The art of saying no: constrained decoding

23 June3 min read

We spend billions training machines to generate almost anything. Then, in production, we spend an enormous amount of effort making sure they generate only what the system can use.

Most of the conversation around LLMs is about generation: more creativity, more capability, more output from the same input. Constrained decoding is about the opposite problem. It belongs to the production phase of LLM integration, where the question is not "can the model produce something?" but "can the model produce something valid?"

What is constrained decoding?#

Constrained decoding forces an LLM's output to follow a predefined structure while the output is being generated.

During inference, the model scores possible next tokens. These scores are called logits: the model's raw preference for each token in its vocabulary. A constrained decoder filters those options before the next token is accepted, blocking tokens that would make the partial output invalid.

For example, imagine a text-to-SQL agent generating this partial query:

SELECT COUNT(*) FROM clients WHERE salary >

At that point, valid continuations might include:

8000
9000
10000

Invalid continuations might include:

FROM
DROP
hello

The model is still generating, but the decoder is narrowing the path.

Applications#

Take text-to-SQL. The promise is that a user can ask a database question in natural language and the model will write the SQL. But SQL is not just text; it is text another system will execute. If the model writes invalid SQL, the query fails. If it writes valid SQL with the wrong join or filter, the answer may look correct while being wrong.

The authors of TeCoD point out that text-to-SQL can look strong on averaged benchmarks and still fail badly on specific enterprise schemas. They report accuracy as low as 20-30% for some databases when results are broken down by schema.

That is exactly where constrained decoding helps. Instead of letting the model invent the whole query from scratch, a system can constrain it to a known SQL template and ask it to fill the slots:

SELECT COUNT(T1.client_id)
FROM client AS T1
INNER JOIN district AS T2
  ON T1.district_id = T2.district_id
WHERE T1.gender = [gender]
  AND T2.A3 = [district_name]
  AND T2.A11 > [salary]

The model still does useful work, but the riskiest part of the search space has been removed.

The same idea shows up in structured outputs. When we ask a model for JSON, the problem is not that the model has never seen JSON. The problem is that "looks like JSON" is not the same as JSON a parser can always consume. JSONSchemaBench evaluates constrained decoding systems on real-world JSON Schemas, measuring not just validity but also efficiency, coverage, and output quality.

The tradeoff#

There is a cost to saying no. By choosing the most valid next token, the decoder may block a path that would have been more creative or semantically useful. Constrained decoding can make large models behave like smaller, more specialized systems by reducing what they are allowed to do.

Agents often use a related pattern outside the decoder. A coding agent may generate code freely, then run bun lint, tsc, or tests to reject invalid output. That is not constrained decoding in the strict sense because the constraint happens after generation, but the production instinct is the same: generate, check, reject, repair.

Today, constraints can feel like putting a muzzle on a poet. Newer approaches, such as backtracking or reasoning before structured generation, suggest a better direction: let the model explore, but make sure it eventually lands on an output the system can trust.

Constrained decoding is the art of saying no during inference. It does not make the model correct, but it can stop entire classes of invalid output before they reach production.

The art of saying no: constrained decoding

What is constrained decoding?#

Applications#

The tradeoff#

Follow us

Navigation