5. Discussion

The vocabulary indexing introduced in this paper removes a prohibitive run-time scaling barrier in guided generation. Naturally, it makes a tradeoff between processing and memory, but we believe that the memory costs are relatively low on average and–when not–can be reduced through conventional means.

In our tests using a slightly augmented version of the Python grammar, we find that even naively constructed indices (i.e. ones containing unused and redundant parser and FSM state configurations) are still only around 50 MB. Furthermore, these indices were constructed with un-reduced DFAs, implying that there are numerous redundant states unnecessarily increasing the size of the indices. Likewise, if the exact representation of the state machines is ever an issue, it’s possible that other state machine formulations with lower memory requirements could suffice (e.g. NFAs).

The implications of this work are not limited to neural text generation. For instance, one could use the indexing approach described here to assist with the training or fine-tuning of LLMs when structured outputs are required. We can also speculate that assisted generation during training may reduce the need for a model to learn syntactic details.

In addition, this method provides an alternative way to evaluate current models. One could, for instance, attempt to quantify the discrepancy between the masked logits generated by our method and the raw logits generated by the model. Which could in turn inform the training objective of a model.

It may also be possible to “lift” the masks computed by this approach into the language models themselves. Basically, the masks implicitly determine which computations do not need to be performed. Our current formulation only applies the masks at the lowest level, but, by lifting the masks further up into the architecture of the model, we may be able to modulate which slices of the model parameters are needed before unnecessarily performing operations on them. This has the potential to further reduce computational costs.


We would like to thank Dan Gerlanc and Dan Simpson for their support and constructive feedback.

This paper is available on arxiv under CC 4.0 license.