Table of Links
-
Methods
2.2 Indicators for detecting under-trained tokens and 2.3 Verification of candidate tokens
-
Results
B. A short primer on UTF-8 encoding
C. Outputs for API-based verification
2.2 Indicators for detecting under-trained tokens
We propose and use model architecture-dependent indicators to identify potentially under-trained tokens. An key distinction is made based on whether a model uses the same matrix for its token embeddings E and the final model layer, consisting of the ‘unembedding’ matrix, U, which converts the final internal embeddings to a probability distribution over tokens.[1] Regardless of model architecture, all weights of the unembedding matrix influence the token predictions at every training step. Specifically, the training loss is minimized when the probability of unused tokens is predicted as zero, regardless of the input, making their logits converge towards −∞. The model can achieve such an input-independent prediction by a constant vector in the residual stream, and the negative of this vector in rows of the unembedding matrix, resulting in a constant negative contribution to the logit values of unused tokens. Using this intuition, we can find unused tokens from the unembedding weights as follows:
2.3 Verification of candidate tokens
Our proposed indicators naturally provide a ranking of candidate under-trained tokens, but do not give a definitive threshold, and their relative simplicity is likely to result in a somewhat noisy relation between indicator and model behaviour. To confirm that candidate tokens indeed induce unwanted model outputs, we verify all tokens which rank among the most likely 2% according to the chosen indicator, excluding partial UTF-8 sequences and unreachable tokens. This verification process involves constructing specific repetitive prompts that induces a high output probability for normal tokens, and checking if a candidate token has a very low output probability (see Appendix A for details).
This paper is
[1] We assume the conventional final layer structure, consisting solely of the unembedding matrix without a bias.