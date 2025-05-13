Abstract and 1. Introduction





4 Closed-source models

This section briefly explores the ability to transfer our findings to closed models. As our techniques involve using the model weights, they are not directly applicable to closed-source models. However, the experience gained in inspecting a large variety of open models has provided insight which may transfer to closed models. For these tests, we use a custom prompt designed to exactly repeat strings and see if models appear incapable of doing so (see Appendix C for details)

4.1 OpenAI GPT-3.5 and GPT-4

By using models that share a tokenizer (cf. section 3.3.7), we already have an list of potential candidates, including _ForCanBeConverted, $PostalCodesNL, useRalative, _typingsJapgolly, and others. We test some of these tokens in prompts and find that all OpenAI models fail to handle many of them correctly, resulting in hallucinations followed by an inability to tell the difference between the inputs and incorrect outputs, or degrading into repetition.[12]

4.2 Anthropic Claude 2 and 3

Although documentation on tokenization in these models is limited, the Anthropic SDK contains some tokenizer utilities for Claude 2, with remarks that they are not accurate for Claude 3[13] Using the tokenizer provided for Claude 2, we can identify some candidates for merged tokens such as CandidateFaciNum (iCandidateFaciNum), TrileptonPatTuple (TrileptonPatTupleMC), BFrontend (DVBFrontend) and others. Some of these tokens can be confirmed as problematic in Claude 2.1, although none appear effective in the Claude 3 family of models, consistent with the change in tokenizer implied by their SDK code.

4.3 Mistral Medium and Large

Although tokenizers are available for Mistral’s open models, their flagship API models do not include information about tokenizers. However, due to a confirmed leak of an early version of their ‘medium’ model as ‘miqu’, we have some knowledge of the ‘medium’ model being potentially derived from Llama2 70B. By prompting both the ‘medium’ and ‘large’ models, we confirm that the ‘medium’ model is unable to repeat strings that are typically under-trained in Llama2 models, and the ‘large’ model fails on typical tokens from the ‘small’ and ‘Mixtral’ series. In addition, in experimenting with such prompts we found that the ‘large’ model occasionally responds with apparent undocumented special tokens including [TOOL_CALLS] and [control_331], which were recently confirmed to be part of the tokenizer for the 8x22B model.





5 Discussion

Our investigation has shown a wide variety of untrained and under-trained tokens present in tokenizers, and their prevalence differs significantly by model. The presence of under-trained tokens has several negative consequences for language models, including inefficient inference and the potential to bypass guardrails. Even with our relatively strict threshold for verification, we detect the presence of such tokens across all tested models, with typically around 0.1–1% of the vocabulary consisting of severely under-trained tokens. The most important factors in a model having many under-trained tokens, aside from simply having a large vocabulary, appears to be whether the tokenizer was trained on similar data as the model. Models which re-use a large external tokenizer, and then train from scratch, are among those with the highest number of under-trained tokens.





Analyzing the tokenizer directly can detect several of these without the need for any training, including unreachable tokens which do not encode back to their representation, and unused byte fallback tokens. This can be particularly useful in quickly catching tokenizer configuration errors, which appear to be particularly common when custom vocabulary is manually added. Using the model embedding weights directly is a reliable way to detect tokens which are under-trained, although the care should be taken to take into account the model architecture. Based on our findings, we can summarize number of recommendations within the scope of current tooling:





• Ensure input data pre-processing is identical across tokenizer training data, model training data, and model inference. In particular, consider carefully how to handle carriage returns, tab characters, and special tokens present as plain text in training data and user input.





• Ensure the model training data and tokenizer are aligned, especially when training a new base model.





• For single-byte tokens, either include a single copy of all 256 bytes without allowing duplicates in the vocabulary, or exclude the 13 unused bytes 0xC0/0xC1, 0xF5-0xFF. When dynamically excluding extremely rare bytes such as 0xF1, consider including an explicit <token> as a fallback.





• After training a tokenizer, check for unreachable tokens by encoding and decoding the vocabulary to ensure manually added tokens are handled correctly.





• When publishing both ‘fast’ and ‘slow’ versions of a tokenizer on Hugging Face, ensure they give the same outputs, for example by tokenizing the tokenizer vocabulary itself with both versions.





• When training a base model, check for under-trained tokens after smaller test runs and reconsider tokenisation methods and data. Running a test on a different corpus can also reveal pre-processing bugs that cause unrepresentative inputs in the main training data.





In addition to providing a set of useful tools for improving models and tokenizers, our work suggests several directions for future research. Firstly, the results from StarCoder2 (section 3.3.8) highlight a potential weakness in BPE training in that occurrences in a single document (or even single sub-collection of documents, such as a repository) are able to define a token by themselves. Strategies for preventing this, such as limiting the count for pairs to be merged by document, should be explored to prevent this. Secondly, one common difference between tokenizers is whether or not they allow partial UTF-8 sequences in tokens other than byte fallback tokens. This trade-off is also particularly under-explored. Although allowing such tokens may lead to lower average token counts, it also leads to more untrained ‘fragments’ and tokens which are less semantically meaningful. Finally, we noticed differences between models in terms of how they apply weight decay to tokens not present in input. This choice may affect how well models remember the meaning of rare tokens and likely mitigate the severity and impact of under-trained tokens. Although this choice has been known to be important in models that predate transformers [31], we are not aware of systematic ablations in recent LLMs.





In conclusion, our findings highlight a range of tokenizer issues, and the severity of these varies across different models. By analyzing tokenizers and model embeddings, we can identify under-trained tokens and improve the efficiency and security of LLMs.

Acknowledgments

We thank Dirk Groeneveld, Luca Soldaini and Nathan Lambert of the Allen Institute for AI for helpful discussions and data on weight decay, tokens trained on, and tokenization in the OLMo models, and Stella Biderman of EleutherAI for information on weight decay and tokenization in the Pythia/GPT-NeoX models. We also thank Matthias Gallé and Phil Blunsom for valuable feedback.

[12] The same technique also confirms that the currently undocumented ‘gpt2-chatbot’ model on the LMSys Arena uses a related tokenizer.





[13] https://github.com/anthropics/anthropic-sdk-python/blob/8e3d8a68d309424238ae54e03ee962f7147cfc60/src/anthropic/_client.py#L276