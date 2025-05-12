Where Glitch Tokens Hide: Common Patterns in LLM Tokenizer Vocabularies

by Large Models (dot tech)May 12th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Untrained tokens often stem from unused byte tokens, merged fragments, and special tokens-patterns found across major LLMs regardless of architecture.
featured image - Where Glitch Tokens Hide: Common Patterns in LLM Tokenizer Vocabularies
black background, coding style, sql python llm Image created by HackerNoon AI Image Generator
Large Models (dot tech) HackerNoon profile picture
0-item

Abstract and 1. Introduction

  1. Methods

    2.1 Tokenizer analysis

    2.2 Indicators for detecting under-trained tokens and 2.3 Verification of candidate tokens

  2. Results

    3.1 Effectiveness of indicators and verification

    3.2 Common observations

    3.3 Model-specific observations

  3. Closed-source models

  4. Discussion, Acknowledgments, and References


A. Verification details

B. A short primer on UTF-8 encoding

C. Outputs for API-based verification

3.2 Common observations

Although many of our findings are dependent on model-specific details such as tokenizer training and configuration, model architecture, and training data, there are a number of commonalities that appear across many different model families.


3.2.1 Single-byte tokens


Tokens representing a single byte are a common source of untrained tokens. The most common occurrence are the ‘fallback’ bytes 0xF5–0xFF which are not used in UTF-8 encoded text[2], and are a convenient source for quickly locating reference untrained tokens for indicators which require them. In addition, many tokenizers including from the Gemma, Llama2 and Mistral families include every byte as a token, but additionally assign a duplicate token to many characters in the normal ASCII range 0x00–0x7F. For example, A is both token 282 as an unused byte fallback token and as token 235280 a text-based ‘A’ in the Gemma models. These issues are not universal, and we also find models which include precisely the 243 bytes used in UTF-8


Table 1: Detection of under-trained tokens. #Confirmed are the confirmed/tested numbers for the tokens tested in verification that are predicted with a maximal probability of <1% across verification prompts. Examples were manually chosen for readability, similarity across models or for being particularly striking. Note that the leading ‘_’ in tokens such as _SolidGoldMagikarp indicates a leading space.∗We use an unembedding-based indicator for these models (cf. section 3.3.2)


Figure 2: Under-trained token indicators vs Training data. Shown are the (un)embedding-based indicators for the OLMo v1.7 7B model and the number of times each token appears in the first epoch of the training data.


as tokens, including the models by EleutherAI [14]. Untrained single byte tokens are typically classified as ‘partial UTF-8 sequences’ or ‘unreachable’, and our indicators are effective in revealing which ones are never or rarely seen in training. We publish specific tables which shows the status of each single-byte token for each analyzed model in our repository.


3.2.2 Fragments of merged tokens



3.2.3 Special tokens


Many models include untrained special tokens, such as <pad>, <unk>, or <|unused_123|>. In the following discussion we generally omit mentioning them, unless their status as an (un)trained token is particularly surprising, as their inclusion in the tokenizer and training data is typically deliberate, for purposes such as the ability to fine-tune models without changing tokenizers. One common observation is that on many occasions tokens such as <mask>, which we expect to be completely untrained, nevertheless appear to have been seen in training. A likely source for this is code repositories or guides about language models using these tokens in normal text, along with tokenizers allowing such special control tokens in normal input text.


Authors:

(1) Sander Land, Cohere s([email protected]);

(2) Max Bartolo, Cohere ([email protected]).

This paper is available on arxiv under CC BY-SA 4.0 DEED license.

[2] See Appendix B for a primer on UTF-8 encoding.


[3] When mentioning fragments of more complete tokens, the tokens in parentheses were not detected or verified as under-trained, unless explicitly mentioned otherwise.

HackerNoon Services
L O A D I N G
. . . comments & more!

About Author

Large Models (dot tech) HackerNoon profile picture
Large Models (dot tech)@largemodels
The Large-ness of Large Language Models (LLMs) ushered in a technological revolution. We dissect the research.
Read my storiesLearn More

TOPICS

purcat-imgtech-stories#fishing-for-magikarp#large-language-models#tokenizer-analysis#under-trained-tokens#glitch-tokens#byte-pair-encoding-(bpe)#model-weight-indicators#prompting-techniques

THIS ARTICLE WAS FEATURED IN...

Arweave
Arweave
Read on Terminal Reader Terminal
Read this story w/o Javascript Lite
Hackernoon
Bsky

RELATED STORIES

Article Thumbnail
New Study Reveals AI's Weak Spots in Medical Logic
by largemodels
Dec 10, 2024
#natural-language-inference
Article Thumbnail
Comprehensive Detection of Untrained Tokens in Language Model Tokenizers
by largemodels
May 12, 2025
#fishing-for-magikarp
Article Thumbnail
How Tokenizer Choices Shape Hidden Risks in Popular Language Models
by largemodels
May 12, 2025
#fishing-for-magikarp
Article Thumbnail
Secret Tokens, Secret Trouble: The Hidden Flaws Lurking in Big-Name AIs
by largemodels
May 13, 2025
#fishing-for-magikarp
Article Thumbnail
The Nuts and Bolts of Token Testing: Prompt Variations and Decoding in Practice
by largemodels
May 13, 2025
#fishing-for-magikarp
Join HackerNoonloading
Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Categories

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks