Table of Links
-
Methods
2.2 Indicators for detecting under-trained tokens and 2.3 Verification of candidate tokens
-
Results
B. A short primer on UTF-8 encoding
C. Outputs for API-based verification
C Outputs for API-based verification
We use the following prompt for API based testing of under-trained tokens.
Where the strings consist of the problematic token, occasionally prefixed to help identify their source, and to avoid leading spaces, as we noticed that models often fail to correctly repeat such tokens for other reasons. Although many other prompt formats are effective, we have found this code-based approach to more clearly avoid false positives.
Figure 4 shows the result for Mistral, Anthropic and OpenAI models.
Figure 4: API prompting results.
This paper is