This story draft by @escholar has not been reviewed by an editor, YET.

Tokenizer analysis

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1. Introduction

  1. Methods

    2.1 Tokenizer analysis

    2.2 Indicators for detecting under-trained tokens and 2.3 Verification of candidate tokens

  2. Results

    3.1 Effectiveness of indicators and verification

    3.2 Common observations

    3.3 Model-specific observations

  3. Closed-source models

  4. Discussion, Acknowledgments, and References


A. Verification details

B. A short primer on UTF-8 encoding

C. Outputs for API-based verification

2 Methods

Our method consists of three steps; i) first, we perform a tokenizer analysis by inspecting its vocabulary and observing its encoding/decoding behaviour, ii) second, we calculate a number of indicators that identify candidate tokens that have likely not been seen during model training, and iii) third, we verify whether identified candidate tokens are indeed out of distribution by prompting the target model.

2.1 Tokenizer analysis

We start by defining a number of useful categories for tokens:


• Partial UTF-8 sequences: The token contains a partial UTF-8 sequence and can not be converted to a string by itself. This is typical for ‘fallback byte’ tokens in the 0x80-0xFF range (also see Appendix B), but depending on tokenizer configuration, can also include a combination of full and partial characters.


• Unreachable: When no input string can result in the token id, we categorize it as ‘unreachable’. We test this by checking if decoding the token to a string, and re-encoding it again, results in the token. Such tokens are typically the result of tokenizer configuration errors or conflicts between trained and manually added vocabulary. As this test does not work when tokens can not be decoded to a string, we exclude partial UTF-8 sequences from this category.


• Special tokens: Manually defined tokens carrying specific meanings as control tokens, such as <s>. We identify special tokens using the patterns <...> and [...] and list them separately from unreachable tokens, even if they might be considered as such due to input sanitization in tokenizer preprocessing.


• Tokens not in any of the other categories, which constitute the vast majority.


We detect and exclude partial UTF-8 sequences and unreachable tokens from our token detection pipeline, as they are not suitable for automatically building verification prompts. Our published model reports include tables with such tokens, and we briefly discuss some interesting model-specific results in section 3.3.


Authors:

(1) Sander Land, Cohere s([email protected]);

(2) Max Bartolo, Cohere ([email protected]).


This paper is available on arxiv under CC BY-SA 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks