Table of Links
-
Methods
2.2 Indicators for detecting under-trained tokens and 2.3 Verification of candidate tokens
-
Results
B. A short primer on UTF-8 encoding
C. Outputs for API-based verification
B A short primer on UTF-8 encoding
UTF-8 is the most prevalent encoding scheme used to represent text in computers and communication protocols worldwide. It efficiently encodes Unicode characters, which encompass a vast range of characters from various writing systems and symbols [32]. Encoding to UTF-8 is often the first step in tokenization.
UTF-8 encoding can be summarized as follows:
• ASCII (Unicode below 128): Single byte, binary 0xxxxxxx representing up to 7 bits.
• 2-byte sequences: 110xxxxx 10xxxxxx representing up to 11 bits.
• 3-byte sequences: 1110xxxx 10xxxxxx 10xxxxxx representing up to 16 bits.
• 4-byte sequences: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx representing up to 21 bits.
Where the bits indicated by ‘x’ are concatenated to form the Unicode codepoint.
• 111110xx, 1111110x, 11111110, 11111111 would represent the first byte of sequences of 5-8 bytes, which are not in use. This corresponds to decimal 245-255 or hexadecimal F5-FF.
• 11000000, 11000001 are not in use, as the possible two-byte encodings that start with this fit in 7 bits due to the five leading zeros. These are 192/193 in decimal and C0/C1 in hexadecimal.
• Additionally, other starting bytes can be covered entirely by other tokens, and also turn out to be unused. A common example of this is C2/C3 which are only used for Unicode points 128-255. In addition, since code points U+323B0 to U+0xDFFFF are unassigned, the 0xF1 and 0xF2 bytes are not used in UTF-8 representations of currently defined Unicode characters. Similarly, 0xF4 is only used through the “Supplementary Private Use Area”. However, even if not defined in the current Unicode standard, such characters can be easily inserted in text and are found on web pages.
This paper is