Why Can't AI Count the Number of "R" in the Word "Strawberry"?
Large language models, especially OpenAI's ChatGPT, revolutionized how we interacted with machines that did understand and could generate human-like text. But in themselves, these models came with their own weirdness-filled characters. The most annoying weirdness that has gone around on all social media recently is the failure of this large language model to correctly count the number of a certain letter in a word. A very popular example is the word "strawberry," in which AI very often fails to count correctly how many times "r" has appeared. But why does it do this? The answer lies deep at the very core of how these models process and generate language.
One of the main reasons AI stumbles over questions like counting letters is because of the way it actually processes words. Language models, such as GPT-3 and GPT-4, do not treat words as a sequence of individual letters. Instead, they break down text into smaller units called "tokens." Tokens can be as short as one character or as long as an entire word, depending on the design of the model in question and the particular word involved.
For example, the word "strawberry" would most likely be split into two tokens, representations of partial word fragments that the model knows from training. The thing is that these usually do not correspond to the letters in the word. This is because, in examples like "strawberry," the AI may see not the breakdown of the word into full, single letters but two tokens; such as token IDs 496 and 675. When, later on, it is asked to count particular letters, this model will not find an easy way of mapping the tokens back to the number of occurrences of a particular letter.
Basically, language models predict what the next word or token in a sequence will be, based on the context given by the prior words or tokens. This works especially for generating text that is not only coherent but also aware of its context. However, it doesn't really suit purposes for which you need to count something precisely or reason about individual characters.
If you were to ask the AI to count the number of occurrences of the letter "r" in the word "strawberry," it wouldn't have such a fine representation of the word from which the number and position of every instance of that letter could be derived. Instead, it answers in the mold of what it has learned about forming predictions from the structure of the request. Of course, this may be inaccurate, because the data it learned from is not about counting letters, and may not even include the type of material it would take to trace the "r" in our example word.
Another important point is that language models per se, used in most chatbots, are inappropriate for explicit counting or arithmetic. In another way, pure language models are little more than advanced dictionaries or predictive text algorithms which do tasks weighted with probability based on the patterns they learn but struggle with tasks that require strict logical reasoning, such as counting. If the AI is asked to spell out a word or break it down into individual letters, it may get this right more often, since this is more in line with the task it has been trained upon: text generation.
Despite these limitations, improvements in the performance of AI in such tasks are possible. They can be improved by asking the AI to employ all kinds of programming languages, such as Python, to do the counting. For example, You can try to give the AI an instruction to write a Python function that counts the number of "r"s in "strawberry", and it probably would get it right. We use this approach because it leverages the AI's ability to understand and generate code, which can be executed to perform the task correctly.
Besides that, more recent generations of language models are combined, with other tools and algorithms that make these models more powerful for more structured tasks, which also include counting and arithmetic.
Embedding symbolic reasoning or combining the LLMs with external reasoning engines would make an AI system capable of overcoming those shortcomings.
The problem of letter counting in words, like "strawberry", points to a much larger and more general issue in this regard: the "collective stupidity" of these trained models. These models, even as they have been trained on very large datasets and can thus perform text generation at very sophisticated levels, will yet at times commit very stupid mistakes that a little child would easily avoid. This happens because the "knowledge" of the model must be made up of pattern recognition and statistical associations, rather than its real-world understanding or logical inference.
Even when instructed in detail or even set up in a situation where multiple models check each other, the AI can still stubbornly stick to wrong answers. This behaviour shows in great detail how important it is not to overestimate AI systems for capabilities beyond their strong suits but to fully appreciate what they can and cannot do.
The inability of AI to count the number of "r" in a "strawberry" is anything but a mere trivial flaw; rather, it is a reflection of the underlying architecture and design philosophy of language models. These models are so very powerful at generating human-like text, understanding context, and emulating conversation but are not directly made for tasks that specifically require attention to detail at the character level.
With AI continuously improving, future models are likely to be more capable of such tasks through improved processes of tokenization, integrating extra reasoning tools, or even entirely different ways of understanding and manipulating language. Till then, it should be approached with an understanding of its limitations, using appropriate workarounds and recognition that while it can simulate understanding, it does not yet truly "understand" in the way humans do.