The Intelligence Paradox: Why We're Building LLMs Wrong (And How to Fix It)

A Thought Leadership Perspective on the Future of Large Language Models

The language surrounding large language models is an example of an overly positive description of something with no solid foundation. We focus on the scale of an LLM, how it performs on benchmarks, and how it outshines the competition, but we don’t focus on the limitations. With the amount of interest and the amount of facts, it would be a shame for the state of AI to be stagnant. Based on industry observations and hands-on experience in watching this development, make me an authority, and the trend is localised optimality, a limitation due to a lack of understanding, and a pronounced improvement of the already known.

The Scale MythIt is a surprise that the LLMs developers have no qualms reporting the arrogant ''bigger is better". A higher parameter is better. Longer context. Better benchmarks. Success is being targeted yet a lack of a development principled base does not appear to be critically assessed. Doubling a parameter tends to double performance, and LLMs are able to be expected to be improved with seemingly no principle, but this is the exact description of a codification of empirical understanding. LLMs are to be expected to be improved and an unquantified improvement really does not appear to be allowed to be a known rationale.

This is model performance in our experience. Models that achieve great results in benchmarks perform poorly in production. Systems that have been trained on the same data produce different behaviour depending on the initialisation. LLMs that perform perfectly on specialised technical tasks fail on basic tasks, even in childlike reasoning. A troubling reality is that scale is a proxy. It's not the variable that is important. We're treating the proxy as if it were the goal.

The Real Challenge: Alignment in the Age of Uncertainty

"Alignment" refers to the desired behaviour of AI systems. Stop thinking of this in reverse. When thinking of challenge research in AI alignment, most focus on the deep problem of misalignment, filtering out toxic outputs, adding malicious safety layers, and fighting jailbreaks. This is whack-a-mole, as every new prompt allows the user to sidestep safety training.

Every creative user finds another avenue to circumvent guardrails, and every new prompt gives the ability to sidestep safety training.

Consider a seemingly straightforward task writing a business email. What's "correct"?

Should it prioritise clarity or persuasiveness?
Should it be formal or approachable?
Should it acknowledge potential objections or paper over them?
What cultural context matters?

An LLM doesn't have the ability to ask these questions. It can't say, "I need more information to do this well." It hazards a guess based on patterns in its training data. And we deploy it anyway.

The companies winning with LLMs aren't winning because their models are slightly larger or trained on slightly better data. They're winning because they've built scaffolding around the models, human feedback loops, domain-specific fine-tuning, and explicit constraints on where the model is allowed to operate. The model itself is becoming a commodity. The intelligence is in the system around it.

The Compute Tax We're Not Talking About

Here's an uncomfortable calculation that deserves more attention:

We want to bring to light an uncomfortable calculation that has gone largely unexamined. The energy cost of a single inference pass on a leading LLM is directly comparable with the energy cost of a human performing the same task. And we mean that literally.

A human takes roughly 1 minute to think through a problem using 20 watts of power. On the other hand, a large-scale inference operation takes millions of watts of power.

This problem of energy cost offers no solution to the problem at hand; the energy cost scales with the number of users using the application.

This is not virtue signalling but a question of appropriate AI architecture that responds to the scale of an organisation.

The most critical question pertaining to the future of LLMs is not about upscaling large models but about engineering efforts focused on them.

Three Shifts That Will Define the Next Era

1. From Generalisation to Specialisation

The AI community feels the pressure to deliver general-purpose AI system models. However, in the real world, model generalisation loses to model specialisation.

Take GPT-4, for example. By all metrics, it is an excellent product. However, for any given niche product, like a specific framework for coding, GPT-4 will lose to a model with purpose determined by its specialisation with less than a tenth of the parameters of GPT-4.

The enterprises capitalising on this technology are not creating general purpose models, but rather model-agnostic domain specialisation models layered with interfaces, spatial finetuning, and curated datasets.

Expect this paradigm to evolve into a more granular model specialisation microservices approach, where the model’s primitive focus is captured by a microservice with an extraordinary capacity for differentiation and granularity for a given purpose, which will be operated by systems with intelligent routing.

2. From Black Boxes to Interpretable Systems

The term ‘explainability’ has lost its meaning. However, in the near future, governance and legality will prescribe its definition.

The paradigm is shifting towards explainability as there are increasing calls for transparency from regulatory bodies. For example, when and LLM is deployed in the decision-making process of a company and the model used to determine whether an employee is eligible for a loan, mortgage, or bail, the model must be explainable or there won’t be any regulatory compliance. This will not be a problem that can be solved by better prompting, but rather will require radical changes in the architectural design of the models.

Progress from these researchers demonstrates that plain interpretability is not an attainable goal with these types of systems, and is not their goal, rather, they are developing systems with component(s) of plain interpretability, such as decision trees, symbolic reasoning, explicit knowledge graphs, and neuro (or hybrid) models.

Purely neural models are not going away, but they can’t satisfy the transparency needs of the actual systems regardless of the needs of the hybrid systems of tomorrow.

3) From Training Once to Continuous Learning

Currently, LLMs are incapable of incorporating new data once trained. A model trained on data from 2023 is oblivious to 2024 events. This illustrates the defining constraint that current models face, in contrast with intelligent systems. Humans have the innate ability to learn continuously, while, as emphasised, current LLMs have one-time learning.

The challenges are multifaceted and include the vicious cycle of the catastrophic forgetting problem, and maintaining a consistent model as new information arrives, along with the issues of temporally and non-stationarily correlated datasets. There are a plethora of challenges that exist to implement continuous learning systems, and, while these challenges are not new, there have been promising implementations to tackle these issues. The ability of systems to learn continuously will be the key differentiator for next-generation production AI systems.

What Leaders Should Be Thinking About Now

When assessing AI providers or constructing systems that utilise large language models (LLMs), the following questions (which are more relevant than traditional factors such as model size or benchmark scores) are recommended:

1. How are the systems' uncertainty quantified and managed? Do they have an awareness of the “unknown” and the ability to escalate uncertainty to human operators? Or do they produce unqualified responses?

2. What does the human feedback loop look like? How easy is it to make corrections to the model? How cost-effective is it to obtain domain-specific data for retraining? Is your vendor creating opportunities to facilitate this?

3. Where should this system NOT operate? Systems with well-defined operational parameters and boundaries are preferable to those that lack definition. Where does your system operate?

4. What is the environmental impact? This is not an exercise in virtue signalling but a real operational constraint that many organisations are adopting. From an environmental cost perspective, the computing costs at scale become a competitive advantage or a competitive disadvantage. What do you expect it to become for your situation?

5. Will justification of the decision be possible? In the short term, expect to provide an explanation for every AI-relevant decision you make, and do your best to make decisions that can be justified.

The Uncomfortable Truth

We have progressed beyond the first phase of the development of more generalised AI tools and are now in the phase of constructing systems that have meaningful utility beyond the hype. This phase of development is more valuable to the users of these systems than the announcement of the newest and largest models. Companies that will experience the most commercial success during this decade will not be the ones to develop the largest models, but the ones that have devised practical, explainable, and efficient systems that operate within a well-defined domain.

Companies in this space will also have made significant investments in human-in-the-loop (HITL) systems to supplement feedback mechanisms in addition to traditional training HITL infrastructures. Companies will have to ask more critical and significant questions about the capabilities and functions of their systems than about the systems themselves.

What we envisioned of the future in terms of large language models (LLMs) is not what is now going to be the transformative development.

The Path Forward

The best time to start rethinking your AI strategy was last year. The second-best time to start is today. Rather than chasing the largest models, concentrate on building the most sophisticated systems that will be geared toward reliability and not on capability, and that will include significant HITL feedback systems.

One of the most critical points in this development is that the real intelligence is not in the model anymore, but in the ecosystem of the model. Those businesses and groups who internalize this will be the architects of the future. The rest will be trapped at the local maximum, targeting scale as the world progresses.