Since early 2025, AI labs have flooded us with so many new models that I'm struggling to keep up.
But the trends says nobody cares! There is only ChatGPT:
How so?
The new models are awesome, but their naming is a complete mess. Plus, you can't even tell models apart by benchmarks anymore. The plain "this one's the best, everyone use it" doesn't work now.
In short, there are many truly fantastic AI models on the market, but few people actually use them.
And that's a shame!
I'll try to make sense of the naming chaos, explain the benchmark crisis, and share tips on how to choose the right model for your needs.
Dario Amodei has long joked that we might create AGI before we learn to name our models clearly. Google is traditionally leading the confusion game:
To be fair, it makes some sense. Each "base" model now has lots of updates. They're not always groundbreaking enough to justify each update as a new version. That's where all these prefixes come from.
To simplify things, I put together a table of model types from major labs, stripping out all the unnecessary details.
So, what are these types of models?
There are huge, powerful base models. They're impressive but slow and costly at scale.
That's why we invented distillation: take a base model, train a more compact model on its answers, and you get roughly the same capabilities, just faster and cheaper.
This is especially critical for reasoning models. The best performers now follow multi-step reasoning chains—plan the solution, execute, and verify the outcome. Effective but pricey.
There are also specialized models: for search, super-cheap ones for simple tasks, or models for specific fields like medicine and law. Plus a separate group for images, video, and audio. I didn't include all these to avoid confusion. I also deliberately ignored some other models and labs to keep it as simple as possible.
Sometimes, more details just make things worse.
It's become tough to pick a clear winner. Andrej Karpathy recently called this an "evaluation crisis."
It's unclear which metrics to look at now. MMLU is outdated, and the SWE-Bench is too narrow. Chatbot Arena is so popular that labs have learned to "hack" it.
Currently, there are several ways to evaluate models:
A 35-point difference means a model is better just 55% of the time.
As in chess, the player with the lower ELO still has a good chance to win. Even with a 100-point gap, a "worse" model still outperforms in a third of the cases.
And again—some tasks are better solved by one model, others by another. Choose a model higher on the list, and one of your 10 requests might be better. Which one and how much better?
Who knows.
For lack of better options, Karpathy suggests relying on the vibe-check.
Test the models yourself and see which one feels right. Sure, it's easy to fool yourself.
It’s subjective and prone to bias—but it's practical.
Here's my personal advice:
Meanwhile, if you've been waiting for a sign to try something other than ChatGPT, here it is:
Next, I'll cover significant highlights from each model and summarize other people's vibe checks.
If you enjoyed this and don't want to miss the next article, subscribe!