Let’s take a moment to think about learning to drive. Any one of us can learn the basic principles of driving from just a few demonstrations, and once we understand concepts like steering, acceleration, and braking, we can apply those skills to any car — from a compact sedan to a large truck. With these skills, we can also quickly adapt to different road conditions, weather, and traffic situations, even if we have never encountered them before. However, current AI will need thousands of hours of training data for each specific vehicle and condition, while AGI would be able to grasp the fundamental principles of driving and apply them broadly.
AGI aims to create AI systems that can truly generalize knowledge and skills — learning fundamental principles that can be applied to completely new situations. Driving a car, playing cheese, go, Minecraft, etc. This differs markedly from today’s AI systems, including LLMs, which primarily operate through sophisticated pattern matching across vast training datasets.
While modern LLMs can engage in seemingly intelligent conversation and solve complex problems, they fundamentally work by recognizing and recombining patterns they’ve encountered during training. This is more akin to extremely advanced memorization and statistical correlation than true understanding and generalization, as they don’t build genuine causal models or abstract representations of the world. When they do appear to generalize, they’re usually just finding subtle statistical patterns in their training data rather than understanding deeper principles.
ARC addresses a crucial measurement problem in AI research — how do we actually test if an AI system can generalize?
Traditional benchmarks typically measure performance on specific tasks with large training datasets, but high scores do not necessarily indicate true generalization ability. An AI might perform well simply by memorizing patterns in the training data rather than developing a genuine understanding.
As F. Chollet writes: “To the best of our knowledge, ARC does not appear to be approachable by any existing machine learning technique (including Deep Learning)”.
The main difficulties are the following:
– The expected output is not a label or even a set of labels, but a colored grid with sizes up to 30x30 and with up to 10 different colors. Therefore, it falls in the domain of structured prediction.
– The predicted output has to match the expected output exactly. If a single cell is wrong, the task is considered as failed. To compensate for that, three attempts are allowed for each input grid.
– In each task, there are generally between two and four training instances (input grid + output grid), and one or two test instances for which a prediction must be made.
– Each task relies on a distinct transformation from the input grid to the output grid. In particular, no evaluation task can be solved by reusing a transformation learned on the training tasks. Each task is a distinct learning problem, and what ARC evaluates is a broad generalization and few-shot learning.
ARC offers a more rigorous test of generalization by presenting each puzzle with only 3–5 samples, few-shot only. This minimal training data means the AI can’t rely on extensive pattern matching — it must extract the underlying principles and apply them to new situations, just as humans do. The puzzles are also deliberately designed to resist shortcut solutions or memorization strategies.
What makes ARC particularly valuable is that it provides a quantifiable measure of generalization ability. Rather than debating whether an AI system truly “understands” in some philosophical sense, we can measure concrete performance on these carefully designed reasoning tasks. This gives researchers a clear benchmark for progress toward artificial general intelligence.
If you want to know more about the ARC benchmark and what it means for AI development, this video is a great place to start:
The 2020 Kaggle competition revealed one of the first approaches to solving ARC — brute-force search through a space of predefined transformations. The winning solution constructed a domain-specific language (DSL) containing 142 hand-crafted grid operations. By systematically searching through combinations of these operations, it achieved 20% accuracy on private evaluation tasks. Another similar solution using grammatical evolution to guide the transformation search reached 3–7.68% accuracy.
While notable for their initial success, these solutions highlighted a key limitation: they relied on an exhaustive search of pre-programmed rules rather than developing any real understanding or ability to generalize. This gap between programmatic search and true intelligence demonstrates why ARC remains a challenging benchmark for measuring generalization capabilities.
The current approach (https://github.com/sebferre/ARC-MDL/tree/master) is based on a fundamental principle used to discover patterns and build models that best explain data in the most concise way possible. At its core, MDL states that “the best model for some data is the one that compresses the most of the data.”
The solution uses a specialized modeling language to describe grid patterns efficiently. This language provides a structured way to represent both input and output grids as combinations of basic elements:
At the highest level, each puzzle is represented as a pair and contains two grids:
● An input grid (in)
● An output grid (out)
Each grid is defined by three components:
Objects are positioned shapes, where each shape can be either:
● A single-colored point
● A rectangle with a specified size, color, and mask
The mask system is particularly powerful, allowing rectangles to take various forms:
● Full (solid rectangle)
● Border (outline only)
● Checkboard patterns (even or odd)
● Cross patterns (plus or times shape)
● Custom bitmap patterns
This language allows the system to describe complex grid patterns compactly. For example, instead of storing a 10x10 grid pixel by pixel (100 values), it might store it as “a black background with a 3x3 red rectangle at position (2,2)” — using far fewer values while capturing the essential structure.
When searching for patterns, the system tries to find the most concise description of both input and output grids using this language. Good solutions tend to reuse elements between input and output (like taking a shape from the input and transforming it in the output), which leads to shorter descriptions and better compression.
The success of this approach (94/400 training tasks solved) suggests that this language captures many of the key patterns present in ARC puzzles while being constrained enough to avoid overfitting to specific examples.
While LLMs have shown impressive capabilities across many domains, using them directly to solve ARC presents both opportunities and challenges. The naive approach involves providing the LLM with input-output examples and asking it to predict the answer for new inputs. However, this method has significant limitations. LLMs demonstrate very limited spatial reasoning capabilities in this context and are highly prone to hallucinations when attempting to predict grid transformations.
This approach builds upon the direct prediction method by first asking the LLM to analyze and describe the patterns it observes in the input-output pairs. While this additional reasoning step yields better results by helping the LLM decompose the problem, it still has the same fundamental limitations. The model continues to exhibit a high rate of hallucinations when attempting to predict final outputs, even after identifying potential patterns. This suggests that adding explicit reasoning steps alone is insufficient to overcome LLM’s spatial reasoning limitations in solving ARC challenges.
At WLTech.AI, we see the huge role of AI Agents in the pursuit of AGI is of great importance. They’re designed to interact with their settings dynamically, adapt based on what they learn, and learn on their own. Unlike static models trained only once, AI agents can learn from ongoing interactions and adapt to changing circumstances, making them a vital component in the development of AGI.
AI agents are the brains of the operation, coordinating a range of techniques that are adapted to the specific demands of a task. Symbolic systems are great at precise, rule-based reasoning, which makes them perfect for tasks that require understanding transformations such as rotations or reflections. Neural networks are great at recognizing patterns and generalizing from data, which is really useful for identifying the underlying structures in ARC tasks.
However, ARC’s challenges don’t end with symbolic manipulation or pattern recognition. Many tasks require a more advanced level of abstraction, including the ability to create new rules, make connections, and adapt to new situations. Language models are useful here, as they can be used for things like program synthesis and abstract reasoning. Search algorithms are another tool in the box as they can explore possible transformations efficiently to identify solutions. Planning systems, on the other hand, provide the framework to break down and tackle complex problems one step at a time.
What makes AI agents so clever is that they can bring all these different approaches together. They don’t just use one method at a time. They evaluate and deploy the best combination of techniques to address each unique problem. This ability to adapt on the fly is what sets humans apart and is an important part of advancing AGI.
At their core, AI agents are basically intelligent coordinators. They keep a running record of what works and what doesn’t, so that they can learn from past experiences.
Our Agentic AI solution You can find our solution here: https://colab.research.google.com/drive/1-rQoqrGRGQE6M8bMpfzqf6tV3TnUi-Mp?usp=sharing or on Github: https://github.com/weblab-technology/arc-challenge-2024-10
Our breakthrough came from mimicking human problem-solving behavior: analyzing examples, hypothesizing rules, testing, and refining. Instead of brute force, our AI focuses on writing generation functions — Python code that defines transformations — and immediately testing them on the training data.
A surprising discovery in our approach was that iterative improvements often hinder progress rather than helping. If the initial assumption behind a generation function is flawed, attempting to refine it typically amplifies the error instead of correcting it. This realization fundamentally reshaped our methodology.
Instead of refining incorrect assumptions, we found it more effective to:
This mirrors human behavior of resetting and rethinking when a solution path proves unproductive, rather than patching a broken strategy.
This insight also explains why genetic algorithms failed to improve results. By design, genetic algorithms evolve solutions incrementally, refining them over generations. However, when the foundational assumptions are wrong, incremental changes lead to convoluted solutions that stray further from the correct transformation.
LLM Evaluation: Claude 3.5 Sonnet Outperforms Competitors
To tackle the ARC challenge, we extensively tested multiple large language models (LLMs) to assist in writing generation functions. Among them, Claude 3.5 Sonnet emerged as the most capable, significantly outperforming its competitors.
Key Findings:
Claude 3.5 Sonnet vs. GPT-4o:
● Performance: Claude 3.5 Sonnet identified more patterns than GPT-4o, achieving nearly double the accuracy in pattern recognition.
● Efficiency: Claude achieved the same results as OpenAI’s GPT-4o in 1/7 of the execution time, making it not only more effective but also faster.
GPT-4o’s Decline:
● We observed a noticeable decline in GPT-4o’s performance over time. The initial version of GPT-4o was far more capable for ARC tasks than subsequent versions, indicating a potential shift in its optimization that hindered pattern recognition and reasoning for this challenge.
Why Claude 3.5 Sonnet Stands Out
Claude’s edge lies in its ability to generalize and identify subtle transformations, which is crucial for ARC’s few-shot nature. Its consistent performance and efficiency made it the clear choice for integrating into our AI agent framework, setting a new standard for LLM-driven reasoning in structured problem-solving.
Our approach achieved close to 30% accuracy on the ARC evaluation set, significantly outperforming baseline methods. This result highlights the strength of mimicking human problem-solving behavior, leveraging fresh hypotheses over iterative refinements, and utilizing the most capable LLMs like Claude 3.5 Sonnet. While there is still substantial room for improvement, this milestone demonstrates meaningful progress in addressing ARC’s challenges and advancing toward broader generalization in AI.
At WLTech.AI we believe the future of solving ARC lies in the continued growth of LLM capabilities combined with higher-level reasoning frameworks like Minimum Description Length (MDL) or similar approaches for concise pattern explanation. These advancements could enable models to better abstract and generalize transformations. Additionally, integrating a self-refining prompt system based on a growing solutions bank would allow models to iteratively improve their reasoning and draw from past successes, creating a more adaptive and efficient problem-solving pipeline. This synergy between advanced LLMs, structured explanations, and adaptive learning holds the potential to unlock new milestones in ARC and AGI development.
The Abstraction and Reasoning Corpus (ARC) benchmark has been key to testing how well AI can apply general rules and think in a more abstract way. Over the years, we’ve seen a number of standout solvers come along, each adding something different to the field.
● Ryan Greenblatt’s Approach
In 2024, Ryan Greenblatt, an engineer at Redwood Research, reached a major milestone by scoring 42% on the ARC-AGI public evaluation set, with a verification score of 43%. His approach involved using GPT-4o to generate and refine several Python programs and selecting the best ones for submission. This shows how we can use large language models with program synthesis to tackle complex reasoning tasks.
● Icecuber 2020
The “icecuber 2020” solution, a previous competition winner, got a public evaluation score of 39% and a verification score of 17%. While we don’t have all the details about the methodology, this solution has been really important in setting the bar for subsequent ARC solvers.
The ARC Prize 2024 leaderboard features the following top performers:
● MindsAI are in the lead with a score of 55.5%.
● The ARChitects are close behind with a score of 53.5%.
● Guillermo Barbadillo in third with a score of 40%.
● Alijs in fourth, also at 40%.
● TuMinhDang fifth with a score of 38%.
These scores show how everyone’s working hard and coming up with new ways to tackle the ARC benchmark. They also show how different teams are using different strategies.
The ARC benchmark is still a great way to test how well AI systems can reason and generalize. Even though there have been some great advances, no model has got to grips with ARC completely, demonstrating how tricky it is to achieve artificial general intelligence. Researchers and practitioners are always looking for ways to combine different approaches, using symbolic reasoning with neural networks, to get closer to solving problems.
Benchmarks like ARC give us a sneak peek into what the future holds for AI research. They’re moving the field toward systems that can think and adapt like humans. While we’re still making slow progress, ARC has already set out a clear path to achieving AGI.
● The focus is moving towards generalization. (https://arxiv.org/abs/2305.07141?utm_source=chatgpt.com)
In the future, AI systems will be designed to generalize rather than specialize. As the ARC tasks show, being able to solve new problems without having to retrain is a key sign of true intelligence. It seems likely that researchers will develop models that are great at few-shot or zero-shot learning, taking inspiration from how our brains work.
● Hybrid models are going to be the norm.
The success of ARC solvers has already shown us that single-approach systems have their limits. The future of AI is going to be hybrid models that integrate neural networks, symbolic systems, and probabilistic reasoning. These models will do well on ARC, but they’ll also be able to handle real-world problems where flexibility and adaptability are key.
● There’s a new focus on cognitive architecture.
ARC has got people thinking about cognitive architectures that copy the human brain’s ability to combine different ways of reasoning. We’ll see more research into working memory, meta-learning, and multi-agent systems, which will help pave the way for AI that can reason, learn, and adapt on the fly.
As AI systems get smarter, they’ll start working with us rather than just doing our jobs for us. Benchmarks like ARC are helping to develop AI systems that work alongside humans, offering insights and solutions in complex areas like scientific discovery and creative problem-solving.
Competitions like these have really inspired the AI community. With a reward of over $1,000,000, the ARC Prize is a great incentive for researchers to come up with open-source solutions that can beat the current benchmarks.
At WLTech.AI, we recognize that the value of such solutions far exceeds $1,000,000, and we are excited to participate in the challenge again next year to continue advancing the field.