The Tiny AI Benchmark That Might Be Asking the Biggest Question in Machine Learning
For the last decade, AI progress has largely followed one recipe.
Build larger models.
Train on more data.
Buy more GPUs.
Repeat.
The results have been extraordinary.
Modern language models can write software, explain scientific papers, solve complex programming tasks and engage in surprisingly natural conversations.
Yet one of the simplest algorithmic problems imaginable, adding two numbers, may now be forcing us to rethink one of our deepest assumptions about machine learning.
A small open-source project called AdderBoard has quietly been asking a deceptively simple question:
How small can a standard transformer become while still learning 10-digit addition?
The project itself began with a simple experiment. Its creator challenged two coding agents, Claude Code and Codex, to train the smallest possible transformer capable of 10-digit addition. Their models, containing 6,080 and 1,644 parameters respectively, became the first entries on a public leaderboard. Since then, the open-source community has been engaged in an extraordinary compression race.
At first glance this sounds like an obscure optimisation challenge.
I think it might be something much bigger.
The story begins with a ridiculous claim
Imagine someone had said in 2023:
"A transformer can learn 10-digit addition with only 36 parameters."
Most AI researchers would probably have laughed.
Not because addition is difficult.
Because transformers appeared to require vastly more parameters to reliably learn even simple algorithmic behaviour.
Very soon after AdderBoard began, trained transformer models with only a few dozen parameters are already achieving perfect performance on the benchmark's held-out test set.
What's remarkable isn't the exact number.
It's how quickly the field has moved.
Why AdderBoard is different
Earlier research on arithmetic transformers asked questions such as:
- Can transformers learn addition?
- How do they internally represent arithmetic?
- Can they generalise to longer numbers?
Those are fascinating scientific questions.
But they were never trying to minimise parameter count.
AdderBoard turns that objective upside down.
Its primary goal is simply:
What is the smallest standard transformer that can perform 10-digit addition while meeting a fixed accuracy target?
That seemingly simple question has produced one of the fastest compression races I've seen in AI research.
The Compression Ladder
One way to think about AdderBoard is as a process of progressively removing different sources of inefficiency from learning.
| Reference point | Approx. Parameters | Accuracy | Knowledge injected into learning |
|---|---|---|---|
| Earlier generic research transformer | ~3,000,000 | ~99% | None. Built to study transformer behaviour rather than minimise size. |
| Early AdderBoard entries | 6,080 → 1,644 | 99–100% | First attempts at explicitly minimising transformer size. |
| Current largely generic-trained AdderBoard model | ~456 | 100% | Better optimisation, parameter sharing, low-rank factorisation and architectural improvements, but no obvious addition-specific algorithmic tricks. |
| Current addition-aware trained model | ~36 | 100%* | Addition-specific inductive biases such as decimal-aware embeddings, carry-focused curriculum and task-specific routing. |
| Hand-designed constructive solution | ~6–10 meaningful constants | Exact / provable | Complete knowledge of the addition algorithm supplied by the designer. |
*100% refers to the benchmark's held-out evaluation set. The hand-designed constructions are analytical implementations rather than empirical training results.
This is the surprising part
A well-trained, largely generic transformer appears capable of learning perfect addition with hundreds of parameters rather than millions.
That's roughly a four-order-of-magnitude reduction.
Hand-designed constructions reduce this by another order of magnitude.
This matters because it shows that at least one important algorithm can be learned and represented with far fewer parameters than earlier generic transformer experiments suggested.
Several of the strongest trained entries rely on grokking or grokking-aware training: they are trained well past the point of fitting examples until a compact generalising solution appears.
What remains untested, but may also be true, is whether similarly compact models can be trained with dramatically less data once the right architecture, optimisation strategy and inductive biases are in place.
That raises a question I don't think we ask often enough.
Are we measuring the complexity of the computation, or the complexity of discovering it?
Those are fundamentally different things.
Representing an algorithm isn't the same as discovering one
Imagine two engineers.
The first is handed the long-addition algorithm.
The second has to invent it purely by observing thousands of worked examples.
Both eventually know exactly the same algorithm.
But one problem is vastly harder than the other.
Today's neural networks are solving the second problem and that's why they initially require so many parameters and so much data.
Why this matters beyond addition
If AdderBoard were simply about addition, it would be an elegant curiosity.
But I don't think that's what it's really measuring.
AdderBoard is asking a much broader question:
How efficiently can gradient descent discover a specific algorithm?
Addition just happens to be the first algorithm that we know how to measure in this way.
Tomorrow it could be:
- comparison
- sorting
- parsing
- planning
- graph search
- counting
If the same compression ladders emerge repeatedly, then something profound may be happening.
The complexity of learning may often be much greater than the complexity of the computation itself.
A different way to think about intelligence
Much of modern AI assumes that sufficiently large general-purpose neural networks can eventually learn everything.
Perhaps that's true.
But perhaps what they are gradually discovering are reusable transformer-based computational building blocks.
Comparison.
Routing.
Copying.
Counting.
Memory.
If those computations repeatedly emerge during learning, then perhaps future AI systems shouldn't rediscover them independently every time.
Perhaps they should learn to recognise, reuse and build upon them.
That isn't hand-coding intelligence.
It's learning how to reuse computation.
Transformers may already be doing something like this, but implicitly.
Mechanistic interpretability researchers have found that individual neurons are often polysemantic: the same neuron can appear to participate in several unrelated features. Sparse autoencoders suggest that the underlying features may be cleaner than the individual neurons make them look.
One interpretation is simply that the model is packing many concepts into limited space.
But another possibility is more interesting: perhaps some of these shared activations reflect reusable computations, not just reused storage.
In other words, the model may not only be storing many meanings in the same weights. It may also be discovering small computational routines that are useful in many different contexts.
If so, the next step is obvious: can we make that reuse explicit, controllable and trainable?
This isn't an argument against scaling
Scaling has been extraordinarily successful.
It may even be the best algorithm we currently have for discovering reusable computational structure.
The point isn't that larger models are the wrong approach.
The point is that they may not be the final approach.
AdderBoard hints that there may eventually be a better way.
Not by replacing neural networks.
Not by abandoning gradient descent.
But by designing learning algorithms that discover and model reusable computational building blocks far more efficiently.
Why this excites me
What I find fascinating about AdderBoard isn't that someone built a tiny transformer that can add numbers.
It's that every improvement on the leaderboard seems to represent the removal of another source of inefficiency.
First came generic optimisation.
Then better parameterisation.
Then task-specific inductive bias.
Finally, complete algorithmic understanding.
It's almost like watching scientific understanding emerge in miniature.
The benchmark isn't simply finding smaller models.
It's revealing where the complexity of learning actually comes from.
A prediction
If AdderBoard is revealing a general principle rather than an arithmetic curiosity, then over the next few years we should begin to see similar compression ladders emerge for many other algorithmic tasks.
Comparison.
Sorting.
Parsing.
Graph algorithms.
Planning.
Reasoning.
For each task, we might observe the same progression.
Large generic neural solutions.
Much smaller solutions using better optimisation.
Even smaller solutions with appropriate inductive bias.
Tiny implementations once the underlying algorithm is fully understood.
If that happens repeatedly, it would suggest that modern neural networks are not simply learning functions.
They are gradually discovering reusable computational structure.
The next research question
AdderBoard's real finding isn't just about how small an addition transformer can be. It's that the gap between "the smallest possible representation" and "the solution gradient descent naturally discovers" is large, measurable and experimentally tractable.
We don't need to speculate about this part. Larger general-purpose models have a well-known failure mode: they can lose track of cascading carries on long digit chains. A compact exact solution should not have that failure mode at all.
Understanding that gap may turn out to be far more important than addition itself.
Because if the same phenomenon appears across many computational tasks, future breakthroughs in AI may come not only from ever-larger models, but from learning algorithms that discover reusable, optimised computational building blocks inside transformer models.
If that turns out to be true, AdderBoard won't be remembered as a benchmark about addition.
It will be remembered as one of the first experiments to expose a fundamental property of how intelligent systems learn.