Elegance and Unification in AI
What fundamental physics taught me about where look for weaknesses
A few months ago I first heard it articulated that the geocentric model of the solar system is no less correct that the heliocentric one, it's all just a matter of reference frames. I laughed to myself because despite having a PhD in physics, it had never occurred to me. Perhaps it’s because I first learned about the Copernican Revolution, like most do, as a school-goer being taught a simple matter of fact and I never questioned it. Geocentrism was obviously wrong.
Because of course it is wrong.
Not technically no. But in a way we all feel in our bones. This intuitive sense that heliocentrism feels ‘more correct’ is what mathematicians and physicists mean when they talk about ‘elegance’.
While we can describe the motions of the heavenly bodies from the point of view of Earth, it is incredibly convoluted compared to the elegant simplicity of model of planets orbiting the Sun. One reason this model is simpler is related to the second important word from the title of this article: Unification. Planetary orbits in geocentric coordinates are intricate and don’t have a uniform description. In the heliocentric model, they are all lovely ellipses. When Kepler discovered this in the early 1600s he was able to write his laws of planetary motion and catapulted our understanding forward.
The history of physics is full of these stories. Something looks complicated, a new view is presented that simplifies the description, and often unifies elements of the original theory. The new theory then allows you to see deeper and in the best cases, make predictions that would have been near impossible to make in the previous theory.
The most famous examples are of course Maxwell’s unification of electricity and magnetism, and Einstein’s unification of space and time. In more recent history we had the Electroweak unification in the Standard Model of particle physics, the almost successful Grand Unified Theories (more on that later), and of course the (candidate) unification of quantum field theory with gravity in string theory.
Each new unification also brought more simple ways to describe the theory with new notation or in some cases new mathematical frameworks. This simplicity of abstraction makes it easier to take further steps, but is also just …. nice. I recall being quite satisfied when during a course in differential geometry, we compressed Maxwell’s original 20 equations to 2:
I can’t resist mentioning that there’s a bit of excitement again in fundamental physics that ideas related to the AdS/CFT correspondence could represent the first glimpse of a new unification, leading some to suggest spacetime itself is emergent from something more fundamental that unifies gravity with quantum theory.
Suffice it to say that I was drunk on this stuff back in my physics days. The feeling of the final jigsaw piece slotting into place to reveal a brand new mental model was intoxicating. But, alas, these moments became harder to come by as I caught up on the history and entered research myself. After my PhD, like many of my peers, I headed for the greener pastures of AI.
It was 2015, a few short years post AlexNet (the years were shorter then) and deep learning was really taking off. I was working in computer vision, and we were getting big gains from reusing ImageNet pre-trained ConvNets on our specialised tasks. It felt quite apparent to me that we were reaping the rewards of a recent unification in the field of computer vision. The field had transformed from a grab bag of SIFT and HOG features, PCA and Hough transforms, and scores of ML algorithms into just ConvNets for everything.
But there was still a lot of ugly our there. Different modalities like text and audio were simply different fields that other people worked in. Even within each field, every new paper had a new specialised model.
As the field grew and the capabilities improved we started to see more cross talk and unification attempts which always caught my attention. We saw the ConvNets from images getting applied to text and the RNNs from text getting applied to images. This has enormously accelerated in recent years to the point where now almost everything is a token getting fed into a multimodal transformer.
With my aforementioned predilection for elegance and unification, this has been very pleasing to see. It also makes me keep my eye out for remaining ugliness. One such example of ugliness is the tokenisers. The need to compress the vocabulary to a fixed, manageable size requires chopping things up into sometimes unnatural or arbitrary pieces in a modality- and language-dependent way. This can also lead to crowds of what I can only hope are bots to spam replies to every new model announcement with “it still can’t count the number of r’s in strawberry”. This needs to go.1
A slightly different ugliness is the prompt sensitivity of LLMs. This may have spawned the world’s worst job title of Prompt Engineer, but it is clearly a bug and not a feature. How to resolve it is less clear. Better models and test time compute methods like Open AI’s o-series seem to be chipping away at it, but I have a feeling we will see future innovation that makes a step change here.
So AI practitioners have certainly acquired the taste for elegance and unification that has so enamoured physicists for the last few centuries. You will often hear them quote Rich Sutton’s Bitter Lesson with the same satisfaction as I had with my 2 electromagnetism equations. But for balance, I thought I should finish with a bitter lesson of a different kind.
In 1974, Howard Georgi and Sheldon Glashow presented a beautiful Grand Unified Theory based on SU(5) symmetry. This unified the Electroweak and Strong forces of the Standard Model which are based on an SU(3) x SU(2) x U(1) symmetry. You don’t need to understand Lie groups to immediately recognised that as uglier. It had many nice properties including unifying quarks and leptons into a single representation. It also made some nicely testable predictions about the half life of protons. In the 1980s a series of experiments starting with the Kamiokande in Japan started to look for proton decay and … didn’t find it, eventually killing the hopes of Georgi and Glashow2.
Similarly in AI could we be in danger of too quickly following the siren song? Reports of the demise of CNNs have been greatly exaggerated and perhaps we threw the recurrence baby out with the LSTM bath water?3
Still, deeper we will go, in search of elegance, but let us keep one eye open for what my professor once described as “those damned Japanese protons!”.
Remarkably, while drafting this article there seems to have been a breakthrough from Meta on removing the tokenisation step and working with patches of bytes.
I should note that SU(5) is still very much in play when embedded in more extensive theories like supersymmetry or compactifications of string theory.
State space models are a very interesting development to undo this.

