Scientists make sense of shapes in the minds of the models


It was at least since 2021, according to the authors of a preprint from March, that researchers began to see something interesting on the insides of their models.

Also known as an AI program, created from a neural network architecture, a model processes a word by learning to represent it as an arrow or a vector within a high-dimensional space. The directions of these words—which each end up at one single point—become the model's carriers of information. 

While these spaces are already strange in their vastness, often consisting in thousands of dimensions, researchers were noticing something even more peculiar; sometimes, inputs would form clouds of points that were distinctively shaped, looking for example like 'Swiss rolls,' or cylinders, after being projected back down to just three dimensions, using standard methods. Over the next few years, they started to see other cloudy shapes, too: curves, loops, circles; helixes, torii; even trees and fractal geometries. 

That models might learn to organize information in shapes did not necessarily surprise people. It was natural to think that a model might learn that certain categories of inputs could all be clumped together; like inputs describing calendar dates, or colors, or arithmetical operations.

But in 2023, when others discovered a new method for understanding the insides of their models, called sparse autoencoders (SAEs), the observations began to seem a little odder. This method, which quickly gained traction, was suggestive of a very different picture—that the most important concepts a model learned, like love, or logic, or the identities of different people, were highly fragmented, each one tearing off in a very different direction. But why then were certain inputs found close together?

Almost as soon as this hint of contradiction surfaced, it was quelled by other findings. Both the study from March as well as an October study, by researchers at the company Anthropic, have shown that models learn shapes in ways that compliment these other tendencies, suggested by other methods. As a consequence, we are increasingly making sense of why models learn to make shapes in the high-dimensional minds that they live in.

"There's a lot of confusion, but it also feels like there's been a lot of progress," said Eric Michaud of the Massachusetts Institute of Technology (MIT), who spoke to Foom in an interview. "I don't know where it's all going to go. But overall, it feels healthy."

Signs of shapes

Well before the recent wave of results, there were observations that models were learning intriguing geometries. Perhaps the most well-known example came from 2013. Then, researchers at Google created a new kind of neural network model designed to be trained to perform a task similar to filling in the blank; the same task learned by countless school children. 

To perform this task, the network was designed to take a small handful of words (with one blanked) and learn to represent the missing word as a vector, existing in a high-dimensional space. As the model was trained on more and more handfuls, it would learn more and more refined ways of representing each missing word. At the end of this process, it would learn a vector for each word from a full vocabulary. 

By creating a new architecture that was higher efficiency than older ones, the Google researchers were able to train their model more extensively. And this, the researchers discovered, led the model to learn geometric relationships that had never been seen before. 

For example, if you added the directions the model had learned for the word 'king,' as well as the word 'woman,' and then subtracted the direction representing 'man,' you ended up with a final direction that pointed in roughly the same direction the model had learned for 'queen.' In other words, the model had learned four word analogies, just like grade students, but learned to store them in the form of parallelograms, existing in high dimensions. This was striking. 

"People noticed that certain arithmetical operations—in the representation space—resulted in semantically meaningful outputs," said Alexander Modell of Imperial College London, a co-author of the March preprint. 

Within just a few more years, scientists had figured out how to make far more capable models. By 2021, companies were routinely building models that did not just learn single vectors, representing single words, but learned many vectors, simultaneously, each vector representing a word in a long passage, and the way that that word appeared in context with others. 

As before, they would train such a network on a task similar to filling in the blank, like autocompletion. But now, they were training these models to learn to process these vast banks of vectors, existing in vast spaces, through many consecutive 'layers' of mutual- and self-interactions. 

It was intuitive to think that training these more advanced models would push them to learn even more sophisticated geometric patterns. And then, came the Swiss-rolls, and the observations of even more interesting structures. 

There was already a basic intuition for why a model might learn to make shapes, similar to the way a librarian learns to put different categories of writing, like fiction and non-fiction, into separate stacks, in order to make the information easier to retrieve from. However, this explanation was only a vague one. 

Breaking space

By the time that researchers started spotting these strange shapes, they had become faced with an even more fundamental question: How could a model get so good at processing information?

A model like GPT-3, released in May 2020 by the company OpenAI, had apparently learned millions or billions of different concepts, like many of the most important names and events listed in Wikipedia. However, its internal space where it stored information consisted of only around 12,000 dimensions—perhaps small by comparison. 

The only way to explain how models succeeded at learning so many concepts was that they must be learning to store many different concepts within each dimension. That would allow them to pack many distinctive concepts into a relatively small space, considering the relentless demands of the training process. Researchers called this principle of packing concepts 'superposition.' 

However, this principle was not by itself sufficient to explain how models learned things. Suppose for example a model possessed a space of just two dimensions, like a plane. Then, suppose each concept was represented by a vector or arrow pointing in just one direction. It is easy to see that many more than just two lines can be packed into this space without the lines overlapping, demonstrating the idea of superposition. But if any two of those lines start to become too parallel to each other, then they might start to seem indistinguishable. Therefore, researchers reasoned, concepts must be packed in, but they must not be packed in too tightly; they must be nearly orthogonal to each other. 

That was the intuition when considering models with just two dimensions. But there are unexpected surprises in higher dimensions, as mathematicians know well. For example, in spaces of high dimension, there is an exponentially greater number of nearly orthogonal directions, like almost up/down, and almost left/right, than the total number of dimensions. So, it is easy to pack in many concepts that are almost orthogonal in direction. 

But high dimensional spaces also come with some limitations. If you wish the direction of a vector to be easily decoded from a vast space, back down to three dimensions, then the vector must not point in very many of these different cardinal directions, at the same time. In other words, researchers say that such a vector must be 'sparse,' mainly aligned with just a few of the defining directions of the space that it lives in. 

In 2023, two separate groups of researchers provided significant evidence that not only were these assumptions supported, you could make outstanding use of them. Namely, for any given input into the model, consisting in a passage of say, one thousand words, you could determine a list of the key concepts a model would tend to be using as it processed them.

The method they figured out for building such lists involved training another neural network on the first one, built in two parts or sections. You trained the first part to de-compress a vector into a much larger space, like taking a bendy curve drawn on a plane and then re-drawing that curve in a volume. Further, you required the vector to have a sparse representation, similar to approximating the bendy curve with a simpler, straighter line. You trained the second part of the network to re-compress this de-compressed vector back to the initial space while minimizing the information lost in transmission. 

When you did this, then this second part of the neural network, tasked with de-compression, learned exactly the key directions that were present in the initial input passage. For example, for an input like 'The Golden Gate bridge is colored red,' you might be able to learn the exact two directions, in the inner space of the model, representing the concept of the bridge and the concept of color. 

This method was called a sparse autoencoder (SAE) because it bore much in common with standard algorithms used for compressing (or encoding) information. It was perhaps suggestive of a clever technique you might employ if you ever get lost in a forest; try to get even more lost, a few times, while moving out on simple paths, and then find your way back again. Because by doing so enough times, you can figure out a list of meaningful directions, even if you've only come back where you started. 

"When sparse autoencoders were discovered, there was a lot of excitement," said Jacob Hilton of the Alignment Research Center. "Some people thought this was going to be the key to unlocking a huge amount of additional progress."

The need for clouds

The success of the SAE method provided a new hope that language models, despite being very complex, could still be interpreted. It suggested that even though they converted words into vectors in arcane, high-dimensional spaces, each of those vectors might ultimately be broken up into a set of key, human-interpretable directions. Further, each of those directions would tend to be nearly orthogonal, highly distinctive. 

But then, why were researchers seeing some inputs form shapes, rather than breaking up like the cardinal directions of a compass? 

The authors of the preprint from March were motivated by a related question. Working as mathematicians, they wanted to know what AI researchers were actually seeing when they saw inputs form cloudy shapes. They also wanted to know how these shapes fit with all the other assumptions. 

They found that only a few hypotheses were enough to explain the situation. In effect, all you had to do to explain the observations of shapes, while staying consistent with the other ideas, was to assume that certain concepts, like color, could actually be seen as a kind of mathematical space. 

Consider for example the common notion of the color wheel, which many encounter. This is a literal wheel or circular presentation of colors that changes from one hue to another as you move around the circumference of the circle. It is reasonable, the researchers reasoned, to hypothesize that simple text inputs involving colors all represent, in some sense, samplings from an abstract mathematical space that is circular. 

And the researchers observed that when you gave many inputs to a model, each describing different colors, the corresponding vectors traced out a cloud of points shaped like a wobbly circle (when projected down into three dimensions). 

After putting these hypotheses into mathematical language, they could prove that what you were seeing in the high-dimensional space was not just a cloud of points that was vaguely circular. Rather, the circularity of that shape, and the distances from one point to another, along that structure, could be seen as a direct consequence of the fact that the concept itself, like the color wheel, was topologically and geometrically circular. 

So, even if a decoding method like an SAE might factor a model into 'atomic' concepts, pointing in different directions, there are also atomic concepts that themselves factor—into shapes, and not vectors. This had already been suggested by an earlier study, co-authored by Michaud, whom I quoted earlier; but now, the origin of the requirement for shapes had been pinned down. 

Models made shapes, found the researchers, because some concepts intrinsically demanded more space to represent them. Moreover, they demanded manifolds; the generalized notion of shape, commonly studied in the mathematics of higher dimensions.

"What does distance mean in the representation space?" asked Patrick Rubin-Delanchy of the University of Edinburgh, another co-author of the March study. "Our preliminary answer is actually—for raw distance—not much. But manifold distance does." 

Between lines and shapes

The trend since March has been for researchers to continue to dispel any sense of dilemma.

In a comprehensive study from October, researchers from the company Anthropic chose to bring both the decoding perspective and the manifold perspective to bear on one problem. They sought to understand how models learn to generate line breaks correctly, a capability observed to be learned even by relatively simple models.

Using several methods, the researchers showed that models learned to solve this problem using both a dictionary of cardinal concepts as well as concepts that took the form of shapes. In particular, they found the model learned shapes of a form that would assist in its computations. They observed remarkable indications that the computations themselves involved manipulating these shapes through processes like rotation and twisting. 

The greater contrast, they concluded, between understanding models in terms of directions versus shapes was in the way that analysis of such shapes has yet to be automated. As of yet, there is no SAE-like method that automatically creates lists of important shapes that a model learns to think in or compute in.

If we are to find interpretability tools that allow us to build safe AI, even as corporations (including Anthropic) relentlessly make AI more capable, we will need ways to automate all such interpretability mechanisms, even to assist in allowing AI to apply them. (Although, even automated methods like SAEs have recently come under question.)

Amidst this uncertain picture, at least we can now say that the puzzles of the shapes, seen in the minds of the models, do not present a paradox for the program.

"A lot of the story of this last year is how our tools—and assumptions about these tools—interact with the messy reality in these models," said Michaud. "And the answer is a little unclear, but I think it's been productive." 


Author's note: No AI was used in the writing or editing of this post. AI was used lightly (<1 hour) for background research.

Read more