Until very recently, it was a truism that nobody really understood the inner workings of neural networks. Sure, scientists could tell you that the algorithms learn to spot patterns in massive amounts of data. But look inside a neural network trained to recognize cats, say, and all you’d see would be lots of very complicated numbers being multiplied together. There is no section of code–at least, not in any language humans can understand–for distinguishing a whisker from a tail. And yet somehow, with enough data and computing power, AI can do just that.
That is miraculous, but it also poses challenges for safety. If you don’t know how a very powerful new system works, how can you be sure that it’s not dangerous?
Enter Chris Olah. The Anthropic co-founder is one of the pioneers of an entirely new scientific field, mechanistic interpretability, designed to peer into these seemingly impenetrable algorithms in order to make them safer. First at Google, then at OpenAI, and now at Anthropic, Olah has worked to identify the internal structures of neural networks, map out which artificial “neurons” are working together for which goals, and piece together their influence on an AI’s final outputs.
In May, Olah’s team at Anthropic made a breakthrough by applying these strategies to one of its most cutting-edge large language models. They found that they could identify groups of neurons within the model that correspond to different concepts and activities, like identifying bias or recognizing scam emails. Toggling those neuron groups on or off could alter the model’s behavior, potentially giving AI researchers a new tool at their disposal to make AI less dangerous.
An empirical scientist at heart, Olah stresses there is still much work to be done. But he is optimistic about the fruits of his research so far. “If we could really understand these systems, and this would require a lot of progress, we might be able to go and say when these models are actually safe,” he says. “Or whether they just appear safe.”
*Disclosure: Investors in Anthropic include Salesforce, where TIME co-chair and owner Marc Benioff is CEO.
More Must-Reads from TIME
- Donald Trump Is TIME's 2024 Person of the Year
- Why We Chose Trump as Person of the Year
- Is Intermittent Fasting Good or Bad for You?
- The 100 Must-Read Books of 2024
- The 20 Best Christmas TV Episodes
- Column: If Optimism Feels Ridiculous Now, Try Hope
- The Future of Climate Action Is Trade Policy
- Merle Bombardieri Is Helping People Make the Baby Decision
Write to Billy Perrigo/San Francisco at billy.perrigo@time.com