For now AI systems are slow-witted enough that discerning users can usually tell if their outputs are potentially harmful. For example, if a CEO were to ask GPT-4, OpenAI’s most powerful AI model to be released so far, to suggest ways that she could maximize her company’s profits, she would likely be able to tell if GPT-4’s responses would lead to bankruptcy.
But as AI systems become more capable, their outputs may become too complex for humans to evaluate. This is particularly concerning for those who work on alignment—the research field working on making sure AI systems behave as their creators intend—because they fear that AI systems may develop goals that could be catastrophic for humans. If, say, GPT-6 were asked how to maximize the profits of a company, it likely could produce an incredibly complex plan that the CEO can’t evaluate. If she decided to execute the plan, her profits might rocket … or GPT-6 might be manipulating her to gain immense power for its own agenda. Before hitting enter, she might not be able to tell which result would be produced.
Jan Leike is hoping his work might make it easier to tell. Leike, 36, co-leads the Superalignment team at top AI lab OpenAI, which wants to build AI systems today that help us to ensure that powerful future AI systems behave as their creators intend.
Leike has spent more than a decade thinking about alignment, ever since reading works by Ray Kurzweil and Eliezer Yudkowsky. After working closely with pathbreaking alignment researchers—including Marcus Hutter, then a professor at Australian National University and now a senior researcher at DeepMind; Nick Bostrom, author of Superintelligence; and Shane Legg of DeepMind—Leike joined OpenAI in 2021.
Now Leike is part of one of the most ambitious alignment efforts yet. The Superalignment team, announced in July, have given themselves four years to “ensure AI systems much smarter than humans follow human intent.” To help them achieve this task, they have 20% of the scarce, expensive computational resources available to OpenAI, and the considerable brainpower of Ilya Sutskever, chief scientist at OpenAI, who is co-leading the team with Leike.
Leike argues that these potential AI systems could help in a number of ways. For example, they could assist in inspecting the vast mathematical representations that constitute an AI system’s model of the world, which even the scientists who build AI models struggle to interpret. In May, OpenAI took early steps toward this, publishing a paper in which GPT-4 tried to explain the purpose of each of the neurons in GPT-2, a smaller and less capable model. Although the results were mixed, there was some success—GPT-4 found a neuron that seemed to correspond to “references to Canadian people, places, and entities.”
Perhaps because he works on practical fixes for alignment rather than thought experiments, Leike is more optimistic than many who work on preventing AI-related catastrophe—preferring to emphasize how much room there is for human agency.
“So much is still up in the air. Humans have a lot of ownership over what happens, and we should try hard to make it go well,” says Leike. “[Avoiding AI catastrophe] is, in some sense, everyone’s responsibility, because it will probably take a lot of people working together and making the right calls at the right time.”
- Donald Trump Is TIME's 2024 Person of the Year
- Why We Chose Trump as Person of the Year
- Is Intermittent Fasting Good or Bad for You?
- The 100 Must-Read Books of 2024
- The 20 Best Christmas TV Episodes
- Column: If Optimism Feels Ridiculous Now, Try Hope
- The Future of Climate Action Is Trade Policy
- Merle Bombardieri Is Helping People Make the Baby Decision