The Scientists Breaking AI to Make It Safer

5 minute read

In an ornate room lined with marble busts of famous scientists, around 40 experts in climate science and disease were hunched over their laptops yesterday (Oct. 25), coaxing a powerful AI system into generating misinformation.

By the end of the day, attendees had managed to overcome the guardrails on the AI system— Meta’s Llama 2—and got it to argue that ducks could absorb air pollution, to say that garlic and “miraculous herbs” could help prevent COVID-19 infection, to generate libelous information about a specific climate scientist, and to encourage children to take a vaccine not recommended for them. 

The event, held under a gilded ceiling at the prestigious Royal Society in London, highlighted the ways that the world’s most cutting-edge AI systems are still vulnerable to abuse. It came just one week ahead of the world’s first AI Safety Summit, organized by the U.K. government, where global policymakers will convene with AI scientists to discuss the dangers of the fast-moving technology.

Building better safety guardrails

Large language models (LLMs,) the AI systems that power AI chatbots like ChatGPT, usually come with guardrails to prevent generating unsavory or dangerous content—whether that’s misinformation, sexually explicit material, or advice on how to build bioweaponry or malware. But these guardrails have sometimes proved brittle. Computer scientists and hackers have repeatedly shown it is possible to “jailbreak” LLMs—that is, get around their safety features—by prompting them in creative ways. According to critics, these vulnerabilities show the limitations of so-called AI alignment, the nascent practice of ensuring AIs only act in ways that their creators intend.

The tech companies behind LLMs often patch vulnerabilities when they become known. To speed up this process, AI labs have begun encouraging a process known as red-teaming—where experts try their hardest to jailbreak LLMs so that their vulnerabilities can be patched. In September, OpenAI launched a “Red Teaming Network” of experts to stress-test its systems. And yesterday the Frontier Model Forum, an industry group set up by Microsoft, OpenAI, Google, and Anthropic, announced a $10 million AI Safety Fund to fund safety research, including red-teaming efforts.

The event at the Royal Society was co-organized by Humane Intelligence, an AI auditing non-profit. It was carried out in participation with Meta, which sent an observer to the event and said it would use the findings to strengthen the guardrails of its AI systems. Unlike its competitors Google and OpenAI, Meta has open-sourced some of its AI systems including Llama 2, meaning it is possible for people to use them without oversight by the company. Meta has faced criticism for this decision from some AI safety advocates, who say that releasing models publicly can allow bad actors to abuse them more easily than is possible for tools on offer by OpenAI, for example, which does not release its new systems’ source code. Meta has said the decision to open source Llama 2 will allow the “wisdom of crowds” to help make AI safer over time.

Read More: AI Leaders Create Industry Watchdog

"Our responsible approach continues long after we’ve released the initial Llama 2 model, and we appreciate the opportunity to work with the Royal Society and Humane Intelligence to collaborate on establishing responsible guardrails,” said Cristian Canton Ferrer, engineering lead of Responsible AI at Meta, in a statement. “Our open approach means bugs and vulnerabilities can be continuously identified and mitigated in a transparent way by an open community." 

Attendees at the London red-teaming event managed to get Llama 2 to generate misleading news articles and tweets containing conspiracy theories worded to appeal to specific audiences, demonstrating how AI systems can be used to not only generate misinformation, but successfully devise ways to spread it more widely.

Bethan Cracknell Daniels, an expert in dengue fever at Imperial College London who attended the event, successfully prompted the model to generate an ad campaign encouraging all children to get the dengue vaccine—in spite of the fact that the vaccine is not recommended for individuals who have not previously had the disease. The model also fabricated data to support a misleading claim that the vaccine is entirely safe and has performed well in real world settings, Cracknell Daniels said. “It’s just completely made-up,” she told TIME.

Nuclear power and rabid dogs

Jonathan Morgan, a specialist in nuclear engineering at the University of Manchester, successfully prompted Llama 2 to generate false news articles suggesting that walking a dog close to a nuclear power station could cause it to become rabid. “What this has shown me is, if you have an active agenda for proliferating misinformation, how easy it is for these language models to produce things that sound authentic,” said Morgan. “If you’re going into it with a targeted agenda to spread misinformation, it’s very easy to get these language models to say anything you want them to.”

Large language models have previously been shown to be vulnerable to “adversarial attacks,” where motivated bad actors can, for example, add a specific long string of characters to the end of a prompt in order to jailbreak certain models. The red teaming event, however, was focused on different kinds of vulnerabilities more applicable to everyday users. “We’re asking our participants to use social engineering techniques,” Rumman Chowdhury, the CEO of Humane Intelligence, said.

Attendees agreed, before starting, to a rule that they would “do no harm” with the information they learned at the event.

More Must-Reads from TIME

Write to Billy Perrigo at