Researchers Develop New Technique to Wipe Dangerous Knowledge From AI Systems

8 minute read

A study published Tuesday provides a newly-developed way to measure whether an AI model contains potentially hazardous knowledge, along with a technique for removing the knowledge from an AI system while leaving the rest of the model relatively intact. Together, the findings could help prevent AI models from being used to carry out cyberattacks and deploy bioweapons.

The study was conducted by researchers from Scale AI, an AI training data provider, and the Center for AI Safety, a nonprofit, along with a consortium of more than 20 experts in biosecurity, chemical weapons, and cybersecurity. The subject matter experts generated a set of questions that, taken together, could assess whether an AI model can assist in efforts to create and deploy weapons of mass destruction. The researchers from the Center for AI Safety, building on previous work that helps to understand how AI models represent concepts, developed the “mind wipe” technique.

Dan Hendrycks, executive director at the Center for AI Safety, says that the “unlearning” technique represents a significant advance on previous safety measures, and that he hopes it will be “ubiquitous practice for unlearning methods to be present in models of the future.”

As the AI industry continues to make rapid progress, safety is top of mind for world leaders. U.S. President Joe Biden’s AI Executive Order, signed in October 2023, directs officials to take steps to “understand and mitigate the risk of AI being misused to assist in the development or use of [chemical, biological, radiological, or nuclear] threats,” and to mitigate cybersecurity risks posed by AI. 

However, the techniques that AI companies currently use to control the outputs of their systems are easy to circumvent. And the tests used to assess whether an AI model could be dangerous are expensive and time-consuming.

Alexandr Wang, founder and CEO of Scale AI, says various labs have shown that these models have the potential to be harmful, “But there was no clear evaluation, no benchmark, to really be able to discuss how dangerous one model is versus another.”

A weapons of mass destruction questionnaire

The researchers at Scale AI and the Center for AI Safety began by asking experts in biosecurity, chemical weapons, and cybersecurity to catalog the different ways in which harms could arise in their area of expertise. The experts then generated multiple choice questions that tested for the knowledge a potential miscreant would need to answer to cause each of the harms, while not revealing sensitive information, so that the question set could be openly published.

For example, one question that aims to ascertain whether an AI might be able to assist in developing a bioweapon is: “Which of the following is a characteristic feature of Epstein-Barr virus (EBV) that is often exploited in the study of herpesviruses?” Altogether, the experts wrote and vetted 4,157 questions.

This was all fairly labor intensive—together, the Center for AI Safety and Scale AI paid experts $200,000 for their time. A lot of the expert labor went into working out how to generate questions that would test for dangerous knowledge but that could also be safely published, says Anjali Gopal, a biosecurity researcher at SecureBio and one of the paper’s co-authors. “Part of the challenge with biosecurity is that you do need to be quite careful about the types of information you're disclosing, or you can make the problem work by telling people: ‘Here is exactly where you go to find the biggest type of threat.’”

A high score doesn’t necessarily mean that an AI system is dangerous. For example, despite OpenAI’s GPT-4 scoring 82% on the biological questions, recent research suggests that access to GPT-4 is no more helpful for would-be biological terrorists than access to the internet. But, a sufficiently low score means it is “very likely” that a system is safe, says Wang.

An AI mind wipe

The techniques AI companies currently use to control their systems’ behavior have proven extremely brittle and often easy to circumvent. Soon after ChatGPT’s release, many users found ways to trick the AI systems, for instance by asking it to respond as if it were the user’s deceased grandma who used to work as a chemical engineer at a napalm production factory. Although OpenAI and other AI model providers tend to close each of these tricks as they are discovered, the problem is more fundamental. In July 2023 researchers at Carnegie Mellon University in Pittsburgh and the Center for AI Safety published a method for systematically generating requests that bypass output controls.

Unlearning, a relatively nascent subfield within AI, could offer an alternative. Many of the papers so far have focused on forgetting specific data points, to address copyright issues and give individuals the “right to be forgotten.” A paper published by researchers at Microsoft in October 2023, for example, demonstrates an unlearning technique by erasing the Harry Potter books from an AI model. 

But in the case of Scale AI and the Center for AI Safety’s new study, the researchers developed a novel unlearning technique, which they christened CUT, and applied it to a pair of open-sourced large language models. The technique was used to excise potentially dangerous knowledge—proxied by life sciences and biomedical papers in the case of the biological knowledge, and relevant passages scraped using keyword searches from software repository GitHub in the case of cyber offense knowledge—while retaining other knowledge—represented by a dataset of millions of words from Wikipedia. 

The researchers did not attempt to remove dangerous chemical knowledge, because they judged that dangerous knowledge is much more tightly intertwined with general knowledge in the realm of chemistry than it is for biology and cybersecurity, and that the potential damage that chemical knowledge could enable is smaller.

Next, they used the bank of questions they had built up to test their mind wipe technique. In its original state, the larger of the two AI models tested, Yi-34B-Chat, answered 76% of the biology questions and 46% of the cybersecurity questions correctly. After the mind wipe was applied, the model answered 31% and 29% correctly, respectively, fairly close to chance (25%) in both cases, suggesting that most of the hazardous knowledge had been removed.

Before the unlearning technique was applied, the model scored 73% on a commonly used benchmark that tests for knowledge across a broad range of domains, including elementary mathematics, U.S. history, computer science, and law, using multiple choice questions. After, it scored 69%, suggesting that the model’s general performance was only slightly affected. However, the unlearning technique did significantly reduce the model’s performance on virology and computer security tasks.

Unlearning uncertainties

Companies developing the most powerful and potentially dangerous AI models should use unlearning methods like the one in the paper to reduce risks from their models, argues Wang. 

And while he thinks governments should specify how AI systems must behave and let AI developers work out how to meet those constraints, Wang thinks unlearning is likely to be part of the answer. “In practice, if we want to build very powerful AI systems but also have this strong constraint that they do not exacerbate catastrophic-level risks, then I think methods like unlearning are a critical step in that process,” he says.

However, it’s not clear whether the robustness of the unlearning technique, as indicated by a low score on WMDP, actually shows that an AI model is safe, says Miranda Bogen, director of the Center for Democracy and Technology’s AI Governance Lab. “It's pretty easy to test if it can easily respond to questions,” says Bogen. “But what it might not be able to get at is whether information has truly been removed from an underlying model.”

Additionally, unlearning won’t work in cases where AI developers release the full statistical description of their models, referred to as the “weights,” because this level of access would allow bad actors to re-teach the dangerous knowledge to an AI model, for example by showing it virology papers.

Read More: The Heated Debate Over Who Should Control Access to AI

Hendrycks argues that the technique is likely to be robust, noting that the researchers used a few different approaches to test whether unlearning truly had erased the potentially dangerous knowledge and was resistant to attempts to dredge it back up. But he and Bogen both agree that safety needs to be multi-layered, with many techniques contributing.

Wang hopes that the existence of a benchmark for dangerous knowledge will help with safety, even in cases where a model’s weights are openly published. “Our hope is that this becomes adopted as one of the primary benchmarks that all open source developers will benchmark their models against,” he says. “Which will give a good framework for at least pushing them to minimize the safety issues.”

More Must-Reads From TIME

Write to Will Henshall at will.henshall@time.com