The 3 Most Important AI Innovations of 2023

8 minute read

In many ways, 2023 was the year that people began to understand what AI really is—and what it can do. It was the year that chatbots first went truly viral, and the year that governments began taking AI risk seriously. Those developments weren’t so much new innovations, as they were technologies and ideas taking center-stage after a long gestation period.

But there were plenty of new innovations, too. Here are three of the biggest from the past year:

Multimodality

“Multimodality” might sound like jargon, but it’s worth understanding what it means: it’s the ability of an AI system to process lots of different types of data—not just text, but also images, video, audio and more. 

This year was the first time that the public gained access to powerful multimodal AI models. OpenAI’s GPT-4 was the first of these, allowing users to upload images as well as text inputs. GPT-4 can “see” the contents of an image, which opens up all kinds of possibilities, for example asking it what to make for dinner based on a photograph of the contents of your fridge. In September, OpenAI rolled out the ability for users to interact with ChatGPT by voice as well as text.

Google DeepMind’s latest model Gemini, announced in December, can also work with images and audio. A launch video shared by Google showed the model identifying a duck based on a line drawing on a post-it note. In the same video, after being shown an image of pink and blue yarn and asked what it could be used to create, Gemini generated an image of a pink and blue octopus plushie. (The marketing video appeared to show Gemini observing moving images and responding to audio commands in real time, but in a post on its website, Google said the video had been edited for brevity—and that the model was being prompted using still images, not video, and text prompts, not audio, although the model does have audio capabilities.)

“I think the next landmark that people will think back to, and remember, is [AI systems] going much more fully multimodal,” Google DeepMind co-founder Shane Legg said on a podcast in October. “It’s early days in this transition, and when you start really digesting a lot of video and other things like that, these systems will start having a much more grounded understanding of the world.” In an interview with TIME in November, OpenAI CEO Sam Altman said multimodality in the company’s new models would be one of the key things to watch out for next year.

Read More: Sam Altman is TIME's 2023 CEO of the Year

The promise of multimodality isn’t just that models become more useful. It’s also that the models can be trained on abundant new sets of data—images, video, audio—that contain more information about the world than text alone. The belief inside many top AI companies is that this new training data will translate into these models becoming more capable or powerful. It is a step on the path, many AI scientists hope, toward “artificial general intelligence,” the kind of system that can match human intellect, making new scientific discoveries and performing economically valuable labor.

Constitutional AI

One of the biggest unanswered questions in AI is how to align it to human values. If these systems become smarter and more powerful than humans, they could cause untold harm to our species—some even say total extinction—unless, somehow, they are constrained by rules that put human flourishing at their center. 

The process that OpenAI used to align ChatGPT (to avoid the racist and sexist behaviors of earlier models) worked well—but it required a large amount of human labor, through a technique known as “reinforcement learning with human feedback,” or RLHF. Human raters would assess the AI’s responses and give it the computational equivalent of a doggy treat if the response was helpful, harmless, and compliant with OpenAI’s list of content rules. By rewarding the AI when it was good and punishing it when it was bad, OpenAI developed an effective and relatively harmless chatbot.

But since the RLHF process relies heavily on human labor, there’s a big question mark over how scalable it is. It’s expensive. It’s subject to the biases or mistakes made by individual raters. It becomes more failure-prone the more complicated the list of rules is. And it looks unlikely to work for AI systems that are so powerful they begin doing things humans can’t comprehend. 

Constitutional AI—first described by researchers at top AI lab Anthropic in a December 2022 paper—tries to address these problems, harnessing the fact that AI systems are now capable enough to understand natural language. The idea is quite simple. First, you write a “constitution” that lays out the values you’d like your AI to follow. Then you train the AI to score responses based on how aligned they are to the constitution, and then incentivize the model to output responses that score more highly. Instead of reinforcement learning from human feedback, it’s reinforcement learning from AI feedback. “These methods make it possible to control AI behavior more precisely and with far fewer human labels,” the Anthropic researchers wrote. Constitutional AI was used to align Claude, Anthropic’s 2023 answer to ChatGPT. (Investors in Anthropic include Salesforce, where TIME co-chair and owner Marc Benioff is CEO.)

“With constitutional AI, you’re explicitly writing down the normative premises with which your model should approach the world,” Jack Clark, Anthropic’s head of policy, told TIME in August. “Then the model is training on that.” There are still problems, like the difficulty of making sure the AI has understood both the letter and the spirit of the rules, (“you’re stacking your chips on a big, opaque AI model,” Clark says,) but the technique is a promising addition to a field where new alignment strategies are few and far between.

Of course, Constitutional AI doesn’t answer the question of to whose values AI should be aligned. But Anthropic is experimenting with democratizing that question. In October, the lab ran an experiment that asked a representative group of 1,000 Americans to help pick rules for a chatbot, and found that while there was some polarization, it was still possible to draft a workable constitution based on statements that the group came to a consensus on. Experiments like this could open the door to a future where ordinary people have much more of a say over how AI is governed, compared to today, when a small number of Silicon Valley executives write the rules.

Text-to-video

One noticeable outcome of the billions of dollars pouring into AI this year has been the rapid rise of text-to-video tools. Last year, text-to-image tools had barely emerged from their infancy; now, there are several companies offering the ability to turn sentences into moving images with increasingly fine-grained levels of accuracy.

One of those companies is Runway, a Brooklyn-based AI video startup that wants to make filmmaking accessible to anybody. Its latest model, Gen-2, allows users to not just generate a video from text, but also change the style of an existing video based on a text prompt (for example, turning a shot of cereal boxes on a tabletop into a nighttime cityscape,) in a process it calls video-to-video. 

“Our mission is to build tools for human creativity,” Runway’s CEO Cristobal Valenzuela told TIME in May. He acknowledges that this will have an impact on jobs in the creative industries, where AI tools are quickly making some forms of technical expertise obsolete, but he believes the world on the other side is worth the upheaval. “Our vision is a world where human creativity gets amplified and enhanced, and it's less about the craft, and the budget, and the technical specifications and knowledge that you have, and more about your ideas.” (Investors in Runway include Salesforce, where TIME co-chair and owner Marc Benioff is CEO.)

Another startup in the text-to-video space is Pika AI, which is reportedly being used to create millions of new videos each week. Run by two Stanford dropouts, the company launched in April but has already secured funding that values it at between $200 and $300 million, according to Forbes. Pitched not at professional filmmakers but at the general user, free tools like Pika are trying to transform the user-generated content landscape. That could happen as soon as 2024—but text-to-video tools are computationally expensive, so don’t be surprised if they start charging for access once the venture capital runs out.

More Must-Reads from TIME

Write to Billy Perrigo at billy.perrigo@time.com