Still frame from a video generated by Sora. OpenAI's prompt was, "The camera directly faces colorful buildings in burano italy. An adorable dalmation looks through a window on a building on the ground floor. Many people are walking and cycling along the canal streets in front of the buildings."
Open AI already has market-leading AI models in image and text generation with DALL-E 3 and ChatGPT, respectively. Now, the company is coming for the text-to-video generation space, too, with a brand-new model.
Also: The best AI image generators of 2024: Tested and reviewed
On Thursday, OpenAI unveiled Sora, its text-to-video model that can generate videos up to a minute long with impressive quality and detail, as seen in the demo video below:
Introducing Sora, our text-to-video model.
- OpenAI (@OpenAI) February 15, 2024
Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions. https://t.co/7j2JN27M3W
Prompt: "Beautiful, snowy... pic.twitter.com/ruTEWn87vf
Sora can tackle complex scenes, including multiple characters, specific types of motion, and great detail, because of the model's deep understanding of language, prompts, and how the subjects exist in the world, according to OpenAI.
From watching different demo videos, you can see that OpenAI has managed to tackle two big issues in the video-generating space: continuity and longevity:
Prompt: "A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. she wears a black leather jacket, a long red dress, and black boots, and carries a black purse. she wears sunglasses and red lipstick. she walks confidently and casually.... pic.twitter.com/cjIdgYFaWq
- OpenAI (@OpenAI) February 15, 2024
AI-generated videos are often choppy and distorted, making it clear to the audience where every frame ends and begins. For example, Runaway AI released its most advanced text-to-video model, Gen-2, in March. As seen below, the clips don't quite compare to those of OpenAI's model today:
Generate videos with nothing but words. If you can say it, now you can see it.
- Runway (@runwayml) March 20, 2023
Introducing, Text to Video. With Gen-2.
Learn more at https://t.co/PsJh664G0Q pic.twitter.com/6qEgcZ9QV4
OpenAI's model, on the other hand, can generate fluid video, making each generated clip look like it was lifted from a Hollywood-produced film.
Also: How to use ChatGPT
OpenAI says Sora is a diffusion model that's able to produce high-quality output by using a transformer architecture similar to the GPT models, as well as past research from DALL-E and GPT models. In addition to generating video from text, Sora can generate video from a still image or fill in missing frames from videos:
Prompt: "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors." pic.twitter.com/0JzpwPUGPB
- OpenAI (@OpenAI) February 15, 2024
Despite showing all of its advancements, OpenAI also addresses the model's weaknesses, claiming it can sometimes struggle with "simulating the physics of a complex scene, and may not understand specific instances of cause and effect." The model could also confuse the spatial details of a prompt.
The model is becoming available to red teamers first to asses the model's risks, and to a select number of creatives, such as visual artists, designers, and filmmakers, to collect feedback on how to improve the model to meet their needs.
Also: I tried Microsoft Copilot's new AI image-generating feature, and it solves a real problem
It seems like we are entering a new era in which companies will shift focus to researching, developing, and launching capable AI text-to-video generators. Just two weeks ago, Google Research published a research paper on Lumiere, a text-to-video diffusion model that can also create highly realistic video.