OpenAI announced its latest generative AI model, Sora, yesterday (5th), which will allow users to generate short videos based on text input. Currently, the model is undergoing undisclosed safety and functionality tests.
Sora can generate high-quality videos up to one minute long based on user-provided text descriptions. It has the ability to create complex scene videos with multiple characters, specific actions, precise thematic details, and backgrounds. The underlying model has a deep understanding of language, enabling it to accurately interpret user prompts and generate characters that express vivid emotions. It can also create multiple scenes in a single short video, maintaining consistent characters and visual styles across different scenes. OpenAI has already provided Sora for testing to a select group of red team practitioners, as well as a few visual designers, photographers, and producers to gather professional feedback.
Sora is similar to AI models released by Meta and Google, namely Emu Video and VideoPoet, respectively.
From a technical standpoint, Sora is a diffusion model that starts with a seemingly static noise frame and progressively generates a video by removing the noise. It builds upon the research of the DALL-E and GPT models, utilizing DALL-E 3’s recaptioning technique to generate highly descriptive text for visual training data. Consequently, it can produce videos based on user-text instructions. In addition to textual prompts, the model can also animate static images to generate videos. Sora has the capability to create new videos from scratch, extend existing videos, or fill in missing frames.
OpenAI states that Sora serves as the foundation for a model capable of understanding and simulating the real world. The company believes it represents a significant milestone in achieving Artificial General Intelligence (AGI).
However, Sora still requires further improvement, particularly in accurately depicting complex scenes and understanding specific causal relationships. For instance, a character in the generated video may appear to take a bite of a cookie, but there is no visible bite mark on the cookie. Sometimes, the model may confuse details provided in the prompts, such as mixing up left and right, or inaccurately describing events that occur over a duration of time, such as tracking camera movements.
Before making Sora available to the public, OpenAI is actively reinforcing its safety testing. The company is conducting red team exercises to assess the model’s susceptibility to false information, hate content, and bias. The development team is also working on building tools to detect deceptive content, such as detection classifiers that can be applied during the video generation process.
To provide identification, OpenAI plans to incorporate C2PA metadata into the videos generated by Sora when it is deployed in OpenAI products. C2PA is an AI content recognition standard developed through collaboration between Meta and industry alliances. OpenAI has already included C2PA metadata in the images generated by DALL-E 3.
Furthermore, OpenAI will implement security measures for Sora using its existing safety technology. For example, a text classifier will reject prompts that violate usage policies, and an image classifier will review video frames to ensure compliance with policies. The company also commits to collaborating with lawmakers, educators, and artists to address concerns related to AI.