On Monday, OpenAI debuted GPT-4o (o for “omni”), a major new AI model that can reportedly converse using speech in realtime, reading emotional cues and responding to visual input. It operates faster than OpenAI’s previous best model, GPT-4 Turbo, and will be free for ChatGPT users and available as a service through API, rolling out over the next few weeks.
OpenAI revealed the new audio conversation and vision comprehension capabilities in a YouTube livestream titled “OpenAI Spring Update,” presented by OpenAI CTO Mira Murati and employees Mark Chen and Barret Zoph that included live demos of GPT-4o in action.
OpenAI claims that GPT-4o responds to audio inputs in about 320 milliseconds on average, which is similar to human response times in conversation, according to a 2009 study. With GPT-4o, OpenAI says it trained a brand new AI model end-to-end using text, vision, and audio in a way that all inputs and outputs “are processed by the same neural network.”