OpenAI released three new real-time voice models on May 7, 2026, that provide reasoning, translation, and live transcription [1].

These tools allow developers to build a new class of voice applications, potentially transforming how businesses and educators interact with users through conversational AI. By integrating high-level intelligence directly into voice streams, the company aims to reduce the friction between human speech and machine understanding.

The new models are available through OpenAI’s API cloud platform [4, 5]. They bring GPT-5-class reasoning to voice interactions, enabling the AI to process complex tasks and reason through problems while a user is speaking [2]. This capability is designed to support real-time conversational tasks across various sectors, including creator platforms, education, and customer service [1, 4, 6].

Translation is a core feature of the update, with the models supporting 70 different languages [2]. This allows for near-instant translation and transcription as speech occurs, removing the need for separate speech-to-text and translation steps. The integration of these features into a single real-time pipeline is intended to make voice-based AI feel more natural and responsive.

Developers can now use these tools to create more sophisticated voice-driven agents. The ability to transcribe and reason simultaneously means applications can react to the nuance of a conversation in real time, rather than waiting for a user to finish a full sentence before processing the request. OpenAI said these models are built specifically for real-time conversations and tasks [5].

OpenAI released three new real-time voice models on May 7, 2026

The shift toward 'GPT-5-class' reasoning in a real-time voice API suggests a move away from the latency associated with traditional voice assistants. By combining transcription, translation, and reasoning into a single stream, OpenAI is positioning its API as the infrastructure for the next generation of autonomous voice agents, moving the technology from simple command-and-response to fluid, multi-lingual human-computer interaction.