OpenAI has enhanced its voice AI capabilities with the upgraded Realtime API and the introduction of gpt-realtime. The Realtime API is now generally available, featuring support for remote Model Context Protocol (MCP) servers, image inputs, and Session Initiation Protocol (SIP) for phone calls. These updates enable developers and enterprises to create more reliable and versatile voice agents.
The new gpt-realtime model excels in understanding complex instructions, precise tool usage, and natural-sounding speech. It improves the interpretation of system messages and developer prompts, enabling seamless language switching and accurate alphanumeric sequence detection. Image input support allows the model to ground conversations in visual context, answering questions about displayed images or screenshots. The Realtime API processes audio directly, reducing latency and preserving speech nuances.
GPT-realtime is priced at $32 per 1 million audio input tokens and $64 per 1 million audio output tokens, with a 20% price reduction compared to its predecessor. New features also provide fine-grained control over conversation context, enabling intelligent token limits and cost reduction for extended sessions.
Subscribe for Weekly Updates
Stay ahead with our weekly AI and tech briefings, delivered every Tuesday.




