OpenAI has enhanced its voice AI capabilities with the upgraded Realtime API and the introduction of gpt-realtime. The Realtime API is now generally available, featuring support for remote Model Context Protocol (MCP) servers, image inputs, and Session Initiation Protocol (SIP) for phone calls. These updates enable developers and enterprises to create more reliable and versatile voice agents.
The new gpt-realtime model excels in understanding complex instructions, precise tool usage, and natural-sounding speech. It improves the interpretation of system messages and developer prompts, enabling seamless language switching and accurate alphanumeric sequence detection. Image input support allows the model to ground conversations in visual context, answering questions about displayed images or screenshots. The Realtime API processes audio directly, reducing latency and preserving speech nuances.
GPT-realtime is priced at $32 per 1 million audio input tokens and $64 per 1 million audio output tokens, with a 20% price reduction compared to its predecessor. New features also provide fine-grained control over conversation context, enabling intelligent token limits and cost reduction for extended sessions.
Related Articles
OpenAI's GPT-Realtime Debuts
Read more about OpenAI's GPT-Realtime Debuts →AI-Powered Ransomware Emerges
Read more about AI-Powered Ransomware Emerges →DeepSeek releases V3.1 model
Read more about DeepSeek releases V3.1 model →Altman Acknowledges AI Market Bubble
Read more about Altman Acknowledges AI Market Bubble →