OpenAI has just lately unveiled a set of next-generation audio fashions, enhancing the capabilities of voice-enabled functions. These developments embrace new speech-to-text (STT) and text-to-speech (TTS) fashions, providing builders extra instruments to create subtle voice brokers. These superior voice fashions, launched on API, allow builders worldwide to construct versatile and dependable voice brokers way more simply. On this article, we are going to discover the options and functions of OpenAI’s newest GPT-4o-Transcribe, GPT-4o-Mini-Transcribe, and GPT-4o-mini TTS fashions. We’ll additionally discover ways to entry openAI’s audio fashions and take a look at them out ourselves. So let’s get began!
OpenAI’s New Audio Fashions
OpenAI has launched a brand new era of audio fashions designed to reinforce speech recognition and voice synthesis capabilities. These fashions provide enhancements in accuracy, pace, and suppleness, enabling builders to construct extra highly effective AI-driven voice functions. The suite consists of 2 speech-to-text fashions and 1 text-to-speech mannequin, that are:
- GPT-4o-Transcribe: OpenAI’s most superior speech-to-text mannequin, providing industry-leading transcription accuracy. It’s designed for functions that require exact and dependable transcriptions, equivalent to assembly and lecture transcriptions, customer support name logs, and content material subtitling.
- GPT-4o-Mini-Transcribe: A smaller, light-weight, and extra environment friendly model of the above transcription mannequin. It’s optimized for lower-latency functions equivalent to reside captions, voice instructions, and interactive AI brokers. It supplies sooner transcription speeds, decrease computational prices, and a steadiness between accuracy and effectivity.
- GPT-4o-mini TTS: This mannequin introduces the power to instruct the AI to talk in particular types or tones, making AI-generated voices sound extra human-like. Builders can now tailor the agent’s voice tone to match completely different contexts like pleasant, skilled, or dramatic. It really works properly with OpenAI’s speech-to-text fashions, enabling clean voice interactions.
The speech-to-text fashions include superior applied sciences equivalent to noise cancellation. They’re additionally outfitted with a semantic voice exercise detector that may precisely detect when the consumer has completed talking. These improvements assist builders deal with a bunch of frequent points whereas constructing voice brokers. Together with these new fashions, OpenAI additionally introduced that its just lately launched Brokers SDK now helps audio, which makes it even simpler for builders to construct voice brokers.
Study Extra: The best way to Use OpenAI Responses API & Agent SDK?
Technical Improvements Behind OpenAI’s Audio Fashions
The developments in these audio fashions are attributed to a number of key technical improvements:
- Pretraining with Genuine Audio Datasets: Leveraging in depth and various audio information has enriched the fashions’ potential to grasp and generate human-like speech patterns.
- Superior Distillation Methodologies: These strategies have been employed to optimize mannequin efficiency, guaranteeing effectivity with out compromising high quality.
- Reinforcement Studying Paradigm: Implementing reinforcement studying has contributed to the fashions’ improved accuracy and flexibility in varied speech situations.
The best way to Entry OpenAI’s Audio Fashions
The most recent mannequin, GPT-4o-mini tts is obtainable on a brand new platform launched by open AI known as Openai.fm. Right here’s how one can entry this mannequin:
- Open the Web site
First, head to www.openai.fm.
- Select the Voice and Vibe
On the interface that opens up, select your voice and set the vibe. Should you can’t discover the correct character with the correct vibe, click on on the refresh button to get completely different choices.
- Nice-tune the Voice
You possibly can additional customise the chosen voice with an in depth immediate. Under the vibe choices, you possibly can kind in particulars like accent, tone, pacing, and so on. to get the precise voice you need.
- Add the Script and Play
As soon as set, simply kind your script into the textual content enter field on the correct, and click on on the ‘PLAY’ button. Should you like what you hear, you possibly can both obtain the audio or share it externally. If not, you possibly can preserve making an attempt out extra iterations until you get it proper.

The web page requires no signup and you’ll play with the mannequin as you want. Furthermore, on the highest proper nook, there’s even a toggle that’ll provide the code for the mannequin, fine-tuned to your selections.
Fingers-on Testing of OpenAI’s Audio Fashions
Now that we all know methods to use the mannequin, let’s give it a strive! First, let’s check out the OpenAI.fm web site.
1. Utilizing GPT-4o-Mini-Transcribe on OpenAI.fm
Suppose I want to construct an “Emergency Providers” Voice assist agent.
For this agent, I choose the:
- Voice – Nova
- Vibe – Sympathetic
Use the Following Directions:
Tone: Calm, assured, and authoritative. Reassuring to maintain the caller comfortable whereas dealing with the state of affairs. Skilled but empathetic, reflecting real concern for the caller’s well-being.
Pacing: Regular, clear, and deliberate. Not too quick to keep away from panic however not too sluggish to delay response. Slight pauses to present the caller time to reply and course of data.
Readability: Clear, impartial accent with a well-enunciated voice. Keep away from jargon or sophisticated phrases, utilizing easy, easy-to-understand language.
Empathy: Acknowledge the caller’s emotional state (worry, panic, and so on.) with out including to it.
Provide calm reassurance and assist all through the dialog.
Use the Following Script:
“Hiya, that is Emergency Providers. I’m right here that can assist you. Please keep calm and pay attention fastidiously as I information you thru this example.”
“Assistance is on the best way, however I would like a bit of data to verify we reply rapidly and appropriately.”
“Please present me along with your location. The precise handle or close by landmarks will assist us get to you quicker.”
“Thanks; if anybody is injured, I would like you to stick with them and keep away from transferring them except crucial.”
“If there’s any bleeding, apply stress to the wound to manage it. If the individual just isn’t respiratory, I’ll information you thru CPR. Please stick with them and preserve calm.”
“If there are not any accidents, please discover a protected place and keep there. Keep away from hazard, and watch for emergency responders to reach.”
“You’re doing nice. Keep on the road with me, and I’ll guarantee assistance is on the best way and preserve you up to date till responders arrive.”

Output:
Wasn’t that nice? OpenAI’s newest audio fashions at the moment are accessible by OpenAI’s API as properly, enabling builders to combine them into varied functions.
Now let’s take a look at that out.
2. Utilizing gpt-4o-audio-preview by way of API
We’ll be accessing the gpt-4o-audio-preview mannequin by way of OpenAI’s API and making an attempt out 2 duties: one for text-to-speech, and the opposite for speech-to-text.
Activity 1: Textual content-to-Speech
For this process, I’ll be asking the mannequin to inform me a joke.
Code Enter:
import base64
from openai import OpenAI
shopper = OpenAI(api_key = "OPENAI_API_KEY")
completion = shopper.chat.completions.create(
mannequin="gpt-4o-audio-preview",
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": "Can you tell me a joke about an AI trying to tell a joke?"
}
]
)
print(completion.selections[0])
wav_bytes = base64.b64decode(completion.selections[0].message.audio.information)
with open("output.wav", "wb") as f:
f.write(wav_bytes)
Response:
Activity 2: Speech-to-Textual content
For our second process, let’s give the mannequin this audio file and see if it could inform us concerning the recording.
Code Enter:
import base64
import requests
from openai import OpenAI
shopper = OpenAI(api_key = "OPENAI_API_KEY")
# Fetch the audio file and convert it to a base64 encoded string
url = "https://cdn.openai.com/API/docs/audio/alloy.wav"
response = requests.get(url)
response.raise_for_status()
wav_data = response.content material
encoded_string = base64.b64encode(wav_data).decode('utf-8')
completion = shopper.chat.completions.create(
mannequin="gpt-4o-audio-preview",
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this recording?"
},
{
"type": "input_audio",
"input_audio": {
"data": encoded_string,
"format": "wav"
}
}
]
},
]
)
print(completion.selections[0].message)
Response:

Benchmark Outcomes of OpenAI’s Audio Fashions
To evaluate the efficiency of its newest speech-to-text fashions, OpenAI performed benchmark checks utilizing Phrase Error Price (WER), a normal metric in speech recognition. WER measures transcription accuracy by calculating the proportion of incorrect phrases in comparison with a reference transcript. A decrease WER signifies higher efficiency with fewer errors.

Because the outcomes present, the brand new speech-to-text fashions – gpt-4o-transcribe and gpt-4o-mini-transcribe – provide improved phrase error charges and enhanced language recognition in comparison with earlier fashions like Whisper.
Efficiency on FLEURS Benchmark
One of many key benchmarks used is FLEURS (Few-shot Studying Analysis of Common Representations of Speech), which is a multilingual speech dataset masking over 100 languages with manually transcribed audio samples.

The outcomes point out that OpenAI’s new fashions:
- Obtain decrease WER throughout a number of languages, demonstrating improved transcription accuracy.
- Present stronger multilingual protection, making them extra dependable for various linguistic functions.
- Outperform Whisper v2 and Whisper v3, OpenAI’s previous-generation fashions, in all evaluated languages.
Value of OpenAI’s Audio Fashions

Conclusion
OpenAI’s newest audio fashions mark a major shift from purely text-based brokers to classy voice brokers, bridging the hole between AI and human-like interplay. These fashions don’t simply perceive what to say—they grasp methods to say it, capturing tone, pacing, and emotion with exceptional precision. By providing each speech-to-text and text-to-speech capabilities, OpenAI permits builders to create AI-driven voice experiences that really feel extra pure and fascinating.
The supply of those fashions by way of API means builders now have better management over each the content material and supply of AI-generated speech. Moreover, OpenAI’s Brokers SDK makes it simpler to remodel conventional text-based brokers into totally purposeful voice brokers, opening up new prospects for customer support, accessibility instruments, and real-time communication functions. As OpenAI continues to refine its voice expertise, these developments set a brand new normal for AI-powered interactions.
Regularly Requested Questions
A. OpenAI has launched three new audio fashions—GPT-4o-Transcribe, GPT-4o-Mini-Transcribe, and GPT-4o-mini TTS. These fashions are designed to reinforce speech-to-text and text-to-speech capabilities, enabling extra correct transcriptions and natural-sounding AI-generated speech.
A. In comparison with OpenAI’s Whisper fashions, the brand new GPT-4o audio fashions provide improved transcription accuracy and decrease phrase error charges. It additionally affords enhanced multilingual assist and higher real-time responsiveness. Moreover, the text-to-speech mannequin supplies extra pure voice modulation, permitting customers to regulate tone, model, and pacing for extra lifelike AI-generated speech.
A. The brand new TTS mannequin permits customers to generate speech with customizable types, tones, and pacing. It enhances human-like voice modulation and helps various use circumstances, from AI voice assistants to audiobook narration. The mannequin additionally supplies higher emotional expression and readability than earlier iterations.
A. GPT-4o-Transcribe affords industry-leading transcription accuracy, making it ultimate for skilled use circumstances like assembly transcriptions and customer support logs. GPT-4o-Mini-Transcribe is optimized for effectivity and pace, catering to real-time functions equivalent to reside captions and interactive AI brokers.
A. OpenAI.fm is an internet platform the place customers can take a look at OpenAI’s text-to-speech mannequin with out signing up. Customers can choose a voice, alter the tone, enter a script, and generate audio immediately. The platform additionally supplies the underlying API code for additional customization.
A. Sure, OpenAI’s Brokers SDK now helps audio, permitting builders to transform text-based brokers into interactive voice brokers. This makes it simpler to create AI-powered buyer assist bots, accessibility instruments, and personalised AI assistants with superior voice capabilities.
Login to proceed studying and revel in expert-curated content material.