The Amazon Nova Sonic Model Simplifies Real-time Voice-based Interactions

The Amazon Nova Sonic Model Simplifies Real-time Voice-based Interactions
Source: Unsplash - appshunter.io

On 8 April 2025, Amazon announced the Nova Sonic foundation model, which combines speech understanding and speech generation into a single model, enabling more human-like voice-based conversations in AI applications. This new technology not only comprehends what is said but also how it is said—including tone, style, and speech pace—crucial for natural conversations. Nova Sonic is available through the Amazon Bedrock platform via API, simplifying the development of voice-based applications across various industries.


Developing traditional voice-based applications has required the complex coordination of multiple separate models—speech recognition, large language models, and text-to-speech systems—which failed to preserve the acoustic context of speech. In contrast, Nova Sonic integrates these capabilities into a single unified system, achieving a 69.7% win rate against Google’s Gemini Flash 2.0 and a 51.0% win rate against OpenAI’s GPT-4o in single-turn American English conversations. The model recorded a 4.2% word error rate on the Multilingual LibriSpeech test, over 36% better than GPT-4o Transcribe’s performance across English, French, German, Italian, and Spanish. Nova Sonic combines the three traditionally separate models—speech-to-text, text comprehension, and text-to-speech—into a single system that models not just the “what” but also the “how” of communication, according to Rohit Prasad, Amazon’s AGI Chief Scientist.

Amazon Nova Sonic offers numerous benefits to enterprise users, including a 1.09-second user-perceived latency, faster than OpenAI’s GPT-4o (1.18 seconds) and Google’s Gemini Flash 2.0 (1.41 seconds). Amazon claims Nova Sonic is nearly 80% cheaper than GPT-4o in real-time operations, providing a significant competitive advantage. Several companies already leverage this technology: ASAPP optimises customer service centres, Education First (EF) to improve language learners’ pronunciation, and Stats Perform enables data-rich sports interactions. Nova Sonic currently supports American and British English with male and female voices, with additional languages and accents in development.

Sources:

1.

Amazon’s new Nova Sonic foundation model understands not just what you say—but how you say it
Our new gen AI model picks up on tone, inflection, and pacing, for a deeper understanding of human conversation.

2.

Amazon plays catch-up with new Nova AI models to generate voices and video
Nova Sonic can detect your tone.

3.

Move over, Alexa: Amazon launches new realtime voice model Nova Sonic for third-party enterprise development
Currently, the model supports multiple expressive voices, both masculine and feminine, in American and British English.