Speech-to-Speech Generative AI: From Theory to Practice

Rifx.Online
Generative AI , Voice Assistants , Ethics
11 Jan, 2025

Explore the world of speech-to-speech generative AI. Learn key components and practical considerations to build your applications

Introduction

Generative AI is one of the most transformative, rapidly evolving, and widely adopted technologies today. A key reason for its impact is its ability to enable humans to communicate with computers using natural language, making interactions easy and intuitive. This ability has facilitated its adoption in our daily lives and across multiple industries, spanning sectors from education and healthcare to entertainment.

The way we interact started with text, powered by large language models (LLMs). Now, to further bridge the gap between human and machine communication, the natural evolution is moving beyond writing to speaking. This is driving a growing focus in research on multimodal generative AI, which refers to systems capable of processing and generating multiple forms of data (modalities), such as text, speech, and images.

In this post, we focus on speech-to-speech models, which are generative AI models designed to process spoken input and respond in real-time with natural, human-like speech, opening new possibilities for technological innovation and applications.

Here’s what we’ll explore:

Under the Hood: Key components of speech-to-speech generative AI models.
Real-World Applications: Use cases, capabilities and best practices for building with speech-to-speech generative AI models.
Measuring Success: Essential evaluation metrics for these systems.

Overview

Given the wide availability of LLMs trained on massive text datasets and the challenges associated with collecting and annotating large amounts of speech data, the natural approach to building speech-to-speech generative AI systems has been to leverage these LLMs and adapt them for spoken interactions. Broadly speaking, this adaptation of LLMs for speech has been done in two main approaches: using a pipeline of cascaded independent modules or employing an end-to-end speech large language model that directly handles the speech modality. In this post, we will refer to these approaches as cascaded and end-to-end speech-to-speech systems. Figure 1 illustrates these two approaches.

📜 N.B: The discussion throughout this post focuses on decoder-only (autoregressive) transformer architectures, as they are representative of the majority of current LLMs and speech-to-speech models.

1- Cascaded speech-to-speech systems

Cascaded systems operate as a pipeline, with processing handled by independent modules. The main components include Automatic Speech Recognition (ASR) to transcribe the spoken input into text, an LLM to process and generate text responses, and Text-to-Speech (TTS) to convert the generated text back into speech. AudioGPT is an example of a cascaded speech-to-speech system.

Despite its modularity, the reliance on separate modules in cascaded systems can introduce latency making real-time conversation challenging. Additionally, relying on intermediate text representations can sometimes lose nuances in the original speech, such as emotion or tone.

2- End-to-end speech-to-speech models

In contrast, end-to-end speech-to-speech models are capable of handling the speech modality by directly extracting information from speech inputs and generating speech representations without relying on intermediate text representation. End-to-end systems offer a more integrated approach, often resulting in lower latency and a more natural conversational flow.

It’s worth noting that there are also hybrid approaches, where some cascaded systems skip the ASR module by handling input speech directly but still need the TTS module, such as Qwen2-Audio or VITA. The choice between cascaded and end-to-end systems depend on the specific needs and constraints of the real-world application.

As current research primarily focuses on end-to-end models, we will emphasize them in this post. In the next section, we will discuss the key components of end-to-end speech-to-speech models.

2.1- Key components of end-to-end speech-to speech models

There are three main components within an end-to-end speech-to-speech model, namely the speech encoder, the LLM (decoder-only transformer), and the vocoder, as illustrated in Figure 1.

Specifically, the speech encoder processes raw audio waveforms into a set of audio embeddings. In multimodal systems, additional modalities, primarily text, are often required in conjunction with audio. A crucial concept in such systems is the alignment of text and speech modalities to ensure coherent integration. While we will not delve deeply into alignment techniques in this post, we include both text and audio modalities in our illustrations to provide a realistic example of a speech-to-speech model.

One possible way of aligning text and audio modalities is to concatenate the audio embeddings with the text embeddings, achieved by expanding the embeddings matrix of the model to include a new set of audio token embeddings. A mixed sequence of text and audio embeddings is then fed as input to the language model, which performs next-token prediction in an autoregressive manner, generating either text or audio tokens as needed. This approach is used, for instance, in AudioPaLM.

Finally, the vocoder converts the audio tokens produced by the language model back into high-quality audio waveforms, enabling natural and seamless speech synthesis.

2.1.1- Audio encoders

The audio encoder is a critical component in end-to-end speech-to-speech models, with several types of encoders employed depending on how the speech is represented. These encoders can be designed for two types of representations: discrete representation and continuous representation, as illustrated in Figure 2.

Discrete representation

In the case of discrete representation, the audio encoder is also referred to as an audio tokenizer. It processes continuous audio signals (waveforms) by first encoding them into latent representations and then converting these latent representations into discrete tokens. These tokens represent the audio in a compressed, quantized form and can be directly fed into the LLM for further processing. Models like AudioPaLM and Moshi utilize this approach.

Current models primarily use three main audio tokenization methods:

Acoustic tokens, which focus on capturing sound patterns and low-level acoustic details. These are typically generated by audio encoders like SoundStream, a neural audio codec designed for audio compression and synthesis.
Semantic tokens, which emphasize the meaning and linguistic content of the speech. These are extracted by models like HuBERT (Hidden-Unit BERT), which learns high-level semantic representations from speech for tasks like speech understanding.
Mixed tokens combine elements of acoustic and semantic tokens, offering a balanced representation that captures acoustic, semantic, and paralinguistic features. Models like AudioLM utilize both acoustic and semantic tokens in tandem to ensure meaningful and high-quality speech synthesis and translation.

To further understand the distinctions between audio tokenization methods, Table 1 summarizes the characteristics, objectives, and applications of acoustic, semantic, and mixed tokens for clearer differentiation.

Continuous representation

In the case of continuous representation, the audio encoder extracts speech features that are unquantized, real-valued representations of speech signals existing on a continuous scale. These features capture fine-grained, nuanced aspects of speech that might otherwise be lost during discretization. To make these features compatible with the LLM, a feature projection layer, i.e., and audio projector is required to map the extracted audio features into continuous input embedding vectors, which are then transmitted to the LLM for further processing. An example of a model using this approach is SALMONN.

2.1.2- Vocoder

The other key element in end-to-end speech-to-speech models is the vocoder, also referred to as the Token-to-Speech Synthesizer. The vocoder’s primary role is to convert discrete or continuous audio tokens generated by the LLM back into high-quality, natural-sounding audio waveforms. A typical vocoder used for this purpose is HiFi-GAN, known for its ability to generate high-fidelity waveforms from token sequences. The choice of vocoder significantly impacts the quality and expressiveness of the synthesized audio, making it an essential part of the speech synthesis pipeline.

Capabilities and practical considerations

Now that we understand the basics of speech-to-speech generative AI models, let’s discuss the ideal capabilities and key considerations you should think of when building your applications with these models.

📜N.B: While this section primarily focuses on capabilities specific to voice, it’s important to remember that other general capabilities required for LLMs — such as factuality, reasoning, and memory — remain equally critical for creating effective and reliable speech-to- systems.

1. Low latency

Latency is one of the most critical factors in speech-to-speech systems, directly affecting user experience. Achieving low latency ensures that interactions feel natural and fluid, maintaining a good perceived responsiveness. While there are numerous latency metrics, the most important ones for speech systems include:

Time to First Token (TTFT): Measures how quickly the system begins generating output after receiving input. A lower TTFT ensures faster responsiveness, which is crucial for maintaining conversational flow.
Interruption Time: Refers to how swiftly the system can process and respond to user interruptions during ongoing speech. This is essential for natural turn-taking in dialogue.
Average Latency: Represents the total processing time from input to output. Consistently low average latency enhances the overall experience, especially in real-time applications.

2. High-quality voice output

Beyond low latency, high-quality voice ensures user engagement, clarity, and satisfaction. From a quality perspective, ideal capabilities include:

2.1. Naturalness and speech intelligence

A good-quality speech-to-speech system should have naturalness and speech intelligence capabilities, such as emotion recognition, multilingual support, accent recognition, or sarcasm detection. These capabilities can be achieved through advanced voice encoders and vocoders, which capture subtle speech features such as intonation, rhythm, and prosody, as well as emotions like joy, sadness, or excitement.

2.2. Voice adaptability with prompt engineering

Contextual prompting for voice is an essential capability that enables easy adaptation of voice output features — such as tone, emotion, and style — using natural language prompts, without requiring reliance on SSML tags. For example, the prompt can adapt the voice to convey empathy in mental health applications, excitement in promotional scenarios, or professionalism in corporate interactions. Below is a prompt example to adapt a voice output for a “Happy New Year” message:

tone = "cheerful"
emotion = "joyful"

prompt = f"Generate a Happy New Year message with a {tone} tone and a {emotion} emotion. Include phrases like 'Happy New Year!' and 'Wishing you a fantastic year ahead!' Conclude with an uplifting and encouraging note."

2.3. Speaker flexibility

Speaker flexibility is another essential component of high-quality voice output. By offering a variety of speakers — differing in gender, accent, or age — applications can cater to diverse user preferences and contexts. This flexibility is particularly important in areas such as customer service, where certain voice characteristics can improve user comfort, or in multilingual systems, where accents aid comprehension. Speech-to-speech systems that provide a range of speaker options ensures adaptability and inclusivity of diverse audiences.

3- Barge-in and full-duplex interaction

Barge-in refers to a user’s ability to interrupt a system’s speech output with their own speech input. This capability is essential for achieving full-duplex communication, where both the user and the system can simultaneously listen and speak, closely mimicking natural human conversational dynamics. Unlike text-based models, which typically operate in a half-duplex manner, full-duplex interaction is particularly important in voice-based systems to enable seamless, real-time interactions.

Recent advancements in speech-based models have aimed to incorporate full-duplex capabilities through techniques such as Voice Activity Detection (VAD), Parallel Stream Processing or Interleaved Token Modeling

4- Streaming

Streaming means the system should not wait for a lengthy audio segment to be processed before generating a response. Instead, the model typically operates on a chunk-based mechanism, dynamically processing and generating audio in real time, one chunk at a time.

Streaming functionality is closely tied to the need for low latency, and both are essential for seamless real-time interactions and overall system responsiveness.

5- Agentic capabilities

Agentic capabilities empower speech-to-speech models to autonomously perform tasks, make decisions, and interact with external tools or environments based on user input. While these functionalities are valuable for LLMs in general, they are particularly crucial for speech-to-speech models, enabling them to access external knowledge and be utilized in advanced applications beyond simple dialogue. Examples of agentic capabilities include accessing external tools, APIs or Databases through function calling or executing code.

6- Multimodal capabilities

Multimodal capabilities enable speech-to-speech systems to process and integrate inputs from multiple modalities, such as text, images, and video, alongside speech. This enhances context understanding, broadens use cases in areas like education and healthcare, and increases accessibility for users with disabilities. For instance, a system might analyze a photo and verbal description simultaneously to offer precise solutions. Despite challenges like data alignment and computational demands, these capabilities are driving the evolution of more intuitive systems, making interactions richer and more human-like.

Evaluation

As systems grow in complexity, evaluating speech-to-speech systems is a challenging task, as it must account for linguistic, emotional nuance, real-time performance and interactional factors. Effective evaluation requires a combination of approaches to capture both technical precision and user experience. Evaluating speech-to-speech systems remains a less advanced area of research compared to system development, with ongoing efforts to establish comprehensive and reliable methodologies. Available evaluation methods are typically divided into human-based assessments and automatic approaches, which include benchmarks and metrics.

📜N.B: This section aims to provide examples and is not an exhaustive list. Similarly to the capabilities section, we focus on metrics specific to voice. Common metrics used for LLMs, including LLMs as a judge, are not covered in this post.

1- Human evaluation

Human evaluation is critical for assessing subjective qualities like naturalness, expressiveness, and overall user experience. The Mean Opinion Score (MOS) is an example of metrics used, where raters score audio samples on a predefined scale based on perceived quality, focusing on intelligibility, emotional expressiveness, and fluency. However, while effective in capturing human-centric insights, MOS is resource-intensive and influenced by rater bias.

2- Automatic evaluation

Automatic evaluation provides objective measures to complement human assessment. It can be divided into two main components:

2.1. Benchmarks

Benchmarks such as VoiceBench, SUPERB (Speech Processing Universal PERformance Benchmark), AudioBench or AIR-Bench standardize the evaluation of speech-to-speech systems by providing pre-defined datasets and tasks. These benchmarks enable fair comparisons across models and ensure consistency in assessing performance across areas like speech recognition, translation, and synthesis.

2.2. Metrics

Metrics quantify specific aspects of system performance. For semantic accuracy, metrics like SpeechBERTScore, SpeechBLEU or SpeechTokenDistance can be used.

Acoustic quality metrics include Mel Cepstral Distortion (MCD) which evaluates the difference in spectral features between generated and reference speech or Log F0 Root Mean Square Error (RMSE) which measures prosody accuracy by comparing fundamental frequency (F0) patterns. Real-time performance is evaluated through latency metrics, while interaction metrics such as Turn-taking accuracy assess responsiveness and coherence in dynamic conversational scenarios. Figure 3 provides examples of evaluation metrics.

Use cases and best practices

Speech-to-speech generative AI models have transformative applications across various domains. Examples of use cases include conversational assistants and agents, such as virtual customer service representatives or personal assistants. They are also valuable for language learning and testing, allowing users to practice pronunciation and comprehension in different languages. Effective integration with other systems, such as multimodal frameworks or text-based LLMs, extends functionality to diverse domains: education, by enabling interactive language tutors or immersive learning environments; healthcare, by supporting voice-enabled patient assistance and telehealth services; entertainment, by enhancing gaming experiences and virtual reality interactions; and sports, by providing real-time commentary or personalized coaching through voice interfaces.

Developing speech-to-speech applications is a complex task that requires careful planning and execution. Here are some practical recommendations to guide you:

1- It’s not just the model, think of the full stack — From the user interface, data storage, APIs, to infrastructure, every layer impacts performance. Adopt a long-term vision and choose a stack that offers you flexibility to integrate with existing and new technologies, scale with demand, and adapt to changing requirements and future advancements.

2- Thoroughly test, evaluate and monitor your system — Validate every component of your voice system across multiple aspects: accuracy for accents, languages, clarity, naturalness, hallucinations and behavior in noisy conditions. Build scalable test protocols for real-world scenarios to measure latency and the ability to handle interruptions. Perform scalability tests for stress and edge cases. Include user feedback, accessibility checks. Regularly update tests to cover new features and keep monitoring over time.

3- Responsible AI is even more critical for speech systems — It’s not just about latency or voice quality; it’s about safety and trust. Implement strong guardrails to prevent harmful or biased outputs, ensure user privacy, and comply with regulations. Speech systems are accessible and interact directly with users — making ethical considerations non-negotiable.

4- Trade-offs are key: optimize for your use case — Noticed the diverse capabilities of generative AI speech models? You don’t need all of them to be perfect. Consider trade-offs to prioritize what matters most for your application. For example, you might accept slower latency for higher accuracy or sacrifice flexibility for better real-time performance in some use cases.

5- You need the right mindset and expertise — Speech-to-speech systems lies at the intersection of multiple disciplines including multimodal AI, software engineering, and user experience design. Fostering collaboration is critical, as interdisciplinary problem-solving is essential to creating an effective user experience and adopting the technology effectively. Generative AI is also a fast-moving field, requiring a growth and learning mindset to stay ahead of advancements. Invest in the right talent and build teams that excel in cross-functional expertise, embrace diverse perspectives, and communicate effectively.

Conclusion

The rise of speech-to-speech generative AI is certainly opening up a range of new possibilities and can assist us in countless ways. However, this potential can only be fully realized if these technologies are built responsibly, users are educated on ethical use, and awareness is raised about the broader ethical implications and societal impacts. This field is constantly evolving, and what we’ve covered here is not exhaustive — it’s just the starting point. Staying informed about the latest innovations and engaging with the research community are essential for developing a comprehensive understanding.

How to build your speech-to-speech applications with Google Cloud?

Explore the multimodal live API Discover how the Multimodal Live API, powered by Gemini, enables low-latency, bidirectional voice and video interactions. This API is designed for seamless, real-time communication, making it ideal for applications like virtual assistants, interactive education, and immersive entertainment.
Check out Gemini 2.0 advanced native audio capabilities Watch the video below to see Gemini 2.0 in action.

📜About the Author

Dr. Wafae Bakklali is a Staff Generative AI Specialist, Blackbelt at Google. With over 12 years of experience, she has a proven track record of combining deep technical expertise and leadership with strategic business insights to drive impactful outcomes. Through her roles at Google and other reputable organizations like AWS, Dr. Bakkali has successfully guided organizations across various industries in adopting and building AI solutions.

She holds a PhD from IMT Atlantique, France, and served as a postdoctoral researcher at Centrale Supélec, Paris-Saclay University, France.

Dr. Bakkali is also an active contributor to the AI community. She regularly shares her expertise as a speaker at international tech conferences and mentors participants in AI hackathons, tech, and startup events.

🔗Connect with me on LinkedIn: https://www.linkedin.com/in/wafae-bakkali/