Resources/Learn/comparing-different-text-to-speech---tts--models-part-2

Choosing the Right Text-to-Speech Model: Part 2

June 30, 2025
5
mins read
Rajdeep Borgohain
DevRel Engineer
Table of contents
Subscribe to our blog
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Introduction

In our first part, we evaluated 9 TTS models, focusing on factors like voice quality, customization, integration ease, and latency across various use cases. Notably, models such as MeloTTS and Piper TTS demonstrated low latency, while Tortoise TTS exhibited significant delays with longer inputs.

Since then, the TTS landscape has evolved significantly. Innovations in deep learning and the availability of large-scale speech datasets have led to the development of models that produce more natural, expressive, and human-like speech. This progression has expanded TTS applications across virtual assistants, content creation, accessibility tools, and conversational AI systems.

Traditionally, TTS systems relied on concatenative and parametric methods, often resulting in robotic-sounding speech. The advent of neural network-based models has revolutionized this field. Modern TTS models, inspired by large language models, treat speech synthesis as a sequence modeling task, enabling more natural and expressive voice generation.

In this continuation, we delve into 12 prominent TTS models, examining their features that define their capabilities, analyzed their synthesized output and the latency.

How Modern TTS Models Work: Three Dominant Paradigms

1. Codec Language Models (CLM)

CLM, such as Dia-1.6B, transform audio waveforms into discrete tokens using neural audio codecs. These tokens are then modeled using language models to generate speech. This approach enables efficient handling of speaker identity and prosody, facilitating tasks like zero-shot voice cloning. Recent advancements focus on co-designing codecs and language models to enhance performance and efficiency.

2. Diffusion-Based Models

Diffusion models, like F5-TTS, generate speech by iteratively refining random noise into coherent audio through a denoising process. They excel at producing high-fidelity and expressive speech. However, their iterative nature can lead to higher computational costs. Techniques like knowledge distillation are being explored to accelerate inference without compromising quality.

3. Direct Waveform / Vocoder-Coupled Models

These models, including Kokoro-82M and XTTS-v2, generate raw audio waveforms directly from intermediate representations like mel-spectrograms. They are known for producing natural-sounding speech with fine-grained control over acoustic features. Recent innovations, such as the BiVocoder, integrate feature extraction and waveform generation for improved performance.

Comprehensive Feature Comparison of TTS Models

In this section, we present a comparative analysis of 12 prominent open-source TTS models. Each model is evaluated based on four critical features: model size, language support, zero-shot voice cloning capability.

Comparative Analysis of TTS Models

We have conducted a comprehensive analysis of Synthesized Speech Quality, Customization Options, Ease of Integration, Pros and Cons of twelve TTS models, assessing their performance across these parameters.

Key Findings:

Several models achieved excellent ratings across multiple categories, with Kokoro-82M, csm-1b, Spark-TTS-0.5B, Orpheus-3b-0.1-ft, F5-TTS, and Llasa-3B delivering impressive performance in synthesized speech quality.

Zonos-v0.1-transformer uniquely excelled in controllability setting it apart as the most controllable model in our evaluation, while F5-TTS and csm-1b emerged as the most well-rounded performers, achieving good performance in both synthesized speech quality and controllability across all synthesis parameters.

Latency Comparison of TTS Models

This analysis visualizes the latency performance of various TTS models across different input lengths, ranging from 5 to 200 words.

We have evaluated 12 different TTS models: XTTS-v2, Kokoro-82M, Dia-1.6B, Llama-OuteTTS-1.0-1B, MegaTTS3, MaskGCT, Llasa-3B, F5-TTS, Zonos-v0.1-transformer, Orpheus-3b-0.1-ft, Spark-TTS-0.5B, and csm-1b.

We found that most of these models demonstrate a linear increase in latency with longer inputs. Kokoro-82M emerges as the clear winner for speed, consistently processing texts in under 0.3 seconds across all tested lengths, while F5-TTS also shows impressive performance with sub-7-second processing times.

At the other end of the spectrum, Llama-OuteTTS-1.0-1B exhibits the highest latency, reaching over 4 minutes for 200-word inputs. Several models including XTTS-v2, MegaTTS3, and Llasa-3B show good performance for shorter texts. Notably, models like Dia-1.6B and MaskGCT maintain relatively consistent scaling throughout the full range.

How did we test them

Testing Platform:

All tests were conducted utilizing a Docker container on the same hardware configuration to ensure consistency.

  • GPU: NVIDIA L4 with 24-GB VRAM
  • CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
  • RAM: 32GB

Text Inputs:

Text samples were prepared in lengths of 5, 10, 25, 50, 100, and 200 words, and the same text content was used across all the models for a fair comparison.

Conclusion

TTS models have evolved rapidly, with modern neural approaches delivering significantly more natural speech than traditional systems. Our evaluation of 12 models reveals clear performance distinctions across speed, quality, and controllability.

Kokoro-82M excels in speed with sub-0.3-second processing, while F5-TTS and csm-1b offer the best balance of naturalness and intelligibility. Zonos-v0.1-transformer stands out for controllability, and models like OuteTTS-1.0-1B provide extensive multilingual support.

Selecting the right TTS model requires balancing application needs: real-time systems benefit from fast models like Kokoro-82M, while content creation applications may prioritize quality over speed. As TTS technology continues advancing, understanding these trade-offs between latency, quality, language support, and computational requirements will be essential for choosing models that enhance user experiences across diverse applications.

Resources:

  1. Dia 1.6B: A case study in smart innovation over brute-force compute in AI - TechTalks  https://bdtechtalks.com/2025/04/24/dia-1-6b-text-to-speech/
  2. Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese - arXiv https://arxiv.org/html/2505.11200v1
  3. LLM-based TTS explained by a human, a breakdown : r/LocalLLaMA - Reddit https://www.reddit.com/r/LocalLLaMA/comments/1jtwbt9/llmbased_tts_explained_by_a_human_a_breakdown/
  4. Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens - arXiv, https://arxiv.org/html/2503.01710v1
  5. Voice Cloning: Comprehensive Survey - arXiv,https://arxiv.org/html/2505.00579v1
  6. Diff-TTS: A Denoising Diffusion Model for Text-to-Speech - ISCA Archive, https://www.isca-archive.org/interspeech_2021/jeong21_interspeech.pdf

Introduction

In our first part, we evaluated 9 TTS models, focusing on factors like voice quality, customization, integration ease, and latency across various use cases. Notably, models such as MeloTTS and Piper TTS demonstrated low latency, while Tortoise TTS exhibited significant delays with longer inputs.

Since then, the TTS landscape has evolved significantly. Innovations in deep learning and the availability of large-scale speech datasets have led to the development of models that produce more natural, expressive, and human-like speech. This progression has expanded TTS applications across virtual assistants, content creation, accessibility tools, and conversational AI systems.

Traditionally, TTS systems relied on concatenative and parametric methods, often resulting in robotic-sounding speech. The advent of neural network-based models has revolutionized this field. Modern TTS models, inspired by large language models, treat speech synthesis as a sequence modeling task, enabling more natural and expressive voice generation.

In this continuation, we delve into 12 prominent TTS models, examining their features that define their capabilities, analyzed their synthesized output and the latency.

How Modern TTS Models Work: Three Dominant Paradigms

1. Codec Language Models (CLM)

CLM, such as Dia-1.6B, transform audio waveforms into discrete tokens using neural audio codecs. These tokens are then modeled using language models to generate speech. This approach enables efficient handling of speaker identity and prosody, facilitating tasks like zero-shot voice cloning. Recent advancements focus on co-designing codecs and language models to enhance performance and efficiency.

2. Diffusion-Based Models

Diffusion models, like F5-TTS, generate speech by iteratively refining random noise into coherent audio through a denoising process. They excel at producing high-fidelity and expressive speech. However, their iterative nature can lead to higher computational costs. Techniques like knowledge distillation are being explored to accelerate inference without compromising quality.

3. Direct Waveform / Vocoder-Coupled Models

These models, including Kokoro-82M and XTTS-v2, generate raw audio waveforms directly from intermediate representations like mel-spectrograms. They are known for producing natural-sounding speech with fine-grained control over acoustic features. Recent innovations, such as the BiVocoder, integrate feature extraction and waveform generation for improved performance.

Comprehensive Feature Comparison of TTS Models

In this section, we present a comparative analysis of 12 prominent open-source TTS models. Each model is evaluated based on four critical features: model size, language support, zero-shot voice cloning capability.

Comparative Analysis of TTS Models

We have conducted a comprehensive analysis of Synthesized Speech Quality, Customization Options, Ease of Integration, Pros and Cons of twelve TTS models, assessing their performance across these parameters.

Key Findings:

Several models achieved excellent ratings across multiple categories, with Kokoro-82M, csm-1b, Spark-TTS-0.5B, Orpheus-3b-0.1-ft, F5-TTS, and Llasa-3B delivering impressive performance in synthesized speech quality.

Zonos-v0.1-transformer uniquely excelled in controllability setting it apart as the most controllable model in our evaluation, while F5-TTS and csm-1b emerged as the most well-rounded performers, achieving good performance in both synthesized speech quality and controllability across all synthesis parameters.

Latency Comparison of TTS Models

This analysis visualizes the latency performance of various TTS models across different input lengths, ranging from 5 to 200 words.

We have evaluated 12 different TTS models: XTTS-v2, Kokoro-82M, Dia-1.6B, Llama-OuteTTS-1.0-1B, MegaTTS3, MaskGCT, Llasa-3B, F5-TTS, Zonos-v0.1-transformer, Orpheus-3b-0.1-ft, Spark-TTS-0.5B, and csm-1b.

We found that most of these models demonstrate a linear increase in latency with longer inputs. Kokoro-82M emerges as the clear winner for speed, consistently processing texts in under 0.3 seconds across all tested lengths, while F5-TTS also shows impressive performance with sub-7-second processing times.

At the other end of the spectrum, Llama-OuteTTS-1.0-1B exhibits the highest latency, reaching over 4 minutes for 200-word inputs. Several models including XTTS-v2, MegaTTS3, and Llasa-3B show good performance for shorter texts. Notably, models like Dia-1.6B and MaskGCT maintain relatively consistent scaling throughout the full range.

How did we test them

Testing Platform:

All tests were conducted utilizing a Docker container on the same hardware configuration to ensure consistency.

  • GPU: NVIDIA L4 with 24-GB VRAM
  • CPU: Intel(R) Xeon(R) CPU @ 2.20GHz
  • RAM: 32GB

Text Inputs:

Text samples were prepared in lengths of 5, 10, 25, 50, 100, and 200 words, and the same text content was used across all the models for a fair comparison.

Conclusion

TTS models have evolved rapidly, with modern neural approaches delivering significantly more natural speech than traditional systems. Our evaluation of 12 models reveals clear performance distinctions across speed, quality, and controllability.

Kokoro-82M excels in speed with sub-0.3-second processing, while F5-TTS and csm-1b offer the best balance of naturalness and intelligibility. Zonos-v0.1-transformer stands out for controllability, and models like OuteTTS-1.0-1B provide extensive multilingual support.

Selecting the right TTS model requires balancing application needs: real-time systems benefit from fast models like Kokoro-82M, while content creation applications may prioritize quality over speed. As TTS technology continues advancing, understanding these trade-offs between latency, quality, language support, and computational requirements will be essential for choosing models that enhance user experiences across diverse applications.

Resources:

  1. Dia 1.6B: A case study in smart innovation over brute-force compute in AI - TechTalks  https://bdtechtalks.com/2025/04/24/dia-1-6b-text-to-speech/
  2. Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese - arXiv https://arxiv.org/html/2505.11200v1
  3. LLM-based TTS explained by a human, a breakdown : r/LocalLLaMA - Reddit https://www.reddit.com/r/LocalLLaMA/comments/1jtwbt9/llmbased_tts_explained_by_a_human_a_breakdown/
  4. Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens - arXiv, https://arxiv.org/html/2503.01710v1
  5. Voice Cloning: Comprehensive Survey - arXiv,https://arxiv.org/html/2505.00579v1
  6. Diff-TTS: A Denoising Diffusion Model for Text-to-Speech - ISCA Archive, https://www.isca-archive.org/interspeech_2021/jeong21_interspeech.pdf

Table of contents