Introduction

A person's voice is one of the most intimate and personal aspects of their identity. The accent, the dialect, the intonation and rhythm of the voice encompass a person's heritage, culture, lived experiences - and to some extent, their personality. Today's digital voice cloning models often fail to capture the subtleties of a person's voice, and shift the voice to more generic American or British-sounding speech regardless of the input.


EXPRESS-Voice is a new model developed by Synthesia that raises the bar for instant voice cloning, which is a technology that allows users to create a realistic digital replica of a voice from a short audio sample, often just seconds long. It excels not only in preserving speaker identity and accent but also in generating speech that is expressive, emotionally rich, and naturally intonated. This report presents the results of a comprehensive subjective and objective evaluation comparing EXPRESS-Voice against the current leading open and closed source models.


While this evaluation focuses on identity preservation and voice quality, EXPRESS-Voice also unlocks capabilities that go beyond traditional cloning - enabling expressive, emotionally aware synthesis that stays coherent with the speaker's style. We explore these capabilities in a dedicated Beyond Cloning section later in the report.

Evaluation

Human Evaluation

EXPRESS-Voice Best Preserves Speaker Identity Across English Accents

In a blind listening study, 100 native English-speaking evaluators were asked to judge which voice clone best matched the identity of the original speaker.

EXPRESS-Voice was rated highest across 17 cloned identities, representing a broad range of native and non-native English accents - including American, British, Irish, Indian, Chinese, Turkish, French, and more.


EXPRESS-Voice is Most Preferred for Speaker Identity Matching

In blind pairwise comparisons, listeners consistently preferred EXPRESS-Voice over other models as the best match to the original speaker's identity and accent. Evaluations covered 17 native and non-native English speakers from a wide range of accents.


Objective Evaluation

Speaker Similarity (Higher is better)

Emotional Similarity (Higher is better)

Audio Examples

We present a single utterance from 10 different speakers, cloned using 5 different models. The ground truth is the same sentence spoken by the original speaker.


Speaker Accent Ground Truth Synthesia ElevenLabs MiniMax Fish Audio Resemble AI
Chinese
Irish
Turkish
Polish
Scottish
Lithuanian
Irish
Russian
Australian
Yorkshire

Beyond Cloning

EXPRESS-Voice isn't just a cloning model - it's expressive, flexible, and coherent with the original emotionality of the speaker. Time to listen to how EXPRESS-Voice can generate naturally emotive speech that retains the original speaker's expressive delivery.

Expressive Transfer

Depending on the emotional characteristics of the speaker, EXPRESS-Voice can generate realistic and natural emotional speech. The effect is most pronounced when the target text aligns with the speaker's emotional state that was used for cloning.


Speaker Style Voice Cloning Input Generated Speech
Happy
Show Host
Relieved
Inspirational
Angry
Sad
Vocal Fry
Excited
Concerned

Method

We conducted a comprehensive evaluation of EXPRESS-Voice by comparing it to several leading instant voice cloning models - both open and closed source. The evaluation combined subjective human ratings with objective signal-level metrics across 17 internal speakers covering a diverse set of English accents and dialects, including American, British, Scottish, Irish, Chinese, Indian-British, Turkish, French, Lithuanian, and others.

For both subjective and objective evaluations, we took 13 target sentences from the Living Audio Dataset, covering a phonetically wide and balanced set of sentences. Each speaker recorded these sentences to serve as ground truth. In addition, we recorded a separate 20–30 second segment from each speaker to be used exclusively for voice cloning. None of the ground truth recordings were used during the cloning process.

Models Included in the Evaluation
Model Company Type
EXPRESS-Voice Synthesia Our Model
Chatterbox Resemble AI Open Source
Speech-02-hd MiniMax Closed Source
Fish-Speech Fish Audio Open Source
Eleven Multilingual V2 ElevenLabs Closed Source
While we are aware that ElevenLabs has released a new v3 alpha model that is partially available for use, as of June 2025 (the time of writing this report), the model remains in its alpha stage. It is not yet stable or accessible via API. For these reasons, we chose not to include it in our evaluation.

Human Evaluation

To assess subjective quality, we conducted a blind listening study with 100 independent third-party evaluators. Participants were presented with sets of audio samples and asked to identify which sounded most similar to a reference prompt spoken by the same speaker reading a different utterance - enabling focused evaluation of identity preservation. The study followed a MUSHRA-style evaluation setup, and we analyzed responses by aggregating top-choice selections to quantify model-level preference. Each evaluation set also included the ground truth sample for reference.

Objective Evaluation

Using the same 13 samples generated for the 17 identities, we computed the following objective metrics:

  • Speaker Similarity – Cosine similarity between speaker embeddings computed using the WavLM model from Microsoft.
  • Emotional Similarity – Cosine similarity between emotion embeddings from the Emotion2Vec model by FunASR.

Model

What enables EXPRESS-Voice to achieve such strong identity preservation, even across a wide range of accents and speakers? Let's take a closer look at some details on the model's architecture, data design, and training procedure that make this possible.


Architecture: EXPRESS-Voice employs a two-stage Transformer architecture composed of an autoregressive (AR) model and a non-autoregressive (NAR) model, each with 800 million parameters. Both models operate directly on graphemes (text tokens) and are conditioned on reference audio, without relying on an explicit speaker embedding. This design enables the AR model to first generate the coarse prosodic and phonetic structure, which the NAR model then refines by adding the detailed audio structure.


Tokenizer: The system employs Descript's residual vector quantized tokenizer to discretize acoustic representations. This choice enables efficient modeling while retaining high fidelity audio during generation.


Training Data: The model is trained on a large internal dataset composed of high-quality human-annotated studio recordings, as well as open-domain corpora like YODAS and LibriLight. The majority of the data is heavily curated to ensure a good coverage of accents and identities in addition to a rigorous data processing pipeline ensuring precise transcriptions and clean segmentation. None of the evaluated speakers we have cloned were included in our pre-training dataset.


Training Procedure: Training follows a curriculum based on utterance length and applies QK-layer normalization to stabilize learning. The model is trained end-to-end without any fine-tuning or speaker adaptation.


Sampling Strategy: A decisive factor in the final quality of EXPRESS-Voice is its advanced sampling strategy. Instead of relying on standard top-p sampling alone, which led to unstable prosody and identity drift, the system adopts a modified version of RAS sampling (inspired by VALL-E2) enhanced with repetition penalty. The NAR stage uses nucleus sampling with a conservative top-p threshold, yielding high-fidelity, stable voices.

References

  • Liao, S., Wang, Y., Li, T., Cheng, Y., Zhang, R., Zhou, R., & Xing, Y. (2024). Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis. arXiv:2411.01156
  • Resemble AI. (2024). Chatterbox: An Open Source Text-to-Speech Model. GitHub Repository
  • Chen, S., Wang, C., et al. (2022). WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505–1518. DOI:10.1109/JSTSP.2022.3188113
  • Chen, S., Liu, S., Zhou, L., et al. (2024). VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers. arXiv:2406.05370
  • Ma, Z., Zheng, Z., et al. (2024). emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation. Findings of ACL 2024. GitHub Repository
  • Kumar, R., Seetharaman, P., et al. (2023). High-Fidelity Audio Compression with Improved RVQGAN. arXiv:2306.06546
  • Li, X., Takamichi, S., et al. (2024). YODAS: Youtube-Oriented Dataset for Audio and Speech. arXiv:2406.00899
  • Kahn, J., Rivière, M., et al. (2020). Libri-Light: A Benchmark for ASR with Limited or No Supervision. ICASSP 2020. GitHub Repository
  • Braude, D.A., Aylett, M.P., Laoide-Kemp, C., Ashby, S., Scott, K.M., Raghallaigh, B.Ó., Braudo, A., Brouwer, A., Stan, A. (2019). All Together Now: The Living Audio Dataset. Proc. Interspeech 2019, 1521-1525, doi: 10.21437/Interspeech.2019-2448