EXPRESS-Voice: Instant and Accented Identity Cloning at the Frontier of Voice Synthesis

Developed by the Voice Research Team at Synthesia, EXPRESS-Voice is a state-of-the-art in-context learning voice cloning model that preserves identity, accent, and expressiveness - instantly and without any finetuning or adaptation steps.

EXPRESS-Voice will be available on the Synthesia platform soon.

Introduction

A person's voice is one of the most intimate and personal aspects of their identity. The accent, the dialect, the intonation and rhythm of the voice encompass a person's heritage, culture, lived experiences - and to some extent, their personality. Today's digital voice cloning models often fail to capture the subtleties of a person's voice, and shift the voice to more generic American or British-sounding speech regardless of the input.

EXPRESS-Voice is a new model developed by Synthesia that raises the bar for instant voice cloning, which is a technology that allows users to create a realistic digital replica of a voice from a short audio sample, often just seconds long. It excels not only in preserving speaker identity and accent but also in generating speech that is expressive, emotionally rich, and naturally intonated. This report presents the results of a comprehensive subjective and objective evaluation comparing EXPRESS-Voice against the current leading open and closed source models.

While this evaluation focuses on identity preservation and voice quality, EXPRESS-Voice also unlocks capabilities that go beyond traditional cloning - enabling expressive, emotionally aware synthesis that stays coherent with the speaker's style. We explore these capabilities in a dedicated Beyond Cloning section later in the report.

Evaluation

Human Evaluation

EXPRESS-Voice Best Preserves Speaker Identity Across English Accents

In a blind listening study, 100 native English-speaking evaluators were asked to judge which voice clone best matched the identity of the original speaker.

EXPRESS-Voice was rated highest across 17 cloned identities, representing a broad range of native and non-native English accents - including American, British, Irish, Indian, Chinese, Turkish, French, and more.

EXPRESS-Voice is Most Preferred for Speaker Identity Matching

In blind pairwise comparisons, listeners consistently preferred EXPRESS-Voice over other models as the best match to the original speaker's identity and accent. Evaluations covered 17 native and non-native English speakers from a wide range of accents.

Objective Evaluation

Speaker Similarity (Higher is better)

Emotional Similarity (Higher is better)

Audio Examples

We present a single utterance from 10 different speakers, cloned using 5 different models. The ground truth is the same sentence spoken by the original speaker.

Speaker Accent	Ground Truth	Synthesia
Speaker Accent	Ground Truth	Chinese
Irish
Turkish
Polish
Scottish
Lithuanian
Irish
Russian
Australian
Yorkshire

Beyond Cloning

EXPRESS-Voice isn't just a cloning model - it's expressive, flexible, and coherent with the original emotionality of the speaker. Time to listen to how EXPRESS-Voice can generate naturally emotive speech that retains the original speaker's expressive delivery.

Expressive Transfer

Depending on the emotional characteristics of the speaker, EXPRESS-Voice can generate realistic and natural emotional speech. The effect is most pronounced when the target text aligns with the speaker's emotional state that was used for cloning.

Speaker Style	Voice Cloning Input	Generated Speech
Happy
Show Host
Relieved
Inspirational
Angry
Sad
Vocal Fry
Excited
Concerned

Method

We conducted a comprehensive evaluation of EXPRESS-Voice by comparing it to several leading instant voice cloning models - both open and closed source. The evaluation combined subjective human ratings with objective signal-level metrics across 17 internal speakers covering a diverse set of English accents and dialects, including American, British, Scottish, Irish, Chinese, Indian-British, Turkish, French, Lithuanian, and others.

For both subjective and objective evaluations, we took 13 target sentences from the Living Audio Dataset, covering a phonetically wide and balanced set of sentences. Each speaker recorded these sentences to serve as ground truth. In addition, we recorded a separate 20–30 second segment from each speaker to be used exclusively for voice cloning. None of the ground truth recordings were used during the cloning process.

Models Included in the Evaluation

Model	Company	Type
EXPRESS-Voice	Synthesia	Our Model
Chatterbox	Resemble AI	Open Source
Speech-02-hd	MiniMax	Closed Source
Fish-Speech	Fish Audio	Open Source
Eleven Multilingual V2	ElevenLabs	Closed Source

While we are aware that ElevenLabs has released a new v3 alpha model that is partially available for use, as of June 2025 (the time of writing this report), the model remains in its alpha stage. It is not yet stable or accessible via API. For these reasons, we chose not to include it in our evaluation.

Human Evaluation

To assess subjective quality, we conducted a blind listening study with 100 independent third-party evaluators. Participants were presented with sets of audio samples and asked to identify which sounded most similar to a reference prompt spoken by the same speaker reading a different utterance - enabling focused evaluation of identity preservation. The study followed a MUSHRA-style evaluation setup, and we analyzed responses by aggregating top-choice selections to quantify model-level preference. Each evaluation set also included the ground truth sample for reference.

Objective Evaluation

Using the same 13 samples generated for the 17 identities, we computed the following objective metrics:

Speaker Similarity – Cosine similarity between speaker embeddings computed using the WavLM model from Microsoft.
Emotional Similarity – Cosine similarity between emotion embeddings from the Emotion2Vec model by FunASR.

Model

What enables EXPRESS-Voice to achieve such strong identity preservation, even across a wide range of accents and speakers? Let's take a closer look at some details on the model's architecture, data design, and training procedure that make this possible.

Architecture: EXPRESS-Voice employs a two-stage Transformer architecture composed of an autoregressive (AR) model and a non-autoregressive (NAR) model, each with 800 million parameters. Both models operate directly on graphemes (text tokens) and are conditioned on reference audio, without relying on an explicit speaker embedding. This design enables the AR model to first generate the coarse prosodic and phonetic structure, which the NAR model then refines by adding the detailed audio structure.

Tokenizer: The system employs Descript's residual vector quantized tokenizer to discretize acoustic representations. This choice enables efficient modeling while retaining high fidelity audio during generation.

Training Data: The model is trained on a large internal dataset composed of high-quality human-annotated studio recordings, as well as open-domain corpora like YODAS and LibriLight. The majority of the data is heavily curated to ensure a good coverage of accents and identities in addition to a rigorous data processing pipeline ensuring precise transcriptions and clean segmentation. None of the evaluated speakers we have cloned were included in our pre-training dataset.

Training Procedure: Training follows a curriculum based on utterance length and applies QK-layer normalization to stabilize learning. The model is trained end-to-end without any fine-tuning or speaker adaptation.

Sampling Strategy: A decisive factor in the final quality of EXPRESS-Voice is its advanced sampling strategy. Instead of relying on standard top-p sampling alone, which led to unstable prosody and identity drift, the system adopts a modified version of RAS sampling (inspired by VALL-E2) enhanced with repetition penalty. The NAR stage uses nucleus sampling with a conservative top-p threshold, yielding high-fidelity, stable voices.

References

Liao, S., Wang, Y., Li, T., Cheng, Y., Zhang, R., Zhou, R., & Xing, Y. (2024). Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis. arXiv:2411.01156
Resemble AI. (2024). Chatterbox: An Open Source Text-to-Speech Model. GitHub Repository
Chen, S., Wang, C., et al. (2022). WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505–1518. DOI:10.1109/JSTSP.2022.3188113
Chen, S., Liu, S., Zhou, L., et al. (2024). VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers. arXiv:2406.05370
Ma, Z., Zheng, Z., et al. (2024). emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation. Findings of ACL 2024. GitHub Repository
Kumar, R., Seetharaman, P., et al. (2023). High-Fidelity Audio Compression with Improved RVQGAN. arXiv:2306.06546
Li, X., Takamichi, S., et al. (2024). YODAS: Youtube-Oriented Dataset for Audio and Speech. arXiv:2406.00899
Kahn, J., Rivière, M., et al. (2020). Libri-Light: A Benchmark for ASR with Limited or No Supervision. ICASSP 2020. GitHub Repository
Braude, D.A., Aylett, M.P., Laoide-Kemp, C., Ashby, S., Scott, K.M., Raghallaigh, B.Ó., Braudo, A., Brouwer, A., Stan, A. (2019). All Together Now: The Living Audio Dataset. Proc. Interspeech 2019, 1521-1525, doi: 10.21437/Interspeech.2019-2448

Our approach to responsible AI

When Synthesia started in 2017, we recognized generative AI is a powerful technology that, when placed in the hands of people with bad intentions, will be misused. So we made two important decisions on day one.

First, we will not create a clone of a real person without their consent. That means our platform has biometric controls in place to prevent someone from making non-consensual deepfakes.

Secondly, we have implemented content moderation at the point of creation and defined strict content policies that would limit the spread of harmful AI-generated content. By checking content before it’s generated, we keep the platform safe and allow everyone to create video and audio clips that adhere to the highest possible ethical standards. We constantly update our policies and red team our security processes to ensure we keep up with the latest adversarial attack methods.

To learn more about our approach to responsible AI, visit our AI Governance portal.