Method
We conducted a comprehensive evaluation of EXPRESS-Voice by comparing
it to several leading instant voice cloning models - both open and
closed source. The evaluation combined subjective human ratings with
objective signal-level metrics across
17 internal speakers covering a diverse set of
English accents and dialects, including
American, British, Scottish, Irish, Chinese, Indian-British,
Turkish, French, Lithuanian, and others.
For both subjective and objective evaluations, we took
13 target sentences from the Living Audio Dataset,
covering a phonetically wide and balanced set of sentences. Each
speaker recorded these sentences to serve as ground truth. In
addition, we recorded a separate
20–30 second segment from each speaker to be used
exclusively for voice cloning. None of the ground truth recordings
were used during the cloning process.
Models Included in the Evaluation
Model |
Company |
Type |
EXPRESS-Voice |
Synthesia |
Our Model |
Chatterbox |
Resemble AI |
Open Source |
Speech-02-hd |
MiniMax |
Closed Source |
Fish-Speech |
Fish Audio |
Open Source |
Eleven Multilingual V2 |
ElevenLabs |
Closed Source |
While we are aware that ElevenLabs has released a new v3 alpha model
that is partially available for use, as of June 2025 (the time of
writing this report), the model remains in its alpha stage. It is not
yet stable or accessible via API. For these reasons, we chose not to
include it in our evaluation.
Human Evaluation
To assess subjective quality, we conducted a blind listening study
with 100 independent third-party evaluators.
Participants were presented with sets of audio samples and asked to
identify which sounded most similar to a reference prompt spoken by
the same speaker reading a different utterance - enabling focused
evaluation of identity preservation. The study followed a MUSHRA-style
evaluation setup, and we analyzed responses by aggregating top-choice
selections to quantify model-level preference. Each evaluation set
also included the ground truth sample for reference.
Objective Evaluation
Using the same 13 samples generated for the 17 identities, we computed
the following objective metrics:
-
Speaker Similarity – Cosine similarity between
speaker embeddings computed using the WavLM model from
Microsoft.
-
Emotional Similarity – Cosine similarity between
emotion embeddings from the Emotion2Vec model by FunASR.