Authors:
(1) Zengyi Qin, MIT & MyShell.ai and (email: [email protected]);
(2) Wenliang Zhao, Tsinghua University;
(3) Xumin Yu, Tsinghua University;
(4 ) Xin Sun, MyShell.ai;
Table of Links
3 Experiment
The evaluation of voice cloning is hard to be objective for several reasons. First, different research studies (e.g., [8], [2]) usually have different training and test sets. The numerical comparison could be intrinsically unfair. Even their metrics such as Mean Opinion Score can be evaluated by crowdsourcing, the diversity and difficulty of the test set would significantly influence the results. For example, if many samples in the test set are neural voices that concentrate on the mean of human voice distributions, then it is relatively easy for most methods to achieve good voice cloning results. Second, different studies usually have different training sets, where the scale and diversity would have considerable influence of the results. Third, different studies can have different focus on their core functionalities. OpenVoice mainly aims at tone color cloning, flexible control over style parameters, and making cross-lingual voice clone easy even without massive-speaker data for a new language. These are different from the objectives of previous work on voice cloning or zero-shot TTS. Therefore, instead of comparing numerical scores with existing methods, we mainly focus on analyzing the qualitative performance of OpenVoice itself, and make the audio samples publicly available for relevant researchers to freely evaluate.
Accurate Tone Color Cloning. We build a test set of reference speakers selected from celebrities, game characters and anonymous individuals. The test set covers a wide voice distributions including both expressive unique voices and neutral samples in human voice distribution. With any of the 4 base speakers and any of the reference speaker, OpenVoice is able to accurately clone the reference tone color and generate speech in multiple languages and accents. We invite the readers to this website5 for qualitative results.
Flexible Control on Voice Styles. A premise for the proposed framework to flexibly control the speech styles is that the tone color converter is able to only modify the tone color and preserves all other styles and voice properties. In order to confirm this, we use both our base speaker model and the Microsoft TTS with SSML to generate a speech corpus of 1K samples with diverse styles (emotion, accent, rhythm, pauses and intonation) as the base voices. After converting to the reference tone color, we observed that all styles are well-preserved. In rare cases, the emotion will be slightly neutralized, and one way that we found to solve this problem is to replace the tone color embedding vector of this particular sentence with the average vector of multiple sentences with different emotions from the same base speaker. This gives less emotion information to the flow layers so that they do not eliminate the emotion. Since the tone color converter is able to preserve all the styles from the base voice, controlling the voice styles becomes very straightforward by simply manipulating the base speaker TTS model. The qualitative results are publicly available on this website6 .
Cross-Lingual Voice Clone with Ease. OpenVoice achieves near zero-shot cross-lingual voice cloning without using any massive-speaker data for an unseen language. It does require a base speaker of the language, which can be achieved with minimum difficulty with the off-the-shelf models and datasets. On our website7 , we provide an abundant of samples that demonstrates the cross-lingual voice clone capabilities of the proposed approach. The cross-lingual capabilities are two-fold:
• When the language of the reference speaker is unseen in the MSML dataset, the model is able to accurately clone the tone color of the reference speaker.
• When the language of the generated speech is unseen in the MSML dataset, the model is able to clone the reference voice and speak in that language, as long as the base speaker TTS supports that language.
Fast Inference with Low Cost. Since OpenVoice is a feed-forward structure without any autoregressive component, it achieves very high inference speed. Our experiment shows that a slightly optimized version of OpenVoice (including the base speaker model and the tone converter) is able achieve 12× real-time performance on a single A10G GPU, which means it only takes 85ms to generate a one second speech. Through detailed GPU usage analysis, we estimate that the upper bound is around 40× real-time, but we will leave this improvement as future work.
Importance of IPA. We found that using IPA as the phoneme dictionary is crucial for the tone color converter to perform cross-lingual voice cloning. As we detailed in Section 2.3, in training the tone color converter, the text is first converted into a sequence of phonemes in IPA, then each phoneme is represented by a learnable vector embedding. The sequence of embedding is encoded with transformer layers and compute loss against the output of the flow layers, aiming to eliminate the tone color information. IPA itself is a cross-lingual unified phoneme dictionary, which enables the flow layers to produce a language-neutral representation. Even if we input a speech audio with unseen language to the tone color converter, it is still able to smoothly process the audio. We also experimented with other types of phoneme dictionaries but the resulting tone color converter tend to mispronounce some phonemes in unseen languages. Although the input audio is correct, there is a high likelihood that the output audio is problematic and sounds non-native.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.
5 https://research.myshell.ai/open-voice/accurate-tone-color-cloning
6 https://research.myshell.ai/open-voice/flexible-voice-style-control
7 https://research.myshell.ai/open-voice/zero-shot-cross-lingual-voice-cloning