Icon March 21, 2023

The Misconception of Voice Cloning: It’s Not Dolly the Sheep

When we hear the word “cloning”, most of us imagine a sci-fi scene with a scientist in a white lab coat, surrounded by test tubes and microscopes, creating an exact replica of a living organism. However, when it comes to voice cloning, the term is rather misleading. Putting one of these so-called cloned voices under our microscopes, we have recently highlighted that the Emma Watson fake voice, despite sounding natural and like the famous actress to the human ear, failed both biometrics and naturalness checks. Failing biometrics checks indicates that the voice did not match a voiceprint created from Emma Watson’s authentic audio. Failing naturalness means that the speech contained enough segments with typical artifacts found in artificially created voices, leading to the conclusion that it was synthetic. So where are we with this technology and is the word “cloning” really appropriate?
The Limitations of Audio Deepfakes and the Inaccuracy of the Term “Voice Cloning”

Audio deepfakes, sometimes referred to as voice cloning, are the process of creating a digital copy of someone’s voice. While it’s true that this technology uses complex algorithms to analyze some recordings of a speaker to generate new audio that sounds like the person, the result is not an exact copy. In fact, in this case, the term “cloning” is not accurate, as it implies a level of precision and duplication that is not currently possible with the technology – as we demonstrated recently with our audio analysis of the Emma Watson case.

The term is accurate when it comes to cloning living organisms. For example, Dolly the sheep, the first mammal to be cloned, was a true clone with 100% of its DNA matching the one of the original sheep. This is quite different from voice cloning, where a system that analyzes a few seconds of audio will have little knowledge about the full characteristics of a person’s voice. While audio deepfakes can mimic certain aspects of a person’s voice, such as pitch and tone, they can’t replicate the full range of nuances and variations that contribute to the distinctiveness of a person’s voice.

What does it mean in practice? It means we need to dig deeper than what we can read behind some sensationalist headlines. A system working on a few seconds of speech recordings from a speaker will only have the information about how this speaker may sound during these 3 seconds. The rest is an approximation based on the generative model and indirectly the data used for training it. We can experience this approximation when using some of the available tools. First with anecdotal evidence. Being a French native living in the UK, it’s quite something to hear so-called cloned voices of myself and my relatives with perfect… American intonations. The other evidence is in labs. As the technology is only a partial transfer of speaker characteristics, some of these fake voices are quite poor at fooling voice biometrics systems. And for those tests we didn’t only use a few seconds, we gave the voice-mimicking systems a fair advantage with minutes of speech to build fake voices.

The Catchy Misconception of “Voice Cloning” and the Need for Accurate Terminology
Years of headlines from Adobe preview of VOCO the photoshop-for-voice in 2016 to the current wave of celebrity imitations in 2023.

Ultimately, we need more objective views about where the technology really stands today, what is the reality behind the headlines, beyond the media hype, and having some constructive debate that is less heated and more conclusive. And it starts by being clear about the capabilities of the technology and the terms to describe it. The term “cloning” is catchier and more attention-grabbing than other, more accurate terms such as “voice mimicking” or “voice synthesis”. Some of us are old enough (time for another mention of my grey hair in a blog post!) to remember the controversies back in 1996 when the news of Dolly the first cloned mammal hit the headlines and the ethical question it raised. Fast forward a quarter of a century and in a world shaken by the sea changes brought by AI, the word “cloning” is preferred in the promotions of some of the novel speech synthesis services. The buzz and awe factor it carries outweighs its controversial nature.

In conclusion, the term “voice cloning” is not an accurate representation of the technology. Instead, we could start using more accurate terms such as “voice synthesis” or “automatic voice imitation” to describe this process. By doing so, we can help to dispel the misconception that voice cloning can create an exact replica of someone’s voice, have a more objective and balanced discussion over the much-needed debate about the rise of generative AI and raise awareness of the limitations and potential risks of this technology.

Let me be clear. While I highlight limitations, I am also aware of the boom in generative AI and of the potential of ever-larger models when it comes to voice creation, for good and for bad. It is remarkable to see the pace at which progress is being made in this field. As a Machine Learning practitioner, I am keenly following the advancements being made in both generative technologies, text-to-speech, and voice conversion, as well as leading the charge on technologies that can be used to detect content created by them, such as voice biometrics and deepfake detection.

This is an arms race, and it is essential that we continue to push the boundaries of what is possible on all fronts. One thing is sure, as AI technology advances the capabilities for voice synthesis, the need to be able to tell the difference between a genuine voice and a synthetic voice is critical!