Illustrating current detection capabilities
Although detecting audio deepfakes has proven challenging, ValidSoft have always been at the forefront of this dimension of speech science and integrating the latest AI audio deepfake detection into their process for years. We were part of a consortium of academic and private partners from 2015 to 2017 for the multimillion-euro EU funded H2020 Octave project. In this project, ValidSoft worked with two of the research institutes who are long-term organizers of the most famous series of academic challenges for audio fake detection, ASVspoof. It is worth noting that the term “deepfake” was first coined in late 2017 by a Reddit user of the same name, years after ValidSoft’s first release of a fake audio detector.
While VALL-E cannot be fully tested, because no pre-trained model has yet been made available, the few examples shared as on VALL-E page can be distinguished as fake by ValidSoft’s synthetic voice detection module when compared to genuine audio.
Profiling VALL-E and genuine audio files
This same synthetic voice detection module has already recently distinguished a fake Emma Watson voice created by ElevenLabs TTS.
Challenge and future
Detecting audio deepfakes remains an ongoing process, and poses specific challenges over the telephony channel. Unlike high-quality laptop/smartphone recordings, legacy telephony channels carry less information and have smaller frequency bandwidth, meaning the audio quality overall is lower and this means some of the artifacts created by the neural vocoders can be lost. Nevertheless, ValidSoft’s detection capabilities work effectively and offer a further tool, or layer, of protection. The future of combating audio deepfakes will involve a mix of detection techniques, including audio watermarking, and mitigation strategies. Recently, companies like Resemble.ai and Microsoft have announced that they will include audio watermarking in their TTS products. Given the rapid advancement of generative AI technology, it is likely that new detection and mitigation strategies will need to be continually developed and refined to keep up with the changing landscape of audio deepfakes.
The development of audio deepfake technology raises important questions about the ethics and potential consequences of this technology. While the technology itself is not inherently good or bad, its potential for misuse underscores the need for responsible use and regulation of AI technologies. It’s vital to carefully consider the implications of new technologies like audio deepfakes and take steps to ensure that they are used ethically and responsibly, as well as taking pro-active measures to combat nefarious uses and new fraud vectors.