OpenAI text to speech voice engine isn't immune to existing detection tools

Assessing the Ongoing Battle Against Voice Cloning Technology: Lessons from OpenAI preview of new Text-to-Speech Model. 

The landscape of artificial intelligence continues to evolve and amaze. Among these advancements, the emergence of audio deepfake technology stands out as a double-edged sword, with challenges and opportunities, as explained by OpenAI in a recent blog post previewing their novel Voice Engine solution.  

A close-up of a painting

Description automatically generated

Open AI previewed it’s Voice Engine model recently 

The Rise of Synthetic Voices: OpenAI’s Voice Engine Model

On one hand, it offers impressive capabilities from a high-profile company in the world of AI, enabling the creation of synthetic voices that to the human ear sound identical to real ones. On the other hand, it is business as usual in the ValidSoft world of advanced speech science.

Over the past 16 months, the field of generative voice AI has witnessed notable steps, with prominent players like ElevenLabs, Microsoft’s Vall-E, and others, showcasing their innovations. In common, these solutions have attracted attention for their ability to clone voices with just a few seconds of audio input. While undeniably impressive, it is worth noting that TTS (Text-To-Speech) technology, whether AI generated or not, is not novel. A major step in Deep Learning-based TTS can be traced back to 2016 with DeepMind’s release of Wavenet

Detectability and Security: Testing Against Deepfake Audio

Despite the sophistication of generative AI deepfake audio technology, it is far from immune to detection. In fact, tests conducted by ValidSoft on examples of deepfake audio generated by OpenAI’s Voice Engine reveal that they can be detected by ValidSoft’s existing deepfake audio detection tools, ie even our older-generation detectors, trained over years on audio generated by early DNN-based vocoders such as Wavenet, exhibit the ability to detect OpenAI’s deepfake audio. 

What do these findings tell us? Most importantly, they highlight that even state-of-the-art generative voice models rely on foundational components developed years ago, leaving detectable traces that can be learned and used for automatic detection.  

Future of Detection Tools: Advancements and Necessities

As noted in a recent article from the Guardian, automatic methods have an effective role to play. The fact that OpenAI’s deepfake audio can be detected, even without fine-tuning, on the day of release, underscores the inherent detectability of synthetic voices, even without tailored training data for automatic detection systems. However, we can’t afford to be complacent and the pace of development of new models emphasizes the need for continuous refinement and adaptation of detection tools to keep pace with evolving deepfake techniques. This situation is no different to the virus/anti-virus “war”. 

The accumulation of training data offers opportunities for enhancing the robustness of detection mechanisms. By incorporating diverse datasets and leveraging advancements in machine learning algorithms, detection tools can achieve even greater accuracy and resilience in identifying new emerging deepfake audio. 

Moreover, the presence of watermarks represents another layer of defense against audio deepfakes. However, watermarking is certainly not infallible and is not a panacea (ipso facto, it is not definitive and cannot be relied upon but is best treated as an indicator). By embedding unique identifiers within audio recordings, content creators can authenticate their content and mitigate the risk of manipulation or misuse to some extent. 

The emergence of advanced TTS technology presents both opportunities and challenges in the world of audio manipulation. While advancements like OpenAI’s Voice Engine model demonstrate the remarkable capabilities of AI, the readiness of detectors to discern their artificiality from day zero validate the ongoing efforts to develop automatic detection tools, of the types developed by our team at ValidSoft. By remaining vigilant and proactive, audio forensics tools can contribute significantly to preventing the descent into a lawless digital world.