Detecting AI Synthetic Voice | Identifying Deepfakes and Robocalls at the Speed of Sound
Synthetic audio and video generated through advanced Artificial Intelligence techniques provide cybercriminals with a new avenue to fraud and voice biometrics with a new application for fraud prevention.
Synthetic Voice and Robocalls
An increasingly common and unwelcome experience for many people is receiving a “synthetic call” or “robocall”, a computer-generated call, typically for marketing purposes, but also increasingly used for fraudulent means. Whilst annoying and disconcerting to receive for individuals, they can also leave the individual feeling bewildered and anxious, in particular whereby fraud is suspected. Such calls can also cause chaos for automated Interactive Voice Response (IVR) systems, the computer-based systems operated by contact centers, banks, insurance companies, government agencies and so on, many of which utilize menu-based options for self-service or information gathering.
Because they rely on the caller choosing a correct option by either voice command or keypad tone (DTMF), a robocall, robotically delivering its message, doesn’t follow the instructions and thus causes the IVR to hand the call to a human agent, a costly and time-wasting exercise.
These robocalls consist of synthetic speech; audio created by a computer program and intended to emulate a human voice. For marketing robocalls, the quality is not particularly important, as it is simply a volume game. A recent estimate put the volume of robocalls in the US alone at over 4 billion per month. Though the Telephone Robocall Abuse Criminal Enforcement and Deterrence Act was passed in December 2019 in the US, Covid-19 has seen calls surge offering non-existent PPE and fake testing kits. Whilst on the whole Robocalls are mainly nuisance value, the use of Artificial Intelligence to generate synthetic audio and video has taken a more sinister turn.
Deepfakes – A New Threat
Deepfakes are the nefarious computer-generated audio and video that are a by-product of the advancements in Artificial Intelligence. Designed to sound and look exactly like the human they mimic the technology is now sufficiently advanced to fool the unsuspecting and even perhaps the suspecting. With the availability of deepfake generating software tools, the technology has been brought within the reach of fraudsters and cybercriminals, and as always, they will devise novel ways of deploying it for monetary gain.
Primarily, using the technology to allow the cybercriminal’s deepfake to be readily identifiable as somebody’s boss, colleague or external worker will become the key to exploitation of Deepfake technology in an enterprise environment. Nobody wants to challenge the identity of a known person of authority who looks and/or sounds perfectly normal and familiar.
And this is exactly what occurred in 2019 in the first reported case of deepfake fraud. According to the Wall Street Journal, the CEO of an unnamed UK firm received a phone call from supposedly the CEO of the firm’s German parent company, requesting an urgent transfer of €220,000 ($243,000) to a Hungarian supplier. As the UK CEO recognised the German CEO’s voice, including “the hint of a German accent and the same melody”, he complied with the request. The money was subsequently moved from Hungary to a Mexican account and then further disbursed. The details of the attack, but not the company, were shared with the WSJ by the company’s insurer.
Apart from confirming that this form of deepfake attack is now in the wild and no longer just theoretical, it also confirms that the fraudsters are using synthetic “voice skin” technology, as distinct to creating static deepfake recordings. A synthetic voice skin allows the fraudster to speak in a conversational manner with the target, with the frauster’s voice being converted by the “skin” to sound like the impersonated voice. Whereas previously voice skins have been of lower quality to static deepfake recordings, the technology used in this incident was clearly good enough to fool a CEO who would easily recognise his colleague’s voice.
As with any fraud vector that is shown to work, it will only become more common, be used in ever more original ways and in the case of deepfakes, the technology will evolve, allowing them to become even more realistic.
Unified Communications and Omni-Channel strategies mean organisations, including banks, will increasingly communicate with their customers using browser-based video/audio for instance. This could be with a human agent, but in the future also Artificial Intelligence (AI) based agents.
Imagine, therefore, a video/audio conversation between a high net-worth client and their private banker. If the client looks and sounds authentic, and of course can provide the answers to any security questions (as they invariably would), why would the banker not acquiesce to any instructions the client gives?
Covid-19 will only exacerbate the problem. With so many people working from home, a trend that is sure to continue in the future, face-to-face interactions are now via video and voice teleconferencing, providing increased opportunities for deepfake fraud attacks. Many people, not just executives of the firm, now have their faces and voices online, whether through work teleconferencing, social gatherings (quiz nights a common event), or tele-medicine, which provides the input to train the deepfake engine. This is the new “normal” and when combined with existing social media content, or just as likely content from their own company’s web site, the fraudster has a ready-made treasure chest of data to exploit.
ValidSoft and the Detection of Synthetic Voice
Only advanced voice biometric engines that can discriminate between a human voice and a synthetic, machine-learning generated voice, i.e. a deepfake, can detect these types of frauds.
Whilst Deepfakes, whether simply audio or audio combined with video, can fool humans into believing it is the genuine person, they cannot fool - ValidSoft’s synthetic voice detection capability, an integral component of our advanced voice biometric technology.
Our patented proprietary native synthetic speech detection algorithms are multi-dimensional, measuring both the behavioural traits of an individual as well as the impact their physical body creates on sound. Linking the sounds waves a human emits through the physiology of their body, we model the different frequencies in the sound spectrum created by the human voice, including those that the human ear cannot hear. The result is a unique and highly accurate scientific method of biometric identification, and our algorithms have the inbuilt capability to detect differences between human voices and synthetically created voices that a human simply cannot possibly do.
It is a data-driven classifier that analyses characteristics of audio and is therefore able to identify unnatural features in a voice. In other words, synthetic audio may sound identical to natural audio for a particular person, but once computer generated audio is digitized and analyzed by the synthetic voice detection module, the differences are clear.
The distribution graph above demonstrates the ability of ValidSoft’s synthetic detection capability to identify Deepfake audio, something the human ear could not.
ValidSoft – Next Generation Synthetic Voice Detection
On corporate communication channels, those that are controlled by processing platforms such as contact centres, IVR platforms, web servers, apps and even enterprise communications systems (video and audio conferencing), ValidSoft’s advanced voice biometric engines can be integrated with all of these channels to detect deepfakes.
ValidSoft can provide mechanisms for overcoming deepfakes, whether integrated into payment systems, contact centers, telecommunications platforms or standalone, alongside those that integrate with corporate communication channels. These mechanisms can include ValidSoft’s voice biometric authentication or operate solely on “invisible” deepfake detection. Whether based on automated outbound dialling or through an inbound IVR-based feature, ValidSoft can ensure that high-risk remote instructions are coming from a human voice, and not a deepfake voice skin.