What is Voice Biometrics?

Voice biometrics (VB) is the technology that enables the recognition of people from their voice characteristics by measuring the distortion their physical make-up (physiology) creates on sound. It is identifying individuals based on their physical makeup by extracting information from their voices.

How does Voice Biometrics work?

Voice biometric systems extract unique speaker-specific features from audio samples to create a “voiceprint” or “voice signature.” When a new voice sample is provided, the system compares it to the enrolled voiceprint using pattern-matching techniques to authenticate the speaker.

Voice biometrics leverages a range of features from speech. Low-level features are short segments of sound—like the unique way air flows through a speaker’s vocal tract—captured in as little as 20 milliseconds. High-level features, such as speech habits or language patterns, are derived from longer stretches of speech. While both are useful, research shows low-level features are more efficient for voice biometrics as they closely relate to the speaker’s physical attributes.

What are Audio Deepfakes and how does ValidSoft prevent them?

Audio deepfakes are synthetic voice recordings generated by AI that mimic a person’s voice to deceive biometric systems. These can pose a security threat, especially in authentication scenarios.

At ValidSoft, we’ve been implementing antispoofing measures like synthetic voice detection amd replay detection long before the term “deepfake” was even coined. Our system is designed to detect anomalies in voice recordings, including subtle distortions and inconsistencies created by both deepfakes and traditional replay attacks. Additionally, our liveness detection ensures that only a live human voice, not a synthetic or recorded one, can successfully pass the authentication process.

We also provide our audio deepfake detection as a standalone solution as part of our Voice Verity® platform.

What is the difference between Text-Dependent, Text-Independent, and Spoken Digit systems?

Text-Dependent voice biometrics requires users to repeat specific phrases or words, often the same ones used during enrollment. This approach is ideal for secure authentication scenarios like mobile banking, where the phrase acts as a consistent identifier.

Text-Independent (also known as conversational) voice biometrics, on the other hand, does not rely on specific phrases. The system can authenticate users based on natural conversation, making it suitable for scenarios like call centers or virtual assistants, where authentication happens passively without interrupting the user experience.

Spoken Digit systems, like ValidSoft’s See-Say®, add an additional layer of security by requiring users to repeat a unique sequence of digits tied to a specific transaction. This ensures not only user identity verification but also the integrity of individual transactions, making it ideal for use cases where a high degree of security is required.

What model does ValidSoft support?

ValidSoft supports a range of voice biometric models, including text-dependent, text-independent (conversational), and spoken digit systems. Our solutions are tailored to the unique needs of each use case:

  • Text-Dependent for scenarios requiring specific phrases for fast, accurate authentication
  • Text-Independent for passive, conversational verification, where users are authenticated naturally during conversations
  • Spoken Digit (such as our See-Say® solution), which enhances security by using a dynamic sequence of digits tied to a specific transaction

We provide customized solutions based on your specific security and user experience requirements.

What is the difference between Passive and Active enrollment?

Passive Enrollment happens naturally during regular interactions, such as conversations with a call center agent or an IVR system. The user doesn’t need to take any specific action to be enrolled.

Active Enrollment requires the user to repeat specific phrases or digits, typically associated with text-dependent voice biometrics.

ValidSoft supports both passive and active enrollment models, ensuring flexibility based on the application.

Does it work? Does it always work?

Voice biometric systems are highly accurate but, like all biometrics, they can’t guarantee 100% success in every scenario. ValidSoft’s voice biometrics engine typically achieves 98-99% accuracy for GSM or landline transmissions, and near 100% with high-definition (wideband) data channels, such as those used on WIFI, 4G, or 5G.

Our system also integrates advanced features like Grey Zone Logic to minimize false positives and false negatives, ensuring reliability across various devices like smartphones and laptops.

Can I still be recognized if I have a cold?

Yes. Having a cold is one of the natural variables in your voice and voice biometrics engines are designed to deal with variability. Furthermore, because VB engines exploit information that relates to the shape of the full vocal tract (i.e. the physical make-up of an individual) the effect of a cold will be minimal and catered for by the normalization of the system. A human might think that a person sounds very different from usual, whereas a VB engine will notice that fundamental features in the voice are always the same.

Can an impersonator fool a voice biometric system?

When someone is mimicking someone else’s voice, they copy language mannerisms that are high-level language features, whereas voice biometric systems exploit low-level features that relate to the speaker’s vocal tract. It’s easy to copy the way a person is talking (accent, mannerism) but impossible to alter the way speech is produced (effect of the vocal tract). A human ear might think that an impersonator sounds very much like another person, whereas a VB engine will notice that fundamental features in the voice are not the same.

Can a recording of a person’s voice fool a voice biometric system?

Using a voice recording device to playback another person’s voice is known as a Replay Attack. In many cases voice biometric engines can identify recording devices, using several techniques including detecting the absence of the highest and lowest frequencies which, though not audible to humans, are detectible to VB engines.

Additionally, the process of replaying creates distortion to audio, and, in many cases, this is detectable by ValidSoft’s replay detection algorithms. Another technique that can be applied is Identical Utterance Checking, where the VB engine checks previous authentications for being too identical to the one being analyzed. The most foolproof method is liveness checking, where a random element must be spoken, such as random digits. Recording another person’s static phrase or random phrase will not be sufficient to pass given the random nature of the challenge.

It should be noted, however, that the chances of obtaining a recording of a person’s authentication phrase for fraudulent purposes are small and not a common technique used in mass attacks, more for targeted attacks. ValidSoft’s production deployment has never been compromised by a replay attack that we have been made aware of.

Can the system distinguish between twins?

Features used by voice biometric engines strongly relate to the shape of the vocal tract. With identical twins having the same genes but different physical development, their vocal tracts will vary enough for voice biometric systems to be able to discriminate between their voices.

How does voice biometrics compare to the other forms of biometrics?

External evaluations have shown voice biometrics to exceed the accuracy of many other biometric modalities when it comes to accuracy. This has been shown in an objective study by the UK National Physical Laboratory, the national measurements standards body, and industry analysists such as Opus Research.

What is more important when comparing biometric modalities are the use cases that apply, i.e. which modalities are the most relevant or user-friendly for any given use case. For instance, in a telephone banking or call centre use case, the only relevant modality is voice biometrics. In financial services such as banking, we believe the only biometric modality that can be used on all electronic customer channels is voice.

Another significant benefit of voice is it’s accessibility; no specific capture device required – phones are ubiquitous; no invasive actions required by the user – no physical contact or scanning, just speaking.

What is speaker verification? What is the difference with speaker identification?

Speaker Verification; is the process of verifying someone’s voice against a claimed identity. It is a 1 to 1 matching process where audio is checked against a (single) previously enrolled voice model (the voice print). The goal is to assert the individual is or is not who they are claiming to be, generally for the purpose of authentication.

Speaker Identification; is the process of checking spoken audio against multiple pre enrolled voice models. It is a 1 to N process as there is no claimed identity, the system is trying to establish who the speaker is from their voice. It is often used for techniques such as watchlist (e.g. identification of known fraudster) processing or agent identification.

ValidSoft’s VB engine performs both verification and identification and can perform both in parallel.

What is the difference between speaker recognition and speech recognition?

The goal of speech recognition is to recognize what is said, which words are spoken but not who spoke them. Speaker recognition’s goal is to recognize who is speaking but not what is spoken and is based on voice biometric technology.

Can speaker recognition and speech recognition be used together?

Yes, the combination of speaker recognition and speech recognition is referred to as Conversational Voice Biometrics and enables the solution to recognize who is speaking and what they are saying. This can be used to augment knowledge-based questions as well as apply randomness to ensure a “live” human speaking. This is commonly used for a process called liveness testing to ensure an actual customer is speaking as opposed to pre-recorded audio being replayed into the authenticating system in an attempt to deceive the service.