Voice biometrics (VB) is the technology that enables the recognition of people from their voice characteristics by measuring the distortion their physical make-up (physiology) creates on sound. It is identifying individuals based on their physical makeup by extracting information from their voices. 

Voice biometric systems process audio utterances to extract certain features (characteristics) that are speaker-specific. From these features, a statistical model – commonly called a voiceprint or a voice signature is built. When comparing new audio, e.g., authenticating a person against their previously enrolled voiceprint, the same process is applied and a similarity measure is obtained through pattern matching, the value of which indicates a pass or fail.

There are different features that can be extracted from a speech signal and speech scientists class them from low-level to a high level. A low-level feature relates to information that is extracted from very short periods of speech, e.g., a frequencies analysis based on circa 20 milliseconds of speech. High-level features are obtained from longer periods, for example, language gimmicks or the frequency of words used by a speaker that can be both estimated over a full sentence or many sentences. All types of features can be used for voice biometrics, but research has shown that low-level features are by far the most practical and efficient.

Because of the way speech is produced, a voice utterance will include characteristics that relate to the way air flows from the lungs to the mouth when a person is talking, and more precisely to how this airflow is affected by the shape of the vocal tract. The low-level features capture this effect and consequently the information exploited by voice biometric systems closely relates to a physical characteristic of the speaker, i.e., their vocal tract.

A text-dependent system is one where the customer is prompted to repeat approximately the same phrase as the one repeated during enrolment. This is commonly referred to as Active or Prompt voice biometrics.

In a text-independent mode, no specific text is expected – the customer is free to speak anything. This technology is typically used while passively processing conversations such as in call centers. This is commonly referred to as Passive or Conversational voice biometrics.

Text-dependent systems are favored for authentication solutions, such as those used with Internet or mobile banking systems, as they have the advantage of giving a good performance with small amounts of speech. Whereas, text-independent systems are commonly applied where existing conversations are already taking place, e.g., Agent to Customer or digital assistants. In these cases, this is a completely transparent customer authentication overlay.

The choice between text-dependent or text-independent technologies is often driven by use cases and customer experience, rather than performance or security.

ValidSoft supports text-dependent, text-independent, and conversational modes. The most appropriate solution is based on the use cases of the client.

Passive enrollment is the process by which the audio to build a voiceprint for a given speaker is captured during a speech interaction that would have happened anyway, e.g., a discussion with a call center agent or an interaction with an IVR system. The customer does not have to take a specific action to be enrolled.

This contrasts with an active enrollment during which the customer is prompted to repeat some given phrases or digits. Passive enrollments typically involve text-independent VB whilst active enrollments are usually associated with text-dependent phrases. ValidSoft supports both passive and active models.

Biometric systems, including voice, are highly accurate but no biometric solution, regardless of modality, can ever achieve 100% security across all scenarios. Typical accuracy rates for the voice biometric engine alone are in the region of 98% to 99% for GSM or landline transmission but approaching 100% where high-definition (wideband) data channels are used, i.e., WIFI/3G/4G.

This applies to authentications performed using smartphones or notebook microphones. ValidSoft’s VB engine also incorporates advanced features such as Grey Zone Logic further reduces false positives and false negatives.

Using a voice recording device to playback another person’s voice is known as a Replay Attack. In many cases voice biometric engines can identify recording devices, using several techniques including detecting the absence of the highest and lowest frequencies which, though not audible to humans, are detectible to VB engines.

Additionally, the process of replaying creates distortion to audio, and, in many cases, this is detectable by ValidSoft’s replay detection algorithms. Another technique that can be applied is Identical Utterance Checking, where the VB engine checks previous authentications for being too identical to the one being analyzed. The most foolproof method is liveness checking, where a random element must be spoken, such as random digits. Recording another person’s static phrase or random phrase will not be sufficient to pass given the random nature of the challenge.

It should be noted, however, that the chances of obtaining a recording of a person’s authentication phrase for fraudulent purposes are small and not a common technique used in mass attacks, more for targeted attacks. ValidSoft’s production deployment has never been compromised by a replay attack that we have been made aware of.

Features used by voice biometric engines strongly relate to the shape of the vocal tract. With identical twins having the same genes but different physical development, their vocal tracts will vary enough for voice biometric systems to be able to discriminate between their voices.

ValidSoft voice biometric technology can be used in, Contact Centers (agents and intelligent virtual agents), Interactive Voice Response platforms, Web portals, Mobile Apps, Enterprise access platforms, and eCommerce (3D-Secure), and IoT.

ValidSoft can obtain a user’s audio through inbound or outbound phone calls, both mobile and landline, through mobile apps on smart devices, over social messaging channels such as WhatsApp for Business, Viber, and Messenger, directly through browsers, e.g., WebRTC and HTML5 and any IoT device with a microphone.

1. What is Voice Biometrics?
Voice biometrics (VB) is the technology that enables the recognition of people from their voice characteristics by measuring the distortion their physical make-up (physiology) creates on sound. It is identifying individuals based on their physical makeup by extracting information from their voices. 

2. How does Voice Biometrics work?
Voice biometric systems process audio utterances to extract certain features (characteristics) that are speaker-specific. From these features, a statistical model – commonly called a voiceprint or a voice signature is built. When comparing new audio, e.g., authenticating a person against their previously enrolled voiceprint, the same process is applied and a similarity measure is obtained through pattern matching, the value of which indicates a pass or fail.

There are different features that can be extracted from a speech signal and speech scientists class them from low-level to a high level. A low-level feature relates to information that is extracted from very short periods of speech, e.g., a frequencies analysis based on circa 20 milliseconds of speech. High-level features are obtained from longer periods, for example, language gimmicks or the frequency of words used by a speaker that can be both estimated over a full sentence or many sentences. All types of features can be used for voice biometrics, but research has shown that low-level features are by far the most practical and efficient.

Because of the way speech is produced, a voice utterance will include characteristics that relate to the way air flows from the lungs to the mouth when a person is talking, and more precisely to how this airflow is affected by the shape of the vocal tract. The low-level features capture this effect and consequently the information exploited by voice biometric systems closely relates to a physical characteristic of the speaker, i.e., their vocal tract.

3. What is the difference between Text-Dependent and Text-Independent systems?
A text-dependent system is one where the customer is prompted to repeat approximately the same phrase as the one repeated during enrolment. This is commonly referred to as Active or Prompt voice biometrics.

In a text-independent mode, no specific text is expected – the customer is free to speak anything. This technology is typically used while passively processing conversations such as in call centers. This is commonly referred to as Passive or Conversational voice biometrics.

Text-dependent systems are favored for authentication solutions, such as those used with Internet or mobile banking systems, as they have the advantage of giving a good performance with small amounts of speech. Whereas, text-independent systems are commonly applied where existing conversations are already taking place, e.g., Agent to Customer or digital assistants. In these cases, this is a completely transparent customer authentication overlay.

The choice between text-dependent or text-independent technologies is often driven by use cases and customer experience, rather than performance or security.

4. What model does ValidSoft support?
ValidSoft supports text-dependent, text-independent, and conversational modes. The most appropriate solution is based on the use cases of the client.

5. What is the difference between Passive and Active enrolment?
Passive enrolment
is the process by which the audio to build a voiceprint for a given speaker is captured during a speech interaction that would have happened anyway, e.g., a discussion with a call center agent or an interaction with an IVR system. The customer does not have to take a specific action to be enrolled.

This contrasts with an active enrolment during which the customer is prompted to repeat some given phrases or digits. Passive enrolments typically involve text-independent VB whilst active enrolments are usually associated with text-dependent phrases. ValidSoft supports both passive and active models.

6. Does it work? Does it always work?
Biometric systems, including voice, are highly accurate but no biometric solution, regardless of modality, can ever achieve 100% security across all scenarios. Typical accuracy rates for the voice biometric engine alone are in the region of 98% to 99% for GSM or landline transmission but approaching 100% where high-definition (wideband) data channels are used, i.e., WIFI/3G/4G.

This applies to authentications performed using smartphones or notebook microphones. ValidSoft’s VB engine also incorporates advanced features such as Grey Zone Logic further reduces false positives and false negatives.

7. Can a recording of a person’s voice fool a voice biometric system?
Using a voice recording device to playback another person’s voice is known as a Replay Attack. In many cases voice biometric engines can identify recording devices, using several techniques including detecting the absence of the highest and lowest frequencies which, though not audible to humans, are detectible to VB engines.

Additionally, the process of replaying creates distortion to audio, and, in many cases, this is detectable by ValidSoft’s replay detection algorithms. Another technique that can be applied is Identical Utterance Checking, where the VB engine checks previous authentications for being too identical to the one being analyzed. The most foolproof method is liveness checking, where a random element must be spoken, such as random digits. Recording another person’s static phrase or random phrase will not be sufficient to pass given the random nature of the challenge.

It should be noted, however, that the chances of obtaining a recording of a person’s authentication phrase for fraudulent purposes are small and not a common technique used in mass attacks, more for targeted attacks. ValidSoft’s production deployment has never been compromised by a replay attack that we have been made aware of.

8. Can the system distinguish between twins?
Features used by voice biometric engines strongly relate to the shape of the vocal tract. With identical twins having the same genes but different physical development, their vocal tracts will vary enough for voice biometric systems to be able to discriminate between their voices.

9. What channels can ValidSoft’s voice biometrics be used on?
ValidSoft voice biometric technology can be used in, Contact Centers (agents and intelligent virtual agents), Interactive Voice Response platforms, Web portals, Mobile Apps, Enterprise access platforms, and eCommerce (3D-Secure), and IoT.

10. How do you capture the speaker’s audio?
ValidSoft can obtain a user’s audio through inbound or outbound phone calls, both mobile and landline, through mobile apps on smart devices, over social messaging channels such as WhatsApp for Business, Viber, and Messenger, directly through browsers, e.g., WebRTC and HTML5 and any IoT device with a microphone.