This is part of a series of blogs diving into the technical aspects of Veridium’s distributed data model, biometrics, and computer vision research and development by our chief biometric scientist Asem Othman.
Artificial intelligence voice assistants like Apple’s Siri, Google Assistant, and Amazon’s Alexa have become convenient alternatives to the tedious, frustrating, and time-consuming effort of keying data into mobile phones, as well as a new way to interact with Internet of Things (IoT) devices in smart homes. As these technologies become more commonplace, it’s natural for people to think the terms speech and voice recognition are synonymous.
Hence, we need to clarify the difference between these voice self-assistants, which are speech recognition systems, and voice recognition (sometimes called “speaker recognition”) systems.
Speech recognition is the exercise of using software to recognize words as they are articulated to convert them to a digital representation, such as for performing text dictation. Speech recognition can be a phenomenal time-savings tool as compared to typed words, but this is not biometrics.
Voice recognition, on the other hand, is the exercise of matching a specific speaker’s voice to a unique digital representation as a means of identity authentication.
As opposed to traditional biometrics, such as fingerprint, face, and iris, voice is a combination of physiological and behavioral biometrics. Physiological aspects are based on the size and shape of each person’s mouth, throat, larynx, nasal cavity, weight, and other factors; these result in our natural pitch, tone, and timbre. Behavioral properties are those formed based on language, education/influence, and geography, resulting in variable speech cadence, inflection, accent, and dialect.
Voicing the Advantages
Voice biometrics has a number of distinct advantages as a method for user authentication on mobile, IoT, and wearable devices. It comes very naturally for people to produce for mobile authentication, and can follow on from the success of fingerprint biometrics being easily integrated into flagship smartphones.
Voice is also well suited as a biometric authentication solution across a wide range of IoT devices, including tablets, wearables, PCs, gaming systems (handheld and console), smart TVs, even fixed line telephones and automobiles.
Voice recognition offers a cost-effective and flexible choice when compared to other biometric modalities that may be hindered by hardware integration efforts, particularly on mobile devices requiring fingerprint sensors and NIR iris cameras.
Representation and Matching
Voice recognition is much more specific and requires significantly more processing and analysis than speech recognition. Where speech recognition converts speech to text, voice recognition must also analyze the unique characteristics of each voice. It also has to compare that voice to a master voice print of a specific enrolled identity (in a verification scenario), or many identities (in the case of an identification scenario).
It is a popular choice for mobile authentication due to the availability of high-quality devices for collecting speech samples. For example, newer mobile phones have higher-quality digital microphones and noise-cancellation processing. If you’ve ever noticed a small pin-hole on the back of your phone, that is a microphone that collects background noise and generates inverse sound waves for noise cancellation.
Due to its ease of integration, voice recognition is different from other biometric methods in that voice samples are captured dynamically or over a short period of time, such as a few seconds. Analysis occurs on a model that monitors changes over time, which is similar to other behavioral biometrics like dynamic signature, gait, and keystroke recognition.
There are two manners of speaker recognition: Text-dependent (constrained manner) and text-independent (unconstrained manner).
In a system using “text-dependent” speech, the individual presents either a fixed or prompted phrase that is programmed into the system. “My voice is my password.” The system compares the voice sample against a master voice print and calculates an accuracy score. Text-dependent requires less data, and the utterance of a fixed, predetermined phrase improves authentication performance, especially with cooperative users.
In a system using “text-independent” speech, the individual presents longer speech input that the system has no advance knowledge of. The system captures the speech input into a voice model and identifies speech mannerisms across a broader spectrum. Therefore, a text-independent system requires significantly more data, takes longer to process, but enrolls users passively without the need to request any specific utterance.
Both have been deployed successfully for call center identification, but text-dependent is the only viable option for functions like app or website access that must be fast and convenient. On the other hand, although a text-independent system is more difficult to design than a text-dependent system, it offers more protection against fraud due to its increased accuracy.
Analyzing Voices as Biometrics
Speech samples are waveforms with time on the horizontal axis and loudness on the vertical access. In speaker recognition systems these samples are converted from an analog format to a digital format. Then, the features of the individual’s voice are extracted and a voice model is created. These individual models are represented based on the underlying variations and temporal changes over time found in the speech state, such as the quality, duration, intensity dynamics, and pitch of the signal. These models are used to compare the similarities and differences between the input voice and the stored voice “states” to produce a recognition decision.
Most “text dependent” speaker verification systems use the concept of Hidden Markov Models (HMMs) that provides a statistical representation of the sounds produced by the individual. Another method is the Gaussian Mixture Model, a state-mapping model closely related to HMM, that is often used for unconstrained “text independent” applications.
Current research introduces the concept of an end-to-end neural speaker recognition system that works well for both text-dependent and text-independent scenarios. This means that the same system can be trained to recognize who is speaking either when you say an awake word to activate your home assistant or when you’re speaking in a meeting. Most of the recent research consists of deep neural network layers inspired by ResNet and recurrent models to extract acoustic features.
Voicing the Disadvantages
Voice recognition effectiveness is directly related to following careful and deliberate best practices during enrollment. Although enrollment is generally a simple and quick process – requiring the user to speak a passphrase or series of numbers three or four times – many people make the mistake of speaking with increased volume, force, or even sounding robotic. Speaking naturally is the most essential best practice, followed by enrolling in an environment without background or ambient noise.
However, voice recognition systems always face problems when end users have enrolled on a clean landline phone and attempt verification using a noisy cellular phone. The inability to control the factors affecting the input system is one of the main issues that significantly decreases the performance. For instance, background noise (e.g., traffic, fans, others speaking, music/TV, machinery, etc.) distorts the purity of voice collection during enrollment or authentication.
Moreover, speaker systems, except those using prompted phrases, are also more susceptible to spoofing attacks through the use of recorded voice. Anti-spoofing measures that require the utterance of a specified and random word or phrase are being implemented to combat this weakness.
Finally, the behavioral parts of the voice of a person change over time due to age, medical conditions (such as a common cold), emotional state, and so on. Therefore, voice recognition is not considered as a very distinctive biometric trait and may not be appropriate for large-scale identification.
You can read parts one, two, three, four, and five of this series here.