You’re the voice – the science behind speaker recognition tech

You may have read reports that the Australian Tax Office (ATO) has introduced voiceprint technology which aims to do away with cumbersome identity-verification processes on the telephone.

When you phone the ATO call centre, instead of supplying your date of birth, address or a password, you’re prompted to say: “In Australia my voice identifies me.” By comparing this to a previously recorded voiceprint, the technology will deduce if the tax file number you gave actually belonged to you.

The technology that makes this possible is called “speaker recognition”. So how does it work, and how secure is it?

Speech recognition and speaker recognition

Two distinct, but related, technologies use human speech as input:

speech recognition turns speech sounds into text and speaker recognition identifies a person based on the sound of their voice. One speech recognition system that many people are familiar with is Apple’s Siri
speaker recognition is what the ATO’s voiceprint system is based on. Speaker recognition is one of a broad range of technologies called biometrics that can identify people based on physical properties such as the sound of their voice, their fingerprint, the shape of blood vessels in their eye or the way they walk.

The science behind biometric systems such as voiceprints is based on various machine learning techniques. If you’d like to get technical, some examples are hidden Markov models, support vector machines and neural networks. These use sophisticated statistical algorithms to create biometric models of a speaker’s voice.

‘My voice is my password.’

Two common ways that a biometric model can be used are to identify a person based on their voice alone, or to verify by voice whether someone is correctly claiming an identity.

In The Sydney Morning Herald yesterday, Ben Grubb reported that the ATO’s voiceprint system is developed by a company called Nuance, a world leader in speech and speaker recognition. It’s very likely that the ATO uses the technology behind Nuance’s VocalPassword system, which matches a customer’s passphrase with a recording of that passphrase kept in a database.

Because a voiceprint matches a passphrase with a stored recording, it only has to verify a match rather than sort through the whole database to uniquely identify a caller based on their voice. This means the recognition process can be very fast and can work with very low-quality audio.

Given a passphrase, the system would return a statistical likelihood that the speaker is the person who provided the original voiceprint. The ATO could select a threshold for a positive identification to ensure a good match was required.

On the record

Engineers who develop systems such as these are very concerned with security. Much research effort has gone into what’s called “liveness detection” and “playback detection”.

These are ways to ensure that a real person is speaking the passphrase rather than a malicious person playing a recording or attempting to mimic another person’s voice.

It’s possible that a voiceprint is susceptible to what’s called a “replay attack”. If a recording could be obtained of someone saying the exact passphrase, there would be a strong chance of being able to access their account. A distinctive passphrase reduces this risk.

Voiceprint can identify you if you have a cold because it doesn’t model the sound of your voice – it uses the sound of your voice to model the shape of your vocal tract. When you have a cold the shape of your vocal tract is still the same (you just might sound a bit nasal).

But there are situations or events that could prevent voiceprint or similar systems from correctly identifying a speaker. If someone received an injury that damaged their vocal tract, it would be unlikely that a speaker recognition system would match a voiceprint made before the injury.

A very poor phone connection or high background noise could also prevent a speaker identification system from working properly.

In both of these cases, a failure to match would probably require a caller to the ATO to verify their identity by another means. It would be extremely unlikely to mis-identify someone.

Systems such as voiceprints are intended to save time for callers and for call-centre workers by reducing the time it takes to verify identities – and less time on the phone with the tax office is always a good thing.

You’re the voice – the science behind speaker recognition tech

Authors

Disclosure statement

Partners

Speech recognition and speaker recognition

On the record

Want to write?