tag:theconversation.com,2011:/us/topics/speaker-recognition-12323/articlesSpeaker recognition – The Conversation2017-07-19T06:44:08Ztag:theconversation.com,2011:article/790702017-07-19T06:44:08Z2017-07-19T06:44:08ZProtecting your smartphone from voice impersonators<figure><img src="https://images.theconversation.com/files/177937/original/file-20170712-19675-910rmn.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=496&fit=clip" /><figcaption><span class="caption">Is this an impostor trying to break into your phone with his voice?</span> <span class="attribution"><a class="source" href="https://www.shutterstock.com/image-photo/man-recording-voice-message-smartphone-541649137">Georgejmclittle/Shutterstock.com</a></span></figcaption></figure><p>It’s a lot easier to talk to a smartphone than to try to type instructions on its keyboard. This is particularly true when a person is trying to log in to a device or a system: Few people would choose to type a long, complex secure password if the alternative were to just say a few words and <a href="https://thenextweb.com/apps/2015/03/25/wechat-on-ios-now-lets-you-log-in-using-just-your-voice/">be authenticated with their voice</a>. But voices can be recorded, simulated or even imitated, making voice authentication vulnerable to attack.</p>
<p>The most common methods for securing voice-based authentication involve only ensuring that analysis of a spoken passphrase is not tampered with; they securely store the passphrase and the <a href="https://www.technologyreview.com/s/428970/securing-your-voice/">authorized user’s voiceprint in an encrypted database</a>. But securing a voice authentication system has to start with the sound itself.</p>
<p>The easiest attack on voice authentication is impersonation: Find someone who sounds enough like the real person and get them to respond to the login prompts. Fortunately, there are automatic speaker verification systems that <a href="http://dx.doi.org/10.1121/1.4879257">can detect</a> <a href="https://doi.org/10.1109/TMM.2014.2300071">human imitation</a>. However, those systems <a href="https://doi.org/10.1016/j.specom.2014.10.005">can’t detect more advanced machine-based attacks</a>, in which an attacker uses a computer and a speaker to simulate or play back recordings of a person’s voice.</p>
<p>If someone records your voice, he can use that recording to create a computer model that can generate any words in your voice. The consequences, from impersonating you with your friends to dipping into your bank account, are terrifying. The research my colleagues and I are doing uses <a href="https://pdfs.semanticscholar.org/6be6/00d60f4d3210d20567c0ab8f3d78324ab5d4.pdf">fundamental properties of audio speakers, and smartphones’ own sensors</a>, to defeat these computer-assisted attacks.</p>
<h2>How speakers work</h2>
<figure class="align-center ">
<img alt="" src="https://images.theconversation.com/files/175656/original/file-20170626-29070-1tmcqg1.png?ixlib=rb-1.1.0&q=45&auto=format&w=754&fit=clip" srcset="https://images.theconversation.com/files/175656/original/file-20170626-29070-1tmcqg1.png?ixlib=rb-1.1.0&q=45&auto=format&w=600&h=306&fit=crop&dpr=1 600w, https://images.theconversation.com/files/175656/original/file-20170626-29070-1tmcqg1.png?ixlib=rb-1.1.0&q=30&auto=format&w=600&h=306&fit=crop&dpr=2 1200w, https://images.theconversation.com/files/175656/original/file-20170626-29070-1tmcqg1.png?ixlib=rb-1.1.0&q=15&auto=format&w=600&h=306&fit=crop&dpr=3 1800w, https://images.theconversation.com/files/175656/original/file-20170626-29070-1tmcqg1.png?ixlib=rb-1.1.0&q=45&auto=format&w=754&h=385&fit=crop&dpr=1 754w, https://images.theconversation.com/files/175656/original/file-20170626-29070-1tmcqg1.png?ixlib=rb-1.1.0&q=30&auto=format&w=754&h=385&fit=crop&dpr=2 1508w, https://images.theconversation.com/files/175656/original/file-20170626-29070-1tmcqg1.png?ixlib=rb-1.1.0&q=15&auto=format&w=754&h=385&fit=crop&dpr=3 2262w" sizes="(min-width: 1466px) 754px, (max-width: 599px) 100vw, (min-width: 600px) 600px, 237px">
<figcaption>
<span class="caption">The architecture of conventional loudspeaker showing the magnet, coil and cone used for loudspeaker operations.</span>
</figcaption>
</figure>
<p>Conventional speakers contain magnets, which vibrate back and forth according to <a href="http://www.physics.org/article-questions.asp?id=54">fluctuations of electrical or digital signals</a>, converting them into sound waves in the air. Putting a speaker up against the microphone of a smartphone, for example, means moving a magnet very close to the smartphone. And most smartphones contain a magnetometer, an electronic chip that can detect magnetic fields. (It comes in handy when using a compass or navigation app, for example.)</p>
<p>If the smartphone detects a magnet nearby during the process of voice authentication, that can be an indicator that a real human might not be doing the talking.</p>
<h2>Making sure it’s a person talking</h2>
<figure class="align-right zoomable">
<a href="https://images.theconversation.com/files/178145/original/file-20170713-18558-rh6r8z.png?ixlib=rb-1.1.0&q=45&auto=format&w=1000&fit=clip"><img alt="" src="https://images.theconversation.com/files/178145/original/file-20170713-18558-rh6r8z.png?ixlib=rb-1.1.0&q=45&auto=format&w=237&fit=clip" srcset="https://images.theconversation.com/files/178145/original/file-20170713-18558-rh6r8z.png?ixlib=rb-1.1.0&q=45&auto=format&w=600&h=851&fit=crop&dpr=1 600w, https://images.theconversation.com/files/178145/original/file-20170713-18558-rh6r8z.png?ixlib=rb-1.1.0&q=30&auto=format&w=600&h=851&fit=crop&dpr=2 1200w, https://images.theconversation.com/files/178145/original/file-20170713-18558-rh6r8z.png?ixlib=rb-1.1.0&q=15&auto=format&w=600&h=851&fit=crop&dpr=3 1800w, https://images.theconversation.com/files/178145/original/file-20170713-18558-rh6r8z.png?ixlib=rb-1.1.0&q=45&auto=format&w=754&h=1070&fit=crop&dpr=1 754w, https://images.theconversation.com/files/178145/original/file-20170713-18558-rh6r8z.png?ixlib=rb-1.1.0&q=30&auto=format&w=754&h=1070&fit=crop&dpr=2 1508w, https://images.theconversation.com/files/178145/original/file-20170713-18558-rh6r8z.png?ixlib=rb-1.1.0&q=15&auto=format&w=754&h=1070&fit=crop&dpr=3 2262w" sizes="(min-width: 1466px) 754px, (max-width: 599px) 100vw, (min-width: 600px) 600px, 237px"></a>
<figcaption>
<span class="caption">An outline of how our process works.</span>
<span class="attribution"><a class="source" href="https://pdfs.semanticscholar.org/6be6/00d60f4d3210d20567c0ab8f3d78324ab5d4.pdf">The Conversation (via Lucidchart), after Kui Ren et al.</a>, <a class="license" href="http://creativecommons.org/licenses/by-nd/4.0/">CC BY-ND</a></span>
</figcaption>
</figure>
<p>That’s just one part of our system. If someone uses a smaller speaker, like a set of headphones, the magnetometer might not detect its smaller magnets. So we use machine learning and advanced mathematics to examine physical properties of the sound as it arrives at the microphone.</p>
<p>Our system requires a user to hold the smartphone in front of his or her face and move it from side to side in a half-circle while speaking. We combine the sound captured by the microphone with movement data from gyroscopes and accelerometers inside the smartphone – the same sensors apps use to know when you’re walking or running, or changing direction. </p>
<p>Using that data, we can calculate how far away from the microphone the sound is being generated – which lets us identify the possibility of someone using speakers at a distance so its magnets wouldn’t be detected. And we can compare the phone’s movement to the changes in the sound to discover whether it is created by a sound source roughly the size of a human mouth near the phone.</p>
<p>All of this, of course, could be defeated by a skilled impersonator – an actual human who mimics a user’s voice. But recall that existing speaker verification methods can catch impersonators, using machine learning techniques that identify <a href="http://dx.doi.org/10.1121/1.4879257">whether a speaker is modifying or disguising</a> his or her normal voice. We include that capability in our system as well. </p>
<h2>Does detection work?</h2>
<p>When we put our system to the test, we found that when the sound source is 6 centimeters (2 inches) from the microphone, we can always distinguish between a human and a computer-controlled speaker. At that distance, the magnet in a normal loudspeaker is strong enough to clearly interfere with the phone’s magnetometer. And if an attacker is using earphone speakers, the microphone is close enough to the sound source to detect it.</p>
<p>When the sound source is farther from the microphone, it’s harder to detect magnetic interference from a speaker. It’s also more difficult to analyze the movement of the sound source in relation to the phone when the distances are greater. But by using multiple lines of defense, we can defeat the vast majority of speaker- and human-based attacks and significantly improve the security of voice-based mobile apps. </p>
<p>At the moment, our system is a stand-alone app, but in the future we’ll be able to integrate it into other voice authentication systems.</p><img src="https://counter.theconversation.com/content/79070/count.gif" alt="The Conversation" width="1" height="1" />
<p class="fine-print"><em><span>Kui Ren receives funding from US National Science Foundation. </span></em></p>You can log in to your smartphone by talking to it. Current security systems don’t protect enough against imitators. The best way to ensure voice authentication is secure is to start with the sound.Kui Ren, Professor of Computer Science and Engineering, University at BuffaloLicensed as Creative Commons – attribution, no derivatives.tag:theconversation.com,2011:article/527872016-01-07T11:39:44Z2016-01-07T11:39:44ZCan voice recognition technology really identify a masked jihadi?<figure><img src="https://images.theconversation.com/files/107442/original/image-20160106-29944-xer0i.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=496&fit=clip" /><figcaption><span class="caption">A masked face but experts still have his voice to go on.</span> <span class="attribution"><span class="source">Video screengrab</span></span></figcaption></figure><p>The latest video of a masked Islamic State jihadist apparently speaking with a British accent led to him being <a href="http://www.telegraph.co.uk/news/worldnews/islamic-state/12081552/new-jihadi-john-Siddhartha-Dhar-isil-terrorist.html">tentatively identified</a> as Muslim convert Siddhartha Dhar from East London. Voice recognition experts were reportedly working with UK intelligence services using voice analysis. But how does this technology work and what is it capable of?</p>
<p>Most of us can, when we hear a voice we know well, recognise who is speaking after just a few words, while less familiar voices might take a little longer. If the context and content of the words spoken are familiar, that makes it easier still. Generally, machines face the same constraints when trying to compare recordings and find a match.</p>
<p>Computational systems that aim to establish who people are from their voices – <a href="https://www.ll.mit.edu/publications/journal/pdf/vol08_no2/8.2.4.speakerrecognition.pdf">speaker identification</a> – differ in whether they aim to detect: the presence of a single known speaker; to match speech to one of several known speakers; detect what’s recognisable from an unknown recording; or verify that a recording of speech was indeed from the expected speaker.</p>
<p>Modern systems tend to take a big data approach, where machine learning algorithms are trained with large sets of known recordings so they can recognise individual speakers’ vocal features. The idea is that the important features that discriminate between different speakers are learned automatically. In contrast, older methods specified which type of linguistic and phonetic features of speech were thought important in order to compare them between speakers. </p>
<p>While we don’t really know what combination of features is best for voice recognition, we can classify them as either acoustic or linguistic.</p>
<h2>Acoustic and linguistic features</h2>
<p>Acoustic features are characteristics of how humans produce speech. When we speak, air is expelled from our lungs, travels up the trachea, through the larynx and out of our mouth and nose. As it passes it vibrates against vocal cords which when relaxed or contracted change the frequency of vibration, and so the pitch of our voices. </p>
<figure class="align-right zoomable">
<a href="https://images.theconversation.com/files/107441/original/image-20160106-14922-1gpr3lk.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=1000&fit=clip"><img alt="" src="https://images.theconversation.com/files/107441/original/image-20160106-14922-1gpr3lk.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=237&fit=clip" srcset="https://images.theconversation.com/files/107441/original/image-20160106-14922-1gpr3lk.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=600&h=608&fit=crop&dpr=1 600w, https://images.theconversation.com/files/107441/original/image-20160106-14922-1gpr3lk.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=600&h=608&fit=crop&dpr=2 1200w, https://images.theconversation.com/files/107441/original/image-20160106-14922-1gpr3lk.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=600&h=608&fit=crop&dpr=3 1800w, https://images.theconversation.com/files/107441/original/image-20160106-14922-1gpr3lk.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&h=765&fit=crop&dpr=1 754w, https://images.theconversation.com/files/107441/original/image-20160106-14922-1gpr3lk.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=754&h=765&fit=crop&dpr=2 1508w, https://images.theconversation.com/files/107441/original/image-20160106-14922-1gpr3lk.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=754&h=765&fit=crop&dpr=3 2262w" sizes="(min-width: 1466px) 754px, (max-width: 599px) 100vw, (min-width: 600px) 600px, 237px"></a>
<figcaption>
<span class="caption">The surfaces involved in producing speech.</span>
<span class="attribution"><a class="source" href="http://training.seer.cancer.gov/head-neck/anatomy/overview.html">National Cancer Institute</a></span>
</figcaption>
</figure>
<p>Several parts inside the vocal and nasal cavities such as the tongue, teeth and lips – known as articulators – modify sounds to create different resonances – called formants – to produce other variable characteristics of speech. What we hear as speech is a combination of all these interactions of air passing through these cavities and over and between these body parts.</p>
<p>Each of us has unique speaking characteristics: the way our lungs exhale, vocal cords resonate, articulators act all produce unique sounds. One person’s “a” can be very different from another’s – and that’s just one of the 44 phonemes (the smallest unit of sound that make up words) in the English language. The way our speech blends the phonemes together and moves from one to another is also different, as is the speed at which this happens. Consider the difference between the steady tempo, rounded vowels of a English country accent with the faster, staccato speech common in bigger cities. </p>
<p>Linguistic features relate to which phonemes we choose to use and in what sequence, rather than how they’re produced. If I say “tomahto” and you say “tomayto” then we have spoken the same word, with a different choice of phonemes. There are a vast number of alternative pronunciations, based on familiarity and often on regional and generational differences. The choice of word, different words, grammatical patterns, characteristic pauses, stresses, sentence structures or phrases also presents a way to identify different speakers. </p>
<p>At a higher level still is the meaning of the words themselves. We tend to make different choices in what we say and how we choose to say it – how direct, or confrontational, or evasive, or intellectual our way of speaking is. If you’ve ever met someone and thought they speak like a lawyer, teacher, or artist, then the patterns you recognise can be recognised by computers too.</p>
<figure class="align-center zoomable">
<a href="https://images.theconversation.com/files/107440/original/image-20160106-14922-9kef12.png?ixlib=rb-1.1.0&q=45&auto=format&w=1000&fit=clip"><img alt="" src="https://images.theconversation.com/files/107440/original/image-20160106-14922-9kef12.png?ixlib=rb-1.1.0&q=45&auto=format&w=754&fit=clip" srcset="https://images.theconversation.com/files/107440/original/image-20160106-14922-9kef12.png?ixlib=rb-1.1.0&q=45&auto=format&w=600&h=312&fit=crop&dpr=1 600w, https://images.theconversation.com/files/107440/original/image-20160106-14922-9kef12.png?ixlib=rb-1.1.0&q=30&auto=format&w=600&h=312&fit=crop&dpr=2 1200w, https://images.theconversation.com/files/107440/original/image-20160106-14922-9kef12.png?ixlib=rb-1.1.0&q=15&auto=format&w=600&h=312&fit=crop&dpr=3 1800w, https://images.theconversation.com/files/107440/original/image-20160106-14922-9kef12.png?ixlib=rb-1.1.0&q=45&auto=format&w=754&h=393&fit=crop&dpr=1 754w, https://images.theconversation.com/files/107440/original/image-20160106-14922-9kef12.png?ixlib=rb-1.1.0&q=30&auto=format&w=754&h=393&fit=crop&dpr=2 1508w, https://images.theconversation.com/files/107440/original/image-20160106-14922-9kef12.png?ixlib=rb-1.1.0&q=15&auto=format&w=754&h=393&fit=crop&dpr=3 2262w" sizes="(min-width: 1466px) 754px, (max-width: 599px) 100vw, (min-width: 600px) 600px, 237px"></a>
<figcaption>
<span class="caption">A time/frequency spectrogram of the phrase ‘I owe you’.</span>
<span class="attribution"><a class="source" href="https://en.wikipedia.org/wiki/File:Spectrogram_of_I_owe_you.png">Jonas.kluk</a></span>
</figcaption>
</figure>
<h2>Making sense of it all</h2>
<p>In computational terms, first the linguistic and acoustic features are isolated, condensing the large amount of data into manageable sets of features that succinctly captures their important nuances. Then pattern matching is used to compare these to those from another recording.</p>
<p>Features of speech that can be automatically extracted include pitch, formant frequency, vocal tract length, and the rate at which syllables are spoken. Some modern methods operate better with lower-level features that require less processing and offer less intrinsic meaning to human ears. These are typically two-dimensional maps of time and frequency, such as spectrograms.</p>
<p>Once complex speech has been reduced to a set of more simplified representative features, then a process of generalised pattern matching is applied, establishing how best to make a comparison, and how closely patterns match. Given enough good quality speech to analyse, we can convincingly match the speaker to one person from among a small group of suspects. The more speech we have from both sets to compare, the better the match. In this case, experts had <a href="http://www.bbc.co.uk/news/uk-35228558">several recordings of Dhar giving interviews</a> when still in the UK.</p>
<p>With no suspects to go on the task would be near-impossible, like searching for a needle in a haystack. But what we can learn and infer about a speaker from a recording can itself reduce the haystack to a more manageable size. For example, expert listeners can narrow down the home region, age, gender, emotion and maybe infer something about a speaker’s education. In some cases speech experts can determine where a speaker was born, whether their parents spoke another language, and whether they have lived elsewhere more recently. Perhaps even when they left the UK.</p>
<figure>
<iframe width="440" height="260" src="https://www.youtube.com/embed/ielvhiS1Hfs?wmode=transparent&start=0" frameborder="0" allowfullscreen=""></iframe>
</figure>
<h2>Science fiction or reality?</h2>
<p>While much is shrouded in secrecy, speaker identification technology is thought to be used by national security agencies such as GCHQ in the UK, the NSA in the US, and Public Security Bureau in China and so on. It’s also widely believed that voice prints are captured at airport immigration counters in some countries, which perhaps explains why you may be asked a meaningless question or two during processing – after all facial recognition is already widespread at airports, why not for voices too?</p>
<p>Commercial voice matching technology from the likes of GoVivace, iFlytek, IBM and Nuance is probably at least a generation behind that used by governments. How useful the technology is at present is debatable, but it is used daily by financial institutions as a means of <a href="http://technology.inquirer.net/39021/millions-of-voiceprints-quietly-being-harvested">speaker validation</a> – offering proof that they are who they claim to be.</p>
<p>Voice print analysis has been used in criminal cases since the 1970s, with mixed success, and usually for the less demanding task of proving that speech in a given recording belongs to a particular speaker. Trying to match one speaker from a huge set of possibilities that may not even include the correct match is far more difficult. But it’s not impossible, and systems are improving.</p><img src="https://counter.theconversation.com/content/52787/count.gif" alt="The Conversation" width="1" height="1" />
<p class="fine-print"><em><span>Ian McLoughlin does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.</span></em></p>When facial recognition isn’t possible, it’s time to bring in the voice recognition experts.Ian McLoughlin, Professor of Computing, Head of School (Medway), University of KentLicensed as Creative Commons – attribution, no derivatives.tag:theconversation.com,2011:article/315792014-09-12T04:45:21Z2014-09-12T04:45:21ZYou’re the voice – the science behind speaker recognition tech<figure><img src="https://images.theconversation.com/files/58847/original/qpcymxsk-1410497334.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=496&fit=clip" /><figcaption><span class="caption">Your voice can now be your password – well, for the ATO anyway. </span> <span class="attribution"><a class="source" href="http://www.flickr.com/photos/martinsphotoart/3415997798">Martin playing with pixels.../Flickr (cropped)</a>, <a class="license" href="http://creativecommons.org/licenses/by-nc/4.0/">CC BY-NC</a></span></figcaption></figure><p>You may have read reports that the Australian Tax Office (ATO) has introduced <a href="https://www.ato.gov.au/About-ATO/About-us/Contact-us/Phone-us/Voiceprint/">voiceprint</a> technology which aims to do away with cumbersome identity-verification processes on the telephone. </p>
<p>When you phone the ATO call centre, instead of supplying your date of birth, address or a password, you’re prompted to say: “In Australia my voice identifies me.” By comparing this to a previously recorded voiceprint, the technology will deduce if the tax file number you gave actually belonged to you.</p>
<p>The technology that makes this possible is called “speaker recognition”. So how does it work, and how secure is it?</p>
<h2>Speech recognition and speaker recognition</h2>
<p>Two distinct, but related, technologies use human speech as input: </p>
<ol>
<li><strong>speech recognition</strong> turns speech sounds into text and speaker recognition identifies a person based on the sound of their voice. One speech recognition system that many people are familiar with is Apple’s Siri</li>
<li><strong>speaker recognition</strong> is what the ATO’s voiceprint system is based on. Speaker recognition is one of a broad range of technologies called biometrics that can identify people based on physical properties such as the sound of their voice, their fingerprint, the shape of blood vessels in their eye or the way they walk.</li>
</ol>
<p>The science behind biometric systems such as voiceprints is based on various machine learning techniques. If you’d like to get technical, some examples are <a href="http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html">hidden Markov models</a>, <a href="http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=708428&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D708428">support vector machines</a> and <a href="http://www.doc.ic.ac.uk/%7End/surprise_96/journal/vol4/cs11/report.html">neural networks</a>. These use sophisticated statistical algorithms to create biometric models of a speaker’s voice.</p>
<figure>
<iframe width="440" height="260" src="https://www.youtube.com/embed/NbgOvWxFhS8?wmode=transparent&start=0" frameborder="0" allowfullscreen=""></iframe>
<figcaption><span class="caption">‘My voice is my password.’</span></figcaption>
</figure>
<p>Two common ways that a biometric model can be used are to identify a person based on their voice alone, or to verify by voice whether someone is correctly claiming an identity.</p>
<p>In The Sydney Morning Herald yesterday, Ben Grubb <a href="http://www.smh.com.au/digital-life/digital-life-news/australian-taxation-office-uses-voiceprint-technology-to-speed-up-calls-20140911-10edli.html">reported</a> that the ATO’s voiceprint system is developed by a company called <a href="http://www.nuance.com/index.htm">Nuance</a>, a world leader in speech and speaker recognition. It’s very likely that the ATO uses the technology behind Nuance’s <a href="http://www.nuance.com/landing-pages/products/voicebiometrics/vocalpassword.asp">VocalPassword</a> system, which matches a customer’s passphrase with a recording of that passphrase kept in a database.</p>
<p>Because a voiceprint matches a passphrase with a stored recording, it only has to verify a match rather than sort through the whole database to uniquely identify a caller based on their voice. This means the recognition process can be very fast and can work with very low-quality audio. </p>
<p>Given a passphrase, the system would return a statistical likelihood that the speaker is the person who provided the original voiceprint. The ATO could select a threshold for a positive identification to ensure a good match was required.</p>
<h2>On the record</h2>
<figure class="align-right zoomable">
<a href="https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=1000&fit=clip"><img alt="" src="https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=237&fit=clip" srcset="https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=600&h=600&fit=crop&dpr=1 600w, https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=600&h=600&fit=crop&dpr=2 1200w, https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=600&h=600&fit=crop&dpr=3 1800w, https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&h=754&fit=crop&dpr=1 754w, https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=754&h=754&fit=crop&dpr=2 1508w, https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=754&h=754&fit=crop&dpr=3 2262w" sizes="(min-width: 1466px) 754px, (max-width: 599px) 100vw, (min-width: 600px) 600px, 237px"></a>
<figcaption>
<span class="caption"></span>
<span class="attribution"><a class="source" href="http://www.flickr.com/photos/johanl/4934459020">Johan Larsson/Flickr</a>, <a class="license" href="http://creativecommons.org/licenses/by/4.0/">CC BY</a></span>
</figcaption>
</figure>
<p>Engineers who develop systems such as these are very concerned with security. Much research effort has gone into what’s called “<a href="http://www.biometrics.org/bc2002/2_bc0130_DerakhshabiBrief.pdf">liveness detection</a>” and “playback detection”. </p>
<p>These are ways to ensure that a real person is speaking the passphrase rather than a malicious person playing a recording or attempting to mimic another person’s voice. </p>
<p>It’s possible that a voiceprint is susceptible to what’s called a “replay attack”. If a recording could be obtained of someone saying the exact passphrase, there would be a strong chance of being able to access their account. A distinctive passphrase reduces this risk.</p>
<p>Voiceprint can identify you if you have a cold because it doesn’t model the sound of your voice – it uses the sound of your voice to model the shape of your vocal tract. When you have a cold the shape of your vocal tract is still the same (you just might sound a bit nasal).</p>
<p>But there are situations or events that could prevent voiceprint or similar systems from correctly identifying a speaker. If someone received an injury that damaged their vocal tract, it would be unlikely that a speaker recognition system would match a voiceprint made before the injury. </p>
<p>A very poor phone connection or high background noise could also prevent a speaker identification system from working properly. </p>
<p>In both of these cases, a failure to match would probably require a caller to the ATO to verify their identity by another means. It would be extremely unlikely to mis-identify someone.</p>
<p>Systems such as voiceprints are intended to save time for callers and for call-centre workers by reducing the time it takes to verify identities – and less time on the phone with the tax office is always a good thing.</p><img src="https://counter.theconversation.com/content/31579/count.gif" alt="The Conversation" width="1" height="1" />
<p class="fine-print"><em><span>Ben Kraal receives funding from the Australian Research Council.</span></em></p><p class="fine-print"><em><span>David Dean receives funding from the Australian Research Council for research related to speaker recognition.</span></em></p>You may have read reports that the Australian Tax Office (ATO) has introduced voiceprint technology which aims to do away with cumbersome identity-verification processes on the telephone. When you phone…Ben Kraal, Research Fellow, Queensland University of TechnologyDavid Dean, Senior Research Fellow, Queensland University of TechnologyLicensed as Creative Commons – attribution, no derivatives.