tag:theconversation.com,2011:/ca/topics/voice-recognition-software-2071/articlesVoice recognition software – The Conversation2017-06-16T15:54:34Ztag:theconversation.com,2011:article/795202017-06-16T15:54:34Z2017-06-16T15:54:34ZHuman voices are unique – but our study shows we’re not that good at recognising them<figure><img src="https://images.theconversation.com/files/174205/original/file-20170616-545-1xwzbxe.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=496&fit=clip" /><figcaption><span class="caption">Computers can pick up the specific acoustic features of each individual voice. </span> <span class="attribution"><a class="source" href="https://www.shutterstock.com/download/confirm/571969369?src=gtIMfIr4VKY3Ka6URlW3YA-1-92&size=huge_jpg">Shutterstock</a></span></figcaption></figure><p><em>“Alexa, who am I?</em>” Amazon Echo’s voice-controlled virtual assistant, Alexa, doesn’t have an answer to that – yet. However, for other applications of speech technology, computer algorithms are increasingly able to discriminate, recognise and identify individuals from voice recordings. </p>
<p>Of course, these algorithms are far from perfect, as was recently shown when a BBC journalist broke into his own voice-controlled bank account <a href="http://www.bbc.co.uk/news/technology-39965545">using his twin brother’s voice</a>. Is this a case of computers just failing at something humans can do perfectly? We decided to find out.</p>
<p>Each human being <a href="https://theconversation.com/is-every-human-voice-and-fingerprint-really-unique-63739?sr=1">has a voice that is distinct</a> and different from everyone else’s. So it seems intuitive that we’d be able to identify someone from their voice fairly easily. But how well can you actually do this? When it comes to recognising your closest family and friends, you’re probably quite good. But would you be able to recognise the voice of your first primary school teacher if you heard them again today? How about the guy on the train this morning who was shouting into his phone? What if you had to pick him out, not from his talking voice, but from samples of his laughter, or singing?</p>
<p>To date, research has only explored voice identity perception using a limited set of vocalisations, for example sentences that have been read aloud or snippets of conversational speech. These studies have found that we can actually recognise voices of familiar people’s speech <a href="https://www.bioscience.org/2014/v6s/af/S417/fulltext.htm">quite well</a>. But they have also shown that there are problems: ear-witness testimonies are notoriously <a href="http://psycnet.apa.org/journals/lhb/4/4/373/">unreliable and inaccurate</a>. </p>
<p>It’s important to keep in mind that these studies have not captured much of the flexibility of the sounds we can make with our voices. This is bound to have an effect on how we process the identity of the person behind the voice we are listening to. Therefore, we are currently missing a very large and important piece of the puzzle. </p>
<p>Recognising voices requires two broad processes to operate together: we need to distinguish between the voices of different people (telling people apart) and we need to be able to attribute a single identity to all the different sounds (talking, laughing, shouting) that can come from the same person (“telling people together”). We set out to investigate the limits of these abilities in humans.</p>
<h2>Voice experiment</h2>
<p>Our recent study, <a href="https://pure.royalholloway.ac.uk/portal/en/publications/impaired-generalization-of-speaker-identity-in-the-perception-of-familiar-and-unfamiliar-voices(0c74b93a-b5da-4b15-a4f9-a328ed41176d).html">published in the Journal of Experimental Psychology: General</a>, confirms that voice identity perception can be extremely challenging. Capitalising on how variable a single person’s voice can be, we presented 46 listeners with laughter and vowels produced by five people. Listeners were asked to make a very simple judgement about pairs of sounds: were they made by the same person, or by two different people? As long as they could compare vowels to vowels or laughter to laughter respectively, discriminating between speakers was relatively successful. </p>
<p>But when we asked our listeners to make this judgement based on a mixed pair of sounds, such as directly comparing vowels to laughter in a pair, they couldn’t discriminate between speakers at all – especially if they were not familiar with the speaker. However, even though a sub-group of people who knew the speakers performed better overall, they still struggled significantly with the challenge of “telling people together”. </p>
<p>Similar effects have been reported by studies showing, for example, that it is <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2680657/">difficult to recognise a bilingual speaker</a> across their two languages. What’s surprising about these findings is how bad voice perception can be once listeners are exposed to natural variation in the sounds that a voice can produce. So, it’s intriguing to consider that while we each have a unique voice, we don’t yet know how useful that uniqueness is.</p>
<figure class="align-center ">
<img alt="" src="https://images.theconversation.com/files/174204/original/file-20170616-545-4v0aac.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&fit=clip" srcset="https://images.theconversation.com/files/174204/original/file-20170616-545-4v0aac.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=600&h=400&fit=crop&dpr=1 600w, https://images.theconversation.com/files/174204/original/file-20170616-545-4v0aac.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=600&h=400&fit=crop&dpr=2 1200w, https://images.theconversation.com/files/174204/original/file-20170616-545-4v0aac.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=600&h=400&fit=crop&dpr=3 1800w, https://images.theconversation.com/files/174204/original/file-20170616-545-4v0aac.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&h=503&fit=crop&dpr=1 754w, https://images.theconversation.com/files/174204/original/file-20170616-545-4v0aac.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=754&h=503&fit=crop&dpr=2 1508w, https://images.theconversation.com/files/174204/original/file-20170616-545-4v0aac.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=754&h=503&fit=crop&dpr=3 2262w" sizes="(min-width: 1466px) 754px, (max-width: 599px) 100vw, (min-width: 600px) 600px, 237px">
<figcaption>
<span class="caption">I said</span>
<span class="attribution"><a class="source" href="https://www.shutterstock.com/image-photo/voice-recognition-622099973?src=fncNRwhcfE-8yEQSCOqrKw-1-42">graphbottles/Shutterstock</a></span>
</figcaption>
</figure>
<p>But why have we evolved to have unique voices if we can’t even recognise them? That’s really an open question so far. We don’t actually know whether we have evolved to have unique voices – we also all have different and largely unique fingerprints, but there’s no evolutionary advantage to that as far as we can tell. It just so happens that based on differences in anatomy and, probably most importantly, how we use our voice, that we all sound different to each other. </p>
<p>Luckily computer algorithms are still able to make the most of the individuality of the human voice. They have probably already outdone humans in some cases – and they will keep on improving. The way these machine-learning algorithms recognise speakers is based on mathematical solutions to create “voice prints” – unique representations picking up the specific acoustic features of each individual voice. </p>
<p>In contrast to computers, humans might not know what they are listening out for, or <a href="http://asa.scitation.org/doi/10.1121/1.2770547">how to separate out these acoustic features</a>. So, the way that voice prints are created for the algorithms is not closely modelled on what human listeners appear to do – we’re still working on this. In the long term, it will be interesting to see if there is any overlap in the way human listeners and machine-learning algorithms recognise voices. While human listeners are unlikely to glean any insights from how computers solve this problem, conversely we might be able to build machines that emulate effective aspects of human performance.</p>
<p>It is rumoured that Amazon is currently working on teaching Alexa how to <a href="http://time.com/4683981/amazon-echo-voice-id-feature-2017/">identify specific users by their voice</a>. If this works, it will be a truly impressive feat and may put a stop to <a href="http://www.telegraph.co.uk/news/2017/01/08/amazon-echo-rogue-payment-warning-tv-show-causes-alexa-order/">further unwanted orders of dollhouses</a>. But, do be patient if Alexa makes mistakes – you may not be able to do it any better yourself. </p>
<p><em>We are currently running a further study on voice recognition – so, if you are interested in taking part, <a href="https://brunellifesc.eu.qualtrics.com/jfe/form/SV_6X7hJk5I8Ym2Dd3">click here</a>.</em></p><img src="https://counter.theconversation.com/content/79520/count.gif" alt="The Conversation" width="1" height="1" />
<p class="fine-print"><em><span>Carolyn McGettigan receives funding from The Leverhulme Trust and the Economic and Social Research Council.</span></em></p><p class="fine-print"><em><span>Nadine Lavan does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.</span></em></p>Would you be able to recognise the voice of your first primary school teacher, if you heard them again today?Carolyn McGettigan, Professor in Psychology, Royal Holloway University of LondonNadine Lavan, Postdoctoral research associate, Brunel University LondonLicensed as Creative Commons – attribution, no derivatives.tag:theconversation.com,2011:article/654952016-09-16T10:36:26Z2016-09-16T10:36:26ZAmazon Echo will bring genuinely helpful AI into our homes much sooner than expected<figure><img src="https://images.theconversation.com/files/138047/original/image-20160916-6332-tz0x7g.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=496&fit=clip" /><figcaption><span class="caption">
</span> <span class="attribution"><span class="source">Amazon</span></span></figcaption></figure><p>What’s all the fuss about the voice-activated home speaker that Amazon is due to <a href="http://www.bbc.co.uk/news/technology-37351342">release in the UK</a> and Germany in late September? This gadget has been available in the US for over a year and has proven a minor hit, with sales estimates between <a href="http://uk.businessinsider.com/how-many-amazon-echo-smart-home-devices-have-been-installed-2016-6?r=US&IR=T">1.6m and 3m</a>. But these figures belie the potential impact this kind of artificial intelligence device could have on our lives in the near future.</p>
<p>Echo doesn’t just let you switch on your music by voice command. It’s the first of what will be several types of smart home appliances that work beyond simple tasks like playing music or turning on a light. It uses an artificial intelligence assistant app called Alexa to allow users to access the information and services of the internet and control personal organisation tools.</p>
<p>You can order a pizza or a taxi, or check the weather or your diary, all just by speaking to Alexa. In this way, it is similar to Apple’s Siri but has <a href="https://developer.amazon.com/appsandservices/solutions/alexa/alexa-voice-service">advances in microphone and AI technology</a> that make it significantly more accurate than past devices in understanding and executing commands – and from anywhere in your home that it can hear you.</p>
<figure>
<iframe width="440" height="260" src="https://www.youtube.com/embed/KkOCeAtKHIc?wmode=transparent&start=0" frameborder="0" allowfullscreen=""></iframe>
</figure>
<p>I’ve been living with Amazon Echo for a year now, having imported it from the US via eBay. It’s an astonishing piece of kit that has to be experienced to see exactly why it has the potential to make the idea of a personal assistant smart home hub successful. It’s not surprising that Amazon’s CEO Jeff Bezos has said it is potentially the <a href="http://venturebeat.com/2016/05/31/alexa-could-be-the-4th-pillar-of-amazon-says-jeff-bezos/">fourth core Amazon service</a> after its marketplace, cloud services and mobile devices.</p>
<p>Many of us have already become used to <a href="http://www.cbc.ca/radio/spark/292-what-you-say-will-be-searched-why-recognition-systems-don-t-recognize-accents-and-more-1.3211777/here-s-why-your-phone-can-t-understand-your-accent-1.3222569">poor voice-recognition</a> software and error-prone requests on our mobile devices. But Amazon started developing a high-precision microphone and more sophisticated voice recognition system a full 12 months before its competitors and has gained a <a href="https://www.wired.com/2016/05/google-home-amazon-echo/">significant headstart</a>. The big difference with other AI assistants is that instead of a single piece of software, Alexa uses 300 of its own apps (which Amazon calls “skills”) to provide the device’s different capabilities. This creates a system that is far more integrated and sophisticated yet simple to use with minimal setup.</p>
<figure class="align-center ">
<img alt="" src="https://images.theconversation.com/files/138051/original/image-20160916-6337-uoiujh.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&fit=clip" srcset="https://images.theconversation.com/files/138051/original/image-20160916-6337-uoiujh.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=600&h=400&fit=crop&dpr=1 600w, https://images.theconversation.com/files/138051/original/image-20160916-6337-uoiujh.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=600&h=400&fit=crop&dpr=2 1200w, https://images.theconversation.com/files/138051/original/image-20160916-6337-uoiujh.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=600&h=400&fit=crop&dpr=3 1800w, https://images.theconversation.com/files/138051/original/image-20160916-6337-uoiujh.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&h=503&fit=crop&dpr=1 754w, https://images.theconversation.com/files/138051/original/image-20160916-6337-uoiujh.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=754&h=503&fit=crop&dpr=2 1508w, https://images.theconversation.com/files/138051/original/image-20160916-6337-uoiujh.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=754&h=503&fit=crop&dpr=3 2262w" sizes="(min-width: 1466px) 754px, (max-width: 599px) 100vw, (min-width: 600px) 600px, 237px">
<figcaption>
<span class="caption">Always listening.</span>
<span class="attribution"><span class="source">Amazon</span></span>
</figcaption>
</figure>
<p>This is a very significant development in <a href="https://theconversation.com/amazon-dash-is-a-first-step-towards-an-internet-of-things-that-is-actually-useful-39711">the rise of</a> the <a href="https://theconversation.com/smart-homes-need-to-start-treating-their-inhabitants-better-57597">connected home</a>, which is coming as we move from PCs and mobile devices to the era of the <a href="https://theconversation.com/explainer-the-internet-of-things-16542">internet of things</a> when computer chips will be in objects all around us. Echo is arguably the first successful product to bridge that gap. It’s working voice recognition service and connected sensors essentially link your home to a marketplace supply chain that services many (if not all) of your needs.</p>
<p>It’s still early days for this kind of device, but it raises the question how other shops, banks and entertainment companies might need to respond to the technology because it could effectively place a middle-man between them and their customers. If you want to order something, instead of going to the company that provides it directly, you just go to Amazon through your Echo. It’s what the IT industry might call an “aggregator” or a “service broker platform”. This is the much-spoken of but near-mythical goal of many tech companies who want to become the service provider of all other services.</p>
<h2>Any downsides?</h2>
<p>The US feedback on Echo has been <a href="http://www.cnet.com/products/amazon-echo-review">very strong</a> from <a href="http://qz.com/611026/amazon-echo-is-a-sleeper-hit-and-the-rest-of-america-is-about-find-out-about-it-for-the-first-time/">early adopters</a>. In my experience, the argument that it doesn’t have a screen and therefore is harder to interact with disappears when you actually use the device. The voice interaction is natural and if there is a problem with the system it’s more to do with learning the range of “skills” the device can perform than getting them to work.</p>
<p>A device that is constantly listening for your commands (although, the company is at pains to make clear, not the rest of your unrelated conversations) will no doubt raise concerns about privacy, just as all our smart devices do. Echo and Alexa work through the existing security protocols that many people already use when online shopping or accessing cloud web services through Amazon. But how secure these systems really are – and their potential for misuse – may come under greater scrutiny once Amazon (or any smart home company) has access not just to our bank details but our private conversations, too.</p>
<p>Echo represents a new kind of interface that will likely make voice activated services, along with the breaking concepts of virtual and augmented reality, the cutting edge way we interact with computers in 2017 and beyond. Google have already launched Google Home in the US (a full year late) and other firms are developing <a href="http://www.itpro.co.uk/desktop-hardware/26577/google-home-vs-amazon-echo-vs-apple-homekit-6">similar solutions</a>. The astonishing thing about this is that it’s a vision of the future that’s arriving much sooner that expected. We’re still far from general artificial intelligence, with machines fully able to think and perform like humans, but the days of the keyboard and mouse are numbered.</p><img src="https://counter.theconversation.com/content/65495/count.gif" alt="The Conversation" width="1" height="1" />
<p class="fine-print"><em><span>Mark Skilton owns shares in the HAT Hub-of-All-Things.com. He is affiliated with The Open Group and ISO international standards bodies, but this article is an independent view unrelated to any of these organisations directly.</span></em></p>Amazon’s voice-controlled personal assistant device is coming to the UK and bringing smart homes with it.Mark Skilton, Professor of Practice, Warwick Business School, University of WarwickLicensed as Creative Commons – attribution, no derivatives.tag:theconversation.com,2011:article/510172016-04-07T09:50:42Z2016-04-07T09:50:42ZCustomer service on hold: we hate phone menus and don’t trust virtual assistants like Siri<figure><img src="https://images.theconversation.com/files/117717/original/image-20160406-28973-18nue9u.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=496&fit=clip" /><figcaption><span class="caption">This isn't going well.</span> <span class="attribution"><a class="source" href="http://www.shutterstock.com/pic.mhtml?id=255223171&src=id">Man image via www.shutterstock.com.</a></span></figcaption></figure><p>“Just thinking about it makes me break out into hives” reported one man in his 60’s. A woman in her 30’s said she does everything she can to avoid it, including pretending she doesn’t speak English. A woman in her 20’s said she’ll do an intensive online search, including blogs, websites and forums, to find others struggling with the same problem so she feels “less alone.” </p>
<p>No, we’re not describing some terrible social encounter or anxiety-provoking health condition. These are examples of how, in our recent nationwide survey and series of interviews, people described dealing with customer service to get help concerning a product or a service.</p>
<p>For the most part, this negative response was based primarily on experiences with interactive voice response systems (IVRs), or “robo-calls,” as one interviewee described them. Interactive voice response systems are those automated menus, prompts and directories that initially answer so many of today’s customer service phone calls. They require us either to press a series of buttons or speak certain keywords to direct the call. We found IVRs are the most common interface for starting customer service journeys – half our respondents’ “most recent” customer service experiences began with IVRs. And, surprise, surprise, they’re among the least-liked automated formats for customer service.</p>
<p>Our work follows up on a study led by our colleague <a href="http://www.bu.edu/com/profile/james-e-katz/">James E. Katz</a> nearly 20 years ago that detailed how <a href="http://doi.org/10.1080/014492997119860">people reacted to IVR technology at that time</a>. In the intervening years, there’ve been surprisingly few studies published on how users now negotiate the dramatic proliferation of communication modalities available for customer service. Outside of proprietary market research, we don’t know that much about users’ feelings concerning the increasing prevalence of voice-activated communication with computers, phones and similar devices. One of our goals was to fill in knowledge gaps about this burgeoning – and for most, irritating – area of media activity.</p>
<figure class="align-center zoomable">
<a href="https://images.theconversation.com/files/117719/original/image-20160406-28940-1dwh2w4.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=1000&fit=clip"><img alt="" src="https://images.theconversation.com/files/117719/original/image-20160406-28940-1dwh2w4.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&fit=clip" srcset="https://images.theconversation.com/files/117719/original/image-20160406-28940-1dwh2w4.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=600&h=401&fit=crop&dpr=1 600w, https://images.theconversation.com/files/117719/original/image-20160406-28940-1dwh2w4.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=600&h=401&fit=crop&dpr=2 1200w, https://images.theconversation.com/files/117719/original/image-20160406-28940-1dwh2w4.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=600&h=401&fit=crop&dpr=3 1800w, https://images.theconversation.com/files/117719/original/image-20160406-28940-1dwh2w4.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&h=503&fit=crop&dpr=1 754w, https://images.theconversation.com/files/117719/original/image-20160406-28940-1dwh2w4.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=754&h=503&fit=crop&dpr=2 1508w, https://images.theconversation.com/files/117719/original/image-20160406-28940-1dwh2w4.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=754&h=503&fit=crop&dpr=3 2262w" sizes="(min-width: 1466px) 754px, (max-width: 599px) 100vw, (min-width: 600px) 600px, 237px"></a>
<figcaption>
<span class="caption">Press 5 if you just want this to be over.</span>
<span class="attribution"><a class="source" href="http://www.shutterstock.com/pic.mhtml?id=68542516&src=id">Keypad image via www.shutterstock.com.</a></span>
</figcaption>
</figure>
<h2>Almost nobody likes dealing with IVRs</h2>
<p>Our survey, funded by the industry group <a href="http://www.interactions.com/">Interactions</a>, sampled 1,321 online respondents who were demographically matched to the overall U.S. population. In addition, we conducted 50 in-depth followup interviews and three focus groups to get a better understanding of the patterns in the survey data.</p>
<p>At the beginning of a customer service experience, 90 percent of our respondents want to speak to a live agent. And no matter how their customer service journey starts – with IVR, email, instant messaging, automated chat, virtual assistants (like Siri and similar voice-controlled mobile apps) or social media – by the end, 83 percent have reached a real, live person. Much as Katz and his colleagues saw in 1997, individuals still overwhelmingly want to deal with a human being rather than a machine. If it doesn’t work easily for them, people do what it takes – within what the system permits – to circumvent automated customer service.</p>
<p>When we asked respondents their opinions about IVRs being the most common entrée to customer service help, the results were almost uniformly negative. Only 10 percent were satisfied with their experience and approximately 35 percent of respondents found the systems difficult to use. Just 3 percent actually <em>liked</em> using the IVR service.</p>
<p>These results did not vary appreciably across gender, but younger individuals tended to rate their most recent IVR experience more favorably than older respondents did.</p>
<h2>Other automated modes for customer service</h2>
<p>The general opinions of these various automated modalities for customer service ranged considerably on a four-point scale from “miserable” to “excellent.” On average, virtual assistants such as <a href="http://www.nytimes.com/2016/01/28/technology/personaltech/siri-alexa-and-other-virtual-assistants-put-to-the-test.html?_r=0">Siri, Cortana, Alexa</a> or similar voice-controlled mobile applications performed the worst in the eyes of customers. Their average ranking was just above unsatisfactory, with 19 percent rating them “miserable.” Accessed via voice-activated mobile phone apps, these virtual assistants are <a href="http://www.psfk.com/2014/10/dom-siri-like-dominos-delivery-app.html">becoming more common</a>. </p>
<p>Of the nonhuman mediated interfaces, email and instant messaging were best received. They ranked behind only in-person customer service and live customer service agents on the phone, which had average rankings greater than “satisfactory” but less than “excellent.” </p>
<p>People’s preference to deal with human customer service agents seems to come down at least partly to trust. Live agents scored an average trust level midway between “some” and “a lot” on a four-point scale. Respondents told us they’re confident a real person will “see it through” and they feel more assured that “the call won’t just end” without some sort of resolution. They know they’ll at least have an answer by the end of the call – even if it’s not the one they want to hear. Social media and virtual assistants (such as Siri) were least trusted, with 35 percent and 29 percent of users reporting having no faith in those interfaces, respectively. </p>
<figure class="align-center zoomable">
<a href="https://images.theconversation.com/files/117721/original/image-20160406-29010-19dxlnp.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=1000&fit=clip"><img alt="" src="https://images.theconversation.com/files/117721/original/image-20160406-29010-19dxlnp.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&fit=clip" srcset="https://images.theconversation.com/files/117721/original/image-20160406-29010-19dxlnp.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=600&h=443&fit=crop&dpr=1 600w, https://images.theconversation.com/files/117721/original/image-20160406-29010-19dxlnp.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=600&h=443&fit=crop&dpr=2 1200w, https://images.theconversation.com/files/117721/original/image-20160406-29010-19dxlnp.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=600&h=443&fit=crop&dpr=3 1800w, https://images.theconversation.com/files/117721/original/image-20160406-29010-19dxlnp.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&h=556&fit=crop&dpr=1 754w, https://images.theconversation.com/files/117721/original/image-20160406-29010-19dxlnp.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=754&h=556&fit=crop&dpr=2 1508w, https://images.theconversation.com/files/117721/original/image-20160406-29010-19dxlnp.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=754&h=556&fit=crop&dpr=3 2262w" sizes="(min-width: 1466px) 754px, (max-width: 599px) 100vw, (min-width: 600px) 600px, 237px"></a>
<figcaption>
<span class="caption">These friendly human operators are probably not standing by.</span>
<span class="attribution"><a class="source" href="http://www.shutterstock.com/pic.mhtml?id=155931950&src=id">Operators image via www.shutterstock.com.</a></span>
</figcaption>
</figure>
<h2>What goes wrong?</h2>
<p>An overwhelming majority reported of respondents problems using IVRs.</p>
<p>Part of the reason IVRs and automated speech recognition platforms are so disliked is that consumers must repeat themselves often when using them. Sixty-nine percent of consumers “agreed” or “strongly agreed” that IVRs make it hard to describe the problem that they’re calling about, and 75 percent “agreed” or “strongly agreed” that IVRs forced them to listen to irrelevant options. Similar percentages thought IVRs present choices that lead nowhere and so achieve nothing, that IVRs have too many menus, and that the prompts used in IVRs are too long. </p>
<p>People also consistently reported their frustration with the robot’s (in)ability to understand them and the need to repeat the same information many times during the interaction without making progress. People still just want to get to a live agent. Overall, respondents reported feeling like the IVR robot is “dragging out the conversation” and forcing them to pick from prompts that don’t really fit their problem. All of this leaves consumers “feel[ing] like I’m being managed,” as one woman described it.</p>
<p>Interestingly, people had strong emotional responses to these experiences. They reported fear about not understanding the prompts or pressing the wrong button, anger and frustration when the IVRs do not lead them to the right place, and an overwhelming sense of stress in general. </p>
<h2>Computer voices distant second to the real deal</h2>
<p>Obviously, <a href="https://mitpress.mit.edu/books/wired-speech">speech is an integral part of being human</a>. We’re extremely adept at picking up on its social aspects, and can easily distinguish between a voice that is human and one that is synthetic. Our inherent responses to computerized voices are different than to live voices, which influences the level of comfort or frustration we feel with IVR.</p>
<p>Previous research has found that a <a href="https://global.oup.com/academic/product/image-bite-politics-9780195372076?cc=nl&lang=en&">mismatch or incongruities</a> between apparent emotion or delivery of a message and its content make people uncomfortable. This might be one cause for IVR users’ discomfort, but likely is only the tip of the iceberg.</p>
<p>In general, people also become uncomfortable and rate the helpfulness of an IVR lower <a href="https://mitpress.mit.edu/books/wired-speech">when synthetic voices refer to themselves as “I.”</a> People are unconsciously discomfitted when the pronoun “I” is ascribed to anything not fully human that possesses agency.</p>
<p>Perceived gender of the voice can also influence how people react to IVR technology. Because of a natural tendency to treat technology socially, people <a href="http://doi.org/10.1145/633292.633461">automatically assign gendered stereotypes</a> to a voice. Indeed, others have argued that so many personal digital assistants like Siri and Cortana have <a href="http://doi.org/10.1177/1071181312561295">female voices and names because</a> people tend to find them more pleasant and helpful. Male voices, on the other hand, have been rated as more authoritative and contribute to higher perceptions of the usability of the service. </p>
<h2>Rise of the bots</h2>
<p>Our survey respondents held relatively innovative attitudes towards technology and were not especially apprehensive about communication. Even so, exactly half reported feeling that the use of IVRs shows “machines are taking over.” One woman in her 20’s said “it annoys me that the company thinks they can do [customer service] with a robot.” And a woman in her 70’s reported quite succinctly that “if I want to talk to a machine, I’ll yell at my computer.” </p>
<p>In this environment, we were surprised to find instant messaging had the highest favorability and trust rankings of any mediated customer service system, even email. People told us they love the synchronous nature of this communication – they can see that someone is ostensibly typing or working on their answer in real time. Respondents liked that they can simultaneously multitask, so don’t feel they’ve been put on hold. And they don’t have to engage in fake pleasantries during the interaction.</p>
<p>Person-to-person communication remains hard to beat. Instant messaging platforms allow the person seeking support to interact in real time with someone capable of understanding human expression <em>accurately</em> and <em>quickly</em>. One woman reported loving that she can “just type” her questions and “receive an immediate response.” Many echoed this sentiment, and also appreciated having a written log of the conversation as well, things that technology enhances rather than diminishes in this modality.</p>
<p>One of the most interesting findings from our interviews and focus groups was that lots of people initiate live-chat with a company while on hold or dealing with the IVR. In essence, live-chat has become a workaround to the robot roadblocks people confront on the phone.</p>
<p>The fact that instant messaging ranked so highly demonstrates that consumers are open to using technology to improve customer service because it is faster and more immediate. But they’ll intentionally seek out platforms that provide detailed and personalized responses to complex questions in real time that minimize time costs, communication errors and frustration.</p><img src="https://counter.theconversation.com/content/51017/count.gif" alt="The Conversation" width="1" height="1" />
<p class="fine-print"><em><span>This research was funded by Interactions, llc.</span></em></p><p class="fine-print"><em><span>Chelsea Cutino does not work for, consult, own shares in or receive funding from any company or organization that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.</span></em></p>Phone trees drive you mad? Just want to talk to an actual person? You aren’t alone – despite the fact that most customer service journeys begin with automated interactive voice response systems.Jacob Groshek, Associate Professor, Emerging Media Studies, Boston UniversityChelsea Cutino, Master's Student in Emerging Media Studies, Boston UniversityJill Walsh, Postdoctoral Fellow in Sociology, Boston UniversityLicensed as Creative Commons – attribution, no derivatives.tag:theconversation.com,2011:article/551842016-02-23T13:01:58Z2016-02-23T13:01:58ZSelfies could replace security passwords – but only with an upgrade<figure><img src="https://images.theconversation.com/files/112546/original/image-20160223-16464-ekcpa7.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=496&fit=clip" /><figcaption><span class="caption">Selfie shopping</span> <span class="attribution"><span class="source">Shutterstock</span></span></figcaption></figure><p>The next time you do some online shopping or call your bank, you may find you no longer have to scrabble around to remember your security password. Banks are <a href="http://www.bbc.co.uk/news/business-35609833">increasingly turning</a> to voice recognition technology as their preferred way of ensuring customers are who they say they are when they use telephone banking services. Mastercard has even announced that it will accept fingerprints or selfies as proof of identity for online purchases. </p>
<p>But does this kind of technology really mean that you’ll soon be able to just forget your passwords? The short answer right now is “no”. Banks are adopting <a href="https://theconversation.com/can-voice-recognition-technology-really-identify-a-masked-jihadi-52787">voice recognition</a> technologies (often known as “speaker identification” in research literature) and <a href="http://mi.eng.cam.ac.uk/%7Ecipolla/publications/article/1997-IVC-face-detection.pdf">face recognition</a> primarily for verification, not identification. These technologies are a reasonable tool for verifying a person is who they claim to be because machines can learn how one person normally speaks or looks. But they are not yet good methods of identifying a single customer from the very large number of possible voices or faces a bank might have in their database.</p>
<p>For voice identification to work, the difference between your voice and others’ (inter-speaker variation) must always be greater than the difference between your voice now and on another occasion (<a href="http://seas3.elte.hu/VLlxx/gosy.html">intra-speaker variation</a>). The same is true with “selfie recognition”; you need to look more like the normal you than anyone else does. In practice, this doesn’t always happen. The more voices or faces a system compares, the more likely it will find two that are very similar.</p>
<figure class="align-center ">
<img alt="" src="https://images.theconversation.com/files/112557/original/image-20160223-16416-b9ymr4.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&fit=clip" srcset="https://images.theconversation.com/files/112557/original/image-20160223-16416-b9ymr4.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=600&h=602&fit=crop&dpr=1 600w, https://images.theconversation.com/files/112557/original/image-20160223-16416-b9ymr4.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=600&h=602&fit=crop&dpr=2 1200w, https://images.theconversation.com/files/112557/original/image-20160223-16416-b9ymr4.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=600&h=602&fit=crop&dpr=3 1800w, https://images.theconversation.com/files/112557/original/image-20160223-16416-b9ymr4.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&h=756&fit=crop&dpr=1 754w, https://images.theconversation.com/files/112557/original/image-20160223-16416-b9ymr4.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=754&h=756&fit=crop&dpr=2 1508w, https://images.theconversation.com/files/112557/original/image-20160223-16416-b9ymr4.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=754&h=756&fit=crop&dpr=3 2262w" sizes="(min-width: 1466px) 754px, (max-width: 599px) 100vw, (min-width: 600px) 600px, 237px">
<figcaption>
<span class="caption">Voice or selfie recognition for identification and verification.</span>
</figcaption>
</figure>
<p>Imagine all the things that can change your own voice when you make a phone call: a blocked or damaged microphone, tiredness, mouth or throat pain, drinking excessive amounts of alcohol, eating curry or misplaced dentures. These make intra-speaker variation large. For face recognition, facial hair, complexion changes, makeup, glasses, lighting and face coverings all contribute to changes in the way you look.</p>
<p>The consequence is that banks have a fair chance of “verifying” that a caller or selfie-taker is who they claim to be, but not of “identifying” an unknown voice or selfie. So we will still need a way to identify ourselves for the foreseeable future, and the best method remains a secret PIN or password.</p>
<p>A driver in Malaysia who had a fingerprint authentication system fitted to his new Mercedes S-class in 2005 found out <a href="http://news.bbc.co.uk/1/hi/world/asia-pacific/4396831.stm">the painful way</a> that some biometrics can be stolen. When thieves discovered that his car could only be started with a fingerprint, they promptly stole his finger along with his car.</p>
<p>A <a href="https://theconversation.com/can-voice-recognition-technology-really-identify-a-masked-jihadi-52787">simple voiceprint</a> can likewise be stolen. All you need is a good quality recording of the person speaking. The same is true for systems that require a user to speak a fixed passphrase or PIN. Interactive systems using a <a href="https://www.sans.org/reading-room/whitepapers/authentication/exploration-voice-biometrics-1436">challenge-response protocol</a> (e.g. asking a user to repeat an unusual phrase) would raise the difficulty level for attackers, but <a href="http://homepages.inf.ed.ac.uk/zwu2/papers/apsipa2014_replay.pdf">can be defeated by current technology</a>.</p>
<p><a href="http://staff.estem-uc.edu.au/mwagner/files/2014/05/ChettyWagner_InvestigatingFeatureLevelFusionForLiveness_ISSPA_2005.pdf">Face recognition</a> (such as that used to identify selfies), lip reading, and <a href="http://www.cs.columbia.edu/%7Ebelhumeur/courses/biometrics/2010/howirisrecognitionworks.pdf">iris pattern recognition</a> are all visual methods that could possibly be stolen or spoofed by pictures or video images.</p>
<h2>More biometric data</h2>
<p>The solution appears to be either making use of additional secret information (which means yet more to remember) or to combine different types of biometric information. Unfortunately, methods that require a camera <a href="https://www.newscientist.com/article/mg21128266-000-face-recognition-technology-fails-to-find-uk-rioters/">are sometimes of limited use</a>: the user must face a camera, for example, must not have glasses or clothing obscuring their face and eyes, will require adequate lighting – and the system probably should not be used while in the bath.</p>
<p>Other researchers are investigating the biometric potential of capturing an <a href="http://csee.essex.ac.uk/staff/palaniappan/IJVLSI_Biometric_Palani_2007.pdf">individual’s unique brainwaves</a> with a headset or, more recently, with earphones. But <a href="http://www.commsp.ee.ic.ac.uk/%7Edlooney/PDFs/The%20In-the-Ear%20Recording%20Concept.pdf">such technology is in its infancy</a>.</p>
<figure class="align-center ">
<img alt="" src="https://images.theconversation.com/files/112422/original/image-20160222-25891-1gu316e.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&fit=clip" srcset="https://images.theconversation.com/files/112422/original/image-20160222-25891-1gu316e.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=600&h=275&fit=crop&dpr=1 600w, https://images.theconversation.com/files/112422/original/image-20160222-25891-1gu316e.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=600&h=275&fit=crop&dpr=2 1200w, https://images.theconversation.com/files/112422/original/image-20160222-25891-1gu316e.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=600&h=275&fit=crop&dpr=3 1800w, https://images.theconversation.com/files/112422/original/image-20160222-25891-1gu316e.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&h=345&fit=crop&dpr=1 754w, https://images.theconversation.com/files/112422/original/image-20160222-25891-1gu316e.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=754&h=345&fit=crop&dpr=2 1508w, https://images.theconversation.com/files/112422/original/image-20160222-25891-1gu316e.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=754&h=345&fit=crop&dpr=3 2262w" sizes="(min-width: 1466px) 754px, (max-width: 599px) 100vw, (min-width: 600px) 600px, 237px">
<figcaption>
<span class="caption">Multimodal biometric authentication.</span>
</figcaption>
</figure>
<p>One future technology being developed for mobile devices is an ultrasound scanner that <a href="http://www.lintech.org/savad/index.html">maps part of the face shape</a> of a person speaking. This is not just a snapshot of the face, but a recording of how the mouth of the speaker moves as the words are spoken. The biometric aspect is not just confined to the sound of the voice but includes the way the mouth shape changes as the voice is produced. The required hardware is even built into most smartphones already.</p>
<p>Imagine walking into a bakery and picking up a crusty farmhouse loaf. You take it over to the baker and say “I would like to buy this, please.” “That will be two pounds, do you wish to proceed with the purchase?” replies the baker. “Yes, please proceed,” you say, and wait for their “Okay” before walking out with your loaf. No cash, no payment card and no personal details divulged. </p>
<p>It might sound like a scene from a bygone era when you knew your local baker and maintained an account with them. But it is, in fact, a future that researchers are working hard to enable. <a href="http://www.bankingtech.com/47982/Biometrics-the-case-for-convenience/">Your smartphone will employ voice authentication and speech recognition technology to authorise the payment with your bank</a> who will confirm the transaction electronically with the baker. Meanwhile, a <a href="http://quintessentialthinking.com.au/uploaded_files/pos.pdf?PHPSESSID=92d3559682c19aade822be530c356e16">point-of-sale video recording</a> of the transaction will be lodged with both your bank and the bakery. So while you shouldn’t throw away your passwords just yet, you can expect some exciting developments in this area over the next few years.</p><img src="https://counter.theconversation.com/content/55184/count.gif" alt="The Conversation" width="1" height="1" />
<p class="fine-print"><em><span>I have recently been awarded (but has not yet used) Faculty Research funding from Google to explore the biometric potential of speech and ultrasonic speech. This project has not yet started - it is due to start in June 2016 - and thus there are obviously no results yet.
The funding is a one-off "unrestricted gift" that has no strings attached.
Anyway, I haven't mentioned Google in the article.</span></em></p>Plans to introduce voice and facial recognition technology for online shopping and banking point to a password-free future.Ian McLoughlin, Professor of Computing, Head of School (Medway), University of KentLicensed as Creative Commons – attribution, no derivatives.tag:theconversation.com,2011:article/428122015-06-12T03:41:43Z2015-06-12T03:41:43ZListen to me: machines learn to understand how we speak<figure><img src="https://images.theconversation.com/files/84778/original/image-20150612-11418-33kb9x.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=496&fit=clip" /><figcaption><span class="caption">Your smartphone is learning to better understand your voice commands.</span> <span class="attribution"><a class="source" href="https://www.flickr.com/photos/janitors/13989720008/">Flickr/Kārlis Dambrāns</a>, <a class="license" href="http://creativecommons.org/licenses/by/4.0/">CC BY</a></span></figcaption></figure><p>At Apple’s recent <a href="https://developer.apple.com/wwdc/">World Wide Developer Conference</a>, one of the tent-pole items was the inclusion of additional features for intelligent voice recognition by its personal assistant app Siri in its most recent update to its mobile operating system <a href="http://www.apple.com/ios/ios9-preview/">iOS 9</a>.</p>
<p>Now, instead of asking Siri to “remind me about Kevin’s birthday tomorrow”, you can rely on context and just ask Siri to “remind me of this” while viewing the Facebook event for the birthday. It will know what you mean.</p>
<p>Technology like this has also existed in Google devices for a little while now – thanks to <a href="https://support.google.com/websearch/answer/6031948">OK Google</a> – bringing us ever closer to context-aware voice recognition.</p>
<p>But how does it all work? Why is context so important and how does it tie in with voice recognition?</p>
<p>To answer that question, it’s worthwhile looking back at how voice recognition works and how it relates to another important area, natural language processing.</p>
<h2>A brief history of voice recognition</h2>
<p>Voice recognition has been in the public consciousness for a long time. Rather than tapping on a keyboard, wouldn’t it be nice to speak to a computer in natural language and have it understand everything you say?</p>
<p>Ever since Captain Kirk’s conversation with the computer aboard the USS Enterprise in the original Star Trek series in the 1960s (and Scotty’s <a href="https://www.youtube.com/watch?v=QpWhugUmV5U">failed attempt</a> to talk to a 20th-century computer in one of the later Original Series movies) we’ve dreamed about how this might work.</p>
<figure>
<iframe width="440" height="260" src="https://www.youtube.com/embed/NM4yEOdIHnc?wmode=transparent&start=0" frameborder="0" allowfullscreen=""></iframe>
</figure>
<p>Even movies set in more recent times have flirted with the idea of better voice recognition. The technology-focused <a href="http://www.imdb.com/title/tt0105435/">Sneakers</a> from 1992 features Robert Redford painfully collecting snippets of an executive’s voice and playing them back with a tape recorder into a computer to gain voice access to the system.</p>
<figure>
<iframe width="440" height="260" src="https://www.youtube.com/embed/-zVgWpVXb64?wmode=transparent&start=0" frameborder="0" allowfullscreen=""></iframe>
</figure>
<p>But the simplicity of the science-fiction depictions belies a complexity in the process of voice-recognition technology. Before a computer can even understand what you mean, it needs to be able to understand what you said.</p>
<p>This involves a complex process that includes audio sampling, feature extraction and then actual speech recognition to recognise individual sounds and convert them to text. </p>
<p>Researchers have been working on this technology for many years. They have developed techniques that extract features in a similar way to the human ear and recognise them as phonemes and sounds that human beings make as part of their speech. This involves the use of <a href="http://www.sciencedirect.com/science/article/pii/S1051200409001821">artificial neural networks</a>, <a href="http://www.sciencedirect.com/science/article/pii/S0167639303000992">hidden Markov models</a> and other ideas that are all part of the broad field of artificial intelligence.</p>
<p>Through these models, speech-recognition rates have improved. Error rates of less than 8% were <a href="http://venturebeat.com/2015/05/28/google-says-its-speech-recognition-technology-now-has-only-an-8-word-error-rate/">reported this year</a> by Google.</p>
<p>But even with these advancements, auditory recognition is only half the battle. Once a computer has gone through this process, it only has the text that replicates what you said. But you could have said anything at all.</p>
<p>The next step is natural language processing.</p>
<h2>Did you get the gist?</h2>
<p>Once a machine has converted what you say into text, it then has to understand what you’ve actually said. This process is called “natural language processing”. This is arguably more difficult than the process of voice recognition, because the human language is full of context and semantics that make the process of natural language recognition difficult.</p>
<p>Anybody who has used earlier voice-recognition systems can testify as to how difficult this can be. Early systems had a very limited vocabulary and you were required to say commands in just the right way to ensure that the computer understood them.</p>
<p>This was true not only for voice-recognition systems, but even textual input systems, where the order of the words and the inclusion of certain words made a large difference to how the system processed the command. This was because early language-processing systems used hard rules and decision trees to interpret commands, so any deviation from these commands caused problems.</p>
<p>Newer systems, however, use machine-learning algorithms similar to the hidden Markov models used in speech recognition to build a vocabulary. These systems still need to be taught, but they are able to make softer decisions based on weightings of the individual words used. This allows for more flexible queries, where the language used can be changed but the content of the query can remain the same.</p>
<p>This is why it’s possible to ask Siri either to “schedule a calendar appointment for 9am to pick up my dry-cleaning” or “enter pick up my dry-cleaning in my calendar for 9am” and get the same result.</p>
<h2>But how do you deal with different voices?</h2>
<p>Despite these advancements there are still challenges in this space. In the field of voice recognition, accents and pronunciation can still cause problems.</p>
<figure>
<iframe width="440" height="260" src="https://www.youtube.com/embed/6xtHEIfJtPo?wmode=transparent&start=0" frameborder="0" allowfullscreen=""></iframe>
</figure>
<p>Because of the way the systems work, different pronunciation of phonemes can cause the system to not recognise what you’ve said. This is especially true when the phonemes in a word seem (to non-locals) to bear no relation to the way it is pronounced, such as the British cities of “Leicester” or “Glasgow”.</p>
<p>Even Australian cities such as “Melbourne” seem to trip up some Americans. While to an Australian the pronunciation of Melbourne is very obvious, the different way that phonemes are used in America means that they often pronounce it wrong (to parochial ears).</p>
<p>Anybody who has heard a GPS system mispronounce Ipswich as “eyp-swich” knows this also goes both ways. The only way around this is to train the system in the different ways words are pronounced. But with the variation in accents (and even pronunciation within accents) this can be quite a large and complex process.</p>
<p>On the language-processing side, the issue is predominantly one of context. The example given in the opening provides an example of the state of the art in contextual language processing. But all you need to do is pay attention to a conversation for a few minutes to realise how much we change the way we speak to give machines extra context.</p>
<p>For instance, how often do you ask somebody:</p>
<blockquote>
<p>Did you get my e-mail?</p>
</blockquote>
<p>But what you actually mean is:</p>
<blockquote>
<p>Did you get my e-mail? If you did, have you read it and can you please provide a reply as response to this question?</p>
</blockquote>
<p>Things get even more complicated when you want to engage in a conversation with a machine, asking an initial question and the follow-up questions, such as “What is Martin’s number?”, followed by “Call him” or “Text him”. </p>
<p>Machines are improving when it comes to understanding context, but they still have a way to go!</p>
<h2>Automatic translation</h2>
<p>So, we have made great progress in a lot of different areas to get to this point. But there are still challenges ahead in accent recognition, implications in language, and context in conversations. This means it might still be a while before we have those computers from Star Trek interpreting everything we say.</p>
<p>But rest assured. We are slowly getting closer, with recent advancements from Microsoft in <a href="https://www.microsoft.com/translator/at.aspx">automatic translation</a> showing that, if we get it right, the result can be very cool.</p>
<figure>
<iframe width="440" height="260" src="https://www.youtube.com/embed/HGFNGcpmDyA?wmode=transparent&start=0" frameborder="0" allowfullscreen=""></iframe>
</figure>
<p>Google has recently revealed technology that uses a combination of image or voice recognition, natural language processing and the camera on your smartphone to automatically translate signs and short conversations from one language to another for you. It will even try to match the font so that the sign looks the same, but in English! </p>
<p>So no longer do you need to ponder over a menu written in Italian, or wonder how to order from a waiter who doesn’t speak English, Google has you covered. Not quite the USS Enterprise, but certainly closer!</p><img src="https://counter.theconversation.com/content/42812/count.gif" alt="The Conversation" width="1" height="1" />
<p class="fine-print"><em><span>Michael Cowling does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.</span></em></p>Voice recognition technology is getting better at understanding what we are saying, even if we only say part of what we mean. So how does it work?Michael Cowling, Senior Lecturer & Discipline Leader, Mobile Computing & Applications, CQUniversity AustraliaLicensed as Creative Commons – attribution, no derivatives.tag:theconversation.com,2011:article/315792014-09-12T04:45:21Z2014-09-12T04:45:21ZYou’re the voice – the science behind speaker recognition tech<figure><img src="https://images.theconversation.com/files/58847/original/qpcymxsk-1410497334.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=496&fit=clip" /><figcaption><span class="caption">Your voice can now be your password – well, for the ATO anyway. </span> <span class="attribution"><a class="source" href="http://www.flickr.com/photos/martinsphotoart/3415997798">Martin playing with pixels.../Flickr (cropped)</a>, <a class="license" href="http://creativecommons.org/licenses/by-nc/4.0/">CC BY-NC</a></span></figcaption></figure><p>You may have read reports that the Australian Tax Office (ATO) has introduced <a href="https://www.ato.gov.au/About-ATO/About-us/Contact-us/Phone-us/Voiceprint/">voiceprint</a> technology which aims to do away with cumbersome identity-verification processes on the telephone. </p>
<p>When you phone the ATO call centre, instead of supplying your date of birth, address or a password, you’re prompted to say: “In Australia my voice identifies me.” By comparing this to a previously recorded voiceprint, the technology will deduce if the tax file number you gave actually belonged to you.</p>
<p>The technology that makes this possible is called “speaker recognition”. So how does it work, and how secure is it?</p>
<h2>Speech recognition and speaker recognition</h2>
<p>Two distinct, but related, technologies use human speech as input: </p>
<ol>
<li><strong>speech recognition</strong> turns speech sounds into text and speaker recognition identifies a person based on the sound of their voice. One speech recognition system that many people are familiar with is Apple’s Siri</li>
<li><strong>speaker recognition</strong> is what the ATO’s voiceprint system is based on. Speaker recognition is one of a broad range of technologies called biometrics that can identify people based on physical properties such as the sound of their voice, their fingerprint, the shape of blood vessels in their eye or the way they walk.</li>
</ol>
<p>The science behind biometric systems such as voiceprints is based on various machine learning techniques. If you’d like to get technical, some examples are <a href="http://www.nature.com/nbt/journal/v22/n10/full/nbt1004-1315.html">hidden Markov models</a>, <a href="http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=708428&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D708428">support vector machines</a> and <a href="http://www.doc.ic.ac.uk/%7End/surprise_96/journal/vol4/cs11/report.html">neural networks</a>. These use sophisticated statistical algorithms to create biometric models of a speaker’s voice.</p>
<figure>
<iframe width="440" height="260" src="https://www.youtube.com/embed/NbgOvWxFhS8?wmode=transparent&start=0" frameborder="0" allowfullscreen=""></iframe>
<figcaption><span class="caption">‘My voice is my password.’</span></figcaption>
</figure>
<p>Two common ways that a biometric model can be used are to identify a person based on their voice alone, or to verify by voice whether someone is correctly claiming an identity.</p>
<p>In The Sydney Morning Herald yesterday, Ben Grubb <a href="http://www.smh.com.au/digital-life/digital-life-news/australian-taxation-office-uses-voiceprint-technology-to-speed-up-calls-20140911-10edli.html">reported</a> that the ATO’s voiceprint system is developed by a company called <a href="http://www.nuance.com/index.htm">Nuance</a>, a world leader in speech and speaker recognition. It’s very likely that the ATO uses the technology behind Nuance’s <a href="http://www.nuance.com/landing-pages/products/voicebiometrics/vocalpassword.asp">VocalPassword</a> system, which matches a customer’s passphrase with a recording of that passphrase kept in a database.</p>
<p>Because a voiceprint matches a passphrase with a stored recording, it only has to verify a match rather than sort through the whole database to uniquely identify a caller based on their voice. This means the recognition process can be very fast and can work with very low-quality audio. </p>
<p>Given a passphrase, the system would return a statistical likelihood that the speaker is the person who provided the original voiceprint. The ATO could select a threshold for a positive identification to ensure a good match was required.</p>
<h2>On the record</h2>
<figure class="align-right zoomable">
<a href="https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=1000&fit=clip"><img alt="" src="https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=237&fit=clip" srcset="https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=600&h=600&fit=crop&dpr=1 600w, https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=600&h=600&fit=crop&dpr=2 1200w, https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=600&h=600&fit=crop&dpr=3 1800w, https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=754&h=754&fit=crop&dpr=1 754w, https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=30&auto=format&w=754&h=754&fit=crop&dpr=2 1508w, https://images.theconversation.com/files/58841/original/s649zv65-1410494664.jpg?ixlib=rb-1.1.0&q=15&auto=format&w=754&h=754&fit=crop&dpr=3 2262w" sizes="(min-width: 1466px) 754px, (max-width: 599px) 100vw, (min-width: 600px) 600px, 237px"></a>
<figcaption>
<span class="caption"></span>
<span class="attribution"><a class="source" href="http://www.flickr.com/photos/johanl/4934459020">Johan Larsson/Flickr</a>, <a class="license" href="http://creativecommons.org/licenses/by/4.0/">CC BY</a></span>
</figcaption>
</figure>
<p>Engineers who develop systems such as these are very concerned with security. Much research effort has gone into what’s called “<a href="http://www.biometrics.org/bc2002/2_bc0130_DerakhshabiBrief.pdf">liveness detection</a>” and “playback detection”. </p>
<p>These are ways to ensure that a real person is speaking the passphrase rather than a malicious person playing a recording or attempting to mimic another person’s voice. </p>
<p>It’s possible that a voiceprint is susceptible to what’s called a “replay attack”. If a recording could be obtained of someone saying the exact passphrase, there would be a strong chance of being able to access their account. A distinctive passphrase reduces this risk.</p>
<p>Voiceprint can identify you if you have a cold because it doesn’t model the sound of your voice – it uses the sound of your voice to model the shape of your vocal tract. When you have a cold the shape of your vocal tract is still the same (you just might sound a bit nasal).</p>
<p>But there are situations or events that could prevent voiceprint or similar systems from correctly identifying a speaker. If someone received an injury that damaged their vocal tract, it would be unlikely that a speaker recognition system would match a voiceprint made before the injury. </p>
<p>A very poor phone connection or high background noise could also prevent a speaker identification system from working properly. </p>
<p>In both of these cases, a failure to match would probably require a caller to the ATO to verify their identity by another means. It would be extremely unlikely to mis-identify someone.</p>
<p>Systems such as voiceprints are intended to save time for callers and for call-centre workers by reducing the time it takes to verify identities – and less time on the phone with the tax office is always a good thing.</p><img src="https://counter.theconversation.com/content/31579/count.gif" alt="The Conversation" width="1" height="1" />
<p class="fine-print"><em><span>Ben Kraal receives funding from the Australian Research Council.</span></em></p><p class="fine-print"><em><span>David Dean receives funding from the Australian Research Council for research related to speaker recognition.</span></em></p>You may have read reports that the Australian Tax Office (ATO) has introduced voiceprint technology which aims to do away with cumbersome identity-verification processes on the telephone. When you phone…Ben Kraal, Research Fellow, Queensland University of TechnologyDavid Dean, Senior Research Fellow, Queensland University of TechnologyLicensed as Creative Commons – attribution, no derivatives.tag:theconversation.com,2011:article/243102014-03-13T06:19:56Z2014-03-13T06:19:56ZSounding like a liar doesn’t make you a benefits cheat<figure><img src="https://images.theconversation.com/files/43742/original/878y7vtc-1394647050.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=496&fit=clip" /><figcaption><span class="caption">You can't tell if someone's lying by listening to their voice and councils should know that by now.</span> <span class="attribution"><a class="source" href="http://www.flickr.com/photos/cizake/4164756091/sizes/o/">Florian Seroussi</a>, <a class="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">CC BY-NC-SA</a></span></figcaption></figure><p>Councils are facing questions about their use of <a href="http://www.theguardian.com/society/2014/mar/10/councils-use-lie-detector-tests-benefits-fraudsters">lie detectors</a> in attempts to catch benefits cheats over the phone. The idea is to listen out for subtle changes in the voice that might indicate that someone isn’t telling the truth about their circumstances.</p>
<p>But there are serious concerns about the technology being used and a chronic lack of evidence to support the claims made about it.</p>
<p>The UK government first tried out <a href="http://www.independent.co.uk/news/uk/crime/lie-detectors-to-assess-if-paedophiles-will-reoffend-6168950.html">lie detection tools</a> with offenders in 2004, despite findings from the US <a href="http://books.google.co.uk/books?id=4WR0AAAAQBAJ&pg=PA102&lpg=PA102&dq=%E2%80%98progressed+over+time+in+the+manner+of+a+typical+scientific+field%E2%80%99+lie+detection+national+research+council&source=bl&ots=cBcePpaf5z&sig=z3ty1F6Vk2KPzwpiRfkiB5tMllQ&hl=en&sa=X&ei=NHcgU9HAO6nT7AaLvIHAAg&ved=0CC0Q6AEwAA#v=onepage&q=%E2%80%98progressed%20over%20time%20in%20the%20manner%20of%20a%20typical%20scientific%20field%E2%80%99%20lie%20detection%20national%20research%20council&f=false">National Research Council</a> that suggested research in this area had not “progressed over time in the manner of a typical scientific field”. In a 2003 assessment, the council warned that research into testing for deception using physiological monitors had failed to “strengthen its scientific underpinnings in any significant manner”.</p>
<p>The main problem is that both voice stress analysis software and the polygraph, which monitors blood pressure, are based on the assumption that liars are more behaviourally aroused than truth tellers because they are afraid of being caught. In reality, displays of arousal depend on many factors, not least the circumstances in which we tell a lie, the individual differences between people and how serious the lie is or the potential repercussions of being found out.</p>
<p>In fact, liars often fail to show an increase in arousal. This can be attributed to a number of causes. Liars do not necessarily experience a fear of being caught or may even be able to control their level of arousal during a test.</p>
<p>On the other hand, truth tellers may show increased arousal due to a fear of not being believed. In the case of the UK government’s attempts to bring in lie detection, this might mean a fear of losing benefits.</p>
<h2>Lying or just stressed?</h2>
<p>The voice stress analyser, or voice risk analyser, uses microphones attached to computers to detect and display readings for the intensity, frequency, pitch, harmonics and micro-tremors of the voice.</p>
<p>This is based on established theory, conducted via empirical research, that changes to the voice indicate stress or arousal. When we are aroused our muscles tense and tighten and, when this happens, these muscles vibrate at a higher frequency leading to an increase in pitch.</p>
<p>When voice stress analysis was first developed, it was hailed as an alternative to the polygraph because it is a non-invasive process and can even be conducted without the person being tested even knowing. Unfortunately, this is the only improvement on the polygraph. The only real difference between the two is that a different physiological response is being measured.</p>
<p>Both techniques suffer from methodological and theoretical problems, particularly if those conducting the test don’t spend time asking the examinee to lie and tell the truth in response to a series of control questions to test their responses first. Without this, there is no comparison between how an individual behaves when being truthful compared to when they lie.</p>
<p>They key point is that voice stress analysis offers a measure of stress or arousal and this is not the same as measuring deception. There is still an insignificant amount of data to link the signs of stress and negative emotion in the human voice with lying.</p>
<p>Despite disappointing results from numerous <a href="http://www.sciencedirect.com/science/article/pii/S0167876005001364">scientific studies</a>on the validity of these systems, voice analysis continues to be popular and several manufacturers produce variations of voice analysis software, which are sold as user-friendly computer devices. There is no evidence to suggest that these programs work at a level any better than chance.</p>
<p>Even ignoring the lack of any empirical support for its application, voice stress analysis is an outdated tool. Recent advances in the area of deception detection acknowledge that you need to take a multi-channel approach, looking at both verbal and non-verbal behavioural cues while taking context into account.</p>
<p>This is one of the aims of on-going research being conducted within the <a href="http://www.uclan.ac.uk/research/environment/groups/emotions_credibility_and_deception_group.php">Centre for Emotions, Credibility and Deception</a> at the University of Central Lancashire. While technology, and in particular measures of real-time body motion and language use, undoubtedly have much to contribute to the field of deception detection, we need to ensure that the work is carried out in a rigorous manner, and is based on sound methodology and theory. Voice stress analysis is not.</p><img src="https://counter.theconversation.com/content/24310/count.gif" alt="The Conversation" width="1" height="1" />
<p class="fine-print"><em><span>Beth Helen Richardson does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.</span></em></p>Councils are facing questions about their use of lie detectors in attempts to catch benefits cheats over the phone. The idea is to listen out for subtle changes in the voice that might indicate that someone…Beth Helen Richardson, Senior Research Fellow, University of Central LancashireLicensed as Creative Commons – attribution, no derivatives.tag:theconversation.com,2011:article/48172011-12-20T19:32:08Z2011-12-20T19:32:08ZSomething about Siri: has the iPhone virtual assistant become the Apple of our eye?<figure><img src="https://images.theconversation.com/files/6593/original/8bjbj33s-1324343068.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=496&fit=clip" /><figcaption><span class="caption">Siri's become a useful assistant, but there are things she could do better.</span> <span class="attribution"><span class="source">Apple</span></span></figcaption></figure><p>In less than two months, <a href="https://theconversation.com/apples-iphone-4s-is-a-game-changer-siri-ously-3880">Siri</a>, Apple’s virtual assistant, has insinuated herself into western culture. This has been less because of Apple’s <a href="http://www.apple.com/iphone/features/siri.html">marketing</a> and more due to the public’s general interest in the concept and Siri’s potential. </p>
<p>Of course, the fact Siri refused to advise people of locations of <a href="http://www.guardian.co.uk/technology/2011/dec/01/siri-abortion-apple-unintenional-omissions">abortion clinics</a> helped increase her notoriety. Siri’s reproductive health oversight was declared an unintentional lapse by Apple; it was taken as a sign, by comedian <a href="http://gawker.com/5864049/stephen-colbert-siri-is-clearly-an-arch+conservative-woman">Stephen Colbert</a>, that Siri is actually ultra-conservative . </p>
<p>Colbert reinforces his claim with the fact <a href="http://www.youtube.com/watch?feature=player_embedded&v=E91Qu1nVQtE">Siri can’t understand “foreign” accents</a>.</p>
<p>Since Siri’s launch, developers have managed to reverse-engineer the way Siri communicates with Apple’s servers. Through this, they have been able to get some insight into how Siri works and provide a mechanism to extend her capabilities. </p>
<p>Recently, developer <a href="http://twitter.com/#!/plamoni">Pete Lamonica</a> created a piece of software called <a href="http://www.idownloadblog.com/2011/11/28/interview-pete-lamonica-siriproxy/">SiriProxy</a>. Once SiriProxy is installed on a computer connected to a local network, an iPhone can be reconfigured to talk to SiriProxy instead of Apple’s servers.</p>
<p>SiriProxy can then intercept replies the servers send back to the phone and carry out a whole range of activities, from switching lights off and on in a room (see video below) to <a href="http://www.idownloadblog.com/2011/11/25/siri-now-talking-to-cars/">unlocking and starting a car</a>.</p>
<figure> <div style="text-align:center;">
<figure>
<iframe width="440" height="260" src="https://www.youtube.com/embed/SzwRSMGM1Gs?wmode=transparent&start=0" frameborder="0" allowfullscreen=""></iframe>
</figure>
</div></figure>
<p>Lamonica has shown that most of the processing of Siri takes place on Apple’s servers. Siri packages up the audio you record (when you ask her a question) and sends it to the server for interpretation. It is for this reason that Siri (and all speech recognition functionality) will not even start if there is no active internet connection.</p>
<p>Because of this, Siri does not need very much processor power and so Apple’s decision not to make it available on the iPhone 4 and iPhone 3GS is more about marketing (to make the iPhone 4S more desirable) than <a href="https://discussions.apple.com/thread/3359314?start=0&tstart=0">a lack of processing power</a>. (On a slight tangent, because Siri uses the internet to send voice, heavy use when roaming internationally might not be advisable.)</p>
<p>On receiving the audio, the server sends text commands back, telling Siri what to display and what to say. Siri has text-to-speech capabilities and can also interact with a limited range of applications. </p>
<p>SiriProxy has a range of <a href="http://en.wikipedia.org/wiki/Plug-in_(computing)">“plugins”</a> that can intercept the commands Apple sends back and then run custom code to carry out a seemingly limitless range of actions.</p>
<p>In a <a href="http://vimeo.com/5424527">conference speech</a>, (the company) Siri’s original technical architect, Tom Gruber, explained Siri’s origins and the way the application works. In fact, Siri was pretty accurately foretold by Apple in 1987 with a concept called the “<a href="http://en.wikipedia.org/wiki/Knowledge_Navigator">Knowledge Navigator</a>”. </p>
<figure><div style="text-align:center;"></div></figure>
<p>The Knowledge Navigator video (see above) portrayed an academic talking to his personal assistant. The essential elements of Siri’s current capabilities were all foretold in the video. In it, the academic gets a list of appointments and details of waiting messages.</p>
<p>He then uses the assistant to help prepare his afternoon’s lecture on deforestation of the Amazon rainforest (even this was prescient of the whole climate change debate). The preparation includes collaboration with a colleague over a videoconference and real-time data analysis and visualisation. This latter interaction, however, is sadly still science fiction. </p>
<p>Apple faces a herculean challenge to further develop and enhance Siri. It has been two years since Tom Gruber’s presentation that basically demonstrated all of the Siri functionality found in the iPhone 4S. Even bringing the Siri functionality available in the US market to the rest of the world presents a significant challenge. The difficulties in this are not necessarily as obvious as they would seem. </p>
<p>Locating services, as the glitch with abortion clinics has shown, is fraught with nuance. This nuance is not just a question of language and geography. Translating text is one thing; interpreting language in the context of local society and culture is much more difficult. Avoiding upsetting your customers and governments is even harder still.</p>
<p>The challenge of expanding Siri to world markets to reach parity with the US features will make innovating its capabilities that much harder. In this respect, allowing other developers to provide services at the back-end of Siri is Apple’s only hope of progressing Siri’s potential.</p>
<figure><div style="text-align:center;">
<figure>
<iframe width="440" height="260" src="https://www.youtube.com/embed/SHoukZpMhDE?wmode=transparent&start=0" frameborder="0" allowfullscreen=""></iframe>
</figure>
</div></figure>
<p>Even though Apple has an advantage with the quality of technology used in Siri, it is clear Microsoft and Google will work on their own respective speech recognition technologies: <a href="http://www.microsoft.com/en-us/tellme/">TellMe</a> and <a href="http://www.eweek.com/c/a/Search-Engines/Google-Majel-May-Answer-Apple-Siri-798344/">Majel</a>.</p>
<p>But, as the video above dramatically illustrates, TellMe (at least) is so hopeless in comparison that Microsoft will really need to go back to the drawing board or look for existing technology elsewhere.</p>
<p>In the meantime, comedians are busy exploring Siri’s potential capabilities. One scenario is played out by the <a href="http://www.collegehumor.com/video/6648229/siri-argument">College Humor</a> site in a sketch (warning: some bad language) where Siri gets between a husband and wife having an argument. Acting like the discrete English butler, Siri’s attempts to mollify the argument sadly fail. </p>
<p>There’s no doubt Siri’s capabilities, even in the area of marriage guidance, will only get better.</p>
<p><em>You can follow David Glance on <a href="http://twitter.com/#!/david_glance">Twitter</a>.</em></p><img src="https://counter.theconversation.com/content/4817/count.gif" alt="The Conversation" width="1" height="1" />
<p class="fine-print"><em><span>David Glance does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.</span></em></p>In less than two months, Siri, Apple’s virtual assistant, has insinuated herself into western culture. This has been less because of Apple’s marketing and more due to the public’s general interest in the…David Glance, Director, Centre for Software Practice, The University of Western AustraliaLicensed as Creative Commons – attribution, no derivatives.