Imperfect match: weighing probability in forensic voice analysis

A perfect voice match is not yet possible. A Germain/Flickr

We’ve all seen the TV cop shows where a wonderful bit of technology allows the police to analyse a criminal’s voice and solve the case. And while the technology doesn’t allow us to do that just yet, serious advances are being made by forensic investigators using voice analysis.

Today most people carry around mobile phones, which has made video and voice recording common. And there is increased demand for voice analysis in court and police investigations. However, the TV shows — which have driven a lot of the public perception of voice analysis — are extremely unrealistic, and consequently there are widespread misunderstandings regarding its usefulness. In short: voice analysis can provide valuable information, but the results are never as conclusive — and rarely as fast — as in TV shows.

For instance, analysts should never say “The two voices matched.” The human voice is not stable. Even if we try to repeat the same word in the same way, there will be subtle acoustic differences. Voice evidence never has a “match”.

Like fingerprints or DNA, the voice contains information from our anatomy. But fingerprints and DNA are direct records of anatomy, and they change only slowly. The voice is a product of complex interactions of many factors, many of them unrelated to identity: emotion, health, who is listening, background noise, whether we are on the phone: all affect how we speak. And these effects are transient; changing dramatically minute by minute and day by day.

So two speech samples always have different acoustic properties, regardless of the speaker’s identity. The analyst’s task is assessing whether the observed differences are more likely to be coming from a single speaker, or from two separate individuals; and how strong this evidence is.

Forensic science has been shifting towards empirically founded, statistical evaluation. The catalyst for this was a US Supreme Court ruling in Daubert v Merrell Dow Pharmaceuticals in 1993, which set out the requirements for scientific evidence to be admissible in that jurisdiction.

It says that the court must consider whether the theory or technique used is empirically testable and replicable, whether it is accepted in the relevant scientific community, and whether the error rate is reported. Most “scientific” evidence at that time, including voice, failed to comply with these requirements.

In the past, voice evidence was evaluated by experts using their professional experience and technical knowledge, rather than empirical testing and statistical analysis.

Good experts could provide useful information for investigation, but we could not replicate the analysis or test its validity, so we could not know whether we had reliable evidence in court. So, many forensic voice comparison analysts around the world have moved away from this old paradigm.

Today, researchers from two fields are driving forensic voice comparison research: speech engineering and linguistic phonetics. Although engineers and linguists may use different techniques at various stages, they use the same basic process: establishing comparability; acoustic feature extraction and comparison; and finally producing the results in the form of a “likelihood ratio”.

Establishing comparability means being sure that the circumstance of the samples is sufficiently similar for the comparison to make sense. We cannot usefully compare shouting angrily on the phone with speaking calmly to an interviewer. The acoustic features commonly used today include average pitch and spectral characteristics.

The likelihood ratio evaluates the relative probability of two competing hypotheses. In the context of voice comparison, it is the ratio between the probability of observing the difference between two voice samples when they are coming from the same speaker; and the probability of observing this difference when two separate individuals are involved.

Likelihood ratios show us how strong the evidence is, and which hypothesis the evidence supports. A likelihood ratio of 10 means we are 10 times more likely to observe the given evidence when the two samples are from the same person; 0.01 means we are 100 times more likely to observe the evidence when they are from separate individuals.

Although courts often find likelihood ratios difficult to interpret, the forensic science community now considers them to be the logically and legally correct way to present scientific evidence.

Likelihood ratio-based analysis was adopted for voice comparison work only in the last ten years, and still there are a lot to be done. Many are working on developing techniques to handle noisy recordings, finding better features, automating the analysis process, and developing techniques to evaluate the reliability of the process.

However the most pressing task for fair and reliable analysis is building a large database of the speech of Australian people, including ethnic varieties of Australian speech — none exists.

Two ARC-funded projects are addressing this problem and this will improve the situation significantly, but data for ethnic varieties of Australian speech is still to be collected.

There will never be a “perfect match” test; but forensic voice analysis can provide crucial investigative leads, and help courts determine guilt and innocence.