Rkq6c8tt 1449680861

How a simple observation from the 1800s about patterns in big data sets can fight fraud

Why are some pages of a book of numbers tables more dog-eared than others? Book image via www.shutterstock.com.

How a simple observation from the 1800s about patterns in big data sets can fight fraud

Why are some pages of a book of numbers tables more dog-eared than others? Book image via www.shutterstock.com.

Benford’s law was first mentioned by the American scientist Simon Newcomb in the 1880s, when he noticed that in books of tables of logarithms, the pages of numbers whose leading digit was 1 were more worn than the pages of numbers whose leading digit was 9. For some reason, people seemed to be consistently looking up certain numbers more frequently than others.

Not much was done with this observation for 50 years. Then Frank Benford, an engineer at General Electric, rediscovered it again and again as he looked at a variety of different data sets, from values of special functions to river lengths, county populations, physical constants, addresses of the first 342 listed members of the American Men of Science…, and on and on.

Distribution of leading digits from the data sets of Benford’s paper; the amalgamation of all observations is denoted by ‘Average.’ Some examples agree better with Benford’s law than others, but the amalgamation of them all is fairly close to Benford’s law. Frank Benford, CC BY-ND

What he noticed is that often the first digits of numbers in a data set are not distributed equally.

Here’s the mathematical way to describe it. A data set follows Benford’s law if the probability of observing a first digit of d is log(1 + 1/d) (all logs here are base 10). That makes the probability of a first digit of 1 around 30%, with a 9 happening roughly 4.6% of the time. This rapid decay is in stark contrast to the intuition many people have that all numbers would be equally likely to serve as a leading digit (with 1 or 9 or any other digit happening about 11% of the time each).

Benford probabilities. Steven J Miller, CC BY-ND

As editor of the recently published book Theory and Applications of Benford’s Law, I collected detailed descriptions of why so many systems exhibit this universal behavior, and what the consequences are.

There are a lot of explanations for why so many systems follow this law.

One particularly nice illustration is the example of a geometric process, say a stock that increases 4% a year. If we start with US$1, then after one year we have $1.04. After two years, we have $1.0816, and so on, finally reaching $2 after about 17.673 years. It would take approximately 58.708 years to reach $10. If we increase by a constant multiple each time, it’ll take more time to go from 1 to 2 than from 9 to 10 because the magnitude of the increase is larger at 9 than at 1 and the distance to cover is the same.

Here’s that explanation in the language of mathematics. At time t we have $1.04t, so if $1.04t = $2 then t log(1.04) = log(2) or t₂ = log(2)/log(1.04). Similarly we see that we reach $10 at t₁₀ = log(10)/log(1.04) or approximately 58.708 years. Thus the fraction of the time we spend with first digit 1 is t₂/t₁₀ = log(2)/log(10) = log(2) (since our logarithms are base 10). A similar argument works for the other leading digits.

The author’s lecture to his probability students on Benford’s law.

In my conversations with Frank Benford’s grandson, he’s entertainingly called Benford’s law a growth industry. It’s gone from an obscure subject with a paper or two per decade to exponentially growing in numerous fields – not unlike the stock price in our example above. More and more systems have been shown to follow Benford’s law, and paper after paper has been written on this phenomenon and explanations for its prevalence.

One of the more interesting uses is to detect fraud. Accounting professor Mark Nigrini pioneered this area when he noticed Benford’s law could be used to detect financial irregularities in many data sets. Many of these data sets should follow Benford’s law – but when people create fraudulent sets they’re often unaware of this pattern. In their phony data, they either make all first digits equally likely, or cluster in the middle.

Benford’s law can help flag credit card fraud. Credit cards image via www.shutterstock.com.

To give you a sense of Benford’s law’s power and utility, here’s my favorite application. It involves banking. If you lose your credit card, or have it stolen, after uttering some expletives you quickly call the bank to report the incident. The person on the phone offers kind, consoling words and reassures you that you are not liable for the charges and a new card is on its way. This ends your involvement, and starts theirs. They have assumed the responsibility to pay the charges, and have two options: they can just pay, or they can try to find the thief and make them pay.

It’s probably not worth it to the credit card company to track down someone who’s run up $90; if it’s $90,000, that’s a different story! Banks often have a demarcation line; anything below is written off as not worth the time to investigate, while anything higher generates a probe. For many companies, that line is $5,000, and leads to my favorite example. An investigation at one bank turned up many more stolen card totals starting with a 4 than Benford’s law would predict. Eventually they found that a large number were around $4,800 or $4,900, and attributable to one agent who was having friends run up debts just below the threshold before reporting the card stolen! Fraudsters discovered, thanks again to Benford’s law.

There are many other uses. University of Michigan Professor of Political Science and Statistics Walter Mebane has fruitfully used Benford’s law to detect voter fraud. Knowing the expected pattern can help determine whether or not a digital image has been modified. The field of steganography studies hiding images inside images, where the embedded file often contains coded messages. Benford’s law has also found use in medical statistics, in psychology of games, in computer science…. Not bad for a subject that began with some worn pages in an old logarithm table.