Can we exploit the patterns lurking in real-life datasets?
A look at Benford's law, a late-19th century observation that is now used to investigate--and even prosecute--financial crimes
Today’s post is another reader request. Please comment or reach out if there is a quantitative business topic that you wish you understood better. I will do my best to write posts for all reasonable requests!
Below are two sequences of 64 1s and 0s (read left to right, then top down). One of the sequences is randomly generated by Excel. The other sequence I made myself, trying to appear random. Can you tell which is which?
If you said that the sequence on the left is random, you are correct. How might you have known that? You could start by counting the number of 1s and 0s, which are expected to be roughly 50-50, or 32 of each. The random sequence has 35 1s and 29 0s, whereas my sequence has 33 1s and 31 0s. Those are fairly close, so that’s unlikely to help you. The giveaway, it turns out, is the length of the longest “streaks” of 1s and 0s. In my fake random sequence, there were never more than three ones or zeros in a row. In the genuinely random string, there were nine 1s in a row and 6 0s in a row (cf., picture below). How does this expose me as a random number generator impersonator? You can show that the probability of the longest 1-streak (and 0-streak) in a string of 64 digits being three or fewer is 0.5%. In other words, there’s a 99.5% chance of seeing at least four 1s in a row in a truly random string of 64 1s and 0s.
This example demonstrates a well-established fact in psychology: humans are not good at generating random numbers. By trying too hard to “look random,” we inadvertently produce discernible patterns in our strings of random numbers. While the preceding example is just a parlor trick, it begs the question of whether it’s possible to exploit humans’ inability to emulate randomness.
Benford’s Law
Imagine you had a list of all the rivers in the world together with their lengths in yards. Next, create a list of numbers by taking the first digit from the length of each river. For example, the Colorado River is about 2,550,000 yards long, so it contributes a ‘2’ to the list. What would you expect the distribution (i.e., histogram) of the first digits to look like? My guess would be that each number 1-9 is equally likely to appear, with probability P = 1/9 ~ 11.1%. It turns out that this is not the case. There’s a ~30% chance that the first digit will be 1, and each subsequent number is less likely than the one before it. The approximate probability that a number d is the first digit in the length of a river is given P(d) = log(1 + (1/d)), with the log in base 10.

This phenomenon—that leading digits in real-life data sets are more likely to be small—is known as Benford’s law. As a mathematician, I kind of cringe at the word “law” because there’s no proof that it’s true. (And, it doesn’t hold in general.) That said, let’s look at an example where Benford’s law is known to hold to get a sense of the type of datasets to which it might apply. The Fibonacci sequence of numbers is defined as follows. The first two numbers are 1, and each subsequent term is the sum of the previous two. The first few terms are 1, 1, 2, 3, 5, 8, 13, 21, 34, etc. If you pluck the first digits from the first 500 Fibonacci numbers, you’ll see near-perfect agreement with the Benford distribution.

What is it about Fibonacci numbers that make them behave this way? For starters, any adequately large collection of Fibonacci numbers will span many orders of magnitude. For example, the fiftieth Fibonacci number is 12,586,269,025. The river data also have this property. (The Roe River in Montana is about 200 feet long, whereas the Missouri is more than 2,300 miles long.). Conversely, the heights of adult U.S. citizen in feet will not follow Benford’s law because the data are in a relatively small band. (Almost everyone has a leading digit of 4, 5, or 6.)
Fibonacci sequences have another property that’s even more important. In the long run, the next number in the sequence is approximately equal to φ times the previous one, where φ is the “golden ratio” (1 + √5)/2 = 1.1 In other words, the sequence grows approximately exponentially. Writing the nth Fibonacci number as F_n, You can use this fact to show that
Since the log term is irrational, the fractional part of the expression on the right will be uniformly distributed in (0, 1). (Trust me on that one.) Let’s think about the first digit of F_n. You can write F_n in scientific notation as
with m between 1 and 10. The log of this number is
whose fractional part is log(m) because k is an integer and 1 < m < 10. If the leading digit of m is 1—that is, if 1 < m < 2—then log(m) is between 0 = log(1) and log(2) = .301. Similarly, if the leading digit of m is 2, then log(m) lies between log(2) = .301 and log(3) = .477. Proceeding down the line, we end up chunking up the interval [0, 1] in such a way that a leading digit of 1 takes the first 30.1%, a leading digit of 2 takes the next (.477 - .301) = 17.6%, and so forth. These are exactly the probabilities from the Benford distribution. But we just proved that the fractional part of the log of the Fibonacci numbers is uniformly distributed in [0, 1], which means that the probability of falling in any one of the chunks is given by the length of that sub-interval. You can then back out the leading digit m, from which Benford’s law follows.
Who gives a flying Fibonacci?
While Benford’s law might explain some unexpected patterns in nature, I still haven’t given any indication as to why this might be useful. Financial data, depending on the context, frequently checks a lot of the Benford boxes:
Numbers can span several orders of magnitude.
Geometric growth is common (e.g., year-over-year revenue growth, stock prices, etc.).
Lots of numbers appear as products of other numbers (e.g., revenue = price times quantity).
Numbers can be sampled from many different places.2
This suggests that Benford’s law might apply for adequately large datasets in finance or accounting. Suppose that you (not you, of course) were cooking the company books or sending fake returns data to potential investors. Would you be savvy enough to “Benfordize” your data? Harkening back to the first example, you probably wouldn’t be. Most likely, you would have too even a distribution in the leading digits of your fake numbers. Several people have taken this idea and run with it, most notably Professor Mark Nigrini (not to be confused with the Campari cocktail), who literally wrote the book about fraud detection with Benford’s law. Impressively, Benford’s law has even been used to make legal cases. For example, investigators used it to bring securities fraud charges against an Oregon man, which resulted in a ten-year prison sentence. By chance, I just wrote last week about the tool that people use to make Benford's law arguments stick: χ-square goodness of fit tests. If you take as your null hypothesis that Benford’s law should hold, then the χ-square value for a sample tells you about deviations from this expected distribution. This can then be converted to a p-value. Since Benford’s law is a “law of nature,” a minuscule p-value would indicate that the observed distribution of leading digits is unnatural.
While I focused on fraud, Benford’s law can also be used for non-criminal activities. For example, it can serve as a quality assurance tool for datasets in a wide variety of fields. You can also extend Benford’s law to the distribution of second digits (and beyond), which you can read about here. That’s all I have for today. I hope you enjoyed reading about Benford’s law. Please subscribe for more content like this and share with your network.
Since a_{n+1} = a_n + a_{n-1}, if a number c existed such that a_{n+1} / a_n approaches c as n gets larger, then c must satisfy c = (a_n + a_{n-1}) / a_n = 1 + a_{n - 1} / a_{n} = 1 + 1/c. Solving for c gives the golden ratio.
I didn’t talk about why this matters, but this assumption is the basis for the best and most rigorous mathematical treatment of Benford’s law.