One of my adoring fans (let me have this one) requested an article about naive Bayes classifiers, which help you categorize objects based on their features. To do that topic justice, you really should understand Bayes’ rule for conditional probabilities. This post introduces you to that topic.
Conditional Probability
Suppose you pick a U.S. citizen at random. How likely is it that that person is a billionaire? According to Forbes, there are 813 billionaires in the U.S., so the chances are about 813 in 330 million, or .0000025. What if I asked the same question, except this time I told you that the surname of the person chosen was Bezos? I don’t know how many Bezoses live in the U.S., but I think it’s safe to assume that it’s fewer than 2,000.1 Among them is at least one billionaire, so the odds are greater than 1/2,000 = .0005 that a random Bezos is a billionaire. This is at least 200 times more likely than the general population.
This example highlights an important topic in statistics called conditional probability. As the same suggests, conditional probability is the likelihood that some event occurs conditioned on another event occurring. For two events A and B, we write P(A|B) for the probability of A, given that B occurs. To derive the formula for this quantity, let’s recall how to calculate probability in general:
For the billionaire example, A is the event that a billionaire is chosen from the U.S. population. The numerator is 813—the number of such billionaires. The denominator is the number of possible outcomes from drawing a random U.S. citizen, i.e. the population of 330M.
Now let’s return to calculating P(A|B). What is the denominator in this case? The premise is that B has occurred, so it must be the number of ways that B can occur. We would like the numerator to be the number of ways that A can occur. However, this isn’t quite right because some of those ways might be incompatible with the occurrence of B. (Remember the premise that B occurs.) Thus, the numerator must be the number of ways that A and B can both occur. This leads to our formula:
The arch symbol stands for “and.” To go from the first equality to the second, multiply the top and bottom by 1/(The total possible outcomes), which is legal because it is tantamount to multiplying by one. Now we can see how this applies to the billionaire example. As stated, A is the event that a randomly chosen U.S. citizen is a billionaire. B is the event that a randomly chosen U.S. citizen is named Bezos. P(A|B) is the probability that a randomly chosen U.S. citizen is a billionaire, given that person’s last name is Bezos.
Naive Bayes classifiers require one more important concept: independence. Intuitively, events A and B are independent if the occurrence of one has no bearing on the occurrence of the other. In other words, the probability of A does not depend at all on B, or P(A|B) = P(A). If A and B are independent, a little algebra shows that the probability of both A and B occurring is equal to the product of the probabilities:
In general, there is no simple rule for calculating P(A ∩ B)—called the joint probability—unless A and B are independent. Independence is the “naive” assumption that people make with naive Bayes classifiers.
Bayes’ Rule
New problem: Suppose you run a company that manufactures doodads. (The widget market is too saturated.) The current defect rate in your manufacturing process is 4%. Let’s say you implement a new quality control system that correctly identifies 90% of defective doodads coming off the line. However, it also has a 5% false positive rate, meaning that it flags 5% of good doodads as defective. You want to answer the following question: if the QC program flags a doodad as defective, what is the probability that it is actually defective?
The if/then structure of the question tells us that this is a conditional probability problem. In shorthand, the quantity we need to compute is P(defect|flag). Let’s re-write the given info using the same notation:
P(defect) = .04
P(flag|defect) = .90
P(flag|no defect) = .05
Hmm…this doesn’t match up great with the conditional probability formula above. Specifically, the event we care about (good/defective) and the conditioning event (flag/ don’t flag) are flipped. We don’t have any indication from the formula how to go from P(A|B) to P(B|A). To create this connection, notice that P(A ∩ B) appears in both the formulae for P(A|B) and P(B|A). More precisely, P(A|B)*P(B) = P(A ∩ B) = P(B|A)*P(A). (This is true since A ∩ B = B ∩ A.) Get rid of P(A ∩ B) and divide both sides by P(A) to arrive at Bayes’ rule:
Applied to our problem, Bayes’ rule looks like this:
We’re making progress. The term on the left is what we want, and we have numbers for the two terms in the numerator on the right. Unfortunately, we still don’t know P(flag). We can figure out P(flag) from a nifty trick called the law of total probability, which computes P(flag) on a case-by-case basis with the possible outcomes of defective: P(flag) = P(flag|defect)*P(defect) + P(flag|no defect)*P(no defect). Check out the Venn diagram below to see why that’s true.

Now we have everything we need to answer the question.
So, if the QC system flags a doodad as defective, there’s only a 43% chance that it’s truly defective. Despite a pretty low false alarm rate of 5%, most doodads that the system flags will be false alarms. This is a common challenge that arises when trying to detect rare things (manufacturing defects, cancer, objects on a radar screen, etc.).
Looking Ahead: the Bayesian Mindset
Next week, I’m going to apply the ideas and formulas from this article to the topic of classification. The basic idea is that you have an initial belief about how different categories are distributed—called the prior probability—and then you continually update these beliefs when presented with evidence (i.e., representative members of each category). After incorporating this evidence, you have a new belief about the probability of each class, called the posterior probability. To classify new objects, you check which class is most probable according to the updated belief system. But that’s for next week. For now, both author and reader deserve a rest.
The U.S. Census Bureau says that there are at least 150,000 surnames in the U.S., so the average surname is shared by no more than 2,200 people (330M / 150K). I don’t think it’s a stretch to say that Bezos is an uncommon surname, so the number of Bezos should be below 2,200. It’s probably 10% of that or less.