It's still ROC and roll to me

A close look at ROC curves, a World War II-era invention that now plays a prominent role in machine learning.

May 29, 2024

Last week, I talked about how to evaluate a binary classification algorithm. I focused on four popular metrics: accuracy, precision, recall, and F1-score. Those classification metrics all have one thing in common: they’re based off a single, fixed confusion matrix. As a reminder, a classification matrix is a 2x2 table that displays the counts of each of the four possible outcomes for a binary prediction: true positive, false positive, true negative, and false negative. I worked off the confusion matrix below for a naive Bayes (NB) classifier, which shows the results of trying to predict whether doodads would be good (positive) or defective (negative).

As I said, the metrics mentioned above are all computed directly from the four numbers in the table. To learn more, let’s review how naive Bayes classifiers categorize objects:

They start with an assumption about the likelihood of defective vs. good.
They observe many defective and good doodads and update their conditional class probabilities based on this evidence.
For a new doodad, they calculate the probability of being in each class based on that doodad’s features.
They assign a doodad as good or defective based on which has a higher probability.

Step 3 is really the critical step, yet we discard all that information when evaluating our classification model. Suppose that you trained two NB classifiers and presented them both with a new doodad x, which you know to be defective. If one classifier said that there is a 51% chance that x is defective and the other a 90% chance, would you say that the second classifier is performing better? I would argue yes, even though both would classify x as defective.

Parameter Tuning

To generalize the preceding observation, note that the “perfect” classifier would assign “good” class probabilities near 100% to all good doodads and near 0% to all defective doodads. Inspired by this, let’s try the following. Instead of predicting that a doodad is good or defective based on the greater class probability, pick a threshold probability for classifying a doodad as good. For example, a threshold of 80% means that a doodad x is classified as defective if P(good|x) < 0.8. This is equivalent to saying that x is classified as defective if P(defective|x) > 0.2. (Thus, a threshold of 50% corresponds to the normal rule of selecting the likelier class.) If our classifier is good—that is, if our class probabilities for the correct classes are ~100%—then you should see confusion matrices consisting of mostly TP and TN for (almost) any threshold value.

To pressure test our classifier, we can vary our threshold for classification as “good” and compute the confusion matrix for each value of the threshold. This is called parameter tuning. More precisely, for each value of the threshold, count the number of TP, FP, TN, and FN that you have. Next, compute the true positive rate (TPR)

\(TPR = \frac{TP}{TP + FN}=\frac{TP}{\text{Count of good doodads}}\)

and false positive rate (FPR)

\(FPR = \frac{FP}{TN + FP} =\frac{FP}{\text{Count of defective doodads}}\)

for each threshold. This can easily be done in Excel with IF statements (cf., the image below). You’ll notice a few things:

If the threshold for a positive (good) prediction is 0, then every doodad is classified as good, so there are no TNs or FNs. It follows that TPR = FPR = 100%.
If the threshold for a positive (good) prediction is 1, then every doodad is classified as defective, so there are no TPs or FPs. Hence, TPR = FPR = 0%.
As the threshold for classifying a doodad as good increases, the number of TPs and FPs both decrease, thus so do TPR and FPR. (In both cases, the denominator is constant.)

An example of classifying 500 doodads as good or defective for different P(good|evidence) thresholds. For each threshold, you count the number of TP, FP, TN, and FN based on the predicted and actual classes. From these confusion matrices, you could then calculate any of the metrics discussed earlier. Column B—labeled “Prediction”—is the prediction based on the most likely class. It’s not used to build the table because the threshold in row 3 determines the prediction.

The ROC curve

The third bullet at the end of the previous section tells us that there is a trade-off between true positives and false positives. If you want to increase your true positive rate, you can do so by lowering your positive class threshold to make sure that you’re not missing any good doodads. However, lowering the positivity threshold will also lead to more false positives. This is a fact of life with parameter tuning. So, how do you decide what threshold to use? One thing you can do is draw the receiver operating characteristic (ROC) curve. The ROC curve is a plot of FPR on the x-axis against TPR on the y-axis. Where is the threshold, you ask? The threshold parametrizes the curve, meaning that you “trace” out the curve as the parameter increases. In other words, we don’t see the threshold exactly, but we can back it out from a (FPR, TPR) pair.1 Here is a picture of the ROC curve for my NB classifier.

ROC curve for the naive Bayes classifier. Note that the curve *must* move up (or laterally) as you go right. It also *must* connect (0, 0) to (100, 100).

This graph has a few features that all ROC curves share. First, it goes from (0%, 0%) to (100%, 100%). This follows from bullets one and two above. Second, as the false positive rate increases, true positive rate either increases or stays the same (and vice versa). This means that the curve can’t ever decrease. (It can be vertical or horizontal for stretches, as is the case here for FPR = 0%.) Now that we have this curve, how can we leverage it? One way is by using it to select our threshold. Notice that the ideal classifier has a TPR of 100% and an FPR of 0%; this is the top left corner of the graph. Therefore, it makes sense to pick the value of the threshold whose corresponding (FPR, TPR) pair is as close to the top left corner as possible. Unless you break out the ruler, this probably won’t give you a definitive answer. However, it can narrow down the range of reasonable thresholds. For example, FPR = 0% for all thresholds greater than .6, so there is no reason to go above .6, since this would just create needless false negatives. Conversely, a threshold of .5 has FPR = 11% and TPR = 95%. Reducing the threshold to .45 increases TPR one point to 96%, but more than doubles FPR to 26%. This trade is probably not worth making. Thus the optimal value of the parameter is probably somewhere in the interval [.5, .6].

The other use of a ROC curve is to evaluate your model as a whole. As I mentioned earlier, you want to be as close to the top left corner as possible, regardless of the threshold. Since you must start at (0%, 0%) and end at (100%, 100%), the “ideal curve” would therefore be a vertical segment from (0%, 0%) to (0%, 100%) concatenated to a horizontal segment from (0%, 100%) to (100%, 100%). The area under this curve would be 1. (It forms a 1x1 square, and 100% = 1.) A realistic ROC curve would have an area under the curve less than 1, but hopefully as close to 1 as possible. On the other end of the spectrum, what would happen if you just guessed randomly? Random guessing means that each class probability is a randomly chosen number between 0 and 1, irrespective of whether a doodad is good or defective. For a good/positive threshold of z, the probability that you classify a good doodad as good is z, so TPR = z. The probability that you classify a defective doodad as good is also z, so FPR = z. In other words, TPR = FPR for a random guess classifier, so its ROC curve is the line y = x. This forms a triangle of base and height 1, which has area 0.5. Area under the curve (AUC) is a popular metric for evaluating binary classifiers. The closer you are to 1, the better your model. If you’re below .5 (i.e., guessing randomly), it’s time to find a new line of work.

Plot of our NB classifier’s ROC curve against the ideal classifier and random classifier. The area under the curve can be computed numerically from your classifier with a Riemann sum. (Don’t worry about that if you haven’t taken calculus.)

Detecting objects in noisy signals

I want to close with a little history. ROC curves were first developed during World War II to facilitate the analysis of radar signals. Noise is always present in a radar receiver, so setting the threshold for determining whether a detected signal is a false alarm is a critical and difficult problem. Radar receiver operators struggled with this classification problem in the early days of radar, and the development of ROC curves helped understand and ultimately improve operator performance. Nowadays, ROC curves are used everywhere classification algorithms are used, both for training and evaluating classification algorithms. As always, thank you for reading. Please share with your network if you found this interesting. Comments and questions are welcome.

It is possible that a range of threshold values give you the same TPR and FPR. This isn’t a problem; it just means that you can place the threshold anywhere you want in this range without changing the performance of the classifier.

Squareholder Value

Discussion about this post