In the statistical world, in my simple mind, precision was the golden metric for gauging whether your model was performing as it was supposed to.

That is partially correct. But precision alone will only give you a partial image of the model’s performance. For a full picture, you will need to check other gauges — Recall and F1 Score.

You might be reading these paragraphs, trying to understand why these metrics are necessary in the first place. That’s a fair question. From a mathematical standpoint, if you executed all the calculations correctly, why all these metrics?

When creating a machine learning model, you are trying to predict the future based on the past. Identifying all the factors that affect a specific outcome is challenging. When you don’t capture all the components that affect the outcome, your model will underperform.

What Are These Metrics?

If you are not sure about True Positive (TP), False Positive (FP), False Negative (FN), etc., you can find more information in this Wikipedia Article. I don’t really like linking Wikipedia stuff, but this one is really good.

Let’s do a quick recap on what each of these metrics stands for:

Precision: Measures the accuracy of positive predictions. It is the closeness of the measurements of the same item are to each other. They are all the true positives observed.

Precision = (TP)/(TP + FP)

Recall: It is the sensitivity or the TP rate. It refers to the percentage of actual positive cases that are correctly identified by a specific model. It measures the model’s ability to find all positive instances.

Recall = (TP)/(TP + FN)

F1-Score: It is the harmonic mean of Precision and Recall. It balances the trade-off between the two.

F1 − Score = 2×(Precision×Recall)/(Precision + Recall)

This metric is particularly useful because it balances precision and recall, making it a powerful tool for imbalanced datasets.

How are these metrics used to help you see how your model is performing?

Let’s say you wrote a model to filter spam emails. Your model has examined 100 emails, and the breakdown is as follows:

  • 70 correct (not identified as spam)

  • 15 truly spam (True Positive)

  • 10 not spam (False Positive)

  • 5 real spam missed (False Negative)

Precision

Let’s talk about the metric I once thought it was the gold standard. Precision is critical to help you identify the false positives.

Using the numbers we already know, we can apply to the formula and come up with the right answer.

Precision = (15)/(15 + 10)

That means, our precision would come out to be 0.60 or 60%.

Recall

In this instance, recall measures the percentage of actual positive instances that were correctly identified. For our email problem, using the formula, we have 15 TP and 5 FN.

Recall = (15)/(15 + 5)

You can see that in this case, our number has improved. The result was 0.75 or 75%.

F1 Score

In the last metric, F1 Score balances recall and precision in a way that ensures high performance in both cases. This metric is perfect for the email case.

At its core, email spam classification is binary. It is either spam or not. That is a perfect candidate for this metric, which thrives in a binary classification problem.

F1 = 2×(0.6×0.75)/(0.6 + 0.75)

F1 = 0.666

Combining It All

Which metrics matter most?

Think of a football team. Which player matters most? What good is a quarterback if the wide receiver has butterfingers? Or how does an excellent tight end fare when the quarterback can’t throw the ball?

Likewise, everyone has a role, and they are somewhat equally important. In some cases, you might rely more on one than another. But at the end of the day, you will need all of them.

Here is why every metric plays a role in this game.

A Precision of 0.60 means that 40% of the emails in this model were incorrectly flagged as spam and never reached the inbox. For an email, in my opinion, this is fairly high and problematic.

For a spam filter, 0.75 recall, I would consider it poor performance. That means one in four spam emails is getting through to the inbox. If you get 100 spam emails a day, that means your inbox will be littered with 25 of those. Painful, right?

And finally, the F1 Score of 0.666 confirms the deeper issue. I don’t want to say that anything below 0.7 is automatically bad, but when I see anything below that threshold, it raises my eyebrow. I will immediately start looking for areas to improve the model.

When you consider each of these numbers individually, they all indicate a model that needs work. However, when combining all three metrics, we can pinpoint issues that one metric alone would not.