Model Performance Parameters of Binary Classification Algorithms
True positives, False positives, Confusion matrix, Recall, Precision, Accuracy. These terms have been eating the heads of any beginner machine learning enthusiast for some time. This is an attempt to simplify those parameters that are crucial in determining the performance of a binary classification model. A binary classification model is an algorithm used to predict the probability of occurrence of an event( i.e, either the event will happen or it will not). It is binary because there are only two possibilities(either yes or no). These performance calculating matrices maybe used in higher level classification algorithms(more than two possibilities of classification), but this article is restricted to the binary aspect. Common examples of binary classification algorithms are logistic regression, Support vector classifiers, Naive Bayes classifiers etc.
Let us first understand a few basic terminologies with the help of an example. Imagine that there is a hospital which is trying to predict if a patient has the susceptibility to have diabetes in future by seeing his medical conditions. They have built a binary model from the past data of their patients and are comparing the predicted results with the actual results. Here the algorithm is trying to predict the probability of a possible diabetes. Hence, 1 is coded as a yes case(the person can have diabetes).
Please note that an event is a case of something new happening( a case of diabetes) and a non-event is the status quo or not happening(a case of no-diabetes).
So, when a patient actually had diabetes and the algorithm correctly identifies it, it is called as a True positive(1,1). When a patient did not have diabetes and the algorithm correctly classifies it as a ‘no diabetes’ or ‘no’ case, it creates a situation of True negative(0,0). Both these predictions are the successful predictions made by the algorithm. These parameters are used to find the accuracy of a model(which we will see later).
When a patient who actually did not have diabetes is classified by the algorithm as someone who has diabetes, this is a case called False positive(0,1). This is an error of the algorithm which is also popularly called as a Type 1 error of Alpha error. This happens when we actually did not have enough evidence to reject the null hypothesis(Class 0), but we rejected it anyways. Similarly, When a patient who actually had diabetes is classified by the algorithm as a ‘no diabetes’ case, it is called as a False negative(1,0). This error is also termed as a Type 2 error or Beta error. This happens when we failed to reject the null hypothesis even though we had enough evidence to do so. Controlling these misclassifications is one of the biggest challenges for a machine learning scientist as these are complementary to each other(as one decreases, other increase).
Now let us see the performance matrices which decide the effectiveness of a model.
Confusion Matrix: A confusion matrix is a 2*2 matrix which depicts these correct classifications and misclassifications in a tabular form. Generally, we show the actual cases as the rows and the predicted cases as the columns.
Here, we see that when the predicted case(column) and the actual case(row) is 1, the resulting box gives you the number of true positives in the data and so on. This is a raw matrix from where we derive insights about the performance of our model.
Now, Let us see what these performance parameters are and what information they carry. Please note that all these parameters range from 0 to 1, where 1 can be interpreted as 100%.
Accuracy: If you are throwing a dart to a board,accuracy is the measure of how close you were to the bull’s eye. See the figure of a hypothetical dart board. Imagine that hitting the inside circle gives you the maximum point. You had five chances to throw and you managed to hit two times inside the inner circle. Then you can calculate your accuracy as 2/5.
In classification, the accuracy is a measure of the models’ correct prediction to the overall number of predictions it made. The correct predictions are the true positives and the true negatives. Hence, It can be expressed as,
See that the denominator is the sum of true predictions and the wrong predictions.
Error Rate: Error rate is the complement of accuracy. From the dart board, it will be the measure of the number of hits outside the inner circle(which is 3/5). In classification, It is defined as the ratio of number of wrong predictions made to the total number of predictions made. It is expressed as,
Accuracy and Error rate are general parameters which only give an overall idea of the model performance. To specifically understand how the model fares on a data in depth, We should probably understand the below mentioned parameters.
Precision: If you go back to the dart board example, precision is a measure of how spread your hits are. Even when your hits are away from the inner circle, it can be high in precision if they all are crowded to a small portion on the board. Hence if your hits are spread out on the board, it is low on precision and vice versa.
In machine learning context, Particularly in the binary classification context, It is the ratio of true events (1,1) predicted out of all the events predicted (1,1 and 0,1). In simpler language, how many times has the machine correctly predicted the happening of a true positive case out of all the positive case (true and false positives) predicted by it. Basically it says, how correct or pure are you in predicting positive cases!
Mathematically it can be expressed as,
Precision is an important parameter which helps us to control the number of false positives in prediction. This aspect will be discussed later in the article.
Recall: Similar to precision, Recall also tells about the number of positive predictions, but there is a small change. Recall is the ratio of true events(1,1) predicted out of all the events(1,1 and 1,0) actually happened. In simpler language, how many times has the machine correctly predicted the happening of a true positive case out of all the positive case (true positives and false negatives) actually occurred. Basically it says, how accurate are you in finding the positive cases.
Mathematically it can be expressed as,
The Recall is also termed as sensitivity or even as true positive rate. This parameter is important in reducing the number of False negatives, an aspect which will be discussed later in the article.
Specificity: Specificity is the opposite of what a recall is. This is the rate of True negatives. Hence it is also called as True negative rate. It is defined as the ratio of prediction of true non-events(0,0) out of all the non-events that happened(0,0 and 0,1). In simpler language, how many times has the machine correctly predicted the happening of a true negative case out of all the negative cases(true negatives and false positives) that happened. Basically, it is the accuracy of non-events predicted.
Mathematically it can be expressed as,
F1-Score: Often, a machine learning student gets confused about which metric to give importance to. Since, Precision and Recall are complementary, we need a comparison parameter from which we can compare the performance of multiple models. A solution to this issue is the F1-score. F1-score is the harmonic mean of both precision and recall and it is that single metric for comparison. Higher the F1-score, better the model is.
Mathematically it can be expressed as,
ROC curve: One of the important dilemma to a student is where to fix the cut off or threshold for the predicted probabilities. The model built by the algorithm gives a probability value for the output which ranges from 0 to 1, rather than a definite output. And by default, it keeps the cut off as 0.5( i.e., it classifies the outputs greater than a probability of 0.5 as 1 and vice versa, if it is modelling for 1). The ROC curve or the Receiver Operating Characteristic curve provides us with the optimum value of the threshold to be selected. It plots a graph between the true positive rates (or Sensitivity or Recall) and the false positive rates( or 1-Specificity) for every value of the cut off. The curves you see in the figure are drawn by altering the cut off value from minimum to maximum.
Let us understand what does a change in cut off value does. For example if my diabetes finding algorithm classifies a patient as 0 or 1( no or yes) based on a cut off of 0.5, and i change the cut off to 0.6, now i am allowing comparatively lesser ‘yes’ cases to appear in my results and simultaneously it increases the ‘no’ cases. The take away from this is that changing the cut off or threshold changes the true positive rates and the false positive rates and the effect is opposite to each other. Now go back to the curves, Each curve( curve 1, 2 , 3) represent a different model and all of them has a peculiar behavior which is related to the slope of the curves.
When the cut off is set as 0, It means all the classification outputs are 1. Hence, there are no true negatives as well as false negatives and correspondingly it means that both the true positive and the false positive rates are 1(i.e., top right corner of the space). Conversely, when the cut off is 1, all the positive predictions disappear and hence the True positives and false positives are 0 (bottom left corner). The diagonal on the space represents a model of random guessing(like tossing an unbiased coin) and any model below that (or on the diagonal) is said to be poor.
Any machine learning scientists’ aim is to increase the true positives and reduce the false positives and hence we find the optimum threshold value at that point where the curve makes the elbow( or where the pace of change in FPR starts to overtake the change in TPR). This point is said to be the one closest to the top left corner of the ROC space (or B(0,1)). This is because an ideal model will try to make a curve where the True positive rate is 1 and False positive rate is 0( i.e., it tries to touch point B). This is represented by curve 1. The area under this curve is called as AUC (Area under the curve) metric and it has a maximum value of 1( Since the ideal model will make a square of area 1 when it starts from A, touches B and ends at C). When we compare different models, we select the one which has the maximum AUC value and we select the cut off as the point closest to the top left corner.
The Model makers’ Dilemma: If we compare building a machine learning model as cooking a curry, we can say that the vegetables are the data and the utensils are the languages, packages and software we use. But what makes the curry tasty? It is the business knowledge or the domain knowledge that acts as the spices and other ingredients of a useful and employable model. Understanding the problem at hand in the context of the business it is part of is the most important aspect of model building. To decide a cut off means to to find a balance between true positives, true negatives , false positives and false negatives. These issues would not have raised if the classes were perfectly separable and there was adequate gap between them( the ideal case). But we live in a world of chaos and imperfection, so is our data which exist with noise and overlaps. Hence, a scientist has to make certain trade offs or accommodate one type of mistake in order to make a useful gain which will only make sense in that particular business context and this is unique to each problem at hand and there is no one button solution to every machine learning problem.
One such aspect is called the Precision- Recall Trade-off which is just a decision of prioritizing between False positives and False negatives respectively( compare the equations of precision and accuracy to see how they are dependent on these two parameters). We cannot have both high in a real case scenario and has to decide which one to prioritize.
For example, let us consider spam email filtering as a case. Here a False positive means classifying a normal email as spam and a false negative means classifying a spam email as a normal email. For a user, seeing a spam email in his inbox wont be as fatal as missing out an important mail because the agorithm sent it to the spam folder. Hence, in this case we need more precision. One thing we have to understand is that to get more precision, we have to increase the threshold(which will allow only values closer to 1 to be classified as 1 adn hence reducing the false positives).
In the opposite case, let us imagine a credit card fraud detection algorithm. A false positive is when a normal transaction is flagged as fraud and a false negative is when a fraudulent transaction is missed and allowed as normal transaction. A company can accommodate and try to repair the dissatisfaction of a normal customer whose transaction was blocked due to false fraud call, but think about the case when it fails to identify a fraud. They might have to bear huge Financial and credibility damages in a false negative case. Hence, Recall is the important metric here. This logic applies for my diabetes example too. It is okay to misclassify a normal patient as a future diabetic and take adequate care than to miss a real future diabetic due to low recall. To increase recall, the threshold has to be decreased from its default.
Another aspect of dilemma is the Accuracy- Latency Trade-off where the choice is between a faster output or a more accurate output. This is crucial in cases where the real time outputs are required and we have to accommodate some errors in the classifications made.
The final call rests within the creator whose understanding of the problem plays the key role in the sustainability and applicability of the model in the real world. This is the reason why collaboration of the data scientists and the business specialists of the team becomes an important indicator of project success.