The performance metrics used to compare classification performance are typically represented using elements in the confusion matrix, which is generated by the machine learning model on a test sample. Figure 1 denotes the template of a confusion matrix for a two-class classification problem, where the class of an instance is either positive or negative (i.e., binary classification).

In the confusion matrix, columns represent actual classes, while rows represent the predicted classes. The number of instances in the test sample is depicted on the top of the confusion matrix, where *P* is the total number of positive instances and *N* is the total number of negative instances. The number of instances predicted by the model in each class is shown in the left of the confusion matrix, where *p* is the total number of instances predicted to be positive and *n* is the total number of instances predicted to be negative.

### Elementary performance metrics

*True Positives (TP)* denotes the number of instances correctly predicted to be positive examples. *False Negatives (FN)* denotes the number of positive instances predicted to be negative. Similarly, *True Negatives (TN)* is the number of correctly predicted negative instances, and *False Positives (FP)* denotes the number of negative instances predicted to be positive. The *True Positive rate (TP _{rate})*, which is represented as

*TP*= {}, depicts the rate at which the positive class is recognised. This is also known as

_{rate}*recall*or

*sensitivity*. The corresponding metric of the negative class is the

*true negative rate (TN*, which is measured as

_{rate})*TN*= {}. This is also known as

_{rate}*specificity*and indicates the number of negative instances that are correctly detected. The purpose of

*Positive Predictive Value (PPV)*and

*Negative Predictive Value (NPV)*is to quantify how many instances which are detected as belonging to a given class actually represent that class.

*PPV*, which is also known as

*precision*, measures the number of actual instances identified as positive (i.e.,

*PPV*= {}).

*NPV*denotes the number of negative instances that are correctly detected out of all instances predicted to be negative (i.e.,

*NPV*= {}).

### Composite measures

From the elementary performance metrics discussed above, several *composite measures* have been constructed, such as *F-measure* and *ROC curves*. F-measure (more specifically, *F _{1}*) is the harmonic mean of precision and recall, and is denoted as {}. The ROC (Receiver Operating Characteristic) curve plots

*true positive rate*(or

*sensitivity*denoted as {}) against

*false positive rate*(or

*1-specificity*denoted as {}), at different classification thresholds. Typically, a good classification model should reside in the upper left region of the plot (Figure 2). Point (0,0) indicates a model that detects all instances as negative. Point(1,1) denotes all instances as positive, while a random classifier signifies

*y=x*curve. The ideal classification model generates the point (0,1) indicating that its false positive rate is zero (i.e., none of the negative instances are predicted to be positive) and the true positive rate is equal to 1 (i.e., every positive instance is identified). The AUC (Area Under the ROC Curve) is the aggregated measure of the ROC curve that indicates the performance across all possible thresholds . More specifically, the AUC denotes the entire two-dimensional area under the ROC curve from point (0,0) to (1,1). Simply put, it indicates the probability with which classifier will rank a random positive instance more highly than a random negative instance.