DAM Portfolio – Machine Learning Evaluation

Reflection Post/s:

Graduate Attribute/s:



After learning 2 different Machine Learning technique is was time for me to go back to a question I raised to myself after the first DAM class: how do I evaluate the performance of different Machine Learning models?


The first metric we used to evaluate the predicting performance of a supervised Machine Learning algorithm is its accuracy. This percentage is calculated from the confusion matrix on the predictions made by the model.

A confusion matrix lists the volume for each type of prediction made by the model. For example for a binary variable (Yes/No) the confusion matrix will look like this:


The row variables are the real values of the data set and the columns are the predicted values.

Accuracy is calculated based on the instances where a good prediction has been made (either Yes or No):


This is good indicator on how a model performs but this may not be sufficient in some cases where a model may perform better on one side compared to the other one (for instance a model may predict better the Yes cases rather than the No). For those situations we will have to look at the sensitivity (or true positive rate) and the specificity values:

sensitivity specificity

Depending on the business requirement one of this 2 measures may be more important than the other. For example in the case of a spam filter, a model may have a sensitivity of 99% and a specificity of 97% which means that 3% of the proper emails are incorrectly classified as spams. The business may have set a requirement of 99% and therefore will reject this model.

There are also 2 others measures called Precision and Recall. They are both focused on the positive predictions:


Precision is used to assess how often a model is correct. For instance a search engine will require a high Precision value as this will mean it will less likely returns unrelated results. Recall is actually the same as sensitivity. For a search engine a high Recall value will mean it returns a high volume of related documents.

In general a model will have a trade-off between sensitivity and specificity but also between precision and recall.

The F-measure has been defined to evaluate the trade-off between precision and recall and is used to compare several different models. The model with the closest F-Measure to 1 will be have a better performance.

f measure


I compared the 2 models I built for the Credit Scoring dataset (KNN and ANN) using the evaluation measures described above.

KNN Evaluation:

KNN evaluation

ANN Evaluation
ANN stats

Looking at the F-measure for predicting credit delinquency, ANN (0.698) has a better performance than KNN (0.55).

Looking at the precision and recall ANN performed much better for recall. This means it is better in finding real delinquent but its value is only 0.652 (35% of delinquent customers are classified as non-delinquent!)


What happened?

During the first class we learned to use the accuracy percentage to evaluate the performance of the KNN model but at that time I did see a table with different statistics for this model. I was thinking these values may provide more information about the model.

What did I do?

I wrote down the question I got after the first class and parked it for a while. After going through the Neural Network activity I went back to this question and started to do some researches online about the measures I found at that time.

What did I think and feel about what I did? Why?

After the class and after seeing this statistics table I felt that I was maybe missing something important. At that time I was focussed on the different learning activities and left this point for later in time. It is only when I started to look at ANN that I remembered to have a look at it so I included it within my learning activities of the Neural Network algorithm.

What were the important elements of this experience and how do I know they are important? What have I learned?

First I learned there are multiple measurements of the performance of Machine Learning models and not only accuracy percentage. Secondly I learned that depending on the business requirement one of these measures may be more important than the others. So it is important to include the definition of the important measurements at the Business Understanding phase of CRISP-DM project. This will help to better understand what the real expectations from the business are and therefore will help in choosing a model according to its performance.

How does this learning relate to the literature and to industry standards?

In a CRISP-DM project there is a dedicated step for evaluating a model. If a model fails to meet the business requirement the project has to go back to the first stage of the cycle. This will have a dramatic impact on the project from a cost and time perspective. Therefore it is highly recommended to define which measures are important at the beginning of the project but also to define what their thresholds are.


University of New South Wales, 2014, Evaluation in Machine Learning

Lantz B., 2013, Machine Learning with R, Packt Publishing

Leave a Reply