DAM Portfolio – Machine Learning Evaluation

Reflection Post/s:

Graduate Attribute/s:

Skills:

Evidence:

After learning 2 different Machine Learning technique is was time for me to go back to a question I raised to myself after the first DAM class: how do I evaluate the performance of different Machine Learning models?


WHAT IS IT ALL ABOUT?


The first metric we used to evaluate the predicting performance of a supervised Machine Learning algorithm is its accuracy. This percentage is calculated from the confusion matrix on the predictions made by the model.

A confusion matrix lists the volume for each type of prediction made by the model. For example for a binary variable (Yes/No) the confusion matrix will look like this:

confusionmatrix

The row variables are the real values of the data set and the columns are the predicted values.

Accuracy is calculated based on the instances where a good prediction has been made (either Yes or No):

accuracy

This is good indicator on how a model performs but this may not be sufficient in some cases where a model may perform better on one side compared to the other one (for instance a model may predict better the Yes cases rather than the No). For those situations we will have to look at the sensitivity (or true positive rate) and the specificity values:

sensitivity specificity

Depending on the business requirement one of this 2 measures may be more important than the other. For example in the case of a spam filter, a model may have a sensitivity of 99% and a specificity of 97% which means that 3% of the proper emails are incorrectly classified as spams. The business may have set a requirement of 99% and therefore will reject this model.

There are also 2 others measures called Precision and Recall. They are both focused on the positive predictions:

precision

Precision is used to assess how often a model is correct. For instance a search engine will require a high Precision value as this will mean it will less likely returns unrelated results. Recall is actually the same as sensitivity. For a search engine a high Recall value will mean it returns a high volume of related documents.

In general a model will have a trade-off between sensitivity and specificity but also between precision and recall.

The F-measure has been defined to evaluate the trade-off between precision and recall and is used to compare several different models. The model with the closest F-Measure to 1 will be have a better performance.

f measure


HANDS-ON PRACTICE


I compared the 2 models I built for the Credit Scoring dataset (KNN and ANN) using the evaluation measures described above.

KNN Evaluation:

KNN evaluation

ANN Evaluation
ANN stats

Looking at the F-measure for predicting credit delinquency, ANN (0.698) has a better performance than KNN (0.55).

Looking at the precision and recall ANN performed much better for recall. This means it is better in finding real delinquent but its value is only 0.652 (35% of delinquent customers are classified as non-delinquent!)


REFLECTION 


What happened?

During the first class we learned to use the accuracy percentage to evaluate the performance of the KNN model but at that time I did see a table with different statistics for this model. I was thinking these values may provide more information about the model.

What did I do?

I wrote down the question I got after the first class and parked it for a while. After going through the Neural Network activity I went back to this question and started to do some researches online about the measures I found at that time.

What did I think and feel about what I did? Why?

After the class and after seeing this statistics table I felt that I was maybe missing something important. At that time I was focussed on the different learning activities and left this point for later in time. It is only when I started to look at ANN that I remembered to have a look at it so I included it within my learning activities of the Neural Network algorithm.

What were the important elements of this experience and how do I know they are important? What have I learned?

First I learned there are multiple measurements of the performance of Machine Learning models and not only accuracy percentage. Secondly I learned that depending on the business requirement one of these measures may be more important than the others. So it is important to include the definition of the important measurements at the Business Understanding phase of CRISP-DM project. This will help to better understand what the real expectations from the business are and therefore will help in choosing a model according to its performance.

How does this learning relate to the literature and to industry standards?

In a CRISP-DM project there is a dedicated step for evaluating a model. If a model fails to meet the business requirement the project has to go back to the first stage of the cycle. This will have a dramatic impact on the project from a cost and time perspective. Therefore it is highly recommended to define which measures are important at the beginning of the project but also to define what their thresholds are.


REFERENCES


University of New South Wales, 2014, Evaluation in Machine Learning

Lantz B., 2013, Machine Learning with R, Packt Publishing

DAM Portfolio – CRISP-DM

Reflection Post/s:

Graduate Attribute/s:

Skills:

Evidence:

 

During the first DAM class professor Siamak brought us through the Cross Industry Standard Process for Data Mining (CRISP-DM) methodology which is largely used for Data Science projects.


WHAT IS CRISP-DM?


CRISP-DM is a methodology for managing Data Mining project. It was conceived in the 90’s by 5 different companies (SPSSTeradataDaimler AGNCR Corporation and OHRA).  According to different polls involving data scientists across different industries it is currently the most used process for Data mining projects.

It breaks down a project into 6 different phases:

  • Business understanding: this step is focused on understanding what are the requirements from the business, what are the problems and questions they want to answer and defining a project plan to address them.
  • Data Understanding: this phase is about collecting data and performing a first level of analysis of the data sets through a descriptive analysis of the different variables.
  • Data Preparation: this is when we clean, transform, merge and enhance the data set for the next phase.
  • Modelling: this is the step when we apply statistical or Machine Learning techniques to define the most appropriate model for the project.
  • Evaluation: After defining the model we have to assess its performance and its ability to generalize its learning.
  • Deployment: The final step is about implementing the model on live environment and on its maintenance. It can also be the finalization of the report requested by the business.

CRISP-DM_Process_Diagram

The understanding of these different steps is pretty straight forward but I personally think the important part of this methodology is the feedback loops. Almost at every stage you are able to go back to previous steps according to the learning you get. It is not a V-model (sequential) as we usually see in IT projects; it is more agile and more iterative. It reminds me the PDCA model designed by Deming where you have to iterate several times the same approach in order to solve a problem: you plan your actions (how am I going to get some learnings about the problem), you do the actions (you perform the tasks you defined), you check the results (you analyse the results), you acts (you reflect on the learnings) and then you start again if it is required (the learnings I got help me to better understand the situation but I need to deep dive into it and get additional learnings).

PDCA-Multi-Loop

Another interesting part of the CRISP -DM methodology is the user guide section where they have detailed the different tasks you have to perform for a data mining project, the associated risks for each phase and also the different possible outputs.

12345


HANDS-ON PRACTICE


I haven’t really applied the full methodology in a project yet. But through my career I learned and applied other kind of methodologies such as V-Model, Agile Methodology, PDCA or DMAIC.

CRISP-DM shares a lot of similarities with the latest one. Like DMAIC, CRISP-DM emphasizes the importance of the first step: understanding business requirements. Both methodologies recommend to past a fair bit of time in defining properly the scope of the project before starting working on it. In these kind of complex projects (process improvement or data mining) it is crucial to challenge the understanding of the situation by the business. The risk is that they will state a very broad view of what they want and push for starting the project as soon as possible. This can mislead the project in the wrong direction or even changing directions in the middle of the project. A common technique used in DMAIC is called the 5 Why’s where you have to ask 5 times “why” in order to get really to the bottom of the question.

The DMAIC Measure phase is quite similar to the Data Understanding and Data Preparation phases from CRISP-DM. The differences is that DMAIC is focused on defining a very detailed measurement plan (mainly because most of the project requires to collect new measurements) and CRISP-DM focuses on the “quality” of the data set (treating missing values, outliers…).

Then the remaining phases from DMAIC and CRISP-DM differ quite a lot as they are very specific to their respective subject: process improvement or data mining.

lean-six-sigma-dmaic-road-map_497150


REFLECTION 


What happened?

After the brief introduction of this methodology in class I did my own research in order to better understand what CRISP-DM is about.

What did I do?

I read the detailed description of the CRISP-DM methodology by SPSS.

What did I think and feel about what I did? Why?

During the class we had a high level view of this methodology. I wanted to have a deep dive at it and be able to do a comparison with other methodologies I saw during my career.

What were the important elements of this experience and how do I know they are important? What have I learned?

As expected this methodology has a lot of details and it requires some practices before really understanding how deep it is. This is similar to any methodology: while you read it, it seems logical and pretty straight forward but you really realise the true meaning once you have faced the situation in a project. So I decided to apply this methodology as much as possible for the coming assignments.

How does this learning relate to the literature and to industry standards?

CRISP-DM is the main methodology used in Data Mining projects so it is quite important to have a good understanding of it. It does provide some recommendations and best practices that may be valuable for the upcoming assignments and projects I will have to manage in the future.


REFERENCES


KD Nuggets, What main methodology are you using for your analytics, data mining, or data science projects? Poll, viewed October 2014, <http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html>

IBM, 2011, IBM SPSS Modeler CRISP-DM Guide, IBM Corporation

 

Evidence URL:

http://www-staff.it.uts.edu.au/~paulk/teaching/dmkdd/ass2/readings/methodology/CRISPWP-0800.pdf

DAM Portfolio – K Nearest Neighbor (KNN)

Reflection Post/s:

Graduate Attribute/s:

Skills:

Evidence:

During the first DAM class, professor Siamak did bring us though our first Machine Learning technique: K Nearest Neighbours. I didn’t expect this will come so quickly as I was thinking we will only learn Machine Learning techniques after at least few weeks of studies. So I was really excited when I saw the program of the day. I heard a lot about KNN prior to enrolling to this Master but I never went into the details of it.


WHAT IS KNN?


In 2006 the IEEE Conference has listed the Top 10 Machine Learning techniques and KNN was part of it. Ten years later it is still well recognized within the industry and is still one of the most used technique.

So how does KNN works? It is based on a very simple concept. I will try to explain it through a simple example. Let’s say you are eating a dish with your eye closed and you try to find out what kind of food we are serving you. If you start thinking: “it smells like duck, it tastes like duck and the texture looks like duck” you will probably classify it as duck meat, right? This is basically how KNN works: you classify any new observations according to previous experiences that share similar features.

It uses distance calculation to identify which group the new observation is the closest to and assign it to this group.

As its name suggests, k is an important parameter. It is used to define the number of closest neighbours to take into account for classifying the new observation. For example if k is set to 3 then for each observation KNN will look at its 3 closest neighbours and assign it to the closest group. For instance if 2 of these neighbours are classified Red and the last one Blue, KNN will classify the new point to the Red group. To avoid any tie it is recommended to pick an odd number for k.

Here come the next question: how do we choose k? Choosing a low value for k may create an under fitted model i.e. not predicting very well the class of the observation. On the other side picking a high number for k may create some difficulties to define clearly the boundaries between groups and also impact the predicting power of the model. A rule of thumbs is to use the square root of the number of observations in the data set. For instance if there are 100 observations we will set k=10. This rule is a good starting point and you will probably try several k values, compare the respective performance of the models and then select the best one.


HANDS-ON PRACTICE


During the first class, we went through an exercise which we were given 45 minutes to implement our first data mining workflow in Knime using KNN. At that time we worked within our respective group and tried to come up with a solution for the Credit Scoring data set.

It wasn’t actually very hard to set up a KNN classification. We struggled a bit at the beginning as we didn’t know that we had to convert the response variable into a String type. The actual error message from the Knime console wasn’t very clear (that would be a good improvement for KNIME to make it easier to understand especially for non-technical people).

Knime KNN

The hardest part was actually the data preparation phase when you transform, clean and enhance the original data set before feeding the modelling phase. At that time we chose the perform all these steps with Excel as we didn’t have much time (I have captured this in another post in my blog called DAM Portfolio – Data Preparation in Knime).

We did tried to run KNN with different values of k during the class (from 1 to 5). The best model was for k = 3 (our original choice). At that time we looked only at the confusion matrix and its accuracy value to access the performance of this model.

KNN confusion matrix

But I saw from the Scorer node there is a table providing additional statistics values (precision, recall…). I told to myself to have a look at these later on (this is again the object of another post in my blog).

KNN evaluation


REFLECTION 


What happened?
We learned this Machine Learning technique though a presentation in class. Professor Siamak did explained us the theory behind KNN first and then later on during the day we had to get some hands-on practice using Knime.

What did I do?
During the class I was really interested by this technique as I heard about it quite often. I was carefully reading the slides while listening to the explanations. I did asked one or two questions at that time because I wasn’t quite sure how the k factor impacted the classification process. I was thinking that it was used to define the boundaries of a group but thanks to professor Siamak’s explanations I understood it was actually used for the classification step after the boundaries have been set.

What did I think and feel about what I did? Why?
Prior to this class Machine Learning was a bit “mystical” for me. It was like some super powers that only chosen people can understand and use. During the class I was quite surprised how simple this technique was. There was no need to have a PhD to understand it. I was actually thinking I should have missed something. It can’t be that simple. So after the class I read few articles about it. Even if it did help me to better understand KNN I actually knew almost everything I need from the class. I still did learn few more things like the fact that KNN is non-parametric i.e. independent of the distribution type and it is a lazy technique i.e. it doesn’t build a model compared to the eager ones. This experience helped me to demystify a bit the Machine Learning field even if I know there are much more complex techniques that I will have to learn.

What were the important elements of this experience and how do I know they are important? What have I learned?
As I said the main element of this experience was the demystification of Machine Learning. Previously I was really thinking I wasn’t capable yet to run any of these techniques but it was actually not true. It was really surprising how easy it was to implement our first workflow in such a short time.

What I learned from this experience is that you don’t really need to understand every single detail about this algorithm. A good understanding of what it is for and in which cases we can use it is more than enough to be able to use it properly for analysing data sets.

How does this learning relate to the literature and to industry standards?
KNN even after 10 years is still one of the top ranked Machine Learning techniques. Every book and article I found on Data mining still talk about it. It is really amazing how a simple algorithm can still produce very accurate predicting outcomes.


REFERENCES


Wu X, Kumar V., 2009, The Top Ten Algorithms in Data Mining, Chapman and Hall/CRC

Gorunescu, F., 2011, Data Mining Concepts, Models and Techniques, Springer

Lantz B., 2013, Machine Learning with R, Packt Publishing