DAM Portfolio – Machine Learning Evaluation

Reflection Post/s:

Graduate Attribute/s:

Skills:

Evidence:

After learning 2 different Machine Learning technique is was time for me to go back to a question I raised to myself after the first DAM class: how do I evaluate the performance of different Machine Learning models?


WHAT IS IT ALL ABOUT?


The first metric we used to evaluate the predicting performance of a supervised Machine Learning algorithm is its accuracy. This percentage is calculated from the confusion matrix on the predictions made by the model.

A confusion matrix lists the volume for each type of prediction made by the model. For example for a binary variable (Yes/No) the confusion matrix will look like this:

confusionmatrix

The row variables are the real values of the data set and the columns are the predicted values.

Accuracy is calculated based on the instances where a good prediction has been made (either Yes or No):

accuracy

This is good indicator on how a model performs but this may not be sufficient in some cases where a model may perform better on one side compared to the other one (for instance a model may predict better the Yes cases rather than the No). For those situations we will have to look at the sensitivity (or true positive rate) and the specificity values:

sensitivity specificity

Depending on the business requirement one of this 2 measures may be more important than the other. For example in the case of a spam filter, a model may have a sensitivity of 99% and a specificity of 97% which means that 3% of the proper emails are incorrectly classified as spams. The business may have set a requirement of 99% and therefore will reject this model.

There are also 2 others measures called Precision and Recall. They are both focused on the positive predictions:

precision

Precision is used to assess how often a model is correct. For instance a search engine will require a high Precision value as this will mean it will less likely returns unrelated results. Recall is actually the same as sensitivity. For a search engine a high Recall value will mean it returns a high volume of related documents.

In general a model will have a trade-off between sensitivity and specificity but also between precision and recall.

The F-measure has been defined to evaluate the trade-off between precision and recall and is used to compare several different models. The model with the closest F-Measure to 1 will be have a better performance.

f measure


HANDS-ON PRACTICE


I compared the 2 models I built for the Credit Scoring dataset (KNN and ANN) using the evaluation measures described above.

KNN Evaluation:

KNN evaluation

ANN Evaluation
ANN stats

Looking at the F-measure for predicting credit delinquency, ANN (0.698) has a better performance than KNN (0.55).

Looking at the precision and recall ANN performed much better for recall. This means it is better in finding real delinquent but its value is only 0.652 (35% of delinquent customers are classified as non-delinquent!)


REFLECTION 


What happened?

During the first class we learned to use the accuracy percentage to evaluate the performance of the KNN model but at that time I did see a table with different statistics for this model. I was thinking these values may provide more information about the model.

What did I do?

I wrote down the question I got after the first class and parked it for a while. After going through the Neural Network activity I went back to this question and started to do some researches online about the measures I found at that time.

What did I think and feel about what I did? Why?

After the class and after seeing this statistics table I felt that I was maybe missing something important. At that time I was focussed on the different learning activities and left this point for later in time. It is only when I started to look at ANN that I remembered to have a look at it so I included it within my learning activities of the Neural Network algorithm.

What were the important elements of this experience and how do I know they are important? What have I learned?

First I learned there are multiple measurements of the performance of Machine Learning models and not only accuracy percentage. Secondly I learned that depending on the business requirement one of these measures may be more important than the others. So it is important to include the definition of the important measurements at the Business Understanding phase of CRISP-DM project. This will help to better understand what the real expectations from the business are and therefore will help in choosing a model according to its performance.

How does this learning relate to the literature and to industry standards?

In a CRISP-DM project there is a dedicated step for evaluating a model. If a model fails to meet the business requirement the project has to go back to the first stage of the cycle. This will have a dramatic impact on the project from a cost and time perspective. Therefore it is highly recommended to define which measures are important at the beginning of the project but also to define what their thresholds are.


REFERENCES


University of New South Wales, 2014, Evaluation in Machine Learning

Lantz B., 2013, Machine Learning with R, Packt Publishing

DAM Portfolio – Artificial Neural Network (ANN)

As part of the DAM learning activities I went thought the Neural Networks Machine Learning technique.


WHAT IS NEURAL NETWORK?


Neural Network is currently considered as one of the most advanced Machine Learning algorithms. It is used in a lot of Artificial Intelligence projects such as AlphaGo (the algorithm designed by Google who recently beat one of the world top player at the game Go).

Neural Network or Artificial Neural Network (ANN) is classified as a black-box algorithm as it is quite difficult to interpret the model it builds. It has been designed as a reflection of the human brain activity. Its objective is to create a network of interconnected neurons that will send signals to its neighbours if it receives an activation signal (based on a threshold).

There are 3 main characteristics for an ANN:

  • An activation function that will trigger the broadcasting of a neuron to its neighbours. There are different types of activation functions depending on the type of the data: binary, logistic, linear, Gaussian etc.
  • A network architecture that specifies the number of internal neurons and the number of internal layers. Adding more internal neurons and layers makes the model more complex. This can be necessary for tackling complex data sets but it increases the difficulty to interpret the model.
  • The training algorithm which will specifies the weight to be applied to every neuron connection.

A Neural Network is also defined by its input and output neurons. An input neuron is assigned to each feature of the data set and an output neuron is assigned to every possible values of the outcome variable.

nn


HANDS-ON PRACTICE


After going through the videos from the learning activities and after performing some researches about this algorithm I tried to run an ANN on the same data set as for the K Nearest Neighbours (credit scoring).

In Knime the node for running an ANN is called Multilayer Perceptron Predictor. This node requires 2 inputs: a training model and a normalized test set. The training model is defined through a node called RProp MLP Learner that takes the normalized training set as input.

Knime ANN

I run several ANN with different number of internal layers and neurons and compared their model afterwards:

  • 1 internal layer and 10 internal neurons
  • 1 internal layer and 20 internal neurons
  • 3 internal layers and 10 internal neurons
  • 3 internal layers and 20 internal neurons
  • 5 internal layers and 10 internal neurons
  • 5 internal layers and 20 internal neurons

Before looking at the accuracy of these different models I thought that having more internal layers would have improved significantly the performance of the model. For this particular data set I was wrong. The performance for the model with 5 layers were less accurate than with the 1 layer one. From the different tests I did the best model is the one with only 1 internal layer and 20 internal neurons. Compared to the first model the accuracy has increased by 2% which is remarkable as the only thing I did was to change some settings.

ANN Accuracy 1-20


REFLECTION 


What happened?

For this topic the learning process was different from the KNN one. Here I had to learn by myself the Neural Network algorithm.

What did I do?

I had to do some research on this technique and find different type of materials. First I started to look for some introductory documents that can explain what it is about in simple ways. Then when I got a better understanding of it I started to look at some more advanced books that go into deeper details. I found one that was related to Machine Learning but in R. Even if I knew it isn’t the main tool recommended for this subject I decided to still read it as I am quite familiar with R. This has helped me to understand how to run an ANN step by step and I was able to create easily the same workflow in Knime afterwards.

What did I think and feel about what I did? Why?

I picked Neural Network as I knew it is currently one of the most advanced Machine Learning technique.

I was quite surprised by the simplicity and the easiness to run a K Nearest Neighbour model during the first class. As I said in my previous post I didn’t feel capable of running any Machine Learning algorithm before this class. So when I got the choice I picked one of the hardest to see if I will go through a similar experience even if it would be more challenging.

Even if the theory behind ANN is quite complex, the actual level of knowledge required to be able to run it is again much lower that I would have imagine at the beginning. It was again a big surprise for me.

What were the important elements of this experience and how do I know they are important? What have I learned?

I am really amazed by the fact that in such a short time I have been able to run some very complex Machine Learning algorithms. This experience has confirmed what I discovered during my first try with KNN: Machine Learning isn’t very complicated at a practical level. It requires some good understanding of the key concepts for each technique but you don’t need to understand all the theory behind. So my key learning for these past few weeks is the fact that I am actually already smart enough to start my journey on the Machine Learning field.

How does this learning relate to the literature and to industry standards?

Neural Network is one of the “hottest” technique at the moment. Every time I heard about an innovative Artificial Intelligence project I heard about Neural Network or Deep Learning. I think in the coming years there will be more and more application of this algorithm in Data Science projects.


REFERENCES


Gorunescu, F., 2011, Data Mining Concepts, Models and Techniques, Springer

Lantz B., 2013, Machine Learning with R, Packt Publishing

Reflection Post/s:

Graduate Attribute/s:

Skills:

DAM Portfolio – CRISP-DM

Reflection Post/s:

Graduate Attribute/s:

Skills:

Evidence:

 

During the first DAM class professor Siamak brought us through the Cross Industry Standard Process for Data Mining (CRISP-DM) methodology which is largely used for Data Science projects.


WHAT IS CRISP-DM?


CRISP-DM is a methodology for managing Data Mining project. It was conceived in the 90’s by 5 different companies (SPSSTeradataDaimler AGNCR Corporation and OHRA).  According to different polls involving data scientists across different industries it is currently the most used process for Data mining projects.

It breaks down a project into 6 different phases:

  • Business understanding: this step is focused on understanding what are the requirements from the business, what are the problems and questions they want to answer and defining a project plan to address them.
  • Data Understanding: this phase is about collecting data and performing a first level of analysis of the data sets through a descriptive analysis of the different variables.
  • Data Preparation: this is when we clean, transform, merge and enhance the data set for the next phase.
  • Modelling: this is the step when we apply statistical or Machine Learning techniques to define the most appropriate model for the project.
  • Evaluation: After defining the model we have to assess its performance and its ability to generalize its learning.
  • Deployment: The final step is about implementing the model on live environment and on its maintenance. It can also be the finalization of the report requested by the business.

CRISP-DM_Process_Diagram

The understanding of these different steps is pretty straight forward but I personally think the important part of this methodology is the feedback loops. Almost at every stage you are able to go back to previous steps according to the learning you get. It is not a V-model (sequential) as we usually see in IT projects; it is more agile and more iterative. It reminds me the PDCA model designed by Deming where you have to iterate several times the same approach in order to solve a problem: you plan your actions (how am I going to get some learnings about the problem), you do the actions (you perform the tasks you defined), you check the results (you analyse the results), you acts (you reflect on the learnings) and then you start again if it is required (the learnings I got help me to better understand the situation but I need to deep dive into it and get additional learnings).

PDCA-Multi-Loop

Another interesting part of the CRISP -DM methodology is the user guide section where they have detailed the different tasks you have to perform for a data mining project, the associated risks for each phase and also the different possible outputs.

12345


HANDS-ON PRACTICE


I haven’t really applied the full methodology in a project yet. But through my career I learned and applied other kind of methodologies such as V-Model, Agile Methodology, PDCA or DMAIC.

CRISP-DM shares a lot of similarities with the latest one. Like DMAIC, CRISP-DM emphasizes the importance of the first step: understanding business requirements. Both methodologies recommend to past a fair bit of time in defining properly the scope of the project before starting working on it. In these kind of complex projects (process improvement or data mining) it is crucial to challenge the understanding of the situation by the business. The risk is that they will state a very broad view of what they want and push for starting the project as soon as possible. This can mislead the project in the wrong direction or even changing directions in the middle of the project. A common technique used in DMAIC is called the 5 Why’s where you have to ask 5 times “why” in order to get really to the bottom of the question.

The DMAIC Measure phase is quite similar to the Data Understanding and Data Preparation phases from CRISP-DM. The differences is that DMAIC is focused on defining a very detailed measurement plan (mainly because most of the project requires to collect new measurements) and CRISP-DM focuses on the “quality” of the data set (treating missing values, outliers…).

Then the remaining phases from DMAIC and CRISP-DM differ quite a lot as they are very specific to their respective subject: process improvement or data mining.

lean-six-sigma-dmaic-road-map_497150


REFLECTION 


What happened?

After the brief introduction of this methodology in class I did my own research in order to better understand what CRISP-DM is about.

What did I do?

I read the detailed description of the CRISP-DM methodology by SPSS.

What did I think and feel about what I did? Why?

During the class we had a high level view of this methodology. I wanted to have a deep dive at it and be able to do a comparison with other methodologies I saw during my career.

What were the important elements of this experience and how do I know they are important? What have I learned?

As expected this methodology has a lot of details and it requires some practices before really understanding how deep it is. This is similar to any methodology: while you read it, it seems logical and pretty straight forward but you really realise the true meaning once you have faced the situation in a project. So I decided to apply this methodology as much as possible for the coming assignments.

How does this learning relate to the literature and to industry standards?

CRISP-DM is the main methodology used in Data Mining projects so it is quite important to have a good understanding of it. It does provide some recommendations and best practices that may be valuable for the upcoming assignments and projects I will have to manage in the future.


REFERENCES


KD Nuggets, What main methodology are you using for your analytics, data mining, or data science projects? Poll, viewed October 2014, <http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html>

IBM, 2011, IBM SPSS Modeler CRISP-DM Guide, IBM Corporation

 

Evidence URL:

http://www-staff.it.uts.edu.au/~paulk/teaching/dmkdd/ass2/readings/methodology/CRISPWP-0800.pdf

DAM Portfolio – Data preparation in Knime

Reflection Post/s:

Graduate Attribute/s:

Skills:

Evidence:

I never used Knime prior to the first DAM class. I did hear about it but I never took the time to give it a go as I was thinking what I can do in Knime I can do it directly in R. I didn’t see the added value for me to learn this new tool.


WHAT IS IT ALL ABOUT?


Knime is an open-source data analytic tool. It is designed for implementing data analysis workflow through its visual interface without any coding skill. It is extremely easy to use. You just have to select the nodes you want (which will perform a specific task) and link them together until you get the final expected output.

Knime has an impressive library of nodes for Data Mining, Machine Learning and Data Visualisation tasks.

knime

Once you build a workflow you can easily re-use it for another project or data set. You can also visually follow step by step how your workflow works.


HANDS-ON PRACTICE


During the first DAM class, we had to use Knime to run a K Nearest Neighbour (KNN) algorithm. There was a competition and we only got 45 minutes to come up with the most accurate model possible. Within our group we started to use Knime to prepare the data set but due to the lack of time we quickly decided to perform all the cleaning, binning and filtering steps directly in Excel and focus only on implementing our KNN algorithm.

After the class I decided to go through the exercise again but this time I tried to perform all the tasks within Knime. I not only did this because it is the recommended tool for this subject but also because I was quite curious about the level of automation this tool can bring.

The first difficulty I encountered at that time is to find the most appropriate nodes to perform the tasks I wanted. There were so many different ones and for some of them I couldn’t tell what were the differences. I did a bit of research online to see if there was some documentation related to this tool. To my big surprise there weren’t much materials about Knime on Internet. The most useful information I found was the forum from the Knime website where people post some questions and get some answers directly from the user community.

So I decided to start building the data preparation workflow using this forum as a guide. What I found is that even if it very easy to build your workflow it starts to get a bit messy while you keep adding different nodes one after the other. So I started to break down the different tasks I wanted to perform on this data set into chunks. For each chunk I defined what the expected output was. This helped me to focus on some specific tasks first and confirm they were providing the right result before moving on to the next chunk. At the end I came up with 3 different chunks.

data preparation models

The latest one was the hardest one by far. I wanted to impute missing values depending on the value of the others variables. Knime hasn’t come up yet with an efficient way to tackle easily this kind of tasks. So you have to create your own workflow within the main workflow to get the results you want. As I was struggling and couldn’t find many help on the Knime forum I decided to look for some user guides or books related to Knime. I found one written by Gabor Bakos that describes every node from the main package and provides some practical examples for them. This was very helpful. Even if I still needed to experiment several options before finding the right solution it did help me to reduce the list of nodes that may have been pertinent for my workflow.data preparation

At the end I was very happy about the fact that I succeeded to get the results I wanted i.e. get the exact same output as the Excel file we came up with during the DAM challenge.


REFLECTION 


What happened?

During the first class as a group we decided to put our focus on implementing the KNN algorithms and left away all the data preparation steps that we were supposed to perform within Knime.

What did I do?

After the class I went through the CRISP-DM methodology and they emphasizes the importance of the data preparation phase so I decided to personally go through the exercise again but this time using Knime only.

I did use the different materials I found on Internet to help me in this task but reading through the entire book called Knime Essentials was the most helpful part.

What did I think and feel about what I did? Why?

At the end I was quite happy to have succeeded to get to the results I wanted. I learned how to use Knime and implement a Data Analysis workflow through this tool.

But at the same time I saw what its limitations were. Its main strength is its easiness to design your workflow; just by dragging and dropping the nodes you want but it gets quite messy very quickly as the number of nodes increases. It requires some preliminary work to define the different steps of your workflow and documenting it is quite useful especially if the model is complex. Creating chunks and adding some notes does help to bring more clarity.

What were the important elements of this experience and how do I know they are important? What have I learned?

Apart from learning how to use Knime I discovered the importance of the 2 CRISP-DM phases related to data munging: Data Understanding and Data Preparation. Running a Machine Learning algorithm is actually pretty straight forward and doesn’t require a lot of time but all the data preparation is actually more time consuming. Now I understand why Data Scientist can pass 80% of their project just for these 2 steps.

Also these phases are extremely important as they can increase dramatically the performance of your model.

How does this learning relate to the literature and to industry standards?

As I stated earlier it is very important to past a fair bit of time to get a clean and good data set prior to the modelling phase. But it is also important to highlight the fact that using Knime helps to define a workflow that is repeatable and reproducible over and over. This is a crucial point for any data mining project. This particularly true at the Deployment phase in CRISP-DM when you need to present your results to the business and explain how you achieved them but also if they want to deploy the model to bigger data set or to other areas.

But Knime cannot capture everything so it is still highly recommended to document all the steps while going through the project.


REFERENCES


Bakos G., 2013, Knime Essentials, Packt Publishing

DAM Portfolio – K Nearest Neighbor (KNN)

Reflection Post/s:

Graduate Attribute/s:

Skills:

Evidence:

During the first DAM class, professor Siamak did bring us though our first Machine Learning technique: K Nearest Neighbours. I didn’t expect this will come so quickly as I was thinking we will only learn Machine Learning techniques after at least few weeks of studies. So I was really excited when I saw the program of the day. I heard a lot about KNN prior to enrolling to this Master but I never went into the details of it.


WHAT IS KNN?


In 2006 the IEEE Conference has listed the Top 10 Machine Learning techniques and KNN was part of it. Ten years later it is still well recognized within the industry and is still one of the most used technique.

So how does KNN works? It is based on a very simple concept. I will try to explain it through a simple example. Let’s say you are eating a dish with your eye closed and you try to find out what kind of food we are serving you. If you start thinking: “it smells like duck, it tastes like duck and the texture looks like duck” you will probably classify it as duck meat, right? This is basically how KNN works: you classify any new observations according to previous experiences that share similar features.

It uses distance calculation to identify which group the new observation is the closest to and assign it to this group.

As its name suggests, k is an important parameter. It is used to define the number of closest neighbours to take into account for classifying the new observation. For example if k is set to 3 then for each observation KNN will look at its 3 closest neighbours and assign it to the closest group. For instance if 2 of these neighbours are classified Red and the last one Blue, KNN will classify the new point to the Red group. To avoid any tie it is recommended to pick an odd number for k.

Here come the next question: how do we choose k? Choosing a low value for k may create an under fitted model i.e. not predicting very well the class of the observation. On the other side picking a high number for k may create some difficulties to define clearly the boundaries between groups and also impact the predicting power of the model. A rule of thumbs is to use the square root of the number of observations in the data set. For instance if there are 100 observations we will set k=10. This rule is a good starting point and you will probably try several k values, compare the respective performance of the models and then select the best one.


HANDS-ON PRACTICE


During the first class, we went through an exercise which we were given 45 minutes to implement our first data mining workflow in Knime using KNN. At that time we worked within our respective group and tried to come up with a solution for the Credit Scoring data set.

It wasn’t actually very hard to set up a KNN classification. We struggled a bit at the beginning as we didn’t know that we had to convert the response variable into a String type. The actual error message from the Knime console wasn’t very clear (that would be a good improvement for KNIME to make it easier to understand especially for non-technical people).

Knime KNN

The hardest part was actually the data preparation phase when you transform, clean and enhance the original data set before feeding the modelling phase. At that time we chose the perform all these steps with Excel as we didn’t have much time (I have captured this in another post in my blog called DAM Portfolio – Data Preparation in Knime).

We did tried to run KNN with different values of k during the class (from 1 to 5). The best model was for k = 3 (our original choice). At that time we looked only at the confusion matrix and its accuracy value to access the performance of this model.

KNN confusion matrix

But I saw from the Scorer node there is a table providing additional statistics values (precision, recall…). I told to myself to have a look at these later on (this is again the object of another post in my blog).

KNN evaluation


REFLECTION 


What happened?
We learned this Machine Learning technique though a presentation in class. Professor Siamak did explained us the theory behind KNN first and then later on during the day we had to get some hands-on practice using Knime.

What did I do?
During the class I was really interested by this technique as I heard about it quite often. I was carefully reading the slides while listening to the explanations. I did asked one or two questions at that time because I wasn’t quite sure how the k factor impacted the classification process. I was thinking that it was used to define the boundaries of a group but thanks to professor Siamak’s explanations I understood it was actually used for the classification step after the boundaries have been set.

What did I think and feel about what I did? Why?
Prior to this class Machine Learning was a bit “mystical” for me. It was like some super powers that only chosen people can understand and use. During the class I was quite surprised how simple this technique was. There was no need to have a PhD to understand it. I was actually thinking I should have missed something. It can’t be that simple. So after the class I read few articles about it. Even if it did help me to better understand KNN I actually knew almost everything I need from the class. I still did learn few more things like the fact that KNN is non-parametric i.e. independent of the distribution type and it is a lazy technique i.e. it doesn’t build a model compared to the eager ones. This experience helped me to demystify a bit the Machine Learning field even if I know there are much more complex techniques that I will have to learn.

What were the important elements of this experience and how do I know they are important? What have I learned?
As I said the main element of this experience was the demystification of Machine Learning. Previously I was really thinking I wasn’t capable yet to run any of these techniques but it was actually not true. It was really surprising how easy it was to implement our first workflow in such a short time.

What I learned from this experience is that you don’t really need to understand every single detail about this algorithm. A good understanding of what it is for and in which cases we can use it is more than enough to be able to use it properly for analysing data sets.

How does this learning relate to the literature and to industry standards?
KNN even after 10 years is still one of the top ranked Machine Learning techniques. Every book and article I found on Data mining still talk about it. It is really amazing how a simple algorithm can still produce very accurate predicting outcomes.


REFERENCES


Wu X, Kumar V., 2009, The Top Ten Algorithms in Data Mining, Chapman and Hall/CRC

Gorunescu, F., 2011, Data Mining Concepts, Models and Techniques, Springer

Lantz B., 2013, Machine Learning with R, Packt Publishing