DAM Portfolio – K Nearest Neighbor (KNN)

Reflection Post/s:

Graduate Attribute/s:

Skills:

Evidence:

During the first DAM class, professor Siamak did bring us though our first Machine Learning technique: K Nearest Neighbours. I didn’t expect this will come so quickly as I was thinking we will only learn Machine Learning techniques after at least few weeks of studies. So I was really excited when I saw the program of the day. I heard a lot about KNN prior to enrolling to this Master but I never went into the details of it.


WHAT IS KNN?


In 2006 the IEEE Conference has listed the Top 10 Machine Learning techniques and KNN was part of it. Ten years later it is still well recognized within the industry and is still one of the most used technique.

So how does KNN works? It is based on a very simple concept. I will try to explain it through a simple example. Let’s say you are eating a dish with your eye closed and you try to find out what kind of food we are serving you. If you start thinking: “it smells like duck, it tastes like duck and the texture looks like duck” you will probably classify it as duck meat, right? This is basically how KNN works: you classify any new observations according to previous experiences that share similar features.

It uses distance calculation to identify which group the new observation is the closest to and assign it to this group.

As its name suggests, k is an important parameter. It is used to define the number of closest neighbours to take into account for classifying the new observation. For example if k is set to 3 then for each observation KNN will look at its 3 closest neighbours and assign it to the closest group. For instance if 2 of these neighbours are classified Red and the last one Blue, KNN will classify the new point to the Red group. To avoid any tie it is recommended to pick an odd number for k.

Here come the next question: how do we choose k? Choosing a low value for k may create an under fitted model i.e. not predicting very well the class of the observation. On the other side picking a high number for k may create some difficulties to define clearly the boundaries between groups and also impact the predicting power of the model. A rule of thumbs is to use the square root of the number of observations in the data set. For instance if there are 100 observations we will set k=10. This rule is a good starting point and you will probably try several k values, compare the respective performance of the models and then select the best one.


HANDS-ON PRACTICE


During the first class, we went through an exercise which we were given 45 minutes to implement our first data mining workflow in Knime using KNN. At that time we worked within our respective group and tried to come up with a solution for the Credit Scoring data set.

It wasn’t actually very hard to set up a KNN classification. We struggled a bit at the beginning as we didn’t know that we had to convert the response variable into a String type. The actual error message from the Knime console wasn’t very clear (that would be a good improvement for KNIME to make it easier to understand especially for non-technical people).

Knime KNN

The hardest part was actually the data preparation phase when you transform, clean and enhance the original data set before feeding the modelling phase. At that time we chose the perform all these steps with Excel as we didn’t have much time (I have captured this in another post in my blog called DAM Portfolio – Data Preparation in Knime).

We did tried to run KNN with different values of k during the class (from 1 to 5). The best model was for k = 3 (our original choice). At that time we looked only at the confusion matrix and its accuracy value to access the performance of this model.

KNN confusion matrix

But I saw from the Scorer node there is a table providing additional statistics values (precision, recall…). I told to myself to have a look at these later on (this is again the object of another post in my blog).

KNN evaluation


REFLECTION 


What happened?
We learned this Machine Learning technique though a presentation in class. Professor Siamak did explained us the theory behind KNN first and then later on during the day we had to get some hands-on practice using Knime.

What did I do?
During the class I was really interested by this technique as I heard about it quite often. I was carefully reading the slides while listening to the explanations. I did asked one or two questions at that time because I wasn’t quite sure how the k factor impacted the classification process. I was thinking that it was used to define the boundaries of a group but thanks to professor Siamak’s explanations I understood it was actually used for the classification step after the boundaries have been set.

What did I think and feel about what I did? Why?
Prior to this class Machine Learning was a bit “mystical” for me. It was like some super powers that only chosen people can understand and use. During the class I was quite surprised how simple this technique was. There was no need to have a PhD to understand it. I was actually thinking I should have missed something. It can’t be that simple. So after the class I read few articles about it. Even if it did help me to better understand KNN I actually knew almost everything I need from the class. I still did learn few more things like the fact that KNN is non-parametric i.e. independent of the distribution type and it is a lazy technique i.e. it doesn’t build a model compared to the eager ones. This experience helped me to demystify a bit the Machine Learning field even if I know there are much more complex techniques that I will have to learn.

What were the important elements of this experience and how do I know they are important? What have I learned?
As I said the main element of this experience was the demystification of Machine Learning. Previously I was really thinking I wasn’t capable yet to run any of these techniques but it was actually not true. It was really surprising how easy it was to implement our first workflow in such a short time.

What I learned from this experience is that you don’t really need to understand every single detail about this algorithm. A good understanding of what it is for and in which cases we can use it is more than enough to be able to use it properly for analysing data sets.

How does this learning relate to the literature and to industry standards?
KNN even after 10 years is still one of the top ranked Machine Learning techniques. Every book and article I found on Data mining still talk about it. It is really amazing how a simple algorithm can still produce very accurate predicting outcomes.


REFERENCES


Wu X, Kumar V., 2009, The Top Ten Algorithms in Data Mining, Chapman and Hall/CRC

Gorunescu, F., 2011, Data Mining Concepts, Models and Techniques, Springer

Lantz B., 2013, Machine Learning with R, Packt Publishing

Leave a Reply