DAM Portfolio – Data preparation in Knime

Reflection Post/s:

Graduate Attribute/s:

Skills:

Evidence:

I never used Knime prior to the first DAM class. I did hear about it but I never took the time to give it a go as I was thinking what I can do in Knime I can do it directly in R. I didn’t see the added value for me to learn this new tool.


WHAT IS IT ALL ABOUT?


Knime is an open-source data analytic tool. It is designed for implementing data analysis workflow through its visual interface without any coding skill. It is extremely easy to use. You just have to select the nodes you want (which will perform a specific task) and link them together until you get the final expected output.

Knime has an impressive library of nodes for Data Mining, Machine Learning and Data Visualisation tasks.

knime

Once you build a workflow you can easily re-use it for another project or data set. You can also visually follow step by step how your workflow works.


HANDS-ON PRACTICE


During the first DAM class, we had to use Knime to run a K Nearest Neighbour (KNN) algorithm. There was a competition and we only got 45 minutes to come up with the most accurate model possible. Within our group we started to use Knime to prepare the data set but due to the lack of time we quickly decided to perform all the cleaning, binning and filtering steps directly in Excel and focus only on implementing our KNN algorithm.

After the class I decided to go through the exercise again but this time I tried to perform all the tasks within Knime. I not only did this because it is the recommended tool for this subject but also because I was quite curious about the level of automation this tool can bring.

The first difficulty I encountered at that time is to find the most appropriate nodes to perform the tasks I wanted. There were so many different ones and for some of them I couldn’t tell what were the differences. I did a bit of research online to see if there was some documentation related to this tool. To my big surprise there weren’t much materials about Knime on Internet. The most useful information I found was the forum from the Knime website where people post some questions and get some answers directly from the user community.

So I decided to start building the data preparation workflow using this forum as a guide. What I found is that even if it very easy to build your workflow it starts to get a bit messy while you keep adding different nodes one after the other. So I started to break down the different tasks I wanted to perform on this data set into chunks. For each chunk I defined what the expected output was. This helped me to focus on some specific tasks first and confirm they were providing the right result before moving on to the next chunk. At the end I came up with 3 different chunks.

data preparation models

The latest one was the hardest one by far. I wanted to impute missing values depending on the value of the others variables. Knime hasn’t come up yet with an efficient way to tackle easily this kind of tasks. So you have to create your own workflow within the main workflow to get the results you want. As I was struggling and couldn’t find many help on the Knime forum I decided to look for some user guides or books related to Knime. I found one written by Gabor Bakos that describes every node from the main package and provides some practical examples for them. This was very helpful. Even if I still needed to experiment several options before finding the right solution it did help me to reduce the list of nodes that may have been pertinent for my workflow.data preparation

At the end I was very happy about the fact that I succeeded to get the results I wanted i.e. get the exact same output as the Excel file we came up with during the DAM challenge.


REFLECTION 


What happened?

During the first class as a group we decided to put our focus on implementing the KNN algorithms and left away all the data preparation steps that we were supposed to perform within Knime.

What did I do?

After the class I went through the CRISP-DM methodology and they emphasizes the importance of the data preparation phase so I decided to personally go through the exercise again but this time using Knime only.

I did use the different materials I found on Internet to help me in this task but reading through the entire book called Knime Essentials was the most helpful part.

What did I think and feel about what I did? Why?

At the end I was quite happy to have succeeded to get to the results I wanted. I learned how to use Knime and implement a Data Analysis workflow through this tool.

But at the same time I saw what its limitations were. Its main strength is its easiness to design your workflow; just by dragging and dropping the nodes you want but it gets quite messy very quickly as the number of nodes increases. It requires some preliminary work to define the different steps of your workflow and documenting it is quite useful especially if the model is complex. Creating chunks and adding some notes does help to bring more clarity.

What were the important elements of this experience and how do I know they are important? What have I learned?

Apart from learning how to use Knime I discovered the importance of the 2 CRISP-DM phases related to data munging: Data Understanding and Data Preparation. Running a Machine Learning algorithm is actually pretty straight forward and doesn’t require a lot of time but all the data preparation is actually more time consuming. Now I understand why Data Scientist can pass 80% of their project just for these 2 steps.

Also these phases are extremely important as they can increase dramatically the performance of your model.

How does this learning relate to the literature and to industry standards?

As I stated earlier it is very important to past a fair bit of time to get a clean and good data set prior to the modelling phase. But it is also important to highlight the fact that using Knime helps to define a workflow that is repeatable and reproducible over and over. This is a crucial point for any data mining project. This particularly true at the Deployment phase in CRISP-DM when you need to present your results to the business and explain how you achieved them but also if they want to deploy the model to bigger data set or to other areas.

But Knime cannot capture everything so it is still highly recommended to document all the steps while going through the project.


REFERENCES


Bakos G., 2013, Knime Essentials, Packt Publishing

Leave a Reply