DAM Portfolio – Summative Reflection

I went through an incredible learning journey since I started the Master of Data Science and Innovation at UTS. It has been only few weeks of teaching but I feel like I have already learned a huge volume of new skills. Ten years after graduating from my last degree I chose to go back to school as I want to become a Data Scientist. My personal objective for this degree is to learn as much as I can. I don’t want to just add another diploma on my resume but I really want to acquire some valuable skills that will help me in my future new career. So far I am quite on track with this objective. For this summative reflection I will go through several examples of what I learned in the Data, Algorithm and Meaning subject.

One main skill a Data Scientist is expected to have in his tool-kit is Machine Learning. Since the start of the DAM subject I learned 2 different Machine Learning algorithms: K Nearest Neighbour (KNN) and Artificial Neural Network (ANN). Professor Siamak did a presentation on KNN during its first class and I tried to be an active listener. I did ask few questions to get a better understanding of this technique. Then we had a hands-on practice with my group. We designed all together a Knime workflow to process the Credit Scoring data set. All of us were novice with this tool and we were trying together to figure out how to get to the result we wanted. It was a very participative exercise. I was the one who were driving the laptop and I was trying to follow all the questions and suggestions from my team members but also trying to design the workflow at the same time. At the end I think it was quite effective as every member were pushing in the same direction and tried to bring some added value to the group. At that time I had few questions that I left as we didn’t have time to go through them. What I did after the class is to go through the workflow again by myself and tried to complete what were left at the data preparation stage (during the class we cleaned the data manually with Excel). I past the following 3 days to familiarise myself with Knime and to achieve the result I wanted. It was quite laborious as I didn’t know much about Knime but I kept trying different nodes to clean and transform the data. I had to some research when I was facing a difficult issue. I finally succeeded to get to the final result. I was very happy and very proud of myself at that time as I didn’t give up during the learning curve. From this experience I learn several things. From a technical side I learned to run a Machine Learning technique, KNN and also familiarise myself with Knime. Regarding my soft skills this has helped me to enhance my “hacking” and problem solving skills. Finally I discovered how important the Data Understanding and Data Preparation phases are.

For ANN it was a totally different approach as it was an individual task. I watched the videos suggested by professor Siamak about the 5 different Machine Learning techniques. At that time I didn’t feel I really understood how ANN works. So I have to do my own research on it. I started with some introductory materials I found on Internet and I was looking progressively to more advanced documentation. After a few days of reading I decided to give it a go on Knime. I was quite surprised to see that I was able to run ANN in less than half an hour. This experience confirmed what I learned while going through KNN is that Machine Learning isn’t as complicated I thought it was and actually just required a basic knowledge on how it works. Now I am looking forward to learn more techniques. So I decided to learn a new algorithm every week and keep practising on the same data set. I think the hand-on practice on a real data set is very important and can help to fast track and consolidate the learning.

The next skill I learned in the last few weeks is the CRISP-DM methodology. This is particularly important as most of the Data Scientist projects are managed through this approach. Professor Siamak presented us an overview of the 6 different phases. I was quite curious about it so I wanted to have a better understanding. So I started to do some research about it and went through the user guide written by SPSS. While going through this guide I was comparing it with different methodologies I have seen during my career. It was interesting to find some similarities with the PDCA approach designed by Deming and the Lean Six Sigma DMAIC methodology. Putting in perspective with other knowledges I got helped me to have a deeper understanding of some of the topics described by CRISP-DM. For example I knew from the DMAIC model how critical the first phase is. Defining clearly what is the scope of the project and identify the key business requirements is key to its success. What I also noticed in the CRISP-DM model is the feedback loop from the Evaluation phase back to the Business Understanding. It was quite surprising for me as this will mean the model you defined can be wrong or rejected and we will have to start again from the beginning. This has raised my interest to this Evaluation phase and I started to question myself on how to evaluate the performance of a model.

This brings us the last topic I learned during this first semester: Machine Learning evaluation. During the first class we used the accuracy percentage to assess the performance of our KNN algorithm. I remembered at that time that I saw in one of the KNN outputs a table with some statistics related to the model. I wasn’t quite sure what they actually mean and I decided to have a look at it after the class. I made some research online to try find out more about these measures. I found some slides that explained how the different measures were calculated but I still didn’t really understand how to interpret them. I found at the UTS library a book called Machine Learning in R with a specific chapter on evaluating the models. Reading through this chapter helped me to get a better understanding of each of these measurements and I decided to go back to the Credit Scoring exercise to compare the 2 models I built with Knime: KNN and ANN. I questioned myself which ones where important for this specific data set. I was trying to think as if I was part of the business and came up with the conclusion that what were important is how good the model will truly predict a delinquent but also how good it is in wrong classification: true delinquent classified as non-delinquent (increasing the risk of losing money) and non-delinquent classified as delinquent (losing sales). Therefore I chose to use precision and recall as the 2 main measures. The best model I built was with the ANN algorithm with a precision of 70% and recall of 65%. In areal environment those values would probably been too low for the business and the model would have been rejected. This conclusion brought me back to the feedback loop I was mentioning earlier. I understood how critical it is to clearly define at the Business Understanding phase and get an agreement from the business on the different measures that will be used for evaluating a model and their rejection thresholds. This experience did really help me to better understand what the CRISP-DM methodology is. Even if I didn’t have any hands-on practise yet I really learned a lot in going through an example by myself and trying to imagine how this project would have been in a real environment.

As I said earlier I am very happy of all the learning I got since I started this Master. I really learned a lot of new techniques and gain some new skills. I personally past a lot of time reading books and try to find the answers to my questions by myself. All this personal work is starting to pay off as I feel I am starting to put the different pieces of the Data Scientist puzzle all together. The writing of my portfolio did help me to assess retrospectively what I really learned and understood. I also improve my writing skills and this quite important as English is not my mother tongue.

What I should probably improve is to connect more with my peers in order to learn from their experiences. I will try to follow some of the blog posts and try to engage more with my team mates first. I also need to better plan my learning activities by focusing on specific topics rather than trying to going through everything.

But just after 5 weeks I really think I am truly building the bases for my new career as a Data Scientist.


RELATED BLOG POSTS


  • DAM Portfolio – K Nearest Neighbor (KNN)
  • DAM Portfolio – Data Preparation in Knime
  • DAM Portfolio – CRISP-DM
  • DAM Portfolio – Neural Network (ANN)
  • DAM Portfolio – Machine Learning Evaluation

REFERENCES


Wu X, Kumar V., 2009, The Top Ten Algorithms in Data Mining, Chapman and Hall/CRC

Gorunescu, F., 2011, Data Mining Concepts, Models and Techniques, Springer

Lantz B., 2013, Machine Learning with R, Packt Publishing

KD Nuggets, What main methodology are you using for your analytics, data mining, or data science projects? Poll, viewed October 2014, <http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html>

IBM, 2011, IBM SPSS Modeler CRISP-DM Guide, IBM Corporation

University of New South Wales, 2014, Evaluation in Machine Learning

Bakos G., 2013, Knime Essentials, Packt Publishing

Reflection Post/s:

Graduate Attribute/s:

Skills:

Leave a Reply