Call of Duty – Pedestrian Safety Hackathon 2016

 

This may be a surprise for people who know me well: “Anthony did write a blog post?! And posted it just 3 days after the end of the event?! No way!!!”

Yeah! Writing a blog post is not usually on top of my to-do list. For instance I still haven’t drafted anything since I won the Unearthed Hackathon and it was a month ago. So you may be wondering why I did write this one so quickly. And if you are thinking that I won another hackathon you are wrong. But there is obviously something important enough that triggered my sudden passion for blog posting.

Let’s start with the beginning. Last week end our usual hackathon crew participated to the Pedestrian Safety 2016 challenge organised by the NSW Data Analytics Centre (DAC) at the University of Sydney. Our team DataCake was composed of:

  • William Azevedo (aka “the-nicest-guy-in-the-world”)
  • Pedro Fernandez (aka “everything-is-awesome”)
  • William So (aka “the-big-data-guy”)
  • Anthony So (me aka “pinky-or-the-brain-we-still-do-not-know-which-one-he-is”)

I also would like to mention that Passiona Cottee was also planning to come but couldn’t make it for personal reasons.

The hackathon

What was this hackathon? The main objective was to try to find different ways to tackle the critical problem of pedestrian safety in NSW. DAC did a fantastic job in collecting, preparing and cleaning the different data sets for this challenge. Yeah! I know. They breastfed us. In other hackathons we usually receive a dump of totally unstructured or there is no data at all.  So we were thinking “Cool. No data munging! Easy peasy! We just need to run different Machine Learning algorithms, pick the best performer and that is it!”.

We (as well as the other teams) quickly realised that it wouldn’t be that easy. The first issue we encountered was to get access to the data. It was quite laborious. Due to data privacy concerns, the data sets were put in a Microsoft Azure Blob storage and it was only shared to a specific Virtual Machine (VM) set up specifically for this hackathon. No one in our team was familiar with Azure so when the organisers told us “you have to set up a VM then copy the Blob with Azcopy into your own Blob” it was like they were speaking Finnish to us. We had no clue what was expected from us but we gave it a go and with persistence we succeeded to do it after few hours. I think there was only another team who made it well. Actually a lot of teams dropped off the competition because of this first challenge.

This is a just one example of all the issues we found during the week end and trust me there were a lot. But let’s move on directly to the result of this hackathon. There were 2 rounds of presentations: every team had to do a 2 min pitch at the first round and only the 8 finalists had to do a 5 min presentation for the final round. We were not among of the finalists. We were quite disappointed, just like all the other teams who worked very hard during this week end but…after announcing the list of the 8 teams, the judges made a strange announcement: “… We also would like to invite team DataCake to stay for the second round and present us their findings”.

As you can imagine we were very puzzled at that time. We were not short-listed for the final round but we have still been asked to do the final presentation. Most teams in this hackathon presented very innovative and out-of-the box solutions such as a rewarding app for pedestrian, intelligent street sensors or lights. But we were the only team who really tried hard to use the full potential of the data sets provided for answering questions that have not been solved yet such as: Why was there an increase of injuries and fatal accidents in 2015 compared to previous years? How can we reduce these numbers?

During the whole weekend we really forced ourselves to go deep and asked “Why is it happening? Why is it happening? Why is it happening?” every time we found an interesting pattern. We really wanted to understand the true root causes of those accidents. We didn’t want to stay at a descriptive level. We knew the answers were behavioural. We knew there were multiple problems and therefore require different answers and solutions. We did different techniques to do so: machine learning, stats, data visualisation. It didn’t matter which we used the only important point was how can we get to answers of those questions.

For instance we built a classification model on the severity of the accidents involving children but we didn’t use it to make predictions. We used it to identify the important features (and unimportant) for those cases. We found out that some of the variables related to the environment (Primary_hazardous_feature, Surface_condition, Weather…) and to the drivers (Fatigue_involved_in_crash…) were not important. This gave us a good indication that those accidents are mostly related directly to the behaviour of the children. So we kept diving further and further and found 3 postcodes with higher numbers of accidents than others. We focused on those 3 areas and we kept going deeper and deeper. Here are some examples showing how deep we went to find answers:

case1case2

case3

As you can see we didn’t come up with very innovative solutions but we were totally absolutely 100% focused on finding the answers because we really wanted to save people’s live. We just used analytical techniques to help us find where we need to focus and where we need to investigate more until we understood the reasons of the accident.

This is probably the reason why we didn’t end up within the final 8: we didn’t bring any new idea or concept. We probably didn’t match the judging criteria which we presumed were more focused on innovation. But having said that the judges were still very interested by our analysis and still wanted to hear what we found. I am extremely proud of what we accomplished as a team. We really tackle every single issue we found in our road and kept moving forward without any fear with the same commitment until we reached our objectives: saving people’s live. This is the one of the reason why I wrote this blog. The second one is coming.

The experience

This experience was really mind-blowing for me. As I said I was already proud of the job we did. But while I was listening to the presentations of the other teams I profoundly realised that we changed.

The following part may sound a bit critical but this is absolutely not the main point but more an explanation on why I started to reflect on our team and on myself. You know me. I am not the kind of person who will easily criticise someone.

During this weekend every team did really an amazing job with the different skills they have in hand. It was definitely a tough challenge to work with. But there were few points that bothered me as a data scientist:

  • A lot of the teams identified Sydney CBD as the place with the highest number of accidents from 2000 to 2015 and jumped to the solution. Some recommended to do some field tests on George street with some IoT (Internet of Things) solutions. There were 2 things that made me jump:
    1. Unfortunately they didn’t put their findings back to their context. We are in 2016! George st has already changed since the last data point. The extension of the Light rail has started for few months already. The only traffic left is on cross streets only. [SENSE MAKING]
    2. They didn’t pushed their analysis further. We saw an increase of injuries in CBD in 2015 and we wanted to understand why. We found out there were more accidents on Thursday evenings. That sounds simple but in order to find this we transformed the date and time variables and assigned them back to the corresponding evening. For instance Thursday 6pm to midnight and Friday from 12am to 6am periods were assigned back to the same Thursday evening. Simple but extremely efficient! Then we started to list different hypothesis and test them and we finally found there was a strong correlation between the increase of accidents in CBD on Thursday evening and the growth of retail turnover but also with the increase of women in the labour force. To do so we went to the Australian Bureau of Statistics (ABS) to find the interesting data sets and merge them into our analysis. Looking at the reaction of the organisers this is something they didn’t know. [COMMITMENT]
  • Some of the teams came up with a predictive model with an accuracy level over 80% for classifying the level of severity of an accident. What they didn’t realise was that in the data set there are some variables highly correlated to their outcome variable. They predicted an accident has been properly classified as “killed” based on another one which gave them the number of killed people in this accident. They didn’t take time for the Data Understanding phase. They just put everything into an algorithm and reported the results. On the other side we went looked at every single variable and filtered only the meaningful ones for our analysis. We ended up with a level of accuracy very low. Throwing a coin and picking head or tail would have give you a higher chance to classify correctly the severity than our model. But it didn’t matter. We just concluded the data set hasn’t captured the real important feature. We didn’t manipulate our model in order to tell a story that will suit us. [INTEGRITY]
  • One of the teams highlighted the fact that the biggest group involved in fatal accidents were elderly people but after normalisation it doesn’t show any difference with the other groups and therefore they focused on other problems. Data scientist practitioners do play a lot with data and numbers but it is absolutely our responsibility to understand their meaning. The number we were talking about were real people who DIED on the roads. Every single count was critical. Reducing that number even by one has an enormous impact: one person has been saved! And this is exactly what we tried to do during this entire week end: save pedestrian lives now not in the future now as every minute gone potentially one more person could have been injured or killed on the road. For us it was an absolute nonsense to say that after normalisation those numbers were not important anymore. [ETHICS]
  • Finally the last thing that bothered me was the fact a lot of teams recommended an application to change the behaviours of pedestrians and drivers. The main group that is mostly killed in road accident are people over 60 years old. The number of old pedestrian killed were trending down for years but it jumped last year and the organisers told us they still don’t know why. We can easily assume this group of people will follow the rules in general. They probably won’t jaywalk and will cross the road where it is safe. Obviously something changed that impacted this increase. They probably changed their behaviour in 2015 but it is almost certain they have been forced to. This is not the group of persons who will decide one day to change their behaviour for no reason. Looking at the data we found they were killed in high density area and quite often in big main roads. These areas are where some public offices or agencies are located and those are probably the places where those people wanted to go to. Unfortunately we didn’t have time to finalise this analysis but our main hypothesis on this is that this group of people have to walk more since the big bus timetable change or the merge of Medicare centres in 2015. What we are sure is the fact no app will have a significant impact on decreasing the risk of fatal accident for them. A better solution may be to change the timetables of buses, their routes or relocate some of the public offices next to a train station or bus stop to lower the probability of crossing a main road . [PROBLEM SOLVING]

As I said earlier the whole point was not to criticise the work of the different teams (I know it is still not that obvious yet). The reason why I spoke about this is because just few months before we would have done exactly the same errors as the other teams or we would have gave up if the issues we were facing were too challenging. Listening to those mistakes made me realise where we are at in our Ithaca journey. If you are a student of the UTS Master of Data Science and Innovation (MDSI) you should know what this mean otherwise you can still google it. Without realising it we travelled a very long distance. And not only that we fully embraced some key skills: the ones I listed in bracket. During this hackathon it was absolutely natural for us to make sense of our findings, to commit ourselves on solving the problems in an ethically, responsible and honest way. We all learned about these concepts during the first semester of MDSI but this is the first time I realised that I didn’t have to go this checklist at the end of the project and promised I will not forget about data privacy and ethics next time. No this time every block flows and gathers all together not as constraints but as drivers in our approach.

Final words

In this hackathon we didn’t win any prize but I am extremely proud of what Pedro, Will, William and I achieved and with the manner we did it. In a lifetime there are few occasions when you feel you have changed and become another person and this experience is definitely one of them for me. Personally it does count much more than winning Unearthed, without comparison! This is really the moment where I feel we became some talented, ethically-minded and responsible data scientists. I remembered the first day I started MDSI (just 6 months ago) one of the guest lecturer said we will get the most sexy job in the world but great powers come with great responsibilities. All these times we were mainly focusing on getting more and more power but this week end showed us we switched to the second part of this sentence. We moved to the superheroes side 🙂

I really want to thanks again the NSW Data Analytics Centre for this life-changing experience. I really hope you will look at our findings and use them as a starting point. I know it wasn’t and is still not an easy task to get access to the data from the different official bodies but you can show them our analysis on fatal accident for old people and I hope this will make them realise by sharing their data lives can be saved. Please keep pushing on fast-tracking data-driven policies making.

I also want to thank all the teachers, lecturers and assistants from the Master of Data Science and Innovation (MDSI) of UTS. You helped us become what we are now. I know we are not finished with the Master yet (especially me) but you did a fantastic job already. You did more than showing us the right direction you forced us to learn to get on the right track by ourselves and this will impact without any doubt our future career of data scientist.

A big big big thank you to my teammates: Pedro, Will and William!!! Thanks for making this challenge so easy and so delightful! Thanks for keeping faith in our approach and pushing it as far as we could during this weekend ! We definitely achieved our goal!

Thanks for reading till the end of this post and I hope you already have or you will experience soon the same amazing journey as we did.

Viva DAC !!!

Viva UTS MDSI !!!

Viva team DataCake !!!

datacake

Anthony So

PS: Don’t expect any blog post about my experience on Unearthed. It will just not happen 🙂

PPS: Sorry Pedro for not showing your slide during the presentation. Here it is:

awesome

iLab1 – Personal Learning Objectives

iLab1 Personal Learning Objectives

My iLab subject is an internal project within my company at Fairfax Media.

 

Project Problem Statement and Description:

Fairfax Media publications are split into 2 main categories: Metropolitan or Community mastheads. The way the audience for each of those categories are consuming contents may be very different from each other or on the contrary may present some similarities for some specific type of contents. The objective of this project is to provide a better understanding of the behaviors of these 2 groups regarding news consumption.

 

My 3 personal learning objectives for this ilab are:

Web Analytics

  • I am very keen to understand more about how solutions such as SiteCatalyst or Google Analytics track the usage of digital websites by online users. It will be interesting to see what kind of information is automatically recorded and what are the key measures related to this area. Even if this part isn’t really in the scope of the problem statement the way a website is designed may have a significant impact on how it is used by the final user.

How to profile Customers Behaviors

  • I am very interested in learning techniques to understand customers behaviors based on their actions. This is mainly related to association rules machine learning algorithms and we haven’t seen any so far so I really want to learn more about them. I always heard about the example of a data science project that found the correlation between beers and nappies purchasing. This example is actually one of the reason why I wanted to learn more about Data Science so I am really looking forward to have a deep dive into this area.
  • I recently read some articles talking about graph analysis that may also fit the purpose of this project. I am not sure yet if this is relevant or not but if I got time that will be an area where I want to learn a bit more.

Clustering and Market Segmentation

  • Finally the final learning I am expecting from this iLab is about how to group similar type of customers. I have already use some clustering algorithms but they were all for continuous data. Depending on the data set I will get I may need to find some that can handle count or categorical variables. As a second step I would like also to see how those results relates back or not to the current marketing strategy of my company. There is no strong expectations internally for this project at the moment so it is more like a research exercise but I would still like to know if this can create or not values for the company.

iLab1 – Reflection on Graduate Attributes

Reflection on Graduate Attributes

ATTRIBUTE 1: Complex Systems Thinking

I have more than 10 years of experience in analysing business processes within corporations so I am quite comfortable with this attribute. Part of my work is to be able to get an end-end view of a process. Business processes  can go through different departments and a multitude of systems. I usually past quite a lot of time in mapping the process at different level of complexities.

 

Managing the Data Scientist project isn’t much different except the fact it can become quite complex to map all the attributes of some data. So one topic I need to work on is about data management and the strategy of creating values from data. I need to get a better understanding of the differences between database, datawarehouse or datalake.

ATTRIBUTE 2: Creative, Analytical and Rigorous Sense Making

I am ok with the analytical and rigorous sense making sides of this attribute. I am less comfortable with the creative one. I am a very logical person and rely heavily on facts so I am definitely not a very creative person 🙂 But I can facilitate workgroups such as brainstorming or kaizen session to help people explore new solutions and think out of the box. During the first semester at the beginning of any project I always started by organising a brainstorming session with my team for understanding the problem we need to solve and define our a solution plan.

 

At work I tend to rely on data quite a lot in order to understand where a business process is “broken” or which areas need to be improved. I always challenge “guts feeling” assumptions or at least I try to verify them through data before taking them for granted. Relying on existing data to improve a business process is for me the less risky and the most efficient way to do it.

 

ATTRIBUTE 3: Create Value in Problem Solving and Inquiry

Problem solving is definitely my cup of tea. Analysing and breaking down issues, finding root causes, define proper and adequate solutions and monitor and control the new state is part my day to day activities.

 

I am quite comfortable with all the different frameworks such as Waterfall, Lean Six Sigma or CRISP-DM methodologies. They all provide a clear end-end view on how to handle different types of projects. All of them highlight the importance of understanding properly the business requirements and make sure a project does generate the expected values and benefits. From experience the upstream part of a project is always the most critical. If you have scope it properly and did the correct analysis, the right solutions will come by themselves and it is just a matter of implementing them.

 

Also one of my strength is my flexibility to adapt myself to any change that can occur. In general project never goes without any unplanned hiccups and you need to be agile in order to react quickly while a situation changes or if a new issue occurs but you still have to keep in mind what are the objectives and how you will get there.

 

ATTRIBUTE 4: Persuasive and Robust Communication

This is probably the main attribute I need to keep working on. The reason why is that English isn’t my strongest language. For any kind of project it is extremely important to be understood properly. You may have the right answers but if you can not convince the other parties then all the work you did can be totally useless. I did appreciate quite a lot the feedback of the Self Quantified assignment during the first semester as I wasn’t fully paying attention at details such as the titles on the graph or using a meaningful unit of measure. This is something I will keep working on.

On the other hand I have no issue in presenting in front of an audience. It does require me a bit of preparation before but I am not afraid at all to be in the “front line”.

ATTRIBUTE 5: Ethical Citizenship and Leadership

Thanks to DSI class, I am more aware about all the implications related to data privacy. It is absolutely critical that any data scientist takes ownership and feels responsible on how to handle ethically any kind of data. More than ever data privacy should be one of the core human rights and it needs to be taken into account at any step of a data science project.

 

Funny enough I wouldn’t have rated myself as a strong leader but from the point of view of the different teammates I work with, they all think this is actually my key strength. Maybe this is due to my experience in project management. Anyway one of the key area I want to work on is the fact I tend to take the lead not at the beginning of a project but a few steps later. For instance at the forming stage of a group I will usually stay “behind” and observe first. So I really need to force myself to “step in” a bit quicker.

 

iLab1 – Key takeaways from each of the subjects completed

Key takeaways from each of the subjects completed

Data Science for Innovation (DSI)

English isn’t my mother tongue so I think my key takeaway for this subject will be academic writing. I have been living in Australia for 5 years and I use English everyday at work but it is quite informal in general. I mainly write emails, prepare some presentations or write project reports. It was the first time that I have to write some kind of essay in an academic way. Getting detailed feedbacks for the different assignments I submitted did really help me to understanding where were my limits and I have already seen some improvements compared to a semester ago.

Data, Algorithms and Meaning (DAM)

For this subject I learned from day 1 that I was already capable of running some machine learning techniques by myself. I used R in the last 2 years prior to enroll in MDSI at work mainly for data wrangling and data visualisation. But my main takeaway is really the ability to interpret the output of any algorithm but also to “challenge” it. It is extremely important to be able to take a step back and really think about how those results link back to the business objective first but also to their potential negative impacts. If too quickly taken for granted those results can change dramatically the life of individuals.s

Statistical Thinking for Data Science

I kind of knew it before but following this course did confirm the fact that having a strong statistical background for a data scientist is not an option but a must. Anyone can easily apply a machine learning algorithm or run a regression analysis in few lines of codes but the ability to interpret the results and assess the performance of a model is definitely key for a proper  data scientist. This course gave me an overview of some of the most useful techniques but I felt that I was still missing the basics and I really wanted to understand the “behind the scenes”. For this reason I picked an elective related to stats for the second semester called Multivariate Data Analysis.

 

I am really looking forward for semester 2 🙂

DAM Assignment 3 – Speculative analysis of a particular data context

In the recent years Data Science practise is attracting more and more attention from corporations. They are interested by the amazing possibilities brought by algorithms for finding patterns within complex and high dimensional data sets. At a high level it sounds like a magic wand that can transform a pile of waste into gold in few clicks and few lines of code. The objective of data science is to help individuals or business in getting insights and taking better decisions. The important part that are often missed from the previous sentence is: help. Data Science algorithms will not take any decision on behalf of somebody. It is just a tool used to analyse in a faster and more efficiently way the increasing volume of data but it still rely on the judgement of individuals.

Maybe in the future we will be able to access some automated machine learning technique but until then it will still require human intervention and decision. As a proof we can talk about CRISP-DM which the main data mining methodology used by data scientists. If Machine Learning algorithms were as good as some people may think there will be only 3 main steps in this methodology: data loading, modeling and deployment. You would just have to load the data, launch the algorithm and then deploy the model it defined. But this is not the case. There are 7 different steps within CRISP-DM and most of them requires the expertise of a human. The first step requires data scientists to have a good understanding of the business requirement, the research questions, the expected outcome and the existing constraint. The next step is related to data understanding. This step requires a deep analysis of the data set and the information lying within it. These 2 first steps are probably the most important ones in a data mining project as they help to define the best strategy to adopt for the project. Data scientists have to assess the strength and limitation of the data and decide which tool (algorithm) will perform the best according to the research questions to be answered. Then the data needs to be prepared at the next step in order to make it fit for the algorithm chosen. Again the data scientist will have to make some decisions on the data preparation strategy like how to handle missing values, what actions need to be performed on the outliers or defining if binning is required to decrease the volume of noise. All those concerns needs to be addressed as algorithms aren’t able to take proper decision by itself. All these human interventions are necessary as this will reduce modeling bias by algorithms.

But why is it so important to lower the risk of bias? The main reason is algorithms cannot make any judgement call. They are not able to determine if a model is absolutely wrong or right. No matter the type of data it receives as input it will perform its computation and provides its output no matter how good or bad are the input. In a sense an algorithm is neutral, it doesn’t make any assumption and will not adapt its strategy. For example Facebook has been recently accused of favouring a party in the US presidential election. The top trending news functionality was highlighting more topics in favour of the Democrat party rather than the Conservative party. It turns out users from the Democrat party tend to post more than the other side and therefore the algorithms were just emphasizing the existing bias it received as input. Another recent example showed Google Ads was discriminating women for high salary jobs advertising. These 2 examples show how important it is to limit the risk of bias but also the necessity to proper evaluate the outputs from an algorithm. In the CRISP-DM evaluation phase a model is assessed based on the business criteria but this can be extended to include the end user point of view. Getting the best model addressing the research questions from the business doesn’t mean it will be well received once deployed. Therefore it is important to also evaluate the future perception of the final users based on criteria such as ethics or privacy.

One last point is related to the transparency of Machine Learning algorithms. In most of the cases they are treated as black box system. We only see the inputs and outputs but not how the algorithm perform in between. But providing more transparency will help to better understand and identify bias and therefore lower the risk of taking bad or wrong decisions. For instance in the USA some judges are relying on algorithms to assess the risk of a convicted defendant to commit similar crimes in the near future. They rely on these algorithms as a black box to decide about the jail time extension of a “risky” offender. But if those algorithms present some bias and the judges are just following what the algorithms say they may take the wrong decision in putting someone into jail longer than required. It is critical to bring transparency especially in the case where algorithms can dramatically change the life of individuals. Recently researchers at Carnegie Mellon University have developed a new tool for detecting bias in algorithms. This tool will provide different type of inputs to an algorithm and assess which of them affects the most its outputs. This will help to identify situations such as the one encountered by Facebook prior to the deployment of the model. The result of this kind of analysis needs to be taken into account at the evaluation phase of CRISP-DM on top of the model performance and algorithm transparency reports.
In this discussion we saw how algorithms can create biased predicting models and how this can have a significant impact on the life of individuals. As data scientists we need to be more accountable for the results of algorithms. It is critical for data science practise to not only try to limit the risk of bias but also to properly assess those risk from the end user perspective and to be transparent on how those algorithms work.

REFERENCES

Schneier, B. 2016, ‘The risks — and benefits — of letting algorithms judge us’, CNN, viewed 24 June 2016, <http://edition.cnn.com/2016/01/06/opinions/schneier-china-social-scores/>.

Pyle, D. & San Jose, C. 2016, ‘An executive’s guide to machine learning‘, McKinsey & Company, viewed 24 June 2016, <http://www.mckinsey.com/industries/high-tech/our-insights/an-executives-guide-to-machine-learning>.

Hayle, B. 2016, ‘US presidential election: Facebook accused of anti-conservative bias’, The Australian, viewed 24 June 2016, <http://www.theaustralian.com.au/news/world/the-times/us-presidential-election-facebook-accused-of-anticonservative-bias/news-story/a36f7da8ab4e20b37538bd64c835385e>.

Carpenter, J. 2015, ‘Google’s algorithm shows prestigious job ads to men, but not to women. Here’s why that should worry you‘, The Washington Post, viewed 24 June 2016, <https://www.washingtonpost.com/news/the-intersect/wp/2015/07/06/googles-algorithm-shows-prestigious-job-ads-to-men-but-not-to-women-heres-why-that-should-worry-you/>.

Angwin, J., Larson, J., Mattu, S. & Kirchner, L. 2016, ‘Machine Bias‘, ProPublica, viewed 24 June 2016, <https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing>.

Datta A., Sen, S. & Zick Y. 2016, Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems, Carnegie Mellon University, Pittsburgh, USA.

DAM Portfolio – Summative Reflection

I went through an incredible learning journey since I started the Master of Data Science and Innovation at UTS. It has been only few weeks of teaching but I feel like I have already learned a huge volume of new skills. Ten years after graduating from my last degree I chose to go back to school as I want to become a Data Scientist. My personal objective for this degree is to learn as much as I can. I don’t want to just add another diploma on my resume but I really want to acquire some valuable skills that will help me in my future new career. So far I am quite on track with this objective. For this summative reflection I will go through several examples of what I learned in the Data, Algorithm and Meaning subject.

One main skill a Data Scientist is expected to have in his tool-kit is Machine Learning. Since the start of the DAM subject I learned 2 different Machine Learning algorithms: K Nearest Neighbour (KNN) and Artificial Neural Network (ANN). Professor Siamak did a presentation on KNN during its first class and I tried to be an active listener. I did ask few questions to get a better understanding of this technique. Then we had a hands-on practice with my group. We designed all together a Knime workflow to process the Credit Scoring data set. All of us were novice with this tool and we were trying together to figure out how to get to the result we wanted. It was a very participative exercise. I was the one who were driving the laptop and I was trying to follow all the questions and suggestions from my team members but also trying to design the workflow at the same time. At the end I think it was quite effective as every member were pushing in the same direction and tried to bring some added value to the group. At that time I had few questions that I left as we didn’t have time to go through them. What I did after the class is to go through the workflow again by myself and tried to complete what were left at the data preparation stage (during the class we cleaned the data manually with Excel). I past the following 3 days to familiarise myself with Knime and to achieve the result I wanted. It was quite laborious as I didn’t know much about Knime but I kept trying different nodes to clean and transform the data. I had to some research when I was facing a difficult issue. I finally succeeded to get to the final result. I was very happy and very proud of myself at that time as I didn’t give up during the learning curve. From this experience I learn several things. From a technical side I learned to run a Machine Learning technique, KNN and also familiarise myself with Knime. Regarding my soft skills this has helped me to enhance my “hacking” and problem solving skills. Finally I discovered how important the Data Understanding and Data Preparation phases are.

For ANN it was a totally different approach as it was an individual task. I watched the videos suggested by professor Siamak about the 5 different Machine Learning techniques. At that time I didn’t feel I really understood how ANN works. So I have to do my own research on it. I started with some introductory materials I found on Internet and I was looking progressively to more advanced documentation. After a few days of reading I decided to give it a go on Knime. I was quite surprised to see that I was able to run ANN in less than half an hour. This experience confirmed what I learned while going through KNN is that Machine Learning isn’t as complicated I thought it was and actually just required a basic knowledge on how it works. Now I am looking forward to learn more techniques. So I decided to learn a new algorithm every week and keep practising on the same data set. I think the hand-on practice on a real data set is very important and can help to fast track and consolidate the learning.

The next skill I learned in the last few weeks is the CRISP-DM methodology. This is particularly important as most of the Data Scientist projects are managed through this approach. Professor Siamak presented us an overview of the 6 different phases. I was quite curious about it so I wanted to have a better understanding. So I started to do some research about it and went through the user guide written by SPSS. While going through this guide I was comparing it with different methodologies I have seen during my career. It was interesting to find some similarities with the PDCA approach designed by Deming and the Lean Six Sigma DMAIC methodology. Putting in perspective with other knowledges I got helped me to have a deeper understanding of some of the topics described by CRISP-DM. For example I knew from the DMAIC model how critical the first phase is. Defining clearly what is the scope of the project and identify the key business requirements is key to its success. What I also noticed in the CRISP-DM model is the feedback loop from the Evaluation phase back to the Business Understanding. It was quite surprising for me as this will mean the model you defined can be wrong or rejected and we will have to start again from the beginning. This has raised my interest to this Evaluation phase and I started to question myself on how to evaluate the performance of a model.

This brings us the last topic I learned during this first semester: Machine Learning evaluation. During the first class we used the accuracy percentage to assess the performance of our KNN algorithm. I remembered at that time that I saw in one of the KNN outputs a table with some statistics related to the model. I wasn’t quite sure what they actually mean and I decided to have a look at it after the class. I made some research online to try find out more about these measures. I found some slides that explained how the different measures were calculated but I still didn’t really understand how to interpret them. I found at the UTS library a book called Machine Learning in R with a specific chapter on evaluating the models. Reading through this chapter helped me to get a better understanding of each of these measurements and I decided to go back to the Credit Scoring exercise to compare the 2 models I built with Knime: KNN and ANN. I questioned myself which ones where important for this specific data set. I was trying to think as if I was part of the business and came up with the conclusion that what were important is how good the model will truly predict a delinquent but also how good it is in wrong classification: true delinquent classified as non-delinquent (increasing the risk of losing money) and non-delinquent classified as delinquent (losing sales). Therefore I chose to use precision and recall as the 2 main measures. The best model I built was with the ANN algorithm with a precision of 70% and recall of 65%. In areal environment those values would probably been too low for the business and the model would have been rejected. This conclusion brought me back to the feedback loop I was mentioning earlier. I understood how critical it is to clearly define at the Business Understanding phase and get an agreement from the business on the different measures that will be used for evaluating a model and their rejection thresholds. This experience did really help me to better understand what the CRISP-DM methodology is. Even if I didn’t have any hands-on practise yet I really learned a lot in going through an example by myself and trying to imagine how this project would have been in a real environment.

As I said earlier I am very happy of all the learning I got since I started this Master. I really learned a lot of new techniques and gain some new skills. I personally past a lot of time reading books and try to find the answers to my questions by myself. All this personal work is starting to pay off as I feel I am starting to put the different pieces of the Data Scientist puzzle all together. The writing of my portfolio did help me to assess retrospectively what I really learned and understood. I also improve my writing skills and this quite important as English is not my mother tongue.

What I should probably improve is to connect more with my peers in order to learn from their experiences. I will try to follow some of the blog posts and try to engage more with my team mates first. I also need to better plan my learning activities by focusing on specific topics rather than trying to going through everything.

But just after 5 weeks I really think I am truly building the bases for my new career as a Data Scientist.


RELATED BLOG POSTS


  • DAM Portfolio – K Nearest Neighbor (KNN)
  • DAM Portfolio – Data Preparation in Knime
  • DAM Portfolio – CRISP-DM
  • DAM Portfolio – Neural Network (ANN)
  • DAM Portfolio – Machine Learning Evaluation

REFERENCES


Wu X, Kumar V., 2009, The Top Ten Algorithms in Data Mining, Chapman and Hall/CRC

Gorunescu, F., 2011, Data Mining Concepts, Models and Techniques, Springer

Lantz B., 2013, Machine Learning with R, Packt Publishing

KD Nuggets, What main methodology are you using for your analytics, data mining, or data science projects? Poll, viewed October 2014, <http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html>

IBM, 2011, IBM SPSS Modeler CRISP-DM Guide, IBM Corporation

University of New South Wales, 2014, Evaluation in Machine Learning

Bakos G., 2013, Knime Essentials, Packt Publishing

Reflection Post/s:

Graduate Attribute/s:

Skills:

DAM Portfolio – Machine Learning Evaluation

Reflection Post/s:

Graduate Attribute/s:

Skills:

Evidence:

After learning 2 different Machine Learning technique is was time for me to go back to a question I raised to myself after the first DAM class: how do I evaluate the performance of different Machine Learning models?


WHAT IS IT ALL ABOUT?


The first metric we used to evaluate the predicting performance of a supervised Machine Learning algorithm is its accuracy. This percentage is calculated from the confusion matrix on the predictions made by the model.

A confusion matrix lists the volume for each type of prediction made by the model. For example for a binary variable (Yes/No) the confusion matrix will look like this:

confusionmatrix

The row variables are the real values of the data set and the columns are the predicted values.

Accuracy is calculated based on the instances where a good prediction has been made (either Yes or No):

accuracy

This is good indicator on how a model performs but this may not be sufficient in some cases where a model may perform better on one side compared to the other one (for instance a model may predict better the Yes cases rather than the No). For those situations we will have to look at the sensitivity (or true positive rate) and the specificity values:

sensitivity specificity

Depending on the business requirement one of this 2 measures may be more important than the other. For example in the case of a spam filter, a model may have a sensitivity of 99% and a specificity of 97% which means that 3% of the proper emails are incorrectly classified as spams. The business may have set a requirement of 99% and therefore will reject this model.

There are also 2 others measures called Precision and Recall. They are both focused on the positive predictions:

precision

Precision is used to assess how often a model is correct. For instance a search engine will require a high Precision value as this will mean it will less likely returns unrelated results. Recall is actually the same as sensitivity. For a search engine a high Recall value will mean it returns a high volume of related documents.

In general a model will have a trade-off between sensitivity and specificity but also between precision and recall.

The F-measure has been defined to evaluate the trade-off between precision and recall and is used to compare several different models. The model with the closest F-Measure to 1 will be have a better performance.

f measure


HANDS-ON PRACTICE


I compared the 2 models I built for the Credit Scoring dataset (KNN and ANN) using the evaluation measures described above.

KNN Evaluation:

KNN evaluation

ANN Evaluation
ANN stats

Looking at the F-measure for predicting credit delinquency, ANN (0.698) has a better performance than KNN (0.55).

Looking at the precision and recall ANN performed much better for recall. This means it is better in finding real delinquent but its value is only 0.652 (35% of delinquent customers are classified as non-delinquent!)


REFLECTION 


What happened?

During the first class we learned to use the accuracy percentage to evaluate the performance of the KNN model but at that time I did see a table with different statistics for this model. I was thinking these values may provide more information about the model.

What did I do?

I wrote down the question I got after the first class and parked it for a while. After going through the Neural Network activity I went back to this question and started to do some researches online about the measures I found at that time.

What did I think and feel about what I did? Why?

After the class and after seeing this statistics table I felt that I was maybe missing something important. At that time I was focussed on the different learning activities and left this point for later in time. It is only when I started to look at ANN that I remembered to have a look at it so I included it within my learning activities of the Neural Network algorithm.

What were the important elements of this experience and how do I know they are important? What have I learned?

First I learned there are multiple measurements of the performance of Machine Learning models and not only accuracy percentage. Secondly I learned that depending on the business requirement one of these measures may be more important than the others. So it is important to include the definition of the important measurements at the Business Understanding phase of CRISP-DM project. This will help to better understand what the real expectations from the business are and therefore will help in choosing a model according to its performance.

How does this learning relate to the literature and to industry standards?

In a CRISP-DM project there is a dedicated step for evaluating a model. If a model fails to meet the business requirement the project has to go back to the first stage of the cycle. This will have a dramatic impact on the project from a cost and time perspective. Therefore it is highly recommended to define which measures are important at the beginning of the project but also to define what their thresholds are.


REFERENCES


University of New South Wales, 2014, Evaluation in Machine Learning

Lantz B., 2013, Machine Learning with R, Packt Publishing

DAM Portfolio – Artificial Neural Network (ANN)

As part of the DAM learning activities I went thought the Neural Networks Machine Learning technique.


WHAT IS NEURAL NETWORK?


Neural Network is currently considered as one of the most advanced Machine Learning algorithms. It is used in a lot of Artificial Intelligence projects such as AlphaGo (the algorithm designed by Google who recently beat one of the world top player at the game Go).

Neural Network or Artificial Neural Network (ANN) is classified as a black-box algorithm as it is quite difficult to interpret the model it builds. It has been designed as a reflection of the human brain activity. Its objective is to create a network of interconnected neurons that will send signals to its neighbours if it receives an activation signal (based on a threshold).

There are 3 main characteristics for an ANN:

  • An activation function that will trigger the broadcasting of a neuron to its neighbours. There are different types of activation functions depending on the type of the data: binary, logistic, linear, Gaussian etc.
  • A network architecture that specifies the number of internal neurons and the number of internal layers. Adding more internal neurons and layers makes the model more complex. This can be necessary for tackling complex data sets but it increases the difficulty to interpret the model.
  • The training algorithm which will specifies the weight to be applied to every neuron connection.

A Neural Network is also defined by its input and output neurons. An input neuron is assigned to each feature of the data set and an output neuron is assigned to every possible values of the outcome variable.

nn


HANDS-ON PRACTICE


After going through the videos from the learning activities and after performing some researches about this algorithm I tried to run an ANN on the same data set as for the K Nearest Neighbours (credit scoring).

In Knime the node for running an ANN is called Multilayer Perceptron Predictor. This node requires 2 inputs: a training model and a normalized test set. The training model is defined through a node called RProp MLP Learner that takes the normalized training set as input.

Knime ANN

I run several ANN with different number of internal layers and neurons and compared their model afterwards:

  • 1 internal layer and 10 internal neurons
  • 1 internal layer and 20 internal neurons
  • 3 internal layers and 10 internal neurons
  • 3 internal layers and 20 internal neurons
  • 5 internal layers and 10 internal neurons
  • 5 internal layers and 20 internal neurons

Before looking at the accuracy of these different models I thought that having more internal layers would have improved significantly the performance of the model. For this particular data set I was wrong. The performance for the model with 5 layers were less accurate than with the 1 layer one. From the different tests I did the best model is the one with only 1 internal layer and 20 internal neurons. Compared to the first model the accuracy has increased by 2% which is remarkable as the only thing I did was to change some settings.

ANN Accuracy 1-20


REFLECTION 


What happened?

For this topic the learning process was different from the KNN one. Here I had to learn by myself the Neural Network algorithm.

What did I do?

I had to do some research on this technique and find different type of materials. First I started to look for some introductory documents that can explain what it is about in simple ways. Then when I got a better understanding of it I started to look at some more advanced books that go into deeper details. I found one that was related to Machine Learning but in R. Even if I knew it isn’t the main tool recommended for this subject I decided to still read it as I am quite familiar with R. This has helped me to understand how to run an ANN step by step and I was able to create easily the same workflow in Knime afterwards.

What did I think and feel about what I did? Why?

I picked Neural Network as I knew it is currently one of the most advanced Machine Learning technique.

I was quite surprised by the simplicity and the easiness to run a K Nearest Neighbour model during the first class. As I said in my previous post I didn’t feel capable of running any Machine Learning algorithm before this class. So when I got the choice I picked one of the hardest to see if I will go through a similar experience even if it would be more challenging.

Even if the theory behind ANN is quite complex, the actual level of knowledge required to be able to run it is again much lower that I would have imagine at the beginning. It was again a big surprise for me.

What were the important elements of this experience and how do I know they are important? What have I learned?

I am really amazed by the fact that in such a short time I have been able to run some very complex Machine Learning algorithms. This experience has confirmed what I discovered during my first try with KNN: Machine Learning isn’t very complicated at a practical level. It requires some good understanding of the key concepts for each technique but you don’t need to understand all the theory behind. So my key learning for these past few weeks is the fact that I am actually already smart enough to start my journey on the Machine Learning field.

How does this learning relate to the literature and to industry standards?

Neural Network is one of the “hottest” technique at the moment. Every time I heard about an innovative Artificial Intelligence project I heard about Neural Network or Deep Learning. I think in the coming years there will be more and more application of this algorithm in Data Science projects.


REFERENCES


Gorunescu, F., 2011, Data Mining Concepts, Models and Techniques, Springer

Lantz B., 2013, Machine Learning with R, Packt Publishing

Reflection Post/s:

Graduate Attribute/s:

Skills:

DAM Portfolio – CRISP-DM

Reflection Post/s:

Graduate Attribute/s:

Skills:

Evidence:

 

During the first DAM class professor Siamak brought us through the Cross Industry Standard Process for Data Mining (CRISP-DM) methodology which is largely used for Data Science projects.


WHAT IS CRISP-DM?


CRISP-DM is a methodology for managing Data Mining project. It was conceived in the 90’s by 5 different companies (SPSSTeradataDaimler AGNCR Corporation and OHRA).  According to different polls involving data scientists across different industries it is currently the most used process for Data mining projects.

It breaks down a project into 6 different phases:

  • Business understanding: this step is focused on understanding what are the requirements from the business, what are the problems and questions they want to answer and defining a project plan to address them.
  • Data Understanding: this phase is about collecting data and performing a first level of analysis of the data sets through a descriptive analysis of the different variables.
  • Data Preparation: this is when we clean, transform, merge and enhance the data set for the next phase.
  • Modelling: this is the step when we apply statistical or Machine Learning techniques to define the most appropriate model for the project.
  • Evaluation: After defining the model we have to assess its performance and its ability to generalize its learning.
  • Deployment: The final step is about implementing the model on live environment and on its maintenance. It can also be the finalization of the report requested by the business.

CRISP-DM_Process_Diagram

The understanding of these different steps is pretty straight forward but I personally think the important part of this methodology is the feedback loops. Almost at every stage you are able to go back to previous steps according to the learning you get. It is not a V-model (sequential) as we usually see in IT projects; it is more agile and more iterative. It reminds me the PDCA model designed by Deming where you have to iterate several times the same approach in order to solve a problem: you plan your actions (how am I going to get some learnings about the problem), you do the actions (you perform the tasks you defined), you check the results (you analyse the results), you acts (you reflect on the learnings) and then you start again if it is required (the learnings I got help me to better understand the situation but I need to deep dive into it and get additional learnings).

PDCA-Multi-Loop

Another interesting part of the CRISP -DM methodology is the user guide section where they have detailed the different tasks you have to perform for a data mining project, the associated risks for each phase and also the different possible outputs.

12345


HANDS-ON PRACTICE


I haven’t really applied the full methodology in a project yet. But through my career I learned and applied other kind of methodologies such as V-Model, Agile Methodology, PDCA or DMAIC.

CRISP-DM shares a lot of similarities with the latest one. Like DMAIC, CRISP-DM emphasizes the importance of the first step: understanding business requirements. Both methodologies recommend to past a fair bit of time in defining properly the scope of the project before starting working on it. In these kind of complex projects (process improvement or data mining) it is crucial to challenge the understanding of the situation by the business. The risk is that they will state a very broad view of what they want and push for starting the project as soon as possible. This can mislead the project in the wrong direction or even changing directions in the middle of the project. A common technique used in DMAIC is called the 5 Why’s where you have to ask 5 times “why” in order to get really to the bottom of the question.

The DMAIC Measure phase is quite similar to the Data Understanding and Data Preparation phases from CRISP-DM. The differences is that DMAIC is focused on defining a very detailed measurement plan (mainly because most of the project requires to collect new measurements) and CRISP-DM focuses on the “quality” of the data set (treating missing values, outliers…).

Then the remaining phases from DMAIC and CRISP-DM differ quite a lot as they are very specific to their respective subject: process improvement or data mining.

lean-six-sigma-dmaic-road-map_497150


REFLECTION 


What happened?

After the brief introduction of this methodology in class I did my own research in order to better understand what CRISP-DM is about.

What did I do?

I read the detailed description of the CRISP-DM methodology by SPSS.

What did I think and feel about what I did? Why?

During the class we had a high level view of this methodology. I wanted to have a deep dive at it and be able to do a comparison with other methodologies I saw during my career.

What were the important elements of this experience and how do I know they are important? What have I learned?

As expected this methodology has a lot of details and it requires some practices before really understanding how deep it is. This is similar to any methodology: while you read it, it seems logical and pretty straight forward but you really realise the true meaning once you have faced the situation in a project. So I decided to apply this methodology as much as possible for the coming assignments.

How does this learning relate to the literature and to industry standards?

CRISP-DM is the main methodology used in Data Mining projects so it is quite important to have a good understanding of it. It does provide some recommendations and best practices that may be valuable for the upcoming assignments and projects I will have to manage in the future.


REFERENCES


KD Nuggets, What main methodology are you using for your analytics, data mining, or data science projects? Poll, viewed October 2014, <http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html>

IBM, 2011, IBM SPSS Modeler CRISP-DM Guide, IBM Corporation

 

Evidence URL:

http://www-staff.it.uts.edu.au/~paulk/teaching/dmkdd/ass2/readings/methodology/CRISPWP-0800.pdf

DAM Portfolio – Data preparation in Knime

Reflection Post/s:

Graduate Attribute/s:

Skills:

Evidence:

I never used Knime prior to the first DAM class. I did hear about it but I never took the time to give it a go as I was thinking what I can do in Knime I can do it directly in R. I didn’t see the added value for me to learn this new tool.


WHAT IS IT ALL ABOUT?


Knime is an open-source data analytic tool. It is designed for implementing data analysis workflow through its visual interface without any coding skill. It is extremely easy to use. You just have to select the nodes you want (which will perform a specific task) and link them together until you get the final expected output.

Knime has an impressive library of nodes for Data Mining, Machine Learning and Data Visualisation tasks.

knime

Once you build a workflow you can easily re-use it for another project or data set. You can also visually follow step by step how your workflow works.


HANDS-ON PRACTICE


During the first DAM class, we had to use Knime to run a K Nearest Neighbour (KNN) algorithm. There was a competition and we only got 45 minutes to come up with the most accurate model possible. Within our group we started to use Knime to prepare the data set but due to the lack of time we quickly decided to perform all the cleaning, binning and filtering steps directly in Excel and focus only on implementing our KNN algorithm.

After the class I decided to go through the exercise again but this time I tried to perform all the tasks within Knime. I not only did this because it is the recommended tool for this subject but also because I was quite curious about the level of automation this tool can bring.

The first difficulty I encountered at that time is to find the most appropriate nodes to perform the tasks I wanted. There were so many different ones and for some of them I couldn’t tell what were the differences. I did a bit of research online to see if there was some documentation related to this tool. To my big surprise there weren’t much materials about Knime on Internet. The most useful information I found was the forum from the Knime website where people post some questions and get some answers directly from the user community.

So I decided to start building the data preparation workflow using this forum as a guide. What I found is that even if it very easy to build your workflow it starts to get a bit messy while you keep adding different nodes one after the other. So I started to break down the different tasks I wanted to perform on this data set into chunks. For each chunk I defined what the expected output was. This helped me to focus on some specific tasks first and confirm they were providing the right result before moving on to the next chunk. At the end I came up with 3 different chunks.

data preparation models

The latest one was the hardest one by far. I wanted to impute missing values depending on the value of the others variables. Knime hasn’t come up yet with an efficient way to tackle easily this kind of tasks. So you have to create your own workflow within the main workflow to get the results you want. As I was struggling and couldn’t find many help on the Knime forum I decided to look for some user guides or books related to Knime. I found one written by Gabor Bakos that describes every node from the main package and provides some practical examples for them. This was very helpful. Even if I still needed to experiment several options before finding the right solution it did help me to reduce the list of nodes that may have been pertinent for my workflow.data preparation

At the end I was very happy about the fact that I succeeded to get the results I wanted i.e. get the exact same output as the Excel file we came up with during the DAM challenge.


REFLECTION 


What happened?

During the first class as a group we decided to put our focus on implementing the KNN algorithms and left away all the data preparation steps that we were supposed to perform within Knime.

What did I do?

After the class I went through the CRISP-DM methodology and they emphasizes the importance of the data preparation phase so I decided to personally go through the exercise again but this time using Knime only.

I did use the different materials I found on Internet to help me in this task but reading through the entire book called Knime Essentials was the most helpful part.

What did I think and feel about what I did? Why?

At the end I was quite happy to have succeeded to get to the results I wanted. I learned how to use Knime and implement a Data Analysis workflow through this tool.

But at the same time I saw what its limitations were. Its main strength is its easiness to design your workflow; just by dragging and dropping the nodes you want but it gets quite messy very quickly as the number of nodes increases. It requires some preliminary work to define the different steps of your workflow and documenting it is quite useful especially if the model is complex. Creating chunks and adding some notes does help to bring more clarity.

What were the important elements of this experience and how do I know they are important? What have I learned?

Apart from learning how to use Knime I discovered the importance of the 2 CRISP-DM phases related to data munging: Data Understanding and Data Preparation. Running a Machine Learning algorithm is actually pretty straight forward and doesn’t require a lot of time but all the data preparation is actually more time consuming. Now I understand why Data Scientist can pass 80% of their project just for these 2 steps.

Also these phases are extremely important as they can increase dramatically the performance of your model.

How does this learning relate to the literature and to industry standards?

As I stated earlier it is very important to past a fair bit of time to get a clean and good data set prior to the modelling phase. But it is also important to highlight the fact that using Knime helps to define a workflow that is repeatable and reproducible over and over. This is a crucial point for any data mining project. This particularly true at the Deployment phase in CRISP-DM when you need to present your results to the business and explain how you achieved them but also if they want to deploy the model to bigger data set or to other areas.

But Knime cannot capture everything so it is still highly recommended to document all the steps while going through the project.


REFERENCES


Bakos G., 2013, Knime Essentials, Packt Publishing