Call of Duty – Pedestrian Safety Hackathon 2016

 

This may be a surprise for people who know me well: “Anthony did write a blog post?! And posted it just 3 days after the end of the event?! No way!!!”

Yeah! Writing a blog post is not usually on top of my to-do list. For instance I still haven’t drafted anything since I won the Unearthed Hackathon and it was a month ago. So you may be wondering why I did write this one so quickly. And if you are thinking that I won another hackathon you are wrong. But there is obviously something important enough that triggered my sudden passion for blog posting.

Let’s start with the beginning. Last week end our usual hackathon crew participated to the Pedestrian Safety 2016 challenge organised by the NSW Data Analytics Centre (DAC) at the University of Sydney. Our team DataCake was composed of:

  • William Azevedo (aka “the-nicest-guy-in-the-world”)
  • Pedro Fernandez (aka “everything-is-awesome”)
  • William So (aka “the-big-data-guy”)
  • Anthony So (me aka “pinky-or-the-brain-we-still-do-not-know-which-one-he-is”)

I also would like to mention that Passiona Cottee was also planning to come but couldn’t make it for personal reasons.

The hackathon

What was this hackathon? The main objective was to try to find different ways to tackle the critical problem of pedestrian safety in NSW. DAC did a fantastic job in collecting, preparing and cleaning the different data sets for this challenge. Yeah! I know. They breastfed us. In other hackathons we usually receive a dump of totally unstructured or there is no data at all.  So we were thinking “Cool. No data munging! Easy peasy! We just need to run different Machine Learning algorithms, pick the best performer and that is it!”.

We (as well as the other teams) quickly realised that it wouldn’t be that easy. The first issue we encountered was to get access to the data. It was quite laborious. Due to data privacy concerns, the data sets were put in a Microsoft Azure Blob storage and it was only shared to a specific Virtual Machine (VM) set up specifically for this hackathon. No one in our team was familiar with Azure so when the organisers told us “you have to set up a VM then copy the Blob with Azcopy into your own Blob” it was like they were speaking Finnish to us. We had no clue what was expected from us but we gave it a go and with persistence we succeeded to do it after few hours. I think there was only another team who made it well. Actually a lot of teams dropped off the competition because of this first challenge.

This is a just one example of all the issues we found during the week end and trust me there were a lot. But let’s move on directly to the result of this hackathon. There were 2 rounds of presentations: every team had to do a 2 min pitch at the first round and only the 8 finalists had to do a 5 min presentation for the final round. We were not among of the finalists. We were quite disappointed, just like all the other teams who worked very hard during this week end but…after announcing the list of the 8 teams, the judges made a strange announcement: “… We also would like to invite team DataCake to stay for the second round and present us their findings”.

As you can imagine we were very puzzled at that time. We were not short-listed for the final round but we have still been asked to do the final presentation. Most teams in this hackathon presented very innovative and out-of-the box solutions such as a rewarding app for pedestrian, intelligent street sensors or lights. But we were the only team who really tried hard to use the full potential of the data sets provided for answering questions that have not been solved yet such as: Why was there an increase of injuries and fatal accidents in 2015 compared to previous years? How can we reduce these numbers?

During the whole weekend we really forced ourselves to go deep and asked “Why is it happening? Why is it happening? Why is it happening?” every time we found an interesting pattern. We really wanted to understand the true root causes of those accidents. We didn’t want to stay at a descriptive level. We knew the answers were behavioural. We knew there were multiple problems and therefore require different answers and solutions. We did different techniques to do so: machine learning, stats, data visualisation. It didn’t matter which we used the only important point was how can we get to answers of those questions.

For instance we built a classification model on the severity of the accidents involving children but we didn’t use it to make predictions. We used it to identify the important features (and unimportant) for those cases. We found out that some of the variables related to the environment (Primary_hazardous_feature, Surface_condition, Weather…) and to the drivers (Fatigue_involved_in_crash…) were not important. This gave us a good indication that those accidents are mostly related directly to the behaviour of the children. So we kept diving further and further and found 3 postcodes with higher numbers of accidents than others. We focused on those 3 areas and we kept going deeper and deeper. Here are some examples showing how deep we went to find answers:

case1case2

case3

As you can see we didn’t come up with very innovative solutions but we were totally absolutely 100% focused on finding the answers because we really wanted to save people’s live. We just used analytical techniques to help us find where we need to focus and where we need to investigate more until we understood the reasons of the accident.

This is probably the reason why we didn’t end up within the final 8: we didn’t bring any new idea or concept. We probably didn’t match the judging criteria which we presumed were more focused on innovation. But having said that the judges were still very interested by our analysis and still wanted to hear what we found. I am extremely proud of what we accomplished as a team. We really tackle every single issue we found in our road and kept moving forward without any fear with the same commitment until we reached our objectives: saving people’s live. This is the one of the reason why I wrote this blog. The second one is coming.

The experience

This experience was really mind-blowing for me. As I said I was already proud of the job we did. But while I was listening to the presentations of the other teams I profoundly realised that we changed.

The following part may sound a bit critical but this is absolutely not the main point but more an explanation on why I started to reflect on our team and on myself. You know me. I am not the kind of person who will easily criticise someone.

During this weekend every team did really an amazing job with the different skills they have in hand. It was definitely a tough challenge to work with. But there were few points that bothered me as a data scientist:

  • A lot of the teams identified Sydney CBD as the place with the highest number of accidents from 2000 to 2015 and jumped to the solution. Some recommended to do some field tests on George street with some IoT (Internet of Things) solutions. There were 2 things that made me jump:
    1. Unfortunately they didn’t put their findings back to their context. We are in 2016! George st has already changed since the last data point. The extension of the Light rail has started for few months already. The only traffic left is on cross streets only. [SENSE MAKING]
    2. They didn’t pushed their analysis further. We saw an increase of injuries in CBD in 2015 and we wanted to understand why. We found out there were more accidents on Thursday evenings. That sounds simple but in order to find this we transformed the date and time variables and assigned them back to the corresponding evening. For instance Thursday 6pm to midnight and Friday from 12am to 6am periods were assigned back to the same Thursday evening. Simple but extremely efficient! Then we started to list different hypothesis and test them and we finally found there was a strong correlation between the increase of accidents in CBD on Thursday evening and the growth of retail turnover but also with the increase of women in the labour force. To do so we went to the Australian Bureau of Statistics (ABS) to find the interesting data sets and merge them into our analysis. Looking at the reaction of the organisers this is something they didn’t know. [COMMITMENT]
  • Some of the teams came up with a predictive model with an accuracy level over 80% for classifying the level of severity of an accident. What they didn’t realise was that in the data set there are some variables highly correlated to their outcome variable. They predicted an accident has been properly classified as “killed” based on another one which gave them the number of killed people in this accident. They didn’t take time for the Data Understanding phase. They just put everything into an algorithm and reported the results. On the other side we went looked at every single variable and filtered only the meaningful ones for our analysis. We ended up with a level of accuracy very low. Throwing a coin and picking head or tail would have give you a higher chance to classify correctly the severity than our model. But it didn’t matter. We just concluded the data set hasn’t captured the real important feature. We didn’t manipulate our model in order to tell a story that will suit us. [INTEGRITY]
  • One of the teams highlighted the fact that the biggest group involved in fatal accidents were elderly people but after normalisation it doesn’t show any difference with the other groups and therefore they focused on other problems. Data scientist practitioners do play a lot with data and numbers but it is absolutely our responsibility to understand their meaning. The number we were talking about were real people who DIED on the roads. Every single count was critical. Reducing that number even by one has an enormous impact: one person has been saved! And this is exactly what we tried to do during this entire week end: save pedestrian lives now not in the future now as every minute gone potentially one more person could have been injured or killed on the road. For us it was an absolute nonsense to say that after normalisation those numbers were not important anymore. [ETHICS]
  • Finally the last thing that bothered me was the fact a lot of teams recommended an application to change the behaviours of pedestrians and drivers. The main group that is mostly killed in road accident are people over 60 years old. The number of old pedestrian killed were trending down for years but it jumped last year and the organisers told us they still don’t know why. We can easily assume this group of people will follow the rules in general. They probably won’t jaywalk and will cross the road where it is safe. Obviously something changed that impacted this increase. They probably changed their behaviour in 2015 but it is almost certain they have been forced to. This is not the group of persons who will decide one day to change their behaviour for no reason. Looking at the data we found they were killed in high density area and quite often in big main roads. These areas are where some public offices or agencies are located and those are probably the places where those people wanted to go to. Unfortunately we didn’t have time to finalise this analysis but our main hypothesis on this is that this group of people have to walk more since the big bus timetable change or the merge of Medicare centres in 2015. What we are sure is the fact no app will have a significant impact on decreasing the risk of fatal accident for them. A better solution may be to change the timetables of buses, their routes or relocate some of the public offices next to a train station or bus stop to lower the probability of crossing a main road . [PROBLEM SOLVING]

As I said earlier the whole point was not to criticise the work of the different teams (I know it is still not that obvious yet). The reason why I spoke about this is because just few months before we would have done exactly the same errors as the other teams or we would have gave up if the issues we were facing were too challenging. Listening to those mistakes made me realise where we are at in our Ithaca journey. If you are a student of the UTS Master of Data Science and Innovation (MDSI) you should know what this mean otherwise you can still google it. Without realising it we travelled a very long distance. And not only that we fully embraced some key skills: the ones I listed in bracket. During this hackathon it was absolutely natural for us to make sense of our findings, to commit ourselves on solving the problems in an ethically, responsible and honest way. We all learned about these concepts during the first semester of MDSI but this is the first time I realised that I didn’t have to go this checklist at the end of the project and promised I will not forget about data privacy and ethics next time. No this time every block flows and gathers all together not as constraints but as drivers in our approach.

Final words

In this hackathon we didn’t win any prize but I am extremely proud of what Pedro, Will, William and I achieved and with the manner we did it. In a lifetime there are few occasions when you feel you have changed and become another person and this experience is definitely one of them for me. Personally it does count much more than winning Unearthed, without comparison! This is really the moment where I feel we became some talented, ethically-minded and responsible data scientists. I remembered the first day I started MDSI (just 6 months ago) one of the guest lecturer said we will get the most sexy job in the world but great powers come with great responsibilities. All these times we were mainly focusing on getting more and more power but this week end showed us we switched to the second part of this sentence. We moved to the superheroes side 🙂

I really want to thanks again the NSW Data Analytics Centre for this life-changing experience. I really hope you will look at our findings and use them as a starting point. I know it wasn’t and is still not an easy task to get access to the data from the different official bodies but you can show them our analysis on fatal accident for old people and I hope this will make them realise by sharing their data lives can be saved. Please keep pushing on fast-tracking data-driven policies making.

I also want to thank all the teachers, lecturers and assistants from the Master of Data Science and Innovation (MDSI) of UTS. You helped us become what we are now. I know we are not finished with the Master yet (especially me) but you did a fantastic job already. You did more than showing us the right direction you forced us to learn to get on the right track by ourselves and this will impact without any doubt our future career of data scientist.

A big big big thank you to my teammates: Pedro, Will and William!!! Thanks for making this challenge so easy and so delightful! Thanks for keeping faith in our approach and pushing it as far as we could during this weekend ! We definitely achieved our goal!

Thanks for reading till the end of this post and I hope you already have or you will experience soon the same amazing journey as we did.

Viva DAC !!!

Viva UTS MDSI !!!

Viva team DataCake !!!

datacake

Anthony So

PS: Don’t expect any blog post about my experience on Unearthed. It will just not happen 🙂

PPS: Sorry Pedro for not showing your slide during the presentation. Here it is:

awesome

iLab1 – Personal Learning Objectives

iLab1 Personal Learning Objectives

My iLab subject is an internal project within my company at Fairfax Media.

 

Project Problem Statement and Description:

Fairfax Media publications are split into 2 main categories: Metropolitan or Community mastheads. The way the audience for each of those categories are consuming contents may be very different from each other or on the contrary may present some similarities for some specific type of contents. The objective of this project is to provide a better understanding of the behaviors of these 2 groups regarding news consumption.

 

My 3 personal learning objectives for this ilab are:

Web Analytics

  • I am very keen to understand more about how solutions such as SiteCatalyst or Google Analytics track the usage of digital websites by online users. It will be interesting to see what kind of information is automatically recorded and what are the key measures related to this area. Even if this part isn’t really in the scope of the problem statement the way a website is designed may have a significant impact on how it is used by the final user.

How to profile Customers Behaviors

  • I am very interested in learning techniques to understand customers behaviors based on their actions. This is mainly related to association rules machine learning algorithms and we haven’t seen any so far so I really want to learn more about them. I always heard about the example of a data science project that found the correlation between beers and nappies purchasing. This example is actually one of the reason why I wanted to learn more about Data Science so I am really looking forward to have a deep dive into this area.
  • I recently read some articles talking about graph analysis that may also fit the purpose of this project. I am not sure yet if this is relevant or not but if I got time that will be an area where I want to learn a bit more.

Clustering and Market Segmentation

  • Finally the final learning I am expecting from this iLab is about how to group similar type of customers. I have already use some clustering algorithms but they were all for continuous data. Depending on the data set I will get I may need to find some that can handle count or categorical variables. As a second step I would like also to see how those results relates back or not to the current marketing strategy of my company. There is no strong expectations internally for this project at the moment so it is more like a research exercise but I would still like to know if this can create or not values for the company.

iLab1 – Reflection on Graduate Attributes

Reflection on Graduate Attributes

ATTRIBUTE 1: Complex Systems Thinking

I have more than 10 years of experience in analysing business processes within corporations so I am quite comfortable with this attribute. Part of my work is to be able to get an end-end view of a process. Business processes  can go through different departments and a multitude of systems. I usually past quite a lot of time in mapping the process at different level of complexities.

 

Managing the Data Scientist project isn’t much different except the fact it can become quite complex to map all the attributes of some data. So one topic I need to work on is about data management and the strategy of creating values from data. I need to get a better understanding of the differences between database, datawarehouse or datalake.

ATTRIBUTE 2: Creative, Analytical and Rigorous Sense Making

I am ok with the analytical and rigorous sense making sides of this attribute. I am less comfortable with the creative one. I am a very logical person and rely heavily on facts so I am definitely not a very creative person 🙂 But I can facilitate workgroups such as brainstorming or kaizen session to help people explore new solutions and think out of the box. During the first semester at the beginning of any project I always started by organising a brainstorming session with my team for understanding the problem we need to solve and define our a solution plan.

 

At work I tend to rely on data quite a lot in order to understand where a business process is “broken” or which areas need to be improved. I always challenge “guts feeling” assumptions or at least I try to verify them through data before taking them for granted. Relying on existing data to improve a business process is for me the less risky and the most efficient way to do it.

 

ATTRIBUTE 3: Create Value in Problem Solving and Inquiry

Problem solving is definitely my cup of tea. Analysing and breaking down issues, finding root causes, define proper and adequate solutions and monitor and control the new state is part my day to day activities.

 

I am quite comfortable with all the different frameworks such as Waterfall, Lean Six Sigma or CRISP-DM methodologies. They all provide a clear end-end view on how to handle different types of projects. All of them highlight the importance of understanding properly the business requirements and make sure a project does generate the expected values and benefits. From experience the upstream part of a project is always the most critical. If you have scope it properly and did the correct analysis, the right solutions will come by themselves and it is just a matter of implementing them.

 

Also one of my strength is my flexibility to adapt myself to any change that can occur. In general project never goes without any unplanned hiccups and you need to be agile in order to react quickly while a situation changes or if a new issue occurs but you still have to keep in mind what are the objectives and how you will get there.

 

ATTRIBUTE 4: Persuasive and Robust Communication

This is probably the main attribute I need to keep working on. The reason why is that English isn’t my strongest language. For any kind of project it is extremely important to be understood properly. You may have the right answers but if you can not convince the other parties then all the work you did can be totally useless. I did appreciate quite a lot the feedback of the Self Quantified assignment during the first semester as I wasn’t fully paying attention at details such as the titles on the graph or using a meaningful unit of measure. This is something I will keep working on.

On the other hand I have no issue in presenting in front of an audience. It does require me a bit of preparation before but I am not afraid at all to be in the “front line”.

ATTRIBUTE 5: Ethical Citizenship and Leadership

Thanks to DSI class, I am more aware about all the implications related to data privacy. It is absolutely critical that any data scientist takes ownership and feels responsible on how to handle ethically any kind of data. More than ever data privacy should be one of the core human rights and it needs to be taken into account at any step of a data science project.

 

Funny enough I wouldn’t have rated myself as a strong leader but from the point of view of the different teammates I work with, they all think this is actually my key strength. Maybe this is due to my experience in project management. Anyway one of the key area I want to work on is the fact I tend to take the lead not at the beginning of a project but a few steps later. For instance at the forming stage of a group I will usually stay “behind” and observe first. So I really need to force myself to “step in” a bit quicker.

 

iLab1 – Key takeaways from each of the subjects completed

Key takeaways from each of the subjects completed

Data Science for Innovation (DSI)

English isn’t my mother tongue so I think my key takeaway for this subject will be academic writing. I have been living in Australia for 5 years and I use English everyday at work but it is quite informal in general. I mainly write emails, prepare some presentations or write project reports. It was the first time that I have to write some kind of essay in an academic way. Getting detailed feedbacks for the different assignments I submitted did really help me to understanding where were my limits and I have already seen some improvements compared to a semester ago.

Data, Algorithms and Meaning (DAM)

For this subject I learned from day 1 that I was already capable of running some machine learning techniques by myself. I used R in the last 2 years prior to enroll in MDSI at work mainly for data wrangling and data visualisation. But my main takeaway is really the ability to interpret the output of any algorithm but also to “challenge” it. It is extremely important to be able to take a step back and really think about how those results link back to the business objective first but also to their potential negative impacts. If too quickly taken for granted those results can change dramatically the life of individuals.s

Statistical Thinking for Data Science

I kind of knew it before but following this course did confirm the fact that having a strong statistical background for a data scientist is not an option but a must. Anyone can easily apply a machine learning algorithm or run a regression analysis in few lines of codes but the ability to interpret the results and assess the performance of a model is definitely key for a proper  data scientist. This course gave me an overview of some of the most useful techniques but I felt that I was still missing the basics and I really wanted to understand the “behind the scenes”. For this reason I picked an elective related to stats for the second semester called Multivariate Data Analysis.

 

I am really looking forward for semester 2 🙂