This may be a surprise for people who know me well: “Anthony did write a blog post?! And posted it just 3 days after the end of the event?! No way!!!”
Yeah! Writing a blog post is not usually on top of my to-do list. For instance I still haven’t drafted anything since I won the Unearthed Hackathon and it was a month ago. So you may be wondering why I did write this one so quickly. And if you are thinking that I won another hackathon you are wrong. But there is obviously something important enough that triggered my sudden passion for blog posting.
Let’s start with the beginning. Last week end our usual hackathon crew participated to the Pedestrian Safety 2016 challenge organised by the NSW Data Analytics Centre (DAC) at the University of Sydney. Our team DataCake was composed of:
- William Azevedo (aka “the-nicest-guy-in-the-world”)
- Pedro Fernandez (aka “everything-is-awesome”)
- William So (aka “the-big-data-guy”)
- Anthony So (me aka “pinky-or-the-brain-we-still-do-not-know-which-one-he-is”)
I also would like to mention that Passiona Cottee was also planning to come but couldn’t make it for personal reasons.
What was this hackathon? The main objective was to try to find different ways to tackle the critical problem of pedestrian safety in NSW. DAC did a fantastic job in collecting, preparing and cleaning the different data sets for this challenge. Yeah! I know. They breastfed us. In other hackathons we usually receive a dump of totally unstructured or there is no data at all. So we were thinking “Cool. No data munging! Easy peasy! We just need to run different Machine Learning algorithms, pick the best performer and that is it!”.
We (as well as the other teams) quickly realised that it wouldn’t be that easy. The first issue we encountered was to get access to the data. It was quite laborious. Due to data privacy concerns, the data sets were put in a Microsoft Azure Blob storage and it was only shared to a specific Virtual Machine (VM) set up specifically for this hackathon. No one in our team was familiar with Azure so when the organisers told us “you have to set up a VM then copy the Blob with Azcopy into your own Blob” it was like they were speaking Finnish to us. We had no clue what was expected from us but we gave it a go and with persistence we succeeded to do it after few hours. I think there was only another team who made it well. Actually a lot of teams dropped off the competition because of this first challenge.
This is a just one example of all the issues we found during the week end and trust me there were a lot. But let’s move on directly to the result of this hackathon. There were 2 rounds of presentations: every team had to do a 2 min pitch at the first round and only the 8 finalists had to do a 5 min presentation for the final round. We were not among of the finalists. We were quite disappointed, just like all the other teams who worked very hard during this week end but…after announcing the list of the 8 teams, the judges made a strange announcement: “… We also would like to invite team DataCake to stay for the second round and present us their findings”.
As you can imagine we were very puzzled at that time. We were not short-listed for the final round but we have still been asked to do the final presentation. Most teams in this hackathon presented very innovative and out-of-the box solutions such as a rewarding app for pedestrian, intelligent street sensors or lights. But we were the only team who really tried hard to use the full potential of the data sets provided for answering questions that have not been solved yet such as: Why was there an increase of injuries and fatal accidents in 2015 compared to previous years? How can we reduce these numbers?
During the whole weekend we really forced ourselves to go deep and asked “Why is it happening? Why is it happening? Why is it happening?” every time we found an interesting pattern. We really wanted to understand the true root causes of those accidents. We didn’t want to stay at a descriptive level. We knew the answers were behavioural. We knew there were multiple problems and therefore require different answers and solutions. We did different techniques to do so: machine learning, stats, data visualisation. It didn’t matter which we used the only important point was how can we get to answers of those questions.
For instance we built a classification model on the severity of the accidents involving children but we didn’t use it to make predictions. We used it to identify the important features (and unimportant) for those cases. We found out that some of the variables related to the environment (Primary_hazardous_feature, Surface_condition, Weather…) and to the drivers (Fatigue_involved_in_crash…) were not important. This gave us a good indication that those accidents are mostly related directly to the behaviour of the children. So we kept diving further and further and found 3 postcodes with higher numbers of accidents than others. We focused on those 3 areas and we kept going deeper and deeper. Here are some examples showing how deep we went to find answers:
As you can see we didn’t come up with very innovative solutions but we were totally absolutely 100% focused on finding the answers because we really wanted to save people’s live. We just used analytical techniques to help us find where we need to focus and where we need to investigate more until we understood the reasons of the accident.
This is probably the reason why we didn’t end up within the final 8: we didn’t bring any new idea or concept. We probably didn’t match the judging criteria which we presumed were more focused on innovation. But having said that the judges were still very interested by our analysis and still wanted to hear what we found. I am extremely proud of what we accomplished as a team. We really tackle every single issue we found in our road and kept moving forward without any fear with the same commitment until we reached our objectives: saving people’s live. This is the one of the reason why I wrote this blog. The second one is coming.
This experience was really mind-blowing for me. As I said I was already proud of the job we did. But while I was listening to the presentations of the other teams I profoundly realised that we changed.
The following part may sound a bit critical but this is absolutely not the main point but more an explanation on why I started to reflect on our team and on myself. You know me. I am not the kind of person who will easily criticise someone.
During this weekend every team did really an amazing job with the different skills they have in hand. It was definitely a tough challenge to work with. But there were few points that bothered me as a data scientist:
- A lot of the teams identified Sydney CBD as the place with the highest number of accidents from 2000 to 2015 and jumped to the solution. Some recommended to do some field tests on George street with some IoT (Internet of Things) solutions. There were 2 things that made me jump:
- Unfortunately they didn’t put their findings back to their context. We are in 2016! George st has already changed since the last data point. The extension of the Light rail has started for few months already. The only traffic left is on cross streets only. [SENSE MAKING]
- They didn’t pushed their analysis further. We saw an increase of injuries in CBD in 2015 and we wanted to understand why. We found out there were more accidents on Thursday evenings. That sounds simple but in order to find this we transformed the date and time variables and assigned them back to the corresponding evening. For instance Thursday 6pm to midnight and Friday from 12am to 6am periods were assigned back to the same Thursday evening. Simple but extremely efficient! Then we started to list different hypothesis and test them and we finally found there was a strong correlation between the increase of accidents in CBD on Thursday evening and the growth of retail turnover but also with the increase of women in the labour force. To do so we went to the Australian Bureau of Statistics (ABS) to find the interesting data sets and merge them into our analysis. Looking at the reaction of the organisers this is something they didn’t know. [COMMITMENT]
- Some of the teams came up with a predictive model with an accuracy level over 80% for classifying the level of severity of an accident. What they didn’t realise was that in the data set there are some variables highly correlated to their outcome variable. They predicted an accident has been properly classified as “killed” based on another one which gave them the number of killed people in this accident. They didn’t take time for the Data Understanding phase. They just put everything into an algorithm and reported the results. On the other side we went looked at every single variable and filtered only the meaningful ones for our analysis. We ended up with a level of accuracy very low. Throwing a coin and picking head or tail would have give you a higher chance to classify correctly the severity than our model. But it didn’t matter. We just concluded the data set hasn’t captured the real important feature. We didn’t manipulate our model in order to tell a story that will suit us. [INTEGRITY]
- One of the teams highlighted the fact that the biggest group involved in fatal accidents were elderly people but after normalisation it doesn’t show any difference with the other groups and therefore they focused on other problems. Data scientist practitioners do play a lot with data and numbers but it is absolutely our responsibility to understand their meaning. The number we were talking about were real people who DIED on the roads. Every single count was critical. Reducing that number even by one has an enormous impact: one person has been saved! And this is exactly what we tried to do during this entire week end: save pedestrian lives now not in the future now as every minute gone potentially one more person could have been injured or killed on the road. For us it was an absolute nonsense to say that after normalisation those numbers were not important anymore. [ETHICS]
- Finally the last thing that bothered me was the fact a lot of teams recommended an application to change the behaviours of pedestrians and drivers. The main group that is mostly killed in road accident are people over 60 years old. The number of old pedestrian killed were trending down for years but it jumped last year and the organisers told us they still don’t know why. We can easily assume this group of people will follow the rules in general. They probably won’t jaywalk and will cross the road where it is safe. Obviously something changed that impacted this increase. They probably changed their behaviour in 2015 but it is almost certain they have been forced to. This is not the group of persons who will decide one day to change their behaviour for no reason. Looking at the data we found they were killed in high density area and quite often in big main roads. These areas are where some public offices or agencies are located and those are probably the places where those people wanted to go to. Unfortunately we didn’t have time to finalise this analysis but our main hypothesis on this is that this group of people have to walk more since the big bus timetable change or the merge of Medicare centres in 2015. What we are sure is the fact no app will have a significant impact on decreasing the risk of fatal accident for them. A better solution may be to change the timetables of buses, their routes or relocate some of the public offices next to a train station or bus stop to lower the probability of crossing a main road . [PROBLEM SOLVING]
As I said earlier the whole point was not to criticise the work of the different teams (I know it is still not that obvious yet). The reason why I spoke about this is because just few months before we would have done exactly the same errors as the other teams or we would have gave up if the issues we were facing were too challenging. Listening to those mistakes made me realise where we are at in our Ithaca journey. If you are a student of the UTS Master of Data Science and Innovation (MDSI) you should know what this mean otherwise you can still google it. Without realising it we travelled a very long distance. And not only that we fully embraced some key skills: the ones I listed in bracket. During this hackathon it was absolutely natural for us to make sense of our findings, to commit ourselves on solving the problems in an ethically, responsible and honest way. We all learned about these concepts during the first semester of MDSI but this is the first time I realised that I didn’t have to go this checklist at the end of the project and promised I will not forget about data privacy and ethics next time. No this time every block flows and gathers all together not as constraints but as drivers in our approach.
In this hackathon we didn’t win any prize but I am extremely proud of what Pedro, Will, William and I achieved and with the manner we did it. In a lifetime there are few occasions when you feel you have changed and become another person and this experience is definitely one of them for me. Personally it does count much more than winning Unearthed, without comparison! This is really the moment where I feel we became some talented, ethically-minded and responsible data scientists. I remembered the first day I started MDSI (just 6 months ago) one of the guest lecturer said we will get the most sexy job in the world but great powers come with great responsibilities. All these times we were mainly focusing on getting more and more power but this week end showed us we switched to the second part of this sentence. We moved to the superheroes side 🙂
I really want to thanks again the NSW Data Analytics Centre for this life-changing experience. I really hope you will look at our findings and use them as a starting point. I know it wasn’t and is still not an easy task to get access to the data from the different official bodies but you can show them our analysis on fatal accident for old people and I hope this will make them realise by sharing their data lives can be saved. Please keep pushing on fast-tracking data-driven policies making.
I also want to thank all the teachers, lecturers and assistants from the Master of Data Science and Innovation (MDSI) of UTS. You helped us become what we are now. I know we are not finished with the Master yet (especially me) but you did a fantastic job already. You did more than showing us the right direction you forced us to learn to get on the right track by ourselves and this will impact without any doubt our future career of data scientist.
A big big big thank you to my teammates: Pedro, Will and William!!! Thanks for making this challenge so easy and so delightful! Thanks for keeping faith in our approach and pushing it as far as we could during this weekend ! We definitely achieved our goal!
Thanks for reading till the end of this post and I hope you already have or you will experience soon the same amazing journey as we did.
Viva DAC !!!
Viva UTS MDSI !!!
Viva team DataCake !!!
PS: Don’t expect any blog post about my experience on Unearthed. It will just not happen 🙂
PPS: Sorry Pedro for not showing your slide during the presentation. Here it is: