iLab1 – Personal Learning Objectives

iLab1 Personal Learning Objectives

My iLab subject is an internal project within my company at Fairfax Media.

 

Project Problem Statement and Description:

Fairfax Media publications are split into 2 main categories: Metropolitan or Community mastheads. The way the audience for each of those categories are consuming contents may be very different from each other or on the contrary may present some similarities for some specific type of contents. The objective of this project is to provide a better understanding of the behaviors of these 2 groups regarding news consumption.

 

My 3 personal learning objectives for this ilab are:

Web Analytics

  • I am very keen to understand more about how solutions such as SiteCatalyst or Google Analytics track the usage of digital websites by online users. It will be interesting to see what kind of information is automatically recorded and what are the key measures related to this area. Even if this part isn’t really in the scope of the problem statement the way a website is designed may have a significant impact on how it is used by the final user.

How to profile Customers Behaviors

  • I am very interested in learning techniques to understand customers behaviors based on their actions. This is mainly related to association rules machine learning algorithms and we haven’t seen any so far so I really want to learn more about them. I always heard about the example of a data science project that found the correlation between beers and nappies purchasing. This example is actually one of the reason why I wanted to learn more about Data Science so I am really looking forward to have a deep dive into this area.
  • I recently read some articles talking about graph analysis that may also fit the purpose of this project. I am not sure yet if this is relevant or not but if I got time that will be an area where I want to learn a bit more.

Clustering and Market Segmentation

  • Finally the final learning I am expecting from this iLab is about how to group similar type of customers. I have already use some clustering algorithms but they were all for continuous data. Depending on the data set I will get I may need to find some that can handle count or categorical variables. As a second step I would like also to see how those results relates back or not to the current marketing strategy of my company. There is no strong expectations internally for this project at the moment so it is more like a research exercise but I would still like to know if this can create or not values for the company.

DAM Assignment 3 – Speculative analysis of a particular data context

In the recent years Data Science practise is attracting more and more attention from corporations. They are interested by the amazing possibilities brought by algorithms for finding patterns within complex and high dimensional data sets. At a high level it sounds like a magic wand that can transform a pile of waste into gold in few clicks and few lines of code. The objective of data science is to help individuals or business in getting insights and taking better decisions. The important part that are often missed from the previous sentence is: help. Data Science algorithms will not take any decision on behalf of somebody. It is just a tool used to analyse in a faster and more efficiently way the increasing volume of data but it still rely on the judgement of individuals.

Maybe in the future we will be able to access some automated machine learning technique but until then it will still require human intervention and decision. As a proof we can talk about CRISP-DM which the main data mining methodology used by data scientists. If Machine Learning algorithms were as good as some people may think there will be only 3 main steps in this methodology: data loading, modeling and deployment. You would just have to load the data, launch the algorithm and then deploy the model it defined. But this is not the case. There are 7 different steps within CRISP-DM and most of them requires the expertise of a human. The first step requires data scientists to have a good understanding of the business requirement, the research questions, the expected outcome and the existing constraint. The next step is related to data understanding. This step requires a deep analysis of the data set and the information lying within it. These 2 first steps are probably the most important ones in a data mining project as they help to define the best strategy to adopt for the project. Data scientists have to assess the strength and limitation of the data and decide which tool (algorithm) will perform the best according to the research questions to be answered. Then the data needs to be prepared at the next step in order to make it fit for the algorithm chosen. Again the data scientist will have to make some decisions on the data preparation strategy like how to handle missing values, what actions need to be performed on the outliers or defining if binning is required to decrease the volume of noise. All those concerns needs to be addressed as algorithms aren’t able to take proper decision by itself. All these human interventions are necessary as this will reduce modeling bias by algorithms.

But why is it so important to lower the risk of bias? The main reason is algorithms cannot make any judgement call. They are not able to determine if a model is absolutely wrong or right. No matter the type of data it receives as input it will perform its computation and provides its output no matter how good or bad are the input. In a sense an algorithm is neutral, it doesn’t make any assumption and will not adapt its strategy. For example Facebook has been recently accused of favouring a party in the US presidential election. The top trending news functionality was highlighting more topics in favour of the Democrat party rather than the Conservative party. It turns out users from the Democrat party tend to post more than the other side and therefore the algorithms were just emphasizing the existing bias it received as input. Another recent example showed Google Ads was discriminating women for high salary jobs advertising. These 2 examples show how important it is to limit the risk of bias but also the necessity to proper evaluate the outputs from an algorithm. In the CRISP-DM evaluation phase a model is assessed based on the business criteria but this can be extended to include the end user point of view. Getting the best model addressing the research questions from the business doesn’t mean it will be well received once deployed. Therefore it is important to also evaluate the future perception of the final users based on criteria such as ethics or privacy.

One last point is related to the transparency of Machine Learning algorithms. In most of the cases they are treated as black box system. We only see the inputs and outputs but not how the algorithm perform in between. But providing more transparency will help to better understand and identify bias and therefore lower the risk of taking bad or wrong decisions. For instance in the USA some judges are relying on algorithms to assess the risk of a convicted defendant to commit similar crimes in the near future. They rely on these algorithms as a black box to decide about the jail time extension of a “risky” offender. But if those algorithms present some bias and the judges are just following what the algorithms say they may take the wrong decision in putting someone into jail longer than required. It is critical to bring transparency especially in the case where algorithms can dramatically change the life of individuals. Recently researchers at Carnegie Mellon University have developed a new tool for detecting bias in algorithms. This tool will provide different type of inputs to an algorithm and assess which of them affects the most its outputs. This will help to identify situations such as the one encountered by Facebook prior to the deployment of the model. The result of this kind of analysis needs to be taken into account at the evaluation phase of CRISP-DM on top of the model performance and algorithm transparency reports.
In this discussion we saw how algorithms can create biased predicting models and how this can have a significant impact on the life of individuals. As data scientists we need to be more accountable for the results of algorithms. It is critical for data science practise to not only try to limit the risk of bias but also to properly assess those risk from the end user perspective and to be transparent on how those algorithms work.

REFERENCES

Schneier, B. 2016, ‘The risks — and benefits — of letting algorithms judge us’, CNN, viewed 24 June 2016, <http://edition.cnn.com/2016/01/06/opinions/schneier-china-social-scores/>.

Pyle, D. & San Jose, C. 2016, ‘An executive’s guide to machine learning‘, McKinsey & Company, viewed 24 June 2016, <http://www.mckinsey.com/industries/high-tech/our-insights/an-executives-guide-to-machine-learning>.

Hayle, B. 2016, ‘US presidential election: Facebook accused of anti-conservative bias’, The Australian, viewed 24 June 2016, <http://www.theaustralian.com.au/news/world/the-times/us-presidential-election-facebook-accused-of-anticonservative-bias/news-story/a36f7da8ab4e20b37538bd64c835385e>.

Carpenter, J. 2015, ‘Google’s algorithm shows prestigious job ads to men, but not to women. Here’s why that should worry you‘, The Washington Post, viewed 24 June 2016, <https://www.washingtonpost.com/news/the-intersect/wp/2015/07/06/googles-algorithm-shows-prestigious-job-ads-to-men-but-not-to-women-heres-why-that-should-worry-you/>.

Angwin, J., Larson, J., Mattu, S. & Kirchner, L. 2016, ‘Machine Bias‘, ProPublica, viewed 24 June 2016, <https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing>.

Datta A., Sen, S. & Zick Y. 2016, Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems, Carnegie Mellon University, Pittsburgh, USA.