- DAM Portfolio – Summative Reflection
- DAM Portfolio – K Nearest Neighbor (KNN)
- DAM Portfolio – Data preparation in Knime
- DAM Portfolio – Artificial Neural Network (ANN)
- DAM Portfolio – Machine Learning Evaluation
- Complex Systems Thinking
- Create Value in Problem Solving and Inquiry
- Creative Analytical and Rigorous Sense Making
During the first DAM class professor Siamak brought us through the Cross Industry Standard Process for Data Mining (CRISP-DM) methodology which is largely used for Data Science projects.
WHAT IS CRISP-DM?
CRISP-DM is a methodology for managing Data Mining project. It was conceived in the 90’s by 5 different companies (SPSS, Teradata, Daimler AG, NCR Corporation and OHRA). According to different polls involving data scientists across different industries it is currently the most used process for Data mining projects.
It breaks down a project into 6 different phases:
- Business understanding: this step is focused on understanding what are the requirements from the business, what are the problems and questions they want to answer and defining a project plan to address them.
- Data Understanding: this phase is about collecting data and performing a first level of analysis of the data sets through a descriptive analysis of the different variables.
- Data Preparation: this is when we clean, transform, merge and enhance the data set for the next phase.
- Modelling: this is the step when we apply statistical or Machine Learning techniques to define the most appropriate model for the project.
- Evaluation: After defining the model we have to assess its performance and its ability to generalize its learning.
- Deployment: The final step is about implementing the model on live environment and on its maintenance. It can also be the finalization of the report requested by the business.
The understanding of these different steps is pretty straight forward but I personally think the important part of this methodology is the feedback loops. Almost at every stage you are able to go back to previous steps according to the learning you get. It is not a V-model (sequential) as we usually see in IT projects; it is more agile and more iterative. It reminds me the PDCA model designed by Deming where you have to iterate several times the same approach in order to solve a problem: you plan your actions (how am I going to get some learnings about the problem), you do the actions (you perform the tasks you defined), you check the results (you analyse the results), you acts (you reflect on the learnings) and then you start again if it is required (the learnings I got help me to better understand the situation but I need to deep dive into it and get additional learnings).
Another interesting part of the CRISP -DM methodology is the user guide section where they have detailed the different tasks you have to perform for a data mining project, the associated risks for each phase and also the different possible outputs.
I haven’t really applied the full methodology in a project yet. But through my career I learned and applied other kind of methodologies such as V-Model, Agile Methodology, PDCA or DMAIC.
CRISP-DM shares a lot of similarities with the latest one. Like DMAIC, CRISP-DM emphasizes the importance of the first step: understanding business requirements. Both methodologies recommend to past a fair bit of time in defining properly the scope of the project before starting working on it. In these kind of complex projects (process improvement or data mining) it is crucial to challenge the understanding of the situation by the business. The risk is that they will state a very broad view of what they want and push for starting the project as soon as possible. This can mislead the project in the wrong direction or even changing directions in the middle of the project. A common technique used in DMAIC is called the 5 Why’s where you have to ask 5 times “why” in order to get really to the bottom of the question.
The DMAIC Measure phase is quite similar to the Data Understanding and Data Preparation phases from CRISP-DM. The differences is that DMAIC is focused on defining a very detailed measurement plan (mainly because most of the project requires to collect new measurements) and CRISP-DM focuses on the “quality” of the data set (treating missing values, outliers…).
Then the remaining phases from DMAIC and CRISP-DM differ quite a lot as they are very specific to their respective subject: process improvement or data mining.
After the brief introduction of this methodology in class I did my own research in order to better understand what CRISP-DM is about.
What did I do?
I read the detailed description of the CRISP-DM methodology by SPSS.
What did I think and feel about what I did? Why?
During the class we had a high level view of this methodology. I wanted to have a deep dive at it and be able to do a comparison with other methodologies I saw during my career.
What were the important elements of this experience and how do I know they are important? What have I learned?
As expected this methodology has a lot of details and it requires some practices before really understanding how deep it is. This is similar to any methodology: while you read it, it seems logical and pretty straight forward but you really realise the true meaning once you have faced the situation in a project. So I decided to apply this methodology as much as possible for the coming assignments.
How does this learning relate to the literature and to industry standards?
CRISP-DM is the main methodology used in Data Mining projects so it is quite important to have a good understanding of it. It does provide some recommendations and best practices that may be valuable for the upcoming assignments and projects I will have to manage in the future.
KD Nuggets, What main methodology are you using for your analytics, data mining, or data science projects? Poll, viewed October 2014, <http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html>
IBM, 2011, IBM SPSS Modeler CRISP-DM Guide, IBM Corporation