Recently I have come across a term, CRISP-DM - a data mining standard. Though this process is not a new one but I felt every analyst should know about commonly used Industry wide process. In this post I will explain about different phases involved in creating a data mining solution.
CRISP-DM, an acronym for Cross Industry Standard Process for Data Mining, is a data mining process model that includes commonly used approaches that data analytics Organizations use to tackle business problems related to Data mining. Polls conducted at one and the same website (KDNuggests) in 2002, 2004, 2007 and 2014 show that it was the leading methodology used by industry data miners who decided to respond to the survey.
CRISP-DM model is a phased approach to tackle a business problem. Different phases involved in the model are defined below:
Let us see each of the phases in detailed way:
Use case Identification:This is the initial phase of CRISP-DM in which a potential business problem is formulated into a Data mining use case. Various levels of brainstorming sessions are conducted between different stakeholders to define the problem statement, its impact on the business and a clear objective of the solution and its timelines.CRISP-DM, an acronym for Cross Industry Standard Process for Data Mining, is a data mining process model that includes commonly used approaches that data analytics Organizations use to tackle business problems related to Data mining. Polls conducted at one and the same website (KDNuggests) in 2002, 2004, 2007 and 2014 show that it was the leading methodology used by industry data miners who decided to respond to the survey.
CRISP-DM model is a phased approach to tackle a business problem. Different phases involved in the model are defined below:
- Use case Identification
- Business Understanding
- Data Acquisition and Data Understanding
- Data Preparation
- Exploratory Analysis
- Data Modeling
- Data Evaluation
- Deployment
Audience:
- higher management
- IT teams – Application team, DBA team
- Analytics team – Data Scientist
Audience:
- domain experts - for domain knowledge, business rules understanding
- IT teams - for data sources identifications, key features of the system
- Analytics teams – Data Scientist
Data Preparation:This phase of CRISP-DM involves preparing data required to be fed into data mining algorithms. This Phase involves processing or cleaning of raw data. This is one of the crucial steps in data mining. The accuracy of the data mining solution depends on the quality of the data. All the data preparation activities which are required for creating final dataset for feeding into algorithms are done here - Handling missing data using methods such as imputations, converting data into proper formats such as unstructured to structured format, identifying outliers, normalizing the data etc.
Audience:
- Data Analytics team
Performing an Exploratory analysis helps us:
- To understand causes of an observed event.
- To understand the nature of the data we are dealing with.
- Assess assumptions on which our analysis will be based.
- To identify the key features in the data needed for the analysis.
Data Modeling:In this phase, various modeling techniques are selected and applied to the data for feature extractions, to model the data, tune the model and to calibrate its parameters to optimal values. Typically this phase involves applying suitable data mining/machine learning algorithms to the dataset. Some problems can be solved using single methods where as some problems involves combination of multiple techniques.
For ex: A recommendation systems of Netflix uses a combination of Boltzman machines, Gradient Boosted Decision trees, logistic regression etc.
Also sometimes different methods are applied separately to select the optimal method to solve the issue at hand.
For ex: Logistic regression, decision tree, Random forests are applied to the dataset to see which model will result in optimal data model.
In this phase of modeling the data, the dataset is divided into two sets, Training Set & Test Set. The modeling the data is done using Training Set and the Test Set is used to evaluate the model.
Data Evaluation: This is the follow-up step to the data Modeling phase. Data Model built in the previous step needs to be thoroughly validated before moving into deployment. The model should address all the business objectives mentioned in the problem statement. The Test Data set created in the previous set is used to test the model build. The objective of this step is to check if the prediction error made on the test set. If the prediction error is less, then our model is good to go. Sometimes the error would be larger indicating the situation of under fitting and Overfitting. Based on the results we might have to go back to previous phases and tune the model.
Deployment:Once the model building and evaluation is completed and we are satisfied with results, the next step is to present the business users with the results. These publishing results should be in user readable or understandable form. Most of the time the results will be published in the form of reports or UI. For example: If the results are needed by the top management for taking key business decisions, visualization reports will be the accurate. If the end user needs to be recommended any new item on e-commerce website, then the results should be displayed on to the web UI.
Most of the time, back and forth between phases is required. For example, during evaluating the data model, if we find that model is suffering from over-fitting we can go back to the model phase and fine tune the Model. As an another example, if in modeling phase if we observe that the a feature column in the dataset with sparse data is very critical in achieving the solution then we will go back to the Business Understanding step and consult the domain experts to know if we can derive more information about the sparse data column and impute the column with relevant values.
To know more information about CRISP-DM, see the wiki page here.
Wow, awesome weblog structure! How long have you ever been running a blog for? you make blogging look easy. The total look of your website is excellent, let alone the content! e commerce singapore statistics
ReplyDeleteThanks for sharing such precious information .I’m very thankful to you that you had given me this chance to write on this blog.Writing an essay is an imperative part of a life and even a slight measure of danger can bring about a major red imprint on your scholastic result. In this way, in the event that you are not certain of yourself that you will effectively pull off these essay composing coursework, then insight is in selecting custom paper composing administrations.
ReplyDeleteWow, marvelous blog layout! How long have you been blogging for? you make blogging look easy. The overall look of your site is magnificent, as well as the content! startup solutions
ReplyDeleteNeeded to learn about data mining so thank you for keep sharing stuff like this.
ReplyDeletebuy logo
I have really enjoyed reading your blog posts.
ReplyDeleteThank you for the auspicious writeup. It in fact was a amusement account it. Look advanced to far added agreeable from you! However, how could we communicate? makeup singapore
ReplyDeleteI am really impressed with your writing skills as well as with the layout on your weblog. Is this a paid theme or did you customize it yourself? Anyway keep up the nice quality writing, it’s rare to see a nice blog like this one today..beauty influencers singapore
ReplyDeleteData set created is used to test the model build. need help with research proposal The objective of this step is to check if the prediction made on the test set. then our model is good to go. Sometimes the error would be larger indicating the situation of under fitting and Over-fitting.
ReplyDeleteI really enjoyed your post, thanks for sharing it with us .
ReplyDeletelaw dissertation Writing Service
Essay writing services are offering exceptional services assignment writing help that would help you reduce stress in your student career and improve your grades
ReplyDeleteYou made some respectable points there. I looked on the web for the difficulty and located most people will associate with with your website.singapore storage rental
ReplyDelete