7 Steps in the Data Analysis Process
Create a Data Analysis Process
Are you still finding it challenging to make good use of your data? Having a set of best practices for data science, applicable to each new or existing project, can ensure continual improvement in getting the most out of the data you collect and store.
For this blog, we’ve compiled a list of seven steps in the data analysis process that many data scientists and business stakeholders have learned to follow for turning data into actionable information.
- Define the business objective.
- Source and collect data.
- Process and clean the data.
- Perform exploratory data analysis (EDA).
- Select, build, and test models.
- Deploy models.
- Monitor and validate against stated objectives.
Let’s review each step in the data analysis process in more detail.
1. Define the Business Objective
Step one of the data analysis process should be to clearly state and understand the business objective.
This can be start as simple as “we need to increase sales or increase revenues.”
Then, through discussions with business stakeholders such as executives, product management, sales and marketing, the objective should become more specific and actionable. From “increase sales” it may become: “find the best product to offer customers based on their buying history.” The second statement is more specific and actionable and aligns with “increasing sales.”
This objective may be even further refined into very specific statements that lend themselves to analytical solutions.
2. Source and Collect Data
The second step is data sourcing and collection. The goal is to find data that is relevant to solving the problem or supports an analytical solution of the stated objective. This step involves reviewing existing data sources and finding out if it is necessary to collect new data. It may involve any number of tasks to get the data in-hand, such as querying databases, scraping data from data streams, submitting requests to other departments, or searching for third-party data sources.
3. Process and Clean Data
In step three of the data analysis process, the data collected is processed and verified. Raw data must be converted into a usable format and this often requires parsing, transforming, and encoding. This is a good time to look for data errors, missing data, or extreme outliers. Basic statistical summary reports and charts can help reveal any serious issues or gaps in the data. How to fix the issues will depend on the type of problem and will likely need to be considered case-by-case, at least at first. Over time, company protocols may be developed for specific data issues. Especially in a new data science solution, the data almost always needs a little repair work.
4. Perform Exploratory Data Analysis (EDA)
In the exploratory data analysis step, the data is examined carefully for possible logical groupings and hidden relationships. Basic statistical methods and graphs can be used, as well as more advanced methods like clustering, principal component analysis, or other dimension reduction methods.
5. Select, Build, and Test Models
The next step after exploratory data analysis is model selection, building, and testing. In this step, the analytical approach is put together and tested.
A few considerations will help select one or more appropriate statistical or machine learning models:
- What are the data types? Categorical, ordered, continuous, or mixed.
- Is there a time index to consider?
- Is the response multivariate?
- Are there rules and constraints that need to be incorporated into the model?
- What models have others used for similar problems?
With a few candidate models selected, the next step is model building, testing, and tuning. In this step the models are configured, validated, andfine-tuned to get better accuracy.
For model validation, a very popular approach is to train the model on one set of data and then, using the trained or fitted model, evaluate its predictive ability on a separate set of data. Through the train-validate-test approach, the best performing models and configurations can be selected.
6. Deploy Models
After selecting, building, and tuning models, the next step is model deployment. The goal of model deployment is to produce outputs that lead to a decision or action.
In a common scenario, model predictions and other variables are inputs to an optimization problem. The solution to that problem produces raw outputs that must be translated and communicated to business experts and decision makers. If the recommendations make sense from their perspective, they can decide to put them into play.
Here’s some examples of what those decisions might look like after evaluating and translating model outputs:
- Raise price
- Launch the promotion
- Change the policy
- Change the mixture
In a data science application, model deployment is often automated while still allowing analyst users to override and influence the model’s recommendations.
7. Monitor and Validate
The final step in a data analysis process is monitoring and validation. After decisions have been put into play and allowed a short time to work, it's important to go back and check to see if outcomes are as expected.
Monitoring and validating results can take many forms For example, summary reports and simple charts of actual versus targets or average revenue or sales over time.
The goal is to make sure results are as expected. Otherwise, review any assumptions, check for errors in the data feeds or any unexpected changes to data attributes. Look to see if something changed in the market in an unexpected way.
By continually monitoring, problems can be detected early on and corrected before decision-makers find themselves trying to understand non-sensical outputs, or worse, the entire project is branded a disappointing failure. With a good process in place finding and fixing issues will be routine—and with a good complement of software tools, quality and assurance can be built into the system.
Successful Data Analysis Process
The seven steps in the data analysis process can be applied to new and old use cases. They are meant to be put in place, automated to the extent possible, and continually improved and refined over time. To get the most out of your data, focus first on understanding and adopting the right process for data analysis.
We should work on our process, not the outcome of our processes.
--William Edwards Deming
See how IMSL Numerical Libraries can help you to address data analysis problems quickly with a variety of readily available algorithms.