Submit support requests and browse self-service resources.
Are you still finding it challenging to make good use of your data? Having a set of best practices for data science, applicable to each new or existing project, can ensure continual improvement in getting the most out of the data you collect and store.
For this blog, we’ve compiled a list of seven steps in the data analysis process that many data scientists and business stakeholders have learned to follow for turning data into actionable information.
Let’s review each step in the data analysis process in more detail.
Step one of the data analysis process should be to clearly state and understand the business objective.
This can be start as simple as “we need to increase sales or increase revenues.”
Then, through discussions with business stakeholders such as executives, product management, sales and marketing, the objective should become more specific and actionable. From “increase sales” it may become: “find the best product to offer customers based on their buying history.” The second statement is more specific and actionable and aligns with “increasing sales.”
This objective may be even further refined into very specific statements that lend themselves to analytical solutions.
The second step is data sourcing and collection. The goal is to find data that is relevant to solving the problem or supports an analytical solution of the stated objective. This step involves reviewing existing data sources and finding out if it is necessary to collect new data. It may involve any number of tasks to get the data in-hand, such as querying databases, scraping data from data streams, submitting requests to other departments, or searching for third-party data sources.
It is important to understand where the data comes from, if it is final, if it is the data of record or a calculation based on other data sources. Timing of the data is also important. Is it produced once a month, once a day, or maybe even more often? How often does the data that you already have change?
In step three of the data analysis process, the data collected is processed and verified. Raw data must be converted into a usable format and this often requires parsing, transforming, and encoding. This is a good time to look for data errors, missing data, or extreme outliers. Basic statistical summary reports and charts can help reveal any serious issues or gaps in the data. How to fix the issues will depend on the type of problem and will likely need to be considered case-by-case, at least at first. Over time, company protocols may be developed for specific data issues. Especially in a new data science solution, the data almost always needs a little repair work.
In order to maximize the value of the data processing step, it is necessary to understand how this data might be sourced and processed in an automated way. These steps will need to be repeated as new data comes in and any manual processing can lead to errors or different results than what you are expecting.
In the exploratory data analysis step, the data is examined carefully for possible logical groupings and hidden relationships. Basic statistical methods and graphs can be used, as well as more advanced methods like clustering, principal component analysis, or other dimension reduction methods.
During this phase of the process, it is good to start understanding the relationships between variables as well as the inherent properties of each. Are you working with a categorical field like “red”, “blue”, and “green?” Or are you using a numerical field to describe a categorical property? Does a number field indicate that numbers are relative, or are they simply labels for something else? Sometimes numbers are used in place of days of the week. Using numerical values for the days of the week can create unexpected results as a model may assume a Sunday “7” is 7 times larger or more significant than a Monday “1”. Are any of your fields dependent on each other? Dependency is a very important concept in data science because built in dependency will skew model results. One example of this is a monthly payment. If you also have a principal amount and an interest rate, then a monthly payment is directly dependent on those. Do you have any fields that only contain blanks or do you have fields that contain all the same values?
The next step after exploratory data analysis is model selection, building, and testing. In this step, the analytical approach is put together and tested.
A few considerations will help select one or more appropriate statistical or machine learning models:
With a few candidate models selected, the next step is model building, testing, and tuning. In this step the models are configured, validated, and fine-tuned to get better accuracy.
For model validation, a very popular approach is to train the model on one set of data and then, using the trained or fitted model, evaluate its predictive ability on a separate set of data. Through the train-validate-test approach, the best performing models and configurations can be selected.
After selecting, building, and tuning models, the next step is model deployment. The goal of model deployment is to produce outputs that lead to a decision or action.
In a common scenario, model predictions and other variables are inputs to an optimization problem. The solution to that problem produces raw outputs that must be translated and communicated to business experts and decision makers. If the recommendations make sense from their perspective, they can decide to put them into play.
Here’s some examples of what those decisions might look like after evaluating and translating model outputs:
In a data science application, model deployment is often automated while still allowing analyst users to override and influence the model’s recommendations.
The final step in a data analysis process is monitoring and validation. After decisions have been put into play and allowed a short time to work, it's important to go back and check to see if outcomes are as expected.
Monitoring and validating results can take many forms For example, summary reports and simple charts of actual versus targets or average revenue or sales over time.
The goal is to make sure results are as expected. Otherwise, review any assumptions, check for errors in the data feeds or any unexpected changes to data attributes. Look to see if something changed in the market in an unexpected way.
By continually monitoring and going through the above data analysis process steps, problems can be detected early on and corrected before decision-makers find themselves trying to understand non-sensical outputs, or worse, the entire project is branded a disappointing failure. With a good process in place finding and fixing issues will be routine—and with a good complement of software tools, quality and assurance can be built into the system.
Once the initial system has shown to perform as expected, continued monitoring must be put in place to detect any changes in the source data or the performance of the model. A model may perform less well as time progresses and may need updating with new training data.
The seven steps in the data analysis process can be applied to new and old use cases. They are meant to be put in place, automated to the extent possible, and continually improved and refined over time. To get the most out of your data, focus first on understanding and adopting the right process for data analysis.
Watch as Rod Cope, CTO, uses these 7 steps to walk through two real-life use cases.
See how IMSL Numerical Libraries can help you to address data analysis problems quickly with a variety of readily available algorithms.