What are different stages of Data Science work? A high-level overview of the process of data science.

Toys used for Lowenfeld’s ‘World Technique’ therapy.
Image Source: https://www.jstor.org/stable/community.26318548

In this article we will have high-level overview of the process of data science. We will look at different stages of data science work. The process of solving a data science problem is summarized in the following figure. This is called as Data Science Road Map.

The first step is always to frame the problem: understand the business use case and craft the well-defined analytics problem. All the next step involve some kind of processing the data before meaningful insights are derived.

The Model and Analyze stage loops back to framing a problem. This is tight feedback loop between data science and real-world. Questions are always been reframed as new insights are available, and, as a result, data scientist must keep their code very flexible and always keep an eye on the real-world problem they are solving.

You can also see there are two exit points to road map – presenting the results and deploying the code. If you are trying to use available data source to answer some kind of business problem, then typically the results are presented in report form such as slide deck or written report. The goal is to give business insight and often use these insights to make key decisions. This kind of data science project also functions as pilot project to test whether some analytics approach is worth a larger follow-up project that may result into production software.

If the final deliverable is piece of software that will continuously run in the background, then the final outcome will blend into software engineering solution. The final deliverable may be a piece of software that performs some analytics work.
Examples would be the following:

Running an API into banking software that will predict the likelihood of loan applicant turning into defaulter.
Implementing a software that determines if an insurance claim is legitimate or fraudulent.

Let’s delve into the details of each step in the road map.

Frame the Problem

The difference between success and failure on data science project is not about math or engineering: it is about asking the right questions. No amount of technical competence or statistical rigor can make up for having solved a useless problem.

Most data science projects starts with some kind of extremely open-ended question. Sometimes questions are known in the form of pain points but it is not clear what solution would look like.
Before delving into actual work, it’s important to clarify exactly what would constitute a solution to this problem. A “definition of done” is a good way to document the criterion for completed project.

For large projects, this document is called as “Statement of Work” or SOW. The SOW is written in collaborative manner involving lot of stakeholders and rounds of discussions. The main purpose of an SOW is to get everybody on the same page about what exactly work should be done, what the priorities are, and what expectations are realistic. Business problems are typically very vague to start off with, and it takes a lot of time and effort to get the clarity on the final expectations.

Understand the Data

Once you have access to the data, it is good idea to ask standard set of questions that will quickly give you some feel of the data. A few generic questions can be asked such as:

How big is the dataset?
Is this entire dataset?
Is this data representative enough? Are all edge cases are considered?
Are there any outliers? For e.g. sudden peak in product sale due to promotional campaigns.
Are there any missing values in data? Where these blank data (missing values) come from
Is there any artificial data in dataset?
Are there any unique identifiers? This helps in joining the data tables.

The most important question to ask about the data is whether it can solve the business problem that you are trying to tackle. If not, then you might need to look into additional sources of data or modify the work you are planning.

Another important step in understanding the data is data wrangling step. Data wrangling is the process of getting the data from its raw format into something suitable for further analysis. This typically mean creating some analysis pipeline that will fetch the data from its source, does any cleaning and filtering necessary, and puts it into regular format.

Once you have the data designed into usable format, the next step is to carry out exploratory data analysis. This means getting intuitive feel of a data, visualizing it in different ways to see the salient patterns.

Extract Features

Feature is category that is extracted from a data and describes some entity. Features define internal structures of a data set. For e.g. if you have a temperature measurements, a feature could be average temperature for particular location. In practical terms, feature extraction means taking your raw data set and distilling them into table of rows and columns. This is called as “Tabular Data”. Each row corresponds to observations and each column corresponds to variables. Extracting good feature is most important thing for getting your analysis work. Feature extraction phase may involve data scientists working closely with the domain experts to understand the features and their relevance for the problem in hand.

Model

Once features were extracted, most data science projects involve some kind of machine learning model. This stage is relatively simple, because you just take a standard suite of models, plug your data into each one of them, and see which one works best. Once the model is selected, a lot of time goes into tuning the model to increase its accuracy.

Present Results

This stage involves preparing slide deck or a written report describing the work you did and what your results were. Often communicating the results is difficult, because material you are communicating is highly technical and you are presenting to a broad audience. The audience may have different knowledge bases, different backgrounds and they will be paying attention to different things in your presentation.

Deploy Code

Typically this falls into two categories:

Batch analytics code: This involves doing analytics similar to the one that has already been done, on the data that will be collected in future. This may involve one of a kind of analysis which is required to test the pilot concept.
Real-time code: This will be full fledged development of analytical package, written in high performance programming language and adhering to all the best practices of software engineering.

The final deliverable of code deployment consists of the code it self, and some documentation of how to run the code.

Summary

In this article we looked at high-level overview of the process of data science. We saw different stages of data science workflow such as:

Frame the problem: this stage involves problem definition
Understand the data: this stage involves data wrangling and exploration
Extract Feature: this stage involves selecting important entities most suitable for the analysis
Model: this stage involves building machine learning model and tuning it to improve its accuracy
Present Result: this stage involves presenting the results to broad audience in the form of slide decks or written report
Deploy Code: this stage involves code deployment in production