Without data, you’re just another person with an opinion

– W. Edwards Deming, noted Statistician, Professor, Author, and Lecturer

“The sexiest job of the 21st century.” Data scientist, a title that didn’t even exist before 2008, is now the position employers can’t hire enough of and job seekers strive to become. There is good reason for a hype: With an estimated 30.1 per cent CAGR, India has positioned itself as a global powerhouse for data analytics, propelling the industry from **$5.7 billion in 2022** to a substantial **$30.7 billion by 2027**.

At present, no sector is untouched by the effects of data science. From aviation to manufacturing, retail to pharmaceuticals, gaming to education, every section leverages data science to make informed decisions to achieve its goals.

In this post, we will delve into the definition of data science and understand the knowledge areas that combine to make data science as a valuable tool in the organisations for making informed business decisions.

**What is Data Science?**

**Data science is the practice of using data to try to understand and solve real-world problems.** This concept isn’t exactly new; people have been analyzing sales figures and trends since the dawn of industrial revolution. In the past decade, however, we have gained access to exponentially more data than existed before. The advent of computers, internet revolution and IoT has assisted in the generation of all that data. The advancement of semiconductor technology has resulted into fast, efficient computational resources to process the large pile of information. With computer code, a data scientist can transform or aggregate data, run statistical analyses, or train machine learning models. The output of this code may be a report or dashboard for human consumption, or it could be a machine learning model that will be deployed to run continuously.

Let’s take a case of retail company and see how data science can be helpful in deciding where to open a new store.

If a retail company is having trouble deciding where to put a new store, for example, it may call in a data scientist to do an analysis. The data scientist could look at the historical data of locations where online orders are shipped to understand where customer demand is. They may also combine that customer location data with demographic and income information for those localities from census records. With these datasets, they could find the optimal place for the new store and create a Microsoft PowerPoint presentation to present their recommendation to the company’s vice president of retail operations.

In order to come up with the recommendations though, the data scientist requires knowledge in different areas such as Coding, Maths and Statistics and domain knowledge in whatever industry they are working in. These different skill areas of Data Science are aptly shown in famous Drew Conway’s popular data science Venn diagram.

In Conway’s opinion (at the time of the diagram’s creation), data science fell into the intersection of math and statistical knowledge, expertise in a domain (Substantive Expertise), and hacking skills (that is, coding). This image is often used as the cornerstone of defining what a data scientist is.

Everyone was well aware of the inherent interdisciplinary nature of the these skills; but more importantly, each of these skills are on their own very valuable, but when combined with only one other are at best simply not data science, or at worst downright dangerous.

Drew Conway

**What does each of these components mean?**

**Mathematics and Statistics**

At the basic level, mathematics and statistics knowledge is data literacy. The data literacy can be broken down into three levels:

- Applying statistical technique
- Knowing that statistical technique exists
- Choosing statistical technique

**Knowing that statistical technique exists**

If you don’t know that something is possible, you can’t use it. If a data scientist was trying to group similar customers, knowing that statistical methods (called clustering) can do this would be the first step.

**How to apply the technique?**

If the data scientist wants to use a method such as k-means clustering to group the customers, they would need need to understand how to adjust the parameters of the method, for example, by choosing how many groups to create.

**Choosing statistical technique?**

In our customer grouping example, even after the data scientist focuses on clustering, they have to consider dozens of different methods and algorithms. Rather than trying each method, they need to be able to rule out methods quickly and focus on just a few.

In order to do this, you need to apply appropriate math and statistics methods, which requires at least a baseline familiarity with these tools. This is not to say that a PhD in statistics in required to be a competent data scientist, but it does require knowing what an ordinary least squares regression is and how to interpret it.

Drew Conway

**Programming**

Programming refer to the ability to pull data from company databases and to write clean, efficient, maintainable code. Knowing how to program, makes common data science tasks easy, such as cleaning, manipulating, summarizing, visualizing, model building and sharing data.

In most data science projects, R or Python is the main language. R is a programming language that has its roots in statistics, so it’s generally strongest for statistical analysis and modeling, visualization, and generating reports with results. Python is a programming language that started as a general software development language and has become extremely popular in data science. Both these languages capabilities are at near parity, hence it is up to Data Scientist to choose a language for the analysis.

**Business Understanding**

A core skill in data science is knowing how to translate a business situation into a data question, find the data answer, and finally deliver the business answer. The business understanding helps in:

- Understanding practicalities of real world.For e.g if you are working for a Bank, understanding common behavior patterns of loan defaulters will help in building robust loan default prediction model.
- Business understanding helps you know what questions to ask? Developing an understanding of the core business can help you judge the situation better. In our loan default example, how current economic condition is likely to affect the loan repayment capability of the applicant? Which economic indicators should we include in building our model? These question can be debated, if you have sound business understanding of lending market.
- Another part of business understanding is developing general business skills, such as being able to tailor your presentations and reports to different audiences. As a Data Scientist, you will most often end up making presentations in front of a vice president who hasn’t taken a math class in 20 years. You need to inform your audience without either talking down or over complicating.

To sum up, Data Science is interdisciplinary field. It requires a knowledge of Programming, Maths and Statistics and domain understanding. Many organisations are looking Data Science as their key strategic initiatives, which will help turning data into actionable insights.

**References:**

– http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

– https://timesofindia.indiatimes.com/blogs/voices/data-science-a-bankable-career-path-for-indian-youth/