What is Statistics and why it is important for data analysis?

Card sorter by International Computers & Tabulators, British, 1967.
Image Source: https://www.jstor.org/stable/community.26313794

One of the wonderful things about statistics is that it is relevant in so many areas. Whatever your industry you are working in, you will need statistical knowledge to make smart decisions.

In this post you will discover why statistics is important in general and for data science in particular, and types of methods that are available.

Statistics is required pre-requisite

Predictive analytics and statistics are highly related field of study.

When you want to carry out any kind of prediction – be it customer churn prediction or raw material pricing, the familiarity of some statistical technique is required.

You need statistical method to find out model accuracy and be able to model the uncertainty, which is inherent in data.

Statistics is widely used in business to measure the KPI, to find out the process variation and to present the business performance over several quarters.

Take a look at this quote from the beginning of popular machine learning book titled Applied predictive modelling:

…the reader should have some knowledge of basic statistics, including variance, correlation, simple linear regression, and basic hypothesis testing (e.g. p-values and test statistics).

Page vii, Applied Predictive Modeling, 2013

Here is another example from the popular Introduction to Statistical learning book:

We expect that the reader will have had at least one elementary course in statistics

Page 9, An introduction to statistical learning with applications in R, 2013

Why there is so much emphasis on learning statistics in first place?

Why learn statistics?

Raw observations alone are data, but they are not information or knowledge. Statistics is required for knowledge discovery.

Businesses are interested in looking for answers to two key questions:

  • What is happening? – this comes under exploratory data analysis
  • What will happen in future? – this comes under predictive analysis

Both these questions rely on statistics to find out:

  • what is central tendency of data?
  • what is past trend, and will it continue in future?
  • are there any unusual patters in the data?

In some cases, such as Design of Experiment, we may need to answer some sophisticated questions such as:

  • can we generalise sample data parameters to population as a whole (hypothesis testing)
  • What is difference between outcome of two experiments?

Statistical methods play vital role in understanding data used in training machine learning model and to interprets result of testing different machine learning models.

To sum up, if you want to extract any knowledge from raw data, we need some kind of statistical technique.

What is Statistics?

Statistics is science of collecting, describing and analysing data.

Statistics is science of collecting data:
collecting data for statistical analysis is critical phase in data analysis project:

  • Define the objective of the analysis and its intended use. (remember first step in CRISP-DM methodology)
  • How many data points are required for the study?
  • How should cross section of data look like (is data covers all the use cases?)
  • Which sampling method to be used?

Statistics is science of describing data:
Raw observations alone are data, but they are not information or knowledge. Describing data is essentially getting information from raw data:

  • what is central tendency of data?
  • Are there any correlation among the variables
  • What is summary of a data?

Statistics is science of analysing data:
If the prediction model is built, how can we confirm its prediction accuracy?

  • How can we assess model accuracy?
  • Use probability theory to analyse uncertainty in the data
  • Which feature should be selected for modeling?

The subject of statistics can be divided into two broad categories. Descriptive Statistics and Inferential Statistics.

Descriptive Statistics is used to summarise and describe main features of a data set. For e.g. HR manager may use descriptive statistics to find out:

  • What is proportion of employees that have masters degree in Engineering?
  • How salary distribution looks like across various ranks of employees
  • Is there a correlation between employee retention and promotion?

Inferential statistics is used to draw conclusion of the entire population based on information in sample.

Businesses often collect data from sample of customers to infer customer preferences for the entire customer base. The most widely used method is A/B testing. Customer click through rate is checked with old and new version of web-page. The web-pages are then modified to maximize the customer conversions.

How to learn statistics?

Learning statistics may feel great deal of anxiety, apprehension and even dread. Most beginner data scientists fear not having extensive mathematical background to learn statistics, that they assume is required and this false assumption leads to the biggest obstacle in their learning.

Although an extensive mathematical background is required at more advanced level of statistical study, it is not required for the day to day work of data analysis carried out by most of the businesses.

So, how can we learn statistics?

“As for issue of relevance is concerned we have found that, students better comprehend the power and purpose of statistics when it is presented in context of substantive problem of real data”

Page 2, Statistics using R, an integrated approach, 2020

I think above statement really captures the essence of learning statistics.

Learn from the context of real data set.

Let’s assume, you are Marketing Executive. You have access to marketing data base. Just pull it out and start fiddling with it using Excel. You can perform lots of data analysis with Excel. Generate some descriptive statistics. Understand central tendency of data by finding out mean, median and standard deviation. What story can you tell from the data? Can you able to use data to track the marketing campaign. How are customer buying patterns?

This is all descriptive statistics. That’s all is required to carry out good data analysis.

The inferential statistics is also widely used in statistical process control – again a context behind analysis. If you churning out millions of product from the machine, what can you tell about machine accuracy – is the process stable? You can answer these questions based on random sample of the machining data.

All above examples use statistics, but always in the context of real world data. Statistics can be learned in a same way, not by referring to dreaded statistics book.

Summary:

In this article, we discussed a role of statistics in data science. The statistics is required to answer many interesting questions from the data. We elaborated the definition of statistics. There are two main branches of statistics – Descriptive Statistics and Inferential Statistics. Descriptive statistics is used to study main features of the data and Inferential statistics uses sample to draw the conclusion on population.

In the end, we touched upon few pointers on learning statistics effectively without fear and anxiety.

References:

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top