Is your data Normal? A gentle introduction to Normality Tests

View of sunrise or sunset from the beach, Kanyakumari, Tamil Nadu, India. Image Source: South Asia Art Archive Mary Binney Wheeler Image Collection

One of the important decision point when working with the data is whether to use parametric or non-parametric statistical methods.

Parametric statistical methods work on the sample data drawn from normal distribution. If the sample is not normal, then the non-parametric statistical methods must be used.

Therefor it is very important to check the normality of the data before proceeding with the type of statistical model used in the analysis.

There are range of techniques that you can use to check if your data sample deviates from a Gaussian distribution, called normality tests.

In this tutorial, we will learn some techniques that can be used to check the normality of the data. So let’s get started..

Normality Assumption

A large part of data analysis in business is carried out using parametric statistical methods. They assume the data follows Gaussian distribution. However, if methods are used to follow Gaussian distribution, and your data was drawn from different distribution, the findings may be misleading or wrong.

So the above decision can be summarized as:

Sometimes we may end up in middle ground where we can assume that the data is Gaussian-like to use the standard techniques, or sufficiently non-Gaussian to use the non-parametric methods.

In Machine Learning, we can use the normality tests on the input data to the model in case of fitting the models, or use normality tests to check the residual errors from model prediction in case of regression.

In this tutorial we will have a look at two classes of techniques for checking whether a sample of data is Gaussian.

Graphical method: These are the methods to qualitatively check the normality of data.
Statistical Test: These are the methods to quantitatively check the normality of data.

Test data set

We will use a diamonds data set from ggplot2 package. This data-set contains the prices and other attributes of almost 54,00 diamonds.

library(tidyverse)

head(diamonds)

We will select the price variable and run various graphical methods and statistical tests to check its normality.

dim(diamonds)

[1] 53940    10

This data set contains 53,940 rows. We don’t need all data points for the analysis, so we will sample out random 200 observations from the data set.

set.seed(123)

sample_data <- sample(1:nrow(diamonds),200,replace=F)

Let’s look at the sample data:

diamonds[sample_data,] %>% head

Let’s print out mean and standard deviation of the price variable from sample data.

print(paste("the mean value of sample is:",mean(diamonds[sample_data,]$price)))

print(paste("the sd value of sample is:",sd(diamonds[sample_data,]$price)))

[1] "the mean value of sample is: 4082.145"
[1] "the sd value of sample is: 4075.94549797804"

We can see that the mean and standard deviation are reasonable but rough estimations of the true underlying population mean and standard deviation, given the small-ish sample size.

Visual Normality Check – Histogram

A simple and commonly used plot to quickly check the distribution of a sample of data is the histogram.

In the histogram, the data is divided into a pre-specified number of groups called bins. The data is then sorted into each bin and the count of the number of observations in each bin is measured.

Let’s check the distribution of price variable from sample data set.

ggplot(diamonds[sample_data,],aes(x=price))+
  geom_histogram(color="white")+
  theme_classic()

Looking at a histogram, we can conclude that sample data does not have Gaussian distribution. This distribution is highly right-skewed, so many diamonds are in the price range between 100-5000 dollar. However, there are some diamonds which are priced exorbitantly high, pulling overall distribution towards right.

Quantile-Quantile Plot (Q-Q plot)

Another popular plot for checking the distribution of a data sample is the quantile-quantile plot, Q-Q plot, or QQ plot for short.

This plot generates its own sample of the idealized distribution that we are comparing with, in this case the Gaussian distribution. The idealized samples are divided into groups (e.g. 5), called quantiles. Each data point in the sample is paired with a similar member from the idealized distribution at the same cumulative distribution.

The resulting points are plotted as a scatter plot with the idealized value on the x-axis and data samples on y-axis.

A perfect match for the distribution will be shown by a line of dots on a 45-degree angle from the bottom left of the plot to the top right. Often a line is drawn on the plot to help make this expectation clear. Deviations by the dots from the line shows a deviation from the expected distribution.

WE can generate Q-Q plot in R using stat_qq()1 function from ggplot2 package.

ggplot(diamonds[sample_data,],aes(sample=price))+
  stat_qq()+ # draw Q-Q plot
  stat_qq_line() # draw ideal normal distribution line

QQ plot showing the data points are way too much diverted from a diagonal line. Since the data points are not closely fitting the expected diagonal pattern, we can conclude that sample data set is not normal.

Statistical Normality Test

The graphical methods used to test the normality of data are quantitative in nature.There are many statistical tests that we can use to quantify whether a sample of data is drawn from Gaussian distribution.

Before you can apply statistical test, you must know how to apply the results.

Each test will return two things:

Statistics: A test statistic describes how closely the distribution of your data matches the distribution predicted under the null hypothesis of the statistical test you are using.
p-value: Used to interpret the test, in this case whether the sample was drawn from a Gaussian distribution.

To interpret the test result using test statistics require deeper level of proficiency in statistics and deeper knowledge of the specific statistical test. Instead, p-value can be used to quickly and accurately interpret the statistics in practical applications.

The test assumes that sample was drawn from Gaussian Distribution. Technically this is called the null-hypothesis, or H_0. A threshold level is chosen called alpha, typically 5%, that is used to interpret the p-value.

In terms of p-value, you can interpret the p value as:

\[p\le\alpha\] – Reject Null Hypothesis, data is not Normally distributed

\[p>\alpha\] – Fail to reject Null Hypothesis – data is Normally distributed

This means, we want to have larger p-value to confirm that our sample was likely drawn from a Gaussian distribution.

The p-value is not the probability of the data fitting Gaussian distribution, it can be thought as a value that helps us interpret the statistical test.

Shapiro-Wilk test

The Shapiro-Wilk test evaluates a data sample and quantifies how likely it is that the data was drawn from a Gaussian distribution, named for Samuel Shapiro and Martin Wilk.

In practice, the Shapiro-Wilk test is believed to be reliable test of normality, although there is some suggestion that the test may be suitable for smaller samples of data.

The shapiro.test() function in R will calculate Shapiro test on given data-set. We will run this test on price variable from sample data-set.

results<- shapiro.test(diamonds[sample_data,]$price)
results

if (results$p.value>0.05){
  print(paste("P-value is more than 0.05, the sample is Gaussian,P-value is:", round(results$p,3)))
} else{
 print(paste("P-value is less than 0.05, the sample is not Gaussian,P-value is:", round(results$p,3)))
}

Shapiro-Wilk normality test

data:  diamonds[sample_data, ]$price
W = 0.8072, p-value = 5.429e-15

[1] "P-value is less than 0.05, the sample is not Gaussian,P-value is: 0"

The test assumes that sample was drawn from Gaussian Distribution. Technically this is called the null-hypothesis.

\[p\le\alpha\]

Since P-value is less than alpha we reject the Null Hypothesis. Thus, sample is not drawn from Gaussian distribution.

Anderson-Darling Test

Anderson-Darling Test in R, The Anderson-Darling Test is a goodness-of-fit test that determines how well your data fits a given distribution.

This test is most typically used to see if your data follow a normal distribution or not.

The function ad.test() calculates Anderson-Darling Test statistics and p-value. The ad.test() function in the nortest package can be used to perform Anderson-Darling test in R. You will have to first install nortest package in order to use ad.test() function.

install.packages("nortest")
library(nortest)

Calculate the test statistics

results_ad <- ad.test(diamonds[sample_data,]$price)
results_ad

if (results_ad$p.value>0.05){
  print(paste("P-value is more than 0.05, the sample is Gaussian,P-value is:", round(results_ad$p,3)))
}else{
 print(paste("P-value is less than 0.05, the sample is not Gaussian,P-value is:", round(results$p,3)))
}

Anderson-Darling normality test

data:  diamonds[sample_data, ]$price
A = 11.942, p-value < 2.2e-16

[1] "P-value is less than 0.05, the sample is not Gaussian,P-value is: 0"

The test assumes that sample was drawn from Gaussian Distribution. Technically this is called the null-hypothesis.

\[p\le\alpha\]

Since P-value is less than alpha we reject the Null Hypothesis. Thus, sample is not drawn from Gaussian distribution.

What test should you use?

We covered few normality tests, but this is not all of the tests that exist. So which test should you use?
I recommend using them all on your data, where appropriate.

What if one test fails and other not?

Your data may not be normal for lots of different reasons. Each test calculates the test statistics with slightly different method. So if the tests disagree, you may assume that your data is not normal.

You can investigate why your data is not normal and use data preparation techniques, such as removing outlines etc..to make the data more normal.

Summary

In this tutorial, you learned graphical and statistical methods to check the normality of the data. We used Histogram and Q-Q plots to check qualitatively the normality of data.

We used Anderson-Darling normality test and Shapiro-Wilk test to statistically check the normality of the data. We used p-value as statistical measure to accept or reject the Null Hypothesis.