How to calculate and visualise the distribution of Numeric Variable?

Violine and Case: Image Source: https://www.jstor.org/stable/community.27028072

Every variable can be characterized by its variation and shape. Variation measures the spread, of the values. One simple measure of variation is range, of the values. However since range is calculated using most distant values in the data set, it is extremely sensitive to outliers. More commonly used dispersion measures in statistics are variance and standard deviation. These measures give summary statistics, hence does not tell much about the overall data. A five number summary is elaborate measure to check the overall data and its spread.

The shape or skewness measures the extent to which data values are not symmetrical around the mean. The shape of the variable represents the pattern of all the values, from lowest to highest value.

In this tutorial we will learn:

How to calculate standard deviation and variance of the numeric variable?
How to calculate five number summary of the numeric variable?
How to visualize the five number summary?
How to visualize the shape of the variable?
Introduction to violin plot.

Let’s get started…

Load the dataset

We will use tidyverse() library to generate summary statistics and some visualizations during analysis.

library(tidyverse)

The dataset is about number of minutes the customer spends in bank queue, before getting attended by the counter manager. Variables bank1 and bank2 represents data collected at two different branches of the same bank.

customer_waiting_time <- data.frame(bank1=c(4.21,5.55,3.03,5.13,4.77,
  2.34,3.54,3.20,4.50,6.10,0.38,5.12,6.46,6.19,3.79),bank2 = c(9.66,5.90,8.02,5.79,8.73,3.82,8.01,8.35,10.49,6.68,5.64,4.08,6.17,9.91,5.47))

Variance and Standard Deviation

Standard deviation and variance are two commonly used measures of variation. They measure average scatter around the mean-how larger values fluctuate above it and how smaller values fluctuate below it.

Variance is average of the squared difference from mean of the data. Variance measures how much the data points deviate from the mean. A high variance indicates that the data points are spread out over a wider range of values, while a low variance indicates that the data points are closer to the mean. As you can see, the variance for bank2 is higher than bank1 indicating that customer waiting time is far more spread out in bank2 than in bank1. Typically for any business process, the variance should be low. The low variance reflects high reliability in the process.
The variance is calculated using var() function

customer_waiting_time %>%
  pivot_longer(c("bank1","bank2"),names_to = "bank",values_to = "wait_time") %>%
  group_by(bank) %>%
  summarise(variance = var(wait_time),
            standard_dev = sd(wait_time))

bankvariance      standard_dev
<chr>     <dbl><dbl>
bank12.6811921.637435
bank24.3355122.082189

Because the variance is expressed in square units rather than in the original units of measurement, the variance value can not meaningfully be related to the original set of data. The variance may be expressed in terms of the original units of measurement by taking the positive square root of the variance. This new measure is called as standard deviation.
The standard deviation is calculated using sd()function

Since variance for bank2 is larger than bank1, the standard deviation for bank2 is also larger than bank1.

Create five number summary

The variance and standard deviation are summary statistics used to measure the dispersion in the data. These measurements, however does not give us the full range of the data. If we want to understand the distribution of the data we will have to generate the five number summary.

The five number summary is a concise way to summarize the distribution of a data set. It consists of the following five values:

The minimum value
The first quartile (Q1)
The median
The third quartile (Q3)
The maximum value

The minimum value is the smallest value in the data set. The first quartile (Q1) is the value below which 25% of the data points lie. The median is the value below which 50% of the data points lie. The third quartile (Q3) is the value below which 75% of the data points lie. The maximum value is the largest value in the data set.

The five number summary can be used to get a quick overview of the distribution of a data set. It can tell us how spread out the data is, whether the data is skewed, and whether there are any outlines. In R, summary() is super useful function to create five number summary statistics.

sapply(customer_waiting_time,summary)

           bank1     bank2
Min.    0.380000  3.820000
1st Qu. 3.370000  5.715000
Median  4.500000  6.680000
Mean    4.287333  7.114667
3rd Qu. 5.340000  8.540000
Max.    6.460000 10.490000

You can get following observations from the summary statistics:

Median waiting time for bank1 is lesser by almost 3 minutes
75% of the customers coming into bank1 experience waiting time lesser than 5 mins, however for bank2 this values is 8.5mins
There could be a possible outlier in bank1 data (0.38mins waiting time)
The waiting time for bank1 varies from 0.38mins to 6.4mins, whereas for bank2, it varies from 3.8mins to 10.5mins. This shows the variation in customer waiting time is higher in bank1 than in bank2. This observation augurs well with the variance and standard deviation values for bank1.

Box plot – visualize five number summary

You can visualize five number summary using box plots, also known as box and whisker plots. The plot is constructed by using a box to enclose the median. The box is extended outward from the median up-to lower and upper quartiles, enclosing the middle 50% of the data. From the lower and upper quartiles, lines extended as whiskers are extended out from the box towards outermost data values. The five number summary shown by the box-plot is as below:

The median Q2
The lower quartile Q1
The upper quartile Q3
The smallest value in the distribution
The largest value in the distribution

A box is drawn around the median with the lower and upper quartiles (Q1 and Q3) as the box endpoints. These box endpoints ( Q
1 and Q3) are referred to as the hinges of the box.

The value of interquartile range (IQR) is computed Q3-Q1. The interquartile range includes middle 50% of the data and should equal the length of the box. A whisker, a line segment, is drawn from lower hinge of the box outwards to the smallest data value. A second whisker is drawn from upper hinge of the box outwards to the largest value. The inner fences of box plot is given as follows:
Q1 – 3*IQR
Q3 + 3*IQR

The box-plot for the bank data looks like as below:

customer_waiting_time %>%
  pivot_longer(c("bank1","bank2"),names_to = "bank",values_to = "wait_time")%>%
  ggplot(aes(x = bank, y= wait_time,fill = bank)) +
  geom_boxplot(outlier.color="red") +
  coord_flip()

We can make few observations from the chart:

The location of the median in the box can relate information about the skewness in the middle 50% of the data. If the median is located on the right side of the box, then middle 50% are skewed towards left. In case of bank1, the distribution is slightly left skewed indicating that there is long tail that contains smallest 25% of the values.
If the median is located on the left side of the box, then the middle 50% are skewed to the right. There is long tail that contains largest 25% of values. In case of bank2, the distribution is slightly right skewed indicating that there is long tail that contains largest 25% of the values.
The outlier value in bank1 box plot is shown in red. This observation falls beyond inner fences of the box plot.
Customers’ median wait times at the two banks differ significantly.

Kernel Density Plots – Visualise shape of the variable

Kernel Density Plots are a type of plot that displays the distribution of values in a dataset using one continuous curve. They are similar to histograms, but they are even better at displaying the shape of a distribution since they aren’t affected by the number of bins used in the histogram.

Kernel Density Plots are also used to visually check the skewness of the data. Skewness is when distribution is asymmetrical. The skewed portion is long thin part of the curve.

For the bank1 data, since it is left skewed the median value is greater than the mean value. The few small values distort the mean towards left tail. The concentration of values are on the high end of the scale.

For bank2 data, since it is right skewed, the mean value is greater than the median value. The few large values distort the mean towards right tail. The concentration of values are on the low end of the scale.

Kernel Density plot of Bank1 and Bank2:

customer_waiting_time %>%
  ggplot(aes(x=bank1))+geom_density(alpha=0.5,fill="khaki")+
theme_classic()+
  geom_vline(xintercept = mean(customer_waiting_time$bank1), color = "red", linewidth = 1)+
  geom_vline(xintercept = median(customer_waiting_time$bank1),color="blue",linewidth=1)+
  geom_text(aes(x = median(customer_waiting_time$bank1)+0.5, 
                label = paste("Median:", round(median(customer_waiting_time$bank1), 3))), 
                 y = 0.15, color = "blue", size = 4) +
  geom_text(aes(x = mean(customer_waiting_time$bank1)-0.5, 
                label = paste("Mean:", round(mean(customer_waiting_time$bank1), 2))), 
                 y = 0.07, color = "red", size = 4)

customer_waiting_time %>%
  ggplot(aes(x=bank2))+geom_density(alpha=0.5,fill="khaki")+
  theme_classic()+
  geom_vline(xintercept = mean(customer_waiting_time$bank2), color = "red", linewidth = 1)+
  geom_vline(xintercept = median(customer_waiting_time$bank2),color="blue",linewidth=1)+
  geom_text(aes(x = median(customer_waiting_time$bank2)-0.5, 
                label = paste("Median:", round(median(customer_waiting_time$bank2), 3))), 
                 y = 0.10, color = "blue", size = 4) +
  geom_text(aes(x = mean(customer_waiting_time$bank2)+0.5, 
                label = paste("Mean:", round(mean(customer_waiting_time$bank2), 2))), 
                 y = 0.07, color = "red", size = 4)

Violine Plot

A violin plot is a hybrid of a box plot and a kernel density plot, which shows peaks in the data. It is used to visualize the distribution of numerical data. Unlike a box plot that can only show summary statistics, violin plots depict summary statistics and the density of each variable.

customer_waiting_time %>%
  pivot_longer(c("bank1","bank2"),names_to = "bank",values_to = "wait_time")%>%
ggplot(aes(x = wait_time, y= factor(bank),fill=bank))+
  geom_violin(width=0.5)+
  geom_boxplot(width=0.1,color="blue",alpha=0.2)+
  ylab("Bank")+
  theme_classic()

On each side of the box-plot is a kernel density estimation to show the distribution shape of the data. Wider sections of the violin plot represent a higher probability that members of the population will take on the given value; the skinnier sections represent a lower probability.

So for bank1, the probability that the customer will be served between 2.5 mins to 6.5 mins is higher. For Bank 2, since the data is spread out the probability remains almost same between 3 mins to 10 mins.

Summary:

In this tutorial we learned various techniques to measure the distribution of the data. We started with variance and standard deviation – two most widely used measures of dispersion. However, both these measures are very susceptible to outliers and does not tell anything about the overall spread of the data.

In order to check the overall spread of the data, we looked at five number summary – the min, max, median and first and third quartile value of the variable. We generated a visual display of five number summary in the form of box-whisker plots.

However, box-whisker plots does not show us the distribution of the data. We used Kernel Density Plots to check the skewness and data distribution. We ended the tutorial by looking at the violin plots – a hybrid of a box plot and a kernel density plot, which shows peaks in the data.

References:

Further Resources

If you need help in dplyr and ggplot2 packages, please check out my Youtube videos on ggplot2 and tidyr packages:

Pivoting the data with pivot longer function

How to create box-plot in ggplot2 package