A gentle introduction to descriptive statistics – examining univariate distributions

Descriptive statistics is used to describe the data. A first step in this process is to check the distribution of values of each numeric variable. In this tutorial, various tools for describing the variable such as distributions, tabulations and graphical representation are discussed. We will work on diamonds data set from ggplot2 package. This data set contains prices and other attributes of almost 54000 diamonds. While working with diamonds dataset you will learn:

How to generate frequency table
How to generate proportionate frequency table
How to plot the histogram and check the distribution
How to generate summary statistics describing central tendency of the data
How to visualise summary statistics using Box-plots

By the end of this tutorial, you will able to quickly generate descriptive statistics of a single variable of interest.
Let’s get started…

Load the library

We will use data set from gplot2 package, which is part of tidyverse package system. If you have not installed tidyverse package, please do so using install.packages("tidyverse") function.

library(tidyverse)

Check initial rows from dataset

Diamonds data set contains the prices and other attributes of almost 54,000 diamonds. You can check the documentation on diamonds data set by typing ?diamonds in console.

We are particularly interested in price and cut variables. We will look at how prices of diamonds are varied across quality of the cut (Fair, Good, Very Good, Premium, Ideal)

head(diamonds)

Frequency Table

The cut variable is an example of discrete variable with 5 categories (Fair, Good, Very Good, Premium, Ideal). To display the number of diamonds from each of these categories, we will use frequency and percent tables.

From frequency table, you can notice that diamonds under Ideal category has highest number of representation in the dataset, while Fair category has the lowest. If we add the n column, we obtain 53940, which represents total number of observations in dataset.

diamonds %>% count

cutn
<ord><int>
Fair1610
Good4906
Very Good12082
Premium13791
Ideal21551

Proportionate frequency table

While the frequency of category is informative, so is also relative frequency of category, i.e. frequency of category relative to the total number of observations. The relative frequency represents the proportion of total number of responses that are in the category. Ideal category of diamonds represents 40% of all the observations in the dataset and Fair is only 3%. Note that some of values in the table equals to 100 percent.

prop.table(table(diamonds$cut))

 Fair       Good      Very Good    Premium      Ideal 
0.02984798 0.09095291 0.22398962 0.25567297 0.39953652

Bar Plot

In addition to tabular format, frequency and percent distributions of categorical data may be represented visually in the form of bar chart. The count (frequency) is represented on Y-axis and categories are represented on X-axis. Each bars in bar chart are unconnected bars or rectangles of equal width, and each of whose height reflects the frequency of the category it represents. For e.g. height of bar labelled Good reaches to count of 5000, indicating frequency of that category is 5000.

ggplot(diamonds,aes(x=factor(cut)))+
  geom_bar(fill="blue")+
  xlab("Cut")+
  ylab("Frequency")

Proportionate Bar Plot

The proportionate bar plot is the same as bar plot, except the Y-axis scales are different. Notice that relative standings of the height of the bars in both charts is same, since percent scale differs from frequency scale in terms of constant of proportionality only.

ggplot(diamonds,aes(x=factor(cut),y=after_stat(count/sum(count)*100)))+
  geom_bar(fill="blue")+
   xlab("Cut")+
  ylab("Percentage")

Proportionate bar plot of *cut* variable

Summary Statistics

Summary statistics of the data can be represented as Measure of Central Tendency and Measure of Dispersion. Measure of Central tendency represents central point of the data set. Examples of these measures include mean, median and mode. The measurement of dispersion represents how the values are dispersed with respect to their mean. Examples of these measures include standard deviation, variance and Interquartile Range (IQR)

Measure of Central Tendency:

Mean: The sum of all values divided by number of values. This is the average point of the data
Median: The middle value of the ordered (ascending\descending) dataset
Mode: The frequency of values that appear in the dataset

Measure of dispersion

Variance: Average of the squared difference from the mean
Stadard Deviation: Square root of the variance
Interquartile Range: Range between 25th and 75th quartile

summary_values <- diamonds %>%
                  group_by(cut)%>%
                  summarise(average_price = mean(price),
                      median_price = median(price),
                      price_standard_deviation = sd(price),
                      price_variation = var(price),
                      IQR_range = IQR(price))
summary_values

The average and median price of Fair and Premium diamond category is nearly the same. This requires close scrutiny, since customers are likely to pay premium price on Premium diamonds than on Fair price.

The price variation is highest for premium category diamond. This showcases vast diversity in pricing for Premium category diamonds.

The range (maximum minus minimum value) is often not the best measure of spread because it is based on only two values and they are the extreme values in the data. A better measure is the range of middle 50% of the data, the interquartile range. Though, the average and median price of Fair and Premium diamond category is nearly the same, the IQR varies widely. We may need to check the presence of outlier in the data.

Histogram

The histogram shows the visual distribution of Price feature from diamonds data set.

ggplot(diamonds,aes(x=price))+
  geom_histogram(bins=30,fill="blue",color="white")+
  geom_vline(xintercept = mean(diamonds$price), color = "red", linewidth = 1)+
  geom_vline(xintercept = median(diamonds$price),color="khaki",linewidth=1)+
  geom_text(aes(x = median(diamonds$price), 
                label = paste("Median:", round(median(diamonds$price), 0))), 
                 y = 9500, color = "blue", size = 4) +
  geom_text(aes(x = mean(diamonds$price), 
                label = paste("Mean:", round(mean(diamonds$price), 0))), 
                 y = 6500, color = "black", size = 4)

The distribution is right skewed since concentration of values is on the low end of the scale. The 75% quantile value of Price variable is $5324.

So 75% of all diamond prices are found between beginning of the left tail to $5324 mark. There is long right tail that contains the expensive 25% of the diamonds, demonstrating lack of symmetry in this data set.

Complementing the histogram is the Kernel Density Estimation (KDE) curve, which provides a smoothed representation of the data distribution. The KDE is essentially an estimate of the histogram but with the advantage of infinitely narrow bins, offering a more continuous view of the data. It serves as a “limit” or refined version of the histogram, capturing nuances that might be missed in a discrete binning approach.

ggplot( diamonds , aes(x=price)) +
    geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)

Kernel Density Estimate of *price* variable

If the histogram is further broken down as per diamond category, the same right skewed trend is observed within a category.

ggplot(diamonds,aes(x=price))+
  geom_histogram(bins=15,fill="blue",color="white")+
    facet_wrap(~cut,scales="free")+
  geom_vline(aes(xintercept = summary_values$average_price), data = summary_values)

histogram for *price* variable across diamond *cuts*

Box Plot

The box plot uses a five-number summary to visualize the shape of the distribution for a variable. Here is a box plot for price variable.

ggplot(diamonds, aes(y = price)) +
  geom_boxplot(alpha=0.7)

The horizontal line drawn within a box represents location of Q50 – a median. The bottom edge of the box represents location of First Quartile -Q1(25% of diamonds are priced below it) and top edge of the box represents location of Third Quartile – Q3(75% of diamonds are priced below it). Thus, the box contains middle 50% of the values.

The whiskers extends out from the box on both ends. The whisker extends from Q1 to minimum value of 0 and from Q3 to maximum value of the variable ($18823). The thick line above the whisker is actually a concentration of dots. These are outliers. They extend the whiskers and typically in the range beyond 1.5*IQR.

In terms of the shape of a box, the distance from Q3 to Q50 is more than the distance from Q1 to Q50. The length of top whisker is much larger than length of bottom whisker, suggesting highly skewed left distribution.

The box plots for the various diamond category are as below:
The dots represent the mean values.

Box plots of price *variable* across diamond *cuts*

Summary:

In this tutorial we delve into diamonds data set and generated some interesting descriptive statistics. We started out with observing few lines of dataset, and generated frequency and proportionate frequency tables. These tables are useful to check the number of observations representing each class of category.

In addition to frequency tables, frequency and percent distributions of categorical data was represented visually in the form of bar chart.

We calculated the summary statistics such as mean, median, variation, standard deviation and IQR range.

We used visual narratives, particularly histograms and box plots, in visually representing and interpreting the distribution and variability of data.

References:

Further Resources:

If you need help in dplyr and ggplot2 packages, please check out my Youtube videos on ggplot2 and dplyr packages:

ggplot2 Package:

dplyr package:

generate data summary using group_by ( ) and summarise ( ) functions