Descriptive statistics is used to describe the data. A first step in this process is to check the distribution of values of each numeric variable. In this tutorial, various tools for describing the variable such as distributions, tabulations and graphical representation are discussed. We will work on `diamonds`

data set from `ggplot2`

package. This data set contains prices and other attributes of almost 54000 diamonds. While working with diamonds dataset you will learn:

- How to generate frequency table
- How to generate proportionate frequency table
- How to plot the histogram and check the distribution
- How to generate summary statistics describing central tendency of the data
- How to visualise summary statistics using Box-plots

By the end of this tutorial, you will able to quickly generate descriptive statistics of a single variable of interest.

Let’s get started…

**Load the library**

We will use data set from `gplot2`

package, which is part of `tidyverse`

package system. If you have not installed tidyverse package, please do so using `install.packages("tidyverse")`

function.

`library(tidyverse)`

**Check initial rows from dataset**

Diamonds data set contains the prices and other attributes of almost 54,000 diamonds. You can check the documentation on diamonds data set by typing `?diamonds`

in console.

We are particularly interested in `price`

and `cut`

variables. We will look at how prices of diamonds are varied across quality of the cut (Fair, Good, Very Good, Premium, Ideal)

`head(diamonds)`

**Frequency Table**

The `cut`

variable is an example of discrete variable with 5 categories (Fair, Good, Very Good, Premium, Ideal). To display the number of diamonds from each of these categories, we will use frequency and percent tables.

From frequency table, you can notice that diamonds under Ideal category has highest number of representation in the dataset, while Fair category has the lowest. If we add the `n`

column, we obtain 53940, which represents total number of observations in dataset.

```
diamonds %>% count
cutn
<ord><int>
Fair1610
Good4906
Very Good12082
Premium13791
Ideal21551
```

**Proportionate frequency table**

While the frequency of category is informative, so is also **relative frequency** of category, i.e. frequency of category relative to the total number of observations. The relative frequency represents the **proportion** of total number of responses that are in the category. Ideal category of diamonds represents 40% of all the observations in the dataset and Fair is only 3%. Note that some of values in the table equals to 100 percent.

```
prop.table(table(diamonds$cut))
Fair Good Very Good Premium Ideal
0.02984798 0.09095291 0.22398962 0.25567297 0.39953652
```

**Bar Plot**

In addition to tabular format, frequency and percent distributions of categorical data may be represented visually in the form of bar chart. The count (frequency) is represented on Y-axis and categories are represented on X-axis. Each bars in bar chart are **unconnected** bars or rectangles of equal width, and each of whose height reflects the frequency of the category it represents. For e.g. height of bar labelled `Good`

reaches to count of 5000, indicating frequency of that category is 5000.

```
ggplot(diamonds,aes(x=factor(cut)))+
geom_bar(fill="blue")+
xlab("Cut")+
ylab("Frequency")
```

**Proportionate Bar Plot**

The proportionate bar plot is the same as bar plot, except the Y-axis scales are different. Notice that relative standings of the height of the bars in both charts is same, since percent scale differs from frequency scale in terms of constant of proportionality only.

```
ggplot(diamonds,aes(x=factor(cut),y=after_stat(count/sum(count)*100)))+
geom_bar(fill="blue")+
xlab("Cut")+
ylab("Percentage")
```

**Summary Statistics**

Summary statistics of the data can be represented as Measure of Central Tendency and Measure of Dispersion. Measure of Central tendency represents central point of the data set. Examples of these measures include **mean, median and mode**. The measurement of dispersion represents how the values are dispersed with respect to their mean. Examples of these measures include **standard deviation, variance and Interquartile Range (IQR**)

Measure of Central Tendency:

**Mean:**The sum of all values divided by number of values. This is the average point of the data**Median:**The middle value of the ordered (ascending\descending) dataset**Mode:**The frequency of values that appear in the dataset

Measure of dispersion

**Variance:**Average of the squared difference from the mean**Stadard Deviation:**Square root of the variance**Interquartile Range**: Range between 25th and 75th quartile

```
summary_values <- diamonds %>%
group_by(cut)%>%
summarise(average_price = mean(price),
median_price = median(price),
price_standard_deviation = sd(price),
price_variation = var(price),
IQR_range = IQR(price))
summary_values
```

The average and median price of `Fair`

and `Premium`

diamond category is nearly the same. This requires close scrutiny, since customers are likely to pay premium price on `Premium`

diamonds than on `Fair`

price.

The price variation is highest for `premium`

category diamond. This showcases vast diversity in pricing for `Premium`

category diamonds.

The range (maximum minus minimum value) is often not the best measure of spread because it is based on only two values and they are the extreme values in the data. A better measure is the range of middle 50% of the data, the **interquartile range**. Though, the average and median price of `Fair`

and `Premium`

diamond category is nearly the same, the IQR varies widely. We may need to check the presence of outlier in the data.

**Histogram**

The histogram shows the visual distribution of `Price`

feature from `diamonds`

data set.

```
ggplot(diamonds,aes(x=price))+
geom_histogram(bins=30,fill="blue",color="white")+
geom_vline(xintercept = mean(diamonds$price), color = "red", linewidth = 1)+
geom_vline(xintercept = median(diamonds$price),color="khaki",linewidth=1)+
geom_text(aes(x = median(diamonds$price),
label = paste("Median:", round(median(diamonds$price), 0))),
y = 9500, color = "blue", size = 4) +
geom_text(aes(x = mean(diamonds$price),
label = paste("Mean:", round(mean(diamonds$price), 0))),
y = 6500, color = "black", size = 4)
```

The distribution is right skewed since concentration of values is on the low end of the scale. The 75% quantile value of `Price`

variable is $5324.

So 75% of all diamond prices are found between beginning of the left tail to $5324 mark. There is long right tail that contains the expensive 25% of the diamonds, demonstrating lack of symmetry in this data set.

Complementing the histogram is the **Kernel Density Estimation (KDE) **curve, which provides a smoothed representation of the data distribution. The KDE is essentially an estimate of the histogram but with the advantage of infinitely narrow bins, offering a more continuous view of the data. It serves as a “limit” or refined version of the histogram, capturing nuances that might be missed in a discrete binning approach.

```
ggplot( diamonds , aes(x=price)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)
```

If the histogram is further broken down as per diamond category, the same right skewed trend is observed within a category.

```
ggplot(diamonds,aes(x=price))+
geom_histogram(bins=15,fill="blue",color="white")+
facet_wrap(~cut,scales="free")+
geom_vline(aes(xintercept = summary_values$average_price), data = summary_values)
```

**Box Plot**

The box plot uses a five-number summary to visualize the shape of the distribution for a variable. Here is a box plot for `price`

variable.

```
ggplot(diamonds, aes(y = price)) +
geom_boxplot(alpha=0.7)
```

The horizontal line drawn within a box represents location of *Q50 *– a median. The bottom edge of the box represents location of *First Quartile -Q1*(25% of diamonds are priced below it) and top edge of the box represents location of *Third Quartile – Q3*(75% of diamonds are priced below it). Thus, the box contains middle 50% of the values.

The **whiskers** extends out from the box on both ends. The whisker extends from *Q1* to minimum value of 0 and from *Q3* to maximum value of the variable ($18823). The thick line above the whisker is actually a concentration of dots. These are outliers. They extend the whiskers and typically in the range beyond *1.5*IQR*.

In terms of the shape of a box, the distance from *Q*3 to *Q50* is more than the distance from *Q*1 to *Q50*. The length of top whisker is much larger than length of bottom whisker, suggesting highly skewed left distribution.

The box plots for the various diamond category are as below:**The dots represent the mean values.**

**Summary:**

In this tutorial we delve into `diamonds`

data set and generated some interesting descriptive statistics. We started out with observing few lines of dataset, and generated frequency and proportionate frequency tables. These tables are useful to check the number of observations representing each class of category.

In addition to frequency tables, frequency and percent distributions of categorical data was represented visually in the form of bar chart.

We calculated the summary statistics such as *mean, median, variation, standard deviation and IQR range*.

We used visual narratives, particularly histograms and box plots, in visually representing and interpreting the distribution and variability of data.

**References:**

- Dplyr API
- ggplot2 API
- Statistics using R – An integrative approach,, Daphna Harel. Cambridge University Press, 2020

**Further Resources:**

If you need help in dplyr and ggplot2 packages, please check out my Youtube videos on ggplot2 and dplyr packages:

**ggplot2 Package:**

- How to create bar-chart in ggplot2 package
- How to create box-plot in ggplot2 package
- How to create histogram in ggplot2 package

**dplyr package:**