Anatomy of a box-plot and how to create it in R

Corinthian Aryballos
Image Source: https://www.jstor.org/stable/community.22852321

Histograms and Density Plots show the distribution of the data. We can also show the distribution of a data using few critical points in the data set. These critical points are called as quantile values which are used to get impression of the whole distribution.


Box plot (or box-whisker plot) is simple way to show the data distribution using quantile values. In this article we will create box-plot and understand some important concepts from descriptive statistics in order to be able to interpret the box-plot. We will learn:

  • What are quantile values of a numeric variable?
  • How quantile values are shown on the box-plot?
  • How to create box-plot in R?

Quantiles are measures of central tendency that divide a data into equal size groups. It can also refer to dividing a probability distribution into areas of equal probability.

The median is quantile, because it splits up the data into two equal parts. Exactly half the data is lower than median and exactly half of the data is above the median. The median cuts a distribution into two equal areas and so it is sometimes called 2-quantile.

Quartiles are also quantiles, they divide the distribution into four equal parts. Percentiles are quantiles that divide a distribution into 100 equal parts and deciles are quantiles that divide a distribution into 10 equal parts.

Box-plot shows graphically the Quartile values of the data. the three quartiles are denoted as Q1, Q2 and Q3.

  • The first quartile Q1 consists of the bottom 25% of the data. It is also called as lower quartile.
  • The second quartile Q2 consists of the bottom 50% of the data. This is median value of the data.
  • The third quartile Q3 consists of the bottom 75% of the data. It is also called as upper quartile.

These three quartiles are shown below:

The interquartile range is the range of values between first and third quartile. It is range of middle 50% of the data and is calculated as Q3-Q1

Box plot (or box-whisker plot) is simple way to show the data distribution using quantile values.

The plot is constructed by using a box to enclose the median. The Box is extended outwards from the median along lower and upper quartiles, enclosing not only median but also the middle 50% of the data. From lower and upper quartiles, lines referred to as whiskers are extended out from the box towards the outermost values.

Thus box-and-whisker plot is used to show the five number

  • The median (Q2)
  • The lower quartile (Q1)
  • The upper quartile (Q3)
  • The smallest value in the distribution
  • The largest value in the distribution

The box endpoints (Q1 and Q3) are referred to as hinges of the box. The value of the interquartile range (IQR) is computed by Q3-Q1. The IQR includes middle 50% of the data. At a distance of 1.5*IQR outwards from the lower and upper quartiles are inner fences. A whisker, a line segment, is drawn from the lower hinge of the box outwards to the smallest data value. A second whisker is drawn from the upper hinge of the box outwards to the largest data value. The inner fences of box-plot are established as follows:

Q1 – 1.5*IQR
Q3 + 1.5*IQR

If data falls beyond the inner fences, then outer fences can be constructed.

Q1 – 3.0*IQR
Q3 + 3.0*IQR

Box-plots are also used to show the outliers. Data values outside the mainstream of values in a distribution are called as outliers

Values in the data distribution that are outside the inner fences but within the outer fences are referred as mild outliers. Values that are outside the outer fences are called as extreme outliers

Horizontal and Vertical box plots are produced in R by the boxplot function. Let’s produce simple box plot for mpg variable from mtcars data set. We can include Y-axis label using ylab argument.

boxplot(mtcars$mpg,ylab = "average fuel consumption", 
        main="box-plot for vehicle fuel efficiency")

Simple Box plot

The thick horizontal line in the middle of the box is median value of mpg variable. The upper line of the box is upper quartile and lower line is lower quartile. The distance between upper and lower quartiles is known as interquartile range and represents the values for 50\% of the data. The dotted lines at both ends of the box are called as whiskers. If the median line is approximately in the middle of the box, and whiskers are more or less of the same length then you can assume that the distribution of the data is symmetrical.

If we want to examine how the distribution of the variable changes between different categories of a categorical variable, we need to use the formula notation with boxplot() function. For e.g. let’s plot mpg variable again, but this time see how this changes with each level of cyl variable.

boxplot(mtcars$mpg~mtcars$cyl,ylab = "average fuel consumption",
        xlab="Engine cylinders",
        main="box-plot for vehicle fuel efficiency for different 
        cylinder engines")

Multiple Box Plot

You can see that a car’s fuel economy decreases significantly as engine size increases. Also note, for 8 cylinder engine, one data point is plotted outside whisker represents potential outlier. Compared to other cars with an eight-cylinder engine, this one has an exceptionally low fuel efficiency.

We can also group our variables by two plots in the same plot. Let’s plot mpg variable, but this time plot separate box for cyl and am combination. am is Transmission variable with two levels in it – 0 = Auto transmission, 1 = Manual transmission)

boxplot(data=mtcars, mpg~am*cyl,ylab = "average fuel consumption",
        xlab="Engine cylinders",
        main="box-plot for vehicle fuel efficiency for different 
        cylinder engines")

Grouped Box Plot

0.4 is label used for cars with 4 cylinder engine with Auto transmission. 1.4 is label used for cars with 4 cylinder engine with manual transmission.

There is a great deal of variation in the median value of average fuel consumption between cars with four-cylinder engines that have manual and automatic transmissions. This is not the case with cars having 8 cylinder engine.

You need to be careful while writing the order for the box-plot formula. In this case, the order is mpg~am*cyl which first creates a box plot between mpg and am variable and then it is grouped by cyl variable.

If the order was changed to mpg~cyl*am, a different box-plot grouping would be created.

boxplot(data=mtcars, mpg~cyl*am,ylab = "average fuel consumption",
        xlab="Engine cylinders",
        main="box-plot for vehicle fuel efficiency for different 
        cylinder engines")

Grouped Box Plot

Box plot (or box-whisker plot) is simple way to show the data distribution using quantile values. Quantiles are measures of central tendency that divide a data into equal size groups. The box-plot is also used to visualize five number summary statistics.

In R boxplot function is used to create simple box-plot. This function can be extended further by writing a formula to group the box-plots of two or more variables.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top