Histograms and Density Plots show the distribution of the data. We can also show the distribution of a data using few critical points in the data set. These critical points are called as quantile values which are used to get impression of the whole distribution.
Box plot (or box-whisker plot) is simple way to show the data distribution using quantile values. In this article we will create box-plot and understand some important concepts from descriptive statistics in order to be able to interpret the box-plot. We will learn:
- What are quantile values of a numeric variable?
- How quantile values are shown on the box-plot?
- How to create box-plot in R?
What are quantile values of a numeric variable?
Quantiles are measures of central tendency that divide a data into equal size groups. It can also refer to dividing a probability distribution into areas of equal probability.
The median is quantile, because it splits up the data into two equal parts. Exactly half the data is lower than median and exactly half of the data is above the median. The median cuts a distribution into two equal areas and so it is sometimes called 2-quantile.
Quartiles are also quantiles, they divide the distribution into four equal parts. Percentiles are quantiles that divide a distribution into 100 equal parts and deciles are quantiles that divide a distribution into 10 equal parts.
Box-plot shows graphically the Quartile values of the data. the three quartiles are denoted as Q1, Q2 and Q3.
- The first quartile Q1 consists of the bottom 25% of the data. It is also called as lower quartile.
- The second quartile Q2 consists of the bottom 50% of the data. This is median value of the data.
- The third quartile Q3 consists of the bottom 75% of the data. It is also called as upper quartile.
These three quartiles are shown below:
The interquartile range is the range of values between first and third quartile. It is range of middle 50% of the data and is calculated as Q3-Q1
How quantile values are shown on the box-plot?
Box plot (or box-whisker plot) is simple way to show the data distribution using quantile values.
The plot is constructed by using a box to enclose the median. The Box is extended outwards from the median along lower and upper quartiles, enclosing not only median but also the middle 50% of the data. From lower and upper quartiles, lines referred to as whiskers are extended out from the box towards the outermost values.
Thus box-and-whisker plot is used to show the five number
- The median (Q2)
- The lower quartile (Q1)
- The upper quartile (Q3)
- The smallest value in the distribution
- The largest value in the distribution
The box endpoints (Q1 and Q3) are referred to as hinges of the box. The value of the interquartile range (IQR) is computed by Q3-Q1. The IQR includes middle 50% of the data. At a distance of 1.5*IQR outwards from the lower and upper quartiles are inner fences. A whisker, a line segment, is drawn from the lower hinge of the box outwards to the smallest data value. A second whisker is drawn from the upper hinge of the box outwards to the largest data value. The inner fences of box-plot are established as follows:
Q1 – 1.5*IQR
Q3 + 1.5*IQR
If data falls beyond the inner fences, then outer fences can be constructed.
Q1 – 3.0*IQR
Q3 + 3.0*IQR
Box-plots are also used to show the outliers. Data values outside the mainstream of values in a distribution are called as outliers
Values in the data distribution that are outside the inner fences but within the outer fences are referred as mild outliers. Values that are outside the outer fences are called as extreme outliers
How to create box-plot in R?
Horizontal and Vertical box plots are produced in R by the boxplot
function. Let’s produce simple box plot for mpg
variable from mtcars
data set. We can include Y-axis label using ylab
argument.
boxplot(mtcars$mpg,ylab = "average fuel consumption",
main="box-plot for vehicle fuel efficiency")
The thick horizontal line in the middle of the box is median value of mpg
variable. The upper line of the box is upper quartile and lower line is lower quartile. The distance between upper and lower quartiles is known as interquartile range and represents the values for 50\% of the data. The dotted lines at both ends of the box are called as whiskers. If the median line is approximately in the middle of the box, and whiskers are more or less of the same length then you can assume that the distribution of the data is symmetrical.
If we want to examine how the distribution of the variable changes between different categories of a categorical variable, we need to use the formula notation with boxplot()
function. For e.g. let’s plot mpg
variable again, but this time see how this changes with each level of cyl
variable.
boxplot(mtcars$mpg~mtcars$cyl,ylab = "average fuel consumption",
xlab="Engine cylinders",
main="box-plot for vehicle fuel efficiency for different
cylinder engines")
You can see that a car’s fuel economy decreases significantly as engine size increases. Also note, for 8 cylinder engine, one data point is plotted outside whisker represents potential outlier. Compared to other cars with an eight-cylinder engine, this one has an exceptionally low fuel efficiency.
We can also group our variables by two plots in the same plot. Let’s plot mpg
variable, but this time plot separate box for cyl
and am
combination. am
is Transmission variable with two levels in it – 0 = Auto transmission, 1 = Manual transmission)
boxplot(data=mtcars, mpg~am*cyl,ylab = "average fuel consumption",
xlab="Engine cylinders",
main="box-plot for vehicle fuel efficiency for different
cylinder engines")
0.4 is label used for cars with 4 cylinder engine with Auto transmission. 1.4 is label used for cars with 4 cylinder engine with manual transmission.
There is a great deal of variation in the median value of average fuel consumption between cars with four-cylinder engines that have manual and automatic transmissions. This is not the case with cars having 8 cylinder engine.
You need to be careful while writing the order for the box-plot formula. In this case, the order is mpg~am*cyl
which first creates a box plot between mpg
and am
variable and then it is grouped by cyl
variable.
If the order was changed to mpg~cyl*am
, a different box-plot grouping would be created.
boxplot(data=mtcars, mpg~cyl*am,ylab = "average fuel consumption",
xlab="Engine cylinders",
main="box-plot for vehicle fuel efficiency for different
cylinder engines")
Summary
Box plot (or box-whisker plot) is simple way to show the data distribution using quantile values. Quantiles are measures of central tendency that divide a data into equal size groups. The box-plot is also used to visualize five number summary statistics.
In R boxplot
function is used to create simple box-plot. This function can be extended further by writing a formula to group the box-plots of two or more variables.