Histograms and Density Plots show the distribution of the data. We can also show the distribution of a data using few critical points in the data set. These critical points are called as *quantile values* which are used to get impression of the whole distribution.

Box plot (or box-whisker plot) is simple way to show the data distribution using quantile values. In this article we will create box-plot and understand some important concepts from descriptive statistics in order to be able to interpret the box-plot. We will learn:

- What are quantile values of a numeric variable?
- How quantile values are shown on the box-plot?
- How to create box-plot in R?

**What are quantile values of a numeric variable?**

Quantiles are measures of *central tendency* that divide a data into equal size groups. It can also refer to dividing a probability distribution into areas of equal probability.

The *median* is quantile, because it splits up the data into two equal parts. Exactly half the data is lower than median and exactly half of the data is above the median. The median cuts a distribution into two equal areas and so it is sometimes called 2-quantile.

*Quartiles* are also *quantiles*, they divide the distribution into four equal parts. *Percentiles* are quantiles that divide a distribution into 100 equal parts and *deciles* are quantiles that divide a distribution into 10 equal parts.

Box-plot shows graphically the *Quartile* values of the data. the three quartiles are denoted as Q1, Q2 and Q3.

- The first quartile Q1 consists of the bottom 25% of the data. It is also called as lower quartile.
- The second quartile Q2 consists of the bottom 50% of the data. This is median value of the data.
- The third quartile Q3 consists of the bottom 75% of the data. It is also called as upper quartile.

These three quartiles are shown below:

The *interquartile range* is the range of values between first and third quartile. It is range of middle 50% of the data and is calculated as Q3-Q1

**How quantile values are shown on the box-plot?**

Box plot (or box-whisker plot) is simple way to show the data distribution using quantile values.

The plot is constructed by using a box to enclose the median. The *Box* is extended outwards from the median along lower and upper quartiles, enclosing not only median but also the middle 50% of the data. From lower and upper quartiles, lines referred to as *whiskers* are extended out from the box towards the outermost values.

Thus box-and-whisker plot is used to show the five number

- The median (Q2)
- The lower quartile (Q1)
- The upper quartile (Q3)
- The smallest value in the distribution
- The largest value in the distribution

The box endpoints (Q1 and Q3) are referred to as *hinges* of the box. The value of the interquartile range (IQR) is computed by Q3-Q1. The IQR includes middle 50% of the data. At a distance of 1.5*IQR outwards from the lower and upper quartiles are *inner fences*. A *whisker*, a line segment, is drawn from the lower hinge of the box outwards to the smallest data value. A second whisker is drawn from the upper hinge of the box outwards to the largest data value. The *inner fences* of box-plot are established as follows:

Q1 – 1.5*IQR

Q3 + 1.5*IQR

If data falls beyond the inner fences, then *outer fences* can be constructed.

Q1 – 3.0*IQR

Q3 + 3.0*IQR

Box-plots are also used to show the *outliers*. Data values outside the mainstream of values in a distribution are called as *outliers*

Values in the data distribution that are outside the inner fences but within the outer fences are referred as *mild outliers*. Values that are outside the outer fences are called as *extreme outliers*

**How to create box-plot in R?**

Horizontal and Vertical box plots are produced in R by the `boxplot`

function. Let’s produce simple box plot for `mpg`

variable from `mtcars`

data set. We can include Y-axis label using `ylab`

argument.

```
boxplot(mtcars$mpg,ylab = "average fuel consumption",
main="box-plot for vehicle fuel efficiency")
```

The thick horizontal line in the middle of the box is *median* value of `mpg`

variable. The upper line of the box is *upper quartile* and lower line is *lower quartile*. The distance between upper and lower quartiles is known as *interquartile range* and represents the values for 50\% of the data. The dotted lines at both ends of the box are called as *whiskers*. If the median line is approximately in the middle of the box, and whiskers are more or less of the same length then you can assume that the distribution of the data is symmetrical.

If we want to examine how the distribution of the variable changes between different categories of a categorical variable, we need to use the formula notation with `boxplot()`

function. For e.g. let’s plot `mpg`

variable again, but this time see how this changes with each level of `cyl`

variable.

```
boxplot(mtcars$mpg~mtcars$cyl,ylab = "average fuel consumption",
xlab="Engine cylinders",
main="box-plot for vehicle fuel efficiency for different
cylinder engines")
```

You can see that a car’s fuel economy decreases significantly as engine size increases. Also note, for 8 cylinder engine, one data point is plotted outside whisker represents potential outlier. Compared to other cars with an eight-cylinder engine, this one has an exceptionally low fuel efficiency.

We can also group our variables by two plots in the same plot. Let’s plot `mpg`

variable, but this time plot separate box for `cyl`

and `am`

combination. `am`

is Transmission variable with two levels in it – 0 = Auto transmission, 1 = Manual transmission)

```
boxplot(data=mtcars, mpg~am*cyl,ylab = "average fuel consumption",
xlab="Engine cylinders",
main="box-plot for vehicle fuel efficiency for different
cylinder engines")
```

0.4 is label used for cars with 4 cylinder engine with Auto transmission. 1.4 is label used for cars with 4 cylinder engine with manual transmission.

There is a great deal of variation in the median value of average fuel consumption between cars with four-cylinder engines that have manual and automatic transmissions. This is not the case with cars having 8 cylinder engine.

You need to be careful while writing the order for the box-plot formula. In this case, the order is `mpg~am*cyl`

which first creates a box plot between `mpg`

and `am`

variable and then it is grouped by `cyl`

variable.

If the order was changed to `mpg~cyl*am`

, a different box-plot grouping would be created.

```
boxplot(data=mtcars, mpg~cyl*am,ylab = "average fuel consumption",
xlab="Engine cylinders",
main="box-plot for vehicle fuel efficiency for different
cylinder engines")
```

**Summary**

Box plot (or box-whisker plot) is simple way to show the data distribution using quantile values. Quantiles are measures of *central tendency* that divide a data into equal size groups. The box-plot is also used to visualize five number summary statistics.

In R `boxplot`

function is used to create simple box-plot. This function can be extended further by writing a formula to group the box-plots of two or more variables.