How to draw histogram in R

Sunset view across Lake Pichola, Udaipur, Rajasthan, India

Frequency histograms are useful when you want to get an idea about the distribution of values in numeric variable. The hist() function takes a numeric vector as its main argument.

To construct a histogram, the first step is to “bin” (or “bucket”) the range of values— divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) are adjacent and are typically (but not required to be) of equal size.

In this article we will learn to:

  • Generate histogram using hist() function
  • Check how the number of bins in the histogram help us understand underlying distribution in the data
  • Overlap density plot on histogram
  • Enhance histogram by adding vertical lines for better insights

Let’s get started!

Generate Histogram

Let’s generate the histogram of mpg variable from mtcars data set.

hist(mtcars$mpg)

Histogram

The hist() function automatically creates the breakpoints (or bins) in the histogram using Sturges formula. You can also specify your own bins using the break=argument.

Histogram with custom breaks

The breakpoints in above histogram are generated in steps of 5. We will first generate sequence from minimum value(approx.) to the maximum value(approx.) of mpg in steps of 2 using seq() function. We can then use this sequence with breaks argument.

brk <- seq(5,35,2)
hist(mtcars$mpg,breaks = brk,main="miles per gallon")

Histogram with different bin width

After changing the bin size, we could able to see the underlying distribution more clearly.

We can also specify equal width breaks using breaks argument.

hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon")

Histogram with equal breaks

In this example, we divided the data into 10 equal-width bins. This approach can help reveal underlying patterns in the data distribution.

The regular frequency distribution shows the number of values within each interval. We can also create histogram with relative frequency distribution. In relative frequency distribution, we show the percentage of values within each interval. The area of relative frequency distribution is always equal to one.

We can generate histogram with relative frequency distribution using freq=FALSE argument.

hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon",freq=FALSE)

Histogram with relative distribution

Histogram with Density Plot

You can add kernel density curve to histogram. Density curve is just a curve that helps us to visualize overall shape of the distribution. If we take histogram we worked with and draw a curve around its distribution, we have essentially made a density curve.

The total area of density curve is always equal to one. Density curve gives us idealized picture of distribution without considering the data outliers. Histograms are limited with the number of intervals they have, but in case of density curve they have infinite number of intervals producing smooth curve. This is useful when you are dealing with the data having very large population size.

You can superimpose a density curve onto histogram by first using density() function to compute the density estimate and then use low level function lines() to add these estimates onto the plot as a line.

dens <- density(mtcars$mpg)
hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon",freq=FALSE)
lines(dens)

Histogram with Kernel density plot

Enhance histogram for better insights

We can highlight specific values or thresholds within histogram. Adding vertical lines in the histogram helps us to mark the important points on the histogram.
To add solid vertical line at specific location in a histogram, we can use the abline() function in R. Here is an example

hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon")
abline(v=15,col="red")

Histogram with vertical line

Argument v=15 in abline function is used to position the x-value for vertical line. If you want to position the line on Y-axis, you can use h argument.

We can extend the above code. Instead of type setting the value of v, we can use mean value of the data, and abline function will draw the vertical line at mean position.

hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon")
abline(v=mean(mtcars$mpg),col="red")

Histogram with mean line

We can also add multiple lines. Let’s also show the line at median value.

hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon")
abline(v=c(mean(mtcars$mpg),median(mtcars$mpg)),col=c("red","blue"),lwd=3,lty="dashed")

Histogram with mean and median values

The mean value is shown in red color and median value is shown in blue color. The lwd=2 argument is used to change the line width to 3 and lty argument is used to change the line type to dashed.

Summary

Histogram is very useful visualization tool to check the distribution of numeric variable. Histogram can be modified by changing the number of intervals also called as bins.

Histograms, due to their limitations on interval size, are not useful when dealing with large population size data. Instead, the density plot is used to check the distribution of large data set. In this article we generated histograms and overlapping density plots to check the data distribution.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top