Frequency histograms are useful when you want to get an idea about the distribution of values in numeric variable. The hist()
function takes a numeric vector as its main argument.
To construct a histogram, the first step is to “bin” (or “bucket”) the range of values— divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) are adjacent and are typically (but not required to be) of equal size.
In this article we will learn to:
- Generate histogram using
hist()
function - Check how the number of bins in the histogram help us understand underlying distribution in the data
- Overlap density plot on histogram
- Enhance histogram by adding vertical lines for better insights
Let’s get started!
Generate Histogram
Let’s generate the histogram of mpg
variable from mtcars
data set.
hist(mtcars$mpg)
The hist()
function automatically creates the breakpoints (or bins) in the histogram using Sturges
formula. You can also specify your own bins using the break
=argument.
Histogram with custom breaks
The breakpoints in above histogram are generated in steps of 5. We will first generate sequence from minimum value(approx.) to the maximum value(approx.) of mpg
in steps of 2 using seq()
function. We can then use this sequence with breaks
argument.
brk <- seq(5,35,2)
hist(mtcars$mpg,breaks = brk,main="miles per gallon")
After changing the bin size, we could able to see the underlying distribution more clearly.
We can also specify equal width breaks using breaks
argument.
hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon")
In this example, we divided the data into 10 equal-width bins. This approach can help reveal underlying patterns in the data distribution.
The regular frequency distribution shows the number of values within each interval. We can also create histogram with relative frequency distribution. In relative frequency distribution, we show the percentage of values within each interval. The area of relative frequency distribution is always equal to one.
We can generate histogram with relative frequency distribution using freq
=FALSE argument.
hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon",freq=FALSE)
Histogram with Density Plot
You can add kernel density curve to histogram. Density curve is just a curve that helps us to visualize overall shape of the distribution. If we take histogram we worked with and draw a curve around its distribution, we have essentially made a density curve.
The total area of density curve is always equal to one. Density curve gives us idealized picture of distribution without considering the data outliers. Histograms are limited with the number of intervals they have, but in case of density curve they have infinite number of intervals producing smooth curve. This is useful when you are dealing with the data having very large population size.
You can superimpose a density curve onto histogram by first using density()
function to compute the density estimate and then use low level function lines()
to add these estimates onto the plot as a line.
dens <- density(mtcars$mpg)
hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon",freq=FALSE)
lines(dens)
Enhance histogram for better insights
We can highlight specific values or thresholds within histogram. Adding vertical lines in the histogram helps us to mark the important points on the histogram.
To add solid vertical line at specific location in a histogram, we can use the abline()
function in R. Here is an example
hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon")
abline(v=15,col="red")
Argument v=
15 in abline
function is used to position the x-value for vertical line. If you want to position the line on Y-axis, you can use h
argument.
We can extend the above code. Instead of type setting the value of v
, we can use mean value of the data, and abline
function will draw the vertical line at mean position.
hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon")
abline(v=mean(mtcars$mpg),col="red")
We can also add multiple lines. Let’s also show the line at median value.
hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon")
abline(v=c(mean(mtcars$mpg),median(mtcars$mpg)),col=c("red","blue"),lwd=3,lty="dashed")
The mean value is shown in red color and median value is shown in blue color. The lwd=2
argument is used to change the line width to 3 and lty
argument is used to change the line type to dashed.
Summary
Histogram is very useful visualization tool to check the distribution of numeric variable. Histogram can be modified by changing the number of intervals also called as bins.
Histograms, due to their limitations on interval size, are not useful when dealing with large population size data. Instead, the density plot is used to check the distribution of large data set. In this article we generated histograms and overlapping density plots to check the data distribution.