Frequency histograms are useful when you want to get an idea about the distribution of values in numeric variable. The `hist()`

function takes a numeric vector as its main argument.

To construct a histogram, the first step is to “bin” (or “bucket”) the range of values— divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) are adjacent and are typically (but not required to be) of equal size.

In this article we will learn to:

- Generate histogram using
`hist()`

function - Check how the number of bins in the histogram help us understand underlying distribution in the data
- Overlap density plot on histogram
- Enhance histogram by adding vertical lines for better insights

Let’s get started!

**Generate Histogram**

Let’s generate the histogram of `mpg`

variable from `mtcars`

data set.

`hist(mtcars$mpg)`

The `hist()`

function automatically creates the breakpoints (or bins) in the histogram using `Sturges`

formula. You can also specify your own bins using the `break`

=argument.

**Histogram with custom breaks**

The breakpoints in above histogram are generated in steps of 5. We will first generate sequence from minimum value(approx.) to the maximum value(approx.) of `mpg`

in steps of 2 using `seq()`

function. We can then use this sequence with `breaks`

argument.

```
brk <- seq(5,35,2)
hist(mtcars$mpg,breaks = brk,main="miles per gallon")
```

After changing the bin size, we could able to see the underlying distribution more clearly.

We can also specify equal width breaks using `breaks`

argument.

`hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon")`

In this example, we divided the data into 10 equal-width bins. This approach can help reveal underlying patterns in the data distribution.

The regular frequency distribution shows the number of values within each interval. We can also create histogram with relative frequency distribution. In relative frequency distribution, we show the percentage of values within each interval. The area of relative frequency distribution is always equal to one.

We can generate histogram with relative frequency distribution using `freq`

=FALSE argument.

`hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon",freq=FALSE)`

**Histogram with Density Plot**

You can add kernel density curve to histogram. Density curve is just a curve that helps us to visualize overall shape of the distribution. If we take histogram we worked with and draw a curve around its distribution, we have essentially made a density curve.

The total area of density curve is always equal to one. Density curve gives us idealized picture of distribution without considering the data outliers. Histograms are limited with the number of intervals they have, but in case of density curve they have infinite number of intervals producing smooth curve. This is useful when you are dealing with the data having very large population size.

You can superimpose a density curve onto histogram by first using `density()`

function to compute the density estimate and then use low level function `lines()`

to add these estimates onto the plot as a line.

```
dens <- density(mtcars$mpg)
hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon",freq=FALSE)
lines(dens)
```

**Enhance histogram for better insights**

We can highlight specific values or thresholds within histogram. Adding vertical lines in the histogram helps us to mark the important points on the histogram.

To add solid vertical line at specific location in a histogram, we can use the `abline()`

function in R. Here is an example

```
hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon")
abline(v=15,col="red")
```

Argument `v=`

15 in `abline`

function is used to position the x-value for vertical line. If you want to position the line on Y-axis, you can use `h`

argument.

We can extend the above code. Instead of type setting the value of `v`

, we can use mean value of the data, and `abline`

function will draw the vertical line at mean position.

```
hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon")
abline(v=mean(mtcars$mpg),col="red")
```

We can also add multiple lines. Let’s also show the line at median value.

```
hist(mtcars$mpg,breaks = 15,main="Histogram-miles per gallon",xlab="miles per gallon")
abline(v=c(mean(mtcars$mpg),median(mtcars$mpg)),col=c("red","blue"),lwd=3,lty="dashed")
```

The mean value is shown in red color and median value is shown in blue color. The `lwd=2`

argument is used to change the line width to 3 and `lty`

argument is used to change the line type to dashed.

**Summary**

Histogram is very useful visualization tool to check the distribution of numeric variable. Histogram can be modified by changing the number of intervals also called as bins.

Histograms, due to their limitations on interval size, are not useful when dealing with large population size data. Instead, the density plot is used to check the distribution of large data set. In this article we generated histograms and overlapping density plots to check the data distribution.