An intuitive understanding of Normal Distribution using business example

Friction Wheels with Parallel Axes
Image Source: Cornell University. http://kmoddl.library.cornell.edu/ (released 2003).

Probably the most widely known and used of all distributions is Normal distribution. In the real world, many human characteristics such as height, weight, IQ score etc. have relative frequency curves that are closely approximated by normal distribution.

Many variables in business and industry are also normally distributed. Some examples such as annual returns from a stock, the cost per square foot of renting warehouse space, items produced or filled by machines are normally distributed. Normal distribution is an integral part of statistical process control. When large enough samples are taken (approx. more than 30), many statistics are normally distributed regardless of the shape of the underlying distribution from which they are drawn.

In this tutorial, we will have gentle introduction to normal distribution with real world example. We will generate normal distribution plot in R and learn some R functions to calculate the probabilities of normal distribution. So Let’s get started.

Characteristics of Normal distribution

Following figure displays characteristics of normal distribution.

**Normal distribution**
Image source: Business Statistics
For Contemporary Decision Making, Wiley,2020

The normal distribution exhibits the following properties.

it is continuous distribution
It is symetrical about mean
Its area under the curve is 1
It is asymptotic to horizontal axis
It is unimodal
It is a family of curves

The normal distribution is symmetrical. Each half of the distribution is the mirror image of the other half.

In theory, normal distribution is asymptotic to the horizontal axis. That is it does not touch the horizontal axis and it goes forever in each direction. However, in real world examples most data sets have finite limits, thus asymptotic behavior is not observed. For e.g. range of share price shall vary from its minimum to maximum value in a give time period.

The normal distribution is bell shaped. Most of its values are converged towards the center of the distribution. The normal distribution is actually a family of curves. Every unique value of the mean and every unique value of standard deviation result in different normal curve.

The total are under any normal distribution is 1. The area under the curve is probability value. So total of probabilities for normal distribution is 1. The distribution is symmetric, the area of the distribution on each side of the mean is 0.5.

History of Normal Distribution

The normal probability model was developed by one of the greatest mathematicians of all time, Johann Frierich Gauss (1777-1855). Most mathematicians of Gauss’ period were interested in applying mathematics to real-world problems, such as astronomy and navigation. Knowing that the data obtained through observations and measurements contained errors due to the imprecision of the measuring instrument employed, Gauss studies many different sets of such data and notices that they all possessed certain common characteristics. He observed, each set of observations was symmetric about some central value and and as we go farther from this central value, the observations were fewer in number. Based on such studies, Gauss was able to develop a mathematical model that could be used to describe the distribution of errors contained in these sets of data. Gauss called the mathematical model the normal probability model, and the distribution of errors the normal probability distribution.

The Probability Density Function

The normal distribution is described by two parameters: the mean, (mu), and standard deviation, (sigma).
The values of mu, and sigma produce a normal distribution.

\[f(x)=\frac{1}{\sigma\sqrt{2\pi}}\exp^\frac{-1/2(x-\mu)^2}{\sigma}\]

This formula is hardly used by statisticians, since using integral calculus to determine area under the normal curve from this function is difficult and time-consuming. Therefor statisticians use table values to analyse normal distribution problems. In R, these table values are used when using normal distribution formula.

Standardized Normal distribution

So every unique pair of the values of mu and sigma defines a different normal distribution. The figure shows normal distributions for the following three pairs of parameters:

\[\mu = 0 , \sigma=1\] \[\mu = 0.5 , \sigma=2\] \[\mu = 1 , \sigma=0.5\]

Since each particular combination of mu and sigma makes a unique normal distribution curve, this could make analysis by the normal distribution tedious because volumes of normal curve tables – one for each different combination of mu and sigma – would be required.

Hence, a mechanism was developed by which all normal distribution can be converted into a single distribution: the z distribution. The z distribution is called as Standardized Normal Distribution. The z formula is given as:

\[z=\frac{x-\mu}{\sigma}, \sigma\neq 0\]

The z score is the number of standard deviations that a value, x, is above or below the mean. If the value of x is less than a mean, the z score is negative; if the value of x is more than a mean, the z score is positive, and if the value of x equals to the mean, the z score is zero.

The z-distribution is a normal distribution with a mean of 0 and standard deviation of 1. Any value of x at the mean of a normal curve is zero standard deviations from the mean. Any value of x that is one standard deviation above the mean has a z value of 1. As per the empirical rule, about 68% of all values are within one standard deviation of the mean regardless of the values of mu and sigma. About 95% of the values are within 2 standard deviation of the mean, and about 99% of the values are within 3 standard deviation of the mean.

Following figure shows the Z-standard deviation.

x <- seq(-4,4,length=100)
hx <- dnorm(x,0,1) # Generate normal distribution curve
plot(hx~x,type="l",xlab="z-score",ylab="density",main="standard normal distribution")

Solving for probabilities using Normal curve

We will take an example of sales page load time of a company. The past data indicates that the sales page load time is normally distributed, with the mean mu=7 seconds and a standard deviation sigma=2 seconds. The normal distribution of sales page loading time is as below:

load_time <- seq(1,13,length=50) # simulate sale page loading time
hx <- dnorm(load_time,7,2) # generate normal distribution curve
plot(hx~load_time,type="l")

As per the normal distribution property, the area under the curve is 1, all probability values are between 0 and 1.

The probability that the single randomly selected page load time will be lower than a particular load time x, is found by calculating the area under the normal distribution curve to the left of x.

The probability that the single randomly selected page load time will be higher than a particular load time x, is found by calculating the area under the normal distribution curve to the right of x.

Because normal distribution is symmetric, with the area under the curve is equal to 1, the probability that the single randomly selected page load time is above the mean is equal to 0.5, and the probability that score is below the mean equals 0.5.

With the help of normal curve, We can answer few questions, such as:

What is the probability that sales page load time will be less than 5 seconds?
What is the probability that sales page load time will be more than 10 seconds?
What is the probability that sales page load time will be between 5 seconds and 10 seconds?

Sales page load time less than 5 seconds

To find the probability that sales page load time will be less than 5 seconds, we use following cumulative distribution function:

pnorm(5,mean=7,sd=2)

[1] 0.1586553

This result means that area under this normal curve to the left of the value of 5 seconds equals 0.158. In other words, if we randomly select page loading time from normal distribution with the mean 7 and standard deviation 2, the probability is 0.158 that the page loading time would fall below the value of 5 seconds.

load_time <- seq(1,13,length=50) # create page loading time
hx <- dnorm(load_time,7,2) # generate normal distribution with mean=7 and sd=2
plot(hx~load_time,type="l",xlab="Page loading time")

# values on x-axis from 0 to 5 seconds are shaded
cord.x <- c(0,seq(0,5,length=50),5)

# value shades below the normal curve
cord.y <- c(0,dnorm(seq(0,5,length=50),7,2),0)

polygon(cord.x,cord.y,col="grey")# display area of curve left to 5 seconds

**Cumulative probability of page loading time less than 5 sec.**

Sales page load time more than 10 seconds

To find the probability that sales page load time will be more than 10 seconds, we use following cumulative distribution function:

pnorm(10,mean=7,sd=2,lower.tail = FALSE)

[1] 0.0668072

Although we will use the same cumulative distribution function pnorm() as before, because we want the area to the right of Z-score and pnorm() function gives us areas to the left by default, we can use option lower.tail=FALSE. This will ensure the probability in upper tail region is calculated.

This result means that area under this normal curve to the right of the value of 10 seconds equals 0.066. In other words, if we randomly select page loading time from normal distribution with the mean 7 and standard deviation 2, the probability is 0.066 that the page loading time would fall above the value of 10 seconds.

load_time <- seq(1,13,length=50) # create page loading time
hx <- dnorm(load_time,7,2) # generate normal distribution with mean=7 and sd=2
plot(hx~load_time,type="l",xlab="Page loading time")

# values on x-axis from 10 to 13 seconds are shaded
cord.x <- c(10,seq(10,13,length=50),10)

# value shades below the normal curve
cord.y <- c(0,dnorm(seq(10,13,length=50),7,2),0)

polygon(cord.x,cord.y,col="grey")# display area of curve right to 10 seconds

**Cumulative probability of page loading time more than 10 sec.**

Sales page load time between 5 seconds and 10 seconds

To find the probability that sales page load time will be between 6 seconds and 10 seconds, we use following cumulative distribution function:

pnorm(10,mean=7,sd=2)- pnorm(6,mean=7,sd=2)

[1] 0.6246553

This is slightly tricky problem. There are two x values involved, 6 seconds and 10 seconds. We need to find the area of curve between two values, 6 and 10. We already know area under the curve less than 10 seconds is (1-0.066)=0.934, and we need to subtract bigger area from the smaller one to get the answer.

This result means that area under this normal curve between 6 and 10 seconds is 0.624. In other words, if we randomly select page loading time from normal distribution with the mean 7 and standard deviation 2, the probability is 0.624 that the page loading time would fall between above two values.

load_time <- seq(1,13,length=50) # create page loading time
hx <- dnorm(load_time,7,2) # generate normal distribution with mean=7 and sd=2
plot(hx~load_time,type="l",xlab="Page loading time")

# values on x-axis from 10 to 13 seconds are shaded
cord.x <- c(6,seq(6,10,length=50),10)

# value shades below the normal curve
cord.y <- c(0,dnorm(seq(6,10,length=50),7,2),0)

polygon(cord.x,cord.y,col="grey")# display area of curve right to 10 seconds

**Cumulative probability of page loading between 6 and 10 sec.**

Summary

Many variables in business and industry are normally distributed. Normal distribution is continuous distribution, which is symmetrical about mean. In this tutorial we learned:

– The properties of normal distribution
– The definition of standard normal distribution and its properties
– How to solve cumulative normal distribution problems using pnorm() function in R
– Interpretation of the results

Normal distribution also has wide area of applications in sampling and hypothesis testing. We will delve deeper into these topics in future articles.