An intuitive real life example of a binomial distribution and how to simulate it in R

Water color by E. Shwarz: Image Source: https://www.jstor.org/stable/community.24863986

A random variable, usually written X, is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables, discrete and continuous.For e.g. the arrival of automobiles at tollbooth in 60 second period. This is random event, since we would not know in advance how many cars will going to arrive at tollbooth in 60 seconds. It could be 1 car, 2 cars….,n cars. Another example is time between completion of two tasks in production line. The value will range from 0 sec to n sec.

A random variable is discrete random variable if it produces values that are non negative whole numbers. For e.g. arrival of cars at tollbooth. There can not be 2.34 cars arriving at tollbooth. Determining number of defective items in batch of 50 samples is another example of discrete random variable.

Continuous random variable takes an infinite number of possible values. For e.g. if a worker is assembling product component, the time it takes to complete the work could range from any value such as 3 mins 54.345 seconds to 5 mins 32.435 seconds. A temperature sensor could measure a temperature up-to 3 decimal places. Time, height, weight, temperature etc.. these are all examples of continuous random variable.

The outcomes of random variables and their associated probabilities can be measured using distributions. The two types of distributions are continuous distributions, constructed from continuous random variable and discrete distributions, constructed from discrete random variable

Some of the popular discrete distributions used in business are:

Binomial Distribution
Poisson distribution

Some of the popular continuous distributions used in business are:

uniform Distribution
Normal Distribution
Exponential Distribution
t distribution
chi-square distribution
F distribution

Understanding distribution is important because type of analysis available to data analyst depends upon the characteristics of any given distribution.

In this tutorial, we will understand the assumptions of binomial distribution, take a business example of binomial distribution, build the binomial distribution formula and use R to solve the problem.

Binomial Distribution

Binomial distribution is most widely used distribution in business. Each distribution mentioned above had its own assumptions and set of rules. the assumptions of binomial distribution is as follows:

The experiment involves n identical trials
The trial has only two possible outcomes denoted as success or failure
Each trial is independent of previous trials
If p is probability of getting success on any given trial and q=(1-p) is probability of getting failure in any given trial, then probabilities p and q remain constant throughout the experiment.

As the word binomial indicates, any single trial of binomial experiment contains only two possible outcomes – success or failure

Understanding binomial distribution assumptions with an example

Let’s say a failure rate of LED bulbs that company manufactures is 5%. Suppose a random sample of 5 LED light bulbs is selected, what is probability that exactly one light bulb is defective? This is classic case of binomial distribution because it fulfills all the assumptions.

The experiment involves $n$ identical trials: Here value of n is 5
The trial has only two possible outcomes denoted as success or failure: The possible outcomes are LED bulb OK, LED bulb NOT OK
Each trial is independent of previous trials: Since LED bulbs were selected from extremely large production volume, and each LED bulb could be selected only once, this ensures the trials are independent of each other
Probabilities p and q remain constant throughout the experiment: Since the trials are independent of each other, each one does not have any bearing on the other, the probabilities p and q remain constant through out the experiment. Here p, probability of success is 0.05, and probability of failure q is 0.95. (If quality control analyst is looking for defective product, he would consider finding defective product as a success.)

Solving a binomial problem

Let’s say a failure rate of LED bulbs that company manufactures is 5%. Suppose a random sample of 5 LED light bulbs is selected, what is probability that exactly one light bulb is defective?

Let’s consider, OK represents LED bulb has passed the quality test and NOT OK represents LED bulb has not passed the quality test. the sequence of trials are as below:
OK, NOT OK, OK, OK, OK

We will use multiplication probability rule to find out the probability of getting one light bulb defective.

If p represents probability of success, and q represents probability of failure, then, in this example, p=0.05 and q=0.95. the probability of getting this sequence of defective LED bulb is:

\[(0.95)(0.05)(0.95)(0.95)(0.95) = 0.04\]

In the random selection of LED bulbs, the first defective LED bulb could have been selected in the first trial or the second or the third or the fourth or the fifth trial. All the possible sequence of defective LED bulb selection is as follows:

OK, NOT OK, OK, OK, OK

OK, OK, NOT OK, OK, OK

OK, OK, OK, NOT OK, OK

OK, OK, OK, OK, NOT OK

NOT OK, OK, OK, OK, OK

Thus, you can see there are five different ways of selecting defective LED bulb. the probability of each of this sequence is 0.04. Each sequence contains the same five probabilities. For the five sequences of this problem, the total probability of getting exactly one LED bulb defective is:

\[ 5(0.05)(0.95)^4 = 0.2036\]

An easier way to determine the number of sequences than by listing all possibilities is to use combinations to calculate them. Five LED bulbs are being sampled, so n=5, and the problem is to get one LED bulb which is defective, x=1. Hence nCx will give number of possible ways to get success in n trials. For this problem, 5C1 tells the number of sequences of possibilities.

\[5C1 = \frac{5!}{1!(5-1)!}\]

Using combinations simplifies the determination of how many sequences are possible for a given value of x in binomial distribution. If we generalize the above example into binomial formula, we will get binomial distribution:

\[P(X) = nCxp^xq^{n-x} = \frac{n!}{x!(n-x)!}p^xq^{n-x}\]

where,
n = number of trials, in our example n = 5
x = the number of success desired, in our example 1 defective LED Bulb
p = the probability of getting success in one trial, in our example, the probability of finding defective bulb from a sample is 0.05
q = (1-p) = the probability of getting failure in one trial, in our example, the probability of getting non-defective LED is 0.95

Using R to solve binomial problem

Using the language of binomial distribution, the probability of obtaining 1 defective LED bulb from sample of n=5 trials, with probability of success p=0.05, is calculated using dbinom() function.

dbinom(1,5,0.05)

[1] 0.2036266

There is 20% chance of getting 1 LED Bulb will be defective from sample of 5 LED bulbs. What is probability of getting 1 defective LED bulb from sample of 50 LED bulbs randomly picked up from the production line, if the probability of success (picking 1 defective LED bulb) is 0.05?

dbinom(1,100,0.05)

[1] 0.03116068

Given the binomial probability distribution, we may find the probability that the number of successes is less than or equal to some value, say k, by using R’s cumulative binomial probability function – pbinom(k,n,p)

What is the probability of obtaining at most 5 defective LED bulbs from sample of n=100 trials, with probability of success p=0.05

At most 5 successes is the same as 5 or fewer successes and can be expressed as \[k\le5\]. We can use R’s cumulative binomial probability function – pbinom(k,n,p)

pbinom(5,100,0.05)

[1] 0.6159991

By the law of large numbers, we may say that if a large number of samples of 100 LED bulbs each are selected, we should expect to see at most 5 defective LED bulbs in approximately 61% of these samples.

At least 5 success is same as 5 or more successes and can be expressed as \[k\geq5\]. We can use R’s cumulative binomial probability function – pbinom(k,n,p,lower.tail=FALSE). Please note, lower.tail=FALSE argument does not include the value of k in its summation, we must use k-1 instead if we want k to be included in the calculation.

The probability that at least 5 of the 100 LED bulbs are defective is:

pbinom(5,100,0.05,lower.tail = FALSE)

[1] 0.3840009

By the law of large numbers, we may say that if a large number of samples of 100 LED bulbs each are selected, we should expect to see at least 5 defective LED bulbs in approximately 38% of these samples.

Graphing the binomial distributions

At most 5 successes is the same as 5 or fewer successes and can be expressed as $k \leq 5$ . To create a list of probabilities and the bar graph we can use series of R commands as follows:

probs <- dbinom(0:5,100,0.05)
probs

[1] 0.005920529 0.031160680 0.081181772 0.139575678 0.178142642 0.180017827

barplot(probs, names.arg=c(0:5),xlab="Number of defective LED bulbs")

Summary

We have started with the types of variables used in statistics – the continuous and discrete variables. The binomial distribution is most widely used distribution for discrete variables to model wide range of business situations. We picked up one such scenario, wherein we wanted to model the probability of getting defective product from the random sample of finite number of trials. We understood the assumptions on which binomial distribution is constructed.

We used the LED bulb example to build the binomial distribution function. Then using R’s various binomial distribution functions such as dbinom(), and pbinom() we modeled various scenarios and graphically represent the binomial distribution.