A gentle introduction to Sampling terms and definitions – A pre-requisite to inferential statistics.

Children from a fishing village playing on a fishing boat near Mahabalipuram
Image Source: South Asia Art Archive

The subject of Statistics is mainly divided into two general branches: Descriptive statistics and Inferential statistics. Descriptive statistics are used to describe the data at hand, inferential statistics are used to draw inferences on larger group based on the results of the data on smaller group.

In this tutorial we will get comfortable with some of the commonly used terms from the field of Sampling theory The clarity on these terms is required to understand the inferential statistical techniques. Let’s get started with the first definition.

Samples and Populations

Suppose we want to know, how people were going to vote in the upcoming presidential election. The population, in this case consists of all people eligible to vote in the general election. Clearly, it would be enormously expensive and time consuming to gather data from each person in the population.

Instead, we would select a sample that is representative of the entire population. We may take a poll on the sample on how they will vote, and then draw conclusions about the population from the sample.

Here is another example, if we want to find out, does a new drug reduce the number of deaths in patients with severe heart disease? This research question refers to target population of all the heart patients who have undergone the treatment with this new drug. A sample refers to subset of these cases – a small fraction of the population. The 50 patients (or some other number) in the population might be selected, and this sample data may be used to draw the conclusions.

Population: Collection of persons, objects, or items of interest. For e.g. population of all people eligible to vote in the general election.

Sample: It is small portion of the population. For e.g. sample of all people eligible to vote in general election in a particular city.

“Sampling consists of selecting some part of a population to observe so that one may estimate something about the whole population.”
Sampling, Chapter 1, Third Edition, 2012

Reasons for Sampling

Taking sample instead of using population to conduct research offers several advantages:

The sample can save money
The sample can save time
Sometimes accessing a population is impossible, for e.g. in case of rear disease, the sample is only option.
In business, sampling saves research turn around time and helps product launches at fast speed.

Statistics and Parameter

If we get data from entire population, we could compute all measures of central tendency – mean, standard deviation etc. just as we can on the sample taken from population.

Parameters: When values of the descriptive measures are computed from populations, they are called as parameters

Statistics: When values of the descriptive measures are computed from populations, they are called as statistics

In order to make clear whether we are talking about a measure taken from sample or on a population, we denote statistics by Roman letters and parameters by Greek letters.

Representative Sample

A sample we use must be a subset of the population. For e.g. if we were interested in estimating the median income of software engineers in India (the population), it would make no sense to select a sample of medical doctors to base our estimate. Obviously, our sample should be comprised of software engineers from the population.

Many sample could be picked from this population, each sample would differ somewhat from all the other samples, even though they all come from the same population. The accuracy of the estimate will depend on how representative our sample is of the population. If our sample is truly representative of the population, we can accurately estimate a population median. On the other hand, if our sample deviates to some extent from being truly representative of the population, we would expect out estimate to be inaccurate.

So, can we determine how representative a particular sample is? In most cases we can not make such determination with the particular sample, but we can do so in long run if we use specific sampling procedure in selecting a sample, called simple random sampling. If we follow a procedure known as simple random sampling we can determine probabilistically just how representative our sample is expected to be. This is not possible or feasible with any other type of sampling methods, therefor much of the inferential statistics is based on the assumptions of random sampling from populations.

“A representative sample is a subset of a population that seeks to accurately reflect the characteristics of the larger group.”
Sampling, Chapter 1, Third Edition, 2012

“A probability design such as simple random sampling thus can provide unbiased estimates of the population mean or total and also an unbiased estimate of variability, which is used to assess the reliability of the survey result. “
Sampling, Chapter 1, Third Edition, 2012

Simple Random Sampling

What is random sample and how do we get one?

“A simple random sample is a sample chosen from a given population in such a way as to ensure that each person or thing in the population has an equal and independent chance of being picked for the sample”.
Statistics using R, Page 270, 2020

Equal means that at each stage of selection process, all objects remaining in the population are equally likely to be picked for the sample. Independent means that no pick has any effect on any other pick.

The actual selection of a random sample can be done through use of uniform random number generator. The important properties of uniform random number generator are:

each digit generated is independent of all other units
in the long run, each digit occurs with equal frequency.

Suppose, you have 5000 cases in population, from which you want to select random sample size of 50 cases. We perform selection using R by typing:

set.seed(123)
x <- sample(1:5000,50,replace=F)
head(x)

[1] 2463 2511 2227  526 4291 2986

set.seed() function ensures we create a random numbers that can be reproduced. The argument replace=F simulates that the cases are drawn without replacement.

Sampling with and without replacement

Simple random sampling is an example of sampling without replacement That is, once the object from the population has been selected, it is removed from the population in all remaining stages of the selection process.

In sampling with replacement, every object is available for the selection to the sample at every stage of selection process, regardless of whether it has already been selected.

For e.g. from deck of 52 playing cards, if we want to select any 2 cards from this population.

With replacement: In the first selection, the probability of any particular card being picked is 1/52. Now this card is put back in the deck. In the second selection, the probability of any particular card being picked is again 1/52
Without replacement: The probability of any particular card being picked is 1/52. Now this card is removed from the deck. In the second selection, since are are left with 51 cards, the probability of any particular card being picked is 1/51 Since the first card is removed, the probability of first selected card being picked in the second selection is 0. Therefor, this violates the Equality property of random sampling.

Summary

In this tutorial, we learned some concepts from sampling theory. Sampling is a small portion of population. Sampling is used to infer some characteristic of the population. A sample must be a good representation of the population to accurately estimate the parameter. Using simple random sampling we can determine probabilistically just how representative our sample is expected to be.

Simple random sampling is an example of sampling without replacement. In sampling without replacement the object is not replaced back into the population. In sampling with replacement, the object is replaced back into the population.

References

Sampling, Third Edition, Wiley, 2012
Statistics using R, an integrated approach, Cambridge University Press, 2020