A gentle introduction to covariance and correlation

St. Andrew’s Episcopal Church. Image source: https://www.jstor.org/stable/community.17343512

In this tutorial we will look at two measures of relationship between two numeric variables: the covariance and coefficient of correlation.

The covariance

The covariance measures the strength of linear relationship between two variables. Let’s work on the sample data of advertisement expenditure and product sale. We will evaluate the effectiveness of the advertising campaign for a company.

Let’s prepare a dataset

library(tidyverse)

ad_data <- data.frame(year=c(2003,2004,2005,2006,2007,2008,2009,2010),ad_expenditure = c(12000,15000,16000,23000,24000,38000,42000,48000),product_sales = c(500000,560000,580000,700000,720000,880000,920000,950000))

ad_data

The covariance can be calculated using cov() function.

covariance <- cov(ad_data$ad_expenditure,ad_data$product_sales)
covariance

[1] 2345357143

The covariance has a major flaw as a measure of linear relationship between two numeric variables. Because covariance can have any value, you can not use it to determine relative strength of the relationship. In the above example you can not tell whether the value 2345357143 indicates strong relationship or weak relationship between ad expenditure and product sales.

The coefficient of correlation

The coefficient of correlation (r) measures the relative strength of a linear relationship between two numeric variables. The values of coefficient of correlation ranges from -1 to +1. -1 been a perfect negative correlation and +1 been a perfect positive correlation. Here perfect relationship means that if we plot a scatter plot, all the points could be connected with a straight line. Let’s find out the coefficient of correlation between ad expenditure and product sale. the coefficient of correlation is calculated using cor() function.

cor(ad_data$ad_expenditure,ad_data$product_sales)

[1] 0.9879274

Since coefficient of correlation is closer to 1, we can conclude that there is strong positive correlationship between ad expenditure and product sale. In other words, for every dollar spent on ad expenditure resulted into a equivalent dollar revenue in product sale. The relationship between two variables can be represented with scatter plot.

ggplot(ad_data,aes(x=ad_expenditure,y=product_sales))+
  geom_point()+
  geom_smooth(method = "lm")

Relationship between Ad expenditure and product sale

Since coefficient of correlation varies from -1 to +1, depending on its value the relationship between two variables can be considered as strong, moderate and weak relationship. Let’s work with simulated data set and generate the scatter plots for various correlation values.

Strong positive correlation

Since r value is close to 1, this is strong positive relationship between two variables. So if X increases by unit value, Y also increases by approximately same value.

set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- x + 6 + rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)

cor(x,y,method = "pearson")

[1] 0.9490752

ggplot(dummy_data,aes(x=x,y=y))+
  geom_point()+
  geom_jitter(alpha=0.2)+
  geom_smooth(method = "lm")

Medium positive correlation

Since r value is close to 0.5, this is moderate positive relationship between two variables. So if X increases by unit value, Y also increases by approximately half the value of X.

set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- 0.2*x + 6 + rnorm(n=1000, mean = 3, sd = 5)
#y <- rnorm(n=1000, mean = 60, sd = 5)-0.2*x
dummy_data <- data.frame(x,y)

cor(x,y,method = "pearson")

[1] 0.5351744

Weak positive correlation

Since r value is close to 0.1, this is weak positive relationship between two variables. So if X increases by unit value, there is hardly any increase in Y value.

set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- 0.05*x + 6 + rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)

cor(x,y,method = "pearson")

[1] 0.1842766

Strong negative correlation

Since r value is close to -1, this is strong negative relationship between two variables. So if X increases by unit value, Y decreases by approximately same value.

set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- -x + 6 - rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)

cor(x,y,method = "pearson")

[1] -0.9490752

Medium negative correlation

Since r value is close to -0.5, this is moderate negetive relationship between two variables. So if X increases by unit value, Y decreases by approximately half the value of X.

set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- -0.2*x + 6 - rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)

cor(x,y,method = "pearson")

[1] -0.5351744

Weak negative correlation

Since r value is close to -0.1, this is weak negative relationship between two variables. So if X increases by unit value, there is hardly any decrease in Y value.

set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- -0.05*x + 6 - rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)

cor(x,y,method = "pearson")

[1] -0.1842766

By default cor() function uses “pearson correlation coefficient”. It gives numerical measure of degree of relationship that exists between two variables. The pearson correlation coefficient works when two numeric variables are quantitative in nature. But if the variables are qualitative in nature, for e.g. beauty score, intelligence score etc. “Spearman correlation coefficient” is used.

Rank correlation

While Pearson correlation is often used for quantitative continuous variables, Spearman correlation (which is based on the ranked values for each variable rather than on the raw data) is often used to evaluate relationships involving ordinal variables. the ranking of 10 trainees in two skill sets are given below. Since this data is qualitative, we will use Spearman correlation coefficient.

test_score <- data.frame(programming = c(3,5,8,4,7,10,2,1,6,9),Analysis=c(6,4,9,8,1,2,3,10,5,7))
test_score

The spearman correlation coefficient is calculated as:

cor(test_score$programming,test_score$Analysis,method = "spearman")

[1] -0.2969697

Since rank correlation coefficient is negative, it implies that variables are negatively correlated.

Summary

The covariance measures the strength of linear relationship between two variables. The covariance has a major flaw as a measure of linear relationship between two numeric variables. Because covariance can have any value, you can not use it to determine relative strength of the relationship.

Coefficient of correlation indicates linear relationship, or association, between two variables. When the coefficient of correlation gets closer to +1 or -1 the linear relationship between two variables is stronger. When coefficient of correlation is close to zero, little or no relationship exists.

“Spearman correlation coefficient” is used when the variables are qualitative in nature (e.g. test scores, beauty score etc.)

References:

Further Resources

If you need help in drawing scatter plot in ggplot2 packages, please check out my Youtube videos on ggplot2 packages

How to draw a scatter plot?