In this tutorial we will look at two measures of relationship between two numeric variables: the covariance and coefficient of correlation.
The covariance
The covariance measures the strength of linear relationship between two variables. Let’s work on the sample data of advertisement expenditure
and product sale
. We will evaluate the effectiveness of the advertising campaign for a company.
Let’s prepare a dataset
library(tidyverse)
ad_data <- data.frame(year=c(2003,2004,2005,2006,2007,2008,2009,2010),ad_expenditure = c(12000,15000,16000,23000,24000,38000,42000,48000),product_sales = c(500000,560000,580000,700000,720000,880000,920000,950000))
ad_data
The covariance can be calculated using cov()
function.
covariance <- cov(ad_data$ad_expenditure,ad_data$product_sales)
covariance
[1] 2345357143
The covariance has a major flaw as a measure of linear relationship between two numeric variables. Because covariance can have any value, you can not use it to determine relative strength of the relationship. In the above example you can not tell whether the value 2345357143 indicates strong relationship or weak relationship between ad expenditure and product sales.
The coefficient of correlation
The coefficient of correlation (r) measures the relative strength of a linear relationship between two numeric variables. The values of coefficient of correlation ranges from -1 to +1. -1 been a perfect negative correlation and +1 been a perfect positive correlation. Here perfect relationship means that if we plot a scatter plot, all the points could be connected with a straight line. Let’s find out the coefficient of correlation between ad expenditure and product sale. the coefficient of correlation is calculated using cor()
function.
cor(ad_data$ad_expenditure,ad_data$product_sales)
[1] 0.9879274
Since coefficient of correlation is closer to 1, we can conclude that there is strong positive correlationship between ad expenditure and product sale. In other words, for every dollar spent on ad expenditure resulted into a equivalent dollar revenue in product sale. The relationship between two variables can be represented with scatter plot.
ggplot(ad_data,aes(x=ad_expenditure,y=product_sales))+
geom_point()+
geom_smooth(method = "lm")
Since coefficient of correlation varies from -1 to +1, depending on its value the relationship between two variables can be considered as strong, moderate and weak relationship. Let’s work with simulated data set and generate the scatter plots for various correlation values.
Strong positive correlation
Since r value is close to 1, this is strong positive relationship between two variables. So if X increases by unit value, Y also increases by approximately same value.
set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- x + 6 + rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)
cor(x,y,method = "pearson")
[1] 0.9490752
ggplot(dummy_data,aes(x=x,y=y))+
geom_point()+
geom_jitter(alpha=0.2)+
geom_smooth(method = "lm")
Medium positive correlation
Since r value is close to 0.5, this is moderate positive relationship between two variables. So if X increases by unit value, Y also increases by approximately half the value of X.
set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- 0.2*x + 6 + rnorm(n=1000, mean = 3, sd = 5)
#y <- rnorm(n=1000, mean = 60, sd = 5)-0.2*x
dummy_data <- data.frame(x,y)
cor(x,y,method = "pearson")
[1] 0.5351744
Weak positive correlation
Since r value is close to 0.1, this is weak positive relationship between two variables. So if X increases by unit value, there is hardly any increase in Y value.
set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- 0.05*x + 6 + rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)
cor(x,y,method = "pearson")
[1] 0.1842766
Strong negative correlation
Since r value is close to -1, this is strong negative relationship between two variables. So if X increases by unit value, Y decreases by approximately same value.
set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- -x + 6 - rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)
cor(x,y,method = "pearson")
[1] -0.9490752
Medium negative correlation
Since r value is close to -0.5, this is moderate negetive relationship between two variables. So if X increases by unit value, Y decreases by approximately half the value of X.
set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- -0.2*x + 6 - rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)
cor(x,y,method = "pearson")
[1] -0.5351744
Weak negative correlation
Since r value is close to -0.1, this is weak negative relationship between two variables. So if X increases by unit value, there is hardly any decrease in Y value.
set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- -0.05*x + 6 - rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)
cor(x,y,method = "pearson")
[1] -0.1842766
By default cor()
function uses “pearson correlation coefficient”. It gives numerical measure of degree of relationship that exists between two variables. The pearson correlation coefficient works when two numeric variables are quantitative in nature. But if the variables are qualitative in nature, for e.g. beauty score, intelligence score etc. “Spearman correlation coefficient” is used.
Rank correlation
While Pearson correlation is often used for quantitative continuous variables, Spearman correlation (which is based on the ranked values for each variable rather than on the raw data) is often used to evaluate relationships involving ordinal variables. the ranking of 10 trainees in two skill sets are given below. Since this data is qualitative, we will use Spearman correlation coefficient.
test_score <- data.frame(programming = c(3,5,8,4,7,10,2,1,6,9),Analysis=c(6,4,9,8,1,2,3,10,5,7))
test_score
The spearman correlation coefficient is calculated as:
cor(test_score$programming,test_score$Analysis,method = "spearman")
[1] -0.2969697
Since rank correlation coefficient is negative, it implies that variables are negatively correlated.
Summary
The covariance measures the strength of linear relationship between two variables. The covariance has a major flaw as a measure of linear relationship between two numeric variables. Because covariance can have any value, you can not use it to determine relative strength of the relationship.
Coefficient of correlation indicates linear relationship, or association, between two variables. When the coefficient of correlation gets closer to +1 or -1 the linear relationship between two variables is stronger. When coefficient of correlation is close to zero, little or no relationship exists.
“Spearman correlation coefficient” is used when the variables are qualitative in nature (e.g. test scores, beauty score etc.)
References:
- Dplyr API
- ggplot2 API
- Statistics using R – An integrative approach,, Daphna Harel. Cambridge University Press, 2020
Further Resources
If you need help in drawing scatter plot in ggplot2 packages, please check out my Youtube videos on ggplot2 packages