In this tutorial we will look at two measures of relationship between two numeric variables: the covariance and coefficient of correlation.

**The covariance**

The covariance measures the strength of linear relationship between two variables. Let’s work on the sample data of `advertisement expenditure`

and `product sale`

. We will evaluate the effectiveness of the advertising campaign for a company.

Let’s prepare a dataset

```
library(tidyverse)
ad_data <- data.frame(year=c(2003,2004,2005,2006,2007,2008,2009,2010),ad_expenditure = c(12000,15000,16000,23000,24000,38000,42000,48000),product_sales = c(500000,560000,580000,700000,720000,880000,920000,950000))
ad_data
```

The covariance can be calculated using `cov()`

function.

```
covariance <- cov(ad_data$ad_expenditure,ad_data$product_sales)
covariance
[1] 2345357143
```

The covariance has a major flaw as a measure of linear relationship between two numeric variables. **Because covariance can have any value, you can not use it to determine relative strength of the relationship.** In the above example you can not tell whether the value 2345357143 indicates strong relationship or weak relationship between ad expenditure and product sales.

**The coefficient of correlation**

**The coefficient of correlation (r) measures the relative strength of a linear relationship between two numeric variables.** The values of coefficient of correlation ranges from -1 to +1. -1 been a perfect negative correlation and +1 been a perfect positive correlation. Here

*perfect relationship*means that if we plot a scatter plot, all the points could be connected with a straight line. Let’s find out the coefficient of correlation between ad expenditure and product sale. the coefficient of correlation is calculated using

`cor()`

function.```
cor(ad_data$ad_expenditure,ad_data$product_sales)
[1] 0.9879274
```

Since coefficient of correlation is closer to 1, we can conclude that **there is strong positive correlationship between ad expenditure and product sale**. In other words, for every dollar spent on ad expenditure resulted into a equivalent dollar revenue in product sale. The relationship between two variables can be represented with scatter plot.

```
ggplot(ad_data,aes(x=ad_expenditure,y=product_sales))+
geom_point()+
geom_smooth(method = "lm")
```

**Since coefficient of correlation varies from -1 to +1, depending on its value the relationship between two variables can be considered as strong, moderate and weak relationship**. Let’s work with simulated data set and generate the scatter plots for various correlation values.

**Strong positive correlation**

Since r value is close to 1, this is strong positive relationship between two variables. **So if X increases by unit value, Y also increases by approximately same value.**

```
set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- x + 6 + rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)
```

```
cor(x,y,method = "pearson")
[1] 0.9490752
```

```
ggplot(dummy_data,aes(x=x,y=y))+
geom_point()+
geom_jitter(alpha=0.2)+
geom_smooth(method = "lm")
```

**Medium positive correlation**

Since r value is close to 0.5, this is moderate positive relationship between two variables. **So if X increases by unit value, Y also increases by approximately half the value of X.**

```
set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- 0.2*x + 6 + rnorm(n=1000, mean = 3, sd = 5)
#y <- rnorm(n=1000, mean = 60, sd = 5)-0.2*x
dummy_data <- data.frame(x,y)
```

```
cor(x,y,method = "pearson")
[1] 0.5351744
```

**Weak positive correlation**

Since r value is close to 0.1, this is weak positive relationship between two variables. **So if X increases by unit value, there is hardly any increase in Y value.**

```
set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- 0.05*x + 6 + rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)
```

```
cor(x,y,method = "pearson")
[1] 0.1842766
```

**Strong negative correlation**

Since r value is close to -1, this is strong negative relationship between two variables. **So if X increases by unit value, Y decreases by approximately same value.**

```
set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- -x + 6 - rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)
```

```
cor(x,y,method = "pearson")
[1] -0.9490752
```

**Medium negative correlation**

Since r value is close to -0.5, this is moderate negetive relationship between two variables. **So if X increases by unit value, Y decreases by approximately half the value of X.**

```
set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- -0.2*x + 6 - rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)
```

```
cor(x,y,method = "pearson")
[1] -0.5351744
```

**Weak negative correlation**

Since r value is close to -0.1, this is weak negative relationship between two variables. **So if X increases by unit value, there is hardly any decrease in Y value.**

```
set.seed(12345)
x <- rnorm(n=1000, mean=120, sd=15)
y <- -0.05*x + 6 - rnorm(n=1000, mean = 3, sd = 5)
dummy_data <- data.frame(x,y)
```

```
cor(x,y,method = "pearson")
[1] -0.1842766
```

By default `cor()`

function uses “pearson correlation coefficient”. It gives numerical measure of degree of relationship that exists between two variables. The pearson correlation coefficient works when two numeric variables are quantitative in nature. But** if the variables are qualitative in nature, for e.g. beauty score, intelligence score etc. “Spearman correlation coefficient” is used.**

**Rank correlation**

While Pearson correlation is often used for quantitative continuous variables, Spearman correlation (which is based on the ranked values for each variable rather than on the raw data) is often used to evaluate relationships involving ordinal variables. the ranking of 10 trainees in two skill sets are given below. Since this data is qualitative, we will use Spearman correlation coefficient.

```
test_score <- data.frame(programming = c(3,5,8,4,7,10,2,1,6,9),Analysis=c(6,4,9,8,1,2,3,10,5,7))
test_score
```

The spearman correlation coefficient is calculated as:

```
cor(test_score$programming,test_score$Analysis,method = "spearman")
[1] -0.2969697
```

Since rank correlation coefficient is negative, it implies that variables are negatively correlated.

**Summary**

The covariance measures the strength of linear relationship between two variables. The covariance has a major flaw as a measure of linear relationship between two numeric variables. Because covariance can have any value, you can not use it to determine relative strength of the relationship.

Coefficient of correlation indicates linear relationship, or association, between two variables. When the coefficient of correlation gets closer to +1 or -1 the linear relationship between two variables is stronger. When coefficient of correlation is close to zero, little or no relationship exists.

“Spearman correlation coefficient” is used when the variables are qualitative in nature (e.g. test scores, beauty score etc.)

**References:**

- Dplyr API
- ggplot2 API
- Statistics using R – An integrative approach,, Daphna Harel. Cambridge University Press, 2020

**Further Resources**

If you need help in drawing scatter plot in ggplot2 packages, please check out my Youtube videos on ggplot2 packages