How to remove skewness in the data? Learn 6 powerful data transformations.

Salton Sea
Image Reference: https://www.jstor.org/stable/community.36112583

Many statistical tests assume that the data is normally distributed. Hence if the underlying data is not normal, we need to transform a data to make it near normal before we apply these tests. The transformed data set removes skewness, mitigate biases and enhance robustness of statistical models.

In this tutorial, we will use different transformations to normalise the data. We will generate a simulated right skewed and left skewed data set. We will introduce the skewness to demonstrate the different methods used to normalise their distributions using transformations. So let’s get started.

This tutorial is divided into five parts:

Understanding skewness
Transformations to handle positive skewness
Transformations to handle negative skewness
statistical evaluation of transformations

Understanding Skewness

Skewness measures the extent to which the data values are not symmetrical around the mean. The three possibilities are:

Mean < median: This is negative, or left skewed distribution
Mean = median: This is symmetrical distribution (zero skewness)
Mean > median: This is positive, or right skewed distribution

In symmetrical distribution, the values below the mean are distributed in exactly the same way as values above the mean. Therefor, for symmetrical distribution, the skewness is zero.

In skewed distribution, there is an imbalance of the data values in below and above the mean. In this cases, the skewness is non-zero value.

Let’s simulate left skewed/negative distribution. We will use Beta distribution to generate the left skewed data.

library(tidyverse)
library(gridExtra)

set.seed(123)
negetive_skew <- rbeta(1000,100,1)*10
plot_negskew <- ggplot(data.frame(x = negetive_skew), aes(x)) +
  geom_histogram(binwidth = 0.01, color = "black", fill = "lightblue") +
  ggtitle("Negatively Skewed Data")

plot_negskew

The above histogram displays the left skewed/negative distribution. In the left skewed distribution, most of the values are in upper portion of the distribution.Some extremely small values cause the long tail and distortion to the left and cause mean to be less than the median.

To find the skewness of the distribution, we can use skewness() function from the moments package. Please install moments package to use skewness() function.

library(moments)
skewness(negetive_skew)

[1] -2.144462

Since the skewness is less than zero (negative), the distribution is negatively skewed with majority of data values greater than the mean.

Let’s simulate right skewed/positive distribution. We will use Beta distribution to generate the right skewed data.

set.seed(123)
positive_skew <- rbeta(1000,1,100)*10
plot1 <- ggplot(data.frame(x = positive_skew ), aes(x)) +
  geom_histogram(binwidth = 0.01, color = "black", fill = "lightblue") +
  ggtitle("Positively Skewed Data")
plot1

The above histogram displays the right skewed/positive distribution. In the right skewed distribution, most of the values are in lower portion of the distribution.Some extremely small values cause the long tail and distortion to the right and cause mean to be greater than the median.

The skewness value of the distribution is:

skewness(positive_skew)

[1] 2.144462

Since the skewness is greater than zero (positive), the distribution is positively skewed with majority of data values less than the mean.

We have looked at both positive and negative skewed data. If we use this data without transformation, the extreme values will introduce bias in the analysis, leading to less reliable and robust statistical models. In the next section, we will look at various transformations to remove the skewness in the data.

Transformations to handle positive skewness

We can use four common transformations to tackle positively skewed data.

log transformation
square root transformation
Box-cox transformation
Yeo-Johnson transformation

Log Transformation

Logarithms comes as a tool to deal with highly skewed data. This is a non-linear transformation, since it changes the relative distances between the values in the distribution, and in process, leads to desired changes in the shape of the distribution. A log transformation will not make distribution perfectly symmetric. Because log transformation has a greater impact on extreme scores.

log_positive_skew <- log(positive_skew,2)

plot2 <- ggplot(data.frame(x = log_positive_skew ), aes(x)) +
  geom_histogram(binwidth = 0.2, color = "black", fill = "lightblue") +
  ggtitle("log transformed Data")

plot2

grid.arrange(plot1,plot2)

The skewness of the transformed data is:

print(paste("Skewness after log transformation:",round(skewness(log_positive_skew),3)))

[1] "Skewness after log transformation: -1.241"

You can see the skewness is reduced, but the data is transformed into negative skewed zone.

Square Root Transformation

The square root transform, transforms the data-set non-linearly by taking square root of each value.

sqrt_positive_skew <- sqrt(positive_skew)

plot3 <- ggplot(data.frame(x = sqrt_positive_skew ), aes(x)) +
  geom_histogram(binwidth = 0.02, color = "black", fill = "lightblue") +
  ggtitle("square root transformed Data")

grid.arrange(plot1,plot3)

The skewness of the transformed data is:

print(paste("Skewness after log transformation:",round(skewness(sqrt_positive_skew),3)))

[1] "Skewness after log transformation: 0.681"

You can see the skewness is further reduced,and the transformed data looks like near-normal.

Box-Cox transformation

The Box-Cox transformation in R is the technique used to transform non-normal data to a normal distribution by applying the power transformation. This transformation is commonly used in statistical modeling to improve the normality of the data. Box-cox transformation method is applicable only to positive data. The Box-Cox transformation works by using maximum likelihood estimation to estimate a transformation parameter lambda in the following equation that would optimise the normality of x*:

The above equation is solved to get the optimum value of lambda for better normality.

We will use caret package(Classification and Regression Training) to apply Box-Cox transformation.

library(caret)
bc_positive_skew <- BoxCoxTrans(positive_skew) # Apply Box-Cox transformation to find out lambda

bc_positive_skew_transformed <- predict(bc_positive_skew,positive_skew) # transform the data

plot4 <- ggplot(data.frame(x = bc_positive_skew_transformed ), aes(x)) +
  geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") +
  ggtitle("Box-Cox transformed Data")

grid.arrange(plot1,plot4)

The skewness of the transformed data is:

print(paste("Skewness after Box-Cox transformation:",round(skewness(bc_positive_skew_transformed),3)))

[1] "Skewness after Box-Cox transformation: 0.057"

You can see the skewness is further reduced,and the transformed data looks almost normal.

Yeo-Johnson transformation

The Box-Cox transformation only works with positive data. Yeo-Johnson is similar to Box-Cox but it is adaptable to both positive and non-positive data. Similar to Box-Cox transformation, Yeo-Johnson transformation works by using maximum likelihood estimation to estimate a transformation parameter lambda. The adaptability of this transformation allows it to manage skewness across wide range of data values, improving its fit for statistical models.

yeo_positive_skew <- bestNormalize::yeojohnson(positive_skew) # Apply Yeo-Johnson transformation to find out lambda

yeo_positive_skew_transformed <- predict(yeo_positive_skew,positive_skew) # transform the data

plot5 <- ggplot(data.frame(x = yeo_positive_skew_transformed ), aes(x)) +
  geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") +
  ggtitle("Histogram of Yeo-Johnson transformed Data")

grid.arrange(plot1,plot5)

The skewness of transformed data is:

print(paste("Skewness after Yeo-Johnson transformation:",round(skewness(yeo_positive_skew_transformed),3)))

[1] "Skewness after Yeo-Johnson transformation: 0.593"

You can see the skewness is further reduced,however it is not as good as Box-Cox transformation.

The following visual provides a side-by-side comparison, helping you to understand better the influence of each transformation on the distribution of skewed data.

grid.arrange(plot1,plot2,plot3,plot4,plot5,nrow=2,ncol=3)

All Transformations for Positvely skewed data

The consolidated table of skewness values suggests, Box-Cox transformation produces the most normalized data.

tranforms_data_positive <- as.data.frame(cbind(positive_skew,log_positive_skew,sqrt_positive_skew,bc_positive_skew_transformed,yeo_positive_skew_transformed))

test_results_positive <- c("positive skew data","log transform","square-root transform","box-cox transform","yeo-johnson transform")

skewness_value_positive <- rep(NA,length(test_results_positive))

positive_skew_results <- function(x,y){
  for (i in 1:length(test_results_positive)) {
    result <- round(skewness(tranforms_data_positive[i]),3)
    skewness_value_positive[i] <- result
  }
  print(cbind("transformations"=test_results_positive,"Skewness Value"=skewness_value_positive),quote=FALSE)
}

positive_skew_results(test_results_positive,tranforms_data_positive)

     transformations       Skewness Value
[1,] positive skew data    2.144         
[2,] log transform         -1.241        
[3,] square-root transform 0.681         
[4,] box-cox transform     0.057         
[5,] yeo-johnson transform 0.593

Transformations to handle negative skewness

We can use four common transformations to tackle negatively skewed data.

squared transformation
cubed transformation
Box-cox transformation
Yeo-Johnson transformation

Squared Transformation

This involves taking each data point in the dataset and squaring it (i.e., raising it to the power of 2). The squared transformation is useful for reducing negative skewness because it tends to spread out the lower values more than the higher values. However, it’s more effective when all data points are positive and the degree of negative skewness is not extreme.

sqr_negative_skew <- negetive_skew^2

plot1_neg <- ggplot(data.frame(x = sqr_negative_skew ), aes(x)) +
  geom_histogram(binwidth = 0.2, color = "black", fill = "lightblue") +
  ggtitle("squared transformed Data")

grid.arrange(plot_negskew ,plot1_neg)

The skewness of transformed data is:

print(paste("Skewness after squared transformation:",round(skewness(sqr_negative_skew),3)))

[1] "Skewness after squared transformation: -2.089"

The skewness of the data is slightly reduced, however there is no remarkable improvement in the normality of the data.

Cubed transformation

This is similar to squared transformation, but it involves raising each data point to power of 3. The cubed transformation can further reduce negative skewness.

cube_negative_skew <- negetive_skew^3

plot2_neg <- ggplot(data.frame(x = cube_negative_skew ), aes(x)) +
  geom_histogram(binwidth = 2, color = "black", fill = "lightblue") +
  ggtitle("cubed transformed Data")

grid.arrange(plot_negskew ,plot2_neg)

The skewness of transformed data is:

print(paste("Skewness after cubed transformation:",round(skewness(cube_negative_skew),3)))

[1] "Skewness after cubed transformation: -2.036"

The skewness value is further reduced however, it is still further away from zero.

Box-Cox transformation

Similar to positively skewed data, Box-Cox transformation also works in the similar way on negatively skewed data. The basic principle remains the same: The Box-Cox transformation works by using maximum likelihood estimation to estimate a transformation parameter lambda that would optimise the normality of data:

bc_negative_skew <- BoxCoxTrans(negetive_skew) # Apply Box-Cox transformation to find out lambda

bc_negative_skew_transformed <- predict(bc_negative_skew,negetive_skew) # transform the data

plot3_neg <- ggplot(data.frame(x = bc_negative_skew_transformed ), aes(x)) +
  geom_histogram(binwidth = 0.1, color = "black", fill = "lightblue") +
  ggtitle("Box-Cox transformed Data")

grid.arrange(plot_negskew,plot3_neg)

The skewness of the transformed data is:

print(paste("Skewness after Box-Cox transformation:",round(skewness(bc_negative_skew_transformed),3)))

[1] "Skewness after Box-Cox transformation: -2.089"

Yeo-Johnson transformation

Similar to the Box-Cox transformation, but the Yeo-Johnson is designed to handle both positive and negative data. For negatively skewed data, the Yeo-Johnson transformation can normalize distributions even when negative values are present.

yeo_negative_skew <- bestNormalize::yeojohnson(negetive_skew) # Apply Yeo-Johnson transformation to find out lambda

yeo_negative_skew_transformed <- predict(yeo_negative_skew,negetive_skew) # transform the data

plot4_neg <- ggplot(data.frame(x = yeo_negative_skew_transformed), aes(x)) +
  geom_histogram(binwidth = 0.05, color = "black", fill = "lightblue") +
  ggtitle("Yeo-Johnson transformed Data")

grid.arrange(plot_negskew,plot4_neg)

The skewness of the transformed data is:

print(paste("Skewness after Yeo-Johnson transformation:",round(skewness(yeo_negative_skew_transformed),3)))

[1] "Skewness after Yeo-Johnson transformation: -1.952"

As you can observe, there is hardly any improvement in the skewness values of the transformed data. Thus it is recommended to check the outliers and drop those if possible.

The following visual provides a side-by-side comparison, helping you to understand better the influence of each transformation on the distribution of skewed data.

grid.arrange(plot_negskew,plot1_neg,plot2_neg,plot3_neg,plot4_neg,nrow=2,ncol=3)

The consolidated table of skewness values suggests, there is a marginal improvement in skewness value using yeo-johnson transformation. However, all the transformations failed to produce the normalised data.

tranforms_data_neg <- as.data.frame(cbind(negetive_skew,cube_negative_skew,sqr_negative_skew,bc_negative_skew_transformed,yeo_negative_skew_transformed))

test_results <- c("negative skew data","cube transform","square transform","box-cox transform","yeo-johnson transform")

skewness_value <- rep(NA,length(test_results))

neg_skew_results <- function(x,y){
  for (i in 1:length(test_results)) {
    result <- round(skewness(tranforms_data_neg[i]),3)
    skewness_value[i] <- result
  }
  print(cbind(test_results,skewness_value),quote=FALSE)
}

neg_skew_results(tranforms_data_neg,test_results)

     test_results          skewness_value
[1,] negative skew data    -2.144        
[2,] cube transform        -2.036        
[3,] square transform      -2.089        
[4,] box-cox transform     -2.089        
[5,] yeo-johnson transform -1.952

statistical evaluation of transformations

There are many statistical tests that we can use to quantify whether a data follows Gaussian distribution.

Before you can apply statistical test, you must know how to apply the results.

Each test will return two things:

Statistics: A test statistic describes how closely the distribution of your data matches the distribution predicted under the null hypothesis of the statistical test you are using.
p-value: Used to interpret the test, in this case whether the sample was drawn from a Gaussian distribution.

To interpret the test result using test statistics require deeper level of proficiency in statistics and deeper knowledge of the specific statistical test. Instead, p-value can be used to quickly and accurately interpret the statistics in practical applications.

The test assumes that sample was drawn from Gaussian Distribution. Technically this is called the null-hypothesis, or H_0. A threshold level is chosen called alpha, typically 5%, that is used to interpret the p-value.

In terms of p-value, you can interpret the p value as:

\[p\le\alpha\] Reject null-hypothesis \[p>\alpha\] fail to reject Null-hypothesis

This means, we want to have larger p-value to confirm that our sample was likely drawn from a Gaussian distribution. We will use one such statistical test called as Anderson-Darling Test to check the normality of the data.

Anderson-Darling Test for negatively skewed data

The Anderson-Darling Test is a goodness-of-fit test that determines how well your data fits a given distribution.

This test is most typically used to see if your data follow a normal distribution or not.

The function ad.test() calculates Anderson-Darling Test statistics and p-value. The ad.test() function in the nortest package can be used to perform Anderson-Darling test in R. You will have to first install nortest package in order to use ad.test() function.

library(nortest)

tranforms_data_neg <- as.data.frame(cbind(negetive_skew,cube_negative_skew,sqr_negative_skew,bc_negative_skew_transformed,yeo_negative_skew_transformed))

test_results <- c("negative skew data","cube transform","square transform","box-cox transform","yeo-johnson transform")

adtest_neg_pvalue <- rep(NA,length(test_results))

neg_adtest_results <- function(x,y){
  for (i in 1:length(test_results)) {
    result <- ad.test(tranforms_data_neg[,i])
    adtest_neg_pvalue[i] <- result$p
  }
  print(cbind(test_results,"p-value"=adtest_neg_pvalue),quote=FALSE)
}

neg_adtest_results(test_results,tranforms_data_neg)

test_results          p-value
[1,] negative skew data    3.7e-24
[2,] cube transform        3.7e-24
[3,] square transform      3.7e-24
[4,] box-cox transform     3.7e-24
[5,] yeo-johnson transform 3.7e-24

The test assumes that sample was drawn from Gaussian Distribution. Technically this is called the null-hypothesis. Since P-value is less than alpha we reject the Null Hypothesis. Thus, sample is not drawn from Gaussian distribution.

Anderson-Darling Test for positively skewed data

For the positively skewed data, the results are promising. The p-value for Box-Cox transformation is more than alpha. The p-value is statistically significant. Thus, we can accept the Null Hypothesis and conclude that, after performing Box-Cox transformation the data is transformed into Normal distribution.

tranforms_data_positive <- as.data.frame(cbind(positive_skew,log_positive_skew,sqrt_positive_skew,bc_positive_skew_transformed,yeo_positive_skew_transformed))

test_results_positive <- c("positive skew data","log transform","square-root transform","box-cox transform","yeo-johnson transform")

adtest_positive_pvalue <- rep(NA,length(test_results_positive))

pos_adtest_results <- function(x,y){
  for (i in 1:length(test_results)) {
    result <- ad.test(tranforms_data_positive[,i])
    adtest_positive_pvalue[i] <- result$p
  }
  print(cbind(test_results,"p-value"=adtest_positive_pvalue),quote=FALSE)
}

pos_adtest_results(test_results_positive,tranforms_data_positive)

     test_results          p-value             
[1,] negative skew data    3.7e-24             
[2,] cube transform        3.7e-24             
[3,] square transform      2.20152962495107e-13
[4,] box-cox transform     0.896158364454503   
[5,] yeo-johnson transform 3.7e-24

Summary

In this tutorial, we generated simulated positive and negative skewed data set. We used various transformation techniques to convert the data into Gaussian like distribution.

We used log, square -root, Box-Cox and Yeo-Johnson transformations on positively skewed data.

We used squared, cubed, Box-Cox and Yeo-Johnson transformations on negatively skewed data.

In both the cases we checked the effectiveness of the transformations using skewness values and statistical tests.

References

Wikipedia Article on Power Transformations

The Box-Cox Transformation Technique: A Review, January 1992 – Journal of Royal Statistical Society

Understanding Skewness

Transformations to handle positive skewness

Log Transformation

Square Root Transformation

Box-Cox transformation

Yeo-Johnson transformation

Transformations to handle negative skewness

Squared Transformation

Cubed transformation

Box-Cox transformation

Yeo-Johnson transformation

statistical evaluation of transformations

Anderson-Darling Test for negatively skewed data

Anderson-Darling Test for positively skewed data

Summary

References

Leave a Comment Cancel Reply