Get introduced to R’s apply Family: Your Guide to apply( ), lapply( ), sapply( ), and tapply( )

***Twilight in the Wilderness*** – **Oil Canvass painting**

In R, for loop is used to repeat evaluating an expression with an iterator on list or vector. In practice, for loop is almost the last choice because an alternative way is much cleaner and easier to write and read when each iteration is independent of each other.

following code uses for loop to create a list of three independent normally distributed random vectors whose length is specified by vector len

len <- c(3,4,5)
x <- list() # create empty list
set.seed(123) # inititate random number generator

for(i in 1:3){
  x[[i]] <- rnorm(len[i])
} 
x

[[1]]
[1] -0.5604756 -0.2301775  1.5587083

[[2]]
[1] 0.07050839 0.12928774 1.71506499 0.46091621

[[3]]
[1] -1.2650612 -0.6868529 -0.4456620  1.2240818  0.3598138

The preceding example is simple but code is quite redundant compared to the implementation of lapply

set.seed(123)
lapply(len,rnorm)

[[1]]
[1] -0.5604756 -0.2301775  1.5587083

[[2]]
[1] 0.07050839 0.12928774 1.71506499 0.46091621

[[3]]
[1] -1.2650612 -0.6868529 -0.4456620  1.2240818  0.3598138

The lapply version is much simpler. It applies rnorm() on each element in len and puts each result into a list.

This succinct code is only possible because R allows us to pass functions as an objects. The rnorm function is passed into lapply function just as any ordinary object argument. This feature largely boosts flexibility of coding.

Each apply family of functions are called as Higher Order Functions that accepts a function as an argument. There are several functions in apply family, each performs specific task.

In this article we will look at important apply family functions listed below. We will understand their usage, and limitations using simple examples.

apply
lapply
sapply
tapply

Let’s get started!

apply function

apply function is higher order function that accepts function as an argument. It applies this function on rows or columns of data frame or matrix.

The data set below describes the height of five individual plants in inches at three different timelines (0 days, 10 days and 20 days). The first column is plant ID and each of the next three columns describe the plant height in inches at three different time points.

example_df <- data.frame(plant_ID = c("A", "B", "C", "D", "E"),
                      height_0 = c(15, 10, 12, 9, 17),
                      height_10 = c(20, 18, 14, 15, 19),
                      height_20 = c(23, 24, 18, 17, 26))
head(example_df)

We are interested in finding out mean height at different stages of plant growth. We can either use for loop or use apply function to get the answer. If we compare the code structure, using apply function makes the code compact and readable.

Let’s use the apply function on the data set to find out the mean values. The function mean is passed on as an argument to apply function and it is used across all columns of the data frame.

# drop first column since it is character vector
apply(example_df[-1],MARGIN=2,FUN = mean)

height_0 height_10 height_20 
     12.6      17.2      21.6

Here is a syntax of apply function.

The first argument is object (data frame/matrix) you want to analyze
The second argument is MARGIN. It specifies which dimension of data frame/matrix you want to analyse. It is used only for two dimensional objects.
Margin = 1 – indicates you want to analyse across data frames rows
Margin = 2 – indicates you want to analyse across data frames
The last argument is name of a function that will be applied to rows and columns.

Calculations in apply function are carried out row-wise or column-wise, based on the Margin value you set up. In the above example, Margin=1 would produce different result.

apply(example_df[-1],MARGIN=1,FUN = mean)

[1] 19.33333 17.33333 14.66667 13.66667 20.66667

We can also pass the custom function to apply. For e.g if we are interested in finding out at which stage of a plant growth its average height has passed above 15 inches. We can create custom function is_tall to check the condition and pass it into apply function.

is_tall <- function(x) {
  value <- mean(x) > 15
  return(value)
}

apply(example_df[,-1],MARGIN = 2, is_tall) # apply with custom function

This tells me that at time point 0, the plants are not taller than 15 cm on average, while the opposite is true for time points 10 and 20.

lapply function

One disadvantage of apply function is that it does not work on lists. So, if we have list object to work on, we must use lapply function.

Here is a simple list with two elements in it. If we wanted to calculate the average value for each list element, we could do it individually using mean function on each list element.

This method is pretty inefficient and makes us repeat our code. And what if we have more than, say 100 list elements? That would be a pain to type out. Let’s try another method.

We could create a for loop and save the results in a vector: This method is better because it automates the process, which would be especially useful if our list had a ton of elements. But for loops also take more time to run and construct, and still take up quite a bit of space in our code.

The last method is using lapply function. Have a look at the code, we could able to wrap all the steps into a single line code.

Here is an example.

We will create a list called plants, containing three elements that are each vectors with a length of ten. Each element in the list contains different plant attributes such as (height, mass, and # of flowers). We used uniform distribution to create a random numbers and used sample function to generate random integers between 1 and 10.

plants <- list(height = runif(10, min = 10, max = 20),
               mass = runif(10, min = 5, max = 10),
               flowers = sample(1:10, 10))
plants

$height
 [1] 12.81165 11.30546 12.79607 10.22552 12.28770 11.78231
 [7] 17.53214 14.35947 11.67449 12.37116

$mass
 [1] 5.423982 5.290442 6.548579 8.275295 6.344244 7.635298
 [7] 9.136648 9.250006 7.958576 5.585793

$flowers
 [1]  6  3  9 10  5  2  8  1  4  7

Using lapply function to find out the mean value of each list element.

lapply(plants,FUN = mean)

$height
[1] 12.7146

$mass
[1] 7.144886

$flowers
[1] 5.5

Please note, the output of lapply function is always a list. Also we have not used Margin argument in lapply function, since the function mean is applied to list elements.

sapply function

The output of lapply function is always a list. If we want the output in a vector or in a matrix form, we can use sapply function. The sapply function works the same way as lapply function. But instead of returning a list, it will return the answer in the simplest possible format.

sapply(plants,FUN = mean)

 height      mass   flowers 
12.714597  7.144886  5.500000

You can notice that the output type is simple numeric vector and not a list. you can confirm its data type using class function.

class(sapply(plants,FUN = mean))

[1] "numeric"

tapply function

The tapply function works in much the same way as the other functions, but it allows you to perform an operation across specified groups in your data. For those of you who are familiar with dplyr package, this does the same thing as the group_by() and summarise() functions.

Here is an example. We got a data set in which a service time to repair a product is recorded. We would like to find out the mean service time to repair for each individual product. So first we will have to group the data into individual products, and then find the mean value of service time for each group.

Let’s use tapply function on mtcars data set. This data-set comprises fuel consumption and various automobile parameters for 32 different car models. We want to determine the average fuel consumption in miles per gallon for different cylinder engines.

group the data as per number of engine cylinders
use mean function on mpg variable to find out the average fuel consumption.

head(mtcars)

We can perform above two steps using tapply function.

# tapply function to calculate the average fuel consumption
# for different engine cylinders.

tapply(mtcars$mpg,INDEX = mtcars$cyl,FUN = mean)

       4        6        8 
26.66364 19.74286 15.10000

You can observe the trend here. A car’s mileage decreases as engine size increases.

Let’s decode the syntax of tapply function.

The first argument is variable on which we want to perform calculation. So it is mpg variable

The INDEX will be the grouping variable. So we want to group the data using cyl variable. The last argument is mean which will calculate the mpg average within each group.

Summary

apply family of functions are higher order functions that accepts the other functions as an argument. These functions are applied to vectors, lists and on rows and columns of data frame or a matrix leading to concise and efficient code.

In this article we learned different apply family of functions.

apply function takes another function as an argument and applies it on rows or columns of data frame or a matrix.

lapply function returns a list object.

sapply function is used when the output is required in vector or matrix form.

tapply function is used to group the variables together and apply operations on those groups.

apply function

lapply function

sapply function

tapply function

Summary

Leave a Comment Cancel Reply