In R, for loop is used to repeat evaluating an expression with an iterator on list or vector. In practice, for loop is almost the last choice because an alternative way is much cleaner and easier to write and read when each iteration is independent of each other.
following code uses for loop to create a list of three independent normally distributed random vectors whose length is specified by vector len
len <- c(3,4,5)
x <- list() # create empty list
set.seed(123) # inititate random number generator
for(i in 1:3){
x[[i]] <- rnorm(len[i])
}
x
[[1]]
[1] -0.5604756 -0.2301775 1.5587083
[[2]]
[1] 0.07050839 0.12928774 1.71506499 0.46091621
[[3]]
[1] -1.2650612 -0.6868529 -0.4456620 1.2240818 0.3598138
The preceding example is simple but code is quite redundant compared to the implementation of lapply
set.seed(123)
lapply(len,rnorm)
[[1]]
[1] -0.5604756 -0.2301775 1.5587083
[[2]]
[1] 0.07050839 0.12928774 1.71506499 0.46091621
[[3]]
[1] -1.2650612 -0.6868529 -0.4456620 1.2240818 0.3598138
The lapply
version is much simpler. It applies rnorm()
on each element in len
and puts each result into a list.
This succinct code is only possible because R allows us to pass functions as an objects. The rnorm
function is passed into lapply
function just as any ordinary object argument. This feature largely boosts flexibility of coding.
Each apply family of functions are called as Higher Order Functions that accepts a function as an argument. There are several functions in apply
family, each performs specific task.
In this article we will look at important apply
family functions listed below. We will understand their usage, and limitations using simple examples.
- apply
- lapply
- sapply
- tapply
Let’s get started!
apply function
apply
function is higher order function that accepts function as an argument. It applies this function on rows or columns of data frame or matrix.
The data set below describes the height of five individual plants in inches at three different timelines (0 days, 10 days and 20 days). The first column is plant ID and each of the next three columns describe the plant height in inches at three different time points.
example_df <- data.frame(plant_ID = c("A", "B", "C", "D", "E"),
height_0 = c(15, 10, 12, 9, 17),
height_10 = c(20, 18, 14, 15, 19),
height_20 = c(23, 24, 18, 17, 26))
head(example_df)
We are interested in finding out mean
height at different stages of plant growth. We can either use for loop or use apply function to get the answer. If we compare the code structure, using apply
function makes the code compact and readable.
Let’s use the apply function on the data set to find out the mean values. The function mean
is passed on as an argument to apply
function and it is used across all columns of the data frame.
# drop first column since it is character vector
apply(example_df[-1],MARGIN=2,FUN = mean)
height_0 height_10 height_20
12.6 17.2 21.6
Here is a syntax of apply
function.
The first argument is object (data frame/matrix) you want to analyze
The second argument is MARGIN
. It specifies which dimension of data frame/matrix you want to analyse. It is used only for two dimensional objects.
Margin = 1 – indicates you want to analyse across data frames rows
Margin = 2 – indicates you want to analyse across data frames
The last argument is name of a function that will be applied to rows and columns.
Calculations in apply
function are carried out row-wise or column-wise, based on the Margin value you set up. In the above example, Margin=1 would produce different result.
apply(example_df[-1],MARGIN=1,FUN = mean)
[1] 19.33333 17.33333 14.66667 13.66667 20.66667
We can also pass the custom function to apply
. For e.g if we are interested in finding out at which stage of a plant growth its average height has passed above 15 inches. We can create custom function is_tall
to check the condition and pass it into apply
function.
is_tall <- function(x) {
value <- mean(x) > 15
return(value)
}
apply(example_df[,-1],MARGIN = 2, is_tall) # apply with custom function
This tells me that at time point 0, the plants are not taller than 15 cm on average, while the opposite is true for time points 10 and 20.
lapply function
One disadvantage of apply
function is that it does not work on lists. So, if we have list object to work on, we must use lapply
function.
Here is a simple list with two elements in it. If we wanted to calculate the average value for each list element, we could do it individually using mean function on each list element.
This method is pretty inefficient and makes us repeat our code. And what if we have more than, say 100 list elements? That would be a pain to type out. Let’s try another method.
We could create a for loop and save the results in a vector: This method is better because it automates the process, which would be especially useful if our list had a ton of elements. But for loops also take more time to run and construct, and still take up quite a bit of space in our code.
The last method is using lapply
function. Have a look at the code, we could able to wrap all the steps into a single line code.
Here is an example.
We will create a list called plants
, containing three elements that are each vectors with a length of ten. Each element in the list contains different plant attributes such as (height, mass, and # of flowers). We used uniform distribution to create a random numbers and used sample
function to generate random integers between 1 and 10.
plants <- list(height = runif(10, min = 10, max = 20),
mass = runif(10, min = 5, max = 10),
flowers = sample(1:10, 10))
plants
$height
[1] 12.81165 11.30546 12.79607 10.22552 12.28770 11.78231
[7] 17.53214 14.35947 11.67449 12.37116
$mass
[1] 5.423982 5.290442 6.548579 8.275295 6.344244 7.635298
[7] 9.136648 9.250006 7.958576 5.585793
$flowers
[1] 6 3 9 10 5 2 8 1 4 7
Using lapply
function to find out the mean
value of each list element.
lapply(plants,FUN = mean)
$height
[1] 12.7146
$mass
[1] 7.144886
$flowers
[1] 5.5
Please note, the output of lapply
function is always a list. Also we have not used Margin
argument in lapply
function, since the function mean
is applied to list elements.
sapply function
The output of lapply
function is always a list. If we want the output in a vector or in a matrix form, we can use sapply
function. The sapply
function works the same way as lapply
function. But instead of returning a list, it will return the answer in the simplest possible format.
sapply(plants,FUN = mean)
height mass flowers
12.714597 7.144886 5.500000
You can notice that the output type is simple numeric vector and not a list. you can confirm its data type using class
function.
class(sapply(plants,FUN = mean))
[1] "numeric"
tapply function
The tapply
function works in much the same way as the other functions, but it allows you to perform an operation across specified groups in your data. For those of you who are familiar with dplyr package, this does the same thing as the group_by()
and summarise()
functions.
Here is an example. We got a data set in which a service time to repair a product is recorded. We would like to find out the mean service time to repair for each individual product. So first we will have to group the data into individual products, and then find the mean value of service time for each group.
Let’s use tapply
function on mtcars
data set. This data-set comprises fuel consumption and various automobile parameters for 32 different car models. We want to determine the average fuel consumption in miles per gallon for different cylinder engines.
- group the data as per number of engine cylinders
- use
mean
function on mpg variable to find out the average fuel consumption.
head(mtcars)
We can perform above two steps using tapply
function.
# tapply function to calculate the average fuel consumption
# for different engine cylinders.
tapply(mtcars$mpg,INDEX = mtcars$cyl,FUN = mean)
4 6 8
26.66364 19.74286 15.10000
You can observe the trend here. A car’s mileage decreases as engine size increases.
Let’s decode the syntax of tapply
function.
The first argument is variable on which we want to perform calculation. So it is mpg
variable
The INDEX
will be the grouping variable. So we want to group the data using cyl
variable. The last argument is mean
which will calculate the mpg
average within each group.
Summary
apply
family of functions are higher order functions that accepts the other functions as an argument. These functions are applied to vectors, lists and on rows and columns of data frame or a matrix leading to concise and efficient code.
In this article we learned different apply
family of functions.
apply
function takes another function as an argument and applies it on rows or columns of data frame or a matrix.
lapply
function returns a list object.
sapply
function is used when the output is required in vector or matrix form.
tapply
function is used to group the variables together and apply operations on those groups.