Homepage > A brief introduction to “apply” in R

A brief introduction to “apply” in R

At any R Q&A site, you’ll frequently see an exchange like this one:

Q: How can I use a loop to [...insert task here...] ?
A: Don’t. Use one of the apply functions.

So, what are these wondrous apply functions and how do they work? I think the best way to figure out anything in R is to learn by experimentation, using embarrassingly trivial data and functions.

If you fire up your R console, type “??apply” and scroll down to the functions in the base package, you’ll see something like this:

    
            1
            base::apply             Apply Functions Over Array Margins
        
            2
            base::by                Apply a Function to a Data Frame Split by Factors
        
            3
            base::eapply            Apply a Function Over Values in an Environment
        
            4
            base::lapply            Apply a Function over a List or Vector
        
            5
            base::mapply            Apply a Function to Multiple List or Vector Arguments
        
            6
            base::rapply            Recursively Apply a Function to a List
        
            7
            base::tapply            Apply a Function Over a Ragged Array

Let’s examine each of those.

1. apply
Description: “Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.”

OK – we know about vectors/arrays and functions, but what are these “margins”? Simple: either the rows (1), the columns (2) or both (1:2). By “both”, we mean “apply the function to each individual value.” An example:

    
            01
            # create a matrix of 10 rows x 2 columns
        
            02
            m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)
        
            03
            # mean of the rows
        
            04
            apply(m, 1, mean)
        
            05
             [1]  6  7  8  9 10 11 12 13 14 15
        
            06
            # mean of the columns
        
            07
            apply(m, 2, mean)
        
            08
            [1]  5.5 15.5
        
            09
            # divide all values by 2
        
            10
            apply(m, 1:2, function(x) x/2)
        
            11
                  [,1] [,2]
        
            12
             [1,]  0.5  5.5
        
            13
             [2,]  1.0  6.0
        
            14
             [3,]  1.5  6.5
        
            15
             [4,]  2.0  7.0
        
            16
             [5,]  2.5  7.5
        
            17
             [6,]  3.0  8.0
        
            18
             [7,]  3.5  8.5
        
            19
             [8,]  4.0  9.0
        
            20
             [9,]  4.5  9.5
        
            21
            [10,]  5.0 10.0

That last example was rather trivial; you could just as easily do “m[, 1:2]/2″ – but you get the idea.

2. by
Description: “Function ‘by’ is an object-oriented wrapper for ‘tapply’ applied to data frames.”

The by function is a little more complex than that. Read a little further and the documentation tells you that “a data frame is split by row into data frames subsetted by the values of one or more factors, and function ‘FUN’ is applied to each subset in turn.” So, we use this one where factors are involved.

To illustrate, we can load up the classic R dataset “iris”, which contains a bunch of flower measurements:

    
            01
            attach(iris)
        
            02
            head(iris)
        
            03
              Sepal.Length Sepal.Width Petal.Length Petal.Width Species
        
            04
            1          5.1         3.5          1.4         0.2  setosa
        
            05
            2          4.9         3.0          1.4         0.2  setosa
        
            06
            3          4.7         3.2          1.3         0.2  setosa
        
            07
            4          4.6         3.1          1.5         0.2  setosa
        
            08
            5          5.0         3.6          1.4         0.2  setosa
        
            09
            6          5.4         3.9          1.7         0.4  setosa
        
            10
             
            11
            # get the mean of the first 4 variables, by species
        
            12
            by(iris[, 1:4], Species, mean)
        
            13
            Species: setosa
        
            14
            Sepal.Length  Sepal.Width Petal.Length  Petal.Width
        
            15
                   5.006        3.428        1.462        0.246
        
            16
            ------------------------------------------------------------
        
            17
            Species: versicolor
        
            18
            Sepal.Length  Sepal.Width Petal.Length  Petal.Width
        
            19
                   5.936        2.770        4.260        1.326
        
            20
            ------------------------------------------------------------
        
            21
            Species: virginica
        
            22
            Sepal.Length  Sepal.Width Petal.Length  Petal.Width
        
            23
                   6.588        2.974        5.552        2.026

Essentially, by provides a way to split your data by factors and do calculations on each subset. It returns an object of class “by” and there are many, more complex ways to use it.

3. eapply
Description: “eapply applies FUN to the named values from an environment and returns the results as a list.”

This one is a little trickier, since you need to know something about environments in R. An environment, as the name suggests, is a self-contained object with its own variables and functions. To continue using our very simple example:

    
            01
            # a new environment
        
            02
            e <- new.env()
        
            03
            # two environment variables, a and b
        
            04
            e$a <- 1:10
        
            05
            e$b <- 11:20
        
            06
            # mean of the variables
        
            07
            eapply(e, mean)
        
            08
            $b
        
            09
            [1] 15.5
        
            10
             
            11
            $a
        
            12
            [1] 5.5

I don’t often create my own environments, but they’re commonly used by R packages such as Bioconductor so it’s good to know how to handle them.

4. lapply
Description: “lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.”

That’s a nice, clear description which makes lapply one of the easier apply functions to understand. A simple example:

    
            01
            # create a list with 2 elements
        
            02
            l <- list(a = 1:10, b = 11:20)
        
            03
            # the mean of the values in each element
        
            04
            lapply(l, mean)
        
            05
            $a
        
            06
            [1] 5.5
        
            07
             
            08
            $b
        
            09
            [1] 15.5
        
            10
             
            11
            # the sum of the values in each element
        
            12
            lapply(l, sum)
        
            13
            $a
        
            14
            [1] 55
        
            15
             
            16
            $b
        
            17
            [1] 155

The lapply documentation tells us to consult further documentation for sapply, vapply and replicate. Let’s do that.

4.1 sapply
Description: “sapply is a user-friendly version of lapply by default returning a vector or matrix if appropriate.”

That simply means that if lapply would have returned a list with elements $a and $b, sapply will return either a vector, with elements [['a']] and [['b']], or a matrix with column names “a” and “b”. Returning to our previous simple example:

    
            01
            # create a list with 2 elements
        
            02
            l <- list(a = 1:10, b = 11:20)
        
            03
            # mean of values using sapply
        
            04
            l.mean <- sapply(l, mean)
        
            05
            # what type of object was returned?
        
            06
            class(l.mean)
        
            07
            [1] "numeric"
        
            08
            # it's a numeric vector, so we can get element "a" like this
        
            09
            l.mean[['a']]
        
            10
            [1] 5.5

4.2 vapply
Description: “vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.”

A third argument is supplied to vapply, which you can think of as a kind of template for the output. The documentation uses the fivenum function as an example, so let’s go with that:

    
            01
            l <- list(a = 1:10, b = 11:20)
        
            02
            # fivenum of values using vapply
        
            03
            l.fivenum <- vapply(l, fivenum, c(Min.=0, "1st Qu."=0, Median=0, "3rd Qu."=0, Max.=0))
        
            04
            class(l.fivenum)
        
            05
            [1] "matrix"
        
            06
            # let's see it
        
            07
            l.fivenum
        
            08
                       a    b
        
            09
            Min.     1.0 11.0
        
            10
            1st Qu.  3.0 13.0
        
            11
            Median   5.5 15.5
        
            12
            3rd Qu.  8.0 18.0
        
            13
            Max.    10.0 20.0

So, vapply returned a matrix, where the column names correspond to the original list elements and the row names to the output template. Nice.

4.3 replicate
Description: “replicate is a wrapper for the common use of sapply for repeated evaluation of an expression (which will usually involve random number generation).”

The replicate function is very useful. Give it two mandatory arguments: the number of replications and the function to replicate; a third optional argument, simplify = T, tries to simplify the result to a vector or matrix. An example – let’s simulate 10 normal distributions, each with 10 observations:

    
            01
            replicate(10, rnorm(10))
        
            02
                         [,1]        [,2]        [,3]       [,4]        [,5]         [,6]
        
            03
             [1,]  0.67947001 -1.94649409  0.28144696  0.5872913  2.22715085 -0.275918282
        
            04
             [2,]  1.17298643 -0.01529898 -1.47314092 -1.3274354 -0.04105249  0.528666264
        
            05
             [3,]  0.77272662 -2.36122644  0.06397576  1.5870779 -0.33926083  1.121164338
        
            06
             [4,] -0.42702542 -0.90613885  0.83645668 -0.5462608 -0.87458396 -0.723858258
        
            07
             [5,] -0.73892937 -0.57486661 -0.04418200 -0.1120936  0.08253614  1.319095242
        
            08
             [6,]  2.93827883 -0.33363446  0.55405024 -0.4942736  0.66407615 -0.153623614
        
            09
             [7,]  1.30037496 -0.26207115  0.49818215  1.0774543 -0.28206908  0.825488436
        
            10
             [8,] -0.04153545 -0.23621632 -1.01192741  0.4364413 -2.28991601 -0.002867193
        
            11
             [9,]  0.01262547  0.40247248  0.65816829  0.9541927 -1.63770154  0.328180660
        
            12
            [10,]  0.96525278 -0.37850821 -0.85869035 -0.6055622  1.13756753 -0.371977151
        
            13
                         [,7]        [,8]       [,9]       [,10]
        
            14
             [1,]  0.03928297  0.34990909 -0.3159794  1.08871657
        
            15
             [2,] -0.79258805 -0.30329668 -1.0902070  0.73356542
        
            16
             [3,]  0.10673459 -0.02849216  0.8094840  0.06446245
        
            17
             [4,] -0.84584079 -0.57308461 -1.3570979 -0.89801330
        
            18
             [5,] -1.50226560 -2.35751419  1.2104163  0.74650696
        
            19
             [6,] -0.32790991  0.80144695 -0.0071844  0.05742356
        
            20
             [7,]  1.36719970  2.34148354  0.9148911  0.20451421
        
            21
             [8,] -0.51112579 -0.53658159  1.5194130 -0.94250069
        
            22
             [9,]  0.52017814 -1.22252527  0.4519702  0.08779704
        
            23
            [10,]  1.35908918  1.09024342  0.5912627 -0.20709053

5. mapply
Description: “mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each (…) argument, the second elements, the third elements, and so on.”

The mapply documentation is full of quite complex examples, but here’s a simple, silly one:

    
            1
            l1 <- list(a = c(1:10), b = c(11:20))
        
            2
            l2 <- list(c = c(21:30), d = c(31:40))
        
            3
            # sum the corresponding elements of l1 and l2
        
            4
            mapply(sum, l1$a, l1$b, l2$c, l2$d)
        
            5
             [1]  64  68  72  76  80  84  88  92  96 100

Here, we sum l1$a[1] + l1$b[1] + l2$c[1] + l2$d[1] (1 + 11 + 21 + 31) to get 64, the first element of the returned list. All the way through to l1$a[10] + l1$b[10] + l2$c[10] + l2$d[10] (10 + 20 + 30 + 40) = 100, the last element.

6. rapply
Description: “rapply is a recursive version of lapply.”

I think “recursive” is a little misleading. What rapply does is apply functions to lists in different ways, depending on the arguments supplied. Best illustrated by examples:

    
            01
            # let's start with our usual simple list example
        
            02
            l <- list(a = 1:10, b = 11:20)
        
            03
            # log2 of each value in the list
        
            04
            rapply(l, log2)
        
            05
                  a1       a2       a3       a4       a5       a6       a7       a8
        
            06
            0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000
        
            07
                  a9      a10       b1       b2       b3       b4       b5       b6
        
            08
            3.169925 3.321928 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000
        
            09
                  b7       b8       b9      b10
        
            10
            4.087463 4.169925 4.247928 4.321928
        
            11
            # log2 of each value in each list
        
            12
            rapply(l, log2, how = "list")
        
            13
            $a
        
            14
             [1] 0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000
        
            15
             [9] 3.169925 3.321928
        
            16
             
            17
            $b
        
            18
             [1] 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000 4.087463 4.169925
        
            19
             [9] 4.247928 4.321928
        
            20
             
            21
            # what if the function is the mean?
        
            22
            rapply(l, mean)
        
            23
               a    b
        
            24
             5.5 15.5
        
            25
             
            26
            rapply(l, mean, how = "list")
        
            27
            $a
        
            28
            [1] 5.5
        
            29
             
            30
            $b
        
            31
            [1] 15.5

So, the output of rapply depends on both the function and the how argument. When how = “list” (or “replace”), the original list structure is preserved. Otherwise, the default is to unlist, which results in a vector.

You can also pass a “classes=” argument to rapply. For example, in a mixed list of numeric and character variables, you could specify that the function act only on the numeric values with “classes = numeric”.

7. tapply
Description: “Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.”

Woah there. That sounds complicated. Don’t panic though, it becomes clearer when the required arguments are described. Usage is “tapply(X, INDEX, FUN = NULL, …, simplify = TRUE)”, where X is “an atomic object, typically a vector” and INDEX is “a list of factors, each of same length as X”.

So, to go back to the famous iris data, “Species” might be a factor and “iris$Petal.Width” would give us a vector of values. We could then run something like:

`1`	`attach(iris)`

Tags:

`2`	`# mean petal length by species`

Homepage | Site map | RSS | Print

Make a free website Webnode

A brief introduction to “apply” in R

Tags:

Contact