Data Manipulation with R (2014)

Chapter 3. Data Manipulation Using plyr

In any data analysis task, a majority of the time is dedicated to data cleaning and preprocessing. It is considered that sometimes about 80 percent of the effort is devoted to data cleaning before conducting actual analysis (Exploratory Data Mining and Data Cleaning,Dasu T. and Johnson T.). Data manipulation is an integral part of data cleaning and analysis. For large sets of data, it is always preferable to perform the operation within subgroups of a dataset to speed up the process. In R, this type of data manipulation can be done with base functionality, but for large-scale data it requires a considerable amount of coding and eventually takes more processing time. In the case of large-scale data, we can split the dataset, perform the manipulation or analysis, and then combine it into a single output again. This type of split using default R is not efficient, and to overcome this limitation, Wickham, in 2011, developed an R package called plyr, in which he efficiently implemented the split-apply-combine strategy. This chapter starts with the concept of split-apply-combine and is followed by the different functions and utilities of the plyr package.

The split-apply-combine strategy

Often, we require similar types of operations in different subgroups of a dataset, such as group-wise summarization, standardization, and statistical modeling. This type of task requires us to break up a big problem into manageable pieces, perform operations on each piece separately, and finally combine the output of each piece into a single piece of output (the paper The Split-Apply-Combine Strategy for Data Analysis, by Wickham, which can be found at http://www.jstatsoft.org/v40/i01/paper). To understand the split-apply-combine strategy intuitively, we could compare this with the map-reduce strategy for processing large amounts of data, recently popularized by Google. In the map-reduce strategy, the map step corresponds to split and apply and the reduce step consists of combining. The map-reduce approach is primarily designed to deal with a highly parallel environment where the work has been done by several hundreds or thousands of computers independently.

The split-apply-combine strategy creates an opportunity to see the similarities of problems across subgroups that were previously unconnected. This strategy can be used in many existing tools, such as the GROUP BY operation in SAS, PivotTable in MS Excel, and the SQL GROUP BY operator.

To explain the split-apply-combine strategy, we will use Fisher's iris data. This dataset contains the measurements in centimeters of these variables: sepal length and width, and petal length and width, for 50 flowers from each of the three species of iris. The species are Iris setosa, Iris versicolor, and Iris virginica. We want to calculate the mean of each variable and for each species separately. This can be done in different ways using a loop or without using one.

Split-apply-combine without a loop

In this section, we will see an example of the split-apply-combine strategy without using a loop. The steps are as follows:

1. Split the iris dataset into three parts.

2. Remove the species name variable from the data.

3. Calculate the mean of each variable for the three different parts separately.

4. Combine the output into a single data frame.

The code for this is as follows:

# notice that during split step a negative 5 is used within the # code, this negative 5 has been used to discard fifth column of the # iris data that contains "species" information and we do not need # that column to calculate mean.

iris.set <- iris[iris$Species=="setosa",-5]

iris.versi <- iris[iris$Species=="versicolor",-5]

iris.virg <- iris[iris$Species=="virginica",-5]

# calculating mean for each piece (The apply step)

mean.set <- colMeans(iris.set)

mean.versi <- colMeans(iris.versi)

mean.virg <- colMeans(iris.virg)

# combining the output (The combine step)

mean.iris <- rbind(mean.set,mean.versi,mean.virg)

# giving row names so that the output could be easily understood

rownames(mean.iris) <- c("setosa","versicolor","virginica")

Split-apply-combine with a loop

The following example will calculate the same statistics as in the previous section, but this time we will perform this task using a loop. The steps are similar but the code is different. In each iteration, we will split the data for each species and calculate the mean for each variable and then combine the output into a single data frame, as shown in the following code:

# split-apply-combine using loop

# each iteration represents split

# mean calculation within each iteration represents apply step

# rbind command in each iteration represents combine step

mean.iris.loop <- NULL

for(species in unique(iris$Species))

{

iris_sub <- iris[iris$Species==species,]

column_means <- colMeans(iris_sub[,-5])

mean.iris.loop <- rbind(mean.iris.loop,column_means)

}

# giving row names so that the output could be easily understood

rownames(mean.iris.loop) <- unique(iris$Species)

An important fact to note in the split-apply-combine strategy is that each piece should be independent of the other. If the calculation in one piece is somehow dependent on the other, the split-apply-combine strategy will not work. This strategy is not applicable in running an average type of operation, where a current average is dependent on the previous one. This strategy is only applicable when the big problem can be broken up into smaller manageable pieces and we can perform the desired operation on each piece independently. For running average calculations, the split-apply-combine strategy is not suitable; we can use a loop instead. But if processing speed is a concern, we can write the code in some lower-level language such as C or ForTran.

Utilities of plyr

The plyr package is a set of tools for a common set of problems. We want to split the big problem into smaller pieces, apply functions, and then combine all the outputs back together. The example we presented using the iris data is one where we applied this strategy. Though it is already possible to perform split-apply-combine operations with base R, such as the split and apply family of functions, plyr makes things much easier and intuitive with its consistent naming convention, various types of input-output processing, and built-in error recovery and informative messages. In general, plyr provides a replacement for the for loop. We do not need to replace the for loop just because it is slow, but we need to replace it to avoid extra, unimportant bookkeeping code. The following examples will clarify the need to replace a for loop with its plyr counterpart.

The mean calculation of each variable within each species in the iris dataset using the for loop in base R can be coded as follows:

mean.iris.loop <- NULL

for(species in unique(iris$Species))

{

iris_sub <- iris[iris$Species==species,]

column_means <- colMeans(iris_sub[,-5])

mean.iris.loop <- rbind(mean.iris.loop,column_means)

}

rownames(mean.iris.loop) <- unique(iris$Species)

mean.iris.loop

Sepal.Length Sepal.Width Petal.Length Petal.Width

setosa 5.006 3.428 1.462 0.246

versicolor 5.936 2.770 4.260 1.326

virginica 6.588 2.974 5.552 2.026

The same mean calculation, but this time using the plyr package, can be coded as follows:

library (plyr)

ddply(iris,~Species,function(x) colMeans(x[,-which(colnames(x)=="Species")]))

mean.iris.loop

Sepal.Length Sepal.Width Petal.Length Petal.Width

setosa 5.006 3.428 1.462 0.246

versicolor 5.936 2.770 4.260 1.326

virginica 6.588 2.974 5.552 2.026

Note that we can easily perform the same calculation with very little code using the plyr package compared to the for loop in base R.

Intuitive function names

To perform any kind of data processing, we need to know the type of input we have to provide and the expected format of output. In most R functions, it is difficult to understand from function names what type of input they accept and what the expected outputs are. Function names in plyr packages are much more intuitive and instructive about their input and output type. Each function is named according to the type of input it accepts and the type of output it produces. The first letter of the function name specifies the input and the second letter specifies the output type; a represents array, d represents data frame, l represents list, and _ (underscore) represents the output discarded. For example, the function name adply() takes input as array and produces output as a data frame. The following table gives us a complete idea about function-naming conventions used in the plyr package:

Input	Output
	Array	Data frame	List	Discarded
Array	aaply	adply	alply	a_ply
Data frame	daply	ddply	dlply	d_ply
List	laply	ldply	llply	l_ply

We can see that there are three types of input and four types of output. Users can easily get an idea of the types of input and output from the function names. Another interesting feature is that we do not need to learn all the 12 functions; rather, it is sufficient to learn the three types of input and four types of output.

Other than the function names in the table, there are some special cases of operating on arrays that correspond to the mapply() function in base R. In base R, mapply() can take multiple inputs as separate arguments, whereas a*ply() takes only a single array argument. However, the separate argument in mapply() should be of the same length. The mapply() functions that are equivalent to plyr are maply(), mdply(), mlply(), and m_ply().

Note that whenever a function name is written using a star symbol, such as a*ply(), it indicates that the common input is an array. The output can be in any format: array, data frame, or list. Optionally, the output can be discarded.

To explain the intuitive nature of the input and output, we will now provide an example using the iris data that we used in an earlier example. This time, we will use the iris3 dataset; this is the same data, but stored in a three-dimensional array format. We will calculate the mean of each variable for each species as shown in the following code:

# class of iris3 dataset is array

class(iris3)

[1] "array"

# dimension of iris3 dataset

dim(iris3)

[1] 50 4 3

The following code snippet calculates the column mean for each species with the input as an array and the output as a data frame:

# Calculate column mean for each species and output will be data frame

iris_mean <- adply(iris3,3,colMeans)

class(iris_mean)

[1] "data.frame"

iris_mean

X1 Sepal L. Sepal W. Petal L. Petal W.

1 Setosa 5.006 3.428 1.462 0.246

2 Versicolor 5.936 2.770 4.260 1.326

3 Virginica 6.588 2.974 5.552 2.0266

The following code snippet calculates the column mean for each species with the input as an array and the output as an array too:

# again we will calculate the mean but this time output will be an # array

iris_mean <- aaply(iris3,3,colMeans)

class(iris_mean)

[1] "matrix"

iris_mean

X1 Sepal L. Sepal W. Petal L. Petal W.

Setosa 5.006 3.428 1.462 0.246

Versicolor 5.936 2.770 4.260 1.326

Virginica 6.588 2.974 5.552 2.026

# note that here the class is showing "matrix", since the output is a # two dimensional array which represents matrix

# Now calculate mean again with output as list

iris_mean <- alply(iris3,3,colMeans)

class(iris_mean)

[1] "list"

iris_mean

$'1'

Sepal L. Sepal W. Petal L. Petal W.

5.006 3.428 1.462 0.246

$'2'

Sepal L. Sepal W. Petal L. Petal W.

5.936 2.770 4.260 1.326

$'3'

Sepal L. Sepal W. Petal L. Petal W.

6.588 2.974 5.552 2.026

attr(,"split_type")

[1] "array"

attr(,"split_labels")

1 Setosa

2 Versicolor

3 Virginica

Input and arguments

The functions in the plyr package accept various input objects: data frames, arrays, and lists. Each input object has its own rule to split the process. In this section, we will discuss input and arguments. The rules of splitting can be described shortly as follows:

· Arrays are sliced by dimension into lower dimensional pieces, and the corresponding common function is a*ply(), where the array is the common input and the output can be one among an array, data frame, or list.

· Data frames are sliced and subset by a combination of variables from that dataset. The corresponding common function is d*ply(), where the data frame is the common input and the output can be one among an array, data frame, or list.

· The elements of a list are processed separately, and the common function is l*ply(), where the common input is a list and the output can be an array, data frame, or list.

Depending on the input type, there are two or three main arguments for the common functions: a*ply(), d*ply(), and l*ply(). The following are the main arguments for these common functions:

· a*ply(.data, .margins, .fun, ..., .progress = "none")

· d*ply(.data, .variables, .fun, ..., .progress = "none")

· l*ply(.data, .fun, ..., .progress = "none")

The first argument, .data, is the input dataset that needs to be processed by splitting up, and the output will be combined from each split. The .margins or .variables argument specifies how the data should be split up into smaller pieces. The .fun argument specifies the processing task; this can be any function that is applicable to each split of the input. If we omit the .fun argument, the input data is just converted to the output structure specified by the function. If we want to monitor the progress of the processing task, the.progress argument should be specified. It will not show the progress status by default.

In the following example, we will see what will happen if we do not specify the .fun argument in any function of the plyr package. If we give the input as an array and want the output as a data frame, but we haven't given a .fun argument, the adply() function will just convert the array object into a data frame. Here is an example:

# converting 3 dimensional array to a 2 dimensional data frame

iris_dat <- adply(iris3, .margins=3)

class(iris_dat)

[1] "data.frame"

str(iris_dat)

'data.frame': 150 obs. of 5 variables:

$ X1 : Factor w/ 3 levels "Setosa","Versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

$ Sepal L.: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal W.: num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal L.: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal W.: num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

The .margins argument works in a similar manner to the apply function in base R. It does the following:

· Slices up a row by specifying .margins = 1

· Slices up a column by .margins = 2

· Slices up the individual cells by .margins = c(1,2)

The .margins argument works correspondingly for higher dimensions, with a combinatorial explosion in the number of possible ways to slice up the array.

Comparing default R and plyr

In this section, we will compare code side by side to solve the same problem using both default R and plyr. Reusing the iris3 data, we are now interested in producing five number summary statistics for each variable group by species. The five numbers will be minimum, mean, median, maximum, and standard deviation. The output will be a list of data frames.

To calculate the summaries for the five numbers, follow these steps:

1. Define a function that will calculate five number summary statistics for a given vector.

2. Produce the output of this function in a data frame object.

3. Apply this function in the iris3 dataset using a for loop.

4. Apply the same function using the apply() function of the plyr package.

An example that explains the calculation of the five number summary statistics is as follows:

# Function to calculate five number summary

fivenum.summary <- function(x)

{

results <-data.frame(min=apply(x,2,min),

mean=apply(x,2,mean),

median=apply(x,2,median),  max=apply(x,2,max),

sd=apply(x,2,sd))

return(results)

}

The following is how we calculate the summaries for the five numbers using a for loop with default R:

# initialize the output list object

all_stats <- list()

# the for loop will run for each species

for(i in 1:dim(iris3)[3])

{

sub_data <- iris3[,,i]

all_stat_species <- fivenum.summary(sub_data)

all_stats[[i]] <- all_stat_species

}

# class of the output object

class(all_stats)

[1] "list"

all_stats

[[1]]

min mean median max sd

Sepal L. 4.3 5.006 5.0 5.8 0.3524897

Sepal W. 2.3 3.428 3.4 4.4 0.3790644

Petal L. 1.0 1.462 1.5 1.9 0.1736640

Petal W. 0.1 0.246 0.2 0.6 0.1053856

[[2]]

min mean median max sd

Sepal L. 4.9 5.936 5.90 7.0 0.5161711

Sepal W. 2.0 2.770 2.80 3.4 0.3137983

Petal L. 3.0 4.260 4.35 5.1 0.4699110

Petal W. 1.0 1.326 1.30 1.8 0.1977527

[[3]]

min mean median max sd

Sepal L. 4.9 6.588 6.50 7.9 0.6358796

Sepal W. 2.2 2.974 3.00 3.8 0.3224966

Petal L. 4.5 5.552 5.55 6.9 0.5518947

Petal W. 1.4 2.026 2.00 2.5 0.2746501

Let's calculate the same statistics, but this time using the alply() function from the plyr package:

all_stats <- alply(iris3,3,fivenum.summary)

class(all_stats)

[1] "list"

all_stats

$'1'

min mean median max sd

Sepal L. 4.3 5.006 5.0 5.8 0.3524897

Sepal W. 2.3 3.428 3.4 4.4 0.3790644

Petal L. 1.0 1.462 1.5 1.9 0.1736640

Petal W. 0.1 0.246 0.2 0.6 0.1053856

$'2'

min mean median max sd

Sepal L. 4.9 5.936 5.90 7.0 0.5161711

Sepal W. 2.0 2.770 2.80 3.4 0.3137983

Petal L. 3.0 4.260 4.35 5.1 0.4699110

Petal W. 1.0 1.326 1.30 1.8 0.1977527

$'3'

min mean median max sd

Sepal L. 4.9 6.588 6.50 7.9 0.6358796

Sepal W. 2.2 2.974 3.00 3.8 0.3224966

Petal L. 4.5 5.552 5.55 6.9 0.5518947

Petal W. 1.4 2.026 2.00 2.5 0.2746501

attr(,"split_type")

[1] "array"

attr(,"split_labels")

1 Setosa

2 Versicolor

3 Virginica

Multiargument functions

Sometimes, we have to deal with functions that take multiple arguments, and the values of each argument can come from a data frame, a list, or an array. The plyr package has intuitive and user friendly functions to work with multiargument functions. In this section, we will see an example of generating random numbers from normal distribution with various combinations of mean and standard deviation. The values of mean and standard deviation are stored in a data frame. Now, we will generate random numbers using default R functions, such as the for loop, and also using the mlply() function from the plyr package. The parameter combinations are given in the following table:

n	Mean	Standard deviation
25	0	1
50	2	1.5
100	3.5	2
200	2.5	5
500	0.1	2

With these parameter combinations, we will generate normal random numbers using default R and plyr as shown in the following code:

# define parameter set

parameter.dat <- data.frame(n=c(25,50,100,200,400),mean=c(0,2,3.5,2.5,0.1),sd=c(1,1.5,2,5,2))

# displaying parameter set

parameter.dat

n mean sd

1 25 0.0 1.0

2 50 2.0 1.5

3 100 3.5 2.0

4 200 2.5 5.0

5 400 0.1 2.0

# random normal variate generate using base R

# set seed to make the example reproducible

set.seed(12345)

# initialize blank list object to store the generated variable

dat <- list()

for(i in 1:nrow(parameter.dat))

{

dat[[i]] <- rnorm(n=parameter.dat[i,1],

mean=parameter.dat[i,2],sd=parameter.dat[i,3])

}

# estimating mean from the newly generated data

estmean <- lapply(dat,mean)

estmean

[[1]]

[1] -0.001177287

[[2]]

[1] 2.417842

[[3]]

[1] 3.667193

[[4]]

[1] 2.999662

[[5]]

[1] 0.1765926

# Performing same task as above but this time use plyr package

dat_plyr <- mlply(parameter.dat,rnorm)

estmean_plyr <- llply(dat_plyr,mean)

estmean_plyr

$'1'

[1] 0.4252469

$'2'

[1] 2.037528

$'3'

[1] 3.070231

$'4'

[1] 2.144276

$'5'

[1] 0.05399488

Summary

In this chapter, we discussed the importance of the split-apply-combine strategy. We understood what the split-apply-combine strategy is and why it is important in data manipulations. The split-apply-combine strategy can be implemented using base R, but requires a large amount of code and is not memory or time efficient. To overcome this limitation, we discussed the plyr package, in which group-wise data manipulation can be implemented efficiently. The functions within plyr are intuitive and instructive in terms of input and output type. A large variety of data processing can be done using only a few functions with common input and various outputs. For further reading, an interested user can look up (at the paper The Split-Apply-Combine Strategy for Data Analysis, by Wickham, which can be found at http://www.jstatsoft.org/v40/i01/paper). In the upcoming chapter, we will learn about reshaping a dataset, which is another important aspect of group-wise data manipulation.