Data Manipulation with R, Second Edition (2015)

Chapter 3. Data Manipulation Using plyr and dplyr

We often collect data across different places and time points and across human characteristics. A census collects data across different states. In a longitudinal study, we collect information over different time points. Those individuals could be male or female, and their occupation could be different. All individuals under any study could be split into different groups based on these geographical, temporal, and occupational characteristics. We usually analyze data as a whole, but sometimes it is useful to perform some tasks separately among different groups.

As an example, if we collect details of the income of different individuals from six different regions, then we might be interested in seeing the income distribution among different professions (considering five different professions), across six regions. This income could vary depending on whether the person is a male or female. In this situation, we can conceptualize this problem by splitting the dataset based on profession, gender, and region. There should be 5 x 6 x 2=60 different groups, and we need to calculate the average income separately for each groups. Finally, we want to combine the result to see all the information side by side. This group-wise operation is often termed as the split-apply-combine approach of data analysis.

In this approach, first we split the dataset into some mutually exclusive groups. We then apply a task on each group and combine all the results to get the desired output. This group-wise task could be generating new variables, summarizing existing variables, or even performing regression analysis on each group. Finally, combining approaches helps us get a nice output to compare the results from different groups.

This chapter will deal with the following topics:

· Applying the split-apply-combine strategy

· Utilities of the plyr library

· Different functions in the plyr package for handling different data structures

· Comparing base R and plyr

· Powerful data manipulation with the dplyr library

Applying the split-apply-combine strategy

For the purpose of demonstration, we will use an iris flower dataset, which is readily available in R. The iris flower has three different species: iris setosa, iris virginica, and iris versicolor. Fifty samples from each species were collected and, for each sample, four variables were measured: the length and width of the sepals and petals. The name of each flower is stored under the species column, and the length and width of sepal is stored under the Sepal.Length and Sepal.Width columns, respectively. Similarly, the length and width of the petal are stored under the Petal.Length and Petal.Width columns, respectively. The following command shows the first few rows from the iris data frame:

> head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

Now we will use the split-apply-combine strategy to find the average width and length of sepal and petal for three different species of iris. The strategy will be as follows:

1. First we will split the dataset into three subsets according to the species of the flower.

2. Next, for each subset, we will compute the average width and length of the sepal and petal.

3. Finally, we will combine all the results to compare them with each other.

4. # Step 1: Splitting dataset

5. iris.setosa <- subset(iris,Species=="setosa", select=c(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width))

7. iris.versicolor <- subset(iris,Species=="versicolor", select=c(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width))

9. iris.virginica <- subset(iris,Species=="virginica", select=c(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width))

10.

11.# Step 2: Applying mean function to calculate mean

12.setosa <- colMeans(iris.setosa)

13.versicolor <- colMeans(iris.versicolor)

14.virginica <-colMeans(iris.virginica)

15.

16.# Step 3: Combining results

rbind(setosa=setosa,versicolor=versicolor,virginica=virginica)

This is the detailed code to implement the split-apply-combine approach. We could implement the strategy with less code, as follows:

# Step 1: Splitting dataset

iris.split <- split(iris,as.factor(iris$Species))

# Step 2: Applying mean function to calculate mean

iris.apply <- lapply(iris.split,function(x)colMeans(x[-5]))

# Step 3: Combining results

iris.combine <- do.call(rbind,iris.apply)

iris.combine

Sepal.Length Sepal.Width Petal.Length Petal.Width

setosa 5.006 3.428 1.462 0.246

versicolor 5.936 2.770 4.260 1.326

virginica 6.588 2.974 5.552 2.026

In later sections in this chapter, we will see how the plyr package comes in handy for implementing the split-apply-combine approach on all kind of data structures. Using the plyr package, one line of code would be sufficient to implement these three steps.

Introducing the plyr and dplyr libraries

We have seen how we can implement the split-apply-combine approach on a data frame using three lines of code. The plyr package helps us to implement the approach in one line. Since R has multiple data structures, we need multiple functions to work on different data structures. R has three main data structures: list, array, and data frames. So, there could be three different types of input, and the output could produce three different types of data structures. There could be 3 x 3 = 9 possible input-output combinations, and for this reason, plyr has 9 functions to incorporate all the input-output combinations. In addition, we have three additional functions that take six different types of input but display only one type of output.

The plyr package works on every type of data structure, whereas the dplyr package is designed to work only on data frames. The dplyr package offers a complete set of functions to perform every kind of data manipulation we need in the process of analysis. These functions take a data frame as the input and also produce a data frame as output; hence the name: dplyr. There are two different types of function in the dplyr package: a single-table function and an aggregate function. The single-table function takes a data frame as input and takes an action, such as subsetting the data frame, generating new columns in the data frame, or rearranging the data frame. The aggregate function takes a column as input, and produces a single value as output, which is mostly used for summarizing columns. These functions do not allow us to perform any group-wise operation, but let's combine these functions with the group_by() function. This allows us to implement the split-apply-combine approach.

plyr's utilities

The most important utility of the plyr package is that a single line of code can perform all the split, apply, and combine steps. What we have done using three lines of code in the first section can be implemented in just one line using the plyr package:

library(plyr)

ddply(iris, .(Species), function(x) colMeans(x[-5]))

Here, ddply() is a function from the plyr package, which takes a data frame as input and produces a data frame as output. Hence, the name of the function is ddply. Here, the argument works as follows:

· The first argument is the name of the data frame. We put iris, since the iris dataset is in the data frame structure and we want to work on it.

· The second argument is for a variable or variables, according to which we want to split our data frame. In this case, we have Species.

· The third argument is a function that defines what kind of task we want to perform on each subset.

One question that should come into our mind is how the function takes data as input. Here, we will split the data frame into three different groups as follows:

· The first subset, which is also a data frame, will be considered as an input of the function. The function will calculate all the column means and store them somewhere.

· The second subset will be considered as input for the function, and so on.

· All the outputs will be combined to form a single data frame.

This is like the bysort command in Stata, but with a lot more flexibility. Since there are different types of data structure in R, one single function cannot handle all types of data structure. That is why we have multiple functions in the plyr package that have a very similar naming convention. It is very easy to remember all the functions, and it is easy to apply them when we need.

Intuitive function names in the plyr library

To perform any kind of data processing, we need to know the type of input that we have to provide and the expected output format. In most R functions, it is difficult to understand from function names what types of input they accept and what the expected types of output are. Function names in the plyr package are much more intuitive and instructive about their input and output types, compared to any other available packages. Each function is named according to the type of input it takes, and the type of output it produces. The first letter of the function name specifies the input, and the second letter specifies the output type; a represents array, d represents data frame, l represents list, and _ (underscore) represents the output discarded. For example, the function name adply() takes input as an array and produces output as a data frame. The following table gives us a complete idea about function-naming conventions used in the plyr package:

Input	Output
	Array	Data frame	List	Discarded
Array	Aaply()	adply()	alply()	a_ply()
Data frame	daply()	ddply()	dlply()	d_ply()
List	laply()	ldply()	llply()	l_ply()

We can see that there are three types of input and four types of output. Users can easily get an idea of the types of input and output from the function names.

Another interesting feature is that we do not need to learn all 12 functions. Instead, it is sufficient to learn the three types of input and four types of output.

Other than the function names in the table, there are some special cases involving operating on arrays that correspond to the mapply() function in base R. In base R, mapply() can take multiple inputs as separate arguments, whereas a*ply() takes only a single array argument. However, the separate argument in mapply() should be of the same length. The mapply() functions that are equivalent to plyr are maply(), mdply(), mlply(), and m_ply().

Note that, whenever a function name is written using a star symbol, such as *ply(), it indicates that the input is an array. The output can be in any format: array, data frame, or list. Optionally, the output can be discarded.

To explain the intuitive nature of the input and output, we will now provide an example using the iris data that we used in an earlier example. This time, we will use iris3 dataset; this is the same data, but it is stored in a three-dimensional array format. We will calculate the mean of each variable for each species, as shown in the following code:

# class of iris3 dataset is array

class(iris3)

[1] "array"

# dimension of iris3 dataset

dim(iris3)

[1] 50 4 3

The following code snippet, calculates the column mean for each species, with the input as an array, and the output as a data frame:

# Calculate column mean for each species and output will be

# data frame

iris_mean <- adply(iris3,3,colMeans)

class(iris_mean)

[1] "data.frame"

iris_mean

X1 Sepal L. Sepal W. Petal L. Petal W.

1 Setosa 5.006 3.428 1.462 0.246

2 Versicolor 5.936 2.770 4.260 1.326

3 Virginica 6.588 2.974 5.552 2.0266

Since iris3 is an array, we need to specify according to which dimension we will split the array. We specify this using the .margins parameter, in the adply function. We put .margins=3 in adply function as: adply(iris3,.margins=3,colMeans) to tell the adply function that we want the splitting according to the third dimension of a three dimensional array object. If we wanted to split the data according to row or column, we would put 1 or 2, respectively. It is also legitimate to use a combination of dimensions. In that case, c(1,2) could be a choice.

The following code snippet calculates the column mean for each species, with the input as an array as well as the output as arrays:

# again we will calculate the mean but this time output will be an # array

iris_mean <- aaply(iris3,3,colMeans)

class(iris_mean)

[1] "matrix"

iris_mean

X1 Sepal L. Sepal W. Petal L. Petal W.

Setosa 5.006 3.428 1.462 0.246

Versicolor 5.936 2.770 4.260 1.326

Virginica 6.588 2.974 5.552 2.026

# note that here the class is showing "matrix",

# since the output is a # two dimensional array which represents

# matrix. Now calculate mean again with output as list

iris_mean <- alply(iris3,3,colMeans)

class(iris_mean)

[1] "list"

iris_mean

$'1'

Sepal L. Sepal W. Petal L. Petal W.

5.006 3.428 1.462 0.246

$'2'

Sepal L. Sepal W. Petal L. Petal W.

5.936 2.770 4.260 1.326

$'3'

Sepal L. Sepal W. Petal L. Petal W.

6.588 2.974 5.552 2.026

attr(,"split_type")

[1] "array"

attr(,"split_labels")

1 Setosa

2 Versicolor

3 Virginica

Inputs and arguments

The functions in the plyr package accept various input objects: data frames, arrays, and lists. Each input object has its own rule to split the process. In this section, we will discuss inputs and arguments. The rules of splitting are described shortly in this section.

Arrays are sliced by dimension into lower dimensional pieces. The corresponding common function is a*ply(), where the array is the common input, and the output can be an array, data frame, or list.

Data frames are sliced and subset by a combination of variables from the input dataset. The corresponding common function is d*ply(), where the data frame is the common input, and the output can be one among an array, data frame, or list.

The elements of a list are processed separately, and the common function is l*ply(), where the common input is a list, and the output can be an array, data frame, or list.

Depending on the input type, there are two or three main arguments for the common functions: a*ply(), d*ply(), and l*ply(). The following are the main arguments for these common functions:

· a*ply(.data, .margins, .fun, ..., .progress = "none")

· d*ply(.data, .variables, .fun, ..., .progress = "none")

· l*ply(.data, .fun, ..., .progress = "none")

The first argument, .data, is the input dataset that needs to be processed by being split, and the output will be combined from each split. The .margins or .variables argument specifies how the data should be split up into smaller pieces. The .fun argument specifies the processing task; this can be any function that is applicable to each split of the input. If we omit the .fun argument, the input data is just converted into the output structure specified by the function. If we want to monitor the progress of the processing task, the progress argument should be specified. It will not show the progress status by default.

In the following example, we will see what will happen if we do not specify the .fun argument in any function of the plyr package. If we give the input as an array and want the output as a data frame, but we haven't given a .fun argument, the adply() function will just convert the array object into a data frame. Here is an example:

# converting 3 dimensional array to a 2 dimensional data

#frame

iris_dat <- adply(iris3, .margins=3)

class(iris_dat)

[1] "data.frame"

str(iris_dat)

'data.frame': 150 obs. of 5 variables:

$ X1 : Factor w/ 3 levels "Setosa","Versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

$ Sepal L.: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal W.: num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal L.: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal W.: num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

The .margins argument works in a manner similar to the apply() function in base R. It does the following:

· Slices up a row by specifying .margins = 1

· Slices up a column by specifying.margins = 2

· Slices up the individual cells by specifying.margins = c(1,2)

The .margins argument works correspondingly for higher dimensions, with a combinatorial explosion in the number of possible ways to slice up the array.

Multiargument functions

Sometimes, we have to deal with functions that take multiple arguments, and the values of each argument can come from a data frame, a list, or an array. The plyr package has intuitive and user-friendly functions to work with multiargument functions. In this section, we will see an example of generating random numbers from a normal distribution, with various combinations of mean and standard deviation. The values of mean and standard deviation are stored in a data frame. Now, we will generate random numbers using default R functions, such as the for loop, and also using the mlply() function from the plyr package. The parameter combinations are given in the following table:

Sample size (n)	Mean	Standard deviation
25	0	1
50	2	1.5
100	3.5	2
200	2.5	5
500	0.1	2

With these parameter combinations, we will generate normal random numbers using default R and plyr, as shown in the following code:

# define parameter set

parameter.dat <- data.frame(n=c(25,50,100,200,400),

mean=c(0,2,3.5,2.5,0.1),

sd=c(1,1.5,2,5,2))

# displaying parameter set

parameter.dat

n mean sd

1 25 0.0 1.0

2 50 2.0 1.5

3 100 3.5 2.0

4 200 2.5 5.0

5 400 0.1 2.0

# random normal variate generate using base R

# set seed to make the example reproducible

set.seed(12345)

# initialize blank list object to store the generated variable

dat <- list()

for(i in 1:nrow(parameter.dat))

{

dat[[i]] <- rnorm(n=parameter.dat[i,1],

mean=parameter.dat[i,2],sd=parameter.dat[i,3])

}

# estimating mean from the newly generated data

estmean <- lapply(dat,mean)

estmean

[[1]]

[1] -0.001177287

[[2]]

[1] 2.417842

[[3]]

[1] 3.667193

[[4]]

[1] 2.999662

[[5]]

[1] 0.1765926

# Performing same task as above but this time use plyr package

dat_plyr <- mlply(parameter.dat,rnorm)

estmean_plyr <- llply(dat_plyr,mean)

estmean_plyr

$'1'

[1] 0.4252469

$'2'

[1] 2.037528

$'3'

[1] 3.070231

$'4'

[1] 2.144276

$'5'

[1] 0.05399488

Comparing base R and plyr

In this section, we will compare the code side by side to solve the same problem using both default R and plyr. Reusing the iris3 data, we are now interested in producing five-number summary statistics for each variable group by species. The five numbers will be minimum, mean, median, maximum, and standard deviation. The output will be a list of data frames.

To calculate the five-number summary statistics, follow these steps:

1. Define a function that will calculate five-number summary statistics for a given vector.

2. Produce the output of this function in a data frame object.

3. Apply this function in the iris3 dataset using a for loop.

4. Apply the same function using the apply() function of the plyr package.

An example that explains the calculation of the five-number summary statistics is as follows:

# Function to calculate five number summary

fivenum.summary <- function(x)

{

results <-data.frame(min=apply(x,2,min),

mean=apply(x,2,mean),

median=apply(x,2,median),

max=apply(x,2,max),

sd=apply(x,2,sd))

return(results)

}

Here, you can see how we calculate the summaries for the five numbers using a for loop, with default R:

# initialize the output list object

all_stats <- list()

# the for loop will run for each species

for(i in 1:dim(iris3)[3])

{

sub_data <- iris3[,,i]

all_stat_species <- fivenum.summary(sub_data)

all_stats[[i]] <- all_stat_species

}

# class of the output object

class(all_stats)

[1] "list"

all_stats

[[1]]

min mean median max sd

Sepal L. 4.3 5.006 5.0 5.8 0.3524897

Sepal W. 2.3 3.428 3.4 4.4 0.3790644

Petal L. 1.0 1.462 1.5 1.9 0.1736640

Petal W. 0.1 0.246 0.2 0.6 0.1053856

[[2]]

min mean median max sd

Sepal L. 4.9 5.936 5.90 7.0 0.5161711

Sepal W. 2.0 2.770 2.80 3.4 0.3137983

Petal L. 3.0 4.260 4.35 5.1 0.4699110

Petal W. 1.0 1.326 1.30 1.8 0.1977527

[[3]]

min mean median max sd

Sepal L. 4.9 6.588 6.50 7.9 0.6358796

Sepal W. 2.2 2.974 3.00 3.8 0.3224966

Petal L. 4.5 5.552 5.55 6.9 0.5518947

Petal W. 1.4 2.026 2.00 2.5 0.2746501

Let's calculate the same statistics, but this time using the alply() function from the plyr package:

all_stats <- alply(iris3,3,fivenum.summary)

class(all_stats)

[1] "list"

all_stats

$'1'

min mean median max sd

Sepal L. 4.3 5.006 5.0 5.8 0.3524897

Sepal W. 2.3 3.428 3.4 4.4 0.3790644

Petal L. 1.0 1.462 1.5 1.9 0.1736640

Petal W. 0.1 0.246 0.2 0.6 0.1053856

$'2'

min mean median max sd

Sepal L. 4.9 5.936 5.90 7.0 0.5161711

Sepal W. 2.0 2.770 2.80 3.4 0.3137983

Petal L. 3.0 4.260 4.35 5.1 0.4699110

Petal W. 1.0 1.326 1.30 1.8 0.1977527

$'3'

min mean median max sd

Sepal L. 4.9 6.588 6.50 7.9 0.6358796

Sepal W. 2.2 2.974 3.00 3.8 0.3224966

Petal L. 4.5 5.552 5.55 6.9 0.5518947

Petal W. 1.4 2.026 2.00 2.5 0.2746501

attr(,"split_type")

[1] "array"

attr(,"split_labels")

1 Setosa

2 Versicolor

3 Virginica

Powerful data manipulation with dplyr

Mostly, in real-life situations, we usually start our analysis with a data frame-type structure. What do we do after getting a dataset and what are the basic data-manipulation tasks we usually perform before starting modeling? They are explained here:

1. We check the validity of a dataset based on conditions.

2. We sort the dataset based on some variables, in ascending or descending order.

3. We create new variables based on existing variables.

4. Finally, we summarize them.

This is a list of tasks we usually perform over full datasets. The dplyr package has all the necessary functions to perform all the tasks listed and some more additional tasks that come in handy in the data-manipulation process. Group-wise operation is also possible using the dplyr package. In the dplyr package, every task is performed using a function that is called a verb. We may need to use multiple verbs on the same data frame. This could force us to write either a very long line or multiple lines of code. Chaining is a powerful feature of dplyr that allows the output from one verb to be piped into the input of another verb using a short, easy-to-read syntax.

Filtering and slicing rows

Sometimes, it is more important to subset the data frame based on values of a variable or multiple variables. The filter() function allow us to perform this task. If we want to just see all the observations under the virginica species, then we need to use the following code:

filter(iris,Species=="virginica")

We could also create a data frame with sepal length less than 6 cm and sepal width less than or equal to 2.7 cm:

filter(iris,Species=="virginica" & Sepal.Length<6 & Sepal.Width<=2.7)

We could also extract the subset of a data frame using the slice() function. If we want to subset the first 10 observations, the last 10 observations, or even the 95^th to 105^th observation, then we could use the following code, respectively:

slice(iris, 1:10)

slice(iris, 140:150)

slice(iris, 95:105)

Arranging rows

To sort the whole data frame based on a single variable or multiple variables, we could use the arrange() function. We could sort the dataset according to the lowest length of sepal to the highest length of sepal:

arrange(iris, Sepal.Length)

We could also sort the dataset by sorting the data frame for sepal length and then for sepal width:

arrange(iris, Sepal.Length, Sepal.Width)

If we want to sort the data frame in ascending order for sepal length, but descending order for sepal width, we can use the desc() function from this package:

arrange(iris, Sepal.Length, desc(Sepal.Width))

It seems that the arrange() function in the dplyr package is very similar to the order() function, but it has a lot more flexibility and an intuitive structure of input arguments.

Selecting and renaming

Most of the time, we do not work on all the variables in a data frame. Selecting a few columns could make the analysis process less complicated. We could easily select a smaller number of columns from a data frame. In our example, we selected the Sepal.Lengthand Sepal.Width species of the iris data frame using the select() function:

select(iris, Species, Sepal.Length, Sepal.Width)

We could also change the column name using the rename() function:

rename(iris, SL=Sepal.Length, SW= Sepal.Width, PL=Petal.Length, PW= Petal.Width )

Adding new columns

Very often, we need to create new columns for the purpose of analysis. In the iris data frame, if we want to convert the width and length of sepal and petal from centimeter to meter, we could use the mutate() function as follows:

mutate(iris, SLm=Sepal.Length/100, SWm= Sepal.Width/100, PLm=Petal.Length/100, PWm= Petal.Width/100 )

Also, we could standardize these variables in the following way:

mutate(iris, SLsd=(Sepal.Length-mean(Sepal.Length))/sd(Sepal.Length),

SWsd= (Sepal.Width-mean(Sepal.Width))/sd(Sepal.Width),

PLsd=(Petal.Length-mean(Petal.Length))/sd(Petal.Length),

PWsd= (Petal.Width-mean(Petal.Width))/sd(Petal.Width) )

If we want to keep only the new variables and drop the old ones, we could easily use the transmute() function:

transmute(iris, SLsd=(Sepal.Length-mean(Sepal.Length))/sd(Sepal.Length),

SWsd= (Sepal.Width-mean(Sepal.Width))/sd(Sepal.Width),

PLsd=(Petal.Length-mean(Petal.Length))/sd(Petal.Length),

PWsd= (Petal.Width-mean(Petal.Width))/sd(Petal.Width) )

Selecting distinct rows

We can extract distinct values of a variable or multiple variables using the distinct() function. Sometimes, we might encounter duplicate observations in a data frame. The distinct() function helps eliminates these observations:

distinct(iris,Species,Petal.Width)

Column-wise descriptive statistics

We could summarize different variables based on different summary statistics using the summarise() function. Here, we summarized the length and width of sepal and petal by calculating their average:

summarise(iris, meanSL=mean(Sepal.Length),

meanSW=mean(Sepal.Width),

meanPL=mean(Petal.Length),

meanPW=mean(Petal.Width))

Group-wise operations

The functions we discussed in previous sections from the dplyr package work on the whole data frame. If we want to use a group-wise operation on different columns, we need to use a combination of the group_by() function and the other functions:

iris.grouped<- group_by(iris, Species)

summarize(iris.grouped, count=n(),

meanSL= mean(Sepal.Length),

meanSW=mean(Sepal.Width),

meanPL=mean(Petal.Length),

meanPW=mean(Petal.Width))

Here, the combination of the group_by() and summarise() functions could be considered as an implementation of the split-apply-combine approach on a data frame. Here, group_by() takes the data frame as an input and produces a data frame too. However, this data frame is a special type of data frame where grouping information is stored inside it. When this special type of data frame is supplied as an input of the summarise() function, it knows that the calculation should be group-wise. Here, all the calculations using n(), mean()are performed group-wise.

Chaining

Sometimes, it could be necessary to use multiple functions to perform a single task. From the iris data, we may want to use the group_by() operation to get a special data frame. Then we may want to use the select() function to select only the sepal length and width. It would then be interesting to see location and dispersion summary statistics. Finally, we might want to see species with maximum average sepal length and maximum average sepal width:

iris

iris.grouped<- group_by(iris, Species)

iris.grouped.selected<- select(iris.grouped, Sepal.Length, Sepal.Width)

iris.grouped.selected.summarised<- summarise(iris.grouped.selected,

meanSL=mean(Sepal.Length),

sdSL=sd(Sepal.Length),

meanSW= mean(Sepal.Width),

sdSW= sd(Sepal.Width))

filter(iris.grouped.selected.summarised, meanSL==max(meanSL) | meanSW==max(meanSW))

The workflow is very intuitive but, each time we applied a function, we saved a new data frame. The dplyr package has a nice operator that prevents us from saving a new data frame each time we perform an action on it. This operator is called the %>% chain operator; it is similar to the pipe operation in shell scripting. The %>% operator turns x %% f(y) into f(x,y). This operator not only allow us to save storage, but also makes the code cluster more intuitive for other people to understand, It also helps you read your code in future:

iris %>%

group_by( Species) %>%

select(Sepal.Length, Sepal.Width) %>%

summarise( meanSL=mean(Sepal.Length),

sdSL=sd(Sepal.Length),

meanSW= mean(Sepal.Width),

sdSW= sd(Sepal.Width)) %>%

filter(meanSL==max(meanSL) | meanSW==max(meanSW))

When we have a script file with a huge number of lines, this feature comes in handy. A cluster of these lines of code in a script file will help us understand that these lines of code were written to perform one task. This also saves us the effort of writing an additional data frame name each time.

Summary

In this chapter, we discussed the importance of the split-apply-combine strategy. We understood what the split-apply-combine strategy is and why it is important in data manipulations. The split-apply-combine strategy can be implemented using base R, but it requires a large amount of code and is not memory or time efficient. To overcome this limitation, we discussed the plyr package in which group-wise data manipulation can be implemented efficiently. The functions within plyr are intuitive and instructive in terms of input and output types. A large variety of data processing can be done using only a few functions with common input and various types of output. For further reading, an interested user can refer to the paper The Split-Apply-Combine Strategy for Data Analysis by Wickham, which can be found at http://www.jstatsoft.org/v40/i01/paper. We also discussed how we can use dplyr as a powerful tool to manipulate data frame.

In the following chapter, you will learn about reshaping a dataset, which is another important aspect of group-wise data manipulation.