Reshaping Datasets - Data Manipulation with R, Second Edition (2015)

Data Manipulation with R, Second Edition (2015)

Chapter 4. Reshaping Datasets

Reshaping data is a common and tedious task in real-life data manipulation and analysis. A dataset might come with different levels of grouping, and we need to implement a reorientation to perform certain types of analysis. The layout of datasets could be long or wide. In a long layout, multiple rows represent a single subject's record, whereas, in a wide layout, a single row represents a single subject's record. Statistical analysis sometimes requires wide data and sometimes long data. In such cases, we need to be able to fluently and fluidly reshape the data to meet the requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not change the content of the dataset. In this chapter, we will show you different layouts of the same dataset and see how they can be transferred from one layout to another. This chapter mainly highlights the melt and cast paradigms of reshaping datasets, melt and cast is implemented in the reshape contributed package. Later on, this same package is reimplemented with a new name, reshape2, which is much more time-and memory-efficient (refer to Reshaping Data with the reshape Package paper by Hadley Wickham, which can be found at http://www.jstatsoft.org/v21/i12/paper). In this chapter, we will discuss the layout of a dataset and understand how we can change the layout using the new paradigm of reshaping datasets with melt and cast. To run the example of this chapter, you need to install both the reshape and reshape2 packages.

Typical layout of a dataset

A single dataset can be rearranged in many different ways but, before going into this rearrangement, let's look at how we usually perceive a dataset. Whenever we think about any dataset, we think of a two-dimensional arrangement, where a row represents a subject's (a subject could be a person and is typically the respondent in a survey) information for all the variables in a dataset, and a column represents the information for each characteristic for all subjects. This means rows indicate records, and columns indicate variables, characteristics, or attributes. This is the typical layout of a dataset. In this arrangement, one or more variables might play the role of an identifier, and others are measured characteristics. For the purpose of reshaping, we could group the variables into two groups: identifier variables and measured variables. They are explained here:

· Identifier variables: These help identify the subject from whom we took information on different characteristics. Typically, identifier variables are qualitative in nature and take a limited number of unique values. In database terms, an identifier is termed as the primary key, and this can be a single variable or a composite of multiple variables.

· Measured variables: These are those characteristics whose information we took from a subject of interest. These can be qualitative, quantitative, or a mix of both.

Long layout

In this layout, the dataset is arranged in such a way that a single subject's information is stored in multiple rows. We need a composite identification variable to identify a unique row. This type of layout is usually seen in a longitudinal dataset. The following is an example of this type of dataset:

sid

exmterm

math

literature

language

1

1

50

40

70

1

2

65

45

80

2

1

75

55

75

2

2

69

59

78

Notice that in the dataset, we repeated sid but, if we consider both sid and exmterm, each row can be identified uniquely. This layout is known as the long layout. The following is the R code to produce this data frame:

# Example of typical two dimensional data

# A demo dataset "students" with typical layout. This data # contains two students' exam score of "math", "literature" # and "language" in different term exam.

students <- data.frame(sid=c(1,1,2,2),

exmterm=c(1,2,1,2),

math=c(50,65,75,69),

literature=c(40,45,55,59),

language=c(70,80,75,78))

students

sid exmterm math literature language

1 1 1 50 40 70

2 1 2 65 45 80

3 2 1 75 55 75

4 2 2 69 59 78

Wide layout

In this layout, each row represents all the information of a single subject. Usually, only one identification variable is enough to identify a unique subject, but a composite identification variable can be used. The main difference between a wide layout and a long layout is that the wide layout contains all the measured information in different columns. The following is the wide layout of the same data that we initially stored in the long layout:

sid

math.1

literature.1

language.1

math.2

literature.2

language.2

1

50

40

70

65

45

80

2

75

55

75

69

59

78

Notice that, in this layout, each row contains all the information corresponding to a single value of sid. This layout is known as the wide form. In a later section, we will see how we can convert a long layout to a wide one and vice versa using R.

New layout of a dataset

In R, the layout of a dataset is known to be different from the typical layout that we discussed in the previous section. This new layout consists of only the identification variables and a value per variable. The identification variable identifies a subject, along with which measured variable the value represents and which is the long layout in this paradigm. In this new paradigm, each row represents one observation of one variable. Interestingly, the typical long and wide layouts are both known as wide layout in this new paradigm. In the new paradigm, long data is also known as molten data, and the process of producing molten data is known as melting from the wide layout. The difference between this new layout of the data and the typical layout is that it now contains only the IDvariable and a new column value, which represents the value of that observation. The following is an example of molten data that comes from the typical long layout:

sid

exmterm

variable

value

1

1

math

50

1

2

math

65

2

1

math

75

2

2

math

69

1

1

literature

40

1

2

literature

45

2

1

literature

55

2

2

literature

59

1

1

language

70

1

2

language

80

2

1

language

75

2

2

language

78

In this dataset, we see that each row contains all the information of one student, which is known as the wide data. The following is the R code to generate this molten data:

# Example of molten data

library(reshape)

molten_students <- melt.data.frame(students,id.vars=c("sid","exmterm"))"

The melt.data.frame function converts the wide data to a long (molten) form, and the new layout will contain only the identification variables, along with two other columns named variable and value. In the new layout, each row contains the observation of a single variable, which is also known as the long form. The variable column represents the identification information, along with what is being measured, and the value column contains the measurement itself.

Reshaping the dataset from the typical layout

In this section, we will see how we can convert a typical long layout to a typical wide layout, and vice versa. To perform this conversion, we will use the built-in reshape() function. This takes several arguments, but we will use the following arguments:

· data: This argument specifies the dataset that we want to change the layout of.

· direction: This argument specifies whether the data is long or wide. Note that, here, long and wide indicate the typical layout.

· idvar: This argument specifies the identification variable. It could be a single variable or multiple variables.

· timevar: This argument specifies how many times the values of idvar repeat for each subject.

The following example converts the students' data that was created earlier from a long layout to a wide layout:

# Reshaping dataset using reshape function

wide_students <- reshape(students,direction="wide",idvar="sid",timevar="exmterm")

wide_students

sid math.1 literature.1 language.1 math.2 literature.2 language.2

1 50 40 70 65 45 80

2 75 55 75 69 59 78

After reshaping the data, we see that the rows contain each student's exam record. Now, we will change the layout from wide to long using the same function:

# Now again reshape to long format

long_students <- reshape(wide_students,direction="long",idvar="id")

long_students

sid exmterm math.1 literature.1 language.1

1 1 50 40 70

2 1 75 55 75

1 2 65 45 80

2 2 69 59 78

The limitation of this default reshape function is that it can only deal with the long and wide structures. In reality, data might contain multiple nested levels. To deal with complex data structures, the reshape function is not useful; we should use the reshape package instead.

Reshaping the dataset with the reshape package

As we have seen, there are two different paradigms to define the layout of a dataset. To change the layout of a dataset, here are the steps of a new paradigm. We need to use the reshape package, where all the functions are implemented following the new layout. The main idea of the reshape package is melting a dataset and then casting it to a suitable layout. In the section, New layout of a dataset, we talked about melting a dataset and what it looks like. Just to recall, in molten data each row represents a single observation of a single variable in the dataset. Also, it contains only the identifier variables and a value variable to represent what is being measured. In this section, we will discuss melting with more examples and casting with molten datasets.

Melting data

In R, melting is a generic operation and can be applied to various data types, including data frames, arrays, and matrices. Though melting can be applied to different R objects, the most common use is to melt a data frame. To perform melting operations using the melt function, we need to know what the identification variables and measured variables in the original input dataset are.

If we do not specify the identification variables and measured variables, by default any factor variables are assumed as the ID variables, and any numeric variables are assumed as measured variables. To avoid this ambiguous operation, it would be good to specify it explicitly. If we specify only one type of variable, either identification or measured, the function assumes that the remaining variable is of the other category. For example, if we specify only the ID variables, the remaining variables will be considered as measuredvariables, and vice versa. The following example will clarify these points:

# original data

students

sid exmterm math literature language

1 1 1 50 40 70

2 1 2 65 45 80

3 2 1 75 55 75

4 2 2 69 59 78

# Melting by specifying both id and measured variables

melt(students,id=c("sid","exmterm"),

measured=c("math","literature","language"))

sid exmterm variable value

1 1 1 math 50

2 1 2 math 65

3 2 1 math 75

4 2 2 math 69

5 1 1 literature 40

6 1 2 literature 45

7 2 1 literature 55

8 2 2 literature 59

9 1 1 language 70

10 1 2 language 80

11 2 1 language 75

12 2 2 language 78

# Melting by specifying only id variables

melt(students,id=c("sid","exmterm"))

sid exmterm variable value

1 1 1 math 50

2 1 2 math 65

3 2 1 math 75

4 2 2 math 69

5 1 1 literature 40

6 1 2 literature 45

7 2 1 literature 55

8 2 2 literature 59

9 1 1 language 70

10 1 2 language 80

11 2 1 language 75

12 2 2 language 78

In the melting process, the melt function does not assume the ID or measured variables; there could be any number of variables in any order. This gives the flexibility to deal with complex dataset. One important thing to note is that, whenever we use the meltfunction, all the measured variables should be of the same type: that is, the measured variables should be either numeric, factor, character, or date.

Missing values in molten data

There could be two types of missing value in practice: sampling zero (that is, no response) and structural missing. The sampling zero values are explicitly coded and represented in the dataset, but the structural missing values depend on the structure of the dataset. Structural missing value are implicit in the dataset; they are represented by the absence of a certain combination of the ID variable. If we change the structure of a dataset from nested to crossed, the implicit missing no longer exists in the data. Rather, it explicitly appears in the new structure, and care should be taken to deal with that data. The following simple example is taken from the Reshaping Data with the reshape Package paper by Hadley Wickham, which can be found at http://www.jstatsoft.org/v21/i12/paper. It clearly explains implicit and explicit missing values in two different data structures.

Consider a dataset with two ID variables: sex (male or female) and pregnant (yes or no). When the variables are nested, the missing value pregnant male is represented by its absence in the dataset, as shown in the following table. However, in a crossed view, we need to add the explicit missing value, as there will now be a cell that must be filled with a value.

Sex

Pregnant

Value

Male

No

10

Female

No

14

Female

Yes

4

The cross view of this table can be represented as follows:

Sex

Pregnant

Not Pregnant

Male

10

Female

4

14

To deal with the implicit missing value, it is good to use na.rm=TRUE with the melt function to remove the structural missing value. If we do not specify na.rm=TRUE during melting, we have to specify this during data analysis.

Casting molten data

Once we have molten data, we can rearrange it in any layout using the cast function from the reshape package. There are two main arguments required to cast molten data. They are as follows:

· data: This is the molten data that we want to reshape.

· formula: This is the casting formula to determine the layout of the output data; for example, which variable should go into columns and which should go into rows. If we do not specify a formula, the cast will return the classic data frame.

There are other argument options to perform certain types of operations, if required. The basic casting formula is col_var_1+col_var_2 ~ row_var_1+ row_var_2, which describes the variables to appear in columns and rows. The following example shows how the castfunction works:

# Melting students data

molten_students <- melt(students,id.vars=c("sid","exmterm"))

molten_students

sid exmterm variable value

1 1 1 math 50

2 1 2 math 65

3 2 1 math 75

4 2 2 math 69

5 1 1 literature 40

6 1 2 literature 45

7 2 1 literature 55

8 2 2 literature 59

9 1 1 language 70

10 1 2 language 80

11 2 1 language 75

12 2 2 language 78

Now use the cast function to return to the original data structure by specifying both row and column variables as follows:

cast(molten_students,sid+exmterm~variable)

sid exmterm math literature language

1 1 1 50 40 70

2 1 2 65 45 80

3 2 1 75 55 75

4 2 2 69 59 78

The following is the same operation, but specifying only row variables:

cast(molten_students,...~variable)

sid exmterm math literature language

1 1 1 50 40 70

2 1 2 65 45 80

3 2 1 75 55 75

4 2 2 69 59 78

We will now rearrange the data in such a way that sid is now a separate column for each student, as follows:

cast(molten_students,...~sid)

exmterm variable 1 2

1 1 math 50 75

2 1 literature 40 55

3 1 language 70 75

4 2 math 65 69

5 2 literature 45 59

6 2 language 80 78

We will rearrange the data again in such a way that exmterm is now a separate column for each term, as follows:

cast(molten_students,...~exmterm)

sid variable 1 2

1 1 math 50 65

2 1 literature 40 45

3 1 language 70 80

4 2 math 75 69

5 2 literature 55 59

6 2 language 75 78

Note

Note that the column names of the last two examples are not valid column names because they contain numbers. This is a limitation of R.

R cannot automatically label row or column names unambiguously, so we have to be careful about column names during analysis.

The reshape2 package

Though the reshape package has various functions to perform, there are various tasks that cannot be done using built-in R functions; this package is slow. To make this more time-and memory-efficient, Wickham reimplemented this package and developed another package, reshape2. The reason behind the development of the new reshape2 package is to keep the functionality of the original reshape package so that users do not get confused. Some important new features of the reshape2 package are as follows:

· It is much better than the original reshape package in terms of memory and time efficiency

· It uses several functions instead of only the cast function

· The multidimensional marginal total can be calculated

The melt function in the reshape2 package works the same as the melt function in the reshape package. The only difference is that the melt function in the reshape2 package is faster and more memory-efficient than the melt function in the reshape package. The meltfunction is pretty efficient at converting all data structures to molten data frames. The next step is to reshape the molten data frame into either a data frame or array structure. In the reshape package, this task is done using only the cast function. The output of thecast function, whether a data frame or array, depends on how we put the formula. In the reshape2 package, we have the dcast function to produce the data frame as output and acast to produce an array from a molten data frame.

We will also use the students dataset here. First, we will melt the dataset using the melt function in the reshape2 package, and then we will illustrate how we can use the dcast and acast functions to reshape the data:

library(reshape2)

molten_students <- melt(students,id.vars=c("sid","exmterm"))

The basic casting formula is x_variable + x_2 ~ y_variable + y_2 ~ z_variable ~. For the purpose of illustration, consider x_variable, x_2 as the first set of variables, y_variable, y_2 as the second set of variables, z_variable ,z_2 as the third set of variables, and so on. The first set of variables is used to make the row uniquely identifiable. For the molten dataset molten_students we are considering sid as first set of variable and variable as second set of variable in the following example:

> dcast(molten_students, sid~variable)

Aggregation function missing: defaulting to length

sid math literature language

1 1 2 2 2

2 2 2 2 2

Here, we can see that we have only two rows, although we do not have all the data here. This happened because the sid variable has only two unique values. To make the column uniquely identifiable using just the sid variable, we only need two rows:

> dcast(molten_students, sid+exmterm~variable)

sid exmterm math literature language

1 1 1 50 40 70

2 1 2 65 45 80

3 2 1 75 55 75

4 2 2 69 59 78

Now we have four rows, because the sid and exmterm variables together can create only four unique rows. We have complete data here. So, in the process of data analysis, we should use the entire identification variable as the first set of variables. This is also true for the acast function:

> acast(molten_students, sid~variable)

Aggregation function missing: defaulting to length

math literature language

1 2 2 2

2 2 2 2

Here, there is no sid variable in the data, because acast produces an array and the value of the sid variable is used as the row index for this data:

> acast(molten_students, sid+exmterm~variable)

math literature language

1_1 50 40 70

1_2 65 45 80

2_1 75 55 75

2_2 69 59 78

This sheds light on how the combination of the sid and exmterm variables is considered as an index of the output array.

The second set of variables is used to produce column name. The combination of the values of the second set of variables is used as the column name of the output data frame in the dcast and acast functions:

> dcast(molten_students, sid~variable+exmterm)

sid math_1 math_2 literature_1 literature_2 language_1 language_2

1 1 50 65 40 45 70 80

2 2 75 69 55 59 75 78

> acast(molten_students, sid~variable+exmterm)

math_1 math_2 literature_1 literature_2 language_1 language_2

1 50 65 40 45 70 80

2 75 69 55 59 75 78

Here, we can see that the combination of the second set of variables is considered as the column name of the output data frame and array.

The third set of variables is only applicable for the acast function since an array could go beyond two dimensions, but data frame is strictly restricted to two dimensions. This is why we could not use the third set of variables in the formula for the dcast function:

> acast(molten_students, sid~exmterm~variable)

, , math

1 2

1 50 65

2 75 69

, , literature

1 2

1 40 45

2 55 59

, , language

1 2

1 70 80

2 75 78

Summary

This chapter introduced a theoretical framework for reshaping a dataset. The limitations of conventional approaches were pointed out, and the new paradigm of data layout was highlighted. In the new paradigm, employing only two functions allows users to rearrange datasets into various layouts as required. This chapter also discussed structural missing, sampling zero values, and how to deal with these missing values during the melting process. For faster and large data rearrangement, you were redirected to thereshape2 package.

In the next chapter, we will discuss how R can be connected with databases and handle large-scale data.