Reshaping Datasets - Data Manipulation with R (2014)

Data Manipulation with R (2014)

Chapter 4. Reshaping Datasets

Reshaping data is a common and tedious task in real-life data manipulation and analysis. A dataset might come with different levels of grouping and we need to implement some reorientation to perform certain types of analyses. Datasets layout could be long or wide. In long-layout, multiple rows represent a single subject's record, whereas in wide-layout, a single row represents a single subject's record. Statistical analysis sometimes requires wide data and sometimes long data, and in such cases, we need to be able to fluently and fluidly reshape the data to meet the requirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does not change the content of the dataset. In this chapter, we will show you different layouts of the same dataset and see how they can be transferred from one layout to another. This chapter mainly highlights the melt and cast paradigm of reshaping datasets, which is implemented in the reshape contributed package. Later on, this same package is reimplemented with a new name, reshape2, which is much more time and memory efficient (the Reshaping Data with the reshape Package paper, by Wickham, which can be found at http://www.jstatsoft.org/v21/i12/paper). In this chapter, we will discuss the layout of a dataset and how we can change the layout using the new paradigm of reshaping datasets with melt and cast. To run the example of this chapter, readers need to install both the reshape and reshape2 packages.

The typical layout of a dataset

A single dataset can be rearranged in many different ways, but before going into rearrangement, let's look back at how we usually perceive a dataset. Whenever we think about any dataset, we think of a two-dimensional arrangement where a row represents a subject's (a subject could be a person and is typically the respondent in a survey) information for all the variables in a dataset and a column represents the information for each characteristic for all subjects. This means rows indicate records and columns indicate variables, characteristics, or attributes. This is the typical layout of a dataset. In this arrangement, one or more variables might play a role as an identifier and others are measured characteristics. For the purpose of reshaping, we could group the variables into two groups: identifier variables and measured variables.

· Identifier variables: These help to identify the subject from whom we took information on different characteristics. Typically, identifier variables are qualitative in nature and take a limited number of unique values. In database terms, an identifier is termed as the primary key, and this can be a single variable or a composite of multiple variables.

· Measured variables: These are those characteristics whose information we took from a subject of interest. These can be qualitative, quantitative, or a mix of both.

Long layout

In this layout, the dataset is arranged in such a way that a single subject's information is stored in multiple rows. We need a composite identification variable to identify a unique row. This type of layout is usually seen in a longitudinal dataset. The following is an example of this type of dataset:

sid

exmterm

math

literature

language

1

1

50

40

70

1

2

65

45

80

2

1

75

55

75

2

2

69

59

78

Notice that in the dataset we have repeated sid, but if we consider both sid and exmterm, each row can be identified uniquely. This layout is known as the long layout. The following is the R code to produce this data frame:

# Example of typical two dimensional data

# A demo dataset "students" with typical layout. This data # contains two students' exam score of "math", "literature" # and "language" in different term exam.

students <- data.frame(sid=c(1,1,2,2),

exmterm=c(1,2,1,2),

math=c(50,65,75,69),

literature=c(40,45,55,59),

language=c(70,80,75,78))

students

sid exmterm math literature language

1 1 1 50 40 70

2 1 2 65 45 80

3 2 1 75 55 75

4 2 2 69 59 78

Wide layout

In this layout, each row represents all the information of a single subject. Usually, only one identification variable is enough to identify a unique subject, but a composite identification variable can be used. The main difference between wide and long layout is that the wide layout contains all the measured information in different columns. The following is the wide layout of the same data that we initially stored in long layout:

sid

math.1

literature.1

language.1

math.2

literature.2

language.2

1

50

40

70

65

45

80

2

75

55

75

69

59

78

Notice that in this layout, each row contains all the information corresponding to a single value of sid. This layout is known as the wide form. In a later section, we will see how we can convert long to wide and vice versa using R.

The new layout of a dataset

In R, the layout of a dataset is known differently than the typical layout we discussed in previous section. This new layout of a dataset consists of only the identification variables and a value per variable. The identification variable identifies a subject along with which measured variable the value represents, which is the long layout in this paradigm. In this new paradigm, each row represents one observation of one variable. Interestingly, the typical long and wide layouts are both known as wide layout in this new paradigm. In the new paradigm, long data is also known as molten data and the process of producing molten data is known as melting from the wide layout. The difference between this new layout of the data and the typical layout is that it now contains only the ID variable and a new column, value, which represents the value of that observation. The following is an example of molten data that comes from the typical long layout:

sid

exmterm

variable

value

1

1

math

50

1

2

math

65

2

1

math

75

2

2

math

69

1

1

literature

40

1

2

literature

45

2

1

literature

55

2

2

literature

59

1

1

language

70

1

2

language

80

2

1

language

75

2

2

language

78

In this dataset, we see that each row contains all the information of one student, which is known as the wide data. The following is the R code to generate this molten data:

# Example of molten data

library(reshape)

molten_students <- melt.data.frame(students,id.vars=c("sid","exmterm"))

The melt.data.frame function converts the wide data to long (molten) form, and the new layout will contain only the identification variables along with two other columns named variable and value. In the new layout, each row contains the observation of a single variable, which is also known as the long form. The variable column represents the identification information along with what is being measured and the value column contains the measurement itself.

Reshaping the dataset from the typical layout

In this section, we will see how we can convert a typical long layout to a typical wide layout and vice versa. To perform this conversion, we will use the built-in reshape() function, which takes several arguments, but we will use the following arguments:

· data: This argument specifies the dataset that we want to change the layout of.

· direction: This argument specifies whether the data is long or wide. Note that here long and wide indicates the typical layout.

· idvar: This argument specifies the identification variable. It could be a single variable or multiple variables.

· timevar: This argument specifies how many times the values of idvar repeats for each subject.

The following example converts the long layout to a wide layout of the students' data that was created earlier:

# Reshaping dataset using reshape function

wide_students <- reshape(students,direction="wide",idvar="sid",timevar="exmterm")

wide_students

sid math.1 literature.1 language.1 math.2 literature.2 language.2

1 50 40 70 65 45 80

2 75 55 75 69 59 78

After reshaping the data, we see that the rows contain each student's exam record. Now we will change the layout from wide to long using the same function:

# Now again reshape to long format

long_students <- reshape(wide_students,direction="long",idvar="id")

long_students

sid exmterm math.1 literature.1 language.1

1 1 50 40 70

2 1 75 55 75

1 2 65 45 80

2 2 69 59 78

The limitation of this default reshape function is that it can only deal with the long and wide structures. In reality, data might contain multiple nested levels. To deal with complex data structures, the reshape function is not useful; we could use the reshape package instead.

Reshaping the dataset with the reshape package

As we have seen, there are two different paradigms for defining the layout of a dataset. To change the layout of a dataset following the steps of a new paradigm, we need to use the reshape packages, where all the functions are implemented following the new layout. The main idea of the reshape package is melting a dataset and then casting it to a suitable layout. In the section The new layout of a dataset, we have mentioned melting a dataset and what it looks like. Just to recall, in molten data, each row represents a single observation of a single variable in the dataset. Also, it contains only the identifier variables and a value variable to represent what is being measured. In this section, we will discuss melting with more examples and casting with molten datasets.

Melting data

In R, melting is a generic operation and can be applied to various data types, including data frames, arrays, and matrices. Though melting can be applied to different R objects, the most common use is melting a data frame. To perform melting operations using themelt function, we need to know what the identification variables are and what the measured variables in the original input dataset are.

If we do not specify the identification variables and measured variables, by default any factor variables are assumed as the ID variables and any numeric variables are assumed as measured variables. To avoid this ambiguous operation, it would be good to specify it explicitly. If we specify only one type of variable, either identification or measured, the function assumes the remaining variable is of the other category. For example, if we specify only the ID variables, the remaining variables will be considered as measured variables and vice versa. The following example will clarify these points:

# original data

students

sid exmterm math literature language

1 1 1 50 40 70

2 1 2 65 45 80

3 2 1 75 55 75

4 2 2 69 59 78

# Melting by specifying both id and measured variables

melt(students,id=c("sid","exmterm"),

measured=c("math","literature","language"))

sid exmterm variable value

1 1 1 math 50

2 1 2 math 65

3 2 1 math 75

4 2 2 math 69

5 1 1 literature 40

6 1 2 literature 45

7 2 1 literature 55

8 2 2 literature 59

9 1 1 language 70

10 1 2 language 80

11 2 1 language 75

12 2 2 language 78

# Melting by specifying only id variables

melt(students,id=c("sid","exmterm"))

sid exmterm variable value

1 1 1 math 50

2 1 2 math 65

3 2 1 math 75

4 2 2 math 69

5 1 1 literature 40

6 1 2 literature 45

7 2 1 literature 55

8 2 2 literature 59

9 1 1 language 70

10 1 2 language 80

11 2 1 language 75

12 2 2 language 78

In the melting process, the melt function does not assume the ID or measured variables; there could be any number of variables in any order, which gives flexibility to deal with the complex dataset. One important thing to note is that whenever we use the meltfunction, all the measured variables should be of the same type; that is, the measured variables should be either numeric, factor, character, or date.

Missing values in molten data

There could be two types of missing values in practice: one is sampling zero (that is, no response) and the other is structural missing. The sampling zeros are explicitly coded and represented in the dataset, but the structural missing depends on the structure of the dataset. The structural missing is implicit in the dataset represented by the absence of a certain combination of the ID variable. If we change the structure of a dataset from nested to crossed, the implicit missing no longer exists in the data, rather it explicitly appears in the new structure and care should be taken to deal with that data. The following simple example taken from the Reshaping Data with the reshape Package paper, by Wickham, which can be found at http://www.jstatsoft.org/v21/i12/paper, clearly explains implicit and explicit missing in two different data structures.

Consider a dataset with two ID variables, sex (male or female) and pregnant (yes or no). When the variables are nested, the missing value "pregnant male" is represented by its absence in the dataset, as shown in the following table. However, in a crossed view, we need to add the explicit missing value as there will now be a cell that must be filled with a value.

Sex

Pregnant

Value

Male

No

10

Female

No

14

Female

Yes

4

The cross view of this table can be represented as follows:

Sex

Pregnant

Not Pregnant

Male

10

Female

4

14

To deal with the implicit missing value, it is good to use na.rm=TRUE with the melt function to remove the structural missing. If we do not specify na.rm=TRUE during melting, we have to specify this during data analysis.

Casting molten data

Once we have molten data, we can rearrange it in any layout using the cast function from the reshape package. There are two main arguments required to cast molten data:

· data: This is the molten data that we want to reshape.

· formula: This is the casting formula to determine the layout of the output data; for example, which variable should go into columns and which should go into rows. If we do not specify a formula, the cast will return the classic data frame.

There are other argument options for performing certain types of operations, if required. The basic casting formula is col_var_1+col_var_2 ~ row_var_1+ row_var_2, which describes the variables to appear in columns and rows. The following example shows how thecast function works:

# Melting students data

molten_students <- melt(students,id.vars=c("sid","exmterm"))

molten_students

sid exmterm variable value

1 1 1 math 50

2 1 2 math 65

3 2 1 math 75

4 2 2 math 69

5 1 1 literature 40

6 1 2 literature 45

7 2 1 literature 55

8 2 2 literature 59

9 1 1 language 70

10 1 2 language 80

11 2 1 language 75

12 2 2 language 78

Now use the cast function to return to the original data structure by specifying both row and column variables, as follows:

cast(molten_students,sid+exmterm~variable)

sid exmterm math literature language

1 1 1 50 40 70

2 1 2 65 45 80

3 2 1 75 55 75

4 2 2 69 59 78

The following is the same operation, but specifying only row variables:

cast(molten_students,...~variable)

sid exmterm math literature language

1 1 1 50 40 70

2 1 2 65 45 80

3 2 1 75 55 75

4 2 2 69 59 78

We now rearrange the data in a way that sid is now a separate column for each student, as follows:

cast(molten_students,...~sid)

exmterm variable 1 2

1 1 math 50 75

2 1 literature 40 55

3 1 language 70 75

4 2 math 65 69

5 2 literature 45 59

6 2 language 80 78

We again rearrange the data in a way that exmterm is now a separate column for each term, as follows:

cast(molten_students,...~exmterm)

sid variable 1 2

1 1 math 50 65

2 1 literature 40 45

3 1 language 70 80

4 2 math 75 69

5 2 literature 55 59

6 2 language 75 78

Note

Note that the column names of the last two examples are not valid column names because they contain numbers. It is a limitation of R. R cannot automatically label row or column names unambiguously, so we have to be careful about column names during analysis.

The reshape2 package

Though the reshape package has various functions to perform various tasks that cannot be done using built-in R functions, this package is slow. To make this more time and memory efficient, Wickham reimplemented this package and developed another package,reshape2. The reason behind the development of the new reshape2 package is to keep the functionality of the original reshape package so that users do not get confused. Some important new features of the reshape2 package are as follows:

· It is much better than the original reshape package in terms of memory and time efficiency

· It uses several functions instead of only the cast function

· The multidimensional marginal total can be calculated

Summary

This chapter introduced the theoretical framework for reshaping a dataset. The limitations of conventional approaches were pointed out and the new paradigm of data layout was highlighted. In the new paradigm, only two functions allow users to rearrange datasets into various layouts as required. This chapter also discussed structural missing and sampling zero, and how to deal with those missing during the melting process. For faster and large data rearrangement, readers were redirected to the reshape2 package. In the next chapter, we will discuss how R can be connected with databases and handle large-scale data.