Data Manipulation with R (2014)

Chapter 2. Basic Data Manipulation

When preparing a dataset for statistical analysis, data processing and manipulations such as checking, cleaning, and creating new variables are two of the important tasks. In this chapter, the basics of data manipulation will be discussed with examples that will give us an idea about checking a dataset and cleaning it if necessary. This chapter will deal with the following topics:

· Importing datasets from different sources

· Factor manipulations

· Numeric variables and date processing

· Character manipulations

· Subscripting and subsetting datasets

Acquiring data

A dataset can be stored in a computer or any other storage device in different file formats. R provides the useful facility to access different file formats through different commands. Some of the commonly used file formats are:

· Comma separated values (*.csv)

· Text file with Tab delimited

· MS Excel file (*.xls or *.xlsx)

· R data object (*.RData)

Other than the file formats mentioned in the preceding list, the dataset can be stored in another statistical software format; for example, Stata, SPSS, or SAS. In R, using the foreign library, we can acquire a dataset from other statistical software. In the following examples, we will see how we can acquire data in R from different file formats.

Firstly, we will import a .csv file, CSVanscombe.csv. This file contains four pairs of numeric variables, (x1,y1) to (x4,y4). The noticeable feature of this file is that the actual data starts from the third row and the first two rows contain a brief description about the dataset. Now we will use read.csv() to import the file and store it in the anscombe object in R, which will be a data frame, as shown in the following code:

# Before running the following command we need to set the data

# location using setwd(). For example setwd("d:/chap2").

anscombe <- read.csv("CSVanscombe.csv",skip=2)

Note that in the preceding code, the skip=2 command is used, which tells R that the actual data starts from the third row.

If a .csv file contains both numeric and character variables and we use read.csv(), the character variables get automatically converted to the factor type. We can prevent character variables from this automatic conversion to factor by specifying stringsAsFactors=FALSEwithin the read.csv() command, as shown in the following code:

# import csv file that contains both numeric and character variable

# firstly using default and then using stringsAsFActors=FALSE

iris_a <- read.csv("iris.csv")

str(iris_a)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

In the following example, we will see the difference if we specify the stringsAsFactors = FALSE argument:

# Now using stringsAsFactors=FALSE

iris_b <- read.csv("iris.csv",stringsAsFactors=F)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : chr "setosa" "setosa" "setosa" "setosa" ...

We see that in the first data frame, the class of the Species variable is factor, whereas in the second data frame the class of the same variable is character. So we have to be careful while importing the .csv file with mixed variables.

Sometimes, it could happen that the file extension is *.csv but the data is not comma separated; rather, the data supplier has used a semicolon (;) as a separator, or any other symbol. In that case, we can still use the read.csv() command, but this time we have to specify the separator. Let's look at the example with a semicolon-separated .csv file of the same iris data:

iris_semicolon <- read.csv("iris_semicolon.csv",stringsAsFactors=FALSE,sep=";")

str(iris_semicolon)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : chr "setosa" "setosa" "setosa" "setosa" ...

Similarly, if the values are tab separated, we can use read.csv() with sep= "\t". Alternatively, we can use read.table(). The following is an example:

anscombe_tab <- read.csv("anscombe.txt",sep="\t")

anscombe_tab_2 <- read.table("anscombe.txt",header=TRUE)

Notice that here when we used read.table(), we have to specify whether the variable name is present or not using header=TRUE.

If the dataset is stored in the *.xls or *.xlsx format, we have to use certain R packages to import those files; one of the packages is xlsx, which is designed to read files formatted as *.xlsx. The following is an example to import the xlsxanscombe.xlsx file:

# Calling xlsx library

library(xlsx)

# importing xlsxanscombe.xlsx

anscombe_xlsx <- read.xlsx2("xlsxanscombe.xlsx",sheetIndex=1)

In R, single or multiple data frames or other objects can be stored in the *.RData format. This file format is convenient to store more than one dataset into a single file. To acquire a dataset for any other type of objects from the *.RData file, we can use the load()function. The following is an example to load multiple datasets and a vector of R objects from a single *.RData file:

# loading robjects.RData file

load("robjects.RData")

# to see whether the objects are imported correctly

objects()

"character.obj" "diab.dat" "logical.obj" "num.obj" "var1" "var2" "var3" "var4"

Note that the objects() command is used to look at all of the objects in the current R session. Now to see the mode and class of each object, we can easily use the mode() and class() function.

To import a Stata file into R, we need to call the foreign library and then use the read.dta() function. Similarly, if we want to import an SPSS data file, the corresponding command will be read.spss(); the output will always be a data frame. Here is an example of importing a Stata file:

library(foreign)

iris_stata <- read.dta("iris_stata.dta")

In this section, we saw that a dataset can be stored in different formats and R has some user friendly functionality to deal with each of them. The noticeable feature of this section is some of the arguments within the read.csv() function, such as skip, stringsAsFactors, and sep. To import any data correctly, we have to use these arguments carefully.

Factor manipulation

A variable that takes only a limited number of distinct values is usually known as a categorical variable, and in R, this is known as a factor. During data analysis, sometimes the factor variable plays an important role, particularly in studying the relationship between two categorical variables. In this section, we will see some important aspects of factor manipulation. When a factor variable is first created, it stores all its levels along with the factor. But if we take any subset of that factor variable, it inherits all its levels from the original factor levels. This feature sometimes creates confusion in understanding the results. Let us now see an example of this feature.

We will firstly create a factor variable from the datamanipulation character string with the English alphabet in lowercase as levels. Each letter of this string represents a value of that factor variable. Then, we will display the data with the table() function, where we will see lots of zero frequency corresponding to the letters that did not appear in the factor variable, as shown in the following code. We then drop those levels that are not part of the original factor variable and will display the data again.

# creating an R object whose value is "datamanipulation"

char.obj <- "datamanipulation"

# creating a factor variable by extracting each single letter from# the character string. To extract each single letter the substring() # function has been used. Note: nchar() function gives number of # character count in a character type R object

factor.obj <- factor(substring(char.obj,1:nchar(char.obj),1:nchar(char.obj)),levels=letters)

# Displaying levels of the factor variable

levels(factor.obj)

[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

# Displaying the data using the table() function

table(factor.obj)

factor.obj

a b c d e f g h i j k l m n o p q r s t u v w x y z

4 0 0 1 0 0 0 0 2 0 0 1 1 2 1 1 0 0 0 2 1 0 0 0 0 0

Notice that there are only a few nonzero values in the table because the original factor variable does not have the entire alphabet as its value. Now we will drop the levels that do not appear in the original factor variable. To do so, we will create another factor variable from the original factor variable, as shown in the following code:

# re-creating factor variable from existing factor variable

factor.obj1 <- factor(factor.obj)

# Displaying levels of the new factor variable

levels(factor.obj1)

[1] "a" "d" "i" "l" "m" "n" "o" "p" "t" "u"

# displaying data using table() function

table(factor.obj1)

factor.obj1

a d i l m n o p t u

4 1 2 1 1 2 1 1 2 1

The important feature to notice here is that we can drop unused factor levels by recreating factor variables from the original factor variable. This is most useful when we use a subset of a factor variable.

Factors from numeric variables

Numeric variables are convenient during statistical analysis, but sometimes we need to create categorical (factor) variables from numeric variables. We can create a limited number of categories from a numeric variable using a series of conditional statements, but this is not an efficient way to perform this operation. In R, cut is a generic command to create factor variables from numeric variables. In the following example, we will see how we can create factors from a numeric variable using a series of conditional statements. We will also use the cut command to perform the same task.

# creating a numeric variable by taking 100 random numbers

# from normal distribution

set.seed(1234) # setting seed to reproduce the example

numvar <- rnorm(100)

# creating factor variable with 5 distinct category

num2factor <- cut(numvar,breaks=5)

class(num2factor)

[1] "factor"

levels(num2factor)

[1] "(-2.35,-1.37]" "(-1.37,-0.389]" "(-0.389,0.592]" "(0.592,1.57]" "(1.57,2.55]"

table(num2factor)

num2factor

(-2.35,-1.37] (-1.37,-0.389] (-0.389,0.592] (0.592,1.57] (1.57,2.55]

7 43 29 13 8

By default, the levels are produced using the actual range of values. Sometimes, the range of values are given a specific name for convenience. For example, the five categories of the preceding factor might be called the lowest group, lower-middle group, middle group, upper-middle group, and highest group, as shown in the following code:

# creating factor with given labels

num2factor <- cut(numvar,breaks=5,labels=c("lowest group","lower middle group", "middle group", "upper middle", "highest group"))

# displaying the data is tabular form

data.frame(table(num2factor))

num2factor Freq

1 lowest group 7

2 lower middle group 43

3 middle group 29

4 upper middle 13

5 highest group 8

# creating factor variable using conditional statement

num2factor <- factor(ifelse(numvar<=-1.37,1,ifelse(numvar<=-0.389,2,ifelse(numvar<=0.592,3,ifelse(numvar<=1.57,4,5)))),labels=c("(-2.35,-1.37]", "(-1.37,-0.389]", "(-0.389,0.592]", "(0.592,1.57]", "(1.57,2.55]"))

# displaying data using table function

table(num2factor)

num2factor

(-2.35,-1.37] (-1.37,-0.389] (-0.389,0.592] (0.592,1.57] (1.57,2.55]

7 43 29 13 8

Once we have converted the numeric variable to the factor variable and discarded the numeric variable, we cannot go back to the original numeric variable if needed. Therefore, we should be careful when converting the numeric variable to the factor variable.

Date processing

R can handle date variables in several ways. There are built-in R functions available to deal with date variables, and there are also some useful contributed packages available. The built-in R function as.Date() can handle only dates but not time, whereas the chronpackage, contributed by James and Hornik in 2008, can handle both date and time. However, it cannot deal with time zones. Using the POSIXct and POSIXlt class objects, we can deal with the inclusion of time zones. But there is another R package, lubridate, contributed by Grolemund and Wickham in 2011, that gives much more user friendly functionality to deal with date and time with time zone support. In this section, we will see how we can easily deal with date and time using the lubridate package and compare it with built-in R functions.

Like other statistical software, R also has a base date, and using that base date, R internally stores date objects. In R, dates are stored as the number of days elapsed since January 1, 1970. So if we convert any date object to its internal number, it will show the number of days. We can reformat the number into a date using the date class. The following are some examples:

# creating date object using built in as.Date() function

as.Date("1970-01-01")

[1] "1970-01-01"

# looking at the internal value of date object

as.numeric(as.Date("1970-01-01"))

[1] 0

# Second January 1970 is showing number of elapsed day is 1.

as.Date("1970-01-02")

[1] "1970-01-02"

as.numeric(as.Date("1970-01-02"))

[1] 1

Using the as.Date() function, we can easily create the date object; the typical format of the date object in this function is year, month, and then day. But we can also create a date object with other formats by specifying the format within the as.Date() function, as shown in the following example:

# creating date object specifying format of date

as.Date("Jan-01-1970",format="%b-%d-%Y")

[1] "1970-01-01"

Note that when specifying the format of the date, we have to give the format that is aligned with the input string. For the complete list of code that is used to specify date formats, users are directed to the help documentation of the strptime function. Users can access the complete list by just typing in help(strptime) in the R console.

The lubridate package provides intuitive functionality to deal with the date object in R. The following are some of the examples to create the date object using the lubridate package:

# loading lubridate package

library(lubridate)

# creating date object using mdy() function

mdy("Jan-01-1970")

"1970-01-01 UTC"

Note that the default time zone in the mdy, dmy, or ymd function is Coordinated Universal Time (UTC). One of the most interesting and important features of the lubridate package is that it can deal with heterogeneous formats. Heterogeneous formats means users can store date information in various ways; for example, second chapter due on 2013, august, 24, first chapter submitted on 2013, 08, 18, or 2013 aug 23. From this heterogeneous date, we can extract the valid date object that can be processed further within R using the lubridate package, as shown in the following code:

# creating heterogeneous date object

hetero_date <- c("second chapter due on 2013, august, 24", "first chapter submitted on 2013, 08, 18", "2013 aug 23")

# parsing the character date object and convert to valid date

ymd(hetero_date)

[1] "2013-08-24 UTC" "2013-08-18 UTC" "2013-08-23 UTC"

Although the lubridate package can handle heterogeneous dates, the sequence of year, month, and day should be similar across all values within the same object, otherwise during date extraction there will be a missing value that will be generated along with a warning message. For example, if we alter the last date to 23 aug 2013, this will not get converted into a valid date, as shown in the following code:

hetero_date <- c("second chapter due on 2013, august, 24", "first chapter submitted on 2013, 08, 18", "23 aug 2013")

ymd(hetero_date)

[1] "2013-08-24 UTC" "2013-08-18 UTC" NA

Warning message:

1 failed to parse.

During the date manipulation, sometimes we need to change the month only within an existing R date object. The following is an example of doing this using the default R function and also using the lubridate package:

# Creating date object using based R functionality

date <- as.POSIXct("23-07-2013",format = "%d-%m-%Y", tz = "UTC")

date

[1] "2013-07-23 UTC"

# extracting month from the date object

as.numeric(format(date, "%m"))

[1] 7

# manipulating month by replacing month 7 to 8

date <- as.POSIXct(format(date,"%Y-8-%d"), tz = "UTC")

date

[1] "2013-08-23 UTC"

# The same operation is done using lubridate package

date <- dmy("23-07-2013")

date

[1] "2013-07-23 UTC"

month(date)

[1] 7

month(date) <- 8

date

[1] "2013-08-23 UTC"

In a dataset, the variable might have both date and time information and we need to round them to the nearest day or month. The following example shows the date-rounding functionality; this example also displays how to convert the time zone:

# accessing system date and time

current_time <- now()

current_time

[1] "2013-08-23 23:43:01 BDT"

# changing time zone to "GMT"

current_time_gmt <- with_tz(current_time,"GMT")

current_time_gmt

[1] "2013-08-23 17:43:01 GMT"

# rounding the date to nearest day

round_date(current_time_gmt,"day")

[1] "2013-08-24 GMT"

# rounding the date to nearest month

round_date(current_time_gmt,"month")

[1] "2013-09-01 GMT"

# rounding date to nearest year

round_date(current_time_gmt,"year")

[1] "2014-01-01 GMT"

In this section, we saw that dealing with dates using the lubridate package is really user friendly and intuitive.

Character manipulation

In any statistical software, all the data is expected to be either numeric or at least a factor, but sometimes we have to work with character data. In the area of text mining, character or string manipulation is the most important. R has complete functionality to manipulate character (string) data for further analysis. Besides default R functionality, there is one contributed package to deal with character data, which is more user friendly and intuitive compared to the base R counterpart. Wickham developed the stringrpackage in 2010 to manipulate character data with some user friendly functions. In this section, we will introduce different functions and their counterparts in a table so that the readers are able to use the functions from the stringr package easily:

Base R functions	stringr functions
paste(): This function is used to concatenate a vector of characters with a default separator as a space.	str_c(): This has a functionality similar to paste(), but it uses empty as the default separator. It also silently removes zero-length arguments.
nchar(): This returns the number of characters in a character string. For NA, it returns 2, which is not expected. For example: nchar(c("x","y",NA)) [1] 1 1 2	str_length(): This is the same as nchar(), but it preserves NA. For example: str_length(c("x","y",NA)) [1] 1 1 NA
substr(): This extracts or replaces substrings in a character vector.	str_sub(): This is the equivalent of substr(), but it returns a zero-length vector if any of its inputs are of zero length. It also accepts negative positions, which are calculated from the left of the last character. The end position defaults to -1, which corresponds to the last character.
Unavailable	str_dup(): This is used to duplicate the characters within a string.
Unavailable	str_trim(): This is used to remove the leading and trailing whitespaces.
Unavailable	str_pad(): This is used to pad a string with extra whitespaces on the left, right, or both sides.

Other than the functions listed in the preceding table, there are some other user friendly functions for pattern matching. Those functions are str_detect, str_locate, str_extract, str_match, str_replace, and so on. To get more details about these functions, readers should refer to the stringr: modern, consistent string processing paper, by Wickham, which can be found at http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf.

Subscripting and subsetting

Subscripting and subsetting a dataset is an integral part of data manipulation. If we need to extract a smaller part of any R object (vector, data frame, matrix, or list) that contains more than one element, we need to use subscripts. Subscripting is an approach to access individual elements of an R object, for example, accessing particular element of a vector. Usually, numeric integers are used for subscripting, but logical vectors can also be used for the same purposes. In R, the subscript starts from 1, and if we specify any negative subscript, it omits that position from the source object.

The following is an example of an R vector with 10 elements and the effect of positive and negative subscripting:

# creating a 10 element vector

num10 <- c(3,2,5,3,9,6,7,9,2,3)

# accessing fifth element

num10[5]

[1] 9

# checking whether there is any value of num10 object greater # than 6

num10>6

[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE

# keeping only values greater than 6

num10[num10>6]

[1] 9 7 9

# use of negative subscript removes first element "3"

num10[-1]

[1] 2 5 3 9 6 7 9 2 3

Note that the subscripted indexes are written within square brackets. For one-dimensional vectors, we use a single index to access elements, but for two-dimensional objects such as data frames or matrices, we have to use two-dimensional subscripts. In that case, we have to use double square brackets for indexing. The first index is for representing rows and the second is for representing columns; for example:

# creating a data frame with 2 variables

data_2variable <- data.frame(x1=c(2,3,4,5,6),x2=c(5,6,7,8,1))

# accessing only first row

data_2variable[1,]

x1 x2

1 2 5

# accessing only first column

data_2variable[,1]

[1] 2 3 4 5 6

# accessing first row and first column

data_2variable[1,1]

[1] 2

Similar indexing is used for matrices. For the list object, the indexing is different than that of data frames or matrices. To get access to a list object, we have to use [[]] for indexing; for example, [[1]] to get the first element of a list. If the list is nested within another list, we need to use a series of double square brackets within double square brackets. The following example creates a list object and accesses its elements:

list_obj<- list(dat=data_2variable,vec.obj=c(1,2,3))

list_obj

$dat

x1 x2

1 2 5

2 3 6

3 4 7

4 5 8

5 6 1

$vec.obj

[1] 1 2 3

# accessing second element of the list_obj objects

list_obj[[2]]

[1] 1 2 3

Now, if we want to get access to the individual elements of list_obj[[2]], we have to use the following command:

list_obj[[2]][1]

[1] 1

If the list object is named, we can get access to the elements of that list using the name as follows:

# accessing dataset from the list object

list_obj$dat

x1 x2

1 2 5

2 3 6

3 4 7

4 5 8

5 6 1

Subsetting is just storing subscripted objects. Once we extract any subscripted R object and store it in another variable, the newly created object is the subset of the original variable.

Summary

In this chapter, we have covered some of the special features that we need to consider during data acquisition. We also discussed the important aspect of factor manipulation, especially when subsetting a factor variable and how to remove unused factor levels. The processing of date variables was covered with the use of the lubridate package, with its user friendly and intuitive functions, and also string processing has been highlighted. The chapter ended with an explanation of the concepts of subscripting and subsetting. For more details on date processing and string manipulation, readers should refer to the stringr: modern, consistent string processing paper, by Wickham, which can be found at http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf., and the Dates and Times Made Easy with lubridate journal, by Grolemund and Wickham, which can be found at http://www.jstatsoft.org/v40/i03/paper.

In the next chapter, we will discuss data manipulation with the plyr package, where we will focus on the split-apply-combine strategy, a state-of-the-art approach in the group-wise data manipulation using R.