Basic Data Manipulation - Data Manipulation with R, Second Edition (2015)

Data Manipulation with R, Second Edition (2015)

Chapter 2. Basic Data Manipulation

When preparing a dataset for statistical analysis, data processing and manipulations, such as checking, cleaning, and creating new variables, are two important tasks. In this chapter, the basics of data manipulation will be discussed with examples that will give us an idea about checking a dataset, and cleaning it, if necessary.

This chapter will deal with the following topics:

· Acquiring data

· Vector and matrix operations

· Factor manipulations

· Factors from numeric variables

· Date processing using lubridate

· Character and string manipulations

· Subscripting and subsetting datasets

Acquiring data

A dataset can be stored in a computer or any other storage device, in different file formats. R provides the useful facility, to access different file formats through different commands. Some of the commonly used file formats are as follows:

· Comma separated values (*.csv)

· Text file with tab delimited

· Microsoft Excel file (*.xls or *.xlsx)

· R data object (*.RData)

Other than the file formats mentioned in the preceding list, the dataset can be stored in another statistical software format; for example, Stata, SPSS, or SAS. In R, using the foreign library, we can acquire a dataset from other statistical software. In the following examples, we will see how we can acquire data in R from different file formats.

Firstly, we will import a .csv file, CSVanscombe.csv. This file contains four pairs of numeric variables, (x1,y1) to (x4,y4). The noticeable feature of this file is that the actual data starts from the third row, and the first two rows contain a brief description about the dataset.

Now, we will use read.csv() function to import the file, and store it in the anscombe object in R, which will be a data frame, as shown in the following code:

# Before running the following command we need to set the file

# location using setwd(). For example setwd("d:/chap2").

# assuming Windows operating system

anscombe <- read.csv("CSVanscombe.csv"",skip=2)

# if the setwd() has not be used then the code will be as

anscombe <- read.csv("d:chap2/CSVanscombe.csv",skip=2)

Note

Note that in the preceding code, skip=2 argument is used, which tells R that the actual data starts from the third row.

If a .csv file contains both numeric and character variables, and we use read.csv(), the character variables get automatically converted to the factor type.

We can prevent character variables from this automatic conversion to factor, by specifying stringsAsFactors=FALSE within the read.csv() function, as shown in the following code:

# import csv file that contains both numeric and character variable # stored in iris.csv file

# firstly using default and then using stringsAsFactors=FALSE

iris_a <- read.csv("iris.csv")

str(iris_a)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

In the following example, we will see the difference if we specify the stringsAsFactors = FALSE argument:

# Now using stringsAsFactors=FALSE

iris_b <- read.csv("iris.csv",stringsAsFactors=F)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : chr "setosa" "setosa" "setosa" "setosa" ...

We see that in the first data frame, the class of the species variable is factor, whereas in the second data frame the class of the same variable is character. So, we have to be careful when importing the .csv file with mixed variables.

Sometimes, it could happen that the file extension is *.csv, but the data is not comma separated; rather, the data supplier has used a semicolon (;) as a separator, or any other symbol. In that case, we can still use the read.csv() function, but in this case we have to specify the separator.

Let's look at the example with a semicolon-separated .csv file, of the same iris data:

iris_semicolon <- read.csv("iris_semicolon.csv",stringsAsFactors=FALSE,sep=";")

str(iris_semicolon)

'data.frame': 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

$ Species : chr "setosa" "setosa" "setosa" "setosa" ...

Similarly, if the values are tab separated, we can use read.csv() with sep= "\t". Alternatively, we can use read.table(). The following is an example:

anscombe_tab <- read.csv("anscombe.txt",sep="\t")

anscombe_tab_2 <- read.table("anscombe.txt",header=TRUE)

Notice that here when we used read.table(), we had to specify whether the variable name is present or not, using the argument header=TRUE.

If the dataset is stored in the *.xls or *.xlsx format, we have to use certain R packages to import those files; one of the packages is xlsx, which is designed to read files formatted as *.xlsx.

The following is an example to import the xlsxanscombe.xlsx file:

# Calling xlsx library

library(xlsx)

# importing xlsxanscombe.xlsx

anscombe_xlsx <- read.xlsx2("xlsxanscombe.xlsx",sheetIndex=1)

In R, single or multiple data frames or other objects can be stored in the *.RData format. This file format is convenient to store more than one dataset in a single file. To acquire a dataset for any other type of object from the *.RData file, we can use the load() function. The following is an example to load multiple datasets, and a vector of R objects from a single *.RData file:

# loading robjects.RData file

load("robjects.RData")

# to see whether the objects are imported correctly

objects()

"character.obj" "diab.dat" "logical.obj" "num.obj" "var1" "var2" "var3" "var4"

Note that the objects() command is used to look at all of the objects in the current R session. Now to see the mode and class of each object, we can easily use the mode() and class() function. See the section, Modes and classes of R objects in Chapter 1, for more details.

To import a Stata file into R, we need to call the foreign library and then use the read.dta() function. Similarly, if we want to import an SPSS data file, the corresponding function will be read.spss(); the output will always be a data frame.

Here is an example of importing a Stata file:

library(foreign)

iris_stata <- read.dta("iris_stata.dta")

Note

R can only read Stata 5-12 version data.

In this section, we saw that a dataset can be stored in different formats, and R has some user friendly functionality to deal with each of them. The noticeable feature of this section is some of the arguments within the read.csv() function, such as skip,stringsAsFactors, and sep. To import any data correctly, we have to use these arguments carefully.

Vector and matrix operations

Matrix operation is one of the most commonly used mathematical operations that we perform during data processing and data analysis. All of the matrix operations must be conformable for the operation, mathematically.

The following are the rules that must be followed for matrix operations:

· Addition or subtraction rule: There should be at least two vectors, or matrices with the same dimensions

· Multiplication rule: There should be at least two vectors or matrices with number of columns of first matrix should be same as the number of rows in second one

· Element wise multiplication: For element wise multiplication, both matrices must be of the same dimension

The following is the R code to perform matrix operations:

# Creating random matrix with two 3x3 and one 4x3 dimension

# we will use runif() function to generate random number from

# standard uniform distribution

set.seed(1234) # To make the result reproducible

matA <- matrix(rnorm(12),ncol=3)

matB <- matrix(rnorm(9),ncol=3)

matB2 <- matrix(runif(9),ncol=3)

# Matrix addtion addition

matB + matB2# both has dimension 3x3

[,1] [,2] [,3]

[1,] -0.4644296 0.3917120 -0.5932429

[2,] 0.6862780 0.1660850 3.1812950

[3,] 1.2892642 -0.4262042 0.2078681

In matrix addition, the default plus (+) symbol works well, but the dimensions of the matrices should be the same. The resultant matrix will also have the same dimensions.

In the following example, we will see if two matrices have different dimensions then matrix addition cannot be performed:

# Matrix addtion addition with varying dimension

matA + matB

Error in matA + matB : non-conformable arrays

If the matrices are not of same dimensions, then matrix addition will not work.

# Matrix multiplication

matA %*% matB

[,1] [,2] [,3]

[1,] 0.4230620 0.4281611 1.9715294

[2,] -1.0367218 0.5218026 0.8709481

[3,] -1.3367123 0.6089153 -2.3603264

[4,] 0.8276759 1.4477557 0.5093074

In matrix multiplication, the important thing to note is that the symbol is not the default multiplication symbol asterisk (*), rather it is %*%. If we do not use this symbol, then it will try to perform element wise multiplication. But if the matrix does not have the same dimensions, then the element wise multiplication will not happen, and in that case, an error report will come in.

# Multiplication with default multiplication symbol *

matA * matB

Error in matA * matB : non-conformable arrays

# Element wise multiplication

matB * matB2

[,1] [,2] [,3]

[1,] -0.24205483 -0.05536304 -0.204210306

[2,] 0.04008173 -0.34600174 1.849224683

[3,] 0.31641252 -0.44192179 0.009893013

# Matrix multiplication with two 3x3 matrix

# with proper use of symbols %*%

matB %*% matB2

[,1] [,2] [,3]

[1,] -0.5867067 -0.87037213 -0.3355362

[2,] 0.4990147 0.85801532 -0.1971938

[3,] -0.2231869 -0.07027022 -0.4535422

Factor manipulation

A variable that takes only a limited number of distinct values is usually known as a categorical variable, and in R, this is known as a factor. During data analysis, sometimes the factor variable plays an important role, particularly in studying the relationship between two categorical variables. In this section, we will see some important aspects of factor manipulation. When a factor variable is first created, it stores all its levels, along with the factor. But if we take any subset of that factor variable, it inherits all its levels from the original factor levels. This feature sometimes creates confusion in understanding the results.

Let's now see an example of this feature.

We will firstly create a factor variable from the datamanipulation character string, with the English alphabet in lowercase as levels. Each letter of this string represents a value of that factor variable. Then, we will display the data with the table() function, where we will see lots of zero frequency corresponding to the letters that did not appear in the factor variable, as shown in the following code. We then drop those levels that are not part of the original factor variable, and will display the data again:

# creating an R object whose value is "datamanipulation"

char.obj <- "datamanipulation"

# creating a factor variable by extracting each single letter from # the character string. To extract each single letter the substring() # function has been used. Note: nchar() function gives number of # character count in a character type R object

factor.obj <- factor(substring(char.obj,1:nchar(char.obj),1:nchar(char.obj)),levels=letters)

# Displaying levels of the factor variable

levels(factor.obj)

[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

# Displaying the data using the table() function

table(factor.obj)

factor.obj

a b c d e f g h i j k l m n o p q r s t u v w x y z

4 0 0 1 0 0 0 0 2 0 0 1 1 2 1 1 0 0 0 2 1 0 0 0 0 0

Notice that there are only a few nonzero values in the table, because the original factor variable does not have the entire alphabet as its value. Now, we will drop the levels that do not appear in the original factor variable.

To do so, we will create another factor variable from the original factor variable, as shown in the following code:

# re-creating factor variable from existing factor variable

factor.obj1 <- factor(factor.obj)

# Displaying levels of the new factor variable

levels(factor.obj1)

[1] "a" "d" "i" "l" "m" "n" "o" "p" "t" "u"

# displaying data using table() function

table(factor.obj1)

factor.obj1

a d i l m n o p t u

4 1 2 1 1 2 1 1 2 1

The important feature to notice here is that we can drop unused factor levels by recreating factor variables from the original factor variable. This is most useful when we use a subset of a factor variable.

Factors from numeric variables

Numeric variables are convenient during statistical analysis, but sometimes we need to create categorical (factor) variables from numeric variables. We can create a limited number of categories from a numeric variable using a series of conditional statements, but this is not an efficient way to perform this operation. In R, cut is a generic command to create factor variables from numeric variables. In the following example, we will see how we can create factors from a numeric variable, using a series of conditional statements. We will also use the cut command to perform the same task.

# creating a numeric variable by taking 100 random numbers

# from normal distribution

set.seed(1234) # setting seed to reproduce the example

numvar <- rnorm(100)

# creating factor variable with 5 distinct category

num2factor <- cut(numvar,breaks=5)

class(num2factor)

[1] "factor"

levels(num2factor)

[1] "(-2.35,-1.37]" "(-1.37,-0.389]" "(-0.389,0.592]" "(0.592,1.57]" "(1.57,2.55]"

table(num2factor)

num2factor

(-2.35,-1.37] (-1.37,-0.389] (-0.389,0.592] (0.592,1.57] (1.57,2.55]

7 43 29 13 8

By default, the levels are produced using the actual range of values. Sometimes, the range of values is given a specific name for convenience. For example, the five categories of the preceding factor might be called the lowest group, lower-middle group, middle group, upper-middle group, and highest group, as shown in the following code:

# creating factor with given labels

num2factor <- cut(numvar,breaks=5,labels=c("lowest group","lower middle group", "middle group", "upper middle", "highest group"))

# displaying the data is tabular form

data.frame(table(num2factor))

num2factor Freq

1 lowest group 7

2 lower middle group 43

3 middle group 29

4 upper middle 13

5 highest group 8

# creating factor variable using conditional statement

num2factor <- factor(ifelse(numvar<=-1.37,1,ifelse(numvar<=-0.389,2,ifelse(numvar<=0.592,3,ifelse(numvar<=1.57,4,5)))),labels=c("(-2.35,-1.37]", "(-1.37,-0.389]", "(-0.389,0.592]", "(0.592,1.57]", "(1.57,2.55]"))

# displaying data using table function

table(num2factor)

num2factor

(-2.35,-1.37] (-1.37,-0.389] (-0.389,0.592] (0.592,1.57] (1.57,2.55]

7 43 29 13 8

Once we have converted the numeric variable to the factor variable and discarded the numeric variable, we cannot go back to the original numeric variable. Therefore, we should be careful when converting the numeric variable to the factor variable.

Date processing using lubridate

R can handle date variables in several ways. There are built-in R functions available to process date variables, and there are also some useful contributed packages available. The built-in R function as.Date() can handle only dates but not time, whereas the chronpackage, contributed by James and Hornik in 2008, can handle both date and time. However, it cannot work with time zones. Using the POSIXct and POSIXlt class objects, we can work with time zones. But there is another R package, lubridate, contributed by Grolemund and Wickham in 2011, that has a much more user friendly functionality to process date and time, with time zone support. In this section, we will see how we can easily process date and time using the lubridate package, and compare it with built-in R functions.

Like other statistical software, R also has a base date, and using that base date, R internally stores date objects. In R, dates are stored as the number of days elapsed since January 1, 1970. So if we convert any date object to its internal number, it will show the number of days. We can reformat the number into a date using the date class. The following are some examples:

# creating date object using built in as.Date() function

as.Date("1970-01-01")

[1] "1970-01-01"

# looking at the internal value of date object

as.numeric(as.Date("1970-01-01"))

[1] 0

# Second January 1970 is showing number of elapsed day is 1.

as.Date("1970-01-02")

[1] "1970-01-02"

as.numeric(as.Date("1970-01-02"))

[1] 1

Using the as.Date() function, we can easily create the date object; the typical format of the date object in this function is year, month, and then day. But we can also create a date object with other formats by specifying the format argument within the as.Date()function, as shown in the following example:

# creating date object specifying format of date

as.Date("Jan-01-1970",format="%b-%d-%Y")

[1] "1970-01-01"

Note that when specifying the format of the date, we have to give the format that is aligned with the input string. For the complete list of code that is used to specify date formats, users are directed to the help documentation of the strptime function. Users can access the complete list by just typing in help(strptime) in the R console.

The lubridate package provides intuitive functionality to work with the date object in R. The following are some of the examples to create the date object using the lubridate package:

# loading lubridate package

library(lubridate)

# creating date object using mdy() function

mdy("Jan-01-1970")

"1970-01-01 UTC"

Note that the default time zone in the mdy, dmy, or ymd function is Coordinated Universal Time (UTC). One of the most interesting and important features of the lubridate package is that it can process date variables in heterogeneous formats. Heterogeneous formats means users can store date information in various ways; for example, the second chapter due on 2013, August, 24, the first chapter submitted on 2013, 08, 18, or 2013 August 23. From this heterogeneous date, we can extract the valid date object that can be processed further within R using the lubridate package, as shown in the following code:

# creating heterogeneous date object

hetero_date <- c("second chapter due on 2013, august, 24", "first chapter submitted on 2013, 08, 18", "2013 aug 23")

# parsing the character date object and convert to valid date

ymd(hetero_date)

[1] "2013-08-24 UTC" "2013-08-18 UTC" "2013-08-23 UTC"

Although the lubridate package can handle heterogeneous dates, the sequence of year, month, and day should be similar across all values within the same object, otherwise during date extraction there will be a missing value that will be generated, along with a warning message. For example, if we alter the last date to 23 aug 2013, it will not get converted into a valid date, as shown in the following code:

hetero_date <- c("second chapter due on 2013, august, 24", "first chapter submitted on 2013, 08, 18", "23 aug 2013")

ymd(hetero_date)

[1] "2013-08-24 UTC" "2013-08-18 UTC" NA

Warning message:

1 failed to parse.

During the date manipulation, sometimes we need to change the month, only within an existing R date object. The following is an example of doing this, using the core R function, and also using the lubridate package:

# Creating date object using base R functionality

date <- as.POSIXct("23-07-2013",format = "%d-%m-%Y", tz = "UTC")

date

[1] "2013-07-23 UTC"

# extracting month from the date object

as.numeric(format(date, "%m"))

[1] 7

# manipulating month by replacing month 7 to 8

date <- as.POSIXct(format(date,"%Y-8-%d"), tz = "UTC")

date

[1] "2013-08-23 UTC"

# The same operation is done using lubridate package

date <- dmy("23-07-2013")

date

[1] "2013-07-23 UTC"

month(date)

[1] 7

month(date) <- 8

date

[1] "2013-08-23 UTC"

In a dataset, the variable might have both date and time information, and we need to round them to the nearest day or month. The following example shows the date-rounding functionality; this example also displays how to convert the time zone:

# accessing system date and time

# the output of this section will be vary for the readers

current_time <- now()

current_time

[1] "2013-08-23 23:43:01 BDT"

# changing time zone to "GMT"

current_time_gmt <- with_tz(current_time,"GMT")

current_time_gmt

[1] "2013-08-23 17:43:01 GMT"

# rounding the date to nearest day

round_date(current_time_gmt,"day")

[1] "2013-08-24 GMT"

# rounding the date to nearest month

round_date(current_time_gmt,"month")

[1] "2013-09-01 GMT"

# rounding date to nearest year

round_date(current_time_gmt,"year")

[1] "2014-01-01 GMT"

In this section, we saw that dealing with dates using the lubridate package is really user friendly and intuitive.

Sometimes we need to change the time zone in date variables for data analysis purposes. For example, we might need to change the time zone from GMT to EST. Using the _tz() function in the lubridate package made this easy and intuitive to change the time zone. Here is a simple example:

date <- ymd("20141221")

date

[1] "2014-12-21 UTC"

with_tz(date,"EST")

[1] "2014-12-20 19:00:00 EST"

Sometimes we need to access individual components of a date and time variable, such as accessing year, month, and day, as well as the days of week, and many more. The following is a list of available easy functions from the lubridate packages. These functions are easy to use, and easy to understand.

· To get the year part from a date time variable: year()

· To get the month only: month()

· To get the week number of a particular date: weak()

· To get the day from a date variable (day of month): day() or mday()

· To get the day number between 1 and 365 (day of year): yday()

· To get the day of week: wday()

· To get the hour, min, and second: hour(), minute(), second()

· To access the time zone: tz()

Here is an example for each of the functions we just listed:

date <- ymd("20141221")

year(date)

[1] 2014

month(date)

[1] 12

month(date,label=T)

[1] Dec

Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < Nov < Dec

month(date,label=T,abbr=F)

[1] December

Levels: January < February < March < April < May < June < July < August < September < October < November < December

week(date)

[1] 51

day(date)

[1] 21

mday(date)

[1] 21

yday(date)

[1] 355

wday(date)

[1] 1

wday(date,label=T)

[1] Sun

Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

hour(date)

[1] 0

minute(date)

[1] 0

second(date)

[1] 0

tz(date)

[1] "UTC"

Now, we will draw attention to the reader on the output of hour(), minute(), and second(); the output of these functions is zero, which means that the date object contains only the date part, and as a result, the time part is set to zero. So, the results indicate that the date is recorded at 12:00 AM. At the point we change the time zone of the date object, the value will be different; here is an example:

hour(with_tz(date,"EST"))

[1] 19

Character manipulation

In any statistical software, all the data is expected to be either numeric or at least a factor, but sometimes we have to work with character data. In the area of text mining, character, or string, manipulation is the most important. R has complete functionality to manipulate character (string) data for further analysis. Besides default R functionality, there is one contributed package to deal with character data, which is more user friendly and intuitive, compared to the base R counterpart. Wickham developed the stringrpackage in 2010 to manipulate character data with some user friendly functions. In this section, we will introduce different functions and their counterparts in a table, so that the readers are able to use the functions from the stringr package easily:

Base R functions

stringr functions

paste(): This function is used to concatenate a vector of characters, with a default separator as a space.

str_c(): This has a functionality similar to paste(), but it uses empty as the default separator. It also silently removes zero-length arguments.

nchar(): This returns the number of characters in a character string. For NA, it returns 2, which is not expected. For example:

nchar(c("x","y",NA))

[1] 1 1 2

str_length(): This is the same as nchar(), but it preserves NA. For example:

str_length(c("x","y",NA))

[1] 1 1 NA

substr(): This extracts or replaces substrings in a character vector.

str_sub(): This is the equivalent of substr(), but it returns a zero-length vector if any of its inputs are of zero length. It also accepts negative positions, which are calculated from the left of the last character. The end position defaults to -1, which corresponds to the last character.

Unavailable

str_dup(): This is used to duplicate the characters within a string.

Unavailable

str_trim(): This is used to remove the leading and trailing whitespaces.

Unavailable

str_pad(): This is used to pad a string with extra whitespaces on the left, right, or both sides.

Other than the functions listed in the preceding table, there are some other user friendly functions for pattern matching. Those functions are str_detect, str_locate, str_extract, str_match, str_replace, and so on. To get more details about these functions, readers should refer to the stringr: modern, consistent string processing paper, by Wickham, which can be found at http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf.

Subscripting and subsetting

Subscripting and subsetting a dataset is an integral part of data manipulation. If we need to extract a smaller part of any R object (vector, data frame, matrix, or list) that contains more than one element, we need to use subscripts. Subscripting is an approach to access individual elements of an R object; for example, accessing a particular element of a vector. Usually, numeric integers are used for subscripting, but logical vectors can also be used for the same purposes. In R, the subscript starts from 1, and if we specify any negative subscript, it omits that position from the source object.

The following is an example of an R vector with 10 elements, and the effect of positive and negative subscripting:

# creating a 10 element vector

num10 <- c(3,2,5,3,9,6,7,9,2,3)

# accessing fifth element

num10[5]

[1] 9

# checking whether there is any value of num10 object greater # than 6

num10>6

[1] FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE

# keeping only values greater than 6

num10[num10>6]

[1] 9 7 9

# use of negative subscript removes first element "3"

num10[-1]

[1] 2 5 3 9 6 7 9 2 3

Note that the subscripted indexes are written within square brackets. For one-dimensional vectors, we use a single index to access elements, but for two-dimensional objects, such as data frames or matrices, we have to use two-dimensional subscripts. In that case, we have to use double square brackets for indexing. The first index is for representing rows, and the second is for representing columns; for example:

# creating a data frame with 2 variables

data_2variable <- data.frame(x1=c(2,3,4,5,6),x2=c(5,6,7,8,1))

data_2variable

x1 x2

1 2 5

2 3 6

3 4 7

4 5 8

5 6 1

# accessing only first row

data_2variable[1,]

x1 x2

1 2 5

# accessing only first column

data_2variable[,1]

[1] 2 3 4 5 6

# accessing first row and first column

data_2variable[1,1]

[1] 2

Similar indexing is used for matrices. For the list object, the indexing is different than that of data frames, or matrices. To get access to a list object, we have to use [[]] for indexing; for example, the index [[1]] gets the first element of a list. If the list is nested within another list, we need to use a series of double square brackets, within double square brackets.

The following example creates a list object and accesses its elements:

list_obj<- list(dat=data_2variable,vec.obj=c(1,2,3))

list_obj

$dat

x1 x2

1 2 5

2 3 6

3 4 7

4 5 8

5 6 1

$vec.obj

[1] 1 2 3

# accessing second element of the list_obj objects

list_obj[[2]]

[1] 1 2 3

Now, if we want to get access to the individual elements of list_obj[[2]], we have to use the following command:

list_obj[[2]][1]

[1] 1

If the list object is named, we can get access to the elements of that list, using the name as follows:

# accessing dataset from the list object

list_obj$dat

x1 x2

1 2 5

2 3 6

3 4 7

4 5 8

5 6 1

Subsetting is just storing subscripted objects. Once we extract any subscripted R object, and store it in another variable, the newly created object is the subset of the original variable.

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Summary

In this chapter, we have covered some of the special features that we need to consider during data acquisition. We also discussed the important aspect of factor manipulation, especially when subsetting a factor variable, and how to remove unused factor levels. The processing of date variables was covered with the use of the lubridate package, with its user friendly and intuitive functions, and also string processing has been highlighted. The chapter ended with an explanation of the concepts of subscripting and subsetting. For more details on date processing and string manipulation readers should refer to the stringr: modern, consistent string processing paper by Wickham, which can be found at http://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf, and the dates and times made easy with lubridate journal, by Grolemund and Wickham, which can be found at http://www.jstatsoft.org/v40/i03/paper.

In the next chapter, we will discuss data manipulation with the plyr package, where we will focus on the split-apply-combine strategy and a state-of-the-art approach in the group-wise data manipulation using R.