Data Manipulation with R (2014)

Preface

This book, Data Manipulation with R, is aimed at giving intermediate to advanced level users of R (who have knowledge about datasets) an opportunity to use state-of-the-art approaches in data manipulation. This book will discuss the types of data that can be handled using R and different types of operations for those data types. Upon reading this book, readers will be able to efficiently manage and check the validity of their datasets with the effective use of R programming, including specialized packages for data management. Readers will come to know about the split-apply-combine strategy, which is the state-of-the-art approach in data management. This book ends with an introduction to how R can be utilized with different database software.

What this book covers

Chapter 1, R Data Types and Basic Operations, discusses the different types of data used in R and their basic operations. Before introducing the data types in this chapter, we will highlight what an object in R is and its mode and class. The mode of an object could be either numeric, character, or logical, whereas its class could be vector, factor, list, data frame, matrix, array, or others. This chapter also highlights how to deal with objects in different modes and how to convert from one mode to another and what caution should be taken during conversion. Missing values in R and how to represent missing character and numeric data types are also discussed here. Along with the data types and basic operations, this chapter sheds light on another important aspect, which is almost never mentioned in other text books—the object naming convention in R. We talk about popular object-naming conventions used in R.

Chapter 2, Basic Data Manipulation, introduces some special features that we need to consider during data acquisition. Then, an important aspect of factor manipulation will be discussed, especially when subsetting a factor variable and how to remove unused factor levels. Date processing is also covered using an efficient R package: lubridate. Dealing with the date variable using the lubridate package is much more efficient than any other existing packages that are designed to work with the date variable. Also, string processing will be highlighted and the chapter ends with a description of subscripting and subsetting.

Chapter 3, Data Manipulation Using plyr, introduces the state-of-the-art approach called split-apply-combine to manipulate datasets. Data manipulation is an integral part of data cleaning and analysis. For large data, it is always preferable to perform the operations within the subgroup of a dataset to speed up the process. In R, this type of data manipulation can be done with base functionality, but for large data it requires considerable amount of coding and eventually takes more processing time. In the case of large datasets, we can split the data and perform the manipulation or analysis and then again combine them into a single output. This chapter contains a discussion on the different functions in the plyr package that are used for group-wise data manipulation and also for data analysis.

Chapter 4, Reshaping Datasets, deals with the orientation of datasets. Reshaping data is a common and tedious task in real-life data manipulation and analysis. A dataset might come with different levels of grouping and we need some reorientation to perform certain types of analysis. To perform statistical analysis, we sometimes require wide data and sometimes long data, and in that case we need to be able to fluently and fluidly reshape data to meet the requirements. Important functions from the reshape package will be discussed in this chapter with examples.

Chapter 5, R and Databases, talks about dealing with database software and R. One of the major problems in R is that its memory is bound by RAM, and that is why working with a dataset requires the data to be smaller than its memory. But in reality, the dataset is larger than the capacity of RAM and sometimes the length of arrays or vectors exceeds the maximum addressable range. To overcome these two limitations, R can be utilized with databases. Interacting with databases using R and dealing with large datasets with specialized packages and data manipulation with sqldf will be discussed with examples in this chapter.

Bibliography, provides a list of citations used in the book.

What you need for this book

Readers are expected to have basic knowledge of R and some knowledge of statistical data. To run the examples from this book, R should be installed, and it can be found at http://www.r-project.org. The example files are produced on R 2.15.2 and R 3.0.1.

Who this book is for

This book is for intermediate to advanced level users of R who have knowledge about datasets. Also, this book is for those who regularly deal with different research data, including but not limited to public health, business analysis, and the machine-learning community.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Once we have an R object we can easily assess its mode by using mode()."

A block of code is set as follows:

num.obj <- seq(from=1,to=10,by=2)

logical.obj<-c(TRUE,TRUE,FALSE,TRUE,FALSE)

character.obj <- c("a","b","c")

is.numeric(num.obj)

[1] TRUE

is.logical(num.obj)

[1] FALSE

is.character(num.obj)

[1] FALSE

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

# Calling xlsx library

library(xlsx)

# importing xlsxanscombe.xlsx

anscombe_xlsx <- read.xlsx2("xlsxanscombe.xlsx",sheetIndex=1)

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Click on the Add... button and select an appropriate ODBC driver and then locate the desired file and give a data source name."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.