Data Manipulation with R, Second Edition (2015)

Preface

This book, Data Manipulation with R, is aimed at giving intermediate-to-advanced level users of R (who have knowledge about datasets) an opportunity to use state-of-the-art approaches in data manipulation. This book will discuss the types of data that can be handled using R and different types of operations for those data types. Upon reading this book, you will be able to efficiently manage and check the validity of your datasets with the effective use of R programming, including specialized packages for data management. You will come to know about the split-apply-combine strategy, which is a state-of-the-art approach in data management. You will also come to know the way to work with database software through ODBC with the help of very simple examples. This book ends with an introduction to text processing for text mining using R.

What this book covers

Chapter 1, Introduction to R Data Types and Basic Operations, discusses the way to get R, how to install it, and how to install various libraries. Upon introducing how to write commands in R, this chapter discusses different types of data used in R and their basic operations. Before introducing the data types in this chapter, we will highlight what an object in R is as well as their modes and classes. The mode of an object could be either numeric, character, or logical, whereas its class could be vector, factor, list, data frame, matrix, array, or others. This chapter also highlights how to work with objects in different modes and how to convert from one mode to another and what caution should be taken during conversion. Missing values in R and how to represent missing characters and numeric data types are also discussed here. Along with the data types and basic operations, this chapter sheds light on another important aspect, which is almost never mentioned in other textbooks—the object naming convention in R. We talk about popular object-naming conventions used in R.

Chapter 2, Basic Data Manipulation, introduces some special features where we need to take care during data acquisition. Then, an important aspect of factor manipulation is discussed, as well as subsetting a factor variable and how to remove unused factor levels. This chapter also includes coverage of vector and matrix operations. Date processing has been discussed using an efficient R package: lubridate. Working with the date variable using the lubridate package is much more efficient than using any other existing package that is designed to work with the date variable. Also, string processing has been highlighted, and the chapter ends with a description of subscripting and subsetting.

Chapter 3, Data Manipulation Using plyr and dplyr, introduces the state-of-the-art approach called split-apply-combine to manipulate datasets. Data manipulation is an integral part of data cleaning and analysis. For a large dataset, it is always preferable to perform operations within the subgroup of a dataset to speed up the process. In R, this type of data manipulation can be done with base functionality, but for large datasets, it requires a considerable amount of coding and eventually takes longer to process. In the case of large datasets, we can split the dataset performing the manipulation or analysis and then combine them again into a single output. This chapter contains a discussion of the different functions in the plyr package that are used for group-wise data manipulation and also for data analysis. This chapter also contains examples and discussions of the dplyr package to work with data frames. Working with data frames using dplyr is much more efficient and intuitive. You will have a very good understanding of data frame processing through the examples of this chapter.

Chapter 4, Reshaping Datasets, deals with the orientation of datasets. Reshaping data is a common and tedious task in real-life data manipulation and analysis. A dataset might come with different levels of grouping, and we need some reorientation to perform certain types of analysis. To perform statistical analysis, we sometimes require wide data and sometimes long data, and in this case, we need to be able to fluently and fluidly reshape data to meet the requirements of statistical analysis. Important functions from the reshape2 package have been discussed in this chapter with examples.

Chapter 5, R and Databases, talks about dealing with database software and R. One of the major problems in R is that its memory is bound by the system virtual memory, and that is why working with a dataset requires the data to be smaller than its memory. However, in reality, the dataset is larger than the virtual memory and sometimes the length of arrays or vectors exceeds the maximum addressable range. To overcome these two limitations, R can be utilized with databases. Interacting with databases using R and dealing with large datasets with specialized packages and data manipulation with sqldf have been discussed with examples in this chapter.

Chapter 6, Text Manipulation, covers the processing of text data for text mining. This chapter introduces various sources of text data and the process of obtaining that data. This chapter also discusses processing text data for text mining purposes by using various relevant packages.

What you need for this book

Knowledge about statistical data is required. You are expected to have basic knowledge of R. To run the examples from this book, R should be installed, and it can be found at http://www.r-project.org. The example files are produced on R 3.0.2.

Who this book is for

This book is for intermediate-to-advanced level users of R who have knowledge about datasets, and also for those who regularly work with different research data, including but not limited to public health, business analysis, and the machine learning community.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Once we have an R object, we can easily assess its mode by using mode()."

A block of code is set as follows:

num.obj <- seq(from=1,to=10,by=2)

logical.obj<-c(TRUE,TRUE,FALSE,TRUE,FALSE)

character.obj <- c("a","b","c")

is.numeric(num.obj)

[1] TRUE

is.logical(num.obj)

[1] FALSE

is.character(num.obj)

[1] FALSE

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

# Calling xlsx library

library(xlsx)

# importing xlsxanscombe.xlsx

anscombe_xlsx <- read.xlsx2("xlsxanscombe.xlsx",sheetIndex=1)

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Click on the Add... button and select an appropriate ODBC driver and then locate the desired file and give a data source name."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.