Input and Output - R Recipes: A Problem-Solution Approach (2014)

R Recipes: A Problem-Solution Approach (2014)

Chapter 2. Input and Output

R provides many input and output capabilities. This chapter contains recipes on how to read data into R, as well as how to use several handy input and output functions. Although most R users are more concerned with input, there are times when you need to write to a file. You will find recipes for that in this chapter as well.

Oracle boasts that Java is everywhere, and that is certainly true, as Java is in everything from automobiles to cell phones and computers. R is not everywhere, but it is everywhere you need it to be for data analysis and statistics.

Recipe 2-1. Inputting and Outputting Data

Problem

To work with data, you need to get it into your R program. You may want to obtain that data from user input or from a file. Once you have done some processing you may want to output some data.

Solution

Besides typing data into the console, you can use the script editor. The output for your R session appears in the R Console or the R Graphics Device. The basic commands for reading data from a file are read.table() and read.csv().

Image Note Here CSV refers to comma-separated values.

You can write to a file using write.table(). In addition to these standard ways to get data into and out of R, there are some other helpful tools as well. You can use data frames, which are a special kind of list. As with any list, you can have multiple data types, and for statistical applications, the data frame is the most common data structure in R. You can get data and scripts from the Internet, and you can write functions that query users for keyboard input.

Before we discuss these I/O (input/output) options, let’s see how you can get information regarding files and directories in R. File and directory information can be very helpful. The functions getwd() and setwd() are used to identify the current working directory and to change the working directory. For files in your working directory, simply use the file name. For files in a different directory, you must give the path to the file in addition to the name.

The function file.info() provides details of a particular file. If you need to know whether a particular file is present in a directory, use file.exists(). Using the function objects() or ls() will show all the objects in your workspace. Type dir() for a list of all the files in the current directory. Finally, you can see a complete list of file- and directory-related functions by entering the command ?files.

To organize the discussion, I’ll cover keyboard and monitor I/O; reading, cleaning, and writing data files; reading and writing text files; and R connections, in that order.

Keyboard and Monitor Access

You can use the scan() function to read in a vector from a file or the keyboard. If you would rather enter the elements of a vector one at a time with a new line for input, just type x <- scan() and press the Enter key. R gives you the index, and you supply the value. See the following example. When you are finished entering data, just hit the Enter key with an empty index.

> xvector <- scan()
1: 19
2: 20
3: 31
4: 25
5: 36
6: 43
7: 53
8: 62
9: 40
10: 29
11:
Read 10 items
> xvector
[1] 19 20 31 25 36 43 53 62 40 29

Humans are better and faster at entering data in a column than they are at entering data in a row. You may like this way of entering vectors more than using the c() function.

If your data are in a file in the current working directory, you can enter a vector by using the file name as the argument for scan(). For example, assume you have a vector stored in a file called yvector.txt.

> scan("yvector.txt")
Read 10 items
[1] 22 18 32 39 42 73 37 55 34 34

The readline() function works in a similar fashion to get information from the keyboard. For example, you may have a code fragment like the following:

> yourName <- readline("Type in Your First and Last Name: ")
Type in Your First and Last Name: Larry Pace
> yourName
[1] "Larry Pace"

In the interactive mode, you can print the value of an object to the screen simply by typing the name of the object and pressing Enter. You can also use the print() function, but it is not necessary at the top level of the interactive session. However, if you want to write a function that prints to the console, just typing the name of the object will no longer work. In that case, you will have to use the print() function. Examine the following code. I wrote the function in the script editor to make things a little easier to control. I cover writing R functions in more depth inChapter 11.

> cubes
function(x) {
print(x^3)
}
> x <- 1:20
> cubes(x)
[1] 1 8 27 64 125 216 343 512 729 1000 1331 1728 2197 2744 3375
[16] 4096 4913 5832 6859 8000

Reading and Writing Data Files

R can deal with data files of various types. Tab-delimited and CSV are two of the most common file types. If you load the foreign package, you can read in additional data types, such as SPSS and SAS files.

Reading Data Files

To illustrate, I will get some data in SPSS format from the General Social Survey (GSS) and then open it in R. The GSS dataset is used by researchers in business, economics, marketing, sociology, political science, and psychology. The most recent GSS data are from 2012. You can download the data from www3.norc.org/GSS+Website/Download/ in either SPSS format or Stata format.

Because Stata does a better job than SPSS at coding the missing data in the GSS dataset, I saved the Stata (*.DTA) format into my directory and then opened the dataset in SPSS. This fixed the problem of dealing with missing data, but my data are far from ready for analysis yet. If you do not have SPSS, you can download the open-source program PSPP, which can read and write SPSS files, and can do most of the analyses available in SPSS. The point of this illustration is simply that there are data out there in cyberspace that you can import into R, but you may often have to make a pit stop at SPSS, Stata, PSPP, Excel, or some other program before the data are ready for R. If you have an “orderly” SPSS dataset with variable names that are legal in R, you can open that file directly into R with no difficulty using the foreign() package.

When I read the SPSS data file into R, I see I still have some work to do:

> require(foreign)
Loading required package: foreign
> gss2012 <- read.spss("GSS2012.sav")
There were 11 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In read.spss("GSS2012.sav") :
GSS2012.sav: Unrecognized record type 7, subtype 18 encountered in system file
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, ... :
duplicated levels in factors are deprecated
3: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, ... :
duplicated levels in factors are deprecated
4: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, ... :
duplicated levels in factors are deprecated
5: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, ... :
duplicated levels in factors are deprecated
6: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, ... :
duplicated levels in factors are deprecated
7: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, ... :
duplicated levels in factors are deprecated
8: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, ... :
duplicated levels in factors are deprecated
9: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, ... :
duplicated levels in factors are deprecated
10: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, ... :
duplicated levels in factors are deprecated
11: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, ... :
duplicated levels in factors are deprecated

Although this dataset with 4820 records and 1067 variables is large by the standards of the majority of researchers, the data are not “big” in the modern sense. As you can see by the preceding warning messages, the next problem is that the data must be cleaned up a bit before I can do any serious data analysis. Dealing with dirty data is a real-world problem that is not sufficiently addressed in most statistics textbooks, in which professors like me make up examples that are easy to work with, and which almost never have missing data. Recipe 2-2 deals with cleaning up data.

Image Note R nearly choked on the GSS data. We will talk about how to handle very large datasets in Chapter 13.

Writing Data Files

The write.table() function is the analog of the read.table() function. The write.table() function writes a data frame. The function cat() can also be used to write to a file (or to the screen), by successive parts. What this means is that you concatenate the arguments to thecat() function, separating them by commas. You can use any R data type for this purpose. The following code illustrates this:

> cats <- c("Tom","Felix","Mittens","Socks","Boots","Fluffy")
> ages <- c(12,10,8,2,5,3)
> pets <- data.frame(cats, ages, stringsASFactors = FALSE)
> pets
cats ages stringsASFactors
1 Tom 12 FALSE
2 Felix 10 FALSE
3 Mittens 8 FALSE
4 Socks 2 FALSE
5 Boots 5 FALSE
6 Fluffy 3 FALSE
> write.table(pets, "myCats")
> cat("Tom\n", file = "catFile")
> cat("Felix\n", file = "catFile", append = TRUE)
> ## verify the file writes by using the file.exists() function
> file.exists("myCats")
[1] TRUE
> file.exists("catFile")
[1] TRUE

Recipe 2-2. Cleaning Up Data

Problem

Real-world data often need cleaning. For example, the GSS codebook uses several different codes for missing data. The easiest way to handle the recoding in this particular case is to clean the dataset in SPSS (see Recipe 2-1 for more on GSS). After the cleaning, the data will be more orderly. In many cases, cleaning data in R is more efficient, and in many others, it might be more efficient to use the search-and-replace functionality of a word processor or a spreadsheet program. As always, choose the most appropriate tool from the toolbox. If the dataset is small, you can make minor edits using the R Data Editor, not to be confused with the script editor.

Solution

When you have serious data recoding and cleaning to do (I call it “data surgery”), I suggest you make use of the plyr package in R. Think of a pair of pliers. The plyr package is a SAC (split-apply-combine) tool, and does a great job for such purposes.

To illustrate some real-world data cleaning issues, let us use a manageable (and I hope interesting to you) set of data, compliments of Dr. Nat Goodman. The data consist of various measurements of mutant and normal mice. The mutated mice were created to carry the genome sequence for Huntington’s disease. Several different strains of mice were used because inbred mice are as alike genetically as human twins are. For this example, we will work with only two strains of mice.

The following is the head (the first few records) of the mouse data (which we can view with the head() function). Each mouse has a unique identifier, the strain, the nominal genome sequence, and the actual genome sequence. The sequence CAG repeated seven times represents a normal mouse. CAG sequences of 40 or more are associated with Huntington’s disease in mice. The other variables are self-descriptive. The age is the mouse’s age in weeks. This dataset represents a small portion of a much larger dataset.

As you have seen previously, when you read in CSV files, you do not have to specify that the first row contains the variable names. The “header” is expected in a CSV file. However, many tab-delimited files do not have a row of column headers. If your tab-delimited file does have a row of variable names as the first row, you must set the header option to T or TRUE, as shown in the following code segment.

> mouseWeights <- read.table("Mouse_Weights.txt", header = TRUE)
> head(mouseWeights)
mouse_id strain cag_nominal cag_actual sex age body_weight brain_weight
1 hd1769 B6 Q111 113 F 3.7143 11.18 0.380
2 hd1777 B6 Q111 137 F 4.0000 12.50 0.434
3 hd1778 B6 WT 7 F 4.0000 13.30 0.406
4 hd1782 B6 Q111 136 F 4.0000 11.66 0.426
5 hd1806 B6 WT 7 M 4.0000 14.33 0.464
6 hd1808 B6 Q111 113 M 4.0000 13.72 0.414

When we examine the data, we see that there are some problems. We find that some mice have an impossible body weight of zero grams. Other mice have an equally impossible brain weight of zero grams.

summary(mouseWeights)
mouse_id strain cag_nominal cag_actual sex age
hd1094 : 1 B6 :376 Q111:172 Min. : 7.00 F:315 Min. : 3.714
hd1095 : 1 CD1:268 Q50 :166 1st Qu.: 7.00 M:329 1st Qu.: 8.000
hd1104 : 1 Q92 : 51 Median : 48.00 Median :12.143
hd1107 : 1 WT :255 Mean : 57.93 Mean :12.138
hd1109 : 1 3rd Qu.:113.00 3rd Qu.:16.286
hd1110 : 1 Max. :154.00 Max. :20.571
(Other):638
body_weight brain_weight
Min. : 0.00 Min. :0.0000
1st Qu.:21.12 1st Qu.:0.4580
Median :25.65 Median :0.4900
Mean :27.57 Mean :0.4925
3rd Qu.:33.00 3rd Qu.:0.5353
Max. :59.00 Max. :0.6660
NA's :12

Recode the zeros to missing data as follows. You can attach() the data frame to make it easier to access the individual variables without having to type the data frame name each time you access the variable. After you have attached the data frame, you can refer to variables by their names rather than using the $ format. The following commands will assign a missing value code (NA) to every mouse whose brain weight is equal to zero. Remember that we check for equality in R by using two equals signs (==). Note the square brackets that are used as an index to instruct R to locate all the brain weights of zero and reassign NA to them.

> attach(mouseWeights)
> brain_weight[brain_weight==0] <- NA
> body_weight[body_weight==0] <- NA

Now, just to illustrate the R Data Editor, say you want to simplify the variable names a little further to make the code more compact. To edit the data frame from inside R, simply enter fix(mouseWeights). The R Data Editor opens in the R GUI (see Figure 2-1).

9781484201312_Fig02-01

Figure 2-1. The R Data Editor opens in the R GUI

The Data Editor is a simple spreadsheet-like view of your data frame. Make any needed changes, and then when you close the Data Editor, the changes are saved. Here is the newly named set of variables:

> head(mouseWeights)
ID strain CAGnom CAGscale sex age bodyWt brainWt
1 hd1769 B6 Q111 113 F 3.7143 11.18 0.380
2 hd1777 B6 Q111 137 F 4.0000 12.50 0.434
3 hd1778 B6 WT 7 F 4.0000 13.30 0.406
4 hd1782 B6 Q111 136 F 4.0000 11.66 0.426
5 hd1806 B6 WT 7 M 4.0000 14.33 0.464
6 hd1808 B6 Q111 113 M 4.0000 13.72 0.414
>

Finally, tidy things up a bit. Use the detach() function to “unattach” the mouseWeights data. Remove any unneeded objects by using the rm() function, and save the workspace image if you plan to work with the objects and data you used in this session.

Recipe 2-3. Dealing with Text Data

Problem

We are dealing with increasing volumes of text data. Text mining has become an important area of research and innovation, as well as a lucrative one. For our purposes, we define text data as data consisting mostly of characters and words. Text data is typically formatted in lines and paragraphs for human beings to read and understand.

Qualitative researchers treat textual material the same way quantitative researchers treat numbers. Qualitative researchers describe text data, look for relationships and differences, and examine patterns and classifications. There is a growing trend toward combining these methods into a mixed-method research approach.

Solution

Consider Plastic Omnium’s environmental policy, which states:

Plastic Omnium maintains a proactive environmental protection policy at the highest levels of the company worldwide. It not only ensures compliance with the legal requirements in effect in the countries where Plastic Omnium is present, but in the cases where there are no such requirements or where the company deems the existing requirements inadequate, Plastic Omnium develops and implements its own rules and ensures that they are followed. Every employee involved in an environment-related activity – such as measuring, recordkeeping, composing a report about an action or situation with consequences for the environment, or handling hazardous products or hazardous waste – must take care to perform his or her activities in strict compliance with the laws in effect and only after having received the necessary prior authorizations.

Everyone must ensure that the rules developed by Plastic Omnium are properly applied and will ensure that reports concerning events or situations related to environmental protection are accurate and complete. An employee who is aware of an event or situation within the company, which could result in pollution to the environment, has the duty to take immediate action to bring the matter to the attention of his or her direct supervisor or go directly to the Group’s Human Resources Department.

—Source: www.plasticomnium.com/en/

Microsoft Word has very rudimentary text analysis tools. We can count the number of words in the policy (there are 205). However, beyond spell checking and grammar checking, there’s not too much else we can do using a word processor. R opens up a host of new possibilities.

To do serious text mining in R, you should install the tm package. This topic will be addressed in Chapter 14, but for the present, let’s just see how to read the text file into R. I saved the policy as a plain-text file with line feeds only.

> Omni <- readLines("Plastic_Omni_Environ_Policy.txt")
> Omni
[1] "Plastic Omnium maintains a proactive environmental protection policy at the highest levels of "
[2] "the company worldwide. It not only ensures compliance with the legal requirements in effect in "
[3] "the countries where Plastic Omnium is present, but in the cases where there are no such "
[4] "requirements or where the company deems the existing requirements inadequate, Plastic "
[5] "Omnium develops and implements its own rules and ensures that they are followed. Every "
[6] "employee involved in an environment-related activity – such as measuring, recordkeeping, "
[7] "composing a report about an action or situation with consequences for the environment, or "
[8] "handling hazardous products or hazardous waste – must take care to perform his or her "
[9] "activities in strict compliance with the laws in effect and only after having received the "
[10] "necessary prior authorizations."
[11] ""
[12] "Everyone must ensure that the rules developed by Plastic Omnium are properly applied and will "
[13] "ensure that reports concerning events or situations related to environmental protection are "
[14] "accurate and complete. "
[15] ""
[16] "An employee who is aware of an event or situation within the company, which could result in "
[17] "pollution to the environment, has the duty to take immediate action to bring the matter to the "
[18] "attention of his or her direct supervisor or go directly to the Group’s Human Resources "
[19] "Department."
[20] ""

We use the readLines() function to read in a text file all at once or one line at a time. What is returned is a single character vector. The preceding example reads in a whole file, but if we would rather read in a line at a time, we will have to establish a connection. In this case, we will use a connection for file access. Create a connection with various R functions, such as file(), url(), or several additional functions. To see which functions can be used to establish connections, type ?connections at the command prompt. The parameter r means that we have opened the file for reading. We tell R to read in the lines one at a time by setting the argument n to 1.

> connection <- file("Plastic_Omni_Environ_Policy.txt", "r")
> readLines(connection, n = 1)
[1] "Plastic Omnium maintains a proactive environmental protection policy at the highest levels of "
> readLines(connection, n = 1)
[1] "the company worldwide. It not only ensures compliance with the legal requirements in effect in "
>

Recipe 2-4. Getting Data from the Internet

Problem

Many datasets are located in repositories on the Internet. There are datasets like the GSS data we have discussed, and literally thousands more web-hosted datasets in economics, data science, finance, government data for the United States and many other countries, health care, machine learning, and various university data repositories. The problem is not so much that we don’t have enough data, but instead the problem is that we don’t know how to access the right data.

Recipe 2-3 covered how to use a connection to read in a data file line by line. We can also establish a connection to a URL. This makes it possible to read in data from that particular source. The url type of connection supports http://, ftp://, and file://. For additional information on connections, type ?connection at the R command prompt to see the documentation for the connections() function.

Solution

Recipe 2-3 describes how you can simply copy and paste information from the Internet into a text document and read it into R. However, Recipe 2-4 shows you how to use the scan() function to import a data file. The scan() function, unlike the read.table() function, returns a list or a vector. This makes it easy to read a text file from the Internet. For example, the Institute for Digital Research and Education (IDRE) at UCLA provides excellent R tutorials and example data. Let us read in the file scan.txt from the IDRE web site. We tell R that we want to read the text file into a list with the what argument.

> (x <- scan("http://www.ats.ucla.edu/stat/data/scan.txt", what = list(age = 0,
+ name = "")))
Read 4 records
$age
[1] 12 24 35 20

$name
[1] "bobby" "kate" "david" "michael"

The read.table() function allows the user to read in any kind of delimited ASCII file. Here’s another example from IDRE. In this case, we read in a text file and specify there is a row of column headings by setting the header argument to TRUE.

> (test <- read.table("http://www.ats.ucla.edu/stat/data/test.txt", header = TRUE))
prgtype gender id ses schtyp level
1 general 0 70 4 1 1
2 vocati 1 121 4 2 1
3 general 0 86 4 3 1
4 vocati 0 141 4 3 1
5 academic 0 172 4 2 1
6 academic 0 113 4 2 1
7 general 0 50 3 2 1