Data Manipulation with R, Second Edition (2015)

Chapter 5. R and Databases

We noticed earlier that a dataset can be stored in any format using different software as well as relational databases. Usually, large-scale datasets are stored in database software. In data mining and statistical learning, we need to process large-scale datasets. One of the major problems in R is memory usage. R is RAM intensive, and for that reason, the size of a dataset should be much smaller than its RAM. Also, one of the major drawbacks of R is its inability to deal with large datasets.

This chapter introduces how to deal with large datasets that are bigger than the computer's memory and dealing with a dataset by interacting with database software. In the first few sections, we describe how to interact with database software with Open Database Connectivity (ODBC) and import datasets. This chapter will present an example of memory issues and then describe ODBC using an example of MS Excel and MS Access, dealing with large datasets with specialized contributed R packages. This chapter ends with an introduction to data manipulation using SQL through the sqldf package.

The first are two examples demonstrating memory problems in R:

· The following example explains the memory limitation of a computer system. R stores everything in RAM, and a typical personal computer consists of limited RAM (depending on the computer's operating system, that is, 32-bit or 64--bit).

· # Trying to create a vector of zero with length 2^32-1.

· # Note that the RAM of the computer on we are generating # this example is 8 GB with 64-bit Windows-7

· # Professional edition. Processor core i5.

· x <- rep(0, 2^31-1)

· Error: cannot allocate vector of size 16.0 Gb

· In addition: Warning messages:

· 1: Reached total allocation of 8078Mb: see help(memory.size)

· 2: Reached total allocation of 8078Mb: see help(memory.size)

· 3: Reached total allocation of 8078Mb: see help(memory.size)

4: Reached total allocation of 8078Mb: see help(memory.size)

· The preceding example clarifies that R cannot allocate a vector that has size larger than the RAM. Now we will see another example that is related to the maximally addressable range of different types of numbers. The maximum addressable range for integers is 231-1.

· # Maximum addressable range of inter vector

· as.integer(2^31-1)

· [1] 2147483647

· # If we try to assign a vector of length greater than # maximum addressable length then that will produce NA

· as.integer(2^31)

· [1] NA

· Warning message:

NAs introduced by coercion

The topic of database administration is beyond the scope of this book, but we can easily discuss connectivity with databases using R.

R and different databases

Before going on to discuss large-scale data handling using R, we will discuss how R can interact with database software through ODBC. There are two principal ways to connect to a database: the first uses the ODBC facility available on many computers and the second uses the DBI package of R along with a specialized package for the particular database needed to be accessed. If there is a specialized package available for a database, we may find that the corresponding DBI-based package gives better performance than the ODBC approach. On the other hand, if a database does not have a specialized package to access, using ODBC may be the only option.

R and Excel

An Excel file can be imported into R using ODBC. We will now create an ODBC connection with an MS Excel file with the connection string xlopen.

To create an ODBC connection string with an MS Excel file, we need to open the control panel of the operating system and then open Administrative Tools and then choose ODBC. A dialog box will now appear. Click on the Add... button and select an appropriate ODBC driver and then locate the desired file and give a data source name. In our case, the data source name is xlopen. The name of the Excel file can be anything, and in our case the file name is xlsxanscombe.xlsx. The following R code will import the corresponding Excel file into the R environment:

# calling ODBC library into R

library(RODBC)

# creating connection with the database using odbc package.

# We created the connection following the steps outlined in the # preceding paragraph

xldb<- odbcConnect("xlopen")

# In the odbcConnect() function the minimum argument required

# is the ODBC connection string.

# Now the connection created, using that connection we will import data

xldata<- sqlFetch(xldb, "CSVanscombe")

# Note here that "CSVanscombe"is the Excel worksheet name.

We can use other packages to import an Excel file, but at the same time R has the facility to import data using the ODBC approach. To use the ODBC approach on an Excel file, we firstly need to create the connection string using the system administrator. The process of creating a connection is beyond the scope of this book, but we will learn about the topic briefly.

R and MS Access

To import data from the MS Access database, the procedure is the same as with Excel. First, we need to create a connection string from the system administrator and then connect with the database from R using the RODBC package.

Let us consider the Access database containing three different tables: coveragepage, questionnaire1, and questionnaire2. The connection string to access this database is accessdata. The following command can be used to import all the three tables as separate data frames in R:

# calling odbc library

library(RODBC)

# connecting with database

access_con<- odbcConnect("accessdata")

# import separate table as separate R data frame

coverage_page<- sqlFetch(access_con, "coverpage")

ques1 <- sqlFetch(access_con, "questionnaire1")

ques2 <- sqlFetch(access_con, "questionnaire2")

Using MS Excel and MS Access, we can deal with fairly large datasets, but sometimes it so happens that the dataset is too large and handling with Excel or Access is difficult. Also, Excel cannot deal with relational databases. To overcome this limitation, R has another functionality, which we will discuss in the following sections.

Relational databases in R

In this section, we will try to provide a concise overview of different packages in R for handling massive data and illustrate some of them.

A popular approach to dealing with bigger datasets is the use of SQL, a different programming language. It might not be difficult for someone to learn another programming language, but as we are dealing with and talking about using R, the community of R users try to develop specialized packages to deal with large datasets. Those contributed packages successfully create interfaces between R and different database software packages that use relational database management systems, such as MySQL (RMySQL), PostgreSQL (RPgSQL), and Oracle (ROracle). To get the full benefit of these specialized packages, we have to install third-party software, and one of the most popular packages is RMySQL. This package allows us to make connections between R and the MySQL server.

MySQL, which can deal with a mid-size, multi-platform RDBMS is a popular software in the open source community. Some of its advantages include high-performance, being open source, and being free for non-commercial use. In order to install this package properly, we need to download both the MySQL server and RMySQL.

There are several R packages available that allow direct interactions with large datasets within R, such as filehash, ff, and bigmemory. The idea is to avoid loading the whole dataset into memory.

The filehash package

The filehash package, which is used for solving large-data problems, was contributed by Roger Peng (The Interacting with Data using the filehash Package for R paper, available at http://cran.r-project.org/web/packages/filehash/vignettes/filehash.pdf). The idea behind the development of this package was to avoid loading the dataset into a computer's virtual memory. We must rather dump the large dataset into the hard drive and then assign an environment name for the dumped objects. Once a dataset is dumped into the hard drive, we can access the data using the assigned environment. In this way, we can deal with larger datasets and avoid the use of the computer's virtual memory and allow faster data manipulation. We will now discuss the basic steps of using this package through some examples.

Firstly, create a database that can be accessed later on. To create a database, we have to use the dbCreate function, which needs to be initialized (via dbInit) in order to be accessed, as shown in the following code. The dbInit function returns an S4 object that inherits from the filehash class.

library(filehash)

dbCreate("exampledb")

filehash_db<- dbInit("exampledb")

The primary interface of filehash databases consists of the functions dbFetch, dbInsert, dbExists, dbList, and dbDelete. All of these functions are generic in nature and specific methods exist for the database that work in the backend. The first argument that is taken by the functions within this package is an object of the filehash class. To insert some data into the database, we can simply call dbInsert. We retrieve those data values with dbFetch, as shown in the following code:

dbInsert(filehash_db, "xx", rnorm(50))

value<- dbFetch(filehash_db, "xx")

summary(value)

The dbList function lists all of the keys that are available in the database, the dbExists function tests to see if a given key is in the database, and the dbDelete function deletes a key-value pair from the database, as shown in the following code:

dbInsert(filehash_db, "y", 4709)

dbDelete(filehash_db, "xx")

dbList(filehash_db)

dbExists(filehash_db, "xx")

There is another very useful command, dbLoad(), that works in a similar way to the attach() function. Using the filehash package, the objects are attached but stored on the local hard disk. We may also assess the objects in the filehash database using the usual standard R subset and accessor functions such as $, [[, and [, as shown in the following code:

filehash_db$x<- runif(100)

summary(filehash_db$x)

summary(filehash_db[["x"]])

filehash_db$y<- rnorm(100, 2)

dbList(filehash_db)

After initializing a database using the default DB1 format, it opens a file connection for reading and writing to the database file on the disk. This file connection will remain open until the database is closed via dbDisconnect or the database object in R is removed. There is a limit on the number of file connections that can be open at the same time, so to protect any database from unexpected results, we need to make sure the file connections are closed properly.

Just like save.image in base R, there are some utilities included in the filehash package and two of them are dumpObjects and dumpImage. The dumpObjects utility saves an object into the filehash database so that it can be accessed in the future if required. It does not save objects into R itself, which allows faster processing. Similarly, dumpImage saves the entire workspace to a filehash database. The dumpList function takes a list and creates a filehash database with values from the list. The list must have a non-empty name for every element in order for dumpList to succeed. The dumpDF utility creates a filehash database from a data frame where each column of the data frame is an element in the database. Essentially, dumpDF converts the data frame to a list and then calls dumpList. The following example shows how we can use dumpDF:

dumpDF(read.table("anscombe.txt", header=T), dbName="massivedata")

massive_environment<- db2env(db="massivedata")

The first element of dumpDF() is a data object. R will read the data within dumpDF(), so its memory does not have a copy of it. Space saved! So now the large dataset, large.dat, can be accessed through the env01 environment. To access it, we use with(). Suppose we want to perform a linear regression of y on x and access the data using the variable names. In such cases, if you assign an object name for the read.table command, the memory will have a copy of the data, which is not desirable. Using the with()function, we can fit a model or compute summary statistics as usual as follows:

fit<- with(massive_environment, lm(Y1~X1))

with(massive_environment, summary(Y1))

with(massive_environment, Y1[1] <- 99))

The ff package

As we have seen in the example in the introductory section of this chapter, R can only address objects that fit within the memory limits of its RAM and the maximally addressable range of 2³¹-1 bytes. To overcome this limitation, Adler and Glaser, in 2010, developed the ff package. This package extends the R system and stores data in the form of native binary flat files in persistent storage such as hard disks, CDs, or DVDs rather than in the RAM. This package enables users to work on several large datasets simultaneously. It also allows the allocation of vectors or arrays that are larger than the RAM. The package comprises of two parts: one is the low-level layer written in C++ and the other is the high-level layer in R. This package is designed for convenient access to large datasets.

As users will only deal with the high-level layer, the following are the tasks we do in this layer:

· Opening/creating flat files: There are two basic functions, ff and ffm, to deal with opening and creating flat files. If we specify the length argument or the dim argument, a new file is created, otherwise R will open an existing file.

· I/O operations: These operations are controlled by the [ (for reading) and the [ <- (for writing) operators.

· Generic functions and methods for the ff and ffm objects: Methods for dim and length are provided and the sample function is converted to a generic function.

The primary argument for the functions ff and ffm require a filename in the file argument to specify the flat file. Whenever length (for ff) or dim (for ffm) is specified, as shown in the following code, a new flat file is created, otherwise an existing file is opened:

# A flat file with a length 10 is created

library(ff)

file1 <- ff(filename="file1", length=10,vmode="double")

str(file1)

list()

- attr(*, "physical")=Class 'ff_pointer' <externalptr>

..- attr(*, "vmode")= chr "double"

..- attr(*, "maxlength")= int 10

..- attr(*, "pattern")= chr "/"

..- attr(*, "filename")= chr "D:/Book on R/Writing/outline/data_ch2/file1"

..- attr(*, "pagesize")= int 65536

..- attr(*, "finalizer")= chr "close"

..- attr(*, "finonexit")= logi TRUE

..- attr(*, "readonly")= logi FALSE

..- attr(*, "caching")= chr "mmnoflush"

- attr(*, "virtual")= list()

..- attr(*, "Length")= int 10

..- attr(*, "Symmetric")= logi FALSE

- attr(*, "class") = chr [1:2] "ff_vector" "ff"

The entries of file1 can be modified with the []<- operator. For example, the first 10 entries of the rivers dataset that contains the length of the 141 rivers in North America can be stored in an ff object, as shown in the following code:

# calling rivers data

data(rivers)

file1[1:10] <- rivers[1:10]

# Note that here file1 is an ff object whereas

# file1[...] returns default R vector

str(file1)

If required, we can perform sampling on the ff objects as follows:

# set seed to reproduce the example

set.seed(1337)

sample(file1,5,replace=FALSE)

[1] 735 392 524 450 600

Flat file objects are referenced when forming R objects using external pointers. In order to clear the references, the garbage collector, gc(), can be used as follows:

gc()

Calling gc() clears the reference to the file, but does not delete the file from the hard drive. Since the data is still present, the flat file can be opened again at a later stage.

R and sqldf

The sqldf package is an R package that allows users to run SQL statements within R. SQL is the popular programming language for manipulating data from relational databases, and the sqldf package creates an opportunity to work directly with SQL statements on an R data frame. With this package, the user can do the following tasks easily:

· Write alternate syntax for data frame manipulation, particularly for purposes of faster processing, since using sqldf (with SQLite as the underlying database) is often faster compared to performing the same manipulations in built-in R functions

· Read portions of large files into R without reading the entire file

The user need not perform the following tasks once they use sqldf because these are automatically done:

· Database setup

· Writing the create table statement, which defines each table

· Importing and exporting to and from the database

· The coercing of the returned columns to the appropriate class in common cases

Data manipulation using sqldf

We can perform any type of data manipulation to an R data frame either in memory or during import. The following example shows the selection of a portion of the iris dataset using the sqldf package:

# Selecting the rows from iris dataset where sepal length > 2.5

# and store that in subiris data frame

library(sqldf)

subiris<- sqldf("select * from iris where Sepal_Width> 3")

head(subiris)

Sepal_Length Sepal_Width Petal_Length Petal_Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.7 3.2 1.3 0.2 setosa

3 4.6 3.1 1.5 0.2 setosa

4 5.0 3.6 1.4 0.2 setosa

5 5.4 3.9 1.7 0.4 setosa

6 4.6 3.4 1.4 0.3 setosa

nrow(subiris)

[1] 67

We can also select a smaller number of columns while filtering out some of the rows with a specified condition. The following example selects only sepal length, petal length, and species; however, this time, rows are filtered by values for petal length greater than 1.4:

subiris2<-

sqldf("select Sepal_Length,Petal_Length,Species from iris where Petal_Length> 1.4")

nrow(subiris2)

[1] 126

head(subiris2)

Sepal_Length Petal_Length Species

1 4.6 1.5 setosa

2 5.4 1.7 setosa

3 5.0 1.5 setosa

4 4.9 1.5 setosa

5 5.4 1.5 setosa

6 4.8 1.6 setosa

If the dataset is too large and cannot entirely be read into the R environment, we can import a portion of that dataset using sqldf. The following example shows how we can import a portion of a csv file using the sqldf functionality. We will use the read.csv.sql()function to perform this task. This is an interface to sqldf that works like read.csv in R, except that it also provides a sql= argument. Not all of the other arguments of read.csv are supported.

In the following example, we will import the iris.csv file. We will import only sepal width and petal width along with the species information where petal width is greater than 0.4:

iriscsv<-read.csv.sql("iris.csv",sql="select Sepal_Width,Petal_Width,Species from file where Petal_Width>0.4")

head(iriscsv)

Sepal_Width Petal_Width Species

1 3.3 0.5 "setosa"

2 3.5 0.6 "setosa"

3 3.2 1.4 "versicolor"

4 3.2 1.5 "versicolor"

5 3.1 1.5 "versicolor"

6 2.3 1.3 "versicolor"

An important thing to note is that in the original iris.csv file, the variable names were dot separated, but when we pass a SQL statement, we need to use an underscore as the variable name, otherwise it will output an error as follows:

iriscsv<-read.csv.sql("iris.csv",sql="select Sepal.Width,Petal.Width,Species from file where Petal.Width>0.4")

Error in sqliteExecStatement(con, statement, bind.data) :

RS-DBI driver: (error in statement: no such column: Sepal.Width)

We sometimes need to draw a random sample from a dataset but the original data file might be too large. In the following example, we will show how we can draw a random sample size of 10 from the iris data that is stored in the iris.csv file:

iris_sample<-

read.csv.sql("iris.csv",sql="select * from file order by random(*) limit 10")

iris_sample

Sepal_Length Sepal_Width Petal_Length Petal_Width Species

1 6.5 3.0 5.2 2.0 "virginica"

2 5.0 3.5 1.3 0.3 "setosa"

3 6.0 2.2 4.0 1.0 "versicolor"

4 6.9 3.1 5.4 2.1 "virginica"

5 6.2 2.8 4.8 1.8 "virginica"

6 5.1 3.8 1.9 0.4 "setosa"

7 5.8 2.6 4.0 1.2 "versicolor"

8 5.9 3.2 4.8 1.8 "versicolor"

9 6.4 2.9 4.3 1.3 "versicolor"

10 6.4 3.1 5.5 1.8 "virginica"

We can perform group-wise processing and aggregation using sqldf, which is a faster alternative to the aggregate function. For example, if we want to calculate the mean of each variable in the iris data for each species, the following is the code:

# Calculate group wise mean from iris data

iris_avg<-sqldf("select Species, avg(Sepal_Length),avg(Sepal_Width),avg(Petal_Length),avg(Petal_Width) from iris group by Species")

colnames(iris_avg) <- c("Species","Sepal_L","Sepal_W","Petal_L","Petal_W")

iris_avg

Species Sepal_L Sepal_W Petal_L Petal_W

1 setosa 5.006 3.428 1.462 0.246

2 versicolor 5.936 2.770 4.260 1.326

3 virginica 6.588 2.974 5.552 2.026

The base R counterpart for performing the same operation is as follows:

aggregate(iris[,-5],list(iris$Species),mean)

Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width

1 setosa 5.006 3.428 1.462 0.246

2 versicolor 5.936 2.770 4.260 1.326

3 virginica 6.588 2.974 5.552 2.026

Though both functions give us the same results, for larger datasets, sqldf is much faster than base R.

Summary

At the beginning of this chapter, we showed you how we can deal with an MS Excel file as a database and how an MS Access database table can be imported into R. One of the major problems in R is that its memory is bound by the system virtual memory, and that is why the data should be smaller in size than the memory of a dataset to be able to work with it. But in reality, datasets are often larger than the virtual memory and sometimes the length of the array or vector exceeds the maximum addressable range. To overcome these two limitations, R can be utilized with relational databases. Contributed R packages exist to help in dealing with such large datasets, and they have been highlighted in this chapter, particularly filehash and ff. We also discussed sqldf for faster data manipulation.