R Recipes: A Problem-Solution Approach (2014)
Chapter 5. Working with Dates and Strings
In Chapter 5, you will learn how to work with dates and strings.
Recipe 5-1. Working with Dates and Times
Problem
When you import date and time data into R, these are not recognized automatically. To work with them, you must convert dates and times to the proper format.
Solution
The system’s idea of the current date is returned by the Sys.Date() function. You can retrieve the current date with the time by using the Sys.time() function. Examine the following examples.
> Sys.time()
[1] "2014-06-03 13:26:35.06454 EDT"
> ## locale-specific version of date()
> format(Sys.time(), "%a %b %d %X %Y")
[1] "Tue Jun 03 1:26:35 PM 2014"
>
> Sys.Date()
[1] "2014-06-03"
>
> Sys.timezone()
[1] "America/New_York"
In R, the default format for dates is the four-digit year, followed by the month, and then the day. These can be separated by slashes or dashes and must be converted using the as.Date() function. R provides three date and date-time variable classes. These are Date, POSIXct, andPOSIXlt. The current date and time are in PSOSIXct format by default, and this is generally the best alternative.
Here are some examples.
> as.Date("1952-5-30")
[1] "1952-05-30"
> as.Date("1952/10/28")
[1] "1952-10-28"
Many programs, such as Microsoft Excel, use the format month/day/year rather than year/month/day. To deal with this situation, you can create a format string using any of the following date codes (see Table 5-1).
Table 5-1. R Format Codes for Various Date Values
Code |
Value |
%d |
Day of the month (decimal number) |
%m |
Month (decimal number) |
%b |
Month (abbreviated) |
%B |
Month (full name) |
%y |
Year (two digits) |
%Y |
Year (four digits) |
For example, to convert the character string “6/1/2014” to a date in R, create the format string "%m/%d/%Y" to achieve the desired result.
> today <- as.Date("6/1/2014", format="%m/%d/%Y")
> class(today)
[1] "Date"
Date objects are stored internally as the number of days since January 1, 1970. Earlier dates are represented by negative numbers. You can convert a date object to the internal form by using the as.numeric() function. For example, statistician John Tukey’s birthdate was 6/16/1915, while R. A. Fisher was born on 2/17/1890. We can use the weekdays() and months() functions to extract the desired components of a date:
StatBdays <- c(tukey = as.Date("1915-01-16"),fisher = as.Date("1890-02-17"))
> StatBdays
tukey fisher
"1915-01-16" "1890-02-17"
> weekdays(StatBdays)
tukey fisher
"Saturday" "Monday"
> months(StatBdays)
tukey fisher
"January" "February"
You can perform arithmetic with dates. For example, to determine the age in days of a person born on June 3, 2000, you could do the following. First, assign today’s date to the variable today. Then, assign a date to June 3, 2000. Finally, subtract the dates as follows:
> today <- Sys.Date()
> today
[1] "2014-06-27"
> then <- as.Date("2000/6/3")
> then
[1] "2000-06-03"
> howLong <- today - then
> howLong
Time difference of 5137 days
> Sys.Date() - then
Time difference of 5137 days
You can make this a little more generic by using the system date instead of creating a date variable for the current day:
> Sys.Date() - then
Time difference of 5137 days
Here is a more meaningful use of date calculations. We have $1,000 to invest, and want to know how much money we will have on May 22, 2015, if we can earn simple interest of .05% per day. The as.integer() function converts the dates to integer format. I wrote a simple function to calculate the simple interest, and then supplied the appropriate arguments to it to find the answer.
> start <- as.integer(as.Date("2015/1/1"))
> stop <- as.integer(as.Date("2015/5/22"))
> t <- stop - start
>
> Return <- function(p = 1000, r = .0005, t = 365){
+ amount <- p * (1 + r * t)
+ return(amount)
+ }
> t
[1] 141
> Return(1000,.0005, 141)
[1] 1070.5
When you have times along with dates, the best class to use is most often POSIXct objects, as mentioned previously. The “ct” stands for calendar time. The POSIXlt object stands for local time. The POSIXct class returns the numeric value , whereas the POSIXlt class returns a list, as you can see from examining the following code. As the name implies, the unclass() function returns a copy of its argument with its class attribute removed.
> time1 <- as.POSIXct(Sys.Date())
> time1
[1] "2014-06-02 20:00:00 EDT"
> unclass(time1)
[1] 1401753600
> time2 <- as.POSIXlt(Sys.Date())
> unclass(time2)
$sec
[1] 0
$min
[1] 0
$hour
[1] 0
$mday
[1] 3
$mon
[1] 5
$year
[1] 114
$wday
[1] 2
$yday
[1] 153
$isdst
[1] 0
attr(,"tzone")
[1] "UTC"
> time1
[1] "2014-06-02 20:00:00 EDT"
> class(time1)
[1] "POSIXct" "POSIXt"
> mode(time1)
[1] "numeric"
> time2
[1] "2014-06-03 UTC"
> class(time2)
[1] "POSIXlt" "POSIXt"
> mode(time2)
[1] "list"
Recipe 5-2. Working with Character Strings
Problem
Visualize an iceberg. When we think of data, we typically think of something like a data frame in R, an Excel spreadsheet, or some other database with a fixed structure. As you have seen, data frames can include both string data (character) as numbers, but in data frames, these are limited to a fixed structure and represent factors or nominal variables. The visible part of the iceberg is about 20%, containing “data” as we commonly conceptualize it. The 80% below the surface contains a dizzying array of “stuff,” and much of that stuff is very useful, even vital, to us on a daily basis, both at a personal and at a business level. The stuff includes, among other things, video and audio files, images, texts of all kinds, PDF files, PowerPoint files, e-mail, notes, and Word documents.
We have a digital universe that is growing exponentially. According to EMC’s seventh digital universe study conducted by the market research company IDC, the size of the digital universe increases 40% per year.
In the year 2005, there were “only” 132 exabytes of data. An exabyte is 2.5 × 10^18 bytes. The Internet of Things (IoT) is predicted to account for approximately 10% of the digital universe by 2020, which itself will contain nearly as many digital bits as the number of stars in the “real” universe.
With this much information “out there,” and the majority of it not numbers, but instead narratives, pictures, and sounds, text mining has become increasingly important. Although perhaps not as proficient as other scripting languages in this regard, R is still quite capable of working with string data. We will discuss creating strings first, and then we will discuss various options for working with string data.
Solution
The class of a string object is a character. Strings must be enclosed in either single or double quotes. You can insert single quotes into a string enclosed in double quotes, and vice versa, but you cannot insert the same kind of quote. In order for R to recognize the quote, you have to escape it with a slash. The character() function is used to create vector objects of the character type.
In this solution, you’ll learn how to create character strings and how to find patterns and matches in strings. You will also learn how to use the stringr() function to make working with strings more effective and more systematic.
Creating Character Strings
You can create character strings in a couple of different ways. You have already seen the use of the c() function. You can also create an empty character vector and then fill in the elements separately. Here are a couple of examples.
> example <- character(5)
> example
[1] "" "" "" "" ""
> example[1] <- "a"
> example[2] <- "b"
> example[3] <- "c"
> example[4] <- "d"
> example[5] <- "e"
> example
[1] "a" "b" "c" "d" "e"
> example2 <- c("f","g","h","i","j")
> example2
[1] "f" "g" "h" "i" "j"
> c(example, example2)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
Another important function for dealing with character data is paste(). This function takes any number of arguments, coerces them to character type if they are not already in that format, and then pastes, or concatenates, the arguments into one or more character strings. Examine the following code segments.
> PieLife <- paste("The life of", pi,"is sweet")
> PieLife
[1] "The life of 3.14159265358979 is sweet"
The default is to use a space as the separator, but you can specify other separators by declaring the type you want. For example, you can use a comma followed by a space.
> MyPieLife <- paste("Today", Sys.Date(), "I did not eat pie.", sep = ", ")
> MyPieLife
[1] "Today, 2014-06-06, I did not eat pie."
> typeof(MyPieLife)
[1] "character"
You can also use the cat() function to concatenate output, but as the following example shows, one cannot save the output from the cat() function to a variable. You will find the cat() and print() functions to be very useful when you write your own custom functions in R. Note the “escaped” character "\n" to tell R to go to the next line. Without the "\n", R would keep the command prompt on the same line with the output. Observe that the attempt to create a variable called MyPieLife with the paste() function was successful, but the same thing is not true for the cat() function. The “variable” we created is nonexistent.
> MyPieLife <- cat("Today, ", as.character(Sys.Date()),", I did not eat pie.","\n")
Today, 2014-06-06 , I did not eat pie.
>
> typeof(MyPieLife)
[1] "NULL"
Pasting has the same recycling property as vectors do. If you paste objects of different lengths, the shorter length will be recycled, as you can see in the following example.
> paste("X",1:10,sep = ".")
[1] "X.1" "X.2" "X.3" "X.4" "X.5" "X.6" "X.7" "X.8" "X.9" "X.10"
In addition to concatenating character values with the paste() function, you can also use the sprintf() function. This function allows us the opportunity to control output by specifying the format of the objects being printed. For example:
> sprintf("%s was born in %d", "Tukey", 1915)
[1] "Tukey was born in 1915"
Finding Patterns and Matches in Strings
The substr() function can be used to extract a substring, and the sub() function can be used to replace the first occurrence of a word or substring match. The gsub() function replaces all matches. The grep() function searches for matches to a pattern within the elements of a character vector. See the following examples for the use of the sub() and gsub() functions.
> TukeySaid <- "An approximate answer to the right problem is worth a good
+ deal more than the exact answer to an approximate problem."
> substr(TukeySaid,start = 3, stop = 14)
[1] " approximate"
> sub("answer", "solution", TukeySaid)
[1] "An approximate solution to the right problem is worth a good\ndeal more than the exact answer to an approximate problem."
> gsub("answer", "solution", TukeySaid)
[1] "An approximate solution to the right problem is worth a good\ndeal more than the exact solution to an approximate problem."
>
The grep() function can locate matches in character vectors. For example, the state.name dataset that ships with R lists the names of the 50 United States:
state.name
[1] "Alabama" "Alaska" "Arizona" "Arkansas"
[5] "California" "Colorado" "Connecticut" "Delaware"
[9] "Florida" "Georgia" "Hawaii" "Idaho"
[13] "Illinois" "Indiana" "Iowa" "Kansas"
[17] "Kentucky" "Louisiana" "Maine" "Maryland"
[21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
[25] "Missouri" "Montana" "Nebraska" "Nevada"
[29] "New Hampshire" "New Jersey" "New Mexico" "New York"
[33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
[37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
[41] "South Dakota" "Tennessee" "Texas" "Utah"
[45] "Vermont" "Virginia" "Washington" "West Virginia"
[49] "Wisconsin" "Wyoming"
Let us find the states with the word “New” in their names. Without the argument value = TRUE, the grep() function returns the index numbers of these states rather than their names. See the following:
> grep(state.name, pattern = "New")
[1] 29 30 31 32
> grep(state.name, pattern = "New", value = TRUE)
[1] "New Hampshire" "New Jersey" "New Mexico" "New York"
Now, find the state or states with the longest name(s). Use the nchar() function for this purpose. We see that two of the states have 14-character names. We can then determine the names of those two states.
> nchar(state.name)
[1] 7 6 7 8 10 8 11 8 7 7 6 5 8 7 4 6 8 9 5 8 13 8 9 11 8
[26] 7 8 6 13 10 10 8 14 12 4 8 6 12 12 14 12 9 5 4 7 8 10 13 9 7
> longest <- nchar(state.name)
> state.name[which(longest == max(longest))]
[1] "North Carolina" "South Carolina"
Using the stringr Package
The stringr package written by Hadley Wickham overcomes some of the limitations of the base version of R when it comes to string manipulations. According to Wickham, the stringr package is a “set of simple wrappers that make R’s string functions more consistent, simpler, and easier to use.” Install the package by using the install.packages() function.
You can load stringr into your current R session with library() or require(). To see the list of the functions available in stringr, use the command library(help = stringr).
> install.packages("stringr")
> library(stringr)
> library(help = stringr)
The stringr package has all the functionality of the string functions we have used previously, but has the advantage that it works with missing data in a more appropriate way, as demonstrated next. The stringr package also has functionality that is not available in base R, such as the ability to duplicate characters. All the functions start with “str_” followed by a term that is descriptive of the task the function performs.
The str_length() and str_c() Functions
In the base R string functions, NA is treated as a two-character string, rather than as missing data. To illustrate, the nchar() function counts the characters in “NA” and reports it as a two-character string, while the str_length() function in stringr recognizes the missing value as such:
myName <- c("Larry",NA,"Pace")
> nchar(myName)
[1] 5 2 4
> str_length(myName)
[1] 5 NA 4
The str_length() function also converts factors to characters, something of which nchar() is not capable.
> sexFactor <- factor(c(0,0,0,0,1,1,1,1,0,1,1,0,0,1), labels = c("female","male"))
> sexFactor
[1] female female female female male male male male female male
[11] male female female male
Levels: female male
> nchar(sexFactor)
Error in nchar(sexFactor) : 'nchar()' requires a character vector
> str_length(sexFactor)
[1] 6 6 6 6 4 4 4 4 6 4 4 6 6 4
The str_c() function is a substitute for paste(), but uses the empty string "" as the default separator instead of using the whitespace, as paste() does.
> str_c("Statistics","is","the","grammar","of","science.","Karl Pearson")
[1] "Statisticsisthegrammarofscience.Karl Pearson"
You can change the separator by using the sep argument, as follows:
> str_c("Statistics","is","the","grammar","of","science.","Karl Pearson", sep = " ")
[1] "Statistics is the grammar of science. Karl Pearson"
The str_sub() Function
The str_sub() function extracts substrings from character vectors. The user supplies three arguments: the string vector, the start value, and the end value. The function has the ability to work with negative indexes, which cause the function to work backward from the last character in a string element.
> pearsonSays <-str_c("Statistics","is","the","grammar","of","science.","Karl Pearson", sep = " ")
> pearsonSays
[1] "Statistics is the grammar of science. Karl Pearson"
> str_sub(pearsonSays, start = 1, end = 10)
[1] "Statistics"
> str_sub(pearsonSays, start = -7, end = -1)
[1] "Pearson"
You can also use the str_sub() function to replace substrings, as in the following example.
> str_sub(pearsonSays, 39, 50) <- "Ronald Fisher"
> pearsonSays
[1] "Statistics is the grammar of science. Ronald Fisher"
The str_dup() Function
R provides no specific function for duplicating string characters, but the str_dup() function in stringr allows that operation. The str_dup() function duplicates and then concatenates strings within a character vector. You can specify the particular string as well as the number of times the string is to be duplicated. See the following:
> SantaSays <- str_dup("Ho", 3)
> SantaSays
[1] "HoHoHo"
> MrsSantaSays <- c(str_dup("Merry", 1:3),"Christmas")
> MrsSantaSays
[1] "Merry" "MerryMerry" "MerryMerryMerry" "Christmas"
Padding, Wrapping, and Trimming Strings
Padding involves taking a string and adding leading or trailing characters (or both) to achieve a specified width. The str_pad() function accomplishes this. The default is the use of a space (pad = " "). The side argument takes the options "left", "right", and "both" to achieve left, right, and centered alignment. Here are some examples.
> str_pad("Tukey", width = 10)
[1] " Tukey"
> str_pad("Tukey", width = 10, side = "right")
[1] "Tukey "
> str_pad("Tukey", width = 10, side = "both")
[1] " Tukey "
> str_pad("Tukey", width = 10, pad = "#")
[1] "#####Tukey"
The str_wrap() function can wrap a string to form paragraphs. For example, consider the following quote from R. A. Fisher:
fisherSays <- c(
"If ... we choose a group of social",
"phenomena with no antecedent knowledge",
"of the causation or absence of causation",
"among them, then the calculation of",
"correlation coefficients, total or partial,",
"will not advance us a step toward evaluating",
"the importance of the causes at work.",
"R. A. Fisher"
)
To display this quote as a single paragraph, we must paste the elements together as follows. The collapse argument tells R to “unconcatenate” the individual lines and create a single string vector.
fisherSays <- paste(fisherSays, collapse = " ")
We can control the width of the lines, as well as indentation. The default arguments for indent and exdent are 0. Here is an example.
> cat(str_wrap(fisherSays, width = 30, indent = 2), "\n")
If ... we choose a group of
social phenomena with no
antecedent knowledge of the
causation or absence of
causation among them, then
the calculation of
correlation coefficients,
total or partial, will not
advance us a step toward
evaluating the importance of
the causes at work. R. A.
Fisher
We can trim strings using the str_trim() function. In string processing, we often parse a text into individual words. The words usually wind up having whitespaces (blank space) on either end. If that is the situation, use str_trim() to remove the whitespaces.
> textToTrim <- c("There", " are","many "," extra ", "whitespaces")
> textToTrim
[1] "There" " are" "many " " extra " "whitespaces"
> str_trim(textToTrim, side = "both")
[1] "There" "are" "many" "extra" "whitespaces"
Extracting Words
The word() function extracts words from a sentence. You pass the function a string along with the starting position of the first word to extract. The end position is that of the last word to extract. By default, a single space is used as the separator between words. Let’s use the Fisher quote and extract different words. We extract the first, the second, and the last words of each string.
> fisherSays <- c(
+ "If ... we choose a group of social",
+ "phenomena with no antecedent knowledge",
+ "of the causation or absence of causation",
+ "among them, then the calculation of",
+ "correlation coefficients, total or partial,",
+ "will not advance us a step toward evaluating",
+ "the importance of the causes at work.",
+ "R. A. Fisher")
> word(fisherSays, 1)
[1] "If" "phenomena" "of" "among" "correlation"
[6] "will" "the" "R."
> word(fisherSays, 2)
[1] "..." "with" "the" "them,"
[5] "coefficients," "not" "importance" "A."
> word(fisherSays, -1)
[1] "social" "knowledge" "causation" "of" "partial,"
[6] "evaluating" "work." "Fisher"