I/O and Data Storage - THE RUBY WAY, Third Edition (2015)

THE RUBY WAY, Third Edition (2015)

Chapter 10. I/O and Data Storage

On a clean disk, you can seek forever.

—Thomas B. Steel, Jr.

Computers are good at computing. This tautology is more profound than it appears. If we only had to sit and chew up the CPU cycles and reference RAM as needed, life would be easy.

A computer that only sits and thinks to itself is of little use, however. Sooner or later we have to get information into it and out of it, and that is where life becomes more difficult.

Several things make I/O complicated. First, input and output are rather different things, but we naturally lump them together. Second, the varieties of I/O operations (and their usages) are as diverse as species of insects.

History has seen such devices as drums, paper tapes, magnetic tapes, punched cards, and teletypes. Some operated with a mechanical component; others were purely electromagnetic. Some were read-only; others were write-only or read-write. Some writable media were erasable, and others were not. Some devices were inherently sequential; others were random access. Some media were permanent; others were transient or volatile. Some devices depended on human intervention; others did not. Some were character oriented; others were block oriented. Some block devices were fixed length; others were variable length. Some devices were polled; others were interrupt-driven. Interrupts could be implemented in hardware or software or both. We have both buffered and non-buffered I/O. There has been memory-mapped I/O and channel-oriented I/O, and with the advent of operating systems such as UNIX, we have seen I/O devices mapped to files in a filesystem. We have done I/O in machine language, in assembly language, and in high-level languages. Some languages have the I/O capabilities firmly hardwired in place; others leave it out of the language specification completely. We have done I/O with and without suitable device drivers or layers of abstraction.

If this seems like a confusing mess, that is because it is. Part of the complexity is inherent in the concept of input/output, part of it is the result of design trade-offs, and part of it is the result of legacies or traditions in computer science and the quirks of various languages and operating systems.

Ruby’s I/O is complex because I/O in general is complex. But we have tried here to make it understandable and present a good overview of how and when to use various techniques.

The core of all Ruby I/O is the IO class, which defines behavior for every kind of input/output operation. Closely allied to IO (and inheriting from it) is the File class. A nested class within File, called Stat, encapsulates various details about a file that we might want to examine (such as its permissions and time stamps). The methods stat and lstat return objects of type File::Stat.

The module FileTest also has methods that allow us to test much the same set of properties. These methods also appear in the File class as class methods. Finally, there are I/O methods in the Kernel module that are mixed into Object (the ancestor of all objects, including classes). These are the simple I/O routines that we have used all along without worrying about what their receiver was. These naturally default to standard input and standard output.

The beginner may find these classes to be a confused jumble of overlapping functionality. The good news is that you need only use small pieces of this framework at any given time.

On a higher level, Ruby offers features to make object persistence possible. The Marshal, YAML, and JSON libraries allow simple serialization of objects, while CSV and SQLite persist data to files on disk.

On the highest level of all, external data stores such as PostgreSQL, MySQL, and Redis provide a way to share a single data store across many Ruby processes. External data stores are complex enough that they have their own books. We will provide only a brief overview to get the programmer started. In some cases, we provide only a pointer to online documentation.

10.1 Working with Files and Directories

When we say file, we usually mean a disk file, though not always. We do use the concept of a file as a meaningful abstraction in Ruby as in other programming languages. When we say directory, we mean a directory in the normal Windows or UNIX sense.

The File class is closely related to the IO class from which it inherits. The Dir class is not so closely related, but we chose to discuss files and directories together because they are still conceptually related.

10.1.1 Opening and Closing Files

The class method File.new instantiates a File object and opens the file. The first parameter is naturally the filename.

The optional second parameter is called the mode string and tells how to open the file, whether for reading, writing, and so on. (The mode string has nothing to do with the mode as in permissions.) This defaults to "r" for reading. The following code demonstrates opening files for reading and writing:

file1 = File.new("one") # Open for reading
file2 = File.new("two", "w") # Open for writing

Another form for new takes three parameters. In this case, the second parameter specifies the original permissions for the file (usually as an octal constant), and the third is a set of flags ORed together. The flags are constants such as File::CREAT (create the file when it is opened if it doesn’t already exist) and File::RDONLY (open for reading only). This form will rarely be used.

file = File.new("three", 0755, File::CREAT|File::WRONLY)

As a courtesy to the operating system and the runtime environment, always close a file that you open. In the case of a file open for writing, this is more than mere politeness and can actually prevent lost data. Not surprisingly, the close method serves this purpose:

out = File.new("captains.log", "w")
# Process as needed...
out.close

There is also an open method. In its simplest form, it is merely a synonym for new, as we see here:

trans = File.open("transactions","w")

But open can also take a block; this is the form that is more interesting. When a block is specified, the open file is passed in as a parameter to the block. The file remains open throughout the scope of the block and is closed automatically at the end. Here is an example:

File.open("somefile","w") do |file|
file.puts "Line 1"
file.puts "Line 2"
file.puts "Third and final line"
end
# The file is now closed

This is obviously an elegant way of ensuring that a file is closed when we’ve finished with it. In addition, the code that handles the file is grouped visually into a unit.

The reopen method will associate a new stream with its receiver. In this example, we turn off output to standard error, and later we turn it back on:

save = STDERR.dup
STDERR.reopen("/dev/null")
# Quiet now...
STDERR.reopen(save)

10.1.2 Updating a File

Suppose that we want to open a file for reading and writing. This is done simply by adding a plus sign (+) in the file mode when we open the file (see Section 10.1.1, “Opening and Closing Files”):

f1 = File.new("file1", "r+")
# Read/write, starting at beginning of file.

f2 = File.new("file2", "w+")
# Read/write; truncate existing file or create a new one.

f3 = File.new("file3", "a+")
# Read/write; start at end of existing file or create a
# new one.

10.1.3 Appending to a File

Suppose that we want to append information onto an existing file. This is done simply by using "a" in the file mode when we open the file (see Section 10.1.1, “Opening and Closing Files”):

logfile = File.open("captains_log", "a")
# Add a line at the end, then close.
logfile.puts "Stardate 47824.1: Our show has been canceled."
logfile.close

10.1.4 Random Access to Files

If you want to read a file randomly rather than sequentially, you can use the method seek, which File inherits from IO. The simplest usage is to seek to a specific byte position. The position is relative to the beginning of the file, where the first byte is numbered 0.

# myfile contains only: abcdefghi
file = File.new("myfile")
file.seek(5)
str = file.gets # "fghi"

If you took care to ensure that each line was a fixed length, you could seek to a specific line, as in the following example:

# Assume 20 bytes per line.
# Line N starts at byte (N-1)*20
file = File.new("fixedlines")
file.seek(5*20) # Sixth line!
# Elegance is left as an exercise.

If you want to do a relative seek, you can use a second parameter. The constant IO::SEEK_CUR assumes that the offset is relative to the current position (which may be negative):

file = File.new("somefile")
file.seek(55) # Position is 55
file.seek(-22, IO::SEEK_CUR) # Position is 33
file.seek(47, IO::SEEK_CUR) # Position is 80

You can also seek relative to the end of the file. Only a negative offset makes sense here:

file.seek(-20, IO::SEEK_END) # twenty bytes from eof

There is also a third constant, IO::SEEK_SET, but it is the default value (seek relative to beginning of file).

The method tell reports the file position; pos is an alias:

file.seek(20)
pos1 = file.tell # 20
file.seek(50, IO::SEEK_CUR)
pos2 = file.pos # 70

The rewind method can also be used to reposition the file pointer at the beginning. This terminology comes from the use of magnetic tapes.

If you are performing random access on a file, you may want to open it for update (reading and writing). Updating a file is done by specifying a plus sign (+) in the mode string. See Section 10.1.2, “Updating a File.”

10.1.5 Working with Binary Files

Although every file is ultimately binary code, binary is colloquially used to mean data that is not readable as text. It may be compressed, encrypted, or contain audio or video data.

To tell Ruby it will read binary data, add the "b" character to the mode string when opening a file. The resulting string will always have the encoding ASCII-8BIT, which is a string of bytes without any encoding.

Binary mode allows reading and manipulating bytes that are invalid in an encoding:

File.write("invalid", "\xFC\x80\x80 \x80\x80\xAF")
File.read("invalid", mode: "r").split(" ")
# invalid byte sequence in UTF-8
File.read("invalid", mode: "rb").split(" ")
# ["\xFC\x80", "\x80\xAF"]

On Windows, binary mode also means that each line break is not translated into a single \n linefeed but is kept as the \r\n carriage-return/linefeed pair.

Another important difference is that Ctrl+Z is treated as end-of-file in non-binary mode, as shown here:

# myfile contains "12345\0326789\r".
# Note the embedded octal 032 (^Z)
File.open("myfile","rb") {|f| str = f.sysread(15) }.size # 11
File.open("myfile","r") {|f| str = f.sysread(15) }.size # 5

The following code fragment shows that carriage returns remain untranslated in binary mode on Windows:

# Input file contains a single line: Line 1.
file = File.open("data")
line = file.readline # "Line 1.\n"
puts "#{line.size} characters." # 8 characters
file.close

file = File.open("data","rb")
line = file.readline # "Line 1.\r\n"
puts "#{line.size} characters." # 9 characters
file.close

Note that the binmode method, shown in the following code example, can switch a stream to binary mode. Once switched, it cannot be switched back.

file = File.open("data")
file.binmode
line = file.readline # "Line 1.\r\n"
puts "#{line.size} characters." # 9 characters
file.close

If you really want to do low-level input/output, you can use the sysread and syswrite methods. The former takes a number of bytes as a parameter; the latter takes a string and returns the actual number of bytes written. (You should not use other methods to read from the same stream; the results may be unpredictable.)

input = File.new("infile")
output = File.new("outfile")
instr = input.sysread(10);
bytes = output.syswrite("This is a test.")

Note that sysread raises EOFError if it is invoked at end-of-file (though not if it encounters end-of-file during a successful read). Either of these methods will raise SystemCallError when an error occurs.

The Array method pack and the String method unpack can be useful in dealing with binary data.

10.1.6 Locking Files

On operating systems where it is supported, the flock method of File will lock or unlock a file. The second parameter is one of these constants: File::LOCK_EX, File::LOCK_NB, File::LOCK_SH, File::LOCK_UN, or a logical-OR of two or more of these. Note, of course, that many of these combinations will be nonsensical; the nonblocking flag is the one most frequently used:

file = File.new("somefile")

file.flock(File::LOCK_EX) # Lock exclusively; no other process
# may use this file.
file.flock(File::LOCK_UN) # Now unlock it.

file.flock(File::LOCK_SH) # Lock file with a shared lock
# (other processes may do the same).
file.flock(File::LOCK_UN) # Now unlock it.

locked = file.flock(File::LOCK_EX | File::LOCK_NB)
# Try to lock the file, but don't block if we can't; in that
# case, locked will be false.

This function is not available on the Windows family of operating systems.

10.1.7 Performing Simple I/O

You are already familiar with some of the I/O routines in the Kernel module; these are the ones we have called all along without specifying a receiver for the methods. Calls such as gets and puts originate here; others are print, printf, and p (which calls the object’s inspectmethod to display it in some way readable to humans).

There are some others that we should mention for completeness, though. The putc method outputs a single character. (The corresponding method getc is not implemented in Kernel for technical reasons; it can be found in any IO object, however.) If a String is specified, the first character of the string will be taken.

putc(?\n) # Output a newline
putc("X") # Output the letter X

A reasonable question is, where does output go when we use these methods without a receiver? Well, to begin with, three constants are defined in the Ruby environment, corresponding to the three standard I/O streams we are accustomed to on UNIX. These are STDIN, STDOUT, andSTDERR. All are global constants of the type IO.

There is also a global variable called $stdout, which is the destination of all the output coming from Kernel methods. This is initialized (indirectly) to the value of STDOUT so that this output all gets written to standard output as we expect. The variable $stdout can be reassigned to refer to some other IO object at any time.

diskfile = File.new("foofile","w")
puts "Hello..." # prints to stdout
$stdout = diskfile
puts "Goodbye!" # prints to "foofile"
diskfile.close
$stdout = STDOUT # reassign to default
puts "That's all." # prints to stdout

Beside gets, Kernel also has the methods readline and readlines for input. The former is equivalent to gets, except that it raises EOFError at the end of a file instead of just returning a nil value. The latter is equivalent to the IO.readlines method (that is, it reads an entire file into memory).

Where does input come from? Well, there is also the standard input stream $stdin, which defaults to STDIN. In the same way, there is a standard error stream ($stderr, which defaults to STDERR).

There is also an interesting global object called ARGF, which represents the concatenation of all the files named on the command line. It is not really a File object, though it resembles one. Default input is connected to this object in the event files are named on the command line.

# cat.rb
# Read all files, then output again
puts ARGF.read
# Or more memory-efficient:
puts ARGF.readline until ARGF.eof?
# Example usage: ruby cat.rb file1 file2 file3

Reading from standard input (STDIN) will bypass the Kernel methods. That way, you can bypass ARGF (or not), as shown here:

# Read a line from standard input
str1 = STDIN.gets
# Read a line from ARGF
str2 = ARGF.gets
# Now read again from standard input
str3 = STDIN.gets

It is possible to read at the character and byte levels as well. In a single-byte encoding, these will be essentially the same, except that a byte is a Fixnum, and character is a single-character string:

c = input.getc
b = input.getbyte
input.ungetc # These two operations are not
input.ungetbyte # always possible.
b = input.readbyte # Like getbyte, but can raise EOFError

10.1.8 Performing Buffered and Unbuffered I/O

Ruby does its own internal buffering in some cases. Consider this fragment:

print "Hello... "
sleep 5
print "Goodbye!\n"

If you run this, you will notice that the hello and goodbye messages both appear at the same time, after the sleep. The first output is not terminated by a newline.

This can be fixed by calling flush to flush the output buffer. In this case, we use the stream $defout (the default stream for all Kernel method output) as the receiver. It then behaves as we probably wanted, with the first message appearing earlier than the second one:

print "Hello... "
STDOUT.flush
sleep 10
print "Goodbye!\n"

This buffering can be turned off (or on) with the sync= method; the sync method lets us know the status:

buf_flag = $defout.sync # true
STDOUT.sync = false
buf_flag = STDOUT.sync # false

There is also at least one lower level of buffering going on behind the scenes. Just as the getc method returns a character and moves the file or stream pointer, ungetc will push a character back onto the stream:

ch = mystream.getc # ?A
mystream.ungetc(?C)
ch = mystream.getc # ?C

You should be aware of three things. First, the buffering we speak of here is unrelated to the buffering mentioned earlier in this section; in other words, sync=false won't turn it off. Second, only one character can be pushed back; if you attempt more than one, only the last one will actually be pushed back onto the input stream. Finally, the ungetc method will not work for inherently unbuffered read operations (such as sysread).

10.1.9 Manipulating File Ownership and Permissions

The issue of file ownership and permissions is highly platform dependent. Typically, UNIX provides a superset of the functionality; for other platforms, many features may be unimplemented.

To determine the owner and group of a file (which are integers), File::Stat has a pair of instance methods, uid and gid, as shown here:

data = File.stat("somefile")
owner_id = data.uid
group_id = data.gid

Class File::Stat has an instance method called mode that returns the mode (or permissions) of the file:

perms = File.stat("somefile").mode

File has class and instance methods named chown to change the owner and group IDs of a file. The class method accepts an arbitrary number of filenames. Where an ID is not to be changed, nil or -1 can be used:

uid = 201
gid = 10
File.chown(uid, gid, "alpha", "beta")
f1 = File.new("delta")
f1.chown(uid, gid)
f2 = File.new("gamma")
f2.chown(nil, gid) # Keep original owner id

Likewise, the permissions can be changed by chmod (also implemented both as class and instance methods). The permissions are traditionally represented in octal, though they need not be:

File.chmod(0644, "epsilon", "theta")
f = File.new("eta")
f.chmod(0444)

A process always runs under the identity of some user (possibly root); as such, there is a user ID associated with it. (Here, we are talking about the effective user ID.) We frequently need to know whether that user has permission to read, write, or execute a given file. There are instance methods in File::Stat to make this determination:

info = File.stat("/tmp/secrets")
rflag = info.readable?
wflag = info.writable?
xflag = info.executable?

Sometimes we need to distinguish between the effective user ID and the real user ID. The appropriate instance methods are readable_real?, writable_real?, and executable_real?, respectively.

info = File.stat("/tmp/secrets")
rflag2 = info.readable_real?
wflag2 = info.writable_real?
xflag2 = info.executable_real?

We can test the ownership of the file as compared with the effective user ID (and group ID) of the current process. The class File::Stat has the instance methods owned? and grpowned? to accomplish this.

Note that many of these methods can also be found in the module FileTest:

rflag = FileTest::readable?("pentagon_files")
# Other methods are: writable? executable?
# readable_real? writable_real?
# executable_real? owned? grpowned?
# Not found here: uid gid mode

The umask associated with a process determines the initial permissions of new files created. The standard mode 0777 is logically ANDed with the negation of the umask so that the bits set in the umask are “masked” or cleared. If you prefer, you can think of this as a simple subtraction (without borrowing). Thus, a umask of 022 results in files being created with a mode of 0755.

The umask can be retrieved or set with the class method umask of class File. If a parameter is specified, the umask will be set to that value (and the previous value will be returned).

File.umask(0237) # Set the umask
current_umask = File.umask # 0237

Some file mode bits (such as the sticky bit) are not strictly related to permissions. For a discussion of these, see Section 10.1.12, “Checking Special File Characteristics.”

10.1.10 Retrieving and Setting Timestamp Information

Each disk file has multiple timestamps associated with it (though there are some variations between operating systems). The three timestamps that Ruby understands are the modification time (the last time the file contents were changed), the access time (the last time the file was read), and the change time (the last time the file’s directory information was changed).

These three pieces of information can be accessed in three different ways. Each of these fortunately gives the same result.

The File class methods mtime, atime, and ctime return the times without the file being opened or any File object being instantiated.

t1 = File.mtime("somefile")
# Thu Jan 04 09:03:10 GMT-6:00 2001
t2 = File.atime("somefile")
# Tue Jan 09 10:03:34 GMT-6:00 2001
t3 = File.ctime("somefile")
# Sun Nov 26 23:48:32 GMT-6:00 2000

If there happens to be a File instance already created, the instance method can be used:

myfile = File.new("somefile")
t1 = myfile.mtime
t2 = myfile.atime
t3 = myfile.ctime

And if there happens to be a File::Stat instance already created, it has instance methods to do the same thing:

myfile = File.new("somefile")
info = myfile.stat
t1 = info.mtime
t2 = info.atime
t3 = info.ctime

Note that a File::Stat is returned by File’s class (or instance) method stat. The class method lstat (or the instance method of the same name) is identical except that it reports on the status of the link itself instead of following the link to the actual file. In the case of links to links, all links are followed but the last one.

File access and modification times may be changed using the utime method. It will change the times on one or more files specified. The times may be given either as Time objects or a number of seconds since the epoch:

today = Time.now
yesterday = today - 86400
File.utime(today, today, "alpha")
File.utime(today, yesterday, "beta", "gamma")

Because both times are changed together, if you want to leave one of them unchanged, you have to save it off first:

mtime = File.mtime("delta")
File.utime(Time.now, mtime, "delta")

10.1.11 Checking File Existence and Size

One fundamental question we sometimes want to know is whether a file of a given name exists. The exist? method in the File class provides a way to find out:

flag = File.exist?("LochNessMonster")
flag = File.exists?("UFO")
# exists? is a synonym for exist?

Related to the question of a file’s existence is the question of whether it has any contents. After all, a file may exist but have zero length (which is the next best thing to not existing).

If we are only interested in this yes/no question, File has two instance methods that are useful. The method zero? returns true if the file is zero length and false otherwise:

flag = File.zero?("somefile")

Conversely, the method size? returns either the size of the file in bytes if it is nonzero length, or the value nil if it is zero length. It may not be immediately obvious why nil is returned rather than 0. The answer is that the method is primarily intended for use as a predicate, and 0 is true in Ruby, whereas nil tests as false.

if File.size?("myfile")
puts "The file has contents."
else
puts "The file is empty."
end

This leads naturally to the question, how big is this file? We've already seen that in the case of a nonempty file, size? returns the length; but if we’re not using it as a predicate, the nil value would confuse us.

The File class has both class and instance methods to give us this answer:

size1 = File.size("filename") # returns 0 if filename is empty

If we want the file size in blocks rather than bytes, we can use the instance method blocks in File::Stat. This is certainly dependent on the operating system. (The method blksize also reports on the operating system’s idea of how big a block is.)

info = File.stat("somefile")
total_bytes = info.blocks * info.blksize

10.1.12 Checking Special File Characteristics

There are numerous aspects of a file that we can test. We summarize here the relevant built-in methods that we don’t discuss elsewhere. Most, though not all, are predicates.

Bear in mind two facts throughout this section (and most of this chapter). First, any test that can be done by invoking the class method may also be called as an instance method of any file object. Second, remember that there is a high degree of overlap between File and theFile::Stat object returned by stat (or lstat). In some cases, there will be different ways to call what is essentially the same method. We won’t necessarily show this every time.

Some operating systems have the concept of block-oriented devices as opposed to character-oriented devices. A file may refer to either but not both. The methods blockdev? and chardev? in the FileTest module tests for this:

flag1 = File.chardev?("/dev/hdisk0") # false
flag2 = File.blockdev?("/dev/hdisk0") # true

Sometimes we want to know whether the stream is associated with a terminal. The IO class method tty? tests for this (as does the synonym isatty):

flag1 = STDIN.tty? # true
flag2 = File.new("diskfile").isatty # false

A stream can be a pipe or a socket. There are corresponding FileTest methods to test for these cases:

flag1 = File.pipe?(myfile)
flag2 = File.socket?(myfile)

Recall that a directory is really just a special case of a file. Therefore, we need to be able to distinguish between directories and ordinary files, which a pair of methods enable us to do:

file1 = File.new("/tmp")
file2 = File.new("/tmp/myfile")
test1 = file1.directory? # true
test2 = file1.file? # false
test3 = file2.directory? # false
test4 = file2.file? # true

There is also a File class method named ftype, which tells us what kind of thing a stream is; it can also be found as an instance method in the File::Stat class. This method returns a string that has one of the following values: file, directory, blockSpecial,characterSpecial, fifo, link, or socket. (The string fifo refers to a pipe.)

this_kind = File.ftype("/dev/hdisk0") # "blockSpecial"
that_kind = File.new("/tmp").stat.ftype # "directory"

Certain special bits may be set or cleared in the permissions of a file. These are not strictly related to the other bits that we discuss in Section 10.1.9, “Manipulating File Ownership and Permissions.” These are the set-group-id bit, the set-user-id bit, and the sticky bit. There are methods for each of these:

file = File.new("somefile")
sticky_flag = file.sticky?
setgid_flag = file.setgid?
setuid_flag = file.setuid?

A disk file may have symbolic or hard links that refer to it (on operating systems supporting these features). To test whether a file is actually a symbolic link to some other file, use the symlink? method. To count the number of hard links associated with a file, use the nlink method (found only in File::Stat). A hard link is virtually indistinguishable from an ordinary file; in fact, it is an ordinary file that happens to have multiple names and directory entries.

File.symlink("yourfile","myfile") # Make a link
is_sym = File.symlink?("myfile") # true
hard_count = File.new("myfile").stat.nlink # 0

Incidentally, note that in the previous example, we used the File class method symlink to create a symbolic link.

In rare cases, you may want even lower-level information about a file. The File::Stat class has three more instance methods that give you the gory details. The method dev gives you an integer identifying the device on which the file resides, rdev returns an integer specifying the kind of device, and for disk files, ino gives you the starting inode number for the file.

file = File.new("diskfile")
info = file.stat
device = info.dev
devtype = info.rdev
inode = info.ino

10.1.13 Working with Pipes

There are various ways of reading and writing pipes in Ruby. The class method IO.popen opens a pipe and hooks the process’s standard input and standard output into the IO object returned. Frequently we will have different threads handling each end of the pipe; here, we just show a single thread writing and then reading:

check = IO.popen("spell","r+")
check.puts("'T was brillig, and the slithy toves")
check.puts("Did gyre and gimble in the wabe.")
check.close_write
list = check.readlines
list.collect! { |x| x.chomp }
# list is now %w[brillig gimble gyre slithy toves wabe]

Note that the close_write call is necessary. If it were not issued, we would not be able to reach the end-of-file when we read the pipe.

There is a block form that works as follows:

File.popen("/usr/games/fortune") do |pipe|
quote = pipe.gets
puts quote
# On a clean disk, you can seek forever. - Thomas Steel
end

If the string "-" is specified, a new Ruby instance is started. If a block is specified with this, the block is run as two separate processes, rather like a fork: The child gets nil passed into the block, and the parent gets an IO object with the child’s standard input and/or output connected to it.

IO.popen("-") do |mypipe|
if mypipe
puts "I'm the parent: pid = #{Process.pid}"
listen = mypipe.gets
puts listen
else
puts "I'm the child: pid = #{Process.pid}"
end
end

# Prints:
# I'm the parent: pid = 10580
# I'm the child: pid = 10582

A pipe method also returns a pair of pipe ends connected to each other. In the following code example, we create a pair of threads and let one pass a message to the other (the first message that Samuel Morse sent over the telegraph). Refer to Chapter 13, “Threads and Concurrency,” for more information.

pipe = IO.pipe
reader = pipe[0]
writer = pipe[1]

str = nil
thread1 = Thread.new(reader,writer) do |reader,writer|
# writer.close_write
str = reader.gets
reader.close
end

thread2 = Thread.new(reader,writer) do |reader,writer|
# reader.close_read
writer.puts("What hath God wrought?")
writer.close
end

thread1.join
thread2.join

puts str # What hath God wrought?

10.1.14 Performing Special I/O Operations

It is possible to do lower-level I/O in Ruby. We will only mention the existence of these methods; if you need to use them, some of them will be highly machine-specific anyway (varying even between different versions of UNIX).

The ioctl method (“I/O control”) accepts two arguments. The first is an integer specifying the operation to be done. The second is either an integer or a string representing a binary number.

The fcntl method is also for low-level control of file-oriented streams in a system-dependent manner. It takes the same kinds of parameters as ioctl.

The select method (in the Kernel module) accepts up to four parameters; the first is the read-array, and the last three are optional (write-array, error-array, and the timeout value). This method allows a process to wait for an I/O opportunity. When input is available from one or more devices in the read-array, or when one or more devices in the write-array are ready, the call returns an array of three elements representing the respective arrays of devices that are ready for I/O.

The Kernel method syscall takes at least one integer parameter (and up to nine string or integer parameters in all). The first parameter specifies the I/O operation to be done.

The fileno method returns an old-fashioned file descriptor associated with an I/O stream. This is the least system-dependent of all the methods mentioned here:

desc = $stderr.fileno # 2

10.1.15 Using Nonblocking I/O

Ruby makes a concerted effort “behind the scenes” to ensure that I/O does not block. For this reason, it is possible in most cases to use Ruby threads to manage I/O—a single thread may block on an I/O operation while another thread goes on processing.

However, those who want to turn off nonblocking I/O can do so using read_nonblock and write_nonblock. The former uses the internal read(2) system call; for this reason, it may raise exceptions corresponding to the usual set of errors, such as Errno::EWOULDBLOCK and others. It is equivalent to readpartial with the nonblocking flag set (see Section 10.1.16, “Using readpartial”).

string = input.read(64) # read 64 bytes
buffer = ""
input.read(64, buffer) # optional buffer

When end-of-file is reached, EOFError is raised. When EWOULDBLOCK is raised, you should avoid calling the method again until input is available. Here is an example:

begin
data = input.read_nonblock(256)
rescue Errno::EWOULDBOCK
IO.select([input])
retry
end

Likewise, write_nonblock invokes the write(2) system call (and can raise corresponding exceptions). It takes a string as an argument and returns the number of bytes written (which may be smaller than the length of the string). The EWOULDBLOCK exception can be handled the same as with the read_nonblock shown previously.

10.1.16 Using readpartial

The readpartial method makes I/O easier with streams, such as a socket.

The “max length” parameter is required. If the buffer parameter is specified, it should refer to a string where the data will be stored.

data = sock.readpartial(128) # Read at most 128 bytes

The readpartial method doesn’t honor the nonblocking flag. It will sometimes block, but only when three conditions are true: The IO object’s buffer is empty, the stream content is empty, and the stream has not yet reached an end-of-file condition.

So in effect, if there is data in the stream, readpartial will not block. It will read up to the maximum number of bytes specified, but if there are fewer bytes available, it will grab those and continue.

If the stream has no data, but it is at end-of-file, readpartial will immediately raise an EOFError.

If the call blocks, it waits until either it receives data or it detects an EOF condition. If it receives data, it simply returns that data. If it detects EOF, it raises an EOFError.

When sysread is called in blocking mode, its behavior is similar to that of read-partial. If the buffer is empty, their behavior is identical.

10.1.17 Manipulating Pathnames

In manipulating pathnames, the first things to be aware of are the class methods File.dirname and File.basename; these work like the UNIX commands of the same name and return the directory name and the filename, respectively. If an extension is specified as a second parameter to basename, that extension will be removed.

str = "/home/dave/podbay.rb"
dir = File.dirname(str) # "/home/dave"
file1 = File.basename(str) # "podbay.rb"
file2 = File.basename(str,".rb") # "podbay"

Note that although these are methods of File, they are really simply doing string manipulation.

A comparable method is File.split, which returns these two components (directory and filename) in a two-element array:

info = File.split(str) # ["/home/dave","podbay.rb"]

The expand_path class method expands a relative pathname, converting to an absolute path. If the operating system understands such idioms as ~ and ~user, these will be expanded also. The optional second argument serves as a path to expand from, and is often used with the current file path, __FILE__.

Dir.chdir("/home/poole/personal/docs")
abs = File.expand_path("../../misc") # "/home/poole/misc"
abs = File.expand_path("misc", "/home/poole") # "/home/poole/misc"

Given an open file, the path instance method returns the pathname used to open the file:

File.new("../../foobar").path # "../../foobar"

The constant File::Separator gives the character used to separate pathname components (typically backslash for Windows, slash for UNIX). An alias is File::SEPARATOR.

The class method join uses this separator to produce a path from a list of directory components:

path = File.join("usr","local","bin","someprog")
# path is "usr/local/bin/someprog"
# Note that it doesn't put a separator on the front!

Don’t fall into the trap of thinking that File.join and File.split are somehow inverses. They’re not.

10.1.18 Using the Pathname Class

You should also be aware of the standard library pathname, which gives us the Pathname class. This is essentially a wrapper for Dir, File, and FileUtils; as such, it has much of the functionality of these, unified in a way that is supposed to be logical and intuitive.

path = Pathname.new("/home/hal")
file = Pathname.new("file.txt")
p2 = path + file

path.directory? # true
path.file? # false
p2.directory? # false
p2.file? # true

parts = path2.split # [Pathname:/home/hal, Pathname:file.txt]
ext = path2.extname # .txt

There are also a number of convenience methods, as you would expect. The root? method attempts to detect whether a path refers to the root directory; it can be “fooled” because it merely analyzes the string and does not access the filesystem. The parent? method returns the pathname of this path’s parent. The children method returns a list of the next-level children below this path; it includes both files and directories but is not recursive:

p1 = Pathname.new("//") # odd but legal
p1.root? # true
p2 = Pathname.new("/home/poole")
p3 = p2.parent # Pathname:/home
items = p2.children # array of Pathnames (all files
# and dirs directly inside poole)

As you would expect, relative and absolute try to determine whether a path is relative (by looking for a leading slash):

p1 = Pathname.new("/home/dave")
p1.absolute? # true
p1.relative? # false

Many methods, such as size, unlink, and others, are actually delegated to File and FileUtils; the functionality is not reimplemented.

For more details on Pathname, consult ruby-doc.org or any good reference.

10.1.19 Command-Level File Manipulation

Often we need to manipulate files in a manner similar to the way we would at a command line. That is, we need to copy, delete, rename, and so on.

Many of these capabilities are built-in methods; a few are in the FileUtils module in the fileutils library. Be aware that FileUtils used to mix functionality directly into the File class by reopening it; now these methods stay in their own module.

To delete a file, we can use File.delete or its synonym, File.unlink:

File.delete("history")
File.unlink("toast")

To rename a file, we can use File.rename, as follows:

File.rename("Ceylon", "Sri Lanka")

File links (hard and symbolic) can be created using File.link and File.symlink, respectively:

File.link("/etc/hosts", "/etc/hostfile") # hard link
File.symlink("/etc/hosts", "/tmp/hosts") # symbolic link

We can truncate a file to zero bytes (or any other specified number) by using the truncate instance method:

File.truncate("myfile",1000) # Now at most 1000 bytes

Two files may be compared by means of the compare_file method. There is an alias named cmp (and there is also compare_stream):

require "fileutils"

same = FileUtils.compare_file("alpha","beta") # true

The copy method will copy a file to a new name or location. It has an optional flag parameter to write error messages to standard error. The UNIX-like name cp is an alias:

require "fileutils"

# Copy epsilon to theta and log any errors.
FileUtils.copy("epsilon","theta", true)

A file may be moved with the move method (alias mv). Like copy, it also has an optional verbose flag:

require "fileutils"

FileUtils.move("/tmp/names","/etc") # Move to new directory
FileUtils.move("colours","colors") # Just a rename

The safe_unlink method deletes the specified file or files, first trying to make the files writable so as to avoid errors. If the last parameter is true or false, that value will be taken as the verbose flag:

require "fileutils"

FileUtils.safe_unlink("alpha","beta","gamma")
# Log errors on the next two files
FileUtils.safe_unlink("delta","epsilon",true)

Finally, the install method basically does a syscopy, except that it first checks that the file either does not exist or has different content.

require "fileutils"

FileUtils.install("foo.so","/usr/lib")
# Existing foo.so will not be overwritten
# if it is the same as the new one.

For more on FileUtils, consult ruby-doc.org or any other reference.

10.1.20 Grabbing Characters from the Keyboard

Here we use the term grabbing because we sometimes want to process a character as soon as it is pressed rather than buffer it and wait for a newline to be entered.

This can be done in both UNIX variants and Windows variants. Unfortunately, the two methods are completely unrelated to each other.

The UNIX version is straightforward. We use the well-known technique of putting the terminal in raw mode (and we usually turn off echoing at the same time):

def getchar
system("stty raw -echo") # Raw mode, no echo
char = STDIN.getc
system("stty -raw echo") # Reset terminal mode
char
end

In the Windows world, we would need to write a C extension for this. An alternative is to use a small feature of the Win32API library:

require 'Win32API'

def getchar
char = Win32API.new("crtdll", "_getch", [], 'L').Call
end

In either case, the behavior is effectively the same.

10.1.21 Reading an Entire File into Memory

To read an entire file into an array, you need not even open the file. The method IO.readlines will do this, opening and closing the file on its own:

arr = IO.readlines("myfile")
lines = arr.size
puts "myfile has #{lines} lines in it."

longest = arr.collect {|x| x.length}.max
puts "The longest line in it has #{longest} characters."

We can also use IO.read (which returns a single large string rather than an array of lines):

str = IO.read("myfile")
bytes = arr.size
puts "myfile has #{bytes} bytes in it."

longest = str.collect {|x| x.length}.max # strings are enumerable!
puts "The longest line in it has #{longest} characters."

Obviously because IO is an ancestor of File, we can say File.readlines and File.read just as easily.

10.1.22 Iterating Over a File by Lines

To iterate over a file a line at a time, we can use the class method IO.foreach or the instance method each. In the former case, the file need not be opened in our code:

# Print all lines containing the word "target"
IO.foreach("somefile") do |line|
puts line if line =~ /target/
end

# Another way...
File.new("somefile").each do |line|
puts line if line =~ /target/
end

Note that each_line is an alias for each. It can also be used to get an enumerator:

lines = File.new("somefile").each_line
lines.find{|line| line =~ /target/ } # Treat as any enumerator

10.1.23 Iterating Over a File by Byte or Character

To iterate a byte at a time, use the each_byte instance method. It feeds a byte (that is, a Fixnum in the range 0..255) into the block:

a_count = 0
File.new("myfile").each_byte do |byte|
a_count += 1 if byte == 97 # lowercase a in ASCII
end

You can also iterate by character (where a character is really a one-character string). Depending on what encoding you are using, a character may be a single byte (as in ASCII) or multiple bytes.

a_count = 0
File.new("myfile").each_char do |char|
a_count += 1 if char == "a"
end

The each_char method also returns an ordinary enumerator.

10.1.24 Treating a String As a File

Sometimes people want to know how to treat a string as though it were a file. The answer depends on the exact meaning of the question.

An object is defined mostly in terms of its methods. The following code shows an iterator applied to an object called source; with each iteration, a line of output is produced. Can you tell the type of source by reading this fragment?

source.each_line do |line|
puts line
end

Actually, source could be a file, or it could be a string containing embedded newlines. Therefore, in cases like these, a string can trivially be treated as a file.

The StringIO class provides many of the methods of the IO class that a regular string lacks. It also has a string accessor that refers to the contents of the string itself:

require 'stringio'

ios = StringIO.new("abcdefghijkl\nABC\n123")

ios.seek(5)
ios.puts("xyz")
puts ios.tell # 9
puts ios.string.inspect # "abcdexyz\njkl\nABC\n123"

puts ios.getc # j
ios.ungetc(?w)
puts ios.string.inspect # "abcdexyz\nwkl\nABC\n123"

s1 = ios.gets # "wkl\n"
s2 = ios.gets # "ABC"

10.1.25 Copying a Stream

Use the class method copy_stream for copying a stream. All the data will be dumped from the source to the destination. The source and destination may be IO objects or filenames. The third (optional) parameter is the number of bytes to be copied (defaulting, of course, to the entire source). The fourth parameter is a beginning offset (in bytes) for the source:

src = File.new("garbage.in")
dst = File.new("garbage.out")
IO.copy_stream(src, dst)

IO.copy_stream("garbage.in","garbage.out", 1000, 80)
# Copy 1000 bytes to output starting at offset 80

10.1.26 Working with Character Encodings

For this topic, refer to Chapter 4, “Internationalization in Ruby.” Using and manipulating character encodings for String and IO objects is covered there.

10.1.27 Reading Data Embedded in a Program

In days gone by, children learned BASIC by copying programs out of magazines. One convenient feature of this language (if any of it was convenient) was the DATA statement. The information was embedded in the program, but it could be read as if it originated outside.

Should you ever want to, you can do much the same thing in Ruby. The directive __END__ at the end of a Ruby program signals that embedded data follows. This can be read using the global constant DATA, which is an IO object like any other. (Note that the __END__ marker must be at the beginning of the line on which it appears.)

# Print each line backwards...
DATA.each_line do |line|
puts line.reverse
end
__END__
A man, a plan, a canal... Panama!
Madam, I'm Adam.
,siht gnidaer er'uoy fI
.evisserpmi si noitacided ruoy

10.1.28 Reading Program Source

Suppose you wanted to access the source of your own program. This can be done using a variation on a trick we used earlier (see Section 10.1.27, “Reading Data Embedded in a Program”).

The global constant DATA is an IO object that refers to the data following the __END__ directive. But if you do a rewind operation, it resets the file pointer to the beginning of the program source.

The following program prints itself with line numbers:

DATA.rewind
DATA.each_line.with_index do |line, i|
puts "#{'%03d' % (i + 1)} #{line.chomp}"
end
__END__

Note that the __END__ directive is necessary; without it, DATA cannot be accessed at all.

Another way to read the source of the current file is by using the special variable __FILE__. It contains the full path of the file itself, and can be read:

puts File.read(__FILE__)

10.1.29 Working with Temporary Files

There are many circumstances in which we need to work with files that are all but anonymous. We don’t want to trouble with naming them or making sure there is no name conflict, and we don’t want to bother with deleting them.

All these issues are addressed in the Tempfile library. The new method (alias open) takes an arbitrary name as a seed and concatenates it with the process ID and a unique sequence number. The optional second parameter is the directory to be used; it defaults to the value of environment variable TMPDIR, TMP, or TEMP, and finally the value "/tmp".

The resulting IO object may be opened and closed many times during the execution of the program. Upon termination of the program, the temporary file will be deleted.

The close method has an optional flag; if set to true, the file will be deleted immediately after it is closed (instead of waiting until program termination). The path method returns the actual pathname of the file, should you need it.

require "tempfile"

temp = Tempfile.new("stuff")
name = temp.path # "/tmp/stuff17060.0"
temp.puts "Kilroy was here"
temp.close

# Later...
temp.open
str = temp.gets # "Kilroy was here"
temp.close(true) # Delete it NOW

10.1.30 Changing and Setting the Current Directory

The current directory may be determined by the use of Dir.pwd or its alias Dir.getwd; these abbreviations historically stand for print working directory and get working directory, respectively.

The method Dir.chdir may be used to change the current directory. On Windows, the logged drive may appear at the front of the string:

Dir.chdir("/var/tmp")
puts Dir.pwd # "/var/tmp"
puts Dir.getwd # "/var/tmp"

This method also takes a block parameter. If a block is specified, the current directory is changed only while the block is executed (and restored afterward):

Dir.chdir("/home")
Dir.chdir("/tmp") do
puts Dir.pwd # /tmp
end
puts Dir.pwd # /home

10.1.31 Changing the Current Root

On most UNIX variants, it is possible to change the current process’s idea of where root or “slash” is. This is typically done to prevent code that runs later from being able to reach the entire filesystem. The chroot method sets the new root to the specified directory:

Dir.chdir("/home/guy/sandbox/tmp")
Dir.chroot("/home/guy/sandbox")
puts Dir.pwd # "/tmp"

10.1.32 Iterating Over Directory Entries

The class method foreach is an iterator that successively passes each directory entry into the block. The instance method each behaves the same way.

Dir.foreach("/tmp") { |entry| puts entry }

dir = Dir.new("/tmp")
dir.each { |entry| puts entry }

Both of these code fragments print the same output (the names of all files and subdirectories in /tmp).

10.1.33 Getting a List of Directory Entries

The class method Dir.entries returns an array of all the entries in the specified directory:

list = Dir.entries("/tmp") # %w[. .. alpha.txt beta.doc]

As shown in the preceding code, the current and parent directories are included. If you don’t want these, you'll have to remove them manually.

10.1.34 Creating a Chain of Directories

Sometimes we want to create a chain of directories where the intermediate directories themselves don’t necessarily exist yet. At the UNIX command line, we would use mkdir -p for this.

In Ruby code, we can do this by using the FileUtils.makedirs method:

require "fileutils"
FileUtils.mkpath("/tmp/these/dirs/need/not/exist")

10.1.35 Deleting a Directory Recursively

In the UNIX world, we can type rm -rf dir at the command line, and the entire subtree starting with dir will be deleted. Obviously, we should exercise caution in doing this.

Pathname has a method called rmtree that will accomplish this:

require 'pathname'
dir = Pathname.new("/home/poole")
dir.rmtree

There is also a method called rm_r in FileUtils that will do the same:

require 'fileutils'
FileUtils.rm_r("/home/poole")

10.1.36 Finding Files and Directories

The Dir class provides the glob method (aliased as []), which returns an array of files that match the given shell glob. In simple cases, this is often enough to find a specific file inside a given directory:

Dir.glob("*.rb") # all ruby files in the current directory
Dir["spec/**/*_spec.rb"] # all files ending _spec.rb inside spec/

For more complicated cases, the standard library find allows us to iterate over every file in a directory and all its subdirectories. Here is a method that finds files in a given directory by either filename (that is, a string) or regular expression:

require "find"

def findfiles(dir, name)
list = []
Find.find(dir) do |path|
Find.prune if [".",".."].include? path
case name
when String
list << path if File.basename(path) == name
when Regexp
list << path if File.basename(path) =~ name
else
raise ArgumentError
end
end
list
end

findfiles "/home/hal", "toc.txt"
# ["/home/hal/docs/toc.txt", "/home/hal/misc/toc.txt"]

findfiles "/home", /^[a-z]+.doc/
# ["/home/hal/docs/alpha.doc", "/home/guy/guide.doc",
# "/home/bill/help/readme.doc"]

Contrary to its name, the find library can be used to do any task that requires traversing a directory and its children, such as adding up the total space taken by all files in a directory.

10.2 Higher-Level Data Access

Frequently, we want to store some specific data for later, rather than simply write bytes to a file. In order to do this, we convert data from objects into bytes and back again, a process called serialization. There are many ways to serialize data, so we will examine only the simplest and most common formats.

The Marshal module offers simple object persistence, and the PStore library builds on that functionality. The YAML format (and Ruby library) provides another way to marshal objects, but using plaintext that is easily human readable.

JSON can only persist numbers, strings, arrays, and hashes, but can be written and read by any language. CSV files can be used to exchange tabular data with many applications, including spreadsheets such as Excel.

External databases, with a socket interface to the database server process, will be examined in the next section.

10.2.1 Simple Marshaling

The simplest way to save an object for later use is by marshaling it. The Marshal module enables programs to serialize and unserialize Ruby objects into strings, and therefore also files.

# array of elements [composer, work, minutes]
works = [["Leonard Bernstein", "Overture to Candide", 11],
["Aaron Copland", "Symphony No. 3", 45],
["Jean Sibelius", "Finlandia", 20]]

# We want to keep this for later...
File.write "store", Marshal.dump("works")

# Much later...
works = Marshal.load File.read("store")

Storing data in this way can be extremely convenient, but can potentially be very dangerous. Loading marshaled data can potentially be exploited to execute any code, rather than the code of your program.

Never unmarshal data that was supplied by any external source, including users of your program. Instead, use the YAML and JSON libraries to safely read and write data provided by untrusted sources, as shown later this section.

Marshaling also has limits: Not all objects can be dumped. Objects of system-specific classes cannot be dumped, including IO, Thread, and Binding. Anonymous and singleton classes also cannot be serialized.

Data produced by Marshal.dump includes two bytes at the beginning, a major and minor version number:

Marshal.dump("foo").bytes[0..1] # [4, 8]

Ruby will only load marshaled data with the same major version and the same or lower minor version. When the “verbose” flag is set, the versions must match exactly. The version number is incremented when the marshal format changes, but it has been stable for many years at this point.

10.2.2 “Deep Copying” with Marshal

Ruby has no “deep copy” operation. For example, using dup or clone on a hash will not copy the keys and values that the hash references. With enough nested object references, a copy operation turns into a game of Pick-Up-Sticks.

We offer here a way to handle a restricted deep copy. It is restricted because it is still based on Marshal and has the same inherent limitations:

def deep_copy(obj)
Marshal.load Marshal.dump(obj)
end

a = deep_copy(b)

10.2.3 More Complex Marshaling

Sometimes we want to customize our marshaling to some extent. Creating marshal_load and marshal_dump methods make this possible. If they exist, these hooks are called when marshaling is done so that you are handling your own conversion to and from a string.

In the following example, a person has been earning 5% interest on his beginning balance since he was born. We don’t store the age and the current balance because they are a function of time:

class Person

attr_reader :balance, :name

def initialize(name, birthdate, deposit)
@name = name
@birthdate = birthdate
@deposit = deposit
@age = (Time.now - @birthdate) / (365*86400)
@balance = @deposit * (1.05 ** @age)
end

def age
@age.floor
end

def marshal_dump
{name: @name, birthdate: @birthdate, deposit: @deposit}
end

def marshal_load(data)
initialize(data[:name], data[:birthdate], data[:deposit])
end
end

p1 = Person.new("Rudy", Time.now - (14 * 365 * 86400), 100)
[p1.name, p1.age, p1.balance] # ["Rudy", 14, 197.9931599439417]

p2 = Marshal.load Marshal.dump(p1)
[p2.name, p2.age, p2.balance] # ["Rudy", 14, 197.9931599440351]

When an object of this type is saved, the age and current balance will not be stored; when the object is “reconstituted,” they will be computed. Notice how the marshal_load method assumes an existing object; this is one of the few times you might want to call initialize explicitly (just as new calls it).

10.2.4 Marshaling with YAML

YAML stands for “YAML Ain’t Markup Language,” and it was created to serve as a flexible, human-readable format that allows data to be exchanged between programs and across programming languages.

When using the yaml library, we can use the YAML.dump and YAML.load methods almost identically to those of Marshal. It is instructive to dump a few objects to see how YAML deals with them:

require 'yaml'

Person = Struct.new(:name)

puts YAML.dump("Hello, world.")
puts YAML.dump({this: "is a hash",
with: "symbol keys and string values"})
puts YAML.dump([1, 2, 3])
puts YAML.dump Person.new("Alice")

# Output:
# --- Hello, world.
# ...
# ---
# :this: is a hash
# :with: symbol keys and string values
# ---
# - 1
# - 2
# - 3
# --- !ruby/struct:Person
# name: Alice

Each YAML document begins with “---” and then contains the object(s) being serialized, converted into a human-readable text format.

Using YAML.load to load a string is similarly straightforward. In fact, YAML.load_file takes it one step further and allows us to simply supply the name of the file we want to load. Assume that we have a file named data.yml, as shown here:

---
- "Hello, world"
- 237
-
- Jan
- Feb
- Mar
- Apr
-
just a: hash.
This: is

This is the same as the four data items we just looked at, except they are collected into a single array. If we now load this file, we get the array back:

require 'yaml'
p YAML.load_file("data.yaml")
# Output:
# ["Hello, world", 237, ["Jan", "Feb", "Mar", "Apr"],
# {"just a"=>"hash.", "This"=>"is"}]

In general, YAML is just a way to marshal objects. At a higher level, it can be used for many purposes. For example, the fact that it is human readable also makes it human editable, and it becomes a natural format for configuration files and such things.

Because YAML can marshal Ruby objects, the dangers of marshaled data apply to YAML as well. Never load YAML files that come from an external source. However, the YAML.safe_load method limits the classes that can be unmarshaled, and it can be used to load data from untrusted sources.

There is more to YAML than shown here. For further information, consult the Ruby stdlib documentation, or the official website at yaml.org.

10.2.5 Persisting Data with JSON

JSON, despite being an acronym for JavaScript Object Notation, allows basic data types to be serialized in a human-readable format containing a few basic data types shared by almost every programing language.

Because it is so simple, readable, and widespread, it has become the preferred data format for transmitting data between programs on the Internet.

Using the JSON standard library is nearly identical to YAML, with the caveat that only hashes, arrays, numbers, strings, true, false, and nil can be dumped. Any object that is not one of those classes will be converted to a string when it is serialized:

require 'json'

data = {
string: "Hi there",
array: [1, 2, 3],
boolean: true,
object: Object.new
}

puts JSON.dump(data)
# Output: {"string":"Hi there","array":[1,2,3],
# "boolean":true,"object":"#<Object:0x007fd61b890320>"}

Converting Ruby objects into JSON and back requires writing code similar to the marshal_dump and marshal_load methods seen earlier. There is a convention, started in Ruby on Rails, of implementing an as_json method to convert an object into the data types supported by JSON.

Here is how to implement as_json and convert a Person (from Section 10.2.3, “More Complex Marshaling”) into JSON and back:

require 'json'
require 'time'

class Person
# other methods as before...

def as_json
{name: @name, birthdate: @birthdate.iso8601, deposit: @deposit}
end

def self.from_json(json)
data = JSON.parse(json)
birthdate = Time.parse(data["birthdate"])
new(data["name"], birthdate, data["deposit"])
end
end

p1 = Person.new("Rudy", Time.now - (14 * 365 * 86400), 100)
p1.as_json # {:name=>"Rudy", :deposit=>100,
# :birthdate=>"2000-07-23T23:25:02-07:00"}

p2 = Person.from_json JSON.dump(p1.as_json)
[p2.name, p2.age, p2.balance] # ["Rudy", 14, 197.9931600356966]

Because JSON cannot contain Time objects, we convert to a string using the iso8601 method, and then parse the string back into a Time when creating a new Person object.

As we’ve just illustrated, JSON cannot serialize Ruby objects, unlike Marshal and YAML. This means there is no danger in calling JSON.load on untrusted JSON documents, and makes JSON suitable for exchanging data with systems that are not under your control.

10.2.6 Working with CSV Data

CSV (comma-separated values) format is something you may have had to deal with if you have ever worked with spreadsheets or databases. Fortunately, Ruby includes the CSV library, written by James Edward Gray III.

The CSV library can parse or generate data in CSV format. There is no universal agreement on the exact format of CSV data, but there are some common conventions. The defaults used by the CSV library are as follows:

• The record separator is CR + LF.

• The field separator is a comma (,).

• Quote data with double quotes if it contains a CR, LF, or comma.

• Quote a double quote by prefixing it with another double quote (" -> "").

• An empty field with quotes means an empty string (data,"",data).

• An empty field without quotes means nil (data,,data).

These conventions can be adjusted.

Let’s start by creating a file. To write out comma-separated data, we can simply open a file for writing; the open method will pass a writer object into the attached block. We then use the append operator to append arrays of data (which are converted to comma-separated format upon writing). The first line will be a header:

require 'csv'

CSV.open("data.csv","w") do |wr|
wr << ["name", "age", "salary"]
wr << ["mark", "29", "34500"]
wr << ["joe", "42", "32000"]
wr << ["fred", "22", "22000"]
wr << ["jake", "25", "24000"]
wr << ["don", "32", "52000"]
end

The preceding code gives us a data file called data.csv:

"name","age","salary"
"mark",29,34500
"joe",42,32000
"fred",22,22000
"jake",25,24000
"don",32,52000

Another program can read this file:

require 'csv'

CSV.open('data.csv', 'r') do |row|
p row
end

# Output:
# ["name", "age", "salary"]
# ["mark", "29", "34500"]
# ["joe", "42", "32000"]
# ["fred", "22", "22000"]
# ["jake", "25", "24000"]
# ["don", "32", "52000"]

The preceding code could also be written without a block; then the open call would return a reader object. We could then invoke shift on the reader (as though it were an array) to retrieve the next row. However, the block-oriented way seems more straightforward.

Regardless of the method used to create the CSV files, they can then be read and modified by Excel and other spreadsheet programs.

10.2.7 SQLite3 for SQL Data Storage

SQLite3 is a data store for those who appreciate zero configuration software. It is a small self-contained executable, written in C, that can handle a complete database in a single file. Although it is usually used for small databases, it can deal with data up to hundreds of gigabytes.

The biggest advantage of using SQLite3 is that it provides the same query interface as other full-sized databases such as MySQL and PostgreSQL. This makes it possible to start a project using SQLite3 and then migrate to an external database server without having to rewrite the database code.

The Ruby bindings for SQLite3 are relatively straightforward. The SQLite::Database class allows you to open a database file and execute SQL queries against it. Here is a brief piece of sample code:

require "sqlite3"

# Open a new database
db = SQLite::Database.new("library.db")

# Create a table to store books
db.execute "create table books (
title varchar(1024), author varchar(256) );"

# Insert records into the table
{
"Robert Zubrin" => "The Case for Mars",
"Alexis de Tocqueville" => "Democracy in America"
}.each do |author, title|
db.execute "insert into books values (?, ?)", [title, author]
end

# Read records from the table using a block
db.execute("select title,author from books") do |row|
p row
end

# Close the open database
db.close

# Output:
# ["The Case for Mars", "Robert Zubrin"]
# ["Democracy in America", "Alexis de Tocqueville"]

If a block is not specified, execute returns an array of arrays, and each row can be iterated over:

rs = db.execute("select title,author from books")
rs.each {|row| p row } # Same results as before

Several different exceptions may be raised by this library. All are subclasses of SQLite::Exception, so it is easy to catch any or all of them.

Although the sqlite library is fairly full featured, SQLite itself does not completely implement the SQL92 standard. Use PostgreSQL or MySQL (examined in the next section) if you need the features of a full SQL database.

For more information on SQLite3, see sqlite.org. For more information on the Ruby bindings, see the project’s online documentation at github.com/sparklemotion/sqlite3-ruby.

10.3 Connecting to External Data Stores

External data stores run in their own process, and they accept connections from clients over a socket. The clients send queries over this connection, and the data stores reply with results.

Data stores are able to hold only basic data types, but can provide advantages in flexibility, shared access, and speed. Libraries exist to connect to almost any data store that exists, including SQL databases, key-value stores, and document stores.

In this section, we will look at the Ruby libraries for the three most widely used data stores: MySQL, PostgreSQL, and Redis.

10.3.1 Connecting to MySQL Databases

Ruby’s MySQL interface is among the most stable and fully functional of its database interfaces. After installing both Ruby and MySQL, install the mysql2 gem using either gem install or bundle install.

There are three steps to using the gem after you have it installed. First, load the gem in your script; then connect to the database; finally, work with your tables. Connecting requires the usual parameters of host, username, password, database, and so on.

The module is composed of two classes (Mysql2::Client and Mysql2: :Result), as described in the README. We summarize some useful methods here, but you can always find more information in the actual documentation.

The class method Mysql2::Client.new takes several string parameters, all defaulting to nil, and returns a client object. The most useful parameters are host, username, password, port, and database. Use the query method on the client object to interact with the database:

require 'mysql2'

client = Mysql2::Client.new(
:host => "localhost",
:username => "root"
)

# Create a database
client.query("CREATE DATABASE mailing_list")

# Use the database
client.query("USE mailing_list")

If the database already exists, you can skip the USE statement by providing the database parameter:

# Create a table to hold names and email addresses
client.query("CREATE TABLE members (
name varchar(1024), email varchar(1024))")

# Insert two mailing list member rows
client.query <<-SQL
INSERT INTO members VALUES
('John Doe', 'jdoe@rubynewbie.com'),
('Fred Smith', 'smithf@rubyexpert.com')
SQL

When inserting data, keep in mind that values are not escaped! Malicious users might provide data specifically crafted to delete your database or even grant administrator access. Use the escape method on data before you insert it to prevent this.

escaped = ["Bob Howard", "bofh@laundry.gov.uk"].map do |value|
"'#{client.escape(value)}'"
end
client.query("INSERT INTO members VALUES (#{escaped.join(",")})")

Queries that yield results will return a Mysql2::Result instance that mixes in Enumerable. The result rows can then be iterated over with each or any other Enumerable method.

# Query for the data and return each row as a hash
client.query("SELECT * from members").each do |member|
puts "Name: #{member["name"]}, Email: #{member["email"]}"
end
# Output:
# Name: John Doe, Email: jdoe@rubynewbie.com
# Name: Fred Smith, Email: smithf@rubyexpert.com
# Name: Bob Howard, Email: bofh@laundry.gov.uk

MySQL data types are fully supported, and strings, numbers, times, and other types are automatically converted into instances of the corresponding Ruby classes.

The Result yields each result row as a hash. Each hash has string keys by default, but can have symbol keys if the query method is passed the option :symbolize_keys => true.

Other useful Result methods include count (the number of rows returned) and fields (an array of column names in the results). The each method can also yield each result row as an array of values if the :as => :array option is passed.

# Query for the data and return each row as an array
result = client.query("SELECT * FROM members")
puts "#{result.count} records"
puts result.fields.join(" - ")
result.each(:as => :array){|m| puts m.join(" - ") }
# Output:
# 3 records
# name - email
# John Doe - jdoe@rubynewbie.com
# Fred Smith - smithf@rubyexpert.com
# Bob Howard - bofh@laundry.gov.uk

The names of existing databases can be found using the MySQL-specific query SHOW DATABASES, like so:

client.query("SHOW DATABASES").to_a(:as => :array).flatten
# ["information_schema", "mailing_list", "mysql",
# "performance_schema", "test"]

Finally, the connection to the database can be closed via the close method:

# Close the connection to the database
client.close

As usual, we have only covered a fraction of what is possible. For more information, see the MySQL website at mysql.com, and the mysql2 gem website at github.com/brianmario/mysql2.

10.3.2 Connecting to PostgreSQL Databases

The pg gem provides an interface to connect to PostgreSQL databases, and it can be installed with gem install or bundle install after PostgreSQL and Ruby have been installed.

As in other database adapters, load the module, connect to the database, and then do your work with the tables. The PG class provides the connect method to obtain a connection to the database.

The connection provides the exec method (aliased as query) to run SQL queries against the database:

require 'pg'

# Create the database
PG.connect.exec("CREATE DATABASE pets")

# Open a connection to the created database
conn = PG.connect(dbname: "pets")

Note that, unlike MySQL, changing databases requires creating an entirely new connection, rather than executing a query.

Sending a query using the pg gem is very similar to any other SQL library:

# Create a table and insert some data
conn.exec("CREATE TABLE pets (name varchar(255),
species varchar(255), birthday date)")
conn.exec <<-SQL
INSERT INTO pets VALUES
('Spooky', 'cat', '2008-10-03'),
('Spot', 'dog', '2004-06-10')
SQL

Escaping values as they are inserted is greatly simplified by the exec_params method, which can escape values as they are provided:

# Escape data as it is inserted
require 'date'
conn.exec_params("INSERT INTO pets VALUES ($1, $2, $3)",
['Maru', 'cat', Date.today.to_s])

Unlike the mysql2 gem, however, the pg gem does not convert query results into their corresponding Ruby classes. All results are provided as strings and must be converted if necessary. Here, we use Date.parse on each result’s date field:

# Query for the data
res = conn.query("SELECT * FROM pets")
puts "There are #{res.count} pets."
res.each do |pet|
name = pet["name"]
age = (Date.today - Date.parse(pet["birthday"]))/365
species = pet["species"]
puts "#{name} is a #{age.floor}-year-old #{species}."
end

# Output:
# There are 3 pets.
# Spooky is a 5-year-old cat.
# Spot is a 10-year-old dog.
# Maru is a 0-year-old cat.

Whereas the each method always yields result rows as hashes with string keys, the PG::Result instance returned by the query also provides values for an array of rows as arrays. Additional useful methods include count for the number of result rows and fields for an array of field names.

As always, see the PostgreSQL website at postgresql.org and the pg gem’s website at bitbucket.org/ged/ruby-pg for further details and documentation.

10.3.3 Object-Relational Mappers (ORMs)

The traditional relational database is good at what it does. It handles queries in an efficient way without foreknowledge of the nature of those ad hoc queries. But this model is not very object oriented, especially in cases where all result fields are returned as strings.

The ubiquity of both these models (RDBMS and OOP) and the “impedance mismatch” between them has led many people to try to bridge this gap. The software bridge that accomplishes this is called an Object-Relational Mapper (ORM).

There are many ways of approaching this problem. All have their advantages and disadvantages. Here, we’ll take a short look at ActiveRecord, probably the best known of these.

The ActiveRecord library for Ruby is named after Martin Fowler’s “Active Record” design pattern. In the pattern, an Active Record “wraps a row in a database table or view, encapsulates the database access, and adds domain logic on that data” (see Patterns of Enterprise Application Architecture by Martin Fowler, Addison-Wesley, 2003).

The activerecord gem does this with a Ruby class for each database table. It can make some queries without SQL, it represents each row as an instance of the class, and it allows domain logic to be added to each class. Classes are named after tables and descend from theActiveRecord::Base class. Only one database connection is required for all classes.

Here is a short example of how it all works:

require 'active_record'

class Pet < ActiveRecord::Base
end

ActiveRecord::Base.establish_connection(
:adapter => "postgresql", :database => "pets")

snoopy = Pet.new(name: "Snoopy", species: "dog")
snoopy.birthday = Date.new(1950, 10, 4)
snoopy.save

p Pet.all.map {|pet| pet.birthday.to_s }

As you can see, it can infer table names, provide attribute accessors based on column names, and execute queries in an object-oriented manner. It also provides the same API, including data type conversion, with any underlying SQL database.

The activerecord API is rich and complex, and entire books have been written about how to use it. For more information, consult any reference.

10.3.4 Connecting to Redis Data Stores

Although it is not a database (SQL or otherwise), the Redis data store has become extremely popular in recent years. It provides key-value storage, and allows values to contain strings, hashes, arrays, sets, and sorted sets.

Although Redis makes a bad permanent data store, it makes an excellent place to cache data. It can be an excellent way to handle the results of expensive calculations that can be derived from permanently stored data.

Connecting to a Redis server using the redis gem is extremely easy. When the Redis server is running on the same machine, no parameters are required.

require 'redis'
r = Redis.new

One useful storage type is the set. Redis can take the difference, intersection, and union of multiple sets quickly, and either return the result or store the result in another key.

# Store pet names in a set
r.sadd("pets", "spooky")
# Get the members of a set
r.smembers("pets") # ["spooky"]

The hash type allow multiple fields to be read and written beneath a single key, either one at a time or all at once.

# Use pet name as key for a hash of other attributes
r.hset("spooky", "species", "cat")
r.hset("spooky", "birthday", "2008-10-03")

# Get a single hash value
r.hget("spooky", "species") # "cat"
# Get the entire hash
r.hgetall("spooky") # {"species"=>"cat", "birthday"=>"2008-10-03"}

The sorted set associates a score with each value, and it provides specific instructions to increment and decrement scores, as well as retrieve values sorted by score.

# Use a sorted set to store pets by weight
r.zadd("pet_weights", 6, "spooky")
r.zadd("pet_weights", 12, "spot")
r.zadd("pet_weights", 2, "maru")

# Retrieve the first value from the set sorted highest to lowest
r.zrevrange("pet_weights", 0, 0) # => ["spot"]

Many more complex and optimized uses can be made of Redis by leveraging the various data types it provides. See the Redis website at redis.io for thorough documentation of every data type and command available.

10.4 Conclusion

This chapter provided an overview of I/O in Ruby. We’ve looked at the IO class itself and its descendant File, along with other related classes such as Dir and Pathname. We’ve seen some useful tricks for manipulating IO objects and files.

We’ve also looked at data storage at a slightly higher level—storing data externally as marshaled and serialized objects. Finally, we’ve had a short overview of the true database solutions available in Ruby, along with some OOP techniques for interacting with them in easier ways.

Later, we will look at I/O again from the perspective of sockets and network programming. But before getting too deep into that, let’s examine some other topics.