Collections - Effective Ruby: 48 Specific Ways to Write Better Ruby (Effective Software Development Series) (2015)

Effective Ruby: 48 Specific Ways to Write Better Ruby (Effective Software Development Series) (2015)

3. Collections

It’s tempting to think of Ruby’s collection classes as containers. If you squint and focus on the two rock stars of the core library, Array and Hash, this is certainly true. But what about the often used Range class? Is it a container? Does having an each method and including the Enumerable module make a class a container?

Alright, I might have stretched that a bit far. You already know that Ruby has several classes which are clearly not containers but use Enumerable. That’s why we throw around the term “collection” which is sometimes synonymous with “container”, but obviously not always.

One of the first classes you usually play around with while learning Ruby is either Array or Hash. That’s because they’re really simple to use, easy to understand, yet no less powerful than their counterparts in other languages. In this chapter I’ll spend a little bit of time with these classes, but only long enough to show off some features you might have missed. Mostly though, I’ll focus on staying out of trouble when using the collection classes and some Ruby features which tend to bite beginner and intermediate Ruby programmers.

Speaking of things that might trip you up, newcomers to Ruby often shy away from the confusing reduce method (also known as inject). I shed some light on that misunderstood Enumerable method and hopefully give you a new trick for your toolbox along the way. You’ll be folding collections like the pros, which, believe it or not, is pretty useful. How useful? How about incrementally converting an array into a hash without needing any temporary variables? You know, fun stuff.

Item 16: Duplicate Collections Passed as Arguments Before Mutating Them

As you know, the collection classes in Ruby hold a group of objects, but what does that really entail? When you insert an object into an array, what are you actually inserting? A copy of the object? The original? The answer actually depends on the class of the object.

Most objects are passed around as references and not as actual values. When these types of objects are inserted into a container the collection class is actually storing a reference to the object and not the object itself. (The notable exception to the rule is the Fixnum class whose objects are always passed by value and not by reference.)

The same is true when objects are passed as method arguments. The method will receive a reference to the object and not a new copy. This is great for efficiency but has a startling implication. The act of modifying an object is observable by any code which holds a reference to it. Mutating an element of an array affects the original object which may still be available outside of the array. Similarly, when a method alters one of its arguments, those changes are visible outside of the method. Sometimes that’s the entire point of the method and its author was kind enough to suffix the method name with “!”. Most of the time, however, you really don’t want your objects to be mutated behind your back.

Collections (such as arrays and hashes) tend to be modified heavily during their lifetime. A common mistake among new Ruby programmers is to mutate a collection which was passed as an argument to a method. Often without realizing that the original collection will also be modified. Throw in a little indirection and it’s not hard to see how something like this can happen.

As an example, consider the following class which represents a radio tuner. The class supports the notion of station presets—commonly found in car stereos—by which a user can jump to a favorite frequency with a single button press. When a Tuner object is initialized it’s given an array of preset stations which it then sanitizes to ensure all of the frequencies are valid.

class Tuner
def initialize (presets)
@presets = presets
clean
end

private

def clean
# Valid frequencies end in odd digits.
@presets.delete_if {|f| f[-1].to_i.even?}
end
end
irb> presets = %w(90.1 106.2 88.5)
---> ["90.1", "106.2", "88.5"]

irb> tuner = Tuner.new(presets)
---> #<Tuner @presets=["90.1", "88.5"]>

irb> presets
---> ["90.1", "88.5"]

If someone using the Tuner class expected the array passed to new to remain unchanged, well, they’ll be in for a surprise. Mutating arguments which are given to a method is an easy mistake to make, especially when the changes are made by nested method calls like in the Tuner class. It’s even more troublesome when the internal state of one object is passed as an argument to another and then subsequently modified. These types of bugs often go unnoticed and untested due to complex interactions between objects.

It might seem like the best course of action would be to simply avoid mutating method arguments in the first place. For some classes this might actually be the easiest approach. The Tuner class, for example, could use Array#rejectinstead of Array#delete_if. This would work because reject returns a new array (with possibly fewer elements than the original) instead of mutating the receiver. Updating Tuner to use this strategy is fairly trivial.

class Tuner
def initialize (presets)
@presets = clean(presets)
end

private

def clean (presets)
presets.reject {|f| f[-1].to_i.even?}
end
end

This time, the array given to Tuner::new is not the same array which is kept in @presets. Unfortunately, things aren’t always this simple. What if you don’t need to mutate the collection up front, during initialization, but later on in other instance methods? If you mistakenly take an argument given to initialize and stash it away in an instance variable you’ll still run into trouble if you later modify it. What you really want in these situations is a copy of the collection instead of a reference to the original. Then you’re free to mutate it to your heart’s content. But hang with me for a few more moments so you’re not surprised by how object copying works.

Ruby comes with two methods which can create copies of objects, dup and clone. Both create new objects based on the receiver, but clone preserves two additional features of the original object which dup does not. First, clonerespects the frozen state of the receiver. If the original object is frozen then the copy will also be frozen. This is different than dup which never returns a frozen object. Second, if the receiver has singleton methods clone will also duplicate the singleton class. Because dup doesn’t do this, the original object and the copy may not respond to the same messages.

Usually you’ll want to use dup instead of clone, especially when you plan on mutating the resulting object. Frozen objects cannot be modified nor unfrozen, therefore clone could potentially return an immutable object. Since we want to mutate the new object, dup is the way to go. Now we can resurrect the first Tuner implementation and fix it using dup:

class Tuner
def initialize (presets)
@presets = presets.dup
clean # Modifies the duplicate.
end

...
end

It’s a common pattern to duplicate collection objects given as method arguments. Watch out though, it comes with a warning label. You need to be aware that dup and clone return shallow copies. For collections classes like Arraythis means the container is duplicated but not the elements. You can add and remove elements without affecting the original collection, but the same can’t be said of the elements themselves. The original collection and the copy both reference the same element objects. Modifying an element will be visible in both collections and to the outside world.

irb> a = ["Polar"]

irb> b = a.dup << "Bear"
---> ["Polar", "Bear"]

irb> b.each {|x| x.sub!('lar', 'oh')}
---> ["Pooh", "Bear"]

irb> a
---> ["Pooh"]

When writing your own class you can override the initialize_copy method to control the depth of the duplication process. If you need deep copies of the existing collection classes, however, you’ll have to roll your own. Thankfully, there’s a quick solution which works the majority of the time. You can use the Marshal class to serialize and then deserialize the collection and its elements:

irb> a = ["Monkey", "Brains"]

irb> b = Marshal.load(Marshal.dump(a))

irb> b.each(&:upcase!); b.first
---> "MONKEY"

irb> a.last
---> "Brains"

The Marshal trick comes with its fair share of limitations. Besides the time it takes to serialize and deserialize an object, you also have to consider the amount of memory needed. The duplicated object will of course occupy its own memory space, but so will the serialized byte stream which is created by Marshal::dump. Dumping and loading large objects can certainly turn your program into a memory hog.

A potentially even bigger problem is that not all objects can be serialized with Marshal. Objects which contain closures or those with singleton methods cannot be serialized. Some of the core Ruby classes also don’t work with Marshal, these include the IO and File classes, for example. In all these cases a TypeError exception will be raised from Marshal::dump.

This might seem like a lot of trouble to go through (and you’d be correct of course), but you’ll rarely need to turn to Marshal in order to create deep copies of objects. Most often, just duplicating the container is enough to keep the surprise factor low. Knowing these limitations, however, will keep you out of trouble when you do need deep copies.

Things to Remember

• Method arguments in Ruby are passed as references, not values. Notable exceptions to this rule are Fixnum objects.

• Duplicate collections passed as arguments before mutating them.

• The dup and clone methods only create shallow copies.

• For most objects, Marshal can be used to create deep copies when needed.

Item 17: Use the Array Method to Convert nil and Scalar Objects into Arrays

In statically typed languages like C++ you can have several implementations of a single method just so long as they differ in their arguments. While this isn’t something we can do directly in Ruby, it’s often emulated using duck typing. That is, we write a single method which accepts any kind of object as an argument, as long as that argument responds to an expected set of messages. In other words, Ruby programmers prefer to work with interfaces over types. Unfortunately, that’s not always possible.

We often need to be able to handle situations where a variable isn’t what we expected it to be. This tends to happen a lot with arrays. The implementation of a method might expect an array as an argument but the caller wants to pass in a single object. Sometimes that single object is nil when it really should be an empty array. The same thing can happen with the return value of a method. It might return a single object, an array, or even nil. If you’re like me you’d like to handle all of these situations uniformly.

Consider for a moment a class used for ordering pizza:

class Pizza
def initialize (toppings)
toppings.each do |topping|
add_and_price_topping(topping)
end
end

...
end

The initialize method expects an array of toppings but we’d like to be able to pass in a single topping, or possibly nil when there are no toppings. You might be tempted to reach for variable-length argument lists by changing the argument to *toppings, thus collapsing all arguments into a single array. While this allows us to pass in a single topping, an array of toppings can no longer be given directly without using “*” to explode the array as it’s passed to initialize. All we’ve done is traded one problem for another. A better solution is to convert the incoming argument into an array so we know for sure what we’re working with.

Tucked away in the Kernel module is a method with an unusual name, Array. That’s right, a method name which starts with an uppercase letter. While Ruby is nearly draconian when it comes to the names of constants, it’s much less restrictive with method names. This is also one of those times where parentheses really help clarify what’s going on because it’s easy to confuse the Array class and the Array method. So, what does this method do? Unsurprisingly, it converts its argument to an array. The great thing about this method how it converts different objects into arrays. Most objects are converted into an array which contains one element. It’s also smart enough to turn nil values into empty arrays. See for yourself:

irb> Array('Betelgeuse')
---> ["Betelgeuse"]

irb> Array(nil)
---> []

irb> Array(['Nadroj', 'Retep'])
---> ["Nadroj", "Retep"]

As you can see from the last example, if the argument is already an array then the Array method simply passes it through as its return value. The way other objects are converted is pretty straight forward as well. If the argument to Array responds to either the to_ary or to_a messages then that method is called. If neither method can be called on the argument then Array wraps the argument in a new array and returns this single-element array. Most of the time this works exactly as you’d expect, but look what happens when you pass the Array method a hash:

irb> h = {pepperoni: 20, jalapenos: 2}

irb> Array(h)
---> [[:pepperoni, 20], [:jalapenos, 2]]

Each key-value pair is converted into a 2-element array which Hash#to_a then nests into another array. This might not seem very useful but it does completely represent the source hash and the resulting array can be used to reconstruct the original hash using the Hash::[] method. While Hash#to_a and Hash::[] might be great for converting a hash to and from an array they unfortunately present a problem with our use of the Array method. That is, they’re incompatible with this technique. If you want to write a method which can either accept an array of hashes or a single hash you’ll have to resort to something other than using the Array method. But if you’re dealing with an array of hashes, perhaps you should consider the advice in Item 10.

Back to pizza toppings, we can now simplify the Pizza initialization interface using the Array method trick:

class Pizza
def initialize (toppings)
Array(toppings).each do |topping|
add_and_price_topping(topping)
end
end

...
end

With a really small change to the initialize method it can now take an array of toppings, a single topping, or even no toppings at all (nil or []). Potentially more important, the change makes the code more robust in the presence of a nil value which might accidentally wind up as the argument to initialize. That’s a nice win all by itself.

Things to Remember

• Use the Array method to convert nil and scalar objects into arrays.

• Don’t pass a Hash to the Array method, it will get converted into a set of nested arrays.

Item 18: Consider Set for Efficient Element Inclusion Checking

There’s no doubt that you’re familiar with and most likely very comfortable using the core collection classes. Both Array and Hash are integral to everyday Ruby programming, and let’s not forget Range. All three classes are woven into the language with dedicated syntax for creating objects from literals. Being part of the core library also means that you don’t have to do anything special before using them, no library to load or file to require. These classes are so predominate in Ruby programs that sometimes they’re used even when a more appropriate class for the task at hand is available.

Part of the problem might be that Ruby ships with two different libraries. The core library is pre-loaded in every program and includes most of the famous collection classes you’ve grown to love. Then there’s the standard library. It’s quite a bit bigger than the core library, so it’s not loaded automatically. Before you can use any class from the standard library you have to require the correct file. This one extra step seems to make a big difference when it comes to discovering the wealth of classes and modules in the standard library.

Take the Set class for example. It’s a very useful collection class but since it’s part of the standard library you have to require a file before using it. Based on its name alone you probably already have a good idea of what it does. It’s a lot like its mathematical cousin, a collection of unordered unique elements with operations such as union, intersection, and subset/superset testing. While there are definitely times when it’s handy to have a mathematical set class available so easily, I want to talk about one specific feature of Set, element inclusion testing.

As a way to play with sets, let’s explore a watered down version of a class which provides role-based authorization. Typically, a role represents a grouping of permissions. Before a user can perform an action you consult all roles associated with the user to see if any of them allow the action. Consider this very simple implementation which stores the collection of permissions in an array:

class Role
def initialize (name, permissions)
@name, @permissions = name, permissions
end

def can? (permission)
@permissions.include?(permission)
end
end

There are two parts to this implementation. The Role class expects to be given an array of permissions and in return it provides a can? method which can be used to test if the role supports a specific permission. The implementation is grossly oversimplified, but bear with me for just a bit.

In the initialize method this array of permissions is stashed away in the @permissions instance variable without having to convert it or manipulate it in any way. This makes it easy to write the can? method since we just need to see if a permission is an element of the array using the include? method. You might have noticed that the array is only being used for its ability to be a container of elements. The @permissions variable could really be any container which supports the include? method. As a matter of fact, out of all the collection classes, Array has the worst performing include? method with a time complexity of O(n). As the number of permissions in the array grows so does the time needed to check if an element is contained within it. This also means that each time the can? method is called to check a permission, and that permission isn’t in the @permissions array, the entire array needs to be consulted before we can return false. We can do better than that, much better.

After an array gets big enough that include? becomes a real performance concern, most Ruby programmers turn to Hash. While the hash requires more memory for its internal data structure, accessing elements can be done in roughly O(log n). That’s a big improvement over Array. The only (slightly) annoying thing about using a hash in the Role class is its construction. Hashes store keys and values, but we don’t have any values to store, we just have a list of permissions. When using a hash this way, it’s common practice to use true as the hash value so that all elements in the hash reference the same immutable global variable. Here’s an updated Role class which uses Hash instead of Array:

class Role
def initialize (name, permissions)
@name = name
@permissions = Hash[permissions.map {|p| [p, true]}]
end

def can? (permission)
@permissions.include?(permission)
end
end

The only change is in the initialize method where we need to convert the incoming array into a hash. There are two trade-offs that you need to be aware of. First, since Hash only stores unique keys, any duplicates in the array will be lost when converted into a hash. In this case the loss is completely reasonable since duplicate permissions in a role are nonsensical. If you want to preserve duplicates then this technique isn’t appropriate.

Second, in order to convert the array into a hash we need to map over the entire array and construct an even bigger array which then gets fed to Hash::[]. This shifts the performance burden from the can? method to the initializemethod with the assumption that can? is called more often than initialize. If Role objects are created and destroyed with the can? method only ever being called a small number of times, we’re probably not saving any time by trying to optimize with Hash. If, on the other hand, the hash was incrementally built over time instead of all at once in initialize, this would still be a good technique. (A good way to incrementally build a hash like this is using the reduce method, which is discussed in Item 19.)

Converting the array of permissions into a hash is pretty simple, if not just a tiny bit ugly with all those braces. And just as with the Array version before it, we’re not really using any of the unique features which Hash provides. That is, we’re only using it to test for element inclusion. Wouldn’t it be nice to abstract away the hash construction code in initialize and clean this code up a bit? I sure think so, and that’s where Set comes in.

require('set')

class Role
def initialize (name, permissions)
@name, @permissions = name, Set.new(permissions)
end

def can? (permission)
@permissions.include?(permission)
end
end

This version of the Role class is very similar to the first version which used the Array class. This is mostly due to the fact that Set objects can be constructed from any collection object or enumerator. Essentially, Set::new is doing the conversion for us (and incrementally too). On the plus side, the Set version performs nearly as good as the Hash version. Internally the Set class stores its elements in a Hash. This means that the trade-offs for using Hash also apply when using Set. Namely that duplicate elements are lost and the original array needs to be converted into a hash. There’s also a requirement that objects stored in the set are well behaved as hash keys. The same requirement that the previous version of the Role class had, it’s just implicit this time around.

Since the array given to Set::new must be converted into a hash it might not be convincing that using Set in this case is worthwhile. Again, if the number of calls to can? doesn’t exceed the number of calls to initialize it’s probably not worth the conversion. If you don’t need to perform a conversion but need to test for element inclusion often then Set can be a real win for performance and easy of use. To illustrate this let’s go back to the example code from Item 10.

As a brief reminder, the example class processed weather data from a CSV file. Each row in the CSV file represents a single month of data and is stored in a Reading struct which is then inserted into an array. To make things more interesting, let’s say that the CSV file might contain duplicate rows. When encountering a duplicate we’d like to keep the most recently read row, discarding the previous value. Guess what, we only need to make two simple changes to the existing code. The Reading class needs to be changed so its objects can be used as hash keys and the AnnualWeather#initialize method needs to be changed to use Set instead of Array. See for yourself:

require('set')

class AnnualWeather
Reading = Struct.new(:date, :high, :low) do
def eql? (other) date.eql?(other.date); end
def hash; date.hash; end
end

def initialize (file_name)
@readings = Set.new

CSV.foreach(file_name, headers: true) do |row|
@readings << Reading.new(Date.parse(row[2]),
row[10].to_f,
row[11].to_f)
end
end
end

Because Set includes Enumerable and provides an interface similar to that of Array nothing else needs to change. This version automatically ensures that there are no duplicate records in the collection, and the duplicate detection is fast. Since the Reading objects are ultimately stored in a hash and their uniqueness is based on their date, the eql? and hash methods just delegate to the date attribute. That’s not a lot of code for a big win. There is a downside to be aware of though.

Arrays are ordered containers. In the version of AnnualWeather which used an array, the elements were in the same order in the array that they were read from the CSV file. That’s not necessarily the case with Set, which is an unordered container. The array version also allowed random access to any element given its index. There’s no way to access individual elements in a Set by indexing into it, to do that you’d have to first convert it into an array. (Technically speaking, since Set actually uses a hash under the covers the elements should be accessible in the order in which they were inserted. That said, the documentation for Set clearly states that the container is unordered. In other words, the fact that the elements are actually ordered is an implementation detail which is subject to change. If you truly need ordered sets you might want to consider the SortedSet class which is also part of the standard library.)

If you don’t need the elements to be in any specific order, don’t need random access to individual elements, and you need efficient checking for element inclusion, Set is the class for you. Just remember to require the “set” file before using it.

Things to Remember

• Consider Set for efficient element inclusion checking.

• Objects inserted into a Set must also be usable as hash keys.

• Require the “set” file before using Set.

Item 19: Know How to Fold Collections with reduce

Classes which include the Enumerable module gain a wealth of instance methods for filtering, traversing, and transforming collections of objects. Among them are the ever popular map and select, powerful methods which are at the core of nearly any Ruby program. Collection classes such as Array and Hash are so integral to writing Ruby programs that if you’re not familiar with the methods defined in Enumerable you’re probably writing more code than you ought to and I’d say you’re really missing out.

Right now, however, I want to talk about one particular method in Enumerable. One method to rule them all. One that you should definitely understand and be comfortable with. “You mean reduce?” How observant you are. Yes, reduce, also known as inject. (I like neither name but if I must pick then I choose reduce. Before Ruby 1.9 reduce wasn’t an option so you’re likely to see inject more often.)

What does reduce do and what’s so special about it? In functional programming terms, it’s a folding function which transforms one data structure into another. One of the things that makes it so powerful yet frustratingly confusing is how general it is. Programmers new to Ruby who don’t have a background in functional programming tend to have a hard time with the reduce method. Don’t worry though, we’re going to explore several sides to reduce and by the end of this item you’ll have a new power tool to turn to.

Let’s start by taking a bird’s-eye view of folding. When working with folding functions such as reduce you need to understand three main components:

• An enumerable object which is the receiver of the reduce message. This is the source collection that you want to transform in some way. Its class must obviously include Enumerable, otherwise you wouldn’t be able to call reduceon it.

• A block which is invoked once for each element in the source collection, similar to how the each method invokes its block. Unlike with each, however, the block given to reduce must produce a return value. The return value represents the final data structure with the current element folded into it. Some examples will make this more concrete.

• An object which is the starting value for the destination data structure, known as the accumulator. Each invocation of the block accepts the current accumulator value and returns a new accumulator value. After all elements have been folded into the accumulator, its final value becomes the return value from reduce.

The obligatory example when demonstrating reduce is a method which sums an array of numbers. In this case you can think of reduce as folding a collection of numbers into a single number. The starting value (accumulator) should be zero because if the array is empty—and therefore the block is never invoked—it’s the only valid result. Consider the following implementation of sum:

def sum (enum)
enum.reduce(0) do |accumulator, element|
accumulator + element
end
end

The argument given to reduce is the starting value for the accumulator. It’s also the first argument which is yielded to the block. The purpose of the block is to create and return a new accumulator which is then fed into the next iteration of the block. The accumulator returned from the last iteration becomes the return value for reduce. If the source collection is empty then the block is never executed and reduce simply returns the initial accumulator value.

Notice that the block doesn’t make any assignments. It’s important to realize that mutating the accumulator won’t always have the desired effect. That’s because after each iteration reduce throws away the previous accumulator and keeps the return value of the block as the new accumulator. If you mutate the accumulator and return nil from the block then the accumulator for the next iteration will be nil. Probably not what you wanted.

Take another look at the block used in the sum method. For each iteration the block is yielded the current accumulator value along with the current element, which are then added together. This sum then becomes the return value for the block and hence the accumulator for the next iteration. The process repeats until all elements have been passed to the block. The final accumulator value is then returned from reduce.

There’s a dangerous shortcut offered by reduce, it lets you omit the starting value for the accumulator. When you do this reduce will use the first element in the source collection as the accumulator and start the iteration cycle on the second element. The reason this is dangerous is that if the source collection is empty then reduce will return nil instead of something sensible. Most uses of reduce should not return nil since it’s so easy to have it return the starting value. Therefore, I recommend you always use a starting value for the accumulator.

Another shortcut which is sometimes handy deals with the block itself. Instead of giving reduce a block you can give it a symbol. On each iteration reduce will send a message to the accumulator using the symbol as the message name, passing the current element as an argument. This allows us to express sum more concisely:

def sum (enum)
enum.reduce(0, :+)
end

If you’re used to generating blocks on the fly with the “&” operator it might look weird to pass a bare symbol to reduce. Don’t worry, in this case &:+ will also work. Both basically do the same thing but the version with “&” is more familiar since that’s what’s used with methods like map and each.

Personally, I think that using sum as an example is quite boring. Let’s do something more interesting and use reduce to transform an array into a hash. Back in Item 18 we converted an array into a hash by essentially duplicating the array and feeding it into Hash::[]. The elements of the array became the keys and true was used for the values. Here’s what it looked like:

Hash[array.map {|x| [x, true]}]

Let’s save some space and build the hash incrementally. Guess which method is prefect for this, yep, reduce:

array.reduce({}) do |hash, element|
hash.update(element => true)
end

Right off the bat you’ll notice that the starting accumulator is a brand new (and empty) hash. If the source array is empty then reduce will return a valid hash instead of the troublesome nil. When the source array has elements, however, reduce will invoke the given block once for each element, just like it did in the sum example. Every iteration of the block then adds a key to the hash with true as the associated value. Since the Hash#update method returns the mutated hash, the return value for the block is also a hash. Therefore the requirement that the block return a new accumulator is also satisfied. Awesome!

Converting arrays to hashes using reduce is more space efficient than the version which used map. Still, it’s not very interesting. Let’s use reduce to join together two loops and in the process show off some slightly more complicated logic in the block. Suppose we have an array of users which we’d like to filter so the resulting array only includes those people who are 21 or older. Then we’d like to convert the array of users into an array of names. Without access to reduce, you’d probably write something like this:

users.select {|u| u.age >= 21}.map(&:name)

While that certainly works, it’s not very efficient. The select method will walk through the entire array of users and return a new array with just those who are 21 or older. This new array is then traversed with map and another new array is returned containing just names. If we use reduce we don’t need to create or traverse more than one array:

users.reduce([]) do |names, user|
names << user.name if user.age >= 21
names
end

Do you see the pattern which is emerging? The block given to reduce can do anything you’d like with the yielded accumulator and elements (including nothing) just so long as it returns a new accumulator. In this example the block ends by returning the names array. This ensures that the block returns an accumulator even if the conditional logic is false. If no users are 21 or over then reduce will return an empty list. I’ve said it several times so far but it’s worth repeating, make sure the block always returns an accumulator. Sometimes it might be a completely new accumulator (like with sum) and other times it’s the same accumulator which the block was yielded (such as in this example).

Speaking of patterns, you probably write fold-like code all the time without using reduce. Recognizing a fold and then refactoring to use a direct invocation of reduce often leads to simpler, self-contained code. So, what does a fold without reduce look like? Think about the last time you used the each method to iterate though a collection in order to update a variable. For example, you may have previously converted an array into a hash using code like this:

hash = {}

array.each do |element|
hash[element] = true
end

This is precisely the pattern which reduce is meant to help with. When you see a pattern like this in your code consider using reduce to clean it up.

Things to Remember

• Always use a starting value for the accumulator.

• The block given to reduce should always return an accumulator. It’s fine to mutate the current accumulator, just remember to return it from the block.

Item 20: Consider Using a Default Hash Value

You’re a Ruby programmer who, I’m sure, has been around the block a few times. So tell me, how often do you see the following pattern show up in code?

def frequency (array)
array.reduce({}) do |hash, element|
hash[element] ||= 0 # Make sure the key exists.
hash[element] += 1 # Then increment it.
hash # Return the hash to reduce.
end
end

I’m referring specifically to using the “||=” operator to ensure that a key is set in a hash before attempting to modify it in some way. Why do we have to resort to this? Okay, I know that you know the answer, but bear with me for a bit and you might discover that things aren’t as straight forward as you think they are. Or maybe they are, either way I’d like to show you some alternatives to this pattern.

You already know that if you try to fetch a value from a hash using a nonexistent key you’ll get back nil. It’s even likely that you’ve been bitten by this and shoved a few “||” or “||=” operators here and there to get around it. But Hash doesn’t return nil because it wants to punish you. It just happens to be the default value, one which you’re free to change. That’s right, you can have Hash return any value you like for nonexistent keys instead of nil. I’m not a huge fan of nil but in this case using it as the default value for hashes probably makes sense. However, when we actually use a hash in our code for a specific purpose, there’s often a better default value to be found.

Look at the previous frequency method again. If you could change the default value for the hash used as the accumulator to reduce, what would you choose? Come on, the answer’s right there on the right-hand side of the “||=” operator. Take a look a look at another version of this method, this time using 0 as the default value:

def frequency (array)
array.reduce(Hash.new(0)) do |hash, element|
hash[element] += 1 # Increment the value.
hash # Return the hash to reduce.
end
end

Hey, before you run off and start changing all your hashes let’s work through what’s going on, there are a couple of gotchas lurking in the dark. Obviously the value given to Hash::new becomes the default value for the hash. Keep in mind though, the default value is only used when you try to access nonexistent keys. It doesn’t magically populate the hash when you attempt to read from it. This is a really important detail about hashes and default values. Consider this IRB session:

irb> h = Hash.new(42)

irb> h[:missing_key]
---> 42

irb> h.keys # Hash is still empty!
---> []

irb> h[:missing_key] += 1
---> 43

irb> h.keys # Ah, there you are.
---> [:missing_key]

Notice that accessing a nonexistent key returns the default value without modifying the hash. However, using the “+=” operator does indeed update the hash as you’d expect. It’s not as clear as it could be though. It helps to recall what “+=” expands to:

# Short version:
hash[key] += 1

# Expands to:
hash[key] = hash[key] + 1

Now it’s clear that the assignment happens by first fetching the default value, then adding one to it, and finally storing the resulting value into the hash under the same key. You should also take note that we don’t mutate the default value in any way. Sure, the default value in this code is a Fixnum, which isn’t mutable. But what happens when we do use a mutable value as the default value, and then we mutate it? This is where things really start to get interesting. Check this out:

irb> h = Hash.new([])

irb> h[:missing_key]
---> []

irb> h[:missing_key] << "Hey there!"
---> ["Hey there!"]

irb> h.keys # Wait for it...
---> []

irb> h[:missing_key]
---> ["Hey there!"]

What in the world is going on? Well, to be honest, I picked an example with a fair amount of dramatic flair. That’s not to say it’s unrealistic though, I’ve actually written silly code like this. There are two things that are obscuring what’s going on here and I’m sure that if you go back through the code you’ll see them. The first one is simple. Hash::new takes an object to use as the default value and I’ve given it an empty array. Technically, it’s a reference to a mutable array. If the default value is therefore modified the mutation will be visible each time you request a nonexistent key.

The other bit of trickery on my part is the use of the “<<” operator. Can you see it now? I never actually changed the hash. Requesting a nonexistent key returned the empty array which was then changed by inserting a new element. The hash didn’t change but the default value did. That’s why the keys method reports that the hash is empty but accessing any nonexistent key returns the recently modified default value. If you really wanted to insert a new element and set a key in the hash you need to go a bit further. But watch out, there’s another sneaky side effect waiting for you:

irb> h = Hash.new([])

irb> h[:weekdays] = h[:weekdays] << "Monday"

irb> h[:months] = h[:months] << "January"

irb> h.keys
---> [:weekdays, :months]

irb> h[:weekdays]
---> ["Monday", "January"]

irb> h.default
---> ["Monday", "January"]

Oh yeah, both keys are sharing the same default array. Probably not what you want to do in most situations. What we’d really like is for hash to return a brand new array when we access a missing key. It just so happens that there’s another way to specify default values for a hash. If you give Hash::new a block it will invoke the block when it needs a default value. Returning a newly created array fits the bill nicely:

irb> h = Hash.new { [] }

irb> h[:weekdays] = h[:weekdays] << "Monday"

irb> h[:months] = h[:months] << "January"

irb> h[:weekdays]
---> ["Monday"]

Much better. But we can take this one step further. The block given to Hash::new can optionally accept two arguments, the hash itself and the key which is being accessed. That means if we want to mutate the hash we can. Why don’t we make it so that accessing a nonexistent key actually sets the key to a new, empty array:

irb> h = Hash.new {|hash, key| hash[key] = [] }

irb> h[:weekdays] << "Monday"

irb> h[:holidays]
---> []

irb> h.keys
---> [:weekdays, :holidays]

Look familiar? We’re back to that tricky code from before but this time it actually works (with a twist). You can probably see that this technique comes with a major downside. Every time a missing key is accessed, the block will not only create a new entry in the hash but it’ll also create a new array. Let me repeat that, accessing a nonexistent key will put that key into the hash. This illuminates a problem with default values in general. Let me show you what I mean.

The correct way to check a hash to see if it contains a specific key is with the has_key? method or one of its aliases. But we’re all guilt of exploiting the fact that nil is usually the default value:

if hash[key]
...
end

If the hash has a default value other than nil or false then this conditional will always succeed. This is a good reminder that it might not be safe to pass a hash with a non-nil default value to a method which isn’t expecting it. It should also remind you that checking for the existence of a key by fetching its value is sloppy and won’t always do what you expect. Stick to has_key?.

There’s another way to work with default values which is sometimes more appropriate. The Hash#fetch method works a lot like Hash#[] with a few important changes. Like Hash#[], the first argument to fetch is the key you want to look up in the hash. But fetch can take an optional second argument. If the given key isn’t found in the hash then fetch will return its second argument instead. If you omit the second argument then fetch will raise an exception if you attempt to access a nonexistent key. This can be a much safer alternative to setting a default value for the entire hash.

irb> h = {}

irb> h[:weekdays] = h.fetch(:weekdays, []) << "Monday"
---> ["Monday"]

irb> h.fetch(:missing_key)
KeyError: key not found: :missing_key

Using fetch is a bit more verbose but has its merits. If you need to pass a hash around to code which assumes invalid keys return nil, using fetch is safer than using a default value. But don’t discount default values, especially the block version of Hash::new. You can do some really interesting and powerful stuff with both default values and default blocks.

Things to Remember

• Consider using a default Hash value.

• Use has_key? or one of its aliases to check if a hash contains a key. That is, don’t assume that accessing a nonexistent key will return nil.

• Don’t use default values if you need to pass the hash to code which assumes invalid keys return nil.

• Hash#fetch can sometimes be a safer alternative to default values.

Item 21: Prefer Delegation to Inheriting from Collection Classes

This item could have just as well been titled “Prefer Delegation to Inheriting from Core Classes” because it applies equally to all of the core Ruby classes. However, beginner and intermediate Ruby programmers tend to be more tempted to subclass the collection classes than classes like Fixnum. That’s because the collection classes tend to look like good foundations when you want to write your own collection class. If you squint a bit it might even look like this models the “is-a” relationship, which is often promoted as a good basis for using inheritance.

More often though, the relationship is truly a “has-a” (composition) relationship which is being shoehorned into an inheritance hierarchy. But I’m not here to rehash the “is-a”/“has-a” debate, there are plenty of other books and papers which have done a much better job on that subject than I can do here. Even if you think you have the most perfect “is-a” relationship with one of the core collection classes, I’m here to tell you why it’s a bad idea and show you how to model your classes properly.

“What could possibly be wrong with inheriting from one of the collection classes?” For starters, think about the relationship which inheritance creates. The subclass first and foremost inherits the public interface from its superclass. This automatically shows up in the documentation and reflection information for your class. In other words, programmers using your class will have an expectation that all those superclass methods work correctly on your subclass. And that’s where we start to run into trouble.

As you know, all of the core Ruby classes are implemented in C. The reason that I point this out is because the instance methods for some of these classes don’t always respect subclasses. Take the Array#reverse method for example. Instead of mutating the receiver it returns a new Array. Guess what happens if you subclass Array and use the reverse method on an instance of your subclass. Okay, you don’t have to guess, I’ll show you:

irb> class LikeArray < Array; end

irb> x = LikeArray.new([1,2,3])
---> [1, 2, 3]

irb> y = x.reverse
---> [3, 2, 1]

irb> y.class
---> Array

Yep, LikeArray#reverse returns an Array and not a LikeArray. You can’t really blame the Array class though, the documentation clearly says that reverse returns a new Array. It’s just not what you’d like to happen, and it might surprise users of your class. But of course it doesn’t end there, a lot of other instance methods on the collection classes do this. The inherited comparison operators might even be worse. They allow instances of your subclass to be compared with instances of the superclass. Does that really make sense?

irb> LikeArray.new([1,2,3]) == [1,2,3]
---> true

Inheritance isn’t always the best option in Ruby. Inheriting from the core collection classes almost never makes sense. So what do you do instead? Well, we don’t even have to look outside of Ruby’s standard library to see an example of how to do this correctly. The Set class is written in Ruby and has a rather simple implementation. It keeps a Hash as its only instance variable and uses it as a place to store elements, forwarding most of its methods on to the hash.

This is a great example where it might be tempting to have Set inherit from Hash. The Set class implements a lot of the same methods as Hash after all. But a Set isn’t a Hash and we certainly don’t want to call a method on a Set and accidentally get back a Hash. That would be weird. So Set takes the “has-a” approach and uses a hash internally without ever exposing it. You don’t need to know anything about Hash in order to use Set. It’s just an implementation detail.

But what about classes which really do seem to model the “is-a” relationship? Without being able to inherit from one of the collection classes you might be left writing a lot of wrapper methods to expose a similar interface. That’s where delegation comes in.

Delegation allows you to declare methods which should be forwarded along to a specific instance variable. It’s sort of like using inheritance but with quite a bit more control. With delegation you can expose an interface similar to one of the collection classes without exposing the entire interface. I like to think about delegation as building up a class from external pieces as opposed to tearing out the parts you inherited but don’t want. An example is in order here.

Let’s write a class based on Hash with one important difference, accessing nonexistent keys raises an exception. There are many different ways to implement something like this but writing a new class lets us reuse the same implementation easily. Instead of inheriting from Hash and then putting bandages all over the code to make sure it works correctly we’ll use delegation. We’ll only need one instance variable, @hash, which does all the heavy lifting for us. Let’s create the class and set up delegation to @hash:

require('forwardable')

class RaisingHash
extend(Forwardable)
include(Enumerable)

def_delegators(:@hash, :[], :[]=, :delete, :each,
:keys, :values, :length,
:empty?, :has_key?)
end

There are several ways to implement delegation in Ruby. The Forwardable module makes it easy to delegate the listed methods to a specific instance variable. It’s part of the standard library (and not the Ruby core library) which is why we need the require.

After extending the Forwardable module you can use the def_delegators class method. The first argument should be a symbol representing the name of an instance variable to use as the target object. The target doesn’t have to be an instance variable though. If the symbol given to def_delegators is the name of a method, that will work too. In this case the method should return an object which will then be used as the delegation target. The remaining arguments to def_delegators are names of instance methods which should be forwarded to the target object. This is a lot like using the attr_accessor method to generate getter and setter methods. For delegation, the def_delegators method will generate instance methods which simply forward the requested message to the target object.

This is where using delegation instead of inheritance really starts to shine. Using def_delegators we specifically choose which hash methods we want to expose through the RaisingHash public interface. Since we’re going to raise exceptions for missing keys it doesn’t really make sense to support the Hash#fetch method. If we were using inheritance we’d have to hide unwanted methods with undef_method or mark them as private. With delegation we cherry pick the methods we want to support. We can even take this a bit further.

Using def_delegators assumes you want to name the locally generated instance method the same as the delegation target’s method. In other words, delegating the delete method creates a RaisingHash#delete method which forwards the message to @hash.delete. Sometimes, however, you want to use a different local name. You can do this with the def_delegator method. For example, if you wanted the RaisingHash#erase! method forwarded to @hash.deleteyou’d use the following def_delegator call:

# Forward self.erase! to @hash.delete
def_delegator(:@hash, :delete, :erase!)

After setting up the delegation methods you need to make sure that the initialize method creates the instance variable mentioned in the delegation declarations. This step requires a bit more thought for our RaisingHash class. Recall that we want to raise exceptions when an invalid key is used with the hash. While we could certainly write our own “[]” operator which verifies each access, there’s a simpler way. Let’s use the technique from Item 20 and give Hash::new a block which just raises a KeyError. Since the block is only ever called when a nonexistent key is accessed we don’t have to perform any key checking logic. This also means that if we expose any other way to access keys in the hash we don’t have to duplicate the KeyError raising code. Here’s the initialize method:

def initialize
@hash = Hash.new do |hash, key|
raise(KeyError, "invalid key `#{key}'!")
end
end

At this point you could create a new RaisingHash object and see that the delegation works as expected. But you may have noticed that there are several Hash methods which we aren’t delegating to @hash. As I mentioned earlier, not all of the Hash methods make sense for RaisingHash, but there’s another reason too. Delegation isn’t the correct answer for all methods. Remember the methods I mentioned earlier which return a new Hash instead of the proper subclass? Just as with inheritance we need to deal with those carefully. Hash#invert is one such method, therefore we need to implement our own version which ensures that a new RaisingHash is returned instead of a Hash:

def invert
other = self.class.new
other.replace!(@hash.invert)
other
end

protected

def replace! (hash)
hash.default_proc = @hash.default_proc
@hash = hash
end

Notice that RaisingHash#invert used the protected method replace! to update the @hash instance variable in the other RaisingHash object. This technique comes from Item 14 and establishes the replace! method as a gateway for changing the @hash method for other internal methods. The replace! method isolates an important piece of logic, ensuring that any hash which is assigned to @hash uses the same default_proc. What is a default_proc and why is it important?

The default_proc for a Hash is the block given to the Hash::new method and is suppose to return a default value, but in our case it raises an exception. As it turns out, Hash methods which return a brand new hash (such as invert) don’t copy the default_proc into the new hash. If the return value from Hash#invert was directly placed into the @hash instance variable then accessing in invalid key would no longer raise an exception. Therefore we need to ensure that the block which was used in the initialize method is copied into the incoming hash. This preserves the logic which raises KeyError exceptions.

There’s one last set of considerations for the RaisingHash class. Because it’s pretending to be a collection class (all the real logic is in @hash) we need to make sure that it behaves correctly when duplicated or cloned. It also needs to properly respond to the freeze, taint, and untaint methods. Let’s start with dup and clone.

If someone duplicates a RaisingHash object Ruby will, by default, create a new object which has identical instances variables as the original object. Since the @hash instance variables is the real container, we need to arrange to have it duplicated as well. This logic belongs in the initialize_copy method which is called just after a new object is created from an original:

def initialize_copy (other)
@hash = @hash.dup
end

The receiver of the initialize_copy method is the new object and it’s given the original object as its only argument. When initialize_copy is called all of the instance variables point to the ones from the original object. We only need to duplicate the @hash variable. Pretty simple.

Finally we come to freezing and tainting. Since RaisingHash is delegating mutating methods to @hash we need to ensure that if an instance of RaisingHash is frozen so is its @hash instance variable. Fortunately, overriding the freezemethod is straight forward. We just need to pass the message on to @hash and then call super:

def freeze
@hash.freeze
super
end

The taint and untaint methods should behave the same way. That is, they should pass the message on to @hash and then call super. With those methods in place freezing, tainting, or untainting a RaisingHash object also affects the internal hash object. If we didn’t do this then calling a mutating method like delete would continue to work even with a frozen RaisingHash object. That’s not how freezing should work. Make sure you take this into consideration.

Things to Remember

• Prefer delegation to inheriting from collection classes.

• Don’t forget to write an initialize_copy method which duplicates the delegation target.

• Write freeze, taint, and untaint methods which send the corresponding message to the delegation target followed by a call to super.