Internationalization in Ruby - THE RUBY WAY, Third Edition (2015)

THE RUBY WAY, Third Edition (2015)

Chapter 4. Internationalization in Ruby

Therefore, [the place] was called Babel, because there the Lord confused the language of all the earth; and from there the Lord scattered them abroad over the face of all the earth.

—Genesis 11:9

Earlier we said that character data was arguably the most important data type. But what do we mean by character data? Whose characters, whose alphabet, whose language and culture?

In the past, computing had an Anglocentric bias, perhaps going back as far as Charles Babbage. This is not necessarily a bad thing. We had to start somewhere, and it might as well be with an alphabet of 26 letters and no diacritic marks.

But computing is a much more global phenomenon now. Every country in the world now has computing hardware and access to online resources. Naturally everyone would prefer to work with web pages, email, and other data not just in English but in his own language.

Human written languages are amazingly diverse. Some are nearly phonetic; others are hardly phonetic at all. Some have true alphabets, whereas others are mostly large collections of thousands of symbols evolved from pictograms. Some languages have more than one alphabet. Some are intended to be written vertically; some are written horizontally, as most of us are used to—but from right to left, as most of us are not used to. Some alphabets are fairly plain; others are incredibly elaborate and adorned richly with diacritics. Some languages have letters that can be combined with their neighboring letters in certain circumstances; sometimes this is mandatory, sometimes optional. Some languages have the concept of upper and lowercase letters; many do not.

In 30 years or so, the computing world has managed to create a little order out of the chaos. If you deal much with programming applications that are meant to be used in linguistically diverse environments, you know the term internationalization. This could be defined as the enabling of software to handle more than one written language.

Related terms are multilingualization and localization. All of these are traditionally abbreviated by the curious practice of deleting the middle letters and replacing them with the number of letters deleted:

def shorten(str)
(str[0] + str[1..-2].length.to_s + str[-1]).upcase
end

shorten("internationalization") # I18N
shorten("localization") # L10N
shorten("multilingualization") # M17N

Localization involves complete support for local conventions and culture, such as currency symbols, ways of formatting dates and times, using a comma as a decimal separator, and much more. There is no universal agreement on the exact meaning of multilingualization. Many or most use this term to mean the combination of I18N and L10N; others use it as synonymous with I18N. I tend to avoid this term in this book.

In essence, internationalizing an application is a one-time process; the source needs to be modified so that it can handle multiple character sets and data in multiple languages. A program that is fully internationalized can accept strings in multiple languages, manipulate those strings, and output them.

A localized program goes beyond that, to permit selection of a language at runtime for use in prompts, error messages, and so on; this includes interpolating data in the correct places in messages and forming plurals correctly. It also includes the capability to perform correct collating (sorting), as well as formatting numbers, dates, times, and currencies.

The process of localization of course includes the translating of error messages into the target language. This process is repeated for every language the application is to support.

In this chapter, we’ll examine tools and techniques that help the Ruby developer with these issues. Let’s begin with a little more terminology, since this area is particularly prone to jargon.

4.1 Background and Terminology

At one time, there was a proliferation of character sets. In the early 60s, ASCII was created, and it gained widespread acceptance through the 70s and early 80s. ASCII stands for American Standard Code for Information Interchange. It was a big step forward, but the operant word here is American. It was never designed to handle even European languages, much less Asian ones.

But there were loopholes. This character set had 128 characters, being a 7-bit code). But an 8-bit byte was standard; so the natural idea was to make a superset of ASCII, using codes 128 and higher for other purposes. The trouble is, this was done many times in many different ways by IBM and others; there was no widespread agreement on what, for example, character 221 was.

The shortcomings of such an approach are obvious. Not only do the sender and receiver have to agree on the exact character set, but they are limited in what languages they can use in a single context. If you wanted to write in German but quote a couple of sources in Greek and Hebrew, you probably couldn’t do it at all. This scheme didn’t begin to address the complexities of Asian languages such as Chinese, Japanese, and Korean.

There were two basic kinds of solutions to this problem. One was to use a much larger character set—one with 16 bits, for example (so-called wide characters). The other was to use variable-length multibyte encodings.

In a variable-length scheme, some characters might be represented in a single byte, some in two bytes, and some in three or even more. Obviously, this raised many issues: For one, a string had to be uniquely decodable. The first byte of a multibyte character could be in a special class so that we could know to expect another byte, but what about the second and later bytes? Were they allowed to overlap with the set of single-byte characters? Were certain characters allowed as second or third bytes, or were they disallowed? Would we be able to jump into the middle of a string and still make sense of it? Would we be able to iterate backwards over a string if we wanted? Different encodings made different design decisions.

Eventually the idea for Unicode was born. Think of it as a “world character set.” Unfortunately, nothing is ever that simple.

You may have heard it said that Unicode was (or is) limited to 65,536 characters (the number of codes representable in 16 bits). This is a common misconception; Unicode was never designed with that kind of constraint. It was understood from the beginning that in many usages, it would be a multibyte scheme. The number of characters that could be represented was essentially limitless—a good thing, because 65,000 codes would never suffice to handle all the languages of the world.

One of the first things to understand about i18n is that the interpretation of a string is not intrinsic to the string itself. That kind of old-fashioned thinking comes from the notion that there is only one way of storing strings. I can’t stress this enough. Internally, a string is just a series of bytes. To emphasize this, let’s imagine a single ASCII character stored in a byte of memory. If we store the letter that we call “capital A,” we really are storing the number 65.

Why do we view a 65 as an A? It’s because of how the data item is used (or how it is interpreted). If we take that item and add it to another number, we are using it (interpreting it) as a number; if we send it to an ASCII terminal over a serial line, we are interpreting it as an ASCII character.

Just as a single byte can be interpreted in more than one way, so, obviously, can a whole sequence of bytes. In fact, the intended interpretation scheme (or encoding) has to be known in advance for a string to make any real sense. An encoding is essentially a mapping between binary numbers and characters. And yet it still isn’t quite this simple.

But before we get into these issues, let’s look at some terminology. These terms are not always intuitive.

• A byte is simply eight bits (though in the old days, even this was not true). Traditionally, many programmers have thought of a byte as corresponding to a single character. Obviously, we can't think that way in an I18N context.

• A codepoint is simply a single entry in the imaginary table that represents the character set. As a half-truth, you may think of a codepoint as mapping one-to-one to a character. Nearer to the truth, it sometimes takes more than a single codepoint to uniquely specify a character.

• A glyph is the visual representation of a codepoint. It may seem a little unintuitive, but a character’s identity is distinct from its visual representation.

• A grapheme is similar in concept to a glyph, but when we talk about graphemes, we are coming from the context of the language, not the context of our software. A grapheme may be the combination (naive or otherwise) of two or more glyphs. It is the way a user thinks about a character in his own native language context. The distinction is subtle enough that many programmers will simply never worry about it.

What then is a character? Even in the Unicode world, there is some fuzziness associated with this concept because different languages behave a little differently and programmers think differently from other people. Let’s say that a character is an abstraction of a writing symbol that can be visually represented in one or more ways.

Let’s get a little more concrete. First, let me introduce a notation to you. We habitually represent Unicode codepoints with the notation U+ followed by four or more uppercase hexadecimal digits. So what we call the letter “A” can be specified as U+0041.

Now take the letter “é” for example (lowercase e with an acute accent). This can actually be represented in two ways in Unicode. The first way is the single codepoint U+00E9 (LATIN SMALL LETTER E WITH ACUTE). The second way is two codepoints—a small e followed by an acute accent: U+0065 and U+0301 (or LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT).

Both forms are equally valid. The shorter one is referred to as the precomposed form. Bear in mind, though, that not every language has precomposed variants, so it isn’t always possible to reduce such a character to a single codepoint.

I’ve referred to Unicode as an encoding, but that isn’t strictly correct. Unicode maps characters to codepoints; but there are different ways to map codepoints to binary storage. In effect, Unicode is a family of encodings.

Let’s take the string “Matz” as an example. This consists of four Unicode codepoints:

"Matz" # U+004d U+0061 U+0074 U+007a

The straightforward way to store this would be as a simple sequence of bytes:

00 4d 00 61 00 74 00 7a

This is called UCS-2 (as in two bytes) or UTF-16 (as in 16 bits). Note that this encoding itself actually comes in two “flavors,” a big-endian and a little-endian form. However, notice that every other byte is zero. This isn’t mere coincidence; it is typical for English text, which rarely goes beyond codepoint U+00FF. It’s somewhat wasteful of memory.

This brings us to the idea of UTF-8. This is a Unicode encoding where the “traditional” characters are represented as single bytes, but others may be represented as multiple bytes. Here is a UTF-8 encoding of this same string:

4d 61 74 7a

Notice that all we have done is strip off the zeroes; more importantly, note that this is the same as ordinary ASCII. This is obviously by design; “plain ASCII” can be thought as a proper subset of UTF-8. In fact, the 8-bit ISO-8859-1 (which overlaps heavily with ASCII) is also a proper subset. (Do not confuse ISO-8859-1, also called Latin-1, with Windows-1252.) In short, this means that UTF-8 is “backward compatible” with text encoded as ASCII or as Latin-1.

One implication of this is that when UTF-8 text is interpreted, it sometimes appears “normal” (especially if the text is mostly English). Sometimes you may find that in a browser or other application, English text is displayed correctly, but there are additional “garbage” characters. In such a case, it’s likely that the application is making the wrong assumption about what encoding is being used.

So we can argue that UTF-8 saves memory (speaking from an Anglocentric point of view again, or at least ASCII-centric). When the text is primarily ASCII, memory will be conserved, but for other writing systems such as Greek or Cyrillic, the strings will actually grow in size.

It is an obvious benefit that UTF-8 is backward compatible with ASCII, still arguably the most common single-byte encoding in the world. Finally, UTF-8 also has some special features to make it convenient for programmers.

For one thing, the bytes used in multibyte characters are assigned carefully. The null character (ASCII 0) is never used as the nth byte in a sequence (where n > 1), nor are such common characters as the slash (commonly used as a pathname delimiter). As a matter of fact, no byte in the full ASCII range (0x00-0x7F) can be used as part of any other character.

The second byte in a multibyte character uniquely determines how many bytes will follow. The second byte is always in the range 0xC0 to 0xFD, and any following bytes are always in the range 0x80 to 0xBF. This ensures that the encoding scheme is stateless and allows recovery after missing or garbled bytes.

UTF-8 is one of the most flexible and common encodings in the world, in use since the early 1990s. Unless it is told otherwise, Ruby assumes that the code you write, any text input, and all text output are encoded as UTF-8. As a result, most of our attention in this chapter will be focused on working with the UTF-8 encoding in Ruby.

With respect to internationalization, there are two concepts that I consider to be fundamental, almost axiomatic. First, a string has no intrinsic interpretation. It must be interpreted according to some external standard. Second, a byte does not correspond to a character; a character may be one or more bytes. There are other lessons to be learned, but these two come first.

4.2 Working with Character Encodings

Even if you as a programmer only use English, there is a strong chance that users of your program will want to enter non-English text. Personal names, place names, and many other such pieces of text often contain characters from other languages.

Handling those languages can be an extremely challenging task. Different languages sort words completely differently: in Slovak, “ch” comes after “h,” whereas in Swedish, “Å” comes after “Z.” In Russian, there are two different plural word endings, one for numbers ending in 2, 3, or 4, and one for numbers ending in 5, 6, 7, 8, 9, or 0. Modern programs, especially those written for the Web, should be prepared to accept multilingual text even if they will never be localized into other languages.

Programs that are fully multinationalized should be able to take input in multiple languages and express output in multiple languages, including correctly-formed plurals. They need to format numbers, dates, and currencies correctly for the current language, and they must be able to sort strings correctly as well.

Let’s investigate the various tools available to the Ruby programmer that allow us to do I18N and L10N. We need to understand how Ruby stores character strings.

As I said, UTF-8 is now the default for Ruby. Unless the interpreter is told otherwise, it assumes that the code you write, any text input, and all text output are encoded as UTF-8.

The letter “e” with an acute accent has the codepoint U+00E9. In Ruby, Unicode characters can also be written in the form u plus the hexadecimal codepoint. As a result, these strings are equal:

"épée" == "\u00E9p\u00E9e" # true
# The French word "épée" refers to a kind of sword

In Ruby, strings are always a series of codepoints, but some codepoints are encoded as more than one byte. The interpreter permits us to access the underlying series of bytes. Let’s look at some examples:

sword = "épée"
sword.length # 4
sword.bytes # [195, 169, 112, 195, 169, 101]
sword.bytes.length # 6

The string contains four characters, but encoding those four characters requires 6 bytes. The character é is represented in UTF-8 as the bytes 195 and 169, in that order.

cp = sword.codepoints # [233, 112, 233, 101]
cp.map {|c| c.to_s(16) } # ["e9", "70", "e9", "65"]
sword.scan(/./) # ["é", "p", "é", "e"]

Using the String#unpack function allows us to see the Unicode codepoint numbers. We can convert those numbers to hexadecimal to see that the codepoint for “é” is in fact U+00E9.

Also in the previous example, note that regular expressions are fully Unicode-aware. They match characters rather than bytes.

4.2.1 Normalization

Many characters are conceptually made up of two other existing characters, such as the letter e and the acute accent combining to create the letter é. In addition to LATIN SMALL LETTER E WITH ACUTE ACCENT, Unicode contains separate codepoints for LATIN SMALL LETTER E (U+0065) and COMBINING ACUTE ACCENT (U+0301) for the e and the acute accent, respectively.

Combining codepoints is called composition, and separating them is decomposition. In order to compare a sequence of codepoints to see if they include the same characters, any program that deals with Unicode must first compose (or decompose) the codepoints using the same set of rules. This process is called normalization. Once a sequence of codepoints has been normalized, it can be accurately compared with others that have been normalized the same way.

Up until now, we’ve been using precomposed characters—ones in which the base character and diacritic are combined into a single entity and a single Unicode codepoint. In general, though, Unicode supports the separate encoding of base characters and their diacritics.

Why is this separation supported? It provides flexibility and allows us to apply diacritic marks to any character, not just the combinations considered by the encoding designer. In fact, fonts will typically include glyphs for common combinations of character and diacritic, but the display of an entity is separate from its encoding. Table 4.1 clarifies this a little.

Image

Table 4.1 Precomposed and Decomposed Forms

Unicode has numerous design considerations such as efficiency and round-trip compatibility with existing national encodings. Sometimes these constraints may introduce some redundancy; for example, not only does Unicode include codepoints for decomposed forms but also for many of the precomposed forms already in use. This means that there is also a codepoint for LATIN SMALL LETTER E WITH ACUTE ACCENT, as well as for things such as the double-f ligature.

For example, let’s consider the German word “öffnen” (to open). Without even considering case, there are four ways to encode it:

1. o + COMBINING DIAERESIS (U+0308) + f + f + n + e + n

2. LATIN SMALL LETTER O WITH DIAERESIS (U+00F6) + f + f + n + e + n

3. o + COMBINING DIAERESIS + DOUBLE-F LIGATURE (U+FB00) + n + e + n

4. LATIN SMALL LETTER O WITH DIAERESIS + DOUBLE-F LIGATURE + n + e + n

The diaeresis (also spelled dieresis) is simply a pair of dots over a character. In German, it is called an umlaut.

Normalizing is the process of standardizing the character representations used. After normalizing, we can be sure that a given character is encoded in a particular way. Exactly what those forms are depends on what we are trying to achieve. Annex 15 of the Unicode Standard lists four normalization forms:

1. D (Canonical Decomposition)

2. C (Canonical Decomposition followed by Canonical Composition)

3. KD (Compatibility Decomposition)

4. KC (Compatibility Decomposition followed by Canonical Composition)

You may see these written as NFKC (Normalization Form KC) and so on.

The precise rules set out in the standard are complex and cover the difference between “canonical equivalence” and “compatibility equivalence.” (Korean and Japanese require particular attention, but we won’t address these here.) Table 4.2 summarizes the effects of each normalization form on the strings we started with previously.

Image

Table 4.2 Normalized Unicode Forms

Which form is most appropriate depends on the application at hand. C and D are simply “fully composed” and “fully decomposed” forms, respectively; these forms are reversible. KC and KD are the “compatibility” forms; they require conversions such as the ligature “ffi” being separated into its three component letters (making these forms irreversible). Therefore the compatibility forms are preferred for storage, especially if the text will be compared with other texts during tasks such as search and validation.

The activesupport gem provides the ActiveSupport::MultiByte::Chars class, which wraps the String class. It allows manipulation of Unicode codepoints in ways that the String class does not. Perhaps the most unexpected Unicode functionality neglected by theString class is that of changing case:

require 'active_support'

chars = ActiveSupport::Multibyte::Chars.new("épée")
"épée".upcase # "éPéE"
chars.upcase.to_s # "ÉPÉE"
"épée".capitalize # "épée"
chars.capitalize.to_s # "Épée"

This feature provided by activesupport is “neglected” in Ruby by design. This is because many languages do not have the concept of upper and lower case characters. In such languages, these operations are meaningless and are therefore meaningless in the general case. Only perform this kind of transformation if you are certain of the behavior with respect to all the languages you plan to support.

The gem also provides useful methods for truncation, codepoint composition and decomposition, and finally normalization to all four forms.

sword_kc = chars.normalize(:kc)
sword_kd = chars.normalize(:kd)
sword_kc.bytes # [195, 169, 112, 195, 169, 101]
sword_kd.bytes # [101, 204, 129, 112, 101, 204, 129, 101]
sword_kc.scan(/./) # ["é", "p", "é", "e"]
sword_kd.scan(/./) # ["e", "´'", "p", "e", "´", "e"]

4.2.2 Encoding Conversions

Each String in Ruby has an encoding that determines how the bytes in the string are interpreted as characters. This allows input to be provided using different encodings and used together in the same program, but it can cause unexpected problems. Trying to combine strings with incompatible encodings will cause an Encoding::CompatibilityError exception. Fortunately, the encode method allows strings to be converted from one character encoding to another. Using the encode method, we can convert strings into the same encoding before we try to combine them:

sword = "épée"
sword_mac = sword.encode("macRoman")
sword.bytes # [195, 169, 112, 195, 169, 101]
sword_mac.bytes # [142, 112, 142, 101]

str = sword + sword_mac
# incompatible character encodings: UTF-8 and macRoman

str = sword + sword_mac.encode("UTF-8")
# "épée"

In a similar way, receiving input in an unexpected encoding can easily cause an exception later on when that input is processed:

File.write("sword.txt", sword.encode("macRoman"))
invalid_sword = File.read("sword.txt") # "\x8Ep\x8Ee"
invalid_sword.encoding # #<Encoding:UTF-8>
invalid_sword.valid_encoding? # false

strings = invalid_sword.split("p")
# invalid byte sequence in UTF-8 error

One fix is to use the force_encoding method to correct the encoding of the bytes that have already been read:

forced_sword = invalid_sword.force_encoding("macRoman")
forced_sword.encoding # #<Encoding:macRoman>
forced_sword.valid_encoding? # true
forced_sword.split("p") # ["\x8E", "\x8Ee"]

The other way to handle invalid byte sequences is to tell Ruby what the encoding of the file is, so it can be read correctly:

read_sword = File.read("sword.txt", :encoding => "macRoman")
read_sword.encoding # #<Encoding:macRoman>
read_sword.split("p") # ["\x8E", "\x8Ee"]

open_sword = File.open("sword.txt", "r:macRoman:UTF-8") do |f|
f.read
end
open_sword.encoding # #<Encoding:UTF-8>
open_sword.split("p") # ["é", "ée"]

In the second example, we used the IO.open API to tell Ruby to read in one encoding, but translate the result into another encoding before returning it.

However, sometimes the encoding of some text is simply unknown, or the data is so garbled that there is no possible valid encoding. In that case, it is still possible to avoid an encoding exception, by replacing invalid bytes with valid ones.

bad_sword = "\x8Ep\x8Ee"
bad_sword.encode!("UTF-8", :invalid => :replace,

:undef => :replace)
bad_sword # "ImagepImagee"
bad_sword.valid_encoding? # true

Using the :replace option means that any bytes that cannot be decoded will be replaced with Image, the Unicode replacement character. If desired, unreadable bytes can be replaced by any string of your choosing, including an empty string.

4.2.3 Transliteration

A completely different way of converting between encodings is called transliteration. This is the process of taking characters from one alphabet and converting them to characters in another alphabet. In the context of I18N, transliteration almost always means simplifying alphabets into their closest equivalent in basic ASCII.

Examples include converting “épée” into “epee,” “dziçkujeç” into “dziekuje,” and “Image” into “eudaimonia.” In transliterations from other writing systems, there can even be multiple ways to transliterate the same thing. Although transliteration can be useful for signage or names that are readable in another alphabet, it can often cause some information to be lost. For these reasons, transliteration does not provide a true encoding option, and should be avoided when possible.

4.2.4 Collation

In computing, collation refers to the process of arranging text according to a particular order. Generally, but not always, this implies some kind of alphabetical or similar order. Collation often depends on normalization to be done correctly, because most languages group letters with and without accents together when sorting.

For example, let’s consider an array of strings that we want to collate. Note the presence of both composed and decomposed characters. What happens when we use the Array#sort method?

eacute = [0x00E9].pack('U')
acute = [0x0301].pack('U')
array = ["epicurean", "#{eacute}p#{eacute}e", "e#{acute}lan"]
array.sort # ["epicurean", "élan", "épée"]

That’s not what we want. But let’s try to understand why it happens. Let’s look at the first two characters of each string and the bytes into which they’re encoded:

array.map {|word| {word[0,2] => word[0,2].bytes} }
# [{"ep"=>[101, 112]},
# {"é"=>[101, 204, 129]},
# {"ép"=>[195, 169, 112]}]

In the first word, the 2 bytes encode two characters. In the second, 3 bytes encode two codepoints that compose into just one character. The third word has a 2-byte and then 1-byte character. By examining the values of the bytes, we can see that the words are sorted based on the bytes that comprise them.

The “e” has a lower value than the first byte of the “é,” so it is sorted first. In UTF-8, ASCII characters have the lowest possible values, so non-ASCII characters will always be sorted last. The middle “é” will always be sorted after “e” but before “f” due to the accent’s high first byte.

Bear in mind that this type of sorting problem is something of an issue even with plain ASCII. For example, uppercase characters all have lower byte values than their lowercase equivalents, so they don’t sort together. We might expect “pyre” and “PyrE” to be adjacent after sorting, but “Pyramus” would come between them. Basically this just means that lexicographic sorting is not alphabetic sorting, and it certainly does not follow any complex rules such as we see in a dictionary or phone book.

Each language has its own set of rules, so let’s start with a simple one. We’ll sort according to the English alphabet and ignore accents. To do that, we can simply decompose each string and then remove any diacritical marks. The Unicode range for combining diacritical marks is fromU+0300 to U+036F.

Let’s assume that we are processing our list according to English rules and that we are going to ignore accents. The first step is to define our transformation method. We’ll normalize our strings to decomposed forms and then elide the diacritics, leaving just the base characters, and sort by the result. For reference, the Unicode range for combining diacritical marks runs from U+0300 to U+036F.

Chars = ActiveSupport::Multibyte::Chars # for convenience

def english_sort(str)
kd = Chars.new(str).downcase.normalize(:kd)
cp = kd.codepoints.select {|c| c < 0x0300 || c > 0x036F }
cp.pack("U*")
end


array.map{|s| english_sort(s) }
# ["epicurean", "epee", "elan"]

array.sort_by{|s| english_sort(s) }
# ["élan", "épée", "epicurean"]

That’s better. Although we have addressed capitalization (by using the Unicode-aware downcase method), we haven’t addressed character equivalents yet. Let’s look at German as an example.

In fact, there is more than one collation for German; we’ll use the DIN-2 collation (or phone book collation) for this exercise, in which the German character “ß” is equivalent to “ss,” and the umlaut is equivalent to a letter “e.” So “ö” is equivalent to “oe”, and so on.

Our transformation method should address this. Once again, we will start by normalizing our string to a decomposed form. For reference, the combining diaeresis (or umlaut) is U+0308. We’ll also use Ruby’s case conversion, but we need to augment it a little. Here, then, is a basic transformation method:

def sort_german(str)
mb = ActiveSupport::Multibyte::Chars.new(str)
kd = mb.downcase.normalize(:kd)
kd.gsub('ß', 'ss').gsub("\u0308", 'e').to_s
end

["Straße", "öffnen"].map {|x| sort_german(x) }
# ["strasse", "oeffnen"]

Real-world collation algorithms are more complex than the examples we have seen here, and employ multiple levels. Usually, the first level tests the character only (ignoring accents and case), the second level orders by accents, the third level takes case into consideration, and so on. Each level is only used if the previous level was inconclusive.

Even with multiple levels, sorting is still extremely difficult, and often requires detailed knowledge of the language. Some languages sort multiple-character sequences as a single semantic unit (for example, “lj” in Croatian is placed between “l” and “m”).

In Danish, “aa” (or “å”) is sorted after “z,” but only for Danish words. For foreign words, “aa” is just sorted with the other “a” words. That rule means that the city of Aachen comes before Atlanta, but Aalborg comes after Zurich.

Ultimately, it’s not possible to devise a truly generic collation algorithm that works for all languages, because many languages have directly contradictory sorting requirements. Keep this in mind when sorting lists for other languages.

4.3 Translations

Lojban is culturally fully neutral. Its vocabulary was built algorithmically using todays six most widely spoken languages: Chinese, Hindi, English, Russian, Spanish, and Arabic.

—Nick Nicholas and John Cowan, from “What Is Lojban?”

Now that we have covered how to handle program input and output in any alphabet, we turn to handling input and output in any language. Translating a program into other languages is a lot of work, but makes your program usable by many additional people.

Fully localizing your program means translating all of the text, including instructions, error messages, and output. It also means formatting all numbers correctly, including currencies, numbers, dates, and times. We’re going to start with the basics of translation, then work our way up through more complex examples, including pluralization and formatting numerical data.

By using the i18n gem, we can use translations from an external file or files, which can be written and edited by translators. The files are formatted in a human-readable format called YAML (Yet Another Markup Language), and each message can be looked up by a key. TheI18n.translate method (which is aliased to I18n.t for brevity) accepts a key, and it returns the message that corresponds to that key.

Let’s look at an example. First, we’ll require the i18n gem (you can install it with gem install i18n if needed), and then we’ll supply some translations.

require 'i18n'
I18n.backend.store_translations(:en,
greeting: { hello: "Hello there!" })
I18n.backend.store_translations(:ja,
greeting: { hello: "Image" })

The two-letter abbreviations each refer to a language, and they come from the ISO 639 international standard. Many programming languages, including C, Java, PHP, Perl, and Python, also use these same codes.

The second part of the translation sets keys that allow the lookup of messages. Beneath the language codes, messages can be organized in any arbitrary hierarchy of keys.

The translate method (or t for short) takes a key and looks up the message stored under that key in the current locale. The locale defaults to :en, but can be changed at any time.

I18n.locale # :en
I18n.t "greeting.hello" # "Hello there!"

I18n.locale = :ja # :ja
I18n.t "greeting.hello" # "Image"

Each period in the key passed to I18n.t indicates a level of hierarchy. Each key segment after that indicates a key in the hash of translations.

In large programs, translations are typically stored in dedicated Ruby or YAML files in a locale/ directory. When multiple files are used, all their keys and messages will be combined, so files are normally separated by language or by type of message.

Let’s create a small program that surveys the user and then prints the results of the survey. We’ll ask for the user’s name, home city, and number of children, using the i18n gem to provide translations for the program output.

First, we need to require the i18n gem, tell it where we will store our translation files, and set the locale. On UNIX systems, the locale is traditionally provided in the LANG environment variable in a form like en_US.UTF-8, so we’ll use that.

# survey.rb
require 'i18n'
I18n.load_path = Dir["locale/*"]
I18n.enforce_available_locales = true
I18n.locale = ENV["LANG"].split("_").first || :en

puts I18n.t("ask.name")
name = gets.chomp
puts I18n.t("ask.location")
place = gets.chomp
puts I18n.t("ask.children")
childnum = gets.chomp.to_i
puts I18n.t("ask.thanks")

puts name, place, childnum

If you run the code at this point, however, you’ll simply see an I18n::InvalidLocale exception. That’s because we haven’t supplied English translations yet. Let’s create files containing translations in English and Japanese:

# locale/en.yml
en:
ask:
name: "What is your name?"
location: "Where do you live?"
children: "How many children do you have?"
thanks: "Thank you!"

# locale/ja.yml
ja:
ask:
name: "Image"

location: "Image"

children: "Image"
thanks: "Image"

Now, our program can ask questions in either English or Japanese, based on the LANG environment variable:

$ ruby survey.rb
What is your name?
[...]
$ LANG=ja ruby survey.rb
Image
[?]

4.3.1 Defaults

Translating your program into many languages can make it usable by a much larger audience, but this comes with some dangers of its own. The translate method expects every language to have a translation for every key. Missing keys don’t raise errors, but can cause a program to be unusable. If we add an empty Russian translation file, we can run our survey in Russian and see the missing translation message:

$ echo "ru:\n ask:" > locale/ru.yml
$ LANG=ru ruby survey.rb
translation missing: ru.ask.name

As you might expect, this can interfere with using your program successfully. There are two strategies to handle this type of problem: One option is to raise an exception when a translation is missing. This reveals the problem early, and can be especially useful while running automated tests. To raise an error on missing translations, set an I18n.exception_handler like this one:

I18n.exception_handler = -> (problem, locale, key, options) {
raise problem.to_exception
}

In production, it’s probably a bad idea to raise an error whenever a translation is missing. By enabling fallbacks, it is possible to provide a different translation that does actually exist. The simplest way to enable fallbacks is to include the Fallback backend extension and set a default locale:

require "i18n/backend/fallback"
I18n::Backend::Simple.send(:include, I18n::Backend::Fallback)
I18n.default_locale = :en

After a default locale of :en has been set, locales with missing translations will use the English translation instead:

$ LANG=ru ruby survey.rb
What is your name?
[...]

Note that setting both a default locale and an exception handler means that keys in both the current locale and the default locale will be tried before the exception is raised.

4.3.2 Namespaces

As you may have noticed earlier, every translation key for a survey question starts with ask. Using the namespace argument to the translate method, it is possible to create a helper method that only requires the name of the question.

def ask(key)
I18n.translate(key, namespace: "ask")
end

puts ask("name")
name = gets.chomp

It is good practice to use named helper methods or to define a common translation method that supplies the correct namespace when it is used in different contexts. This makes it easier to manage complex sets of translations.

4.3.3 Interpolation

Without translations, inserting variables into strings is very straightforward. To generate a greeting given a name variable, you can simply write the string "Hello, #{name}!".

With translated strings, that isn’t possible. In other languages, variables might need to be inserted before, or after, or even in between other words. Deciding where to interpolate needs to be done by translators, not developers.

To handle this problem, translation strings have their own interpolation format, using the percent sign and curly braces. Developers can supply named variables when asking for a translation, and translators can insert those named variables at the appropriate location.

Using interpolation, we can add another message to our translations, and another line of code that prints out the survey results in a much clearer way:

# survey.rb
puts I18n.t("result.lives_in", name: name,
place: place, childnum: childnum)

# locale/en.yml
en:
result:
lives_in: >
%{name} lives in %{place},
and has %{childnum} child(ren).

# locale/ja.yml
ja:
result:
lives_in: >
%{name}Image%{place}Image
Image%{childnum}Image

Now the results are printed out as a complete sentence in both English and Japanese:

John Smith lives in San Francisco, and has 4 child(ren).
John SmithImageSan FranciscoImage 4Image

There’s one place where our “new and improved” output looks quite awkward, though. We have to say “child(ren)” because there might be one child, but there might be zero (or several) children. In Japanese, it’s even worse: We might end up saying “zero children exist,” which doesn’t even make sense.

4.3.4 Pluralization

Counting things in multiple languages is even harder than just translating. A single translation might have to be completely different depending on how many things are being talked about.

In English, pluralization isn’t too bad. Nouns have singular and plural forms (such as “child” and “children”), or a single word that works for both (like “sheep” or “deer”). Japanese is even easier—nouns are the same whether there is one or many.

Other languages can be far more complex. Russian has three forms: one singular, one plural for numbers ending in 2, 3, or 4, and one for all other numbers. Russian isn’t even alone: Polish and others have similar patterns for plural words.

In order to pluralize and translate at the same time, pluralized translations need to contain separate keys for each plural form. In English, that means “zero,” “one,” and “other.” Here’s what the English locale file looks like with pluralizations:

# locale/en.yml
en:
result:
lives_in: "%{name} lives in %{place}, and has "
children:
zero: "no children."
one: "a single child."
other: "%{count} children."

With pluralized translations, just call the translate method and pass it a count parameter. The correct pluralization form will be added to the key automatically:

# survey.rb
puts I18n.t("result.lives_in", name: name, place: place) +
" " + I18n.t("result.children", count: childnum)

In order to pluralize in Japanese, we’ll need to require the Pluralization backend and start using it:

# survey.rb
require "i18n/backend/pluralization"
I18n::Backend::Simple.send(:include,
I18n::Backend::Pluralization)

Next, we add a pluralization rule, written in Ruby, to a new locale file for pluralization rules:

# locale/plurals.rb
{ ja: { 18n: { plural: {
keys: [:other],
rule: -> (n) { n.zero? ? :zero : :other }
}}}}

Then, we can add keys for just “zero” and “other” because Japanese doesn’t treat one object differently from two or more objects:

# locale/ja.yml
ja:
result:
lives_in: "%{name}Image%{place}Image"
children:
zero: "Image"
other: "%{count}Image"

Pluralization rules for other languages, both more and less complicated, can be implemented in a very similar way. Rather than rediscover pluralization rules for every language separately, developers from many countries have cooperated together and created fairly exhaustive lists of pluralization and other formatting rules.

The Unicode Consortium hosts a large set of rules at the Common Locale Data Repository, online at cldr.unicode.org. Twitter has released a version of that data, usable directly in Ruby, as the twitter_cldr gem.

4.4 Localized Formatting

As hinted at in our discussion of pluralization, there are other formatting rules beyond count-based noun forms. Every language has their own standards for how to represent dates, times, numbers, and currency.

The bare i18n gem includes support for formatting dates and times, but doesn’t supply any translations. The twitter_cldr gem is the easiest way to format dates, times, numbers, and currencies for the current locale. We’ll use it in the following examples, but keep in mind that it hard-codes the CLDR formats and cannot be customized as easily as editing a .yml file.

4.4.1 Dates and Times

The CLDR defines four date and time formats for every language: full, long, medium, and short. Other formats may be available for some languages, so check the documentation for the gem if you’d like to learn more about the options. Formatting dates and times into a format that you create (using strftime) is covered in Section 7.21, “Formatting and Printing Time Values.”

Before using any of the localize methods, install the twitter_cldr gem by running gem install twitter_cldr and then require it in your Ruby script:

require 'twitter_cldr'

To format dates and times together into a single localized string, use DateTime#localize:

date = DateTime.parse("2014-12-15")
date.localize(:en).to_short_s # "12/15/14, 12:00 AM"
date.localize(:fr).to_short_s # "15/12/2014 00:00"

To convert times, there is a similar Time#localize method:

time = Time.parse("9:00 PM GMT")
time.localize(:en).to_short_s # "9:00 PM"
time.localize(:fr).to_short_s # "21:00"

Printing dates is the odd case out, because localized DateTime objects must be converted to dates before they can be formatted. There is no Date#localize method.

date = DateTime.parse("2014-12-15")
date.localize(:en).to_date.to_short_s # "12/15/14"
date.localize(:fr).to_date.to_short_s # "15/12/2014"

Predictably, there are also methods named to_medium_s, to_long_s, and to_full_s that format dates into strings that spell out month names and weekdays:

date.localize(:en).to_medium_s
# "Dec 15, 2014, 12:00:00 AM"

date.localize(:en).to_long_s
# "December 15, 2014 'at' 12:00:00 AM UTC"

date.localize(:en).to_full_s
# "Monday, December 15, 2014 'at' 12:00:00 AM UTC +00:00"

4.4.2 Numbers

Formatting numbers is similarly straightforward because every number gains a localize method:

num = 1_337
num.localize(:en).to_s # "1,337"
num.localize(:fr).to_s # "1 337"

Formatting decimals is as simple as localizing a Float or calling to_decimal on the localized number:

1337.00.localize(:en).to_s(precision: 2) # "1,337.00"
num.localize(:fr).to_decimal.to_s(precision: 2) # "1 337,00"

Finally, localized numbers can be formatted as percentages using the to_percent method, which also takes a precision parameter:

num.localize(:en).to_percent.to_s # "1,337%"
num.localize(:fr).to_percent.to_s(precision: 2) # "1 337,00 %"

4.4.3 Currencies

Currency formatting defaults to USD (American dollars), but can easily be set using three-letter currency codes, as shown here:

num.localize(:en).to_currency.to_s
# "$1,337.00"
num.localize(:fr).to_currency.to_s(currency:"EUR")
# "1 337,00 €"

4.5 Conclusion

In this chapter, we’ve looked at the issues faced by programmers as they internationalize and localize their applications. I18N and L10N are regarded as passionately important by many users and developers, and when well done can greatly multiply the reach and user base of any application.

As part of I18N, we examined character encodings and how the Unicode standard provides a way to encode almost any character that exists. Next, we reviewed L10N and how to implement a localized application with fully translated output, including pluralization. Finally, we looked at how to format numbers, dates, times, and currencies according to the rules of a particular locale.

Finally, we saw how to use the i18n gem in conjunction with other gems to translate and pluralize strings as well as format numbers correctly for any locale.

At this point, we’ll take a short break from strings and formatting. In the next chapter, we’ll look at how to represent numbers in Ruby and perform calculations with them.