Internationalization with gawk - Moving Beyond Standard awk with gawk - Effective awk Programming (2015)

Effective awk Programming (2015)

Part III. Moving Beyond Standard awk with gawk

Chapter 13. Internationalization with gawk

Once upon a time, computer makers wrote software that worked only in English. Eventually, hardware and software vendors noticed that if their systems worked in the native languages of non-English-speaking countries, they were able to sell more systems. As a result, internationalization and localization of programs and software systems became a common practice.

For many years, the ability to provide internationalization was largely restricted to programs written in C and C++. This chapter describes the underlying library gawk uses for internationalization, as well as how gawk makes internationalization features available at the awk program level. Having internationalization available at the awk level gives software developers additional flexibility—they are no longer forced to write in C or C++ when internationalization is a requirement.

Internationalization and Localization

Internationalization means writing (or modifying) a program once, in such a way that it can use multiple languages without requiring further source code changes. Localization means providing the data necessary for an internationalized program to work in a particular language. Most typically, these terms refer to features such as the language used for printing error messages, the language used to read responses, and information related to how numerical and monetary values are printed and read.

GNU gettext

gawk uses GNU gettext to provide its internationalization features. The facilities in GNU gettext focus on messages: strings printed by a program, either directly or via formatting with printf or sprintf().[82]

When using GNU gettext, each application has its own text domain. This is a unique name, such as ‘kpilot’ or ‘gawk’, that identifies the application. A complete application may have multiple components—programs written in C or C++, as well as scripts written in sh or awk. All of the components use the same text domain.

To make the discussion concrete, assume we’re writing an application named guide. Internationalization consists of the following steps, in this order:

1. The programmer reviews the source for all of guide’s components and marks each string that is a candidate for translation. For example, "`-F': option required" is a good candidate for translation. A table with strings of option names is not (e.g., gawk’s --profile option should remain the same, no matter what the local language).

2. The programmer indicates the application’s text domain ("guide") to the gettext library, by calling the textdomain() function.

3. Messages from the application are extracted from the source code and collected into a portable object template file (guide.pot), which lists the strings and their translations. The translations are initially empty. The original (usually English) messages serve as the key for lookup of the translations.

4. For each language with a translator, guide.pot is copied to a portable object file (.po) and translations are created and shipped with the application. For example, there might be a fr.po for a French translation.

5. Each language’s .po file is converted into a binary message object (.gmo) file. A message object file contains the original messages and their translations in a binary format that allows fast lookup of translations at runtime.

6. When guide is built and installed, the binary translation files are installed in a standard place.

7. For testing and development, it is possible to tell gettext to use .gmo files in a different directory than the standard one by using the bindtextdomain() function.

8. At runtime, guide looks up each string via a call to gettext(). The returned string is the translated string if available, or the original string if not.

9. If necessary, it is possible to access messages from a different text domain than the one belonging to the application, without having to switch the application’s default text domain back and forth.

In C (or C++), the string marking and dynamic translation lookup are accomplished by wrapping each string in a call to gettext():

printf("%s", gettext("Don't Panic!\n"));

The tools that extract messages from source code pull out all strings enclosed in calls to gettext().

The GNU gettext developers, recognizing that typing ‘gettext(…)’ over and over again is both painful and ugly to look at, use the macro ‘_’ (an underscore) to make things easier:

/* In the standard header file: */

#define _(str) gettext(str)

/* In the program text: */

printf("%s", _("Don't Panic!\n"));

This reduces the typing overhead to just three extra characters per string and is considerably easier to read as well.

There are locale categories for different types of locale-related information. The defined locale categories that gettext knows about are:

LC_MESSAGES

Text messages. This is the default category for gettext operations, but it is possible to supply a different one explicitly, if necessary. (It is almost never necessary to supply a different category.)

LC_COLLATE

Text-collation information (i.e., how different characters and/or groups of characters sort in a given language).

LC_CTYPE

Character-type information (alphabetic, digit, upper- or lowercase, and so on) as well as character encoding. This information is accessed via the POSIX character classes in regular expressions, such as /[[:alnum:]]/ (see Using Bracket Expressions).

LC_MONETARY

Monetary information, such as the currency symbol, and whether the symbol goes before or after a number.

LC_NUMERIC

Numeric information, such as which characters to use for the decimal point and the thousands separator.[83]

LC_TIME

Time- and date-related information, such as 12- or 24-hour clock, month printed before or after the day in a date, local month abbreviations, and so on.

LC_ALL

All of the above. (Not too useful in the context of gettext.)

Internationalizing awk Programs

gawk provides the following variables for internationalization:

TEXTDOMAIN

This variable indicates the application’s text domain. For compatibility with GNU gettext, the default value is "messages".

_"your message here"

String constants marked with a leading underscore are candidates for translation at runtime. String constants without a leading underscore are not translated.

gawk provides the following functions for internationalization:

dcgettext(string [, domain [, category]])

Return the translation of string in text domain domain for locale category category. The default value for domain is the current value of TEXTDOMAIN. The default value for category is "LC_MESSAGES".

If you supply a value for category, it must be a string equal to one of the known locale categories described in the previous section. You must also supply a text domain. Use TEXTDOMAIN if you want to use the current domain.

CAUTION

The order of arguments to the awk version of the dcgettext() function is purposely different from the order for the C version. The awk version’s order was chosen to be simple and to allow for reasonable awk-style default arguments.

dcngettext(string1, string2, number [, domain [, category]])

Return the plural form used for number of the translation of string1 and string2 in text domain domain for locale category category. string1 is the English singular variant of a message, and string2 is the English plural variant of the same message. The default value fordomain is the current value of TEXTDOMAIN. The default value for category is "LC_MESSAGES".

The same remarks about argument order as for the dcgettext() function apply.

bindtextdomain(directory [, domain ])

Change the directory in which gettext looks for .gmo files, in case they will not or cannot be placed in the standard locations (e.g., during testing). Return the directory in which domain is “bound.”

The default domain is the value of TEXTDOMAIN. If directory is the null string (""), then bindtextdomain() returns the current binding for the given domain.

To use these facilities in your awk program, follow these steps:

1. Set the variable TEXTDOMAIN to the text domain of your program. This is best done in a BEGIN rule (see The BEGIN and END Special Patterns), or it can also be done via the -v command-line option (see Command-Line Options):

2. BEGIN {

3. TEXTDOMAIN = "guide"

4. …

}

5. Mark all translatable strings with a leading underscore (‘_’) character. It must be adjacent to the opening quote of the string. For example:

6. print _"hello, world"

7. x = _"you goofed"

printf(_"Number of users is %d\n", nusers)

8. If you are creating strings dynamically, you can still translate them, using the dcgettext() built-in function:[84]

9. if (groggy)

10. message = dcgettext("%d customers disturbing me\n", "adminprog")

11.else

12. message = dcgettext("enjoying %d customers\n", "adminprog")

printf(message, ncustomers)

Here, the call to dcgettext() supplies a different text domain ("adminprog") in which to find the message, but it uses the default "LC_MESSAGES" category.

The previous example only works if ncustomers is greater than one. This example would be better done with dcngettext():

if (groggy)

message = dcngettext("%d customer disturbing me\n",

"%d customers disturbing me\n", "adminprog")

else

message = dcngettext("enjoying %d customer\n",

"enjoying %d customers\n", "adminprog")

printf(message, ncustomers)

13.During development, you might want to put the .gmo file in a private directory for testing. This is done with the bindtextdomain() built-in function:

14.BEGIN {

15. TEXTDOMAIN = "guide" # our text domain

16. if (Testing) {

17. # where to find our files

18. bindtextdomain("testdir")

19. # joe is in charge of adminprog

20. bindtextdomain("../joe/testdir", "adminprog")

21. }

22. …

}

See A Simple Internationalization Example for an example program showing the steps to create and use translations from awk.

Translating awk Programs

Once a program’s translatable strings have been marked, they must be extracted to create the initial .pot file. As part of translation, it is often helpful to rearrange the order in which arguments to printf are output.

gawk’s --gen-pot command-line option extracts the messages and is discussed next. After that, printf’s ability to rearrange the order for printf arguments at runtime is covered.

Extracting Marked Strings

Once your awk program is working, and all the strings have been marked and you’ve set (and perhaps bound) the text domain, it is time to produce translations. First, use the --gen-pot command-line option to create the initial .pot file:

gawk --gen-pot -f guide.awk > guide.pot

When run with --gen-pot, gawk does not execute your program. Instead, it parses it as usual and prints all marked strings to standard output in the format of a GNU gettext Portable Object file. Also included in the output are any constant strings that appear as the first argument todcgettext() or as the first and second argument to dcngettext().[85] You should distribute the generated .pot file with your awk program; translators will eventually use it to provide you translations that you can also then distribute. See A Simple Internationalization Example for the full list of steps to go through to create and test translations for guide.

Rearranging printf Arguments

Format strings for printf and sprintf() (see Using printf Statements for Fancier Printing) present a special problem for translation. Consider the following:[86]

printf(_"String `%s' has %d characters\n",

string, length(string)))

A possible German translation for this might be:

"%d Zeichen lang ist die Zeichenkette `%s'\n"

The problem should be obvious: the order of the format specifications is different from the original! Even though gettext() can return the translated string at runtime, it cannot change the argument order in the call to printf.

To solve this problem, printf format specifiers may have an additional optional element, which we call a positional specifier. For example:

"%2$d Zeichen lang ist die Zeichenkette `%1$s'\n"

Here, the positional specifier consists of an integer count, which indicates which argument to use, and a ‘$’. Counts are one-based, and the format string itself is not included. Thus, in the following example, ‘string’ is the first argument and ‘length(string)’ is the second:

$ gawk 'BEGIN {

> string = "Don\47t Panic"

> printf "%2$d characters live in \"%1$s\"\n",

> string, length(string)

> }'

11 characters live in "Don't Panic"

If present, positional specifiers come first in the format specification, before the flags, the field width, and/or the precision.

Positional specifiers can be used with the dynamic field width and precision capability:

$ gawk 'BEGIN {

> printf("%*.*s\n", 10, 20, "hello")

> printf("%3$*2$.*1$s\n", 20, 10, "hello")

> }'

hello

hello

NOTE

When using ‘*’ with a positional specifier, the ‘*’ comes first, then the integer position, and then the ‘$’. This is somewhat counterintuitive.

gawk does not allow you to mix regular format specifiers and those with positional specifiers in the same string:

$ gawk 'BEGIN { printf "%d %3$s\n", 1, 2, "hi" }'

error→ gawk: cmd. line:1: fatal: must use `count$' on all formats or none

NOTE

There are some pathological cases that gawk may fail to diagnose. In such cases, the output may not be what you expect. It’s still a bad idea to try mixing them, even if gawk doesn’t detect it.

Although positional specifiers can be used directly in awk programs, their primary purpose is to help in producing correct translations of format strings into languages different from the one in which the program is first written.

awk Portability Issues

gawk’s internationalization features were purposely chosen to have as little impact as possible on the portability of awk programs that use them to other versions of awk. Consider this program:

BEGIN {

TEXTDOMAIN = "guide"

if (Test_Guide) # set with -v

bindtextdomain("/test/guide/messages")

print _"don't panic!"

}

As written, it won’t work on other versions of awk. However, it is actually almost portable, requiring very little change:

§ Assignments to TEXTDOMAIN won’t have any effect, because TEXTDOMAIN is not special in other awk implementations.

§ Non-GNU versions of awk treat marked strings as the concatenation of a variable named _ with the string following it.[87] Typically, the variable _ has the null string ("") as its value, leaving the original string constant as the result.

§ By defining “dummy” functions to replace dcgettext(), dcngettext(), and bindtextdomain(), the awk program can be made to run, but all the messages are output in the original language. For example:

§ function bindtextdomain(dir, domain)

§ {

§ return dir

§ }

§

§ function dcgettext(string, domain, category)

§ {

§ return string

§ }

§

§ function dcngettext(string1, string2, number, domain, category)

§ {

§ return (number == 1 ? string1 : string2)

}

§ The use of positional specifications in printf or sprintf() is not portable. To support gettext() at the C level, many systems’ C versions of sprintf() do support positional specifiers. But it works only if enough arguments are supplied in the function call. Many versions of awkpass printf formats and arguments unchanged to the underlying C library version of sprintf(), but only one format and argument at a time. What happens if a positional specification is used is anybody’s guess. However, because the positional specifications are primarily for use intranslated format strings, and because non-GNU awks never retrieve the translated string, this should not be a problem in practice.

A Simple Internationalization Example

Now let’s look at a step-by-step example of how to internationalize and localize a simple awk program, using guide.awk as our original source:

BEGIN {

TEXTDOMAIN = "guide"

bindtextdomain(".") # for testing

print _"Don't Panic"

print _"The Answer Is", 42

print "Pardon me, Zaphod who?"

}

Run ‘gawk --gen-pot’ to create the .pot file:

$ gawk --gen-pot -f guide.awk > guide.pot

This produces:

#: guide.awk:4

msgid "Don't Panic"

msgstr ""

#: guide.awk:5

msgid "The Answer Is"

msgstr ""

This original portable object template file is saved and reused for each language into which the application is translated. The msgid is the original string and the msgstr is the translation.

NOTE

Strings not marked with a leading underscore do not appear in the guide.pot file.

Next, the messages must be translated. Here is a translation to a hypothetical dialect of English, called “Mellow”:[88]

$ cp guide.pot guide-mellow.po

Add translations to guide-mellow.po …

Following are the translations:

#: guide.awk:4

msgid "Don't Panic"

msgstr "Hey man, relax!"

#: guide.awk:5

msgid "The Answer Is"

msgstr "Like, the scoop is"

The next step is to make the directory to hold the binary message object file and then to create the guide.mo file. We pretend that our file is to be used in the en_US.UTF-8 locale, because we have to use a locale name known to the C gettext routines. The directory layout shown here is standard for GNU gettext on GNU/Linux systems. Other versions of gettext may use a different layout:

$ mkdir en_US.UTF-8 en_US.UTF-8/LC_MESSAGES

The msgfmt utility does the conversion from human-readable .po file to machine-readable .mo file. By default, msgfmt creates a file named messages. This file must be renamed and placed in the proper directory (using the -o option) so that gawk can find it:

$ msgfmt guide-mellow.po -o en_US.UTF-8/LC_MESSAGES/guide.mo

Finally, we run the program to test it:

$ gawk -f guide.awk

Hey man, relax!

Like, the scoop is 42

Pardon me, Zaphod who?

If the three replacement functions for dcgettext(), dcngettext(), and bindtextdomain() (see awk Portability Issues) are in a file named libintl.awk, then we can run guide.awk unchanged as follows:

$ gawk --posix -f guide.awk -f libintl.awk

Don't Panic

The Answer Is 42

Pardon me, Zaphod who?

gawk Can Speak Your Language

gawk itself has been internationalized using the GNU gettext package. (GNU gettext is described in complete detail in GNU gettext utilities.) As of this writing, the latest version of GNU gettext is version 0.19.4.

If a translation of gawk’s messages exists, then gawk produces usage messages, warnings, and fatal errors in the local language.

Summary

§ Internationalization means writing a program such that it can use multiple languages without requiring source code changes. Localization means providing the data necessary for an internationalized program to work in a particular language.

§ gawk uses GNU gettext to let you internationalize and localize awk programs. A program’s text domain identifies the program for grouping all messages and other data together.

§ You mark a program’s strings for translation by preceding them with an underscore. Once that is done, the strings are extracted into a .pot file. This file is copied for each language into a .po file, and the .po files are compiled into .gmo files for use at runtime.

§ You can use positional specifications with sprintf() and printf to rearrange the placement of argument values in formatted strings and output. This is useful for the translation of format control strings.

§ The internationalization features have been designed so that they can be easily worked around in a standard awk.

§ gawk itself has been internationalized and ships with a number of translations for its messages.


[82] For some operating systems, the gawk port doesn’t support GNU gettext. Therefore, these features are not available if you are using one of those operating systems. Sorry.

[83] Americans use a comma every three decimal places and a period for the decimal point, while many Europeans do exactly the opposite: 1,234.56 versus 1.234,56.

[84] Thanks to Bruno Haible for this example.

[85] The xgettext utility that comes with GNU gettext can handle .awk files.

[86] This example is borrowed from the GNU gettext manual.

[87] This is good fodder for an “Obfuscated awk” contest.

[88] Perhaps it would be better if it were called “Hippy.” Ah, well.