Application Localization - Basic Projects - Sams Teach Yourself PHP, MySQL and Apache All in One (2012)

Sams Teach Yourself PHP, MySQL and Apache All in One (2012)

Part V. Basic Projects

Chapter 27. Application Localization


In this chapter, you learn the following:

How to recognize and prepare for character set differences

How to prepare the structure of your application and produce localized sites


The key phrase in World Wide Web is World Wide. Creating a website useful to speakers of different languages is a breeze using PHP and MySQL. The process of preparing your applications for use in multiple locales is called internationalization; customizing your code for each locale is called localization.

About Internationalization and Localization

First and foremost, it’s important to understand that neither internationalization nor localization is the same thing as content translation. In fact, you can have a fully translated website—all in German, all in Japanese, or all in whatever language you want—and it will not be considered an internationalized or localized website. It will just be a translated one. The key aspects of an internationalized application are as follows:

• Externalizing all strings, icons, and graphics

• Modifying the display of formatting functions (dates, currency, numbers, and so on)

After you have constructed your application so that your strings are externalized—when all strings used in functions, classes, and other scripts are managed in one place and included or otherwise referred to as constant variables—and your formatting functions can change per locale, you can begin the process of localization. Translation happens to be a part of localization.

A locale is essentially a grouping—in this case, a grouping of the translated strings, graphics, text, and formatting conventions that will be used in the application or website to be localized. These groupings are usually referred to by the name of the pervasive language of the application, such as the German locale. Although it might be obvious that the German locale includes text translated into German, it does not mean that the website is applicable only to people in Germany. Austrians who speak German would probably utilize a localized German website, but it would not be referred to as the Austrian locale.

In the next few sections, you learn about working with different character sets and how to modify your environment to successfully prepare your applications for localization.

About Character Sets

Character sets are usually referred to as single-byte or multibyte, referring to the number of bytes needed to define a relationship with a character used in a language. English, German, and French (among many others) are single-byte languages; only 1 byte is necessary to represent a character such as the letter a or the number 9. Single-byte code sets have, at most, 256 characters, including the entire set of ASCII characters, accented characters, and other characters necessary for formatting.

Multibyte code sets have more than 256 characters, including all single-byte characters as a subset. Multibyte languages include traditional and simplified Chinese, Japanese, Korean, Thai, Arabic, Hebrew, and so forth. These languages require more than 1 byte to represent a character. A good example is the word Tokyo, the capital of Japan. In English, it is spelled with four different characters, using a total of 5 bytes. However, in Japanese, the word is represented by two syllables, tou and kyou, each of which uses 2 bytes, for a total of 4 bytes used.

This is a complete simplification of character sets and the technology behind them, but the relevance is this: To properly interpret and display the text of web pages in their intended language, it is up to you to tell the web browser which character set to use. This is achieved by sending the appropriate headers before all content.

If you have a set of pages that include Japanese text and you do not send the correct headers regarding language and character set, those pages will render incorrectly in web browsers whose primary language is not Japanese. In other words, because no character set information is included, the browser assumes that it is to render the text using its own default character set. For example, if your Japanese pages use the Shift_JIS or UTF-8 character set and your browser is set for ISO-8859-1, your browser will try to render the Japanese text using the single-byte ISO-8859-1 character set. It will fail miserably in this unless the headers alert it to use Shift_JIS or UTF-8 and you have the appropriate libraries and language packs installed on your operating system.


Tip

Mojibake is the term for this type of unrecognizable characters. For more information, see http://en.wikipedia.org/wiki/Mojibake.


The headers in question are the Content-type and Content-language headers, and these can also be set as HTML5 tag attributes. Because you have all the tools for a dynamic environment, it’s best to both send the appropriate headers before your text and print the correct HTML5 attributes tags in your document. The following is an example of the header() function outputting the proper character information for an English site:

header("Content-Type: text/html;charset=ISO-8859-1");
header("Content-Language: en");

The accompanying HTML5 tags would be these:

<html lang="en">
<meta charset="ISO-8859-1">

A German site would use the same character set but a different language code:

header("Content-Type: text/html;charset=ISO-8859-1");
header("Content-Language: de");

The accompanying HTML5 tags would be these:

<html lang="de">
<meta charset="ISO-8859-1">

A Japanese site uses both a different character set and different language code:

header("Content-Type: text/html;charset=Shift_JIS");
header("Content-Language: ja");

The accompanying HTML5 tags would be these:

<html lang="ja">
<meta charset="Shift_JIS">

Environment Modifications

Your environment, as defined in the installation chapters of this book, need not change to handle localized websites. Although you can use several language-related settings in Apache, PHP, and MySQL to accommodate localized websites, you can also perform all the tasks in this chapter without making any language-related changes to your configuration. Just for your own information, the next few sections point you to the appropriate documentation for internationalization using Apache, PHP, and MySQL.

Configuration Changes to Apache

In Chapter 29, “Apache Performance Tuning and Virtual Hosting,” you learn about the concept of content negotiation using the mod_mime or mod_negotiation modules and the AddLanguage and AddCharset directives (among others). You use these directives when you manually change the extension of your file and want Apache to interpret the character set to be used, based on that extension. However, that is not what this chapter discusses. You want all your localized websites to have the same file-naming conventions (such as index.html and company_info.html) and not have to manually create multiple pages with different language-based extensions to accommodate translated files. Your goal regarding website localization is to have a single set of pages filled with the appropriately translated text running from one web server.


Note

There’s nothing wrong with Apache-based content negotiation using multiple files with language-based naming conventions. It’s just not the focus of this chapter. You can read more about Apache-based content negotiation at http://httpd.apache.org/docs-2.2/content-negotiation.html.


Configuration Changes to PHP

As with Apache, no configuration changes in PHP are required for any tasks in this chapter. However, you can use a host of functions related to the handling of multibyte characters, if you want. These functions are in the PHP manual at http://www.php.net/mbstring and must be enabled during the configuration process using this code. (Windows users enable the php_mbstring.dll extension in php.ini.)

--enable-mbstring=LANG

Here, LANG is a language code, such as ja for Japanese, cn for Simplified Chinese, and so forth. Or, you can use this line to enable all available languages:

--enable-mbstring=all

When you enable mbstring functions in PHP, you can set several options in the php.ini configuration file to use these functions properly. After this is configured, you can use any of the more than 40 mbstring-related functions for handling multibtye input in PHP.

The manual entries for these functions are comprehensive and recommended reading for advanced work with multibyte character sets and dynamic content. You will get by just fine in this chapter without them, although it is recommended that at some point you peruse the PHP manual for your own edification.

Configuration Changes to MySQL

No explicit changes are needed in MySQL for the localization examples used in this chapter because the examples are not database-driven. The default character set used in MySQL is ISO-8859-1, but that does not mean that you are limited only to storing single-byte characters in your database tables. For more information on the current language-related elements of MySQL, read the MySQL Manual entry at http://dev.mysql.com/doc/refman/5.5/en/globalization.html.

Creating a Localized Page Structure

In this section, you look at a functioning example of a localized welcome page that uses PHP to enable a user to select a target language and then receive the appropriate text. The goal of this section is to show an example of externalizing the strings used in this script, which is one of the characteristics of internationalization.

In this script, the user happens upon your English-based website but is also presented with an option to browse within the locale of his choice—English, German, or Japanese. Three elements are involved in this process:

• Creating and using a master file for sending locale-specific header information

• Creating and using a master file for displaying the information based on the selected locale

• Using the script itself

Listing 27.1 shows the contents of the master file used for sending locale-specific header information.

Listing 27.1 Language Definition File


1: <?php
2: if ((!isset($_SESSION['lang'])) || (!isset($_GET['lang']))) {
3: $_SESSION['lang'] = "en";
4: $currLang = "en";
5: } else {
6: $currLang = $_GET['lang'];
7: $_SESSION['lang'] = $currLang;
8: }
9:
10: switch($currLang) {
11: case "en":
12: define("CHARSET","ISO-8859-1");
13: define("LANGCODE", "en");
14: break;
15:
16: case "de":
17: define("CHARSET","ISO-8859-1");
18: define("LANGCODE", "de");
19: break;
20:
21: case "ja":
22: define("CHARSET","UTF-8");
23: define("LANGCODE", "ja");
24: break;
25:
26: default:
27: define("CHARSET","ISO-8859-1");
28: define("LANGCODE", "en");
29: break;
30: }
31:
32: header("Content-Type: text/html;charset=".CHARSET);
33: header("Content-Language: ".LANGCODE);
34: ?>


Lines 2–8 of Listing 27.1 set up the session value needed to store the user’s selected language choice.


Note

The session_start() function is not used in the define_lang.php or the lang_strings.php file listed in the following paragraphs because these files are included via the include() function from within the master file. The master file, which you will create shortly, calls thesession_start() function, which will be valid for these included files as well.


If no session value exists, the English locale settings are used. If your site were a German site by default, you would change this file to use the German locale by default. This script prepares for the next script, which contains an input-selection mechanism, by setting the value of $currLang to the result of this input in line 6.

The switch statement beginning on line 10 contains several case statements designed to assign the appropriate values to the constant variables CHARSET and LANGCODE. Lines 32–33 actually utilize these variables for the first time when dynamically creating and sending the headers for Content-type and Content-language.

Save this file as define_lang.php and place it in the document root of your web browser. This file defines two constants used in the next script, which is the actual display script. The constants are CHARSET and LANGCODE, corresponding to the character set and language code for each locale. The display script uses these constants to create the proper META tags regarding character set and language code. Although this script sends the headers, it’s a good idea to ensure that they are part of the page itself to aid in any necessary input from forms.

Listing 27.2 creates a function that simply stores the externalized strings used in the display script. This example uses two: one to welcome the user to the page (WELCOME_TXT) and one to introduce the language selection process (CHOOSE_TXT).

Listing 27.2 String Definition File


1: <?php
2: function defineStrings() {
3: switch($_SESSION['lang']) {
4: case "en":
5: define("WELCOME_TXT","Welcome!");
6: define("CHOOSE_TXT","Choose Language");
7: break;
8:
9: case "de":
10: define("WELCOME_TXT","Willkommen!");
11: define("CHOOSE_TXT","Sprache auswählen");
12: break;
13:
14: case "ja":
15: define("WELCOME_TXT","[unprintable characters]");
16: define("CHOOSE_TXT","[unprintable characters]");
17: break;
18:
19: default:
20: define("WELCOME_TXT","Welcome!");
21: define("CHOOSE_TXT","Choose Language");
22: break;
23: }
24: }
25: ?>


Use the file lang_strings.php from the CD included with this book to use the actual Japanese characters that cannot be displayed here. Place this file in the document root of your web browser. This file defines two constants, WELCOME_TXT and CHOOSE_TXT, which are used in the display script. These constants are defined within the context of the function called defineStrings(), although you could just as easily make this file a long switch statement outside the context of the function structure. I’ve simply put it in a function for the sake of organization and for ease of explanation when it comes time to use the display script.

Finally, it’s time to create the display script. Remember, one key element of internationalization is to externalize all strings so that only one master file needs to be used. Listing 27.3 is such an example.

Listing 27.3 Localized Welcome Script


1: <?php
2: session_start();
3: include 'define_lang.php';
4: include 'lang_strings.php';
5: defineStrings();
6: ?>
7: <!DOCTYPE html>
8: <html lang="<?php echo LANGCODE; ?>">
9: <head>
10: <title><?php echo WELCOME_TXT; ?></title>
11: <meta charset="<?php echo CHARSET; ?>" />
12: <body>
13: <h1 style="text-align: center;"><?php echo WELCOME_TXT; ?></h1>
14: <p style="text-align: center; font-weight: bold;">
15: <?php echo CHOOSE_TXT; ?><br/><br/>
16: <a href="<?php echo $_SERVER['PHP_SELF']."?lang=en"; ?>">
17: <img src="en_flag.gif" alt="English" /></a>
18: <a href="<?php echo $_SERVER['PHP_SELF']."?lang=de"; ?>">
19: <img src="de_flag.gif" alt="German"/></a>
20: <a href="<?php echo $_SERVER['PHP_SELF']."?lang=ja"; ?>">
21: <img src="ja_flag.gif" alt="Japanese"/></a>
22: </p>
23: </body>
24: </html>


Notice that Listing 27.3 is a basic template because all the language-related elements are externalized in the define_lang.php or lang_strings.php files. All this third file does is display the appropriate results, depending on the selected (or default) locale.

Line 5 calls the defineStrings() function, which then makes available the appropriate values for the two constant variables, which are used in lines 8, 10, 11, 13, and 15. Lines 16–18 display flags representing the English, German, and Japanese locales, which are clickable. When the user clicks one of the flags, the locale changes to the new, selected locale, and the strings used are those appropriate to the new locale. These links contain the lang variable, which is passed to the script as $_GET['lang']. If you look at line 6 of Listing 27.1, you will see how the code uses this to change the setting regarding the user’s preferred locale.


Note

Despite the use here for illustrative purposes in a development environment, the use of a flag to represent language selection options is not recommended, because there exists no natural graphic representation for a language. Take, for instance, the use of the flag of Great Britain to represent English and the flag of Germany to represent German. English has an official or majority status in over 80 different countries, and German in at least 10—no single flag can represent that information.


Save this file as lang_selector.php and place it in the document root of your web browser. When visited for the first time, it should look something like Figure 27.1.

image

Figure 27.1 Viewing the language selector for the first time.

Until another language is selected, the default is English; accordingly, the Welcome and Choose Language text appears in English. When the user clicks the German flag, he sees Figure 27.2; when the user clicks the Japanese flag, he sees Figure 27.3.

image

Figure 27.2 Viewing the German language page.

image

Figure 27.3 Viewing the Japanese language page.

Companies and organizations that offer localized versions of their websites often have long discussions about how to represent the locale selections—flags, names of countries, names of languages, and so forth. There is no clear-cut answer, but please remember that the use of flags is frowned upon. How to display the language selection is definitely a business decision, but if you have gone through the process of externalizing strings, text, and images and created an internationalized website template that is ready to be localized, the format of your locale selection is the least of your concerns.

Localizing Your Application with gettext()

The previous sections walked you through a basic approach to application internationalization and localization. A more advanced approach would be to use the built-in PHP function called gettext(), which is a gateway of sorts (an API, or application programming interface) to the GNUgettext package.


Note

For more information about GNU gettext, see http://www.gnu.org/software/gettext/gettext.htm.


The use of gettext and its PHP-related functions requires translation catalog files to be in specific format. A popular cross-platform editor for these files is Poedit (see http://www.poedit.com/). Once a translation catalog template has been created (full of your externalized strings), you can give the template to translators you have hired, or that you use through a crowdsourcing service such as Transifex (https://www.transifex.net/) or Get Localization (http://www.getlocalization.com/). With completed catalog files in hand, you can put them in a directory in your web server document root and begin the process of using gettext functions.

You can learn more about gettext and the PHP gettext-related functions in the PHP Manual at http://www.php.net/gettext, but in general the process goes something like this:

• Use putenv() to set the LC_ALL environment variable for the locale.

• Use setlocale() to set a value for LC_ALL (see http://www.php.net/setlocale).

• Use bindtextdomain() to set the location of the translation catalog for the given domain (domain in this case means a name identifying the application, not a domain like www.mydomain.com; see http://www.php.net/bindtextdomain).

• Use textdomain() to set the default domain to use with gettext (see http://www.php.net/textdomain).

• From this point forward, use either gettext("some string") or _("some string") to invoke the gettext translation for that string. So, if you have a translation catalog that assigns the German translated string "Willkommen!" for "Welcome", and all the environment variables are set as appropriate for German, the following code will output “Willkommen!”:

echo _("Welcome");

Once you have a handle on the basics of application internationalization and localization, if you are going to develop an application used by speakers of many different languages, I recommend looking into a gettext-based localization framework and crowdsourced translation services (unless you have a plethora of native language speakers at your disposal or a lot of money to spend on translation services).

Summary

This chapter introduced you to the basics of internationalization and localization. You learned the two keys to creating an internationalized site: All strings, text, and graphics are externalized, as is number, currency, and date formatting. You also learned that neither internationalization nor localization is equivalent to translating text; translation is just one part of localization.

You also learned a little bit about character sets: They can be single-byte or multibyte. You also learned the importance of sending the appropriate language-related headers so that your web browser can interpret and display your text properly.

You also created a practical example of how to store a locale-related session variable, to determine and send the localized strings to a preexisting template. This template can be used by all locales because each element was externalized. As a bit of a bonus, you learned about an advanced step in using application frameworks for localization, using PHP’s gettext functions.

Q&A

Q. How do I go about localizing numbers, dates, and currency using PHP?

A. Two functions will prove very useful to you in this regard: number_format() and date(). You have already learned about the date() function. To use it in a localized environment, you simply rearrange the month, day, and year elements as appropriate to the locale (MM-DD-YYYY, DD-MM-YYYY, and so forth). The number_format() function is used for numbers and currency; it groups the thousandths with a comma, period, or space, as appropriate to the locale. Read the PHP Manual entry at http://www.php.net/number_format for possible uses.

Workshop

The workshop is designed to help you review what you’ve learned and begin putting your knowledge into practice.

Quiz

1. Is English a single-byte or multibyte language? What about Japanese?

2. What two headers related to character encoding are crucial in a localized site?

3. In addition to text strings, what other content elements need attention when internationalizing a site?

Answers

1. English is single-byte; Japanese is double-byte.

2. Content-Type with the charset indicator, Content-Language.

3. The formatting of dates, currency, and numbers are other types of content elements that need attention in the internationalization process.

Activities

1. Use Google Translate (or your own knowledge) to add “Welcome!” messages in a few different languages to the language definition and display files you created in this chapter.

2. Because graphical representations of flags are nonoptimal for use in selecting languages, change the flag-based language selection in the sample files from this chapter to something more appropriate.