Locales and Code Pages

Unicode TasksMultibyte Character Set (MBCS) Tasks

A “locale” reflects the local conventions and language for a particular geographical region. A given language may be spoken in more than one country; for example, Portuguese is spoken in Brazil as well as in Portugal. Conversely, a country may have more than one official language. For example, Canada has two: English and French. Thus, Canada has two distinct locales: Canadian-English and Canadian-French. Some locale-dependent categories include the formatting of dates and the display format for monetary values.

The language determines the text and data formatting conventions, while the country determines the national conventions. Every language has a unique mapping, represented by “code pages,” which includes characters other than those in the alphabet (such as punctuation marks and numbers). A code page is a character set and is related to the current locale and language. As such, a locale is a unique combination of language, country, and code page. The code page setting can determine the locale setting and can be changed at run time by calling the setlocale function.

Different languages may use different code pages. For example, the ANSI code page 1252 is used for American English and most European languages, and the ANSI code page 932 is used for Japanese Kanji. Virtually all code pages share the ASCII character set for the lowest 128 characters (0x00 to 0x7F).

Any single-byte code page can be represented in a table (with 256 entries) as a mapping of byte values to characters (including numbers and punctuation marks), or glyphs. Any multibyte code page can also be represented as a very large table (with 64K entries) of double-byte values to characters. In practice, however, it are usually represented as a table for the first 256 (single-byte) characters and as ranges for the double-byte values.

The C run-time library has two types of internal code pages: locale and multibyte. You can change the current code page during program execution (see the documentation for the setlocale and _setmbcp functions). Also, the run-time library may obtain and use the value of the operating system code page. In Windows NT, the operating system code page is the “system default ANSI” code page. This code page is constant for the duration of the program’s execution.

When the locale code page changes, the behavior of the locale-dependent set of functions changes to that dictated by the chosen code page. By default, all locale-dependent functions begin execution with a locale code page unique to the “C” locale. You can change the internal locale code page (as well as other locale-specific properties) by calling the setlocale function. A call to setlocale(LC_ALL, "") will set the locale to that indicated by the operating system’s default code page.

Similarly, when the multibyte code page changes, the behavior of the multibyte functions changes to that dictated by the chosen code page. By default, all multibyte functions begin execution with a multibyte code page corresponding to the operating system’s default code page. You can change the internal multibyte code page by calling the _setmbcp function.

The C run-time function setlocale sets, changes, or queries some or all of the current program’s locale information. The _wsetlocale routine is a wide-character version of setlocale; the arguments and return values of _wsetlocale are wide-character strings.

See Also   Benefits of Character Set Portability