International Support in Windows NT 5.0

Windows NT®

Operating System

White Paper

Abstract

Since its initial design stages, Microsoft® Windows NT® has incorporated international support through the Unicode characer encoding system APIs that retrieve language-specific information, and resource files that store UI (UI) elements in multiple languages. Windows NT 5.0, a world-ready operating system that supports more than 100 international locales, is the culmination of several years of progressive improvements in the operating system’s international support. Each of the more than two dozen language editions of Windows NT 5.0 will support the input and display of languages used in all 100 locales. Because all language editions are based on the same core code—the same API set, the same character encoding, the same fonts and character tables—it will be much easier to maintain multilingual networks and machines using Windows NT 5.0 and to create applications that can easily support multilingual documents.

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.

Microsoft, BackOffice, the BackOffice logo, MS-DOS, Visual Basic, Visual C++, Win32, Windows, and Windows NT are registered trademarks of Microsoft Corporation

Other product or company names mentioned herein may be the trademarks of their respective owners.

Microsoft Corporation · One Microsoft Way · Redmond, WA 98052-6399 · USA

0997

Introduction

Microsoft® Windows NT® 5.0 is a fully globalized operating system that will be released in more than two dozen language editions. Each language edition will support the input and display of all languages that Windows NT supports, making it much easier to maintain multilingual networks and machines, and to create multilingual documents.

With international support built into the system through the National Language Support API (NLSAPI), the Multilingual API (MLAPI), and Windows resource files, developers will find it easier to create globalized applications that support multilingual data and a multilingual UI—without using special tools, multiple editions of the operating system, or writing complex, specialized codes. Using these APIs, developers can create applications that can run on any language edition of Windows NT 5.0 and will allow for the editing and display of multiple languages.

For detailed information on the NLSAPI, please consult the Microsoft Windows Operating Systems NLSAPI Functional Specification. For detailed information on the MLAPI, please consult the Microsoft Windows NT 5.0 Multilingual Functional Specification. Both documents are available through Microsoft Developer’s Network.

A Single, Worldwide Binary

All language editions of Windows NT 5.0 are created from the same core code base. In previous editions of Windows NT, Asian, and Middle East editions were a superset of the core U.S. and European editions and contained additional APIs to handle more complex text input and layout requirements. In Windows NT 5.0, all APIs are contained in all language editions, making possible scenarios that will be described later in this paper.

In addition, every language edition will ship with the components necessary to support the input, display, and formatting of text in all languages that Windows NT supports. For example, each CD will include at least one font to represent each script supported by the system. (Additional fonts may ship for the primary language of the packaged product).

Important Concepts

The following concepts are key to understanding international support in Windows.

Locale

A locale is a set of user preference information related to the user's language and sublanguage. An example of a language is “French,” where the sublanguage could be French as spoken in Canada, France, or Switzerland. Locale information includes currency symbol; date, time, and number formatting information; localized days of the week and months of the year; the standard abbreviation for the name of the country; and character encoding information. (For a more complete list see the NLSAPI specification). Each Windows NT system has a default system locale and one user locale per user, which may be different from the default system locale. Both can be changed via the control panel. Applications can specify a locale on a per-thread basis when calling APIs.

Figure 1: The Regional Settings properties in Windows NT 5.0 Control Panel.

Character Encoding

A character encoding (also called a code page) is a set of numeric values, or code points, that represents a group of alphanumeric characters, punctuation, and symbols. Single-byte character encodings use 8-bits to encode 256 different characters. On Windows, the first 128 characters of all code pages consist of the standard ASCII set of characters. The characters from code point 128–255 represent additional characters and vary depending on the set of scripts represented by the character encoding (for a complete listing of character sets tables see Developing International Software for Windows 95 and Windows NT, published by Microsoft Press). Double-byte character encodings on Windows, used for Asian languages, use 8–16 bits to encode each character. Computers exchange information encoded in character encodings and render it on screens using fonts.

Figure 2: Code Page 1256, the Arabic character encoding.

Windows NT supports OEM character encodings (those originally designed for MS-DOS®), ANSI character encodings (those introduced with Windows 3.1) and Unicode. Unicode is a 16-bit character encoding that encompasses most of the scripts in wide computer use today (for more information on Unicode see The Unicode Standard published by the Unicode Consortium or visit http://www.unicode.org). Windows NT 5.0 uses Unicode as its base character encoding, meaning that all strings passed around internally in the system, including strings in Windows resource (.res) files, are encoded in Unicode. Windows NT also supports ANSI character encodings. Each API that takes a string as a parameter has two entry points – an ‘A’ or ANSI entry point and a ‘W’ or wide-character (Unicode) entry point.

Windows NT supports additional code pages for translating data to and from Unicode, including Macintosh, EBCDIC, and ISO encodings. It also contains translation tables for the UTF-7 and UTF-8 standards, which are commonly used to send Unicode-based data across networks, in particular the Internet.

National Language Support

National Language Support in Windows NT consists of a set of system tables that applications can access through the NLSAPI. The NLSAPI retrieves the following types of information:

Locale information such as date, time, number, or currency format, or localized names of countries, languages, or days of the month and week.
Character mapping tables that map local character encodings (ANSI or OEM) to Unicode or vice versa.
Keyboard layout information, which on Windows keyboard layouts is software-driven. The same keyboard hardware can be used to generate a variety of different language scripts.
Character typing information. Does a specific Unicode code point represent a letter, a number, a spacing character, or a punctuation symbol? Is a character uppercase or lowercase? For a particular locale, what is the character’s uppercase or lowercase equivalent?
Sorting information—For example, different locales follow different sorting rules for accented characters, or may support more than one sorting algorithm.
Font information. The system stores information about which fonts support which character encoding(s) or which range(s) of Unicode. APIs exist to map which languages the font will support.
On Windows NT 5.0, users can install National Language Support for any locale via the control panel (see figure 1).

Localizable Resources

A localizable resource is any piece of information in a software program that will change from language to language. Although certain algorithms may change depending on language (for example, spelling or hyphenation), localizable resources are generally UI elements. Examples include menus, dialog boxes, help text, icons, and bitmaps. On Windows, most of these resources are stored in Windows resource files. In text form, Windows resource files have the extension .rc. When compiled, they have the extension .res. With today’s tools (see Figure 3), resource files are compiled directly into the application executable. On Windows NT 5.0, all language editions share the same binary code. All that changes are the localizable resources.

Figure 3: Editing a Resource file in Developer Studio.

The Significance of a Worldwide Binary

Operating in a multilingual environment introduces technical issues that have traditionally been difficult to address:

Sharing Data

Users cannot read a documents unless the system and the applications they are running understand the character encoding used to create these documents. For example, a document created on a Japanese Windows 95 system can be displayed on an English Windows 95 system if that system has Japanese fonts installed; it cannot, however, be easily edited because English Windows 95 does not support a Japanese character encoding nor Japanese Input Method Editors (programs that convert keystrokes into ideographic characters). In addition, applications have typically extended their file formats for Asian editions, making them unreadable in non-Asian editions of their products. To share data, then, users have been required to run compatible systems and applications, meaning their environments must support matching character encodings, fonts, and file formats.

Creating Multilingual Documents

The limitations of sharing data naturally extend to putting “incompatible” data in a single document. On Windows 3.1 and Windows 95, for example, it is not possible to combine Arabic, Greek, Russian, and Japanese text in a single document without special applications that contain complicated (and often proprietary) code. Windows 95 introduced a limited solution (similar to the Macintosh solution) that involves tagging data with font and character encoding information. Users could create and display multi-script documents that were portable in a rich text format. However, this solution did not enable portable documents stored in a plain-text format (important for communication across the Internet) and required that applications understand and support the font-tagging scheme. It limited multilingual documents to scripts of similar types (European text, for example) and did not support the mixing of complex scripts such as Arabic, Japanese, and Chinese.

Supporting Multinational User Scenarios

Large organizations such as banks, universities, or government agencies often support staff, as well as customers, who speak more than one language. This may require that individuals who speak different languages use the same machine, or that a single individual use one machine to communicate in more than one language. Just as a user of an ATM machine may want to change the UI language of the ATM to conduct a transaction, a PC user may want to change the UI language of either the application she is using or of the entire system, without affecting data. Typically, a single PC will be dedicated to a single language—for example, a Japanese system running Japanese applications to handle Japanese data. Although Windows 95 and Windows NT 4.0 both enable applications to change their UI language and system locale settings on the fly, they do not allow users to change the UI language of the entire system. As mentioned above, it is also difficult to create an environment in which a user can run a system in one default language and run applications that support other languages. For example, on existing Windows-based systems it is possible, but not transparent, to run an application with a Greek UI on an English edition of Windows and enter Greek text. It is also difficult to run an application with an English UI on an English system and be able to enter and edit Japanese text.

Creating Multiple Language Editions of an Application

A large part of the reason that multilingual user scenarios have been difficult to set up is that comprehensive multilingual applications have been difficult to create, due to limitations in the operating system. Creating a Japanese-language application, for example, used to require a Japanese edition of Windows, special Japanese editions of programming tools, and a separate Software Development Kit (SDK). Fortunately, it is now possible to use standard tools (the English edition of Visual Basic® or Visual C++®, for example) to create Japanese-language applications. In addition, the SDK has been unified, and it is possible to compile Japanese applications on any language edition of Windows NT , as long as the proper national language tables have been installed. However, it has still been necessary to run Japanese applications on Japanese Windows because non-Japanese language editions of the operating system have not supported additional APIs (for Input Method Editors). Thus, testing additional language editions of an application could still require additional installations of the operating system.

In addition, to support different languages, developers often had to customize or add code. For example, supporting Asian languages on Windows 95 required changing pointer arithmetic to handle double-byte character encodings and adding support for Input Method Editors. Supporting Arabic and Hebrew required customizing dialogs and menus with right-to-left controls, and adding code to handle ligatures and other text layout issues. Supporting multilingual documents required tagging data with font and language information. Although system flags, messages, and APIs exist in both Windows 95 and Windows NT 4.0 to handle text input, layout, and UI issues, not all the necessary mechanisms exist in all language editions of the operating systems.

How Windows NT 5.0’s Worldwide Binary Addresses these Issues.

The unified architecture of Windows NT 5.0’s worldwide binary makes it much easier to create scenarios such as multilingual user environments, mixed language networks, and multilingual documents. Several key design decisions form the basis for the global operating system.

Windows NT is based on the Unicode standard

Support for the Unicode Standard was built into the Windows NT operating system from its early stages. The first release of Windows NT used Unicode as the system’s base character encoding. Subsequent releases used Unicode as the basis for the file system, the UI, and for network communication. Windows NT 5.0 supports version 2.0 of Unicode. It provides a Unicode-based application environment and includes forward migration tools for existing non-Unicode data (explained later).

Unicode’s most important benefit is that it allows for unambiguous plain text representation of data, ending the requirement of tagging text strings with code page information. As a uniformly 16-bit character encoding, it represents Asian languages without requiring the programming tricks necessary to support variable-width character encodings used in Windows 9x. As an industry standard, it simplifies sharing of data in mixed platform environments.

Windows 9x and Windows NT both contain tables for converting text from ANSI character encodings to Unicode and vice versa. Users and developers can add conversion tables for a variety of character encodings, including Mac and UNIX character encodings, through the regional settings control panel applet (see figure 3). Conversion tables make it possible for non-Unicode enabled applications to operate in the Windows NT environment, and Unicode-enabled applications to operate in the Windows 9x environment. Although Windows 9x does not contain native support for Unicode, it supports several wide-character APIs, such as TextOutW.

Figure 4: The Advanced Regional Settings Dialog allows users to install code page conversion tables for a variety of standards.

Windows NT includes transparent support for multiple languages

Developers can use system APIs to create generic code that will correctly handle data input, storage, and display for a wide range of languages. The National Language Support API (NLSAPI) contains functions for transforming strings, retrieving and manipulating code page information, and for retrieving and manipulating locale information. These APIs are listed in table 1. The NLSAPI functions allow applications to query the system for types of information that can change depending on language, country, or character encoding. For example, LCMapString converts a string to uppercase, lowercase, or to a sort key depending on the language parameter passed to the call. GetCurrencyFormat returns all the information an application needs to format a currency string for a particular country – what the currency symbol is, whether the symbol comes before the numerical amount or after, and so forth. MultiByteToWideChar will convert a string from an ANSI character encoding into the proper Unicode range.

APIs to retrieve locale information

APIs to analyze and manipulate strings

APIs to analyze and manipulate system character encoding tables

GetSystemDefaultLangID

GetUserDefaultLangID

GetSystemDefaultLCID

GetUserDefaultLCID

SetThreadLocale

GetThreadLocale IsValidLocale

ConvertDefaultLocale

EnumSystemLocales

GetLocaleInfo

SetLocaleInfo

GetTimeFormat

GetDateFormat

EnumDateFormats(Ex)

EnumTimeFormats

EnumCalendarInfo(Ex)

GetNumberFormat

GetCurrencyFormat

CompareString

LCMapString

MultiByteToWideChar

WideCharToMultiByte

FoldString

IsDBCSLeadByte

IsDBCSLeadByteEx

GetStringTypeEx

GetStringType[A|W]

IsValidCodePage

EnumSystemCodePages

GetConsoleCP

GetConsoleOutputCP

SetConsoleCP

SetConsoleOutputCP

GetACP

GetOEMCP

GetCPInfo

GetCPInfoEx

Table 1: NLSAPI functions.

These APIs accept identifiers for languages, locales, or character encodings. Applications can therefore pass the system locale, user, or thread locale to an API, which will return the appropriate information from tables carried by the operating system. If the system or user locale changes, the application behavior will automatically adjust without requiring any code changes or any action on the part of the user. Developers can set the locale of a thread before passing it to an API in order to retrieve information about a specific locale. For example, if one section of a document is tagged as German text, an application can set the thread’s locale to German before calling GetDateFormat, so that any dates in this section of the document are formatted according to German conventions.

Applications can also create generic code for handling text input and display. Windows 9x and Windows NT allow users to install several keyboard layouts and change them on the fly, for example when creating a multilingual document. The Multilingual API contains functions for changing keyboard layout tables as well as fonts used to display text (see Table 2). It also contains APIs to handle text layout issues, e.g. vertical text for Japanese or right-to-left text containing ligatures for Arabic. Applications that use these APIs will contain basic, transparent support for creating mixed-language documents. Supporting complex scripts such as Arabic, Hebrew and Thai requires using these APIs (see Appendix B for details).

APIs to Control keyboard layouts

APIs to Handle Font Information

APIs to Handle Text Layout and Data

ActivateKeyboardLayout

GetKeyboardLayout

GetKeyboardLayoutList

GetKeyboardLayoutName

LoadKeyboardLayout

MapVirtualKeyEx

ToAsciiEx

ToUnicodeEx

VkKeyScanEx

SystemParametersInfo

ChooseFont

CreateFontIndirectEx

EnumFontFamilies

EnumFontFamiliesEx

EnumFontFamExProc

GetFontLanguageInfo

GetTextCharsetInfo

GetTextFace

TranslateCharsetInfo

DrawTextEx

ExtTextOut

GetCharacterPlacement

GetTextAlign

SetTextAlign

GetClipboardData

SetClipboardData

GetTextExtent

Table 2: Multilingual API functions.

Through these APIs, developers can also create applications that can handle text input and display for any number of languages, even if a fully localized UI will not be available for all languages. For example, English-language applications running on Windows NT 5.0 will automatically handle the input of Japanese text as long as the application is based on Unicode. This is works on Windows NT 5.0 because all APIs are fully functional in all language editions of the operating system. In the past, IME APIs were either unavailable or simply stubbed on non-Asian editions of Windows NT. Non-Unicode applications can easily handle the input of Japanese text by adding code to trap IME-related window messages.

Windows NT makes it easy to change the language of an application’s UI

As mentioned before, traditional Windows applications store localizable resources in a resource (.res) file which is compiled into the application executable. With a resource file editor (such as the one built into Visual Studio) it is possible to create multiple language versions of localizable resources (tagged with language IDs) and compile them into the same .exe. It is also possible to extract a set of resources and replace them with a translated version.

The APIs listed in table 3 are dependent on language. Several APIs –FindResourceEx, MessageBoxEx, and FormatMessage – accept a language ID as a parameter. Others retrieve the version of the menu, string, or icon that corresponds to the language ID of the calling thread. Since these APIs are language sensitive, developers can create applications that display the UI in a different language depending on the user’s locale ID or some other mechanism (for example, menu choice).

CreateDialog

CreateDialogPararm

CreateIconFromResource

CreateWindowEx (RTL flags)

DialogBox

DialogBoxIndirect

DialogBoxParam

DialogBoxParamIndirect

LoadAccelerators

LoadBitmap

LoadCursor

LoadIcon

LoadMenu

LoadString

MessageBox

MessageBoxEx

FindResource

FindResourceEx

Format Message

Table 3: APIs for retrieving UI elements.

Windows NT provides a flexible application environment

Windows NT 5.0 can imitate the Win32® application environment of any non-Unicode language edition of Windows—for example, any edition of Windows 95. This allows Win32 applications that are not enabled for Unicode to run on any language edition of Windows NT 5.0. For example, a Win32 application that uses Code Page 1253 (Greek) can run on French Windows NT 5.0 with the proper system settings and tables. The major limitation is that multiple language applications cannot run at the same time if they use different character encodings (for example, a Japanese application that expects Code Page 932 and a Russian application that expects Code Page 1251). Windows NT will require the user to reboot the system before changing application environments. Unicode-based applications are not subject to this limitation—running two Unicode-based applications side-by-side does not require resetting the system locale.

The flexible application environment allows users to run localized, non-Unicode applications, but its major benefit may be to application developers, who can now test a myriad of localized applications on a single machine. It is no longer necessary to maintain several machines with different language editions of the operating system for development and testing.

Figure 5: English Windows NT 5.0 running Arabic Word for Windows. The system locale is Arabic, which allows Word to run correctly. The user locale, however, is Japanese. Note the date in the bottom right hand corner, formatted in Japanese.

Customer scenarios

Sharing the Same Machine with Users Who Speak Different Languages

Today is your first day on the job at a multinational bank in New York City. Your native language is German. Because of space constraints, you have to share a machine with another part-time worker whose native language is Russian. When you arrive at your desk, the Russian worker is finishing her tasks. You notice that the machine is running in Russian—the UI is Russian, and when she types into a dialog box, the text appears in Russian characters. Before she leaves for the day, she logs off.

After she leaves, you log on to the machine. Instead of a Russian UI , however, the system appears with an English UI . When you launch an application and type, you notice that the keyboard behaves just like an English keyboard—characters appear in the English alphabet, not in Cyrillic characters. Any dates you insert into the document appear in English. You call the system administrator and tell him you would prefer a German machine. He tells you to go to the control panel, click on regional settings, and select “Deutsch” in the drop-down list labeled “UI Language” (see Figure 5). You do so and a dialog box appears informing you that the system settings will change the next time you logon. Do you want to log off now? You log off and then log on again. This time your UI appears in German. When you launch your application and type, the keyboard now behaves like a German keyboard. Any dates you insert into the document are formatted according to German conventions.

Figure 6: Setting the UI language for Windows NT 5.0 in the Control Panel

How it works

When the system administrator set up this workstation, he installed a feature called “Multilingual UI” from a special CD that contains language resources and a special administration tool. When he ran the tool for this particular workstation, it told him what the system’s base language was and gave him a list of available UI languages on the CD. He then chose to install French, German, Russian, and Spanish UIs. The necessary resource files were copied to the workstation, and the registry was updated to reflect which languages are present on the system. Now whenever the user runs the control panel, an additional list box appears beneath the sorting option in the regional settings dialog, giving a choice of UI languages available on the machine.

The system UI is a user property. Different users can set the UI language to different defaults. Administrators, for example, can stipulate that the UI language for the administrator account is always a particular language. Therefore, if you are supporting a network containing machines running in six different European languages and you only speak English, you can administer each machine in English.

Handling Multilingual Data

You work for the European Union as a translator and speak six languages. You want to create a single document that contains translations of a recent meeting in English, French, German, Dutch, and Greek. You open a document that contains your French notes, edit it, and check its spelling. You then click your task bar to change your “input language” to English. You translate the French text into English. As you type, the keyboard reacts as a French keyboard. When you are ready to begin the German section, you click the taskbar to select your German input language. The keyboard still reacts as a French keyboard. Before you begin the Greek section, you select the Greek Input language. The keyboard now reacts as a Greek keyboard, and the text appears in your document in a Greek font. When you are done, you move the cursor to the beginning of the document, and check the spelling in the entire document. You find two minor spelling errors in the English section and one in the Greek section. You print the document and send it for proofing.

How it works

Different countries have different standard keyboard layouts. For example, compared with the U.S. keyboard layout, the French keyboard layout supports additional characters (e.g. for accented letters) and places others in different physical positions (on a French keyboard, for example, z and w are reversed relative to their position on the U.S. keyboard). People who speak different languages may be able to type in different languages, but they generally prefer to use one keyboard layout to enter text for all languages. When a language uses a different script (such as Russian and Greek), however, it is necessary to change keyboard layouts.

Windows stores keyboard layout information in tables that determine which character gets generated when the user presses a particular key on the keyboard hardware. Since the character generation is a software issue, Windows can control which keyboard layout is active for which user and which application at any given time. Users can go to the control panel and create “input locales,” best described as a language-keyboard layout pair. For example, the user in the above scenario set up her machine so that any time she typed English, she would be using the French layout. It would also be possible for her to assign a different keyboard layout to each input locale (see Figure 7).

Figure 7: Adding an input locale and assigning a keyboard layout.

Using the taskbar indicator (see Figure 8) or a shortcut key combination, she can switch between any of these input locales. When she changes input locales, Windows generates a WM_INPUTLANGCHANGEREQUEST message that applications can accept, reject, or ignore. If an application accepts the message, Windows generates another message that gives the application the locale ID of the new input locale. Applications can use this ID to tag text with a language property, which is useful for operations like spelling or grammar checking. An application may choose to reject the request—for example, if the system for some reason does not contain the proper fonts to display the requested language.

Figure 8: The taskbar indicator for input language

Windows NT stores locale-keyboard layout pairs as part of a user’s profile. Different users may assign a different keyboard layout to a particular language. Each user session tracks current input locales by thread—that is, two applications running at the same time may be using different input locales. In addition, an application can change the input locale for the user. For example, if the translator in the above example moved her cursor from English text to Greek text, her application may choose to activate the Greek input locale.

Changing the UI Language

You work at an international research firm and are at the library, using the on-line catalog. The previous user ran the search application in Czech and left it running. You do not speak Czech. You right-click on a little globe icon in the corner of the application, and a list of languages pops up. You select Spanish. The application UI redraws and changes to Spanish. You run the application and then close it down when you are done. The next person who runs it sees a Spanish UI .

How it works

Applications can implement a multilingual UI in several different ways. They can base the UI language of the application on the system locale, on the user locale, or on a manually selected default. On the system described above, for example, the application may save information about the locale ID of the most recently selected UI language. The next time it is launched, it can call SetThreadLocale with that language ID so that any APIs that retrieve UI elements from the program files will retrieve elements in the appropriate language.

If the current user would like to change the UI language, they could do so from the application’s menus or by using an application-supported hot key combination. This would in turn invoke a command to reset the thread locale. This scenario is useful if a number of people will be using the same machine with the same application running all the time, much like an ATM machine. If an application does not support menu or keyboard options, it is still possible for the user to change the application UI language by changing the user locale in the regional settings of the Windows NT control panel. If the application contains the proper language resources and retrieves resources based on the user locale, then it will automatically start drawing them in the language of the new user locale. This second type of mechanism is useful if more than one user is sharing the same machine, running the same applications but in separate sessions. When each person logs on, his user locale determines the UI language for the applications. This makes it possible to install one copy of an application with multiple language resources, rather than numerous copies of the same application in different language editions.

Figure 9: Notepad running in both English and Japanese on the same Windows NT Workstation.

Running Applications that Require Different Language Environments

You are a student at a university taking a Japanese class. You are in the language lab, preparing to do your Japanese homework on a machine running English Windows NT Workstation. The teacher has provided an applet written for Japanese Windows 95 that will help you practice your Japanese characters. Following her instructions, you first set your system locale to Japanese in the Control Panel (see Figure 1). Then you reboot and run the application. The system UI remains in English, but the applet works perfectly, allowing you to read and type Japanese characters.

How it works

Since the application was written for Windows 95, it is based on the Shift-JIS character encoding (code page 932) and not Unicode. When the administrator set up the workstation, he installed support for the Japanese language—character tables, keyboard support, fonts, and locale-based information (sorting, date and time formatting, and so forth). When the student sets the system locale to Japanese, Windows NT loads the Shift-JIS character tables and, upon reboot, simulates the Win32 environment for Japanese, which is based on code page 932. The system behaves as if code page 932 is its local character set, even though the system environment is still in English, and Unicode-based applications still run unchanged.

The Japanese-language support includes Input Method Editor (IME) support, which takes advantage of the same input locale-keyboard layout mechanism described earlier in this paper. Input Method Editors contain more intelligence than a simple keyboard layout table, but the user can treat IMEs as they would any other input method, assigning a particular IME algorithm to an Asian input locale, and switching among Asian and non-Asian input locales using the taskbar indicator.

Differences Between Windows NT and Windows 98

The implementation of international support in Windows 98 and Windows NT 5.0 differs. Both operating systems support the NLSAPI and the MLAPI, both handle input locale switching and multilingual fonts, and both will be released in multiple language versions (Windows 98 will ship in a few more languages than Windows NT 5.0). However, key architectural differences mean that Windows 98 will not support multilingual applications to the same degree that Windows NT does.

Since Windows 98 has evolved from the Windows 3.x code base, it does not contain native Unicode support, but instead uses ANSI character encodings. The lack of native Unicode support makes sharing data between machines running different character encodings more difficult. It is still possible to write a Unicode-based application that runs on Windows 98 (Word 97, for example), but with the exception of a small subset of wide character APIs that Windows 98 supports, Unicode data must be translated before it is sent to system calls. One of the wide character APIs, TextOutW, allows applications to display Unicode-encoded data. This is the API that Internet Explorer uses, for example, to display Japanese text on an English system.

Windows 98 and Windows NT share a common resource file format. It is therefore possible to create applications that can run on Windows 98 and change UI language. However, Windows 98 does not support multilingual user profiles or thread locales, so some mechanisms for automating the change of an application’s UI language do not exist. In addition, Windows 98 does not support the ability to change the UI of the system itself.

Unlike Windows NT , localized editions of Windows 98 do not share a single binary. Asian and Middle Eastern editions are still supersets of the European editions of the system. Input Method Editor support, for example, is limited to Asian editions of Windows 98.

Summary

The foundation of Microsoft's multilingual platform is the international support contained in the Windows NT operating system. With Windows NT , it is possible to create a solution that supports multiple language data and a multiple language UI without requiring specialized applications or creating incompatibilities for users in different countries. Since it was first released, Windows NT has used Unicode as its base character encoding, which ensures the integrity of multilingual data shared across networks, in e-mail or in document files. Windows NT contains the font support, the keyboard support, and the APIs to allow for both the display and input of multiple languages (French, Russian, and Greek, for example) in a single document. In addition, the system carries information for formatting dates and currencies and sorting text in more than 100 international locales.

Solutions built using Windows NT and international-aware applications like Microsoft Office and Internet Explorer allow for universal storage of data in the Unicode format (translated to local character encodings when necessary through tables provided by the system). Users of the system can use any language edition of Windows NT, Word, or Internet Explorer, to display any document, as long as they have installed the appropriate language support (fonts and locale information) through the control panel. With Windows NT 5.0, users will also be able to enter any language into a document. For example, they could run a Russian word processor on English Windows NT and enter Japanese text. The system offers users the additional flexibility of changing the language of the system's UI or the UI of any application that supports multiple languages. Because Windows NT supports user profiles, users sharing the same machine at different times can log on with different language preferences.

For More Information

For the latest information on Windows NT Server, check out our World Wide Web site at http://www.microsoft.com/backoffice or the Windows NT Server Forum on the Microsoft Network (GO WORD: MSNTS).

For information on globalizing applications, visit http://www.microsoft.com/globaldev.

You can find more details on software internationalization in the Microsoft Windows Operating Systems NLSAPI Functional Specification and the Microsoft Windows NT 5.0 Multilingual Functional Specification available on the Microsoft Developer Network or from Developing International Software for Windows 95 and Windows NT by Nadine Kano, published Microsoft Press, ISBN1-55615-840-8.

Appendix A: Internationalization Checklist

DBCS Enabling

To write double-byte character set (DBCS) enabled code

1	Avoid assuming:
	a fixed character size
	a fixed lead byte range
2	Use CharPrev and CharNext instead of p-- or p++
3	Treat lead bytes and trail bytes as one unit
4	Handle WM_IME_CHAR
5	Store code page information with data
6	Check your algorithms for:
	caret positioning
	pointer arithmetic
	character code alignment
	word wrapping

Unicode Enabling

To write Unicode-enabled software

1	Use:
	generic data types TCHAR, LPTSTR for text
	LPVOID for pointers of indeterminate type
	explicit types LPBYTE for byte pointers
	the TEXT macro
	generic function prototypes
2	Avoid:
	algorithms that assume small character sets
	translation to and from code pages
	assuming a character size

Font Technology

To select the appropriate font and output text in local script

1	Use:
	EnumFontFamilies or ChooseFont to select fonts
	GetTextCharSetInfo to generate the font signature
	GetLocaleInfo to generate the locale signature
2	Record the charset in your document files
3	Avoid:
	using OEM_CHARSET
	using ANSI_CHARSET by default
	assuming a given font facename exists

Bidirectionality

To support bidirectionality in your software

1	Use GetFontLanguageInfo and GetCharacterPlacement to reorder text:
2	Use ExtTextOut

Vertical Writing

To implement vertical writing in your software

1	Use fonts with @ in front of facename
2	Set escapement and orientation to 270°
3	Check your algorithms for:
	coordinates calculation
	caret positioning
	caret orientation
	virtual key handling

Changing Input Language

To handle changing the input language properly

1	Add code to manage WM_INPUTLANGCHANGEREQUEST and WM_INPUTLANGCHANGE
2	Use GetLocaleInfo and either TranslateCharsetInfo or GetTextCharsetInfo to determine if language is supported by available fonts
3	Use ActivateKeyboardLayout to activate a specific layout

Locale Awareness

To make your software locale-aware

1	Use the National Language Support (NLS) API to:
	formulate date and time
	create calendars
	format numbers and currency
	compare strings
	sort strings
	validate code pages
	generate locale font signatures
	enumerate system code pages

Localizability

To make your software localizable

1	Isolate all UI elements
2	Be sure UI elements are not:
	hidden
	overlapping
	part of sentences
3	Use dynamic buffers to allow maximum buffer size
4	Document string usage
5	Be sure strings are not:
	built at run time
	concatenated
	built by stripping out characters
6	Do not use string substitution unless absolutely necessary
7	Use FormatMessage for strings that have several arguments
8	Do not build sentences at runtime
9	Avoid gender dependencies
10	Be sure icons or bitmaps do not contain text
	Be sure Resource IDs are:
	constant throughout all language editions
	unique in a file
	the same across platforms
11	Be sure resources are:
	standard Windows resources
	categorized
12	Allow for forty percent (40%) growth of resources
13	Be sure the .RC file does not contain items that cannot be localized

Multilingual UI

To implement a multilingual interface, do one of the following

1	Have multiple language resources in one executable
	Retrieve resources with FindResource or FindResourceEx
2	Separate language DLLs
	Use naming conventions for DLL extensions
	Construct DLL names at run time
	Retrieve DLLs using FindFirstFile or FindNextFile
	To enable dynamic switching between languages	3
1	Enumerate languages at run time
2	Provide a UI for the user to select a language

Appendix B: Guidelines for Supporting Complex Scripts in Applications

Introduction

This appendix presents design principles to consider when developing applications to support complex scripts such as Arabic, Hebrew, Thai, and Indic scripts. We will first go over the properties of these scripts that set them apart from traditional scripts used for written communication, such as the Latin and ideographic scripts. Then we will point out some conventional programming techniques that cause problems in processing complex scripts, and give guidelines on how to avoid these problems in your applications.

Basic Concepts

A script as used in this document is a collection of symbols used for written communication, usually with a common property that justifies their association as a set. For example, the Latin script consists of the uppercase letters A-Z and the lowercase letters a-z. Written English generally contains two scripts: Latin letters and Arabic numerals. Written Japanese can contain up to five scripts: Hiragana, Katakana, ideographs, Arabic numerals, and Latin script. Other examples are Hungarian (extended Latin script), Korean (Hangul and Hanja scripts) and Vietnamese (extended Latin) script. These scripts all share the property that they are displayed as discrete glyphs, one per character, one after another, progressing from left to right (or vertically from top to bottom).

A complex script is one in which this assumption of linear layout, from left to right, does not hold. The following are some examples of non-linear processing required of complex scripts, including example languages:

Bidirectional (BiDi) reordering when displaying the backing store in visual order (Arabic, Hebrew, Persian, Pashto, Urdu).

Contextual shaping (Arabic and Indic families).

Display of combining characters and diacritics (Arabic, Hebrew, Thai, Indic).

Specialized word-break and justification rules (Arabic, Thai, others).

Disallowing illegal character combinations (Thai).

Proper cursor movement, text selection, and text highlighting.

Programming Pitfalls

Remember the good old days, when all characters were ASCII, and there was only one locale (“C”)? You could make all kinds of assumptions that simplified programming (for example, that everyone uses the same date format and the same decimal point indicator). And you did not sell much software outside North America, either.

Nowadays, most software designers are aware of the need to eliminate assumptions about locale and language from their software to make it acceptable to users in other locales. The message in this section is simply that you probably need to eliminate more assumptions in your software.

Multiple scripts per document

In the past, when Windows did not support mixing of scripts very well, you could get by with a monolingual application, using 8-bit character strings to store text, assuming the same code page throughout your application. However, as explained in the body of this document, Windows NT 5.0 supports multilingual applications, and many of your customers will demand the ability to mix scripts in a single text document.

One approach, of course, is to use multiple 8-bit code pages, enough for each of the scripts you wish to support. This is cumbersome at best, and quite unnecessary. Instead, use Unicode, as explained earlier in this paper.

Context sensitive characters

The second assumption you need to discard is that a given character in a given font always looks the same, and has the same properties. Characters in languages such as Arabic change shape depending on the surrounding characters. Specifically, Arabic characters take one of four forms—initial, medial, final, and stand-alone—depending on where they occur in the text stream. Moreover, adjacent Arabic characters often ligate, meaning they combine together in a single glyph called a ligature.

This means you cannot use the old trick of putting out characters one by one, as you get them in the wParam parameter from the WM_CHAR message. If you do, then the system cannot do the contextual shaping for you, because when it comes time to render a character, the system does not know what characters precede or follow. It also means that you should not cache character widths and compute line lengths yourself, since the width of the character depends on the context. For example, this code will produce incorrect results when displaying most complex scripts:

case WM_CHAR:

// NOTE: This is an example of what *not* to do, because
// characters that should join or otherwise interact 
// typographically will show as separate, stand alone characters.
hDc = GetDC (hWnd) ;// BTW, this is also bad for other reasons!
SelectObject (hDc, hTextFont) ;
SetBkMode (hDc, TRANSPARENT) ;
ExtTextOut (hDc, g_xStartOneChar, YSTART, 0,
 NULL, (LPCTSTR) &wParam, 1, NULL) ;

// Get the next character position
GetCharWidth (hDc, (UINT) wParam, (UINT) wParam, &cCharWidths) ;
// This assumes left to right scripts, so it will break on
// Arabic and Hebrew!
g_xStartChar += cCharWidths ; 

ReleaseDC (hWnd, hDc) ;
Return 0 ;

Instead, you should save characters in a buffer, and put out the entire buffer each time a new character is typed, as follows:

RECT rcRectLine ;
...
case WM_CHAR:

szOutputBuffer[nChars] = (TCHAR) wParam ;
if (nChars < BUFFER_SIZE-1) {// Limited by the buffer size
 nChars++ ;
}
// This will generate a WM_PAINT message, where all of
// the text buffer is displayed at once. This is the 
// recommended approach.
InvalidateRect (hWnd, &rcRectLine, TRUE) ;
Return 0 ;

case WM_PAINT :

hDc = BeginPaint (hWnd, &ps) ;

SelectObject (hDc, hTextFont) ;

// Write the whole text buffer in the line buffer rectangle
// This happens every time the user enters a character
ExtTextOut (hDc, nxStartBuffer, nyStartBuffer, ETO_OPAQUE, 
&rcRectLine, szOutputBuffer, nChars, NULL) ;

EndPaint (hWnd, &ps) ;

return 0 ;

Bidirectional layout

Another assumption is that a character always displays to the right of the characters that precede it in the text. Notice in the example above, we moved the x position to the right after each character was input, using these lines:

// Get the next character position
GetCharWidth (hDc, (UINT) wParam, (UINT) wParam, &cCharWidths) ;
// BAD! Don’t do this!
g_xStartChar += cCharWidths ;

Correctly determining the position of the next character in the stream would require implementing the Unicode algorithm for layout of bidirectional text (BiDi algorithm), which is a major undertaking indeed. Instead, use ExtTextOut on the whole buffer, as shown above, and let the system implementation of the BiDi algorithm handle layout.

However, there may be other cases where your application assumes left to right (LTR) layout, such as the x position passed in the call to ExtTextOut. You can make this selectable by the user, and set the proper x value as follows:

Static UINT uiAlign = TA_LEFT ;
int nxStartBuffer ;
...
case WM_PAINT :

hDc = BeginPaint (hWnd, &ps) ;

SelectObject (hDc, hTextFont) ;

// Set the x position for right or left aligned text
if (uiAlign & TA_RIGHT) {// Start at right edge
 nxStartBuffer = rcRectLine.right ;
} else {// Start at left edge
 nxStartBuffer = XSTART ;
}

SetTextAlign (hDc, uiAlign) ;

// Same as above
ExtTextOut (hDc, nxStartBuffer, nyStartBuffer, ETO_OPAQUE, 
&rcRectLine, szOutputBuffer, nChars, NULL) ;
...

Complex cursor positioning, highlighting, and selection

Because modern graphical interfaces handle glyphs of various widths, most applications that display a cursor as they put out text take this into account. However, you may find that your software assumes it can move the cursor over one character at a time as the user types the left or right arrow keys. This does not work for Thai and some Indic scripts, some of whose characters may be displayed above, below, or to the left of previous characters. In Thai, for example, if the cursor is positioned after a base consonant, vowel, and tone mark, the cursor should skip back over all three characters when the user types the back arrow.

This is just one example of the kinds of problems you can run into when you support direct editing of text. Others include split highlighting and selection when the user drags the mouse over bidirectional text, and improper assumptions about word breaking when you wrap text.

A complete description of how to handle all cases for every script you encounter is beyond the scope of this paper. Suffice it to say that the most convenient way to handle these cases is to leave it up to the system by using an edit control. Both the simple edit control and the rich edit control have been enabled for complex scripts in the Arabic, Hebrew, and Thai versions of Windows NT 4.0, and in Windows NT 5.0 with the appropriate locale support installed.

Summary of Guidelines

Here is a summary of the guidelines to process complex scripts correctly:

Use Unicode as your character encoding if the target platform is Windows NT.
Use ExtTextOut to display all of the text in a line at once. Displaying text character by character as it is entered will result in improper display of context sensitive text.
Do not cache character widths; instead use GetTextExtentExPoint.
Applications that cache character widths implicitly assume that characters always have the same width. As a result, they may measure line lengths of complex scripts incorrectly, because the width of a character depends on the surrounding characters. The text extent functions, such as GetTextExtentExPoint, have been extended to work correctly with complex scripts on platforms that support those scripts.
Use an edit control if possible.
The edit control will handle all processing of complex script for you, including input, display, cut and paste, input of Unicode control characters, and so on.