Supporting Multilanguage Text Layout and Complex Scripts with Windows NT 5.0--MSJ, November 1998

This article may contain URLs that were valid when originally published, but now link to sites or pages that no longer exist. To maintain the flow of the article, we've left these URLs in the text, but disabled the links.

November 1998

Supporting Multilanguage Text Layout and Complex Scripts with Windows NT 5.0

Download Nov98MTLCS.exe (629KB)

The three authors of this article work in the Windows Operating System division International group at Microsoft. F. Avery Bishop is an evangelist for international software development. David C. Brown is the architect of Uniscribe. David M. Meltzer manages the OpenType specification and related developer resources.

All international versions of Windows NT® 5.0, from Japanese to Hebrew, are based on the same binary files. In addition, Microsoft has created new services for text layout that support a wide range of languages. By internationalizing the Windows® text layout interfaces, Microsoft has made it easier to develop applications that can lay out text for almost any language.
      For developers of global applications, the shift to a single worldwide binary strategy is welcome news; different localized versions of an application can now be developed under a single Windows NT 5.0-based system. That is, for most applications you won't have to switch from one localized system to another during development (although you should certainly test on each targeted platform before release).
      In this article, we'll explore techniques for developing applications that handle multilingual text and complex scripts. We'll introduce Uniscribe, the Windows Unicode script processor, which comes with Windows NT 5.0. This article also will cover existing interfaces that Microsoft has extended to meet the requirements of complex text layout.
Multilingual Features in Windows NT 5.0
      All language versions of Windows NT 5.0 will be enabled for all supported languages, including European, and Far Eastern. This includes languages written with complex scripts such as Arabic, Hebrew, Thai, Devanagari, and Tamil. Applications that display plain text using Unicode can handle mixed text from any of the supported scripts. For example, you can pass a Unicode string containing French, Thai, Hindi, Korean, and Arabic text to ExtTextOutW, which will display the whole string in one pass. Thus, a Unicode application (one compiled with the –DUNICODE option) can display text in any of the supported scripts, without changing locales, on any language version of Windows NT 5.0.
      An ANSI application that targets a particular language can run on any language version of Windows NT 5.0 if you first set the system default appropriately. For example, setting the system default locale to Japanese on a system localized to French allows the user to run a Japanese ANSI application.
      Windows NT 5.0 includes a new Unicode script processor called Uniscribe that supports line measurement, display, caret movement, character selection, justification, and line breaking of Unicode plain text and rich text.
      The input method manager (IMM) runs on any version of Windows NT 5.0. You can install Chinese, Japanese, and Korean input method editors (IME), and use them to enter text in the appropriate language. Similarly, all keyboard drivers work on all language versions of Windows NT 5.0, and all locales are supported by all versions of Windows NT 5.0. For example, users on any system can set their user locale to Arabic and the calendar type to Hijri.
      A multilanguage version of Windows NT 5.0 will allow per-user setting of the user-interface language. One user can see system messages, menus, and other text in Japanese, while another user logging onto the same system can see the corresponding text in French. Plus, Windows NT 5.0 can provide a fallback glyph for characters that have no corresponding glyph in the currently selected font. This enables an application to display multilingual Unicode plain text.
Characteristics of Complex Scripts
      The rules governing the shaping and positioning of glyphs are specified and cataloged within the Unicode standard. The shaping engines that comprise the Windows Unicode Script Processor implement this standard for applications performing complex text layout.
      A complex script is one that requires special processing to display and edit because the characters are not laid out in a simple linear progression from left to right, as most European characters are. This special processing falls into several general classes.
Character Reordering Characters must be rearranged from logical (keystroke) order to visual order.
Contextual Shaping In some languages, the choice of which glyph to display depends on the surrounding characters.
Display of Combining Characters and Diacritics Multiple characters must be stacked or combined into one cluster.
Specialized Word-break and Justification Rules Some languages require special word-break logic because there is no fixed set of characters that delimit words.
Cursor Movement and Hit Testing The mapping between screen position and a character index for, say, selection of text or cursor display requires knowledge of the layout algorithms.
The OpenType Font Format
      The Unicode-based OpenType™ font format has been developed jointly by Microsoft and Adobe; it extends the TrueType font file format originally designed by Apple. OpenType fonts allow mapping between characters and glyphs, enabling support for ligatures, positional forms, alternates, and other substitutions. OpenType fonts may also include information that supports two-dimensional glyph positioning and glyph attachment, and may contain either TrueType or PostScript outlines.
      Layout features within OpenType fonts are organized by scripts and languages, allowing a single font to support multiple writing systems, even within the same script. To ensure consistency in text layout operations and to avoid unnecessary overhead in font files or applications, many of the text layout and language semantic algorithms are included in Uniscribe. This relieves the font developer from having to define generalized script rules within a font.
      Applications may introduce their own knowledge or preferences regarding script layout. OpenType layout fonts may even contain layout rules that duplicate or supersede those applied by OS services. The layered structure of OS services supporting text layout allows a client to choose which layout information to use, and how to apply it.
      At a minimum, font developers should be able to expect that an application has knowledge of (or services for executing) script rules as defined in the Unicode standard. Application developers should be able to expect that a font has glyphs and positioning information representing layout features as defined by the Unicode standard.
Multilingual Input
      Chapter 6 of Developing International Software for Windows 95 and Windows NT by Nadine Kano (Microsoft Press, 1995) explains how an application can handle switching of the input locale by the user. When the book was published, support for switching input locales was only available in Windows 95; Windows NT 3.5 did not support the new messages and interfaces. This support has been in Windows NT since version 4.0.
      Notice that we use the term "input locale" rather than "keyboard layout." When users change the input locale, they're doing much more than changing the keyboard layout, and in some cases a keyboard may not be involved at all. The input locale consists of an input language and a method of input. The input language can be any LangID supported in Windows NT, as found in the WINNT.H header file. The method of input is often a keyboard layout, but it can be anything from an IME to a speech recognition system.
      The input language is of interest to the user and to the application. For example, applications can use the input language to tag text or to choose a new font. In general, however, applications do not care what method of input is used. The application always gets the characters from the user in the same way, through the WM_CHAR or WM_IME_ CHAR messages. (IME-aware applications—those that display their own UI for an IME—are an obvious exception.)
      When the user changes the input locale, the application receives a WM_INPUTLANGCHANGEREQUEST message, with lParam set to the new HKL. (HKL originally stood for "handle to a keyboard layout," but as you've seen, it isn't necessarily associated with a keyboard at all.) The HKL contains the LangID of the new language in the low word. You can use this to select compatible fonts or to tag text for other processing.
      For example, the code in Figure 1 identifies the charset corresponding to the new input locale and enumerates a set of compatible fonts. Note that hWnd is the window handle where the font will be used, and hDlg is a dialog to display the font list. In this example, the callback function passed to EnumFontFamiliesEx is as follows:
// int CALLBACK EnumFontProc (ENUMLOGFONTEX* lpelfe, NEWTEXTMETRICEX* lpntme, int iFontType, LPARAMlParam) { // Size computed from format used below // and buffer limits TCHAR SzFaceName [4+LF_FULLFACESIZE+LF_FACESIZE] ;
wsprintf (szFaceName, TEXT("%s (%s)"), lpelfe->elfFullName, lpelfe->elfScript) ;
// Add string to listbox to describe this font SendDlgItemMessage ((HWND) lParam, IDC_FONTLIST, LB_ADDSTRING, (WPARAM) 0, (LPARAM) szFaceName) ;
return TRUE ; }
Each time the callback function is called, it builds a string containing the typeface name and the language name, and adds that string to the listbox in a dialog box.
      As you can see from the sample code in Figure 1, Indic scripts must be handled separately because they have no charset values. Since there is no default ACP value for Indic scripts, none of the Win32® ANSI entry points (the A routines) will work with Indic text. Indic text is not automatically translated to Unicode. There are ways to force translation of Indic text to or from Unicode using MultiByteToWideChar and WideCharToMultiByte by specifying the appropriate code page. However, an Indic input locale can only pass Indic text to a Unicode application, so full support for Indic scripts requires a Unicode application.
Doing Text Layout Using Win32 APIs
      An application has the following options for performing text layout:
Calling Win32 text APIs
Instantiating Win32 edit controls
Instantiating RichEdit control
Calling Uniscribe
Some applications will use a combination of these methods. Responsibility for performing and tracking text layout operations depends on a client's implementation model. For example, some clients handle line breaking and some don't. Some functionality, such as managing memory and maintaining a backing store, are shared by all clients.
      Many applications deal mostly in plain text—text that is all in the same typeface, weight, color, and so on. Such applications have traditionally displayed text using standard Win32 display entry points (TextOut, ExtTextOut, TabbedTextOut, and DrawText) to write text to a window, and the GetTextExtent family of functions to measure line lengths. As you'll see later, Uniscribe provides ScriptString APIs for better plain text processing on Windows NT 5.0 and Windows 9x. There is good news for existing applications that use the standard Win32 API for plain text processing; it just works!
      In Windows NT 5.0, the standard entry points have been extended to support display of complex scripts and, through the font fallback mechanisms mentioned earlier, multilingual Unicode text. In general, this support is transparent to the application itself, so properly designed applications require no changes to support complex scripts through these interfaces.
      There are two requirements for displaying complex scripts correctly using the standard Win32-based applications. First, applications should save characters in a buffer and display the whole line of text at once rather than, for example, calling ExtTextOut on each character as it is typed in by the user. When characters are written out one by one, the complex script shaping modules cannot determine the context for correct reordering and glyph shaping.
      Second, applications should use one of the GetTextExtentXxx functions to determine line length rather than computing line lengths from cached character widths. This is because the width of a glyph used to display a character may vary by context.
      In addition, complex script-aware applications should consider adding support for right-to-left reading order and right alignment to their applications. You can toggle the reading order or alignment between left and right with the code shown in Figure 2. Of course, you can toggle both attributes at once, as Notepad does, by executing the following statement:
lAlign ^= TA_RIGHT|TA_RTLREADING;
Follow this by the calls to SetTextAlign and ExtTextOut as shown in Figure 2.
Standard Edit Control
      The standard edit control has been extended in Windows to support text containing multilingual text and complex scripts. This includes not only input and display, but also correct cursor movement over character clusters (in Thai and Devanagari script, for example).
      As with the standard Win32 API functions, a well-written application will receive this support automatically, without modification. Again, you should consider adding support for right-to-left reading order and right alignment. In this case, toggle the extended style flags of the edit control window to control these attributes:
// ID_EDITCONTROL is the control ID in the // resource file. HANDLE hWndEdit = GetDlgItem(hDlg, ID_EDITCONTROL); LONG lAlign = GetWindowLong(hWndEdit, GWL_EXSTYLE) ; //... // To toggle alignment lAlign ^= WS_EX_RIGHT ; // To toggle reading order lAlign ^= WS_EX_RTLREADING ;
      After setting the lAlign value, enable the new display by setting the extended style of the edit control window as follows:
// (This assumes your edit control is in a // dialog box. If not, you can // get the edit control handle // from another source) SetWindowLong(hWndEdit, GWL_EXSTYLE, lAlign); InvalidateRect(hWndEdit, NULL, FALSE);
One new feature of the standard edit control is a context menu that allows the user to toggle the reading order and insert/display Unicode bidirectional control characters (see Figure 3).

Figure 3 Edit Control Context Menu

Figure 3 Edit Control Context Menu

RichEdit Control
      RichEdit 3.0 is a higher-level collection of interfaces that takes advantage of Uniscribe to further insulate text layout clients from the complexities of certain scripts. RichEdit is designed for clients whose primary purpose is not necessarily text layout, but who nonetheless need to display complex scripts.
      RichEdit provides fast, versatile editing of rich Unicode multilingual text and simple plain text. It includes extensive message and COM interfaces, text editing, formatting, line breaking, simple table layout, vertical text layout, bidirectional text layout, Indic and Thai support, a Word-like edit UI, and Text Object Model interfaces. RichEdit is the simplest way for a client to support features of complex scripts. Clients use its TextOut function to automatically parse, shape, position, and break lines.
Uniscribe
      The new Unicode Script Processor (USP10.DLL), also known as Uniscribe, is a collection of APIs that enables a text layout client to format complex scripts. Uniscribe supports the complex rules found in scripts such as Arabic, Indian, and Thai. Uniscribe also handles scripts written from right-to-left such as Arabic or Hebrew, and supports the mixing of scripts. For plain-text clients, Uniscribe provides a range of ScriptString functions that are similar to TextOut, with additional support for caret placement. The remainder of the Uniscribe interfaces provide finer control to clients.
      Although native to Windows NT 5.0, the Uniscribe DLL may also be distributed for use on Windows NT 4.0, Windows 95, and Windows 98-based systems. USP10.DLL is also expected to ship with Microsoft® Internet Explorer 5.0.
      Uniscribe uses multiple shaping engines that contain the layout knowledge for particular scripts (see Figure 4 ). It also takes advantage of the OpenType layout shaping engine for handling font-specific script features such as glyph generation, extent measurement, and word-breaking support.
      Uniscribe subdivides strings of characters into items (a character string having all the same script and direction attributes), runs (portions of an item that have continuous formatting attributes), and clusters (script-defined, indivisible character groupings). The client builds runs based on its own stored formatting attributes and on the item boundaries obtained by calling the Uniscribe ScriptItemize API.
      The Uniscribe ScriptShape API breaks a run into clusters according to script rules and then generates glyphs. The ScriptPlace API generates x and y positions for the characters. The ScriptTextOut API then displays these glyphs using these x and y positions.
      Uniscribe supports line breaking at word boundaries through ScriptBreak. Hit testing and cursor positioning are supported by ScriptCPtoX and ScriptXtoCP. Character-to-glyph mapping is provided by ScriptGetCMap. Uniscribe manages bidirectional character reordering using the Unicode bidirectional algorithm, and understands non-OpenType layout font formats for Arabic, Hebrew, and Thai shaping.
      Using Uniscribe, clients need only manage a backing store of Unicode character codes. Text layout clients do not need to maintain any other buffer or mapping table to track character order. A client only needs to store and manage the order in which the characters were entered by the user. This is the same logical order as defined by Unicode. The client's backing store never changes as a result of layout operations. Uniscribe maintains an index from the reordered clusters to the original character boundaries passed by the client. Using the Uniscribe interfaces ScriptCPtoX and ScriptXtoCP, clients can support cursor positioning and hit testing.
      All Uniscribe APIs are Unicode APIs. Uniscribe is a single API for Unicode output across Microsoft's operating system range. Scripts are supported as shown in Figure 5.
Problems with Common Text-layout Methods
      The most common way to break a simple text paragraph into filled lines is to sum the widths of the individual characters until the line is full, and then back up to the closest preceding space. Lines of simple text are conventionally split into runs based on hdc attributes such as font and color. Each run is displayed immediately to the right of the previous run, using SelectObject to set the style and ExtTextOut to display the text.
      For complex scripts these simple approaches have problems. First, the width of a complex script character depends on its context. It is not possible to save the widths in simple tables. Second, breaking between words in scripts like Thai requires dictionary support since there is no separator character between Thai words. Third, Arabic, Hebrew, Farsi, Urdu and other bidirectional text requires reordering before display. And finally, some form of font association is often required to easily use complex scripts.
Formatting Paragraphs in the Sample Application
      Our sample application, CSSamp, demonstrates Uniscribe APIs displaying text (see Figure 6). (The code for this example is included in the archive which can be found at the top of this article. —Ed). DspPlain.cpp shows how to use the ScriptString APIs to display plain text. These APIs give similar functionality to ExtTextOut, DrawText, and GetTextExtent, providing full complex script support under Windows 9x and Windows NT 4.0 and higher. DspFormt.cpp shows how to use lower-level APIs such as ScriptItemize, ScriptShape, and ScriptPlace to display high quality formatted text. In the sample application, all the paragraph formatting code is in Dspformt.cpp.
      Let's summarize the text formatting process. First, runs are built that are unique in style, script, and direction. Then, lines are broken into whole runs. For each line, a map is built from visual position to a run. For each run, the codepoints are shaped in visual order into glyphs, which are then positioned and rendered.
      The function PaintFormattedText in DspFormt.cpp breaks the text buffer (g_wcBuf) at CR+LF and passes single paragraphs to PaintFormattedTextPara. To keep the app simple, it reapplies the entire layout process every time it displays a paragraph. (In real-world applications, it may be a better idea to save formatting information such as run sizes and line boundaries.) The entry conditions for PaintFormattedTextPara are the text buffer (g_wcBuf), the text style list (head at g_pFirstFormatRun), and the character positions at the start and end of the paragraph. The entire paragraph is first split into runs, each of a single script, a single style, and a single direction. The paragraph is broken into lines by measuring runs in logical order until the line overflows. A word-breaking algorithm then breaks the overflowing run between the current line and the next line. Finally, the lines are displayed one at a time by PaintFormattedTextLine. BuildParaRunList creates runs of text that contain no changes of style, script, or direction.
      The Uniscribe API ScriptItemize is passed the entire paragraph and then breaks it at script and direction boundaries. Script boundaries are determined by the codepoint coverage of the shaping engines; direction boundaries are evaluated according to the Unicode bidirectional algorithm. ScriptItemize accepts many options in its SCRIPT_CONTROL and SCRIPT_STATE parameters, some of which are Unicode bidirectional algorithm control and choice of digits. ScriptItemize fills a buffer with information about each item, including an internal script enumeration, the direction encoded as a Unicode embedding level, and the flags to be passed to the shaping engine.
Shaping Engines and Font Association
      Itemization serves two main purposes. It breaks the string into runs that match the codepoint ranges of the shaping engines or where a change of direction will require reordering. Direction changes are identified according to the Unicode bidirectional algorithm. Many applications will do their own reordering, applying what Unicode calls a higher protocol, because applications generally know more about a string than just the codepoints and can do a better job of reordering than the Unicode bidirectional algorithm. For example, an application may derive directionality from the keyboard layout used to enter a character. This approach gives a consistent and easily understood user interface. If that's what you want your application to do, instruct ScriptItemize to break only for the shaping engine by passing NULL to the SCRIPT_CONTROL and SCRIPT_STATE parameters.
      The sample application merges the items in the ScriptItemize buffer with its own style list, returning a paragraph run list. The representation of the style list and paragraph run list in the sample is arbitrary. I chose a linked list with nodes containing length and style. You may prefer a dynamic array or STL data type.
      Lines containing one or more runs are constructed by measuring the runs in logical order until a run causes the line to overflow. The overflowing run is passed to BreakRun, which determines a suitable wordbreak position. BreakRun uses ScriptGetLogicalWidths to convert the glyph widths returned by ScriptPlace into character widths. ScriptGetLogicalWidths returns virtual character widths ordered one for one with the logical character buffer. These widths are summed to identify the physical end of line as a logical character position.
      BreakRun then uses ScriptBreak to obtain character classifications including whitespace and the start of the word in scripts like Thai. BreakRun retreats from the character break position to the nearest line preceding the start of the word. The run is split at this point. Spaces, if any, are left attached to the end of the previous line so that the new line always begins with a nonspace character. (For simplicity, the sample does not implement Far East word-breaking rules.) BreakRun treats a break request at the beginning of a line as a special case to ensure that each line contains at least one cluster. In this case BreakRun uses the logical cluster array returned by ScriptShape to make sure that combining characters are not split from their base characters.
      Next, PaintFormattedTextLine is passed a single line of runs for display. Before the runs can be rendered, the correct display order must be established.
      BuildVisualDisplayOrder passes the Unicode embedding levels from the runs in the line to the ScriptLayout API. ScriptLayout can return both logical-to-visual and visual-to-logical mapping arrays. A logical-to-visual mapping array is indexed by a logical (stored) run offset, providing the appropriate visual position for each run. A visual-to-logical mapping is indexed by a visual run position, providing the index of the logical run that should be displayed at that position. BuildVisualDisplayOrder uses the logical-to-visual mapping to construct and return two arrays. pVisualOrder is indexed by a visual run index and provides a pointer to the logical run that should be displayed at that visual index. iPos is indexed by a visual run index and returns the offset to the first character of the logical run that should be displayed at that visual index.
      PaintFormattedTextPara now loops through the runs on the line in visual order, using pVisualOrder and iPos to pass the correct logical runs to PaintFormattedRun.
      PaintFormattedRun displays a single run in a single style. These are the steps for displaying the run.

Update the hdc as necessary for any change in style from the previously displayed run.
Call ShapePlaceRun to generate glyphs and positions.
Call ScriptTextOut to render the run to the hdc.
Call CaretHandling to process any caret display or mouse hit testing required in this run.
      In the sample application, styles are simply hFonts, so the style change is a simple SelectObject call.
Shaping Functions
      ShapePlaceRun encapsulates the calls to ScriptShape and ScriptPlace and implements simple font association. The call to ScriptShape requires the SCRIPT_ANALYSIS returned by ScriptItemize. If ScriptShape returns USP_E_ SCRIPT_NOT_IN_FONT, it means the shaping engine was unable to generate glyphs for this script with the currently selected font. To handle this case, the sample app tries using the first style to shape the run. A real-world application might keep a list of standard fonts to try. By keeping such a list indexed by the script number in the itemization analysis, the application can avoid running through many alternative attempts.
      If this fallback strategy fails, the sample application restores the original style and changes the script field of the itemization analysis to SCRIPT_UNDEFINED (the only publicized script number). SCRIPT_UNDEFINED causes ScriptShape to bypass shaping and use the 1:1 codepoint to glyph mappings from the font CMAP table. Most likely this will display the missing glyph for each character in the run. (The missing glyph is usually represented as an empty rectangle.)
      The glyphs are then passed to ScriptPlace for positioning. ScriptPlace returns an advance width and an x, y offset for each glyph. Usually, base characters have an advance width and no x, y offset, and combining characters have a zero advance width and an x, y offset to place them correctly over the preceding base glyph.
      Once ShapePlaceRun has generated glyphs and widths, the glyphs are rendered by a call to ScriptTextOut. ScriptTextOut is a slightly extended form of ExtTextOut(… ETO_ GLYPH_INDEX …) that can handle the x, y combining character offsets.
      Finally, the run display process checks for any caret display or mouse hit testing activity required in this run. We do this here in the sample application because we don't keep width information hanging around. In your apps, you might save enough information to do hit testing and caret placement at least on the current line without requiring reprocessing of the paragraph.
      Microsoft will add more shaping engines to Uniscribe in the future. The exact codepoint ranges assigned to each shaping engine may vary, so with the exception of SCRIPT_ UNDEFINED, script numbers are not published. Currently, codepoint range divisions include the following: complex text ranges such as Arabic, Hebrew, Thai, Hindi; complex script digit ranges; basic punctuation; ASCII; other Western text; and Far East CJK.
      Although the script numbers are not published, attributes of the scripts can be tested. There is a global script properties table that can be indexed by script number.
const SCRIPT_PROPERTIES **g_ppScriptProperties; int g_iMaxScript; ScriptGetProperties(&g_ppScriptProperties, &g_iMaxScript); hResult = ScriptItemize( … , pItems, &cItems); for (i=0; i<cItems; i++) { if (g_ppScriptProperties[pItems[i].a.eScript] >fComplex) { // Item [i] is complex script text // requiring glyph shaping } }
Applications may use these properties to help combine their own layout rules with the required shaping engine divisions.
      All the complex script shaping engines, the digit shaping engines, the punctuation and ASCII shaping engines validate the font in the hdc before shaping, and will return HRESULT USP_E_SCRIPT_NOT_IN_FONT if the font does not contain sufficient glyphs and/or shaping tables. Only scripts that have the property fComplex should be shaped with the script returned by ScriptItemize. All other runs may be merged and shaped with SCRIPT_UNDEFINED. If there are characters not supported by the font, SCRIPT_UNDEFINED will not fail with USP_E_SCRIPT_ NOT_IN_FONT. Missing glyphs will usually be displayed as an empty rectangle. An application can determine if a codepoint is supported by a font by calling ScriptGetFontProperties to obtain the default glyph index, and ScriptGetCMap to look up font glyphs for Unicode codepoints.
The Unicode Bidirectional Algorithm
      The Unicode bidirectional algorithm resolves the layout of mixed-direction text in the absence of higher-level protocols. Here are some of the general assumptions it makes. Adjacent runs of words of opposite language direction are laid out according to the base level—left-to-right for an English paragraph, right-to-left for an Arabic paragraph. Numbers following left-to-right words should be displayed to the right of the words. Numbers following right-to-left words should be displayed to the left of the words. Punctuation between words of the same language direction should be displayed between those words. Punctuation between runs of words of opposite language direction appears between those runs. Punctuation at the beginning or end of a paragraph is laid out according to the paragraph direction and is not affected by the direction of adjacent text.
      The digits of numbers are laid out left-to-right in the number. Commas and periods are considered part of a number when immediately surrounded by digits. Other characters, such as currency signs, are considered part of a number when immediately adjacent to a digit. The algorithm makes a valiant and surprisingly successful stab at resolving what can be very ambiguous text. In applications such as databases and forms, it is often sufficient. In applications such as word processors, it is usually considered necessary to give the user more direct control over bidirectional text layout.
Experimenting with the Sample App
      You can use the sample application to experiment with reading order. The default text for the sample shows the line "123-52 is 71" twice, once in English and once with "is" translated to Arabic.
      In the second case, the number 71 following the Arabic translation of "is" has right-to-left layout because it follows Arabic text. Thus, it is displayed to the left of the translation of "is". Since the overall line direction is left-to-right, there is a conflict with the first part "123-52" which is assumed to be left-to-right since there is no preceding text.
      Now press the RTL button in the SCRIPT_STATE control group. Notice how the second example now looks better, and the first (English) numeric example suffers from the conflict instead.
      Is the right-to-left sample now correct? Should "One hundred and twenty three minus fifty two" appear as "123-52" or as "52-123"? It depends on the country. In Israel and Iran, sums like this are usually presented left-to-right, while for the rest of the bidirectional world they are presented right-to-left.
      Set the AraNum Context checkbox in the SCRIPT_STATE control group to change the display to "52-123." (The Arabic number context is normally set by preceding Arabic text; AraNum Context sets the initial value).
Caret Placement and Hit Testing
      Complex script languages are broken into clusters by ScriptShape. Character reordering always occurs within cluster boundaries. The clusters themselves are guaranteed to advance monotonically in the reading order.
      Conventions for caret placement within clusters depend on the script. For the Arabic script, if the cursor position is set between a base character and its combining mark, then the caret is displayed halfway through the base character. For the Thai script, the cursor may not be positioned within a cluster. When the user advances the cursor, the application must advance over all characters that make up the cluster.
      In the sample app, caret placement and hit testing is performed in CaretHandling. The Uniscribe APIs ScriptXtoCP and ScriptCPtoX translate between cursor positions (in codepoint offsets) and x positions (in logical pixels). Both APIs require the attribute and position information returned by ScriptShape and ScriptPlace. In the sample app, pending caret displays and mouse clicks are saved in global variables and processed during line display. A real-world application might choose to cache this information for the current line.
      ScriptXtoCP returns a trailing edge flag so the caller knows which side of the character or cluster the user has clicked on. The value of the flag is either zero or the width of the character or cluster in codepoints. The returned CP is the position of the character on which the user clicked. Most editors set the cursor closest to the characters whose leading edge the user clicked. To achieve this, add the flag value to the returned CP.
      For languages such as Thai where the user conventionally does not want to place the cursor into a cluster, ScriptXtoCP sets the trailing side flag to zero or the cluster width. The application should also advance the cursor position in whole clusters for languages such as Thai. For languages such as Arabic, where the user expects to be able to edit within a cluster, ScriptXtoCP sets the trailing side flag to zero or one. Uniscribe provides information on valid cursor positions in the fCharStop BOOL in the logical attributes returned by ScriptBreak: TRUE for most characters and FALSE for intercluster characters in scripts such as Thai. Check the fNeedsCaretInfo flag in the SCRIPT_PROPERTIES for an item to see if it is necessary to call ScriptBreak to check for valid cursor positions. If fNeedsCaretInfo is FALSE then all codepoints are valid cursor positions.
Digit Shape Selection
      Unicode provides separate digit codepoints for each script that has its own digits. For historical reasons, the conventional names for some of these digit styles are confusing. The Arabic numerals used in America and Europe aren't used in the Arab world. The Arab-Indic digits used in the Arab world aren't used in Indic countries. And Farsi and Urdu use Eastern Arabic-Indic digits, which, just to keep things confusing, aren't used in the nations that use Arabic or Indic numerals. Other complex scripts have their own digit shapes, including Thai, Tibetan, and all nine Indian scripts.
      Although Unicode provides separate codepoints for alternate digit shapes, very little software will recognize them if entered into a numeric form field, and most software will produce ASCII digits (U+0030 through U+0039) when converting from internal (binary) representation to character codes.
      The fDigitSubstitute and fContextDigits flags in SCRIPT_ STATE and the uDefaultLanguage field in SCRIPT_CONTROL determine how ScriptItemize will classify ASCII digits. To cause U+0030 through U+0039 to display in an alternate digit script, set fDigitShape to TRUE and uDefaultLanguage to the language with which the digits are associated. You can also set fContextDigits to have digits displayed in the language of preceding letters in the same itemization.
Caching
      Uniscribe saves Unicode to glyph mappings (CMAP), glyph widths, and OpenType script shaping tables. A handle to the tables for a particular font of a particular size is called a SCRIPT_CACHE. Uniscribe functions look first for information through the SCRIPT_CACHE, using the hdc only when required tables are not already cached. When calling ScriptShape, ScriptPlace, and ScriptTextOut, you must provide a pointer to a SCRIPT_CACHE variable, which you must initially set to NULL.
      For ScriptShape and ScriptPlace it is valid to pass the hdc as NULL. Most often the call will be successful as required tables will already be cached. If the shaping or placement requires access to an hdc, ScriptShape or ScriptPlace will return immediately with the HRESULT E_PENDING. This allows the client to avoid most SelectObject calls.
Symbol and Device Fonts
      Symbolic fonts can be recognized by calling GetTextMetrics and checking for a tmCharSet value of SYMBOL_ CHARSET or OEM_CHARSET. Such fonts do not necessarily conform to Unicode conventions. Although Uniscribe will process such fonts, it probably makes no sense to itemize them. Instead, consider a run formatted with such a font as a single item with eScript SCRIPT_UNDEFINED.
      Printer device fonts are not processed by Uniscribe—if you call ScriptShape, ScriptPlace and ScriptTextout strings will be sent to ExtTextOut without any manipulation. You cannot use ScriptGetCmap on a printer device font.
Supporting Multilingual and Complex Scripts
      In the preceding sections we've discussed four ways to enable your application to support multilingual content in documents: standard Win32 API functions, edit controls, RichEdit controls, and Uniscribe. Figure 7 explains which platforms support complex scripts through which interfaces.
      Until now, the discussion has assumed that you have Unicode strings that you pass to the Uniscribe entry points. What should you do if your application needs to run on Windows 95, Windows 98, and Windows NT, given that Windows 98 allegedly doesn't support Unicode? There are a couple of strategies that work.
      Before we get into these strategies, let's briefly review the A and W entry points in the Win32 API. In essence, all entry points used in a normal Win32-based application, such as RegisterClass and CreateWindowEx, are actually symbols in the Windows headers files defined as follows:
#ifdef UNICODE #define MessageBox MessageBoxW #else #define MessageBox MessageBoxA #endif // !UNICODE
In this example, the application actually calls MessageBoxW if you compile your source code with the –DUNICODE switch; otherwise it calls MessageBoxA. Text strings passed to or from these entry points all have the LPCTSTR or LPTSTR data types, which are typedefed as unsigned short if the symbol UNICODE is defined, and char otherwise. Unicode applications (those that call the W interfaces) get all characters and strings from the system as Unicode, whereas ANSI applications (users of the A routines) get text encoded in the ACP. It's important to keep in mind that this applies not only to the arguments of the Win32 entry points, but to all text passed to or from the application. For example, window messages such as WM_CHAR, WM_GETTEXT, and WM_SETTEXT that pass text in the wParam or lParam parameters also use Unicode or ANSI, depending on the type of application.
      With this in mind, how can you encode text in Unicode and run on all Win32-based platforms?
Strategy 1 Always run as a pure Unicode application. Compile the application with the –DUNICODE switch so that you use only the W entry points. All text passed to and from the application is in Unicode. This is the easiest to program by far. It also supports all Indic scripts and all new script added to Windows NT in the future. However, the application will not run on Windows 95 or Windows 98. This is the best approach if your application is targeted for Windows NT only, as is the case for many in-house or vertical applications.
Strategy 2 Create two binaries, one for Windows NT using Unicode and one for Windows 95 and Windows 98 using ANSI. Use LPTSTR for pointers to string buffers, TCHAR for characters, and so on. Use –DUNICODE to compile the Windows NT version only. This strategy is easy to program in its simplest form, and covers Windows 95, Windows 98, and Windows NT. Unfortunately, the ANSI version is basically restricted to the Win32 API standard calls. Localization, distribution, and maintenance of two binaries is difficult. Only recommended for simple or special in-house applications.
Strategy 3 Always run as an ANSI application, but use Unicode internally. This is the strategy used by the CSSamp sample code. The source code is compiled as an ANSI application, which receives text from the keyboard through the WM_CHAR or WM_IME_CHAR messages in the codepage of the current input locale. In general, this is not the same as the ACP. The application converts text to Unicode using the codepage of the current input locale. If the input locale changes in the middle of a string, the application will have to concatenate strings converted from different codepages. With this strategy, the same binary runs on Windows NT, Windows 95, and Windows 98, and it supports nearly all of the scripts in Unicode.
      On the other hand, it's somewhat more difficult to program than the pure Unicode approach. Also, it does not support scripts without an ACP, such as Indic scripts, even when running on Windows NT. This is a sound approach if your application must run on Windows 95 and Windows 98 and does not need to support Indic scripts and others without an ACP.
Strategy 4 Detect the system and explicitly call the W APIs for Windows NT and the A routines for Windows 95 and Windows 98. The application registers itself as a Unicode application on Windows NT and as an ANSI application on Windows 95 and Windows 98.
      The easiest way to implement this approach is to write a set of functions, say U routines, that parallel the Win32 W and A routines. Your application first calls GetVersionEx to detect the system, and stores that information into a global variable:
BOOL g_IsWindowsNT.
Each U interface looks just like the corresponding W interface. For example, the prototype for CreateWindowExU would be:
WINUSERAPI HWND WINAPI CreateWindowExU( DWORD dwExStyle, LPCWSTR lpClassName, LPCWSTR lpWindowName, DWORD dwStyle, int X, int Y, int nWidth, int nHeight, HWND hWndParent , HMENU hMenu, HINSTANCE hInstance, LPVOID lpParam);
You can implement CreateWindowExU as a function pointer. When the app is launched, your initialization code checks to see if g_IsWindowsNT is TRUE, and if s, sets CreateWindowExU equal to CreateWindowExW. Otherwise (that is, when running on Windows 95 or Windows 98), CreateWindowExU is set to a routine you write yourself, say CreateWindowsExAU. This routine converts lpClassName and lpWindowName to the ACP using WideCharToMultiByte, and passes those parameters along with everything else to CreateWindowExA.
      This approach also requires special handlers for messages such as WM_CHAR and WM_GETTEXT to convert the text passed in the wParam or lParam parameters to or from Unicode when g_IsWindowsNT is false. In the case of WM_CHAR and WM_IME_CHAR when running on Windows 95 and Windows 98, the application will also have to build up the Unicode string from multiple conversions via MultiByteToWideChar if the user switches input locales while typing in text.
      This strategy runs on all platforms with the same binary files. It supports all scripts when running on Windows NT, including Indic, and allows use of Uniscribe on all platforms. The only disadvantage to this approach is that it requires considerable development investment. This strategy is your best choice for any application that needs to run on Windows 95 and Windows 98 and needs universal support for all scripts on Windows NT.
Summary
      It's time to make your application multilingual! Don't overlook the new markets into which you can now more easily localize your application. The Uniscribe and RichEdit libraries enable you to rely on consistent and standardized layout of complex scripts—and, of course, typical scripts as well. Applications performing advanced typographic layout may complement their own proprietary layout engines with features available from these libraries.
See the Glossary of terms for Supporting Multilanguage Text Layout and Complex Scripts with Windows NT 5.0.

From the November 1998 issue of Microsoft Systems Journal.