Unicode RTF

Word 97 is a partially Unicode-enabled application. Text is handled using the 16-bit Unicode character encoding scheme. Expressing this text in RTF requires a new mechanism, because until this release (version 1.5), RTF has only handled 7-bit characters directly and 8-bit characters encoded as hexadecimal. The Unicode mechanism described here can be applied to any RTF destination or body text.

Control word	Meaning
\ansicpgN	This keyword represents the ANSI code page which is used to perform the Unicode to ANSI conversion when writing RTF text. N represents the code page in decimal. This is typically set to the default ANSI code page of the run-time environment (for example \ansicpg1252 for U.S. Windows). The reader can use the same ANSI code page to convert ANSI text back to Unicode. This keyword should be emitted in the RTF header section right after the \ansi, \mac, \pc or \pca keyword.
\upr	This keyword represents a destination with two embedded destinations, one represented using Unicode and the other using ANSI. This keyword operates in conjunction with the \ud keyword to provide backward compatibility. The general syntax is as follows: `{\upr{keyword ansi_text}{\\ud{keyword Unicode_text}}}` Notice that this keyword-destination does not use the \ keyword; this forces the old RTF readers to pick up the ANSI representation and discard the Unicode one.
\ud	This is a destination which is represented in Unicode. The text is represented using a mixture of ANSI translation and use of \uN keywords to represent characters which do not have the exact ANSI equivalent.
\uN	This keyword represents a single Unicode character which has no equivalent ANSI representation based on the current ANSI code page. N represents the Unicode character value expressed as a decimal number. This keyword is followed immediately by equivalent character(s) in ANSI representation. In this way, old readers will ignore the \uN keyword and pick up the ANSI representation properly. When this keyword is encountered, the reader should ignore the next N characters, where N corresponds to the last \ucN value encountered. As with all RTF keywords, a keyword-terminating space may be present (before the ANSI characters) which is not counted in the characters to skip. While this is not likely to occur (or recommended), a \bin keyword, its argument, and the binary data that follows are considered one character for skipping purposes. If an RTF scope delimiter character (that is, an opening or closing brace) is encountered while scanning skippable data, the skippable data is considered to be ended before the delimiter. This makes it possible for a reader to perform some rudimentary error recovery. To include an RTF delimiter in skippable data, it must be represented using the appropriate control symbol (that is, escaped with a backslash,) as in plain text. Any RTF control word or symbol is considered a single character for the purposes of counting skippable characters. An RTF writer, when it encounters a Unicode character with no corresponding ANSI character, should output \uN followed by the best ANSI representation it can manage. Also, if the Unicode character translates into an ANSI character stream with count of bytes differing from the current Unicode Character Byte Count, it should emit the \ucN keyword prior to the \uN keyword to notify the reader of the change. RTF control words generally accept signed 16-bit numbers as arguments. For this reason, Unicode values greater than 32767 must be expressed as negative numbers.
\ucN	This keyword represents the number of bytes corresponding to a given \uN Unicode character. This keyword may be used at any time, and values are scoped like character properties. That is, a \ucN keyword applies only to text following the keyword, and within the same (or deeper) nested braces. On exiting the group, the previous \uc value is restored. The reader must keep a stack of counts seen and use the most recent one to skip the appropriate number of characters when it encounters a \uN keyword. When leaving an RTF group which specified a \uc value, the reader must revert to the previous value. A default of 1 should be assumed if no \uc keyword has been seen in the current or outer scopes. A common practice is to emit no ANSI representation for Unicode characters within a Unicode destination context (that is, inside a \ud destination.). Typically, the destination will contain a \uc0 control sequence. There is no need to reset the count on leaving the \ud destination as the scoping rules will ensure the previous value is restored.

Document Text

Document text should be emitted as ANSI characters. If there are Unicode characters that do not have corresponding ANSI characters, they should be output using the \ucN and \uN keywords.

For example, the text LabΓValue (Unicode characters 0x004c, 0x0061, 0x0062, 0x0393, 0x0056, 0x0061, 0x006c, 0x0075, 0x0065) should be represented as follows (assuming a previous \ucl):

Lab\u915Gvalue

Destination Text

Destination text is defined as any text represented in an RTF destination. A good example is the bookmark name in the \bkmkstart destination.

Any destination containing Unicode characters should be emitted as two destinations within a \upr destination to ensure that old readers can read it properly and that no Unicode character encoding is lost when read with a new reader.

For example, a bookmark name LabΓValue (Unicode characters 0x004c, 0x0061, 0x0062, 0x0393, 0x0056, 0x0061, 0x006c, 0x0075, 0x0065) should be represented as follows:

{\upr{\*\bkmkstart LabGValue}{\*\ud{\*\bkmkstart Lab\u915 Value}}}

The first sub-destination contains only ANSI characters and is the representation that old readers will see. The second sub-destination is a \*\ud destination which contains a second copy of the \bkmkstart destination. This copy can contain Unicode characters and is the representation that Unicode-aware readers must pay attention to, ignoring the ANSI-only version.