Overview of Concepts

Understanding some basic terms and concepts is useful prior to a detailed discussion of how to build international applications using Microsoft SQL Server. This section discusses the basics of code pages, sort orders, and language/locale support.

Code Pages

A code page, or character set, defines a collection of characters, numbers, punctuation, symbols, and special characters for a particular language. Using a process called encoding, each character in the code page is assigned a numeric value called a code point. Computer hardware, software, and operating systems can then exchange information for a particular language or for a collection of languages using these code points.

From a SQL Server perspective, as long as the software running on the client computer is using the same code page as SQL Server, all is well. However, if different code pages are being used on the client and the server, there is a chance that data can be corrupted.

The various character encoding schemes that exist for different computing environments can be confusing. Microsoft SQL Server is concerned primarily with PC computing environments; therefore, it supports code pages that are commonly used in localized editions of Microsoft Windows, as well as a subset of code pages used in MS-DOS®. There are many code pages supported by other computing environments (for example Macintosh, IBM EBCDIC, and so on) that are not supported by Microsoft SQL Server. Character-set conversion technology must be used to facilitate interoperability between SQL Server and these other computing environments.

Characters 32 through 127 (hex 0x20 through 0x7F) in all code pages supported by MS-DOS, Windows, and SQL Server are defined by the American Standard Code for Interchange of Information (ASCII). This seven-bit standard for character encoding accommodates all of the standard characters in the English alphabet. ASCII characters were the only guaranteed way of exchanging electronic information between all major languages and code pages until the adoption of a more modern character encoding scheme called Unicode, which is discussed later.

A single byte can be used to represent up to 256 different code points in a code page. Many languages (for example English, French, and German) require only a single byte of storage to represent their alphabets, punctuation, and special characters. The code pages for these languages are often referred to as single-byte code pages. The majority of the character sets provided with SQL Server 6.5 are single-byte.

Complex Asian languages such as Chinese, Korean, or Japanese have many more characters than Western European languages. For example, Chinese has more than 10,000 characters. New code pages had to be developed for languages that use multiple bytes of storage to represent all possible code points. A double-byte character set (DBCS) or multiple-byte character set can be used to encode 216, or more than 65,000, different characters.

The remaining characters in a code page above code point 128 (hex 0x80) are sometimes referred to as extended characters. Extended characters differ from code page to code page depending on the requirements of the language.

Microsoft has assigned a unique SQL Server-specific identifier for each code page. The sp_helpsort stored procedure can be used to determine which code page is currently installed. There are a few classes of code pages that are common in PC computing environments. They are called OEM code pages, ANSI code pages, and DBCS code pages. Unicode is a new code page that is an emerging standard for international software and operating systems.

OEM Code Pages

The term OEM code page refers to code pages used in MS-DOS that have built-in hardware support to allow rapid display of characters on the computer console. The first 128 characters in OEM code pages are defined by the ASCII standard. The extended characters in OEM code pages are usually special graphics characters used for line drawing and graphics. Windows 95 and Windows NT support a separate installable OEM code page for applications that were written in MS-DOS or applications that use the Win32 console APIs.

Microsoft SQL Server 6.5 provides support for two OEM code pages (437 and 850) to facilitate backward compatibility with MS-DOS applications written for the earlier U.S. versions of SQL Server. It is important to note that MS-DOS, Windows 95, and Windows NT provide support for many other OEM code pages besides these two.

ANSI Code Pages

Because Microsoft Windows features a rich graphical user interface and installable software-based fonts, Windows developers no longer have the need to use the extended characters in OEM code pages to perform line drawing and graphics. The ASCII character encoding scheme was not rich enough to support many different languages, so Microsoft introduced new code pages for Windows 3.1, which still supported ASCII for the first 128 characters, but replaced the old OEM code page extended graphics characters with much-needed characters and symbols. This made it easier to build international applications with Windows.

ANSI code page is a generic term used in Windows to describe a collection of code pages used for character encoding in graphical Win32 applications and fonts. The original Windows ANSI code page was developed for Microsoft Windows 3.1 for use in the United States and Western Europe. This code page has many names, including code page 1252, Latin 1, and Windows ANSI. It is similar to an ISO standard code page called ISO 8859-1.

Yet, code page 1252 could not represent all the characters required to create localized editions of Microsoft Windows for all of the countries that were demanding them. So the code page proliferation continued and several variants of code page 1252 were created to handle localized versions of Windows for Eastern Europe, the Mediterranean, and the Middle East for a total of eight Windows ANSI code pages (code pages 1250-1257). These code pages are referred to collectively as ANSI code pages in Win32 API documentation. All Windows 3.1 and Windows 95 fonts are defined using an ANSI code page. Every copy of SQL Server, regardless of the country in which it is sold, includes support for all eight of these code pages.

The introduction of the ANSI code pages made it easy to produce localized Windows applications for North American, South American, European, and Middle Eastern regions, but a single-byte character encoding scheme was not sufficient to produce a localized version of Windows for Asian languages such as Chinese, Japanese, and Korean. A new set of Windows ANSI code pages called double-byte character sets (DBCS) were introduced in Asian editions of Microsoft Windows to accommodate these more complex languages. Each DBCS character set was developed individually by examining its compatibility with OEM code pages, ANSI code pages, and existing character encoding standards common in Asian computing. For example, the Japan Industrial Standard for multi-byte encoding called Shift-JIS was used to develop the Japanese DBCS code page 932. DBCS character sets are therefore a hybrid of ASCII, OEM, ANSI, and Asian code pages.

DBCS code pages are actually a mixture of single-byte and double-byte characters. For example, in the Japanese DBCS code-page (932) the first 256 characters are occupied by ASCII characters and some very commonly used Japanese characters called katakana. In DBCS parlance these are called half-width characters because they occupy only a single byte of storage.

Most of the remaining extended characters require two bytes of storage and are called full-width characters. The first byte of a full-width character is called the lead byte and the second is called the trail byte. It is up to the programmer to DBCS-enable an application by using proper string handling and input routines to determine whether characters are half-width or full-width.

Unicode

Today, many overlapping standards exist for character encoding, and there is a proliferation of code pages that international software developers have to deal with in their code, making the process of building a single international code base extremely difficult. Unicode is an industry-standard character encoding scheme designed to solve some of the most complex problems that occur in developing international software.

The Unicode standard was produced by an industry group called the Unicode Consortium and has been adopted as the standard character encoding specification by the International Standards Organization (ISO 10646). The Unicode standard incorporates support for the existing ASCII and Windows ANSI code pages, translation from other legacy DBCS code pages, and support for all the characters and scripts in common business use in the world today. Unicode also provides room for expansion and user- or OEM-specific extensions. Unicode is the strategic direction for character encoding in Windows and will be supported in a future release of Microsoft SQL Server.

Unicode uses a fixed-width 16-bit character encoding scheme to represent all characters. It does not mix half-width and full-width characters as DBCS code pages do. Additionally, Unicode solves the problem of proliferation of new code pages and allows mixing and sharing data between multiple languages without corruption. The standard is rich and full featured and is not covered in detail in this paper.

The section "For More Information" includes a reference for a book from Microsoft Press titled Developing International Applications for Windows 95 and Windows NT. This work contains excellent recommendations for coding practices to ease the transition from using ANSI code pages to Unicode in Windows applications written in C or C++. If you are writing a new application or re-engineering an existing application that targets international markets, it would be wise to review and incorporate these coding techniques.

Sort Orders

After you select a code page for your SQL Server installation, you must also choose a sort order for that code page. Each code page offers anywhere from 2 to 17 sort order choices. A sort order is a set of rules for a particular code page that determines how SQL Server matches, collates, and presents data in response to database queries. SQL Server uses these rules to determine the sequence in which data is presented in response to SELECT statements involving GROUP BY, ORDER BY, and DISTINCT. The sort order also determines how certain queries are resolved, such as those involving WHERE and DISTINCT. Sort orders affect the physical storage of SQL Server indexes.

Microsoft assigns a unique SQL Server-specific identifier to each sort order supported in version 6.5. Sort order rules are stored using the identifier in the syscharsets system table. You can determine the current sort order by running the sp_helpsort stored procedure. SQL Server uses a special text file to define a sort order during installation. These files are stored in the charsets directory in the SQL Server root directory. It may be useful to examine these files to determine how a particular sort order operates.

If you intend to support multiple languages from the same SQL Server installation, it is important to choose a sort order that is acceptable to all users. Not all code pages offer the same sort order options. For a comprehensive list of code pages and sort orders, see
Appendix B.

Languages and Locales

Developers building Windows applications for the international marketplace must be sensitive to cultural conventions used in different languages and regions. This includes ensuring that numbers, currency, dates, and times are formatted correctly when displayed for the user, as well as properly sorting data when it is presented in a list. Using a combination of international features provided by Win32 and SQL Server, developers can build international applications that meet these demanding requirements.

When developing an application for an international audience, code page and sort order support are key SQL Server features. In addition, SQL Server has international features designed prior to the releases of Windows 95 and Windows NT that are still supported for compatibility reasons. These features are less critical from an application development perspective but may still be useful. This section defines these features and how they relate to Win32 capabilities available to application developers.

The Win32 NLSAPI

The Win32 API provides support for the National Language Support API (NLSAPI). The NLSAPI allows applications to make simple API calls to determine how to format certain data (for example, dates, times, numbers, and currency) using culturally correct conventions for a particular language or region. It also includes functions that allow applications to sort strings using a culturally correct sort order.

These conventions are grouped together in a system-wide setting referred to as a locale. Each locale supported by the NLSAPI is assigned a unique Locale ID (LCID) by Microsoft. You can set the current locale in Control Panel using International in Windows NT 3.51 and Regional Settings in Windows NT 4.0 and Windows 95. The NLSAPI is useful when developing international applications but is beyond the scope of this paper. For further information, see Developing International Applications for Windows 95 and Windows NT.

SQL Server was developed before the NLSAPI and uses it only to define the dictionary sort order for DBCS character sets. In most cases it is the responsibility of the client application to properly format dates, times, numbers, and currency data retrieved from SQL Server using the NLSAPI. There are, however, some useful operations that can be performed by SQL Server. These operations are discussed in detail in the sections that follow.

SQL Server langids

SQL Server has its own concept of a language that is separate and distinct from the Windows concept of a locale. To avoid confusion, this paper refers to a SQL Server-specific language setting as a langid rather than as a language or a locale. A langid enables SQL Server to determine which localized version of a system message should be returned to a user and how SQL Server performs certain date and time operations.

SQL Server langid settings are of minor interest to international developers. Not every targeted Windows locale(s) has a corresponding langid. The most important component is a compatible code page and sort order. For a comprehensive list of the SQL Server langids supported by each code page, see Appendix B.

SQL Server langids and related settings are stored in the syslanguages table and can be queried using the sp_helplanguage stored procedure. The only exception is langid 0 (us_english), which is hard-coded and therefore not stored in syslanguages. SQL Server langids and related attributes are specific to SQL Server and have no relationship to other standard identifiers or naming schemes, such as those defined by the NLSAPI or Unicode. The following table describes syslanguages in more detail.

Column in syslanguages Description
LANGID Unique identifier for a language assigned by Microsoft. us_english is hard-coded as langid 0. French is 1, German is 2, and Japanese is 3.
NAME Name of the langid accepted by the SET LANGUAGE statement.
ALIAS Alternate name of the langid accepted by the SET LANGUAGE statement.

Primary and Alternate langids

Not all SQL Server langids are created equal. Primary langids were created for localized editions of SQL Server (English, French, German, and Japanese) and include a set of localized system messages. Alternate langids are available for other languages and do not include a set of localized system messages. This terminology (primary and alternate) is used only in this paper and is not used in the SQL Server documentation. Appendix B lists each of the primary and alternate langids available by code page.

The Default langid

User connections that do not specifically request a particular langid inherit the settings used for the default langid. The default langid is set at installation and can be changed later. For information on how to change the default langid, see "Installing and Configuring langids." Only primary langids are used as defaults to ensure that system messages are available for connections using the default langid.

Supporting Multiple langids

In some installations a developer may want to support multiple langids on a per-user basis depending upon the language required by the user. SQL Server can be configured to support additional primary or alternate langids other than the default. For more information about how to install and use multiple langids, see the section on installation and configuration.

Specifying a non-default langid does not affect the way that data is sorted because a single sort order is used by SQL Server system-wide. There are three ways multiple langids can be enabled for user connections:

Localized System Messages

SQL Server must periodically send system status and error messages to connected clients. System messages are stored in the sysmessages system table, and each message is associated with a SQL Server langid, allows localized system messages for different languages. The langid for English messages is set to NULL.

Currently there are four sets of localized system messages available for English, French, German, and Japanese editions of SQL Server. English versions of all system messages are available on all localized editions of SQL Server. If a connection uses a langid other than us_english (langid 0), SQL Server attempts to use a localized version of the system message if it is available. If a localized system message is not available, the English version is returned.

This capability is extensible in that developers can add their own user-defined system messages using the sp_addmessage system stored procedure, then raise them from within Transact-SQL batches, stored procedures, or triggers by using the Transact-SQL RAISERROR statement. Developers should always add the English version of a message first, then localized versions of the message in each desired language.

System messages support the ability to format message parameters as strings. The formatting syntax is similar to the syntax used by the C language printf function. Each message parameter in a us_english message is implicitly assigned a parameter number from left to right. The localized counterpart(s) of the us_english system message then use this parameter number to retrieve the appropriate parameter value and format from the us_english message.

Consider the following example. An error message is returned when a user attempts to create an index with the same name as an existing index. The English version of the message embeds two string parameters for the name of the table and the name of the index. For the French version of this system message to be culturally correct, the name of the index should be stated first:

// us_english message
There is already an index on table '%.*s' named '%.*s'
// Localized message
n index appelé '%2' existe déjà sur la table '%1' 

English messages are required in all releases of SQL Server and are installed by default regardless of which localized edition of SQL Server is being installed. Localized versions of system messages for the French, German, and Japanese editions of SQL Server are located in a server localization file called Server.loc. They are included on the distribution CD-ROM for each edition and loaded into sysmessages by a special program called Langinst.exe, which is run by the SQL Server Setup program.

Date and Time Operations

The SQL Server datetime and smalldatetime data types use either a four- or eight-byte binary value to represent a particular moment in time on the Gregorian calendar. SQL Server can automatically convert a wide variety of string constants to datetime values subject to certain rules. For more information, see the Microsoft SQL ServerTransact-SQL Reference.

Once date and time data is represented in binary format, it is a simple matter to perform complex date and time calculations. Additionally, it becomes easier to convert dates and times to strings while applying different formats using Transact-SQL functions. There are a variety of SQL Server-defined locale settings (unrelated to NLSAPI locale settings) stored with each langid in syslanguages that affect operations involving datetime data. This table describes the settings.

Column in syslanguages
Description
DATEFORMAT Sets the order of the month, day, and year for entering datetime or smalldatetime. The default for the English edition is mdy, the German and French editions use dmy, and the Japanese edition uses ymd. The current dateformat for a particular user connection can be changed at runtime using the SET DATEFORMAT statement.
DATEFIRST Defines the first day of the week as a number from 1 through 7. The default for the English and Japanese editions is 7 (Sunday), and the German and French editions use 1 (Monday). The current datefirst for a particular user connection can be changed at runtime using the SET DATEFIRST statement.
MONTHS List of month names for a langid. SQL Server uses these names to properly convert strings into datetime values and vice-versa for a particular langid.
SHORTMONTHS List of abbreviated month names for the langid. SQL Server uses these names to convert strings into datetime values and vice-versa for a particular langid.
DAYS List of day names for the langid. SQL Server uses these names to convert strings into datetime values and vice-versa for a particular langid.

Date Parts

Useful datetime calculations can be performed using units that SQL Server calls date parts. A date part is a unit of time that can be extracted from a datetime value or used in datetime calculations. SQL Server defines several date parts to handle anything from a millisecond to a century. For more information about date parts, see the Microsoft SQL Server Transact-SQL Reference.

Data Entry Issues with Datetime Values

SQL Server-defined locale settings have an impact on data entry operations involving datetime data. For example, dateformat can vary depending upon the langid for a particular user connection. SQL Server uses dateformat to interpret string constants that are to be converted and stored as datetime values. The following Transact-SQL script demonstrates how dateformat can affect input operations:

set dateformat dmy /* or 'set language deutsch', if installed */
go
declare @foo datetime
select @foo = '25.06.97' ---this works
set dateformat mdy /* or 'set language us_english' */
go
declare @foo datetime
select @foo = '25.06.97' --- this breaks, The conversion of CHAR to DATETIME resulted in a DATETIME value out of range.

The same consideration applies to the other SQL Server-defined locale settings including DATEFIRST, MONTHS, SHORTMONTHS and DAYS. These settings can affect data entry options and developers may need to make use of the appropriate SET command to change their values at runtime in order to avoid data entry problems such as the one illustrated.

Using Transact-SQL Datetime Functions

Transact-SQL contains several functions that are useful in performing datetime calculations. These functions include DATEADD, DATEDIFF, DATENAME, and DATEPART. These functions can be useful when a developer wants to perform datetime calculations in Transact-SQL batches, stored procedures, and triggers. For complete information about using these functions, see the Microsoft SQL Server Transact-SQL Reference.

These functions are also affected by SQL Server-defined locale settings. For example, consider the difference in output of the DATENAME function based upon the current langid.

Transact-SQL batch ISQL/w output
SET LANGUAGE "english"
GO
SELECT DATENAME(MONTH,GETDATE())
GO
SET LANGUAGE "german"
GO
SELECT DATENAME(MONTH,GETDATE())
GO

------------------------------
June

(1 row(s) affected)


------------------------------
Juni

(1 row(s) affected)

The difference in output is the result of two separate values for the SQL Server-defined locale setting called MONTHS for the English and German langids. Developers must be aware of the way these functions can be affected by changes to the SQL Server-defined locale settings based upon the current langid.

Converting Datetime Values to Strings

SQL Server's CONVERT function supports the conversion of binary datetime values into formatted strings. In most cases developers wll prefer to perform this operation in the client application due to the richness of the NLSAPI. However, in some cases it can be useful to allow SQL Server to perform this conversion.

The Transact-SQL CONVERT function supports a style parameter. If this parameter is not used a default style is applied. These styles are not extensible and Transact-SQL does not currently support format specifier syntax for formatting dates and times. It is important to recognize that returning the strings formatted using the default style may not be appropriate for a particular language or region.

This table lists each of the date and time styles available in SQL Server 6.5. Default values (style 0 or 100, 9 or 109, and 13 or 113) always return the century (yyyy).

Format number
without century (yy)
Format number
with century (yyyy)

Description

Output
  0 or 100 Default mon dd yyyy hh:miAM (or PM)
1 101 USA mm/dd/yy
2 102 ANSI yy.mm.dd
3 103 British/French dd/mm/yy
4 104 German dd.mm.yy
5 105 Italian dd-mm-yy
6 106   dd mon yy
7 107 mon dd, yy
8 108 hh:mm:ss
  9 or 109 Default milliseconds mon dd yyyy hh:mi:ss:mmmAM (or PM)
10 110 USA mm-dd-yy
11 111 JAPAN yy/mm/dd
12 112 ISO yymmdd
  13 or 113 Europe default milliseconds dd mon yyyy hh:mm:ss:mmm(24h)
14 114 - hh:mi:ss:mmm(24h)