Sjoert Ebben and Gwyneth Marshall
Microsoft Corporation
April 21, 1999
Contents
Introduction
Globalization/Internationalization
What Is Globalization and When Should I Do It?
What Types of Problems Will I Find in Globalization?
How Can I Identify Globalization Bugs?
Pseudo-Localization
Pilot Process
What Is a Pilot Project?
Why Should I Run a Pilot Project?
When Should I Run a Pilot Project?
Which Language?
Localization
Scheduling
The Localization Kit
Graphics
Quality Assurance
Market Specifics
Source Control
Outsourcing
Summary
Links to Explore
About the Authors
"Oh, What a Tangled Web We Wove" was the title of an article we saw years ago about how complicated and large the Web had become. Of course, at that time almost all of the Web was in English because the United States was miles ahead of the rest of the world when it came to being online. Now, about six years later, the rest of the world is catching up quickly. Japan, Germany, and China all have a large number of users, and the Spanish-speaking population on the Web is increasing rapidly. The consequence? The Web is becoming a multicultural and multilingual playground, where Web sites are becoming available in the native languages of the audience. On the global market, there are now plenty of opportunities for English-speaking companies to port their Web sites to a local market. Doing so might seem easy; at first it's natural to think, "Just translate the thing." Well, it's not that easy. Local markets have specific requirements, not only from a marketing perspective but also from a technical point of view.
In this article, we give you an overview of some of the difficulties, various steps in the process, and some tips and tricks on how to go about it. We don't presume any prior knowledge to the localization business, but some Internet experience is assumed.
This is by no means a complete guide, but a starting point to localizing Web sites. We also don't want to claim to be the inventors of the process! Our knowledge is based on our experience and also that of many others, and great thanks go out to them!
The Computer Dictionary (Microsoft Press, 1994) defines localization as "the process of altering a program so that it is appropriate for the area in which it is used." We have divided the generic term "localization" into two phases: globalization/internationalization and localization. Globalization and internationalization are used interchangeably to refer to code changes that are made to ensure that a product -- or Web site -- can be localized and that information is presented in a format to which your users are accustomed. We use the term localization to refer to translation of strings and localization engineering tasks (more explanation about that later).
There are several schools of thought as to when this process should occur; however, we have had the best success when globalization is done in parallel with core development (development of the base product, usually targeted for the United States market) and before localization has begun. This has several benefits. It:
There is no fool-proof way to find all globalization errors. The key to success is testing and research. You must know your market -- or find someone who does -- to avoid introducing errors in your code, or you must test your product and remove the errors later.
Even if you begin a project knowing it will ultimately be released in several languages or markets, your innate cultural biases will blind you problems that could potentially be introduced into your site. (A language and a market can be completely different. Consider the French language: the market considerations for France, Belgium, and Switzerland may be quite different although each of these markets speaks a form of French.) There are thousands of globalization issues -- different address conventions around the world, money formats, and date formats, to name a few. The type of errors you run into depends on the format and complexity of your site. We can't cover them all in this article, but cover some of the common mistakes we see.
It is important to remember that these are problems we see relating to code. If you want to ensure your content is appropriate for a particular market, you need to have your content reviewed by someone in that market. No one -- us included -- can know all of the issues that pertain to a particular area. Having a native or expatriate review your content can be critical, especially in politically or culturally sensitive areas of the world.
Here's a short list of items that would need to be changed to be globalized. These items are often hard-coded. We prefer either using the system settings when you can or using an automated function to present the information to the user. (The classic book on software localization, Developing International Software by Nadine Kano discusses the Native Language Support files and some ways to use them. Out of print, but available online at MSDN.) If you have to have a localization engineer change the code later, you could be introducing new errors into the project.
A note about fonts and font attributes. You may think formatting is a trivial issue, but when you spend thousands of dollars to present your corporate image in a market, you want to ensure the correct image is presented. In the United States, sites are usually designed using certain fonts (Arial, Times New Roman, and so forth). While these fonts can generally also be used for Western European languages, they do not contain the characters for Japanese, Korean, Traditional Chinese, Simplified Chinese, Russian, or others. We recommend that you write a style guide for each of these markets detailing the fonts you want to use, the font size, and any other changes in font attributes -- such as bold and italics -- that can make reading some languages difficult.
Truncation happens when you design an interface with a particular word or set of words in mind, and the word length varies in other languages. Any words that are longer are then truncated in the interface. This is very common on Web sites, because real estate is very valuable and designers often want to get the most information into a limited space. Consider the simple example of a Search button. "Search" is fairly short in English, but in French one would use the term "Réchercher." If you were to use the exact same button for the word "Réchercher" only "Récher" would display -- not very useful to the user.
In your design and development of a Web site, you should consider that all of the words and phrases could expand. The general rule given is to expect terms to expand by 30%. However, we've found that this rule works best when applied to large sections of text, like help files, but doesn't work as well when applied to a single word. A term can expand anywhere from 200% to 400%. Having a flexible design is critical to the success of localizing a Web site.
A composite string is an error message or other text that is dynamically generated and presented to the user in sentence form. Here is an overly simplistic example:
<% if Err = 400 errText = "server" else errText = "connection" end if %> <P>The <%=errText%> is currently unavailable.</P>
This string could not be translated into Spanish. Because "server" is a male noun and "connection" is a female noun, the translation of the term "the" would have to change depending on which condition is met. And because the localizer is given only one opportunity to translate the shell error message, the correct message will show for only 50% of the conditions.
A better solution would be to have multiple error messages that are translated separately, or to find a different way of presenting the same information. For example, the following are two ways of telling the user how many mail messages they have.
MailMsg = [Number of messages in user's mailbox] <P>You have <%=MailMsg%> message(s) in your inbox</P>
Even in U.S. English this is not the clearest communication. However, we can rewrite the code so that it is flexible enough for multiple languages and conveys the same information to the user.
MailMsg = [Number of messages in user's mailbox] <P>Inbox: <%=MailMsg%></P>
Let's say you a have list of items that you want to present to the user. What is the most common way to present a list of information? Most likely you'd choose an alphabetical list. Now consider taking this same list of information and translating it. You would probably give the list to the translator in the same order you wrote it -- alphabetically for U.S. English. However, sort order is not the same for all languages, particularly for languages that do not use the Western alphabet. In many Asian cultures, characters are comprised by a prescribed tradition of brushstrokes and characters are sorted by the brush stroke order.
If we were to just blindly translate the terms as given, the user would not know how to find the information they were looking for. You need to either find a way to automatically sort the items (this can be a very difficult task) or ensure that your localizers can change the sort order of the list while they are localizing the code.
When localizing a site, one of the first decisions to make is what character encoding to use. On the Web, this is generally divided into UTF-8 (covers the characters of most of the world's major languages) or native encoding (encoding specific to a language or set of languages). There are costs and benefits to both systems. Regardless of what encoding you decide to use, you need to ensure that the data displays correctly to your users and any data sent to and from a form or database maintains data integrity. (For more information about character encoding, see HTML Character Sets .)
Carefully consider each component of your Web site (this could include databases, form elements, Component Object Model (COM) objects, JavaScript, DHMTL, ActiveX® controls, etc.) and ask yourself:
The answers to these questions may not be the same for different elements on your site. If you are storing all the strings for several language versions of your Web site in a single database, you may need to use Unicode or UTF-8. If you've encoded your HTML natively, how are you transforming your Unicode data so that the user can read it on the Web site?
Now that you've read about a few of the errors that can occur on a Web site when you don't consider a global audience, we hope it's apparent that globalization is crucial if you are going to translate that site. We discuss a few simple ways to uncover globalization errors, but if you've never globalized a Web site and are unsure of how to begin, two of the best ways are described in this article: pseudo-localization and a pilot project. You will uncover whole sets of bugs you never imagined before.
Sometimes just reading a functional or technical specification from the point of view of an international user can reveal a number of issues. For example, a technical specification says that a Web site will display the date and time it was last updated. This data will be collected from the Server Time. However, Asia and Oceania are on the other side of the International Dateline, so any date displayed will make your site appear at least a day older than it is. However, if you were to calculate based on the user's time, your Web site's content would appear fresh.
A more formalized approach to this idea is to write your own globalization specification based on the functional or technical specification.
One of the best ways to analyze whether you have globalized a Web site is by writing test cases. Obviously, this is best if you have a dedicated international tester. If we were testing a word processor, you wouldn't think it odd if we were to test each font, then test them in bold, then italics, then bold italics, and so forth. The same type of granular thinking needs to be applied when writing globalization test cases. Even though we educate many people about international Web site design and development, we continually run into people who design address forms with a zip code field.
Writing test cases allows you to focus on smaller and smaller units of a project. As you write a test case for each component, you can start thinking about how others would use it. In addition to uncovering problems at the globalization stage, you can reuse these test cases at other stages of our project -- particularly when the files have been localized.
Pseudo-localization is the process of exercising your site's user interface, localizability, and site stability before localization. This is done by quickly editing all of the strings in a project to:
Pseudo-localization can be as simple or as complex as the stage of the project demands, and runs the gamut from automated processes to manually editing strings.
If you are in the design phase of a project, you may want to run a prototype through pseudo-localization to ensure that the design is flexible for all of the terms to be translated. For more complex sites, you could use pseudo-localization to test dynamically generated data or to ensure that your controls can display extended characters correctly.
The benefit of pseudo-localization is that you can run through the process iteratively or at different stages of your project, to identify and resolve international issues without wasting your translator's time (and your money) by fixing bugs later in the project. Typically, this process is used to test:
While there are some automated tools for pseudo-localization that will test for string truncation and other error conditions, having a human test the pseudo-localized output can be very important. You must also keep mind that pseudo-localization can't test everything, this is where human testing and running a pilot project can be very useful.
A pilot project is running a localization project for a particular language before starting this process for other languages. Typically, a pilot project will run a few weeks ahead of the other languages, although in some cases this might be just a few days or sometimes months ahead. Sometimes, when plenty of time is allowed between the release of the source project (mostly U.S. English), a localization pilot starts when the source project is released. This method has a few advantages and disadvantages. Both are discussed in more detail in this section.
Localizing a project usually means you'll find specific problems caused by the fact you're changing English text to a foreign language. As a rule of thumb, foreign languages take an average of 30% more space than English does. If a particular resource string doesn't allow that many more characters, a localizer will have a tough time shortening a valid translation or might not be able to find a translation that fits.
Most of these problems will normally be caught during the globalization phase, but because the steps involved in globalization are mostly automated, not all issues will be found.
The pilot project is the first time a project is localized by a human being, involving mostly manual work. This allows the localizer to find those specific issues that fell through the cracks. Some problems the localizer encounters might not even be issues, but just "localization challenges." The pilot localizer will investigate this challenge and come up with a solution that other localizers can then use as well.
Running a pilot project will naturally take more time than might be expected from the amount of localizable resources involved. This is due to the problem solving as described above. However, the extra time needed for the pilot will be earned back when the other language versions are built. Most products will be localized in eight languages or more, so it becomes clear that any fix at the pilot stage will save time on other language versions. When you don't run some sort of pilot, you can apply the following calculation: TotalBugs = CoreBug * Languages.
Before starting any localization project, a localization plan needs to be written. In this plan, you need to outline the various instructions for the localizers, the engineering changes for a market (if any), and the overall process to follow. Of course, at the process planning stage, no actual localization work is done yet and the plan has to be tested in practice.
The pilot project allows you to test the localization process, to make sure all steps outlined make sense and are achievable. For example, in a localization plan, the localizer is instructed to work online via the Internet on a file or project. Unfortunately, the Internet connection in his country is either slow or unreliable and a connection can't be maintained for the required amount of time. The localization process is changed to e-mailing the file and letting the localizer work offline.
The pilot localization project is one last sanity check before the bulk of the work starts, to make sure the localization of other languages runs smoothly.
Whenever you develop a product that is going to be localized in a few languages, you should consider running a pilot project.
You've seen that TotalBugs = CoreBug * Languages, so obviously the more languages you localize in, the higher the bug count.
Also, TotalBugs = EngineeringHours, and EngineeringHours = AddToProjectCost. You can easily see by this equation that if you are localizing your product in a few languages, running a pilot can save you money. (Unless your product meets all the international requirements and is designed in such a way that it will never break anywhere when localized. Feel our cynicism?).
In localization, all languages are equal but some languages are more equal than others (brutally rephrased from Animal Farm). What this means is that you're more likely to find problems in some languages than in others. Don't forget that English is also a language and we still find problems when localizing a product.
Major (potential) problem areas to look out for in globalization/localization are:
Of these, the last two are more generic and will not be dependant on which language you choose; the first two issues can be very language specific. From a technical point of view, it's therefore better to choose a pilot language that will address these issues than a language that won't.
Because a pilot project is ahead of the other languages, it's likely it will be ready for release before other languages. From a marketing point of view, it's therefore better to choose a language for the pilot that is of more strategic importance than others.
Depending on the product or product type (software, online), the "best" pilot language can be different but will most likely be one of the following:
All of these languages use international characters, more space, and are important markets. Japanese uses Double Byte characters (DBCS), which is an area with very specific issues, and is mostly used as a pilot language together with a non-DBCS language.
Alternatively, the following languages will also meet (most of) the requirements outlined above:
Eastern European languages usually don't have an important place in the global market but face very specific character issues. Some products provide different code bases for these languages and therefor create a need for an Eastern European pilot.
A pilot project is run to find and solve localization problems before the bulk of the work starts and to streamline the localization process. When a product is localized in a few languages, running a pilot can be a time- and cost-saving exercise. The language for the pilot project is based on both language suitability and strategic importance. If a product has more than one code base, a pilot is required for each of them.
Wouldn't it be nice if you could hand off some files to the localizers and say, "Please hand back at your earliest convenience"? Well, it doesn't exactly work that way. Most products have a deadline -- a target date for going out to the customers. Web sites are no different. Therefore, you need to plan when each step in the project should be completed. In other words, you need to schedule.
Scheduling as such isn't that hard, as long as you stick to the basics. Once you go into more detail, it can get complicated. You can create your schedule according to:
With the "time you need" method, you work from the day your schedule starts and assign time frames to each step:
As you can see, your schedule determines when the product ships. Unfortunately, Web products drive in a faster lane than software products, so you're expected to:
This means that release to Web (RTW) dates are not always under your control and might be part of a company/marketing strategy. You probably have guessed by now that the "time you have" method is used more often. Good scheduling means you can still achieve the dates. Because you still need x amount of time, you'll just need to apply your schedule more aggressively.
Extra resources | Overlap |
Testing/Engineering - 5 days | Testing/Engineering/Localization - 14 days |
Handoff - 1 day | Handoff - 1 day |
To ensure your project is localized properly, you have to provide instructions for the localizers, testers, and engineers. Localizers need information on what to localize, for which audience they localize and, in most cases, what not to touch in the file. (Unfortunately, Web-based translation cannot always be done in "safe mode" and localizers often have access to source code.) Static HTML pages are generally easier than ASP or HTML with scripting, because HTML is relatively simple to understand and there are tools that "lock" the HTML tags and allow only plain text or attribute values to be localized. ASP files normally contain a large amount of scripting (VBScript or JavaScript) and localizable strings are often embedded in the scripting. For localizers it can be hard to distinguish localizable strings from functional ones.
For example, the following variable is used in a piece of scripting to determine a cause of action. When localized, this functionality breaks.
L_PAGE_STRING = "home" if L_PAGE_STRING ="home" then response.redirect "/home.htm" end if
Please note that the resource identifier as used above allows for localization of the string in some localization tools. If the identifier doesn't use this model, the string won't be accessible. In this model, the string will only be localizable when initialized and not in the if clause. Localizing the word "home" means the if clause will never be true.
For the localizer, it's impossible to know in most cases if the string should be localized. Giving instructions or adding a comment will clarify:
L_PAGE_STRING = "home" 'Don't localize this string
Note The example above is also an example of "over-globalization." Functional variables should not follow the resource identifier model, because they should not be localized.
Testers and engineers also need information on what to do with the files. Providing test cases to the tester as well as an overview of the project will ensure the project is checked thoroughly.
So you know why you have to provide a localization kit, but for whom should you write it? Well, as mentioned above, the localizers, the testers, the engineers -- and if you outsource, the project program managers. However, that's a bit over-simplified, because you need to think about your audience. If you write instructions for localizers, you should make sure they actually understand what you're saying. Generally, localizers have a language background and have moderate technical skills. Instructions written for a developer audience might not work. For project managers, you don't have to go into the tiny details of what needs to be changed. They will be interested in an overview, the number of files and the quantity of words to localize. Engineers like their instructions to be to-the-point, and in their lingo.
Here are some examples:
For a project manager
The Hercules project consists of 200 files, of which 138 need to be localized and 62 need to be localized and engineered. The total word count is 35,500 words. There are 50 test cases, spanning 2 platforms and 2 browsers. The project needs to be localized for the following languages:
Handoff | Localization | Engineering | Testing | Bug fixing | Handback |
date | 18 days | 2 days | 4 days | 2 days | date |
For a localizer
These files need to be localized:
File | Words |
/default.asp | 30 |
/content.asp | 1830 |
/help/default.htm | 4375 |
/default.asp
/style.css
/content.asp
For an engineer
/default.asp
Response.write(Services.LookupValue(Services.Key(i)) & "<br>")
to
WriteUTF8(Services.LookupValue(Services.Key(i)) & "<br>")
The examples above should give you an idea of how to go about it. You want the localization of the project to run smoothly, so the better the instructions, the fewer the problems. You can take a few general steps to write the localization kit:
Localizing graphics can be a tricky process, but with the right preparation you shouldn't run into too many problems.
The most common issue with localizing graphics is that no proper source files are provided. Most localizable graphics consist of text on top of some sort of structured background. To localize the text, you need to get access to the text only. If you get a GIF or JPEG file, the text and the background are in the same layer so changing the text means touching the background as well. If the background is just a plain color, you can easily replace the text. If the background contains structure, well, good luck.
Figure 1. Example of file with structured background
The most common format used to hand off localizable graphics is PhotoShop (.psd extension). PhotoShop supports layers, which means localizable text can go into a separate layer and the other components of the file don't have to be touched.
Many things can go wrong in the localization process. This is why localized products should be reviewed for quality. There are three common reviews:
Shipping quality products, whether Web sites or software, is something every company, developer, and localizer has as a primary goal. For software products, the focus has always been on functionality. Functionality needs to be 100% + 30% because it's very costly and time-consuming to recall or update a product. Language quality needs to be 100% but doesn't need the extra 30%, because users will be mostly concerned with functionality. For Web sites, this focus is changing. Functionality is still important but can be fixed much quicker, because it's just a matter of updating some lines of code and re-posting the file. Language quality is much more in the user's face because Web pages are collections of text with not as much functionality as applications. If the language quality is bad, the user will be annoyed and less inclined to come back. Highest annoyance factors:
Examples of wrong terminology can be found in other industries as well. When General Motors introduced the Chevy Nova in South America, it was apparently unaware that "no va" means "it won't go." After the company figured out why it wasn't selling any cars, it renamed the car in its Spanish markets to the Caribe. Of course, this example could be an urban legend, but it still makes a good point.
If the user doesn't come back to the site, revenue is lost. Don't forget that there are plenty of alternatives for almost every type of site. That's why, for localizing Web sites, the QA focus shifts to 100% + 30% on language quality.
The Internet has reached many homes and many countries. Until recently, high Internet penetration was reserved for just a few markets outside the U.S. Market specific changes to a site were easy, because there were only a few. Now there are many flavors for many languages to be taken into account.
Use different flavors so you can serve your customers. If your customers expect a particular terminology, you'd better use it. Let's look at Spanish as an example. Mexico still has the highest Internet penetration of the Spanish-speaking markets, but the rest of Latin America and Spain itself are catching up. Terminology-wise, the various flavors of Spanish can be very different. What can be perfectly acceptable in one flavor can be an insult in another. Even for English there are different flavors. In British English, I would have spelled "flavors" as "flavours" and "localization" as "localisation." Not an insulting difference in my humble opinion, but still different!
For software, products are more often language specific, for these reasons:
For Web products, it's a different story. Let's look at portal sites as an example. The main attraction of a portal site is that a user finds what he or she needs. Most often, it's news, and a user in Finland will not really be interested in The Seattle Times. So, go local! Providing local content will attract the user, and part of local content is local terminology. The user will feel more "at home" and will come back to the site. More hits, more revenue.
Making the page truly local means higher overall cost. In most cases, the page needs to be localized only once. Market specific changes can be implemented afterwards. The amount of fixes can mean that the pages have to be completely re-localized, although those cases are generally rare.
Although the overall cost of the project would be higher, the cost per market for market specific changes is actually going down. Most of the work has been done on the primary market for that language anyway (localization, engineering, and testing) so you can do market specific changes for two markets for almost the same price as localizing another market. For example, the French language is used for the French, Swiss, and Belgian markets, and with minor modifications you can effect market specific changes for them all. Now that's a bargain!
You might have your pages localized in-house, or you might outsource the localization completely. In any case, you need to maintain source control. This means that you have to establish a code base that all languages are based on. From this code base, the localization will be done. Even market specific changes can be implemented, but it's advisable to keep those changes limited to text and not functionality. Keep in mind the distinction between functionality and code, because code can be a URL or a method of generating/displaying text. If you require functionality changes for a specific market, it means the page wasn't fully globalized. In that case, it's advisable to edit the source file and make it suitable for all markets.
The consequences of not having a code base will be visible quickly. How do you know a localized page is properly done if there's nothing to reference it against? How do you make sure a code update is properly implemented in all files? Working from the same source means it's easier to track updates or problems.
A good tool to maintain source control is Visual Source Safe (VSS). VSS forces you to "check out" a file. This will make it read-only to anyone else so that only one person can work a file at a time. Compare it to a library. If you borrow a book from the library, nobody else can read that specific book until you've brought it back.
One could probably write a whole book just on outsourcing. We'll cover just the basic concepts here. Many companies outsource localization, for various reasons:
Many companies specialize in localization. Some of them are small and focused on a smaller market; some of them are huge and have offices in almost every country and take on all sorts of localization projects.
To choose a localization company, you can keep the following in mind:
Although prices are important, there's fierce competition in the market and it's far more important to focus on quality than bargain prices.
Outsourcing has its own problems you'll need to deal with:
There's no "best" way of going about it. You'll need to analyze what's best for you.
After this overview of the localization process, we hope that you see creating an international Web site is more than just translation. We often say, "The good news is international is fairly straightforward. The bad news is that you have to completely change how you think about Web sites!" With this in mind, we encourage you to read other articles and learn more about the issues involved. In our next article, we will discuss some of the issues in designing a global Web site and how you can avoid some of the standard errors we have discussed in this article.
Language and Code Pages
Sjoert Ebben is a program manager on the Web Essentials team, Ireland and has been with Microsoft for nearly four years. For the last three years he's been in the Internet fast lane, which corresponds to his driving style. When he's not spending his time on Internet business, his wife Cathérine and daughter Lilith make sure to catch him in their Web.
Gwyneth Marshall is a program manager specializing in the globalization and localization of Web sites, most notably MSN.com (http://www.msn.com/wwcon/intl_map.asp ). She dabbles in high finance on a global scale via her investment club.