Tips on Using the HTML Data Refinery
The HXML (HTML Data Refinery) tool creates a set of "globalized" XML template files and a "lingo" XML file containing attributed tags and script. (You can either do a global search for localizable text, or insert custom attributes for tags that are to be localized.) The localized versions of the lingo files can be recombined with the globalized XML templates to create the "final" HTML versions.
The Pitfalls of HXML
There are several issues with the HXML tool that have yet to be resolved. We have found that:
- HTML must be written using HTML AutoLayout (HAL) rules. The localized HTML will not resize properly otherwise.
- Most of the time, HTML encoding is your friend. It is normally better to convert reserved characters that may be interpreted by the XML parser as entities (&) or subelements (<...>) into entity format ("&", "<", and ">" respectively). However, when storing HTML and SCRIPT text in an XML file, reserved characters may be very important—the U(nderlines) in the LABEL tags, for instance. Unfortunately, there is no way (that we know of) to disable HTML encoding of element text by XML. Consequently, script that is not to be localized should be removed from the source files. Use <SCRIPT SRC=...> to include the script.
- TITLE tags cannot be automatically extracted from the HTML source files because they do not support custom attributes and the LID attribute is ignored. A bug has been filed against the Internet Explorer parser for this. For the time being, you will have to copy the LID attribute from your HTML source file to the globalized XML template by hand after a build.
- Script execution must be disabled in the browser during the build process. If the script is allowed to run, it can make modifications to the page during the window.onload event (such as preloading form fields from a database). Programmatic changes to the HTML hierarchy will be copied to the template file when it is saved.
- Simple replacement of lingo terms is inadequate for some locales. (Consider the difference in address conventions between the United States and Japan.) We don't have a good story for this sticky problem yet.
- The NLS component is required to format data for display in a particular locale. SQL does this already with dates, but doesn't provide locale-specific formatting for numbers or currency. (Moreover, currency formatting requires XML data for exchange rates.) We have yet to come upon a "safe" solution for the reverse procedure, accepting user input from one locale and converting it back into the locale of the database.
- ASP source files are not supported, because the XML parser does not recognize <%..%> tags.
- The character set is set automatically for all the globalized template pages, but you may still have to set the Session.CodePage attribute of the IIS server for some languages (like Japanese) to display properly. This may be related to the fact that some code pages aren't handled well by the XML parser. You have to encode these languages.
- The tool itself still has a few problems with user interface and error handling. We hope to have these worked out in a later version.
- Non-breaking space entities ( ) are converted to special white space characters by the HTML DOM. These characters are saved as hard spaces in the converted XML template files, and can appear as garbage characters after the build with some code pages.
HXML Settings Used by the BDG
Here are the HXML Conversion Options used by the BDG to create the globalized XML templates from the source application:
- General. Convert to XML, without URL auto-completion. Use "LID" as the localization ID attribute.
- Custom Attributes. Four custom attributes were used: L_NAME, required, validator, and VIEWASTEXT.
- Always Close. Always close the following tags when converting to XML: APPLET, BODY, DIV, FIELDSET, FRAMESET, IFRAME, OBJECT, SCRIPT, SELECT, SPAN, TABLE, TD, TEXTAREA, TITLE, and TR.
- Lingo. Save localizable text into the lingo file, and strip text and attributes from the source file. The following attributes were used: accessKey, alt, L_NAME, src, title, and value.