Lessons Learned from Internationalizing a Global Resource II

Lessons Learned from Internationalizing a Global Resource II:
Experiences with the Second Generation

Gary Perlman and Debbie Hysell
OCLC Online Computer Library Center, Inc.

In a previous paper, Hysell and Perlman (1999) reported on the lessons learned from the internationalization and localization of the FirstSearch® 4.0 search system. At the time that paper was presented, we were in the process of building a completely new version of FirstSearch, which required a new translation effort. Our 1999 paper presented our plan for the next generation translation. In this paper, we will report on the results of the translation. The overall results are that many problems in the first translation were avoided the second time around, but there were both positive and negative experiences, reported here.

The translation process used for FirstSearch 5.0 is depicted in Figure 1.

Figure 1: FirstSearch Translation Process

Language files, db.ini and en.ini (used by different groups of developers), contain over 3500 entities (FirstSearch variables) that define the English terminology used in FirstSearch.
The English files are are merged, and differences with previous versions are highlighted to create a Microsoft® Word® file (lang.doc). The Word file contains a table with entity names, English value, current non-English value, and notes.
The Word document is e-mailed to the translators who translate new and changed entities, and e-mail the translated document back to OCLC.
Language-specific files are generated for Spanish (es.ini) and French (fr.ini).
These files are checked with automated scripts, corrected if necessary, and integrated into FirstSearch.

The following are highlights of the lessons we learned translating a new version of FirstSearch, referring back to the lessons learned the first time. The first translation, for FirstSearch 4.0, was made difficult because the system was never designed with translation in mind. It was a huge effort to find English in HTML and C code and replace it with variables. FirstSearch 5.0 was built on a new application, SiteSearch 4.0, written in Java, so it was a complete rewrite of the old system, with many new features.

Standard Language Codes: The ISO 639 2-character language codes were used throughout the system (i.e., en for English, es for Español, and fr for Français), with the positive effect that many decisions were made for us, particularly for file and other naming conventions. When we needed a file translated into multiple languages, we were able to predict that the same file name would be used, but be located within a standard directory name. This convention was used both within the help system, and also when accessing news files on the OCLC site (e.g., www.oclc.org/oclc/fs/en/fsnews.htm), and it simplified negotiating the user's preferred language with the browser.
Structured Language Entities: In our first translation effort, we had to separate the English text from the HTML, placing the English strings into entities. The system had not been built with translation in mind. The entities were not well-organized because they were named as they were found, without an overall plan. In the new system, the language-specific text was not only separated from the code and HTML in the system, but it was also structured, with the positive effect of regularizing and clarifying the translation. The language configuration files (i.e., en.ini, es.ini and fr.ini) contained sections with related items:
- Standard parts of screens: titles, tips, status information
- Items related to particular functions
- Standard parts of user interactions: prompts, labels
- Items related to particular databases (e.g., field and index names, special values)
This structuring simplified the naming of entities because each entity name included the section in which is was placed. For example, the title of the advanced search page was &Lang.pagetitle.advanced; and the label for the Author index was &Lang.index.author;. The adjacency of related terms helped ensure they were consistent in English and in their translations.
After a period of high contention for the English language file between groups of developers, en.ini was split into parts, one for user interface entities (en.ini) and one for database entities (db.ini). This made it easier for groups working on different aspects of the system to work independently. Recently, a new set of entities for the administrative control module was started as a separate file from the start. These files are all merged for the translators.
Although FirstSearch 5.0 has considerably more functionality than the previous system, it has 25% fewer language entities than the old system (3695 compared to 5082), and proportionally fewer words (notable because we pay translators by the word). We attribute this in part to the reduction of redundant information in the system.
Help for Developers: Despite a more structured environment and tools to support it, programmers often failed to address the demands of international development and inserted English text directly into HTML or into Java code that generated HTML (and this was true for both uni- and bi-lingual programmers). Writing language-independent code is difficult, especially when display-independent code was also needed (all styles were in another set of configuration files). Instead, we suggested that programmers develop HTML with embedded English, and an expert would replace it with entities. For Java code, we wrote perl scripts to detect English that needed replacement, and we provided Java methods to look for values that are submitted in the language currently in use (e.g., a submit button to start a search would be Search in English, Buscar in Spanish, or Rechercher in French).
To help ensure valid English files, and later valid Spanish and French files, we wrote perl scripts to check the files for every anticipated problem. As un anticipated problems were discovered, new checks were added to the the scripts.
English as a Second Language: Although our first experience with translation was from English into other languages, the new effort for FirstSearch 5.0 was to translate into English first. That is, we were able to review and change the terminology used throughout the system, in part because of the structured system we used for naming entities. For example, we knew that there would be a search index to look up articles by the title of the publishing periodical. Its name is &Lang.index.source;, but the English version could be magazine, journal, or periodical title; Because everyone used the entity, global changes were possible, and marketing, documentation, and others were able to participate in the design of the language. One interesting result is that we now have internal names for many parts of the system, but these are not the names used outside the system (e.g., history was renamed to previous searches).
Previous Translator Experience: The translators had already had experience with the previous system, so despite changes in terminology, had already decided many of the most appropriate translations.
Translation Tools: The conversion of the entity files into a multi-column Microsoft Word file that highlighted changed and new entities allowed the translators to work more efficiently than in the old translation environment. The translators were able to use the table to do global search-and-replace operations and, as a look-up device, to ensure consistency in translations of help files and related publications. The table has also allowed translators to continue work when the development system is not available. Review of the interface has also been more efficient since the table is first reviewed and then a quick pass is made over the screens themselves, whereas in the previous version, reviewers were dependent on reviewing every screen and message displayed by the system. This improved translation process will make it possible to add other language versions of the interface within 2-3 months in the future.
Additionally, the use of SGML/XML in creating help and documentation content files has allowed us to reuse text and reduce costs and review time. Because the new version of FirstSearch is customized for user experience level (basic, advanced, expert) and databases used (over 80), the need for effective language control and efficiency and economy in the translation process is far greater.
Translator Knowledge of HTML: Our translators were knowledgeable about HTML, which was critical because some text to be translated contained many entities (including language entities, display format entities, and standard HTML). The translators were able to move parts of HTML in ways that maintained the structure while adapting the grammar to the target language. Still, there were some problems in the translations, such as slightly broken HTML and numerical entities translated (e.g., &year; translated to &ano;). We developed perl scripts to check the HTML after translation to automatically detect as many problems as possible.
A document provided meta-data about 40 different section types in the language files, indicating for variables in a section: where they are used; how many words/characters are expected (because of screen constraints); whether to use mixed case, lower case, or a sentence; whether HTML tags are allowed (e.g., tags in page titles appeared uninterpreted in window titles). These details helped translators avoid translations that would not fit, look bad, etc.
Help for Writers: In addition to natural language entities, we generated a special language that defined each entity as its section and variable name (e.g., &Lang.index.subject;, which displayed as Suject in French, would be defined as index~subject to indicate that the item was defined in the subject variable in the index section). This helped translators locate the source of an entity on their screens. The feature was also used by help writers so that they could include entities for text instead of the text (e.g., include &Lang.pagename.history; instead of Previous Searches). The section headings in help files were used as index keywords for a Latin-1-aware search of help document.
Checking HTML Before Translation: Our help files were translated from HTML files, and we found that any HTML errors in the source were maintained in the translation. To avoid that problem in the future, we wrote perl scripts to detect HTML problems before help files are translated. These scripts were also useful for locating problems in the translated files.
Dynamic Updates: The language files are all loaded into a server that is shared by multiple user sessions; language switching is simply a change of a single pointer. One advantage of storing all the language entities in configuration files is that they can be modified easily without any need for compilation, just a reload of the changed file. This feature was used extensively by our marketing department while developing the terminology to be used in the English version. They would edit en.ini and click on a re-load button to see their changes in a working environment. We have not yet automated the translation process in Figure 1 to make this available to the translators; instead they may need to wait at least a few minutes and usually hours to have someone at OCLC reload their changes.

While most of our experiences were positive, especially compared to the first experience, there are still areas for future improvement:

Complete the implementation of the database system that will track the versions of help/documentation content files and coordinate translation variants at the paragraph level. This will improve the efficiency and consistency of the translations, both for online material and for print media.
Complete usability testing and user feedback surveys.
Provide a where-used capability for the entity strings, to supplement the where-defined capability now available.
Complete the translation of the search engine so that all operators and index labels are transparently translated.
Complete implementation of the SGML-to-HTML conversion process to reduce tagging errors introduced by translators.
Continue to develop safeguards/discipline to ensure that developers have defined all interface text as entities.

Lessons Learned from Internationalizing a Global Resource II: Experiences with the Second Generation

Lessons Learned from Internationalizing a Global Resource II:
Experiences with the Second Generation