Lessons Learned from Internationalizing a Global Resource II:
Experiences with the Second Generation
Gary Perlman
and
Debbie Hysell
OCLC Online Computer Library Center, Inc.
In a previous paper,
Hysell and Perlman (1999) reported on the lessons learned from
the internationalization and localization of the FirstSearch® 4.0 search system.
At the time that paper was presented, we were in the process of building
a completely new version of FirstSearch,
which required a new translation effort.
Our 1999 paper presented our plan for the next generation translation.
In this paper, we will report on the results of the translation.
The overall results are that many problems in the first translation were avoided
the second time around, but there were both positive and negative experiences,
reported here.
The translation process used for FirstSearch 5.0 is depicted in
Figure 1.
Figure 1: FirstSearch Translation Process
- Language files,
db.ini and en.ini
(used by different groups of developers),
contain over 3500 entities (FirstSearch variables)
that define the English terminology used in FirstSearch.
-
The English files are are merged,
and differences with previous versions are highlighted
to create a Microsoft® Word® file
(lang.doc).
The Word file contains a table with entity names,
English value, current non-English value, and notes.
-
The Word document is e-mailed to the translators
who translate new and changed entities,
and e-mail the translated document back to OCLC.
-
Language-specific files are generated for Spanish
(es.ini) and French (fr.ini).
-
These files are checked with automated scripts,
corrected if necessary, and integrated into FirstSearch.
|
The following are highlights of the lessons we learned translating
a new version of FirstSearch, referring back to the
lessons learned the first time.
The first translation,
for FirstSearch 4.0,
was made difficult because the system was never designed
with translation in mind.
It was a huge effort to find English
in HTML and C code and replace it with variables.
FirstSearch 5.0 was built on a new application,
SiteSearch 4.0,
written in Java,
so it was a complete rewrite of the old system,
with many new features.
- Standard Language Codes:
The
ISO 639 2-character language codes were used throughout the system
(i.e., en for English,
es for Español,
and fr for Français),
with the positive effect that many decisions were made for us,
particularly for file and other naming conventions.
When we needed a file translated into multiple languages,
we were able to predict that the same file name would be used,
but be located within a standard directory name.
This convention was used both within the help system,
and also when accessing news files on the OCLC site
(e.g.,
www.oclc.org/oclc/fs/en/fsnews.htm),
and it simplified negotiating the user's preferred language
with the browser.
- Structured Language Entities:
In our first translation effort, we had to separate the
English text from the HTML,
placing the English strings into entities.
The system had not been built with translation in mind.
The entities were not well-organized
because they were named as they were found,
without an overall plan.
In the new system,
the language-specific text was not only separated from the code
and HTML in the system, but it was also structured,
with the positive effect of regularizing and clarifying the translation.
The language configuration files
(i.e., en.ini, es.ini and fr.ini)
contained sections with related items:
- Standard parts of screens: titles, tips, status information
- Items related to particular functions
- Standard parts of user interactions: prompts, labels
- Items related to particular databases (e.g., field and index names, special values)
This structuring simplified the naming of entities because
each entity name included the section in which is was placed.
For example, the title of the advanced search page was
&Lang.pagetitle.advanced;
and the label for the Author index was
&Lang.index.author;.
The adjacency of related terms helped ensure they
were consistent in English and in their translations.
After a period of high contention for
the English language file
between groups of developers,
en.ini
was split into parts,
one for user interface entities
(en.ini)
and one for database entities
(db.ini).
This made it easier for groups working on different aspects
of the system to work independently.
Recently, a new set of entities for the administrative control
module was started as a separate file from the start.
These files are all merged for the translators.
Although FirstSearch 5.0 has considerably more
functionality than the previous system,
it has 25% fewer language entities than
the old system (3695 compared to 5082),
and proportionally fewer words
(notable because we pay translators by the word).
We attribute this in part to the reduction of
redundant information in the system.
- Help for Developers:
Despite a more structured environment and tools to support it,
programmers often failed to address the demands of international development
and inserted English text directly into HTML
or into Java code that generated HTML
(and this was true for both uni- and bi-lingual programmers).
Writing language-independent code is difficult,
especially when display-independent code was also needed
(all styles were in another set of configuration files).
Instead, we suggested that programmers develop
HTML with embedded English, and an expert would replace it with entities.
For Java code, we wrote perl scripts to detect English
that needed replacement, and we provided Java methods to look for
values that are submitted in the language currently in use
(e.g., a submit button to start a search would be
Search in English,
Buscar in Spanish,
or
Rechercher in French).
To help ensure valid English files,
and later valid Spanish and French files,
we wrote perl scripts to
check the files for every anticipated problem.
As un anticipated problems were discovered,
new checks were added to the the scripts.
- English as a Second Language:
Although our first experience with translation
was from English into other languages,
the new effort for FirstSearch 5.0 was to
translate into English first.
That is, we were able to review and change the terminology
used throughout the system, in part because of the
structured system we used for naming entities.
For example, we knew that there would be a search index
to look up articles by the title of the publishing periodical.
Its name is &Lang.index.source;, but the English version
could be magazine, journal, or periodical title;
Because everyone used the entity,
global changes were possible,
and marketing, documentation, and others were able to participate
in the design of the language.
One interesting result is that we now have
internal names for many parts of the system,
but these are not the names used outside the system
(e.g., history was renamed to previous searches).
- Previous Translator Experience:
The translators had already had experience with the previous system,
so despite changes in terminology, had already decided many of the
most appropriate translations.
- Translation Tools:
The conversion of the entity files into a multi-column Microsoft Word
file that highlighted changed and new entities
allowed the translators to work more efficiently
than in the old translation environment.
The translators were able to use the table to
do global search-and-replace operations
and, as a look-up device,
to ensure consistency in translations of help files and related publications.
The table has also allowed translators to continue work when the
development system is not available.
Review of the interface has also
been more efficient since the table is first reviewed and then a quick
pass is made over the screens themselves, whereas in the previous
version, reviewers were dependent on reviewing every screen and message
displayed by the system.
This improved translation process will make it
possible to add other language versions of the interface within 2-3
months in the future.
Additionally, the use of SGML/XML in creating help and documentation
content files has allowed us to reuse text and reduce costs and review
time.
Because the new version of FirstSearch is customized for user
experience level (basic, advanced, expert)
and databases used (over 80), the need for effective language
control and efficiency and economy in the translation process is far greater.
- Translator Knowledge of HTML:
Our translators were knowledgeable about HTML,
which was critical because some text to be translated contained
many entities
(including language entities, display format entities, and standard HTML).
The translators were able to move parts of HTML in ways that maintained the structure
while adapting the grammar to the target language.
Still, there were some problems in the translations, such as
slightly broken HTML and numerical entities translated
(e.g., &year; translated to &ano;).
We developed perl scripts to check the HTML after translation to
automatically detect as many problems as possible.
A document provided meta-data about 40 different section types
in the language files, indicating for variables in a section:
where they are used;
how many words/characters are expected (because of screen constraints);
whether to use mixed case, lower case, or a sentence;
whether HTML tags are allowed (e.g., tags in page titles
appeared uninterpreted in window titles).
These details helped translators avoid translations that
would not fit, look bad, etc.
- Help for Writers:
In addition to natural language entities,
we generated a special language that defined each entity
as its section and variable name
(e.g.,
&Lang.index.subject;,
which displayed as Suject in French,
would be defined as
index~subject
to indicate that the item was defined in the subject
variable in the index section).
This helped translators locate the source of an entity on their screens.
The feature was also used by help writers so that they could
include entities for text instead of the text
(e.g., include
&Lang.pagename.history;
instead of Previous Searches).
The section headings in help files were used as index keywords
for a
Latin-1-aware search of help document.
- Checking HTML Before Translation:
Our help files were translated from HTML files,
and we found that any HTML errors in the source were
maintained in the translation.
To avoid that problem in the future,
we wrote perl scripts to detect HTML problems
before help files are translated.
These scripts were also useful for locating problems in the translated files.
- Dynamic Updates:
The language files are all loaded into a server
that is shared by multiple user sessions;
language switching is simply a change of a single pointer.
One advantage of storing all the language entities
in configuration files is that they can be modified easily
without any need for compilation, just a reload of the
changed file.
This feature was used extensively by our
marketing department while developing the terminology
to be used in the English version.
They would edit en.ini
and click on a re-load button
to see their changes in a working environment.
We have not yet automated the translation process in
Figure 1
to make this available to the translators;
instead they may need to wait at least a few minutes
and usually hours to have someone at OCLC reload their changes.
While most of our experiences were positive,
especially compared to the first experience,
there are still
areas for future improvement:
- Complete the implementation of the database system
that will track the versions of help/documentation content files
and coordinate translation variants at the paragraph level.
This will improve the efficiency and consistency of the translations,
both for online material and for print media.
- Complete usability testing and user feedback surveys.
- Provide a where-used capability for the entity strings,
to supplement the where-defined capability now available.
- Complete the translation of the search engine so that
all operators and index labels are transparently translated.
- Complete implementation of the SGML-to-HTML conversion process
to reduce tagging errors introduced by translators.
- Continue to develop safeguards/discipline to ensure that developers have
defined all interface text as entities.