World Views

Open Communication Requires Unicode

At this week’s Sun Engineering Conference, my contribution was a call to use Unicode everywhere.

What’s new about that? Hasn’t Sun been using Unicode for years? Yes, of course we have, because Sun requires all revenue products to be internationalized, and using Unicode is usually the first step in internationalizing modern software. Java has been based on Unicode since version 1.0, Solaris offers a wide range of UTF-8 locales (UTF-8 is a Unicode character encoding), StarOffice uses Unicode for text processing, and so on.

The problem is in the systems that we use to communicate with customers and other partners but that aren’t considered products. These software tools and web applications often don’t use Unicode and so impose random restrictions on the languages that can be used. And that’s bad because Sun has partners worldwide and needs to speak their languages and listen in their languages.

One negative example is our bug tracking system. This is not a creaky relic from the last millennium, but a brand-new system, developed within Sun, deployed in summer 2004, and using the whole range of modern software technologies. The main front end for Sun-internal use is a Java Web Start-deployed Swing application, which of course lets you input characters in any of the 14 writing systems supported by the Java platform. But if you try to save a bug report that contains, say, Chinese characters, you get this lovely alert:

Alert: One or more characters in the field Note: Description are not in the extended-ASCII character range. Please remove those characters.

The reason is that the back end system has been configured to use the ISO 8859-1 character encoding, which restricts all text to English and a few other western European languages. Text in any other language cannot be stored in any text field. The only workaround is to store it as a binary attachment, which makes it inaccessible to search and difficult to access in general.

As part of opening up its development processes, Sun also offers a front end for public use, the bugs.sun.com web site, which allows anybody to submit and track bug reports against a number of Sun products. Following the lead of the bug database, this web site also uses ISO 8859-1 for all text. In this case, if a user includes non-Latin text in a bug report, she usually won’t even get an alert – the browser will just silently convert the text to question marks or, if we’re lucky, to numeric character references, making it rather difficult for engineers to understand what the bug is about. The web site does not accept or display attachments, so the workaround for the internal front end is incompatible with the public front end.

The reason given for the restriction to ISO 8859-1 is that all Sun employees speak English, and therefore internationalization isn’t necessary. This obviously ignores that a bug tracking system isn’t just about Sun employees communicating with each other; it’s about customers and developers communicating with Sun employees about problems that customers have when processing their data. The data can be in any language that customers use, and so the bug tracking system needs to be able to represent text in any language. Removing “those characters” may make it impossible to investigate and fix a bug.

The language in which customers, developers and Sun employees communicate about bugs is a separate issue from the data. Sun obviously prefers such communication to be in English, because it makes it easier for us to pass the information around within Sun. However, if a customer speaks only Thai and runs into a problem in processing Thai text using Sun products, wouldn’t we prefer that she submit a bug report in Thai rather than immediately switching to a competitor’s product? Most Sun engineers don’t speak Thai, but some do, so we can get help if necessary – if the text survives in the bug tracking system.

The bug tracking system isn’t the only tool that obstructs communication with unnecessary technical restrictions. Other examples include the software behind the Java developer forums, which corrupts non-ASCII text and so makes it difficult for developers to discuss internationalization problems, and the developer feedback forms, which use ISO 8859-1 and so block feedback in non-Latin languages.

Sun has realized that open communication with customers, developers, and other partners is critical to its future success. Limiting the communication to English or a small set of other western European languages means limiting our success worldwide. That’s why using Unicode everywhere is an important first step. The site on which this blog appears, blogs.sun.com, shows the way: It enables daily communication between Sun and the world in Chinese, English, and Japanese; and occasional blogs in French, Hungarian, German, and Korean show that more is possible. All components of blogs.sun.com, of course, use Unicode.

2005-02-27 (Sonntag) – Comments [6]
Kommentare:

[Trackback] Java Forum や、 Bug Database が英語でしか表記できないのに不満をもっている日本人開発者は少なくないと思う。 それに関連したことを、Java 国際化チームの中心メンバーである Norbert が、Blog で書いている 。 - ディスカッションの情報自体は、世界共通で共有されるべき、そのためには英語でコミュニケーションせざるをない という事情が、現状、基本的に英語でしか入力できない理由なのだが、日本語固有の問題、例えば特定の漢字の表示がおかしいだ...

Gesendet von Naoki Ishihara's Blog : Crying over spilt java milk (chats) am Februar 28, 2005 at 10:24 AM PST #

I notice that the date format at the bottom of your postings is more international than I see sometimes. I.e. you use yyyy/mm/dd whereas if you used mm/dd/yyyy then that would be ambiguous for those of us in the UK who use dd/mm/yyyy. Is this a deliberate decision by you, or a happy byproduct of your blogging software? I also notice the date format on the comment uses the month name, which is less confusing, but obviously English-centric. PS If you're interested in Internationalisation you might be interested in http://scripts.sil.org which shows work done by our sister organisation. "Our" is Wycliffe Bible Translators - http://www.wycliffe.net

Gesendet von Paul Morriss am März 02, 2005 at 07:31 AM PST #

You mention the Java development forums. The internationalization problems are only one of many problems introduced by the recent so-called "improvements". If anybody at Sun is interested in repairing these bugs (which I very much doubt) they may contact me at the e-mail address on this comment.

Gesendet von Paul Clapham am März 02, 2005 at 08:10 AM PST #

Just to add another observation: Even someone who does speak and write in English might have non-Western characters in her name or her company's name. Or a bug report written in English might need to include an error message or filename in the user's local script. Even "legitimate" English messages must be able to contain international characters.

Gesendet von M. Brubeck am März 02, 2005 at 08:25 AM PST #

Paul M: Where possible, I use the ISO 8601 / RFC 3339 date format. It’s designed to be unambiguous (although possibly unfamiliar) to users worldwide. I have no control over the date format used for comments. I'm familiar with SIL, especially their Ethnologue database.

Paul C: I’ve forwarded your comments in the right direction.

Matt: Error messages and filenames are two good examples for what I mean by data that users need to process. Names are another good reason for using Unicode: I’ve heard complaints from Japanese coworkers that the information in the database doesn’t let them correctly address Japanese customers because it’s not always possible to guess the correct kanji from the romanized form of a name.

Gesendet von Norbert am März 02, 2005 at 02:07 PM PST #

An additional problem that I've encountered is when you move beyond allowing people to post on the site to allowing them to upload structured files (csv, tab-delimited, etc.) to a website for text processing and display. OSes and particularly individual software apps seem to use wildly inconsistent ... meaning that when you try to get the text from the file, you get inconsistent results, particularly when the text has special characters such as é, ™, ®, etc. etc. Have you found any solutions to issues like this? Please let me know!

Gesendet von John Koetsier am März 03, 2005 at 08:41 AM PST #

Senden Sie einen Kommentar:

Kommentare sind ausgeschaltet.
  © World Views. All rights reserved.