World Views

Eliminating the Digital Divide in Java

Scott McNealy recently wrote about the community effort needed to eliminate the “digital divide” and will give a related presentation at JavaOne.

Software globalization of course is one of the critical pieces in this effort. A language barrier is a pretty effective divider. If software isn’t capable of rendering text and accepting input in the user’s language(s), it’s not very useful. If the user can’t understand the user interface because it’s in a foreign language, her use of the software will be limited as well. And if the software doesn’t fit into the cultural, legal, and business environment of its intended users, it may not matter how cheap it is.

So I’d like to survey where we stand with Java internationalization and localization, how we enable the community to contribute, and what the remaining issues are.

One acronym you’ll see several times is “SPI,” Service Provider Interface – by this we mean public interfaces in the Java platform APIs that let third-party developers extend the functionality of the Java runtime through new classes (and some identifying information) that are installed into the runtime’s extension directory. SPI’s are one way to enable the community to provide support for additional languages.

Unicode

The foundation of Java globalization is the Unicode character set – the Java platform now supports Unicode 4.0. Unicode has always endeavored to include all languages used on planet Earth, but a number of writing systems still have not been encoded. Encoding a writing system requires detailed knowledge about its use in real life as well as about how it would be processed in software, so it’s often difficult work. Supporting the Script Encoding Initiative may be the best way to help.

Character Encoding Conversion

While the Java runtime uses Unicode (more precisely, UTF-16) internally, much of the world’s data is stored in other character encodings. The JRE already supports a long list of character encodings, but the Java platform also provides an SPI that lets developers add any other encoding that may be needed.

Locale and Currency Identification

The Locale class currently is based on the ISO standard 639-1 for languages and 3166 for countries. ISO 639-1 covers about 200 of the most important languages, but estimates for the total number of languages on planet Earth range in the thousands (many of them near extinction). There’s still plenty of work to do to support just the ISO 639-1 languages, but using the three-character language codes of ISO 639-2 or an extensible standard such as RFC 3066 (or its successor) may eventually be necessary to enable even broader coverage. For countries and currencies, the situation is simpler: The JRE knows all of them.

Date and Time Handling

The Calendar class was intended to enable support for all calendars used in the world, but it turned out that its design was hard to understand, difficult to subclass correctly, and not extensible enough for complete coverage (the Balinese calendars, for example, just don’t fit into the mold). Support for the world’s calendars has therefore been slow in coming: The JRE provides only the Gregorian, Thai Buddhist, and – starting with JRE 6 – Japanese calendars. A complete solution will likely require a new API/SPI combination. In the meantime, the ICU4J library provides a separate, somewhat incompatible Calendar class with several additional calendar implementations.

For time zones, the situation is simpler: The JRE knows all of them. However, there is a little problem with keeping the information up to date: Politicians in some countries like to tinker with the daylight savings rules, often with little advance notice, and so the time zone rules in the JRE don’t always match reality. More frequent updates of the JRE time zone data may be one solution; productizing the tool that we use to update the data may be another.

Names of Languages, Countries, Time Zones, and Currencies

The JRE has traditionally provided complete sets of these names and symbols for about 10 languages, and smaller sets for another 30 or so languages. A new SPI in Java SE 6 enables third parties to provide more. As I mentioned in my blog about this SPI, there’s the idea of creating an extension that uses the SPI to support all locales that the Unicode Consortium’s Common Locale Data Repository provides but the JRE doesn’t. This means that community members who want to extend the set of supported locales may have two ways to contribute: Directly by implementing an extension using the SPI, or indirectly by contributing data to the CLDR.

Text Processing

The text processing functionality in the java.text package supports about 100 locales (more in JRE 6). The discussion of the SPI and CLDR in the previous section applies to this functionality as well.

Text Input

Text input in the Java platform typically relies on the host operating system. For cases where the host OS’s facilities are insufficient, the input method engine SPI can be used to implement both full-fledged input methods and simple keyboard remappers. The JRE itself comes with input methods for Thai and Devanagari, Naoto’s article provides a few more, and several third-party input methods exist.

Text Rendering and Editing

The JRE currently supports 11 of the Unicode standard’s 60 writing systems. Extending this set is where the really hard problems are. Text rendering and editing are complex processes that don’t lend themselves to the creation of SPI’s. They also require extensive testing, which is hard to automate. Testing has been the main bottleneck so far. Some of the currently “supported” writing systems don’t receive any systematic testing. Several additional writing systems (including the ones supported by our Indic input method) are implemented, but we don’t document them as supported because they’re complex and the risk of failures is too high. Community help therefore would be most useful in the area of testing.

User Interface Localization

The JRE’s user interface is currently provided in ten languages, the JDK tools in three. To enable others to provide additional localizations, we’re currently evaluating whether we can document the main user interface resource bundles and how to create and install additional ones. This wouldn’t be the same as an SPI – resource bundles are easier to create, but there would be no guarantee of compatibility from release to release as there is for an SPI.

Documentation and Community Interaction

Sun currently provides documentation about the Java platform primarily in English, with significant amounts translated into Japanese, small amounts into Chinese and Korean, and nothing in other languages. Input from the community is pretty much only accepted if it comes in English. It’s been recognized that these are serious problems, and some efforts are underway to provide more documentation particularly in Chinese. However, much more is needed to engage developers worldwide and forge a global community without language barriers.

2005-06-27 (Montag) – Comments [0]

Locale Sensitive Services SPI

OK, the name of this new feature in Java SE 6 has become more complicated than we’d like it to be. What this “Locale Sensitive Services SPI” really means is that third parties can now provide implementations of most locale sensitive classes in the java.text and java.util packages for locales that aren’t yet supported by the Java runtime. “SPI” stands for “service provider interface,” a pattern that some developers are already familiar with from developing character converters, input methods, or other extensions. In this pattern, the Java platform provides interfaces or abstract classes that a developer can implement or subclass. The implementations or subclasses are then packaged with some descriptive information and installed into the Java runtime’s extension directory, and become available to applications through factory methods of the Java platform.

The locale-specific implementations that can be provided through the locale sensitive services SPI are:

Each of the classes mentioned above now has a corresponding abstract provider class in the java.util.spi or java.text.spi package. Concrete provider classes advertise the locales they support through their getAvailableLocales methods; these locales get added to the lists returned by the getAvailableLocales methods of the corresponding API classes. If an application requests an object for a locale that the JRE doesn’t support, a provider that supports the locale is asked to provide the necessary object.

An alternative to using an SPI might have been to simply document the format of the resource bundles used in implementing the java.text and java.util packages. This however would have required freezing the current format of the resource bundles, and we weren’t quite ready for that (in fact, we did redesign it for JDK 6). It would also have prevented the implementation of objects such as collators and break iterators that may require locale-specific code or data that doesn’t fit into the existing resource bundle format. Using an SPI gives developers complete freedom in the design of their implementations.

We also considered importing the complete data of the Unicode Consortium’s Common Locale Data Repository instead of providing an SPI. The CLDR covers about twice as many locales as Sun’s current JRE. However, importing the complete CLDR data would have added substantially to the JRE download size while still not helping users of locales that are not (yet) supported in the CLDR. Instead, we only used CLDR for a few additional locales. We think however that it would be worthwhile creating an extension that uses the SPI to support all locales that the CLDR provides but the JRE doesn’t. If you’re interested in helping with such a project, please let us know.

2005-06-27 (Montag) – Comments [2]

Farewell to the “2”

It’s official: The next version of the Java platform won’t have a “2” in its name anymore. The successor to Java™ 2 Platform, Standard Edition 5.0 (J2SE 5.0) is going to be Java™ Platform, Standard Edition 6 (Java SE 6). Similarly, new versions of J2EE and J2ME will be Java EE and Java ME. The name change doesn’t affect currently shipping products, so it’ll be a long transition. But reducing the number of numbers in the product name is clearly a change for the better.

2005-06-27 (Montag) – Comments [0]

More Control over ResourceBundle

Did you ever wish you had a little more control over the ResourceBundle class? Say, have it instantiate bundles from XML files or from data in a database, rather than just class and properties files? Or, to the contrary, have it look only for properties files, because you never use class-based resource bundles? Or have it reload a bundle with a little fix without having to restart your web application, which otherwise is on the way towards reaching 99.999% uptime? Well, you just got that little more control: The ResourceBundle.Control class in JDK 6.

The central method in ResourceBundle has always been getBundle. This method looks for resource bundles of predefined types in predefined places using a predefined search strategy, loaded them in predefined ways, and cached them in a largely unspecified way.

The idea of the Control class is to expose every major step of the bundle loading process as a separate method that can be overridden and customized. The Control class itself implements the methods so that using it directly results in the same behavior as in previous releases. But applications can subclass it, override as many methods as necessary to implement the behavior they need, and pass an instance of this subclass to new getBundle methods that accept Control objects.

For some of the most common cases, you don’t actually need to write your own subclass: If you just want to use only class-based or only properties-based resource bundles, or if you want to avoid the fallback to the default locale, the getControl and getNoFallbackControl methods provide you with ready-made instances.

Here are a few examples for how you can go further (the class description has more):

  • To search for bundles according to the list of languages that your web application received in an Accept-Language HTTP header, override the getFallbackLocale method to successively return the locales of the language list.
  • To load resource bundles from locale-specific directories rather than using locale-specific suffixes, override the toBundleName method to insert the locale ID components into the appropriate places of the bundle name.
  • To use bundles for Chinese/Taiwan as the parent bundles of Chinese/Hong Kong bundles in order to share traditional Chinese strings, override getCandidateLocales to insert the Chinese/Taiwan locale in the right place.
  • To ensure that cached bundles are checked against their source files on disk at least every 6 hours, override getTimeToLive to return 21,600,000. (The specification currently says that that’s the default behavior. Unfortunately, this uncovered a bug in a major third-party application that we test with, so in order to avoid incompatibilities, the default behavior will revert to TTL_NO_EXPIRATION_CONTROL, which reflects behavior in previous releases.)

Some methods need to be overridden together. For example, if you override getFormats to return formats other than "java.class" and "java.properties", you also need to override newBundle to load bundles of these formats. If, in addition, you override getTimeToLive to enable checking whether a bundle in the cache is still up to date, you may also need to override needsReload to implement the checking.

If, instead of periodically checking whether resource bundles in the cache are still up to date, you’d rather remove all your application’s bundles from the cache when installing new bundles, you can use the new ResourceBundle.clearCache methods.

As always, we’d love to hear from you whether these API additions meet your needs, or what’s wrong with them. Please try them, file bugs, or send us your feedback.

2005-06-24 (Freitag) – Comments [0]

New Supported Locales

If you pay close attention to the return values of the getAvailableLocales methods in the java.text and java.util packages, you may have noticed that the lists got a little longer in recent builds of JDK 6. The additions to the already long list of supported locales in JRE 5.0 are:

Language Country Locale ID
Chinese (Simplified) Singapore zh_SG
English Malta en_MT
English Philippines en_PH
English Singapore en_SG
Greek Cyprus el_CY
Indonesian Indonesia in_ID
Japanese (Japanese calendar) Japan ja_JP_JP
Malay Malaysia ms_MY
Maltese Malta mt_MT
Spanish US es_US

You may wonder how we picked these locales. OK:

  • One group is based on IT market sizes in the IDC Worldwide Black Book, from which we selected all countries above a certain market size. We also looked at the languages spoken within these countries, and tried to guesstimate IT market size for those. For example, 11% of the US population now speak Spanish at home, and at least in the consumer space companies can no longer afford to ignore this.
  • Then we added locales to cover all countries in the European Union with their main languages. Malta is a small country, but it’s a member of the EU, Maltese is an official EU language, and some business in the EU simply requires support for all EU languages. As it turns out, the EU added Gaelic to its official languages just as we finished the addition of these locales, so we’re already behind again.
  • The new Japanese locale finally is the way applications can access the new Japanese calendar, which is not exposed as a separate class in the API. To obtain an instance of the Japanese calendar, you’d use Calendar.getInstance(new Locale("ja", "JP", "JP")).

The additions mark the first time that we use data from the Common Locale Data Repository for the JRE. The CLDR is maintained by the Unicode Consortium with contributions from IBM, Apple, Sun, and many others. We intend to continue to use CLDR for new locale data in the JRE. It’s an open question whether we should also update existing data from this source – it might fix some problems in our data and improve the alignment between the JRE and other software products, but it may also be seen as an unnecessary incompatibility. It’s pretty common for locale data to have two or more different versions that are all acceptable to users, while random changes between these versions are not acceptable. If you have an opinion on this issue, please let us know.

2005-06-24 (Freitag) – Comments [0]
  © World Views. All rights reserved.