I attended the SF PHP Meetup last night where Andrei Zmievski (PHP 6 release manager and PHP core team member) gave a talk on PHP 6 and internationalization (i18n).  It was good to hear that while PHP 6 has been in development for the past 2 years, it's very likely that we'll be seeing a release in early 2009, and definitely ahead of Perl 6, as Andrei joked.

The main feature of PHP 6 will be that it will be entirely Unicode supported.  Or as one of his slides so aptly stated:

PHP 6 = PHP 5 + Unicode

It was evident that Andrei and team have given quite a bit of thought into what i18n means for the PHP world, and as a result, PHP developers everywhere will soon be enjoying a new set of tools to enable faster development of multi-lingual sites.  My favorite example was a class that had the method names all defined using different languages, including an example in Hebrew (written right to left)!  From a practical standpoint, many of the features are intelligent enough to be able to handle common cultural issues such as proper sorting and date/number formatting.

The even better news is that most of these features will also be available for the upcoming PHP 5.3 release via pecl.  The intl module will be "backwards" compatible with PHP 5.3 since the classes expect UTF-8 encodings.  How you provide those strings is up to the you.

One concern about PHP 6 is that since it will be entirely Unicode, strings will automatically double in size, meaning there will certainly be a performance hit.  So for now, I look forward to i18n with PHP 5.3 as well as the much needed namespaces.

Andrei's presentation is available on his site here.

Comments:

Whether strings will grow with Unicode support really depends on the implementation.

Uf PHP6 will use UTF-8, then no, all strings will have exactly the same length as they have now. UTF-8 is directly backwards-compatible with US-ASCII, so is generally preferred as the industry standard encoding for Unicode.

UTF-8 characters can consist of any number of bytes (from 1 to 5, I believe is the current maximum supported by major implementations), so unless you type lots of Hebrew or Japanese Kana or Ancient Phoenician, your strings will remain the same size as always. :)

- Simon

Posted by Simon on July 13, 2008 at 07:19 AM PDT #

But will it support astral characters?

Unicode has 17 "planes", each with 65,536 characters. Most software that claims to support Unicode, even using terms like "full Unicode support", only supports characters from plane 0, the Basic Multilingual Plane. The "astral" characters are those from planes 1 through 16.

And how well is PHP going to hide its implementation details? Will trying to get the length of a string tell you how many characters are in the string, or will it do like most languages and instead tell you how many 8-bit (UTF-8) or 16-bit (UTF-16) units are used to encode those characters?

It's great that PHP is increasing it support for Unicode, but most likely there will still be a very very long way to go after PHP 6.

Posted by James Justin Harrell on July 13, 2008 at 09:43 AM PDT #

Good performance, bytes, codepoints, graphemes and characters ... IMHO Parrot and Perl 6 are designed with real Unicode support.

Posted by mj on July 13, 2008 at 10:29 AM PDT #

James,

PHP 6 supports all the planes, fully and transparently. strlen() and all similar functions operate on codepoints (not codeunits or bytes), so you will get the correct result back.

Simon,

PHP 6 uses UTF-16 internally.

Posted by Andrei Z on July 17, 2008 at 09:48 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2008 by Wen Huang