Ienup's Weblog
Leaders wanted at OpenSolaris I18N & L10N community!
It's like yesterday the OpenSolaris has launched and yet within a couple of days, we have already formed 15 subgroups under the OpenSolaris Internationalization and Localization community. And they are:
- French
- German
- Hungarian
- Indic
- Internationalization (I18N)
- Italian
- Japanese
- Korean
- Portuguese Brazil
- Simplified Chinese
- Spanish
- Tamil
- Traditional Chinese
- Traditional Chinese Hong Kong
- Vietnamese
Many of them have already subgroup leader(s) but some of them are not and so we are looking for the subgroup leaders! Also, if your language isn't there and would like to see your language subgroup at the Internationalization and Localization community so that the OpenSolaris will speak your language (even better in case you already have the language support in Solaris), please join us and take part in the party! We are also looking for global coordinators who will oversee and coordinate projects in similiar or common nature over multiple subgroups!
Thanks!
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Posted at 05:58PM Jun 17, 2005 by is in Solaris |
Secure UTF-8
OpenSolaris launched today and since people can access OpenSolaris code from now on, I thought it might be an interesting thing to say a few things about possible illegal byte sequences from UTF-8 characters and show how to avoid them in your program with a couple of links to a real code example.
As you know, UTF-8 (Unicode/UCS Transformation Format 8) is the file code representation used in all Unix/Linux operating systems especially for their Unicode/UTF-8 locales. It is also widely accepted and endorsed as a standard encoding form by numerous number of modern and major standards from various standardization organizations such as IETF, W3C, Unicode, Inc., ISO/IEC, and so on.
Until rather recently (say, about a few years), we used to have a quite relaxed binary mappings between the Unicode scalar values and UTF-8 code point values like the following table:
| Unicode
Scalar Values in Binary |
Hex
Min |
Hex
Max |
1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
|---|---|---|---|---|---|---|
| 00000000 00000000 0xxxxxxx | U+0000 | U+007F | 0xxxxxxx | |||
| 00000000 00000yyy yyxxxxxx | U+0080 | U+07FF | 110yyyyy | 10xxxxxx | ||
| 00000000 zzzzyyyy yyxxxxxx | U+0800 | U+FFFF | 1110zzzz | 10yyyyyy | 10xxxxxx | |
| 000uuuuu zzzzyyyy yyxxxxxx | U+10000 | U+10FFFF | 11110uuu | 10uuzzzz | 10yyyyyy | 10xxxxxx |
Everyone was using the above mapping scheme happily and without any major problems for many years but then some people came up with ideas that they can exploit the above loose mappings and introduce illegal sequences that may contain machine code and with some existing security holes such as buffer overflow, they could do malicious things to your systems. And so Unicode consortium and other folks updated the above mappings and finally introduced so-called "UTF-8 Corrigendum" at the Unicode Standard Version 3.1 which has been revised one more time at the Unicode 3.2. The following table shows the final mappings that I took from the Unicode 3.2 for your convenience:
| Unicode
Scalar Values in Binary |
Hex
Min |
Hex
Max |
1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
|---|---|---|---|---|---|---|
| 00000000 00000000 0xxxxxxx | U+0000 | U+007F | 00..7F | |||
| 00000000 00000yyy yyxxxxxx | U+0080 | U+07FF | C2..DF | 80..BF |
||
| 00000000 zzzzyyyy yyxxxxxx | U+0800 | U+0FFF | E0 | A0..BF | 80..BF | |
| U+1000 |
U+CFFF |
E1..EC |
80..BF | 80..BF | ||
| U+D000 |
U+D7FF |
ED |
80..9F | 80..BF | ||
| U+D800 |
U+DFFF |
ill-formed |
||||
| U+E000 |
U+FFFF |
EE..EF |
80..BF | 80..BF | ||
| 000uuuuu zzzzyyyy yyxxxxxx | U+10000 | U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
| U+40000 |
U+FFFFF |
F1..F3 |
80..BF | 80..BF | 80..BF | |
| U+100000 |
U+10FFFF |
F4 |
80..8F | 80..BF | 80..BF | |
Since we support UTF-8 as the file code of the Unicode/UTF-8 locales in Solaris, naturally, we are keen to such revisions and so we have incorporated the necessary changes into various places in Solaris where the above mappings are used so that any illegal byte sequences entered to system will be properly screened out. One of the places we have such screening is in ldterm kernel module where terminal line editing is handled if your shell or application at your terminal isn't doing the line editing by itself.
We also do this screening as efficiently as possible in my opinion by reducing the number of comparisons. For more details on the screening, please see the UTF-8 conversion done in the function shown at here. We do this conversion to figure out the number of columns for a Unicode character that, as an example, we need to "rub out" from the terminal screen when you hit Back Space or Delete key.
Please be noted that the difference between the "UTF-8 Corrigendum" at the Unicode 3.1 and the latest legal UTF-8 byte sequences shown at the Unicode 3.2 is the removal of Surrogate Pair code values from the UTF-8 byte sequences. In the code example from the ldterm, the screenings are done in two places: (1) within the for() loop during the conversion and (2) at the Unicode width table for the BMP (Basic Multilingual Plane).
Technorati Tag: OpenSolaris
Technorati Tag: Solaris
Posted at 08:38AM Jun 14, 2005 by is in Solaris |
Friday Jun 17, 2005