Ienup's Weblog / 성인업의 웹로그

All | Audio Video Music Movies Books | General | Outdoors | Solaris
20050617 Friday June 17, 2005

Leaders wanted at OpenSolaris I18N & L10N community!

It's like yesterday the OpenSolaris has launched and yet within a couple of days, we have already formed 15 subgroups under the OpenSolaris Internationalization and Localization community. And they are:

Many of them have already subgroup leader(s) but some of them are not and so we are looking for the subgroup leaders! Also, if your language isn't there and would like to see your language subgroup at the Internationalization and Localization community so that the OpenSolaris will speak your language (even better in case you already have the language support in Solaris), please join us and take part in the party! We are also looking for global coordinators who will oversee and coordinate projects in similiar or common nature over multiple subgroups!

Thanks!

Technorati Tag:
Technorati Tag:

( Jun 17 2005, 05:58:22 PM PDT ) Permalink Comments [12]

20050614 Tuesday June 14, 2005

Secure UTF-8

OpenSolaris launched today and since people can access OpenSolaris code from now on, I thought it might be an interesting thing to say a few things about possible illegal byte sequences from UTF-8 characters and show how to avoid them in your program with a couple of links to a real code example.

As you know, UTF-8 (Unicode/UCS Transformation Format 8) is the file code representation used in all Unix/Linux operating systems especially for their Unicode/UTF-8 locales. It is also widely accepted and endorsed as a standard encoding form by numerous number of modern and major standards from various standardization organizations such as IETF, W3C, Unicode, Inc., ISO/IEC, and so on.

Until rather recently (say, about a few years), we used to have a quite relaxed binary mappings between the Unicode scalar values and UTF-8 code point values like the following table:

Table 1: Previous UTF-8 Binary Encoding Mapping
Unicode Scalar Values in Binary
Hex Min
Hex Max
1st Byte 2nd Byte 3rd Byte 4th Byte
00000000 00000000 0xxxxxxx U+0000 U+007F 0xxxxxxx      
00000000 00000yyy yyxxxxxx U+0080 U+07FF 110yyyyy 10xxxxxx    
00000000 zzzzyyyy yyxxxxxx U+0800 U+FFFF 1110zzzz 10yyyyyy 10xxxxxx  
000uuuuu zzzzyyyy yyxxxxxx U+10000 U+10FFFF 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Everyone was using the above mapping scheme happily and without any major problems for many years but then some people came up with ideas that they can exploit the above loose mappings and introduce illegal sequences that may contain machine code and with some existing security holes such as buffer overflow, they could do malicious things to your systems. And so Unicode consortium and other folks updated the above mappings and finally introduced so-called "UTF-8 Corrigendum" at the Unicode Standard Version 3.1 which has been revised one more time at the Unicode 3.2. The following table shows the final mappings that I took from the Unicode 3.2 for your convenience:

Table 2: Legal UTF-8 Byte Sequences
Unicode Scalar Values in Binary
Hex Min
Hex Max
1st Byte 2nd Byte 3rd Byte 4th Byte
00000000 00000000 0xxxxxxx U+0000 U+007F 00..7F      
00000000 00000yyy yyxxxxxx U+0080 U+07FF C2..DF 80..BF
   
00000000 zzzzyyyy yyxxxxxx U+0800 U+0FFF E0 A0..BF 80..BF  
U+1000
U+CFFF
E1..EC
80..BF 80..BF
U+D000
U+D7FF
ED
80..9F 80..BF
U+D800
U+DFFF
ill-formed
U+E000
U+FFFF
EE..EF
80..BF 80..BF
000uuuuu zzzzyyyy yyxxxxxx U+10000 U+3FFFF F0 90..BF 80..BF 80..BF
U+40000
U+FFFFF
F1..F3
80..BF 80..BF 80..BF
U+100000
U+10FFFF
F4
80..8F 80..BF 80..BF

Since we support UTF-8 as the file code of the Unicode/UTF-8 locales in Solaris, naturally, we are keen to such revisions and so we have incorporated the necessary changes into various places in Solaris where the above mappings are used so that any illegal byte sequences entered to system will be properly screened out. One of the places we have such screening is in ldterm kernel module where terminal line editing is handled if your shell or application at your terminal isn't doing the line editing by itself.

We also do this screening as efficiently as possible in my opinion by reducing the number of comparisons. For more details on the screening, please see the UTF-8 conversion done in the function shown at here. We do this conversion to figure out the number of columns for a Unicode character that, as an example, we need to "rub out" from the terminal screen when you hit Back Space or Delete key.

Please be noted that the difference between the "UTF-8 Corrigendum" at the Unicode 3.1 and the latest legal UTF-8 byte sequences shown at the Unicode 3.2 is the removal of Surrogate Pair code values from the UTF-8 byte sequences. In the code example from the ldterm, the screenings are done in two places: (1) within the for() loop during the conversion and (2) at the Unicode width table for the BMP (Basic Multilingual Plane).

Technorati Tag:
Technorati Tag:

( Jun 14 2005, 08:38:05 AM PDT ) Permalink Comments [3]

20050602 Thursday June 02, 2005

Starting my weblog / 웹로그를 시작합니다.

Howdy!

With this weblog, I'm hoping to write about some interesting things that I encounter, know, or create(!) that might be useful or worth to read about. I'm also hoping to exercise my writing skills not only in English but also in Korean.

Ienup

안녕하세요?

웹로그를 통해 제가 접하거나, 이미 알고 있거나, 혹은 새로이 제가 만든 것중 흥미 또는 가치가 있을법한 일등등에 관해 쓰고자 합니다. 한글로도 블로그 하고자 합니다.

성인업

( Jun 02 2005, 02:22:37 PM PDT ) Permalink Comments [1]

Calendar

RSS Feeds

Search

Links

Navigation

Referers