Ienup's Weblog

pageicon Tuesday Jun 14, 2005

Secure UTF-8

OpenSolaris launched today and since people can access OpenSolaris code from now on, I thought it might be an interesting thing to say a few things about possible illegal byte sequences from UTF-8 characters and show how to avoid them in your program with a couple of links to a real code example.

As you know, UTF-8 (Unicode/UCS Transformation Format 8) is the file code representation used in all Unix/Linux operating systems especially for their Unicode/UTF-8 locales. It is also widely accepted and endorsed as a standard encoding form by numerous number of modern and major standards from various standardization organizations such as IETF, W3C, Unicode, Inc., ISO/IEC, and so on.

Until rather recently (say, about a few years), we used to have a quite relaxed binary mappings between the Unicode scalar values and UTF-8 code point values like the following table:

Table 1: Previous UTF-8 Binary Encoding Mapping
Unicode Scalar Values in Binary
Hex Min
Hex Max
1st Byte 2nd Byte 3rd Byte 4th Byte
00000000 00000000 0xxxxxxx U+0000 U+007F 0xxxxxxx      
00000000 00000yyy yyxxxxxx U+0080 U+07FF 110yyyyy 10xxxxxx    
00000000 zzzzyyyy yyxxxxxx U+0800 U+FFFF 1110zzzz 10yyyyyy 10xxxxxx  
000uuuuu zzzzyyyy yyxxxxxx U+10000 U+10FFFF 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Everyone was using the above mapping scheme happily and without any major problems for many years but then some people came up with ideas that they can exploit the above loose mappings and introduce illegal sequences that may contain machine code and with some existing security holes such as buffer overflow, they could do malicious things to your systems. And so Unicode consortium and other folks updated the above mappings and finally introduced so-called "UTF-8 Corrigendum" at the Unicode Standard Version 3.1 which has been revised one more time at the Unicode 3.2. The following table shows the final mappings that I took from the Unicode 3.2 for your convenience:

Table 2: Legal UTF-8 Byte Sequences
Unicode Scalar Values in Binary
Hex Min
Hex Max
1st Byte 2nd Byte 3rd Byte 4th Byte
00000000 00000000 0xxxxxxx U+0000 U+007F 00..7F      
00000000 00000yyy yyxxxxxx U+0080 U+07FF C2..DF 80..BF
   
00000000 zzzzyyyy yyxxxxxx U+0800 U+0FFF E0 A0..BF 80..BF  
U+1000
U+CFFF
E1..EC
80..BF 80..BF
U+D000
U+D7FF
ED
80..9F 80..BF
U+D800
U+DFFF
ill-formed
U+E000
U+FFFF
EE..EF
80..BF 80..BF
000uuuuu zzzzyyyy yyxxxxxx U+10000 U+3FFFF F0 90..BF 80..BF 80..BF
U+40000
U+FFFFF
F1..F3
80..BF 80..BF 80..BF
U+100000
U+10FFFF
F4
80..8F 80..BF 80..BF

Since we support UTF-8 as the file code of the Unicode/UTF-8 locales in Solaris, naturally, we are keen to such revisions and so we have incorporated the necessary changes into various places in Solaris where the above mappings are used so that any illegal byte sequences entered to system will be properly screened out. One of the places we have such screening is in ldterm kernel module where terminal line editing is handled if your shell or application at your terminal isn't doing the line editing by itself.

We also do this screening as efficiently as possible in my opinion by reducing the number of comparisons. For more details on the screening, please see the UTF-8 conversion done in the function shown at here. We do this conversion to figure out the number of columns for a Unicode character that, as an example, we need to "rub out" from the terminal screen when you hit Back Space or Delete key.

Please be noted that the difference between the "UTF-8 Corrigendum" at the Unicode 3.1 and the latest legal UTF-8 byte sequences shown at the Unicode 3.2 is the removal of Surrogate Pair code values from the UTF-8 byte sequences. In the code example from the ldterm, the screenings are done in two places: (1) within the for() loop during the conversion and (2) at the Unicode width table for the BMP (Basic Multilingual Plane).

Technorati Tag:
Technorati Tag:


June 2005
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
15
16
18
19
20
21
22
23
24
25
26
27
28
29
30
  
       
Today

Feeds

Search this blog

Links

Weblog menu

Today's referrers

Today's Page Hits: 9