Osamu Sayama's Weblog

月曜日 10 20, 2008

Reduce locale shared object size

Current shared object size for UTF-8 locale is about 2Mbyte per a locale. This size is increasing because new unicode standard introduces new characters whenever it is released.

% ls -lah /usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3 /usr/lib/locale/fr_FR.UTF-8/fr_FR.UTF-8.so.3 /usr/lib/locale/fr_CA.UTF-8/fr_CA.UTF-8.so.3
-r-xr-xr-x   1 root     bin         2.4M Sep 19 04:31 /usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3
-r-xr-xr-x   1 root     bin         1.7M Aug  7 19:42 /usr/lib/locale/fr_CA.UTF-8/fr_CA.UTF-8.so.3
-r-xr-xr-x   1 root     bin         1.7M Aug  7 19:42 /usr/lib/locale/fr_FR.UTF-8/fr_FR.UTF-8.so.3

'locale -a|grep -i utf-8 | wc' shows 108 locales on nevada 100 with full locale support and more than 400MByte is used for UTF-8 locale shared objects. This was not so problem on the installed system (however, when creating a patch for the locale, huge size of patch will be created...). However, it is a problem for OpenSolaris Live CD because the size is more limited. So we should try to reduce this size as possible. The root cause of this size is that the weight tables in  _LC_collate_t lc_coll (ct_wgts* and subs_map) and qmask index table (qifx) in _LC_ctype_t lc_ctype. Since many of UTF-8 locales are sharing LC_CTYPE and LC_COLLATE definition between locales (ex, fr_FR.UTF-8 and fr_CA.UTF-8), spliting these tables from locale shared object and creates new shared object for ctype and collation tables can reduce the total disk size dramatically. It looks that the table size of LC_CTYPE and LC_COLLATE consists of 99% of the total size and 90% is LC_COLLATE on UTF-8 locale.

- en_US.UTF-8

[68]    |    982856|    974848|OBJT |LOCL |0    |11     |ct_wgts0
[67]    |      8008|    974848|OBJT |LOCL |0    |11     |ct_wgts1
[72]    |   1958144|    243713|OBJT |LOCL |0    |11     |subs_map
[81]    |   2202848|    243456|OBJT |LOCL |0    |11     |qidx

- fr_FR.UTF-8 and fr_CA.UTF-8

[72]    |   1120768|    607132|OBJT |LOCL |0    |11     |weightstr
[69]    |    793096|    262136|OBJT |LOCL |0    |11     |ct_wgts0
[68]    |    530960|    262136|OBJT |LOCL |0    |11     |ct_wgts1
[67]    |    268824|    262136|OBJT |LOCL |0    |11     |ct_wgts2
[66]    |      6688|    262136|OBJT |LOCL |0    |11     |ct_wgts3
[71]    |   1055232|     65535|OBJT |LOCL |0    |11     |subs_map
[80]    |   1728456|     65278|OBJT |LOCL |0    |11     |qidx

As a trial, I splited fr_FR.UTF-8.c, which is created by localedef command, to 3 parts. CLDR.UTF-8-ctype.c, CLDR.fr.UTF-8-collate.c  and fr_FR.UTF-8.c. Then compiled and linked like the following.

% cc -xO3 -K PIC -G -Xa  -h CLDR.UTF-8-ctype.so.3 -o CLDR.UTF-8-ctype.so.3 ./CLDR.UTF-8-ctype.c
% cc -xO3 -K PIC -G -Xa  -h CLDR.fr.UTF-8-collate.so.3 -o CLDR.fr.UTF-8-collate.so.3 ./CLDR.fr.UTF-8-collate.c
% cc -xO3 -K PIC -G -Xa  -h fr_FR.UTF-8.so.3 -o fr_FR.UTF-8.so.3 ./fr_FR.UTF-8.c  /usr/lib/locale/common/methods_unicode.so.3 ./CLDR.UTF-8-ctype.so.3 ./CLDR.fr.UTF-8-collate.so.3 -R /usr/lib/locale/common

Then copy CLDR.UTF-8-ctype.so.3 and CLDR.fr.UTF-8-collate.so.3 to /usr/lib/locale/common, copy  fr_FR.UTF-8.so.3 to /usr/lib/locale/fr_FR.UTF-8. Here is modified source.

% ldd /usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3 /usr/lib/locale/fr_FR.UTF-8/fr_FR.UTF-8.so.3 /usr/lib/locale/fr_CA.UTF-8/fr_CA.UTF-8.so.3
/usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3:
        libc.so.1 =>     /lib/libc.so.1
        /usr/lib/locale/common/methods_unicode.so.3
        en_US.UTF-8-ctype.so.3 =>        /usr/lib/locale/common/en_US.UTF-8-ctype.so.3
        en_US.UTF-8-collate.so.3 =>      /usr/lib/locale/common/en_US.UTF-8-collate.so.3
        libm.so.2 =>     /lib/libm.so.2
/usr/lib/locale/fr_FR.UTF-8/fr_FR.UTF-8.so.3:
        libc.so.1 =>     /lib/libc.so.1
        /usr/lib/locale/common/methods_unicode.so.3
        CLDR.UTF-8-ctype.so.3 =>         /usr/lib/locale/common/CLDR.UTF-8-ctype.so.3
        CLDR.fr.UTF-8-collate.so.3 =>    /usr/lib/locale/common/CLDR.fr.UTF-8-collate.so.3
        libm.so.2 =>     /lib/libm.so.2
/usr/lib/locale/fr_CA.UTF-8/fr_CA.UTF-8.so.3:
        libc.so.1 =>     /lib/libc.so.1
        /usr/lib/locale/common/methods_unicode.so.3
        CLDR.UTF-8-ctype.so.3 =>         /usr/lib/locale/common/CLDR.UTF-8-ctype.so.3
        CLDR.fr.UTF-8-collate.so.3 =>    /usr/lib/locale/common/CLDR.fr.UTF-8-collate.so.3
        libm.so.2 =>     /lib/libm.so.2

% ls -lah /usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3 /usr/lib/locale/fr_FR.UTF-8/fr_FR.UTF-8.so.3 /usr/lib/locale/fr_CA.UTF-8/fr_CA.UTF-8.so.3
-rwxr-xr-x   1 root     root         44K Oct 17 15:19 /usr/lib/locale/en_US.UTF-8/en_US.UTF-8.so.3
-rwxr-xr-x   1 root     root         14K Oct 19 08:59 /usr/lib/locale/fr_CA.UTF-8/fr_CA.UTF-8.so.3
-rwxr-xr-x   1 root     root         14K Oct 17 18:58 /usr/lib/locale/fr_FR.UTF-8/fr_FR.UTF-8.so.3
% ls -lah /usr/lib/locale/common/CLDR.* /usr/lib/locale/common/en_US.UTF-8-c*
-rwxr-xr-x   1 root     root         90K Oct 18 15:09 /usr/lib/locale/common/CLDR.UTF-8-ctype.so.3
-rwxr-xr-x   1 root     root        1.6M Oct 18 15:09 /usr/lib/locale/common/CLDR.fr.UTF-8-collate.so.3
-rwxr-xr-x   1 root     root        2.1M Oct 17 15:18 /usr/lib/locale/common/en_US.UTF-8-collate.so.3
-rwxr-xr-x   1 root     root        243K Oct 17 15:18 /usr/lib/locale/common/en_US.UTF-8-ctype.so.3

This simple modification works fine with current libc (no modification is needed in libc !) and meet our requirement. The number of current UTF-8 collation types are about 15 and ctype types are 2. So I expect that this change will reduce the size to 1/6 ((15 collation types + 2 ctype types) / 100 UTF-8 locales)... Now I'm thinking that localedef should add the option to produce 3 shared objects. I will try later...

投稿されたコメント:

コメント
  • HTML文法 不許可

Calendar

Feeds

Search

Links

Navigation

Referrers