星期一 九月 22, 2008

According to the man-page, iconv(3C) would return the number of non-identical (or non-reversible in Gnu's vocabulary), when that happens. But what's the non-identical/non-reversible conversion? Try the following example:

$ echo "abc测试" | iconv -f UTF-8 -t ASCII
abc??


As you can see, the last two characters are out-of scope of ASCII, however they are legal/valid UTF-8 characters. In this case, iconv(3C) shall convert them to another two characters (non-identical-conversion character, in some cases it's '?'), and returns 2. However, Gnu's iconv raises an error (return -1, and set errno to EILSEQ) in this case. Though, according to its manpage, EILSEQ is set to indicate there is an invalid multibyte sequence in the input.

So, it not a portable way to use this method to tell if the target encoding is capable to represent the source contents. And you should not either rely on the non-identical conversion numbers. A successful conversion may return -1 and with E2BIG when output buffer is exhausted, meanwhile non-identical conversions may happen. And there is no flag in iconv(3C)/iconv_open(3C) to control whether to perform non-identical conversions or to raise an error.

In gernal, iconv(3C) is not a well-defined interface.

P.S., this post is a summary of the discussion between JDS/Evolution team and myself, to locate/isolate an iconv(3C) related bug.

星期一 八月 18, 2008

Gnu-iconv supports more encoding conversions and provides better performances for some conversions over the Solaris-iconv. E.g., currently, Solaris-iconv does not support the conversions between GB18030 and UCS-2BE, UCS-4LE/BE, UTF-16LE/BE; and the conversions of GB18030<->UTF-8 (UCS-2LE) in Gnu-iconv is two times faster. And Gnu-iconv has a star-shaped structure with some exceptions, which uses UCS-4 as the intermediary encoding. While Solaris-iconv has a peer-2-peer structure (with alias), it's really painful to add a new encoding.

So, you may want to use Gnu-iconv library. For Solaris 10/Nevada, you could download&install gnu-libiconv from www.sunfreeware.com,  for opensolaris, you could install SUNWgnu-libiconv from pkg.opensolaris.org,  but the OS.o package does not contain the header files.

You may notice that, the function symboles in gnu-libiconv, had been added the prefix of "lib", e.g., iconv_open -> libiconv_open. So, LD_PRELOAD and RUNPATH are not sufficient for replacing iconv(3) routines in libc. You need to make sure to include the "iconv.h" from gnu-libiconv.



和Solaris的iconv相比较,Gnu-iconv支持更多的编码转换,并且在某些编码转换上有更好的性能。例如,目前Solaris-iconv不支持从GB18030到UCS-2BE、UCS-4LE/BE和UTF-16LE/BE之间的转换;而GB18030<->UTF-8 (UCS-2LE)在Gnu-iconv中的转换速度,是Solaris-iconv的两倍。并且Gnu-iconv是一种星型结构(也有某些点到点的例外情况),它使用UCS-4作为中间转换的介质。而Solaris-iconv是一种点到点的结构(支持别名),因此添加一个新的编码实在是有些痛苦。

因此,你可能希望使用Gnu-iconv程序库。对Solaris 10/Nevada来说,你可以从www.sunfreeware.com下载并安装gnu-libiconv,对opensolaris你可以用pkg(1)从pkg.opensolaris.org上安装SUNWgnu-libiconv的程序包,不过这个包没有包括头文件。

你可能已经注意到了,gnu-libiconv中的符号名,都被加上了"lib"的前缀,例如iconv_open->libiconv_open。因此LD_PRELOAD和RUNPATH并不能替换libc中的iconv(3)调用。你必须确保include gnu-libiconv中的"iconv.h"头文件。

This blog copyright 2009 by yongsun