Xueming Shen's Blog
Non-UTF-8 encoding in ZIP file
The Zip specification (historically) does not specify what character encoding to be used for the embedded file names and comments, the original IBM PC character encoding set, commonly referred to as IBM Code Page 437, is supposed to be the only encoding supported. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in Jar files. Our java.util.jar and java.util.zip implementation therefor strictly followed Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.
Consequence? the ZIP file created by "traditional" ZIP tool is not accessible for java.util.jar/zip based tool, and vice versa, if the file name contains characters that are not compatible between Cp437 (as an alternative, tools might simply use the default platform encoding) and UTF-8
For most European, you're "lucky":-) that you only need to avoid a "handful" of characters, such as the umlauts (OK, I'm just kidding), but for Japanese and Chinese, most of the characters are simply out of luck. This is why bug 4244499 had been the No.1 on the Top 25 Java Bugs for so many years. The bug is no longer on the list:-) it has been finally "fixed" in OpenJDK 7, b57. I still keep a snapshot as the record/kudo for myself:-)
The solution (I would use "solution" than a "fix") in JDK7 b57 is to introduce in a new set of ZipInputStream ZipOutStream and ZipFile constructors with a specific "charset" as the parameter, as showed below.
ZipFile(File, Charset)
ZipInputStream(InputStream, Charset)
ZipOutputStream(OutputStream, Charset)
With these new constructors, applications can now access those non-UTF-8 ZIP files via ZipInputStream or ZipFile objects created with the specific encoding, or create a Zip files encoded in non-UTF-8 via the new ZipOutputStream(os, charset) constructor, if necessary.
zip is a stripped-down version of the Jar tool with a "-encoding" option to support non-UTF8 encoding for entry name and comment, it can serve as a demo for how to use the new APIs (I used it as a unit test). I'm still debating with myself if it is a good idea to officially introduce "-encoding" into the Jar tool...
Something you might want to keep in mind when use these new APIs and the new JDK7 bundles.
(1)The java.util.jar package is not touched, therefor there is no behavior change when accessing Jar and ZIP file via java.util.jar package (Jar is Jar, it uses UTF-8)
(2)UTF-8 is still used to decode the file names and comments if the general purpose flag bit 11 (EFS) is ON, even if a non-UTF-8 charset is specified in constructors. (See PKWare ZIP Spec for more detailed info regarding EFS).
(3)Jar and ZIP file created by JDK7 b57 and later now set the "general purpose flag bit 11" if UTF-8 encoding is used to encode the file name and comment.
(4)Since JDK7 b57 we switched to use the "standard" UTF-8 charset in java.util.jar/zip implementation, the earlier Java releases use a "modified" version of UTF-8. This is an in-compatible change for sure, but I strongly believe this is something worth doing.
Enjoy the APIs! Leave me a comment if you have any question, issue or problem.
Posted at 01:14PM May 01, 2009 by xuemingshen in Java | Comments[2]
Friday May 01, 2009