Most of the times we will be dealing with multi byte characters that effects web applications at all programming languages in development, the most recommended one to use is Unicode for non-English languages. Unicode has a wide range of characters mapped in it. So to be lucky, you need to first start from the operating system level to the application level. Most of the internationalization is always a tricky problem to handle in development cycle, as said let us look at part by part from the operating system level to application level and I will be explaining internationalization tricks on web based application and technologies further.

Operating System configuration:

Add the following variables to your system.

LANG = en_US.UTF-8; LC_ALL = en_US.UTF-8 

Handling UTF-8 characters in HTML pages: The most important step for displaying the non-English characters/text on your web page is to first look at your browser encoding communicating to the server is right. Let see what is this mean, Open your browser like Firefox, under the 'View' Menu, there is sub-menu called 'Character Encoding', under this sub menu you will always select the 'Unicode (UTF-8)' instead of Auto detect. There are couple of ways to do this one on the server side when you first login page or welcome page appears you will fix the meta tags of HTML to set as 'utf-8' and further in your session this could be maintained in you HTTP header. So there are two places where you can find out or specify the browser encoding that is 'Content-Type HTTP header' and 'Content-Type meta tag'.

 For meta tags you will specify this in the servers configuration file like,

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

These configurations are vendor implemented example would be in IIS server you could find the same content-type setting for each file type under the "Headers" menu in the properties of your web site.

Jetty Server Configuration

echo off
rem set LANG=fr_FR.ISO8859-1
set LANG=en_US.UTF-8
set JETTY_PORT=8080
set JETTY_HOME=.
java %JAVA_OPTS% -Djetty.port=%JETTY_PORT% -Djetty.home=%JETTY_HOME% -Dfile.encoding=UTF-8 -jar %JETTY_HOME%/start.jar

Instead of config system variable LANG you can use JVM properties

-Duser.language=en
-Duser.country=US

in server startup script.

Tomcat Server Configuration:

In order to enable UTF-8 in example as Tomcat Server, you have to add

URIEncoding="UTF-8"

to each connector enabled/used in conf/server.xml. For example the non-SSL HTTP Connector should read:

<Connector port="8080" maxHttpHeaderSize="8192"
    maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
    enableLookups="false" redirectPort="8443" acceptCount="100"
    connectionTimeout="20000" disableUploadTimeout="true"
    URIEncoding="UTF-8"/>

Warning: In case you're using AJP to connect Tomcat and httpd, make sure you add this attribute to the AJP connector.

HTTP requests from the clients side

There are two places in which HTTP requests (from browsers to web servers) may include character data:

  1. In the URL part of the first line of the HTTP request,
  2. In the HTTP content area at the end of the HTTP message, resulting from an HTML <form method='post'...> ... </form> submission

HTTP content part handling characters

Wherever possible, a POST method should be used when international characters are involved.

This is because the browser sends a HTTP Content-Type header which can help the web server determine the encoding of the content. The Content-Type header will tell the server the MIME-type encoding of the content (usually application/x-www-form-urlencoded) and also can optionally include the character encoding of the content eg:

Content-Type: application/x-www-form-urlencoded;charset=UTF-8

If both the MIME-type and the charset encoding information is sent in the POST HTTP header, the server can correctly decode the content.

Unfortunately, many browsers do not bother to send the charset information, leaving the web server to guess the correct encoding. For this reason, the Servlet API provides the SevletRequest.setCharacterEncoding(String) method to allow the webapp developer to control the decoding of the form content.

Jetty-6 uses a default of UTF-8 if no overriding character encoding is set on a request.

Ok let us look at the HTML form, <form method="GET"> , the browser tries to pass the characters to the server in the character set of the page, but it will only succeed if the characters in question can be represented in that
character set. If not, browsers calculate "their best bet" based on what's available (old style) or use an Unicode set (new style).

Example: Western browsers send 'é' as '%E9' by default (URL encoding). But when the page is in UTF-8, the browser will first lookup the Unicode multi byte encoding of 'é'. In this case, it are 2 bytes because 'é' lies in UTF code point range 128-256. Those two bytes correspond to à and ©, and will result in '%C3%A9' (URL encoding) in the eventual query string. <form method="post" enctype="application/x-www-form-urlencoded"> is the same as <form method="POST"> and uses the same general principle as GET.

In <form method="POST" enctype="multipart/form-data"> there is no default encoding at all, because this encoding type needs to be able to transfer non-base64-ed binaries. 'é' will be passed as 'é' and that's it.

I think, we are not done yet fully,
(1) The user's (available) charsets
(2) The charset of the web page
(3) How JavaScript handles characters internally

Only (3) is of importance in your case: (for comment posted by: laxmi on this blog)

Paste into input field:

<br>ヤツカ<hr>
<form>
<input name="i">
<input type="button" value="check" onClick="
if (document.forms[0].i.value == '\uFF94\uFF82\uFF76')  {
alert('equal') }
else  {
alert('not equal')
}"></form>

Working on JavaScript to use the charset attribute of the <script> tag

The easiest way to ensure your script is served as UTF-8 is to add a charset attribute (charset="utf-8") to your <script> tags in the parent page:

<script type="text/javascript" src="[path]/myscript.js" charset="utf-8"></script>

For an example you can also configure your webserver to serve all .js files in the UTF-8 charset, or only .js files in a single directory. You can do the latter (in Apache) by adding this line to the .htaccess file in the directory where your scripts are stored:

AddCharset utf-8 .js

Working on MySQL Database and configuration

It's common practice that the MySQL configuration file, in *nix systems is located in /etc/mysql/my.cnf

[client]
default-character-set=utf8
[mysqld]
default-character-set=utf8
character-set-server = utf8
collation-server = utf8_general_ci

If you have been using MySQL 4.0 as I outlined above, storing UTF-8 data in string columns regardless of the default server character set, one of the things you will want to do after upgrading to MySQL 4.1 is actually let the server know the true character set of those columns. But if you simply do an ALTER TABLE myTable MODIFY myColumn VARCHAR(255) CHARACTER SET utf8, the server will try to convert the data in the myColumn column from the server default character set to UTF-8. You need to do a two-step conversion to avoid this:

 ALTER TABLE myTable MODIFY myColumn BINARY(255);
 ALTER TABLE myTable MODIFY myColumn VARCHAR(255) CHARACTER SET utf8;  

Working with UTF-8 on the Web PHP

Ignoring older (and badly implemented) browsers for a second, handling UTF-8 data on the web is quite simple. You just need to indicate in the header and/or body of your document the character set, like so (using PHP):

<?php header("Content-type: text/html; charset=utf-8");?>
<html>
 <head>
  <meta http-equiv="Content-type" value="text/html; charset=utf-8">
  ...

If your HTML page contains a form, browsers will generally send the results back in the character set of the page. So if your page is sent in UTF-8, you will (usually) get UTF-8 results back. The default encoding of HTML documents is ISO-8859-1, so by default you will get form data encoded as ISO-8859-1, with one big exception: some browsers (including Microsoft Internet Explorer and Apple Safari) will actually send the data encoded as Windows-1252, which extends ISO-8859-1 with some special symbols, like the euro (€) and the curly quotes (“”).

It's those "usually" and "ignoring older (and badly implemented) browsers" qualifiers that make it a little bit tricky: if you want to make sure to catch these edge cases, you'll need to do a little bit of extra work. One thing you can do is add a hidden field to your form containing some data is likely to be corrupted if the client isn't handling the character set correctly:

 <input type="hidden" name="charset_check" value="ä™®">

You can also verify that you have gotten valid UTF-8 content with this regular expression published by the W3C.

If the data is not valid UTF-8, or you already know that you are dealing with data in another character set that you want to convert into UTF-8, PHP supports a few different ways of converting the data:

So handling input might look something like this:

<?php
$test  = $_REQUEST['charset_check']; /* our test field */
$field = $_REQUEST['field']; /* the data field */

if (bin2hex($test) == "c3a4e284a2c2ae") { /* UTF-8 for "ä™®" */
  /* Nothing to do: it's UTF-8! */
} elseif (bin2hex($test) == "e499ae") { /* Windows-1252 */
  $field = iconv("windows-1252", "utf-8", $field);
} else {
  die("Sorry, I didn't understand the character set of the data you sent!");
}

mysql_query("INSERT INTO table SET field = _utf8'" . addslashes($field) . "'")
  or die("INSERT failed: " . mysql_error());

The most commonly used technologies I have discussed, further it can be expanded on web 2.0.

Comments:

Thanks for the blog. Is there any way to turn a form submission in Japanese character into it's ampersand encoded character? For example, a person searches for "一" which is one in Japanese, but the results passed in the POST submission is "&#19968;" which one in UTF-8 (which is the character in my MySQL database that is being searched for). Thanks in advance (again!).

Posted by JW on March 10, 2009 at 07:48 AM IST #

Hi my chinese webpage was utf-8 encoded. It loads well in most every browsers except loading a total blank page until the user switch the character encoding to auto-detect.

I need your help.

Thanks

James
Hong Kong

Posted by James Wong on March 26, 2009 at 08:24 AM IST #

Hi James,
Install any Live HTTP Header on your browser as an addon and try to see the request/response from the server, to make sure that you getting the meta tag as below,

<head>
<meta http-equiv="Content-type" value="text/html; charset=utf-8"> </head>

Regards
Shankar

Posted by Shankar Gowda Mbn on March 26, 2009 at 03:18 PM IST #

thank you so much.

Posted by glosgu on June 01, 2009 at 05:07 PM IST #

thank you so much.

Posted by glosgu on June 01, 2009 at 05:12 PM IST #

Post a Comment:
  • HTML Syntax: NOT allowed

This blog copyright 2009 by shankar