Claudiu Persoiu

Blog-ul lui Claudiu Persoiu

PHP and Unicode using UTF-8

without comments

One of the biggest issues with the web is encoding.

In the old days the formerly base standard was  ISO 8859-1, where there ware 191 latin characters defined, and 1 char = 1B. For different languages, different encodings ware used, but from here many portability issues appeared, the possibility to cover a greater number of languages etc.

The problem occurs when a project should be available in several languages, and the number of the languages is not controlled. A big project like WordPress for example should be available with any language.

Unicode is a better alternative for ISO 8859-1, having more then 100.000 characters defined. In other words it has about every character of about any existing language.

As I was saying for MySQL, UTF-8 characters have a variable length between 1 and 4B.

Displaying the UTF-8 content in PHP pages

For browser to interpret the page content as UTF-8, it should receive the right headers:

<?php header("Content-type: text/html; charset=utf-8");?>

Attention! The header should be the first thing that is send from the server! In other words it should be the first thing displayed on the page.

The type of the document can be specified with the “Content-Type” meta tag. If there is a similar meta tag on the page it should be removed and replace with:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

The .htaccess file and string processing

Add to the .htaccess file (for Apache servers) the following lines:

# default charset used by PHP
php_value default_charset utf-8
# encoding for mbstring
php_value mbstring.internal_encoding utf-8
php_value mbstring.func_overload 7

The first line sets the default charset for PHP, this setting can be made directly to php.ini.

Second and third line sets the mbstring (multi byte string) functions.

Using UTF-8, as I was saying earlier, 1 char != 1B, so errors may appear:

$var = 'aşadar';

echo strlen($var).PHP_EOL; // 7
echo strtoupper($var).PHP_EOL; // AşADAR

// using mbstring functions
echo mb_strlen($var).PHP_EOL; // 6
echo mb_strtoupper($var).PHP_EOL; // AŞADAR

This is why we set the mbstring functions mode using the .htaccess file. Content entered through forms should be processed using mbstring functions, to avoid problems like in the earlier example.

The available functions are in the manual.

Coding old content

There are many ways to encode ISO 8859-1 content to UTF-8. A couple of ways of doing that with PHP are:

iconv() function which converts from a format to another specified format:

echo iconv("ISO-8859-1", "UTF-8", "Test");

utf8_encode() function which converts from ISO 8859-1 to UTF-8:

echo utf8_encode("Test");

What does the future bring?

The long-expected PHP6 will have native support for Unicode, so all the above tricks will be unnecessary. At the moment of writing this blog PHP 6 is 70.70% done, and with a little luck it will be ready in less then an year.

Written by Claudiu Persoiu

11 August 2009 at 10:40 AM

Posted in PHP

Tagged with , , ,

Leave a Reply