-
Another year has passed and PHP 6 is still not here…
But that’s not exactly news, is the 4-th year when this long awaited version is not released. Not for nothing is called the most awaited version.
But overall it was a good year for the community, even though we still don’t have native Unicode in a stable version now we have all other awaited features in PHP 5.3, which will probably need another few years to become used on a large scale.
Even though everyone was expecting last year for Oracle to enter in full force on medium and small database market by purchasing Sun, enlarging it’s already well established portfolio on the enterprise market. It seems that it was not to be, CE is still analyzing the deal.
However MySQL is not what it used to be 5-6 years ago, when nobody dared to use it for enterprise products. This days MySQL is a product ready to be used in both small and large products that require a lot of scalability.
But back to the year that just ended, it was a full year, even with this economic crises.
-
One of the biggest issues with the web is encoding.
In the old days the formerly base standard was ISO 8859-1, where there ware 191 latin characters defined, and 1 char = 1B. For different languages, different encodings ware used, but from here many portability issues appeared, the possibility to cover a greater number of languages etc.
The problem occurs when a project should be available in several languages, and the number of the languages is not controlled. A big project like WordPress for example should be available with any language.
Unicode is a better alternative for ISO 8859-1, having more then 100.000 characters defined. In other words it has about every character of about any existing language.
As I was saying for MySQL, UTF-8 characters have a variable length between 1 and 4B.
Displaying the UTF-8 content in PHP pages
For browser to interpret the page content as UTF-8, it should receive the right headers:
1<?php header("Content-type: text/html; charset=utf-8");?>
Attention! The header should be the first thing that is send from the server! In other words it should be the first thing displayed on the page.
The type of the document can be specified with the “Content-Type” meta tag. If there is a similar meta tag on the page it should be removed and replace with:
1<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
The .htaccess file and string processing
Add to the .htaccess file (for Apache servers) the following lines:
1# default charset used by PHP 2php_value default_charset utf-8 3# encoding for mbstring 4php_value mbstring.internal_encoding utf-8 5php_value mbstring.func_overload 7
The first line sets the default charset for PHP, this setting can be made directly to php.ini.
Second and third line sets the mbstring (multi byte string) functions.
Using UTF-8, as I was saying earlier, 1 char != 1B, so errors may appear:
1$var = 'aşadar'; 2 3echo strlen($var).PHP_EOL; // 7 4echo strtoupper($var).PHP_EOL; // AşADAR 5 6// using mbstring functions 7echo mb_strlen($var).PHP_EOL; // 6 8echo mb_strtoupper($var).PHP_EOL; // AŞADAR
This is why we set the mbstring functions mode using the .htaccess file. Content entered through forms should be processed using mbstring functions, to avoid problems like in the earlier example.
The available functions are in the manual.
Coding old content
There are many ways to encode ISO 8859-1 content to UTF-8. A couple of ways of doing that with PHP are:
– iconv() function which converts from a format to another specified format:
1echo iconv("ISO-8859-1", "UTF-8", "Test");
– utf8_encode() function which converts from ISO 8859-1 to UTF-8:
1echo utf8_encode("Test");
What does the future bring?
The long-expected PHP6 will have native support for Unicode, so all the above tricks will be unnecessary. At the moment of writing this blog PHP 6 is 70.70% done, and with a little luck it will be ready in less then an year.