Claudiu Persoiu

Blog-ul lui Claudiu Persoiu


Archive for the ‘unicode’ tag

PHP and Unicode using UTF-8

without comments

One of the biggest issues with the web is encoding.

In the old days the formerly base standard was  ISO 8859-1, where there ware 191 latin characters defined, and 1 char = 1B. For different languages, different encodings ware used, but from here many portability issues appeared, the possibility to cover a greater number of languages etc.

The problem occurs when a project should be available in several languages, and the number of the languages is not controlled. A big project like WordPress for example should be available with any language.

Unicode is a better alternative for ISO 8859-1, having more then 100.000 characters defined. In other words it has about every character of about any existing language.

As I was saying for MySQL, UTF-8 characters have a variable length between 1 and 4B.

Displaying the UTF-8 content in PHP pages

For browser to interpret the page content as UTF-8, it should receive the right headers:

<?php header("Content-type: text/html; charset=utf-8");?>

Attention! The header should be the first thing that is send from the server! In other words it should be the first thing displayed on the page.

The type of the document can be specified with the “Content-Type” meta tag. If there is a similar meta tag on the page it should be removed and replace with:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

The .htaccess file and string processing

Add to the .htaccess file (for Apache servers) the following lines:

# default charset used by PHP
php_value default_charset utf-8
# encoding for mbstring
php_value mbstring.internal_encoding utf-8
php_value mbstring.func_overload 7

The first line sets the default charset for PHP, this setting can be made directly to php.ini.

Second and third line sets the mbstring (multi byte string) functions.

Using UTF-8, as I was saying earlier, 1 char != 1B, so errors may appear:

$var = 'aşadar';

echo strlen($var).PHP_EOL; // 7
echo strtoupper($var).PHP_EOL; // AşADAR

// using mbstring functions
echo mb_strlen($var).PHP_EOL; // 6
echo mb_strtoupper($var).PHP_EOL; // AŞADAR

This is why we set the mbstring functions mode using the .htaccess file. Content entered through forms should be processed using mbstring functions, to avoid problems like in the earlier example.

The available functions are in the manual.

Coding old content

There are many ways to encode ISO 8859-1 content to UTF-8. A couple of ways of doing that with PHP are:

iconv() function which converts from a format to another specified format:

echo iconv("ISO-8859-1", "UTF-8", "Test");

utf8_encode() function which converts from ISO 8859-1 to UTF-8:

echo utf8_encode("Test");

What does the future bring?

The long-expected PHP6 will have native support for Unicode, so all the above tricks will be unnecessary. At the moment of writing this blog PHP 6 is 70.70% done, and with a little luck it will be ready in less then an year.

Written by Claudiu Persoiu

11 August 2009 at 10:40 AM

Posted in PHP

Tagged with , , ,

MySQL and Unicode using UTF-8

without comments

Along with globalization, the old ASCII code is no longer suitable.  Consider that one day you have to develop a project in German, Russian or even Japanese, you could adapt the charset for each of these languages or you could simply develop using Unicode.

To use Unicode with MySQL UTF-8 can be used.

You must note that UTF-8 characters are variable in length and they are ASCII compatible. In ASCII 1 char = 1B, in UTF-8 1 char can be between 1 and 4 B.

UTF-8 charset and collation on the server

Character type in MySQL is dictated by charset.

To check if UTF-8 in installed on the server:

SHOW CHARSET LIKE 'utf8';

or with information_schema

SELECT * FROM `CHARACTER_SETS` WHERE CHARACTER_SET_NAME = 'utf8';

If the charset was found then we can continue.

Another element that appears with charset is collation, which it’s used for comparing strings at ordering.

To see what collations are available on the server:

SHOW COLLATION WHERE CHARSET = 'utf8';

or with information_schema

SELECT * FROM `COLLATIONS` WHERE CHARACTER_SET_NAME = 'utf8';

Collation are usually by language, for comparing strings with or without diacritics for example, or “bin” can be used with orders strings in binary mode,  ie “A” is greater than “a” for example.

If no collation is specified, then the default one will be used.

UTF-8 and data bases

When creating a database you can specify the default charset to be used with all new tables for which there isn’t any charset specified.

For example:

CREATE DATABASE db_name CHARACTER SET utf8 COLLATE utf8_romanian_ci;

Or for modifying the default charset for a data base which already exists:

ALTER DATABASE db_name CHARACTER SET utf8 COLLATE utf8_romanian_ci;

UTF-8, tables and columns

For modifying tables which already exist ALTER TABLE must be used is used.

A table can have a default charset and collation, and each column can have it’s own charset and collation.

For more information about the table:

SHOW CREATE TABLE tab;

To set a charset for an existing table:

ALTER TABLE tab CHARSET = utf8 COLLATE = utf8_romanian_ci;

For modifying the charset of a VARCHAR(200) column is used:

ALTER TABLE tab MODIFY c1 VARCHAR(200) CHARSET utf8 COLLATE utf8_romanian_ci;

String size

A “problem” that may arise is related to the size of the character, it’s size can be between 1 and 4B.  That is why for measuring a string column (like varchar) you must use CHAR_LENGTH(str) instead of LENGTH().

A short exemple:

SET @var = 'aşadar';
SELECT CHAR_LENGTH(@var) AS 'Char', LENGTH(@var) AS 'Length';

// The output is: Char = 6 and Length = 7 because ş is 2B

Written by Claudiu Persoiu

10 August 2009 at 1:40 PM

Posted in MySQL

Tagged with , , ,

PHP 5.3 ce aduce nou?

without comments

Pana la versiunea mult asteptata 6, PHP 5.3 este deja la RC 2.

PHP 5.3 vine cu multe lucruri noi, de fapt sunt atat de multe incat putea sa fie cu succes versiunea 6, dar de ce nu este PHP 6? Pentru ca toata lumea se asteapta ca versiunea 6 sa aiba suport Unicode, iar cand o comunitate de programatori asteapta de mai bine de 3 ani asta, nu poti pur si simplu sa le spui ca te-ai gandit sa lasi suportul pentru Unicode pentru versiunea 7.

Namespaces

Printre noutati se afla si namespaces, mi se pare o idee interesanta dar nu fantastica, initial acestea nu erau pe lista de prioritati pentru ca se considera ca problema lor se poate rezolva relativ simplu folosind prefixuri si clase.

Un exemplu de namespace:

<?php
// clasa.php

// definirea namespace-ului, in cazul asta trebuie sa fie prima instructiune din fisier
namespace teste;

// o clasa cu un conscturctor
class clasaTest {
   function __construct() {
      echo "in constrcutor";
   }
}

// o functie
function funcTest() {
   echo "in functie";
}

// o constanta
const CONSTANTA = 'global';
?>

Acelasi fisier se mai poate defini intr-un anumit namespace si in felul urmator:

<?php
// clasa.php

// definirea namespace-ului
namespace teste {

   // o clasa cu un conscturctor
   class clasaTest {
     function __construct() {
        echo "in constrcutor";
     }
   }

   // o functie
   function funcTest() {
      echo "in functie";
   }

   // o constanta
   const CONSTANTA = 'global';

}
?>

Fisierul in care fac testele:

<?php

// includem fisierul de mai sus
require('clasa.php');

$obj = new clasaTest(); // Fatal Error:clasa 'clasaTest' not found in ...

$obj2 = new teste\clasaTest(); // afisaza "in constructor"

// aparent suprascriem constanta
const CONSTANTA = 'local';

echo CONSTANTA; // afisaza "local"

echo teste\CONSTANTA; // afisza "global"

teste\funcTest(); // afisaza "in functie"

?>

iar daca vrem sa nu mai folosim operatorul “\” si numele namespace-ului:

<?php
// trebuie sa fie prima insctructiune
namespace teste;

// includem fisierul de mai sus
require('clasa.php');

$obj = new clasaTest(); // afisaza "in constructor";

// suprascriem constanta
const CONSTANTA = 'local'; // Notice: Constant teste\CONSTANTA already defined in...

echo CONSTANTA; // afisaza global

?>

Si acestea sunt functiile de baza, folosindu-le pe acestea o sa ajute portabilitatea, si scalabilitatea. Acum numele de clase si functii nu mai trebuie sa fie unice, trebuie pur si simplu sa fie in namespace-uri diferite.

Functile lambda si “closures”

Cele mai interesante facilitati mi se par functile lambda si “closures”.

Suna cunoscut? Poate pentru ca sunt foarte folosite in JavaScript:

<script language="javascript">
closure = function(text) {
   alert(text);
}

closure("hello");
</script>

Iar acum in php este posibil:

<?php
$closure = function ($text) {
   return $text;
};

echo $closure("hello");
?>

Tot la functii lambda si “closures” a mai fost introdus si “use”, care permite folosirea unor variabile din exterior:

<?php
$x = 'Claudiu';
$closure = function ($text) use ($x) {
   return $text.' '.$x;
};

echo $closure("hello"); // afisaza "hello Claudiu"
?>

Iar daca in exemplul de mai sus vrem sa modificam variabilele care sunt parametri la “use” nu trebuie decat sa le trimitem prin referinta:

<?php
$x = 'Claudiu';
$closure = function ($text) use (&$x) {
   $x = 'utilizator';
   return $text;
};

echo $closure("hello"); // afisaza "hello"

echo $x; // afisaza "utilizator"
?>

Iar ca sa luam un exemplu concret unde pot fi utile, sa zicem functia “usort“:

<?php
function cmp($a, $b)
{
    if ($a == $b) return 0;
    return ($a < $b) ? -1 : 1;
}

$a = array(3, 2, 5, 6, 1);

usort($a, "cmp");

// este echivalent cu:

$a = array(3, 2, 5, 6, 1);

usort($a, function ($a, $b){
    if ($a == $b) return 0;
    return ($a < $b) ? -1 : 1;
});
?>

Dragut, nu? Cred ca asa se vede si mult mai clar ce se intampla in functie si la ce foloseste aceasta.

De asemenea obiectele se pot comporta ca niste closures folosind noua “metoda magica” __invoke():

<?php 

class testCl {
   function __invoke() {
      return "metoda de closure";
   }
}

$obj = new testCl(); // instantiem clasa
echo $obj(); // o apelam ca pe un closure si afisaza "metoda de closure"

?>

NOWDOC

NOWDOC este similar cu HEREDOC doar ca nu interpreteaza variabilele si caracterele speciale. Cu alte cuvinte HEREDOC era echivalentul ghilimelelor duble, NOWDOC este echivalentul ghilimelelor simple:

<?php

$var = "5";

// ghilimele duble
echo "Valoarea este: $var <br>"; // "Valoarea este 5"

// ghilimele simple
echo 'Valoarea este: $var <br>'; // "Valoarea este $var"

// HEREDOC
echo <<<HEREDOC
Valoarea este: $var <br>
HEREDOC;
// "Valoarea este 5"

// NOWDOC
echo <<<'NOWDOC'
Valoarea este: $var <br>
NOWDOC;
// "Valoarea este $var"

?>

Operatorul “?”

Acest operator a fost putin “inbunatatit”:

<?php
// inainte
echo $_GET['ceva']?$_GET['ceva']:false;

// acum
echo $_GET['ceva']?:false;

?>

Cu alte cuvinte acum conditia poate fi folosita ca valoare pentru adevarat.

goto

Sincer… nuprea vad de ce era nevoie de asa ceva, dar poate nu e chiar asa rau. Si un mic exemplu:

<?php

$x = 1;

label1:
echo "la label1 $x <br>";
$x++;

if($x>3) goto label2;
goto label1;

label2:
echo "la label2";

// va afisa:
// la label1 1
// la label1 2
// la label1 3
// la label2
?>

Arata mai mult a Basic decat a PHP dar chiar functioneaza.

Mai sunt si alte noutati, am incercat sa le enumar doar pe cele care mi se pare mie mai interesante.

Din pacate toate aceste lucruri nu vor ajunge sa fie folosite cu adevarat pe scara larga decat poate peste 2-3 ani, sau chiar mai mult. Poate nu este PHP 6 dar eu cred ca toate schimbarile vor fi mai degraba asociate cu PHP 6 decat cu PHP 5, nu cred ca vrea nimeni sa riste o intreaga aplicatie care foloseste namespace-uri sau closures pentru ca nu este versiunea “noua” de PHP 5.

Written by Claudiu Persoiu

10 May 2009 at 5:42 PM

Posted in PHP

Tagged with , , , , ,