MySQL and Unicode using UTF-8

Citește postarea în română

Share on:

Along with globalization, the old ASCII code is no longer suitable.  Consider that one day you have to develop a project in German, Russian or even Japanese, you could adapt the charset for each of these languages or you could simply develop using Unicode.

To use Unicode with MySQL UTF-8 can be used.

You must note that UTF-8 characters are variable in length and they are ASCII compatible. In ASCII 1 char = 1B, in UTF-8 1 char can be between 1 and 4 B.

UTF-8 charset and collation on the server

Character type in MySQL is dictated by charset.

To check if UTF-8 in installed on the server:

1SHOW CHARSET LIKE 'utf8';

or with information_schema

1SELECT * FROM `CHARACTER_SETS` WHERE CHARACTER_SET_NAME = 'utf8';

If the charset was found then we can continue.

Another element that appears with charset is collation, which it’s used for comparing strings at ordering.

To see what collations are available on the server:

1SHOW COLLATION WHERE CHARSET = 'utf8';

or with information_schema

1SELECT * FROM `COLLATIONS` WHERE CHARACTER_SET_NAME = 'utf8';

Collation are usually by language, for comparing strings with or without diacritics for example, or “bin” can be used with orders strings in binary mode,  ie “A” is greater than “a” for example.

If no collation is specified, then the default one will be used.

UTF-8 and data bases

When creating a database you can specify the default charset to be used with all new tables for which there isn’t any charset specified.

For example:

1CREATE DATABASE db_name CHARACTER SET utf8 COLLATE utf8_romanian_ci;

Or for modifying the default charset for a data base which already exists:

1ALTER DATABASE db_name CHARACTER SET utf8 COLLATE utf8_romanian_ci;

UTF-8, tables and columns

For modifying tables which already exist ALTER TABLE must be used is used.

A table can have a default charset and collation, and each column can have it’s own charset and collation.

For more information about the table:

1SHOW CREATE TABLE tab;

To set a charset for an existing table:

1ALTER TABLE tab CHARSET = utf8 COLLATE = utf8_romanian_ci;

For modifying the charset of a VARCHAR(200) column is used:

1ALTER TABLE tab MODIFY c1 VARCHAR(200) CHARSET utf8 COLLATE utf8_romanian_ci;

String size

A “problem” that may arise is related to the size of the character, it’s size can be between 1 and 4B.  That is why for measuring a string column (like varchar) you must use CHAR_LENGTH(str) instead of LENGTH().

A short exemple:

1SET @var = 'aşadar';
2SELECT CHAR_LENGTH(@var) AS 'Char', LENGTH(@var) AS 'Length';
3
4// The output is: Char = 6 and Length = 7 because ş is 2B