Monday, April 11, 2011

PHP, MySQL and UTF-8 encoding

UTF-8 encoding allows the display of different charsets on the same Web page; using UTF-8, you can write with Latin, Chinese and Arab characters on the same page.
On the other hand, setting up a whole site to work with UTF-8 can be incredibly tedious if you do not know how to proceed. You have to configure the HTML code, PHP, Apache, and the MySQL databases you will need.
Here is a very simple tutorial to convert a whole website to UTF8 within minutes!

Apache

Send the charset in the HTTP headers; you can either add this in the .htaccess at the root of your website :

AddDefaultCharset UTF-8

Or include this in every PHP script:

header("Content-type: text/html; charset=UTF-8"); 

PHP Configuration

You will need to use multibyte string functions instead of the usual string functions (substr, etc.). You will have to edit your php.ini configuration file.

mbstring.language=UTF-8
mbstring.internal_encoding=UTF-8
mbstring.http_input=UTF-8
mbstring.http_output=UTF-8
mbstring.detect_order=
auto


Here are a the multibytes counterparts to the regular string functions:

RegularMultibyte
mailmb_send_mail
strlenmb_strlen
strposmb_strpos
strrposmb_strrpos
substrmb_substr
strtolowermb_strtolower
strtouppermb_strtoupper
substr_countmb_substr_count
eregmb_ereg
eregimb_eregi
ereg_replacemb_ereg_replace
eregi_replacemb_eregi_replace
splitmb_split
htmlentities($text)htmlentities($text, ENT_QUOTES, 'UTF-8')
htmlspecialchars($text)htmlspecialchars($text, ENT_QUOTES, 'UTF-8')
You can use the mb_convert_encoding and mb_detect_encoding functions to convert a string to its UTF-8 equivalent, which is very useful when you are using data from external files or parsed HTML pages.
Note that in multibyte regular expression, \w will match any accuentuated character, which can be very practical for words detection most non-english languages!

MySQL

When you create a database, use the following request:

CREATE DATABASE foo
CHARACTER SET
utf8 COLLATE utf8_bin;


By default, a table created in a UTF-8 database will use the UTF-8 charset. Similarly, text columns will by default inherit from the table they are in.
If you are using PHPMyAdmin, you can access those properties in the Operations tab corresponding to the MySQL database or table you want to change.
In your PHP code, don't forget this query for every connection to the database:

mysql_query("SET NAMES 'utf8'");


If are using MySQL from the command line, you have to specify the following option:
--default-character-set=utf8

HTML

You can put the following tag in the head section:

<meta http-equiv="Content-type" content="text/html; charset=UTF-8" />
And don't forget to say form data will be sent using UTF8:

<form accept-charset="UTF-8">


Do not forget always to save your files in UTF-8 without BOM in the text editors! This is true for JavaScript, HTML, and PHP files. CSS files are less likely to contain special characters, but consistency never hurts.


5 comments:

  1. Thank you very much! This tutorial help me a lot

    ReplyDelete
  2. thank you, it is really helpful !
    Just don't forget to restart the Apache server after editing php.ini

    ReplyDelete
  3. Thanks for sharing. It is indeed a useful info.

    Website Development company

    ReplyDelete
  4. Thank you. It works. God Bless you. I implemented it in my website www.kitkatfun.com

    ReplyDelete
  5. Thank you. It works. God Bless you. I implemented it in my website latest facebook funny punjabi posts and shayari

    ReplyDelete