Monday, May 25, 2015

UTF-8 end to end : Eclipse ->XText -> Java -> MySQL -> PHP -> HTML

Passing UTF-8 characters through a chain of languages and processing elements can be a bit nightmarish, with each component having its own quirky configurations about how to deal with UTF-8.

In one of my pet projects, I have a long text processing chain:

  1. User enters text in Eclipse
  2. Text is parsed by XText parser
  3. The parsed text is processed by a Java batch application
  4. Java batch application stores the processed text in MySQL
  5. PHP reads the data in MySQL and sends it to browser either as JSON or HTML
  6. Text shows up embedded in HTML/JSON on the browser

To ensure the UTF-8 data flows through the chain properly, the following needs to be done:

  1. Ensure the text file encoding in Eclipse is set to UTF-8 (Window->Preferences->General->Workspace [Text file encoding])

  2. Ensure XText stand alone parser is configured to respect UTF-8. Refer to this post http://sandeepdeb.blogspot.in/2015/04/ensuring-xtext-parsers-respect-utf-8.html

  3. For Java to store UTF-8 in MySQL via JDBC, the JDBC url needs to mention the encoding. 

jdbc:mysql://localhost:3306/?useUnicode=yes&characterEncoding=UTF-8
  1. PHP's default_encoding in /etc/php5/apache2/php.ini needs to be set to UTF-8 (by default it is commented)

  2. The database connection charset in PHP needs to be explicitly set to UTF-8

$dbConn = mysqli_connect( DB_HOST, DB_USER, DB_PASSWORD, DB_SCHEMA ) ;

$dbConn->set_charset("utf8") ;

6. <meta charset="UTF-8"> needs to be set in the HTML header

No comments: