Passing UTF-8 characters through a chain of languages and processing elements can be a bit nightmarish, with each component having its own quirky configurations about how to deal with UTF-8.
In one of my pet projects, I have a long text processing chain:
- User enters text in Eclipse
- Text is parsed by XText parser
- The parsed text is processed by a Java batch application
- Java batch application stores the processed text in MySQL
- PHP reads the data in MySQL and sends it to browser either as JSON or HTML
- Text shows up embedded in HTML/JSON on the browser
To ensure the UTF-8 data flows through the chain properly, the following needs to be done:
Ensure the text file encoding in Eclipse is set to UTF-8 (Window->Preferences->General->Workspace [Text file encoding])
Ensure XText stand alone parser is configured to respect UTF-8. Refer to this post http://sandeepdeb.blogspot.in/2015/04/ensuring-xtext-parsers-respect-utf-8.html
For Java to store UTF-8 in MySQL via JDBC, the JDBC url needs to mention the encoding.
jdbc:mysql://localhost:3306/?useUnicode=yes&characterEncoding=UTF-8
PHP's default_encoding in /etc/php5/apache2/php.ini needs to be set to UTF-8 (by default it is commented)
The database connection charset in PHP needs to be explicitly set to UTF-8
$dbConn = mysqli_connect( DB_HOST, DB_USER, DB_PASSWORD, DB_SCHEMA ) ; $dbConn->set_charset("utf8") ;
6. <meta charset="UTF-8"> needs to be set in the HTML header