MySQL and Unicode: Three Gotchas

Written by Owen Raccuglia | Jul 2, 2009

Unicode support in MySQL is a pain — I lost a few hours to it today, and it won't be the last time. So here are the three MySQL unicode gotchas that took over my day. They're all based on this simple table:

CREATE TABLE utfail8 (
    str VARCHAR(250) KEY
);

Gotcha number 1: In the latin1 character set, 'café' = 'cafe'. If you have a primary key or other unique constraint on your data, you can't insert both, and any SELECT queries for 'café' return the same results as queries for 'cafe':

mysql> insert into utfail8 VALUES ('cafe');
Query OK, 1 row affected (0.00 sec)

mysql> insert into utfail8 VALUES ('café');
ERROR 1062 (23000): Duplicate entry 'café' for key 1

At a glance, this violates one of my fundamental rules for "equals:" when I say WHERE str = 'café', you do not return 'cafe': those strings are different. The MySQL folks are quick to point out that it's not a bug. As someone who just wasted hours on variations of the feature, I think that's about as convincing as this proof that 1 = 2. I wish it had just thrown an error (or warning), like if I had tried to insert a 200 character string into a VARCHAR(100). It'd be a cool feature if it weren't so easy to forget about.

Anyway, the fix is nice and easy: just ALTER TABLE utfail8 MODIFY str VARCHAR(250) CHARACTER SET utf8. Most of the time, this is enough.

Gotcha number 2: Your column has the right character set, but you still can't INSERT 'café'! This is about when you start losing hair; utf8 knows the difference between 'é' and 'e', right? So what's going on? If you're unlucky (like me), you'll start to question utf8, read up on collations, and waste time trying a few of them out.

It turns out that, while SELECT's still not broken, your IDE might be. I ditched Toad and went back to basics, when the same queries that failed on my machine worked fine on @danmil's plain mysql prompt.

Gotcha number 3: It's time to push to QA; I run the ALTER TABLE to set the character set to unicode, and I run the python script I was fixing in the first place... and it fails, even after you "fixed everything." So, I open a mysql shell on the server, and try the test script out, but itfails, too:

mysql> select * from utfail8;
+------+
| str  |
+------+
| cafe |
+------+
1 row in set (0.00 sec)

mysql> insert into utfail8 VALUES ('café');
ERROR 1062 (23000): Duplicate entry 'cafÃ©' for key 1

The ugly end of that error should catch your eye: it's completely mangled the unicode. Try running SHOW variables to see what's up. Here's what my client said:

mysql> SHOW variables LIKE '%character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |  <--- Hint: This is the problem
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Yep: it doesn't take an IDE to screw up your input. Anything — a mysql prompt, Toad, a database connection object in your python script — can mangle unicode like there's no tomorrow, with the right (wrong) settings. This was the hardest of the three to track down... and of course, like any hair-tearing bug worth its salt, it only showed up on QA or production.

The good news is, it's easy to fix, too: just run SET character_set_client = utf8. To fix the Python script, I just ran that query right after opening a connection.

How much hair have you lost to unicode lately? Know of any more nasty mysql character set "gotchas?"

View full post