CREATE TABLE utfail8 ( str VARCHAR(250) KEY
);
Gotcha number 1: In the latin1 character set, 'café' = 'cafe'. If you have a primary key or other unique constraint on your data, you can't insert both, and any SELECT queries for 'café' return the same results as queries for 'cafe':
mysql> insert into utfail8 VALUES ('cafe');At a glance, this violates one of my fundamental rules for "equals:" when I say WHERE str = 'café', you do not return 'cafe': those strings are different. The MySQL folks are quick to point out that it's not a bug. As someone who just wasted hours on variations of the feature, I think that's about as convincing as this proof that 1 = 2. I wish it had just thrown an error (or warning), like if I had tried to insert a 200 character string into a VARCHAR(100). It'd be a cool feature if it weren't so easy to forget about.
Query OK, 1 row affected (0.00 sec)
mysql> insert into utfail8 VALUES ('café');
ERROR 1062 (23000): Duplicate entry 'café' for key 1
mysql> select * from utfail8; +------+ | str | +------+ | cafe | +------+ 1 row in set (0.00 sec) mysql> insert into utfail8 VALUES ('café'); ERROR 1062 (23000): Duplicate entry 'café' for key 1The ugly end of that error should catch your eye: it's completely mangled the unicode. Try running SHOW variables to see what's up. Here's what my client said:
mysql> SHOW variables LIKE '%character_set%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | latin1 | <--- Hint: This is the problem | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ 8 rows in set (0.00 sec)Yep: it doesn't take an IDE to screw up your input. Anything — a mysql prompt, Toad, a database connection object in your python script — can mangle unicode like there's no tomorrow, with the right (wrong) settings. This was the hardest of the three to track down... and of course, like any hair-tearing bug worth its salt, it only showed up on QA or production.