Fixing character sets in MySQL

I’ve recently had to move a few databases from MySQL 4.x to MySQL 5.x. One of the most important differences is that the 5.x family understands character encodings. Not exactly fresh news, version 5.0 was issued in 2003, but there is still a lot of 4.x installations around.

MySQL 5.0 no longer happily accepts any byte string into a VARCHAR or TEXT field. It stores encoding names as part of the table structure, and converts between encodings when necessary. MediaWiki or WordPress, when run on MySQL 4.x, store data in UTF-8, but the database itself doesn’t “know” about it. Everything seems fine, until you dump your database to a file and load it into MySQL 5.0 (or above). What happens, is that your text is considered to be latin1 (a.k.a. ISO-8859-1). If you happen to have any non-English characters as, say, article names in MediaWiki, you’re going to end up with an error message such as:

A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was:

(SQL query hidden)

from within function “Article::pageData”. MySQL returned error “1267: Illegal mix of collations (latin1_bin,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation ‘=’ (localhost)”.

To fix the problem, you need to tell MySQL that your text data is really UTF-8, not latin1. You need to find all the columns of type VARCHAR or TEXT, and modify them to have UTF-8 character set. For example, if your column is VARCHAR(255), you can execute this statement:

ALTER TABLE your_table
MODIFY COLUMN your_column VARCHAR(255)
CHARACTER SET utf8
COLLATE utf8_bin;

(The utf8_bin collation is needed to keep your sorting case-sensitive.)

However, MySQL will convert your text from latin1 to UTF-8, and your text will still appear “wrong”. You can fix it in one more step. The problem is that you had UTF-8 taken to be latin1 and then stored as UTF-8. To fix this, you need to “convert” your text from UTF-8 back to latin1, and then make MySQL take it as UTF-8, but, importantly, without converting it. This can be achieved by temporarily casting your data to binary — this operation doesn’t trigger encoding changes. You can then cast your data into any encoding you want. In a nutshell, you need to go: UTF-8 →(conversion)→ latin1 → binary → UTF-8.

UPDATE your_table
SET your_column = CONVERT(
CONVERT(
CONVERT(
your_column
USING latin1
)
USING binary
)
USING utf8
);

It may take you a while to understand it. If you want to get a better feel of what’s going on, consider the following, equivalent example in shell. Let’s assume you’re using Polish diacritics: ąćęłńóśżź. Your Polish page title “Café” might in a garbled form look something like this:

Café

Assuming your system is natively UTF-8 (most of modern Linux distributions are), an easy way to simulate text garbling is the following shell expression.

echo Café \
| iconv -f utf-8 -t utf-8 \
| iconv -f latin1 -t utf-8
Café

Converting from UTF-8 to UTF-8 seems stupid, but I wanted it to be very clear: what we have here, is an UTF-8 string, as output by the second line, taken to be latin1 in the third line. This is how your text can become garbled. An obvious way to fix it, is to run the process backwards:

echo Café \
| iconv -f utf-8 -t latin1 \
| iconv -f utf-8 -t utf-8
Café

Again, the UTF-8 to UTF-8 conversion is preserved to make this crucial point explicit. Your string was converted to latin1, and then taken to be UTF-8.

Back to our problem. We know how to convert the data, but we need to find all the tables and columns taht need converting. Conveniently, MySQL offers an “information_schema” database, which allows us to read information about MySQL tables. It’s enough to run this query to find all the tables of interest:

SELECT
table_name,
column_name,
column_type,
character_set_name
FROM
columns
WHERE
table_schema = ‘your_table’
AND
(
data_type = ‘varchar’
OR
data_type = ‘text’
)
AND
character_set_name != ‘utf8′
;

If you’re really lazy, as I am, your data has exactly this problem (UTF-8 taken to be latni1 and looking garbled), you can use a Python script I wrote. But please be careful! You’re using it at your own risk! Backup your database first! If this script damages your database and you lose all your data, it’s your problem, not mine. You have been warned. Here’s the script.

Note: the script will only print SQL statements to screen. You have to execute them youself.

About these ads

15 Responses to “Fixing character sets in MySQL”

  1. Pavel Says:

    Thank you! This solved my very problem.

  2. Pavel Says:

    BTW, isnt’t it a simpler solution to switch encoding on a column:

    ALTER TABLE t MODIFY COLUMN c varbinary; /* drop any charset properties */
    ALTER TABLE t MODIFY COLUMN c CHARSET ; /* imply a new charset over the data */

    and no excessive recodings occur.

  3. automatthias Says:

    You can do that if you know what you’re doing. It will work when the data in the database is a valid representation of text in some encoding. It doesn’t generally have to be the case.

  4. Henry Says:

    Thanks/Dziękuję for your article!

    I recently inherited a PHP-Fusion v6 site in Polish, with DB text I guess is encoded iso-8859-2. Why? Because the generated pages say “charset=iso-8859-2″ in the headers, and all the Polish text, including diacritics, renders correctly.

    The site’s MySQL 4.1.22 DB charset is set to UTF-8 Unicode (utf8), so I guess that MySQL doesn’t “know” some table columns contain text encoded with iso-8859-2; apparently MySQL 4.1x doesn’t care. Right?

    I’m must convert the DB to be used by PHP-Fusion v7, which uses only utf-8, again with MySQL 4.1.22 — of course, with the DB set to use utf8. I have only a few tables to convert, so I don’t need a script, nor do I want to convert the entire DB. (You might guess that I’m simplifying a much more complex situation for the sake of this discussion.)

    I’ve tried several variations of Pavel’s solution.

    One variation replaces all the Polish diacritics with their base glyphs, for example, the stroked-L with a plain L. I guess this would be barely OK for Polish-speakers, but –in my view– it is not good enough. (Anyway, I’m not certain I can figure out which experiment of dozens succeeded in doing this.)

    Another variation truncates the source string at the first diacritic, which is clearly unacceptable. I’ve found some other people who have encountered this truncation, some of whom offer explanations, but I’m not certain any of them really understand why it is occurring. I certainly don’t.

    I’m going to try your suggestion next. I hope it works. The site has been down for almost a month, and the encoding issue is (I hope) the last major problem to solve before putting it back on the air.

    Your suggestions would be most welcome!

    Thanks,

    Henry

  5. Pavel Says:

    Henry,

    Could you please post the queries you’ve tried? I suppose you are observing the effects of interpretation of iso-8859-2-encoded data as utf-8-encoded, to the extent these encodings are compatible.

    In general, I suppose, you have to a) reliably determine the actual data encoding (being unsure of conversions that may happen inside web-engine, I could have used bare SELECT into a file (i.e. through mysql.exe) and then examined the raw data in and editor or a browser), and b) convert columns data from their actual encoding to UTF-8 (as you need this for web-engine).

    My queries are a shorter form of what automatthias’s script does, and do not perform any conversion on data.

    So, being moderately sure in actual data encoding, please try the following:

    ALTER TABLE t MODIFY COLUMN c binary;
    ALTER TABLE t MODIFY COLUMN c varchar(…) charset iso-8859-2; /* ‘de jure’ declare the ‘de facto’ encoding */
    ALTER TABLE t MODIFY COLUMN c varchar(…) charset utf-8; /* convert the text accurately into utf-8 */

  6. Pavel Says:

    p.s. Well, ‘bare’ SELECT is indeed affected by connection encoding. In the case of mysql.exe it is set by ‘–default-character-set=’ option. So, the values of text types (which are encoding-aware types) are converted from their column encoding into this encoding when needed.

    So, before doing SELECT into a file, do

    ALTER TABLE t MODIFY COLUMN c binary;

    to get actual bytes in the file regardless of connection encoding. Then you have the other two steps (declare the discovered actual encoding and convert data to desired final encoding).

  7. Henry Says:

    Pavel:

    Thanks for your responses!

    The code you provided looks reasonable, but threw errors. I modified it to this:

    ALTER TABLE polski_shoutbox MODIFY COLUMN shout_message binary;
    ALTER TABLE polski_shoutbox MODIFY COLUMN shout_message varchar(200) charset latin2;
    ALTER TABLE polski_shoutbox MODIFY COLUMN shout_message varchar(200) charset utf8i;

    where, of course, “polski_shoutbox” is the table I’m trying to convert and “shout_message” is the specific column, which is typed varchar(200). I hope my changes reflect your intentions.

    Unfortunately, this sequence results in truncating all the messages to one character. After staring at the results for a while, I made a slight modification:

    ALTER TABLE polski_shoutbox MODIFY COLUMN shout_message binary(800);
    ALTER TABLE polski_shoutbox MODIFY COLUMN shout_message varchar(200) charset latin2;
    ALTER TABLE polski_shoutbox MODIFY COLUMN shout_message varchar(200) charset utf8;

    Let’s talk about the results of using this transformation in terms of a single example shoutbox message, one that has plenty of diacritics. I’m not certain how well these will render in WordPress so I also describe what I see.

    1) The original v6 site renders this message in the browser window as:

    Panowie! Słów wdzięczności za Wasz trud nigdy dość, więc powiem: dzięki za Waszą pracę, którą zachowujecie dorobek WIG-u. Dzięki!

    This is exactly correct: All diacritics are rendered perfectly and other characters are right, too. (I do not read Polish, so I’m guessing a bit. It looks right.)

    2) On the new site, BEFORE running the 3 queries just above, this message is rendered in the browser window as:

    Panowie! S��w wdzi�czno�ci za Wasz trud nigdy do��, wi�c powiem: dzi�ki za Wasz� prac�, kt�r� zachowujecie dorobek WIG-u. Dzi�ki!

    Here, all the diacritics are replaced by a graphic of a white question mark over a black diamond.

    Looking at this string with myPHPadmin, it appears this way:

    Panowie! S³ów wdziêczno¶ci za Wasz trud nigdy do¶æ, wiêc powiem: dziêki za Wasz± pracê, któr± zachowujecie dorobek WIG-u. Dziêki!

    Here, all the diacritics are replaced by other “special” characters. As far as I checked, the replacement is the same for each given diacritic. By the way, the collation sequence is given as latin_swedish_ci.

    3) On the new site, AFTER running the 3 queries just above, this message is rendered in the browser window as

    Panowie! S?�w wdzi?czno?ci za Wasz trud nigdy do??, wi?c powiem: dzi?ki za Wasz? prac?, kt�r? zachowujecie dorobek WIG-u. Dzi?ki!

    Here the diacritics are replaced by question marks (‘?’) or the graphic of a white question mark over a black diamond.

    Looking at this string with myPHPadmin:

    Panowie! Słów wdzięczności za Wasz trud nigdy dość, więc powiem: dzięki za Waszą pracę, którą zachowujecie dorobek WIG-u. Dzięki!

    which is identical to the correct appearance in 1) above. By the way, the collation sequence is now: utf8_general_ci.

    In other words: the transformation has succeeded as far as myPHPadmin is concerned, but the browser rendering is still incorrect. The collation sequence was changed to the correct one, even though we didn’t explicitly ask for that.

    (By the way, I tried using latin1 instead of latin2 in the second query. The result at 3) same as the situation described in 2). So the original encoding is probably not Latin1. )

    —-

    As may be obvious, I’ve only been working with mySQL for a few hours; I’ve essentially never seen any of these tools before. Seeing the example message rendered correctly within myPHPadmin looks like real progress to me, but I have no idea why the message still does not render correctly in the browser window. I _think_ that everything is aligned correctly: The string is now certainly encoded in utf8 (right?), myPHPadmin recognizes that the DB is set to use utf8, and both the PHP-Fusion and the myPHPadmin page headers specify ‘charset=utf-8′.

    At this point I’m wondering if this is a PHP issue — I’ve seen some criticisms of PHP’s handling of international text. Hmmm I did a search just now for “php encoding”. The first Google hit includes this observation “… When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough….”

    Thanks,

    Henry

  8. Henry Says:

    I modified the author’s suggested code and ran it on my shoutbox table, making a slight change as I believe the content of the table is encoded Latin2:

    UPDATE polski_shoutbox
    SET shout_message = CONVERT(
    CONVERT(
    CONVERT(
    shout_message
    USING latin2
    )
    USING binary
    )
    USING utf8
    );

    Result: In the generated web page, all diacritics are represented as question marks. Same when the example is viewed in phpMyAdmin.

    (Note: about my various renderings of this name, yes, I am a bit dyslexic, obviously. In my English writing _never_ in my code. )

    Does the result offer any clues?

    Henry

  9. automatthias Says:

    Henry, try taking a piece of your Polish garbled text and paste it into http://krzaki.blizinski.pl/ — with a bit of luck, it will tell you what’s wrong with your text.

  10. Henry Says:

    Thanks, that’s a great tool, even if I’m not 100% certain what it does.

    Here are some quick results:

    Input: Example text taken from old version’s output to browser:
    Result: The text does not appear to pokrzaczony or is pokrzaczony in a way that is difficult to detect.

    Input: Example text taken from phpMyAdmin browser before transformation.
    Result: Encoding iso-8859-2 interpreted as iso-8859-1.

    Input: Example text taken from new version’s output to browser before transformation
    Result: The text does not appear to pokrzaczony or is pokrzaczony in a way that is difficult to detect.

    Input: Example text taken from phpMyAdmin browser after Pavel’s transformation as modified by Henry.

    Result: The text does not appear to pokrzaczony or is pokrzaczony in a way that is difficult to detect.

    Input: Example text taken from phpMyAdmin browser after Maciej’s transformation as modified by Henry (“Latin2″)

    Result: The text does not appear to pokrzaczony or is pokrzaczony in a way that is difficult to detect.

    —–

    I did not try to run ALL combinations from all sources — hmmm, I may have not made the best picks, so please feel free to suggest… I think the most interesting place to get the text is from the edit window for the example in phpMyAdmin. But…

    The one direct result “Encoding iso-8859-2 interpreted as iso-8859-1.” seems to confirm that the original encoding in the old PHP-Fusion site is 8859-2, aka Latin-2 (right?) which would seem to support my mods to both methods. But why would it be interpreted as 8859-1? The DB is set to utf-8… Do you think phpMyAdmin is ignoring that setting, examining the encoding tag and finding “8859-1″?

    Thanks,

    Henry

  11. automatthias Says:

    Yes, I think at some point the application decided it’s going to ignore everything that MySQL tells it about character encodings and do whatever it wants. It was a standard behavior in MySQL 4, from my experience at least. Or maybe your tables don’t have the correct encoding set. Or maybe it’s both.

    If the diagnosis is “Encoding iso-8859-2 interpreted as iso-8859-1″, your garbled text probably looks like this:

    maciej@quince ~ $ echo "Zagęśl żółcią jaźń" | iconv -f utf-8 -t iso-8859-2 | iconv -f iso-8859-1 -t utf-8
    Zagê¶l ¿ó³ci± ja¼ñ

    To verify that your text is in fact garbled this way, you can do this:

    maciej@quince ~ $ echo "Zagê¶l ¿ó³ci± ja¼ñ" | iconv -f utf-8 -t iso-8859-1 | iconv -f iso-8859-2 -t utf-8
    Zagęśl żółcią jaźń

    You said: “the transformation has succeeded as far as myPHPadmin is concerned, but the browser rendering is still incorrect.”

    I think the right way to go would be to achieve a correct encoding in phpMyAdmin first (you did just that) and then beat the application into submission. Look into configs, see if there’s any way to declare character encodings. Try to log stuff from the application and see what it looks like at various stages. See what encoding is used in the connection. Maybe you have to declare connection encoding?

    If this fails, you could also enter some new text with diacritics into the application, then look if the application can handle it correctly. If it can, see what this text looks like in the database. If it’s garbled, see how. Then convert the rest of the database to achieve the “right garbledness”. (I hope you won’t have to do that.)

  12. Pavel Says:

    Well, it should be real fun when the “character set support ” morphs to trying to achieve the “right garbledness”. %)

    I could expect the application installation manual to contain something useful on configuring MySQL and charsets everywhere from scratch, in what case you may try to start with a verified DB dump and a clean server…

  13. Henry Says:

    Thank you both for your responses!

    As you both might have suspected, the project history is quite complex. Although the old site was labeled “PHP-Fusion v6.00.303″, hidden away on the server I found clear signs it was a composite implementation, built with pieces of a Polish project called “mFusion”. From what little I can discover, the mFusion project was intended to be a fully multilingual version of PHP-Fusion; it was never completed and has apparently been abandoned. Under the circumstances, it is not at all surprising that the encodings from the old site are not, ummm, clear.

    Aside from that, I’ve already mentioned that PHP itself has historically not dealt well with character encoding issues, and I’ve been unable to find relevant detailed documentation or knowledgeable people in the PHP-Fusion support.

    As I said, I inherited the project … It is not a situation I would choose!

    —-

    My impression of mySQL is that pre-version 5, it provided some tools to deal with character encoding, but enforced nothing. Version 5, –if my quick read is correct– tries to assure some level of consistency between, well, what you say and what you actually do.

    Thank you for the suggestion to use iconv. I’ll take a look.

    But first, let me ask you for some practical advice. It seems to me that one must be very careful, as hidden character encoding conversions lurk.

    For example: almost immediately, I discovered that I could enter the content editors on the old website and the new one, transfer Polish user content by simply copy-and-paste, and the content would render correctly on the new site, diacritics and all. (This work-around is OK for small amounts of content, but clearly impractical for the hundreds of items we want to transfer.) Nice to have implicit conversions provided by the system … except when you are trying to understand the encoding issues!

    On the other hand, I found that copy-and-paste between corresponding data records in the DB using phpMyAdmin failed. I don’t know if that’s because of implicit conversions, or what?

    By the way, I am on a Mac (10.5.6) , but I confirmed by one example that the copy-and-paste workaround functions just as well on a PC. From long experience, I doubt that the two function the same way in every case. Even on a Mac, I am not confident about understanding the difference between GUI and Unix shell operation in this respect….

    Do you have any practical suggestions or ideas about where implicit conversions generally do or do not occur?

    Yes, I agree, the way to solve this problem is to examine interfaces and to find ways to access the fundamental data. But, again, one must be certain that implicit encoding conversions are not confusing the issue.

    I’m going to look around to see if I can get some suggestions from mySQL and the phpMyAdmin experts.

    Based on how little the PHP-Fusion support actually mentions character encoding, I think there is a good chance that there is some kind of error in the recent code. Perhaps that is the reason that many PHP-Fusion people suggest forgetting my old data and starting fresh: There is an unrecognized error in the PHP-Fusion code. So, they are saying that the most recent version uses only pure utf-8, but I may find it not so. The next step is to find a foolproof way to look at a specific string in phpMyAdmin and determine its encoding. I have a default installation of the latest PHP-Fusion. If I enter new content into that, locate that content in the DB, and see it is not perfectly consistent utf8-encoded…

    ———————

    Yes, we’re all having fun! My other projects are, however, suffering greatly …

  14. Henry Says:

    Some kind souls on the Quakenet mySQL IRC led me to this phpMyAdmin incantation:

    select charset(shout_message) from polski_shoutbox where shout_id=378;

    This confirms that my adaptation of Pavel’s transformation produced utc-8 in this column. Excellent!

    Applied to a similar item that has not been transformed, it confirms what automatthias’s utility at

    http://krzaki.blizinski.pl/

    said: At least in this case, the old database contains Latin-2 encodings tagged as Latin-1.

    Now that I have a way of confirming the actual encoding of a particular string in the DB… I have a better chance of figuring out why the CMS fails to render that string correctly.

    Maybe, despite some evidence to the contrary, the DB really wants Latin-2 encoding. Next step is to transform the table to provide Latin-2 and see what happens. I’ll post back.

  15. Henry Says:

    Well, none of the obvious measures worked.

    After much head-scratching, I prepared a short test phrase, nonsense Polish with plenty of diacritics, and I entered that into the shoutbox. It displays correctly, of course.

    Then I want to phpMyAdmin to examine the new table row.

    First surprise: The text does NOT display correctly in phpMyAdmin

    Second surprise: The SQL what-encoding incantation reveals it is encoded as Latin-1.

    My new theory: Strings are being double-encoded on the way from the user to the database, and double-decoded on the way out. These must be symmetric, because the diacritics survive the round-trip. I didn’t expect utf-8 strings to survive being encoded in Latin-1, but … why not, it’s just bits, right? Evidently, because it happens.

    This would explain why perfectly good utf-8 encoded-strings in the database don’t succeed. They probably would be fine if I could figure out a simply way of re-encoding them in Latin-1.

    We discussed that the old database appears to contain Latin-2 encoded strings mislabeled as Latin-1. Is this compatible with “…appears to contain Latin-2 encoded strings, re-encoded with Latin-1?”

    Hmm, if Pavel’s casting trick works one way, it should work the other, right?

    YESSSSSSSS!! That works!!!!!!!

    Henry

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 489 other followers

%d bloggers like this: