Fixing character sets in MySQL

I’ve recently had to move a few databases from MySQL 4.x to MySQL 5.x. One of the most important differences is that the 5.x family understands character encodings. Not exactly fresh news, version 5.0 was issued in 2003, but there is still a lot of 4.x installations around.

MySQL 5.0 no longer happily accepts any byte string into a VARCHAR or TEXT field. It stores encoding names as part of the table structure, and converts between encodings when necessary. MediaWiki or WordPress, when run on MySQL 4.x, store data in UTF-8, but the database itself doesn’t “know” about it. Everything seems fine, until you dump your database to a file and load it into MySQL 5.0 (or above). What happens, is that your text is considered to be latin1 (a.k.a. ISO-8859-1). If you happen to have any non-English characters as, say, article names in MediaWiki, you’re going to end up with an error message such as:

A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was:

(SQL query hidden)

from within function “Article::pageData”. MySQL returned error “1267: Illegal mix of collations (latin1_bin,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation ‘=’ (localhost)”.

To fix the problem, you need to tell MySQL that your text data is really UTF-8, not latin1. You need to find all the columns of type VARCHAR or TEXT, and modify them to have UTF-8 character set. For example, if your column is VARCHAR(255), you can execute this statement:

ALTER TABLE your_table
MODIFY COLUMN your_column VARCHAR(255)
CHARACTER SET utf8
COLLATE utf8_bin;

(The utf8_bin collation is needed to keep your sorting case-sensitive.)

However, MySQL will convert your text from latin1 to UTF-8, and your text will still appear “wrong”. You can fix it in one more step. The problem is that you had UTF-8 taken to be latin1 and then stored as UTF-8. To fix this, you need to “convert” your text from UTF-8 back to latin1, and then make MySQL take it as UTF-8, but, importantly, without converting it. This can be achieved by temporarily casting your data to binary — this operation doesn’t trigger encoding changes. You can then cast your data into any encoding you want. In a nutshell, you need to go: UTF-8 →(conversion)→ latin1 → binary → UTF-8.

UPDATE your_table
SET your_column = CONVERT(
CONVERT(
CONVERT(
your_column
USING latin1
)
USING binary
)
USING utf8
);

It may take you a while to understand it. If you want to get a better feel of what’s going on, consider the following, equivalent example in shell. Let’s assume you’re using Polish diacritics: ąćęłńóśżź. Your Polish page title “Café” might in a garbled form look something like this:

Café

Assuming your system is natively UTF-8 (most of modern Linux distributions are), an easy way to simulate text garbling is the following shell expression.

echo Café \
| iconv -f utf-8 -t utf-8 \
| iconv -f latin1 -t utf-8
Café

Converting from UTF-8 to UTF-8 seems stupid, but I wanted it to be very clear: what we have here, is an UTF-8 string, as output by the second line, taken to be latin1 in the third line. This is how your text can become garbled. An obvious way to fix it, is to run the process backwards:

echo Café \
| iconv -f utf-8 -t latin1 \
| iconv -f utf-8 -t utf-8
Café

Again, the UTF-8 to UTF-8 conversion is preserved to make this crucial point explicit. Your string was converted to latin1, and then taken to be UTF-8.

Back to our problem. We know how to convert the data, but we need to find all the tables and columns taht need converting. Conveniently, MySQL offers an “information_schema” database, which allows us to read information about MySQL tables. It’s enough to run this query to find all the tables of interest:

SELECT
table_name,
column_name,
column_type,
character_set_name
FROM
columns
WHERE
table_schema = ‘your_table’
AND
(
data_type = ‘varchar’
OR
data_type = ‘text’
)
AND
character_set_name != ‘utf8’
;

If you’re really lazy, as I am, your data has exactly this problem (UTF-8 taken to be latni1 and looking garbled), you can use a Python script I wrote. But please be careful! You’re using it at your own risk! Backup your database first! If this script damages your database and you lose all your data, it’s your problem, not mine. You have been warned. Here’s the script.

Note: the script will only print SQL statements to screen. You have to execute them youself.

Advertisements

Genetic data in PostgreSQL

People get usually famous for the things they’ve done. Well, that’s not entirely true. They usually get famous for the things they’ve done, when they were successful. You don’t get famous for attempting and being unsuccessful, now do you?

It works the same way for the scientific publications. All scientists work hard trying various things, and when they finally succeed, they publish a paper. But what happens with all those hours spend on unsuccessful attempts? Nobody seems to be proud of blowing a whole laboratory up. Or whatever didn’t work for them. This means that other people can never learn that something was unsuccessful and they’re likely to get the same, unfeasible, idea and repeat the same research. Needless to say, unsuccessfully.

Not that I’m proud of what I’ve done here, but I will at least allow other people to find this post on Google, when searching for genetic data and relational database. I’ll describe what I did, so they at least don’t do it the way I did.

Continue reading “Genetic data in PostgreSQL”

Tetralogy of Fallot database representation

Tetralogy of Fallot is a significant and complex congenital heart disease. It consists of four heart malformations. So if a patient is described to have TOF, it means that she/he has all those four malformations together.

However, the separate malformations are already present on the diagnoses list, as separate entities. From a data-modeling perspective, it’s a redundancy on the factors (malformations, diseases) list. This leads to problems with interpretation. As the VSD is one of TOF’s components, is already present on the diseases list, and users are allowed to enter both VSD and TOF diagnoses, there are patients with all four combinations in the database.

Continue reading “Tetralogy of Fallot database representation”

Malcolm Tredinnick’s SQL puzzle solution

The puzzle

Malcolm has asked, how to find the classes that were attended by all of the students from a given list. Then, he proposed a solution with a HAVING clause. I’ll call it the one-join solution. I’d like to suggest another one, which I’ll call multi-join.

I’ve made a benchmark to evaluate the execution time. A statistical tool was used to create a mathematical model of the execution time.

Continue reading “Malcolm Tredinnick’s SQL puzzle solution”