Fixing character sets in MySQL

I’ve recently had to move a few databases from MySQL 4.x to MySQL 5.x. One of the most important differences is that the 5.x family understands character encodings. Not exactly fresh news, version 5.0 was issued in 2003, but there is still a lot of 4.x installations around.

MySQL 5.0 no longer happily accepts any byte string into a VARCHAR or TEXT field. It stores encoding names as part of the table structure, and converts between encodings when necessary. MediaWiki or WordPress, when run on MySQL 4.x, store data in UTF-8, but the database itself doesn’t “know” about it. Everything seems fine, until you dump your database to a file and load it into MySQL 5.0 (or above). What happens, is that your text is considered to be latin1 (a.k.a. ISO-8859-1). If you happen to have any non-English characters as, say, article names in MediaWiki, you’re going to end up with an error message such as:

A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was:

(SQL query hidden)

from within function “Article::pageData”. MySQL returned error “1267: Illegal mix of collations (latin1_bin,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation ‘=’ (localhost)”.

To fix the problem, you need to tell MySQL that your text data is really UTF-8, not latin1. You need to find all the columns of type VARCHAR or TEXT, and modify them to have UTF-8 character set. For example, if your column is VARCHAR(255), you can execute this statement:

ALTER TABLE your_table
MODIFY COLUMN your_column VARCHAR(255)
COLLATE utf8_bin;

(The utf8_bin collation is needed to keep your sorting case-sensitive.)

However, MySQL will convert your text from latin1 to UTF-8, and your text will still appear “wrong”. You can fix it in one more step. The problem is that you had UTF-8 taken to be latin1 and then stored as UTF-8. To fix this, you need to “convert” your text from UTF-8 back to latin1, and then make MySQL take it as UTF-8, but, importantly, without converting it. This can be achieved by temporarily casting your data to binary — this operation doesn’t trigger encoding changes. You can then cast your data into any encoding you want. In a nutshell, you need to go: UTF-8 →(conversion)→ latin1 → binary → UTF-8.

UPDATE your_table
SET your_column = CONVERT(
USING latin1
USING binary
USING utf8

It may take you a while to understand it. If you want to get a better feel of what’s going on, consider the following, equivalent example in shell. Let’s assume you’re using Polish diacritics: ąćęłńóśżź. Your Polish page title “Café” might in a garbled form look something like this:


Assuming your system is natively UTF-8 (most of modern Linux distributions are), an easy way to simulate text garbling is the following shell expression.

echo Café \
| iconv -f utf-8 -t utf-8 \
| iconv -f latin1 -t utf-8

Converting from UTF-8 to UTF-8 seems stupid, but I wanted it to be very clear: what we have here, is an UTF-8 string, as output by the second line, taken to be latin1 in the third line. This is how your text can become garbled. An obvious way to fix it, is to run the process backwards:

echo Café \
| iconv -f utf-8 -t latin1 \
| iconv -f utf-8 -t utf-8

Again, the UTF-8 to UTF-8 conversion is preserved to make this crucial point explicit. Your string was converted to latin1, and then taken to be UTF-8.

Back to our problem. We know how to convert the data, but we need to find all the tables and columns taht need converting. Conveniently, MySQL offers an “information_schema” database, which allows us to read information about MySQL tables. It’s enough to run this query to find all the tables of interest:

table_schema = ‘your_table’
data_type = ‘varchar’
data_type = ‘text’
character_set_name != ‘utf8’

If you’re really lazy, as I am, your data has exactly this problem (UTF-8 taken to be latni1 and looking garbled), you can use a Python script I wrote. But please be careful! You’re using it at your own risk! Backup your database first! If this script damages your database and you lose all your data, it’s your problem, not mine. You have been warned. Here’s the script.

Note: the script will only print SQL statements to screen. You have to execute them youself.

Django – phpBB

About two years ago, I’ve started putting together a set of classes to allow Django to interact with a phpBB database. It was used to interact with phpBB 2.x at the time. The code was sitting together with the rest of the project I was working on.

The code was sitting idle for quite a while. I returned to it lately and decided to separate the phpBB-relevant part and create a new project, dedicated to creating Django-phpBB compatibility layer.

I’m currently working on updating the code to be compatible with Django 1.0 and phpBB 3.0. I’ve ported phpBB’s password hashing functions, my current focus is on the remaining user auth bits and updating the forum view code.

Project’s name is simply django-phpbb. Its goal is to allow you to access a live, unmodified phpBB database, read from it and authenticate users against it.

Directory renaming in SCM

SCM stands for Source Code Management. Pretty much the same thing can be called VCS, Version Control Software. Perhaps even more TLA’s are there out in the wild. It all boils down to a program which allows programmers to manage their source code.

Pretty much everybody who started using SCM, started with CVS and then moved to something else. Probably Subversion, which is meant to be a CVS replacement. For more adventurous or demanding developers, there are many other SCM’s: Git, Bazaar, Monotone, Mercurial, Darcs… and more.

Mark Shuttleworth has written an interesting thing: that file and directory renaming is one of the most important operations to be handled with an SCM. I got curious and wrote a test case for three SCM’s I know: Bazaar, Git and Subversion. The scenario is:

Continue reading “Directory renaming in SCM”

Code over the phone

Two weeks of waiting for it have built it up pretty much. When the phone finally rang, for some reason it felt surprising. No more waiting? Interview begins?

“I would like you to dictate me some code” said the engineer. He was describing tasks and asking me to write code that solves them. They were not daunting; could be easily decomposed into basic operations. Recruiters surely understand that people get nervous when being interviewed. The fact that one is being interviewed, not “just” asked to write some code, makes them make stupid mistakes all the time. Interviewers try to give simple tasks to people. It’s also interesting that the questions were not strictly theoretical, but more like “how would you…” followed by something to find out or to calculate.

Continue reading “Code over the phone”

Slugify in a shell script

When constructing nice file names or URLs, it’s often nice to “slugify” a string, so it has a form of alphanumerics separated by dashes. For instance, you may have a string like this:

Linux clover 2.6.19-gentoo-r5 i686 Genuine Intel(R) CPU T2050 @ 1.60GHz

It has uppercase and lowercase letters, digits, brackets… you need to remove all but alphanumerics while retaining readability. Basically, you may want for instance:


If you append “.html” to it, it makes a very nice URL, doesn’t it?

Here’s a part of a pipe chain that slugifies strings:

sed -e 's/[^[:alnum:]]/-/g' | tr -s '-' | tr A-Z a-z

If you have a shell script and you want to slugify variable content, you can:

SLUGIFIED="$(echo -n "${VARIABLE}" | sed -e 's/[^[:alnum:]]/-/g' \
| tr -s '-' | tr A-Z a-z)"

Note that wordpress likes to mess up quotes. They are meant to be plain, double ones.

Parallel programming course

I have spent this whole week in the Computer Science and Informatics building. I wonder how did “informatics” creep into the English language; I was taught in 2002 that there is no such thing as “informatics”. There’s only Computer Science. Term “informatics” was supposed to be used only by mistake. German has “informatik”, Polish has “informatyka”, it’s probably those non-native English speakers who just kept using it until even English people started believing that it’s a legitimate English word. A lie told a thousand times… well, what was I… yes, the course.

The main topic was parallel programming, harnessing multiple processors to solve a single, computationally-intensive task such as a weather forecast or a car-crash simulation. There’s more than that, there are many more problems that you can solve and lots of money you save by simulating things for you instead of doing them for real.

Continue reading “Parallel programming course”

Screen for gnome-terminal users

I have noticed that gnome-terminal doesn’t handle bold fonts properly. Instead of using a bold version of given font, gnome-terminal displays a kind of widened regular font.

Not really a bold font

In contrast to gnome-terminal, xterm uses the bold variant:

Bold variant is used here
This is enough for me to want to use xterm instead of gnome-terminal, but there’s one problem: tabs.

Everyone who uses Firefox , knows how addictive tabs can be. There are no tabs in xterm, but there is a program called screen, which can do the same job of handling multiple terminals in one window. The only problem with screen was that it uses CTRL+A, N and CTRL+A, P keystrokes to switch the terminals. I am used to do the same thing with CTRL+Page up, CTRL+Page down not only because gnome-terminal uses that, but because it appears to be a standard key shortcut for tab switching in GTK applications. Fortunately, it is possible to configure screen to use those shortcuts. It’s enough to add the following two lines to ~/.screenrc file.

bindkey ^[[5;5~ prev
bindkey ^[[6;5~ next

Another problem with (unconfigured) screen is that there is no status bar showing the list of currently open terminals. This can be configured too, as explained on Gentoo Wiki. This configuration allowed me to switch easily from gnome-terminal to xterm + screen.