Fixing character sets in MySQL

I’ve recently had to move a few databases from MySQL 4.x to MySQL 5.x. One of the most important differences is that the 5.x family understands character encodings. Not exactly fresh news, version 5.0 was issued in 2003, but there is still a lot of 4.x installations around.

MySQL 5.0 no longer happily accepts any byte string into a VARCHAR or TEXT field. It stores encoding names as part of the table structure, and converts between encodings when necessary. MediaWiki or WordPress, when run on MySQL 4.x, store data in UTF-8, but the database itself doesn’t “know” about it. Everything seems fine, until you dump your database to a file and load it into MySQL 5.0 (or above). What happens, is that your text is considered to be latin1 (a.k.a. ISO-8859-1). If you happen to have any non-English characters as, say, article names in MediaWiki, you’re going to end up with an error message such as:

A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was:

(SQL query hidden)

from within function “Article::pageData”. MySQL returned error “1267: Illegal mix of collations (latin1_bin,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation ‘=’ (localhost)”.

To fix the problem, you need to tell MySQL that your text data is really UTF-8, not latin1. You need to find all the columns of type VARCHAR or TEXT, and modify them to have UTF-8 character set. For example, if your column is VARCHAR(255), you can execute this statement:

ALTER TABLE your_table
MODIFY COLUMN your_column VARCHAR(255)
COLLATE utf8_bin;

(The utf8_bin collation is needed to keep your sorting case-sensitive.)

However, MySQL will convert your text from latin1 to UTF-8, and your text will still appear “wrong”. You can fix it in one more step. The problem is that you had UTF-8 taken to be latin1 and then stored as UTF-8. To fix this, you need to “convert” your text from UTF-8 back to latin1, and then make MySQL take it as UTF-8, but, importantly, without converting it. This can be achieved by temporarily casting your data to binary — this operation doesn’t trigger encoding changes. You can then cast your data into any encoding you want. In a nutshell, you need to go: UTF-8 →(conversion)→ latin1 → binary → UTF-8.

UPDATE your_table
SET your_column = CONVERT(
USING latin1
USING binary
USING utf8

It may take you a while to understand it. If you want to get a better feel of what’s going on, consider the following, equivalent example in shell. Let’s assume you’re using Polish diacritics: ąćęłńóśżź. Your Polish page title “Café” might in a garbled form look something like this:


Assuming your system is natively UTF-8 (most of modern Linux distributions are), an easy way to simulate text garbling is the following shell expression.

echo Café \
| iconv -f utf-8 -t utf-8 \
| iconv -f latin1 -t utf-8

Converting from UTF-8 to UTF-8 seems stupid, but I wanted it to be very clear: what we have here, is an UTF-8 string, as output by the second line, taken to be latin1 in the third line. This is how your text can become garbled. An obvious way to fix it, is to run the process backwards:

echo Café \
| iconv -f utf-8 -t latin1 \
| iconv -f utf-8 -t utf-8

Again, the UTF-8 to UTF-8 conversion is preserved to make this crucial point explicit. Your string was converted to latin1, and then taken to be UTF-8.

Back to our problem. We know how to convert the data, but we need to find all the tables and columns taht need converting. Conveniently, MySQL offers an “information_schema” database, which allows us to read information about MySQL tables. It’s enough to run this query to find all the tables of interest:

table_schema = ‘your_table’
data_type = ‘varchar’
data_type = ‘text’
character_set_name != ‘utf8’

If you’re really lazy, as I am, your data has exactly this problem (UTF-8 taken to be latni1 and looking garbled), you can use a Python script I wrote. But please be careful! You’re using it at your own risk! Backup your database first! If this script damages your database and you lose all your data, it’s your problem, not mine. You have been warned. Here’s the script.

Note: the script will only print SQL statements to screen. You have to execute them youself.

register365 and scp

My current hosting provider, hosting365 register365, is a traditional-style shared hosting service, where file upload is still done via FTP with passwords sent as clear text. They provide ssh access on demand, and it has to be manually approved by company’s staff. I don’t understand why don’t they provide shell by default. Maybe it’s part of being a traditional style hosting and trying to avoid the word “shell” or anything like it.

Shell they provide, but what about public key authentication? No, they don’t. As this is turned on by default in all Linux installations, they must have switched this option off. Why did they do that, remains a mystery as public key authentication is no less secure than password entry.

Continue reading “register365 and scp”

Code over the phone

Two weeks of waiting for it have built it up pretty much. When the phone finally rang, for some reason it felt surprising. No more waiting? Interview begins?

“I would like you to dictate me some code” said the engineer. He was describing tasks and asking me to write code that solves them. They were not daunting; could be easily decomposed into basic operations. Recruiters surely understand that people get nervous when being interviewed. The fact that one is being interviewed, not “just” asked to write some code, makes them make stupid mistakes all the time. Interviewers try to give simple tasks to people. It’s also interesting that the questions were not strictly theoretical, but more like “how would you…” followed by something to find out or to calculate.

Continue reading “Code over the phone”

Cartesian product of multiple sets

What a cartesian product is, knows everyone who ever saw a table. For example:

        |    hard-working        |      lazy        |
| smart | smart and hard-working | smart but lazy   |
| dumb  | dumb but hard-working  | dumb and lazy    |

It’s an example of product of two cartesian sets: {hard-working, lazy} and {smart, dumb}. It’s easy to generate such a product in bash:

maciej@clover ~ $ echo {smart,dumb}-{hard-working,lazy}
smart-hard-working smart-lazy dumb-hard-working dumb-lazy

It’s a list of all the possible pairs of elements.

In order to generate a cartesian product of two sets, one usually writes two nested loops. For example, in Python:

for i in ['smart', 'dumb']:
    for j in ['hard-working', 'lazy']:
        print i, j

What if we want to generate a cartesian product of three sets? Three nested loops? What about four sets? What about N sets?

I’ve found a thread with examples of code generating such cartesian products. I especially liked the solution with generators, because it avoids keeping in memory potentially enormous tables with data. The example from the forum thread:

def cartesian_product(L,*lists):
    if not lists:
        for x in L:
            yield (x,)
        for x in L:
            for y in cartesian_product(lists[0],*lists[1:]):
                yield (x,)+y

It’s a short and effective solution, using recursion. This particular implementation has one distadvantage: lists need to be given as function arguments:

cartesian_product(list1, list2, list3)

I wanted a solution where I could give it a list of lists instead.

UPDATE:  James Hopkin suggested using an asterisk (thanks!):


Here’s my original solution:

def cartesian_product(lists, previous_elements = []):
    if len(lists) == 1:
        for elem in lists[0]:
            yield previous_elements + [elem, ]
        for elem in lists[0]:
            for x in cartesian_product(lists[1:], previous_elements + [elem, ]):
                yield x

Usage of this function can look like this:

a = []
a.append(['in', 'out'])
a.append(['put', 'come'])
for i in cartesian_product(a):
    print "%s%s" % (i[0], i[1])

Another example, generating a natural binary code, with the number of bits as a parameter. Please note that when you give it a very large number of bits, it will take a lot of time to execute, but it will not exhaust the memory.

bits = 5
for i in cartesian_product([range(2) for x in range(bits)]):
    print i

Perl: argument passing weirdness

The project I’m currently working on is using Perl scripts. I usually use Python for scripting, but introducing another programming language didn’t seem like a good idea, so I decided to write in Perl.

At first, I was struck with $all @the %decorators which make the code @$difficult $to @#read. Nothing to do about it.

Next thing, a very strange way of reading function arguments. You don’t define them, all the stuff comes in something with a friendly name of @_ which is basically a lump of arguments. It’s your job to make it useful. And it isn’t necessarily easy to do.

Let me give you an example. I’ll start off with a Python function that prints a list to the screen:

a = ['a', 'b', 'c', 'd']
print a

When executed, it prints this:

['a', 'b', 'c', 'd']

A corresponding Perl code looks like this:

@a = ('a', 'b', 'c', 'd');
print "@a∖n";

And produces:

a b c d

As you can see, Python code is a little cleaner, as it doesn’t have @decorators and you don’t need to surround the list with “all the weird stuff∖n” to get the list on the screen in a comprehensive form.

Let’s get back to our code. Let us write a function that takes a list as an argument and prints it. In Python:

def one_list(list_a):
    print list_a

It’s pretty straightforward. We need to write the name of our function, then the arguments (currently just one) in the brackets and then the function body. The same in Perl:

sub one_list {
    my @list_a = @_;
    print "@list_a∖n";

What’s strange here is that there is no way to specify the list of arguments that the function accepts. There’s only the lumpy @_ variable which contains a list that the function was called with. You don’t actually know, how was the function called, because it could be one_list(@a) as well as one_list(‘a’, ‘b’, ‘c’, ‘d’) and you can’t tell any difference in the @_. That’s why I call it a lump.

Now that we have a function which accepts a list, let’s try two lists. With Python, we can just extrapolate what we had before. One more thing in the brackets, voila, two lists ready to use.

def two_lists(list_a, list_b):
    print list_a
    print list_b

What about Perl? Can we extrapolate the one_list function? Perhaps I’m spoiled, but usually as I learn how to declare a function with one argument, I don’t need to learn anything more to write a two-argument one. So, I’ve tried this…

sub two_lists {
    my ($list_a, $list_b) = @_;
    print "@list_a∖n";
    print "@list_b∖n";

…no error message. And… somehow, @list_a swallowed the second argument, leaving @list_b empty. The two list were just concatenated somewhere on the way and function sees just one list instead of original two. Ouch.

Finally, a friend brought a solution. It looks like this:

sub two_lists {
    my ($list_a_ref, $list_b_ref) = @_;
    my @list_a = @$list_a_ref;
    my @list_b = @$list_b_ref;
    print "@list_a∖n";
    print "@list_b∖n";

What it does, is it reads references to the lists and then uses weird @$ syntax to dereference them, er, make them usable.

In this example, there are additional lines of code that wouldn’t have to be there if… well… never mind.

Whatever the reason, instead of simply declaring a list of arguments, I have to wrestle with @_ and references.

Life with Python was so easy… I think you can now understand this picture (even though it was Perl 5 in the examples):

Perl 6

Programming with Perl looks more like something weird, geeky and obscure. It’s easy to write “Hello, world”, but one you want to write any complex data structure, references and inconsistent syntax will quickly get you down. Since data structures are difficult to handle in Perl, you will tend to write complex procedures, making your program difficult to read and understand and therefore difficult to maintain. It’s the opposite to what Eric Raymond says about handling complexity.

Perl seems to be a quintessence of what I don’t like in programming languages.

Dreamhost 100MB memory limit

I’ve recently found a thread on Google Groups which mentioned a 100MB memory limit for FCGI processes on Dreamhost, which can be a reason for killing them by their process monitor.

One of the posts says:

Interestingly, this limit doesn’t apply to Ruby processes. When I asked them if this was an admission that Ruby on Rails has a sad deployment story, the response was “Ahem.. =)”

This could explain my trick with the “dispatch.fcgi” file name, assuming this is how their process monitors detect Ruby on Rails. Again, it’s only my guesses.

Dreamhost, kernel 2.6, FCGI and threads

My Dreamhost server got rebooted yesterday. After the reboot, I’ve noticed two things:

  1. Kernel 2.6
  2. My Django application down. I’m not sure if it was the 2.6 kernel that caused the problem. It could be a coincidence.

I’ve spent a whole day investigating the problem. It turned out that Python couldn’t make a new thread. Just like Grimboy, I’ve changed “threaded” method into “prefork” in dispatch.fcgi and got my site up and running.

FCGI is pretty difficult to debug, I must say. To get a debug message, I needed to run a Perl script, from which a Python script was called, with stderr redirected to a file.

API for full-text search in Django

Let me imagine a way I’d like to use a full-text search in Django. It would look like this:

class Person(models.Model):
….first_name = models.CharField(maxlength = 50)
….about = models.TextField()
….class TextSearch:

# This would return a QuerySet
people =“Miles Davis”)

That’s it.

The inner class “TextSearch” would take optional arguments like the list of fields to be indexed. All fields would be indexed by default.

I am aware of already existing projects which provide search capabilities to Django.

  • Mercurytide uses MySQL-specific functions, so it wouldn’t work for other database backends.
  • Merquery doesn’t seem to have a nice API. For example, a system path is needed to initialize an indexer.

Any other search engines out there for Django?