MySQL encoding problems on Dreamhost

I’m running phpBB, MediaWiki and WordPress on Dreamhost. All the applications use MySQL database. Once I imported the data into the database, I checked how it looks like in phpMyAdmin. I was a little concerned when I saw latin1_swedish_ci collation in all the text columns in all tables. I checked the applications, expecting to see wrong encoding displayed, but everything seemed fine.

I learned the truth later, when developing a Django application which sits on top of the existing phpBB tables. All the data in the tables was stored wrongly encoded, but since the encoding and decoding were symmetrically wrong, all the characters were displayed correctly. Unfortunately, the database content is stored wrongly.

The problem is, all the databases on Dreamhost are created with LATIN1 default encoding (LATIN1 and ISO-8859-1 are synonyms), and it’s impossible to create a database with, say, UTF-8 default encoding. As a result, all the connections to the database are in LATIN1 by default. It is possible to set the encoding to UTF-8, but applications don’t do that. Typically. Because Django does.

Django stores all the text correctly encoded, other applications ― wrongly. Everything is fine, unless Django reads data from other applications. All the accented characters are trashed. I’ve written a small wrapper function that could bring some of the text to the proper encoding:

def repair_encoding(s):
    try:
        return s.decode('utf-8').encode('latin1').decode('latin2').encode('utf-8')
    except:
        return s

What it does, is:

  1. Read the data (variable s) and consider it an UTF-8 encoded text, storing it as Unicode
  2. Encode the Unicode object in LATIN1
  3. Take the LATIN1-encoded text and consider it LATIN2, convert it to Unicode again
  4. Encode it in UTF-8
  5. If any of the above fails, just return the original data

Steps 1-4 can fail, especially step 2, where it can happen that the Unicode object contains characters that are not present in LATIN1.

This hack allows to read data from PHP applications, but I wanted to repair the wrongly encoded text, so all the database content is straightened out. I saw some tutorials which involved dumping and restoring the database. I didn’t want that because that would mean a considerable downtime. I wanted to fix that in place. I’ve finally figured it out. Here’s how to fix column colname in table tablename.

SET NAMES latin1;
ALTER TABLE tablename MODIFY COLUMN colname TEXT CHARACTER SET latin1;
ALTER TABLE tablename MODIFY COLUMN colname blob;
ALTER TABLE tablename MODIFY COLUMN colname TEXT CHARACTER SET utf8;
SET NAMES utf8;

It should be ran against every TEXT column in the database. The same applies to the VARCHAR and CHAR columns.

After applying the script, all the data in the database is encoded correctly. The problem is that the PHP applications started displaying trashed text on-line. It was due the default LATIN1 encoding connection on Dreamhost. I fixed it by adding the below query just after the connection was established. Alternatively, it could be added before every query.

SET NAMES utf8;

This line sets the connection encoding to UTF-8, so all the data is transmitted to the application in correct encoding.

If I knew how to set the default encoding to UTF-8, it wouldn’t be necessary. I’ve posted a question about it on Dreamhost forum. We’ll see if there will be any answer.

7 Responses to “MySQL encoding problems on Dreamhost”

  1. Joanna Ludmiła Says:

    latin1_swedish_ci ist the default for MySQL, not for Dreamhost, don’t ask why swedish – I don’t know!

  2. automatthias Says:

    Is it? My local MySQL installation has UTF-8 by default.

  3. Joanna Ludmiła Says:

    As far as I know — I’ve read about it in a book about MySQL — last week. Maybe you can change somewhere this default.

  4. Franky Says:

    those methods there are just not working with me ..

    i dont know why .. but …finanlly .. i found a way to fix my problems.

    hopefully could help you guys as well .
    ———————–
    1.SET NAMES ‘utf8′;

    2.ALTER DATABASE phpbb DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;

    ———————–

    phpbb is my DB name, you should change your own database name !

  5. millionnairedeluxe Says:

    I managed to fix my problem with french accents on
    http://www.millionnairedeluxe.com

    by choosing ut8_bin as MySQL connection collation and MySQL charset: UTF-8 Unicode (utf8)

    to import my database to dreamhost.

    Hope it can help.

  6. Unicode or How to deal with f’d up text in Django at Spindrop Says:

    [...] searching as much as I could into django and unicode, I found this article which [...]

  7. Character encoding issues — MySql + Django: latin-1 to unicode | Disembrangling programming Says:

    [...] For some reason, you cannot change the character encoding in MySQL as this example shows. [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.