I’m running phpBB, MediaWiki and WordPress on Dreamhost. All the applications use MySQL database. Once I imported the data into the database, I checked how it looks like in phpMyAdmin. I was a little concerned when I saw latin1_swedish_ci collation in all the text columns in all tables. I checked the applications, expecting to see wrong encoding displayed, but everything seemed fine.
I learned the truth later, when developing a Django application which sits on top of the existing phpBB tables. All the data in the tables was stored wrongly encoded, but since the encoding and decoding were symmetrically wrong, all the characters were displayed correctly. Unfortunately, the database content is stored wrongly.
The problem is, all the databases on Dreamhost are created with LATIN1 default encoding (LATIN1 and ISO-8859-1 are synonyms), and it’s impossible to create a database with, say, UTF-8 default encoding. As a result, all the connections to the database are in LATIN1 by default. It is possible to set the encoding to UTF-8, but applications don’t do that. Typically. Because Django does.
Django stores all the text correctly encoded, other applications ― wrongly. Everything is fine, unless Django reads data from other applications. All the accented characters are trashed. I’ve written a small wrapper function that could bring some of the text to the proper encoding:
def repair_encoding(s): try: return s.decode('utf-8').encode('latin1').decode('latin2').encode('utf-8') except: return s
What it does, is:
- Read the data (variable s) and consider it an UTF-8 encoded text, storing it as Unicode
- Encode the Unicode object in LATIN1
- Take the LATIN1-encoded text and consider it LATIN2, convert it to Unicode again
- Encode it in UTF-8
- If any of the above fails, just return the original data
Steps 1-4 can fail, especially step 2, where it can happen that the Unicode object contains characters that are not present in LATIN1.
This hack allows to read data from PHP applications, but I wanted to repair the wrongly encoded text, so all the database content is straightened out. I saw some tutorials which involved dumping and restoring the database. I didn’t want that because that would mean a considerable downtime. I wanted to fix that in place. I’ve finally figured it out. Here’s how to fix column colname in table tablename.
SET NAMES latin1;
ALTER TABLE tablename MODIFY COLUMN colname TEXT CHARACTER SET latin1;
ALTER TABLE tablename MODIFY COLUMN colname blob;
ALTER TABLE tablename MODIFY COLUMN colname TEXT CHARACTER SET utf8;
SET NAMES utf8;
It should be ran against every TEXT column in the database. The same applies to the VARCHAR and CHAR columns.
After applying the script, all the data in the database is encoded correctly. The problem is that the PHP applications started displaying trashed text on-line. It was due the default LATIN1 encoding connection on Dreamhost. I fixed it by adding the below query just after the connection was established. Alternatively, it could be added before every query.
SET NAMES utf8;
This line sets the connection encoding to UTF-8, so all the data is transmitted to the application in correct encoding.
If I knew how to set the default encoding to UTF-8, it wouldn’t be necessary. I’ve posted a question about it on Dreamhost forum. We’ll see if there will be any answer.