Authentication in django-phpbb

I’ve recently checked in changes to django-phpbb, my Django-phpBB integration project. It’s now possible to authenticate users against unmodified phpBB database. I’ve also added installation instructions.

Current focus will be on removing the parts that are specific to my project and making django-phpbb more generic.

Advertisements

Dreamhost 100MB memory limit

I’ve recently found a thread on Google Groups which mentioned a 100MB memory limit for FCGI processes on Dreamhost, which can be a reason for killing them by their process monitor.

One of the posts says:

Interestingly, this limit doesn’t apply to Ruby processes. When I asked them if this was an admission that Ruby on Rails has a sad deployment story, the response was “Ahem.. =)”

This could explain my trick with the “dispatch.fcgi” file name, assuming this is how their process monitors detect Ruby on Rails. Again, it’s only my guesses.

Dreamhost, kernel 2.6, FCGI and threads

My Dreamhost server got rebooted yesterday. After the reboot, I’ve noticed two things:

  1. Kernel 2.6
  2. My Django application down. I’m not sure if it was the 2.6 kernel that caused the problem. It could be a coincidence.

I’ve spent a whole day investigating the problem. It turned out that Python couldn’t make a new thread. Just like Grimboy, I’ve changed “threaded” method into “prefork” in dispatch.fcgi and got my site up and running.

FCGI is pretty difficult to debug, I must say. To get a debug message, I needed to run a Perl script, from which a Python script was called, with stderr redirected to a file.

API for full-text search in Django

Let me imagine a way I’d like to use a full-text search in Django. It would look like this:

class Person(models.Model):
….first_name = models.CharField(maxlength = 50)
….about = models.TextField()
….class TextSearch:
……..pass

# This would return a QuerySet
people = Person.objects.search(“Miles Davis”)

That’s it.

The inner class “TextSearch” would take optional arguments like the list of fields to be indexed. All fields would be indexed by default.

I am aware of already existing projects which provide search capabilities to Django.

  • Mercurytide uses MySQL-specific functions, so it wouldn’t work for other database backends.
  • Merquery doesn’t seem to have a nice API. For example, a system path is needed to initialize an indexer.

Any other search engines out there for Django?

PHP on Dreamhost also suffers from 500

My idea for renaming the django.fcgi file into dispatch.fcgi fixed the “500 Internal Server Error” mostly, but not completely. Watching site stats and Google webmaster tools reports, I was seeing “500” errrors popping up every now and then. At first, I thought that there might be still some Django-related problem. My latest observations point out that it’s probably a general issue which concerns both Python and PHP. And possibly, Ruby as well.

My wild guess is that it’s got something to do with the server load. When a server is busy, some processes get killed. When fcgi doesn’t receive any data from a killed process, it returns 500.

So, the questions arises, should Dreamhost be removed from the list of Django-friendly hosts? Or shoud it be removed from the list of anything-friendly hosts? No, it shouldn’t, because it should definitely stay on the list of wallet-friendly hosts.

Killing phpBB softly

My Polish forum is powered by phpBB. Undoubtedly, it’s the most popular bulletin board package. It’s free (as in freedom), easy to install and it’s easy to use. Virtually every Internet user had some exposure to it. When starting a new forum, it’s a safe choice.

As the years were passing by and my forum was growing bigger, I started being somewhat dissatisfied with it. Smaller and bigger annoyances were biting me every now and then. I’d like to point out some of them.

  • Search. Its user interface is unnecessarily complicated. It yields unsatisfactory results. As a result, people don’t want to use it and tend to ask the same questions over and over again. A good forum engine needs a decent search. Look at Vanilla’s search, it’s so simple and functional! Although it doesn’t mean I wouldn’t like to simplify it even a little more
  • Uncomfortable add-ons installation. So-called mods are distributed as instructions on how to modify the code. You have to open files and edit them by hand. One missed dot, BANG! Your forum is down. Want to upgrade your modified phpBB? It’s very likely that you will have to install it from scratch and install all the mods again. That’s why my moderators still don’t have the “merge topics” mod back. (sorry! I’ll try to install it some time!)
  • Crufty URLs. Compare “/viewtopic.php?t=1234” with “/topics/1234/i-like-clean-urls/”
  • Google won’t index it. It’s a mystery. Perhaps Google recognizes phpBB and avoids it. phpBB has a nasty habit of “enriching” its URLs with things that are different each time, generating infinite number of addresses. Google can never know if it has got all the topics from the forum. No wonder it gets discouraged. This causes a major problem: if the forum is not indexed, it doesn’t come up in search results and there ain’t no people coming! I consider it the biggest problem with phpBB.

I could also complain about lack of several features, including tags, ranking, finding similar topics, etc. Many of them are available… as mods of course. Theoretically, I could fix three of above problems, but I once phpBB would require an upgrade, I’d have to edit all the source code again, by hand. It’s the main reason why I wasn’t adding much things to the forum.

I tried installing Vanilla. It’s brilliant, but once I launched a test installation, users who visited it, complained about everything they could. I tried to fix things they were mentioning, but there was one major and inevitable problem: Vanilla ain’t look like phpBB. For example, buttons are in different places. Users are so tied to the existing interface that they can’t stand a button moved from right to left. I gave up with Vanilla.

I considered writing my own forum engine, then started having doubts and finally gave up. It’s too much hassle. Loads of work, data migration, user complaints… I would have probably rewritten the whole thing if I were younger. I would work furiously for many weeks, then force users into the new version, take flame-war attacks on my chest… No, I don’t want to do that any more.

However, I’m still too young to just sit around. Having just a few hours time, I started playing around with Django, writing a model on top of the phpBB database. I was soon able to fiddle with forums, topics and posts using Django’s ORM. I created a read-only forum archive with clean URLs, an RSS feed and a Sitemap for Google. The forum sitemap consists of about three thousands URLs, where each URL is a starting point of a topic. Each topic can have several pages.

My models work directly on phpBB database tables without modifying them. phpBB itself doesn’t even “know” that someone else is reading its dear tables.

My forum users didn’t notice anything. They’re happily using the old phpBB forum. In the meantime, Googlebot is crawling the Django-powered forum archive with dogged persistence. I think it will soon include the archive in its index and start directing traffic to it.

I’ll keep on developing the Django-powered forum. I can do it slowly and on-line. I will add a nice search engine, posts ranking and all other stuff that will come to my mind. Thing is, I won’t be touching the original phpBB tables. If I ever need to extend some models, I’ll just use Django OneToOne mapping. Current phpBB users will be able to use their forum just as they were before. However, all the cool features will be appearing on the new, Django-powered forum. They might find it more useful and start using it instead of the PHP version. It doesn’t need to happen any time soon. I can take my time developing the features as I want them. If they don’t like it, they can always go back to the PHP version.

It will be all soft. There will be no data migration. No forced user interface change. I’m going to slowly attract phpBB users to the new, Django-powered forum interface.

I’ll put all the phpBB-related code in a separate package and once it’s mature enough, publish it. It won’t be necessarily a forum implementation. It will be a Django-phpBB integration layer that will allow Django programmers to develop their own ideas for their phpBB-powered forums.

I’ll be killing phpBB softly.

MySQL encoding problems on Dreamhost

I’m running phpBB, MediaWiki and WordPress on Dreamhost. All the applications use MySQL database. Once I imported the data into the database, I checked how it looks like in phpMyAdmin. I was a little concerned when I saw latin1_swedish_ci collation in all the text columns in all tables. I checked the applications, expecting to see wrong encoding displayed, but everything seemed fine.

I learned the truth later, when developing a Django application which sits on top of the existing phpBB tables. All the data in the tables was stored wrongly encoded, but since the encoding and decoding were symmetrically wrong, all the characters were displayed correctly. Unfortunately, the database content is stored wrongly.

The problem is, all the databases on Dreamhost are created with LATIN1 default encoding (LATIN1 and ISO-8859-1 are synonyms), and it’s impossible to create a database with, say, UTF-8 default encoding. As a result, all the connections to the database are in LATIN1 by default. It is possible to set the encoding to UTF-8, but applications don’t do that. Typically. Because Django does.

Django stores all the text correctly encoded, other applications ― wrongly. Everything is fine, unless Django reads data from other applications. All the accented characters are trashed. I’ve written a small wrapper function that could bring some of the text to the proper encoding:

def repair_encoding(s):
    try:
        return s.decode('utf-8').encode('latin1').decode('latin2').encode('utf-8')
    except:
        return s

What it does, is:

  1. Read the data (variable s) and consider it an UTF-8 encoded text, storing it as Unicode
  2. Encode the Unicode object in LATIN1
  3. Take the LATIN1-encoded text and consider it LATIN2, convert it to Unicode again
  4. Encode it in UTF-8
  5. If any of the above fails, just return the original data

Steps 1-4 can fail, especially step 2, where it can happen that the Unicode object contains characters that are not present in LATIN1.

This hack allows to read data from PHP applications, but I wanted to repair the wrongly encoded text, so all the database content is straightened out. I saw some tutorials which involved dumping and restoring the database. I didn’t want that because that would mean a considerable downtime. I wanted to fix that in place. I’ve finally figured it out. Here’s how to fix column colname in table tablename.

SET NAMES latin1;
ALTER TABLE tablename MODIFY COLUMN colname TEXT CHARACTER SET latin1;
ALTER TABLE tablename MODIFY COLUMN colname blob;
ALTER TABLE tablename MODIFY COLUMN colname TEXT CHARACTER SET utf8;
SET NAMES utf8;

It should be ran against every TEXT column in the database. The same applies to the VARCHAR and CHAR columns.

After applying the script, all the data in the database is encoded correctly. The problem is that the PHP applications started displaying trashed text on-line. It was due the default LATIN1 encoding connection on Dreamhost. I fixed it by adding the below query just after the connection was established. Alternatively, it could be added before every query.

SET NAMES utf8;

This line sets the connection encoding to UTF-8, so all the data is transmitted to the application in correct encoding.

If I knew how to set the default encoding to UTF-8, it wouldn’t be necessary. I’ve posted a question about it on Dreamhost forum. We’ll see if there will be any answer.