phpBB static archive

I looked online for instructions on how to create a static phpBB archive of
a retired forum, and didn’t find much, apart from other people asking the same thing. I’ve investigated it myself.

UPDATE 2016-05-09: New things I found: How to archive phpBB (similar writeup), and phpbb3-static (a converter script).

UPDATE 2016-11-28: I’ve decided to do it again, better, using phpbb3-static.

General options

When choosing your approach, one of the criteria is the future maintenance cost. It’s likely that the reason that you want a static archive is that you want it to not require maintenance, or require as little as possible.

Optoion 1: Lock the forum and continue to run phpBB

  • Pros:
    • There’s little to do, so it’s quick.
  • Cons:
    • High maintenance. It’s not static. You’re still running PHP, so you have to keep on upgrading your PHP installation and your phpBB installation, or your forum archive will get hacked.

Option 2: Download the whole forum using wget or httrack

  • Pros:
    • The result looks the same as the original.
  • Cons:
    • The result looks the same as the original. (e.g. hard to browse on phones)
    • Out of the box, it does not work! It requires tweaks as discussed below.
    • Lots of content duplication. If there are different URLs with the same content, they will exist as separate files on disk.

Optoion 3: Write your own exporter

Query the database with SQL and write the output the way you want it.

  • Pros:
    • Low maintenance of the resulting site.
    • High level of control of how the output is structured.
  • Cons:
    • Writing the exporter is time consuming.
    • The output will most likely look different from the original forum, so people used to the forum who are browsing it will be likely confused about the navigation.
    • You need to put in additional work to preserve the old URLs.

Also… you could even generate a set of Markdown files to be fed as input to a static website generator such as hugo. This would give you a lot of things for free, including nice URLs and a sitemap.

Option 4: Use an existing exporter

  • Pros:
    • Low maintenance result.
    • Takes less time than Option 3, with comparable results.
  • Cons:
    • You can’t expect the exporter to just work for you, especially if you’ve modified / heavily customized your forum. You will have to dig into the exporter script and fix issues in the (somebody else’s) code.
      Why:
      Archiving a forum is a one-off job. Once the result is satisfying, the user will lose interest in the exporter and will most likely not improve it any further. When you pick up an exporter, you’ll pick it up where the previous user left off.

Post content / bbcode

From my experience proper processing of the post content is the hardest problem. This is due to the format that phpBB uses to store posts in the database.

You would think that there is just one syntax – the one that forum users enter, which is stored in the database, and rendered into HTML when served on the web. In the case of phpBB it is not so: there are 3 formats! One for the user to edit, one to display (HTML) and something intermediate, that is stored in the database.

The existing exporter I found, phpbb3-static, used an existing bbcode parser to transform the database contents into HTML. The problem is that the database content isn’t bbcode, or at least it isn’t pure bbcode.

It’s a mix of HTML containing raw <a href=”…”>…</a> links, with bbcode links (“[url=www.example.com]bbcode links[/url]”), and the existing bbcode parser tries to linkify bare URLs that it spots in the content. If there’s something like this in the content…

[url=$valid_url]$truncated_url[/url]

…the end result is (indentation added for readability)…

<a href="$valid_url">
  <a href="$truncated_url">
    $truncated_url
  </a>
</a>

…and that doesn’t work, because $truncated_url is… truncated. This is what phpBB does with link links by default: It shortens turns “longlonglonglink” into “lo…nk”. The first part still starts with “http://” so the bare link matcher catches it and adds a <a href=”…”></a> tag around it.

I examined the database representation and realized that it’s complex and improving the parser on my own is futile, and in the best case I would be merely reimplementing what has already been implemented in phpBB itself. Perhaps I could just call the generate_text_for_display() function from phpBB to render the HTML? Theoretically yes. Unfortunately, this function isn’t just a parser. It uses a number of global variables, such as $user and $cache. The $cache is used to access the forum configuration, and makes SQL queries. In result, what should be just a text parser, requires the full phpBB environment.

I could wire the exporter to phpBB, but I thought that it would make it dependent on a certain phpBB version. What I could do instead, is making a HTTP request to the live version of the forum, finding the right snippet of HTML and saving it.

I’ve tried it. This method was order of magnitude slower than in-process parsing. But on the positive side, it gave me the right results!

phpbb3-static

 

[Obsolete] The previous attempt, using wget

Left here for the record. Superseded by the above approach, using phpbb3-static.

I’m intentionally not trying to write the whole thing in a form of a script, even though it was tempting. I expect different phpBB installations to vary, and the chance that my script would work with somebody else’s forum is slim. So instead I’ll write up what I did step by step, and people can follow this howto and make alterations as they see fit.

Note: I’m using Apache and I’m quoting Apache specific configuration lines.

Mirroring the forum

I downloaded the database and the forum snapshot to a local computer to start a local instance. It’s a hassle but it makes things quicker. Once it was ready, I created a mirror on disk:

wget --mirror -k -p <Forum URL>

After downloading it turned out that I had 127 thousand files on disk, which takes up 5GB of space as shown by du -sh <directory>. I mean I’ve seen larger in my career, but I expected a smaller size from a generally text-based static forum archive.

I’ve put result of wget’s work on a test server to see how it works.

Question marks

During testing it turned out that the “?” in the URL is treated as a special character. For example, when the browser requests this:

GET /style.php?id=1 HTTP/1.1

…the WWW server is looking for a file on disk named style.php, fails to find it, and returns a HTTP 404 error.

HTTP 404: style.php not found

But in our case we want the server to serve the file named “style.php?id=1”!

$ ls -l style.php*
-rw-rw-r-- 1 maciej maciej 71445 Apr 24 15:58 style.php?id=1&lang=pl
-rw-rw-r-- 1 maciej maciej 71445 Apr 24 16:24 style.php?id=1&lang=pl&sid=2231c9b38ea28f9aa9e9bdd2a8452846

By the way, did you noticed the file with sid? Ugh. Anyway…

With help from StackOverflow I’ve found these magic lines that I added to .htaccess:

RewriteCond %{ENV:REDIRECT_STATUS} !200 
RewriteCond %{QUERY_STRING} !^$ 
RewriteRule ^(.*)$ %{REQUEST_URI}\%3F%{QUERY_STRING} [noescape,last,qsdiscard]

I don’t fully understand what it does, but it seems to work. As far as I could understand — when the query string is not empty (“?foo=bar” in the URL), the request is rewritten in such a way that we’re putting it together again using REQUEST_URI and QUERY_STRING, and we’re connecting them with “%3F” which is an urlencoded question mark. When this is done, Apache understands that we mean a “?” on disk, and not a url/query string combination. We also have to add “qsdiscard” to prevent Apache from appending the query string again onto the URL. In a way, Apache is trying to do the right thing: keeping the file part and the query string part of the URL meaningful and separate. But in this case we want to do something opposite: treat the “?” literally as a part of file name.

By the way, the solution I found on StackOverflow was slightly different and didn’t work for me verbatim.

Done-ish? Probably not

OK, so this is the rudimentary version of the archive. It has a number of disadvantages, but it meets the main criteria: we have static files and the content is there, you can browse it.

What are the problems?

  1. The login form and the search box are is still there, which is confusing for people, they will try to log in and wonder why it’s broken.
    Addressed below.
  2. A number of URLs won’t work. There is a number of reasons for this, one of them is the parameter ordering. The web server isn’t interpreting the query strings any more, so these two are different now:
    viewtopic.php?f=1&t=2
    viewtopic.php?t=2&f=1
    

    In the PHP world they were interpreted and became part of the URL parameter namespace regardless of the order, but now Apache is just looking for files on disk, and it just looks for files named exactly as specified in the URL. So some URLs that used to work, especially if somebody linked to your forum  from the outside, will not work.

    Not addressed as of 2016-05-05.

  3. URLs are ugly. I know that search engines can deal with this sort of stuff, and they can do things like filtering out the “sid” parameter from the URL. But still, I keep on thinking that the forum URLs should be more like:
    /forum/1/4/i-like-fluffy-kittens/
    

    Not addressed as of 2016-05-05.

  4. No sitemap.Not addressed as of 2016-05-05.
  5. Not mobile friendly. This isn’t a problem with the archiving process per se, but it is a feature I would expect in a good archive.Not addressed as of 2016-05-05.

Login form and the search box.

The next thing I noticed is that there still is a login form in the HTML. It is confusing for people because there’s nothing indicating that there’s nothing to log into. I wanted to remove the form, but it was duplicated across 127 thousand files!

First I tested it on one file:

sed -i -e '/<div id="search-box">$/,+9d' viewtopic.php?…

And then ran across all files:

find . -name '*.php*' -exec sed -i -e '/<div id="search-box">$/,+9d' {} \;

This took a fair bit of time, but was successful. I actually don’t know how much because I went out for a small hike.

Let’s make it smaller

The reason why the forum occupies a large amount of disk space is that a small file still occupies a full block on disk, so there’s a sort of file count tax that you have to pay when storing files on disk. But there’s something that you can do. I realized that the forum archive is static, so I can use a read-only file system, and there are read-only file system which pack files efficiently. After a quick look around, SquashFS turned up as the best pick, with efficient file packing, compression, and support in the Linux kernel. The whole packed forum shrinked from 5G to 517MB. I mounted it using the loopback device on the web server (added it to /etc/fstab), and voila! Almost 10× reduction in size. My web server only has 20G of disk space, so saving 4.5G is significant.

Unresolved problems

At the time of writing there’s a number of problems I haven’t addressed in my forum archive. If I manage to, I’ll update this page with new information.

Author: automatthias

You won't believe what a skeptic I am.