phpBB static archive

I looked online for instructions on how to create a static phpBB archive of
a retired forum, and didn’t find much, apart from other people asking the same thing. I’ve investigated it myself.

UPDATE 2016-05-09: New things I found: How to archive phpBB (similar writeup), and phpbb3-static (a converter script).

UPDATE 2016-11-28: I’ve decided to do it again, better, using phpbb3-static.

General options

When choosing your approach, one of the criteria is the future maintenance cost. It’s likely that the reason that you want a static archive is that you want it to not require maintenance, or require as little as possible.

Optoion 1: Lock the forum and continue to run phpBB

  • Pros:
    • There’s little to do, so it’s quick.
  • Cons:
    • High maintenance. It’s not static. You’re still running PHP, so you have to keep on upgrading your PHP installation and your phpBB installation, or your forum archive will get hacked.

Option 2: Download the whole forum using wget or httrack

  • Pros:
    • The result looks the same as the original.
  • Cons:
    • The result looks the same as the original. (e.g. hard to browse on phones)
    • Out of the box, it does not work! It requires tweaks as discussed below.
    • Lots of content duplication. If there are different URLs with the same content, they will exist as separate files on disk.

Optoion 3: Write your own exporter

Query the database with SQL and write the output the way you want it.

  • Pros:
    • Low maintenance of the resulting site.
    • High level of control of how the output is structured.
  • Cons:
    • Writing the exporter is time consuming.
    • The output will most likely look different from the original forum, so people used to the forum who are browsing it will be likely confused about the navigation.
    • You need to put in additional work to preserve the old URLs.

Also… you could even generate a set of Markdown files to be fed as input to a static website generator such as hugo. This would give you a lot of things for free, including nice URLs and a sitemap.

Option 4: Use an existing exporter

  • Pros:
    • Low maintenance result.
    • Takes less time than Option 3, with comparable results.
  • Cons:
    • You can’t expect the exporter to just work for you, especially if you’ve modified / heavily customized your forum. You will have to dig into the exporter script and fix issues in the (somebody else’s) code.
      Archiving a forum is a one-off job. Once the result is satisfying, the user will lose interest in the exporter and will most likely not improve it any further. When you pick up an exporter, you’ll pick it up where the previous user left off.

Post content / bbcode

From my experience proper processing of the post content is the hardest problem. This is due to the format that phpBB uses to store posts in the database.

You would think that there is just one syntax – the one that forum users enter, which is stored in the database, and rendered into HTML when served on the web. In the case of phpBB it is not so: there are 3 formats! One for the user to edit, one to display (HTML) and something intermediate, that is stored in the database.

The existing exporter I found, phpbb3-static, used an existing bbcode parser to transform the database contents into HTML. The problem is that the database content isn’t bbcode, or at least it isn’t pure bbcode.

It’s a mix of HTML containing raw <a href=”…”>…</a> links, with bbcode links (“[]bbcode links[/url]”), and the existing bbcode parser tries to linkify bare URLs that it spots in the content. If there’s something like this in the content…


…the end result is (indentation added for readability)…

<a href="$valid_url">
  <a href="$truncated_url">

…and that doesn’t work, because $truncated_url is… truncated. This is what phpBB does with link links by default: It shortens turns “longlonglonglink” into “lo…nk”. The first part still starts with “http://” so the bare link matcher catches it and adds a <a href=”…”></a> tag around it.

I examined the database representation and realized that it’s complex and improving the parser on my own is futile, and in the best case I would be merely reimplementing what has already been implemented in phpBB itself. Perhaps I could just call the generate_text_for_display() function from phpBB to render the HTML? Theoretically yes. Unfortunately, this function isn’t just a parser. It uses a number of global variables, such as $user and $cache. The $cache is used to access the forum configuration, and makes SQL queries. In result, what should be just a text parser, requires the full phpBB environment.

I could wire the exporter to phpBB, but I thought that it would make it dependent on a certain phpBB version. What I could do instead, is making a HTTP request to the live version of the forum, finding the right snippet of HTML and saving it.

I’ve tried it. This method was order of magnitude slower than in-process parsing. But on the positive side, it gave me the right results!



[Obsolete] The previous attempt, using wget

Left here for the record. Superseded by the above approach, using phpbb3-static.

I’m intentionally not trying to write the whole thing in a form of a script, even though it was tempting. I expect different phpBB installations to vary, and the chance that my script would work with somebody else’s forum is slim. So instead I’ll write up what I did step by step, and people can follow this howto and make alterations as they see fit.

Note: I’m using Apache and I’m quoting Apache specific configuration lines.

Mirroring the forum

I downloaded the database and the forum snapshot to a local computer to start a local instance. It’s a hassle but it makes things quicker. Once it was ready, I created a mirror on disk:

wget --mirror -k -p <Forum URL>

After downloading it turned out that I had 127 thousand files on disk, which takes up 5GB of space as shown by du -sh <directory>. I mean I’ve seen larger in my career, but I expected a smaller size from a generally text-based static forum archive.

I’ve put result of wget’s work on a test server to see how it works.

Question marks

During testing it turned out that the “?” in the URL is treated as a special character. For example, when the browser requests this:

GET /style.php?id=1 HTTP/1.1

…the WWW server is looking for a file on disk named style.php, fails to find it, and returns a HTTP 404 error.

HTTP 404: style.php not found

But in our case we want the server to serve the file named “style.php?id=1”!

$ ls -l style.php*
-rw-rw-r-- 1 maciej maciej 71445 Apr 24 15:58 style.php?id=1&lang=pl
-rw-rw-r-- 1 maciej maciej 71445 Apr 24 16:24 style.php?id=1&lang=pl&sid=2231c9b38ea28f9aa9e9bdd2a8452846

By the way, did you noticed the file with sid? Ugh. Anyway…

With help from StackOverflow I’ve found these magic lines that I added to .htaccess:

RewriteCond %{ENV:REDIRECT_STATUS} !200 
RewriteCond %{QUERY_STRING} !^$ 
RewriteRule ^(.*)$ %{REQUEST_URI}\%3F%{QUERY_STRING} [noescape,last,qsdiscard]

I don’t fully understand what it does, but it seems to work. As far as I could understand — when the query string is not empty (“?foo=bar” in the URL), the request is rewritten in such a way that we’re putting it together again using REQUEST_URI and QUERY_STRING, and we’re connecting them with “%3F” which is an urlencoded question mark. When this is done, Apache understands that we mean a “?” on disk, and not a url/query string combination. We also have to add “qsdiscard” to prevent Apache from appending the query string again onto the URL. In a way, Apache is trying to do the right thing: keeping the file part and the query string part of the URL meaningful and separate. But in this case we want to do something opposite: treat the “?” literally as a part of file name.

By the way, the solution I found on StackOverflow was slightly different and didn’t work for me verbatim.

Done-ish? Probably not

OK, so this is the rudimentary version of the archive. It has a number of disadvantages, but it meets the main criteria: we have static files and the content is there, you can browse it.

What are the problems?

  1. The login form and the search box are is still there, which is confusing for people, they will try to log in and wonder why it’s broken.
    Addressed below.
  2. A number of URLs won’t work. There is a number of reasons for this, one of them is the parameter ordering. The web server isn’t interpreting the query strings any more, so these two are different now:

    In the PHP world they were interpreted and became part of the URL parameter namespace regardless of the order, but now Apache is just looking for files on disk, and it just looks for files named exactly as specified in the URL. So some URLs that used to work, especially if somebody linked to your forum  from the outside, will not work.

    Not addressed as of 2016-05-05.

  3. URLs are ugly. I know that search engines can deal with this sort of stuff, and they can do things like filtering out the “sid” parameter from the URL. But still, I keep on thinking that the forum URLs should be more like:

    Not addressed as of 2016-05-05.

  4. No sitemap.Not addressed as of 2016-05-05.
  5. Not mobile friendly. This isn’t a problem with the archiving process per se, but it is a feature I would expect in a good archive.Not addressed as of 2016-05-05.

Login form and the search box.

The next thing I noticed is that there still is a login form in the HTML. It is confusing for people because there’s nothing indicating that there’s nothing to log into. I wanted to remove the form, but it was duplicated across 127 thousand files!

First I tested it on one file:

sed -i -e '/<div id="search-box">$/,+9d' viewtopic.php?…

And then ran across all files:

find . -name '*.php*' -exec sed -i -e '/<div id="search-box">$/,+9d' {} \;

This took a fair bit of time, but was successful. I actually don’t know how much because I went out for a small hike.

Let’s make it smaller

The reason why the forum occupies a large amount of disk space is that a small file still occupies a full block on disk, so there’s a sort of file count tax that you have to pay when storing files on disk. But there’s something that you can do. I realized that the forum archive is static, so I can use a read-only file system, and there are read-only file system which pack files efficiently. After a quick look around, SquashFS turned up as the best pick, with efficient file packing, compression, and support in the Linux kernel. The whole packed forum shrinked from 5G to 517MB. I mounted it using the loopback device on the web server (added it to /etc/fstab), and voila! Almost 10× reduction in size. My web server only has 20G of disk space, so saving 4.5G is significant.

Unresolved problems

At the time of writing there’s a number of problems I haven’t addressed in my forum archive. If I manage to, I’ll update this page with new information.

HTTP PUT with multipart/form-data using pycurl

Let’s suppose you have a REST interface to talk to, and there’s a PUT request you want to make, sending data over using the multipart/form-data encoding (as opposed to application/x-www-form-urlencoded). If you’re using Python and pycurl, you’ll find out that if you try to combine setopt(pycurl.PUT, 1) with setopt(pycurl.HTTPPOST, [ (key1, val1), … ]), it doesn’t work. You could try to use setopt(pycurl.POSTFIELDS, “…”), but you’d have to handle encoding to multipart/form-data by hand, or use a third party library such as poster. But in any case it looks like more hassle than it should. The pycurl.HTTPPOST option can already do what’s needed, it’s just that it implies the POST method, while you want to use PUT.

A solution came to me when reading a thread on the curl-with-python mailing list. I knew I could already do what I needed using the command line utility, like this:

curl -X PUT -F 'fieldname=@filename.json' http://localhost:8000/

If you add an option like --libcurl foo.c to such call, you’ll get a C program which does what your command line invocation would do. This revealed, that “-X PUT” did not translate into setopt(pycurl.PUT, 1), but into setopt(pycurl.CUSTOMREQUEST, “PUT”). It might look like a subtle difference, but the latter does what I wanted, while the former doesn’t. A minimal working example would look like this:

import pycurl

c = pycurl.Curl()
c.setopt(pycurl.URL, "http://localhost:8000")
c.setopt(pycurl.HTTPPOST, [('foo', 'bar')])
c.setopt(pycurl.CUSTOMREQUEST, "PUT")

If you run “nc -l 8000” and run the above code, you’ll see:

PUT / HTTP/1.1
User-Agent: PycURL/7.26.0
Host: localhost:8000
Accept: */*
Content-Length: 141
Expect: 100-continue
Content-Type: multipart/form-data; boundary=----------------------------2def70e0b37a

Content-Disposition: form-data; name="foo"


…which is exactly what I wanted.

Merging from trunk to a branch

You created a branch in subversion, and while you were working on it, trunk progressed. You now want to include the trunk updates in your branch. What should you do? Maybe merge from trunk into your branch?

svn merge ${url}/trunk branches/mybranch

Nope! This isn’t it. Think about the simple case: branch out, edit the branch, merge back. What does ‘merge’ mean in this case? If I understand correctly, it means replaying on trunk all the changes you made to your branch.

What happens when you run the above command then? You replay all the changes you made to trunk, on top of your branch. Once that is done, what happens when you want to merge your branch back to trunk? One of the changes to be replayed is the merge you did, but it contains changes that have already been made on trunk, and the merge does not work.

How to do it properly then? What you probably meant to do, is to have your branch as if you started your branch-work on the newer trunk. Let’s first consider the simple case, where you branch out and then merge back.

svn cp ${url}/trunk ${url}/branches/mybranch
svn update
...editing your branch...
svn commit -m "edits to my branch"
svn merge ${url}/branches/mybranch trunk
svn commit -m "merging mybranch back to trunk"

That works. And it cannot really be more complex than that. Maybe if you’re a subversion whiz, but I’m not, so I like to stick to simple scenarios I can understand.

Let’s try to accommodate an updated trunk into the above workflow. It starts as usual:

svn cp ${url}/trunk ${url}/branches/mybranch
svn update
...editing your branch...
svn commit -m "edits to my branch"

So far so good. Let’s say there are some updates to trunk we want to see in our branch. You would think: “Why didn’t I start working on my branch later, I would have all the updates already in my branch!”. It turns out, you can do that! You can create an new branch from the new trunk, and then replay all the changes from your branch on top of it. The result? You still have your changes in a separate branch, and you have the updates to trunk too.

svn status
# Make sure this returns nothing ‒ your working copy is clean.
svn cp ${url}/trunk ${url}/branches/mybranch2
svn update
svn merge ${url}/branches/mybranch branches/mybranch2
# There is potential for code conflicts here, you need to resolve them.
svn commit -m "Replaying changes made to mybranch onto mybranch2."
svn rm ${url}/branches/mybranch
# Let's go to the original branch name.
svn mv ${url}/branches/mybranch2 ${url}/branches/mybranch
svn update

Your branch is now updated and looks as if you’ve started to work on it using the new trunk. You can use the regular merging procedure.

svn merge ${url}/branches/mybranch trunk
svn commit -m "merging mybranch back to trunk"

Your changes are now merged back to trunk.

Canon XM2 (DV) to DVD, on Linux

I wanted to transfer some material from DV cassettes to DVD. My main workstation is running Ubuntu 12.04, and I decided to use the tools that are available with the distribution. I tried multiple ways of doing each of the tasks, and git many dead ends, mainly due to crashing programs, bugs, or incompatible tools. For instance, tovid looked very promising until it turned out that it is not compatible with the new version of the ffmpeg utility. My source material was DV, recorded by Canon XM2, the video format was 768×576, interlaced (576i), with audio at 48kHz, PCM, stereo. Interlacing was giving me some headache, because the first attempts lead to unsightly stripey output. The camera outputs double-scan interlace, which should be interpreted as 50 frames per second with reduced resolution. Interlacing might be tricky

The first step is to capture the video from the camera. Connect the camera to the laptop, switch the camera to the playback mode, rewind the tape and:

dvgrab birthday-

The “birthday-” bit is a prefix that will be added to the saved .dv files. dvgrab will save multiple 1GB files, each file about 4 minutes long. Once the material is captured, you can merge the multiple files into one, by simply concatenating them:

cat birthday-001.dv birthday-002.dv birthday-003.dv > birthday.dv

Once you have one file with the complete material, fire off a player and note down (I used paper and pencil) the times of segments you want to extract. You won’t be able to do a lot of cutting that way, but if it’s a couple of segments, it shouldn’t be too labor intensive. Once you know what are the segments you want to extract, you can extract them and encode as .vob files. Suppose one fragment starts at 02:13 and is 135 seconds long:

avconv -i birthday.dv -target pal-dvd -flags +ilme+ildct -b:v 6000k -ss 02:13 -t 135 birthday-01.vob

The “+ilme+ildct” bit is responsible for correct handling of interlacing, because DV uses different field order than DVD. Repeat the above command for each segment, and you’ll get a list of VOB files. These VOB files are DVD compliant, and they are implementing the interlace correctly. They must not be re-encoded when transferred to DVD, otherwise the interlacing settings will be most likely lost. You can try if your interlacing settings are correct by watching the VOB file using VLC with automatic deinterlace detection:

vlc --deinterlace -1 --deinterlace-mode bob --play-and-exit birthday-01.vob

You should see no stripes during movement in the video, and the displayed frame rate should be 50fps (although the video frame rate is set to 25fps).

The next step is to create a DVD menu. There is a number of DVD authoring software. I had most success with DVD Styler. I also tried tovid, and Bombono.

In DVD Styler, I managed to create a DVD directory structure, but not an ISO image, and I was not able to burn a DVD directly from DVD Styler. Instead, I only generated the DVD structure on disk, and used k3b, using its DVD template. I created a new project, found the generated VIDEO_TS directory from DVD Styler, and added it to the project in k3b. This was enough to arrive at a working DVD.

DVD Styler would recognize that the files are already DVD compatible and did not attempt to re-encode them.

The above method is rather basic and crude, but gets the job done. There isn’t a video editor used at any stage; instead we just note down the times and then extract time regions using the -ss and -t options of avconv. I tried to use pitivi for video editing, but there were issues with rendered video, and since I didn’t really need any editing, I dropped pitivi from the workflow. The main problem to solve in pitivi would be to encode a DVD compliant VOB video file. You can select a DVD VOB as the output format, but there’s still a lot of things you can mess up, for instance accidentally encode audio in 44.1kHz instead of 48kHz, which results in a DVD disc with no audio.

I suspect that tovid will be reasonably soon adapted for use with the new ffmpeg tools (using /usr/bin/avconv instead of /usr/bin/ffmpeg), which will make it easier to script out the process if I had more of such (e.g. archival) DVDs to make.

Headless VirtualBox setup

VirtualBox, unless you look deeper, is a desktop application. However, it is possible to install it on a server without a monitor, and run virtual machines there. The setup procedure is somewhat quirky, and it has changed since the last time I did it. If you search for VBoxManage and VBoxHeadless, you’re likely to hit outdated instructions. If you’re reading this in 2020, these instructions are probably also out of date. Here’s the 2012 edition. Tested on Ubuntu Oneiric Ocelot.

Initial setup for installation of the OS from an ISO image. In my case, it was Solaris 10, which will be my small private porting and OpenCSW Solaris package building host.

UPDATE 2012-06-04: I’ve checked in the code into a subversion repository on The script is called

VM_NAME="Solaris 10 x86"
DISK_SIZE=20000 # In MB

function setup {
VBoxManage createvm \
--name "${VM_NAME}" \
--basefolder "${DISK_DIR}" \
VBoxManage modifyvm "${VM_NAME}" \
--memory "${MEMORY_IN_MB}" \
--acpi on \
--boot1 dvd \
--nic1 bridged \
--bridgeadapter1 eth0
VBoxManage createhd \
--filename "${VDI}" \
--size "${DISK_SIZE}"
VBoxManage storagectl "${VM_NAME}" \
--name "IDE Controller" --add ide
VBoxManage storagectl "${VM_NAME}" \
--name "SATA Controller" --add sata
VBoxManage storageattach "${VM_NAME}" \
--storagectl "SATA Controller" \
--type hdd --device 0 --port 0 \
--medium "${VDI}"
VBoxManage storageattach "${VM_NAME}" \
--storagectl "IDE Controller" \
--type dvddrive --device 0 --port 0 \
--medium "${CD}"

If something went wrong and you need to start over, you can nuke your VM with:

VBoxManage unregistervm "${VM_NAME}"
rm -rf "${DISK_DIR}/${VM_NAME}"

To start your virtual machine:

VBoxHeadless -s "${VM_NAME}" --vnc

You can now connect to your virtual machine via VNC. You will probably need to set up a VNC tunnel and use a VNC viewer. Ubuntu has a remote desktop application.

When you’ve installed your guest OS, you need to remove the installation media, otherwise your VM will start the installation process all over again. You can stop the VM (I just press CTRL+C, there must be a better way).

VBoxManage storageattach "${VM_NAME}" \
--storagectl "IDE Controller" \
--type dvddrive --device 0 --port 0 \
--medium none

You can start your VM again, this time it will boot into the installed OS.

Zyxel P-660HW-T1 vs IPv6 tunnel vs SIP

I’m a SIP/VoIP user.  It allows me to make international calls for the cost of a local connection.  I have set up a Linksys PAP2T gateway with a regular phone connected to it. Worked great for calling out.  Unfortunately, the SIP gateway kept on logging out of the SIP server.  It could last a day or an hour, but it would always eventually lose the logged-in state and never return to it without human intervention.  Resetting the gateway would not help, it was usually necessary to switch it off for 15 minutes or so.  A quicker method was to change the SIP port from 5060 to 5061 and back every time I needed to restore the service.  I tried fiddling with the PAP2T settings, but no setting changes seemed to alleviate the issue.

I’m also a an IPv6 user.  I’ve got an OpenWRT installation on WRT54GL, running aiccu and providing an IPv6 tunnel from SixXS.  The tunnel had a similar ailment: it would go down every couple hours to days.  The workaround was to restart aiccu.  I would restart it when I needed it.

At some point, I started neglecting the IPv6 tunnel.  I didn’t need to use it, and I just didn’t bother to restart it.  At the same time, I noticed that the SIP gateway would stay logged in without dropping out for much longer than usual.  This state remained for about two weeks, until I needed to reinstate the IPv6 tunnel.  Right after doing that, I walked over to the SIP gate and… noticed that it had dropped out.  Correlation does not imply causation, but you know… it raised my suspicion.  What is it that these two devices have in common?  The router!

A search on Google for “zyxel sixxs sip” revealed a forum post, in which someone described the same symptoms I had, with a bit of diagnostics.  Both the IPv6 tunnel and the SIP service are using UDP, which is harder to NAT than TCP.  The Zyxel router would repeatedly get confused and misinterpret the IPv6 related UDP packets as SIP packets and vice versa.

The solution was to take away the logic out of the Zyxel router and make it act as a DSL modem only. I’ve reconfigured it to the bridge / transparent mode, and moved the NAT logic to OpenWRT.  I initially wasn’t sure how the bridging mode works, but it turned out to be simple enough.  What’s nice about my particular setup is that I have a static IP address.  The setup uses a /30 network block, so it’s effectively using up 4 of the IPv4 address space.  It’s essentially a 2-bit netblock, so we can use 0, 1, 2, and 3 as the addresses.  In practice it can be something like, and any CIDR calculator will tell you, that if your netmask is /30 (or, then it’s a 4-address network starting at and ending at Let’s consider the NAT mode first.

  • 0: netblock address
  • 1: the router tells you it’s the the remote side (the gateway)
  • 2: the router’s public IP address
  • 3: the broadcast address
It might look like the 1 address is remote, at the ISP side.  But if you configure your router in a bridging mode, you have 2 devices, and they both have public IP addresses.  Let’s call them Zyxel (acting as a DSL modem) and OpenWRT (doing NAT).
  • 0: netblock address
  • 1: the Zyxel router’s address (gateway)
  • 2: the OpenWRT router’s address
  • 3: the broadcast address
So what you might have thought of as the remote IP address, is your local router’s address.  What was there left to do, was configuring NAT on OpenWRT.  Linux knows how to interpret incoming UDP packets, and both my SIP gate and IPv6 tunnel are working correctly now.  Plus, I have more control.

FLOSS Weekly 163: OpenCSW, addendum

FLOSS Weekly, a podcast about Free-Libre and Open Source software, episode 163 featured OpenCSW, a project I actively participate in.

Since I was not on the podcast, I would like to use this opportunity to add to what has been said there.

Q: 05:30 What is OpenCSW and what does it contribute to the world?
A: …to add to what Phil said (we provide packages free as in free beer), there are two parts of what we provide: one part is binary packages, and the other part is the source code to build these packages.  It hasn’t been historically the culture at OpenCSW (or formerly, Blastwave) to release build recipes.  At OpenCSW, the policy for all new maintainers is to release source code of all packages they build.  However, there is still a number of old-timers, who build packages using own, unpublished scripts.  We are making efforts to have all build recipes published as open source, and while we’re still not there yet, it’s one of the most important points on our agenda.  In this sense, we do care about freedom and about being an open source project.

Q: 15:15 Do you think of OpenCSW as of a Solaris distribution?
A: Yes, as much as it is possible, while being based on commercial Solaris. The main difference between OpenCSW and Linux or BSD distributions is that OpenCSW does not provide the base OS, such as the kernel, libc or an installer.  From the perspective of a business which runs third party applications, it’s important that their OS is supported by the vendor.  Nexenta is a lovely Debian-based system with a Solaris kernel, but you can’t get support for an Oracle database on it.

Continue reading “FLOSS Weekly 163: OpenCSW, addendum”