Genetic data in PostgreSQL

People get usually famous for the things they’ve done. Well, that’s not entirely true. They usually get famous for the things they’ve done, when they were successful. You don’t get famous for attempting and being unsuccessful, now do you?

It works the same way for the scientific publications. All scientists work hard trying various things, and when they finally succeed, they publish a paper. But what happens with all those hours spend on unsuccessful attempts? Nobody seems to be proud of blowing a whole laboratory up. Or whatever didn’t work for them. This means that other people can never learn that something was unsuccessful and they’re likely to get the same, unfeasible, idea and repeat the same research. Needless to say, unsuccessfully.

Not that I’m proud of what I’ve done here, but I will at least allow other people to find this post on Google, when searching for genetic data and relational database. I’ll describe what I did, so they at least don’t do it the way I did.

There is a project called Hapmap, which is essentially a publicly available genetic data set. It is now being used for various genetic studies, usually for finding associations between specific genes and diseases. That’s what the genetic epidemiology is all about, I think. To find the guilty gene. I wonder when will they start fixing people with broken genes. It’s a only matter of time. Anyway, Hapmap allows you to download a whole bulk of genetic data and torture them in any way you want, just listening to the bits squeaking in pain. That’s a really great thing.

For my project, there was a problem of reformatting the data, so they could be fed to the program that analyzes them. Sounds easy, but writing a C++ code to do that is neither quick nor nice. And it works quite slow. In fact, I’m just watching one of my Core 2 Duo cores working at 100% for… how long?… 103 minutes so far, working with one chromosome. So I got this idea to put all the Hapmap data into an SQL database so I could throw declarative statements at it, getting whatever I asked for.

Genetic data as in Hapmap phase II, is generally a matrix with 210 rows (individuals) and 3.3 million columns (loci). Relational database is all about tables, but you can’t create a table with 3.3 million columns. PostgreSQL has a limitation to 65 thousands or so. I decided to try a sparse matrix implementation with one table for individuals, one for loci and one connecting table with two foreign keys and two fields for allele, defining a genotype. I was expecting it to be somewhat demanding for a machine, but having a Core 2 Duo with 2GB RAM and 70GB hard disk, how hard can it be? One row of the connecting table consists of 4 + 8 + 1 + 1 bytes, giving 14 bytes in total. 3.3 million times 210 would give about 700 million rows. Not every locus is polymorphic in every population, so in practice there were about 500 million rows in that table. 500 million times 14 bytes gives about 7GB. Piece of cake.

CREATE TABLE individual (indiv_no SERIAL, indiv_id VARCHAR);

CREATE TABLE locus (locus_no BIGSERIAL, snp_id VARCHAR, position INT8);

CREATE TABLE observed_genotype(indiv_no INTEGER, locus_no INT8, a1 CHAR, a2 CHAR);

Those SQL statements are simplified for better readability, in reality they had more stuff in them, like primary and foreign key declarations.

I sat down and hacked few Python scripts to prepare the data and load them into PostgreSQL. At first, my computer almost melted, then I found some more efficient ways to handle the data and finally got them all loaded into the database. The database had 38GB.

That is pretty heavy. It’s about 5 times more than the actual data takes. Not mentioning that using continuous bit arrays would allow to store the whole thing in about 250MB. We’re in relational world, you know. Things just need to take space.

There was only 4GB disk space left. The data was loaded, but the connecting table didn’t have the primary key. It had to be dropped, otherwise it would’ve taken too long to load the data. I happily launched phppgadmin and made few clicks to create a primary key. I’m getting old, you know. I know I could’ve typed it. But regardless to whether I’d type or click it in, the PostgreSQL took the remaining 4GB to create the index, but it wasn’t enough so it rolled back and returned the space to the file system.

That was basically it. I’ve had enough, dropped the whole database and started hacking with text files and coreutils. If you had an idea to put whole Hapmap data set into PostgreSQL, think if you have enough resources. Maybe, if you’re in 2012, sitting in front of a 16GHz eight-core machine with 32GB RAM and 1TB hard disk you do. But I’m in 2007 and I don’t. So, here is my opposite to “success story”. I hope you appreciate that. And no, I’m not very proud of it; but I value other people’s time and I think I might help someone before they spend a week doing the same thing.

Nevertheless, it could be feasible to load Hapmap genetic data into some specifically optimized relational database. Perhaps Oracle guys have something that could handle it. It’s still very tempting to be able to manipulate genetic data in a high level, declarative manner.


Author: automatthias

You won't believe what a skeptic I am.