<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Maciej Bliziński &#187; Database</title>
	<atom:link href="http://automatthias.wordpress.com/category/database/feed/" rel="self" type="application/rss+xml" />
	<link>http://automatthias.wordpress.com</link>
	<description>Data analysis and Linux</description>
	<lastBuildDate>Thu, 03 Dec 2009 08:03:37 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<cloud domain='automatthias.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://www.gravatar.com/blavatar/a6db86e33e907b5131ba30b2228db630?s=96&#038;d=http://s.wordpress.com/i/buttonw-com.png</url>
		<title>Maciej Bliziński &#187; Database</title>
		<link>http://automatthias.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://automatthias.wordpress.com/osd.xml" title="Maciej Bliziński" />
		<item>
		<title>Fixing character sets in MySQL</title>
		<link>http://automatthias.wordpress.com/2008/12/26/fixing-character-sets-in-mysql/</link>
		<comments>http://automatthias.wordpress.com/2008/12/26/fixing-character-sets-in-mysql/#comments</comments>
		<pubDate>Fri, 26 Dec 2008 15:01:30 +0000</pubDate>
		<dc:creator>automatthias</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[SQL]]></category>

		<guid isPermaLink="false">http://automatthias.wordpress.com/?p=405</guid>
		<description><![CDATA[I&#8217;ve recently had to move a few databases from MySQL 4.x to MySQL 5.x. One of the most important differences is that the 5.x family understands character encodings. Not exactly fresh news, version 5.0 was issued in 2003, but there is still a lot of 4.x installations around.
MySQL 5.0 no longer happily accepts any byte [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=automatthias.wordpress.com&blog=223983&post=405&subd=automatthias&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I&#8217;ve recently had to move a few databases from MySQL 4.x to MySQL 5.x. One of the most important differences is that the 5.x family understands character encodings. Not exactly fresh news, version 5.0 was issued in 2003, but there is still a lot of 4.x installations around.</p>
<p>MySQL 5.0 no longer happily accepts any byte string into a VARCHAR or TEXT field. It stores encoding names as part of the table structure, and converts between encodings when necessary. MediaWiki or WordPress, when run on MySQL 4.x, store data in UTF-8, but the database itself doesn&#8217;t &#8220;know&#8221; about it. Everything seems fine, until you dump your database to a file and load it into MySQL 5.0 (or above). What happens, is that your text is considered to be latin1 (a.k.a. ISO-8859-1). If you happen to have any non-English characters as, say, article names in MediaWiki, you&#8217;re going to end up with an error message such as:</p>
<blockquote><p>A database query syntax error has occurred. This may indicate a bug in the software. The last attempted database query was:</p>
<p>(SQL query hidden)</p>
<p>from within function &#8220;Article::pageData&#8221;. MySQL returned error &#8220;1267: Illegal mix of collations (latin1_bin,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation &#8216;=&#8217; (localhost)&#8221;.</p></blockquote>
<p>To fix the problem, you need to tell MySQL that your text data is really UTF-8, not latin1. You need to find all the columns of type VARCHAR or TEXT, and modify them to have UTF-8 character set. For example, if your column is VARCHAR(255), you can execute this statement:</p>
<blockquote><p>ALTER TABLE your_table<br />
MODIFY COLUMN your_column VARCHAR(255)<br />
CHARACTER SET utf8<br />
COLLATE utf8_bin;</p></blockquote>
<p><em>(The utf8_bin collation is needed to keep your sorting case-sensitive.)</em></p>
<p>However, MySQL will <em>convert</em> your text from latin1 to UTF-8, and your text will still appear &#8220;wrong&#8221;. You can fix it in one more step. The problem is that you had UTF-8 taken to be latin1 and then stored as UTF-8. To fix this, you need to &#8220;convert&#8221; your text from UTF-8 back to latin1, and then make MySQL take it as UTF-8, but, importantly, without converting it. This can be achieved by temporarily casting your data to binary &#8212; this operation doesn&#8217;t trigger encoding changes. You can then cast your data into any encoding you want. In a nutshell, you<em></em> need to go: UTF-8 →(conversion)→ latin1 → binary → UTF-8.</p>
<blockquote><p>UPDATE your_table<br />
SET your_column = CONVERT(<br />
CONVERT(<br />
CONVERT(<br />
your_column<br />
USING latin1<br />
)<br />
USING binary<br />
)<br />
USING utf8<br />
);</p></blockquote>
<p>It may take you a while to understand it. If you want to get a better feel of what&#8217;s going on, consider the following, equivalent example in shell. Let&#8217;s assume you&#8217;re using Polish diacritics: ąćęłńóśżź. Your Polish page title &#8220;Café&#8221; might in a garbled form look something like this:</p>
<blockquote><p>CafÃ©</p></blockquote>
<p>Assuming your system is natively UTF-8 (most of modern Linux distributions are), an easy way to simulate text garbling is the following shell expression.</p>
<blockquote><p>echo Café \<br />
| iconv -f utf-8 -t utf-8 \<br />
| iconv -f latin1 -t utf-8<br />
CafÃ©</p></blockquote>
<p>Converting from UTF-8 to UTF-8 seems stupid, but I wanted it to be very clear: what we have here, is an UTF-8 string, as output by the second line, taken to be latin1 in the third line. This is how your text can become garbled. An obvious way to fix it, is to run the process backwards:</p>
<blockquote><p>echo CafÃ© \<br />
| iconv -f utf-8 -t latin1 \<br />
| iconv -f utf-8 -t utf-8<br />
Café</p></blockquote>
<p>Again, the UTF-8 to UTF-8 conversion is preserved to make this crucial point explicit. Your string was converted to latin1, and then taken to be UTF-8.</p>
<p>Back to our problem. We know how to convert the data, but we need to find all the tables and columns taht need converting. Conveniently, MySQL offers an &#8220;information_schema&#8221; database, which allows us to read information about MySQL tables. It&#8217;s enough to run this query to find all the tables of interest:</p>
<blockquote><p>SELECT<br />
table_name,<br />
column_name,<br />
column_type,<br />
character_set_name<br />
FROM<br />
columns<br />
WHERE<br />
table_schema = &#8216;your_table&#8217;<br />
AND<br />
(<br />
data_type = &#8216;varchar&#8217;<br />
OR<br />
data_type = &#8216;text&#8217;<br />
)<br />
AND<br />
character_set_name != &#8216;utf8&#8242;<br />
;</p></blockquote>
<p>If you&#8217;re really lazy, as I am, your data has exactly this problem (UTF-8 taken to be latni1 and looking garbled), you can use a Python script I wrote. But please be careful! You&#8217;re using it at your own risk! Backup your database first! If this script damages your database and you lose all your data, it&#8217;s your problem, not mine. You have been warned. <a href="http://django-phpbb.googlecode.com/svn-history/r26/trunk/phpbb/tools/mysql_repair_encoding.py">Here&#8217;s the script</a>.</p>
<p>Note: the script will only <em>print</em> SQL statements to screen. You have to execute them youself.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/automatthias.wordpress.com/405/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/automatthias.wordpress.com/405/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/automatthias.wordpress.com/405/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/automatthias.wordpress.com/405/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/automatthias.wordpress.com/405/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/automatthias.wordpress.com/405/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/automatthias.wordpress.com/405/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/automatthias.wordpress.com/405/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/automatthias.wordpress.com/405/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/automatthias.wordpress.com/405/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=automatthias.wordpress.com&blog=223983&post=405&subd=automatthias&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://automatthias.wordpress.com/2008/12/26/fixing-character-sets-in-mysql/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e09207f4f71e692020a239853749b114?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">automatthias</media:title>
		</media:content>
	</item>
		<item>
		<title>Genetic data in PostgreSQL</title>
		<link>http://automatthias.wordpress.com/2007/06/06/genetic-data-in-postgresql/</link>
		<comments>http://automatthias.wordpress.com/2007/06/06/genetic-data-in-postgresql/#comments</comments>
		<pubDate>Wed, 06 Jun 2007 20:00:59 +0000</pubDate>
		<dc:creator>automatthias</dc:creator>
				<category><![CDATA[Computers]]></category>
		<category><![CDATA[Database]]></category>
		<category><![CDATA[Work]]></category>

		<guid isPermaLink="false">http://automatthias.wordpress.com/2007/06/06/genetic-data-in-postgresql/</guid>
		<description><![CDATA[People get usually famous for the things they&#8217;ve done.  Well, that&#8217;s not entirely true. They usually get famous for the things they&#8217;ve done, when they were successful. You don&#8217;t get famous for attempting and being unsuccessful, now do you?
It works the same way for the scientific publications. All scientists work hard trying various things, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=automatthias.wordpress.com&blog=223983&post=279&subd=automatthias&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>People get usually famous for the things they&#8217;ve done.  Well, that&#8217;s not entirely true. They usually get famous for the things they&#8217;ve done, when they were successful. You don&#8217;t get famous for attempting and being unsuccessful, now do you?</p>
<p>It works the same way for the scientific publications. All scientists work hard trying various things, and when they finally succeed, they publish a paper. But what happens with all those hours spend on unsuccessful attempts? Nobody seems to be proud of blowing a whole laboratory up. Or whatever didn&#8217;t work for them. This means that other people can never learn that something was unsuccessful and they&#8217;re likely to get the same, unfeasible, idea and repeat the same research. Needless to say, unsuccessfully.</p>
<p>Not that I&#8217;m proud of what I&#8217;ve done here, but I will at least allow other people to find this post on Google, when searching for genetic data and relational database. I&#8217;ll describe what I did, so they at least don&#8217;t do it the way I did.</p>
<p><span id="more-279"></span></p>
<p>There is a project called <a href="http://www.hapmap.org/">Hapmap</a>, which is essentially a publicly available genetic data set. It is now being used for various genetic studies, usually for finding associations between specific genes and diseases. That&#8217;s what the genetic epidemiology is all about, I think. To find the guilty gene. I wonder when will they start fixing people with broken genes. It&#8217;s a only matter of time. Anyway, Hapmap allows you to download a whole bulk of genetic data and torture them in any way you want, just listening to the bits squeaking in pain. That&#8217;s a really great thing.</p>
<p>For my project, there was a problem of reformatting the data, so they could be fed to the program that analyzes them. Sounds easy, but writing a C++ code to do that is neither quick nor nice. And it works quite slow. In fact, I&#8217;m just watching one of my Core 2 Duo cores working at 100% for&#8230; how long?&#8230; 103 minutes so far, working with one chromosome. So I got this idea to put all the Hapmap data into an SQL database so I could throw declarative statements at it, getting whatever I asked for.</p>
<p>Genetic data as in Hapmap phase II, is generally a matrix with 210 rows (individuals) and 3.3 million columns (loci). Relational database is all about tables, but you can&#8217;t create a table with 3.3 million columns. PostgreSQL has a limitation to 65 thousands or so. I decided to try a sparse matrix implementation with one table for individuals, one for loci and one connecting table with two foreign keys and two fields for allele, defining a genotype.  I was expecting it to be somewhat demanding for a machine, but having a Core 2 Duo with 2GB RAM and 70GB hard disk, how hard can it be? One row of the connecting table consists of 4 + 8 + 1 + 1 bytes, giving 14 bytes in total. 3.3 million times 210 would give about 700 million rows. Not every locus is polymorphic in every population, so in practice there were about 500 million rows in that table. 500 million times 14 bytes gives about 7GB. Piece of cake.</p>
<blockquote><p>CREATE TABLE individual (indiv_no SERIAL, indiv_id VARCHAR);</p>
<p>CREATE TABLE locus (locus_no BIGSERIAL, snp_id VARCHAR, position INT8);</p>
<p>CREATE TABLE observed_genotype(indiv_no INTEGER, locus_no INT8, a1 CHAR, a2 CHAR);</p>
<p><em>Those SQL statements are simplified for better readability, in reality they had more stuff in them, like primary and foreign key declarations.</em></p></blockquote>
<p>I sat down and hacked few Python scripts to prepare the data and load them into PostgreSQL. At first, my computer almost melted, then I found some more efficient ways to handle the data and finally got them all loaded into the database. The database had 38GB.</p>
<p>That is pretty heavy. It&#8217;s about 5 times more than the actual data takes. Not mentioning that using continuous bit arrays would allow to store the whole thing in about 250MB. We&#8217;re in relational world, you know. Things just need to take space.</p>
<p>There was only 4GB disk space left. The data was loaded, but the connecting table didn&#8217;t have the primary key. It had to be dropped, otherwise it would&#8217;ve taken too long to load the data. I happily launched phppgadmin and made few clicks to create a primary key. I&#8217;m getting old, you know. I know I could&#8217;ve typed it. But regardless to whether I&#8217;d type or click it in, the PostgreSQL took the remaining 4GB to create the index, but it wasn&#8217;t enough so it rolled back and returned the space to the file system.</p>
<p>That was basically it. I&#8217;ve had enough, dropped the whole database and started hacking with text files and coreutils. If you had an idea to put whole Hapmap data set into PostgreSQL,  think if you have enough resources. Maybe, if you&#8217;re in 2012, sitting in front of a 16GHz eight-core machine with 32GB RAM and 1TB hard disk you do. But I&#8217;m in 2007 and I don&#8217;t. So, here is my opposite to “success story”. I hope you appreciate that. And no, I&#8217;m not very proud of it; but I value other people&#8217;s time and I think I might help someone before they spend a week doing the same thing.</p>
<p>Nevertheless, it <em>could</em> be feasible to load Hapmap genetic data into some specifically optimized relational database. Perhaps Oracle guys have something that could handle it. It&#8217;s still very tempting to be able to manipulate genetic data in a high level, declarative manner.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/automatthias.wordpress.com/279/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/automatthias.wordpress.com/279/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/automatthias.wordpress.com/279/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/automatthias.wordpress.com/279/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/automatthias.wordpress.com/279/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/automatthias.wordpress.com/279/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/automatthias.wordpress.com/279/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/automatthias.wordpress.com/279/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/automatthias.wordpress.com/279/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/automatthias.wordpress.com/279/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/automatthias.wordpress.com/279/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/automatthias.wordpress.com/279/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=automatthias.wordpress.com&blog=223983&post=279&subd=automatthias&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://automatthias.wordpress.com/2007/06/06/genetic-data-in-postgresql/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e09207f4f71e692020a239853749b114?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">automatthias</media:title>
		</media:content>
	</item>
		<item>
		<title>Method of comparing hospitals in the EACTS Congenital Database</title>
		<link>http://automatthias.wordpress.com/2007/04/15/method-of-comparing-hospitals-in-the-eacts-congenital-database/</link>
		<comments>http://automatthias.wordpress.com/2007/04/15/method-of-comparing-hospitals-in-the-eacts-congenital-database/#comments</comments>
		<pubDate>Sun, 15 Apr 2007 11:11:28 +0000</pubDate>
		<dc:creator>automatthias</dc:creator>
				<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Database]]></category>
		<category><![CDATA[EACTS Congenital Database]]></category>
		<category><![CDATA[Medicine]]></category>

		<guid isPermaLink="false">http://automatthias.wordpress.com/2007/04/15/method-of-comparing-hospitals-in-the-eacts-congenital-database/</guid>
		<description><![CDATA[I have published my MSc thesis on-line. It&#8217;s available for (free) download in PDF format. It contains:

An example of complicated data reshaped to a form which allows statistical analysis
A method of comparing hospitals fairly

Read more and download the thesis.
       <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=automatthias.wordpress.com&blog=223983&post=266&subd=automatthias&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I have published my MSc thesis on-line. It&#8217;s available for (free) download in PDF format. It contains:</p>
<ol>
<li>An example of complicated data reshaped to a form which allows statistical analysis</li>
<li>A method of comparing hospitals fairly</li>
</ol>
<p><a href="http://automatthias.wordpress.com/eacts-congenital-database/">Read more and download</a> the thesis.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/automatthias.wordpress.com/266/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/automatthias.wordpress.com/266/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/automatthias.wordpress.com/266/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/automatthias.wordpress.com/266/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/automatthias.wordpress.com/266/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/automatthias.wordpress.com/266/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/automatthias.wordpress.com/266/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/automatthias.wordpress.com/266/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/automatthias.wordpress.com/266/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/automatthias.wordpress.com/266/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/automatthias.wordpress.com/266/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/automatthias.wordpress.com/266/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=automatthias.wordpress.com&blog=223983&post=266&subd=automatthias&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://automatthias.wordpress.com/2007/04/15/method-of-comparing-hospitals-in-the-eacts-congenital-database/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e09207f4f71e692020a239853749b114?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">automatthias</media:title>
		</media:content>
	</item>
		<item>
		<title>Tetralogy of Fallot database representation</title>
		<link>http://automatthias.wordpress.com/2006/07/31/tetralogy-of-fallot-database-representation/</link>
		<comments>http://automatthias.wordpress.com/2006/07/31/tetralogy-of-fallot-database-representation/#comments</comments>
		<pubDate>Mon, 31 Jul 2006 13:25:45 +0000</pubDate>
		<dc:creator>automatthias</dc:creator>
				<category><![CDATA[Data Analysis]]></category>
		<category><![CDATA[Database]]></category>
		<category><![CDATA[EACTS Congenital Database]]></category>
		<category><![CDATA[Medicine]]></category>
		<category><![CDATA[Thesis]]></category>

		<guid isPermaLink="false">https://automatthias.wordpress.com/2006/07/31/tetralogy-of-fallot-database-over-representation/</guid>
		<description><![CDATA[Tetralogy of Fallot is a significant and complex congenital heart disease. It consists of four heart malformations. So if a patient is described to have TOF, it means that she/he has all those four malformations together.
However, the separate malformations are already present on the diagnoses list, as separate entities. From a data-modeling perspective, it&#8217;s a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=automatthias.wordpress.com&blog=223983&post=163&subd=automatthias&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Tetralogy of Fallot is a significant and complex congenital heart disease. It consists of <a href="http://en.wikipedia.org/wiki/Tetralogy_of_Fallot">four heart malformations</a>. So if a patient is described to have TOF, it means that she/he has all those four malformations together.</p>
<p>However, the separate malformations are already present on the <a href="http://www.sts.org/file/AppendixIV.doc">diagnoses list</a>, as separate entities. From a data-modeling perspective, it&#8217;s a redundancy on the factors (malformations, diseases) list. This leads to problems with interpretation. As the <a href="http://en.wikipedia.org/wiki/Ventricular_septal_defect">VSD</a> is one of TOF&#8217;s components, is already present on the diseases list, and users are allowed to enter both VSD and TOF diagnoses, there are patients with all four combinations in the database.</p>
<p><span id="more-163"></span><br />
People might write it like this:</p>
<ul>
<li>TOF</li>
<li>VSD</li>
<li>TOF and VSD</li>
<li>None</li>
</ul>
<p>But I would prefer the tabular notation:</p>
<pre>+-----+-----+
| TOF | VSD |
+-----+-----+
|  -  |  -  |
|  X  |  -  |
|  -  |  X  |
|  X  |  X  |
+-----+-----+</pre>
<p>It should be clear now, why there are exactly four combinations. And all four are present in the database. The TOF and VSD combination is invalid, because VSD is already present in TOF. Well, it&#8217;s doctors who know that, because the data structure doesn&#8217;t reflect this knowledge.</p>
<p>Now, how to fix it? Should the data entry software prohibit entering both TOF and VSD? No. It would mean implementing an <i>exceptional behaviour</i> of the application, and leave the weak data structure. Instead, I would make an alternative data structure:</p>
<pre>+-----+--------------------+------------------+----------------------------+
| VSD | Pulmonary stenosis | Overriding aorta | Right ventric. hypertrophy |
+-----+--------------------+------------------+----------------------------+
|  -  | -                  | -                | -                          |
|  X  | -                  | -                | -                          |
|  X  | X                  | X                | X                          |
|  X  | X                  | X                | X                          |
+-----+--------------------+------------------+----------------------------+</pre>
<p>It doesn&#8217;t mean that the data entry software users would have to enter every component separatery. For such complex diseases a presets would be helpful: click the TOF button, the four factors jump in. But if you have an exceptional patient, you can tweak every single factor independently.</p>
<p><b>The reports</b></p>
<p>When viewing information about VSD and related diseases, surgeon doesn&#8217;t want to see, what percent of patients with VSD also have TOF. Also, when viewing TOF-related diseases, the surgeon doesn&#8217;t want to see VSD, because all patients with TOF have VSD by definition.</p>
<p><b>New relation</b></p>
<p>So I have to add new data structures. There will be new relation between the diseases:</p>
<ul>
<li><i><b>Is part of</b></i></li>
</ul>
<p>The results in the reports will be:</p>
<ul>
<li>When viewing the superior disease (here: TOF), patients who have the inferior disease (here: VSD) will be shown, but the information about the inferior disease will be discarded, so they will appear as they wouldn&#8217;t have it.</li>
<li>When viewing the inferior disease (here: VSD), the patients with the superior disease (here: TOF) as well will not be included and/or shown.</li>
</ul>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/automatthias.wordpress.com/163/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/automatthias.wordpress.com/163/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/automatthias.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/automatthias.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/automatthias.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/automatthias.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/automatthias.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/automatthias.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/automatthias.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/automatthias.wordpress.com/163/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/automatthias.wordpress.com/163/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/automatthias.wordpress.com/163/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=automatthias.wordpress.com&blog=223983&post=163&subd=automatthias&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://automatthias.wordpress.com/2006/07/31/tetralogy-of-fallot-database-representation/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e09207f4f71e692020a239853749b114?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">automatthias</media:title>
		</media:content>
	</item>
		<item>
		<title>Malcolm Tredinnick&#8217;s SQL puzzle solution</title>
		<link>http://automatthias.wordpress.com/2006/07/23/malcolm-tredinnicks-sql-puzzle-solution/</link>
		<comments>http://automatthias.wordpress.com/2006/07/23/malcolm-tredinnicks-sql-puzzle-solution/#comments</comments>
		<pubDate>Sun, 23 Jul 2006 11:42:18 +0000</pubDate>
		<dc:creator>automatthias</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">https://automatthias.wordpress.com/2006/07/23/malcolm-tredinnicks-sql-puzzle-solution/</guid>
		<description><![CDATA[The puzzle
Malcolm has asked, how to find the classes that were attended by all of the students from a given list. Then, he proposed a solution with a HAVING clause. I&#8217;ll call it the one-join solution. I&#8217;d like to suggest another one, which I&#8217;ll call multi-join.
I&#8217;ve made a benchmark to evaluate the execution time. A [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=automatthias.wordpress.com&blog=223983&post=156&subd=automatthias&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><h3>The puzzle</h3>
<p>Malcolm has <a href="http://www.pointy-stick.com/blog/2006/06/12/sql-puzzle/">asked</a>, how to find the classes that were attended by all of the students from a given list. Then, he <a href="http://www.pointy-stick.com/blog/2006/06/13/sql-puzzle-solution/">proposed a solution</a> with a HAVING clause. I&#8217;ll call it the <em>one-join</em> solution. I&#8217;d like to suggest another one, which I&#8217;ll call <em>multi-join</em>.</p>
<p>I&#8217;ve made a benchmark to evaluate the execution time. A statistical tool was used to create a mathematical model of the execution time.</p>
<p><span id="more-156"></span></p>
<h3>The database structure</h3>
<pre>CREATE TABLE Class (
    id  integer NOT NULL PRIMARY KEY);
CREATE TABLE Student (
    id integer NOT NULL PRIMARY KEY);

CREATE TABLE Reln_Class_Student (
    class_id integer NOT NULL REFERENCES Class(id),
    student_id integer NOT NULL REFERENCES Student(id),
    PRIMARY KEY (class_id, student_id));</pre>
<h3>My solution</h3>
<p>Like in Malcolm&#8217;s example, following example finds the classes that are attended by both 253 and 289 students.</p>
<pre>SELECT id
FROM
    Class AS C
    INNER JOIN Reln_Class_Student AS s253
        ON (C.id = s253.class_id)
    INNER JOIN Reln_Class_Student AS s289
        ON (C.id = s289.class_id)
WHERE
    s253.student_id = 253
    AND
    s289.student_id = 289
;</pre>
<p>In my example, the number of students on the list (here: two) is equal to number of joins that are to be performed. Each join must have different alias assigned, hence &#8220;s253&#8243; and &#8220;s289&#8243; aliases for joined tables.</p>
<h3>Without indexes on the connecting table</h3>
<p>Malcolm&#8217;s solution found the class with list of 10 students in 214ms, while my query did the same job in 5.8ms, <strong>36 times faster</strong>. With a short (2-element) student list, Malcolm&#8217;s query takes 130ms, while my query completes in 4.9ms, 26 times faster.</p>
<p>I analyzed the planner (<em>EXPLAIN ANALYZE</em>) output. Despite my query uses multiple joins, it always uses index scans, while with Malcolm&#8217;s query, planner uses sequential scans (over the long Reln_Class_Student table), which effectively kill the performance. <strong>It&#8217;s faster to perform 10 index scans than just one sequential one.</strong></p>
<h3>The benchmark</h3>
<h4>The environment</h4>
<p>The test was done on a Celeron M 1.5GHz with 768MB of RAM, on PostgreSQL 8.1, under Ubuntu Linux 6.06 distribution.</p>
<h4>The data</h4>
<p>Let&#8217;s change the nomenclature to a more generic example: <em>documents and tags</em>.</p>
<ul>
<li>2000 documents</li>
<li>100 tags</li>
</ul>
<p>Each tag had a number and assigned frequency. tag 1 had 0% frequency, tag 2 had 1% frequency, and so on. Tags were distributed independently from each other.</p>
<h4>The tests</h4>
<p>Queries with two keywords were tested. All possible combinations of the tags were tested. In total, 20 thousands observations were recorded (100 × 100 × 2). Each observation had 4 properties:</p>
<ul>
<li>Query type (one-join or multi-join)</li>
<li>Tag 1 id (~frequency, from 0 to 99)</li>
<li>Tag 2 id (~frequency, from 0 to 99)</li>
<li>Execution time (in ms)</li>
</ul>
<h4>The analysis</h4>
<p>All the observations were imported into the R-project statistical package. A linear regression were used to create a mathematical model for the execution time. The following model was found:</p>
<pre>Call:
lm(formula = time ~ tag1 * tag2 * qrytype, data = B)Residuals:

Min       1Q   Median       3Q      Max
-36.9016  -5.8601  -2.8720   0.8972 794.9966
Coefficients:
                             Estimate Std. Error t value Pr(&gt;|t|)

(Intercept)               27.0779277  0.8396820  32.248  &lt; 2e-16 ***
tag1                      -0.0626129  0.0144355  -4.337 1.45e-05 ***
tag2                       0.2447616  0.0144355  16.956  &lt; 2e-16 ***
qrytypeone-join           -2.0632741  1.1874897  -1.738   0.0823 .
tag1:tag2                  0.0013108  0.0002482   5.282 1.29e-07 ***
tag1:qrytypeone-join       0.1914354  0.0204149   9.377  &lt; 2e-16 ***
tag2:qrytypeone-join       0.1304854  0.0204149   6.392 1.68e-10 ***
tag1:tag2:qrytypeone-join -0.0015868  0.0003510  -4.521 6.18e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.68 on 19992 degrees of freedom
Multiple R-Squared: 0.2284,     Adjusted R-squared: 0.2281
F-statistic: 845.5 on 7 and 19992 DF,  p-value: &lt; 2.2e-16</pre>
<p>The three greatest coefficients are:</p>
<ul>
<li><strong>tag2<br />
</strong>the more popular second tag, the longer execution time</li>
<li><strong>tag1:qrytypeone-join</strong><br />
when one-join query is used, additional time on tag1 is used</li>
<li><strong>tag2:qrytypeone-join</strong><br />
when one-join query is used, additional time on tag2 is used</li>
</ul>
<p>The <em>qrytypeone-join</em> doesn&#8217;t show statistical significance. It&#8217;s probably the effect of some queries with one-join query that executed very fast. They are visible on the plot.</p>
<h4>The plot</h4>
<p><a href="http://automatthias.files.wordpress.com/2006/07/time-by-qrytype-by-tags.png" class="imagelink" title="Query execution time"><img src="http://automatthias.files.wordpress.com/2006/07/time-by-qrytype-by-tags.thumbnail.png" alt="Query execution time" /></a><br />
Click the thumbnail to see the large version.</p>
<ul>
<li>One-join queries: Red</li>
<li>Multi-join queries: Blue</li>
</ul>
<p>The plot shows timings of the queries. The lower, the better. As you can see, the tag-frequency to time ratio relation is linear, but the slope is steeper for multi-join queries. You can see the same thing in the model as the <em>tagX:qrytypeone-join</em> coefficients.</p>
<h4>Conclusion</h4>
<p>Multi-join query is always at least as fast as one-join. Both queries perform essentially similar if both tags are of low frequency. The more frequent tags, the bigger difference between queries, in favor for multi-join query.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/automatthias.wordpress.com/156/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/automatthias.wordpress.com/156/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/automatthias.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/automatthias.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/automatthias.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/automatthias.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/automatthias.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/automatthias.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/automatthias.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/automatthias.wordpress.com/156/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/automatthias.wordpress.com/156/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/automatthias.wordpress.com/156/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=automatthias.wordpress.com&blog=223983&post=156&subd=automatthias&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://automatthias.wordpress.com/2006/07/23/malcolm-tredinnicks-sql-puzzle-solution/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/e09207f4f71e692020a239853749b114?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">automatthias</media:title>
		</media:content>

		<media:content url="http://automatthias.files.wordpress.com/2006/07/time-by-qrytype-by-tags.thumbnail.png" medium="image">
			<media:title type="html">Query execution time</media:title>
		</media:content>
	</item>
	</channel>
</rss>