Malcolm Tredinnick’s SQL puzzle solution

The puzzle

Malcolm has asked, how to find the classes that were attended by all of the students from a given list. Then, he proposed a solution with a HAVING clause. I’ll call it the one-join solution. I’d like to suggest another one, which I’ll call multi-join.

I’ve made a benchmark to evaluate the execution time. A statistical tool was used to create a mathematical model of the execution time.

The database structure

CREATE TABLE Class (
    id  integer NOT NULL PRIMARY KEY);
CREATE TABLE Student (
    id integer NOT NULL PRIMARY KEY);

CREATE TABLE Reln_Class_Student (
    class_id integer NOT NULL REFERENCES Class(id),
    student_id integer NOT NULL REFERENCES Student(id),
    PRIMARY KEY (class_id, student_id));

My solution

Like in Malcolm’s example, following example finds the classes that are attended by both 253 and 289 students.

SELECT id
FROM
    Class AS C
    INNER JOIN Reln_Class_Student AS s253
        ON (C.id = s253.class_id)
    INNER JOIN Reln_Class_Student AS s289
        ON (C.id = s289.class_id)
WHERE
    s253.student_id = 253
    AND
    s289.student_id = 289
;

In my example, the number of students on the list (here: two) is equal to number of joins that are to be performed. Each join must have different alias assigned, hence “s253″ and “s289″ aliases for joined tables.

Without indexes on the connecting table

Malcolm’s solution found the class with list of 10 students in 214ms, while my query did the same job in 5.8ms, 36 times faster. With a short (2-element) student list, Malcolm’s query takes 130ms, while my query completes in 4.9ms, 26 times faster.

I analyzed the planner (EXPLAIN ANALYZE) output. Despite my query uses multiple joins, it always uses index scans, while with Malcolm’s query, planner uses sequential scans (over the long Reln_Class_Student table), which effectively kill the performance. It’s faster to perform 10 index scans than just one sequential one.

The benchmark

The environment

The test was done on a Celeron M 1.5GHz with 768MB of RAM, on PostgreSQL 8.1, under Ubuntu Linux 6.06 distribution.

The data

Let’s change the nomenclature to a more generic example: documents and tags.

  • 2000 documents
  • 100 tags

Each tag had a number and assigned frequency. tag 1 had 0% frequency, tag 2 had 1% frequency, and so on. Tags were distributed independently from each other.

The tests

Queries with two keywords were tested. All possible combinations of the tags were tested. In total, 20 thousands observations were recorded (100 × 100 × 2). Each observation had 4 properties:

  • Query type (one-join or multi-join)
  • Tag 1 id (~frequency, from 0 to 99)
  • Tag 2 id (~frequency, from 0 to 99)
  • Execution time (in ms)

The analysis

All the observations were imported into the R-project statistical package. A linear regression were used to create a mathematical model for the execution time. The following model was found:

Call:
lm(formula = time ~ tag1 * tag2 * qrytype, data = B)Residuals:

Min       1Q   Median       3Q      Max
-36.9016  -5.8601  -2.8720   0.8972 794.9966
Coefficients:
                             Estimate Std. Error t value Pr(>|t|)

(Intercept)               27.0779277  0.8396820  32.248  < 2e-16 ***
tag1                      -0.0626129  0.0144355  -4.337 1.45e-05 ***
tag2                       0.2447616  0.0144355  16.956  < 2e-16 ***
qrytypeone-join           -2.0632741  1.1874897  -1.738   0.0823 .
tag1:tag2                  0.0013108  0.0002482   5.282 1.29e-07 ***
tag1:qrytypeone-join       0.1914354  0.0204149   9.377  < 2e-16 ***
tag2:qrytypeone-join       0.1304854  0.0204149   6.392 1.68e-10 ***
tag1:tag2:qrytypeone-join -0.0015868  0.0003510  -4.521 6.18e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20.68 on 19992 degrees of freedom
Multiple R-Squared: 0.2284,     Adjusted R-squared: 0.2281
F-statistic: 845.5 on 7 and 19992 DF,  p-value: < 2.2e-16

The three greatest coefficients are:

  • tag2
    the more popular second tag, the longer execution time
  • tag1:qrytypeone-join
    when one-join query is used, additional time on tag1 is used
  • tag2:qrytypeone-join
    when one-join query is used, additional time on tag2 is used

The qrytypeone-join doesn’t show statistical significance. It’s probably the effect of some queries with one-join query that executed very fast. They are visible on the plot.

The plot

Query execution time
Click the thumbnail to see the large version.

  • One-join queries: Red
  • Multi-join queries: Blue

The plot shows timings of the queries. The lower, the better. As you can see, the tag-frequency to time ratio relation is linear, but the slope is steeper for multi-join queries. You can see the same thing in the model as the tagX:qrytypeone-join coefficients.

Conclusion

Multi-join query is always at least as fast as one-join. Both queries perform essentially similar if both tags are of low frequency. The more frequent tags, the bigger difference between queries, in favor for multi-join query.

About these ads

One Response to “Malcolm Tredinnick’s SQL puzzle solution”

  1. Michael Axiak Says:

    Hello,
    I was procrastinating and I happened upon this blog entry. It just so happens that I just wrote a patch for django (well, not really…since QuerySet is going to be refactored) that does exactly what you suggested:

    http://code.djangoproject.com/ticket/3691

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 474 other followers

%d bloggers like this: