Malcolm has asked, how to find the classes that were attended by all of the students from a given list. Then, he proposed a solution with a HAVING clause. I’ll call it the one-join solution. I’d like to suggest another one, which I’ll call multi-join.
I’ve made a benchmark to evaluate the execution time. A statistical tool was used to create a mathematical model of the execution time.
The database structure
CREATE TABLE Class ( id integer NOT NULL PRIMARY KEY); CREATE TABLE Student ( id integer NOT NULL PRIMARY KEY); CREATE TABLE Reln_Class_Student ( class_id integer NOT NULL REFERENCES Class(id), student_id integer NOT NULL REFERENCES Student(id), PRIMARY KEY (class_id, student_id));
Like in Malcolm’s example, following example finds the classes that are attended by both 253 and 289 students.
SELECT id FROM Class AS C INNER JOIN Reln_Class_Student AS s253 ON (C.id = s253.class_id) INNER JOIN Reln_Class_Student AS s289 ON (C.id = s289.class_id) WHERE s253.student_id = 253 AND s289.student_id = 289 ;
In my example, the number of students on the list (here: two) is equal to number of joins that are to be performed. Each join must have different alias assigned, hence “s253″ and “s289″ aliases for joined tables.
Without indexes on the connecting table
Malcolm’s solution found the class with list of 10 students in 214ms, while my query did the same job in 5.8ms, 36 times faster. With a short (2-element) student list, Malcolm’s query takes 130ms, while my query completes in 4.9ms, 26 times faster.
I analyzed the planner (EXPLAIN ANALYZE) output. Despite my query uses multiple joins, it always uses index scans, while with Malcolm’s query, planner uses sequential scans (over the long Reln_Class_Student table), which effectively kill the performance. It’s faster to perform 10 index scans than just one sequential one.
The test was done on a Celeron M 1.5GHz with 768MB of RAM, on PostgreSQL 8.1, under Ubuntu Linux 6.06 distribution.
Let’s change the nomenclature to a more generic example: documents and tags.
- 2000 documents
- 100 tags
Each tag had a number and assigned frequency. tag 1 had 0% frequency, tag 2 had 1% frequency, and so on. Tags were distributed independently from each other.
Queries with two keywords were tested. All possible combinations of the tags were tested. In total, 20 thousands observations were recorded (100 × 100 × 2). Each observation had 4 properties:
- Query type (one-join or multi-join)
- Tag 1 id (~frequency, from 0 to 99)
- Tag 2 id (~frequency, from 0 to 99)
- Execution time (in ms)
All the observations were imported into the R-project statistical package. A linear regression were used to create a mathematical model for the execution time. The following model was found:
Call: lm(formula = time ~ tag1 * tag2 * qrytype, data = B)Residuals: Min 1Q Median 3Q Max -36.9016 -5.8601 -2.8720 0.8972 794.9966 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 27.0779277 0.8396820 32.248 < 2e-16 *** tag1 -0.0626129 0.0144355 -4.337 1.45e-05 *** tag2 0.2447616 0.0144355 16.956 < 2e-16 *** qrytypeone-join -2.0632741 1.1874897 -1.738 0.0823 . tag1:tag2 0.0013108 0.0002482 5.282 1.29e-07 *** tag1:qrytypeone-join 0.1914354 0.0204149 9.377 < 2e-16 *** tag2:qrytypeone-join 0.1304854 0.0204149 6.392 1.68e-10 *** tag1:tag2:qrytypeone-join -0.0015868 0.0003510 -4.521 6.18e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 20.68 on 19992 degrees of freedom Multiple R-Squared: 0.2284, Adjusted R-squared: 0.2281 F-statistic: 845.5 on 7 and 19992 DF, p-value: < 2.2e-16
The three greatest coefficients are:
the more popular second tag, the longer execution time
when one-join query is used, additional time on tag1 is used
when one-join query is used, additional time on tag2 is used
The qrytypeone-join doesn’t show statistical significance. It’s probably the effect of some queries with one-join query that executed very fast. They are visible on the plot.
- One-join queries: Red
- Multi-join queries: Blue
The plot shows timings of the queries. The lower, the better. As you can see, the tag-frequency to time ratio relation is linear, but the slope is steeper for multi-join queries. You can see the same thing in the model as the tagX:qrytypeone-join coefficients.
Multi-join query is always at least as fast as one-join. Both queries perform essentially similar if both tags are of low frequency. The more frequent tags, the bigger difference between queries, in favor for multi-join query.