Tour of California 2009, Stage 2, Sausalito

The Tour of California Stage 2 started in Sausalito, went over the Golden Gate bridge, then started for real and went to Santa Cruz. Since the start was a few blocks away I got up and wondered down the street to check it out. Fairly exciting, much larger crowd than I would expect for such an early hour on a holiday in crappy rainy weather. I saw Lance Armstrong which was cool, hopefully his comeback works out well for him. All the riders are in such stupendous shape it’s pretty crazy to even think about. Though this time, with the weather, I was happy to go back home. :)

Postgres Partitioning and the Query Planner

I love that Postgres support partitions, even if they aren’t the cleanest to set up and maintain. However I have run into some issues with regards to the query planner and partitioned tables. The gist of it is that if the tables are partitioned the query planner can get quite confused and take an endless amount of time to run what should a ‘simple’ query. Which isn’t in and of itself a problem, query planners notoriously pick less than great query plans. The problem is that for most cases you can explicitly tell the planner to things ‘the right way’ (well, your way, anyway) and with partitioned tables you really can’t.
I have a key value table organized by group and I have it partitioned by id in order to be able to limit the total number of rows in a particular table (see Postgres Partitioning and hibernate for one way to do that) and my goal is to denormalize this table to use as the basis for a set of datawarehouse tables. In order to denormalize the key/value table I have to do various subselects against keys that I’m interested in for the datawarehouse. However the table is partitioned by group_id (a group that the keys belong too, maybe 10 keys per group on average) and the denormalized table will use group_id so searching by key results in large table scans for each subselect of all the child tables. Ideally there would be a single scan gathering up the data one cares about (ala GROUP SETS) or in my case if there is a UNION of the results of my query on each child table that would work too.

A simple example shows the issue:

create table FOO (
id integer primary key,
group_id integer,
key varchar(20),
value varchar(20)
);

insert into foo values (1, 1, ‘A’, ‘sdf’);
insert into foo values (2, 1, ‘B’, ‘onv’);
insert into foo values (3, 2, ‘A’, ‘now’);
insert into foo values (14, 2, ‘B’, ‘doe’);
insert into foo values (15, 2, ‘C’, ‘mns’);

EXPLAIN
select fooA.group_id, fooA.value, fooB.value FROM
(select * from foo where key=’A’) as fooA,
(select * from foo where key=’B’) as fooB
WHERE
fooA.group_id = fooB.group_id

QUERY PLAN
—————————————————————–
Hash Join (cost=16.66..33.31 rows=1 width=120)
Hash Cond: (public.foo.group_id = public.foo.group_id)
-> Seq Scan on foo (cost=0.00..16.62 rows=3 width=62)
Filter: ((key)::text = ‘A'::text)
-> Hash (cost=16.62..16.62 rows=3 width=62)
-> Seq Scan on foo (cost=0.00..16.62 rows=3 width=62)
Filter: ((key)::text = ‘B'::text)

group_id | value | value
———-+——-+——-
1 | sdf | onv
2 | now | doe
(2 rows)

drop table foo;

create table FOO (
id integer primary key,
group_id integer,
key varchar(20),
value varchar(20)
);

CREATE TABLE foo_1 (CHECK ( id >= 0 AND id < 10 ) ) INHERITS (foo);
CREATE TABLE foo_2 (CHECK ( id >= 10 AND id < 20 ) ) INHERITS (foo);

insert into foo_1 values (1, 1, ‘A’, ‘sdf’);
insert into foo_1 values (2, 1, ‘B’, ‘onv’);
insert into foo_1 values (3, 2, ‘A’, ‘now’);
insert into foo_2 values (14, 2, ‘B’, ‘doe’);
insert into foo_2 values (15, 2, ‘C’, ‘mns’);

EXPLAIN
postgres-dw-# select fooA.group_id, fooA.value, fooB.value FROM
postgres-dw-# (select * from foo where key=’A’) as fooA,
postgres-dw-# (select * from foo where key=’B’) as fooB
postgres-dw-# WHERE
postgres-dw-# fooA.group_id = fooB.group_id;
QUERY PLAN
—————————————————————————–
Hash Join (cost=49.99..99.99 rows=9 width=120)
Hash Cond: (public.foo.group_id = public.foo.group_id)
-> Append (cost=0.00..49.88 rows=9 width=62)
-> Seq Scan on foo (cost=0.00..16.62 rows=3 width=62)
Filter: ((key)::text = ‘A'::text)
-> Seq Scan on foo_1 foo (cost=0.00..16.62 rows=3 width=62)
Filter: ((key)::text = ‘A'::text)
-> Seq Scan on foo_2 foo (cost=0.00..16.62 rows=3 width=62)
Filter: ((key)::text = ‘A'::text)
-> Hash (cost=49.88..49.88 rows=9 width=62)
-> Append (cost=0.00..49.88 rows=9 width=62)
-> Seq Scan on foo (cost=0.00..16.62 rows=3 width=62)
Filter: ((key)::text = ‘B'::text)
-> Seq Scan on foo_1 foo (cost=0.00..16.62 rows=3 width=62)
Filter: ((key)::text = ‘B'::text)
-> Seq Scan on foo_2 foo (cost=0.00..16.62 rows=3 width=62)
Filter: ((key)::text = ‘B'::text)

As you can see the partitioned table cost estimate is significantly worse. As you get more rows and more partitions the difference goes off the charts. In my real world use case I have 50 tables with a million rows each and 9 keys I’m pulling out, the cost estimate lower bound was a 50 digit number. Talking to the folks on #postgres irc channel there doesn’t seem to be any reasonable way to deal with this, so what
I ended up doing was in my java code running my query on each child individually which was very very fast. There is talk of GROUP SETS coming in 8.4 or 8.5 which will be nice.