A grouping performance problem

By now I’ve come across this problem at least thee times, each time I needed to build a relatively big rollup table, which would use a join of about ten tables, some of which may have several million rows, and others – several hundred thousand rows. Each time I needed to run something like this:


select
t1.a
,t1.b
,t2.a2
,t2.b2
,t3.a3
,t4.b4
,sum(t1.c) as c_sum
,max(t1.d) as d_max
,sum(case t1.e='a' then 1 else 0 end) as sum_a
from t1
join t2 (using join_id2)
join t3 (using join_id3)
join t4 (using join_id4)
group by t1.a
,t1.b
,t2.a2
,t3.a3
,t2.b2

In all cases the table t1 will be the only table with the “real” aggregation, and for the rest of the tables there would be exactly one line for each of the aggregated line of t1.

The problem: this query would run… forever.

Now, if I just separate the “aggregation” part:


select
t1.a
,t1.b
,sum(t1.c) as c_sum
,max(t1.d) as d_max
,sum(case t1.e='a' then 1 else 0 end) as sum_a
from t1 group by t1.a, t1.b

This one will run really fast, even though the table t1 would be the biggest table. Let’s add the rest of the tables to the join:


select
t1.a
,t1.b
,sum(t1.c) as c_sum
,max(t1.d) as d_max
,sum(case t1.e='a' then 1 else 0 end) as sum_a
from t1
join t2 (using join_id2)
join t3 (using join_id3)
join t4 (using join_id4)
group by t1.a, t1.b

This will run just a little bit slower, than the “aggregate only”. But then…

I would imagine, after all joins and groupings the rest simply does not matter, because “everything is there already”, and all data is already read from the database. But the moment I start to add these “singular” columns to the select list and group by clause, it starts to break:


select
t1.a
,t1.b
,t2.a2
,t2.b2
,sum(t1.c) as c_sum
,max(t1.d) as d_max
,sum(case t1.e='a' then 1 else 0 end) as sum_a
from t1
join t2 (using join_id2)
join t3 (using join_id3)
join t4 (using join_id4)
group by t1.a
,t1.b
,t2.a2
,t2.b2

This will run visibly slower… this one:


select
t1.a
,t1.b
,t2.a2
,t2.b2
,t3.a3
,sum(t1.c) as c_sum
,max(t1.d) as d_max
,sum(case t1.e='a' then 1 else 0 end) as sum_a
from t1
join t2 (using join_id2)
join t3 (using join_id3)
join t4 (using join_id4)
group by t1.a
,t1.b
,t2.a2
,t3.a3
,t2.b2

even more slower… till it just will run forever…

The work around is easy – to materialize the “aggregated” piece. But I still do not think, it’s a correct behavior… I have no idea, why it does what it does…

Advertisements

2 Comments

Filed under SQL

2 responses to “A grouping performance problem

  1. Lily

    Your aggregation part has less fields in group by, that’s why it runs fast. Isolated aggregation part will on t1 table only will be
    select
    t1.a
    ,t1.b
    ,t1.t2_id
    ,t1.t3_id
    ,sum(t1.c) as c_sum
    ,max(t1.d) as d_max
    ,sum(case t1.e=’a’ then 1 else 0 end) as sum_a
    from t1
    group by t1.t2_id, t.t3_id, t1.a, t1.b

    here t2_id, t.t3_id are keys to t2 and t3_tables. How fast this query is running ( if you add all your tables)?

    If it is still fast and with joins the query becomes slower, may be this will help:
    http://use-the-index-luke.com/sql/sorting-grouping/indexed-group-by

  2. That’s what I just said – if I add all the joins, it will be only a little bit slower. Which is normal :). Technically, after all joins are there, “everything is there”, and the grouping should happen in main memory with almost no overhead. Which is not the case.

    See – yes, there are more fields to group by, but this is not a “real” grouping (it won’t reduce the number of rows),so I would assume Postgres should figure this out

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s