A grouping performance problem

By now I’ve come across this problem at least thee times, each time I needed to build a relatively big rollup table, which would use a join of about ten tables, some of which may have several million rows, and others – several hundred thousand rows. Each time I needed to run something like this:

select t1.a ,t1.b ,t2.a2 ,t2.b2 ,t3.a3 ,t4.b4 ,sum(t1.c) as c_sum ,max(t1.d) as d_max ,sum(case t1.e='a' then 1 else 0 end) as sum_a from t1 join t2 (using join_id2) join t3 (using join_id3) join t4 (using join_id4) group by t1.a ,t1.b ,t2.a2 ,t3.a3 ,t2.b2

In all cases the table t1 will be the only table with the “real” aggregation, and for the rest of the tables there would be exactly one line for each of the aggregated line of t1.

The problem: this query would run… forever.

Now, if I just separate the “aggregation” part:

select t1.a ,t1.b ,sum(t1.c) as c_sum ,max(t1.d) as d_max ,sum(case t1.e='a' then 1 else 0 end) as sum_a from t1 group by t1.a, t1.b

This one will run really fast, even though the table t1 would be the biggest table. Let’s add the rest of the tables to the join:

select t1.a ,t1.b ,sum(t1.c) as c_sum ,max(t1.d) as d_max ,sum(case t1.e='a' then 1 else 0 end) as sum_a from t1 join t2 (using join_id2) join t3 (using join_id3) join t4 (using join_id4) group by t1.a, t1.b

This will run just a little bit slower, than the “aggregate only”. But then…

I would imagine, after all joins and groupings the rest simply does not matter, because “everything is there already”, and all data is already read from the database. But the moment I start to add these “singular” columns to the select list and group by clause, it starts to break:

select t1.a ,t1.b ,t2.a2 ,t2.b2 ,sum(t1.c) as c_sum ,max(t1.d) as d_max ,sum(case t1.e='a' then 1 else 0 end) as sum_a from t1 join t2 (using join_id2) join t3 (using join_id3) join t4 (using join_id4) group by t1.a ,t1.b ,t2.a2 ,t2.b2

This will run visibly slower… this one:

select t1.a ,t1.b ,t2.a2 ,t2.b2 ,t3.a3 ,sum(t1.c) as c_sum ,max(t1.d) as d_max ,sum(case t1.e='a' then 1 else 0 end) as sum_a from t1 join t2 (using join_id2) join t3 (using join_id3) join t4 (using join_id4) group by t1.a ,t1.b ,t2.a2 ,t3.a3 ,t2.b2

even more slower… till it just will run forever…

The work around is easy – to materialize the “aggregated” piece. But I still do not think, it’s a correct behavior… I have no idea, why it does what it does…

2 responses to “A grouping performance problem”

Lily

February 23, 2015 at 8:56 pm

Your aggregation part has less fields in group by, that’s why it runs fast. Isolated aggregation part will on t1 table only will be
select
t1.a
,t1.b
,t1.t2_id
,t1.t3_id
,sum(t1.c) as c_sum
,max(t1.d) as d_max
,sum(case t1.e=’a’ then 1 else 0 end) as sum_a
from t1
group by t1.t2_id, t.t3_id, t1.a, t1.b

here t2_id, t.t3_id are keys to t2 and t3_tables. How fast this query is running ( if you add all your tables)?

If it is still fast and with joins the query becomes slower, may be this will help:
http://use-the-index-luke.com/sql/sorting-grouping/indexed-group-by

Henrietta Dombrovskaya

February 23, 2015 at 10:51 pm

That’s what I just said – if I add all the joins, it will be only a little bit slower. Which is normal :). Technically, after all joins are there, “everything is there”, and the grouping should happen in main memory with almost no overhead. Which is not the case.

See – yes, there are more fields to group by, but this is not a “real” grouping (it won’t reduce the number of rows),so I would assume Postgres should figure this out

	Hettie D. on PostgreSQL 17 features I am mo…
	holgerjakobs on PostgreSQL 17 features I am mo…
	Hettie D. on The Optimization Book Second E…
	Dmitriy S. on The Optimization Book Second E…
	Hettie D. on The Optimization Book Second E…

A grouping performance problem

2 responses to “A grouping performance problem”

Leave a comment Cancel reply

Pages

Recent Posts

Recent Comments

Archives

Categories

Follow Blog via Email

Meta

A grouping performance problem

Share this:

Related

2 responses to “A grouping performance problem”

Leave a comment Cancel reply

Pages

Recent Posts

Recent Comments

Archives

Categories

Follow Blog via Email

Meta