You probably know, what is a long query and what is a short query, but just to make sure we all are on the same page, here is a precise definition: a query is considered short, when the results can be obtained using a small portion of all records from the table(s), for example, to find a mailing address of one specific customer. And as you can guess, the query is considered long, if you need to read “almost all” data to obtain the result, for example, to find a total of all sales for the previous quarter.
Note, that if a query returns a very little number of records, it does not automatically mean, that this is a short query. Actually, it may return just one number, like this total sales for a quarter, but in order to obtain this number, you need to read all records for the quarter, so this is definitely a “long” query.
Why we need to know, whether a query is a “short” or a “long” one? Because generally speaking we need to utilize different optimization techniques in these two cases, and (generally) we pursue different optimization goals.
For short queries our goal is to truncated “almost all” records by means of selecting the most selective semi-join (BTW, do you know, what “semi-join is?), while for long queries the goal is to ready “almost everything” the fastest possible way.
And now – here is a story about how to turn a long query into a short query.
Last week there was a batch job timing out, and when we talk about batches we almost instantaneously think about “long” queries. So when the question popped up in the mailing list with a comment that this query was running for hours, I assumed it is a “long query”. Recently I almost never respond to the optimization questions in the mailing list in a hopeless attempt to concentrate on just one thing, which is the most important at that time. But the day the question was posted I was too upset and frustrated with one issue which didn’t go as planned, so I had to do something satisfying :).
The query included a join of eight tables, and the database developer who posted the question mentioned, that he knows which join should be executed first, but he is not sure how to make Postgres to choose a specific join order. I’ve checked the selectivity of different selection criteria, and he was right – the set of joins he wanted to execute first, would indeed limit the output to 1300 records, which is nothing! I’ve suggested he just execute this join as a sub-select (or inline view) , and since the number of records returned was so small, it didn’t really matter, how the rest of the joins would be executed! He first tried to resist:), saying that the problem is in other joins, but finally agreed to give it a try.
Turned out, that the total execution time of the whole batch query was reduced to the execution time of this sub-select, which was just 19 seconds! Seconds, not minutes!
So, it turned out, that what we originally thought to be a “long” query, was actually a “short” query, and the problem was solved by finding a correct semi-join. Should we figure this out earlier, we could probably solve it faster!
The moral of the story – check the query(es) selectivity!