A strange bug (or not?) in Postgres optimizer

Actually, I do believe it’s a bug, but some people commented to me, that this is not a bug, but a way to optimize queries. So, see for yourself.

That’s what I need to do: to select all personal information (name, e-mails, all phones) for a person (people) who’s phone number starts from a certain sequence of digits. The phone numbers unlike in the previous case I stored in a separate table “phones”, where there may be multiple active phones for one customer (the phone type and the validity period are indicated). So, one more time once we found a customer, who has at least one of the phones satisfying the search criterion, we select all phone number for him(her), matching and not matching.

The SQL in question is generated by the Postgres function, and originally I was planning on generating something like this:

SELECT
FROM people p
INNER JOIN accounts a on p.person_id=a.person_id
...
INNER JOIN phones ph ON a.account_id=ph.account_id AND ph.end_date is null
INNER JOIN emails eh ON a.account_id=eh.account_id AND eh.end_date is null
INNER JOIN phones ph_s ON a.account_id=ph_s.account_id AND ph_s.end_date is null
AND ph_s.phone_number like '84732%'

In this case the last joined table will be different depending on the search condition.I was absolutely sure, that the order of joins does not make any difference for the optimizer (that’s what I was always told!), and that the Postgres optimizer should be smart enough to figure out that the last join should be applied first. But – NO! In was executing for several seconds, and when I looked at the execution plan, I saw that the joins were performed in the listed order!

And when I changed the generated SQL to be like this:

SELECT
FROM people p
INNER JOIN accounts a on p.person_id=a.person_id
INNER JOIN phones ph_s ON a.account_id=ph_s.account_id AND ph_s.end_date is null
AND ph_s.phone_number like '84732%'
....

INNER JOIN phones ph ON a.account_id=ph.account_id AND ph.end_date is null
INNER JOIN emails eh ON a.account_id=eh.account_id AND eh.end_date is null

it worked just fine with total execution time 30-40 ms!

Do you think it’s a bug? I think so! And we should ask Mr.Haas about it!

 

Advertisements

11 Comments

Filed under SQL

11 responses to “A strange bug (or not?) in Postgres optimizer

  1. lily

    I think it is a bug. The cause may be that optimizer does not know how many records the query “like ‘84732%'” will bring and do not consider this in the optimisation. It would change the order the way that the join with shortest table would be executed first, and it cannot decide if this “like” clase make the table shorter.
    It is only my guess I don’t know for sure.

  2. I also think, it’s a bug, especially because it was publicly a number of times, that the order SHOULD NOT matter. And actually I think I should submit it as a bug… but what about my optimizations :)?

  3. CY

    Hi Hettie, I faced a similar problem before on PostgreSQL 9.1 and I was told on the channel the order does not matter.

  4. Yes, but it does :), hence it’s a bug!

  5. Pingback: Another look at the optimizer parameters | The World of Data

  6. Rajshree

    Hi I am going through the bug you mentioned and trying to reproduce the same.
    Is the bug still existing in postgres or it has been fixed.in which postgres version you found the issue ?

    • Actually I’ve experimented later and found that this can be cured by increasing the JOIN COLLAPSE LIMIT parameter increase. But you should be very careful with it, because most of the time increasing it to more than 12 decreases performance in general… I didn’t experiment on 9.5, because when I was on 9.5, I was with another company already 🙂

  7. Hi Henrietta,
    I am a colleague of Rajshree who put the comment above. I realise you have found a way to circumvent the situation but nevertheless we would still like to reproduce it if we can.
    I will really appreciate it if you can help with the following:
    1.) Do you think the size of the database or the amount of data participating in joins has any role to play. Currently we are trying with 20 gb of data or so and I want to make sure size is a consideration.
    2.) Do number of join conditions have any role to play? I am assuming they are since you have circumvented the bug by using “join collapse limit”.
    3.) Anything else you can think of that can help us?

    Actually, our task is to look out for Optimizer bugs in PostgreSQL and we are looking at those that are not in the official bug list.

    Thanks so much for your time and consideration.

    Regards,
    Anoop

  8. Hi, here is how it works. The join_collapse_limit parameter, which you can set both on the system and on the session level, defines the max number of tables in a join, to which the optimize would actually apply optimization. When this number is reached, it falls into just consecutive execution.

    If should be 12 by default, sometimes people reduce in to 8 or 10 (like in my case, when somebody wanted to make an OLTP system more efficient, which didn’t play well). My experience is that it’s rarely helpful after 16, and can increase the execution time. Also I found, that in some cases it’s better to set it to zero, and actually define the join order by yourself, if you “know better than Postgres”.

    So to replicate, you need to reset this parameter to less, than the number of joined tables in the query!

    Hope this helps!

  9. Hi Henrietta,
    Thanks so much.

    Regards,
    Anoop

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s