I meant to write about this optimization story for a while, not because somebody can use it as an example, but rather as a counter-example, a case of “what I’ve tough you so far is all completely wrong and not working”.
It all have started with the story which I’ve described in this blog post: the problem was, that indexes on the remote site were not being used, and in order to make things work I had to create a remote function, which would “inject” the constants into a remotely executed query and then too map the results. A couple of weeks later I had to face a similar problem which had also manifested itself only in production environment. But that time around the performance degradation was so severe, that we had to rollback the changes.
First I thought that I will be able to deal with a similar problem in a similar way, but turned out that was not the case. The SQL involving remote tables was a little bit more complex, and the function didn’t help. I’ve realized that I needed to explicitly pass some constants to the dynamic query, which meant that I had to perform a crime against relational theory, braking one perfectly fine SQL into two, and passing the result of the first selection to the second statement. This performed better, but still not ideal. I continued to follow the same path. It became better. I’ve repeated it one more time. And one more time.
At the end it was a textbook example of how not to write SQL. My own lectures were always filled with this “how it will not perform” examples: first you select a person, then using the person id you select position, then… But what I did to optimize performance my function was the same thing almost to the letter. I needed to select all loans for a specific person and some of the loan-related details. And instead of writing one SQL I ended up doing the following:
- having a person_id, select all accounts
- generate a SQL with all these user accounts embedded as constants to retrieve loans
- generate a set of SQL statements, which would access remote information for each of the loans individually
- and actually there was more in the middle
A part of the problem is, that in PostgreSQL functions are not really stored in a truly “compiled” way, they should be rather viewed as “stored sequences of actions”. And parametrized statements, unlike those in different systems, are not optimized properly. That’s why in most of the cases I build dynamic SQL inside of the functions. But in the case I am describing here, things were even worse. Even when I was executing a SQL statement for a specific person id, the indexes on user account and loan where not really utilized (at least not always). So in order for me to push through the usage of indexed, I had to explicitly pass the values to the generated SQL…
It all worked at the end, and performance was amazing, but I can’t even convince myself to post code snippets – it looks ugly 🙂
Oh, and another part of the equation is, that in all those SELECT statement were selecting a really small portion of the data from each of the tables, so there was not that many total executions.
The moral of the story: trust, but verify!