About one very controversial optimization

I meant to write about this optimization story for a while, not because somebody can use it as an example, but rather as a counter-example, a case of “what I’ve tough you so far is all completely wrong and not working”.

It all have started with the story which I’ve described in this blog post: the problem was, that indexes on the remote site were not being used, and in order to make things work I had to create a remote function, which would “inject” the constants into a remotely executed query and then too map the results. A couple of weeks later I had to face a similar problem which had also manifested itself only in production environment. But that time around  the performance degradation was so severe, that we had to rollback the changes.

First I thought that I will be able to deal with a similar problem in a similar way, but turned out that was not the case. The SQL involving remote tables was a little bit more complex, and the function didn’t help. I’ve realized that I needed to explicitly pass some constants to the dynamic query, which meant that I had to perform a crime against relational theory, braking one perfectly fine SQL into two, and passing the result of the first selection to the second statement. This performed better, but still not ideal. I continued to follow the same path. It became better. I’ve repeated it one more time. And one more time.

At the end it was a textbook example of how not to write SQL. My own lectures were always filled with this “how it will not perform” examples: first you select a person, then using the person id you select position, then… But what I did to optimize  performance my function was the same  thing almost to the letter. I needed to select all loans for a specific person and some of the loan-related details. And instead of writing one SQL I ended up doing the following:

  • having a person_id, select all accounts
  • generate a SQL with all these user accounts embedded as constants to retrieve loans
  • generate a set of SQL statements, which would access remote information for each of the loans individually
  • and actually there was more in the middle

A part of the problem is, that in PostgreSQL functions are not really stored in a truly “compiled” way, they should be  rather viewed as “stored sequences of actions”.  And parametrized statements, unlike those in different systems,  are not optimized properly. That’s why in most of the cases I  build dynamic SQL inside of the functions. But in the case I am describing here,  things were even worse. Even when I was executing a SQL statement for a specific person id, the indexes on user account  and loan where not really utilized (at least not always).  So in order for me to push through the usage of indexed, I had to explicitly pass the values to the generated SQL…

It all worked at the end, and performance was amazing, but I can’t even convince myself to post code snippets – it looks ugly 🙂

Oh, and another part of the equation is, that in all those SELECT statement were selecting a really small portion of the data from each of the tables, so there was not that many total executions.

The moral of the story:  trust, but verify!

Advertisements

Leave a comment

Filed under Data management, Development and testing, SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s