How I did what I never do to optimize…

So… all of you guys who were ever taking my classes, and all of you guys who read my optimization posts, you all know, that one of my favorite topics is – the TEMP tables must die!

How many times did I repeat, that one should never materialize the intermediate results of the SELECT statement? How many times did I tell, that it takes extra read/write operations, which cost the most, and plus – that this locks the order of joins, thereby limiting the optimization options?

Well… a couple of weeks ago I had to do exactly the opposite of what I usually teach! That is – in order to improve performance of the critical function I had to create an intermediate permanent table!

That is how it happened.

The function is building the “delta” – the set of changed records to feed a data mart. Each of the records for this feed is build from several OLTP database records, and I need to include the new record in the output each time at least one of those got changed.

Technically the select statement should look like this:


select
from people p
inner join addresses a (using person_id)
inner join phones ph (using person_id)
inner join email_addresses e (using person_id)
inner join employment em (using person_id)
where p.updated_at between t1 and t2
or p.updated_at between t1 and t2
or a.updated_at between t1 and t2
or ph.updated_at between t1 and t2
or e.updated_at between t1 and t2
or em.updated_at between t1 and t2

Do you know how long this will take providing each time several hundred records are selected? Forever!

Do you know, why? Because even though each of the updated_at fields is indexes, and even though in general the condition by the timestamp is very selective, the OR – ing makes it impossible to limit any of the participating tables enough.

The next idea I had – to pre-select the person_id list which has changed. The code started to look like this:


select
from people p
inner join addresses a (using person_id)
inner join phones ph (using person_id)
inner join email_addresses e (using person_id)
inner join employment em (using person_id)
where p.person_id in (
select person_id from people where updated_at between t1 and t2
union
select person_id from addresses where updated_at between t1 and t2
union
select person_id from phones where updated_at between t1 and t2
union
select person_id from email_addresses where updated_at between t1 and t2
union
select person_id from employment where updated_at between t1 and t2)

What happened here: Postgres refused to use the PK on the people table to select the person_ids, which were pre-selected in the UNION.

After that I had the only option left. I’ve created the permanent table people_changed, which I truncate each time I execute this function. This table is indexed by person_id, and the first step of this function execution is, that I insert the result of the UNION into this table, and then execute:


select
from people p
inner join addresses a (using person_id)
inner join phones ph (using person_id)
inner join email_addresses e (using person_id)
inner join employment em (using person_id)
inner join people_changed pc (using person_id)

And this works perfectly!

Advertisements

Leave a comment

Filed under SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s