I was going to post about this interesting problem for a couple of months, but each and single time life would get on my way :). First I wanted to post a problem and wait to see whether somebody will come up with a solution. However, since I didn’t… let me just tell the whole story.
Let’s say, you have a table, in which you store e-mail messages, let’s call it all_messages. And similar to the google tags, each message might have multiple tags, which can be assigned to the message or removed from it at any time. The table all_tags_for_messages contains just a pair of columns (message_id, tag_id).
Obviously, the primary key for this table consists of both columns of the table, right?
Now imagine, you need to send all new records from this table to the data mart. Let’s say, each hour you need to send all new records which were added to this table – how you are going to select them? Neither message_id nor tag_id are ordered. Will you run “except” each time? how long it will take, when the table continues to grow?
The Data Warehouse developers asked to add some field which will allow to them to identify “newcomers”: either a serial primary key, which will be incremented each time a record is added, or a timestamp: created_at.
The application developers replied, that for them it would be much easier to have a new primary key. Now: what the database developers supposed to do? You have a primary key already. The table is huge. How you are going to switch the primary key without locking the table for a long time? (and you can’t really lock it for any visible time, because it’s being used non-stop).
Here is a solution, which one of our database developers came with.
First, we do not really need to change a primary key, it’s enough to have one additional unique field, which will be populated from a sequence. Now it will be easy to populate this field for new records. What about existing records?
Here is a proposed solution:
– create “main” sequence with a really big START value
– create a “temporal” sequence starting from 1
– use the “temporal” sequence for a background backfill of the new column
– when a backfill is completed, drop the sequence we do not need any more.
This seems trivial, but only after you’ve figured it out 🙂