Some time ago we encountered an interesting locking issue. To be honest it didn’t seem quite so “interesting” at the time when it occurred, because it happened in production, and looked like a disaster was about to happen. After we got this issue resolved, I was thinking about it, and for a while I though that this is actually a very non-typical situation, and that “Postgres should be smarter than that”. But the more I though about it, the more I was coming to the conclusion, that actually it’s not much Postgres could do under the circumstances.
Long story short, here is how it all happened. One of our developers was applying a change to production database, and a change was really simple – he needed to add one field to the table, and this field was NULL by default, and it should have taken a fraction of a second. Nevertheless, this transaction was hanging, and the lock waits started to pile up.
We immediately started to look at the sequence of lock waits, trying to figure out, what was holding what, and we found right away, that there was a COPY command for this table (because another developer was taking a database dump), and this COPY, which was running for several minutes, prevented the exclusive lock, which was required to add a column. But what surprised us, there were a couple of dozen (literally!) lock waits for shared lock, which were sitting there, not being granted, and we could not figure out, why.
So first thing we did – we killed this COPY command, and next we were going to kill this hanging implementation, but we realized, that it was already completed! The moment the COPY was killed, the lock was granted, and the column was added. We started to look at the rest of the locks, but they were all gone! So we realized, that this lock wait for an exclusive lock created this whole chain of waits.
I started to wonder, why these locks where not granted, when the copy was running, and realized, that the were not, because the exclusive lock wait was first in line. Then I though – why? Why Postgres could not grant all these read locks, while waiting for the COPY to finish? But … yes, that’s me, who knows that all “other” read locks will be short in duration, but how Postgres is supposed to know?! Each next shared lock request might initiate the similar long transaction. And if all these read locks will keep coming, then… the exclusive lock will never be granted! So actually, this was the only logical behavior on the Postgres side, and nothing could be better.
Do you want to know, what was my take on it? I’ve stopped taking production copies during the business hours!