A couple of months ago we started to run a job, which collects the execution statistics in our OLTP database. We’ve been running a similar job in our reporting system for a while, but there was a significant difference – which SELECTS we would consider to be long -running.
In the reporting system you expect things to be a little bit slower, so I would not care about SELECT statements, which run less than a minute. And for the longer ones if was enough to collect stats once a minute. Which meant, that we could schedule the execution using our cron-like sql-schedule-running system.
Not the case for the OLTP database. There we would consider the 30-sec running SQL statements unacceptably slow, so we definitely wanted to monitor them. But what about 1-min scheduler granularity? I can’t run a shell script in our scheduling system, it was designed for SQL execution only.
Then I though I’ve got the smartest idea ever – I suggested we should run a loop inside the function, and pass a number of seconds it would sleep between reading the database stats as a function parameter… and I thought it was running – I thought it for a while. There were other small issues I needed to address, and I’ve being fixing them. And then I’ve realized that something was wrong with my monitoring – the execution time for long-running transaction was suspiciously “even”, lasting for 55 sec, or 1 min 55 sec… I was staring at the code… and suddenly understood, what was wrong. Then I quickly ran an experiment, which confirmed my suspicions.
Did you realize what have happened? Continue reading!