Monthly Archives: September 2014
About a week before PG Open I had one performance disaster on our new platform. I’ve developed a function a while ago, which is generating a pretty sophisticated query and then executes it. And this function worked very fast, both query generation and execution took about 100 ms. Now imagine how I felt, when two days after this thing went in production, I saw this function execution at the very top of the offenders list! And the average execution time was almost 2 sec! I’ve started investigating, and could not figure out, what’s wrong.
I had a full copy of production database locally on my computer, and is was performing perfectly fine. More interestingly, I found that even EXPLAIN (without ANALYZE) was executing close to a second.
So after I exhausted all of my ideas, e-mailed to my favorite DBA (no-no, do not take my wrong, I love all of you guys, but…:)). So I’ve emailed to my favorite DBA asking for help. And you know what he found?
He found, that the default_statistics_target parameter on this Postgres cluster was set to 10,000 instead of our usual 1,000, and that’s why it took so long to analyze. And consequently, since the query in the function was generated each time, it had to be analyzed each time, and it took so long for the whole function.
I was so happy that the solution turned to be so simple; and now this is fixed on all clusters.
And, speaking about parameters… do you remember this post? At the conference I talked to people about it, and found out, that it was enough to increase the join_collapse_limit parameter! It was set to 8, and I’ve experimented to see, how much I can increase it not to kill the optimizer performance.
The problem with this particular cluster is, that there are lots of small “vertically partitioned” tables, so the average join would easily have 10 tables involved, and the moment the number of tables in the join will exceed the join_collapse_limit, the optimizer would just join them in the order they are listed….
Thus one more problem was solved, and this was a perfect timing, because this very day one of my coworkers asked for my advice on his troubled query, and turned out that increasing this parameter locally to 12 would bring the query execution time from 2 min down to 13 ms 🙂
So, the conference is over, and there are so many things I want to say about it! A year ago I talked at PG Open 2013, and it was my first talk after 15+ years of silence :). Back then I was glad I’ve got accepted, and I was so happy I was able to get the audience attention. I was also pleasantly surprised with the overall high level of the conference.
Many things have changed in course of this past year. First, I am not a novice anymore, and I can tell, that people actually consider me as a peer. Many people were approaching me and saying they remember me, that we met earlier, even not necessarily at this conference, and sometime I didn’t even remember them…
Another new thing was, that I talked Grant into presenting together with me, and I think it was a great idea. It was good, that we rehearsed properly in advance, and I think we managed to make our presentation quite dynamic, and keeping people awake and alert :). A great thing was, that last year only one person stayed after my presentation to talk to me, and this time there were five! And those were not the people, who were “just curios” – these people were actually facing similar problems in their production environment, and we shared some ideas. I felt, that now it’s not like “Hettie’s thing”, but a whole new direction of research. And this makes me feel really happy.
Also, I went around asking people whether they saw some problem(s) similar to what I’ve got. And in a couple of cases I was actually able to find the answer! I will post separately about a couple of amazingly easy solutions to my long-running problems.
I forgot to repost it early, and today, when I finally got to update my blog, I am shamelessly promoting myself;
Getting to Know: Tech
Want to know more about the folks who call Enova home? We asked Enova’s Software Engineering Technical Lead Henrietta Dombrovskaya our five favorite “get to know you better” questions, and here’s what she had to say:
How did you come to work at Enova?
A friend of mine sent me an email: “I know you only recently switched jobs, but the company my husband works at is looking for an experienced database administrator. Are you interested?”
While my move to Enova was motivated more by getting out of my old job than the new position, I quickly came to love it here! Both the people and the environment won me over very quickly!
What do you do at Enova?
My current role is the Database Development Tech Lead for Software Engineering, and believe it or not, what I actually spend most of my time doing is talking! I’m not sure how I ended up in this situation, but apparently I can do it quite well, despite not being a native speaker.
What’s the best part of working at Enova?
The freedom! It feels like I can do whatever I want at Enova, and pursue my interests every day! Luckily, it turns out that the projects I want to pursue are good for the company! I also love having the opportunity to participate in international conferences.
If you could live anywhere in the world, where would it be?
I believe it would still be Chicago. Oh, and I would love to have instant transportation to Helsinki!
What are your top three desert island must-haves?
- Computer with Internet access
Now I finally finished the book I was talking about for several times already – “12 essential skills of software architect”. And now I am even more confident, that all IT people should read this book, not only the people who want to become software architects. Because communication is not the strongest point of most of IT people.
The Chapter 4 of this book talks about leadership, and I would like to share some quotes, which precede this chapter:
Management is doing things right; leadership is doing the right things
– Peter F. Drucker, Father of Modern Management.
Leadership os the art of getting someone else to do something you want done because he wants to do it.
– Dwight D. Eisenhower
And one more quote from Eisenhower, which I love so much – I put it in my signature on Enova mail:
You don’t lead by hitting people over the head; this is an assault, not leadership.
Some time ago we encountered an interesting locking issue. To be honest it didn’t seem quite so “interesting” at the time when it occurred, because it happened in production, and looked like a disaster was about to happen. After we got this issue resolved, I was thinking about it, and for a while I though that this is actually a very non-typical situation, and that “Postgres should be smarter than that”. But the more I though about it, the more I was coming to the conclusion, that actually it’s not much Postgres could do under the circumstances.
Long story short, here is how it all happened. One of our developers was applying a change to production database, and a change was really simple – he needed to add one field to the table, and this field was NULL by default, and it should have taken a fraction of a second. Nevertheless, this transaction was hanging, and the lock waits started to pile up.
We immediately started to look at the sequence of lock waits, trying to figure out, what was holding what, and we found right away, that there was a COPY command for this table (because another developer was taking a database dump), and this COPY, which was running for several minutes, prevented the exclusive lock, which was required to add a column. But what surprised us, there were a couple of dozen (literally!) lock waits for shared lock, which were sitting there, not being granted, and we could not figure out, why.
So first thing we did – we killed this COPY command, and next we were going to kill this hanging implementation, but we realized, that it was already completed! The moment the COPY was killed, the lock was granted, and the column was added. We started to look at the rest of the locks, but they were all gone! So we realized, that this lock wait for an exclusive lock created this whole chain of waits.
I started to wonder, why these locks where not granted, when the copy was running, and realized, that the were not, because the exclusive lock wait was first in line. Then I though – why? Why Postgres could not grant all these read locks, while waiting for the COPY to finish? But … yes, that’s me, who knows that all “other” read locks will be short in duration, but how Postgres is supposed to know?! Each next shared lock request might initiate the similar long transaction. And if all these read locks will keep coming, then… the exclusive lock will never be granted! So actually, this was the only logical behavior on the Postgres side, and nothing could be better.
Do you want to know, what was my take on it? I’ve stopped taking production copies during the business hours!