Category Archives: Data management

Can you teach somebody to optimize?

I’ve got a lot of feedback on my last blogpost; one question was posted on another paltform where I’ve reblogged the same text, and this question was so interesting that I’ve decided to write a separate post in reply.

So tell me, Hettie, are these kind of discoveries being reported at the conferences, and would they later become a part of a common knowledgebase? Will this sort of technique be taught in colleges? Overall, in your opinion, are the nowadays CS graduates more knowledgeable in this area. And, by the way, is there any special knowledge which is necessary to be able to resolve problems like this, or it’s just a combination of basic knowledge plus experience?

Great question! I have being teaching optimization for almost 15 years, and in general my optimism on this subject is very modest. You can teach technique, you can show tons of examples, and still there is no guarantee that a student who has attended your class will be able to correctly identify a similar pattern in the real life and to recall what specific technique which was advised for a similar problem. But may be I am just not good in teaching this stuff.

It’s tempting to say, that all that matters are years of practice, but this is also not always the case, since you, and me, and all of us can recall the situations in which years of experience did not help. And to be honest, I do not want to wait for “years of experience”! I want to be able to hire a new grad, who can optimize at a reasonable level. And I am not saying I never met this kind of CS grads, but what I am saying is that whenever it happens, it is due to the person’s individual abilities and/or a desire to excel in this particular skill.

Let’s be clear: the kind of breakthrough as I’ve described in the previous post does not happen often. In fact, you might never get anything like this in your live. But there are still tons of optimizations which can be done almost every day.

I would still argue that knowing the basics is a key. For the thousandth time over – you need to know your math, your calc and your algebra in order to understand how databases work. You might not be aware of some sophisticated indexes, but you should be able to identify, looking at the query, what’s it about< whether it is "short" or "long". And if you try and try, and it does not become faster, you need all your convincing powers to convince yourself, that this query can be optimized. There should be a way.

Another big thing I am trying to teach – to write queries declaratively. This is an extremely challenging task, because most of the requirements are formulated in an imperative manner. So what I am trying to teach, is that even if you find something like “previous three occurrences”, or “return to the previous status”, or “compare the top two” in the requirements, you still can write a declarative statement, carefully using CASE, GROUP BY and sometimes window functions. And it’s amazing, how fast everything starts running right away. Most of the time being able to reduce the number of table scans to one does the trick, except of… well, except of the situation, when you should do exactly the opposite.

I didn’t figure out yet, how to teach to distinguish one from another :). But the more I think about it, there more it seems like that’s what signifies that somebody can optimize, and that skill is the most difficult to teach. Most optimization classes teach you how and when to use some indexes, and how to choose the join order, but they do not teach how to rewrite a query itself.

… For the original question: no, I do not think they teach it in school. But I am trying to promote this idea!

Advertisements

Leave a comment

Filed under Data management, SQL

Can you teach somebody to optimize?

I’ve got a lot of feedback on my last blogpost; one question was posted on another paltform where I’ve reblogged the same text, and this question was so interesting that I’ve decided to write a separate post in reply.

So tell me, Hettie, are these kind of discoveries being reported at the conferences, and would they later become a part of a common knowledgebase? Will this sort of technique be taught in colleges? Overall, in your opinion, are the nowadays CS graduates more knowledgeable in this area. And, by the way, is there any special knowledge which is necessary to be able to resolve problems like this, or it’s just a combination of basic knowledge plus experience?

Great question! I have being teaching optimization for almost 15 years, and in general my optimism on this subject is very modest. You can teach technique, you can show tons of examples, and still there is no guarantee that a student who has attended your class will be able to correctly identify a similar pattern in the real life and to recall what specific technique which was advised for a similar problem. But may be I am just not good in teaching this stuff.

It’s tempting to say, that all that matters are years of practice, but this is also not always the case, since you, and me, and all of us can recall the situations in which years of experience did not help. And to be honest, I do not want to wait for “years of experience”! I want to be able to hire a new grad, who can optimize at a reasonable level. And I am not saying I never met this kind of CS grads, but what I am saying is that whenever it happens, it is due to the person’s individual abilities and/or a desire to excel in this particular skill.

Let’s be clear: the kind of breakthrough as I’ve described in the previous post does not happen often. In fact, you might never get anything like this in your live. But there are still tons of optimizations which can be done almost every day.

I would still argue that knowing the basics is a key. For the thousandth time over – you need to know your math, your calc and your algebra in order to understand how databases work. You might not be aware of some sophisticated indexes, but you should be able to identify, looking at the query, what’s it about< whether it is "short" or "long". And if you try and try, and it does not become faster, you need all your convincing powers to convince yourself, that this query can be optimized. There should be a way.

Another big thing I am trying to teach – to write queries declaratively. This is an extremely challenging task, because most of the requirements are formulated in an imperative manner. So what I am trying to teach, is that even if you find something like “previous three occurrences”, or “return to the previous status”, or “compare the top two” in the requirements, you still can write a declarative statement, carefully using CASE, GROUP BY and sometimes window functions. And it’s amazing, how fast everything starts running right away. Most of the time being able to reduce the number of table scans to one does the trick, except of… well, except of the situation, when you should do exactly the opposite.

I didn’t figure out yet, how to teach to distinguish one from another :). But the more I think about it, there more it seems like that’s what signifies that somebody can optimize, and that skill is the most difficult to teach. Most optimization classes teach you how and when to use some indexes, and how to choose the join order, but they do not teach how to rewrite a query itself.

… For the original question: no, I do not think they teach it in school. But I am trying to promote this idea!

Leave a comment

Filed under Data management, SQL

How to make optimization a priority

One of my usua/perennial rants is that many managers would tell you something like “business needs this functionality, we will optimize later”. And we all know what happens “later” – nothing. The most important thing I love about working at Braviant is that the tech team shares the same values, since I know quite well how often this is not the case, I appreciate it immensely. However, business might still think differently, and what’s good for business… well, that’s what we should do, since we all need to make money.

… After one of our recent releases our DBA noticed that one of the application functions slowed down significantly, specifically instead of executing for about 3 sec, it started to execute for 7-8 seconds. And immediately we all were alarmed. You might wonder – why? The 7 sec execution time is a good time, perfectly acceptable for the end users, especially because this function is executed not so often. Well… Didn’t I just say our tech team agree on priorities? We ll, we believe that the good user experience includes fast response time, and thereby our applications time out on 10 sec. And if a function’s average execution time is over 7 sec, the peak time can easily reach 10 sec!

I had to make a miracle… the sooner the better. Because, as you can imagine, I usually do not write bad queries. Well, most of the time:). Which meant I had to find some unusual optimization.

To tell the truth, I knew right away, why this function starter to perform slower. We added one new field to the distributed query (which was required by business stakeholders, or cause!), and to select this additional field I needed to join one more remote table. And all of a sudden, although all required indexes were on place, the optimizer would choose the full table scan. Of a huge table!

Not much I can do to explain the optimizer that they are wrong (are optimizers male or female, what do you think? 🙂 – they are male in my mother tongue, which explains a lot – they are always sure they know better how to make things work!). So I had to find a way to put this optimizer in such a situation, that there won’t be any way other than to utilize the indexes which were out there. First I thought it will be relatively easy – in all previous cases when a similar issue would occur, I would create a view on the remote node – but this time it didn’t work. I’ve conducted several experiments, and came to the conclusion that the only way to make it work is to implement one new technology, which I’ve played with a couple of months ago, but never implemented in production.

So…
– testing
– making sure it does what I want
– mapping through the foreign data wrapper and making sure it wors
– creating a new version of the function
– testing in lower environments
– QA
– implementing on staging and QA on both our products
– in production on both products

Total execution time of the process described above: around 3 hours.

Result: everybody happy, we’ve got one more groundbreaking technology, which I can guarantee nobody in the world is using (because documentation clearly says it’s impossible:)), and which we will be able to use in many other cases to reduce execution time. And all because we have our priorities right!

P.S. Are you anxious to find out what is this technology? Come to 2Q PgConf in Chicago this December!

2 Comments

Filed under Companies, Data management, Development and testing, SQL

If we need this data for reporting purposes only…

I always thought I have a well established opinion on separating application needs and reporting need. So whenever I was asked to add some attribute to the production system, which is “necessary for reporting”, I would always say: we will build this dada in the reporting system.

So first I could not believe what I just said, when I’ve said something to the effect “I will create this table, which application does not need, and I will use it to record data, which we need for reporting. Literally, after I’ve sadid this I thought – what in the world I am saying?! But then I was very positive that we indeed need this data to be recorded, and I’ve been keeping thinking about it…

And then I’ve realized, that the “reporting” I was referring to is not really a “reporting”. We just historically use the reporting system to go for this information, but in reality, those ones are not analytical reports, but operational reports, which raise alerts when some processing errors occur – mostly in the daily batch jobs. And then I started to think – probably we should treat those operational reports differently? They are not processing big data volumes, they are actually exception reports, and may be we even need to create an “exception processing service” in our production system?

Yes, just to prove, that I am right, of cause 🙂

and data, which is required for reporting. I think it makes lots of sense –

Leave a comment

Filed under Data management, Systems

There is nothing more permanent than temporal solutions!

I like this saying, because it always reminds me about the danger of doing something “just for now”, and I often cite it to my coworkers when I argue for a better, more stable solution. However, sometimes I would revert to these kind of tactics myself, and then…

Yesterday one of our external service providers was  moving our data to another server, and as a result I had to recreate the foreign servers associated with it, and cascade-recreate all the foreign  tables. I’d say I am reasonably organized, all all my data definitions are stored in a github repository. So that in the event like yesterday I just need to rerun a DDL script.

Which I did. And a half an  hour later I’ve started one of my daily jobs, which sometime crashes, that’s why I am always keeping a closer look on it. It crashed that time as well, but for an unexpected reason – one of the foreign tables used could not find it’s source… The name didn’t look familiar to me, and it was definitely not created in the morning. I’ve searched the github with no luck, then searched the “suspicious”  directories on my computer. Finally, I’ve started the global search on my computer and iCloud. And I’ve found a missing definition! Guess, what was the name of the file it was stored in? – temp_fix. sql  

🙂

Leave a comment

Filed under Data management, Development and testing, SQL

Don’t forget about transactions – even when you do not write anything

A couple of months ago we started to run a job, which collects the execution statistics in our OLTP database. We’ve been running a similar job in our reporting system for a while, but there was a significant difference – which SELECTS we would consider to be long -running.

In the reporting system you expect things to be a little bit slower, so I would not care about SELECT statements, which run less than a minute. And for the longer ones if was enough to collect stats once a minute. Which meant, that we could schedule the execution using our cron-like sql-schedule-running system.

Not the case for the OLTP database. There we would consider the 30-sec running SQL statements unacceptably slow, so we definitely wanted to monitor them. But what about 1-min scheduler granularity? I can’t run a shell script in our scheduling system, it was designed for SQL execution only.

Then I though I’ve got the smartest idea ever – I suggested we should run a loop inside the function, and pass  a number of seconds it would sleep between reading the database stats   as a function parameter… and I thought it was running – I thought it for a while. There were other small issues I needed to address, and I’ve being fixing them. And then I’ve realized that something was wrong with my monitoring – the execution time for long-running transaction was suspiciously “even”, lasting for 55 sec, or 1 min 55 sec…  I was staring at the code… and suddenly understood, what was wrong. Then I quickly ran an experiment, which confirmed my suspicions.

Did you realize what have happened? Continue reading!

Continue reading

Leave a comment

Filed under Data management, SQL

I am not sure what I fixed, but I’ve definitely fixed something

I had this problem for a while. It’s very difficult to describe, and even more difficult to report, because I do not know a good way to reproduce the problem.

The reason I am writing about it is, that if somebody ever had or will have a similar problem, then a) you know there is a way to fix it  and b) if there is more than one person experiencing the same problem, together we can find the root cause.

So… when we import data from our external service providers databases, we use a EC2 machine with a Postgres instance running on it, as our “proxy”. We have several foreign data wrappers installed on the said EC2 instance, and all the external databases (which use different DBMS’s) are mapped to the Postgres database, from where they are mapped to our Data Warehouse.  The Data Warehouse resides on RDS, which means, that only a Postgres FDW is available.

We didn’t have any issues while we were only using this setup to refresh materialized views in our Data Warehouse. But recently we started to use the same proxy to communicate with one of the external databases from the OLTP database. And that’s when strange things started to happen.

They happen when we have “a complex” query, and that’s what I can’t quantify. I can’t say “if we have more than five external tables joined” or “if we have more than one join condition on more than two tables” … it just happens at some point. What happens? The query starts to return only the first row of the result set.

When I run the same query on proxy, it would return a correct number of rows. So the specific FDW does not appear to be a problem. Then what? I do not know the answer. They way I’ve fixed it – I’ve created a view on proxy, which would join all the tables I need, and mapped this view to the OLTP database. First I was reluctant to do it, because I was sure that the conditions won’t be pushed correctly to the lowest level, and thus the query would be incredibly slow, but life proved me wrong:). It works beautifully – and very fast.

So, for now the problem is solved, but I am still wondering, what exactly causes the problem in the original query…

Leave a comment

Filed under Data management, Development and testing, SQL