Tag Archives: optimization

The second rejected paper: the ORIM again

Object-relational impedance mismatch is by far my favorite research topic, mostly due to the fact that it has a very practical implementation. I would make even stronger statement: the most rewarding optimization is the one when you can reduce the number of SQL statements executed when a web page is rendered and all of a sudden it is loaded 50 times faster (and I mean actually 50 times, not figuratively speaking!).  It always looks like a magic – and I haven’d done anything!

This been said, the ORMs are my worst enemies, and I am always looking for opportunities to promote the better ways of communication between a database and an applications. Most of the time the human factor appears to be more important than the technological challenges, so I always think about these projects as battles.

At Braviant however, first time in my professional career I had nobody to fight about this issue – the app developers were completely on board with my approach since day one. Which allowed us to develop something really cool, and to optimize the interaction between databases and application to the point of absolute perfection. SO, when my husband suggested we’d write a short paper about this project, I had no doubt it will be accepted – because two of my previous papers on the same subject were accepted to the very serious conferences.

Life proved me wrong :), I am not going to name the conference and the workshop, but I have to make some comments about the reviews, so that the level of my frustration can be understood.

One of the reviewers asked: why we think that the number of round trips defines the response time of the web application. Another reviewer asked, whether we tried to use Mongo DB :))). And why we think that (de) serialization of the JSON takes negligible time. And why we think Hibernate is worse.

I think the only valid objection was, that the topic of the paper is not relevant to the workshop topic.  And the latter might explain the whole story.

Several years ago, when I started to attend the database conferences again, after fifteen years of absence, I made an observation that a significant number of the attendees never saw the real applications, and never had deal with performance problems, Fortunately, I’ve also met and got to know some really outstanding researches, whom I admire and feel honored to be aquatinted with, so… I am sure I will find the right place to showcase our work.

And may be it’s time to get back to my old “HDAT” workshop idea,,,

And for my fellow Chicagoans: I will be presenting this work this Tuesday, Feb 13 at the Chicago PUG meetup!

Advertisements

8 Comments

Filed under research

I am not sure what I fixed, but I’ve definitely fixed something

I had this problem for a while. It’s very difficult to describe, and even more difficult to report, because I do not know a good way to reproduce the problem.

The reason I am writing about it is, that if somebody ever had or will have a similar problem, then a) you know there is a way to fix it  and b) if there is more than one person experiencing the same problem, together we can find the root cause.

So… when we import data from our external service providers databases, we use a EC2 machine with a Postgres instance running on it, as our “proxy”. We have several foreign data wrappers installed on the said EC2 instance, and all the external databases (which use different DBMS’s) are mapped to the Postgres database, from where they are mapped to our Data Warehouse.  The Data Warehouse resides on RDS, which means, that only a Postgres FDW is available.

We didn’t have any issues while we were only using this setup to refresh materialized views in our Data Warehouse. But recently we started to use the same proxy to communicate with one of the external databases from the OLTP database. And that’s when strange things started to happen.

They happen when we have “a complex” query, and that’s what I can’t quantify. I can’t say “if we have more than five external tables joined” or “if we have more than one join condition on more than two tables” … it just happens at some point. What happens? The query starts to return only the first row of the result set.

When I run the same query on proxy, it would return a correct number of rows. So the specific FDW does not appear to be a problem. Then what? I do not know the answer. They way I’ve fixed it – I’ve created a view on proxy, which would join all the tables I need, and mapped this view to the OLTP database. First I was reluctant to do it, because I was sure that the conditions won’t be pushed correctly to the lowest level, and thus the query would be incredibly slow, but life proved me wrong:). It works beautifully – and very fast.

So, for now the problem is solved, but I am still wondering, what exactly causes the problem in the original query…

Leave a comment

Filed under Data management, Development and testing, SQL

Optimizing something you can’t control

This is very much  like  finding Pluto! At Braviant, we use several external service providers to perform some business tasks. And then, as I’ve mentioned in one of my presentations about our usage of foreign data wrappers, we need to manage data, when we do not really own the data.

But this time around the task was even more complex, and I’ve spent weeks trying to figure out how to approach it. There is one Really Large Table on the “other” side, and to refresh the Data Mart, we need to select a small subset of records each time, basically “all records starting from the moment we refreshed last time”.

For some reason unknown to me something on the way from “them” to “us” did not work, and we could not push the condition to the external site. No matter what I was selecting, what was really happening (I’ve figured it out by observing the query behavior closely) – the whole table was fetched from the third-party server, and only then the selection criteria was applied.

The problem looked unsolvable, because “everything worked on the other side”. Then I cam up with one crazy idea. I thought: if we can’t push our condition through, may be we can create similar condition on the other side.

So, I’ve asked our service provider tech support, whether they can create a view on their side, which would restrict the size of object, I am selecting from, Note, I’ve asked for just a view, not a materialized view. So it was literally “query is executed locally”. And then I’ve mapped this view to the foreign table, so there was no changes to reporting.

Yes, this view has way more records than I need (it contains “last 24 hours”), while I refresh data every  two hours. However, now I select from way smaller data set, because the view contains only last 24 hours, not the last 2 months!

… and now tell me, which optimizer would be able to execute this kind of optimization?!

Leave a comment

Filed under Data management, SQL

Revisiting the old code (or not)

This blog is becoming more a collection of mistakes I’ve made than anything else, but I believe learning from other people’s mistakes is important. So each time I do something not-so-smart, I am sharing it here.

I was not sure how to call this post, and still not sure the name reflects the contents, so let me proceed to the story:). It has been over a year since I’ve started to rewrite the pieces of an old system, one by one. And granted in the very beginning I didn’t know the data so well, so after a year in production I could rewrite most of them much better.

But what is more important, that data itself has changed as well. One of the important changes was that a year ago we were using two external service providers for loans processing, and now for several months we are not using one of them (except of the servicing of the old loans). But it turned out, that I had a step in my code (which BTW had to be executed every two hours!) which would try to fill in the ID from this old system which we are not using anymore – for all records, which do not have this ID assigned! Which means, (since we do not use this system) that every two hours I was scanning all records – for nothing!

After I commented out this loop, the execution time for the whole process became pretty much invisible.

… now – how I should title this post?!

2 Comments

Filed under Data management, Development and testing

The March Chicago PUG meetup is coming, and I am advertising myself again!

I have at least two optimization posts in my Drafts, but today I wanted to remind everybody, that Thursday (not Wednesday!) is the day for the next Chicago PUG.

So, for those of you, who vaguely remember my talks “Why the databases are slow?”,  for those of you who at some point of your work lives had fun working with me on web applications, for those, with whom we had all these extended arguments about “what databases can do better”… that’s your chance to experience it again :).

And for those of you, who never heard this talk… well, it’s your chance to find out for the first time, why databases are slow!

Here is the event:

https://www.meetup.com/Chicago-PostgreSQL-User-Group/events/237300112/

and if you plan to attend, please RSVP, otherwise I can’t guarantee there will be enough pizza!!!

Leave a comment

Filed under events

One “emergency optimization”

Why it’s an “emergency optimization”? Well, because you write some SQL, you debug and optimize it, and it works fine, and it runs in production ten times during the business hours producing some report… and all of a sudden it just stops working… well, not necessarily “stop”, but start to run more than a hundred times slower… then you need to drop everything you are doing and optimize it ASAP. Because it’s already in production!

Last Thursday I had several things which “stopped working” in a manner described above. I’ve written about one of them in my previous post, but that one was relatively easy to fix. The next one looked as trivial as it can be, but stll I could not make it work!

Here is how the query looked like:

SELECT ...
   FROM payments p WHERE payment_date< current_date 
        AND payment_id NOT IN (SELECT payment_id from bank_records)
        AND...

There were more conditions, but they are not so important. This worked fine until a couple of days ago. What could be more simple? The payment_id field is indexed in both tables, and once again, this query should return a very small number of records. Also, the tables themselves are small, a little bit over 100,000 records. However, all of a sudden it started to run for minutes! When I looked at the execution plan I saw, that instead of doing a merge join utilizing the index, Postgres decided to read the whole bank_records table into the main memory, “materialize”, and then loop through this data set for each and single record of the payments table!

When I saw this, I immediately rewrote the SQL in my favorite way – with OUTER JOIN:

SELECT ...
   FROM payments p 
        LEFT OUTER JOIN bank_records b 
        ON b.payment_id =p.payment_id AND payment_date< current_date 
        AND b.payment_id IS NULL 
        AND...

This didn’t help either – still sequential scan and line by line comparison. And then I decided to use “the last resort”- which I hate!

SELECT ...
   FROM payments p WHERE payment_date< current_date 
        AND NOT EXISTS (SELECT payment_id from bank_records b
        WHERE payment_id=p.payment-id)
        AND...

This SQL just magically started to use the index, and the execution time when down from8 minutes to 300 milliseconds…

Yes, it’s cool… but why?!

2 Comments

Filed under Data management, SQL

When things do not work like you expect them to work…

This post has a history. I’ve done the original optimization about three weeks ago, then I wanted to cleanse my example from the company-specific things, then I could not reproduce the original execution plan, and then at the end, when I was ready to publish it… it stopped working in our production system! By “stop working” I mean, that it started to take about 90 seconds to execute…But at the very end, by the time I was done with optimization for the second time, I guess, I was able to formulate some basic principles optimization of this kind of queries… so, without further ado – let’s start!

Each time I talk about optimizing join in Postgres, I advise people to include any additional selection criteria into join condition, even if they are, technically speaking, not “join” conditions. The reason for this suggestion is, that if the optimizer would see any way to pre-filter the records before  joining or while joining, it will be done.

So I was trying to utilize this strategy while working on a query which would check for the following condition:

SELECT ...
FROM loan_contract lc
JOIN loan_details ld on lc.loan_id=ld.loan_id
     AND contract_date >= current_date- interval '10 days' 
WHERE (state ='CA' AND  rate>10) OR (state='MO' AND rate>15)

I knew that this SELECT should retrieve a very small number of records if any, and thereby I was thinking that if I add this condition to the JOIN, it will execute faster because Postgres will be able to filter out the records which are not needed before join… but I was wrong!

SELECT ...
FROM loan_contract lc
JOIN loan_details ld on lc.loan_id=ld.loan_id
     AND contract_date >= current_date- interval '10 days' 
AND((state ='CA' AND  rate>10) OR (state='MO' AND rate>15))

Continue reading

Leave a comment

Filed under SQL