Tag Archives: optimization

Revisiting the old code (or not)

This blog is becoming more a collection of mistakes I’ve made than anything else, but I believe learning from other people’s mistakes is important. So each time I do something not-so-smart, I am sharing it here.

I was not sure how to call this post, and still not sure the name reflects the contents, so let me proceed to the story:). It has been over a year since I’ve started to rewrite the pieces of an old system, one by one. And granted in the very beginning I didn’t know the data so well, so after a year in production I could rewrite most of them much better.

But what is more important, that data itself has changed as well. One of the important changes was that a year ago we were using two external service providers for loans processing, and now for several months we are not using one of them (except of the servicing of the old loans). But it turned out, that I had a step in my code (which BTW had to be executed every two hours!) which would try to fill in the ID from this old system which we are not using anymore – for all records, which do not have this ID assigned! Which means, (since we do not use this system) that every two hours I was scanning all records – for nothing!

After I commented out this loop, the execution time for the whole process became pretty much invisible.

… now – how I should title this post?!

2 Comments

Filed under Data management, Development and testing

The March Chicago PUG meetup is coming, and I am advertising myself again!

I have at least two optimization posts in my Drafts, but today I wanted to remind everybody, that Thursday (not Wednesday!) is the day for the next Chicago PUG.

So, for those of you, who vaguely remember my talks “Why the databases are slow?”,  for those of you who at some point of your work lives had fun working with me on web applications, for those, with whom we had all these extended arguments about “what databases can do better”… that’s your chance to experience it again :).

And for those of you, who never heard this talk… well, it’s your chance to find out for the first time, why databases are slow!

Here is the event:

https://www.meetup.com/Chicago-PostgreSQL-User-Group/events/237300112/

and if you plan to attend, please RSVP, otherwise I can’t guarantee there will be enough pizza!!!

Leave a comment

Filed under events

One “emergency optimization”

Why it’s an “emergency optimization”? Well, because you write some SQL, you debug and optimize it, and it works fine, and it runs in production ten times during the business hours producing some report… and all of a sudden it just stops working… well, not necessarily “stop”, but start to run more than a hundred times slower… then you need to drop everything you are doing and optimize it ASAP. Because it’s already in production!

Last Thursday I had several things which “stopped working” in a manner described above. I’ve written about one of them in my previous post, but that one was relatively easy to fix. The next one looked as trivial as it can be, but stll I could not make it work!

Here is how the query looked like:

SELECT ...
   FROM payments p WHERE payment_date< current_date 
        AND payment_id NOT IN (SELECT payment_id from bank_records)
        AND...

There were more conditions, but they are not so important. This worked fine until a couple of days ago. What could be more simple? The payment_id field is indexed in both tables, and once again, this query should return a very small number of records. Also, the tables themselves are small, a little bit over 100,000 records. However, all of a sudden it started to run for minutes! When I looked at the execution plan I saw, that instead of doing a merge join utilizing the index, Postgres decided to read the whole bank_records table into the main memory, “materialize”, and then loop through this data set for each and single record of the payments table!

When I saw this, I immediately rewrote the SQL in my favorite way – with OUTER JOIN:

SELECT ...
   FROM payments p 
        LEFT OUTER JOIN bank_records b 
        ON b.payment_id =p.payment_id AND payment_date< current_date 
        AND b.payment_id IS NULL 
        AND...

This didn’t help either – still sequential scan and line by line comparison. And then I decided to use “the last resort”- which I hate!

SELECT ...
   FROM payments p WHERE payment_date< current_date 
        AND NOT EXISTS (SELECT payment_id from bank_records b
        WHERE payment_id=p.payment-id)
        AND...

This SQL just magically started to use the index, and the execution time when down from8 minutes to 300 milliseconds…

Yes, it’s cool… but why?!

2 Comments

Filed under Data management, SQL

When things do not work like you expect them to work…

This post has a history. I’ve done the original optimization about three weeks ago, then I wanted to cleanse my example from the company-specific things, then I could not reproduce the original execution plan, and then at the end, when I was ready to publish it… it stopped working in our production system! By “stop working” I mean, that it started to take about 90 seconds to execute…But at the very end, by the time I was done with optimization for the second time, I guess, I was able to formulate some basic principles optimization of this kind of queries… so, without further ado – let’s start!

Each time I talk about optimizing join in Postgres, I advise people to include any additional selection criteria into join condition, even if they are, technically speaking, not “join” conditions. The reason for this suggestion is, that if the optimizer would see any way to pre-filter the records before  joining or while joining, it will be done.

So I was trying to utilize this strategy while working on a query which would check for the following condition:

SELECT ...
FROM loan_contract lc
JOIN loan_details ld on lc.loan_id=ld.loan_id
     AND contract_date >= current_date- interval '10 days' 
WHERE (state ='CA' AND  rate>10) OR (state='MO' AND rate>15)

I knew that this SELECT should retrieve a very small number of records if any, and thereby I was thinking that if I add this condition to the JOIN, it will execute faster because Postgres will be able to filter out the records which are not needed before join… but I was wrong!

SELECT ...
FROM loan_contract lc
JOIN loan_details ld on lc.loan_id=ld.loan_id
     AND contract_date >= current_date- interval '10 days' 
AND((state ='CA' AND  rate>10) OR (state='MO' AND rate>15))

Continue reading

Leave a comment

Filed under SQL

Duplicates removal

I do not think I’ve ever published in this blog the simple way of removing duplicates from the table. I’ve faced this problem multiple times in different environments, first coming up with a solution during my consulting engagement with the City of Chicago and later virtually everywhere I worked.  I believe now is the time to finally document this solution.

First, let’s define the problem, and let’s specify more precisely what “duplicates” mean in this context. First, let’s assume that the table with duplicates has a primary key. Although the SQL standard does not require to have a PK constraint on each table, it’s considered a good practice to have one, even if it is just a sequential number assigned to the record on insertion. If all other attributes, except the PK are the same in two records, let’s consider them duplicates.

Most of the time duplicates are really not intended to happen and they occur due to some programmatic error in the insertion process. After the error is discovered and corrected, we usually want to remove the duplicates, leaving just one copy of each set of attributes. Yes, we can select distinct values into a temporal table, truncate and re-insert, but this process would initiate a long exclusive lock (possibly prohibitively long) on the table, which is not always possible in production environment. Besides, if there may be memory limitations, id the table is big enough.

To remove duplicates fast, I’ve developed a very efficient algorithm, which can be schematically represented as follows:


v_prev_record :={0,' ',' ',' ' ,' ');
FOR rec in (SELECT t.pk, t.attr1, t.attr2, t.attr3 FROM table_name t
ORDER BY 2,3,4,1 LOOP
IF v_prev_record.attr1=rec.attr1 AND
v_prev_record.attr2=rec.attr2 AND
v_prev_record.attr3=rec.attr3
THEN
DELETE table_name WHERE pk=rec.pk;
END IF;
v_prev_record.attr1:=rec.attr1;
v_prev_record.attr2:=rec.attr2;
v_prev_record.attr3:=rec.attr3;
END LOOP;

This way only “the first” of identical records will remain in the table.

When we are just removing dups, this might look like somewhat excessive  amount of work, but often the table with the dups also serves as a reference in the foreign key constraint, and then we need to modify a dependent table as well, and this might REALLY take long time! Now check out, how a very small modification to the code above can make this work easy!

v_prev_record :={0,' ',' ',' ' ,' ');
v_pk_keep :=0;
FOR rec in (SELECT t.pk, t.attr1, t.attr2, t.attr3 FROM table_name t
ORDER BY 2,3,4,1 LOOP
IF v_prev_record.attr1=rec.attr1 AND
v_prev_record.attr2=rec.attr2 AND
v_prev_record.attr3=rec.attr3
THEN
UPDATE dependent_table SET fk_attr:=v_pk_to_keep
WHERE fk_attr=rec.pk; --we are about to delete this dup
DELETE table_name WHERE pk=rec.pk;
ELSE
v_pk_to_keep:=rec.pk;--this is a "new" value we keep
END IF;
v_prev_record.attr1:=rec.attr1;
v_prev_record.attr2:=rec.attr2;
v_prev_record.attr3:=rec.attr3;
END LOOP;

Isn’t it just beautiful?!

5 Comments

Filed under SQL

Using inline functions

Did you know anything about inline functions, which return the sets of records? I didn’t know anything about them until recently, when one of my co-workers discovered them trying to solve a problem, which has being a problem for a while…

Here is what the problem was. We have a table in our Data Warehouse, which is updated incrementally, since it stores historical and current data for all installments for all loans. All good so far, right?  The problem is, that for reporting purposes we need to calculate lots of additional data, based on the current date – and this data needs to be calculated for all installments, for all loans. Obviously, it’s impossible to recreate all data daily, especially because the actual data is being refreshed every 15 minutes.

The option which seems obvious would be to create a view, which will use the “permanent table and calculate the rest dynamically based on the current date. Sounds like a perfect solution? Well, if you try to query this data for specific loan or installment, then – yes, you receive response instantaneously. But if you want to join this view with some other table – performance is terrible! And no wonder – not only Postgres, but virtually none database can put condition or parameter deep into grouping.  Materializing the view would not help either, because such materialized view would soon become so big, that the calculations would run out of temp tablespace.

And then one of our co-workers asked: do you guys know about the inline functions?

You can read about inline functions here. The biggest advantage of using the inline functions returning record set is, that they are not the black boxes. For the optimizer the inline function body is directly substituted by the raw SQL, and the optimization is no different than of any “regular” query. In our case are now were spared of “calculate everything for all installments first”, and the query would be optimized for a specific use case.

Leave a comment

Filed under SQL

A need for reverse index

One more optimization problem I came across last week. I’ve spotted a slow-running query, and could not figure out, why it could be slow.

The query was rather simple, and if we take all the “extras” out of the equation, it would come to something like this:

 

SELECT l.loan_id FROM loans l
WHERE
AND l.loan_id NOT IN
(SELECT loan_id FROM payment_runs WHERE run_date=’2016-01-13′)

 

Now, the table payment_runs has a primary_key (loan_id, run_date) – and that’s basically all it has. So you would imagine, this index will be used for searching the loans – but it was not!

Actually, I’ve realized more or less right away, why it was not used: it starts from loan_id, thereby the b-tree search should already “know” a loan_id or a set of loan_id’s to start using the index. However, we need exactly the opposite – we need to find loan_ids, which are not present in the table, and there is no use to try to utilize the index, since the search criteria is based on the second value.

We ended up creating one more, “reverse” index on that table (run_date, loan_id). Although it looks completely redundant, we really need it – the table payment_runs is growing very fast, and there was no way to avoid a full table scan without this index!

 

 

Leave a comment

Filed under SQL