The misconceptions of nested loops

When I just started to work on queries optimization, I had the same misconceptions, as many people have – that having nested loops in the execution plan is good. In reality for this to be true, two conditions should be met: 1) this should be a short query 2) the nested loop should utilize unique key and/or primary-foreign key constraints.

For long queries (reminder: a query is considered long, when most of the records from the table(s) contribute to the result) the use of full scan (and consequently the hash join algorithm) can be more beneficial. Why? Because when the records are retrieved using an index, two reads are required – one to read the index block, and another – to read the actual data block. The order of index blocks often (not necessarily, but really often) does not match the order of the table blocks, so consecutive access by index may result in random access to the actual data. Now imaging, what will happen, if we need “almost all” records from the table: we will execute twice more reads, than we could, should be just ignore the index!

When I was reviewing my last class Power Point (SQL for Advanced Analytics) I’ve realized, that all example I had for the “Art of full scan” section were still coming from my work in the City of Chicago, from a very old version of Oracle Applications SUite. I had to come up with “in-house examples”, and I was worrying, that I won’t be able to find ones.

You know why? Because our database servers have huge main memory, and even big tables cab be read into it all at once, thereby making nested loops fast even when they should not :). So I was really happy when I started to reconstruct one City of Chicago Example in our environment, and it worked exactly how I wanted!

Now, let’s proceed with an example.

Than is how you calculate “sum accounts” for an individual loan:

SELECT
vl.value AS account,
SUM(CASE vl.value
WHEN pt.debit_account_cd THEN pt.amount ELSE 0 END)
– SUM(CASE vl.value WHEN pt.credit_account_cd THEN pt.amount ELSE 0 END) AS amount
FROM payment_transactions pt
INNER JOIN valuelists vl ON vl.type_cd = ‘transaction_account’ WHERE loan_id=30501044
GROUP BY vl.value

The output will look somewhat like this:

And if you look at the execution plan, it will be something like this:

HashAggregate  (cost=227.63..228.01 rows=30 width=55)
->  Nested Loop  (cost=2.65..183.01 rows=3570 width=55)
->  Index Scan using payment_transactions_m1 on payment_transactions_committed  (cost=0.00..100.97 rows=70 width=41)
Index Cond: (loan_id = 30501044)
->  Materialize  (cost=2.65..37.54 rows=51 width=14)
->  Bitmap Heap Scan on valuelists vl  (cost=2.65..37.28 rows=51 width=14)
Recheck Cond: ((type_cd)::text = ‘transaction_account’::text)
->  Bitmap Index Scan on valuelists_m1  (cost=0.00..2.63 rows=51 width=0)
Index Cond: ((type_cd)::text = ‘transaction_account’::text)

Basically – what you would expect, and this will execute super-fast.

Now, imagine we what to “help” our users and create a view, which would allow them to select any account for any given loan. That’s how this new view will be defined:

CREATE OR REPLACE VIEW balances_example_v AS
SELECT
loan_id,
vl.value AS account,
SUM(CASE vl.value WHEN pt.debit_account_cd THEN pt.amount ELSE 0 END) – SUM(CASE vl.value WHEN pt.credit_account_cd THEN pt.amount ELSE 0 END) AS amount
FROM payment_transactions pt
INNER JOIN valuelists vl ON vl.type_cd = ‘transaction_account’
GROUP BY loan_id, vl.value

If you want to use this view just to select specific accounts for a specific loan, everything will be just beautiful – same execution plan as above, same high execution speed. But what will happen, if we will use this view in a “long” query?

Yes, we need to run batches. Yes, we need to run reports. And we often need the account values in those reports. We can’t use Ruby methods, so we left no choice but to use SQLs which are “close enough” to Ruby methods. So imagine we have a batch which execute a SQL like this:

SELECT
c.id,
email,
v.amount AS principal_due,
loan_id
FROM balances_example_v v
INNER JOIN loans l ON l.id=v.loan_id
INNER JOIN customers c ON l.customer_id=c.id
WHERE account=’principal_due’ AND l.status_cd=’in_default’
AND l.created_on>’06-01-13’

Guess, what’s the execution time for this SELECT? – 25 minutes!

And guess, how the execution plan would look like?

LIke this:

Note inner NESTED LOOP: you see a sequential scan on payment_transactions_committed table! And you know, why? Because when we have a view with GROUP BY, the optimizer “forgets” all the indexes it could use in the original table!

Now. Guess, what happens, when we do not  use a view? The SQL will look like this:

SELECT c.id,
email,
SUM(CASE WHEN pt.debit_account_cd =’princioal_due’ THEN pt.amount ELSE 0 END)
– SUM(CASE WHEN pt.credit_account_cd =’principal_due’THEN pt.amount ELSE 0 END) AS principal_due,
loan_id
FROM payment_transactions pt
INNER JOIN loans l ON l.id=pt.loan_id
INNER JOIN customers c ON l.customer_id=c.id
WHERE l.status_cd=’in_default’
AND l.created_on>’06-01-13’
GROUP BY c.id, email, loan_id

You want to know, what’s the execution time for this one? 2.5 minutes!!!
And you want to know, why? Look at the execution plan:

Do you see, what I see? Do you see, which join algorithm is used when we join with payment_transactions_committed? Hash join! And it works faster! Ten times faster!

Now you can reflect on this, because I’ve being writing this post since Christmas, and it was quite exhausting!

 

Advertisements

Leave a comment

Filed under Data management, SQL, talks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s