Why this SELECT is bad, and how you can tell without actually running it

So, about this SELECT, which I was discussing with many of my friends and coworkers over the past week and promised to write a blog post. Yea, about this ugly SELECT.

First, what is the purpose of this select? When a call from a customer comes through, the system should be able to find this customer (if this is customer, who’s already in the system, of cause) and display all customer information right away. Which is a common task which has been successfully solved in many call centers across the world.

However, in this particular case there is one problem: the customer data is not sanitized, meaning, although in most of the cases the phone is stored in format without any dashes or spaces, there us a limited number of cases, when this is not correct, namely there is no country code in the front. (Less than 1% of all phones).

The SELECT statement which was put on place to search for a phone, was the following:

SELECT c.id, c.brand_id, p.first_name, p.last_name, p.country_cd
FROM cnu.customers c INNER JOIN cnu.people p
ON c.person_id = p.id
AND ( (POSITION('5027677730' IN p.home_phone) > 0)
OR (POSITION('5027677730' IN p.mobile_phone) > 0)
OR (POSITION('5027677730' IN p.work_phone) > 0) )

So, what’s the problem with this statement?

First, it is just wrong logically, because we so not need ANY record which “includes” a correct phone number. Yes, in 99% of the cases this is, what we will get, but from the logical perspective we are solving a wrong problem here. What we really need to find, is a customer with either “this” phone number, or the same phone number, preceded by ‘1’, which is the following SELECT:

SELECT c.id, c.brand_id, p.first_name, p.last_name, p.country_cd
FROM cnu.customers c INNER JOIN cnu.people p
ON c.person_id = p.id
AND (home_phone ='15027677730'
or home_phone ='5027677730'
or mobile_phone='15027677730'
or mobile_phone='5027677730'
or work_phone='15027677730'
or work_phone='5027677730')

What are the other reasons the first SELECT is “bad”? POSITION does not allow to search by any index, neither by any regular index on the phone number(s), nor by pattern index. Which means, that this will always be a full scan, which is totally unacceptable for the large tables in “short” queries.

We do not even need to run EXPLAIN for this SELECT to understand, that this solution is not scalable, but let’s take a look 🙂

Here is an execution plan for the first select:
"Hash Join (cost=591558.63..2188487.12 rows=6167904 width=22) (actual time=315988.193..332839.641 rows=1 loops=1)"
" Hash Cond: (people.id = c.person_id)"
" -> Seq Scan on people (cost=0.00..1092811.80 rows=15691320 width=20) (actual time=146168.437..162950.158 rows=1 loops=1)"
" Filter: (("position"((home_phone)::text, '5027677730'::text) > 0) OR ("position"((mobile_phone)::text, '5027677730'::text) > 0) OR ("position"((work_phone)::text, '5027677730'::text) > 0))"
" -> Hash (cost=439199.17..439199.17 rows=8764917 width=10) (actual time=169817.624..169817.624 rows=8766163 loops=1)"
" Buckets: 262144 Batches: 8 Memory Usage: 47082kB"
" -> Seq Scan on customers c (cost=0.00..439199.17 rows=8764917 width=10) (actual time=43.807..166620.638 rows=8766163 loops=1)"

And here is the plan for the second one:

"Nested Loop (cost=27.16..1715.41 rows=96 width=22)"
" -> Bitmap Heap Scan on people (cost=27.16..516.05 rows=244 width=20)"
" Recheck Cond: (((home_phone)::text = '15027677730'::text) OR ((home_phone)::text = '5027677730'::text) OR ((mobile_phone)::text = '15027677730'::text) OR ((mobile_phone)::text = '5027677730'::text) OR ((work_phone)::text = '15027677730'::text) OR ((work_phone)::text = '5027677730'::text))"
" -> BitmapOr (cost=27.16..27.16 rows=244 width=0)"
" -> Bitmap Index Scan on people_m3 (cost=0.00..4.47 rows=11 width=0)"
" Index Cond: ((home_phone)::text = '15027677730'::text)"
" -> Bitmap Index Scan on people_m3 (cost=0.00..4.47 rows=11 width=0)"
" Index Cond: ((home_phone)::text = '5027677730'::text)"
" -> Bitmap Index Scan on people_m5 (cost=0.00..4.43 rows=60 width=0)"
" Index Cond: ((mobile_phone)::text = '15027677730'::text)"
" -> Bitmap Index Scan on people_m5 (cost=0.00..4.43 rows=60 width=0)"
" Index Cond: ((mobile_phone)::text = '5027677730'::text)"
" -> Bitmap Index Scan on people_m4 (cost=0.00..4.50 rows=52 width=0)"
" Index Cond: ((work_phone)::text = '15027677730'::text)"
" -> Bitmap Index Scan on people_m4 (cost=0.00..4.50 rows=52 width=0)"
" Index Cond: ((work_phone)::text = '5027677730'::text)"
" -> Index Scan using customers__u_person__brand on customers c (cost=0.00..4.90 rows=1 width=10)"
" Index Cond: (person_id = people.id)"

During our discussion of this SELECT, an alternative approach was proposed: to use a reverse index and to search “backwards”

So, if we build three additional reverse indexes and rewrite this query the following way:

SELECT c.id, c.brand_id, p.first_name, p.last_name, p.country_cd
FROM cnu.customers c INNER JOIN cnu.people p
ON c.person_id = p.id
AND (reverse(p.home_phone) like reverse('%5027677730')
OR reverse(p.work_phone) like reverse('%5027677730')
OR reverse(p.mobile_phone) like reverse('%5027677730') )

The execution plan will look like this:

"Hash Join (cost=356785.55..873176.53 rows=130556 width=22)"
" Hash Cond: (c.person_id = people.id)"
" -> Seq Scan on customers c (cost=0.00..438545.82 rows=8747382 width=10)"
" -> Hash (cost=352532.03..352532.03 rows=340282 width=20)"
" -> Bitmap Heap Scan on people (cost=5938.37..352532.03 rows=340282 width=20)"
" Recheck Cond: ((reverse((home_phone)::text) ~~ '0377767205%'::text) OR (reverse((work_phone)::text) ~~ '0377767205%'::text) OR (reverse((mobile_phone)::text) ~~ '0377767205%'::text))"
" Filter: ((reverse((home_phone)::text) ~~ '0377767205%'::text) OR (reverse((work_phone)::text) ~~ '0377767205%'::text) OR (reverse((mobile_phone)::text) ~~ '0377767205%'::text))"
" -> BitmapOr (cost=5938.37..5938.37 rows=341989 width=0)"
" -> Bitmap Index Scan on r1 (cost=0.00..2007.95 rows=113996 width=0)"
" Index Cond: ((reverse((home_phone)::text) ~>=~ '0377767205'::text) AND (reverse((home_phone)::text) ~ Bitmap Index Scan on r3 (cost=0.00..1857.65 rows=113996 width=0)"
" Index Cond: ((reverse((work_phone)::text) ~>=~ '0377767205'::text) AND (reverse((work_phone)::text) ~ Bitmap Index Scan on r2 (cost=0.00..1817.57 rows=113996 width=0)"
" Index Cond: ((reverse((mobile_phone)::text) ~>=~ '0377767205'::text) AND (reverse((mobile_phone)::text) ~<~ '0377767206'::text))"

And yes, the execution time will be more or less the same, as in the second case, but do we really need to build indexes, which we do not really need?…

Advertisements

Leave a comment

Filed under Data management, SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s