Monthly Archives: February 2014

I hope I will now remember FOREVER…

… that concatenation of anything with a NULL string in Postgres results in NULL!!! Because in Postgres, in contrast to Oracle, it’s done correctly, and NULL string is different from  an empty string.

I new it last time  – when I first debugged the customer search, but I forgot about it 18 months ago, when I got to move it to a different platform. And it took me hours to straighten this up again!


Filed under SQL

Finally – more female than male in CS? May be…

This article talks about one very small, but important shift: last spring for the first time in history there were more female than male students in UC Berkley, who signed up for the Introductory CS class.

The article talks about a still-existing gap between male and female CS major numbers, and about the ways to overcome the problem.  Here is something to think about:

A 2008 study by the Association for Computing Machinery, consulted in designing the Berkeley course, found that while college-bound boys equated words like “interesting,” “video games,” and “solving problems” with computing, girls associated terms like “typing,” “math” and “boredom.”

This graph illustrates the gender gap for CS major grads:

Leave a comment

Filed under People

A perfect example of ORIM and how we successfully overcame it

I try to run execution statistics on our application controllers every day, or at least every other day. Each time I see some controller action produces excessive number of database calls, I am bringing this to my squad attention, and we create a task to investigate.

However, I am not a purist:), so I am OK, if a controller is not “perfect” and not completely optimized. Usually I am OK. But not in case of our payment presenter. The problem was, that the page time was way more than database time, which should not be the case: the page time should be only a little bit more than the db time. Moreover, after playing with statistics I realized, that the more db calls is executed by controller action, the bigger is the said difference, which sounded completely ridiculous! So, I pushed for investigation.

This was something like a mistery novel, but I will spear my readers from the details and proceed directly to our findings. The first thing we found was, that each payment had to read a loan type, and as usual “could not memorize it”, so for each payment it was reading a new copy of a loan object! No wonder we had memory issues to severe, I had to implement pagination on the database side! So, my coworker Richard fixed this by making the method referencing only one loan object.

But that was not it. I hope Richard will eventually create a pretty diagram for me, but for a time being let my try to explain it without a visual aid.

In my function I was returning the “payment status code”, which was an ach status code if ach is present and “some other status” otherwise. So – I returned this value using COALESCE function. Guess what was happening in the payments presenter: it would search for a “description” of this code, which would be an ach code description, if there is an ach and null otherwise! So in reality we only need to join with one table! And moreover, we already had this join (to get “a code”), in all other cases it will be NULL anyway, but no model could figure this out! And moreover – even if it could figure this out, it will still have to go “one select per returned line”, because this was a “different” join, not the one specified in the model in the payments – ach relationship. So – eager loader could not be used….

Solution: I added one more field to the function – ach code description. There was zero extra db work, because the table was already present in the FROM list, we just didn’t se;ect this particular field.

Result: steadily 2 db calls per controller action! I will report production statistics later, when this solution will be in production (on Wed, I guess).

Now you guys have to tell us, that we are awesome :))

Leave a comment

Filed under Data management, Systems, Team and teamwork

Why this index is never used?

Recently I’ve got a question from a coworker.

We have a table credit_reports, which have among others the following fields:

  • customer_id
  • inquiry_time
  • report_type

The table contains over 51,000,000 rows. There are several indexes for this table, among them:
idx1 – (customer_id, inquiry_time)
idx2(inquiry_time::date) – the timestamp converted to date

A coworker was wondering, why idx2 was never used, even when we use this transformation in the selection criteria, like:

SELECT * FROM credit_reports WHERE report_type=’type1′ AND inquiry_date::date > current_date-5

But if we omit the transformation:

SELECT * FROM credit_reports WHERE report_type=’type1′ AND inquiry_date > current_date-5

then idx1 will be used (although it is not on the first position in the index)

The answer is the following. idx2 is never going to be used (except of queries like SELECT min(inquiry_date::date) FROM credit_reports) because the first record was inserted 7 years ago, which means, there are about 2,500 distinct dates in the whole table, which makes at least 20,000 records per one index value…. no sane optimizer would choose this path!

As for the usage of idx1, it was used not exactly as an index, but rather was read sequentially, because it is faster than reading the whole table!


Filed under SQL

Videos from the Tech Conference

Courtesy of Preeti 🙂 They cover almost all of my presentation, except of the very beginning and the very end. Also, need to remove that part which is 90 degrees rotated 🙂


Filed under talks

A word about colocation

One of the principles of Agile technology is colocation –  that all team members should sit close to each other. The strangest thing is, that I actually understand this requirement and find it important.

Those of my friends and co-workers who know me long enough, can understand why I find it quite unexpected for me to like this idea. For at least last 22 years I was always fighting for my right to work remotely, and I proved, that I can   – New York Department of Education project alone is a living proof. I never went to New York until the time we actually had to hand the project to the new team.  I always took pride in being able to work from home and being efficient, in being able to work in a different time zone, and still stay connected, and I was always a firm believer  that when there is a will, there is a way – to collaborate remotely in this case.

However, working on an Agile team proved, how important is to be able to look around, make sure your co-worked is at his desk and walk over. There are some many things which are like “let me show you…” In many cases you can do the same thing on Skype, but with Agile development you really lose precious minutes or even hours when you need to get on Skype to discuss  things.

I absolutely love the idea of open space, which again is news – I was always suffering without a private space. But now for some reason I do not suffer, and I didn’t like when at some point out management decided to try to put “dividers” on the 9th floor.  I love the fact, that I can see where are everybody, who is on which meeting, and that I can stop by literally anybody at any time and ask my question.  I can’t remember how many times I had absolutely critical conversations in the break room aka kitchen, how many times somebody would appear behind me with a new and revolutionary idea…

Long story short – I do not know why, but it does make a difference!

Leave a comment

Filed under Companies, Team and teamwork, Workplace

Duplicates removal

A friend of mine asked me the other day, how I would solve the following problem she got during her interview. Imagine you have a table with several attributes, and there  some records, where all attributes except of primary key are the same (duplicates). Write one SQL statement to delete duplicates and leave just one of each values combinations.

If the only constraint is one, and we do not care about performance, the solution is easy. Let’s say we have a table:

id integer PRIMARY KEY
,attr1 number NOT NULL
,attr2 text NOT NULL)

To delete all duplicate occurrences of (attr1, attr2) we can write the following SQL statement:

SELECT min(id) AS id
FROM t1 GROUP BY attr1,attr2)

This way only one (minimal) id for each attr1, attr2 comination will be left.

However, for big tables this is not an efficient way to perform delete. For big tables we want to become “imperative” and use a cursor with ordering.

In no particular 4GL language this will look like this:

DECLARE v_attr1 number:=to_number(NULL);
v_attr2 text:=NULL;
FOR rec in (SELECT id, attr1, attr2 FROM t1
ORDER by attr1, attr2, id) LOOP
IF v_attr1 IS NOT NULL AND v_attr2 IS NOT NULL AND v_attr1=rec.attr1 AND v_attr2=rec.attr2

If the table is really big, you can insert additional counter and commit after each X number or records.

This is a simple problem, but quite often we need to remove duplicates, for which we have foreign key references, and then it’s a problem. Suppose, t1 is a lookup table, and t2 is referencing it.

id integer PRIMIARY KEY
,attr3 text);

Not only we need to remove duplicates now, but we also need to fix the references. You can do it using the single cursor.

v_t1_id integer;
v_attr1 number:=to_number(NULL);
v_attr2 text:=NULL;
FOR rec in (SELECT id, attr1, attr2 FROM t1
ORDER by attr1, attr2, id) LOOP
IF v_attr1 IS NOT NULL AND v_attr2 IS NOT NULL AND v_attr1=rec.attr1 AND v_attr2=rec.attr2
UPDATE t2 SET t1_id=v_t1_id

Once again, you may need to commit, if the tables are big enough and you do not want to paralyze the work of the system.

I think, it’s quite neat, what do you think?


Filed under Data management, SQL