What my co-workers are presenting

I’ve being asking some of my co-workers to post their optimization findings to this blog, but didn’t get any submissions yet. And since lots of people were asking me about “what exactly” we are presenting, I’ve decided to show one of the interesting cases myself, with a hope that people, who actually did this, will comment :).

Here is it.

The problem – original non-normalized schema:

This schema was normalized and now looks like this:

The task – to update each result with a correct set of failure details.

The first program which the database developer put together looked somewhat like this:


Loop Through All Runs
Loop Through All Branded Examples for that run
tests = Results for Run/BrandedExample
attempts = tests.size
if attempts > 1
statuses = tests.map(&:status_id)
keeper = tests.shift
update_keeper = "UPDATE results SET attempts=#{attempts}"
if statuses.uniq.size > 1
update_keeper += ", status_id=#{sporadic.id}"
end
update_keeper += " WHERE id = #{keeper.id}"
test_ids = tests.pluck(:id).join(',')
execute <<-SQL
#{update_keeper};
UPDATE failure_details SET result_id = #{keeper.id}
WHERE result_id IN (#{test_ids});
DELETE FROM results
WHERE id IN (#{test_ids});
SQL
end

The sample time measurements proved that the complete update of the whole table will take about 19 hours!

Then my other co-worker suggested the alternative, and after multiple backs and forth this code looked like this.
First step – updates all failure details with same run and branded example to the first result (result_to_point to):

update failure_details
set result_id = data_source.result_to_point_to
from (min(result_id) as result_to_point_to,
array_agg(failure_details.id) as failures
from results
join failure_details on results.id = failure_details.result_id
group by run_id, branded_example_id
) data_source
where failure_details.id = ANY (data_source.failures);

Second step -Updates results with attempts with sporadic status if # of failures are less than attempts and greater than 0:

UPDATE results
SET attempts = data_source.attempts,
status_id = case when data_source.num_of_failures > 0 AND data_source.num_of_failures <> data_source.attempts
then #{sporadic.id}
else status_id
end
FROM(
SELECT count(*) as attempts,
COUNT(status_id = 2) as num_of_failures,
MIN(result_id) as result_to_point_to
FROM results
JOIN failure_details on results.id = failure_details.result_id
GROUP BY run_id, branded_example_id
) data_source
WHERE results.id = data_source.result_to_point_to;

After this is done, the unassigned results are deleted:

DELETE FROM results
WHERE attempts IS NULL

The total execution time for this whole thing was less than 5 min!

Advertisements

Leave a comment

Filed under Data management, SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s