Monthly Archives: December 2014

2014 in review

The WordPress.com stats helper monkeys prepared a 2014 annual report for this blog.

Here's an excerpt:

A San Francisco cable car holds 60 people. This blog was viewed about 2,100 times in 2014. If it were a cable car, it would take about 35 trips to carry that many people.

Click here to see the complete report.

Advertisements

Leave a comment

Filed under Uncategorized

Who are the senior people

For the past two weeks I had my friends from Moscow over. We talked about all sorts of things, including work and professional development, and at some point my guests asked me: you are saying, you need more senior database people. How you can tell, who is senior, and who is junior?

For many years I thought that the answer to this question is based on the people’s experience, or on how smart the people are. But before opened my mouth I’ve realized that now I have a different answer. That’s what I’ve told my friends:

– A person is senior, when he or she can take an initiative and develop a full solution of a complex problem, without asking for detailed instructions. It does not mean, that this person won’t consult other team members, and it does not mean that he thinks he knows best. In fact, a senior person would run his/her ideas by other people, to get a second opinion, to have an extra pair of eyes on the matter, to have a good argument. But at the same time a senior person has a plan in mind, and he won’t sit not knowing how to start; he won’t be wondering – what are my next steps.

… now, that I think about it, it’s surprising, how long did it take me to realize this….

Leave a comment

Filed under People, Team and teamwork, Workplace

I am NOT building a Data Mart!… wait, why?!

Recently somebody asked me: why you are saying you are building a data mart? Whatever you are building, is being built in the application database, even thought it happens in the “staging area”, designated for the “reporting” stuff.

And then I started to think about what we really mean when we are saying “we are building a data mart”. Yes, you can say, that building a data mart means building all those fact and dimension tables, and creating indexes, and storing aggregates… but the most important is to map the OLTP data structures to the reporting structures, to define, which data from the application tables we need and how we want to process it in order to make reporting fast, easy and efficient. Until we define this mapping, there is no data mart.

Does this mapping belong to the application? I do not think so. Why – because the application can be 100% functional without all these staging schemas. The only purpose and the only reasons for them to exist – to provide the “service” for the data mart. The application technically “does not know” about the reporting structure, besides, the reporting structures actually may be different for different reporting needs.

How this mapping is done, is a data mart “responsibility. Although I may end up doing it anyway, and “I” means “an application database” 🙂

Leave a comment

Filed under Data management

How I did what I never do to optimize…

So… all of you guys who were ever taking my classes, and all of you guys who read my optimization posts, you all know, that one of my favorite topics is – the TEMP tables must die!

How many times did I repeat, that one should never materialize the intermediate results of the SELECT statement? How many times did I tell, that it takes extra read/write operations, which cost the most, and plus – that this locks the order of joins, thereby limiting the optimization options?

Well… a couple of weeks ago I had to do exactly the opposite of what I usually teach! That is – in order to improve performance of the critical function I had to create an intermediate permanent table!

That is how it happened.

The function is building the “delta” – the set of changed records to feed a data mart. Each of the records for this feed is build from several OLTP database records, and I need to include the new record in the output each time at least one of those got changed.

Technically the select statement should look like this:


select
from people p
inner join addresses a (using person_id)
inner join phones ph (using person_id)
inner join email_addresses e (using person_id)
inner join employment em (using person_id)
where p.updated_at between t1 and t2
or p.updated_at between t1 and t2
or a.updated_at between t1 and t2
or ph.updated_at between t1 and t2
or e.updated_at between t1 and t2
or em.updated_at between t1 and t2

Do you know how long this will take providing each time several hundred records are selected? Forever!

Do you know, why? Because even though each of the updated_at fields is indexes, and even though in general the condition by the timestamp is very selective, the OR – ing makes it impossible to limit any of the participating tables enough.

The next idea I had – to pre-select the person_id list which has changed. The code started to look like this:


select
from people p
inner join addresses a (using person_id)
inner join phones ph (using person_id)
inner join email_addresses e (using person_id)
inner join employment em (using person_id)
where p.person_id in (
select person_id from people where updated_at between t1 and t2
union
select person_id from addresses where updated_at between t1 and t2
union
select person_id from phones where updated_at between t1 and t2
union
select person_id from email_addresses where updated_at between t1 and t2
union
select person_id from employment where updated_at between t1 and t2)

What happened here: Postgres refused to use the PK on the people table to select the person_ids, which were pre-selected in the UNION.

After that I had the only option left. I’ve created the permanent table people_changed, which I truncate each time I execute this function. This table is indexed by person_id, and the first step of this function execution is, that I insert the result of the UNION into this table, and then execute:


select
from people p
inner join addresses a (using person_id)
inner join phones ph (using person_id)
inner join email_addresses e (using person_id)
inner join employment em (using person_id)
inner join people_changed pc (using person_id)

And this works perfectly!

Leave a comment

Filed under SQL