Monthly Archives: September 2015

Pipeline DB at PG Open 2015

Conferences are mostly about meeting people: finding people with similar research interests, getting to see in flesh people who’s work you’ve being admiring for a long time, making sure people who met you ten years ago still remember you name, and always something totally unexpected.

One of the very interesting conversations I had at the conference was with the folks from PipelineDB. Technically speaking I am not responsible for the Data Marts at Enova, my focus is on the application databases. But since the source of data for the Data Marts always reside in the application db, I’ve got very much involved in developing the processes of data transfer – from the apps to the Data Marts.

That was the reason I’ve gladly accepted an invitation to talk about PipelineDB, and about whether my company can use it. Although some of my colleagues are skeptical about a product which was forked from Postgres, I am saying it has to be seen. At the end of the day – what if we would find totally new tool for data transfer, which would have no relationships with Postgres whatsoever?

As it usually happen after you’ve been away from the office for several days (especially when “you” means almost all company database developers) you get caught in addressing the immediate needs. But I hope that in a week or so we will be able to test how will Pipeline DB can serve our needs. And if there will be definite advantages in both development time and easing the load on the application db – why not to choose this avenue?…

Leave a comment

Filed under Data management

PG Open 2015 – a talk about FDW

I didn’t pay much attention to the Postgres Foreign Data Wrappers until recently, but since Jim Mlodgenski mentioned the FDW for the Hadoop during his tutorial, I’ve realize I am missing something, so I came to his other presentation (on Thursday) to learn more about the ways to combine data from different sources in one query.

Once again, it’s my fault that I didn’t pay much attention to this topic before, but now I am thinking… I think, that we can identify three big groups of foreign data: 1)other (non-Postgres) databases, 2) non – databases (flat files) 3) other Postgres databases

Regarding group one – if you have to do it, you have to do it. I mean, if you really need to access data from different databases in one query, there is no way to get around it. However, Jim demonstrated, that just starting the FDW service takes lots of time, regardless of how fast the data is retrieved afterwards. Because of this in some cases the usage of Hadoop will not improve, but decrease the query performance. To be honest, I do have prejudgements against Hadoop… Yes, you can write to the Hadoop db fast, but what about reads? Also, in order to run any reports from Hadoop.. yes, first you build a SQL extension to Hadoop (Hive), then you join it with some data which is stored in Postgres…. under which circumstances it will be faster than accessing the same date in Postgres?

But enough about it, let’s say that at least in this case there is a case for some sort of FDW.

Group two: I am having trouble understanding, why we may need to “wrap” a file, except of “it looks cool”. If we really need data from external files, the only way to use them efficiently is to load them. Even if we want to read this data only once, we can as well “read” it into some database table. I am just trying to imagine, what happens on the disk level, when an external table actually represents a flat file… and we are trying to make it to behave like a proper table…

Group three: definitely the most interesting usage. I’ve asked Jim about what’s the difference between the usage of a dblink of a FDW for Postgres. Does not look like it’s a huge difference, just with FDW we can write more complex queries, and at looks cooler :).

The most important takeaway was, that there is no concept of a “driving node” in Postgres, and no matter how hard you try, you can’t force it to execute join or select on the node where it makes more sense, Postgres will pull “all” data from the remote node to the local node… The advice we’ve got was to create a view on the remote side, and then select from this view… which contradicts the statement, that during a query compilation in Postgres views are always substituted by their code…

Lot’s of things to think about, not that many to use right away…

Leave a comment

Filed under events, People, SQL

PG Open 2015 – tutorials

Hello from Dallas TX! Β It’s the first time that the PG Open conference is not in Chicago; to be honest, I am not entirely happy about this, because Chicago is a better place in general :). But on the other hand it’s nice to stay in hotel for three days, not to worry about anything and just to immerse yourself in professional communications.

We have a big group of people from Enova attending this conference – five database developers and two DBA, and I am really happy that our “younger generation” had a chance to attend, to meet interesting people, to be seen and to be heard. I think, we were quite visible :).

The first conference day was a day of tutorials. I’ve attended the Jim Mlodgenski tutorial “Big Postgres: Scaling PostgreSQL in a BigData environment”, and I’ve enjoiyed it a lot.

First, I like that Jim used the same definition of the Big Data, which I liked so much during the panel discussion at ICDE 2015 (“too big for conventional processing”). Second, even though there was nothing which would be completely new for me in this tutorial, it really helped me to connect all the things together, and to view all the different technologies which were presented as different ways to make Postgres scalable.

Third, and may be to most important: now I’ve got some numbers! I’ve had this discussion multiple times at work: people think they need some kind of powerful data warehousing too, like Hadoop, but they never tyr to figure out, whether this is feasible. What will be additional costs – not even in terms of money, but in terms of the system response time. I only theoretically knew, that that Hadoop is “too big” for us, but now I have actual measurements.

Can’t wait till Jim would upload his slides – I will show them to those people who are reluctant to admit, that each task requires appropriate tools.

Leave a comment

Filed under events, SQL, Systems

Another look at the optimizer parameters

Almost exactly a year ago I was dealing with the optimization problem I’ve described in this post, which in turn was referencing that post with my original problem. Recently I faced another problem of unexpected slowness, and a solution turned to be… kind of the opposite.

This time the slow query was running (or rather not running) in the Data Mart, in the classic star schema environment.

We had one fact table and 20+ dimensions joined to this fact.

So the query looked somewhat like this:

SELECT d1.name1

FROM fact_table f
INNER JOIN dimension1 d1 using (key1)
INNER JOIN dimension2 d2 using (key2)

INNER JOIN dimension20 d20 using (key20)

The problem was… this query was nor running! It would enter the query planner and never come back!

What I found about this cluster was, that the both geqo_threshold and join_collapse_limit were set to 30, which paralyzed the optimizer almost entirely. I’ve started to experiment with reducing the join_collapse_limit and found, that what we actually had to do was to set it to 1. This allowed us to control the join order, which is perfectly fine: we know exactly, what the order should be: fact table first, followed by dimensions πŸ™‚

The total execution time got reduced from “god-knows-what” to 34 ms! Which, I think proves that in many cases humans think faster that computers πŸ™‚

Leave a comment

Filed under SQL

Team and teamwork

For the past couple of weeks I’ve been asked multiple times the same question I though I’ve answered a long time ago: Hettie, why do you want to have a database developer on each quad? Won’t it be better, if you just have a team of database developers, and if there is a need in the database work, you would dispatch a developer to a project?

I was so sure I’ve written about it already, but I’ve just looked over all the “people” tags and didn’t find anything… so I guess, I just always meant to write about it, and never did :).

In short: if a squad does not have a database developer, it becomes much more difficult for the squad members to realize they actually need some database work. Since the database is “a backend of backend”, many potential problems which can occur are not recognized as “db problems” or “db tasks”.

On the other hand, when a database developer is indeed a member of the squad, when he or she is present starting from the very first design sessions, when (s)he understands the project/product needs, the better solutions can be introduced from the very beginning, potential problems could be resolved before they occur.

It’s very rarely that “an outsider”, “a consultant”, which a database developer can become if (s)he won’t be a squad member, can grasp all small details about project. I personally do not believe in “theoretically correct solution”. In the overwhelming majority of cases the solution should be tailored for a particular case. The foreign keys and other constraints and functional dependencies are originated in the real world, not in the database. Not always, but in most cases you can’t just look at the database schema and say, whether it is “right” or “wrong”. And that’s where a database developer on a team matters – to do things right. Right away.

Leave a comment

Filed under Team and teamwork

Creating a new primary key

I was going to post about this interesting problem for a couple of months, but each and single time life would get on my way :). First I wanted to post a problem and wait to see whether somebody will come up with a solution. However, since I didn’t… let me just tell the whole story.

Let’s say, you have a table, in which you store e-mail messages, let’s call it all_messages. And similar to the google tags, each message might have multiple tags, which can be assigned to the message or removed from it at any time. The table all_tags_for_messages contains just a pair of columns (message_id, tag_id).

Obviously, the primary key for this table consists of both columns of the table, right?

Now imagine, you need to send all new records from this table to the data mart. Let’s say, each hour you need to send all new records which were added to this table – how you are going to select them? Neither message_id nor tag_id are ordered. Will you run “except” each time? how long it will take, when the table continues to grow?

The Data Warehouse developers asked to add some field which will allow to them to identify “newcomers”: either a serial primary key, which will be incremented each time a record is added, or a timestamp: created_at.

The application developers replied, that for them it would be much easier to have a new primary key. Now: what the database developers supposed to do? You have a primary key already. The table is huge. How you are going to switch the primary key without locking the table for a long time? (and you can’t really lock it for any visible time, because it’s being used non-stop).

Here is a solution, which one of our database developers came with.

First, we do not really need to change a primary key, it’s enough to have one additional unique field, which will be populated from a sequence. Now it will be easy to populate this field for new records. What about existing records?

Here is a proposed solution:

– create “main” sequence with a really big START value
– create a “temporal” sequence starting from 1
– use the “temporal” sequence for a background backfill of the new column
– when a backfill is completed, drop the sequence we do not need any more.

This seems trivial, but only after you’ve figured it out πŸ™‚

Leave a comment

Filed under SQL