PG Open 2015 – a talk about FDW

I didn’t pay much attention to the Postgres Foreign Data Wrappers until recently, but since Jim Mlodgenski mentioned the FDW for the Hadoop during his tutorial, I’ve realize I am missing something, so I came to his other presentation (on Thursday) to learn more about the ways to combine data from different sources in one query.

Once again, it’s my fault that I didn’t pay much attention to this topic before, but now I am thinking… I think, that we can identify three big groups of foreign data: 1)other (non-Postgres) databases, 2) non – databases (flat files) 3) other Postgres databases

Regarding group one – if you have to do it, you have to do it. I mean, if you really need to access data from different databases in one query, there is no way to get around it. However, Jim demonstrated, that just starting the FDW service takes lots of time, regardless of how fast the data is retrieved afterwards. Because of this in some cases the usage of Hadoop will not improve, but decrease the query performance. To be honest, I do have prejudgements against Hadoop… Yes, you can write to the Hadoop db fast, but what about reads? Also, in order to run any reports from Hadoop.. yes, first you build a SQL extension to Hadoop (Hive), then you join it with some data which is stored in Postgres…. under which circumstances it will be faster than accessing the same date in Postgres?

But enough about it, let’s say that at least in this case there is a case for some sort of FDW.

Group two: I am having trouble understanding, why we may need to “wrap” a file, except of “it looks cool”. If we really need data from external files, the only way to use them efficiently is to load them. Even if we want to read this data only once, we can as well “read” it into some database table. I am just trying to imagine, what happens on the disk level, when an external table actually represents a flat file… and we are trying to make it to behave like a proper table…

Group three: definitely the most interesting usage. I’ve asked Jim about what’s the difference between the usage of a dblink of a FDW for Postgres. Does not look like it’s a huge difference, just with FDW we can write more complex queries, and at looks cooler :).

The most important takeaway was, that there is no concept of a “driving node” in Postgres, and no matter how hard you try, you can’t force it to execute join or select on the node where it makes more sense, Postgres will pull “all” data from the remote node to the local node… The advice we’ve got was to create a view on the remote side, and then select from this view… which contradicts the statement, that during a query compilation in Postgres views are always substituted by their code…

Lot’s of things to think about, not that many to use right away…

Advertisements

Leave a comment

Filed under events, People, SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s