Monthly Archives: February 2017

I honestly hope it’s not a bug!

When I’ve shared this discovery with my co-workers, they told me I’d better not write any blog post about this, because god forbid it is actually a bug, and somebody will fix it, and we’ll loose this feature.  But I hope it’s not!

… For years I’ve being complaining about the fact, that Postgres functions are atomic, meaning there is no way to have transactions inside the function, thereby it’s impossible to commit intermediate results, it’s always either all or nothing. Not like I really wanted to have the checkpoints and such, but processing huge data volumes without the  option of committing the intermediate results is at least challenging. You are bound to have long-running transactions, extensive locks and such. I really missed this option I had with Oracle functions to be able to commit each 100,000 records….

For a while I’ve being asking the lead Postgres contributors, “how much longer”,  and for a while they were replying – in the next release, until they just stopped replying…

So, the other day I was testing my new function, which is building a table out of multiple materialized views, and for each INSERT I have a prepared statement, which is executed by a single EXECUTE operator. When the execution crashed, because one of the materialized views which meant to be on place was not, I was think: well… now I need to start all over again… and to my surprise I saw, that all inserts which happened before this crash, persisted!

So, let’s re-iterate. If might be the same function, but if the SQL statements are executed as generated statements using EXECUTE operator, each of the executions will be treated as a separate transaction! Which is pretty awesome, keeping in mind that we need to insert over 18 million records! And no, I do not mean I am going to insert 18 million times 🙂

Leave a comment

Filed under Data management, Development and testing, SQL

Using the S3 .csv FDW for VERY LARGE files

As usual, when I am asked  – Hettie, can we do it? I reply: Sure! Because I never ever assume there is something  I can’t do  :).  But this time I didn’t even get a second thought: there was simply no doubt!

Since I’ve started to build a new Data Warehouse for Braviant in May 2016 I actively used Postrges FDW to integrate heterogeneous data sources. The last one I’ve incorporated was the said csv FDW for Amazon S3.

Speaking about csv files, anybody who ever worked with them knows that they are not as “primitive” as one might think. Postges documentation underlines the fact, that a file which is considered to be the .csv file by other programs might not be recognized as csv by Postgres and vise verse. Thereby when something is going wrong with a csv file… you can never be sure.

This time around the problem I was facing was that the .csv file was huge. How exactly huge – about 60GB. And there was no indication in the Postgres documentation of what precisely is the maximum size of the .csv file which can be mapped throught the FDW. When I’ve talked to people who had some experience of mapping large files, the answer was “the bigger the file, the more complicated it gets”.  To add to the mixture,  there were actually many different .csv files, coming from different sources, of different sizes, and I could not figure out, why some of them are being mapped easily, and some produced errors.  The original comment on “files over 1GB might have problems” didn’t appaer to be relevant, since I had plenty of files 1 GB or more which I was able to map with no problem.

It might look funny that it took me so long to figure out what the actual problem was; what I am trying to explain – I’ve got confused of what to expect, because I was more than sure that “size does not matter”, that it only matters in combination with other factors, like the cleanness of the format… and I wasted the whole week on fruitless and exhausting experiments.

Until one morning when I decided “to start from scratch” and to try starting from the very small files. And when I realized, that the file size is actually the only factor which matters, I found the real limit within 3 hours: the file has to be under 2GB to be successfully mapped!

And here another story starts: what I’ve done to avoid copy-pasting for 37 times.. but this will be a topic of my next post!

 

 

3 Comments

Filed under Uncategorized

February Chicago PUG is just a week away!

Hi everybody, once again I am shamelessly advertising the next meetup of the Chicago Postgres User Group.

This one will be experimental in some sense – during our January meetup we talked about alternating “beginners” and “advanced” meeting. In February we specifically target the users, who are relatively new to Postgres, and who are often not sure what to ask about, what options are available. It’s not uncommon that app developers do not want to “risk” and try something beyond the “normal” SELECT statement.

Fear no more:). Come to Chicago PUG meetup  to learn how to use Postgres functions, and how their usage can make your life as a developer or report write much easier. Come to listen and learn (and have some pizza and beer), and to share you own experience, ask questions, share success stories (and have more pizza and beer!).

See you there!

 

Leave a comment

Filed under events