Using the S3 .csv FDW for VERY LARGE files

As usual, when I am asked  – Hettie, can we do it? I reply: Sure! Because I never ever assume there is something  I can’t do  :).  But this time I didn’t even get a second thought: there was simply no doubt!

Since I’ve started to build a new Data Warehouse for Braviant in May 2016 I actively used Postrges FDW to integrate heterogeneous data sources. The last one I’ve incorporated was the said csv FDW for Amazon S3.

Speaking about csv files, anybody who ever worked with them knows that they are not as “primitive” as one might think. Postges documentation underlines the fact, that a file which is considered to be the .csv file by other programs might not be recognized as csv by Postgres and vise verse. Thereby when something is going wrong with a csv file… you can never be sure.

This time around the problem I was facing was that the .csv file was huge. How exactly huge – about 60GB. And there was no indication in the Postgres documentation of what precisely is the maximum size of the .csv file which can be mapped throught the FDW. When I’ve talked to people who had some experience of mapping large files, the answer was “the bigger the file, the more complicated it gets”.  To add to the mixture,  there were actually many different .csv files, coming from different sources, of different sizes, and I could not figure out, why some of them are being mapped easily, and some produced errors.  The original comment on “files over 1GB might have problems” didn’t appaer to be relevant, since I had plenty of files 1 GB or more which I was able to map with no problem.

It might look funny that it took me so long to figure out what the actual problem was; what I am trying to explain – I’ve got confused of what to expect, because I was more than sure that “size does not matter”, that it only matters in combination with other factors, like the cleanness of the format… and I wasted the whole week on fruitless and exhausting experiments.

Until one morning when I decided “to start from scratch” and to try starting from the very small files. And when I realized, that the file size is actually the only factor which matters, I found the real limit within 3 hours: the file has to be under 2GB to be successfully mapped!

And here another story starts: what I’ve done to avoid copy-pasting for 37 times.. but this will be a topic of my next post!

 

 

Advertisements

3 Comments

Filed under Uncategorized

3 responses to “Using the S3 .csv FDW for VERY LARGE files

  1. Iakov

    2 Gb limit could be somehow related to the underlying file system. It is kind of well known limitation for the older file systems and/or software that was conceived without very large files in mind. At least, the number “2 Gb” rings some bells in my mind.

  2. This is s3, for which we do not have to choose the version:), whatever Amazon is providing.

  3. Pingback: OK, we know what to do with very large files, but what if there are too many? | The World of Data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s