Category Archives: Uncategorized

Don’t forget about Chicago PUG this Tuesday!

Attention fellow chicagoans! Do you want to learn more about PG Open 2017 straight from the participants? Come to Chicago PUG meetup on September 12!

Here is a https://www.meetup.com/Chicago-PostgreSQL-User-Group/events/242081721/link to the meetup – please RSVP there.

Advertisements

Leave a comment

Filed under events, Uncategorized

May PUG with Joe Conway!

I neglected to advertise our May event, and this is going to be indeed the most interesting meetup of 2017! Because just in two days, on May 19 Joe Conway will be speaking at Chicago PUG.

I definitely do not need to advertise him, but I am advertising the fact of his appearance in Chicago, and encourage everybody to attend.

Please RSVP at our Meetup page, and hope to see there.

Leave a comment

Filed under events, People, Uncategorized

Using the S3 .csv FDW for VERY LARGE files

As usual, when I am asked  – Hettie, can we do it? I reply: Sure! Because I never ever assume there is something  I can’t do  :).  But this time I didn’t even get a second thought: there was simply no doubt!

Since I’ve started to build a new Data Warehouse for Braviant in May 2016 I actively used Postrges FDW to integrate heterogeneous data sources. The last one I’ve incorporated was the said csv FDW for Amazon S3.

Speaking about csv files, anybody who ever worked with them knows that they are not as “primitive” as one might think. Postges documentation underlines the fact, that a file which is considered to be the .csv file by other programs might not be recognized as csv by Postgres and vise verse. Thereby when something is going wrong with a csv file… you can never be sure.

This time around the problem I was facing was that the .csv file was huge. How exactly huge – about 60GB. And there was no indication in the Postgres documentation of what precisely is the maximum size of the .csv file which can be mapped throught the FDW. When I’ve talked to people who had some experience of mapping large files, the answer was “the bigger the file, the more complicated it gets”.  To add to the mixture,  there were actually many different .csv files, coming from different sources, of different sizes, and I could not figure out, why some of them are being mapped easily, and some produced errors.  The original comment on “files over 1GB might have problems” didn’t appaer to be relevant, since I had plenty of files 1 GB or more which I was able to map with no problem.

It might look funny that it took me so long to figure out what the actual problem was; what I am trying to explain – I’ve got confused of what to expect, because I was more than sure that “size does not matter”, that it only matters in combination with other factors, like the cleanness of the format… and I wasted the whole week on fruitless and exhausting experiments.

Until one morning when I decided “to start from scratch” and to try starting from the very small files. And when I realized, that the file size is actually the only factor which matters, I found the real limit within 3 hours: the file has to be under 2GB to be successfully mapped!

And here another story starts: what I’ve done to avoid copy-pasting for 37 times.. but this will be a topic of my next post!

 

 

3 Comments

Filed under Uncategorized

ACM Hour of Code Dec 5-11

I am re-posting the ACM newsletter about thee upcoming Hour of code – please consider organizing something in your community!

 

Organize an Hour of Code in Your Community During Computer Science Education Week, December 5-11

Over the past three years, the Hour of Code has introduced over 100 million students in more than 180 countries to computer science. ACM (a partner of Code.org, a coalition of organizations dedicated to expanding participation in computer science) invites you to host an Hour of Code in your community and give students an opportunity to gain the skills needed for creating technology that’s changing the world.

The Hour of Code is a global movement designed to generate excitement in young people. Games, tutorials, and other events are organized by local volunteers from schools, research institutions, and other groups during Computer Science Education Week, December 5-11.

Anyone, anywhere can organize an Hour of Code event, and anyone from ages 4 to 104 can try the one-hour tutorials, which are available in 40 languages. Learn more about how to teach an Hour of Code. Visit the Get Involved page for additional ideas for promoting your event.

Please post activities you are hosting/participating in, pass along this information, and encourage others to post their activities. Tweet about it at #HourOfCode.

Leave a comment

Filed under events, news, Uncategorized

My experience with Xplenty

… what I liked, and why I’ve opted not to use their services.

When I first started at Braviant, and just in a couple weeks have realized that I will need to build a new data mart, the question was: how I can do it having four different external data sources?  Not to mention, having no IT and no app developers.

At first my plan was to use foreign data wrappers – the are FDW for all of the data source types I needed: SQL Server, MySQL, .csv files and Postgres itself. So everything seemed easy, except of… well, except of RDS does not support any of the FDW, aside of Postgres one.

I’ve started to look for alternatives, and several people pointed me to the Xplenty – and I decided to give it a try. I almost feel bad choosing at the end not to go with their solution, because these folks had spent enormous time discussing my needs and trying their best to accommodate my wants.  And I believe that for many organizations it might indeed be a very sensible solution.

Who and when should consider using Xplenty?

  • Organizations which do not have or have very small IT department with not enough expertise in data integration
  • The number of tables to be integrated is small (or reasonably small)
  • The speed and/or frequency of data refresh/pull is not a big concern
  • There is none or very little special data processing

One of the definitely positive things about Xplenty is their customer service, they actually get back to you, they talk to you, they are really focused on resolving your problems. They would give you a sandbox instance to try everything for a week, and you can perform as many data pulls during this trial as you want. They will help you to debug your scheduled jobs.  Another great thing is that you do not really need to know anything about these external systems, except of the connection details. all the meta-information will be extracted, and processed, and the data will be presented to you.

So why we ended up not using their services? Well, because as it often happens, the things which are good for some customers are not good to other. We needed to map in total over 300 tables, and this was completely manual process. Besides, it turned out that some column names in our external data sources where there keywords in Postgres, so they required special coding – yes, Xplenty supports this option, but again, it is a manual process.

There were several other things which also had to be decoded manually, for example, integer 0/1 was force-converted to boolean, but the biggest problem was a speed of refresh. Again, if you have just a handful of tables which should be refreshed a couple times a day, there is no problem at all. But if you have 300 tables which has to be refreshed every hour or so, and each table takes at least a minute to refresh… you’ve got the idea.

To summarize: there are lots of cases when Xplenty will be the fastest in terms of delivery and the easiest solution; it didn’t work for us, but on the other hand I do not think any out-of-the-box solution will work for us – we ended up with the custom development.

 

 

Leave a comment

Filed under Companies, Data management, Uncategorized

Getting ready for ICDE 2016

This Monday is a deadline for the camera-ready paper versions, and I am so glad we were done more than a week ago! That was the most complex submission in my life (granted, I didn’t have a lot of them :)), and the instructions were somewhat confusing.

A couple of days ago a list of accepted Industrial papers had finally appeared on the conference website, so now we can proudly show everybody at work, that “we are there”!

And you know what? It feels really good to see ourselves in the company of Google, Oracle and Teradata!

We still have lots of work to do; I never had to prepare these huge posters, so we need to decide what we are going to put on them; and our idea of all three of us presenting will require a lot of rehearsing… but it is so-so-so exciting! I still can’t believe we did it 🙂

Leave a comment

Filed under events, talks, Uncategorized

I haven’t being posting anything forever, but now I am going to fix this, and planning to write about all the interesting things in the World of Data on a more regular basis. And I am going to start (or rather resume) by writing about the book I read recently – the book which totally blew my mind!

The book is called Managing Time in Relational Databases: How to Design, Update and Query Temporal Data, and it presents the most complete bi-temporal data model. Actually, you may call it “tri-temporal”, because in the “classic” bi-temporal model you have an “effective” time and a “system time”, and the system time just indicates, when the record was added or updated.

However, in the model which is described in this book – Asserted Versioning – an additional concept , assert time , is introduced. That is, “the time we believe(d) it was true”.

Let me tell you, that I was a biggest fan of Richard Snodgrass temporal database concepts probably since the time they were first published (or something close to that :)). I really “felt” them, and since I can’t remember when, I really wanted to implement them – in the real life, in some real project.

I can’t even say that nobody ever needed the temporal data. It’s just for some reason people strongly believe that storing changes is very resource consuming. Which is not exactly true. As the authors of this book point out, it’s way easier to convince your manager to store one year worth of transactional log than to store the objects versions for the same period of time, although the latter requires way less space and is way easier to use. I was constantly have a hard time convincing people that 1) versioning is not so space -consuming, 2) you need to have both start and end date, not just the start date 3) the “current” state is not the one which has end date IS NULL, but the one having end date=INFINITY, and the list can go on and on….

This being said, the book feels extremely refreshing. It just so clear what the authors are trying to achieve, I understand each and single statement completely, I do not need to go through all the examples to understand the concepts…

Now more than ever I really-really – really wish… I could implement it somewhere in the real life 🙂

 

Leave a comment

December 4, 2015 · 6:19 pm