Category Archives: Uncategorized

Protected: 2Q PG Conf Ultimate Optimization Training Slides

This content is password protected. To view it please enter your password below:

Enter your password to view comments.

Filed under Uncategorized

My Ups and Downs at Urbansoft

Today, everybody is blogging “The Horror Coding Stories.” I didn’t plan on that, but then I thought that this is a perfect horror story (reposted from my personal blog).The context: me working at one of the first joint ventures in Russia in 1990s.

Hettie's Reflections

At the end of December, John went back to the US for Christmas. I was still working at that “it’s great!” project and on my makeshift database. And I came up with something cool. Something I was very proud of.

I did most of that work at home because it was time around the holidays. Although I did have a modem, that was before the times you could email a bulk attachment, so usually, I would compress my code with tar command and copy a .tar file to the diskette, and take this diskette to the office.

The next day John should have to be back, and I was anticipating my triumph. At about 9 PM, when kids were already long asleep, I started to make my final .tar file.

Nowadays, even some of the younger IT people might not know what the tar command does, yet along those of…

View original post 271 more words

Leave a comment

Filed under Uncategorized

What I am looking (and not looking) for

Since  I’ve been looking for  database developers and DBAs for quite some time now,  and since virtually everybody knows about this, people often ask me: what are you looking for? What skills and qualifications you are interested at? Who would be your ideal candidate?

Most of the time I reply: please read the job description. I know that all the career advisors tell you “apply even if you do not have some qualifications”, but as for my job postings, I actually need those qualifications which are listed as “required”, and I would really prefer the candidates, who have “is a plus” qualifications.

Also, there are definitely some big DON’Ts, which I wish I would never ever hear again during an interview:

  • when asked for the definition of the foreign key,  starting your answer from “when we need to join two tables”
  • when asked about normalization, starting from “for better performance”
  • when asked about schemas, saying that we use then for storage optimization

Today however, I was asked a different question: why you are saying that you are looking for skilled candidates, and at the same time you admit, that for anybody who will get hired there will be a long learning process? if a candidate does not know something, doesn’t it mean he does not have enough skills? Doesn’t it mean, (s)he is underqualified?

I thought for a while before I’ve responded. When I was first hired as a Postgres DBA, it was a senior position right away, although at that time I did not know any Postgres at all. But people who’ve hired me were confident that not only I can learn fast, but also that I can generalize my existing knowledge and skills and apply it in the new environment.

To build on this example, there are two pre-requisites for success: knowledge and the ability to apply it in the real-life circumstances.

I think, that a person who wants to succeed as a database developer or a DBA should possess a solid knowledge of the relational theory.  But it is not enough to memorize your Ullman or Jennifer Widom, you need to be able to connect this theory to the real-world problems. This is such an obvious thing, that I never thought I will need to write about it, but life proved me wrong :).

Same goes in the situation, when a candidate has a lot of experience with other database, not the one you need. Yes, different database systems may be different, and significantly different. Can somebody who is very skilled  Oracle DBA be qualified for a position of Postgres DBA? Yes, if this person knows how to separate generic knowledge from the systems specifics.

if you know how to read an execution plan in Oracle, and more importantly, why you need to be able to read it, you will have no problem reading execution plans in Postgres. If you used the system tables in Oracle to generate dynamic SQL,  you will know exactly what you want to look for in the Postgres catalog.  And if you know that queries can be optimized, it will help no matter what a specific DBMS is. And it won’t help, if the only thing you know is how to execute utilities.

… No idea, whether this blog post is optimistic or pessimistic, but… we are still hiring 🙂

1 Comment

Filed under SQL, Uncategorized

Don’t forget about Chicago PUG this Tuesday!

Attention fellow chicagoans! Do you want to learn more about PG Open 2017 straight from the participants? Come to Chicago PUG meetup on September 12!

Here is a to the meetup – please RSVP there.

Leave a comment

Filed under events, Uncategorized

May PUG with Joe Conway!

I neglected to advertise our May event, and this is going to be indeed the most interesting meetup of 2017! Because just in two days, on May 19 Joe Conway will be speaking at Chicago PUG.

I definitely do not need to advertise him, but I am advertising the fact of his appearance in Chicago, and encourage everybody to attend.

Please RSVP at our Meetup page, and hope to see there.

Leave a comment

Filed under events, People, Uncategorized

Using the S3 .csv FDW for VERY LARGE files

As usual, when I am asked  – Hettie, can we do it? I reply: Sure! Because I never ever assume there is something  I can’t do  :).  But this time I didn’t even get a second thought: there was simply no doubt!

Since I’ve started to build a new Data Warehouse for Braviant in May 2016 I actively used Postrges FDW to integrate heterogeneous data sources. The last one I’ve incorporated was the said csv FDW for Amazon S3.

Speaking about csv files, anybody who ever worked with them knows that they are not as “primitive” as one might think. Postges documentation underlines the fact, that a file which is considered to be the .csv file by other programs might not be recognized as csv by Postgres and vise verse. Thereby when something is going wrong with a csv file… you can never be sure.

This time around the problem I was facing was that the .csv file was huge. How exactly huge – about 60GB. And there was no indication in the Postgres documentation of what precisely is the maximum size of the .csv file which can be mapped throught the FDW. When I’ve talked to people who had some experience of mapping large files, the answer was “the bigger the file, the more complicated it gets”.  To add to the mixture,  there were actually many different .csv files, coming from different sources, of different sizes, and I could not figure out, why some of them are being mapped easily, and some produced errors.  The original comment on “files over 1GB might have problems” didn’t appaer to be relevant, since I had plenty of files 1 GB or more which I was able to map with no problem.

It might look funny that it took me so long to figure out what the actual problem was; what I am trying to explain – I’ve got confused of what to expect, because I was more than sure that “size does not matter”, that it only matters in combination with other factors, like the cleanness of the format… and I wasted the whole week on fruitless and exhausting experiments.

Until one morning when I decided “to start from scratch” and to try starting from the very small files. And when I realized, that the file size is actually the only factor which matters, I found the real limit within 3 hours: the file has to be under 2GB to be successfully mapped!

And here another story starts: what I’ve done to avoid copy-pasting for 37 times.. but this will be a topic of my next post!




Filed under Uncategorized

ACM Hour of Code Dec 5-11

I am re-posting the ACM newsletter about thee upcoming Hour of code – please consider organizing something in your community!


Organize an Hour of Code in Your Community During Computer Science Education Week, December 5-11

Over the past three years, the Hour of Code has introduced over 100 million students in more than 180 countries to computer science. ACM (a partner of, a coalition of organizations dedicated to expanding participation in computer science) invites you to host an Hour of Code in your community and give students an opportunity to gain the skills needed for creating technology that’s changing the world.

The Hour of Code is a global movement designed to generate excitement in young people. Games, tutorials, and other events are organized by local volunteers from schools, research institutions, and other groups during Computer Science Education Week, December 5-11.

Anyone, anywhere can organize an Hour of Code event, and anyone from ages 4 to 104 can try the one-hour tutorials, which are available in 40 languages. Learn more about how to teach an Hour of Code. Visit the Get Involved page for additional ideas for promoting your event.

Please post activities you are hosting/participating in, pass along this information, and encourage others to post their activities. Tweet about it at #HourOfCode.

Leave a comment

Filed under events, news, Uncategorized

My experience with Xplenty

… what I liked, and why I’ve opted not to use their services.

When I first started at Braviant, and just in a couple weeks have realized that I will need to build a new data mart, the question was: how I can do it having four different external data sources?  Not to mention, having no IT and no app developers.

At first my plan was to use foreign data wrappers – the are FDW for all of the data source types I needed: SQL Server, MySQL, .csv files and Postgres itself. So everything seemed easy, except of… well, except of RDS does not support any of the FDW, aside of Postgres one.

I’ve started to look for alternatives, and several people pointed me to the Xplenty – and I decided to give it a try. I almost feel bad choosing at the end not to go with their solution, because these folks had spent enormous time discussing my needs and trying their best to accommodate my wants.  And I believe that for many organizations it might indeed be a very sensible solution.

Who and when should consider using Xplenty?

  • Organizations which do not have or have very small IT department with not enough expertise in data integration
  • The number of tables to be integrated is small (or reasonably small)
  • The speed and/or frequency of data refresh/pull is not a big concern
  • There is none or very little special data processing

One of the definitely positive things about Xplenty is their customer service, they actually get back to you, they talk to you, they are really focused on resolving your problems. They would give you a sandbox instance to try everything for a week, and you can perform as many data pulls during this trial as you want. They will help you to debug your scheduled jobs.  Another great thing is that you do not really need to know anything about these external systems, except of the connection details. all the meta-information will be extracted, and processed, and the data will be presented to you.

So why we ended up not using their services? Well, because as it often happens, the things which are good for some customers are not good to other. We needed to map in total over 300 tables, and this was completely manual process. Besides, it turned out that some column names in our external data sources where there keywords in Postgres, so they required special coding – yes, Xplenty supports this option, but again, it is a manual process.

There were several other things which also had to be decoded manually, for example, integer 0/1 was force-converted to boolean, but the biggest problem was a speed of refresh. Again, if you have just a handful of tables which should be refreshed a couple times a day, there is no problem at all. But if you have 300 tables which has to be refreshed every hour or so, and each table takes at least a minute to refresh… you’ve got the idea.

To summarize: there are lots of cases when Xplenty will be the fastest in terms of delivery and the easiest solution; it didn’t work for us, but on the other hand I do not think any out-of-the-box solution will work for us – we ended up with the custom development.



Leave a comment

Filed under Companies, Data management, Uncategorized

Getting ready for ICDE 2016

This Monday is a deadline for the camera-ready paper versions, and I am so glad we were done more than a week ago! That was the most complex submission in my life (granted, I didn’t have a lot of them :)), and the instructions were somewhat confusing.

A couple of days ago a list of accepted Industrial papers had finally appeared on the conference website, so now we can proudly show everybody at work, that “we are there”!

And you know what? It feels really good to see ourselves in the company of Google, Oracle and Teradata!

We still have lots of work to do; I never had to prepare these huge posters, so we need to decide what we are going to put on them; and our idea of all three of us presenting will require a lot of rehearsing… but it is so-so-so exciting! I still can’t believe we did it 🙂

Leave a comment

Filed under events, talks, Uncategorized

I haven’t being posting anything forever, but now I am going to fix this, and planning to write about all the interesting things in the World of Data on a more regular basis. And I am going to start (or rather resume) by writing about the book I read recently – the book which totally blew my mind!

The book is called Managing Time in Relational Databases: How to Design, Update and Query Temporal Data, and it presents the most complete bi-temporal data model. Actually, you may call it “tri-temporal”, because in the “classic” bi-temporal model you have an “effective” time and a “system time”, and the system time just indicates, when the record was added or updated.

However, in the model which is described in this book – Asserted Versioning – an additional concept , assert time , is introduced. That is, “the time we believe(d) it was true”.

Let me tell you, that I was a biggest fan of Richard Snodgrass temporal database concepts probably since the time they were first published (or something close to that :)). I really “felt” them, and since I can’t remember when, I really wanted to implement them – in the real life, in some real project.

I can’t even say that nobody ever needed the temporal data. It’s just for some reason people strongly believe that storing changes is very resource consuming. Which is not exactly true. As the authors of this book point out, it’s way easier to convince your manager to store one year worth of transactional log than to store the objects versions for the same period of time, although the latter requires way less space and is way easier to use. I was constantly have a hard time convincing people that 1) versioning is not so space -consuming, 2) you need to have both start and end date, not just the start date 3) the “current” state is not the one which has end date IS NULL, but the one having end date=INFINITY, and the list can go on and on….

This being said, the book feels extremely refreshing. It just so clear what the authors are trying to achieve, I understand each and single statement completely, I do not need to go through all the examples to understand the concepts…

Now more than ever I really-really – really wish… I could implement it somewhere in the real life 🙂


Leave a comment

December 4, 2015 · 6:19 pm