Category Archives: Data management

Bitemporal documentation is available!

Everybody who was curious enough to start using our pg_bitemporal github repo would complain about the lack of documentation, so we’ve tried really hard to provide our follows with some guidance.

What we have now, is very far from perfect, but if you go to the docs directory, there is a lot of documentation, including our old presentations, explanations of the basic bitemporal concepts and most importantly first ever bitemporal functions manual, which we promise to make more readable in the nearest future. Meanwhile – please share your feedback! Thank you!

Advertisements

Leave a comment

Filed under Data management, research

Databases are not sexy?

I’ve heard this line from a former colleague of mine. He was explaining why there are so little database people around, why IT students are not specializing in databases. That was his answer – that it not cool. “WEB-designer” sounds cool, “database developer” does not.

Several months passed since I heard that, and I was thinking: I should write a blog post about it! However, honestly – what else I can say except the things I’ve already said in this blog multiple times? That there is nothing more exciting, than exploring the endless possibilities of SQL, that nothing can be more rewarding than applying your magic to the jobs, which runs for an hour, and all of a sudden it runs in less than a minute and produces the same result:)

I suspect that the general public does not think that there is something behind a web page, and when somebody experiences a website slowness, they refer to it as “the internet is slow.” Also, the buzz words like “Big Data” often send a message, that “the databases are outdated,” and that there is something bigger, and cooler, and more fashionable, than just”databases,” which does not help a bit.

As I always like to be practical, and not only state a problem but come up with a way to solve it, I am now thinking about how to bridge this gap. I want to find ways to reach out to college and high school students and to give them some exposure to the Wonderful World Of Data. A couple of years ago when I was attending one of the Big Conferences, I’ve heard a discussion regarding “what our students want to be taught.” That was a time of Big Data just becoming a Big Deal :). Honestly, my opinion is that the student’s interest should not drive a curriculum entirely 🙂 and that that’s the right place to interfere.

Is anybody interested in joining me in this effort?

Leave a comment

Filed under Data management, SQL

Working Under Pressure, Next Chicago PUG And The Wonders of FDW

Anybody likes working under pressure? Anybody likes production crisis? Nobody would admit it, however…

Try to recall something really exciting which has happened to you in your work environment. No, not your recent promotion :). Try to remember last time you did something great, and things worked fabulously. May be you were not running around, as I do in such cases, telling everybody what a cool thing I just came up with, but even without this running around, most likely your the last great thing happened when there was some serious production issue and you had very limited time to come up with an entirely new solution.

You know how I love sharing the story of New York Hight School Application Processing System! And when I tell this story, I always mention, that I had very limited time to fix everything, and worked 16-hours days for 2 weeks.

It was not that kind of crisis recently at my work, but it was still a production issue . And as 99% of all of our production issues, this one have happened during some interaction with our third party service provider, who’s utilizing MySQL. To be honest, knowing what I know now about MySQL as of now, I am very close to making a statement, that it in not a database! Because, anything which wants to be treated as a database, should know how to use indexes :).

To make the story short, we’ve figured it all out, and rewrote a query – in a most absurd way :), I have to admit. But after this one, and a whole bunch of other similar exercises, I think I have a very good understanding of how things work on the MySQL, what is getting lost when we are using FDW, and what and how can be mitigated.

And guess what – I will be taking about all this cool stuff next Wednesday, at the Chicago PUG. Come and join the conversation!

Leave a comment

Filed under Data management, SQL

I knew I won’t like this feature! About the parallel execution

Many years ago, when Postgres 9.6 was still in making, my coworker said to me with excitement: Hettie, Postgres will now have the ability to run queries in parallel! There can be only four parallel processes, but still, isn’t in nice?!

And I remember exactly what I’ve replied: no, I do not like this feature a bit! You know why? Because executing queries in parallel would rarely solve performance problems. In fact, if four parallel workers would solve your performance problems, they were not really problems! It can do more harm than good, because it can mask some real performance problems for a while, and then they will turn just to be more severe.

What happened then – I’ve started a new chapter of my life at Braviant, and we had Postgres 9.5 then, and then for almost 3 years it was like I never had time to stop and upgrade :). But now I have a team, so we’ve finally planned the upgrade, and since we were already four versions behind, we planned upgrade to 9.6 and immediately to PG 10.

We’ve started from our Data Warehouse. First – in the staging environment, we we’ve tested, and observed the execution for some time. And then on the mirroring instance, and only then – on production.

And then it started! Seriously, out of all data refreshes this one is only one, which is important for the OLTP instance, because it sends data to one of our client interfaces. It started to behave inconsistently, Sometimes it would be just fine. Other times, instead of about 50 seconds it has been running for an hour, and probably won’t finish if we won’t kill it. Yes, it was obvious, that something did change in the execution plan after the upgrade. But what?! From the first glance the execution plan looked the same, all HASH JOINS, which you would expect, when you join tables with no restrictive conditions.

But it was still painfully slow. What was more puzzling – I could take out of the equation JOIN to any table, and performance would be unpredictable. After several dozen of attempts to make things run decided to take a closer look at the execution plan. And you know what I saw?! Yes, parallel execution/! Four joins were parallelized, which resulted in the execution time been really horrible. After the issue was found, the only thing left was to figure out, how to turn this feature off 🙂

Leave a comment

Filed under Data management, Development and testing, SQL

One more time on the state of optimization

I just have to show some of the execution time graphs, and how the have changed after the optimized versions of the respective functions were deployed:

I know that many people are wondering looking at the second image, why I am striving to optimize things which are already running super-fast?

It’s not because I am trying to demonstrate my superpowers, it’s because i know for the fact, that with the database size we currently have, that is the right execution time. What does it mean? it means, that if the execution time is more than that, it indicates the wrong execution plan.

All these optimizations have been performed on our OLTP database, which means that all of these queries are “small queries”, retrieving a relatively small number of records. Which implies, that the appropriate indexes should be used, and that the execution plans should show the NESTED LOOP join algorithm. When I see the execution time of 500 ms, it tells me that there is at least one full table scan inside. Which in turn, means, that the execution time will be increasing, when the data volumes will be growing. Which is not good, if we are building a scalable system.

Another important thing to consider is that all these small queries cannot be “parallelized” to speed up the execution. We are in the OLTP environment, not OLAP. I know that I can’t rely on switching to the larger AWS instance, because 1) this process gets out of control very fast 2) does not help. Seen the execution times like on both of these pictures, like “I can’t see it” just proves, that the functions are performing as expected.

Leave a comment

Filed under Data management, Development and testing, SQL

About one very controversial optimization

I meant to write about this optimization story for a while, not because somebody can use it as an example, but rather as a counter-example, a case of “what I’ve tough you so far is all completely wrong and not working”.

It all have started with the story which I’ve described in this blog post: the problem was, that indexes on the remote site were not being used, and in order to make things work I had to create a remote function, which would “inject” the constants into a remotely executed query and then too map the results. A couple of weeks later I had to face a similar problem which had also manifested itself only in production environment. But that time around  the performance degradation was so severe, that we had to rollback the changes.

First I thought that I will be able to deal with a similar problem in a similar way, but turned out that was not the case. The SQL involving remote tables was a little bit more complex, and the function didn’t help. I’ve realized that I needed to explicitly pass some constants to the dynamic query, which meant that I had to perform a crime against relational theory, braking one perfectly fine SQL into two, and passing the result of the first selection to the second statement. This performed better, but still not ideal. I continued to follow the same path. It became better. I’ve repeated it one more time. And one more time.

At the end it was a textbook example of how not to write SQL. My own lectures were always filled with this “how it will not perform” examples: first you select a person, then using the person id you select position, then… But what I did to optimize  performance my function was the same  thing almost to the letter. I needed to select all loans for a specific person and some of the loan-related details. And instead of writing one SQL I ended up doing the following:

  • having a person_id, select all accounts
  • generate a SQL with all these user accounts embedded as constants to retrieve loans
  • generate a set of SQL statements, which would access remote information for each of the loans individually
  • and actually there was more in the middle

A part of the problem is, that in PostgreSQL functions are not really stored in a truly “compiled” way, they should be  rather viewed as “stored sequences of actions”.  And parametrized statements, unlike those in different systems,  are not optimized properly. That’s why in most of the cases I  build dynamic SQL inside of the functions. But in the case I am describing here,  things were even worse. Even when I was executing a SQL statement for a specific person id, the indexes on user account  and loan where not really utilized (at least not always).  So in order for me to push through the usage of indexed, I had to explicitly pass the values to the generated SQL…

It all worked at the end, and performance was amazing, but I can’t even convince myself to post code snippets – it looks ugly 🙂

Oh, and another part of the equation is, that in all those SELECT statement were selecting a really small portion of the data from each of the tables, so there was not that many total executions.

The moral of the story:  trust, but verify!

Leave a comment

Filed under Data management, Development and testing, SQL

I hate views!

I really do. I have no idea, why they were even introduced into the SQL Language – for the record, there are no views in the relational theory. There is no justification for views to exist.

I suspect, that a person who’ve introduced the views for the first time just thought about themselves, and was lazy to repeat the same SQL over and over again, And then they never gave a thought, that somebody else may start using the same view.

The first reason I hate views it, actually comes out of the only legitimate reason to use the at all (yea, I am contradicting myself). I believe, the only legitimate reason to have a view is to use it as a canned report. So you come up with a perfect SQL for some requirements, and then you give it to the end user in form of a view, and say – use this! Just SELECT * FROM this-view.

But the next thing happening is (and that is my reason number one to hate views) that the end users take the words “as good as a table” literally and start actually using your perfect view as a table, like adding some conditions, or worse – joining it with other tables. Actually, even adding a condition may be damaging enough if your views contains a GROUP BY clause.

Why this is so bad? Because in general Postgres (and most of other databases) does not guarantee anything regarding how the execution plan with the view will go. It might or might not execute the view first. It may or may not push down additional selection criteria. In general, the performance becomes unpredictable.

The second reason is the continuation of the first. The end user does not know what’s inside the view (as they should not!), and often would make decision on which view to use literally by it’s name!

You might have guessed by now, that there is some immediate reason I am writing this post. Here is it: a couple of weeks ago I finally found some time to address the super-poor performance of one of the views used by our marketing team. Since I’ve written the first version of this view about 2.5 years ago, I thought I should be the one to fix it now.

I’ve started to read the code, looking at what have changed in the past 2+ years. Some of the conditions looked questionable to me, and I’ve scheduled a meeting with the primary business stakeholder to go over those conditions.

You won’t believe what I found out! None, NONE of the complicated conditions from the original report were needed! None. Instead, only a certain subset of the old output was needed, with a couple of extra fields. So all the complex calculations from the original view – they were performed, but the results were not needed. Instead, there was an additional filtering, since less than half of the result set was used. Well, the name of the view was still reflecting its purpose, so the marketing department never had a second thought 🙂

I can’t even call this an optimization! I’ve just crossed out 805 of the old code and added a new condition. A second instead of 15 minutes! So now we are good for now… till somebody will try to use this view for something else 🙂

2 Comments

Filed under Data management, SQL