Category Archives: SQL

The best things from 2017 and what I am looking forward to in 2018

When I am talking to people about the year 2017, and what was important, I would repeat over and over: everything I was striving for during my 30+ years of professional career had happened. All wishes have come true. Which would inevitably lead to the question: well, Hettie, what are you going to next then?

And there are plenty and plenty of things I want to do, but first I wanted to reflect on 2017 one more time. I’ve already listed multiple times all our technical accomplishments, all the wonderful things I was able to implement in a real production environment.  But when I look back at what was the best, it is definitely working with our new tech team. As I am reiterating over and over again, “the database is a service”, and whatever we are doing inside the database, can only make any impact, if our results can be utilized by “somebody” – by our end users.

And most of the time our end users are  application developers.  I can’t imagine any serious database development without continuous interaction with the rest of the development team. For years my “golden standard” for the teamwork has been my work for New York Department of Education, where I had a full support of the tech leadership, which helped me to defend my approaches and produce the results I am still happy about.

But what has being going on in the past several months is even better. The way we discuss the user stories. The way we make decisions on what should go into the application, and what – into the database. The way we debug. The way we discuss, what is critical, and what’s not. The willingness of each of the team members “to go extra mile”, to adjust in order to make others job easier. Basically, whatever I could imagine for the teamwork being perfect, is there. 🙂

Being in this industry for 34 years, I know that nothing lasts forever:). I know that every tea,, every organization evolve, and that nothing is granted forever. But I also learned to be thankful for what’s going on right now, and enjoy the moment.

Looking forward to 2018 – I hope that our team will continue to be the most amazing team ever. I am looking forward for the stress -test of our new applications with high data volumes. After all, I was designing the data storage and data access to be super-scalable, and I hope that it will work this way.

I am looking  forward to use the performance data we’ll obtain to improve our bitemporal library, and I already have some ideas of what I exactly I want to do. Actually, I have new technical ideas on almost everything I’ve developed through the last year. I still didn’t get notifications on whether any of the papers I’ve submitted for the next year conferences were accepted,  but I know that both are very good papers :), so even if none will get accepted… I will try for another conference!

 

 

 

 

Advertisements

Leave a comment

Filed under Development and testing, SQL, Systems, Team and teamwork

I finally figure out how it works!

I never understood, why in some cases when you look at the Postgres execution plan you see “Recheck condition”. This happens, when an index search is executed, and when I was looking at the execution plans of that kind, I never could understand, why Postgres wants to go back to the table and “recheck” something, while all the conditions should be satisfied by the index itself.

Last week however, i found myself in non-typical situation, when I had a query to optimize, and I could not bring it to the acceptable speed. If was imperative, that it would execute in less than a minute, actually highly preferable – within 30 sec. I knew that the culprit was one of the sub-selects, which was querying a very large table, and thereby I absolutely had to convince Postgres to use the index-only scan.

And it didn’t work. I’ve created a compound partial index, which would exactly describe a condition I was filtering by, and I’ve included all the fields I needed to select in this index. But the query continued to be slow, and the execution plan was clearly showing “Recheck condition”.

After a number of experiments I’ve realized, that I in order to ensure the index-only scan I need to include into the index not only the attributes I am selecting, but also the attributes which I am constraining.

Specifically, when I was running a query

SELECT loan_id,
min(date) AS first_date
FROM very_large_table
WHERE days_past_due >30 AND status in (1,2)

having the index

CREATE INDEX cond_index ON
very_large_table(loan_id, date, days_past_due)
WHERE days_past_due>30 AND status IN (1,2)

was not enough to ensure the index-only scan. Instead I had to build the following index:

CREATE INDEX cond_index_2 ON
very_large_table(loan_id, date, days_past_due, status)
WHERE days_past_due>30 AND status IN (1,2)

I was wondering for a while – why? But then I sort of thought about this “as a database” and realized, that if the status is not included into the index, then, if for some record only the status will be changed, and all indexed attributes remain the same, the rebuild of the index just won’t be triggered, and thereby the condition will require a check. On the contrary, if the status would change from “something else” to 1 or 2, the index condition will be triggered, and this record will be re-indexed.

Now I will remember!

Leave a comment

Filed under SQL

Optimizing something you can’t control

This is very much  like  finding Pluto! At Braviant, we use several external service providers to perform some business tasks. And then, as I’ve mentioned in one of my presentations about our usage of foreign data wrappers, we need to manage data, when we do not really own the data.

But this time around the task was even more complex, and I’ve spent weeks trying to figure out how to approach it. There is one Really Large Table on the “other” side, and to refresh the Data Mart, we need to select a small subset of records each time, basically “all records starting from the moment we refreshed last time”.

For some reason unknown to me something on the way from “them” to “us” did not work, and we could not push the condition to the external site. No matter what I was selecting, what was really happening (I’ve figured it out by observing the query behavior closely) – the whole table was fetched from the third-party server, and only then the selection criteria was applied.

The problem looked unsolvable, because “everything worked on the other side”. Then I cam up with one crazy idea. I thought: if we can’t push our condition through, may be we can create similar condition on the other side.

So, I’ve asked our service provider tech support, whether they can create a view on their side, which would restrict the size of object, I am selecting from, Note, I’ve asked for just a view, not a materialized view. So it was literally “query is executed locally”. And then I’ve mapped this view to the foreign table, so there was no changes to reporting.

Yes, this view has way more records than I need (it contains “last 24 hours”), while I refresh data every  two hours. However, now I select from way smaller data set, because the view contains only last 24 hours, not the last 2 months!

… and now tell me, which optimizer would be able to execute this kind of optimization?!

Leave a comment

Filed under Data management, SQL

From theory to practice

For the past several months I am implementing the bitemporal framework on the real life objects, not on the lab mice :). And this process was quite a revelation!

I’ve written the functions for bitemporal basic operations almost two years ago, and talked about them on several conferences and workshops. I could not imagine something can go wrong with them – and yet it did. And that’s exactly what happens when all your test cases are cloned lab mice!

One of the first errors I’ve got was an empty assertion interval, and that’s when I’ve realized than we never discussed the relations between transactions and bitemporal operations. Well, a transaction is a transaction, isn’t it? Nobody is supposed to see what’s inside, until transaction is finished – committed or rolled back. So… if there are several modifications (say INSERT, UPDATE and CORRECT for the same logical record) within one transaction… what we are supposed to see when transaction is committed? Just an INSERT, if the first operation was INSERT? But this “won’t be true”!

Yes, but on the other hand, imagine what will happen if we would record the “interim” state, and then later we would like to run a query “as asserted” at some time in the past, and at that exact moment some transactions will be in the uncommitted state? Then we will get results which will be in the inconsistent  status. As of now I didn’t come up with how I want these situations to be handled. I am almost convinced that I want to give a user an option: if you want to be “anti-transactional”, you can :)). But then you’ll need to accept the consequences.

Another set of problems is rather philosophical: do we believe in reincarnation? 🙂 More precisely, if an object is “bitemporally deleted”, and then a new object with the same business key value is created, is this “the same object” or a “new object”? Both ways can be supported, but I think that by default we should assume a “formal approach”, and say the this is “the same” object. And if the real world (i.e. business rules) is such, that the new object is a different object… well, that means, that something else should be included into the business key. For example, if the SSN is reused, then we need an extra piece of information, like person’d data of birth.

Related questions: can we update a deleted (inactive) record? What are the differences between UPDATE and CORRECTION if the date ranges are “equal”?  I can only imagine how many issues like this are just waiting to be discovered!

Leave a comment

Filed under Data management, Development and testing, research, SQL

Dos and Don’ts of the Data Warehouse

In the past couple of months the number of employees in our company have grown significantly. And guess what: almost all of the new employees need access to the Data warehouse!

While we were very small, I used to be able (to have time) to explain each new person, how our Data warehouse is organized, how it is being populated, how data is refreshed, and what you should and should not do. But recently I barely could memorize the names of new employees! And when I overheard one of myexperienced co-workers asking one of the new co-workers: do you know how to join tables?… I’ve realized I owe them some education.

So, last Thursday I gave a presentation about our data warehouse, and it was a big success – for many folks it was the first time realizing “how this thing works”. But un-doubtfully the most popular one was the last slide: what not to do with your database.

Since I think those statements are largely universial, I am going to paste here the contents of the last slide.

  • Although you can’t write anything to the Data Warehouse there are plenty of ways to crush the system,  so use caution.
  • Please use the copies of the core tables for exploration purposes only, do not run big queries on them
  • Please kill any query which runs over 1 min and ask somebody from the IT database group for assistance
  • Do not use temporal tables.
  • Do not create objects in the public schema.
  • Before creating a new report or requesting one, please check what’s already available. The view and mat. views in our Data Warehouse are well- documented

Couple of comments
1. “Over 1 minute” is a surprisingly good estimate. Granted, out Data Warehouse is relatively small now, but most of the time when something is running over 1 min, it indicates that either the join criteria  are not specified correctly, or one or more conditions have very low selectivity, or there is an index missing. In all of those cases an IT person should take a closer look

2. Why avoid using temporal tables? Because they occupy the same space on disk which is used to allocate the intermediate result sets, and at the end of the day slow the things down due to extra IO

3. Why not to create objects in the public schema? Well, because it’s public! Because anybody can create tables in the public schema! And everybody create tables owned by them, which other people can’t access. The public schema should only hold the publicly used functions and such.

I think, the rest is self-explanatory!

Leave a comment

Filed under Data management, SQL, talks

Chicago PUG meetup with Joe Conway

Yesterday was a Day – a day when Joe Conway presented at Chicago PUG. He was talking about the PL/R extension of Postgres, which is really important for out data analysts.

We had a full house:

And everybody were listening to the great presentation:

Continue reading

Leave a comment

Filed under events, People, SQL, talks

I honestly hope it’s not a bug!

When I’ve shared this discovery with my co-workers, they told me I’d better not write any blog post about this, because god forbid it is actually a bug, and somebody will fix it, and we’ll loose this feature.  But I hope it’s not!

… For years I’ve being complaining about the fact, that Postgres functions are atomic, meaning there is no way to have transactions inside the function, thereby it’s impossible to commit intermediate results, it’s always either all or nothing. Not like I really wanted to have the checkpoints and such, but processing huge data volumes without the  option of committing the intermediate results is at least challenging. You are bound to have long-running transactions, extensive locks and such. I really missed this option I had with Oracle functions to be able to commit each 100,000 records….

For a while I’ve being asking the lead Postgres contributors, “how much longer”,  and for a while they were replying – in the next release, until they just stopped replying…

So, the other day I was testing my new function, which is building a table out of multiple materialized views, and for each INSERT I have a prepared statement, which is executed by a single EXECUTE operator. When the execution crashed, because one of the materialized views which meant to be on place was not, I was think: well… now I need to start all over again… and to my surprise I saw, that all inserts which happened before this crash, persisted!

So, let’s re-iterate. If might be the same function, but if the SQL statements are executed as generated statements using EXECUTE operator, each of the executions will be treated as a separate transaction! Which is pretty awesome, keeping in mind that we need to insert over 18 million records! And no, I do not mean I am going to insert 18 million times 🙂

Leave a comment

Filed under Data management, Development and testing, SQL