Category Archives: Data management

From theory to practice

For the past several months I am implementing the bitemporal framework on the real life objects, not on the lab mice :). And this process was quite a revelation!

I’ve written the functions for bitemporal basic operations almost two years ago, and talked about them on several conferences and workshops. I could not imagine something can go wrong with them – and yet it did. And that’s exactly what happens when all your test cases are cloned lab mice!

One of the first errors I’ve got was an empty assertion interval, and that’s when I’ve realized than we never discussed the relations between transactions and bitemporal operations. Well, a transaction is a transaction, isn’t it? Nobody is supposed to see what’s inside, until transaction is finished – committed or rolled back. So… if there are several modifications (say INSERT, UPDATE and CORRECT for the same logical record) within one transaction… what we are supposed to see when transaction is committed? Just an INSERT, if the first operation was INSERT? But this “won’t be true”!

Yes, but on the other hand, imagine what will happen if we would record the “interim” state, and then later we would like to run a query “as asserted” at some time in the past, and at that exact moment some transactions will be in the uncommitted state? Then we will get results which will be in the inconsistent  status. As of now I didn’t come up with how I want these situations to be handled. I am almost convinced that I want to give a user an option: if you want to be “anti-transactional”, you can :)). But then you’ll need to accept the consequences.

Another set of problems is rather philosophical: do we believe in reincarnation? 🙂 More precisely, if an object is “bitemporally deleted”, and then a new object with the same business key value is created, is this “the same object” or a “new object”? Both ways can be supported, but I think that by default we should assume a “formal approach”, and say the this is “the same” object. And if the real world (i.e. business rules) is such, that the new object is a different object… well, that means, that something else should be included into the business key. For example, if the SSN is reused, then we need an extra piece of information, like person’d data of birth.

Related questions: can we update a deleted (inactive) record? What are the differences between UPDATE and CORRECTION if the date ranges are “equal”?  I can only imagine how many issues like this are just waiting to be discovered!

Leave a comment

Filed under Data management, Development and testing, research, SQL

Dos and Don’ts of the Data Warehouse

In the past couple of months the number of employees in our company have grown significantly. And guess what: almost all of the new employees need access to the Data warehouse!

While we were very small, I used to be able (to have time) to explain each new person, how our Data warehouse is organized, how it is being populated, how data is refreshed, and what you should and should not do. But recently I barely could memorize the names of new employees! And when I overheard one of myexperienced co-workers asking one of the new co-workers: do you know how to join tables?… I’ve realized I owe them some education.

So, last Thursday I gave a presentation about our data warehouse, and it was a big success – for many folks it was the first time realizing “how this thing works”. But un-doubtfully the most popular one was the last slide: what not to do with your database.

Since I think those statements are largely universial, I am going to paste here the contents of the last slide.

  • Although you can’t write anything to the Data Warehouse there are plenty of ways to crush the system,  so use caution.
  • Please use the copies of the core tables for exploration purposes only, do not run big queries on them
  • Please kill any query which runs over 1 min and ask somebody from the IT database group for assistance
  • Do not use temporal tables.
  • Do not create objects in the public schema.
  • Before creating a new report or requesting one, please check what’s already available. The view and mat. views in our Data Warehouse are well- documented

Couple of comments
1. “Over 1 minute” is a surprisingly good estimate. Granted, out Data Warehouse is relatively small now, but most of the time when something is running over 1 min, it indicates that either the join criteria  are not specified correctly, or one or more conditions have very low selectivity, or there is an index missing. In all of those cases an IT person should take a closer look

2. Why avoid using temporal tables? Because they occupy the same space on disk which is used to allocate the intermediate result sets, and at the end of the day slow the things down due to extra IO

3. Why not to create objects in the public schema? Well, because it’s public! Because anybody can create tables in the public schema! And everybody create tables owned by them, which other people can’t access. The public schema should only hold the publicly used functions and such.

I think, the rest is self-explanatory!

Leave a comment

Filed under Data management, SQL, talks

The Data science education panel on ICDE 2017

In order to keep up with my own promises to tell more about what was happening on ICDE 2017 I am going to write about the panel on data science education. The panel was called “Data Science Education: We’re Missing the Boat, Again”, and I’d say it was probably the most interesting panel I’ve ever attended! By the time the panel was about to start, there was a huge crowd, and people were encouraged to take a dozen of remaining seats in the first and second rows (do I need to mention that I was at the front five minutes before the panel started?)

The topic of the panel described in my own words was the following. The Data science is a buzz word, students want to be taught “data science”, and there is a common believe that data science is about machine learning and statistical modeling while in reality 80% of time of the data scientists is spent on data pre-processing, cleansing, etc.

The panelists were given the questions which I am copying below.

If data scientists are spending 80% of their time grappling with data, what are they doing wrong? What are we doing wrong? What can we teach them to reduce this cost?
• What should a practicing data scientist learn about sys- tems engineering? What’s the difference between a data engineer and a data scientist?
• Scale is at the heart of what we do, and it’s a daily source of friction for data scientists. How can we teach funda- mental principles of scalability (randomized algorithms, for example) in the context of data systems?
• Perhaps data scientists are just consumers of our technol- ogy — how much do they really need to know about how things work? Empirically, it appears to be more than we think. There is a black art to making our systems sing and dance at scale, even though we like to pretend everything happens automatically. How can we stop pretending and start teaching the black art in a principled way?
• How can we address emerging issues in reproducibility, provenance, curation in a principled yet practical way as a core part of data engineering and data systems? Consider that the ML community has a vibrant workshop on fairness, accountability, and transparency. These topics are at least as relevant from a database perspective as they are from an ML perspective, maybe more so. Can we incorporate these issues into what we teach?
• How much math do we need to teach in our database- oriented data science courses? How can we expose the underlying rigor while remaining practical for people seeking professional degrees?

Bill Howe from UW was a moderator and the first panelist to give his talk.

The second one was Jeff Ullman, and thereby I have nothing more to say:)

Actually, i really liked the fact that he mentioned, that the math courses, linear algebra and calculus should be included into the Database curriculum.  I was always saying that nobody without Calc  BC should be allowed anywhere near any database.

The next panelist was Laura Haas, and again – what else I need to say, except of I’ve enjoyed each and every moment of her presentation?

One thing from her presentation which I find really important is that the Data science is not a part of the Computer Science, and not a part of Database management.  As Laura put it, “we provide the tools”, but not like “we” should teach the DS as a part of CS.

Next panelist was Mike Franklin from UC, and I hope this picture is clear enough for you to see a funny example of DS he is showing.

And the last one was very controversial Tim Kraska from Brown, who started with “he is going to disagree with all the rest of panelists” – and he did.

To be honest, it’s very difficult to write about this panel, because each of you can google all these great people, but you would need to see a video recording of this panel to really fell how interesting, and how much fun it was.

After the panel I talked to several conference participants, who like me are from industry and asked them what are they looking for when hiring recent grads. And literally everybody said the same thing that I was thinking about: they said they hire smart people with solid basic education, people who can solve problems, “and we will teach them all the rest”. Which I couldn’t agree more!

Paradoxically, the students think it’s cool to have something about “Data science” in their curriculum, they often think it will make them more marketable, but real future employers do not care that much!

Leave a comment

Filed under Data management, events, People, publications and discussions, talks

I honestly hope it’s not a bug!

When I’ve shared this discovery with my co-workers, they told me I’d better not write any blog post about this, because god forbid it is actually a bug, and somebody will fix it, and we’ll loose this feature.  But I hope it’s not!

… For years I’ve being complaining about the fact, that Postgres functions are atomic, meaning there is no way to have transactions inside the function, thereby it’s impossible to commit intermediate results, it’s always either all or nothing. Not like I really wanted to have the checkpoints and such, but processing huge data volumes without the  option of committing the intermediate results is at least challenging. You are bound to have long-running transactions, extensive locks and such. I really missed this option I had with Oracle functions to be able to commit each 100,000 records….

For a while I’ve being asking the lead Postgres contributors, “how much longer”,  and for a while they were replying – in the next release, until they just stopped replying…

So, the other day I was testing my new function, which is building a table out of multiple materialized views, and for each INSERT I have a prepared statement, which is executed by a single EXECUTE operator. When the execution crashed, because one of the materialized views which meant to be on place was not, I was think: well… now I need to start all over again… and to my surprise I saw, that all inserts which happened before this crash, persisted!

So, let’s re-iterate. If might be the same function, but if the SQL statements are executed as generated statements using EXECUTE operator, each of the executions will be treated as a separate transaction! Which is pretty awesome, keeping in mind that we need to insert over 18 million records! And no, I do not mean I am going to insert 18 million times 🙂

Leave a comment

Filed under Data management, Development and testing, SQL

One “emergency optimization”

Why it’s an “emergency optimization”? Well, because you write some SQL, you debug and optimize it, and it works fine, and it runs in production ten times during the business hours producing some report… and all of a sudden it just stops working… well, not necessarily “stop”, but start to run more than a hundred times slower… then you need to drop everything you are doing and optimize it ASAP. Because it’s already in production!

Last Thursday I had several things which “stopped working” in a manner described above. I’ve written about one of them in my previous post, but that one was relatively easy to fix. The next one looked as trivial as it can be, but stll I could not make it work!

Here is how the query looked like:

SELECT ...
   FROM payments p WHERE payment_date< current_date 
        AND payment_id NOT IN (SELECT payment_id from bank_records)
        AND...

There were more conditions, but they are not so important. This worked fine until a couple of days ago. What could be more simple? The payment_id field is indexed in both tables, and once again, this query should return a very small number of records. Also, the tables themselves are small, a little bit over 100,000 records. However, all of a sudden it started to run for minutes! When I looked at the execution plan I saw, that instead of doing a merge join utilizing the index, Postgres decided to read the whole bank_records table into the main memory, “materialize”, and then loop through this data set for each and single record of the payments table!

When I saw this, I immediately rewrote the SQL in my favorite way – with OUTER JOIN:

SELECT ...
   FROM payments p 
        LEFT OUTER JOIN bank_records b 
        ON b.payment_id =p.payment_id AND payment_date< current_date 
        AND b.payment_id IS NULL 
        AND...

This didn’t help either – still sequential scan and line by line comparison. And then I decided to use “the last resort”- which I hate!

SELECT ...
   FROM payments p WHERE payment_date< current_date 
        AND NOT EXISTS (SELECT payment_id from bank_records b
        WHERE payment_id=p.payment-id)
        AND...

This SQL just magically started to use the index, and the execution time when down from8 minutes to 300 milliseconds…

Yes, it’s cool… but why?!

2 Comments

Filed under Data management, SQL

A major accomplishment

It’s almost the end of the year. And it sounds like the most improbable day to write something about work, especially voluntarily.  But I’ve being really busy working  for the past several weeks (and months, to that matter).

But now I have something really exciting to report. And that “something” happened on Wednesday last week. On that day we stopped the automated emailing of reports from the old datawarehouse. Which of cause means, that we have all needed reports set up to run and email results from the new datawarehouse, which I have being building for the past seven months.

On one hand it seems like it took too long – the data was there and available for a while. But it always takes lots of time to convince people, who are used to the old ways, that they need to switch to the new system, the the new system is better, more accurate, etc. Besides,as it almost always happens, on the day that we’ve announced that “the old reports will be discontinued”,  the whole bunch of reports never converted to the new system emerged out of nowhere…

We were able to convert all these reports, to validate their results with business, and to shut down the automated reporting as planned.  And that means a lot for me, because it’s a very practical result of all my work since I’ve joined Braviant.  I do not have any illusions regrading this “sunset”, I know that there may be still multiple cases of “I need to see this old report” and such. And even after all user access to the old system will be disabled, we’ll need to archive the data for auditing purposes. But in any case I consider the current status as a major accomplishment, and a very material result of my work.

Of cause I won’t be able to accomplish that task without my wonderful co-workers, and since we are a small company, the input of each individual made a huge impact on the successful achievement of that goal.  I am in a very good place now, in all possible meanings of this word 🙂

Leave a comment

Filed under Companies, Data management, Team and teamwork, Workplace

Using Postgres functions for reporting

Two weeks ago, when I was participating in the ORM panel I thought to myself that it has been a while since I’ve used Postgres functions to encapsulate some database functionality. And the reason for that was not that I am not really supporting the application at that time, but that I was trying to keep the structure of my data mart really clear and transparent.

There was one particular case where I knew for sure I should use functions instead of views, because I did have multiple views with identical logic which differ just by the state and the lender name. And I could not create  generic views and select from them, because in all those cases the condition won’t be pushed down.

But.. we were still validating results with our business stakeholders, and for some stupid reason I was thinking that I should wait till everything is final before converting. And then – I actually decided to convert to functions, because I could not stand a perspective of copying the same changes to four different places. And I did, and now I am thinking, why I was so stupid not to do it long time ago? I thought that it was not an immediate business need, but boy, how wrong I was!  It became so-so-so much easier to make the changes; also it turned out that Looker can work with the functions output just fine, no issues at all.

And as an extra bonus – since I always generate and execute dynamic SQL in the functions (for performance reasons), I do not need to recompile them when I need to change the underlying materialized views. No need to recompile 24 views! Isn’t it a relief?! Now I am thinking I should convert even more views to functions – as long as my team members are OK with that.

Leave a comment

Filed under Data management, SQL