Category Archives: Development and testing

On importance of automation: I am migrating my data again

Moving my Data Warehouse to a separate cluster was a big and exhausting project. However, looks like it did not teach me anything – now, when I’ve started to build a staging environment, I’ve realized, that almost nothing was automated. By “automated” I mean, that you should be able to run a set of scripts on a clean database and all objects should be created.

I always had best intentions to build my data warehouse that way, but life would always get on my way in the form of urgent business requests, things, which should have being done yesterday combined with “I will clean it up tomorrow”. Now, when I am building “the same” environment for the third time in a row, I’ve decided, that I will spend extra time on cleaning up all the creation scripts and making them re-runnable, no matter how much time it will take.

Well, it takes tons of time! But now nobody by myself forces me to do things that way, and now I fully and genuinely ¬†understand, how important is it! So it may take me another 2 weeks to finish building the staging environment, but at the end I will not only get an environment, but a process on place as well. Which will make me very proud, even if nobody but me will know ūüôā

Leave a comment

Filed under Development and testing

From theory to practice

For the past several months I am implementing the bitemporal framework on the real life objects, not on the lab mice :). And this process was quite a revelation!

I’ve written the functions for bitemporal basic operations almost two years ago, and talked about them on several conferences and workshops. I could not imagine something can go wrong with them – and yet it did. And that’s exactly what happens when all your test cases are cloned lab mice!

One of the first errors I’ve got was an empty assertion interval, and that’s when I’ve realized than we never discussed the relations between transactions and bitemporal operations. Well, a transaction is a transaction, isn’t it? Nobody is supposed to see what’s inside, until transaction is finished – committed or rolled back. So… if there are several modifications (say INSERT, UPDATE and CORRECT for the same logical record) within one transaction… what we are supposed to see when transaction is committed? Just an INSERT, if the first operation was INSERT? But this “won’t be true”!

Yes, but on the other hand, imagine what will happen if we would record the “interim” state, and then later we would like to run a query “as asserted” at some time in the past, and at that exact moment some transactions will be in the uncommitted state? Then we will get results which will be in the inconsistent ¬†status. As of now I didn’t come up with how I want these situations to be handled. I am almost convinced that I want to give a user an option: if you want to be “anti-transactional”, you can :)). But then you’ll need to accept the consequences.

Another set of problems is rather philosophical: do we believe in reincarnation? ūüôā More precisely, if an object is “bitemporally deleted”, and then a new object with the same business key value is created, is this “the same object” or a “new object”? Both ways can be supported, but I think that by default we should assume a “formal approach”, and say the this is “the same” object. And if the real world (i.e. business rules) is¬†such, that the new object is a different object… well, that means, that something else should be included into the business key.¬†For example, if the SSN is reused, then¬†we need an extra piece of information, like person’d data of birth.

Related questions: can we update a deleted (inactive) record? What are the differences between UPDATE and CORRECTION if the date ranges are “equal”? ¬†I can only imagine how many issues like this are just waiting to be discovered!

Leave a comment

Filed under Data management, Development and testing, research, SQL

OK, we know what to do with very large files, but what if there are too many?

About two¬†weeks I wrote a post about dealing with very large csv files. The proposed solution was to break this very large file into smaller files and then merge together the results of mapping. This was definitely a doable solution, however since we had to break this one very large file into 37 smaller files, it involved having 37 foreign table on our proxy server, and another 37 foreign tables in the staging area of the data warehouse, and ¬†also 37 matching materialized views….

And if this is not enough, soon I found out, that for each marketing campaign the number of files may vary. And later I’ve learned, that the past campaigns should be stored permanently… although later this requirement was revoked. But some other results had to be stored forever…

I any case, the problem I was facing was that there is no way you can stay sane doing 37 copy/paste… or making changes in the 37 places… So I’ve realized, that I need to automate the process of the said foreign tables and materialized views creation.

Once again, there is nothing better than an ability to generate SQL inside the function and then execute it. You can’t make a table name a function “parameter”, but… you can, if you generate a DDL statement inside it. ¬†I was very happy when I’ve realized, that in the situation when requirements change so often it really does not make much of difference, whether you refresh a materialized view, or you drop it and create again. That’s what my refresh functions now do, and this allows to make changes to the logic really easy. Well… relatively easy. As easy as you can make it :).

Honestly if i won’t come up with my first automated generation about two weeks ago, I will be way more upset each time I need to make changes. But by know I’ve parametrized literally everything :). Tomorrow we’ll test, whether somebody except me can execute this super-flexible set of functions – because I am taking vacation next week – for real!

 

 

Leave a comment

Filed under Development and testing

I honestly hope it’s not a bug!

When I’ve shared this discovery with my co-workers, they told me I’d better not write any blog post about this, because god forbid it¬†is actually a bug, and somebody will fix it, and we’ll loose this feature. ¬†But I hope it’s not!

… For years I’ve being complaining about the fact, that Postgres functions are atomic, meaning there is no way to have transactions inside the function, thereby it’s impossible to commit intermediate results, it’s always either all or nothing. Not like I really wanted to have the checkpoints and such, but processing huge data volumes without the ¬†option of committing the intermediate results is at least challenging. You are bound to have long-running transactions, extensive locks and such. I really missed this option I had with Oracle functions to be able to commit each 100,000 records….

For a while I’ve being asking the lead Postgres contributors, “how much longer”, ¬†and for a while they were replying – in the next release, until they just stopped replying…

So, the other day I was testing my new function, which is building a table out of multiple materialized views, and for each INSERT I have¬†a prepared statement, which is executed by a single¬†EXECUTE operator. When the execution crashed, because one of the materialized views which meant to be on place was not, I was think: well… now I need to start all over again… and to my surprise I saw, that all inserts which happened before this crash, persisted!

So, let’s re-iterate. If might be the same function, but if the SQL statements are executed as generated statements using EXECUTE¬†operator, each of the executions will be treated as a separate transaction! Which is pretty awesome,¬†keeping in mind that we¬†need to insert over 18 million records! And no, I do not mean I am going to insert 18 million times ūüôā

Leave a comment

Filed under Data management, Development and testing, SQL

New vs old: should we even validate?

I think that¬†there is no developer in the world who never had to face this situation: there was an older version of “something” – the app, the program, the interface, the function. And now you need to create a new version (most often – because you are migrating to the new system), and you need to make sure this new version of “something” produces the same result.

First of all, this never happens. I’ve never experienced the situation, when the new procedure would produce “exactly the same result”. There are always mismatches. ¬†And second, when you start to validate them in more details, turns out, that the “new” one is better, than the “old” one,¬† in a sense that most of the time it would turn out, that the new code fixed some issues of the old code :).

Moreover, quite often we already know, that the old code is wrong. But for some reason we are still trying “first to make in work, like then old one, and then fix it”.

Why?

I think in most cases it’s pure “phycological”. We think that the fact that the old code was running for a while and nobody complained, sort of validates it. But quite often the end users actually know about the errors, and they learned to correct then in the after-processing (at one of my consulting assignments I’ve learned that the error in the client financial software was know about for almost … 30 years!!!). Or people might not care. Or¬†have no ways to validate correctness.

In any case, if we already know for sure that some parts are wrong, an we want to rewrite them anyway, I do not see a reason to “first replicate what it was”. It’s might be as easier (and better!) to gather new requirements, as if it was just another new development, and to ¬†validate results against these requirements, not the old version!

Leave a comment

Filed under Development and testing

How I learned to love tests: both using and writing ones – part 2

Even when I would reluctantly admit I need to have tests on place, I never understood, why one might want to put the check for the number of tests you want to run in pg_tap. What’s the point? You know how many tests you want to run, so the only thing you need is to count the executions :). And when the number does not match it means that you didn’t count them correctly;)

That’s what I was absolutely sure about… until last week.¬†A week before that I’ve discovered that I’ve mapped one foreign table incorrectly. Or, may be, it got changed and I didn’t notice – I didn’t have proper tests!

Nevertheless, after I fixed the table structure… yes, you are right, a number of tests failed! and since ai’ve added a whole bunch of newly mapped columns (21 of them, to be precise), I had to place 84 more tests… four for each column… and after I did it… and pg_tap reported that I ran less tests than I’ve planned. And my first inclination was to change the “number of tests I want to run”. And I almost did it… but then I thought: I remember I’ve counted! If there are less than 84 new tests, then there are two options: either I counted them incorrectly, or – I misses several tests.

It was not fun at all, going through this huge file with all the tests… but I found the missing ones! And I was so happy again, that somebody forced me to run the tests each time I am committing changes ūüôā

 

 

Leave a comment

Filed under Development and testing, Team and teamwork

How I learned to love tests: both using and writing ones – part 1

My fellow database developers, let’s be honest: we do not like writing tests. We are not application developers. We do not understand the test-driven development: at the end of the day, how we can figure out what should be the outcome of our functions and stored procedures, when we do not know what data they will be working with?! ¬†When our managers tell us, we should generate the test data, we think its’ the most ridiculous thing in the world, because if we create some data, obviously the results of the testing will be favorable!

I understand, that not necessarily each and single database developer goes to that extreme, but… pretty close. And I will be the first to admit being guilty with the similar attitude. I only believed in testing on a “copy of the real data”, sort of A/B testing, which is important, but not the only thing to be tested.

Especially these days, when the data structure is not “almost always static”, when the changes the application DB are not a rare catastrophe, but a part of normal life of the application. At a minimum you want to have tests which show you that if you change”something” in the database structure, other “something” won’t break. We need those tests. But… it is so boring to write them! It slows our development process sooooo much! Especially, you know, when you have this big project to complete, and each and every half an hour matters!

At least that’s what I was thinking for the past two months working hard to bring our new Data Warehouse live. And promising to myself, I will write the tests… later :).

But my former co-worker, my forever-mentor, current consultant for my company – Chad – have written his test for “his”part ¬†– which is a part of our system, which is responsible for “taking” to the third-party databases. So… on the night of massive changes on the said third-party database, which were not properly communicated in advance, when some parts of my processes started to fail… and when I fixed the data structures to match the new ones… the tests started to fail!

I was not happy :). Not happy at all. I was thinking Рwhy?  Why I have to sit and fix these tests at 11-30 PM?! But guess what. It took me only 45 minutes to fix each and single test, which was touched by the change, and to validate the new data structures. And I was done before the midnight. And guess, how long it took other people to have their parts of the system updated and running with no issues? Almost the whole night, and almost the whole next day! You might laugh at the next statement, but here is it anyway: at that moment I felt very much protected by these tests.

And that’s what the tests are for, aren’t they?!

Leave a comment

Filed under Development and testing, Systems, Team and teamwork