I’ve attended both conference panels: “Dark Data: Are We Solving the Right Problems?” on Tuesday and “Big Data Quality – Whose problem is it?” on Thursday.
In some sense they both were the logical continuation of Martin Kersten’s keynote (although the keynote was better :)). Granted, the topic is popular. But as I’ve already mentioned somewhere else, I agree with those panelists, who were questioning the validity of the concept of “data cleansing” in general.
Let me explain, what I mean. I am questioning the concept of modifying any data without notice, and when people talk about data cleansing, that’s what they mean. Basically we are saying: this data does not fit our standards of cleanness, and thereby we are going to modify it according to our best guess, and make it usable. In contrast, I believe that if we have some data, which is “wrong”, we need to identify a reason how this data got into our database. If we got some “bad data” on our hands, it means, that the process by which it was loaded, had some issues.
I remember how much efforts did we spend in PepsiAmericas trying to verify the data which was loaded into the data warehouse. The dimensional data (data related to products and points of sale) was loaded from the independent systems, while the factual data (orders and invoices) were loaded from other independent systems. Technically speaking, all products on the orders and invoices should match some products from the master product list, but there were tons of situations, where it could be not the case, most often – when the systems were updated asynchronously, but sometimes – because data was typed in the orders and invoices rather then selected from the drop-down lists.
So what you do if you come across the invoice line, which contains the product not present in the master list? Well, first of all, you can’t loose this invoice! Because this is money you are loosing :). So we came up with an algorithm, which would insert a placeholder record into the master product list- this record would have some serial key assigned, some description of the unknown product, and it would have a status of ‘?’, indicating that this record is neither in “active” nor in “inactive” state, and that it has to be resolved.
The business stakeholders who were responsible for specific data feeds would receiver daily reports with all “need to be fixed” records – until they would be actually fixed.
Sure ife is way more complicated, than just missed product codes, but I still firmly believe that this general approach is the only way to achieve data accuracy.