Duplicates removal

I do not think I’ve ever published in this blog the simple way of removing duplicates from the table. I’ve faced this problem multiple times in different environments, first coming up with a solution during my consulting engagement with the City of Chicago and later virtually everywhere I worked.  I believe now is the time to finally document this solution.

First, let’s define the problem, and let’s specify more precisely what “duplicates” mean in this context. First, let’s assume that the table with duplicates has a primary key. Although the SQL standard does not require to have a PK constraint on each table, it’s considered a good practice to have one, even if it is just a sequential number assigned to the record on insertion. If all other attributes, except the PK are the same in two records, let’s consider them duplicates.

Most of the time duplicates are really not intended to happen and they occur due to some programmatic error in the insertion process. After the error is discovered and corrected, we usually want to remove the duplicates, leaving just one copy of each set of attributes. Yes, we can select distinct values into a temporal table, truncate and re-insert, but this process would initiate a long exclusive lock (possibly prohibitively long) on the table, which is not always possible in production environment. Besides, if there may be memory limitations, id the table is big enough.

To remove duplicates fast, I’ve developed a very efficient algorithm, which can be schematically represented as follows:


v_prev_record :={0,' ',' ',' ' ,' ');
FOR rec in (SELECT t.pk, t.attr1, t.attr2, t.attr3 FROM table_name t
ORDER BY 2,3,4,1 LOOP
IF v_prev_record.attr1=rec.attr1 AND
v_prev_record.attr2=rec.attr2 AND
v_prev_record.attr3=rec.attr3
THEN
DELETE table_name WHERE pk=rec.pk;
END IF;
v_prev_record.attr1:=rec.attr1;
v_prev_record.attr2:=rec.attr2;
v_prev_record.attr3:=rec.attr3;
END LOOP;

This way only “the first” of identical records will remain in the table.

When we are just removing dups, this might look like somewhat excessive  amount of work, but often the table with the dups also serves as a reference in the foreign key constraint, and then we need to modify a dependent table as well, and this might REALLY take long time! Now check out, how a very small modification to the code above can make this work easy!

v_prev_record :={0,' ',' ',' ' ,' ');
v_pk_keep :=0;
FOR rec in (SELECT t.pk, t.attr1, t.attr2, t.attr3 FROM table_name t
ORDER BY 2,3,4,1 LOOP
IF v_prev_record.attr1=rec.attr1 AND
v_prev_record.attr2=rec.attr2 AND
v_prev_record.attr3=rec.attr3
THEN
UPDATE dependent_table SET fk_attr:=v_pk_to_keep
WHERE fk_attr=rec.pk; --we are about to delete this dup
DELETE table_name WHERE pk=rec.pk;
ELSE
v_pk_to_keep:=rec.pk;--this is a "new" value we keep
END IF;
v_prev_record.attr1:=rec.attr1;
v_prev_record.attr2:=rec.attr2;
v_prev_record.attr3:=rec.attr3;
END LOOP;

Isn’t it just beautiful?!

Advertisements

5 Comments

Filed under SQL

5 responses to “Duplicates removal

  1. Hey Henrietta. That’s a very nice algorithm. I’ve had many times in the past where I needed to the same thing. Thank you for this solution.

    One thing i would point out is that I’ve run across the scenario where only a subset of the columns in a table were used to define duplicates. For instance, if you have col1, col2, .. col10 and only col1..col5 were used to identify uniqueness. Forget for a moment that you might have a table that is not fully normalized and assume it needs to stay as it is designed. Is there a more generic solution where the developer could look up a variable set of field names that collectively indicate uniqueness in a given table? And what about providing an option of finding the most recent duplicate of a record and updating the original record with the fields not used to indicate uniqueness? The simple example I’m thinking of is an address book where the unique fields *could* be the first, middle, and last name… and the rest of the fields address, phone, etc… would be updated from duplicate records…

    Thanks again for the great solution!

    BB

  2. Yes, the same algorithm would work perfectly fine, if I understand correctly what you need. First, you do not need to list all the attributes in the SELECT list, just those which identify uniqueness should be there. And as long as we ORDER BY these “need to stay unique” attributes, and we do not care about the rest, we can still delete the records “all except first” till one of the five (in your example) columns will change.

    Also, if you need to keep “the most recent one”, like in your example – keep the most recent phone/address, delete the older ones, you can order in descending. The SELECT which drives the loop will look like this:

    SELECT id, last_name, first_name, middle_name, date_added
    FROM addresses
    ORDER BY last_name, first_name, middle_name, date_added DESC;

    This way for each person the most recent phone/address will always be the first one, and will stay, while the earlier records will be deleted.

    Thank you for asking :))

  3. Svitlana Senenko

    Yes, it’s beautiful indeed 🙂 Thank you for a very good idea!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s