Duplicates removal

A friend of mine asked me the other day, how I would solve the following problem she got during her interview. Imagine you have a table with several attributes, and there  some records, where all attributes except of primary key are the same (duplicates). Write one SQL statement to delete duplicates and leave just one of each values combinations.

If the only constraint is one, and we do not care about performance, the solution is easy. Let’s say we have a table:

CREATE TABLE t1 (
id integer PRIMARY KEY
,attr1 number NOT NULL
,attr2 text NOT NULL)

To delete all duplicate occurrences of (attr1, attr2) we can write the following SQL statement:


DELETE FROM t1 WHERE id NOT IN(
SELECT min(id) AS id
FROM t1 GROUP BY attr1,attr2)

This way only one (minimal) id for each attr1, attr2 comination will be left.

However, for big tables this is not an efficient way to perform delete. For big tables we want to become “imperative” and use a cursor with ordering.

In no particular 4GL language this will look like this:


DECLARE v_attr1 number:=to_number(NULL);
v_attr2 text:=NULL;
BEGIN
FOR rec in (SELECT id, attr1, attr2 FROM t1
ORDER by attr1, attr2, id) LOOP
IF v_attr1 IS NOT NULL AND v_attr2 IS NOT NULL AND v_attr1=rec.attr1 AND v_attr2=rec.attr2
THEN
DELETE t1 WHERE id=rec.id
ELSE
v_attr1:=rec.attr1;
v_attr2:=rec.attr2;
END IF;
END;

If the table is really big, you can insert additional counter and commit after each X number or records.

This is a simple problem, but quite often we need to remove duplicates, for which we have foreign key references, and then it’s a problem. Suppose, t1 is a lookup table, and t2 is referencing it.


CREATE TABLE t2(
id integer PRIMIARY KEY
,t1_id integer FOREIGN KEY REFERENCES t1.id
,attr3 text);

Not only we need to remove duplicates now, but we also need to fix the references. You can do it using the single cursor.


DECLARE
v_t1_id integer;
v_attr1 number:=to_number(NULL);
v_attr2 text:=NULL;
BEGIN
FOR rec in (SELECT id, attr1, attr2 FROM t1
ORDER by attr1, attr2, id) LOOP
IF v_attr1 IS NOT NULL AND v_attr2 IS NOT NULL AND v_attr1=rec.attr1 AND v_attr2=rec.attr2
THEN
UPDATE t2 SET t1_id=v_t1_id
WHERE t1_id=rec.id;
DELETE t1 WHERE id=rec.id
ELSE
v_t1_id:=rec.id;
v_attr1:=rec.attr1;
v_attr2:=rec.attr2;
END IF;
END;

Once again, you may need to commit, if the tables are big enough and you do not want to paralyze the work of the system.

I think, it’s quite neat, what do you think?

Advertisements

7 Comments

Filed under Data management, SQL

7 responses to “Duplicates removal

  1. lily

    I rewrote the stored proc for SQL Server
    Why my procedure would work slower than the procedure in the blog? At the interview I was asked to duplicate records inside the cursor using one delete statement, not one by one, may be this is incorrect approach.
    CREATE TABLE [dbo].[tb0](
    [id] [int] IDENTITY(1,1) NOT NULL,
    [trId] [int] NOT NULL,
    [datafield1] [varchar](50) NULL,
    [datafield2] [varchar](50) NULL,
    CONSTRAINT [PK_tb0] PRIMARY KEY CLUSTERED
    (
    [id] ASC
    )
    ) ON [PRIMARY]

    The stored proc from the blog:

    CREATE PROCEDURE [dbo].[remove_duplicates]

    AS
    BEGIN

    SET NOCOUNT ON;
    declare db_crc cursor for
    SELECT trId, datafield1, datafield2, id
    FROM tb0
    ORDER BY trId, datafield1, datafield2, id

    declare @field1 varchar(50), @field2 varchar(50)
    declare @trId int, @id int
    declare @v_field1 varchar(50), @v_field2 varchar(50), @v_trId int

    open db_crc
    FETCH NEXT FROM db_crc INTO @trId, @field1, @field2, @id

    WHILE @@FETCH_STATUS = 0
    BEGIN
    if ( @v_trId is not null and @v_trId = @trid AND
    ( @field1 is null and @v_field1 is null or @v_field1 = @field1) AND
    ( @field2 is null and @v_field2 is null or @v_field2 = @field2)
    )
    — this is duplicate of previous record, delete it by primary key
    delete from tb0 WHERE (id = @id)
    else
    begin
    set @v_trId = @trId
    set @v_field1 = @field1
    set @v_field2 = @field2
    end

    FETCH NEXT FROM db_crc INTO @trId, @field1, @field2, @id
    END
    close db_crc
    deallocate db_crc
    END
    My stored proc
    ALTER PROCEDURE [dbo].[remove_duplicates_my]

    AS
    BEGIN

    SET NOCOUNT ON;
    declare db_crc cursor for
    select trId, count(*) as cnt
    from [dbo].[tb0]
    group by [trId]
    having count(*) >1
    declare @field1 varchar(50), @field2 varchar(50)
    declare @trId int, @cnt int, @id int
    open db_crc
    FETCH NEXT FROM db_crc INTO @trId, @cnt

    WHILE @@FETCH_STATUS = 0
    BEGIN
    SELECT TOP (1) @id = id, @field1 = datafield1, @field2= datafield2
    FROM tb0
    WHERE (trId = @trId)

    if (@cnt = 2 )
    delete from tb0 where id = @id
    else
    begin
    delete from tb0 WHERE (trId = @trId)
    INSERT INTO tb0 (trId, datafield1, datafield2)
    VALUES (@trId,@field1,@field2)
    end
    FETCH NEXT FROM db_crc INTO @trId, @cnt
    END
    close db_crc
    deallocate db_crc
    END

  2. So, the question was, why my solution is faster. Well, there are several reasons. First, let me step back. In many cases the “one SQL statement” solution actually WILL BE faster, exactly because it’s ONE SQL statement.

    However, when the table is big and we do not have Enova-size memory, and we need to have 2 copies of one table for the whole span of transaction, this MAY become a problem. The cursor solution allows intermediate commits, thus reducing the potential problem of locking big data volumes and running out of temp space.

    But back to two procedures. In general, the less db operations, the faster. Deleting and later inserting is bad for many reasons. Even if we are not talking about Postgres, the INSERT, which follows DELETE never reuses the same space, so the table size will increase, the data blocks will become sparse, and it will take longer to read the table.

    Second, there is an embedded cursor (when you do one more select inside your fetch loop, you open implicit cursor) – one extra read for each fields combination.

    Third, delete in you case does not use the PK index, so there is another scan before delete is executed. In my solution the table is scanned only ones.

    Does it make sense?

    • lily

      1. Of course delete by primary key is faster, but I don’t know what is faster – delete a huge set of records ( let’s say 10000) one by one using primary key, or delete in one statement but using a field which is not a key ( and looks like not an indexed field, I think for indexed field and for not indexed field the answer can be different)
      2. Embedded cursor creates an extra read, but the cursor for the first read executes less number of times because it has group by in it. Is it correct or not?
      I am not trying to argue, just try to get better understanding.

      • No-no, you are not arguing, that’s exactly what we are doing – trying to see what is the right way and why. I will explain different cases tomorrow!

      • OK, let’s count operations. Let’s say the table has m records, and n – non-duplicates, so we need to delete m-n records.

        My process executes 1 select for a cursor and m-n deletes, which do not really require extra selects, because the record is “current of” the cursor (if a particular 4GL language allows it, you can just DELETE CURRENT;

        So the total number of operations is m-n+1, and deletes are fast.

        Now let’s count your number of operations,

        Your cursor is 1 select.
        The number of rows it returns is n. For each n you execute one more select and at least one more delete. So you execute 2n+1 operations, if there is only one duplicate for each record, or if m=2n. If m=2n, then my count is n+1, and your count is 2n+1.

        Now let’s say, there are more duplicates, than 2. For each occurrence when there is more than 1 dup, you have insert in addition to delete. So the count will be 3n+1

        Which means, my way procedure will execute more operations than yours only if
        m-n+1>3n+1
        m>4n

        Which means, that each duplicate will occur more than 4 times on average.

        These estimates do not take into account the fact, that “my” deletes are fast, since they are executed on the “current” record, and “your’s” SELECT/DELETE/INSERT actually needs to be executed. Does it make sense?

      • lily

        Thank you, this is the explanation I was looking for!
        I have one more question. If I am the owner of that database? I have two ways to fight the duplicates/ Let’s assume, that a duplicates we call the records with the same trId (one int field and it is not a primary key).
        I can remove duplicate once, and then make this field a unique index. It will prevent any other duplicates but it will slow down the performance on insert( is it correct?) Another solution – I can run the procedure to remove duplicates once a day or more often. When I should choose which solution?
        I think it is better to have the unique constrain but I do not know why.

  3. You are correct, it’s better to have a unique constraint. The reason is – this is the correct state of data, we want it ALWAYS to be correct. not once a day. The overhead on inserts is minimal, close to invisible,and keeping the data consistent is important.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s