RE: de-dup process

  • From: "Kerber, Andrew" <Andrew.Kerber@xxxxxxx>
  • To: tim@xxxxxxxxx, ebadi01@xxxxxxxxx
  • Date: Thu, 14 Dec 2006 07:31:37 -0600

This will probably work, but keep in mind that by adding a column you
run the risk of breaking any applications that rely on the table
structure to remain constant (ie, anything that relies on a select *..
).  Admittedly, it is bad programming practice to do this, but I have
seen it done quite a bit.


Andrew W. Kerber 
Oracle DBA 


"If at first you dont succeed, dont take up skydiving" 

-----Original Message-----
From: oracle-l-bounce@xxxxxxxxxxxxx
[mailto:oracle-l-bounce@xxxxxxxxxxxxx] On Behalf Of Tim Gorman
Sent: Thursday, December 14, 2006 7:19 AM
To: ebadi01@xxxxxxxxx
Cc: mgogala@xxxxxxxxxxx; tboss@xxxxxxxxxxxxxxxxxx;
Subject: Re: de-dup process


When an UPDATE or DELETE won't scale (and it never will!), convert it
into a parallel nologging direct-path INSERT.

Use a query similar to Tom's to insert only the first "copy" of a row
(i.e. use "RN = 1" instead of "RN <> 1" the outermost WHERE clause) to
INSERT the rows to be retained into a new table.  Make the new table
NOLOGGING, use APPEND and PARALLEL hints on the INSERT portion and
PARALLEL hints on the SELECT portion of the INSERT ... SELECT statement.

Here's the cute part.  If you can't simply RENAME the two tables (i.e.
old and new) to swap them, then make the newly-created table look
exactly like the original table, but make it range-partitioned on a
dummy numeric column (call it DUMMY or PARTKEY), give this column a
constant value of zero, and name the single partition PZERO (or PDUMMY
or whatever).  Also, add this DUMMY or PARTKEY column to the original
table as well.  Then, use "ALTER TABLE <new-table> EXCHANGE PARTITION
PZERO WITH TABLE <original-table>" to swap the new data with the old

A direct-path parallel INSERT...SELECT will scale far better than any
possible optimization on an UPDATE or DELETE statement.  The comparison
is not even close.

Hope this helps...


A Ebadi wrote: 

Biggest problem we've faced in coming up with a solution is none of the
solutions so far scale.  In other words, things are fine if we have a 20
million row table with 2-3 million duplicates - runs in 10-15 minutes.
However, trying it for 100+ million row table - it runs for hrs!


We've even had another tool (Informatica) select out the ROWIDs of the
duplicates into a separate table then we are using PL/SQL cursor to
delete those rows from the large table, but this doesn't scale either!


I'm currently looking at some of the suggestions I got from this list
and seeing if any of them will work with us.

Thanks for all who replied and I'll let you know how it goes!  

Mladen Gogala <mgogala@xxxxxxxxxxx> <mailto:mgogala@xxxxxxxxxxx>  wrote:

        On 12/12/2006 08:42:38 PM, tboss@xxxxxxxxxxxxxxxxxx wrote:
        > >From asktom, the best way I've found is to use Tom's little
code snippet below:
        > delete from table your_huge_table
        > where rowid in
        > (select rid
        > from
        > (select rowid rid,
        > row_number() over
        > (partition by varchar_that_defines_duplicates
        > order by rowid ) rn
        > from your_huge_table
        > )
        > where rn <> 1
        > )
        > /
        Good luck with that if cardinality is +60M rows. I would also
add a condition like
        WHERE ROWID in (SELECT ROW_ID from EXCEPTIONS) to your query.
You only need to populate
        exceptions table with duplicates and if those duplicates are a
small percentage of the
        total number of records, your task will be done two order of
magnitude faster then without
        the exceptions table.
        Mladen Gogala



Everyone is raving about the all-new Yahoo! Mail beta.


NOTICE:  This electronic mail message and any attached files are confidential.  
The information is exclusively for the use of the individual or entity intended 
as the recipient.  If you are not the intended recipient, any use, copying, 
printing, reviewing, retention, disclosure, distribution or forwarding of the 
message or any attached file is not authorized and is strictly prohibited.  If 
you have received this electronic mail message in error, please advise the 
sender by reply electronic mail immediately and permanently delete the original 
transmission, any attachments and any copies of this message from your computer 
system. Thank you.


Other related posts: