Re: de-dup process

When an UPDATE or DELETE won't scale (and it never will!), convert it into a parallel nologging direct-path INSERT.

Use a query similar to Tom's to insert only the first "copy" of a row (i.e. use "RN = 1" instead of "RN <> 1" the outermost WHERE clause) to INSERT the rows to be retained into a new table.  Make the new table NOLOGGING, use APPEND and PARALLEL hints on the INSERT portion and PARALLEL hints on the SELECT portion of the INSERT ... SELECT statement.

Here's the cute part.  If you can't simply RENAME the two tables (i.e. old and new) to swap them, then make the newly-created table look exactly like the original table, but make it range-partitioned on a dummy numeric column (call it DUMMY or PARTKEY), give this column a constant value of zero, and name the single partition PZERO (or PDUMMY or whatever).  Also, add this DUMMY or PARTKEY column to the original table as well.  Then, use "ALTER TABLE <new-table> EXCHANGE PARTITION PZERO WITH TABLE <original-table>" to swap the new data with the old data.

A direct-path parallel INSERT...SELECT will scale far better than any possible optimization on an UPDATE or DELETE statement.  The comparison is not even close.

Hope this helps...

-Tim



A Ebadi wrote:
Biggest problem we've faced in coming up with a solution is none of the solutions so far scale.  In other words, things are fine if we have a 20 million row table with 2-3 million duplicates - runs in 10-15 minutes.  However, trying it for 100+ million row table - it runs for hrs!
 
We've even had another tool (Informatica) select out the ROWIDs of the duplicates into a separate table then we are using PL/SQL cursor to delete those rows from the large table, but this doesn't scale either!
 
I'm currently looking at some of the suggestions I got from this list and seeing if any of them will work with us.

Thanks for all who replied and I'll let you know how it goes! 


Mladen Gogala <mgogala@xxxxxxxxxxx> wrote:

On 12/12/2006 08:42:38 PM, tboss@xxxxxxxxxxxxxxxxxx wrote:
> >From asktom, the best way I've found is to use Tom's little code snippet below:
>
> delete from table your_huge_table
> where rowid in
> (select rid
> from
> (select rowid rid,
> row_number() over
> (partition by varchar_that_defines_duplicates
> order by rowid ) rn
> from your_huge_table
> )
> where rn <> 1
> )
> /



Good luck with that if cardinality is +60M rows. I would also add a condition like
WHERE ROWID in (SELECT ROW_ID from EXCEPTIONS) to your query. You only need to populate
exceptions table with duplicates and if those duplicates are a small percentage of the
total number of records, your task will be done two order of magnitude faster then without
the exceptions table.
--
Mladen Gogala
http://www.mladen-gogala.com

--
http://www.freelists.org/webpage/oracle-l




Everyone is raving about the all-new Yahoo! Mail beta.
-- http://www.freelists.org/webpage/oracle-l

Other related posts: