When an UPDATE or DELETE won't scale (and it never will!), convert it
into a parallel nologging direct-path INSERT.
Use a query similar to Tom's to insert only the first "copy" of a row
(i.e. use "RN = 1" instead of "RN <> 1" the outermost WHERE
clause) to INSERT the rows to be retained into a new table. Make the
new table NOLOGGING, use APPEND and PARALLEL hints on the INSERT
portion and PARALLEL hints on the SELECT portion of the INSERT ...
SELECT statement.
Here's the cute part. If you can't simply RENAME the two tables (i.e.
old and new) to swap them, then make the newly-created table look
exactly like the original table, but make it range-partitioned on a
dummy numeric column (call it DUMMY or PARTKEY), give this column a
constant value of zero, and name the single partition PZERO (or PDUMMY
or whatever). Also, add this DUMMY or PARTKEY column to the original
table as well. Then, use "ALTER TABLE <new-table> EXCHANGE
PARTITION PZERO WITH TABLE <original-table>" to swap the new data
with the old data.
A direct-path parallel INSERT...SELECT will scale far better than any
possible optimization on an UPDATE or DELETE statement. The comparison
is not even close.
Hope this helps...
-Tim
A Ebadi wrote:
Biggest problem we've faced in coming up with a solution is none
of the solutions so far scale. In other words, things are fine if we
have a 20 million row table with 2-3 million duplicates - runs in 10-15
minutes. However, trying it for 100+ million row table - it runs for
hrs!
We've even had another tool (Informatica) select out the ROWIDs
of the duplicates into a separate table then we are using PL/SQL cursor
to delete those rows from the large table, but this doesn't scale
either!
I'm currently looking at some of the suggestions I got from this
list and seeing if any of them will work with us.
Thanks for all who replied and I'll let you know how it goes!
On 12/12/2006 08:42:38 PM, tboss@xxxxxxxxxxxxxxxxxx
wrote:
> >From asktom, the best way I've found is to use Tom's little
code snippet below:
>
> delete from table your_huge_table
> where rowid in
> (select rid
> from
> (select rowid rid,
> row_number() over
> (partition by varchar_that_defines_duplicates
> order by rowid ) rn
> from your_huge_table
> )
> where rn <> 1
> )
> /
Good luck with that if cardinality is +60M rows. I would also add a
condition like
WHERE ROWID in (SELECT ROW_ID from EXCEPTIONS) to your query. You only
need to populate
exceptions table with duplicates and if those duplicates are a small
percentage of the
total number of records, your task will be done two order of magnitude
faster then without
the exceptions table.
--
Mladen Gogala
http://www.mladen-gogala.com
--
//www.freelists.org/webpage/oracle-l
Everyone is raving about the
all-new Yahoo! Mail beta.
--
//www.freelists.org/webpage/oracle-l
|