Re: de-dup process

  • From: tboss@xxxxxxxxxxxxxxxxxx
  • To: ebadi01@xxxxxxxxx
  • Date: Tue, 12 Dec 2006 20:42:38 -0500 (EST)

From asktom, the best way I've found is to use Tom's little code snippet below:

delete from table your_huge_table
where rowid in
  (select rid
        from
        (select rowid rid,
                     row_number() over
                        (partition by varchar_that_defines_duplicates
                         order by rowid ) rn
   from your_huge_table
  )
where rn <> 1
)
/

It will get multiple duplicate rows, and works far faster than any not exists, 
minus,
or cursor-based solution.

A few other options exist for you if you can do them that may be faster
1. create table as select distinct; probably faster than doing any sort of 
deleting.

2. Alter table mytab enable constraint PK exceptions into exceptions;
Better way; much faster for large tables, lets you audit the 
duplicate rows by examining exceptions table.  (you must run 
$ORACLE_HOME/rdbms/admin/utlexcpt.sql before doing this).  
Con: the exceptions table will contain BOTH duplicate rows in
the source table ... you'll have to delete them manually.

3. Use unix.  Perhaps the purest fastest way is to use unix sort/unique 
commands:
a. sqlload data out or select out delimited
b. sort filename | uniq > new file
c. sqlload back in.

only a viable option if your table is "thin" and only has a few columns.

hope this helps, todd

> 
> We have a huge table (> 160 million rows) which has about 20 million 
> duplicate rows that we need to delete.  What is the most efficient way to do 
> this as we will need to do this daily?
>   A single varchar2(30) column is used to identified duplicates.  We could 
> possibly have > 2 rows of duplicates.
>    
>   We are doing direct path load so no unique key indexes can be put on the 
> table to take care of the duplicates.
>    
>   Platform: Oracle 10G RAC (2 node) on Solaris 10.
>    
--
//www.freelists.org/webpage/oracle-l


Other related posts: