Re: de-dup process


A Ebadi wrote:

> Biggest problem we've faced in coming up with a solution is none of 
> the solutions so far scale.  In other words, things are fine if we 
> have a 20 million row table with 2-3 million duplicates - runs in 
> 10-15 minutes.  However, trying it for 100+ million row table - it 
> runs for hrs!

You do of course delete non-redoable? When deleting a row, Oracle will 
create redo info which you, having done a direct load, will not be 
needed. This'll take time.

>  
> We've even had another tool (Informatica) select out the ROWIDs of the 
> duplicates into a separate table then we are using PL/SQL cursor to 
> delete those rows from the large table, but this doesn't scale either!
>  

if you mean that deleting 20million rows from a huge tabel is not as 
fast as deleting 2, then no. Nothing will scale. Try buying more iron 
and use parallel query.

Why don't you look at cleansing the dataset before loading it? e.g. use 
'sort -u' on the file to get rid of duplicate lines. Might be quicker 
than loading everything and deleting later on...

Cheers,

-- 
Tony van Lingen
Tech One Contractor
Information Management
Corporate Development Division
Environmental Protection Agency

Ph:     (07) 3234 1972
Fax:    (07) 3227 6534
Mobile: 0413 701 284
E-mail: tony.vanlingen@xxxxxxxxxxxxxx

Visit us online at www.epa.qld.gov.au
--


___________________________
Disclaimer

WARNING: This e-mail (including any attachments) has originated from a 
Queensland Government department and may contain information that is 
confidential, private, or covered by legal professional privilege, and may be 
protected by copyright.  

You may use this e-mail only if you are the person(s) it was intended to be 
sent to and if you use it in an authorised way.  No one is allowed to use, 
review, alter, transmit, disclose, distribute, print or copy this e-mail 
without appropriate authority.  If you have received this e-mail in error, 
please inform the sender immediately by phone or e-mail and delete this e-mail, 
including any copies, from your computer system network and destroy any 
hardcopies.

Unless otherwise stated, this e-mail represents the views of the sender and not 
the views of the Environmental Protection Agency.

Although this e-mail has been checked for the presence of computer viruses, the 
Environmental Protection Agency provides no warranty that all viruses have been 
detected and cleaned. Any use of this e-mail could harm your computer system.  
It is your responsibility to ensure that this e-mail does not contain and is 
not affected by computer viruses, defects or interference by third parties or 
replication problems (including incompatibility with your computer system).

E-mails sent to and from the Environmental Protection Agency will be 
electronically stored, managed and may be audited, in accordance with the law 
and Queensland Government Information Standards (IS31, IS38, IS40, IS41 and 
IS42) to the extent they are consistent with the law.

___________________________

--
http://www.freelists.org/webpage/oracle-l


Other related posts: