RE: de-dup process

  • From: "Ken Naim" <kennaim@xxxxxxxxx>
  • To: <ebadi01@xxxxxxxxx>, "'Tony van Lingen'" <tony.vanlingen@xxxxxxxxxxxxxx>, <oracle-l@xxxxxxxxxxxxx>
  • Date: Thu, 14 Dec 2006 13:35:28 -0500

For the initial load use an external table and do an insert as select
distinct. After the initial load use an external table for the file and use
the merge statement to only insert the ones you want or you can use union
and insert it into a third table and then rename the tables and truncate the
original and swap back and forth nightly between the two. The select form
the external table can be optimized to exclude value known  to be
duplicates, i.e.anything over 6 months old or whatever criteria makes sense
for you.
 
Ken
 
  _____  

From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx]
On Behalf Of A Ebadi
Sent: Thursday, December 14, 2006 12:47 PM
To: Tony van Lingen; oracle-l@xxxxxxxxxxxxx
Cc: ebadi01@xxxxxxxxx
Subject: Re: de-dup process
 
Cannot clean data before loading as data is from many different sources that
don't know about each other.
 
Thanks for everyone that replied and still doing testing to find the best
method!

Tony van Lingen <tony.vanlingen@xxxxxxxxxxxxxx> wrote:


A Ebadi wrote:

> Biggest problem we've faced in coming up with a solution is none of 
> the solutions so far scale. In other words, things are fine if we 
> have a 20 million row table with 2-3 million duplicates - runs in 
> 10-15 minutes. However, trying it for 100+ million row table - it 
> runs for hrs!

You do of course delete non-redoable? When deleting a row, Oracle will 
create redo info which you, having done a direct load, will not be 
needed. This'll take time.

> 
> We've even had another tool (Informatica) select out the ROWIDs of the 
> duplicates into a separate table then we are using PL/SQL cursor to 
> delete those rows from the large table, but this doesn't scale either!
> 

if you mean that deleting 20million rows from a huge tabel is not as 
fast as deleting 2, then no. Nothing will scale. Try buying more iron 
and use parallel query.

Why don't you look at cleansing the dataset before loading it? e.g. use 
'sort -u' on the file to get rid of duplicate lines. Might be quicker 
than loading everything and deleting later on...

Cheers,

-- 
Tony van Lingen
Tech One Contractor
Information Management
Corporate Development Division
Environmental Protection Agency

Ph: (07) 3234 1972
Fax: (07) 3227 6534
Mobile: 0413 701 284
E-mail: tony.vanlingen@xxxxxxxxxxxxxx

Visit us online at www.epa.qld.gov.au
--


___________________________
Disclaimer

WARNING: This e-mail (including any attachments) has originated from a
Queensland Government department and may contain information that is
confidential, private, or covered by legal professional privilege, and may
be protected by copyright. 

You may use this e-mail only if you are the person(s) it was intended to be
sent to and if you use it in an authorised way. No one is allowed to use,
review, alter, transmit, disclose, distribute, print or copy this e-mail
without appropriate authority. If you have received this e-mail in error,
please inform the sender immediately by phone or e-mail and delete this
e-mail, including any copies, from your computer system network and destroy
any hardcopies.

Unless otherwise stated, this e-mail represents the views of the sender and
not the views of the Environmental Protection Agency.

Although this e-mail has been checked for the presence of computer viruses,
the Environmental Protection Agency provides no warranty that all viruses
have been detected and cleaned. Any use of this e-mail could harm your
computer system. It is your responsibility to ensure that this e-mail does
not contain and is not affected by computer viruses, defects or interference
by third parties or replication problems (including incompatibility with
your computer system).

E-mails sent to and from the Environmental Protection Agency will be
electronically stored, managed and may be audited, in accordance with the
law and Queensland Government Information Standards (IS31, IS38, IS40, IS41
and IS42) to the extent they are consistent with the law.

___________________________
 
  
  _____  

Need a quick answer? Get one in minutes from people who know. Ask your
question on Yahoo!
<http://answers.yahoo.com/;_ylc=X3oDMTFvbGNhMGE3BF9TAzM5NjU0NTEwOARfcwMzOTY1
NDUxMDMEc2VjA21haWxfdGFnbGluZQRzbGsDbWFpbF90YWcx>  Answers.

Other related posts: