Re: de-dup process
- From: A Ebadi <ebadi01@xxxxxxxxx>
- To: Tony van Lingen <tony.vanlingen@xxxxxxxxxxxxxx>, oracle-l@xxxxxxxxxxxxx
- Date: Thu, 14 Dec 2006 09:47:03 -0800 (PST)
Cannot clean data before loading as data is from many different sources that
don't know about each other.
Thanks for everyone that replied and still doing testing to find the best
method!
Tony van Lingen <tony.vanlingen@xxxxxxxxxxxxxx> wrote:
A Ebadi wrote:
> Biggest problem we've faced in coming up with a solution is none of
> the solutions so far scale. In other words, things are fine if we
> have a 20 million row table with 2-3 million duplicates - runs in
> 10-15 minutes. However, trying it for 100+ million row table - it
> runs for hrs!
You do of course delete non-redoable? When deleting a row, Oracle will
create redo info which you, having done a direct load, will not be
needed. This'll take time.
>
> We've even had another tool (Informatica) select out the ROWIDs of the
> duplicates into a separate table then we are using PL/SQL cursor to
> delete those rows from the large table, but this doesn't scale either!
>
if you mean that deleting 20million rows from a huge tabel is not as
fast as deleting 2, then no. Nothing will scale. Try buying more iron
and use parallel query.
Why don't you look at cleansing the dataset before loading it? e.g. use
'sort -u' on the file to get rid of duplicate lines. Might be quicker
than loading everything and deleting later on...
Cheers,
--
Tony van Lingen
Tech One Contractor
Information Management
Corporate Development Division
Environmental Protection Agency
Ph: (07) 3234 1972
Fax: (07) 3227 6534
Mobile: 0413 701 284
E-mail: tony.vanlingen@xxxxxxxxxxxxxx
Visit us online at www.epa.qld.gov.au
--
___________________________
Disclaimer
WARNING: This e-mail (including any attachments) has originated from a
Queensland Government department and may contain information that is
confidential, private, or covered by legal professional privilege, and may be
protected by copyright.
You may use this e-mail only if you are the person(s) it was intended to be
sent to and if you use it in an authorised way. No one is allowed to use,
review, alter, transmit, disclose, distribute, print or copy this e-mail
without appropriate authority. If you have received this e-mail in error,
please inform the sender immediately by phone or e-mail and delete this e-mail,
including any copies, from your computer system network and destroy any
hardcopies.
Unless otherwise stated, this e-mail represents the views of the sender and not
the views of the Environmental Protection Agency.
Although this e-mail has been checked for the presence of computer viruses, the
Environmental Protection Agency provides no warranty that all viruses have been
detected and cleaned. Any use of this e-mail could harm your computer system.
It is your responsibility to ensure that this e-mail does not contain and is
not affected by computer viruses, defects or interference by third parties or
replication problems (including incompatibility with your computer system).
E-mails sent to and from the Environmental Protection Agency will be
electronically stored, managed and may be audited, in accordance with the law
and Queensland Government Information Standards (IS31, IS38, IS40, IS41 and
IS42) to the extent they are consistent with the law.
___________________________
---------------------------------
Need a quick answer? Get one in minutes from people who know. Ask your question
on Yahoo! Answers.
- Follow-Ups:
- RE: de-dup process
- From: Ken Naim
- References:
- Re: de-dup process
- From: Tony van Lingen
Other related posts:
- » Re: de-dup process
- » de-dup process
- » RE: de-dup process
- » Re: de-dup process
- » Re: de-dup process
- » Re: de-dup process
- » Re: de-dup process
- » Re: de-dup process
- » Re: de-dup process
- » Re: de-dup process
- » Re: de-dup process
- » Re: de-dup process
- » RE: de-dup process
- » Re: de-dup process
- » Re: de-dup process
- » RE: de-dup process
- » Re: de-dup process
- RE: de-dup process
- From: Ken Naim
- Re: de-dup process
- From: Tony van Lingen