Re: de-dup process

From: "Gints Plivna" <gints.plivna@xxxxxxxxx>
To: ebadi01@xxxxxxxxx
Date: Thu, 14 Dec 2006 16:19:29 +0200

Ok I've created a test case on our development box with 3 striped SATA
disks, 2 CPUs and 2G of RAM, but only 500M are devoted to this 9.2
Oracle instance.

base_table was created from dba_source with one unique column and text
just to fill up some space
Then I multiplied base_table again and again until the count reached
119 258 304 rows. Space it took was about ~30GB.

Then I created temp_table with 20M rows to simulate your loaded table.
half of them i.e. 10M rows where different than those in base_table,
other 10M were the same.

Then I created unique index on base_table varchar2 column simalting
your unique key and altered this column to not null.

OK. Now I was ready to do the insert.

Firstly I've used following insert

insert /*+ append */ into base_table
select * from temp_table where ukrow in (
select ukrow from temp_table
minus
select ukrow from base_table);
with plan you can find in the link at very end of this message

It took 26 minutes and 11 seconds to complete
and it consisted of following steps show in v$session_longops
Seconds Opname Target

281 Sort Output
18 Sort/Merge
316 Hash Join
246 Sort Output
261 Index Fast Full Scan  GINTS.BASE_TABLE
525 Sort Output
37 Sort/Merge
177 Table Scan         GINTS.TEMP_TABLE
217 Table Scan         GINTS.TEMP_TABLE

The thing I didn't like was at least that GINTS.TEMP_TABLE was scanned
twice. Writing SQLs I tend to follow the principle scan any necessary
row as less as possible, at best only once. So after the first scan of
GINTS.TEMP_TABLE I actaully knew all data for insert and second scan
was a simple waste.

Then I rewrote my insert as follows:
insert /*+ append */ into base_table
select ukrow, text from (
select t.ukrow, t.text,
  case when b.ukrow = t.ukrow then 0 else 1 end flag
from temp_table t, base_table b
where t.ukrow = b.ukrow(+)
)
where flag = 1;

with plan you can find in the link at very end of this message

It took 16 minutes and 35 seconds to complete
and it consisted of following steps show in v$session_longops
Seconds Opname Target
196 Sort Output
39 Sort/Merge
363 Hash Join
244 Index Fast Full Scan         GINTS.BASE_TABLE
147 Table Scan                       GINTS.TEMP_TABLE

So summary is
10 M rows out of 20 M rows to insert in ~120 M rows table took 16.5
minutes, and it included also 1 UK index update.

So looking in the insert we can see that it is directly dependent on
temp_table size, base_table UK index size and involves 1 hash and 1
sort. To my mind it should scale rather linearly or at least the in
the same rate as worst of sort and hash join.

One more thing - consider using workarea_size_policy = manual with big
sort and hahs area_sizes instead of workarea_size_policy = auto and
[possibly] miserable pga_aggregate_target.

Gints Plivna
http://www.gplivna.eu

P.S. resent message because the body was too big.
The full spool file you can find in http://www.gplivna.eu/TabInsert.txt

2006/12/14, A Ebadi <ebadi01@xxxxxxxxx>:

Biggest problem we've faced in coming up with a solution is none of the
solutions so far scale.

--
//www.freelists.org/webpage/oracle-l

References:
- Re: de-dup process
  - From: Mladen Gogala
- Re: de-dup process
  - From: A Ebadi

Re: de-dup process

Other related posts: