Re: de-dup process

  • From: Nigel Thomas <nigel_cl_thomas@xxxxxxxxx>
  • To: ebadi01@xxxxxxxxx, oracle-l <oracle-l@xxxxxxxxxxxxx>
  • Date: Thu, 14 Dec 2006 11:47:16 -0800 (PST)

>>Cannot clean data before loading as data is from many different sources that 
>>don't know about each other.:
How many is many? 
If you really have many different load files (or can manufacture them), then 
you may be able to change the way the problem scales. 
How do you decide which of the duplicates to use?
if you can use the "first come - first served" rule your de-duplicate problem 
becomes much simpler - duplicate avoidance rather than duplicate removal
Is there any easy way you can partition the incoming data so you can be sure 
that records in files in group A can't interfere with records in files in 
groups B, C, D, etc?
if the files 'overlap' when they come in, you may still be able to split them 
on a partition key -  ie a component of the unique key - before they get loaded
Task I: "Reception"
First, as data files arrive, deal them out into N non-overlapping file groups, 
where with luck N is an acceptable degree of (external) parallelism. This  task 
is not internally parallelised 
get the next file
place it as-is into the appropriate load file group for task III OR to a 
splitter process for task II if necessary  
Go back to step 1
This is NOT parallel - like the maitre-d in a restaurant, it does one thing 
quickly, then hands you off:
- "here is Annette to take you to your table" OR 
- "Your table is by the window"

Task II: "Your table is ready"
This step deals with physically partitioning individual files outside the 
database as a precursor to task III. You can skip this if you don't need to 
physically split files
find a file that's ready to split (eg it's in a specific directory) 
deal it the records into a file for each load group
load it into temporary table TEMP_TABLE_A  (truncate first). Each group has its 
own temp table...
You can parallelize this task if necessary - have as many parallel splitter 
processes as the cpu and file system can stand, to ensure that there is always 
a file ready for every loader process in the next task.

This is parallel - like the number of waiters in a restaurant; one of them 
organises you, takes your order etc

Task III: "The meal"
Within each group, set up a loader process:
find a file that's ready to load for this group (eg it's in a specific 
directory) 
load it into temporary table TEMP_TABLE_A  (truncate first). Each group has its 
own temp table...
merge TEMP_TABLE_A into your master table BASE_TABLE (using the technique Gints 
Plivna mentioned) 
mark the file as loaded (eg move it to another directory)
Go back to step 1 until files are exhausted (or you are)
This is parallel (like the number of tables in a restaurant). Restaurants can 
add more tables - but only if the kitchen can keep up... otherwise you get mad 
and leave.

Summary
You now have some flexibility to tune things:
the size of the individual files - to get the best out of the hash join 
TEMP_TABLE_A to BASE_TABLE. 
the external degree of parallelism (how many loader group processes, how many 
splitter processes) 
the (internal) degree of parallelism within the database (for each group) so 
that the sum of parallel servers required across all groups matches the 
capacity of your server (ie DOP * N = function of max_parallel_servers).
task II could be inside Oracle (easier to manage?) - subsititute temp tables 
for temp files throughout
or if task II could be pipelined into task III using fifos (means you don't 
have to double handle the data on disc - but can cause other problems)
And BTW, this doesn't necessarily need the Partitioning option...

HTH

Regards Nigel

Other related posts: