Re: de-dup process

  • From: "jaromir nemec" <jaromir@xxxxxxxxxxxx>
  • To: <tboss@xxxxxxxxxxxxxxxxxx>
  • Date: Wed, 13 Dec 2006 22:22:12 +0100 (CET)


> (select rowid rid,
> vrow_number() over
> (partition by varchar_that_defines_duplicates
> order by rowid ) rn

a good thing on this approach is that you can control which of the
duplicated rows will be discarded and which preserved. You may order not
only by rowid but by any other attribute(s).
(In case that dups are defined based on a key attribute there is no
guarantee that  all attributes in duplicated rows are identical).

A common approach by incremental loading  is to check for dups in the
newly loaded data and to eliminate them.

Optionally, if relevant, a second step is performed.
In the second step the new data is checked against you base table. A big
optimisation can be reached if there is some business rule saying that the
dups must be timely coupled (i.e. they can appear only within a limited
interval of time). This allowed to limit the check only to the some most
recent partitions of your base table and it is not necessary to consider
the whole base table. Provided that the base table is timely partitioned,
of course.

Both step can be combined and performed in one pass.



> --


Other related posts: