Re: Building fact tables in a data warehouse

  • From: "jaromir nemec" <jaromir@xxxxxxxxxxxx>
  • To: <tomdaytwo@xxxxxxxxx>, <oracle-l@xxxxxxxxxxxxx>
  • Date: Sun, 23 Dec 2007 21:49:25 +0100

Hi Thomas,

> This worked OK for our ETL developers when we were dealing with 10,000 rows
> in the development database.
> Now we're working with a much larger source set and we're talking
> 100,000,000 rows. It doesn't work that well.  
> Does anybody have experience or even a theoretical insight into a better way
> to do this?

One important thing in an ETL process is to know where is the bottleneck. From 
the birds view this can be 
a) in the data source
b) in the transformation
c) in the data target

Except for some cases such as inefficient selects in data source, indexes on 
targets or contention with other processes in prevailing number of cases the 
problem lies in the transformation. This is due to the row-by-row processing 
logic of the majority of ETL tools.
A typical answer is to use parallelism; in principle to start more ETL sessions 
working on partitioned sources and targets. This could indeed help, and with 
your row count will be probably required.
But there is other possibility to shift some transformation logic in the data 
source.

> To build our fact table, let's call it Z  (as I understand the process) we
> grab the primary key of the row in dimension A plus information which
> we can use to find the associated row in dimension B.  We go to dimension B,
> grab the primary key plus information which allows us to
> find the associated row in dimension C, and so on through 8 dimension
> tables.

In this case I'd try to join the fact records with the dimensions in the 
database (e.g. using parallel, partition-wise hash join) possible with 
pre-denormalizing the dimension hierarchy in a single table before the ETL 
process starts (to avoid the cascading joins).

One starting point in troubles with ETL process is to rewrite the process 
(possible simplified) in a CTAS statement. This gives you a quick orientation 
where the problem could be. If the CTAS doesn't work as expected, it is 
probably an a) or c) problem and should be solved in the database. Otherwise 
you get estimation what can be an expected target throughput of the ETL 
transformation.

Regards,

Jaromir D.B. Nemec 

Other related posts: