Re: Data Warehouse Guru Question (Data Model question) between FACT and DIMENSION tbl

  • From: Tim Gorman <tim.evdbt@xxxxxxxxx>
  • To: Chris Taylor <christopherdtaylor1994@xxxxxxxxx>
  • Date: Wed, 4 Aug 2021 07:34:40 -0700

Chris,

Apologies for slow response!

Comments in-line in /red italicized font/ below...

Thanks!

-Tim

On 7/30/2021 6:16 AM, Chris Taylor wrote:

Tim,

Thank you for the detailed response.

There are a lot of dimensions involved in the query, this one jumps out though because this dimension is very large and the secondary pass on the "*TABLE ACCESS BY LOCAL INDEX ROWID" *with a filter condition is a bit annoying.

So two MORE questions if you don't mind if I pick your brain a bit.

1.) *So if we were filtering on a lot of conditions on all the dimensions, it would be normal to see the filter condition applied after the TABLE ACCESS BY LOCAL INDEX ROWID on all the dimensions where we filter for some condition then?*  (That is, is it normal/expected to get a filter condition against the table access for the rows we are retrieving?  That's annoying because effectively a second pass/sanitation of the data but it sounds like it's expected.)  Some of these dimensions are really, really large.

   /[TG]: Yes, a joinback is expected, to go back to the dimension
   table to retrieve columns which were not part of the original star
   transformation operation./



2.) *Is it NORMAL (or a no-no) to join dimensions to dimensions as part of a data warehouse query?*  We've got the following two situations:
a.)  A join between 2 dimension tables on an ID field and also joined to the FACT table on the same ID field.  I think the join between the DIMs is redundant here and causes an extra condition.
b.) We have a situation where partition pruning isn't happening because a separate dimension table has the partition key but isn't being brought into the joins.  So to get to the partition key of the FACT table, we have to join 2 other DIMENSIONS together (Dim to Dim Join) and then join that result back to the FACT table to have the appropriate partition_key for the FACT table.  I've tested doing that and the performance didn't really improve much.  What's crazy is, if I do an IN statement instead and do a sub query against the other DIMENSION to pull out the partitions I want (SITE_IDs) , the query returns in 4 seconds (vs over a minute).

   /[TG]: Yes, I'm guessing that a DIM->DIM join would
   exclude/disqualify both DIM tables from the initial star
   transformation operation.  The IN statement on FACT is likely
   happening at a stage subsequent to the massive filtering of the star
   transformation, around the same time as the dimensional joinbacks. 
   So I think it would be expected for that option to perform better./



For example of item 2.b):  (beware SQL text ahead!) - two examples

*Example 1 of a DIM to DIM join to get the SITE partition key on the FACT table:*
/*performance bad/
WITH
    SAWITH0
    AS
        (SELECT SUM (T1586.UOM_QNTY_CNT) AS c1,
                COUNT (DISTINCT T1526.CE_INTRNL_DRUG_CD) AS c2
           FROM DM_CHARGE_DIM          T1526  /* Dim_CHARGE */
                                            ,
                DM_DATE_DAY_DIM        T1568             /* Dim_DISP_SRV_DT */
                                            ,
                DM_COVERED_ENTITY_DIM  T1032          /* Dim_COVERED_ENTITY */
                                            ,
                DM_DISPENSING_FACT     T1586 */* Fact_DISPENSING partitioned by SITE_KEY_NB */,*
                , DM_SITE_DIM T1 */* Added Table with SITE_KEY_NB partition */*
          WHERE ( T1032.CE_KEY_NB = T1586.CE_KEY_NB
                 AND T1526.CHG_PRICE_KEY_NB = T1586.CHG_PRICE_KEY_NB
                 AND T1568.DT_KEY_NB = T1586.SRV_DT_KEY_NB
                 AND CASE WHEN T1526.HID_MAP_FLG = 1 THEN 'Yes' ELSE 'No' END =
                     'No'
                 AND CASE WHEN T1526.NON_RX_FLG = 1 THEN 'Yes' ELSE 'No' END =
                     'No'
                 AND CONCAT (
                         CONCAT (TRIM (BOTH ' ' FROM T1568.MO_DESC), ' '),
                         CAST (T1568.YR_NB AS CHARACTER (4))) =
                     'December 2020'
                 AND (T1032.CE_GRP_ID IN (29.0,
                                          30.0,
                                          31.0,
                                          32.0,
                                          741.0,
                                          4223.0,
                                          7091.0,
                                          10552.0))
                 AND (T1526.CE_GRP_ID IN (29.0,
                                          30.0,
                                          31.0,
                                          32.0,
                                          741.0,
                                          4223.0,
                                          7091.0,
                                          10552.0)))
*AND T1586.SITE_KEY_NB = T1.SITE_KEY_NB /* Added Join to get SITE_KEY_NB */*
*AND T1032.CE_ID = T1.CE_ID  /* Added Join on CE_IDs to find appropriate SITE_KEY_NBs from DM_SITE_DIM , DIM-DIM join */*
                                          )
SELECT D1.c1 AS c1, D1.c2 AS c2, D1.c3 AS c3
  FROM (SELECT DISTINCT 0 AS c1, D1.c2 AS c2, D1.c1 AS c3
          FROM SAWITH0 D1) D1
 WHERE ROWNUM <= 125001

*Example 2 of adding an IN statement for the FACT table instead:*
/*performance great!/
WITH
    SAWITH0
    AS
        (SELECT SUM (T1586.UOM_QNTY_CNT) AS c1,
                COUNT (DISTINCT T1526.CE_INTRNL_DRUG_CD) AS c2
           FROM DM_CHARGE_DIM          T1526  /* Dim_CHARGE */
                                            ,
                DM_DATE_DAY_DIM        T1568             /* Dim_DISP_SRV_DT */
                                            ,
                DM_COVERED_ENTITY_DIM  T1032          /* Dim_COVERED_ENTITY */
                                            ,
                DM_DISPENSING_FACT     T1586 * /* Fact_DISPENSING partitioned by SITE_KEY_NB */,*
          WHERE ( T1032.CE_KEY_NB = T1586.CE_KEY_NB
                 AND T1526.CHG_PRICE_KEY_NB = T1586.CHG_PRICE_KEY_NB
                 AND T1568.DT_KEY_NB = T1586.SRV_DT_KEY_NB
                 AND CASE WHEN T1526.HID_MAP_FLG = 1 THEN 'Yes' ELSE 'No' END =
                     'No'
                 AND CASE WHEN T1526.NON_RX_FLG = 1 THEN 'Yes' ELSE 'No' END =
                     'No'
                 AND CONCAT (
                         CONCAT (TRIM (BOTH ' ' FROM T1568.MO_DESC), ' '),
                         CAST (T1568.YR_NB AS CHARACTER (4))) =
                     'December 2020'
                 AND (T1032.CE_GRP_ID IN (29.0,
                                          30.0,
                                          31.0,
                                          32.0,
                                          741.0,
                                          4223.0,
                                          7091.0,
                                          10552.0))
                 AND (T1526.CE_GRP_ID IN (29.0,
                                          30.0,
                                          31.0,
                                          32.0,
                                          741.0,
                                          4223.0,
                                          7091.0,
                                          10552.0)))
*AND **T1586.SITE_KEY_NB IN (SELECT T1.SITE_KEY_NB /* Find the Site Key Numbers that belong to our CE_GRP_IDs */*
*        from DIM_SITE_DIM T1*
*                                  where T1.CE_ID IN*
*                                        (*
*                                        SELECT CE_ID*
*                                        FROM DM_COVERED_ENTITY_DIM T2 /* This table is our main join above as well */*
*                                        WHERE T2.CE_GRP_ID IN (29.0,*
*                30.0,
                31.0,
                32.0,
                741.0,
                4223.0,
                7091.0,
                10552.0*
*)*
*                                        )*
*                                        )*
                                          )
SELECT D1.c1 AS c1, D1.c2 AS c2, D1.c3 AS c3
  FROM (SELECT DISTINCT 0 AS c1, D1.c2 AS c2, D1.c1 AS c3
          FROM SAWITH0 D1) D1
 WHERE ROWNUM <= 125001

I also tried creating a mapping View between DM_SITE_DIM and DM_COVERED_ENTITY_DIM that has SITE_KEY_NB, CE_ID, CE_GRP_ID and add that to the join and performance is also equally bad.

Also we have the PKs and enabled FKs (novalidate) between the DIMs and FACT.

Finally, I will look CREATE HIERARCHY stuff you mentioned along with the BITMAP join indexes.  Those concern me a bit because we do a lot of parallel DDL when running the ETL process and there seems to be a lot of Oracle caveats/warnings with using BITMAP join indexes.

   /[TG]: There are a lot of limitations on BJI, but if they can be
   built, then they provide an efficiency similar to that of
   materialized views have on normal table join operations and
   filtering operations.  That is, the BJI is the intermediate result
   of the star transformation materialized as a bitmap index, which
   allows queries to skip over portions of the normal star
   transformation process, and I believe their presence also helps the
   CBO choose their use, effectively coercing a star transformation
   operation./



Chris



On Thu, Jul 29, 2021 at 6:43 PM Tim Gorman <tim.evdbt@xxxxxxxxx <mailto:tim.evdbt@xxxxxxxxx>> wrote:

    Chris,

    It is just the two tables (i.e. FACT and DIM) involved in the
    query?  There are no additional dimension tables?

    If so, that is likely much of the problem, because the concept of
    a dimensional data model (a.k.a. star schema) as well as Oracle's
    "star transformation" mechanism is built around the idea of two or
    more dimensions being involved in any query on the fact table.  As
    many dimensions as possible, in fact.  The more, the merrier.

    The basic idea is this...

     1. user specifies search criteria on columns on the dimensions
          * if no search predicates are specified for a dimension
            table, then that dimension is not included in steps 2-4 below.
              o instead it will be used in step 5 as a post-processing
                filter or join step
     2. database uses search criteria to filter down to a result set
        from each dimension
     3. database uses all the result sets from all the dimensions to
        create a consolidated set (a.k.a. star transformation)
     4. database uses all of the dimension key values to join to the
        bitmap keys in the fact table (a.k.a. star join)
     5. database uses additional filter predicates or join predicates
        against the fact table to produce final results


    The entire reason for having bitmap indexes on all of the foreign
    keys of the fact table is for that big operation on step 4 to
    work.  That operation in step 4 works well if the consolidated set
    created in step 3 is small, and that's only going to happen if
    filtering happens on the dimensions.

    The filtering in step 5 is merely an after-effect, ideally most of
    the work of joining all the tables was already performed in step
    4, and step 5 is running against a really small final result set.

    The upshot is the more well-filtered dimensions that are involved
    in the star transformation, the more efficient is the star join. 
    Most (if not all) filtering predicates in star transformations
    should be found on columns in the dimensions.  Ideally there are
    no filter predicates against the fact table, but if there are,
    then we have to consider whether they should be converted into
    dimensions or left as "conformed dimensions".

    Are star transformations important?  Yes, if you are attempting to
    query a dimensional data model.

    There is more about star transformations HERE
    
<https://docs.oracle.com/en/database/oracle/oracle-database/19/tgsql/query-transformations.html#GUID-76B6B58B-24B0-4DFE-AC1B-CFAC4D93C55A>.
 
    Also, the STAR_TRANSFORMATION_ENABLED
    
<https://docs.oracle.com/en/database/oracle/oracle-database/19/refrn/STAR_TRANSFORMATION_ENABLED.html#GUID-B2E6145D-164A-4453-9839-0F6E6442A922>
    parameter should not be FALSE, and there are additional guidelines
    to post for the Oracle optimizer to initiate star transformations
    more easily, such as hierarchies specified with the CREATE
    HIERARCHY
    
<https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/CREATE-HIERARCHY.html#GUID-73925877-992B-4624-AA28-8F565E9C3F0D>
    command, enabled PK constraints on the dimension tables, and
    ENABLED or RELY enforced foreign-key constraints on the fact table.

    Finally, one fun feature for star transformations is when Oracle
    database materializes the consolidated set mentioned in step 3
    above, which is called a bitmap-join index. Unfortunately, there
    are no specific documentation pages in the Oracle doc-set devoted
    to bitmap-join indexes, but there is a good discussion of DW
    optimization HERE
    
<https://docs.oracle.com/en/database/oracle/oracle-database/19/dwhsg/data-warehouse-optimizations-techniques.html#GUID-79C29A60-3477-448D-835D-2940D060D050>as
    well Tim Hall's more detailed blog post HERE
    <https://oracle-base.com/articles/9i/bitmap-join-indexes>.  If it
    is possible to create a bitmap-join index on a fact table, then it
    is almost certain that any query using those dimensional search
    predicates will achieve a star transformation.

    Hope this helps,

    -Tim

    On 7/28/2021 6:41 AM, Chris Taylor wrote:
    This is a sanity check question as I'm helping out with a data
    model design and indexing strategy with our ETL team. None of us
    are what I'd call real strong on data warehouse design.

    Here's the setup and the question:

    Setup:
    We have a DIMENSION table partitioned by SITE (Client) and we
    have a FACT table partitioned by SITE (Client).

    The PK on the DIMENSION is a combo of SITE+KEY_FIELD
    The join between the DIMENSION and the FACT is on SITE+KEY_FIELD

    All that is fine and works great with bitmap indexes on the FACT etc.

    *HOWEVER* here is the problem:

    In one particular query, we're applying an /additional FILTER
    /condition to the DIMENSION table (instead of the FACT) after the
    join which causes a bit of performance penalty as the rows are
    found between the DIMENSION and FACT but then the filter is then
    applied to the rows from the DIMENSION table (which is very large
    table also).

    My theory is that this filter clause (the columns used by the
    filter) should actually also be in the FACT and apply the filter
    there with a BITMAP index instead of applying it to the
    DIMENSION.   RIght now the condition (columns) in the DIMENSION
    that are checked for a value do not exist in the FACT table..

    The join & condition is something like this:

    *|  35 |     PARTITION LIST ITERATOR                    |        
                        |      0 |      1 |     |     1   (0)|
    00:00:01 |   KEY |   KEY |  Q1,01 | PCWP
    |* 36 |                  INDEX UNIQUE SCAN                 |
    DIMENSION_TABLE_NDX2      |  0 |      1 |       |     1   (0)|
    00:00:01 | KEY |   KEY |  Q1,01 | PCWP
    |* 37 |                TABLE ACCESS BY LOCAL INDEX ROWID        
      | DIMENSION_TABLE


      36 - access("FACT_TABLE"."SITE_ID"="DIMENSION_TABLE"."SITE_ID"
    AND "FACT_TABLE"."KEY_FIELD"="DIMENSION_TABLE"."KEY_FIELD")
      37 - filter((CASE "FIELD1" WHEN 'I' THEN 'Inpatient' WHEN 'O'
    THEN 'Outpatient' END))*

    So we have the join between the FACT and DIMENSION on SITE_ID &
    KEY_FIELD

    But then we have an additional FILTER being applied to the
    DIMENSION table.

    Currently this doesn't kill performance, but definitely adds
    about 25% execution time (6 secs out of 24) and I'm concerned
    this will get worse as data volumes grow.

    So the question is , should CASE statement filter really get
    applied to the FACT table and replicate that column (or some
    version of it) to the FACT table instead?   (Generally as a rule
    of thumb I mean)

    We can add the CASE statement to the PK index on the
    DIMENSION_TABLE as well, and thats what the ETL team is
    suggesting, but that seems like the wrong way to go.

    Chris






Other related posts: