performance problems updating dimension tables with MVs prejoined between fact and dimension

From: Marc Slemko <identd@xxxxxxxxx>
To: oracle-l@xxxxxxxxxxxxx
Date: Sun, 20 Feb 2005 00:10:24 -0800
We are having a bit of an annoying issue with a data warehouse we are
currently in the process of deploying WRT some of the fast refresh
materialized views we have.  Running 10g.

Due to the large size of some of our dimensions, we want to create
materialized views that pre-join a fact table with a dimension table
to pull one boolean attribute, for example, out of the dimension
without having to blow up the size of the materialized view by having
the actual id of the dimension table in it.

This data warehouse is loaded more or less in realtime, throughout the
day, so nearly all our materialized views are refresh on commit and
fast refresh.  We don't just add new fact table entries during the
course of a day, but also load in new dimension records as necessary.

Conceptually, this poses no problems for the rollups since the
dimension data in question is never modified after insertion.

However we are running into an implementation issue with Oracle that
is causing this to be extremely ugly.  When we do a COMMIT after
adding a batch of records to the dimension, Oracle behind the scenes
runs something like the following to update the materialized view:

MERGE INTO "MARCS"."MV_LARGE" "SNA$" USING (SELECT   /*+
  OPT_ESTIMATE(QUERY_BLOCK MAX=1000)   */ "MAS$1"."SMALL_D" "GB0",
  "DLT$0"."IS_SPECIAL" "GB1", SUM(1) "D0" FROM (SELECT  /*+ CARDINALITY(MAS$
  1000) */  CHARTOROWID("MAS$"."M_ROW$$") RID$  ,  "MAS$"."IS_SPECIAL",
  "MAS$"."LARGE_D"   FROM "MARCS"."MLOG$_LARGE" "MAS$"   WHERE
  "MAS$".SNAPTIME$$ > :1 ) "DLT$0" , (SELECT "MAS$"."ROWID" "RID$"  ,
  "MAS$"."SMALL_D", "MAS$"."LARGE_D"  FROM "MARCS"."FACT" "MAS$") AS OF
  SNAPSHOT (:2)  "MAS$1"  WHERE ("MAS$1"."LARGE_D"="DLT$0"."LARGE_D") GROUP
  BY "MAS$1"."SMALL_D","DLT$0"."IS_SPECIAL")"AV$" ON
  (SYS_OP_MAP_NONNULL("SNA$"."SMALL_D")=SYS_OP_MAP_NONNULL("AV$"."GB0") AND
  SYS_OP_MAP_NONNULL("SNA$"."IS_SPECIAL")=SYS_OP_MAP_NONNULL("AV$"."GB1"))
  WHEN MATCHED THEN UPDATE  SET "SNA$"."COUNT(*)"="SNA$"."COUNT(*)
  "+"AV$"."D0" WHEN NOT MATCHED THEN INSERT ("SNA$"."SMALL_D",
  "SNA$"."IS_SPECIAL", "SNA$"."COUNT(*)") VALUES ( "AV$"."GB0", "AV$"."GB1",
  "AV$"."D0")

Which ends up being optimized as:

Rows     Row Source Operation
-------  ---------------------------------------------------
      2  MERGE  (cr=90996 pr=90930 pw=0 time=12304762 us)
      0   PX COORDINATOR  (cr=90996 pr=90930 pw=0 time=12304749 us)
      0  VIEW  (cr=0 pr=0 pw=0 time=0 us)(Orphan Entry)
      0  HASH JOIN OUTER BUFFERED (cr=0 pr=0 pw=0 time=0 us)
      0   BUFFER SORT (cr=0 pr=0 pw=0 time=0 us)
      0    PX RECEIVE  (cr=0 pr=0 pw=0 time=0 us)
      0  VIEW  (cr=90971 pr=90930 pw=0 time=12283633 us)(Orphan Entry)
      0  SORT GROUP BY (cr=90971 pr=90930 pw=0 time=12283627 us)
      0   HASH JOIN  (cr=90971 pr=90930 pw=0 time=12283597 us)
   1000    TABLE ACCESS FULL MLOG$_LARGE (cr=23 pr=0 pw=0 time=1106 us)
1000000    TABLE ACCESS FULL FACT (cr=90948 pr=90930 pw=0 time=2006047 us)
      0   PX RECEIVE  (cr=0 pr=0 pw=0 time=0 us)
      0  PX BLOCK ITERATOR (cr=0 pr=0 pw=0 time=0 us)(Orphan Entry)
      0  MAT_VIEW ACCESS FULL MV_LARGE (cr=0 pr=0 pw=0 time=0 us)

Note the full table scan of the "fact" table.  The optimizer appears
to be making the correct decision here, since as far as it knows there
are enough rows of the materialized view that would have to be updated
based on the new records in the "large" dimension that a full table
scan of the fact table is faster ... it just isn't being told that
there really can't be any rows to update due to the foreign key
constraint and the fact rows were only added.  This becomes quite
impractical with our larger fact tables.  For example, with one
dimension we have about 65 million fact table rows with 800k distinct
dimension key values in the fact table, so you wouldn't have to
actually update too many dimension records before the proper choice
for the CBO would be to do the full table scan based on what it knows.

When we do updates in very small batches, the optimizer picks an index
lookup into the fact table instead, which is much faster since it
doesn't take long to lookup something that doesn't exist in the index.

I don't even know how I can add a hint to work around this, since the
MERGE statement is generated by Oracle.  Stored outlines don't seem an
option because Oracle dynamically includes information about
cardinality and blocks in the query text, which changes more or less
with each commit.

A related, but less important problem, is that if we update column "b"
in a row in a dimension table that is prejoined to the fact table in a
materialized view using column "a" of the dimension, Oracle still
thinks it has to update all the rows in the materialized views related
to the modified rows, even though column "b" isn't even in the
materialized view log, let alone used in the materialized view.

Has anyone run into this before or have any suggestions or other
approaches that could accomplish the same thing?  We really don't want
to have to break the dimension out into multiple dimensions.  We do
have a TAR open, but have not been getting very useful results so far
and are on a tight timeline.

It can be a bit hard to explain exactly what is going on here, a full
example, with schema, population scripts, sample sqltrace / tkprof
output, etc. is at http://znep.com/~marcs/tmp/mvtest.tgz if anyone
thinks they could offer some insight, a README.txt in that file
contains details of the  contents.

Thanks for any comments.  I'll post a followup if I figure anything out.
--
//www.freelists.org/webpage/oracle-l
performance problems updating dimension tables with MVs prejoined between fact and dimension

Other related posts: