Re: need help designing a heavily concurrent reporting engine

  • From: Tim Gorman <tim@xxxxxxxxx>
  • To: maxim.pakhutkin@xxxxxxxxxxxxxx, ORACLE-L <oracle-l@xxxxxxxxxxxxx>
  • Date: Tue, 04 Sep 2012 15:55:11 -0600

Maxim,
For ad-hoc activity, your best bet for a data model is dimensional 
<http://en.wikipedia.org/wiki/Dimensional_modeling> (a.k.a. star schema) 
with facts and dimesions.

Range-partition the fact table(s) by date.  Use the EXCHANGE-PARTITION 
load technique described in this white paper 
<http://www.evdbt.com/TGorman%20TD2009%20DWScale.doc> and presentation 
<http://www.evdbt.com/OOW09%20DWScaling%20TGorman%2020091013.ppt>, 
allowing you to use INSERT+APPEND, nologging if necessary, TRUNCATE to 
remove data.  Mostly direct-path loads for table data and index 
creation, very little "conventional" DML -- everything you're wishing 
for.  :-)

If you have any slowly-changing dimensions, first load the "type 2" 
(time-variant) dimension table(s) using the EXCHANGE-PARTITION load 
technique, then re-load the "type 1" (point-in-time) dimension table(s) 
using the exchange-partition technique illustrated on slides 25-29 of 
the presentation 
<http://www.evdbt.com/OOW09%20DWScaling%20TGorman%2020091013.ppt>. 
Again, no need for expensive "conventional" operations;  large volume 
data manipulation is faster with INSERT+APPEND in parallel. No need for 
all this is the dimension is not slowly-changing, by the way.

That's it in a nutshell.  I've been using this stuff for ~14 years and 
talking to the world about it for over 8 years now.  It's not easy to 
do, but it rocks.

------------------------------------------------------------------------

By the way, I'll be doing my 2-day seminar on this topic (a.k.a. 
"/Scaling To Infinity:  Partitioning Data Warehouses on Oracle 
Database/") for Oracle University in Munich Germany on 22-23 October 
<http://education.oracle.com/webreg/ShoppingCart?arg=1&doccode=D70363_1657076&av_id=&org_id=34&territory_code=DE&lang=D&fee=&event_id=3253864&p_return_url=http://education.oracle.com/pls/web_prod-plq-dad/searchResult.searchEngineParser?p_search_keyword=Gorman&p_org_id=1001&p_lang=US&p_country=34,41&p_search_location=&p_search_category_id=&p_search_format=&p_search_start_month=OCT&p_search_start_year=2012&p_search_end_month=NOV&p_search_end_year=2012&p_page_number=1&p_search_from_date=&p_search_to_date=&p_forceExpandedSearch=N>
 
and in Utrecht Netherlands on 25-26 October 
<http://education.oracle.com/webreg/ShoppingCart?arg=1&doccode=D70363_1657076&av_id=&org_id=41&territory_code=NL&lang=NL&fee=&event_id=3253866&p_return_url=http://education.oracle.com/pls/web_prod-plq-dad/searchResult.searchEngineParser?p_search_keyword=Gorman&p_org_id=1001&p_lang=US&p_country=34,41&p_search_location=&p_search_category_id=&p_search_format=&p_search_start_month=OCT&p_search_start_year=2012&p_search_end_month=NOV&p_search_end_year=2012&p_page_number=1&p_search_from_date=&p_search_to_date=&p_forceExpandedSearch=N>.
  
I'll be doing my 60-90 minute version of the 2-day seminar at the 
Slovenian Oracle Users Group in Llubjana Slovenia 15-16 October 
<http://www.sioug.si/index.php?option=com_content&view=article&id=201&Itemid=180&lang=en>.

I'm glad to do one or more of the 60-90 minute presentations over a 
GoToMeeting webinar as well, if anyone would be interested?

Please let me know if you have any questions.

Hope this helps...

-- 
Tim Gorman
consultant -> Evergreen Database Technologies, Inc.
postal     => PO Box 352151, Westminster CO 80035
website    => http://www.EvDBT.com/
email      => Tim@xxxxxxxxx
mobile     => +1-303-885-4526
fax        => +1-303-484-3608
Lost Data? => http://www.ora600.be/ for info about DUDE...





On 9/4/2012 3:10 PM, Pakhutkin, Maxim (Max) wrote:
> Hello,
>
> I'm looking for some advice for designing a process for generating reports. 
> The reports are ad-hoc, i.e. it's safe to assume that the reports will be 
> generated continuously and concurrently, no start and finish point or quiet 
> time. This basically data-warehouse-type process needs to live in a 
> heavily-OLTP type system, so we hope to reduce datafile and redo I/O overhead 
> as much as possible.
>
> The direction we've been going so far:
> 1) We are hoping to run multiple database sessions each working on its own 
> set of data at a time (this corresponds to one report). This way we can add 
> more processes if we need to speed things up (plenty of CPU and storage on 
> the system). These sessions are called renderers (they render reports) and 
> each will use its own render_key value to access its  own data.
> 2) The processes would copy some detail data for each report into their own 
> tables, do various joins of this detail data with other data to generate data 
> for various sections of the report.
> 3) The detail data does not need to be restorable, so we are hoping to use 
> INSERT+APPEND to put it in, TRUNCATE to get rid of it, and NOLOGGING on the 
> tables/tablespaces that store it, to avoid the I/O overhead of using plain 
> INSERT/DELETE.
> 4) We don't want to use GLOBAL temporary tables, if possible, because in 
> addition to the render process accessing the data for its report, the 
> "orchestrator" process will need to access the same data. Also, we want to be 
> able to easily collect optimizer stats on the data and don't have to set the 
> stats, which may be error prone and is a support overhead.
> 5) after each set of data is inserted, stats would be collected on that data.
>
> The goal is to get maximum speed out of it, but also hopefully to keep code 
> fairly reasonably maintainable.
>
> At first I thought of partitioned tables, with each process working on it own 
> partition. But I found out that INSERT+APPEND and TRUNCATE PARTITION don't 
> play well with concurrent access to the same table (even though different 
> partitions of the same table are being accessed). Collecting stats also would 
> take out latches on the whole table in library cache, despite us collecting 
> stats only on partition level. For all of the above we get library cache 
> waits of various kinds, slowing things down significantly.
>
> Then we thought of using separate tables, but joining them into a union_all 
> view (for ease of code maintenance), that way simulating a partitioned table. 
> Each separate table would still have one column filled with the same value, 
> dedicated to identifying which render process would work on it.
>
> CREATE TABLE X1 (
> Render_key number, <-- this column contains the same value for all rows in 
> the table, and the appropriate bill rendering process would supply this value 
> in queries to get only its own data from the union-all view.
> ...
> );
>
> CREATE TABLE X2(
> Render_key number,
> ...
> );
> Etc.
>
> Now we can INSERT+APPEND, but TRUNCATE doesn't work too well, as some 
> processes accessing the view get "object no longer exists" error due to other 
> processes truncating their tables that are part of the same view. (This may 
> be partially because oracle tries to use a very selective index that exists 
> on each of these tables first and then filter it by render_key identifier, 
> instead of reading the render_key column histograms first, realizing that 
> there is only one table where that column will have the right value and not 
> touching the other tables, as I hoped it would. But I may be wrong and it 
> would do that even without the index).
>
> I can't think of any other approaches that would get us what we want (speed + 
> ease of code maintenance) and am ready to recommend to the developers that 
> each process work in its own schema with its own tables (so code and tables 
> would be duplicated everywhere, making maintaining it a harder).
>
> Can anyone think of other architectural alternatives that could give us what 
> we want?
>
> Thanks in advance.
>
> --
> Max Pakhutkin
>
> --
> //www.freelists.org/webpage/oracle-l
>
>
>
>
>


--
//www.freelists.org/webpage/oracle-l


Other related posts: