Re: need help designing a heavily concurrent reporting engine

  • From: rjamya <rjamya@xxxxxxxxx>
  • To: maxim.pakhutkin@xxxxxxxxxxxxxx
  • Date: Wed, 5 Sep 2012 07:39:24 -0400

Here is an idea,

1. let your N renderer processes truncate, load, transform their own
pieces of data into renderer-specific tables (one for each renderer,
no one else needs to access it.).
2. Once your renderer is ready with data for orchestrator process, let
it insert+append into a partitioned table (any which way you want to
partition), commit, and be done with it.
3. your orchestrator process accesses this only one table (mentioned
in 2). it can collect stats on necessary partitions, generate reports
and delete old data and exit.

This means, truncating renderer specific tables will not have impact
on any other process, since each renderer process works in its own
space, the stats collection on it would only impact itself.
Orchastrator process will access a central table that everyone will
only insert into, so as long as renderer processes keep doing that and
orchastrator process does eventual cleanup of the data in this central
table, they all should remain independent.

Not sure if this makes a whole lot of sense but I am still catching up
on my first cup of coffee and quickly thought of this to avoid
dependency. GTTs might work with above method as well, but if you have
continuously running processes, any DDL on GTTs (when necessary) would
be difficult.

Raj

On Tue, Sep 4, 2012 at 5:10 PM, Pakhutkin, Maxim (Max)
<maxim.pakhutkin@xxxxxxxxxxxxxx> wrote:
> Hello,
>
> I'm looking for some advice for designing a process for generating reports. 
> The reports are ad-hoc, i.e. it's safe to assume that the reports will be 
> generated continuously and concurrently, no start and finish point or quiet 
> time. This basically data-warehouse-type process needs to live in a 
> heavily-OLTP type system, so we hope to reduce datafile and redo I/O overhead 
> as much as possible.
>
> The direction we've been going so far:
> 1) We are hoping to run multiple database sessions each working on its own 
> set of data at a time (this corresponds to one report). This way we can add 
> more processes if we need to speed things up (plenty of CPU and storage on 
> the system). These sessions are called renderers (they render reports) and 
> each will use its own render_key value to access its  own data.
> 2) The processes would copy some detail data for each report into their own 
> tables, do various joins of this detail data with other data to generate data 
> for various sections of the report.
> 3) The detail data does not need to be restorable, so we are hoping to use 
> INSERT+APPEND to put it in, TRUNCATE to get rid of it, and NOLOGGING on the 
> tables/tablespaces that store it, to avoid the I/O overhead of using plain 
> INSERT/DELETE.
> 4) We don't want to use GLOBAL temporary tables, if possible, because in 
> addition to the render process accessing the data for its report, the 
> "orchestrator" process will need to access the same data. Also, we want to be 
> able to easily collect optimizer stats on the data and don't have to set the 
> stats, which may be error prone and is a support overhead.
> 5) after each set of data is inserted, stats would be collected on that data.
>
> The goal is to get maximum speed out of it, but also hopefully to keep code 
> fairly reasonably maintainable.
>
> At first I thought of partitioned tables, with each process working on it own 
> partition. But I found out that INSERT+APPEND and TRUNCATE PARTITION don't 
> play well with concurrent access to the same table (even though different 
> partitions of the same table are being accessed). Collecting stats also would 
> take out latches on the whole table in library cache, despite us collecting 
> stats only on partition level. For all of the above we get library cache 
> waits of various kinds, slowing things down significantly.
>
> Then we thought of using separate tables, but joining them into a union_all 
> view (for ease of code maintenance), that way simulating a partitioned table. 
> Each separate table would still have one column filled with the same value, 
> dedicated to identifying which render process would work on it.
>
> CREATE TABLE X1 (
> Render_key number, <-- this column contains the same value for all rows in 
> the table, and the appropriate bill rendering process would supply this value 
> in queries to get only its own data from the union-all view.
> ...
> );
>
> CREATE TABLE X2(
> Render_key number,
> ...
> );
> Etc.
>
> Now we can INSERT+APPEND, but TRUNCATE doesn't work too well, as some 
> processes accessing the view get "object no longer exists" error due to other 
> processes truncating their tables that are part of the same view. (This may 
> be partially because oracle tries to use a very selective index that exists 
> on each of these tables first and then filter it by render_key identifier, 
> instead of reading the render_key column histograms first, realizing that 
> there is only one table where that column will have the right value and not 
> touching the other tables, as I hoped it would. But I may be wrong and it 
> would do that even without the index).
>
> I can't think of any other approaches that would get us what we want (speed + 
> ease of code maintenance) and am ready to recommend to the developers that 
> each process work in its own schema with its own tables (so code and tables 
> would be duplicated everywhere, making maintaining it a harder).
>
> Can anyone think of other architectural alternatives that could give us what 
> we want?
>
> Thanks in advance.
>
> --
> Max Pakhutkin
>
> --
> //www.freelists.org/webpage/oracle-l
>
>
--
//www.freelists.org/webpage/oracle-l


Other related posts: