Re: How to design data movement

From: yudhi s <learnerdatabase99@xxxxxxxxx>
To: "Mark W. Farnham" <mwf@xxxxxxxx>
Date: Sat, 10 Jun 2023 14:25:13 +0530

Thank You mark.

If I get you correct , in such a situation the approach should be to find
out the max(Txn_ids) from all the tables , so in the above scenario it will
be '4' for table1, '2' for table2 and 3 for table3. And then take the Min
of all those txn_ids which will be txn_id '2' here. and only load the data
till the txn_ids<=2 with equi join across all the tables. And rest of
the txn_ids data should get picked up in subsequent hourly runs. Hope my
understanding is correct here.

But also even with this approach , like table2 and table3 having multiple
rows with same txn_ids , in those cases if only one row came in the first
batch but other rows for same txn_ids came in the subsequent batch, the
will be chances of missing some of the rows for some txn_ids. Is there any
way to handle such a scenario?

Regarding below point, do you mean to say , we will pick all the txn_ids
data using outer join, however in place of null we will manually put some
default values so that the length of the row will not increase when we
update the column with the exact value afterwards for catching up missing
txn_ids data?

*"IF you insist on UPDATE, find yourself an acceptable stand-in for NULL
for as many of the columns as you can that is either the maximum width of
the column or something like 80 percent of the maximum column width.
UPDATEs are still expensive, but they are less expensive if the rows don’t
grow on insert."*

Or should we wait and only run the batch in this downstream by ~1hr delay ,
means if upstream batch will push the data in stage for 9hrs , the the
downstream batch will be triggered for the 8hrs batch(i.e. giving a
buffer of 1hrs) to minimize the missing txn_ids data in any table. Will
this be the correct approach?

On Fri, Jun 9, 2023 at 4:57 PM Mark W. Farnham <mwf@xxxxxxxx> wrote:

My experience is that the folks USING the data find difficulties using
missing bits. Only rarely is it useful to include partial data. Air traffic
control  and missile defense come to mind.

IF your analysts agree with my analysis, THEN the best way I know of is to
find the maximum transaction id for which you have data in all the sources
and use that as a maximum filter on each.

This relies on no “Swiss Cheese” batches where some transaction ids are
skipped in some sources and are filled in later, in which case you can also
record the maximum transaction already processed (zero the first time
unless you have negative transaction ids), which also presumes that your
transaction ids are monotonically increasing (like time in most of our
universe.)

Or you can simply use inner joins, or build just an inner join spine
intersecting the tables pairwise until you have a spine that will work when
you pull all the non-intersection columns after building the spine.

Your proposal to build with NULLs for missing values is fraught with
problems. UPDATES are expensive, and filling in NULLs expands each row
dynamically. (IF your analysts actually prefer to include rows with
temporarily missing data, read up on Tim Gorman’s explanations of why and
how an INSERT is the fastest update if your data is even mildly batchy,
which probably fits your data arrival texture).

IF you insist on UPDATE, find yourself an acceptable stand-in for NULL for
as many of the columns as you can that is either the maximum width of the
column or something like 80 percent of the maximum column width. UPDATEs
are still expensive, but they are less expensive if the rows don’t grow on
insert.

Good luck.

People are really making decisions hourly? That seems like a madhouse to
me for MOST real world usefulness. Even when trends are important on a
short trigger, using the math for trends based on new arrivals against the
existing averages for some weighted time period is probably better than
generating a full new data set.

mwf

*From:* oracle-l-bounce@xxxxxxxxxxxxx [mailto:
oracle-l-bounce@xxxxxxxxxxxxx] *On Behalf Of *yudhi s
*Sent:* Friday, June 09, 2023 1:31 AM
*To:* Oracle L
*Subject:* How to design data movement

Hello Listers,

We have multiple transaction tables and data on those are streamed from
upstream systems using tools like kafka, glue etc. And we have an
analytical database at the destination which is consuming the data and
going to publish reports on those data. In the analytical database , the
data gets dumped in a stage schema as is and then procedures written to
pull the stage tables data transform the data and populate the main schema
tables(which are having different structures) and on them
reporting/analytical queries are getting executed. Some structures in the
main schema are one to one as its on stage but others are
joining/aggregating multiple "stage tables" data into one table in "main
schema".

Now the issue is , we are getting the batches from the source loaded to
the stage schema of the analytical database hourly once. And we are seeing,
even the tables are holding the related data(i.e. the transaction ID
column) but they are coming with different lags.Say for e.g. at any point
in time when we are trying to pull data from stage and populate main
schema(which is also going to be hourly once frequency), the stage tables
looks something as below.

Table1 has all the data/maximum number of Transaction ids but table2 and
table3 is lagging behind. Table2 is yet to be populated with the
transaction ids- 3,4 data and also Table3 is yet to be populated with
transaction id-4 data and that will mostly be populated in the stage area
on the analytical database in the next hourly batch by the upstream job.

So how should we handle this scenario?

So in the above scenario we are thinking of
1)Pick up all the required data for a specific batch, from stage, using
"outer join" in TXN_ID column from related tables first and populate main
schema tables. And this will populate all the TXN_ID's in the main schema
but with null values for attributes of table TABLE2 and TABLE3 i.e. col3,
col4, col5, col6 will be populated as NULL for the missing Txn_ids.

2)In the next hourly run of the procedures, we will check , if any of the
values of the column(col3, col4, col5,col6) showing NULL then we will
execute UPDATE queries to populate those columns of the main schema table ,
for those specific transaction IDS from Table2 and Table3.

I Wanted to understand from experts if there exists any better way to
handle such data streaming challenges?

TABLE1
TXN_ID  COL1 COL2  Batch_time
1        A     B    9hrs
2        C     D    9hrs
3        E     F    9hrs
4        G     H    9hrs

TABLE2
TXN_ID COL3 COL4   Batch_time
1       I     J    9hrs
1       K     L    9Hrs
2       M     N    9hrs

TABLE3
TXN_ID COL5 COL6  Batch_time
1       O     P   9hrs
2       Q     R   9hrs
3       S     T   9hrs
3       U     V   9hrs

Regards

Yudhi

Follow-Ups:
- Re: How to design data movement
  - From: Pap
- RE: How to design data movement
  - From: Mark W. Farnham

References:
- How to design data movement
  - From: yudhi s
- RE: How to design data movement
  - From: Mark W. Farnham

Re: How to design data movement

Other related posts: