Re: Full table scan -- uniform extent allocation - extent allocation map issue?

From: Vishnu Potukanuma <vishnupotukanuma@xxxxxxxxx>
To: jonathan@xxxxxxxxxxxxxxxxxx
Date: Wed, 27 Nov 2019 02:00:35 +0530

Hi Jonathan, thanks for your timely response.

I am just trying to see the entire picture... of what it takes to perform a
full table scan... every possibility and every path oracle can take..

My understanding:
Oracle knows that highwater mark information by reading the L3BMB or
segment header initially to get the low - high water mark, high - high
water mark information and also regarding extent map so that it knows which
extents to read from. Oracle is also aware that all blocks below the low -
high water mark - whether they are used or not or free, it should read (in
other words all formatted) and as the blocks between Low - HWM and high HWM
are either formatted or unformatted (may or may not contain data). so it
selectively reads only those formatted block ranges between low-HWM and
high-HWM.

So if A, B, C, D, E extents are allocated to the table E is the last extent
so E usually contains the high water mark, unless users manually allocated
extents.. So while performing a full table scan, oracle has to read all the
blocks from the table (formatted). If the extent A is stored at say near
31gb position and extent stored at 1gb position, the disk spindles whether
they are based out of RAID or single disks have to read the extend from A
position go to position B which is at 1gb position and vice versa depending
on the location..

My idea was as Oracle knows that E is the last extent which contains the
high water mark,
if it read A, B, C, D the order in which they are stored in a data file,
the disk spindle has to move only along one path over the surface of the
disk and finally read the last extent this goes same even RAID volumes. So
that most of the IO seek time can be minimised  I mean this is only for the
tables in the tablespaces with uniform extent allocation, as the extent
sizes grow with auto allocate, this issue i was talking about diminishes.
The only reason why I brought this up is, *although negligible as you
indicated in most cases,* this is rather an interesting behaviour to note,
but when the disks are busy a large random read (pread64 which reads 1mb
chunk in one IO operation) may affect other sessions as well.

Now cases when this issue can occur (in OLTP environments) i mean this can
grow over the years, not immediate.
1. Over time rebuild indexes on a tables, as it is mandatory in few cases
due to its dependency on ROWID.
2. Over time use shrinking operations on indexes or tables to free up
unused space or bloat which can occur.. For examples, tables holding orders
that are passing through, and we can't use a temporary table in this case
as restarting the database we will loose all orders, days when the orders
are high and days when it is low.
3. Users may rarely recreate tables to mitigate migrated rows.
4. archive old partitions and drop them (partitioning). etc

Same goes for OLAP:
1. Data is loaded initially into a table or segment, espeically if the data
is coming from different sources, here temporary segments also fail as they
are not retained post restart.
2. Here also old partitions archived to tape and dropped as in some
environments compliance requires us to have only 5 or 10 years worth of
data and anything older than that is archived to tape.
3. Movement of partitions between Tier-1 and Tier-2 storage as it gets old.

The idea about scanning the table in the reserve order makes sense but
there are lot of variables or cases or alternate path that can lead to
loads of possibilites. Regarding the CR processing part and reverse order:

Jonathan please correct me if am wrong here:

1. Reserve order is beneficial only in the cases when most of the activity
is concentrated in the last extents or latest extents, which can be inserts
updates or deletes, or application works with only latest data and the size
of the table is very large to mitigate the effort required for CR
processing. but this comes with two possibilities. when we are talking
about full table scans.
     a. use partitioning in this case, and modify the application to be
partition aware, or periodically archiving the data - which may also
require modifying the application.... and this ends up two possibilites.
              * mercy of the application team, whether they are willing to
make any changes to the code.
              * whether the application is third party in which case we
don't have any control, and they call it best practice.
     b. Size of the buffer cache and table - and this raises four
possibilities yet again. and can end in various paths....
              * if the size of the buffer cache is large, and the buffers
that are modified are hot enough not to get flushed from the disk, as a
full table scan can go with *db file scattered read (*depending on its size
and other variables like _small_table_threshold, mid_table_threshold,
rarely even beyond mid_table_threshold until VLOT > don't remember 4 or 5
times buffer cache). in this case two possibilities again
                        * *buffers replaced from the cache to accomodate
the buffers coming in doesn't belong to this table* and the buffer cache is
not that busy, if the buffer cache is busy this can add two more wait free
buffer waits or write complete waits, in both cases the either DBWn or both
DBWn and LGWR has to work again, and this is basically at the mercy of on
the size on the buffer cache and the number of cpus as the number working
sets are equal to some constant 2 (dont remember) * cpu_count, if the size
of the buffer cache is too large and the number of cpus is less the length
of the working sets are usually larger and if there are any dirty buffers
in the working set they are added to the dirty queue, and during this time
if the cached blocks from the current table in the last extents does not
get modified we are good, if they are modified, then CR processing
required, this again depends on how many times the buffer is modified as
the SCN is maintained at the block level, and it has to go through the undo
table entries to create a CR copy of the block and in order to make a CR
copy of the block this again requires a free buffer, as this can again add
two more waits free buffer waits or write complete waits.
                         * *buffers replaced from the cache belong to this
table*, in which case these blocks are flushed down to disk, and have to be
reread again and this again raises two more possibilities and these two
possibilities list like above two more possibilities.
                * the table size is large By default before a full table
scan (provided a session chooses* direct path read)*, oracle performs a
full object checkpoint which makes sure the on disk version of blocks are
consistent till that point and in OLTP environments, as the transactions
are usually of very short duration, the number of buffers with uncommitted
changes will be less at the precise point at the time of select (diring
object level checkpoint) will be less and as dbwr process uses db file
parallel write wait event which inturn makes use of async IO to flush the
contents, it is very aggressive in pushing the blocks or writing the blocks
down to the disk > 1k blocks in one IO operation. example attachment output
of strace.  and as most of the work is done before there is less chance for
CR image fabrication at a later point in time, and here there exists two
alternate paths of execution.
                          *    no further blocks written so no issues,
continue with the full table scan.
                           *   what if the dirty blocks with committed or
uncommited changes gets written to the disk, by dbwr due to either
checkpoint or buffer cache flush or space pressure in the buffer cache to
bring new blocks in? in this case the CR processing is mandatory.

            c.  if we have a hash partitioned table / index or a reverse
key index, the data blocks may be placed partition/block and with ASSM and
multiple sessions, data can be anywhere. and in these cases reverse order
scan may not yield any performance benefits as is the case with the general
scan.

now that i am writing this mail, this came to my mind...
what precisely happens when the full table scan starts and a table is large
and during this time due to space pressure, an extent gets allocated, and
possibility of a row migration migrating a row which is not read yet into
the new extent...

Thanks,
Vishnu

On Tue, Nov 26, 2019 at 10:38 PM Jonathan Lewis <jonathan@xxxxxxxxxxxxxxxxxx>
wrote:

No longer any need to send a full test case.  Your observation has been
known for a very long time and doesn't need to be demonstrated again.

First thought, of course, is that an application should not be dropping
and creating segments at all (let alone frequently). If this type of
processing appears to be needed then the requirement should be re-examined
with global temporary tables in mind.

Second, if one is dropping and creating segments frequently they should be
assigned to a tablespace that is (a) uniform extent sized at 1MB (or more
if the O/S allows for a larger read) and (b) not used by permanent
objects.   Feature (a) ensures that Oracle has the best chance of reading
the segment as efficiently as possible and once the tablescan is doing the
largest read the various components of the O/S will handle it's
over-optimistic to assume that changing the order in which extents are
accessed will make any significant difference in a multi-user system.  If
(some of) the segments you are dropping and recreating are guaranteed to be
very small and you really don't want to "waste" 1MB per segment then assign
small objects to a tablespace with a small uniform extent size.

A significant problem, though, with having a good idea is that it's very
difficult to think of all the ramifications and handle all the peripheral
cases.  For example:   "shrink space compact" deletes rows from the table
backwards from the last block of the last extent, inserting them forwards
from the first block of the first extent - should the shrink code be
modified to use the L1 bitmaps to decide where to delete from and where to
insert to ? What impact would that have on the potential for dropping
extents.

Consider an idea that sounds much more useful: the idea that you should be
able to do a tablescan "backwards" has been around for (literally) years  -
after all, the most recent, hence most interesting, data will be in the
last extent(s) of the table - it's a feature that hasn't been made
(publicly - but see
https://jonathanlewis.wordpress.com/2016/03/14/quiz-2/ ;) available
despite the fact that from time to time people complain about performance
problems because the end of a table has been changing a lot while their
tablescan has been reading the unchanging start of the table.  (And that's
a good reason why reading by extent id (in reverse order) would be nicer
than reading by L1 address.)

Regards
Jonathan Lewis

________________________________________
From: Vishnu Potukanuma <vishnupotukanuma@xxxxxxxxx>
Sent: 26 November 2019 16:30
To: Jonathan Lewis
Cc: Oracle L
Subject: Re: Full table scan -- uniform extent allocation - extent
allocation map issue?

Hi Jonathan,

The trace file indicate for autoallocate - yes.. i must have pasted the
old one when i was testing auto allocate (in auto allocate this issue
diminishes as the extent sizes become large)

Oracle allocates extents to a segment at any allocation in the data file
(uniform size) when the demand for space arises which is logical thing but
as oracle reads extents in the order they are allocated during the full
table scans rather than where they are stored, the IO becomes random rather
than sequential given time when extent sizes are low and the workload
mainly contains frequent segment allocation and deallocation and that if
other sessions doing index range scans saturate IOPS at the storage layer
especially HDDs as these are random in nature, reduces the overall
throughput of the full table scan...

I mean when the full table scans are slow, usually we end up looking at
various things, but could be a factor in very rare cases (OLTP) and could
be common in warehouses where segments are frequently dropped or created,
and this random extent allocation can grow right under our nose and we
never detect it. There are various other scenarios, edge cases i am looking
at, but i am really interested as to why a session blindly reads the
extents in order they are allocated rather than in the order they are
stored based on L1DBA.

Please give me sometime (have to get dinner before its too late), i will
send the complete test case with which this thing can be replicated.

Thanks,
Vishnu

On Tue, Nov 26, 2019 at 8:58 PM Jonathan Lewis <
jonathan@xxxxxxxxxxxxxxxxxx<mailto:jonathan@xxxxxxxxxxxxxxxxxx>> wrote:
insert into table10 select * From table1;
drop table1;

--- create a tablespace with 350mb in size and uniform extent allocation,

The extent information you've supplied shows us that you've created the
final table in a tablespace using autoallocate - the early extents are
64KB, the next 1MB, the last few you've shown are 8MB.

If you want to supply a test case that has a chance of reproducing then
you need to tell us WHAT size uniform size, preferably giving us a script
that allows us change a couple of names (tablespace / file) and create the
tablespace.

-- create 10 tables with 30 mb in size.

Needs one sample Create table script - did you create it with "initial
30M"  (obviously not); did you create it with "create as select ...",  did
you do anything odd with pctfree; how did you populate it.  Do you need the
tables to allocate 30MB, or should the highwater mark be pushed to 30MB as
well.

-- insert into table10 select * From table1;

That's probably going to work - because we'd probably assume you wanted
all 10 tables to look alike.

-- drop table1;

That should work, but do you have your recyclebin enabled - that may
affect where the next insert puts the data (especially if some doesn't
follow your instruction that the tablespace should be 350MB).

-- similarly perform the same randomly

If my results don't match your results, might that be because your results
are highlighting a flaw in your system, or because I didn't choose the same
random order as you did ?  Don't supply a "random" test if you want to
demonstrate a reproducible anomaly. Tell us the exact order you used to
insert and drop.

-- interestingly the fix is to read the extents one after the other based
on the L1 DBA rather than the order in which they are allocated to the
segment

Are you saying you think this is what Oracle does, or what you think
Oracle should do ?

Regards
Jonathan Lewis

________________________________________
From: oracle-l-bounce@xxxxxxxxxxxxx<mailto:oracle-l-bounce@xxxxxxxxxxxxx>
<oracle-l-bounce@xxxxxxxxxxxxx<mailto:oracle-l-bounce@xxxxxxxxxxxxx>> on
behalf of Vishnu Potukanuma <vishnupotukanuma@xxxxxxxxx<mailto:
vishnupotukanuma@xxxxxxxxx>>
Sent: 26 November 2019 14:12
To: Oracle L
Subject: Full table scan -- uniform extent allocation - extent allocation
map issue?

Hi,

I have observed an interesting behaviour today, while working on other
aspects and my hunch appears to be true ,,, I will start by saying that
there are too many variables and scenarios where things can be optimal but
i am talking about the worst case scenario where the segments are
frequently created and dropped in a tablespace.

This works by exploiting certain aspects of the database such as how
oracle maintains extent map in the segment header and how the extent
allocation is done or how the full table scan reads the segment. Inherently
in this case there is nothing we can do as there are too many variables.
The impact may be small in terms of the overall response times but this
typically reduces the lifespan of the disks as the disk spindles ends up
moving from higher to lower tracks and vice versa over and over again...
instead of just in a simple arc from higher tracks to lower tracks. in
worst case this could add delay when the segment is very large and if the
extent sizes are small (this probability of this occuring is very less but
there is still a chance).

when it comes to oracle while performing full table scans, the process
starts by reading the extent map and starts reading extents one by one
until the last extent or high water mark is reached... but what if the
extents allocated to the segment are stored in random areas just like
clustering_factor? interestingly the fix is to read the extents one after
the other based on the L1 DBA rather than the order in which they are
allocated to the segment, in most cases, oracle does a pretty good job
allocating extents close to each other to the segment, but things can get
messy.

this is the simplest way to explain and this goes gradually from here and
we can make however we like it.
create a tablespace with 350mb in size and uniform extent allocation,
create 10 tables with 30 mb in size.
insert into table10 select * From table1;
drop table1;
similarly perform the same randomly so that the final end result is only
one table. here we are just simulating a case where the subsequent extents
allocated to the segment are on different regions randomly in a datafile,
thus their locations are on the disk as well (which could be or could not
be based on several factors such as how fragmented the disks are etc, i
wont go in that space as it adds too much complexity).. here we perform a
full table scan, oracle reads the segment header to read the extent map and
reads the extents one by one...

Notice the block numbers in the traces.
WAIT #139711646963136: nam='db file sequential read' ela= 612 file#=5
block#=71810 blocks=1 obj#=75252 tim=248445129880
WAIT #139711646963136: nam='db file scattered read' ela= 763 file#=5
block#=71811 blocks=5 obj#=75252 tim=248445130757
WAIT #139711646963136: nam='db file scattered read' ela= 771 file#=5
block#=71816 blocks=8 obj#=75252 tim=248445131732
....
....
WAIT #139711646963136: nam='db file scattered read' ela= 1941 file#=5
block#=44736 blocks=64 obj#=75252 tim=248445270874
WAIT #139711646963136: nam='db file scattered read' ela= 1788 file#=5
block#=44802 blocks=62 obj#=75252 tim=248445273696
WAIT #139711646963136: nam='db file scattered read' ela= 1902 file#=5
block#=44864 blocks=64 obj#=75252 tim=248445276573
WAIT #139711646963136: nam='db file scattered read' ela= 1756 file#=5
block#=44930 blocks=62 obj#=75252 tim=248445279335
....

WAIT #139711646963136: nam='db file scattered read' ela= 3078 file#=5
block#=132 blocks=128 obj#=75252 tim=248445887737
WAIT #139711646963136: nam='db file scattered read' ela= 3045 file#=5
block#=260 blocks=124 obj#=75252 tim=248445893065
WAIT #139711646963136: nam='db file scattered read' ela= 3399 file#=5
block#=384 blocks=128 obj#=75252 tim=248445898641
....
...
WAIT #139711646963136: nam='db file scattered read' ela= 12269 file#=5
block#=97540 blocks=124 obj#=75252 tim=248446182677
WAIT #139711646963136: nam='db file scattered read' ela= 17440 file#=5
block#=97664 blocks=128 obj#=75252 tim=248446202405
WAIT #139711646963136: nam='db file scattered read' ela= 3282 file#=5
block#=97792 blocks=128 obj#=75252 tim=248446208027
...
...
WAIT #139711646963136: nam='db file scattered read' ela= 3279 file#=5
block#=61058 blocks=128 obj#=75252 tim=248446281696
WAIT #139711646963136: nam='db file scattered read' ela= 3115 file#=5
block#=61186 blocks=126 obj#=75252 tim=248446287141
WAIT #139711646963136: nam='db file scattered read' ela= 2990 file#=5
block#=61312 blocks=128 obj#=75252 tim=248446292359
WAIT #139711646963136: nam='db file scattered read' ela= 3109 file#=5
block#=61440 blocks=128 obj#=75252 tim=248446297711
....
...

WAIT #139711646963136: nam='db file scattered read' ela= 3125 file#=5
block#=7812 blocks=128 obj#=75252 tim=248446303109
WAIT #139711646963136: nam='db file scattered read' ela= 3246 file#=5
block#=7940 blocks=124 obj#=75252 tim=248446309728
WAIT #139711646963136: nam='db file scattered read' ela= 3180 file#=5
block#=8064 blocks=128 obj#=75252 tim=248446315169

the L1 DBA extent map:
  Auxillary Map
  --------------------------------------------------------
   Extent 0     :  L1 dba:  0x01411880 Data dba:  0x01411883
   Extent 1     :  L1 dba:  0x01411880 Data dba:  0x01411888
   ....
   Extent 35    :  L1 dba:  0x0140ae80 Data dba:  0x0140ae82
   Extent 36    :  L1 dba:  0x0140af00 Data dba:  0x0140af02
   Extent 37    :  L1 dba:  0x0140af80 Data dba:  0x0140af82
   ...
   Extent 38    :  L1 dba:  0x0140b000 Data dba:  0x0140b002
   Extent 39    :  L1 dba:  0x0140b080 Data dba:  0x0140b082
   Extent 40    :  L1 dba:  0x0140b100 Data dba:  0x0140b102
   Extent 41    :  L1 dba:  0x0140b180 Data dba:  0x0140b182
   Extent 42    :  L1 dba:  0x0140b200 Data dba:  0x0140b202
   Extent 43    :  L1 dba:  0x0140b280 Data dba:  0x0140b282
   ...
   ...
   Extent 92    :  L1 dba:  0x01417c80 Data dba:  0x01417c84
   Extent 93    :  L1 dba:  0x01418080 Data dba:  0x01418084
   ...
   Extent 94    :  L1 dba:  0x0140ee80 Data dba:  0x0140ee82
   Extent 95    :  L1 dba:  0x01401e80 Data dba:  0x01401e84
   Extent 96    :  L1 dba:  0x01402280 Data dba:  0x01402284
   Extent 97    :  L1 dba:  0x01402680 Data dba:  0x01402684
   Extent 98    :  L1 dba:  0x01402a80 Data dba:  0x01402a84

When it comes to index range scan things are as expected, as the rowids
are sorted it reads all the blocks in one arc (i am not talking about
prefetching, or batched (db file parallel read which exploits IOPS at a
storage layer but a very simplistic case with all prefetching etc turned
off) but for full table scans things appear to be different as it simply
following the extent map and not in an order of their L1 DBA,

maybe this can be explained by something, but I am still not be able to
figure out any other valid reason as to why we should not perform a full
table scan based on the storage order extent map with the exception of last
extent. Please let me know if I have missed anything?

Thanks,
Vishnu
--
//www.freelists.org/webpage/oracle-l

--
//www.freelists.org/webpage/oracle-l

Attachment: strace
Description: Binary data

Follow-Ups:
- Re: Full table scan -- uniform extent allocation - extent allocation map issue?
  - From: Jonathan Lewis

References:
- Full table scan -- uniform extent allocation - extent allocation map issue?
  - From: Vishnu Potukanuma
- Re: Full table scan -- uniform extent allocation - extent allocation map issue?
  - From: Jonathan Lewis
- Re: Full table scan -- uniform extent allocation - extent allocation map issue?
  - From: Vishnu Potukanuma
- Re: Full table scan -- uniform extent allocation - extent allocation map issue?
  - From: Jonathan Lewis

Re: Full table scan -- uniform extent allocation - extent allocation map issue?

Other related posts: