RE: What is the purpose of segment level checkpoint before DROP/TRUNCATE of a table?

  • From: "Mark W. Farnham" <mwf@xxxxxxxx>
  • To: <saibabu_d@xxxxxxxxx>, "'Jonathan Lewis'" <jonathan@xxxxxxxxxxxxxxxxxx>, "'free'" <oracle-l@xxxxxxxxxxxxx>
  • Date: Fri, 8 Jul 2011 10:07:51 -0400

I now see your concern about the ability to checksum. But that is not
guaranteed (or needed by the recovery model) and since those blocks are no
longer assigned to a segment any contents are immaterial. If someone is
doing checksums or binary diffs between datafiles through the same point in
recovery, it might break their sanity check, but it does not compromise the
integrity of the standby in any way I can grok.


I think Jonathan raises an interesting question that there is a potential
bug and challenging race condition in direct writes, but how that relates to
flushing dirty buffers through DBWR escapes me; wouldn't the same proposed
route of marking the buffers invalid/free without DBWR writing them be
effective there as well? I think the question is whether a direct write
might write somewhere it no longer has a right to write, but perhaps I'm not
entirely clear on the issue and I look forward to Jonathan explaining his
concern more fully.


The query in flight question is interesting, but if the buffer is marked
free the query in flight needing a read consistent version of the block
should attempt to find the block and roll it back, just as with any other
block that is no longer in cache. This should cancel the query with an
error. That a truncate command might cancel queries in flight involving the
truncated table was accepted from the first thought of the idea of truncate
versus deletes. Once the truncate has been issued (whether complete or not),
if a block clone you need is still valid when you call for it you can
complete, but if you have to get a block from disk and undo it your query is
toast because it will fail attempting to get the block. That is no worse
than a snapshot too old, and someone with authority to truncate the table
did truncate the table (which I hold is the same standard as someone with
authority to have given you more undo space to avoid the snapshot too old
did not do that). If since the original inception of the truncate command
someone has adopted the stance that it should not potentially cancel queries
in flight, that would be both (IMHO) a silly standard to try to achieve and,
I think, impractical if the purpose of the truncate command is to be
honored. If I recall correctly (and it was 20 years ago) this was referred
to as allowing breakable parse locks.


Perhaps some lurker from Oracle can tune up my memory of the history this
request and its implementation. At this moment I cannot remember whether it
was retrofit to V6 or not. It might have only appeared in V7.1 or later.
I'll see if I can find my Oracle VLDB notes.


And I continue to be unclear as to WHY Oracle needs to flush dirty buffers
from truncated or dropped segments through DBWR other than it possibly being
easier to code up. I look forward to being enlightened!







From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx]
On Behalf Of Saibabu Devabhaktuni
Sent: Friday, July 08, 2011 3:45 AM
To: Jonathan Lewis; free
Subject: Re: What is the purpose of segment level checkpoint before
DROP/TRUNCATE of a table?


Inline responses (lines starting with "=>").
Notes in line
Jonathan Lewis

----- Original Message ----- From: "Saibabu Devabhaktuni"
<saibabu_d@xxxxxxxxx> To: "Jonathan Lewis" <jonathan@xxxxxxxxxxxxxxxxxx>;

Sent: Thursday, July 07, 2011 5:07 PM

Subject: Re: What is the purpose of segment level checkpoint before
DROP/TRUNCATE of a table?


Hi Jonathan,
If we have a table with 100,000 dirty buffers on primary database.
1) Let's say 50,000 dirty buffers were already written to disk by DBWR on
2) At time T1, truncate table command issued on primary.

3) If Oracle had this feature to not write buffers as part of truncate, then

time T2 Oracle finished marking buffers not to write.
4) At time T3, truncate operation is fully completed on primary.
5) On the standby, configure very small buffer cache size (i.e. can only fit
10,000 buffers).
6) As the redo up to time T1 applied on the standby, only 10,000 dirty

can stay in the cache and rest of the blocks will be written to disk as soon

redo is applied.

7) Redo as of time T2, will not really mark buffers as not to write, as most

them are already on disk. This is when blocks on the standby is going to be
different from the primary as of same checkpoint time.

So what. I've already pointed out that the standby physical files are always
likely to be different from the primary files at any point in time. Clearly
Oracle has to be able to deal with that problem because we can always point
out that Oracle is supposed survive a global powercut because of the basic
redo and recovery mechanism.


=> I guess I'm not stating it properly. If you do normal shutdown of both
primary and the physical standby as of time T3, and then open both of them
in read only mode, checkpoint time of all datafiles on both standby and
primary should be same, but some of the blocks belong to the table truncated
can be different at binary level across standby and primary, but not
logically. This defies the whole premise of primary and physical standby
being one and the same at binary level when both of them are brought to the
same checkpoint. In other words if I do "checksum" on datafiles at OS level,
that should be same on primary and standby when both of them are at same
checkpoint level and no further changes are happening.


Data loss scenario:
1) If there was a system or datafile level checkpoint finished on primary
between time T2 and T3.
2) If primary instance crashes between time T2 and T3, but after the above
checkpoint was completed.
3) Above checkpoint would have skipped writing buffers marked as not to
and hence on disk image is not current.

4) When instance is starting up, crash recovery starts as of redo from the

recent checkpoint.
5) After the completion of crash recovery, truncate never really finished,
the data in the dirty blocks as of time T1 is missing.

Points to worry about - what does it mean to say:

truncate operation is fully completed - what are the events, and in what
sequence why do you assume that you "mark the buffers as free" (t2) before
you complete the truncate (t3)

 if you have a checkpoint that finishes between t2 and t3 -
       what does it mean to say that the checkpoint finishes in this context

when did the checkpoint start, and does that matter ? (before t1, between t1
and t2, between t2 and t3)

       where do the local writes come into it in your scenario
       where do the updates to the data dictionary come in your scenario
How about this for a truncate sequence.

a) Grab an exclusive pin on the table definition - this stops any more SQL
referencing the table from being compiled

       (may have to wait)

b) Grab an exclusive pin on every cursor referencing the table definition -
this stops any SQL referencing the table from being executed

       (may have to wait)
c)    Invalidate table and index entries in dictionary cache
d)    Invalidate all cursors referencing the table
e)    Generate redo for correcting segment header blocks etc.
f)    Apply, commit, and generate local writes        ***

g) Generate redo for the data dictionary to show table (and indexes) have
been truncated, data_object_id changed.

h)    apply and commit
i)    mark as free all buffers for table and indexes and move to REPL-AUX
       have to ignore blocks on write queue that are already pinned by dbwr
j)    release cursor pins
k)    release object pin

*** Point (f) needs further thought - Oracle must have a mechanism for
avoiding a race condition for blocks which are subject to local writes when
a checkpoint is running or you get the option for something similar to your
inconsistency description: dirty block is not written to file during
checkpoint because it's supposed to be a local write

   local write doesn't take place (for some reason, e.g. session crashes)
   checkpoint completes

if a recovery is required very soon afterwards (and the local write still
hasn't happened) then

       the block on disc is wrong

the recovery process is going to start from the next redo log, and therefore
not see the redo that should bring the block up to date.


=> Assuming "local writes" from step f above meaning "direct path writes".
Why can't they be left to be written by DBWR instead. Also, what happens if
user hit CTRL-C between step f and step h.


=> Any select queries in flight will continue  to execute and fetch the
records for quite a bit of time, before getting ORA-8103 error, after
truncate command is completed. These select queries may see stale data if
dirty buffers are flushed out without writing to the disk.


Other related posts: