Re: What is the purpose of segment level checkpoint before DROP/TRUNCATE of a table?

From: Saibabu Devabhaktuni <saibabu_d@xxxxxxxxx>
To: Jonathan Lewis <jonathan@xxxxxxxxxxxxxxxxxx>, free <oracle-l@xxxxxxxxxxxxx>
Date: Fri, 8 Jul 2011 00:45:10 -0700 (PDT)
Inline responses (lines starting with "=>").

Thanks,
Sai
http://sai-oracle.blogspot.com

Notes in line Regards Jonathan Lewis http://jonathanlewis.wordpress.com ----- 
Original Message ----- From: "Saibabu Devabhaktuni" <saibabu_d@xxxxxxxxx> To: 
"Jonathan Lewis" <jonathan@xxxxxxxxxxxxxxxxxx>; "free"<oracle-l@xxxxxxxxxxxxx>
Sent: Thursday, July 07, 2011 5:07 PM Subject: Re: What is the purpose of 
segment level checkpoint before DROP/TRUNCATE of a table?
Hi Jonathan, If we have a table with 100,000 dirty buffers on primary database. 
1) Let's say 50,000 dirty buffers were already written to disk by DBWR on
primary.
2) At time T1, truncate table command issued on primary. 3) If Oracle had this 
feature to not write buffers as part of truncate, then at
>time T2 Oracle finished marking buffers not to write.
4) At time T3, truncate operation is fully completed on primary.
5) On the standby, configure very small buffer cache size (i.e. can only fit
10,000 buffers).
6) As the redo up to time T1 applied on the standby, only 10,000 dirty buffers 
can stay in the cache and rest of the blocks will be written to disk as soon as
>redo is applied. 7) Redo as of time T2, will not really mark buffers as not to 
>write, as most of
>them are already on disk. This is when blocks on the standby is going to be
different from the primary as of same checkpoint time. So what. I've already 
pointed out that the standby physical files are always likely to be different 
from the primary files at any point in time. Clearly Oracle has to be able to 
deal with that problem because we can always point out that Oracle is supposed 
survive a global powercut because of the basic redo and recovery mechanism.

=> I guess I'm not stating it properly. If you do normal shutdown of both 
primary and the physical standby as of time T3, and then open both of them in 
read only mode, checkpoint time of all datafiles on both standby and primary 
should be same, but some of the blocks belong to the table truncated can be 
different at binary level across standby and primary, but not logically. This 
defies the whole premise of primary and physical standby being one and the same 
at binary level when both of them are brought to the same checkpoint. In other 
words if I do "checksum" on datafiles at OS level, that should be same on 
primary and standby when both of them are at same checkpoint level and no 
further changes are happening.


>
>Data loss scenario:
1) If there was a system or datafile level checkpoint finished on primary
between time T2 and T3.
2) If primary instance crashes between time T2 and T3, but after the above
checkpoint was completed.
3) Above checkpoint would have skipped writing buffers marked as not to write
and hence on disk image is not current. 4) When instance is starting up, crash 
recovery starts as of redo from the most
>recent checkpoint.
5) After the completion of crash recovery, truncate never really finished, but
the data in the dirty blocks as of time T1 is missing. 
Points to worry about - what does it mean to say: truncate operation is fully 
completed - what are the events, and in what sequence why do you assume that 
you "mark the buffers as free" (t2) before you complete the truncate (t3)
if you have a checkpoint that finishes between t2 and t3 - what does it mean to 
say that the checkpoint finishes in this context ? when did the checkpoint 
start, and does that matter ? (before t1, between t1 and t2, between t2 and t3)
where do the local writes come into it in your scenario where do the updates to 
the data dictionary come in your scenario How about this for a truncate 
sequence. a) Grab an exclusive pin on the table definition - this stops any 
more SQL referencing the table from being compiled
(may have to wait) b) Grab an exclusive pin on every cursor referencing the 
table definition - this stops any SQL referencing the table from being executed
(may have to wait)
c)    Invalidate table and index entries in dictionary cache
d)    Invalidate all cursors referencing the table
e)    Generate redo for correcting segment header blocks etc.
f)    Apply, commit, and generate local writes        *** g) Generate redo for 
the data dictionary to show table (and indexes) have been truncated, 
data_object_id changed.
h)    apply and commit
i)    mark as free all buffers for table and indexes and move to REPL-AUX have 
to ignore blocks on write queue that are already pinned by dbwr
j)    release cursor pins
k)    release object pin *** Point (f) needs further thought - Oracle must have 
a mechanism for avoiding a race condition for blocks which are subject to local 
writes when a checkpoint is running or you get the option for something similar 
to your inconsistency description: dirty block is not written to file during 
checkpoint because it's supposed to be a local write
local write doesn't take place (for some reason, e.g. session crashes) 
checkpoint completes if a recovery is required very soon afterwards (and the 
local write still hasn't happened) then
the block on disc is wrong the recovery process is going to start from the next 
redo log, and therefore not see the redo that should bring the block up to date.



=> Assuming "local writes" from step f above meaning "direct path writes". Why 
can't they be left to be written by DBWR instead. Also, what happens if user 
hit CTRL-C between step f and step h.

=> Any select queries in flight will continue  to execute and fetch the records 
for quite a bit of time, before getting ORA-8103 error, after truncate command 
is completed. These select queries may see stale data if dirty buffers are 
flushed out without writing to the disk.
Follow-Ups:
- RE: What is the purpose of segment level checkpoint before DROP/TRUNCATE of a table?
  - From: Mark W. Farnham
Re: What is the purpose of segment level checkpoint before DROP/TRUNCATE of a table?

Other related posts: