I can confirm that any DDL on any of the tables involved, regardless of
partition, triggers a hard invalidation of the cursor (v$sql.object_status
becomes ‘INVALID_UNAUTH') and causes it to roll back and restart
(sql_exec_start and sql_exec_id are reset). Each rollback can easily take an
hour, depending on how long it ran before it was invalidated, as it inserts
tens of millions of rows. As it happens there is an exercise at the moment to
reload some data for previous dates, and this INSERT doesn’t have a chance.
The 12.2 "deferred invalidations” clause and the corresponding
cursor_invalidation parameter made no difference.
Possibly we can add some mutex rules to our batch job config, but working out
exactly which rules to add could be a challenge and I suspect there are too
many suspects to be practical.
Another approach might be to lock all 10 tables in row share mode at the start
of the procedure, and have our partition maintenance procedures take out an
exclusive lock on the table before making any change, which might give us a
mechanism for making partition operations wait for critical DML.
I’m wondering why I’ve never come across this before, apart from an issue
recently with a stats job passing no_invalidate = false. I suppose it must
happen all the time, but it takes a long-running, high-volume insert and an
unusually busy batch system before you really notice it.
On 5 Aug 2020, at 10:19, William Robertson <william@xxxxxxxxxxxxxxxxxxxx> wrote:
Well, that’s the kind of smoking gun I am looking for. The statement involved
half a dozen tables, mostly partitioned, some subpartitioned, so I was hoping
for a quick way to track it down, but I suppose the timestamps in dba_objects
and dba_tab_stats_history might be a start.
No automatic/interval partitioning, no triggers, I’d be surprised if there were
force-invalidate stats operations (six of them) but you never know, and this
insert was serial and conventional.
Thanks for the links - I’d come across the one from Randolf Geist one before,
and it describes the situation exactly. I see he wasn’t sure at the time of
writing whether fine-grained cursor invalidation might prevent it in 12.2. It
doesn’t. I’ll have to read the Kun Sun post a couple more times to get my head
around the scenario with undo. Does it seem likely in an insert?
We also get regular ‘Object no longer exists’ errors that are similarly hard to
track down. I wonder if that is a related issue.
On 5 Aug 2020, at 00:42, Sayan Malakshinov <xt.and.r@xxxxxxxxx
Were there any DDL operations on that table? Drop/add? Automatic interval
partitioning? Row-level 'before insert' triggers? Was it parallel insert? Undo
On Wed, Aug 5, 2020 at 1:29 AM <dimensional.dba@xxxxxxxxxxx
While it is running, what is it waiting on?
From: oracle-l-bounce@xxxxxxxxxxxxx <mailto:oracle-l-bounce@xxxxxxxxxxxxx>
<oracle-l-bounce@xxxxxxxxxxxxx <mailto:oracle-l-bounce@xxxxxxxxxxxxx>> On
Behalf Of William Robertson
Sent: Tuesday, August 4, 2020 3:19 PM
To: oracle-l <oracle-l@xxxxxxxxxxxxx <mailto:oracle-l@xxxxxxxxxxxxx>>
Subject: INSERT statement restarts internally
I’m trying to get to the bottom of an issue where a SQL statement (in this case
an insert) that normally runs in 15 minutes suddenly takes two hours. Checking
the session, it’s not blocked, the plan hasn’t changed and seems fine, all row
estimates are good, there is no more data than usual, backups aren’t running
(though I can’t rule out other background resource hogs, as I have somewhat
limited access). Then I notice that the start time shown in SQL Monitor has
changed, and going back to the session I see there have been six sql_exec_ids
and sql_exec_starts for the sid/serial#/sql_id, even though the procedure only
calls it once. Some of them are a couple of minutes apart, some are after 40
minutes. (With hindsight I should have looked at v$sql.object_status and also
checked whether the session was rolling back, which I’m guessing it was.)
The last time we saw this, it coincided with a job that gathered stats on all
partitions marked as stale, passing no_invalidate = false. We fixed that by
changing it to true and the issue seemed to go away. This time though, this
stats job wasn’t running and neither was anything else I could see that would
have affected the same tables and partitions - all our large tables are
list-partitioned by business date and many are subpartitioned by business line
etc, partly to isolate batch processes. Looking at dba_active_sess_history I
can see it’s happened before but sporadically.
I’m a bit stumped about what to do about it. Would this be logged anywhere
(similar to a deadlock report)? I couldn’t see anything likely in
v$diag_trace_file_contents but maybe I’m looking for the wrong thing. Is it the
case that another session must have hard-invalidated the cursor, and I just
have to find the smoking gun, or is there some other scenario like an internal
failure? This is 184.108.40.206.