Re: DB Instance failure Issue

From: Lok P <loknath.73@xxxxxxxxx>
To: Pap <oracle.developer35@xxxxxxxxx>
Date: Thu, 15 Sep 2022 02:08:55 +0530

Thank You so much.
This doc is saying the database hangs with 'log file parallel write', but
in our case we were not seeing much of 'log file parallel write' wait
events when the crash happened. And also it all just happened suddenly, we
didn't experience hung kind of scenario.
On the other hand, we see the Avg 'log file parallel write' response time
becomes better post 19c migration (Avg ~10ms vs <2ms now during peak load),
it may be that because they must be served from flash cache somehow as
opposed to hard disk in 11.2 version.

We got some pointer towards Doc ID- 2694588.1 from support, which say large
IO or multiple paralle slaves can cause the crash. But we are still
struggling to connect the dots i.e. to understand what exact query were
causing the large IO or paralle slaves and thus caused this issue. And what
exactly is MR Key?

On Wed, Sep 14, 2022 at 11:56 AM Pap <oracle.developer35@xxxxxxxxx> wrote:

See this one..
Exadata: Database hang with high log file parallel write wait and the
cells report ORA-00600: [Scheduler::One_Or_More_Cellsrv_Threads_Stuck]
after upgrading to 19.2.x (Doc ID 2604805.1)

On Wed, 14 Sep 2022, 10:24 am Pap, <oracle.developer35@xxxxxxxxx> wrote:

Your log says slow log writer and a disk offline issue. We have also
encountered similar issues wrt log writer responses and it may be a bug.
You may try to search the MOS.

On Wed, Sep 14, 2022 at 1:07 AM Lok P <loknath.73@xxxxxxxxx> wrote:

Hello Listers,

Its Oracle version- 19.15.0.0.0. Its 4 node RAC. Its Exadata X8M, image
version 19.3.7. It was upgraded from 11.2 a few days back. All the Nodes
crashed suddenly. And alert logs pointing to mainly Node-2 and a specific
cellserver. Below are the snippets of the alert logs from DB and the
cellserver.

We are suspecting any bugs which we have hit. We are checking with
support. But want to understand from experts here, if anything similar
anybody has encountered?

*Database log:- *

2022-09-12T15:01:53.278393-04:00

requested by (instance=2, osid=97030 (DIA0)), summary=[process state
dump request (kjznHA)]

requested by (instance=2, osid=97030 (DIA0)), summary=[process state
dump request (kjznHA)]

2022-09-12T15:02:54.901871-04:00

NOTE: process _lgwr_DB12 (97118) initiating optional offline of disk
22.4042277007 (RECO_XGMB_CD_06_XXGMCL13) with mask 0x7e in group 2
(RECO_XGMB) with client assisting

2022-09-12T15:02:54.946461-04:00

NOTE: updating disk modes to 0x5 from 0x7 for disk 12
(RECO_XGMB_CD_05_XXGMCL13) in group 2 (RECO_XGMB): lflags 0x0

NOTE: updating disk modes to 0x5 from 0x7 for disk 13
(RECO_XGMB_CD_02_XXGMCL13) in group 2 (RECO_XGMB): lflags 0x0

---------------

2022-09-12T15:02:54.961121-04:00

Errors in file
/u01/app/oracle/diag/rdbms/DB1_xXGMb/DB12/trace/DB12_lgwr_97118.trc:

ORA-15080: synchronous I/O operation failed to write block 32665540 of
disk 22 in disk group RECO_XGMB

2022-09-12T15:02:54.961169-04:00

WARNING: failed to write mirror side 2 of virtual extent 3987 logical
extent 1 of file 625 in group 2 on disk 22 allocation unit 1294

NOTE: process _lgwr_DB12 (97118) initiating offline of disk
22.4042277007 (RECO_XGMB_CD_06_XGMCL13) with mask 0x7e in group 2
(RECO_XGMB) with client assisting

---------------

2022-09-12T15:03:09.809095-04:00

NOTE: ASMB process initiating disk discovery for grp 2
(reqid:5125151858335298229)

2022-09-12T15:03:11.517788-04:00

NOTE: Found o/XXXXXXXX/RECO_XGMB_CD_05_XGMcl13 for disk
RECO_XGMB_CD_05_XGMCL13

SUCCESS: disk RECO_XGMB_CD_05_XGMCL13 (12.4042276997) replaced in
diskgroup RECO_XGMB path: o/XXXXXXXX/RECO_XGMB_CD_05_XGMcl13

SUCCESS: completed reconfiguration of group 2 (RECO_XGMB)

-------------

2022-09-12T15:05:13.914953-04:00

LGWR (ospid: 97118) waits for event 'log file parallel write' for 138
secs.

2022-09-12T15:05:24.010945-04:00

LGWR (ospid: 97118) waits for event 'log file parallel write' for 148
secs.

Errors in file
/u01/app/oracle/diag/rdbms/DB1_XGMb/DB12/trace/DB12_lmhb_97063.trc
(incident=319107):

ORA-29770: global enqueue process LGWR (OSID 97118) is hung for more
than 140 seconds

Incident details in:
/u01/app/oracle/diag/rdbms/DB1_XGMb/DB12/incident/incdir_319107/DB12_lmhb_97063_i319107.trc

2022-09-12T15:05:24.744688-04:00

NOTE: ASMB process initiating disk discovery for grp 2
(reqid:5125151858335314113)

2022-09-12T15:05:26.701077-04:00

LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_DB1_XGMB by memno
1

2022-09-12T15:05:27.038295-04:00

Process termination requested for pid 97118 [source = rdbms], [info = 0]
[request issued by pid: 97063, uid: 260]

2022-09-12T15:05:27.038338-04:00

Process termination requested for pid 97041_97049 [source = rdbms],
[info = 0] [request issued by pid: 97063, uid: 260]

2022-09-12T15:05:27.038441-04:00

Process termination requested for pid 97039_97050 [source = rdbms],
[info = 0] [request issued by pid: 97063, uid: 260]

2022-09-12T15:05:27.038471-04:00

*****************

2022-09-12T15:05:27.054641-04:00

Dumping diagnostic data in directory=[cdmp_20220912150527], requested by
(instance=2, osid=97063 (LMHB)), summary=[abnormal instance termination].

LMHB (ospid: 97063): terminating the instance due to ORA error 29770

2022-09-12T15:05:27.064406-04:00

Cause - 'ERROR: Some process(s) is not making progress.

LMHB (ospid: 97063) is terminating the instance.

Please check LMHB trace file for more details.

Please also check the CPU load, I/O load and other system properties for
anomalous behavior

ERROR: Some process(s'

2022-09-12T15:05:29.118991-04:00

*Cell server log :- *

sendMessage completed with error 2, clientHost:
XGMdb06.XX.XXXXX.com[pid:148143]
destinationID: 1243439106 openCnt 1 requestCnt 4

Implicit fencing of this port will be performed

slos passed to ossmisc_dump_slos_alert routine is not in failure status

2022-09-11T22:10:01.317939-04:00

I/O Resource Manager: Setting intra-database plan for database DB1_XGMB:
DEFAULT_MAINTENANCE_PLAN

2022-09-12T15:03:16.978494-04:00

OSS_IOCTL_GRIDDISK_FENCE for griddisk: RECO_XGMB_CD_06_XGMcl13 giuid:
673900a4-8854-4ad8-8a5b-2015942d1bbc failed with error 388 (Fencing IOCTL
timed out due to pending I/O's)

2022-09-12T15:04:21.539419-04:00

OSS_IOCTL_GRIDDISK_FENCE for griddisk: RECO_XGMB_CD_06_XGMcl13 giuid:
673900a4-8854-4ad8-8a5b-2015942d1bbc failed with error 388 (Fencing IOCTL
timed out due to

-------------

WARNING: Cellsrv UserThread: LWPID: 59568 userId: 108 kernelId: 72
pthreadID: 13766447306496 is stuck for more than 10025222 micro sec.

Please take a look at svtrc_57983_108.trc and System State dump for more
info

Errors in file
/opt/oracle/cell/log/diag/asm/cell/x073cl13/trace/svtrc_57983_16.trc
(incident=17):

ORA-00600: internal error code, arguments:
[Scheduler::One_Or_More_Cellsrv_Threads_Stuck], [1], [], [], [], [], [],
[], [], [], [], []

Incident details in: /opt/oracle/cell/log/diag/asm/cell/XGMcl13
/incident/incdir_17/svtrc_57983_16_i17.trc

2022-09-12T15:48:06.190103-04:00

[RS] Monitoring process /opt/oracle/cell/cellsrv/bin/cellrsomt (pid:
57980, srvc_pid: 57983) returned with error: 128

CELLSRV error - ORA-600 internal error

*logwriter trace:-*

2022-09-12 15:02:54.947
:C343DCE7:KFNC:kfnc.c@12127:kfncSlaveMsgFree():1373:13291:
[kfncDiskOffline]
0x247eeee5d8 alloc=0x1 msgtyp=0x1

2022-09-12 15:02:54.947
:C343DCE8:KFNC:kfnc.c@4048:kfncDiskOffline():1373:13291:
12 disks were taken offline for gnum [2.1193292329]

2022-09-12 15:02:54.961
:C343DE61:KFD:kfd.c@12756:kfdOfflinePriv():1373:13291:
kfdOffline: mask=0x7e check=0xc flg=0x2 caller=1 dskCnt=1 client=1
clientAssisting=1 at kfio.c:8245

2022-09-12 15:02:54.961
:C343DE62:KFNC:kfnc.c@3929:kfncDiskOffline():1373:13291:
gnum [2.1193292329], count 1 mask 126 ocheck 12

2022-09-12 15:02:54.961
:C343DE63:KFNC:kfn.c@19336:kfnSetupASMBIdx():1373:13291:
Known group number 2, owner asmb index 0; inrenew = 0

2022-09-12 15:02:54.961
:C343DE64:KFNC:kfnc.c@7525:kfncGetCurrentPoolInt():1373:13291:
asmb index = 0

2022-09-12 15:02:54.961
:C343DE65:KFNC:kfnc.c@12024:kfncSlaveMsgAllocInt():1373:13291:
[kfncDiskOffline] usepga=0 workmsg=1 noslave=0 uga=0x7fabf6d2bc30 at
kfnc.c:3952, asmb#=0

Regards

Lok

References:
- DB Instance failure Issue
  - From: Lok P
- Re: DB Instance failure Issue
  - From: Pap
- Re: DB Instance failure Issue
  - From: Pap

Re: DB Instance failure Issue

Other related posts: