RE: Grid (RAC & Standalone) Unexpected Node Reboots Upon Device Path Failures

From: "Dimensional DBA" <dimensional.dba@xxxxxxxxxxx>
To: <fmhabash@xxxxxxxxx>, "'Oracle-L Group'" <oracle-l@xxxxxxxxxxxxx>
Date: Mon, 13 Jun 2016 11:51:58 -0700

You should check your disk timeout and misscount values at the Oracle cluster
level, but you should see timeout in your logs if it was related to these and
long failover times at the hardware.

Matthew Parker

Chief Technologist

Dimensional DBA

425-891-7934 (cell)

D&B 047931344

CAGE 7J5S7

Dimensional.dba@xxxxxxxxxxx

<http://www.linkedin.com/pub/matthew-parker/6/51b/944/> View Matthew Parker's
profile on LinkedIn

www.dimensionaldba.com <http://www.dimensionaldba.com/>

From: Dimensional DBA [mailto:dimensional.dba@xxxxxxxxxxx] ;
Sent: Monday, June 13, 2016 11:29 AM
To: dimensional.dba@xxxxxxxxxxx; fmhabash@xxxxxxxxx; 'Oracle-L Group'
Subject: RE: Grid (RAC & Standalone) Unexpected Node Reboots Upon Device Path
Failures

Other generic notes.

Normally you don’t set “Queue_if_no_path “ but set “no_path_retry N “

The number can vary but a standard setting for say EMC Symmtrix with UCS is

    no_path_retry 6

Matthew Parker

Chief Technologist

Dimensional DBA

425-891-7934 (cell)

D&B 047931344

CAGE 7J5S7

Dimensional.dba@xxxxxxxxxxx

<http://www.linkedin.com/pub/matthew-parker/6/51b/944/> View Matthew Parker's
profile on LinkedIn

www.dimensionaldba.com <http://www.dimensionaldba.com/>

From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On ;
Behalf Of Dimensional DBA
Sent: Monday, June 13, 2016 10:52 AM
To: fmhabash@xxxxxxxxx; 'Oracle-L Group'
Subject: RE: Grid (RAC & Standalone) Unexpected Node Reboots Upon Device Path
Failures

Does it happen every time or sporadically?

Can you provide an example lun from your multipath.conf and what values you are
using for those settings or combination of those settings since some are binary
opposites of each other?

What UCS Manager version are you running and what firmware Bundle patch and on
which blade type are you having problems with?

Is the error in the cluster logs and OS logs that all paths timed out?

There are a variety of failure points and each failure point had a different
solution.

That includes an administrator modifying templates in UCS manager causing the
nodes to reboot.

Matthew Parker

Chief Technologist

Dimensional DBA

425-891-7934 (cell)

D&B 047931344

CAGE 7J5S7

Dimensional.dba@xxxxxxxxxxx

<http://www.linkedin.com/pub/matthew-parker/6/51b/944/> View Matthew Parker's
profile on LinkedIn

www.dimensionaldba.com <http://www.dimensionaldba.com/>

From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On ;
Behalf Of fmhabash@xxxxxxxxx
Sent: Monday, June 13, 2016 10:05 AM
To: Oracle-L Group
Subject: Grid (RAC & Standalone) Unexpected Node Reboots Upon Device Path
Failures

We are experiences a perplexing issue that we have not been able to arrive at
an RCA resolution. Grid nodes (can be RAC or standalone) boot unexpectedly &
sporadically (not every time) when we failover a hardware component such as UCS
fabric interconnect, an HBA, or a storage controller. On some systems, we also
noticed filesystems going read-only.

All devices are configured with multipathing of minim of 4 paths. Multipathing
is offered via EMC PowerPath or Native Linux DM-MPIO.

All nodes use 11gR2 ASM LVM, with subset using ASMLIB running on OEL 6.3-6.6
and RDBMS 11gR2

I know there is a zillion factors to consider here, but to make things simple,
let’s focus on dm-mpio for now. We believe, all these symptoms related to how
the software (oracle ASM  or Linux LVM) reacts to the loss of a path in a
multipathed setup. So we focused on multipath.conf settings that control IO
path failover. Namely …

Path_retry

Queue_if_no_path

Polling_interval

Rr_min_io

Failback immediate

1)      Have you experienced issues like unexpected node reboots, filesystems
going read-only when failing over at the hardware level I listed above?

2)      What was you resolution.

3)      How does your multipath.conf parameters listed above compare to yours?

Thanks all

References:
- Grid (RAC & Standalone) Unexpected Node Reboots Upon Device Path Failures
  - From: fmhabash
- RE: Grid (RAC & Standalone) Unexpected Node Reboots Upon Device Path Failures
  - From: Dimensional DBA
- RE: Grid (RAC & Standalone) Unexpected Node Reboots Upon Device Path Failures
  - From: Dimensional DBA

RE: Grid (RAC & Standalone) Unexpected Node Reboots Upon Device Path Failures

Other related posts: