Problem with RAC on Solaris

From: "Grzesiak, Robert (LNG-POL)" <Robert.Grzesiak@xxxxxxxxxxxxx>
To: "oracle-l@xxxxxxxxxxxxx" <oracle-l@xxxxxxxxxxxxx>
Date: Sun, 10 Jul 2011 04:00:25 +0200
Welcom,

I'm the first time on this oracle-l freelists... and sorry for my poor English 
:)

I have Oracle 11.2.0.2 RAC system on the OS Solaria 10. I have same problems 
with stability of this system. After few days 1 node restarts the second one, 
and it works next few day and it repeats... I have 2 nodes, named Mars and 
Venus. For example today Mars node have restarted the second one (Venus), but I 
don't know why... In the db alert log is information that it was same problem 
with communication - timeout and it's all...

It happened on about 02:30...

Db alert log on Mars:

Thread 2 advanced to log sequence 5461 (LGWR switch)
  Current log# 10 seq# 5461 mem# 0: +DATA/lnred/onlinelog/group_10.308.743814917
Thu Jul 07 01:02:23 2011
Archived Log entry 3047 added for thread 2 sequence 5460 ID 0xa9ba4d33 dest 1:
Thu Jul 07 02:31:28 2011
minact-scn master exiting with err:12751
Thu Jul 07 02:31:43 2011
IPC Send timeout detected. Sender: ospid 1940 [oracle@Mars (MMON)]
Receiver: inst 1 binc 36502 ospid 1949
IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 26
IPC Send timeout: Terminating pid 26 osid 1940
Thu Jul 07 02:31:46 2011
IPC Send timeout detected. Sender: ospid 18305 [oracle@Mars (M000)]
Receiver: inst 1 binc 36502 ospid 1949
IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 75
IPC Send timeout: Terminating pid 75 osid 18305
Thu Jul 07 02:32:07 2011
IPC Send timeout detected. Sender: ospid 1908 [oracle@Mars (LMD0)]
Receiver: inst 1 binc 36502 ospid 1949
Thu Jul 07 02:32:10 2011
Restarting dead background process MMON
Thu Jul 07 02:32:11 2011
MMON started with pid=26, OS id=19687
Thu Jul 07 02:32:15 2011
IPC Send timeout detected. Sender: ospid 1928 [oracle@Mars (LGWR)]
Receiver: inst 1 binc 36528 ospid 1951
IPC Send timeout to 1.1 inc 32 for msg type 73 from opid 20
Thu Jul 07 02:32:19 2011
Communications reconfiguration: instance_number 1
Thu Jul 07 02:32:24 2011
IPC Send timeout to 1.0 inc 32 for msg type 65521 from opid 12
Thu Jul 07 02:32:29 2011
IPC Send timeout detected. Sender: ospid 1928 [oracle@Mars (LGWR)]
Receiver: inst 1 binc 36528 ospid 1955
IPC Send timeout to 1.2 inc 32 for msg type 73 from opid 20
Thu Jul 07 02:33:18 2011
Evicting instance 1 from cluster
Waiting for instances to leave: 1
Thu Jul 07 02:33:38 2011
Remote instance kill is issued with system inc 34
Remote instance kill map (size 1) : 1
LMON received an instance eviction notification from instance 2
The instance eviction reason is 0x20000000
The instance eviction map is 1
Thu Jul 07 02:33:48 2011
Waiting for instances to leave: 1
Thu Jul 07 02:34:18 2011
Waiting for instances to leave: 1
Thu Jul 07 02:34:40 2011
Reconfiguration started (old inc 32, new inc 36)
List of instances:
 2 (myinst: 2)
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Thu Jul 07 02:34:40 2011
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Thu Jul 07 02:34:40 2011
 LMS 1: 1 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Thu Jul 07 02:34:40 2011
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
Started redo scan
Completed redo scan
 read 3662 KB redo, 135 data blocks need recovery
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
Started redo application at
 Thread 1: logseq 10409, block 19647
Recovery of Online Redo Log: Thread 1 Group 5 Seq 10409 Reading mem 0
  Mem# 0: +DATA/lnred/onlinelog/group_5.303.743814897
Completed redo application of 0.02MB
Completed instance recovery at
 Thread 1: logseq 10409, block 26971, scn 2612073761
 62 data blocks read, 121 data blocks written, 3662 redo k-bytes read
Thread 1 advanced to log sequence 10410 (thread recovery)
Redo thread 1 internally disabled at seq 10410 (SMON)
Thu Jul 07 02:34:41 2011
Archived Log entry 3048 added for thread 1 sequence 10409 ID 0xa9ba4d33 dest 1:
Thu Jul 07 02:34:42 2011
ALTER SYSTEM SET service_names='lnwyd_service','lnred_service','backup' 
SCOPE=MEMORY SID='lnred2';
Thu Jul 07 02:34:42 2011
ARC3: Archiving disabled thread 1 sequence 10410
Archived Log entry 3049 added for thread 1 sequence 10410 ID 0xa9ba4d33 dest 1:
Thu Jul 07 02:34:53 2011
minact-scn: Inst 2 is now the master inc#:36 mmon proc-id:19687 status:0x7
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.9bb067c4 
gcalc-scn:0x0000.9bb06ebf
Thu Jul 07 02:35:51 2011
SKGXP: ospid 1908: network interface with IP address 169.254.220.65 no longer 
running (check cable)
Thu Jul 07 02:36:52 2011
SKGXP: ospid 1908: network interface with IP addressTs is now running
Thu Jul 07 02:37:28 2011
IPC Send timeout detected. Sender: ospid 1886 [oracle@Mars (PMON)]
Receiver: inst 1 binc 36502 ospid 1949
IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 2
Thu Jul 07 02:41:17 2011
db_recovery_file_dest_size of 512000 MB is 80.48% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Thu Jul 07 02:42:28 2011
Reconfiguration started (old inc 36, new inc 38)
List of instances:
 1 2 (myinst: 2)
 Global Resource Directory frozen
 Communication channels reestablished
Thu Jul 07 02:42:28 2011
 * domain 0 valid = 1 according to instance 1
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Thu Jul 07 02:42:28 2011
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Thu Jul 07 02:42:28 2011
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Thu Jul 07 02:42:29 2011
minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:38 
errcnt:0
Reconfiguration complete
Thu Jul 07 02:56:18 2011
db_recovery_file_dest_size of 512000 MB is 80.48% used. This is a
user-specified limit on the amount of space that will be used by this
database for recovery-related files, and does not reflect the amount of
space available in the underlying filesystem or ASM diskgroup.
Thu Jul 07 04:09:37 2011
Thread 2 advanced to log sequence 5462 (LGWR switch)
  Current log# 8 seq# 5462 mem# 0: +DATA/lnred/onlinelog/group_8.306.743814913
Thu Jul 07 04:09:39 2011


Db alert on Venus:

ed Jul 06 22:09:06 2011
Deleted Oracle managed file 
+FRA/lnred/archivelog/2011_07_06/thread_1_seq_10394.415.755808529
Deleted Oracle managed file 
+FRA/lnred/archivelog/2011_07_06/thread_2_seq_5418.421.755808553
Archived Log entry 3044 added for thread 1 sequence 10408 ID 0xa9ba4d33 dest 1:
Thu Jul 07 02:42:16 2011
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Private Interface 'aggr2:1' configured from GPnP for use as a private 
interconnect.
  [name='aggr2:1', type=1, ip=169.254.16.251, mac=00-1b-21-49-f4-34, 
net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62]
Public Interface 'aggr1' configured from GPnP for use as a public interface.
  [name='aggr1', type=1, ip=10.122.10.150, mac=00-1b-21-49-f4-30, 
net=10.122.10.0/23, mask=255.255.254.0, use=public/1]
Picked latch-free SCN scheme 3
Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up:
Oracle Database 11g Release 11.2.0.2.0 - 64bit Production
With the Real Application Clusters option.
Using parameter settings in server-side pfile 
/md1/app/oracle/product/11.2.0/db1/dbs/initlnred1.ora
System parameters with non-default values:
  processes                = 300
  sessions                 = 480
  timed_statistics         = TRUE
  spfile                   = "+DATA/lnred/spfilelnred.ora"
  nls_language             = "POLISH"
  nls_territory            = "POLAND"
  nls_length_semantics     = "CHAR"
  resource_manager_plan    = ""
  sga_target               = 8G
  control_files            = "+DATA/lnred/controlfile/current.256.741561271"
  db_block_size            = 8192
  compatible               = "11.2.0.0.0"
  cluster_database         = TRUE
  db_create_file_dest      = "+DATA"
  db_recovery_file_dest    = "+FRA"
  db_recovery_file_dest_size= 500G
  thread                   = 1
  undo_tablespace          = "UNDOTBS1"
  instance_number          = 1
  remote_login_passwordfile= "EXCLUSIVE"
  db_domain                = ""
  dispatchers              = "(PROTOCOL=TCP) (SERVICE=lnredXDB)"
  remote_listener          = "lnscan.adln.local:1521"
  audit_file_dest          = "/md1/app/oracle/admin/lnred/adump"
  optimizer_features_enable= "11.1.0.7"
  audit_trail              = "DB"
  db_name                  = "lnred"
  open_cursors             = 300
  pga_aggregate_target     = 8G
  aq_tm_processes          = 1
  diagnostic_dest          = "/md1/app/oracle"
Cluster communication is configured to use the following interface(s) for this 
instance
  169.254.16.251
cluster interconnect IPC version:Oracle UDP/IP (generic)
IPC Vendor 1 proto 2
Thu Jul 07 02:42:25 2011
PMON started with pid=2, OS id=1914


OS messages on Venus:

Jul  7 02:34:38 Venus unix: [ID 836849 kern.notice]
Jul  7 02:34:38 Venus ^Mpanic[cpu6]/thread=ffffffff9bec7e60:
Jul  7 02:34:38 Venus genunix: [ID 156897 kern.notice] forced crash dump 
initiated at user request
Jul  7 02:34:38 Venus unix: [ID 100000 kern.notice]
Jul  7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017ceeb0 
genunix:kadmin+517 ()
Jul  7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017cef00 
genunix:uadmin+c7 ()
Jul  7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017cef10 
unix:brand_sys_syscall+21d ()
Jul  7 02:34:38 Venus unix: [ID 100000 kern.notice]
Jul  7 02:34:38 Venus genunix: [ID 672855 kern.notice] syncing file systems...
Jul  7 02:34:39 Venus genunix: [ID 433738 kern.notice]  [6]
Jul  7 02:34:39 Venus genunix: [ID 733762 kern.notice]  220
Jul  7 02:34:40 Venus genunix: [ID 433738 kern.notice]  [6]
Jul  7 02:34:40 Venus genunix: [ID 733762 kern.notice]  205
Jul  7 02:34:42 Venus genunix: [ID 433738 kern.notice]  [6]

So... it isn't much information... but... on Venus grid log 
grid\log\venus\crsd\crsd.log<file:///\\venus\root\md1\app\11.2.0\grid\log\venus\crsd\crsd.log>
 I found:

2011-07-07 02:02:10.052: [   CRSPE][47] {1:3804:25345} Processing PE command 
id=25249. Description: [Stat Resource : 3703910]
2011-07-07 02:02:10.054: [UiServer][48] {1:3804:25345} Done for ctx=32ca350
Here are same rubbishes like (maybe it has same problem with comunication): 
[NULL][NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] 
[NUL] [NUL] [NUL]
[  clsdmt][3]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=venusDBG_CRSD))
2011-07-07 02:41:48.808: [  clsdmt][3]PID for the Process [1619], connkey 1

But on Mars the grid log 
grid\log\mars\crsd\crsd.log<file:///\\venus\root\md1\app\11.2.0\grid\log\venus\crsd\crsd.log>
 :

2011-07-07 02:19:42.393: [    AGFW][42] {0:8:6} Agfw Proxy Server received the 
message: RESOURCE_STATUS[Proxy] ID 20481:58
2011-07-07 02:19:42.393: [    AGFW][42] {0:8:6} Agfw Proxy Server replying to 
the message: RESOURCE_STATUS[Proxy] ID 20481:58
2011-07-07 02:20:48.482: [ CRSCOMM][39][FFAIL] Ipc: Couldnt clscreceive 
message, no message: 11
2011-07-07 02:20:48.482: [ CRSCOMM][39] Ipc: Client disconnected.
2011-07-07 02:20:48.482: [ CRSCOMM][39][FFAIL] IpcL: Listener got clsc error 11 
for memNum. 8
2011-07-07 02:20:48.482: [ CRSCOMM][39] IpcL: connection to member 8 has been 
removed
2011-07-07 02:20:48.483: [CLSFRAME][39] Removing IPC 
Member:{Relative|Node:0|Process:8|Type:3}
2011-07-07 02:20:48.483: [CLSFRAME][39] Disconnected from AGENT process: 
{Relative|Node:0|Process:8|Type:3}
2011-07-07 02:20:48.483: [    AGFW][42] {2:44971:2690} Agfw Proxy Server 
received process disconnected notification, count=1
2011-07-07 02:20:48.483: [    AGFW][42] {2:44971:2690} 
/md1/app/11.2.0/grid/bin/oraagent_oracle disconnected.
2011-07-07 02:20:48.483: [    AGFW][42] {2:44971:2690} Agent 
/md1/app/11.2.0/grid/bin/oraagent_oracle[18546] stopped!
2011-07-07 02:20:48.483: [ CRSCOMM][42] {2:44971:2690} IpcL: removeConnection: 
Member 8 does not exist.
2011-07-07 02:20:48.483: [    AGFW][42] {2:44971:2690} Created alert : 
(:CRSAGF00129:) :  The agent: /md1/app/11.2.0/grid/bin/oraagent has been 
disabled because it exceeded the maximum failures (5) over a period of 600 
seconds.

this time the Venus OS has been rebooted (I think via IPMI from Mars)

2011-07-07 02:34:39.963: [   CRSSE][32] Forwarding Node Leave to PE for: venus
2011-07-07 02:34:39.965: [  CRSCCL][34]Reconfig event received by clssgsgrpstat

2011-07-07 02:34:39.965: [  CRSCCL][34]cclGetMemberData called
2011-07-07 02:34:39.979: [  CRSCCL][34]Member (2, 307107, mars:11.2.0.2.0) @  
found.

2011-07-07 02:34:39.999: [CLSFRAME][34] CCL MEMBER LEFT:1:1:CRSD:venus
2011-07-07 02:34:39.999: [CLSFRAME][34] Disconnected from CRSD:venus process:


And... what to do... ?

Have you same ideas ?


Regards,
----------------------------------------------------
Robert Grzesiak

Senior Developer, Oracle DBA

LexisNexis Polska Sp. z o.o.
tel office:  + 48 12 426 00 03
email:  Robert.Grzesiak@xxxxxxxxxxxxx<mailto:Robert.Grzesiak@xxxxxxxxxxxxx>
www.lexisnexis.pl<http://www.lexisnexis.pl/>

Biurowiec Galileo, ul. Armii Krajowej 18, 30-150 Kraków
XII Wydział Gospodarczy Krajowego Rejestru Sądowego KRS 0000035401
Właściciel: Reed Elsevier Overseas B.V. Kapitał zakładowy: 8 200 000 PLN,
REGON: 010860143, NIP: 526-10-13-257
----------------------------------------------------
Save a tree!  Print this message
only if it's absolutely necessary
Problem with RAC on Solaris

Other related posts: