Welcom, I'm the first time on this oracle-l freelists... and sorry for my poor English :) I have Oracle 11.2.0.2 RAC system on the OS Solaria 10. I have same problems with stability of this system. After few days 1 node restarts the second one, and it works next few day and it repeats... I have 2 nodes, named Mars and Venus. For example today Mars node have restarted the second one (Venus), but I don't know why... In the db alert log is information that it was same problem with communication - timeout and it's all... It happened on about 02:30... Db alert log on Mars: Thread 2 advanced to log sequence 5461 (LGWR switch) Current log# 10 seq# 5461 mem# 0: +DATA/lnred/onlinelog/group_10.308.743814917 Thu Jul 07 01:02:23 2011 Archived Log entry 3047 added for thread 2 sequence 5460 ID 0xa9ba4d33 dest 1: Thu Jul 07 02:31:28 2011 minact-scn master exiting with err:12751 Thu Jul 07 02:31:43 2011 IPC Send timeout detected. Sender: ospid 1940 [oracle@Mars (MMON)] Receiver: inst 1 binc 36502 ospid 1949 IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 26 IPC Send timeout: Terminating pid 26 osid 1940 Thu Jul 07 02:31:46 2011 IPC Send timeout detected. Sender: ospid 18305 [oracle@Mars (M000)] Receiver: inst 1 binc 36502 ospid 1949 IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 75 IPC Send timeout: Terminating pid 75 osid 18305 Thu Jul 07 02:32:07 2011 IPC Send timeout detected. Sender: ospid 1908 [oracle@Mars (LMD0)] Receiver: inst 1 binc 36502 ospid 1949 Thu Jul 07 02:32:10 2011 Restarting dead background process MMON Thu Jul 07 02:32:11 2011 MMON started with pid=26, OS id=19687 Thu Jul 07 02:32:15 2011 IPC Send timeout detected. Sender: ospid 1928 [oracle@Mars (LGWR)] Receiver: inst 1 binc 36528 ospid 1951 IPC Send timeout to 1.1 inc 32 for msg type 73 from opid 20 Thu Jul 07 02:32:19 2011 Communications reconfiguration: instance_number 1 Thu Jul 07 02:32:24 2011 IPC Send timeout to 1.0 inc 32 for msg type 65521 from opid 12 Thu Jul 07 02:32:29 2011 IPC Send timeout detected. Sender: ospid 1928 [oracle@Mars (LGWR)] Receiver: inst 1 binc 36528 ospid 1955 IPC Send timeout to 1.2 inc 32 for msg type 73 from opid 20 Thu Jul 07 02:33:18 2011 Evicting instance 1 from cluster Waiting for instances to leave: 1 Thu Jul 07 02:33:38 2011 Remote instance kill is issued with system inc 34 Remote instance kill map (size 1) : 1 LMON received an instance eviction notification from instance 2 The instance eviction reason is 0x20000000 The instance eviction map is 1 Thu Jul 07 02:33:48 2011 Waiting for instances to leave: 1 Thu Jul 07 02:34:18 2011 Waiting for instances to leave: 1 Thu Jul 07 02:34:40 2011 Reconfiguration started (old inc 32, new inc 36) List of instances: 2 (myinst: 2) Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Thu Jul 07 02:34:40 2011 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Thu Jul 07 02:34:40 2011 LMS 1: 1 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Thu Jul 07 02:34:40 2011 Instance recovery: looking for dead threads Beginning instance recovery of 1 threads Started redo scan Completed redo scan read 3662 KB redo, 135 data blocks need recovery Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete Started redo application at Thread 1: logseq 10409, block 19647 Recovery of Online Redo Log: Thread 1 Group 5 Seq 10409 Reading mem 0 Mem# 0: +DATA/lnred/onlinelog/group_5.303.743814897 Completed redo application of 0.02MB Completed instance recovery at Thread 1: logseq 10409, block 26971, scn 2612073761 62 data blocks read, 121 data blocks written, 3662 redo k-bytes read Thread 1 advanced to log sequence 10410 (thread recovery) Redo thread 1 internally disabled at seq 10410 (SMON) Thu Jul 07 02:34:41 2011 Archived Log entry 3048 added for thread 1 sequence 10409 ID 0xa9ba4d33 dest 1: Thu Jul 07 02:34:42 2011 ALTER SYSTEM SET service_names='lnwyd_service','lnred_service','backup' SCOPE=MEMORY SID='lnred2'; Thu Jul 07 02:34:42 2011 ARC3: Archiving disabled thread 1 sequence 10410 Archived Log entry 3049 added for thread 1 sequence 10410 ID 0xa9ba4d33 dest 1: Thu Jul 07 02:34:53 2011 minact-scn: Inst 2 is now the master inc#:36 mmon proc-id:19687 status:0x7 minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.9bb067c4 gcalc-scn:0x0000.9bb06ebf Thu Jul 07 02:35:51 2011 SKGXP: ospid 1908: network interface with IP address 169.254.220.65 no longer running (check cable) Thu Jul 07 02:36:52 2011 SKGXP: ospid 1908: network interface with IP addressTs is now running Thu Jul 07 02:37:28 2011 IPC Send timeout detected. Sender: ospid 1886 [oracle@Mars (PMON)] Receiver: inst 1 binc 36502 ospid 1949 IPC Send timeout to 1.0 inc 32 for msg type 8 from opid 2 Thu Jul 07 02:41:17 2011 db_recovery_file_dest_size of 512000 MB is 80.48% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup. Thu Jul 07 02:42:28 2011 Reconfiguration started (old inc 36, new inc 38) List of instances: 1 2 (myinst: 2) Global Resource Directory frozen Communication channels reestablished Thu Jul 07 02:42:28 2011 * domain 0 valid = 1 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Thu Jul 07 02:42:28 2011 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Thu Jul 07 02:42:28 2011 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Thu Jul 07 02:42:29 2011 minact-scn: Master returning as live inst:1 has inc# mismatch instinc:0 cur:38 errcnt:0 Reconfiguration complete Thu Jul 07 02:56:18 2011 db_recovery_file_dest_size of 512000 MB is 80.48% used. This is a user-specified limit on the amount of space that will be used by this database for recovery-related files, and does not reflect the amount of space available in the underlying filesystem or ASM diskgroup. Thu Jul 07 04:09:37 2011 Thread 2 advanced to log sequence 5462 (LGWR switch) Current log# 8 seq# 5462 mem# 0: +DATA/lnred/onlinelog/group_8.306.743814913 Thu Jul 07 04:09:39 2011 Db alert on Venus: ed Jul 06 22:09:06 2011 Deleted Oracle managed file +FRA/lnred/archivelog/2011_07_06/thread_1_seq_10394.415.755808529 Deleted Oracle managed file +FRA/lnred/archivelog/2011_07_06/thread_2_seq_5418.421.755808553 Archived Log entry 3044 added for thread 1 sequence 10408 ID 0xa9ba4d33 dest 1: Thu Jul 07 02:42:16 2011 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Private Interface 'aggr2:1' configured from GPnP for use as a private interconnect. [name='aggr2:1', type=1, ip=169.254.16.251, mac=00-1b-21-49-f4-34, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62] Public Interface 'aggr1' configured from GPnP for use as a public interface. [name='aggr1', type=1, ip=10.122.10.150, mac=00-1b-21-49-f4-30, net=10.122.10.0/23, mask=255.255.254.0, use=public/1] Picked latch-free SCN scheme 3 Using LOG_ARCHIVE_DEST_1 parameter default value as USE_DB_RECOVERY_FILE_DEST Autotune of undo retention is turned on. LICENSE_MAX_USERS = 0 SYS auditing is disabled Starting up: Oracle Database 11g Release 11.2.0.2.0 - 64bit Production With the Real Application Clusters option. Using parameter settings in server-side pfile /md1/app/oracle/product/11.2.0/db1/dbs/initlnred1.ora System parameters with non-default values: processes = 300 sessions = 480 timed_statistics = TRUE spfile = "+DATA/lnred/spfilelnred.ora" nls_language = "POLISH" nls_territory = "POLAND" nls_length_semantics = "CHAR" resource_manager_plan = "" sga_target = 8G control_files = "+DATA/lnred/controlfile/current.256.741561271" db_block_size = 8192 compatible = "11.2.0.0.0" cluster_database = TRUE db_create_file_dest = "+DATA" db_recovery_file_dest = "+FRA" db_recovery_file_dest_size= 500G thread = 1 undo_tablespace = "UNDOTBS1" instance_number = 1 remote_login_passwordfile= "EXCLUSIVE" db_domain = "" dispatchers = "(PROTOCOL=TCP) (SERVICE=lnredXDB)" remote_listener = "lnscan.adln.local:1521" audit_file_dest = "/md1/app/oracle/admin/lnred/adump" optimizer_features_enable= "11.1.0.7" audit_trail = "DB" db_name = "lnred" open_cursors = 300 pga_aggregate_target = 8G aq_tm_processes = 1 diagnostic_dest = "/md1/app/oracle" Cluster communication is configured to use the following interface(s) for this instance 169.254.16.251 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 Thu Jul 07 02:42:25 2011 PMON started with pid=2, OS id=1914 OS messages on Venus: Jul 7 02:34:38 Venus unix: [ID 836849 kern.notice] Jul 7 02:34:38 Venus ^Mpanic[cpu6]/thread=ffffffff9bec7e60: Jul 7 02:34:38 Venus genunix: [ID 156897 kern.notice] forced crash dump initiated at user request Jul 7 02:34:38 Venus unix: [ID 100000 kern.notice] Jul 7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017ceeb0 genunix:kadmin+517 () Jul 7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017cef00 genunix:uadmin+c7 () Jul 7 02:34:38 Venus genunix: [ID 655072 kern.notice] fffffe80017cef10 unix:brand_sys_syscall+21d () Jul 7 02:34:38 Venus unix: [ID 100000 kern.notice] Jul 7 02:34:38 Venus genunix: [ID 672855 kern.notice] syncing file systems... Jul 7 02:34:39 Venus genunix: [ID 433738 kern.notice] [6] Jul 7 02:34:39 Venus genunix: [ID 733762 kern.notice] 220 Jul 7 02:34:40 Venus genunix: [ID 433738 kern.notice] [6] Jul 7 02:34:40 Venus genunix: [ID 733762 kern.notice] 205 Jul 7 02:34:42 Venus genunix: [ID 433738 kern.notice] [6] So... it isn't much information... but... on Venus grid log grid\log\venus\crsd\crsd.log<file:///\\venus\root\md1\app\11.2.0\grid\log\venus\crsd\crsd.log> I found: 2011-07-07 02:02:10.052: [ CRSPE][47] {1:3804:25345} Processing PE command id=25249. Description: [Stat Resource : 3703910] 2011-07-07 02:02:10.054: [UiServer][48] {1:3804:25345} Done for ctx=32ca350 Here are same rubbishes like (maybe it has same problem with comunication): [NULL][NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [NUL] [ clsdmt][3]Listening to (ADDRESS=(PROTOCOL=ipc)(KEY=venusDBG_CRSD)) 2011-07-07 02:41:48.808: [ clsdmt][3]PID for the Process [1619], connkey 1 But on Mars the grid log grid\log\mars\crsd\crsd.log<file:///\\venus\root\md1\app\11.2.0\grid\log\venus\crsd\crsd.log> : 2011-07-07 02:19:42.393: [ AGFW][42] {0:8:6} Agfw Proxy Server received the message: RESOURCE_STATUS[Proxy] ID 20481:58 2011-07-07 02:19:42.393: [ AGFW][42] {0:8:6} Agfw Proxy Server replying to the message: RESOURCE_STATUS[Proxy] ID 20481:58 2011-07-07 02:20:48.482: [ CRSCOMM][39][FFAIL] Ipc: Couldnt clscreceive message, no message: 11 2011-07-07 02:20:48.482: [ CRSCOMM][39] Ipc: Client disconnected. 2011-07-07 02:20:48.482: [ CRSCOMM][39][FFAIL] IpcL: Listener got clsc error 11 for memNum. 8 2011-07-07 02:20:48.482: [ CRSCOMM][39] IpcL: connection to member 8 has been removed 2011-07-07 02:20:48.483: [CLSFRAME][39] Removing IPC Member:{Relative|Node:0|Process:8|Type:3} 2011-07-07 02:20:48.483: [CLSFRAME][39] Disconnected from AGENT process: {Relative|Node:0|Process:8|Type:3} 2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} Agfw Proxy Server received process disconnected notification, count=1 2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} /md1/app/11.2.0/grid/bin/oraagent_oracle disconnected. 2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} Agent /md1/app/11.2.0/grid/bin/oraagent_oracle[18546] stopped! 2011-07-07 02:20:48.483: [ CRSCOMM][42] {2:44971:2690} IpcL: removeConnection: Member 8 does not exist. 2011-07-07 02:20:48.483: [ AGFW][42] {2:44971:2690} Created alert : (:CRSAGF00129:) : The agent: /md1/app/11.2.0/grid/bin/oraagent has been disabled because it exceeded the maximum failures (5) over a period of 600 seconds. this time the Venus OS has been rebooted (I think via IPMI from Mars) 2011-07-07 02:34:39.963: [ CRSSE][32] Forwarding Node Leave to PE for: venus 2011-07-07 02:34:39.965: [ CRSCCL][34]Reconfig event received by clssgsgrpstat 2011-07-07 02:34:39.965: [ CRSCCL][34]cclGetMemberData called 2011-07-07 02:34:39.979: [ CRSCCL][34]Member (2, 307107, mars:11.2.0.2.0) @ found. 2011-07-07 02:34:39.999: [CLSFRAME][34] CCL MEMBER LEFT:1:1:CRSD:venus 2011-07-07 02:34:39.999: [CLSFRAME][34] Disconnected from CRSD:venus process: And... what to do... ? Have you same ideas ? Regards, ---------------------------------------------------- Robert Grzesiak Senior Developer, Oracle DBA LexisNexis Polska Sp. z o.o. tel office: + 48 12 426 00 03 email: Robert.Grzesiak@xxxxxxxxxxxxx<mailto:Robert.Grzesiak@xxxxxxxxxxxxx> www.lexisnexis.pl<http://www.lexisnexis.pl/> Biurowiec Galileo, ul. Armii Krajowej 18, 30-150 Kraków XII Wydział Gospodarczy Krajowego Rejestru Sądowego KRS 0000035401 Właściciel: Reed Elsevier Overseas B.V. Kapitał zakładowy: 8 200 000 PLN, REGON: 010860143, NIP: 526-10-13-257 ---------------------------------------------------- Save a tree! Print this message only if it's absolutely necessary