A customer ran into a simlar problem(s) with OCFS2 and RHEL4 upd 4 (smp kernel). heavy db updates or mixed io (cp from ocfs to ext3, oracle export to ext3) would cause the cluster to become unresponsive and crash a node. cp and exp caused a high load avg and heavy swapping. We couldn't even ssh to the host. I didn't understand the heavy swapping because there was 3GB of cache mem available (shown by free -m). something to do with ocfs and low mem usage. I never got a clear answer on it.
the ended up setting "vm.lower_zone_protection=100" which helped the swapping issue.
The fencing problem was attributed to the following init.ora parms. filesystemio_options = asynch disk_asynch_io = TRUE they were changed to: disk_asynch_io=FALSE filesystemio_options='DIRECTIO' Things have improved since.I asked Oracle for a good document for OCFS2 and RAC and still haven't got a response.
I also asked for optimal kernel parameter settings for OCFS2. The closest I got was the following list, but no values. - vm.swappiness - vm.lower_zone_protection - vm.vfs_cache_pressure - vm.dirty_ratio - vm.dirty_background_ratioI'm not sure about "unbreakable" Oracle/Linux combo. I'd be happy if they focused on "stable" Oracle/Linux.
It comes back to "You get what you pay for". Customers think that Oracle spends as much money on the "freebies" (i.e. OCFS) as they do the database.
my 2¢ P.S. I spend as much time on Bugzilla as I do metalink these days. On Dec 28, 2006, at 11:14 AM, Kevin Closson wrote:
And to point out that I'm not being obtuse, here is a snippet from http://oss.oracle.com/bugzilla/show_bug.cgi?id=822 : Environment: Linux x86-64 Redhat 4.0 Update 3 OCFS2 1.2.3 3-node cluster. Problem:After installation, created two filesystems to be used for software. To limit timeout problems, increased the O2CB_HEARTBEAT_THRESHOLD TO31. During maintenance window, decided to use the OCFS2 filesystem to store a large backup file (about 5-10 gig file). SCP'ed the file from an outside server to node1 of the cluster using command "scp $file oracle@sachlp10:/ocfs2_fs1/. After a few minutes, node1 crashed. Did not find error messages on node1, but found them in /var/log/messages on node2: ...wow, sounds like a pretty aggressive workload, right? -- http://www.freelists.org/webpage/oracle-l