Re: RAC on OCFS2 acceptance testing

A customer ran into a simlar problem(s) with OCFS2 and RHEL4 upd 4 (smp kernel). heavy db updates or mixed io (cp from ocfs to ext3, oracle export to ext3) would cause the cluster to become unresponsive and crash a node. cp and exp caused a high load avg and heavy swapping. We couldn't even ssh to the host. I didn't understand the heavy swapping because there was 3GB of cache mem available (shown by free -m). something to do with ocfs and low mem usage. I never got a clear answer on it.


the ended up setting "vm.lower_zone_protection=100" which helped the swapping issue.

The fencing problem was attributed to the following init.ora parms.
filesystemio_options     = asynch
disk_asynch_io           = TRUE

they were changed to:
disk_asynch_io=FALSE
filesystemio_options='DIRECTIO'

Things have improved since.

I asked Oracle for a good document for OCFS2 and RAC and still haven't got a response.
I also asked for optimal kernel parameter settings for OCFS2.

The closest I got was the following list, but no values.
- vm.swappiness
- vm.lower_zone_protection
- vm.vfs_cache_pressure
- vm.dirty_ratio
- vm.dirty_background_ratio

I'm not sure about "unbreakable" Oracle/Linux combo. I'd be happy if they focused on "stable" Oracle/Linux.

It comes back to "You get what you pay for". Customers think that Oracle spends as much money on the "freebies" (i.e. OCFS) as they do the database.

my 2¢

P.S. I spend as much time on Bugzilla as I do metalink these days.


On Dec 28, 2006, at 11:14 AM, Kevin Closson wrote:


And to point out that I'm not being obtuse,
here is a snippet from
http://oss.oracle.com/bugzilla/show_bug.cgi?id=822 :


Environment:
   Linux x86-64  Redhat 4.0 Update 3
   OCFS2 1.2.3  3-node cluster.
Problem:
After installation, created two filesystems to be used for software. To limit timeout problems, increased the O2CB_HEARTBEAT_THRESHOLD TO
31.

   During maintenance window, decided to use the OCFS2 filesystem
   to store a large backup file (about 5-10 gig file).
   SCP'ed the file from an outside server to node1 of the cluster
   using command "scp $file oracle@sachlp10:/ocfs2_fs1/.

   After a few minutes, node1 crashed.
   Did not find error messages on node1, but found them in
/var/log/messages
   on node2:

...wow, sounds like a pretty aggressive workload, right?
--
http://www.freelists.org/webpage/oracle-l



--
http://www.freelists.org/webpage/oracle-l


Other related posts: