Re: RAC on OCFS2 acceptance testing

  • From: Mladen Gogala <mgogala@xxxxxxxxxxx>
  • To: sperry@xxxxxxxxxxx
  • Date: Sat, 30 Dec 2006 17:42:37 -0500

Comments in-line:


On 12/30/2006 12:17:57 PM, Steve Perry wrote:
> A customer ran into a simlar problem(s) with OCFS2 and RHEL4 upd 4  
> (smp kernel).
> heavy db updates or mixed io (cp from ocfs to ext3, oracle export to  
> ext3) would cause the cluster to become unresponsive and crash a node.
> cp and exp caused a high load avg and heavy swapping. We couldn't  
> even ssh to the host.
> I didn't understand the heavy swapping because there was 3GB of cache  
> mem available (shown by free -m).
> something to do with ocfs and low mem usage. I never got a clear  
> answer on it.
> 
> the ended up setting "vm.lower_zone_protection=100" which helped the  
> swapping issue.

The vm.lower_zone_protection parameter makes certain portion of physical memory
non-pageable. On MVS, it used to be known as "VIRTUAL=REAL boundary". 
Conveniently,
the units are megabytes, which means that you precluded 100M of memory from 
being
pageable. In particular, that means that OCFS kernel module will not be able to 
allocate user buffers from the memory below 100M boundary. The reason for that 
are "features" in Linux kernel, more or less openly admitted in the 
documentation 
for this parameter. Here is an excerpt from the documentation:

/usr/share/doc/kernel-doc-2.6.17/Documentation/filesystems/proc.txt

lower_zone_protection
---------------------

For some specialised workloads on highmem machines it is dangerous for
the kernel to allow process memory to be allocated from the "lowmem"
zone.  This is because that memory could then be pinned via the mlock()
system call, or by unavailability of swapspace.

And on large highmem machines this lack of reclaimable lowmem memory
can be fatal.

So the Linux page allocator has a mechanism which prevents allocations
which _could_ use highmem from using too much lowmem.  This means that
a certain amount of lowmem is defended from the possibility of being
captured into pinned user memory.

(The same argument applies to the old 16 megabyte ISA DMA region.  This
mechanism will also defend that region from allocations which could use
highmem or lowmem).

The `lower_zone_protection' tunable determines how aggressive the kernel is
in defending these lower zones.  The default value is zero - no
protection at all.

If you have a machine which uses highmem or ISA DMA and your
applications are using mlock(), or if you are running with no swap then
you probably should increase the lower_zone_protection setting.

The units of this tunable are fairly vague.  It is approximately equal
to "megabytes".  So setting lower_zone_protection=100 will protect around 100
megabytes of the lowmem zone from user allocations.  It will also make
those 100 megabytes unavaliable for use by applications and by
pagecache, so there is a cost.



> 
> The fencing problem was attributed to the following init.ora parms.
> filesystemio_options     = asynch
> disk_asynch_io           = TRUE
> 
> they were changed to:
> disk_asynch_io=FALSE
> filesystemio_options='DIRECTIO'


Neither OCFS nor OCFS2 support asynchronous I/O. They both allow only direct 
I/O. By attempting
to use asynchronous I/O, you may crash your system or your database. That is 
well documented on
the OCFS site.


> 
> Things have improved since.
> 
> I asked Oracle for a good document for OCFS2 and RAC and still  
> haven't got a response.
> I also asked for optimal kernel parameter settings for OCFS2.
> 
> The closest I got was the following list, but no values.
> - vm.swappiness
> - vm.lower_zone_protection
> - vm.vfs_cache_pressure
> - vm.dirty_ratio
> - vm.dirty_background_ratio

Here we have to deal with the fact that Linux kernel is less then perfect, to 
say the least.
From those parameters, swappiness and vfs_cache_pressure are so called 
"composite parameters"
which regulate "tendency", which means that you don't get to see an accurate 
parameter description
without plunging into the kernel code. Both of these parameters regulate 
"aggressiveness" of the 
OS with swapping/page stealing or replacing inodes and directory entries. I 
find them best set to 0.
I was playing with the "swappiness" and I found that it will turn on aggressive 
page swapping which
will slow down your system. Dirty ratio and background ratio are parameters for 
modified page write-back.
Due to Linux kernel problems, you don't have any tools which would help you 
diagnose problems with the
page write-back. You don't have anything even remotely like VMS "monitor page" 
and "monitor pool" 
commands. Linux itself is inferior to SYSVR4 Unix derivatives like AIX or HP-UX 
and is certainly 
inferior to Solaris which took a significant step ahead and removed itself from 
the pervasive SYSVR4
standard. Without being able to monitor effects, those parameters should be 
left alone.

Parameter that you should set to at least 15% of your memory, to ensure ease of 
memory allocation is vm.min_free_kbytes. This parameters sets the target value 
for the page replacement daemon to keep 
free and available for "malloc" calls. 


> 
> I'm not sure about "unbreakable" Oracle/Linux combo. I'd be happy if  
> they focused on "stable" Oracle/Linux.

They first have to make Linux as capable as the other operating systems and
add instrumentation for monitoring and diagnostics. When that is done, the
first Mogens theorem will apply:
"Anything that is sufficiently instrumented is obsolete".
That is the theorem from the Oak Table book.


> 
> It comes back to "You get what you pay for". Customers think that  
> Oracle spends as much money on the "freebies" (i.e. OCFS) as they do  
> the database.

That has always been the case. Anybody with the right mind should expect to pay
for decent things. There is a reason why Cadillac costs more then a Yugo.


-- 
Mladen Gogala
http://www.mladen-gogala.com

--
//www.freelists.org/webpage/oracle-l


Other related posts: