RE: ocssd

  • From: "Kevin Closson" <kevinc@xxxxxxxxxxxxx>
  • To: "Alex Gorbachev" <gorbyx@xxxxxxxxx>
  • Date: Fri, 19 May 2006 15:27:06 -0700

 >>>
>>>That's what I was told today during our ASM training (or 
>>>what I assumed base on information i got). For Oracle, IO 
>>>fencing requires at least one voting disk and this is what's 
>>>configured with CRS (voting disks). In non-RAC setup, you 
>>>don't specify voting disks, thus, no IO fencing. Right? From 
>>>my point of view, IO fencing is only needed to support split 
>>>brain resolution for clustered setup and to evict nodes 
>>>(anything I missed?).



..ok, this makes sense, Alex. But, applying the term "I/O fencing"
to the thing that CRS does is a little off base. Not to get
crazy about semantics, but "I/O Fencing" is a term applied to
technology that isolates a server from I/O. That is, it remains
alive, but CANNOT touch storage.

What CRS does is referred to "Server Fencing". The approach CRS
takes to Server Fencing is routinely mislabeled STONITH (Shoot
The Other Node In The Head). CRS does not, in fact, implement STONITH.
What CRS implements is something that still has not been given a
term. I call is ATONTRI (Ask The Other Node To Reboot Itself). You
can read /etc/init.d/init.cssd on a Linux RAC system to see what I
mean. I'm not casting stones by any means. How could I since I'm
a nobody.  However, I assert that it is of the UTMOST importance that
IT professionals involved with a RAC implementation be keenly aware of
what they are actually running. It is lack of knowledge regarding the
underpinnings that will come back and bite you. I know everyone out
there
builds these Linux RAC systems and are generally happy. But that is
the extent of it. Generally speaking people are not harsh enough
on the technology. To foster assurance that your RAC kit is going to
hold together, you need to load a test cluster and torture test it.
Yes, that means you will need to physically touch the servers, switches,
cables, GBICs, all of it. Inject faults. Inject multiple cascading
faults.
Observe what you see. Do you see ancillary failures (e.g., you sever
and I/O path from node 3 and boom, node 2 goes down too)? Do you ever
see any total outages? Etc.  Fault injection testing should be a 
natural when you shell out for such expensive software (RAC).  

Why is ATONTRI interesting? Well, consider what happens when the reason
a node is "being fenced" is due to a catatonic situation. That is,
a node in the RAC cluster is not performing its checkins to the
CSS disk. OK, fine. But what if it isn't checking in because the
Kernel is cranky? Uh, have you ever seen a system so overloaded 
you can't execute a command? Ever seen a system in desperation VM
code? Of course you we all have.  So ask yourself how in the world
/etc/init/d/init.cssd is going to successfully execute the reboot(1)
command? How many fork calls is that? The shell forks and execs reboot,
reboot is a dynamically linked binary. That means the overloaded or
catatonic server needs to be able to allow the reboot command to mmap
shared libraries, get file descriptors, etc etc etc... have you
ever seen a system where that sort of processing cannot get through 
the system? Of course you have. 

If you've made it this far, ponder for a moment what happens if a RAC
node has been told to ATONTRI and it didn't because it couldn't. It
is no longer a viable member of the cluster but it sure has a path to
storage and it has electricity. What happens if the catatonic state
was transient? Maybe, 2 minutes, who knows? Are there I/O requests
queued in the SCSI midlayer? Do you think those might be I/Os 
headed for an Oracle datafile? 

This is not FUD. These are real clustering concerns. This is why
PolyServe implements assured peer-fencing. There will never be
a "missed fencing operation" with our stuff...not down to the
very last 2 nodes...and then there is no split-brain because
we use a much more sophisticated membership algorithm than 
simple "who's got more".

Oracle instituted the Clusterware Compatibility Program 
http://www.oracle.com/technology/software/oce/oce_fact_sheet.htm 
under which we have been certified to make sure that host clusterware
doesn't in fact weaken ATONTRI. In our case, we add value because
we run in kernel mode and nodes do not fence themselves. I'm
writing a piece to describe the relationship between CRS and
non-integrated (compatible) host clusterware and how our
two fencing options are more reliable than other fencing 
technologies. I'll post a URL here when I finish it. 
I never bothered writing it before because Oracle had no 
program for certification for us so we only sold to shops 
that had a mandate to go live and be successful at the high 
end with Linux clustering and generally failed with other clustering
solutions. 




>>>
>>>Today, I did a test - I created a second ASM instance on the 
>>>same host and databases were not able to register with this 
>>>instance unless I register this new ASM instance with CRS 
>>>(and consequently CSS is aware). There was some problem with 
>>>CRS bahving strangely but this is anothe story.
>>>
>>>In the end, I won't tell you "yes, I am 100% sure" unless I 
>>>trace the processes involved. I am positive that this is the 
>>>case and I actually can try to trace it down (if I have 
>>>enough spare time).
>>>
>>>2006/5/19, Kevin Closson <kevinc@xxxxxxxxxxxxx>:
>>>>  >>>
>>>> >>>It was quite a while since this thread posted but in 
>>>the meantime I 
>>>> >>>figured out that ASM needs ocssd daemon because this is 
>>>the way of 
>>>> >>>establishing communications between database instances and ASM 
>>>> >>>instance.
>>>>
>>>> Alex, are you sure that is why it needs ocssd and not just for 
>>>> "fencing" functionality?
>>>> --
>>>> //www.freelists.org/webpage/oracle-l
>>>>
>>>>
>>>>
>>>
>>>
>>>--
>>>Best regards,
>>>Alex Gorbachev
>>>
>>>http://oracloid.blogspot.com
>>>
--
//www.freelists.org/webpage/oracle-l


Other related posts: