Interesting. I hadn't considered system hangs caused by flakey hardware
/ device drivers. This is something I used to encounter a lot (many
years ago) with SCO. I can easily imagine Linux suffering similar
problems for similar reasons, especially if you start using "exotic"
hardware. (I seem to recall a number of horror scenarios with SCO,
mostly stemming from using unusual combinations of hardware, like
Compaq-proprietary SCSI controllers with token-ring networks...)
Because Linux supports such a huge variety of devices, there's really no
way the hardware and device drivers can possible be tested in every
combination.
And, of course, a "hung" node is a very different critter from a
"failed" node... Yeah, I could envision something like that causing a
cluster to hang...
Are such problems unique to Linux, though? I doubt it. But I *could*
see them maybe happening more frequently...
Chris, I am curious about this: to what extent did Oracle honour their
claims of "unbreakable Linux"? Did they ever "fix" the problem? Did
they even try?
Cheers, -- Mark.
Marquez, Chris wrote:
2.)
-----Original Message-----
David wrote:
Have any of you run into the issue whereby
you lost an entire RAC db(crash) due to a an instance loss?
You want a count in total or just for one year? ;o)
I would be surprised if one running RAC at some point did see a RAC instance hang crash cause the other instance to hang.
Granted the often the OS/Config is the root cause and to blame, but I
have seen session "GC" waits on one node continue right to the other
PASSIVE node if we did not shutdown the first dying node. I have seen a
controller error on nodeA hang instance NodeA and during the reconfigure
hang instance on NodeB.
What fun...I can feel the panic of management as if it were yesterday. :o|
Chris Marquez Oracle DBA
-- //www.freelists.org/webpage/oracle-l