Single point of failures, how to identify them?

  • From: Guillermo Alan Bort <cicciuxdba@xxxxxxxxx>
  • To: oracle-l-freelists <oracle-l@xxxxxxxxxxxxx>
  • Date: Wed, 6 Jul 2011 16:19:04 -0300

A few days ago another business was hit with a bug and they got some
corruption on ASM. I'm not very familiar with what happened, what bug or
anything like that but they ended up having to restore a bunch of databases.
This got me thinking... we don't normally like to admit it but we do have
single point of failures and identifying them could help us be prepared to
deal with any issue impacting one of them (or find an alternative to
minimize downtime).

There are several things to consider when talking about points of failure,
and I might even start a blog series about this topic, but I will try to
describe what I consider to be a point of failure.

A point of failure is any part of a system (in our case it would be a
computer system) that by not performing its task as designed could cause
problems in the end result expected of the system. This brings us to try and
define what a system is, and for the purposes of making life easy for us I
will choose to define a system as a set of tools and processes that
transform something into something else (in our case information). Systems
include hardware, software and human components.

When looking for points of failure in a system one must consider the full
extent of the system and then take a close look at each and every component
of that system and ask: If this here stops working, what will happen with
the entire system?

What if the answer to that question is "the entire system will stop working"
well, that's a point of failure...

What can we do to prevent the "entire system" to "stop working"? Usually it
comes down to redundancy... we take the "piece" that is likely to cause the
system to not work and we throw a couple of replacement that will pick up
its task should it fail to do it. Some will even do it at the same time,
increasing efficiency while everything runs smoothly (or giving a headache
to the system administrator).

Now, back to real life... we are DBAs, well, some of us are... and we manage
databases... so... what are the points of failure in a database and how do
you work your way around them? Have you ever found anything that cannot be
solved by redundnacy? (usually data corruption falls into this category)
what do you do then?

Well, the RDBMS itself is a point of failure, if there is a bug hitting a
particular patchset, no matter how maney RAC nodes you have, they will all
hit it (unless it's an intermittent bug!!!) asm behave pretty much the same
way.

You can have multiple homes and have listeners of different versions ready
should you run into problems with any particular one.

User error... well, users are already redundant enough, let's not make them
more redundant :-P

cheers
Alan.-

Other related posts: