Re: Measure database availability beyond 99.9%

  • From: Ingrid Voigt <GiantPanda@xxxxxxx>
  • To: oracle-l@xxxxxxxxxxxxx
  • Date: Fri, 29 Aug 2008 22:24:43 +0200

For many databases we do only database hosting, applications are
responsibility of the customer. The SLAs for these do not contain
hard numbers except availability and service times, no performance
data.

So, the only thing we really need to measure here is the uptime
of the databases as such. For accessibilty from the client side we
would need some sort of monitoring installed on them which we
cannot always do. Besides, we "know" the network is stable. If
we can reach the databases, so can the customers.

For High Availability we use a simple Windows cluster with Oracle
Failsafe. Automatic failover takes about 53 seconds and has
occurred twice this year (we've been lucky). The customers know
that SPOF for this solution is the SAN storage, but are not
willing to pay for something more reliable.

So we would like to give them numbers like "As long as there is no
storage failure, you get 99.99% availability. If there is - bad
luck." Maybe we can talk them into Data Guard.

Business would also like to use these numbers in upcoming discussions
of service levels and prices as well as bragging.

A real monitoring system for the whole company (not only databases)
is being built, but will take time. There are several unsolved
problems in the proposed solution.



Niall Litchfield wrote:
Aaaarrrrgh! I'm sure there's a purpose that isn't lying to justify
expensive investments. I just cannot see it. Real HA must do service
level monitoring  (aka can the users work) what you seem to propose
has no clear benefit, please tell me I'm wrong.

On 28/08/2008, Ingrid Voigt <GiantPanda@xxxxxxx> wrote:
Hi,

we are looking for a tool to measure and report the availability of our
databases in the HA range, i.e. with high precision. At this time we are
only interested in the database state, not whether the customers can work.

The database versions involved are 9.2 - 10.2, 11 coming next year. All
editions: SE1, SE and EE.

So far, we have been using EM Grid Control, but beyond 99,9% this is not
precise enough. Too many failures of the agent/the Grid Control system
rather than the database and too much time between "database back up"
and "agent notices that database is back up". A switch in the failsafe
clusters takes less than a minute and should be reported to the second,
if possible.

We can get startup time easily from a database trigger or the alertlog,
but have not good way to measure shutdown time so far. Is there
something good available (free would be nice) or do we have to build it
  ourselves?


Thanks for your help.


Regards
Ingrid Voigt
--
//www.freelists.org/webpage/oracle-l






--
//www.freelists.org/webpage/oracle-l


Other related posts: