11g fault diagnosability infratructure and poor documentation

From: Jeremiah Wilton <jeremiah@xxxxxxxxxxx>
To: ORACLE-L <oracle-l@xxxxxxxxxxxxx>
Date: Tue, 02 Oct 2007 18:29:56 -0700

Am I the only one who has been unable to do much with this feature dueto the woefully absent documentation? Three components of "faultdiagnosability" in particular seem very interesting:


- automatic hang detection
- automatic reactive "health checks"
- incident packages as a replacement for RDA

Hang detection seems like a great idea, but there is no information onprecisely what constitutes a "hang" according to DIAG and DIA0. Theseprocesses seem never to wake up, even in the most dire of hangingsituations. I did find that by default in single-instance databases,the _hang_resolution, _hm_analysis_output_disk and _hm_log_incidentsparameters are set to FALSE, which I take to mean the feature is turnedoff. Even turned on, long hangs involving chains of waiters visible inhanganalyze output do not trigger any actions that I can discern. Thisis slightly complicated by the fact that two components of "faultdiagnosability" share the initials HM, and packages, parameters andviews use HM interchangeably to mean "hang manager" and "heath monitor".

As for Health Checks, there is no documentation indicating what kinds ofevents or incidents might result in a "reactive" health check. Theexistence of reactive health checks is repeatedly asserted in thedocumentation, and there is even a parameter called _diag_hm_rc_enabledwith the description "Parameter to enable/disable Diag HM ReactiveChecks". Set to FALSE by default, this parameter does nothing in theevent of a badly degraded and hanging system either. We are left towonder what "reactive" health checks react to!

Finally, the incident packaging service works well enough, but ispredicated completely upon the notion that any and all problems will beassociated with a fatal error of some kind. Anything that does not dumpORA-600 or another fatal error will not result in an "incident" and thusthere is nothing to package. There is apparently no provision forproblems that do not dump on an error. So, an on-demand incident packageapparently cannot be created. Thus, despite the incident payloadshaving many of the same contents as the horrid RDA of yore, you cannotgenerate one on demand in a supported way. You can shoot a serverprocess with a SIGSEGV, but I cannot imagine that is how Oracle intendsus to get diagnostic data for opening an SR.

You can probably detect that I am frustrated but I have been playingwith this feature set for weeks and it is a frustrating morass ofnonworking undocumented wastes of server memory. Remember, we are allnow running two extra background processes, DIAG and DIA0, just for thisfeature. They are up and running and using memory on all of our 11gsystems even if they do nothing and are turned off at the parameterlevel by default.

I am ranting here in hopes that someone else has gotten further than Ihave or knows someone on the inside who can shed some light on theseconcerns.


Thanks,

Jeremiah Wilton
ORA-600 Consulting
--
//www.freelists.org/webpage/oracle-l

Follow-Ups:
- RE: 11g fault diagnosability infrastructure and poor documentation
  - From: Robert Freeman
- Re: 11g fault diagnosability infratructure and poor documentation
  - From: Andre van Winssen

11g fault diagnosability infratructure and poor documentation

Other related posts: