11g fault diagnosability infratructure and poor documentation

  • From: Jeremiah Wilton <jeremiah@xxxxxxxxxxx>
  • To: ORACLE-L <oracle-l@xxxxxxxxxxxxx>
  • Date: Tue, 02 Oct 2007 18:29:56 -0700

Am I the only one who has been unable to do much with this feature due to the woefully absent documentation? Three components of "fault diagnosability" in particular seem very interesting:


- automatic hang detection
- automatic reactive "health checks"
- incident packages as a replacement for RDA

Hang detection seems like a great idea, but there is no information on precisely what constitutes a "hang" according to DIAG and DIA0. These processes seem never to wake up, even in the most dire of hanging situations. I did find that by default in single-instance databases, the _hang_resolution, _hm_analysis_output_disk and _hm_log_incidents parameters are set to FALSE, which I take to mean the feature is turned off. Even turned on, long hangs involving chains of waiters visible in hanganalyze output do not trigger any actions that I can discern. This is slightly complicated by the fact that two components of "fault diagnosability" share the initials HM, and packages, parameters and views use HM interchangeably to mean "hang manager" and "heath monitor".

As for Health Checks, there is no documentation indicating what kinds of events or incidents might result in a "reactive" health check. The existence of reactive health checks is repeatedly asserted in the documentation, and there is even a parameter called _diag_hm_rc_enabled with the description "Parameter to enable/disable Diag HM Reactive Checks". Set to FALSE by default, this parameter does nothing in the event of a badly degraded and hanging system either. We are left to wonder what "reactive" health checks react to!

Finally, the incident packaging service works well enough, but is predicated completely upon the notion that any and all problems will be associated with a fatal error of some kind. Anything that does not dump ORA-600 or another fatal error will not result in an "incident" and thus there is nothing to package. There is apparently no provision for problems that do not dump on an error. So, an on-demand incident package apparently cannot be created. Thus, despite the incident payloads having many of the same contents as the horrid RDA of yore, you cannot generate one on demand in a supported way. You can shoot a server process with a SIGSEGV, but I cannot imagine that is how Oracle intends us to get diagnostic data for opening an SR.

You can probably detect that I am frustrated but I have been playing with this feature set for weeks and it is a frustrating morass of nonworking undocumented wastes of server memory. Remember, we are all now running two extra background processes, DIAG and DIA0, just for this feature. They are up and running and using memory on all of our 11g systems even if they do nothing and are turned off at the parameter level by default.

I am ranting here in hopes that someone else has gotten further than I have or knows someone on the inside who can shed some light on these concerns.

Thanks,

Jeremiah Wilton
ORA-600 Consulting
--
//www.freelists.org/webpage/oracle-l


Other related posts: