Re: 11g fault diagnosability infratructure and poor documentation

  • From: "Andre van Winssen" <dreveewee@xxxxxxxxx>
  • To: jeremiah@xxxxxxxxxxx
  • Date: Wed, 3 Oct 2007 14:49:51 +0200

Hi,

11g engine is more complex than ever which leads to 'better bugs' that are
more difficult to solve, enlarged attack surface for hacking, more often
unexpected side-effects of so called fixes, more difficulties in finding
ways to turn off unwanted features.

 as Dilbert said: "if it ain't broke it doesn't have enough features yet"
would 11g have enough features now?

Regards,
Andre


2007/10/3, Jeremiah Wilton <jeremiah@xxxxxxxxxxx>:
>
> Am I the only one who has been unable to do much with this feature due
> to the woefully absent documentation?  Three components of "fault
> diagnosability" in particular seem very interesting:
>
> - automatic hang detection
> - automatic reactive "health checks"
> - incident packages as a replacement for RDA
>
> Hang detection seems like a great idea, but there is no information on
> precisely what constitutes a "hang" according to DIAG and DIA0.  These
> processes seem never to wake up, even in the most dire of hanging
> situations.  I did find that by default in single-instance databases,
> the _hang_resolution, _hm_analysis_output_disk and _hm_log_incidents
> parameters are set to FALSE, which I take to mean the feature is turned
> off.  Even turned on, long hangs involving chains of waiters visible in
> hanganalyze output do not trigger any actions that I can discern. This
> is slightly complicated by the fact that two components of "fault
> diagnosability" share the initials HM, and packages, parameters and
> views use HM interchangeably to mean "hang manager" and "heath monitor".
>
> As for Health Checks, there is no documentation indicating what kinds of
> events or incidents might result in a "reactive" health check.  The
> existence of reactive health checks is repeatedly asserted in the
> documentation, and there is even a parameter called _diag_hm_rc_enabled
> with the description "Parameter to enable/disable Diag HM Reactive
> Checks".  Set to FALSE by default, this parameter does nothing in the
> event of a badly degraded and hanging system either.  We are left to
> wonder what "reactive" health checks react to!
>
> Finally, the incident packaging service works well enough, but is
> predicated completely upon the notion that any and all problems will be
> associated with a fatal error of some kind.  Anything that does not dump
> ORA-600 or another fatal error will not result in an "incident" and thus
> there is nothing to package.  There is apparently no provision for
> problems that do not dump on an error. So, an on-demand incident package
> apparently cannot be created.  Thus, despite the incident payloads
> having many of the same contents as the horrid RDA of yore, you cannot
> generate one on demand in a supported way.  You can shoot a server
> process with a SIGSEGV, but I cannot imagine that is how Oracle intends
> us to get diagnostic data for opening an SR.
>
> You can probably detect that I am frustrated but I have been playing
> with this feature set for weeks and it is a frustrating morass of
> nonworking undocumented wastes of server memory.  Remember, we are all
> now running two extra background processes, DIAG and DIA0, just for this
> feature.  They are up and running and using memory on all of our 11g
> systems even if they do nothing and are turned off at the parameter
> level by default.
>
> I am ranting here in hopes that someone else has gotten further than I
> have or knows someone on the inside who can shed some light on these
> concerns.
>
> Thanks,
>
> Jeremiah Wilton
> ORA-600 Consulting
> --
> //www.freelists.org/webpage/oracle-l
>
>
>

Other related posts: