The performance problem seems to have been due to memory (no proof is provided,
so it still is an assumption). Whenever memory uncorrectable errors are
detected (typically via ECC, which is a hardware/memory feature; exadata does
use ECC memory), linux will be notified and mark the page ‘HardwareCorrupted’.
This means that on the linux layer, that statistic could be used to detect
memory failure issues.
However, with exadata you get an ILOM, which does hardware management. Why
don’t you use that functionality? You can use EM and an agent that can be
notified by the ILOM as SNMP destination. You can use the email option in the
ILOM, which will send an email if it finds an issue, which is a very nice way,
these emails are very clear.
In order to analyse if the memory problems did cause the performance
degradation, you first need to establish what the difference between badly
performing and well performing is with regards to CPU usage and waits on the
oracle level. Then look at the linux level and see if it caused swapping for
example, causing much extra cpu to be used outside of the database.
Frits Hoogland
http://fritshoogland.wordpress.com ;<http://fritshoogland.wordpress.com/>
frits.hoogland@xxxxxxxxx <mailto:frits.hoogland@xxxxxxxxx>
Mobile: +31 6 14180860
On 2 Feb 2018, at 22:49, Q A I S E R <qrasheed@xxxxxxxxx> wrote:
It looks like the DIMM was bad on the database node, and not on Exadata
storage cell node. You do not need a script to monitor Hardware but configure
EM. Oracle Enterprise Manager can be configured to monitor all components of
Exadata. In addition, If you had ASR configured it would have automatically
detected the fault in memory and raised an ASR for you.
Your performance issue may very well be related to this event as the SGA is
created on memory.
As for kipmi0 process using over 100%, please see MOS Kipmi0 Using 100% CPU
(Doc ID 1235235.1).
Thanks,
--Qaiser
On Fri, Feb 2, 2018 at 2:02 PM, Glenn Travis <Glenn.Travis@xxxxxxx
<mailto:Glenn.Travis@xxxxxxx>> wrote:
Over the holidays we experienced some very poor performance on one of our
Exadata (X3-2 quarter rack) nodes. The other node was unaffected. We spent
several days running AWR and other performance related tuning opportunities
to diagnose. At the database level we observed high I/O and some complex poor
SQL. At the server level we identified high cpu. We bounced the databases
several times over the next few days to troubleshoot, but performance was
only slightly better.
We decided to peruse the hardware logs and looked at the system log on the
ilom for the node. We noticed we had 2 DIMMs receiving errors;
Event Type - DIMM Service Required
Subsystem – Memory
Component – P0/D4 (CPU 0 DIMM 4) and P0/D5
Message - The number of memory correctable errors has exceeded threshold
limit. (Probability:100, UUID:7b4fe74e-4fb1-4a69-c966-b38d1cc8dab5,
Resource:/SYS/MB/P0/D5
My question is; Do you think the poor performance is related to the memory
issues/errors? Can you tell what state the memory was in by the errors?
We also noticed at the server level (using top) that the [kipmi0] process
using over 100% cpu on a constant basis. Is this normal? Is this related to
the memory errors?
Also the ora_dia0_<SID> process using near 100% during this poor performance
event. Is this normal?
We resolved the issue (or the issue went away) after we scheduled an outage
and had Oracle replaced the 2 bad DIMMS. Once the server was rebooted,
performance returned to normal.
We are wondering if there may have been something else going on, or was this
solely related to hardware. And curious about the 2 high cpu processes.
On a side note: Does anyone have a script/program/command to run (at regular
intervals) to check the state of the hardware? We usually get emailed for
hardware issues, but apparently this one was not bad enough to send an email.
Thanks all!
Glenn Travis
DBA ▪ Database Services
IT Enteprise Solutions
SAS Institute