MMON slaves spinning

From: De DBA <dedba@xxxxxxxxxx>
To: oracle-l@xxxxxxxxxxxxx
Date: Fri, 29 Nov 2013 12:09:00 +1000
G'day

Oracle 11.2.0.3 with Oct13 CPU, on RHEL 6

We have 2 servers: one runs just one instance - Production - with Memory_Target 
= 4G and Memory_Max_Target = 25G. The other server has 7 instances: an Active 
DG standby database (Memory_target = 2G, no Memory_Max_Target set) and 6 
staging databases, each with Memory_Target = 4GB and Memory_Max_Target = 8GB. 
Both servers have 64GB of physical RAM installed, and 30G of swap available. 
Top shows that no swap is in use on either.

All databases are identical - the staging databases are regularly re-created 
from the standby database using RMAN DUPLICATE. Each database is accessed by 
the same set of web applications (excluding the standby database, obviously). 
Modifications made in the staging applications are migrated to the production 
environment after testing and approval. TDE is used in every database.

There is also a matching (and underused) set of development databases, which 
are created identically to the staging databases, via an intermediary where 
sensitive data is masked.

Four of the staging databases all suffer from time to time from a spinning MMON 
slave (usually M000), which may or may not block the Library Cache Mutex. When 
it does, no more sessions can log on and the database for all intends and 
purposes is down. The 2 staging databases that do not suffer this problem are 
recreated (and therefore restarted) daily. No other database suffers from this 
problem, even though all databases are identical, both in contents as well as 
configuration.

As the production database is part of a critical 24/7 environment the fear is 
that this eventually will also hit the production environment and cause large 
losses...

The spinning slave process is still alive and a system state dump of the 
spinning situation shows nothing out of the ordinary (except long lists of 
blocked processes when the mutex is locked). We tried the following:

- flush the Shared Pool - this provided only temporary relief (a few minutes)
- kill MMON and its slaves on the OS level (pkill -9 ora_mmon_stgx; <etc>)  
Immediately after a new mmon process was started, a slave spawned and started spinning
- bounce the instance - this provides some relief, hours and even sometimes 
days.

The SQL area of a staging database with a spinning MMon slave does not show 
large amounts of child cursors, in fact the production database (which never 
suffers this problem, and is never bounced) has 10 times the amount of cursors 
and child cursors. One thing I noticed in the staging alert logs is that 
sometimes, but not always, a spinning situation is preceded by an emergency ASH 
flush. This also never happens in the production instance. Symptoms of a 
spinning process (TNS timeout errors, other background processes failing to 
start, PMON failing to acquire a latch) always start appearing in the alert log 
shortly after the daily maintenance window is closed. This is the standard 
Oracle defined maintenance window and associated plans.

The MMON process logs absolutely no errors of any kind, so none of the 
scenarios that I can find using Google or MetaLink apply (they all seem to be 
associated with ORA-600 errors).

Any suggestion welcome :)

Cheers,
Tony
--
//www.freelists.org/webpage/oracle-l
MMON slaves spinning

Other related posts: