> We have observed several critical slowdowns of our Production system > since upgrading to 10g. The symptoms include a rapid rise in library > cache latch waits (with associated, but less notable, rises in shared > pool and library cache pin waits), CPU pegged at 100% and CPU load in > triple-digits, and massive shared pool resizing evidenced by > v$sga_resize_ops (1.7gb added in 15 seconds). Obviously, trying to run > queries against the normal v$ views is rather slow (only adds to the > latching problem). We sent a hanganalyze and systemstate dumps to > Oracle Support, but after a couple months of trying to figure things > out, we have word that the initial Support Engineer may have > misdiagnosed our problem. > > Based on what little I know and researched (reading a number of notes > here, Metalinks, google, oraFAQ, etc), a large number of literal sql > statements might be able to cause such symptoms. It is just hard for > me to imagine the number of literal statements that would consume > 1.7gb of shared pool in 15 seconds. Perhaps it is not quantity, but > quality as well, each statement consuming a large chunk of real > estate. The only problem with this conjecture is that I do not find > evidence of this in v$sql (looking at number of executions, number of > literals, sharable_mem). > > But the puzzle is even more elusive. EM reports extremely (abnormally) > large number of executions for seemingly random sql statements. In one > case, we had 3 statements that were each executed 35 million times in > one hour (2 of those statements had bind variables, the other was a > "simple" select). I can never catch these large numbers in v$sql, so I > have to assume they age out by the time we start to look. I have asked > the application folks to see if these queries are related, and why the > execution count is a couple orders of magnitude greater than normal. > Still waiting for resolution at that end. It seems the general > consensus that these larger numbers are symptomatic as well, that they > only start to escalate when the system is severely slow and the > middle-tier starts to timeout. NOTE: that is the working hypothesis, > we are trying to prove it. > > So, my question is two-fold. Do you experts out there have advice as > to how we can reactively figure out what already happened? And how to > proactively avoid this from happening again? Because of the automatic > AWR snapshots only happening each hour, it is hard for us to get more > granular when looking at a historical view. Perhaps we should > temporarily increase that? > > Any and all advice is gladly welcome. I am hoping to learn some of the > tricks and glean from your experience. > > Thanks in advance, > > PS - after reading about Kyle Hailey and Roger Sanders work, anyone > have access to a DMA for dummies? =) > > charles schultz > oracle dba > aits - adsd > university of illinois > >