RE: Difference in writer response time

From: "Mark W. Farnham" <mwf@xxxxxxxx>
To: <oracle.developer35@xxxxxxxxx>, "'Tanel Poder'" <tanel@xxxxxxxxxxxxxx>
Date: Fri, 9 Apr 2021 09:18:39 -0400

In a network sense, where is the client application with respect to the two
databases?

When you are using the “primary” database is the client at the “primary”
database location?

When you are using the “DR” database is the client at the “DR” database
location?

Depending on how the application handshakes information to be written and how
waits are rolled up, that can make all the difference.

A very quick test is to actually place a client at the location of the “DR”
database location on the LAN with the “DR” database if in fact currently you
are considering performance of an application on the “primary” database
location in both cases currently.

Explaining how and why the performance difference can actually be network
latency and speed yet appear to be differences in DBWR service rate is beyond
the scope of this note.

mwf

From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On ;
Behalf Of Pap
Sent: Friday, April 09, 2021 8:29 AM
To: Tanel Poder
Cc: Oracle L
Subject: Re: Difference in writer response time

Thank You Tanel.

Not having much understanding with respect to the infra side. But , actually
the first thing we want to understand is why there is a difference in
performance during that peak period between the two sites. As those are primary
and DR so ideally everything should be the same including SAN configurations.
We have verified all the DB parameters are mostly the same and are not
suspected such as to impact DBWR throughput one vs other.

But as you said there may be some issue wrt kernel or IO stack, so wanted to
understand what exact information we can ask from our storage/SAN team by which
we can conclude is there is really a difference in performance of storage IO
between both sides?

And should we ask the IO information only during the issue period to understand
the difference as because in normal time both the databases perform similar
just that during those peak activity the DR is behaving badly as compared to
primary?

Regards

Pap

On Fri, Apr 9, 2021 at 5:32 AM Tanel Poder <tanel@xxxxxxxxxxxxxx> wrote:

Ok, sorry I didn't realize this was on HP-UX (I guess I'm too spoiled nowadays
with "servers" like this one
<https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/>
  under my desk ;-)

Some comments:

1.      The free buffer waits show that DBWR falls behind with syncing dirty
buffers to disk (~4000 blocks written per second). So it looks like a DBWR
throughput problem (caused by some issue in the kernel or I/O stack)
2.      Other than storage hardware latency itself and things like block device
queue depths, the filesystem itself may be a bottleneck - are you using VxFS -
and if yes, with or without ODM? VxFS without ODM doesn't use concurrent I/O by
default, potentially causing write-write contention at file level (POSIX single
writer locking)
3.      Also, the normal recommendation for production systems is to use
filesystemio_options=setall, but I don't remember if HP-UX was special in that
sense (or you just forced direct I/O with a VxFS mount option)
4.      Run Snapper on DBWRs and see how many of them you have in each instance
and are they 100% busy waiting for one of the db file parallel write, db file
asynch I/O submit waits (or show all CPU usage - no wait events whatsoever, can
be due to an instrumentation bug)

The various "higher level contention" wait events, buffer busy waits, enq: TX
contention and in rare cases even some latch contention may be just a symptom
of the underlying I/O problems. You need to do wait chain analysis to see
beyond the "top level symptoms".

The ash_wait_chains.sql output is pretty wide, Oracle-L doesn't allow me to
send screenshots as images, so here are a couple of tweets with images of such
problems (where the high level symptoms show different waits than the root
cause):

*       https://twitter.com/TanelPoder/status/1318023393304825856
*       https://twitter.com/TanelPoder/status/1380304923678498820

Back when I used to deal with HP-UX/Solaris machines with VxFS more, a really
common bottleneck in high-performance databases (of that era) was people not
using concurrent I/O on that filesystem.

--

Tanel Poder

#vConf2021: Troubleshooting Very Complex Oracle Performance Problems

https://tanelpoder.com/conference/

On Thu, Apr 8, 2021 at 2:05 PM Pap <oracle.developer35@xxxxxxxxx> wrote:

Thank you Tanel.

Its HP machine and its on file system. No ASM. The filesystemsio_options in
gv$paramer showing as "asynch" in both databases. and also disk_asynch_io is
also set as TRUE. As I see from AWR, Its HP-UX IA(64 bit), 64 core, 8- socket
machine in both side.

The SAN team confirms not much visible difference between the underlying
storage performance between those two sites. So wondering as we don't use data
guards could this difference be just because of just the structural difference
in table/index structure? But we have all the same volume of data on both sides
so not sure how that can play such a big difference. And contrary to that, I
see we are doing more work on the Primary(faster) side during that time period
as compared to DR(slow side), so how is that possible? I am seeing the direct
write is higher on DR i.e. slow side as compared to primary, not sure if that
is pointing to something suspicious.

I have fetched the wait event section of the AWR from both databases and
attaching here(in two tabs as primary and DR wait profile). Although I do see
"free buffer waits" in one of the top lists but the major one is 'index
contention'(top one is a composite PK index but local index, so less chances of
structural difference causing such issues). But again it might be that the DBWR
response trend is just a symptom of something else but not the cause as the SAN
team is saying no difference there.

Regards

Pap

On Thu, Apr 8, 2021 at 10:34 PM Tanel Poder <tanel@xxxxxxxxxxxxxx> wrote:

There's also one usual suspect, I don't see filesystemio_options parameter set,
are you running on ASM, not a filesystem, right?

Thanks,

Tanel.

On Thu, Apr 8, 2021 at 4:06 AM Pap <oracle.developer35@xxxxxxxxx> wrote:

Hello Lister, We have two databases which are the same in all the hardware
configurations along with DB parameters and are in sync/replicated using golden
gate. And those are treated as primary and DR for us and both are active. But
we are seeing differences in behaviour when our application points to primary
VS when it points to DR. We are seeing during specific times of the day when
our system is at its peak, in one of the database i.e DR the DBWR response
times spikes till ~200+ms slowing down the write queries while in another
database the dbwr response time stays ~65ms for a similar amount of data and
transaction volume. So wanted to understand what can cause such drastic
differences? Note- Its version 11.2.0.4 of Oracle.

Follow-Ups:
- Re: Difference in writer response time
  - From: Pap

References:
- Difference in writer response time
  - From: Pap
- Re: Difference in writer response time
  - From: Tanel Poder
- Re: Difference in writer response time
  - From: Pap
- Re: Difference in writer response time
  - From: Tanel Poder
- Re: Difference in writer response time
  - From: Pap

RE: Difference in writer response time

Other related posts: