Re: really slow RMAN backups

  • From: "Mark Brinsmead" <pythianbrinsmead@xxxxxxxxx>
  • To: sperry@xxxxxxxxxxx
  • Date: Mon, 21 Aug 2006 20:45:44 -0600

Steve,

  Here are a few thoughts -- for what they're worth.  I'm sure others on
this list can offer much better feedback.

1.  You did not say how your NetApp storage is connected.  I presume NFS,
but there are other options...

2.  Aside from mentioning the MTU and suggesting that there are multiple
networks in play, you haven't said much about  your network config.
Bandwidth?  NFS mount options? so on...  Is your NFS storage distributed
across multiple networks (NICs)?  (Probably not...)

3.  You mentioned "tape channels", but you haven't told us anything about
you backup hardware, media manager, etc.  I would imagine that the backups
are perfomed across a network (sadly, the norm these days); is it the same
network used for NFS?  Are you sure?

4.  You're using 10gR2.  And a flashback recovery area.  Where is it
stored?  On the NetApp, perhaps?

5.  Are you backing up "directly" to tape, or moving data through the FBRA
and then to tape?  What about your media manager?  Does it stage data to
disk first, then "destage" to tape?  If so, where is the staging area?
Maybe in the NetApp?

6.  What is the actual bandwidth of your TCP network(s)?  I don't just mean
"are you using 100Mbit or 1000Mbit?", but rather, what kind of actual
throughput are you able to achieve, for example using 'dd' to read or write
a file on the NetApp filer?  (You could find that something like a duplex
mismatch or traffic congestion are cutting your actual bandwidth to
muchless than you would think it should be.

Okay, so, silly questions out of the way, here are the observations I
promised earlier...

You said:
I don't have any experience with netapp and want to see if there are
some known issues with it.

One comes to mind. With (redhat) Linux, is not possible to do asynchronous I/O against NetApp storage. Not if you're using NFS, anyway. This can have huge implications to I/O performance, especially if you happen to be assuming that you are (capable of) doing Async I/O...

You said:
I don't know why they chose directio (1 dbwr) instead of async. they
may not have anything to do with it, but it's the first time I saw
them set on a RAC database.

Lack of async I/O could be a major factor here. Here's the bad news: "they" chose not to use Async I/O because it is not available (i.e. not possible) with NFS-on-redhat-linux. Not much of a choice, really...

All of your I/O is being done synchronously.  And this can lead to serious
bottlenecks.  (Mostly on writes, though.)

You said:
I ran an awr report and "RMAN backup & recovery I/O" was the top
waiter with an avg wait of 134 ms.

Average wait of 134 ms? That's about 7 (synchronous) I/Os per second. At 8KB per I/O (you didn't tell us DB_BLOCK_SIZE) that's about 56KB/s, or around 200MB/hr. Obviously, you're not bottlenecked (completely) on this all of the time -- your backups would take 2,000+ hours, not 20+ hours.

I don't know much about this particular wait (obviously).  I would want to
understand what it means a lot better before really running with this, but
that 134ms average wait does not sound (at all) promising.

So, you're backing up a 500GB database.  To do it in 10 hours (that's a lot)
you need to sustain 50GB/hr -- end to end -- just for the backups.  That's
around 15MB per second.  That could mean (something vaguely like) reading
from the NetApp at 15MB/s, writing to the flashback recovery area (also on
the NetApp?) at 15MB/s, reading again from the flashback recovery area at
15MB/s, transmitting backup over the nework to the media manager at 15MB/s,
staging the backup data to disk at 15MB/s, destaging the backup data from
disk at 15MB/s, and (finally!) writing to tape at 15MB/s.  All concurrently!

So, depending on the answers to the "silly" questions above, I count
somwhere up to 6 or 7 traversals of your IP network, for a total of 100MB/s
(1000Mbits/s), total NetApp throughput (just for backups) of maybe 90MB/s.
How much (sustained) I/O can it do?

You may want to consider DBWR_IO_SLAVES for your database.  This is probably
not (directly) related to backups, but you didn't tell us what else your
database has been waiting on.  In any event, environments where ASYNC I/O is
unavailable (yours is one) are the rare cases where DBWR_IO_SLAVES can be
warranted.

And if you haven't already, you may want to look into TAPE_IO_SLAVES, too...


On 8/21/06, Steve Perry <sperry@xxxxxxxxxxx> wrote:

This was just passed to me, but I thought I'd check with the group to see if anyone else has experienced this slowness.



RMAN backups (2 tape channels)  take forever on this system. forever
means 20+ hours.

the view v$backup_sync_io  shows the effective bytes  per second at 2
or 3 MB per second. nothing above 5MB per second.
v$backup_async_io doesnt' show anything.

Setup.
500GB database on a netapp filer (40+ disks, don't know the model)
with ASM
32-bit 10.2.0.1
2 - node RAC EE cluster
rhel3
2 cpu
1 GB swap
4GB ram
600 MB SGA (small and uses the automatic memory management)
flash recovery area is on
DG is setup for 2 different databases
mtu sizes of all NICs are set to 1500 (since it's netapp, they might
prefer something else)
legato is the media manager

I looked at the init.ora settings and besides the small sga,
disk_asynch_io = false
filesystemio_option = directIO
large_pool_size = 52M

I don't know why they chose directio (1 dbwr) instead of async. they
may not have anything to do with it, but it's the first time I saw
them set on a RAC database.

I ran an awr report and "RMAN backup & recovery I/O" was the top
waiter with an avg wait of 134 ms. the class is "system io".
other things are  an index with 19 million get buffs during 2 hour
snap shot.
I see a few slow access times 300ms avg. read time, but there are
only 200 or so reads against it. Most of the access times are less
than 20ms.
I don't know if the problem is contention with other jobs, config
parameter or hardware.

I checked a similar system (db ver, 2 node rac, asm) that gets
80-90MB per second for it's backup.
it's on the SAN and uses async.
I haven't looked at the awr report from it.

any suggestions?
--
//www.freelists.org/webpage/oracle-l





--
Cheers,
-- Mark Brinsmead
  Staff DBA,
  The Pythian Group
  http://www.pythian.com/blogs

Other related posts: