RE: Oracle & pagefaults (solaris)

  • From: "Patterson, Joel" <Joel.Patterson@xxxxxxxxxxx>
  • To: "przemolicc@xxxxxxxxx" <przemolicc@xxxxxxxxx>, "oracle-l@xxxxxxxxxxxxx" <oracle-l@xxxxxxxxxxxxx>
  • Date: Mon, 13 Aug 2012 10:44:58 -0400

To throw some suggestions at the issue while you are waiting for some responses:
Are you using ZFS?   If so, have you looked at any of the following?   
ZFS for Databases:
http://www.solarisinternals.com/wiki/index.php/ZFS_for_Databases 
Limitations  -- Full table scans, etc.
Variable performance of sequential reads - Since ZFS does block-level 
copy-on-write for all writes, the physical layout of a file will tend to become 
random over time for files that are updated randomly. Normal RDBMS operations 
such as the updating of tables and indexes characteristically result in such 
random writes. As a consequence, the performance of database queries that do 
sustained large sequential reads (including "full table scans", "full index 
scans", or CREATE INDEX operations) can deteriorate over time as the database 
is updated. Simple ad-hoc or utility queries such as SELECT COUNT (*) from a 
non-indexed table commonly cause such sustained sequential I/O demand. The 
worst case impact varies inversely with the ZFS record size, thus posing a 
tradeoff between the performance of random reads and the performance of 
sequential reads in relation to the degree of potential fragmentation.

http://developers.sun.com/solaris/docs/wp-oraclezfsconfig-0510_ds_ac2.pdf 
Backups:
Copy operations, like backups, are similar to a full scan, long logical 
sequential reads and
are also subject to IOPS inflation as compared to traditional file system 
storage. The
elapsed time for a database backup can be due to the copy-on-write evolution
of the on-disk format for data files. Copying the data files, although 
cumbersome, might
help to relocate the blocks into a more continuous physical layout, 
particularly if the
pool is left with a minimum of free disk blocks.

This Guy, Constantin Gonzalez like ZFS and supports the defaults for most 
situations:
http://constantin.glez.de/blog/2010/04/ten-ways-easily-improve-oracle-solaris-zfs-filesystem-performance
 

Evil Tuning Guide:
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide 


Have you changed any of your SAN hardware or software; any other changes?

Joel Patterson
Database Administrator
904 727-2546

-----Original Message-----
From: oracle-l-bounce@xxxxxxxxxxxxx [mailto:oracle-l-bounce@xxxxxxxxxxxxx] On 
Behalf Of przemolicc@xxxxxxxxx
Sent: Monday, August 13, 2012 10:27 AM
To: oracle-l@xxxxxxxxxxxxx
Subject: Oracle & pagefaults (solaris)


Hi all,

we have been facing performance problems on our Solaris 10 Sparc based server 
with several Oracle databases running on it.
It started 2 months ago and gradually is increasing. There is always high level 
of syscalls when we face these problems:

09:20:03      25       8       0      68
09:30:02      22       7       0      70
09:40:03      23       8       0      70
09:50:03      23       7       0      69
10:00:03      22       9       0      69
10:10:03      25      34       0      41
10:20:03      28      28       0      44
10:30:03      27      24       0      49
10:40:02      24       8       0      68
10:50:03      27      19       0      54
11:00:02      31      27       0      41
11:10:21      29      48       0      23
11:20:03      36      56       0       9
11:30:05      32      51       0      18
11:40:03      37      43       0      20
11:50:06      28      59       0      14
12:00:06      32      61       0       7
12:10:03      38      26       0      36
12:20:02      34       9       0      57
12:30:02      35       8       0      58
12:40:03      35       9       0      56

Normal level of syscalls is about 7-10%. But when something is happening it 
increases to over 50%.
Using DTrace I can find that kernel is working in 'pagefault' function:

......
              unix`page_freelist_coalesce+0x894 [***]
              0x1fe0000
              unix`page_get_mnode_freelist+0x398
              unix`page_get_freelist+0x428
              unix`page_alloc_pages+0x110
              genunix`anon_map_getpages+0x348
              genunix`segvn_fault_anonpages+0x32c
              genunix`segvn_fault+0x530
              genunix`as_fault+0x4c8
              unix`pagefault+0x68
              unix`trap+0xd50
              unix`utl0+0x4c
            29062

              unix`cpu_halt+0x10c
              unix`cpu_halt+0x104
              unix`idle+0x128
              unix`thread_start+0x4
           719133

(this is just part of the image - there are much more pagefaults then listed 
above).
So natural is to check on behalf of which application kernel is working in 
pagefaults:

[ High syscalls ]

CPU     ID                    FUNCTION:NAME
 96  85656                       :tick-5sec
  nscd                                                              1
  top                                                               1
  dtrace                                                            5
  perl                                                            122
  emagent                                                         124
  tnslsnr                                                        1569
  oracle                                                         5538

[ Low (normal) syscalls ]

CPU     ID                    FUNCTION:NAME
 20  85656                       :tick-5sec
  nscd                                                              1
  dtrace                                                            5
  emagent                                                          41
  sh                                                               91
  perl                                                             93
  emdctl                                                          243
  tnslsnr                                                         320
  oracle                                                         2466


When I watch which Oracle's PID are generating these high syscalls using 'ps'
most of them don't exists (!). Are they short lived processes ?
Our DBA claims that this is not a problem of Oracle but DTrace shows that 
system is working on behalf of Oracle. Can you shed some light where our DBAs 
should look for any solution ? When I analyzed number of connections in 
listener log they don't differ much between low and high system calls.

Can you help me please ? I don't feel comfortable because my feeling is that I 
am doing DBAs homework but they claim that AWR report is the same as before so 
there is nothing they could do.

Best regards
Przemek
--
//www.freelists.org/webpage/oracle-l


--
//www.freelists.org/webpage/oracle-l


Other related posts: