The ill effects of NOT using direct I/O

  • From: Gaja Krishna Vaidyanatha <gajav@xxxxxxxxx>
  • To: yong321@xxxxxxxxx, Freek.DHooge@xxxxxxxxx
  • Date: Sun, 13 Feb 2011 21:56:37 -0800 (PST)

Hi Yong,

Nice writeup. I have started a new thread in an effort to focus on the topic of 
discussion. Yes, indeed on Linux with ext3 filesystems, setting 
FILESYSTEMIO_OPTIONS to SETALL and not doing anything special at the filesystem 
mount is enough to get direct I/O working. But as mentioned by me in a prior 
note (in the previous thread), the same is not true for jfs or vxfs, the last 
time I checked (a few months ago). You also bring a very relevant point to the 
discussion, the distinction between page cache and OS buffer cache. Let us for 
the purposes of our discussion, just say that the filesystem buffer cache 
includes both the page cache and the OS buffer cache.

Also, the scenario you have depicted in your last paragraph is interesting. My 
question to you - Have you observed what you are suggesting in your write up on 
a production system? The reason why I bring this up, is that in my experience 
the effects of NOT having direct I/O (with or without HugePages) has had 
crippling effects on the entire system, in this case in RH Linux. Similar 
behavior have also been observed and documented on other operating systems. 

The following is pretty much a standard sequence of what has been observed on 
Linux without direct I/O on very busy Oracle systems (100+ active sessions at 
any given time):

1) As the load on the system increases, more sessions run more SQL and 
eventually perform more I/O, thus increasing total I/O consumption on the 

2) This causes the OS to allocate more memory to the filesystem buffer cache, 
an effort to buffer future I/O requests (which may or may not occur). The 
classic "double buffering" problem of data in the Oracle database buffer cache 
and the filesystem buffer cache has now begun.

3) An increase in the memory footprint of the filesystem buffer cache causes an 
increase in the activity of the paging daemon - the OS tries to take preemptive 
action against a potential memory bottleneck.

4) As the I/O pressure increases, the paging daemon starts to work even more 
aggressively (significantly increasing the number of pageins and pageouts). 
Eventually, the paging daemon starts to get even more aggressive, not only 
increasing its paging rates, but also swapping portions of the SGA (even if it 
were locked) when the demand for memory increases. A system-wide Memory 
Bottleneck has now become a reality.

5) This further increases the CPU consumption on the system (listed below in 
and also starts to create an artificial I/O Bottleneck (db file sequential read 
and db file scattered read times increase exponentially, as Oracle database 
blocks need to brought in from the OS's swap partitions). OUCH!!!

6) This increase in paging daemon activity combined with the increase in the 
filesystem buffer cache footprint, increases the overall CPU utilization of the 
OS (verified with a continuous sar - u or vmstat). The "sar -u" breaks down CPU 
consumption is in 4 buckets - %usr, %sys, %wio, %idle. %sys will consistently 
and continuously increase.

7) This starts to create an artificial CPU Bottleneck as %sys in many cases 
consumes a very significant percentage of total CPU on the system (have 
%sys consume 40% - 50% of total CPU). 

8) Imagine what havoc starts to unfold when the OS consumes 50% of the system's 
CPU resources - CPU Bottleneck becomes a reality. Nobody plans for the OS to 
consume 50% of configured CPU resources while engaging in system sizing and 
capacity planning efforts.

9) Pretty soon, the CPU Bottleneck causes Oracle sessions start to timeout. 

10) Depending on the application vendor and their connection pool mechanics, 
application then starts to re-try connections to the database, which starts a 
connection storm to the database. In some cases, the database is completely 
hung-up with various session/login-related latch issues.

11) Sooner than later, the database comes to a grinding halt (with or without a 
crash), and the OS eventually may come to a grinding halt by imploding with a 
kernel panic. Even this did not occur, you pretty much have an unusable Oracle 
database at this stage of the game.
The above is pretty much a "poster child storyline" of very busy Oracle systems 
with hundreds of concurrent sessions that do not use direct I/O. Again, this is 
my experience and I'm curious if you or others on this list have observed 
anything to the contrary. If so, please provide the relevant details, as I 
love to learn alternative system behavioral patterns without direct I/O. It is 
something that interests me quite a bit and I'm very curious. Thanks for taking 
the time to read this long note :)


Gaja Krishna Vaidyanatha,
Founder/Principal, DBPerfMan LLC
Phone - 001-(650)-743-6060
Co-author:Oracle Insights:Tales of the Oak Table -
Co-author:Oracle Performance Tuning 101 -

From: Yong Huang <yong321@xxxxxxxxx>
To: Freek.DHooge@xxxxxxxxx
Cc: oracle-l@xxxxxxxxxxxxx
Sent: Sun, February 13, 2011 10:41:30 AM
Subject: RE: How much RAM is to much

> But I always thought that setting the filesystemio_options parameter 
> to directIO or setall caused the processes to open the files with 
> the O_DIRECT flag.

I can confirm what Freek said, on my Oracle, Red Hat 5 kernel 
2.6.18-194.17.1.el5 architecture x86_64 if these details matter. The 
filesystem is ext3, mounted with no option (i.e. everything default). 
When the process opens a datafile for the first time, I can see the 
flags passed to open() call in strace are O_RDWR|O_SYNC|O_DIRECT.

By the way, ever since page cache is introduced to the file system, 
we should make a distinction between buffer cache and page cache. What's 
interesting here is page cache, because buffer cache caches file system 
metadata. Well, in Linux, I believe buffer cache is physically inside 
page cache. So you may call those buffers "buffer pages". Confusing name 
indeed. But to make it short, as soon as you have the word "buffer" in 
it, people knowing Linux kernel (not me!) will think you're referring to 
the cache for metadata, not for file content, which may be what you want 
to say.

Last point. While Oracle manages data in memory through a better 
mechanism because it knows the data better, leaving the decision of 
caching/not caching to the file system is not that dreadful. In some 
cases, it may even be a big help. So instead of assigning most memory to 
Oracle buffer cache and using direct I/O, someone can make a comprehensive 
test to see if locking a large portion of RAM to SGA (through HugePages 
etc.) and leaving file system to NOT direct I/O will be even better in 
some cases. If a table slightly bigger than 2% of the buffer cache is 
frequently full-scanned, and you didn't do anything special about it, I 
think it's better to let the file system cache its content.

Yong Huang


Other related posts: