RE: Tim Gorman's "...Cost-Based Optimizer.doc"

  • From: "Marquez, Chris" <cmarquez@xxxxxxxxxxxxxxxx>
  • To: "Mladen Gogala" <mgogala@xxxxxxxxxxxxxxxxxxxx>
  • Date: Thu, 16 Jun 2005 14:17:43 -0400

Mladen,
Thanks for the very technical explanation although I honestl;y didn't follow it.

>>Your oracle waiting times will be 
>>reflecting this and blur any comparison.
>>In short, you cannot compare. 
>>In the old tuning books, when DBA used to
>>be much more conscientious of the underlying OS,


Can't compare;
"NON-cached" filesystems vs. "cached" filesystems
or can't compare;
"db file scattered reads" vs. "db file sequential reads"

Not sure what you mean?

Chris Marquez
Oracle DBA


-----Original Message-----
From: Mladen Gogala [mailto:mgogala@xxxxxxxxxxxxxxxxxxxx]
Sent: Thu 6/16/2005 11:56 AM
To: Marquez, Chris
Cc: oracle-l@xxxxxxxxxxxxx
Subject: Re: Tim Gorman's "...Cost-Based Optimizer.doc"
 
Marquez, Chris wrote:

>Tim Gorman's "The Search for Intelligent Life in the Cost-Based
>Optimizer.doc"
>
>
>One comment / question on OPTIMIZER_INDEX_COST_ADJ;
>It seems that on "NON-cached" filesystems (e.g. RAW, OCFS) that there
>would/do NOT be a great difference in "db file scattered reads" vs. "db
>file sequential reads" AVERAGE_WAITS...as every read from disk (on
>non-cached filesystem) is a *real* read from dusk...no OS buffer to
>help, no?
>
>Thanks,
>
>Chris Marquez
>Oracle DBA
>  
>

Chris, things aren't that simple. To explain why, I must show the following:


    NAME 

readv, writev - read or write a vector


    SYNOPSIS 

*#include <sys/uio.h <http://www.die.net/doc/linux/include/sys/uio.h>>*

*int readv(int* /fd/*, const struct iovec **/vector/*, int* /count/*);* 
*int writev(int* /fd/*, const struct iovec **/vector/*, int* /count/*);*

*struct iovec {*
    *__ptr_t* /iov_base/*;*    /* Starting address */
    *size_t* /iov_len/*;*      /* Length in bytes  */
*};*


    DESCRIPTION

*readv* reads data from file descriptor /fd/, and puts the result in the 
buffers described by /vector/. The number of buffers is specified by 
/count/. The buffers are filled in the order specified. Operates just 
like *read* except that data is put in /vector/ instead of a contiguous 
buffer.

*writev* writes data to file descriptor /fd/, and from the buffers 
described by /vector/. The number of buffers is specified by /count/. 
The buffers are used in the order specified. Operates just like *write* 
except that data is taken from /vector/ instead of a contiguous buffer.


    RETURN VALUE

On success *readv* returns the number of bytes read. On success *writev* 
returns the number of bytes written. On error, -1 is returned, and 
/errno/ is set appropriately.


    ERRORS

*EINVAL*
    An invalid argument was given. For instance /count/ might be greater
    than *MAX_IOVEC*, or zero. /fd/ could also be attached to an object
    which is unsuitable for reading (for *readv*) or writing (for
    *writev*). 
*EFAULT*
    "Segmentation fault." Most likely /vector/ or some of the /iov_base/
    pointers points to memory that is not properly allocated. 
*EBADF*
    The file descriptor /fd/ is not valid. 
*EINTR*
    The call was interrupted by a signal before any data was read/written. 
*EAGAIN*
    Non-blocking I/O has been selected using *O_NONBLOCK* and no data
    was immediately available for reading. (Or the file descriptor /fd/
    is for an object that is locked.) 
*EISDIR*
    /fd/ refers to a directory. 
*EOPNOTSUPP*
    /fd/ refers to a socket or device that does not support
    reading/writing. 
*ENOMEM*
    Insufficient kernel memory was available.

Other errors may occur, depending on the object connected to /fd/.


    CONFORMING TO

4.4BSD (the *readv* and *writev* functions first appeared in BSD 4.2), 
Unix98. Linux libc5 uses *size_t* as the type of the /count/ parameter, 
which is logical but non-standard.


    SEE ALSO

*read <http://www.die.net/doc/linux/man/man2/read.2.html>*(2), *write 
<http://www.die.net/doc/linux/man/man2/write.2.html>*(2), *fprintf 
<http://www.die.net/doc/linux/man/man3/fprintf.3.html>*(3), *fscanf 
<http://www.die.net/doc/linux/man/man3/fscanf.3.html>*(3)


    REFERENCED BY 

*recv <http://www.die.net/doc/linux/man/man2/recv.2.html>*(2), *recvfrom 
<http://www.die.net/doc/linux/man/man2/recvfrom.2.html>*(2), *recvmsg 
<http://www.die.net/doc/linux/man/man2/recvmsg.2.html>*(2), *socket 
<http://www.die.net/doc/linux/man/man7/socket.7.html>*(7), *syscalls 
<http://www.die.net/doc/linux/man/man2/syscalls.2.html>*(2)



As you can see, "readv" or "scattered read" is a single call to the OS 
and the wait is calculated as the time to completion.
Full table scan will attempt to read as many as 
DB_FILE_MULTIBLOCK_READ_COUNT blocks from the file in various
DB Block buffers in SGA (they are memory elements of the "vector"). 
Internally, OS might break this request into several I/O
requests and complete everything as the controller scheduler (yes, disk 
controllers have I/O schedulers, usually using SPF
type schedule ("Shortest Path First")).  Your wait time is OS 
overhead+time to completion of EACH I/O request scheduled
by OS to satisfy the call to readv. That may, on average, be 
significantly more then time necessary to satisfy a single block I/O,
which is done by calling "read" or "write" (no trailing "v"). Your 
oracle waiting times will be reflecting this and blur any comparison.
In short, you cannot compare. In the old tuning books, when DBA used to 
be much more conscientious of the underlying OS,
as the best DBA staff was evolving from the primordial mud of system 
administration,   there was an advice that size of DB_FILE_MULTIBLOCK 
READ_COUNT shouldn't ever exceed the maximum size of a single, atomic 
I/O that your OS
is capable of performing, but that  piece of advice is lost in the newer 
books. It was a moot point anyway, because I/O could be
broken because of the fragmentation as well.

-- 
Mladen Gogala
Oracle DBA
Ext. 121





--
//www.freelists.org/webpage/oracle-l

Other related posts: