RE: 64 node Oracle RAC Cluster (The reality of...)

>Kevin,
>I don't know for the others... but I'd like to keep reading 
>this thread and how it is evolving.
>The discussion is interesting.
>
>Fabrizio

 This thread gets more traction in this forum than
suse-oracle as Fabrizio and I can attest. It seems
over there that any platform software, regardless
of quality, is best as long as it is free and 
open source...which I find particularly odd when
choosing a platform to host the most expensive
(and most feature rich) closed source software 
out there (Oracle). hmmm...

 So the thread is a technical comparison of 
cluster filesystem architectures. Or at least
a tip-toe through the tulips on horseback.

One camp being the central locking and metadata 
approach of the IBM GPFS, Sistina GFS, Veritas CFS 
camp versus the fully symmatric, distributed approach
implemented by PolyServe on Linux and Windows.

 The central approach is the easiest approach.
Period. That does not make them useless. On
the contrary, they are extremely good (better
than PolyServe) at HPC workloads. When you
compare more commerical-style workloads, like
email, the distributed, symmetric approach
bears fruit. Workloads like email are great for
making the point of  whether a CFS is
general purpose and what isn't. See the following 
URL of an independent test of an email system for 
hundreds of thousands of users comparing 
the various CFS technology out there (for Linux):

http://www.polyserve.com/pdf/Caspur_CS.pdf
http://www.linuxelectrons.com/article.php/20050126205113614

Mladen asked about such intricasies as versioning
and such. There is no such concept on the table.
A CFS is responsible for keeping filesystem
metadata coherent, applications are responsible
for keeping file content coherent. Now, having
said that, PolyServe supports positional locking
and we do also maintain page cache coherency 
on a per-file granularity. So, if two processes
in the cluster use a non-cluster-aware program,
like vi, and set out to edit the same file
in the CFS, the result will be that the last
process to write the file will be the winner.
This is how vi works on a non-CFS, so this
should be expected. 

Oracle file access characteristics are an entirely
different story. Here, the application is
cluster-aware so we've implemented a mount
option for direct IO (akin to forcedirectio
mount option in Solaris). Here, the IO requests
are DMAed directly from the address space 
of the process to disk - without serialization
or inode updates like [ma]time. The value add
that we implented, however, is what sets this
approach apart. 

In the same filesystem, that is mounted with 
the direct IO option, you can have one process 
performing properly aligned Ios going through 
the direct IO path (e.g., lgwr) while another 
process is doing unaligned buffered IO. This 
comes in handy, for instance, when you
have a process like ARCH spooling the archived
redo logs (direct IO) followed by compress/gzip 
compressing down the file. Tools like compress 
nearly always produce an output file that is not 
a multiple of 512 bytes, so for that reason alone 
it cannot use direct IO on any SCSI based system. 
Lot's of stuff to consider in making a comprehensive
cluster platform for databases...

The concerns of a good CFS being able to handle
text-mapping is not an issue. The following 
example is a small 10 node PolyServe Matrix (cluster).
The test consists of first comparing 1000 executions
of the Pro*C executable comparing to a non-CFS (reiserfs
in this case).

First, prove that the test binary (proc in this case)
is the same inode in the CFS on all 10 nodes:

$ for i in 1 2 3 4 5 6 7 8 9 10; do rsh mxserv$i "ls -i
$ORACLE_HOME/bin/proc"; done
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc
2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc

Next, copy the proc executable to /tmp to get baseline
non-CFS (reiserfs) to PolyServe CFS comparison:

$ cp $ORACLE_HOME/bin/proc /tmp
$ md5sum $ORACLE_HOME/bin/proc /tmp/proc
af42f080f2ddba7fe90530d15ac1880a
/u01/app/oracle10/product/10.1.0/db_1/bin/proc
af42f080f2ddba7fe90530d15ac1880a  /tmp/proc
$

Next, a quick script to fire off 1000 concurrent
invocations of the binary pointed to by arg1

$ cat t_proc
#!/bin/bash

binary=$1
getenv=$2

[[ ! -z "$2" ]] &&  cd ~oracle && . ./.bash_profile

cnt=0
until [ $cnt -eq 1000 ]
do
        (( cnt = $cnt + 1 ))
        ( $binary sqlcheck=FULL foo.pc > /dev/null 2>&1) &

done
###End script

Next, execute the script under time(1) to get count of
minor faults and execution time. When executed as /tmp/proc,
the cost is 1020884 minor faults and 11.6 total complete time.

$ /usr/bin/time ./t_proc /tmp/proc
11.60user 10.42system 0:11.72elapsed 187%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+1020884minor)pagefaults 0swaps


Next, execute the script pointing to the Shared Oracle
Home copy of the proc executable:

$ echo $ORACLE_HOME
/u01/app/oracle10/product/10.1.0/db_1
$ /usr/bin/time ./t_proc $ORACLE_HOME/bin/proc
11.43user 10.52system 0:11.08elapsed 198%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+1016753minor)pagefaults 0swaps

So, 1000 invocations parallelized as much as a dual proc
system can muster yields the same execution performance
on non-CFS as CFS.

Next, execute the script in parallel on 2,4 and then 
10 nodes in parallel. Note, the timing granularity
is seconds using the $SECONDS builtin variable.

$ cat para_t_proc
for i in 1 2
do
rsh mxserv$i "/u01/t_proc $ORACLE_HOME/bin/proc GETENV" &
done
wait

echo $SECONDS

for i in 1 2 3 4
do
rsh mxserv$i "/u01/t_proc $ORACLE_HOME/bin/proc GETENV" &
wait

echo $SECONDS

for i in 1 2 3 4 5 6 7 8 9 10
do
rsh mxserv$i "/u01/t_proc $ORACLE_HOME/bin/proc GETENV" &
done
wait

echo $SECONDS

$ sh ./para_t_proc
11
22
34


So, parallel and cluster-concurrent execution of
bits is 100% linear scalable...as it should be. Otherwise,
as I've ranted before, you would not be able to
call it a CFS, or an FS at all for that matter :-)



--
http://www.freelists.org/webpage/oracle-l

Other related posts: