>Kevin, >I don't know for the others... but I'd like to keep reading >this thread and how it is evolving. >The discussion is interesting. > >Fabrizio This thread gets more traction in this forum than suse-oracle as Fabrizio and I can attest. It seems over there that any platform software, regardless of quality, is best as long as it is free and open source...which I find particularly odd when choosing a platform to host the most expensive (and most feature rich) closed source software out there (Oracle). hmmm... So the thread is a technical comparison of cluster filesystem architectures. Or at least a tip-toe through the tulips on horseback. One camp being the central locking and metadata approach of the IBM GPFS, Sistina GFS, Veritas CFS camp versus the fully symmatric, distributed approach implemented by PolyServe on Linux and Windows. The central approach is the easiest approach. Period. That does not make them useless. On the contrary, they are extremely good (better than PolyServe) at HPC workloads. When you compare more commerical-style workloads, like email, the distributed, symmetric approach bears fruit. Workloads like email are great for making the point of whether a CFS is general purpose and what isn't. See the following URL of an independent test of an email system for hundreds of thousands of users comparing the various CFS technology out there (for Linux): http://www.polyserve.com/pdf/Caspur_CS.pdf http://www.linuxelectrons.com/article.php/20050126205113614 Mladen asked about such intricasies as versioning and such. There is no such concept on the table. A CFS is responsible for keeping filesystem metadata coherent, applications are responsible for keeping file content coherent. Now, having said that, PolyServe supports positional locking and we do also maintain page cache coherency on a per-file granularity. So, if two processes in the cluster use a non-cluster-aware program, like vi, and set out to edit the same file in the CFS, the result will be that the last process to write the file will be the winner. This is how vi works on a non-CFS, so this should be expected. Oracle file access characteristics are an entirely different story. Here, the application is cluster-aware so we've implemented a mount option for direct IO (akin to forcedirectio mount option in Solaris). Here, the IO requests are DMAed directly from the address space of the process to disk - without serialization or inode updates like [ma]time. The value add that we implented, however, is what sets this approach apart. In the same filesystem, that is mounted with the direct IO option, you can have one process performing properly aligned Ios going through the direct IO path (e.g., lgwr) while another process is doing unaligned buffered IO. This comes in handy, for instance, when you have a process like ARCH spooling the archived redo logs (direct IO) followed by compress/gzip compressing down the file. Tools like compress nearly always produce an output file that is not a multiple of 512 bytes, so for that reason alone it cannot use direct IO on any SCSI based system. Lot's of stuff to consider in making a comprehensive cluster platform for databases... The concerns of a good CFS being able to handle text-mapping is not an issue. The following example is a small 10 node PolyServe Matrix (cluster). The test consists of first comparing 1000 executions of the Pro*C executable comparing to a non-CFS (reiserfs in this case). First, prove that the test binary (proc in this case) is the same inode in the CFS on all 10 nodes: $ for i in 1 2 3 4 5 6 7 8 9 10; do rsh mxserv$i "ls -i $ORACLE_HOME/bin/proc"; done 2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc 2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc 2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc 2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc 2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc 2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc 2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc 2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc 2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc 2241437 /u01/app/oracle10/product/10.1.0/db_1/bin/proc Next, copy the proc executable to /tmp to get baseline non-CFS (reiserfs) to PolyServe CFS comparison: $ cp $ORACLE_HOME/bin/proc /tmp $ md5sum $ORACLE_HOME/bin/proc /tmp/proc af42f080f2ddba7fe90530d15ac1880a /u01/app/oracle10/product/10.1.0/db_1/bin/proc af42f080f2ddba7fe90530d15ac1880a /tmp/proc $ Next, a quick script to fire off 1000 concurrent invocations of the binary pointed to by arg1 $ cat t_proc #!/bin/bash binary=$1 getenv=$2 [[ ! -z "$2" ]] && cd ~oracle && . ./.bash_profile cnt=0 until [ $cnt -eq 1000 ] do (( cnt = $cnt + 1 )) ( $binary sqlcheck=FULL foo.pc > /dev/null 2>&1) & done ###End script Next, execute the script under time(1) to get count of minor faults and execution time. When executed as /tmp/proc, the cost is 1020884 minor faults and 11.6 total complete time. $ /usr/bin/time ./t_proc /tmp/proc 11.60user 10.42system 0:11.72elapsed 187%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1020884minor)pagefaults 0swaps Next, execute the script pointing to the Shared Oracle Home copy of the proc executable: $ echo $ORACLE_HOME /u01/app/oracle10/product/10.1.0/db_1 $ /usr/bin/time ./t_proc $ORACLE_HOME/bin/proc 11.43user 10.52system 0:11.08elapsed 198%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1016753minor)pagefaults 0swaps So, 1000 invocations parallelized as much as a dual proc system can muster yields the same execution performance on non-CFS as CFS. Next, execute the script in parallel on 2,4 and then 10 nodes in parallel. Note, the timing granularity is seconds using the $SECONDS builtin variable. $ cat para_t_proc for i in 1 2 do rsh mxserv$i "/u01/t_proc $ORACLE_HOME/bin/proc GETENV" & done wait echo $SECONDS for i in 1 2 3 4 do rsh mxserv$i "/u01/t_proc $ORACLE_HOME/bin/proc GETENV" & wait echo $SECONDS for i in 1 2 3 4 5 6 7 8 9 10 do rsh mxserv$i "/u01/t_proc $ORACLE_HOME/bin/proc GETENV" & done wait echo $SECONDS $ sh ./para_t_proc 11 22 34 So, parallel and cluster-concurrent execution of bits is 100% linear scalable...as it should be. Otherwise, as I've ranted before, you would not be able to call it a CFS, or an FS at all for that matter :-) -- http://www.freelists.org/webpage/oracle-l