Re: Moving db to linux

  • From: Mladen Gogala <mgogala@xxxxxxxxxxxx>
  • To: oracle-l@xxxxxxxxxxxxx
  • Date: Sat, 28 Feb 2004 13:38:22 -0500

Nuno, here's an excerpt from IBM JFS manual:
*********************************************
File System operations logged by
JFS
 The following list of file system operations
 changes meta-data of the file system so they
 must be logged.
·    File creation (create)
·    Linking (link)
·    Making directory (mkdir)
·    Making node (mknod)
·    Removing file (unlink)
·    Rename (rename)
·    Removing directory (rmdir)
·    Symbolic link (symlink)
·    Set ACL (setacl)
·    Writing File (write) (not on normal
     conditions)
·    Truncating regular file
******************************************* 

You are right with logging for metadata only, but not so right 
with direct I/O. Most file systems  simply ignore request for
open with O_DIRECT, XFS reports an error on Linux (at a subsequent 
read/write one gets EINVAL) , but works as advertized on Irix. 
Below is a little program that I used to test direct I/O:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <asm/fcntl.h>
#include <errno.h>
#include <string.h>
#define BUFFSIZE 65536
#define ALIGN    4096

main() {
char  *buff;
int stat1=0,stat2=0,stat3=0;
int fd1=0,fd2=0;
   if (stat3=posix_memalign(&buff,ALIGN,BUFFSIZE)) {
      fprintf(stderr,"ALIGN ERR:%s\n",strerror(stat3));
      exit(0);
   }

   fd1=open("xxx", O_RDONLY|O_DIRECT,S_IRWXU);

   fd2=open("yyy",O_CREAT|O_WRONLY|O_DIRECT,S_IRWXU);
   while(stat1=read(fd1,buff,BUFFSIZE)) {
         if (errno) {
            fprintf(stderr,"READ ERR:%s\n",strerror(errno));
            exit(0);
         }
         stat2=write(fd2,buff,(unsigned) stat1);
         if (errno) {
            fprintf(stderr,"WRITE ERR:%s\n",strerror(errno));
            exit(0);
         }
   }
   close(fd1);
   close(fd2);
}




On 02/28/2004 10:13:12 AM, Nuno Souto wrote:
> ----- Original Message ----- 
> From: "Mladen Gogala" <mgogala@xxxxxxxxxxxx>
> 
> > Journalling for files is a concept similar to redo in the world
> > of oracle.
> 
> No, it MOST DEFINITELY is not.  Journalled file systems are similar
> to redo ONLY for file system metadata.  NOT for the data itself!
> 
> > With JFS, you get the process called jfsCommit running,
> > which "commits" buffer operations. Each filehandle operation like
> > "flush" or "close" is a "commit".
> 
> So it is in a non-journalled file system.  "flush" has existed in
> normal file systems since the year dot and does exactly and precisely that.
> There is also a background process in non-JFS file systems that flushes
> every 30 seconds or so: it's called "sync".
> 
> > Basically, journalled FS guarantees
> > that the data written down synchronously will really written down
> > to the disk device(s).
> 
> ANY file system guarantees that data written synchronously
> is really written to the disk device.
> Synchronous access is NOT a synonym for journalling.
> 
> > If you can do DIO, your data is a little bit
> > safer.
> 
> Most file systems can do DIO.  It's got nothing to do with
> journalling itself.
> 
> >What a journalling FS protects you against is a huge data loss
> > of blocks that were in the buffer cache.
> 
> NO WAY! If you do NOT write synchronously in a JFS, you WILL
> lose ANY data blocks in the cache!
> 
> And to write synchronously you have to use synchronous I/O,
> DIO or frequent "flushes".  Which you can equally do in ANY file
> system, be it journalled or not.
> 
> I repeat: Synchronous writing has NOTHING to do with journalling.
> 
> 
> 
> What a JFS really does is to automatically (like it or not) write
> - synchronously - to a journal file, ANY changes to file system METADATA.
> IOW, any changes that involve creation/delete files, allocation of
> disk space or freeing of disk space.
> 
> Those and ONLY those are recovered after a system crash, by simply
> reading from the journal file. Instead of inspecting the ENTIRE file
> system looking for broken metadata.  Which is what fsck does in a
> non-journalled file system.
> 
> With the result (in a JFS) that you do not lose large chunks of a file.
> This is the problem that fsck has with non-journaled file systems:
> sometimes it cannot recover the metadata and it loses track of an entire
> space
> allocation for a file.  Which can be a substantial part of the file.  This
> happens mostly when files are very volatile or constantly changing in
> allocation.
> 
> Which is NOT the case for Oracle datafiles.  They are pre-allocated
> and do not often change in size.
> 
> 
> It's high time this myth of journalled file systems "protecting"
> data is exposed.  A run-of-the-mill JFS does NOT protect data blocks inside
> files, it protects ONLY the file system's own meta data!  That is certainly
> the case of ext3, JFS, NTFS and many other journalled f/s.  Veritas
> is the only JFS I know of that can ALSO protect the data but that is
> an add-on, not a characteristic of JFS.
> 
> 
> 
> Historical note:
> This f/s metadata thing is the major factor why I never lost a benchmark
> against
> Ingres: journalled file systems were unknown back then and Ingres did not
> use the concept of pre-allocated datafiles like Oracle.  Their tables were
> stored one table per file, with dynamic space management done by the file
> system itself.  With the result that if you specified a benchmark where
> tables
> were dropped/re-created and inserted/deleted from and you pulled the plug
> half
> way through, you'd have a very high probability fsck would NOT recover the
> file system where the Ingres database was.
> 
> While Oracle would quietly just rollback the last transaction and keep
> going.
> After the fsck was finished, of course.  Remember: no JFS back then!  Not
> once
> did I have to use the redo log.  Datafiles were pre-allocated and the f/s
> metadata
> never changed, no matter how busy the system was.
> 
> 
> As well, not ONCE did Ingres survive this little "technique"!
> Cheers
> Nuno Souto
> in sunny Sydney, Australia
> dbvision@xxxxxxxxxxxxxxx
> 
> ----------------------------------------------------------------
> Please see the official ORACLE-L FAQ: http://www.orafaq.com
> ----------------------------------------------------------------
> To unsubscribe send email to:  oracle-l-request@xxxxxxxxxxxxx
> put 'unsubscribe' in the subject line.
> --
> Archives are at //www.freelists.org/archives/oracle-l/
> FAQ is at //www.freelists.org/help/fom-serve/cache/1.html
> -----------------------------------------------------------------
> 

-- 
Mladen Gogala
Oracle DBA
----------------------------------------------------------------
Please see the official ORACLE-L FAQ: http://www.orafaq.com
----------------------------------------------------------------
To unsubscribe send email to:  oracle-l-request@xxxxxxxxxxxxx
put 'unsubscribe' in the subject line.
--
Archives are at //www.freelists.org/archives/oracle-l/
FAQ is at //www.freelists.org/help/fom-serve/cache/1.html
-----------------------------------------------------------------

Other related posts: