Re: ZFS or UFS? Solaris 11 or better stay with Solaris 10?

From: kyle Hailey <kylelf@xxxxxxxxx>
To: grzegorzof@xxxxxxxxxx
Date: Fri, 30 Mar 2012 14:38:00 -0700
The Oracle paper is good.
I'm involved exclusively with Oracle databases on ZFS, so in general it
works well. I've listed below the issues I've run into and the solutions
for each of them.  There are many cool options that are only available on
ZFS. The ZFS community is quite active in improving, fixing and adding
functionality. There is quite a bit going on in the way it works. I'm still
quite new to all the possibilities in ZFS.

A few things to keep in mind:

For best write performance the pool should be less than 80% free.

ZFS by default double writes, similar to Oracle. Oracle writes to Redo and
the datafile. Similarly ZFS writes to the ZIL (redo) and the files
themselves. This can double the amount of writes which can be confusing
when benchmarking the I/O. This can be modified with logbias as shown in
the Oracle paper. Set the Oracle datafiles to throughput and the metadata
will get written to the ZIL (like nologging operations in Oracle) but the
data will be written to the data files. Put the redo in latency logbias
mode so it gets committed through the ZIL. This is faster but will cause
double writes.

My experience is that ZFS read ahead and caching works well and especially
for DSS queries. I can't say for sure but it seems ZFS is aggressive with
read ahead and caching. If by chance you want less read ahead, ie you are
only doing random 8k reads, you can turn read ahead off with
zfs_prefetch_disable
http://forums.freenas.org/archive/index.php/t-1076.html


Two rare but problematic issues come to mind

1. the ZFS ARC (like file system cache) stopped caching and spent all it's
time kicking pages out. There is a fix, but not sure if it's out the open
source community yet. To monitor, run
echo '::arc ! egrep -w "c_min|c_max|size|arc_no_grow"' | pfexec mdb -k
the problem manifests by a arc_no_grow set to 1 (ie don't grow it) and the
arc size been well under the available memory.
Seen this happen 4 times on maybe 100s of systems
If happens  then it might require a reboot.

2. Write throughput dropped drastically after a flurry of disk errors. Disk
errors were the core problem but it turns out that ZFS in became a bit
overly protective, throttled writes down too far. Write throttling can be
turned off with zfs_no_write_throttle
Seen this happen once, but was quite confusing at the time.
you should be able to monitor what ZFS thinks the write speed is with
dtrace:
dsl_pool_sync:entry
/stringof(args[0]->dp_spa->spa_name) == "domain0"/
{
        self->dp = args[0];
}

dsl_pool_sync:return
/self->dp/
{
        printf("write_limit %d, write_throughput %d\n",
            self->dp->dp_write_limit,
            self->dp->dp_throughput);
        self->dp = NULL;
}


Other things to be aware of, is that ZFS scrub can show up as a lot of
reads when the filesystem would otherwise be idle. The ZFS scrub should
back off strongly when user load comes on to give priority to other I/O.
The scrub can be turned off with
zpool scrub -s poolname
The should only be temporary as its crucial that scrub gets run regularly
(like weekly)

There is no direct I/O per say. Data will get cached in the ARC. If by
chance you want to turn off caching and simulate direct I/O (not suggesting
this but it's useful for testing of the actual back end disks) you set
caching off:
  zfs set primarycache=none  poolname
Note that this will still leave things cached that are already in the
cache. You'd have to export the pool to clear the cache of an existing pool.

ZFS will also call to the disks to flush them if the have cache like an
internal array usually has. This can cause problems if that cache is
battery backed and it interprets the flush as a force write to disk. The
call to flush can be turned off with
zfs_nocacheflush
I think the Oracle paper discusses this.

The parameters you want to set per pool are

   - compression - on/off, up to you
   - logbias - latency for redo, throughput datafiles
   - recordsize - blocksize for datafiles, 128K for others. Oracle paper
   gives recommendations
   - primarycache - all except archive (and UNDO) set to metadata
   - secondarycache - all for datafiles, none for others (probably)


ZFS also bases calculations on LUNs so if your give a few LUNs to ZFS that
represent many back end spindles, some of the I/O queue calculations can be
off. I believe the Oracle paper goes into this. Never been a problem that
I've directly seen though heard about it.

As in all I/O systems block alignment is important. Unfortunately that is
the point in this list I'm weakest on.

Comments, additions, corrections welcome as this is all new to me :)


- Kyle









On Thu, Mar 29, 2012 at 9:49 AM, GG <grzegorzof@xxxxxxxxxx> wrote:

> W dniu 2012-03-28 15:34, De DBA pisze:
> > G'day,
> ZFS is very mature and database friendly filesystem as long as You
> follow the rules :) mentioned here
> http://developers.sun.com/solaris/docs/wp-oraclezfsconfig-0510_ds_ac2.pdf
> :)
>
> Regards
> GregG
>
> --
> //www.freelists.org/webpage/oracle-l
>
>
>


--
//www.freelists.org/webpage/oracle-l
References:
- ZFS or UFS? Solaris 11 or better stay with Solaris 10?
  - From: De DBA
- Re: ZFS or UFS? Solaris 11 or better stay with Solaris 10?
  - From: GG
Re: ZFS or UFS? Solaris 11 or better stay with Solaris 10?

Other related posts: