Re: Dell-Oracle-Linux: Anyone else run this...because its not working for us!
From: Mark Brinsmead <mark.brinsmead@xxxxxxx>
To: tomday2@xxxxxxxxx
Date: Mon, 12 Dec 2005 00:48:00 -0700
Thomas,
I don't "buy" your explanation, at least not as far as SCNs and
checkpoints are concerned. Yeah, they generate I/O, but not usually a
lot of it. Of course, that doesn't make the problems you have
experienced *any* less real. (Nor any less disconcerting...)
It strikes me as much more likely that the controller (or even more
likely the device driver) is having trouble with *very* high volume I/O
requests -- maybe with multiple overlapping I/Os. (This is all
guesswork, of course.) A subtle error in the kernel locking / mutual
exclusion operations in the device driver could easily result in errors
that manifest only under very high I/O conditions -- and maybe even only
on multi-processor systems.
I think I recall others contributing to this thread saying that these
errors are unique to Linux -- that is consistent with a device driver
problem rather than a hardware problem. Has *anybody* out there
experienced problems with Dell PERC controllers under Windows?
As for people suffering problems under Linux, are there any other
commonalities? For example, async I/O, or direct I/O? Does the problem
go away (or reduce in frequency) if you mount filesystems with the
'noatime' option? (It is my theory that maintaining "access" times on
datafiles will generate *far* more I/Os than database checkpoints will;
in fact if I'm not mistaken, this could nearly double effective I/O
levels. I have done *nothing* whatsoever to *verify* that theory,
though, so I'll apologise in advance if it proves to be less than
entirely correct.)
Oh well, this is all wild speculation -- and I hope I'm not just
wasting everybody's time -- but maybe there's something here that
somebody might find helpful. (But sadly, there's probably not...) In
the meantime, I'm going to stat talking to my SA's (and "architects")
about PERC controllers...
Just out of curiosity, what kind of symptoms are folks seeing? I
presume you're seeing SCSI errors. "Hard" (permanent) or "soft"
(temporary)? System hangs? Kernel panics?
Anyway, thanks for the "heads-up" to all who have contributed to this
thread. This has been an eye-opener...
Thomas Day wrote:
[...] Under certain circumstances, this seems to be
beyond the ability of the controller (to get all these counters
written in a timely manner).