Re: Dell-Oracle-Linux: Anyone else run this...because its not working for us!

Thomas,

I don't "buy" your explanation, at least not as far as SCNs and checkpoints are concerned. Yeah, they generate I/O, but not usually a lot of it. Of course, that doesn't make the problems you have experienced *any* less real. (Nor any less disconcerting...)

It strikes me as much more likely that the controller (or even more likely the device driver) is having trouble with *very* high volume I/O requests -- maybe with multiple overlapping I/Os. (This is all guesswork, of course.) A subtle error in the kernel locking / mutual exclusion operations in the device driver could easily result in errors that manifest only under very high I/O conditions -- and maybe even only on multi-processor systems.

I think I recall others contributing to this thread saying that these errors are unique to Linux -- that is consistent with a device driver problem rather than a hardware problem. Has *anybody* out there experienced problems with Dell PERC controllers under Windows?

As for people suffering problems under Linux, are there any other commonalities? For example, async I/O, or direct I/O? Does the problem go away (or reduce in frequency) if you mount filesystems with the 'noatime' option? (It is my theory that maintaining "access" times on datafiles will generate *far* more I/Os than database checkpoints will; in fact if I'm not mistaken, this could nearly double effective I/O levels. I have done *nothing* whatsoever to *verify* that theory, though, so I'll apologise in advance if it proves to be less than entirely correct.)

Oh well, this is all wild speculation -- and I hope I'm not just wasting everybody's time -- but maybe there's something here that somebody might find helpful. (But sadly, there's probably not...) In the meantime, I'm going to stat talking to my SA's (and "architects") about PERC controllers...

Just out of curiosity, what kind of symptoms are folks seeing? I presume you're seeing SCSI errors. "Hard" (permanent) or "soft" (temporary)? System hangs? Kernel panics?

Anyway, thanks for the "heads-up" to all who have contributed to this thread. This has been an eye-opener...


Thomas Day wrote:

[...]  Under certain circumstances, this seems to be
beyond the ability of the controller (to get all these counters
written in a timely manner).

[...]



-- http://www.freelists.org/webpage/oracle-l


Other related posts: