linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* FYI: RAID5 unusably unstable through 2.6.14
@ 2006-01-17 19:35 Cynbe ru Taren
  2006-01-17 19:39 ` Benjamin LaHaise
                   ` (6 more replies)
  0 siblings, 7 replies; 44+ messages in thread
From: Cynbe ru Taren @ 2006-01-17 19:35 UTC (permalink / raw)
  To: linux-kernel


Just in case the RAID5 maintainers aren't aware of it:

The current Linux kernel RAID5 implementation is just
too fragile to be used for most of the applications
where it would be most useful.

In principle, RAID5 should allow construction of a
disk-based store which is considerably MORE reliable
than any individual drive.

In my experience, at least, using Linux RAID5 results
in a disk storage system which is considerably LESS
reliable than the underlying drives.

What happens repeatedly, at least in my experience over
a variety of boxes running a variety of 2.4 and 2.6
Linux kernel releases, is that any transient I/O problem
results in a critical mass of RAID5 drives being marked
'failed', at which point there is no longer any supported
way of retrieving the data on the RAID5 device, even
though the underlying drives are all fine, and the underlying
data on those drives almost certainly intact.

This has just happened to me for at least the sixth time,
this time in a brand new RAID5 consisting of 8 200G hotswap
SATA drives backing up the contents of about a dozen onsite
and offsite boxes via dirvish, which took me the better part
of December to get initialized and running, and now two weeks
later I'm back to square one.

I'm currently digging through the md kernel source code
trying to work out some ad-hoc recovery method, but this
level of flakiness just isn't acceptable on systems where
reliable mass storage is a must -- and when else would
one bother with RAID5?

I run a RAID1 mirrored boot and/or root partition on all
the boxes I run RAID5 on -- and lots more as well -- and
RAID1 -does- work as one would hope, providing a disk
store -more- reliable than the underlying drives.  A
Linux RAID1 system will ride out any sort of sequence
of hardware problems, and if the hardware is physically
capable of running at all, the RAID1 system will pop
right back like a cork coming out of white water.

I've NEVER had a RAID1 throw a temper trantrum and go
into apoptosis mode the way RAID5s do given the slightest
opportunity.

We need RAID5 to be equally resilient in the face of
real-world problems, people -- it isn't enough to
just be able to function under ideal lab conditions!

A design bug is -still- a bug, and -still- needs to
get fixed.

Something HAS to be done to make the RAID5 logic
MUCH more conservative about destroying RAID5
systems in response to a transient burst of I/O
errors, before it can in good conscience be declared
ready for production use -- or at MINIMUM to provide
a SUPPORTED way of restoring a butchered RAID5 to
last-known-good configuration or such once transient
hardware issues have been resolved.

There was a time when Unix filesystems disintegrated
on the slightest excuse, requiring guru-level inode
hand-editing to fix.  fsck basically ended that,
allowing any idiot to successfully maintain a unix
filesystem in the face of real-life problems like
power failures and kernel crashes.  Maybe we need
a mdfsck which can fix sick RAID5 subsystems?

In the meantime, IMHO Linux RAID5 should be prominently flagged
EXPERIMENTAL -- NONCRITICAL USE ONLY or some such, to avoid
building up ill-will and undeserved distrust of Linux
software quality generally.

Pending some quantum leap in Linux RAID5 resistance to
collapse, I'm switching to RAID1 everywhere:  Doubling
my diskspace hardware costs is a SMALL price to pay to
avoid weeks of system downtime and rebuild effort annually.
I like to spend my time writing open source, not
rebuilding servers. :)   (Yes, I could become an md
maintainer myself.  But only at the cost of defaulting
on pre-existing open source commitments.  We all have
full plates.)

Anyhow -- kudos to everyone involved:  I've been using
Unix since v7 on PDP-11, Irix since its 68020 days,
and Linux since booting off floppy was mandatory, and
in general I'm happy as a bug in a rug with the fleet
of Debian Linux boxes I manage, with uptimes often exceeding
a year, typically limited only by hardware or software
upgrades -- great work all around, everyone!

Life is Good!

 -- Cynbe




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-17 19:35 FYI: RAID5 unusably unstable through 2.6.14 Cynbe ru Taren
@ 2006-01-17 19:39 ` Benjamin LaHaise
  2006-01-17 20:13   ` Martin Drab
  2006-01-17 19:56 ` Kyle Moffett
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 44+ messages in thread
From: Benjamin LaHaise @ 2006-01-17 19:39 UTC (permalink / raw)
  To: Cynbe ru Taren; +Cc: linux-kernel

On Tue, Jan 17, 2006 at 01:35:46PM -0600, Cynbe ru Taren wrote:
> In principle, RAID5 should allow construction of a
> disk-based store which is considerably MORE reliable
> than any individual drive.
> 
> In my experience, at least, using Linux RAID5 results
> in a disk storage system which is considerably LESS
> reliable than the underlying drives.

That is a function of how RAID5 works.  A properly configured RAID5 array 
will have a spare disk to take over in case one of the members fails, as 
otherwise you run a serious risk of not being able to recover any data.

> What happens repeatedly, at least in my experience over
> a variety of boxes running a variety of 2.4 and 2.6
> Linux kernel releases, is that any transient I/O problem
> results in a critical mass of RAID5 drives being marked
> 'failed', at which point there is no longer any supported
> way of retrieving the data on the RAID5 device, even
> though the underlying drives are all fine, and the underlying
> data on those drives almost certainly intact.

Underlying disks should not be experiencing transient failures.  Are you 
sure the problem isn't with the disk controller you're building your array 
on top of?  At the very least any bug report requires that information to 
be able to provide even a basic analysis of what is going wrong.

Personally, I am of the opinion that RAID5 should not be used by the 
vast majority of people as the failure modes it entails are far too 
complex for most people to cope with.

		-ben
-- 
"You know, I've seen some crystals do some pretty trippy shit, man."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-17 19:35 FYI: RAID5 unusably unstable through 2.6.14 Cynbe ru Taren
  2006-01-17 19:39 ` Benjamin LaHaise
@ 2006-01-17 19:56 ` Kyle Moffett
  2006-01-17 19:58 ` David R
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: Kyle Moffett @ 2006-01-17 19:56 UTC (permalink / raw)
  To: Cynbe ru Taren; +Cc: linux-kernel

On Jan 17, 2006, at 14:35, Cynbe ru Taren wrote:
> What happens repeatedly, at least in my experience over a variety  
> of boxes running a variety of 2.4 and 2.6 Linux kernel releases, is  
> that any transient I/O problem results in a critical mass of RAID5  
> drives being marked 'failed', at which point there is no longer any  
> supported way of retrieving the data on the RAID5 device, even  
> though the underlying drives are all fine, and the underlying data  
> on those drives almost certainly intact.

Insufficient detail.  Please provide a full bug report detailing the  
problem, then we can help you.

> I've NEVER had a RAID1 throw a temper trantrum and go into  
> apoptosis mode the way RAID5s do given the slightest opportunity.

I've never had either RAID1 _or_ RAID5 throw temper tantrums on me,  
_including_ during drive failures.  In fact, I've dealt easily with  
Linux RAID multi-drive failures that threw all our shiny 3ware RAID  
hardware into fits it took me an hour to work out.

> Something HAS to be done to make the RAID5 logic MUCH more  
> conservative about destroying RAID5
> systems in response to a transient burst of I/O errors, before it  
> can in good conscience be declared ready for production use -- or  
> at MINIMUM to provide a SUPPORTED way of restoring a butchered  
> RAID5 to last-known-good configuration or such once transient  
> hardware issues have been resolved.

The problem is that such errors are _rarely_ transient, or indicate  
deeper media problems.  Have you then verified your disks using  
smartctl?  There already _is_ such a way to restore said "butchered"  
RAID5:  "mdadm --assemble --force"  In any case, I suspect your RAID- 
on-SATA problems are more due to the primitive nature of the SATA  
error handling; much of the code does not do more than a basic bus  
reset before failing the whole I/O.

> In the meantime, IMHO Linux RAID5 should be prominently flagged  
> EXPERIMENTAL -- NONCRITICAL USE ONLY or some such, to avoid  
> building up ill-will and undeserved distrust of Linux software  
> quality generally.

It works great for me, and for a lot of other people too, including  
production servers.  In fact, I've had fewer issues with Linux RAID5  
than with a lot of hardware RAIDs, especially when the HW raid  
controller died and the company was no longer in business :-\.  If  
you can provide actual bug reports, we'd be happy to take a look at  
your problems, but as it is, we can't help you.

Cheers,
Kyle Moffett

--
There is no way to make Linux robust with unreliable memory  
subsystems, sorry.  It would be like trying to make a human more  
robust with an unreliable O2 supply. Memory just has to work.
   -- Andi Kleen



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-17 19:35 FYI: RAID5 unusably unstable through 2.6.14 Cynbe ru Taren
  2006-01-17 19:39 ` Benjamin LaHaise
  2006-01-17 19:56 ` Kyle Moffett
@ 2006-01-17 19:58 ` David R
  2006-01-17 20:00 ` Kyle Moffett
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: David R @ 2006-01-17 19:58 UTC (permalink / raw)
  To: Cynbe ru Taren; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3104 bytes --]

Cynbe ru Taren wrote:
> The current Linux kernel RAID5 implementation is just
> too fragile to be used for most of the applications
> where it would be most useful.

I'm not sure I agree.

> What happens repeatedly, at least in my experience over
> a variety of boxes running a variety of 2.4 and 2.6
> Linux kernel releases, is that any transient I/O problem
> results in a critical mass of RAID5 drives being marked
> 'failed', at which point there is no longer any supported

What "transient" I/O problem would this be. I've had loads of issues with
flaky motherboard/PCI bus implementations that make RAID using addin cards
(all 5 slots filled with other devices) a nightmare. The built in controllers
seem to be more reliable.

> way of retrieving the data on the RAID5 device, even
> though the underlying drives are all fine, and the underlying
> data on those drives almost certainly intact.

This is no problem, just use something like

	mdadm --assemble --force /dev/md5 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
/dev/sde1

(Then of course do a fsck)

You can even do this with (nr.drives-1), then add in the last one to be
sync'ed up in the background.

> This has just happened to me for at least the sixth time,
> this time in a brand new RAID5 consisting of 8 200G hotswap
> SATA drives backing up the contents of about a dozen onsite
> and offsite boxes via dirvish, which took me the better part
> of December to get initialized and running, and now two weeks
> later I'm back to square one.

:-( .. maybe try the force assemble?

> I'm currently digging through the md kernel source code
> trying to work out some ad-hoc recovery method, but this
> level of flakiness just isn't acceptable on systems where
> reliable mass storage is a must -- and when else would
> one bother with RAID5?

It isn't flaky for me now I'm using a better quality motherboard, in fact it's
saved me through 3 near simultaneous failures of WD 250GB drives.

> We need RAID5 to be equally resilient in the face of
> real-world problems, people -- it isn't enough to
> just be able to function under ideal lab conditions!

I think it is. The automatics are paranoid (as they should be) when failures
are noticed. The array can be assembled manually though.

> A design bug is -still- a bug, and -still- needs to
> get fixed.

It's not a design bug - in my opinion.

> Something HAS to be done to make the RAID5 logic
> MUCH more conservative about destroying RAID5
> systems in response to a transient burst of I/O
> errors, before it can in good conscience be declared

If such things are common you should investigate the hardware.

> ready for production use -- or at MINIMUM to provide
> a SUPPORTED way of restoring a butchered RAID5 to
> last-known-good configuration or such once transient
> hardware issues have been resolved.

It is. See above.

> In the meantime, IMHO Linux RAID5 should be prominently flagged
> EXPERIMENTAL -- NONCRITICAL USE ONLY or some such, to avoid
> building up ill-will and undeserved distrust of Linux
> software quality generally.

I'd calm down if I were you.

Cheers
David

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 256 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-17 19:35 FYI: RAID5 unusably unstable through 2.6.14 Cynbe ru Taren
                   ` (2 preceding siblings ...)
  2006-01-17 19:58 ` David R
@ 2006-01-17 20:00 ` Kyle Moffett
  2006-01-17 23:27 ` Michael Loftis
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: Kyle Moffett @ 2006-01-17 20:00 UTC (permalink / raw)
  To: LKML Kernel

BTW, when you get this message via the list (because I refuse to  
touch challenge-response email systems), shut off your damn email  
challenge-response system before posting to this list again.  Such  
systems are exceptionally poor ettiquite on a public mailing list,  
and more than likely most posters will flag your "challenge" as junk- 
mail (as I did, and will continue doing).

Cheers,
Kyle Moffett

--
Premature optimization is the root of all evil in programming
   -- C.A.R. Hoare




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-17 19:39 ` Benjamin LaHaise
@ 2006-01-17 20:13   ` Martin Drab
  2006-01-17 23:39     ` Michael Loftis
  2006-02-02 20:33     ` Bill Davidsen
  0 siblings, 2 replies; 44+ messages in thread
From: Martin Drab @ 2006-01-17 20:13 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Cynbe ru Taren, linux-kernel

On Tue, 17 Jan 2006, Benjamin LaHaise wrote:

> On Tue, Jan 17, 2006 at 01:35:46PM -0600, Cynbe ru Taren wrote:
> > In principle, RAID5 should allow construction of a
> > disk-based store which is considerably MORE reliable
> > than any individual drive.
> > 
> > In my experience, at least, using Linux RAID5 results
> > in a disk storage system which is considerably LESS
> > reliable than the underlying drives.
> 
> That is a function of how RAID5 works.  A properly configured RAID5 array 
> will have a spare disk to take over in case one of the members fails, as 
> otherwise you run a serious risk of not being able to recover any data.
> 
> > What happens repeatedly, at least in my experience over
> > a variety of boxes running a variety of 2.4 and 2.6
> > Linux kernel releases, is that any transient I/O problem
> > results in a critical mass of RAID5 drives being marked
> > 'failed', at which point there is no longer any supported
> > way of retrieving the data on the RAID5 device, even
> > though the underlying drives are all fine, and the underlying
> > data on those drives almost certainly intact.
> 
> Underlying disks should not be experiencing transient failures.  Are you 
> sure the problem isn't with the disk controller you're building your array 
> on top of?  At the very least any bug report requires that information to 
> be able to provide even a basic analysis of what is going wrong.

Well, I had a similar experience lately with the Adaptec AAC-2410SA RAID 
5 array. Due to the CPU overheating the whole box was suddenly shot down 
by the CPU damage protection mechanism. While there is no battery backup 
on this particular RAID controller, the sudden poweroff caused some very 
localized inconsistency of one disk in the RAID. The configuration was 
1x160 GB and 3x120GB, with the 160 GB being split into 120 GB part within 
the RAID 5 and a 40 GB part as a separate volume. The inconsistency 
happend in the 40 GB part of the 160 GB HDD (as reported by the Adaptec 
BIOS media check). In particular the problem was in the /dev/sda2 (with 
/dev/sda being the 40 GB Volume, /dev/sda1 being an NTFS Windows system, 
and /dev/sda2 being ext3 Linux system).

Now, what is interesting, is that Linux completely refused any possible 
access to every byte within /dev/sda, not even dd(1) reading from any 
position within /dev/sda, not even "fdisk /dev/sda", nothing. Everything 
ended up with lots of following messages:

        sd 0:0:0:0: SCSI error: return code = 0x8000002
        sda: Current: sense key: Hardware Error
            Additional sense: Internal target failure
        Info fld=0x0
        end_request: I/O error, dev sda, sector <some sector number>

I've consulted this with Mark Salyzyn, because I thought it was a problem 
of the AACRAID driver. But I was told, that there is nothing that AACRAID 
can possibly do about it, and that it is a problem of the upper Linux 
layers (block device layer?) that are strictly fault intollerant, and 
thouth the problem was just an inconsistency of one particular localized 
region inside /dev/sda2, Linux was COMPLETELY UNABLE (!!!!!) to read a 
single byte from the ENTIRE VOLUME (/dev/sda)!

And now for the best part: From Windows, I was able to access the ENTIRE 
VOLUME without the slightest problem. Not only did Windows boot entirely 
from the /dev/sda1, but using Total Commander's ext3 plugin I was also 
able to access the ENTIRE /dev/sda2 and at least extract the most 
important data and configurations, before I did the complete low-level 
formatting of the drive, which fixed the inconsistency problem.

I call this "AN IRONY" to be forced to use Windows to extract information 
from Linux partition, wouldn't you? ;)

(Besides, even GRUB (using BIOS) accessed the /dev/sda without 
complications - as it was the bootable volume. Only Linux failed here a 
100%. :()

Martin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-17 19:35 FYI: RAID5 unusably unstable through 2.6.14 Cynbe ru Taren
                   ` (3 preceding siblings ...)
  2006-01-17 20:00 ` Kyle Moffett
@ 2006-01-17 23:27 ` Michael Loftis
  2006-01-18  0:12   ` Kyle Moffett
  2006-01-18  0:21   ` Phillip Susi
  2006-01-18 10:54 ` Helge Hafting
  2006-01-19  0:13 ` Neil Brown
  6 siblings, 2 replies; 44+ messages in thread
From: Michael Loftis @ 2006-01-17 23:27 UTC (permalink / raw)
  To: linux-kernel



--On January 17, 2006 1:35:46 PM -0600 Cynbe ru Taren <cynbe@muq.org> wrote:

>
> Just in case the RAID5 maintainers aren't aware of it:
>
> The current Linux kernel RAID5 implementation is just
> too fragile to be used for most of the applications
> where it would be most useful.
>
> In principle, RAID5 should allow construction of a
> disk-based store which is considerably MORE reliable
> than any individual drive.

Absolutely not.  The more spindles the more chances of a double failure. 
Simple statistics will mean that unless you have mirrors the more drives 
you add the more chance of two of them (really) failing at once and choking 
the whole system.

That said, there very well could be (are?) cases where md needs to do a 
better job of handling the world unravelling.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-17 20:13   ` Martin Drab
@ 2006-01-17 23:39     ` Michael Loftis
  2006-01-18  2:30       ` Martin Drab
  2006-02-02 20:33     ` Bill Davidsen
  1 sibling, 1 reply; 44+ messages in thread
From: Michael Loftis @ 2006-01-17 23:39 UTC (permalink / raw)
  To: Martin Drab, Benjamin LaHaise; +Cc: Cynbe ru Taren, linux-kernel



--On January 17, 2006 9:13:49 PM +0100 Martin Drab 
<drab@kepler.fjfi.cvut.cz> wrote:

> I've consulted this with Mark Salyzyn, because I thought it was a problem
> of the AACRAID driver. But I was told, that there is nothing that AACRAID
> can possibly do about it, and that it is a problem of the upper Linux
> layers (block device layer?) that are strictly fault intollerant, and
> thouth the problem was just an inconsistency of one particular localized
> region inside /dev/sda2, Linux was COMPLETELY UNABLE (!!!!!) to read a
> single byte from the ENTIRE VOLUME (/dev/sda)!

Actually...this is also related to how the controller reports the error. 
If it reports a device level death/failure rather than a read error, Linux 
is just taking that on face value.  Yup, it should retry though.  Other 
possibilities exist including the volume going offline at the controller 
level, having nothing to do with Linux, this is most often the problem I 
see with RAIDs.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-17 23:27 ` Michael Loftis
@ 2006-01-18  0:12   ` Kyle Moffett
  2006-01-18 11:24     ` Erik Mouw
  2006-01-18  0:21   ` Phillip Susi
  1 sibling, 1 reply; 44+ messages in thread
From: Kyle Moffett @ 2006-01-18  0:12 UTC (permalink / raw)
  To: Michael Loftis; +Cc: linux-kernel

On Jan 17, 2006, at 18:27, Michael Loftis wrote:
> --On January 17, 2006 1:35:46 PM -0600 Cynbe ru Taren  
> <cynbe@muq.org> wrote:
>> Just in case the RAID5 maintainers aren't aware of it:
>>
>> The current Linux kernel RAID5 implementation is just too fragile  
>> to be used for most of the applications where it would be most  
>> useful.
>>
>> In principle, RAID5 should allow construction of a disk-based  
>> store which is considerably MORE reliable than any individual drive.
>
> Absolutely not.  The more spindles the more chances of a double  
> failure. Simple statistics will mean that unless you have mirrors  
> the more drives you add the more chance of two of them (really)  
> failing at once and choking the whole system.

The most reliable RAID-5 you can build is a 3-drive system.  For each  
byte of data, you have a half-byte of parity, meaning that half the  
data-space (not including the parity) can fail without data loss.   
I'm ignoring the issue of rotating parity drive for simplicity, but  
that only affects performance, not the algorithm.  If you want any  
kind of _real_ reliability and speed, you should buy a couple good  
hardware RAID-5 units and mirror them in software.

Cheers,
Kyle Moffett

--
If you don't believe that a case based on [nothing] could potentially  
drag on in court for _years_, then you have no business playing with  
the legal system at all.
   -- Rob Landley




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-17 23:27 ` Michael Loftis
  2006-01-18  0:12   ` Kyle Moffett
@ 2006-01-18  0:21   ` Phillip Susi
  2006-01-18  0:29     ` Michael Loftis
  2006-02-02 22:10     ` Bill Davidsen
  1 sibling, 2 replies; 44+ messages in thread
From: Phillip Susi @ 2006-01-18  0:21 UTC (permalink / raw)
  To: Michael Loftis; +Cc: linux-kernel

Your understanding of statistics leaves something to be desired.  As you 
add disks the probability of a single failure is grows linearly, but the 
probability of double failure grows much more slowly.  For example:

If 1 disk has a 1/1000 chance of failure, then
2 disks have a (1/1000)^2 chance of double failure, and
3 disks have a (1/1000)^2 * 3 chance of double failure
4 disks have a (1/1000)^2 * 7 chance of double failure

Thus the probability of double failure on this 4 drive array is ~142 
times less than the odds of a single drive failing.  As the probably of 
a single drive failing becomes more remote, then the ratio of that 
probability to the probability of double fault in the array grows 
exponentially.

( I think I did that right in my head... will check on a real calculator 
  later )

This is why raid-5 was created: because the array has a much lower 
probabiliy of double failure, and thus, data loss, than a single drive. 
  Then of course, if you are really paranoid, you can go with raid-6 ;)


Michael Loftis wrote:
> Absolutely not.  The more spindles the more chances of a double failure. 
> Simple statistics will mean that unless you have mirrors the more drives 
> you add the more chance of two of them (really) failing at once and 
> choking the whole system.
> 
> That said, there very well could be (are?) cases where md needs to do a 
> better job of handling the world unravelling.
> -

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-18  0:21   ` Phillip Susi
@ 2006-01-18  0:29     ` Michael Loftis
  2006-01-18  2:10       ` Phillip Susi
  2006-02-02 22:10     ` Bill Davidsen
  1 sibling, 1 reply; 44+ messages in thread
From: Michael Loftis @ 2006-01-18  0:29 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-kernel



--On January 17, 2006 7:21:45 PM -0500 Phillip Susi <psusi@cfl.rr.com> 
wrote:

> Your understanding of statistics leaves something to be desired.  As you
> add disks the probability of a single failure is grows linearly, but the
> probability of double failure grows much more slowly.  For example:

What about I said was inaccurate?  I never said that it increases 
exponentially or anything like that, just that it does increase, which 
you've proven.  I was speaking in the case of a RAID-5 set, where the 
minimum is 3 drives, so every additional drive increases the chance of a 
double fault condition.  Now if we're including mirrors and stripes/etc, 
then that means we do have to look at the 2 spindle case, but the third 
spindle and beyond keeps increasing.  If you've a 1% failure rate, and you 
have 100+ drives, chances are pretty good you're going to see a failure. 
Yes it's a LOT more complicated than that.

>
> If 1 disk has a 1/1000 chance of failure, then
> 2 disks have a (1/1000)^2 chance of double failure, and
> 3 disks have a (1/1000)^2 * 3 chance of double failure
> 4 disks have a (1/1000)^2 * 7 chance of double failure
>
> Thus the probability of double failure on this 4 drive array is ~142
> times less than the odds of a single drive failing.  As the probably of a
> single drive failing becomes more remote, then the ratio of that
> probability to the probability of double fault in the array grows
> exponentially.
>
> ( I think I did that right in my head... will check on a real calculator
> later )
>
> This is why raid-5 was created: because the array has a much lower
> probabiliy of double failure, and thus, data loss, than a single drive.
> Then of course, if you are really paranoid, you can go with raid-6 ;)
>
>
> Michael Loftis wrote:
>> Absolutely not.  The more spindles the more chances of a double failure.
>> Simple statistics will mean that unless you have mirrors the more drives
>> you add the more chance of two of them (really) failing at once and
>> choking the whole system.
>>
>> That said, there very well could be (are?) cases where md needs to do a
>> better job of handling the world unravelling.
>> -
>



--
"Genius might be described as a supreme capacity for getting its possessors
into trouble of all kinds."
-- Samuel Butler

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-18  0:29     ` Michael Loftis
@ 2006-01-18  2:10       ` Phillip Susi
  2006-01-18  3:01         ` Michael Loftis
  2006-01-18 16:47         ` Krzysztof Halasa
  0 siblings, 2 replies; 44+ messages in thread
From: Phillip Susi @ 2006-01-18  2:10 UTC (permalink / raw)
  To: Michael Loftis; +Cc: linux-kernel

Michael Loftis wrote:

> What about I said was inaccurate?  I never said that it increases 
> exponentially or anything like that, just that it does increase, which 
> you've proven.  I was speaking in the case of a RAID-5 set, where the 
> minimum is 3 drives, so every additional drive increases the chance of 
> a double fault condition.  Now if we're including mirrors and 
> stripes/etc, then that means we do have to look at the 2 spindle case, 
> but the third spindle and beyond keeps increasing.  If you've a 1% 
> failure rate, and you have 100+ drives, chances are pretty good you're 
> going to see a failure. Yes it's a LOT more complicated than that.
>

I understood you to be saying that a raid-5 was less reliable than a 
single disk, which it is not.  Maybe I did not read correctly.  Yes, a 3 
+ n disk raid-5 has a higher chance of failure than a 3 disk raid-5, but 
only slightly so, and in any case, a 3 disk raid-5 is FAR more reliable 
than a single drive, and only slightly less reliable than a two disk 
raid-1 ( though you get 3x the space for only 50% higher cost, so 6x 
cheaper cost per byte of storage ). 





^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-17 23:39     ` Michael Loftis
@ 2006-01-18  2:30       ` Martin Drab
  0 siblings, 0 replies; 44+ messages in thread
From: Martin Drab @ 2006-01-18  2:30 UTC (permalink / raw)
  To: Michael Loftis; +Cc: Benjamin LaHaise, Cynbe ru Taren, linux-kernel

On Tue, 17 Jan 2006, Michael Loftis wrote:
> --On January 17, 2006 9:13:49 PM +0100 Martin Drab <drab@kepler.fjfi.cvut.cz> wrote:
> 
> > I've consulted this with Mark Salyzyn, because I thought it was a problem
> > of the AACRAID driver. But I was told, that there is nothing that AACRAID
> > can possibly do about it, and that it is a problem of the upper Linux
> > layers (block device layer?) that are strictly fault intollerant, and
> > thouth the problem was just an inconsistency of one particular localized
> > region inside /dev/sda2, Linux was COMPLETELY UNABLE (!!!!!) to read a
> > single byte from the ENTIRE VOLUME (/dev/sda)!
> 
> Actually...this is also related to how the controller reports the error. If it
> reports a device level death/failure rather than a read error, Linux is just

Yes, but that wasn't the case here. I've witnessed that a while ago, but 
this time, no. Just a read error, no device death nor going off-line. 
Otherwise I wouldn't be that much surprised that Linux didn't even try. 
The controller didn't do anything that would prevent system from reading. 
Windows used that and worked, Linux unfortunatelly didn't even try. That's 
why I'm talking about it here.

> taking that on face value.  Yup, it should retry though.  Other possibilities
> exist including the volume going offline at the controller level, having
> nothing to do with Linux, this is most often the problem I see with RAIDs.

Martin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-18  2:10       ` Phillip Susi
@ 2006-01-18  3:01         ` Michael Loftis
  2006-01-18 16:49           ` Krzysztof Halasa
  2006-01-18 16:47         ` Krzysztof Halasa
  1 sibling, 1 reply; 44+ messages in thread
From: Michael Loftis @ 2006-01-18  3:01 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-kernel



--On January 17, 2006 9:10:56 PM -0500 Phillip Susi <psusi@cfl.rr.com> 
wrote:

> I understood you to be saying that a raid-5 was less reliable than a
> single disk, which it is not.  Maybe I did not read correctly.  Yes, a 3
> + n disk raid-5 has a higher chance of failure than a 3 disk raid-5, but
> only slightly so, and in any case, a 3 disk raid-5 is FAR more reliable
> than a single drive, and only slightly less reliable than a two disk
> raid-1 ( though you get 3x the space for only 50% higher cost, so 6x
> cheaper cost per byte of storage ).


Yup we're on the same page, we just didn't think we were.  It happens :) 
R-5 (in theory) could be less reliable than a mirror or possibly a single 
drive, but it'd take a pretty obscene number of drives with excessively 
large strip size.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-17 19:35 FYI: RAID5 unusably unstable through 2.6.14 Cynbe ru Taren
                   ` (4 preceding siblings ...)
  2006-01-17 23:27 ` Michael Loftis
@ 2006-01-18 10:54 ` Helge Hafting
  2006-01-18 16:15   ` Mark Lord
  2006-01-19  0:13 ` Neil Brown
  6 siblings, 1 reply; 44+ messages in thread
From: Helge Hafting @ 2006-01-18 10:54 UTC (permalink / raw)
  To: Cynbe ru Taren; +Cc: linux-kernel

Cynbe ru Taren wrote:

>Just in case the RAID5 maintainers aren't aware of it:
>
>The current Linux kernel RAID5 implementation is just
>too fragile to be used for most of the applications
>where it would be most useful.
>
>In principle, RAID5 should allow construction of a
>disk-based store which is considerably MORE reliable
>than any individual drive.
>
>In my experience, at least, using Linux RAID5 results
>in a disk storage system which is considerably LESS
>reliable than the underlying drives.
>
>What happens repeatedly, at least in my experience over
>a variety of boxes running a variety of 2.4 and 2.6
>Linux kernel releases, is that any transient I/O problem
>results in a critical mass of RAID5 drives being marked
>'failed', 
>
What kind of "transient io error" would that be?
That is not supposed to happen regularly. . .

You do replace failed drives immediately?  Allowing
systems to run "for a while" in degraded mode is
surely a recipe for disaster.  Degraded mode
has no safety at all, it is just raid-0 with a performance
overhead added in. :-/

Having hot spares is a nice way of replacing the failed
drive quickly.

>at which point there is no longer any supported
>way of retrieving the data on the RAID5 device, even
>though the underlying drives are all fine, and the underlying
>data on those drives almost certainly intact.
>  
>
As other have showed - "mdadm" can reassemble your
broken raid - and it'll work well in those cases where
the underlying drives indeed are ok.  It will fail
spectacularly if you have a real double fault though,
but then nothing short of raid-6 can save you.


Helge Hafting


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-18  0:12   ` Kyle Moffett
@ 2006-01-18 11:24     ` Erik Mouw
  0 siblings, 0 replies; 44+ messages in thread
From: Erik Mouw @ 2006-01-18 11:24 UTC (permalink / raw)
  To: Kyle Moffett; +Cc: Michael Loftis, linux-kernel

On Tue, Jan 17, 2006 at 07:12:57PM -0500, Kyle Moffett wrote:
> The most reliable RAID-5 you can build is a 3-drive system.  For each  
> byte of data, you have a half-byte of parity, meaning that half the  
> data-space (not including the parity) can fail without data loss.   
> I'm ignoring the issue of rotating parity drive for simplicity, but  
> that only affects performance, not the algorithm.  If you want any  
> kind of _real_ reliability and speed, you should buy a couple good  
> hardware RAID-5 units and mirror them in software.

Actually, the most reliable RAID-5 is a 2 drive system, where you have
a full byte of reduncancy for each byte of data. Two drive RAID-5
systems are usually called RAID-1, but if you write out the formulas it
becomes clear that RAID-1 is just a special case of RAID-5.


Erik

-- 
+-- Erik Mouw -- www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands
| Data lost? Stay calm and contact Harddisk-recovery.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-18 10:54 ` Helge Hafting
@ 2006-01-18 16:15   ` Mark Lord
  2006-01-18 17:32     ` Alan Cox
  2006-01-18 23:37     ` Neil Brown
  0 siblings, 2 replies; 44+ messages in thread
From: Mark Lord @ 2006-01-18 16:15 UTC (permalink / raw)
  To: Helge Hafting; +Cc: Cynbe ru Taren, linux-kernel

Helge Hafting wrote:
 >
 > As other have showed - "mdadm" can reassemble your
 > broken raid - and it'll work well in those cases where
 > the underlying drives indeed are ok.  It will fail
 > spectacularly if you have a real double fault though,
 > but then nothing short of raid-6 can save you.

No, actually there are several things we *could* do,
if only the will-to-do-so existed.

For example, one bad sector on a drive doesn't mean that
the entire drive has failed.  It just means that one 512-byte
chunk of the drive has failed.

We could rewrite the failed area of the drive, allowing the
onboard firmware to repair the fault internally, likely by
remapping physical sectors.  This is nothing unusual, as all
drives these days ship from the factory with many bad sectors
that have already been remapped to "fix" them.  One or two
more in the field is no reason to toss a perfectly good drive.

Mind you, if it's more than just one or two bad sectors,
then the drive really should get tossed regardless. And the case
can be made that even for the first one or two bad sectors,
a prudent sysadmin would schedule replacement of the whole drive.

But until the drive is replaced, it could be repaired and continued
to be used as added redundancy, helping us cope far more reliably
with multiple failures.

Sure, nobody's demanding double-fault protection -- where the SAME
sector of data fails on multiple drives, and nothing can be done
to recover it then.  But we really could/should handle the case
of two *different* unrelated single-faults, at least when those
are just soft failures of unrelated sectors.

Just need somebody motivated to actually fix it,
rather than bitch about how impossible/stupid it would be.

Cheers

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-18  2:10       ` Phillip Susi
  2006-01-18  3:01         ` Michael Loftis
@ 2006-01-18 16:47         ` Krzysztof Halasa
  1 sibling, 0 replies; 44+ messages in thread
From: Krzysztof Halasa @ 2006-01-18 16:47 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Michael Loftis, linux-kernel

Phillip Susi <psusi@cfl.rr.com> writes:

> but only slightly so, and in any case, a 3 disk raid-5 is FAR more
> reliable than a single drive, and only slightly less reliable than a
> two disk raid-1 ( though you get 3x the space for only 50% higher
> cost, so 6x cheaper cost per byte of storage ).

Actually with 3-disk RAID5 you get 2x the space of RAID1 for 1.5 x cost,
so the factor is 1.5/2 = 0.75, i.e., you save only 25% on RAID5 or RAID1
is 33% more expensive.
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-18  3:01         ` Michael Loftis
@ 2006-01-18 16:49           ` Krzysztof Halasa
  0 siblings, 0 replies; 44+ messages in thread
From: Krzysztof Halasa @ 2006-01-18 16:49 UTC (permalink / raw)
  To: Michael Loftis; +Cc: Phillip Susi, linux-kernel

Michael Loftis <mloftis@wgops.com> writes:

> Yup we're on the same page, we just didn't think we were.  It happens
> :) R-5 (in theory) could be less reliable than a mirror

Statistically, RAID-5 with 3 or more disks is always less reliable than
a mirror. Strip size doesn't matter.

> or possibly a
> single drive,

With lot of drives.
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-18 16:15   ` Mark Lord
@ 2006-01-18 17:32     ` Alan Cox
  2006-01-19 15:59       ` Mark Lord
  2006-01-18 23:37     ` Neil Brown
  1 sibling, 1 reply; 44+ messages in thread
From: Alan Cox @ 2006-01-18 17:32 UTC (permalink / raw)
  To: Mark Lord; +Cc: Helge Hafting, Cynbe ru Taren, linux-kernel

On Mer, 2006-01-18 at 11:15 -0500, Mark Lord wrote:
> For example, one bad sector on a drive doesn't mean that
> the entire drive has failed.  It just means that one 512-byte
> chunk of the drive has failed.

You don't actually know what failed, truth be told, probably a lot more
than 512 byte spec of disk nowdays.

> We could rewrite the failed area of the drive, allowing the
> onboard firmware to repair the fault internally, likely by

We should do so definitely but you probably want to rewrite the stripe
as a whole so that you fix up the other sectors in the physical sector
that went poof.

> Just need somebody motivated to actually fix it,
> rather than bitch about how impossible/stupid it would be.

Send patches ;)

PS: How is the delkin_cb driver - does it know how to do modes and stuff
yet ? Just wondering if I should pull a version for libata whacking

Alan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-18 16:15   ` Mark Lord
  2006-01-18 17:32     ` Alan Cox
@ 2006-01-18 23:37     ` Neil Brown
  2006-01-19 15:53       ` Mark Lord
  1 sibling, 1 reply; 44+ messages in thread
From: Neil Brown @ 2006-01-18 23:37 UTC (permalink / raw)
  To: Mark Lord; +Cc: Helge Hafting, Cynbe ru Taren, linux-kernel

On Wednesday January 18, lkml@rtr.ca wrote:
> Helge Hafting wrote:
>  >
>  > As other have showed - "mdadm" can reassemble your
>  > broken raid - and it'll work well in those cases where
>  > the underlying drives indeed are ok.  It will fail
>  > spectacularly if you have a real double fault though,
>  > but then nothing short of raid-6 can save you.
> 
> No, actually there are several things we *could* do,
> if only the will-to-do-so existed.

You not only need the will.  You also need the ability and the time,
and the three must be combined into the one person...

> 
> For example, one bad sector on a drive doesn't mean that
> the entire drive has failed.  It just means that one 512-byte
> chunk of the drive has failed.
> 
> We could rewrite the failed area of the drive, allowing the
> onboard firmware to repair the fault internally, likely by
> remapping physical sectors.  This is nothing unusual, as all
> drives these days ship from the factory with many bad sectors
> that have already been remapped to "fix" them.  One or two
> more in the field is no reason to toss a perfectly good drive.

Very recent 2.6 kernels do exactly this.  They don't drop a drive on a
read error, only on a write error.  On a read error they generate the
data from elsewhere and schedule a write, then a re-read.

NeilBrown

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-17 19:35 FYI: RAID5 unusably unstable through 2.6.14 Cynbe ru Taren
                   ` (5 preceding siblings ...)
  2006-01-18 10:54 ` Helge Hafting
@ 2006-01-19  0:13 ` Neil Brown
  6 siblings, 0 replies; 44+ messages in thread
From: Neil Brown @ 2006-01-19  0:13 UTC (permalink / raw)
  To: Cynbe ru Taren; +Cc: linux-kernel

On Tuesday January 17, cynbe@muq.org wrote:
> 
> Just in case the RAID5 maintainers aren't aware of it:

Others have replied, but just so that you know that the "RAID5
maintainer" is listening, I will too.

You refer to "current" implementation and then talk about " a variety
of 2.4 and 2.6" releases.... Not all of them are 'current'.

The 'current' raid5 (in 2.6.15) is much more resilient against read
errors than earlier versions.

If you are having transient errors that really are very transient,
then the device driver should be retrying more I expect.

If you are having random connectivity error causing transient errors,
then your hardware is too unreliable for raid5 to code with.

As has been said, there *is* a supported way to regain a raid5 after
connectivity problems - mdadm --assemble --force.

The best way to help with the improvement of md/raid5 is to give
precise details of situations where md/raid5 doesn't live up to your
expectations.  Without precise details it is hard to make progress.

Thankyou for your interest.

NeilBrown

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-18 23:37     ` Neil Brown
@ 2006-01-19 15:53       ` Mark Lord
  0 siblings, 0 replies; 44+ messages in thread
From: Mark Lord @ 2006-01-19 15:53 UTC (permalink / raw)
  To: Neil Brown; +Cc: Helge Hafting, Cynbe ru Taren, linux-kernel

Neil Brown wrote:
>
> Very recent 2.6 kernels do exactly this.  They don't drop a drive on a
> read error, only on a write error.  On a read error they generate the
> data from elsewhere and schedule a write, then a re-read.

Well done, then.  Further to this:

Pardon me for not looking at the specifics of the code here,
but experience shows that rewriting just the single sector
is often not enough to repair an error.  The drive often just
continues to fail when only the bad sector is rewritten by itself.

Dumb drives, or what, I don't know, but they seem to respond
better when the entire physical track is rewritten.

Since we rarely know what a physical track is these days,
this often boils down to simply rewriting a 64KB chunk
centered on the failed sector.  So far, this strategy has
always worked for me.

Cheers

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-18 17:32     ` Alan Cox
@ 2006-01-19 15:59       ` Mark Lord
  2006-01-19 16:25         ` Alan Cox
  0 siblings, 1 reply; 44+ messages in thread
From: Mark Lord @ 2006-01-19 15:59 UTC (permalink / raw)
  To: Alan Cox; +Cc: Helge Hafting, Cynbe ru Taren, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 846 bytes --]

Alan Cox wrote:
> PS: How is the delkin_cb driver - does it know how to do modes and stuff
> yet ? Just wondering if I should pull a version for libata whacking

I whacked at it for libata a while back, and then shelved it while awaiting
PIO to appear in a released libata version.  Now that we've got PIO, I ought
to add a couple of lines to bind in the right functions and release it.

No knowledge of "modes" and stuff -- but the basic register settings I
reverse engineered seem to work adequately on the cards I have here.

But the card is a total slug unless the host does 32-bit PIO to/from it.
Do we have that capability in libata yet?

My last hack at it (without the necessary libata PIO bindings) is attached,
but this is several revisions behind libata now, and probably needs some
updates to compile.  Suggestions welcomed.

Cheers


[-- Attachment #2: pata_delkin_cb.c --]
[-- Type: text/x-csrc, Size: 7188 bytes --]

/*
 *  Delkin CardBus IDE CompactFlash Adapter
 *
 *  This program is free software; you can redistribute it and/or
 *  modify it under the terms of the GNU General Public License
 *  as published by the Free Software Foundation; either version
 *  2 of the License, or (at your option) any later version.
 *
 *  Written by Mark Lord, Real-Time Remedies Inc.
 *  Copyright (C) 2005 	Mark Lord <mlord@pobox.com>
 *  Released under terms of General Public License
 * 
 */
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/pci.h>
#include <linux/init.h>
#include <linux/blkdev.h>
#include <linux/delay.h>
#include "scsi.h"
#include <scsi/scsi_host.h>
#include <linux/libata.h>
#include <asm/io.h>

#define DRV_NAME	"delkin_cb"
#define DRV_VERSION	"0.01"

static int  delkin_cb_init_one(struct pci_dev *pdev, const struct pci_device_id *ent);
static void delkin_cb_remove_one(struct pci_dev *pdev);

static struct pci_device_id delkin_cb_pci_tbl[] = {
	{ PCI_VENDOR_ID_WORKBIT, PCI_DEVICE_ID_WORKBIT_CB, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 },
	{ }	/* terminate list */
};

static struct pci_driver delkin_cb_pci_driver = {
	.name			= DRV_NAME,
	.id_table		= delkin_cb_pci_tbl,
	.probe			= delkin_cb_init_one,
	.remove			= __devexit_p(delkin_cb_remove_one),
};

static Scsi_Host_Template delkin_cb_sht = {
	.module			= THIS_MODULE,
	.name			= DRV_NAME,
	.ioctl			= ata_scsi_ioctl,
	.queuecommand		= ata_scsi_queuecmd,
	.eh_strategy_handler	= ata_scsi_error,
	.can_queue		= ATA_DEF_QUEUE,
	.this_id		= ATA_SHT_THIS_ID,
	.sg_tablesize		= 256,
	.max_sectors		= 256,
	.cmd_per_lun		= 1,
	.emulated		= ATA_SHT_EMULATED,
	.use_clustering		= ENABLE_CLUSTERING,
	.proc_name		= DRV_NAME,
	.bios_param		= ata_std_bios_param,
	.resume			= ata_scsi_device_resume,
	.suspend		= ata_scsi_device_suspend,
};

static int no_check_atapi_dma(struct ata_queued_cmd *qc)
{
	printk("no_check_atapi_dma\n");
	return 1; /* atapi DMA not okay */
}

static void no_bmdma_stop(struct ata_port *ap)
{
	printk("no_bmdma_stop\n");
}

static u8 no_bmdma_status(struct ata_port *ap)
{
	printk("no_bmdma_status\n");
	return 0;
}

static void no_irq_clear(struct ata_port *ap)
{
	printk("no_irq_clear\n");
}

static void no_scr_write (struct ata_port *ap, unsigned int sc_reg, u32 val)
{
	printk("no_scr_write\n");
}

static u32 no_scr_read (struct ata_port *ap, unsigned int sc_reg)
{
	printk("no_scr_read\n");
	return ~0U;
}

static void no_phy_reset(struct ata_port *ap)
{
	printk("no_phy_reset\n");
	ap->flags &= ~ATA_FLAG_PORT_DISABLED;
	ata_bus_reset(ap);
}

static int delkin_cb_qc_issue(struct ata_queued_cmd *qc)
{
	printk("qc_issue: cmd=0x%02x proto=%d\n", qc->tf.command, qc->tf.protocol);
	switch (qc->tf.protocol) {
	case ATA_PROT_NODATA:
	case ATA_PROT_PIO:
		return ata_qc_issue_prot(qc);
	default:
		printk("qc_issue: bad protocol: %d\n", qc->tf.protocol);
		return -1;
	}
}

static struct ata_port_operations delkin_cb_ops = {
	.port_disable		= ata_port_disable,
	.tf_load		= ata_tf_load,
	.tf_read		= ata_tf_read,
	.check_status		= ata_check_status,
	.check_atapi_dma	= no_check_atapi_dma,
	.exec_command		= ata_exec_command,
	.dev_select		= ata_std_dev_select,
	.phy_reset		= no_phy_reset,
	.qc_prep		= ata_qc_prep,
	.qc_issue		= delkin_cb_qc_issue,
	.eng_timeout		= ata_eng_timeout,
	.irq_handler		= ata_interrupt,
	.irq_clear		= no_irq_clear,
	.scr_read		= no_scr_read,
	.scr_write		= no_scr_write,
	.port_start		= ata_port_start,
	.port_stop		= ata_port_stop,
	.bmdma_stop		= no_bmdma_stop,
	.bmdma_status		= no_bmdma_status,
};

static struct ata_port_info delkin_cb_port_info[] = {
	{
		.sht		= &delkin_cb_sht,
		.host_flags	= ATA_FLAG_SRST,
		.pio_mask	= 0x1f, /* pio0-4 */
		.port_ops	= &delkin_cb_ops,
	},
};

MODULE_AUTHOR("Mark Lord");
MODULE_DESCRIPTION("Basic support for Delkin-ASKA-Workbit Cardbus IDE");
MODULE_LICENSE("GPL");
MODULE_DEVICE_TABLE(pci, delkin_cb_pci_tbl);

/*
 * No chip documentation has yet been found,
 * so these configuration values were pulled from
 * a running Win98 system using "debug".
 * This gives around 3MByte/second read performance,
 * which is about 1/3 of what the chip is capable of.
 *
 * There is also a 4KByte mmio region on the card,
 * but its purpose has yet to be reverse-engineered.
 */
static const u8 delkin_cb_setup[] = {
	0x00, 0x05, 0xbe, 0x01, 0x20, 0x8f, 0x00, 0x00,
	0xa4, 0x1f, 0xb3, 0x1b, 0x00, 0x00, 0x00, 0x80,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0xa4, 0x83, 0x02, 0x13,
};

/**
 * delkin_cb_init_one - PCI probe function
 * Called when an instance of cardbus adapter is inserted.
 * 
 * @pdev: instance of pci_dev found
 * @ent:  matching entry in the id_tbl[]
 */
static int __devinit delkin_cb_init_one(struct pci_dev *pdev,
					const struct pci_device_id *ent)
{
	static int printed_version;
	struct ata_probe_ent *probe_ent = NULL;
	unsigned long io_base;
	unsigned int board_idx = (unsigned int) ent->driver_data;
	int i, rc;

	if (!printed_version++)
		printk(KERN_DEBUG DRV_NAME " version " DRV_VERSION "\n");

	rc = pci_enable_device(pdev);
	if (rc)
		return rc;

	rc = pci_request_regions(pdev, DRV_NAME);
	if (rc)
		goto err_out;

	probe_ent = kmalloc(sizeof(*probe_ent), GFP_KERNEL);
	if (probe_ent == NULL) {
		rc = -ENOMEM;
		goto err_out_regions;
	}

	memset(probe_ent, 0, sizeof(*probe_ent));
	probe_ent->dev = pci_dev_to_dev(pdev);
	INIT_LIST_HEAD(&probe_ent->node);

	probe_ent->sht		= delkin_cb_port_info[board_idx].sht;
	probe_ent->host_flags	= delkin_cb_port_info[board_idx].host_flags;
	probe_ent->pio_mask	= delkin_cb_port_info[board_idx].pio_mask;
	probe_ent->port_ops	= delkin_cb_port_info[board_idx].port_ops;

       	probe_ent->irq = pdev->irq;
       	probe_ent->irq_flags = SA_SHIRQ;
	io_base = pci_resource_start(pdev, 0);
	probe_ent->n_ports = 1;

	/* Initialize the device configuration registers */
	outb(0x02, io_base + 0x1e);	/* set nIEN to block interrupts */
	inb(io_base + 0x17);		/* read status to clear interrupts */
	for (i = 0; i < sizeof(delkin_cb_setup); ++i) {
		if (delkin_cb_setup[i])
			outb(delkin_cb_setup[i], io_base + i);
	}
	inb(io_base + 0x17);		/* read status to clear interrupts */

	probe_ent->port[0].cmd_addr = io_base + 0x10;
	ata_std_ports(&probe_ent->port[0]);
	probe_ent->port[0].altstatus_addr =
		probe_ent->port[0].ctl_addr = io_base + 0x1e;

	ata_device_add(probe_ent);
	kfree(probe_ent);

	// drive->io_32bit = 1;
	// drive->unmask   = 1;

	return 0;

err_out_regions:
	pci_release_regions(pdev);
err_out:
	pci_disable_device(pdev);
	return rc;
}

/**
 * delkin_cb_remove_one - Called to remove a single instance of the
 * adapter.
 *
 * @dev: The PCI device to remove.
 * FIXME: module load/unload not working yet
 */
static void __devexit delkin_cb_remove_one(struct pci_dev *pdev)
{
	ata_pci_remove_one(pdev);
}
/**
 * delkin_cb_init - Called after this module is loaded into the kernel.
 */
static int __init delkin_cb_init(void)
{
	return pci_module_init(&delkin_cb_pci_driver);
}
/**
 * delkin_cb_exit - Called before this module unloaded from the kernel
 */
static void __exit delkin_cb_exit(void)
{
	pci_unregister_driver(&delkin_cb_pci_driver);
}

module_init(delkin_cb_init);
module_exit(delkin_cb_exit);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-19 15:59       ` Mark Lord
@ 2006-01-19 16:25         ` Alan Cox
  2006-02-08 14:46           ` Alan Cox
  0 siblings, 1 reply; 44+ messages in thread
From: Alan Cox @ 2006-01-19 16:25 UTC (permalink / raw)
  To: Mark Lord; +Cc: Helge Hafting, Cynbe ru Taren, linux-kernel

On Iau, 2006-01-19 at 10:59 -0500, Mark Lord wrote:
> But the card is a total slug unless the host does 32-bit PIO to/from it.
> Do we have that capability in libata yet?

Very very easy to sort out. Just need a ->pio_xfer method set. Would
then eliminate some of the core driver flags and let us do vlb sync for
legacy hw


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-17 20:13   ` Martin Drab
  2006-01-17 23:39     ` Michael Loftis
@ 2006-02-02 20:33     ` Bill Davidsen
  2006-02-03  0:57       ` Martin Drab
  1 sibling, 1 reply; 44+ messages in thread
From: Bill Davidsen @ 2006-02-02 20:33 UTC (permalink / raw)
  To: Martin Drab; +Cc: Cynbe ru Taren, linux-kernel

Martin Drab wrote:

> Well, I had a similar experience lately with the Adaptec AAC-2410SA RAID 
> 5 array. Due to the CPU overheating the whole box was suddenly shot down 
> by the CPU damage protection mechanism. While there is no battery backup 
> on this particular RAID controller, the sudden poweroff caused some very 
> localized inconsistency of one disk in the RAID. The configuration was 
> 1x160 GB and 3x120GB, with the 160 GB being split into 120 GB part within 
> the RAID 5 and a 40 GB part as a separate volume. The inconsistency 
> happend in the 40 GB part of the 160 GB HDD (as reported by the Adaptec 
> BIOS media check). In particular the problem was in the /dev/sda2 (with 
> /dev/sda being the 40 GB Volume, /dev/sda1 being an NTFS Windows system, 
> and /dev/sda2 being ext3 Linux system).
> 
> Now, what is interesting, is that Linux completely refused any possible 
> access to every byte within /dev/sda, not even dd(1) reading from any 
> position within /dev/sda, not even "fdisk /dev/sda", nothing. Everything 
> ended up with lots of following messages:
> 
>         sd 0:0:0:0: SCSI error: return code = 0x8000002
>         sda: Current: sense key: Hardware Error
>             Additional sense: Internal target failure
>         Info fld=0x0
>         end_request: I/O error, dev sda, sector <some sector number>

But /dev/sda is not a Linux filesystem, running fsck on it makes no 
sense. You wanted to run on /dev/sda2.
> 
> I've consulted this with Mark Salyzyn, because I thought it was a problem 
> of the AACRAID driver. But I was told, that there is nothing that AACRAID 
> can possibly do about it, and that it is a problem of the upper Linux 
> layers (block device layer?) that are strictly fault intollerant, and 
> thouth the problem was just an inconsistency of one particular localized 
> region inside /dev/sda2, Linux was COMPLETELY UNABLE (!!!!!) to read a 
> single byte from the ENTIRE VOLUME (/dev/sda)!

The obvious test of this "it's not us" statement is to connect that one 
drive to another type controller and see if the upper level code 
recovers. I'm assuming that "sda" is a real drive and not some 
pseudo-drive which exists only in the firmware of the RAID controller. 
That message is curious, did you cat /proc/scsi/scsi to see what the 
system thought was there? Use the infamous "cdrecord -scanbus" command?

> 
> And now for the best part: From Windows, I was able to access the ENTIRE 
> VOLUME without the slightest problem. Not only did Windows boot entirely 
> from the /dev/sda1, but using Total Commander's ext3 plugin I was also 
> able to access the ENTIRE /dev/sda2 and at least extract the most 
> important data and configurations, before I did the complete low-level 
> formatting of the drive, which fixed the inconsistency problem.
> 
> I call this "AN IRONY" to be forced to use Windows to extract information 
> from Linux partition, wouldn't you? ;)
> 
> (Besides, even GRUB (using BIOS) accessed the /dev/sda without 
> complications - as it was the bootable volume. Only Linux failed here a 
> 100%. :()

 From the way you say sda when you presumably mean sda1 or sda2 it's not 
clear if you don't understand the difference between drive and partition 
access or are just so pissed off you are not taking the time to state 
the distinction clearly.

There was a problem with recovery from errors in RAID-5 which is 
addressed by recent changes to fail a sector, try rewriting it, etc. I 
would have to read linux-raid archives to explain it, so I'll stop with 
the overview. I don't think that's the issue here, you're using a RAID 
controller rather than the software RAID, so it should not apply.

I assume that the problem is gone, so we can't do any more analysis 
after the fact.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-18  0:21   ` Phillip Susi
  2006-01-18  0:29     ` Michael Loftis
@ 2006-02-02 22:10     ` Bill Davidsen
  2006-02-08 21:58       ` Pavel Machek
  1 sibling, 1 reply; 44+ messages in thread
From: Bill Davidsen @ 2006-02-02 22:10 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3117 bytes --]

Phillip Susi wrote:
> Your understanding of statistics leaves something to be desired.  As you 
> add disks the probability of a single failure is grows linearly, but the 
> probability of double failure grows much more slowly.  For example:
> 
> If 1 disk has a 1/1000 chance of failure, then
> 2 disks have a (1/1000)^2 chance of double failure, and
> 3 disks have a (1/1000)^2 * 3 chance of double failure
> 4 disks have a (1/1000)^2 * 7 chance of double failure

After the first drive fails you have no redundancy, the chance of an 
additional failure is linear to the number of remaining drives.

Assume:
   p - probability of a drive failing in unit time
   n - number of drives
   F - probability of double failure

The chance of a single drive failure is n*p. After that you have a new 
"independent trial" for the failure any one of n-1 drives, so the chance 
of a double drive failure is actually:
   F = (n*p) * (n-1)*p

But wait, there's more:
   p - chance of a drive failing in unit time
   n - number of drives
   R - the time to rebuild to a hot spare in the same units as p
   F - probability of double failure

So:

   F = n*p * (n-1)*(R * p)

If you rebuild a track at a time, each track takes the time to read the 
slowest drive plus the time to write the spare. If the array remains in 
use load increases those times.

And the ugly part is that p is changing all the time, there's infant 
mortality on new drives, fairly constant electronic probability and 
increasing probability of mechanical failure over time. If all of your 
drives are the same age they are less reliable than mixed age drives.

> 
> Thus the probability of double failure on this 4 drive array is ~142 
> times less than the odds of a single drive failing.  As the probably of 
> a single drive failing becomes more remote, then the ratio of that 
> probability to the probability of double fault in the array grows 
> exponentially.
> 
> ( I think I did that right in my head... will check on a real calculator 
>  later )
> 
> This is why raid-5 was created: because the array has a much lower 
> probabiliy of double failure, and thus, data loss, than a single drive. 
>  Then of course, if you are really paranoid, you can go with raid-6 ;)

If you're paranoid you mirror over two RAID-5 arrays. The mirrors are on 
independent controllers. RAID-10.

> 
> 
> Michael Loftis wrote:
> 
>> Absolutely not.  The more spindles the more chances of a double 
>> failure. Simple statistics will mean that unless you have mirrors the 
>> more drives you add the more chance of two of them (really) failing at 
>> once and choking the whole system.
>>
>> That said, there very well could be (are?) cases where md needs to do 
>> a better job of handling the world unravelling.
>> -
A small graph of the effect of the rebuild time on RAID-5 attached, it 
assumes probability of failure = 1/1000 per the original post, for 
various rebuild times the probability of failure drops.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

[-- Attachment #2: 2dfail-2.png --]
[-- Type: image/png, Size: 3106 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-02 20:33     ` Bill Davidsen
@ 2006-02-03  0:57       ` Martin Drab
  2006-02-03  1:13         ` Martin Drab
  2006-02-03 15:41         ` Phillip Susi
  0 siblings, 2 replies; 44+ messages in thread
From: Martin Drab @ 2006-02-03  0:57 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Cynbe ru Taren, Linux Kernel Mailing List, Salyzyn, Mark

[-- Attachment #1: Type: TEXT/PLAIN, Size: 10175 bytes --]

On Thu, 2 Feb 2006, Bill Davidsen wrote:

Just to state clearly in the first place. I've allready solved the problem 
by low-level formatting the entire disk that this inconsistent array in 
question was part of.

So now everything is back to normal. So unforunatelly I would not be able 
to do any more tests on the device in the non-working state.

I mentioned this problem here now just to let you konw that there is such 
a problematic Linux behviour (and IMO flawed) in such circumstances, and 
that perhaps it may let you think of such situations when doing further 
improvements and development in the design of the block device layer (or 
wherever the problem may possibly come from).

And I also hope you would understand, that I wouldn't try to create that 
state again deliberatelly, since my main system is running on that array 
and I wouldn't risk loosing some more data because of this.

However maybe someone perhaps in Adaptec or smewhere else may have 
some simillar system at the disposal on which he could allow to experiment 
on demand without any serious risk of loosing anything important.

So what I may say is that it is an Adaptec 2410SA with 8205 firmware and 
without a battery backup system (which is probably the crutial thing). 
And the inconsistency was caused by a MB protection of CPU overheat 
shutdown, because I've started the system and booted Linux from the array 
in question (which consisted by just one part of one disk), while I've 
forgotten to turn on the water cooling of the CPU and northbridge. So 
after about 3 minutes the system automatically shut down and Linux was 
probably doing some writing in that very moment, which wasn't able to 
complete fully (most probably due to the lack of the battery backup system 
on the RAID controller). So my guess is that this may be artificially 
reproduced when you suddenly switch off a power source of the system while 
Linux is doing some writing to the array.

My arrays in particular are:

	HDD1 (160 GB): 120 GB Array 1, 40 GB Array 2
	HDD2 (120 GB): 120 GB Array 1
	HDD3 (120 GB): 120 GB Array 1
	HDD4 (120 GB): 120 GB Array 1

Where Array 1 is a RAID 5 array /dev/sdb (labeled as "Data 1"), which 
contains just one 330 GB partition /dev/sdb1, and Array 2 is a bootable 
(in Adaptec BIOS setup so called) Volume array (i.e. no RAID) /dev/sda 
(labeled as "SYSTEM"), which contains /dev/sda1 (NTFS Windows), /dev/sda2 
(ext3 Linux), /dev/sda3 (Linux swap). Problem was accessing the whole Array 2. 
Array 1 from Linux worked well.

Then, when I tried, the array checking function within the BIOS of the 
Adaptec controller found an inconsistency on the position somewhere in the 
middle of the /dev/sda, so somewhere within the /dev/sda2 in particular. 
So I low-level formatted the entire HDD1, resynced the Array 1 (which is 
RAID 5, so no problem) and reinstalled both systems in Array 2, and now it 
is all back to normal again.

> Martin Drab wrote:
> 
> > Well, I had a similar experience lately with the Adaptec AAC-2410SA RAID 5
> > array. Due to the CPU overheating the whole box was suddenly shot down by
> > the CPU damage protection mechanism. While there is no battery backup on
> > this particular RAID controller, the sudden poweroff caused some very
> > localized inconsistency of one disk in the RAID. The configuration was 1x160
> > GB and 3x120GB, with the 160 GB being split into 120 GB part within the RAID
> > 5 and a 40 GB part as a separate volume. The inconsistency happend in the 40
> > GB part of the 160 GB HDD (as reported by the Adaptec BIOS media check). In
> > particular the problem was in the /dev/sda2 (with /dev/sda being the 40 GB
> > Volume, /dev/sda1 being an NTFS Windows system, and /dev/sda2 being ext3
> > Linux system).
> > 
> > Now, what is interesting, is that Linux completely refused any possible
> > access to every byte within /dev/sda, not even dd(1) reading from any
> > position within /dev/sda, not even "fdisk /dev/sda", nothing. Everything
                                        ^^^^^^^^^^^^^^
> > ended up with lots of following messages:
> > 
> >         sd 0:0:0:0: SCSI error: return code = 0x8000002
> >         sda: Current: sense key: Hardware Error
> >             Additional sense: Internal target failure
> >         Info fld=0x0
> >         end_request: I/O error, dev sda, sector <some sector number>
> 
> But /dev/sda is not a Linux filesystem, running fsck on it makes no sense. You
> wanted to run on /dev/sda2.

But I was talking about fdisk(1). This wasn't a problematic behaviour of a 
filesystem, but of the entire block device.

> > I've consulted this with Mark Salyzyn, because I thought it was a problem of
> > the AACRAID driver. But I was told, that there is nothing that AACRAID can
> > possibly do about it, and that it is a problem of the upper Linux layers
> > (block device layer?) that are strictly fault intollerant, and thouth the
> > problem was just an inconsistency of one particular localized region inside
> > /dev/sda2, Linux was COMPLETELY UNABLE (!!!!!) to read a single byte from
> > the ENTIRE VOLUME (/dev/sda)!
> 
> The obvious test of this "it's not us" statement is to connect that one drive
> to another type controller and see if the upper level code recovers. I'm
> assuming that "sda" is a real drive and not some pseudo-drive which exists
> only in the firmware of the RAID controller.

/dev/sda is a 40 GB RAID array consisting of just one 40 GB part of one 
160 GB drive. But it is in fact a virtual device supplied by the 
controller. I.e. this 40 GB part of that disc behaves as an entire 
harddisk (with it's own MBR etc.). And it is at the end of the drive, so 
it may be a little tricky to find the exact position of the partitions 
there, but it may be possible.

> That message is curious, did you
> cat /proc/scsi/scsi to see what the system thought was there? Use the infamous
> "cdrecord -scanbus" command?

----------
$ cdrecord -scanbus
Cdrecord-Clone 2.01.01a03-dvd (i686-pc-linux-gnu) Copyright (C) 1995-2005 J�g Schilling
Note: This version is an unofficial (modified) version with DVD support
Note: and therefore may have bugs that are not present in the original.
Note: Please send bug reports or support requests to warly at mandriva.com.
Note: The author of cdrecord should not be bothered with problems in this 
version.
Linux sg driver version: 3.5.33
Using libscg version 'schily-0.8'.
scsibus0:
        0,0,0     0) 'Adaptec ' 'SYSTEM          ' 'V1.0' Disk
        0,1,0     1) 'Adaptec ' 'Data 1          ' 'V1.0' Disk
        0,2,0     2) *
        0,3,0     3) *
        0,4,0     4) *
        0,5,0     5) *
        0,6,0     6) *
        0,7,0     7) *

$ cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: Adaptec  Model: SYSTEM           Rev: V1.0
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 01 Lun: 00
  Vendor: Adaptec  Model: Data 1           Rev: V1.0
  Type:   Direct-Access                    ANSI SCSI revision: 02
-----------

The 0,0,0 is the /dev/sda. And even though this is now, after low-level 
formatting of the previously inconsistent disc, the indications back then 
were just the same. Which means every indication behaved as usual. Both 
arrays were properly identified. But when I was accessing the inconsistent 
one, i.e. /dev/sda, in any way (even just bytes, this has nothing to do 
with any filesystems) the error messages mentioned above appeared. I'm not 
sure what exactly was generating them, but I've CC'd Mark Salyzyn, maybe 
he can explain more to it.

> > And now for the best part: From Windows, I was able to access the ENTIRE
> > VOLUME without the slightest problem. Not only did Windows boot entirely
> > from the /dev/sda1, but using Total Commander's ext3 plugin I was also able
> > to access the ENTIRE /dev/sda2 and at least extract the most important data
> > and configurations, before I did the complete low-level formatting of the
> > drive, which fixed the inconsistency problem.
> > 
> > I call this "AN IRONY" to be forced to use Windows to extract information
> > from Linux partition, wouldn't you? ;)
> > 
> > (Besides, even GRUB (using BIOS) accessed the /dev/sda without complications
> > - as it was the bootable volume. Only Linux failed here a 100%. :()
> 
> From the way you say sda when you presumably mean sda1 or sda2 it's not clear
> if you don't understand the difference between drive and partition access or
> are just so pissed off you are not taking the time to state the distinction
> clearly.

No, I understand the differences very clearly. But maybe I was just 
unclear in my expressions (for which I appologize). What I mean is that 
the problem was with the entire RAID array /dev/sda. So whenever ANY 
access to ANY part of /dev/sda, which of course also includes accesses to 
all of /dev/sda1, /dev/sda2, and /dev/sda3, the error messages appeared 
and no access was performed. That includes even accesses like this
"dd if=/dev/sda of=/dev/null bs=512 count=1" and any other possible 
accesses. So the problem was with the entire device /dev/sda.

> There was a problem with recovery from errors in RAID-5 which is addressed by
> recent changes to fail a sector, try rewriting it, etc.

Maybe this was again my bad explanation, but this wasn't a problem of a 
RAID 5 array, and much less of a software array. Adaptec 2410SA is a 
4-channel HW SATA-I RAID controller.

> I would have to read linux-raid archives to explain it, so I'll stop 
> with the overview. I don't think that's the issue here, you're using a 
> RAID controller rather than the software RAID, so it should not apply.

Yes, exactly. And again, I've solved it by lowlevel formatting.

> I assume that the problem is gone, so we can't do any more analysis after the
> fact.

Unfortunatelly, yes. But I've described above how did it happen, so maybe 
someone in Adaptec would be able to reproduce, Mark?

Martin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-03  0:57       ` Martin Drab
@ 2006-02-03  1:13         ` Martin Drab
  2006-02-03 15:41         ` Phillip Susi
  1 sibling, 0 replies; 44+ messages in thread
From: Martin Drab @ 2006-02-03  1:13 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Cynbe ru Taren, Linux Kernel Mailing List, Salyzyn, Mark

[-- Attachment #1: Type: TEXT/PLAIN, Size: 11090 bytes --]

On Fri, 3 Feb 2006, Martin Drab wrote:

> On Thu, 2 Feb 2006, Bill Davidsen wrote:
> 
> Just to state clearly in the first place. I've allready solved the problem 
> by low-level formatting the entire disk that this inconsistent array in 
> question was part of.
> 
> So now everything is back to normal. So unforunatelly I would not be able 
> to do any more tests on the device in the non-working state.
> 
> I mentioned this problem here now just to let you konw that there is such 
> a problematic Linux behviour (and IMO flawed) in such circumstances, and 
> that perhaps it may let you think of such situations when doing further 
> improvements and development in the design of the block device layer (or 
> wherever the problem may possibly come from).

Perhaps maybe rather of the SCSI layer, than of the block layer ??
 
> And I also hope you would understand, that I wouldn't try to create that 
> state again deliberatelly, since my main system is running on that array 
> and I wouldn't risk loosing some more data because of this.
> 
> However maybe someone perhaps in Adaptec or smewhere else may have 
> some simillar system at the disposal on which he could allow to experiment 
> on demand without any serious risk of loosing anything important.
> 
> So what I may say is that it is an Adaptec 2410SA with 8205 firmware and 
> without a battery backup system (which is probably the crutial thing). 
> And the inconsistency was caused by a MB protection of CPU overheat 
> shutdown, because I've started the system and booted Linux from the array 
> in question (which consisted by just one part of one disk), while I've 
> forgotten to turn on the water cooling of the CPU and northbridge. So 
> after about 3 minutes the system automatically shut down and Linux was 
> probably doing some writing in that very moment, which wasn't able to 
> complete fully (most probably due to the lack of the battery backup system 
> on the RAID controller). So my guess is that this may be artificially 
> reproduced when you suddenly switch off a power source of the system while 
> Linux is doing some writing to the array.
> 
> My arrays in particular are:
> 
> 	HDD1 (160 GB): 120 GB Array 1, 40 GB Array 2
> 	HDD2 (120 GB): 120 GB Array 1
> 	HDD3 (120 GB): 120 GB Array 1
> 	HDD4 (120 GB): 120 GB Array 1
> 
> Where Array 1 is a RAID 5 array /dev/sdb (labeled as "Data 1"), which 
> contains just one 330 GB partition /dev/sdb1, and Array 2 is a bootable 
> (in Adaptec BIOS setup so called) Volume array (i.e. no RAID) /dev/sda 
> (labeled as "SYSTEM"), which contains /dev/sda1 (NTFS Windows), /dev/sda2 
> (ext3 Linux), /dev/sda3 (Linux swap). Problem was accessing the whole Array 2. 
> Array 1 from Linux worked well.
> 
> Then, when I tried, the array checking function within the BIOS of the 
> Adaptec controller found an inconsistency on the position somewhere in the 
> middle of the /dev/sda, so somewhere within the /dev/sda2 in particular. 
> So I low-level formatted the entire HDD1, resynced the Array 1 (which is 
> RAID 5, so no problem) and reinstalled both systems in Array 2, and now it 
> is all back to normal again.
> 
> > Martin Drab wrote:
> > 
> > > Well, I had a similar experience lately with the Adaptec AAC-2410SA RAID 5
> > > array. Due to the CPU overheating the whole box was suddenly shot down by
> > > the CPU damage protection mechanism. While there is no battery backup on
> > > this particular RAID controller, the sudden poweroff caused some very
> > > localized inconsistency of one disk in the RAID. The configuration was 1x160
> > > GB and 3x120GB, with the 160 GB being split into 120 GB part within the RAID
> > > 5 and a 40 GB part as a separate volume. The inconsistency happend in the 40
> > > GB part of the 160 GB HDD (as reported by the Adaptec BIOS media check). In
> > > particular the problem was in the /dev/sda2 (with /dev/sda being the 40 GB
> > > Volume, /dev/sda1 being an NTFS Windows system, and /dev/sda2 being ext3
> > > Linux system).
> > > 
> > > Now, what is interesting, is that Linux completely refused any possible
> > > access to every byte within /dev/sda, not even dd(1) reading from any
> > > position within /dev/sda, not even "fdisk /dev/sda", nothing. Everything
>                                         ^^^^^^^^^^^^^^
> > > ended up with lots of following messages:
> > > 
> > >         sd 0:0:0:0: SCSI error: return code = 0x8000002
> > >         sda: Current: sense key: Hardware Error
> > >             Additional sense: Internal target failure
> > >         Info fld=0x0
> > >         end_request: I/O error, dev sda, sector <some sector number>
> > 
> > But /dev/sda is not a Linux filesystem, running fsck on it makes no sense. You
> > wanted to run on /dev/sda2.
> 
> But I was talking about fdisk(1). This wasn't a problematic behaviour of a 
> filesystem, but of the entire block device.
> 
> > > I've consulted this with Mark Salyzyn, because I thought it was a problem of
> > > the AACRAID driver. But I was told, that there is nothing that AACRAID can
> > > possibly do about it, and that it is a problem of the upper Linux layers
> > > (block device layer?) that are strictly fault intollerant, and thouth the
> > > problem was just an inconsistency of one particular localized region inside
> > > /dev/sda2, Linux was COMPLETELY UNABLE (!!!!!) to read a single byte from
> > > the ENTIRE VOLUME (/dev/sda)!
> > 
> > The obvious test of this "it's not us" statement is to connect that one drive
> > to another type controller and see if the upper level code recovers. I'm
> > assuming that "sda" is a real drive and not some pseudo-drive which exists
> > only in the firmware of the RAID controller.
> 
> /dev/sda is a 40 GB RAID array consisting of just one 40 GB part of one 
> 160 GB drive. But it is in fact a virtual device supplied by the 
> controller. I.e. this 40 GB part of that disc behaves as an entire 
> harddisk (with it's own MBR etc.). And it is at the end of the drive, so 
> it may be a little tricky to find the exact position of the partitions 
> there, but it may be possible.
> 
> > That message is curious, did you
> > cat /proc/scsi/scsi to see what the system thought was there? Use the infamous
> > "cdrecord -scanbus" command?
> 
> ----------
> $ cdrecord -scanbus
> Cdrecord-Clone 2.01.01a03-dvd (i686-pc-linux-gnu) Copyright (C) 1995-2005 J�g Schilling
> Note: This version is an unofficial (modified) version with DVD support
> Note: and therefore may have bugs that are not present in the original.
> Note: Please send bug reports or support requests to warly at mandriva.com.
> Note: The author of cdrecord should not be bothered with problems in this 
> version.
> Linux sg driver version: 3.5.33
> Using libscg version 'schily-0.8'.
> scsibus0:
>         0,0,0     0) 'Adaptec ' 'SYSTEM          ' 'V1.0' Disk
>         0,1,0     1) 'Adaptec ' 'Data 1          ' 'V1.0' Disk
>         0,2,0     2) *
>         0,3,0     3) *
>         0,4,0     4) *
>         0,5,0     5) *
>         0,6,0     6) *
>         0,7,0     7) *
> 
> $ cat /proc/scsi/scsi
> Attached devices:
> Host: scsi0 Channel: 00 Id: 00 Lun: 00
>   Vendor: Adaptec  Model: SYSTEM           Rev: V1.0
>   Type:   Direct-Access                    ANSI SCSI revision: 02
> Host: scsi0 Channel: 00 Id: 01 Lun: 00
>   Vendor: Adaptec  Model: Data 1           Rev: V1.0
>   Type:   Direct-Access                    ANSI SCSI revision: 02
> -----------
> 
> The 0,0,0 is the /dev/sda. And even though this is now, after low-level 
> formatting of the previously inconsistent disc, the indications back then 
> were just the same. Which means every indication behaved as usual. Both 
> arrays were properly identified. But when I was accessing the inconsistent 
> one, i.e. /dev/sda, in any way (even just bytes, this has nothing to do 
> with any filesystems) the error messages mentioned above appeared. I'm not 
> sure what exactly was generating them, but I've CC'd Mark Salyzyn, maybe 
> he can explain more to it.
> 
> > > And now for the best part: From Windows, I was able to access the ENTIRE
> > > VOLUME without the slightest problem. Not only did Windows boot entirely
> > > from the /dev/sda1, but using Total Commander's ext3 plugin I was also able
> > > to access the ENTIRE /dev/sda2 and at least extract the most important data
> > > and configurations, before I did the complete low-level formatting of the
> > > drive, which fixed the inconsistency problem.
> > > 
> > > I call this "AN IRONY" to be forced to use Windows to extract information
> > > from Linux partition, wouldn't you? ;)
> > > 
> > > (Besides, even GRUB (using BIOS) accessed the /dev/sda without complications
> > > - as it was the bootable volume. Only Linux failed here a 100%. :()
> > 
> > From the way you say sda when you presumably mean sda1 or sda2 it's not clear
> > if you don't understand the difference between drive and partition access or
> > are just so pissed off you are not taking the time to state the distinction
> > clearly.
> 
> No, I understand the differences very clearly. But maybe I was just 
> unclear in my expressions (for which I appologize). What I mean is that 
> the problem was with the entire RAID array /dev/sda. So whenever ANY 
> access to ANY part of /dev/sda, which of course also includes accesses to 
> all of /dev/sda1, /dev/sda2, and /dev/sda3, the error messages appeared 
> and no access was performed. That includes even accesses like this
> "dd if=/dev/sda of=/dev/null bs=512 count=1" and any other possible 
> accesses. So the problem was with the entire device /dev/sda.
> 
> > There was a problem with recovery from errors in RAID-5 which is addressed by
> > recent changes to fail a sector, try rewriting it, etc.
> 
> Maybe this was again my bad explanation, but this wasn't a problem of a 
> RAID 5 array, and much less of a software array. Adaptec 2410SA is a 
> 4-channel HW SATA-I RAID controller.
> 
> > I would have to read linux-raid archives to explain it, so I'll stop 
> > with the overview. I don't think that's the issue here, you're using a 
> > RAID controller rather than the software RAID, so it should not apply.
> 
> Yes, exactly. And again, I've solved it by lowlevel formatting.
> 
> > I assume that the problem is gone, so we can't do any more analysis after the
> > fact.
> 
> Unfortunatelly, yes. But I've described above how did it happen, so maybe 
> someone in Adaptec would be able to reproduce, Mark?
> 
> Martin

====================================================
Martin Drab
Department of Solid State Engineering
Department of Mathematics
Faculty of Nuclear Sciences and Physical Engineering
Czech Technical University in Prague
Trojanova 13
120 00  Praha 2, Czech Republic
Tel: +420 22435 8649
Fax: +420 22435 8601
E-mail: drab@kepler.fjfi.cvut.cz
====================================================

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-03  0:57       ` Martin Drab
  2006-02-03  1:13         ` Martin Drab
@ 2006-02-03 15:41         ` Phillip Susi
  2006-02-03 16:13           ` Martin Drab
  1 sibling, 1 reply; 44+ messages in thread
From: Phillip Susi @ 2006-02-03 15:41 UTC (permalink / raw)
  To: Martin Drab
  Cc: Bill Davidsen, Cynbe ru Taren, Linux Kernel Mailing List, Salyzyn, Mark

Martin Drab wrote:
> On Thu, 2 Feb 2006, Bill Davidsen wrote:
>
> Just to state clearly in the first place. I've allready solved the problem 
> by low-level formatting the entire disk that this inconsistent array in 
> question was part of.
>
> So now everything is back to normal. So unforunatelly I would not be able 
> to do any more tests on the device in the non-working state.
>
> I mentioned this problem here now just to let you konw that there is such 
> a problematic Linux behviour (and IMO flawed) in such circumstances, and 
> that perhaps it may let you think of such situations when doing further 
> improvements and development in the design of the block device layer (or 
> wherever the problem may possibly come from).
>
>   

It looks like the problem is in that controller card and its driver.  
Was this a proprietary closed source driver?  Linux is perfectly happy 
to access the rest of the disk when some parts of it have gone bad; 
people do this all the time.  It looks like your raid controller decided 
to take the entire virtual disk that it presents to the kernel offline 
because it detected errors.

<snip>
> The 0,0,0 is the /dev/sda. And even though this is now, after low-level 
> formatting of the previously inconsistent disc, the indications back then 
> were just the same. Which means every indication behaved as usual. Both 
> arrays were properly identified. But when I was accessing the inconsistent 
> one, i.e. /dev/sda, in any way (even just bytes, this has nothing to do 
> with any filesystems) the error messages mentioned above appeared. I'm not 
> sure what exactly was generating them, but I've CC'd Mark Salyzyn, maybe 
> he can explain more to it.
>
>   

How did you low level format the drive?  These days disk manufacturers 
ship drives already low level formatted and end users can not perform a 
low level format.  The last time I remember being able to low level 
format a drive was with MFM and RLL drives, prior to IDE.  My guess is 
what you actually did was simply write out zeros to every sector on the 
disk, which replaced the corrupted data in the effected sector with good 
data, rendering it repaired.  Usually drives will fail reads to bad 
sectors but when you write to that sector, it will write and read that 
sector to see if it is fine after being written again, or if the media 
is bad in which case it will remap the sector to a spare. 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-03 15:41         ` Phillip Susi
@ 2006-02-03 16:13           ` Martin Drab
  2006-02-03 16:38             ` Phillip Susi
  2006-02-03 17:51             ` Martin Drab
  0 siblings, 2 replies; 44+ messages in thread
From: Martin Drab @ 2006-02-03 16:13 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Bill Davidsen, Cynbe ru Taren, Linux Kernel Mailing List, Salyzyn, Mark

On Fri, 3 Feb 2006, Phillip Susi wrote:

> Martin Drab wrote:
> > On Thu, 2 Feb 2006, Bill Davidsen wrote:
> > 
> > Just to state clearly in the first place. I've allready solved the problem
> > by low-level formatting the entire disk that this inconsistent array in
> > question was part of.
> > 
> > So now everything is back to normal. So unforunatelly I would not be able to
> > do any more tests on the device in the non-working state.
> > 
> > I mentioned this problem here now just to let you konw that there is such a
> > problematic Linux behviour (and IMO flawed) in such circumstances, and that
> > perhaps it may let you think of such situations when doing further
> > improvements and development in the design of the block device layer (or
> > wherever the problem may possibly come from).
> > 
> >   
> 
> It looks like the problem is in that controller card and its driver.  Was this
> a proprietary closed source driver?

No, it was the kernel's AACRAID driver (drivers/scsi/aacraid/*). And I've 
consulted that with Mark Salyzyn who told me that it is the problem of the 
upper layers which are only zero fault tollerant and that driver con do 
nothing about it.

So as I understand it, the RAID controller did signal some error with 
respect to the inconsistency of that particular array and the upper layers 
weren't probably able to distinguish the real condition and just 
interpreted it as an error and so refused to access the device 
alltogether. But understand that this is just my way of interpreting what 
I think might have happend without any knowledge of SCSI protocol or 
functionality of the SCSI or other related layers.

> Linux is perfectly happy to access the
> rest of the disk when some parts of it have gone bad; people do this all the
> time.  It looks like your raid controller decided to take the entire virtual
> disk that it presents to the kernel offline because it detected errors.

No, it wasn't offline. No such messages appeared in the kernel. And if it 
would have been offlie, the kernel/driver would certainly report that, as 
I've allready witnessed such a situation in the past (however for totally 
different reason).

> <snip>
> > The 0,0,0 is the /dev/sda. And even though this is now, after low-level
> > formatting of the previously inconsistent disc, the indications back then
> > were just the same. Which means every indication behaved as usual. Both
> > arrays were properly identified. But when I was accessing the inconsistent
> > one, i.e. /dev/sda, in any way (even just bytes, this has nothing to do with
> > any filesystems) the error messages mentioned above appeared. I'm not sure
> > what exactly was generating them, but I've CC'd Mark Salyzyn, maybe he can
> > explain more to it.  
> 
> How did you low level format the drive? 

The BIOS of the RAID controller has this option.

> These days disk manufacturers ship
> drives already low level formatted and end users can not perform a low level
> format. The last time I remember being able to low level format a drive was
> with MFM and RLL drives, prior to IDE.  My guess is what you actually did was
> simply write out zeros to every sector on the disk, which replaced the
> corrupted data in the effected sector with good data, rendering it repaired.

That may very well be true. I do not know what the Adaptec BIOS does under 
the "Low-Level Format" option. Maybe someone from Adaptec would know that. 
Mark?

> Usually drives will fail reads to bad sectors but when you write to that
> sector, it will write and read that sector to see if it is fine after being
> written again, or if the media is bad in which case it will remap the sector
> to a spare. 

No, I don't think this was the case of a physically bad sectors. I think 
it was just an inconsistency of the RAID controllers metadata (or 
something simillar) related to that particular array.

Martin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-03 16:13           ` Martin Drab
@ 2006-02-03 16:38             ` Phillip Susi
  2006-02-03 17:22               ` Roger Heflin
  2006-02-03 17:51             ` Martin Drab
  1 sibling, 1 reply; 44+ messages in thread
From: Phillip Susi @ 2006-02-03 16:38 UTC (permalink / raw)
  To: Martin Drab
  Cc: Bill Davidsen, Cynbe ru Taren, Linux Kernel Mailing List, Salyzyn, Mark

Martin Drab wrote:
> On Fri, 3 Feb 2006, Phillip Susi wrote:
>   
>> It looks like the problem is in that controller card and its driver.  Was this
>> a proprietary closed source driver?
>>     
>
> No, it was the kernel's AACRAID driver (drivers/scsi/aacraid/*). And I've 
> consulted that with Mark Salyzyn who told me that it is the problem of the 
> upper layers which are only zero fault tollerant and that driver con do 
> nothing about it.
>   

That's a strange statement, maybe we could get some clarification on 
it?  From the dmesg lines you posted before, it appeared that the 
hardware was failing the request with a bad disk sense code.  As I said 
before, normally Linux has no problem reading the good parts of a 
partially bad disk, so I wonder exactly what Mark means by "upper layers 
which are only zero fault tollerant"?




^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-03 16:38             ` Phillip Susi
@ 2006-02-03 17:22               ` Roger Heflin
  2006-02-03 19:38                 ` Phillip Susi
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Heflin @ 2006-02-03 17:22 UTC (permalink / raw)
  To: 'Phillip Susi', 'Martin Drab'
  Cc: 'Bill Davidsen', 'Cynbe ru Taren',
	'Linux Kernel Mailing List', 'Salyzyn, Mark'

 

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Phillip Susi
> Sent: Friday, February 03, 2006 10:38 AM
> To: Martin Drab
> Cc: Bill Davidsen; Cynbe ru Taren; Linux Kernel Mailing List; 
> Salyzyn, Mark
> Subject: Re: FYI: RAID5 unusably unstable through 2.6.14
> 
> Martin Drab wrote:
> > On Fri, 3 Feb 2006, Phillip Susi wrote:
> >   
> >> It looks like the problem is in that controller card and 
> its driver.  
> >> Was this a proprietary closed source driver?
> >>     
> >
> > No, it was the kernel's AACRAID driver 
> (drivers/scsi/aacraid/*). And 
> > I've consulted that with Mark Salyzyn who told me that it is the 
> > problem of the upper layers which are only zero fault tollerant and 
> > that driver con do nothing about it.
> >   
> 
> That's a strange statement, maybe we could get some 
> clarification on it?  From the dmesg lines you posted before, 
> it appeared that the hardware was failing the request with a 
> bad disk sense code.  As I said before, normally Linux has no 
> problem reading the good parts of a partially bad disk, so I 
> wonder exactly what Mark means by "upper layers which are 
> only zero fault tollerant"?


Some of the fakeraid controllers will kill the disk when the
disk returns a failure like that.

On top of that usually (even if the controller were not to
kill the disk) the application will get a fatal disk error
also, causing the application to die.

The best I have been able to hope for (this is a raid0 stripe
case) is that the fakeraid controller does not kill the disk,
returns the disk error to the higher levels and lets the application
be killed, at least in this case you will likely know the disk
has a fatal error, rather than (in the raid0 case) having the
machine crash, and have to debug it to determine exactly
what the nature of the failure was.

The same may need to be applied when the array is already
in degraded mode ... limping along with some lost data and messages
indicating such is a lot better that losing all of the data.

                           Roger


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-03 16:13           ` Martin Drab
  2006-02-03 16:38             ` Phillip Susi
@ 2006-02-03 17:51             ` Martin Drab
  2006-02-03 19:10               ` Roger Heflin
  1 sibling, 1 reply; 44+ messages in thread
From: Martin Drab @ 2006-02-03 17:51 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Bill Davidsen, Cynbe ru Taren, Linux Kernel Mailing List, Salyzyn, Mark

On Fri, 3 Feb 2006, Martin Drab wrote:

> On Fri, 3 Feb 2006, Phillip Susi wrote:
> 
> > Usually drives will fail reads to bad sectors but when you write to that
> > sector, it will write and read that sector to see if it is fine after being
> > written again, or if the media is bad in which case it will remap the sector
> > to a spare. 
> 
> No, I don't think this was the case of a physically bad sectors. I think 
> it was just an inconsistency of the RAID controllers metadata (or 
> something simillar) related to that particular array.

Or is such a situation not possible at all? Are bad sectors the only 
reason that might have caused this? That sounds a little strange to me, 
that would have been a very unlikely concentration of conincidences, IMO. 
That's why I still think there are no bad sectors at all (at least not 
because of this). Is there any way to actually find out?

Martin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-03 17:51             ` Martin Drab
@ 2006-02-03 19:10               ` Roger Heflin
  2006-02-03 19:12                 ` Martin Drab
  0 siblings, 1 reply; 44+ messages in thread
From: Roger Heflin @ 2006-02-03 19:10 UTC (permalink / raw)
  To: 'Martin Drab', 'Phillip Susi'
  Cc: 'Bill Davidsen', 'Cynbe ru Taren',
	'Linux Kernel Mailing List', 'Salyzyn, Mark'

 

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org 
> [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Martin Drab
> Sent: Friday, February 03, 2006 11:51 AM
> To: Phillip Susi
> Cc: Bill Davidsen; Cynbe ru Taren; Linux Kernel Mailing List; 
> Salyzyn, Mark
> Subject: Re: FYI: RAID5 unusably unstable through 2.6.14
> 
> On Fri, 3 Feb 2006, Martin Drab wrote:
> 
> > On Fri, 3 Feb 2006, Phillip Susi wrote:
> > 
> > > Usually drives will fail reads to bad sectors but when 
> you write to 
> > > that sector, it will write and read that sector to see if 
> it is fine 
> > > after being written again, or if the media is bad in 
> which case it 
> > > will remap the sector to a spare.
> > 
> > No, I don't think this was the case of a physically bad sectors. I 
> > think it was just an inconsistency of the RAID controllers metadata 
> > (or something simillar) related to that particular array.
> 
> Or is such a situation not possible at all? Are bad sectors 
> the only reason that might have caused this? That sounds a 
> little strange to me, that would have been a very unlikely 
> concentration of conincidences, IMO. 
> That's why I still think there are no bad sectors at all (at 
> least not because of this). Is there any way to actually find out?


Some of the drive manufacturers have tools that will read out
"log" files from the disks, and these log files include stuff
such as how many bad block errors where returned to the host
over the life of the disk.

You would need a decent contatct with the disk manufacturer, and
you might be able to get them to tell you, maybe.

                         Roger


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-03 19:10               ` Roger Heflin
@ 2006-02-03 19:12                 ` Martin Drab
  2006-02-03 19:41                   ` Phillip Susi
  0 siblings, 1 reply; 44+ messages in thread
From: Martin Drab @ 2006-02-03 19:12 UTC (permalink / raw)
  To: Roger Heflin
  Cc: 'Phillip Susi', 'Bill Davidsen',
	'Cynbe ru Taren', 'Linux Kernel Mailing List',
	'Salyzyn, Mark'

On Fri, 3 Feb 2006, Roger Heflin wrote:

> > -----Original Message-----
> > From: linux-kernel-owner@vger.kernel.org 
> > [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of Martin Drab
> > Sent: Friday, February 03, 2006 11:51 AM
> > To: Phillip Susi
> > Cc: Bill Davidsen; Cynbe ru Taren; Linux Kernel Mailing List; 
> > Salyzyn, Mark
> > Subject: Re: FYI: RAID5 unusably unstable through 2.6.14
> > 
> > On Fri, 3 Feb 2006, Martin Drab wrote:
> > 
> > > On Fri, 3 Feb 2006, Phillip Susi wrote:
> > > 
> > > > Usually drives will fail reads to bad sectors but when 
> > you write to 
> > > > that sector, it will write and read that sector to see if 
> > it is fine 
> > > > after being written again, or if the media is bad in 
> > which case it 
> > > > will remap the sector to a spare.
> > > 
> > > No, I don't think this was the case of a physically bad sectors. I 
> > > think it was just an inconsistency of the RAID controllers metadata 
> > > (or something simillar) related to that particular array.
> > 
> > Or is such a situation not possible at all? Are bad sectors 
> > the only reason that might have caused this? That sounds a 
> > little strange to me, that would have been a very unlikely 
> > concentration of conincidences, IMO. 
> > That's why I still think there are no bad sectors at all (at 
> > least not because of this). Is there any way to actually find out?
> 
> 
> Some of the drive manufacturers have tools that will read out
> "log" files from the disks, and these log files include stuff
> such as how many bad block errors where returned to the host
> over the life of the disk.

S.M.A.R.T. should be able to do this. But last time I've checked it wasn't 
working with Linux and SCSI/SATA. Is this working now?

> You would need a decent contatct with the disk manufacturer, and
> you might be able to get them to tell you, maybe.

Well it's a WD 1600SD.

Martin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-03 17:22               ` Roger Heflin
@ 2006-02-03 19:38                 ` Phillip Susi
  0 siblings, 0 replies; 44+ messages in thread
From: Phillip Susi @ 2006-02-03 19:38 UTC (permalink / raw)
  To: Roger Heflin
  Cc: 'Martin Drab', 'Bill Davidsen',
	'Cynbe ru Taren', 'Linux Kernel Mailing List',
	'Salyzyn, Mark'

I fail to see how this is a reply to my message.  I was asking for 
clarification on what "higher layer" supposedly resulted in this 
behavior ( of not being able to access any part of the disk ) because as 
far as I know, all the higher layers are quite happy to access the non 
broken parts of the disk, and return the appropriate error to the 
calling application for the bad parts of the disk. 

Roger Heflin wrote:
>> That's a strange statement, maybe we could get some 
>> clarification on it?  From the dmesg lines you posted before, 
>> it appeared that the hardware was failing the request with a 
>> bad disk sense code.  As I said before, normally Linux has no 
>> problem reading the good parts of a partially bad disk, so I 
>> wonder exactly what Mark means by "upper layers which are 
>> only zero fault tollerant"?
>>     
>
>
> Some of the fakeraid controllers will kill the disk when the
> disk returns a failure like that.
>
> On top of that usually (even if the controller were not to
> kill the disk) the application will get a fatal disk error
> also, causing the application to die.
>
> The best I have been able to hope for (this is a raid0 stripe
> case) is that the fakeraid controller does not kill the disk,
> returns the disk error to the higher levels and lets the application
> be killed, at least in this case you will likely know the disk
> has a fatal error, rather than (in the raid0 case) having the
> machine crash, and have to debug it to determine exactly
> what the nature of the failure was.
>
> The same may need to be applied when the array is already
> in degraded mode ... limping along with some lost data and messages
> indicating such is a lot better that losing all of the data.
>
>                            Roger


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-03 19:12                 ` Martin Drab
@ 2006-02-03 19:41                   ` Phillip Susi
  2006-02-03 19:45                     ` Martin Drab
  0 siblings, 1 reply; 44+ messages in thread
From: Phillip Susi @ 2006-02-03 19:41 UTC (permalink / raw)
  To: Martin Drab
  Cc: Roger Heflin, 'Bill Davidsen', 'Cynbe ru Taren',
	'Linux Kernel Mailing List', 'Salyzyn, Mark'

Martin Drab wrote:
> S.M.A.R.T. should be able to do this. But last time I've checked it wasn't 
> working with Linux and SCSI/SATA. Is this working now?
>
>   

Yes, it is working now.  The smartutils package returns all kinds of 
handy information from the drive and can force the drive to perform a 
low level disk check on request.  It likely won't pass through a 
hardware raid controller however. 

> Well it's a WD 1600SD.
>
> Martin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-03 19:41                   ` Phillip Susi
@ 2006-02-03 19:45                     ` Martin Drab
  0 siblings, 0 replies; 44+ messages in thread
From: Martin Drab @ 2006-02-03 19:45 UTC (permalink / raw)
  To: Phillip Susi
  Cc: Roger Heflin, 'Bill Davidsen', 'Cynbe ru Taren',
	'Linux Kernel Mailing List', 'Salyzyn, Mark'

On Fri, 3 Feb 2006, Phillip Susi wrote:

> Martin Drab wrote:
> > S.M.A.R.T. should be able to do this. But last time I've checked it wasn't
> > working with Linux and SCSI/SATA. Is this working now?
> 
> Yes, it is working now.  The smartutils package returns all kinds of handy
> information from the drive and can force the drive to perform a low level disk
> check on request.  It likely won't pass through a hardware raid controller
> however. 

Yes, that may be another issue. It depend's on whether AACRAID is ready 
for that or not. (Adaptec declares that the controller is SMART capable.)

Martin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-01-19 16:25         ` Alan Cox
@ 2006-02-08 14:46           ` Alan Cox
  0 siblings, 0 replies; 44+ messages in thread
From: Alan Cox @ 2006-02-08 14:46 UTC (permalink / raw)
  To: Mark Lord; +Cc: Helge Hafting, Cynbe ru Taren, linux-kernel

On Iau, 2006-01-19 at 16:25 +0000, Alan Cox wrote:
> On Iau, 2006-01-19 at 10:59 -0500, Mark Lord wrote:
> > But the card is a total slug unless the host does 32-bit PIO to/from it.
> > Do we have that capability in libata yet?
> 
> Very very easy to sort out. Just need a ->pio_xfer method set. Would
> then eliminate some of the core driver flags and let us do vlb sync for
> legacy hw


This is now all done and present in the 2.6.16 libata PATA patches.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-02 22:10     ` Bill Davidsen
@ 2006-02-08 21:58       ` Pavel Machek
  0 siblings, 0 replies; 44+ messages in thread
From: Pavel Machek @ 2006-02-08 21:58 UTC (permalink / raw)
  To: Bill Davidsen; +Cc: Phillip Susi, linux-kernel

Hi!

> >If 1 disk has a 1/1000 chance of failure, then
> >2 disks have a (1/1000)^2 chance of double failure, and
> >3 disks have a (1/1000)^2 * 3 chance of double failure
> >4 disks have a (1/1000)^2 * 7 chance of double failure
> 
> After the first drive fails you have no redundancy, the 
> chance of an additional failure is linear to the number 
> of remaining drives.
> 
> Assume:
>   p - probability of a drive failing in unit time
>   n - number of drives
>   F - probability of double failure
> 
> The chance of a single drive failure is n*p. After that 

<pedantic>
Actually it is not. Imagine 100 drives with 10% failure rate each. You
can't have probability of 1000%...
</> 

							Pavel
-- 
Thanks, Sharp!

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-03 17:00 Salyzyn, Mark
  2006-02-03 17:39 ` Martin Drab
@ 2006-02-03 19:46 ` Phillip Susi
  1 sibling, 0 replies; 44+ messages in thread
From: Phillip Susi @ 2006-02-03 19:46 UTC (permalink / raw)
  To: Salyzyn, Mark
  Cc: Martin Drab, Bill Davidsen, Cynbe ru Taren, Linux Kernel Mailing List

Salyzyn, Mark wrote:
> The drive is low level formatted. This resolved the problem you were
> having.
>
>   

Could you define what you mean by "low level format"?  AFAIK, IDE drives 
do not provide a command to low level format them ( like MFM and RLL 
drives required ), so the best you can do is write zeroes to all sectors 
on the disk. 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: FYI: RAID5 unusably unstable through 2.6.14
  2006-02-03 17:00 Salyzyn, Mark
@ 2006-02-03 17:39 ` Martin Drab
  2006-02-03 19:46 ` Phillip Susi
  1 sibling, 0 replies; 44+ messages in thread
From: Martin Drab @ 2006-02-03 17:39 UTC (permalink / raw)
  To: Salyzyn, Mark
  Cc: Phillip Susi, Bill Davidsen, Cynbe ru Taren, Linux Kernel Mailing List

On Fri, 3 Feb 2006, Salyzyn, Mark wrote:

> Martin Drab [mailto:drab@kepler.fjfi.cvut.cz] sez:
> > That may very well be true. I do not know what the Adaptec 
> > BIOS does under the "Low-Level Format" option. Maybe someone from
> Adaptec 
> > would know that.
> 
> The drive is low level formatted. This resolved the problem you were
> having.
> 
> > No, I don't think this was the case of a physically bad 
> > sectors. I think it was just an inconsistency of the RAID controllers
> metadata (or 
> > something simillar) related to that particular array.
> 
> It was a case of a set of physically bad sectors in a non-redundant
> formation resulting in a non-recoverable situation, from what I could
> tell. Read failures do not take the array offline, write failures do.

Again, neither read, nor write did result in disk offline. (Even though 
I'm not quite positive on trying the writing under Linux.) And it 
definitelly wasn't caused by the controller, since I was doing both reads 
and writes to that "faulty" array from Windows and all those operations 
completed without any problem.

> Instead the adapter responds with a hardware failure to the read
> responses. Writing the data would have re-assigned the bad blocks. (RAID
> controllers do reassign media bad blocks automatically, but sets them as
> inconsistent under some scenarios, requiring a write to mark them
> consistent again. This is no different to how single drive media reacts
> to faulty or corruption issues).
>
> The bad sectors were localized only affecting the Linux partition, the
> accesses were to directory or superblock nodes if memory serves. Another
> system partition was unaffected because the errors were not localized to
> it's area.

However I was able to read the Linux Ext3 data (from the /dev/sda2) 
partition using the Total Commander's ext2 plugin from Windows, and that 
worked well for the entire partition (both reads and writes).

Are you a 100% certain it must have been bad physical sectors? Since I'm 
not all that sure.

> Besides low level formatting, there is not much anyone can do about this
> issue except ask for a less catastrophic response from the Linux File
> system drivers.

This has nothing to do with filesystems, since no access was possible at 
all to that block device entirely.

> I make no offer or suggestion regarding the changes that
> would be necessary to support the system limping along when file system
> data has been corrupted; UNIX policy in general is to walk away as
> quickly as possible and do the least continuing damage.
> 
> Except this question: If a superblock can not be read in, what about the
> backup copies? Could an fsck play games with backup copies to result in
> a write to close inconsistencies?

OK, this is probably also something needed to be improved if there is the 
problem as well, but it is a totally different case than what happend 
here. This certainly had nothing to do with filesystems. As (as I've 
mentioned earlier) not even plain access to the whole /dev/sda using dd(1) 
was working.

Martin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: FYI: RAID5 unusably unstable through 2.6.14
@ 2006-02-03 17:00 Salyzyn, Mark
  2006-02-03 17:39 ` Martin Drab
  2006-02-03 19:46 ` Phillip Susi
  0 siblings, 2 replies; 44+ messages in thread
From: Salyzyn, Mark @ 2006-02-03 17:00 UTC (permalink / raw)
  To: Martin Drab, Phillip Susi
  Cc: Bill Davidsen, Cynbe ru Taren, Linux Kernel Mailing List

Martin Drab [mailto:drab@kepler.fjfi.cvut.cz] sez:
> That may very well be true. I do not know what the Adaptec 
> BIOS does under the "Low-Level Format" option. Maybe someone from
Adaptec 
> would know that.

The drive is low level formatted. This resolved the problem you were
having.

> No, I don't think this was the case of a physically bad 
> sectors. I think it was just an inconsistency of the RAID controllers
metadata (or 
> something simillar) related to that particular array.

It was a case of a set of physically bad sectors in a non-redundant
formation resulting in a non-recoverable situation, from what I could
tell. Read failures do not take the array offline, write failures do.
Instead the adapter responds with a hardware failure to the read
responses. Writing the data would have re-assigned the bad blocks. (RAID
controllers do reassign media bad blocks automatically, but sets them as
inconsistent under some scenarios, requiring a write to mark them
consistent again. This is no different to how single drive media reacts
to faulty or corruption issues).

The bad sectors were localized only affecting the Linux partition, the
accesses were to directory or superblock nodes if memory serves. Another
system partition was unaffected because the errors were not localized to
it's area.

Besides low level formatting, there is not much anyone can do about this
issue except ask for a less catastrophic response from the Linux File
system drivers. I make no offer or suggestion regarding the changes that
would be necessary to support the system limping along when file system
data has been corrupted; UNIX policy in general is to walk away as
quickly as possible and do the least continuing damage.

Except this question: If a superblock can not be read in, what about the
backup copies? Could an fsck play games with backup copies to result in
a write to close inconsistencies?

-- Mark Salyzyn

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2006-02-09 17:06 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-01-17 19:35 FYI: RAID5 unusably unstable through 2.6.14 Cynbe ru Taren
2006-01-17 19:39 ` Benjamin LaHaise
2006-01-17 20:13   ` Martin Drab
2006-01-17 23:39     ` Michael Loftis
2006-01-18  2:30       ` Martin Drab
2006-02-02 20:33     ` Bill Davidsen
2006-02-03  0:57       ` Martin Drab
2006-02-03  1:13         ` Martin Drab
2006-02-03 15:41         ` Phillip Susi
2006-02-03 16:13           ` Martin Drab
2006-02-03 16:38             ` Phillip Susi
2006-02-03 17:22               ` Roger Heflin
2006-02-03 19:38                 ` Phillip Susi
2006-02-03 17:51             ` Martin Drab
2006-02-03 19:10               ` Roger Heflin
2006-02-03 19:12                 ` Martin Drab
2006-02-03 19:41                   ` Phillip Susi
2006-02-03 19:45                     ` Martin Drab
2006-01-17 19:56 ` Kyle Moffett
2006-01-17 19:58 ` David R
2006-01-17 20:00 ` Kyle Moffett
2006-01-17 23:27 ` Michael Loftis
2006-01-18  0:12   ` Kyle Moffett
2006-01-18 11:24     ` Erik Mouw
2006-01-18  0:21   ` Phillip Susi
2006-01-18  0:29     ` Michael Loftis
2006-01-18  2:10       ` Phillip Susi
2006-01-18  3:01         ` Michael Loftis
2006-01-18 16:49           ` Krzysztof Halasa
2006-01-18 16:47         ` Krzysztof Halasa
2006-02-02 22:10     ` Bill Davidsen
2006-02-08 21:58       ` Pavel Machek
2006-01-18 10:54 ` Helge Hafting
2006-01-18 16:15   ` Mark Lord
2006-01-18 17:32     ` Alan Cox
2006-01-19 15:59       ` Mark Lord
2006-01-19 16:25         ` Alan Cox
2006-02-08 14:46           ` Alan Cox
2006-01-18 23:37     ` Neil Brown
2006-01-19 15:53       ` Mark Lord
2006-01-19  0:13 ` Neil Brown
2006-02-03 17:00 Salyzyn, Mark
2006-02-03 17:39 ` Martin Drab
2006-02-03 19:46 ` Phillip Susi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).