All of lore.kernel.org
 help / color / mirror / Atom feed
* mismatch_cnt again
@ 2009-11-07  0:41 Eyal Lebedinsky
  2009-11-07  1:53 ` berk walker
  2009-11-09 22:03 ` Eyal Lebedinsky
  0 siblings, 2 replies; 58+ messages in thread
From: Eyal Lebedinsky @ 2009-11-07  0:41 UTC (permalink / raw)
  To: linux-raid list

For years I found the mismatch_cnt rising regularly every few weeks and could
never relate it to any evens.

I since replaced the computer, installed fedora 11 (was very old debian)
and only kept the array itself (ext3 on 5x1TB raid5). I had the raid
'repair'ed to get it to mismatch_cnt=0.

I thought that I saw the last of these. I had a good run for almost three
months, then last week I saw the first mismatch_cnt=184. It was still so
on this weekly 'check'.

I cannot see any bad event logged.

Are there situations known to cause this without an actual hardware failure?
I know that this came up in the past (often) but I see little recent
discussion and wonder what the current status is.

For the last 6 weeks (my uptime) the machine runs
	2.6.30.5-43.fc11.x86_64 #1 SMP

The raid holds data (no root or swap) used mostly as DVR (nothing heavy).
smartd checks each week and so far no errors. The disks are modern 1yo
"SAMSUNG HD103UJ".

TIA

-- 
Eyal Lebedinsky	(eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07  0:41 mismatch_cnt again Eyal Lebedinsky
@ 2009-11-07  1:53 ` berk walker
  2009-11-07  7:49   ` Eyal Lebedinsky
  2009-11-09 22:03 ` Eyal Lebedinsky
  1 sibling, 1 reply; 58+ messages in thread
From: berk walker @ 2009-11-07  1:53 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: linux-raid list

Eyal Lebedinsky wrote:
> For years I found the mismatch_cnt rising regularly every few weeks and 
> could
> never relate it to any evens.
> 
> I since replaced the computer, installed fedora 11 (was very old debian)
> and only kept the array itself (ext3 on 5x1TB raid5). I had the raid
> 'repair'ed to get it to mismatch_cnt=0.
> 
> I thought that I saw the last of these. I had a good run for almost three
> months, then last week I saw the first mismatch_cnt=184. It was still so
> on this weekly 'check'.
> 
> I cannot see any bad event logged.
> 
> Are there situations known to cause this without an actual hardware 
> failure?
> I know that this came up in the past (often) but I see little recent
> discussion and wonder what the current status is.
> 
> For the last 6 weeks (my uptime) the machine runs
>     2.6.30.5-43.fc11.x86_64 #1 SMP
> 
> The raid holds data (no root or swap) used mostly as DVR (nothing heavy).
> smartd checks each week and so far no errors. The disks are modern 1yo
> "SAMSUNG HD103UJ".
> 
> TIA
> 
 >I< am not quite sure what you are reporting as a problem here, sir. 
"new computer".. drives ~ 1 yr old...mismatch..I know.. seems like 
things just want to die ..
Do you have any logs showing these strangenesses?

I suggest - start off with SOMETHING at zero point, then track what 
changes...  oh BTW did you change the drive cables?

Best to you,
b-



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07  1:53 ` berk walker
@ 2009-11-07  7:49   ` Eyal Lebedinsky
  2009-11-07  8:08     ` Michael Evans
  0 siblings, 1 reply; 58+ messages in thread
From: Eyal Lebedinsky @ 2009-11-07  7:49 UTC (permalink / raw)
  To: linux-raid list

Berk,

I am not sure that I understood your response. As I explained, I did start with
a clean slate: New hardware, new system install and a zero mismatch count.

After three months the weekly 'check' detected a mismatch. During this period
there were no hardware errors reported on this system. smart shows no issues
with the disks.

You ask "Do you have any logs showing these strangenesses?" and I have none,
just a non-zero count from a raid check.

About cables: Does the sata protocol have checksums on the transactions?
I always assumed so but never looked into it.

I also ran a long memtest before commissioning the new hardware.

cheers
	Eyal

berk walker wrote:
> Eyal Lebedinsky wrote:
>> For years I found the mismatch_cnt rising regularly every few weeks 
>> and could
>> never relate it to any evens.
>>
>> I since replaced the computer, installed fedora 11 (was very old debian)
>> and only kept the array itself (ext3 on 5x1TB raid5). I had the raid
>> 'repair'ed to get it to mismatch_cnt=0.
>>
>> I thought that I saw the last of these. I had a good run for almost three
>> months, then last week I saw the first mismatch_cnt=184. It was still so
>> on this weekly 'check'.
>>
>> I cannot see any bad event logged.
>>
>> Are there situations known to cause this without an actual hardware 
>> failure?
>> I know that this came up in the past (often) but I see little recent
>> discussion and wonder what the current status is.
>>
>> For the last 6 weeks (my uptime) the machine runs
>>     2.6.30.5-43.fc11.x86_64 #1 SMP
>>
>> The raid holds data (no root or swap) used mostly as DVR (nothing heavy).
>> smartd checks each week and so far no errors. The disks are modern 1yo
>> "SAMSUNG HD103UJ".
>>
>> TIA
>>
>  >I< am not quite sure what you are reporting as a problem here, sir. 
> "new computer".. drives ~ 1 yr old...mismatch..I know.. seems like 
> things just want to die ..
> Do you have any logs showing these strangenesses?
> 
> I suggest - start off with SOMETHING at zero point, then track what 
> changes...  oh BTW did you change the drive cables?
> 
> Best to you,
> b-

-- 
Eyal Lebedinsky	(eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07  7:49   ` Eyal Lebedinsky
@ 2009-11-07  8:08     ` Michael Evans
  2009-11-07  8:42       ` Eyal Lebedinsky
  2009-11-07 13:51       ` Goswin von Brederlow
  0 siblings, 2 replies; 58+ messages in thread
From: Michael Evans @ 2009-11-07  8:08 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: linux-raid list

Your dmesg and/or the syslog stream of the same kernel warnings/info
should show you when and where these errors occurred.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07  8:08     ` Michael Evans
@ 2009-11-07  8:42       ` Eyal Lebedinsky
  2009-11-07 13:51       ` Goswin von Brederlow
  1 sibling, 0 replies; 58+ messages in thread
From: Eyal Lebedinsky @ 2009-11-07  8:42 UTC (permalink / raw)
  To: linux-raid list

Michael,

This is something new to me. I never saw any message in any of the logs.

I am not a new linux user and neither am I unfamiliar with looking after
my system, yet I have never seen any messages related to these mismatches.

The first time I find out about this is when I run a raid 'check', and
even this scan does not produce any messages about the mismatch which
is simply logged into /sys/devices/virtual/block/md0/md/mismatch_cnt

Nov  7 03:05:02 e7 kernel: md: data-check of RAID array md0
Nov  7 03:05:02 e7 kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Nov  7 03:05:02 e7 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Nov  7 03:05:02 e7 kernel: md: using 128k window, over a total of 976759808 blocks.
Nov  7 07:06:22 e7 kernel: md: md0: data-check done.

All intervening messages are unrelated (named verbiage).

Eyal

Michael Evans wrote:
> Your dmesg and/or the syslog stream of the same kernel warnings/info
> should show you when and where these errors occurred.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org

-- 
Eyal Lebedinsky	(eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07  8:08     ` Michael Evans
  2009-11-07  8:42       ` Eyal Lebedinsky
@ 2009-11-07 13:51       ` Goswin von Brederlow
  2009-11-07 14:58         ` Doug Ledford
  1 sibling, 1 reply; 58+ messages in thread
From: Goswin von Brederlow @ 2009-11-07 13:51 UTC (permalink / raw)
  To: Michael Evans; +Cc: Eyal Lebedinsky, linux-raid list

Michael Evans <mjevans1983@gmail.com> writes:

> Your dmesg and/or the syslog stream of the same kernel warnings/info
> should show you when and where these errors occurred.

I believe mismatch count doesn't show up in the kernel. The mismatch
count shows where data can be read clearly from the disks but the
computed parity does not match the read parity (or the mirrors
disagree). If the drive reports an actual error then the block is
recomputed and not left as mismatch.

So this would be caused by a bit flipping in ram (cpu, controler or
disk) before being written to the platter, flipping in the cable or
flipping on the platter. Or software.

I currently only have mismatches on raid1. In both cases on a device
containing swap on lvm, which I think is the culprit. Lucky me.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07 13:51       ` Goswin von Brederlow
@ 2009-11-07 14:58         ` Doug Ledford
  2009-11-07 16:23           ` Piergiorgio Sartor
                             ` (3 more replies)
  0 siblings, 4 replies; 58+ messages in thread
From: Doug Ledford @ 2009-11-07 14:58 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: Michael Evans, Eyal Lebedinsky, linux-raid list

[-- Attachment #1: Type: text/plain, Size: 3775 bytes --]

On 11/07/2009 08:51 AM, Goswin von Brederlow wrote:
> Michael Evans <mjevans1983@gmail.com> writes:
> 
>> Your dmesg and/or the syslog stream of the same kernel warnings/info
>> should show you when and where these errors occurred.
> 
> I believe mismatch count doesn't show up in the kernel. The mismatch
> count shows where data can be read clearly from the disks but the
> computed parity does not match the read parity (or the mirrors
> disagree). If the drive reports an actual error then the block is
> recomputed and not left as mismatch.
> 
> So this would be caused by a bit flipping in ram (cpu, controler or
> disk) before being written to the platter, flipping in the cable or
> flipping on the platter. Or software.
> 
> I currently only have mismatches on raid1. In both cases on a device
> containing swap on lvm, which I think is the culprit. Lucky me.

I'm very quickly starting to become dubious of the current mismatch_cnt
implementation.  I think a kernel patch is in order and I may just work
on that today.  Here's the deal: a non-0 mismatch count is worthless if
you don't also tell people *where* the mismatch is so they can
investigate it and correct it.

And Goswin is correct, once a mismatch exists, reading the mismatch
would not normally produce any kernel messages because the data is being
read just fine, it's simply inconsistent (bad parity or disagreeing
copies in raid1/10).  Whatever *caused* it to be inconsistent might show
up in your logs (system crash, drive reset) or it might not (sectors
went bad on a disk and were reallocated by the disk's firmware so they
now read all zeros or just random junk instead of your data).

And actually, with 1TB drives, your most likely culprit for this is the
last item I just listed: reallocated drive sectors.  Here's the deal.
If the drive detects the bad sectors during a write, it reallocates and
redoes the write to the new sectors, data saved.  If, on the other hand,
the sectors go bad after the write, then whether or not your data gets
saved depends on a number of factors.  For instance, if the sectors were
going bad slowly and you also read those sectors on a regular basis so
the drive firmware would have reason to know that they are going bad (it
would start gettings reads with errors that it had to ECC correct before
it went totally bad), then some drives will reallocate the sectors and
move the data before it's totally lost.  But, if they go bad suddenly,
or if they went bad without having frequent enough intervening reads to
pick it up that it was on its way to going bad, then the data is just
lost.  But, that's what RAID is for, so we can get it back.  Anyway,
that's my guess for the culprit of your situation.  And, unfortunately,
without getting in and looking at the mismatch to identify the correct
data, a repair operation is just as likely (50-50 chance) to corrupt
things as opposed to correct things.

With Fedora 11 there should be the palimpsest program installed.  Run it
and it will allow you to see the SMART details on each drive.  Take a
look and see if you have any showing reallocated sectors.  I happen to
have 4 of 6 drives in my array that show reallocated sectors.  I also
happen to be lucky in that none of my weekly raid-checks have turned up
a mismatch count on any devices, so the bad sectors must have been
caught in time (or there was a read error sometime for the sectors and
the raid subsystem corrected it, but if that happened I missed it in the
kernel logs).

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07 14:58         ` Doug Ledford
@ 2009-11-07 16:23           ` Piergiorgio Sartor
  2009-11-07 16:37             ` Doug Ledford
  2009-11-08 15:32             ` Goswin von Brederlow
  2009-11-07 22:19           ` Eyal Lebedinsky
                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 58+ messages in thread
From: Piergiorgio Sartor @ 2009-11-07 16:23 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Goswin von Brederlow, Michael Evans, Eyal Lebedinsky, linux-raid list

On 11/07/2009 03:58 PM, Doug Ledford wrote:
[...]
> I'm very quickly starting to become dubious of the current mismatch_cnt
> implementation.  I think a kernel patch is in order and I may just work
> on that today.  Here's the deal: a non-0 mismatch count is worthless if
> you don't also tell people *where* the mismatch is so they can
> investigate it and correct it.

You're perfectly right.

And this, again, fits in the discussion of RAID-6 error
check and, potentially, repair.

Ideally the log should tell which (RAID) address has a
mismatch and, in case of RAID-6, if a specific device
could be faulty at that position.

This would be already quite a huge step forward in improving
the overall reliability of the RAID sub-system.

Side note: in F11 there is this cron.weekly raid-check, but
nothing is reported (per email to root) in case of mismatch.
Is there any plan to add a such a facility?

Thanks.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07 16:23           ` Piergiorgio Sartor
@ 2009-11-07 16:37             ` Doug Ledford
  2009-11-07 22:25               ` Eyal Lebedinsky
  2009-11-08 15:32             ` Goswin von Brederlow
  1 sibling, 1 reply; 58+ messages in thread
From: Doug Ledford @ 2009-11-07 16:37 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: linux-raid list

[-- Attachment #1: Type: text/plain, Size: 703 bytes --]

On 11/07/2009 11:23 AM, Piergiorgio Sartor wrote:
> Side note: in F11 there is this cron.weekly raid-check, but
> nothing is reported (per email to root) in case of mismatch.
> Is there any plan to add a such a facility?

Unless you've modified the default cron behaviour, then yes something is
reported to the user on a mismatch.  The fact that the script goes from
being silent to echoing a warning will result in the cron job itself
emailing root the output of the script.


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07 14:58         ` Doug Ledford
  2009-11-07 16:23           ` Piergiorgio Sartor
@ 2009-11-07 22:19           ` Eyal Lebedinsky
  2009-11-07 22:58             ` Doug Ledford
  2009-11-08 15:46           ` Goswin von Brederlow
  2009-11-09 18:11           ` Bill Davidsen
  3 siblings, 1 reply; 58+ messages in thread
From: Eyal Lebedinsky @ 2009-11-07 22:19 UTC (permalink / raw)
  To: linux-raid list

Doug Ledford wrote:
[trim]
> And actually, with 1TB drives, your most likely culprit for this is the
> last item I just listed: reallocated drive sectors.
[trim]

I did say "smart shows no issues with the disks". This means, naturally, no
reallocated count/events on any of the 5 drives.

-- 
Eyal Lebedinsky	(eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07 16:37             ` Doug Ledford
@ 2009-11-07 22:25               ` Eyal Lebedinsky
  2009-11-07 22:57                 ` Doug Ledford
  0 siblings, 1 reply; 58+ messages in thread
From: Eyal Lebedinsky @ 2009-11-07 22:25 UTC (permalink / raw)
  To: linux-raid list

Doug,

I can only see the cron job start a check:
	echo "check" > /sys/block/$dev/md/sync_action
It does not wait for completion and does not report the count. Neither is
the count reported to the system log.

I replaced it with a script that does report.

Eyal

Doug Ledford wrote:
> On 11/07/2009 11:23 AM, Piergiorgio Sartor wrote:
>> Side note: in F11 there is this cron.weekly raid-check, but
>> nothing is reported (per email to root) in case of mismatch.
>> Is there any plan to add a such a facility?
> 
> Unless you've modified the default cron behaviour, then yes something is
> reported to the user on a mismatch.  The fact that the script goes from
> being silent to echoing a warning will result in the cron job itself
> emailing root the output of the script.

-- 
Eyal Lebedinsky	(eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07 22:25               ` Eyal Lebedinsky
@ 2009-11-07 22:57                 ` Doug Ledford
  0 siblings, 0 replies; 58+ messages in thread
From: Doug Ledford @ 2009-11-07 22:57 UTC (permalink / raw)
  To: Eyal Lebedinsky, Linux RAID Mailing List

[-- Attachment #1: Type: text/plain, Size: 1210 bytes --]

On 11/07/2009 05:25 PM, Eyal Lebedinsky wrote:
> Doug,
> 
> I can only see the cron job start a check:
>     echo "check" > /sys/block/$dev/md/sync_action
> It does not wait for completion and does not report the count. Neither is
> the count reported to the system log.
> 
> I replaced it with a script that does report.

Or you can updated to a later mdadm package.  The one in rawhide has a
fairly well fletched out script now.

> Eyal
> 
> Doug Ledford wrote:
>> On 11/07/2009 11:23 AM, Piergiorgio Sartor wrote:
>>> Side note: in F11 there is this cron.weekly raid-check, but
>>> nothing is reported (per email to root) in case of mismatch.
>>> Is there any plan to add a such a facility?
>>
>> Unless you've modified the default cron behaviour, then yes something is
>> reported to the user on a mismatch.  The fact that the script goes from
>> being silent to echoing a warning will result in the cron job itself
>> emailing root the output of the script.
> 


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07 22:19           ` Eyal Lebedinsky
@ 2009-11-07 22:58             ` Doug Ledford
  0 siblings, 0 replies; 58+ messages in thread
From: Doug Ledford @ 2009-11-07 22:58 UTC (permalink / raw)
  To: Eyal Lebedinsky; +Cc: linux-raid list

[-- Attachment #1: Type: text/plain, Size: 732 bytes --]

On 11/07/2009 05:19 PM, Eyal Lebedinsky wrote:
> Doug Ledford wrote:
> [trim]
>> And actually, with 1TB drives, your most likely culprit for this is the
>> last item I just listed: reallocated drive sectors.
> [trim]
> 
> I did say "smart shows no issues with the disks". This means, naturally, no
> reallocated count/events on any of the 5 drives.
> 

Sorry, overlooked that (or thought it was just that it passed the self
test...wasn't sure that it specifically meant reallocated sector count
was 0.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07 16:23           ` Piergiorgio Sartor
  2009-11-07 16:37             ` Doug Ledford
@ 2009-11-08 15:32             ` Goswin von Brederlow
  2009-11-09 18:08               ` Bill Davidsen
  1 sibling, 1 reply; 58+ messages in thread
From: Goswin von Brederlow @ 2009-11-08 15:32 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Doug Ledford, Goswin von Brederlow, Michael Evans,
	Eyal Lebedinsky, linux-raid list

Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> writes:

> On 11/07/2009 03:58 PM, Doug Ledford wrote:
> [...]
>> I'm very quickly starting to become dubious of the current mismatch_cnt
>> implementation.  I think a kernel patch is in order and I may just work
>> on that today.  Here's the deal: a non-0 mismatch count is worthless if
>> you don't also tell people *where* the mismatch is so they can
>> investigate it and correct it.
>
> You're perfectly right.
>
> And this, again, fits in the discussion of RAID-6 error
> check and, potentially, repair.
>
> Ideally the log should tell which (RAID) address has a
> mismatch and, in case of RAID-6, if a specific device
> could be faulty at that position.

Actual in raid6 mode if one parity block is bad but the other is
correct I would expect that to automatically repair the bad block, at
least optionally. Same with a 3+ way mirror and one mirror being bad.

In general if a block is bad and the kernel can isolate which block in
a stripe is bad then it should repair it while checking.

> This would be already quite a huge step forward in improving
> the overall reliability of the RAID sub-system.
>
> Side note: in F11 there is this cron.weekly raid-check, but
> nothing is reported (per email to root) in case of mismatch.
> Is there any plan to add a such a facility?

In Debian it is monthly, first sunday of the month. Takes too long to
do weekly imho.

> Thanks.
>
> bye,

MfG
        Goswin

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07 14:58         ` Doug Ledford
  2009-11-07 16:23           ` Piergiorgio Sartor
  2009-11-07 22:19           ` Eyal Lebedinsky
@ 2009-11-08 15:46           ` Goswin von Brederlow
  2009-11-08 16:04             ` Piergiorgio Sartor
  2009-11-08 22:51             ` Peter Rabbitson
  2009-11-09 18:11           ` Bill Davidsen
  3 siblings, 2 replies; 58+ messages in thread
From: Goswin von Brederlow @ 2009-11-08 15:46 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Goswin von Brederlow, Michael Evans, Eyal Lebedinsky, linux-raid list

Doug Ledford <dledford@redhat.com> writes:

> On 11/07/2009 08:51 AM, Goswin von Brederlow wrote:
>> Michael Evans <mjevans1983@gmail.com> writes:
>> 
>>> Your dmesg and/or the syslog stream of the same kernel warnings/info
>>> should show you when and where these errors occurred.
>> 
>> I believe mismatch count doesn't show up in the kernel. The mismatch
>> count shows where data can be read clearly from the disks but the
>> computed parity does not match the read parity (or the mirrors
>> disagree). If the drive reports an actual error then the block is
>> recomputed and not left as mismatch.
>> 
>> So this would be caused by a bit flipping in ram (cpu, controler or
>> disk) before being written to the platter, flipping in the cable or
>> flipping on the platter. Or software.
>> 
>> I currently only have mismatches on raid1. In both cases on a device
>> containing swap on lvm, which I think is the culprit. Lucky me.
>
> I'm very quickly starting to become dubious of the current mismatch_cnt
> implementation.  I think a kernel patch is in order and I may just work
> on that today.  Here's the deal: a non-0 mismatch count is worthless if
> you don't also tell people *where* the mismatch is so they can
> investigate it and correct it.
>
> And Goswin is correct, once a mismatch exists, reading the mismatch
> would not normally produce any kernel messages because the data is being
> read just fine, it's simply inconsistent (bad parity or disagreeing
> copies in raid1/10).  Whatever *caused* it to be inconsistent might show
> up in your logs (system crash, drive reset) or it might not (sectors
> went bad on a disk and were reallocated by the disk's firmware so they
> now read all zeros or just random junk instead of your data).

I think the kernel should output a message when it detects a
mismatch. Probably gather sequential mismatches into single message in
case a disks sector turned bad completly.

> And actually, with 1TB drives, your most likely culprit for this is the
> last item I just listed: reallocated drive sectors.  Here's the deal.
> If the drive detects the bad sectors during a write, it reallocates and
> redoes the write to the new sectors, data saved.  If, on the other hand,
> the sectors go bad after the write, then whether or not your data gets
> saved depends on a number of factors.  For instance, if the sectors were
> going bad slowly and you also read those sectors on a regular basis so
> the drive firmware would have reason to know that they are going bad (it
> would start gettings reads with errors that it had to ECC correct before
> it went totally bad), then some drives will reallocate the sectors and
> move the data before it's totally lost.  But, if they go bad suddenly,
> or if they went bad without having frequent enough intervening reads to
> pick it up that it was on its way to going bad, then the data is just
> lost.  But, that's what RAID is for, so we can get it back.  Anyway,

But unless your drive firmware is broken the drive with only ever give
the correct data or an error. Smart has a counter for blocks that have
gone bad and will be fixed pending a write to them:
Current_Pending_Sector.

The only way the drive should be able to give you bad data is if
multiple bits toggle in such a way that the ECC still fits.

> that's my guess for the culprit of your situation.  And, unfortunately,
> without getting in and looking at the mismatch to identify the correct
> data, a repair operation is just as likely (50-50 chance) to corrupt
> things as opposed to correct things.

Unless you have more redundancy like raid6 or 3+ way mirror. My / is a
4 way raid1. Having 3 disks give the same bad data is so much more
unlikely than 1 disk giving bad data that I would be happy to
automatically repair there.

> With Fedora 11 there should be the palimpsest program installed.  Run it
> and it will allow you to see the SMART details on each drive.  Take a
> look and see if you have any showing reallocated sectors.  I happen to
> have 4 of 6 drives in my array that show reallocated sectors.  I also
> happen to be lucky in that none of my weekly raid-checks have turned up
> a mismatch count on any devices, so the bad sectors must have been
> caught in time (or there was a read error sometime for the sectors and
> the raid subsystem corrected it, but if that happened I missed it in the
> kernel logs).

Either corrected on write or repaired. As said the drive should give
correct or no data.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-08 15:46           ` Goswin von Brederlow
@ 2009-11-08 16:04             ` Piergiorgio Sartor
  2009-11-09 18:22               ` Bill Davidsen
  2009-11-09 19:13               ` Goswin von Brederlow
  2009-11-08 22:51             ` Peter Rabbitson
  1 sibling, 2 replies; 58+ messages in thread
From: Piergiorgio Sartor @ 2009-11-08 16:04 UTC (permalink / raw)
  To: Goswin von Brederlow
  Cc: Doug Ledford, Michael Evans, Eyal Lebedinsky, linux-raid list

Hi,

> But unless your drive firmware is broken the drive with only ever give
> the correct data or an error. Smart has a counter for blocks that have
> gone bad and will be fixed pending a write to them:
> Current_Pending_Sector.
> 
> The only way the drive should be able to give you bad data is if
> multiple bits toggle in such a way that the ECC still fits.

Not really, I've disks which are *perfect* in smart sense
and nevertheless I had mistmatch count.
This was a SW problem, I think now fixed, in RAID-10 code.

This means that, yes, there could be mismatches, without
any warning, from other sources than disks.
And these could be anywhere in the system.
I already mentioned, time ago, a cabling problem which was
leading to a similar result: wrong data on different disks,
without any warning or error from the HW layer.

That is why it is important to know *where* the mismatch
occurs and, if possible, in which device component.
If it is an empty part of the FS, no problem, if it
belongs to a specific file, then it would be possible
to restore/recreate it.

Of course, a tool will be needed telling which file is
using a certain block of the device.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-08 15:46           ` Goswin von Brederlow
  2009-11-08 16:04             ` Piergiorgio Sartor
@ 2009-11-08 22:51             ` Peter Rabbitson
  2009-11-09 18:56               ` Piergiorgio Sartor
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Rabbitson @ 2009-11-08 22:51 UTC (permalink / raw)
  To: Goswin von Brederlow
  Cc: Doug Ledford, Michael Evans, Eyal Lebedinsky, linux-raid list

My 2c on how to approach this subject without controversial auto-repair
issues: http://marc.info/?l=linux-raid&m=120605458309825&w=2
Also follow the thread to see Neil's reply

Cheers

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-08 15:32             ` Goswin von Brederlow
@ 2009-11-09 18:08               ` Bill Davidsen
  0 siblings, 0 replies; 58+ messages in thread
From: Bill Davidsen @ 2009-11-09 18:08 UTC (permalink / raw)
  To: Goswin von Brederlow
  Cc: Piergiorgio Sartor, Doug Ledford, Michael Evans, Eyal Lebedinsky,
	linux-raid list

Goswin von Brederlow wrote:
> Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> writes:
>
>   
>> On 11/07/2009 03:58 PM, Doug Ledford wrote:
>> [...]
>>     
>>> I'm very quickly starting to become dubious of the current mismatch_cnt
>>> implementation.  I think a kernel patch is in order and I may just work
>>> on that today.  Here's the deal: a non-0 mismatch count is worthless if
>>> you don't also tell people *where* the mismatch is so they can
>>> investigate it and correct it.
>>>       
>> You're perfectly right.
>>
>> And this, again, fits in the discussion of RAID-6 error
>> check and, potentially, repair.
>>
>> Ideally the log should tell which (RAID) address has a
>> mismatch and, in case of RAID-6, if a specific device
>> could be faulty at that position.
>>     
>
> Actual in raid6 mode if one parity block is bad but the other is
> correct I would expect that to automatically repair the bad block, at
> least optionally. Same with a 3+ way mirror and one mirror being bad.
>
> In general if a block is bad and the kernel can isolate which block in
> a stripe is bad then it should repair it while checking.
>   

While I agree totally on what the kernel *should* do, AFAIK it does no 
such thing. In fact, I believe that even with a three way mirror the 
mismatch is "fixed" by picking one copy at random and writing it over 
the others, rather than voting.

I haven't looked at this in ages, but that's my memory. Like Dennis 
Miller, "I could be wrong."

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07 14:58         ` Doug Ledford
                             ` (2 preceding siblings ...)
  2009-11-08 15:46           ` Goswin von Brederlow
@ 2009-11-09 18:11           ` Bill Davidsen
  2009-11-09 20:58             ` Doug Ledford
  3 siblings, 1 reply; 58+ messages in thread
From: Bill Davidsen @ 2009-11-09 18:11 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Goswin von Brederlow, Michael Evans, Eyal Lebedinsky, linux-raid list

Doug Ledford wrote:
> And actually, with 1TB drives, your most likely culprit for this is the
> last item I just listed: reallocated drive sectors.  Here's the deal.
> If the drive detects the bad sectors during a write, it reallocates and
> redoes the write to the new sectors, data saved.  If, on the other hand,
> the sectors go bad after the write, then whether or not your data gets
> saved depends on a number of factors.  For instance, if the sectors were
> going bad slowly and you also read those sectors on a regular basis so
> the drive firmware would have reason to know that they are going bad (it
> would start gettings reads with errors that it had to ECC correct before
> it went totally bad), then some drives will reallocate the sectors and
> move the data before it's totally lost.  But, if they go bad suddenly,
> or if they went bad without having frequent enough intervening reads to
> pick it up that it was on its way to going bad, then the data is just
> lost.  But, that's what RAID is for, so we can get it back.  Anyway,
> that's my guess for the culprit of your situation.  And, unfortunately,
> without getting in and looking at the mismatch to identify the correct
> data, a repair operation is just as likely (50-50 chance) to corrupt
> things as opposed to correct things.
>
> With Fedora 11 there should be the palimpsest program installed.  Run it
> and it will allow you to see the SMART details on each drive.  Take a
> look and see if you have any showing reallocated sectors.  I happen to
> have 4 of 6 drives in my array that show reallocated sectors.  I also
> happen to be lucky in that none of my weekly raid-checks have turned up
> a mismatch count on any devices, so the bad sectors must have been
> caught in time (or there was a read error sometime for the sectors and
> the raid subsystem corrected it, but if that happened I missed it in the
> kernel logs).
>   

Are you saying or implying that this palimpsest program will show 
relocated sectors which the current tools (smartools) don't? If not, 
what does this tool do other than what's surrently done by most people 
using smartctl?

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-08 16:04             ` Piergiorgio Sartor
@ 2009-11-09 18:22               ` Bill Davidsen
  2009-11-09 21:50                 ` NeilBrown
  2009-11-09 19:13               ` Goswin von Brederlow
  1 sibling, 1 reply; 58+ messages in thread
From: Bill Davidsen @ 2009-11-09 18:22 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Goswin von Brederlow, Doug Ledford, Michael Evans,
	Eyal Lebedinsky, linux-raid list

Piergiorgio Sartor wrote:
> Hi,
>
>   
>> But unless your drive firmware is broken the drive with only ever give
>> the correct data or an error. Smart has a counter for blocks that have
>> gone bad and will be fixed pending a write to them:
>> Current_Pending_Sector.
>>
>> The only way the drive should be able to give you bad data is if
>> multiple bits toggle in such a way that the ECC still fits.
>>     
>
> Not really, I've disks which are *perfect* in smart sense
> and nevertheless I had mistmatch count.
> This was a SW problem, I think now fixed, in RAID-10 code.
>
>   
IIRC there still is an error in raid-1 code, in that data is written to 
multiple drives without preventing modification of the memory between 
writes. As I understand Neil's explanation, this happens (a) when memory 
is being changed rapidly and frequently via memory mapped files, or (b) 
writing via O_DIRECT, or (c) when raid-1 is being used for swap. I'm not 
totally sure why the last one, but I have always seem mismatches on swap 
in a system which is actually swapping. What is more troubling is that 
if I do a hibernate, which writes to swap, and then force a boot from 
other media to a Live-CD, doing a check of the swap array occasionally 
shows a mismatch. That doesn't give me a secure feeling, although I have 
never had an issue in practice, I was just curious.

> This means that, yes, there could be mismatches, without
> any warning, from other sources than disks.
> And these could be anywhere in the system.
> I already mentioned, time ago, a cabling problem which was
> leading to a similar result: wrong data on different disks,
> without any warning or error from the HW layer.
>
> That is why it is important to know *where* the mismatch
> occurs and, if possible, in which device component.
> If it is an empty part of the FS, no problem, if it
> belongs to a specific file, then it would be possible
> to restore/recreate it.
>
> Of course, a tool will be needed telling which file is
> using a certain block of the device.
>   

There are tools which claim to do that, or list blocks used in a given 
file, which is not nearly as useful, but easier to do.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-08 22:51             ` Peter Rabbitson
@ 2009-11-09 18:56               ` Piergiorgio Sartor
  2009-11-09 21:14                 ` NeilBrown
  0 siblings, 1 reply; 58+ messages in thread
From: Piergiorgio Sartor @ 2009-11-09 18:56 UTC (permalink / raw)
  To: Peter Rabbitson
  Cc: Goswin von Brederlow, Doug Ledford, Michael Evans,
	Eyal Lebedinsky, linux-raid list

I think this you explained is the basic common understanding, at
least fori me, on what will be the next step of the MD software.

About Neil answer. My opinion is that you do not need a model
of what happen to a person, jumping out of a plane at 6000ft,
when he reaches the ground, to know he better use a parachute...

No offense Neil... :-)

In other words, IMHO, sometimes could be good to be proactive
against unspecified problems, than later complain.

Of course, if this is a manpower issue, maybe we should find
some support for the coding and the rest.

There was already the question about patches, I guess this is
an open possibility.

bye,

> My 2c on how to approach this subject without controversial auto-repair
> issues: http://marc.info/?l=linux-raid&m=120605458309825&w=2
> Also follow the thread to see Neil's reply
> 
> Cheers
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-08 16:04             ` Piergiorgio Sartor
  2009-11-09 18:22               ` Bill Davidsen
@ 2009-11-09 19:13               ` Goswin von Brederlow
  1 sibling, 0 replies; 58+ messages in thread
From: Goswin von Brederlow @ 2009-11-09 19:13 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Goswin von Brederlow, Doug Ledford, Michael Evans,
	Eyal Lebedinsky, linux-raid list

Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> writes:

> Hi,
>
>> But unless your drive firmware is broken the drive with only ever give
>> the correct data or an error. Smart has a counter for blocks that have
>> gone bad and will be fixed pending a write to them:
>> Current_Pending_Sector.
>> 
>> The only way the drive should be able to give you bad data is if
>> multiple bits toggle in such a way that the ECC still fits.
>
> Not really, I've disks which are *perfect* in smart sense
> and nevertheless I had mistmatch count.
> This was a SW problem, I think now fixed, in RAID-10 code.

But that wasn't the drive giving you bad data. That was you writing
bad data in the first place. :)

> This means that, yes, there could be mismatches, without
> any warning, from other sources than disks.
> And these could be anywhere in the system.
> I already mentioned, time ago, a cabling problem which was
> leading to a similar result: wrong data on different disks,
> without any warning or error from the HW layer.
>
> That is why it is important to know *where* the mismatch
> occurs and, if possible, in which device component.
> If it is an empty part of the FS, no problem, if it
> belongs to a specific file, then it would be possible
> to restore/recreate it.

FULL ACK.

> Of course, a tool will be needed telling which file is
> using a certain block of the device.
>
> bye,

Filesystems usualy have such a tool. Worstcase write a little C
program that checks the FIBMAP of each file.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-09 18:11           ` Bill Davidsen
@ 2009-11-09 20:58             ` Doug Ledford
  0 siblings, 0 replies; 58+ messages in thread
From: Doug Ledford @ 2009-11-09 20:58 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Goswin von Brederlow, Michael Evans, Eyal Lebedinsky, linux-raid list

[-- Attachment #1: Type: text/plain, Size: 619 bytes --]

On 11/09/2009 01:11 PM, Bill Davidsen wrote:
> 
> Are you saying or implying that this palimpsest program will show
> relocated sectors which the current tools (smartools) don't? If not,
> what does this tool do other than what's surrently done by most people
> using smartctl?

Nah, it's just a slightly easier way to get the smart results.  If
someone doesn't use smartools already anyway.


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: CFBFF194
	      http://people.redhat.com/dledford

Infiniband specific RPMs available at
	      http://people.redhat.com/dledford/Infiniband


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-09 18:56               ` Piergiorgio Sartor
@ 2009-11-09 21:14                 ` NeilBrown
  2009-11-09 21:54                   ` Piergiorgio Sartor
  0 siblings, 1 reply; 58+ messages in thread
From: NeilBrown @ 2009-11-09 21:14 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Peter Rabbitson, Goswin von Brederlow, Doug Ledford,
	Michael Evans, Eyal Lebedinsky, linux-raid list

On Tue, November 10, 2009 5:56 am, Piergiorgio Sartor wrote:
> I think this you explained is the basic common understanding, at
> least fori me, on what will be the next step of the MD software.

Is this an offer to submit a patch ?? :-)

>
> About Neil answer. My opinion is that you do not need a model
> of what happen to a person, jumping out of a plane at 6000ft,
> when he reaches the ground, to know he better use a parachute...

I disagree.  You do need a model.  The particular features of the
model would be the weight and wind-resistance of the person so that
you can estimate what extra wind resistance is needed to reduce terminal
velocity such that the impact will be something that the person's
legs can absorb.  So you also need the model to describe the legs
in enough detail so that a suitable target terminal velocity can
be determined.

>
> No offense Neil... :-)

I never take offense - it just doesn't seem to be worth the effort.

>
> In other words, IMHO, sometimes could be good to be proactive
> against unspecified problems, than later complain.

If we proactively hand out parachutes that can just barely land a
small dog safely, then we aren't doing any people any favours,
and probably are making their situation less safe because they are
more likely to take a risk in the belief that their parachute
will protect them - which it might not.

>
> Of course, if this is a manpower issue, maybe we should find
> some support for the coding and the rest.

Certainly manpower is an issue - and it is pointless spending it
on something that you think sounds nice, but have no evidence that it
will actually address a real need.
The money spent on those dog-sized parachutes would clearly be
a complete waste.

NeilBrown



>
> There was already the question about patches, I guess this is
> an open possibility.
>
> bye,
>
>> My 2c on how to approach this subject without controversial auto-repair
>> issues: http://marc.info/?l=linux-raid&m=120605458309825&w=2
>> Also follow the thread to see Neil's reply
>>
>> Cheers
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
>
> piergiorgio
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-09 18:22               ` Bill Davidsen
@ 2009-11-09 21:50                 ` NeilBrown
  2009-11-10 18:05                   ` Bill Davidsen
  0 siblings, 1 reply; 58+ messages in thread
From: NeilBrown @ 2009-11-09 21:50 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Piergiorgio Sartor, Goswin von Brederlow, Doug Ledford,
	Michael Evans, Eyal Lebedinsky, linux-raid list

On Tue, November 10, 2009 5:22 am, Bill Davidsen wrote:
> Piergiorgio Sartor wrote:
>> Hi,
>>
>>
>>> But unless your drive firmware is broken the drive with only ever give
>>> the correct data or an error. Smart has a counter for blocks that have
>>> gone bad and will be fixed pending a write to them:
>>> Current_Pending_Sector.
>>>
>>> The only way the drive should be able to give you bad data is if
>>> multiple bits toggle in such a way that the ECC still fits.
>>>
>>
>> Not really, I've disks which are *perfect* in smart sense
>> and nevertheless I had mistmatch count.
>> This was a SW problem, I think now fixed, in RAID-10 code.
>>
>>
> IIRC there still is an error in raid-1 code, in that data is written to
> multiple drives without preventing modification of the memory between
> writes. As I understand Neil's explanation, this happens (a) when memory
> is being changed rapidly and frequently via memory mapped files, or (b)
> writing via O_DIRECT, or (c) when raid-1 is being used for swap. I'm not
> totally sure why the last one, but I have always seem mismatches on swap
> in a system which is actually swapping. What is more troubling is that
> if I do a hibernate, which writes to swap, and then force a boot from
> other media to a Live-CD, doing a check of the swap array occasionally
> shows a mismatch. That doesn't give me a secure feeling, although I have
> never had an issue in practice, I was just curious.

I don't think this is really an error in the RAID1 code.
The only thing that the RAID1 code could do differently is make a local
copy of the data and then write that to all of the devices (a bit like
RAID5 does so it can generate a parity block reliably).
Doing this would introduce a performance penalty with not real
benefit (the only benefit would be to stop long email threads about
mismatch_cnt :-)

You could possibly argue that it is a weakness in the interface to block
devices that the block device cannot ask for the buffer to be guaranteed
to be stable for the duration of the write, but as there is little real
need for that and it would probably be fairly hard to implement both
efficiently and generally.

A filesystem is well placed to do this sort of thing and it is quite
likely that BTRFS does something appropriate to ensure that the block
checksums it creates are reliable.
All the filesystem needs to do is forcibly unmap the page from any
process address space and make sure it doesn't get remapped or otherwise
modified until the write completes.

The (c) option is actually the most likely to cause inconsistencies.
If a page is modified while being written out to swap, the swap
system will effective forget that it ever tried to write it so
any inconsistency is likely to remain (but never be read, so there
is no problem).
With a filesystem, if the page is changed while being written, it is
very likely that the filesystem will try to write the page to the same
location again, thus fixing the inconsistency.

When suspend-to-disk writes to swap, it stops all changes from happening
and then writes the data and waits for it to complete, so you will never
find inconsistencies in blocks on swap that actually contain a
suspend-to-disk image.

NeilBrown



>
>> This means that, yes, there could be mismatches, without
>> any warning, from other sources than disks.
>> And these could be anywhere in the system.
>> I already mentioned, time ago, a cabling problem which was
>> leading to a similar result: wrong data on different disks,
>> without any warning or error from the HW layer.
>>
>> That is why it is important to know *where* the mismatch
>> occurs and, if possible, in which device component.
>> If it is an empty part of the FS, no problem, if it
>> belongs to a specific file, then it would be possible
>> to restore/recreate it.
>>
>> Of course, a tool will be needed telling which file is
>> using a certain block of the device.
>>
>
> There are tools which claim to do that, or list blocks used in a given
> file, which is not nearly as useful, but easier to do.
>
> --
> Bill Davidsen <davidsen@tmr.com>
>   "We can't solve today's problems by using the same thinking we
>    used in creating them." - Einstein
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-09 21:14                 ` NeilBrown
@ 2009-11-09 21:54                   ` Piergiorgio Sartor
  2009-11-10  0:17                     ` NeilBrown
  0 siblings, 1 reply; 58+ messages in thread
From: Piergiorgio Sartor @ 2009-11-09 21:54 UTC (permalink / raw)
  To: NeilBrown
  Cc: Piergiorgio Sartor, Peter Rabbitson, Goswin von Brederlow,
	Doug Ledford, Michael Evans, Eyal Lebedinsky, linux-raid list

Well...

> Is this an offer to submit a patch ?? :-)

almost, I was looking into RAID-6 for this, but unfortunately
it seems I'll need external manpower too... :-)

> I disagree.  You do need a model.  The particular features of the
> model would be the weight and wind-resistance of the person so that
> you can estimate what extra wind resistance is needed to reduce terminal
> velocity such that the impact will be something that the person's
> legs can absorb.  So you also need the model to describe the legs
> in enough detail so that a suitable target terminal velocity can
> be determined.

Well, sorry, but IMHO this is needed only when you design
the parachute, not when you jump out of the plane.

It seems that here some people, including me, would have
found useful such a feature.
For example I've a RAID-10 which shows a mismatch_cnt of
256, but everything seems to work fine.
The disks are new, no SMART errors or else.
Where the mismatch belong I do not know.
What should I do? Try to fill up the MD device and then
see if the mismatch is still there?
It would be much better to know which file, if any, is
affected and then take the proper countermeasures.

At the moment, since everything runs fine, I do not dare
to start a resync, since it will not be better than
leaving the things like they're right now.
I'm in the hope that some file creation or similar will fix
the mismatch.
Or do you have a better option?

> If we proactively hand out parachutes that can just barely land a
> small dog safely, then we aren't doing any people any favours,
> and probably are making their situation less safe because they are
> more likely to take a risk in the belief that their parachute
> will protect them - which it might not.

Do not over stretch the example.
The parachute, in the MD case, will not remove any risk,
it will simply help people to manage a damage, that might
occure for any reasons, including SW bugs, better.

I mean, will you swear that the actual RAID software will
never cause, by its own, a mismatch between disks?
I guess not.
So, why not to give a mechanism to enable user to look
further into mismatches and be able to take a proper action?

> Certainly manpower is an issue - and it is pointless spending it
> on something that you think sounds nice, but have no evidence that it
> will actually address a real need.

It seems some people, here, have this need.
So, it is real.

I see not only myself asking for such features like returning
the block address of the mismatch count or trigger a *proper*
repair instead of a random one.

Frankly speaking, the whole resync/repair concept is, at the
moment, a waste of manpower (when it was done), since repairing
or not a RAID does not change the underlayining situation.
It just sets the mismatch_cnt to zero, but if an error is
present there are good chances it will still be there.
And this is the problem: after the resync people will *feel*
secure, people *feel* safe (because there is a "repair"),
but in the end the risk is simply increased (as per your
example about dog-parachute).

Again, manpower is always an issue and priorities are needed,
of course, but what if we vote, here, for such a feauture and
then it turns out it is "most wanted"?

Written that, since complaining alone does not help, how to
proceed in the case I would like to print the MD block address
of a mismatch? Which source code file would be more sensible
to look into?

Thanks for you attention and sorry for the rant,

P.S.: I like very much the MD thing, that's the reason
why I would like to see it improved.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-07  0:41 mismatch_cnt again Eyal Lebedinsky
  2009-11-07  1:53 ` berk walker
@ 2009-11-09 22:03 ` Eyal Lebedinsky
  1 sibling, 0 replies; 58+ messages in thread
From: Eyal Lebedinsky @ 2009-11-09 22:03 UTC (permalink / raw)
  To: linux-raid list

Thanks everyone,

I wish to narrow down the issue to my question
	Are there situations known to cause this without an actual hardware failure?

Meaning, are there known *software* issues with this configuration
	2.6.30.5-43.fc11.x86_64, ext3, raid5, sata, Adaptec 1430SA
that can lead to a mismatch?

It is not root, not swap, has weekly smartd scans and weekly (different days) raid
'check's. Only report is a growing mismatch_cnt.

I noted the raid1 as mentioned in the thread.

cheers
	Eyal

Eyal Lebedinsky wrote:
> For years I found the mismatch_cnt rising regularly every few weeks and 
> could
> never relate it to any evens.
> 
> I since replaced the computer, installed fedora 11 (was very old debian)
> and only kept the array itself (ext3 on 5x1TB raid5). I had the raid
> 'repair'ed to get it to mismatch_cnt=0.
> 
> I thought that I saw the last of these. I had a good run for almost three
> months, then last week I saw the first mismatch_cnt=184. It was still so
> on this weekly 'check'.
> 
> I cannot see any bad event logged.
> 
> Are there situations known to cause this without an actual hardware 
> failure?
> I know that this came up in the past (often) but I see little recent
> discussion and wonder what the current status is.
> 
> For the last 6 weeks (my uptime) the machine runs
>     2.6.30.5-43.fc11.x86_64 #1 SMP
> 
> The raid holds data (no root or swap) used mostly as DVR (nothing heavy).
> smartd checks each week and so far no errors. The disks are modern 1yo
> "SAMSUNG HD103UJ".
> 
> TIA

-- 
Eyal Lebedinsky	(eyal@eyal.emu.id.au)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-09 21:54                   ` Piergiorgio Sartor
@ 2009-11-10  0:17                     ` NeilBrown
  2009-11-10  9:09                       ` Peter Rabbitson
                                         ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: NeilBrown @ 2009-11-10  0:17 UTC (permalink / raw)
  Cc: Piergiorgio Sartor, Peter Rabbitson, Goswin von Brederlow,
	Doug Ledford, Michael Evans, Eyal Lebedinsky, linux-raid list

On Tue, November 10, 2009 8:54 am, Piergiorgio Sartor wrote:
> Well...
>
>> Is this an offer to submit a patch ?? :-)
>
> almost, I was looking into RAID-6 for this, but unfortunately
> it seems I'll need external manpower too... :-)
>
>> I disagree.  You do need a model.  The particular features of the
>> model would be the weight and wind-resistance of the person so that
>> you can estimate what extra wind resistance is needed to reduce terminal
>> velocity such that the impact will be something that the person's
>> legs can absorb.  So you also need the model to describe the legs
>> in enough detail so that a suitable target terminal velocity can
>> be determined.
>
> Well, sorry, but IMHO this is needed only when you design
> the parachute, not when you jump out of the plane.
>
> It seems that here some people, including me, would have
> found useful such a feature.
> For example I've a RAID-10 which shows a mismatch_cnt of
> 256, but everything seems to work fine.
> The disks are new, no SMART errors or else.
> Where the mismatch belong I do not know.
> What should I do? Try to fill up the MD device and then
> see if the mismatch is still there?
> It would be much better to know which file, if any, is
> affected and then take the proper countermeasures.
>


It seems we might have been talking at cross-purposes.

When I wrote about the need for a threat model, it was in the
context of automatically determining which block was most
likely to be in error (e.g. voting with a 3-drive RAID1 or
fancy arithmetic with RAID6).  I do not believe there is any
value in doing that.  At least not automatically in the kernel
with the aim of just repairing which block was decided to be
most wrong.

You now seem to be talking about the ability to find out which
blocks are inconsistent.  That is very different.  I do agree there
is value in that.  Maybe it should appear in the kernel logs,
or maybe we could store the information and report in via sysfs
(the former would certainly be easier).

I would be very happy to accept a patch which logged this
information - providing it was careful not to overly spam the logs if there
were lots and lots of errors.  I may even write on myself.




> At the moment, since everything runs fine, I do not dare
> to start a resync, since it will not be better than
> leaving the things like they're right now.
> I'm in the hope that some file creation or similar will fix
> the mismatch.
> Or do you have a better option?

It is possible that a resync will could improve
the situation.  Having a block that will sometimes read with
one value and sometimes with a different value could easily
confuse something - particularly a filesystem.

I would probably run a 'repair' to fix the difference, but that
isn't firm advice.  It is quite probably that the block is not
actively in use and so the inconsistency will never be noticed.


>
>> If we proactively hand out parachutes that can just barely land a
>> small dog safely, then we aren't doing any people any favours,
>> and probably are making their situation less safe because they are
>> more likely to take a risk in the belief that their parachute
>> will protect them - which it might not.
>
> Do not over stretch the example.
> The parachute, in the MD case, will not remove any risk,
> it will simply help people to manage a damage, that might
> occure for any reasons, including SW bugs, better.
>
> I mean, will you swear that the actual RAID software will
> never cause, by its own, a mismatch between disks?
> I guess not.
> So, why not to give a mechanism to enable user to look
> further into mismatches and be able to take a proper action?
>
>> Certainly manpower is an issue - and it is pointless spending it
>> on something that you think sounds nice, but have no evidence that it
>> will actually address a real need.
>
> It seems some people, here, have this need.
> So, it is real.
>
> I see not only myself asking for such features like returning
> the block address of the mismatch count or trigger a *proper*
> repair instead of a random one.
>
> Frankly speaking, the whole resync/repair concept is, at the
> moment, a waste of manpower (when it was done), since repairing
> or not a RAID does not change the underlayining situation.
> It just sets the mismatch_cnt to zero, but if an error is
> present there are good chances it will still be there.
> And this is the problem: after the resync people will *feel*
> secure, people *feel* safe (because there is a "repair"),
> but in the end the risk is simply increased (as per your
> example about dog-parachute).

check/repair is primarily about reading every block on every device,
and being ready to cope with read errors by overwriting with the
correct data.  This is known as scrubbing I believe.
I would normally just 'repair' every month or so.  If there are
discrepancies I would like them reported and fixed.  I they happen
often on a non-swap partition, I would like to knoe about it, otherwise
I would rather they were just fixed.
'check' largely exists because it was trivial to implement given
that 'repair' was being implemented, and it could concievably be useful,
e.g. you have assembled an array read-only as you aren't at all sure the
disks should form an array.  You run a 'check' to increase your
confidence that all is OK without risking any change to any data incase
you put the array together badly.


>
> Again, manpower is always an issue and priorities are needed,
> of course, but what if we vote, here, for such a feauture and
> then it turns out it is "most wanted"?
>
> Written that, since complaining alone does not help, how to
> proceed in the case I would like to print the MD block address
> of a mismatch? Which source code file would be more sensible
> to look into?

drivers/md/raid1.c for RAID1
drivers/md/raid5.c for RAID4/RAID5/RAID6

Look for where the resync_mismatches field is updated.


>
> Thanks for you attention and sorry for the rant,
>
> P.S.: I like very much the MD thing, that's the reason
> why I would like to see it improved.
>

Thanks for your interest!

NeilBrown


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-10  0:17                     ` NeilBrown
@ 2009-11-10  9:09                       ` Peter Rabbitson
  2009-11-10 14:03                         ` Martin K. Petersen
  2009-11-10 19:52                       ` Piergiorgio Sartor
  2009-11-12 22:57                       ` Bill Davidsen
  2 siblings, 1 reply; 58+ messages in thread
From: Peter Rabbitson @ 2009-11-10  9:09 UTC (permalink / raw)
  To: NeilBrown
  Cc: Piergiorgio Sartor, Goswin von Brederlow, Doug Ledford,
	Michael Evans, Eyal Lebedinsky, linux-raid list

NeilBrown wrote:
> On Tue, November 10, 2009 8:54 am, Piergiorgio Sartor wrote:
> <snip>
> 
> check/repair is primarily about reading every block on every device,
> and being ready to cope with read errors by overwriting with the
> correct data.  This is known as scrubbing I believe.
> I would normally just 'repair' every month or so.  If there are
> discrepancies I would like them reported and fixed.  I they happen
> often on a non-swap partition, I would like to knoe about it, otherwise
> I would rather they were just fixed.

Bingo - and according to the list archive many of us are getting mismatches
without swap anywhere near the raid in question. The current situation is
more akin to "Ok folks get in the plane, we're deploying in 2 hours, and
btw your chute is not going to open and there is nothing you can do about
it" How is that for a threat model :)

Please someone step in and add *some* sort of reporting about which
particular blocks are screwed, so a user can figure out which data is(about
to) be lost.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-10  9:09                       ` Peter Rabbitson
@ 2009-11-10 14:03                         ` Martin K. Petersen
  2009-11-12 22:40                           ` Bill Davidsen
  0 siblings, 1 reply; 58+ messages in thread
From: Martin K. Petersen @ 2009-11-10 14:03 UTC (permalink / raw)
  To: Peter Rabbitson
  Cc: NeilBrown, Piergiorgio Sartor, Goswin von Brederlow,
	Doug Ledford, Michael Evans, Eyal Lebedinsky, linux-raid list

>>>>> "Peter" == Peter Rabbitson <rabbit+list@rabbit.us> writes:

Peter> Bingo - and according to the list archive many of us are getting
Peter> mismatches without swap anywhere near the raid in question. The
Peter> current situation is more akin to "Ok folks get in the plane,
Peter> we're deploying in 2 hours, and btw your chute is not going to
Peter> open and there is nothing you can do about it" How is that for a
Peter> threat model :)

Way back we used to lock pages down entirely for I/O submission.  At
some point the writeback bit was introduced to gate the page during the
actual (physical) write operation only.  That made locking trickier and
not all filesystems correctly adapted to this.  ext[234] in particular
have issues of varying degrees, somewhat amplified by their use of
buffer_heads to track buffers instead of pages.  See the recent thread
about corruption with ext4 in 2.6.32+ for examples of this.

It's not just RAID consistency that breaks.  In the ext4 case above we
end up with garbled blocks being written to a single drive.

Add data integrity protection to the mix (btrfs, DIX) and all hell
breaks loose if you change the buffer after the checksum has been
generated.  So while modifying pages in flight has kinda-sorta worked
for a while (i.e. the window of error is small) it's something we'll
simply have to stop doing to support new features in the storage stack.
You'll be glad to know there's discussion about merging the debug patch
(which marks pages read-only during writeback) into ext4.

FWIW, XFS and btrfs both use the page writeback bit correctly and never
change a page while it is undergoing I/O.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-09 21:50                 ` NeilBrown
@ 2009-11-10 18:05                   ` Bill Davidsen
  2009-11-10 22:17                     ` Peter Rabbitson
  2009-11-13  2:15                     ` Neil Brown
  0 siblings, 2 replies; 58+ messages in thread
From: Bill Davidsen @ 2009-11-10 18:05 UTC (permalink / raw)
  To: NeilBrown
  Cc: Piergiorgio Sartor, Goswin von Brederlow, Doug Ledford,
	Michael Evans, Eyal Lebedinsky, linux-raid list

NeilBrown wrote:
> On Tue, November 10, 2009 5:22 am, Bill Davidsen wrote:
>   
>> Piergiorgio Sartor wrote:
>>     
>>> Hi,
>>>
>>>
>>>       
>>>> But unless your drive firmware is broken the drive with only ever give
>>>> the correct data or an error. Smart has a counter for blocks that have
>>>> gone bad and will be fixed pending a write to them:
>>>> Current_Pending_Sector.
>>>>
>>>> The only way the drive should be able to give you bad data is if
>>>> multiple bits toggle in such a way that the ECC still fits.
>>>>
>>>>         
>>> Not really, I've disks which are *perfect* in smart sense
>>> and nevertheless I had mistmatch count.
>>> This was a SW problem, I think now fixed, in RAID-10 code.
>>>
>>>
>>>       
>> IIRC there still is an error in raid-1 code, in that data is written to
>> multiple drives without preventing modification of the memory between
>> writes. As I understand Neil's explanation, this happens (a) when memory
>> is being changed rapidly and frequently via memory mapped files, or (b)
>> writing via O_DIRECT, or (c) when raid-1 is being used for swap. I'm not
>> totally sure why the last one, but I have always seem mismatches on swap
>> in a system which is actually swapping. What is more troubling is that
>> if I do a hibernate, which writes to swap, and then force a boot from
>> other media to a Live-CD, doing a check of the swap array occasionally
>> shows a mismatch. That doesn't give me a secure feeling, although I have
>> never had an issue in practice, I was just curious.
>>     
>
> I don't think this is really an error in the RAID1 code.
> The only thing that the RAID1 code could do differently is make a local
> copy of the data and then write that to all of the devices (a bit like
> RAID5 does so it can generate a parity block reliably).
> Doing this would introduce a performance penalty with not real
> benefit (the only benefit would be to stop long email threads about
> mismatch_cnt :-)
>
>   
After thinking about it, I agree that "limitation" would be a more 
accurate term. Apologies. This is one of the few reasons to consider 
hardware raid. By writing all copies of the data from a single cache 
buffer in the controller they are always consistent and only take up the 
bandwidth on the memory bus needed to transfer the initial data to the 
controller.

Of course unless the cache on the controller is really large it can 
become a choke point, adds controller firmware as a failure point, adds 
to the cost... so I regard hardware raid as useful only when it 
justified spending big bucks to get a really good controller.

> You could possibly argue that it is a weakness in the interface to block
> devices that the block device cannot ask for the buffer to be guaranteed
> to be stable for the duration of the write, but as there is little real
> need for that and it would probably be fairly hard to implement both
> efficiently and generally.
>
>   
The raid code would need it's own copy of the data in a private buffer, 
or would have to mark the write memory as copy on write. I suspect the 
2nd if far more efficient, but I have no idea how hard it would be to 
implement.

> A filesystem is well placed to do this sort of thing and it is quite
> likely that BTRFS does something appropriate to ensure that the block
> checksums it creates are reliable.
> All the filesystem needs to do is forcibly unmap the page from any
> process address space and make sure it doesn't get remapped or otherwise
> modified until the write completes.
>
>   
That sounds like a lot more overhead than just making the page COW for 
the duration, since only a very small number of writes every actually do 
get changed.  No easy answer, but at least the filesystem can align the 
buffers in a reasonable way.
> The (c) option is actually the most likely to cause inconsistencies.
> If a page is modified while being written out to swap, the swap
> system will effective forget that it ever tried to write it so
> any inconsistency is likely to remain (but never be read, so there
> is no problem).
> With a filesystem, if the page is changed while being written, it is
> very likely that the filesystem will try to write the page to the same
> location again, thus fixing the inconsistency.
>
>   
Well, I do get a *ton* of mismatches in swap, I just ran a check and got 
12032 in the mismatch count. Another raid1 on partitions of the same 
drives showed 128, which still bothers me, since /boot hasn't changed in 
months.
> When suspend-to-disk writes to swap, it stops all changes from happening
> and then writes the data and waits for it to complete, so you will never
> find inconsistencies in blocks on swap that actually contain a
> suspend-to-disk image.
>   

Then that's not an issue for restart, at least.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-10  0:17                     ` NeilBrown
  2009-11-10  9:09                       ` Peter Rabbitson
@ 2009-11-10 19:52                       ` Piergiorgio Sartor
  2009-11-13  2:37                         ` Neil Brown
  2009-11-12 22:57                       ` Bill Davidsen
  2 siblings, 1 reply; 58+ messages in thread
From: Piergiorgio Sartor @ 2009-11-10 19:52 UTC (permalink / raw)
  To: NeilBrown
  Cc: Piergiorgio Sartor, Peter Rabbitson, Goswin von Brederlow,
	Doug Ledford, Michael Evans, Eyal Lebedinsky, linux-raid list

Hi again,

> It seems we might have been talking at cross-purposes.
> 
> When I wrote about the need for a threat model, it was in the
> context of automatically determining which block was most
> likely to be in error (e.g. voting with a 3-drive RAID1 or
> fancy arithmetic with RAID6).  I do not believe there is any
> value in doing that.  At least not automatically in the kernel
> with the aim of just repairing which block was decided to be
> most wrong.
> 
> You now seem to be talking about the ability to find out which
> blocks are inconsistent.  That is very different.  I do agree there
> is value in that.  Maybe it should appear in the kernel logs,
> or maybe we could store the information and report in via sysfs
> (the former would certainly be easier).

maybe there is a misunderstanding between us! :-)

Automatic repair *might* be a far end target, but I do
agree, this needs to be clarified deeply.

I see the thing similarly to a previous comment from a
fellow poster.
To do:
1) detect which MD block is inconsistent
2) detect, when possible, which device component is responsible
3) trigger a repair action

This would be done all under user control, i.e. the user
will get the mismatch count, maybe with some hint on which
device could be guilty (RAID-6 or RAID-1/10 with multiple
redundancy) and then he could decide what to do.

The user will have full control and full *responsability*
on the action, but it will also be fully informed on what
the situation is.

The system will tell: block ABC is inconsistent, maybe
device /dev/sdX is guilty, you could: do nothing, resync
the parity, try to repair.

> I would be very happy to accept a patch which logged this
> information - providing it was careful not to overly spam the logs if there
> were lots and lots of errors.  I may even write on myself.

I could try to have a look into it, time permitting.

[mismatch_cnt=256]
> I would probably run a 'repair' to fix the difference, but that
> isn't firm advice.  It is quite probably that the block is not
> actively in use and so the inconsistency will never be noticed.

Exactly, that's why having the knowledge of *where*
the issue is would help already a lot!
 
> check/repair is primarily about reading every block on every device,
> and being ready to cope with read errors by overwriting with the
> correct data.  This is known as scrubbing I believe.
> I would normally just 'repair' every month or so.  If there are
> discrepancies I would like them reported and fixed.  I they happen
> often on a non-swap partition, I would like to knoe about it, otherwise
> I would rather they were just fixed.
> 'check' largely exists because it was trivial to implement given
> that 'repair' was being implemented, and it could concievably be useful,
> e.g. you have assembled an array read-only as you aren't at all sure the
> disks should form an array.  You run a 'check' to increase your
> confidence that all is OK without risking any change to any data incase
> you put the array together badly.

As I mentioned some times ago, I built a RAID-6, where
one disk, due to a strange cabling problem, was sometimes
returning wrong data (one bit flip, actually).
And this without any errors reported, i.e. a bit was
sometimes flipped, at the very end it seems, and it
was undetected by ECC/CRC/whatever.

This was noticed by the "check", so I ran a "repair", which
was, of course, making more damage...

What I did was to run a check, with one device after the
other failed (and then re-added, of course) on a RO MD device.

I was able to find the guilty disk and to fix the array
for good!

Now, this was a really lengthy process, I would have
preferred to have it done automatically and then have
a report on which *could* be the resposible device.

I agree with you that an automatic repair would have
not been the right choice, without knowing first what
was going on.

> drivers/md/raid1.c for RAID1
> drivers/md/raid5.c for RAID4/RAID5/RAID6
> 
> Look for where the resync_mismatches field is updated.

Thanks, I'll try to have a look!
 
bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-10 18:05                   ` Bill Davidsen
@ 2009-11-10 22:17                     ` Peter Rabbitson
  2009-11-13  2:15                     ` Neil Brown
  1 sibling, 0 replies; 58+ messages in thread
From: Peter Rabbitson @ 2009-11-10 22:17 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: NeilBrown, Piergiorgio Sartor, Goswin von Brederlow,
	Doug Ledford, Michael Evans, Eyal Lebedinsky, linux-raid list

Bill Davidsen wrote:
> Well, I do get a *ton* of mismatches in swap, I just ran a check and got
> 12032 in the mismatch count. Another raid1 on partitions of the same
> drives showed 128, which still bothers me, since /boot hasn't changed in
> months.

I can answer the /boot part (I did a hexdiff on all the raid members to
come up with this): http://marc.info/?l=linux-raid&m=120988628322707&w=2

Cheers

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-10 14:03                         ` Martin K. Petersen
@ 2009-11-12 22:40                           ` Bill Davidsen
  2009-11-13 17:12                             ` Martin K. Petersen
  0 siblings, 1 reply; 58+ messages in thread
From: Bill Davidsen @ 2009-11-12 22:40 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Peter Rabbitson, NeilBrown, Piergiorgio Sartor,
	Goswin von Brederlow, Doug Ledford, Michael Evans,
	Eyal Lebedinsky, linux-raid list

Martin K. Petersen wrote:
>>>>>> "Peter" == Peter Rabbitson <rabbit+list@rabbit.us> writes:
>>>>>>             
>
> Peter> Bingo - and according to the list archive many of us are getting
> Peter> mismatches without swap anywhere near the raid in question. The
> Peter> current situation is more akin to "Ok folks get in the plane,
> Peter> we're deploying in 2 hours, and btw your chute is not going to
> Peter> open and there is nothing you can do about it" How is that for a
> Peter> threat model :)
>
> Way back we used to lock pages down entirely for I/O submission.  At
> some point the writeback bit was introduced to gate the page during the
> actual (physical) write operation only.  That made locking trickier and
> not all filesystems correctly adapted to this.  ext[234] in particular
> have issues of varying degrees, somewhat amplified by their use of
> buffer_heads to track buffers instead of pages.  See the recent thread
> about corruption with ext4 in 2.6.32+ for examples of this.
>
> It's not just RAID consistency that breaks.  In the ext4 case above we
> end up with garbled blocks being written to a single drive.
>
> Add data integrity protection to the mix (btrfs, DIX) and all hell
> breaks loose if you change the buffer after the checksum has been
> generated.  So while modifying pages in flight has kinda-sorta worked
> for a while (i.e. the window of error is small) it's something we'll
> simply have to stop doing to support new features in the storage stack.
> You'll be glad to know there's discussion about merging the debug patch
> (which marks pages read-only during writeback) into ext4.
>
> FWIW, XFS and btrfs both use the page writeback bit correctly and never
> change a page while it is undergoing I/O.
>
>   
That's necessary but not sufficient. To be done correctly it must be 
protected by md as well. This is because arrays are used without a 
filesystem by some applications, such as swap and database, to name the 
most common cases. Data simply can't be correct on the drive if it is 
allowed to change between the write system call and arrival on the 
media, more so if a CRC or mirror is involved.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-10  0:17                     ` NeilBrown
  2009-11-10  9:09                       ` Peter Rabbitson
  2009-11-10 19:52                       ` Piergiorgio Sartor
@ 2009-11-12 22:57                       ` Bill Davidsen
  2 siblings, 0 replies; 58+ messages in thread
From: Bill Davidsen @ 2009-11-12 22:57 UTC (permalink / raw)
  To: NeilBrown
  Cc: Piergiorgio Sartor, Peter Rabbitson, Goswin von Brederlow,
	Doug Ledford, Michael Evans, Eyal Lebedinsky, linux-raid list

NeilBrown wrote:
> On Tue, November 10, 2009 8:54 am, Piergiorgio Sartor wrote:
>   
>> Well...
>>
>>     
>>> Is this an offer to submit a patch ?? :-)
>>>       
>> almost, I was looking into RAID-6 for this, but unfortunately
>> it seems I'll need external manpower too... :-)
>>
>>     
>>> I disagree.  You do need a model.  The particular features of the
>>> model would be the weight and wind-resistance of the person so that
>>> you can estimate what extra wind resistance is needed to reduce terminal
>>> velocity such that the impact will be something that the person's
>>> legs can absorb.  So you also need the model to describe the legs
>>> in enough detail so that a suitable target terminal velocity can
>>> be determined.
>>>       
>> Well, sorry, but IMHO this is needed only when you design
>> the parachute, not when you jump out of the plane.
>>
>> It seems that here some people, including me, would have
>> found useful such a feature.
>> For example I've a RAID-10 which shows a mismatch_cnt of
>> 256, but everything seems to work fine.
>> The disks are new, no SMART errors or else.
>> Where the mismatch belong I do not know.
>> What should I do? Try to fill up the MD device and then
>> see if the mismatch is still there?
>> It would be much better to know which file, if any, is
>> affected and then take the proper countermeasures.
>>
>>     
>
>
> It seems we might have been talking at cross-purposes.
>
> When I wrote about the need for a threat model, it was in the
> context of automatically determining which block was most
> likely to be in error (e.g. voting with a 3-drive RAID1 or
> fancy arithmetic with RAID6).  I do not believe there is any
> value in doing that.  At least not automatically in the kernel
> with the aim of just repairing which block was decided to be
> most wrong.
>   

And on this point I continue to believe you are not going going in the 
wrong direction, but riding the wrong horse. What is the value of having 
a 'repair' operation in the kernel if it makes no effort to fix the 
problem, but instead hides the problem, picks one possible value for the 
contents and writes it everywhere, perhaps because at least occasionally 
the data will be correct? I the case of N-way mirror with N>2, and with 
raid-6, a "most likely" data can be identified, and from data already in 
memory! And the tests appear to be possible calling code which is 
already used for either recovery on actual drive error or to generate P 
and Q values.

To suggest doing it in a non-kernel solution is to say it shouldn't be 
done. The problems being discussed with timing, protecting data from 
changing, etc, all become worse when trying to do this by system calls 
instead of diddling the locks and io queues using the existing kernel code.

The argument that such repair would not be guaranteed correct in all 
cases is true, but given that the current code is guaranteed to be wrong 
a significant percentage of the time, how could taking the obvious steps 
not be better?

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-10 18:05                   ` Bill Davidsen
  2009-11-10 22:17                     ` Peter Rabbitson
@ 2009-11-13  2:15                     ` Neil Brown
  1 sibling, 0 replies; 58+ messages in thread
From: Neil Brown @ 2009-11-13  2:15 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Piergiorgio Sartor, Goswin von Brederlow, Doug Ledford,
	Michael Evans, Eyal Lebedinsky, linux-raid list

On Tuesday November 10, davidsen@tmr.com wrote:
> NeilBrown wrote:
> 
> > You could possibly argue that it is a weakness in the interface to block
> > devices that the block device cannot ask for the buffer to be guaranteed
> > to be stable for the duration of the write, but as there is little real
> > need for that and it would probably be fairly hard to implement both
> > efficiently and generally.
> >
> >   
> The raid code would need it's own copy of the data in a private buffer, 
> or would have to mark the write memory as copy on write. I suspect the 
> 2nd if far more efficient, but I have no idea how hard it would be to 
> implement.

Copy-on-write is not actually possible for md to enforce - it is at
the wrong layer and knows nothing about who owns the page of how or
where it is mapped.
A filesystem can impose copy-on-write, a block device cannot.
I gather from odd comments that I have seen that copy-on-write is
rather expensive.  Marking a thousand contiguous pages copy-on-write
is much faster than copy one thousand pages.  Making a single page
copy-on-write may not be much faster than copying the page.
However I'm not 100% certain of these details.

Maybe if the filesystem could set a flag in the bio saying "this page
will not  change until the write completes", then md could optimise
that case and do copies in other cases...

NeilBrown

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-10 19:52                       ` Piergiorgio Sartor
@ 2009-11-13  2:37                         ` Neil Brown
  2009-11-13  5:30                           ` Goswin von Brederlow
                                             ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Neil Brown @ 2009-11-13  2:37 UTC (permalink / raw)
  To: Piergiorgio Sartor
  Cc: Peter Rabbitson, Goswin von Brederlow, Doug Ledford,
	Michael Evans, Eyal Lebedinsky, linux-raid list

On Tuesday November 10, piergiorgio.sartor@nexgo.de wrote:
> Hi again,
> 
> > It seems we might have been talking at cross-purposes.
> > 
> > When I wrote about the need for a threat model, it was in the
> > context of automatically determining which block was most
> > likely to be in error (e.g. voting with a 3-drive RAID1 or
> > fancy arithmetic with RAID6).  I do not believe there is any
> > value in doing that.  At least not automatically in the kernel
> > with the aim of just repairing which block was decided to be
> > most wrong.
> > 
> > You now seem to be talking about the ability to find out which
> > blocks are inconsistent.  That is very different.  I do agree there
> > is value in that.  Maybe it should appear in the kernel logs,
> > or maybe we could store the information and report in via sysfs
> > (the former would certainly be easier).
> 
> maybe there is a misunderstanding between us! :-)
> 
> Automatic repair *might* be a far end target, but I do
> agree, this needs to be clarified deeply.
> 
> I see the thing similarly to a previous comment from a
> fellow poster.
> To do:
> 1) detect which MD block is inconsistent
> 2) detect, when possible, which device component is responsible
> 3) trigger a repair action
> 
> This would be done all under user control, i.e. the user
> will get the mismatch count, maybe with some hint on which
> device could be guilty (RAID-6 or RAID-1/10 with multiple
> redundancy) and then he could decide what to do.
> 
> The user will have full control and full *responsability*
> on the action, but it will also be fully informed on what
> the situation is.
> 
> The system will tell: block ABC is inconsistent, maybe
> device /dev/sdX is guilty, you could: do nothing, resync
> the parity, try to repair.

I think just "block ABC is inconsistent" is sufficient.
user-space can then quiesce that part of the array, read the relevant
blocks, do any analysis that might be appropriate, and report to the
admin. 

> 
> As I mentioned some times ago, I built a RAID-6, where
> one disk, due to a strange cabling problem, was sometimes
> returning wrong data (one bit flip, actually).
> And this without any errors reported, i.e. a bit was
> sometimes flipped, at the very end it seems, and it
> was undetected by ECC/CRC/whatever.

That is a very interesting threat scenario - occasional bit flip on
read between media and memory.  I had a drive like that once.  One
particular bit in the sector would fairly often return '1' no matter
what had been written.  I had it in a RAID1 and it quickly made a mess
of the filesystem.

As you say, there is nothing that md can or should do about this
except report that something odd is happening, which it does, and
report where it is happening, which it does not.

NeilBrown

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-13  2:37                         ` Neil Brown
@ 2009-11-13  5:30                           ` Goswin von Brederlow
  2009-11-13  9:33                           ` Peter Rabbitson
  2009-11-15 21:05                           ` Piergiorgio Sartor
  2 siblings, 0 replies; 58+ messages in thread
From: Goswin von Brederlow @ 2009-11-13  5:30 UTC (permalink / raw)
  To: Neil Brown
  Cc: Piergiorgio Sartor, Peter Rabbitson, Goswin von Brederlow,
	Doug Ledford, Michael Evans, Eyal Lebedinsky, linux-raid list

Neil Brown <neilb@suse.de> writes:

> On Tuesday November 10, piergiorgio.sartor@nexgo.de wrote:
>> Hi again,
>> 
>> > It seems we might have been talking at cross-purposes.
>> > 
>> > When I wrote about the need for a threat model, it was in the
>> > context of automatically determining which block was most
>> > likely to be in error (e.g. voting with a 3-drive RAID1 or
>> > fancy arithmetic with RAID6).  I do not believe there is any
>> > value in doing that.  At least not automatically in the kernel
>> > with the aim of just repairing which block was decided to be
>> > most wrong.
>> > 
>> > You now seem to be talking about the ability to find out which
>> > blocks are inconsistent.  That is very different.  I do agree there
>> > is value in that.  Maybe it should appear in the kernel logs,
>> > or maybe we could store the information and report in via sysfs
>> > (the former would certainly be easier).
>> 
>> maybe there is a misunderstanding between us! :-)
>> 
>> Automatic repair *might* be a far end target, but I do
>> agree, this needs to be clarified deeply.
>> 
>> I see the thing similarly to a previous comment from a
>> fellow poster.
>> To do:
>> 1) detect which MD block is inconsistent
>> 2) detect, when possible, which device component is responsible
>> 3) trigger a repair action
>> 
>> This would be done all under user control, i.e. the user
>> will get the mismatch count, maybe with some hint on which
>> device could be guilty (RAID-6 or RAID-1/10 with multiple
>> redundancy) and then he could decide what to do.
>> 
>> The user will have full control and full *responsability*
>> on the action, but it will also be fully informed on what
>> the situation is.
>> 
>> The system will tell: block ABC is inconsistent, maybe
>> device /dev/sdX is guilty, you could: do nothing, resync
>> the parity, try to repair.
>
> I think just "block ABC is inconsistent" is sufficient.
> user-space can then quiesce that part of the array, read the relevant
> blocks, do any analysis that might be appropriate, and report to the
> admin. 

It is a begining. Eventualy I would like to see the guilty device in
the log though. That way the log can be analysed quickly and for
example a bad cable or failing drive will show up to be always the
guilty drive. Only makes sence for 3+ mirrors or raid6 though.

The repair should also determine the likely faulty block and rewrite
that instead of picking a random one. So you already need a "who is to
blame" function. The loging and repair can share the code.

>> As I mentioned some times ago, I built a RAID-6, where
>> one disk, due to a strange cabling problem, was sometimes
>> returning wrong data (one bit flip, actually).
>> And this without any errors reported, i.e. a bit was
>> sometimes flipped, at the very end it seems, and it
>> was undetected by ECC/CRC/whatever.
>
> That is a very interesting threat scenario - occasional bit flip on
> read between media and memory.  I had a drive like that once.  One
> particular bit in the sector would fairly often return '1' no matter
> what had been written.  I had it in a RAID1 and it quickly made a mess
> of the filesystem.

I had a external raid enclosure that would flip bits in the block
number data was read from or written too. With the box alone data
written to one file would suddenly appear in another file.

To make matters worse 2 enclosures where combined in a software raid1
giving the strangest errors. The file contents would randomly change
depending on which enclosure was used to read data.

Those errors do happen from time to time and will keep hapening.

> As you say, there is nothing that md can or should do about this
> except report that something odd is happening, which it does, and
> report where it is happening, which it does not.
>
> NeilBrown

MfG
        Goswin

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-13  2:37                         ` Neil Brown
  2009-11-13  5:30                           ` Goswin von Brederlow
@ 2009-11-13  9:33                           ` Peter Rabbitson
  2009-11-15 21:05                           ` Piergiorgio Sartor
  2 siblings, 0 replies; 58+ messages in thread
From: Peter Rabbitson @ 2009-11-13  9:33 UTC (permalink / raw)
  To: Neil Brown
  Cc: Piergiorgio Sartor, Goswin von Brederlow, Doug Ledford,
	Michael Evans, Eyal Lebedinsky, linux-raid list

Neil Brown wrote:
> On Tuesday November 10, piergiorgio.sartor@nexgo.de wrote:
>> Hi again,
>>
>>> It seems we might have been talking at cross-purposes.
>>>
>>> When I wrote about the need for a threat model, it was in the
>>> context of automatically determining which block was most
>>> likely to be in error (e.g. voting with a 3-drive RAID1 or
>>> fancy arithmetic with RAID6).  I do not believe there is any
>>> value in doing that.  At least not automatically in the kernel
>>> with the aim of just repairing which block was decided to be
>>> most wrong.
>>>
>>> You now seem to be talking about the ability to find out which
>>> blocks are inconsistent.  That is very different.  I do agree there
>>> is value in that.  Maybe it should appear in the kernel logs,
>>> or maybe we could store the information and report in via sysfs
>>> (the former would certainly be easier).
>> maybe there is a misunderstanding between us! :-)
>>
>> Automatic repair *might* be a far end target, but I do
>> agree, this needs to be clarified deeply.
>>
>> I see the thing similarly to a previous comment from a
>> fellow poster.
>> To do:
>> 1) detect which MD block is inconsistent
>> 2) detect, when possible, which device component is responsible
>> 3) trigger a repair action
>>
>> This would be done all under user control, i.e. the user
>> will get the mismatch count, maybe with some hint on which
>> device could be guilty (RAID-6 or RAID-1/10 with multiple
>> redundancy) and then he could decide what to do.
>>
>> The user will have full control and full *responsability*
>> on the action, but it will also be fully informed on what
>> the situation is.
>>
>> The system will tell: block ABC is inconsistent, maybe
>> device /dev/sdX is guilty, you could: do nothing, resync
>> the parity, try to repair.
> 
> I think just "block ABC is inconsistent" is sufficient.
> user-space can then quiesce that part of the array, read the relevant
> blocks, do any analysis that might be appropriate, and report to the
> admin. 

Will there be an accompanying userspace tool to determine the physical
device addresses of individual blocks representing the inconsitent MD
block? Is there any way addresses of individual blocks can be reported
right there by the kernel? I.e. figuring out which physical blocks make
up a block in a raid -l 10 -n5 -pf3 is not an easy task, while the kernel
alreayd knows what is where.

Cheers

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-12 22:40                           ` Bill Davidsen
@ 2009-11-13 17:12                             ` Martin K. Petersen
  2009-11-14 17:01                               ` Bill Davidsen
  2009-11-14 19:04                               ` Goswin von Brederlow
  0 siblings, 2 replies; 58+ messages in thread
From: Martin K. Petersen @ 2009-11-13 17:12 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Martin K. Petersen, Peter Rabbitson, NeilBrown,
	Piergiorgio Sartor, Goswin von Brederlow, Doug Ledford,
	Michael Evans, Eyal Lebedinsky, linux-raid list

>>>>> "Bill" == Bill Davidsen <davidsen@tmr.com> writes:

>> FWIW, XFS and btrfs both use the page writeback bit correctly and
>> never change a page while it is undergoing I/O.
>> 
>> 
Bill> That's necessary but not sufficient. To be done correctly it must
Bill> be protected by md as well. This is because arrays are used
Bill> without a filesystem by some applications, such as swap and
Bill> database, to name the most common cases. 

I agree that making MD RAID1 do a copy would be a quick fix.  But I
don't see any reason to encourage what is essentially sloppy behavior at
the top of the stack.  And then what if you stack MD/DM devices?  Do
each layer do a copy?  I think that gets murky pretty quickly.

I'd much rather fix the cases where the top layers are broken.  And as I
said there are several people working on this spurred by my work on the
data integrity extensions.

FWIW, databases on raw disk have gone out of fashion.  But it is true
that applications that do direct I/O need to avoid updating buffers in
flight.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-13 17:12                             ` Martin K. Petersen
@ 2009-11-14 17:01                               ` Bill Davidsen
  2009-11-17  5:19                                 ` Martin K. Petersen
  2009-11-14 19:04                               ` Goswin von Brederlow
  1 sibling, 1 reply; 58+ messages in thread
From: Bill Davidsen @ 2009-11-14 17:01 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Peter Rabbitson, NeilBrown, Piergiorgio Sartor,
	Goswin von Brederlow, Doug Ledford, Michael Evans,
	Eyal Lebedinsky, linux-raid list

Martin K. Petersen wrote:
>>>>>> "Bill" == Bill Davidsen <davidsen@tmr.com> writes:
>>>>>>             
>
>   
>>> FWIW, XFS and btrfs both use the page writeback bit correctly and
>>> never change a page while it is undergoing I/O.
>>>
>>>
>>>       
> Bill> That's necessary but not sufficient. To be done correctly it must
> Bill> be protected by md as well. This is because arrays are used
> Bill> without a filesystem by some applications, such as swap and
> Bill> database, to name the most common cases. 
>
> I agree that making MD RAID1 do a copy would be a quick fix.  But I
> don't see any reason to encourage what is essentially sloppy behavior at
> the top of the stack.  And then what if you stack MD/DM devices?  Do
> each layer do a copy?  I think that gets murky pretty quickly.
>
>   
Which is why I suggested that the ideal implementation is COW, which in 
most cases would need no copy unless the pages were attempted to be 
modified. That requires some assumptions about how the buffers are 
aligned vs. memory pages, and are hardware dependent to some extent. 
It's not easy, but I never said it was. The question is if it is 
*required* in some places, as determined by (a) good practice if the 
overhead is low, or (b) user option for "safe even if slow."

> I'd much rather fix the cases where the top layers are broken.  And as I
> said there are several people working on this spurred by my work on the
> data integrity extensions.
>
> FWIW, databases on raw disk have gone out of fashion.  But it is true
> that applications that do direct I/O need to avoid updating buffers in
> flight.
>   

May have gone out of fashion in new applications (I'm not sure I agree, 
but as a talking point), but there are tons of old apps which are not 
going to be updated any time soon, and any number of libraries which 
mmap stuff and effect multiple applications.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-13 17:12                             ` Martin K. Petersen
  2009-11-14 17:01                               ` Bill Davidsen
@ 2009-11-14 19:04                               ` Goswin von Brederlow
  2009-11-17  5:22                                 ` Martin K. Petersen
  1 sibling, 1 reply; 58+ messages in thread
From: Goswin von Brederlow @ 2009-11-14 19:04 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Bill Davidsen, Peter Rabbitson, NeilBrown, Piergiorgio Sartor,
	Goswin von Brederlow, Doug Ledford, Michael Evans,
	Eyal Lebedinsky, linux-raid list

"Martin K. Petersen" <martin.petersen@oracle.com> writes:

>>>>>> "Bill" == Bill Davidsen <davidsen@tmr.com> writes:
>
>>> FWIW, XFS and btrfs both use the page writeback bit correctly and
>>> never change a page while it is undergoing I/O.
>>> 
>>> 
> Bill> That's necessary but not sufficient. To be done correctly it must
> Bill> be protected by md as well. This is because arrays are used
> Bill> without a filesystem by some applications, such as swap and
> Bill> database, to name the most common cases. 
>
> I agree that making MD RAID1 do a copy would be a quick fix.  But I
> don't see any reason to encourage what is essentially sloppy behavior at
> the top of the stack.  And then what if you stack MD/DM devices?  Do
> each layer do a copy?  I think that gets murky pretty quickly.

Maybe as a quick debug the raid layer should make the page read-only
and then watch what fails to write to it.

> I'd much rather fix the cases where the top layers are broken.  And as I
> said there are several people working on this spurred by my work on the
> data integrity extensions.
>
> FWIW, databases on raw disk have gone out of fashion.  But it is true
> that applications that do direct I/O need to avoid updating buffers in
> flight.

Maybe a flag somewhere saying if the data is safe from writes or
not. Default would be unsafe and md copies. A filesystem that works
"right" sets the safe flag as would md after copying. That way
anything lower in the stack (like another md) has the flag set.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-13  2:37                         ` Neil Brown
  2009-11-13  5:30                           ` Goswin von Brederlow
  2009-11-13  9:33                           ` Peter Rabbitson
@ 2009-11-15 21:05                           ` Piergiorgio Sartor
  2009-11-15 22:29                             ` Guy Watkins
  2 siblings, 1 reply; 58+ messages in thread
From: Piergiorgio Sartor @ 2009-11-15 21:05 UTC (permalink / raw)
  To: Neil Brown
  Cc: Piergiorgio Sartor, Peter Rabbitson, Goswin von Brederlow,
	Doug Ledford, Michael Evans, Eyal Lebedinsky, linux-raid list

Hi,

> I think just "block ABC is inconsistent" is sufficient.
> user-space can then quiesce that part of the array, read the relevant
> blocks, do any analysis that might be appropriate, and report to the
> admin. 

personally I think user space is good for
this kind of operations.

I think the point here is not if this kind of
recovery should be in kernel space or not, but
to have this kind of recovery.

> That is a very interesting threat scenario - occasional bit flip on
> read between media and memory.  I had a drive like that once.  One
> particular bit in the sector would fairly often return '1' no matter
> what had been written.  I had it in a RAID1 and it quickly made a mess
> of the filesystem.

In my case, a further analisys showed that the "bits"
where always *written* correctly, but the reading
operation was, sometimes, flipping bits.

This was especially nasty, because, without "resync"
the array would have been always fine.
 
> As you say, there is nothing that md can or should do about this
> except report that something odd is happening, which it does, and
> report where it is happening, which it does not.

Well, md specifically may or may not have the infrastructure
to use the RAID-6 parity to correct this sort of issues.
Nevertheless, using the RAID-6 double parity, in user or
kernel space, is really one point for software RAID.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: mismatch_cnt again
  2009-11-15 21:05                           ` Piergiorgio Sartor
@ 2009-11-15 22:29                             ` Guy Watkins
  2009-11-16  1:23                               ` Goswin von Brederlow
  2009-11-16  1:37                               ` Neil Brown
  0 siblings, 2 replies; 58+ messages in thread
From: Guy Watkins @ 2009-11-15 22:29 UTC (permalink / raw)
  To: 'Piergiorgio Sartor', 'Neil Brown'
  Cc: 'Peter Rabbitson', 'Goswin von Brederlow',
	'Doug Ledford', 'Michael Evans',
	'Eyal Lebedinsky', 'linux-raid list'

I have been following this issue some, and I think this could be a cause for
silent corruption on RAID5 and RAID6.  I don't think this has been
mentioned, if so, sorry.

If data blocks can be changed in memory before written to disk, even if the
data blocks that were changed were never needed again from the disk, the
other related blocks in the stripe are at risk.  If the parity blocks are
computed, then the 1 data block in memory is changed, then the blocks are
written to disk, the parity would be wrong.  If a disk fails and is re-added
or replaced, the data block in that stripe will be computed using the
changed block giving a now corrupt value.  I am assuming the stripe has some
data blocks that have needed data and at least 1 that was not needed, and
that block that was not needed was changed before writing it to disk.  And
the disk that failed did not have the block that had been changed.

I have a hard time conveying my thought in text.  I hope you understand me.

Thanks for reading.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-15 22:29                             ` Guy Watkins
@ 2009-11-16  1:23                               ` Goswin von Brederlow
  2009-11-16  1:37                               ` Neil Brown
  1 sibling, 0 replies; 58+ messages in thread
From: Goswin von Brederlow @ 2009-11-16  1:23 UTC (permalink / raw)
  To: Guy Watkins
  Cc: 'Piergiorgio Sartor', 'Neil Brown',
	'Peter Rabbitson', 'Goswin von Brederlow',
	'Doug Ledford', 'Michael Evans',
	'Eyal Lebedinsky', 'linux-raid list'

"Guy Watkins" <linux-raid@watkins-home.com> writes:

> I have been following this issue some, and I think this could be a cause for
> silent corruption on RAID5 and RAID6.  I don't think this has been
> mentioned, if so, sorry.
>
> If data blocks can be changed in memory before written to disk, even if the
> data blocks that were changed were never needed again from the disk, the
> other related blocks in the stripe are at risk.  If the parity blocks are
> computed, then the 1 data block in memory is changed, then the blocks are
> written to disk, the parity would be wrong.  If a disk fails and is re-added
> or replaced, the data block in that stripe will be computed using the
> changed block giving a now corrupt value.  I am assuming the stripe has some
> data blocks that have needed data and at least 1 that was not needed, and
> that block that was not needed was changed before writing it to disk.  And
> the disk that failed did not have the block that had been changed.
>
> I have a hard time conveying my thought in text.  I hope you understand me.
>
> Thanks for reading.

In short, the block on the replaced disk will be wrong and won't be
the one that caused the mismatch. I.e. a second block gets broken.

Replace another disk ad yet another block gets wrong. and so on.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-15 22:29                             ` Guy Watkins
  2009-11-16  1:23                               ` Goswin von Brederlow
@ 2009-11-16  1:37                               ` Neil Brown
  2009-11-16  5:21                                 ` Goswin von Brederlow
  1 sibling, 1 reply; 58+ messages in thread
From: Neil Brown @ 2009-11-16  1:37 UTC (permalink / raw)
  To: Guy Watkins
  Cc: 'Piergiorgio Sartor', 'Peter Rabbitson',
	'Goswin von Brederlow', 'Doug Ledford',
	'Michael Evans', 'Eyal Lebedinsky',
	'linux-raid list'

On Sun, 15 Nov 2009 17:29:17 -0500
"Guy Watkins" <linux-raid@watkins-home.com> wrote:

> I have been following this issue some, and I think this could be a
> cause for silent corruption on RAID5 and RAID6.  I don't think this
> has been mentioned, if so, sorry.

RAID1/RAID10 are very different from RAID5/RAID6

RAID1/RAID10 can get 'mismatches' due to the particular behaviour
of swap or filesystems.  However this doesn't matter (the blocks that
are inconsistent are of no interest to the filesystem).

RAID5/RAID6 is careful not to allow any mismatches to creep in
due to any particular filesystem or swap activity.  This is because,
as you say, those mismatches could be significant to the RAID
algorithm even though they might be of no interest to the filesystem.

mismatches can only occur in a RAID5/RAID6 due to a software bug
in the md/raid code, or due to 'hardware errors' (including of course
drive firmware errors etc).

NeilBrown


> 
> If data blocks can be changed in memory before written to disk, even
> if the data blocks that were changed were never needed again from the
> disk, the other related blocks in the stripe are at risk.  If the
> parity blocks are computed, then the 1 data block in memory is
> changed, then the blocks are written to disk, the parity would be
> wrong.  If a disk fails and is re-added or replaced, the data block
> in that stripe will be computed using the changed block giving a now
> corrupt value.  I am assuming the stripe has some data blocks that
> have needed data and at least 1 that was not needed, and that block
> that was not needed was changed before writing it to disk.  And the
> disk that failed did not have the block that had been changed.
> 
> I have a hard time conveying my thought in text.  I hope you
> understand me.
> 
> Thanks for reading.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-16  1:37                               ` Neil Brown
@ 2009-11-16  5:21                                 ` Goswin von Brederlow
  2009-11-16  5:35                                   ` Neil Brown
  0 siblings, 1 reply; 58+ messages in thread
From: Goswin von Brederlow @ 2009-11-16  5:21 UTC (permalink / raw)
  To: Neil Brown
  Cc: Guy Watkins, 'Piergiorgio Sartor',
	'Peter Rabbitson', 'Goswin von Brederlow',
	'Doug Ledford', 'Michael Evans',
	'Eyal Lebedinsky', 'linux-raid list'

Neil Brown <neilb@suse.de> writes:

> On Sun, 15 Nov 2009 17:29:17 -0500
> "Guy Watkins" <linux-raid@watkins-home.com> wrote:
>
>> I have been following this issue some, and I think this could be a
>> cause for silent corruption on RAID5 and RAID6.  I don't think this
>> has been mentioned, if so, sorry.
>
> RAID1/RAID10 are very different from RAID5/RAID6
>
> RAID1/RAID10 can get 'mismatches' due to the particular behaviour
> of swap or filesystems.  However this doesn't matter (the blocks that
> are inconsistent are of no interest to the filesystem).
>
> RAID5/RAID6 is careful not to allow any mismatches to creep in
> due to any particular filesystem or swap activity.  This is because,
> as you say, those mismatches could be significant to the RAID
> algorithm even though they might be of no interest to the filesystem.
>
> mismatches can only occur in a RAID5/RAID6 due to a software bug
> in the md/raid code, or due to 'hardware errors' (including of course
> drive firmware errors etc).
>
> NeilBrown

Does that mean raid4/5/6 always coppies the data or that it protects
it with the MMU?

MfG
        Goswin

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-16  5:21                                 ` Goswin von Brederlow
@ 2009-11-16  5:35                                   ` Neil Brown
  2009-11-16  7:40                                     ` Goswin von Brederlow
  0 siblings, 1 reply; 58+ messages in thread
From: Neil Brown @ 2009-11-16  5:35 UTC (permalink / raw)
  Cc: Guy Watkins, 'Piergiorgio Sartor',
	'Peter Rabbitson', 'Goswin von Brederlow',
	'Doug Ledford', 'Michael Evans',
	'Eyal Lebedinsky', 'linux-raid list'

On Mon, 16 Nov 2009 06:21:03 +0100
Goswin von Brederlow <goswin-v-b@web.de> wrote:

> Neil Brown <neilb@suse.de> writes:
> 
> > On Sun, 15 Nov 2009 17:29:17 -0500
> > "Guy Watkins" <linux-raid@watkins-home.com> wrote:
> >
> >> I have been following this issue some, and I think this could be a
> >> cause for silent corruption on RAID5 and RAID6.  I don't think this
> >> has been mentioned, if so, sorry.
> >
> > RAID1/RAID10 are very different from RAID5/RAID6
> >
> > RAID1/RAID10 can get 'mismatches' due to the particular behaviour
> > of swap or filesystems.  However this doesn't matter (the blocks
> > that are inconsistent are of no interest to the filesystem).
> >
> > RAID5/RAID6 is careful not to allow any mismatches to creep in
> > due to any particular filesystem or swap activity.  This is because,
> > as you say, those mismatches could be significant to the RAID
> > algorithm even though they might be of no interest to the
> > filesystem.
> >
> > mismatches can only occur in a RAID5/RAID6 due to a software bug
> > in the md/raid code, or due to 'hardware errors' (including of
> > course drive firmware errors etc).
> >
> > NeilBrown
> 
> Does that mean raid4/5/6 always coppies the data or that it protects
> it with the MMU?

Always copies.  Given that it has to access the data to calculate the
XOR, the extra overhead of copying it is less than RAID1.
Where hardware XOR support, hardware copy support is normally also
available, and that is used.


NeilBrown

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-16  5:35                                   ` Neil Brown
@ 2009-11-16  7:40                                     ` Goswin von Brederlow
  0 siblings, 0 replies; 58+ messages in thread
From: Goswin von Brederlow @ 2009-11-16  7:40 UTC (permalink / raw)
  To: Neil Brown
  Cc: Goswin von Brederlow, Guy Watkins, 'Piergiorgio Sartor',
	'Peter Rabbitson', 'Doug Ledford',
	'Michael Evans', 'Eyal Lebedinsky',
	'linux-raid list'

Neil Brown <neilb@suse.de> writes:

> On Mon, 16 Nov 2009 06:21:03 +0100
> Goswin von Brederlow <goswin-v-b@web.de> wrote:
>
>> Neil Brown <neilb@suse.de> writes:
>> 
>> > On Sun, 15 Nov 2009 17:29:17 -0500
>> > "Guy Watkins" <linux-raid@watkins-home.com> wrote:
>> >
>> >> I have been following this issue some, and I think this could be a
>> >> cause for silent corruption on RAID5 and RAID6.  I don't think this
>> >> has been mentioned, if so, sorry.
>> >
>> > RAID1/RAID10 are very different from RAID5/RAID6
>> >
>> > RAID1/RAID10 can get 'mismatches' due to the particular behaviour
>> > of swap or filesystems.  However this doesn't matter (the blocks
>> > that are inconsistent are of no interest to the filesystem).
>> >
>> > RAID5/RAID6 is careful not to allow any mismatches to creep in
>> > due to any particular filesystem or swap activity.  This is because,
>> > as you say, those mismatches could be significant to the RAID
>> > algorithm even though they might be of no interest to the
>> > filesystem.
>> >
>> > mismatches can only occur in a RAID5/RAID6 due to a software bug
>> > in the md/raid code, or due to 'hardware errors' (including of
>> > course drive firmware errors etc).
>> >
>> > NeilBrown
>> 
>> Does that mean raid4/5/6 always coppies the data or that it protects
>> it with the MMU?
>
> Always copies.  Given that it has to access the data to calculate the
> XOR, the extra overhead of copying it is less than RAID1.
> Where hardware XOR support, hardware copy support is normally also
> available, and that is used.

In cases where you have XOR but not copy wouldn't you XOR against a
zero filled page to copy?

MfG
        Goswin

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-14 17:01                               ` Bill Davidsen
@ 2009-11-17  5:19                                 ` Martin K. Petersen
  0 siblings, 0 replies; 58+ messages in thread
From: Martin K. Petersen @ 2009-11-17  5:19 UTC (permalink / raw)
  To: Bill Davidsen
  Cc: Martin K. Petersen, Peter Rabbitson, NeilBrown,
	Piergiorgio Sartor, Goswin von Brederlow, Doug Ledford,
	Michael Evans, Eyal Lebedinsky, linux-raid list

>>>>> "Bill" == Bill Davidsen <davidsen@tmr.com> writes:

>> FWIW, databases on raw disk have gone out of fashion.  But it is true
>> that applications that do direct I/O need to avoid updating buffers
>> in flight.
>> 

Bill> May have gone out of fashion in new applications (I'm not sure I
Bill> agree, but as a talking point), 

I don't believe Oracle supports it anymore.  ASM solves this and several
other problems.


Bill> but there are tons of old apps which are not going to be updated
Bill> any time soon, and any number of libraries which mmap stuff and
Bill> effect multiple applications.

We'll just unmap the page while it's being written out.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-14 19:04                               ` Goswin von Brederlow
@ 2009-11-17  5:22                                 ` Martin K. Petersen
  0 siblings, 0 replies; 58+ messages in thread
From: Martin K. Petersen @ 2009-11-17  5:22 UTC (permalink / raw)
  To: Goswin von Brederlow
  Cc: Martin K. Petersen, Bill Davidsen, Peter Rabbitson, NeilBrown,
	Piergiorgio Sartor, Doug Ledford, Michael Evans, Eyal Lebedinsky,
	linux-raid list

>>>>> "Goswin" == Goswin von Brederlow <goswin-v-b@web.de> writes:

>> I agree that making MD RAID1 do a copy would be a quick fix.  But I
>> don't see any reason to encourage what is essentially sloppy behavior
>> at the top of the stack.  And then what if you stack MD/DM devices?
>> Do each layer do a copy?  I think that gets murky pretty quickly.

Goswin> Maybe as a quick debug the raid layer should make the page
Goswin> read-only and then watch what fails to write to it.

That's essentially what the fs-level debug patches do.  The advantage is
that you get a bit more information about the call path when you do it
up there.


Goswin> Maybe a flag somewhere saying if the data is safe from writes or
Goswin> not. Default would be unsafe and md copies. A filesystem that
Goswin> works "right" sets the safe flag as would md after copying. That
Goswin> way anything lower in the stack (like another md) has the flag
Goswin> set.

I actually have a patch kicking around in my guilt stack that implements
such a flag.  Mostly because it appears nobody is interested in fixing
ext2.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-16 22:14 ` Neil Brown
@ 2009-11-17  4:50   ` Goswin von Brederlow
  0 siblings, 0 replies; 58+ messages in thread
From: Goswin von Brederlow @ 2009-11-17  4:50 UTC (permalink / raw)
  To: Neil Brown; +Cc: greg, Eyal Lebedinsky, linux-raid list

Neil Brown <neilb@suse.de> writes:

> On Mon, 16 Nov 2009 15:36:55 -0600
> greg@enjellic.com wrote:
>
>> If a scrub directive were to be implemented it would be beneficial to
>> make it interruptible.  A 'halt' or similar directive would shutdown
>> the scrub and latch the last block number which had been examined.
>> That would allow a scrub to be resumed from that point in a subsequent
>> session.
>> 
>> With some of these large block devices it is difficult to get through
>> an entire 'check/scrub' in whatever late night window is left after
>> backups have run.  The above infra-structure would allow userspace to
>> gate the checking into whatever windows are available for these types
>> of activities.
>
> This is already possible with check.
>
> If you write 'idle' to 'sync_action', the check will stop.
> If you first read from 'sync_completed' and store that value,
> then before starting a new 'check', write the value to
> sync_max, then you get exactly what you are asking for, all
> easily done in a shell script.
> You can also set 'sync_max' if you like, thus you could e.g.
> quite easily have a cron job that scrubs 1/28th of the array each
> night based on the day of the month.
>
> NeilBrown

Great. I was looking for that feature too.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-16 21:36 greg
@ 2009-11-16 22:14 ` Neil Brown
  2009-11-17  4:50   ` Goswin von Brederlow
  0 siblings, 1 reply; 58+ messages in thread
From: Neil Brown @ 2009-11-16 22:14 UTC (permalink / raw)
  To: greg; +Cc: Eyal Lebedinsky, linux-raid list

On Mon, 16 Nov 2009 15:36:55 -0600
greg@enjellic.com wrote:

> If a scrub directive were to be implemented it would be beneficial to
> make it interruptible.  A 'halt' or similar directive would shutdown
> the scrub and latch the last block number which had been examined.
> That would allow a scrub to be resumed from that point in a subsequent
> session.
> 
> With some of these large block devices it is difficult to get through
> an entire 'check/scrub' in whatever late night window is left after
> backups have run.  The above infra-structure would allow userspace to
> gate the checking into whatever windows are available for these types
> of activities.

This is already possible with check.

If you write 'idle' to 'sync_action', the check will stop.
If you first read from 'sync_completed' and store that value,
then before starting a new 'check', write the value to
sync_max, then you get exactly what you are asking for, all
easily done in a shell script.
You can also set 'sync_max' if you like, thus you could e.g.
quite easily have a cron job that scrubs 1/28th of the array each
night based on the day of the month.

NeilBrown

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
@ 2009-11-16 21:36 greg
  2009-11-16 22:14 ` Neil Brown
  0 siblings, 1 reply; 58+ messages in thread
From: greg @ 2009-11-16 21:36 UTC (permalink / raw)
  To: Neil Brown, greg; +Cc: Eyal Lebedinsky, linux-raid list

On Nov 13,  1:28pm, Neil Brown wrote:
} Subject: Re: mismatch_cnt again

Good afternoon to everyone, hope your week is starting well.

> On Thursday November 12, greg@enjellic.com wrote:
> > 
> > Neil/Martin what do you think?

> I think that if you found out which blocks were different and mapped
> that back through the filesystem, you would find that those blocks
> are not a part of any file, or possibly are part of a file that is
> currently being written.

I can buy the issue of the mismatches being part of a file being
written but that doesn't explain machines where the RAID1 array was
initialized and allowed to synchronize and which now show persistent
counts of mismatched sectors.

I can certainly buy the issue of the mismatches not being part of an
active file.  I still think this leaves the issue of why the
mismatches were generated unless we want to assume that whatever
causes the mismatch only affects areas of the filesystem which don't
have useful files.  Not a reassuring assumption.

> I guess I need to start logging the error address so people can
> start dealing with facts rather than fears.

I think that would be a good starting point.  If for no other reason
then to allow people to easily figure out the possible ramifications of a
mismatch count.

One other issue to consider.  We have RAID1 volumes with mismatch
counts over a wide variety of hardware platforms and Linux kernels.
In all cases the number of mismatched blocks are an exact multiple of
128.  That doesn't seem to suggest some type of random corruption.

This issue may all be innocuous but we have about the worst situation
we could have.  An issue which may be generating false positives for
potential corruption.  Amplified by the fact that major distributions
are generating what will be interpreted as warning e-mails about their
existence.  So even if the problem is innocuous the list is guaranteed
to be spammed with these reports let alone your inbox.... :-)

Just a thought in moving forward.

The 'check' option is primarily useful for its role in scrubbing RAID*
volumes with an eye toward making sure that silent corruption
scenarios don't arise which would thwart a resync.  Particularly since
you implemented the ability to attempt a sector re-write to trigger
block re-allocations.  This is a nice deterministic repair mechanism
which has fixed problems for us on a number of occassions.

I think what is needed is a 'scrub' directive which carries out this
function without incrementing mismatch counts and the like.  That
would leave a possibly enhanced 'check' command to report on
mismatches and carry out any remedial action, if any, that the group
can think of.

If a scrub directive were to be implemented it would be beneficial to
make it interruptible.  A 'halt' or similar directive would shutdown
the scrub and latch the last block number which had been examined.
That would allow a scrub to be resumed from that point in a subsequent
session.

With some of these large block devices it is difficult to get through
an entire 'check/scrub' in whatever late night window is left after
backups have run.  The above infra-structure would allow userspace to
gate the checking into whatever windows are available for these types
of activities.

> NeilBrown

Hope the above comments are helpful.

Best wishes for a productive week.

}-- End of excerpt from Neil Brown

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"When I am working on a problem I never think about beauty.  I only
 think about how to solve the problem.  But when I have finished, if
 the solution is not beautiful, I know it is wrong."
                                -- Buckminster Fuller

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-13  2:28 ` Neil Brown
  2009-11-13  5:19   ` Goswin von Brederlow
@ 2009-11-15  1:54   ` Bill Davidsen
  1 sibling, 0 replies; 58+ messages in thread
From: Bill Davidsen @ 2009-11-15  1:54 UTC (permalink / raw)
  To: Neil Brown; +Cc: greg, Eyal Lebedinsky, linux-raid list

Neil Brown wrote:
> On Thursday November 12, greg@enjellic.com wrote:
>   
>> Neil/Martin what do you think?
>>     
>
> I think that if you found out which blocks were different and mapped
> that back through the filesystem, you would find that those blocks are
> not a part of any file, or possibly are part of a file that is
> currently being written.
>   

Well, I have a bunch on my /boot partition, so here's my test plan. 
Please comment on the safety of this plan as you see it.

- remount the raid-1 array read-only. It shouldn't be changing!
- mount 1st component r/o  and do md5sum on every file[1]
- for each other component, mount r/o and check every file

Investigate any mismatches found in real data.

The alternative involves taking the system down and doing this from a 
Live-CD. Since most of my servers and desktops (including this one) run 
in VMs on the server in question, this is safer but not likely to happen 
soon.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-13  2:28 ` Neil Brown
@ 2009-11-13  5:19   ` Goswin von Brederlow
  2009-11-15  1:54   ` Bill Davidsen
  1 sibling, 0 replies; 58+ messages in thread
From: Goswin von Brederlow @ 2009-11-13  5:19 UTC (permalink / raw)
  To: Neil Brown; +Cc: greg, Eyal Lebedinsky, linux-raid list

Neil Brown <neilb@suse.de> writes:

> On Thursday November 12, greg@enjellic.com wrote:
>> 
>> Neil/Martin what do you think?
>
> I think that if you found out which blocks were different and mapped
> that back through the filesystem, you would find that those blocks are
> not a part of any file, or possibly are part of a file that is
> currently being written.
>
> I guess I need to start logging the error address so people can start
> dealing with facts rather than fears.
>
> NeilBrown

+1

Even if only to debug where the mismatch comes from. Maybe it is the
raid layer or the fs layer or swap. Currently we just can not tell.

MfG
        Goswin

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
  2009-11-12 19:20 greg
@ 2009-11-13  2:28 ` Neil Brown
  2009-11-13  5:19   ` Goswin von Brederlow
  2009-11-15  1:54   ` Bill Davidsen
  0 siblings, 2 replies; 58+ messages in thread
From: Neil Brown @ 2009-11-13  2:28 UTC (permalink / raw)
  To: greg; +Cc: Eyal Lebedinsky, linux-raid list

On Thursday November 12, greg@enjellic.com wrote:
> 
> Neil/Martin what do you think?

I think that if you found out which blocks were different and mapped
that back through the filesystem, you would find that those blocks are
not a part of any file, or possibly are part of a file that is
currently being written.

I guess I need to start logging the error address so people can start
dealing with facts rather than fears.

NeilBrown

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: mismatch_cnt again
@ 2009-11-12 19:20 greg
  2009-11-13  2:28 ` Neil Brown
  0 siblings, 1 reply; 58+ messages in thread
From: greg @ 2009-11-12 19:20 UTC (permalink / raw)
  To: Eyal Lebedinsky, linux-raid list; +Cc: neilb

On Nov 10,  9:03am, Eyal Lebedinsky wrote:
} Subject: Re: mismatch_cnt again

Good day to everyone.

> Thanks everyone,

> I wish to narrow down the issue to my question Are there situations
> known to cause this without an actual hardware failure?
>
> Meaning, are there known *software* issues with this configuration
> 	2.6.30.5-43.fc11.x86_64, ext3, raid5, sata, Adaptec 1430SA
> that can lead to a mismatch?
>
> It is not root, not swap, has weekly smartd scans and weekly
> (different days) raid 'check's. Only report is a growing
> mismatch_cnt.
>
> I noted the raid1 as mentioned in the thread.

I have concerns there is a big ugly issue waiting to rear its head in
the Linux storage community.  Particularly after reading Martin's note
about pages not being pinned through the duration of an I/O.

Speaking directly to your concerns Eyal.  One of my staff members runs
recent Fedora on his desktop with software RAID1.  On a brand new box
shortly after installation he is noting large mismatch_cnt's on the
RAID pairs.

He posted about the issue a month or so ago to the linux-raid list.
He received no definitive responses other than some vague hand waving
that ext3 could cause this.  I believe he is running ext4 on the RAID1
volumes in question.

Interestingly enough a filesystem check comes up normal.  So there are
mismatches but they do not seem to be manifesting themselves.  It
would seem that others confirm this issue.

More to the point we manage geographically mirrored storage systems.
Linux initiators receive fiber-channel based block devices from two
separate mirrors.  The block devices are used as the basis for a RAID1
volume with persistent bitmaps.

In the data-centers we have SCST based Linux storage targets.  The
target 'disks' are LVM based logical volumes platformed on top of
software RAID5 volumes.

We are seeing, in some cases, large mismatch_cnts on the RAID1
initiators.  Check runs on each of the two RAID5 target volumes show
no mismatches.  So the mismatch is occuring at the RAID1 level and is
independent of what is happening at the physical storage level.

The filesystems on the RAID1 volumes are ext3 running under moderate
to heavy load.  Initiator kernels, in general, have been reasonably
new, 2.6.27.x and forward, with RHEL5 userspace.

I suspect there are one or more subtle factors which are making the
non-pinned pages more of an issue then what they appear to be at first
analysis.  Jens and company have been putzing with the I/O schedulers
and related issues.  One possible bit of hand waving is that all of
this may be somehow confounded by elevator induced latencies.

Our I/O latencies are longer due to the physical issues of shooting
I/O through a fair amount of glass and multi-trunked switch
architectures.  In addition we configure somewhat deeper queue depths
on the targets which may compound the problem.  But that doesn't
explain Eyal's and other's issues with this showing up on desktop
systems.

In any case I am convinced the problem is real and potentially
significant.  What seems to be perplexing is why it isn't showing up
as corrupted files and the like.  We are not hearing anything from the
user side which would suggest manifestation of the problem.

More troubling in my opinion is how widespread the problem might be
and how do we fix it?  Automatic repair is problematic as has been
discussed, particulary in the case of a two pair RAID1 volume.  I'm
also equally apprehensive about doing a casino roll with data by
blindly running a 'repair'.

The obvious alternative is to compare the mismatches and figure out
which block is correct.  Pragmatically a somewhat daunting task on
potentially thousands of mismatches on multi-hundred gigabyte
filesystems.  Much more so when one considers the qualitative
assessment issue and the need to do this off-line to avoid Heisenberg
issues.

> cheers Eyal

So I think the problem is real and one we need to respond to as a
community sooner rather than later.  I shudder at the thought of an
LWN or Slashdot article heralding the fact there might be silent
corruption on thousands of filesystems around the planet... :-)(

Neil/Martin what do you think?

I'm happy to hunt if we can do anything from our end.

Best wishes for a pleasant weekend to everyone.

Greg

}-- End of excerpt from Eyal Lebedinsky

As always,
Dr. G.W. Wettstein, Ph.D.   Enjellic Systems Development, LLC.
4206 N. 19th Ave.           Specializing in information infra-structure
Fargo, ND  58102            development.
PH: 701-281-1686
FAX: 701-281-3949           EMAIL: greg@enjellic.com
------------------------------------------------------------------------------
"If I'd listened to customers, I'd have given them a faster horse."
                                -- Henry Ford

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2009-11-17  5:22 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-07  0:41 mismatch_cnt again Eyal Lebedinsky
2009-11-07  1:53 ` berk walker
2009-11-07  7:49   ` Eyal Lebedinsky
2009-11-07  8:08     ` Michael Evans
2009-11-07  8:42       ` Eyal Lebedinsky
2009-11-07 13:51       ` Goswin von Brederlow
2009-11-07 14:58         ` Doug Ledford
2009-11-07 16:23           ` Piergiorgio Sartor
2009-11-07 16:37             ` Doug Ledford
2009-11-07 22:25               ` Eyal Lebedinsky
2009-11-07 22:57                 ` Doug Ledford
2009-11-08 15:32             ` Goswin von Brederlow
2009-11-09 18:08               ` Bill Davidsen
2009-11-07 22:19           ` Eyal Lebedinsky
2009-11-07 22:58             ` Doug Ledford
2009-11-08 15:46           ` Goswin von Brederlow
2009-11-08 16:04             ` Piergiorgio Sartor
2009-11-09 18:22               ` Bill Davidsen
2009-11-09 21:50                 ` NeilBrown
2009-11-10 18:05                   ` Bill Davidsen
2009-11-10 22:17                     ` Peter Rabbitson
2009-11-13  2:15                     ` Neil Brown
2009-11-09 19:13               ` Goswin von Brederlow
2009-11-08 22:51             ` Peter Rabbitson
2009-11-09 18:56               ` Piergiorgio Sartor
2009-11-09 21:14                 ` NeilBrown
2009-11-09 21:54                   ` Piergiorgio Sartor
2009-11-10  0:17                     ` NeilBrown
2009-11-10  9:09                       ` Peter Rabbitson
2009-11-10 14:03                         ` Martin K. Petersen
2009-11-12 22:40                           ` Bill Davidsen
2009-11-13 17:12                             ` Martin K. Petersen
2009-11-14 17:01                               ` Bill Davidsen
2009-11-17  5:19                                 ` Martin K. Petersen
2009-11-14 19:04                               ` Goswin von Brederlow
2009-11-17  5:22                                 ` Martin K. Petersen
2009-11-10 19:52                       ` Piergiorgio Sartor
2009-11-13  2:37                         ` Neil Brown
2009-11-13  5:30                           ` Goswin von Brederlow
2009-11-13  9:33                           ` Peter Rabbitson
2009-11-15 21:05                           ` Piergiorgio Sartor
2009-11-15 22:29                             ` Guy Watkins
2009-11-16  1:23                               ` Goswin von Brederlow
2009-11-16  1:37                               ` Neil Brown
2009-11-16  5:21                                 ` Goswin von Brederlow
2009-11-16  5:35                                   ` Neil Brown
2009-11-16  7:40                                     ` Goswin von Brederlow
2009-11-12 22:57                       ` Bill Davidsen
2009-11-09 18:11           ` Bill Davidsen
2009-11-09 20:58             ` Doug Ledford
2009-11-09 22:03 ` Eyal Lebedinsky
2009-11-12 19:20 greg
2009-11-13  2:28 ` Neil Brown
2009-11-13  5:19   ` Goswin von Brederlow
2009-11-15  1:54   ` Bill Davidsen
2009-11-16 21:36 greg
2009-11-16 22:14 ` Neil Brown
2009-11-17  4:50   ` Goswin von Brederlow

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.