All of lore.kernel.org
 help / color / mirror / Atom feed
From: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>
To: NeilBrown <neilb@suse.de>
Cc: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>,
	Peter Rabbitson <rabbit+list@rabbit.us>,
	Goswin von Brederlow <goswin-v-b@web.de>,
	Doug Ledford <dledford@redhat.com>,
	Michael Evans <mjevans1983@gmail.com>,
	Eyal Lebedinsky <eyal@eyal.emu.id.au>,
	linux-raid list <linux-raid@vger.kernel.org>
Subject: Re: mismatch_cnt again
Date: Tue, 10 Nov 2009 20:52:22 +0100	[thread overview]
Message-ID: <20091110195222.GA2777@lazy.lzy> (raw)
In-Reply-To: <d99ca9481d2471073484c5d43d493b4d.squirrel@neil.brown.name>

Hi again,

> It seems we might have been talking at cross-purposes.
> 
> When I wrote about the need for a threat model, it was in the
> context of automatically determining which block was most
> likely to be in error (e.g. voting with a 3-drive RAID1 or
> fancy arithmetic with RAID6).  I do not believe there is any
> value in doing that.  At least not automatically in the kernel
> with the aim of just repairing which block was decided to be
> most wrong.
> 
> You now seem to be talking about the ability to find out which
> blocks are inconsistent.  That is very different.  I do agree there
> is value in that.  Maybe it should appear in the kernel logs,
> or maybe we could store the information and report in via sysfs
> (the former would certainly be easier).

maybe there is a misunderstanding between us! :-)

Automatic repair *might* be a far end target, but I do
agree, this needs to be clarified deeply.

I see the thing similarly to a previous comment from a
fellow poster.
To do:
1) detect which MD block is inconsistent
2) detect, when possible, which device component is responsible
3) trigger a repair action

This would be done all under user control, i.e. the user
will get the mismatch count, maybe with some hint on which
device could be guilty (RAID-6 or RAID-1/10 with multiple
redundancy) and then he could decide what to do.

The user will have full control and full *responsability*
on the action, but it will also be fully informed on what
the situation is.

The system will tell: block ABC is inconsistent, maybe
device /dev/sdX is guilty, you could: do nothing, resync
the parity, try to repair.

> I would be very happy to accept a patch which logged this
> information - providing it was careful not to overly spam the logs if there
> were lots and lots of errors.  I may even write on myself.

I could try to have a look into it, time permitting.

[mismatch_cnt=256]
> I would probably run a 'repair' to fix the difference, but that
> isn't firm advice.  It is quite probably that the block is not
> actively in use and so the inconsistency will never be noticed.

Exactly, that's why having the knowledge of *where*
the issue is would help already a lot!
 
> check/repair is primarily about reading every block on every device,
> and being ready to cope with read errors by overwriting with the
> correct data.  This is known as scrubbing I believe.
> I would normally just 'repair' every month or so.  If there are
> discrepancies I would like them reported and fixed.  I they happen
> often on a non-swap partition, I would like to knoe about it, otherwise
> I would rather they were just fixed.
> 'check' largely exists because it was trivial to implement given
> that 'repair' was being implemented, and it could concievably be useful,
> e.g. you have assembled an array read-only as you aren't at all sure the
> disks should form an array.  You run a 'check' to increase your
> confidence that all is OK without risking any change to any data incase
> you put the array together badly.

As I mentioned some times ago, I built a RAID-6, where
one disk, due to a strange cabling problem, was sometimes
returning wrong data (one bit flip, actually).
And this without any errors reported, i.e. a bit was
sometimes flipped, at the very end it seems, and it
was undetected by ECC/CRC/whatever.

This was noticed by the "check", so I ran a "repair", which
was, of course, making more damage...

What I did was to run a check, with one device after the
other failed (and then re-added, of course) on a RO MD device.

I was able to find the guilty disk and to fix the array
for good!

Now, this was a really lengthy process, I would have
preferred to have it done automatically and then have
a report on which *could* be the resposible device.

I agree with you that an automatic repair would have
not been the right choice, without knowing first what
was going on.

> drivers/md/raid1.c for RAID1
> drivers/md/raid5.c for RAID4/RAID5/RAID6
> 
> Look for where the resync_mismatches field is updated.

Thanks, I'll try to have a look!
 
bye,

-- 

piergiorgio

  parent reply	other threads:[~2009-11-10 19:52 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-11-07  0:41 mismatch_cnt again Eyal Lebedinsky
2009-11-07  1:53 ` berk walker
2009-11-07  7:49   ` Eyal Lebedinsky
2009-11-07  8:08     ` Michael Evans
2009-11-07  8:42       ` Eyal Lebedinsky
2009-11-07 13:51       ` Goswin von Brederlow
2009-11-07 14:58         ` Doug Ledford
2009-11-07 16:23           ` Piergiorgio Sartor
2009-11-07 16:37             ` Doug Ledford
2009-11-07 22:25               ` Eyal Lebedinsky
2009-11-07 22:57                 ` Doug Ledford
2009-11-08 15:32             ` Goswin von Brederlow
2009-11-09 18:08               ` Bill Davidsen
2009-11-07 22:19           ` Eyal Lebedinsky
2009-11-07 22:58             ` Doug Ledford
2009-11-08 15:46           ` Goswin von Brederlow
2009-11-08 16:04             ` Piergiorgio Sartor
2009-11-09 18:22               ` Bill Davidsen
2009-11-09 21:50                 ` NeilBrown
2009-11-10 18:05                   ` Bill Davidsen
2009-11-10 22:17                     ` Peter Rabbitson
2009-11-13  2:15                     ` Neil Brown
2009-11-09 19:13               ` Goswin von Brederlow
2009-11-08 22:51             ` Peter Rabbitson
2009-11-09 18:56               ` Piergiorgio Sartor
2009-11-09 21:14                 ` NeilBrown
2009-11-09 21:54                   ` Piergiorgio Sartor
2009-11-10  0:17                     ` NeilBrown
2009-11-10  9:09                       ` Peter Rabbitson
2009-11-10 14:03                         ` Martin K. Petersen
2009-11-12 22:40                           ` Bill Davidsen
2009-11-13 17:12                             ` Martin K. Petersen
2009-11-14 17:01                               ` Bill Davidsen
2009-11-17  5:19                                 ` Martin K. Petersen
2009-11-14 19:04                               ` Goswin von Brederlow
2009-11-17  5:22                                 ` Martin K. Petersen
2009-11-10 19:52                       ` Piergiorgio Sartor [this message]
2009-11-13  2:37                         ` Neil Brown
2009-11-13  5:30                           ` Goswin von Brederlow
2009-11-13  9:33                           ` Peter Rabbitson
2009-11-15 21:05                           ` Piergiorgio Sartor
2009-11-15 22:29                             ` Guy Watkins
2009-11-16  1:23                               ` Goswin von Brederlow
2009-11-16  1:37                               ` Neil Brown
2009-11-16  5:21                                 ` Goswin von Brederlow
2009-11-16  5:35                                   ` Neil Brown
2009-11-16  7:40                                     ` Goswin von Brederlow
2009-11-12 22:57                       ` Bill Davidsen
2009-11-09 18:11           ` Bill Davidsen
2009-11-09 20:58             ` Doug Ledford
2009-11-09 22:03 ` Eyal Lebedinsky
2009-11-12 19:20 greg
2009-11-13  2:28 ` Neil Brown
2009-11-13  5:19   ` Goswin von Brederlow
2009-11-15  1:54   ` Bill Davidsen
2009-11-16 21:36 greg
2009-11-16 22:14 ` Neil Brown
2009-11-17  4:50   ` Goswin von Brederlow

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091110195222.GA2777@lazy.lzy \
    --to=piergiorgio.sartor@nexgo.de \
    --cc=dledford@redhat.com \
    --cc=eyal@eyal.emu.id.au \
    --cc=goswin-v-b@web.de \
    --cc=linux-raid@vger.kernel.org \
    --cc=mjevans1983@gmail.com \
    --cc=neilb@suse.de \
    --cc=rabbit+list@rabbit.us \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.