linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andreas Dilger <adilger@dilger.ca>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Dave Chinner <david@fromorbit.com>,
	Allison Henderson <allison.henderson@oracle.com>,
	linux-block@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	Martin Petersen <martin.petersen@oracle.com>,
	shirley.ma@oracle.com, bob.liu@oracle.com
Subject: Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry
Date: Wed, 28 Nov 2018 12:38:47 -0700	[thread overview]
Message-ID: <4C297F47-C508-4DC9-8360-5E0873873833@dilger.ca> (raw)
In-Reply-To: <20181128054923.GF8125@magnolia>

[-- Attachment #1: Type: text/plain, Size: 3383 bytes --]

On Nov 27, 2018, at 10:49 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote:
>> On Tue, Nov 27, 2018 at 08:49:44PM -0700, Allison Henderson wrote:
>>> Motivation:
>>> When fs data/metadata checksum mismatch, lower block devices may have other
>>> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1
>>> but decides that the metadata is garbage, today it will shut down the entire
>>> filesystem without trying any of the other mirrors.  This is a severe
>>> loss of service, and we propose these patches to have XFS try harder to
>>> avoid failure.
>>> 
>>> This patch prototype this mirror retry idea by:
>>> * Adding @nr_mirrors to struct request_queue which is similar as
>>>  blk_queue_nonrot(), filesystem can grab device request queue and check max
>>>  mirrors this block device has.
>>>  Helper functions were also added to get/set the nr_mirrors.
>>> 
>>> * Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three meanings.
>>> 1.Original write_hint.
>>> 2.end_io() will update @bi_rw_hint to reflect which mirror this i/o really happened.
>>> 3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific mirror.
>>> 
>>> * Modify md/raid1 to support this retry feature.
>>> 
>>> * Add b_rw_hint to xfs_buf
>>>  This patch adds a new field b_rw_hint to xfs_buf.  We will use this to set the
>>>  new bio->bi_rw_hint when submitting the read request, and also to store the
>>>  returned mirror when the read completes
> 
>> the retry iterations. That allows us to let he block layer ot pick
>> whatever leg it wants for the initial read, but if we get a failure
>> we directly control the mirror we retry from and all bios in the
>> buffer go to that same mirror.
>> 	- is it generic/abstract enough to be able to work with
>> 	  RAID5/6 to trigger verification/recovery from the parity
>> 	  information in the stripe?
> 
> In theory we could supply a raid5 implementation, wherein rw_hint == 0
> lets the raid do as it pleases; rw_hint == 1 reads from the stripe; and
> rw_hint == 2 forces stripe recovery for the given block.

Definitely this API needs to be useful for RAID-5/6 storage as well, and
I don't think that needs too complex an interface to achieve.

Basically, the "nr_mirrors" parameter would instead be "nr_retries" or
similar, so that the caller knows how many possible data combinations
there are to try and validate.  For mirrors this is easy, and as it is
currently implemented.  For RAID-5/6 this would essentially be the
number of data rebuild combinations in the RAID group (e.g. 8 in a
RAID-5 8+1 setup, and 16 in a RAID-6 8+2).

For each call with nr_retries != 0, the MD RAID-5/6 driver would skip
one of the data drives, and rebuild that part of the data from parity.
This wouldn't take too long, since the blocks are already in memory,
they just need the parity to be recomputed in a few different ways to
try and find a combination that returns valid data (e.g. if a drive
failed and the parity also has a latent corrupt sector, not uncommon).

The next step is to have an API that says "retry=N returned the correct
data, rebuild the parity/drive with that combination of devices" so
that the corrupt parity sector isn't used during the rebuild.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

  parent reply	other threads:[~2018-11-28 19:38 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-28  3:49 [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry Allison Henderson
2018-11-28  3:49 ` [PATCH v1 1/7] block: add nr_mirrors to request_queue Allison Henderson
2018-11-28  3:49 ` [PATCH v1 2/7] block: expand write_hint of bio/request to rw_hint Allison Henderson
2018-11-28  3:49 ` [PATCH v1 3/7] md: raid1: handle bi_rw_hint accordingly Allison Henderson
2018-11-28  3:49 ` [PATCH v1 4/7] xfs: Add b_rw_hint to xfs_buf Allison Henderson
2018-11-28  5:03   ` Dave Chinner
2018-11-28  3:49 ` [PATCH v1 5/7] xfs: Add device retry Allison Henderson
2018-11-28  5:08   ` Dave Chinner
2018-11-28  5:22     ` Darrick J. Wong
2018-11-28  5:38       ` Dave Chinner
2018-11-28  7:35     ` Christoph Hellwig
2018-11-28 12:41       ` Bob Liu
2018-11-28 16:47         ` Allison Henderson
2018-11-28  3:49 ` [PATCH v1 6/7] xfs: Rewrite retried read Allison Henderson
2018-11-28  5:17   ` Dave Chinner
2018-11-28  5:26     ` Darrick J. Wong
2018-11-28  5:40       ` Dave Chinner
2018-11-28  3:49 ` [PATCH v1 7/7] xfs: Add tracepoints and logging to alternate device retry Allison Henderson
2018-11-28  5:33 ` [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror " Dave Chinner
2018-11-28  5:49   ` Darrick J. Wong
2018-11-28  6:30     ` Dave Chinner
2018-11-28  7:15       ` Darrick J. Wong
2018-11-28 19:38     ` Andreas Dilger [this message]
2018-11-28  7:37   ` Christoph Hellwig
2018-11-28  7:46     ` Dave Chinner
2018-11-28  7:51       ` Christoph Hellwig
2018-11-28  7:45   ` Christoph Hellwig
2018-12-08 14:49     ` Bob Liu
2018-12-10  4:30       ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C297F47-C508-4DC9-8360-5E0873873833@dilger.ca \
    --to=adilger@dilger.ca \
    --cc=allison.henderson@oracle.com \
    --cc=bob.liu@oracle.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=shirley.ma@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).