linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Bob Liu <bob.liu@oracle.com>
Cc: linux-block@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, martin.petersen@oracle.com,
	shirley.ma@oracle.com, allison.henderson@oracle.com,
	darrick.wong@oracle.com, hch@infradead.org, adilger@dilger.ca
Subject: Re: [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry
Date: Fri, 1 Mar 2019 08:49:49 +1100	[thread overview]
Message-ID: <20190228214949.GO23020@dastard> (raw)
In-Reply-To: <ba78a587-04ba-58a4-f282-63e200c81e2d@oracle.com>

On Thu, Feb 28, 2019 at 10:22:02PM +0800, Bob Liu wrote:
> On 2/19/19 5:31 AM, Dave Chinner wrote:
> > On Wed, Feb 13, 2019 at 05:50:35PM +0800, Bob Liu wrote:
> >> Motivation:
> >> When fs data/metadata checksum mismatch, lower block devices may have other
> >> correct copies. e.g. If XFS successfully reads a metadata buffer off a raid1 but
> >> decides that the metadata is garbage, today it will shut down the entire
> >> filesystem without trying any of the other mirrors.  This is a severe
> >> loss of service, and we propose these patches to have XFS try harder to
> >> avoid failure.
> >>
> >> This patch prototype this mirror retry idea by:
> >> * Adding @nr_mirrors to struct request_queue which is similar as
> >>   blk_queue_nonrot(), filesystem can grab device request queue and check max
> >>   mirrors this block device has.
> >>   Helper functions were also added to get/set the nr_mirrors.
> >>
> >> * Introducing bi_rd_hint just like bi_write_hint, but bi_rd_hint is a long bitmap
> >> in order to support stacked layer case.
> >>
> >> * Modify md/raid1 to support this retry feature.
> >>
> >> * Adapter xfs to use this feature.
> >>   If the read verify fails, we loop over the available mirrors and retry the read.
> > 
> > Why does the filesystem have to iterate every single posible
> > combination of devices that are underneath it?
> > 
> > Wouldn't it be much simpler to be able to attach a verifier
> > function to the bio, and have each layer that gets called iterate
> > over all it's copies internally until the verfier function passes
> > or all copies are exhausted?
> > 
> > This works for stacked mirrors - it can pass the higher layer
> > verifier down as far as necessary. It can work for RAID5/6, too, by
> > having that layer supply it's own verifier for reads that verifies
> > parity and can reconstruct of failure, then when it's reconstructed
> > a valid stripe it can run the verifier that was supplied to it from
> > above, etc.
> > 
> > i.e. I dont see why only filesystems should drive retries or have to
> > be aware of the underlying storage stacking. ISTM that each
> > layer of the storage stack should be able to verify what has been
> > returned to it is valid independently of the higher layer
> > requirements. The only difference from a caller point of view should
> > be submit_bio(bio); vs submit_bio_verify(bio, verifier_cb_func);
> > 
> 
> We already have bio->bi_end_io(), how about do the verification inside bi_end_io()?
> 
> Then the whole sequence would like:
> bio_endio()
>     > 1.bio->bi_end_io()
>         > xfs_buf_bio_end_io()
>             > verify, set bio->bi_status = "please retry" if verify fail
>         
>     > 2.if found bio->bi_status = retry
>     > 3.resubmit bio

As I mentioned to Darrick, this isn't cwas simple as it seems
because what XFS actually does is this:

IO completion thread			Workqueue Thread
bio_endio(bio)
  bio->bi_end_io(bio)
    xfs_buf_bio_end_io(bio)
      bp->b_error = bio->bi_status
      xfs_buf_ioend_async(bp)
        queue_work(bp->b_ioend_wq, bp)
      bio_put(bio)
<io completion done>
					.....
					xfs_buf_ioend(bp)
					  bp->b_ops->read_verify()
					.....

IOWs, XFS does not do read verification inside the bio completion
context, but instead defers it to an external workqueue so it does
not delay processing incoming bio IO completions. Hence there is no
way to get the verification status back to the bio completion (the
bio has already been freed!) to resubmit from there.

This is one of the reasons I suggested a verifier be added to the
submission, so the bio itself is wholly responsible for running it,
not an external, filesystem level completion function that may
operate outside of bio scope....

> Is it fine to resubmit a bio inside bio_endio()?

Depends on the context the bio_endio() completion is running in.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2019-02-28 21:49 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-13  9:50 [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror device retry Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 1/9] block: add nr_mirrors to request_queue Bob Liu
2019-02-13 10:26   ` Andreas Dilger
2019-02-13 16:04   ` Theodore Y. Ts'o
2019-02-14  5:57     ` Bob Liu
2019-02-18 17:56       ` Theodore Y. Ts'o
2019-02-13  9:50 ` [RFC PATCH v2 2/9] block: add rd_hint to bio and request Bob Liu
2019-02-13 16:18   ` Jens Axboe
2019-02-14  6:10     ` Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 3/9] md:raid1: set mirrors correctly Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 4/9] md:raid1: rd_hint support and consider stacked layer case Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 5/9] Add b_alt_retry to xfs_buf Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 6/9] xfs: Add b_rd_hint " Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 7/9] xfs: Add device retry Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 8/9] xfs: Rewrite retried read Bob Liu
2019-02-13  9:50 ` [RFC PATCH v2 9/9] xfs: Add tracepoints and logging to alternate device retry Bob Liu
2019-02-18  8:08 ` [RFC PATCH v2 0/9] Block/XFS: Support alternative mirror " jianchao.wang
2019-02-19  1:29   ` jianchao.wang
2019-02-18 21:31 ` Dave Chinner
2019-02-19  2:55   ` Darrick J. Wong
2019-02-19  3:33     ` Dave Chinner
2019-02-28 14:22   ` Bob Liu
2019-02-28 21:49     ` Dave Chinner [this message]
2019-03-03  2:37       ` Bob Liu
2019-03-03 23:18         ` Dave Chinner
2019-02-28 23:28     ` Andreas Dilger
2019-03-01 14:14       ` Bob Liu
2019-03-03 23:45       ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190228214949.GO23020@dastard \
    --to=david@fromorbit.com \
    --cc=adilger@dilger.ca \
    --cc=allison.henderson@oracle.com \
    --cc=bob.liu@oracle.com \
    --cc=darrick.wong@oracle.com \
    --cc=hch@infradead.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=shirley.ma@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).