All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Wheeler <bcache@lists.ewheeler.net>
To: Lars Ellenberg <lars.ellenberg@linbit.com>
Cc: Mike Snitzer <snitzer@redhat.com>,
	Eric Wheeler <bcache@lists.ewheeler.net>,
	NeilBrown <neilb@suse.com>, Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Jiri Kosina <jkosina@suse.cz>, Ming Lei <ming.lei@canonical.com>,
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org,
	Takashi Iwai <tiwai@suse.de>,
	linux-bcache@vger.kernel.org, Zheng Liu <gnehzuil.liu@gmail.com>,
	Kent Overstreet <kent.overstreet@gmail.com>,
	Keith Busch <keith.busch@intel.com>,
	dm-devel@redhat.com, Shaohua Li <shli@kernel.org>,
	Ingo Molnar <mingo@redhat.com>, Alasdair Kergon <agk@redhat.com>,
	Roland Kammerer <roland.kammerer@linbit.com>
Subject: Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
Date: Wed, 10 Aug 2016 21:16:52 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LRH.2.11.1608102103040.10662@mail.ewheeler.net> (raw)
In-Reply-To: <20160719090024.GB20868@soda.linbit>

On Tue, 19 Jul 2016, Lars Ellenberg wrote:
> On Tue, Jul 12, 2016 at 10:32:33PM -0400, Mike Snitzer wrote:
> > On Tue, Jul 12 2016 at 10:18pm -0400,
> > Eric Wheeler <bcache@lists.ewheeler.net> wrote:
> > 
> > > On Tue, 12 Jul 2016, NeilBrown wrote:
> > > 
> > > > On Tue, Jul 12 2016, Lars Ellenberg wrote:
> > > > ....
> > > > >
> > > > > Instead, I suggest to distinguish between recursive calls to
> > > > > generic_make_request(), and pushing back the remainder part in
> > > > > blk_queue_split(), by pointing current->bio_lists to a
> > > > > 	struct recursion_to_iteration_bio_lists {
> > > > > 		struct bio_list recursion;
> > > > > 		struct bio_list queue;
> > > > > 	}
> > > > >
> > > > > By providing each q->make_request_fn() with an empty "recursion"
> > > > > bio_list, then merging any recursively submitted bios to the
> > > > > head of the "queue" list, we can make the recursion-to-iteration
> > > > > logic in generic_make_request() process deepest level bios first,
> > > > > and "sibling" bios of the same level in "natural" order.
> > > > >
> > > > > Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> > > > > Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
> > > > 
> > > > Reviewed-by: NeilBrown <neilb@suse.com>
> > > > 
> > > > Thanks again for doing this - I think this is a very significant
> > > > improvement and could allow other simplifications.
> > > 
> > > Thank you Lars for all of this work!  
> > > 
> > > It seems like there have been many 4.3+ blockdev stacking issues and this 
> > > will certainly address some of those (maybe all of them?).  (I think we 
> > > hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
> > > issue.)  It would be great to hear 4.4.y stable pick this up if 
> > > compatible.
> > > 
> > > 
> > > Do you believe that this patch would solve any of the proposals by others 
> > > since 4.3 related to bio splitting/large bios?  I've been collecting a 
> > > list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
> > > if I'm wrong):

[... cut unrelated A,B ... ]

> > > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> > > 	by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> > > 		(was https://patchwork.kernel.org/patch/7398411/)
> 
> As it stands now, this is yet an other issue, but related.
> 
> From the link above:
> 
> | ** Here is the dm-snapshot deadlock that was observed:
> | 
> | 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> | spans snapshot chunk boundary and so it is split to two bios by device
> | mapper.
> | 
> | 2) Device mapper creates the first sub-bio and sends it to the snapshot
> | driver.
> | 
> | 3) The function snapshot_map calls track_chunk (that allocates a
> | structure
> | dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> | the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
> | 
> | 4) The remapped bio is submitted with generic_make_request, but it isn't
> | issued - it is added to current->bio_list instead.
> | 
> | 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> | chunk affected be the first remapped bio, it takes down_write(&s->lock)
> | and then loops in __check_for_conflicting_io, waiting for
> | dm_snap_tracked_chunk created in step 3) to be released.
> | 
> | 6) Process A continues, it creates a second sub-bio for the rest of the
> | original bio.
> 
> Aha.
> Here is the relation.
> If "A" had only ever processed "just the chunk it can handle now",
> and "pushed back" the rest of the incoming bio,
> it could rely on all deeper level bios to have been submitted already.
> 
> But this does not look like it easily fits into the current DM model.
> 
> | 7) snapshot_map is called for this new bio, it waits on
> | down_write(&s->lock) that is held by Process B (in step 5).
> 
> There is an other suggestion:
> Use down_trylock (or down_timeout),
> and if it fails, push back the currently to-be-processed bio.
> We can introduce a new bio helper for that.
> Kind of what blk_queue_split() does with my patch applied.
> 
> Or even better, ignore the down_trylock suggestion,
> simply not iterate over all pieces first,
> but process one piece, and return back the the
> iteration in generic_make_request.
> 
> A bit of conflict here may be that DM has all its own
> split and clone and queue magic, and wants to process
> "all of the bio" before returning back to generic_make_request().
> 
> To change that, __split_and_process_bio() and all its helpers
> would need to learn to "push back" (pieces of) the bio they are
> currently working on, and not push back via "DM_ENDIO_REQUEUE",
> but by bio_list_add_head(&current->bio_lists->queue, piece_to_be_done_later).
> 
> Then, after they processed each piece,
> *return* all the way up to the top-level generic_make_request(),
> where the recursion-to-iteration logic would then
> make sure that all deeper level bios, submitted via
> recursive calls to generic_make_request() will be processed, before the
> next, pushed back, piece of the "original incoming" bio.
> 
> And *not* do their own iteration over all pieces first.
> 
> Probably not as easy as dropping the while loop,
> using bio_advance, and pushing that "advanced" bio back to
> current->...queue?
> 
> static void __split_and_process_bio(struct mapped_device *md,
> 				    struct dm_table *map, struct bio *bio)
> ...
> 		ci.bio = bio;
> 		ci.sector_count = bio_sectors(bio);
> 		while (ci.sector_count && !error)
> 			error = __split_and_process_non_flush(&ci);
> ...
> 		error = __split_and_process_non_flush(&ci);
> 		if (ci.sector_count)
> 			bio_advance()
> 			bio_list_add_head(&current->bio_lists->queue, )
> ...
> 
> Something like that, maybe?
> Just a thought.

Hello all,

Has anyone been able to make progress with resolution to this issue?  

Might the suggestions from Lars help solve BZ# 119841?
	https://bugzilla.kernel.org/show_bug.cgi?id=119841

--
Eric Wheeler

  parent reply	other threads:[~2016-08-11  4:16 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-08 15:04 [PATCH 0/1] block: fix blk_queue_split() resource exhaustion Lars Ellenberg
2016-07-08 15:04 ` [PATCH 1/1] " Lars Ellenberg
2016-07-08 18:49   ` Mike Snitzer
2016-07-11 14:13     ` Lars Ellenberg
2016-07-11 14:10   ` [PATCH v2 " Lars Ellenberg
2016-07-12  2:55     ` [dm-devel] " NeilBrown
2016-07-13  2:18       ` Eric Wheeler
2016-07-13  2:32         ` Mike Snitzer
2016-07-19  9:00           ` Lars Ellenberg
2016-07-19  9:00             ` Lars Ellenberg
2016-07-21 22:53             ` Eric Wheeler
2016-07-25 20:39               ` Jeff Moyer
2016-08-11  4:16             ` Eric Wheeler [this message]
2017-01-07 19:56             ` Lars Ellenberg
2017-01-07 19:56               ` Lars Ellenberg
2016-09-16 20:14         ` [dm-devel] " Matthias Ferdinand
2016-09-18 23:10           ` Eric Wheeler
2016-09-19 20:43             ` Matthias Ferdinand
2016-09-21 21:08               ` bcache: discard BUG (was: [dm-devel] [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion) Eric Wheeler
2016-12-23  8:49     ` [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion Michael Wang
2016-12-23 11:45       ` Lars Ellenberg
2016-12-23 11:45         ` Lars Ellenberg
2017-01-02 14:33         ` [dm-devel] " Jack Wang
2017-01-02 14:33           ` Jack Wang
2017-01-04  5:12           ` NeilBrown
2017-01-04  5:12             ` [dm-devel] " NeilBrown
2017-01-04 18:50             ` Mike Snitzer
2017-01-04 18:50               ` Mike Snitzer
2017-01-05 10:54               ` 王金浦
2017-01-05 10:54                 ` 王金浦
2017-01-06 16:50               ` Mikulas Patocka
2017-01-06 16:50                 ` Mikulas Patocka
2017-01-06 17:34                 ` Mikulas Patocka
2017-01-06 17:34                   ` Mikulas Patocka
2017-01-06 17:34                   ` Mikulas Patocka
2017-01-06 19:52                   ` Mike Snitzer
2017-01-06 19:52                     ` Mike Snitzer
2017-01-06 23:01                     ` NeilBrown
2017-01-06 23:01                       ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LRH.2.11.1608102103040.10662@mail.ewheeler.net \
    --to=bcache@lists.ewheeler.net \
    --cc=agk@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=gnehzuil.liu@gmail.com \
    --cc=jkosina@suse.cz \
    --cc=keith.busch@intel.com \
    --cc=kent.overstreet@gmail.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=lars.ellenberg@linbit.com \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ming.lei@canonical.com \
    --cc=mingo@redhat.com \
    --cc=neilb@suse.com \
    --cc=peterz@infradead.org \
    --cc=roland.kammerer@linbit.com \
    --cc=shli@kernel.org \
    --cc=snitzer@redhat.com \
    --cc=tiwai@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.