linux-bcache.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kent Overstreet <koverstreet@google.com>
To: linux-bcache@vger.kernel.org, linux-kernel@vger.kernel.org,
	dm-devel@redhat.com
Cc: Kent Overstreet <koverstreet@google.com>,
	tj@kernel.org, axboe@kernel.dk, vgoyal@redhat.com
Subject: [PATCH v4 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
Date: Mon, 15 Oct 2012 13:09:00 -0700	[thread overview]
Message-ID: <1350331769-14856-27-git-send-email-koverstreet@google.com> (raw)
In-Reply-To: <1350331769-14856-1-git-send-email-koverstreet@google.com>

Previously, if we ever try to allocate more than once from the same bio
set while running under generic_make_request() (i.e. a stacking block
driver), we risk deadlock.

This is because of the code in generic_make_request() that converts
recursion to iteration; any bios we submit won't actually be submitted
(so they can complete and eventually be freed) until after we return -
this means if we allocate a second bio, we're blocking the first one
from ever being freed.

Thus if enough threads call into a stacking block driver at the same
time with bios that need multiple splits, and the bio_set's reserve gets
used up, we deadlock.

This can be worked around in the driver code - we could check if we're
running under generic_make_request(), then mask out __GFP_WAIT when we
go to allocate a bio, and if the allocation fails punt to workqueue and
retry the allocation.

But this is tricky and not a generic solution. This patch solves it for
all users by inverting the previously described technique. We allocate a
rescuer workqueue for each bio_set, and then in the allocation code if
there are bios on current->bio_list we would be blocking, we punt them
to the rescuer workqueue to be submitted.

This guarantees forward progress for bio allocations under
generic_make_request() provided each bio is submitted before allocating
the next, and provided the bios are freed after they complete.

Note that this doesn't do anything for allocation from other mempools.
Instead of allocating per bio data structures from a mempool, code
should use bio_set's front_pad.

Tested it by forcing the rescue codepath to be taken (by disabling the
first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
of arbitrary bio splitting) and verified that the rescuer was being
invoked.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Muthukumar Ratty <muthur@gmail.com>
---
 fs/bio.c            | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/bio.h |   9 ++++
 2 files changed, 123 insertions(+), 2 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index 9298c65..9aa1938 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -295,6 +295,54 @@ void bio_reset(struct bio *bio)
 }
 EXPORT_SYMBOL(bio_reset);
 
+static void bio_alloc_rescue(struct work_struct *work)
+{
+	struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
+	struct bio *bio;
+
+	while (1) {
+		spin_lock(&bs->rescue_lock);
+		bio = bio_list_pop(&bs->rescue_list);
+		spin_unlock(&bs->rescue_lock);
+
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+}
+
+static void punt_bios_to_rescuer(struct bio_set *bs)
+{
+	struct bio_list punt, nopunt;
+	struct bio *bio;
+
+	/*
+	 * In order to guarantee forward progress we must punt only bios that
+	 * were allocated from this bio_set; otherwise, if there was a bio on
+	 * there for a stacking driver higher up in the stack, processing it
+	 * could require allocating bios from this bio_set, and doing that from
+	 * our own rescuer would be bad.
+	 *
+	 * Since bio lists are singly linked, pop them all instead of trying to
+	 * remove from the middle of the list:
+	 */
+
+	bio_list_init(&punt);
+	bio_list_init(&nopunt);
+
+	while ((bio = bio_list_pop(current->bio_list)))
+		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+
+	*current->bio_list = nopunt;
+
+	spin_lock(&bs->rescue_lock);
+	bio_list_merge(&bs->rescue_list, &punt);
+	spin_unlock(&bs->rescue_lock);
+
+	queue_work(bs->rescue_workqueue, &bs->rescue_work);
+}
+
 /**
  * bio_alloc_bioset - allocate a bio for I/O
  * @gfp_mask:   the GFP_ mask given to the slab allocator
@@ -312,11 +360,27 @@ EXPORT_SYMBOL(bio_reset);
  *   previously allocated bio for IO before attempting to allocate a new one.
  *   Failure to do so can cause deadlocks under memory pressure.
  *
+ *   Note that when running under generic_make_request() (i.e. any block
+ *   driver), bios are not submitted until after you return - see the code in
+ *   generic_make_request() that converts recursion into iteration, to prevent
+ *   stack overflows.
+ *
+ *   This would normally mean allocating multiple bios under
+ *   generic_make_request() would be susceptible to deadlocks, but we have
+ *   deadlock avoidance code that resubmits any blocked bios from a rescuer
+ *   thread.
+ *
+ *   However, we do not guarantee forward progress for allocations from other
+ *   mempools. Doing multiple allocations from the same mempool under
+ *   generic_make_request() should be avoided - instead, use bio_set's front_pad
+ *   for per bio allocations.
+ *
  *   RETURNS:
  *   Pointer to new bio on success, NULL on failure.
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
+	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	unsigned long idx = BIO_POOL_NONE;
@@ -334,7 +398,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		front_pad = 0;
 		inline_vecs = nr_iovecs;
 	} else {
+		/*
+		 * generic_make_request() converts recursion to iteration; this
+		 * means if we're running beneath it, any bios we allocate and
+		 * submit will not be submitted (and thus freed) until after we
+		 * return.
+		 *
+		 * This exposes us to a potential deadlock if we allocate
+		 * multiple bios from the same bio_set() while running
+		 * underneath generic_make_request(). If we were to allocate
+		 * multiple bios (say a stacking block driver that was splitting
+		 * bios), we would deadlock if we exhausted the mempool's
+		 * reserve.
+		 *
+		 * We solve this, and guarantee forward progress, with a rescuer
+		 * workqueue per bio_set. If we go to allocate and there are
+		 * bios on current->bio_list, we first try the allocation
+		 * without __GFP_WAIT; if that fails, we punt those bios we
+		 * would be blocking to the rescuer workqueue before we retry
+		 * with the original gfp_flags.
+		 */
+
+		if (current->bio_list && !bio_list_empty(current->bio_list))
+			gfp_mask &= ~__GFP_WAIT;
+
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
+		if (!p && gfp_mask != saved_gfp) {
+			punt_bios_to_rescuer(bs);
+			gfp_mask = saved_gfp;
+			p = mempool_alloc(bs->bio_pool, gfp_mask);
+		}
+
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
 	}
@@ -347,6 +441,12 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 
 	if (nr_iovecs > inline_vecs) {
 		bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
+		if (!bvl && gfp_mask != saved_gfp) {
+			punt_bios_to_rescuer(bs);
+			gfp_mask = saved_gfp;
+			bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
+		}
+
 		if (unlikely(!bvl))
 			goto err_free;
 	} else if (nr_iovecs) {
@@ -1575,6 +1675,9 @@ static void biovec_free_pools(struct bio_set *bs)
 
 void bioset_free(struct bio_set *bs)
 {
+	if (bs->rescue_workqueue)
+		destroy_workqueue(bs->rescue_workqueue);
+
 	if (bs->bio_pool)
 		mempool_destroy(bs->bio_pool);
 
@@ -1610,6 +1713,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
 
 	bs->front_pad = front_pad;
 
+	spin_lock_init(&bs->rescue_lock);
+	bio_list_init(&bs->rescue_list);
+	INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
+
 	bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
 	if (!bs->bio_slab) {
 		kfree(bs);
@@ -1620,9 +1727,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
 	if (!bs->bio_pool)
 		goto bad;
 
-	if (!biovec_create_pools(bs, pool_size))
-		return bs;
+	if (biovec_create_pools(bs, pool_size))
+		goto bad;
+
+	bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
+	if (!bs->rescue_workqueue)
+		goto bad;
 
+	return bs;
 bad:
 	bioset_free(bs);
 	return NULL;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 93d3d17..b31036f 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -513,6 +513,15 @@ struct bio_set {
 	mempool_t *bio_integrity_pool;
 #endif
 	mempool_t *bvec_pool;
+
+	/*
+	 * Deadlock avoidance for stacking block drivers: see comments in
+	 * bio_alloc_bioset() for details
+	 */
+	spinlock_t		rescue_lock;
+	struct bio_list		rescue_list;
+	struct work_struct	rescue_work;
+	struct workqueue_struct	*rescue_workqueue;
 };
 
 struct biovec_slab {
-- 
1.7.12

  parent reply	other threads:[~2012-10-15 20:09 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-15 20:08 [PATCH v4 00/24] Prep work for immutable bio vecs Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 02/24] block: Refactor blk_update_request() Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 05/24] block: Use bio_sectors() more consistently Kent Overstreet
2012-10-16  1:54   ` Ed Cashin
2012-10-15 20:08 ` [PATCH v4 06/24] block: Change bio_split() to respect the current value of bi_idx Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 07/24] block: Remove bi_idx references Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 08/24] block: Remove some unnecessary bi_vcnt usage Kent Overstreet
2012-11-06 11:19   ` Reddy, Sreekanth
     [not found]   ` <1350331769-14856-9-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-11-07  6:05     ` Reddy, Sreekanth
2012-10-15 20:08 ` [PATCH v4 10/24] raid10: Use bio_reset() Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 11/24] raid1: use bio_reset() Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 12/24] raid5: " Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 13/24] raid1: Refactor narrow_write_error() to not use bi_idx Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 14/24] block: Add bio_copy_data() Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 15/24] pktcdvd: use bio_copy_data() Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 16/24] pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 17/24] raid1: use bio_copy_data() Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 18/24] bounce: Refactor __blk_queue_bounce to not use bi_io_vec Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 20/24] block: Convert some code to bio_for_each_segment_all() Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 21/24] block: Add bio_alloc_pages() Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 22/24] block: Add an explicit bio flag for bios that own their bvec Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 24/24] block: Add BIO_SUBMITTED flag, kill BIO_CLONED Kent Overstreet
2012-10-15 20:08 ` [PATCH v4 1/2] block: Reorder struct bio_set Kent Overstreet
     [not found]   ` <1350331769-14856-26-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-10-19  1:14     ` Tejun Heo
2012-10-15 20:09 ` Kent Overstreet [this message]
2012-10-15 20:09 ` [PATCH v4 1/2] block: Fix a buffer overrun in bio_integrity_split() Kent Overstreet
     [not found]   ` <1350331769-14856-28-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-10-19 20:30     ` Tejun Heo
2012-10-22 15:39       ` Vivek Goyal
2012-10-24 16:34         ` Martin K. Petersen
     [not found]           ` <yq1mwzbbzi1.fsf-+q57XtR/GgMb6DWv4sQWN6xOck334EZe@public.gmane.org>
2012-10-24 16:42             ` Tejun Heo
     [not found]               ` <CAOS58YPFq_rt1Pw-v1XtX7-tnPGceXP5Chpp9JhkkMtnWnNQsg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-10-24 16:57                 ` Martin K. Petersen
2012-10-15 20:09 ` [PATCH v4 2/2] block: Convert integrity to bvec_alloc_bs() Kent Overstreet
     [not found]   ` <1350331769-14856-29-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-10-19 18:11     ` Vivek Goyal
2012-10-24 16:54     ` Martin K. Petersen
2012-10-15 20:09 ` [PATCH v2 00/26] Prep work for immutable bio vecs Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 01/26] block: Convert integrity to bvec_alloc_bs(), and a bugfix Kent Overstreet
     [not found]   ` <1350331769-14856-31-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-10-19 20:34     ` Tejun Heo
     [not found]       ` <20121019203421.GT13370-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-10-19 20:36         ` Tejun Heo
2012-10-15 20:09 ` [PATCH v2 02/26] block: Add bio_advance() Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 03/26] block: Refactor blk_update_request() Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 04/26] md: Convert md_trim_bio() to use bio_advance() Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 05/26] block: Add bio_end() Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 06/26] block: Use bio_sectors() more consistently Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 08/26] block: Remove bi_idx references Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 09/26] block: Remove some unnecessary bi_vcnt usage Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 10/26] block: Add submit_bio_wait(), remove from md Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 12/26] raid1: use bio_reset() Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 13/26] raid5: " Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 14/26] raid1: Refactor narrow_write_error() to not use bi_idx Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 15/26] block: Add bio_copy_data() Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 17/26] pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 18/26] raid1: use bio_copy_data() Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 19/26] bounce: Refactor __blk_queue_bounce to not use bi_io_vec Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 20/26] block: Add bio_for_each_segment_all() Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 22/26] block: Add bio_alloc_pages() Kent Overstreet
2012-10-15 20:09 ` [PATCH v2 23/26] raid1: use bio_alloc_pages() Kent Overstreet
     [not found] ` <1350331769-14856-1-git-send-email-koverstreet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-10-15 20:08   ` [PATCH v4 01/24] block: Add bio_advance() Kent Overstreet
2012-10-15 20:08   ` [PATCH v4 03/24] md: Convert md_trim_bio() to use bio_advance() Kent Overstreet
2012-10-15 20:08   ` [PATCH v4 04/24] block: Add bio_end_sector() Kent Overstreet
2012-10-15 20:08   ` [PATCH v4 09/24] block: Add submit_bio_wait(), remove from md Kent Overstreet
2012-10-15 20:08   ` [PATCH v4 19/24] block: Add bio_for_each_segment_all() Kent Overstreet
2012-10-15 20:08   ` [PATCH v4 23/24] bio-integrity: Add explicit field for owner of bip_buf Kent Overstreet
2012-10-15 20:09   ` [PATCH v2 07/26] block: Don't use bi_idx in bio_split() or require it to be 0 Kent Overstreet
2012-10-15 20:09   ` [PATCH v2 11/26] raid10: Use bio_reset() Kent Overstreet
2012-10-15 20:09   ` [PATCH v2 16/26] pktcdvd: use bio_copy_data() Kent Overstreet
2012-10-15 20:09   ` [PATCH v2 21/26] block: Convert some code to bio_for_each_segment_all() Kent Overstreet
2012-10-15 20:09   ` [PATCH v2 24/26] block: Add an explicit bio flag for bios that own their bvec Kent Overstreet
2012-10-15 20:09   ` [PATCH v2 25/26] bio-integrity: Add explicit field for owner of bip_buf Kent Overstreet
2012-10-15 20:09   ` [PATCH v2 26/26] block: Add BIO_SUBMITTED flag, kill BIO_CLONED Kent Overstreet
2012-10-19  1:14   ` [PATCH v4 00/24] Prep work for immutable bio vecs Tejun Heo
     [not found]     ` <20121019011445.GJ13370-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-10-19 15:16       ` Mike Snitzer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1350331769-14856-27-git-send-email-koverstreet@google.com \
    --to=koverstreet@google.com \
    --cc=axboe@kernel.dk \
    --cc=dm-devel@redhat.com \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=vgoyal@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).