linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Avoid deadlocks with bio allocation
@ 2012-09-07 22:12 Kent Overstreet
  2012-09-07 22:12 ` [PATCH 1/2] block: Reorder struct bio_set Kent Overstreet
  2012-09-07 22:12 ` [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers Kent Overstreet
  0 siblings, 2 replies; 26+ messages in thread
From: Kent Overstreet @ 2012-09-07 22:12 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel; +Cc: Kent Overstreet, tj, axboe

These patches were part of the block cleanups series I just sent out, but I
split them off. Nothing's changed with them lately, the last thing I added was
a bit of logic to the "punt to rescue" code to only punt bios that were
allocated from the current bio_set.

Kent Overstreet (2):
  block: Reorder struct bio_set
  block: Avoid deadlocks with bio allocation by stacking drivers

 fs/bio.c            | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/bio.h | 75 +++++++++++++++++++++++++--------------------
 2 files changed, 126 insertions(+), 36 deletions(-)

-- 
1.7.12


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/2] block: Reorder struct bio_set
  2012-09-07 22:12 [PATCH 0/2] Avoid deadlocks with bio allocation Kent Overstreet
@ 2012-09-07 22:12 ` Kent Overstreet
  2012-09-07 22:12 ` [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers Kent Overstreet
  1 sibling, 0 replies; 26+ messages in thread
From: Kent Overstreet @ 2012-09-07 22:12 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel; +Cc: Kent Overstreet, tj, axboe

This is prep work for the next patch, which embeds a struct bio_list in
struct bio_set.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
---
 include/linux/bio.h | 66 ++++++++++++++++++++++++++---------------------------
 1 file changed, 33 insertions(+), 33 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 52b9cbc..a2bfe3a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -298,39 +298,6 @@ static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
 static inline void bio_disassociate_task(struct bio *bio) { }
 #endif	/* CONFIG_BLK_CGROUP */
 
-/*
- * bio_set is used to allow other portions of the IO system to
- * allocate their own private memory pools for bio and iovec structures.
- * These memory pools in turn all allocate from the bio_slab
- * and the bvec_slabs[].
- */
-#define BIO_POOL_SIZE 2
-#define BIOVEC_NR_POOLS 6
-#define BIOVEC_MAX_IDX	(BIOVEC_NR_POOLS - 1)
-
-struct bio_set {
-	struct kmem_cache *bio_slab;
-	unsigned int front_pad;
-
-	mempool_t *bio_pool;
-#if defined(CONFIG_BLK_DEV_INTEGRITY)
-	mempool_t *bio_integrity_pool;
-#endif
-	mempool_t *bvec_pool;
-};
-
-struct biovec_slab {
-	int nr_vecs;
-	char *name;
-	struct kmem_cache *slab;
-};
-
-/*
- * a small number of entries is fine, not going to be performance critical.
- * basically we just need to survive
- */
-#define BIO_SPLIT_ENTRIES 2
-
 #ifdef CONFIG_HIGHMEM
 /*
  * remember never ever reenable interrupts between a bvec_kmap_irq and
@@ -505,6 +472,39 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
 	return bio;
 }
 
+/*
+ * bio_set is used to allow other portions of the IO system to
+ * allocate their own private memory pools for bio and iovec structures.
+ * These memory pools in turn all allocate from the bio_slab
+ * and the bvec_slabs[].
+ */
+#define BIO_POOL_SIZE 2
+#define BIOVEC_NR_POOLS 6
+#define BIOVEC_MAX_IDX	(BIOVEC_NR_POOLS - 1)
+
+struct bio_set {
+	struct kmem_cache *bio_slab;
+	unsigned int front_pad;
+
+	mempool_t *bio_pool;
+#if defined(CONFIG_BLK_DEV_INTEGRITY)
+	mempool_t *bio_integrity_pool;
+#endif
+	mempool_t *bvec_pool;
+};
+
+struct biovec_slab {
+	int nr_vecs;
+	char *name;
+	struct kmem_cache *slab;
+};
+
+/*
+ * a small number of entries is fine, not going to be performance critical.
+ * basically we just need to survive
+ */
+#define BIO_SPLIT_ENTRIES 2
+
 #if defined(CONFIG_BLK_DEV_INTEGRITY)
 
 #define bip_vec_idx(bip, idx)	(&(bip->bip_vec[(idx)]))
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-07 22:12 [PATCH 0/2] Avoid deadlocks with bio allocation Kent Overstreet
  2012-09-07 22:12 ` [PATCH 1/2] block: Reorder struct bio_set Kent Overstreet
@ 2012-09-07 22:12 ` Kent Overstreet
  2012-09-08 19:36   ` Tejun Heo
  1 sibling, 1 reply; 26+ messages in thread
From: Kent Overstreet @ 2012-09-07 22:12 UTC (permalink / raw)
  To: linux-bcache, linux-kernel, dm-devel; +Cc: Kent Overstreet, tj, axboe

Previously, if we ever try to allocate more than once from the same bio
set while running under generic_make_request() (i.e. a stacking block
driver), we risk deadlock.

This is because of the code in generic_make_request() that converts
recursion to iteration; any bios we submit won't actually be submitted
(so they can complete and eventually be freed) until after we return -
this means if we allocate a second bio, we're blocking the first one
from ever being freed.

Thus if enough threads call into a stacking block driver at the same
time with bios that need multiple splits, and the bio_set's reserve gets
used up, we deadlock.

This can be worked around in the driver code - we could check if we're
running under generic_make_request(), then mask out __GFP_WAIT when we
go to allocate a bio, and if the allocation fails punt to workqueue and
retry the allocation.

But this is tricky and not a generic solution. This patch solves it for
all users by inverting the previously described technique. We allocate a
rescuer workqueue for each bio_set, and then in the allocation code if
there are bios on current->bio_list we would be blocking, we punt them
to the rescuer workqueue to be submitted.

Tested it by forcing the rescue codepath to be taken (by disabling the
first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
of arbitrary bio splitting) and verified that the rescuer was being
invoked.

Signed-off-by: Kent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
---
 fs/bio.c            | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/bio.h |  9 ++++++
 2 files changed, 93 insertions(+), 3 deletions(-)

diff --git a/fs/bio.c b/fs/bio.c
index 13e9567..244007f 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -295,6 +295,43 @@ void bio_reset(struct bio *bio)
 }
 EXPORT_SYMBOL(bio_reset);
 
+static void bio_alloc_rescue(struct work_struct *work)
+{
+	struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
+	struct bio *bio;
+
+	while (1) {
+		spin_lock(&bs->rescue_lock);
+		bio = bio_list_pop(&bs->rescue_list);
+		spin_unlock(&bs->rescue_lock);
+
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+}
+
+static void punt_bios_to_rescuer(struct bio_set *bs)
+{
+	struct bio_list punt, nopunt;
+	struct bio *bio;
+
+	bio_list_init(&punt);
+	bio_list_init(&nopunt);
+
+	while ((bio = bio_list_pop(current->bio_list)))
+		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+
+	*current->bio_list = nopunt;
+
+	spin_lock(&bs->rescue_lock);
+	bio_list_merge(&bs->rescue_list, &punt);
+	spin_unlock(&bs->rescue_lock);
+
+	queue_work(bs->rescue_workqueue, &bs->rescue_work);
+}
+
 /**
  * bio_alloc_bioset - allocate a bio for I/O
  * @gfp_mask:   the GFP_ mask given to the slab allocator
@@ -317,6 +354,7 @@ EXPORT_SYMBOL(bio_reset);
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
+	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	unsigned long idx = BIO_POOL_NONE;
@@ -334,13 +372,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		front_pad = 0;
 		inline_vecs = nr_iovecs;
 	} else {
+		/*
+		 * generic_make_request() converts recursion to iteration; this
+		 * means if we're running beneath it, any bios we allocate and
+		 * submit will not be submitted (and thus freed) until after we
+		 * return.
+		 *
+		 * This exposes us to a potential deadlock if we allocate
+		 * multiple bios from the same bio_set() while running
+		 * underneath generic_make_request(). If we were to allocate
+		 * multiple bios (say a stacking block driver that was splitting
+		 * bios), we would deadlock if we exhausted the mempool's
+		 * reserve.
+		 *
+		 * We solve this, and guarantee forward progress, with a rescuer
+		 * workqueue per bio_set. If we go to allocate and there are
+		 * bios on current->bio_list, we first try the allocation
+		 * without __GFP_WAIT; if that fails, we punt those bios we
+		 * would be blocking to the rescuer workqueue before we retry
+		 * with the original gfp_flags.
+		 */
+
+		if (current->bio_list && !bio_list_empty(current->bio_list))
+			gfp_mask &= ~__GFP_WAIT;
+retry:
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
 	}
 
 	if (unlikely(!p))
-		return NULL;
+		goto err;
 
 	bio = p + front_pad;
 	bio_init(bio);
@@ -361,6 +423,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 
 err_free:
 	mempool_free(p, bs->bio_pool);
+err:
+	if (gfp_mask != saved_gfp) {
+		punt_bios_to_rescuer(bs);
+		gfp_mask = saved_gfp;
+		goto retry;
+	}
+
 	return NULL;
 }
 EXPORT_SYMBOL(bio_alloc_bioset);
@@ -1570,6 +1639,9 @@ static void biovec_free_pools(struct bio_set *bs)
 
 void bioset_free(struct bio_set *bs)
 {
+	if (bs->rescue_workqueue)
+		destroy_workqueue(bs->rescue_workqueue);
+
 	if (bs->bio_pool)
 		mempool_destroy(bs->bio_pool);
 
@@ -1605,6 +1677,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
 
 	bs->front_pad = front_pad;
 
+	spin_lock_init(&bs->rescue_lock);
+	bio_list_init(&bs->rescue_list);
+	INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
+
 	bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
 	if (!bs->bio_slab) {
 		kfree(bs);
@@ -1615,9 +1691,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
 	if (!bs->bio_pool)
 		goto bad;
 
-	if (!biovec_create_pools(bs, pool_size))
-		return bs;
+	if (biovec_create_pools(bs, pool_size))
+		goto bad;
+
+	bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
+	if (!bs->rescue_workqueue)
+		goto bad;
 
+	return bs;
 bad:
 	bioset_free(bs);
 	return NULL;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index a2bfe3a..c32ea0d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -491,6 +491,15 @@ struct bio_set {
 	mempool_t *bio_integrity_pool;
 #endif
 	mempool_t *bvec_pool;
+
+	/*
+	 * Deadlock avoidance for stacking block drivers: see comments in
+	 * bio_alloc_bioset() for details
+	 */
+	spinlock_t		rescue_lock;
+	struct bio_list		rescue_list;
+	struct work_struct	rescue_work;
+	struct workqueue_struct	*rescue_workqueue;
 };
 
 struct biovec_slab {
-- 
1.7.12


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-07 22:12 ` [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers Kent Overstreet
@ 2012-09-08 19:36   ` Tejun Heo
  2012-09-10  0:28     ` Kent Overstreet
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2012-09-08 19:36 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache, linux-kernel, dm-devel, axboe, Vivek Goyal,
	Mikulas Patocka, bharrosh, david

(Restoring cc list from the previous discussion.  Please retain the cc
list of the people who discussed in the previous postings.)

On Fri, Sep 07, 2012 at 03:12:53PM -0700, Kent Overstreet wrote:
> But this is tricky and not a generic solution. This patch solves it for
> all users by inverting the previously described technique. We allocate a
> rescuer workqueue for each bio_set, and then in the allocation code if
> there are bios on current->bio_list we would be blocking, we punt them
> to the rescuer workqueue to be submitted.

It would be great if this explanation can be expanded a bit.  Why does
it make the deadlock condition go away?  What are the restrictions -
e.g. using other mempools for additional per-bio data structure
wouldn't work, right?

> +static void punt_bios_to_rescuer(struct bio_set *bs)
> +{
> +	struct bio_list punt, nopunt;
> +	struct bio *bio;
> +
> +	bio_list_init(&punt);
> +	bio_list_init(&nopunt);
> +
> +	while ((bio = bio_list_pop(current->bio_list)))
> +		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> +
> +	*current->bio_list = nopunt;

Why this is necessary needs explanation and it's done in rather
unusual way.  I suppose the weirdness is from bio_list API
restriction?

> +	spin_lock(&bs->rescue_lock);
> +	bio_list_merge(&bs->rescue_list, &punt);
> +	spin_unlock(&bs->rescue_lock);
> +
> +	queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +}
> +
>  /**
>   * bio_alloc_bioset - allocate a bio for I/O
>   * @gfp_mask:   the GFP_ mask given to the slab allocator
> @@ -317,6 +354,7 @@ EXPORT_SYMBOL(bio_reset);
>   */
>  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  {
> +	gfp_t saved_gfp = gfp_mask;
>  	unsigned front_pad;
>  	unsigned inline_vecs;
>  	unsigned long idx = BIO_POOL_NONE;
> @@ -334,13 +372,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  		front_pad = 0;
>  		inline_vecs = nr_iovecs;
>  	} else {
> +		/*
> +		 * generic_make_request() converts recursion to iteration; this
> +		 * means if we're running beneath it, any bios we allocate and
> +		 * submit will not be submitted (and thus freed) until after we
> +		 * return.
> +		 *
> +		 * This exposes us to a potential deadlock if we allocate
> +		 * multiple bios from the same bio_set() while running
> +		 * underneath generic_make_request(). If we were to allocate
> +		 * multiple bios (say a stacking block driver that was splitting
> +		 * bios), we would deadlock if we exhausted the mempool's
> +		 * reserve.
> +		 *
> +		 * We solve this, and guarantee forward progress, with a rescuer
> +		 * workqueue per bio_set. If we go to allocate and there are
> +		 * bios on current->bio_list, we first try the allocation
> +		 * without __GFP_WAIT; if that fails, we punt those bios we
> +		 * would be blocking to the rescuer workqueue before we retry
> +		 * with the original gfp_flags.
> +		 */
> +
> +		if (current->bio_list && !bio_list_empty(current->bio_list))
> +			gfp_mask &= ~__GFP_WAIT;
> +retry:
>  		p = mempool_alloc(bs->bio_pool, gfp_mask);
>  		front_pad = bs->front_pad;
>  		inline_vecs = BIO_INLINE_VECS;
>  	}

Wouldn't the following be better?

	p = mempool_alloc(bs->bi_pool, gfp_mask);
	if (unlikely(!p) && gfp_mask != saved_gfp) {
		punt_bios_to_rescuer(bs);
		p = mempool_alloc(bs->bi_pool, saved_gfp);
	}

I really hope the usage restriction (don't mix with mempool) for
bioset is clearly documented somewhere appropriate.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-08 19:36   ` Tejun Heo
@ 2012-09-10  0:28     ` Kent Overstreet
  2012-09-10 15:25       ` Vivek Goyal
  2012-09-10 17:22       ` Tejun Heo
  0 siblings, 2 replies; 26+ messages in thread
From: Kent Overstreet @ 2012-09-10  0:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-bcache, linux-kernel, dm-devel, axboe, Vivek Goyal,
	Mikulas Patocka, bharrosh, david

On Sat, Sep 08, 2012 at 12:36:41PM -0700, Tejun Heo wrote:
> (Restoring cc list from the previous discussion.  Please retain the cc
> list of the people who discussed in the previous postings.)
> 
> On Fri, Sep 07, 2012 at 03:12:53PM -0700, Kent Overstreet wrote:
> > But this is tricky and not a generic solution. This patch solves it for
> > all users by inverting the previously described technique. We allocate a
> > rescuer workqueue for each bio_set, and then in the allocation code if
> > there are bios on current->bio_list we would be blocking, we punt them
> > to the rescuer workqueue to be submitted.
> 
> It would be great if this explanation can be expanded a bit.  Why does
> it make the deadlock condition go away?  What are the restrictions -
> e.g. using other mempools for additional per-bio data structure
> wouldn't work, right?

Ok, I'll do that. New patch below.

> > +static void punt_bios_to_rescuer(struct bio_set *bs)
> > +{
> > +	struct bio_list punt, nopunt;
> > +	struct bio *bio;
> > +
> > +	bio_list_init(&punt);
> > +	bio_list_init(&nopunt);
> > +
> > +	while ((bio = bio_list_pop(current->bio_list)))
> > +		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> > +
> > +	*current->bio_list = nopunt;
> 
> Why this is necessary needs explanation and it's done in rather
> unusual way.  I suppose the weirdness is from bio_list API
> restriction?

It's because bio_lists are singly linked, so deleting an entry from the
middle of the list would be a real pain - just much cleaner/simpler to
do it this way.

> > +	spin_lock(&bs->rescue_lock);
> > +	bio_list_merge(&bs->rescue_list, &punt);
> > +	spin_unlock(&bs->rescue_lock);
> > +
> > +	queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > +}
> > +
> >  /**
> >   * bio_alloc_bioset - allocate a bio for I/O
> >   * @gfp_mask:   the GFP_ mask given to the slab allocator
> > @@ -317,6 +354,7 @@ EXPORT_SYMBOL(bio_reset);
> >   */
> >  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> >  {
> > +	gfp_t saved_gfp = gfp_mask;
> >  	unsigned front_pad;
> >  	unsigned inline_vecs;
> >  	unsigned long idx = BIO_POOL_NONE;
> > @@ -334,13 +372,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> >  		front_pad = 0;
> >  		inline_vecs = nr_iovecs;
> >  	} else {
> > +		/*
> > +		 * generic_make_request() converts recursion to iteration; this
> > +		 * means if we're running beneath it, any bios we allocate and
> > +		 * submit will not be submitted (and thus freed) until after we
> > +		 * return.
> > +		 *
> > +		 * This exposes us to a potential deadlock if we allocate
> > +		 * multiple bios from the same bio_set() while running
> > +		 * underneath generic_make_request(). If we were to allocate
> > +		 * multiple bios (say a stacking block driver that was splitting
> > +		 * bios), we would deadlock if we exhausted the mempool's
> > +		 * reserve.
> > +		 *
> > +		 * We solve this, and guarantee forward progress, with a rescuer
> > +		 * workqueue per bio_set. If we go to allocate and there are
> > +		 * bios on current->bio_list, we first try the allocation
> > +		 * without __GFP_WAIT; if that fails, we punt those bios we
> > +		 * would be blocking to the rescuer workqueue before we retry
> > +		 * with the original gfp_flags.
> > +		 */
> > +
> > +		if (current->bio_list && !bio_list_empty(current->bio_list))
> > +			gfp_mask &= ~__GFP_WAIT;
> > +retry:
> >  		p = mempool_alloc(bs->bio_pool, gfp_mask);
> >  		front_pad = bs->front_pad;
> >  		inline_vecs = BIO_INLINE_VECS;
> >  	}
> 
> Wouldn't the following be better?
> 
> 	p = mempool_alloc(bs->bi_pool, gfp_mask);
> 	if (unlikely(!p) && gfp_mask != saved_gfp) {
> 		punt_bios_to_rescuer(bs);
> 		p = mempool_alloc(bs->bi_pool, saved_gfp);
> 	}

That'd require duplicating the error handling in two different places -
once for the initial allocation, once for the bvec allocation. And I
really hate that writing code that does

alloc_something()
if (fail) {
	alloc_something_again()
}

it just screams ugly to me.

> I really hope the usage restriction (don't mix with mempool) for
> bioset is clearly documented somewhere appropriate.

Good point, I'm adding it to bio_alloc_bioset's documentation.


commit 4edb21e0b749fc098c72edcb4f9abdeca6fc62cd
Author: Kent Overstreet <koverstreet@google.com>
Date:   Sun Sep 9 17:23:29 2012 -0700

    block: Avoid deadlocks with bio allocation by stacking drivers
    
    Previously, if we ever try to allocate more than once from the same bio
    set while running under generic_make_request() (i.e. a stacking block
    driver), we risk deadlock.
    
    This is because of the code in generic_make_request() that converts
    recursion to iteration; any bios we submit won't actually be submitted
    (so they can complete and eventually be freed) until after we return -
    this means if we allocate a second bio, we're blocking the first one
    from ever being freed.
    
    Thus if enough threads call into a stacking block driver at the same
    time with bios that need multiple splits, and the bio_set's reserve gets
    used up, we deadlock.
    
    This can be worked around in the driver code - we could check if we're
    running under generic_make_request(), then mask out __GFP_WAIT when we
    go to allocate a bio, and if the allocation fails punt to workqueue and
    retry the allocation.
    
    But this is tricky and not a generic solution. This patch solves it for
    all users by inverting the previously described technique. We allocate a
    rescuer workqueue for each bio_set, and then in the allocation code if
    there are bios on current->bio_list we would be blocking, we punt them
    to the rescuer workqueue to be submitted.
    
    This guarantees forward progress for bio allocations under
    generic_make_request() provided each bio is submitted before allocating
    the next, and provided the bios are freed after they complete.
    
    Note that this doesn't do anything for allocation from other mempools.
    Instead of allocating per bio data structures from a mempool, code
    should use bio_set's front_pad.
    
    Tested it by forcing the rescue codepath to be taken (by disabling the
    first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
    of arbitrary bio splitting) and verified that the rescuer was being
    invoked.
    
    Signed-off-by: Kent Overstreet <koverstreet@google.com>
    CC: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/bio.c b/fs/bio.c
index 13e9567..fa07e1f 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -295,6 +295,53 @@ void bio_reset(struct bio *bio)
 }
 EXPORT_SYMBOL(bio_reset);
 
+static void bio_alloc_rescue(struct work_struct *work)
+{
+	struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
+	struct bio *bio;
+
+	while (1) {
+		spin_lock(&bs->rescue_lock);
+		bio = bio_list_pop(&bs->rescue_list);
+		spin_unlock(&bs->rescue_lock);
+
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+}
+
+static void punt_bios_to_rescuer(struct bio_set *bs)
+{
+	struct bio_list punt, nopunt;
+	struct bio *bio;
+
+	/*
+	 * Don't want to punt all bios on current->bio_list; if there was a bio
+	 * on there for a stacking driver higher up in the stack, processing it
+	 * could require allocating bios from this bio_set, and we don't want to
+	 * do that from our own rescuer.
+	 *
+	 * Since bio lists are singly linked, pop them all instead of trying to
+	 * remove from the middle of the list:
+	 */
+
+	bio_list_init(&punt);
+	bio_list_init(&nopunt);
+
+	while ((bio = bio_list_pop(current->bio_list)))
+		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+
+	*current->bio_list = nopunt;
+
+	spin_lock(&bs->rescue_lock);
+	bio_list_merge(&bs->rescue_list, &punt);
+	spin_unlock(&bs->rescue_lock);
+
+	queue_work(bs->rescue_workqueue, &bs->rescue_work);
+}
+
 /**
  * bio_alloc_bioset - allocate a bio for I/O
  * @gfp_mask:   the GFP_ mask given to the slab allocator
@@ -312,11 +359,27 @@ EXPORT_SYMBOL(bio_reset);
  *   previously allocated bio for IO before attempting to allocate a new one.
  *   Failure to do so can cause deadlocks under memory pressure.
  *
+ *   Note that when running under generic_make_request() (i.e. any block
+ *   driver), bios are not submitted until after you return - see the code in
+ *   generic_make_request() that converts recursion into iteration, to prevent
+ *   stack overflows.
+ *
+ *   This would normally mean allocating multiple bios under
+ *   generic_make_request() would be susceptible to deadlocks, but we have
+ *   deadlock avoidance code that resubmits any blocked bios from a rescuer
+ *   thread.
+ *
+ *   However, we do not guarantee forward progress for allocations from other
+ *   mempools. Doing multiple allocations from the same mempool under
+ *   generic_make_request() should be avoided - instead, use bio_set's front_pad
+ *   for per bio allocations.
+ *
  *   RETURNS:
  *   Pointer to new bio on success, NULL on failure.
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
+	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	unsigned long idx = BIO_POOL_NONE;
@@ -334,13 +397,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		front_pad = 0;
 		inline_vecs = nr_iovecs;
 	} else {
+		/*
+		 * generic_make_request() converts recursion to iteration; this
+		 * means if we're running beneath it, any bios we allocate and
+		 * submit will not be submitted (and thus freed) until after we
+		 * return.
+		 *
+		 * This exposes us to a potential deadlock if we allocate
+		 * multiple bios from the same bio_set() while running
+		 * underneath generic_make_request(). If we were to allocate
+		 * multiple bios (say a stacking block driver that was splitting
+		 * bios), we would deadlock if we exhausted the mempool's
+		 * reserve.
+		 *
+		 * We solve this, and guarantee forward progress, with a rescuer
+		 * workqueue per bio_set. If we go to allocate and there are
+		 * bios on current->bio_list, we first try the allocation
+		 * without __GFP_WAIT; if that fails, we punt those bios we
+		 * would be blocking to the rescuer workqueue before we retry
+		 * with the original gfp_flags.
+		 */
+
+		if (current->bio_list && !bio_list_empty(current->bio_list))
+			gfp_mask &= ~__GFP_WAIT;
+retry:
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
 	}
 
 	if (unlikely(!p))
-		return NULL;
+		goto err;
 
 	bio = p + front_pad;
 	bio_init(bio);
@@ -361,6 +448,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 
 err_free:
 	mempool_free(p, bs->bio_pool);
+err:
+	if (gfp_mask != saved_gfp) {
+		punt_bios_to_rescuer(bs);
+		gfp_mask = saved_gfp;
+		goto retry;
+	}
+
 	return NULL;
 }
 EXPORT_SYMBOL(bio_alloc_bioset);
@@ -1570,6 +1664,9 @@ static void biovec_free_pools(struct bio_set *bs)
 
 void bioset_free(struct bio_set *bs)
 {
+	if (bs->rescue_workqueue)
+		destroy_workqueue(bs->rescue_workqueue);
+
 	if (bs->bio_pool)
 		mempool_destroy(bs->bio_pool);
 
@@ -1605,6 +1702,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
 
 	bs->front_pad = front_pad;
 
+	spin_lock_init(&bs->rescue_lock);
+	bio_list_init(&bs->rescue_list);
+	INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
+
 	bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
 	if (!bs->bio_slab) {
 		kfree(bs);
@@ -1615,9 +1716,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
 	if (!bs->bio_pool)
 		goto bad;
 
-	if (!biovec_create_pools(bs, pool_size))
-		return bs;
+	if (biovec_create_pools(bs, pool_size))
+		goto bad;
+
+	bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
+	if (!bs->rescue_workqueue)
+		goto bad;
 
+	return bs;
 bad:
 	bioset_free(bs);
 	return NULL;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index a2bfe3a..c32ea0d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -491,6 +491,15 @@ struct bio_set {
 	mempool_t *bio_integrity_pool;
 #endif
 	mempool_t *bvec_pool;
+
+	/*
+	 * Deadlock avoidance for stacking block drivers: see comments in
+	 * bio_alloc_bioset() for details
+	 */
+	spinlock_t		rescue_lock;
+	struct bio_list		rescue_list;
+	struct work_struct	rescue_work;
+	struct workqueue_struct	*rescue_workqueue;
 };
 
 struct biovec_slab {

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10  0:28     ` Kent Overstreet
@ 2012-09-10 15:25       ` Vivek Goyal
  2012-09-10 17:22       ` Tejun Heo
  1 sibling, 0 replies; 26+ messages in thread
From: Vivek Goyal @ 2012-09-10 15:25 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Tejun Heo, linux-bcache, linux-kernel, dm-devel, axboe,
	Mikulas Patocka, bharrosh, david

On Sun, Sep 09, 2012 at 05:28:10PM -0700, Kent Overstreet wrote:

[..]
> > > +retry:
> > >  		p = mempool_alloc(bs->bio_pool, gfp_mask);
> > >  		front_pad = bs->front_pad;
> > >  		inline_vecs = BIO_INLINE_VECS;
> > >  	}
> > 
> > Wouldn't the following be better?
> > 
> > 	p = mempool_alloc(bs->bi_pool, gfp_mask);
> > 	if (unlikely(!p) && gfp_mask != saved_gfp) {
> > 		punt_bios_to_rescuer(bs);
> > 		p = mempool_alloc(bs->bi_pool, saved_gfp);
> > 	}
> 
> That'd require duplicating the error handling in two different places -
> once for the initial allocation, once for the bvec allocation. And I
> really hate that writing code that does
> 
> alloc_something()
> if (fail) {
> 	alloc_something_again()
> }
> 
> it just screams ugly to me.

Personally I kind of like Tejun's suggestion. Yes, it means that you
will have to do it twice (once for bio and once for bvecs). But at
the same time, if bvec allocation fails, then you don't have to free
already allocated bio and retry bio allocation again.  Current code
is doing that and I am not sure why should we free allocated bio and
retry again in case of bvec allocation failure.

Secondly current code is doing following.

                if (unlikely(!bvl))
                        goto err_free;

err_free:
	mempool_free(p, bs->bio_pool);
err:
	retry_allocation.

Not sure but err_free kind of gives impression that free up whatever
is allocated path, and return NULL. Instead of we end up retrying
allocation.

Implementing Tejun's idea atleast keeps error and retry code separate.
I am not too particular about it. Just a thought. Though I do want to
know why to free up already allocated bio and retry in case of bvec
allocation failure.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10  0:28     ` Kent Overstreet
  2012-09-10 15:25       ` Vivek Goyal
@ 2012-09-10 17:22       ` Tejun Heo
  2012-09-10 20:24         ` Kent Overstreet
  1 sibling, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2012-09-10 17:22 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache, linux-kernel, dm-devel, axboe, Vivek Goyal,
	Mikulas Patocka, bharrosh, david

Hello, Kent.

On Sun, Sep 09, 2012 at 05:28:10PM -0700, Kent Overstreet wrote:
> > > +	while ((bio = bio_list_pop(current->bio_list)))
> > > +		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> > > +
> > > +	*current->bio_list = nopunt;
> > 
> > Why this is necessary needs explanation and it's done in rather
> > unusual way.  I suppose the weirdness is from bio_list API
> > restriction?
> 
> It's because bio_lists are singly linked, so deleting an entry from the
> middle of the list would be a real pain - just much cleaner/simpler to
> do it this way.

Yeah, I wonder how benefical that singly linked list is.  Eh well...

> > Wouldn't the following be better?
> > 
> > 	p = mempool_alloc(bs->bi_pool, gfp_mask);
> > 	if (unlikely(!p) && gfp_mask != saved_gfp) {
> > 		punt_bios_to_rescuer(bs);
> > 		p = mempool_alloc(bs->bi_pool, saved_gfp);
> > 	}
> 
> That'd require duplicating the error handling in two different places -
> once for the initial allocation, once for the bvec allocation. And I
> really hate that writing code that does
> 
> alloc_something()
> if (fail) {
> 	alloc_something_again()
> }
> 
> it just screams ugly to me.

I don't know.  That at least represents what's going on and goto'ing
back and forth is hardly pretty.  Sometimes the code gets much uglier
/ unwieldy and we have to live with gotos.  Here, that doesn't seem to
be the case.

> +static void punt_bios_to_rescuer(struct bio_set *bs)
> +{
> +	struct bio_list punt, nopunt;
> +	struct bio *bio;
> +
> +	/*
> +	 * Don't want to punt all bios on current->bio_list; if there was a bio
> +	 * on there for a stacking driver higher up in the stack, processing it
> +	 * could require allocating bios from this bio_set, and we don't want to
> +	 * do that from our own rescuer.

Hmmm... isn't it more like we "must" process only the bios which are
from this bio_set to have any kind of forward-progress guarantee?  The
above sounds like it's just something undesirable.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 17:22       ` Tejun Heo
@ 2012-09-10 20:24         ` Kent Overstreet
  2012-09-10 20:40           ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Kent Overstreet @ 2012-09-10 20:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-bcache, linux-kernel, dm-devel, axboe, Vivek Goyal,
	Mikulas Patocka, bharrosh, david

On Mon, Sep 10, 2012 at 10:22:10AM -0700, Tejun Heo wrote:
> Hello, Kent.
> 
> On Sun, Sep 09, 2012 at 05:28:10PM -0700, Kent Overstreet wrote:
> > > > +	while ((bio = bio_list_pop(current->bio_list)))
> > > > +		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> > > > +
> > > > +	*current->bio_list = nopunt;
> > > 
> > > Why this is necessary needs explanation and it's done in rather
> > > unusual way.  I suppose the weirdness is from bio_list API
> > > restriction?
> > 
> > It's because bio_lists are singly linked, so deleting an entry from the
> > middle of the list would be a real pain - just much cleaner/simpler to
> > do it this way.
> 
> Yeah, I wonder how benefical that singly linked list is.  Eh well...

Well, this is the first time I can think of that it's come up, and IMO
this is no less clean a way of writing it... just a bit unusual in C,
it feels more functional to me instead of imperative.

> > > Wouldn't the following be better?
> > > 
> > > 	p = mempool_alloc(bs->bi_pool, gfp_mask);
> > > 	if (unlikely(!p) && gfp_mask != saved_gfp) {
> > > 		punt_bios_to_rescuer(bs);
> > > 		p = mempool_alloc(bs->bi_pool, saved_gfp);
> > > 	}
> > 
> > That'd require duplicating the error handling in two different places -
> > once for the initial allocation, once for the bvec allocation. And I
> > really hate that writing code that does
> > 
> > alloc_something()
> > if (fail) {
> > 	alloc_something_again()
> > }
> > 
> > it just screams ugly to me.
> 
> I don't know.  That at least represents what's going on and goto'ing
> back and forth is hardly pretty.  Sometimes the code gets much uglier
> / unwieldy and we have to live with gotos.  Here, that doesn't seem to
> be the case.

I think this is really more personal preference than anything, but:

Setting gfp_mask = saved_gfp after calling punt_bio_to_rescuer() is
really the correct thing to do, and makes the code clearer IMO: once
we've run punt_bio_to_rescuer() we don't need to mask out GFP_WAIT (not
until the next time a bio is submitted, really).

This matters a bit for the bvl allocation too, if we call
punt_bio_to_rescuer() for the bio allocation no point doing it again.

So to be rigorously correct, your way would have to be

p = mempool_alloc(bs->bio_pool, gfp_mask);
if (!p && gfp_mask != saved_gfp) {
	punt_bios_to_rescuer(bs);
	gfp_mask = saved_gfp;
	p = mempool_alloc(bs->bio_pool, gfp_mask);
}

And at that point, why duplicate that line of code? It doesn't matter that
much, but IMO a goto retry better labels what's actually going on (it's
something that's not uncommon in the kernel and if I see a retry label
in a function I pretty immediately have an idea of what's going on).

So we could do

retry:
p = mempool_alloc(bs->bio_pool, gfp_mask);
if (!p && gfp_mask != saved_gfp) {
	punt_bios_to_rescuer(bs);
	gfp_mask = saved_gfp;
	goto retry;
}

(side note: not that it really matters here, but gcc will inline the
bvec_alloc_bs() call if it's not duplicated, I've never seen it
consolidate duplicated code and /then/ inline based off that)

This does have the advantage that we're not freeing and reallocating the
bio like Vivek pointed out, but I'm not a huge fan of having the
punting/retry logic in the main code path.

I don't care that much though. I'd prefer not to have the actual
allocations duplicated, but it's starting to feel like bikeshedding to
me.

> > +static void punt_bios_to_rescuer(struct bio_set *bs)
> > +{
> > +	struct bio_list punt, nopunt;
> > +	struct bio *bio;
> > +
> > +	/*
> > +	 * Don't want to punt all bios on current->bio_list; if there was a bio
> > +	 * on there for a stacking driver higher up in the stack, processing it
> > +	 * could require allocating bios from this bio_set, and we don't want to
> > +	 * do that from our own rescuer.
> 
> Hmmm... isn't it more like we "must" process only the bios which are
> from this bio_set to have any kind of forward-progress guarantee?  The
> above sounds like it's just something undesirable.

Yeah, that'd be better, I'll change it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 20:24         ` Kent Overstreet
@ 2012-09-10 20:40           ` Tejun Heo
  2012-09-10 21:33             ` Kent Overstreet
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2012-09-10 20:40 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache, linux-kernel, dm-devel, axboe, Vivek Goyal,
	Mikulas Patocka, bharrosh, david

Hello, Kent.

On Mon, Sep 10, 2012 at 01:24:35PM -0700, Kent Overstreet wrote:
> And at that point, why duplicate that line of code? It doesn't matter that
> much, but IMO a goto retry better labels what's actually going on (it's
> something that's not uncommon in the kernel and if I see a retry label
> in a function I pretty immediately have an idea of what's going on).
> 
> So we could do
> 
> retry:
> p = mempool_alloc(bs->bio_pool, gfp_mask);
> if (!p && gfp_mask != saved_gfp) {
> 	punt_bios_to_rescuer(bs);
> 	gfp_mask = saved_gfp;
> 	goto retry;
> }

Yes, we do retry loops if that makes the code simpler.  Doing that to
save one extra alloc call, I don't think so.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 20:40           ` Tejun Heo
@ 2012-09-10 21:33             ` Kent Overstreet
  2012-09-10 21:37               ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Kent Overstreet @ 2012-09-10 21:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-bcache, linux-kernel, dm-devel, axboe, Vivek Goyal,
	Mikulas Patocka, bharrosh, david

On Mon, Sep 10, 2012 at 01:40:10PM -0700, Tejun Heo wrote:
> Hello, Kent.
> 
> On Mon, Sep 10, 2012 at 01:24:35PM -0700, Kent Overstreet wrote:
> > And at that point, why duplicate that line of code? It doesn't matter that
> > much, but IMO a goto retry better labels what's actually going on (it's
> > something that's not uncommon in the kernel and if I see a retry label
> > in a function I pretty immediately have an idea of what's going on).
> > 
> > So we could do
> > 
> > retry:
> > p = mempool_alloc(bs->bio_pool, gfp_mask);
> > if (!p && gfp_mask != saved_gfp) {
> > 	punt_bios_to_rescuer(bs);
> > 	gfp_mask = saved_gfp;
> > 	goto retry;
> > }
> 
> Yes, we do retry loops if that makes the code simpler.  Doing that to
> save one extra alloc call, I don't think so.

"Simpler" isn't really an objective thing though. To me the goto version
is more obvious/idiomatic.

Eh. I'll do it your way, but consider this a formal objection :p

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 21:33             ` Kent Overstreet
@ 2012-09-10 21:37               ` Tejun Heo
  2012-09-10 21:56                 ` Kent Overstreet
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2012-09-10 21:37 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache, linux-kernel, dm-devel, axboe, Vivek Goyal,
	Mikulas Patocka, bharrosh, david

On Mon, Sep 10, 2012 at 02:33:49PM -0700, Kent Overstreet wrote:
> "Simpler" isn't really an objective thing though. To me the goto version
> is more obvious/idiomatic.
> 
> Eh. I'll do it your way, but consider this a formal objection :p

Thanks. :)

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 21:37               ` Tejun Heo
@ 2012-09-10 21:56                 ` Kent Overstreet
  2012-09-10 22:09                   ` Tejun Heo
  2012-09-11 18:36                   ` Muthu Kumar
  0 siblings, 2 replies; 26+ messages in thread
From: Kent Overstreet @ 2012-09-10 21:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-bcache, linux-kernel, dm-devel, axboe, Vivek Goyal,
	Mikulas Patocka, bharrosh, david

On Mon, Sep 10, 2012 at 02:37:10PM -0700, Tejun Heo wrote:
> On Mon, Sep 10, 2012 at 02:33:49PM -0700, Kent Overstreet wrote:
> > "Simpler" isn't really an objective thing though. To me the goto version
> > is more obvious/idiomatic.
> > 
> > Eh. I'll do it your way, but consider this a formal objection :p
> 
> Thanks. :)

Here's current version. Good enough for an acked-by?

commit df7e63cbffa3065fcc4ba2b9a93418d7c7312243
Author: Kent Overstreet <koverstreet@google.com>
Date:   Mon Sep 10 14:33:46 2012 -0700

    block: Avoid deadlocks with bio allocation by stacking drivers
    
    Previously, if we ever try to allocate more than once from the same bio
    set while running under generic_make_request() (i.e. a stacking block
    driver), we risk deadlock.
    
    This is because of the code in generic_make_request() that converts
    recursion to iteration; any bios we submit won't actually be submitted
    (so they can complete and eventually be freed) until after we return -
    this means if we allocate a second bio, we're blocking the first one
    from ever being freed.
    
    Thus if enough threads call into a stacking block driver at the same
    time with bios that need multiple splits, and the bio_set's reserve gets
    used up, we deadlock.
    
    This can be worked around in the driver code - we could check if we're
    running under generic_make_request(), then mask out __GFP_WAIT when we
    go to allocate a bio, and if the allocation fails punt to workqueue and
    retry the allocation.
    
    But this is tricky and not a generic solution. This patch solves it for
    all users by inverting the previously described technique. We allocate a
    rescuer workqueue for each bio_set, and then in the allocation code if
    there are bios on current->bio_list we would be blocking, we punt them
    to the rescuer workqueue to be submitted.
    
    This guarantees forward progress for bio allocations under
    generic_make_request() provided each bio is submitted before allocating
    the next, and provided the bios are freed after they complete.
    
    Note that this doesn't do anything for allocation from other mempools.
    Instead of allocating per bio data structures from a mempool, code
    should use bio_set's front_pad.
    
    Tested it by forcing the rescue codepath to be taken (by disabling the
    first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
    of arbitrary bio splitting) and verified that the rescuer was being
    invoked.
    
    Signed-off-by: Kent Overstreet <koverstreet@google.com>
    CC: Jens Axboe <axboe@kernel.dk>

diff --git a/fs/bio.c b/fs/bio.c
index 13e9567..4783e31 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -295,6 +295,54 @@ void bio_reset(struct bio *bio)
 }
 EXPORT_SYMBOL(bio_reset);
 
+static void bio_alloc_rescue(struct work_struct *work)
+{
+	struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
+	struct bio *bio;
+
+	while (1) {
+		spin_lock(&bs->rescue_lock);
+		bio = bio_list_pop(&bs->rescue_list);
+		spin_unlock(&bs->rescue_lock);
+
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+}
+
+static void punt_bios_to_rescuer(struct bio_set *bs)
+{
+	struct bio_list punt, nopunt;
+	struct bio *bio;
+
+	/*
+	 * In order to guarantee forward progress we must punt only bios that
+	 * were allocated from this bio_set; otherwise, if there was a bio on
+	 * there for a stacking driver higher up in the stack, processing it
+	 * could require allocating bios from this bio_set, and doing that from
+	 * our own rescuer would be bad.
+	 *
+	 * Since bio lists are singly linked, pop them all instead of trying to
+	 * remove from the middle of the list:
+	 */
+
+	bio_list_init(&punt);
+	bio_list_init(&nopunt);
+
+	while ((bio = bio_list_pop(current->bio_list)))
+		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+
+	*current->bio_list = nopunt;
+
+	spin_lock(&bs->rescue_lock);
+	bio_list_merge(&bs->rescue_list, &punt);
+	spin_unlock(&bs->rescue_lock);
+
+	queue_work(bs->rescue_workqueue, &bs->rescue_work);
+}
+
 /**
  * bio_alloc_bioset - allocate a bio for I/O
  * @gfp_mask:   the GFP_ mask given to the slab allocator
@@ -312,11 +360,27 @@ EXPORT_SYMBOL(bio_reset);
  *   previously allocated bio for IO before attempting to allocate a new one.
  *   Failure to do so can cause deadlocks under memory pressure.
  *
+ *   Note that when running under generic_make_request() (i.e. any block
+ *   driver), bios are not submitted until after you return - see the code in
+ *   generic_make_request() that converts recursion into iteration, to prevent
+ *   stack overflows.
+ *
+ *   This would normally mean allocating multiple bios under
+ *   generic_make_request() would be susceptible to deadlocks, but we have
+ *   deadlock avoidance code that resubmits any blocked bios from a rescuer
+ *   thread.
+ *
+ *   However, we do not guarantee forward progress for allocations from other
+ *   mempools. Doing multiple allocations from the same mempool under
+ *   generic_make_request() should be avoided - instead, use bio_set's front_pad
+ *   for per bio allocations.
+ *
  *   RETURNS:
  *   Pointer to new bio on success, NULL on failure.
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
+	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	unsigned long idx = BIO_POOL_NONE;
@@ -334,7 +398,37 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		front_pad = 0;
 		inline_vecs = nr_iovecs;
 	} else {
+		/*
+		 * generic_make_request() converts recursion to iteration; this
+		 * means if we're running beneath it, any bios we allocate and
+		 * submit will not be submitted (and thus freed) until after we
+		 * return.
+		 *
+		 * This exposes us to a potential deadlock if we allocate
+		 * multiple bios from the same bio_set() while running
+		 * underneath generic_make_request(). If we were to allocate
+		 * multiple bios (say a stacking block driver that was splitting
+		 * bios), we would deadlock if we exhausted the mempool's
+		 * reserve.
+		 *
+		 * We solve this, and guarantee forward progress, with a rescuer
+		 * workqueue per bio_set. If we go to allocate and there are
+		 * bios on current->bio_list, we first try the allocation
+		 * without __GFP_WAIT; if that fails, we punt those bios we
+		 * would be blocking to the rescuer workqueue before we retry
+		 * with the original gfp_flags.
+		 */
+
+		if (current->bio_list && !bio_list_empty(current->bio_list))
+			gfp_mask &= ~__GFP_WAIT;
+
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
+		if (!p && gfp_mask != saved_gfp) {
+			punt_bios_to_rescuer(bs);
+			gfp_mask = saved_gfp;
+			p = mempool_alloc(bs->bio_pool, gfp_mask);
+		}
+
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
 	}
@@ -347,6 +441,12 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 
 	if (nr_iovecs > inline_vecs) {
 		bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
+		if (!bvl && gfp_mask != saved_gfp) {
+			punt_bios_to_rescuer(bs);
+			gfp_mask = saved_gfp;
+			bvl = bvec_alloc_bs(gfp_mask, nr_iovecs, &idx, bs);
+		}
+
 		if (unlikely(!bvl))
 			goto err_free;
 	} else if (nr_iovecs) {
@@ -1570,6 +1670,9 @@ static void biovec_free_pools(struct bio_set *bs)
 
 void bioset_free(struct bio_set *bs)
 {
+	if (bs->rescue_workqueue)
+		destroy_workqueue(bs->rescue_workqueue);
+
 	if (bs->bio_pool)
 		mempool_destroy(bs->bio_pool);
 
@@ -1605,6 +1708,10 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
 
 	bs->front_pad = front_pad;
 
+	spin_lock_init(&bs->rescue_lock);
+	bio_list_init(&bs->rescue_list);
+	INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
+
 	bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
 	if (!bs->bio_slab) {
 		kfree(bs);
@@ -1615,9 +1722,14 @@ struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
 	if (!bs->bio_pool)
 		goto bad;
 
-	if (!biovec_create_pools(bs, pool_size))
-		return bs;
+	if (biovec_create_pools(bs, pool_size))
+		goto bad;
+
+	bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
+	if (!bs->rescue_workqueue)
+		goto bad;
 
+	return bs;
 bad:
 	bioset_free(bs);
 	return NULL;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index a2bfe3a..c32ea0d 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -491,6 +491,15 @@ struct bio_set {
 	mempool_t *bio_integrity_pool;
 #endif
 	mempool_t *bvec_pool;
+
+	/*
+	 * Deadlock avoidance for stacking block drivers: see comments in
+	 * bio_alloc_bioset() for details
+	 */
+	spinlock_t		rescue_lock;
+	struct bio_list		rescue_list;
+	struct work_struct	rescue_work;
+	struct workqueue_struct	*rescue_workqueue;
 };
 
 struct biovec_slab {

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 21:56                 ` Kent Overstreet
@ 2012-09-10 22:09                   ` Tejun Heo
  2012-09-10 22:50                     ` [dm-devel] " Alasdair G Kergon
  2012-09-10 23:13                     ` Tejun Heo
  2012-09-11 18:36                   ` Muthu Kumar
  1 sibling, 2 replies; 26+ messages in thread
From: Tejun Heo @ 2012-09-10 22:09 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache, linux-kernel, dm-devel, axboe, Vivek Goyal,
	Mikulas Patocka, bharrosh, david

Hello, Kent.

On Mon, Sep 10, 2012 at 02:56:33PM -0700, Kent Overstreet wrote:
> commit df7e63cbffa3065fcc4ba2b9a93418d7c7312243
> Author: Kent Overstreet <koverstreet@google.com>
> Date:   Mon Sep 10 14:33:46 2012 -0700
> 
>     block: Avoid deadlocks with bio allocation by stacking drivers
>     
>     Previously, if we ever try to allocate more than once from the same bio
>     set while running under generic_make_request() (i.e. a stacking block
>     driver), we risk deadlock.
>     
>     This is because of the code in generic_make_request() that converts
>     recursion to iteration; any bios we submit won't actually be submitted
>     (so they can complete and eventually be freed) until after we return -
>     this means if we allocate a second bio, we're blocking the first one
>     from ever being freed.
>     
>     Thus if enough threads call into a stacking block driver at the same
>     time with bios that need multiple splits, and the bio_set's reserve gets
>     used up, we deadlock.
>     
>     This can be worked around in the driver code - we could check if we're
>     running under generic_make_request(), then mask out __GFP_WAIT when we
>     go to allocate a bio, and if the allocation fails punt to workqueue and
>     retry the allocation.
>     
>     But this is tricky and not a generic solution. This patch solves it for
>     all users by inverting the previously described technique. We allocate a
>     rescuer workqueue for each bio_set, and then in the allocation code if
>     there are bios on current->bio_list we would be blocking, we punt them
>     to the rescuer workqueue to be submitted.
>     
>     This guarantees forward progress for bio allocations under
>     generic_make_request() provided each bio is submitted before allocating
>     the next, and provided the bios are freed after they complete.
>     
>     Note that this doesn't do anything for allocation from other mempools.
>     Instead of allocating per bio data structures from a mempool, code
>     should use bio_set's front_pad.
>     
>     Tested it by forcing the rescue codepath to be taken (by disabling the
>     first GFP_NOWAIT) attempt, and then ran it with bcache (which does a lot
>     of arbitrary bio splitting) and verified that the rescuer was being
>     invoked.
>     
>     Signed-off-by: Kent Overstreet <koverstreet@google.com>
>     CC: Jens Axboe <axboe@kernel.dk>

I'm still a bit scared but think this is correct.

 Acked-by: Tejun Heo <tj@kernel.org>

One last thing is that we may want to add @name on bioset creation so
that we can name the workqueue properly but that's for another patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 22:09                   ` Tejun Heo
@ 2012-09-10 22:50                     ` Alasdair G Kergon
  2012-09-10 23:01                       ` Tejun Heo
  2012-09-10 23:01                       ` Kent Overstreet
  2012-09-10 23:13                     ` Tejun Heo
  1 sibling, 2 replies; 26+ messages in thread
From: Alasdair G Kergon @ 2012-09-10 22:50 UTC (permalink / raw)
  To: Kent Overstreet, axboe, device-mapper development, david,
	linux-kernel, linux-bcache, Mikulas Patocka, bharrosh,
	Vivek Goyal

On Mon, Sep 10, 2012 at 03:09:10PM -0700, Tejun Heo wrote:
> On Mon, Sep 10, 2012 at 02:56:33PM -0700, Kent Overstreet wrote:
> > commit df7e63cbffa3065fcc4ba2b9a93418d7c7312243
> > Author: Kent Overstreet <koverstreet@google.com>
> > Date:   Mon Sep 10 14:33:46 2012 -0700
> > 
> >     block: Avoid deadlocks with bio allocation by stacking drivers

> >     Note that this doesn't do anything for allocation from other mempools.

Note that dm has several cases of this, so this patch should not be used with
dm yet.  Mikulas is studying those cases to see whether anything like this
might be feasible/sensible or not.

> I'm still a bit scared but think this is correct.
>  Acked-by: Tejun Heo <tj@kernel.org>
 
Alasdair


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 22:50                     ` [dm-devel] " Alasdair G Kergon
@ 2012-09-10 23:01                       ` Tejun Heo
  2012-09-10 23:06                         ` Kent Overstreet
  2012-09-10 23:09                         ` Alasdair G Kergon
  2012-09-10 23:01                       ` Kent Overstreet
  1 sibling, 2 replies; 26+ messages in thread
From: Tejun Heo @ 2012-09-10 23:01 UTC (permalink / raw)
  To: Kent Overstreet, axboe, device-mapper development, david,
	linux-kernel, linux-bcache, Mikulas Patocka, bharrosh,
	Vivek Goyal

Hello,

On Mon, Sep 10, 2012 at 3:50 PM, Alasdair G Kergon <agk@redhat.com> wrote:
>> >     Note that this doesn't do anything for allocation from other mempools.
>
> Note that dm has several cases of this, so this patch should not be used with
> dm yet.  Mikulas is studying those cases to see whether anything like this
> might be feasible/sensible or not.

IIUC, Kent posted a patch which converts all of them to use front-pad
(there's no reason not to, really). This better come after that but
it's not like this is gonna break something which isn't broken now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 22:50                     ` [dm-devel] " Alasdair G Kergon
  2012-09-10 23:01                       ` Tejun Heo
@ 2012-09-10 23:01                       ` Kent Overstreet
  1 sibling, 0 replies; 26+ messages in thread
From: Kent Overstreet @ 2012-09-10 23:01 UTC (permalink / raw)
  To: axboe, device-mapper development, david, linux-kernel,
	linux-bcache, Mikulas Patocka, bharrosh, Vivek Goyal

On Mon, Sep 10, 2012 at 11:50:57PM +0100, Alasdair G Kergon wrote:
> On Mon, Sep 10, 2012 at 03:09:10PM -0700, Tejun Heo wrote:
> > On Mon, Sep 10, 2012 at 02:56:33PM -0700, Kent Overstreet wrote:
> > > commit df7e63cbffa3065fcc4ba2b9a93418d7c7312243
> > > Author: Kent Overstreet <koverstreet@google.com>
> > > Date:   Mon Sep 10 14:33:46 2012 -0700
> > > 
> > >     block: Avoid deadlocks with bio allocation by stacking drivers
> 
> > >     Note that this doesn't do anything for allocation from other mempools.
> 
> Note that dm has several cases of this, so this patch should not be used with
> dm yet.

That just means it won't affect dm one way or the other for those
allocations.

> Mikulas is studying those cases to see whether anything like this
> might be feasible/sensible or not.

I've got a patch that eliminates one of the per bio mempools in dm, and
I'll probably work on the rest after I finish off with immutable biovecs
- which is mostly done, just cleaning up/testing/pushing patches in now.


commit 8754349145edfc791450d3ad54c19f0f3715c86c
Author: Kent Overstreet <koverstreet@google.com>
Date:   Tue Sep 4 06:17:56 2012 -0700

    dm: Use bioset's front_pad for dm_target_io

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f2eb730..3cf39b0 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -71,6 +71,7 @@ struct dm_target_io {
 	struct dm_io *io;
 	struct dm_target *ti;
 	union map_info info;
+	struct bio clone;
 };
 
 /*
@@ -174,7 +175,7 @@ struct mapped_device {
 	 * io objects are allocated from here.
 	 */
 	mempool_t *io_pool;
-	mempool_t *tio_pool;
+	mempool_t *rq_tio_pool;
 
 	struct bio_set *bs;
 
@@ -214,15 +215,8 @@ struct dm_md_mempools {
 
 #define MIN_IOS 256
 static struct kmem_cache *_io_cache;
-static struct kmem_cache *_tio_cache;
 static struct kmem_cache *_rq_tio_cache;
 
-/*
- * Unused now, and needs to be deleted. But since io_pool is overloaded and it's
- * still used for _io_cache, I'm leaving this for a later cleanup
- */
-static struct kmem_cache *_rq_bio_info_cache;
-
 static int __init local_init(void)
 {
 	int r = -ENOMEM;
@@ -232,22 +226,13 @@ static int __init local_init(void)
 	if (!_io_cache)
 		return r;
 
-	/* allocate a slab for the target ios */
-	_tio_cache = KMEM_CACHE(dm_target_io, 0);
-	if (!_tio_cache)
-		goto out_free_io_cache;
-
 	_rq_tio_cache = KMEM_CACHE(dm_rq_target_io, 0);
 	if (!_rq_tio_cache)
-		goto out_free_tio_cache;
-
-	_rq_bio_info_cache = KMEM_CACHE(dm_rq_clone_bio_info, 0);
-	if (!_rq_bio_info_cache)
-		goto out_free_rq_tio_cache;
+		goto out_free_io_cache;
 
 	r = dm_uevent_init();
 	if (r)
-		goto out_free_rq_bio_info_cache;
+		goto out_free_rq_tio_cache;
 
 	_major = major;
 	r = register_blkdev(_major, _name);
@@ -261,12 +246,8 @@ static int __init local_init(void)
 
 out_uevent_exit:
 	dm_uevent_exit();
-out_free_rq_bio_info_cache:
-	kmem_cache_destroy(_rq_bio_info_cache);
 out_free_rq_tio_cache:
 	kmem_cache_destroy(_rq_tio_cache);
-out_free_tio_cache:
-	kmem_cache_destroy(_tio_cache);
 out_free_io_cache:
 	kmem_cache_destroy(_io_cache);
 
@@ -275,9 +256,7 @@ out_free_io_cache:
 
 static void local_exit(void)
 {
-	kmem_cache_destroy(_rq_bio_info_cache);
 	kmem_cache_destroy(_rq_tio_cache);
-	kmem_cache_destroy(_tio_cache);
 	kmem_cache_destroy(_io_cache);
 	unregister_blkdev(_major, _name);
 	dm_uevent_exit();
@@ -461,20 +440,15 @@ static void free_io(struct mapped_device *md, struct dm_io *io)
 	mempool_free(io, md->io_pool);
 }
 
-static void free_tio(struct mapped_device *md, struct dm_target_io *tio)
-{
-	mempool_free(tio, md->tio_pool);
-}
-
 static struct dm_rq_target_io *alloc_rq_tio(struct mapped_device *md,
 					    gfp_t gfp_mask)
 {
-	return mempool_alloc(md->tio_pool, gfp_mask);
+	return mempool_alloc(md->rq_tio_pool, gfp_mask);
 }
 
 static void free_rq_tio(struct dm_rq_target_io *tio)
 {
-	mempool_free(tio, tio->md->tio_pool);
+	mempool_free(tio, tio->md->rq_tio_pool);
 }
 
 static int md_in_flight(struct mapped_device *md)
@@ -658,7 +632,6 @@ static void clone_endio(struct bio *bio, int error)
 	int r = 0;
 	struct dm_target_io *tio = bio->bi_private;
 	struct dm_io *io = tio->io;
-	struct mapped_device *md = tio->io->md;
 	dm_endio_fn endio = tio->ti->type->end_io;
 
 	if (!bio_flagged(bio, BIO_UPTODATE) && !error)
@@ -681,7 +654,6 @@ static void clone_endio(struct bio *bio, int error)
 		}
 	}
 
-	free_tio(md, tio);
 	bio_put(bio);
 	dec_pending(io, error);
 }
@@ -998,13 +970,16 @@ int dm_set_target_max_io_len(struct dm_target *ti, sector_t len)
 }
 EXPORT_SYMBOL_GPL(dm_set_target_max_io_len);
 
-static void __map_bio(struct dm_target *ti, struct bio *clone,
-		      struct dm_target_io *tio)
+static void __map_bio(struct dm_io *io, struct dm_target *ti, struct bio *clone)
 {
+	struct dm_target_io *tio = container_of(clone, struct dm_target_io, clone);
 	int r;
 	sector_t sector;
 	struct mapped_device *md;
 
+	tio->io = io;
+	tio->ti = ti;
+
 	clone->bi_end_io = clone_endio;
 	clone->bi_private = tio;
 
@@ -1028,7 +1003,6 @@ static void __map_bio(struct dm_target *ti, struct bio *clone,
 		md = tio->io->md;
 		dec_pending(tio->io, r);
 		bio_put(clone);
-		free_tio(md, tio);
 	} else if (r) {
 		DMWARN("unimplemented target map return value: %d", r);
 		BUG();
@@ -1104,26 +1078,18 @@ static struct bio *clone_bio(struct bio *bio, sector_t sector,
 	return clone;
 }
 
-static struct dm_target_io *alloc_tio(struct clone_info *ci,
-				      struct dm_target *ti)
+static void init_tio(struct bio *bio)
 {
-	struct dm_target_io *tio = mempool_alloc(ci->md->tio_pool, GFP_NOIO);
-
-	tio->io = ci->io;
-	tio->ti = ti;
+	struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone);
 	memset(&tio->info, 0, sizeof(tio->info));
-
-	return tio;
 }
 
 static void __issue_target_request(struct clone_info *ci, struct dm_target *ti,
 				   unsigned request_nr, sector_t len)
 {
-	struct dm_target_io *tio = alloc_tio(ci, ti);
+	struct dm_target_io *tio;
 	struct bio *clone;
 
-	tio->info.target_request_nr = request_nr;
-
 	/*
 	 * Discard requests require the bio's inline iovecs be initialized.
 	 * ci->bio->bi_max_vecs is BIO_INLINE_VECS anyway, for both flush
@@ -1136,7 +1102,10 @@ static void __issue_target_request(struct clone_info *ci, struct dm_target *ti,
 		clone->bi_size = to_bytes(len);
 	}
 
-	__map_bio(ti, clone, tio);
+	tio = container_of(clone, struct dm_target_io, clone);
+	tio->info.target_request_nr = request_nr;
+
+	__map_bio(ci->io, ti, clone);
 }
 
 static void __issue_target_requests(struct clone_info *ci, struct dm_target *ti,
@@ -1166,13 +1135,13 @@ static int __clone_and_map_empty_flush(struct clone_info *ci)
 static void __clone_and_map_simple(struct clone_info *ci, struct dm_target *ti)
 {
 	struct bio *clone, *bio = ci->bio;
-	struct dm_target_io *tio;
 
-	tio = alloc_tio(ci, ti);
 	clone = clone_bio(bio, ci->sector, ci->idx,
 			  bio->bi_vcnt - ci->idx, ci->sector_count,
 			  ci->md->bs);
-	__map_bio(ti, clone, tio);
+
+	init_tio(clone);
+	__map_bio(ci->io, ti, clone);
 	ci->sector_count = 0;
 }
 
@@ -1213,7 +1182,6 @@ static int __clone_and_map(struct clone_info *ci)
 	struct bio *clone, *bio = ci->bio;
 	struct dm_target *ti;
 	sector_t len = 0, max;
-	struct dm_target_io *tio;
 
 	if (unlikely(bio->bi_rw & REQ_DISCARD))
 		return __clone_and_map_discard(ci);
@@ -1250,10 +1218,11 @@ static int __clone_and_map(struct clone_info *ci)
 			len += bv_len;
 		}
 
-		tio = alloc_tio(ci, ti);
 		clone = clone_bio(bio, ci->sector, ci->idx, i - ci->idx, len,
 				  ci->md->bs);
-		__map_bio(ti, clone, tio);
+
+		init_tio(clone);
+		__map_bio(ci->io, ti, clone);
 
 		ci->sector += len;
 		ci->sector_count -= len;
@@ -1278,12 +1247,12 @@ static int __clone_and_map(struct clone_info *ci)
 
 			len = min(remaining, max);
 
-			tio = alloc_tio(ci, ti);
 			clone = split_bvec(bio, ci->sector, ci->idx,
 					   bv->bv_offset + offset, len,
 					   ci->md->bs);
 
-			__map_bio(ti, clone, tio);
+			init_tio(clone);
+			__map_bio(ci->io, ti, clone);
 
 			ci->sector += len;
 			ci->sector_count -= len;
@@ -1911,8 +1880,8 @@ static void free_dev(struct mapped_device *md)
 	unlock_fs(md);
 	bdput(md->bdev);
 	destroy_workqueue(md->wq);
-	if (md->tio_pool)
-		mempool_destroy(md->tio_pool);
+	if (md->rq_tio_pool)
+		mempool_destroy(md->rq_tio_pool);
 	if (md->io_pool)
 		mempool_destroy(md->io_pool);
 	if (md->bs)
@@ -1935,16 +1904,16 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
 {
 	struct dm_md_mempools *p;
 
-	if (md->io_pool && md->tio_pool && md->bs)
+	if ((md->io_pool || md->rq_tio_pool) && md->bs)
 		/* the md already has necessary mempools */
 		goto out;
 
 	p = dm_table_get_md_mempools(t);
-	BUG_ON(!p || md->io_pool || md->tio_pool || md->bs);
+	BUG_ON(!p || md->io_pool || md->rq_tio_pool || md->bs);
 
 	md->io_pool = p->io_pool;
 	p->io_pool = NULL;
-	md->tio_pool = p->tio_pool;
+	md->rq_tio_pool = p->tio_pool;
 	p->tio_pool = NULL;
 	md->bs = p->bs;
 	p->bs = NULL;
@@ -2693,40 +2662,29 @@ struct dm_md_mempools *dm_alloc_md_mempools(unsigned type, unsigned integrity)
 	if (!pools)
 		return NULL;
 
-	pools->io_pool = (type == DM_TYPE_BIO_BASED) ?
-			 mempool_create_slab_pool(MIN_IOS, _io_cache) :
-			 mempool_create_slab_pool(MIN_IOS, _rq_bio_info_cache);
+	if (type == DM_TYPE_BIO_BASED)
+		pools->io_pool = mempool_create_slab_pool(MIN_IOS, _io_cache);
 	if (!pools->io_pool)
-		goto free_pools_and_out;
+		goto err;
 
-	pools->tio_pool = (type == DM_TYPE_BIO_BASED) ?
-			  mempool_create_slab_pool(MIN_IOS, _tio_cache) :
-			  mempool_create_slab_pool(MIN_IOS, _rq_tio_cache);
+	if (type == DM_TYPE_REQUEST_BASED)
+		pools->tio_pool =
+			mempool_create_slab_pool(MIN_IOS, _rq_tio_cache);
 	if (!pools->tio_pool)
-		goto free_io_pool_and_out;
+		goto err;
 
 	pools->bs = bioset_create(pool_size,
-				  offsetof(struct dm_rq_clone_bio_info, clone));
+				  max(offsetof(struct dm_target_io, clone),
+				      offsetof(struct dm_rq_clone_bio_info, clone)));
 	if (!pools->bs)
-		goto free_tio_pool_and_out;
+		goto err;
 
 	if (integrity && bioset_integrity_create(pools->bs, pool_size))
-		goto free_bioset_and_out;
+		goto err;
 
 	return pools;
-
-free_bioset_and_out:
-	bioset_free(pools->bs);
-
-free_tio_pool_and_out:
-	mempool_destroy(pools->tio_pool);
-
-free_io_pool_and_out:
-	mempool_destroy(pools->io_pool);
-
-free_pools_and_out:
-	kfree(pools);
-
+err:
+	dm_free_md_mempools(pools);
 	return NULL;
 }
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 23:01                       ` Tejun Heo
@ 2012-09-10 23:06                         ` Kent Overstreet
  2012-09-10 23:09                         ` Alasdair G Kergon
  1 sibling, 0 replies; 26+ messages in thread
From: Kent Overstreet @ 2012-09-10 23:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: axboe, device-mapper development, david, linux-kernel,
	linux-bcache, Mikulas Patocka, bharrosh, Vivek Goyal

On Mon, Sep 10, 2012 at 04:01:01PM -0700, Tejun Heo wrote:
> Hello,
> 
> On Mon, Sep 10, 2012 at 3:50 PM, Alasdair G Kergon <agk@redhat.com> wrote:
> >> >     Note that this doesn't do anything for allocation from other mempools.
> >
> > Note that dm has several cases of this, so this patch should not be used with
> > dm yet.  Mikulas is studying those cases to see whether anything like this
> > might be feasible/sensible or not.
> 
> IIUC, Kent posted a patch which converts all of them to use front-pad
> (there's no reason not to, really). This better come after that but
> it's not like this is gonna break something which isn't broken now.

Not all, I only did the easy one - you know how dm has all those crazy
abstraction layers? They've got multiple per bio allocations because of
that; the core dm code does one, and then some other code takes that
struct dm_io* and allocates its own state pointing to that (which then
points to the original bio...)

So front_pad should still work, but you need to have say dm_crypt pass
the amount of front pad it needs to the core dm code when it creates the
bio_set, and then dm crypt can use container_of(struct dm_io) and embed
like everything does that use the bio_set front pad.

*I'm probably misremembering all the names.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 23:01                       ` Tejun Heo
  2012-09-10 23:06                         ` Kent Overstreet
@ 2012-09-10 23:09                         ` Alasdair G Kergon
  2012-09-10 23:35                           ` Tejun Heo
  1 sibling, 1 reply; 26+ messages in thread
From: Alasdair G Kergon @ 2012-09-10 23:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kent Overstreet, axboe, device-mapper development, david,
	linux-kernel, linux-bcache, Mikulas Patocka, bharrosh,
	Vivek Goyal

On Mon, Sep 10, 2012 at 04:01:01PM -0700, Tejun Heo wrote:
> IIUC, Kent posted a patch which converts all of them to use front-pad

The only patch I saw posted here only handles one of the easier cases so far.

The others are a bit trickier and probably involve a decision about which way to
change an in-kernel dm interface.

Alasdair


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 22:09                   ` Tejun Heo
  2012-09-10 22:50                     ` [dm-devel] " Alasdair G Kergon
@ 2012-09-10 23:13                     ` Tejun Heo
  1 sibling, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2012-09-10 23:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-bcache, linux-kernel, dm-devel, axboe, Vivek Goyal,
	Mikulas Patocka, bharrosh, david

Hello, again.

On Mon, Sep 10, 2012 at 03:09:10PM -0700, Tejun Heo wrote:
> I'm still a bit scared but think this is correct.
> 
>  Acked-by: Tejun Heo <tj@kernel.org>
> 
> One last thing is that we may want to add @name on bioset creation so
> that we can name the workqueue properly but that's for another patch.

Yet another thing that I noticed in a different discussion.

  http://thread.gmane.org/gmane.linux.network.drbd.devel/2130

Before this, I think bios didn't get reordered while they're traveling
down the stacked bio drivers.  After this, I don't think that's true
anymore.  Currently, IIRC, we don't have any ordering functionality at
bio interface, so I don't think this breaks anything but this can lead
to stupidly subtle bugs if the upper layer is making assumption on
ordering somehow.  It's something which at least should be noted, I
think.  Whether we want to update the code so that ordering is
maintained, I don't know.  I hope not.  It's already crazy complex. :(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 23:09                         ` Alasdair G Kergon
@ 2012-09-10 23:35                           ` Tejun Heo
  2012-09-10 23:45                             ` Alasdair G Kergon
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2012-09-10 23:35 UTC (permalink / raw)
  To: Kent Overstreet, axboe, device-mapper development, david,
	linux-kernel, linux-bcache, Mikulas Patocka, bharrosh,
	Vivek Goyal

Hello, Alasdair.

On Tue, Sep 11, 2012 at 12:09:17AM +0100, Alasdair G Kergon wrote:
> On Mon, Sep 10, 2012 at 04:01:01PM -0700, Tejun Heo wrote:
> > IIUC, Kent posted a patch which converts all of them to use front-pad
> 
> The only patch I saw posted here only handles one of the easier cases so far.
> 
> The others are a bit trickier and probably involve a decision about which way to
> change an in-kernel dm interface.

Ah, I see.  Thought that was full conversion.  I still think this is
the right direction to be headed tho.  Using multiple mempools from
the stacking drivers is way too fragile and difficult to verify and
debug.  Would it be difficult to convert dm drivers to collect size
requirements and use front-pad for all per-bio data?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dm-devel] [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 23:35                           ` Tejun Heo
@ 2012-09-10 23:45                             ` Alasdair G Kergon
  0 siblings, 0 replies; 26+ messages in thread
From: Alasdair G Kergon @ 2012-09-10 23:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kent Overstreet, axboe, device-mapper development, david,
	linux-kernel, linux-bcache, Mikulas Patocka, bharrosh,
	Vivek Goyal

On Mon, Sep 10, 2012 at 04:35:02PM -0700, Tejun Heo wrote:
> debug.  Would it be difficult to convert dm drivers to collect size
> requirements and use front-pad for all per-bio data?
 
I can't give a quick answer because a single bio may require a variable
number (depending on the bio content) of (sequential) mempool_allocs from the
same mempool to complete its processing, so we do have to audit all this very
carefully to see what can/can't be pulled out.

Alasdair


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-10 21:56                 ` Kent Overstreet
  2012-09-10 22:09                   ` Tejun Heo
@ 2012-09-11 18:36                   ` Muthu Kumar
  2012-09-11 18:45                     ` Tejun Heo
  1 sibling, 1 reply; 26+ messages in thread
From: Muthu Kumar @ 2012-09-11 18:36 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Tejun Heo, linux-bcache, linux-kernel, dm-devel, axboe,
	Vivek Goyal, Mikulas Patocka, bharrosh, david

Kent,

On Mon, Sep 10, 2012 at 2:56 PM, Kent Overstreet <koverstreet@google.com> wrote:
..
<snip>
..

>
> +static void bio_alloc_rescue(struct work_struct *work)
> +{
> +       struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
> +       struct bio *bio;
> +
> +       while (1) {
> +               spin_lock(&bs->rescue_lock);
> +               bio = bio_list_pop(&bs->rescue_list);
> +               spin_unlock(&bs->rescue_lock);
> +
> +               if (!bio)
> +                       break;
> +
> +               generic_make_request(bio);
> +       }
> +}
> +
> +static void punt_bios_to_rescuer(struct bio_set *bs)
> +{
> +       struct bio_list punt, nopunt;
> +       struct bio *bio;
> +
> +       /*
> +        * In order to guarantee forward progress we must punt only bios that
> +        * were allocated from this bio_set; otherwise, if there was a bio on
> +        * there for a stacking driver higher up in the stack, processing it
> +        * could require allocating bios from this bio_set, and doing that from
> +        * our own rescuer would be bad.
> +        *
> +        * Since bio lists are singly linked, pop them all instead of trying to
> +        * remove from the middle of the list:
> +        */
> +
> +       bio_list_init(&punt);
> +       bio_list_init(&nopunt);
> +
> +       while ((bio = bio_list_pop(current->bio_list)))
> +               bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> +
> +       *current->bio_list = nopunt;
> +
> +       spin_lock(&bs->rescue_lock);
> +       bio_list_merge(&bs->rescue_list, &punt);
> +       spin_unlock(&bs->rescue_lock);
> +
> +       queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +}


Does this preserve the CPU from which the bio was submitted
originally. Not familiar with cmwq, may be Tejun can clarify.

Tejun - the question is, do we honor the rq_affinity with the above
rescue worker implementation?

Regards,
Muthu

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-11 18:36                   ` Muthu Kumar
@ 2012-09-11 18:45                     ` Tejun Heo
  2012-09-11 18:58                       ` Muthu Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2012-09-11 18:45 UTC (permalink / raw)
  To: Muthu Kumar
  Cc: Kent Overstreet, linux-bcache, linux-kernel, dm-devel, axboe,
	Vivek Goyal, Mikulas Patocka, bharrosh, david

Hello,

On Tue, Sep 11, 2012 at 11:36:28AM -0700, Muthu Kumar wrote:
> Does this preserve the CPU from which the bio was submitted
> originally. Not familiar with cmwq, may be Tejun can clarify.
> 
> Tejun - the question is, do we honor the rq_affinity with the above
> rescue worker implementation?

The work item would run from the same CPU but there isn't any
mechanism to keep track of the issuing CPU if there are multiple bios
to be rescued.  Isn't rq_affinity an optimization hint?  If so, I
don't think it matters here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-11 18:45                     ` Tejun Heo
@ 2012-09-11 18:58                       ` Muthu Kumar
  2012-09-11 19:31                         ` Kent Overstreet
  0 siblings, 1 reply; 26+ messages in thread
From: Muthu Kumar @ 2012-09-11 18:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Kent Overstreet, linux-bcache, linux-kernel, dm-devel, axboe,
	Vivek Goyal, Mikulas Patocka, bharrosh, david

On Tue, Sep 11, 2012 at 11:45 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Tue, Sep 11, 2012 at 11:36:28AM -0700, Muthu Kumar wrote:
>> Does this preserve the CPU from which the bio was submitted
>> originally. Not familiar with cmwq, may be Tejun can clarify.
>>
>> Tejun - the question is, do we honor the rq_affinity with the above
>> rescue worker implementation?
>
> The work item would run from the same CPU but there isn't any
> mechanism to keep track of the issuing CPU if there are multiple bios
> to be rescued.  Isn't rq_affinity an optimization hint?  If so, I
> don't think it matters here.
>

Thanks... Just worried about performance impact.

Kent - Anything to validate that the performance is not impacted would
be really good. Otherwise, the patch looks great.

Feel free to add:

Reviewed-by: Muthukumar Ratty <muthur@gmail.com>


Regards,
Muthu


> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-11 18:58                       ` Muthu Kumar
@ 2012-09-11 19:31                         ` Kent Overstreet
  2012-09-11 20:00                           ` Muthu Kumar
  0 siblings, 1 reply; 26+ messages in thread
From: Kent Overstreet @ 2012-09-11 19:31 UTC (permalink / raw)
  To: Muthu Kumar
  Cc: Tejun Heo, linux-bcache, linux-kernel, dm-devel, axboe,
	Vivek Goyal, Mikulas Patocka, bharrosh, david

On Tue, Sep 11, 2012 at 11:58:05AM -0700, Muthu Kumar wrote:
> On Tue, Sep 11, 2012 at 11:45 AM, Tejun Heo <tj@kernel.org> wrote:
> > Hello,
> >
> > On Tue, Sep 11, 2012 at 11:36:28AM -0700, Muthu Kumar wrote:
> >> Does this preserve the CPU from which the bio was submitted
> >> originally. Not familiar with cmwq, may be Tejun can clarify.
> >>
> >> Tejun - the question is, do we honor the rq_affinity with the above
> >> rescue worker implementation?
> >
> > The work item would run from the same CPU but there isn't any
> > mechanism to keep track of the issuing CPU if there are multiple bios
> > to be rescued.  Isn't rq_affinity an optimization hint?  If so, I
> > don't think it matters here.
> >
> 
> Thanks... Just worried about performance impact.
> 
> Kent - Anything to validate that the performance is not impacted would
> be really good. Otherwise, the patch looks great.

Well - there'll only be any performance impact at all when we're memory
constrained enough that GFP_NOWAIT allocations fail, which for these
size allocations definitely isn't normal.

I did test it with forcing everything to use the rescuer, and I also
benchmarked Vivek's version - in any sane configuration, the impact of
punting everything to workqueue is not very noticable (the AHCI
interrupt handler uses more cpu).

> 
> Feel free to add:
> 
> Reviewed-by: Muthukumar Ratty <muthur@gmail.com>

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers
  2012-09-11 19:31                         ` Kent Overstreet
@ 2012-09-11 20:00                           ` Muthu Kumar
  0 siblings, 0 replies; 26+ messages in thread
From: Muthu Kumar @ 2012-09-11 20:00 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Tejun Heo, linux-bcache, linux-kernel, dm-devel, axboe,
	Vivek Goyal, Mikulas Patocka, bharrosh, david

Kent,

On Tue, Sep 11, 2012 at 12:31 PM, Kent Overstreet
<koverstreet@google.com> wrote:
> On Tue, Sep 11, 2012 at 11:58:05AM -0700, Muthu Kumar wrote:
>> On Tue, Sep 11, 2012 at 11:45 AM, Tejun Heo <tj@kernel.org> wrote:
>> > Hello,
>> >
>> > On Tue, Sep 11, 2012 at 11:36:28AM -0700, Muthu Kumar wrote:
>> >> Does this preserve the CPU from which the bio was submitted
>> >> originally. Not familiar with cmwq, may be Tejun can clarify.
>> >>
>> >> Tejun - the question is, do we honor the rq_affinity with the above
>> >> rescue worker implementation?
>> >
>> > The work item would run from the same CPU but there isn't any
>> > mechanism to keep track of the issuing CPU if there are multiple bios
>> > to be rescued.  Isn't rq_affinity an optimization hint?  If so, I
>> > don't think it matters here.
>> >
>>
>> Thanks... Just worried about performance impact.
>>
>> Kent - Anything to validate that the performance is not impacted would
>> be really good. Otherwise, the patch looks great.
>
> Well - there'll only be any performance impact at all when we're memory
> constrained enough that GFP_NOWAIT allocations fail, which for these
> size allocations definitely isn't normal.


Agreed. If we are in this situation, we are already running in
degraded mode. So performance is not in question.

Regards,
Muthu

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2012-09-11 20:00 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-07 22:12 [PATCH 0/2] Avoid deadlocks with bio allocation Kent Overstreet
2012-09-07 22:12 ` [PATCH 1/2] block: Reorder struct bio_set Kent Overstreet
2012-09-07 22:12 ` [PATCH 2/2] block: Avoid deadlocks with bio allocation by stacking drivers Kent Overstreet
2012-09-08 19:36   ` Tejun Heo
2012-09-10  0:28     ` Kent Overstreet
2012-09-10 15:25       ` Vivek Goyal
2012-09-10 17:22       ` Tejun Heo
2012-09-10 20:24         ` Kent Overstreet
2012-09-10 20:40           ` Tejun Heo
2012-09-10 21:33             ` Kent Overstreet
2012-09-10 21:37               ` Tejun Heo
2012-09-10 21:56                 ` Kent Overstreet
2012-09-10 22:09                   ` Tejun Heo
2012-09-10 22:50                     ` [dm-devel] " Alasdair G Kergon
2012-09-10 23:01                       ` Tejun Heo
2012-09-10 23:06                         ` Kent Overstreet
2012-09-10 23:09                         ` Alasdair G Kergon
2012-09-10 23:35                           ` Tejun Heo
2012-09-10 23:45                             ` Alasdair G Kergon
2012-09-10 23:01                       ` Kent Overstreet
2012-09-10 23:13                     ` Tejun Heo
2012-09-11 18:36                   ` Muthu Kumar
2012-09-11 18:45                     ` Tejun Heo
2012-09-11 18:58                       ` Muthu Kumar
2012-09-11 19:31                         ` Kent Overstreet
2012-09-11 20:00                           ` Muthu Kumar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).