linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] block: flush queued bios when the process blocks
@ 2014-05-27 15:03 Mikulas Patocka
  2014-05-27 15:08 ` Jens Axboe
  0 siblings, 1 reply; 43+ messages in thread
From: Mikulas Patocka @ 2014-05-27 15:03 UTC (permalink / raw)
  To: Jens Axboe, Kent Overstreet
  Cc: linux-kernel, dm-devel, Alasdair G. Kergon, Mike Snitzer

The block layer uses per-process bio list to avoid recursion in
generic_make_request. When generic_make_request is called recursively, the
bio is added to current->bio_list and the function returns immediatelly.
The top-level instance of generic_make_requests takes bios from
current->bio_list and processes them.

This bio queuing can result in deadlocks. The following deadlock was
observed:

1) Process A sends one-page read bio to the dm-snapshot target. The bio
spans snapshot chunk boundary and so it is split to two bios by device
mapper.

2) Device mapper creates the first sub-bio and sends it to the snapshot
driver.

3) The function snapshot_map calls track_chunk (that allocates a structure
dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then it
remaps the bio to the underlying linear target and exits with
DM_MAPIO_REMAPPED.

4) The remapped bio is submitted with generic_make_request, but it isn't
processed - it is added to current->bio_list instead.

5) Meanwhile, process B executes pending_complete for the affected chunk,
it takes down_write(&s->lock) and then loops in
__check_for_conflicting_io, waiting for dm_snap_tracked_chunk created in
step 3) to be released.

6) Process A continues, it creates a new bio for the rest of the original
bio.

7) snapshot_map is called for this new bio, it waits on
down_write(&s->lock) that is held in step 5).

The resulting deadlock:
* bio added to current->bio_list at step 4) waits until the function in
  step 7) finishes
* the function in step 7) waits until s->lock held in step 5) is released
* the process in step 5) waits until the bio queued in step 4) finishes

The general problem is that queuing bios on current->bio_list introduces
additional lock dependencies. If a device mapper target sends a bio to
some block device, it assumes that the bio only takes locks of the target
block device or devices that are below the target device. However, if the
bio is added to queue on current->bio_list, it creates artifical locking
dependency on locks taken by other bios that are on current->bio_list. In
the above scenario, this artifical locking dependency results in
deadlock.

Kent Overstreet already created a workqueue for every bio set and there is
a code that tries to resolve some low-memory deadlocks by redirecting bios
queued on current->bio_list to the workqueue if the system is low on
memory. However, other deadlocks (as described above) may happen without
any low memory condition.

This patch generalizes Kent's concept, it redirects bios on
current->bio_list to the bio_set's workqueue on every schedule call.
Consequently, when the process blocks on a mutex, the bios queued on
current->bio_list are dispatched to independent workqueus and they can
complete without waiting for the mutex to be available.

Bios allocated with bio_kmalloc do not have bio_set, so they are not
redirected, however bio_kmalloc shouldn't be used by stacking drivers (it
is currently used by raid1.c and raid10.c, we need to change it to
bio_set).


Note to stable kernel maintainers: before backporting this patch, you also
need to backport df2cb6daa4cbc34406bc4b1ac9b9335df1083a72.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org

---
 fs/bio.c               |   84 ++++++++++++++-----------------------------------
 include/linux/blkdev.h |    7 +++-
 kernel/sched/core.c    |    7 ++++
 3 files changed, 37 insertions(+), 61 deletions(-)

Index: linux-3.15-rc5/fs/bio.c
===================================================================
--- linux-3.15-rc5.orig/fs/bio.c	2014-05-26 19:02:47.000000000 +0200
+++ linux-3.15-rc5/fs/bio.c	2014-05-27 00:00:13.000000000 +0200
@@ -342,35 +342,34 @@ static void bio_alloc_rescue(struct work
 	}
 }
 
-static void punt_bios_to_rescuer(struct bio_set *bs)
+/**
+ * blk_flush_bio_list
+ *
+ * Pop bios queued on current->bio_list and submit each of them to
+ * their rescue workqueue.
+ *
+ * If the bio doesn't have a bio_set, we leave it on current->bio_list.
+ * However, stacking drivers should use bio_set, so this shouldn't be
+ * an issue.
+ */
+void blk_flush_bio_list(void)
 {
-	struct bio_list punt, nopunt;
 	struct bio *bio;
+	struct bio_list list = *current->bio_list;
+	bio_list_init(current->bio_list);
 
-	/*
-	 * In order to guarantee forward progress we must punt only bios that
-	 * were allocated from this bio_set; otherwise, if there was a bio on
-	 * there for a stacking driver higher up in the stack, processing it
-	 * could require allocating bios from this bio_set, and doing that from
-	 * our own rescuer would be bad.
-	 *
-	 * Since bio lists are singly linked, pop them all instead of trying to
-	 * remove from the middle of the list:
-	 */
-
-	bio_list_init(&punt);
-	bio_list_init(&nopunt);
-
-	while ((bio = bio_list_pop(current->bio_list)))
-		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
-
-	*current->bio_list = nopunt;
-
-	spin_lock(&bs->rescue_lock);
-	bio_list_merge(&bs->rescue_list, &punt);
-	spin_unlock(&bs->rescue_lock);
+	while ((bio = bio_list_pop(&list))) {
+		struct bio_set *bs = bio->bi_pool;
+		if (unlikely(!bs)) {
+			bio_list_add(current->bio_list, bio);
+		} else {
+			spin_lock(&bs->rescue_lock);
+			bio_list_add(&bs->rescue_list, bio);
+			spin_unlock(&bs->rescue_lock);
 
-	queue_work(bs->rescue_workqueue, &bs->rescue_work);
+			queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		}
+	}
 }
 
 /**
@@ -410,7 +409,6 @@ static void punt_bios_to_rescuer(struct 
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
-	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	unsigned long idx = BIO_POOL_NONE;
@@ -428,36 +426,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
 		front_pad = 0;
 		inline_vecs = nr_iovecs;
 	} else {
-		/*
-		 * generic_make_request() converts recursion to iteration; this
-		 * means if we're running beneath it, any bios we allocate and
-		 * submit will not be submitted (and thus freed) until after we
-		 * return.
-		 *
-		 * This exposes us to a potential deadlock if we allocate
-		 * multiple bios from the same bio_set() while running
-		 * underneath generic_make_request(). If we were to allocate
-		 * multiple bios (say a stacking block driver that was splitting
-		 * bios), we would deadlock if we exhausted the mempool's
-		 * reserve.
-		 *
-		 * We solve this, and guarantee forward progress, with a rescuer
-		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_WAIT; if that fails, we punt those bios we
-		 * would be blocking to the rescuer workqueue before we retry
-		 * with the original gfp_flags.
-		 */
-
-		if (current->bio_list && !bio_list_empty(current->bio_list))
-			gfp_mask &= ~__GFP_WAIT;
-
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
-		if (!p && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			p = mempool_alloc(bs->bio_pool, gfp_mask);
-		}
 
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
@@ -471,11 +440,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
 
 	if (nr_iovecs > inline_vecs) {
 		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		if (!bvl && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		}
 
 		if (unlikely(!bvl))
 			goto err_free;
Index: linux-3.15-rc5/kernel/sched/core.c
===================================================================
--- linux-3.15-rc5.orig/kernel/sched/core.c	2014-05-26 19:30:51.000000000 +0200
+++ linux-3.15-rc5/kernel/sched/core.c	2014-05-27 00:23:00.000000000 +0200
@@ -2734,6 +2734,13 @@ static inline void sched_submit_work(str
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
 		return;
 	/*
+	 * If there are bios on the bio list, flush them to the appropriate
+	 * rescue threads.
+	 */
+	if (unlikely(current->bio_list != NULL) &&
+	    !bio_list_empty(current->bio_list))
+		blk_flush_bio_list();
+	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
 	 */
Index: linux-3.15-rc5/include/linux/blkdev.h
===================================================================
--- linux-3.15-rc5.orig/include/linux/blkdev.h	2014-05-26 23:54:48.000000000 +0200
+++ linux-3.15-rc5/include/linux/blkdev.h	2014-05-26 23:56:41.000000000 +0200
@@ -1103,6 +1103,8 @@ static inline bool blk_needs_flush_plug(
 		 !list_empty(&plug->cb_list));
 }
 
+extern void blk_flush_bio_list(void);
+
 /*
  * tag stuff
  */
@@ -1634,12 +1636,15 @@ static inline void blk_schedule_flush_pl
 {
 }
 
-
 static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 {
 	return false;
 }
 
+static inline void blk_flush_bio_list(void)
+{
+}
+
 #endif /* CONFIG_BLOCK */
 
 #endif

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] block: flush queued bios when the process blocks
  2014-05-27 15:03 [PATCH] block: flush queued bios when the process blocks Mikulas Patocka
@ 2014-05-27 15:08 ` Jens Axboe
  2014-05-27 15:23   ` Mikulas Patocka
  2014-05-27 17:59   ` [PATCH] " Kent Overstreet
  0 siblings, 2 replies; 43+ messages in thread
From: Jens Axboe @ 2014-05-27 15:08 UTC (permalink / raw)
  To: Mikulas Patocka, Kent Overstreet
  Cc: linux-kernel, dm-devel, Alasdair G. Kergon, Mike Snitzer

On 2014-05-27 09:03, Mikulas Patocka wrote:
> The block layer uses per-process bio list to avoid recursion in
> generic_make_request. When generic_make_request is called recursively, the
> bio is added to current->bio_list and the function returns immediatelly.
> The top-level instance of generic_make_requests takes bios from
> current->bio_list and processes them.

This really begs the question of why we just don't use the per-process 
plugs for this. We already have scheduler hooks in place to flush those 
at the appropriate time. Why are we reinventing something for 
essentially the same thing?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] block: flush queued bios when the process blocks
  2014-05-27 15:08 ` Jens Axboe
@ 2014-05-27 15:23   ` Mikulas Patocka
  2014-05-27 15:42     ` Jens Axboe
  2014-05-27 17:59   ` [PATCH] " Kent Overstreet
  1 sibling, 1 reply; 43+ messages in thread
From: Mikulas Patocka @ 2014-05-27 15:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kent Overstreet, linux-kernel, dm-devel, Alasdair G. Kergon,
	Mike Snitzer



On Tue, 27 May 2014, Jens Axboe wrote:

> On 2014-05-27 09:03, Mikulas Patocka wrote:
> > The block layer uses per-process bio list to avoid recursion in
> > generic_make_request. When generic_make_request is called recursively, the
> > bio is added to current->bio_list and the function returns immediatelly.
> > The top-level instance of generic_make_requests takes bios from
> > current->bio_list and processes them.
> 
> This really begs the question of why we just don't use the per-process plugs
> for this. We already have scheduler hooks in place to flush those at the
> appropriate time. Why are we reinventing something for essentially the same
> thing?
> 
> -- 
> Jens Axboe

Plugs work with requests, this patch works with bios. They are different 
structures, so you can't use one infrastructure to process them.

The patch adds bio list flushing to the scheduler just besides plug 
flushsing.

Mikulas

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] block: flush queued bios when the process blocks
  2014-05-27 15:23   ` Mikulas Patocka
@ 2014-05-27 15:42     ` Jens Axboe
  2014-05-27 16:26       ` Mikulas Patocka
  0 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2014-05-27 15:42 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Kent Overstreet, linux-kernel, dm-devel, Alasdair G. Kergon,
	Mike Snitzer

On 2014-05-27 09:23, Mikulas Patocka wrote:
>
>
> On Tue, 27 May 2014, Jens Axboe wrote:
>
>> On 2014-05-27 09:03, Mikulas Patocka wrote:
>>> The block layer uses per-process bio list to avoid recursion in
>>> generic_make_request. When generic_make_request is called recursively, the
>>> bio is added to current->bio_list and the function returns immediatelly.
>>> The top-level instance of generic_make_requests takes bios from
>>> current->bio_list and processes them.
>>
>> This really begs the question of why we just don't use the per-process plugs
>> for this. We already have scheduler hooks in place to flush those at the
>> appropriate time. Why are we reinventing something for essentially the same
>> thing?
>>
>> --
>> Jens Axboe
>
> Plugs work with requests, this patch works with bios. They are different
> structures, so you can't use one infrastructure to process them.

Yes... I realize the list and plugs are for requests. But there's 
nothing preventing a non-rq hook, we have uses like that too. And it 
could easily be extended to handle bio lists, too.

> The patch adds bio list flushing to the scheduler just besides plug
> flushsing.

... which is exactly why I'm commenting. It'd be great to avoid yet one 
more scheduler hook for this sort of thing.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] block: flush queued bios when the process blocks
  2014-05-27 15:42     ` Jens Axboe
@ 2014-05-27 16:26       ` Mikulas Patocka
  2014-05-27 17:33         ` Mike Snitzer
  2014-05-27 17:42         ` [PATCH] " Jens Axboe
  0 siblings, 2 replies; 43+ messages in thread
From: Mikulas Patocka @ 2014-05-27 16:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kent Overstreet, linux-kernel, dm-devel, Alasdair G. Kergon,
	Mike Snitzer

On Tue, 27 May 2014, Jens Axboe wrote:

> On 2014-05-27 09:23, Mikulas Patocka wrote:
> 
> > The patch adds bio list flushing to the scheduler just besides plug
> > flushsing.
> 
> ... which is exactly why I'm commenting. It'd be great to avoid yet one more
> scheduler hook for this sort of thing.
> 
> -- 
> Jens Axboe

One could create something like schedule notifier chain, but I'm not sure 
if it is worth the complexity because of just two users. If more users 
come in the future, it could be generalized.

Mikulas

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: block: flush queued bios when the process blocks
  2014-05-27 16:26       ` Mikulas Patocka
@ 2014-05-27 17:33         ` Mike Snitzer
  2014-05-27 19:56           ` Kent Overstreet
  2014-05-27 17:42         ` [PATCH] " Jens Axboe
  1 sibling, 1 reply; 43+ messages in thread
From: Mike Snitzer @ 2014-05-27 17:33 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Kent Overstreet, linux-kernel, dm-devel, Alasdair G. Kergon

On Tue, May 27 2014 at 12:26pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> On Tue, 27 May 2014, Jens Axboe wrote:
> 
> > On 2014-05-27 09:23, Mikulas Patocka wrote:
> > 
> > > The patch adds bio list flushing to the scheduler just besides plug
> > > flushsing.
> > 
> > ... which is exactly why I'm commenting. It'd be great to avoid yet one more
> > scheduler hook for this sort of thing.
> > 
> > -- 
> > Jens Axboe
> 
> One could create something like schedule notifier chain, but I'm not sure 
> if it is worth the complexity because of just two users. If more users 
> come in the future, it could be generalized.

It could be that Jens is suggesting updating blk_needs_flush_plug() and
blk_schedule_flush_plug() to be bio_list aware too (rather than train
sched_submit_work() from this new bio_list)?

Somewhat awkward, but _could_ be made to work.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] block: flush queued bios when the process blocks
  2014-05-27 16:26       ` Mikulas Patocka
  2014-05-27 17:33         ` Mike Snitzer
@ 2014-05-27 17:42         ` Jens Axboe
  2014-05-27 18:14           ` [dm-devel] " Christoph Hellwig
                             ` (2 more replies)
  1 sibling, 3 replies; 43+ messages in thread
From: Jens Axboe @ 2014-05-27 17:42 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Kent Overstreet, linux-kernel, dm-devel, Alasdair G. Kergon,
	Mike Snitzer

On 2014-05-27 10:26, Mikulas Patocka wrote:
> On Tue, 27 May 2014, Jens Axboe wrote:
>
>> On 2014-05-27 09:23, Mikulas Patocka wrote:
>>
>>> The patch adds bio list flushing to the scheduler just besides plug
>>> flushsing.
>>
>> ... which is exactly why I'm commenting. It'd be great to avoid yet one more
>> scheduler hook for this sort of thing.
>>
>> --
>> Jens Axboe
>
> One could create something like schedule notifier chain, but I'm not sure
> if it is worth the complexity because of just two users. If more users
> come in the future, it could be generalized.

Except such a thing already exists, there are unplug callback chains. 
All I'm asking is that you look into how feasible it would be to use 
something like that, instead of reinventing the wheel.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] block: flush queued bios when the process blocks
  2014-05-27 15:08 ` Jens Axboe
  2014-05-27 15:23   ` Mikulas Patocka
@ 2014-05-27 17:59   ` Kent Overstreet
  1 sibling, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2014-05-27 17:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mikulas Patocka, linux-kernel, device-mapper development,
	Alasdair G. Kergon, Mike Snitzer

On Tue, May 27, 2014 at 8:08 AM, Jens Axboe <axboe@kernel.dk> wrote:
> This really begs the question of why we just don't use the per-process plugs
> for this. We already have scheduler hooks in place to flush those at the
> appropriate time. Why are we reinventing something for essentially the same
> thing?

Yes! Unifying the two plugging mechanisms has been on my todo/wishlist for ages.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH] block: flush queued bios when the process blocks
  2014-05-27 17:42         ` [PATCH] " Jens Axboe
@ 2014-05-27 18:14           ` Christoph Hellwig
  2014-05-27 19:59             ` Kent Overstreet
  2014-05-27 19:56           ` Mikulas Patocka
  2014-05-29 23:52           ` Mikulas Patocka
  2 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2014-05-27 18:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mikulas Patocka, Mike Snitzer, Kent Overstreet,
	Alasdair G. Kergon, linux-kernel, dm-devel

On Tue, May 27, 2014 at 11:42:26AM -0600, Jens Axboe wrote:
> Except such a thing already exists, there are unplug callback
> chains. All I'm asking is that you look into how feasible it would
> be to use something like that, instead of reinventing the wheel.

I suspect the code from MD could be lifted quite easily.

It would be nicer to move it up to the block layer entirely, but when I
tried that a while ago I ran into issues that I unfortunately don't
remember.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] block: flush queued bios when the process blocks
  2014-05-27 17:42         ` [PATCH] " Jens Axboe
  2014-05-27 18:14           ` [dm-devel] " Christoph Hellwig
@ 2014-05-27 19:56           ` Mikulas Patocka
  2014-05-27 20:06             ` Kent Overstreet
  2014-05-29 23:52           ` Mikulas Patocka
  2 siblings, 1 reply; 43+ messages in thread
From: Mikulas Patocka @ 2014-05-27 19:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kent Overstreet, linux-kernel, dm-devel, Alasdair G. Kergon,
	Mike Snitzer



On Tue, 27 May 2014, Jens Axboe wrote:

> On 2014-05-27 10:26, Mikulas Patocka wrote:
> > On Tue, 27 May 2014, Jens Axboe wrote:
> > 
> > > On 2014-05-27 09:23, Mikulas Patocka wrote:
> > > 
> > > > The patch adds bio list flushing to the scheduler just besides plug
> > > > flushsing.
> > > 
> > > ... which is exactly why I'm commenting. It'd be great to avoid yet one
> > > more
> > > scheduler hook for this sort of thing.
> > > 
> > > --
> > > Jens Axboe
> > 
> > One could create something like schedule notifier chain, but I'm not sure
> > if it is worth the complexity because of just two users. If more users
> > come in the future, it could be generalized.
> 
> Except such a thing already exists, there are unplug callback chains. All I'm
> asking is that you look into how feasible it would be to use something like
> that, instead of reinventing the wheel.
> 
> -- 
> Jens Axboe

Do you mean moving current->bio_list to struct blk_plug and calling 
blk_start_plug/blk_finish_plug around generic_make_request?

It would be possible on a condition that we can redirect all bios to a 
workqueue (i.e. eliminate bio_kmalloc and always use bio_alloc_bioset).

What are performance implications of this - does it make sense to have 
blk_start_plug/blk_finish_plug around every call to generic_make_request? 
- that means that all i/o requests will be added to a plug and then 
unplugged.

Mikulas

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: block: flush queued bios when the process blocks
  2014-05-27 17:33         ` Mike Snitzer
@ 2014-05-27 19:56           ` Kent Overstreet
  2015-10-05 19:50             ` Mike Snitzer
  0 siblings, 1 reply; 43+ messages in thread
From: Kent Overstreet @ 2014-05-27 19:56 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Mikulas Patocka, Jens Axboe, linux-kernel, dm-devel, Alasdair G. Kergon

On Tue, May 27, 2014 at 01:33:06PM -0400, Mike Snitzer wrote:
> On Tue, May 27 2014 at 12:26pm -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > On Tue, 27 May 2014, Jens Axboe wrote:
> > 
> > > On 2014-05-27 09:23, Mikulas Patocka wrote:
> > > 
> > > > The patch adds bio list flushing to the scheduler just besides plug
> > > > flushsing.
> > > 
> > > ... which is exactly why I'm commenting. It'd be great to avoid yet one more
> > > scheduler hook for this sort of thing.
> > > 
> > > -- 
> > > Jens Axboe
> > 
> > One could create something like schedule notifier chain, but I'm not sure 
> > if it is worth the complexity because of just two users. If more users 
> > come in the future, it could be generalized.
> 
> It could be that Jens is suggesting updating blk_needs_flush_plug() and
> blk_schedule_flush_plug() to be bio_list aware too (rather than train
> sched_submit_work() from this new bio_list)?
> 
> Somewhat awkward, but _could_ be made to work.

No...

I started on this ages and ages ago, but the patches were nowhere near ready to
go out.

What I'd do is rework the existing blk_plug to work in terms of bios, not
requests. This is a fair amount of work, and then make_request_fn() needs to be
able to take multiple bios at a time, but IIRC there were other simplifications
that fell out of this and this also means bio based drivers can take advantage
of plugging/batching.

Once this is done, you just... replace the bio list in generic_make_request()
with a plug. Boom, done.

(oh, and to avoid blowing the stack the scheduler/plug callback or whatever
needs to check the current stack usage and punt off to a per request_queue
workqueue, but that's easy enough).

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] [PATCH] block: flush queued bios when the process blocks
  2014-05-27 18:14           ` [dm-devel] " Christoph Hellwig
@ 2014-05-27 19:59             ` Kent Overstreet
  0 siblings, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2014-05-27 19:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Mikulas Patocka, Mike Snitzer, Alasdair G. Kergon,
	linux-kernel, dm-devel

On Tue, May 27, 2014 at 11:14:15AM -0700, Christoph Hellwig wrote:
> On Tue, May 27, 2014 at 11:42:26AM -0600, Jens Axboe wrote:
> > Except such a thing already exists, there are unplug callback
> > chains. All I'm asking is that you look into how feasible it would
> > be to use something like that, instead of reinventing the wheel.
> 
> I suspect the code from MD could be lifted quite easily.
> 
> It would be nicer to move it up to the block layer entirely, but when I
> tried that a while ago I ran into issues that I unfortunately don't
> remember.

How's md do it?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] block: flush queued bios when the process blocks
  2014-05-27 19:56           ` Mikulas Patocka
@ 2014-05-27 20:06             ` Kent Overstreet
  0 siblings, 0 replies; 43+ messages in thread
From: Kent Overstreet @ 2014-05-27 20:06 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, linux-kernel, dm-devel, Alasdair G. Kergon, Mike Snitzer

On Tue, May 27, 2014 at 03:56:00PM -0400, Mikulas Patocka wrote:
> 
> 
> On Tue, 27 May 2014, Jens Axboe wrote:
> 
> > On 2014-05-27 10:26, Mikulas Patocka wrote:
> > > On Tue, 27 May 2014, Jens Axboe wrote:
> > > 
> > > > On 2014-05-27 09:23, Mikulas Patocka wrote:
> > > > 
> > > > > The patch adds bio list flushing to the scheduler just besides plug
> > > > > flushsing.
> > > > 
> > > > ... which is exactly why I'm commenting. It'd be great to avoid yet one
> > > > more
> > > > scheduler hook for this sort of thing.
> > > > 
> > > > --
> > > > Jens Axboe
> > > 
> > > One could create something like schedule notifier chain, but I'm not sure
> > > if it is worth the complexity because of just two users. If more users
> > > come in the future, it could be generalized.
> > 
> > Except such a thing already exists, there are unplug callback chains. All I'm
> > asking is that you look into how feasible it would be to use something like
> > that, instead of reinventing the wheel.
> > 
> > -- 
> > Jens Axboe
> 
> Do you mean moving current->bio_list to struct blk_plug and calling 
> blk_start_plug/blk_finish_plug around generic_make_request?
> 
> It would be possible on a condition that we can redirect all bios to a 
> workqueue (i.e. eliminate bio_kmalloc and always use bio_alloc_bioset).
> 
> What are performance implications of this - does it make sense to have 
> blk_start_plug/blk_finish_plug around every call to generic_make_request? 
> - that means that all i/o requests will be added to a plug and then 
> unplugged.

We've already got blk_start_plug() calls around IO submission at higher points
in the stack. (I actually have seen it show up in profiles though, it probably
would be worth inlining and slimming down a bit).

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH] block: flush queued bios when the process blocks
  2014-05-27 17:42         ` [PATCH] " Jens Axboe
  2014-05-27 18:14           ` [dm-devel] " Christoph Hellwig
  2014-05-27 19:56           ` Mikulas Patocka
@ 2014-05-29 23:52           ` Mikulas Patocka
  2015-10-05 20:59             ` Mike Snitzer
  2 siblings, 1 reply; 43+ messages in thread
From: Mikulas Patocka @ 2014-05-29 23:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Kent Overstreet, linux-kernel, dm-devel, Alasdair G. Kergon,
	Mike Snitzer



On Tue, 27 May 2014, Jens Axboe wrote:

> On 2014-05-27 10:26, Mikulas Patocka wrote:
> > On Tue, 27 May 2014, Jens Axboe wrote:
> > 
> > > On 2014-05-27 09:23, Mikulas Patocka wrote:
> > > 
> > > > The patch adds bio list flushing to the scheduler just besides plug
> > > > flushsing.
> > > 
> > > ... which is exactly why I'm commenting. It'd be great to avoid yet one
> > > more
> > > scheduler hook for this sort of thing.
> > > 
> > > --
> > > Jens Axboe
> > 
> > One could create something like schedule notifier chain, but I'm not sure
> > if it is worth the complexity because of just two users. If more users
> > come in the future, it could be generalized.
> 
> Except such a thing already exists, there are unplug callback chains. All I'm
> asking is that you look into how feasible it would be to use something like
> that, instead of reinventing the wheel.
> 
> -- 
> Jens Axboe


You can use this patch as an example that moves current->bio_list to 
struct plug, but I don't recommend to put it in the kernel - this patch 
still has some issues (some lvm raid tests fail) - and at -rc7 stage we 
should really be fixing bugs and not rearchitecting the code.


You should commit my original patch because it is small and it generated 
no regressions for me. Think about stable kernels and enterprise 
distributions - the smaller the patch is, the easier it is to backport.


Mikulas


---
 block/blk-core.c       |   19 ++++++++++++-------
 drivers/md/dm-bufio.c  |    2 +-
 drivers/md/raid1.c     |    6 +++---
 drivers/md/raid10.c    |    6 +++---
 fs/bio.c               |   21 +++++++++------------
 include/linux/blkdev.h |    7 +++++--
 include/linux/sched.h  |    4 ----
 kernel/sched/core.c    |    7 -------
 8 files changed, 33 insertions(+), 39 deletions(-)

Index: linux-3.15-rc5/block/blk-core.c
===================================================================
--- linux-3.15-rc5.orig/block/blk-core.c	2014-05-29 23:06:29.000000000 +0200
+++ linux-3.15-rc5/block/blk-core.c	2014-05-30 00:30:41.000000000 +0200
@@ -1828,7 +1828,7 @@ end_io:
  */
 void generic_make_request(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack;
+	struct blk_plug plug;
 
 	if (!generic_make_request_checks(bio))
 		return;
@@ -1858,8 +1858,8 @@ void generic_make_request(struct bio *bi
 	 * it is non-NULL, then a make_request is active, and new requests
 	 * should be added at the tail
 	 */
-	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+	if (current->plug) {
+		bio_list_add(&current->plug->bio_list, bio);
 		return;
 	}
 
@@ -1877,17 +1877,18 @@ void generic_make_request(struct bio *bi
 	 * of the top of the list (no pretending) and so remove it from
 	 * bio_list, and call into ->make_request() again.
 	 */
+	blk_start_plug(&plug);
+	current->plug->in_generic_make_request = 1;
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
-	current->bio_list = &bio_list_on_stack;
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		q->make_request_fn(q, bio);
 
-		bio = bio_list_pop(current->bio_list);
+		bio = bio_list_pop(&current->plug->bio_list);
 	} while (bio);
-	current->bio_list = NULL; /* deactivate */
+	current->plug->in_generic_make_request = 0;
+	blk_finish_plug(&plug);
 }
 EXPORT_SYMBOL(generic_make_request);
 
@@ -2965,6 +2966,8 @@ void blk_start_plug(struct blk_plug *plu
 	INIT_LIST_HEAD(&plug->list);
 	INIT_LIST_HEAD(&plug->mq_list);
 	INIT_LIST_HEAD(&plug->cb_list);
+	bio_list_init(&plug->bio_list);
+	plug->in_generic_make_request = 0;
 
 	/*
 	 * If this is a nested plug, don't actually assign it. It will be
@@ -3060,6 +3063,8 @@ void blk_flush_plug_list(struct blk_plug
 
 	BUG_ON(plug->magic != PLUG_MAGIC);
 
+	blk_flush_bio_list(plug);
+
 	flush_plug_callbacks(plug, from_schedule);
 
 	if (!list_empty(&plug->mq_list))
Index: linux-3.15-rc5/include/linux/blkdev.h
===================================================================
--- linux-3.15-rc5.orig/include/linux/blkdev.h	2014-05-29 23:05:46.000000000 +0200
+++ linux-3.15-rc5/include/linux/blkdev.h	2014-05-30 00:30:54.000000000 +0200
@@ -1061,6 +1061,8 @@ struct blk_plug {
 	struct list_head list; /* requests */
 	struct list_head mq_list; /* blk-mq requests */
 	struct list_head cb_list; /* md requires an unplug callback */
+	struct bio_list bio_list; /* list of queued bios */
+	int in_generic_make_request;
 };
 #define BLK_MAX_REQUEST_COUNT 16
 
@@ -1100,10 +1102,11 @@ static inline bool blk_needs_flush_plug(
 	return plug &&
 		(!list_empty(&plug->list) ||
 		 !list_empty(&plug->mq_list) ||
-		 !list_empty(&plug->cb_list));
+		 !list_empty(&plug->cb_list) ||
+		 !bio_list_empty(&plug->bio_list));
 }
 
-extern void blk_flush_bio_list(void);
+extern void blk_flush_bio_list(struct blk_plug *plug);
 
 /*
  * tag stuff
Index: linux-3.15-rc5/include/linux/sched.h
===================================================================
--- linux-3.15-rc5.orig/include/linux/sched.h	2014-05-29 23:07:01.000000000 +0200
+++ linux-3.15-rc5/include/linux/sched.h	2014-05-29 23:07:08.000000000 +0200
@@ -126,7 +126,6 @@ struct sched_attr {
 struct exec_domain;
 struct futex_pi_state;
 struct robust_list_head;
-struct bio_list;
 struct fs_struct;
 struct perf_event_context;
 struct blk_plug;
@@ -1427,9 +1426,6 @@ struct task_struct {
 /* journalling filesystem info */
 	void *journal_info;
 
-/* stacked block device info */
-	struct bio_list *bio_list;
-
 #ifdef CONFIG_BLOCK
 /* stack plugging */
 	struct blk_plug *plug;
Index: linux-3.15-rc5/fs/bio.c
===================================================================
--- linux-3.15-rc5.orig/fs/bio.c	2014-05-29 23:19:04.000000000 +0200
+++ linux-3.15-rc5/fs/bio.c	2014-05-29 23:36:40.000000000 +0200
@@ -352,23 +352,20 @@ static void bio_alloc_rescue(struct work
  * However, stacking drivers should use bio_set, so this shouldn't be
  * an issue.
  */
-void blk_flush_bio_list(void)
+void blk_flush_bio_list(struct blk_plug *plug)
 {
 	struct bio *bio;
-	struct bio_list list = *current->bio_list;
-	bio_list_init(current->bio_list);
 
-	while ((bio = bio_list_pop(&list))) {
+	while ((bio = bio_list_pop(&plug->bio_list))) {
 		struct bio_set *bs = bio->bi_pool;
-		if (unlikely(!bs)) {
-			bio_list_add(current->bio_list, bio);
-		} else {
-			spin_lock(&bs->rescue_lock);
-			bio_list_add(&bs->rescue_list, bio);
-			spin_unlock(&bs->rescue_lock);
+		if (!bs)
+			bs = fs_bio_set;
 
-			queue_work(bs->rescue_workqueue, &bs->rescue_work);
-		}
+		spin_lock(&bs->rescue_lock);
+		bio_list_add(&bs->rescue_list, bio);
+		spin_unlock(&bs->rescue_lock);
+
+		queue_work(bs->rescue_workqueue, &bs->rescue_work);
 	}
 }
 
Index: linux-3.15-rc5/kernel/sched/core.c
===================================================================
--- linux-3.15-rc5.orig/kernel/sched/core.c	2014-05-29 23:17:04.000000000 +0200
+++ linux-3.15-rc5/kernel/sched/core.c	2014-05-29 23:18:28.000000000 +0200
@@ -2734,13 +2734,6 @@ static inline void sched_submit_work(str
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
 		return;
 	/*
-	 * If there are bios on the bio list, flush them to the appropriate
-	 * rescue threads.
-	 */
-	if (unlikely(current->bio_list != NULL) &&
-	    !bio_list_empty(current->bio_list))
-		blk_flush_bio_list();
-	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
 	 */
Index: linux-3.15-rc5/drivers/md/dm-bufio.c
===================================================================
--- linux-3.15-rc5.orig/drivers/md/dm-bufio.c	2014-05-30 00:25:55.000000000 +0200
+++ linux-3.15-rc5/drivers/md/dm-bufio.c	2014-05-30 00:31:28.000000000 +0200
@@ -169,7 +169,7 @@ static inline int dm_bufio_cache_index(s
 #define DM_BUFIO_CACHE(c)	(dm_bufio_caches[dm_bufio_cache_index(c)])
 #define DM_BUFIO_CACHE_NAME(c)	(dm_bufio_cache_names[dm_bufio_cache_index(c)])
 
-#define dm_bufio_in_request()	(!!current->bio_list)
+#define dm_bufio_in_request()	(current->plug && current->plug->in_generic_make_request)
 
 static void dm_bufio_lock(struct dm_bufio_client *c)
 {
Index: linux-3.15-rc5/drivers/md/raid1.c
===================================================================
--- linux-3.15-rc5.orig/drivers/md/raid1.c	2014-05-30 00:19:28.000000000 +0200
+++ linux-3.15-rc5/drivers/md/raid1.c	2014-05-30 00:33:11.000000000 +0200
@@ -912,8 +912,8 @@ static sector_t wait_barrier(struct r1co
 				    (!conf->barrier ||
 				    ((conf->start_next_window <
 				      conf->next_resync + RESYNC_SECTORS) &&
-				     current->bio_list &&
-				     !bio_list_empty(current->bio_list))),
+				     current->plug &&
+				     !bio_list_empty(&current->plug->bio_list))),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
@@ -1052,7 +1052,7 @@ static void raid1_unplug(struct blk_plug
 	struct r1conf *conf = mddev->private;
 	struct bio *bio;
 
-	if (from_schedule || current->bio_list) {
+	if (from_schedule || (current->plug && current->plug->in_generic_make_request)) {
 		spin_lock_irq(&conf->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		conf->pending_count += plug->pending_cnt;
Index: linux-3.15-rc5/drivers/md/raid10.c
===================================================================
--- linux-3.15-rc5.orig/drivers/md/raid10.c	2014-05-30 00:23:51.000000000 +0200
+++ linux-3.15-rc5/drivers/md/raid10.c	2014-05-30 00:32:50.000000000 +0200
@@ -1045,8 +1045,8 @@ static void wait_barrier(struct r10conf 
 		wait_event_lock_irq(conf->wait_barrier,
 				    !conf->barrier ||
 				    (conf->nr_pending &&
-				     current->bio_list &&
-				     !bio_list_empty(current->bio_list)),
+				     current->plug &&
+				     !bio_list_empty(&current->plug->bio_list)),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
@@ -1122,7 +1122,7 @@ static void raid10_unplug(struct blk_plu
 	struct r10conf *conf = mddev->private;
 	struct bio *bio;
 
-	if (from_schedule || current->bio_list) {
+	if (from_schedule || (current->plug && current->plug->in_generic_make_request)) {
 		spin_lock_irq(&conf->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		conf->pending_count += plug->pending_cnt;

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: block: flush queued bios when the process blocks
  2014-05-27 19:56           ` Kent Overstreet
@ 2015-10-05 19:50             ` Mike Snitzer
  0 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2015-10-05 19:50 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jens Axboe, dm-devel, Mikulas Patocka, linux-kernel, Alasdair G. Kergon

On Tue, May 27 2014 at  3:56pm -0400,
Kent Overstreet <kmo@daterainc.com> wrote:

> On Tue, May 27, 2014 at 01:33:06PM -0400, Mike Snitzer wrote:
> > On Tue, May 27 2014 at 12:26pm -0400,
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> > 
> > > On Tue, 27 May 2014, Jens Axboe wrote:
> > > 
> > > > On 2014-05-27 09:23, Mikulas Patocka wrote:
> > > > 
> > > > > The patch adds bio list flushing to the scheduler just besides plug
> > > > > flushsing.
> > > > 
> > > > ... which is exactly why I'm commenting. It'd be great to avoid yet one more
> > > > scheduler hook for this sort of thing.
> > > > 
> > > > -- 
> > > > Jens Axboe
> > > 
> > > One could create something like schedule notifier chain, but I'm not sure 
> > > if it is worth the complexity because of just two users. If more users 
> > > come in the future, it could be generalized.
> > 
> > It could be that Jens is suggesting updating blk_needs_flush_plug() and
> > blk_schedule_flush_plug() to be bio_list aware too (rather than train
> > sched_submit_work() from this new bio_list)?
> > 
> > Somewhat awkward, but _could_ be made to work.
> 
> No...
> 
> I started on this ages and ages ago, but the patches were nowhere near ready to
> go out.
> 
> What I'd do is rework the existing blk_plug to work in terms of bios, not
> requests. This is a fair amount of work, and then make_request_fn() needs to be
> able to take multiple bios at a time, but IIRC there were other simplifications
> that fell out of this and this also means bio based drivers can take advantage
> of plugging/batching.
> 
> Once this is done, you just... replace the bio list in generic_make_request()
> with a plug. Boom, done.
> 
> (oh, and to avoid blowing the stack the scheduler/plug callback or whatever
> needs to check the current stack usage and punt off to a per request_queue
> workqueue, but that's easy enough).

Re-reading your response since this bug is still lingering; and Mikulas'
patch _still_ fixes the deadlock.. you're suggesting all of the above
just to avoid an extra conditional scheduler callout..

Could be what you're proposing is "the one true way forward" but I'm not
touching it at this point.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: block: flush queued bios when the process blocks
  2014-05-29 23:52           ` Mikulas Patocka
@ 2015-10-05 20:59             ` Mike Snitzer
  2015-10-06 13:28               ` Mikulas Patocka
  2015-10-06 18:17               ` [dm-devel] " Mikulas Patocka
  0 siblings, 2 replies; 43+ messages in thread
From: Mike Snitzer @ 2015-10-05 20:59 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Kent Overstreet, Alasdair G. Kergon, linux-kernel, dm-devel

On Thu, May 29 2014 at  7:52pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Tue, 27 May 2014, Jens Axboe wrote:
> 
> > On 2014-05-27 10:26, Mikulas Patocka wrote:
> > > On Tue, 27 May 2014, Jens Axboe wrote:
> > > 
> > > > On 2014-05-27 09:23, Mikulas Patocka wrote:
> > > > 
> > > > > The patch adds bio list flushing to the scheduler just besides plug
> > > > > flushsing.
> > > > 
> > > > ... which is exactly why I'm commenting. It'd be great to avoid yet one
> > > > more
> > > > scheduler hook for this sort of thing.
> > > > 
> > > > --
> > > > Jens Axboe
> > > 
> > > One could create something like schedule notifier chain, but I'm not sure
> > > if it is worth the complexity because of just two users. If more users
> > > come in the future, it could be generalized.
> > 
> > Except such a thing already exists, there are unplug callback chains. All I'm
> > asking is that you look into how feasible it would be to use something like
> > that, instead of reinventing the wheel.
> > 
> > -- 
> > Jens Axboe
> 
> 
> You can use this patch as an example that moves current->bio_list to 
> struct plug, but I don't recommend to put it in the kernel - this patch 
> still has some issues (some lvm raid tests fail).


Mikulas,

Could it be that cond_resched() wasn't unplugging?  As was
recently raised in this thread: https://lkml.org/lkml/2015/9/18/378
Chris Mason's patch from that thread fixed this	issue... I _think_ Linus
has since committed Chris' work but I haven't kept my finger on the
pulse of that issue.

FYI, I've put rebased versions of your 2 patches in my wip branch, see:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

I tweaked the 2nd patch that adds bio_list to plug so that
generic_make_request's checks for in_generic_make_request isn't racey
(your original patch could happen to have current-plug set but
in_generic_make_request not yet set).

Anyway, I'm just dusting off your earlier work and seeing if I can make
sense of a way forward that meets Jens' approval.  Jens, if adding a
bio_list to struct blk_plug isn't what you were after either please be
more specific on what you would like to see...

Thanks,
Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: block: flush queued bios when the process blocks
  2015-10-05 20:59             ` Mike Snitzer
@ 2015-10-06 13:28               ` Mikulas Patocka
  2015-10-06 13:47                 ` Mike Snitzer
  2015-10-06 18:17               ` [dm-devel] " Mikulas Patocka
  1 sibling, 1 reply; 43+ messages in thread
From: Mikulas Patocka @ 2015-10-06 13:28 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Kent Overstreet, Alasdair G. Kergon, linux-kernel, dm-devel



On Mon, 5 Oct 2015, Mike Snitzer wrote:

> Mikulas,
> 
> Could it be that cond_resched() wasn't unplugging?  As was
> recently raised in this thread: https://lkml.org/lkml/2015/9/18/378
> Chris Mason's patch from that thread fixed this	issue... I _think_ Linus
> has since committed Chris' work but I haven't kept my finger on the
> pulse of that issue.

I think it doesn't matter (regarding correctness) if cond_reched unplugs 
on not. If it didn't unplug, the process will be scheduled later, and it 
will eventually reach the point where it unplugs.

> FYI, I've put rebased versions of your 2 patches in my wip branch, see:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
> 
> I tweaked the 2nd patch that adds bio_list to plug so that
> generic_make_request's checks for in_generic_make_request isn't racey
> (your original patch could happen to have current-plug set but
> in_generic_make_request not yet set).

I don't recommend that second patch 
(http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=5e740c2e45a767d8d6ef8ca36b0db705ef6259c4). 
The patch just complicates things without adding any value. It's also not 
correct because it plugs bios at places when bios aren't supposed to be 
plugged

> Anyway, I'm just dusting off your earlier work and seeing if I can make
> sense of a way forward that meets Jens' approval.  Jens, if adding a
> bio_list to struct blk_plug isn't what you were after either please be
> more specific on what you would like to see...
> 
> Thanks,
> Mike

Mikulas

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: block: flush queued bios when the process blocks
  2015-10-06 13:28               ` Mikulas Patocka
@ 2015-10-06 13:47                 ` Mike Snitzer
  2015-10-06 14:10                   ` Mikulas Patocka
  2015-10-06 14:26                   ` Mikulas Patocka
  0 siblings, 2 replies; 43+ messages in thread
From: Mike Snitzer @ 2015-10-06 13:47 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Kent Overstreet, Alasdair G. Kergon, linux-kernel, dm-devel

On Tue, Oct 06 2015 at  9:28am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Mon, 5 Oct 2015, Mike Snitzer wrote:
> 
> > Mikulas,
> > 
> > Could it be that cond_resched() wasn't unplugging?  As was
> > recently raised in this thread: https://lkml.org/lkml/2015/9/18/378
> > Chris Mason's patch from that thread fixed this	issue... I _think_ Linus
> > has since committed Chris' work but I haven't kept my finger on the
> > pulse of that issue.
> 
> I think it doesn't matter (regarding correctness) if cond_reched unplugs 
> on not. If it didn't unplug, the process will be scheduled later, and it 
> will eventually reach the point where it unplugs.

Couldn't the original deadlock you fixed (as described in your first
patch) manifest when a new process is scheduled?
 
> > FYI, I've put rebased versions of your 2 patches in my wip branch, see:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
> > 
> > I tweaked the 2nd patch that adds bio_list to plug so that
> > generic_make_request's checks for in_generic_make_request isn't racey
> > (your original patch could happen to have current-plug set but
> > in_generic_make_request not yet set).
> 
> I don't recommend that second patch 
> (http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=5e740c2e45a767d8d6ef8ca36b0db705ef6259c4). 
> The patch just complicates things without adding any value. It's also not 
> correct because it plugs bios at places when bios aren't supposed to be 
> plugged

Avoiding another hook the the scheduler is a requirement (from Jens).
"without adding any value": it offers a different strategy for recording
bios to the bio_list by making it part of the plug.  The plug doesn't
actually block bios like requests are plugged.  What am I missing?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: block: flush queued bios when the process blocks
  2015-10-06 13:47                 ` Mike Snitzer
@ 2015-10-06 14:10                   ` Mikulas Patocka
  2015-10-06 14:26                   ` Mikulas Patocka
  1 sibling, 0 replies; 43+ messages in thread
From: Mikulas Patocka @ 2015-10-06 14:10 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Kent Overstreet, Alasdair G. Kergon, linux-kernel, dm-devel



On Tue, 6 Oct 2015, Mike Snitzer wrote:

> On Tue, Oct 06 2015 at  9:28am -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Mon, 5 Oct 2015, Mike Snitzer wrote:
> > 
> > > Mikulas,
> > > 
> > > Could it be that cond_resched() wasn't unplugging?  As was
> > > recently raised in this thread: https://lkml.org/lkml/2015/9/18/378
> > > Chris Mason's patch from that thread fixed this	issue... I _think_ Linus
> > > has since committed Chris' work but I haven't kept my finger on the
> > > pulse of that issue.
> > 
> > I think it doesn't matter (regarding correctness) if cond_reched unplugs 
> > on not. If it didn't unplug, the process will be scheduled later, and it 
> > will eventually reach the point where it unplugs.
> 
> Couldn't the original deadlock you fixed (as described in your first
> patch) manifest when a new process is scheduled?
>  
> > > FYI, I've put rebased versions of your 2 patches in my wip branch, see:
> > > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
> > > 
> > > I tweaked the 2nd patch that adds bio_list to plug so that
> > > generic_make_request's checks for in_generic_make_request isn't racey
> > > (your original patch could happen to have current-plug set but
> > > in_generic_make_request not yet set).
> > 
> > I don't recommend that second patch 
> > (http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=5e740c2e45a767d8d6ef8ca36b0db705ef6259c4). 
> > The patch just complicates things without adding any value. It's also not 
> > correct because it plugs bios at places when bios aren't supposed to be 
> > plugged
> 
> Avoiding another hook the the scheduler is a requirement (from Jens).
> "without adding any value": it offers a different strategy for recording
> bios to the bio_list by making it part of the plug.  The plug doesn't
> actually block bios like requests are plugged.

Unfortunatelly, it does (when the second patch is applied).

> What am I missing?

Bios allocated with bio_kmalloc can't be unplugged because they lack the 
rescurer workqueue (they shouldn't be used by the stacking drivers, they 
could only be used by top-level mm code). If you plug those bios at 
incorrect places (i.e. between blk_start_plug and blk_finish_plug), you 
are introducing other deadlock possibility.


> Avoiding another hook the the scheduler is a requirement (from Jens).

So, you can add a flag that is set when either current->plug or 
current->bio_list is non-NULL and if the flag is set, run a function that 
unplugs both current->plug and current->bio_list.

But anyway - if you look at struct task_struct, you see that "bio_list" 
and "plug" fields are next to each other - so they will likely be in the 
same cacheline. So testing current->bio_list in the schedule function has 
no performance overhead because the cacheline was already loaded when 
testing current->plug.

Mikulas

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: block: flush queued bios when the process blocks
  2015-10-06 13:47                 ` Mike Snitzer
  2015-10-06 14:10                   ` Mikulas Patocka
@ 2015-10-06 14:26                   ` Mikulas Patocka
  1 sibling, 0 replies; 43+ messages in thread
From: Mikulas Patocka @ 2015-10-06 14:26 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Kent Overstreet, Alasdair G. Kergon, linux-kernel, dm-devel



On Tue, 6 Oct 2015, Mike Snitzer wrote:

> On Tue, Oct 06 2015 at  9:28am -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Mon, 5 Oct 2015, Mike Snitzer wrote:
> > 
> > > Mikulas,
> > > 
> > > Could it be that cond_resched() wasn't unplugging?  As was
> > > recently raised in this thread: https://lkml.org/lkml/2015/9/18/378
> > > Chris Mason's patch from that thread fixed this	issue... I _think_ Linus
> > > has since committed Chris' work but I haven't kept my finger on the
> > > pulse of that issue.
> > 
> > I think it doesn't matter (regarding correctness) if cond_reched unplugs 
> > on not. If it didn't unplug, the process will be scheduled later, and it 
> > will eventually reach the point where it unplugs.
> 
> Couldn't the original deadlock you fixed (as described in your first
> patch) manifest when a new process is scheduled?

A process rescheduled with cond_reched is not stuck, it will be run later. 
It may contribute to increased request latency, but it can't contribute to 
a deadlock.

Mikulas

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dm-devel] block: flush queued bios when the process blocks
  2015-10-05 20:59             ` Mike Snitzer
  2015-10-06 13:28               ` Mikulas Patocka
@ 2015-10-06 18:17               ` Mikulas Patocka
  2015-10-06 18:50                 ` Mike Snitzer
  1 sibling, 1 reply; 43+ messages in thread
From: Mikulas Patocka @ 2015-10-06 18:17 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Kent Overstreet, dm-devel, linux-kernel, Alasdair G. Kergon



On Mon, 5 Oct 2015, Mike Snitzer wrote:

> FYI, I've put rebased versions of your 2 patches in my wip branch, see:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

I found a bug in the first patch 
(http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=2e90df2e9cf482f45be4230152535fdab525fbd8)

There is this piece of code:

	spin_lock(&bs->rescue_lock);
	bio_list_add(&bs->rescue_list, bio);
	spin_unlock(&bs->rescue_lock);
	queue_work(bs->rescue_workqueue, &bs->rescue_work);

It is possible that after spin_unlock and before queue_work the bio is 
finished by previous workqueue invocation. When the bio is finished, it is 
possible that the block device is unloaded and queue_work accesses freed 
memory.

Change the code so that queue_work is executed inside the spinlock:
	spin_lock(&bs->rescue_lock);
	bio_list_add(&bs->rescue_list, bio);
	queue_work(bs->rescue_workqueue, &bs->rescue_work);
	spin_unlock(&bs->rescue_lock);

Mikulas

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: block: flush queued bios when the process blocks
  2015-10-06 18:17               ` [dm-devel] " Mikulas Patocka
@ 2015-10-06 18:50                 ` Mike Snitzer
  2015-10-06 20:16                   ` [PATCH v2] " Mike Snitzer
  0 siblings, 1 reply; 43+ messages in thread
From: Mike Snitzer @ 2015-10-06 18:50 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Kent Overstreet, dm-devel, linux-kernel, Alasdair G. Kergon

On Tue, Oct 06 2015 at  2:17pm -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Mon, 5 Oct 2015, Mike Snitzer wrote:
> 
> > FYI, I've put rebased versions of your 2 patches in my wip branch, see:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
> 
> I found a bug in the first patch 
> (http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=2e90df2e9cf482f45be4230152535fdab525fbd8)
> 
> There is this piece of code:
> 
> 	spin_lock(&bs->rescue_lock);
> 	bio_list_add(&bs->rescue_list, bio);
> 	spin_unlock(&bs->rescue_lock);
> 	queue_work(bs->rescue_workqueue, &bs->rescue_work);
> 
> It is possible that after spin_unlock and before queue_work the bio is 
> finished by previous workqueue invocation. When the bio is finished, it is 
> possible that the block device is unloaded and queue_work accesses freed 
> memory.
> 
> Change the code so that queue_work is executed inside the spinlock:
> 	spin_lock(&bs->rescue_lock);
> 	bio_list_add(&bs->rescue_list, bio);
> 	queue_work(bs->rescue_workqueue, &bs->rescue_work);
> 	spin_unlock(&bs->rescue_lock);

OK, but that should get pulled out to a separate stable@ fix that patch
you reference builds on.

I've adjusted my 'wip' branch accordingly (with placeholder commit that
needs revised header, etc).

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v2] block: flush queued bios when the process blocks
  2015-10-06 18:50                 ` Mike Snitzer
@ 2015-10-06 20:16                   ` Mike Snitzer
  2015-10-06 20:26                     ` Mike Snitzer
                                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Mike Snitzer @ 2015-10-06 20:16 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, kent.overstreet, dm-devel, linux-kernel, Alasdair G. Kergon

To give others context for why I'm caring about this issue again, this
recent BZ against 4.3-rc served as a reminder that we _need_ a fix:
https://bugzilla.redhat.com/show_bug.cgi?id=1267650

FYI, I cleaned up the plug-based approach a bit further, here is the
incremental patch:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=f73d001ec692125308accbb5ca26f892f949c1b6

And here is a new version of the overall combined patch (sharing now
before I transition to looking at alternatives, though my gut is the use
of a plug in generic_make_request really wouldn't hurt us.. famous last
words):

 block/bio.c            | 82 +++++++++++++-------------------------------------
 block/blk-core.c       | 21 ++++++++-----
 drivers/md/dm-bufio.c  |  2 +-
 drivers/md/raid1.c     |  6 ++--
 drivers/md/raid10.c    |  6 ++--
 include/linux/blkdev.h | 11 +++++--
 include/linux/sched.h  |  4 ---
 7 files changed, 51 insertions(+), 81 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index ad3f276..3d03668 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -354,35 +354,31 @@ static void bio_alloc_rescue(struct work_struct *work)
 	}
 }
 
-static void punt_bios_to_rescuer(struct bio_set *bs)
+/**
+ * blk_flush_bio_list
+ * @plug: the blk_plug that may have collected bios
+ *
+ * Pop bios queued on plug->bio_list and submit each of them to
+ * their rescue workqueue.
+ *
+ * If the bio doesn't have a bio_set, we use the default fs_bio_set.
+ * However, stacking drivers should use bio_set, so this shouldn't be
+ * an issue.
+ */
+void blk_flush_bio_list(struct blk_plug *plug)
 {
-	struct bio_list punt, nopunt;
 	struct bio *bio;
 
-	/*
-	 * In order to guarantee forward progress we must punt only bios that
-	 * were allocated from this bio_set; otherwise, if there was a bio on
-	 * there for a stacking driver higher up in the stack, processing it
-	 * could require allocating bios from this bio_set, and doing that from
-	 * our own rescuer would be bad.
-	 *
-	 * Since bio lists are singly linked, pop them all instead of trying to
-	 * remove from the middle of the list:
-	 */
-
-	bio_list_init(&punt);
-	bio_list_init(&nopunt);
-
-	while ((bio = bio_list_pop(current->bio_list)))
-		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
-
-	*current->bio_list = nopunt;
-
-	spin_lock(&bs->rescue_lock);
-	bio_list_merge(&bs->rescue_list, &punt);
-	spin_unlock(&bs->rescue_lock);
+	while ((bio = bio_list_pop(&plug->bio_list))) {
+		struct bio_set *bs = bio->bi_pool;
+		if (!bs)
+			bs = fs_bio_set;
 
-	queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		spin_lock(&bs->rescue_lock);
+		bio_list_add(&bs->rescue_list, bio);
+		queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		spin_unlock(&bs->rescue_lock);
+	}
 }
 
 /**
@@ -422,7 +418,6 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
-	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	unsigned long idx = BIO_POOL_NONE;
@@ -443,37 +438,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		/* should not use nobvec bioset for nr_iovecs > 0 */
 		if (WARN_ON_ONCE(!bs->bvec_pool && nr_iovecs > 0))
 			return NULL;
-		/*
-		 * generic_make_request() converts recursion to iteration; this
-		 * means if we're running beneath it, any bios we allocate and
-		 * submit will not be submitted (and thus freed) until after we
-		 * return.
-		 *
-		 * This exposes us to a potential deadlock if we allocate
-		 * multiple bios from the same bio_set() while running
-		 * underneath generic_make_request(). If we were to allocate
-		 * multiple bios (say a stacking block driver that was splitting
-		 * bios), we would deadlock if we exhausted the mempool's
-		 * reserve.
-		 *
-		 * We solve this, and guarantee forward progress, with a rescuer
-		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_WAIT; if that fails, we punt those bios we
-		 * would be blocking to the rescuer workqueue before we retry
-		 * with the original gfp_flags.
-		 */
-
-		if (current->bio_list && !bio_list_empty(current->bio_list))
-			gfp_mask &= ~__GFP_WAIT;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
-		if (!p && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			p = mempool_alloc(bs->bio_pool, gfp_mask);
-		}
-
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
 	}
@@ -486,12 +452,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 
 	if (nr_iovecs > inline_vecs) {
 		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		if (!bvl && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		}
-
 		if (unlikely(!bvl))
 			goto err_free;
 
diff --git a/block/blk-core.c b/block/blk-core.c
index 2eb722d..cf0706a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1927,6 +1927,7 @@ end_io:
 void generic_make_request(struct bio *bio)
 {
 	struct bio_list bio_list_on_stack;
+	struct blk_plug plug;
 
 	if (!generic_make_request_checks(bio))
 		return;
@@ -1934,15 +1935,15 @@ void generic_make_request(struct bio *bio)
 	/*
 	 * We only want one ->make_request_fn to be active at a time, else
 	 * stack usage with stacked devices could be a problem.  So use
-	 * current->bio_list to keep a list of requests submited by a
-	 * make_request_fn function.  current->bio_list is also used as a
+	 * current->plug->bio_list to keep a list of requests submitted by a
+	 * make_request_fn function.  current->plug->bio_list is also used as a
 	 * flag to say if generic_make_request is currently active in this
 	 * task or not.  If it is NULL, then no make_request is active.  If
 	 * it is non-NULL, then a make_request is active, and new requests
 	 * should be added at the tail
 	 */
-	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+	if (current->plug && current->plug->bio_list) {
+		bio_list_add(&current->plug->bio_list, bio);
 		return;
 	}
 
@@ -1962,15 +1963,17 @@ void generic_make_request(struct bio *bio)
 	 */
 	BUG_ON(bio->bi_next);
 	bio_list_init(&bio_list_on_stack);
-	current->bio_list = &bio_list_on_stack;
+	blk_start_plug(&plug);
+	current->plug->bio_list = &bio_list_on_stack;
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		q->make_request_fn(q, bio);
 
-		bio = bio_list_pop(current->bio_list);
+		bio = bio_list_pop(current->plug->bio_list);
 	} while (bio);
-	current->bio_list = NULL; /* deactivate */
+	current->plug->bio_list = NULL; /* deactivate */
+	blk_finish_plug(&plug);
 }
 EXPORT_SYMBOL(generic_make_request);
 
@@ -3065,6 +3068,8 @@ void blk_start_plug(struct blk_plug *plug)
 	INIT_LIST_HEAD(&plug->list);
 	INIT_LIST_HEAD(&plug->mq_list);
 	INIT_LIST_HEAD(&plug->cb_list);
+	plug->bio_list = NULL;
+
 	/*
 	 * Store ordering should not be needed here, since a potential
 	 * preempt will imply a full memory barrier
@@ -3151,6 +3156,8 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 	LIST_HEAD(list);
 	unsigned int depth;
 
+	blk_flush_bio_list(plug);
+
 	flush_plug_callbacks(plug, from_schedule);
 
 	if (!list_empty(&plug->mq_list))
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 2dd3308..c2bff16 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -168,7 +168,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
 #define DM_BUFIO_CACHE(c)	(dm_bufio_caches[dm_bufio_cache_index(c)])
 #define DM_BUFIO_CACHE_NAME(c)	(dm_bufio_cache_names[dm_bufio_cache_index(c)])
 
-#define dm_bufio_in_request()	(!!current->bio_list)
+#define dm_bufio_in_request()	(current->plug && !!current->plug->bio_list)
 
 static void dm_bufio_lock(struct dm_bufio_client *c)
 {
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 4517f06..357782f 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -874,8 +874,8 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
 				    (!conf->barrier ||
 				     ((conf->start_next_window <
 				       conf->next_resync + RESYNC_SECTORS) &&
-				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
+				      (current->plug && current->plug->bio_list &&
+				       !bio_list_empty(current->plug->bio_list)))),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
@@ -1013,7 +1013,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	struct r1conf *conf = mddev->private;
 	struct bio *bio;
 
-	if (from_schedule || current->bio_list) {
+	if (from_schedule || (current->plug && current->plug->bio_list)) {
 		spin_lock_irq(&conf->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		conf->pending_count += plug->pending_cnt;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0fc33eb..780681f 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -944,8 +944,8 @@ static void wait_barrier(struct r10conf *conf)
 		wait_event_lock_irq(conf->wait_barrier,
 				    !conf->barrier ||
 				    (conf->nr_pending &&
-				     current->bio_list &&
-				     !bio_list_empty(current->bio_list)),
+				     (current->plug && current->plug->bio_list &&
+				      !bio_list_empty(current->plug->bio_list))),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
@@ -1021,7 +1021,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	struct r10conf *conf = mddev->private;
 	struct bio *bio;
 
-	if (from_schedule || current->bio_list) {
+	if (from_schedule || (current->plug && current->plug->bio_list)) {
 		spin_lock_irq(&conf->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		conf->pending_count += plug->pending_cnt;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 99da9eb..9bdac70 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1040,6 +1040,7 @@ struct blk_plug {
 	struct list_head list; /* requests */
 	struct list_head mq_list; /* blk-mq requests */
 	struct list_head cb_list; /* md requires an unplug callback */
+	struct bio_list *bio_list; /* queued bios from stacked block device */
 };
 #define BLK_MAX_REQUEST_COUNT 16
 
@@ -1079,9 +1080,12 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 	return plug &&
 		(!list_empty(&plug->list) ||
 		 !list_empty(&plug->mq_list) ||
-		 !list_empty(&plug->cb_list));
+		 !list_empty(&plug->cb_list) ||
+		 (plug->bio_list && !bio_list_empty(plug->bio_list)));
 }
 
+extern void blk_flush_bio_list(struct blk_plug *plug);
+
 /*
  * tag stuff
  */
@@ -1673,12 +1677,15 @@ static inline void blk_schedule_flush_plug(struct task_struct *task)
 {
 }
 
-
 static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 {
 	return false;
 }
 
+static inline void blk_flush_bio_list(void)
+{
+}
+
 static inline int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
 				     sector_t *error_sector)
 {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b7b9501..ca304f1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -128,7 +128,6 @@ struct sched_attr {
 
 struct futex_pi_state;
 struct robust_list_head;
-struct bio_list;
 struct fs_struct;
 struct perf_event_context;
 struct blk_plug;
@@ -1633,9 +1632,6 @@ struct task_struct {
 /* journalling filesystem info */
 	void *journal_info;
 
-/* stacked block device info */
-	struct bio_list *bio_list;
-
 #ifdef CONFIG_BLOCK
 /* stack plugging */
 	struct blk_plug *plug;

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] block: flush queued bios when the process blocks
  2015-10-06 20:16                   ` [PATCH v2] " Mike Snitzer
@ 2015-10-06 20:26                     ` Mike Snitzer
  2015-10-08 15:04                     ` Mikulas Patocka
  2015-10-09 11:58                     ` kbuild test robot
  2 siblings, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2015-10-06 20:26 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, kent.overstreet, dm-devel, linux-kernel, Alasdair G. Kergon

On Tue, Oct 06 2015 at  4:16P -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> To give others context for why I'm caring about this issue again, this
> recent BZ against 4.3-rc served as a reminder that we _need_ a fix:
> https://bugzilla.redhat.com/show_bug.cgi?id=1267650
> 
> FYI, I cleaned up the plug-based approach a bit further, here is the
> incremental patch:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=f73d001ec692125308accbb5ca26f892f949c1b6
> 
> And here is a new version of the overall combined patch (sharing now
> before I transition to looking at alternatives, though my gut is the use
> of a plug in generic_make_request really wouldn't hurt us.. famous last
> words):
> 
>  block/bio.c            | 82 +++++++++++++-------------------------------------
>  block/blk-core.c       | 21 ++++++++-----
>  drivers/md/dm-bufio.c  |  2 +-
>  drivers/md/raid1.c     |  6 ++--
>  drivers/md/raid10.c    |  6 ++--
>  include/linux/blkdev.h | 11 +++++--
>  include/linux/sched.h  |  4 ---
>  7 files changed, 51 insertions(+), 81 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index ad3f276..3d03668 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -354,35 +354,31 @@ static void bio_alloc_rescue(struct work_struct *work)
>  	}
>  }
>  
> -static void punt_bios_to_rescuer(struct bio_set *bs)
> +/**
> + * blk_flush_bio_list
> + * @plug: the blk_plug that may have collected bios
> + *
> + * Pop bios queued on plug->bio_list and submit each of them to
> + * their rescue workqueue.
> + *
> + * If the bio doesn't have a bio_set, we use the default fs_bio_set.
> + * However, stacking drivers should use bio_set, so this shouldn't be
> + * an issue.
> + */
> +void blk_flush_bio_list(struct blk_plug *plug)
>  {
> -	struct bio_list punt, nopunt;
>  	struct bio *bio;
>  
> -	/*
> -	 * In order to guarantee forward progress we must punt only bios that
> -	 * were allocated from this bio_set; otherwise, if there was a bio on
> -	 * there for a stacking driver higher up in the stack, processing it
> -	 * could require allocating bios from this bio_set, and doing that from
> -	 * our own rescuer would be bad.
> -	 *
> -	 * Since bio lists are singly linked, pop them all instead of trying to
> -	 * remove from the middle of the list:
> -	 */
> -
> -	bio_list_init(&punt);
> -	bio_list_init(&nopunt);
> -
> -	while ((bio = bio_list_pop(current->bio_list)))
> -		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> -
> -	*current->bio_list = nopunt;
> -
> -	spin_lock(&bs->rescue_lock);
> -	bio_list_merge(&bs->rescue_list, &punt);
> -	spin_unlock(&bs->rescue_lock);
> +	while ((bio = bio_list_pop(&plug->bio_list))) {

Bleh, should be plug->bio_list.. obviously I didn't compile test this...

Here is the incremental I've folded in:

diff --git a/block/bio.c b/block/bio.c
index 3d03668..b868b9e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -369,7 +369,10 @@ void blk_flush_bio_list(struct blk_plug *plug)
 {
 	struct bio *bio;
 
-	while ((bio = bio_list_pop(&plug->bio_list))) {
+	if (!plug->bio_list)
+		return;
+
+	while ((bio = bio_list_pop(plug->bio_list))) {
 		struct bio_set *bs = bio->bi_pool;
 		if (!bs)
 			bs = fs_bio_set;

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] block: flush queued bios when the process blocks
  2015-10-06 20:16                   ` [PATCH v2] " Mike Snitzer
  2015-10-06 20:26                     ` Mike Snitzer
@ 2015-10-08 15:04                     ` Mikulas Patocka
  2015-10-08 15:08                       ` Mike Snitzer
  2015-10-09 11:58                     ` kbuild test robot
  2 siblings, 1 reply; 43+ messages in thread
From: Mikulas Patocka @ 2015-10-08 15:04 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, kent.overstreet, dm-devel, linux-kernel, Alasdair G. Kergon



On Tue, 6 Oct 2015, Mike Snitzer wrote:

> To give others context for why I'm caring about this issue again, this
> recent BZ against 4.3-rc served as a reminder that we _need_ a fix:
> https://bugzilla.redhat.com/show_bug.cgi?id=1267650
> 
> FYI, I cleaned up the plug-based approach a bit further, here is the
> incremental patch:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=f73d001ec692125308accbb5ca26f892f949c1b6
> 
> And here is a new version of the overall combined patch (sharing now
> before I transition to looking at alternatives, though my gut is the use
> of a plug in generic_make_request really wouldn't hurt us.. famous last
> words):
> 
>  block/bio.c            | 82 +++++++++++++-------------------------------------
>  block/blk-core.c       | 21 ++++++++-----
>  drivers/md/dm-bufio.c  |  2 +-
>  drivers/md/raid1.c     |  6 ++--
>  drivers/md/raid10.c    |  6 ++--
>  include/linux/blkdev.h | 11 +++++--
>  include/linux/sched.h  |  4 ---
>  7 files changed, 51 insertions(+), 81 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index ad3f276..3d03668 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -354,35 +354,31 @@ static void bio_alloc_rescue(struct work_struct *work)
>  	}
>  }
>  
> -static void punt_bios_to_rescuer(struct bio_set *bs)
> +/**
> + * blk_flush_bio_list
> + * @plug: the blk_plug that may have collected bios
> + *
> + * Pop bios queued on plug->bio_list and submit each of them to
> + * their rescue workqueue.
> + *
> + * If the bio doesn't have a bio_set, we use the default fs_bio_set.
> + * However, stacking drivers should use bio_set, so this shouldn't be
> + * an issue.
> + */
> +void blk_flush_bio_list(struct blk_plug *plug)
>  {
> -	struct bio_list punt, nopunt;
>  	struct bio *bio;
>  
> -	/*
> -	 * In order to guarantee forward progress we must punt only bios that
> -	 * were allocated from this bio_set; otherwise, if there was a bio on
> -	 * there for a stacking driver higher up in the stack, processing it
> -	 * could require allocating bios from this bio_set, and doing that from
> -	 * our own rescuer would be bad.
> -	 *
> -	 * Since bio lists are singly linked, pop them all instead of trying to
> -	 * remove from the middle of the list:
> -	 */
> -
> -	bio_list_init(&punt);
> -	bio_list_init(&nopunt);
> -
> -	while ((bio = bio_list_pop(current->bio_list)))
> -		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> -
> -	*current->bio_list = nopunt;
> -
> -	spin_lock(&bs->rescue_lock);
> -	bio_list_merge(&bs->rescue_list, &punt);
> -	spin_unlock(&bs->rescue_lock);
> +	while ((bio = bio_list_pop(&plug->bio_list))) {
> +		struct bio_set *bs = bio->bi_pool;
> +		if (!bs)
> +			bs = fs_bio_set;
>  
> -	queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +		spin_lock(&bs->rescue_lock);
> +		bio_list_add(&bs->rescue_list, bio);
> +		queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +		spin_unlock(&bs->rescue_lock);
> +	}
>  }
>  
>  /**
> @@ -422,7 +418,6 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   */
>  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  {
> -	gfp_t saved_gfp = gfp_mask;
>  	unsigned front_pad;
>  	unsigned inline_vecs;
>  	unsigned long idx = BIO_POOL_NONE;
> @@ -443,37 +438,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  		/* should not use nobvec bioset for nr_iovecs > 0 */
>  		if (WARN_ON_ONCE(!bs->bvec_pool && nr_iovecs > 0))
>  			return NULL;
> -		/*
> -		 * generic_make_request() converts recursion to iteration; this
> -		 * means if we're running beneath it, any bios we allocate and
> -		 * submit will not be submitted (and thus freed) until after we
> -		 * return.
> -		 *
> -		 * This exposes us to a potential deadlock if we allocate
> -		 * multiple bios from the same bio_set() while running
> -		 * underneath generic_make_request(). If we were to allocate
> -		 * multiple bios (say a stacking block driver that was splitting
> -		 * bios), we would deadlock if we exhausted the mempool's
> -		 * reserve.
> -		 *
> -		 * We solve this, and guarantee forward progress, with a rescuer
> -		 * workqueue per bio_set. If we go to allocate and there are
> -		 * bios on current->bio_list, we first try the allocation
> -		 * without __GFP_WAIT; if that fails, we punt those bios we
> -		 * would be blocking to the rescuer workqueue before we retry
> -		 * with the original gfp_flags.
> -		 */
> -
> -		if (current->bio_list && !bio_list_empty(current->bio_list))
> -			gfp_mask &= ~__GFP_WAIT;
>  
>  		p = mempool_alloc(bs->bio_pool, gfp_mask);
> -		if (!p && gfp_mask != saved_gfp) {
> -			punt_bios_to_rescuer(bs);
> -			gfp_mask = saved_gfp;
> -			p = mempool_alloc(bs->bio_pool, gfp_mask);
> -		}
> -
>  		front_pad = bs->front_pad;
>  		inline_vecs = BIO_INLINE_VECS;
>  	}
> @@ -486,12 +452,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  
>  	if (nr_iovecs > inline_vecs) {
>  		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> -		if (!bvl && gfp_mask != saved_gfp) {
> -			punt_bios_to_rescuer(bs);
> -			gfp_mask = saved_gfp;
> -			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> -		}
> -
>  		if (unlikely(!bvl))
>  			goto err_free;
>  
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2eb722d..cf0706a 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1927,6 +1927,7 @@ end_io:
>  void generic_make_request(struct bio *bio)
>  {
>  	struct bio_list bio_list_on_stack;
> +	struct blk_plug plug;
>  
>  	if (!generic_make_request_checks(bio))
>  		return;
> @@ -1934,15 +1935,15 @@ void generic_make_request(struct bio *bio)
>  	/*
>  	 * We only want one ->make_request_fn to be active at a time, else
>  	 * stack usage with stacked devices could be a problem.  So use
> -	 * current->bio_list to keep a list of requests submited by a
> -	 * make_request_fn function.  current->bio_list is also used as a
> +	 * current->plug->bio_list to keep a list of requests submitted by a
> +	 * make_request_fn function.  current->plug->bio_list is also used as a
>  	 * flag to say if generic_make_request is currently active in this
>  	 * task or not.  If it is NULL, then no make_request is active.  If
>  	 * it is non-NULL, then a make_request is active, and new requests
>  	 * should be added at the tail
>  	 */
> -	if (current->bio_list) {
> -		bio_list_add(current->bio_list, bio);
> +	if (current->plug && current->plug->bio_list) {
> +		bio_list_add(&current->plug->bio_list, bio);
>  		return;
>  	}
>  
> @@ -1962,15 +1963,17 @@ void generic_make_request(struct bio *bio)
>  	 */
>  	BUG_ON(bio->bi_next);
>  	bio_list_init(&bio_list_on_stack);
> -	current->bio_list = &bio_list_on_stack;
> +	blk_start_plug(&plug);
> +	current->plug->bio_list = &bio_list_on_stack;
>  	do {
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
>  		q->make_request_fn(q, bio);
>  
> -		bio = bio_list_pop(current->bio_list);
> +		bio = bio_list_pop(current->plug->bio_list);
>  	} while (bio);
> -	current->bio_list = NULL; /* deactivate */
> +	current->plug->bio_list = NULL; /* deactivate */
> +	blk_finish_plug(&plug);
>  }
>  EXPORT_SYMBOL(generic_make_request);
>  
> @@ -3065,6 +3068,8 @@ void blk_start_plug(struct blk_plug *plug)
>  	INIT_LIST_HEAD(&plug->list);
>  	INIT_LIST_HEAD(&plug->mq_list);
>  	INIT_LIST_HEAD(&plug->cb_list);
> +	plug->bio_list = NULL;
> +
>  	/*
>  	 * Store ordering should not be needed here, since a potential
>  	 * preempt will imply a full memory barrier
> @@ -3151,6 +3156,8 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
>  	LIST_HEAD(list);
>  	unsigned int depth;
>  
> +	blk_flush_bio_list(plug);
> +
>  	flush_plug_callbacks(plug, from_schedule);
>  
>  	if (!list_empty(&plug->mq_list))
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 2dd3308..c2bff16 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -168,7 +168,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
>  #define DM_BUFIO_CACHE(c)	(dm_bufio_caches[dm_bufio_cache_index(c)])
>  #define DM_BUFIO_CACHE_NAME(c)	(dm_bufio_cache_names[dm_bufio_cache_index(c)])
>  
> -#define dm_bufio_in_request()	(!!current->bio_list)
> +#define dm_bufio_in_request()	(current->plug && !!current->plug->bio_list)

This condition is repeated several times throughout the whole patch - so 
maybe you should make it a function in block device header file.

Mikulas

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] block: flush queued bios when the process blocks
  2015-10-08 15:04                     ` Mikulas Patocka
@ 2015-10-08 15:08                       ` Mike Snitzer
  2015-10-09 19:52                         ` Mike Snitzer
  0 siblings, 1 reply; 43+ messages in thread
From: Mike Snitzer @ 2015-10-08 15:08 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, kent.overstreet, dm-devel, linux-kernel, Alasdair G. Kergon

On Thu, Oct 08 2015 at 11:04am -0400,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Tue, 6 Oct 2015, Mike Snitzer wrote:
> 
> > To give others context for why I'm caring about this issue again, this
> > recent BZ against 4.3-rc served as a reminder that we _need_ a fix:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1267650
> > 
> > FYI, I cleaned up the plug-based approach a bit further, here is the
> > incremental patch:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=f73d001ec692125308accbb5ca26f892f949c1b6
> > 
> > And here is a new version of the overall combined patch (sharing now
> > before I transition to looking at alternatives, though my gut is the use
> > of a plug in generic_make_request really wouldn't hurt us.. famous last
> > words):
> > 
> >  block/bio.c            | 82 +++++++++++++-------------------------------------
> >  block/blk-core.c       | 21 ++++++++-----
> >  drivers/md/dm-bufio.c  |  2 +-
> >  drivers/md/raid1.c     |  6 ++--
> >  drivers/md/raid10.c    |  6 ++--
> >  include/linux/blkdev.h | 11 +++++--
> >  include/linux/sched.h  |  4 ---
> >  7 files changed, 51 insertions(+), 81 deletions(-)
> > 
...
> > diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> > index 2dd3308..c2bff16 100644
> > --- a/drivers/md/dm-bufio.c
> > +++ b/drivers/md/dm-bufio.c
> > @@ -168,7 +168,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
> >  #define DM_BUFIO_CACHE(c)	(dm_bufio_caches[dm_bufio_cache_index(c)])
> >  #define DM_BUFIO_CACHE_NAME(c)	(dm_bufio_cache_names[dm_bufio_cache_index(c)])
> >  
> > -#define dm_bufio_in_request()	(!!current->bio_list)
> > +#define dm_bufio_in_request()	(current->plug && !!current->plug->bio_list)
> 
> This condition is repeated several times throughout the whole patch - so 
> maybe you should make it a function in block device header file.

Yeah, I thought of that too but forgot to come back to it.  Will do,
thanks.

FYI, I found another bug in my last patch and fixed it up.  I'll get
some refactoring done (including your suggestion), actually _test_ the
code (e.g. verify all of lvm testsuite passes) and then send out v3.

Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] block: flush queued bios when the process blocks
  2015-10-06 20:16                   ` [PATCH v2] " Mike Snitzer
  2015-10-06 20:26                     ` Mike Snitzer
  2015-10-08 15:04                     ` Mikulas Patocka
@ 2015-10-09 11:58                     ` kbuild test robot
  2 siblings, 0 replies; 43+ messages in thread
From: kbuild test robot @ 2015-10-09 11:58 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: kbuild-all, Mikulas Patocka, Jens Axboe, kent.overstreet,
	dm-devel, linux-kernel, Alasdair G. Kergon

[-- Attachment #1: Type: text/plain, Size: 4629 bytes --]

Hi Mike,

[auto build test ERROR on v4.3-rc4 -- if it's inappropriate base, please ignore]

config: x86_64-lkp (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from include/linux/linkage.h:4:0,
                    from include/linux/fs.h:4,
                    from include/linux/highmem.h:4,
                    from include/linux/bio.h:23,
                    from drivers/md/bcache/bcache.h:181,
                    from drivers/md/bcache/btree.c:23:
   drivers/md/bcache/btree.c: In function '__bch_btree_node_write':
>> drivers/md/bcache/btree.c:454:16: error: 'struct task_struct' has no member named 'bio_list'
     BUG_ON(current->bio_list);
                   ^
   include/linux/compiler.h:166:42: note: in definition of macro 'unlikely'
    # define unlikely(x) __builtin_expect(!!(x), 0)
                                             ^
   drivers/md/bcache/btree.c:454:2: note: in expansion of macro 'BUG_ON'
     BUG_ON(current->bio_list);
     ^
   drivers/md/bcache/btree.c: In function 'bch_btree_leaf_dirty':
   drivers/md/bcache/btree.c:548:14: error: 'struct task_struct' has no member named 'bio_list'
         !current->bio_list)
                 ^
   In file included from include/linux/linkage.h:4:0,
                    from include/linux/fs.h:4,
                    from include/linux/highmem.h:4,
                    from include/linux/bio.h:23,
                    from drivers/md/bcache/bcache.h:181,
                    from drivers/md/bcache/btree.c:23:
   drivers/md/bcache/btree.c: In function 'mca_alloc':
   drivers/md/bcache/btree.c:893:16: error: 'struct task_struct' has no member named 'bio_list'
     BUG_ON(current->bio_list);
                   ^
   include/linux/compiler.h:166:42: note: in definition of macro 'unlikely'
    # define unlikely(x) __builtin_expect(!!(x), 0)
                                             ^
   drivers/md/bcache/btree.c:893:2: note: in expansion of macro 'BUG_ON'
     BUG_ON(current->bio_list);
     ^
   drivers/md/bcache/btree.c: In function 'bch_btree_node_get':
   drivers/md/bcache/btree.c:980:14: error: 'struct task_struct' has no member named 'bio_list'
      if (current->bio_list)
                 ^
   drivers/md/bcache/btree.c: In function 'bch_btree_insert_node':
   drivers/md/bcache/btree.c:2131:13: error: 'struct task_struct' has no member named 'bio_list'
     if (current->bio_list) {
                ^
   In file included from include/linux/linkage.h:4:0,
                    from include/linux/fs.h:4,
                    from include/linux/highmem.h:4,
                    from include/linux/bio.h:23,
                    from drivers/md/bcache/bcache.h:181,
                    from drivers/md/bcache/btree.c:23:
   drivers/md/bcache/btree.c: In function 'bch_btree_insert':
   drivers/md/bcache/btree.c:2211:16: error: 'struct task_struct' has no member named 'bio_list'
     BUG_ON(current->bio_list);
                   ^
   include/linux/compiler.h:166:42: note: in definition of macro 'unlikely'
    # define unlikely(x) __builtin_expect(!!(x), 0)
                                             ^
   drivers/md/bcache/btree.c:2211:2: note: in expansion of macro 'BUG_ON'
     BUG_ON(current->bio_list);
     ^
   drivers/md/bcache/btree.c: In function 'bch_btree_insert_node':
   drivers/md/bcache/btree.c:2147:1: warning: control reaches end of non-void function [-Wreturn-type]
    }
    ^

vim +454 drivers/md/bcache/btree.c

ee811287 Kent Overstreet 2013-12-17  448  	struct bset *i = btree_bset_last(b);
cafe5635 Kent Overstreet 2013-03-23  449  
2a285686 Kent Overstreet 2014-03-04  450  	lockdep_assert_held(&b->write_lock);
2a285686 Kent Overstreet 2014-03-04  451  
c37511b8 Kent Overstreet 2013-04-26  452  	trace_bcache_btree_write(b);
c37511b8 Kent Overstreet 2013-04-26  453  
cafe5635 Kent Overstreet 2013-03-23 @454  	BUG_ON(current->bio_list);
57943511 Kent Overstreet 2013-04-25  455  	BUG_ON(b->written >= btree_blocks(b));
57943511 Kent Overstreet 2013-04-25  456  	BUG_ON(b->written && !i->keys);
ee811287 Kent Overstreet 2013-12-17  457  	BUG_ON(btree_bset_first(b)->seq != i->seq);

:::::: The code at line 454 was first introduced by commit
:::::: cafe563591446cf80bfbc2fe3bc72a2e36cf1060 bcache: A block layer cache

:::::: TO: Kent Overstreet <koverstreet@google.com>
:::::: CC: Kent Overstreet <koverstreet@google.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 21911 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] block: flush queued bios when the process blocks
  2015-10-08 15:08                       ` Mike Snitzer
@ 2015-10-09 19:52                         ` Mike Snitzer
  2015-10-09 19:59                           ` Mike Snitzer
  2015-10-15  3:27                           ` [PATCH v2] block: flush queued bios when the process blocks Ming Lei
  0 siblings, 2 replies; 43+ messages in thread
From: Mike Snitzer @ 2015-10-09 19:52 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, kent.overstreet, dm-devel, linux-kernel, Alasdair G. Kergon

On Thu, Oct 08 2015 at 11:08am -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Thu, Oct 08 2015 at 11:04am -0400,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
> 
> > 
> > 
> > On Tue, 6 Oct 2015, Mike Snitzer wrote:
> > 
> > > To give others context for why I'm caring about this issue again, this
> > > recent BZ against 4.3-rc served as a reminder that we _need_ a fix:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1267650
> > > 
> > > FYI, I cleaned up the plug-based approach a bit further, here is the
> > > incremental patch:
> > > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=f73d001ec692125308accbb5ca26f892f949c1b6
> > > 
> > > And here is a new version of the overall combined patch (sharing now
> > > before I transition to looking at alternatives, though my gut is the use
> > > of a plug in generic_make_request really wouldn't hurt us.. famous last
> > > words):
> > > 
> > >  block/bio.c            | 82 +++++++++++++-------------------------------------
> > >  block/blk-core.c       | 21 ++++++++-----
> > >  drivers/md/dm-bufio.c  |  2 +-
> > >  drivers/md/raid1.c     |  6 ++--
> > >  drivers/md/raid10.c    |  6 ++--
> > >  include/linux/blkdev.h | 11 +++++--
> > >  include/linux/sched.h  |  4 ---
> > >  7 files changed, 51 insertions(+), 81 deletions(-)
> > > 
> ...
> > > diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> > > index 2dd3308..c2bff16 100644
> > > --- a/drivers/md/dm-bufio.c
> > > +++ b/drivers/md/dm-bufio.c
> > > @@ -168,7 +168,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
> > >  #define DM_BUFIO_CACHE(c)	(dm_bufio_caches[dm_bufio_cache_index(c)])
> > >  #define DM_BUFIO_CACHE_NAME(c)	(dm_bufio_cache_names[dm_bufio_cache_index(c)])
> > >  
> > > -#define dm_bufio_in_request()	(!!current->bio_list)
> > > +#define dm_bufio_in_request()	(current->plug && !!current->plug->bio_list)
> > 
> > This condition is repeated several times throughout the whole patch - so 
> > maybe you should make it a function in block device header file.
> 
> Yeah, I thought of that too but forgot to come back to it.  Will do,
> thanks.
> 
> FYI, I found another bug in my last patch and fixed it up.  I'll get
> some refactoring done (including your suggestion), actually _test_ the
> code (e.g. verify all of lvm testsuite passes) and then send out v3.

Turns out that this change:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=2639638c77768a86216be456c2764e32a2bcd841

needed to be reverted with:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=ad3ccd760da7c05b90775372f9b39dc2964086fe

Because nested plugs caused generic_make_request()'s onstack bio_list to
go out of scope (blk_finish_plug() wouldn't actually flush the list
within generic_make_request because XFS already added an outermost
plug).

But even after fixing that I then hit issues with these changes now
resulting in imperfect 'in_generic_make_request' accounting that happens
lazily once the outermost plug completes blk_finish_plug.  manifested as
dm-bufio.c:dm_bufio_prefetch's BUG_ON(dm_bufio_in_request()); hitting.

Basically using the blk-core's onstack plugging isn't workable for
fixing this deadlock and we're back to having to seriously consider
this (with its additional hook in the scheduler): 

Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] block: flush queued bios when the process blocks
  2015-10-09 19:52                         ` Mike Snitzer
@ 2015-10-09 19:59                           ` Mike Snitzer
  2015-10-14 20:47                             ` [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock Mike Snitzer
  2015-10-15  3:27                           ` [PATCH v2] block: flush queued bios when the process blocks Ming Lei
  1 sibling, 1 reply; 43+ messages in thread
From: Mike Snitzer @ 2015-10-09 19:59 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, kent.overstreet, dm-devel, linux-kernel, Alasdair G. Kergon

On Fri, Oct 09 2015 at  3:52pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:
 
> Turns out that this change:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=2639638c77768a86216be456c2764e32a2bcd841
> 
> needed to be reverted with:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=ad3ccd760da7c05b90775372f9b39dc2964086fe
> 
> Because nested plugs caused generic_make_request()'s onstack bio_list to
> go out of scope (blk_finish_plug() wouldn't actually flush the list
> within generic_make_request because XFS already added an outermost
> plug).
> 
> But even after fixing that I then hit issues with these changes now
> resulting in imperfect 'in_generic_make_request' accounting that happens
> lazily once the outermost plug completes blk_finish_plug.  manifested as
> dm-bufio.c:dm_bufio_prefetch's BUG_ON(dm_bufio_in_request()); hitting.
> 
> Basically using the blk-core's onstack plugging isn't workable for
> fixing this deadlock and we're back to having to seriously consider
> this (with its additional hook in the scheduler): 

http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=a91709cd32b5ca7ca047b68c9299e747f2ae6ca2

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock
  2015-10-09 19:59                           ` Mike Snitzer
@ 2015-10-14 20:47                             ` Mike Snitzer
  2015-10-14 21:44                               ` Jeff Moyer
  2015-10-17 16:04                               ` Ming Lei
  0 siblings, 2 replies; 43+ messages in thread
From: Mike Snitzer @ 2015-10-14 20:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: kent.overstreet, Mikulas Patocka, dm-devel, linux-kernel,
	Alasdair G. Kergon, jmoyer

From: Mikulas Patocka <mpatocka@redhat.com>

The block layer uses per-process bio list to avoid recursion in
generic_make_request.  When generic_make_request is called recursively,
the bio is added to current->bio_list and generic_make_request returns
immediately.  The top-level instance of generic_make_request takes bios
from current->bio_list and processes them.

Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
stacking drivers") created a workqueue for every bio set and code
in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
redirecting bios queued on current->bio_list to the workqueue if the
system is low on memory.  However another deadlock (see below **) may
happen, without any low memory condition, because generic_make_request
is queuing bios to current->bio_list (rather than submitting them).

Fix this deadlock by redirecting any bios on current->bio_list to the
bio_set's rescue workqueue on every schedule call.  Consequently, when
the process blocks on a mutex, the bios queued on current->bio_list are
dispatched to independent workqueus and they can complete without
waiting for the mutex to be available.

Also, now we can remove punt_bios_to_rescuer() and bio_alloc_bioset()'s
calls to it because bio_alloc_bioset() will implicitly punt all bios on
current->bio_list if it performs a blocking allocation.

** Here is the dm-snapshot deadlock that was observed:

1) Process A sends one-page read bio to the dm-snapshot target. The bio
spans snapshot chunk boundary and so it is split to two bios by device
mapper.

2) Device mapper creates the first sub-bio and sends it to the snapshot
driver.

3) The function snapshot_map calls track_chunk (that allocates a structure
dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
the bio to the underlying device and exits with DM_MAPIO_REMAPPED.

4) The remapped bio is submitted with generic_make_request, but it isn't
issued - it is added to current->bio_list instead.

5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
chunk affected be the first remapped bio, it takes down_write(&s->lock)
and then loops in __check_for_conflicting_io, waiting for
dm_snap_tracked_chunk created in step 3) to be released.

6) Process A continues, it creates a second sub-bio for the rest of the
original bio.

7) snapshot_map is called for this new bio, it waits on
down_write(&s->lock) that is held by Process B (in step 5).

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
Cc: stable@vger.kernel.org
---
 block/bio.c            | 75 +++++++++++++++++++-------------------------------
 include/linux/blkdev.h | 19 +++++++++++--
 kernel/sched/core.c    |  7 ++---
 3 files changed, 48 insertions(+), 53 deletions(-)

v3: improved patch header, changed sched/core.c block callout to blk_flush_queued_io(),
    io_schedule_timeout() also updated to use blk_flush_queued_io(), blk_flush_bio_list()
    now takes a @tsk argument rather than assuming current. v3 is now being submitted with
    more feeling now that (ab)using the onstack plugging proved problematic, please see:
    https://www.redhat.com/archives/dm-devel/2015-October/msg00087.html

diff --git a/block/bio.c b/block/bio.c
index ad3f276..99f5a2ad 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -354,35 +354,35 @@ static void bio_alloc_rescue(struct work_struct *work)
 	}
 }
 
-static void punt_bios_to_rescuer(struct bio_set *bs)
+/**
+ * blk_flush_bio_list
+ * @tsk: task_struct whose bio_list must be flushed
+ *
+ * Pop bios queued on @tsk->bio_list and submit each of them to
+ * their rescue workqueue.
+ *
+ * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
+ * However, stacking drivers should use bio_set, so this shouldn't be
+ * an issue.
+ */
+void blk_flush_bio_list(struct task_struct *tsk)
 {
-	struct bio_list punt, nopunt;
 	struct bio *bio;
+	struct bio_list list = *tsk->bio_list;
+	bio_list_init(tsk->bio_list);
 
-	/*
-	 * In order to guarantee forward progress we must punt only bios that
-	 * were allocated from this bio_set; otherwise, if there was a bio on
-	 * there for a stacking driver higher up in the stack, processing it
-	 * could require allocating bios from this bio_set, and doing that from
-	 * our own rescuer would be bad.
-	 *
-	 * Since bio lists are singly linked, pop them all instead of trying to
-	 * remove from the middle of the list:
-	 */
-
-	bio_list_init(&punt);
-	bio_list_init(&nopunt);
-
-	while ((bio = bio_list_pop(current->bio_list)))
-		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
-
-	*current->bio_list = nopunt;
-
-	spin_lock(&bs->rescue_lock);
-	bio_list_merge(&bs->rescue_list, &punt);
-	spin_unlock(&bs->rescue_lock);
+	while ((bio = bio_list_pop(&list))) {
+		struct bio_set *bs = bio->bi_pool;
+		if (unlikely(!bs)) {
+			bio_list_add(tsk->bio_list, bio);
+			continue;
+		}
 
-	queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		spin_lock(&bs->rescue_lock);
+		bio_list_add(&bs->rescue_list, bio);
+		queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		spin_unlock(&bs->rescue_lock);
+	}
 }
 
 /**
@@ -422,7 +422,6 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
-	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	unsigned long idx = BIO_POOL_NONE;
@@ -457,23 +456,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		 * reserve.
 		 *
 		 * We solve this, and guarantee forward progress, with a rescuer
-		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_WAIT; if that fails, we punt those bios we
-		 * would be blocking to the rescuer workqueue before we retry
-		 * with the original gfp_flags.
+		 * workqueue per bio_set. If an allocation would block (due to
+		 * __GFP_WAIT) the scheduler will first punt all bios on
+		 * current->bio_list to the rescuer workqueue.
 		 */
-
-		if (current->bio_list && !bio_list_empty(current->bio_list))
-			gfp_mask &= ~__GFP_WAIT;
-
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
-		if (!p && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			p = mempool_alloc(bs->bio_pool, gfp_mask);
-		}
-
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
 	}
@@ -486,12 +473,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 
 	if (nr_iovecs > inline_vecs) {
 		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		if (!bvl && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		}
-
 		if (unlikely(!bvl))
 			goto err_free;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 19c2e94..5dc7415 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1084,6 +1084,22 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 		 !list_empty(&plug->cb_list));
 }
 
+extern void blk_flush_bio_list(struct task_struct *tsk);
+
+static inline void blk_flush_queued_io(struct task_struct *tsk)
+{
+	/*
+	 * Flush any queued bios to corresponding rescue threads.
+	 */
+	if (tsk->bio_list && !bio_list_empty(tsk->bio_list))
+		blk_flush_bio_list(tsk);
+	/*
+	 * Flush any plugged IO that is queued.
+	 */
+	if (blk_needs_flush_plug(tsk))
+		blk_schedule_flush_plug(tsk);
+}
+
 /*
  * tag stuff
  */
@@ -1671,11 +1687,10 @@ static inline void blk_flush_plug(struct task_struct *task)
 {
 }
 
-static inline void blk_schedule_flush_plug(struct task_struct *task)
+static inline void blk_flush_queued_io(struct task_struct *tsk)
 {
 }
 
-
 static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 {
 	return false;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 10a8faa..eaf9eb3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3127,11 +3127,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
 		return;
 	/*
-	 * If we are going to sleep and we have plugged IO queued,
+	 * If we are going to sleep and we have queued IO,
 	 * make sure to submit it to avoid deadlocks.
 	 */
-	if (blk_needs_flush_plug(tsk))
-		blk_schedule_flush_plug(tsk);
+	blk_flush_queued_io(tsk);
 }
 
 asmlinkage __visible void __sched schedule(void)
@@ -4718,7 +4717,7 @@ long __sched io_schedule_timeout(long timeout)
 	long ret;
 
 	current->in_iowait = 1;
-	blk_schedule_flush_plug(current);
+	blk_flush_queued_io(current);
 
 	delayacct_blkio_start();
 	rq = raw_rq();
-- 
2.3.8 (Apple Git-58)


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock
  2015-10-14 20:47                             ` [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock Mike Snitzer
@ 2015-10-14 21:44                               ` Jeff Moyer
  2015-10-17 16:04                               ` Ming Lei
  1 sibling, 0 replies; 43+ messages in thread
From: Jeff Moyer @ 2015-10-14 21:44 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, kent.overstreet, Mikulas Patocka, dm-devel,
	linux-kernel, Alasdair G. Kergon


I don't see a problem with this.  Jens, I'm not sure what you were
getting at about the using the existing plugging infrastructure.  I
couldn't think of a clean way to integrate this code with the plugging.
They really do serve two separate purposes, and I don't think growing a
conditional in the scheduler hook is all that onerous.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

Mike Snitzer <snitzer@redhat.com> writes:

> From: Mikulas Patocka <mpatocka@redhat.com>
>
> The block layer uses per-process bio list to avoid recursion in
> generic_make_request.  When generic_make_request is called recursively,
> the bio is added to current->bio_list and generic_make_request returns
> immediately.  The top-level instance of generic_make_request takes bios
> from current->bio_list and processes them.
>
> Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
> stacking drivers") created a workqueue for every bio set and code
> in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
> redirecting bios queued on current->bio_list to the workqueue if the
> system is low on memory.  However another deadlock (see below **) may
> happen, without any low memory condition, because generic_make_request
> is queuing bios to current->bio_list (rather than submitting them).
>
> Fix this deadlock by redirecting any bios on current->bio_list to the
> bio_set's rescue workqueue on every schedule call.  Consequently, when
> the process blocks on a mutex, the bios queued on current->bio_list are
> dispatched to independent workqueus and they can complete without
> waiting for the mutex to be available.
>
> Also, now we can remove punt_bios_to_rescuer() and bio_alloc_bioset()'s
> calls to it because bio_alloc_bioset() will implicitly punt all bios on
> current->bio_list if it performs a blocking allocation.
>
> ** Here is the dm-snapshot deadlock that was observed:
>
> 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> spans snapshot chunk boundary and so it is split to two bios by device
> mapper.
>
> 2) Device mapper creates the first sub-bio and sends it to the snapshot
> driver.
>
> 3) The function snapshot_map calls track_chunk (that allocates a structure
> dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
>
> 4) The remapped bio is submitted with generic_make_request, but it isn't
> issued - it is added to current->bio_list instead.
>
> 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> chunk affected be the first remapped bio, it takes down_write(&s->lock)
> and then loops in __check_for_conflicting_io, waiting for
> dm_snap_tracked_chunk created in step 3) to be released.
>
> 6) Process A continues, it creates a second sub-bio for the rest of the
> original bio.
>
> 7) snapshot_map is called for this new bio, it waits on
> down_write(&s->lock) that is held by Process B (in step 5).
>
> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
> Cc: stable@vger.kernel.org
> ---
>  block/bio.c            | 75 +++++++++++++++++++-------------------------------
>  include/linux/blkdev.h | 19 +++++++++++--
>  kernel/sched/core.c    |  7 ++---
>  3 files changed, 48 insertions(+), 53 deletions(-)
>
> v3: improved patch header, changed sched/core.c block callout to blk_flush_queued_io(),
>     io_schedule_timeout() also updated to use blk_flush_queued_io(), blk_flush_bio_list()
>     now takes a @tsk argument rather than assuming current. v3 is now being submitted with
>     more feeling now that (ab)using the onstack plugging proved problematic, please see:
>     https://www.redhat.com/archives/dm-devel/2015-October/msg00087.html
>
> diff --git a/block/bio.c b/block/bio.c
> index ad3f276..99f5a2ad 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -354,35 +354,35 @@ static void bio_alloc_rescue(struct work_struct *work)
>  	}
>  }
>  
> -static void punt_bios_to_rescuer(struct bio_set *bs)
> +/**
> + * blk_flush_bio_list
> + * @tsk: task_struct whose bio_list must be flushed
> + *
> + * Pop bios queued on @tsk->bio_list and submit each of them to
> + * their rescue workqueue.
> + *
> + * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
> + * However, stacking drivers should use bio_set, so this shouldn't be
> + * an issue.
> + */
> +void blk_flush_bio_list(struct task_struct *tsk)
>  {
> -	struct bio_list punt, nopunt;
>  	struct bio *bio;
> +	struct bio_list list = *tsk->bio_list;
> +	bio_list_init(tsk->bio_list);
>  
> -	/*
> -	 * In order to guarantee forward progress we must punt only bios that
> -	 * were allocated from this bio_set; otherwise, if there was a bio on
> -	 * there for a stacking driver higher up in the stack, processing it
> -	 * could require allocating bios from this bio_set, and doing that from
> -	 * our own rescuer would be bad.
> -	 *
> -	 * Since bio lists are singly linked, pop them all instead of trying to
> -	 * remove from the middle of the list:
> -	 */
> -
> -	bio_list_init(&punt);
> -	bio_list_init(&nopunt);
> -
> -	while ((bio = bio_list_pop(current->bio_list)))
> -		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> -
> -	*current->bio_list = nopunt;
> -
> -	spin_lock(&bs->rescue_lock);
> -	bio_list_merge(&bs->rescue_list, &punt);
> -	spin_unlock(&bs->rescue_lock);
> +	while ((bio = bio_list_pop(&list))) {
> +		struct bio_set *bs = bio->bi_pool;
> +		if (unlikely(!bs)) {
> +			bio_list_add(tsk->bio_list, bio);
> +			continue;
> +		}
>  
> -	queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +		spin_lock(&bs->rescue_lock);
> +		bio_list_add(&bs->rescue_list, bio);
> +		queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +		spin_unlock(&bs->rescue_lock);
> +	}
>  }
>  
>  /**
> @@ -422,7 +422,6 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   */
>  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  {
> -	gfp_t saved_gfp = gfp_mask;
>  	unsigned front_pad;
>  	unsigned inline_vecs;
>  	unsigned long idx = BIO_POOL_NONE;
> @@ -457,23 +456,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  		 * reserve.
>  		 *
>  		 * We solve this, and guarantee forward progress, with a rescuer
> -		 * workqueue per bio_set. If we go to allocate and there are
> -		 * bios on current->bio_list, we first try the allocation
> -		 * without __GFP_WAIT; if that fails, we punt those bios we
> -		 * would be blocking to the rescuer workqueue before we retry
> -		 * with the original gfp_flags.
> +		 * workqueue per bio_set. If an allocation would block (due to
> +		 * __GFP_WAIT) the scheduler will first punt all bios on
> +		 * current->bio_list to the rescuer workqueue.
>  		 */
> -
> -		if (current->bio_list && !bio_list_empty(current->bio_list))
> -			gfp_mask &= ~__GFP_WAIT;
> -
>  		p = mempool_alloc(bs->bio_pool, gfp_mask);
> -		if (!p && gfp_mask != saved_gfp) {
> -			punt_bios_to_rescuer(bs);
> -			gfp_mask = saved_gfp;
> -			p = mempool_alloc(bs->bio_pool, gfp_mask);
> -		}
> -
>  		front_pad = bs->front_pad;
>  		inline_vecs = BIO_INLINE_VECS;
>  	}
> @@ -486,12 +473,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  
>  	if (nr_iovecs > inline_vecs) {
>  		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> -		if (!bvl && gfp_mask != saved_gfp) {
> -			punt_bios_to_rescuer(bs);
> -			gfp_mask = saved_gfp;
> -			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> -		}
> -
>  		if (unlikely(!bvl))
>  			goto err_free;
>  
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 19c2e94..5dc7415 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1084,6 +1084,22 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
>  		 !list_empty(&plug->cb_list));
>  }
>  
> +extern void blk_flush_bio_list(struct task_struct *tsk);
> +
> +static inline void blk_flush_queued_io(struct task_struct *tsk)
> +{
> +	/*
> +	 * Flush any queued bios to corresponding rescue threads.
> +	 */
> +	if (tsk->bio_list && !bio_list_empty(tsk->bio_list))
> +		blk_flush_bio_list(tsk);
> +	/*
> +	 * Flush any plugged IO that is queued.
> +	 */
> +	if (blk_needs_flush_plug(tsk))
> +		blk_schedule_flush_plug(tsk);
> +}
> +
>  /*
>   * tag stuff
>   */
> @@ -1671,11 +1687,10 @@ static inline void blk_flush_plug(struct task_struct *task)
>  {
>  }
>  
> -static inline void blk_schedule_flush_plug(struct task_struct *task)
> +static inline void blk_flush_queued_io(struct task_struct *tsk)
>  {
>  }
>  
> -
>  static inline bool blk_needs_flush_plug(struct task_struct *tsk)
>  {
>  	return false;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 10a8faa..eaf9eb3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3127,11 +3127,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
>  	if (!tsk->state || tsk_is_pi_blocked(tsk))
>  		return;
>  	/*
> -	 * If we are going to sleep and we have plugged IO queued,
> +	 * If we are going to sleep and we have queued IO,
>  	 * make sure to submit it to avoid deadlocks.
>  	 */
> -	if (blk_needs_flush_plug(tsk))
> -		blk_schedule_flush_plug(tsk);
> +	blk_flush_queued_io(tsk);
>  }
>  
>  asmlinkage __visible void __sched schedule(void)
> @@ -4718,7 +4717,7 @@ long __sched io_schedule_timeout(long timeout)
>  	long ret;
>  
>  	current->in_iowait = 1;
> -	blk_schedule_flush_plug(current);
> +	blk_flush_queued_io(current);
>  
>  	delayacct_blkio_start();
>  	rq = raw_rq();

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] block: flush queued bios when the process blocks
  2015-10-09 19:52                         ` Mike Snitzer
  2015-10-09 19:59                           ` Mike Snitzer
@ 2015-10-15  3:27                           ` Ming Lei
  2015-10-15  8:06                             ` Mike Snitzer
  1 sibling, 1 reply; 43+ messages in thread
From: Ming Lei @ 2015-10-15  3:27 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Mikulas Patocka, Jens Axboe, Kent Overstreet, dm-devel,
	Linux Kernel Mailing List, Alasdair G. Kergon

On Sat, Oct 10, 2015 at 3:52 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Thu, Oct 08 2015 at 11:08am -0400,
> Mike Snitzer <snitzer@redhat.com> wrote:
>
>> On Thu, Oct 08 2015 at 11:04am -0400,
>> Mikulas Patocka <mpatocka@redhat.com> wrote:
>>
>> >
>> >
>> > On Tue, 6 Oct 2015, Mike Snitzer wrote:
>> >
>> > > To give others context for why I'm caring about this issue again, this
>> > > recent BZ against 4.3-rc served as a reminder that we _need_ a fix:
>> > > https://bugzilla.redhat.com/show_bug.cgi?id=1267650
>> > >
>> > > FYI, I cleaned up the plug-based approach a bit further, here is the
>> > > incremental patch:
>> > > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=f73d001ec692125308accbb5ca26f892f949c1b6
>> > >
>> > > And here is a new version of the overall combined patch (sharing now
>> > > before I transition to looking at alternatives, though my gut is the use
>> > > of a plug in generic_make_request really wouldn't hurt us.. famous last
>> > > words):
>> > >
>> > >  block/bio.c            | 82 +++++++++++++-------------------------------------
>> > >  block/blk-core.c       | 21 ++++++++-----
>> > >  drivers/md/dm-bufio.c  |  2 +-
>> > >  drivers/md/raid1.c     |  6 ++--
>> > >  drivers/md/raid10.c    |  6 ++--
>> > >  include/linux/blkdev.h | 11 +++++--
>> > >  include/linux/sched.h  |  4 ---
>> > >  7 files changed, 51 insertions(+), 81 deletions(-)
>> > >
>> ...
>> > > diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
>> > > index 2dd3308..c2bff16 100644
>> > > --- a/drivers/md/dm-bufio.c
>> > > +++ b/drivers/md/dm-bufio.c
>> > > @@ -168,7 +168,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
>> > >  #define DM_BUFIO_CACHE(c)        (dm_bufio_caches[dm_bufio_cache_index(c)])
>> > >  #define DM_BUFIO_CACHE_NAME(c)   (dm_bufio_cache_names[dm_bufio_cache_index(c)])
>> > >
>> > > -#define dm_bufio_in_request()    (!!current->bio_list)
>> > > +#define dm_bufio_in_request()    (current->plug && !!current->plug->bio_list)
>> >
>> > This condition is repeated several times throughout the whole patch - so
>> > maybe you should make it a function in block device header file.
>>
>> Yeah, I thought of that too but forgot to come back to it.  Will do,
>> thanks.
>>
>> FYI, I found another bug in my last patch and fixed it up.  I'll get
>> some refactoring done (including your suggestion), actually _test_ the
>> code (e.g. verify all of lvm testsuite passes) and then send out v3.
>
> Turns out that this change:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=2639638c77768a86216be456c2764e32a2bcd841
>
> needed to be reverted with:
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=ad3ccd760da7c05b90775372f9b39dc2964086fe
>
> Because nested plugs caused generic_make_request()'s onstack bio_list to
> go out of scope (blk_finish_plug() wouldn't actually flush the list
> within generic_make_request because XFS already added an outermost
> plug).

Looks you should have defined bio_list in plug as

               'struct bio_list bio_list'

instead of one pointer.

>
> But even after fixing that I then hit issues with these changes now
> resulting in imperfect 'in_generic_make_request' accounting that happens
> lazily once the outermost plug completes blk_finish_plug.  manifested as
> dm-bufio.c:dm_bufio_prefetch's BUG_ON(dm_bufio_in_request()); hitting.

Looks this problem should be related with above 'bio_list' definition too.

>
> Basically using the blk-core's onstack plugging isn't workable for
> fixing this deadlock and we're back to having to seriously consider
> this (with its additional hook in the scheduler):
>
> Mike
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] block: flush queued bios when the process blocks
  2015-10-15  3:27                           ` [PATCH v2] block: flush queued bios when the process blocks Ming Lei
@ 2015-10-15  8:06                             ` Mike Snitzer
  2015-10-16  3:08                               ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Mike Snitzer @ 2015-10-15  8:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Mikulas Patocka, Jens Axboe, Kent Overstreet, dm-devel,
	Linux Kernel Mailing List, Alasdair G. Kergon

On Wed, Oct 14 2015 at 11:27pm -0400,
Ming Lei <tom.leiming@gmail.com> wrote:

> On Sat, Oct 10, 2015 at 3:52 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> >
> > Turns out that this change:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=2639638c77768a86216be456c2764e32a2bcd841
> >
> > needed to be reverted with:
> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=ad3ccd760da7c05b90775372f9b39dc2964086fe
> >
> > Because nested plugs caused generic_make_request()'s onstack bio_list to
> > go out of scope (blk_finish_plug() wouldn't actually flush the list
> > within generic_make_request because XFS already added an outermost
> > plug).
> 
> Looks you should have defined bio_list in plug as
> 
>                'struct bio_list bio_list'
> 
> instead of one pointer.

I realized that and fixed it (see commit ad3ccd760da7c05b90 referenced
above that does exactly that).  That wasn't the problem.

> >
> > But even after fixing that I then hit issues with these changes now
> > resulting in imperfect 'in_generic_make_request' accounting that happens
> > lazily once the outermost plug completes blk_finish_plug.  manifested as
> > dm-bufio.c:dm_bufio_prefetch's BUG_ON(dm_bufio_in_request()); hitting.
> 
> Looks this problem should be related with above 'bio_list' definition too.

No, as I explained it was due to the nested plug:

> >
> > Basically using the blk-core's onstack plugging isn't workable for
> > fixing this deadlock and we're back to having to seriously consider
> > this (with its additional hook in the scheduler)

To elaborate, for the code in DM (and other subsystems like bcache) that
rely on accurate accounting of whether we're actively _in_
generic_make_request: using plug to store/manage the bio_list isn't
workable because nested plugs change the lifetime of when the bio_list
is processed (as I implemented it -- which was to respect nested plugs).
I could've forced the issue by making the bio_list get processed
regardless of nesting but that would've made the onstack plugging much
more convoluted (duality between nested vs not just for bio_list's
benefit and for what gain?  Simply to avoid an extra conditional
immediately in the scheduler?  That conditional was still added anyway
but just as part of blk_needs_flush_plug so in the end there wasn't any
benefit!).

Hopefully my middle-of-the-night reply is coherent and helped to clarify
my position that (ab)using blk_plug for the bio_list management is
_really_ awkward. ;)

Thanks,
Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] block: flush queued bios when the process blocks
  2015-10-15  8:06                             ` Mike Snitzer
@ 2015-10-16  3:08                               ` Ming Lei
  2015-10-16 15:29                                 ` Mike Snitzer
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2015-10-16  3:08 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Mikulas Patocka, Jens Axboe, Kent Overstreet, dm-devel,
	Linux Kernel Mailing List, Alasdair G. Kergon

On Thu, Oct 15, 2015 at 4:06 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Wed, Oct 14 2015 at 11:27pm -0400,
> Ming Lei <tom.leiming@gmail.com> wrote:
>
>> On Sat, Oct 10, 2015 at 3:52 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> >
>> > Turns out that this change:
>> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=2639638c77768a86216be456c2764e32a2bcd841
>> >
>> > needed to be reverted with:
>> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=ad3ccd760da7c05b90775372f9b39dc2964086fe
>> >
>> > Because nested plugs caused generic_make_request()'s onstack bio_list to
>> > go out of scope (blk_finish_plug() wouldn't actually flush the list
>> > within generic_make_request because XFS already added an outermost
>> > plug).
>>
>> Looks you should have defined bio_list in plug as
>>
>>                'struct bio_list bio_list'
>>
>> instead of one pointer.
>
> I realized that and fixed it (see commit ad3ccd760da7c05b90 referenced
> above that does exactly that).  That wasn't the problem.

OK.

>
>> >
>> > But even after fixing that I then hit issues with these changes now
>> > resulting in imperfect 'in_generic_make_request' accounting that happens
>> > lazily once the outermost plug completes blk_finish_plug.  manifested as
>> > dm-bufio.c:dm_bufio_prefetch's BUG_ON(dm_bufio_in_request()); hitting.
>>
>> Looks this problem should be related with above 'bio_list' definition too.
>
> No, as I explained it was due to the nested plug:
>
>> >
>> > Basically using the blk-core's onstack plugging isn't workable for
>> > fixing this deadlock and we're back to having to seriously consider
>> > this (with its additional hook in the scheduler)
>
> To elaborate, for the code in DM (and other subsystems like bcache) that
> rely on accurate accounting of whether we're actively _in_
> generic_make_request: using plug to store/manage the bio_list isn't

That looks an interesting requirement, which means DM just need to know
if the current callsite is from generic_make_request(), so what you need
is just one per-task variable.

With the stack variable of 'plug', it should be easier to do that for DM, for
example, you can introduce one flag in 'struct blk_plug', then set it in
the entry of generic_make_request(), and clear it in the exit of the
function.

> workable because nested plugs change the lifetime of when the bio_list
> is processed (as I implemented it -- which was to respect nested plugs).
> I could've forced the issue by making the bio_list get processed
> regardless of nesting but that would've made the onstack plugging much
> more convoluted (duality between nested vs not just for bio_list's
> benefit and for what gain?  Simply to avoid an extra conditional
> immediately in the scheduler?  That conditional was still added anyway
> but just as part of blk_needs_flush_plug so in the end there wasn't any
> benefit!).
>
> Hopefully my middle-of-the-night reply is coherent and helped to clarify
> my position that (ab)using blk_plug for the bio_list management is
> _really_ awkward. ;)

Hope it wan't my reply to cause the break of your sleep, :-)

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] block: flush queued bios when the process blocks
  2015-10-16  3:08                               ` Ming Lei
@ 2015-10-16 15:29                                 ` Mike Snitzer
  2015-10-17 15:54                                   ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Mike Snitzer @ 2015-10-16 15:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Mikulas Patocka, Jens Axboe, Kent Overstreet, dm-devel,
	Linux Kernel Mailing List, Alasdair G. Kergon

On Thu, Oct 15 2015 at 11:08pm -0400,
Ming Lei <tom.leiming@gmail.com> wrote:

> On Thu, Oct 15, 2015 at 4:06 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Wed, Oct 14 2015 at 11:27pm -0400,
> > Ming Lei <tom.leiming@gmail.com> wrote:
> >
> >> On Sat, Oct 10, 2015 at 3:52 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> >> >
> >> > Turns out that this change:
> >> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=2639638c77768a86216be456c2764e32a2bcd841
> >> >
> >> > needed to be reverted with:
> >> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=ad3ccd760da7c05b90775372f9b39dc2964086fe
> >> >
> >> > Because nested plugs caused generic_make_request()'s onstack bio_list to
> >> > go out of scope (blk_finish_plug() wouldn't actually flush the list
> >> > within generic_make_request because XFS already added an outermost
> >> > plug).
> >>
> >> Looks you should have defined bio_list in plug as
> >>
> >>                'struct bio_list bio_list'
> >>
> >> instead of one pointer.
> >
> > I realized that and fixed it (see commit ad3ccd760da7c05b90 referenced
> > above that does exactly that).  That wasn't the problem.
> 
> OK.
> 
> >
> >> >
> >> > But even after fixing that I then hit issues with these changes now
> >> > resulting in imperfect 'in_generic_make_request' accounting that happens
> >> > lazily once the outermost plug completes blk_finish_plug.  manifested as
> >> > dm-bufio.c:dm_bufio_prefetch's BUG_ON(dm_bufio_in_request()); hitting.
> >>
> >> Looks this problem should be related with above 'bio_list' definition too.
> >
> > No, as I explained it was due to the nested plug:
> >
> >> >
> >> > Basically using the blk-core's onstack plugging isn't workable for
> >> > fixing this deadlock and we're back to having to seriously consider
> >> > this (with its additional hook in the scheduler)
> >
> > To elaborate, for the code in DM (and other subsystems like bcache) that
> > rely on accurate accounting of whether we're actively _in_
> > generic_make_request: using plug to store/manage the bio_list isn't
> 
> That looks an interesting requirement, which means DM just need to know
> if the current callsite is from generic_make_request(), so what you need
> is just one per-task variable.
> 
> With the stack variable of 'plug', it should be easier to do that for DM, for
> example, you can introduce one flag in 'struct blk_plug', then set it in
> the entry of generic_make_request(), and clear it in the exit of the
> function.

Yes, I mean we _could_ set/clear the 'in_generic_make_request' flag _in_
generic_make_request() but then it just calls into question why the heck
we're using the plug to begin with? (especially given plugging is for
request-based devices at this point!).

It really doesn't make _any_ sense to overload blk_plug by moving the
bio_list into there and adding a 'in_generic_make_request'... when you
consider the _only_ reason this was suggested is to (ab)use the existing
hook in scheduler/core.c.

So I stand by my position that there is really no point in the exercise
and that it actually hurts the code to try to make this a blk_plug
"feature".

We already have well established current->bio_list semantics that can be
reused as a flag given it is a pointer.  The block callout in the
scheduler is going to grow a conditional either way.  What I've proposed
_seems_ the cleanest to me and others.  Hopefully you can see that
aspect of things.

So if you could review the v3 patch with a critical eye that'd be very
much appreciated.

But I do look forward to Jens also having a look at this and providing
his review feedback.

> > workable because nested plugs change the lifetime of when the bio_list
> > is processed (as I implemented it -- which was to respect nested plugs).
> > I could've forced the issue by making the bio_list get processed
> > regardless of nesting but that would've made the onstack plugging much
> > more convoluted (duality between nested vs not just for bio_list's
> > benefit and for what gain?  Simply to avoid an extra conditional
> > immediately in the scheduler?  That conditional was still added anyway
> > but just as part of blk_needs_flush_plug so in the end there wasn't any
> > benefit!).
> >
> > Hopefully my middle-of-the-night reply is coherent and helped to clarify
> > my position that (ab)using blk_plug for the bio_list management is
> > _really_ awkward. ;)
> 
> Hope it wan't my reply to cause the break of your sleep, :-)

No, my dog woke me up to go outside at 4am.. I was up and couldn't
resist looking at my phone.. the rest is history ;)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2] block: flush queued bios when the process blocks
  2015-10-16 15:29                                 ` Mike Snitzer
@ 2015-10-17 15:54                                   ` Ming Lei
  0 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2015-10-17 15:54 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Mikulas Patocka, Jens Axboe, Kent Overstreet, dm-devel,
	Linux Kernel Mailing List, Alasdair G. Kergon

On Fri, Oct 16, 2015 at 11:29 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Thu, Oct 15 2015 at 11:08pm -0400,
> Ming Lei <tom.leiming@gmail.com> wrote:
>
>> On Thu, Oct 15, 2015 at 4:06 PM, Mike Snitzer <snitzer@redhat.com> wrote:
>> > On Wed, Oct 14 2015 at 11:27pm -0400,
>> > Ming Lei <tom.leiming@gmail.com> wrote:
>> >
>> >> On Sat, Oct 10, 2015 at 3:52 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> >> >
>> >> > Turns out that this change:
>> >> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=2639638c77768a86216be456c2764e32a2bcd841
>> >> >
>> >> > needed to be reverted with:
>> >> > http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=ad3ccd760da7c05b90775372f9b39dc2964086fe
>> >> >
>> >> > Because nested plugs caused generic_make_request()'s onstack bio_list to
>> >> > go out of scope (blk_finish_plug() wouldn't actually flush the list
>> >> > within generic_make_request because XFS already added an outermost
>> >> > plug).
>> >>
>> >> Looks you should have defined bio_list in plug as
>> >>
>> >>                'struct bio_list bio_list'
>> >>
>> >> instead of one pointer.
>> >
>> > I realized that and fixed it (see commit ad3ccd760da7c05b90 referenced
>> > above that does exactly that).  That wasn't the problem.
>>
>> OK.
>>
>> >
>> >> >
>> >> > But even after fixing that I then hit issues with these changes now
>> >> > resulting in imperfect 'in_generic_make_request' accounting that happens
>> >> > lazily once the outermost plug completes blk_finish_plug.  manifested as
>> >> > dm-bufio.c:dm_bufio_prefetch's BUG_ON(dm_bufio_in_request()); hitting.
>> >>
>> >> Looks this problem should be related with above 'bio_list' definition too.
>> >
>> > No, as I explained it was due to the nested plug:
>> >
>> >> >
>> >> > Basically using the blk-core's onstack plugging isn't workable for
>> >> > fixing this deadlock and we're back to having to seriously consider
>> >> > this (with its additional hook in the scheduler)
>> >
>> > To elaborate, for the code in DM (and other subsystems like bcache) that
>> > rely on accurate accounting of whether we're actively _in_
>> > generic_make_request: using plug to store/manage the bio_list isn't
>>
>> That looks an interesting requirement, which means DM just need to know
>> if the current callsite is from generic_make_request(), so what you need
>> is just one per-task variable.
>>
>> With the stack variable of 'plug', it should be easier to do that for DM, for
>> example, you can introduce one flag in 'struct blk_plug', then set it in
>> the entry of generic_make_request(), and clear it in the exit of the
>> function.
>
> Yes, I mean we _could_ set/clear the 'in_generic_make_request' flag _in_
> generic_make_request() but then it just calls into question why the heck
> we're using the plug to begin with? (especially given plugging is for
> request-based devices at this point!).
>
> It really doesn't make _any_ sense to overload blk_plug by moving the
> bio_list into there and adding a 'in_generic_make_request'... when you
> consider the _only_ reason this was suggested is to (ab)use the existing
> hook in scheduler/core.c.

At the first glance, I mean it is doable to use blk_plug for the issue.

>From last year's discussion, looks Jens thought we have plug already which
should have covered this case, also Kent wanted to implement
plug for bio too.

>
> So I stand by my position that there is really no point in the exercise
> and that it actually hurts the code to try to make this a blk_plug
> "feature".
>
> We already have well established current->bio_list semantics that can be
> reused as a flag given it is a pointer.  The block callout in the
> scheduler is going to grow a conditional either way.  What I've proposed
> _seems_ the cleanest to me and others.  Hopefully you can see that
> aspect of things.
>
> So if you could review the v3 patch with a critical eye that'd be very
> much appreciated.

Will do.

>
> But I do look forward to Jens also having a look at this and providing
> his review feedback.
>
>> > workable because nested plugs change the lifetime of when the bio_list
>> > is processed (as I implemented it -- which was to respect nested plugs).
>> > I could've forced the issue by making the bio_list get processed
>> > regardless of nesting but that would've made the onstack plugging much
>> > more convoluted (duality between nested vs not just for bio_list's
>> > benefit and for what gain?  Simply to avoid an extra conditional
>> > immediately in the scheduler?  That conditional was still added anyway
>> > but just as part of blk_needs_flush_plug so in the end there wasn't any
>> > benefit!).
>> >
>> > Hopefully my middle-of-the-night reply is coherent and helped to clarify
>> > my position that (ab)using blk_plug for the bio_list management is
>> > _really_ awkward. ;)
>>
>> Hope it wan't my reply to cause the break of your sleep, :-)
>
> No, my dog woke me up to go outside at 4am.. I was up and couldn't
> resist looking at my phone.. the rest is history ;)



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock
  2015-10-14 20:47                             ` [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock Mike Snitzer
  2015-10-14 21:44                               ` Jeff Moyer
@ 2015-10-17 16:04                               ` Ming Lei
  2015-10-20 19:57                                 ` Mike Snitzer
  2015-10-20 20:03                                 ` Mikulas Patocka
  1 sibling, 2 replies; 43+ messages in thread
From: Ming Lei @ 2015-10-17 16:04 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Jens Axboe, Kent Overstreet, Mikulas Patocka, dm-devel,
	Linux Kernel Mailing List, Alasdair G. Kergon, Jeff Moyer

On Thu, Oct 15, 2015 at 4:47 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> From: Mikulas Patocka <mpatocka@redhat.com>
>
> The block layer uses per-process bio list to avoid recursion in
> generic_make_request.  When generic_make_request is called recursively,
> the bio is added to current->bio_list and generic_make_request returns
> immediately.  The top-level instance of generic_make_request takes bios
> from current->bio_list and processes them.
>
> Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
> stacking drivers") created a workqueue for every bio set and code
> in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
> redirecting bios queued on current->bio_list to the workqueue if the
> system is low on memory.  However another deadlock (see below **) may
> happen, without any low memory condition, because generic_make_request
> is queuing bios to current->bio_list (rather than submitting them).
>
> Fix this deadlock by redirecting any bios on current->bio_list to the
> bio_set's rescue workqueue on every schedule call.  Consequently, when
> the process blocks on a mutex, the bios queued on current->bio_list are
> dispatched to independent workqueus and they can complete without
> waiting for the mutex to be available.

It isn't common to acquire mutex/semaphone inside .make_request()
or .request_fn(), so I am wondering it is good to reuse the rescuing
workqueue for this unusual case.

Also sometimes it can hurt performance by converting I/O submission
from one context into concurrent contexts of workqueue, especially
in case of sequential I/O, since plug & plug merge can't be used any
more.

>
> Also, now we can remove punt_bios_to_rescuer() and bio_alloc_bioset()'s
> calls to it because bio_alloc_bioset() will implicitly punt all bios on
> current->bio_list if it performs a blocking allocation.
>
> ** Here is the dm-snapshot deadlock that was observed:
>
> 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> spans snapshot chunk boundary and so it is split to two bios by device
> mapper.
>
> 2) Device mapper creates the first sub-bio and sends it to the snapshot
> driver.
>
> 3) The function snapshot_map calls track_chunk (that allocates a structure
> dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
>
> 4) The remapped bio is submitted with generic_make_request, but it isn't
> issued - it is added to current->bio_list instead.
>
> 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> chunk affected be the first remapped bio, it takes down_write(&s->lock)
> and then loops in __check_for_conflicting_io, waiting for
> dm_snap_tracked_chunk created in step 3) to be released.
>
> 6) Process A continues, it creates a second sub-bio for the rest of the
> original bio.
>
> 7) snapshot_map is called for this new bio, it waits on
> down_write(&s->lock) that is held by Process B (in step 5).
>
> Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
> Cc: stable@vger.kernel.org
> ---
>  block/bio.c            | 75 +++++++++++++++++++-------------------------------
>  include/linux/blkdev.h | 19 +++++++++++--
>  kernel/sched/core.c    |  7 ++---
>  3 files changed, 48 insertions(+), 53 deletions(-)
>
> v3: improved patch header, changed sched/core.c block callout to blk_flush_queued_io(),
>     io_schedule_timeout() also updated to use blk_flush_queued_io(), blk_flush_bio_list()
>     now takes a @tsk argument rather than assuming current. v3 is now being submitted with
>     more feeling now that (ab)using the onstack plugging proved problematic, please see:
>     https://www.redhat.com/archives/dm-devel/2015-October/msg00087.html
>
> diff --git a/block/bio.c b/block/bio.c
> index ad3f276..99f5a2ad 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -354,35 +354,35 @@ static void bio_alloc_rescue(struct work_struct *work)
>         }
>  }
>
> -static void punt_bios_to_rescuer(struct bio_set *bs)
> +/**
> + * blk_flush_bio_list
> + * @tsk: task_struct whose bio_list must be flushed
> + *
> + * Pop bios queued on @tsk->bio_list and submit each of them to
> + * their rescue workqueue.
> + *
> + * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
> + * However, stacking drivers should use bio_set, so this shouldn't be
> + * an issue.
> + */
> +void blk_flush_bio_list(struct task_struct *tsk)
>  {
> -       struct bio_list punt, nopunt;
>         struct bio *bio;
> +       struct bio_list list = *tsk->bio_list;
> +       bio_list_init(tsk->bio_list);
>
> -       /*
> -        * In order to guarantee forward progress we must punt only bios that
> -        * were allocated from this bio_set; otherwise, if there was a bio on
> -        * there for a stacking driver higher up in the stack, processing it
> -        * could require allocating bios from this bio_set, and doing that from
> -        * our own rescuer would be bad.
> -        *
> -        * Since bio lists are singly linked, pop them all instead of trying to
> -        * remove from the middle of the list:
> -        */
> -
> -       bio_list_init(&punt);
> -       bio_list_init(&nopunt);
> -
> -       while ((bio = bio_list_pop(current->bio_list)))
> -               bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> -
> -       *current->bio_list = nopunt;
> -
> -       spin_lock(&bs->rescue_lock);
> -       bio_list_merge(&bs->rescue_list, &punt);
> -       spin_unlock(&bs->rescue_lock);
> +       while ((bio = bio_list_pop(&list))) {
> +               struct bio_set *bs = bio->bi_pool;
> +               if (unlikely(!bs)) {
> +                       bio_list_add(tsk->bio_list, bio);
> +                       continue;
> +               }
>
> -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +               spin_lock(&bs->rescue_lock);
> +               bio_list_add(&bs->rescue_list, bio);
> +               queue_work(bs->rescue_workqueue, &bs->rescue_work);
> +               spin_unlock(&bs->rescue_lock);
> +       }

Not like rescuring path, schedule out can be quite frequent, and the
above change will switch to submit these I/Os from wq concurrently,
which might hurt performance for sequential I/O.

Also I am wondering why not submit these I/Os in 'current' context
just like what flush plug does?

>  }
>
>  /**
> @@ -422,7 +422,6 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   */
>  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  {
> -       gfp_t saved_gfp = gfp_mask;
>         unsigned front_pad;
>         unsigned inline_vecs;
>         unsigned long idx = BIO_POOL_NONE;
> @@ -457,23 +456,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>                  * reserve.
>                  *
>                  * We solve this, and guarantee forward progress, with a rescuer
> -                * workqueue per bio_set. If we go to allocate and there are
> -                * bios on current->bio_list, we first try the allocation
> -                * without __GFP_WAIT; if that fails, we punt those bios we
> -                * would be blocking to the rescuer workqueue before we retry
> -                * with the original gfp_flags.
> +                * workqueue per bio_set. If an allocation would block (due to
> +                * __GFP_WAIT) the scheduler will first punt all bios on
> +                * current->bio_list to the rescuer workqueue.
>                  */
> -
> -               if (current->bio_list && !bio_list_empty(current->bio_list))
> -                       gfp_mask &= ~__GFP_WAIT;
> -
>                 p = mempool_alloc(bs->bio_pool, gfp_mask);
> -               if (!p && gfp_mask != saved_gfp) {
> -                       punt_bios_to_rescuer(bs);
> -                       gfp_mask = saved_gfp;
> -                       p = mempool_alloc(bs->bio_pool, gfp_mask);
> -               }
> -
>                 front_pad = bs->front_pad;
>                 inline_vecs = BIO_INLINE_VECS;
>         }
> @@ -486,12 +473,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>
>         if (nr_iovecs > inline_vecs) {
>                 bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> -               if (!bvl && gfp_mask != saved_gfp) {
> -                       punt_bios_to_rescuer(bs);
> -                       gfp_mask = saved_gfp;
> -                       bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> -               }
> -

Looks you touched rescuing path for bio allocation, and better to just
do one thing in one patch.

>                 if (unlikely(!bvl))
>                         goto err_free;
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 19c2e94..5dc7415 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1084,6 +1084,22 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
>                  !list_empty(&plug->cb_list));
>  }
>
> +extern void blk_flush_bio_list(struct task_struct *tsk);
> +
> +static inline void blk_flush_queued_io(struct task_struct *tsk)
> +{
> +       /*
> +        * Flush any queued bios to corresponding rescue threads.
> +        */
> +       if (tsk->bio_list && !bio_list_empty(tsk->bio_list))
> +               blk_flush_bio_list(tsk);
> +       /*
> +        * Flush any plugged IO that is queued.
> +        */
> +       if (blk_needs_flush_plug(tsk))
> +               blk_schedule_flush_plug(tsk);
> +}
> +
>  /*
>   * tag stuff
>   */
> @@ -1671,11 +1687,10 @@ static inline void blk_flush_plug(struct task_struct *task)
>  {
>  }
>
> -static inline void blk_schedule_flush_plug(struct task_struct *task)
> +static inline void blk_flush_queued_io(struct task_struct *tsk)
>  {
>  }
>
> -
>  static inline bool blk_needs_flush_plug(struct task_struct *tsk)
>  {
>         return false;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 10a8faa..eaf9eb3 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3127,11 +3127,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
>         if (!tsk->state || tsk_is_pi_blocked(tsk))
>                 return;
>         /*
> -        * If we are going to sleep and we have plugged IO queued,
> +        * If we are going to sleep and we have queued IO,
>          * make sure to submit it to avoid deadlocks.
>          */
> -       if (blk_needs_flush_plug(tsk))
> -               blk_schedule_flush_plug(tsk);
> +       blk_flush_queued_io(tsk);
>  }
>
>  asmlinkage __visible void __sched schedule(void)
> @@ -4718,7 +4717,7 @@ long __sched io_schedule_timeout(long timeout)
>         long ret;
>
>         current->in_iowait = 1;
> -       blk_schedule_flush_plug(current);
> +       blk_flush_queued_io(current);
>
>         delayacct_blkio_start();
>         rq = raw_rq();
> --
> 2.3.8 (Apple Git-58)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock
  2015-10-17 16:04                               ` Ming Lei
@ 2015-10-20 19:57                                 ` Mike Snitzer
  2015-10-20 20:03                                 ` Mikulas Patocka
  1 sibling, 0 replies; 43+ messages in thread
From: Mike Snitzer @ 2015-10-20 19:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Linux Kernel Mailing List, Jeff Moyer, dm-devel,
	Mikulas Patocka, Kent Overstreet, Alasdair G. Kergon

On Sat, Oct 17 2015 at 12:04pm -0400,
Ming Lei <tom.leiming@gmail.com> wrote:

> On Thu, Oct 15, 2015 at 4:47 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> > From: Mikulas Patocka <mpatocka@redhat.com>
> >
> > The block layer uses per-process bio list to avoid recursion in
> > generic_make_request.  When generic_make_request is called recursively,
> > the bio is added to current->bio_list and generic_make_request returns
> > immediately.  The top-level instance of generic_make_request takes bios
> > from current->bio_list and processes them.
> >
> > Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
> > stacking drivers") created a workqueue for every bio set and code
> > in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
> > redirecting bios queued on current->bio_list to the workqueue if the
> > system is low on memory.  However another deadlock (see below **) may
> > happen, without any low memory condition, because generic_make_request
> > is queuing bios to current->bio_list (rather than submitting them).
> >
> > Fix this deadlock by redirecting any bios on current->bio_list to the
> > bio_set's rescue workqueue on every schedule call.  Consequently, when
> > the process blocks on a mutex, the bios queued on current->bio_list are
> > dispatched to independent workqueus and they can complete without
> > waiting for the mutex to be available.
> 
> It isn't common to acquire mutex/semaphone inside .make_request()
> or .request_fn(), so I am wondering it is good to reuse the rescuing
> workqueue for this unusual case.

Which specific locking are you concerned about?

> Also sometimes it can hurt performance by converting I/O submission
> from one context into concurrent contexts of workqueue, especially
> in case of sequential I/O, since plug & plug merge can't be used any
> more.

True, plug and plug merge wouldn't be usable but this recursive call to
generic_make_request isn't expected to be common.

This patch was to fix a relatively obscure bio spliting scenario that
fell out of the complexity of dm-snapshot.

> > diff --git a/block/bio.c b/block/bio.c
> > index ad3f276..99f5a2ad 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -354,35 +354,35 @@ static void bio_alloc_rescue(struct work_struct *work)
> >         }
> >  }
> >
> > -static void punt_bios_to_rescuer(struct bio_set *bs)
> > +/**
> > + * blk_flush_bio_list
> > + * @tsk: task_struct whose bio_list must be flushed
> > + *
> > + * Pop bios queued on @tsk->bio_list and submit each of them to
> > + * their rescue workqueue.
> > + *
> > + * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
> > + * However, stacking drivers should use bio_set, so this shouldn't be
> > + * an issue.
> > + */
> > +void blk_flush_bio_list(struct task_struct *tsk)
...
> > +       while ((bio = bio_list_pop(&list))) {
> > +               struct bio_set *bs = bio->bi_pool;
> > +               if (unlikely(!bs)) {
> > +                       bio_list_add(tsk->bio_list, bio);
> > +                       continue;
> > +               }
> >
> > -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > +               spin_lock(&bs->rescue_lock);
> > +               bio_list_add(&bs->rescue_list, bio);
> > +               queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > +               spin_unlock(&bs->rescue_lock);
> > +       }
> 
> Not like rescuring path, schedule out can be quite frequent, and the
> above change will switch to submit these I/Os from wq concurrently,
> which might hurt performance for sequential I/O.
> 
> Also I am wondering why not submit these I/Os in 'current' context
> just like what flush plug does?

Flush plug during schedule makes use of kblockd so I'm not sure what
you're referring to here.

> >  }
> >
> >  /**
> > @@ -422,7 +422,6 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
> >   */
> >  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> >  {
> > -       gfp_t saved_gfp = gfp_mask;
> >         unsigned front_pad;
> >         unsigned inline_vecs;
> >         unsigned long idx = BIO_POOL_NONE;
> > @@ -457,23 +456,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> >                  * reserve.
> >                  *
> >                  * We solve this, and guarantee forward progress, with a rescuer
> > -                * workqueue per bio_set. If we go to allocate and there are
> > -                * bios on current->bio_list, we first try the allocation
> > -                * without __GFP_WAIT; if that fails, we punt those bios we
> > -                * would be blocking to the rescuer workqueue before we retry
> > -                * with the original gfp_flags.
> > +                * workqueue per bio_set. If an allocation would block (due to
> > +                * __GFP_WAIT) the scheduler will first punt all bios on
> > +                * current->bio_list to the rescuer workqueue.
> >                  */
> > -
> > -               if (current->bio_list && !bio_list_empty(current->bio_list))
> > -                       gfp_mask &= ~__GFP_WAIT;
> > -
> >                 p = mempool_alloc(bs->bio_pool, gfp_mask);
> > -               if (!p && gfp_mask != saved_gfp) {
> > -                       punt_bios_to_rescuer(bs);
> > -                       gfp_mask = saved_gfp;
> > -                       p = mempool_alloc(bs->bio_pool, gfp_mask);
> > -               }
> > -
> >                 front_pad = bs->front_pad;
> >                 inline_vecs = BIO_INLINE_VECS;
> >         }
> > @@ -486,12 +473,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> >
> >         if (nr_iovecs > inline_vecs) {
> >                 bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> > -               if (!bvl && gfp_mask != saved_gfp) {
> > -                       punt_bios_to_rescuer(bs);
> > -                       gfp_mask = saved_gfp;
> > -                       bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
> > -               }
> > -
> 
> Looks you touched rescuing path for bio allocation, and better to just
> do one thing in one patch.

Yes, good point, I've split the patches, you can see the result in the 2
topmost commits in this branch:

http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

Definitely helped clean up the patches and make them more
approachable/reviewable.  Thanks for the suggestion.

I'll hold off on sending out v4 of these patches until I can better
undersatnd your concerns about using the rescue workqueue during
schedule.

Mike

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock
  2015-10-17 16:04                               ` Ming Lei
  2015-10-20 19:57                                 ` Mike Snitzer
@ 2015-10-20 20:03                                 ` Mikulas Patocka
  2015-10-21 16:38                                   ` Ming Lei
  1 sibling, 1 reply; 43+ messages in thread
From: Mikulas Patocka @ 2015-10-20 20:03 UTC (permalink / raw)
  To: Ming Lei
  Cc: Mike Snitzer, Jens Axboe, Kent Overstreet, dm-devel,
	Linux Kernel Mailing List, Alasdair G. Kergon, Jeff Moyer



On Sun, 18 Oct 2015, Ming Lei wrote:

> On Thu, Oct 15, 2015 at 4:47 AM, Mike Snitzer <snitzer@redhat.com> wrote:
> > From: Mikulas Patocka <mpatocka@redhat.com>
> >
> > The block layer uses per-process bio list to avoid recursion in
> > generic_make_request.  When generic_make_request is called recursively,
> > the bio is added to current->bio_list and generic_make_request returns
> > immediately.  The top-level instance of generic_make_request takes bios
> > from current->bio_list and processes them.
> >
> > Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
> > stacking drivers") created a workqueue for every bio set and code
> > in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
> > redirecting bios queued on current->bio_list to the workqueue if the
> > system is low on memory.  However another deadlock (see below **) may
> > happen, without any low memory condition, because generic_make_request
> > is queuing bios to current->bio_list (rather than submitting them).
> >
> > Fix this deadlock by redirecting any bios on current->bio_list to the
> > bio_set's rescue workqueue on every schedule call.  Consequently, when
> > the process blocks on a mutex, the bios queued on current->bio_list are
> > dispatched to independent workqueus and they can complete without
> > waiting for the mutex to be available.
> 
> It isn't common to acquire mutex/semaphone inside .make_request()
> or .request_fn(), so I am wondering it is good to reuse the rescuing
> workqueue for this unusual case.

Some drivers (dm-snapshot, dm-thin) do acquire a mutex in .make_requests() 
for every bio. It wouldn't be practical to convert them to not acquire the 
mutex (and it would also degrade performance of these drivers, if they had 
to offload every bio to a worker thread that can acquire the mutex).

You can't acquire a mutex in .request_fn() because .request_fn() is called 
with queue spinlock held.

> Also sometimes it can hurt performance by converting I/O submission
> from one context into concurrent contexts of workqueue, especially
> in case of sequential I/O, since plug & plug merge can't be used any
> more.

You can add blk_start_plug/blk_finish_plug to the function 
bio_alloc_rescue. That would be reasonable to make sure that the requests 
are merged even when they are offloaded to rescue thread.

> > -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > +               spin_lock(&bs->rescue_lock);
> > +               bio_list_add(&bs->rescue_list, bio);
> > +               queue_work(bs->rescue_workqueue, &bs->rescue_work);
> > +               spin_unlock(&bs->rescue_lock);
> > +       }
> 
> Not like rescuring path, schedule out can be quite frequent, and the
> above change will switch to submit these I/Os from wq concurrently,
> which might hurt performance for sequential I/O.
> 
> Also I am wondering why not submit these I/Os in 'current' context
> just like what flush plug does?

Processing requests doesn't block (they only take the queue spinlock).

Processing bios can block (they can take other mutexes or semaphores), so 
processing them from the schedule hook is impossible - the bio's 
make_request function could attempt to take some lock that is already 
held. So - we must offload the bios to a separate workqueue.

Mikulas

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock
  2015-10-20 20:03                                 ` Mikulas Patocka
@ 2015-10-21 16:38                                   ` Ming Lei
  2015-10-21 21:49                                     ` Mikulas Patocka
  0 siblings, 1 reply; 43+ messages in thread
From: Ming Lei @ 2015-10-21 16:38 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, Jens Axboe, Kent Overstreet, dm-devel,
	Linux Kernel Mailing List, Alasdair G. Kergon, Jeff Moyer

On Wed, Oct 21, 2015 at 4:03 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Sun, 18 Oct 2015, Ming Lei wrote:
>
>> On Thu, Oct 15, 2015 at 4:47 AM, Mike Snitzer <snitzer@redhat.com> wrote:
>> > From: Mikulas Patocka <mpatocka@redhat.com>
>> >
>> > The block layer uses per-process bio list to avoid recursion in
>> > generic_make_request.  When generic_make_request is called recursively,
>> > the bio is added to current->bio_list and generic_make_request returns
>> > immediately.  The top-level instance of generic_make_request takes bios
>> > from current->bio_list and processes them.
>> >
>> > Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
>> > stacking drivers") created a workqueue for every bio set and code
>> > in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
>> > redirecting bios queued on current->bio_list to the workqueue if the
>> > system is low on memory.  However another deadlock (see below **) may
>> > happen, without any low memory condition, because generic_make_request
>> > is queuing bios to current->bio_list (rather than submitting them).
>> >
>> > Fix this deadlock by redirecting any bios on current->bio_list to the
>> > bio_set's rescue workqueue on every schedule call.  Consequently, when
>> > the process blocks on a mutex, the bios queued on current->bio_list are
>> > dispatched to independent workqueus and they can complete without
>> > waiting for the mutex to be available.
>>
>> It isn't common to acquire mutex/semaphone inside .make_request()
>> or .request_fn(), so I am wondering it is good to reuse the rescuing
>> workqueue for this unusual case.
>
> Some drivers (dm-snapshot, dm-thin) do acquire a mutex in .make_requests()
> for every bio. It wouldn't be practical to convert them to not acquire the
> mutex (and it would also degrade performance of these drivers, if they had
> to offload every bio to a worker thread that can acquire the mutex).

Lots of drivers handle I/O in that way, and this way makes AIO not possible
basically for dm-snapshot.

>
> You can't acquire a mutex in .request_fn() because .request_fn() is called
> with queue spinlock held.

Lots of drivers are dropping the lock in the request_fn(), but I admit
.request_fn shouldn't be blocked.

>
>> Also sometimes it can hurt performance by converting I/O submission
>> from one context into concurrent contexts of workqueue, especially
>> in case of sequential I/O, since plug & plug merge can't be used any
>> more.
>
> You can add blk_start_plug/blk_finish_plug to the function
> bio_alloc_rescue. That would be reasonable to make sure that the requests
> are merged even when they are offloaded to rescue thread.

The IOs submitted from each wq context becomes not contineous any
more, so plug merge isn't doable, not mention the extra context switch
cost.

This kind of cost can be introduced for all bio devices just for handling
the unusual case, fair?

>
>> > -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
>> > +               spin_lock(&bs->rescue_lock);
>> > +               bio_list_add(&bs->rescue_list, bio);
>> > +               queue_work(bs->rescue_workqueue, &bs->rescue_work);
>> > +               spin_unlock(&bs->rescue_lock);
>> > +       }
>>
>> Not like rescuring path, schedule out can be quite frequent, and the
>> above change will switch to submit these I/Os from wq concurrently,
>> which might hurt performance for sequential I/O.
>>
>> Also I am wondering why not submit these I/Os in 'current' context
>> just like what flush plug does?
>
> Processing requests doesn't block (they only take the queue spinlock).
>
> Processing bios can block (they can take other mutexes or semaphores), so
> processing them from the schedule hook is impossible - the bio's
> make_request function could attempt to take some lock that is already
> held. So - we must offload the bios to a separate workqueue.

Yes, so better to just handle dm-snapshot in this way.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock
  2015-10-21 16:38                                   ` Ming Lei
@ 2015-10-21 21:49                                     ` Mikulas Patocka
  2015-10-22  1:53                                       ` Ming Lei
  0 siblings, 1 reply; 43+ messages in thread
From: Mikulas Patocka @ 2015-10-21 21:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: Mike Snitzer, Jens Axboe, Kent Overstreet, dm-devel,
	Linux Kernel Mailing List, Alasdair G. Kergon, Jeff Moyer



On Thu, 22 Oct 2015, Ming Lei wrote:

> > Some drivers (dm-snapshot, dm-thin) do acquire a mutex in .make_requests()
> > for every bio. It wouldn't be practical to convert them to not acquire the
> > mutex (and it would also degrade performance of these drivers, if they had
> > to offload every bio to a worker thread that can acquire the mutex).
> 
> Lots of drivers handle I/O in that way, and this way makes AIO not possible
> basically for dm-snapshot.

It doesn't have to do anything with asynchronous I/O. Of course you can do 
asynchronous I/O on dm-snapshot.

> >> Also sometimes it can hurt performance by converting I/O submission
> >> from one context into concurrent contexts of workqueue, especially
> >> in case of sequential I/O, since plug & plug merge can't be used any
> >> more.
> >
> > You can add blk_start_plug/blk_finish_plug to the function
> > bio_alloc_rescue. That would be reasonable to make sure that the requests
> > are merged even when they are offloaded to rescue thread.
> 
> The IOs submitted from each wq context becomes not contineous any
> more, so plug merge isn't doable, not mention the extra context switch
> cost.

If the requests are mergeable, blk_start_plug/blk_finish_plug will merge 
them, if not, it won't.

> This kind of cost can be introduced for all bio devices just for handling
> the unusual case, fair?

Offloading bios to a worker thread when the make_request_fn function 
blocks is required to avoid a deadlock (BTW. the deadlock became more 
common in the kernel 4.3 due to unrestricted size of bios).

The bio list current->bio_list introduces a false locking dependency - 
completion of a bio depends on completion of other bios on 
current->bio_list directed for different devices, thus it could create 
circular dependency resulting in deadlock.

To avoid the circular dependency, each bio must be offloaded to a specific 
workqueue, so that completion of bio for device A no longer depends on 
completion of another bio for device B.

> >> > -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
> >> > +               spin_lock(&bs->rescue_lock);
> >> > +               bio_list_add(&bs->rescue_list, bio);
> >> > +               queue_work(bs->rescue_workqueue, &bs->rescue_work);
> >> > +               spin_unlock(&bs->rescue_lock);
> >> > +       }
> >>
> >> Not like rescuring path, schedule out can be quite frequent, and the
> >> above change will switch to submit these I/Os from wq concurrently,
> >> which might hurt performance for sequential I/O.
> >>
> >> Also I am wondering why not submit these I/Os in 'current' context
> >> just like what flush plug does?
> >
> > Processing requests doesn't block (they only take the queue spinlock).
> >
> > Processing bios can block (they can take other mutexes or semaphores), so
> > processing them from the schedule hook is impossible - the bio's
> > make_request function could attempt to take some lock that is already
> > held. So - we must offload the bios to a separate workqueue.
> 
> Yes, so better to just handle dm-snapshot in this way.

All dm targets and almost all other bio-processing drivers can block in 
the make_request_fn function (for example, they may block when allocating 
from a mempool).

Mikulas

> Thanks,
> Ming Lei
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock
  2015-10-21 21:49                                     ` Mikulas Patocka
@ 2015-10-22  1:53                                       ` Ming Lei
  0 siblings, 0 replies; 43+ messages in thread
From: Ming Lei @ 2015-10-22  1:53 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Mike Snitzer, Jens Axboe, Kent Overstreet, dm-devel,
	Linux Kernel Mailing List, Alasdair G. Kergon, Jeff Moyer

On Thu, Oct 22, 2015 at 5:49 AM, Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>
> On Thu, 22 Oct 2015, Ming Lei wrote:
>
>> > Some drivers (dm-snapshot, dm-thin) do acquire a mutex in .make_requests()
>> > for every bio. It wouldn't be practical to convert them to not acquire the
>> > mutex (and it would also degrade performance of these drivers, if they had
>> > to offload every bio to a worker thread that can acquire the mutex).
>>
>> Lots of drivers handle I/O in that way, and this way makes AIO not possible
>> basically for dm-snapshot.
>
> It doesn't have to do anything with asynchronous I/O. Of course you can do
> asynchronous I/O on dm-snapshot.

But it is actually sync I/O because submit_bio() will block on the semaphore,
so this behavious isn't acceptable from AIO view. That should be why there
are very few drivers to use semaphoe/mutex in I/O path.

I have found zram has removed its rwlock from its I/O path in v3.17.

>
>> >> Also sometimes it can hurt performance by converting I/O submission
>> >> from one context into concurrent contexts of workqueue, especially
>> >> in case of sequential I/O, since plug & plug merge can't be used any
>> >> more.
>> >
>> > You can add blk_start_plug/blk_finish_plug to the function
>> > bio_alloc_rescue. That would be reasonable to make sure that the requests
>> > are merged even when they are offloaded to rescue thread.
>>
>> The IOs submitted from each wq context becomes not contineous any
>> more, so plug merge isn't doable, not mention the extra context switch
>> cost.
>
> If the requests are mergeable, blk_start_plug/blk_finish_plug will merge
> them, if not, it won't.

Plug is per task, and these requests are mergeable previously from one
same task, but after you submit them from different context via wq, it
becomes not mergeable any more via plug. Also context switch isn't free.

>
>> This kind of cost can be introduced for all bio devices just for handling
>> the unusual case, fair?
>
> Offloading bios to a worker thread when the make_request_fn function
> blocks is required to avoid a deadlock (BTW. the deadlock became more
> common in the kernel 4.3 due to unrestricted size of bios).
>
> The bio list current->bio_list introduces a false locking dependency -
> completion of a bio depends on completion of other bios on
> current->bio_list directed for different devices, thus it could create
> circular dependency resulting in deadlock.
>
> To avoid the circular dependency, each bio must be offloaded to a specific
> workqueue, so that completion of bio for device A no longer depends on
> completion of another bio for device B.

You can do that for dm-snapshot only by using per-device thread/wq or
whatever, but it isn't good to make other drivers involved by reusing
rescure wq.

>
>> >> > -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
>> >> > +               spin_lock(&bs->rescue_lock);
>> >> > +               bio_list_add(&bs->rescue_list, bio);
>> >> > +               queue_work(bs->rescue_workqueue, &bs->rescue_work);
>> >> > +               spin_unlock(&bs->rescue_lock);
>> >> > +       }
>> >>
>> >> Not like rescuring path, schedule out can be quite frequent, and the
>> >> above change will switch to submit these I/Os from wq concurrently,
>> >> which might hurt performance for sequential I/O.
>> >>
>> >> Also I am wondering why not submit these I/Os in 'current' context
>> >> just like what flush plug does?
>> >
>> > Processing requests doesn't block (they only take the queue spinlock).
>> >
>> > Processing bios can block (they can take other mutexes or semaphores), so
>> > processing them from the schedule hook is impossible - the bio's
>> > make_request function could attempt to take some lock that is already
>> > held. So - we must offload the bios to a separate workqueue.
>>
>> Yes, so better to just handle dm-snapshot in this way.
>
> All dm targets and almost all other bio-processing drivers can block in
> the make_request_fn function (for example, they may block when allocating
> from a mempool).

blk_queue_bio() and its mq pairs can block in get_request() too, and that is
acceptable, but it isn't friendly/acceptable for AIO to use mutex/semaphoe
in I/O path which can block quite frequently.

--
Ming Lei

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH] block: flush queued bios when the process blocks
@ 2014-06-26 23:46 Mikulas Patocka
  0 siblings, 0 replies; 43+ messages in thread
From: Mikulas Patocka @ 2014-06-26 23:46 UTC (permalink / raw)
  To: Jens Axboe, Kent Overstreet
  Cc: linux-kernel, dm-devel, Alasdair G. Kergon, Mike Snitzer

Hi Jens

Would you please consider applying this patch?

Regarding your idea about merging bio and request plugging - I think this 
could be done later when it is properly designed by Kent Overstreet or 
someone else.

We need this patch to fix the deadlock in dm snapshot.

Mikulas


---------- Forwarded message ----------
Date: Tue, 27 May 2014 11:03:36 -0400 (EDT)
From: Mikulas Patocka <mpatocka@redhat.com>
To: Jens Axboe <axboe@kernel.dk>, Kent Overstreet <kmo@daterainc.com>
Cc: linux-kernel@vger.kernel.org, dm-devel@redhat.com,
    Alasdair G. Kergon <agk@redhat.com>, Mike Snitzer <msnitzer@redhat.com>
Subject: [PATCH] block: flush queued bios when the process blocks

The block layer uses per-process bio list to avoid recursion in
generic_make_request. When generic_make_request is called recursively, the
bio is added to current->bio_list and the function returns immediatelly.
The top-level instance of generic_make_requests takes bios from
current->bio_list and processes them.

This bio queuing can result in deadlocks. The following deadlock was
observed:

1) Process A sends one-page read bio to the dm-snapshot target. The bio
spans snapshot chunk boundary and so it is split to two bios by device
mapper.

2) Device mapper creates the first sub-bio and sends it to the snapshot
driver.

3) The function snapshot_map calls track_chunk (that allocates a structure
dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then it
remaps the bio to the underlying linear target and exits with
DM_MAPIO_REMAPPED.

4) The remapped bio is submitted with generic_make_request, but it isn't
processed - it is added to current->bio_list instead.

5) Meanwhile, process B executes pending_complete for the affected chunk,
it takes down_write(&s->lock) and then loops in
__check_for_conflicting_io, waiting for dm_snap_tracked_chunk created in
step 3) to be released.

6) Process A continues, it creates a new bio for the rest of the original
bio.

7) snapshot_map is called for this new bio, it waits on
down_write(&s->lock) that is held in step 5).

The resulting deadlock:
* bio added to current->bio_list at step 4) waits until the function in
  step 7) finishes
* the function in step 7) waits until s->lock held in step 5) is released
* the process in step 5) waits until the bio queued in step 4) finishes

The general problem is that queuing bios on current->bio_list introduces
additional lock dependencies. If a device mapper target sends a bio to
some block device, it assumes that the bio only takes locks of the target
block device or devices that are below the target device. However, if the
bio is added to queue on current->bio_list, it creates artifical locking
dependency on locks taken by other bios that are on current->bio_list. In
the above scenario, this artifical locking dependency results in
deadlock.

Kent Overstreet already created a workqueue for every bio set and there is
a code that tries to resolve some low-memory deadlocks by redirecting bios
queued on current->bio_list to the workqueue if the system is low on
memory. However, other deadlocks (as described above) may happen without
any low memory condition.

This patch generalizes Kent's concept, it redirects bios on
current->bio_list to the bio_set's workqueue on every schedule call.
Consequently, when the process blocks on a mutex, the bios queued on
current->bio_list are dispatched to independent workqueus and they can
complete without waiting for the mutex to be available.

Bios allocated with bio_kmalloc do not have bio_set, so they are not
redirected, however bio_kmalloc shouldn't be used by stacking drivers (it
is currently used by raid1.c and raid10.c, we need to change it to
bio_set).


Note to stable kernel maintainers: before backporting this patch, you also
need to backport df2cb6daa4cbc34406bc4b1ac9b9335df1083a72.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org

---
 fs/bio.c               |   84 ++++++++++++++-----------------------------------
 include/linux/blkdev.h |    7 +++-
 kernel/sched/core.c    |    7 ++++
 3 files changed, 37 insertions(+), 61 deletions(-)

Index: linux-3.15-rc5/fs/bio.c
===================================================================
--- linux-3.15-rc5.orig/fs/bio.c	2014-05-26 19:02:47.000000000 +0200
+++ linux-3.15-rc5/fs/bio.c	2014-05-27 00:00:13.000000000 +0200
@@ -342,35 +342,34 @@ static void bio_alloc_rescue(struct work
 	}
 }
 
-static void punt_bios_to_rescuer(struct bio_set *bs)
+/**
+ * blk_flush_bio_list
+ *
+ * Pop bios queued on current->bio_list and submit each of them to
+ * their rescue workqueue.
+ *
+ * If the bio doesn't have a bio_set, we leave it on current->bio_list.
+ * However, stacking drivers should use bio_set, so this shouldn't be
+ * an issue.
+ */
+void blk_flush_bio_list(void)
 {
-	struct bio_list punt, nopunt;
 	struct bio *bio;
+	struct bio_list list = *current->bio_list;
+	bio_list_init(current->bio_list);
 
-	/*
-	 * In order to guarantee forward progress we must punt only bios that
-	 * were allocated from this bio_set; otherwise, if there was a bio on
-	 * there for a stacking driver higher up in the stack, processing it
-	 * could require allocating bios from this bio_set, and doing that from
-	 * our own rescuer would be bad.
-	 *
-	 * Since bio lists are singly linked, pop them all instead of trying to
-	 * remove from the middle of the list:
-	 */
-
-	bio_list_init(&punt);
-	bio_list_init(&nopunt);
-
-	while ((bio = bio_list_pop(current->bio_list)))
-		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
-
-	*current->bio_list = nopunt;
-
-	spin_lock(&bs->rescue_lock);
-	bio_list_merge(&bs->rescue_list, &punt);
-	spin_unlock(&bs->rescue_lock);
+	while ((bio = bio_list_pop(&list))) {
+		struct bio_set *bs = bio->bi_pool;
+		if (unlikely(!bs)) {
+			bio_list_add(current->bio_list, bio);
+		} else {
+			spin_lock(&bs->rescue_lock);
+			bio_list_add(&bs->rescue_list, bio);
+			spin_unlock(&bs->rescue_lock);
 
-	queue_work(bs->rescue_workqueue, &bs->rescue_work);
+			queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		}
+	}
 }
 
 /**
@@ -410,7 +409,6 @@ static void punt_bios_to_rescuer(struct 
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
-	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	unsigned long idx = BIO_POOL_NONE;
@@ -428,36 +426,7 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
 		front_pad = 0;
 		inline_vecs = nr_iovecs;
 	} else {
-		/*
-		 * generic_make_request() converts recursion to iteration; this
-		 * means if we're running beneath it, any bios we allocate and
-		 * submit will not be submitted (and thus freed) until after we
-		 * return.
-		 *
-		 * This exposes us to a potential deadlock if we allocate
-		 * multiple bios from the same bio_set() while running
-		 * underneath generic_make_request(). If we were to allocate
-		 * multiple bios (say a stacking block driver that was splitting
-		 * bios), we would deadlock if we exhausted the mempool's
-		 * reserve.
-		 *
-		 * We solve this, and guarantee forward progress, with a rescuer
-		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_WAIT; if that fails, we punt those bios we
-		 * would be blocking to the rescuer workqueue before we retry
-		 * with the original gfp_flags.
-		 */
-
-		if (current->bio_list && !bio_list_empty(current->bio_list))
-			gfp_mask &= ~__GFP_WAIT;
-
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
-		if (!p && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			p = mempool_alloc(bs->bio_pool, gfp_mask);
-		}
 
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
@@ -471,11 +440,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
 
 	if (nr_iovecs > inline_vecs) {
 		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		if (!bvl && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		}
 
 		if (unlikely(!bvl))
 			goto err_free;
Index: linux-3.15-rc5/kernel/sched/core.c
===================================================================
--- linux-3.15-rc5.orig/kernel/sched/core.c	2014-05-26 19:30:51.000000000 +0200
+++ linux-3.15-rc5/kernel/sched/core.c	2014-05-27 00:23:00.000000000 +0200
@@ -2734,6 +2734,13 @@ static inline void sched_submit_work(str
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
 		return;
 	/*
+	 * If there are bios on the bio list, flush them to the appropriate
+	 * rescue threads.
+	 */
+	if (unlikely(current->bio_list != NULL) &&
+	    !bio_list_empty(current->bio_list))
+		blk_flush_bio_list();
+	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
 	 */
Index: linux-3.15-rc5/include/linux/blkdev.h
===================================================================
--- linux-3.15-rc5.orig/include/linux/blkdev.h	2014-05-26 23:54:48.000000000 +0200
+++ linux-3.15-rc5/include/linux/blkdev.h	2014-05-26 23:56:41.000000000 +0200
@@ -1103,6 +1103,8 @@ static inline bool blk_needs_flush_plug(
 		 !list_empty(&plug->cb_list));
 }
 
+extern void blk_flush_bio_list(void);
+
 /*
  * tag stuff
  */
@@ -1634,12 +1636,15 @@ static inline void blk_schedule_flush_pl
 {
 }
 
-
 static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 {
 	return false;
 }
 
+static inline void blk_flush_bio_list(void)
+{
+}
+
 #endif /* CONFIG_BLOCK */
 
 #endif

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2015-10-22  1:53 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-27 15:03 [PATCH] block: flush queued bios when the process blocks Mikulas Patocka
2014-05-27 15:08 ` Jens Axboe
2014-05-27 15:23   ` Mikulas Patocka
2014-05-27 15:42     ` Jens Axboe
2014-05-27 16:26       ` Mikulas Patocka
2014-05-27 17:33         ` Mike Snitzer
2014-05-27 19:56           ` Kent Overstreet
2015-10-05 19:50             ` Mike Snitzer
2014-05-27 17:42         ` [PATCH] " Jens Axboe
2014-05-27 18:14           ` [dm-devel] " Christoph Hellwig
2014-05-27 19:59             ` Kent Overstreet
2014-05-27 19:56           ` Mikulas Patocka
2014-05-27 20:06             ` Kent Overstreet
2014-05-29 23:52           ` Mikulas Patocka
2015-10-05 20:59             ` Mike Snitzer
2015-10-06 13:28               ` Mikulas Patocka
2015-10-06 13:47                 ` Mike Snitzer
2015-10-06 14:10                   ` Mikulas Patocka
2015-10-06 14:26                   ` Mikulas Patocka
2015-10-06 18:17               ` [dm-devel] " Mikulas Patocka
2015-10-06 18:50                 ` Mike Snitzer
2015-10-06 20:16                   ` [PATCH v2] " Mike Snitzer
2015-10-06 20:26                     ` Mike Snitzer
2015-10-08 15:04                     ` Mikulas Patocka
2015-10-08 15:08                       ` Mike Snitzer
2015-10-09 19:52                         ` Mike Snitzer
2015-10-09 19:59                           ` Mike Snitzer
2015-10-14 20:47                             ` [PATCH v3 for-4.4] block: flush queued bios when process blocks to avoid deadlock Mike Snitzer
2015-10-14 21:44                               ` Jeff Moyer
2015-10-17 16:04                               ` Ming Lei
2015-10-20 19:57                                 ` Mike Snitzer
2015-10-20 20:03                                 ` Mikulas Patocka
2015-10-21 16:38                                   ` Ming Lei
2015-10-21 21:49                                     ` Mikulas Patocka
2015-10-22  1:53                                       ` Ming Lei
2015-10-15  3:27                           ` [PATCH v2] block: flush queued bios when the process blocks Ming Lei
2015-10-15  8:06                             ` Mike Snitzer
2015-10-16  3:08                               ` Ming Lei
2015-10-16 15:29                                 ` Mike Snitzer
2015-10-17 15:54                                   ` Ming Lei
2015-10-09 11:58                     ` kbuild test robot
2014-05-27 17:59   ` [PATCH] " Kent Overstreet
2014-06-26 23:46 Mikulas Patocka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).