[PATCH 0/1] block: fix blk_queue_split() resource exhaustion

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/1] block: fix blk_queue_split() resource exhaustion
@ 2016-07-08 15:04 Lars Ellenberg
  2016-07-08 15:04 ` [PATCH 1/1] " Lars Ellenberg
  0 siblings, 1 reply; 23+ messages in thread
From: Lars Ellenberg @ 2016-07-08 15:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Lars Ellenberg, NeilBrown, linux-raid, Martin K. Petersen,
	Mike Snitzer, Peter Zijlstra, Jiri Kosina, Ming Lei,
	linux-kernel, Zheng Liu, linux-block, Takashi Iwai, linux-bcache,
	Ingo Molnar, Alasdair Kergon, Keith Busch, dm-devel, Shaohua Li,
	Kent Overstreet, Kirill A. Shutemov, Roland Kammerer

Result of RFC previously discussed here:
https://lkml.org/lkml/2016/6/22/172
[RFC] block: fix blk_queue_split() resource exhaustion

Rebased to linux-block/for-4.8/core as of today.
Would also need to go to Stable 4.3 and later.

Lars Ellenberg (1):
  block: fix blk_queue_split() resource exhaustion

 block/bio.c               | 27 +++++++++++++++++--------
 block/blk-core.c          | 50 +++++++++++++++++++++++++----------------------
 block/blk-merge.c         |  5 ++++-
 drivers/md/bcache/btree.c | 12 ++++++------
 drivers/md/dm-bufio.c     |  2 +-
 drivers/md/md.h           |  7 +++++++
 drivers/md/raid1.c        |  5 ++---
 drivers/md/raid10.c       |  5 ++---
 include/linux/bio.h       | 18 +++++++++++++++++
 include/linux/sched.h     |  4 ++--
 10 files changed, 88 insertions(+), 47 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/1] block: fix blk_queue_split() resource exhaustion
  2016-07-08 15:04 [PATCH 0/1] block: fix blk_queue_split() resource exhaustion Lars Ellenberg
@ 2016-07-08 15:04 ` Lars Ellenberg
  2016-07-08 18:49   ` Mike Snitzer
  2016-07-11 14:10   ` [PATCH v2 " Lars Ellenberg
  0 siblings, 2 replies; 23+ messages in thread
From: Lars Ellenberg @ 2016-07-08 15:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Lars Ellenberg, NeilBrown, linux-raid, Martin K. Petersen,
	Mike Snitzer, Peter Zijlstra, Jiri Kosina, Ming Lei,
	linux-kernel, Zheng Liu, linux-block, Takashi Iwai, linux-bcache,
	Ingo Molnar, Alasdair Kergon, Keith Busch, dm-devel, Shaohua Li,
	Kent Overstreet, Kirill A. Shutemov, Roland Kammerer

For a long time, generic_make_request() converts recursion into
iteration by queuing recursive arguments on current->bio_list.

This is convenient for stacking drivers,
the top-most driver would take the originally submitted bio,
and re-submit a re-mapped version of it, or one or more clones,
or one or more new allocated bios to its backend(s). Which
are then simply processed in turn, and each can again queue
more "backend-bios" until we reach the bottom of the driver stack,
and actually dispatch to the real backend device.

Any stacking driver ->make_request_fn() could expect that,
once it returns, any backend-bios it submitted via recursive calls
to generic_make_request() would now be processed and dispatched, before
the current task would call into this driver again.

This is changed by commit
  54efd50 block: make generic_make_request handle arbitrarily sized bios

Drivers may call blk_queue_split() inside their ->make_request_fn(),
which may split the current bio into a front-part to be dealt with
immediately, and a remainder-part, which may need to be split even
further. That remainder-part will simply also be pushed to
current->bio_list, and would end up being head-of-queue, in front
of any backend-bios the current make_request_fn() might submit during
processing of the fron-part.

Which means the current task would immediately end up back in the same
make_request_fn() of the same driver again, before any of its backend
bios have even been processed.

This can lead to resource starvation deadlock.
Drivers could avoid this by learning to not need blk_queue_split(),
or by submitting their backend bios in a different context (dedicated
kernel thread, work_queue context, ...). Or by playing funny re-ordering
games with entries on current->bio_list.

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
	struct recursion_to_iteration_bio_lists {
		struct bio_list recursion;
		struct bio_list queue;
	}

By providing each q->make_request_fn() with an empty "recursion"
bio_list, then merging any recursively submitted bios to the
head of the "queue" list, we can make the recursion-to-iteration
logic in generic_make_request() process deepest level bios first,
and "sibling" bios of the same level in "natural" order.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
---
 block/bio.c               | 27 +++++++++++++++++--------
 block/blk-core.c          | 50 +++++++++++++++++++++++++----------------------
 block/blk-merge.c         |  5 ++++-
 drivers/md/bcache/btree.c | 12 ++++++------
 drivers/md/dm-bufio.c     |  2 +-
 drivers/md/md.h           |  7 +++++++
 drivers/md/raid1.c        |  5 ++---
 drivers/md/raid10.c       |  5 ++---
 include/linux/bio.h       | 18 +++++++++++++++++
 include/linux/sched.h     |  4 ++--
 10 files changed, 88 insertions(+), 47 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 848cd35..1f9fcf0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 	 */
 
 	bio_list_init(&punt);
-	bio_list_init(&nopunt);
 
-	while ((bio = bio_list_pop(current->bio_list)))
+	bio_list_init(&nopunt);
+	while ((bio = bio_list_pop(&current->bio_lists->recursion)))
 		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+	current->bio_lists->recursion = nopunt;
 
-	*current->bio_list = nopunt;
+	bio_list_init(&nopunt);
+	while ((bio = bio_list_pop(&current->bio_lists->queue)))
+		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+	current->bio_lists->queue = nopunt;
 
 	spin_lock(&bs->rescue_lock);
 	bio_list_merge(&bs->rescue_list, &punt);
@@ -380,6 +384,13 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 	queue_work(bs->rescue_workqueue, &bs->rescue_work);
 }
 
+static bool current_has_pending_bios(void)
+{
+	return current->bio_lists &&
+		(!bio_list_empty(&current->bio_lists->queue) ||
+		 !bio_list_empty(&current->bio_lists->recursion));
+}
+
 /**
  * bio_alloc_bioset - allocate a bio for I/O
  * @gfp_mask:   the GFP_ mask given to the slab allocator
@@ -453,13 +464,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		 *
 		 * We solve this, and guarantee forward progress, with a rescuer
 		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
-		 * bios we would be blocking to the rescuer workqueue before
-		 * we retry with the original gfp_flags.
+		 * bios on current->bio_lists->{recursion,queue}, we first try the
+		 * allocation without __GFP_DIRECT_RECLAIM; if that fails, we
+		 * punt those bios we would be blocking to the rescuer
+		 * workqueue before we retry with the original gfp_flags.
 		 */
 
-		if (current->bio_list && !bio_list_empty(current->bio_list))
+		if (current_has_pending_bios())
 			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
diff --git a/block/blk-core.c b/block/blk-core.c
index 3cfd67d..74bceea 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2040,7 +2040,7 @@ end_io:
  */
 blk_qc_t generic_make_request(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack;
+	struct recursion_to_iteration_bio_lists bio_lists_on_stack;
 	blk_qc_t ret = BLK_QC_T_NONE;
 
 	if (!generic_make_request_checks(bio))
@@ -2049,15 +2049,20 @@ blk_qc_t generic_make_request(struct bio *bio)
 	/*
 	 * We only want one ->make_request_fn to be active at a time, else
 	 * stack usage with stacked devices could be a problem.  So use
-	 * current->bio_list to keep a list of requests submited by a
-	 * make_request_fn function.  current->bio_list is also used as a
+	 * current->bio_lists to keep a list of requests submited by a
+	 * make_request_fn function.  current->bio_lists is also used as a
 	 * flag to say if generic_make_request is currently active in this
 	 * task or not.  If it is NULL, then no make_request is active.  If
 	 * it is non-NULL, then a make_request is active, and new requests
-	 * should be added at the tail
+	 * should be added at the tail of current->bio_lists->recursion;
+	 * bios resulting from a call to blk_queue_split() from
+	 * within ->make_request_fn() should be pushed back to the head of
+	 * current->bio_lists->queue.
+	 * After the current ->make_request_fn() returns, .recursion will be
+	 * merged back to the head of .queue.
 	 */
-	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+	if (current->bio_lists) {
+		bio_list_add(&current->bio_lists->recursion, bio);
 		goto out;
 	}
 
@@ -2066,35 +2071,34 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * Before entering the loop, bio->bi_next is NULL (as all callers
 	 * ensure that) so we have a list with a single bio.
 	 * We pretend that we have just taken it off a longer list, so
-	 * we assign bio_list to a pointer to the bio_list_on_stack,
-	 * thus initialising the bio_list of new bios to be
-	 * added.  ->make_request() may indeed add some more bios
-	 * through a recursive call to generic_make_request.  If it
-	 * did, we find a non-NULL value in bio_list and re-enter the loop
-	 * from the top.  In this case we really did just take the bio
-	 * of the top of the list (no pretending) and so remove it from
-	 * bio_list, and call into ->make_request() again.
+	 * we assign bio_list to a pointer to the bio_lists_on_stack,
+	 * thus initialising the bio_lists of new bios to be added.
+	 * ->make_request() may indeed add some more bios to .recursion
+	 * through a recursive call to generic_make_request.  If it did,
+	 * we find a non-NULL value in .recursion, merge .recursion back to the
+	 * head of .queue, and re-enter the loop from the top.  In this case we
+	 * really did just take the bio of the top of the list (no pretending)
+	 * and so remove it from .queue, and call into ->make_request() again.
 	 */
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
-	current->bio_list = &bio_list_on_stack;
+	bio_list_init(&bio_lists_on_stack.queue);
+	current->bio_lists = &bio_lists_on_stack;
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			bio_list_init(&bio_lists_on_stack.recursion);
 			ret = q->make_request_fn(q, bio);
-
 			blk_queue_exit(q);
-
-			bio = bio_list_pop(current->bio_list);
+			bio_list_merge_head(&bio_lists_on_stack.queue,
+					&bio_lists_on_stack.recursion);
+			/* XXX bio_list_init(&bio_lists_on_stack.recursion); */
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
-
 			bio_io_error(bio);
-			bio = bio_next;
 		}
+		bio = bio_list_pop(&current->bio_lists->queue);
 	} while (bio);
-	current->bio_list = NULL; /* deactivate */
+	current->bio_lists = NULL; /* deactivate */
 
 out:
 	return ret;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index c265348..df96327 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
 	struct bio *split, *res;
 	unsigned nsegs;
 
+	BUG_ON(!current->bio_lists);
 	if (bio_op(*bio) == REQ_OP_DISCARD)
 		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
 	else if (bio_op(*bio) == REQ_OP_WRITE_SAME)
@@ -190,7 +191,9 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
 
 		bio_chain(split, *bio);
 		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
-		generic_make_request(*bio);
+		/* push back remainder, it may later be split further */
+		bio_list_add_head(&current->bio_lists->queue, *bio);
+		/* and fake submission of a suitably sized piece */
 		*bio = split;
 	}
 }
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 76f7534..731ec3b 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -450,7 +450,7 @@ void __bch_btree_node_write(struct btree *b, struct closure *parent)
 
 	trace_bcache_btree_write(b);
 
-	BUG_ON(current->bio_list);
+	BUG_ON(current->bio_lists);
 	BUG_ON(b->written >= btree_blocks(b));
 	BUG_ON(b->written && !i->keys);
 	BUG_ON(btree_bset_first(b)->seq != i->seq);
@@ -544,7 +544,7 @@ static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
 
 	/* Force write if set is too big */
 	if (set_bytes(i) > PAGE_SIZE - 48 &&
-	    !current->bio_list)
+	    !current->bio_lists)
 		bch_btree_node_write(b, NULL);
 }
 
@@ -889,7 +889,7 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
 {
 	struct btree *b;
 
-	BUG_ON(current->bio_list);
+	BUG_ON(current->bio_lists);
 
 	lockdep_assert_held(&c->bucket_lock);
 
@@ -976,7 +976,7 @@ retry:
 	b = mca_find(c, k);
 
 	if (!b) {
-		if (current->bio_list)
+		if (current->bio_lists)
 			return ERR_PTR(-EAGAIN);
 
 		mutex_lock(&c->bucket_lock);
@@ -2127,7 +2127,7 @@ static int bch_btree_insert_node(struct btree *b, struct btree_op *op,
 
 	return 0;
 split:
-	if (current->bio_list) {
+	if (current->bio_lists) {
 		op->lock = b->c->root->level + 1;
 		return -EAGAIN;
 	} else if (op->lock <= b->c->root->level) {
@@ -2209,7 +2209,7 @@ int bch_btree_insert(struct cache_set *c, struct keylist *keys,
 	struct btree_insert_op op;
 	int ret = 0;
 
-	BUG_ON(current->bio_list);
+	BUG_ON(current->bio_lists);
 	BUG_ON(bch_keylist_empty(keys));
 
 	bch_btree_op_init(&op.op, 0);
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 6571c81..ba0c325 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -174,7 +174,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
 #define DM_BUFIO_CACHE(c)	(dm_bufio_caches[dm_bufio_cache_index(c)])
 #define DM_BUFIO_CACHE_NAME(c)	(dm_bufio_cache_names[dm_bufio_cache_index(c)])
 
-#define dm_bufio_in_request()	(!!current->bio_list)
+#define dm_bufio_in_request()	(!!current->bio_lists)
 
 static void dm_bufio_lock(struct dm_bufio_client *c)
 {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index b4f3352..b62e65f4 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -664,6 +664,13 @@ static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
 	}
 }
 
+static inline bool current_has_pending_bios(void)
+{
+	return current->bio_lists && (
+	      !bio_list_empty(&current->bio_lists->queue) ||
+	      !bio_list_empty(&current->bio_lists->recursion));
+}
+
 extern struct md_cluster_operations *md_cluster_ops;
 static inline int mddev_is_clustered(struct mddev *mddev)
 {
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 10e53cd..38790e3 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -876,8 +876,7 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
 				    (!conf->barrier ||
 				     ((conf->start_next_window <
 				       conf->next_resync + RESYNC_SECTORS) &&
-				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
+				      current_has_pending_bios())),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
@@ -1014,7 +1013,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	struct r1conf *conf = mddev->private;
 	struct bio *bio;
 
-	if (from_schedule || current->bio_list) {
+	if (from_schedule || current->bio_lists) {
 		spin_lock_irq(&conf->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		conf->pending_count += plug->pending_cnt;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 245640b..13a5341 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -945,8 +945,7 @@ static void wait_barrier(struct r10conf *conf)
 		wait_event_lock_irq(conf->wait_barrier,
 				    !conf->barrier ||
 				    (conf->nr_pending &&
-				     current->bio_list &&
-				     !bio_list_empty(current->bio_list)),
+				     current_has_pending_bios()),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
@@ -1022,7 +1021,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	struct r10conf *conf = mddev->private;
 	struct bio *bio;
 
-	if (from_schedule || current->bio_list) {
+	if (from_schedule || current->bio_lists) {
 		spin_lock_irq(&conf->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		conf->pending_count += plug->pending_cnt;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b7e1a008..0b2b28e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -541,6 +541,24 @@ struct bio_list {
 	struct bio *tail;
 };
 
+/* for generic_make_request() */
+struct recursion_to_iteration_bio_lists {
+	/* For stacking drivers submitting to their respective backend,
+	 * bios are added to the tail of .recursion, which is re-initialized
+	 * before each call to ->make_request_fn() and after that returns,
+	 * the whole .recursion queue is then merged back to the head of .queue.
+	 *
+	 * The recursion-to-iteration logic in generic_make_request() will
+	 * peel off of .queue.head, processing bios in deepest-level-first
+	 * "natural" order. */
+	struct bio_list recursion;
+
+	/* This keeps a list of to-be-processed bios.
+	 * The "remainder" part resulting from calling blk_queue_split()
+	 * will be pushed back to its head. */
+	struct bio_list queue;
+};
+
 static inline int bio_list_empty(const struct bio_list *bl)
 {
 	return bl->head == NULL;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada..146eedc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -128,7 +128,7 @@ struct sched_attr {
 
 struct futex_pi_state;
 struct robust_list_head;
-struct bio_list;
+struct recursion_to_iteration_bio_lists;
 struct fs_struct;
 struct perf_event_context;
 struct blk_plug;
@@ -1727,7 +1727,7 @@ struct task_struct {
 	void *journal_info;
 
 /* stacked block device info */
-	struct bio_list *bio_list;
+	struct recursion_to_iteration_bio_lists *bio_lists;
 
 #ifdef CONFIG_BLOCK
 /* stack plugging */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] block: fix blk_queue_split() resource exhaustion
  2016-07-08 15:04 ` [PATCH 1/1] " Lars Ellenberg
@ 2016-07-08 18:49   ` Mike Snitzer
  2016-07-11 14:13     ` Lars Ellenberg
  2016-07-11 14:10   ` [PATCH v2 " Lars Ellenberg
  1 sibling, 1 reply; 23+ messages in thread
From: Mike Snitzer @ 2016-07-08 18:49 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Jens Axboe, linux-block, Keith Busch, Martin K. Petersen,
	Peter Zijlstra, Jiri Kosina, Ming Lei, Kirill A. Shutemov,
	NeilBrown, linux-kernel, linux-raid, Takashi Iwai, linux-bcache,
	Zheng Liu, Kent Overstreet, dm-devel, Shaohua Li, Ingo Molnar,
	Alasdair Kergon, Roland Kammerer

On Fri, Jul 08 2016 at 11:04am -0400,
Lars Ellenberg <lars.ellenberg@linbit.com> wrote:

> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
> 
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
> 
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
> 
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
> 
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
> 
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
> 
> This can lead to resource starvation deadlock.
> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
> 
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
> 	struct recursion_to_iteration_bio_lists {
> 		struct bio_list recursion;
> 		struct bio_list queue;
> 	}
> 
> By providing each q->make_request_fn() with an empty "recursion"
> bio_list, then merging any recursively submitted bios to the
> head of the "queue" list, we can make the recursion-to-iteration
> logic in generic_make_request() process deepest level bios first,
> and "sibling" bios of the same level in "natural" order.
> 
> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
> ---
>  block/bio.c               | 27 +++++++++++++++++--------
>  block/blk-core.c          | 50 +++++++++++++++++++++++++----------------------
>  block/blk-merge.c         |  5 ++++-
>  drivers/md/bcache/btree.c | 12 ++++++------
>  drivers/md/dm-bufio.c     |  2 +-
>  drivers/md/md.h           |  7 +++++++
>  drivers/md/raid1.c        |  5 ++---
>  drivers/md/raid10.c       |  5 ++---
>  include/linux/bio.h       | 18 +++++++++++++++++
>  include/linux/sched.h     |  4 ++--
>  10 files changed, 88 insertions(+), 47 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 848cd35..1f9fcf0 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>  	 */
>  
>  	bio_list_init(&punt);
> -	bio_list_init(&nopunt);
>  
> -	while ((bio = bio_list_pop(current->bio_list)))
> +	bio_list_init(&nopunt);
> +	while ((bio = bio_list_pop(&current->bio_lists->recursion)))
>  		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> +	current->bio_lists->recursion = nopunt;
>  
> -	*current->bio_list = nopunt;
> +	bio_list_init(&nopunt);
> +	while ((bio = bio_list_pop(&current->bio_lists->queue)))
> +		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> +	current->bio_lists->queue = nopunt;
>  
>  	spin_lock(&bs->rescue_lock);
>  	bio_list_merge(&bs->rescue_list, &punt);
> @@ -380,6 +384,13 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>  	queue_work(bs->rescue_workqueue, &bs->rescue_work);
>  }
>  
> +static bool current_has_pending_bios(void)
> +{
> +	return current->bio_lists &&
> +		(!bio_list_empty(&current->bio_lists->queue) ||
> +		 !bio_list_empty(&current->bio_lists->recursion));
> +}
> +

This should be moved to include/linux/bio.h

> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3cfd67d..74bceea 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2066,35 +2071,34 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	 * Before entering the loop, bio->bi_next is NULL (as all callers
>  	 * ensure that) so we have a list with a single bio.
>  	 * We pretend that we have just taken it off a longer list, so
> -	 * we assign bio_list to a pointer to the bio_list_on_stack,
> -	 * thus initialising the bio_list of new bios to be
> -	 * added.  ->make_request() may indeed add some more bios
> -	 * through a recursive call to generic_make_request.  If it
> -	 * did, we find a non-NULL value in bio_list and re-enter the loop
> -	 * from the top.  In this case we really did just take the bio
> -	 * of the top of the list (no pretending) and so remove it from
> -	 * bio_list, and call into ->make_request() again.
> +	 * we assign bio_list to a pointer to the bio_lists_on_stack,
> +	 * thus initialising the bio_lists of new bios to be added.
> +	 * ->make_request() may indeed add some more bios to .recursion
> +	 * through a recursive call to generic_make_request.  If it did,
> +	 * we find a non-NULL value in .recursion, merge .recursion back to the
> +	 * head of .queue, and re-enter the loop from the top.  In this case we
> +	 * really did just take the bio of the top of the list (no pretending)
> +	 * and so remove it from .queue, and call into ->make_request() again.
>  	 */
>  	BUG_ON(bio->bi_next);
> -	bio_list_init(&bio_list_on_stack);
> -	current->bio_list = &bio_list_on_stack;
> +	bio_list_init(&bio_lists_on_stack.queue);
> +	current->bio_lists = &bio_lists_on_stack;
>  	do {
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
>  		if (likely(blk_queue_enter(q, false) == 0)) {
> +			bio_list_init(&bio_lists_on_stack.recursion);
>  			ret = q->make_request_fn(q, bio);
> -
>  			blk_queue_exit(q);
> -
> -			bio = bio_list_pop(current->bio_list);
> +			bio_list_merge_head(&bio_lists_on_stack.queue,
> +					&bio_lists_on_stack.recursion);
> +			/* XXX bio_list_init(&bio_lists_on_stack.recursion); */

Please remove this XXX commented code.

> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index b4f3352..b62e65f4 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -664,6 +664,13 @@ static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
>  	}
>  }
>  
> +static inline bool current_has_pending_bios(void)
> +{
> +	return current->bio_lists && (
> +	      !bio_list_empty(&current->bio_lists->queue) ||
> +	      !bio_list_empty(&current->bio_lists->recursion));
> +}
> +

This can be removed now that include/linux/bio.h exports the same.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2016-07-08 15:04 ` [PATCH 1/1] " Lars Ellenberg
  2016-07-08 18:49   ` Mike Snitzer
@ 2016-07-11 14:10   ` Lars Ellenberg
  2016-07-12  2:55     ` [dm-devel] " NeilBrown
  2016-12-23  8:49     ` Michael Wang
  1 sibling, 2 replies; 23+ messages in thread
From: Lars Ellenberg @ 2016-07-11 14:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: NeilBrown, linux-raid, Martin K. Petersen, Mike Snitzer,
	Peter Zijlstra, Jiri Kosina, Ming Lei, linux-kernel, Zheng Liu,
	linux-block, Takashi Iwai, linux-bcache, Ingo Molnar,
	Alasdair Kergon, Keith Busch, dm-devel, Shaohua Li,
	Kent Overstreet, Kirill A. Shutemov, Roland Kammerer

For a long time, generic_make_request() converts recursion into
iteration by queuing recursive arguments on current->bio_list.

This is convenient for stacking drivers,
the top-most driver would take the originally submitted bio,
and re-submit a re-mapped version of it, or one or more clones,
or one or more new allocated bios to its backend(s). Which
are then simply processed in turn, and each can again queue
more "backend-bios" until we reach the bottom of the driver stack,
and actually dispatch to the real backend device.

Any stacking driver ->make_request_fn() could expect that,
once it returns, any backend-bios it submitted via recursive calls
to generic_make_request() would now be processed and dispatched, before
the current task would call into this driver again.

This is changed by commit
  54efd50 block: make generic_make_request handle arbitrarily sized bios

Drivers may call blk_queue_split() inside their ->make_request_fn(),
which may split the current bio into a front-part to be dealt with
immediately, and a remainder-part, which may need to be split even
further. That remainder-part will simply also be pushed to
current->bio_list, and would end up being head-of-queue, in front
of any backend-bios the current make_request_fn() might submit during
processing of the fron-part.

Which means the current task would immediately end up back in the same
make_request_fn() of the same driver again, before any of its backend
bios have even been processed.

This can lead to resource starvation deadlock.
Drivers could avoid this by learning to not need blk_queue_split(),
or by submitting their backend bios in a different context (dedicated
kernel thread, work_queue context, ...). Or by playing funny re-ordering
games with entries on current->bio_list.

Instead, I suggest to distinguish between recursive calls to
generic_make_request(), and pushing back the remainder part in
blk_queue_split(), by pointing current->bio_lists to a
	struct recursion_to_iteration_bio_lists {
		struct bio_list recursion;
		struct bio_list queue;
	}

By providing each q->make_request_fn() with an empty "recursion"
bio_list, then merging any recursively submitted bios to the
head of the "queue" list, we can make the recursion-to-iteration
logic in generic_make_request() process deepest level bios first,
and "sibling" bios of the same level in "natural" order.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
---
 block/bio.c               | 20 +++++++++++--------
 block/blk-core.c          | 49 +++++++++++++++++++++++++----------------------
 block/blk-merge.c         |  5 ++++-
 drivers/md/bcache/btree.c | 12 ++++++------
 drivers/md/dm-bufio.c     |  2 +-
 drivers/md/raid1.c        |  5 ++---
 drivers/md/raid10.c       |  5 ++---
 include/linux/bio.h       | 25 ++++++++++++++++++++++++
 include/linux/sched.h     |  4 ++--
 9 files changed, 80 insertions(+), 47 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 848cd35..c2606fd 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
 	 */
 
 	bio_list_init(&punt);
-	bio_list_init(&nopunt);
 
-	while ((bio = bio_list_pop(current->bio_list)))
+	bio_list_init(&nopunt);
+	while ((bio = bio_list_pop(&current->bio_lists->recursion)))
 		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+	current->bio_lists->recursion = nopunt;
 
-	*current->bio_list = nopunt;
+	bio_list_init(&nopunt);
+	while ((bio = bio_list_pop(&current->bio_lists->queue)))
+		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
+	current->bio_lists->queue = nopunt;
 
 	spin_lock(&bs->rescue_lock);
 	bio_list_merge(&bs->rescue_list, &punt);
@@ -453,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		 *
 		 * We solve this, and guarantee forward progress, with a rescuer
 		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
-		 * bios we would be blocking to the rescuer workqueue before
-		 * we retry with the original gfp_flags.
+		 * bios on current->bio_lists->{recursion,queue}, we first try the
+		 * allocation without __GFP_DIRECT_RECLAIM; if that fails, we
+		 * punt those bios we would be blocking to the rescuer
+		 * workqueue before we retry with the original gfp_flags.
 		 */
 
-		if (current->bio_list && !bio_list_empty(current->bio_list))
+		if (current_has_pending_bios())
 			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
diff --git a/block/blk-core.c b/block/blk-core.c
index 3cfd67d..2886a59b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2040,7 +2040,7 @@ end_io:
  */
 blk_qc_t generic_make_request(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack;
+	struct recursion_to_iteration_bio_lists bio_lists_on_stack;
 	blk_qc_t ret = BLK_QC_T_NONE;
 
 	if (!generic_make_request_checks(bio))
@@ -2049,15 +2049,20 @@ blk_qc_t generic_make_request(struct bio *bio)
 	/*
 	 * We only want one ->make_request_fn to be active at a time, else
 	 * stack usage with stacked devices could be a problem.  So use
-	 * current->bio_list to keep a list of requests submited by a
-	 * make_request_fn function.  current->bio_list is also used as a
+	 * current->bio_lists to keep a list of requests submited by a
+	 * make_request_fn function.  current->bio_lists is also used as a
 	 * flag to say if generic_make_request is currently active in this
 	 * task or not.  If it is NULL, then no make_request is active.  If
 	 * it is non-NULL, then a make_request is active, and new requests
-	 * should be added at the tail
+	 * should be added at the tail of current->bio_lists->recursion;
+	 * bios resulting from a call to blk_queue_split() from
+	 * within ->make_request_fn() should be pushed back to the head of
+	 * current->bio_lists->queue.
+	 * After the current ->make_request_fn() returns, .recursion will be
+	 * merged back to the head of .queue.
 	 */
-	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+	if (current->bio_lists) {
+		bio_list_add(&current->bio_lists->recursion, bio);
 		goto out;
 	}
 
@@ -2066,35 +2071,33 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * Before entering the loop, bio->bi_next is NULL (as all callers
 	 * ensure that) so we have a list with a single bio.
 	 * We pretend that we have just taken it off a longer list, so
-	 * we assign bio_list to a pointer to the bio_list_on_stack,
-	 * thus initialising the bio_list of new bios to be
-	 * added.  ->make_request() may indeed add some more bios
-	 * through a recursive call to generic_make_request.  If it
-	 * did, we find a non-NULL value in bio_list and re-enter the loop
-	 * from the top.  In this case we really did just take the bio
-	 * of the top of the list (no pretending) and so remove it from
-	 * bio_list, and call into ->make_request() again.
+	 * we assign bio_list to a pointer to the bio_lists_on_stack,
+	 * thus initialising the bio_lists of new bios to be added.
+	 * ->make_request() may indeed add some more bios to .recursion
+	 * through a recursive call to generic_make_request.  If it did,
+	 * we find a non-NULL value in .recursion, merge .recursion back to the
+	 * head of .queue, and re-enter the loop from the top.  In this case we
+	 * really did just take the bio of the top of the list (no pretending)
+	 * and so remove it from .queue, and call into ->make_request() again.
 	 */
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
-	current->bio_list = &bio_list_on_stack;
+	bio_list_init(&bio_lists_on_stack.queue);
+	current->bio_lists = &bio_lists_on_stack;
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			bio_list_init(&bio_lists_on_stack.recursion);
 			ret = q->make_request_fn(q, bio);
-
 			blk_queue_exit(q);
-
-			bio = bio_list_pop(current->bio_list);
+			bio_list_merge_head(&bio_lists_on_stack.queue,
+					&bio_lists_on_stack.recursion);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
-
 			bio_io_error(bio);
-			bio = bio_next;
 		}
+		bio = bio_list_pop(&current->bio_lists->queue);
 	} while (bio);
-	current->bio_list = NULL; /* deactivate */
+	current->bio_lists = NULL; /* deactivate */
 
 out:
 	return ret;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index c265348..df96327 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
 	struct bio *split, *res;
 	unsigned nsegs;
 
+	BUG_ON(!current->bio_lists);
 	if (bio_op(*bio) == REQ_OP_DISCARD)
 		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
 	else if (bio_op(*bio) == REQ_OP_WRITE_SAME)
@@ -190,7 +191,9 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
 
 		bio_chain(split, *bio);
 		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
-		generic_make_request(*bio);
+		/* push back remainder, it may later be split further */
+		bio_list_add_head(&current->bio_lists->queue, *bio);
+		/* and fake submission of a suitably sized piece */
 		*bio = split;
 	}
 }
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 76f7534..731ec3b 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -450,7 +450,7 @@ void __bch_btree_node_write(struct btree *b, struct closure *parent)
 
 	trace_bcache_btree_write(b);
 
-	BUG_ON(current->bio_list);
+	BUG_ON(current->bio_lists);
 	BUG_ON(b->written >= btree_blocks(b));
 	BUG_ON(b->written && !i->keys);
 	BUG_ON(btree_bset_first(b)->seq != i->seq);
@@ -544,7 +544,7 @@ static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
 
 	/* Force write if set is too big */
 	if (set_bytes(i) > PAGE_SIZE - 48 &&
-	    !current->bio_list)
+	    !current->bio_lists)
 		bch_btree_node_write(b, NULL);
 }
 
@@ -889,7 +889,7 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
 {
 	struct btree *b;
 
-	BUG_ON(current->bio_list);
+	BUG_ON(current->bio_lists);
 
 	lockdep_assert_held(&c->bucket_lock);
 
@@ -976,7 +976,7 @@ retry:
 	b = mca_find(c, k);
 
 	if (!b) {
-		if (current->bio_list)
+		if (current->bio_lists)
 			return ERR_PTR(-EAGAIN);
 
 		mutex_lock(&c->bucket_lock);
@@ -2127,7 +2127,7 @@ static int bch_btree_insert_node(struct btree *b, struct btree_op *op,
 
 	return 0;
 split:
-	if (current->bio_list) {
+	if (current->bio_lists) {
 		op->lock = b->c->root->level + 1;
 		return -EAGAIN;
 	} else if (op->lock <= b->c->root->level) {
@@ -2209,7 +2209,7 @@ int bch_btree_insert(struct cache_set *c, struct keylist *keys,
 	struct btree_insert_op op;
 	int ret = 0;
 
-	BUG_ON(current->bio_list);
+	BUG_ON(current->bio_lists);
 	BUG_ON(bch_keylist_empty(keys));
 
 	bch_btree_op_init(&op.op, 0);
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 6571c81..ba0c325 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -174,7 +174,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
 #define DM_BUFIO_CACHE(c)	(dm_bufio_caches[dm_bufio_cache_index(c)])
 #define DM_BUFIO_CACHE_NAME(c)	(dm_bufio_cache_names[dm_bufio_cache_index(c)])
 
-#define dm_bufio_in_request()	(!!current->bio_list)
+#define dm_bufio_in_request()	(!!current->bio_lists)
 
 static void dm_bufio_lock(struct dm_bufio_client *c)
 {
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 10e53cd..38790e3 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -876,8 +876,7 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
 				    (!conf->barrier ||
 				     ((conf->start_next_window <
 				       conf->next_resync + RESYNC_SECTORS) &&
-				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
+				      current_has_pending_bios())),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
@@ -1014,7 +1013,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	struct r1conf *conf = mddev->private;
 	struct bio *bio;
 
-	if (from_schedule || current->bio_list) {
+	if (from_schedule || current->bio_lists) {
 		spin_lock_irq(&conf->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		conf->pending_count += plug->pending_cnt;
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 245640b..13a5341 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -945,8 +945,7 @@ static void wait_barrier(struct r10conf *conf)
 		wait_event_lock_irq(conf->wait_barrier,
 				    !conf->barrier ||
 				    (conf->nr_pending &&
-				     current->bio_list &&
-				     !bio_list_empty(current->bio_list)),
+				     current_has_pending_bios()),
 				    conf->resync_lock);
 		conf->nr_waiting--;
 	}
@@ -1022,7 +1021,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	struct r10conf *conf = mddev->private;
 	struct bio *bio;
 
-	if (from_schedule || current->bio_list) {
+	if (from_schedule || current->bio_lists) {
 		spin_lock_irq(&conf->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		conf->pending_count += plug->pending_cnt;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b7e1a008..2f8a361 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -541,6 +541,24 @@ struct bio_list {
 	struct bio *tail;
 };
 
+/* for generic_make_request() */
+struct recursion_to_iteration_bio_lists {
+	/* For stacking drivers submitting to their respective backend,
+	 * bios are added to the tail of .recursion, which is re-initialized
+	 * before each call to ->make_request_fn() and after that returns,
+	 * the whole .recursion queue is then merged back to the head of .queue.
+	 *
+	 * The recursion-to-iteration logic in generic_make_request() will
+	 * peel off of .queue.head, processing bios in deepest-level-first
+	 * "natural" order. */
+	struct bio_list recursion;
+
+	/* This keeps a list of to-be-processed bios.
+	 * The "remainder" part resulting from calling blk_queue_split()
+	 * will be pushed back to its head. */
+	struct bio_list queue;
+};
+
 static inline int bio_list_empty(const struct bio_list *bl)
 {
 	return bl->head == NULL;
@@ -551,6 +569,13 @@ static inline void bio_list_init(struct bio_list *bl)
 	bl->head = bl->tail = NULL;
 }
 
+static inline bool current_has_pending_bios(void)
+{
+	return current->bio_lists &&
+		(!bio_list_empty(&current->bio_lists->queue) ||
+		 !bio_list_empty(&current->bio_lists->recursion));
+}
+
 #define BIO_EMPTY_LIST	{ NULL, NULL }
 
 #define bio_list_for_each(bio, bl) \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6e42ada..146eedc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -128,7 +128,7 @@ struct sched_attr {
 
 struct futex_pi_state;
 struct robust_list_head;
-struct bio_list;
+struct recursion_to_iteration_bio_lists;
 struct fs_struct;
 struct perf_event_context;
 struct blk_plug;
@@ -1727,7 +1727,7 @@ struct task_struct {
 	void *journal_info;
 
 /* stacked block device info */
-	struct bio_list *bio_list;
+	struct recursion_to_iteration_bio_lists *bio_lists;
 
 #ifdef CONFIG_BLOCK
 /* stack plugging */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/1] block: fix blk_queue_split() resource exhaustion
  2016-07-08 18:49   ` Mike Snitzer
@ 2016-07-11 14:13     ` Lars Ellenberg
  0 siblings, 0 replies; 23+ messages in thread
From: Lars Ellenberg @ 2016-07-11 14:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, linux-block, Keith Busch, Martin K. Petersen,
	Peter Zijlstra, Jiri Kosina, Ming Lei, Kirill A. Shutemov,
	NeilBrown, linux-kernel, linux-raid, Takashi Iwai, linux-bcache,
	Zheng Liu, Kent Overstreet, dm-devel, Shaohua Li, Ingo Molnar,
	Alasdair Kergon, Roland Kammerer

Dropped the XXX comment (oops),
moved the current_has_pending_bios() helper to bio.h
and dropped the identical ones from bio.c and md.h.

Reposted in-thread as
[PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion

Thanks,

    Lars Ellenberg

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dm-devel] [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2016-07-11 14:10   ` [PATCH v2 " Lars Ellenberg
@ 2016-07-12  2:55     ` NeilBrown
  2016-07-13  2:18       ` Eric Wheeler
  2016-12-23  8:49     ` Michael Wang
  1 sibling, 1 reply; 23+ messages in thread
From: NeilBrown @ 2016-07-12  2:55 UTC (permalink / raw)
  To: Lars Ellenberg, Jens Axboe
  Cc: linux-block, Martin K. Petersen, Mike Snitzer, Peter Zijlstra,
	Jiri Kosina, Ming Lei, Kirill A. Shutemov, linux-kernel,
	linux-raid, Takashi Iwai, linux-bcache, Zheng Liu,
	Kent Overstreet, Keith Busch, dm-devel, Shaohua Li, Ingo Molnar,
	Alasdair Kergon, Roland Kammerer

[-- Attachment #1: Type: text/plain, Size: 962 bytes --]

On Tue, Jul 12 2016, Lars Ellenberg wrote:
....
>
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
> 	struct recursion_to_iteration_bio_lists {
> 		struct bio_list recursion;
> 		struct bio_list queue;
> 	}
>
> By providing each q->make_request_fn() with an empty "recursion"
> bio_list, then merging any recursively submitted bios to the
> head of the "queue" list, we can make the recursion-to-iteration
> logic in generic_make_request() process deepest level bios first,
> and "sibling" bios of the same level in "natural" order.
>
> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>

Reviewed-by: NeilBrown <neilb@suse.com>

Thanks again for doing this - I think this is a very significant
improvement and could allow other simplifications.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dm-devel] [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2016-07-12  2:55     ` [dm-devel] " NeilBrown
@ 2016-07-13  2:18       ` Eric Wheeler
  2016-07-13  2:32         ` Mike Snitzer
  0 siblings, 1 reply; 23+ messages in thread
From: Eric Wheeler @ 2016-07-13  2:18 UTC (permalink / raw)
  To: NeilBrown
  Cc: Lars Ellenberg, Jens Axboe, linux-block, Martin K. Petersen,
	Mike Snitzer, Peter Zijlstra, Jiri Kosina, Ming Lei,
	Kirill A. Shutemov, linux-kernel, linux-raid, Takashi Iwai,
	linux-bcache, Zheng Liu, Kent Overstreet, Keith Busch, dm-devel,
	Shaohua Li, Ingo Molnar, Alasdair Kergon, Roland Kammerer

On Tue, 12 Jul 2016, NeilBrown wrote:

> On Tue, Jul 12 2016, Lars Ellenberg wrote:
> ....
> >
> > Instead, I suggest to distinguish between recursive calls to
> > generic_make_request(), and pushing back the remainder part in
> > blk_queue_split(), by pointing current->bio_lists to a
> > 	struct recursion_to_iteration_bio_lists {
> > 		struct bio_list recursion;
> > 		struct bio_list queue;
> > 	}
> >
> > By providing each q->make_request_fn() with an empty "recursion"
> > bio_list, then merging any recursively submitted bios to the
> > head of the "queue" list, we can make the recursion-to-iteration
> > logic in generic_make_request() process deepest level bios first,
> > and "sibling" bios of the same level in "natural" order.
> >
> > Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> > Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> 
> Thanks again for doing this - I think this is a very significant
> improvement and could allow other simplifications.

Thank you Lars for all of this work!  

It seems like there have been many 4.3+ blockdev stacking issues and this 
will certainly address some of those (maybe all of them?).  (I think we 
hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
issue.)  It would be great to hear 4.4.y stable pick this up if 
compatible.

Do you believe that this patch would solve any of the proposals by others 
since 4.3 related to bio splitting/large bios?  I've been collecting a 
list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
if I'm wrong):

A.  [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
	by Ming Lei: https://patchwork.kernel.org/patch/9169483/

B.  block: don't make BLK_DEF_MAX_SECTORS too big
	by Shaohua Li: http://www.spinics.net/lists/linux-bcache/msg03525.html

C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
	by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
		(was https://patchwork.kernel.org/patch/7398411/)

D.  dm-crypt: Fix error with too large bios
	by Mikulas Patocka: https://patchwork.kernel.org/patch/9138595/

The A,B,D are known to fix large bio issues when stacking dm+bcache 
(though the B,D are trivial and probably necessary even with your patch).

Patch C was mentioned earlier in this thread by Mike Snitzer and you 
commented briefly that his patch might solve the issue; given that, and in 
the interest of minimizing duplicate effort, which of the following best 
describes the situation?

  1. Your patch could supersede Mikulas's patch; they address the same 
issue.

  2. Mikulas's patch addresses different issues such and both patches 
should be applied.

  3. There is overlap between both your patch and Mikulas's such that both 
#1,#2 are true and effort to solve this has been duplicated.

If #3, then what might be done to resolve the overlap?

What are the opinions of the authors and can a consensus be reached so we 
can see these pushed upstream with the appropriate stable Cc tags and 
ultimately fix 4.4.y?

--
Eric Wheeler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2016-07-13  2:18       ` Eric Wheeler
@ 2016-07-13  2:32         ` Mike Snitzer
  2016-07-19  9:00           ` Lars Ellenberg
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Snitzer @ 2016-07-13  2:32 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: NeilBrown, Lars Ellenberg, Jens Axboe, linux-block,
	Martin K. Petersen, Peter Zijlstra, Jiri Kosina, Ming Lei,
	Kirill A. Shutemov, linux-kernel, linux-raid, Takashi Iwai,
	linux-bcache, Zheng Liu, Kent Overstreet, Keith Busch, dm-devel,
	Shaohua Li, Ingo Molnar, Alasdair Kergon, Roland Kammerer

On Tue, Jul 12 2016 at 10:18pm -0400,
Eric Wheeler <bcache@lists.ewheeler.net> wrote:

> On Tue, 12 Jul 2016, NeilBrown wrote:
> 
> > On Tue, Jul 12 2016, Lars Ellenberg wrote:
> > ....
> > >
> > > Instead, I suggest to distinguish between recursive calls to
> > > generic_make_request(), and pushing back the remainder part in
> > > blk_queue_split(), by pointing current->bio_lists to a
> > > 	struct recursion_to_iteration_bio_lists {
> > > 		struct bio_list recursion;
> > > 		struct bio_list queue;
> > > 	}
> > >
> > > By providing each q->make_request_fn() with an empty "recursion"
> > > bio_list, then merging any recursively submitted bios to the
> > > head of the "queue" list, we can make the recursion-to-iteration
> > > logic in generic_make_request() process deepest level bios first,
> > > and "sibling" bios of the same level in "natural" order.
> > >
> > > Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> > > Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
> > 
> > Reviewed-by: NeilBrown <neilb@suse.com>
> > 
> > Thanks again for doing this - I think this is a very significant
> > improvement and could allow other simplifications.
> 
> Thank you Lars for all of this work!  
> 
> It seems like there have been many 4.3+ blockdev stacking issues and this 
> will certainly address some of those (maybe all of them?).  (I think we 
> hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
> issue.)  It would be great to hear 4.4.y stable pick this up if 
> compatible.
> 
> 
> Do you believe that this patch would solve any of the proposals by others 
> since 4.3 related to bio splitting/large bios?  I've been collecting a 
> list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
> if I'm wrong):
> 
> A.  [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
> 	by Ming Lei: https://patchwork.kernel.org/patch/9169483/
> 
> B.  block: don't make BLK_DEF_MAX_SECTORS too big
> 	by Shaohua Li: http://www.spinics.net/lists/linux-bcache/msg03525.html
> 
> C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> 	by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> 		(was https://patchwork.kernel.org/patch/7398411/)
> 
> D.  dm-crypt: Fix error with too large bios
> 	by Mikulas Patocka: https://patchwork.kernel.org/patch/9138595/
> 
> The A,B,D are known to fix large bio issues when stacking dm+bcache 
> (though the B,D are trivial and probably necessary even with your patch).
> 
> Patch C was mentioned earlier in this thread by Mike Snitzer and you 
> commented briefly that his patch might solve the issue; given that, and in 
> the interest of minimizing duplicate effort, which of the following best 
> describes the situation?
> 
>   1. Your patch could supersede Mikulas's patch; they address the same 
> issue.
> 
>   2. Mikulas's patch addresses different issues such and both patches 
> should be applied.
> 
>   3. There is overlap between both your patch and Mikulas's such that both 
> #1,#2 are true and effort to solve this has been duplicated.
> 
> 
> If #3, then what might be done to resolve the overlap?

Mikulas confirmed to me that he believes Lars' v2 patch will fix the
dm-snapshot problem, which is being tracked with this BZ:
https://bugzilla.kernel.org/show_bug.cgi?id=119841

We'll see how testing goes (currently underway).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2016-07-13  2:32         ` Mike Snitzer
@ 2016-07-19  9:00           ` Lars Ellenberg
  2016-07-21 22:53             ` Eric Wheeler
                               ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Lars Ellenberg @ 2016-07-19  9:00 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Eric Wheeler, NeilBrown, Jens Axboe, linux-block,
	Martin K. Petersen, Peter Zijlstra, Jiri Kosina, Ming Lei,
	Kirill A. Shutemov, linux-kernel, linux-raid, Takashi Iwai,
	linux-bcache, Zheng Liu, Kent Overstreet, Keith Busch, dm-devel,
	Shaohua Li, Ingo Molnar, Alasdair Kergon, Roland Kammerer

On Tue, Jul 12, 2016 at 10:32:33PM -0400, Mike Snitzer wrote:
> On Tue, Jul 12 2016 at 10:18pm -0400,
> Eric Wheeler <bcache@lists.ewheeler.net> wrote:
> 
> > On Tue, 12 Jul 2016, NeilBrown wrote:
> > 
> > > On Tue, Jul 12 2016, Lars Ellenberg wrote:
> > > ....
> > > >
> > > > Instead, I suggest to distinguish between recursive calls to
> > > > generic_make_request(), and pushing back the remainder part in
> > > > blk_queue_split(), by pointing current->bio_lists to a
> > > > 	struct recursion_to_iteration_bio_lists {
> > > > 		struct bio_list recursion;
> > > > 		struct bio_list queue;
> > > > 	}
> > > >
> > > > By providing each q->make_request_fn() with an empty "recursion"
> > > > bio_list, then merging any recursively submitted bios to the
> > > > head of the "queue" list, we can make the recursion-to-iteration
> > > > logic in generic_make_request() process deepest level bios first,
> > > > and "sibling" bios of the same level in "natural" order.
> > > >
> > > > Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> > > > Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
> > > 
> > > Reviewed-by: NeilBrown <neilb@suse.com>
> > > 
> > > Thanks again for doing this - I think this is a very significant
> > > improvement and could allow other simplifications.
> > 
> > Thank you Lars for all of this work!  
> > 
> > It seems like there have been many 4.3+ blockdev stacking issues and this 
> > will certainly address some of those (maybe all of them?).  (I think we 
> > hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
> > issue.)  It would be great to hear 4.4.y stable pick this up if 
> > compatible.
> > 
> > 
> > Do you believe that this patch would solve any of the proposals by others 
> > since 4.3 related to bio splitting/large bios?  I've been collecting a 
> > list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
> > if I'm wrong):
> > 
> > A.  [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
> > 	by Ming Lei: https://patchwork.kernel.org/patch/9169483/

That's an independend issue.

> > B.  block: don't make BLK_DEF_MAX_SECTORS too big
> > 	by Shaohua Li: http://www.spinics.net/lists/linux-bcache/msg03525.html

Yet an other independend issue.

> > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> > 	by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> > 		(was https://patchwork.kernel.org/patch/7398411/)

As it stands now,
this is yet an other issue, but related.

>From the link above:

| ** Here is the dm-snapshot deadlock that was observed:
| 
| 1) Process A sends one-page read bio to the dm-snapshot target. The bio
| spans snapshot chunk boundary and so it is split to two bios by device
| mapper.
| 
| 2) Device mapper creates the first sub-bio and sends it to the snapshot
| driver.
| 
| 3) The function snapshot_map calls track_chunk (that allocates a
| structure
| dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
| the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
| 
| 4) The remapped bio is submitted with generic_make_request, but it isn't
| issued - it is added to current->bio_list instead.
| 
| 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
| chunk affected be the first remapped bio, it takes down_write(&s->lock)
| and then loops in __check_for_conflicting_io, waiting for
| dm_snap_tracked_chunk created in step 3) to be released.
| 
| 6) Process A continues, it creates a second sub-bio for the rest of the
| original bio.

Aha.
Here is the relation.
If "A" had only ever processed "just the chunk it can handle now",
and "pushed back" the rest of the incoming bio,
it could rely on all deeper level bios to have been submitted already.

But this does not look like it easily fits into the current DM model.

| 7) snapshot_map is called for this new bio, it waits on
| down_write(&s->lock) that is held by Process B (in step 5).

There is an other suggestion:
Use down_trylock (or down_timeout),
and if it fails, push back the currently to-be-processed bio.
We can introduce a new bio helper for that.
Kind of what blk_queue_split() does with my patch applied.

Or even better, ignore the down_trylock suggestion,
simply not iterate over all pieces first,
but process one piece, and return back the the
iteration in generic_make_request.

A bit of conflict here may be that DM has all its own
split and clone and queue magic, and wants to process
"all of the bio" before returning back to generic_make_request().

To change that, __split_and_process_bio() and all its helpers
would need to learn to "push back" (pieces of) the bio they are
currently working on, and not push back via "DM_ENDIO_REQUEUE",
but by bio_list_add_head(&current->bio_lists->queue, piece_to_be_done_later).

Then, after they processed each piece,
*return* all the way up to the top-level generic_make_request(),
where the recursion-to-iteration logic would then
make sure that all deeper level bios, submitted via
recursive calls to generic_make_request() will be processed, before the
next, pushed back, piece of the "original incoming" bio.

And *not* do their own iteration over all pieces first.

Probably not as easy as dropping the while loop,
using bio_advance, and pushing that "advanced" bio back to
current->...queue?

static void __split_and_process_bio(struct mapped_device *md,
				    struct dm_table *map, struct bio *bio)
...
		ci.bio = bio;
		ci.sector_count = bio_sectors(bio);
		while (ci.sector_count && !error)
			error = __split_and_process_non_flush(&ci);
...
		error = __split_and_process_non_flush(&ci);
		if (ci.sector_count)
			bio_advance()
			bio_list_add_head(&current->bio_lists->queue, )
...

Something like that, maybe?
Just a thought.

> > D.  dm-crypt: Fix error with too large bios
> > 	by Mikulas Patocka: https://patchwork.kernel.org/patch/9138595/
> > 
> > The A,B,D are known to fix large bio issues when stacking dm+bcache 
> > (though the B,D are trivial and probably necessary even with your patch).
> > 
> > Patch C was mentioned earlier in this thread by Mike Snitzer and you 
> > commented briefly that his patch might solve the issue; given that, and in 
> > the interest of minimizing duplicate effort, which of the following best 
> > describes the situation?
> > 
> >   1. Your patch could supersede Mikulas's patch; they address the same 
> > issue.
> > 
> >   2. Mikulas's patch addresses different issues such and both patches 
> > should be applied.
> > 
> >   3. There is overlap between both your patch and Mikulas's such that both 
> > #1,#2 are true and effort to solve this has been duplicated.
> > 
> > 
> > If #3, then what might be done to resolve the overlap?
> 
> Mikulas confirmed to me that he believes Lars' v2 patch will fix the
> dm-snapshot problem, which is being tracked with this BZ:
> https://bugzilla.kernel.org/show_bug.cgi?id=119841
> 
> We'll see how testing goes (currently underway).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2016-07-19  9:00           ` Lars Ellenberg
@ 2016-07-21 22:53             ` Eric Wheeler
  2016-07-25 20:39               ` Jeff Moyer
  2016-08-11  4:16             ` Eric Wheeler
  2017-01-07 19:56             ` Lars Ellenberg
  2 siblings, 1 reply; 23+ messages in thread
From: Eric Wheeler @ 2016-07-21 22:53 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Mike Snitzer, NeilBrown, Jens Axboe, linux-block,
	Martin K. Petersen, Peter Zijlstra, Jiri Kosina, Ming Lei,
	Kirill A. Shutemov, linux-kernel, linux-raid, Takashi Iwai,
	linux-bcache, Zheng Liu, Kent Overstreet, Keith Busch, dm-devel,
	Shaohua Li, Ingo Molnar, Alasdair Kergon, Roland Kammerer,
	Mikulas Patocka, Jeff Moyer

[+cc Mikulas Patocka, Jeff Moyer; Do either of you have any input on Lars' 
commentary related to patchwork #'s 9204125 and 7398411 and BZ#119841? ]

On Tue, 19 Jul 2016, Lars Ellenberg wrote:

> On Tue, Jul 12, 2016 at 10:32:33PM -0400, Mike Snitzer wrote:
> > On Tue, Jul 12 2016 at 10:18pm -0400,
> > Eric Wheeler <bcache@lists.ewheeler.net> wrote:
> > 
> > > On Tue, 12 Jul 2016, NeilBrown wrote:
> > > 
> > > > On Tue, Jul 12 2016, Lars Ellenberg wrote:
> > > > ....
> > > > >
> > > > > Instead, I suggest to distinguish between recursive calls to
> > > > > generic_make_request(), and pushing back the remainder part in
> > > > > blk_queue_split(), by pointing current->bio_lists to a
> > > > > 	struct recursion_to_iteration_bio_lists {
> > > > > 		struct bio_list recursion;
> > > > > 		struct bio_list queue;
> > > > > 	}
> > > > >
> > > > > By providing each q->make_request_fn() with an empty "recursion"
> > > > > bio_list, then merging any recursively submitted bios to the
> > > > > head of the "queue" list, we can make the recursion-to-iteration
> > > > > logic in generic_make_request() process deepest level bios first,
> > > > > and "sibling" bios of the same level in "natural" order.
> > > > >
> > > > > Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> > > > > Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
> > > > 
> > > > Reviewed-by: NeilBrown <neilb@suse.com>
> > > > 
> > > > Thanks again for doing this - I think this is a very significant
> > > > improvement and could allow other simplifications.
> > > 
> > > Thank you Lars for all of this work!  
> > > 
> > > It seems like there have been many 4.3+ blockdev stacking issues and this 
> > > will certainly address some of those (maybe all of them?).  (I think we 
> > > hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
> > > issue.)  It would be great to hear 4.4.y stable pick this up if 
> > > compatible.
> > > 
> > > 
> > > Do you believe that this patch would solve any of the proposals by others 
> > > since 4.3 related to bio splitting/large bios?  I've been collecting a 
> > > list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
> > > if I'm wrong):
> > > 
> > > A.  [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
> > > 	by Ming Lei: https://patchwork.kernel.org/patch/9169483/
> 
> That's an independend issue.
> 
> > > B.  block: don't make BLK_DEF_MAX_SECTORS too big
> > > 	by Shaohua Li: http://www.spinics.net/lists/linux-bcache/msg03525.html
> 
> Yet an other independend issue.
> 
> > > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> > > 	by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> > > 		(was https://patchwork.kernel.org/patch/7398411/)
> 
> As it stands now,
> this is yet an other issue, but related.
> 
> From the link above:
> 
> | ** Here is the dm-snapshot deadlock that was observed:
> | 
> | 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> | spans snapshot chunk boundary and so it is split to two bios by device
> | mapper.
> | 
> | 2) Device mapper creates the first sub-bio and sends it to the snapshot
> | driver.
> | 
> | 3) The function snapshot_map calls track_chunk (that allocates a
> | structure
> | dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> | the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
> | 
> | 4) The remapped bio is submitted with generic_make_request, but it isn't
> | issued - it is added to current->bio_list instead.
> | 
> | 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> | chunk affected be the first remapped bio, it takes down_write(&s->lock)
> | and then loops in __check_for_conflicting_io, waiting for
> | dm_snap_tracked_chunk created in step 3) to be released.
> | 
> | 6) Process A continues, it creates a second sub-bio for the rest of the
> | original bio.
> 
> Aha.
> Here is the relation.
> If "A" had only ever processed "just the chunk it can handle now",
> and "pushed back" the rest of the incoming bio,
> it could rely on all deeper level bios to have been submitted already.
> 
> But this does not look like it easily fits into the current DM model.
> 
> | 7) snapshot_map is called for this new bio, it waits on
> | down_write(&s->lock) that is held by Process B (in step 5).
> 
> There is an other suggestion:
> Use down_trylock (or down_timeout),
> and if it fails, push back the currently to-be-processed bio.
> We can introduce a new bio helper for that.
> Kind of what blk_queue_split() does with my patch applied.
> 
> Or even better, ignore the down_trylock suggestion,
> simply not iterate over all pieces first,
> but process one piece, and return back the the
> iteration in generic_make_request.
> 
> A bit of conflict here may be that DM has all its own
> split and clone and queue magic, and wants to process
> "all of the bio" before returning back to generic_make_request().
> 
> To change that, __split_and_process_bio() and all its helpers
> would need to learn to "push back" (pieces of) the bio they are
> currently working on, and not push back via "DM_ENDIO_REQUEUE",
> but by bio_list_add_head(&current->bio_lists->queue, piece_to_be_done_later).
> 
> Then, after they processed each piece,
> *return* all the way up to the top-level generic_make_request(),
> where the recursion-to-iteration logic would then
> make sure that all deeper level bios, submitted via
> recursive calls to generic_make_request() will be processed, before the
> next, pushed back, piece of the "original incoming" bio.
> 
> And *not* do their own iteration over all pieces first.
> 
> Probably not as easy as dropping the while loop,
> using bio_advance, and pushing that "advanced" bio back to
> current->...queue?
> 
> static void __split_and_process_bio(struct mapped_device *md,
> 				    struct dm_table *map, struct bio *bio)
> ...
> 		ci.bio = bio;
> 		ci.sector_count = bio_sectors(bio);
> 		while (ci.sector_count && !error)
> 			error = __split_and_process_non_flush(&ci);
> ...
> 		error = __split_and_process_non_flush(&ci);
> 		if (ci.sector_count)
> 			bio_advance()
> 			bio_list_add_head(&current->bio_lists->queue, )
> ...
> 
> Something like that, maybe?
> Just a thought.
> 
> > > D.  dm-crypt: Fix error with too large bios
> > > 	by Mikulas Patocka: https://patchwork.kernel.org/patch/9138595/
> > > 
> > > The A,B,D are known to fix large bio issues when stacking dm+bcache 
> > > (though the B,D are trivial and probably necessary even with your patch).
> > > 
> > > Patch C was mentioned earlier in this thread by Mike Snitzer and you 
> > > commented briefly that his patch might solve the issue; given that, and in 
> > > the interest of minimizing duplicate effort, which of the following best 
> > > describes the situation?
> > > 
> > >   1. Your patch could supersede Mikulas's patch; they address the same 
> > > issue.
> > > 
> > >   2. Mikulas's patch addresses different issues such and both patches 
> > > should be applied.
> > > 
> > >   3. There is overlap between both your patch and Mikulas's such that both 
> > > #1,#2 are true and effort to solve this has been duplicated.
> > > 
> > > 
> > > If #3, then what might be done to resolve the overlap?
> > 
> > Mikulas confirmed to me that he believes Lars' v2 patch will fix the
> > dm-snapshot problem, which is being tracked with this BZ:
> > https://bugzilla.kernel.org/show_bug.cgi?id=119841
> > 
> > We'll see how testing goes (currently underway).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2016-07-21 22:53             ` Eric Wheeler
@ 2016-07-25 20:39               ` Jeff Moyer
  0 siblings, 0 replies; 23+ messages in thread
From: Jeff Moyer @ 2016-07-25 20:39 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: Lars Ellenberg, Mike Snitzer, NeilBrown, Jens Axboe, linux-block,
	Martin K. Petersen, Peter Zijlstra, Jiri Kosina, Ming Lei,
	Kirill A. Shutemov, linux-kernel, linux-raid, Takashi Iwai,
	linux-bcache, Zheng Liu, Kent Overstreet, Keith Busch, dm-devel,
	Shaohua Li, Ingo Molnar, Alasdair Kergon, Roland Kammerer,
	Mikulas Patocka

Eric Wheeler <bcache@lists.ewheeler.net> writes:

> [+cc Mikulas Patocka, Jeff Moyer; Do either of you have any input on Lars' 
> commentary related to patchwork #'s 9204125 and 7398411 and BZ#119841? ]

Sorry, I don't have any time to look at this right now.

Cheers,
Jeff

>
> On Tue, 19 Jul 2016, Lars Ellenberg wrote:
>
>> On Tue, Jul 12, 2016 at 10:32:33PM -0400, Mike Snitzer wrote:
>> > On Tue, Jul 12 2016 at 10:18pm -0400,
>> > Eric Wheeler <bcache@lists.ewheeler.net> wrote:
>> > 
>> > > On Tue, 12 Jul 2016, NeilBrown wrote:
>> > > 
>> > > > On Tue, Jul 12 2016, Lars Ellenberg wrote:
>> > > > ....
>> > > > >
>> > > > > Instead, I suggest to distinguish between recursive calls to
>> > > > > generic_make_request(), and pushing back the remainder part in
>> > > > > blk_queue_split(), by pointing current->bio_lists to a
>> > > > > 	struct recursion_to_iteration_bio_lists {
>> > > > > 		struct bio_list recursion;
>> > > > > 		struct bio_list queue;
>> > > > > 	}
>> > > > >
>> > > > > By providing each q->make_request_fn() with an empty "recursion"
>> > > > > bio_list, then merging any recursively submitted bios to the
>> > > > > head of the "queue" list, we can make the recursion-to-iteration
>> > > > > logic in generic_make_request() process deepest level bios first,
>> > > > > and "sibling" bios of the same level in "natural" order.
>> > > > >
>> > > > > Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
>> > > > > Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
>> > > > 
>> > > > Reviewed-by: NeilBrown <neilb@suse.com>
>> > > > 
>> > > > Thanks again for doing this - I think this is a very significant
>> > > > improvement and could allow other simplifications.
>> > > 
>> > > Thank you Lars for all of this work!  
>> > > 
>> > > It seems like there have been many 4.3+ blockdev stacking issues and this 
>> > > will certainly address some of those (maybe all of them?).  (I think we 
>> > > hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
>> > > issue.)  It would be great to hear 4.4.y stable pick this up if 
>> > > compatible.
>> > > 
>> > > 
>> > > Do you believe that this patch would solve any of the proposals by others 
>> > > since 4.3 related to bio splitting/large bios?  I've been collecting a 
>> > > list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
>> > > if I'm wrong):
>> > > 
>> > > A.  [PATCH v2] block: make sure big bio is splitted into at most 256 bvecs
>> > > 	by Ming Lei: https://patchwork.kernel.org/patch/9169483/
>> 
>> That's an independend issue.
>> 
>> > > B.  block: don't make BLK_DEF_MAX_SECTORS too big
>> > > 	by Shaohua Li: http://www.spinics.net/lists/linux-bcache/msg03525.html
>> 
>> Yet an other independend issue.
>> 
>> > > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
>> > > 	by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
>> > > 		(was https://patchwork.kernel.org/patch/7398411/)
>> 
>> As it stands now,
>> this is yet an other issue, but related.
>> 
>> From the link above:
>> 
>> | ** Here is the dm-snapshot deadlock that was observed:
>> | 
>> | 1) Process A sends one-page read bio to the dm-snapshot target. The bio
>> | spans snapshot chunk boundary and so it is split to two bios by device
>> | mapper.
>> | 
>> | 2) Device mapper creates the first sub-bio and sends it to the snapshot
>> | driver.
>> | 
>> | 3) The function snapshot_map calls track_chunk (that allocates a
>> | structure
>> | dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
>> | the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
>> | 
>> | 4) The remapped bio is submitted with generic_make_request, but it isn't
>> | issued - it is added to current->bio_list instead.
>> | 
>> | 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
>> | chunk affected be the first remapped bio, it takes down_write(&s->lock)
>> | and then loops in __check_for_conflicting_io, waiting for
>> | dm_snap_tracked_chunk created in step 3) to be released.
>> | 
>> | 6) Process A continues, it creates a second sub-bio for the rest of the
>> | original bio.
>> 
>> Aha.
>> Here is the relation.
>> If "A" had only ever processed "just the chunk it can handle now",
>> and "pushed back" the rest of the incoming bio,
>> it could rely on all deeper level bios to have been submitted already.
>> 
>> But this does not look like it easily fits into the current DM model.
>> 
>> | 7) snapshot_map is called for this new bio, it waits on
>> | down_write(&s->lock) that is held by Process B (in step 5).
>> 
>> There is an other suggestion:
>> Use down_trylock (or down_timeout),
>> and if it fails, push back the currently to-be-processed bio.
>> We can introduce a new bio helper for that.
>> Kind of what blk_queue_split() does with my patch applied.
>> 
>> Or even better, ignore the down_trylock suggestion,
>> simply not iterate over all pieces first,
>> but process one piece, and return back the the
>> iteration in generic_make_request.
>> 
>> A bit of conflict here may be that DM has all its own
>> split and clone and queue magic, and wants to process
>> "all of the bio" before returning back to generic_make_request().
>> 
>> To change that, __split_and_process_bio() and all its helpers
>> would need to learn to "push back" (pieces of) the bio they are
>> currently working on, and not push back via "DM_ENDIO_REQUEUE",
>> but by bio_list_add_head(&current->bio_lists->queue, piece_to_be_done_later).
>> 
>> Then, after they processed each piece,
>> *return* all the way up to the top-level generic_make_request(),
>> where the recursion-to-iteration logic would then
>> make sure that all deeper level bios, submitted via
>> recursive calls to generic_make_request() will be processed, before the
>> next, pushed back, piece of the "original incoming" bio.
>> 
>> And *not* do their own iteration over all pieces first.
>> 
>> Probably not as easy as dropping the while loop,
>> using bio_advance, and pushing that "advanced" bio back to
>> current->...queue?
>> 
>> static void __split_and_process_bio(struct mapped_device *md,
>> 				    struct dm_table *map, struct bio *bio)
>> ...
>> 		ci.bio = bio;
>> 		ci.sector_count = bio_sectors(bio);
>> 		while (ci.sector_count && !error)
>> 			error = __split_and_process_non_flush(&ci);
>> ...
>> 		error = __split_and_process_non_flush(&ci);
>> 		if (ci.sector_count)
>> 			bio_advance()
>> 			bio_list_add_head(&current->bio_lists->queue, )
>> ...
>> 
>> Something like that, maybe?
>> Just a thought.
>> 
>> > > D.  dm-crypt: Fix error with too large bios
>> > > 	by Mikulas Patocka: https://patchwork.kernel.org/patch/9138595/
>> > > 
>> > > The A,B,D are known to fix large bio issues when stacking dm+bcache 
>> > > (though the B,D are trivial and probably necessary even with your patch).
>> > > 
>> > > Patch C was mentioned earlier in this thread by Mike Snitzer and you 
>> > > commented briefly that his patch might solve the issue; given that, and in 
>> > > the interest of minimizing duplicate effort, which of the following best 
>> > > describes the situation?
>> > > 
>> > >   1. Your patch could supersede Mikulas's patch; they address the same 
>> > > issue.
>> > > 
>> > >   2. Mikulas's patch addresses different issues such and both patches 
>> > > should be applied.
>> > > 
>> > >   3. There is overlap between both your patch and Mikulas's such that both 
>> > > #1,#2 are true and effort to solve this has been duplicated.
>> > > 
>> > > 
>> > > If #3, then what might be done to resolve the overlap?
>> > 
>> > Mikulas confirmed to me that he believes Lars' v2 patch will fix the
>> > dm-snapshot problem, which is being tracked with this BZ:
>> > https://bugzilla.kernel.org/show_bug.cgi?id=119841
>> > 
>> > We'll see how testing goes (currently underway).
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2016-07-19  9:00           ` Lars Ellenberg
  2016-07-21 22:53             ` Eric Wheeler
@ 2016-08-11  4:16             ` Eric Wheeler
  2017-01-07 19:56             ` Lars Ellenberg
  2 siblings, 0 replies; 23+ messages in thread
From: Eric Wheeler @ 2016-08-11  4:16 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Mike Snitzer, Eric Wheeler, NeilBrown, Jens Axboe, linux-block,
	Martin K. Petersen, Peter Zijlstra, Jiri Kosina, Ming Lei,
	Kirill A. Shutemov, linux-kernel, linux-raid, Takashi Iwai,
	linux-bcache, Zheng Liu, Kent Overstreet, Keith Busch, dm-devel,
	Shaohua Li, Ingo Molnar, Alasdair Kergon, Roland Kammerer

On Tue, 19 Jul 2016, Lars Ellenberg wrote:
> On Tue, Jul 12, 2016 at 10:32:33PM -0400, Mike Snitzer wrote:
> > On Tue, Jul 12 2016 at 10:18pm -0400,
> > Eric Wheeler <bcache@lists.ewheeler.net> wrote:
> > 
> > > On Tue, 12 Jul 2016, NeilBrown wrote:
> > > 
> > > > On Tue, Jul 12 2016, Lars Ellenberg wrote:
> > > > ....
> > > > >
> > > > > Instead, I suggest to distinguish between recursive calls to
> > > > > generic_make_request(), and pushing back the remainder part in
> > > > > blk_queue_split(), by pointing current->bio_lists to a
> > > > > 	struct recursion_to_iteration_bio_lists {
> > > > > 		struct bio_list recursion;
> > > > > 		struct bio_list queue;
> > > > > 	}
> > > > >
> > > > > By providing each q->make_request_fn() with an empty "recursion"
> > > > > bio_list, then merging any recursively submitted bios to the
> > > > > head of the "queue" list, we can make the recursion-to-iteration
> > > > > logic in generic_make_request() process deepest level bios first,
> > > > > and "sibling" bios of the same level in "natural" order.
> > > > >
> > > > > Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> > > > > Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
> > > > 
> > > > Reviewed-by: NeilBrown <neilb@suse.com>
> > > > 
> > > > Thanks again for doing this - I think this is a very significant
> > > > improvement and could allow other simplifications.
> > > 
> > > Thank you Lars for all of this work!  
> > > 
> > > It seems like there have been many 4.3+ blockdev stacking issues and this 
> > > will certainly address some of those (maybe all of them?).  (I think we 
> > > hit this while trying drbd in 4.4 so we dropped back to 4.1 without 
> > > issue.)  It would be great to hear 4.4.y stable pick this up if 
> > > compatible.
> > > 
> > > 
> > > Do you believe that this patch would solve any of the proposals by others 
> > > since 4.3 related to bio splitting/large bios?  I've been collecting a 
> > > list, none of which appear have landed yet as of 4.7-rc7 (but correct me 
> > > if I'm wrong):

[... cut unrelated A,B ... ]

> > > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> > > 	by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> > > 		(was https://patchwork.kernel.org/patch/7398411/)
> 
> As it stands now, this is yet an other issue, but related.
> 
> From the link above:
> 
> | ** Here is the dm-snapshot deadlock that was observed:
> | 
> | 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> | spans snapshot chunk boundary and so it is split to two bios by device
> | mapper.
> | 
> | 2) Device mapper creates the first sub-bio and sends it to the snapshot
> | driver.
> | 
> | 3) The function snapshot_map calls track_chunk (that allocates a
> | structure
> | dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> | the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
> | 
> | 4) The remapped bio is submitted with generic_make_request, but it isn't
> | issued - it is added to current->bio_list instead.
> | 
> | 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> | chunk affected be the first remapped bio, it takes down_write(&s->lock)
> | and then loops in __check_for_conflicting_io, waiting for
> | dm_snap_tracked_chunk created in step 3) to be released.
> | 
> | 6) Process A continues, it creates a second sub-bio for the rest of the
> | original bio.
> 
> Aha.
> Here is the relation.
> If "A" had only ever processed "just the chunk it can handle now",
> and "pushed back" the rest of the incoming bio,
> it could rely on all deeper level bios to have been submitted already.
> 
> But this does not look like it easily fits into the current DM model.
> 
> | 7) snapshot_map is called for this new bio, it waits on
> | down_write(&s->lock) that is held by Process B (in step 5).
> 
> There is an other suggestion:
> Use down_trylock (or down_timeout),
> and if it fails, push back the currently to-be-processed bio.
> We can introduce a new bio helper for that.
> Kind of what blk_queue_split() does with my patch applied.
> 
> Or even better, ignore the down_trylock suggestion,
> simply not iterate over all pieces first,
> but process one piece, and return back the the
> iteration in generic_make_request.
> 
> A bit of conflict here may be that DM has all its own
> split and clone and queue magic, and wants to process
> "all of the bio" before returning back to generic_make_request().
> 
> To change that, __split_and_process_bio() and all its helpers
> would need to learn to "push back" (pieces of) the bio they are
> currently working on, and not push back via "DM_ENDIO_REQUEUE",
> but by bio_list_add_head(&current->bio_lists->queue, piece_to_be_done_later).
> 
> Then, after they processed each piece,
> *return* all the way up to the top-level generic_make_request(),
> where the recursion-to-iteration logic would then
> make sure that all deeper level bios, submitted via
> recursive calls to generic_make_request() will be processed, before the
> next, pushed back, piece of the "original incoming" bio.
> 
> And *not* do their own iteration over all pieces first.
> 
> Probably not as easy as dropping the while loop,
> using bio_advance, and pushing that "advanced" bio back to
> current->...queue?
> 
> static void __split_and_process_bio(struct mapped_device *md,
> 				    struct dm_table *map, struct bio *bio)
> ...
> 		ci.bio = bio;
> 		ci.sector_count = bio_sectors(bio);
> 		while (ci.sector_count && !error)
> 			error = __split_and_process_non_flush(&ci);
> ...
> 		error = __split_and_process_non_flush(&ci);
> 		if (ci.sector_count)
> 			bio_advance()
> 			bio_list_add_head(&current->bio_lists->queue, )
> ...
> 
> Something like that, maybe?
> Just a thought.

Hello all,

Has anyone been able to make progress with resolution to this issue?  

Might the suggestions from Lars help solve BZ# 119841?
	https://bugzilla.kernel.org/show_bug.cgi?id=119841

--
Eric Wheeler

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2016-07-11 14:10   ` [PATCH v2 " Lars Ellenberg
  2016-07-12  2:55     ` [dm-devel] " NeilBrown
@ 2016-12-23  8:49     ` Michael Wang
  2016-12-23 11:45       ` Lars Ellenberg
  1 sibling, 1 reply; 23+ messages in thread
From: Michael Wang @ 2016-12-23  8:49 UTC (permalink / raw)
  To: Lars Ellenberg, Jens Axboe
  Cc: NeilBrown, linux-raid, Martin K. Petersen, Mike Snitzer,
	Peter Zijlstra, Jiri Kosina, Ming Lei, linux-kernel, Zheng Liu,
	linux-block, Takashi Iwai, linux-bcache, Ingo Molnar,
	Alasdair Kergon, Keith Busch, dm-devel, Shaohua Li,
	Kent Overstreet, Kirill A. Shutemov, Roland Kammerer

Dear Maintainers

I'd like to ask for the status of this patch since we hit the
issue too during our testing on md raid1.

Split remainder bio_A was queued ahead, following by bio_B for
lower device, at this moment raid start freezing, the loop take
out bio_A firstly and deliver it, which will hung since raid is
freezing, while the freezing never end since it waiting for
bio_B to finish, and bio_B is still on the queue, waiting for
bio_A to finish...

We're looking for a good solution and we found this patch
already progressed a lot, but we can't find it on linux-next,
so we'd like to ask are we still planning to have this fix
in upstream?

Regards,
Michael Wang


On 07/11/2016 04:10 PM, Lars Ellenberg wrote:
> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
> 
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
> 
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
> 
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
> 
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
> 
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
> 
> This can lead to resource starvation deadlock.
> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
> 
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
> 	struct recursion_to_iteration_bio_lists {
> 		struct bio_list recursion;
> 		struct bio_list queue;
> 	}
> 
> By providing each q->make_request_fn() with an empty "recursion"
> bio_list, then merging any recursively submitted bios to the
> head of the "queue" list, we can make the recursion-to-iteration
> logic in generic_make_request() process deepest level bios first,
> and "sibling" bios of the same level in "natural" order.
> 
> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
> ---
>  block/bio.c               | 20 +++++++++++--------
>  block/blk-core.c          | 49 +++++++++++++++++++++++++----------------------
>  block/blk-merge.c         |  5 ++++-
>  drivers/md/bcache/btree.c | 12 ++++++------
>  drivers/md/dm-bufio.c     |  2 +-
>  drivers/md/raid1.c        |  5 ++---
>  drivers/md/raid10.c       |  5 ++---
>  include/linux/bio.h       | 25 ++++++++++++++++++++++++
>  include/linux/sched.h     |  4 ++--
>  9 files changed, 80 insertions(+), 47 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 848cd35..c2606fd 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>  	 */
>  
>  	bio_list_init(&punt);
> -	bio_list_init(&nopunt);
>  
> -	while ((bio = bio_list_pop(current->bio_list)))
> +	bio_list_init(&nopunt);
> +	while ((bio = bio_list_pop(&current->bio_lists->recursion)))
>  		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> +	current->bio_lists->recursion = nopunt;
>  
> -	*current->bio_list = nopunt;
> +	bio_list_init(&nopunt);
> +	while ((bio = bio_list_pop(&current->bio_lists->queue)))
> +		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> +	current->bio_lists->queue = nopunt;
>  
>  	spin_lock(&bs->rescue_lock);
>  	bio_list_merge(&bs->rescue_list, &punt);
> @@ -453,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  		 *
>  		 * We solve this, and guarantee forward progress, with a rescuer
>  		 * workqueue per bio_set. If we go to allocate and there are
> -		 * bios on current->bio_list, we first try the allocation
> -		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> -		 * bios we would be blocking to the rescuer workqueue before
> -		 * we retry with the original gfp_flags.
> +		 * bios on current->bio_lists->{recursion,queue}, we first try the
> +		 * allocation without __GFP_DIRECT_RECLAIM; if that fails, we
> +		 * punt those bios we would be blocking to the rescuer
> +		 * workqueue before we retry with the original gfp_flags.
>  		 */
>  
> -		if (current->bio_list && !bio_list_empty(current->bio_list))
> +		if (current_has_pending_bios())
>  			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>  
>  		p = mempool_alloc(bs->bio_pool, gfp_mask);
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3cfd67d..2886a59b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2040,7 +2040,7 @@ end_io:
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -	struct bio_list bio_list_on_stack;
> +	struct recursion_to_iteration_bio_lists bio_lists_on_stack;
>  	blk_qc_t ret = BLK_QC_T_NONE;
>  
>  	if (!generic_make_request_checks(bio))
> @@ -2049,15 +2049,20 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	/*
>  	 * We only want one ->make_request_fn to be active at a time, else
>  	 * stack usage with stacked devices could be a problem.  So use
> -	 * current->bio_list to keep a list of requests submited by a
> -	 * make_request_fn function.  current->bio_list is also used as a
> +	 * current->bio_lists to keep a list of requests submited by a
> +	 * make_request_fn function.  current->bio_lists is also used as a
>  	 * flag to say if generic_make_request is currently active in this
>  	 * task or not.  If it is NULL, then no make_request is active.  If
>  	 * it is non-NULL, then a make_request is active, and new requests
> -	 * should be added at the tail
> +	 * should be added at the tail of current->bio_lists->recursion;
> +	 * bios resulting from a call to blk_queue_split() from
> +	 * within ->make_request_fn() should be pushed back to the head of
> +	 * current->bio_lists->queue.
> +	 * After the current ->make_request_fn() returns, .recursion will be
> +	 * merged back to the head of .queue.
>  	 */
> -	if (current->bio_list) {
> -		bio_list_add(current->bio_list, bio);
> +	if (current->bio_lists) {
> +		bio_list_add(&current->bio_lists->recursion, bio);
>  		goto out;
>  	}
>  
> @@ -2066,35 +2071,33 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	 * Before entering the loop, bio->bi_next is NULL (as all callers
>  	 * ensure that) so we have a list with a single bio.
>  	 * We pretend that we have just taken it off a longer list, so
> -	 * we assign bio_list to a pointer to the bio_list_on_stack,
> -	 * thus initialising the bio_list of new bios to be
> -	 * added.  ->make_request() may indeed add some more bios
> -	 * through a recursive call to generic_make_request.  If it
> -	 * did, we find a non-NULL value in bio_list and re-enter the loop
> -	 * from the top.  In this case we really did just take the bio
> -	 * of the top of the list (no pretending) and so remove it from
> -	 * bio_list, and call into ->make_request() again.
> +	 * we assign bio_list to a pointer to the bio_lists_on_stack,
> +	 * thus initialising the bio_lists of new bios to be added.
> +	 * ->make_request() may indeed add some more bios to .recursion
> +	 * through a recursive call to generic_make_request.  If it did,
> +	 * we find a non-NULL value in .recursion, merge .recursion back to the
> +	 * head of .queue, and re-enter the loop from the top.  In this case we
> +	 * really did just take the bio of the top of the list (no pretending)
> +	 * and so remove it from .queue, and call into ->make_request() again.
>  	 */
>  	BUG_ON(bio->bi_next);
> -	bio_list_init(&bio_list_on_stack);
> -	current->bio_list = &bio_list_on_stack;
> +	bio_list_init(&bio_lists_on_stack.queue);
> +	current->bio_lists = &bio_lists_on_stack;
>  	do {
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
>  		if (likely(blk_queue_enter(q, false) == 0)) {
> +			bio_list_init(&bio_lists_on_stack.recursion);
>  			ret = q->make_request_fn(q, bio);
> -
>  			blk_queue_exit(q);
> -
> -			bio = bio_list_pop(current->bio_list);
> +			bio_list_merge_head(&bio_lists_on_stack.queue,
> +					&bio_lists_on_stack.recursion);
>  		} else {
> -			struct bio *bio_next = bio_list_pop(current->bio_list);
> -
>  			bio_io_error(bio);
> -			bio = bio_next;
>  		}
> +		bio = bio_list_pop(&current->bio_lists->queue);
>  	} while (bio);
> -	current->bio_list = NULL; /* deactivate */
> +	current->bio_lists = NULL; /* deactivate */
>  
>  out:
>  	return ret;
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index c265348..df96327 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>  	struct bio *split, *res;
>  	unsigned nsegs;
>  
> +	BUG_ON(!current->bio_lists);
>  	if (bio_op(*bio) == REQ_OP_DISCARD)
>  		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
>  	else if (bio_op(*bio) == REQ_OP_WRITE_SAME)
> @@ -190,7 +191,9 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>  
>  		bio_chain(split, *bio);
>  		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> -		generic_make_request(*bio);
> +		/* push back remainder, it may later be split further */
> +		bio_list_add_head(&current->bio_lists->queue, *bio);
> +		/* and fake submission of a suitably sized piece */
>  		*bio = split;
>  	}
>  }
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 76f7534..731ec3b 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -450,7 +450,7 @@ void __bch_btree_node_write(struct btree *b, struct closure *parent)
>  
>  	trace_bcache_btree_write(b);
>  
> -	BUG_ON(current->bio_list);
> +	BUG_ON(current->bio_lists);
>  	BUG_ON(b->written >= btree_blocks(b));
>  	BUG_ON(b->written && !i->keys);
>  	BUG_ON(btree_bset_first(b)->seq != i->seq);
> @@ -544,7 +544,7 @@ static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
>  
>  	/* Force write if set is too big */
>  	if (set_bytes(i) > PAGE_SIZE - 48 &&
> -	    !current->bio_list)
> +	    !current->bio_lists)
>  		bch_btree_node_write(b, NULL);
>  }
>  
> @@ -889,7 +889,7 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
>  {
>  	struct btree *b;
>  
> -	BUG_ON(current->bio_list);
> +	BUG_ON(current->bio_lists);
>  
>  	lockdep_assert_held(&c->bucket_lock);
>  
> @@ -976,7 +976,7 @@ retry:
>  	b = mca_find(c, k);
>  
>  	if (!b) {
> -		if (current->bio_list)
> +		if (current->bio_lists)
>  			return ERR_PTR(-EAGAIN);
>  
>  		mutex_lock(&c->bucket_lock);
> @@ -2127,7 +2127,7 @@ static int bch_btree_insert_node(struct btree *b, struct btree_op *op,
>  
>  	return 0;
>  split:
> -	if (current->bio_list) {
> +	if (current->bio_lists) {
>  		op->lock = b->c->root->level + 1;
>  		return -EAGAIN;
>  	} else if (op->lock <= b->c->root->level) {
> @@ -2209,7 +2209,7 @@ int bch_btree_insert(struct cache_set *c, struct keylist *keys,
>  	struct btree_insert_op op;
>  	int ret = 0;
>  
> -	BUG_ON(current->bio_list);
> +	BUG_ON(current->bio_lists);
>  	BUG_ON(bch_keylist_empty(keys));
>  
>  	bch_btree_op_init(&op.op, 0);
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 6571c81..ba0c325 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -174,7 +174,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
>  #define DM_BUFIO_CACHE(c)	(dm_bufio_caches[dm_bufio_cache_index(c)])
>  #define DM_BUFIO_CACHE_NAME(c)	(dm_bufio_cache_names[dm_bufio_cache_index(c)])
>  
> -#define dm_bufio_in_request()	(!!current->bio_list)
> +#define dm_bufio_in_request()	(!!current->bio_lists)
>  
>  static void dm_bufio_lock(struct dm_bufio_client *c)
>  {
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 10e53cd..38790e3 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -876,8 +876,7 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>  				    (!conf->barrier ||
>  				     ((conf->start_next_window <
>  				       conf->next_resync + RESYNC_SECTORS) &&
> -				      current->bio_list &&
> -				      !bio_list_empty(current->bio_list))),
> +				      current_has_pending_bios())),
>  				    conf->resync_lock);
>  		conf->nr_waiting--;
>  	}
> @@ -1014,7 +1013,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
>  	struct r1conf *conf = mddev->private;
>  	struct bio *bio;
>  
> -	if (from_schedule || current->bio_list) {
> +	if (from_schedule || current->bio_lists) {
>  		spin_lock_irq(&conf->device_lock);
>  		bio_list_merge(&conf->pending_bio_list, &plug->pending);
>  		conf->pending_count += plug->pending_cnt;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 245640b..13a5341 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -945,8 +945,7 @@ static void wait_barrier(struct r10conf *conf)
>  		wait_event_lock_irq(conf->wait_barrier,
>  				    !conf->barrier ||
>  				    (conf->nr_pending &&
> -				     current->bio_list &&
> -				     !bio_list_empty(current->bio_list)),
> +				     current_has_pending_bios()),
>  				    conf->resync_lock);
>  		conf->nr_waiting--;
>  	}
> @@ -1022,7 +1021,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
>  	struct r10conf *conf = mddev->private;
>  	struct bio *bio;
>  
> -	if (from_schedule || current->bio_list) {
> +	if (from_schedule || current->bio_lists) {
>  		spin_lock_irq(&conf->device_lock);
>  		bio_list_merge(&conf->pending_bio_list, &plug->pending);
>  		conf->pending_count += plug->pending_cnt;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index b7e1a008..2f8a361 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -541,6 +541,24 @@ struct bio_list {
>  	struct bio *tail;
>  };
>  
> +/* for generic_make_request() */
> +struct recursion_to_iteration_bio_lists {
> +	/* For stacking drivers submitting to their respective backend,
> +	 * bios are added to the tail of .recursion, which is re-initialized
> +	 * before each call to ->make_request_fn() and after that returns,
> +	 * the whole .recursion queue is then merged back to the head of .queue.
> +	 *
> +	 * The recursion-to-iteration logic in generic_make_request() will
> +	 * peel off of .queue.head, processing bios in deepest-level-first
> +	 * "natural" order. */
> +	struct bio_list recursion;
> +
> +	/* This keeps a list of to-be-processed bios.
> +	 * The "remainder" part resulting from calling blk_queue_split()
> +	 * will be pushed back to its head. */
> +	struct bio_list queue;
> +};
> +
>  static inline int bio_list_empty(const struct bio_list *bl)
>  {
>  	return bl->head == NULL;
> @@ -551,6 +569,13 @@ static inline void bio_list_init(struct bio_list *bl)
>  	bl->head = bl->tail = NULL;
>  }
>  
> +static inline bool current_has_pending_bios(void)
> +{
> +	return current->bio_lists &&
> +		(!bio_list_empty(&current->bio_lists->queue) ||
> +		 !bio_list_empty(&current->bio_lists->recursion));
> +}
> +
>  #define BIO_EMPTY_LIST	{ NULL, NULL }
>  
>  #define bio_list_for_each(bio, bl) \
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6e42ada..146eedc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -128,7 +128,7 @@ struct sched_attr {
>  
>  struct futex_pi_state;
>  struct robust_list_head;
> -struct bio_list;
> +struct recursion_to_iteration_bio_lists;
>  struct fs_struct;
>  struct perf_event_context;
>  struct blk_plug;
> @@ -1727,7 +1727,7 @@ struct task_struct {
>  	void *journal_info;
>  
>  /* stacked block device info */
> -	struct bio_list *bio_list;
> +	struct recursion_to_iteration_bio_lists *bio_lists;
>  
>  #ifdef CONFIG_BLOCK
>  /* stack plugging */
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2016-12-23  8:49     ` Michael Wang
@ 2016-12-23 11:45       ` Lars Ellenberg
  2017-01-02 14:33         ` [dm-devel] " Jack Wang
  0 siblings, 1 reply; 23+ messages in thread
From: Lars Ellenberg @ 2016-12-23 11:45 UTC (permalink / raw)
  To: Michael Wang
  Cc: Jens Axboe, NeilBrown, linux-raid, Martin K. Petersen,
	Mike Snitzer, Peter Zijlstra, Jiri Kosina, Ming Lei,
	linux-kernel, Zheng Liu, linux-block, Takashi Iwai, linux-bcache,
	Ingo Molnar, Alasdair Kergon, Keith Busch, dm-devel, Shaohua Li,
	Kent Overstreet, Kirill A. Shutemov, Roland Kammerer

On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
> Dear Maintainers
> 
> I'd like to ask for the status of this patch since we hit the
> issue too during our testing on md raid1.
> 
> Split remainder bio_A was queued ahead, following by bio_B for
> lower device, at this moment raid start freezing, the loop take
> out bio_A firstly and deliver it, which will hung since raid is
> freezing, while the freezing never end since it waiting for
> bio_B to finish, and bio_B is still on the queue, waiting for
> bio_A to finish...
> 
> We're looking for a good solution and we found this patch
> already progressed a lot, but we can't find it on linux-next,
> so we'd like to ask are we still planning to have this fix
> in upstream?

I don't see why not, I'd even like to have it in older kernels,
but did not have the time and energy to push it.

Thanks for the bump.

	Lars

On 07/11/2016 04:10 PM, Lars Ellenberg wrote:
> For a long time, generic_make_request() converts recursion into
> iteration by queuing recursive arguments on current->bio_list.
> 
> This is convenient for stacking drivers,
> the top-most driver would take the originally submitted bio,
> and re-submit a re-mapped version of it, or one or more clones,
> or one or more new allocated bios to its backend(s). Which
> are then simply processed in turn, and each can again queue
> more "backend-bios" until we reach the bottom of the driver stack,
> and actually dispatch to the real backend device.
> 
> Any stacking driver ->make_request_fn() could expect that,
> once it returns, any backend-bios it submitted via recursive calls
> to generic_make_request() would now be processed and dispatched, before
> the current task would call into this driver again.
> 
> This is changed by commit
>   54efd50 block: make generic_make_request handle arbitrarily sized bios
> 
> Drivers may call blk_queue_split() inside their ->make_request_fn(),
> which may split the current bio into a front-part to be dealt with
> immediately, and a remainder-part, which may need to be split even
> further. That remainder-part will simply also be pushed to
> current->bio_list, and would end up being head-of-queue, in front
> of any backend-bios the current make_request_fn() might submit during
> processing of the fron-part.
> 
> Which means the current task would immediately end up back in the same
> make_request_fn() of the same driver again, before any of its backend
> bios have even been processed.
> 
> This can lead to resource starvation deadlock.
> Drivers could avoid this by learning to not need blk_queue_split(),
> or by submitting their backend bios in a different context (dedicated
> kernel thread, work_queue context, ...). Or by playing funny re-ordering
> games with entries on current->bio_list.
> 
> Instead, I suggest to distinguish between recursive calls to
> generic_make_request(), and pushing back the remainder part in
> blk_queue_split(), by pointing current->bio_lists to a
> 	struct recursion_to_iteration_bio_lists {
> 		struct bio_list recursion;
> 		struct bio_list queue;
> 	}
> 
> By providing each q->make_request_fn() with an empty "recursion"
> bio_list, then merging any recursively submitted bios to the
> head of the "queue" list, we can make the recursion-to-iteration
> logic in generic_make_request() process deepest level bios first,
> and "sibling" bios of the same level in "natural" order.
> 
> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
> ---
>  block/bio.c               | 20 +++++++++++--------
>  block/blk-core.c          | 49 +++++++++++++++++++++++++----------------------
>  block/blk-merge.c         |  5 ++++-
>  drivers/md/bcache/btree.c | 12 ++++++------
>  drivers/md/dm-bufio.c     |  2 +-
>  drivers/md/raid1.c        |  5 ++---
>  drivers/md/raid10.c       |  5 ++---
>  include/linux/bio.h       | 25 ++++++++++++++++++++++++
>  include/linux/sched.h     |  4 ++--
>  9 files changed, 80 insertions(+), 47 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 848cd35..c2606fd 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -366,12 +366,16 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>  	 */
>  
>  	bio_list_init(&punt);
> -	bio_list_init(&nopunt);
>  
> -	while ((bio = bio_list_pop(current->bio_list)))
> +	bio_list_init(&nopunt);
> +	while ((bio = bio_list_pop(&current->bio_lists->recursion)))
>  		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> +	current->bio_lists->recursion = nopunt;
>  
> -	*current->bio_list = nopunt;
> +	bio_list_init(&nopunt);
> +	while ((bio = bio_list_pop(&current->bio_lists->queue)))
> +		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> +	current->bio_lists->queue = nopunt;
>  
>  	spin_lock(&bs->rescue_lock);
>  	bio_list_merge(&bs->rescue_list, &punt);
> @@ -453,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  		 *
>  		 * We solve this, and guarantee forward progress, with a rescuer
>  		 * workqueue per bio_set. If we go to allocate and there are
> -		 * bios on current->bio_list, we first try the allocation
> -		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> -		 * bios we would be blocking to the rescuer workqueue before
> -		 * we retry with the original gfp_flags.
> +		 * bios on current->bio_lists->{recursion,queue}, we first try the
> +		 * allocation without __GFP_DIRECT_RECLAIM; if that fails, we
> +		 * punt those bios we would be blocking to the rescuer
> +		 * workqueue before we retry with the original gfp_flags.
>  		 */
>  
> -		if (current->bio_list && !bio_list_empty(current->bio_list))
> +		if (current_has_pending_bios())
>  			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>  
>  		p = mempool_alloc(bs->bio_pool, gfp_mask);
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3cfd67d..2886a59b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2040,7 +2040,7 @@ end_io:
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -	struct bio_list bio_list_on_stack;
> +	struct recursion_to_iteration_bio_lists bio_lists_on_stack;
>  	blk_qc_t ret = BLK_QC_T_NONE;
>  
>  	if (!generic_make_request_checks(bio))
> @@ -2049,15 +2049,20 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	/*
>  	 * We only want one ->make_request_fn to be active at a time, else
>  	 * stack usage with stacked devices could be a problem.  So use
> -	 * current->bio_list to keep a list of requests submited by a
> -	 * make_request_fn function.  current->bio_list is also used as a
> +	 * current->bio_lists to keep a list of requests submited by a
> +	 * make_request_fn function.  current->bio_lists is also used as a
>  	 * flag to say if generic_make_request is currently active in this
>  	 * task or not.  If it is NULL, then no make_request is active.  If
>  	 * it is non-NULL, then a make_request is active, and new requests
> -	 * should be added at the tail
> +	 * should be added at the tail of current->bio_lists->recursion;
> +	 * bios resulting from a call to blk_queue_split() from
> +	 * within ->make_request_fn() should be pushed back to the head of
> +	 * current->bio_lists->queue.
> +	 * After the current ->make_request_fn() returns, .recursion will be
> +	 * merged back to the head of .queue.
>  	 */
> -	if (current->bio_list) {
> -		bio_list_add(current->bio_list, bio);
> +	if (current->bio_lists) {
> +		bio_list_add(&current->bio_lists->recursion, bio);
>  		goto out;
>  	}
>  
> @@ -2066,35 +2071,33 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	 * Before entering the loop, bio->bi_next is NULL (as all callers
>  	 * ensure that) so we have a list with a single bio.
>  	 * We pretend that we have just taken it off a longer list, so
> -	 * we assign bio_list to a pointer to the bio_list_on_stack,
> -	 * thus initialising the bio_list of new bios to be
> -	 * added.  ->make_request() may indeed add some more bios
> -	 * through a recursive call to generic_make_request.  If it
> -	 * did, we find a non-NULL value in bio_list and re-enter the loop
> -	 * from the top.  In this case we really did just take the bio
> -	 * of the top of the list (no pretending) and so remove it from
> -	 * bio_list, and call into ->make_request() again.
> +	 * we assign bio_list to a pointer to the bio_lists_on_stack,
> +	 * thus initialising the bio_lists of new bios to be added.
> +	 * ->make_request() may indeed add some more bios to .recursion
> +	 * through a recursive call to generic_make_request.  If it did,
> +	 * we find a non-NULL value in .recursion, merge .recursion back to the
> +	 * head of .queue, and re-enter the loop from the top.  In this case we
> +	 * really did just take the bio of the top of the list (no pretending)
> +	 * and so remove it from .queue, and call into ->make_request() again.
>  	 */
>  	BUG_ON(bio->bi_next);
> -	bio_list_init(&bio_list_on_stack);
> -	current->bio_list = &bio_list_on_stack;
> +	bio_list_init(&bio_lists_on_stack.queue);
> +	current->bio_lists = &bio_lists_on_stack;
>  	do {
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
>  		if (likely(blk_queue_enter(q, false) == 0)) {
> +			bio_list_init(&bio_lists_on_stack.recursion);
>  			ret = q->make_request_fn(q, bio);
> -
>  			blk_queue_exit(q);
> -
> -			bio = bio_list_pop(current->bio_list);
> +			bio_list_merge_head(&bio_lists_on_stack.queue,
> +					&bio_lists_on_stack.recursion);
>  		} else {
> -			struct bio *bio_next = bio_list_pop(current->bio_list);
> -
>  			bio_io_error(bio);
> -			bio = bio_next;
>  		}
> +		bio = bio_list_pop(&current->bio_lists->queue);
>  	} while (bio);
> -	current->bio_list = NULL; /* deactivate */
> +	current->bio_lists = NULL; /* deactivate */
>  
>  out:
>  	return ret;
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index c265348..df96327 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -172,6 +172,7 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>  	struct bio *split, *res;
>  	unsigned nsegs;
>  
> +	BUG_ON(!current->bio_lists);
>  	if (bio_op(*bio) == REQ_OP_DISCARD)
>  		split = blk_bio_discard_split(q, *bio, bs, &nsegs);
>  	else if (bio_op(*bio) == REQ_OP_WRITE_SAME)
> @@ -190,7 +191,9 @@ void blk_queue_split(struct request_queue *q, struct bio **bio,
>  
>  		bio_chain(split, *bio);
>  		trace_block_split(q, split, (*bio)->bi_iter.bi_sector);
> -		generic_make_request(*bio);
> +		/* push back remainder, it may later be split further */
> +		bio_list_add_head(&current->bio_lists->queue, *bio);
> +		/* and fake submission of a suitably sized piece */
>  		*bio = split;
>  	}
>  }
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> index 76f7534..731ec3b 100644
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -450,7 +450,7 @@ void __bch_btree_node_write(struct btree *b, struct closure *parent)
>  
>  	trace_bcache_btree_write(b);
>  
> -	BUG_ON(current->bio_list);
> +	BUG_ON(current->bio_lists);
>  	BUG_ON(b->written >= btree_blocks(b));
>  	BUG_ON(b->written && !i->keys);
>  	BUG_ON(btree_bset_first(b)->seq != i->seq);
> @@ -544,7 +544,7 @@ static void bch_btree_leaf_dirty(struct btree *b, atomic_t *journal_ref)
>  
>  	/* Force write if set is too big */
>  	if (set_bytes(i) > PAGE_SIZE - 48 &&
> -	    !current->bio_list)
> +	    !current->bio_lists)
>  		bch_btree_node_write(b, NULL);
>  }
>  
> @@ -889,7 +889,7 @@ static struct btree *mca_alloc(struct cache_set *c, struct btree_op *op,
>  {
>  	struct btree *b;
>  
> -	BUG_ON(current->bio_list);
> +	BUG_ON(current->bio_lists);
>  
>  	lockdep_assert_held(&c->bucket_lock);
>  
> @@ -976,7 +976,7 @@ retry:
>  	b = mca_find(c, k);
>  
>  	if (!b) {
> -		if (current->bio_list)
> +		if (current->bio_lists)
>  			return ERR_PTR(-EAGAIN);
>  
>  		mutex_lock(&c->bucket_lock);
> @@ -2127,7 +2127,7 @@ static int bch_btree_insert_node(struct btree *b, struct btree_op *op,
>  
>  	return 0;
>  split:
> -	if (current->bio_list) {
> +	if (current->bio_lists) {
>  		op->lock = b->c->root->level + 1;
>  		return -EAGAIN;
>  	} else if (op->lock <= b->c->root->level) {
> @@ -2209,7 +2209,7 @@ int bch_btree_insert(struct cache_set *c, struct keylist *keys,
>  	struct btree_insert_op op;
>  	int ret = 0;
>  
> -	BUG_ON(current->bio_list);
> +	BUG_ON(current->bio_lists);
>  	BUG_ON(bch_keylist_empty(keys));
>  
>  	bch_btree_op_init(&op.op, 0);
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 6571c81..ba0c325 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -174,7 +174,7 @@ static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
>  #define DM_BUFIO_CACHE(c)	(dm_bufio_caches[dm_bufio_cache_index(c)])
>  #define DM_BUFIO_CACHE_NAME(c)	(dm_bufio_cache_names[dm_bufio_cache_index(c)])
>  
> -#define dm_bufio_in_request()	(!!current->bio_list)
> +#define dm_bufio_in_request()	(!!current->bio_lists)
>  
>  static void dm_bufio_lock(struct dm_bufio_client *c)
>  {
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 10e53cd..38790e3 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -876,8 +876,7 @@ static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>  				    (!conf->barrier ||
>  				     ((conf->start_next_window <
>  				       conf->next_resync + RESYNC_SECTORS) &&
> -				      current->bio_list &&
> -				      !bio_list_empty(current->bio_list))),
> +				      current_has_pending_bios())),
>  				    conf->resync_lock);
>  		conf->nr_waiting--;
>  	}
> @@ -1014,7 +1013,7 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
>  	struct r1conf *conf = mddev->private;
>  	struct bio *bio;
>  
> -	if (from_schedule || current->bio_list) {
> +	if (from_schedule || current->bio_lists) {
>  		spin_lock_irq(&conf->device_lock);
>  		bio_list_merge(&conf->pending_bio_list, &plug->pending);
>  		conf->pending_count += plug->pending_cnt;
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 245640b..13a5341 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -945,8 +945,7 @@ static void wait_barrier(struct r10conf *conf)
>  		wait_event_lock_irq(conf->wait_barrier,
>  				    !conf->barrier ||
>  				    (conf->nr_pending &&
> -				     current->bio_list &&
> -				     !bio_list_empty(current->bio_list)),
> +				     current_has_pending_bios()),
>  				    conf->resync_lock);
>  		conf->nr_waiting--;
>  	}
> @@ -1022,7 +1021,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
>  	struct r10conf *conf = mddev->private;
>  	struct bio *bio;
>  
> -	if (from_schedule || current->bio_list) {
> +	if (from_schedule || current->bio_lists) {
>  		spin_lock_irq(&conf->device_lock);
>  		bio_list_merge(&conf->pending_bio_list, &plug->pending);
>  		conf->pending_count += plug->pending_cnt;
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index b7e1a008..2f8a361 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -541,6 +541,24 @@ struct bio_list {
>  	struct bio *tail;
>  };
>  
> +/* for generic_make_request() */
> +struct recursion_to_iteration_bio_lists {
> +	/* For stacking drivers submitting to their respective backend,
> +	 * bios are added to the tail of .recursion, which is re-initialized
> +	 * before each call to ->make_request_fn() and after that returns,
> +	 * the whole .recursion queue is then merged back to the head of .queue.
> +	 *
> +	 * The recursion-to-iteration logic in generic_make_request() will
> +	 * peel off of .queue.head, processing bios in deepest-level-first
> +	 * "natural" order. */
> +	struct bio_list recursion;
> +
> +	/* This keeps a list of to-be-processed bios.
> +	 * The "remainder" part resulting from calling blk_queue_split()
> +	 * will be pushed back to its head. */
> +	struct bio_list queue;
> +};
> +
>  static inline int bio_list_empty(const struct bio_list *bl)
>  {
>  	return bl->head == NULL;
> @@ -551,6 +569,13 @@ static inline void bio_list_init(struct bio_list *bl)
>  	bl->head = bl->tail = NULL;
>  }
>  
> +static inline bool current_has_pending_bios(void)
> +{
> +	return current->bio_lists &&
> +		(!bio_list_empty(&current->bio_lists->queue) ||
> +		 !bio_list_empty(&current->bio_lists->recursion));
> +}
> +
>  #define BIO_EMPTY_LIST	{ NULL, NULL }
>  
>  #define bio_list_for_each(bio, bl) \
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6e42ada..146eedc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -128,7 +128,7 @@ struct sched_attr {
>  
>  struct futex_pi_state;
>  struct robust_list_head;
> -struct bio_list;
> +struct recursion_to_iteration_bio_lists;
>  struct fs_struct;
>  struct perf_event_context;
>  struct blk_plug;
> @@ -1727,7 +1727,7 @@ struct task_struct {
>  	void *journal_info;
>  
>  /* stacked block device info */
> -	struct bio_list *bio_list;
> +	struct recursion_to_iteration_bio_lists *bio_lists;
>  
>  #ifdef CONFIG_BLOCK
>  /* stack plugging */
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [dm-devel] [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2016-12-23 11:45       ` Lars Ellenberg
@ 2017-01-02 14:33         ` Jack Wang
  2017-01-04  5:12           ` NeilBrown
  0 siblings, 1 reply; 23+ messages in thread
From: Jack Wang @ 2017-01-02 14:33 UTC (permalink / raw)
  To: Lars Ellenberg
  Cc: Michael Wang, Jens Axboe, linux-block, Martin K. Petersen,
	Mike Snitzer, Peter Zijlstra, Jiri Kosina, Ming Lei,
	Kirill A. Shutemov, NeilBrown, linux-kernel, linux-raid,
	Takashi Iwai, linux-bcache, Zheng Liu, Kent Overstreet,
	Keith Busch, device-mapper development, Shaohua Li, Ingo Molnar,
	Alasdair Kergon, Roland Kammerer

[-- Attachment #1: Type: text/plain, Size: 1285 bytes --]

2016-12-23 12:45 GMT+01:00 Lars Ellenberg <lars.ellenberg@linbit.com>:
> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
>> Dear Maintainers
>>
>> I'd like to ask for the status of this patch since we hit the
>> issue too during our testing on md raid1.
>>
>> Split remainder bio_A was queued ahead, following by bio_B for
>> lower device, at this moment raid start freezing, the loop take
>> out bio_A firstly and deliver it, which will hung since raid is
>> freezing, while the freezing never end since it waiting for
>> bio_B to finish, and bio_B is still on the queue, waiting for
>> bio_A to finish...
>>
>> We're looking for a good solution and we found this patch
>> already progressed a lot, but we can't find it on linux-next,
>> so we'd like to ask are we still planning to have this fix
>> in upstream?
>
> I don't see why not, I'd even like to have it in older kernels,
> but did not have the time and energy to push it.
>
> Thanks for the bump.
>
>         Lars
>
Hi folks,

As Michael mentioned, we hit a bug this patch is trying to fix.
Neil suggested another way to fix it.  I attached below.
I personal prefer Neil's version as it's less code change, and straight forward.

Could you share your comments, we can get one fix into mainline.

Thanks,
Jinpu

[-- Attachment #2: 0001-block-fix-deadlock-between-freeze_array-and-wait_bar.patch --]
[-- Type: text/x-patch, Size: 2366 bytes --]

From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.com>
Date: Wed, 14 Dec 2016 16:55:52 +0100
Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()

When we call wait_barrier, we might have some bios waiting
in current->bio_list, which prevents the array_freeze call to
complete. Those can only be internal READs, which have already
passed the wait_barrier call (thus incrementing nr_pending), but
still were not submitted to the lower level, due to generic_make_request
logic to avoid recursive calls. In such case, we have a deadlock:
- array_frozen is already set to 1, so wait_barrier unconditionally waits, so
- internal READ bios will not be submitted, thus freeze_array will
never completes.

To fix this, modify generic_make_request to always sort bio_list_on_stack
first with lowest level, then higher, until same level.

Sent to linux-raid mail list:
https://marc.info/?l=linux-raid&m=148232453107685&w=2

Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 block/blk-core.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9e3ac56..47ef373 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2138,10 +2138,30 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
+			struct bio_list lower, same, hold;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_init(&hold);
+			bio_list_merge(&hold, &bio_list_on_stack);
+			bio_list_init(&bio_list_on_stack);
 
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 
 			bio = bio_list_pop(current->bio_list);
 		} else {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [dm-devel] [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2017-01-02 14:33         ` [dm-devel] " Jack Wang
@ 2017-01-04  5:12           ` NeilBrown
  2017-01-04 18:50             ` Mike Snitzer
  0 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2017-01-04  5:12 UTC (permalink / raw)
  To: Jack Wang, Lars Ellenberg
  Cc: Michael Wang, Jens Axboe, linux-block, Martin K. Petersen,
	Mike Snitzer, Peter Zijlstra, Jiri Kosina, Ming Lei,
	Kirill A. Shutemov, linux-kernel, linux-raid, Takashi Iwai,
	linux-bcache, Zheng Liu, Kent Overstreet, Keith Busch,
	device-mapper development, Shaohua Li, Ingo Molnar,
	Alasdair Kergon, Roland Kammerer

[-- Attachment #1: Type: text/plain, Size: 4832 bytes --]

On Tue, Jan 03 2017, Jack Wang wrote:

> 2016-12-23 12:45 GMT+01:00 Lars Ellenberg <lars.ellenberg@linbit.com>:
>> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
>>> Dear Maintainers
>>>
>>> I'd like to ask for the status of this patch since we hit the
>>> issue too during our testing on md raid1.
>>>
>>> Split remainder bio_A was queued ahead, following by bio_B for
>>> lower device, at this moment raid start freezing, the loop take
>>> out bio_A firstly and deliver it, which will hung since raid is
>>> freezing, while the freezing never end since it waiting for
>>> bio_B to finish, and bio_B is still on the queue, waiting for
>>> bio_A to finish...
>>>
>>> We're looking for a good solution and we found this patch
>>> already progressed a lot, but we can't find it on linux-next,
>>> so we'd like to ask are we still planning to have this fix
>>> in upstream?
>>
>> I don't see why not, I'd even like to have it in older kernels,
>> but did not have the time and energy to push it.
>>
>> Thanks for the bump.
>>
>>         Lars
>>
> Hi folks,
>
> As Michael mentioned, we hit a bug this patch is trying to fix.
> Neil suggested another way to fix it.  I attached below.
> I personal prefer Neil's version as it's less code change, and straight forward.
>
> Could you share your comments, we can get one fix into mainline.
>
> Thanks,
> Jinpu
> From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@suse.com>
> Date: Wed, 14 Dec 2016 16:55:52 +0100
> Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()
>
> When we call wait_barrier, we might have some bios waiting
> in current->bio_list, which prevents the array_freeze call to
> complete. Those can only be internal READs, which have already
> passed the wait_barrier call (thus incrementing nr_pending), but
> still were not submitted to the lower level, due to generic_make_request
> logic to avoid recursive calls. In such case, we have a deadlock:
> - array_frozen is already set to 1, so wait_barrier unconditionally waits, so
> - internal READ bios will not be submitted, thus freeze_array will
> never completes.
>
> To fix this, modify generic_make_request to always sort bio_list_on_stack
> first with lowest level, then higher, until same level.
>
> Sent to linux-raid mail list:
> https://marc.info/?l=linux-raid&m=148232453107685&w=2
>

This should probably also have

  Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>

or something that, as I was building on Lars' ideas when I wrote this.

It would also be worth noting in the description that this addresses
issues with dm and drbd as well as md.

In fact, I think that with this patch in place, much of the need for the
rescue_workqueue won't exist any more.  I cannot promise it can be
removed completely, but it should be to hard to make it optional and
only enabled for those few block devices that will still need it.
The rescuer should only be needed for a bioset which can be allocated
From twice in the one call the ->make_request_fn.  This would include
raid0 for example, though raid0_make_reqest could be re-written to not
use a loop and to just call generic_make_request(bio) if bio != split.

Thanks,
NeilBrown


> Suggested-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
> ---
>  block/blk-core.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 9e3ac56..47ef373 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2138,10 +2138,30 @@ blk_qc_t generic_make_request(struct bio *bio)
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
>  		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
> +			struct bio_list lower, same, hold;
> +
> +			/* Create a fresh bio_list for all subordinate requests */
> +			bio_list_init(&hold);
> +			bio_list_merge(&hold, &bio_list_on_stack);
> +			bio_list_init(&bio_list_on_stack);
>  
>  			ret = q->make_request_fn(q, bio);
>  
>  			blk_queue_exit(q);
> +			/* sort new bios into those for a lower level
> +			 * and those for the same level
> +			 */
> +			bio_list_init(&lower);
> +			bio_list_init(&same);
> +			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> +				if (q == bdev_get_queue(bio->bi_bdev))
> +					bio_list_add(&same, bio);
> +				else
> +					bio_list_add(&lower, bio);
> +			/* now assemble so we handle the lowest level first */
> +			bio_list_merge(&bio_list_on_stack, &lower);
> +			bio_list_merge(&bio_list_on_stack, &same);
> +			bio_list_merge(&bio_list_on_stack, &hold);
>  
>  			bio = bio_list_pop(current->bio_list);
>  		} else {
> -- 
> 2.7.4

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2017-01-04  5:12           ` NeilBrown
@ 2017-01-04 18:50             ` Mike Snitzer
  2017-01-05 10:54               ` 王金浦
  2017-01-06 16:50               ` Mikulas Patocka
  0 siblings, 2 replies; 23+ messages in thread
From: Mike Snitzer @ 2017-01-04 18:50 UTC (permalink / raw)
  To: NeilBrown, Mikulas Patocka
  Cc: Jack Wang, Lars Ellenberg, Jens Axboe, linux-raid, Michael Wang,
	Peter Zijlstra, Jiri Kosina, Ming Lei, linux-kernel, Zheng Liu,
	linux-block, Takashi Iwai, linux-bcache, Ingo Molnar,
	Alasdair Kergon, Martin K. Petersen, Keith Busch,
	device-mapper development, Shaohua Li, Kent Overstreet,
	Kirill A. Shutemov, Roland Kammerer

On Wed, Jan 04 2017 at 12:12am -0500,
NeilBrown <neilb@suse.com> wrote:

> On Tue, Jan 03 2017, Jack Wang wrote:
> 
> > 2016-12-23 12:45 GMT+01:00 Lars Ellenberg <lars.ellenberg@linbit.com>:
> >> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
> >>> Dear Maintainers
> >>>
> >>> I'd like to ask for the status of this patch since we hit the
> >>> issue too during our testing on md raid1.
> >>>
> >>> Split remainder bio_A was queued ahead, following by bio_B for
> >>> lower device, at this moment raid start freezing, the loop take
> >>> out bio_A firstly and deliver it, which will hung since raid is
> >>> freezing, while the freezing never end since it waiting for
> >>> bio_B to finish, and bio_B is still on the queue, waiting for
> >>> bio_A to finish...
> >>>
> >>> We're looking for a good solution and we found this patch
> >>> already progressed a lot, but we can't find it on linux-next,
> >>> so we'd like to ask are we still planning to have this fix
> >>> in upstream?
> >>
> >> I don't see why not, I'd even like to have it in older kernels,
> >> but did not have the time and energy to push it.
> >>
> >> Thanks for the bump.
> >>
> >>         Lars
> >>
> > Hi folks,
> >
> > As Michael mentioned, we hit a bug this patch is trying to fix.
> > Neil suggested another way to fix it.  I attached below.
> > I personal prefer Neil's version as it's less code change, and straight forward.
> >
> > Could you share your comments, we can get one fix into mainline.
> >
> > Thanks,
> > Jinpu
> > From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001
> > From: NeilBrown <neilb@suse.com>
> > Date: Wed, 14 Dec 2016 16:55:52 +0100
> > Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()
> >
> > When we call wait_barrier, we might have some bios waiting
> > in current->bio_list, which prevents the array_freeze call to
> > complete. Those can only be internal READs, which have already
> > passed the wait_barrier call (thus incrementing nr_pending), but
> > still were not submitted to the lower level, due to generic_make_request
> > logic to avoid recursive calls. In such case, we have a deadlock:
> > - array_frozen is already set to 1, so wait_barrier unconditionally waits, so
> > - internal READ bios will not be submitted, thus freeze_array will
> > never completes.
> >
> > To fix this, modify generic_make_request to always sort bio_list_on_stack
> > first with lowest level, then higher, until same level.
> >
> > Sent to linux-raid mail list:
> > https://marc.info/?l=linux-raid&m=148232453107685&w=2
> >
> 
> This should probably also have
> 
>   Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> 
> or something that, as I was building on Lars' ideas when I wrote this.
> 
> It would also be worth noting in the description that this addresses
> issues with dm and drbd as well as md.

I never saw this patch but certainly like the relative simplicity of the
solution when compared with other approaches taken, e.g. (5 topmost
commits on this branch):
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

> In fact, I think that with this patch in place, much of the need for the
> rescue_workqueue won't exist any more.  I cannot promise it can be
> removed completely, but it should be to hard to make it optional and
> only enabled for those few block devices that will still need it.
> The rescuer should only be needed for a bioset which can be allocated
> From twice in the one call the ->make_request_fn.  This would include
> raid0 for example, though raid0_make_reqest could be re-written to not
> use a loop and to just call generic_make_request(bio) if bio != split.

Mikulas, would you be willing to try the below patch with the
dm-snapshot deadlock scenario and report back on whether it fixes that?

Patch below looks to be the same as here:
https://marc.info/?l=linux-raid&m=148232453107685&q=p3

Neil and/or others if that isn't the patch that should be tested please
provide a pointer to the latest.

Thanks,
Mike

> > Suggested-by: NeilBrown <neilb@suse.com>
> > Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
> > ---
> >  block/blk-core.c | 20 ++++++++++++++++++++
> >  1 file changed, 20 insertions(+)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 9e3ac56..47ef373 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2138,10 +2138,30 @@ blk_qc_t generic_make_request(struct bio *bio)
> >  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> >  
> >  		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
> > +			struct bio_list lower, same, hold;
> > +
> > +			/* Create a fresh bio_list for all subordinate requests */
> > +			bio_list_init(&hold);
> > +			bio_list_merge(&hold, &bio_list_on_stack);
> > +			bio_list_init(&bio_list_on_stack);
> >  
> >  			ret = q->make_request_fn(q, bio);
> >  
> >  			blk_queue_exit(q);
> > +			/* sort new bios into those for a lower level
> > +			 * and those for the same level
> > +			 */
> > +			bio_list_init(&lower);
> > +			bio_list_init(&same);
> > +			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> > +				if (q == bdev_get_queue(bio->bi_bdev))
> > +					bio_list_add(&same, bio);
> > +				else
> > +					bio_list_add(&lower, bio);
> > +			/* now assemble so we handle the lowest level first */
> > +			bio_list_merge(&bio_list_on_stack, &lower);
> > +			bio_list_merge(&bio_list_on_stack, &same);
> > +			bio_list_merge(&bio_list_on_stack, &hold);
> >  
> >  			bio = bio_list_pop(current->bio_list);
> >  		} else {
> > -- 
> > 2.7.4

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2017-01-04 18:50             ` Mike Snitzer
@ 2017-01-05 10:54               ` 王金浦
  2017-01-06 16:50               ` Mikulas Patocka
  1 sibling, 0 replies; 23+ messages in thread
From: 王金浦 @ 2017-01-05 10:54 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: NeilBrown, Mikulas Patocka, Jack Wang, Lars Ellenberg,
	Jens Axboe, linux-raid, Michael Wang, Peter Zijlstra,
	Jiri Kosina, Ming Lei, LKML, Zheng Liu, linux-block,
	Takashi Iwai, linux-bcache, Ingo Molnar, Alasdair Kergon,
	Martin K. Petersen, Keith Busch, device-mapper development,
	Shaohua Li, Kent Overstreet, Kirill A. Shutemov, Roland Kammerer,
	Jinpu Wang

[-- Attachment #1: Type: text/plain, Size: 4392 bytes --]

2017-01-04 19:50 GMT+01:00 Mike Snitzer <snitzer@redhat.com>:
> On Wed, Jan 04 2017 at 12:12am -0500,
> NeilBrown <neilb@suse.com> wrote:
>
>> On Tue, Jan 03 2017, Jack Wang wrote:
>>
>> > 2016-12-23 12:45 GMT+01:00 Lars Ellenberg <lars.ellenberg@linbit.com>:
>> >> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
>> >>> Dear Maintainers
>> >>>
>> >>> I'd like to ask for the status of this patch since we hit the
>> >>> issue too during our testing on md raid1.
>> >>>
>> >>> Split remainder bio_A was queued ahead, following by bio_B for
>> >>> lower device, at this moment raid start freezing, the loop take
>> >>> out bio_A firstly and deliver it, which will hung since raid is
>> >>> freezing, while the freezing never end since it waiting for
>> >>> bio_B to finish, and bio_B is still on the queue, waiting for
>> >>> bio_A to finish...
>> >>>
>> >>> We're looking for a good solution and we found this patch
>> >>> already progressed a lot, but we can't find it on linux-next,
>> >>> so we'd like to ask are we still planning to have this fix
>> >>> in upstream?
>> >>
>> >> I don't see why not, I'd even like to have it in older kernels,
>> >> but did not have the time and energy to push it.
>> >>
>> >> Thanks for the bump.
>> >>
>> >>         Lars
>> >>
>> > Hi folks,
>> >
>> > As Michael mentioned, we hit a bug this patch is trying to fix.
>> > Neil suggested another way to fix it.  I attached below.
>> > I personal prefer Neil's version as it's less code change, and straight forward.
>> >
>> > Could you share your comments, we can get one fix into mainline.
>> >
>> > Thanks,
>> > Jinpu
>> > From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001
>> > From: NeilBrown <neilb@suse.com>
>> > Date: Wed, 14 Dec 2016 16:55:52 +0100
>> > Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()
>> >
>> > When we call wait_barrier, we might have some bios waiting
>> > in current->bio_list, which prevents the array_freeze call to
>> > complete. Those can only be internal READs, which have already
>> > passed the wait_barrier call (thus incrementing nr_pending), but
>> > still were not submitted to the lower level, due to generic_make_request
>> > logic to avoid recursive calls. In such case, we have a deadlock:
>> > - array_frozen is already set to 1, so wait_barrier unconditionally waits, so
>> > - internal READ bios will not be submitted, thus freeze_array will
>> > never completes.
>> >
>> > To fix this, modify generic_make_request to always sort bio_list_on_stack
>> > first with lowest level, then higher, until same level.
>> >
>> > Sent to linux-raid mail list:
>> > https://marc.info/?l=linux-raid&m=148232453107685&w=2
>> >
>>
>> This should probably also have
>>
>>   Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
>>
>> or something that, as I was building on Lars' ideas when I wrote this.
>>
>> It would also be worth noting in the description that this addresses
>> issues with dm and drbd as well as md.
>
> I never saw this patch but certainly like the relative simplicity of the
> solution when compared with other approaches taken, e.g. (5 topmost
> commits on this branch):
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
>
>> In fact, I think that with this patch in place, much of the need for the
>> rescue_workqueue won't exist any more.  I cannot promise it can be
>> removed completely, but it should be to hard to make it optional and
>> only enabled for those few block devices that will still need it.
>> The rescuer should only be needed for a bioset which can be allocated
>> From twice in the one call the ->make_request_fn.  This would include
>> raid0 for example, though raid0_make_reqest could be re-written to not
>> use a loop and to just call generic_make_request(bio) if bio != split.
>
> Mikulas, would you be willing to try the below patch with the
> dm-snapshot deadlock scenario and report back on whether it fixes that?
>
> Patch below looks to be the same as here:
> https://marc.info/?l=linux-raid&m=148232453107685&q=p3
>
> Neil and/or others if that isn't the patch that should be tested please
> provide a pointer to the latest.
>
> Thanks,
> Mike

Thanks Mike,

I've rebased the patch on to Linux-4.10-rc2, and updated the
description as Neil suggested.
If Mikulas get possitive feedback, then we can go with it.

Cheers,
Jinpu

[-- Attachment #2: 0001-block-fix-deadlock-between-freeze_array-and-wait_bar.patch --]
[-- Type: text/x-patch, Size: 2467 bytes --]

From 4ffaefb719c129ed51f9fcb235b945caf56de8d1 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.com>
Date: Wed, 14 Dec 2016 16:55:52 +0100
Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()

When we call wait_barrier, we might have some bios waiting
in current->bio_list, which prevents the array_freeze call to
complete. Those can only be internal READs, which have already
passed the wait_barrier call (thus incrementing nr_pending), but
still were not submitted to the lower level, due to generic_make_request
logic to avoid recursive calls. In such case, we have a deadlock:
- array_frozen is already set to 1, so wait_barrier unconditionally waits, so
- internal READ bios will not be submitted, thus freeze_array will
never completes.

To fix this, modify generic_make_request to always sort bio_list_on_stack
first with lowest level, then higher, until same level.

This would address issuses with dm and drbd as well as md.

Sent to linux-raid mail list:
https://marc.info/?l=linux-raid&m=148232453107685&w=2

Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
---
 block/blk-core.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c..2f74129 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2019,9 +2019,30 @@ blk_qc_t generic_make_request(struct bio *bio)
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
 		if (likely(blk_queue_enter(q, false) == 0)) {
+			struct bio_list lower, same, hold;
+
+			/* Create a fresh bio_list for all subordinate requests */
+			bio_list_init(&hold);
+			bio_list_merge(&hold, &bio_list_on_stack);
+			bio_list_init(&bio_list_on_stack);
+
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
+			/* sort new bios into those for a lower level
+			 * and those for the same level
+			 */
+			bio_list_init(&lower);
+			bio_list_init(&same);
+			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
+				if (q == bdev_get_queue(bio->bi_bdev))
+					bio_list_add(&same, bio);
+				else
+					bio_list_add(&lower, bio);
+			/* now assemble so we handle the lowest level first */
+			bio_list_merge(&bio_list_on_stack, &lower);
+			bio_list_merge(&bio_list_on_stack, &same);
+			bio_list_merge(&bio_list_on_stack, &hold);
 
 			bio = bio_list_pop(current->bio_list);
 		} else {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2017-01-04 18:50             ` Mike Snitzer
  2017-01-05 10:54               ` 王金浦
@ 2017-01-06 16:50               ` Mikulas Patocka
  2017-01-06 17:34                 ` Mikulas Patocka
  1 sibling, 1 reply; 23+ messages in thread
From: Mikulas Patocka @ 2017-01-06 16:50 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: NeilBrown, Jack Wang, Lars Ellenberg, Jens Axboe, linux-raid,
	Michael Wang, Peter Zijlstra, Jiri Kosina, Ming Lei,
	linux-kernel, Zheng Liu, linux-block, Takashi Iwai, linux-bcache,
	Ingo Molnar, Alasdair Kergon, Martin K. Petersen, Keith Busch,
	device-mapper development, Shaohua Li, Kent Overstreet,
	Kirill A. Shutemov, Roland Kammerer



On Wed, 4 Jan 2017, Mike Snitzer wrote:

> On Wed, Jan 04 2017 at 12:12am -0500,
> NeilBrown <neilb@suse.com> wrote:
> 
> > On Tue, Jan 03 2017, Jack Wang wrote:
> > 
> > > 2016-12-23 12:45 GMT+01:00 Lars Ellenberg <lars.ellenberg@linbit.com>:
> > >> On Fri, Dec 23, 2016 at 09:49:53AM +0100, Michael Wang wrote:
> > >>> Dear Maintainers
> > >>>
> > >>> I'd like to ask for the status of this patch since we hit the
> > >>> issue too during our testing on md raid1.
> > >>>
> > >>> Split remainder bio_A was queued ahead, following by bio_B for
> > >>> lower device, at this moment raid start freezing, the loop take
> > >>> out bio_A firstly and deliver it, which will hung since raid is
> > >>> freezing, while the freezing never end since it waiting for
> > >>> bio_B to finish, and bio_B is still on the queue, waiting for
> > >>> bio_A to finish...
> > >>>
> > >>> We're looking for a good solution and we found this patch
> > >>> already progressed a lot, but we can't find it on linux-next,
> > >>> so we'd like to ask are we still planning to have this fix
> > >>> in upstream?
> > >>
> > >> I don't see why not, I'd even like to have it in older kernels,
> > >> but did not have the time and energy to push it.
> > >>
> > >> Thanks for the bump.
> > >>
> > >>         Lars
> > >>
> > > Hi folks,
> > >
> > > As Michael mentioned, we hit a bug this patch is trying to fix.
> > > Neil suggested another way to fix it.  I attached below.
> > > I personal prefer Neil's version as it's less code change, and straight forward.
> > >
> > > Could you share your comments, we can get one fix into mainline.
> > >
> > > Thanks,
> > > Jinpu
> > > From 69a4829a55503e496ce9c730d2c8e3dd8a08874a Mon Sep 17 00:00:00 2001
> > > From: NeilBrown <neilb@suse.com>
> > > Date: Wed, 14 Dec 2016 16:55:52 +0100
> > > Subject: [PATCH] block: fix deadlock between freeze_array() and wait_barrier()
> > >
> > > When we call wait_barrier, we might have some bios waiting
> > > in current->bio_list, which prevents the array_freeze call to
> > > complete. Those can only be internal READs, which have already
> > > passed the wait_barrier call (thus incrementing nr_pending), but
> > > still were not submitted to the lower level, due to generic_make_request
> > > logic to avoid recursive calls. In such case, we have a deadlock:
> > > - array_frozen is already set to 1, so wait_barrier unconditionally waits, so
> > > - internal READ bios will not be submitted, thus freeze_array will
> > > never completes.
> > >
> > > To fix this, modify generic_make_request to always sort bio_list_on_stack
> > > first with lowest level, then higher, until same level.
> > >
> > > Sent to linux-raid mail list:
> > > https://marc.info/?l=linux-raid&m=148232453107685&w=2
> > >
> > 
> > This should probably also have
> > 
> >   Inspired-by: Lars Ellenberg <lars.ellenberg@linbit.com>
> > 
> > or something that, as I was building on Lars' ideas when I wrote this.
> > 
> > It would also be worth noting in the description that this addresses
> > issues with dm and drbd as well as md.
> 
> I never saw this patch but certainly like the relative simplicity of the
> solution when compared with other approaches taken, e.g. (5 topmost
> commits on this branch):
> http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip
> 
> > In fact, I think that with this patch in place, much of the need for the
> > rescue_workqueue won't exist any more.  I cannot promise it can be
> > removed completely, but it should be to hard to make it optional and
> > only enabled for those few block devices that will still need it.
> > The rescuer should only be needed for a bioset which can be allocated
> > From twice in the one call the ->make_request_fn.  This would include
> > raid0 for example, though raid0_make_reqest could be re-written to not
> > use a loop and to just call generic_make_request(bio) if bio != split.
> 
> > > Suggested-by: NeilBrown <neilb@suse.com>
> > > Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
> > > ---
> > >  block/blk-core.c | 20 ++++++++++++++++++++
> > >  1 file changed, 20 insertions(+)
> > >
> > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > index 9e3ac56..47ef373 100644
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -2138,10 +2138,30 @@ blk_qc_t generic_make_request(struct bio *bio)
> > >  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> > >  
> > >  		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
> > > +			struct bio_list lower, same, hold;
> > > +
> > > +			/* Create a fresh bio_list for all subordinate requests */
> > > +			bio_list_init(&hold);
> > > +			bio_list_merge(&hold, &bio_list_on_stack);
> > > +			bio_list_init(&bio_list_on_stack);
> > >  
> > >  			ret = q->make_request_fn(q, bio);
> > >  
> > >  			blk_queue_exit(q);
> > > +			/* sort new bios into those for a lower level
> > > +			 * and those for the same level
> > > +			 */
> > > +			bio_list_init(&lower);
> > > +			bio_list_init(&same);
> > > +			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> > > +				if (q == bdev_get_queue(bio->bi_bdev))
> > > +					bio_list_add(&same, bio);
> > > +				else
> > > +					bio_list_add(&lower, bio);
> > > +			/* now assemble so we handle the lowest level first */
> > > +			bio_list_merge(&bio_list_on_stack, &lower);
> > > +			bio_list_merge(&bio_list_on_stack, &same);
> > > +			bio_list_merge(&bio_list_on_stack, &hold);
> > >  
> > >  			bio = bio_list_pop(current->bio_list);
> > >  		} else {
> > > -- 
> > > 2.7.4
> 
> Mikulas, would you be willing to try the below patch with the
> dm-snapshot deadlock scenario and report back on whether it fixes that?
> 
> Patch below looks to be the same as here:
> https://marc.info/?l=linux-raid&m=148232453107685&q=p3
> 
> Neil and/or others if that isn't the patch that should be tested please
> provide a pointer to the latest.
> 
> Thanks,
> Mike

The bad news is that this doesn't fix the snapshot deadlock.

I created a test program for the snapshot deadlock bug (it was originally 
created years ago to test for a different bug, so it contains some cruft). 
You also need to insert "if (ci->sector_count) msleep(100);" to the end of 
__split_and_process_non_flush to make the kernel sleep when splitting the 
bio.

And with the above above patch, the snapshot deadlock bug still happens.

Mikulas


#define _XOPEN_SOURCE 500
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
#include <malloc.h>
#include <pthread.h>
#include <asm/unistd.h>

/*
 * Change "VG" symbol to a volume group name that you are using.
 *
 * You must apply this patch to the kernel to trigger the bug:
 * Index: linux-4.10-rc2/drivers/md/dm.c
 * ===================================================================
 * --- linux-4.10-rc2.orig/drivers/md/dm.c
 * +++ linux-4.10-rc2/drivers/md/dm.c
 * @@ -1223,6 +1223,9 @@ static int __split_and_process_non_flush
 *         ci->sector += len;
 *         ci->sector_count -= len;
 * 
 * +       if (ci->sector_count)
 * +               msleep(100);
 * +
 *         return 0;
 *  }
 * 
 */

#define VG		"vg1"
#define LV		"test_lv"
#define LV_SNAP		"test_snap"
#define MEGABYTES	"12"
#define SNAP_MEGABYTES	"16"
#define THREADS		1
#define BS		4096
#define SKEW		512
#define ORIG_PATTERN	'p'
#define NEW_PATTERN	'n'

enum {
	IOPRIO_CLASS_NONE,
	IOPRIO_CLASS_RT,
	IOPRIO_CLASS_BE,
	IOPRIO_CLASS_IDLE,
};

enum {
	IOPRIO_WHO_PROCESS = 1,
	IOPRIO_WHO_PGRP,
	IOPRIO_WHO_USER,
};

#define IOPRIO_CLASS_SHIFT	13

static inline int ioprio_set(int which, int who, int ioprio)
{
	return syscall(__NR_ioprio_set, which, who, ioprio);
}

static inline int ioprio_get(int which, int who)
{
	return syscall(__NR_ioprio_get, which, who);
}

#define PRIO_READER	((IOPRIO_CLASS_IDLE << IOPRIO_CLASS_SHIFT) | 0xff)
#define PRIO_WRITER	(IOPRIO_CLASS_RT << IOPRIO_CLASS_SHIFT)

static void do_cmd(char *cmd, int ign_err)
{
	int r;
	fprintf(stderr, "* %s\n", cmd);
	r = system(cmd);
	if (r) {
		if (r == -1) {
			perror("system");
		} else {
			if (ign_err) return;
			fprintf(stderr, "return code %x\n", r);
		}
		exit(1);
	}
}

static char pattern[BS];

static int h_orig, h_snap;
static int n;
static long long test_of;
static pthread_rwlock_t rw_lock_1;
static pthread_rwlock_t rw_lock_2;
static pthread_rwlock_t rw_lock_3;
static volatile int started = 0;

static void pthread_error(int r)
{
	fprintf(stderr, "pthread_error: %s\n", strerror(r));
	exit(1);
}

static void *test_read(long long of)
{
	int r;
	char *t = memalign(BS, BS);
	if (!t) perror("memalign"), exit(1);
	if ((r = pread(h_snap, t, BS, of)) != BS) {
		fprintf(stderr, "can't read (%d): %s\n", r, strerror(errno));
		exit(1);
	}
	if (memcmp(pattern, t, BS)) {
		int i;
		for (i = 0; i < BS; i++) if (t[i] != pattern[i]) break;
		fprintf(stderr, "!!!! SNAPSHOT VOLUME DAMAGE AT BLOCK OFFSET %llX, BYTE OFFSET %X: %02x != %02x\n", of, i, (unsigned char)t[i], (unsigned char)pattern[i]);
		exit(2);
	}
	free(t);
	return NULL;
}

static void *test_thread(void *_)
{
	int r;
	_ = _;
	//fprintf(stderr, "start\n");
	if ((r = ioprio_set(IOPRIO_WHO_PROCESS, 0, PRIO_READER))) perror("ioprio_set"), exit(1);
	if ((r = pthread_rwlock_rdlock(&rw_lock_2))) pthread_error(r);
	started = 1;
	if ((r = ioprio_get(IOPRIO_WHO_PROCESS, 0)) != PRIO_READER) {
		if (r == -1) perror("ioprio_get");
		else fprintf(stderr, "reader priority not set: %x\n", r);
		exit(1);
	}
	again:
	if ((r = pthread_rwlock_rdlock(&rw_lock_1))) pthread_error(r);
	if ((r = pthread_rwlock_unlock(&rw_lock_2))) pthread_error(r);
	if (test_of == -1) {
		if ((r = pthread_rwlock_unlock(&rw_lock_1))) pthread_error(r);
		//fprintf(stderr, "return\n");
		return NULL;
	}
	//fprintf(stderr, "test(%lld)\n", test_of);
	test_read(test_of);
	if ((r = pthread_rwlock_rdlock(&rw_lock_3))) pthread_error(r);
	if ((r = pthread_rwlock_unlock(&rw_lock_1))) pthread_error(r);
	if ((r = pthread_rwlock_rdlock(&rw_lock_2))) pthread_error(r);
	if ((r = pthread_rwlock_unlock(&rw_lock_3))) pthread_error(r);
	goto again;
}

int main(void)
{
	int i, j, r;
	char *np;
	pthread_t thr[THREADS];

	memset(pattern, ORIG_PATTERN, sizeof pattern);

	do_cmd("lvremove -f "VG"/"LV_SNAP"", 1);
	do_cmd("lvremove -f "VG"/"LV"", 1);
	do_cmd("lvcreate -L "MEGABYTES" -n "LV" "VG"", 0);

	h_orig = open("/dev/mapper/"VG"-"LV"", O_RDWR);
	if (h_orig < 0) perror("open orig"), exit(1);
	if (lseek(h_orig, SKEW, SEEK_SET) == -1) perror("lseek"), exit(1);
	n = 0;
	while (write(h_orig, pattern, BS) == BS) {
		n++;
		fprintf(stderr, "creating %llx...\r", (long long)n * BS + SKEW);
	}
	if (fsync(h_orig)) perror("fsync"), exit(1);
	fprintf(stderr,"\n");
	lseek(h_orig, 0, SEEK_SET);
	close(h_orig);

	do_cmd("lvcreate -L "SNAP_MEGABYTES" -n "LV_SNAP" -s "VG"/"LV"", 0);

	h_orig = open("/dev/mapper/"VG"-"LV"", O_RDWR | O_DIRECT);
	if (h_orig < 0) perror("open orig"), exit(1);

	h_snap = open("/dev/mapper/"VG"-"LV_SNAP"", O_RDONLY | O_DIRECT);
	if (h_snap < 0) perror("open snap"), exit(1);

	if ((r = pthread_rwlock_init(&rw_lock_1, NULL))) pthread_error(r);
	if ((r = pthread_rwlock_init(&rw_lock_2, NULL))) pthread_error(r);
	if ((r = pthread_rwlock_init(&rw_lock_3, NULL))) pthread_error(r);
	if ((r = pthread_rwlock_wrlock(&rw_lock_1))) pthread_error(r);
	if ((r = pthread_rwlock_wrlock(&rw_lock_3))) pthread_error(r);

	if ((r = ioprio_set(IOPRIO_WHO_PROCESS, 0, PRIO_WRITER))) perror("ioprio_set"), exit(1);

	for (j = 0; j < THREADS; j++) {
		if ((r = pthread_create(&thr[j], NULL, test_thread, NULL))) pthread_error(r);
	}
	while (!started) usleep(1000);

	if ((r = ioprio_get(IOPRIO_WHO_PROCESS, 0)) != PRIO_WRITER) {
		if (r == -1) perror("ioprio_get");
		else fprintf(stderr, "writer priority not set: %x\n", r);
		exit(1);
	}

	np = memalign(BS, BS);
	if (!np) perror("memalign"), exit(1);
	memset(np, NEW_PATTERN, BS);
	for (i = 0; i < n; i++) {
		test_of = (off_t)i * BS + SKEW;
		fprintf(stderr, "testing %llx...\r", test_of);
		if ((r = pthread_rwlock_unlock(&rw_lock_1))) pthread_error(r);
		sched_yield();
		if (pwrite(h_orig, np, BS, test_of) != BS) {
			fprintf(stderr, "can't write (%d): %s\n", r, strerror(errno));
			exit(1);
		}
		if ((r = pthread_rwlock_wrlock(&rw_lock_2))) pthread_error(r);
		if ((r = pthread_rwlock_unlock(&rw_lock_3))) pthread_error(r);
		if ((r = pthread_rwlock_wrlock(&rw_lock_1))) pthread_error(r);
		if ((r = pthread_rwlock_unlock(&rw_lock_2))) pthread_error(r);
		if ((r = pthread_rwlock_wrlock(&rw_lock_3))) pthread_error(r);
	}
	fprintf(stderr,"\n");

	test_of = -1;
	if ((r = pthread_rwlock_unlock(&rw_lock_1))) pthread_error(r);

	for (j = 0; j < THREADS; j++) {
		if ((r = pthread_join(thr[j], NULL))) pthread_error(r);
	}

	fprintf(stderr, "TEST PASSED OK.\n");

	return 0;
}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2017-01-06 16:50               ` Mikulas Patocka
@ 2017-01-06 17:34                 ` Mikulas Patocka
  2017-01-06 19:52                   ` Mike Snitzer
  0 siblings, 1 reply; 23+ messages in thread
From: Mikulas Patocka @ 2017-01-06 17:34 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: NeilBrown, Jack Wang, Lars Ellenberg, Jens Axboe, linux-raid,
	Michael Wang, Peter Zijlstra, Jiri Kosina, Ming Lei,
	linux-kernel, Zheng Liu, linux-block, Takashi Iwai, linux-bcache,
	Ingo Molnar, Alasdair Kergon, Martin K. Petersen, Keith Busch,
	device-mapper development, Shaohua Li, Kent Overstreet,
	Kirill A. Shutemov, Roland Kammerer



On Fri, 6 Jan 2017, Mikulas Patocka wrote:

> 
> 
> On Wed, 4 Jan 2017, Mike Snitzer wrote:
> 
> > On Wed, Jan 04 2017 at 12:12am -0500,
> > NeilBrown <neilb@suse.com> wrote:
> > 
> > > > Suggested-by: NeilBrown <neilb@suse.com>
> > > > Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
> > > > ---
> > > >  block/blk-core.c | 20 ++++++++++++++++++++
> > > >  1 file changed, 20 insertions(+)
> > > >
> > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > index 9e3ac56..47ef373 100644
> > > > --- a/block/blk-core.c
> > > > +++ b/block/blk-core.c
> > > > @@ -2138,10 +2138,30 @@ blk_qc_t generic_make_request(struct bio *bio)
> > > >  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> > > >  
> > > >  		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
> > > > +			struct bio_list lower, same, hold;
> > > > +
> > > > +			/* Create a fresh bio_list for all subordinate requests */
> > > > +			bio_list_init(&hold);
> > > > +			bio_list_merge(&hold, &bio_list_on_stack);
> > > > +			bio_list_init(&bio_list_on_stack);
> > > >  
> > > >  			ret = q->make_request_fn(q, bio);
> > > >  
> > > >  			blk_queue_exit(q);
> > > > +			/* sort new bios into those for a lower level
> > > > +			 * and those for the same level
> > > > +			 */
> > > > +			bio_list_init(&lower);
> > > > +			bio_list_init(&same);
> > > > +			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> > > > +				if (q == bdev_get_queue(bio->bi_bdev))
> > > > +					bio_list_add(&same, bio);
> > > > +				else
> > > > +					bio_list_add(&lower, bio);
> > > > +			/* now assemble so we handle the lowest level first */
> > > > +			bio_list_merge(&bio_list_on_stack, &lower);
> > > > +			bio_list_merge(&bio_list_on_stack, &same);
> > > > +			bio_list_merge(&bio_list_on_stack, &hold);
> > > >  
> > > >  			bio = bio_list_pop(current->bio_list);
> > > >  		} else {
> > > > -- 
> > > > 2.7.4
> > 
> > Mikulas, would you be willing to try the below patch with the
> > dm-snapshot deadlock scenario and report back on whether it fixes that?
> > 
> > Patch below looks to be the same as here:
> > https://marc.info/?l=linux-raid&m=148232453107685&q=p3
> > 
> > Neil and/or others if that isn't the patch that should be tested please
> > provide a pointer to the latest.
> > 
> > Thanks,
> > Mike
> 
> The bad news is that this doesn't fix the snapshot deadlock.
> 
> I created a test program for the snapshot deadlock bug (it was originally 
> created years ago to test for a different bug, so it contains some cruft). 
> You also need to insert "if (ci->sector_count) msleep(100);" to the end of 
> __split_and_process_non_flush to make the kernel sleep when splitting the 
> bio.
> 
> And with the above above patch, the snapshot deadlock bug still happens.
> 
> Mikulas
> 
> 
> #define _XOPEN_SOURCE 500
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <string.h>
> #include <errno.h>
> #include <malloc.h>
> #include <pthread.h>
> #include <asm/unistd.h>
> 
> /*
>  * Change "VG" symbol to a volume group name that you are using.
>  *
>  * You must apply this patch to the kernel to trigger the bug:
>  * Index: linux-4.10-rc2/drivers/md/dm.c
>  * ===================================================================
>  * --- linux-4.10-rc2.orig/drivers/md/dm.c
>  * +++ linux-4.10-rc2/drivers/md/dm.c
>  * @@ -1223,6 +1223,9 @@ static int __split_and_process_non_flush
>  *         ci->sector += len;
>  *         ci->sector_count -= len;
>  * 
>  * +       if (ci->sector_count)
>  * +               msleep(100);
>  * +
>  *         return 0;
>  *  }
>  * 
>  */
> 
> #define VG		"vg1"
> #define LV		"test_lv"
> #define LV_SNAP		"test_snap"
> #define MEGABYTES	"12"
> #define SNAP_MEGABYTES	"16"
> #define THREADS		1
> #define BS		4096
> #define SKEW		512
> #define ORIG_PATTERN	'p'
> #define NEW_PATTERN	'n'
> 
> enum {
> 	IOPRIO_CLASS_NONE,
> 	IOPRIO_CLASS_RT,
> 	IOPRIO_CLASS_BE,
> 	IOPRIO_CLASS_IDLE,
> };
> 
> enum {
> 	IOPRIO_WHO_PROCESS = 1,
> 	IOPRIO_WHO_PGRP,
> 	IOPRIO_WHO_USER,
> };
> 
> #define IOPRIO_CLASS_SHIFT	13
> 
> static inline int ioprio_set(int which, int who, int ioprio)
> {
> 	return syscall(__NR_ioprio_set, which, who, ioprio);
> }
> 
> static inline int ioprio_get(int which, int who)
> {
> 	return syscall(__NR_ioprio_get, which, who);
> }
> 
> #define PRIO_READER	((IOPRIO_CLASS_IDLE << IOPRIO_CLASS_SHIFT) | 0xff)
> #define PRIO_WRITER	(IOPRIO_CLASS_RT << IOPRIO_CLASS_SHIFT)
> 
> static void do_cmd(char *cmd, int ign_err)
> {
> 	int r;
> 	fprintf(stderr, "* %s\n", cmd);
> 	r = system(cmd);
> 	if (r) {
> 		if (r == -1) {
> 			perror("system");
> 		} else {
> 			if (ign_err) return;
> 			fprintf(stderr, "return code %x\n", r);
> 		}
> 		exit(1);
> 	}
> }
> 
> static char pattern[BS];
> 
> static int h_orig, h_snap;
> static int n;
> static long long test_of;
> static pthread_rwlock_t rw_lock_1;
> static pthread_rwlock_t rw_lock_2;
> static pthread_rwlock_t rw_lock_3;
> static volatile int started = 0;
> 
> static void pthread_error(int r)
> {
> 	fprintf(stderr, "pthread_error: %s\n", strerror(r));
> 	exit(1);
> }
> 
> static void *test_read(long long of)
> {
> 	int r;
> 	char *t = memalign(BS, BS);
> 	if (!t) perror("memalign"), exit(1);
> 	if ((r = pread(h_snap, t, BS, of)) != BS) {
> 		fprintf(stderr, "can't read (%d): %s\n", r, strerror(errno));
> 		exit(1);
> 	}
> 	if (memcmp(pattern, t, BS)) {
> 		int i;
> 		for (i = 0; i < BS; i++) if (t[i] != pattern[i]) break;
> 		fprintf(stderr, "!!!! SNAPSHOT VOLUME DAMAGE AT BLOCK OFFSET %llX, BYTE OFFSET %X: %02x != %02x\n", of, i, (unsigned char)t[i], (unsigned char)pattern[i]);
> 		exit(2);
> 	}
> 	free(t);
> 	return NULL;
> }
> 
> static void *test_thread(void *_)
> {
> 	int r;
> 	_ = _;
> 	//fprintf(stderr, "start\n");
> 	if ((r = ioprio_set(IOPRIO_WHO_PROCESS, 0, PRIO_READER))) perror("ioprio_set"), exit(1);
> 	if ((r = pthread_rwlock_rdlock(&rw_lock_2))) pthread_error(r);
> 	started = 1;
> 	if ((r = ioprio_get(IOPRIO_WHO_PROCESS, 0)) != PRIO_READER) {
> 		if (r == -1) perror("ioprio_get");
> 		else fprintf(stderr, "reader priority not set: %x\n", r);
> 		exit(1);
> 	}
> 	again:
> 	if ((r = pthread_rwlock_rdlock(&rw_lock_1))) pthread_error(r);
> 	if ((r = pthread_rwlock_unlock(&rw_lock_2))) pthread_error(r);
> 	if (test_of == -1) {
> 		if ((r = pthread_rwlock_unlock(&rw_lock_1))) pthread_error(r);
> 		//fprintf(stderr, "return\n");
> 		return NULL;
> 	}
> 	//fprintf(stderr, "test(%lld)\n", test_of);
> 	test_read(test_of);
> 	if ((r = pthread_rwlock_rdlock(&rw_lock_3))) pthread_error(r);
> 	if ((r = pthread_rwlock_unlock(&rw_lock_1))) pthread_error(r);
> 	if ((r = pthread_rwlock_rdlock(&rw_lock_2))) pthread_error(r);
> 	if ((r = pthread_rwlock_unlock(&rw_lock_3))) pthread_error(r);
> 	goto again;
> }
> 
> int main(void)
> {
> 	int i, j, r;
> 	char *np;
> 	pthread_t thr[THREADS];
> 
> 	memset(pattern, ORIG_PATTERN, sizeof pattern);
> 
> 	do_cmd("lvremove -f "VG"/"LV_SNAP"", 1);
> 	do_cmd("lvremove -f "VG"/"LV"", 1);
> 	do_cmd("lvcreate -L "MEGABYTES" -n "LV" "VG"", 0);
> 
> 	h_orig = open("/dev/mapper/"VG"-"LV"", O_RDWR);
> 	if (h_orig < 0) perror("open orig"), exit(1);
> 	if (lseek(h_orig, SKEW, SEEK_SET) == -1) perror("lseek"), exit(1);
> 	n = 0;
> 	while (write(h_orig, pattern, BS) == BS) {
> 		n++;
> 		fprintf(stderr, "creating %llx...\r", (long long)n * BS + SKEW);
> 	}
> 	if (fsync(h_orig)) perror("fsync"), exit(1);
> 	fprintf(stderr,"\n");
> 	lseek(h_orig, 0, SEEK_SET);
> 	close(h_orig);
> 
> 	do_cmd("lvcreate -L "SNAP_MEGABYTES" -n "LV_SNAP" -s "VG"/"LV"", 0);
> 
> 	h_orig = open("/dev/mapper/"VG"-"LV"", O_RDWR | O_DIRECT);
> 	if (h_orig < 0) perror("open orig"), exit(1);
> 
> 	h_snap = open("/dev/mapper/"VG"-"LV_SNAP"", O_RDONLY | O_DIRECT);
> 	if (h_snap < 0) perror("open snap"), exit(1);
> 
> 	if ((r = pthread_rwlock_init(&rw_lock_1, NULL))) pthread_error(r);
> 	if ((r = pthread_rwlock_init(&rw_lock_2, NULL))) pthread_error(r);
> 	if ((r = pthread_rwlock_init(&rw_lock_3, NULL))) pthread_error(r);
> 	if ((r = pthread_rwlock_wrlock(&rw_lock_1))) pthread_error(r);
> 	if ((r = pthread_rwlock_wrlock(&rw_lock_3))) pthread_error(r);
> 
> 	if ((r = ioprio_set(IOPRIO_WHO_PROCESS, 0, PRIO_WRITER))) perror("ioprio_set"), exit(1);
> 
> 	for (j = 0; j < THREADS; j++) {
> 		if ((r = pthread_create(&thr[j], NULL, test_thread, NULL))) pthread_error(r);
> 	}
> 	while (!started) usleep(1000);
> 
> 	if ((r = ioprio_get(IOPRIO_WHO_PROCESS, 0)) != PRIO_WRITER) {
> 		if (r == -1) perror("ioprio_get");
> 		else fprintf(stderr, "writer priority not set: %x\n", r);
> 		exit(1);
> 	}
> 
> 	np = memalign(BS, BS);
> 	if (!np) perror("memalign"), exit(1);
> 	memset(np, NEW_PATTERN, BS);
> 	for (i = 0; i < n; i++) {
> 		test_of = (off_t)i * BS + SKEW;
> 		fprintf(stderr, "testing %llx...\r", test_of);
> 		if ((r = pthread_rwlock_unlock(&rw_lock_1))) pthread_error(r);
> 		sched_yield();
> 		if (pwrite(h_orig, np, BS, test_of) != BS) {
> 			fprintf(stderr, "can't write (%d): %s\n", r, strerror(errno));
> 			exit(1);
> 		}
> 		if ((r = pthread_rwlock_wrlock(&rw_lock_2))) pthread_error(r);
> 		if ((r = pthread_rwlock_unlock(&rw_lock_3))) pthread_error(r);
> 		if ((r = pthread_rwlock_wrlock(&rw_lock_1))) pthread_error(r);
> 		if ((r = pthread_rwlock_unlock(&rw_lock_2))) pthread_error(r);
> 		if ((r = pthread_rwlock_wrlock(&rw_lock_3))) pthread_error(r);
> 	}
> 	fprintf(stderr,"\n");
> 
> 	test_of = -1;
> 	if ((r = pthread_rwlock_unlock(&rw_lock_1))) pthread_error(r);
> 
> 	for (j = 0; j < THREADS; j++) {
> 		if ((r = pthread_join(thr[j], NULL))) pthread_error(r);
> 	}
> 
> 	fprintf(stderr, "TEST PASSED OK.\n");
> 
> 	return 0;
> }
> 
> 

Here I post a patch that fixes the snapshot deadlock. On schedule(), it 
redirects bios on current->bio_list to helper workqueues.

Mikulas


>From f126e182a053ef2e44a3e70b86df84d2b003530b Mon Sep 17 00:00:00 2001
From: Mikulas Patocka <mpatocka@redhat.com>
Date: Tue, 27 May 2014 11:03:36 -0400
Subject: block: flush queued bios when process blocks to avoid deadlock

The block layer uses per-process bio list to avoid recursion in
generic_make_request.  When generic_make_request is called recursively,
the bio is added to current->bio_list and generic_make_request returns
immediately.  The top-level instance of generic_make_request takes bios
from current->bio_list and processes them.

Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
stacking drivers") created a workqueue for every bio set and code
in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
redirecting bios queued on current->bio_list to the workqueue if the
system is low on memory.  However another deadlock (see below **) may
happen, without any low memory condition, because generic_make_request
is queuing bios to current->bio_list (rather than submitting them).

Fix this deadlock by redirecting any bios on current->bio_list to the
bio_set's rescue workqueue on every schedule call.  Consequently, when
the process blocks on a mutex, the bios queued on current->bio_list are
dispatched to independent workqueus and they can complete without
waiting for the mutex to be available.

Also, now we can remove punt_bios_to_rescuer() and bio_alloc_bioset()'s
calls to it because bio_alloc_bioset() will implicitly punt all bios on
current->bio_list if it performs a blocking allocation.

** Here is the dm-snapshot deadlock that was observed:

1) Process A sends one-page read bio to the dm-snapshot target. The bio
spans snapshot chunk boundary and so it is split to two bios by device
mapper.

2) Device mapper creates the first sub-bio and sends it to the snapshot
driver.

3) The function snapshot_map calls track_chunk (that allocates a structure
dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
the bio to the underlying device and exits with DM_MAPIO_REMAPPED.

4) The remapped bio is submitted with generic_make_request, but it isn't
issued - it is added to current->bio_list instead.

5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
chunk affected be the first remapped bio, it takes down_write(&s->lock)
and then loops in __check_for_conflicting_io, waiting for
dm_snap_tracked_chunk created in step 3) to be released.

6) Process A continues, it creates a second sub-bio for the rest of the
original bio.

7) snapshot_map is called for this new bio, it waits on
down_write(&s->lock) that is held by Process B (in step 5).

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
Cc: stable@vger.kernel.org

---
 block/bio.c            |   77 +++++++++++++++++++------------------------------
 include/linux/blkdev.h |   24 ++++++++++-----
 kernel/sched/core.c    |    7 +---
 3 files changed, 50 insertions(+), 58 deletions(-)

Index: linux-4.9-rc3/block/bio.c
===================================================================
--- linux-4.9-rc3.orig/block/bio.c	2016-11-02 23:05:03.000000000 +0100
+++ linux-4.9-rc3/block/bio.c	2016-11-02 23:05:21.000000000 +0100
@@ -353,35 +353,37 @@ static void bio_alloc_rescue(struct work
 	}
 }
 
-static void punt_bios_to_rescuer(struct bio_set *bs)
+/**
+ * blk_flush_bio_list
+ * @tsk: task_struct whose bio_list must be flushed
+ *
+ * Pop bios queued on @tsk->bio_list and submit each of them to
+ * their rescue workqueue.
+ *
+ * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
+ * If the bio is allocated from fs_bio_set, we must leave it to avoid
+ * deadlock on loopback block device.
+ * Stacking bio drivers should use bio_set, so this shouldn't be
+ * an issue.
+ */
+void blk_flush_bio_list(struct task_struct *tsk)
 {
-	struct bio_list punt, nopunt;
 	struct bio *bio;
+	struct bio_list list = *tsk->bio_list;
+	bio_list_init(tsk->bio_list);
 
-	/*
-	 * In order to guarantee forward progress we must punt only bios that
-	 * were allocated from this bio_set; otherwise, if there was a bio on
-	 * there for a stacking driver higher up in the stack, processing it
-	 * could require allocating bios from this bio_set, and doing that from
-	 * our own rescuer would be bad.
-	 *
-	 * Since bio lists are singly linked, pop them all instead of trying to
-	 * remove from the middle of the list:
-	 */
-
-	bio_list_init(&punt);
-	bio_list_init(&nopunt);
-
-	while ((bio = bio_list_pop(current->bio_list)))
-		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
-
-	*current->bio_list = nopunt;
-
-	spin_lock(&bs->rescue_lock);
-	bio_list_merge(&bs->rescue_list, &punt);
-	spin_unlock(&bs->rescue_lock);
+	while ((bio = bio_list_pop(&list))) {
+		struct bio_set *bs = bio->bi_pool;
+		if (unlikely(!bs) || bs == fs_bio_set) {
+			bio_list_add(tsk->bio_list, bio);
+			continue;
+		}
 
-	queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		spin_lock(&bs->rescue_lock);
+		bio_list_add(&bs->rescue_list, bio);
+		queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		spin_unlock(&bs->rescue_lock);
+	}
 }
 
 /**
@@ -421,7 +423,6 @@ static void punt_bios_to_rescuer(struct 
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
-	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	struct bio_vec *bvl = NULL;
@@ -455,23 +456,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
 		 * reserve.
 		 *
 		 * We solve this, and guarantee forward progress, with a rescuer
-		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
-		 * bios we would be blocking to the rescuer workqueue before
-		 * we retry with the original gfp_flags.
+		 * workqueue per bio_set. If an allocation would block (due to
+		 * __GFP_DIRECT_RECLAIM) the scheduler will first punt all bios
+		 * on current->bio_list to the rescuer workqueue.
 		 */
-
-		if (current->bio_list && !bio_list_empty(current->bio_list))
-			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
-
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
-		if (!p && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			p = mempool_alloc(bs->bio_pool, gfp_mask);
-		}
-
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
 	}
@@ -486,12 +475,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
 		unsigned long idx = 0;
 
 		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		if (!bvl && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		}
-
 		if (unlikely(!bvl))
 			goto err_free;
 
Index: linux-4.9-rc3/include/linux/blkdev.h
===================================================================
--- linux-4.9-rc3.orig/include/linux/blkdev.h	2016-11-02 23:05:03.000000000 +0100
+++ linux-4.9-rc3/include/linux/blkdev.h	2016-11-02 23:05:21.000000000 +0100
@@ -1118,6 +1118,22 @@ static inline bool blk_needs_flush_plug(
 		 !list_empty(&plug->cb_list));
 }
 
+extern void blk_flush_bio_list(struct task_struct *tsk);
+
+static inline void blk_flush_queued_io(struct task_struct *tsk)
+{
+	/*
+	 * Flush any queued bios to corresponding rescue threads.
+	 */
+	if (tsk->bio_list && !bio_list_empty(tsk->bio_list))
+		blk_flush_bio_list(tsk);
+	/*
+	 * Flush any plugged IO that is queued.
+	 */
+	if (blk_needs_flush_plug(tsk))
+		blk_schedule_flush_plug(tsk);
+}
+
 /*
  * tag stuff
  */
@@ -1729,16 +1745,10 @@ static inline void blk_flush_plug(struct
 {
 }
 
-static inline void blk_schedule_flush_plug(struct task_struct *task)
+static inline void blk_flush_queued_io(struct task_struct *tsk)
 {
 }
 
-
-static inline bool blk_needs_flush_plug(struct task_struct *tsk)
-{
-	return false;
-}
-
 static inline int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
 				     sector_t *error_sector)
 {
Index: linux-4.9-rc3/kernel/sched/core.c
===================================================================
--- linux-4.9-rc3.orig/kernel/sched/core.c	2016-11-02 23:05:03.000000000 +0100
+++ linux-4.9-rc3/kernel/sched/core.c	2016-11-02 23:05:21.000000000 +0100
@@ -3440,11 +3440,10 @@ static inline void sched_submit_work(str
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
 		return;
 	/*
-	 * If we are going to sleep and we have plugged IO queued,
+	 * If we are going to sleep and we have queued IO,
 	 * make sure to submit it to avoid deadlocks.
 	 */
-	if (blk_needs_flush_plug(tsk))
-		blk_schedule_flush_plug(tsk);
+	blk_flush_queued_io(tsk);
 }
 
 asmlinkage __visible void __sched schedule(void)
@@ -5067,7 +5066,7 @@ long __sched io_schedule_timeout(long ti
 	long ret;
 
 	current->in_iowait = 1;
-	blk_schedule_flush_plug(current);
+	blk_flush_queued_io(current);
 
 	delayacct_blkio_start();
 	rq = raw_rq();

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2017-01-06 17:34                 ` Mikulas Patocka
@ 2017-01-06 19:52                   ` Mike Snitzer
  2017-01-06 23:01                     ` NeilBrown
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Snitzer @ 2017-01-06 19:52 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: NeilBrown, Jack Wang, Lars Ellenberg, Jens Axboe, linux-raid,
	Michael Wang, Peter Zijlstra, Jiri Kosina, Ming Lei,
	linux-kernel, Zheng Liu, linux-block, Takashi Iwai, linux-bcache,
	Ingo Molnar, Alasdair Kergon, Martin K. Petersen, Keith Busch,
	device-mapper development, Shaohua Li, Kent Overstreet,
	Kirill A. Shutemov, Roland Kammerer

On Fri, Jan 06 2017 at 12:34pm -0500,
Mikulas Patocka <mpatocka@redhat.com> wrote:

> 
> 
> On Fri, 6 Jan 2017, Mikulas Patocka wrote:
> 
> > 
> > 
> > On Wed, 4 Jan 2017, Mike Snitzer wrote:
> > 
> > > On Wed, Jan 04 2017 at 12:12am -0500,
> > > NeilBrown <neilb@suse.com> wrote:
> > > 
> > > > > Suggested-by: NeilBrown <neilb@suse.com>
> > > > > Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
> > > > > ---
> > > > >  block/blk-core.c | 20 ++++++++++++++++++++
> > > > >  1 file changed, 20 insertions(+)
> > > > >
> > > > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > > > index 9e3ac56..47ef373 100644
> > > > > --- a/block/blk-core.c
> > > > > +++ b/block/blk-core.c
> > > > > @@ -2138,10 +2138,30 @@ blk_qc_t generic_make_request(struct bio *bio)
> > > > >  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> > > > >  
> > > > >  		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
> > > > > +			struct bio_list lower, same, hold;
> > > > > +
> > > > > +			/* Create a fresh bio_list for all subordinate requests */
> > > > > +			bio_list_init(&hold);
> > > > > +			bio_list_merge(&hold, &bio_list_on_stack);
> > > > > +			bio_list_init(&bio_list_on_stack);
> > > > >  
> > > > >  			ret = q->make_request_fn(q, bio);
> > > > >  
> > > > >  			blk_queue_exit(q);
> > > > > +			/* sort new bios into those for a lower level
> > > > > +			 * and those for the same level
> > > > > +			 */
> > > > > +			bio_list_init(&lower);
> > > > > +			bio_list_init(&same);
> > > > > +			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
> > > > > +				if (q == bdev_get_queue(bio->bi_bdev))
> > > > > +					bio_list_add(&same, bio);
> > > > > +				else
> > > > > +					bio_list_add(&lower, bio);
> > > > > +			/* now assemble so we handle the lowest level first */
> > > > > +			bio_list_merge(&bio_list_on_stack, &lower);
> > > > > +			bio_list_merge(&bio_list_on_stack, &same);
> > > > > +			bio_list_merge(&bio_list_on_stack, &hold);
> > > > >  
> > > > >  			bio = bio_list_pop(current->bio_list);
> > > > >  		} else {
> > > > > -- 
> > > > > 2.7.4
> > > 
> > > Mikulas, would you be willing to try the below patch with the
> > > dm-snapshot deadlock scenario and report back on whether it fixes that?
> > > 
> > > Patch below looks to be the same as here:
> > > https://marc.info/?l=linux-raid&m=148232453107685&q=p3
> > > 
> > > Neil and/or others if that isn't the patch that should be tested please
> > > provide a pointer to the latest.
> > > 
> > > Thanks,
> > > Mike
> > 
> > The bad news is that this doesn't fix the snapshot deadlock.
> > 
> > I created a test program for the snapshot deadlock bug (it was originally 
> > created years ago to test for a different bug, so it contains some cruft). 
> > You also need to insert "if (ci->sector_count) msleep(100);" to the end of 
> > __split_and_process_non_flush to make the kernel sleep when splitting the 
> > bio.
> > 
> > And with the above above patch, the snapshot deadlock bug still happens.

That is really unfortunate.  Would be useful to dig in and understand
why.  Because ordering of the IO in generic_make_request() really should
take care of it.

<snip>
 
> Here I post a patch that fixes the snapshot deadlock. On schedule(), it 
> redirects bios on current->bio_list to helper workqueues.

<snip old patch>

That patch is included in the series of changes sequenced at the top of
this git branch:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/log/?h=wip

At the risk of repeating myself: unfortunately it doesn't have a way
forward with the timed offload implementation (which was done to appease
Ming Lei's concern about context switching causing reduced plugging that
results in less efficient IO).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2017-01-06 19:52                   ` Mike Snitzer
@ 2017-01-06 23:01                     ` NeilBrown
  0 siblings, 0 replies; 23+ messages in thread
From: NeilBrown @ 2017-01-06 23:01 UTC (permalink / raw)
  To: Mike Snitzer, Mikulas Patocka
  Cc: Jack Wang, Lars Ellenberg, Jens Axboe, linux-raid, Michael Wang,
	Peter Zijlstra, Jiri Kosina, Ming Lei, linux-kernel, Zheng Liu,
	linux-block, Takashi Iwai, linux-bcache, Ingo Molnar,
	Alasdair Kergon, Martin K. Petersen, Keith Busch,
	device-mapper development, Shaohua Li, Kent Overstreet,
	Kirill A. Shutemov, Roland Kammerer

[-- Attachment #1: Type: text/plain, Size: 4866 bytes --]

On Sat, Jan 07 2017, Mike Snitzer wrote:

> On Fri, Jan 06 2017 at 12:34pm -0500,
> Mikulas Patocka <mpatocka@redhat.com> wrote:
>
>> 
>> 
>> On Fri, 6 Jan 2017, Mikulas Patocka wrote:
>> 
>> > 
>> > 
>> > On Wed, 4 Jan 2017, Mike Snitzer wrote:
>> > 
>> > > On Wed, Jan 04 2017 at 12:12am -0500,
>> > > NeilBrown <neilb@suse.com> wrote:
>> > > 
>> > > > > Suggested-by: NeilBrown <neilb@suse.com>
>> > > > > Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com>
>> > > > > ---
>> > > > >  block/blk-core.c | 20 ++++++++++++++++++++
>> > > > >  1 file changed, 20 insertions(+)
>> > > > >
>> > > > > diff --git a/block/blk-core.c b/block/blk-core.c
>> > > > > index 9e3ac56..47ef373 100644
>> > > > > --- a/block/blk-core.c
>> > > > > +++ b/block/blk-core.c
>> > > > > @@ -2138,10 +2138,30 @@ blk_qc_t generic_make_request(struct bio *bio)
>> > > > >  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>> > > > >  
>> > > > >  		if (likely(blk_queue_enter(q, __GFP_DIRECT_RECLAIM) == 0)) {
>> > > > > +			struct bio_list lower, same, hold;
>> > > > > +
>> > > > > +			/* Create a fresh bio_list for all subordinate requests */
>> > > > > +			bio_list_init(&hold);
>> > > > > +			bio_list_merge(&hold, &bio_list_on_stack);
>> > > > > +			bio_list_init(&bio_list_on_stack);
>> > > > >  
>> > > > >  			ret = q->make_request_fn(q, bio);
>> > > > >  
>> > > > >  			blk_queue_exit(q);
>> > > > > +			/* sort new bios into those for a lower level
>> > > > > +			 * and those for the same level
>> > > > > +			 */
>> > > > > +			bio_list_init(&lower);
>> > > > > +			bio_list_init(&same);
>> > > > > +			while ((bio = bio_list_pop(&bio_list_on_stack)) != NULL)
>> > > > > +				if (q == bdev_get_queue(bio->bi_bdev))
>> > > > > +					bio_list_add(&same, bio);
>> > > > > +				else
>> > > > > +					bio_list_add(&lower, bio);
>> > > > > +			/* now assemble so we handle the lowest level first */
>> > > > > +			bio_list_merge(&bio_list_on_stack, &lower);
>> > > > > +			bio_list_merge(&bio_list_on_stack, &same);
>> > > > > +			bio_list_merge(&bio_list_on_stack, &hold);
>> > > > >  
>> > > > >  			bio = bio_list_pop(current->bio_list);
>> > > > >  		} else {
>> > > > > -- 
>> > > > > 2.7.4
>> > > 
>> > > Mikulas, would you be willing to try the below patch with the
>> > > dm-snapshot deadlock scenario and report back on whether it fixes that?
>> > > 
>> > > Patch below looks to be the same as here:
>> > > https://marc.info/?l=linux-raid&m=148232453107685&q=p3
>> > > 
>> > > Neil and/or others if that isn't the patch that should be tested please
>> > > provide a pointer to the latest.
>> > > 
>> > > Thanks,
>> > > Mike
>> > 
>> > The bad news is that this doesn't fix the snapshot deadlock.
>> > 
>> > I created a test program for the snapshot deadlock bug (it was originally 
>> > created years ago to test for a different bug, so it contains some cruft). 
>> > You also need to insert "if (ci->sector_count) msleep(100);" to the end of 
>> > __split_and_process_non_flush to make the kernel sleep when splitting the 
>> > bio.
>> > 
>> > And with the above above patch, the snapshot deadlock bug still happens.
>
> That is really unfortunate.  Would be useful to dig in and understand
> why.  Because ordering of the IO in generic_make_request() really should
> take care of it.

I *think* you might be able to resolve this by changing
__split_and_process_bio() to only ever perform a single split.  No
looping.
i.e. if the bio is too big to handle directly, then split off the front
to a new bio, which you bio_chain to the original.  The original then
has bio_advance() called to stop over the front, then
generic_make_request() so it is queued.
Then the code proceeds to __clone_and_map_data_bio() on the front that
got split off.
When that completes it *doesn't* loop round, but returns into
generic_make_request() which does the looping and makes sure to handle
the lowest-level bio next.

something vaguely like this:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3086da5664f3..06ee0960e415 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)
 
 	len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
 
+	if (len < ci->sector_count) {
+		struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
+		bio_chain(split, bio);
+		generic_make_request(bio);
+		bio = split;
+		ci->sector_count = len;
+	}
+
 	r = __clone_and_map_data_bio(ci, ti, ci->sector, &len);
 	if (r < 0)
 		return r;

though I haven't tested, and the change (if it works) should probably be
more fully integrated into surrounding code.

You probably don't want to use "fs_bio_set" either - a target-local
pool would be best.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 1/1] block: fix blk_queue_split() resource exhaustion
  2016-07-19  9:00           ` Lars Ellenberg
  2016-07-21 22:53             ` Eric Wheeler
  2016-08-11  4:16             ` Eric Wheeler
@ 2017-01-07 19:56             ` Lars Ellenberg
  2 siblings, 0 replies; 23+ messages in thread
From: Lars Ellenberg @ 2017-01-07 19:56 UTC (permalink / raw)
  To: Mike Snitzer, Mikulas Patocka, NeilBrown
  Cc: Eric Wheeler, Jens Axboe, linux-block, Martin K. Petersen,
	Peter Zijlstra, Jiri Kosina, Ming Lei, Kirill A. Shutemov,
	linux-kernel, linux-raid, Takashi Iwai, linux-bcache, Zheng Liu,
	Kent Overstreet, Keith Busch, dm-devel, Shaohua Li, Ingo Molnar,
	Alasdair Kergon, Roland Kammerer, Jack Wang, Michael Wang

On Sat, Jan 07, 2017 at 10:01:07AM +1100, NeilBrown wrote:
> On Sat, Jan 07 2017, Mike Snitzer wrote:
> > On Fri, Jan 06 2017 at 12:34pm -0500,
> > Mikulas Patocka <mpatocka@redhat.com> wrote:
> >> On Fri, 6 Jan 2017, Mikulas Patocka wrote:
> >> > On Wed, 4 Jan 2017, Mike Snitzer wrote:
> >> > > On Wed, Jan 04 2017 at 12:12am -0500,
> >> > > NeilBrown <neilb@suse.com> wrote:
...

> >> > And with the above above patch, the snapshot deadlock bug still happens.
> >
> > That is really unfortunate.  Would be useful to dig in and understand
> > why.  Because ordering of the IO in generic_make_request() really should
> > take care of it.
> 
> I *think* you might be able to resolve this by changing
> __split_and_process_bio() to only ever perform a single split.  No
> looping.
> i.e. if the bio is too big to handle directly, then split off the front
> to a new bio, which you bio_chain to the original.  The original then
> has bio_advance() called to stop over the front, then
> generic_make_request() so it is queued.
> Then the code proceeds to __clone_and_map_data_bio() on the front that
> got split off.
> When that completes it *doesn't* loop round, but returns into
> generic_make_request() which does the looping and makes sure to handle
> the lowest-level bio next.
> 
> something vaguely like this:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3086da5664f3..06ee0960e415 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1216,6 +1216,14 @@ static int __split_and_process_non_flush(struct clone_info *ci)
>  
>  	len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
>  
> +	if (len < ci->sector_count) {
> +		struct bio *split = bio_split(bio, len, GFP_NOIO, fs_bio_set);
> +		bio_chain(split, bio);
> +		generic_make_request(bio);
> +		bio = split;
> +		ci->sector_count = len;
> +	}
> +
>  	r = __clone_and_map_data_bio(ci, ti, ci->sector, &len);
>  	if (r < 0)
>  		return r;
> 
> though I haven't tested, and the change (if it works) should probably be
> more fully integrated into surrounding code.
> 
> You probably don't want to use "fs_bio_set" either - a target-local
> pool would be best.
> 
> NeilBrown

Which is pretty much what I suggested in this thread
back in July already, see below.

Cheers,
	Lars

On Tue, Jul 19, 2016 at 11:00:24AM +0200, Lars Ellenberg wrote:
...

> > > C.  [1/3] block: flush queued bios when process blocks to avoid deadlock
> > > 	by Mikulas Patocka: https://patchwork.kernel.org/patch/9204125/
> > > 		(was https://patchwork.kernel.org/patch/7398411/)
> 
> As it stands now,
> this is yet an other issue, but related.
> 
> From the link above:
> 
> | ** Here is the dm-snapshot deadlock that was observed:
> | 
> | 1) Process A sends one-page read bio to the dm-snapshot target. The bio
> | spans snapshot chunk boundary and so it is split to two bios by device
> | mapper.
> | 
> | 2) Device mapper creates the first sub-bio and sends it to the snapshot
> | driver.
> | 
> | 3) The function snapshot_map calls track_chunk (that allocates a
> | structure
> | dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
> | the bio to the underlying device and exits with DM_MAPIO_REMAPPED.
> | 
> | 4) The remapped bio is submitted with generic_make_request, but it isn't
> | issued - it is added to current->bio_list instead.
> | 
> | 5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
> | chunk affected be the first remapped bio, it takes down_write(&s->lock)
> | and then loops in __check_for_conflicting_io, waiting for
> | dm_snap_tracked_chunk created in step 3) to be released.
> | 
> | 6) Process A continues, it creates a second sub-bio for the rest of the
> | original bio.
> 
> Aha.
> Here is the relation.
> If "A" had only ever processed "just the chunk it can handle now",
> and "pushed back" the rest of the incoming bio,
> it could rely on all deeper level bios to have been submitted already.
> 
> But this does not look like it easily fits into the current DM model.
> 
> | 7) snapshot_map is called for this new bio, it waits on
> | down_write(&s->lock) that is held by Process B (in step 5).
> 
> There is an other suggestion:
> Use down_trylock (or down_timeout),
> and if it fails, push back the currently to-be-processed bio.
> We can introduce a new bio helper for that.
> Kind of what blk_queue_split() does with my patch applied.
> 
> Or even better, ignore the down_trylock suggestion,
> simply not iterate over all pieces first,
> but process one piece, and return back the the
> iteration in generic_make_request.
> 
> A bit of conflict here may be that DM has all its own
> split and clone and queue magic, and wants to process
> "all of the bio" before returning back to generic_make_request().
> 
> To change that, __split_and_process_bio() and all its helpers
> would need to learn to "push back" (pieces of) the bio they are
> currently working on, and not push back via "DM_ENDIO_REQUEUE",
> but by bio_list_add_head(&current->bio_lists->queue, piece_to_be_done_later).
> 
> Then, after they processed each piece,
> *return* all the way up to the top-level generic_make_request(),
> where the recursion-to-iteration logic would then
> make sure that all deeper level bios, submitted via
> recursive calls to generic_make_request() will be processed, before the
> next, pushed back, piece of the "original incoming" bio.
> 
> And *not* do their own iteration over all pieces first.
> 
> Probably not as easy as dropping the while loop,
> using bio_advance, and pushing that "advanced" bio back to
> current->...queue?
> 
> static void __split_and_process_bio(struct mapped_device *md,
> 				    struct dm_table *map, struct bio *bio)
> ...
> 		ci.bio = bio;
> 		ci.sector_count = bio_sectors(bio);
> 		while (ci.sector_count && !error)
> 			error = __split_and_process_non_flush(&ci);
> ...
> 		error = __split_and_process_non_flush(&ci);
> 		if (ci.sector_count)
> 			bio_advance()
> 			bio_list_add_head(&current->bio_lists->queue, )
> ...
> 
> Something like that, maybe?
> Just a thought.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-01-07 19:57 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-08 15:04 [PATCH 0/1] block: fix blk_queue_split() resource exhaustion Lars Ellenberg
2016-07-08 15:04 ` [PATCH 1/1] " Lars Ellenberg
2016-07-08 18:49   ` Mike Snitzer
2016-07-11 14:13     ` Lars Ellenberg
2016-07-11 14:10   ` [PATCH v2 " Lars Ellenberg
2016-07-12  2:55     ` [dm-devel] " NeilBrown
2016-07-13  2:18       ` Eric Wheeler
2016-07-13  2:32         ` Mike Snitzer
2016-07-19  9:00           ` Lars Ellenberg
2016-07-21 22:53             ` Eric Wheeler
2016-07-25 20:39               ` Jeff Moyer
2016-08-11  4:16             ` Eric Wheeler
2017-01-07 19:56             ` Lars Ellenberg
2016-12-23  8:49     ` Michael Wang
2016-12-23 11:45       ` Lars Ellenberg
2017-01-02 14:33         ` [dm-devel] " Jack Wang
2017-01-04  5:12           ` NeilBrown
2017-01-04 18:50             ` Mike Snitzer
2017-01-05 10:54               ` 王金浦
2017-01-06 16:50               ` Mikulas Patocka
2017-01-06 17:34                 ` Mikulas Patocka
2017-01-06 19:52                   ` Mike Snitzer
2017-01-06 23:01                     ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).