linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] blk plug fixes
@ 2015-04-30 17:45 Shaohua Li
  2015-04-30 17:45 ` [PATCH 1/5] blk: clean up plug Shaohua Li
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Shaohua Li @ 2015-04-30 17:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, jmoyer, hch, neilb

These are some block plug related patches. The first 3 are independent and can
apply separately.

For the multiple queue case, I only handled !BLK_MQ_F_DEFER_ISSUE case. The
DEFER_ISSUE is really confusing, I'm not sure what's the correct behavior.

BTW, calling blk_mq_merge_queue_io is confusing too to me. The request will be
dispatched immediately or offload to a workqueue to dispatch. In either case,
the request will be dispatched soon. There is a tiny window there are requests
in the queue and requests can be merged. I thought we can delete the code with
plug merge. And if out of order request merge is important, we really should
add a ELEVATOR_INSERT_SORT_MERGE like mechanism. This makes me more confusing
about BLK_MQ_F_DEFER_ISSUE too.

Thanks,
Shaohua

Jeff Moyer (1):
  blk-mq: fix plugging in blk_sq_make_request

Shaohua Li (4):
  blk: clean up plug
  sched: always use blk_schedule_flush_plug in io_schedule_out
  blk-mq: do limited block plug for multiple queue case
  blk-mq: make plug work for mutiple disks and queues

 block/blk-core.c    |  34 ++++++++-------
 block/blk-mq.c      | 120 ++++++++++++++++++++++++++++++++--------------------
 block/blk.h         |   3 +-
 kernel/sched/core.c |   8 +---
 4 files changed, 98 insertions(+), 67 deletions(-)

-- 
1.8.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] blk: clean up plug
  2015-04-30 17:45 [PATCH 0/5] blk plug fixes Shaohua Li
@ 2015-04-30 17:45 ` Shaohua Li
  2015-05-01 17:11   ` Christoph Hellwig
  2015-04-30 17:45 ` [PATCH 2/5] sched: always use blk_schedule_flush_plug in io_schedule_out Shaohua Li
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Shaohua Li @ 2015-04-30 17:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, jmoyer, hch, neilb

Current code looks like inner plug gets flushed with a
blk_finish_plug(). Actually it's a nop. All requests/callbacks are added
to current->plug, while only outmost plug is assigned to current->plug.
So inner plug always has empty request/callback list, which makes
blk_flush_plug_list() a nop. This tries to make the code more clear.

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
---
 block/blk-core.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index fd154b9..d51ed61 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3031,21 +3031,20 @@ void blk_start_plug(struct blk_plug *plug)
 {
 	struct task_struct *tsk = current;
 
+	/*
+	 * If this is a nested plug, don't actually assign it.
+	 */
+	if (tsk->plug)
+		return;
+
 	INIT_LIST_HEAD(&plug->list);
 	INIT_LIST_HEAD(&plug->mq_list);
 	INIT_LIST_HEAD(&plug->cb_list);
-
 	/*
-	 * If this is a nested plug, don't actually assign it. It will be
-	 * flushed on its own.
+	 * Store ordering should not be needed here, since a potential
+	 * preempt will imply a full memory barrier
 	 */
-	if (!tsk->plug) {
-		/*
-		 * Store ordering should not be needed here, since a potential
-		 * preempt will imply a full memory barrier
-		 */
-		tsk->plug = plug;
-	}
+	tsk->plug = plug;
 }
 EXPORT_SYMBOL(blk_start_plug);
 
@@ -3192,10 +3191,11 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 
 void blk_finish_plug(struct blk_plug *plug)
 {
+	if (plug != current->plug)
+		return;
 	blk_flush_plug_list(plug, false);
 
-	if (plug == current->plug)
-		current->plug = NULL;
+	current->plug = NULL;
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] sched: always use blk_schedule_flush_plug in io_schedule_out
  2015-04-30 17:45 [PATCH 0/5] blk plug fixes Shaohua Li
  2015-04-30 17:45 ` [PATCH 1/5] blk: clean up plug Shaohua Li
@ 2015-04-30 17:45 ` Shaohua Li
  2015-05-01 17:14   ` Christoph Hellwig
  2015-05-01 17:42   ` Jeff Moyer
  2015-04-30 17:45 ` [PATCH 3/5] blk-mq: fix plugging in blk_sq_make_request Shaohua Li
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 22+ messages in thread
From: Shaohua Li @ 2015-04-30 17:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, jmoyer, hch, neilb

block plug callback could sleep, so we introduce a parameter
'from_schedule' and corresponding drivers can use it to destinguish a
schedule plug flush or a plug finish. Unfortunately io_schedule_out
still uses blk_flush_plug(). This causes below output (Note, I added a
might_sleep() in raid1_unplug to make it trigger faster, but the whole
thing doesn't matter if I add might_sleep). In raid1/10, this can cause
deadlock.

This patch makes io_schedule_out always uses blk_schedule_flush_plug.
This should only impact drivers (as far as I know, raid 1/10) which are
sensitive to the 'from_schedule' parameter.

[  370.817949] ------------[ cut here ]------------
[  370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90()
[  370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff81092fcf>] prepare_to_wait+0x2f/0x90
[  370.817971] Modules linked in: raid1
[  370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G        W       4.0.0+ #361
[  370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
[  370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1)
[  370.817985]  ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001
[  370.817988]  ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8
[  370.817990]  ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28
[  370.817993] Call Trace:
[  370.817999]  [<ffffffff819dd7af>] dump_stack+0x4f/0x7b
[  370.818002]  [<ffffffff81051afc>] warn_slowpath_common+0x8c/0xd0
[  370.818004]  [<ffffffff81051b86>] warn_slowpath_fmt+0x46/0x50
[  370.818006]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
[  370.818008]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
[  370.818010]  [<ffffffff810776ef>] __might_sleep+0x7f/0x90
[  370.818014]  [<ffffffffa0000c03>] raid1_unplug+0xd3/0x170 [raid1]
[  370.818024]  [<ffffffff81421d9a>] blk_flush_plug_list+0x8a/0x1e0
[  370.818028]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
[  370.818031]  [<ffffffff819e21b0>] io_schedule_timeout+0x130/0x140
[  370.818033]  [<ffffffff819e3586>] bit_wait_io+0x36/0x50
[  370.818034]  [<ffffffff819e31b5>] __wait_on_bit+0x65/0x90
[  370.818041]  [<ffffffff8125b67c>] ? ext4_read_block_bitmap_nowait+0xbc/0x630
[  370.818043]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
[  370.818045]  [<ffffffff819e3302>] out_of_line_wait_on_bit+0x72/0x80
[  370.818047]  [<ffffffff810935e0>] ? autoremove_wake_function+0x40/0x40
[  370.818050]  [<ffffffff811de744>] __wait_on_buffer+0x44/0x50
[  370.818053]  [<ffffffff8125ae80>] ext4_wait_block_bitmap+0xe0/0xf0
[  370.818058]  [<ffffffff812975d6>] ext4_mb_init_cache+0x206/0x790
[  370.818062]  [<ffffffff8114bc6c>] ? lru_cache_add+0x1c/0x50
[  370.818064]  [<ffffffff81297c7e>] ext4_mb_init_group+0x11e/0x200
[  370.818066]  [<ffffffff81298231>] ext4_mb_load_buddy+0x341/0x360
[  370.818068]  [<ffffffff8129a1a3>] ext4_mb_find_by_goal+0x93/0x2f0
[  370.818070]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
[  370.818072]  [<ffffffff8129ab67>] ext4_mb_regular_allocator+0x67/0x460
[  370.818074]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
[  370.818076]  [<ffffffff8129ca4b>] ext4_mb_new_blocks+0x4cb/0x620
[  370.818079]  [<ffffffff81290956>] ext4_ext_map_blocks+0x4c6/0x14d0
[  370.818081]  [<ffffffff812a4d4e>] ? ext4_es_lookup_extent+0x4e/0x290
[  370.818085]  [<ffffffff8126399d>] ext4_map_blocks+0x14d/0x4f0
[  370.818088]  [<ffffffff81266fbd>] ext4_writepages+0x76d/0xe50
[  370.818094]  [<ffffffff81149691>] do_writepages+0x21/0x50
[  370.818097]  [<ffffffff811d5c00>] __writeback_single_inode+0x60/0x490
[  370.818099]  [<ffffffff811d630a>] writeback_sb_inodes+0x2da/0x590
[  370.818103]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
[  370.818105]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
[  370.818107]  [<ffffffff811d665f>] __writeback_inodes_wb+0x9f/0xd0
[  370.818109]  [<ffffffff811d69db>] wb_writeback+0x34b/0x3c0
[  370.818111]  [<ffffffff811d70df>] bdi_writeback_workfn+0x23f/0x550
[  370.818116]  [<ffffffff8106bbd8>] process_one_work+0x1c8/0x570
[  370.818117]  [<ffffffff8106bb5b>] ? process_one_work+0x14b/0x570
[  370.818119]  [<ffffffff8106c09b>] worker_thread+0x11b/0x470
[  370.818121]  [<ffffffff8106bf80>] ? process_one_work+0x570/0x570
[  370.818124]  [<ffffffff81071868>] kthread+0xf8/0x110
[  370.818126]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
[  370.818129]  [<ffffffff819e9322>] ret_from_fork+0x42/0x70
[  370.818131]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
[  370.818132] ---[ end trace 7b4deb71e68b6605 ]---

Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 kernel/sched/core.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9123a8..fef7eb2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4397,21 +4397,17 @@ EXPORT_SYMBOL_GPL(yield_to);
  */
 long __sched io_schedule_timeout(long timeout)
 {
-	int old_iowait = current->in_iowait;
 	struct rq *rq;
 	long ret;
 
 	current->in_iowait = 1;
-	if (old_iowait)
-		blk_schedule_flush_plug(current);
-	else
-		blk_flush_plug(current);
+	blk_schedule_flush_plug(current);
 
 	delayacct_blkio_start();
 	rq = raw_rq();
 	atomic_inc(&rq->nr_iowait);
 	ret = schedule_timeout(timeout);
-	current->in_iowait = old_iowait;
+	current->in_iowait = 0;
 	atomic_dec(&rq->nr_iowait);
 	delayacct_blkio_end();
 
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] blk-mq: fix plugging in blk_sq_make_request
  2015-04-30 17:45 [PATCH 0/5] blk plug fixes Shaohua Li
  2015-04-30 17:45 ` [PATCH 1/5] blk: clean up plug Shaohua Li
  2015-04-30 17:45 ` [PATCH 2/5] sched: always use blk_schedule_flush_plug in io_schedule_out Shaohua Li
@ 2015-04-30 17:45 ` Shaohua Li
  2015-05-01 17:16   ` Christoph Hellwig
  2015-04-30 17:45 ` [PATCH 4/5] blk-mq: do limited block plug for multiple queue case Shaohua Li
  2015-04-30 17:45 ` [PATCH 5/5] blk-mq: make plug work for mutiple disks and queues Shaohua Li
  4 siblings, 1 reply; 22+ messages in thread
From: Shaohua Li @ 2015-04-30 17:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, jmoyer, hch, neilb

From: Jeff Moyer <jmoyer@redhat.com>

The following appears in blk_sq_make_request:

	/*
	 * If we have multiple hardware queues, just go directly to
	 * one of those for sync IO.
	 */

We clearly don't have multiple hardware queues, here!  This comment was
introduced with this commit 07068d5b8e (blk-mq: split make request
handler for multi and single queue):

    We want slightly different behavior from them:

    - On single queue devices, we currently use the per-process plug
      for deferred IO and for merging.

    - On multi queue devices, we don't use the per-process plug, but
      we want to go straight to hardware for SYNC IO.

The old code had this:

        use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);

and that was converted to:

	use_plug = !is_flush_fua && !is_sync;

which is not equivalent.  For the single queue case, that second half of
the && expression is always true.  So, what I think was actually inteded
follows (and this more closely matches what is done in blk_queue_bio).

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-mq.c | 36 ++++++++++++++----------------------
 1 file changed, 14 insertions(+), 22 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ade8a2d..7f9d3a1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1309,16 +1309,11 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
 {
 	const int is_sync = rw_is_sync(bio->bi_rw);
 	const int is_flush_fua = bio->bi_rw & (REQ_FLUSH | REQ_FUA);
-	unsigned int use_plug, request_count = 0;
+	struct blk_plug *plug;
+	unsigned int request_count = 0;
 	struct blk_map_ctx data;
 	struct request *rq;
 
-	/*
-	 * If we have multiple hardware queues, just go directly to
-	 * one of those for sync IO.
-	 */
-	use_plug = !is_flush_fua && !is_sync;
-
 	blk_queue_bounce(q, &bio);
 
 	if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
@@ -1326,7 +1321,7 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
 		return;
 	}
 
-	if (use_plug && !blk_queue_nomerges(q) &&
+	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
 	    blk_attempt_plug_merge(q, bio, &request_count))
 		return;
 
@@ -1345,21 +1340,18 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	 * utilize that to temporarily store requests until the task is
 	 * either done or scheduled away.
 	 */
-	if (use_plug) {
-		struct blk_plug *plug = current->plug;
-
-		if (plug) {
-			blk_mq_bio_to_request(rq, bio);
-			if (list_empty(&plug->mq_list))
-				trace_block_plug(q);
-			else if (request_count >= BLK_MAX_REQUEST_COUNT) {
-				blk_flush_plug_list(plug, false);
-				trace_block_plug(q);
-			}
-			list_add_tail(&rq->queuelist, &plug->mq_list);
-			blk_mq_put_ctx(data.ctx);
-			return;
+	plug = current->plug;
+	if (plug) {
+		blk_mq_bio_to_request(rq, bio);
+		if (list_empty(&plug->mq_list))
+			trace_block_plug(q);
+		else if (request_count >= BLK_MAX_REQUEST_COUNT) {
+			blk_flush_plug_list(plug, false);
+			trace_block_plug(q);
 		}
+		list_add_tail(&rq->queuelist, &plug->mq_list);
+		blk_mq_put_ctx(data.ctx);
+		return;
 	}
 
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] blk-mq: do limited block plug for multiple queue case
  2015-04-30 17:45 [PATCH 0/5] blk plug fixes Shaohua Li
                   ` (2 preceding siblings ...)
  2015-04-30 17:45 ` [PATCH 3/5] blk-mq: fix plugging in blk_sq_make_request Shaohua Li
@ 2015-04-30 17:45 ` Shaohua Li
  2015-05-01 20:16   ` Jeff Moyer
  2015-04-30 17:45 ` [PATCH 5/5] blk-mq: make plug work for mutiple disks and queues Shaohua Li
  4 siblings, 1 reply; 22+ messages in thread
From: Shaohua Li @ 2015-04-30 17:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, jmoyer, hch, neilb

plug is still helpful for workload with IO merge, but it can be harmful
otherwise especially with multiple hardware queues, as there is
(supposed) no lock contention in this case and plug can introduce
latency. For multiple queues, we do limited plug, eg plug only if there
is request merge. If a request doesn't have merge with following
request, the requet will be dispatched immediately.

This also fixes a bug. If we directly issue a request and it fails, we
use blk_mq_merge_queue_io(). But we already assigned bio to a request in
blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
blk_mq_bio_to_request again.

Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-mq.c | 81 ++++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 59 insertions(+), 22 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7f9d3a1..5545828 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1224,6 +1224,38 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 	return rq;
 }
 
+static int blk_mq_direct_issue_request(struct request *rq)
+{
+	int ret;
+	struct request_queue *q = rq->q;
+	struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q,
+			rq->mq_ctx->cpu);
+	struct blk_mq_queue_data bd = {
+		.rq = rq,
+		.list = NULL,
+		.last = 1
+	};
+
+	/*
+	 * For OK queue, we are done. For error, kill it. Any other
+	 * error (busy), just add it to our list as we previously
+	 * would have done
+	 */
+	ret = q->mq_ops->queue_rq(hctx, &bd);
+	if (ret == BLK_MQ_RQ_QUEUE_OK)
+		return 0;
+	else {
+		__blk_mq_requeue_request(rq);
+
+		if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
+			rq->errors = -EIO;
+			blk_mq_end_request(rq, rq->errors);
+			return 0;
+		}
+		return -1;
+	}
+}
+
 /*
  * Multiple hardware queue variant. This will not use per-process plugs,
  * but will attempt to bypass the hctx queueing if we can go straight to
@@ -1235,6 +1267,8 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	const int is_flush_fua = bio->bi_rw & (REQ_FLUSH | REQ_FUA);
 	struct blk_map_ctx data;
 	struct request *rq;
+	unsigned int request_count = 0;
+	struct blk_plug *plug;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1243,6 +1277,10 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		return;
 	}
 
+	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
+	    blk_attempt_plug_merge(q, bio, &request_count))
+		return;
+
 	rq = blk_mq_map_request(q, bio, &data);
 	if (unlikely(!rq))
 		return;
@@ -1253,38 +1291,38 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		goto run_queue;
 	}
 
+	plug = current->plug;
 	/*
 	 * If the driver supports defer issued based on 'last', then
 	 * queue it up like normal since we can potentially save some
 	 * CPU this way.
 	 */
-	if (is_sync && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
-		struct blk_mq_queue_data bd = {
-			.rq = rq,
-			.list = NULL,
-			.last = 1
-		};
-		int ret;
+	if ((plug || is_sync) && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
+		struct request *old_rq = NULL;
 
 		blk_mq_bio_to_request(rq, bio);
 
 		/*
-		 * For OK queue, we are done. For error, kill it. Any other
-		 * error (busy), just add it to our list as we previously
-		 * would have done
+		 * we do limited pluging. If bio can be merged, do merge.
+		 * Otherwise the existing request in the plug list will be
+		 * issued. So the plug list will have one request at most
 		 */
-		ret = q->mq_ops->queue_rq(data.hctx, &bd);
-		if (ret == BLK_MQ_RQ_QUEUE_OK)
-			goto done;
-		else {
-			__blk_mq_requeue_request(rq);
-
-			if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
-				rq->errors = -EIO;
-				blk_mq_end_request(rq, rq->errors);
-				goto done;
+		if (plug) {
+			if (!list_empty(&plug->mq_list)) {
+				old_rq = list_first_entry(&plug->mq_list,
+					struct request, queuelist);
+				list_del_init(&old_rq->queuelist);
 			}
-		}
+			list_add_tail(&rq->queuelist, &plug->mq_list);
+		} else /* is_sync */
+			old_rq = rq;
+		blk_mq_put_ctx(data.ctx);
+		if (!old_rq)
+			return;
+		if (!blk_mq_direct_issue_request(old_rq))
+			return;
+		blk_mq_insert_request(old_rq, false, true, true);
+		return;
 	}
 
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
@@ -1297,7 +1335,6 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 run_queue:
 		blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
 	}
-done:
 	blk_mq_put_ctx(data.ctx);
 }
 
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] blk-mq: make plug work for mutiple disks and queues
  2015-04-30 17:45 [PATCH 0/5] blk plug fixes Shaohua Li
                   ` (3 preceding siblings ...)
  2015-04-30 17:45 ` [PATCH 4/5] blk-mq: do limited block plug for multiple queue case Shaohua Li
@ 2015-04-30 17:45 ` Shaohua Li
  2015-05-01 20:55   ` Jeff Moyer
  4 siblings, 1 reply; 22+ messages in thread
From: Shaohua Li @ 2015-04-30 17:45 UTC (permalink / raw)
  To: linux-kernel; +Cc: axboe, jmoyer, hch, neilb

Last patch makes plug work for multiple queue case. However it only
works for single disk case, because it assumes only one request in the
plug list. If a task is accessing multiple disks, eg MD/DM, the
assumption is wrong. Let blk_attempt_plug_merge() record request from
the same queue.

Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-core.c | 10 +++++++---
 block/blk-mq.c   | 11 ++++++-----
 block/blk.h      |  3 ++-
 3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d51ed61..a5e1574 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1521,7 +1521,8 @@ bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
  * Caller must ensure !blk_queue_nomerges(q) beforehand.
  */
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
-			    unsigned int *request_count)
+			    unsigned int *request_count,
+			    struct request **same_queue_rq)
 {
 	struct blk_plug *plug;
 	struct request *rq;
@@ -1541,8 +1542,10 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 	list_for_each_entry_reverse(rq, plug_list, queuelist) {
 		int el_ret;
 
-		if (rq->q == q)
+		if (rq->q == q) {
 			(*request_count)++;
+			*same_queue_rq = rq;
+		}
 
 		if (rq->q != q || !blk_rq_merge_ok(rq, bio))
 			continue;
@@ -1583,6 +1586,7 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
 	int el_ret, rw_flags, where = ELEVATOR_INSERT_SORT;
 	struct request *req;
 	unsigned int request_count = 0;
+	struct request *same_queue_rq;
 
 	/*
 	 * low level driver can indicate that it wants pages above a
@@ -1607,7 +1611,7 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
 	 * any locks.
 	 */
 	if (!blk_queue_nomerges(q) &&
-	    blk_attempt_plug_merge(q, bio, &request_count))
+	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
 		return;
 
 	spin_lock_irq(q->queue_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5545828..c45739c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1269,6 +1269,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	struct request *rq;
 	unsigned int request_count = 0;
 	struct blk_plug *plug;
+	struct request *same_queue_rq = NULL;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1278,7 +1279,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	}
 
 	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
-	    blk_attempt_plug_merge(q, bio, &request_count))
+	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
 		return;
 
 	rq = blk_mq_map_request(q, bio, &data);
@@ -1308,9 +1309,8 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		 * issued. So the plug list will have one request at most
 		 */
 		if (plug) {
-			if (!list_empty(&plug->mq_list)) {
-				old_rq = list_first_entry(&plug->mq_list,
-					struct request, queuelist);
+			if (same_queue_rq) {
+				old_rq = same_queue_rq;
 				list_del_init(&old_rq->queuelist);
 			}
 			list_add_tail(&rq->queuelist, &plug->mq_list);
@@ -1350,6 +1350,7 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	unsigned int request_count = 0;
 	struct blk_map_ctx data;
 	struct request *rq;
+	struct request *same_queue_rq;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1359,7 +1360,7 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	}
 
 	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
-	    blk_attempt_plug_merge(q, bio, &request_count))
+	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
 		return;
 
 	rq = blk_mq_map_request(q, bio, &data);
diff --git a/block/blk.h b/block/blk.h
index 43b0361..aa8633c 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -78,7 +78,8 @@ bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
 bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
 			    struct bio *bio);
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
-			    unsigned int *request_count);
+			    unsigned int *request_count,
+			    struct request **same_queue_rq);
 
 void blk_account_io_start(struct request *req, bool new_io);
 void blk_account_io_completion(struct request *req, unsigned int bytes);
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] blk: clean up plug
  2015-04-30 17:45 ` [PATCH 1/5] blk: clean up plug Shaohua Li
@ 2015-05-01 17:11   ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2015-05-01 17:11 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, jmoyer, hch, neilb

On Thu, Apr 30, 2015 at 10:45:14AM -0700, Shaohua Li wrote:
> Current code looks like inner plug gets flushed with a
> blk_finish_plug(). Actually it's a nop. All requests/callbacks are added
> to current->plug, while only outmost plug is assigned to current->plug.
> So inner plug always has empty request/callback list, which makes
> blk_flush_plug_list() a nop. This tries to make the code more clear.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] sched: always use blk_schedule_flush_plug in io_schedule_out
  2015-04-30 17:45 ` [PATCH 2/5] sched: always use blk_schedule_flush_plug in io_schedule_out Shaohua Li
@ 2015-05-01 17:14   ` Christoph Hellwig
  2015-05-01 18:05     ` Shaohua Li
  2015-05-01 17:42   ` Jeff Moyer
  1 sibling, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2015-05-01 17:14 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, jmoyer, hch, neilb

On Thu, Apr 30, 2015 at 10:45:15AM -0700, Shaohua Li wrote:
>  long __sched io_schedule_timeout(long timeout)
>  {
> -	int old_iowait = current->in_iowait;
>  	struct rq *rq;
>  	long ret;
>  
>  	current->in_iowait = 1;
> -	if (old_iowait)
> -		blk_schedule_flush_plug(current);
> -	else
> -		blk_flush_plug(current);
> +	blk_schedule_flush_plug(current);
>  
>  	delayacct_blkio_start();
>  	rq = raw_rq();
>  	atomic_inc(&rq->nr_iowait);
>  	ret = schedule_timeout(timeout);
> -	current->in_iowait = old_iowait;
> +	current->in_iowait = 0;

Always clearing ->in_iowait is behavior change not mentioned in the
changelog.  Even if it was intentional it would better be done in
a separate patch.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] blk-mq: fix plugging in blk_sq_make_request
  2015-04-30 17:45 ` [PATCH 3/5] blk-mq: fix plugging in blk_sq_make_request Shaohua Li
@ 2015-05-01 17:16   ` Christoph Hellwig
  2015-05-01 17:47     ` Jeff Moyer
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2015-05-01 17:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, jmoyer, hch, neilb

> -	if (use_plug && !blk_queue_nomerges(q) &&
> +	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&

Please don't sprinkle likely annotations for no go reason.  Especially
on metadata write heavy workloads (e.g. an NFS server) it might be very
likely.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] sched: always use blk_schedule_flush_plug in io_schedule_out
  2015-04-30 17:45 ` [PATCH 2/5] sched: always use blk_schedule_flush_plug in io_schedule_out Shaohua Li
  2015-05-01 17:14   ` Christoph Hellwig
@ 2015-05-01 17:42   ` Jeff Moyer
  2015-05-01 18:07     ` Jeff Moyer
  1 sibling, 1 reply; 22+ messages in thread
From: Jeff Moyer @ 2015-05-01 17:42 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, hch, neilb

Shaohua Li <shli@fb.com> writes:

> block plug callback could sleep, so we introduce a parameter
> 'from_schedule' and corresponding drivers can use it to destinguish a
> schedule plug flush or a plug finish. Unfortunately io_schedule_out
> still uses blk_flush_plug(). This causes below output (Note, I added a
> might_sleep() in raid1_unplug to make it trigger faster, but the whole
> thing doesn't matter if I add might_sleep). In raid1/10, this can cause
> deadlock.
>
> This patch makes io_schedule_out always uses blk_schedule_flush_plug.
> This should only impact drivers (as far as I know, raid 1/10) which are
> sensitive to the 'from_schedule' parameter.

Why wouldn't you change io_schedule_timeout to use blk_flush_plug_list
instead?

Cheers,
Jeff

>
> [  370.817949] ------------[ cut here ]------------
> [  370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90()
> [  370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff81092fcf>] prepare_to_wait+0x2f/0x90
> [  370.817971] Modules linked in: raid1
> [  370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G        W       4.0.0+ #361
> [  370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
> [  370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1)
> [  370.817985]  ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001
> [  370.817988]  ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8
> [  370.817990]  ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28
> [  370.817993] Call Trace:
> [  370.817999]  [<ffffffff819dd7af>] dump_stack+0x4f/0x7b
> [  370.818002]  [<ffffffff81051afc>] warn_slowpath_common+0x8c/0xd0
> [  370.818004]  [<ffffffff81051b86>] warn_slowpath_fmt+0x46/0x50
> [  370.818006]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
> [  370.818008]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
> [  370.818010]  [<ffffffff810776ef>] __might_sleep+0x7f/0x90
> [  370.818014]  [<ffffffffa0000c03>] raid1_unplug+0xd3/0x170 [raid1]
> [  370.818024]  [<ffffffff81421d9a>] blk_flush_plug_list+0x8a/0x1e0
> [  370.818028]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
> [  370.818031]  [<ffffffff819e21b0>] io_schedule_timeout+0x130/0x140
> [  370.818033]  [<ffffffff819e3586>] bit_wait_io+0x36/0x50
> [  370.818034]  [<ffffffff819e31b5>] __wait_on_bit+0x65/0x90
> [  370.818041]  [<ffffffff8125b67c>] ? ext4_read_block_bitmap_nowait+0xbc/0x630
> [  370.818043]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
> [  370.818045]  [<ffffffff819e3302>] out_of_line_wait_on_bit+0x72/0x80
> [  370.818047]  [<ffffffff810935e0>] ? autoremove_wake_function+0x40/0x40
> [  370.818050]  [<ffffffff811de744>] __wait_on_buffer+0x44/0x50
> [  370.818053]  [<ffffffff8125ae80>] ext4_wait_block_bitmap+0xe0/0xf0
> [  370.818058]  [<ffffffff812975d6>] ext4_mb_init_cache+0x206/0x790
> [  370.818062]  [<ffffffff8114bc6c>] ? lru_cache_add+0x1c/0x50
> [  370.818064]  [<ffffffff81297c7e>] ext4_mb_init_group+0x11e/0x200
> [  370.818066]  [<ffffffff81298231>] ext4_mb_load_buddy+0x341/0x360
> [  370.818068]  [<ffffffff8129a1a3>] ext4_mb_find_by_goal+0x93/0x2f0
> [  370.818070]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
> [  370.818072]  [<ffffffff8129ab67>] ext4_mb_regular_allocator+0x67/0x460
> [  370.818074]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
> [  370.818076]  [<ffffffff8129ca4b>] ext4_mb_new_blocks+0x4cb/0x620
> [  370.818079]  [<ffffffff81290956>] ext4_ext_map_blocks+0x4c6/0x14d0
> [  370.818081]  [<ffffffff812a4d4e>] ? ext4_es_lookup_extent+0x4e/0x290
> [  370.818085]  [<ffffffff8126399d>] ext4_map_blocks+0x14d/0x4f0
> [  370.818088]  [<ffffffff81266fbd>] ext4_writepages+0x76d/0xe50
> [  370.818094]  [<ffffffff81149691>] do_writepages+0x21/0x50
> [  370.818097]  [<ffffffff811d5c00>] __writeback_single_inode+0x60/0x490
> [  370.818099]  [<ffffffff811d630a>] writeback_sb_inodes+0x2da/0x590
> [  370.818103]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
> [  370.818105]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
> [  370.818107]  [<ffffffff811d665f>] __writeback_inodes_wb+0x9f/0xd0
> [  370.818109]  [<ffffffff811d69db>] wb_writeback+0x34b/0x3c0
> [  370.818111]  [<ffffffff811d70df>] bdi_writeback_workfn+0x23f/0x550
> [  370.818116]  [<ffffffff8106bbd8>] process_one_work+0x1c8/0x570
> [  370.818117]  [<ffffffff8106bb5b>] ? process_one_work+0x14b/0x570
> [  370.818119]  [<ffffffff8106c09b>] worker_thread+0x11b/0x470
> [  370.818121]  [<ffffffff8106bf80>] ? process_one_work+0x570/0x570
> [  370.818124]  [<ffffffff81071868>] kthread+0xf8/0x110
> [  370.818126]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
> [  370.818129]  [<ffffffff819e9322>] ret_from_fork+0x42/0x70
> [  370.818131]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
> [  370.818132] ---[ end trace 7b4deb71e68b6605 ]---
>
> Cc: NeilBrown <neilb@suse.de>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  kernel/sched/core.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f9123a8..fef7eb2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4397,21 +4397,17 @@ EXPORT_SYMBOL_GPL(yield_to);
>   */
>  long __sched io_schedule_timeout(long timeout)
>  {
> -	int old_iowait = current->in_iowait;
>  	struct rq *rq;
>  	long ret;
>  
>  	current->in_iowait = 1;
> -	if (old_iowait)
> -		blk_schedule_flush_plug(current);
> -	else
> -		blk_flush_plug(current);
> +	blk_schedule_flush_plug(current);
>  
>  	delayacct_blkio_start();
>  	rq = raw_rq();
>  	atomic_inc(&rq->nr_iowait);
>  	ret = schedule_timeout(timeout);
> -	current->in_iowait = old_iowait;
> +	current->in_iowait = 0;
>  	atomic_dec(&rq->nr_iowait);
>  	delayacct_blkio_end();

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/5] blk-mq: fix plugging in blk_sq_make_request
  2015-05-01 17:16   ` Christoph Hellwig
@ 2015-05-01 17:47     ` Jeff Moyer
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff Moyer @ 2015-05-01 17:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Shaohua Li, linux-kernel, axboe, neilb

Christoph Hellwig <hch@lst.de> writes:

>> -	if (use_plug && !blk_queue_nomerges(q) &&
>> +	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
>
> Please don't sprinkle likely annotations for no go reason.  Especially
> on metadata write heavy workloads (e.g. an NFS server) it might be very
> likely.

I have no problem getting rid of the likely, but I was mimicing the mq
case:

blk_mq_make_request:
        if (unlikely(is_flush_fua)) {
                blk_mq_bio_to_request(rq, bio);
                blk_insert_flush(rq);
                goto run_queue;
        }

So would you also want that unlikely removed?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] sched: always use blk_schedule_flush_plug in io_schedule_out
  2015-05-01 17:14   ` Christoph Hellwig
@ 2015-05-01 18:05     ` Shaohua Li
  0 siblings, 0 replies; 22+ messages in thread
From: Shaohua Li @ 2015-05-01 18:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, axboe, jmoyer, neilb

On Fri, May 01, 2015 at 07:14:10PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 30, 2015 at 10:45:15AM -0700, Shaohua Li wrote:
> >  long __sched io_schedule_timeout(long timeout)
> >  {
> > -	int old_iowait = current->in_iowait;
> >  	struct rq *rq;
> >  	long ret;
> >  
> >  	current->in_iowait = 1;
> > -	if (old_iowait)
> > -		blk_schedule_flush_plug(current);
> > -	else
> > -		blk_flush_plug(current);
> > +	blk_schedule_flush_plug(current);
> >  
> >  	delayacct_blkio_start();
> >  	rq = raw_rq();
> >  	atomic_inc(&rq->nr_iowait);
> >  	ret = schedule_timeout(timeout);
> > -	current->in_iowait = old_iowait;
> > +	current->in_iowait = 0;
> 
> Always clearing ->in_iowait is behavior change not mentioned in the
> changelog.  Even if it was intentional it would better be done in
> a separate patch.

9cff8adeaa34b5d28 changes to use blk_schedule_flush_plug for recursive
io_schedule with old_iowait for recurse detection. Since we always use
blk_schedule_flush_plug now, I thought we don't need the recurse
detection any more. I agree it's better to do it in another patch.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] sched: always use blk_schedule_flush_plug in io_schedule_out
  2015-05-01 17:42   ` Jeff Moyer
@ 2015-05-01 18:07     ` Jeff Moyer
  2015-05-01 18:28       ` Shaohua Li
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff Moyer @ 2015-05-01 18:07 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, hch, neilb

Jeff Moyer <jmoyer@redhat.com> writes:

> Shaohua Li <shli@fb.com> writes:
>
>> block plug callback could sleep, so we introduce a parameter
>> 'from_schedule' and corresponding drivers can use it to destinguish a
>> schedule plug flush or a plug finish. Unfortunately io_schedule_out
>> still uses blk_flush_plug(). This causes below output (Note, I added a
>> might_sleep() in raid1_unplug to make it trigger faster, but the whole
>> thing doesn't matter if I add might_sleep). In raid1/10, this can cause
>> deadlock.
>>
>> This patch makes io_schedule_out always uses blk_schedule_flush_plug.
>> This should only impact drivers (as far as I know, raid 1/10) which are
>> sensitive to the 'from_schedule' parameter.
>
> Why wouldn't you change io_schedule_timeout to use blk_flush_plug_list
> instead?

Sorry, I did not fully engage my brain on this one.  I see that's
exactly what you did.

I'll add my reviewed-by once you've addressed hch's concern.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] sched: always use blk_schedule_flush_plug in io_schedule_out
  2015-05-01 18:07     ` Jeff Moyer
@ 2015-05-01 18:28       ` Shaohua Li
  2015-05-01 19:37         ` Jeff Moyer
  0 siblings, 1 reply; 22+ messages in thread
From: Shaohua Li @ 2015-05-01 18:28 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, axboe, hch, neilb

On Fri, May 01, 2015 at 02:07:42PM -0400, Jeff Moyer wrote:
> Jeff Moyer <jmoyer@redhat.com> writes:
> 
> > Shaohua Li <shli@fb.com> writes:
> >
> >> block plug callback could sleep, so we introduce a parameter
> >> 'from_schedule' and corresponding drivers can use it to destinguish a
> >> schedule plug flush or a plug finish. Unfortunately io_schedule_out
> >> still uses blk_flush_plug(). This causes below output (Note, I added a
> >> might_sleep() in raid1_unplug to make it trigger faster, but the whole
> >> thing doesn't matter if I add might_sleep). In raid1/10, this can cause
> >> deadlock.
> >>
> >> This patch makes io_schedule_out always uses blk_schedule_flush_plug.
> >> This should only impact drivers (as far as I know, raid 1/10) which are
> >> sensitive to the 'from_schedule' parameter.
> >
> > Why wouldn't you change io_schedule_timeout to use blk_flush_plug_list
> > instead?
> 
> Sorry, I did not fully engage my brain on this one.  I see that's
> exactly what you did.
> 
> I'll add my reviewed-by once you've addressed hch's concern.

Thanks! Updated patch.


>From c7c6e5ab4071e9fc20e898a3e3741bf315449d59 Mon Sep 17 00:00:00 2001
Message-Id: <c7c6e5ab4071e9fc20e898a3e3741bf315449d59.1430504716.git.shli@fb.com>
From: Shaohua Li <shli@fb.com>
Date: Wed, 29 Apr 2015 12:12:03 -0700
Subject: [PATCH] sched: always use blk_schedule_flush_plug in io_schedule_out

block plug callback could sleep, so we introduce a parameter
'from_schedule' and corresponding drivers can use it to destinguish a
schedule plug flush or a plug finish. Unfortunately io_schedule_out
still uses blk_flush_plug(). This causes below output (Note, I added a
might_sleep() in raid1_unplug to make it trigger faster, but the whole
thing doesn't matter if I add might_sleep). In raid1/10, this can cause
deadlock.

This patch makes io_schedule_out always uses blk_schedule_flush_plug.
This should only impact drivers (as far as I know, raid 1/10) which are
sensitive to the 'from_schedule' parameter.

[  370.817949] ------------[ cut here ]------------
[  370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90()
[  370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff81092fcf>] prepare_to_wait+0x2f/0x90
[  370.817971] Modules linked in: raid1
[  370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G        W       4.0.0+ #361
[  370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
[  370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1)
[  370.817985]  ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001
[  370.817988]  ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8
[  370.817990]  ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28
[  370.817993] Call Trace:
[  370.817999]  [<ffffffff819dd7af>] dump_stack+0x4f/0x7b
[  370.818002]  [<ffffffff81051afc>] warn_slowpath_common+0x8c/0xd0
[  370.818004]  [<ffffffff81051b86>] warn_slowpath_fmt+0x46/0x50
[  370.818006]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
[  370.818008]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
[  370.818010]  [<ffffffff810776ef>] __might_sleep+0x7f/0x90
[  370.818014]  [<ffffffffa0000c03>] raid1_unplug+0xd3/0x170 [raid1]
[  370.818024]  [<ffffffff81421d9a>] blk_flush_plug_list+0x8a/0x1e0
[  370.818028]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
[  370.818031]  [<ffffffff819e21b0>] io_schedule_timeout+0x130/0x140
[  370.818033]  [<ffffffff819e3586>] bit_wait_io+0x36/0x50
[  370.818034]  [<ffffffff819e31b5>] __wait_on_bit+0x65/0x90
[  370.818041]  [<ffffffff8125b67c>] ? ext4_read_block_bitmap_nowait+0xbc/0x630
[  370.818043]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
[  370.818045]  [<ffffffff819e3302>] out_of_line_wait_on_bit+0x72/0x80
[  370.818047]  [<ffffffff810935e0>] ? autoremove_wake_function+0x40/0x40
[  370.818050]  [<ffffffff811de744>] __wait_on_buffer+0x44/0x50
[  370.818053]  [<ffffffff8125ae80>] ext4_wait_block_bitmap+0xe0/0xf0
[  370.818058]  [<ffffffff812975d6>] ext4_mb_init_cache+0x206/0x790
[  370.818062]  [<ffffffff8114bc6c>] ? lru_cache_add+0x1c/0x50
[  370.818064]  [<ffffffff81297c7e>] ext4_mb_init_group+0x11e/0x200
[  370.818066]  [<ffffffff81298231>] ext4_mb_load_buddy+0x341/0x360
[  370.818068]  [<ffffffff8129a1a3>] ext4_mb_find_by_goal+0x93/0x2f0
[  370.818070]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
[  370.818072]  [<ffffffff8129ab67>] ext4_mb_regular_allocator+0x67/0x460
[  370.818074]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
[  370.818076]  [<ffffffff8129ca4b>] ext4_mb_new_blocks+0x4cb/0x620
[  370.818079]  [<ffffffff81290956>] ext4_ext_map_blocks+0x4c6/0x14d0
[  370.818081]  [<ffffffff812a4d4e>] ? ext4_es_lookup_extent+0x4e/0x290
[  370.818085]  [<ffffffff8126399d>] ext4_map_blocks+0x14d/0x4f0
[  370.818088]  [<ffffffff81266fbd>] ext4_writepages+0x76d/0xe50
[  370.818094]  [<ffffffff81149691>] do_writepages+0x21/0x50
[  370.818097]  [<ffffffff811d5c00>] __writeback_single_inode+0x60/0x490
[  370.818099]  [<ffffffff811d630a>] writeback_sb_inodes+0x2da/0x590
[  370.818103]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
[  370.818105]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
[  370.818107]  [<ffffffff811d665f>] __writeback_inodes_wb+0x9f/0xd0
[  370.818109]  [<ffffffff811d69db>] wb_writeback+0x34b/0x3c0
[  370.818111]  [<ffffffff811d70df>] bdi_writeback_workfn+0x23f/0x550
[  370.818116]  [<ffffffff8106bbd8>] process_one_work+0x1c8/0x570
[  370.818117]  [<ffffffff8106bb5b>] ? process_one_work+0x14b/0x570
[  370.818119]  [<ffffffff8106c09b>] worker_thread+0x11b/0x470
[  370.818121]  [<ffffffff8106bf80>] ? process_one_work+0x570/0x570
[  370.818124]  [<ffffffff81071868>] kthread+0xf8/0x110
[  370.818126]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
[  370.818129]  [<ffffffff819e9322>] ret_from_fork+0x42/0x70
[  370.818131]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
[  370.818132] ---[ end trace 7b4deb71e68b6605 ]---

V2: don't change ->in_iowait

Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 kernel/sched/core.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f9123a8..1ab3d2d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4402,10 +4402,7 @@ long __sched io_schedule_timeout(long timeout)
 	long ret;
 
 	current->in_iowait = 1;
-	if (old_iowait)
-		blk_schedule_flush_plug(current);
-	else
-		blk_flush_plug(current);
+	blk_schedule_flush_plug(current);
 
 	delayacct_blkio_start();
 	rq = raw_rq();
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/5] sched: always use blk_schedule_flush_plug in io_schedule_out
  2015-05-01 18:28       ` Shaohua Li
@ 2015-05-01 19:37         ` Jeff Moyer
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff Moyer @ 2015-05-01 19:37 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, hch, neilb

Shaohua Li <shli@fb.com> writes:

> V2: don't change ->in_iowait
>
> Cc: NeilBrown <neilb@suse.de>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  kernel/sched/core.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f9123a8..1ab3d2d 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4402,10 +4402,7 @@ long __sched io_schedule_timeout(long timeout)
>  	long ret;
>  
>  	current->in_iowait = 1;
> -	if (old_iowait)
> -		blk_schedule_flush_plug(current);
> -	else
> -		blk_flush_plug(current);
> +	blk_schedule_flush_plug(current);
>  
>  	delayacct_blkio_start();
>  	rq = raw_rq();

Looks good to me.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] blk-mq: do limited block plug for multiple queue case
  2015-04-30 17:45 ` [PATCH 4/5] blk-mq: do limited block plug for multiple queue case Shaohua Li
@ 2015-05-01 20:16   ` Jeff Moyer
  2015-05-04 19:40     ` Shaohua Li
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff Moyer @ 2015-05-01 20:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, hch, neilb

Shaohua Li <shli@fb.com> writes:

> plug is still helpful for workload with IO merge, but it can be harmful
> otherwise especially with multiple hardware queues, as there is
> (supposed) no lock contention in this case and plug can introduce
> latency. For multiple queues, we do limited plug, eg plug only if there
> is request merge. If a request doesn't have merge with following
> request, the requet will be dispatched immediately.
>
> This also fixes a bug. If we directly issue a request and it fails, we
> use blk_mq_merge_queue_io(). But we already assigned bio to a request in
> blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
> blk_mq_bio_to_request again.

Good catch.  Might've been better to split that out first for easy
backport to stable kernels, but I won't hold you to that.

> @@ -1243,6 +1277,10 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
>  		return;
>  	}
>  
> +	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
> +	    blk_attempt_plug_merge(q, bio, &request_count))
> +		return;
> +
>  	rq = blk_mq_map_request(q, bio, &data);
>  	if (unlikely(!rq))
>  		return;

After this patch, everything up to this point in blk_mq_make_request and
blk_sq_make_request is the same.  This can be factored out (in another
patch) to a common function.

> @@ -1253,38 +1291,38 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
>  		goto run_queue;
>  	}
>  
> +	plug = current->plug;
>  	/*
>  	 * If the driver supports defer issued based on 'last', then
>  	 * queue it up like normal since we can potentially save some
>  	 * CPU this way.
>  	 */
> -	if (is_sync && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> -		struct blk_mq_queue_data bd = {
> -			.rq = rq,
> -			.list = NULL,
> -			.last = 1
> -		};
> -		int ret;
> +	if ((plug || is_sync) && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> +		struct request *old_rq = NULL;

I would add a !blk_queue_nomerges(q) to that conditional.  There's no
point holding back an I/O when we won't merge it anyway.

That brings up another quirk of the current implementation (not your
patches) that bugs me.

BLK_MQ_F_SHOULD_MERGE
QUEUE_FLAG_NOMERGES

Those two flags are set independently, one via the driver and the other
via a sysfs file.  So the user could set the nomerges flag to 1 or 2,
and still potentially get merges (see blk_mq_merge_queue_io).  That's
something that should be fixed, albeit that can wait.

>  		blk_mq_bio_to_request(rq, bio);
>  
>  		/*
> -		 * For OK queue, we are done. For error, kill it. Any other
> -		 * error (busy), just add it to our list as we previously
> -		 * would have done
> +		 * we do limited pluging. If bio can be merged, do merge.
> +		 * Otherwise the existing request in the plug list will be
> +		 * issued. So the plug list will have one request at most
>  		 */
> -		ret = q->mq_ops->queue_rq(data.hctx, &bd);
> -		if (ret == BLK_MQ_RQ_QUEUE_OK)
> -			goto done;
> -		else {
> -			__blk_mq_requeue_request(rq);
> -
> -			if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
> -				rq->errors = -EIO;
> -				blk_mq_end_request(rq, rq->errors);
> -				goto done;
> +		if (plug) {
> +			if (!list_empty(&plug->mq_list)) {
> +				old_rq = list_first_entry(&plug->mq_list,
> +					struct request, queuelist);
> +				list_del_init(&old_rq->queuelist);
>  			}
> -		}
> +			list_add_tail(&rq->queuelist, &plug->mq_list);
> +		} else /* is_sync */
> +			old_rq = rq;
> +		blk_mq_put_ctx(data.ctx);
> +		if (!old_rq)
> +			return;
> +		if (!blk_mq_direct_issue_request(old_rq))
> +			return;
> +		blk_mq_insert_request(old_rq, false, true, true);
> +		return;
>  	}

Now there is no way to exit that if block, we always return.  It may be
worth cosidering moving that block to its own function, if you can think
of a good name for it.

Other than those minor issues, this looks good to me.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] blk-mq: make plug work for mutiple disks and queues
  2015-04-30 17:45 ` [PATCH 5/5] blk-mq: make plug work for mutiple disks and queues Shaohua Li
@ 2015-05-01 20:55   ` Jeff Moyer
  2015-05-04 19:44     ` Shaohua Li
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff Moyer @ 2015-05-01 20:55 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, axboe, hch, neilb

Shaohua Li <shli@fb.com> writes:

> Last patch makes plug work for multiple queue case. However it only
> works for single disk case, because it assumes only one request in the
> plug list. If a task is accessing multiple disks, eg MD/DM, the
> assumption is wrong. Let blk_attempt_plug_merge() record request from
> the same queue.

I understand the desire to be performant, and that's why you
piggy-backed the same_queue_rq onto this function, but it sure looks
hackish.  I'd almost rather walk the list a second time instead of add
warts like this.  To be perfectly clear, this is what I really don't
like about it: we're relying on the nuance that we will only add a
single request per queue to the plug_list in the mq case.  When you
start to look at what happens for the sq case, it doesn't make much
sense at all, as you'll just return the first entry in the list (last
one visited walking the list backwards) that has the same request queue.
That's fine if this code were mq specific, but it isn't.  It will just
lead to confusion for others reviewing the code, and may trip up anyone
modifying the mq plugging code.

I'll leave it up to Jens to decide if this is fit for inclusion.  The
patch /functionally/ looks fine to me.

> Cc: Jens Axboe <axboe@fb.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  block/blk-core.c | 10 +++++++---
>  block/blk-mq.c   | 11 ++++++-----
>  block/blk.h      |  3 ++-
>  3 files changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d51ed61..a5e1574 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1521,7 +1521,8 @@ bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
>   * Caller must ensure !blk_queue_nomerges(q) beforehand.
>   */
>  bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
> -			    unsigned int *request_count)
> +			    unsigned int *request_count,
> +			    struct request **same_queue_rq)
>  {
>  	struct blk_plug *plug;
>  	struct request *rq;
> @@ -1541,8 +1542,10 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
>  	list_for_each_entry_reverse(rq, plug_list, queuelist) {
>  		int el_ret;
>  
> -		if (rq->q == q)
> +		if (rq->q == q) {
>  			(*request_count)++;
> +			*same_queue_rq = rq;
> +		}

Out of the 3 callers of blk_attempt_plug_merge, only one will use the
result, yet all of them have to provide the argument.  How about just
handling NULL in there?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] blk-mq: do limited block plug for multiple queue case
  2015-05-01 20:16   ` Jeff Moyer
@ 2015-05-04 19:40     ` Shaohua Li
  2015-05-04 19:46       ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Shaohua Li @ 2015-05-04 19:40 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, axboe, hch, neilb

On Fri, May 01, 2015 at 04:16:04PM -0400, Jeff Moyer wrote:
> Shaohua Li <shli@fb.com> writes:
> 
> > plug is still helpful for workload with IO merge, but it can be harmful
> > otherwise especially with multiple hardware queues, as there is
> > (supposed) no lock contention in this case and plug can introduce
> > latency. For multiple queues, we do limited plug, eg plug only if there
> > is request merge. If a request doesn't have merge with following
> > request, the requet will be dispatched immediately.
> >
> > This also fixes a bug. If we directly issue a request and it fails, we
> > use blk_mq_merge_queue_io(). But we already assigned bio to a request in
> > blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
> > blk_mq_bio_to_request again.
> 
> Good catch.  Might've been better to split that out first for easy
> backport to stable kernels, but I won't hold you to that.

It's not a severe bug, but I don't mind. Jens, please let me know if I
should split the patch into 2 patches.
 
> > @@ -1243,6 +1277,10 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
> >  		return;
> >  	}
> >  
> > +	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
> > +	    blk_attempt_plug_merge(q, bio, &request_count))
> > +		return;
> > +
> >  	rq = blk_mq_map_request(q, bio, &data);
> >  	if (unlikely(!rq))
> >  		return;
> 
> After this patch, everything up to this point in blk_mq_make_request and
> blk_sq_make_request is the same.  This can be factored out (in another
> patch) to a common function.

I'll leave this for a separate cleanup if a good function name is found.

> > @@ -1253,38 +1291,38 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
> >  		goto run_queue;
> >  	}
> >  
> > +	plug = current->plug;
> >  	/*
> >  	 * If the driver supports defer issued based on 'last', then
> >  	 * queue it up like normal since we can potentially save some
> >  	 * CPU this way.
> >  	 */
> > -	if (is_sync && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> > -		struct blk_mq_queue_data bd = {
> > -			.rq = rq,
> > -			.list = NULL,
> > -			.last = 1
> > -		};
> > -		int ret;
> > +	if ((plug || is_sync) && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> > +		struct request *old_rq = NULL;
> 
> I would add a !blk_queue_nomerges(q) to that conditional.  There's no
> point holding back an I/O when we won't merge it anyway.

Good catch! Fixed.
 
> That brings up another quirk of the current implementation (not your
> patches) that bugs me.
> 
> BLK_MQ_F_SHOULD_MERGE
> QUEUE_FLAG_NOMERGES
> 
> Those two flags are set independently, one via the driver and the other
> via a sysfs file.  So the user could set the nomerges flag to 1 or 2,
> and still potentially get merges (see blk_mq_merge_queue_io).  That's
> something that should be fixed, albeit that can wait.

Agree 
> >  		blk_mq_bio_to_request(rq, bio);
> >  
> >  		/*
> > -		 * For OK queue, we are done. For error, kill it. Any other
> > -		 * error (busy), just add it to our list as we previously
> > -		 * would have done
> > +		 * we do limited pluging. If bio can be merged, do merge.
> > +		 * Otherwise the existing request in the plug list will be
> > +		 * issued. So the plug list will have one request at most
> >  		 */
> > -		ret = q->mq_ops->queue_rq(data.hctx, &bd);
> > -		if (ret == BLK_MQ_RQ_QUEUE_OK)
> > -			goto done;
> > -		else {
> > -			__blk_mq_requeue_request(rq);
> > -
> > -			if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
> > -				rq->errors = -EIO;
> > -				blk_mq_end_request(rq, rq->errors);
> > -				goto done;
> > +		if (plug) {
> > +			if (!list_empty(&plug->mq_list)) {
> > +				old_rq = list_first_entry(&plug->mq_list,
> > +					struct request, queuelist);
> > +				list_del_init(&old_rq->queuelist);
> >  			}
> > -		}
> > +			list_add_tail(&rq->queuelist, &plug->mq_list);
> > +		} else /* is_sync */
> > +			old_rq = rq;
> > +		blk_mq_put_ctx(data.ctx);
> > +		if (!old_rq)
> > +			return;
> > +		if (!blk_mq_direct_issue_request(old_rq))
> > +			return;
> > +		blk_mq_insert_request(old_rq, false, true, true);
> > +		return;
> >  	}
> 
> Now there is no way to exit that if block, we always return.  It may be
> worth cosidering moving that block to its own function, if you can think
> of a good name for it.

I'll leave this for a later work

> Other than those minor issues, this looks good to me.

Thanks for your time!


>From b2f2f6fbf72e4b80dffbce3ada6a151754407044 Mon Sep 17 00:00:00 2001
Message-Id: <b2f2f6fbf72e4b80dffbce3ada6a151754407044.1430766392.git.shli@fb.com>
In-Reply-To: <f3bfe60a013827942790c89d658f63c920653437.1430766392.git.shli@fb.com>
References: <f3bfe60a013827942790c89d658f63c920653437.1430766392.git.shli@fb.com>
From: Shaohua Li <shli@fb.com>
Date: Wed, 29 Apr 2015 16:45:40 -0700
Subject: [PATCH 4/5] blk-mq: do limited block plug for multiple queue case

plug is still helpful for workload with IO merge, but it can be harmful
otherwise especially with multiple hardware queues, as there is
(supposed) no lock contention in this case and plug can introduce
latency. For multiple queues, we do limited plug, eg plug only if there
is request merge. If a request doesn't have merge with following
request, the requet will be dispatched immediately.

This also fixes a bug. If we directly issue a request and it fails, we
use blk_mq_merge_queue_io(). But we already assigned bio to a request in
blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
blk_mq_bio_to_request again.

Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-mq.c | 82 ++++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 60 insertions(+), 22 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7f9d3a1..6a6b6d0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1224,6 +1224,38 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 	return rq;
 }
 
+static int blk_mq_direct_issue_request(struct request *rq)
+{
+	int ret;
+	struct request_queue *q = rq->q;
+	struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q,
+			rq->mq_ctx->cpu);
+	struct blk_mq_queue_data bd = {
+		.rq = rq,
+		.list = NULL,
+		.last = 1
+	};
+
+	/*
+	 * For OK queue, we are done. For error, kill it. Any other
+	 * error (busy), just add it to our list as we previously
+	 * would have done
+	 */
+	ret = q->mq_ops->queue_rq(hctx, &bd);
+	if (ret == BLK_MQ_RQ_QUEUE_OK)
+		return 0;
+	else {
+		__blk_mq_requeue_request(rq);
+
+		if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
+			rq->errors = -EIO;
+			blk_mq_end_request(rq, rq->errors);
+			return 0;
+		}
+		return -1;
+	}
+}
+
 /*
  * Multiple hardware queue variant. This will not use per-process plugs,
  * but will attempt to bypass the hctx queueing if we can go straight to
@@ -1235,6 +1267,8 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	const int is_flush_fua = bio->bi_rw & (REQ_FLUSH | REQ_FUA);
 	struct blk_map_ctx data;
 	struct request *rq;
+	unsigned int request_count = 0;
+	struct blk_plug *plug;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1243,6 +1277,10 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		return;
 	}
 
+	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
+	    blk_attempt_plug_merge(q, bio, &request_count))
+		return;
+
 	rq = blk_mq_map_request(q, bio, &data);
 	if (unlikely(!rq))
 		return;
@@ -1253,38 +1291,39 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		goto run_queue;
 	}
 
+	plug = current->plug;
 	/*
 	 * If the driver supports defer issued based on 'last', then
 	 * queue it up like normal since we can potentially save some
 	 * CPU this way.
 	 */
-	if (is_sync && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
-		struct blk_mq_queue_data bd = {
-			.rq = rq,
-			.list = NULL,
-			.last = 1
-		};
-		int ret;
+	if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
+	    !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
+		struct request *old_rq = NULL;
 
 		blk_mq_bio_to_request(rq, bio);
 
 		/*
-		 * For OK queue, we are done. For error, kill it. Any other
-		 * error (busy), just add it to our list as we previously
-		 * would have done
+		 * we do limited pluging. If bio can be merged, do merge.
+		 * Otherwise the existing request in the plug list will be
+		 * issued. So the plug list will have one request at most
 		 */
-		ret = q->mq_ops->queue_rq(data.hctx, &bd);
-		if (ret == BLK_MQ_RQ_QUEUE_OK)
-			goto done;
-		else {
-			__blk_mq_requeue_request(rq);
-
-			if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
-				rq->errors = -EIO;
-				blk_mq_end_request(rq, rq->errors);
-				goto done;
+		if (plug) {
+			if (!list_empty(&plug->mq_list)) {
+				old_rq = list_first_entry(&plug->mq_list,
+					struct request, queuelist);
+				list_del_init(&old_rq->queuelist);
 			}
-		}
+			list_add_tail(&rq->queuelist, &plug->mq_list);
+		} else /* is_sync */
+			old_rq = rq;
+		blk_mq_put_ctx(data.ctx);
+		if (!old_rq)
+			return;
+		if (!blk_mq_direct_issue_request(old_rq))
+			return;
+		blk_mq_insert_request(old_rq, false, true, true);
+		return;
 	}
 
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
@@ -1297,7 +1336,6 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 run_queue:
 		blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
 	}
-done:
 	blk_mq_put_ctx(data.ctx);
 }
 
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] blk-mq: make plug work for mutiple disks and queues
  2015-05-01 20:55   ` Jeff Moyer
@ 2015-05-04 19:44     ` Shaohua Li
  0 siblings, 0 replies; 22+ messages in thread
From: Shaohua Li @ 2015-05-04 19:44 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, axboe, hch, neilb

On Fri, May 01, 2015 at 04:55:39PM -0400, Jeff Moyer wrote:
> Shaohua Li <shli@fb.com> writes:
> 
> > Last patch makes plug work for multiple queue case. However it only
> > works for single disk case, because it assumes only one request in the
> > plug list. If a task is accessing multiple disks, eg MD/DM, the
> > assumption is wrong. Let blk_attempt_plug_merge() record request from
> > the same queue.
> 
> I understand the desire to be performant, and that's why you
> piggy-backed the same_queue_rq onto this function, but it sure looks
> hackish.  I'd almost rather walk the list a second time instead of add
> warts like this.  To be perfectly clear, this is what I really don't
> like about it: we're relying on the nuance that we will only add a
> single request per queue to the plug_list in the mq case.  When you
> start to look at what happens for the sq case, it doesn't make much
> sense at all, as you'll just return the first entry in the list (last
> one visited walking the list backwards) that has the same request queue.
> That's fine if this code were mq specific, but it isn't.  It will just
> lead to confusion for others reviewing the code, and may trip up anyone
> modifying the mq plugging code.
> 
> I'll leave it up to Jens to decide if this is fit for inclusion.  The
> patch /functionally/ looks fine to me.

I really dont want to rewalk the list again for performance reason.
Added some comments in the code and hopefully it's better.
 
> > Cc: Jens Axboe <axboe@fb.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  block/blk-core.c | 10 +++++++---
> >  block/blk-mq.c   | 11 ++++++-----
> >  block/blk.h      |  3 ++-
> >  3 files changed, 15 insertions(+), 9 deletions(-)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index d51ed61..a5e1574 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1521,7 +1521,8 @@ bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
> >   * Caller must ensure !blk_queue_nomerges(q) beforehand.
> >   */
> >  bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
> > -			    unsigned int *request_count)
> > +			    unsigned int *request_count,
> > +			    struct request **same_queue_rq)
> >  {
> >  	struct blk_plug *plug;
> >  	struct request *rq;
> > @@ -1541,8 +1542,10 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
> >  	list_for_each_entry_reverse(rq, plug_list, queuelist) {
> >  		int el_ret;
> >  
> > -		if (rq->q == q)
> > +		if (rq->q == q) {
> >  			(*request_count)++;
> > +			*same_queue_rq = rq;
> > +		}
> 
> Out of the 3 callers of blk_attempt_plug_merge, only one will use the
> result, yet all of them have to provide the argument.  How about just
> handling NULL in there?

Ok, fixed. Also the updated patch fixed a bug in the patch.


>From 1218a50d39dbd11d0fbd15547516d1a4a92df0b5 Mon Sep 17 00:00:00 2001
Message-Id: <1218a50d39dbd11d0fbd15547516d1a4a92df0b5.1430766392.git.shli@fb.com>
In-Reply-To: <f3bfe60a013827942790c89d658f63c920653437.1430766392.git.shli@fb.com>
References: <f3bfe60a013827942790c89d658f63c920653437.1430766392.git.shli@fb.com>
From: Shaohua Li <shli@fb.com>
Date: Wed, 29 Apr 2015 16:58:20 -0700
Subject: [PATCH 5/5] blk-mq: make plug work for mutiple disks and queues

Last patch makes plug work for multiple queue case. However it only
works for single disk case, because it assumes only one request in the
plug list. If a task is accessing multiple disks, eg MD/DM, the
assumption is wrong. Let blk_attempt_plug_merge() record request from
the same queue.

Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-core.c | 15 ++++++++++++---
 block/blk-mq.c   | 12 ++++++++----
 block/blk.h      |  3 ++-
 3 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d51ed61..503927e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1521,7 +1521,8 @@ bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
  * Caller must ensure !blk_queue_nomerges(q) beforehand.
  */
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
-			    unsigned int *request_count)
+			    unsigned int *request_count,
+			    struct request **same_queue_rq)
 {
 	struct blk_plug *plug;
 	struct request *rq;
@@ -1541,8 +1542,16 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 	list_for_each_entry_reverse(rq, plug_list, queuelist) {
 		int el_ret;
 
-		if (rq->q == q)
+		if (rq->q == q) {
 			(*request_count)++;
+			/*
+			 * Only blk-mq multiple hardware queues case checks the
+			 * rq in the same queue, there should be only one such
+			 * rq in a queue
+			 **/
+			if (same_queue_rq)
+				*same_queue_rq = rq;
+		}
 
 		if (rq->q != q || !blk_rq_merge_ok(rq, bio))
 			continue;
@@ -1607,7 +1616,7 @@ void blk_queue_bio(struct request_queue *q, struct bio *bio)
 	 * any locks.
 	 */
 	if (!blk_queue_nomerges(q) &&
-	    blk_attempt_plug_merge(q, bio, &request_count))
+	    blk_attempt_plug_merge(q, bio, &request_count, NULL))
 		return;
 
 	spin_lock_irq(q->queue_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6a6b6d0..be1169f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1269,6 +1269,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	struct request *rq;
 	unsigned int request_count = 0;
 	struct blk_plug *plug;
+	struct request *same_queue_rq = NULL;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1278,7 +1279,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	}
 
 	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
-	    blk_attempt_plug_merge(q, bio, &request_count))
+	    blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
 		return;
 
 	rq = blk_mq_map_request(q, bio, &data);
@@ -1309,9 +1310,12 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		 * issued. So the plug list will have one request at most
 		 */
 		if (plug) {
+			/*
+			 * The plug list might get flushed before this. If that
+			 * happens, same_queue_rq is invalid and plug list is empty
+			 **/
 			if (!list_empty(&plug->mq_list)) {
-				old_rq = list_first_entry(&plug->mq_list,
-					struct request, queuelist);
+				old_rq = same_queue_rq;
 				list_del_init(&old_rq->queuelist);
 			}
 			list_add_tail(&rq->queuelist, &plug->mq_list);
@@ -1360,7 +1364,7 @@ static void blk_sq_make_request(struct request_queue *q, struct bio *bio)
 	}
 
 	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
-	    blk_attempt_plug_merge(q, bio, &request_count))
+	    blk_attempt_plug_merge(q, bio, &request_count, NULL))
 		return;
 
 	rq = blk_mq_map_request(q, bio, &data);
diff --git a/block/blk.h b/block/blk.h
index 43b0361..aa8633c 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -78,7 +78,8 @@ bool bio_attempt_front_merge(struct request_queue *q, struct request *req,
 bool bio_attempt_back_merge(struct request_queue *q, struct request *req,
 			    struct bio *bio);
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
-			    unsigned int *request_count);
+			    unsigned int *request_count,
+			    struct request **same_queue_rq);
 
 void blk_account_io_start(struct request *req, bool new_io);
 void blk_account_io_completion(struct request *req, unsigned int bytes);
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] blk-mq: do limited block plug for multiple queue case
  2015-05-04 19:40     ` Shaohua Li
@ 2015-05-04 19:46       ` Jens Axboe
  2015-05-04 20:33         ` Shaohua Li
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2015-05-04 19:46 UTC (permalink / raw)
  To: Shaohua Li, Jeff Moyer; +Cc: linux-kernel, hch, neilb

On 05/04/2015 01:40 PM, Shaohua Li wrote:
> On Fri, May 01, 2015 at 04:16:04PM -0400, Jeff Moyer wrote:
>> Shaohua Li <shli@fb.com> writes:
>>
>>> plug is still helpful for workload with IO merge, but it can be harmful
>>> otherwise especially with multiple hardware queues, as there is
>>> (supposed) no lock contention in this case and plug can introduce
>>> latency. For multiple queues, we do limited plug, eg plug only if there
>>> is request merge. If a request doesn't have merge with following
>>> request, the requet will be dispatched immediately.
>>>
>>> This also fixes a bug. If we directly issue a request and it fails, we
>>> use blk_mq_merge_queue_io(). But we already assigned bio to a request in
>>> blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
>>> blk_mq_bio_to_request again.
>>
>> Good catch.  Might've been better to split that out first for easy
>> backport to stable kernels, but I won't hold you to that.
>
> It's not a severe bug, but I don't mind. Jens, please let me know if I
> should split the patch into 2 patches.

I don't care that much for this particular case. But since one/more of 
the others need respin anyway, might be prudent to split it up in any case.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] blk-mq: do limited block plug for multiple queue case
  2015-05-04 19:46       ` Jens Axboe
@ 2015-05-04 20:33         ` Shaohua Li
  2015-05-04 20:35           ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Shaohua Li @ 2015-05-04 20:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Jeff Moyer, linux-kernel, hch, neilb

On Mon, May 04, 2015 at 01:46:49PM -0600, Jens Axboe wrote:
> On 05/04/2015 01:40 PM, Shaohua Li wrote:
> >On Fri, May 01, 2015 at 04:16:04PM -0400, Jeff Moyer wrote:
> >>Shaohua Li <shli@fb.com> writes:
> >>
> >>>plug is still helpful for workload with IO merge, but it can be harmful
> >>>otherwise especially with multiple hardware queues, as there is
> >>>(supposed) no lock contention in this case and plug can introduce
> >>>latency. For multiple queues, we do limited plug, eg plug only if there
> >>>is request merge. If a request doesn't have merge with following
> >>>request, the requet will be dispatched immediately.
> >>>
> >>>This also fixes a bug. If we directly issue a request and it fails, we
> >>>use blk_mq_merge_queue_io(). But we already assigned bio to a request in
> >>>blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
> >>>blk_mq_bio_to_request again.
> >>
> >>Good catch.  Might've been better to split that out first for easy
> >>backport to stable kernels, but I won't hold you to that.
> >
> >It's not a severe bug, but I don't mind. Jens, please let me know if I
> >should split the patch into 2 patches.
> 
> I don't care that much for this particular case. But since one/more
> of the others need respin anyway, might be prudent to split it up in
> any case.

ok, done. I'll repost the patch 4/5. Please let me know if I should
repost 1-3.


>From 5f8117b38ba423a4c746b05d40d3647c73b89a32 Mon Sep 17 00:00:00 2001
Message-Id: <5f8117b38ba423a4c746b05d40d3647c73b89a32.1430771485.git.shli@fb.com>
From: Shaohua Li <shli@fb.com>
Date: Mon, 4 May 2015 13:26:55 -0700
Subject: [PATCH] blk-mq: avoid re-initialize request which is failed in direct
 dispatch

If we directly issue a request and it fails, we use
blk_mq_merge_queue_io(). But we already assigned bio to a request in
blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
blk_mq_bio_to_request again.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e1a5b9e..2a411c9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1294,6 +1294,8 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 				blk_mq_end_request(rq, rq->errors);
 				goto done;
 			}
+			blk_mq_insert_request(rq, false, true, true);
+			return;
 		}
 	}
 
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] blk-mq: do limited block plug for multiple queue case
  2015-05-04 20:33         ` Shaohua Li
@ 2015-05-04 20:35           ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2015-05-04 20:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Jeff Moyer, linux-kernel, hch, neilb

On 05/04/2015 02:33 PM, Shaohua Li wrote:
> On Mon, May 04, 2015 at 01:46:49PM -0600, Jens Axboe wrote:
>> On 05/04/2015 01:40 PM, Shaohua Li wrote:
>>> On Fri, May 01, 2015 at 04:16:04PM -0400, Jeff Moyer wrote:
>>>> Shaohua Li <shli@fb.com> writes:
>>>>
>>>>> plug is still helpful for workload with IO merge, but it can be harmful
>>>>> otherwise especially with multiple hardware queues, as there is
>>>>> (supposed) no lock contention in this case and plug can introduce
>>>>> latency. For multiple queues, we do limited plug, eg plug only if there
>>>>> is request merge. If a request doesn't have merge with following
>>>>> request, the requet will be dispatched immediately.
>>>>>
>>>>> This also fixes a bug. If we directly issue a request and it fails, we
>>>>> use blk_mq_merge_queue_io(). But we already assigned bio to a request in
>>>>> blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
>>>>> blk_mq_bio_to_request again.
>>>>
>>>> Good catch.  Might've been better to split that out first for easy
>>>> backport to stable kernels, but I won't hold you to that.
>>>
>>> It's not a severe bug, but I don't mind. Jens, please let me know if I
>>> should split the patch into 2 patches.
>>
>> I don't care that much for this particular case. But since one/more
>> of the others need respin anyway, might be prudent to split it up in
>> any case.
>
> ok, done. I'll repost the patch 4/5. Please let me know if I should
> repost 1-3.

That's fine, I'll grab 1-3 as-is. Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-05-04 20:35 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-30 17:45 [PATCH 0/5] blk plug fixes Shaohua Li
2015-04-30 17:45 ` [PATCH 1/5] blk: clean up plug Shaohua Li
2015-05-01 17:11   ` Christoph Hellwig
2015-04-30 17:45 ` [PATCH 2/5] sched: always use blk_schedule_flush_plug in io_schedule_out Shaohua Li
2015-05-01 17:14   ` Christoph Hellwig
2015-05-01 18:05     ` Shaohua Li
2015-05-01 17:42   ` Jeff Moyer
2015-05-01 18:07     ` Jeff Moyer
2015-05-01 18:28       ` Shaohua Li
2015-05-01 19:37         ` Jeff Moyer
2015-04-30 17:45 ` [PATCH 3/5] blk-mq: fix plugging in blk_sq_make_request Shaohua Li
2015-05-01 17:16   ` Christoph Hellwig
2015-05-01 17:47     ` Jeff Moyer
2015-04-30 17:45 ` [PATCH 4/5] blk-mq: do limited block plug for multiple queue case Shaohua Li
2015-05-01 20:16   ` Jeff Moyer
2015-05-04 19:40     ` Shaohua Li
2015-05-04 19:46       ` Jens Axboe
2015-05-04 20:33         ` Shaohua Li
2015-05-04 20:35           ` Jens Axboe
2015-04-30 17:45 ` [PATCH 5/5] blk-mq: make plug work for mutiple disks and queues Shaohua Li
2015-05-01 20:55   ` Jeff Moyer
2015-05-04 19:44     ` Shaohua Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).