Re: [PATCH 4/5] blk-mq: do limited block plug for multiple queue case

From: Shaohua Li <shli@fb.com>
To: Jeff Moyer <jmoyer@redhat.com>
Cc: <linux-kernel@vger.kernel.org>, <axboe@fb.com>, <hch@lst.de>,
	<neilb@suse.de>
Subject: Re: [PATCH 4/5] blk-mq: do limited block plug for multiple queue case
Date: Mon, 4 May 2015 12:40:39 -0700	[thread overview]
Message-ID: <20150504194037.GA3300441@devbig257.prn2.facebook.com> (raw)
In-Reply-To: <x49d22kyry3.fsf@segfault.boston.devel.redhat.com>

On Fri, May 01, 2015 at 04:16:04PM -0400, Jeff Moyer wrote:
> Shaohua Li <shli@fb.com> writes:
> 
> > plug is still helpful for workload with IO merge, but it can be harmful
> > otherwise especially with multiple hardware queues, as there is
> > (supposed) no lock contention in this case and plug can introduce
> > latency. For multiple queues, we do limited plug, eg plug only if there
> > is request merge. If a request doesn't have merge with following
> > request, the requet will be dispatched immediately.
> >
> > This also fixes a bug. If we directly issue a request and it fails, we
> > use blk_mq_merge_queue_io(). But we already assigned bio to a request in
> > blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
> > blk_mq_bio_to_request again.
> 
> Good catch.  Might've been better to split that out first for easy
> backport to stable kernels, but I won't hold you to that.

It's not a severe bug, but I don't mind. Jens, please let me know if I
should split the patch into 2 patches.
 
> > @@ -1243,6 +1277,10 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
> >  		return;
> >  	}
> >  
> > +	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
> > +	    blk_attempt_plug_merge(q, bio, &request_count))
> > +		return;
> > +
> >  	rq = blk_mq_map_request(q, bio, &data);
> >  	if (unlikely(!rq))
> >  		return;
> 
> After this patch, everything up to this point in blk_mq_make_request and
> blk_sq_make_request is the same.  This can be factored out (in another
> patch) to a common function.

I'll leave this for a separate cleanup if a good function name is found.

> > @@ -1253,38 +1291,38 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
> >  		goto run_queue;
> >  	}
> >  
> > +	plug = current->plug;
> >  	/*
> >  	 * If the driver supports defer issued based on 'last', then
> >  	 * queue it up like normal since we can potentially save some
> >  	 * CPU this way.
> >  	 */
> > -	if (is_sync && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> > -		struct blk_mq_queue_data bd = {
> > -			.rq = rq,
> > -			.list = NULL,
> > -			.last = 1
> > -		};
> > -		int ret;
> > +	if ((plug || is_sync) && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
> > +		struct request *old_rq = NULL;
> 
> I would add a !blk_queue_nomerges(q) to that conditional.  There's no
> point holding back an I/O when we won't merge it anyway.

Good catch! Fixed.
 
> That brings up another quirk of the current implementation (not your
> patches) that bugs me.
> 
> BLK_MQ_F_SHOULD_MERGE
> QUEUE_FLAG_NOMERGES
> 
> Those two flags are set independently, one via the driver and the other
> via a sysfs file.  So the user could set the nomerges flag to 1 or 2,
> and still potentially get merges (see blk_mq_merge_queue_io).  That's
> something that should be fixed, albeit that can wait.

Agree 
> >  		blk_mq_bio_to_request(rq, bio);
> >  
> >  		/*
> > -		 * For OK queue, we are done. For error, kill it. Any other
> > -		 * error (busy), just add it to our list as we previously
> > -		 * would have done
> > +		 * we do limited pluging. If bio can be merged, do merge.
> > +		 * Otherwise the existing request in the plug list will be
> > +		 * issued. So the plug list will have one request at most
> >  		 */
> > -		ret = q->mq_ops->queue_rq(data.hctx, &bd);
> > -		if (ret == BLK_MQ_RQ_QUEUE_OK)
> > -			goto done;
> > -		else {
> > -			__blk_mq_requeue_request(rq);
> > -
> > -			if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
> > -				rq->errors = -EIO;
> > -				blk_mq_end_request(rq, rq->errors);
> > -				goto done;
> > +		if (plug) {
> > +			if (!list_empty(&plug->mq_list)) {
> > +				old_rq = list_first_entry(&plug->mq_list,
> > +					struct request, queuelist);
> > +				list_del_init(&old_rq->queuelist);
> >  			}
> > -		}
> > +			list_add_tail(&rq->queuelist, &plug->mq_list);
> > +		} else /* is_sync */
> > +			old_rq = rq;
> > +		blk_mq_put_ctx(data.ctx);
> > +		if (!old_rq)
> > +			return;
> > +		if (!blk_mq_direct_issue_request(old_rq))
> > +			return;
> > +		blk_mq_insert_request(old_rq, false, true, true);
> > +		return;
> >  	}
> 
> Now there is no way to exit that if block, we always return.  It may be
> worth cosidering moving that block to its own function, if you can think
> of a good name for it.

I'll leave this for a later work

> Other than those minor issues, this looks good to me.

Thanks for your time!


>From b2f2f6fbf72e4b80dffbce3ada6a151754407044 Mon Sep 17 00:00:00 2001
Message-Id: <b2f2f6fbf72e4b80dffbce3ada6a151754407044.1430766392.git.shli@fb.com>
In-Reply-To: <f3bfe60a013827942790c89d658f63c920653437.1430766392.git.shli@fb.com>
References: <f3bfe60a013827942790c89d658f63c920653437.1430766392.git.shli@fb.com>
From: Shaohua Li <shli@fb.com>
Date: Wed, 29 Apr 2015 16:45:40 -0700
Subject: [PATCH 4/5] blk-mq: do limited block plug for multiple queue case

plug is still helpful for workload with IO merge, but it can be harmful
otherwise especially with multiple hardware queues, as there is
(supposed) no lock contention in this case and plug can introduce
latency. For multiple queues, we do limited plug, eg plug only if there
is request merge. If a request doesn't have merge with following
request, the requet will be dispatched immediately.

This also fixes a bug. If we directly issue a request and it fails, we
use blk_mq_merge_queue_io(). But we already assigned bio to a request in
blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
blk_mq_bio_to_request again.

Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-mq.c | 82 ++++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 60 insertions(+), 22 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7f9d3a1..6a6b6d0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1224,6 +1224,38 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 	return rq;
 }
 
+static int blk_mq_direct_issue_request(struct request *rq)
+{
+	int ret;
+	struct request_queue *q = rq->q;
+	struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q,
+			rq->mq_ctx->cpu);
+	struct blk_mq_queue_data bd = {
+		.rq = rq,
+		.list = NULL,
+		.last = 1
+	};
+
+	/*
+	 * For OK queue, we are done. For error, kill it. Any other
+	 * error (busy), just add it to our list as we previously
+	 * would have done
+	 */
+	ret = q->mq_ops->queue_rq(hctx, &bd);
+	if (ret == BLK_MQ_RQ_QUEUE_OK)
+		return 0;
+	else {
+		__blk_mq_requeue_request(rq);
+
+		if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
+			rq->errors = -EIO;
+			blk_mq_end_request(rq, rq->errors);
+			return 0;
+		}
+		return -1;
+	}
+}
+
 /*
  * Multiple hardware queue variant. This will not use per-process plugs,
  * but will attempt to bypass the hctx queueing if we can go straight to
@@ -1235,6 +1267,8 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 	const int is_flush_fua = bio->bi_rw & (REQ_FLUSH | REQ_FUA);
 	struct blk_map_ctx data;
 	struct request *rq;
+	unsigned int request_count = 0;
+	struct blk_plug *plug;
 
 	blk_queue_bounce(q, &bio);
 
@@ -1243,6 +1277,10 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		return;
 	}
 
+	if (likely(!is_flush_fua) && !blk_queue_nomerges(q) &&
+	    blk_attempt_plug_merge(q, bio, &request_count))
+		return;
+
 	rq = blk_mq_map_request(q, bio, &data);
 	if (unlikely(!rq))
 		return;
@@ -1253,38 +1291,39 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 		goto run_queue;
 	}
 
+	plug = current->plug;
 	/*
 	 * If the driver supports defer issued based on 'last', then
 	 * queue it up like normal since we can potentially save some
 	 * CPU this way.
 	 */
-	if (is_sync && !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
-		struct blk_mq_queue_data bd = {
-			.rq = rq,
-			.list = NULL,
-			.last = 1
-		};
-		int ret;
+	if (((plug && !blk_queue_nomerges(q)) || is_sync) &&
+	    !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {
+		struct request *old_rq = NULL;
 
 		blk_mq_bio_to_request(rq, bio);
 
 		/*
-		 * For OK queue, we are done. For error, kill it. Any other
-		 * error (busy), just add it to our list as we previously
-		 * would have done
+		 * we do limited pluging. If bio can be merged, do merge.
+		 * Otherwise the existing request in the plug list will be
+		 * issued. So the plug list will have one request at most
 		 */
-		ret = q->mq_ops->queue_rq(data.hctx, &bd);
-		if (ret == BLK_MQ_RQ_QUEUE_OK)
-			goto done;
-		else {
-			__blk_mq_requeue_request(rq);
-
-			if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
-				rq->errors = -EIO;
-				blk_mq_end_request(rq, rq->errors);
-				goto done;
+		if (plug) {
+			if (!list_empty(&plug->mq_list)) {
+				old_rq = list_first_entry(&plug->mq_list,
+					struct request, queuelist);
+				list_del_init(&old_rq->queuelist);
 			}
-		}
+			list_add_tail(&rq->queuelist, &plug->mq_list);
+		} else /* is_sync */
+			old_rq = rq;
+		blk_mq_put_ctx(data.ctx);
+		if (!old_rq)
+			return;
+		if (!blk_mq_direct_issue_request(old_rq))
+			return;
+		blk_mq_insert_request(old_rq, false, true, true);
+		return;
 	}
 
 	if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) {
@@ -1297,7 +1336,6 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 run_queue:
 		blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);
 	}
-done:
 	blk_mq_put_ctx(data.ctx);
 }
 
-- 
1.8.1