All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Improve blk-mq performance
@ 2018-07-27 22:56 Bart Van Assche
  2018-07-27 22:56 ` [PATCH v2 1/5] blk-mq: Introduce the timeout handler return code BLK_EH_DONT_COMPLETE Bart Van Assche
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Bart Van Assche @ 2018-07-27 22:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche

Hello Jens,

The blk-mq timeout handling rework that went upstream during the v4.18
development cycle introduced a performance regression due to the atomic
instructions that were added in the hot path. This patch series improves
blk-mq performance by reducing the number of atomic instructions in the
hot path. This patch series increases the number of IOPS reported by a
fio test against the null_blk driver with 15% on an x86_64 system. This
patch series applies on top of a merge of your for-next and for-linus
branches. This patch series passes the block and srp blktests tests.

Bart.

Changes compared to v1:
- Dropped the patch that renames BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER.
- Added a new patch that introduces the timeout handler return code
  BLK_EH_DONT_COMPLETE.
- If blk_mq_complete_request() is called from inside the request timed
  out function, complete the request immediately to avoid triggering a
  deadlock.
- Instead of introducing one new request state, introduce two new request
  states, namely MQ_RQ_COMPLETION_DELAYED and MQ_RQ_TIMED_OUT.
- Fixed a race condition in blk_mq_change_rq_state().

Bart Van Assche (5):
  blk-mq: Introduce the timeout handler return code BLK_EH_DONT_COMPLETE
  block: Remove a superfluous #include directive
  block: Split blk_add_timer()
  block: Simplify blk_add_timer() and blk_mq_add_timer()
  blk-mq: Rework blk-mq timeout handling again

 Documentation/scsi/scsi_eh.txt            |   4 +-
 block/blk-core.c                          |   3 +
 block/blk-mq-debugfs.c                    |   8 +-
 block/blk-mq.c                            | 325 +++++++++++++++++-----
 block/blk-mq.h                            |  10 +-
 block/blk-timeout.c                       | 116 +++++---
 block/blk.h                               |   1 +
 drivers/block/nbd.c                       |   2 +-
 drivers/message/fusion/mptsas.c           |   2 +-
 drivers/s390/block/dasd.c                 |   6 +-
 drivers/scsi/gdth.c                       |   2 +-
 drivers/scsi/libiscsi.c                   |   6 +-
 drivers/scsi/megaraid/megaraid_sas_base.c |   2 +-
 drivers/scsi/mvumi.c                      |   2 +-
 drivers/scsi/qla4xxx/ql4_os.c             |   2 +-
 drivers/scsi/scsi_error.c                 |  18 +-
 drivers/scsi/scsi_transport_fc.c          |   9 +-
 drivers/scsi/scsi_transport_srp.c         |   4 +-
 drivers/scsi/ufs/ufshcd.c                 |   6 +-
 include/linux/blk-mq.h                    |  14 -
 include/linux/blkdev.h                    |  60 +++-
 21 files changed, 417 insertions(+), 185 deletions(-)

-- 
2.18.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/5] blk-mq: Introduce the timeout handler return code BLK_EH_DONT_COMPLETE
  2018-07-27 22:56 [PATCH v2 0/5] Improve blk-mq performance Bart Van Assche
@ 2018-07-27 22:56 ` Bart Van Assche
  2018-07-27 22:56 ` [PATCH v2 2/5] block: Remove a superfluous #include directive Bart Van Assche
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Bart Van Assche @ 2018-07-27 22:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Ming Lei,
	Keith Busch, Hannes Reinecke, Johannes Thumshirn

If a block driver timeout handler returns BLK_EH_DONE that means
either that the request has been completed or that the block driver
still owns the request and that the request has not yet been
completed. Since the name BLK_EH_DONE is misleading in the latter
case, introduce the return code BLK_EH_DONT_COMPLETE. See also
commits 88b0cfad2888 ("block: document the blk_eh_timer_return
values") and 6600593cbd93 ("block: rename BLK_EH_NOT_HANDLED to
BLK_EH_DONE").

Suggested-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 Documentation/scsi/scsi_eh.txt            |  4 ++--
 block/blk-mq.c                            | 10 ++++++++--
 block/blk-timeout.c                       |  9 ++++++++-
 drivers/block/nbd.c                       |  2 +-
 drivers/message/fusion/mptsas.c           |  2 +-
 drivers/s390/block/dasd.c                 |  6 +++---
 drivers/scsi/gdth.c                       |  2 +-
 drivers/scsi/libiscsi.c                   |  6 +++---
 drivers/scsi/megaraid/megaraid_sas_base.c |  2 +-
 drivers/scsi/mvumi.c                      |  2 +-
 drivers/scsi/qla4xxx/ql4_os.c             |  2 +-
 drivers/scsi/scsi_error.c                 |  4 ++--
 drivers/scsi/scsi_transport_fc.c          |  9 ++++++---
 drivers/scsi/scsi_transport_srp.c         |  4 ++--
 drivers/scsi/ufs/ufshcd.c                 |  6 +++---
 include/linux/blkdev.h                    |  3 ++-
 16 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/Documentation/scsi/scsi_eh.txt b/Documentation/scsi/scsi_eh.txt
index 1b7436932a2b..20d6044b9a6c 100644
--- a/Documentation/scsi/scsi_eh.txt
+++ b/Documentation/scsi/scsi_eh.txt
@@ -86,9 +86,9 @@ function
 	This indicates that more time is required to finish the
 	command.  Timer is restarted.  This action is counted as a
 	retry and only allowed scmd->allowed + 1(!) times.  Once the
-	limit is reached, action for BLK_EH_DONE is taken instead.
+	limit is reached, action for BLK_EH_DONT_COMPLETE is taken instead.
 
-    - BLK_EH_DONE
+    - BLK_EH_DONT_COMPLETE
         eh_timed_out() callback did not handle the command.
 	Step #2 is taken.
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c92ce06fd565..8e9beb4ddeb0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -774,9 +774,15 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		enum blk_eh_timer_return ret;
 
 		ret = req->q->mq_ops->timeout(req, reserved);
-		if (ret == BLK_EH_DONE)
+		switch (ret) {
+		case BLK_EH_DONE:
+		case BLK_EH_DONT_COMPLETE:
 			return;
-		WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
+		case BLK_EH_RESET_TIMER:
+			break;
+		default:
+			WARN_ON_ONCE(true);
+		}
 	}
 
 	blk_add_timer(req);
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index f2cfd56e1606..38992c098026 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -90,7 +90,7 @@ static void blk_rq_timed_out(struct request *req)
 		blk_add_timer(req);
 		blk_clear_rq_complete(req);
 		break;
-	case BLK_EH_DONE:
+	case BLK_EH_DONT_COMPLETE:
 		/*
 		 * LLD handles this for now but in the future
 		 * we can send a request msg to abort the command
@@ -98,6 +98,13 @@ static void blk_rq_timed_out(struct request *req)
 		 * the blk layer.
 		 */
 		break;
+	case BLK_EH_DONE:
+		/*
+		 * The LLD completed the request by calling
+		 * __blk_complete_request().
+		 */
+		WARN_ON_ONCE(true);
+		break;
 	default:
 		printk(KERN_ERR "block: bad eh return: %d\n", ret);
 		break;
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 3fb95c8d9fd8..1a5a33e84dee 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -382,7 +382,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 			mutex_unlock(&cmd->lock);
 			nbd_requeue_cmd(cmd);
 			nbd_config_put(nbd);
-			return BLK_EH_DONE;
+			return BLK_EH_DONT_COMPLETE;
 		}
 	} else {
 		dev_err_ratelimited(nbd_to_dev(nbd),
diff --git a/drivers/message/fusion/mptsas.c b/drivers/message/fusion/mptsas.c
index 76a66da33996..eae1b2613d69 100644
--- a/drivers/message/fusion/mptsas.c
+++ b/drivers/message/fusion/mptsas.c
@@ -1929,7 +1929,7 @@ static enum blk_eh_timer_return mptsas_eh_timed_out(struct scsi_cmnd *sc)
 	MPT_SCSI_HOST *hd;
 	MPT_ADAPTER   *ioc;
 	VirtDevice    *vdevice;
-	enum blk_eh_timer_return rc = BLK_EH_DONE;
+	enum blk_eh_timer_return rc = BLK_EH_DONT_COMPLETE;
 
 	hd = shost_priv(sc->device->host);
 	if (hd == NULL) {
diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c
index a9f60d0ee02e..5756c2fc3544 100644
--- a/drivers/s390/block/dasd.c
+++ b/drivers/s390/block/dasd.c
@@ -3010,7 +3010,7 @@ static blk_status_t do_dasd_request(struct blk_mq_hw_ctx *hctx,
  *
  * Return values:
  * BLK_EH_RESET_TIMER if the request should be left running
- * BLK_EH_DONE if the request is handled or terminated
+ * BLK_EH_DONT_COMPLETE if the request is handled or terminated
  *		      by the driver.
  */
 enum blk_eh_timer_return dasd_times_out(struct request *req, bool reserved)
@@ -3023,7 +3023,7 @@ enum blk_eh_timer_return dasd_times_out(struct request *req, bool reserved)
 
 	cqr = blk_mq_rq_to_pdu(req);
 	if (!cqr)
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_COMPLETE;
 
 	spin_lock_irqsave(&cqr->dq->lock, flags);
 	device = cqr->startdev ? cqr->startdev : block->base;
@@ -3078,7 +3078,7 @@ enum blk_eh_timer_return dasd_times_out(struct request *req, bool reserved)
 	spin_unlock(&block->queue_lock);
 	spin_unlock_irqrestore(&cqr->dq->lock, flags);
 
-	return rc ? BLK_EH_RESET_TIMER : BLK_EH_DONE;
+	return rc ? BLK_EH_RESET_TIMER : BLK_EH_DONT_COMPLETE;
 }
 
 static int dasd_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
diff --git a/drivers/scsi/gdth.c b/drivers/scsi/gdth.c
index 85604795d8ee..9a2bec642986 100644
--- a/drivers/scsi/gdth.c
+++ b/drivers/scsi/gdth.c
@@ -3882,7 +3882,7 @@ static enum blk_eh_timer_return gdth_timed_out(struct scsi_cmnd *scp)
 	struct gdth_cmndinfo *cmndinfo = gdth_cmnd_priv(scp);
 	u8 b, t;
 	unsigned long flags;
-	enum blk_eh_timer_return retval = BLK_EH_DONE;
+	enum blk_eh_timer_return retval = BLK_EH_DONT_COMPLETE;
 
 	TRACE(("%s() cmd 0x%x\n", scp->cmnd[0], __func__));
 	b = scp->device->channel;
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index d6093838f5f2..2db9c760835f 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -1963,7 +1963,7 @@ static int iscsi_has_ping_timed_out(struct iscsi_conn *conn)
 
 enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc)
 {
-	enum blk_eh_timer_return rc = BLK_EH_DONE;
+	enum blk_eh_timer_return rc = BLK_EH_DONT_COMPLETE;
 	struct iscsi_task *task = NULL, *running_task;
 	struct iscsi_cls_session *cls_session;
 	struct iscsi_session *session;
@@ -1982,7 +1982,7 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc)
 		 * Raced with completion. Blk layer has taken ownership
 		 * so let timeout code complete it now.
 		 */
-		rc = BLK_EH_DONE;
+		rc = BLK_EH_DONT_COMPLETE;
 		goto done;
 	}
 
@@ -1997,7 +1997,7 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc)
 		if (unlikely(system_state != SYSTEM_RUNNING)) {
 			sc->result = DID_NO_CONNECT << 16;
 			ISCSI_DBG_EH(session, "sc on shutdown, handled\n");
-			rc = BLK_EH_DONE;
+			rc = BLK_EH_DONT_COMPLETE;
 			goto done;
 		}
 		/*
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 71d97573a667..47c292f0b8ae 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -2768,7 +2768,7 @@ blk_eh_timer_return megasas_reset_timer(struct scsi_cmnd *scmd)
 
 	if (time_after(jiffies, scmd->jiffies_at_alloc +
 				(scmd_timeout * 2) * HZ)) {
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_COMPLETE;
 	}
 
 	instance = (struct megasas_instance *)scmd->device->host->hostdata;
diff --git a/drivers/scsi/mvumi.c b/drivers/scsi/mvumi.c
index b3cd9a6b1d30..2b41d2f64e50 100644
--- a/drivers/scsi/mvumi.c
+++ b/drivers/scsi/mvumi.c
@@ -2155,7 +2155,7 @@ static enum blk_eh_timer_return mvumi_timed_out(struct scsi_cmnd *scmd)
 	mvumi_return_cmd(mhba, cmd);
 	spin_unlock_irqrestore(mhba->shost->host_lock, flags);
 
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_COMPLETE;
 }
 
 static int
diff --git a/drivers/scsi/qla4xxx/ql4_os.c b/drivers/scsi/qla4xxx/ql4_os.c
index 0e13349dce57..0d5d13983217 100644
--- a/drivers/scsi/qla4xxx/ql4_os.c
+++ b/drivers/scsi/qla4xxx/ql4_os.c
@@ -1848,7 +1848,7 @@ static enum blk_eh_timer_return qla4xxx_eh_cmd_timed_out(struct scsi_cmnd *sc)
 	struct iscsi_cls_session *session;
 	struct iscsi_session *sess;
 	unsigned long flags;
-	enum blk_eh_timer_return ret = BLK_EH_DONE;
+	enum blk_eh_timer_return ret = BLK_EH_DONT_COMPLETE;
 
 	session = starget_to_session(scsi_target(sc->device));
 	sess = session->dd_data;
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 2715cdaa669c..a885f4f23591 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -283,7 +283,7 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
 enum blk_eh_timer_return scsi_times_out(struct request *req)
 {
 	struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
-	enum blk_eh_timer_return rtn = BLK_EH_DONE;
+	enum blk_eh_timer_return rtn = BLK_EH_DONT_COMPLETE;
 	struct Scsi_Host *host = scmd->device->host;
 
 	trace_scsi_dispatch_cmd_timeout(scmd);
@@ -295,7 +295,7 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
 	if (host->hostt->eh_timed_out)
 		rtn = host->hostt->eh_timed_out(scmd);
 
-	if (rtn == BLK_EH_DONE) {
+	if (rtn == BLK_EH_DONT_COMPLETE) {
 		/*
 		 * For blk-mq, we must set the request state to complete now
 		 * before sending the request to the scsi error handler. This
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 13948102ca29..a23941f195e3 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -2087,7 +2087,7 @@ fc_eh_timed_out(struct scsi_cmnd *scmd)
 	if (rport->port_state == FC_PORTSTATE_BLOCKED)
 		return BLK_EH_RESET_TIMER;
 
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_COMPLETE;
 }
 EXPORT_SYMBOL(fc_eh_timed_out);
 
@@ -3591,9 +3591,12 @@ fc_bsg_job_timeout(struct request *req)
 	}
 
 	/* the blk_end_sync_io() doesn't check the error */
-	if (inflight)
+	if (inflight) {
 		__blk_complete_request(req);
-	return BLK_EH_DONE;
+		return BLK_EH_DONE;
+	} else {
+		return BLK_EH_DONT_COMPLETE;
+	}
 }
 
 /**
diff --git a/drivers/scsi/scsi_transport_srp.c b/drivers/scsi/scsi_transport_srp.c
index 4e46fdb2d7c9..6c73b4c520cf 100644
--- a/drivers/scsi/scsi_transport_srp.c
+++ b/drivers/scsi/scsi_transport_srp.c
@@ -604,7 +604,7 @@ EXPORT_SYMBOL(srp_reconnect_rport);
  *
  * If a timeout occurs while an rport is in the blocked state, ask the SCSI
  * EH to continue waiting (BLK_EH_RESET_TIMER). Otherwise let the SCSI core
- * handle the timeout (BLK_EH_DONE).
+ * handle the timeout (BLK_EH_DONT_COMPLETE).
  *
  * Note: This function is called from soft-IRQ context and with the request
  * queue lock held.
@@ -620,7 +620,7 @@ enum blk_eh_timer_return srp_timed_out(struct scsi_cmnd *scmd)
 	return rport && rport->fast_io_fail_tmo < 0 &&
 		rport->dev_loss_tmo < 0 &&
 		i->f->reset_timer_if_blocked && scsi_device_blocked(sdev) ?
-		BLK_EH_RESET_TIMER : BLK_EH_DONE;
+		BLK_EH_RESET_TIMER : BLK_EH_DONT_COMPLETE;
 }
 EXPORT_SYMBOL(srp_timed_out);
 
diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 397081d320b1..1042fff318d5 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -6682,12 +6682,12 @@ static enum blk_eh_timer_return ufshcd_eh_timed_out(struct scsi_cmnd *scmd)
 	bool found = false;
 
 	if (!scmd || !scmd->device || !scmd->device->host)
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_COMPLETE;
 
 	host = scmd->device->host;
 	hba = shost_priv(host);
 	if (!hba)
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_COMPLETE;
 
 	spin_lock_irqsave(host->host_lock, flags);
 
@@ -6705,7 +6705,7 @@ static enum blk_eh_timer_return ufshcd_eh_timed_out(struct scsi_cmnd *scmd)
 	 * SCSI command was not actually dispatched to UFS driver, otherwise
 	 * let SCSI layer handle the error as usual.
 	 */
-	return found ? BLK_EH_DONE : BLK_EH_RESET_TIMER;
+	return found ? BLK_EH_DONT_COMPLETE : BLK_EH_RESET_TIMER;
 }
 
 static const struct attribute_group *ufshcd_driver_groups[] = {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 050d599f5ea9..8f924fba80b9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -328,7 +328,8 @@ typedef int (init_rq_fn)(struct request_queue *, struct request *, gfp_t);
 typedef void (exit_rq_fn)(struct request_queue *, struct request *);
 
 enum blk_eh_timer_return {
-	BLK_EH_DONE,		/* drivers has completed the command */
+	BLK_EH_DONE,		/* driver has completed the command */
+	BLK_EH_DONT_COMPLETE,	/* driver owns the command */
 	BLK_EH_RESET_TIMER,	/* reset timer and try again */
 };
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/5] block: Remove a superfluous #include directive
  2018-07-27 22:56 [PATCH v2 0/5] Improve blk-mq performance Bart Van Assche
  2018-07-27 22:56 ` [PATCH v2 1/5] blk-mq: Introduce the timeout handler return code BLK_EH_DONT_COMPLETE Bart Van Assche
@ 2018-07-27 22:56 ` Bart Van Assche
  2018-07-27 22:56 ` [PATCH v2 3/5] block: Split blk_add_timer() Bart Van Assche
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Bart Van Assche @ 2018-07-27 22:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Keith Busch,
	Jianchao Wang, Ming Lei

Commit 12f5b9314545 ("blk-mq: Remove generation seqeunce") removed the
only u64_stats_sync instance from <linux/blkdev.h> but did not remove
the corresponding #include directive. Since the #include directive is
no longer needed, remove it.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 include/linux/blkdev.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8f924fba80b9..9aeda7e982c4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -28,7 +28,6 @@
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
 #include <linux/seqlock.h>
-#include <linux/u64_stats_sync.h>
 
 struct module;
 struct scsi_ioctl_command;
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 3/5] block: Split blk_add_timer()
  2018-07-27 22:56 [PATCH v2 0/5] Improve blk-mq performance Bart Van Assche
  2018-07-27 22:56 ` [PATCH v2 1/5] blk-mq: Introduce the timeout handler return code BLK_EH_DONT_COMPLETE Bart Van Assche
  2018-07-27 22:56 ` [PATCH v2 2/5] block: Remove a superfluous #include directive Bart Van Assche
@ 2018-07-27 22:56 ` Bart Van Assche
  2018-07-27 22:56 ` [PATCH v2 4/5] block: Simplify blk_add_timer() and blk_mq_add_timer() Bart Van Assche
  2018-07-27 22:56 ` [PATCH v2 5/5] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
  4 siblings, 0 replies; 6+ messages in thread
From: Bart Van Assche @ 2018-07-27 22:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Keith Busch,
	Jianchao Wang, Ming Lei

Split blk_add_timer() as follows:
- Introduce the helper function __blk_add_timer() that performs the
  tasks that apply both to the legacy block layer core and to blk-mq.
- Duplicate the remaining blk_add_timer() function into the new
  function blk_mq_add_timer().
- Change the blk_mq_timer() calls into blk_mq_add_timer() calls in
  blk-mq code.

This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c      |  4 +--
 block/blk-timeout.c | 81 +++++++++++++++++++++++++++++++++++----------
 block/blk.h         |  1 +
 3 files changed, 66 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8e9beb4ddeb0..670f8960b88a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -644,7 +644,7 @@ void blk_mq_start_request(struct request *rq)
 
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
-	blk_add_timer(rq);
+	blk_mq_add_timer(rq);
 	WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
@@ -785,7 +785,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		}
 	}
 
-	blk_add_timer(req);
+	blk_mq_add_timer(req);
 }
 
 static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 38992c098026..ac0255a70401 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -188,6 +188,33 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 	return timeout;
 }
 
+static void __blk_add_timer(struct request *req, unsigned long deadline)
+{
+	struct request_queue *q = req->q;
+	unsigned long expiry;
+
+	/*
+	 * If the timer isn't already pending or this timeout is earlier
+	 * than an existing one, modify the timer. Round up to next nearest
+	 * second.
+	 */
+	expiry = blk_rq_timeout(round_jiffies_up(deadline));
+	if (!timer_pending(&q->timeout) ||
+	    time_before(expiry, q->timeout.expires)) {
+		unsigned long diff = q->timeout.expires - expiry;
+
+		/*
+		 * Due to added timer slack to group timers, the timer
+		 * will often be a little in front of what we asked for.
+		 * So apply some tolerance here too, otherwise we keep
+		 * modifying the timer because expires for value X
+		 * will be X + something.
+		 */
+		if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
+			mod_timer(&q->timeout, expiry);
+	}
+}
+
 /**
  * blk_add_timer - Start timeout timer for a single request
  * @req:	request that is about to start running.
@@ -199,7 +226,6 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 void blk_add_timer(struct request *req)
 {
 	struct request_queue *q = req->q;
-	unsigned long expiry;
 
 	if (!q->mq_ops)
 		lockdep_assert_held(q->queue_lock);
@@ -227,26 +253,45 @@ void blk_add_timer(struct request *req)
 	if (!q->mq_ops)
 		list_add_tail(&req->timeout_list, &req->q->timeout_list);
 
+	__blk_add_timer(req, blk_rq_deadline(req));
+}
+
+/**
+ * blk_mq_add_timer - set the deadline for a single request
+ * @req:	request for which to set the deadline.
+ *
+ * Sets the deadline of a request. The caller must guarantee that the request
+ * state won't be modified while this function is in progress.
+ */
+void blk_mq_add_timer(struct request *req)
+{
+	struct request_queue *q = req->q;
+
+	if (!q->mq_ops)
+		lockdep_assert_held(q->queue_lock);
+
+	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
+	if (!q->mq_ops && !q->rq_timed_out_fn)
+		return;
+
+	BUG_ON(!list_empty(&req->timeout_list));
+
 	/*
-	 * If the timer isn't already pending or this timeout is earlier
-	 * than an existing one, modify the timer. Round up to next nearest
-	 * second.
+	 * Some LLDs, like scsi, peek at the timeout to prevent a
+	 * command from being retried forever.
 	 */
-	expiry = blk_rq_timeout(round_jiffies_up(blk_rq_deadline(req)));
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
 
-	if (!timer_pending(&q->timeout) ||
-	    time_before(expiry, q->timeout.expires)) {
-		unsigned long diff = q->timeout.expires - expiry;
+	req->rq_flags &= ~RQF_TIMED_OUT;
+	blk_rq_set_deadline(req, jiffies + req->timeout);
 
-		/*
-		 * Due to added timer slack to group timers, the timer
-		 * will often be a little in front of what we asked for.
-		 * So apply some tolerance here too, otherwise we keep
-		 * modifying the timer because expires for value X
-		 * will be X + something.
-		 */
-		if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
-			mod_timer(&q->timeout, expiry);
-	}
+	/*
+	 * Only the non-mq case needs to add the request to a protected list.
+	 * For the mq case we simply scan the tag map.
+	 */
+	if (!q->mq_ops)
+		list_add_tail(&req->timeout_list, &req->q->timeout_list);
 
+	__blk_add_timer(req, blk_rq_deadline(req));
 }
diff --git a/block/blk.h b/block/blk.h
index 69b14cd2bb22..6adae8f94279 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -170,6 +170,7 @@ static inline bool bio_integrity_endio(struct bio *bio)
 void blk_timeout_work(struct work_struct *work);
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
+void blk_mq_add_timer(struct request *req);
 void blk_delete_timer(struct request *);
 
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 4/5] block: Simplify blk_add_timer() and blk_mq_add_timer()
  2018-07-27 22:56 [PATCH v2 0/5] Improve blk-mq performance Bart Van Assche
                   ` (2 preceding siblings ...)
  2018-07-27 22:56 ` [PATCH v2 3/5] block: Split blk_add_timer() Bart Van Assche
@ 2018-07-27 22:56 ` Bart Van Assche
  2018-07-27 22:56 ` [PATCH v2 5/5] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
  4 siblings, 0 replies; 6+ messages in thread
From: Bart Van Assche @ 2018-07-27 22:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Keith Busch,
	Jianchao Wang, Ming Lei

Use the fact that q->mq_ops == NULL for the legacy block layer and
also that q->mq_ops != NULL for blk-mq to simplify blk_add_timer and
blk_mq_add_timer(). This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 block/blk-timeout.c | 54 +++++++++------------------------------------
 1 file changed, 11 insertions(+), 43 deletions(-)

diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index ac0255a70401..8e6661031bc0 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -226,34 +226,21 @@ static void __blk_add_timer(struct request *req, unsigned long deadline)
 void blk_add_timer(struct request *req)
 {
 	struct request_queue *q = req->q;
+	unsigned long deadline;
 
-	if (!q->mq_ops)
-		lockdep_assert_held(q->queue_lock);
+	lockdep_assert_held(q->queue_lock);
 
-	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
-	if (!q->mq_ops && !q->rq_timed_out_fn)
+	if (!q->rq_timed_out_fn)
 		return;
-
-	BUG_ON(!list_empty(&req->timeout_list));
-
-	/*
-	 * Some LLDs, like scsi, peek at the timeout to prevent a
-	 * command from being retried forever.
-	 */
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
 	req->rq_flags &= ~RQF_TIMED_OUT;
-	blk_rq_set_deadline(req, jiffies + req->timeout);
+	deadline = jiffies + req->timeout;
+	blk_rq_set_deadline(req, deadline);
+	list_add_tail(&req->timeout_list, &req->q->timeout_list);
 
-	/*
-	 * Only the non-mq case needs to add the request to a protected list.
-	 * For the mq case we simply scan the tag map.
-	 */
-	if (!q->mq_ops)
-		list_add_tail(&req->timeout_list, &req->q->timeout_list);
-
-	__blk_add_timer(req, blk_rq_deadline(req));
+	__blk_add_timer(req, deadline);
 }
 
 /**
@@ -266,32 +253,13 @@ void blk_add_timer(struct request *req)
 void blk_mq_add_timer(struct request *req)
 {
 	struct request_queue *q = req->q;
+	unsigned long deadline;
 
-	if (!q->mq_ops)
-		lockdep_assert_held(q->queue_lock);
-
-	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
-	if (!q->mq_ops && !q->rq_timed_out_fn)
-		return;
-
-	BUG_ON(!list_empty(&req->timeout_list));
-
-	/*
-	 * Some LLDs, like scsi, peek at the timeout to prevent a
-	 * command from being retried forever.
-	 */
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
 	req->rq_flags &= ~RQF_TIMED_OUT;
-	blk_rq_set_deadline(req, jiffies + req->timeout);
-
-	/*
-	 * Only the non-mq case needs to add the request to a protected list.
-	 * For the mq case we simply scan the tag map.
-	 */
-	if (!q->mq_ops)
-		list_add_tail(&req->timeout_list, &req->q->timeout_list);
-
-	__blk_add_timer(req, blk_rq_deadline(req));
+	deadline = jiffies + req->timeout;
+	blk_rq_set_deadline(req, deadline);
+	__blk_add_timer(req, deadline);
 }
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 22:56 [PATCH v2 0/5] Improve blk-mq performance Bart Van Assche
                   ` (3 preceding siblings ...)
  2018-07-27 22:56 ` [PATCH v2 4/5] block: Simplify blk_add_timer() and blk_mq_add_timer() Bart Van Assche
@ 2018-07-27 22:56 ` Bart Van Assche
  4 siblings, 0 replies; 6+ messages in thread
From: Bart Van Assche @ 2018-07-27 22:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Martin K . Petersen, Keith Busch, Jianchao Wang, Ming Lei

Recently the blk-mq timeout handling code was reworked to avoid that
completions that occur while a timeout handler is in progress get
ignored. That rework moved the protection against completions that occur
while a timeout handler is in progress from the block layer into the
SCSI core. Move that protection back into the block layer core by
introducing two new request states, namely MQ_RQ_COMPLETION_DELAYED and
MQ_RQ_TIMED_OUT.

Other changes in this patch are:
- If blk_mq_complete_request() is called from inside the request timed
  out function, complete the request immediately. Otherwise delay
  request completion until the request timeout handler has finished.
- Reintroduce the request generation number to avoid that request
  timeout handling happens after a request has already completed.
- Reintroduce deadline_seq to make reads and updates of the request
  deadline possible without having to use atomic instructions.
- Remove the request state information that became superfluous due to
  this patch, namely the RQF_TIMED_OUT flag and the ref member.
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.

This patch reverts the following two commits:
* 0fc09f920983 ("blk-mq: export setting request completion state")
* 065990bd198e ("scsi: set timed out out mq requests to complete")

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          |   3 +
 block/blk-mq-debugfs.c    |   8 +-
 block/blk-mq.c            | 329 ++++++++++++++++++++++++++++----------
 block/blk-mq.h            |  10 +-
 block/blk-timeout.c       |  28 +++-
 drivers/scsi/scsi_error.c |  14 --
 include/linux/blk-mq.h    |  14 --
 include/linux/blkdev.h    |  56 ++++++-
 8 files changed, 332 insertions(+), 130 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 03a4ea93a5f3..3c63e410ede4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,6 +198,9 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->internal_tag = -1;
 	rq->start_time_ns = ktime_get_ns();
 	rq->part = NULL;
+#ifndef CONFIG_64BIT
+	seqcount_init(&rq->deadline_seq);
+#endif
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index cb1e6cf7ac48..99732f6ec592 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -334,9 +334,11 @@ static const char *const rqf_name[] = {
 #undef RQF_NAME
 
 static const char *const blk_mq_rq_state_name_array[] = {
-	[MQ_RQ_IDLE]		= "idle",
-	[MQ_RQ_IN_FLIGHT]	= "in_flight",
-	[MQ_RQ_COMPLETE]	= "complete",
+	[MQ_RQ_IDLE]			= "idle",
+	[MQ_RQ_IN_FLIGHT]		= "in flight",
+	[MQ_RQ_COMPLETE]		= "complete",
+	[MQ_RQ_COMPLETION_DELAYED]	= "completion delayed",
+	[MQ_RQ_TIMED_OUT]		= "timed out",
 };
 
 static const char *blk_mq_rq_state_name(enum mq_rq_state rq_state)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 670f8960b88a..3fd654655610 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -6,6 +6,7 @@
  */
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/atomic.h>	/* cmpxchg() */
 #include <linux/backing-dev.h>
 #include <linux/bio.h>
 #include <linux/blkdev.h>
@@ -322,6 +323,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 
 	INIT_LIST_HEAD(&rq->timeout_list);
 	rq->timeout = 0;
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
@@ -332,7 +334,6 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 #endif
 
 	data->ctx->rq_dispatched[op_is_sync(op)]++;
-	refcount_set(&rq->ref, 1);
 	return rq;
 }
 
@@ -466,19 +467,43 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
-static void __blk_mq_free_request(struct request *rq)
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
+ *
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
+ */
+static bool blk_mq_change_rq_state(struct request *rq,
+				   enum mq_rq_state old_state,
+				   enum mq_rq_state new_state)
 {
-	struct request_queue *q = rq->q;
-	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-	const int sched_tag = rq->internal_tag;
+	union blk_generation_and_state old_val = READ_ONCE(rq->gstate);
+	union blk_generation_and_state new_val = old_val;
 
-	if (rq->tag != -1)
-		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
-	if (sched_tag != -1)
-		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
-	blk_mq_sched_restart(hctx);
-	blk_queue_exit(q);
+	old_val.state = old_state;
+	new_val.state = new_state;
+	if (new_state == MQ_RQ_IN_FLIGHT)
+		new_val.generation++;
+	/*
+	 * For transitions from state in-flight or timed out to another state
+	 * cmpxchg() must be used. For other state transitions it is safe to
+	 * use WRITE_ONCE().
+	 */
+	switch (old_state) {
+	case MQ_RQ_IN_FLIGHT:
+	case MQ_RQ_TIMED_OUT:
+		return blk_mq_set_rq_state(rq, old_val, new_val);
+	case MQ_RQ_IDLE:
+	case MQ_RQ_COMPLETE:
+	case MQ_RQ_COMPLETION_DELAYED:
+		WRITE_ONCE(rq->gstate.val, new_val.val);
+		return true;
+	}
+	WARN(true, "Invalid request state %d\n", old_state);
+	return false;
 }
 
 void blk_mq_free_request(struct request *rq)
@@ -487,6 +512,7 @@ void blk_mq_free_request(struct request *rq)
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+	const int sched_tag = rq->internal_tag;
 
 	if (rq->rq_flags & RQF_ELVPRIV) {
 		if (e && e->type->ops.mq.finish_request)
@@ -509,9 +535,14 @@ void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
-	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
-	if (refcount_dec_and_test(&rq->ref))
-		__blk_mq_free_request(rq);
+	if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+		WARN_ON_ONCE(true);
+	if (rq->tag != -1)
+		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
+	if (sched_tag != -1)
+		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
+	blk_mq_sched_restart(hctx);
+	blk_queue_exit(q);
 }
 EXPORT_SYMBOL_GPL(blk_mq_free_request);
 
@@ -558,8 +589,8 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	if (!blk_mq_mark_complete(rq))
-		return;
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
+
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
 
@@ -613,9 +644,47 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
  **/
 void blk_mq_complete_request(struct request *rq)
 {
-	if (unlikely(blk_should_fake_timeout(rq->q)))
+	struct request_queue *q = rq->q;
+
+	if (unlikely(blk_should_fake_timeout(q)))
 		return;
-	__blk_mq_complete_request(rq);
+again:
+	if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
+		__blk_mq_complete_request(rq);
+		return;
+	}
+	if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT,
+				   MQ_RQ_COMPLETION_DELAYED)) {
+		/*
+		 * If blk_mq_complete_request() is called from inside the
+		 * request timed out function, complete the request now to
+		 * avoid that a deadlock occurs if another function is called
+		 * from the same context that waits for request completion.
+		 * Otherwise delay request completion until the request
+		 * timeout function has returned.
+		 */
+		if (READ_ONCE(q->timeout_context) == current) {
+			WARN_ON_ONCE(!blk_mq_change_rq_state(rq,
+						MQ_RQ_COMPLETION_DELAYED,
+						MQ_RQ_COMPLETE));
+			__blk_mq_complete_request(rq);
+		}
+		return;
+	}
+	switch (blk_mq_rq_state(rq)) {
+	case MQ_RQ_IDLE:
+	case MQ_RQ_COMPLETE:
+	case MQ_RQ_COMPLETION_DELAYED:
+		WARN_ON_ONCE(true);
+		return;
+	case MQ_RQ_IN_FLIGHT:
+	case MQ_RQ_TIMED_OUT:
+		/*
+		 * In the unlikely case of a race with the timeout code,
+		 * retry.
+		 */
+		goto again;
+	}
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -645,7 +714,7 @@ void blk_mq_start_request(struct request *rq)
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
 	blk_mq_add_timer(rq);
-	WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT);
+	WARN_ON_ONCE(!blk_mq_change_rq_state(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT));
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -658,23 +727,46 @@ void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
+/**
+ * __blk_mq_requeue_request - requeue a request
+ * @rq: request to be requeued
+ *
+ * This function is either called by blk_mq_requeue_request() or by the block
+ * layer core if .queue_rq() returned BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
+ * If the request state is MQ_RQ_IN_FLIGHT and if this function is called from
+ * inside .queue_rq() then it is guaranteed that the timeout code won't try to
+ * modify the request state while this function is in progress because an RCU
+ * read lock is held around .queue_rq() and because the timeout code calls
+ * synchronize_rcu() after having marked requests and before processing
+ * time-outs.
+ */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	enum mq_rq_state old_state = blk_mq_rq_state(rq);
 
 	blk_mq_put_driver_tag(rq);
 
 	trace_block_rq_requeue(q, rq);
 	rq_qos_requeue(q, rq);
 
-	if (blk_mq_request_started(rq)) {
-		WRITE_ONCE(rq->state, MQ_RQ_IDLE);
-		rq->rq_flags &= ~RQF_TIMED_OUT;
+	if (old_state != MQ_RQ_IDLE) {
+		if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+			WARN_ON_ONCE(true);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
 }
 
+/**
+ * blk_mq_requeue_request - requeue a request
+ * @rq: request to be requeued
+ * @kick_requeue_list: whether or not to kick the requeue_list
+ *
+ * This function is called after a request has completed, after a request has
+ * timed out or from inside .queue_rq(). In the latter case the request may
+ * already have been started.
+ */
 void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
 {
 	__blk_mq_requeue_request(rq);
@@ -767,88 +859,126 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 }
 EXPORT_SYMBOL(blk_mq_tag_to_rq);
 
+struct blk_mq_timeout_data {
+	unsigned long next;
+	unsigned int next_set;
+	unsigned int nr_expired;
+};
+
 static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
-	req->rq_flags |= RQF_TIMED_OUT;
-	if (req->q->mq_ops->timeout) {
-		enum blk_eh_timer_return ret;
+	enum blk_eh_timer_return ret;
 
-		ret = req->q->mq_ops->timeout(req, reserved);
-		switch (ret) {
-		case BLK_EH_DONE:
-		case BLK_EH_DONT_COMPLETE:
-			return;
-		case BLK_EH_RESET_TIMER:
-			break;
-		default:
-			WARN_ON_ONCE(true);
-		}
+	if (!req->q->mq_ops->timeout) {
+		blk_mq_add_timer(req);
+		return;
 	}
 
-	blk_mq_add_timer(req);
+	ret = req->q->mq_ops->timeout(req, reserved);
+	switch (ret) {
+	case BLK_EH_DONE:
+		WARN_ON_ONCE(blk_mq_rq_state(req) == MQ_RQ_TIMED_OUT);
+		break;
+	case BLK_EH_DONT_COMPLETE:
+		return;
+	case BLK_EH_RESET_TIMER:
+		blk_mq_add_timer(req);
+		break;
+	default:
+		WARN_ON_ONCE(true);
+		return;
+	}
+again:
+	if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_IN_FLIGHT))
+		return;
+	if (blk_mq_change_rq_state(req, MQ_RQ_COMPLETION_DELAYED,
+				   MQ_RQ_COMPLETE)) {
+		__blk_mq_complete_request(req);
+		return;
+	}
+	switch (blk_mq_rq_state(req)) {
+	case MQ_RQ_IDLE:
+	case MQ_RQ_IN_FLIGHT:
+	case MQ_RQ_COMPLETE:
+		WARN_ON_ONCE(true);
+		return;
+	case MQ_RQ_TIMED_OUT:
+	case MQ_RQ_COMPLETION_DELAYED:
+		/*
+		 * In the unlikely case of a race with the request completion
+		 * code, retry.
+		 */
+		goto again;
+	}
 }
 
-static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
+static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, struct request *rq,
+				 void *priv, bool reserved)
 {
+	struct blk_mq_timeout_data *data = priv;
+	union blk_generation_and_state gstate = READ_ONCE(rq->gstate);
 	unsigned long deadline;
 
-	if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
-		return false;
-	if (rq->rq_flags & RQF_TIMED_OUT)
-		return false;
+	might_sleep();
 
-	deadline = blk_rq_deadline(rq);
-	if (time_after_eq(jiffies, deadline))
-		return true;
+#ifdef CONFIG_64BIT
+	deadline = rq->__deadline;
+#else
+	while (true) {
+		int start;
 
-	if (*next == 0)
-		*next = deadline;
-	else if (time_after(*next, deadline))
-		*next = deadline;
-	return false;
+		start = read_seqcount_begin(&rq->deadline_seq);
+		deadline = rq->__deadline;
+		if (!read_seqcount_retry(&rq->deadline_seq, start))
+			break;
+		cond_resched();
+	}
+#endif
+
+	/*
+	 * Make sure that rq->aborted_gstate != rq->gstate if a request gets
+	 * recycled before blk_mq_terminate_expired() is called.
+	 */
+	rq->aborted_gstate = gstate;
+	rq->aborted_gstate.generation ^= (1UL << 28);
+	if (gstate.state == MQ_RQ_IN_FLIGHT &&
+	    time_after_eq(jiffies, deadline)) {
+		/* request timed out */
+		rq->aborted_gstate = gstate;
+		data->nr_expired++;
+		hctx->nr_expired++;
+	} else if (!data->next_set || time_after(data->next, deadline)) {
+		data->next = deadline;
+		data->next_set = 1;
+	}
 }
 
-static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
-	unsigned long *next = priv;
+	union blk_generation_and_state old_val = rq->aborted_gstate;
+	union blk_generation_and_state new_val = old_val;
 
-	/*
-	 * Just do a quick check if it is expired before locking the request in
-	 * so we're not unnecessarilly synchronizing across CPUs.
-	 */
-	if (!blk_mq_req_expired(rq, next))
-		return;
+	new_val.state = MQ_RQ_TIMED_OUT;
 
 	/*
-	 * We have reason to believe the request may be expired. Take a
-	 * reference on the request to lock this request lifetime into its
-	 * currently allocated context to prevent it from being reallocated in
-	 * the event the completion by-passes this timeout handler.
-	 *
-	 * If the reference was already released, then the driver beat the
-	 * timeout handler to posting a natural completion.
-	 */
-	if (!refcount_inc_not_zero(&rq->ref))
-		return;
-
-	/*
-	 * The request is now locked and cannot be reallocated underneath the
-	 * timeout handler's processing. Re-verify this exact request is truly
-	 * expired; if it is not expired, then the request was completed and
-	 * reallocated as a new request.
+	 * We marked @rq->aborted_gstate and waited for ongoing .queue_rq()
+	 * calls. If rq->gstate has not changed that means that it
+	 * is now safe to change the request state and to handle the timeout.
 	 */
-	if (blk_mq_req_expired(rq, next))
+	if (blk_mq_set_rq_state(rq, old_val, new_val))
 		blk_mq_rq_timed_out(rq, reserved);
-	if (refcount_dec_and_test(&rq->ref))
-		__blk_mq_free_request(rq);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
 		container_of(work, struct request_queue, timeout_work);
-	unsigned long next = 0;
+	struct blk_mq_timeout_data data = {
+		.next		= 0,
+		.next_set	= 0,
+		.nr_expired	= 0,
+	};
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
@@ -868,10 +998,43 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
-	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next);
+	/* scan for the expired ones and set their ->aborted_gstate */
+	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
-	if (next != 0) {
-		mod_timer(&q->timeout, next);
+	if (data.nr_expired) {
+		bool has_rcu = false;
+
+		/*
+		 * For very short timeouts it can happen that
+		 * blk_mq_check_expired() modifies the state of a request
+		 * while .queue_rq() is still in progress. Hence wait until
+		 * these .queue_rq() calls have finished. This is also
+		 * necessary to avoid races with blk_mq_requeue_request() for
+		 * requests that have already been started.
+		 */
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (!hctx->nr_expired)
+				continue;
+
+			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+				has_rcu = true;
+			else
+				synchronize_srcu(hctx->srcu);
+
+			hctx->nr_expired = 0;
+		}
+		if (has_rcu)
+			synchronize_rcu();
+
+		/* terminate the ones we won */
+		WRITE_ONCE(q->timeout_context, current);
+		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
+		WRITE_ONCE(q->timeout_context, NULL);
+	}
+
+	if (data.next_set) {
+		data.next = blk_rq_timeout(round_jiffies_up(data.next));
+		mod_timer(&q->timeout, data.next);
 	} else {
 		/*
 		 * Request timeouts are handled as a forward rolling timer. If
@@ -2015,7 +2178,9 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			return ret;
 	}
 
-	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
+#ifndef CONFIG_64BIT
+	seqcount_init(&rq->deadline_seq);
+#endif
 	return 0;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9497b47e2526..de92cddc147f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -90,13 +90,21 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_release(struct request_queue *q);
 
+static inline bool blk_mq_set_rq_state(struct request *rq,
+				       union blk_generation_and_state old_val,
+				       union blk_generation_and_state new_val)
+{
+	return cmpxchg(&rq->gstate.val, old_val.val, new_val.val) ==
+		old_val.val;
+}
+
 /**
  * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
  * @rq: target request.
  */
 static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 {
-	return READ_ONCE(rq->state);
+	return READ_ONCE(rq->gstate).state;
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 8e6661031bc0..793592560159 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -149,6 +149,22 @@ void blk_timeout_work(struct work_struct *work)
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
+/* Update deadline to @time. May be called from interrupt context. */
+static void blk_mq_rq_set_deadline(struct request *rq, unsigned long new_time)
+{
+#ifdef CONFIG_64BIT
+	rq->__deadline = new_time;
+#else
+	unsigned long flags;
+
+	local_irq_save(flags);
+	write_seqcount_begin(&rq->deadline_seq);
+	rq->__deadline = new_time;
+	write_seqcount_end(&rq->deadline_seq);
+	local_irq_restore(flags);
+#endif
+}
+
 /**
  * blk_abort_request -- Request request recovery for the specified command
  * @req:	pointer to the request of interest
@@ -162,11 +178,10 @@ void blk_abort_request(struct request *req)
 {
 	if (req->q->mq_ops) {
 		/*
-		 * All we need to ensure is that timeout scan takes place
-		 * immediately and that scan sees the new timeout value.
-		 * No need for fancy synchronizations.
+		 * Ensure that a timeout scan takes place immediately and that
+		 * that scan sees the new timeout value.
 		 */
-		blk_rq_set_deadline(req, jiffies);
+		blk_mq_rq_set_deadline(req, jiffies);
 		kblockd_schedule_work(&req->q->timeout_work);
 	} else {
 		if (blk_mark_rq_complete(req))
@@ -235,7 +250,6 @@ void blk_add_timer(struct request *req)
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
-	req->rq_flags &= ~RQF_TIMED_OUT;
 	deadline = jiffies + req->timeout;
 	blk_rq_set_deadline(req, deadline);
 	list_add_tail(&req->timeout_list, &req->q->timeout_list);
@@ -257,9 +271,7 @@ void blk_mq_add_timer(struct request *req)
 
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
-
-	req->rq_flags &= ~RQF_TIMED_OUT;
 	deadline = jiffies + req->timeout;
-	blk_rq_set_deadline(req, deadline);
+	blk_mq_rq_set_deadline(req, deadline);
 	__blk_add_timer(req, deadline);
 }
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index a885f4f23591..963ad28257cc 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -296,20 +296,6 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
 		rtn = host->hostt->eh_timed_out(scmd);
 
 	if (rtn == BLK_EH_DONT_COMPLETE) {
-		/*
-		 * For blk-mq, we must set the request state to complete now
-		 * before sending the request to the scsi error handler. This
-		 * will prevent a use-after-free in the event the LLD manages
-		 * to complete the request before the error handler finishes
-		 * processing this timed out request.
-		 *
-		 * If the request was already completed, then the LLD beat the
-		 * time out handler from transferring the request to the scsi
-		 * error handler. In that case we can return immediately as no
-		 * further action is required.
-		 */
-		if (req->q->mq_ops && !blk_mq_mark_complete(req))
-			return rtn;
 		if (scsi_abort_command(scmd) != SUCCESS) {
 			set_host_byte(scmd, DID_TIME_OUT);
 			scsi_eh_scmd_add(scmd);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 1da59c16f637..d710e92874cc 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -289,20 +289,6 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
-/**
- * blk_mq_mark_complete() - Set request state to complete
- * @rq: request to set to complete state
- *
- * Returns true if request state was successfully set to complete. If
- * successful, the caller is responsibile for seeing this request is ended, as
- * blk_mq_complete_request will not work again.
- */
-static inline bool blk_mq_mark_complete(struct request *rq)
-{
-	return cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) ==
-			MQ_RQ_IN_FLIGHT;
-}
-
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9aeda7e982c4..5915f7208cc6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -126,20 +126,44 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
 /* already slept for hybrid poll */
 #define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 20))
-/* ->timeout has been called, don't expire again */
-#define RQF_TIMED_OUT		((__force req_flags_t)(1 << 21))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
 	(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
 
-/*
- * Request state for blk-mq.
+union blk_generation_and_state {
+	struct {
+		uint32_t generation:29;
+		uint32_t state:3;
+	};
+	uint32_t val;
+};
+
+/**
+ * enum mq_rq_state - blk-mq request state
+ *
+ * The legal state transitions are:
+ * - idle      -> in flight: blk_mq_start_request()
+ * - in flight -> complete:  blk_mq_complete_request()
+ * - in flight -> timed out: request times out
+ * - timed out -> complete:  blk_mq_complete_request() called from inside
+ *			     the timeout handler
+ * - timed out -> cmpltn dlyd: blk_mq_complete_request() called from outside
+ *			     the timeout handler
+ * - cmpltn dlyd -> complete: blk_mq_rq_timed_out() completes request
+ * - complete  -> idle:      blk_mq_requeue_request() or blk_mq_free_request()
+ * - in flight -> idle:      blk_mq_requeue_request() or blk_mq_free_request()
+ * - timed out -> idle:      blk_mq_requeue_request() or blk_mq_free_request()
+ * - timed out -> in flight: request restart due to BLK_EH_RESET_TIMER
+ *
+ * See also blk_generation_and_state.state.
  */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
 	MQ_RQ_COMPLETE		= 2,
+	MQ_RQ_COMPLETION_DELAYED= 3,
+	MQ_RQ_TIMED_OUT		= 4,
 };
 
 /*
@@ -242,12 +266,26 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
-	enum mq_rq_state state;
-	refcount_t ref;
-
 	unsigned int timeout;
 
-	/* access through blk_rq_set_deadline, blk_rq_deadline */
+	/*
+	 * ->aborted_gstate is used by the timeout to claim a specific
+	 * recycle instance of this request.  See blk_mq_timeout_work().
+	 */
+	union blk_generation_and_state aborted_gstate;
+
+	/*
+	 * Access through blk_rq_deadline() and blk_rq_set_deadline(),
+	 * blk_mark_rq_complete(), blk_clear_rq_complete() and
+	 * blk_rq_is_complete() for legacy queues or blk_mq_rq_state(),
+	 * blk_mq_change_rq_state() and blk_mq_rq_set_deadline() for
+	 * blk-mq queues. deadline_seq protects __deadline in blk-mq mode
+	 * only.
+	 */
+	union blk_generation_and_state gstate;
+#ifndef CONFIG_64BIT
+	seqcount_t deadline_seq;
+#endif
 	unsigned long __deadline;
 
 	struct list_head timeout_list;
@@ -581,6 +619,8 @@ struct request_queue {
 	struct work_struct	timeout_work;
 	struct list_head	timeout_list;
 
+	struct task_struct	*timeout_context;
+
 	struct list_head	icq_list;
 #ifdef CONFIG_BLK_CGROUP
 	DECLARE_BITMAP		(blkcg_pols, BLKCG_MAX_POLS);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-07-28  0:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-27 22:56 [PATCH v2 0/5] Improve blk-mq performance Bart Van Assche
2018-07-27 22:56 ` [PATCH v2 1/5] blk-mq: Introduce the timeout handler return code BLK_EH_DONT_COMPLETE Bart Van Assche
2018-07-27 22:56 ` [PATCH v2 2/5] block: Remove a superfluous #include directive Bart Van Assche
2018-07-27 22:56 ` [PATCH v2 3/5] block: Split blk_add_timer() Bart Van Assche
2018-07-27 22:56 ` [PATCH v2 4/5] block: Simplify blk_add_timer() and blk_mq_add_timer() Bart Van Assche
2018-07-27 22:56 ` [PATCH v2 5/5] blk-mq: Rework blk-mq timeout handling again Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.