All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Improve blk-mq performance
@ 2018-07-27 16:20 Bart Van Assche
  2018-07-27 16:20 ` [PATCH 1/5] blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER Bart Van Assche
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche

Hello Jens,

The blk-mq timeout handling rework that went upstream during the v4.18
development cycle introduced a performance regression due to the atomic
instructions that were added in the hot path. This patch series improves
blk-mq performance by reducing the number of atomic instructions in the
hot path. This patch series introduces the number of IOPS reported by the
following test with 15% (fifteen per cent) on an x86_64 system:

#!/bin/bash

if [ -e /sys/kernel/config/nullb ]; then
    for d in /sys/kernel/config/nullb/*; do
        [ -d "$d" ] && rmdir "$d"
    done
fi
modprobe -r null_blk
[ -e /sys/module/null_blk ] && exit $?
modprobe null_blk nr_devices=0 &&
    udevadm settle &&
    cd /sys/kernel/config/nullb &&
    mkdir nullb0 &&
    cd nullb0 &&
    echo 0 > completion_nsec &&
    echo 4096 > blocksize &&
    echo 0 > home_node &&
    echo 1024 > size &&
    echo 0 > memory_backed &&
    echo 1 > power ||
    exit $?

(
    cd /sys/block/nullb0/queue &&
	echo 2 > rq_affinity
) || exit $?

iodepth=${1:-1}
runtime=30
args=()
if [ "$iodepth" = 1 ]; then
	args+=(--ioengine=psync)
else
	args+=(--ioengine=libaio --iodepth_batch=$((iodepth/2)))
fi
args+=(--iodepth=$iodepth --name=nullb0 --filename=/dev/nullb0\
    --rw=read --bs=4096 --loops=$((1<<20)) --direct=1 --numjobs=1 --thread\
    --runtime=$runtime --invalidate=1 --gtod_reduce=1 --ioscheduler=none)
numactl -m 0 -N 0 -- fio "${args[@]}"


Bart Van Assche (5):
  blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER
  block: Remove a superfluous #include directive
  block: Split blk_add_timer()
  block: Simplify blk_add_timer() and blk_mq_add_timer()
  blk-mq: Rework blk-mq timeout handling again

 Documentation/scsi/scsi_eh.txt            |   4 +-
 block/blk-core.c                          |   3 +
 block/blk-mq-debugfs.c                    |   3 +-
 block/blk-mq.c                            | 284 ++++++++++++++++------
 block/blk-mq.h                            |  10 +-
 block/blk-timeout.c                       | 109 +++++----
 block/blk.h                               |   1 +
 drivers/block/mtip32xx/mtip32xx.c         |   2 +-
 drivers/block/nbd.c                       |   4 +-
 drivers/block/null_blk_main.c             |   4 +-
 drivers/message/fusion/mptsas.c           |   2 +-
 drivers/mmc/core/queue.c                  |   2 +-
 drivers/nvme/host/pci.c                   |  10 +-
 drivers/nvme/host/rdma.c                  |   2 +-
 drivers/nvme/target/loop.c                |   2 +-
 drivers/s390/block/dasd.c                 |   6 +-
 drivers/scsi/gdth.c                       |   2 +-
 drivers/scsi/libiscsi.c                   |   6 +-
 drivers/scsi/megaraid/megaraid_sas_base.c |   2 +-
 drivers/scsi/mvumi.c                      |   2 +-
 drivers/scsi/qla4xxx/ql4_os.c             |   2 +-
 drivers/scsi/scsi_error.c                 |  18 +-
 drivers/scsi/scsi_transport_fc.c          |   4 +-
 drivers/scsi/scsi_transport_srp.c         |   4 +-
 drivers/scsi/ufs/ufshcd.c                 |   6 +-
 include/linux/blk-mq.h                    |  14 --
 include/linux/blkdev.h                    |  52 +++-
 27 files changed, 367 insertions(+), 193 deletions(-)

-- 
2.18.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/5] blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER
  2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
@ 2018-07-27 16:20 ` Bart Van Assche
  2018-07-27 16:20 ` [PATCH 2/5] block: Remove a superfluous #include directive Bart Van Assche
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Ming Lei,
	Hannes Reinecke

If a block driver timeout handler returns BLK_EH_DONE that means
either that the request has been completed or that the block driver
still owns the request. Since the name BLK_EH_DONE is misleading,
change it into BLK_EH_DONT_RESET_TIMER.

Fixes: 88b0cfad2888 ("block: document the blk_eh_timer_return values")
Fixes: 6600593cbd93 ("block: rename BLK_EH_NOT_HANDLED to BLK_EH_DONE")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.com>
---
 Documentation/scsi/scsi_eh.txt            |  4 ++--
 block/blk-mq.c                            |  2 +-
 block/blk-timeout.c                       |  2 +-
 drivers/block/mtip32xx/mtip32xx.c         |  2 +-
 drivers/block/nbd.c                       |  4 ++--
 drivers/block/null_blk_main.c             |  4 ++--
 drivers/message/fusion/mptsas.c           |  2 +-
 drivers/mmc/core/queue.c                  |  2 +-
 drivers/nvme/host/pci.c                   | 10 +++++-----
 drivers/nvme/host/rdma.c                  |  2 +-
 drivers/nvme/target/loop.c                |  2 +-
 drivers/s390/block/dasd.c                 |  6 +++---
 drivers/scsi/gdth.c                       |  2 +-
 drivers/scsi/libiscsi.c                   |  6 +++---
 drivers/scsi/megaraid/megaraid_sas_base.c |  2 +-
 drivers/scsi/mvumi.c                      |  2 +-
 drivers/scsi/qla4xxx/ql4_os.c             |  2 +-
 drivers/scsi/scsi_error.c                 |  4 ++--
 drivers/scsi/scsi_transport_fc.c          |  4 ++--
 drivers/scsi/scsi_transport_srp.c         |  4 ++--
 drivers/scsi/ufs/ufshcd.c                 |  6 +++---
 include/linux/blkdev.h                    |  2 +-
 22 files changed, 38 insertions(+), 38 deletions(-)

diff --git a/Documentation/scsi/scsi_eh.txt b/Documentation/scsi/scsi_eh.txt
index 1b7436932a2b..59e085b28b31 100644
--- a/Documentation/scsi/scsi_eh.txt
+++ b/Documentation/scsi/scsi_eh.txt
@@ -86,9 +86,9 @@ function
 	This indicates that more time is required to finish the
 	command.  Timer is restarted.  This action is counted as a
 	retry and only allowed scmd->allowed + 1(!) times.  Once the
-	limit is reached, action for BLK_EH_DONE is taken instead.
+	limit is reached, action for BLK_EH_DONT_RESET_TIMER is taken instead.
 
-    - BLK_EH_DONE
+    - BLK_EH_DONT_RESET_TIMER
         eh_timed_out() callback did not handle the command.
 	Step #2 is taken.
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c92ce06fd565..96e1a7f25875 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -774,7 +774,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		enum blk_eh_timer_return ret;
 
 		ret = req->q->mq_ops->timeout(req, reserved);
-		if (ret == BLK_EH_DONE)
+		if (ret == BLK_EH_DONT_RESET_TIMER)
 			return;
 		WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
 	}
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index f2cfd56e1606..37df7f8f8516 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -90,7 +90,7 @@ static void blk_rq_timed_out(struct request *req)
 		blk_add_timer(req);
 		blk_clear_rq_complete(req);
 		break;
-	case BLK_EH_DONE:
+	case BLK_EH_DONT_RESET_TIMER:
 		/*
 		 * LLD handles this for now but in the future
 		 * we can send a request msg to abort the command
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index c73626decb46..e28be7821d0c 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3712,7 +3712,7 @@ static enum blk_eh_timer_return mtip_cmd_timeout(struct request *req,
 
 		cmd->status = BLK_STS_TIMEOUT;
 		blk_mq_complete_request(req);
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	}
 
 	if (test_bit(req->tag, dd->port->cmds_to_issue))
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 3fb95c8d9fd8..67e17965e462 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -382,7 +382,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 			mutex_unlock(&cmd->lock);
 			nbd_requeue_cmd(cmd);
 			nbd_config_put(nbd);
-			return BLK_EH_DONE;
+			return BLK_EH_DONT_RESET_TIMER;
 		}
 	} else {
 		dev_err_ratelimited(nbd_to_dev(nbd),
@@ -395,7 +395,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
 	nbd_config_put(nbd);
 done:
 	blk_mq_complete_request(req);
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 /*
diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index 86cafa6d3b41..3617c9ac251a 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -1320,7 +1320,7 @@ static enum blk_eh_timer_return null_rq_timed_out_fn(struct request *rq)
 {
 	pr_info("null: rq %p timed out\n", rq);
 	__blk_complete_request(rq);
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 static int null_rq_prep_fn(struct request_queue *q, struct request *req)
@@ -1383,7 +1383,7 @@ static enum blk_eh_timer_return null_timeout_rq(struct request *rq, bool res)
 {
 	pr_info("null: rq %p timed out\n", rq);
 	blk_mq_complete_request(rq);
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 static blk_status_t null_queue_rq(struct blk_mq_hw_ctx *hctx,
diff --git a/drivers/message/fusion/mptsas.c b/drivers/message/fusion/mptsas.c
index 76a66da33996..b703ad68f426 100644
--- a/drivers/message/fusion/mptsas.c
+++ b/drivers/message/fusion/mptsas.c
@@ -1929,7 +1929,7 @@ static enum blk_eh_timer_return mptsas_eh_timed_out(struct scsi_cmnd *sc)
 	MPT_SCSI_HOST *hd;
 	MPT_ADAPTER   *ioc;
 	VirtDevice    *vdevice;
-	enum blk_eh_timer_return rc = BLK_EH_DONE;
+	enum blk_eh_timer_return rc = BLK_EH_DONT_RESET_TIMER;
 
 	hd = shost_priv(sc->device->host);
 	if (hd == NULL) {
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 648eb6743ed5..ea32dcd6bfcf 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -113,7 +113,7 @@ static enum blk_eh_timer_return mmc_cqe_timed_out(struct request *req)
 		}
 		/* No timeout (XXX: huh? comment doesn't make much sense) */
 		blk_mq_complete_request(req);
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	default:
 		/* Timeout is handled by mmc core */
 		return BLK_EH_RESET_TIMER;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8dcae11bbf3a..efb034b0b98e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1211,7 +1211,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		nvme_warn_reset(dev, csts);
 		nvme_dev_disable(dev, false);
 		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	}
 
 	/*
@@ -1221,14 +1221,14 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		dev_warn(dev->ctrl.device,
 			 "I/O %d QID %d timeout, completion polled\n",
 			 req->tag, nvmeq->qid);
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	}
 
 	/*
 	 * Shutdown immediately if controller times out while starting. The
 	 * reset work will see the pci device disabled when it gets the forced
 	 * cancellation error. All outstanding requests are completed on
-	 * shutdown, so we return BLK_EH_DONE.
+	 * shutdown, so we return BLK_EH_DONT_RESET_TIMER.
 	 */
 	switch (dev->ctrl.state) {
 	case NVME_CTRL_CONNECTING:
@@ -1238,7 +1238,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 			 req->tag, nvmeq->qid);
 		nvme_dev_disable(dev, false);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	default:
 		break;
 	}
@@ -1256,7 +1256,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		nvme_reset_ctrl(&dev->ctrl);
 
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 0805fa6215ee..d18511427437 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1682,7 +1682,7 @@ nvme_rdma_timeout(struct request *rq, bool reserved)
 	/* fail with DNR on cmd timeout */
 	nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
 
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 static blk_status_t nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 9908082b32c4..419107cd952a 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -148,7 +148,7 @@ nvme_loop_timeout(struct request *rq, bool reserved)
 	/* fail with DNR on admin cmd timeout */
 	nvme_req(rq)->status = NVME_SC_ABORT_REQ | NVME_SC_DNR;
 
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 static blk_status_t nvme_loop_queue_rq(struct blk_mq_hw_ctx *hctx,
diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c
index a9f60d0ee02e..d9bfebcc6aa7 100644
--- a/drivers/s390/block/dasd.c
+++ b/drivers/s390/block/dasd.c
@@ -3010,7 +3010,7 @@ static blk_status_t do_dasd_request(struct blk_mq_hw_ctx *hctx,
  *
  * Return values:
  * BLK_EH_RESET_TIMER if the request should be left running
- * BLK_EH_DONE if the request is handled or terminated
+ * BLK_EH_DONT_RESET_TIMER if the request is handled or terminated
  *		      by the driver.
  */
 enum blk_eh_timer_return dasd_times_out(struct request *req, bool reserved)
@@ -3023,7 +3023,7 @@ enum blk_eh_timer_return dasd_times_out(struct request *req, bool reserved)
 
 	cqr = blk_mq_rq_to_pdu(req);
 	if (!cqr)
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 
 	spin_lock_irqsave(&cqr->dq->lock, flags);
 	device = cqr->startdev ? cqr->startdev : block->base;
@@ -3078,7 +3078,7 @@ enum blk_eh_timer_return dasd_times_out(struct request *req, bool reserved)
 	spin_unlock(&block->queue_lock);
 	spin_unlock_irqrestore(&cqr->dq->lock, flags);
 
-	return rc ? BLK_EH_RESET_TIMER : BLK_EH_DONE;
+	return rc ? BLK_EH_RESET_TIMER : BLK_EH_DONT_RESET_TIMER;
 }
 
 static int dasd_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
diff --git a/drivers/scsi/gdth.c b/drivers/scsi/gdth.c
index 85604795d8ee..fa18d3d1dc9c 100644
--- a/drivers/scsi/gdth.c
+++ b/drivers/scsi/gdth.c
@@ -3882,7 +3882,7 @@ static enum blk_eh_timer_return gdth_timed_out(struct scsi_cmnd *scp)
 	struct gdth_cmndinfo *cmndinfo = gdth_cmnd_priv(scp);
 	u8 b, t;
 	unsigned long flags;
-	enum blk_eh_timer_return retval = BLK_EH_DONE;
+	enum blk_eh_timer_return retval = BLK_EH_DONT_RESET_TIMER;
 
 	TRACE(("%s() cmd 0x%x\n", scp->cmnd[0], __func__));
 	b = scp->device->channel;
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index d6093838f5f2..6ede7e01df30 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -1963,7 +1963,7 @@ static int iscsi_has_ping_timed_out(struct iscsi_conn *conn)
 
 enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc)
 {
-	enum blk_eh_timer_return rc = BLK_EH_DONE;
+	enum blk_eh_timer_return rc = BLK_EH_DONT_RESET_TIMER;
 	struct iscsi_task *task = NULL, *running_task;
 	struct iscsi_cls_session *cls_session;
 	struct iscsi_session *session;
@@ -1982,7 +1982,7 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc)
 		 * Raced with completion. Blk layer has taken ownership
 		 * so let timeout code complete it now.
 		 */
-		rc = BLK_EH_DONE;
+		rc = BLK_EH_DONT_RESET_TIMER;
 		goto done;
 	}
 
@@ -1997,7 +1997,7 @@ enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd *sc)
 		if (unlikely(system_state != SYSTEM_RUNNING)) {
 			sc->result = DID_NO_CONNECT << 16;
 			ISCSI_DBG_EH(session, "sc on shutdown, handled\n");
-			rc = BLK_EH_DONE;
+			rc = BLK_EH_DONT_RESET_TIMER;
 			goto done;
 		}
 		/*
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 71d97573a667..bd5bc1da9be1 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -2768,7 +2768,7 @@ blk_eh_timer_return megasas_reset_timer(struct scsi_cmnd *scmd)
 
 	if (time_after(jiffies, scmd->jiffies_at_alloc +
 				(scmd_timeout * 2) * HZ)) {
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 	}
 
 	instance = (struct megasas_instance *)scmd->device->host->hostdata;
diff --git a/drivers/scsi/mvumi.c b/drivers/scsi/mvumi.c
index b3cd9a6b1d30..84cdadd4bd78 100644
--- a/drivers/scsi/mvumi.c
+++ b/drivers/scsi/mvumi.c
@@ -2155,7 +2155,7 @@ static enum blk_eh_timer_return mvumi_timed_out(struct scsi_cmnd *scmd)
 	mvumi_return_cmd(mhba, cmd);
 	spin_unlock_irqrestore(mhba->shost->host_lock, flags);
 
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 static int
diff --git a/drivers/scsi/qla4xxx/ql4_os.c b/drivers/scsi/qla4xxx/ql4_os.c
index 0e13349dce57..ecabca8d815b 100644
--- a/drivers/scsi/qla4xxx/ql4_os.c
+++ b/drivers/scsi/qla4xxx/ql4_os.c
@@ -1848,7 +1848,7 @@ static enum blk_eh_timer_return qla4xxx_eh_cmd_timed_out(struct scsi_cmnd *sc)
 	struct iscsi_cls_session *session;
 	struct iscsi_session *sess;
 	unsigned long flags;
-	enum blk_eh_timer_return ret = BLK_EH_DONE;
+	enum blk_eh_timer_return ret = BLK_EH_DONT_RESET_TIMER;
 
 	session = starget_to_session(scsi_target(sc->device));
 	sess = session->dd_data;
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 2715cdaa669c..b38b8e62e618 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -283,7 +283,7 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
 enum blk_eh_timer_return scsi_times_out(struct request *req)
 {
 	struct scsi_cmnd *scmd = blk_mq_rq_to_pdu(req);
-	enum blk_eh_timer_return rtn = BLK_EH_DONE;
+	enum blk_eh_timer_return rtn = BLK_EH_DONT_RESET_TIMER;
 	struct Scsi_Host *host = scmd->device->host;
 
 	trace_scsi_dispatch_cmd_timeout(scmd);
@@ -295,7 +295,7 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
 	if (host->hostt->eh_timed_out)
 		rtn = host->hostt->eh_timed_out(scmd);
 
-	if (rtn == BLK_EH_DONE) {
+	if (rtn == BLK_EH_DONT_RESET_TIMER) {
 		/*
 		 * For blk-mq, we must set the request state to complete now
 		 * before sending the request to the scsi error handler. This
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 13948102ca29..63d283f298f3 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -2087,7 +2087,7 @@ fc_eh_timed_out(struct scsi_cmnd *scmd)
 	if (rport->port_state == FC_PORTSTATE_BLOCKED)
 		return BLK_EH_RESET_TIMER;
 
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 EXPORT_SYMBOL(fc_eh_timed_out);
 
@@ -3593,7 +3593,7 @@ fc_bsg_job_timeout(struct request *req)
 	/* the blk_end_sync_io() doesn't check the error */
 	if (inflight)
 		__blk_complete_request(req);
-	return BLK_EH_DONE;
+	return BLK_EH_DONT_RESET_TIMER;
 }
 
 /**
diff --git a/drivers/scsi/scsi_transport_srp.c b/drivers/scsi/scsi_transport_srp.c
index 4e46fdb2d7c9..03571512dca5 100644
--- a/drivers/scsi/scsi_transport_srp.c
+++ b/drivers/scsi/scsi_transport_srp.c
@@ -604,7 +604,7 @@ EXPORT_SYMBOL(srp_reconnect_rport);
  *
  * If a timeout occurs while an rport is in the blocked state, ask the SCSI
  * EH to continue waiting (BLK_EH_RESET_TIMER). Otherwise let the SCSI core
- * handle the timeout (BLK_EH_DONE).
+ * handle the timeout (BLK_EH_DONT_RESET_TIMER).
  *
  * Note: This function is called from soft-IRQ context and with the request
  * queue lock held.
@@ -620,7 +620,7 @@ enum blk_eh_timer_return srp_timed_out(struct scsi_cmnd *scmd)
 	return rport && rport->fast_io_fail_tmo < 0 &&
 		rport->dev_loss_tmo < 0 &&
 		i->f->reset_timer_if_blocked && scsi_device_blocked(sdev) ?
-		BLK_EH_RESET_TIMER : BLK_EH_DONE;
+		BLK_EH_RESET_TIMER : BLK_EH_DONT_RESET_TIMER;
 }
 EXPORT_SYMBOL(srp_timed_out);
 
diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 397081d320b1..2184222953b2 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -6682,12 +6682,12 @@ static enum blk_eh_timer_return ufshcd_eh_timed_out(struct scsi_cmnd *scmd)
 	bool found = false;
 
 	if (!scmd || !scmd->device || !scmd->device->host)
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 
 	host = scmd->device->host;
 	hba = shost_priv(host);
 	if (!hba)
-		return BLK_EH_DONE;
+		return BLK_EH_DONT_RESET_TIMER;
 
 	spin_lock_irqsave(host->host_lock, flags);
 
@@ -6705,7 +6705,7 @@ static enum blk_eh_timer_return ufshcd_eh_timed_out(struct scsi_cmnd *scmd)
 	 * SCSI command was not actually dispatched to UFS driver, otherwise
 	 * let SCSI layer handle the error as usual.
 	 */
-	return found ? BLK_EH_DONE : BLK_EH_RESET_TIMER;
+	return found ? BLK_EH_DONT_RESET_TIMER : BLK_EH_RESET_TIMER;
 }
 
 static const struct attribute_group *ufshcd_driver_groups[] = {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 050d599f5ea9..d5d31b7b6ba8 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -328,7 +328,7 @@ typedef int (init_rq_fn)(struct request_queue *, struct request *, gfp_t);
 typedef void (exit_rq_fn)(struct request_queue *, struct request *);
 
 enum blk_eh_timer_return {
-	BLK_EH_DONE,		/* drivers has completed the command */
+	BLK_EH_DONT_RESET_TIMER,/* driver completed or owns the command */
 	BLK_EH_RESET_TIMER,	/* reset timer and try again */
 };
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/5] block: Remove a superfluous #include directive
  2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
  2018-07-27 16:20 ` [PATCH 1/5] blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER Bart Van Assche
@ 2018-07-27 16:20 ` Bart Van Assche
  2018-07-27 16:20 ` [PATCH 3/5] block: Split blk_add_timer() Bart Van Assche
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Keith Busch,
	Jianchao Wang, Ming Lei

Commit 12f5b9314545 ("blk-mq: Remove generation seqeunce") removed the
only u64_stats_sync instance from <linux/blkdev.h> but did not remove
the corresponding #include directive. Since the #include directive is
no longer needed, remove it.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 include/linux/blkdev.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d5d31b7b6ba8..8244a5a1aa5b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -28,7 +28,6 @@
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
 #include <linux/seqlock.h>
-#include <linux/u64_stats_sync.h>
 
 struct module;
 struct scsi_ioctl_command;
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/5] block: Split blk_add_timer()
  2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
  2018-07-27 16:20 ` [PATCH 1/5] blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER Bart Van Assche
  2018-07-27 16:20 ` [PATCH 2/5] block: Remove a superfluous #include directive Bart Van Assche
@ 2018-07-27 16:20 ` Bart Van Assche
  2018-07-27 16:29   ` Keith Busch
  2018-07-27 16:20 ` [PATCH 4/5] block: Simplify blk_add_timer() and blk_mq_add_timer() Bart Van Assche
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Keith Busch,
	Jianchao Wang, Ming Lei

Split blk_add_timer() as follows:
- Introduce the helper function __blk_add_timer() that performs the
  tasks that apply both to the legacy block layer core and to blk-mq.
- Duplicate the remaining blk_add_timer() function into the new
  function blk_mq_add_timer().
- Change the blk_mq_timer() calls into blk_mq_add_timer() calls in
  blk-mq code.

This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c      |  4 +--
 block/blk-timeout.c | 81 +++++++++++++++++++++++++++++++++++----------
 block/blk.h         |  1 +
 3 files changed, 66 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 96e1a7f25875..1b49973629f6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -644,7 +644,7 @@ void blk_mq_start_request(struct request *rq)
 
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
-	blk_add_timer(rq);
+	blk_mq_add_timer(rq);
 	WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
@@ -779,7 +779,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
 	}
 
-	blk_add_timer(req);
+	blk_mq_add_timer(req);
 }
 
 static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 37df7f8f8516..527670f2f985 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -181,6 +181,33 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 	return timeout;
 }
 
+static void __blk_add_timer(struct request *req, unsigned long deadline)
+{
+	struct request_queue *q = req->q;
+	unsigned long expiry;
+
+	/*
+	 * If the timer isn't already pending or this timeout is earlier
+	 * than an existing one, modify the timer. Round up to next nearest
+	 * second.
+	 */
+	expiry = blk_rq_timeout(round_jiffies_up(deadline));
+	if (!timer_pending(&q->timeout) ||
+	    time_before(expiry, q->timeout.expires)) {
+		unsigned long diff = q->timeout.expires - expiry;
+
+		/*
+		 * Due to added timer slack to group timers, the timer
+		 * will often be a little in front of what we asked for.
+		 * So apply some tolerance here too, otherwise we keep
+		 * modifying the timer because expires for value X
+		 * will be X + something.
+		 */
+		if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
+			mod_timer(&q->timeout, expiry);
+	}
+}
+
 /**
  * blk_add_timer - Start timeout timer for a single request
  * @req:	request that is about to start running.
@@ -192,7 +219,6 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 void blk_add_timer(struct request *req)
 {
 	struct request_queue *q = req->q;
-	unsigned long expiry;
 
 	if (!q->mq_ops)
 		lockdep_assert_held(q->queue_lock);
@@ -220,26 +246,45 @@ void blk_add_timer(struct request *req)
 	if (!q->mq_ops)
 		list_add_tail(&req->timeout_list, &req->q->timeout_list);
 
+	__blk_add_timer(req, blk_rq_deadline(req));
+}
+
+/**
+ * blk_mq_add_timer - set the deadline for a single request
+ * @req:	request for which to set the deadline.
+ *
+ * Sets the deadline of a request. The caller must guarantee that the request
+ * state won't be modified while this function is in progress.
+ */
+void blk_mq_add_timer(struct request *req)
+{
+	struct request_queue *q = req->q;
+
+	if (!q->mq_ops)
+		lockdep_assert_held(q->queue_lock);
+
+	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
+	if (!q->mq_ops && !q->rq_timed_out_fn)
+		return;
+
+	BUG_ON(!list_empty(&req->timeout_list));
+
 	/*
-	 * If the timer isn't already pending or this timeout is earlier
-	 * than an existing one, modify the timer. Round up to next nearest
-	 * second.
+	 * Some LLDs, like scsi, peek at the timeout to prevent a
+	 * command from being retried forever.
 	 */
-	expiry = blk_rq_timeout(round_jiffies_up(blk_rq_deadline(req)));
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
 
-	if (!timer_pending(&q->timeout) ||
-	    time_before(expiry, q->timeout.expires)) {
-		unsigned long diff = q->timeout.expires - expiry;
+	req->rq_flags &= ~RQF_TIMED_OUT;
+	blk_rq_set_deadline(req, jiffies + req->timeout);
 
-		/*
-		 * Due to added timer slack to group timers, the timer
-		 * will often be a little in front of what we asked for.
-		 * So apply some tolerance here too, otherwise we keep
-		 * modifying the timer because expires for value X
-		 * will be X + something.
-		 */
-		if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
-			mod_timer(&q->timeout, expiry);
-	}
+	/*
+	 * Only the non-mq case needs to add the request to a protected list.
+	 * For the mq case we simply scan the tag map.
+	 */
+	if (!q->mq_ops)
+		list_add_tail(&req->timeout_list, &req->q->timeout_list);
 
+	__blk_add_timer(req, blk_rq_deadline(req));
 }
diff --git a/block/blk.h b/block/blk.h
index 69b14cd2bb22..6adae8f94279 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -170,6 +170,7 @@ static inline bool bio_integrity_endio(struct bio *bio)
 void blk_timeout_work(struct work_struct *work);
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
+void blk_mq_add_timer(struct request *req);
 void blk_delete_timer(struct request *);
 
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/5] block: Simplify blk_add_timer() and blk_mq_add_timer()
  2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
                   ` (2 preceding siblings ...)
  2018-07-27 16:20 ` [PATCH 3/5] block: Split blk_add_timer() Bart Van Assche
@ 2018-07-27 16:20 ` Bart Van Assche
  2018-07-27 16:20 ` [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
  2018-07-27 16:22 ` [PATCH 0/5] Improve blk-mq performance Bart Van Assche
  5 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Keith Busch,
	Jianchao Wang, Ming Lei

Use the fact that q->mq_ops == NULL for the legacy block layer and
also that q->mq_ops != NULL for blk-mq to simplify blk_add_timer and
blk_mq_add_timer(). This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 block/blk-timeout.c | 54 +++++++++------------------------------------
 1 file changed, 11 insertions(+), 43 deletions(-)

diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 527670f2f985..02d7e3427369 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -219,34 +219,21 @@ static void __blk_add_timer(struct request *req, unsigned long deadline)
 void blk_add_timer(struct request *req)
 {
 	struct request_queue *q = req->q;
+	unsigned long deadline;
 
-	if (!q->mq_ops)
-		lockdep_assert_held(q->queue_lock);
+	lockdep_assert_held(q->queue_lock);
 
-	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
-	if (!q->mq_ops && !q->rq_timed_out_fn)
+	if (!q->rq_timed_out_fn)
 		return;
-
-	BUG_ON(!list_empty(&req->timeout_list));
-
-	/*
-	 * Some LLDs, like scsi, peek at the timeout to prevent a
-	 * command from being retried forever.
-	 */
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
 	req->rq_flags &= ~RQF_TIMED_OUT;
-	blk_rq_set_deadline(req, jiffies + req->timeout);
+	deadline = jiffies + req->timeout;
+	blk_rq_set_deadline(req, deadline);
+	list_add_tail(&req->timeout_list, &req->q->timeout_list);
 
-	/*
-	 * Only the non-mq case needs to add the request to a protected list.
-	 * For the mq case we simply scan the tag map.
-	 */
-	if (!q->mq_ops)
-		list_add_tail(&req->timeout_list, &req->q->timeout_list);
-
-	__blk_add_timer(req, blk_rq_deadline(req));
+	__blk_add_timer(req, deadline);
 }
 
 /**
@@ -259,32 +246,13 @@ void blk_add_timer(struct request *req)
 void blk_mq_add_timer(struct request *req)
 {
 	struct request_queue *q = req->q;
+	unsigned long deadline;
 
-	if (!q->mq_ops)
-		lockdep_assert_held(q->queue_lock);
-
-	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
-	if (!q->mq_ops && !q->rq_timed_out_fn)
-		return;
-
-	BUG_ON(!list_empty(&req->timeout_list));
-
-	/*
-	 * Some LLDs, like scsi, peek at the timeout to prevent a
-	 * command from being retried forever.
-	 */
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
 	req->rq_flags &= ~RQF_TIMED_OUT;
-	blk_rq_set_deadline(req, jiffies + req->timeout);
-
-	/*
-	 * Only the non-mq case needs to add the request to a protected list.
-	 * For the mq case we simply scan the tag map.
-	 */
-	if (!q->mq_ops)
-		list_add_tail(&req->timeout_list, &req->q->timeout_list);
-
-	__blk_add_timer(req, blk_rq_deadline(req));
+	deadline = jiffies + req->timeout;
+	blk_rq_set_deadline(req, deadline);
+	__blk_add_timer(req, deadline);
 }
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
                   ` (3 preceding siblings ...)
  2018-07-27 16:20 ` [PATCH 4/5] block: Simplify blk_add_timer() and blk_mq_add_timer() Bart Van Assche
@ 2018-07-27 16:20 ` Bart Van Assche
  2018-07-27 16:46   ` Keith Busch
  2018-07-27 16:22 ` [PATCH 0/5] Improve blk-mq performance Bart Van Assche
  5 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Martin K . Petersen, Keith Busch, Jianchao Wang, Ming Lei

Recently the blk-mq timeout handling code was reworked to avoid that
completions that occur while a timeout handler is in progress get
ignored. However, that rework removed the protection against completions
that occur while a timeout handler is in progress. Fix this by
introducing a new request state, namely MQ_RQ_TIMED_OUT. This state
means that the timeout handler is in progress.

Other changes in this patch are:
- Reintroduce the request generation number to avoid that request
  timeout handling happens after a request has already completed.
- Reintroduce deadline_seq to make reads and updates of the request
  deadline possible without having to use atomic instructions.
- Remove the request state information that became superfluous due to
  this patch, namely the RQF_TIMED_OUT flag and the ref member.
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- Move the code for managing the request state back from the SCSI
  core into the block layer core.

This patch reverts the following two commits:
* 0fc09f920983 ("blk-mq: export setting request completion state")
* 065990bd198e ("scsi: set timed out out mq requests to complete")

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          |   3 +
 block/blk-mq-debugfs.c    |   3 +-
 block/blk-mq.c            | 280 ++++++++++++++++++++++++++++----------
 block/blk-mq.h            |  10 +-
 block/blk-timeout.c       |  28 ++--
 drivers/scsi/scsi_error.c |  14 --
 include/linux/blk-mq.h    |  14 --
 include/linux/blkdev.h    |  49 +++++--
 8 files changed, 281 insertions(+), 120 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 03a4ea93a5f3..3c63e410ede4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,6 +198,9 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->internal_tag = -1;
 	rq->start_time_ns = ktime_get_ns();
 	rq->part = NULL;
+#ifndef CONFIG_64BIT
+	seqcount_init(&rq->deadline_seq);
+#endif
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index cb1e6cf7ac48..72abc9a87c53 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -335,8 +335,9 @@ static const char *const rqf_name[] = {
 
 static const char *const blk_mq_rq_state_name_array[] = {
 	[MQ_RQ_IDLE]		= "idle",
-	[MQ_RQ_IN_FLIGHT]	= "in_flight",
+	[MQ_RQ_IN_FLIGHT]	= "in flight",
 	[MQ_RQ_COMPLETE]	= "complete",
+	[MQ_RQ_TIMED_OUT]	= "timed out",
 };
 
 static const char *blk_mq_rq_state_name(enum mq_rq_state rq_state)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1b49973629f6..a97ab5ba9d18 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -6,6 +6,7 @@
  */
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/atomic.h>	/* cmpxchg() */
 #include <linux/backing-dev.h>
 #include <linux/bio.h>
 #include <linux/blkdev.h>
@@ -322,6 +323,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 
 	INIT_LIST_HEAD(&rq->timeout_list);
 	rq->timeout = 0;
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
 	rq->end_io = NULL;
 	rq->end_io_data = NULL;
@@ -332,7 +334,6 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 #endif
 
 	data->ctx->rq_dispatched[op_is_sync(op)]++;
-	refcount_set(&rq->ref, 1);
 	return rq;
 }
 
@@ -466,19 +467,42 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
-static void __blk_mq_free_request(struct request *rq)
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
+ *
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
+ */
+static bool blk_mq_change_rq_state(struct request *rq,
+				   enum mq_rq_state old_state,
+				   enum mq_rq_state new_state)
 {
-	struct request_queue *q = rq->q;
-	struct blk_mq_ctx *ctx = rq->mq_ctx;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
-	const int sched_tag = rq->internal_tag;
+	union blk_generation_and_state old_val = READ_ONCE(rq->gstate);
+	union blk_generation_and_state new_val = old_val;
 
-	if (rq->tag != -1)
-		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
-	if (sched_tag != -1)
-		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
-	blk_mq_sched_restart(hctx);
-	blk_queue_exit(q);
+	old_val.state = old_state;
+	new_val.state = new_state;
+	if (new_state == MQ_RQ_IN_FLIGHT)
+		new_val.generation++;
+	/*
+	 * For transitions from state in-flight or timed out to another state
+	 * cmpxchg() must be used. For other state transitions it is safe to
+	 * use WRITE_ONCE().
+	 */
+	switch (old_state) {
+	case MQ_RQ_IN_FLIGHT:
+	case MQ_RQ_TIMED_OUT:
+		WRITE_ONCE(rq->gstate.val, new_val.val);
+		return true;
+	case MQ_RQ_IDLE:
+	case MQ_RQ_COMPLETE:
+		return blk_mq_set_rq_state(rq, old_val, new_val);
+	}
+	WARN(true, "Invalid request state %d\n", old_state);
+	return false;
 }
 
 void blk_mq_free_request(struct request *rq)
@@ -487,6 +511,7 @@ void blk_mq_free_request(struct request *rq)
 	struct elevator_queue *e = q->elevator;
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
+	const int sched_tag = rq->internal_tag;
 
 	if (rq->rq_flags & RQF_ELVPRIV) {
 		if (e && e->type->ops.mq.finish_request)
@@ -509,9 +534,14 @@ void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
-	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
-	if (refcount_dec_and_test(&rq->ref))
-		__blk_mq_free_request(rq);
+	if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+		WARN_ON_ONCE(true);
+	if (rq->tag != -1)
+		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
+	if (sched_tag != -1)
+		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
+	blk_mq_sched_restart(hctx);
+	blk_queue_exit(q);
 }
 EXPORT_SYMBOL_GPL(blk_mq_free_request);
 
@@ -558,8 +588,8 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	if (!blk_mq_mark_complete(rq))
-		return;
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
+
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
 
@@ -615,7 +645,18 @@ void blk_mq_complete_request(struct request *rq)
 {
 	if (unlikely(blk_should_fake_timeout(rq->q)))
 		return;
-	__blk_mq_complete_request(rq);
+again:
+	if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
+		__blk_mq_complete_request(rq);
+		return;
+	}
+	if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE))
+		return;
+	if (WARN_ON_ONCE(blk_mq_rq_state(rq) == MQ_RQ_IDLE ||
+			 blk_mq_rq_state(rq) == MQ_RQ_COMPLETE))
+		return;
+	/* In the unlikely case of a race with the timeout code, retry. */
+	goto again;
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -645,7 +686,7 @@ void blk_mq_start_request(struct request *rq)
 	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
 	blk_mq_add_timer(rq);
-	WRITE_ONCE(rq->state, MQ_RQ_IN_FLIGHT);
+	WARN_ON_ONCE(!blk_mq_change_rq_state(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT));
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -658,23 +699,46 @@ void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
+/**
+ * __blk_mq_requeue_request - requeue a request
+ * @rq: request to be requeued
+ *
+ * This function is either called by blk_mq_requeue_request() or by the block
+ * layer core if .queue_rq() returned BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
+ * If the request state is MQ_RQ_IN_FLIGHT and if this function is called from
+ * inside .queue_rq() then it is guaranteed that the timeout code won't try to
+ * modify the request state while this function is in progress because an RCU
+ * read lock is held around .queue_rq() and because the timeout code calls
+ * synchronize_rcu() after having marked requests and before processing
+ * time-outs.
+ */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	enum mq_rq_state old_state = blk_mq_rq_state(rq);
 
 	blk_mq_put_driver_tag(rq);
 
 	trace_block_rq_requeue(q, rq);
 	rq_qos_requeue(q, rq);
 
-	if (blk_mq_request_started(rq)) {
-		WRITE_ONCE(rq->state, MQ_RQ_IDLE);
-		rq->rq_flags &= ~RQF_TIMED_OUT;
+	if (old_state != MQ_RQ_IDLE) {
+		if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+			WARN_ON_ONCE(true);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
 }
 
+/**
+ * blk_mq_requeue_request - requeue a request
+ * @rq: request to be requeued
+ * @kick_requeue_list: whether or not to kick the requeue_list
+ *
+ * This function is called after a request has completed, after a request has
+ * timed out or from inside .queue_rq(). In the latter case the request may
+ * already have been started.
+ */
 void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
 {
 	__blk_mq_requeue_request(rq);
@@ -767,82 +831,117 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 }
 EXPORT_SYMBOL(blk_mq_tag_to_rq);
 
+struct blk_mq_timeout_data {
+	unsigned long next;
+	unsigned int next_set;
+	unsigned int nr_expired;
+};
+
 static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 {
-	req->rq_flags |= RQF_TIMED_OUT;
-	if (req->q->mq_ops->timeout) {
-		enum blk_eh_timer_return ret;
+	enum blk_eh_timer_return ret;
 
-		ret = req->q->mq_ops->timeout(req, reserved);
-		if (ret == BLK_EH_DONT_RESET_TIMER)
-			return;
-		WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
+	if (!req->q->mq_ops->timeout) {
+		blk_mq_add_timer(req);
+		return;
 	}
 
+	ret = req->q->mq_ops->timeout(req, reserved);
+	/*
+	 * BLK_EH_DONT_RESET_TIMER means that the block driver either
+	 * completed the request or still owns the request and will
+	 * continue processing the timeout asynchronously. In the
+	 * latter case, if blk_mq_complete_request() was called while
+	 * the timeout handler was in progress, ignore that call.
+	 */
+	if (ret == BLK_EH_DONT_RESET_TIMER)
+		return;
+	WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
 	blk_mq_add_timer(req);
+again:
+	if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_IN_FLIGHT))
+		return;
+	if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
+		__blk_mq_complete_request(req);
+		return;
+	}
+	if (WARN_ON_ONCE(blk_mq_rq_state(req) == MQ_RQ_IDLE ||
+			 blk_mq_rq_state(req) == MQ_RQ_IN_FLIGHT))
+		return;
+	/*
+	 * In the unlikely case of a race with the request completion code,
+	 * retry.
+	 */
+	goto again;
 }
 
-static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
+static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx, struct request *rq,
+				 void *priv, bool reserved)
 {
+	struct blk_mq_timeout_data *data = priv;
+	union blk_generation_and_state gstate = READ_ONCE(rq->gstate);
 	unsigned long deadline;
 
-	if (blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT)
-		return false;
-	if (rq->rq_flags & RQF_TIMED_OUT)
-		return false;
+	might_sleep();
 
-	deadline = blk_rq_deadline(rq);
-	if (time_after_eq(jiffies, deadline))
-		return true;
+#ifdef CONFIG_64BIT
+	deadline = rq->__deadline;
+#else
+	while (true) {
+		int start;
 
-	if (*next == 0)
-		*next = deadline;
-	else if (time_after(*next, deadline))
-		*next = deadline;
-	return false;
+		start = read_seqcount_begin(&rq->deadline_seq);
+		deadline = rq->__deadline;
+		if (!read_seqcount_retry(&rq->deadline_seq, start))
+			break;
+		cond_resched();
+	}
+#endif
+
+	/*
+	 * Make sure that rq->aborted_gstate != rq->gstate if a request gets
+	 * recycled before blk_mq_terminate_expired() is called.
+	 */
+	rq->aborted_gstate = gstate;
+	rq->aborted_gstate.generation ^= (1UL << 29);
+	if (gstate.state == MQ_RQ_IN_FLIGHT &&
+	    time_after_eq(jiffies, deadline)) {
+		/* request timed out */
+		rq->aborted_gstate = gstate;
+		data->nr_expired++;
+		hctx->nr_expired++;
+	} else if (!data->next_set || time_after(data->next, deadline)) {
+		data->next = deadline;
+		data->next_set = 1;
+	}
 }
 
-static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
-	unsigned long *next = priv;
+	union blk_generation_and_state old_val = rq->aborted_gstate;
+	union blk_generation_and_state new_val = old_val;
 
-	/*
-	 * Just do a quick check if it is expired before locking the request in
-	 * so we're not unnecessarilly synchronizing across CPUs.
-	 */
-	if (!blk_mq_req_expired(rq, next))
-		return;
-
-	/*
-	 * We have reason to believe the request may be expired. Take a
-	 * reference on the request to lock this request lifetime into its
-	 * currently allocated context to prevent it from being reallocated in
-	 * the event the completion by-passes this timeout handler.
-	 *
-	 * If the reference was already released, then the driver beat the
-	 * timeout handler to posting a natural completion.
-	 */
-	if (!refcount_inc_not_zero(&rq->ref))
-		return;
+	new_val.state = MQ_RQ_TIMED_OUT;
 
 	/*
-	 * The request is now locked and cannot be reallocated underneath the
-	 * timeout handler's processing. Re-verify this exact request is truly
-	 * expired; if it is not expired, then the request was completed and
-	 * reallocated as a new request.
+	 * We marked @rq->aborted_gstate and waited for ongoing .queue_rq()
+	 * calls. If rq->gstate has not changed that means that it
+	 * is now safe to change the request state and to handle the timeout.
 	 */
-	if (blk_mq_req_expired(rq, next))
+	if (blk_mq_set_rq_state(rq, old_val, new_val))
 		blk_mq_rq_timed_out(rq, reserved);
-	if (refcount_dec_and_test(&rq->ref))
-		__blk_mq_free_request(rq);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
 		container_of(work, struct request_queue, timeout_work);
-	unsigned long next = 0;
+	struct blk_mq_timeout_data data = {
+		.next		= 0,
+		.next_set	= 0,
+		.nr_expired	= 0,
+	};
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
@@ -862,10 +961,41 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
-	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next);
+	/* scan for the expired ones and set their ->aborted_gstate */
+	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
+
+	if (data.nr_expired) {
+		bool has_rcu = false;
+
+		/*
+		 * For very short timeouts it can happen that
+		 * blk_mq_check_expired() modifies the state of a request
+		 * while .queue_rq() is still in progress. Hence wait until
+		 * these .queue_rq() calls have finished. This is also
+		 * necessary to avoid races with blk_mq_requeue_request() for
+		 * requests that have already been started.
+		 */
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (!hctx->nr_expired)
+				continue;
+
+			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+				has_rcu = true;
+			else
+				synchronize_srcu(hctx->srcu);
 
-	if (next != 0) {
-		mod_timer(&q->timeout, next);
+			hctx->nr_expired = 0;
+		}
+		if (has_rcu)
+			synchronize_rcu();
+
+		/* terminate the ones we won */
+		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
+	}
+
+	if (data.next_set) {
+		data.next = blk_rq_timeout(round_jiffies_up(data.next));
+		mod_timer(&q->timeout, data.next);
 	} else {
 		/*
 		 * Request timeouts are handled as a forward rolling timer. If
@@ -2009,7 +2139,9 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			return ret;
 	}
 
-	WRITE_ONCE(rq->state, MQ_RQ_IDLE);
+#ifndef CONFIG_64BIT
+	seqcount_init(&rq->deadline_seq);
+#endif
 	return 0;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9497b47e2526..de92cddc147f 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -90,13 +90,21 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_release(struct request_queue *q);
 
+static inline bool blk_mq_set_rq_state(struct request *rq,
+				       union blk_generation_and_state old_val,
+				       union blk_generation_and_state new_val)
+{
+	return cmpxchg(&rq->gstate.val, old_val.val, new_val.val) ==
+		old_val.val;
+}
+
 /**
  * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
  * @rq: target request.
  */
 static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 {
-	return READ_ONCE(rq->state);
+	return READ_ONCE(rq->gstate).state;
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 02d7e3427369..b37b8090f3ae 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -142,6 +142,22 @@ void blk_timeout_work(struct work_struct *work)
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
+/* Update deadline to @time. May be called from interrupt context. */
+static void blk_mq_rq_set_deadline(struct request *rq, unsigned long new_time)
+{
+#ifdef CONFIG_64BIT
+	rq->__deadline = new_time;
+#else
+	unsigned long flags;
+
+	local_irq_save(flags);
+	write_seqcount_begin(&rq->deadline_seq);
+	rq->__deadline = new_time;
+	write_seqcount_end(&rq->deadline_seq);
+	local_irq_restore(flags);
+#endif
+}
+
 /**
  * blk_abort_request -- Request request recovery for the specified command
  * @req:	pointer to the request of interest
@@ -155,11 +171,10 @@ void blk_abort_request(struct request *req)
 {
 	if (req->q->mq_ops) {
 		/*
-		 * All we need to ensure is that timeout scan takes place
-		 * immediately and that scan sees the new timeout value.
-		 * No need for fancy synchronizations.
+		 * Ensure that a timeout scan takes place immediately and that
+		 * that scan sees the new timeout value.
 		 */
-		blk_rq_set_deadline(req, jiffies);
+		blk_mq_rq_set_deadline(req, jiffies);
 		kblockd_schedule_work(&req->q->timeout_work);
 	} else {
 		if (blk_mark_rq_complete(req))
@@ -228,7 +243,6 @@ void blk_add_timer(struct request *req)
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
-	req->rq_flags &= ~RQF_TIMED_OUT;
 	deadline = jiffies + req->timeout;
 	blk_rq_set_deadline(req, deadline);
 	list_add_tail(&req->timeout_list, &req->q->timeout_list);
@@ -250,9 +264,7 @@ void blk_mq_add_timer(struct request *req)
 
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
-
-	req->rq_flags &= ~RQF_TIMED_OUT;
 	deadline = jiffies + req->timeout;
-	blk_rq_set_deadline(req, deadline);
+	blk_mq_rq_set_deadline(req, deadline);
 	__blk_add_timer(req, deadline);
 }
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index b38b8e62e618..fa18f9d78170 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -296,20 +296,6 @@ enum blk_eh_timer_return scsi_times_out(struct request *req)
 		rtn = host->hostt->eh_timed_out(scmd);
 
 	if (rtn == BLK_EH_DONT_RESET_TIMER) {
-		/*
-		 * For blk-mq, we must set the request state to complete now
-		 * before sending the request to the scsi error handler. This
-		 * will prevent a use-after-free in the event the LLD manages
-		 * to complete the request before the error handler finishes
-		 * processing this timed out request.
-		 *
-		 * If the request was already completed, then the LLD beat the
-		 * time out handler from transferring the request to the scsi
-		 * error handler. In that case we can return immediately as no
-		 * further action is required.
-		 */
-		if (req->q->mq_ops && !blk_mq_mark_complete(req))
-			return rtn;
 		if (scsi_abort_command(scmd) != SUCCESS) {
 			set_host_byte(scmd, DID_TIME_OUT);
 			scsi_eh_scmd_add(scmd);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 1da59c16f637..d710e92874cc 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -289,20 +289,6 @@ void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
 
 void blk_mq_quiesce_queue_nowait(struct request_queue *q);
 
-/**
- * blk_mq_mark_complete() - Set request state to complete
- * @rq: request to set to complete state
- *
- * Returns true if request state was successfully set to complete. If
- * successful, the caller is responsibile for seeing this request is ended, as
- * blk_mq_complete_request will not work again.
- */
-static inline bool blk_mq_mark_complete(struct request *rq)
-{
-	return cmpxchg(&rq->state, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE) ==
-			MQ_RQ_IN_FLIGHT;
-}
-
 /*
  * Driver command data is immediately after the request. So subtract request
  * size to get back to the original request, add request size to get the PDU.
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8244a5a1aa5b..58e5b22123a2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -126,20 +126,39 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
 /* already slept for hybrid poll */
 #define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 20))
-/* ->timeout has been called, don't expire again */
-#define RQF_TIMED_OUT		((__force req_flags_t)(1 << 21))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
 	(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
 
-/*
- * Request state for blk-mq.
+union blk_generation_and_state {
+	struct {
+		uint32_t generation:30;
+		uint32_t state:2;
+	};
+	uint32_t val;
+};
+
+/**
+ * enum mq_rq_state - blk-mq request state
+ *
+ * The legal state transitions are:
+ * - idle      -> in flight: blk_mq_start_request()
+ * - in flight -> complete:  blk_mq_complete_request()
+ * - timed out -> complete:  blk_mq_complete_request()
+ * - in flight -> timed out: request times out
+ * - complete  -> idle:      blk_mq_requeue_request() or blk_mq_free_request()
+ * - in flight -> idle:      blk_mq_requeue_request() or blk_mq_free_request()
+ * - timed out -> idle:      blk_mq_requeue_request() or blk_mq_free_request()
+ * - timed out -> in flight: request restart due to BLK_EH_RESET_TIMER
+ *
+ * See also blk_generation_and_state.state.
  */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
 	MQ_RQ_COMPLETE		= 2,
+	MQ_RQ_TIMED_OUT		= 3,
 };
 
 /*
@@ -242,12 +261,26 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
-	enum mq_rq_state state;
-	refcount_t ref;
-
 	unsigned int timeout;
 
-	/* access through blk_rq_set_deadline, blk_rq_deadline */
+	/*
+	 * ->aborted_gstate is used by the timeout to claim a specific
+	 * recycle instance of this request.  See blk_mq_timeout_work().
+	 */
+	union blk_generation_and_state aborted_gstate;
+
+	/*
+	 * Access through blk_rq_deadline() and blk_rq_set_deadline(),
+	 * blk_mark_rq_complete(), blk_clear_rq_complete() and
+	 * blk_rq_is_complete() for legacy queues or blk_mq_rq_state(),
+	 * blk_mq_change_rq_state() and blk_mq_rq_set_deadline() for
+	 * blk-mq queues. deadline_seq protects __deadline in blk-mq mode
+	 * only.
+	 */
+	union blk_generation_and_state gstate;
+#ifndef CONFIG_64BIT
+	seqcount_t deadline_seq;
+#endif
 	unsigned long __deadline;
 
 	struct list_head timeout_list;
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/5] Improve blk-mq performance
  2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
                   ` (4 preceding siblings ...)
  2018-07-27 16:20 ` [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
@ 2018-07-27 16:22 ` Bart Van Assche
  5 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:22 UTC (permalink / raw)
  To: axboe; +Cc: hch, linux-block

On Fri, 2018-07-27 at 09:20 -0700, Bart Van Assche wrote:
> hot path. This patch series introduces the number of IOPS reported by=
 the
                              ^^^^^^^^^^
This is a typo - I meant "improves".

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/5] block: Split blk_add_timer()
  2018-07-27 16:20 ` [PATCH 3/5] block: Split blk_add_timer() Bart Van Assche
@ 2018-07-27 16:29   ` Keith Busch
  2018-07-27 16:31     ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-07-27 16:29 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jianchao Wang, Ming Lei

On Fri, Jul 27, 2018 at 09:20:40AM -0700, Bart Van Assche wrote:
> +void blk_mq_add_timer(struct request *req)
> +{
> +	struct request_queue *q = req->q;
> +
> +	if (!q->mq_ops)
> +		lockdep_assert_held(q->queue_lock);
> +
> +	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
> +	if (!q->mq_ops && !q->rq_timed_out_fn)
> +		return;

<snip>

> +	if (!q->mq_ops)
> +		list_add_tail(&req->timeout_list, &req->q->timeout_list);

Do you really need these checks for !q->mq_ops?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/5] block: Split blk_add_timer()
  2018-07-27 16:29   ` Keith Busch
@ 2018-07-27 16:31     ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:31 UTC (permalink / raw)
  To: keith.busch; +Cc: hch, ming.lei, linux-block, axboe, jianchao.w.wang

On Fri, 2018-07-27 at 10:29 -0600, Keith Busch wrote:
> On Fri, Jul 27, 2018 at 09:20:40AM -0700, Bart Van Assche wrote:
> > +	if (!q->mq_ops)
> > +		list_add_tail(&req->timeout_list, &=
req->q->timeout_list);
>=20
> Do you really need these checks for !q->mq_ops?

Hello Keith,

Have you noticed that patch 4/5 removes that check?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 16:20 ` [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
@ 2018-07-27 16:46   ` Keith Busch
  2018-07-27 16:51     ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-07-27 16:46 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Martin K . Petersen,
	Jianchao Wang, Ming Lei

On Fri, Jul 27, 2018 at 09:20:42AM -0700, Bart Van Assche wrote:
> +	ret = req->q->mq_ops->timeout(req, reserved);
> +	/*
> +	 * BLK_EH_DONT_RESET_TIMER means that the block driver either
> +	 * completed the request or still owns the request and will
> +	 * continue processing the timeout asynchronously. In the
> +	 * latter case, if blk_mq_complete_request() was called while
> +	 * the timeout handler was in progress, ignore that call.
> +	 */
> +	if (ret == BLK_EH_DONT_RESET_TIMER)
> +		return;

This is how completions get lost.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 16:46   ` Keith Busch
@ 2018-07-27 16:51     ` Bart Van Assche
  2018-07-27 16:57       ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:51 UTC (permalink / raw)
  To: keith.busch
  Cc: hch, ming.lei, linux-block, martin.petersen, axboe, jianchao.w.wang

On Fri, 2018-07-27 at 10:46 -0600, Keith Busch wrote:
> On Fri, Jul 27, 2018 at 09:20:42AM -0700, Bart Van Assche wrote:
> > +	ret = req->q->mq_ops->timeout(req, reser=
ved);
> > +	/*
> > +	 * BLK_EH_DONT_RESET_TIMER means that th=
e block driver either
> > +	 * completed the request or still owns the request and w=
ill
> > +	 * continue processing the timeout asynchronously. In th=
e
> > +	 * latter case, if blk_mq_complete_request()=
 was called while
> > +	 * the timeout handler was in progress, ignore that call=
.
> > +	 */
> > +	if (ret == BLK_EH_DONT_RESET_TIMER)
> > +		return;
>=20
> This is how completions get lost.

The new approach for handling completions that occur while the .timeout()
callback in progress is as follows:
* blk_mq_complete_request() executes the following code:
       if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED=
F8-OUT, MQ_RQ_COMPLETE))
               return;
* blk_mq_rq_timed_out() executes the following code:
       if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE=
) {
               __blk_mq_complete_request(req);
               return;
       }

As one can see __blk_mq_complete_request() gets called if=
 this race occurs.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 16:51     ` Bart Van Assche
@ 2018-07-27 16:57       ` Keith Busch
  2018-07-27 16:59         ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-07-27 16:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, ming.lei, linux-block, martin.petersen, axboe, jianchao.w.wang

On Fri, Jul 27, 2018 at 04:51:05PM +0000, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 10:46 -0600, Keith Busch wrote:
> > On Fri, Jul 27, 2018 at 09:20:42AM -0700, Bart Van Assche wrote:
> > > +	ret = req->q->mq_ops->timeout(req, reserved);
> > > +	/*
> > > +	 * BLK_EH_DONT_RESET_TIMER means that the block driver either
> > > +	 * completed the request or still owns the request and will
> > > +	 * continue processing the timeout asynchronously. In the
> > > +	 * latter case, if blk_mq_complete_request() was called while
> > > +	 * the timeout handler was in progress, ignore that call.
> > > +	 */
> > > +	if (ret == BLK_EH_DONT_RESET_TIMER)
> > > +		return;
> > 
> > This is how completions get lost.
> 
> The new approach for handling completions that occur while the .timeout()
> callback in progress is as follows:
> * blk_mq_complete_request() executes the following code:
>        if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE))
>                return;
> * blk_mq_rq_timed_out() executes the following code:
>        if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
>                __blk_mq_complete_request(req);
>                return;
>        }
> 
> As one can see __blk_mq_complete_request() gets called if this race occurs.

You skip that code if the driver returns BLK_EH_DONT_RESET_TIMER.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 16:57       ` Keith Busch
@ 2018-07-27 16:59         ` Bart Van Assche
  2018-07-27 17:04           ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 16:59 UTC (permalink / raw)
  To: keith.busch
  Cc: hch, jianchao.w.wang, linux-block, martin.petersen, ming.lei, axboe

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-7", Size: 1138 bytes --]

On Fri, 2018-07-27 at 10:57 -0600, Keith Busch wrote:
+AD4- You skip that code if the driver returns BLK+AF8-EH+AF8-DONT+AF8-RESE=
T+AF8-TIMER.

How about applying the following patch on top of this series?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a97ab5ba9d18..aa66535604fd 100644
--- a/block/blk-mq.c
+-+-+- b/block/blk-mq.c
+AEAAQA- -854,10 +-854,10 +AEAAQA- static void blk+AF8-mq+AF8-rq+AF8-timed+=
AF8-out(struct request +ACo-req, bool reserved)
 	 +ACo- latter case, if blk+AF8-mq+AF8-complete+AF8-request() was called w=
hile
 	 +ACo- the timeout handler was in progress, ignore that call.
 	 +ACo-/
-	if (ret +AD0APQ- BLK+AF8-EH+AF8-DONT+AF8-RESET+AF8-TIMER)
-		return+ADs-
-	WARN+AF8-ON+AF8-ONCE(ret +ACEAPQ- BLK+AF8-EH+AF8-RESET+AF8-TIMER)+ADs-
-	blk+AF8-mq+AF8-add+AF8-timer(req)+ADs-
+-	if (ret +AD0APQ- BLK+AF8-EH+AF8-RESET+AF8-TIMER)
+-		blk+AF8-mq+AF8-add+AF8-timer(req)+ADs-
+-	else
+-		WARN+AF8-ON+AF8-ONCE(ret +ACEAPQ- BLK+AF8-EH+AF8-DONT+AF8-RESET+AF8-TIM=
ER)+ADs-
 again:
 	if (blk+AF8-mq+AF8-change+AF8-rq+AF8-state(req, MQ+AF8-RQ+AF8-TIMED+AF8-O=
UT, MQ+AF8-RQ+AF8-IN+AF8-FLIGHT))
 		return+ADs-

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 16:59         ` Bart Van Assche
@ 2018-07-27 17:04           ` Keith Busch
  2018-07-27 17:14             ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-07-27 17:04 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, jianchao.w.wang, linux-block, martin.petersen, ming.lei, axboe

On Fri, Jul 27, 2018 at 04:59:34PM +0000, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 10:57 -0600, Keith Busch wrote:
> > You skip that code if the driver returns BLK_EH_DONT_RESET_TIMER.
> 
> How about applying the following patch on top of this series?

That works for me if you, but it breaks scsi again when
scmd_eh_abort_handler completes the command a second time.

 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a97ab5ba9d18..aa66535604fd 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -854,10 +854,10 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  	 * latter case, if blk_mq_complete_request() was called while
>  	 * the timeout handler was in progress, ignore that call.
>  	 */
> -	if (ret == BLK_EH_DONT_RESET_TIMER)
> -		return;
> -	WARN_ON_ONCE(ret != BLK_EH_RESET_TIMER);
> -	blk_mq_add_timer(req);
> +	if (ret == BLK_EH_RESET_TIMER)
> +		blk_mq_add_timer(req);
> +	else
> +		WARN_ON_ONCE(ret != BLK_EH_DONT_RESET_TIMER);
>  again:
>  	if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_IN_FLIGHT))
>  		return;
> 
> Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 17:04           ` Keith Busch
@ 2018-07-27 17:14             ` Bart Van Assche
  2018-07-27 17:58               ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 17:14 UTC (permalink / raw)
  To: keith.busch
  Cc: hch, axboe, linux-block, martin.petersen, jianchao.w.wang, ming.lei

On Fri, 2018-07-27 at 11:04 -0600, Keith Busch wrote:
> On Fri, Jul 27, 2018 at 04:59:34PM +0000, Bart Van Assche wrote:
> > On Fri, 2018-07-27 at 10:57 -0600, Keith Busch wrote:
> > > You skip that code if the driver returns BLK_EH_D=
ONT_RESET_TIMER.
> >=20
> > How about applying the following patch on top of this series?
>=20
> That works for me if you, but it breaks scsi again when
> scmd_eh_abort_handler completes the command a second time=
.

How about introducing a new request queue flag that chooses between the
behavior with or without the patch in my previous e-mail? I don't think tha=
t
it is possible to come up with a single implementation that covers the need=
s
of NVMe and SCSI without introducing such a flag. If a SCSI request times o=
ut
then request ownership is transferred from the LLD to the error handler. Fo=
r
the NVMe driver however there is no such transfer of ownership.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 17:14             ` Bart Van Assche
@ 2018-07-27 17:58               ` Keith Busch
  2018-07-27 18:09                 ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2018-07-27 17:58 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, axboe, linux-block, martin.petersen, jianchao.w.wang, ming.lei

On Fri, Jul 27, 2018 at 05:14:18PM +0000, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 11:04 -0600, Keith Busch wrote:
> > On Fri, Jul 27, 2018 at 04:59:34PM +0000, Bart Van Assche wrote:
> > > On Fri, 2018-07-27 at 10:57 -0600, Keith Busch wrote:
> > > > You skip that code if the driver returns BLK_EH_DONT_RESET_TIMER.
> > > 
> > > How about applying the following patch on top of this series?
> > 
> > That works for me if you, but it breaks scsi again when
> > scmd_eh_abort_handler completes the command a second time.
> 
> How about introducing a new request queue flag that chooses between the
> behavior with or without the patch in my previous e-mail? I don't think that
> it is possible to come up with a single implementation that covers the needs
> of NVMe and SCSI without introducing such a flag. If a SCSI request times out
> then request ownership is transferred from the LLD to the error handler. For
> the NVMe driver however there is no such transfer of ownership.

Instead of PATCH 1/5, how about creating a new timeout return code like
"BLK_EH_DONT_COMPLETE"?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again
  2018-07-27 17:58               ` Keith Busch
@ 2018-07-27 18:09                 ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2018-07-27 18:09 UTC (permalink / raw)
  To: keith.busch
  Cc: hch, ming.lei, linux-block, martin.petersen, axboe, jianchao.w.wang

On Fri, 2018-07-27 at 11:58 -0600, Keith Busch wrote:
> Instead of PATCH 1/5, how about creating a new timeout return code li=
ke
> "BLK_EH_DONT_COMPLETE"?

That sounds like a good idea to me. I think this approach will avoid that w=
e
have to introduce a request queue flag that chooses between the behavior
needed by the NVMe driver or the behavior needed by the SCSI core.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-07-27 18:09 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-27 16:20 [PATCH 0/5] Improve blk-mq performance Bart Van Assche
2018-07-27 16:20 ` [PATCH 1/5] blk-mq: Rename BLK_EH_DONE into BLK_EH_DONT_RESET_TIMER Bart Van Assche
2018-07-27 16:20 ` [PATCH 2/5] block: Remove a superfluous #include directive Bart Van Assche
2018-07-27 16:20 ` [PATCH 3/5] block: Split blk_add_timer() Bart Van Assche
2018-07-27 16:29   ` Keith Busch
2018-07-27 16:31     ` Bart Van Assche
2018-07-27 16:20 ` [PATCH 4/5] block: Simplify blk_add_timer() and blk_mq_add_timer() Bart Van Assche
2018-07-27 16:20 ` [PATCH 5/5] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
2018-07-27 16:46   ` Keith Busch
2018-07-27 16:51     ` Bart Van Assche
2018-07-27 16:57       ` Keith Busch
2018-07-27 16:59         ` Bart Van Assche
2018-07-27 17:04           ` Keith Busch
2018-07-27 17:14             ` Bart Van Assche
2018-07-27 17:58               ` Keith Busch
2018-07-27 18:09                 ` Bart Van Assche
2018-07-27 16:22 ` [PATCH 0/5] Improve blk-mq performance Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.