All of lore.kernel.org
 help / color / mirror / Atom feed
* scsi-mq V2
@ 2014-06-25 16:51 Christoph Hellwig
  2014-06-25 16:51 ` [PATCH 01/14] sd: don't use rq->cmd_len before setting it up Christoph Hellwig
                   ` (15 more replies)
  0 siblings, 16 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

This is the second post of the scsi-mq series.

At this point the code is ready for merging and use by developers and early
adopters.  The core blk-mq code isn't that suitable for slow devices
yet, mostly due to the lack of an I/O scheduler, but Jens is working on it.
Similarly there is no dm-multipath support for drivers using blk-mq yet,
but I'm working on it.  It should also be noted that the code doesn't
actually support multiple hardware queues or fine grained tuning of the
blk-mq parameters yet.  All these could be added fairly easily as soon
as low-level drivers want to make use of them.

The amount of chances to the existing code are fairly small, and mostly
speedups or cleanups that also apply to the old path as well.  Because
of this I also haven't bothered to put it under a config option, just
like the blk-mq core.

The usage of blk-mq dramatically decreases CPU usage under all workloads going
down from 100% CPU usage that the old setup can hit easily to usually less
than 20% for maxing out storage subsystems with 512byte reads and writes,
and it allows to easily archive millions of IOPS.  Bart and Robert have
helped with some very detailed measurements that they might be able to send
in reply to this, although these usually involve significantly reworked low
level drivers to avoid other bottle necks.

One major objection to previous iterations of this code was the simple
replacement of the host_lock with atomic counters for the host and busy
counters.  The host_lock avoidance on it's own already improves performance,
and with the patch to avoid maintaining the per-target busy counter unless
needed we now replace a lock round trip on the host_lock with just a single
atomic increment in the submission path, and a single atomic decrement in
completion path, which should provide benefits even for the oddest RISC
architecture.  Longer term I'd still love to get rid of these entirely
and use the counters in blk-mq, but due to the difference in how they
are maintained this doesn't seem feasible as long as we still need to
support the legacy request code path.

Changes from V1:
 - rebased on top of the core-for-3.17 branch, most notable the
   scsi logging changes
 - fixed handling of cmd_list to prevent crashes for some heavy
   workloads
 - fixed incorrect handling of !target->can_queue
 - avoid scheduling a workqueue on I/O completions when no queues
   are congested

In addition to the patches in this thread there also is a git available at:

	git://git.infradead.org/users/hch/scsi.git scsi-mq.2

This work was sponsored by the ION division of Fusion IO.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH 01/14] sd: don't use rq->cmd_len before setting it up
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
@ 2014-06-25 16:51 ` Christoph Hellwig
  2014-07-09 11:12     ` Hannes Reinecke
  2014-06-25 16:51 ` [PATCH 02/14] scsi: split __scsi_queue_insert Christoph Hellwig
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

Unlike the old request code blk-mq doesn't initialize cmd_len with a
default value, so don't rely on it being set in sd_setup_write_same_cmnd.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sd.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 9c86e3d..6ec4ffe 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -824,15 +824,16 @@ static int sd_setup_write_same_cmnd(struct scsi_device *sdp, struct request *rq)
 
 	rq->__data_len = sdp->sector_size;
 	rq->timeout = SD_WRITE_SAME_TIMEOUT;
-	memset(rq->cmd, 0, rq->cmd_len);
 
 	if (sdkp->ws16 || sector > 0xffffffff || nr_sectors > 0xffff) {
 		rq->cmd_len = 16;
+		memset(rq->cmd, 0, rq->cmd_len);
 		rq->cmd[0] = WRITE_SAME_16;
 		put_unaligned_be64(sector, &rq->cmd[2]);
 		put_unaligned_be32(nr_sectors, &rq->cmd[10]);
 	} else {
 		rq->cmd_len = 10;
+		memset(rq->cmd, 0, rq->cmd_len);
 		rq->cmd[0] = WRITE_SAME;
 		put_unaligned_be32(sector, &rq->cmd[2]);
 		put_unaligned_be16(nr_sectors, &rq->cmd[7]);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 02/14] scsi: split __scsi_queue_insert
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
  2014-06-25 16:51 ` [PATCH 01/14] sd: don't use rq->cmd_len before setting it up Christoph Hellwig
@ 2014-06-25 16:51 ` Christoph Hellwig
  2014-07-09 11:12   ` Hannes Reinecke
  2014-06-25 16:51 ` [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn Christoph Hellwig
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

Factor out a helper to set the _blocked values, which we'll reuse for the
blk-mq code path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi_lib.c |   44 ++++++++++++++++++++++++++------------------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index d5d22e4..2667c75 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -75,28 +75,12 @@ struct kmem_cache *scsi_sdb_cache;
  */
 #define SCSI_QUEUE_DELAY	3
 
-/**
- * __scsi_queue_insert - private queue insertion
- * @cmd: The SCSI command being requeued
- * @reason:  The reason for the requeue
- * @unbusy: Whether the queue should be unbusied
- *
- * This is a private queue insertion.  The public interface
- * scsi_queue_insert() always assumes the queue should be unbusied
- * because it's always called before the completion.  This function is
- * for a requeue after completion, which should only occur in this
- * file.
- */
-static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
+static void
+scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
 {
 	struct Scsi_Host *host = cmd->device->host;
 	struct scsi_device *device = cmd->device;
 	struct scsi_target *starget = scsi_target(device);
-	struct request_queue *q = device->request_queue;
-	unsigned long flags;
-
-	SCSI_LOG_MLQUEUE(1, scmd_printk(KERN_INFO, cmd,
-		"Inserting command %p into mlqueue\n", cmd));
 
 	/*
 	 * Set the appropriate busy bit for the device/host.
@@ -123,6 +107,30 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
 		starget->target_blocked = starget->max_target_blocked;
 		break;
 	}
+}
+
+/**
+ * __scsi_queue_insert - private queue insertion
+ * @cmd: The SCSI command being requeued
+ * @reason:  The reason for the requeue
+ * @unbusy: Whether the queue should be unbusied
+ *
+ * This is a private queue insertion.  The public interface
+ * scsi_queue_insert() always assumes the queue should be unbusied
+ * because it's always called before the completion.  This function is
+ * for a requeue after completion, which should only occur in this
+ * file.
+ */
+static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
+{
+	struct scsi_device *device = cmd->device;
+	struct request_queue *q = device->request_queue;
+	unsigned long flags;
+
+	SCSI_LOG_MLQUEUE(1, scmd_printk(KERN_INFO, cmd,
+		"Inserting command %p into mlqueue\n", cmd));
+
+	scsi_set_blocked(cmd, reason);
 
 	/*
 	 * Decrement the counters, since these commands are no longer
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
  2014-06-25 16:51 ` [PATCH 01/14] sd: don't use rq->cmd_len before setting it up Christoph Hellwig
  2014-06-25 16:51 ` [PATCH 02/14] scsi: split __scsi_queue_insert Christoph Hellwig
@ 2014-06-25 16:51 ` Christoph Hellwig
  2014-07-08 20:51   ` Elliott, Robert (Server Storage)
  2014-07-09 11:13   ` Hannes Reinecke
  2014-06-25 16:51 ` [PATCH 04/14] scsi: set ->scsi_done before calling scsi_dispatch_cmd Christoph Hellwig
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

Make sure we only have the logic for requeing commands in one place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi.c     |   35 ++++++++++++-----------------------
 drivers/scsi/scsi_lib.c |    9 ++++++---
 2 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index ce5b4e5..dcc43fd 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -648,9 +648,7 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
 		 * returns an immediate error upwards, and signals
 		 * that the device is no longer present */
 		cmd->result = DID_NO_CONNECT << 16;
-		scsi_done(cmd);
-		/* return 0 (because the command has been processed) */
-		goto out;
+		goto done;
 	}
 
 	/* Check to see if the scsi lld made this device blocked. */
@@ -662,17 +660,9 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
 		 * occur until the device transitions out of the
 		 * suspend state.
 		 */
-
-		scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
-
 		SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
 			"queuecommand : device blocked\n"));
-
-		/*
-		 * NOTE: rtn is still zero here because we don't need the
-		 * queue to be plugged on return (it's already stopped)
-		 */
-		goto out;
+		return SCSI_MLQUEUE_DEVICE_BUSY;
 	}
 
 	/*
@@ -696,20 +686,19 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
 			       "cdb_size=%d host->max_cmd_len=%d\n",
 			       cmd->cmd_len, cmd->device->host->max_cmd_len));
 		cmd->result = (DID_ABORT << 16);
-
-		scsi_done(cmd);
-		goto out;
+		goto done;
 	}
 
 	if (unlikely(host->shost_state == SHOST_DEL)) {
 		cmd->result = (DID_NO_CONNECT << 16);
-		scsi_done(cmd);
-	} else {
-		trace_scsi_dispatch_cmd_start(cmd);
-		cmd->scsi_done = scsi_done;
-		rtn = host->hostt->queuecommand(host, cmd);
+		goto done;
+
 	}
 
+	trace_scsi_dispatch_cmd_start(cmd);
+
+	cmd->scsi_done = scsi_done;
+	rtn = host->hostt->queuecommand(host, cmd);
 	if (rtn) {
 		trace_scsi_dispatch_cmd_error(cmd, rtn);
 		if (rtn != SCSI_MLQUEUE_DEVICE_BUSY &&
@@ -718,12 +707,12 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
 
 		SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
 			"queuecommand : request rejected\n"));
-
-		scsi_queue_insert(cmd, rtn);
 	}
 
- out:
 	return rtn;
+ done:
+	scsi_done(cmd);
+	return 0;
 }
 
 /**
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2667c75..63bf844 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1583,9 +1583,12 @@ static void scsi_request_fn(struct request_queue *q)
 		 * Dispatch the command to the low-level driver.
 		 */
 		rtn = scsi_dispatch_cmd(cmd);
-		spin_lock_irq(q->queue_lock);
-		if (rtn)
+		if (rtn) {
+			scsi_queue_insert(cmd, rtn);
+			spin_lock_irq(q->queue_lock);
 			goto out_delay;
+		}
+		spin_lock_irq(q->queue_lock);
 	}
 
 	return;
@@ -1605,7 +1608,7 @@ static void scsi_request_fn(struct request_queue *q)
 	blk_requeue_request(q, req);
 	sdev->device_busy--;
 out_delay:
-	if (sdev->device_busy == 0)
+	if (sdev->device_busy == 0 && !scsi_device_blocked(sdev))
 		blk_delay_queue(q, SCSI_QUEUE_DELAY);
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 04/14] scsi: set ->scsi_done before calling scsi_dispatch_cmd
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2014-06-25 16:51 ` [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn Christoph Hellwig
@ 2014-06-25 16:51 ` Christoph Hellwig
  2014-07-09 11:14   ` Hannes Reinecke
  2014-06-25 16:51 ` [PATCH 05/14] scsi: push host_lock down into scsi_{host,target}_queue_ready Christoph Hellwig
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

The blk-mq code path will set this to a different function, so make the
code simpler by setting it up in a legacy-request specific place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi.c     |   23 +----------------------
 drivers/scsi/scsi_lib.c |   20 ++++++++++++++++++++
 2 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index dcc43fd..d3bd6cf 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -72,8 +72,6 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/scsi.h>
 
-static void scsi_done(struct scsi_cmnd *cmd);
-
 /*
  * Definitions and constants.
  */
@@ -696,8 +694,6 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
 	}
 
 	trace_scsi_dispatch_cmd_start(cmd);
-
-	cmd->scsi_done = scsi_done;
 	rtn = host->hostt->queuecommand(host, cmd);
 	if (rtn) {
 		trace_scsi_dispatch_cmd_error(cmd, rtn);
@@ -711,28 +707,11 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
 
 	return rtn;
  done:
-	scsi_done(cmd);
+	cmd->scsi_done(cmd);
 	return 0;
 }
 
 /**
- * scsi_done - Invoke completion on finished SCSI command.
- * @cmd: The SCSI Command for which a low-level device driver (LLDD) gives
- * ownership back to SCSI Core -- i.e. the LLDD has finished with it.
- *
- * Description: This function is the mid-level's (SCSI Core) interrupt routine,
- * which regains ownership of the SCSI command (de facto) from a LLDD, and
- * calls blk_complete_request() for further processing.
- *
- * This function is interrupt context safe.
- */
-static void scsi_done(struct scsi_cmnd *cmd)
-{
-	trace_scsi_dispatch_cmd_done(cmd);
-	blk_complete_request(cmd->request);
-}
-
-/**
  * scsi_finish_command - cleanup and pass command back to upper layer
  * @cmd: the command
  *
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 63bf844..6989b6f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -29,6 +29,8 @@
 #include <scsi/scsi_eh.h>
 #include <scsi/scsi_host.h>
 
+#include <trace/events/scsi.h>
+
 #include "scsi_priv.h"
 #include "scsi_logging.h"
 
@@ -1480,6 +1482,23 @@ static void scsi_softirq_done(struct request *rq)
 	}
 }
 
+/**
+ * scsi_done - Invoke completion on finished SCSI command.
+ * @cmd: The SCSI Command for which a low-level device driver (LLDD) gives
+ * ownership back to SCSI Core -- i.e. the LLDD has finished with it.
+ *
+ * Description: This function is the mid-level's (SCSI Core) interrupt routine,
+ * which regains ownership of the SCSI command (de facto) from a LLDD, and
+ * calls blk_complete_request() for further processing.
+ *
+ * This function is interrupt context safe.
+ */
+static void scsi_done(struct scsi_cmnd *cmd)
+{
+	trace_scsi_dispatch_cmd_done(cmd);
+	blk_complete_request(cmd->request);
+}
+
 /*
  * Function:    scsi_request_fn()
  *
@@ -1582,6 +1601,7 @@ static void scsi_request_fn(struct request_queue *q)
 		/*
 		 * Dispatch the command to the low-level driver.
 		 */
+		cmd->scsi_done = scsi_done;
 		rtn = scsi_dispatch_cmd(cmd);
 		if (rtn) {
 			scsi_queue_insert(cmd, rtn);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 05/14] scsi: push host_lock down into scsi_{host,target}_queue_ready
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2014-06-25 16:51 ` [PATCH 04/14] scsi: set ->scsi_done before calling scsi_dispatch_cmd Christoph Hellwig
@ 2014-06-25 16:51 ` Christoph Hellwig
  2014-07-09 11:14   ` Hannes Reinecke
  2014-06-25 16:51 ` [PATCH 06/14] scsi: convert target_busy to an atomic_t Christoph Hellwig
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

Prepare for not taking a host-wide lock in the dispatch path by pushing
the lock down into the places that actually need it.  Note that this
patch is just a preparation step, as it will actually increase lock
roundtrips and thus decrease performance on its own.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi_lib.c |   75 ++++++++++++++++++++++++-----------------------
 1 file changed, 39 insertions(+), 36 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6989b6f..18e6449 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1300,18 +1300,18 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
 /*
  * scsi_target_queue_ready: checks if there we can send commands to target
  * @sdev: scsi device on starget to check.
- *
- * Called with the host lock held.
  */
 static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
 					   struct scsi_device *sdev)
 {
 	struct scsi_target *starget = scsi_target(sdev);
+	int ret = 0;
 
+	spin_lock_irq(shost->host_lock);
 	if (starget->single_lun) {
 		if (starget->starget_sdev_user &&
 		    starget->starget_sdev_user != sdev)
-			return 0;
+			goto out;
 		starget->starget_sdev_user = sdev;
 	}
 
@@ -1319,57 +1319,66 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
 		/*
 		 * unblock after target_blocked iterates to zero
 		 */
-		if (--starget->target_blocked == 0) {
-			SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
-					 "unblocking target at zero depth\n"));
-		} else
-			return 0;
+		if (--starget->target_blocked != 0)
+			goto out;
+
+		SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
+				 "unblocking target at zero depth\n"));
 	}
 
 	if (scsi_target_is_busy(starget)) {
 		list_move_tail(&sdev->starved_entry, &shost->starved_list);
-		return 0;
+		goto out;
 	}
 
-	return 1;
+	scsi_target(sdev)->target_busy++;
+	ret = 1;
+out:
+	spin_unlock_irq(shost->host_lock);
+	return ret;
 }
 
 /*
  * scsi_host_queue_ready: if we can send requests to shost, return 1 else
  * return 0. We must end up running the queue again whenever 0 is
  * returned, else IO can hang.
- *
- * Called with host_lock held.
  */
 static inline int scsi_host_queue_ready(struct request_queue *q,
 				   struct Scsi_Host *shost,
 				   struct scsi_device *sdev)
 {
+	int ret = 0;
+
+	spin_lock_irq(shost->host_lock);
+
 	if (scsi_host_in_recovery(shost))
-		return 0;
+		goto out;
 	if (shost->host_busy == 0 && shost->host_blocked) {
 		/*
 		 * unblock after host_blocked iterates to zero
 		 */
-		if (--shost->host_blocked == 0) {
-			SCSI_LOG_MLQUEUE(3,
-				shost_printk(KERN_INFO, shost,
-					     "unblocking host at zero depth\n"));
-		} else {
-			return 0;
-		}
+		if (--shost->host_blocked != 0)
+			goto out;
+
+		SCSI_LOG_MLQUEUE(3,
+			shost_printk(KERN_INFO, shost,
+				     "unblocking host at zero depth\n"));
 	}
 	if (scsi_host_is_busy(shost)) {
 		if (list_empty(&sdev->starved_entry))
 			list_add_tail(&sdev->starved_entry, &shost->starved_list);
-		return 0;
+		goto out;
 	}
 
 	/* We're OK to process the command, so we can't be starved */
 	if (!list_empty(&sdev->starved_entry))
 		list_del_init(&sdev->starved_entry);
 
-	return 1;
+	shost->host_busy++;
+	ret = 1;
+out:
+	spin_unlock_irq(shost->host_lock);
+	return ret;
 }
 
 /*
@@ -1550,7 +1559,7 @@ static void scsi_request_fn(struct request_queue *q)
 			blk_start_request(req);
 		sdev->device_busy++;
 
-		spin_unlock(q->queue_lock);
+		spin_unlock_irq(q->queue_lock);
 		cmd = req->special;
 		if (unlikely(cmd == NULL)) {
 			printk(KERN_CRIT "impossible request in %s.\n"
@@ -1560,7 +1569,6 @@ static void scsi_request_fn(struct request_queue *q)
 			blk_dump_rq_flags(req, "foo");
 			BUG();
 		}
-		spin_lock(shost->host_lock);
 
 		/*
 		 * We hit this when the driver is using a host wide
@@ -1571,9 +1579,11 @@ static void scsi_request_fn(struct request_queue *q)
 		 * a run when a tag is freed.
 		 */
 		if (blk_queue_tagged(q) && !blk_rq_tagged(req)) {
+			spin_lock_irq(shost->host_lock);
 			if (list_empty(&sdev->starved_entry))
 				list_add_tail(&sdev->starved_entry,
 					      &shost->starved_list);
+			spin_unlock_irq(shost->host_lock);
 			goto not_ready;
 		}
 
@@ -1581,16 +1591,7 @@ static void scsi_request_fn(struct request_queue *q)
 			goto not_ready;
 
 		if (!scsi_host_queue_ready(q, shost, sdev))
-			goto not_ready;
-
-		scsi_target(sdev)->target_busy++;
-		shost->host_busy++;
-
-		/*
-		 * XXX(hch): This is rather suboptimal, scsi_dispatch_cmd will
-		 *		take the lock again.
-		 */
-		spin_unlock_irq(shost->host_lock);
+			goto host_not_ready;
 
 		/*
 		 * Finally, initialize any error handling parameters, and set up
@@ -1613,9 +1614,11 @@ static void scsi_request_fn(struct request_queue *q)
 
 	return;
 
- not_ready:
+ host_not_ready:
+	spin_lock_irq(shost->host_lock);
+	scsi_target(sdev)->target_busy--;
 	spin_unlock_irq(shost->host_lock);
-
+ not_ready:
 	/*
 	 * lock q, handle tag, requeue req, and decrement device_busy. We
 	 * must return with queue_lock held.
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 06/14] scsi: convert target_busy to an atomic_t
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2014-06-25 16:51 ` [PATCH 05/14] scsi: push host_lock down into scsi_{host,target}_queue_ready Christoph Hellwig
@ 2014-06-25 16:51 ` Christoph Hellwig
  2014-07-09 11:15     ` Hannes Reinecke
  2014-06-25 16:51 ` [PATCH 07/14] scsi: convert host_busy to atomic_t Christoph Hellwig
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

Avoid taking the host-wide host_lock to check the per-target queue limit.
Instead we do an atomic_inc_return early on to grab our slot in the queue,
and if nessecary decrement it after finishing all checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi_lib.c    |   52 ++++++++++++++++++++++++++------------------
 include/scsi/scsi_device.h |    4 ++--
 2 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 18e6449..5e269d6 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -294,7 +294,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
 
 	spin_lock_irqsave(shost->host_lock, flags);
 	shost->host_busy--;
-	starget->target_busy--;
+	atomic_dec(&starget->target_busy);
 	if (unlikely(scsi_host_in_recovery(shost) &&
 		     (shost->host_failed || shost->host_eh_scheduled)))
 		scsi_eh_wakeup(shost);
@@ -361,7 +361,7 @@ static inline int scsi_device_is_busy(struct scsi_device *sdev)
 static inline int scsi_target_is_busy(struct scsi_target *starget)
 {
 	return ((starget->can_queue > 0 &&
-		 starget->target_busy >= starget->can_queue) ||
+		 atomic_read(&starget->target_busy) >= starget->can_queue) ||
 		 starget->target_blocked);
 }
 
@@ -1305,37 +1305,49 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
 					   struct scsi_device *sdev)
 {
 	struct scsi_target *starget = scsi_target(sdev);
-	int ret = 0;
+	unsigned int busy;
 
-	spin_lock_irq(shost->host_lock);
 	if (starget->single_lun) {
+		spin_lock_irq(shost->host_lock);
 		if (starget->starget_sdev_user &&
-		    starget->starget_sdev_user != sdev)
-			goto out;
+		    starget->starget_sdev_user != sdev) {
+			spin_unlock_irq(shost->host_lock);
+			return 0;
+		}
 		starget->starget_sdev_user = sdev;
+		spin_unlock_irq(shost->host_lock);
 	}
 
-	if (starget->target_busy == 0 && starget->target_blocked) {
+	busy = atomic_inc_return(&starget->target_busy) - 1;
+	if (busy == 0 && starget->target_blocked) {
 		/*
 		 * unblock after target_blocked iterates to zero
 		 */
-		if (--starget->target_blocked != 0)
-			goto out;
+		spin_lock_irq(shost->host_lock);
+		if (--starget->target_blocked != 0) {
+			spin_unlock_irq(shost->host_lock);
+			goto out_dec;
+		}
+		spin_unlock_irq(shost->host_lock);
 
 		SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
 				 "unblocking target at zero depth\n"));
 	}
 
-	if (scsi_target_is_busy(starget)) {
-		list_move_tail(&sdev->starved_entry, &shost->starved_list);
-		goto out;
-	}
+	if (starget->can_queue > 0 && busy >= starget->can_queue)
+		goto starved;
+	if (starget->target_blocked)
+		goto starved;
 
-	scsi_target(sdev)->target_busy++;
-	ret = 1;
-out:
+	return 1;
+
+starved:
+	spin_lock_irq(shost->host_lock);
+	list_move_tail(&sdev->starved_entry, &shost->starved_list);
 	spin_unlock_irq(shost->host_lock);
-	return ret;
+out_dec:
+	atomic_dec(&starget->target_busy);
+	return 0;
 }
 
 /*
@@ -1445,7 +1457,7 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
 	spin_unlock(sdev->request_queue->queue_lock);
 	spin_lock(shost->host_lock);
 	shost->host_busy++;
-	starget->target_busy++;
+	atomic_inc(&starget->target_busy);
 	spin_unlock(shost->host_lock);
 	spin_lock(sdev->request_queue->queue_lock);
 
@@ -1615,9 +1627,7 @@ static void scsi_request_fn(struct request_queue *q)
 	return;
 
  host_not_ready:
-	spin_lock_irq(shost->host_lock);
-	scsi_target(sdev)->target_busy--;
-	spin_unlock_irq(shost->host_lock);
+	atomic_dec(&scsi_target(sdev)->target_busy);
  not_ready:
 	/*
 	 * lock q, handle tag, requeue req, and decrement device_busy. We
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 816e8a2..446f741 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -290,8 +290,8 @@ struct scsi_target {
 	unsigned int		expecting_lun_change:1;	/* A device has reported
 						 * a 3F/0E UA, other devices on
 						 * the same target will also. */
-	/* commands actually active on LLD. protected by host lock. */
-	unsigned int		target_busy;
+	/* commands actually active on LLD. */
+	atomic_t		target_busy;
 	/*
 	 * LLDs should set this in the slave_alloc host template callout.
 	 * If set to zero then there is not limit.
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 07/14] scsi: convert host_busy to atomic_t
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2014-06-25 16:51 ` [PATCH 06/14] scsi: convert target_busy to an atomic_t Christoph Hellwig
@ 2014-06-25 16:51 ` Christoph Hellwig
  2014-07-09 11:15   ` Hannes Reinecke
  2014-06-25 16:51 ` [PATCH 08/14] scsi: convert device_busy " Christoph Hellwig
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

Avoid taking the host-wide host_lock to check the per-host queue limit.
Instead we do an atomic_inc_return early on to grab our slot in the queue,
and if nessecary decrement it after finishing all checks.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/advansys.c             |    4 +-
 drivers/scsi/libiscsi.c             |    4 +-
 drivers/scsi/libsas/sas_scsi_host.c |    5 ++-
 drivers/scsi/qlogicpti.c            |    2 +-
 drivers/scsi/scsi.c                 |    2 +-
 drivers/scsi/scsi_error.c           |    7 ++--
 drivers/scsi/scsi_lib.c             |   71 +++++++++++++++++++++--------------
 drivers/scsi/scsi_sysfs.c           |    9 ++++-
 include/scsi/scsi_host.h            |   10 ++---
 9 files changed, 66 insertions(+), 48 deletions(-)

diff --git a/drivers/scsi/advansys.c b/drivers/scsi/advansys.c
index e716d0a..43761c1 100644
--- a/drivers/scsi/advansys.c
+++ b/drivers/scsi/advansys.c
@@ -2512,7 +2512,7 @@ static void asc_prt_scsi_host(struct Scsi_Host *s)
 
 	printk("Scsi_Host at addr 0x%p, device %s\n", s, dev_name(boardp->dev));
 	printk(" host_busy %u, host_no %d,\n",
-	       s->host_busy, s->host_no);
+	       atomic_read(&s->host_busy), s->host_no);
 
 	printk(" base 0x%lx, io_port 0x%lx, irq %d,\n",
 	       (ulong)s->base, (ulong)s->io_port, boardp->irq);
@@ -3346,7 +3346,7 @@ static void asc_prt_driver_conf(struct seq_file *m, struct Scsi_Host *shost)
 
 	seq_printf(m,
 		   " host_busy %u, max_id %u, max_lun %llu, max_channel %u\n",
-		   shost->host_busy, shost->max_id,
+		   atomic_read(&shost->host_busy), shost->max_id,
 		   shost->max_lun, shost->max_channel);
 
 	seq_printf(m,
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index f2db82b..f9f3a12 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -2971,7 +2971,7 @@ void iscsi_conn_teardown(struct iscsi_cls_conn *cls_conn)
 	 */
 	for (;;) {
 		spin_lock_irqsave(session->host->host_lock, flags);
-		if (!session->host->host_busy) { /* OK for ERL == 0 */
+		if (!atomic_read(&session->host->host_busy)) { /* OK for ERL == 0 */
 			spin_unlock_irqrestore(session->host->host_lock, flags);
 			break;
 		}
@@ -2979,7 +2979,7 @@ void iscsi_conn_teardown(struct iscsi_cls_conn *cls_conn)
 		msleep_interruptible(500);
 		iscsi_conn_printk(KERN_INFO, conn, "iscsi conn_destroy(): "
 				  "host_busy %d host_failed %d\n",
-				  session->host->host_busy,
+				  atomic_read(&session->host->host_busy),
 				  session->host->host_failed);
 		/*
 		 * force eh_abort() to unblock
diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
index 7d02a19..24e477d 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -813,7 +813,7 @@ retry:
 	spin_unlock_irq(shost->host_lock);
 
 	SAS_DPRINTK("Enter %s busy: %d failed: %d\n",
-		    __func__, shost->host_busy, shost->host_failed);
+		    __func__, atomic_read(&shost->host_busy), shost->host_failed);
 	/*
 	 * Deal with commands that still have SAS tasks (i.e. they didn't
 	 * complete via the normal sas_task completion mechanism),
@@ -858,7 +858,8 @@ out:
 		goto retry;
 
 	SAS_DPRINTK("--- Exit %s: busy: %d failed: %d tries: %d\n",
-		    __func__, shost->host_busy, shost->host_failed, tries);
+		    __func__, atomic_read(&shost->host_busy),
+		    shost->host_failed, tries);
 }
 
 enum blk_eh_timer_return sas_scsi_timed_out(struct scsi_cmnd *cmd)
diff --git a/drivers/scsi/qlogicpti.c b/drivers/scsi/qlogicpti.c
index 6d48d30..740ae49 100644
--- a/drivers/scsi/qlogicpti.c
+++ b/drivers/scsi/qlogicpti.c
@@ -959,7 +959,7 @@ static inline void update_can_queue(struct Scsi_Host *host, u_int in_ptr, u_int
 	/* Temporary workaround until bug is found and fixed (one bug has been found
 	   already, but fixing it makes things even worse) -jj */
 	int num_free = QLOGICPTI_REQ_QUEUE_LEN - REQ_QUEUE_DEPTH(in_ptr, out_ptr) - 64;
-	host->can_queue = host->host_busy + num_free;
+	host->can_queue = atomic_read(&host->host_busy) + num_free;
 	host->sg_tablesize = QLOGICPTI_MAX_SG(num_free);
 }
 
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index d3bd6cf..35a23e2 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -603,7 +603,7 @@ void scsi_log_completion(struct scsi_cmnd *cmd, int disposition)
 			if (level > 3)
 				scmd_printk(KERN_INFO, cmd,
 					    "scsi host busy %d failed %d\n",
-					    cmd->device->host->host_busy,
+					    atomic_read(&cmd->device->host->host_busy),
 					    cmd->device->host->host_failed);
 		}
 	}
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index e4a5324..5db8454 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -59,7 +59,7 @@ static int scsi_try_to_abort_cmd(struct scsi_host_template *,
 /* called with shost->host_lock held */
 void scsi_eh_wakeup(struct Scsi_Host *shost)
 {
-	if (shost->host_busy == shost->host_failed) {
+	if (atomic_read(&shost->host_busy) == shost->host_failed) {
 		trace_scsi_eh_wakeup(shost);
 		wake_up_process(shost->ehandler);
 		SCSI_LOG_ERROR_RECOVERY(5, shost_printk(KERN_INFO, shost,
@@ -2164,7 +2164,7 @@ int scsi_error_handler(void *data)
 	while (!kthread_should_stop()) {
 		set_current_state(TASK_INTERRUPTIBLE);
 		if ((shost->host_failed == 0 && shost->host_eh_scheduled == 0) ||
-		    shost->host_failed != shost->host_busy) {
+		    shost->host_failed != atomic_read(&shost->host_busy)) {
 			SCSI_LOG_ERROR_RECOVERY(1,
 				shost_printk(KERN_INFO, shost,
 					     "scsi_eh_%d: sleeping\n",
@@ -2178,7 +2178,8 @@ int scsi_error_handler(void *data)
 			shost_printk(KERN_INFO, shost,
 				     "scsi_eh_%d: waking up %d/%d/%d\n",
 				     shost->host_no, shost->host_eh_scheduled,
-				     shost->host_failed, shost->host_busy));
+				     shost->host_failed,
+				     atomic_read(&shost->host_busy)));
 
 		/*
 		 * We have a host that is failing for some reason.  Figure out
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 5e269d6..5d37d79 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -292,14 +292,17 @@ void scsi_device_unbusy(struct scsi_device *sdev)
 	struct scsi_target *starget = scsi_target(sdev);
 	unsigned long flags;
 
-	spin_lock_irqsave(shost->host_lock, flags);
-	shost->host_busy--;
+	atomic_dec(&shost->host_busy);
 	atomic_dec(&starget->target_busy);
+
 	if (unlikely(scsi_host_in_recovery(shost) &&
-		     (shost->host_failed || shost->host_eh_scheduled)))
+		     (shost->host_failed || shost->host_eh_scheduled))) {
+		spin_lock_irqsave(shost->host_lock, flags);
 		scsi_eh_wakeup(shost);
-	spin_unlock(shost->host_lock);
-	spin_lock(sdev->request_queue->queue_lock);
+		spin_unlock_irqrestore(shost->host_lock, flags);
+	}
+
+	spin_lock_irqsave(sdev->request_queue->queue_lock, flags);
 	sdev->device_busy--;
 	spin_unlock_irqrestore(sdev->request_queue->queue_lock, flags);
 }
@@ -367,7 +370,8 @@ static inline int scsi_target_is_busy(struct scsi_target *starget)
 
 static inline int scsi_host_is_busy(struct Scsi_Host *shost)
 {
-	if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) ||
+	if ((shost->can_queue > 0 &&
+	     atomic_read(&shost->host_busy) >= shost->can_queue) ||
 	    shost->host_blocked || shost->host_self_blocked)
 		return 1;
 
@@ -1359,38 +1363,51 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
 				   struct Scsi_Host *shost,
 				   struct scsi_device *sdev)
 {
-	int ret = 0;
-
-	spin_lock_irq(shost->host_lock);
+	unsigned int busy;
 
 	if (scsi_host_in_recovery(shost))
-		goto out;
-	if (shost->host_busy == 0 && shost->host_blocked) {
+		return 0;
+
+	busy = atomic_inc_return(&shost->host_busy) - 1;
+	if (busy == 0 && shost->host_blocked) {
 		/*
 		 * unblock after host_blocked iterates to zero
 		 */
-		if (--shost->host_blocked != 0)
-			goto out;
+		spin_lock_irq(shost->host_lock);
+		if (--shost->host_blocked != 0) {
+			spin_unlock_irq(shost->host_lock);
+			goto out_dec;
+		}
+		spin_unlock_irq(shost->host_lock);
 
 		SCSI_LOG_MLQUEUE(3,
 			shost_printk(KERN_INFO, shost,
 				     "unblocking host at zero depth\n"));
 	}
-	if (scsi_host_is_busy(shost)) {
-		if (list_empty(&sdev->starved_entry))
-			list_add_tail(&sdev->starved_entry, &shost->starved_list);
-		goto out;
-	}
+
+	if (shost->can_queue > 0 && busy >= shost->can_queue)
+		goto starved;
+	if (shost->host_blocked || shost->host_self_blocked)
+		goto starved;
 
 	/* We're OK to process the command, so we can't be starved */
-	if (!list_empty(&sdev->starved_entry))
-		list_del_init(&sdev->starved_entry);
+	if (!list_empty(&sdev->starved_entry)) {
+		spin_lock_irq(shost->host_lock);
+		if (!list_empty(&sdev->starved_entry))
+			list_del_init(&sdev->starved_entry);
+		spin_unlock_irq(shost->host_lock);
+	}
 
-	shost->host_busy++;
-	ret = 1;
-out:
+	return 1;
+
+starved:
+	spin_lock_irq(shost->host_lock);
+	if (list_empty(&sdev->starved_entry))
+		list_add_tail(&sdev->starved_entry, &shost->starved_list);
 	spin_unlock_irq(shost->host_lock);
-	return ret;
+out_dec:
+	atomic_dec(&shost->host_busy);
+	return 0;
 }
 
 /*
@@ -1454,12 +1471,8 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
 	 * with the locks as normal issue path does.
 	 */
 	sdev->device_busy++;
-	spin_unlock(sdev->request_queue->queue_lock);
-	spin_lock(shost->host_lock);
-	shost->host_busy++;
+	atomic_inc(&shost->host_busy);
 	atomic_inc(&starget->target_busy);
-	spin_unlock(shost->host_lock);
-	spin_lock(sdev->request_queue->queue_lock);
 
 	blk_complete_request(req);
 }
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 5f36788..7ec5e06 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -334,7 +334,6 @@ store_shost_eh_deadline(struct device *dev, struct device_attribute *attr,
 static DEVICE_ATTR(eh_deadline, S_IRUGO | S_IWUSR, show_shost_eh_deadline, store_shost_eh_deadline);
 
 shost_rd_attr(unique_id, "%u\n");
-shost_rd_attr(host_busy, "%hu\n");
 shost_rd_attr(cmd_per_lun, "%hd\n");
 shost_rd_attr(can_queue, "%hd\n");
 shost_rd_attr(sg_tablesize, "%hu\n");
@@ -344,6 +343,14 @@ shost_rd_attr(prot_capabilities, "%u\n");
 shost_rd_attr(prot_guard_type, "%hd\n");
 shost_rd_attr2(proc_name, hostt->proc_name, "%s\n");
 
+static ssize_t
+show_host_busy(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct Scsi_Host *shost = class_to_shost(dev);
+	return snprintf(buf, 20, "%hu\n", atomic_read(&shost->host_busy));
+}
+static DEVICE_ATTR(host_busy, S_IRUGO, show_host_busy, NULL);
+
 static struct attribute *scsi_sysfs_shost_attrs[] = {
 	&dev_attr_unique_id.attr,
 	&dev_attr_host_busy.attr,
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index abb6958..3d124f7 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -603,13 +603,9 @@ struct Scsi_Host {
 	 */
 	struct blk_queue_tag	*bqt;
 
-	/*
-	 * The following two fields are protected with host_lock;
-	 * however, eh routines can safely access during eh processing
-	 * without acquiring the lock.
-	 */
-	unsigned int host_busy;		   /* commands actually active on low-level */
-	unsigned int host_failed;	   /* commands that failed. */
+	atomic_t host_busy;		   /* commands actually active on low-level */
+	unsigned int host_failed;	   /* commands that failed.
+					      protected by host_lock */
 	unsigned int host_eh_scheduled;    /* EH scheduled without command */
     
 	unsigned int host_no;  /* Used for IOCTL_GET_IDLUN, /proc/scsi et al. */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 08/14] scsi: convert device_busy to atomic_t
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2014-06-25 16:51 ` [PATCH 07/14] scsi: convert host_busy to atomic_t Christoph Hellwig
@ 2014-06-25 16:51 ` Christoph Hellwig
  2014-07-09 11:16   ` Hannes Reinecke
  2014-07-09 16:49   ` James Bottomley
  2014-06-25 16:51 ` [PATCH 09/14] scsi: fix the {host,target,device}_blocked counter mess Christoph Hellwig
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

Avoid taking the queue_lock to check the per-device queue limit.  Instead
we do an atomic_inc_return early on to grab our slot in the queue,
and if nessecary decrement it after finishing all checks.

Unlike the host and target busy counters this doesn't allow us to avoid the
queue_lock in the request_fn due to the way the interface works, but it'll
allow us to prepare for using the blk-mq code, which doesn't use the
queue_lock at all, and it at least avoids a queue_lock rountrip in
scsi_device_unbusy, which is still important given how busy the queue_lock
is.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/message/fusion/mptsas.c |    2 +-
 drivers/scsi/scsi_lib.c         |   50 ++++++++++++++++++++++-----------------
 drivers/scsi/scsi_sysfs.c       |   10 +++++++-
 drivers/scsi/sg.c               |    2 +-
 include/scsi/scsi_device.h      |    4 +---
 5 files changed, 40 insertions(+), 28 deletions(-)

diff --git a/drivers/message/fusion/mptsas.c b/drivers/message/fusion/mptsas.c
index 711fcb5..d636dbe 100644
--- a/drivers/message/fusion/mptsas.c
+++ b/drivers/message/fusion/mptsas.c
@@ -3763,7 +3763,7 @@ mptsas_send_link_status_event(struct fw_event_work *fw_event)
 						printk(MYIOC_s_DEBUG_FMT
 						"SDEV OUTSTANDING CMDS"
 						"%d\n", ioc->name,
-						sdev->device_busy));
+						atomic_read(&sdev->device_busy)));
 				}
 
 			}
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 5d37d79..e23fef5 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -302,9 +302,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
 		spin_unlock_irqrestore(shost->host_lock, flags);
 	}
 
-	spin_lock_irqsave(sdev->request_queue->queue_lock, flags);
-	sdev->device_busy--;
-	spin_unlock_irqrestore(sdev->request_queue->queue_lock, flags);
+	atomic_dec(&sdev->device_busy);
 }
 
 /*
@@ -355,9 +353,10 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
 
 static inline int scsi_device_is_busy(struct scsi_device *sdev)
 {
-	if (sdev->device_busy >= sdev->queue_depth || sdev->device_blocked)
+	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
+		return 1;
+	if (sdev->device_blocked)
 		return 1;
-
 	return 0;
 }
 
@@ -1224,7 +1223,7 @@ scsi_prep_return(struct request_queue *q, struct request *req, int ret)
 		 * queue must be restarted, so we schedule a callback to happen
 		 * shortly.
 		 */
-		if (sdev->device_busy == 0)
+		if (atomic_read(&sdev->device_busy) == 0)
 			blk_delay_queue(q, SCSI_QUEUE_DELAY);
 		break;
 	default:
@@ -1281,26 +1280,32 @@ static void scsi_unprep_fn(struct request_queue *q, struct request *req)
 static inline int scsi_dev_queue_ready(struct request_queue *q,
 				  struct scsi_device *sdev)
 {
-	if (sdev->device_busy == 0 && sdev->device_blocked) {
+	unsigned int busy;
+
+	busy = atomic_inc_return(&sdev->device_busy) - 1;
+	if (busy == 0 && sdev->device_blocked) {
 		/*
 		 * unblock after device_blocked iterates to zero
 		 */
-		if (--sdev->device_blocked == 0) {
-			SCSI_LOG_MLQUEUE(3,
-				   sdev_printk(KERN_INFO, sdev,
-				   "unblocking device at zero depth\n"));
-		} else {
+		if (--sdev->device_blocked != 0) {
 			blk_delay_queue(q, SCSI_QUEUE_DELAY);
-			return 0;
+			goto out_dec;
 		}
+		SCSI_LOG_MLQUEUE(3, sdev_printk(KERN_INFO, sdev,
+				   "unblocking device at zero depth\n"));
 	}
-	if (scsi_device_is_busy(sdev))
-		return 0;
+
+	if (busy >= sdev->queue_depth)
+		goto out_dec;
+	if (sdev->device_blocked)
+		goto out_dec;
 
 	return 1;
+out_dec:
+	atomic_dec(&sdev->device_busy);
+	return 0;
 }
 
-
 /*
  * scsi_target_queue_ready: checks if there we can send commands to target
  * @sdev: scsi device on starget to check.
@@ -1470,7 +1475,7 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
 	 * bump busy counts.  To bump the counters, we need to dance
 	 * with the locks as normal issue path does.
 	 */
-	sdev->device_busy++;
+	atomic_inc(&sdev->device_busy);
 	atomic_inc(&shost->host_busy);
 	atomic_inc(&starget->target_busy);
 
@@ -1566,7 +1571,7 @@ static void scsi_request_fn(struct request_queue *q)
 		 * accept it.
 		 */
 		req = blk_peek_request(q);
-		if (!req || !scsi_dev_queue_ready(q, sdev))
+		if (!req)
 			break;
 
 		if (unlikely(!scsi_device_online(sdev))) {
@@ -1576,13 +1581,14 @@ static void scsi_request_fn(struct request_queue *q)
 			continue;
 		}
 
+		if (!scsi_dev_queue_ready(q, sdev))
+			break;
 
 		/*
 		 * Remove the request from the request list.
 		 */
 		if (!(blk_queue_tagged(q) && !blk_queue_start_tag(q, req)))
 			blk_start_request(req);
-		sdev->device_busy++;
 
 		spin_unlock_irq(q->queue_lock);
 		cmd = req->special;
@@ -1652,9 +1658,9 @@ static void scsi_request_fn(struct request_queue *q)
 	 */
 	spin_lock_irq(q->queue_lock);
 	blk_requeue_request(q, req);
-	sdev->device_busy--;
+	atomic_dec(&sdev->device_busy);
 out_delay:
-	if (sdev->device_busy == 0 && !scsi_device_blocked(sdev))
+	if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
 		blk_delay_queue(q, SCSI_QUEUE_DELAY);
 }
 
@@ -2394,7 +2400,7 @@ scsi_device_quiesce(struct scsi_device *sdev)
 		return err;
 
 	scsi_run_queue(sdev->request_queue);
-	while (sdev->device_busy) {
+	while (atomic_read(&sdev->device_busy)) {
 		msleep_interruptible(200);
 		scsi_run_queue(sdev->request_queue);
 	}
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 7ec5e06..54e3dac 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -585,13 +585,21 @@ static int scsi_sdev_check_buf_bit(const char *buf)
  * Create the actual show/store functions and data structures.
  */
 sdev_rd_attr (device_blocked, "%d\n");
-sdev_rd_attr (device_busy, "%d\n");
 sdev_rd_attr (type, "%d\n");
 sdev_rd_attr (scsi_level, "%d\n");
 sdev_rd_attr (vendor, "%.8s\n");
 sdev_rd_attr (model, "%.16s\n");
 sdev_rd_attr (rev, "%.4s\n");
 
+static ssize_t
+sdev_show_device_busy(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	struct scsi_device *sdev = to_scsi_device(dev);
+	return snprintf(buf, 20, "%d\n", atomic_read(&sdev->device_busy));
+}
+static DEVICE_ATTR(device_busy, S_IRUGO, sdev_show_device_busy, NULL);
+
 /*
  * TODO: can we make these symlinks to the block layer ones?
  */
diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
index cb2a18e..3db4fc9 100644
--- a/drivers/scsi/sg.c
+++ b/drivers/scsi/sg.c
@@ -2573,7 +2573,7 @@ static int sg_proc_seq_show_dev(struct seq_file *s, void *v)
 			      scsidp->id, scsidp->lun, (int) scsidp->type,
 			      1,
 			      (int) scsidp->queue_depth,
-			      (int) scsidp->device_busy,
+			      (int) atomic_read(&scsidp->device_busy),
 			      (int) scsi_device_online(scsidp));
 	}
 	read_unlock_irqrestore(&sg_index_lock, iflags);
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 446f741..5ff3d24 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -81,9 +81,7 @@ struct scsi_device {
 	struct list_head    siblings;   /* list of all devices on this host */
 	struct list_head    same_target_siblings; /* just the devices sharing same target id */
 
-	/* this is now protected by the request_queue->queue_lock */
-	unsigned int device_busy;	/* commands actually active on
-					 * low-level. protected by queue_lock. */
+	atomic_t device_busy;		/* commands actually active on LLDD */
 	spinlock_t list_lock;
 	struct list_head cmd_list;	/* queue of in use SCSI Command structures */
 	struct list_head starved_entry;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 09/14] scsi: fix the {host,target,device}_blocked counter mess
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2014-06-25 16:51 ` [PATCH 08/14] scsi: convert device_busy " Christoph Hellwig
@ 2014-06-25 16:51 ` Christoph Hellwig
  2014-07-09 11:12   ` Hannes Reinecke
  2014-06-25 16:51 ` [PATCH 10/14] scsi: only maintain target_blocked if the driver has a target queue limit Christoph Hellwig
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

Seems like these counters are missing any sort of synchronization for
updates, as a over 10 year old comment from me noted.  Fix this by
using atomic counters, and while we're at it also make sure they are
in the same cacheline as the _busy counters and not needlessly stored
to in every I/O completion.

With the new model the _busy counters can temporarily go negative,
so all the readers are updated to check for > 0 values.  Longer
term every successful I/O completion will reset the counters to zero,
so the temporarily negative values will not cause any harm.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi.c        |   21 ++++++------
 drivers/scsi/scsi_lib.c    |   82 +++++++++++++++++++++-----------------------
 drivers/scsi/scsi_sysfs.c  |   10 +++++-
 include/scsi/scsi_device.h |    7 ++--
 include/scsi/scsi_host.h   |    7 ++--
 5 files changed, 64 insertions(+), 63 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 35a23e2..b362058 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -729,17 +729,16 @@ void scsi_finish_command(struct scsi_cmnd *cmd)
 
 	scsi_device_unbusy(sdev);
 
-        /*
-         * Clear the flags which say that the device/host is no longer
-         * capable of accepting new commands.  These are set in scsi_queue.c
-         * for both the queue full condition on a device, and for a
-         * host full condition on the host.
-	 *
-	 * XXX(hch): What about locking?
-         */
-        shost->host_blocked = 0;
-	starget->target_blocked = 0;
-        sdev->device_blocked = 0;
+	/*
+	 * Clear the flags which say that the device/target/host is no longer
+	 * capable of accepting new commands.
+	 */
+	if (atomic_read(&shost->host_blocked))
+		atomic_set(&shost->host_blocked, 0);
+	if (atomic_read(&starget->target_blocked))
+		atomic_set(&starget->target_blocked, 0);
+	if (atomic_read(&sdev->device_blocked))
+		atomic_set(&sdev->device_blocked, 0);
 
 	/*
 	 * If we have valid sense information, then some kind of recovery
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e23fef5..a39d5ba 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -99,14 +99,16 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
 	 */
 	switch (reason) {
 	case SCSI_MLQUEUE_HOST_BUSY:
-		host->host_blocked = host->max_host_blocked;
+		atomic_set(&host->host_blocked, host->max_host_blocked);
 		break;
 	case SCSI_MLQUEUE_DEVICE_BUSY:
 	case SCSI_MLQUEUE_EH_RETRY:
-		device->device_blocked = device->max_device_blocked;
+		atomic_set(&device->device_blocked,
+			   device->max_device_blocked);
 		break;
 	case SCSI_MLQUEUE_TARGET_BUSY:
-		starget->target_blocked = starget->max_target_blocked;
+		atomic_set(&starget->target_blocked,
+			   starget->max_target_blocked);
 		break;
 	}
 }
@@ -351,30 +353,39 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
 	spin_unlock_irqrestore(shost->host_lock, flags);
 }
 
-static inline int scsi_device_is_busy(struct scsi_device *sdev)
+static inline bool scsi_device_is_busy(struct scsi_device *sdev)
 {
 	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
-		return 1;
-	if (sdev->device_blocked)
-		return 1;
+		return true;
+	if (atomic_read(&sdev->device_blocked) > 0)
+		return true;
 	return 0;
 }
 
-static inline int scsi_target_is_busy(struct scsi_target *starget)
+static inline bool scsi_target_is_busy(struct scsi_target *starget)
 {
-	return ((starget->can_queue > 0 &&
-		 atomic_read(&starget->target_busy) >= starget->can_queue) ||
-		 starget->target_blocked);
+	if (starget->can_queue > 0) {
+		if (atomic_read(&starget->target_busy) >= starget->can_queue)
+			return true;
+		if (atomic_read(&starget->target_blocked) > 0)
+			return true;
+	}
+
+	return false;
 }
 
-static inline int scsi_host_is_busy(struct Scsi_Host *shost)
+static inline bool scsi_host_is_busy(struct Scsi_Host *shost)
 {
-	if ((shost->can_queue > 0 &&
-	     atomic_read(&shost->host_busy) >= shost->can_queue) ||
-	    shost->host_blocked || shost->host_self_blocked)
-		return 1;
+	if (shost->can_queue > 0) {
+		if (atomic_read(&shost->host_busy) >= shost->can_queue)
+			return true;
+		if (atomic_read(&shost->host_blocked) > 0)
+			return true;
+		if (shost->host_self_blocked)
+			return true;
+	}
 
-	return 0;
+	return false;
 }
 
 static void scsi_starved_list_run(struct Scsi_Host *shost)
@@ -1283,11 +1294,8 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
 	unsigned int busy;
 
 	busy = atomic_inc_return(&sdev->device_busy) - 1;
-	if (busy == 0 && sdev->device_blocked) {
-		/*
-		 * unblock after device_blocked iterates to zero
-		 */
-		if (--sdev->device_blocked != 0) {
+	if (busy == 0 && atomic_read(&sdev->device_blocked) > 0) {
+		if (atomic_dec_return(&sdev->device_blocked) > 0) {
 			blk_delay_queue(q, SCSI_QUEUE_DELAY);
 			goto out_dec;
 		}
@@ -1297,7 +1305,7 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
 
 	if (busy >= sdev->queue_depth)
 		goto out_dec;
-	if (sdev->device_blocked)
+	if (atomic_read(&sdev->device_blocked) > 0)
 		goto out_dec;
 
 	return 1;
@@ -1328,16 +1336,9 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
 	}
 
 	busy = atomic_inc_return(&starget->target_busy) - 1;
-	if (busy == 0 && starget->target_blocked) {
-		/*
-		 * unblock after target_blocked iterates to zero
-		 */
-		spin_lock_irq(shost->host_lock);
-		if (--starget->target_blocked != 0) {
-			spin_unlock_irq(shost->host_lock);
+	if (busy == 0 && atomic_read(&starget->target_blocked) > 0) {
+		if (atomic_dec_return(&starget->target_blocked) > 0)
 			goto out_dec;
-		}
-		spin_unlock_irq(shost->host_lock);
 
 		SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
 				 "unblocking target at zero depth\n"));
@@ -1345,7 +1346,7 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
 
 	if (starget->can_queue > 0 && busy >= starget->can_queue)
 		goto starved;
-	if (starget->target_blocked)
+	if (atomic_read(&starget->target_blocked) > 0)
 		goto starved;
 
 	return 1;
@@ -1374,16 +1375,9 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
 		return 0;
 
 	busy = atomic_inc_return(&shost->host_busy) - 1;
-	if (busy == 0 && shost->host_blocked) {
-		/*
-		 * unblock after host_blocked iterates to zero
-		 */
-		spin_lock_irq(shost->host_lock);
-		if (--shost->host_blocked != 0) {
-			spin_unlock_irq(shost->host_lock);
+	if (busy == 0 && atomic_read(&shost->host_blocked) > 0) {
+		if (atomic_dec_return(&shost->host_blocked) > 0)
 			goto out_dec;
-		}
-		spin_unlock_irq(shost->host_lock);
 
 		SCSI_LOG_MLQUEUE(3,
 			shost_printk(KERN_INFO, shost,
@@ -1392,7 +1386,9 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
 
 	if (shost->can_queue > 0 && busy >= shost->can_queue)
 		goto starved;
-	if (shost->host_blocked || shost->host_self_blocked)
+	if (atomic_read(&shost->host_blocked) > 0)
+		goto starved;
+	if (shost->host_self_blocked)
 		goto starved;
 
 	/* We're OK to process the command, so we can't be starved */
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 54e3dac..deef063 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -584,7 +584,6 @@ static int scsi_sdev_check_buf_bit(const char *buf)
 /*
  * Create the actual show/store functions and data structures.
  */
-sdev_rd_attr (device_blocked, "%d\n");
 sdev_rd_attr (type, "%d\n");
 sdev_rd_attr (scsi_level, "%d\n");
 sdev_rd_attr (vendor, "%.8s\n");
@@ -600,6 +599,15 @@ sdev_show_device_busy(struct device *dev, struct device_attribute *attr,
 }
 static DEVICE_ATTR(device_busy, S_IRUGO, sdev_show_device_busy, NULL);
 
+static ssize_t
+sdev_show_device_blocked(struct device *dev, struct device_attribute *attr,
+		char *buf)
+{
+	struct scsi_device *sdev = to_scsi_device(dev);
+	return snprintf(buf, 20, "%d\n", atomic_read(&sdev->device_blocked));
+}
+static DEVICE_ATTR(device_blocked, S_IRUGO, sdev_show_device_blocked, NULL);
+
 /*
  * TODO: can we make these symlinks to the block layer ones?
  */
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 5ff3d24..a8a8981 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -82,6 +82,8 @@ struct scsi_device {
 	struct list_head    same_target_siblings; /* just the devices sharing same target id */
 
 	atomic_t device_busy;		/* commands actually active on LLDD */
+	atomic_t device_blocked;	/* Device returned QUEUE_FULL. */
+
 	spinlock_t list_lock;
 	struct list_head cmd_list;	/* queue of in use SCSI Command structures */
 	struct list_head starved_entry;
@@ -179,8 +181,6 @@ struct scsi_device {
 	struct list_head event_list;	/* asserted events */
 	struct work_struct event_work;
 
-	unsigned int device_blocked;	/* Device returned QUEUE_FULL. */
-
 	unsigned int max_device_blocked; /* what device_blocked counts down from  */
 #define SCSI_DEFAULT_DEVICE_BLOCKED	3
 
@@ -290,12 +290,13 @@ struct scsi_target {
 						 * the same target will also. */
 	/* commands actually active on LLD. */
 	atomic_t		target_busy;
+	atomic_t		target_blocked;
+
 	/*
 	 * LLDs should set this in the slave_alloc host template callout.
 	 * If set to zero then there is not limit.
 	 */
 	unsigned int		can_queue;
-	unsigned int		target_blocked;
 	unsigned int		max_target_blocked;
 #define SCSI_DEFAULT_TARGET_BLOCKED	3
 
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 3d124f7..7f9bbda 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -604,6 +604,8 @@ struct Scsi_Host {
 	struct blk_queue_tag	*bqt;
 
 	atomic_t host_busy;		   /* commands actually active on low-level */
+	atomic_t host_blocked;
+
 	unsigned int host_failed;	   /* commands that failed.
 					      protected by host_lock */
 	unsigned int host_eh_scheduled;    /* EH scheduled without command */
@@ -703,11 +705,6 @@ struct Scsi_Host {
 	struct workqueue_struct *tmf_work_q;
 
 	/*
-	 * Host has rejected a command because it was busy.
-	 */
-	unsigned int host_blocked;
-
-	/*
 	 * Value host_blocked counts down from
 	 */
 	unsigned int max_host_blocked;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 10/14] scsi: only maintain target_blocked if the driver has a target queue limit
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2014-06-25 16:51 ` [PATCH 09/14] scsi: fix the {host,target,device}_blocked counter mess Christoph Hellwig
@ 2014-06-25 16:51 ` Christoph Hellwig
  2014-07-09 11:19   ` Hannes Reinecke
  2014-06-25 16:51 ` [PATCH 11/14] scsi: unwind blk_end_request_all and blk_end_request_err calls Christoph Hellwig
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

This saves us an atomic operation for each I/O submission and completion
for the usual case where the driver doesn't set a per-target can_queue
value.  Only a few iscsi hardware offload drivers set the per-target
can_queue value at the moment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi_lib.c |   17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index a39d5ba..a64b9d3 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -295,7 +295,8 @@ void scsi_device_unbusy(struct scsi_device *sdev)
 	unsigned long flags;
 
 	atomic_dec(&shost->host_busy);
-	atomic_dec(&starget->target_busy);
+	if (starget->can_queue > 0)
+		atomic_dec(&starget->target_busy);
 
 	if (unlikely(scsi_host_in_recovery(shost) &&
 		     (shost->host_failed || shost->host_eh_scheduled))) {
@@ -1335,6 +1336,9 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
 		spin_unlock_irq(shost->host_lock);
 	}
 
+	if (starget->can_queue <= 0)
+		return 1;
+
 	busy = atomic_inc_return(&starget->target_busy) - 1;
 	if (busy == 0 && atomic_read(&starget->target_blocked) > 0) {
 		if (atomic_dec_return(&starget->target_blocked) > 0)
@@ -1344,7 +1348,7 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
 				 "unblocking target at zero depth\n"));
 	}
 
-	if (starget->can_queue > 0 && busy >= starget->can_queue)
+	if (busy >= starget->can_queue)
 		goto starved;
 	if (atomic_read(&starget->target_blocked) > 0)
 		goto starved;
@@ -1356,7 +1360,8 @@ starved:
 	list_move_tail(&sdev->starved_entry, &shost->starved_list);
 	spin_unlock_irq(shost->host_lock);
 out_dec:
-	atomic_dec(&starget->target_busy);
+	if (starget->can_queue > 0)
+		atomic_dec(&starget->target_busy);
 	return 0;
 }
 
@@ -1473,7 +1478,8 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
 	 */
 	atomic_inc(&sdev->device_busy);
 	atomic_inc(&shost->host_busy);
-	atomic_inc(&starget->target_busy);
+	if (starget->can_queue > 0)
+		atomic_inc(&starget->target_busy);
 
 	blk_complete_request(req);
 }
@@ -1642,7 +1648,8 @@ static void scsi_request_fn(struct request_queue *q)
 	return;
 
  host_not_ready:
-	atomic_dec(&scsi_target(sdev)->target_busy);
+	if (scsi_target(sdev)->can_queue > 0)
+		atomic_dec(&scsi_target(sdev)->target_busy);
  not_ready:
 	/*
 	 * lock q, handle tag, requeue req, and decrement device_busy. We
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 11/14] scsi: unwind blk_end_request_all and blk_end_request_err calls
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
                   ` (9 preceding siblings ...)
  2014-06-25 16:51 ` [PATCH 10/14] scsi: only maintain target_blocked if the driver has a target queue limit Christoph Hellwig
@ 2014-06-25 16:51 ` Christoph Hellwig
  2014-07-09 11:20   ` Hannes Reinecke
  2014-06-25 16:51 ` [PATCH 12/14] scatterlist: allow chaining to preallocated chunks Christoph Hellwig
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

Replace the calls to the various blk_end_request variants with opencode
equivalents.  Blk-mq is using a model that gives the driver control
between the bio updates and the actual completion, and making the old
code follow that same model allows us to keep the code more similar for
both pathes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi_lib.c |   61 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 42 insertions(+), 19 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index a64b9d3..58534fd 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -625,6 +625,37 @@ static void scsi_release_bidi_buffers(struct scsi_cmnd *cmd)
 	cmd->request->next_rq->special = NULL;
 }
 
+static bool scsi_end_request(struct request *req, int error,
+		unsigned int bytes, unsigned int bidi_bytes)
+{
+	struct scsi_cmnd *cmd = req->special;
+	struct scsi_device *sdev = cmd->device;
+	struct request_queue *q = sdev->request_queue;
+	unsigned long flags;
+
+
+	if (blk_update_request(req, error, bytes))
+		return true;
+
+	/* Bidi request must be completed as a whole */
+	if (unlikely(bidi_bytes) &&
+	    blk_update_request(req->next_rq, error, bidi_bytes))
+		return true;
+
+	if (blk_queue_add_random(q))
+		add_disk_randomness(req->rq_disk);
+
+	spin_lock_irqsave(q->queue_lock, flags);
+	blk_finish_request(req, error);
+	spin_unlock_irqrestore(q->queue_lock, flags);
+
+	if (bidi_bytes)
+		scsi_release_bidi_buffers(cmd);
+	scsi_release_buffers(cmd);
+	scsi_next_command(cmd);
+	return false;
+}
+
 /**
  * __scsi_error_from_host_byte - translate SCSI error code into errno
  * @cmd:	SCSI command (unused)
@@ -697,7 +728,7 @@ static int __scsi_error_from_host_byte(struct scsi_cmnd *cmd, int result)
  *		   be put back on the queue and retried using the same
  *		   command as before, possibly after a delay.
  *
- *		c) We can call blk_end_request() with -EIO to fail
+ *		c) We can call scsi_end_request() with -EIO to fail
  *		   the remainder of the request.
  */
 void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
@@ -749,13 +780,9 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 			 * both sides at once.
 			 */
 			req->next_rq->resid_len = scsi_in(cmd)->resid;
-
-			scsi_release_buffers(cmd);
-			scsi_release_bidi_buffers(cmd);
-
-			blk_end_request_all(req, 0);
-
-			scsi_next_command(cmd);
+			if (scsi_end_request(req, 0, blk_rq_bytes(req),
+					blk_rq_bytes(req->next_rq)))
+				BUG();
 			return;
 		}
 	}
@@ -794,15 +821,16 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 	/*
 	 * If we finished all bytes in the request we are done now.
 	 */
-	if (!blk_end_request(req, error, good_bytes))
-		goto next_command;
+	if (!scsi_end_request(req, error, good_bytes, 0))
+		return;
 
 	/*
 	 * Kill remainder if no retrys.
 	 */
 	if (error && scsi_noretry_cmd(cmd)) {
-		blk_end_request_all(req, error);
-		goto next_command;
+		if (scsi_end_request(req, error, blk_rq_bytes(req), 0))
+			BUG();
+		return;
 	}
 
 	/*
@@ -947,8 +975,8 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 				scsi_print_sense("", cmd);
 			scsi_print_command(cmd);
 		}
-		if (!blk_end_request_err(req, error))
-			goto next_command;
+		if (!scsi_end_request(req, error, blk_rq_err_bytes(req), 0))
+			return;
 		/*FALLTHRU*/
 	case ACTION_REPREP:
 	requeue:
@@ -967,11 +995,6 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 		__scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY, 0);
 		break;
 	}
-	return;
-
-next_command:
-	scsi_release_buffers(cmd);
-	scsi_next_command(cmd);
 }
 
 static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 12/14] scatterlist: allow chaining to preallocated chunks
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
                   ` (10 preceding siblings ...)
  2014-06-25 16:51 ` [PATCH 11/14] scsi: unwind blk_end_request_all and blk_end_request_err calls Christoph Hellwig
@ 2014-06-25 16:51 ` Christoph Hellwig
  2014-07-09 11:21   ` Hannes Reinecke
  2014-06-25 16:52 ` [PATCH 13/14] scsi: add support for a blk-mq based I/O path Christoph Hellwig
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:51 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

Blk-mq drivers usually preallocate their S/G list as part of the request,
but if we want to support the very large S/G lists currently supported by
the SCSI code that would tie up a lot of memory in the preallocated request
pool.  Add support to the scatterlist code so that it can initialize a
S/G list that uses a preallocated first chunks and dynamically allocated
additional chunks.  That way the scsi-mq code can preallocate a first
page worth of S/G entries as part of the request, and dynamically extent
the S/G list when needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi_lib.c     |   16 +++++++---------
 include/linux/scatterlist.h |    6 +++---
 lib/scatterlist.c           |   24 ++++++++++++++++--------
 3 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 58534fd..900b1c0 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -567,6 +567,11 @@ static struct scatterlist *scsi_sg_alloc(unsigned int nents, gfp_t gfp_mask)
 	return mempool_alloc(sgp->pool, gfp_mask);
 }
 
+static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
+{
+	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, false, scsi_sg_free);
+}
+
 static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
 			      gfp_t gfp_mask)
 {
@@ -575,19 +580,12 @@ static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
 	BUG_ON(!nents);
 
 	ret = __sg_alloc_table(&sdb->table, nents, SCSI_MAX_SG_SEGMENTS,
-			       gfp_mask, scsi_sg_alloc);
+			       NULL, gfp_mask, scsi_sg_alloc);
 	if (unlikely(ret))
-		__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS,
-				scsi_sg_free);
-
+		scsi_free_sgtable(sdb);
 	return ret;
 }
 
-static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
-{
-	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, scsi_sg_free);
-}
-
 /*
  * Function:    scsi_release_buffers()
  *
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index a964f72..f4ec8bb 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -229,10 +229,10 @@ void sg_init_one(struct scatterlist *, const void *, unsigned int);
 typedef struct scatterlist *(sg_alloc_fn)(unsigned int, gfp_t);
 typedef void (sg_free_fn)(struct scatterlist *, unsigned int);
 
-void __sg_free_table(struct sg_table *, unsigned int, sg_free_fn *);
+void __sg_free_table(struct sg_table *, unsigned int, bool, sg_free_fn *);
 void sg_free_table(struct sg_table *);
-int __sg_alloc_table(struct sg_table *, unsigned int, unsigned int, gfp_t,
-		     sg_alloc_fn *);
+int __sg_alloc_table(struct sg_table *, unsigned int, unsigned int,
+		     struct scatterlist *, gfp_t, sg_alloc_fn *);
 int sg_alloc_table(struct sg_table *, unsigned int, gfp_t);
 int sg_alloc_table_from_pages(struct sg_table *sgt,
 	struct page **pages, unsigned int n_pages,
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 3a8e8e8..48c15d2 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -165,6 +165,7 @@ static void sg_kfree(struct scatterlist *sg, unsigned int nents)
  * __sg_free_table - Free a previously mapped sg table
  * @table:	The sg table header to use
  * @max_ents:	The maximum number of entries per single scatterlist
+ * @skip_first_chunk: don't free the (preallocated) first scatterlist chunk
  * @free_fn:	Free function
  *
  *  Description:
@@ -174,7 +175,7 @@ static void sg_kfree(struct scatterlist *sg, unsigned int nents)
  *
  **/
 void __sg_free_table(struct sg_table *table, unsigned int max_ents,
-		     sg_free_fn *free_fn)
+		     bool skip_first_chunk, sg_free_fn *free_fn)
 {
 	struct scatterlist *sgl, *next;
 
@@ -202,7 +203,9 @@ void __sg_free_table(struct sg_table *table, unsigned int max_ents,
 		}
 
 		table->orig_nents -= sg_size;
-		free_fn(sgl, alloc_size);
+		if (!skip_first_chunk)
+			free_fn(sgl, alloc_size);
+		skip_first_chunk = false;
 		sgl = next;
 	}
 
@@ -217,7 +220,7 @@ EXPORT_SYMBOL(__sg_free_table);
  **/
 void sg_free_table(struct sg_table *table)
 {
-	__sg_free_table(table, SG_MAX_SINGLE_ALLOC, sg_kfree);
+	__sg_free_table(table, SG_MAX_SINGLE_ALLOC, false, sg_kfree);
 }
 EXPORT_SYMBOL(sg_free_table);
 
@@ -241,8 +244,8 @@ EXPORT_SYMBOL(sg_free_table);
  *
  **/
 int __sg_alloc_table(struct sg_table *table, unsigned int nents,
-		     unsigned int max_ents, gfp_t gfp_mask,
-		     sg_alloc_fn *alloc_fn)
+		     unsigned int max_ents, struct scatterlist *first_chunk,
+		     gfp_t gfp_mask, sg_alloc_fn *alloc_fn)
 {
 	struct scatterlist *sg, *prv;
 	unsigned int left;
@@ -269,7 +272,12 @@ int __sg_alloc_table(struct sg_table *table, unsigned int nents,
 
 		left -= sg_size;
 
-		sg = alloc_fn(alloc_size, gfp_mask);
+		if (first_chunk) {
+			sg = first_chunk;
+			first_chunk = NULL;
+		} else {
+			sg = alloc_fn(alloc_size, gfp_mask);
+		}
 		if (unlikely(!sg)) {
 			/*
 			 * Adjust entry count to reflect that the last
@@ -324,9 +332,9 @@ int sg_alloc_table(struct sg_table *table, unsigned int nents, gfp_t gfp_mask)
 	int ret;
 
 	ret = __sg_alloc_table(table, nents, SG_MAX_SINGLE_ALLOC,
-			       gfp_mask, sg_kmalloc);
+			       NULL, gfp_mask, sg_kmalloc);
 	if (unlikely(ret))
-		__sg_free_table(table, SG_MAX_SINGLE_ALLOC, sg_kfree);
+		__sg_free_table(table, SG_MAX_SINGLE_ALLOC, false, sg_kfree);
 
 	return ret;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
                   ` (11 preceding siblings ...)
  2014-06-25 16:51 ` [PATCH 12/14] scatterlist: allow chaining to preallocated chunks Christoph Hellwig
@ 2014-06-25 16:52 ` Christoph Hellwig
  2014-07-09 11:25   ` Hannes Reinecke
  2014-07-16 11:13   ` Mike Christie
  2014-06-25 16:52 ` [PATCH 14/14] fnic: reject device resets without assigned tags for the blk-mq case Christoph Hellwig
                   ` (2 subsequent siblings)
  15 siblings, 2 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:52 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

This patch adds support for an alternate I/O path in the scsi midlayer
which uses the blk-mq infrastructure instead of the legacy request code.

Use of blk-mq is fully transparent to drivers, although for now a host
template field is provided to opt out of blk-mq usage in case any unforseen
incompatibilities arise.

In general replacing the legacy request code with blk-mq is a simple and
mostly mechanical transformation.  The biggest exception is the new code
that deals with the fact the I/O submissions in blk-mq must happen from
process context, which slightly complicates the I/O completion handler.
The second biggest differences is that blk-mq is build around the concept
of preallocated requests that also include driver specific data, which
in SCSI context means the scsi_cmnd structure.  This completely avoids
dynamic memory allocations for the fast path through I/O submission.

Due the preallocated requests the MQ code path exclusively uses the
host-wide shared tag allocator instead of a per-LUN one.  This only
affects drivers actually using the block layer provided tag allocator
instead of their own.  Unlike the old path blk-mq always provides a tag,
although drivers don't have to use it.

For now the blk-mq path is disable by defauly and must be enabled using
the "use_blk_mq" module parameter.  Once the remaining work in the block
layer to make blk-mq more suitable for slow devices is complete I hope
to make it the default and eventually even remove the old code path.

Based on the earlier scsi-mq prototype by Nicholas Bellinger.

Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking and
various sugestions and code contributions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/hosts.c      |   30 ++-
 drivers/scsi/scsi.c       |    5 +-
 drivers/scsi/scsi_lib.c   |  475 +++++++++++++++++++++++++++++++++++++++------
 drivers/scsi/scsi_priv.h  |    3 +
 drivers/scsi/scsi_scan.c  |    5 +-
 drivers/scsi/scsi_sysfs.c |    2 +
 include/scsi/scsi_host.h  |   18 +-
 include/scsi/scsi_tcq.h   |   28 ++-
 8 files changed, 494 insertions(+), 72 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 0632eee..6322e6c 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 		goto fail;
 	}
 
+	if (shost_use_blk_mq(shost)) {
+		error = scsi_mq_setup_tags(shost);
+		if (error)
+			goto fail;
+	}
+
+	/*
+	 * Note that we allocate the freelist even for the MQ case for now,
+	 * as we need a command set aside for scsi_reset_provider.  Having
+	 * the full host freelist and one command available for that is a
+	 * little heavy-handed, but avoids introducing a special allocator
+	 * just for this.  Eventually the structure of scsi_reset_provider
+	 * will need a major overhaul.
+	 */
 	error = scsi_setup_command_freelist(shost);
 	if (error)
-		goto fail;
+		goto out_destroy_tags;
+
 
 	if (!shost->shost_gendev.parent)
 		shost->shost_gendev.parent = dev ? dev : &platform_bus;
@@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 
 	error = device_add(&shost->shost_gendev);
 	if (error)
-		goto out;
+		goto out_destroy_freelist;
 
 	pm_runtime_set_active(&shost->shost_gendev);
 	pm_runtime_enable(&shost->shost_gendev);
@@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 	device_del(&shost->shost_dev);
  out_del_gendev:
 	device_del(&shost->shost_gendev);
- out:
+ out_destroy_freelist:
 	scsi_destroy_command_freelist(shost);
+ out_destroy_tags:
+	if (shost_use_blk_mq(shost))
+		scsi_mq_destroy_tags(shost);
  fail:
 	return error;
 }
@@ -309,7 +327,9 @@ static void scsi_host_dev_release(struct device *dev)
 	}
 
 	scsi_destroy_command_freelist(shost);
-	if (shost->bqt)
+	if (shost_use_blk_mq(shost) && shost->tag_set.tags)
+		scsi_mq_destroy_tags(shost);
+	else if (shost->bqt)
 		blk_free_tags(shost->bqt);
 
 	kfree(shost->shost_data);
@@ -436,6 +456,8 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
 	else
 		shost->dma_boundary = 0xffffffff;
 
+	shost->use_blk_mq = scsi_use_blk_mq && !shost->hostt->disable_blk_mq;
+
 	device_initialize(&shost->shost_gendev);
 	dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
 	shost->shost_gendev.bus = &scsi_bus_type;
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index b362058..c089812 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -809,7 +809,7 @@ void scsi_adjust_queue_depth(struct scsi_device *sdev, int tagged, int tags)
 	 * is more IO than the LLD's can_queue (so there are not enuogh
 	 * tags) request_fn's host queue ready check will handle it.
 	 */
-	if (!sdev->host->bqt) {
+	if (!shost_use_blk_mq(sdev->host) && !sdev->host->bqt) {
 		if (blk_queue_tagged(sdev->request_queue) &&
 		    blk_queue_resize_tags(sdev->request_queue, tags) != 0)
 			goto out;
@@ -1363,6 +1363,9 @@ MODULE_LICENSE("GPL");
 module_param(scsi_logging_level, int, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(scsi_logging_level, "a bit mask of logging levels");
 
+bool scsi_use_blk_mq = false;
+module_param_named(use_blk_mq, scsi_use_blk_mq, bool, S_IWUSR | S_IRUGO);
+
 static int __init init_scsi(void)
 {
 	int error;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 900b1c0..5d39cfc 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1,5 +1,6 @@
 /*
- *  scsi_lib.c Copyright (C) 1999 Eric Youngdale
+ * Copyright (C) 1999 Eric Youngdale
+ * Copyright (C) 2014 Christoph Hellwig
  *
  *  SCSI queueing library.
  *      Initial versions: Eric Youngdale (eric@andante.org).
@@ -20,6 +21,7 @@
 #include <linux/delay.h>
 #include <linux/hardirq.h>
 #include <linux/scatterlist.h>
+#include <linux/blk-mq.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -113,6 +115,16 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
 	}
 }
 
+static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
+{
+	struct scsi_device *sdev = cmd->device;
+	struct request_queue *q = cmd->request->q;
+
+	blk_mq_requeue_request(cmd->request);
+	blk_mq_kick_requeue_list(q);
+	put_device(&sdev->sdev_gendev);
+}
+
 /**
  * __scsi_queue_insert - private queue insertion
  * @cmd: The SCSI command being requeued
@@ -150,6 +162,10 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
 	 * before blk_cleanup_queue() finishes.
 	 */
 	cmd->result = 0;
+	if (q->mq_ops) {
+		scsi_mq_requeue_cmd(cmd);
+		return;
+	}
 	spin_lock_irqsave(q->queue_lock, flags);
 	blk_requeue_request(q, cmd->request);
 	kblockd_schedule_work(&device->requeue_work);
@@ -308,6 +324,14 @@ void scsi_device_unbusy(struct scsi_device *sdev)
 	atomic_dec(&sdev->device_busy);
 }
 
+static void scsi_kick_queue(struct request_queue *q)
+{
+	if (q->mq_ops)
+		blk_mq_start_hw_queues(q);
+	else
+		blk_run_queue(q);
+}
+
 /*
  * Called for single_lun devices on IO completion. Clear starget_sdev_user,
  * and call blk_run_queue for all the scsi_devices on the target -
@@ -332,7 +356,7 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
 	 * but in most cases, we will be first. Ideally, each LU on the
 	 * target would get some limited time or requests on the target.
 	 */
-	blk_run_queue(current_sdev->request_queue);
+	scsi_kick_queue(current_sdev->request_queue);
 
 	spin_lock_irqsave(shost->host_lock, flags);
 	if (starget->starget_sdev_user)
@@ -345,7 +369,7 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
 			continue;
 
 		spin_unlock_irqrestore(shost->host_lock, flags);
-		blk_run_queue(sdev->request_queue);
+		scsi_kick_queue(sdev->request_queue);
 		spin_lock_irqsave(shost->host_lock, flags);
 	
 		scsi_device_put(sdev);
@@ -438,7 +462,7 @@ static void scsi_starved_list_run(struct Scsi_Host *shost)
 			continue;
 		spin_unlock_irqrestore(shost->host_lock, flags);
 
-		blk_run_queue(slq);
+		scsi_kick_queue(slq);
 		blk_put_queue(slq);
 
 		spin_lock_irqsave(shost->host_lock, flags);
@@ -469,7 +493,10 @@ static void scsi_run_queue(struct request_queue *q)
 	if (!list_empty(&sdev->host->starved_list))
 		scsi_starved_list_run(sdev->host);
 
-	blk_run_queue(q);
+	if (q->mq_ops)
+		blk_mq_start_stopped_hw_queues(q, false);
+	else
+		blk_run_queue(q);
 }
 
 void scsi_requeue_run_queue(struct work_struct *work)
@@ -567,25 +594,72 @@ static struct scatterlist *scsi_sg_alloc(unsigned int nents, gfp_t gfp_mask)
 	return mempool_alloc(sgp->pool, gfp_mask);
 }
 
-static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
+static void scsi_free_sgtable(struct scsi_data_buffer *sdb, bool mq)
 {
-	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, false, scsi_sg_free);
+	if (mq && sdb->table.nents <= SCSI_MAX_SG_SEGMENTS)
+		return;
+	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, mq, scsi_sg_free);
 }
 
 static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
-			      gfp_t gfp_mask)
+			      gfp_t gfp_mask, bool mq)
 {
+	struct scatterlist *first_chunk = NULL;
 	int ret;
 
 	BUG_ON(!nents);
 
+	if (mq) {
+		if (nents <= SCSI_MAX_SG_SEGMENTS) {
+			sdb->table.nents = nents;
+			sg_init_table(sdb->table.sgl, sdb->table.nents);
+			return 0;
+		}
+		first_chunk = sdb->table.sgl;
+	}
+
 	ret = __sg_alloc_table(&sdb->table, nents, SCSI_MAX_SG_SEGMENTS,
-			       NULL, gfp_mask, scsi_sg_alloc);
+			       first_chunk, gfp_mask, scsi_sg_alloc);
 	if (unlikely(ret))
-		scsi_free_sgtable(sdb);
+		scsi_free_sgtable(sdb, mq);
 	return ret;
 }
 
+static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
+{
+	if (cmd->request->cmd_type == REQ_TYPE_FS) {
+		struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
+
+		if (drv->uninit_command)
+			drv->uninit_command(cmd);
+	}
+}
+
+static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
+{
+	if (cmd->sdb.table.nents)
+		scsi_free_sgtable(&cmd->sdb, true);
+	if (cmd->request->next_rq && cmd->request->next_rq->special)
+		scsi_free_sgtable(cmd->request->next_rq->special, true);
+	if (scsi_prot_sg_count(cmd))
+		scsi_free_sgtable(cmd->prot_sdb, true);
+}
+
+static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd)
+{
+	struct scsi_device *sdev = cmd->device;
+	unsigned long flags;
+
+	BUG_ON(list_empty(&cmd->list));
+
+	scsi_mq_free_sgtables(cmd);
+	scsi_uninit_cmd(cmd);
+
+	spin_lock_irqsave(&sdev->list_lock, flags);
+	list_del_init(&cmd->list);
+	spin_unlock_irqrestore(&sdev->list_lock, flags);
+}
+
 /*
  * Function:    scsi_release_buffers()
  *
@@ -605,12 +679,12 @@ static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
 void scsi_release_buffers(struct scsi_cmnd *cmd)
 {
 	if (cmd->sdb.table.nents)
-		scsi_free_sgtable(&cmd->sdb);
+		scsi_free_sgtable(&cmd->sdb, false);
 
 	memset(&cmd->sdb, 0, sizeof(cmd->sdb));
 
 	if (scsi_prot_sg_count(cmd))
-		scsi_free_sgtable(cmd->prot_sdb);
+		scsi_free_sgtable(cmd->prot_sdb, false);
 }
 EXPORT_SYMBOL(scsi_release_buffers);
 
@@ -618,7 +692,7 @@ static void scsi_release_bidi_buffers(struct scsi_cmnd *cmd)
 {
 	struct scsi_data_buffer *bidi_sdb = cmd->request->next_rq->special;
 
-	scsi_free_sgtable(bidi_sdb);
+	scsi_free_sgtable(bidi_sdb, false);
 	kmem_cache_free(scsi_sdb_cache, bidi_sdb);
 	cmd->request->next_rq->special = NULL;
 }
@@ -629,8 +703,6 @@ static bool scsi_end_request(struct request *req, int error,
 	struct scsi_cmnd *cmd = req->special;
 	struct scsi_device *sdev = cmd->device;
 	struct request_queue *q = sdev->request_queue;
-	unsigned long flags;
-
 
 	if (blk_update_request(req, error, bytes))
 		return true;
@@ -643,14 +715,38 @@ static bool scsi_end_request(struct request *req, int error,
 	if (blk_queue_add_random(q))
 		add_disk_randomness(req->rq_disk);
 
-	spin_lock_irqsave(q->queue_lock, flags);
-	blk_finish_request(req, error);
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	if (req->mq_ctx) {
+		/*
+		 * In the MQ case the command gets freed by __blk_mq_end_io,
+		 * so we have to do all cleanup that depends on it earlier.
+		 *
+		 * We also can't kick the queues from irq context, so we
+		 * will have to defer it to a workqueue.
+		 */
+		scsi_mq_uninit_cmd(cmd);
+
+		__blk_mq_end_io(req, error);
+
+		if (scsi_target(sdev)->single_lun ||
+		    !list_empty(&sdev->host->starved_list))
+			kblockd_schedule_work(&sdev->requeue_work);
+		else
+			blk_mq_start_stopped_hw_queues(q, true);
+
+		put_device(&sdev->sdev_gendev);
+	} else {
+		unsigned long flags;
+
+		spin_lock_irqsave(q->queue_lock, flags);
+		blk_finish_request(req, error);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+
+		if (bidi_bytes)
+			scsi_release_bidi_buffers(cmd);
+		scsi_release_buffers(cmd);
+		scsi_next_command(cmd);
+	}
 
-	if (bidi_bytes)
-		scsi_release_bidi_buffers(cmd);
-	scsi_release_buffers(cmd);
-	scsi_next_command(cmd);
 	return false;
 }
 
@@ -981,8 +1077,14 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 		/* Unprep the request and put it back at the head of the queue.
 		 * A new command will be prepared and issued.
 		 */
-		scsi_release_buffers(cmd);
-		scsi_requeue_command(q, cmd);
+		if (q->mq_ops) {
+			cmd->request->cmd_flags &= ~REQ_DONTPREP;
+			scsi_mq_uninit_cmd(cmd);
+			scsi_mq_requeue_cmd(cmd);
+		} else {
+			scsi_release_buffers(cmd);
+			scsi_requeue_command(q, cmd);
+		}
 		break;
 	case ACTION_RETRY:
 		/* Retry the same command immediately */
@@ -1004,9 +1106,8 @@ static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb,
 	 * If sg table allocation fails, requeue request later.
 	 */
 	if (unlikely(scsi_alloc_sgtable(sdb, req->nr_phys_segments,
-					gfp_mask))) {
+					gfp_mask, req->mq_ctx != NULL)))
 		return BLKPREP_DEFER;
-	}
 
 	/* 
 	 * Next, walk the list, and fill in the addresses and sizes of
@@ -1034,21 +1135,27 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
 {
 	struct scsi_device *sdev = cmd->device;
 	struct request *rq = cmd->request;
+	bool is_mq = (rq->mq_ctx != NULL);
+	int error;
 
-	int error = scsi_init_sgtable(rq, &cmd->sdb, gfp_mask);
+	error = scsi_init_sgtable(rq, &cmd->sdb, gfp_mask);
 	if (error)
 		goto err_exit;
 
 	if (blk_bidi_rq(rq)) {
-		struct scsi_data_buffer *bidi_sdb = kmem_cache_zalloc(
-			scsi_sdb_cache, GFP_ATOMIC);
-		if (!bidi_sdb) {
-			error = BLKPREP_DEFER;
-			goto err_exit;
+		if (!rq->q->mq_ops) {
+			struct scsi_data_buffer *bidi_sdb =
+				kmem_cache_zalloc(scsi_sdb_cache, GFP_ATOMIC);
+			if (!bidi_sdb) {
+				error = BLKPREP_DEFER;
+				goto err_exit;
+			}
+
+			rq->next_rq->special = bidi_sdb;
 		}
 
-		rq->next_rq->special = bidi_sdb;
-		error = scsi_init_sgtable(rq->next_rq, bidi_sdb, GFP_ATOMIC);
+		error = scsi_init_sgtable(rq->next_rq, rq->next_rq->special,
+					  GFP_ATOMIC);
 		if (error)
 			goto err_exit;
 	}
@@ -1060,7 +1167,7 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
 		BUG_ON(prot_sdb == NULL);
 		ivecs = blk_rq_count_integrity_sg(rq->q, rq->bio);
 
-		if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask)) {
+		if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask, is_mq)) {
 			error = BLKPREP_DEFER;
 			goto err_exit;
 		}
@@ -1074,13 +1181,16 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
 		cmd->prot_sdb->table.nents = count;
 	}
 
-	return BLKPREP_OK ;
-
+	return BLKPREP_OK;
 err_exit:
-	scsi_release_buffers(cmd);
-	cmd->request->special = NULL;
-	scsi_put_command(cmd);
-	put_device(&sdev->sdev_gendev);
+	if (is_mq) {
+		scsi_mq_free_sgtables(cmd);
+	} else {
+		scsi_release_buffers(cmd);
+		cmd->request->special = NULL;
+		scsi_put_command(cmd);
+		put_device(&sdev->sdev_gendev);
+	}
 	return error;
 }
 EXPORT_SYMBOL(scsi_init_io);
@@ -1295,13 +1405,7 @@ out:
 
 static void scsi_unprep_fn(struct request_queue *q, struct request *req)
 {
-	if (req->cmd_type == REQ_TYPE_FS) {
-		struct scsi_cmnd *cmd = req->special;
-		struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
-
-		if (drv->uninit_command)
-			drv->uninit_command(cmd);
-	}
+	scsi_uninit_cmd(req->special);
 }
 
 /*
@@ -1318,7 +1422,11 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
 	busy = atomic_inc_return(&sdev->device_busy) - 1;
 	if (busy == 0 && atomic_read(&sdev->device_blocked) > 0) {
 		if (atomic_dec_return(&sdev->device_blocked) > 0) {
-			blk_delay_queue(q, SCSI_QUEUE_DELAY);
+			/*
+			 * For the MQ case we take care of this in the caller.
+			 */
+			if (!q->mq_ops)
+				blk_delay_queue(q, SCSI_QUEUE_DELAY);
 			goto out_dec;
 		}
 		SCSI_LOG_MLQUEUE(3, sdev_printk(KERN_INFO, sdev,
@@ -1688,6 +1796,188 @@ out_delay:
 		blk_delay_queue(q, SCSI_QUEUE_DELAY);
 }
 
+static inline int prep_to_mq(int ret)
+{
+	switch (ret) {
+	case BLKPREP_OK:
+		return 0;
+	case BLKPREP_DEFER:
+		return BLK_MQ_RQ_QUEUE_BUSY;
+	default:
+		return BLK_MQ_RQ_QUEUE_ERROR;
+	}
+}
+
+static int scsi_mq_prep_fn(struct request *req)
+{
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
+	struct scsi_device *sdev = req->q->queuedata;
+	struct Scsi_Host *shost = sdev->host;
+	unsigned char *sense_buf = cmd->sense_buffer;
+	struct scatterlist *sg;
+
+	memset(cmd, 0, sizeof(struct scsi_cmnd));
+
+	req->special = cmd;
+
+	cmd->request = req;
+	cmd->device = sdev;
+	cmd->sense_buffer = sense_buf;
+
+	cmd->tag = req->tag;
+
+	req->cmd = req->__cmd;
+	cmd->cmnd = req->cmd;
+	cmd->prot_op = SCSI_PROT_NORMAL;
+
+	INIT_LIST_HEAD(&cmd->list);
+	INIT_DELAYED_WORK(&cmd->abort_work, scmd_eh_abort_handler);
+	cmd->jiffies_at_alloc = jiffies;
+
+	/*
+	 * XXX: cmd_list lookups are only used by two drivers, try to get
+	 * rid of this list in common code.
+	 */
+	spin_lock_irq(&sdev->list_lock);
+	list_add_tail(&cmd->list, &sdev->cmd_list);
+	spin_unlock_irq(&sdev->list_lock);
+
+	sg = (void *)cmd + sizeof(struct scsi_cmnd) + shost->hostt->cmd_size;
+	cmd->sdb.table.sgl = sg;
+
+	if (scsi_host_get_prot(shost)) {
+		cmd->prot_sdb = (void *)sg +
+			shost->sg_tablesize * sizeof(struct scatterlist);
+		memset(cmd->prot_sdb, 0, sizeof(struct scsi_data_buffer));
+
+		cmd->prot_sdb->table.sgl =
+			(struct scatterlist *)(cmd->prot_sdb + 1);
+	}
+
+	if (blk_bidi_rq(req)) {
+		struct request *next_rq = req->next_rq;
+		struct scsi_data_buffer *bidi_sdb = blk_mq_rq_to_pdu(next_rq);
+
+		memset(bidi_sdb, 0, sizeof(struct scsi_data_buffer));
+		bidi_sdb->table.sgl =
+			(struct scatterlist *)(bidi_sdb + 1);
+
+		next_rq->special = bidi_sdb;
+	}
+
+	switch (req->cmd_type) {
+	case REQ_TYPE_FS:
+		return scsi_cmd_to_driver(cmd)->init_command(cmd);
+	case REQ_TYPE_BLOCK_PC:
+		return scsi_setup_blk_pc_cmnd(cmd->device, req);
+	default:
+		return BLKPREP_KILL;
+	}
+}
+
+static void scsi_mq_done(struct scsi_cmnd *cmd)
+{
+	trace_scsi_dispatch_cmd_done(cmd);
+	blk_mq_complete_request(cmd->request);
+}
+
+static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
+{
+	struct request_queue *q = req->q;
+	struct scsi_device *sdev = q->queuedata;
+	struct Scsi_Host *shost = sdev->host;
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
+	int ret;
+	int reason;
+
+	ret = prep_to_mq(scsi_prep_state_check(sdev, req));
+	if (ret)
+		goto out;
+
+	ret = BLK_MQ_RQ_QUEUE_BUSY;
+	if (!get_device(&sdev->sdev_gendev))
+		goto out;
+
+	if (!scsi_dev_queue_ready(q, sdev))
+		goto out_put_device;
+	if (!scsi_target_queue_ready(shost, sdev))
+		goto out_dec_device_busy;
+	if (!scsi_host_queue_ready(q, shost, sdev))
+		goto out_dec_target_busy;
+
+	if (!(req->cmd_flags & REQ_DONTPREP)) {
+		ret = prep_to_mq(scsi_mq_prep_fn(req));
+		if (ret)
+			goto out_dec_host_busy;
+		req->cmd_flags |= REQ_DONTPREP;
+	}
+
+	scsi_init_cmd_errh(cmd);
+	cmd->scsi_done = scsi_mq_done;
+
+	reason = scsi_dispatch_cmd(cmd);
+	if (reason) {
+		scsi_set_blocked(cmd, reason);
+		ret = BLK_MQ_RQ_QUEUE_BUSY;
+		goto out_dec_host_busy;
+	}
+
+	return BLK_MQ_RQ_QUEUE_OK;
+
+out_dec_host_busy:
+	cancel_delayed_work(&cmd->abort_work);
+	atomic_dec(&shost->host_busy);
+out_dec_target_busy:
+	if (scsi_target(sdev)->can_queue > 0)
+		atomic_dec(&scsi_target(sdev)->target_busy);
+out_dec_device_busy:
+	atomic_dec(&sdev->device_busy);
+out_put_device:
+	put_device(&sdev->sdev_gendev);
+out:
+	switch (ret) {
+	case BLK_MQ_RQ_QUEUE_BUSY:
+		blk_mq_stop_hw_queue(hctx);
+		if (atomic_read(&sdev->device_busy) == 0 &&
+		    !scsi_device_blocked(sdev))
+			blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
+		break;
+	case BLK_MQ_RQ_QUEUE_ERROR:
+		/*
+		 * Make sure to release all allocated ressources when
+		 * we hit an error, as we will never see this command
+		 * again.
+		 */
+		if (req->cmd_flags & REQ_DONTPREP)
+			scsi_mq_uninit_cmd(cmd);
+		break;
+	default:
+		break;
+	}
+	return ret;
+}
+
+static int scsi_init_request(void *data, struct request *rq,
+		unsigned int hctx_idx, unsigned int request_idx,
+		unsigned int numa_node)
+{
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
+
+	cmd->sense_buffer = kzalloc_node(SCSI_SENSE_BUFFERSIZE, GFP_KERNEL,
+			numa_node);
+	if (!cmd->sense_buffer)
+		return -ENOMEM;
+	return 0;
+}
+
+static void scsi_exit_request(void *data, struct request *rq,
+		unsigned int hctx_idx, unsigned int request_idx)
+{
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
+
+	kfree(cmd->sense_buffer);
+}
+
 u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
 {
 	struct device *host_dev;
@@ -1710,16 +2000,10 @@ u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
 }
 EXPORT_SYMBOL(scsi_calculate_bounce_limit);
 
-struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
-					 request_fn_proc *request_fn)
+static void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
 {
-	struct request_queue *q;
 	struct device *dev = shost->dma_dev;
 
-	q = blk_init_queue(request_fn, NULL);
-	if (!q)
-		return NULL;
-
 	/*
 	 * this limit is imposed by hardware restrictions
 	 */
@@ -1750,7 +2034,17 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
 	 * blk_queue_update_dma_alignment() later.
 	 */
 	blk_queue_dma_alignment(q, 0x03);
+}
 
+struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
+					 request_fn_proc *request_fn)
+{
+	struct request_queue *q;
+
+	q = blk_init_queue(request_fn, NULL);
+	if (!q)
+		return NULL;
+	__scsi_init_queue(shost, q);
 	return q;
 }
 EXPORT_SYMBOL(__scsi_alloc_queue);
@@ -1771,6 +2065,55 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
 	return q;
 }
 
+static struct blk_mq_ops scsi_mq_ops = {
+	.map_queue	= blk_mq_map_queue,
+	.queue_rq	= scsi_queue_rq,
+	.complete	= scsi_softirq_done,
+	.timeout	= scsi_times_out,
+	.init_request	= scsi_init_request,
+	.exit_request	= scsi_exit_request,
+};
+
+struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev)
+{
+	sdev->request_queue = blk_mq_init_queue(&sdev->host->tag_set);
+	if (IS_ERR(sdev->request_queue))
+		return NULL;
+
+	sdev->request_queue->queuedata = sdev;
+	__scsi_init_queue(sdev->host, sdev->request_queue);
+	return sdev->request_queue;
+}
+
+int scsi_mq_setup_tags(struct Scsi_Host *shost)
+{
+	unsigned int cmd_size, sgl_size, tbl_size;
+
+	tbl_size = shost->sg_tablesize;
+	if (tbl_size > SCSI_MAX_SG_SEGMENTS)
+		tbl_size = SCSI_MAX_SG_SEGMENTS;
+	sgl_size = tbl_size * sizeof(struct scatterlist);
+	cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size;
+	if (scsi_host_get_prot(shost))
+		cmd_size += sizeof(struct scsi_data_buffer) + sgl_size;
+
+	memset(&shost->tag_set, 0, sizeof(shost->tag_set));
+	shost->tag_set.ops = &scsi_mq_ops;
+	shost->tag_set.nr_hw_queues = 1;
+	shost->tag_set.queue_depth = shost->can_queue;
+	shost->tag_set.cmd_size = cmd_size;
+	shost->tag_set.numa_node = NUMA_NO_NODE;
+	shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+	shost->tag_set.driver_data = shost;
+
+	return blk_mq_alloc_tag_set(&shost->tag_set);
+}
+
+void scsi_mq_destroy_tags(struct Scsi_Host *shost)
+{
+	blk_mq_free_tag_set(&shost->tag_set);
+}
+
 /*
  * Function:    scsi_block_requests()
  *
@@ -2516,9 +2859,13 @@ scsi_internal_device_block(struct scsi_device *sdev)
 	 * block layer from calling the midlayer with this device's
 	 * request queue. 
 	 */
-	spin_lock_irqsave(q->queue_lock, flags);
-	blk_stop_queue(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	if (q->mq_ops) {
+		blk_mq_stop_hw_queues(q);
+	} else {
+		spin_lock_irqsave(q->queue_lock, flags);
+		blk_stop_queue(q);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+	}
 
 	return 0;
 }
@@ -2564,9 +2911,13 @@ scsi_internal_device_unblock(struct scsi_device *sdev,
 		 sdev->sdev_state != SDEV_OFFLINE)
 		return -EINVAL;
 
-	spin_lock_irqsave(q->queue_lock, flags);
-	blk_start_queue(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	if (q->mq_ops) {
+		blk_mq_start_stopped_hw_queues(q, false);
+	} else {
+		spin_lock_irqsave(q->queue_lock, flags);
+		blk_start_queue(q);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+	}
 
 	return 0;
 }
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index a45d1c2..12b8e1b 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -88,6 +88,9 @@ extern void scsi_next_command(struct scsi_cmnd *cmd);
 extern void scsi_io_completion(struct scsi_cmnd *, unsigned int);
 extern void scsi_run_host_queues(struct Scsi_Host *shost);
 extern struct request_queue *scsi_alloc_queue(struct scsi_device *sdev);
+extern struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev);
+extern int scsi_mq_setup_tags(struct Scsi_Host *shost);
+extern void scsi_mq_destroy_tags(struct Scsi_Host *shost);
 extern int scsi_init_queue(void);
 extern void scsi_exit_queue(void);
 struct request_queue;
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 4a6e4ba..b91cfaf 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -273,7 +273,10 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	 */
 	sdev->borken = 1;
 
-	sdev->request_queue = scsi_alloc_queue(sdev);
+	if (shost_use_blk_mq(shost))
+		sdev->request_queue = scsi_mq_alloc_queue(sdev);
+	else
+		sdev->request_queue = scsi_alloc_queue(sdev);
 	if (!sdev->request_queue) {
 		/* release fn is set up in scsi_sysfs_device_initialise, so
 		 * have to free and put manually here */
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index deef063..6c9227f 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -333,6 +333,7 @@ store_shost_eh_deadline(struct device *dev, struct device_attribute *attr,
 
 static DEVICE_ATTR(eh_deadline, S_IRUGO | S_IWUSR, show_shost_eh_deadline, store_shost_eh_deadline);
 
+shost_rd_attr(use_blk_mq, "%d\n");
 shost_rd_attr(unique_id, "%u\n");
 shost_rd_attr(cmd_per_lun, "%hd\n");
 shost_rd_attr(can_queue, "%hd\n");
@@ -352,6 +353,7 @@ show_host_busy(struct device *dev, struct device_attribute *attr, char *buf)
 static DEVICE_ATTR(host_busy, S_IRUGO, show_host_busy, NULL);
 
 static struct attribute *scsi_sysfs_shost_attrs[] = {
+	&dev_attr_use_blk_mq.attr,
 	&dev_attr_unique_id.attr,
 	&dev_attr_host_busy.attr,
 	&dev_attr_cmd_per_lun.attr,
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 7f9bbda..b54511e 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -7,6 +7,7 @@
 #include <linux/workqueue.h>
 #include <linux/mutex.h>
 #include <linux/seq_file.h>
+#include <linux/blk-mq.h>
 #include <scsi/scsi.h>
 
 struct request_queue;
@@ -531,6 +532,9 @@ struct scsi_host_template {
 	 */
 	unsigned int cmd_size;
 	struct scsi_host_cmd_pool *cmd_pool;
+
+	/* temporary flag to disable blk-mq I/O path */
+	bool disable_blk_mq;
 };
 
 /*
@@ -601,7 +605,10 @@ struct Scsi_Host {
 	 * Area to keep a shared tag map (if needed, will be
 	 * NULL if not).
 	 */
-	struct blk_queue_tag	*bqt;
+	union {
+		struct blk_queue_tag	*bqt;
+		struct blk_mq_tag_set	tag_set;
+	};
 
 	atomic_t host_busy;		   /* commands actually active on low-level */
 	atomic_t host_blocked;
@@ -693,6 +700,8 @@ struct Scsi_Host {
 	/* The controller does not support WRITE SAME */
 	unsigned no_write_same:1;
 
+	unsigned use_blk_mq:1;
+
 	/*
 	 * Optional work queue to be utilized by the transport
 	 */
@@ -793,6 +802,13 @@ static inline int scsi_host_in_recovery(struct Scsi_Host *shost)
 		shost->tmf_in_progress;
 }
 
+extern bool scsi_use_blk_mq;
+
+static inline bool shost_use_blk_mq(struct Scsi_Host *shost)
+{
+	return shost->use_blk_mq;
+}
+
 extern int scsi_queue_work(struct Scsi_Host *, struct work_struct *);
 extern void scsi_flush_work(struct Scsi_Host *);
 
diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h
index 81dd12e..cdcc90b 100644
--- a/include/scsi/scsi_tcq.h
+++ b/include/scsi/scsi_tcq.h
@@ -67,7 +67,8 @@ static inline void scsi_activate_tcq(struct scsi_device *sdev, int depth)
 	if (!sdev->tagged_supported)
 		return;
 
-	if (!blk_queue_tagged(sdev->request_queue))
+	if (!shost_use_blk_mq(sdev->host) &&
+	    blk_queue_tagged(sdev->request_queue))
 		blk_queue_init_tags(sdev->request_queue, depth,
 				    sdev->host->bqt);
 
@@ -80,7 +81,8 @@ static inline void scsi_activate_tcq(struct scsi_device *sdev, int depth)
  **/
 static inline void scsi_deactivate_tcq(struct scsi_device *sdev, int depth)
 {
-	if (blk_queue_tagged(sdev->request_queue))
+	if (!shost_use_blk_mq(sdev->host) &&
+	    blk_queue_tagged(sdev->request_queue))
 		blk_queue_free_tags(sdev->request_queue);
 	scsi_adjust_queue_depth(sdev, 0, depth);
 }
@@ -108,6 +110,15 @@ static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
 	return 0;
 }
 
+static inline struct scsi_cmnd *scsi_mq_find_tag(struct Scsi_Host *shost,
+		unsigned int hw_ctx, int tag)
+{
+	struct request *req;
+
+	req = blk_mq_tag_to_rq(shost->tag_set.tags[hw_ctx], tag);
+	return req ? (struct scsi_cmnd *)req->special : NULL;
+}
+
 /**
  * scsi_find_tag - find a tagged command by device
  * @SDpnt:	pointer to the ScSI device
@@ -118,10 +129,12 @@ static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
  **/
 static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
 {
-
         struct request *req;
 
         if (tag != SCSI_NO_TAG) {
+		if (shost_use_blk_mq(sdev->host))
+			return scsi_mq_find_tag(sdev->host, 0, tag);
+
         	req = blk_queue_find_tag(sdev->request_queue, tag);
 	        return req ? (struct scsi_cmnd *)req->special : NULL;
 	}
@@ -130,6 +143,7 @@ static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
 	return sdev->current_cmnd;
 }
 
+
 /**
  * scsi_init_shared_tag_map - create a shared tag map
  * @shost:	the host to share the tag map among all devices
@@ -138,6 +152,12 @@ static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
 static inline int scsi_init_shared_tag_map(struct Scsi_Host *shost, int depth)
 {
 	/*
+	 * We always have a shared tag map around when using blk-mq.
+	 */
+	if (shost_use_blk_mq(shost))
+		return 0;
+
+	/*
 	 * If the shared tag map isn't already initialized, do it now.
 	 * This saves callers from having to check ->bqt when setting up
 	 * devices on the shared host (for libata)
@@ -165,6 +185,8 @@ static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
 	struct request *req;
 
 	if (tag != SCSI_NO_TAG) {
+		if (shost_use_blk_mq(shost))
+			return scsi_mq_find_tag(shost, 0, tag);
 		req = blk_map_queue_find_tag(shost->bqt, tag);
 		return req ? (struct scsi_cmnd *)req->special : NULL;
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 14/14] fnic: reject device resets without assigned tags for the blk-mq case
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
                   ` (12 preceding siblings ...)
  2014-06-25 16:52 ` [PATCH 13/14] scsi: add support for a blk-mq based I/O path Christoph Hellwig
@ 2014-06-25 16:52 ` Christoph Hellwig
  2014-07-09 11:27   ` Hannes Reinecke
  2014-06-26  4:50 ` scsi-mq V2 Jens Axboe
  2014-07-08 14:48 ` Christoph Hellwig
  15 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-25 16:52 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi,
	linux-kernel, Hiral Patel, Suma Ramars, Brian Uchino

Current the midlayer fakes up a struct request for the explicit reset
ioctls, and those don't have a tag allocated to them.  The fnic driver pokes
into midlayer structures to paper over this design issue, but that won't
work for the blk-mq case.

Either someone who can actually test the hardware will have to come up with
a similar hack for the blk-mq case, or we'll have to bite the bullet and fix
the way the EH ioctls work for real, but until that happens we fail these
explicit requests here.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Hiral Patel <hiralpat@cisco.com>
Cc: Suma Ramars <sramars@cisco.com>
Cc: Brian Uchino <buchino@cisco.com>
---
 drivers/scsi/fnic/fnic_scsi.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/scsi/fnic/fnic_scsi.c b/drivers/scsi/fnic/fnic_scsi.c
index 3f88f56..961bdf5 100644
--- a/drivers/scsi/fnic/fnic_scsi.c
+++ b/drivers/scsi/fnic/fnic_scsi.c
@@ -2224,6 +2224,22 @@ int fnic_device_reset(struct scsi_cmnd *sc)
 
 	tag = sc->request->tag;
 	if (unlikely(tag < 0)) {
+		/*
+		 * XXX(hch): current the midlayer fakes up a struct
+		 * request for the explicit reset ioctls, and those
+		 * don't have a tag allocated to them.  The below
+		 * code pokes into midlayer structures to paper over
+		 * this design issue, but that won't work for blk-mq.
+		 *
+		 * Either someone who can actually test the hardware
+		 * will have to come up with a similar hack for the
+		 * blk-mq case, or we'll have to bite the bullet and
+		 * fix the way the EH ioctls work for real, but until
+		 * that happens we fail these explicit requests here.
+		 */
+		if (shost_use_blk_mq(sc->device->host))
+			goto fnic_device_reset_end;
+
 		tag = fnic_scsi_host_start_tag(fnic, sc);
 		if (unlikely(tag == SCSI_NO_TAG))
 			goto fnic_device_reset_end;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
                   ` (13 preceding siblings ...)
  2014-06-25 16:52 ` [PATCH 14/14] fnic: reject device resets without assigned tags for the blk-mq case Christoph Hellwig
@ 2014-06-26  4:50 ` Jens Axboe
  2014-06-26 22:07   ` Elliott, Robert (Server Storage)
  2014-06-30 15:20   ` Jens Axboe
  2014-07-08 14:48 ` Christoph Hellwig
  15 siblings, 2 replies; 99+ messages in thread
From: Jens Axboe @ 2014-06-26  4:50 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 2014-06-25 10:51, Christoph Hellwig wrote:
> This is the second post of the scsi-mq series.
>
> At this point the code is ready for merging and use by developers and early
> adopters.  The core blk-mq code isn't that suitable for slow devices
> yet, mostly due to the lack of an I/O scheduler, but Jens is working on it.
> Similarly there is no dm-multipath support for drivers using blk-mq yet,
> but I'm working on it.  It should also be noted that the code doesn't
> actually support multiple hardware queues or fine grained tuning of the
> blk-mq parameters yet.  All these could be added fairly easily as soon
> as low-level drivers want to make use of them.
>
> The amount of chances to the existing code are fairly small, and mostly
> speedups or cleanups that also apply to the old path as well.  Because
> of this I also haven't bothered to put it under a config option, just
> like the blk-mq core.
>
> The usage of blk-mq dramatically decreases CPU usage under all workloads going
> down from 100% CPU usage that the old setup can hit easily to usually less
> than 20% for maxing out storage subsystems with 512byte reads and writes,
> and it allows to easily archive millions of IOPS.  Bart and Robert have
> helped with some very detailed measurements that they might be able to send
> in reply to this, although these usually involve significantly reworked low
> level drivers to avoid other bottle necks.
>
> One major objection to previous iterations of this code was the simple
> replacement of the host_lock with atomic counters for the host and busy
> counters.  The host_lock avoidance on it's own already improves performance,
> and with the patch to avoid maintaining the per-target busy counter unless
> needed we now replace a lock round trip on the host_lock with just a single
> atomic increment in the submission path, and a single atomic decrement in
> completion path, which should provide benefits even for the oddest RISC
> architecture.  Longer term I'd still love to get rid of these entirely
> and use the counters in blk-mq, but due to the difference in how they
> are maintained this doesn't seem feasible as long as we still need to
> support the legacy request code path.
>
> Changes from V1:
>   - rebased on top of the core-for-3.17 branch, most notable the
>     scsi logging changes
>   - fixed handling of cmd_list to prevent crashes for some heavy
>     workloads
>   - fixed incorrect handling of !target->can_queue
>   - avoid scheduling a workqueue on I/O completions when no queues
>     are congested
>
> In addition to the patches in this thread there also is a git available at:
>
> 	git://git.infradead.org/users/hch/scsi.git scsi-mq.2

You can add my acked/reviewed-by to the series.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: scsi-mq V2
  2014-06-26  4:50 ` scsi-mq V2 Jens Axboe
@ 2014-06-26 22:07   ` Elliott, Robert (Server Storage)
  2014-06-27 14:42     ` Bart Van Assche
  2014-06-30 15:20   ` Jens Axboe
  1 sibling, 1 reply; 99+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-06-26 22:07 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, James Bottomley
  Cc: Bart Van Assche, linux-scsi, linux-kernel



> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Wednesday, 25 June, 2014 11:51 PM
> To: Christoph Hellwig; James Bottomley
> Cc: Bart Van Assche; Elliott, Robert (Server Storage); linux-
> scsi@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> On 2014-06-25 10:51, Christoph Hellwig wrote:
> > This is the second post of the scsi-mq series.
> >
...
> >
> > Changes from V1:
> >   - rebased on top of the core-for-3.17 branch, most notable the
> >     scsi logging changes
> >   - fixed handling of cmd_list to prevent crashes for some heavy
> >     workloads
> >   - fixed incorrect handling of !target->can_queue
> >   - avoid scheduling a workqueue on I/O completions when no queues
> >     are congested
> >
> > In addition to the patches in this thread there also is a git available at:
> >
> > 	git://git.infradead.org/users/hch/scsi.git scsi-mq.2
> 
> You can add my acked/reviewed-by to the series.
> 
> --
> Jens Axboe

Since March 20th (circa LSF-MM 2014) we've run many hours of tests
with hpsa and the scsi-mq tree.  We've also done a little bit of 
testing with mpt3sas and, in the last few days, scsi_debug.

Although there are certainly more problems to find and improvements
to be made, it's become quite stable.  It's even been used on the
boot drives of our test servers.

For the patches in scsi-mq.2 you may add:
Tested-by: Robert Elliott <elliott@hp.com>


---
Rob Elliott    HP Server Storage




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-06-26 22:07   ` Elliott, Robert (Server Storage)
@ 2014-06-27 14:42     ` Bart Van Assche
  0 siblings, 0 replies; 99+ messages in thread
From: Bart Van Assche @ 2014-06-27 14:42 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Elliott, Robert (Server Storage), linux-scsi, linux-kernel

On 06/27/14 00:07, Elliott, Robert (Server Storage) wrote:
>> -----Original Message-----
>> From: Jens Axboe [mailto:axboe@kernel.dk]
>> Sent: Wednesday, 25 June, 2014 11:51 PM
>> To: Christoph Hellwig; James Bottomley
>> Cc: Bart Van Assche; Elliott, Robert (Server Storage); linux-
>> scsi@vger.kernel.org; linux-kernel@vger.kernel.org
>> Subject: Re: scsi-mq V2
>>
>> On 2014-06-25 10:51, Christoph Hellwig wrote:
>>> This is the second post of the scsi-mq series.
>>>
> ...
>>>
>>> Changes from V1:
>>>   - rebased on top of the core-for-3.17 branch, most notable the
>>>     scsi logging changes
>>>   - fixed handling of cmd_list to prevent crashes for some heavy
>>>     workloads
>>>   - fixed incorrect handling of !target->can_queue
>>>   - avoid scheduling a workqueue on I/O completions when no queues
>>>     are congested
>>>
>>> In addition to the patches in this thread there also is a git available at:
>>>
>>> 	git://git.infradead.org/users/hch/scsi.git scsi-mq.2
>>
>> You can add my acked/reviewed-by to the series.
> 
> Since March 20th (circa LSF-MM 2014) we've run many hours of tests
> with hpsa and the scsi-mq tree.  We've also done a little bit of 
> testing with mpt3sas and, in the last few days, scsi_debug.
> 
> Although there are certainly more problems to find and improvements
> to be made, it's become quite stable.  It's even been used on the
> boot drives of our test servers.
> 
> For the patches in scsi-mq.2 you may add:
> Tested-by: Robert Elliott <elliott@hp.com>

Performance of scsi-mq-v2 looks even better than that of scsi-mq-v1. The
slight single-LUN regression is gone, peak IOPS with use_blk_mq=Y on my
test setup is now 3x the performance of use_blk_mq=N and latency has
been reduced further. I think this means reducing the number of context
switches did really help :-) Detailed measurement results can be found
on https://drive.google.com/file/d/0B1YQOreL3_FxWmZfbl8xSzRfdGM/.

If you want you may add to the scsi-mq-v2 patch series:

Tested-by: Bart Van Assche <bvanassche@acm.org>

Bart.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-06-26  4:50 ` scsi-mq V2 Jens Axboe
  2014-06-26 22:07   ` Elliott, Robert (Server Storage)
@ 2014-06-30 15:20   ` Jens Axboe
  2014-06-30 15:25     ` Christoph Hellwig
  1 sibling, 1 reply; 99+ messages in thread
From: Jens Axboe @ 2014-06-30 15:20 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel, Chris Mason

On 06/25/2014 10:50 PM, Jens Axboe wrote:
> On 2014-06-25 10:51, Christoph Hellwig wrote:
>> This is the second post of the scsi-mq series.
>>
>> At this point the code is ready for merging and use by developers and
>> early
>> adopters.  The core blk-mq code isn't that suitable for slow devices
>> yet, mostly due to the lack of an I/O scheduler, but Jens is working
>> on it.
>> Similarly there is no dm-multipath support for drivers using blk-mq yet,
>> but I'm working on it.  It should also be noted that the code doesn't
>> actually support multiple hardware queues or fine grained tuning of the
>> blk-mq parameters yet.  All these could be added fairly easily as soon
>> as low-level drivers want to make use of them.
>>
>> The amount of chances to the existing code are fairly small, and mostly
>> speedups or cleanups that also apply to the old path as well.  Because
>> of this I also haven't bothered to put it under a config option, just
>> like the blk-mq core.
>>
>> The usage of blk-mq dramatically decreases CPU usage under all
>> workloads going
>> down from 100% CPU usage that the old setup can hit easily to usually
>> less
>> than 20% for maxing out storage subsystems with 512byte reads and writes,
>> and it allows to easily archive millions of IOPS.  Bart and Robert have
>> helped with some very detailed measurements that they might be able to
>> send
>> in reply to this, although these usually involve significantly
>> reworked low
>> level drivers to avoid other bottle necks.
>>
>> One major objection to previous iterations of this code was the simple
>> replacement of the host_lock with atomic counters for the host and busy
>> counters.  The host_lock avoidance on it's own already improves
>> performance,
>> and with the patch to avoid maintaining the per-target busy counter
>> unless
>> needed we now replace a lock round trip on the host_lock with just a
>> single
>> atomic increment in the submission path, and a single atomic decrement in
>> completion path, which should provide benefits even for the oddest RISC
>> architecture.  Longer term I'd still love to get rid of these entirely
>> and use the counters in blk-mq, but due to the difference in how they
>> are maintained this doesn't seem feasible as long as we still need to
>> support the legacy request code path.
>>
>> Changes from V1:
>>   - rebased on top of the core-for-3.17 branch, most notable the
>>     scsi logging changes
>>   - fixed handling of cmd_list to prevent crashes for some heavy
>>     workloads
>>   - fixed incorrect handling of !target->can_queue
>>   - avoid scheduling a workqueue on I/O completions when no queues
>>     are congested
>>
>> In addition to the patches in this thread there also is a git
>> available at:
>>
>>     git://git.infradead.org/users/hch/scsi.git scsi-mq.2
> 
> You can add my acked/reviewed-by to the series.

Ran stress testing from Friday to now, 65h of beating up on it and no
problems observed. 47TB read and 20TB written for a total of 17.7
billion of IOs issued and completed. Latencies look good. I officially
declare this code for bug free.

Bug-free-by: Jens Axboe <axboe@fb.com>

Now lets get this queued up for inclusion, pretty please.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-06-30 15:20   ` Jens Axboe
@ 2014-06-30 15:25     ` Christoph Hellwig
  2014-06-30 15:54       ` Martin K. Petersen
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-30 15:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: James Bottomley, Bart Van Assche, Robert Elliott, linux-scsi,
	linux-kernel, Chris Mason

On Mon, Jun 30, 2014 at 09:20:51AM -0600, Jens Axboe wrote:
> Ran stress testing from Friday to now, 65h of beating up on it and no
> problems observed. 47TB read and 20TB written for a total of 17.7
> billion of IOs issued and completed. Latencies look good. I officially
> declare this code for bug free.
> 
> Bug-free-by: Jens Axboe <axboe@fb.com>
> 
> Now lets get this queued up for inclusion, pretty please.

I'm still looking for one (or better two) persons familar with the
SCSI and/or block code to go over it and do a real detailed review.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-06-30 15:25     ` Christoph Hellwig
@ 2014-06-30 15:54       ` Martin K. Petersen
  0 siblings, 0 replies; 99+ messages in thread
From: Martin K. Petersen @ 2014-06-30 15:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, James Bottomley, Bart Van Assche, Robert Elliott,
	linux-scsi, linux-kernel, Chris Mason

>>>>> "Christoph" == Christoph Hellwig <hch@infradead.org> writes:

Christoph> I'm still looking for one (or better two) persons familar
Christoph> with the SCSI and/or block code to go over it and do a real
Christoph> detailed review.

I'm on vacation for a couple of days. Will review Wednesday.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
                   ` (14 preceding siblings ...)
  2014-06-26  4:50 ` scsi-mq V2 Jens Axboe
@ 2014-07-08 14:48 ` Christoph Hellwig
  2014-07-09 16:39   ` Douglas Gilbert
  2014-07-14  9:13   ` Sagi Grimberg
  15 siblings, 2 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-07-08 14:48 UTC (permalink / raw)
  To: James Bottomley, Jens Axboe, Bart Van Assche, Robert Elliott,
	linux-scsi, linux-kernel

On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:
> Changes from V1:
>  - rebased on top of the core-for-3.17 branch, most notable the
>    scsi logging changes
>  - fixed handling of cmd_list to prevent crashes for some heavy
>    workloads
>  - fixed incorrect handling of !target->can_queue
>  - avoid scheduling a workqueue on I/O completions when no queues
>    are congested
> 
> In addition to the patches in this thread there also is a git available at:
> 
> 	git://git.infradead.org/users/hch/scsi.git scsi-mq.2


I've pushed out a new scsi-mq.3 branch, which has been rebased on the
latest core-for-3.17 tree + the "RFC: clean up command setup" series
from June 29th.  Robert Elliot found a problem with not fully zeroed
out UNMAP CDBs, which is fixed by the saner discard handling in that
series.

There is a new patch to factor the code from the above series for
blk-mq use, which I've attached below.  Besides that the only changes
are minor merge fixups in the main blk-mq usage patch.

---
>From f925c317c74849666d599926d8ad8f34ef99d5cf Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 8 Jul 2014 13:16:17 +0200
Subject: scsi: add scsi_setup_cmnd helper

Factor out command setup code that will be shared with the blk-mq code path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi_lib.c |   40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 116f541..61afae8 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1116,6 +1116,27 @@ static int scsi_setup_fs_cmnd(struct scsi_device *sdev, struct request *req)
 	return scsi_cmd_to_driver(cmd)->init_command(cmd);
 }
 
+static int scsi_setup_cmnd(struct scsi_device *sdev, struct request *req)
+{
+	struct scsi_cmnd *cmd = req->special;
+
+	if (!blk_rq_bytes(req))
+		cmd->sc_data_direction = DMA_NONE;
+	else if (rq_data_dir(req) == WRITE)
+		cmd->sc_data_direction = DMA_TO_DEVICE;
+	else
+		cmd->sc_data_direction = DMA_FROM_DEVICE;
+
+	switch (req->cmd_type) {
+	case REQ_TYPE_FS:
+		return scsi_setup_fs_cmnd(sdev, req);
+	case REQ_TYPE_BLOCK_PC:
+		return scsi_setup_blk_pc_cmnd(sdev, req);
+	default:
+		return BLKPREP_KILL;
+	}
+}
+
 static int
 scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
 {
@@ -1219,24 +1240,7 @@ static int scsi_prep_fn(struct request_queue *q, struct request *req)
 		goto out;
 	}
 
-	if (!blk_rq_bytes(req))
-		cmd->sc_data_direction = DMA_NONE;
-	else if (rq_data_dir(req) == WRITE)
-		cmd->sc_data_direction = DMA_TO_DEVICE;
-	else
-		cmd->sc_data_direction = DMA_FROM_DEVICE;
-
-	switch (req->cmd_type) {
-	case REQ_TYPE_FS:
-		ret = scsi_setup_fs_cmnd(sdev, req);
-		break;
-	case REQ_TYPE_BLOCK_PC:
-		ret = scsi_setup_blk_pc_cmnd(sdev, req);
-		break;
-	default:
-		ret = BLKPREP_KILL;
-	}
-
+	ret = scsi_setup_cmnd(sdev, req);
 out:
 	return scsi_prep_return(q, req, ret);
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* RE: [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn
  2014-06-25 16:51 ` [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn Christoph Hellwig
@ 2014-07-08 20:51   ` Elliott, Robert (Server Storage)
  2014-07-09  6:40     ` Christoph Hellwig
  2014-07-09 11:13   ` Hannes Reinecke
  1 sibling, 1 reply; 99+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-07-08 20:51 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, linux-scsi, linux-kernel



> -----Original Message-----
> From: Christoph Hellwig [mailto:hch@lst.de]
> Sent: Wednesday, 25 June, 2014 11:52 AM
> To: James Bottomley
> Cc: Jens Axboe; Bart Van Assche; Elliott, Robert (Server Storage); linux-
> scsi@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH 03/14] scsi: centralize command re-queueing in
> scsi_dispatch_fn
> 
> Make sure we only have the logic for requeing commands in one place.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/scsi/scsi.c     |   35 ++++++++++++-----------------------
>  drivers/scsi/scsi_lib.c |    9 ++++++---
>  2 files changed, 18 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index ce5b4e5..dcc43fd 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -648,9 +648,7 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>  		 * returns an immediate error upwards, and signals
>  		 * that the device is no longer present */
>  		cmd->result = DID_NO_CONNECT << 16;
> -		scsi_done(cmd);
> -		/* return 0 (because the command has been processed) */
> -		goto out;
> +		goto done;
>  	}
> 
>  	/* Check to see if the scsi lld made this device blocked. */
> @@ -662,17 +660,9 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>  		 * occur until the device transitions out of the
>  		 * suspend state.
>  		 */
> -
> -		scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
> -
>  		SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
>  			"queuecommand : device blocked\n"));
> -
> -		/*
> -		 * NOTE: rtn is still zero here because we don't need the
> -		 * queue to be plugged on return (it's already stopped)
> -		 */
> -		goto out;
> +		return SCSI_MLQUEUE_DEVICE_BUSY;
>  	}
> 
>  	/*
> @@ -696,20 +686,19 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>  			       "cdb_size=%d host->max_cmd_len=%d\n",
>  			       cmd->cmd_len, cmd->device->host->max_cmd_len));
>  		cmd->result = (DID_ABORT << 16);
> -
> -		scsi_done(cmd);
> -		goto out;
> +		goto done;
>  	}
> 
>  	if (unlikely(host->shost_state == SHOST_DEL)) {
>  		cmd->result = (DID_NO_CONNECT << 16);
> -		scsi_done(cmd);
> -	} else {
> -		trace_scsi_dispatch_cmd_start(cmd);
> -		cmd->scsi_done = scsi_done;
> -		rtn = host->hostt->queuecommand(host, cmd);
> +		goto done;
> +
>  	}
> 
> +	trace_scsi_dispatch_cmd_start(cmd);
> +
> +	cmd->scsi_done = scsi_done;
> +	rtn = host->hostt->queuecommand(host, cmd);
>  	if (rtn) {
>  		trace_scsi_dispatch_cmd_error(cmd, rtn);
>  		if (rtn != SCSI_MLQUEUE_DEVICE_BUSY &&
> @@ -718,12 +707,12 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
> 
>  		SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
>  			"queuecommand : request rejected\n"));
> -
> -		scsi_queue_insert(cmd, rtn);
>  	}
> 
> - out:
>  	return rtn;
> + done:
> +	scsi_done(cmd);
> +	return 0;
>  }
> 

Related to the position of the trace_scsi_dispatch_cmd_start()
call... this function does:

1. check sdev_state - goto done
2. check scsi_device_blocked() - return
3. put LUN into CDB for ancient SCSI-1 devices
4. scsi_log_send()
5. check cmd_len - goto done
6. check shost_state - goto done
7. trace_scsi_dispatch_cmd_start()
8. queuecommand()
9. return
10. done:
	cmd->scsi_done(cmd)  [PATCH 04/14 upgrades it to this]
	return 0;

It's inconsistent for logging and tracing to occur after 
different number of checks.

In scsi_lib.c, both scsi_done() and scsi_mq_done() always call
trace_scsi_dispatch_cmd_done(), so trace_scsi_dispatch_cmd_start()
should be called before scsi_done() is called.  That way the
trace will always have a submission to match each completion.

That means trace should be called before the sdev_state check 
(which calls scsi_done()).  

I don't know about the scsi_device_blocked check (which just 
returns).  Should the trace record multiple submissions with 
one completion?  Maybe both trace_scsi_dispatch_cmd_start() 
and trace_scsi_dispatch_cmd_done() should both be called?

scsi_log_completion() is called by scsi_softirq_done() and
scsi_times_out() but not by scsi_done() and scsi_mq_done(), so 
scsi_log_send() should not be called unless all the checks 
pass and an IO is really queued.

That would lead to something like:
1. check sdev_state - goto done
2. check scsi_device_blocked() - return
3. put LUN into CDB for ancient SCSI-1 devices
5. check cmd_len - goto done
6. check shost_state - goto done
7a. scsi_log_send()
7b. trace_scsi_dispatch_cmd_start()
8. queuecommand()
9. return
10. done:
	trace_scsi_dispatch_cmd_start()
	cmd->scsi_done(cmd);
	return 0;

---
Rob Elliott    HP Server Storage




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn
  2014-07-08 20:51   ` Elliott, Robert (Server Storage)
@ 2014-07-09  6:40     ` Christoph Hellwig
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-07-09  6:40 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: James Bottomley, Jens Axboe, Bart Van Assche, linux-scsi, linux-kernel

On Tue, Jul 08, 2014 at 08:51:30PM +0000, Elliott, Robert (Server Storage) wrote:
> In scsi_lib.c, both scsi_done() and scsi_mq_done() always call
> trace_scsi_dispatch_cmd_done(), so trace_scsi_dispatch_cmd_start()
> should be called before scsi_done() is called.  That way the
> trace will always have a submission to match each completion.
> 
> That means trace should be called before the sdev_state check 
> (which calls scsi_done()).  
>
> I don't know about the scsi_device_blocked check (which just 
> returns).  Should the trace record multiple submissions with 
> one completion?  Maybe both trace_scsi_dispatch_cmd_start() 
> and trace_scsi_dispatch_cmd_done() should both be called?

trace_scsi_dispatch_cmd_start is maybe a little misnamed as it traces
the command submission to the driver.  So getting a done trace without
this one sounds perfectly fine.  Adding another trace for an error
before submission could be done if you care about pairing.  The *_BUSY
returns don't fit this scheme at all.

But none of this really is in this patch.  Hannes has some plans to clean
up the logging and tracing mess in scsi, and it might be a good idea
to incorporate it there.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 09/14] scsi: fix the {host,target,device}_blocked counter mess
  2014-06-25 16:51 ` [PATCH 09/14] scsi: fix the {host,target,device}_blocked counter mess Christoph Hellwig
@ 2014-07-09 11:12   ` Hannes Reinecke
  2014-07-10  6:06     ` Christoph Hellwig
  0 siblings, 1 reply; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:12 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Seems like these counters are missing any sort of synchronization for
> updates, as a over 10 year old comment from me noted.  Fix this by
> using atomic counters, and while we're at it also make sure they are
> in the same cacheline as the _busy counters and not needlessly stored
> to in every I/O completion.
>
> With the new model the _busy counters can temporarily go negative,
> so all the readers are updated to check for > 0 values.  Longer
> term every successful I/O completion will reset the counters to zero,
> so the temporarily negative values will not cause any harm.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/scsi.c        |   21 ++++++------
>   drivers/scsi/scsi_lib.c    |   82 +++++++++++++++++++++-----------------------
>   drivers/scsi/scsi_sysfs.c  |   10 +++++-
>   include/scsi/scsi_device.h |    7 ++--
>   include/scsi/scsi_host.h   |    7 ++--
>   5 files changed, 64 insertions(+), 63 deletions(-)
>
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index 35a23e2..b362058 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -729,17 +729,16 @@ void scsi_finish_command(struct scsi_cmnd *cmd)
>
>   	scsi_device_unbusy(sdev);
>
> -        /*
> -         * Clear the flags which say that the device/host is no longer
> -         * capable of accepting new commands.  These are set in scsi_queue.c
> -         * for both the queue full condition on a device, and for a
> -         * host full condition on the host.
> -	 *
> -	 * XXX(hch): What about locking?
> -         */
> -        shost->host_blocked = 0;
> -	starget->target_blocked = 0;
> -        sdev->device_blocked = 0;
> +	/*
> +	 * Clear the flags which say that the device/target/host is no longer
> +	 * capable of accepting new commands.
> +	 */
> +	if (atomic_read(&shost->host_blocked))
> +		atomic_set(&shost->host_blocked, 0);
> +	if (atomic_read(&starget->target_blocked))
> +		atomic_set(&starget->target_blocked, 0);
> +	if (atomic_read(&sdev->device_blocked))
> +		atomic_set(&sdev->device_blocked, 0);
>
>   	/*
>   	 * If we have valid sense information, then some kind of recovery
Hmm. I guess there is a race window between
atomic_read() and atomic_set().
Doesn't this cause issues when someone calls atomic_set() just 
before the call to atomic_read?

> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index e23fef5..a39d5ba 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -99,14 +99,16 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
>   	 */
>   	switch (reason) {
>   	case SCSI_MLQUEUE_HOST_BUSY:
> -		host->host_blocked = host->max_host_blocked;
> +		atomic_set(&host->host_blocked, host->max_host_blocked);
>   		break;
>   	case SCSI_MLQUEUE_DEVICE_BUSY:
>   	case SCSI_MLQUEUE_EH_RETRY:
> -		device->device_blocked = device->max_device_blocked;
> +		atomic_set(&device->device_blocked,
> +			   device->max_device_blocked);
>   		break;
>   	case SCSI_MLQUEUE_TARGET_BUSY:
> -		starget->target_blocked = starget->max_target_blocked;
> +		atomic_set(&starget->target_blocked,
> +			   starget->max_target_blocked);
>   		break;
>   	}
>   }
> @@ -351,30 +353,39 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
>   	spin_unlock_irqrestore(shost->host_lock, flags);
>   }
>
> -static inline int scsi_device_is_busy(struct scsi_device *sdev)
> +static inline bool scsi_device_is_busy(struct scsi_device *sdev)
>   {
>   	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> -		return 1;
> -	if (sdev->device_blocked)
> -		return 1;
> +		return true;
> +	if (atomic_read(&sdev->device_blocked) > 0)
> +		return true;
>   	return 0;
>   }
>
> -static inline int scsi_target_is_busy(struct scsi_target *starget)
> +static inline bool scsi_target_is_busy(struct scsi_target *starget)
>   {
> -	return ((starget->can_queue > 0 &&
> -		 atomic_read(&starget->target_busy) >= starget->can_queue) ||
> -		 starget->target_blocked);
> +	if (starget->can_queue > 0) {
> +		if (atomic_read(&starget->target_busy) >= starget->can_queue)
> +			return true;
> +		if (atomic_read(&starget->target_blocked) > 0)
> +			return true;
> +	}
> +
> +	return false;
>   }
>
> -static inline int scsi_host_is_busy(struct Scsi_Host *shost)
> +static inline bool scsi_host_is_busy(struct Scsi_Host *shost)
>   {
> -	if ((shost->can_queue > 0 &&
> -	     atomic_read(&shost->host_busy) >= shost->can_queue) ||
> -	    shost->host_blocked || shost->host_self_blocked)
> -		return 1;
> +	if (shost->can_queue > 0) {
> +		if (atomic_read(&shost->host_busy) >= shost->can_queue)
> +			return true;
> +		if (atomic_read(&shost->host_blocked) > 0)
> +			return true;
> +		if (shost->host_self_blocked)
> +			return true;
> +	}
>
> -	return 0;
> +	return false;
>   }
>
>   static void scsi_starved_list_run(struct Scsi_Host *shost)
> @@ -1283,11 +1294,8 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
>   	unsigned int busy;
>
>   	busy = atomic_inc_return(&sdev->device_busy) - 1;
> -	if (busy == 0 && sdev->device_blocked) {
> -		/*
> -		 * unblock after device_blocked iterates to zero
> -		 */
> -		if (--sdev->device_blocked != 0) {
> +	if (busy == 0 && atomic_read(&sdev->device_blocked) > 0) {
> +		if (atomic_dec_return(&sdev->device_blocked) > 0) {
>   			blk_delay_queue(q, SCSI_QUEUE_DELAY);
>   			goto out_dec;
>   		}
> @@ -1297,7 +1305,7 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
>
>   	if (busy >= sdev->queue_depth)
>   		goto out_dec;
> -	if (sdev->device_blocked)
> +	if (atomic_read(&sdev->device_blocked) > 0)
>   		goto out_dec;
>
>   	return 1;
> @@ -1328,16 +1336,9 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
>   	}
>
>   	busy = atomic_inc_return(&starget->target_busy) - 1;
> -	if (busy == 0 && starget->target_blocked) {
> -		/*
> -		 * unblock after target_blocked iterates to zero
> -		 */
> -		spin_lock_irq(shost->host_lock);
> -		if (--starget->target_blocked != 0) {
> -			spin_unlock_irq(shost->host_lock);
> +	if (busy == 0 && atomic_read(&starget->target_blocked) > 0) {
> +		if (atomic_dec_return(&starget->target_blocked) > 0)
>   			goto out_dec;
> -		}
> -		spin_unlock_irq(shost->host_lock);
>
>   		SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
>   				 "unblocking target at zero depth\n"));
> @@ -1345,7 +1346,7 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
>
>   	if (starget->can_queue > 0 && busy >= starget->can_queue)
>   		goto starved;
> -	if (starget->target_blocked)
> +	if (atomic_read(&starget->target_blocked) > 0)
>   		goto starved;
>
>   	return 1;
> @@ -1374,16 +1375,9 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
>   		return 0;
>
>   	busy = atomic_inc_return(&shost->host_busy) - 1;
> -	if (busy == 0 && shost->host_blocked) {
> -		/*
> -		 * unblock after host_blocked iterates to zero
> -		 */
> -		spin_lock_irq(shost->host_lock);
> -		if (--shost->host_blocked != 0) {
> -			spin_unlock_irq(shost->host_lock);
> +	if (busy == 0 && atomic_read(&shost->host_blocked) > 0) {
> +		if (atomic_dec_return(&shost->host_blocked) > 0)
>   			goto out_dec;
> -		}
> -		spin_unlock_irq(shost->host_lock);
>
>   		SCSI_LOG_MLQUEUE(3,
>   			shost_printk(KERN_INFO, shost,
Same with this one.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 01/14] sd: don't use rq->cmd_len before setting it up
  2014-06-25 16:51 ` [PATCH 01/14] sd: don't use rq->cmd_len before setting it up Christoph Hellwig
@ 2014-07-09 11:12     ` Hannes Reinecke
  0 siblings, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:12 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Unlike the old request code blk-mq doesn't initialize cmd_len with a
> default value, so don't rely on it being set in sd_setup_write_same_cmnd.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/sd.c |    3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 9c86e3d..6ec4ffe 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -824,15 +824,16 @@ static int sd_setup_write_same_cmnd(struct scsi_device *sdp, struct request *rq)
>
>   	rq->__data_len = sdp->sector_size;
>   	rq->timeout = SD_WRITE_SAME_TIMEOUT;
> -	memset(rq->cmd, 0, rq->cmd_len);
>
>   	if (sdkp->ws16 || sector > 0xffffffff || nr_sectors > 0xffff) {
>   		rq->cmd_len = 16;
> +		memset(rq->cmd, 0, rq->cmd_len);
>   		rq->cmd[0] = WRITE_SAME_16;
>   		put_unaligned_be64(sector, &rq->cmd[2]);
>   		put_unaligned_be32(nr_sectors, &rq->cmd[10]);
>   	} else {
>   		rq->cmd_len = 10;
> +		memset(rq->cmd, 0, rq->cmd_len);
>   		rq->cmd[0] = WRITE_SAME;
>   		put_unaligned_be32(sector, &rq->cmd[2]);
>   		put_unaligned_be16(nr_sectors, &rq->cmd[7]);
>
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 01/14] sd: don't use rq->cmd_len before setting it up
@ 2014-07-09 11:12     ` Hannes Reinecke
  0 siblings, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:12 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Unlike the old request code blk-mq doesn't initialize cmd_len with a
> default value, so don't rely on it being set in sd_setup_write_same_cmnd.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/sd.c |    3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 9c86e3d..6ec4ffe 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -824,15 +824,16 @@ static int sd_setup_write_same_cmnd(struct scsi_device *sdp, struct request *rq)
>
>   	rq->__data_len = sdp->sector_size;
>   	rq->timeout = SD_WRITE_SAME_TIMEOUT;
> -	memset(rq->cmd, 0, rq->cmd_len);
>
>   	if (sdkp->ws16 || sector > 0xffffffff || nr_sectors > 0xffff) {
>   		rq->cmd_len = 16;
> +		memset(rq->cmd, 0, rq->cmd_len);
>   		rq->cmd[0] = WRITE_SAME_16;
>   		put_unaligned_be64(sector, &rq->cmd[2]);
>   		put_unaligned_be32(nr_sectors, &rq->cmd[10]);
>   	} else {
>   		rq->cmd_len = 10;
> +		memset(rq->cmd, 0, rq->cmd_len);
>   		rq->cmd[0] = WRITE_SAME;
>   		put_unaligned_be32(sector, &rq->cmd[2]);
>   		put_unaligned_be16(nr_sectors, &rq->cmd[7]);
>
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 02/14] scsi: split __scsi_queue_insert
  2014-06-25 16:51 ` [PATCH 02/14] scsi: split __scsi_queue_insert Christoph Hellwig
@ 2014-07-09 11:12   ` Hannes Reinecke
  0 siblings, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:12 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Factor out a helper to set the _blocked values, which we'll reuse for the
> blk-mq code path.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/scsi_lib.c |   44 ++++++++++++++++++++++++++------------------
>   1 file changed, 26 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index d5d22e4..2667c75 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -75,28 +75,12 @@ struct kmem_cache *scsi_sdb_cache;
>    */
>   #define SCSI_QUEUE_DELAY	3
>
> -/**
> - * __scsi_queue_insert - private queue insertion
> - * @cmd: The SCSI command being requeued
> - * @reason:  The reason for the requeue
> - * @unbusy: Whether the queue should be unbusied
> - *
> - * This is a private queue insertion.  The public interface
> - * scsi_queue_insert() always assumes the queue should be unbusied
> - * because it's always called before the completion.  This function is
> - * for a requeue after completion, which should only occur in this
> - * file.
> - */
> -static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> +static void
> +scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
>   {
>   	struct Scsi_Host *host = cmd->device->host;
>   	struct scsi_device *device = cmd->device;
>   	struct scsi_target *starget = scsi_target(device);
> -	struct request_queue *q = device->request_queue;
> -	unsigned long flags;
> -
> -	SCSI_LOG_MLQUEUE(1, scmd_printk(KERN_INFO, cmd,
> -		"Inserting command %p into mlqueue\n", cmd));
>
>   	/*
>   	 * Set the appropriate busy bit for the device/host.
> @@ -123,6 +107,30 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
>   		starget->target_blocked = starget->max_target_blocked;
>   		break;
>   	}
> +}
> +
> +/**
> + * __scsi_queue_insert - private queue insertion
> + * @cmd: The SCSI command being requeued
> + * @reason:  The reason for the requeue
> + * @unbusy: Whether the queue should be unbusied
> + *
> + * This is a private queue insertion.  The public interface
> + * scsi_queue_insert() always assumes the queue should be unbusied
> + * because it's always called before the completion.  This function is
> + * for a requeue after completion, which should only occur in this
> + * file.
> + */
> +static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> +{
> +	struct scsi_device *device = cmd->device;
> +	struct request_queue *q = device->request_queue;
> +	unsigned long flags;
> +
> +	SCSI_LOG_MLQUEUE(1, scmd_printk(KERN_INFO, cmd,
> +		"Inserting command %p into mlqueue\n", cmd));
> +
> +	scsi_set_blocked(cmd, reason);
>
>   	/*
>   	 * Decrement the counters, since these commands are no longer
>
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn
  2014-06-25 16:51 ` [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn Christoph Hellwig
  2014-07-08 20:51   ` Elliott, Robert (Server Storage)
@ 2014-07-09 11:13   ` Hannes Reinecke
  1 sibling, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:13 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Make sure we only have the logic for requeing commands in one place.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/scsi.c     |   35 ++++++++++++-----------------------
>   drivers/scsi/scsi_lib.c |    9 ++++++---
>   2 files changed, 18 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index ce5b4e5..dcc43fd 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -648,9 +648,7 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>   		 * returns an immediate error upwards, and signals
>   		 * that the device is no longer present */
>   		cmd->result = DID_NO_CONNECT << 16;
> -		scsi_done(cmd);
> -		/* return 0 (because the command has been processed) */
> -		goto out;
> +		goto done;
>   	}
>
>   	/* Check to see if the scsi lld made this device blocked. */
> @@ -662,17 +660,9 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>   		 * occur until the device transitions out of the
>   		 * suspend state.
>   		 */
> -
> -		scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
> -
>   		SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
>   			"queuecommand : device blocked\n"));
> -
> -		/*
> -		 * NOTE: rtn is still zero here because we don't need the
> -		 * queue to be plugged on return (it's already stopped)
> -		 */
> -		goto out;
> +		return SCSI_MLQUEUE_DEVICE_BUSY;
>   	}
>
>   	/*
> @@ -696,20 +686,19 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>   			       "cdb_size=%d host->max_cmd_len=%d\n",
>   			       cmd->cmd_len, cmd->device->host->max_cmd_len));
>   		cmd->result = (DID_ABORT << 16);
> -
> -		scsi_done(cmd);
> -		goto out;
> +		goto done;
>   	}
>
>   	if (unlikely(host->shost_state == SHOST_DEL)) {
>   		cmd->result = (DID_NO_CONNECT << 16);
> -		scsi_done(cmd);
> -	} else {
> -		trace_scsi_dispatch_cmd_start(cmd);
> -		cmd->scsi_done = scsi_done;
> -		rtn = host->hostt->queuecommand(host, cmd);
> +		goto done;
> +
>   	}
>
> +	trace_scsi_dispatch_cmd_start(cmd);
> +
> +	cmd->scsi_done = scsi_done;
> +	rtn = host->hostt->queuecommand(host, cmd);
>   	if (rtn) {
>   		trace_scsi_dispatch_cmd_error(cmd, rtn);
>   		if (rtn != SCSI_MLQUEUE_DEVICE_BUSY &&
> @@ -718,12 +707,12 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>
>   		SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
>   			"queuecommand : request rejected\n"));
> -
> -		scsi_queue_insert(cmd, rtn);
>   	}
>
> - out:
>   	return rtn;
> + done:
> +	scsi_done(cmd);
> +	return 0;
>   }
>
>   /**
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 2667c75..63bf844 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1583,9 +1583,12 @@ static void scsi_request_fn(struct request_queue *q)
>   		 * Dispatch the command to the low-level driver.
>   		 */
>   		rtn = scsi_dispatch_cmd(cmd);
> -		spin_lock_irq(q->queue_lock);
> -		if (rtn)
> +		if (rtn) {
> +			scsi_queue_insert(cmd, rtn);
> +			spin_lock_irq(q->queue_lock);
>   			goto out_delay;
> +		}
> +		spin_lock_irq(q->queue_lock);
>   	}
>
>   	return;
> @@ -1605,7 +1608,7 @@ static void scsi_request_fn(struct request_queue *q)
>   	blk_requeue_request(q, req);
>   	sdev->device_busy--;
>   out_delay:
> -	if (sdev->device_busy == 0)
> +	if (sdev->device_busy == 0 && !scsi_device_blocked(sdev))
>   		blk_delay_queue(q, SCSI_QUEUE_DELAY);
>   }
>
>

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 04/14] scsi: set ->scsi_done before calling scsi_dispatch_cmd
  2014-06-25 16:51 ` [PATCH 04/14] scsi: set ->scsi_done before calling scsi_dispatch_cmd Christoph Hellwig
@ 2014-07-09 11:14   ` Hannes Reinecke
  0 siblings, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:14 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> The blk-mq code path will set this to a different function, so make the
> code simpler by setting it up in a legacy-request specific place.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/scsi.c     |   23 +----------------------
>   drivers/scsi/scsi_lib.c |   20 ++++++++++++++++++++
>   2 files changed, 21 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index dcc43fd..d3bd6cf 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -72,8 +72,6 @@
>   #define CREATE_TRACE_POINTS
>   #include <trace/events/scsi.h>
>
> -static void scsi_done(struct scsi_cmnd *cmd);
> -
>   /*
>    * Definitions and constants.
>    */
> @@ -696,8 +694,6 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>   	}
>
>   	trace_scsi_dispatch_cmd_start(cmd);
> -
> -	cmd->scsi_done = scsi_done;
>   	rtn = host->hostt->queuecommand(host, cmd);
>   	if (rtn) {
>   		trace_scsi_dispatch_cmd_error(cmd, rtn);
> @@ -711,28 +707,11 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>
>   	return rtn;
>    done:
> -	scsi_done(cmd);
> +	cmd->scsi_done(cmd);
>   	return 0;
>   }
>
>   /**
> - * scsi_done - Invoke completion on finished SCSI command.
> - * @cmd: The SCSI Command for which a low-level device driver (LLDD) gives
> - * ownership back to SCSI Core -- i.e. the LLDD has finished with it.
> - *
> - * Description: This function is the mid-level's (SCSI Core) interrupt routine,
> - * which regains ownership of the SCSI command (de facto) from a LLDD, and
> - * calls blk_complete_request() for further processing.
> - *
> - * This function is interrupt context safe.
> - */
> -static void scsi_done(struct scsi_cmnd *cmd)
> -{
> -	trace_scsi_dispatch_cmd_done(cmd);
> -	blk_complete_request(cmd->request);
> -}
> -
> -/**
>    * scsi_finish_command - cleanup and pass command back to upper layer
>    * @cmd: the command
>    *
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 63bf844..6989b6f 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -29,6 +29,8 @@
>   #include <scsi/scsi_eh.h>
>   #include <scsi/scsi_host.h>
>
> +#include <trace/events/scsi.h>
> +
>   #include "scsi_priv.h"
>   #include "scsi_logging.h"
>
> @@ -1480,6 +1482,23 @@ static void scsi_softirq_done(struct request *rq)
>   	}
>   }
>
> +/**
> + * scsi_done - Invoke completion on finished SCSI command.
> + * @cmd: The SCSI Command for which a low-level device driver (LLDD) gives
> + * ownership back to SCSI Core -- i.e. the LLDD has finished with it.
> + *
> + * Description: This function is the mid-level's (SCSI Core) interrupt routine,
> + * which regains ownership of the SCSI command (de facto) from a LLDD, and
> + * calls blk_complete_request() for further processing.
> + *
> + * This function is interrupt context safe.
> + */
> +static void scsi_done(struct scsi_cmnd *cmd)
> +{
> +	trace_scsi_dispatch_cmd_done(cmd);
> +	blk_complete_request(cmd->request);
> +}
> +
>   /*
>    * Function:    scsi_request_fn()
>    *
> @@ -1582,6 +1601,7 @@ static void scsi_request_fn(struct request_queue *q)
>   		/*
>   		 * Dispatch the command to the low-level driver.
>   		 */
> +		cmd->scsi_done = scsi_done;
>   		rtn = scsi_dispatch_cmd(cmd);
>   		if (rtn) {
>   			scsi_queue_insert(cmd, rtn);
>
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 05/14] scsi: push host_lock down into scsi_{host,target}_queue_ready
  2014-06-25 16:51 ` [PATCH 05/14] scsi: push host_lock down into scsi_{host,target}_queue_ready Christoph Hellwig
@ 2014-07-09 11:14   ` Hannes Reinecke
  0 siblings, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:14 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Prepare for not taking a host-wide lock in the dispatch path by pushing
> the lock down into the places that actually need it.  Note that this
> patch is just a preparation step, as it will actually increase lock
> roundtrips and thus decrease performance on its own.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/scsi_lib.c |   75 ++++++++++++++++++++++++-----------------------
>   1 file changed, 39 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 6989b6f..18e6449 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1300,18 +1300,18 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
>   /*
>    * scsi_target_queue_ready: checks if there we can send commands to target
>    * @sdev: scsi device on starget to check.
> - *
> - * Called with the host lock held.
>    */
>   static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
>   					   struct scsi_device *sdev)
>   {
>   	struct scsi_target *starget = scsi_target(sdev);
> +	int ret = 0;
>
> +	spin_lock_irq(shost->host_lock);
>   	if (starget->single_lun) {
>   		if (starget->starget_sdev_user &&
>   		    starget->starget_sdev_user != sdev)
> -			return 0;
> +			goto out;
>   		starget->starget_sdev_user = sdev;
>   	}
>
> @@ -1319,57 +1319,66 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
>   		/*
>   		 * unblock after target_blocked iterates to zero
>   		 */
> -		if (--starget->target_blocked == 0) {
> -			SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
> -					 "unblocking target at zero depth\n"));
> -		} else
> -			return 0;
> +		if (--starget->target_blocked != 0)
> +			goto out;
> +
> +		SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
> +				 "unblocking target at zero depth\n"));
>   	}
>
>   	if (scsi_target_is_busy(starget)) {
>   		list_move_tail(&sdev->starved_entry, &shost->starved_list);
> -		return 0;
> +		goto out;
>   	}
>
> -	return 1;
> +	scsi_target(sdev)->target_busy++;
> +	ret = 1;
> +out:
> +	spin_unlock_irq(shost->host_lock);
> +	return ret;
>   }
>
>   /*
>    * scsi_host_queue_ready: if we can send requests to shost, return 1 else
>    * return 0. We must end up running the queue again whenever 0 is
>    * returned, else IO can hang.
> - *
> - * Called with host_lock held.
>    */
>   static inline int scsi_host_queue_ready(struct request_queue *q,
>   				   struct Scsi_Host *shost,
>   				   struct scsi_device *sdev)
>   {
> +	int ret = 0;
> +
> +	spin_lock_irq(shost->host_lock);
> +
>   	if (scsi_host_in_recovery(shost))
> -		return 0;
> +		goto out;
>   	if (shost->host_busy == 0 && shost->host_blocked) {
>   		/*
>   		 * unblock after host_blocked iterates to zero
>   		 */
> -		if (--shost->host_blocked == 0) {
> -			SCSI_LOG_MLQUEUE(3,
> -				shost_printk(KERN_INFO, shost,
> -					     "unblocking host at zero depth\n"));
> -		} else {
> -			return 0;
> -		}
> +		if (--shost->host_blocked != 0)
> +			goto out;
> +
> +		SCSI_LOG_MLQUEUE(3,
> +			shost_printk(KERN_INFO, shost,
> +				     "unblocking host at zero depth\n"));
>   	}
>   	if (scsi_host_is_busy(shost)) {
>   		if (list_empty(&sdev->starved_entry))
>   			list_add_tail(&sdev->starved_entry, &shost->starved_list);
> -		return 0;
> +		goto out;
>   	}
>
>   	/* We're OK to process the command, so we can't be starved */
>   	if (!list_empty(&sdev->starved_entry))
>   		list_del_init(&sdev->starved_entry);
>
> -	return 1;
> +	shost->host_busy++;
> +	ret = 1;
> +out:
> +	spin_unlock_irq(shost->host_lock);
> +	return ret;
>   }
>
>   /*
> @@ -1550,7 +1559,7 @@ static void scsi_request_fn(struct request_queue *q)
>   			blk_start_request(req);
>   		sdev->device_busy++;
>
> -		spin_unlock(q->queue_lock);
> +		spin_unlock_irq(q->queue_lock);
>   		cmd = req->special;
>   		if (unlikely(cmd == NULL)) {
>   			printk(KERN_CRIT "impossible request in %s.\n"
> @@ -1560,7 +1569,6 @@ static void scsi_request_fn(struct request_queue *q)
>   			blk_dump_rq_flags(req, "foo");
>   			BUG();
>   		}
> -		spin_lock(shost->host_lock);
>
>   		/*
>   		 * We hit this when the driver is using a host wide
> @@ -1571,9 +1579,11 @@ static void scsi_request_fn(struct request_queue *q)
>   		 * a run when a tag is freed.
>   		 */
>   		if (blk_queue_tagged(q) && !blk_rq_tagged(req)) {
> +			spin_lock_irq(shost->host_lock);
>   			if (list_empty(&sdev->starved_entry))
>   				list_add_tail(&sdev->starved_entry,
>   					      &shost->starved_list);
> +			spin_unlock_irq(shost->host_lock);
>   			goto not_ready;
>   		}
>
> @@ -1581,16 +1591,7 @@ static void scsi_request_fn(struct request_queue *q)
>   			goto not_ready;
>
>   		if (!scsi_host_queue_ready(q, shost, sdev))
> -			goto not_ready;
> -
> -		scsi_target(sdev)->target_busy++;
> -		shost->host_busy++;
> -
> -		/*
> -		 * XXX(hch): This is rather suboptimal, scsi_dispatch_cmd will
> -		 *		take the lock again.
> -		 */
> -		spin_unlock_irq(shost->host_lock);
> +			goto host_not_ready;
>
>   		/*
>   		 * Finally, initialize any error handling parameters, and set up
> @@ -1613,9 +1614,11 @@ static void scsi_request_fn(struct request_queue *q)
>
>   	return;
>
> - not_ready:
> + host_not_ready:
> +	spin_lock_irq(shost->host_lock);
> +	scsi_target(sdev)->target_busy--;
>   	spin_unlock_irq(shost->host_lock);
> -
> + not_ready:
>   	/*
>   	 * lock q, handle tag, requeue req, and decrement device_busy. We
>   	 * must return with queue_lock held.
>
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 06/14] scsi: convert target_busy to an atomic_t
  2014-06-25 16:51 ` [PATCH 06/14] scsi: convert target_busy to an atomic_t Christoph Hellwig
@ 2014-07-09 11:15     ` Hannes Reinecke
  0 siblings, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:15 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Avoid taking the host-wide host_lock to check the per-target queue limit.
> Instead we do an atomic_inc_return early on to grab our slot in the queue,
> and if nessecary decrement it after finishing all checks.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/scsi_lib.c    |   52 ++++++++++++++++++++++++++------------------
>   include/scsi/scsi_device.h |    4 ++--
>   2 files changed, 33 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 18e6449..5e269d6 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -294,7 +294,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
>
>   	spin_lock_irqsave(shost->host_lock, flags);
>   	shost->host_busy--;
> -	starget->target_busy--;
> +	atomic_dec(&starget->target_busy);
>   	if (unlikely(scsi_host_in_recovery(shost) &&
>   		     (shost->host_failed || shost->host_eh_scheduled)))
>   		scsi_eh_wakeup(shost);
> @@ -361,7 +361,7 @@ static inline int scsi_device_is_busy(struct scsi_device *sdev)
>   static inline int scsi_target_is_busy(struct scsi_target *starget)
>   {
>   	return ((starget->can_queue > 0 &&
> -		 starget->target_busy >= starget->can_queue) ||
> +		 atomic_read(&starget->target_busy) >= starget->can_queue) ||
>   		 starget->target_blocked);
>   }
>
> @@ -1305,37 +1305,49 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
>   					   struct scsi_device *sdev)
>   {
>   	struct scsi_target *starget = scsi_target(sdev);
> -	int ret = 0;
> +	unsigned int busy;
>
> -	spin_lock_irq(shost->host_lock);
>   	if (starget->single_lun) {
> +		spin_lock_irq(shost->host_lock);
>   		if (starget->starget_sdev_user &&
> -		    starget->starget_sdev_user != sdev)
> -			goto out;
> +		    starget->starget_sdev_user != sdev) {
> +			spin_unlock_irq(shost->host_lock);
> +			return 0;
> +		}
>   		starget->starget_sdev_user = sdev;
> +		spin_unlock_irq(shost->host_lock);
>   	}
>
> -	if (starget->target_busy == 0 && starget->target_blocked) {
> +	busy = atomic_inc_return(&starget->target_busy) - 1;
> +	if (busy == 0 && starget->target_blocked) {
>   		/*
>   		 * unblock after target_blocked iterates to zero
>   		 */
> -		if (--starget->target_blocked != 0)
> -			goto out;
> +		spin_lock_irq(shost->host_lock);
> +		if (--starget->target_blocked != 0) {
> +			spin_unlock_irq(shost->host_lock);
> +			goto out_dec;
> +		}
> +		spin_unlock_irq(shost->host_lock);
>
>   		SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
>   				 "unblocking target at zero depth\n"));
>   	}
>
> -	if (scsi_target_is_busy(starget)) {
> -		list_move_tail(&sdev->starved_entry, &shost->starved_list);
> -		goto out;
> -	}
> +	if (starget->can_queue > 0 && busy >= starget->can_queue)
> +		goto starved;
> +	if (starget->target_blocked)
> +		goto starved;
>
> -	scsi_target(sdev)->target_busy++;
> -	ret = 1;
> -out:
> +	return 1;
> +
> +starved:
> +	spin_lock_irq(shost->host_lock);
> +	list_move_tail(&sdev->starved_entry, &shost->starved_list);
>   	spin_unlock_irq(shost->host_lock);
> -	return ret;
> +out_dec:
> +	atomic_dec(&starget->target_busy);
> +	return 0;
>   }
>
>   /*
> @@ -1445,7 +1457,7 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
>   	spin_unlock(sdev->request_queue->queue_lock);
>   	spin_lock(shost->host_lock);
>   	shost->host_busy++;
> -	starget->target_busy++;
> +	atomic_inc(&starget->target_busy);
>   	spin_unlock(shost->host_lock);
>   	spin_lock(sdev->request_queue->queue_lock);
>
> @@ -1615,9 +1627,7 @@ static void scsi_request_fn(struct request_queue *q)
>   	return;
>
>    host_not_ready:
> -	spin_lock_irq(shost->host_lock);
> -	scsi_target(sdev)->target_busy--;
> -	spin_unlock_irq(shost->host_lock);
> +	atomic_dec(&scsi_target(sdev)->target_busy);
>    not_ready:
>   	/*
>   	 * lock q, handle tag, requeue req, and decrement device_busy. We
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 816e8a2..446f741 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -290,8 +290,8 @@ struct scsi_target {
>   	unsigned int		expecting_lun_change:1;	/* A device has reported
>   						 * a 3F/0E UA, other devices on
>   						 * the same target will also. */
> -	/* commands actually active on LLD. protected by host lock. */
> -	unsigned int		target_busy;
> +	/* commands actually active on LLD. */
> +	atomic_t		target_busy;
>   	/*
>   	 * LLDs should set this in the slave_alloc host template callout.
>   	 * If set to zero then there is not limit.
>
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 06/14] scsi: convert target_busy to an atomic_t
@ 2014-07-09 11:15     ` Hannes Reinecke
  0 siblings, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:15 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Avoid taking the host-wide host_lock to check the per-target queue limit.
> Instead we do an atomic_inc_return early on to grab our slot in the queue,
> and if nessecary decrement it after finishing all checks.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/scsi_lib.c    |   52 ++++++++++++++++++++++++++------------------
>   include/scsi/scsi_device.h |    4 ++--
>   2 files changed, 33 insertions(+), 23 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 18e6449..5e269d6 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -294,7 +294,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
>
>   	spin_lock_irqsave(shost->host_lock, flags);
>   	shost->host_busy--;
> -	starget->target_busy--;
> +	atomic_dec(&starget->target_busy);
>   	if (unlikely(scsi_host_in_recovery(shost) &&
>   		     (shost->host_failed || shost->host_eh_scheduled)))
>   		scsi_eh_wakeup(shost);
> @@ -361,7 +361,7 @@ static inline int scsi_device_is_busy(struct scsi_device *sdev)
>   static inline int scsi_target_is_busy(struct scsi_target *starget)
>   {
>   	return ((starget->can_queue > 0 &&
> -		 starget->target_busy >= starget->can_queue) ||
> +		 atomic_read(&starget->target_busy) >= starget->can_queue) ||
>   		 starget->target_blocked);
>   }
>
> @@ -1305,37 +1305,49 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
>   					   struct scsi_device *sdev)
>   {
>   	struct scsi_target *starget = scsi_target(sdev);
> -	int ret = 0;
> +	unsigned int busy;
>
> -	spin_lock_irq(shost->host_lock);
>   	if (starget->single_lun) {
> +		spin_lock_irq(shost->host_lock);
>   		if (starget->starget_sdev_user &&
> -		    starget->starget_sdev_user != sdev)
> -			goto out;
> +		    starget->starget_sdev_user != sdev) {
> +			spin_unlock_irq(shost->host_lock);
> +			return 0;
> +		}
>   		starget->starget_sdev_user = sdev;
> +		spin_unlock_irq(shost->host_lock);
>   	}
>
> -	if (starget->target_busy == 0 && starget->target_blocked) {
> +	busy = atomic_inc_return(&starget->target_busy) - 1;
> +	if (busy == 0 && starget->target_blocked) {
>   		/*
>   		 * unblock after target_blocked iterates to zero
>   		 */
> -		if (--starget->target_blocked != 0)
> -			goto out;
> +		spin_lock_irq(shost->host_lock);
> +		if (--starget->target_blocked != 0) {
> +			spin_unlock_irq(shost->host_lock);
> +			goto out_dec;
> +		}
> +		spin_unlock_irq(shost->host_lock);
>
>   		SCSI_LOG_MLQUEUE(3, starget_printk(KERN_INFO, starget,
>   				 "unblocking target at zero depth\n"));
>   	}
>
> -	if (scsi_target_is_busy(starget)) {
> -		list_move_tail(&sdev->starved_entry, &shost->starved_list);
> -		goto out;
> -	}
> +	if (starget->can_queue > 0 && busy >= starget->can_queue)
> +		goto starved;
> +	if (starget->target_blocked)
> +		goto starved;
>
> -	scsi_target(sdev)->target_busy++;
> -	ret = 1;
> -out:
> +	return 1;
> +
> +starved:
> +	spin_lock_irq(shost->host_lock);
> +	list_move_tail(&sdev->starved_entry, &shost->starved_list);
>   	spin_unlock_irq(shost->host_lock);
> -	return ret;
> +out_dec:
> +	atomic_dec(&starget->target_busy);
> +	return 0;
>   }
>
>   /*
> @@ -1445,7 +1457,7 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
>   	spin_unlock(sdev->request_queue->queue_lock);
>   	spin_lock(shost->host_lock);
>   	shost->host_busy++;
> -	starget->target_busy++;
> +	atomic_inc(&starget->target_busy);
>   	spin_unlock(shost->host_lock);
>   	spin_lock(sdev->request_queue->queue_lock);
>
> @@ -1615,9 +1627,7 @@ static void scsi_request_fn(struct request_queue *q)
>   	return;
>
>    host_not_ready:
> -	spin_lock_irq(shost->host_lock);
> -	scsi_target(sdev)->target_busy--;
> -	spin_unlock_irq(shost->host_lock);
> +	atomic_dec(&scsi_target(sdev)->target_busy);
>    not_ready:
>   	/*
>   	 * lock q, handle tag, requeue req, and decrement device_busy. We
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 816e8a2..446f741 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -290,8 +290,8 @@ struct scsi_target {
>   	unsigned int		expecting_lun_change:1;	/* A device has reported
>   						 * a 3F/0E UA, other devices on
>   						 * the same target will also. */
> -	/* commands actually active on LLD. protected by host lock. */
> -	unsigned int		target_busy;
> +	/* commands actually active on LLD. */
> +	atomic_t		target_busy;
>   	/*
>   	 * LLDs should set this in the slave_alloc host template callout.
>   	 * If set to zero then there is not limit.
>
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 07/14] scsi: convert host_busy to atomic_t
  2014-06-25 16:51 ` [PATCH 07/14] scsi: convert host_busy to atomic_t Christoph Hellwig
@ 2014-07-09 11:15   ` Hannes Reinecke
  0 siblings, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:15 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Avoid taking the host-wide host_lock to check the per-host queue limit.
> Instead we do an atomic_inc_return early on to grab our slot in the queue,
> and if nessecary decrement it after finishing all checks.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/advansys.c             |    4 +-
>   drivers/scsi/libiscsi.c             |    4 +-
>   drivers/scsi/libsas/sas_scsi_host.c |    5 ++-
>   drivers/scsi/qlogicpti.c            |    2 +-
>   drivers/scsi/scsi.c                 |    2 +-
>   drivers/scsi/scsi_error.c           |    7 ++--
>   drivers/scsi/scsi_lib.c             |   71 +++++++++++++++++++++--------------
>   drivers/scsi/scsi_sysfs.c           |    9 ++++-
>   include/scsi/scsi_host.h            |   10 ++---
>   9 files changed, 66 insertions(+), 48 deletions(-)
>
> diff --git a/drivers/scsi/advansys.c b/drivers/scsi/advansys.c
> index e716d0a..43761c1 100644
> --- a/drivers/scsi/advansys.c
> +++ b/drivers/scsi/advansys.c
> @@ -2512,7 +2512,7 @@ static void asc_prt_scsi_host(struct Scsi_Host *s)
>
>   	printk("Scsi_Host at addr 0x%p, device %s\n", s, dev_name(boardp->dev));
>   	printk(" host_busy %u, host_no %d,\n",
> -	       s->host_busy, s->host_no);
> +	       atomic_read(&s->host_busy), s->host_no);
>
>   	printk(" base 0x%lx, io_port 0x%lx, irq %d,\n",
>   	       (ulong)s->base, (ulong)s->io_port, boardp->irq);
> @@ -3346,7 +3346,7 @@ static void asc_prt_driver_conf(struct seq_file *m, struct Scsi_Host *shost)
>
>   	seq_printf(m,
>   		   " host_busy %u, max_id %u, max_lun %llu, max_channel %u\n",
> -		   shost->host_busy, shost->max_id,
> +		   atomic_read(&shost->host_busy), shost->max_id,
>   		   shost->max_lun, shost->max_channel);
>
>   	seq_printf(m,
> diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
> index f2db82b..f9f3a12 100644
> --- a/drivers/scsi/libiscsi.c
> +++ b/drivers/scsi/libiscsi.c
> @@ -2971,7 +2971,7 @@ void iscsi_conn_teardown(struct iscsi_cls_conn *cls_conn)
>   	 */
>   	for (;;) {
>   		spin_lock_irqsave(session->host->host_lock, flags);
> -		if (!session->host->host_busy) { /* OK for ERL == 0 */
> +		if (!atomic_read(&session->host->host_busy)) { /* OK for ERL == 0 */
>   			spin_unlock_irqrestore(session->host->host_lock, flags);
>   			break;
>   		}
> @@ -2979,7 +2979,7 @@ void iscsi_conn_teardown(struct iscsi_cls_conn *cls_conn)
>   		msleep_interruptible(500);
>   		iscsi_conn_printk(KERN_INFO, conn, "iscsi conn_destroy(): "
>   				  "host_busy %d host_failed %d\n",
> -				  session->host->host_busy,
> +				  atomic_read(&session->host->host_busy),
>   				  session->host->host_failed);
>   		/*
>   		 * force eh_abort() to unblock
> diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
> index 7d02a19..24e477d 100644
> --- a/drivers/scsi/libsas/sas_scsi_host.c
> +++ b/drivers/scsi/libsas/sas_scsi_host.c
> @@ -813,7 +813,7 @@ retry:
>   	spin_unlock_irq(shost->host_lock);
>
>   	SAS_DPRINTK("Enter %s busy: %d failed: %d\n",
> -		    __func__, shost->host_busy, shost->host_failed);
> +		    __func__, atomic_read(&shost->host_busy), shost->host_failed);
>   	/*
>   	 * Deal with commands that still have SAS tasks (i.e. they didn't
>   	 * complete via the normal sas_task completion mechanism),
> @@ -858,7 +858,8 @@ out:
>   		goto retry;
>
>   	SAS_DPRINTK("--- Exit %s: busy: %d failed: %d tries: %d\n",
> -		    __func__, shost->host_busy, shost->host_failed, tries);
> +		    __func__, atomic_read(&shost->host_busy),
> +		    shost->host_failed, tries);
>   }
>
>   enum blk_eh_timer_return sas_scsi_timed_out(struct scsi_cmnd *cmd)
> diff --git a/drivers/scsi/qlogicpti.c b/drivers/scsi/qlogicpti.c
> index 6d48d30..740ae49 100644
> --- a/drivers/scsi/qlogicpti.c
> +++ b/drivers/scsi/qlogicpti.c
> @@ -959,7 +959,7 @@ static inline void update_can_queue(struct Scsi_Host *host, u_int in_ptr, u_int
>   	/* Temporary workaround until bug is found and fixed (one bug has been found
>   	   already, but fixing it makes things even worse) -jj */
>   	int num_free = QLOGICPTI_REQ_QUEUE_LEN - REQ_QUEUE_DEPTH(in_ptr, out_ptr) - 64;
> -	host->can_queue = host->host_busy + num_free;
> +	host->can_queue = atomic_read(&host->host_busy) + num_free;
>   	host->sg_tablesize = QLOGICPTI_MAX_SG(num_free);
>   }
>
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index d3bd6cf..35a23e2 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -603,7 +603,7 @@ void scsi_log_completion(struct scsi_cmnd *cmd, int disposition)
>   			if (level > 3)
>   				scmd_printk(KERN_INFO, cmd,
>   					    "scsi host busy %d failed %d\n",
> -					    cmd->device->host->host_busy,
> +					    atomic_read(&cmd->device->host->host_busy),
>   					    cmd->device->host->host_failed);
>   		}
>   	}
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index e4a5324..5db8454 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -59,7 +59,7 @@ static int scsi_try_to_abort_cmd(struct scsi_host_template *,
>   /* called with shost->host_lock held */
>   void scsi_eh_wakeup(struct Scsi_Host *shost)
>   {
> -	if (shost->host_busy == shost->host_failed) {
> +	if (atomic_read(&shost->host_busy) == shost->host_failed) {
>   		trace_scsi_eh_wakeup(shost);
>   		wake_up_process(shost->ehandler);
>   		SCSI_LOG_ERROR_RECOVERY(5, shost_printk(KERN_INFO, shost,
> @@ -2164,7 +2164,7 @@ int scsi_error_handler(void *data)
>   	while (!kthread_should_stop()) {
>   		set_current_state(TASK_INTERRUPTIBLE);
>   		if ((shost->host_failed == 0 && shost->host_eh_scheduled == 0) ||
> -		    shost->host_failed != shost->host_busy) {
> +		    shost->host_failed != atomic_read(&shost->host_busy)) {
>   			SCSI_LOG_ERROR_RECOVERY(1,
>   				shost_printk(KERN_INFO, shost,
>   					     "scsi_eh_%d: sleeping\n",
> @@ -2178,7 +2178,8 @@ int scsi_error_handler(void *data)
>   			shost_printk(KERN_INFO, shost,
>   				     "scsi_eh_%d: waking up %d/%d/%d\n",
>   				     shost->host_no, shost->host_eh_scheduled,
> -				     shost->host_failed, shost->host_busy));
> +				     shost->host_failed,
> +				     atomic_read(&shost->host_busy)));
>
>   		/*
>   		 * We have a host that is failing for some reason.  Figure out
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 5e269d6..5d37d79 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -292,14 +292,17 @@ void scsi_device_unbusy(struct scsi_device *sdev)
>   	struct scsi_target *starget = scsi_target(sdev);
>   	unsigned long flags;
>
> -	spin_lock_irqsave(shost->host_lock, flags);
> -	shost->host_busy--;
> +	atomic_dec(&shost->host_busy);
>   	atomic_dec(&starget->target_busy);
> +
>   	if (unlikely(scsi_host_in_recovery(shost) &&
> -		     (shost->host_failed || shost->host_eh_scheduled)))
> +		     (shost->host_failed || shost->host_eh_scheduled))) {
> +		spin_lock_irqsave(shost->host_lock, flags);
>   		scsi_eh_wakeup(shost);
> -	spin_unlock(shost->host_lock);
> -	spin_lock(sdev->request_queue->queue_lock);
> +		spin_unlock_irqrestore(shost->host_lock, flags);
> +	}
> +
> +	spin_lock_irqsave(sdev->request_queue->queue_lock, flags);
>   	sdev->device_busy--;
>   	spin_unlock_irqrestore(sdev->request_queue->queue_lock, flags);
>   }
> @@ -367,7 +370,8 @@ static inline int scsi_target_is_busy(struct scsi_target *starget)
>
>   static inline int scsi_host_is_busy(struct Scsi_Host *shost)
>   {
> -	if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) ||
> +	if ((shost->can_queue > 0 &&
> +	     atomic_read(&shost->host_busy) >= shost->can_queue) ||
>   	    shost->host_blocked || shost->host_self_blocked)
>   		return 1;
>
> @@ -1359,38 +1363,51 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
>   				   struct Scsi_Host *shost,
>   				   struct scsi_device *sdev)
>   {
> -	int ret = 0;
> -
> -	spin_lock_irq(shost->host_lock);
> +	unsigned int busy;
>
>   	if (scsi_host_in_recovery(shost))
> -		goto out;
> -	if (shost->host_busy == 0 && shost->host_blocked) {
> +		return 0;
> +
> +	busy = atomic_inc_return(&shost->host_busy) - 1;
> +	if (busy == 0 && shost->host_blocked) {
>   		/*
>   		 * unblock after host_blocked iterates to zero
>   		 */
> -		if (--shost->host_blocked != 0)
> -			goto out;
> +		spin_lock_irq(shost->host_lock);
> +		if (--shost->host_blocked != 0) {
> +			spin_unlock_irq(shost->host_lock);
> +			goto out_dec;
> +		}
> +		spin_unlock_irq(shost->host_lock);
>
>   		SCSI_LOG_MLQUEUE(3,
>   			shost_printk(KERN_INFO, shost,
>   				     "unblocking host at zero depth\n"));
>   	}
> -	if (scsi_host_is_busy(shost)) {
> -		if (list_empty(&sdev->starved_entry))
> -			list_add_tail(&sdev->starved_entry, &shost->starved_list);
> -		goto out;
> -	}
> +
> +	if (shost->can_queue > 0 && busy >= shost->can_queue)
> +		goto starved;
> +	if (shost->host_blocked || shost->host_self_blocked)
> +		goto starved;
>
>   	/* We're OK to process the command, so we can't be starved */
> -	if (!list_empty(&sdev->starved_entry))
> -		list_del_init(&sdev->starved_entry);
> +	if (!list_empty(&sdev->starved_entry)) {
> +		spin_lock_irq(shost->host_lock);
> +		if (!list_empty(&sdev->starved_entry))
> +			list_del_init(&sdev->starved_entry);
> +		spin_unlock_irq(shost->host_lock);
> +	}
>
> -	shost->host_busy++;
> -	ret = 1;
> -out:
> +	return 1;
> +
> +starved:
> +	spin_lock_irq(shost->host_lock);
> +	if (list_empty(&sdev->starved_entry))
> +		list_add_tail(&sdev->starved_entry, &shost->starved_list);
>   	spin_unlock_irq(shost->host_lock);
> -	return ret;
> +out_dec:
> +	atomic_dec(&shost->host_busy);
> +	return 0;
>   }
>
>   /*
> @@ -1454,12 +1471,8 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
>   	 * with the locks as normal issue path does.
>   	 */
>   	sdev->device_busy++;
> -	spin_unlock(sdev->request_queue->queue_lock);
> -	spin_lock(shost->host_lock);
> -	shost->host_busy++;
> +	atomic_inc(&shost->host_busy);
>   	atomic_inc(&starget->target_busy);
> -	spin_unlock(shost->host_lock);
> -	spin_lock(sdev->request_queue->queue_lock);
>
>   	blk_complete_request(req);
>   }
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index 5f36788..7ec5e06 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -334,7 +334,6 @@ store_shost_eh_deadline(struct device *dev, struct device_attribute *attr,
>   static DEVICE_ATTR(eh_deadline, S_IRUGO | S_IWUSR, show_shost_eh_deadline, store_shost_eh_deadline);
>
>   shost_rd_attr(unique_id, "%u\n");
> -shost_rd_attr(host_busy, "%hu\n");
>   shost_rd_attr(cmd_per_lun, "%hd\n");
>   shost_rd_attr(can_queue, "%hd\n");
>   shost_rd_attr(sg_tablesize, "%hu\n");
> @@ -344,6 +343,14 @@ shost_rd_attr(prot_capabilities, "%u\n");
>   shost_rd_attr(prot_guard_type, "%hd\n");
>   shost_rd_attr2(proc_name, hostt->proc_name, "%s\n");
>
> +static ssize_t
> +show_host_busy(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +	struct Scsi_Host *shost = class_to_shost(dev);
> +	return snprintf(buf, 20, "%hu\n", atomic_read(&shost->host_busy));
> +}
> +static DEVICE_ATTR(host_busy, S_IRUGO, show_host_busy, NULL);
> +
>   static struct attribute *scsi_sysfs_shost_attrs[] = {
>   	&dev_attr_unique_id.attr,
>   	&dev_attr_host_busy.attr,
> diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
> index abb6958..3d124f7 100644
> --- a/include/scsi/scsi_host.h
> +++ b/include/scsi/scsi_host.h
> @@ -603,13 +603,9 @@ struct Scsi_Host {
>   	 */
>   	struct blk_queue_tag	*bqt;
>
> -	/*
> -	 * The following two fields are protected with host_lock;
> -	 * however, eh routines can safely access during eh processing
> -	 * without acquiring the lock.
> -	 */
> -	unsigned int host_busy;		   /* commands actually active on low-level */
> -	unsigned int host_failed;	   /* commands that failed. */
> +	atomic_t host_busy;		   /* commands actually active on low-level */
> +	unsigned int host_failed;	   /* commands that failed.
> +					      protected by host_lock */
>   	unsigned int host_eh_scheduled;    /* EH scheduled without command */
>
>   	unsigned int host_no;  /* Used for IOCTL_GET_IDLUN, /proc/scsi et al. */
>
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 08/14] scsi: convert device_busy to atomic_t
  2014-06-25 16:51 ` [PATCH 08/14] scsi: convert device_busy " Christoph Hellwig
@ 2014-07-09 11:16   ` Hannes Reinecke
  2014-07-09 16:49   ` James Bottomley
  1 sibling, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:16 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Avoid taking the queue_lock to check the per-device queue limit.  Instead
> we do an atomic_inc_return early on to grab our slot in the queue,
> and if nessecary decrement it after finishing all checks.
>
> Unlike the host and target busy counters this doesn't allow us to avoid the
> queue_lock in the request_fn due to the way the interface works, but it'll
> allow us to prepare for using the blk-mq code, which doesn't use the
> queue_lock at all, and it at least avoids a queue_lock rountrip in
> scsi_device_unbusy, which is still important given how busy the queue_lock
> is.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/message/fusion/mptsas.c |    2 +-
>   drivers/scsi/scsi_lib.c         |   50 ++++++++++++++++++++++-----------------
>   drivers/scsi/scsi_sysfs.c       |   10 +++++++-
>   drivers/scsi/sg.c               |    2 +-
>   include/scsi/scsi_device.h      |    4 +---
>   5 files changed, 40 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/message/fusion/mptsas.c b/drivers/message/fusion/mptsas.c
> index 711fcb5..d636dbe 100644
> --- a/drivers/message/fusion/mptsas.c
> +++ b/drivers/message/fusion/mptsas.c
> @@ -3763,7 +3763,7 @@ mptsas_send_link_status_event(struct fw_event_work *fw_event)
>   						printk(MYIOC_s_DEBUG_FMT
>   						"SDEV OUTSTANDING CMDS"
>   						"%d\n", ioc->name,
> -						sdev->device_busy));
> +						atomic_read(&sdev->device_busy)));
>   				}
>
>   			}
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 5d37d79..e23fef5 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -302,9 +302,7 @@ void scsi_device_unbusy(struct scsi_device *sdev)
>   		spin_unlock_irqrestore(shost->host_lock, flags);
>   	}
>
> -	spin_lock_irqsave(sdev->request_queue->queue_lock, flags);
> -	sdev->device_busy--;
> -	spin_unlock_irqrestore(sdev->request_queue->queue_lock, flags);
> +	atomic_dec(&sdev->device_busy);
>   }
>
>   /*
> @@ -355,9 +353,10 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
>
>   static inline int scsi_device_is_busy(struct scsi_device *sdev)
>   {
> -	if (sdev->device_busy >= sdev->queue_depth || sdev->device_blocked)
> +	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> +		return 1;
> +	if (sdev->device_blocked)
>   		return 1;
> -
>   	return 0;
>   }
>
> @@ -1224,7 +1223,7 @@ scsi_prep_return(struct request_queue *q, struct request *req, int ret)
>   		 * queue must be restarted, so we schedule a callback to happen
>   		 * shortly.
>   		 */
> -		if (sdev->device_busy == 0)
> +		if (atomic_read(&sdev->device_busy) == 0)
>   			blk_delay_queue(q, SCSI_QUEUE_DELAY);
>   		break;
>   	default:
> @@ -1281,26 +1280,32 @@ static void scsi_unprep_fn(struct request_queue *q, struct request *req)
>   static inline int scsi_dev_queue_ready(struct request_queue *q,
>   				  struct scsi_device *sdev)
>   {
> -	if (sdev->device_busy == 0 && sdev->device_blocked) {
> +	unsigned int busy;
> +
> +	busy = atomic_inc_return(&sdev->device_busy) - 1;
> +	if (busy == 0 && sdev->device_blocked) {
>   		/*
>   		 * unblock after device_blocked iterates to zero
>   		 */
> -		if (--sdev->device_blocked == 0) {
> -			SCSI_LOG_MLQUEUE(3,
> -				   sdev_printk(KERN_INFO, sdev,
> -				   "unblocking device at zero depth\n"));
> -		} else {
> +		if (--sdev->device_blocked != 0) {
>   			blk_delay_queue(q, SCSI_QUEUE_DELAY);
> -			return 0;
> +			goto out_dec;
>   		}
> +		SCSI_LOG_MLQUEUE(3, sdev_printk(KERN_INFO, sdev,
> +				   "unblocking device at zero depth\n"));
>   	}
> -	if (scsi_device_is_busy(sdev))
> -		return 0;
> +
> +	if (busy >= sdev->queue_depth)
> +		goto out_dec;
> +	if (sdev->device_blocked)
> +		goto out_dec;
>
>   	return 1;
> +out_dec:
> +	atomic_dec(&sdev->device_busy);
> +	return 0;
>   }
>
> -
>   /*
>    * scsi_target_queue_ready: checks if there we can send commands to target
>    * @sdev: scsi device on starget to check.
> @@ -1470,7 +1475,7 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
>   	 * bump busy counts.  To bump the counters, we need to dance
>   	 * with the locks as normal issue path does.
>   	 */
> -	sdev->device_busy++;
> +	atomic_inc(&sdev->device_busy);
>   	atomic_inc(&shost->host_busy);
>   	atomic_inc(&starget->target_busy);
>
> @@ -1566,7 +1571,7 @@ static void scsi_request_fn(struct request_queue *q)
>   		 * accept it.
>   		 */
>   		req = blk_peek_request(q);
> -		if (!req || !scsi_dev_queue_ready(q, sdev))
> +		if (!req)
>   			break;
>
>   		if (unlikely(!scsi_device_online(sdev))) {
> @@ -1576,13 +1581,14 @@ static void scsi_request_fn(struct request_queue *q)
>   			continue;
>   		}
>
> +		if (!scsi_dev_queue_ready(q, sdev))
> +			break;
>
>   		/*
>   		 * Remove the request from the request list.
>   		 */
>   		if (!(blk_queue_tagged(q) && !blk_queue_start_tag(q, req)))
>   			blk_start_request(req);
> -		sdev->device_busy++;
>
>   		spin_unlock_irq(q->queue_lock);
>   		cmd = req->special;
> @@ -1652,9 +1658,9 @@ static void scsi_request_fn(struct request_queue *q)
>   	 */
>   	spin_lock_irq(q->queue_lock);
>   	blk_requeue_request(q, req);
> -	sdev->device_busy--;
> +	atomic_dec(&sdev->device_busy);
>   out_delay:
> -	if (sdev->device_busy == 0 && !scsi_device_blocked(sdev))
> +	if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
>   		blk_delay_queue(q, SCSI_QUEUE_DELAY);
>   }
>
> @@ -2394,7 +2400,7 @@ scsi_device_quiesce(struct scsi_device *sdev)
>   		return err;
>
>   	scsi_run_queue(sdev->request_queue);
> -	while (sdev->device_busy) {
> +	while (atomic_read(&sdev->device_busy)) {
>   		msleep_interruptible(200);
>   		scsi_run_queue(sdev->request_queue);
>   	}
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index 7ec5e06..54e3dac 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -585,13 +585,21 @@ static int scsi_sdev_check_buf_bit(const char *buf)
>    * Create the actual show/store functions and data structures.
>    */
>   sdev_rd_attr (device_blocked, "%d\n");
> -sdev_rd_attr (device_busy, "%d\n");
>   sdev_rd_attr (type, "%d\n");
>   sdev_rd_attr (scsi_level, "%d\n");
>   sdev_rd_attr (vendor, "%.8s\n");
>   sdev_rd_attr (model, "%.16s\n");
>   sdev_rd_attr (rev, "%.4s\n");
>
> +static ssize_t
> +sdev_show_device_busy(struct device *dev, struct device_attribute *attr,
> +		char *buf)
> +{
> +	struct scsi_device *sdev = to_scsi_device(dev);
> +	return snprintf(buf, 20, "%d\n", atomic_read(&sdev->device_busy));
> +}
> +static DEVICE_ATTR(device_busy, S_IRUGO, sdev_show_device_busy, NULL);
> +
>   /*
>    * TODO: can we make these symlinks to the block layer ones?
>    */
> diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
> index cb2a18e..3db4fc9 100644
> --- a/drivers/scsi/sg.c
> +++ b/drivers/scsi/sg.c
> @@ -2573,7 +2573,7 @@ static int sg_proc_seq_show_dev(struct seq_file *s, void *v)
>   			      scsidp->id, scsidp->lun, (int) scsidp->type,
>   			      1,
>   			      (int) scsidp->queue_depth,
> -			      (int) scsidp->device_busy,
> +			      (int) atomic_read(&scsidp->device_busy),
>   			      (int) scsi_device_online(scsidp));
>   	}
>   	read_unlock_irqrestore(&sg_index_lock, iflags);
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 446f741..5ff3d24 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -81,9 +81,7 @@ struct scsi_device {
>   	struct list_head    siblings;   /* list of all devices on this host */
>   	struct list_head    same_target_siblings; /* just the devices sharing same target id */
>
> -	/* this is now protected by the request_queue->queue_lock */
> -	unsigned int device_busy;	/* commands actually active on
> -					 * low-level. protected by queue_lock. */
> +	atomic_t device_busy;		/* commands actually active on LLDD */
>   	spinlock_t list_lock;
>   	struct list_head cmd_list;	/* queue of in use SCSI Command structures */
>   	struct list_head starved_entry;
>
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 10/14] scsi: only maintain target_blocked if the driver has a target queue limit
  2014-06-25 16:51 ` [PATCH 10/14] scsi: only maintain target_blocked if the driver has a target queue limit Christoph Hellwig
@ 2014-07-09 11:19   ` Hannes Reinecke
  2014-07-09 15:05     ` Christoph Hellwig
  0 siblings, 1 reply; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:19 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> This saves us an atomic operation for each I/O submission and completion
> for the usual case where the driver doesn't set a per-target can_queue
> value.  Only a few iscsi hardware offload drivers set the per-target
> can_queue value at the moment.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/scsi_lib.c |   17 ++++++++++++-----
>   1 file changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index a39d5ba..a64b9d3 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -295,7 +295,8 @@ void scsi_device_unbusy(struct scsi_device *sdev)
>   	unsigned long flags;
>
>   	atomic_dec(&shost->host_busy);
> -	atomic_dec(&starget->target_busy);
> +	if (starget->can_queue > 0)
> +		atomic_dec(&starget->target_busy);
>
>   	if (unlikely(scsi_host_in_recovery(shost) &&
>   		     (shost->host_failed || shost->host_eh_scheduled))) {
> @@ -1335,6 +1336,9 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
>   		spin_unlock_irq(shost->host_lock);
>   	}
>
> +	if (starget->can_queue <= 0)
> +		return 1;
> +
>   	busy = atomic_inc_return(&starget->target_busy) - 1;
>   	if (busy == 0 && atomic_read(&starget->target_blocked) > 0) {
>   		if (atomic_dec_return(&starget->target_blocked) > 0)
> @@ -1344,7 +1348,7 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
>   				 "unblocking target at zero depth\n"));
>   	}
>
> -	if (starget->can_queue > 0 && busy >= starget->can_queue)
> +	if (busy >= starget->can_queue)
>   		goto starved;
>   	if (atomic_read(&starget->target_blocked) > 0)
>   		goto starved;
> @@ -1356,7 +1360,8 @@ starved:
>   	list_move_tail(&sdev->starved_entry, &shost->starved_list);
>   	spin_unlock_irq(shost->host_lock);
>   out_dec:
> -	atomic_dec(&starget->target_busy);
> +	if (starget->can_queue > 0)
> +		atomic_dec(&starget->target_busy);
>   	return 0;
>   }
>
> @@ -1473,7 +1478,8 @@ static void scsi_kill_request(struct request *req, struct request_queue *q)
>   	 */
>   	atomic_inc(&sdev->device_busy);
>   	atomic_inc(&shost->host_busy);
> -	atomic_inc(&starget->target_busy);
> +	if (starget->can_queue > 0)
> +		atomic_inc(&starget->target_busy);
>
>   	blk_complete_request(req);
>   }
> @@ -1642,7 +1648,8 @@ static void scsi_request_fn(struct request_queue *q)
>   	return;
>
>    host_not_ready:
> -	atomic_dec(&scsi_target(sdev)->target_busy);
> +	if (scsi_target(sdev)->can_queue > 0)
> +		atomic_dec(&scsi_target(sdev)->target_busy);
>    not_ready:
>   	/*
>   	 * lock q, handle tag, requeue req, and decrement device_busy. We
>
Hmm. 'can_queue' can be changed by the LLDD. Don't we need some sort 
of synchronization here?
(Or move that to atomic_t, too?)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 11/14] scsi: unwind blk_end_request_all and blk_end_request_err calls
  2014-06-25 16:51 ` [PATCH 11/14] scsi: unwind blk_end_request_all and blk_end_request_err calls Christoph Hellwig
@ 2014-07-09 11:20   ` Hannes Reinecke
  0 siblings, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:20 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Replace the calls to the various blk_end_request variants with opencode
> equivalents.  Blk-mq is using a model that gives the driver control
> between the bio updates and the actual completion, and making the old
> code follow that same model allows us to keep the code more similar for
> both pathes.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/scsi_lib.c |   61 ++++++++++++++++++++++++++++++++---------------
>   1 file changed, 42 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index a64b9d3..58534fd 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -625,6 +625,37 @@ static void scsi_release_bidi_buffers(struct scsi_cmnd *cmd)
>   	cmd->request->next_rq->special = NULL;
>   }
>
> +static bool scsi_end_request(struct request *req, int error,
> +		unsigned int bytes, unsigned int bidi_bytes)
> +{
> +	struct scsi_cmnd *cmd = req->special;
> +	struct scsi_device *sdev = cmd->device;
> +	struct request_queue *q = sdev->request_queue;
> +	unsigned long flags;
> +
> +
> +	if (blk_update_request(req, error, bytes))
> +		return true;
> +
> +	/* Bidi request must be completed as a whole */
> +	if (unlikely(bidi_bytes) &&
> +	    blk_update_request(req->next_rq, error, bidi_bytes))
> +		return true;
> +
> +	if (blk_queue_add_random(q))
> +		add_disk_randomness(req->rq_disk);
> +
> +	spin_lock_irqsave(q->queue_lock, flags);
> +	blk_finish_request(req, error);
> +	spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +	if (bidi_bytes)
> +		scsi_release_bidi_buffers(cmd);
> +	scsi_release_buffers(cmd);
> +	scsi_next_command(cmd);
> +	return false;
> +}
> +
>   /**
>    * __scsi_error_from_host_byte - translate SCSI error code into errno
>    * @cmd:	SCSI command (unused)
> @@ -697,7 +728,7 @@ static int __scsi_error_from_host_byte(struct scsi_cmnd *cmd, int result)
>    *		   be put back on the queue and retried using the same
>    *		   command as before, possibly after a delay.
>    *
> - *		c) We can call blk_end_request() with -EIO to fail
> + *		c) We can call scsi_end_request() with -EIO to fail
>    *		   the remainder of the request.
>    */
>   void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
> @@ -749,13 +780,9 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
>   			 * both sides at once.
>   			 */
>   			req->next_rq->resid_len = scsi_in(cmd)->resid;
> -
> -			scsi_release_buffers(cmd);
> -			scsi_release_bidi_buffers(cmd);
> -
> -			blk_end_request_all(req, 0);
> -
> -			scsi_next_command(cmd);
> +			if (scsi_end_request(req, 0, blk_rq_bytes(req),
> +					blk_rq_bytes(req->next_rq)))
> +				BUG();
>   			return;
>   		}
>   	}
> @@ -794,15 +821,16 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
>   	/*
>   	 * If we finished all bytes in the request we are done now.
>   	 */
> -	if (!blk_end_request(req, error, good_bytes))
> -		goto next_command;
> +	if (!scsi_end_request(req, error, good_bytes, 0))
> +		return;
>
>   	/*
>   	 * Kill remainder if no retrys.
>   	 */
>   	if (error && scsi_noretry_cmd(cmd)) {
> -		blk_end_request_all(req, error);
> -		goto next_command;
> +		if (scsi_end_request(req, error, blk_rq_bytes(req), 0))
> +			BUG();
> +		return;
>   	}
>
>   	/*
> @@ -947,8 +975,8 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
>   				scsi_print_sense("", cmd);
>   			scsi_print_command(cmd);
>   		}
> -		if (!blk_end_request_err(req, error))
> -			goto next_command;
> +		if (!scsi_end_request(req, error, blk_rq_err_bytes(req), 0))
> +			return;
>   		/*FALLTHRU*/
>   	case ACTION_REPREP:
>   	requeue:
> @@ -967,11 +995,6 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
>   		__scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY, 0);
>   		break;
>   	}
> -	return;
> -
> -next_command:
> -	scsi_release_buffers(cmd);
> -	scsi_next_command(cmd);
>   }
>
>   static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb,
>
YES.

That code really was a mess.

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 12/14] scatterlist: allow chaining to preallocated chunks
  2014-06-25 16:51 ` [PATCH 12/14] scatterlist: allow chaining to preallocated chunks Christoph Hellwig
@ 2014-07-09 11:21   ` Hannes Reinecke
  0 siblings, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:21 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:51 PM, Christoph Hellwig wrote:
> Blk-mq drivers usually preallocate their S/G list as part of the request,
> but if we want to support the very large S/G lists currently supported by
> the SCSI code that would tie up a lot of memory in the preallocated request
> pool.  Add support to the scatterlist code so that it can initialize a
> S/G list that uses a preallocated first chunks and dynamically allocated
> additional chunks.  That way the scsi-mq code can preallocate a first
> page worth of S/G entries as part of the request, and dynamically extent
> the S/G list when needed.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/scsi_lib.c     |   16 +++++++---------
>   include/linux/scatterlist.h |    6 +++---
>   lib/scatterlist.c           |   24 ++++++++++++++++--------
>   3 files changed, 26 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 58534fd..900b1c0 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -567,6 +567,11 @@ static struct scatterlist *scsi_sg_alloc(unsigned int nents, gfp_t gfp_mask)
>   	return mempool_alloc(sgp->pool, gfp_mask);
>   }
>
> +static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
> +{
> +	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, false, scsi_sg_free);
> +}
> +
>   static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
>   			      gfp_t gfp_mask)
>   {
> @@ -575,19 +580,12 @@ static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
>   	BUG_ON(!nents);
>
>   	ret = __sg_alloc_table(&sdb->table, nents, SCSI_MAX_SG_SEGMENTS,
> -			       gfp_mask, scsi_sg_alloc);
> +			       NULL, gfp_mask, scsi_sg_alloc);
>   	if (unlikely(ret))
> -		__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS,
> -				scsi_sg_free);
> -
> +		scsi_free_sgtable(sdb);
>   	return ret;
>   }
>
> -static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
> -{
> -	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, scsi_sg_free);
> -}
> -
>   /*
>    * Function:    scsi_release_buffers()
>    *
> diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
> index a964f72..f4ec8bb 100644
> --- a/include/linux/scatterlist.h
> +++ b/include/linux/scatterlist.h
> @@ -229,10 +229,10 @@ void sg_init_one(struct scatterlist *, const void *, unsigned int);
>   typedef struct scatterlist *(sg_alloc_fn)(unsigned int, gfp_t);
>   typedef void (sg_free_fn)(struct scatterlist *, unsigned int);
>
> -void __sg_free_table(struct sg_table *, unsigned int, sg_free_fn *);
> +void __sg_free_table(struct sg_table *, unsigned int, bool, sg_free_fn *);
>   void sg_free_table(struct sg_table *);
> -int __sg_alloc_table(struct sg_table *, unsigned int, unsigned int, gfp_t,
> -		     sg_alloc_fn *);
> +int __sg_alloc_table(struct sg_table *, unsigned int, unsigned int,
> +		     struct scatterlist *, gfp_t, sg_alloc_fn *);
>   int sg_alloc_table(struct sg_table *, unsigned int, gfp_t);
>   int sg_alloc_table_from_pages(struct sg_table *sgt,
>   	struct page **pages, unsigned int n_pages,
> diff --git a/lib/scatterlist.c b/lib/scatterlist.c
> index 3a8e8e8..48c15d2 100644
> --- a/lib/scatterlist.c
> +++ b/lib/scatterlist.c
> @@ -165,6 +165,7 @@ static void sg_kfree(struct scatterlist *sg, unsigned int nents)
>    * __sg_free_table - Free a previously mapped sg table
>    * @table:	The sg table header to use
>    * @max_ents:	The maximum number of entries per single scatterlist
> + * @skip_first_chunk: don't free the (preallocated) first scatterlist chunk
>    * @free_fn:	Free function
>    *
>    *  Description:
> @@ -174,7 +175,7 @@ static void sg_kfree(struct scatterlist *sg, unsigned int nents)
>    *
>    **/
>   void __sg_free_table(struct sg_table *table, unsigned int max_ents,
> -		     sg_free_fn *free_fn)
> +		     bool skip_first_chunk, sg_free_fn *free_fn)
>   {
>   	struct scatterlist *sgl, *next;
>
> @@ -202,7 +203,9 @@ void __sg_free_table(struct sg_table *table, unsigned int max_ents,
>   		}
>
>   		table->orig_nents -= sg_size;
> -		free_fn(sgl, alloc_size);
> +		if (!skip_first_chunk)
> +			free_fn(sgl, alloc_size);
> +		skip_first_chunk = false;
>   		sgl = next;
>   	}
>
> @@ -217,7 +220,7 @@ EXPORT_SYMBOL(__sg_free_table);
>    **/
>   void sg_free_table(struct sg_table *table)
>   {
> -	__sg_free_table(table, SG_MAX_SINGLE_ALLOC, sg_kfree);
> +	__sg_free_table(table, SG_MAX_SINGLE_ALLOC, false, sg_kfree);
>   }
>   EXPORT_SYMBOL(sg_free_table);
>
> @@ -241,8 +244,8 @@ EXPORT_SYMBOL(sg_free_table);
>    *
>    **/
>   int __sg_alloc_table(struct sg_table *table, unsigned int nents,
> -		     unsigned int max_ents, gfp_t gfp_mask,
> -		     sg_alloc_fn *alloc_fn)
> +		     unsigned int max_ents, struct scatterlist *first_chunk,
> +		     gfp_t gfp_mask, sg_alloc_fn *alloc_fn)
>   {
>   	struct scatterlist *sg, *prv;
>   	unsigned int left;
> @@ -269,7 +272,12 @@ int __sg_alloc_table(struct sg_table *table, unsigned int nents,
>
>   		left -= sg_size;
>
> -		sg = alloc_fn(alloc_size, gfp_mask);
> +		if (first_chunk) {
> +			sg = first_chunk;
> +			first_chunk = NULL;
> +		} else {
> +			sg = alloc_fn(alloc_size, gfp_mask);
> +		}
>   		if (unlikely(!sg)) {
>   			/*
>   			 * Adjust entry count to reflect that the last
> @@ -324,9 +332,9 @@ int sg_alloc_table(struct sg_table *table, unsigned int nents, gfp_t gfp_mask)
>   	int ret;
>
>   	ret = __sg_alloc_table(table, nents, SG_MAX_SINGLE_ALLOC,
> -			       gfp_mask, sg_kmalloc);
> +			       NULL, gfp_mask, sg_kmalloc);
>   	if (unlikely(ret))
> -		__sg_free_table(table, SG_MAX_SINGLE_ALLOC, sg_kfree);
> +		__sg_free_table(table, SG_MAX_SINGLE_ALLOC, false, sg_kfree);
>
>   	return ret;
>   }
>
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
  2014-06-25 16:52 ` [PATCH 13/14] scsi: add support for a blk-mq based I/O path Christoph Hellwig
@ 2014-07-09 11:25   ` Hannes Reinecke
  2014-07-16 11:13   ` Mike Christie
  1 sibling, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:25 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On 06/25/2014 06:52 PM, Christoph Hellwig wrote:
> This patch adds support for an alternate I/O path in the scsi midlayer
> which uses the blk-mq infrastructure instead of the legacy request code.
>
> Use of blk-mq is fully transparent to drivers, although for now a host
> template field is provided to opt out of blk-mq usage in case any unforseen
> incompatibilities arise.
>
> In general replacing the legacy request code with blk-mq is a simple and
> mostly mechanical transformation.  The biggest exception is the new code
> that deals with the fact the I/O submissions in blk-mq must happen from
> process context, which slightly complicates the I/O completion handler.
> The second biggest differences is that blk-mq is build around the concept
> of preallocated requests that also include driver specific data, which
> in SCSI context means the scsi_cmnd structure.  This completely avoids
> dynamic memory allocations for the fast path through I/O submission.
>
> Due the preallocated requests the MQ code path exclusively uses the
> host-wide shared tag allocator instead of a per-LUN one.  This only
> affects drivers actually using the block layer provided tag allocator
> instead of their own.  Unlike the old path blk-mq always provides a tag,
> although drivers don't have to use it.
>
> For now the blk-mq path is disable by defauly and must be enabled using
> the "use_blk_mq" module parameter.  Once the remaining work in the block
> layer to make blk-mq more suitable for slow devices is complete I hope
> to make it the default and eventually even remove the old code path.
>
> Based on the earlier scsi-mq prototype by Nicholas Bellinger.
>
> Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking and
> various sugestions and code contributions.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   drivers/scsi/hosts.c      |   30 ++-
>   drivers/scsi/scsi.c       |    5 +-
>   drivers/scsi/scsi_lib.c   |  475 +++++++++++++++++++++++++++++++++++++++------
>   drivers/scsi/scsi_priv.h  |    3 +
>   drivers/scsi/scsi_scan.c  |    5 +-
>   drivers/scsi/scsi_sysfs.c |    2 +
>   include/scsi/scsi_host.h  |   18 +-
>   include/scsi/scsi_tcq.h   |   28 ++-
>   8 files changed, 494 insertions(+), 72 deletions(-)
>
> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
> index 0632eee..6322e6c 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
>   		goto fail;
>   	}
>
> +	if (shost_use_blk_mq(shost)) {
> +		error = scsi_mq_setup_tags(shost);
> +		if (error)
> +			goto fail;
> +	}
> +
> +	/*
> +	 * Note that we allocate the freelist even for the MQ case for now,
> +	 * as we need a command set aside for scsi_reset_provider.  Having
> +	 * the full host freelist and one command available for that is a
> +	 * little heavy-handed, but avoids introducing a special allocator
> +	 * just for this.  Eventually the structure of scsi_reset_provider
> +	 * will need a major overhaul.
> +	 */
>   	error = scsi_setup_command_freelist(shost);
>   	if (error)
> -		goto fail;
> +		goto out_destroy_tags;
> +
>
>   	if (!shost->shost_gendev.parent)
>   		shost->shost_gendev.parent = dev ? dev : &platform_bus;
> @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
>
>   	error = device_add(&shost->shost_gendev);
>   	if (error)
> -		goto out;
> +		goto out_destroy_freelist;
>
>   	pm_runtime_set_active(&shost->shost_gendev);
>   	pm_runtime_enable(&shost->shost_gendev);
> @@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
>   	device_del(&shost->shost_dev);
>    out_del_gendev:
>   	device_del(&shost->shost_gendev);
> - out:
> + out_destroy_freelist:
>   	scsi_destroy_command_freelist(shost);
> + out_destroy_tags:
> +	if (shost_use_blk_mq(shost))
> +		scsi_mq_destroy_tags(shost);
>    fail:
>   	return error;
>   }
> @@ -309,7 +327,9 @@ static void scsi_host_dev_release(struct device *dev)
>   	}
>
>   	scsi_destroy_command_freelist(shost);
> -	if (shost->bqt)
> +	if (shost_use_blk_mq(shost) && shost->tag_set.tags)
> +		scsi_mq_destroy_tags(shost);
> +	else if (shost->bqt)
>   		blk_free_tags(shost->bqt);
>
>   	kfree(shost->shost_data);
> @@ -436,6 +456,8 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
>   	else
>   		shost->dma_boundary = 0xffffffff;
>
> +	shost->use_blk_mq = scsi_use_blk_mq && !shost->hostt->disable_blk_mq;
> +
>   	device_initialize(&shost->shost_gendev);
>   	dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
>   	shost->shost_gendev.bus = &scsi_bus_type;
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index b362058..c089812 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -809,7 +809,7 @@ void scsi_adjust_queue_depth(struct scsi_device *sdev, int tagged, int tags)
>   	 * is more IO than the LLD's can_queue (so there are not enuogh
>   	 * tags) request_fn's host queue ready check will handle it.
>   	 */
> -	if (!sdev->host->bqt) {
> +	if (!shost_use_blk_mq(sdev->host) && !sdev->host->bqt) {
>   		if (blk_queue_tagged(sdev->request_queue) &&
>   		    blk_queue_resize_tags(sdev->request_queue, tags) != 0)
>   			goto out;
> @@ -1363,6 +1363,9 @@ MODULE_LICENSE("GPL");
>   module_param(scsi_logging_level, int, S_IRUGO|S_IWUSR);
>   MODULE_PARM_DESC(scsi_logging_level, "a bit mask of logging levels");
>
> +bool scsi_use_blk_mq = false;
> +module_param_named(use_blk_mq, scsi_use_blk_mq, bool, S_IWUSR | S_IRUGO);
> +
>   static int __init init_scsi(void)
>   {
>   	int error;
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 900b1c0..5d39cfc 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1,5 +1,6 @@
>   /*
> - *  scsi_lib.c Copyright (C) 1999 Eric Youngdale
> + * Copyright (C) 1999 Eric Youngdale
> + * Copyright (C) 2014 Christoph Hellwig
>    *
>    *  SCSI queueing library.
>    *      Initial versions: Eric Youngdale (eric@andante.org).
> @@ -20,6 +21,7 @@
>   #include <linux/delay.h>
>   #include <linux/hardirq.h>
>   #include <linux/scatterlist.h>
> +#include <linux/blk-mq.h>
>
>   #include <scsi/scsi.h>
>   #include <scsi/scsi_cmnd.h>
> @@ -113,6 +115,16 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
>   	}
>   }
>
> +static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
> +{
> +	struct scsi_device *sdev = cmd->device;
> +	struct request_queue *q = cmd->request->q;
> +
> +	blk_mq_requeue_request(cmd->request);
> +	blk_mq_kick_requeue_list(q);
> +	put_device(&sdev->sdev_gendev);
> +}
> +
>   /**
>    * __scsi_queue_insert - private queue insertion
>    * @cmd: The SCSI command being requeued
> @@ -150,6 +162,10 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
>   	 * before blk_cleanup_queue() finishes.
>   	 */
>   	cmd->result = 0;
> +	if (q->mq_ops) {
> +		scsi_mq_requeue_cmd(cmd);
> +		return;
> +	}
>   	spin_lock_irqsave(q->queue_lock, flags);
>   	blk_requeue_request(q, cmd->request);
>   	kblockd_schedule_work(&device->requeue_work);
> @@ -308,6 +324,14 @@ void scsi_device_unbusy(struct scsi_device *sdev)
>   	atomic_dec(&sdev->device_busy);
>   }
>
> +static void scsi_kick_queue(struct request_queue *q)
> +{
> +	if (q->mq_ops)
> +		blk_mq_start_hw_queues(q);
> +	else
> +		blk_run_queue(q);
> +}
> +
>   /*
>    * Called for single_lun devices on IO completion. Clear starget_sdev_user,
>    * and call blk_run_queue for all the scsi_devices on the target -
> @@ -332,7 +356,7 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
>   	 * but in most cases, we will be first. Ideally, each LU on the
>   	 * target would get some limited time or requests on the target.
>   	 */
> -	blk_run_queue(current_sdev->request_queue);
> +	scsi_kick_queue(current_sdev->request_queue);
>
>   	spin_lock_irqsave(shost->host_lock, flags);
>   	if (starget->starget_sdev_user)
> @@ -345,7 +369,7 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
>   			continue;
>
>   		spin_unlock_irqrestore(shost->host_lock, flags);
> -		blk_run_queue(sdev->request_queue);
> +		scsi_kick_queue(sdev->request_queue);
>   		spin_lock_irqsave(shost->host_lock, flags);
>   	
>   		scsi_device_put(sdev);
> @@ -438,7 +462,7 @@ static void scsi_starved_list_run(struct Scsi_Host *shost)
>   			continue;
>   		spin_unlock_irqrestore(shost->host_lock, flags);
>
> -		blk_run_queue(slq);
> +		scsi_kick_queue(slq);
>   		blk_put_queue(slq);
>
>   		spin_lock_irqsave(shost->host_lock, flags);
> @@ -469,7 +493,10 @@ static void scsi_run_queue(struct request_queue *q)
>   	if (!list_empty(&sdev->host->starved_list))
>   		scsi_starved_list_run(sdev->host);
>
> -	blk_run_queue(q);
> +	if (q->mq_ops)
> +		blk_mq_start_stopped_hw_queues(q, false);
> +	else
> +		blk_run_queue(q);
>   }
>
>   void scsi_requeue_run_queue(struct work_struct *work)
> @@ -567,25 +594,72 @@ static struct scatterlist *scsi_sg_alloc(unsigned int nents, gfp_t gfp_mask)
>   	return mempool_alloc(sgp->pool, gfp_mask);
>   }
>
> -static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
> +static void scsi_free_sgtable(struct scsi_data_buffer *sdb, bool mq)
>   {
> -	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, false, scsi_sg_free);
> +	if (mq && sdb->table.nents <= SCSI_MAX_SG_SEGMENTS)
> +		return;
> +	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, mq, scsi_sg_free);
>   }
>
>   static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
> -			      gfp_t gfp_mask)
> +			      gfp_t gfp_mask, bool mq)
>   {
> +	struct scatterlist *first_chunk = NULL;
>   	int ret;
>
>   	BUG_ON(!nents);
>
> +	if (mq) {
> +		if (nents <= SCSI_MAX_SG_SEGMENTS) {
> +			sdb->table.nents = nents;
> +			sg_init_table(sdb->table.sgl, sdb->table.nents);
> +			return 0;
> +		}
> +		first_chunk = sdb->table.sgl;
> +	}
> +
>   	ret = __sg_alloc_table(&sdb->table, nents, SCSI_MAX_SG_SEGMENTS,
> -			       NULL, gfp_mask, scsi_sg_alloc);
> +			       first_chunk, gfp_mask, scsi_sg_alloc);
>   	if (unlikely(ret))
> -		scsi_free_sgtable(sdb);
> +		scsi_free_sgtable(sdb, mq);
>   	return ret;
>   }
>
> +static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
> +{
> +	if (cmd->request->cmd_type == REQ_TYPE_FS) {
> +		struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
> +
> +		if (drv->uninit_command)
> +			drv->uninit_command(cmd);
> +	}
> +}
> +
> +static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
> +{
> +	if (cmd->sdb.table.nents)
> +		scsi_free_sgtable(&cmd->sdb, true);
> +	if (cmd->request->next_rq && cmd->request->next_rq->special)
> +		scsi_free_sgtable(cmd->request->next_rq->special, true);
> +	if (scsi_prot_sg_count(cmd))
> +		scsi_free_sgtable(cmd->prot_sdb, true);
> +}
> +
> +static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd)
> +{
> +	struct scsi_device *sdev = cmd->device;
> +	unsigned long flags;
> +
> +	BUG_ON(list_empty(&cmd->list));
> +
> +	scsi_mq_free_sgtables(cmd);
> +	scsi_uninit_cmd(cmd);
> +
> +	spin_lock_irqsave(&sdev->list_lock, flags);
> +	list_del_init(&cmd->list);
> +	spin_unlock_irqrestore(&sdev->list_lock, flags);
> +}
> +
>   /*
>    * Function:    scsi_release_buffers()
>    *
> @@ -605,12 +679,12 @@ static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
>   void scsi_release_buffers(struct scsi_cmnd *cmd)
>   {
>   	if (cmd->sdb.table.nents)
> -		scsi_free_sgtable(&cmd->sdb);
> +		scsi_free_sgtable(&cmd->sdb, false);
>
>   	memset(&cmd->sdb, 0, sizeof(cmd->sdb));
>
>   	if (scsi_prot_sg_count(cmd))
> -		scsi_free_sgtable(cmd->prot_sdb);
> +		scsi_free_sgtable(cmd->prot_sdb, false);
>   }
>   EXPORT_SYMBOL(scsi_release_buffers);
>
> @@ -618,7 +692,7 @@ static void scsi_release_bidi_buffers(struct scsi_cmnd *cmd)
>   {
>   	struct scsi_data_buffer *bidi_sdb = cmd->request->next_rq->special;
>
> -	scsi_free_sgtable(bidi_sdb);
> +	scsi_free_sgtable(bidi_sdb, false);
>   	kmem_cache_free(scsi_sdb_cache, bidi_sdb);
>   	cmd->request->next_rq->special = NULL;
>   }
> @@ -629,8 +703,6 @@ static bool scsi_end_request(struct request *req, int error,
>   	struct scsi_cmnd *cmd = req->special;
>   	struct scsi_device *sdev = cmd->device;
>   	struct request_queue *q = sdev->request_queue;
> -	unsigned long flags;
> -
>
>   	if (blk_update_request(req, error, bytes))
>   		return true;
> @@ -643,14 +715,38 @@ static bool scsi_end_request(struct request *req, int error,
>   	if (blk_queue_add_random(q))
>   		add_disk_randomness(req->rq_disk);
>
> -	spin_lock_irqsave(q->queue_lock, flags);
> -	blk_finish_request(req, error);
> -	spin_unlock_irqrestore(q->queue_lock, flags);
> +	if (req->mq_ctx) {
> +		/*
> +		 * In the MQ case the command gets freed by __blk_mq_end_io,
> +		 * so we have to do all cleanup that depends on it earlier.
> +		 *
> +		 * We also can't kick the queues from irq context, so we
> +		 * will have to defer it to a workqueue.
> +		 */
> +		scsi_mq_uninit_cmd(cmd);
> +
> +		__blk_mq_end_io(req, error);
> +
> +		if (scsi_target(sdev)->single_lun ||
> +		    !list_empty(&sdev->host->starved_list))
> +			kblockd_schedule_work(&sdev->requeue_work);
> +		else
> +			blk_mq_start_stopped_hw_queues(q, true);
> +
> +		put_device(&sdev->sdev_gendev);
> +	} else {
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(q->queue_lock, flags);
> +		blk_finish_request(req, error);
> +		spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +		if (bidi_bytes)
> +			scsi_release_bidi_buffers(cmd);
> +		scsi_release_buffers(cmd);
> +		scsi_next_command(cmd);
> +	}
>
> -	if (bidi_bytes)
> -		scsi_release_bidi_buffers(cmd);
> -	scsi_release_buffers(cmd);
> -	scsi_next_command(cmd);
>   	return false;
>   }
>
> @@ -981,8 +1077,14 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
>   		/* Unprep the request and put it back at the head of the queue.
>   		 * A new command will be prepared and issued.
>   		 */
> -		scsi_release_buffers(cmd);
> -		scsi_requeue_command(q, cmd);
> +		if (q->mq_ops) {
> +			cmd->request->cmd_flags &= ~REQ_DONTPREP;
> +			scsi_mq_uninit_cmd(cmd);
> +			scsi_mq_requeue_cmd(cmd);
> +		} else {
> +			scsi_release_buffers(cmd);
> +			scsi_requeue_command(q, cmd);
> +		}
>   		break;
>   	case ACTION_RETRY:
>   		/* Retry the same command immediately */
> @@ -1004,9 +1106,8 @@ static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb,
>   	 * If sg table allocation fails, requeue request later.
>   	 */
>   	if (unlikely(scsi_alloc_sgtable(sdb, req->nr_phys_segments,
> -					gfp_mask))) {
> +					gfp_mask, req->mq_ctx != NULL)))
>   		return BLKPREP_DEFER;
> -	}
>
>   	/*
>   	 * Next, walk the list, and fill in the addresses and sizes of
> @@ -1034,21 +1135,27 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
>   {
>   	struct scsi_device *sdev = cmd->device;
>   	struct request *rq = cmd->request;
> +	bool is_mq = (rq->mq_ctx != NULL);
> +	int error;
>
> -	int error = scsi_init_sgtable(rq, &cmd->sdb, gfp_mask);
> +	error = scsi_init_sgtable(rq, &cmd->sdb, gfp_mask);
>   	if (error)
>   		goto err_exit;
>
>   	if (blk_bidi_rq(rq)) {
> -		struct scsi_data_buffer *bidi_sdb = kmem_cache_zalloc(
> -			scsi_sdb_cache, GFP_ATOMIC);
> -		if (!bidi_sdb) {
> -			error = BLKPREP_DEFER;
> -			goto err_exit;
> +		if (!rq->q->mq_ops) {
> +			struct scsi_data_buffer *bidi_sdb =
> +				kmem_cache_zalloc(scsi_sdb_cache, GFP_ATOMIC);
> +			if (!bidi_sdb) {
> +				error = BLKPREP_DEFER;
> +				goto err_exit;
> +			}
> +
> +			rq->next_rq->special = bidi_sdb;
>   		}
>
> -		rq->next_rq->special = bidi_sdb;
> -		error = scsi_init_sgtable(rq->next_rq, bidi_sdb, GFP_ATOMIC);
> +		error = scsi_init_sgtable(rq->next_rq, rq->next_rq->special,
> +					  GFP_ATOMIC);
>   		if (error)
>   			goto err_exit;
>   	}
> @@ -1060,7 +1167,7 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
>   		BUG_ON(prot_sdb == NULL);
>   		ivecs = blk_rq_count_integrity_sg(rq->q, rq->bio);
>
> -		if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask)) {
> +		if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask, is_mq)) {
>   			error = BLKPREP_DEFER;
>   			goto err_exit;
>   		}
> @@ -1074,13 +1181,16 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
>   		cmd->prot_sdb->table.nents = count;
>   	}
>
> -	return BLKPREP_OK ;
> -
> +	return BLKPREP_OK;
>   err_exit:
> -	scsi_release_buffers(cmd);
> -	cmd->request->special = NULL;
> -	scsi_put_command(cmd);
> -	put_device(&sdev->sdev_gendev);
> +	if (is_mq) {
> +		scsi_mq_free_sgtables(cmd);
> +	} else {
> +		scsi_release_buffers(cmd);
> +		cmd->request->special = NULL;
> +		scsi_put_command(cmd);
> +		put_device(&sdev->sdev_gendev);
> +	}
>   	return error;
>   }
>   EXPORT_SYMBOL(scsi_init_io);
> @@ -1295,13 +1405,7 @@ out:
>
>   static void scsi_unprep_fn(struct request_queue *q, struct request *req)
>   {
> -	if (req->cmd_type == REQ_TYPE_FS) {
> -		struct scsi_cmnd *cmd = req->special;
> -		struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
> -
> -		if (drv->uninit_command)
> -			drv->uninit_command(cmd);
> -	}
> +	scsi_uninit_cmd(req->special);
>   }
>
>   /*
> @@ -1318,7 +1422,11 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
>   	busy = atomic_inc_return(&sdev->device_busy) - 1;
>   	if (busy == 0 && atomic_read(&sdev->device_blocked) > 0) {
>   		if (atomic_dec_return(&sdev->device_blocked) > 0) {
> -			blk_delay_queue(q, SCSI_QUEUE_DELAY);
> +			/*
> +			 * For the MQ case we take care of this in the caller.
> +			 */
> +			if (!q->mq_ops)
> +				blk_delay_queue(q, SCSI_QUEUE_DELAY);
>   			goto out_dec;
>   		}
>   		SCSI_LOG_MLQUEUE(3, sdev_printk(KERN_INFO, sdev,
> @@ -1688,6 +1796,188 @@ out_delay:
>   		blk_delay_queue(q, SCSI_QUEUE_DELAY);
>   }
>
> +static inline int prep_to_mq(int ret)
> +{
> +	switch (ret) {
> +	case BLKPREP_OK:
> +		return 0;
> +	case BLKPREP_DEFER:
> +		return BLK_MQ_RQ_QUEUE_BUSY;
> +	default:
> +		return BLK_MQ_RQ_QUEUE_ERROR;
> +	}
> +}
> +
> +static int scsi_mq_prep_fn(struct request *req)
> +{
> +	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> +	struct scsi_device *sdev = req->q->queuedata;
> +	struct Scsi_Host *shost = sdev->host;
> +	unsigned char *sense_buf = cmd->sense_buffer;
> +	struct scatterlist *sg;
> +
> +	memset(cmd, 0, sizeof(struct scsi_cmnd));
> +
> +	req->special = cmd;
> +
> +	cmd->request = req;
> +	cmd->device = sdev;
> +	cmd->sense_buffer = sense_buf;
> +
> +	cmd->tag = req->tag;
> +
> +	req->cmd = req->__cmd;
> +	cmd->cmnd = req->cmd;
> +	cmd->prot_op = SCSI_PROT_NORMAL;
> +
> +	INIT_LIST_HEAD(&cmd->list);
> +	INIT_DELAYED_WORK(&cmd->abort_work, scmd_eh_abort_handler);
> +	cmd->jiffies_at_alloc = jiffies;
> +
> +	/*
> +	 * XXX: cmd_list lookups are only used by two drivers, try to get
> +	 * rid of this list in common code.
> +	 */
> +	spin_lock_irq(&sdev->list_lock);
> +	list_add_tail(&cmd->list, &sdev->cmd_list);
> +	spin_unlock_irq(&sdev->list_lock);
> +
> +	sg = (void *)cmd + sizeof(struct scsi_cmnd) + shost->hostt->cmd_size;
> +	cmd->sdb.table.sgl = sg;
> +
> +	if (scsi_host_get_prot(shost)) {
> +		cmd->prot_sdb = (void *)sg +
> +			shost->sg_tablesize * sizeof(struct scatterlist);
> +		memset(cmd->prot_sdb, 0, sizeof(struct scsi_data_buffer));
> +
> +		cmd->prot_sdb->table.sgl =
> +			(struct scatterlist *)(cmd->prot_sdb + 1);
> +	}
> +
> +	if (blk_bidi_rq(req)) {
> +		struct request *next_rq = req->next_rq;
> +		struct scsi_data_buffer *bidi_sdb = blk_mq_rq_to_pdu(next_rq);
> +
> +		memset(bidi_sdb, 0, sizeof(struct scsi_data_buffer));
> +		bidi_sdb->table.sgl =
> +			(struct scatterlist *)(bidi_sdb + 1);
> +
> +		next_rq->special = bidi_sdb;
> +	}
> +
> +	switch (req->cmd_type) {
> +	case REQ_TYPE_FS:
> +		return scsi_cmd_to_driver(cmd)->init_command(cmd);
> +	case REQ_TYPE_BLOCK_PC:
> +		return scsi_setup_blk_pc_cmnd(cmd->device, req);
> +	default:
> +		return BLKPREP_KILL;
> +	}
> +}
> +
> +static void scsi_mq_done(struct scsi_cmnd *cmd)
> +{
> +	trace_scsi_dispatch_cmd_done(cmd);
> +	blk_mq_complete_request(cmd->request);
> +}
> +
> +static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
> +{
> +	struct request_queue *q = req->q;
> +	struct scsi_device *sdev = q->queuedata;
> +	struct Scsi_Host *shost = sdev->host;
> +	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> +	int ret;
> +	int reason;
> +
> +	ret = prep_to_mq(scsi_prep_state_check(sdev, req));
> +	if (ret)
> +		goto out;
> +
> +	ret = BLK_MQ_RQ_QUEUE_BUSY;
> +	if (!get_device(&sdev->sdev_gendev))
> +		goto out;
> +
> +	if (!scsi_dev_queue_ready(q, sdev))
> +		goto out_put_device;
> +	if (!scsi_target_queue_ready(shost, sdev))
> +		goto out_dec_device_busy;
> +	if (!scsi_host_queue_ready(q, shost, sdev))
> +		goto out_dec_target_busy;
> +
> +	if (!(req->cmd_flags & REQ_DONTPREP)) {
> +		ret = prep_to_mq(scsi_mq_prep_fn(req));
> +		if (ret)
> +			goto out_dec_host_busy;
> +		req->cmd_flags |= REQ_DONTPREP;
> +	}
> +
> +	scsi_init_cmd_errh(cmd);
> +	cmd->scsi_done = scsi_mq_done;
> +
> +	reason = scsi_dispatch_cmd(cmd);
> +	if (reason) {
> +		scsi_set_blocked(cmd, reason);
> +		ret = BLK_MQ_RQ_QUEUE_BUSY;
> +		goto out_dec_host_busy;
> +	}
> +
> +	return BLK_MQ_RQ_QUEUE_OK;
> +
> +out_dec_host_busy:
> +	cancel_delayed_work(&cmd->abort_work);
> +	atomic_dec(&shost->host_busy);
> +out_dec_target_busy:
> +	if (scsi_target(sdev)->can_queue > 0)
> +		atomic_dec(&scsi_target(sdev)->target_busy);
> +out_dec_device_busy:
> +	atomic_dec(&sdev->device_busy);
> +out_put_device:
> +	put_device(&sdev->sdev_gendev);
> +out:
> +	switch (ret) {
> +	case BLK_MQ_RQ_QUEUE_BUSY:
> +		blk_mq_stop_hw_queue(hctx);
> +		if (atomic_read(&sdev->device_busy) == 0 &&
> +		    !scsi_device_blocked(sdev))
> +			blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
> +		break;
> +	case BLK_MQ_RQ_QUEUE_ERROR:
> +		/*
> +		 * Make sure to release all allocated ressources when
> +		 * we hit an error, as we will never see this command
> +		 * again.
> +		 */
> +		if (req->cmd_flags & REQ_DONTPREP)
> +			scsi_mq_uninit_cmd(cmd);
> +		break;
> +	default:
> +		break;
> +	}
> +	return ret;
> +}
> +
> +static int scsi_init_request(void *data, struct request *rq,
> +		unsigned int hctx_idx, unsigned int request_idx,
> +		unsigned int numa_node)
> +{
> +	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
> +
> +	cmd->sense_buffer = kzalloc_node(SCSI_SENSE_BUFFERSIZE, GFP_KERNEL,
> +			numa_node);
> +	if (!cmd->sense_buffer)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
> +static void scsi_exit_request(void *data, struct request *rq,
> +		unsigned int hctx_idx, unsigned int request_idx)
> +{
> +	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
> +
> +	kfree(cmd->sense_buffer);
> +}
> +
>   u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
>   {
>   	struct device *host_dev;
> @@ -1710,16 +2000,10 @@ u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
>   }
>   EXPORT_SYMBOL(scsi_calculate_bounce_limit);
>
> -struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
> -					 request_fn_proc *request_fn)
> +static void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
>   {
> -	struct request_queue *q;
>   	struct device *dev = shost->dma_dev;
>
> -	q = blk_init_queue(request_fn, NULL);
> -	if (!q)
> -		return NULL;
> -
>   	/*
>   	 * this limit is imposed by hardware restrictions
>   	 */
> @@ -1750,7 +2034,17 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
>   	 * blk_queue_update_dma_alignment() later.
>   	 */
>   	blk_queue_dma_alignment(q, 0x03);
> +}
>
> +struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
> +					 request_fn_proc *request_fn)
> +{
> +	struct request_queue *q;
> +
> +	q = blk_init_queue(request_fn, NULL);
> +	if (!q)
> +		return NULL;
> +	__scsi_init_queue(shost, q);
>   	return q;
>   }
>   EXPORT_SYMBOL(__scsi_alloc_queue);
> @@ -1771,6 +2065,55 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
>   	return q;
>   }
>
> +static struct blk_mq_ops scsi_mq_ops = {
> +	.map_queue	= blk_mq_map_queue,
> +	.queue_rq	= scsi_queue_rq,
> +	.complete	= scsi_softirq_done,
> +	.timeout	= scsi_times_out,
> +	.init_request	= scsi_init_request,
> +	.exit_request	= scsi_exit_request,
> +};
> +
> +struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev)
> +{
> +	sdev->request_queue = blk_mq_init_queue(&sdev->host->tag_set);
> +	if (IS_ERR(sdev->request_queue))
> +		return NULL;
> +
> +	sdev->request_queue->queuedata = sdev;
> +	__scsi_init_queue(sdev->host, sdev->request_queue);
> +	return sdev->request_queue;
> +}
> +
> +int scsi_mq_setup_tags(struct Scsi_Host *shost)
> +{
> +	unsigned int cmd_size, sgl_size, tbl_size;
> +
> +	tbl_size = shost->sg_tablesize;
> +	if (tbl_size > SCSI_MAX_SG_SEGMENTS)
> +		tbl_size = SCSI_MAX_SG_SEGMENTS;
> +	sgl_size = tbl_size * sizeof(struct scatterlist);
> +	cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size;
> +	if (scsi_host_get_prot(shost))
> +		cmd_size += sizeof(struct scsi_data_buffer) + sgl_size;
> +
> +	memset(&shost->tag_set, 0, sizeof(shost->tag_set));
> +	shost->tag_set.ops = &scsi_mq_ops;
> +	shost->tag_set.nr_hw_queues = 1;
> +	shost->tag_set.queue_depth = shost->can_queue;
> +	shost->tag_set.cmd_size = cmd_size;
> +	shost->tag_set.numa_node = NUMA_NO_NODE;
> +	shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
> +	shost->tag_set.driver_data = shost;
> +
> +	return blk_mq_alloc_tag_set(&shost->tag_set);
> +}
> +
> +void scsi_mq_destroy_tags(struct Scsi_Host *shost)
> +{
> +	blk_mq_free_tag_set(&shost->tag_set);
> +}
> +
>   /*
>    * Function:    scsi_block_requests()
>    *
> @@ -2516,9 +2859,13 @@ scsi_internal_device_block(struct scsi_device *sdev)
>   	 * block layer from calling the midlayer with this device's
>   	 * request queue.
>   	 */
> -	spin_lock_irqsave(q->queue_lock, flags);
> -	blk_stop_queue(q);
> -	spin_unlock_irqrestore(q->queue_lock, flags);
> +	if (q->mq_ops) {
> +		blk_mq_stop_hw_queues(q);
> +	} else {
> +		spin_lock_irqsave(q->queue_lock, flags);
> +		blk_stop_queue(q);
> +		spin_unlock_irqrestore(q->queue_lock, flags);
> +	}
>
>   	return 0;
>   }
> @@ -2564,9 +2911,13 @@ scsi_internal_device_unblock(struct scsi_device *sdev,
>   		 sdev->sdev_state != SDEV_OFFLINE)
>   		return -EINVAL;
>
> -	spin_lock_irqsave(q->queue_lock, flags);
> -	blk_start_queue(q);
> -	spin_unlock_irqrestore(q->queue_lock, flags);
> +	if (q->mq_ops) {
> +		blk_mq_start_stopped_hw_queues(q, false);
> +	} else {
> +		spin_lock_irqsave(q->queue_lock, flags);
> +		blk_start_queue(q);
> +		spin_unlock_irqrestore(q->queue_lock, flags);
> +	}
>
>   	return 0;
>   }
> diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
> index a45d1c2..12b8e1b 100644
> --- a/drivers/scsi/scsi_priv.h
> +++ b/drivers/scsi/scsi_priv.h
> @@ -88,6 +88,9 @@ extern void scsi_next_command(struct scsi_cmnd *cmd);
>   extern void scsi_io_completion(struct scsi_cmnd *, unsigned int);
>   extern void scsi_run_host_queues(struct Scsi_Host *shost);
>   extern struct request_queue *scsi_alloc_queue(struct scsi_device *sdev);
> +extern struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev);
> +extern int scsi_mq_setup_tags(struct Scsi_Host *shost);
> +extern void scsi_mq_destroy_tags(struct Scsi_Host *shost);
>   extern int scsi_init_queue(void);
>   extern void scsi_exit_queue(void);
>   struct request_queue;
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
> index 4a6e4ba..b91cfaf 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -273,7 +273,10 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
>   	 */
>   	sdev->borken = 1;
>
> -	sdev->request_queue = scsi_alloc_queue(sdev);
> +	if (shost_use_blk_mq(shost))
> +		sdev->request_queue = scsi_mq_alloc_queue(sdev);
> +	else
> +		sdev->request_queue = scsi_alloc_queue(sdev);
>   	if (!sdev->request_queue) {
>   		/* release fn is set up in scsi_sysfs_device_initialise, so
>   		 * have to free and put manually here */
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index deef063..6c9227f 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -333,6 +333,7 @@ store_shost_eh_deadline(struct device *dev, struct device_attribute *attr,
>
>   static DEVICE_ATTR(eh_deadline, S_IRUGO | S_IWUSR, show_shost_eh_deadline, store_shost_eh_deadline);
>
> +shost_rd_attr(use_blk_mq, "%d\n");
>   shost_rd_attr(unique_id, "%u\n");
>   shost_rd_attr(cmd_per_lun, "%hd\n");
>   shost_rd_attr(can_queue, "%hd\n");
> @@ -352,6 +353,7 @@ show_host_busy(struct device *dev, struct device_attribute *attr, char *buf)
>   static DEVICE_ATTR(host_busy, S_IRUGO, show_host_busy, NULL);
>
>   static struct attribute *scsi_sysfs_shost_attrs[] = {
> +	&dev_attr_use_blk_mq.attr,
>   	&dev_attr_unique_id.attr,
>   	&dev_attr_host_busy.attr,
>   	&dev_attr_cmd_per_lun.attr,
> diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
> index 7f9bbda..b54511e 100644
> --- a/include/scsi/scsi_host.h
> +++ b/include/scsi/scsi_host.h
> @@ -7,6 +7,7 @@
>   #include <linux/workqueue.h>
>   #include <linux/mutex.h>
>   #include <linux/seq_file.h>
> +#include <linux/blk-mq.h>
>   #include <scsi/scsi.h>
>
>   struct request_queue;
> @@ -531,6 +532,9 @@ struct scsi_host_template {
>   	 */
>   	unsigned int cmd_size;
>   	struct scsi_host_cmd_pool *cmd_pool;
> +
> +	/* temporary flag to disable blk-mq I/O path */
> +	bool disable_blk_mq;
>   };
>
>   /*
> @@ -601,7 +605,10 @@ struct Scsi_Host {
>   	 * Area to keep a shared tag map (if needed, will be
>   	 * NULL if not).
>   	 */
> -	struct blk_queue_tag	*bqt;
> +	union {
> +		struct blk_queue_tag	*bqt;
> +		struct blk_mq_tag_set	tag_set;
> +	};
>
>   	atomic_t host_busy;		   /* commands actually active on low-level */
>   	atomic_t host_blocked;
> @@ -693,6 +700,8 @@ struct Scsi_Host {
>   	/* The controller does not support WRITE SAME */
>   	unsigned no_write_same:1;
>
> +	unsigned use_blk_mq:1;
> +
>   	/*
>   	 * Optional work queue to be utilized by the transport
>   	 */
> @@ -793,6 +802,13 @@ static inline int scsi_host_in_recovery(struct Scsi_Host *shost)
>   		shost->tmf_in_progress;
>   }
>
> +extern bool scsi_use_blk_mq;
> +
> +static inline bool shost_use_blk_mq(struct Scsi_Host *shost)
> +{
> +	return shost->use_blk_mq;
> +}
> +
>   extern int scsi_queue_work(struct Scsi_Host *, struct work_struct *);
>   extern void scsi_flush_work(struct Scsi_Host *);
>
> diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h
> index 81dd12e..cdcc90b 100644
> --- a/include/scsi/scsi_tcq.h
> +++ b/include/scsi/scsi_tcq.h
> @@ -67,7 +67,8 @@ static inline void scsi_activate_tcq(struct scsi_device *sdev, int depth)
>   	if (!sdev->tagged_supported)
>   		return;
>
> -	if (!blk_queue_tagged(sdev->request_queue))
> +	if (!shost_use_blk_mq(sdev->host) &&
> +	    blk_queue_tagged(sdev->request_queue))
>   		blk_queue_init_tags(sdev->request_queue, depth,
>   				    sdev->host->bqt);
>
> @@ -80,7 +81,8 @@ static inline void scsi_activate_tcq(struct scsi_device *sdev, int depth)
>    **/
>   static inline void scsi_deactivate_tcq(struct scsi_device *sdev, int depth)
>   {
> -	if (blk_queue_tagged(sdev->request_queue))
> +	if (!shost_use_blk_mq(sdev->host) &&
> +	    blk_queue_tagged(sdev->request_queue))
>   		blk_queue_free_tags(sdev->request_queue);
>   	scsi_adjust_queue_depth(sdev, 0, depth);
>   }
> @@ -108,6 +110,15 @@ static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
>   	return 0;
>   }
>
> +static inline struct scsi_cmnd *scsi_mq_find_tag(struct Scsi_Host *shost,
> +		unsigned int hw_ctx, int tag)
> +{
> +	struct request *req;
> +
> +	req = blk_mq_tag_to_rq(shost->tag_set.tags[hw_ctx], tag);
> +	return req ? (struct scsi_cmnd *)req->special : NULL;
> +}
> +
>   /**
>    * scsi_find_tag - find a tagged command by device
>    * @SDpnt:	pointer to the ScSI device
> @@ -118,10 +129,12 @@ static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
>    **/
>   static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
>   {
> -
>           struct request *req;
>
>           if (tag != SCSI_NO_TAG) {
> +		if (shost_use_blk_mq(sdev->host))
> +			return scsi_mq_find_tag(sdev->host, 0, tag);
> +
>           	req = blk_queue_find_tag(sdev->request_queue, tag);
>   	        return req ? (struct scsi_cmnd *)req->special : NULL;
>   	}
> @@ -130,6 +143,7 @@ static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
>   	return sdev->current_cmnd;
>   }
>
> +
>   /**
>    * scsi_init_shared_tag_map - create a shared tag map
>    * @shost:	the host to share the tag map among all devices
> @@ -138,6 +152,12 @@ static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
>   static inline int scsi_init_shared_tag_map(struct Scsi_Host *shost, int depth)
>   {
>   	/*
> +	 * We always have a shared tag map around when using blk-mq.
> +	 */
> +	if (shost_use_blk_mq(shost))
> +		return 0;
> +
> +	/*
>   	 * If the shared tag map isn't already initialized, do it now.
>   	 * This saves callers from having to check ->bqt when setting up
>   	 * devices on the shared host (for libata)
> @@ -165,6 +185,8 @@ static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
>   	struct request *req;
>
>   	if (tag != SCSI_NO_TAG) {
> +		if (shost_use_blk_mq(shost))
> +			return scsi_mq_find_tag(shost, 0, tag);
>   		req = blk_map_queue_find_tag(shost->bqt, tag);
>   		return req ? (struct scsi_cmnd *)req->special : NULL;
>   	}
>
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 14/14] fnic: reject device resets without assigned tags for the blk-mq case
  2014-06-25 16:52 ` [PATCH 14/14] fnic: reject device resets without assigned tags for the blk-mq case Christoph Hellwig
@ 2014-07-09 11:27   ` Hannes Reinecke
  0 siblings, 0 replies; 99+ messages in thread
From: Hannes Reinecke @ 2014-07-09 11:27 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi,
	linux-kernel, Hiral Patel, Suma Ramars, Brian Uchino

On 06/25/2014 06:52 PM, Christoph Hellwig wrote:
> Current the midlayer fakes up a struct request for the explicit reset
> ioctls, and those don't have a tag allocated to them.  The fnic driver pokes
> into midlayer structures to paper over this design issue, but that won't
> work for the blk-mq case.
>
> Either someone who can actually test the hardware will have to come up with
> a similar hack for the blk-mq case, or we'll have to bite the bullet and fix
> the way the EH ioctls work for real, but until that happens we fail these
> explicit requests here.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Cc: Hiral Patel <hiralpat@cisco.com>
> Cc: Suma Ramars <sramars@cisco.com>
> Cc: Brian Uchino <buchino@cisco.com>
> ---
>   drivers/scsi/fnic/fnic_scsi.c |   16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
>
> diff --git a/drivers/scsi/fnic/fnic_scsi.c b/drivers/scsi/fnic/fnic_scsi.c
> index 3f88f56..961bdf5 100644
> --- a/drivers/scsi/fnic/fnic_scsi.c
> +++ b/drivers/scsi/fnic/fnic_scsi.c
> @@ -2224,6 +2224,22 @@ int fnic_device_reset(struct scsi_cmnd *sc)
>
>   	tag = sc->request->tag;
>   	if (unlikely(tag < 0)) {
> +		/*
> +		 * XXX(hch): current the midlayer fakes up a struct
> +		 * request for the explicit reset ioctls, and those
> +		 * don't have a tag allocated to them.  The below
> +		 * code pokes into midlayer structures to paper over
> +		 * this design issue, but that won't work for blk-mq.
> +		 *
> +		 * Either someone who can actually test the hardware
> +		 * will have to come up with a similar hack for the
> +		 * blk-mq case, or we'll have to bite the bullet and
> +		 * fix the way the EH ioctls work for real, but until
> +		 * that happens we fail these explicit requests here.
> +		 */
> +		if (shost_use_blk_mq(sc->device->host))
> +			goto fnic_device_reset_end;
> +
>   		tag = fnic_scsi_host_start_tag(fnic, sc);
>   		if (unlikely(tag == SCSI_NO_TAG))
>   			goto fnic_device_reset_end;
>
The correct fix will be part of my EH redesign.
Plan is to allocate a real command/request for EH, which then can be 
used to send down EH TMFs and related commands.

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 01/14] sd: don't use rq->cmd_len before setting it up
  2014-07-09 11:12     ` Hannes Reinecke
  (?)
@ 2014-07-09 15:03     ` Christoph Hellwig
  -1 siblings, 0 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-07-09 15:03 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, James Bottomley, Jens Axboe, Bart Van Assche,
	Robert Elliott, linux-scsi, linux-kernel

FYI, this has been dropped from the series in favour of always memsetting
the cdb in common code.  Take a look at the "RFC: clean up command setup"
series, on top of which I have rebased the scsi-mq changes.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 10/14] scsi: only maintain target_blocked if the driver has a target queue limit
  2014-07-09 11:19   ` Hannes Reinecke
@ 2014-07-09 15:05     ` Christoph Hellwig
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-07-09 15:05 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, James Bottomley, Jens Axboe, Bart Van Assche,
	Robert Elliott, linux-scsi, linux-kernel

On Wed, Jul 09, 2014 at 01:19:41PM +0200, Hannes Reinecke wrote:
>>    host_not_ready:
>> -	atomic_dec(&scsi_target(sdev)->target_busy);
>> +	if (scsi_target(sdev)->can_queue > 0)
>> +		atomic_dec(&scsi_target(sdev)->target_busy);
>>    not_ready:
>>   	/*
>>   	 * lock q, handle tag, requeue req, and decrement device_busy. We
>>
> Hmm. 'can_queue' can be changed by the LLDD. Don't we need some sort of 
> synchronization here?

While a few drivers change the host can_queue value at runtime none
do for the target.  While I don't think driver should even change the
host one even modification to the target one is perfectly fine as long
as no driver drops it to zero.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-08 14:48 ` Christoph Hellwig
@ 2014-07-09 16:39   ` Douglas Gilbert
  2014-07-09 19:38     ` Jens Axboe
  2014-07-14  9:13   ` Sagi Grimberg
  1 sibling, 1 reply; 99+ messages in thread
From: Douglas Gilbert @ 2014-07-09 16:39 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley, Jens Axboe, Bart Van Assche,
	Robert Elliott, linux-scsi, linux-kernel

On 14-07-08 10:48 AM, Christoph Hellwig wrote:
> On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:
>> Changes from V1:
>>   - rebased on top of the core-for-3.17 branch, most notable the
>>     scsi logging changes
>>   - fixed handling of cmd_list to prevent crashes for some heavy
>>     workloads
>>   - fixed incorrect handling of !target->can_queue
>>   - avoid scheduling a workqueue on I/O completions when no queues
>>     are congested
>>
>> In addition to the patches in this thread there also is a git available at:
>>
>> 	git://git.infradead.org/users/hch/scsi.git scsi-mq.2
>
>
> I've pushed out a new scsi-mq.3 branch, which has been rebased on the
> latest core-for-3.17 tree + the "RFC: clean up command setup" series
> from June 29th.  Robert Elliot found a problem with not fully zeroed
> out UNMAP CDBs, which is fixed by the saner discard handling in that
> series.
>
> There is a new patch to factor the code from the above series for
> blk-mq use, which I've attached below.  Besides that the only changes
> are minor merge fixups in the main blk-mq usage patch.

Be warned: both Rob Elliott and I can easily break
the scsi-mq.3 branch. It seems as though a regression
has slipped in. I notice that Christoph has added a
new branch called "scsi-mq.3-no-rebase".

For those interested, watch this space.

Doug Gilbert


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 08/14] scsi: convert device_busy to atomic_t
  2014-06-25 16:51 ` [PATCH 08/14] scsi: convert device_busy " Christoph Hellwig
  2014-07-09 11:16   ` Hannes Reinecke
@ 2014-07-09 16:49   ` James Bottomley
  2014-07-10  6:01     ` Christoph Hellwig
  1 sibling, 1 reply; 99+ messages in thread
From: James Bottomley @ 2014-07-09 16:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Bart Van Assche, Robert Elliott, linux-scsi, linux-kernel

On Wed, 2014-06-25 at 18:51 +0200, Christoph Hellwig wrote:
> Avoid taking the queue_lock to check the per-device queue limit.  Instead
> we do an atomic_inc_return early on to grab our slot in the queue,
> and if nessecary decrement it after finishing all checks.
> 
> Unlike the host and target busy counters this doesn't allow us to avoid the
> queue_lock in the request_fn due to the way the interface works, but it'll
> allow us to prepare for using the blk-mq code, which doesn't use the
> queue_lock at all, and it at least avoids a queue_lock rountrip in
> scsi_device_unbusy, which is still important given how busy the queue_lock
> is.

Most of these patches look fine to me, but this one worries me largely
because of the expense of atomics.

As far as I can tell from the block MQ, we get one CPU thread per LUN.
Doesn't this mean that we only need true atomics for variables that
cross threads?  That does mean target and host, but shouldn't mean
device, since device == LUN.  As long as we protect from local
interrupts, we should be able to exclusively update all LUN local
variables without having to change them to being atomic.

This view depends on correct CPU steering of returning interrupts, since
the LUN thread model only works if the same CPU handles issue and
completion, but it looks like that works in MQ, even if it doesn't work
in vanilla.

James



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-09 16:39   ` Douglas Gilbert
@ 2014-07-09 19:38     ` Jens Axboe
  2014-07-10  0:53       ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 99+ messages in thread
From: Jens Axboe @ 2014-07-09 19:38 UTC (permalink / raw)
  To: dgilbert, Christoph Hellwig, James Bottomley, Bart Van Assche,
	Robert Elliott, linux-scsi, linux-kernel

On 2014-07-09 18:39, Douglas Gilbert wrote:
> On 14-07-08 10:48 AM, Christoph Hellwig wrote:
>> On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:
>>> Changes from V1:
>>>   - rebased on top of the core-for-3.17 branch, most notable the
>>>     scsi logging changes
>>>   - fixed handling of cmd_list to prevent crashes for some heavy
>>>     workloads
>>>   - fixed incorrect handling of !target->can_queue
>>>   - avoid scheduling a workqueue on I/O completions when no queues
>>>     are congested
>>>
>>> In addition to the patches in this thread there also is a git
>>> available at:
>>>
>>>     git://git.infradead.org/users/hch/scsi.git scsi-mq.2
>>
>>
>> I've pushed out a new scsi-mq.3 branch, which has been rebased on the
>> latest core-for-3.17 tree + the "RFC: clean up command setup" series
>> from June 29th.  Robert Elliot found a problem with not fully zeroed
>> out UNMAP CDBs, which is fixed by the saner discard handling in that
>> series.
>>
>> There is a new patch to factor the code from the above series for
>> blk-mq use, which I've attached below.  Besides that the only changes
>> are minor merge fixups in the main blk-mq usage patch.
>
> Be warned: both Rob Elliott and I can easily break
> the scsi-mq.3 branch. It seems as though a regression
> has slipped in. I notice that Christoph has added a
> new branch called "scsi-mq.3-no-rebase".

Rob/Doug, those issues look very much like problems in the aio code. Can 
either/both of you try with:

f8567a3845ac05bb28f3c1b478ef752762bd39ef
edfbbf388f293d70bf4b7c0bc38774d05e6f711a

reverted (in that order) and see if that changes anything.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: scsi-mq V2
  2014-07-09 19:38     ` Jens Axboe
@ 2014-07-10  0:53       ` Elliott, Robert (Server Storage)
  2014-07-10  6:20         ` Christoph Hellwig
  0 siblings, 1 reply; 99+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-07-10  0:53 UTC (permalink / raw)
  To: Jens Axboe, dgilbert, Christoph Hellwig, James Bottomley,
	Bart Van Assche, Benjamin LaHaise, linux-scsi, linux-kernel



> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Wednesday, 09 July, 2014 2:38 PM
> To: dgilbert@interlog.com; Christoph Hellwig; James Bottomley; Bart Van
> Assche; Elliott, Robert (Server Storage); linux-scsi@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> On 2014-07-09 18:39, Douglas Gilbert wrote:
> > On 14-07-08 10:48 AM, Christoph Hellwig wrote:
> >> On Wed, Jun 25, 2014 at 06:51:47PM +0200, Christoph Hellwig wrote:
> >>> Changes from V1:
> >>>   - rebased on top of the core-for-3.17 branch, most notable the
> >>>     scsi logging changes
> >>>   - fixed handling of cmd_list to prevent crashes for some heavy
> >>>     workloads
> >>>   - fixed incorrect handling of !target->can_queue
> >>>   - avoid scheduling a workqueue on I/O completions when no queues
> >>>     are congested
> >>>
> >>> In addition to the patches in this thread there also is a git
> >>> available at:
> >>>
> >>>     git://git.infradead.org/users/hch/scsi.git scsi-mq.2
> >>
> >>
> >> I've pushed out a new scsi-mq.3 branch, which has been rebased on the
> >> latest core-for-3.17 tree + the "RFC: clean up command setup" series
> >> from June 29th.  Robert Elliot found a problem with not fully zeroed
> >> out UNMAP CDBs, which is fixed by the saner discard handling in that
> >> series.
> >>
> >> There is a new patch to factor the code from the above series for
> >> blk-mq use, which I've attached below.  Besides that the only changes
> >> are minor merge fixups in the main blk-mq usage patch.
> >
> > Be warned: both Rob Elliott and I can easily break
> > the scsi-mq.3 branch. It seems as though a regression
> > has slipped in. I notice that Christoph has added a
> > new branch called "scsi-mq.3-no-rebase".
> 
> Rob/Doug, those issues look very much like problems in the aio code. Can
> either/both of you try with:
> 
> f8567a3845ac05bb28f3c1b478ef752762bd39ef
> edfbbf388f293d70bf4b7c0bc38774d05e6f711a
> 
> reverted (in that order) and see if that changes anything.
> 
> 
> --
> Jens Axboe

scsi-mq.3-no-rebase, which has all the scsi updates from scsi-mq.3
but is based on 3.16.0-rc2 rather than 3.16.0-rc4, works fine:
* ^C exits fio cleanly with scsi_debug devices
* ^C exits fio cleanly with mpt3sas devices
* fio hits 1M IOPS with 16 hpsa devices
* fio hits 700K IOPS with 6 mpt3sas devices
* 38 device test to mpt3sas, hpsa, and scsi_debug devices runs OK


With:
* scsi-mq-3, which is based on 3.16.0-rc4
* [PATCH] x86-64: fix vDSO build from https://lkml.org/lkml/2014/7/3/738
* those two aio patches reverted

the problem still occurs - fio results in low or 0 IOPS, with perf top 
reporting unusual amounts of time spent in do_io_submit and io_submit.

perf top:
 14.38%  [kernel]              [k] do_io_submit
 13.71%  libaio.so.1.0.1       [.] io_submit
 13.32%  [kernel]              [k] system_call
 11.60%  [kernel]              [k] system_call_after_swapgs
  8.88%  [kernel]              [k] lookup_ioctx
  8.78%  [kernel]              [k] copy_user_generic_string
  7.78%  [kernel]              [k] io_submit_one
  5.97%  [kernel]              [k] blk_flush_plug_list
  2.73%  fio                   [.] fio_libaio_commit
  2.70%  [kernel]              [k] sysret_check
  2.68%  [kernel]              [k] blk_finish_plug
  1.98%  [kernel]              [k] blk_start_plug
  1.17%  [kernel]              [k] SyS_io_submit
  1.17%  [kernel]              [k] __get_user_4
  0.99%  fio                   [.] io_submit@plt
  0.85%  [kernel]              [k] _copy_from_user
  0.79%  [kernel]              [k] system_call_fastpath

Repeating some of last night's investigation details for the lists:

ftrace of one of the CPUs for all functions shows these 
are repeatedly being called:
 
           <...>-34508 [004] ....  6360.790714: io_submit_one <-do_io_submit
           <...>-34508 [004] ....  6360.790714: blk_finish_plug <-do_io_submit
           <...>-34508 [004] ....  6360.790714: blk_flush_plug_list <-blk_finish_plug
           <...>-34508 [004] ....  6360.790714: SyS_io_submit <-system_call_fastpath
           <...>-34508 [004] ....  6360.790715: do_io_submit <-SyS_io_submit
           <...>-34508 [004] ....  6360.790715: lookup_ioctx <-do_io_submit
           <...>-34508 [004] ....  6360.790715: blk_start_plug <-do_io_submit
           <...>-34508 [004] ....  6360.790715: io_submit_one <-do_io_submit
           <...>-34508 [004] ....  6360.790715: blk_finish_plug <-do_io_submit
           <...>-34508 [004] ....  6360.790715: blk_flush_plug_list <-blk_finish_plug
           <...>-34508 [004] ....  6360.790715: SyS_io_submit <-system_call_fastpath
           <...>-34508 [004] ....  6360.790715: do_io_submit <-SyS_io_submit
           <...>-34508 [004] ....  6360.790715: lookup_ioctx <-do_io_submit
           <...>-34508 [004] ....  6360.790716: blk_start_plug <-do_io_submit
           <...>-34508 [004] ....  6360.790716: io_submit_one <-do_io_submit
           <...>-34508 [004] ....  6360.790716: blk_finish_plug <-do_io_submit
           <...>-34508 [004] ....  6360.790716: blk_flush_plug_list <-blk_finish_plug
           <...>-34508 [004] ....  6360.790716: SyS_io_submit <-system_call_fastpath
           <...>-34508 [004] ....  6360.790716: do_io_submit <-SyS_io_submit
           <...>-34508 [004] ....  6360.790716: lookup_ioctx <-do_io_submit
           <...>-34508 [004] ....  6360.790716: blk_start_plug <-do_io_submit
           <...>-34508 [004] ....  6360.790717: io_submit_one <-do_io_submit
           <...>-34508 [004] ....  6360.790717: blk_finish_plug <-do_io_submit
           <...>-34508 [004] ....  6360.790717: blk_flush_plug_list <-blk_finish_plug
           <...>-34508 [004] ....  6360.790717: SyS_io_submit <-system_call_fastpath
           <...>-34508 [004] ....  6360.790717: do_io_submit <-SyS_io_submit

fs/aio.c do_io_submit is apparently completing (many times) - it's not
stuck in the for loop:
        blk_start_plug(&plug);

        /*
         * AKPM: should this return a partial result if some of the IOs were
         * successfully submitted?
         */
        for (i=0; i<nr; i++) {
                struct iocb __user *user_iocb;
                struct iocb tmp;

                if (unlikely(__get_user(user_iocb, iocbpp + i))) {
                        ret = -EFAULT;
                        break;
                }

                if (unlikely(copy_from_user(&tmp, user_iocb, sizeof(tmp)))) {
                        ret = -EFAULT;
                        break;
                }

                ret = io_submit_one(ctx, user_iocb, &tmp, compat);
                if (ret)
                        break;
        }
        blk_finish_plug(&plug);


fs/aio.c io_submit_one is not getting to fget, which is traceable:
        /* enforce forwards compatibility on users */
        if (unlikely(iocb->aio_reserved1 || iocb->aio_reserved2)) {
                pr_debug("EINVAL: reserve field set\n");
                return -EINVAL;
        }

        /* prevent overflows */
        if (unlikely(
            (iocb->aio_buf != (unsigned long)iocb->aio_buf) ||
            (iocb->aio_nbytes != (size_t)iocb->aio_nbytes) ||
            ((ssize_t)iocb->aio_nbytes < 0)
           )) {
                pr_debug("EINVAL: io_submit: overflow check\n");
                return -EINVAL;
        }

        req = aio_get_req(ctx);
        if (unlikely(!req))
                return -EAGAIN;

        req->ki_filp = fget(iocb->aio_fildes);

I don't have that file compiled with -DDEBUG so the pr_debug
prints are unavailable.  The -EAGAIN seems most likely to lead 
to a hang like this.

aio_get_req is not getting to kmem_cache_alloc, which is
traceable:
        if (!get_reqs_available(ctx))
                return NULL;

        req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);

get_reqs_available is probably returning false because not
enough reqs are available compared to req_batch:

        struct kioctx_cpu *kcpu;
        bool ret = false;

        preempt_disable();
        kcpu = this_cpu_ptr(ctx->cpu);

        if (!kcpu->reqs_available) {
                int old, avail = atomic_read(&ctx->reqs_available);

                do {
                        if (avail < ctx->req_batch)
                                goto out;

                        old = avail;
                        avail = atomic_cmpxchg(&ctx->reqs_available,
                                               avail, avail - ctx->req_batch);
                } while (avail != old);

                kcpu->reqs_available += ctx->req_batch;
        }

        ret = true;
        kcpu->reqs_available--;
out:
        preempt_enable();
        return ret;


---
Rob Elliott    HP Server Storage




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 08/14] scsi: convert device_busy to atomic_t
  2014-07-09 16:49   ` James Bottomley
@ 2014-07-10  6:01     ` Christoph Hellwig
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-07-10  6:01 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Hellwig, Jens Axboe, Bart Van Assche, Robert Elliott,
	linux-scsi, linux-kernel

On Wed, Jul 09, 2014 at 09:49:56AM -0700, James Bottomley wrote:
> As far as I can tell from the block MQ, we get one CPU thread per LUN.

No, that's entirely incorrect.  IFF a device supports multiple hardware
queues we only submit I/O from CPUs (there might be more than one) this
queue is bound to.  With the single hardware queue supported by most
hardware submissions can and will happen from any CPU.  Note that
this patchset doesn't even support multiple hardware queues yet, although
it should be fairly simple to add once the low level driver support is
ready.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 09/14] scsi: fix the {host,target,device}_blocked counter mess
  2014-07-09 11:12   ` Hannes Reinecke
@ 2014-07-10  6:06     ` Christoph Hellwig
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-07-10  6:06 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Christoph Hellwig, James Bottomley, Jens Axboe, Bart Van Assche,
	Robert Elliott, linux-scsi, linux-kernel

On Wed, Jul 09, 2014 at 01:12:17PM +0200, Hannes Reinecke wrote:
> Hmm. I guess there is a race window between
> atomic_read() and atomic_set().
> Doesn't this cause issues when someone calls atomic_set() just before the 
> call to atomic_read?

There is a race window just _after_ the atomic_read, but it's harmless.
The whole _blocked scheme is a backoff to avoid resubmitting I/O all
the time when the HBA or target returned a busy status.  If we race
an incorrectly reset it we will submit I/O and just get a busy indicator
again.

On the other hand doing the atomic_set all the time introduces three atomic
in the I/O completion part that are entirely pointless most of the time.

I guess I should add something like this as a comment to the code..

Note that the old code didn't use any sort of synchronization either.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10  0:53       ` Elliott, Robert (Server Storage)
@ 2014-07-10  6:20         ` Christoph Hellwig
  2014-07-10 13:36           ` Benjamin LaHaise
  2014-07-10 15:51           ` Elliott, Robert (Server Storage)
  0 siblings, 2 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-07-10  6:20 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Jens Axboe, dgilbert, Christoph Hellwig, James Bottomley,
	Bart Van Assche, Benjamin LaHaise, linux-scsi, linux-kernel

On Thu, Jul 10, 2014 at 12:53:36AM +0000, Elliott, Robert (Server Storage) wrote:
> the problem still occurs - fio results in low or 0 IOPS, with perf top 
> reporting unusual amounts of time spent in do_io_submit and io_submit.

The diff between the two version doesn't show too much other possible
interesting commits, the most interesting being some minor block
updates.

I guess we'll have to a manual bisect, I've pushed out a
scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
the block tree, and a scsi-mq.3-bisect-2 branch that is just after
the merge of the block tree to get started.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10  6:20         ` Christoph Hellwig
@ 2014-07-10 13:36           ` Benjamin LaHaise
  2014-07-10 13:39             ` Jens Axboe
  2014-07-10 13:50             ` Christoph Hellwig
  2014-07-10 15:51           ` Elliott, Robert (Server Storage)
  1 sibling, 2 replies; 99+ messages in thread
From: Benjamin LaHaise @ 2014-07-10 13:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Elliott, Robert (Server Storage),
	Jens Axboe, dgilbert, James Bottomley, Bart Van Assche,
	linux-scsi, linux-kernel

On Wed, Jul 09, 2014 at 11:20:40PM -0700, Christoph Hellwig wrote:
> On Thu, Jul 10, 2014 at 12:53:36AM +0000, Elliott, Robert (Server Storage) wrote:
> > the problem still occurs - fio results in low or 0 IOPS, with perf top 
> > reporting unusual amounts of time spent in do_io_submit and io_submit.
> 
> The diff between the two version doesn't show too much other possible
> interesting commits, the most interesting being some minor block
> updates.
> 
> I guess we'll have to a manual bisect, I've pushed out a
> scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
> the block tree, and a scsi-mq.3-bisect-2 branch that is just after
> the merge of the block tree to get started.

There is one possible concern that could be exacerbated by other changes in 
the system: if the application is running close to the bare minimum number 
of requests allocated in io_setup(), the per cpu reference counters will 
have a hard time batching things.  It might be worth testing with an 
increased number of requests being allocated if this is the case.

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 13:36           ` Benjamin LaHaise
@ 2014-07-10 13:39             ` Jens Axboe
  2014-07-10 13:44               ` Benjamin LaHaise
  2014-07-10 13:50             ` Christoph Hellwig
  1 sibling, 1 reply; 99+ messages in thread
From: Jens Axboe @ 2014-07-10 13:39 UTC (permalink / raw)
  To: Benjamin LaHaise, Christoph Hellwig
  Cc: Elliott, Robert (Server Storage),
	dgilbert, James Bottomley, Bart Van Assche, linux-scsi,
	linux-kernel

On 2014-07-10 15:36, Benjamin LaHaise wrote:
> On Wed, Jul 09, 2014 at 11:20:40PM -0700, Christoph Hellwig wrote:
>> On Thu, Jul 10, 2014 at 12:53:36AM +0000, Elliott, Robert (Server Storage) wrote:
>>> the problem still occurs - fio results in low or 0 IOPS, with perf top
>>> reporting unusual amounts of time spent in do_io_submit and io_submit.
>>
>> The diff between the two version doesn't show too much other possible
>> interesting commits, the most interesting being some minor block
>> updates.
>>
>> I guess we'll have to a manual bisect, I've pushed out a
>> scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
>> the block tree, and a scsi-mq.3-bisect-2 branch that is just after
>> the merge of the block tree to get started.
>
> There is one possible concern that could be exacerbated by other changes in
> the system: if the application is running close to the bare minimum number
> of requests allocated in io_setup(), the per cpu reference counters will
> have a hard time batching things.  It might be worth testing with an
> increased number of requests being allocated if this is the case.

That's how fio always runs, it sets up the context with the exact queue 
depth that it needs. Do we have a good enough understanding of other aio 
use cases to say that this isn't the norm? I would expect it to be, it's 
the way that the API would most obviously be used.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 13:39             ` Jens Axboe
@ 2014-07-10 13:44               ` Benjamin LaHaise
  2014-07-10 13:48                 ` Jens Axboe
  0 siblings, 1 reply; 99+ messages in thread
From: Benjamin LaHaise @ 2014-07-10 13:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Elliott, Robert (Server Storage),
	dgilbert, James Bottomley, Bart Van Assche, linux-scsi,
	linux-kernel

On Thu, Jul 10, 2014 at 03:39:57PM +0200, Jens Axboe wrote:
> That's how fio always runs, it sets up the context with the exact queue 
> depth that it needs. Do we have a good enough understanding of other aio 
> use cases to say that this isn't the norm? I would expect it to be, it's 
> the way that the API would most obviously be used.

The problem with this approach is that it works very poorly with per cpu 
reference counting's batching of references, which is pretty much a 
requirement now that many core systems are the norm.  Allocating the bare 
minimum is not the right thing to do today.  That said, the default limits 
on the number of requests probably needs to be raised.

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 13:44               ` Benjamin LaHaise
@ 2014-07-10 13:48                 ` Jens Axboe
  2014-07-10 13:50                   ` Benjamin LaHaise
  0 siblings, 1 reply; 99+ messages in thread
From: Jens Axboe @ 2014-07-10 13:48 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Christoph Hellwig, Elliott, Robert (Server Storage),
	dgilbert, James Bottomley, Bart Van Assche, linux-scsi,
	linux-kernel

On 2014-07-10 15:44, Benjamin LaHaise wrote:
> On Thu, Jul 10, 2014 at 03:39:57PM +0200, Jens Axboe wrote:
>> That's how fio always runs, it sets up the context with the exact queue
>> depth that it needs. Do we have a good enough understanding of other aio
>> use cases to say that this isn't the norm? I would expect it to be, it's
>> the way that the API would most obviously be used.
>
> The problem with this approach is that it works very poorly with per cpu
> reference counting's batching of references, which is pretty much a
> requirement now that many core systems are the norm.  Allocating the bare
> minimum is not the right thing to do today.  That said, the default limits
> on the number of requests probably needs to be raised.

Sorry, that's a complete cop-out. Then you handle this internally, 
allocate a bigger pool and cap the limit if you need to. Look at the 
API. You pass in the number of requests you will use. Do you expect 
anyone to double up, just in case? Will never happen.

But all of this is side stepping the point that there's a real bug 
reported here. The above could potentially explain the "it's using X 
more CPU, or it's Y slower". The above is a softlock, it never completes.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 13:48                 ` Jens Axboe
@ 2014-07-10 13:50                   ` Benjamin LaHaise
  2014-07-10 13:52                     ` Jens Axboe
  0 siblings, 1 reply; 99+ messages in thread
From: Benjamin LaHaise @ 2014-07-10 13:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Elliott, Robert (Server Storage),
	dgilbert, James Bottomley, Bart Van Assche, linux-scsi,
	linux-kernel

On Thu, Jul 10, 2014 at 03:48:10PM +0200, Jens Axboe wrote:
> On 2014-07-10 15:44, Benjamin LaHaise wrote:
> >On Thu, Jul 10, 2014 at 03:39:57PM +0200, Jens Axboe wrote:
> >>That's how fio always runs, it sets up the context with the exact queue
> >>depth that it needs. Do we have a good enough understanding of other aio
> >>use cases to say that this isn't the norm? I would expect it to be, it's
> >>the way that the API would most obviously be used.
> >
> >The problem with this approach is that it works very poorly with per cpu
> >reference counting's batching of references, which is pretty much a
> >requirement now that many core systems are the norm.  Allocating the bare
> >minimum is not the right thing to do today.  That said, the default limits
> >on the number of requests probably needs to be raised.
> 
> Sorry, that's a complete cop-out. Then you handle this internally, 
> allocate a bigger pool and cap the limit if you need to. Look at the 
> API. You pass in the number of requests you will use. Do you expect 
> anyone to double up, just in case? Will never happen.
> 
> But all of this is side stepping the point that there's a real bug 
> reported here. The above could potentially explain the "it's using X 
> more CPU, or it's Y slower". The above is a softlock, it never completes.

I'm not trying to cop out on this -- I'm asking for a data point to see 
if changing the request limits has any effect.

		-ben

> -- 
> Jens Axboe

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 13:36           ` Benjamin LaHaise
  2014-07-10 13:39             ` Jens Axboe
@ 2014-07-10 13:50             ` Christoph Hellwig
  2014-07-10 13:52               ` Jens Axboe
  1 sibling, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-07-10 13:50 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Elliott, Robert (Server Storage),
	Jens Axboe, dgilbert, James Bottomley, Bart Van Assche,
	linux-scsi, linux-kernel

On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:
> There is one possible concern that could be exacerbated by other changes in 
> the system: if the application is running close to the bare minimum number 
> of requests allocated in io_setup(), the per cpu reference counters will 
> have a hard time batching things.  It might be worth testing with an 
> increased number of requests being allocated if this is the case.

Well, Robert said reverting the two aio commits didn't help.  Either he
didn't manage to boot into the right kernel, or we need to look
elsewhere for the culprit.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 13:50                   ` Benjamin LaHaise
@ 2014-07-10 13:52                     ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2014-07-10 13:52 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Christoph Hellwig, Elliott, Robert (Server Storage),
	dgilbert, James Bottomley, Bart Van Assche, linux-scsi,
	linux-kernel

On 2014-07-10 15:50, Benjamin LaHaise wrote:
> On Thu, Jul 10, 2014 at 03:48:10PM +0200, Jens Axboe wrote:
>> On 2014-07-10 15:44, Benjamin LaHaise wrote:
>>> On Thu, Jul 10, 2014 at 03:39:57PM +0200, Jens Axboe wrote:
>>>> That's how fio always runs, it sets up the context with the exact queue
>>>> depth that it needs. Do we have a good enough understanding of other aio
>>>> use cases to say that this isn't the norm? I would expect it to be, it's
>>>> the way that the API would most obviously be used.
>>>
>>> The problem with this approach is that it works very poorly with per cpu
>>> reference counting's batching of references, which is pretty much a
>>> requirement now that many core systems are the norm.  Allocating the bare
>>> minimum is not the right thing to do today.  That said, the default limits
>>> on the number of requests probably needs to be raised.
>>
>> Sorry, that's a complete cop-out. Then you handle this internally,
>> allocate a bigger pool and cap the limit if you need to. Look at the
>> API. You pass in the number of requests you will use. Do you expect
>> anyone to double up, just in case? Will never happen.
>>
>> But all of this is side stepping the point that there's a real bug
>> reported here. The above could potentially explain the "it's using X
>> more CPU, or it's Y slower". The above is a softlock, it never completes.
>
> I'm not trying to cop out on this -- I'm asking for a data point to see
> if changing the request limits has any effect.

Fair enough, if the question is "does it solve the regression", then 
it's a valid data point. Rob/Doug, for fio, you can just double the 
iodepth passed in in engines/libaio:fio_libaio_init() and test with that 
and see if it makes a difference.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 13:50             ` Christoph Hellwig
@ 2014-07-10 13:52               ` Jens Axboe
  2014-07-10 14:36                 ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 99+ messages in thread
From: Jens Axboe @ 2014-07-10 13:52 UTC (permalink / raw)
  To: Christoph Hellwig, Benjamin LaHaise
  Cc: Elliott, Robert (Server Storage),
	dgilbert, James Bottomley, Bart Van Assche, linux-scsi,
	linux-kernel

On 2014-07-10 15:50, Christoph Hellwig wrote:
> On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:
>> There is one possible concern that could be exacerbated by other changes in
>> the system: if the application is running close to the bare minimum number
>> of requests allocated in io_setup(), the per cpu reference counters will
>> have a hard time batching things.  It might be worth testing with an
>> increased number of requests being allocated if this is the case.
>
> Well, Robert said reverting the two aio commits didn't help.  Either he
> didn't manage to boot into the right kernel, or we need to look
> elsewhere for the culprit.

Rob, let me know what scsi_debug setup you use, and I can try and 
reproduce it here as well.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: scsi-mq V2
  2014-07-10 13:52               ` Jens Axboe
@ 2014-07-10 14:36                 ` Elliott, Robert (Server Storage)
  2014-07-10 14:45                   ` Benjamin LaHaise
  0 siblings, 1 reply; 99+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-07-10 14:36 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Benjamin LaHaise
  Cc: dgilbert, James Bottomley, Bart Van Assche, linux-scsi, linux-kernel



> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Thursday, 10 July, 2014 8:53 AM
> To: Christoph Hellwig; Benjamin LaHaise
> Cc: Elliott, Robert (Server Storage); dgilbert@interlog.com; James Bottomley;
> Bart Van Assche; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> On 2014-07-10 15:50, Christoph Hellwig wrote:
> > On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:
> >> There is one possible concern that could be exacerbated by other changes
> in
> >> the system: if the application is running close to the bare minimum number
> >> of requests allocated in io_setup(), the per cpu reference counters will
> >> have a hard time batching things.  It might be worth testing with an
> >> increased number of requests being allocated if this is the case.
> >
> > Well, Robert said reverting the two aio commits didn't help.  Either he
> > didn't manage to boot into the right kernel, or we need to look
> > elsewhere for the culprit.
> 
> Rob, let me know what scsi_debug setup you use, and I can try and
> reproduce it here as well.
> 
> --
> Jens Axboe

This system has 6 online CPUs and 64 possible CPUs.

Printing avail and req_batch in that loop results in many of these:
** 3813 printk messages dropped ** [10643.503772] ctx ffff88042d8d4cc0 avail=0 req_batch=2

Adding CFLAGS_aio.o := -DDEBUG to the Makefile to enable
those pr_debug prints results in nothing extra printing,
so it's not hitting an error.

Printing nr_events and aio_max_nr at the top of ioctx_alloc results in
these as fio starts:

[  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339070] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.339074] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.339076] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.339076] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359772] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359971] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.359972] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359985] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.359986] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.359987] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359995] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.359995] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.359998] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.359998] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.362529] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.362529] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.363510] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.363513] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.363520] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.363521] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.398113] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.398115] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.398121] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.398122] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.398124] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.398124] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.398130] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.398131] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.398164] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.398165] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.398499] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.400489] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.401478] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.401491] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.434522] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.434523] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.434526] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.434533] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.435370] hrtimer: interrupt took 6868 ns
[  186.435491] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.435492] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.447864] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.449896] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.449900] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.449901] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.449909] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.449932] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.449933] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.461147] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.461176] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.461177] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.461181] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.461181] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.461184] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.461185] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.461185] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.461191] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.461192] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.474426] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.481353] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.483706] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.483707] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.483709] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.483710] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.483712] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.483717] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.495441] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.495444] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.495445] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.490] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.495451] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.495457] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.495457] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.495460] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.495461] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.495463] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.495464] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.499429] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.499437] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.619785] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.627371] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.627374] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.627383] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.627384] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.627385] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.628371] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.628372] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.630361] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.665329] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.666360] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.666361] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.666366] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.666367] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.666367] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.666369] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[  186.666370] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  186.670369] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.670372] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.670373] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  186.767323] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[  187.211053] ioctx_alloc: nr_events=512 aio_max_nr=65536
[  187.213053] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]

Subsequent added prints showed nr_events coming in with
an initial value of 0x7FFFFFFF in those cases where it showed
as -2 above.  Since the last call had a reasonable value
of 512, it doesn't seem like a problem.


script to create the scsi_debug devices:
#!/bin/bash
devices=6
delay=0         # -2 = jiffy, -1 = hi jiffy, 0 = none, 1..n = longer
ndelay=20000    # 20 us
opts=0x4801     #
every_nth=3
capacity_mib=0
capacity_gib=$((2 * 1024))
lbpu=1
lbpws=1

modprobe -r scsi_debug
modprobe -v scsi_debug fake_rw=1 delay=$delay ndelay=$ndelay num_tgts=$devices opts=$opts every_nth=$every_nth physblk_exp=3 lbpu=$lbpu lbpws=$lbpws dev_size_mb=$capacity_mib virtual_gb=$capacity_gib
lsscsi -s
lsblk
# the assigned /dev names will vary...
for device in /sys/block/sda[h-m]
do
        echo none > $device/device/queue_type
done

fio script:
[global]
direct=1
invalidate=1
ioengine=libaio
norandommap
randrepeat=0
bs=4096
iodepth=96
numjobs=6
runtime=216000
time_based=1
group_reporting
thread
gtod_reduce=1
iodepth_batch=16
iodepth_batch_complete=16
cpus_allowed=0-5
cpus_allowed_policy=split
rw=randread

[4_KiB_RR_drive_ah]
filename=/dev/sdah

[4_KiB_RR_drive_ai]
filename=/dev/sdai

[4_KiB_RR_drive_aj]
filename=/dev/sdaj

[4_KiB_RR_drive_ak]
filename=/dev/sdak

[4_KiB_RR_drive_al]
filename=/dev/sdal

[4_KiB_RR_drive_am]
filename=/dev/sdam

kernel log with some prints in ioctx_alloc:
(2147483647 is 0x7FFFFFFF)

[   94.050877] ioctx_alloc: initial nr_events=2147483647
[   94.053610] ioctx_alloc: num_possible_cpus=64
[   94.055235] ioctx_alloc: after max nr_events=2147483647
[   94.057110] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.059189] ioctx_alloc: initial nr_events=96
[   94.059294] ioctx_alloc: initial nr_events=2147483647
[   94.059295] ioctx_alloc: num_possible_cpus=64
[   94.059295] ioctx_alloc: after max nr_events=2147483647
[   94.059296] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.059296] ioctx_alloc: initial nr_events=96
[   94.059297] ioctx_alloc: num_possible_cpus=64
[   94.059297] ioctx_alloc: after max nr_events=256
[   94.059298] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.075891] ioctx_alloc: num_possible_cpus=64
[   94.077529] ioctx_alloc: after max nr_events=256
[   94.079064] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.087777] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.087810] ioctx_alloc: initial nr_events=2147483647
[   94.087810] ioctx_alloc: num_possible_cpus=64
[   94.087811] ioctx_alloc: after max nr_events=2147483647
[   94.087811] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.087812] ioctx_alloc: initial nr_events=96
[   94.087812] ioctx_alloc: num_possible_cpus=64
[   94.087813] ioctx_alloc: after max nr_events=256
[   94.087813] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.087815] ioctx_alloc: initial nr_events=2147483647
[   94.087816] ioctx_alloc: initial nr_events=2147483647
[   94.087816] ioctx_alloc: num_possible_cpus=64
[   94.087817] ioctx_alloc: initial nr_events=2147483647
[   94.087818] ioctx_alloc: num_possible_cpus=64
[   94.087819] ioctx_alloc: after max nr_events=2147483647
[   94.087819] ioctx_alloc: num_possible_cpus=64
[   94.087820] ioctx_alloc: after max nr_events=2147483647
[   94.087820] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.087821] ioctx_alloc: after max nr_events=2147483647
[   94.087822] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.087822] ioctx_alloc: initial nr_events=96
[   94.087823] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.087824] ioctx_alloc: initial nr_events=96
[   94.087825] ioctx_alloc: initial nr_events=2147483647
[   94.087825] ioctx_alloc: num_possible_cpus=64
[   94.087826] ioctx_alloc: initial nr_events=96
[   94.087826] ioctx_alloc: num_possible_cpus=64
[   94.087827] ioctx_alloc: num_possible_cpus=64
[   94.087827] ioctx_alloc: after max nr_events=256
[   94.087828] ioctx_alloc: num_possible_cpus=64
[   94.087828] ioctx_alloc: after max nr_events=256
[   94.087829] ioctx_alloc: after max nr_events=2147483647
[   94.087829] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.087830] ioctx_alloc: after max nr_events=256
[   94.087831] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.087831] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.087832] ioctx_alloc: nr_events=512 aio_max_nr=65
[   94.087833] ioctx_alloc: initial nr_events=96
[   94.087833] ioctx_alloc: num_possible_cpus=64
[   94.087833] ioctx_alloc: after max nr_events=256
[   94.087834] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.090668] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.259433] ioctx_alloc: initial nr_events=2147483647
[   94.259435] ioctx_alloc: initial nr_events=2147483647
[   94.259436] ioctx_alloc: num_possible_cpus=64
[   94.259437] ioctx_alloc: after max nr_events=2147483647
[   94.259437] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.259438] ioctx_alloc: initial nr_events=96
[   94.259438] ioctx_alloc: num_possible_cpus=64
[   94.259438] ioctx_alloc: after max nr_events=256
[   94.259439] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.259446] ioctx_alloc: initial nr_events=2147483647
[   94.259448] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.259449] ioctx_alloc: initial nr_events=2147483647
[   94.259450] ioctx_alloc: initial nr_events=2147483647
[   94.259450] ioctx_alloc: num_possible_cpus=64
[   94.259451] ioctx_alloc: num_possible_cpus=64
[   94.259452] ioctx_alloc: num_possible_cpus=64
[   94.259452] ioctx_alloc: after max nr_events=2147483647
[   94.259453] ioctx_alloc: after max nr_events=2147483647
[   94.259453] ioctx_alloc: after max nr_events=2147483647
[   94.259454] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.259455] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.259455] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.259456] ioctx_alloc: initial nr_events=96
[   94.259456] ioctx_alloc: initial nr_events=96
[   94.259457] ioctx_alloc: initial nr_events=96
[   94.259457] ioctx_alloc: num_possible_cpus=64
[   94.259458] ioctx_alloc: num_possible_cpus=64
[   94.259458] ioctx_alloc: num_possible_cpus=64
[   94.259459] ioctx_alloc: after max nr_events=256
[   94.259459] ioctx_alloc: after max nr_events=256
[   94.259460] ioctx_alloc: after max nr_events=256
[   94.259460] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   259461] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.259462] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.260539] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.260544] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.262535] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.262550] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.423889] ioctx_alloc: num_possible_cpus=64
[   94.425386] ioctx_alloc: after max nr_events=2147483647
[   94.427327] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.429359] ioctx_alloc: initial nr_events=96
[   94.429448] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.429451] ioctx_alloc: initial nr_events=2147483647
[   94.429452] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.429453] ioctx_alloc: num_possible_cpus=64
[   94.429454] ioctx_alloc: initial nr_events=2147483647
[   94.429454] ioctx_alloc: after max nr_events=2147483647
[   94.429455] ioctx_alloc: num_possible_cpus=64
[   94.429456] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.429456] ioctx_alloc: after max nr_events=2147483647
[   94.429457] ioctx_alloc: initial nr_events=96
[   94.429458] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.429458] ioctx_alloc: num_possible_cpus=64
[   94.429459] ioctx_alloc: initial nr_events=96
[   94.429459] ioctx_alloc: after max nr_events=256
[   94.429460] ioctx_alloc: num_possible_cpus=64
[   94.429461] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.429461] ioctx_alloc: after max nr_events=256
[   94.429462] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.429463] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.430422] hrtimer: interrupt took 6115 ns
[   94.431463] ioctx_alloc: initial nr_events=2147483647
[   94.431464] ioctx_alloc: num_possible_cpus=64
[   94.431464] ioctx_alloc: after max nr_events=2147483647
[   94.431465] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.431465] ioctx_alloc: initial nr_events=96
[   94.431466] ioctx_alloc: num_possible_cpus=64
[   931466] ioctx_alloc: after max nr_events=256
[   94.431466] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.432641] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.580307] ioctx_alloc: num_possible_cpus=64
[   94.581844] ioctx_alloc: after max nr_events=256
[   94.583405] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.585313] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.585319] ioctx_alloc: initial nr_events=2147483647
[   94.585320] ioctx_alloc: num_possible_cpus=64
[   94.585320] ioctx_alloc: after max nr_events=2147483647
[   94.585321] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.585322] ioctx_alloc: initial nr_events=2147483647
[   94.585322] ioctx_alloc: initial nr_events=96
[   94.585323] ioctx_alloc: num_possible_cpus=64
[   94.585324] ioctx_alloc: num_possible_cpus=64
[   94.585324] ioctx_alloc: after max nr_events=2147483647
[   94.585325] ioctx_alloc: after max nr_events=256
[   94.585325] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.585326] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.585327] ioctx_alloc: initial nr_events=2147483647
[   94.585328] ioctx_alloc: initial nr_events=96
[   94.585328] ioctx_alloc: num_possible_cpus=64
[   94.585329] ioctx_alloc: num_possible_cpus=64
[   94.585329] ioctx_alloc: after max nr_events=2147483647
[   94.585330] ioctx_alloc: after max nr_events=256
[   94.585331] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.585331] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.585332] ioctx_alloc: initial nr_events=96
[   94.585332] ioctx_alloc: num_possible_cpus=64
[   94.585333] ioctx_alloc: after max nr_events=256
[   94.585333] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.585372] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.585402] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.588377] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.632221] ioctx_alloc: initial nr_events=2147483647
[   94.632228] ioctx_alloc: initial nr_events=2147483647
[   94.632229] iocalloc: num_possible_cpus=64
[   94.632229] ioctx_alloc: after max nr_events=2147483647
[   94.632230] ioctx_alloc: initial nr_events=2147483647
[   94.632231] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.632232] ioctx_alloc: num_possible_cpus=64
[   94.632232] ioctx_alloc: initial nr_events=96
[   94.632233] ioctx_alloc: after max nr_events=2147483647
[   94.632233] ioctx_alloc: num_possible_cpus=64
[   94.632234] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.632234] ioctx_alloc: after max nr_events=256
[   94.632235] ioctx_alloc: initial nr_events=96
[   94.632236] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.632236] ioctx_alloc: num_possible_cpus=64
[   94.632237] ioctx_alloc: after max nr_events=256
[   94.632237] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.632241] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.633350] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.764384] ioctx_alloc: num_possible_cpus=64
[   94.766038] ioctx_alloc: after max nr_events=2147483647
[   94.767807] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.769568] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.770328] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.773546] ioctx_alloc: initial nr_events=2147483647
[   94.773550] ioctx_alloc: initial nr_events=2147483647
[   94.773551] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.773552] ioctx_alloc: num_possible_cpus=64
[   94.773552] ioctx_alloc: after max nr_events=2147483647
[   94.773553] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.773553] ioctx_alloc: initial nr_events=96
[   94.773554] ioctx_alloc: num_possible_cpus=64
[   94.773554] ioctx_alloc: after max nr_events=256
[   94.773555] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.773569] ioctx_alloc: initial nr_events=2147483647
[   94.773569] ioctx_alloc: num_possible_cpus=64
[   94.773570] ioctx_alloc: after max nr_events=2147483647
[   94.773570] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.773571] ioctx_alloc:itial nr_events=96
[   94.773571] ioctx_alloc: num_possible_cpus=64
[   94.773572] ioctx_alloc: after max nr_events=256
[   94.773572] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.903978] ioctx_alloc: num_possible_cpus=64
[   94.905427] ioctx_alloc: after max nr_events=2147483647
[   94.907320] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.909300] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.909305] ioctx_alloc: initial nr_events=2147483647
[   94.909306] ioctx_alloc: num_possible_cpus=64
[   94.909306] ioctx_alloc: after max nr_events=2147483647
[   94.909307] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.909307] ioctx_alloc: initial nr_events=96
[   94.909308] ioctx_alloc: num_possible_cpus=64
[   94.909308] ioctx_alloc: after max nr_events=256
[   94.909309] ioctx_alloc: initial nr_events=2147483647
[   94.909310] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.909310] ioctx_alloc: num_possible_cpus=64
[   94.909311] ioctx_alloc: after max nr_events=2147483647
[   94.909311] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.909312] ioctx_alloc: initial nr_events=96
[   94.909312] ioctx_alloc: num_possible_cpus=64
[   94.909313] ioctx_alloc: after max nr_events=256
[   94.909313] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.912223] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.940281] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   94.940283] ioctx_alloc: initial nr_events=2147483647
[   94.940284] ioctx_alloc: num_possible_cpus=64
[   94.940285] ioctx_alloc: after max nr_events=2147483647
[   94.940286] ioctx_alloc: initial nr_events=2147483647
[   94.940286] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.940287] ioctx_alloc: num_possible_cpus=64
[   94.940288] ioctx_alloc: initial nr_events=96
[   94.940288] ioctx_alloc: after max nr_events=2147483647
[   94.940289] ioctx_alloc: num_possible_cpus=64
[   94.940290] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   94.940290] ioctx_alloc: after max nr_events=256
[   94.940291] ioctx_alloc: initial nr_events=96
[   94.940291] ioctx_alloc: nr_events=512 amax_nr=65536
[   94.940292] ioctx_alloc: num_possible_cpus=64
[   94.940292] ioctx_alloc: after max nr_events=256
[   94.940293] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   94.942198] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   95.069096] ioctx_alloc: initial nr_events=96
[   95.069097] ioctx_alloc: num_possible_cpus=64
[   95.069097] ioctx_alloc: after max nr_events=256
[   95.069098] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   95.087101] ioctx_alloc: initial nr_events=2147483647
[   95.087108] ioctx_alloc: initial nr_events=2147483647
[   95.087108] ioctx_alloc: num_possible_cpus=64
[   95.087109] ioctx_alloc: after max nr_events=2147483647
[   95.087109] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   95.087110] ioctx_alloc: initial nr_events=96
[   95.087110] ioctx_alloc: num_possible_cpus=64
[   95.087111] ioctx_alloc: after max nr_events=256
[   95.087112] ioctx_alloc: initial nr_events=2147483647
[   95.087113] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   95.087113] ioctx_alloc: num_possible_cpus=64
[   95.087114] ioctx_alloc: after max nr_events=2147483647
[   95.087114] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   95.087115] ioctx_alloc: initial nr_events=96
[   95.087115] ioctx_alloc: num_possible_cpus=64
[   95.087116] ioctx_alloc: after max nr_events=256
[   95.087117] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   95.087117] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   95.087120] ioctx_alloc: initial nr_events=2147483647
[   95.087120] ioctx_alloc: num_possible_cpus=64
[   95.087121] ioctx_alloc: after max nr_events=2147483647
[   95.087121] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   95.087122] ioctx_alloc: initial nr_events=96
[   95.087122] ioctx_alloc: num_possible_cpus=64
[   95.087123] ioctx_alloc: after max nr_events=256
[   95.087123] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   95.087126] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   95.091100] ioctx_alloc: initial nr_events=2147483647
[   95.091100] ioctx_alloc: num_possible_cpus=64
[   95.091100] ioctx_alloc: after max nr_events=2147483647
[   95.091101] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   95.091101] ioctx_alloc: initial nr_events=96
[   95.091102] ioctx_alloc: num_possible_cpus=64
[   95.091102] ioctx_alloc: after max nr_events=256
[   95.091103] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   95.145236] ioctx_alloc: num_possible_cpus=64
[   95.146754] ioctx_alloc: after max nr_events=2483647
[   95.248567] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   95.250432] ioctx_alloc: initial nr_events=2147483647
[   95.250438] ioctx_alloc: initial nr_events=2147483647
[   95.250439] ioctx_alloc: num_possible_cpus=64
[   95.250439] ioctx_alloc: after max nr_events=2147483647
[   95.250440] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   95.250440] ioctx_alloc: initial nr_events=96
[   95.250441] ioctx_alloc: num_possible_cpus=64
[   95.250441] ioctx_alloc: after max nr_events=256
[   95.250442] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   95.250450] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   95.250457] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   95.251027] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   95.251038] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   95.252029] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   95.275430] ioctx_alloc: num_possible_cpus=64
[   95.277000] ioctx_alloc: after max nr_events=2147483647
[   95.278747] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   95.280540] ioctx_alloc: initial nr_events=2147483647
[   95.280554] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   95.284457] ioctx_alloc: num_possible_cpus=64
[   95.285998] ioctx_alloc: after max nr_events=2147483647
[   95.287764] ioctx_alloc: nr_events=-2 aio_max_nr=65536
[   95.289455] ioctx_alloc: initial nr_events=96
[   95.290901] ioctx_alloc: num_possible_cpus=64
[   95.292450] ioctx_alloc: after max nr_events=256
[   95.294013] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   95.295873] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   95.381941] ioctx_alloc: initial nr_events=96
[   95.383764] ioctx_alloc: num_possible_cpus=64
[   95.385303] ioctx_alloc: after max nr_events=256
[   95.386959] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   95.391935] ioctx_alloc: initial nr_events=96
[   95.393493] ioctx_alloc: num_possible_cpus=64
[   95.394994] ioctx_alloc: after max nr_events=256
[   95.396751] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   95.421964] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   95.425953] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
[   95.611825] ioctx_alloc: initial nr_events=96
[   95.613398] ioctx_alloc: num_possible_cpus=64
[   95.614893] ioctx_alloc: after max nr_events=256
[   95.616615] ioctx_alloc: nr_events=512 aio_max_nr=65536
[   95.645844] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 14:36                 ` Elliott, Robert (Server Storage)
@ 2014-07-10 14:45                   ` Benjamin LaHaise
  2014-07-10 15:11                       ` Jeff Moyer
  0 siblings, 1 reply; 99+ messages in thread
From: Benjamin LaHaise @ 2014-07-10 14:45 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Jens Axboe, Christoph Hellwig, dgilbert, James Bottomley,
	Bart Van Assche, linux-scsi, linux-kernel

On Thu, Jul 10, 2014 at 02:36:40PM +0000, Elliott, Robert (Server Storage) wrote:
> 
> 
> > -----Original Message-----
> > From: Jens Axboe [mailto:axboe@kernel.dk]
> > Sent: Thursday, 10 July, 2014 8:53 AM
> > To: Christoph Hellwig; Benjamin LaHaise
> > Cc: Elliott, Robert (Server Storage); dgilbert@interlog.com; James Bottomley;
> > Bart Van Assche; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org
> > Subject: Re: scsi-mq V2
> > 
> > On 2014-07-10 15:50, Christoph Hellwig wrote:
> > > On Thu, Jul 10, 2014 at 09:36:09AM -0400, Benjamin LaHaise wrote:
> > >> There is one possible concern that could be exacerbated by other changes
> > in
> > >> the system: if the application is running close to the bare minimum number
> > >> of requests allocated in io_setup(), the per cpu reference counters will
> > >> have a hard time batching things.  It might be worth testing with an
> > >> increased number of requests being allocated if this is the case.
> > >
> > > Well, Robert said reverting the two aio commits didn't help.  Either he
> > > didn't manage to boot into the right kernel, or we need to look
> > > elsewhere for the culprit.
> > 
> > Rob, let me know what scsi_debug setup you use, and I can try and
> > reproduce it here as well.
> > 
> > --
> > Jens Axboe
> 
> This system has 6 online CPUs and 64 possible CPUs.
> 
> Printing avail and req_batch in that loop results in many of these:
> ** 3813 printk messages dropped ** [10643.503772] ctx ffff88042d8d4cc0 avail=0 req_batch=2
> 
> Adding CFLAGS_aio.o := -DDEBUG to the Makefile to enable
> those pr_debug prints results in nothing extra printing,
> so it's not hitting an error.
> 
> Printing nr_events and aio_max_nr at the top of ioctx_alloc results in
> these as fio starts:
> 
> [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536

Something is horribly wrong here.  There is no way that value for nr_events 
should be passed in to ioctx_alloc().  This implies that userland is calling 
io_setup() with an impossibly large value for nr_events.  Can you post the 
actual diff for your fs/aio.c relative to linus' tree?

		-ben


> [  186.339070] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.339071] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.339074] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.339076] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.339076] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.359772] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.359971] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.359972] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.359985] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.359986] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.359987] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.359995] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.359995] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.359998] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.359998] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.362529] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.362529] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.363510] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.363513] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.363520] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.363521] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.398113] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.398115] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.398121] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.398122] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.398124] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.398124] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.398130] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.398131] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.398164] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.398165] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.398499] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.400489] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.401478] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.401491] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.434522] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.434523] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.434526] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.434533] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.435370] hrtimer: interrupt took 6868 ns
> [  186.435491] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.435492] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.447864] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.449896] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.449900] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.449901] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.449909] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.449932] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.449933] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.461147] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.461176] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.461177] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.461181] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.461181] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.461184] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.461185] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.461185] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.461191] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.461192] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.474426] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.481353] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.483706] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.483707] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.483709] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.483710] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.483712] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.483717] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.495441] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.495444] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.495445] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.490] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.495451] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.495457] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.495457] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.495460] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.495461] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.495463] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.495464] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.499429] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.499437] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.619785] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.627371] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.627374] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.627383] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.627384] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.627385] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.628371] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.628372] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.630361] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.665329] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.666360] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.666361] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.666366] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.666367] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.666367] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.666369] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [  186.666370] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  186.670369] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.670372] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.670373] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  186.767323] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [  187.211053] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [  187.213053] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> 
> Subsequent added prints showed nr_events coming in with
> an initial value of 0x7FFFFFFF in those cases where it showed
> as -2 above.  Since the last call had a reasonable value
> of 512, it doesn't seem like a problem.
> 
> 
> script to create the scsi_debug devices:
> #!/bin/bash
> devices=6
> delay=0         # -2 = jiffy, -1 = hi jiffy, 0 = none, 1..n = longer
> ndelay=20000    # 20 us
> opts=0x4801     #
> every_nth=3
> capacity_mib=0
> capacity_gib=$((2 * 1024))
> lbpu=1
> lbpws=1
> 
> modprobe -r scsi_debug
> modprobe -v scsi_debug fake_rw=1 delay=$delay ndelay=$ndelay num_tgts=$devices opts=$opts every_nth=$every_nth physblk_exp=3 lbpu=$lbpu lbpws=$lbpws dev_size_mb=$capacity_mib virtual_gb=$capacity_gib
> lsscsi -s
> lsblk
> # the assigned /dev names will vary...
> for device in /sys/block/sda[h-m]
> do
>         echo none > $device/device/queue_type
> done
> 
> fio script:
> [global]
> direct=1
> invalidate=1
> ioengine=libaio
> norandommap
> randrepeat=0
> bs=4096
> iodepth=96
> numjobs=6
> runtime=216000
> time_based=1
> group_reporting
> thread
> gtod_reduce=1
> iodepth_batch=16
> iodepth_batch_complete=16
> cpus_allowed=0-5
> cpus_allowed_policy=split
> rw=randread
> 
> [4_KiB_RR_drive_ah]
> filename=/dev/sdah
> 
> [4_KiB_RR_drive_ai]
> filename=/dev/sdai
> 
> [4_KiB_RR_drive_aj]
> filename=/dev/sdaj
> 
> [4_KiB_RR_drive_ak]
> filename=/dev/sdak
> 
> [4_KiB_RR_drive_al]
> filename=/dev/sdal
> 
> [4_KiB_RR_drive_am]
> filename=/dev/sdam
> 
> kernel log with some prints in ioctx_alloc:
> (2147483647 is 0x7FFFFFFF)
> 
> [   94.050877] ioctx_alloc: initial nr_events=2147483647
> [   94.053610] ioctx_alloc: num_possible_cpus=64
> [   94.055235] ioctx_alloc: after max nr_events=2147483647
> [   94.057110] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.059189] ioctx_alloc: initial nr_events=96
> [   94.059294] ioctx_alloc: initial nr_events=2147483647
> [   94.059295] ioctx_alloc: num_possible_cpus=64
> [   94.059295] ioctx_alloc: after max nr_events=2147483647
> [   94.059296] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.059296] ioctx_alloc: initial nr_events=96
> [   94.059297] ioctx_alloc: num_possible_cpus=64
> [   94.059297] ioctx_alloc: after max nr_events=256
> [   94.059298] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.075891] ioctx_alloc: num_possible_cpus=64
> [   94.077529] ioctx_alloc: after max nr_events=256
> [   94.079064] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.087777] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.087810] ioctx_alloc: initial nr_events=2147483647
> [   94.087810] ioctx_alloc: num_possible_cpus=64
> [   94.087811] ioctx_alloc: after max nr_events=2147483647
> [   94.087811] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.087812] ioctx_alloc: initial nr_events=96
> [   94.087812] ioctx_alloc: num_possible_cpus=64
> [   94.087813] ioctx_alloc: after max nr_events=256
> [   94.087813] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.087815] ioctx_alloc: initial nr_events=2147483647
> [   94.087816] ioctx_alloc: initial nr_events=2147483647
> [   94.087816] ioctx_alloc: num_possible_cpus=64
> [   94.087817] ioctx_alloc: initial nr_events=2147483647
> [   94.087818] ioctx_alloc: num_possible_cpus=64
> [   94.087819] ioctx_alloc: after max nr_events=2147483647
> [   94.087819] ioctx_alloc: num_possible_cpus=64
> [   94.087820] ioctx_alloc: after max nr_events=2147483647
> [   94.087820] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.087821] ioctx_alloc: after max nr_events=2147483647
> [   94.087822] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.087822] ioctx_alloc: initial nr_events=96
> [   94.087823] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.087824] ioctx_alloc: initial nr_events=96
> [   94.087825] ioctx_alloc: initial nr_events=2147483647
> [   94.087825] ioctx_alloc: num_possible_cpus=64
> [   94.087826] ioctx_alloc: initial nr_events=96
> [   94.087826] ioctx_alloc: num_possible_cpus=64
> [   94.087827] ioctx_alloc: num_possible_cpus=64
> [   94.087827] ioctx_alloc: after max nr_events=256
> [   94.087828] ioctx_alloc: num_possible_cpus=64
> [   94.087828] ioctx_alloc: after max nr_events=256
> [   94.087829] ioctx_alloc: after max nr_events=2147483647
> [   94.087829] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.087830] ioctx_alloc: after max nr_events=256
> [   94.087831] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.087831] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.087832] ioctx_alloc: nr_events=512 aio_max_nr=65
> [   94.087833] ioctx_alloc: initial nr_events=96
> [   94.087833] ioctx_alloc: num_possible_cpus=64
> [   94.087833] ioctx_alloc: after max nr_events=256
> [   94.087834] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.090668] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.259433] ioctx_alloc: initial nr_events=2147483647
> [   94.259435] ioctx_alloc: initial nr_events=2147483647
> [   94.259436] ioctx_alloc: num_possible_cpus=64
> [   94.259437] ioctx_alloc: after max nr_events=2147483647
> [   94.259437] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.259438] ioctx_alloc: initial nr_events=96
> [   94.259438] ioctx_alloc: num_possible_cpus=64
> [   94.259438] ioctx_alloc: after max nr_events=256
> [   94.259439] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.259446] ioctx_alloc: initial nr_events=2147483647
> [   94.259448] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.259449] ioctx_alloc: initial nr_events=2147483647
> [   94.259450] ioctx_alloc: initial nr_events=2147483647
> [   94.259450] ioctx_alloc: num_possible_cpus=64
> [   94.259451] ioctx_alloc: num_possible_cpus=64
> [   94.259452] ioctx_alloc: num_possible_cpus=64
> [   94.259452] ioctx_alloc: after max nr_events=2147483647
> [   94.259453] ioctx_alloc: after max nr_events=2147483647
> [   94.259453] ioctx_alloc: after max nr_events=2147483647
> [   94.259454] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.259455] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.259455] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.259456] ioctx_alloc: initial nr_events=96
> [   94.259456] ioctx_alloc: initial nr_events=96
> [   94.259457] ioctx_alloc: initial nr_events=96
> [   94.259457] ioctx_alloc: num_possible_cpus=64
> [   94.259458] ioctx_alloc: num_possible_cpus=64
> [   94.259458] ioctx_alloc: num_possible_cpus=64
> [   94.259459] ioctx_alloc: after max nr_events=256
> [   94.259459] ioctx_alloc: after max nr_events=256
> [   94.259460] ioctx_alloc: after max nr_events=256
> [   94.259460] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   259461] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.259462] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.260539] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.260544] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.262535] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.262550] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.423889] ioctx_alloc: num_possible_cpus=64
> [   94.425386] ioctx_alloc: after max nr_events=2147483647
> [   94.427327] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.429359] ioctx_alloc: initial nr_events=96
> [   94.429448] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.429451] ioctx_alloc: initial nr_events=2147483647
> [   94.429452] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.429453] ioctx_alloc: num_possible_cpus=64
> [   94.429454] ioctx_alloc: initial nr_events=2147483647
> [   94.429454] ioctx_alloc: after max nr_events=2147483647
> [   94.429455] ioctx_alloc: num_possible_cpus=64
> [   94.429456] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.429456] ioctx_alloc: after max nr_events=2147483647
> [   94.429457] ioctx_alloc: initial nr_events=96
> [   94.429458] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.429458] ioctx_alloc: num_possible_cpus=64
> [   94.429459] ioctx_alloc: initial nr_events=96
> [   94.429459] ioctx_alloc: after max nr_events=256
> [   94.429460] ioctx_alloc: num_possible_cpus=64
> [   94.429461] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.429461] ioctx_alloc: after max nr_events=256
> [   94.429462] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.429463] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.430422] hrtimer: interrupt took 6115 ns
> [   94.431463] ioctx_alloc: initial nr_events=2147483647
> [   94.431464] ioctx_alloc: num_possible_cpus=64
> [   94.431464] ioctx_alloc: after max nr_events=2147483647
> [   94.431465] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.431465] ioctx_alloc: initial nr_events=96
> [   94.431466] ioctx_alloc: num_possible_cpus=64
> [   931466] ioctx_alloc: after max nr_events=256
> [   94.431466] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.432641] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.580307] ioctx_alloc: num_possible_cpus=64
> [   94.581844] ioctx_alloc: after max nr_events=256
> [   94.583405] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.585313] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.585319] ioctx_alloc: initial nr_events=2147483647
> [   94.585320] ioctx_alloc: num_possible_cpus=64
> [   94.585320] ioctx_alloc: after max nr_events=2147483647
> [   94.585321] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.585322] ioctx_alloc: initial nr_events=2147483647
> [   94.585322] ioctx_alloc: initial nr_events=96
> [   94.585323] ioctx_alloc: num_possible_cpus=64
> [   94.585324] ioctx_alloc: num_possible_cpus=64
> [   94.585324] ioctx_alloc: after max nr_events=2147483647
> [   94.585325] ioctx_alloc: after max nr_events=256
> [   94.585325] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.585326] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.585327] ioctx_alloc: initial nr_events=2147483647
> [   94.585328] ioctx_alloc: initial nr_events=96
> [   94.585328] ioctx_alloc: num_possible_cpus=64
> [   94.585329] ioctx_alloc: num_possible_cpus=64
> [   94.585329] ioctx_alloc: after max nr_events=2147483647
> [   94.585330] ioctx_alloc: after max nr_events=256
> [   94.585331] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.585331] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.585332] ioctx_alloc: initial nr_events=96
> [   94.585332] ioctx_alloc: num_possible_cpus=64
> [   94.585333] ioctx_alloc: after max nr_events=256
> [   94.585333] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.585372] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.585402] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.588377] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.632221] ioctx_alloc: initial nr_events=2147483647
> [   94.632228] ioctx_alloc: initial nr_events=2147483647
> [   94.632229] iocalloc: num_possible_cpus=64
> [   94.632229] ioctx_alloc: after max nr_events=2147483647
> [   94.632230] ioctx_alloc: initial nr_events=2147483647
> [   94.632231] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.632232] ioctx_alloc: num_possible_cpus=64
> [   94.632232] ioctx_alloc: initial nr_events=96
> [   94.632233] ioctx_alloc: after max nr_events=2147483647
> [   94.632233] ioctx_alloc: num_possible_cpus=64
> [   94.632234] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.632234] ioctx_alloc: after max nr_events=256
> [   94.632235] ioctx_alloc: initial nr_events=96
> [   94.632236] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.632236] ioctx_alloc: num_possible_cpus=64
> [   94.632237] ioctx_alloc: after max nr_events=256
> [   94.632237] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.632241] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.633350] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.764384] ioctx_alloc: num_possible_cpus=64
> [   94.766038] ioctx_alloc: after max nr_events=2147483647
> [   94.767807] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.769568] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.770328] sd 5:0:0:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.773546] ioctx_alloc: initial nr_events=2147483647
> [   94.773550] ioctx_alloc: initial nr_events=2147483647
> [   94.773551] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.773552] ioctx_alloc: num_possible_cpus=64
> [   94.773552] ioctx_alloc: after max nr_events=2147483647
> [   94.773553] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.773553] ioctx_alloc: initial nr_events=96
> [   94.773554] ioctx_alloc: num_possible_cpus=64
> [   94.773554] ioctx_alloc: after max nr_events=256
> [   94.773555] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.773569] ioctx_alloc: initial nr_events=2147483647
> [   94.773569] ioctx_alloc: num_possible_cpus=64
> [   94.773570] ioctx_alloc: after max nr_events=2147483647
> [   94.773570] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.773571] ioctx_alloc:itial nr_events=96
> [   94.773571] ioctx_alloc: num_possible_cpus=64
> [   94.773572] ioctx_alloc: after max nr_events=256
> [   94.773572] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.903978] ioctx_alloc: num_possible_cpus=64
> [   94.905427] ioctx_alloc: after max nr_events=2147483647
> [   94.907320] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.909300] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.909305] ioctx_alloc: initial nr_events=2147483647
> [   94.909306] ioctx_alloc: num_possible_cpus=64
> [   94.909306] ioctx_alloc: after max nr_events=2147483647
> [   94.909307] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.909307] ioctx_alloc: initial nr_events=96
> [   94.909308] ioctx_alloc: num_possible_cpus=64
> [   94.909308] ioctx_alloc: after max nr_events=256
> [   94.909309] ioctx_alloc: initial nr_events=2147483647
> [   94.909310] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.909310] ioctx_alloc: num_possible_cpus=64
> [   94.909311] ioctx_alloc: after max nr_events=2147483647
> [   94.909311] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.909312] ioctx_alloc: initial nr_events=96
> [   94.909312] ioctx_alloc: num_possible_cpus=64
> [   94.909313] ioctx_alloc: after max nr_events=256
> [   94.909313] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.912223] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.940281] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   94.940283] ioctx_alloc: initial nr_events=2147483647
> [   94.940284] ioctx_alloc: num_possible_cpus=64
> [   94.940285] ioctx_alloc: after max nr_events=2147483647
> [   94.940286] ioctx_alloc: initial nr_events=2147483647
> [   94.940286] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.940287] ioctx_alloc: num_possible_cpus=64
> [   94.940288] ioctx_alloc: initial nr_events=96
> [   94.940288] ioctx_alloc: after max nr_events=2147483647
> [   94.940289] ioctx_alloc: num_possible_cpus=64
> [   94.940290] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   94.940290] ioctx_alloc: after max nr_events=256
> [   94.940291] ioctx_alloc: initial nr_events=96
> [   94.940291] ioctx_alloc: nr_events=512 amax_nr=65536
> [   94.940292] ioctx_alloc: num_possible_cpus=64
> [   94.940292] ioctx_alloc: after max nr_events=256
> [   94.940293] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   94.942198] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   95.069096] ioctx_alloc: initial nr_events=96
> [   95.069097] ioctx_alloc: num_possible_cpus=64
> [   95.069097] ioctx_alloc: after max nr_events=256
> [   95.069098] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   95.087101] ioctx_alloc: initial nr_events=2147483647
> [   95.087108] ioctx_alloc: initial nr_events=2147483647
> [   95.087108] ioctx_alloc: num_possible_cpus=64
> [   95.087109] ioctx_alloc: after max nr_events=2147483647
> [   95.087109] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   95.087110] ioctx_alloc: initial nr_events=96
> [   95.087110] ioctx_alloc: num_possible_cpus=64
> [   95.087111] ioctx_alloc: after max nr_events=256
> [   95.087112] ioctx_alloc: initial nr_events=2147483647
> [   95.087113] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   95.087113] ioctx_alloc: num_possible_cpus=64
> [   95.087114] ioctx_alloc: after max nr_events=2147483647
> [   95.087114] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   95.087115] ioctx_alloc: initial nr_events=96
> [   95.087115] ioctx_alloc: num_possible_cpus=64
> [   95.087116] ioctx_alloc: after max nr_events=256
> [   95.087117] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   95.087117] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   95.087120] ioctx_alloc: initial nr_events=2147483647
> [   95.087120] ioctx_alloc: num_possible_cpus=64
> [   95.087121] ioctx_alloc: after max nr_events=2147483647
> [   95.087121] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   95.087122] ioctx_alloc: initial nr_events=96
> [   95.087122] ioctx_alloc: num_possible_cpus=64
> [   95.087123] ioctx_alloc: after max nr_events=256
> [   95.087123] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   95.087126] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   95.091100] ioctx_alloc: initial nr_events=2147483647
> [   95.091100] ioctx_alloc: num_possible_cpus=64
> [   95.091100] ioctx_alloc: after max nr_events=2147483647
> [   95.091101] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   95.091101] ioctx_alloc: initial nr_events=96
> [   95.091102] ioctx_alloc: num_possible_cpus=64
> [   95.091102] ioctx_alloc: after max nr_events=256
> [   95.091103] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   95.145236] ioctx_alloc: num_possible_cpus=64
> [   95.146754] ioctx_alloc: after max nr_events=2483647
> [   95.248567] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   95.250432] ioctx_alloc: initial nr_events=2147483647
> [   95.250438] ioctx_alloc: initial nr_events=2147483647
> [   95.250439] ioctx_alloc: num_possible_cpus=64
> [   95.250439] ioctx_alloc: after max nr_events=2147483647
> [   95.250440] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   95.250440] ioctx_alloc: initial nr_events=96
> [   95.250441] ioctx_alloc: num_possible_cpus=64
> [   95.250441] ioctx_alloc: after max nr_events=256
> [   95.250442] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   95.250450] sd 5:0:5:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   95.250457] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   95.251027] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   95.251038] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   95.252029] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   95.275430] ioctx_alloc: num_possible_cpus=64
> [   95.277000] ioctx_alloc: after max nr_events=2147483647
> [   95.278747] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   95.280540] ioctx_alloc: initial nr_events=2147483647
> [   95.280554] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   95.284457] ioctx_alloc: num_possible_cpus=64
> [   95.285998] ioctx_alloc: after max nr_events=2147483647
> [   95.287764] ioctx_alloc: nr_events=-2 aio_max_nr=65536
> [   95.289455] ioctx_alloc: initial nr_events=96
> [   95.290901] ioctx_alloc: num_possible_cpus=64
> [   95.292450] ioctx_alloc: after max nr_events=256
> [   95.294013] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   95.295873] sd 5:0:3:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   95.381941] ioctx_alloc: initial nr_events=96
> [   95.383764] ioctx_alloc: num_possible_cpus=64
> [   95.385303] ioctx_alloc: after max nr_events=256
> [   95.386959] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   95.391935] ioctx_alloc: initial nr_events=96
> [   95.393493] ioctx_alloc: num_possible_cpus=64
> [   95.394994] ioctx_alloc: after max nr_events=256
> [   95.396751] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   95.421964] sd 5:0:2:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   95.425953] sd 5:0:4:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> [   95.611825] ioctx_alloc: initial nr_events=96
> [   95.613398] ioctx_alloc: num_possible_cpus=64
> [   95.614893] ioctx_alloc: after max nr_events=256
> [   95.616615] ioctx_alloc: nr_events=512 aio_max_nr=65536
> [   95.645844] sd 5:0:1:0: scsi_debug_ioctl: BLKFLSBUF [0x1261]
> 

-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 14:45                   ` Benjamin LaHaise
@ 2014-07-10 15:11                       ` Jeff Moyer
  0 siblings, 0 replies; 99+ messages in thread
From: Jeff Moyer @ 2014-07-10 15:11 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Elliott, Robert (Server Storage),
	Jens Axboe, Christoph Hellwig, dgilbert, James Bottomley,
	Bart Van Assche, linux-scsi, linux-kernel

Benjamin LaHaise <bcrl@kvack.org> writes:

>> 
>> [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>> [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>> [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>> [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>> [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>
> Something is horribly wrong here.  There is no way that value for nr_events 
> should be passed in to ioctx_alloc().  This implies that userland is calling 
> io_setup() with an impossibly large value for nr_events.  Can you post the 
> actual diff for your fs/aio.c relative to linus' tree?
>

fio does exactly this!  it passes INT_MAX.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
@ 2014-07-10 15:11                       ` Jeff Moyer
  0 siblings, 0 replies; 99+ messages in thread
From: Jeff Moyer @ 2014-07-10 15:11 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Elliott, Robert (Server Storage),
	Jens Axboe, Christoph Hellwig, dgilbert, James Bottomley,
	Bart Van Assche, linux-scsi, linux-kernel

Benjamin LaHaise <bcrl@kvack.org> writes:

>> 
>> [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>> [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>> [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>> [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>> [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>
> Something is horribly wrong here.  There is no way that value for nr_events 
> should be passed in to ioctx_alloc().  This implies that userland is calling 
> io_setup() with an impossibly large value for nr_events.  Can you post the 
> actual diff for your fs/aio.c relative to linus' tree?
>

fio does exactly this!  it passes INT_MAX.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: scsi-mq V2
  2014-07-10  6:20         ` Christoph Hellwig
  2014-07-10 13:36           ` Benjamin LaHaise
@ 2014-07-10 15:51           ` Elliott, Robert (Server Storage)
  2014-07-10 16:04             ` Christoph Hellwig
  1 sibling, 1 reply; 99+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-07-10 15:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, dgilbert, James Bottomley, Bart Van Assche,
	Benjamin LaHaise, linux-scsi, linux-kernel



> -----Original Message-----
> From: Christoph Hellwig [mailto:hch@infradead.org]
> Sent: Thursday, 10 July, 2014 1:21 AM
> To: Elliott, Robert (Server Storage)
> Cc: Jens Axboe; dgilbert@interlog.com; Christoph Hellwig; James Bottomley;
> Bart Van Assche; Benjamin LaHaise; linux-scsi@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> On Thu, Jul 10, 2014 at 12:53:36AM +0000, Elliott, Robert (Server Storage)
> wrote:
> > the problem still occurs - fio results in low or 0 IOPS, with perf top
> > reporting unusual amounts of time spent in do_io_submit and io_submit.
> 
> The diff between the two version doesn't show too much other possible
> interesting commits, the most interesting being some minor block
> updates.
> 
> I guess we'll have to a manual bisect, I've pushed out a

> scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
> the block tree

good.

> and a scsi-mq.3-bisect-2 branch that is just after the merge of the 
> block tree to get started.

good.



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 15:51           ` Elliott, Robert (Server Storage)
@ 2014-07-10 16:04             ` Christoph Hellwig
  2014-07-10 16:14               ` Christoph Hellwig
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-07-10 16:04 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Jens Axboe, dgilbert, James Bottomley, Bart Van Assche,
	Benjamin LaHaise, linux-scsi, linux-kernel

On Thu, Jul 10, 2014 at 03:51:44PM +0000, Elliott, Robert (Server Storage) wrote:
> > scsi-mq.3-bisect-1 branch that is rebased to just before the merge of
> > the block tree
> 
> good.
> 
> > and a scsi-mq.3-bisect-2 branch that is just after the merge of the 
> > block tree to get started.
> 
> good.

It's starting to look weird.  I'll prepare another two bisect branches
around some MM changes, which seems the only other possible candidate.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 16:04             ` Christoph Hellwig
@ 2014-07-10 16:14               ` Christoph Hellwig
  2014-07-10 18:49                 ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-07-10 16:14 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Jens Axboe, dgilbert, James Bottomley, Bart Van Assche,
	Benjamin LaHaise, linux-scsi, linux-kernel

On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
> It's starting to look weird.  I'll prepare another two bisect branches
> around some MM changes, which seems the only other possible candidate.

I've pushed out scsi-mq.3-bisect-3 and scsi-mq.3-bisect-4 for you.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: scsi-mq V2
  2014-07-10 16:14               ` Christoph Hellwig
@ 2014-07-10 18:49                 ` Elliott, Robert (Server Storage)
  2014-07-10 19:14                     ` Jeff Moyer
  0 siblings, 1 reply; 99+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-07-10 18:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, dgilbert, James Bottomley, Bart Van Assche,
	Benjamin LaHaise, linux-scsi, linux-kernel



> -----Original Message-----
> From: Christoph Hellwig [mailto:hch@infradead.org]
> Sent: Thursday, 10 July, 2014 11:15 AM
> To: Elliott, Robert (Server Storage)
> Cc: Jens Axboe; dgilbert@interlog.com; James Bottomley; Bart Van Assche;
> Benjamin LaHaise; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
> > It's starting to look weird.  I'll prepare another two bisect branches
> > around some MM changes, which seems the only other possible candidate.
> 
> I've pushed out scsi-mq.3-bisect-3 

Good.

> and scsi-mq.3-bisect-4 for you.

Bad.

Note: I had to apply the vdso2c.h patch to build this -rc3 based kernel:
diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
index df95a2f..11b65d4 100644
--- a/arch/x86/vdso/vdso2c.h
+++ b/arch/x86/vdso/vdso2c.h
@@ -93,6 +93,9 @@ static void BITSFUNC(copy_section)(struct BITSFUNC(fake_sections) *out,
	uint64_t flags = GET_LE(&in->sh_flags);

	bool copy = flags & SHF_ALLOC &&
+		(GET_LE(&in->sh_size) ||
+		(GET_LE(&in->sh_type) != SHT_RELA &&
+		GET_LE(&in->sh_type) != SHT_REL)) &&
		strcmp(name, ".altinstructions") &&
		strcmp(name, ".altinstr_replacement");

Results: fio started OK, getting 900K IOPS, but ^C led to 0 IOPS and
an fio hang, with one CPU (CPU 0) stuck in io_submit loops.

perf top shows lookup_ioctx function alongside io_submit and
do_io_submit this time:
 14.96%  [kernel]             [k] lookup_ioctx
 14.71%  libaio.so.1.0.1      [.] io_submit
 13.78%  [kernel]             [k] system_call
 10.79%  [kernel]             [k] system_call_after_swapgs
 10.17%  [kernel]             [k] do_io_submit
  8.91%  [kernel]             [k] copy_user_generic_string
  4.24%  [kernel]             [k] io_submit_one
  3.93%  [kernel]             [k] blk_flush_plug_list
  3.32%  fio                  [.] fio_libaio_commit
  2.84%  [kernel]             [k] sysret_check
  2.06%  [kernel]             [k] blk_finish_plug
  1.89%  [kernel]             [k] SyS_io_submit
  1.48%  [kernel]             [k] blk_start_plug
  1.04%  fio                  [.] io_submit@plt
  0.84%  [kernel]             [k] __get_user_4
  0.74%  [kernel]             [k] system_call_fastpath
  0.60%  [kernel]             [k] _copy_from_user
  0.51%  diff                 [.] 0x0000000000007abb

ftrace on CPU 0 shows similar repetition to before:
             fio-4107  [000] ....   389.992300: lookup_ioctx <-do_io_submit
             fio-4107  [000] ....   389.992300: blk_start_plug <-do_io_submit
             fio-4107  [000] ....   389.992300: io_submit_one <-do_io_submit
             fio-4107  [000] ....   389.992300: blk_finish_plug <-do_io_submit
             fio-4107  [000] ....   389.992300: blk_flush_plug_list <-blk_finish_plug
             fio-4107  [000] ....   389.992301: SyS_io_submit <-system_call_fastpath
             fio-4107  [000] ....   389.992301: do_io_submit <-SyS_io_submit
             fio-4107  [000] ....   389.992301: lookup_ioctx <-do_io_submit
             fio-4107  [000] ....   389.992301: blk_start_plug <-do_io_submit
             fio-4107  [000] ....   389.992301: io_submit_one <-do_io_submit
             fio-4107  [000] ....   389.992301: blk_finish_plug <-do_io_submit
             fio-4107  [000] ....   389.992301: blk_flush_plug_list <-blk_finish_plug
             fio-4107  [000] ....   389.992301: SyS_io_submit <-system_call_fastpath
             fio-4107  [000] ....   389.992302: do_io_submit <-SyS_io_submit
             fio-4107  [000] ....   389.992302: lookup_ioctx <-do_io_submit
             fio-4107  [000] ....   389.992302: blk_start_plug <-do_io_submit
             fio-4107  [000] ....   389.992302: io_submit_one <-do_io_submit
             fio-4107  [000] ....   389.992302: blk_finish_plug <-do_io_submit
             fio-4107  [000] ....   389.992302: blk_flush_plug_list <-blk_finish_plug




^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 18:49                 ` Elliott, Robert (Server Storage)
@ 2014-07-10 19:14                     ` Jeff Moyer
  0 siblings, 0 replies; 99+ messages in thread
From: Jeff Moyer @ 2014-07-10 19:14 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, Jens Axboe, dgilbert, James Bottomley,
	Bart Van Assche, Benjamin LaHaise, linux-scsi, linux-kernel

"Elliott, Robert (Server Storage)" <Elliott@hp.com> writes:

>> -----Original Message-----
>> From: Christoph Hellwig [mailto:hch@infradead.org]
>> Sent: Thursday, 10 July, 2014 11:15 AM
>> To: Elliott, Robert (Server Storage)
>> Cc: Jens Axboe; dgilbert@interlog.com; James Bottomley; Bart Van Assche;
>> Benjamin LaHaise; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org
>> Subject: Re: scsi-mq V2
>> 
>> On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
>> > It's starting to look weird.  I'll prepare another two bisect branches
>> > around some MM changes, which seems the only other possible candidate.
>> 
>> I've pushed out scsi-mq.3-bisect-3 
>
> Good.
>
>> and scsi-mq.3-bisect-4 for you.
>
> Bad.
>
> Note: I had to apply the vdso2c.h patch to build this -rc3 based kernel:
> diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
> index df95a2f..11b65d4 100644
> --- a/arch/x86/vdso/vdso2c.h
> +++ b/arch/x86/vdso/vdso2c.h
> @@ -93,6 +93,9 @@ static void BITSFUNC(copy_section)(struct BITSFUNC(fake_sections) *out,
> 	uint64_t flags = GET_LE(&in->sh_flags);
>
> 	bool copy = flags & SHF_ALLOC &&
> +		(GET_LE(&in->sh_size) ||
> +		(GET_LE(&in->sh_type) != SHT_RELA &&
> +		GET_LE(&in->sh_type) != SHT_REL)) &&
> 		strcmp(name, ".altinstructions") &&
> 		strcmp(name, ".altinstr_replacement");
>
> Results: fio started OK, getting 900K IOPS, but ^C led to 0 IOPS and
> an fio hang, with one CPU (CPU 0) stuck in io_submit loops.

Hi, Rob,

Can you get sysrq-t output for me?  I don't know how/why we'd continue
to get io_submits for an exiting process.

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
@ 2014-07-10 19:14                     ` Jeff Moyer
  0 siblings, 0 replies; 99+ messages in thread
From: Jeff Moyer @ 2014-07-10 19:14 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, Jens Axboe, dgilbert, James Bottomley,
	Bart Van Assche, Benjamin LaHaise, linux-scsi, linux-kernel

"Elliott, Robert (Server Storage)" <Elliott@hp.com> writes:

>> -----Original Message-----
>> From: Christoph Hellwig [mailto:hch@infradead.org]
>> Sent: Thursday, 10 July, 2014 11:15 AM
>> To: Elliott, Robert (Server Storage)
>> Cc: Jens Axboe; dgilbert@interlog.com; James Bottomley; Bart Van Assche;
>> Benjamin LaHaise; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org
>> Subject: Re: scsi-mq V2
>> 
>> On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
>> > It's starting to look weird.  I'll prepare another two bisect branches
>> > around some MM changes, which seems the only other possible candidate.
>> 
>> I've pushed out scsi-mq.3-bisect-3 
>
> Good.
>
>> and scsi-mq.3-bisect-4 for you.
>
> Bad.
>
> Note: I had to apply the vdso2c.h patch to build this -rc3 based kernel:
> diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
> index df95a2f..11b65d4 100644
> --- a/arch/x86/vdso/vdso2c.h
> +++ b/arch/x86/vdso/vdso2c.h
> @@ -93,6 +93,9 @@ static void BITSFUNC(copy_section)(struct BITSFUNC(fake_sections) *out,
> 	uint64_t flags = GET_LE(&in->sh_flags);
>
> 	bool copy = flags & SHF_ALLOC &&
> +		(GET_LE(&in->sh_size) ||
> +		(GET_LE(&in->sh_type) != SHT_RELA &&
> +		GET_LE(&in->sh_type) != SHT_REL)) &&
> 		strcmp(name, ".altinstructions") &&
> 		strcmp(name, ".altinstr_replacement");
>
> Results: fio started OK, getting 900K IOPS, but ^C led to 0 IOPS and
> an fio hang, with one CPU (CPU 0) stuck in io_submit loops.

Hi, Rob,

Can you get sysrq-t output for me?  I don't know how/why we'd continue
to get io_submits for an exiting process.

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 19:14                     ` Jeff Moyer
@ 2014-07-10 19:36                       ` Jeff Moyer
  -1 siblings, 0 replies; 99+ messages in thread
From: Jeff Moyer @ 2014-07-10 19:36 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, Jens Axboe, dgilbert, James Bottomley,
	Bart Van Assche, Benjamin LaHaise, linux-scsi, linux-kernel

Jeff Moyer <jmoyer@redhat.com> writes:

> Hi, Rob,
>
> Can you get sysrq-t output for me?  I don't know how/why we'd continue
> to get io_submits for an exiting process.

Also, do you know what sys_io_submit is returning?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
@ 2014-07-10 19:36                       ` Jeff Moyer
  0 siblings, 0 replies; 99+ messages in thread
From: Jeff Moyer @ 2014-07-10 19:36 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, Jens Axboe, dgilbert, James Bottomley,
	Bart Van Assche, Benjamin LaHaise, linux-scsi, linux-kernel

Jeff Moyer <jmoyer@redhat.com> writes:

> Hi, Rob,
>
> Can you get sysrq-t output for me?  I don't know how/why we'd continue
> to get io_submits for an exiting process.

Also, do you know what sys_io_submit is returning?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 15:11                       ` Jeff Moyer
@ 2014-07-10 19:59                         ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2014-07-10 19:59 UTC (permalink / raw)
  To: Jeff Moyer, Benjamin LaHaise
  Cc: Elliott, Robert (Server Storage),
	Christoph Hellwig, dgilbert, James Bottomley, Bart Van Assche,
	linux-scsi, linux-kernel

On 2014-07-10 17:11, Jeff Moyer wrote:
> Benjamin LaHaise <bcrl@kvack.org> writes:
>
>>>
>>> [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>> [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>> [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>> [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>> [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>
>> Something is horribly wrong here.  There is no way that value for nr_events
>> should be passed in to ioctx_alloc().  This implies that userland is calling
>> io_setup() with an impossibly large value for nr_events.  Can you post the
>> actual diff for your fs/aio.c relative to linus' tree?
>>
>
> fio does exactly this!  it passes INT_MAX.

That's correct, I had actually forgotten about this. It was a change 
made a few years back, in correlation with the aio optimizations posted 
then, basically telling aio to ignore that silly (and broken) user ring.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
@ 2014-07-10 19:59                         ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2014-07-10 19:59 UTC (permalink / raw)
  To: Jeff Moyer, Benjamin LaHaise
  Cc: Elliott, Robert (Server Storage),
	Christoph Hellwig, dgilbert, James Bottomley, Bart Van Assche,
	linux-scsi, linux-kernel

On 2014-07-10 17:11, Jeff Moyer wrote:
> Benjamin LaHaise <bcrl@kvack.org> writes:
>
>>>
>>> [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>> [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>> [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>> [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>> [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>
>> Something is horribly wrong here.  There is no way that value for nr_events
>> should be passed in to ioctx_alloc().  This implies that userland is calling
>> io_setup() with an impossibly large value for nr_events.  Can you post the
>> actual diff for your fs/aio.c relative to linus' tree?
>>
>
> fio does exactly this!  it passes INT_MAX.

That's correct, I had actually forgotten about this. It was a change 
made a few years back, in correlation with the aio optimizations posted 
then, basically telling aio to ignore that silly (and broken) user ring.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 19:59                         ` Jens Axboe
@ 2014-07-10 20:05                           ` Jeff Moyer
  -1 siblings, 0 replies; 99+ messages in thread
From: Jeff Moyer @ 2014-07-10 20:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Benjamin LaHaise, Elliott, Robert (Server Storage),
	Christoph Hellwig, dgilbert, James Bottomley, Bart Van Assche,
	linux-scsi, linux-kernel

Jens Axboe <axboe@kernel.dk> writes:

> On 2014-07-10 17:11, Jeff Moyer wrote:
>> Benjamin LaHaise <bcrl@kvack.org> writes:
>>
>>>>
>>>> [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>> [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>> [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>> [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>> [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>
>>> Something is horribly wrong here.  There is no way that value for nr_events
>>> should be passed in to ioctx_alloc().  This implies that userland is calling
>>> io_setup() with an impossibly large value for nr_events.  Can you post the
>>> actual diff for your fs/aio.c relative to linus' tree?
>>>
>>
>> fio does exactly this!  it passes INT_MAX.
>
> That's correct, I had actually forgotten about this. It was a change
> made a few years back, in correlation with the aio optimizations
> posted then, basically telling aio to ignore that silly (and broken)
> user ring.

I still don't see how you accomplish that.  Making it bigger doesn't get
rid of it.  ;-)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
@ 2014-07-10 20:05                           ` Jeff Moyer
  0 siblings, 0 replies; 99+ messages in thread
From: Jeff Moyer @ 2014-07-10 20:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Benjamin LaHaise, Elliott, Robert (Server Storage),
	Christoph Hellwig, dgilbert, James Bottomley, Bart Van Assche,
	linux-scsi, linux-kernel

Jens Axboe <axboe@kernel.dk> writes:

> On 2014-07-10 17:11, Jeff Moyer wrote:
>> Benjamin LaHaise <bcrl@kvack.org> writes:
>>
>>>>
>>>> [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>> [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>> [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>> [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>> [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>
>>> Something is horribly wrong here.  There is no way that value for nr_events
>>> should be passed in to ioctx_alloc().  This implies that userland is calling
>>> io_setup() with an impossibly large value for nr_events.  Can you post the
>>> actual diff for your fs/aio.c relative to linus' tree?
>>>
>>
>> fio does exactly this!  it passes INT_MAX.
>
> That's correct, I had actually forgotten about this. It was a change
> made a few years back, in correlation with the aio optimizations
> posted then, basically telling aio to ignore that silly (and broken)
> user ring.

I still don't see how you accomplish that.  Making it bigger doesn't get
rid of it.  ;-)

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-10 20:05                           ` Jeff Moyer
@ 2014-07-10 20:06                             ` Jens Axboe
  -1 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2014-07-10 20:06 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Benjamin LaHaise, Elliott, Robert (Server Storage),
	Christoph Hellwig, dgilbert, James Bottomley, Bart Van Assche,
	linux-scsi, linux-kernel

On 2014-07-10 22:05, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
>
>> On 2014-07-10 17:11, Jeff Moyer wrote:
>>> Benjamin LaHaise <bcrl@kvack.org> writes:
>>>
>>>>>
>>>>> [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>> [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>> [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>> [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>> [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>
>>>> Something is horribly wrong here.  There is no way that value for nr_events
>>>> should be passed in to ioctx_alloc().  This implies that userland is calling
>>>> io_setup() with an impossibly large value for nr_events.  Can you post the
>>>> actual diff for your fs/aio.c relative to linus' tree?
>>>>
>>>
>>> fio does exactly this!  it passes INT_MAX.
>>
>> That's correct, I had actually forgotten about this. It was a change
>> made a few years back, in correlation with the aio optimizations
>> posted then, basically telling aio to ignore that silly (and broken)
>> user ring.
>
> I still don't see how you accomplish that.  Making it bigger doesn't get
> rid of it.  ;-)

See the patches from back then - INT_MAX basically just meant the same 
as 0, but 0 could not be used because of the (silly) setup with the 
wrappers around the syscalls. So INT_MAX was overloaded to mean "no ring 
events, I don't care".

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
@ 2014-07-10 20:06                             ` Jens Axboe
  0 siblings, 0 replies; 99+ messages in thread
From: Jens Axboe @ 2014-07-10 20:06 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Benjamin LaHaise, Elliott, Robert (Server Storage),
	Christoph Hellwig, dgilbert, James Bottomley, Bart Van Assche,
	linux-scsi, linux-kernel

On 2014-07-10 22:05, Jeff Moyer wrote:
> Jens Axboe <axboe@kernel.dk> writes:
>
>> On 2014-07-10 17:11, Jeff Moyer wrote:
>>> Benjamin LaHaise <bcrl@kvack.org> writes:
>>>
>>>>>
>>>>> [  186.339064] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>> [  186.339065] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>> [  186.339067] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>> [  186.339068] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>> [  186.339069] ioctx_alloc: nr_events=-2 aio_max_nr=65536
>>>>
>>>> Something is horribly wrong here.  There is no way that value for nr_events
>>>> should be passed in to ioctx_alloc().  This implies that userland is calling
>>>> io_setup() with an impossibly large value for nr_events.  Can you post the
>>>> actual diff for your fs/aio.c relative to linus' tree?
>>>>
>>>
>>> fio does exactly this!  it passes INT_MAX.
>>
>> That's correct, I had actually forgotten about this. It was a change
>> made a few years back, in correlation with the aio optimizations
>> posted then, basically telling aio to ignore that silly (and broken)
>> user ring.
>
> I still don't see how you accomplish that.  Making it bigger doesn't get
> rid of it.  ;-)

See the patches from back then - INT_MAX basically just meant the same 
as 0, but 0 could not be used because of the (silly) setup with the 
wrappers around the syscalls. So INT_MAX was overloaded to mean "no ring 
events, I don't care".

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: scsi-mq V2
  2014-07-10 19:14                     ` Jeff Moyer
  (?)
  (?)
@ 2014-07-10 21:10                     ` Elliott, Robert (Server Storage)
  2014-07-11  6:02                       ` Elliott, Robert (Server Storage)
  -1 siblings, 1 reply; 99+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-07-10 21:10 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Jens Axboe, dgilbert, James Bottomley,
	Bart Van Assche, Benjamin LaHaise, linux-scsi, linux-kernel



> -----Original Message-----
> From: Jeff Moyer [mailto:jmoyer@redhat.com]
> Sent: Thursday, 10 July, 2014 2:14 PM
> To: Elliott, Robert (Server Storage)
> Cc: Christoph Hellwig; Jens Axboe; dgilbert@interlog.com; James Bottomley;
> Bart Van Assche; Benjamin LaHaise; linux-scsi@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> "Elliott, Robert (Server Storage)" <Elliott@hp.com> writes:
> 
> >> -----Original Message-----
> >> From: Christoph Hellwig [mailto:hch@infradead.org]
> >> Sent: Thursday, 10 July, 2014 11:15 AM
> >> To: Elliott, Robert (Server Storage)
> >> Cc: Jens Axboe; dgilbert@interlog.com; James Bottomley; Bart Van Assche;
> >> Benjamin LaHaise; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org
> >> Subject: Re: scsi-mq V2
> >>
> >> On Thu, Jul 10, 2014 at 09:04:22AM -0700, Christoph Hellwig wrote:
> >> > It's starting to look weird.  I'll prepare another two bisect branches
> >> > around some MM changes, which seems the only other possible candidate.
> >>
> >> I've pushed out scsi-mq.3-bisect-3
> >
> > Good.
> >
> >> and scsi-mq.3-bisect-4 for you.
> >
> > Bad.
> >
> > Note: I had to apply the vdso2c.h patch to build this -rc3 based kernel:
> > diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
> > index df95a2f..11b65d4 100644
> > --- a/arch/x86/vdso/vdso2c.h
> > +++ b/arch/x86/vdso/vdso2c.h
> > @@ -93,6 +93,9 @@ static void BITSFUNC(copy_section)(struct
> BITSFUNC(fake_sections) *out,
> > 	uint64_t flags = GET_LE(&in->sh_flags);
> >
> > 	bool copy = flags & SHF_ALLOC &&
> > +		(GET_LE(&in->sh_size) ||
> > +		(GET_LE(&in->sh_type) != SHT_RELA &&
> > +		GET_LE(&in->sh_type) != SHT_REL)) &&
> > 		strcmp(name, ".altinstructions") &&
> > 		strcmp(name, ".altinstr_replacement");
> >
> > Results: fio started OK, getting 900K IOPS, but ^C led to 0 IOPS and
> > an fio hang, with one CPU (CPU 0) stuck in io_submit loops.
> 

I added some prints in aio_setup_ring and  ioctx_alloc and
rebooted.  This time it took much longer to hit the problem.  It 
survived dozens of ^Cs.  Running a few minutes, though, IOPS 
eventually dropped.  So, sometimes it happens immediately,
sometimes it takes time to develop.

I will rerun bisect-1 -2 and -3 for longer times to increase
confidence that they didn't just appear good.

On this bisect-4 run, as IOPS started to drop from 900K to 40K, 
I ran perf top when it was at 700K.  You can see io_submit times
creeping up.

  4.30%  [kernel]            [k] do_io_submit
  4.29%  [kernel]            [k] _raw_spin_lock_irqsave
  3.88%  libaio.so.1.0.1     [.] io_submit
  3.55%  [kernel]            [k] system_call
  3.34%  [kernel]            [k] put_compound_page
  3.11%  [kernel]            [k] io_submit_one
  3.06%  [kernel]            [k] system_call_after_swapgs
  2.89%  [kernel]            [k] copy_user_generic_string
  2.45%  [kernel]            [k] lookup_ioctx
  2.16%  [kernel]            [k] apic_timer_interrupt
  2.00%  [kernel]            [k] _raw_spin_lock
  1.97%  [scsi_debug]        [k] sdebug_q_cmd_hrt_complete
  1.84%  [kernel]            [k] __get_page_tail
  1.74%  [kernel]            [k] do_blockdev_direct_IO
  1.68%  [kernel]            [k] blk_flush_plug_list
  1.41%  [kernel]            [k] _raw_spin_unlock_irqrestore
  1.24%  [scsi_debug]        [k] schedule_resp

finally settling like before:
 14.15%  [kernel]                    [k] do_io_submit
 13.61%  libaio.so.1.0.1             [.] io_submit
 11.81%  [kernel]                    [k] system_call
 10.11%  [kernel]                    [k] system_call_after_swapgs
  8.59%  [kernel]                    [k] io_submit_one
  8.56%  [kernel]                    [k] copy_user_generic_string
  7.96%  [kernel]                    [k] lookup_ioctx
  5.33%  [kernel]                    [k] blk_flush_plug_list
  3.11%  [kernel]                    [k] blk_finish_plug
  2.84%  [kernel]                    [k] sysret_check
  2.63%  fio                         [.] fio_libaio_commit
  2.27%  [kernel]                    [k] blk_start_plug
  1.17%  [kernel]                    [k] SyS_io_submit


^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: scsi-mq V2
  2014-07-10 21:10                     ` Elliott, Robert (Server Storage)
@ 2014-07-11  6:02                       ` Elliott, Robert (Server Storage)
  2014-07-11  6:14                         ` Christoph Hellwig
  0 siblings, 1 reply; 99+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-07-11  6:02 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Jens Axboe, dgilbert, James Bottomley,
	Bart Van Assche, Benjamin LaHaise, linux-scsi, linux-kernel



> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Elliott, Robert (Server Storage)
> 
> I added some prints in aio_setup_ring and  ioctx_alloc and
> rebooted.  This time it took much longer to hit the problem.  It
> survived dozens of ^Cs.  Running a few minutes, though, IOPS
> eventually dropped.  So, sometimes it happens immediately,
> sometimes it takes time to develop.
> 
> I will rerun bisect-1 -2 and -3 for longer times to increase
> confidence that they didn't just appear good.

Allowing longer run times before declaring success, the problem 
does appear in all of the bisect trees.  I just let fio
continue to run for many minutes - no ^Cs necessary.

no-rebase: good for > 45 minutes (I will leave that running for
  8 more hours)
bisect-1: bad
bisect-2: bad
bisect-3: bad
bisect-4: bad



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-11  6:02                       ` Elliott, Robert (Server Storage)
@ 2014-07-11  6:14                         ` Christoph Hellwig
  2014-07-11 14:33                           ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-07-11  6:14 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Jeff Moyer, Christoph Hellwig, Jens Axboe, dgilbert,
	James Bottomley, Bart Van Assche, Benjamin LaHaise, linux-scsi,
	linux-kernel

On Fri, Jul 11, 2014 at 06:02:03AM +0000, Elliott, Robert (Server Storage) wrote:
> Allowing longer run times before declaring success, the problem 
> does appear in all of the bisect trees.  I just let fio
> continue to run for many minutes - no ^Cs necessary.
> 
> no-rebase: good for > 45 minutes (I will leave that running for
>   8 more hours)

Ok, thanks.  If it's still running tomorrow morning let's look into the
aio reverts again.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: scsi-mq V2
  2014-07-11  6:14                         ` Christoph Hellwig
@ 2014-07-11 14:33                           ` Elliott, Robert (Server Storage)
  2014-07-11 14:55                             ` Benjamin LaHaise
  0 siblings, 1 reply; 99+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-07-11 14:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Moyer, Jens Axboe, dgilbert, James Bottomley,
	Bart Van Assche, Benjamin LaHaise, linux-scsi, linux-kernel



> -----Original Message-----
> From: Christoph Hellwig [mailto:hch@infradead.org]
> Sent: Friday, 11 July, 2014 1:15 AM
> To: Elliott, Robert (Server Storage)
> Cc: Jeff Moyer; Christoph Hellwig; Jens Axboe; dgilbert@interlog.com; James
> Bottomley; Bart Van Assche; Benjamin LaHaise; linux-scsi@vger.kernel.org;
> linux-kernel@vger.kernel.org
> Subject: Re: scsi-mq V2
> 
> On Fri, Jul 11, 2014 at 06:02:03AM +0000, Elliott, Robert (Server Storage)
> wrote:
> > Allowing longer run times before declaring success, the problem
> > does appear in all of the bisect trees.  I just let fio
> > continue to run for many minutes - no ^Cs necessary.
> >
> > no-rebase: good for > 45 minutes (I will leave that running for
> >   8 more hours)
> 
> Ok, thanks.  If it's still running tomorrow morning let's look into the
> aio reverts again.

That ran 9 total hours with no problem.

Rather than revert in the bisect trees, I added just this single additional
patch to the no-rebase tree, and the problem appeared:


48a2e94154177286b3bcbed25ea802232527fa7c
aio: fix aio request leak when events are reaped by userspace

diff --git a/fs/aio.c b/fs/aio.c
index 4f078c0..e59bba8 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1021,6 +1021,7 @@ void aio_complete(struct kiocb *iocb, long res, long res2)

        /* everything turned out well, dispose of the aiocb. */
        kiocb_free(iocb);
+       put_reqs_available(ctx, 1);     /* added by patch f8567 */

        /*
         * We have to order our ring_info tail store above and test
@@ -1101,7 +1102,7 @@ static long aio_read_events_ring(struct kioctx *ctx,

        pr_debug("%li  h%u t%u\n", ret, head, tail);

-       put_reqs_available(ctx, ret);
+       /* put_reqs_available(ctx, ret); removed by patch f8567 */
 out:
        mutex_unlock(&ctx->ring_lock);


---
Rob Elliott    HP Server Storage




^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-11 14:33                           ` Elliott, Robert (Server Storage)
@ 2014-07-11 14:55                             ` Benjamin LaHaise
  2014-07-12 21:50                               ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 99+ messages in thread
From: Benjamin LaHaise @ 2014-07-11 14:55 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, Jeff Moyer, Jens Axboe, dgilbert,
	James Bottomley, Bart Van Assche, linux-scsi, linux-kernel

On Fri, Jul 11, 2014 at 02:33:12PM +0000, Elliott, Robert (Server Storage) wrote:
> That ran 9 total hours with no problem.
> 
> Rather than revert in the bisect trees, I added just this single additional
> patch to the no-rebase tree, and the problem appeared:

Can you try the below totally untested patch instead?  It looks like
put_reqs_available() is not irq-safe.

		-ben
-- 
"Thought is the essence of where you are now."


diff --git a/fs/aio.c b/fs/aio.c
index 955947e..4b97180 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -830,16 +830,20 @@ void exit_aio(struct mm_struct *mm)
 static void put_reqs_available(struct kioctx *ctx, unsigned nr)
 {
 	struct kioctx_cpu *kcpu;
+	unsigned long flags;
 
 	preempt_disable();
 	kcpu = this_cpu_ptr(ctx->cpu);
 
+	local_irq_save(flags);
 	kcpu->reqs_available += nr;
+
 	while (kcpu->reqs_available >= ctx->req_batch * 2) {
 		kcpu->reqs_available -= ctx->req_batch;
 		atomic_add(ctx->req_batch, &ctx->reqs_available);
 	}
 
+	local_irq_restore(flags);
 	preempt_enable();
 }
 

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* RE: scsi-mq V2
  2014-07-11 14:55                             ` Benjamin LaHaise
@ 2014-07-12 21:50                               ` Elliott, Robert (Server Storage)
  2014-07-12 23:20                                 ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 99+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-07-12 21:50 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Christoph Hellwig, Jeff Moyer, Jens Axboe, dgilbert,
	James Bottomley, Bart Van Assche, linux-scsi, linux-kernel



> -----Original Message-----
> From: Benjamin LaHaise [mailto:bcrl@kvack.org]
> Sent: Friday, 11 July, 2014 9:55 AM
> To: Elliott, Robert (Server Storage)
> Cc: Christoph Hellwig; Jeff Moyer; Jens Axboe; dgilbert@interlog.com; James
> Bottomley; Bart Van Assche; linux-scsi@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: Re: scsi-mq V2
...
> Can you try the below totally untested patch instead?  It looks like
> put_reqs_available() is not irq-safe.
> 

With that addition alone, fio still runs into the same problem.

I added the same fix to get_reqs_available, which also accesses 
kcpu->reqs_available, and the test has run for 35 minutes with 
no problem.

Patch applied:

diff --git a/fs/aio.c b/fs/aio.c
index e59bba8..8e85e26 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -830,16 +830,20 @@ void exit_aio(struct mm_struct *mm)
 static void put_reqs_available(struct kioctx *ctx, unsigned nr)
 {
 	struct kioctx_cpu *kcpu;
+	unsigned long flags;
 
 	preempt_disable();
 	kcpu = this_cpu_ptr(ctx->cpu);
 
+	local_irq_save(flags);
 	kcpu->reqs_available += nr;
+
 	while (kcpu->reqs_available >= ctx->req_batch * 2) {
 		kcpu->reqs_available -= ctx->req_batch;
 		atomic_add(ctx->req_batch, &ctx->reqs_available);
 	}
 
+	local_irq_restore(flags);
 	preempt_enable();
 }
 
@@ -847,10 +851,12 @@ static bool get_reqs_available(struct kioctx *ctx)
 {
 	struct kioctx_cpu *kcpu;
 	bool ret = false;
+	unsigned long flags;
 
 	preempt_disable();
 	kcpu = this_cpu_ptr(ctx->cpu);
 
+	local_irq_save(flags);
 	if (!kcpu->reqs_available) {
 		int old, avail = atomic_read(&ctx->reqs_available);
 
@@ -869,6 +875,7 @@ static bool get_reqs_available(struct kioctx *ctx)
 	ret = true;
 	kcpu->reqs_available--;
 out:
+	local_irq_restore(flags);
 	preempt_enable();
 	return ret;
 }

--
I will see if that solves the problem with the scsi-mq-3 tree, or 
at least some of the bisect trees leading up to it.

A few other comments:

1. Those changes boost _raw_spin_lock_irqsave into first place
in perf top:

  6.59%  [kernel]                    [k] _raw_spin_lock_irqsave
  4.37%  [kernel]                    [k] put_compound_page
  2.87%  [scsi_debug]                [k] sdebug_q_cmd_hrt_complete
  2.74%  [kernel]                    [k] _raw_spin_lock
  2.73%  [kernel]                    [k] apic_timer_interrupt
  2.41%  [kernel]                    [k] do_blockdev_direct_IO
  2.24%  [kernel]                    [k] __get_page_tail
  1.97%  [kernel]                    [k] _raw_spin_unlock_irqrestore
  1.87%  [kernel]                    [k] scsi_queue_rq
  1.76%  [scsi_debug]                [k] schedule_resp

Maybe (later) kcpu->reqs_available should converted to an atomic,
like ctx->reqs_available, to reduce that overhead?

2. After the f8567a3 patch, aio_complete has one early return that 
bypasses the call to put_reqs_available.  Is that OK, or does
that mean that sync iocbs will now eat up reqs_available?

        /*
         * Special case handling for sync iocbs:
         *  - events go directly into the iocb for fast handling
         *  - the sync task with the iocb in its stack holds the single iocb
         *    ref, no other paths have a way to get another ref
         *  - the sync task helpfully left a reference to itself in the iocb
         */
        if (is_sync_kiocb(iocb)) {
                iocb->ki_user_data = res;
                smp_wmb();
                iocb->ki_ctx = ERR_PTR(-EXDEV);
                wake_up_process(iocb->ki_obj.tsk);
                return;
        }


3. The f8567a3 patch renders this comment in aio.c out of date - it's 
no longer incremented when pulled off the ringbuffer, but is now 
incremented when aio_complete is called.

        struct {
                /*
                 * This counts the number of available slots in the ringbuffer,
                 * so we avoid overflowing it: it's decremented (if positive)
                 * when allocating a kiocb and incremented when the resulting
                 * io_event is pulled off the ringbuffer.
                 *
                 * We batch accesses to it with a percpu version.
                 */
                atomic_t        reqs_available;
        } ____cacheline_aligned_in_smp;


---
Rob Elliott    HP Server Storage




^ permalink raw reply related	[flat|nested] 99+ messages in thread

* RE: scsi-mq V2
  2014-07-12 21:50                               ` Elliott, Robert (Server Storage)
@ 2014-07-12 23:20                                 ` Elliott, Robert (Server Storage)
  2014-07-13 17:15                                   ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 99+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-07-12 23:20 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Christoph Hellwig, Jeff Moyer, Jens Axboe, dgilbert,
	James Bottomley, Bart Van Assche, linux-scsi, linux-kernel

> I will see if that solves the problem with the scsi-mq-3 tree, or
> at least some of the bisect trees leading up to it.

scsi-mq-3 is still going after 45 minutes.  I'll leave it running
overnight.



^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: scsi-mq V2
  2014-07-12 23:20                                 ` Elliott, Robert (Server Storage)
@ 2014-07-13 17:15                                   ` Elliott, Robert (Server Storage)
  2014-07-14 17:15                                     ` Benjamin LaHaise
  0 siblings, 1 reply; 99+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-07-13 17:15 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Christoph Hellwig, Jeff Moyer, Jens Axboe, dgilbert,
	James Bottomley, Bart Van Assche, linux-scsi, linux-kernel



> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Elliott, Robert (Server Storage)
> Sent: Saturday, July 12, 2014 6:20 PM
> To: Benjamin LaHaise
> Cc: Christoph Hellwig; Jeff Moyer; Jens Axboe; dgilbert@interlog.com;
> James Bottomley; Bart Van Assche; linux-scsi@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: RE: scsi-mq V2
> 
> > I will see if that solves the problem with the scsi-mq-3 tree, or
> > at least some of the bisect trees leading up to it.
> 
> scsi-mq-3 is still going after 45 minutes.  I'll leave it running
> overnight.
> 

That has been going strong for 18 hours, so I think that's the patch
we need.




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-08 14:48 ` Christoph Hellwig
  2014-07-09 16:39   ` Douglas Gilbert
@ 2014-07-14  9:13   ` Sagi Grimberg
  2014-08-21 12:32       ` Sagi Grimberg
  1 sibling, 1 reply; 99+ messages in thread
From: Sagi Grimberg @ 2014-07-14  9:13 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley, Jens Axboe, Bart Van Assche,
	Robert Elliott, linux-scsi, linux-kernel
  Cc: Or Gerlitz, Oren Duer, Nicholas A. Bellinger, Mike Christie,
	Bart Van Assche

On 7/8/2014 5:48 PM, Christoph Hellwig wrote:
<SNIP>
> I've pushed out a new scsi-mq.3 branch, which has been rebased on the
> latest core-for-3.17 tree + the "RFC: clean up command setup" series
> from June 29th.  Robert Elliot found a problem with not fully zeroed
> out UNMAP CDBs, which is fixed by the saner discard handling in that
> series.
>
> There is a new patch to factor the code from the above series for
> blk-mq use, which I've attached below.  Besides that the only changes
> are minor merge fixups in the main blk-mq usage patch.

Hey Christoph & Co,

I'd like to share some benchmarks I took on this patch set using iSER 
initiator (+2 pre-submitted performance improvements) vs LIO iSER target.
I ran workloads I think are interesting use-cases (single LUN with 1,2,4 
IO threads up to a fully occupied system doing IO to multiple LUNs).
Overall (except 2 strange anomalies) seems that scsi-mq patches 
(use_blk_mq=N) roughly sustains traditional scsi performance.
On the other hand scsi-mq code path (use_blk_mq=Y) on its own clearly 
shows better performance (tables below).

At first I too hit the aio issues discussed in this thread and converted 
to scsi-mq.3-no-rebase for testing (thanks Doug & Rob for raising it).
I must say that for some reason I get very low numbers for writes vs. 
reads (writes perf stuck at ~20K IOPs per thread), this happens
on 3.16-rc2 even before scsi-mq patches. Did anyone step on this as well 
or is it just a weird problem I'm having in my setup?
Anyway this is why my benchmarks shows only randread IO pattern (getting 
familiar numbers). I need to figure out whats wrong
with IO writes - I'll start bisecting on this.

I also reviewed the patch set and at this point, I don't have any 
comments. So you can add to the series:
Reviewed-by: Sagi Grimberg '<sagig@dev.mellanox.co.il>' (or Tested-by - 
whatever you choose).

I want to state that I tested a traditional iSER initiator - no scsi-mq 
adoption at all.
I started looking into adopting scsi-mq to iSCSI/iSER recently and I 
must that say the scsi-mq adoption is not so
trivial due to iSCSI session-wide CmdSN/StatSN ordering constraints 
(can't just use more RDMA channels per connection...)
I'll be on vacation for the next couple of weeks, so I'll start a 
separate thread to get the community input on this matter.


Results: table entries are KIOPS(CPU%)
3.16-rc2 (scsi-mq patches reverted)
    Threads/LUN   1           2            4
#LUNs
  1             231(6.5%)   355(18.5%)   337(31.1%)
  2             446(13.6%)  673(37.2%)   654(49.8%)
  4             594(25%)    960(49.41%) 1165(99.3%)
  8            1018(50.3%) 1563(99.6%)  1696(99.9%)
  16           1660(86.5%) 1731(99.6%)  1710(100%)


3.16-rc2 (scsi-mq included, use_blk_mq=N)
    Threads/LUN   1           2            4
#LUNs
  1             231(6.5%)   351(18.5%)   337(31.4%)
  2             446(13.6%)  660(37.3%)   647(50%)
  4             591(25%)    967(49.7%)  1136(98.1%)
  8            1014(52.1%) 1296(100%)   1470(100%)
  16           1741(100%)  1761(100%)   1853(100%)


3.16-rc2 (scsi-mq included, use_blk_mq=Y)
    Threads/LUN   1           2            4
#LUNs
  1             265(6.4%)   465(13.4%)   572(27.9%)
  2             507(13.4%)  902(27.8%)  1034(45.9%)
  4             697(25%)   1197(49.5%)  1477(98.6%)
  8            1257(53.6%) 1856(98.7%)  1906(100%)
  16           1991(100%)  2021(100%)   2020(100%)

Notes:
     - IOPs measurements are the average of a 60 seconds runs.
     - The CPU measurement is the total usage across all CPUs, In order
       to understand per-CPU utilization value should be normalized to 16
       cores.
     - scsi_mq (use_blk_mq=N) has roughly the same performance as
       traditional scsi IO path but I see an anomaly in test cases
       {8 LUNs, 2/4 threads per LUN}. This may result in NUMA
       misalignment for threads/interrupts – requires further
       investigation.
     - iSER initiator has no Multi-Queue awareness.

Testing environment:
     - Initiator and target systems of 16 (8x2) cores (Hyperthreading
       disabled).
     - CPU model: Intel(R) Xeon(R) @ 2.60GHz
     - Block Layer settings:
     - scheduler=noop
         - rq_affinity=1
         - add_random=0
         - nomerges=1
     - Single FDR link between the target and initiator.
     - Device model: Mellanox ConnectIB (the numbers are also familiar
       with Mellanox ConnectX-3).
     - MSIX interrupt vectors were spread across system cores.
     - irqbalancer was disabled.
     - scsi_host settings:
         - cmd_per_lun=32 (default)
         - can_queue=113 (default)
     - In the multi-LUN test cases, each LUN exposed via different
       scsi_host (iSCSI session).

Software:
     - fio version: 2.0.13
     - LIO iSER target (target-pending for-next)
     - Null backing devices (NULLIO)
     - Upstream based iSER initiator + internal pre-submitted
       performance enhancements.

fio configuration:
rw=randread
bs=1k
iodepth=128
loops=1
ioengine=libaio
direct=1
invalidate=1
fsync_on_close=1
randrepeat=1
norandommap

Cheers,
Sagi.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: scsi-mq V2
  2014-07-13 17:15                                   ` Elliott, Robert (Server Storage)
@ 2014-07-14 17:15                                     ` Benjamin LaHaise
  0 siblings, 0 replies; 99+ messages in thread
From: Benjamin LaHaise @ 2014-07-14 17:15 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: Christoph Hellwig, Jeff Moyer, Jens Axboe, dgilbert,
	James Bottomley, Bart Van Assche, linux-scsi, linux-kernel

Hi Robert,

On Sun, Jul 13, 2014 at 05:15:15PM +0000, Elliott, Robert (Server Storage) wrote:
> > > I will see if that solves the problem with the scsi-mq-3 tree, or
> > > at least some of the bisect trees leading up to it.
> > 
> > scsi-mq-3 is still going after 45 minutes.  I'll leave it running
> > overnight.
> > 
> 
> That has been going strong for 18 hours, so I think that's the patch
> we need.

Thanks for taking the time to narrow this down.  I've applied the fix to 
my aio-fixes tree at git://git.kvack.org/~bcrl/aio-fixes.git and fowarded 
it on to Linus as well.

		-ben
-- 
"Thought is the essence of where you are now."

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
  2014-06-25 16:52 ` [PATCH 13/14] scsi: add support for a blk-mq based I/O path Christoph Hellwig
  2014-07-09 11:25   ` Hannes Reinecke
@ 2014-07-16 11:13   ` Mike Christie
  2014-07-16 11:16     ` Christoph Hellwig
  1 sibling, 1 reply; 99+ messages in thread
From: Mike Christie @ 2014-07-16 11:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Jens Axboe, Bart Van Assche, Robert Elliott,
	linux-scsi, linux-kernel

On 06/25/2014 11:52 AM, Christoph Hellwig wrote:
> +static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
> +{
> +	struct request_queue *q = req->q;
> +	struct scsi_device *sdev = q->queuedata;
> +	struct Scsi_Host *shost = sdev->host;
> +	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> +	int ret;
> +	int reason;
> +
> +	ret = prep_to_mq(scsi_prep_state_check(sdev, req));
> +	if (ret)
> +		goto out;
> +
> +	ret = BLK_MQ_RQ_QUEUE_BUSY;
> +	if (!get_device(&sdev->sdev_gendev))
> +		goto out;
> +
> +	if (!scsi_dev_queue_ready(q, sdev))
> +		goto out_put_device;
> +	if (!scsi_target_queue_ready(shost, sdev))
> +		goto out_dec_device_busy;
> +	if (!scsi_host_queue_ready(q, shost, sdev))
> +		goto out_dec_target_busy;
> +
> +	if (!(req->cmd_flags & REQ_DONTPREP)) {
> +		ret = prep_to_mq(scsi_mq_prep_fn(req));
> +		if (ret)
> +			goto out_dec_host_busy;
> +		req->cmd_flags |= REQ_DONTPREP;
> +	}
> +
> +	scsi_init_cmd_errh(cmd);
> +	cmd->scsi_done = scsi_mq_done;
> +
> +	reason = scsi_dispatch_cmd(cmd);
> +	if (reason) {
> +		scsi_set_blocked(cmd, reason);
> +		ret = BLK_MQ_RQ_QUEUE_BUSY;
> +		goto out_dec_host_busy;
> +	}
> +
> +	return BLK_MQ_RQ_QUEUE_OK;
> +
> +out_dec_host_busy:
> +	cancel_delayed_work(&cmd->abort_work);

Hey Christoph,

I see the request timer is started before calling queue_rq, but I could
not figure out what the cancel_delayed_work here is for exactly. It
seems if the request were to time out and the eh started while queue_rq
was running we could end up some nasty bugs like the requested requeued
twice.

Is the cancel_delayed_work call just to be safe or is supposed to be
handling a case where the abort_work could be queued at this time up due
to a request timing out while queue_rq is running? Is this case mq specific?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
  2014-07-16 11:13   ` Mike Christie
@ 2014-07-16 11:16     ` Christoph Hellwig
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-07-16 11:16 UTC (permalink / raw)
  To: Mike Christie
  Cc: Christoph Hellwig, James Bottomley, Jens Axboe, Bart Van Assche,
	Robert Elliott, linux-scsi, linux-kernel

On Wed, Jul 16, 2014 at 06:13:21AM -0500, Mike Christie wrote:
> I see the request timer is started before calling queue_rq, but I could
> not figure out what the cancel_delayed_work here is for exactly. It
> seems if the request were to time out and the eh started while queue_rq
> was running we could end up some nasty bugs like the requested requeued
> twice.
> 
> Is the cancel_delayed_work call just to be safe or is supposed to be
> handling a case where the abort_work could be queued at this time up due
> to a request timing out while queue_rq is running? Is this case mq specific?

It was cargo cult copy & paste from the old path.  I've merged a patch
from Bart to remove it from the old code, so it should go away here as well,
thanks for the reminder.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Performance degradation in IO writes vs. reads (was scsi-mq V2)
  2014-07-14  9:13   ` Sagi Grimberg
@ 2014-08-21 12:32       ` Sagi Grimberg
  0 siblings, 0 replies; 99+ messages in thread
From: Sagi Grimberg @ 2014-08-21 12:32 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley, Jens Axboe, Bart Van Assche,
	Robert Elliott, linux-scsi, linux-kernel, viro
  Cc: Or Gerlitz, Oren Duer, Nicholas A. Bellinger, Mike Christie,
	linux-fsdevel, linux-kernel, Max Gurtovoy

On 7/14/2014 12:13 PM, Sagi Grimberg wrote:

<SNIP>

> I'd like to share some benchmarks I took on this patch set using iSER
> initiator (+2 pre-submitted performance improvements) vs LIO iSER target.
> I ran workloads I think are interesting use-cases (single LUN with 1,2,4
> IO threads up to a fully occupied system doing IO to multiple LUNs).
> Overall (except 2 strange anomalies) seems that scsi-mq patches
> (use_blk_mq=N) roughly sustains traditional scsi performance.
> On the other hand scsi-mq code path (use_blk_mq=Y) on its own clearly
> shows better performance (tables below).
>
> At first I too hit the aio issues discussed in this thread and converted
> to scsi-mq.3-no-rebase for testing (thanks Doug & Rob for raising it).
> I must say that for some reason I get very low numbers for writes vs.
> reads (writes perf stuck at ~20K IOPs per thread), this happens
> on 3.16-rc2 even before scsi-mq patches. Did anyone step on this as well
> or is it just a weird problem I'm having in my setup?
> Anyway this is why my benchmarks shows only randread IO pattern (getting
> familiar numbers). I need to figure out whats wrong
> with IO writes - I'll start bisecting on this.
>

Hi,

So I just got back to checking this issue of *extremely low* IO write
performance I got in 3.16-rc2.

Reminder:
I used iSER to benchmark Christoph's scsi-mq patches performance and
noticed that direct-IO writes are stuck at 20-50K IOPs instead of 350K
IOPs I was used to see for a single device. This issue existed also when
I removed the scsi-mq patches, so I started bisecting to see what broke
stuff.

Finally I got to the completely unbisectable bulk that seems to yield
the issue:

/* IO write poor performance*/
2b777c9 ceph_sync_read: stop poking into iov_iter guts
f0d1bec new helper: copy_page_from_iter()
84c3d55 fuse: switch to ->write_iter()
b30ac0f btrfs: switch to ->write_iter()
3ef045c ocfs2: switch to ->write_iter()
bf97f3b xfs: switch to ->write_iter()
50b5551 afs: switch to ->write_iter()
da56e45 gfs2: switch to ->write_iter()
edaf436 nfs: switch to ->write_iter()
f5674c3 ubifs: switch to ->write_iter()
a8f3550 bury __generic_file_aio_write()
3dae875 cifs: switch to ->write_iter()
d4637bc udf: switch to ->write_iter()
9b88416 convert ext4 to ->write_iter()
a832475 Merge ext4 changes in ext4_file_write() into for-next
1456c0a blkdev_aio_write() - turn into blkdev_write_iter()
8174202 write_iter variants of {__,}generic_file_aio_write()
3644424 ceph: switch to ->read_iter()
3aa2d19 nfs: switch to ->read_iter()
a886038 fs/block_dev.c: switch to ->read_iter()
2ba5bbe shmem: switch to ->read_iter()
fb9096a pipe: switch to ->read_iter()
e6a7bcb cifs: switch to ->read_iter()
37c20f1 fuse_file_aio_read(): convert to ->read_iter()
3cd9ad5 ocfs2: switch to ->read_iter()
0279782 ecryptfs: switch to ->read_iter()
b4f5d2c xfs: switch to ->read_iter()
aad4f8b switch simple generic_file_aio_read() users to ->read_iter()
293bc98 new methods: ->read_iter() and ->write_iter()
7f7f25e replace checking for ->read/->aio_read presence with check in 
->f_mode
b318891 xfs: trim the argument lists of xfs_file_{dio,buffered}_aio_write()
37938463 blkdev_aio_read(): switch to generic_file_read_iter(), get rid 
of iov_shorten()
0c94933 iov_iter_truncate()
28060d5 btrfs: switch check_direct_IO() to iov_iter
91f79c4 new helper: iov_iter_get_pages_alloc()
f67da30 new helper: iov_iter_npages()
5b46f25 f2fs: switch to iov_iter_alignment()
c9c37e2 fuse: switch to iov_iter_get_pages()
d22a943 fuse: pull iov_iter initializations up
7b2c99d new helper: iov_iter_get_pages()
3320c60 dio: take updating ->result into do_direct_IO(
71d8e53 start adding the tag to iov_iter
ed978a8 new helper: generic_file_read_iter()
23faa7b fuse_file_aio_write(): merge initializations of iov_iter
05bb2e0 ceph_aio_read(): keep iov_iter across retries
886a391 new primitive: iov_iter_alignment()
26978b8 give ->direct_IO() a copy of iov_iter
31b1403 switch {__,}blockdev_direct_IO() to iov_iter
a6cbcd4 get rid of pointless iov_length() in ->direct_IO()
16b1f05 ext4: switch the guts of ->direct_IO() to iov_iter
619d30b convert the guts of nfs_direct_IO() to iov_iter
d8d3d94 pass iov_iter to ->direct_IO()
cb66a7a kill generic_segment_checks()
0ae5e4d __btrfs_direct_write(): switch to iov_iter
f8579f8 generic_file_direct_write(): switch to iov_iter
e7c2460 kill iov_iter_copy_from_user()
f6c0a19 fs/file.c: don't open-code kvfree()
/* IO write performance is OK*/

I tried to isolate the issue by running fio on a null_blk
device and also got performance degradation although it wasn't
as severe as with iSER/iSCSI. IO write performance decreased
from 360K IOPs to 280 KIOPs.

So at the moment I can't pin point the problem, but I figured
I'd raise the issue in case anyone else had stepped on this one
(hard to imagine that no one saw this...)

I'll run perf comparison to see if I get anything interesting.

CC'ing Al Viro, the author of all the above commits.

Cheers,
Sagi.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Performance degradation in IO writes vs. reads (was scsi-mq V2)
@ 2014-08-21 12:32       ` Sagi Grimberg
  0 siblings, 0 replies; 99+ messages in thread
From: Sagi Grimberg @ 2014-08-21 12:32 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley, Jens Axboe, Bart Van Assche,
	Robert Elliott, linux-scsi, viro
  Cc: Or Gerlitz, Oren Duer, Nicholas A. Bellinger, Mike Christie,
	linux-fsdevel, linux-kernel, Max Gurtovoy

On 7/14/2014 12:13 PM, Sagi Grimberg wrote:

<SNIP>

> I'd like to share some benchmarks I took on this patch set using iSER
> initiator (+2 pre-submitted performance improvements) vs LIO iSER target.
> I ran workloads I think are interesting use-cases (single LUN with 1,2,4
> IO threads up to a fully occupied system doing IO to multiple LUNs).
> Overall (except 2 strange anomalies) seems that scsi-mq patches
> (use_blk_mq=N) roughly sustains traditional scsi performance.
> On the other hand scsi-mq code path (use_blk_mq=Y) on its own clearly
> shows better performance (tables below).
>
> At first I too hit the aio issues discussed in this thread and converted
> to scsi-mq.3-no-rebase for testing (thanks Doug & Rob for raising it).
> I must say that for some reason I get very low numbers for writes vs.
> reads (writes perf stuck at ~20K IOPs per thread), this happens
> on 3.16-rc2 even before scsi-mq patches. Did anyone step on this as well
> or is it just a weird problem I'm having in my setup?
> Anyway this is why my benchmarks shows only randread IO pattern (getting
> familiar numbers). I need to figure out whats wrong
> with IO writes - I'll start bisecting on this.
>

Hi,

So I just got back to checking this issue of *extremely low* IO write
performance I got in 3.16-rc2.

Reminder:
I used iSER to benchmark Christoph's scsi-mq patches performance and
noticed that direct-IO writes are stuck at 20-50K IOPs instead of 350K
IOPs I was used to see for a single device. This issue existed also when
I removed the scsi-mq patches, so I started bisecting to see what broke
stuff.

Finally I got to the completely unbisectable bulk that seems to yield
the issue:

/* IO write poor performance*/
2b777c9 ceph_sync_read: stop poking into iov_iter guts
f0d1bec new helper: copy_page_from_iter()
84c3d55 fuse: switch to ->write_iter()
b30ac0f btrfs: switch to ->write_iter()
3ef045c ocfs2: switch to ->write_iter()
bf97f3b xfs: switch to ->write_iter()
50b5551 afs: switch to ->write_iter()
da56e45 gfs2: switch to ->write_iter()
edaf436 nfs: switch to ->write_iter()
f5674c3 ubifs: switch to ->write_iter()
a8f3550 bury __generic_file_aio_write()
3dae875 cifs: switch to ->write_iter()
d4637bc udf: switch to ->write_iter()
9b88416 convert ext4 to ->write_iter()
a832475 Merge ext4 changes in ext4_file_write() into for-next
1456c0a blkdev_aio_write() - turn into blkdev_write_iter()
8174202 write_iter variants of {__,}generic_file_aio_write()
3644424 ceph: switch to ->read_iter()
3aa2d19 nfs: switch to ->read_iter()
a886038 fs/block_dev.c: switch to ->read_iter()
2ba5bbe shmem: switch to ->read_iter()
fb9096a pipe: switch to ->read_iter()
e6a7bcb cifs: switch to ->read_iter()
37c20f1 fuse_file_aio_read(): convert to ->read_iter()
3cd9ad5 ocfs2: switch to ->read_iter()
0279782 ecryptfs: switch to ->read_iter()
b4f5d2c xfs: switch to ->read_iter()
aad4f8b switch simple generic_file_aio_read() users to ->read_iter()
293bc98 new methods: ->read_iter() and ->write_iter()
7f7f25e replace checking for ->read/->aio_read presence with check in 
->f_mode
b318891 xfs: trim the argument lists of xfs_file_{dio,buffered}_aio_write()
37938463 blkdev_aio_read(): switch to generic_file_read_iter(), get rid 
of iov_shorten()
0c94933 iov_iter_truncate()
28060d5 btrfs: switch check_direct_IO() to iov_iter
91f79c4 new helper: iov_iter_get_pages_alloc()
f67da30 new helper: iov_iter_npages()
5b46f25 f2fs: switch to iov_iter_alignment()
c9c37e2 fuse: switch to iov_iter_get_pages()
d22a943 fuse: pull iov_iter initializations up
7b2c99d new helper: iov_iter_get_pages()
3320c60 dio: take updating ->result into do_direct_IO(
71d8e53 start adding the tag to iov_iter
ed978a8 new helper: generic_file_read_iter()
23faa7b fuse_file_aio_write(): merge initializations of iov_iter
05bb2e0 ceph_aio_read(): keep iov_iter across retries
886a391 new primitive: iov_iter_alignment()
26978b8 give ->direct_IO() a copy of iov_iter
31b1403 switch {__,}blockdev_direct_IO() to iov_iter
a6cbcd4 get rid of pointless iov_length() in ->direct_IO()
16b1f05 ext4: switch the guts of ->direct_IO() to iov_iter
619d30b convert the guts of nfs_direct_IO() to iov_iter
d8d3d94 pass iov_iter to ->direct_IO()
cb66a7a kill generic_segment_checks()
0ae5e4d __btrfs_direct_write(): switch to iov_iter
f8579f8 generic_file_direct_write(): switch to iov_iter
e7c2460 kill iov_iter_copy_from_user()
f6c0a19 fs/file.c: don't open-code kvfree()
/* IO write performance is OK*/

I tried to isolate the issue by running fio on a null_blk
device and also got performance degradation although it wasn't
as severe as with iSER/iSCSI. IO write performance decreased
from 360K IOPs to 280 KIOPs.

So at the moment I can't pin point the problem, but I figured
I'd raise the issue in case anyone else had stepped on this one
(hard to imagine that no one saw this...)

I'll run perf comparison to see if I get anything interesting.

CC'ing Al Viro, the author of all the above commits.

Cheers,
Sagi.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Performance degradation in IO writes vs. reads (was scsi-mq V2)
  2014-08-21 12:32       ` Sagi Grimberg
  (?)
@ 2014-08-21 13:03       ` Christoph Hellwig
  2014-08-21 14:02         ` Sagi Grimberg
  -1 siblings, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-08-21 13:03 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, James Bottomley, Jens Axboe, Bart Van Assche,
	Robert Elliott, linux-scsi, linux-kernel, viro, Or Gerlitz,
	Oren Duer, Nicholas A. Bellinger, Mike Christie, linux-fsdevel,
	Max Gurtovoy

On Thu, Aug 21, 2014 at 03:32:09PM +0300, Sagi Grimberg wrote:
> So I just got back to checking this issue of *extremely low* IO write
> performance I got in 3.16-rc2.

Please test with 3.16 final.  There once issue each in aio and dio
that caused bad I/O performance regression that were only fixed last
minute in the 3.16 cycle.


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Performance degradation in IO writes vs. reads (was scsi-mq V2)
  2014-08-21 13:03       ` Christoph Hellwig
@ 2014-08-21 14:02         ` Sagi Grimberg
  2014-08-24 16:41           ` Sagi Grimberg
  0 siblings, 1 reply; 99+ messages in thread
From: Sagi Grimberg @ 2014-08-21 14:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Jens Axboe, Bart Van Assche, Robert Elliott,
	linux-scsi, linux-kernel, viro, Or Gerlitz, Oren Duer,
	Nicholas A. Bellinger, Mike Christie, linux-fsdevel,
	Max Gurtovoy

On 8/21/2014 4:03 PM, Christoph Hellwig wrote:
> On Thu, Aug 21, 2014 at 03:32:09PM +0300, Sagi Grimberg wrote:
>> So I just got back to checking this issue of *extremely low* IO write
>> performance I got in 3.16-rc2.
>
> Please test with 3.16 final.  There once issue each in aio and dio
> that caused bad I/O performance regression that were only fixed last
> minute in the 3.16 cycle.
>

I'll do that now.

Thanks,
Sagi.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Performance degradation in IO writes vs. reads (was scsi-mq V2)
  2014-08-21 14:02         ` Sagi Grimberg
@ 2014-08-24 16:41           ` Sagi Grimberg
  0 siblings, 0 replies; 99+ messages in thread
From: Sagi Grimberg @ 2014-08-24 16:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, Jens Axboe, Bart Van Assche, Robert Elliott,
	linux-scsi, linux-kernel, viro, Or Gerlitz, Oren Duer,
	Nicholas A. Bellinger, Mike Christie, linux-fsdevel,
	Max Gurtovoy

On 8/21/2014 5:02 PM, Sagi Grimberg wrote:
> On 8/21/2014 4:03 PM, Christoph Hellwig wrote:
>> On Thu, Aug 21, 2014 at 03:32:09PM +0300, Sagi Grimberg wrote:
>>> So I just got back to checking this issue of *extremely low* IO write
>>> performance I got in 3.16-rc2.
>>
>> Please test with 3.16 final.  There once issue each in aio and dio
>> that caused bad I/O performance regression that were only fixed last
>> minute in the 3.16 cycle.
>>
>
> I'll do that now.
>

Indeed this issue is resolved in 3.16. Sorry for the false alarm...
Thanks Christoph.

Sagi.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
  2014-08-19 16:06     ` Christoph Hellwig
@ 2014-08-19 16:11       ` Kashyap Desai
  0 siblings, 0 replies; 99+ messages in thread
From: Kashyap Desai @ 2014-08-19 16:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, linux-scsi, Jens Axboe, Bart Van Assche,
	Mike Christie, Martin K. Petersen, Robert Elliott, Webb Scales,
	linux-kernel

On Tue, Aug 19, 2014 at 9:36 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Tue, Aug 19, 2014 at 03:51:42AM +0530, Kashyap Desai wrote:
>> I read  this comment and find that very few drivers are using this
>> cmd_list.  I think if we remove this cmd_list, performance will scale as I
>> am seeing major contention in this lock.
>> Just thought to ping you to see if this is known limitation for now or any
>> plan to change this lock in near future ?
>
> Removing the lock entirely and pushing the list into the two drivers
> using it is on my TODO list.  Bart actually suggested keeping the code in the
> SCSI core and having a flag to enabled.  Given that I'm too busy to get my full
> version done in time, it might be a good idea if someone picks up Barts
> idea.  Can you send me a patch to add a enable_cmd_list flag to the host
> template and only enable it for aacraid and dpt_i2o?
>

Sure. I will work on relevant code change and will post patch for review.

-- 
Device Driver Developer @ Avagotech
Kashyap D. Desai
Note - my new email address
kashyap.desai@avagotech.com

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
  2014-08-18 22:21   ` Kashyap Desai
  2014-08-19 15:41     ` Kashyap Desai
@ 2014-08-19 16:06     ` Christoph Hellwig
  2014-08-19 16:11       ` Kashyap Desai
  1 sibling, 1 reply; 99+ messages in thread
From: Christoph Hellwig @ 2014-08-19 16:06 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Christoph Hellwig, James Bottomley, linux-scsi, Jens Axboe,
	Bart Van Assche, Mike Christie, Martin K. Petersen,
	Robert Elliott, Webb Scales, linux-kernel

On Tue, Aug 19, 2014 at 03:51:42AM +0530, Kashyap Desai wrote:
> I read  this comment and find that very few drivers are using this
> cmd_list.  I think if we remove this cmd_list, performance will scale as I
> am seeing major contention in this lock.
> Just thought to ping you to see if this is known limitation for now or any
> plan to change this lock in near future ?

Removing the lock entirely and pushing the list into the two drivers
using it is on my TODO list.  Bart actually suggested keeping the code in the
SCSI core and having a flag to enabled.  Given that I'm too busy to get my full
version done in time, it might be a good idea if someone picks up Barts
idea.  Can you send me a patch to add a enable_cmd_list flag to the host
template and only enable it for aacraid and dpt_i2o?


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
  2014-08-18 22:21   ` Kashyap Desai
@ 2014-08-19 15:41     ` Kashyap Desai
  2014-08-19 16:06     ` Christoph Hellwig
  1 sibling, 0 replies; 99+ messages in thread
From: Kashyap Desai @ 2014-08-19 15:41 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley, linux-scsi
  Cc: Jens Axboe, Bart Van Assche, Mike Christie, Martin K. Petersen,
	Robert Elliott, Webb Scales, linux-kernel

On Tue, Aug 19, 2014 at 3:51 AM, Kashyap Desai
<kashyap.desai@avagotech.com> wrote:
>
> > -----Original Message-----
> > From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> > owner@vger.kernel.org] On Behalf Of Christoph Hellwig
> > Sent: Friday, July 18, 2014 3:43 PM
> > To: James Bottomley; linux-scsi@vger.kernel.org
> > Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen;
> Robert
> > Elliott; Webb Scales; linux-kernel@vger.kernel.org
> > Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
> >
> > This patch adds support for an alternate I/O path in the scsi midlayer
> which
> > uses the blk-mq infrastructure instead of the legacy request code.
> >
> > Use of blk-mq is fully transparent to drivers, although for now a host
> > template field is provided to opt out of blk-mq usage in case any
> unforseen
> > incompatibilities arise.
> >
> > In general replacing the legacy request code with blk-mq is a simple and
> > mostly mechanical transformation.  The biggest exception is the new code
> > that deals with the fact the I/O submissions in blk-mq must happen from
> > process context, which slightly complicates the I/O completion handler.
> > The second biggest differences is that blk-mq is build around the
> concept of
> > preallocated requests that also include driver specific data, which in
> SCSI
> > context means the scsi_cmnd structure.  This completely avoids dynamic
> > memory allocations for the fast path through I/O submission.
> >
> > Due the preallocated requests the MQ code path exclusively uses the
> host-
> > wide shared tag allocator instead of a per-LUN one.  This only affects
> drivers
> > actually using the block layer provided tag allocator instead of their
> own.
> > Unlike the old path blk-mq always provides a tag, although drivers don't
> have
> > to use it.
> >
> > For now the blk-mq path is disable by defauly and must be enabled using
> the
> > "use_blk_mq" module parameter.  Once the remaining work in the block
> > layer to make blk-mq more suitable for slow devices is complete I hope
> to
> > make it the default and eventually even remove the old code path.
> >
> > Based on the earlier scsi-mq prototype by Nicholas Bellinger.
> >
> > Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking
> and
> > various sugestions and code contributions.
> >
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
> > Reviewed-by: Webb Scales <webbnh@hp.com>
> > Acked-by: Jens Axboe <axboe@kernel.dk>
> > Tested-by: Bart Van Assche <bvanassche@acm.org>
> > Tested-by: Robert Elliott <elliott@hp.com>
> > ---
> >  drivers/scsi/hosts.c      |  35 +++-
> >  drivers/scsi/scsi.c       |   5 +-
> >  drivers/scsi/scsi_lib.c   | 464
> > ++++++++++++++++++++++++++++++++++++++++------
> >  drivers/scsi/scsi_priv.h  |   3 +
> >  drivers/scsi/scsi_scan.c  |   5 +-
> >  drivers/scsi/scsi_sysfs.c |   2 +
> >  include/scsi/scsi_host.h  |  18 +-
> >  include/scsi/scsi_tcq.h   |  28 ++-
> >  8 files changed, 488 insertions(+), 72 deletions(-)
> >
> > diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index
> 0632eee..6de80e3
> > 100644
> > --- a/drivers/scsi/hosts.c
> > +++ b/drivers/scsi/hosts.c
> > @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host
> > *shost, struct device *dev,
> >               goto fail;
> >       }
> >
> > +     if (shost_use_blk_mq(shost)) {
> > +             error = scsi_mq_setup_tags(shost);
> > +             if (error)
> > +                     goto fail;
> > +     }
> > +
> > +     /*
> > +      * Note that we allocate the freelist even for the MQ case for
> now,
> > +      * as we need a command set aside for scsi_reset_provider.  Having
> > +      * the full host freelist and one command available for that is a
> > +      * little heavy-handed, but avoids introducing a special allocator
> > +      * just for this.  Eventually the structure of scsi_reset_provider
> > +      * will need a major overhaul.
> > +      */
> >       error = scsi_setup_command_freelist(shost);
> >       if (error)
> > -             goto fail;
> > +             goto out_destroy_tags;
> > +
> >
> >       if (!shost->shost_gendev.parent)
> >               shost->shost_gendev.parent = dev ? dev : &platform_bus;
> > @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost,
> > struct device *dev,
> >
> >       error = device_add(&shost->shost_gendev);
> >       if (error)
> > -             goto out;
> > +             goto out_destroy_freelist;
> >
> >       pm_runtime_set_active(&shost->shost_gendev);
> >       pm_runtime_enable(&shost->shost_gendev);
> > @@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host
> > *shost, struct device *dev,
> >       device_del(&shost->shost_dev);
> >   out_del_gendev:
> >       device_del(&shost->shost_gendev);
> > - out:
> > + out_destroy_freelist:
> >       scsi_destroy_command_freelist(shost);
> > + out_destroy_tags:
> > +     if (shost_use_blk_mq(shost))
> > +             scsi_mq_destroy_tags(shost);
> >   fail:
> >       return error;
> >  }
> > @@ -309,8 +327,13 @@ static void scsi_host_dev_release(struct device
> > *dev)
> >       }
> >
> >       scsi_destroy_command_freelist(shost);
> > -     if (shost->bqt)
> > -             blk_free_tags(shost->bqt);
> > +     if (shost_use_blk_mq(shost)) {
> > +             if (shost->tag_set.tags)
> > +                     scsi_mq_destroy_tags(shost);
> > +     } else {
> > +             if (shost->bqt)
> > +                     blk_free_tags(shost->bqt);
> > +     }
> >
> >       kfree(shost->shost_data);
> >
> > @@ -436,6 +459,8 @@ struct Scsi_Host *scsi_host_alloc(struct
> > scsi_host_template *sht, int privsize)
> >       else
> >               shost->dma_boundary = 0xffffffff;
> >
> > +     shost->use_blk_mq = scsi_use_blk_mq && !shost->hostt-
> > >disable_blk_mq;
> > +
> >       device_initialize(&shost->shost_gendev);
> >       dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
> >       shost->shost_gendev.bus = &scsi_bus_type; diff --git
> > a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index 3dde8a3..013709f
> 100644
> > --- a/drivers/scsi/scsi.c
> > +++ b/drivers/scsi/scsi.c
> > @@ -805,7 +805,7 @@ void scsi_adjust_queue_depth(struct scsi_device
> > *sdev, int tagged, int tags)
> >        * is more IO than the LLD's can_queue (so there are not enuogh
> >        * tags) request_fn's host queue ready check will handle it.
> >        */
> > -     if (!sdev->host->bqt) {
> > +     if (!shost_use_blk_mq(sdev->host) && !sdev->host->bqt) {
> >               if (blk_queue_tagged(sdev->request_queue) &&
> >                   blk_queue_resize_tags(sdev->request_queue, tags) != 0)
> >                       goto out;
> > @@ -1361,6 +1361,9 @@ MODULE_LICENSE("GPL");
> > module_param(scsi_logging_level, int, S_IRUGO|S_IWUSR);
> > MODULE_PARM_DESC(scsi_logging_level, "a bit mask of logging levels");
> >
> > +bool scsi_use_blk_mq = false;
> > +module_param_named(use_blk_mq, scsi_use_blk_mq, bool, S_IWUSR |
> > +S_IRUGO);
> > +
> >  static int __init init_scsi(void)
> >  {
> >       int error;
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> > bbd7a0a..9c44392 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -1,5 +1,6 @@
> >  /*
> > - *  scsi_lib.c Copyright (C) 1999 Eric Youngdale
> > + * Copyright (C) 1999 Eric Youngdale
> > + * Copyright (C) 2014 Christoph Hellwig
> >   *
> >   *  SCSI queueing library.
> >   *      Initial versions: Eric Youngdale (eric@andante.org).
> > @@ -20,6 +21,7 @@
> >  #include <linux/delay.h>
> >  #include <linux/hardirq.h>
> >  #include <linux/scatterlist.h>
> > +#include <linux/blk-mq.h>
> >
> >  #include <scsi/scsi.h>
> >  #include <scsi/scsi_cmnd.h>
> > @@ -113,6 +115,16 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
> >       }
> >  }
> >
> > +static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd) {
> > +     struct scsi_device *sdev = cmd->device;
> > +     struct request_queue *q = cmd->request->q;
> > +
> > +     blk_mq_requeue_request(cmd->request);
> > +     blk_mq_kick_requeue_list(q);
> > +     put_device(&sdev->sdev_gendev);
> > +}
> > +
> >  /**
> >   * __scsi_queue_insert - private queue insertion
> >   * @cmd: The SCSI command being requeued @@ -150,6 +162,10 @@ static
> > void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
> >        * before blk_cleanup_queue() finishes.
> >        */
> >       cmd->result = 0;
> > +     if (q->mq_ops) {
> > +             scsi_mq_requeue_cmd(cmd);
> > +             return;
> > +     }
> >       spin_lock_irqsave(q->queue_lock, flags);
> >       blk_requeue_request(q, cmd->request);
> >       kblockd_schedule_work(&device->requeue_work);
> > @@ -308,6 +324,14 @@ void scsi_device_unbusy(struct scsi_device *sdev)
> >       atomic_dec(&sdev->device_busy);
> >  }
> >
> > +static void scsi_kick_queue(struct request_queue *q) {
> > +     if (q->mq_ops)
> > +             blk_mq_start_hw_queues(q);
> > +     else
> > +             blk_run_queue(q);
> > +}
> > +
> >  /*
> >   * Called for single_lun devices on IO completion. Clear
> starget_sdev_user,
> >   * and call blk_run_queue for all the scsi_devices on the target - @@
> -332,7
> > +356,7 @@ static void scsi_single_lun_run(struct scsi_device
> *current_sdev)
> >        * but in most cases, we will be first. Ideally, each LU on the
> >        * target would get some limited time or requests on the target.
> >        */
> > -     blk_run_queue(current_sdev->request_queue);
> > +     scsi_kick_queue(current_sdev->request_queue);
> >
> >       spin_lock_irqsave(shost->host_lock, flags);
> >       if (starget->starget_sdev_user)
> > @@ -345,7 +369,7 @@ static void scsi_single_lun_run(struct scsi_device
> > *current_sdev)
> >                       continue;
> >
> >               spin_unlock_irqrestore(shost->host_lock, flags);
> > -             blk_run_queue(sdev->request_queue);
> > +             scsi_kick_queue(sdev->request_queue);
> >               spin_lock_irqsave(shost->host_lock, flags);
> >
> >               scsi_device_put(sdev);
> > @@ -435,7 +459,7 @@ static void scsi_starved_list_run(struct Scsi_Host
> > *shost)
> >                       continue;
> >               spin_unlock_irqrestore(shost->host_lock, flags);
> >
> > -             blk_run_queue(slq);
> > +             scsi_kick_queue(slq);
> >               blk_put_queue(slq);
> >
> >               spin_lock_irqsave(shost->host_lock, flags); @@ -466,7
> > +490,10 @@ static void scsi_run_queue(struct request_queue *q)
> >       if (!list_empty(&sdev->host->starved_list))
> >               scsi_starved_list_run(sdev->host);
> >
> > -     blk_run_queue(q);
> > +     if (q->mq_ops)
> > +             blk_mq_start_stopped_hw_queues(q, false);
> > +     else
> > +             blk_run_queue(q);
> >  }
> >
> >  void scsi_requeue_run_queue(struct work_struct *work) @@ -564,25
> > +591,72 @@ static struct scatterlist *scsi_sg_alloc(unsigned int nents,
> gfp_t
> > gfp_mask)
> >       return mempool_alloc(sgp->pool, gfp_mask);  }
> >
> > -static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
> > +static void scsi_free_sgtable(struct scsi_data_buffer *sdb, bool mq)
> >  {
> > -     __sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, false,
> > scsi_sg_free);
> > +     if (mq && sdb->table.nents <= SCSI_MAX_SG_SEGMENTS)
> > +             return;
> > +     __sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, mq,
> > scsi_sg_free);
> >  }
> >
> >  static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
> > -                           gfp_t gfp_mask)
> > +                           gfp_t gfp_mask, bool mq)
> >  {
> > +     struct scatterlist *first_chunk = NULL;
> >       int ret;
> >
> >       BUG_ON(!nents);
> >
> > +     if (mq) {
> > +             if (nents <= SCSI_MAX_SG_SEGMENTS) {
> > +                     sdb->table.nents = nents;
> > +                     sg_init_table(sdb->table.sgl, sdb->table.nents);
> > +                     return 0;
> > +             }
> > +             first_chunk = sdb->table.sgl;
> > +     }
> > +
> >       ret = __sg_alloc_table(&sdb->table, nents,
> > SCSI_MAX_SG_SEGMENTS,
> > -                            NULL, gfp_mask, scsi_sg_alloc);
> > +                            first_chunk, gfp_mask, scsi_sg_alloc);
> >       if (unlikely(ret))
> > -             scsi_free_sgtable(sdb);
> > +             scsi_free_sgtable(sdb, mq);
> >       return ret;
> >  }
> >
> > +static void scsi_uninit_cmd(struct scsi_cmnd *cmd) {
> > +     if (cmd->request->cmd_type == REQ_TYPE_FS) {
> > +             struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
> > +
> > +             if (drv->uninit_command)
> > +                     drv->uninit_command(cmd);
> > +     }
> > +}
> > +
> > +static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd) {
> > +     if (cmd->sdb.table.nents)
> > +             scsi_free_sgtable(&cmd->sdb, true);
> > +     if (cmd->request->next_rq && cmd->request->next_rq->special)
> > +             scsi_free_sgtable(cmd->request->next_rq->special, true);
> > +     if (scsi_prot_sg_count(cmd))
> > +             scsi_free_sgtable(cmd->prot_sdb, true); }
> > +
> > +static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd) {
> > +     struct scsi_device *sdev = cmd->device;
> > +     unsigned long flags;
> > +
> > +     BUG_ON(list_empty(&cmd->list));
> > +
> > +     scsi_mq_free_sgtables(cmd);
> > +     scsi_uninit_cmd(cmd);
> > +
> > +     spin_lock_irqsave(&sdev->list_lock, flags);
> > +     list_del_init(&cmd->list);
> > +     spin_unlock_irqrestore(&sdev->list_lock, flags); }
> > +
> >  /*
> >   * Function:    scsi_release_buffers()
> >   *
> > @@ -602,19 +676,19 @@ static int scsi_alloc_sgtable(struct
> scsi_data_buffer
> > *sdb, int nents,  static void scsi_release_buffers(struct scsi_cmnd
> *cmd)  {
> >       if (cmd->sdb.table.nents)
> > -             scsi_free_sgtable(&cmd->sdb);
> > +             scsi_free_sgtable(&cmd->sdb, false);
> >
> >       memset(&cmd->sdb, 0, sizeof(cmd->sdb));
> >
> >       if (scsi_prot_sg_count(cmd))
> > -             scsi_free_sgtable(cmd->prot_sdb);
> > +             scsi_free_sgtable(cmd->prot_sdb, false);
> >  }
> >
> >  static void scsi_release_bidi_buffers(struct scsi_cmnd *cmd)  {
> >       struct scsi_data_buffer *bidi_sdb = cmd->request->next_rq-
> > >special;
> >
> > -     scsi_free_sgtable(bidi_sdb);
> > +     scsi_free_sgtable(bidi_sdb, false);
> >       kmem_cache_free(scsi_sdb_cache, bidi_sdb);
> >       cmd->request->next_rq->special = NULL;  } @@ -625,8 +699,6 @@
> > static bool scsi_end_request(struct request *req, int error,
> >       struct scsi_cmnd *cmd = req->special;
> >       struct scsi_device *sdev = cmd->device;
> >       struct request_queue *q = sdev->request_queue;
> > -     unsigned long flags;
> > -
> >
> >       if (blk_update_request(req, error, bytes))
> >               return true;
> > @@ -639,14 +711,38 @@ static bool scsi_end_request(struct request *req,
> > int error,
> >       if (blk_queue_add_random(q))
> >               add_disk_randomness(req->rq_disk);
> >
> > -     spin_lock_irqsave(q->queue_lock, flags);
> > -     blk_finish_request(req, error);
> > -     spin_unlock_irqrestore(q->queue_lock, flags);
> > +     if (req->mq_ctx) {
> > +             /*
> > +              * In the MQ case the command gets freed by
> > __blk_mq_end_io,
> > +              * so we have to do all cleanup that depends on it
> earlier.
> > +              *
> > +              * We also can't kick the queues from irq context, so we
> > +              * will have to defer it to a workqueue.
> > +              */
> > +             scsi_mq_uninit_cmd(cmd);
> > +
> > +             __blk_mq_end_io(req, error);
> > +
> > +             if (scsi_target(sdev)->single_lun ||
> > +                 !list_empty(&sdev->host->starved_list))
> > +                     kblockd_schedule_work(&sdev->requeue_work);
> > +             else
> > +                     blk_mq_start_stopped_hw_queues(q, true);
> > +
> > +             put_device(&sdev->sdev_gendev);
> > +     } else {
> > +             unsigned long flags;
> > +
> > +             spin_lock_irqsave(q->queue_lock, flags);
> > +             blk_finish_request(req, error);
> > +             spin_unlock_irqrestore(q->queue_lock, flags);
> > +
> > +             if (bidi_bytes)
> > +                     scsi_release_bidi_buffers(cmd);
> > +             scsi_release_buffers(cmd);
> > +             scsi_next_command(cmd);
> > +     }
> >
> > -     if (bidi_bytes)
> > -             scsi_release_bidi_buffers(cmd);
> > -     scsi_release_buffers(cmd);
> > -     scsi_next_command(cmd);
> >       return false;
> >  }
> >
> > @@ -953,8 +1049,14 @@ void scsi_io_completion(struct scsi_cmnd *cmd,
> > unsigned int good_bytes)
> >               /* Unprep the request and put it back at the head of the
> > queue.
> >                * A new command will be prepared and issued.
> >                */
> > -             scsi_release_buffers(cmd);
> > -             scsi_requeue_command(q, cmd);
> > +             if (q->mq_ops) {
> > +                     cmd->request->cmd_flags &= ~REQ_DONTPREP;
> > +                     scsi_mq_uninit_cmd(cmd);
> > +                     scsi_mq_requeue_cmd(cmd);
> > +             } else {
> > +                     scsi_release_buffers(cmd);
> > +                     scsi_requeue_command(q, cmd);
> > +             }
> >               break;
> >       case ACTION_RETRY:
> >               /* Retry the same command immediately */ @@ -976,9
> > +1078,8 @@ static int scsi_init_sgtable(struct request *req, struct
> > scsi_data_buffer *sdb,
> >        * If sg table allocation fails, requeue request later.
> >        */
> >       if (unlikely(scsi_alloc_sgtable(sdb, req->nr_phys_segments,
> > -                                     gfp_mask))) {
> > +                                     gfp_mask, req->mq_ctx != NULL)))
> >               return BLKPREP_DEFER;
> > -     }
> >
> >       /*
> >        * Next, walk the list, and fill in the addresses and sizes of @@
> -
> > 1006,6 +1107,7 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t
> gfp_mask)  {
> >       struct scsi_device *sdev = cmd->device;
> >       struct request *rq = cmd->request;
> > +     bool is_mq = (rq->mq_ctx != NULL);
> >       int error;
> >
> >       BUG_ON(!rq->nr_phys_segments);
> > @@ -1015,15 +1117,19 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t
> > gfp_mask)
> >               goto err_exit;
> >
> >       if (blk_bidi_rq(rq)) {
> > -             struct scsi_data_buffer *bidi_sdb = kmem_cache_zalloc(
> > -                     scsi_sdb_cache, GFP_ATOMIC);
> > -             if (!bidi_sdb) {
> > -                     error = BLKPREP_DEFER;
> > -                     goto err_exit;
> > +             if (!rq->q->mq_ops) {
> > +                     struct scsi_data_buffer *bidi_sdb =
> > +                             kmem_cache_zalloc(scsi_sdb_cache,
> > GFP_ATOMIC);
> > +                     if (!bidi_sdb) {
> > +                             error = BLKPREP_DEFER;
> > +                             goto err_exit;
> > +                     }
> > +
> > +                     rq->next_rq->special = bidi_sdb;
> >               }
> >
> > -             rq->next_rq->special = bidi_sdb;
> > -             error = scsi_init_sgtable(rq->next_rq, bidi_sdb,
> > GFP_ATOMIC);
> > +             error = scsi_init_sgtable(rq->next_rq,
> rq->next_rq->special,
> > +                                       GFP_ATOMIC);
> >               if (error)
> >                       goto err_exit;
> >       }
> > @@ -1035,7 +1141,7 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t
> > gfp_mask)
> >               BUG_ON(prot_sdb == NULL);
> >               ivecs = blk_rq_count_integrity_sg(rq->q, rq->bio);
> >
> > -             if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask)) {
> > +             if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask, is_mq))
> {
> >                       error = BLKPREP_DEFER;
> >                       goto err_exit;
> >               }
> > @@ -1049,13 +1155,16 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t
> > gfp_mask)
> >               cmd->prot_sdb->table.nents = count;
> >       }
> >
> > -     return BLKPREP_OK ;
> > -
> > +     return BLKPREP_OK;
> >  err_exit:
> > -     scsi_release_buffers(cmd);
> > -     cmd->request->special = NULL;
> > -     scsi_put_command(cmd);
> > -     put_device(&sdev->sdev_gendev);
> > +     if (is_mq) {
> > +             scsi_mq_free_sgtables(cmd);
> > +     } else {
> > +             scsi_release_buffers(cmd);
> > +             cmd->request->special = NULL;
> > +             scsi_put_command(cmd);
> > +             put_device(&sdev->sdev_gendev);
> > +     }
> >       return error;
> >  }
> >  EXPORT_SYMBOL(scsi_init_io);
> > @@ -1266,13 +1375,7 @@ out:
> >
> >  static void scsi_unprep_fn(struct request_queue *q, struct request
> *req)  {
> > -     if (req->cmd_type == REQ_TYPE_FS) {
> > -             struct scsi_cmnd *cmd = req->special;
> > -             struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
> > -
> > -             if (drv->uninit_command)
> > -                     drv->uninit_command(cmd);
> > -     }
> > +     scsi_uninit_cmd(req->special);
> >  }
> >
> >  /*
> > @@ -1295,7 +1398,11 @@ static inline int scsi_dev_queue_ready(struct
> > request_queue *q,
> >                * unblock after device_blocked iterates to zero
> >                */
> >               if (atomic_dec_return(&sdev->device_blocked) > 0) {
> > -                     blk_delay_queue(q, SCSI_QUEUE_DELAY);
> > +                     /*
> > +                      * For the MQ case we take care of this in the
> caller.
> > +                      */
> > +                     if (!q->mq_ops)
> > +                             blk_delay_queue(q, SCSI_QUEUE_DELAY);
> >                       goto out_dec;
> >               }
> >               SCSI_LOG_MLQUEUE(3, sdev_printk(KERN_INFO, sdev, @@
> > -1671,6 +1778,180 @@ out_delay:
> >               blk_delay_queue(q, SCSI_QUEUE_DELAY);  }
> >
> > +static inline int prep_to_mq(int ret)
> > +{
> > +     switch (ret) {
> > +     case BLKPREP_OK:
> > +             return 0;
> > +     case BLKPREP_DEFER:
> > +             return BLK_MQ_RQ_QUEUE_BUSY;
> > +     default:
> > +             return BLK_MQ_RQ_QUEUE_ERROR;
> > +     }
> > +}
> > +
> > +static int scsi_mq_prep_fn(struct request *req) {
> > +     struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> > +     struct scsi_device *sdev = req->q->queuedata;
> > +     struct Scsi_Host *shost = sdev->host;
> > +     unsigned char *sense_buf = cmd->sense_buffer;
> > +     struct scatterlist *sg;
> > +
> > +     memset(cmd, 0, sizeof(struct scsi_cmnd));
> > +
> > +     req->special = cmd;
> > +
> > +     cmd->request = req;
> > +     cmd->device = sdev;
> > +     cmd->sense_buffer = sense_buf;
> > +
> > +     cmd->tag = req->tag;
> > +
> > +     req->cmd = req->__cmd;
> > +     cmd->cmnd = req->cmd;
> > +     cmd->prot_op = SCSI_PROT_NORMAL;
> > +
> > +     INIT_LIST_HEAD(&cmd->list);
> > +     INIT_DELAYED_WORK(&cmd->abort_work,
> > scmd_eh_abort_handler);
> > +     cmd->jiffies_at_alloc = jiffies;
> > +
> > +     /*
> > +      * XXX: cmd_list lookups are only used by two drivers, try to get
> > +      * rid of this list in common code.
> > +      */
> > +     spin_lock_irq(&sdev->list_lock);
> > +     list_add_tail(&cmd->list, &sdev->cmd_list);
> > +     spin_unlock_irq(&sdev->list_lock);
>
> Hi Chris,
>
> I am using scsi.mq.4 branch and doing profiling to find out possible
> improvement in low level driver to get benefit of SCSI.MQ.  I am using
> LSI/Avago 12G MegaRaid Invader and total 12 SSDs (of 12Gpb/s).
> I have done some changes in "megaraid_sas" driver to gain from scsi.mq
> interface. I will send the list of changes some time later to get early
> feedback..
>
> I used this thread to reply as I found relevant patch to explain you
> better.
>
> Here are few data points - ( I used 4K Rand READ  FIO-libaio load on Two
> socket Super micro server)
>
> If I use "null_blk" driver, I was able to get 1800K IOPs on my setup
> When I used "megaraid_sas" driver in loop back mode (FAKE READ/WRITE), I
> see below numbers.
> Keep the worker on Node-0, 1800K IOPs (similar to null_blk), but when I
> spread workers on Node-0 and Node-1, I see ~700K IOPS.
>
> Above experiment hint me that there may be some difference in SCSI.MQ
> compare to BLK-MQ.
>
> My original problem was - "12 Drives R0 cannot scale beyond 750K IOPS, but
> it goes upto 1200K IOPS if I keep workers on Node-0 using cpus_allowed
> parameter of fio"
>
> Lock stats data  - Below data is for work load where I was not able to
> scale beyond 750K IOPS..
>
>
> --------------------------------------------------------------------------
> ------
>                               class name    con-bounces    contentions
> waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
> acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> -------------------------------------------------------------------------
>
>               &(&sdev->list_lock)->rlock:       2307248        2308395
> 0.07         158.89    10435357.44           4.52        3849400
> 3958002           0.04          26.02     1123671.56           0.28
>               --------------------------
>               &(&sdev->list_lock)->rlock        1105029
> [<ffffffff814ac980>] scsi_queue_rq+0x560/0x750
>               &(&sdev->list_lock)->rlock        1203366
> [<ffffffff814abc97>] scsi_mq_uninit_cmd+0x47/0xb0
>               --------------------------
>               &(&sdev->list_lock)->rlock        1176271
> [<ffffffff814abc97>] scsi_mq_uninit_cmd+0x47/0xb0
>               &(&sdev->list_lock)->rlock        1132124
> [<ffffffff814ac980>] scsi_queue_rq+0x560/0x750
>
>
>
> I read  this comment and find that very few drivers are using this
> cmd_list.  I think if we remove this cmd_list, performance will scale as I
> am seeing major contention in this lock.
> Just thought to ping you to see if this is known limitation for now or any
> plan to change this lock in near future ?

Additional info -

I tried after removing spinlock + list_add/del from scsi_mq_uninit()
and scsi_queue_rq(), IOPs are able to scale now upto 1100K (earlier
only 700K IOPS) which is almost similar to IO load running on single
Numa Node.


>
>
> ~ Kashyap
>
> > +
> > +     sg = (void *)cmd + sizeof(struct scsi_cmnd) + shost->hostt-
> > >cmd_size;
> > +     cmd->sdb.table.sgl = sg;
> > +
> > +     if (scsi_host_get_prot(shost)) {
> > +             cmd->prot_sdb = (void *)sg +
> > +                     shost->sg_tablesize * sizeof(struct scatterlist);
> > +             memset(cmd->prot_sdb, 0, sizeof(struct scsi_data_buffer));
> > +
> > +             cmd->prot_sdb->table.sgl =
> > +                     (struct scatterlist *)(cmd->prot_sdb + 1);
> > +     }
> > +
> > +     if (blk_bidi_rq(req)) {
> > +             struct request *next_rq = req->next_rq;
> > +             struct scsi_data_buffer *bidi_sdb =
> > blk_mq_rq_to_pdu(next_rq);
> > +
> > +             memset(bidi_sdb, 0, sizeof(struct scsi_data_buffer));
> > +             bidi_sdb->table.sgl =
> > +                     (struct scatterlist *)(bidi_sdb + 1);
> > +
> > +             next_rq->special = bidi_sdb;
> > +     }
> > +
> > +     return scsi_setup_cmnd(sdev, req);
> > +}
> > +
> > +static void scsi_mq_done(struct scsi_cmnd *cmd) {
> > +     trace_scsi_dispatch_cmd_done(cmd);
> > +     blk_mq_complete_request(cmd->request);
> > +}
> > +
> > +static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request
> > +*req) {
> > +     struct request_queue *q = req->q;
> > +     struct scsi_device *sdev = q->queuedata;
> > +     struct Scsi_Host *shost = sdev->host;
> > +     struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> > +     int ret;
> > +     int reason;
> > +
> > +     ret = prep_to_mq(scsi_prep_state_check(sdev, req));
> > +     if (ret)
> > +             goto out;
> > +
> > +     ret = BLK_MQ_RQ_QUEUE_BUSY;
> > +     if (!get_device(&sdev->sdev_gendev))
> > +             goto out;
> > +
> > +     if (!scsi_dev_queue_ready(q, sdev))
> > +             goto out_put_device;
> > +     if (!scsi_target_queue_ready(shost, sdev))
> > +             goto out_dec_device_busy;
> > +     if (!scsi_host_queue_ready(q, shost, sdev))
> > +             goto out_dec_target_busy;
> > +
> > +     if (!(req->cmd_flags & REQ_DONTPREP)) {
> > +             ret = prep_to_mq(scsi_mq_prep_fn(req));
> > +             if (ret)
> > +                     goto out_dec_host_busy;
> > +             req->cmd_flags |= REQ_DONTPREP;
> > +     }
> > +
> > +     scsi_init_cmd_errh(cmd);
> > +     cmd->scsi_done = scsi_mq_done;
> > +
> > +     reason = scsi_dispatch_cmd(cmd);
> > +     if (reason) {
> > +             scsi_set_blocked(cmd, reason);
> > +             ret = BLK_MQ_RQ_QUEUE_BUSY;
> > +             goto out_dec_host_busy;
> > +     }
> > +
> > +     return BLK_MQ_RQ_QUEUE_OK;
> > +
> > +out_dec_host_busy:
> > +     atomic_dec(&shost->host_busy);
> > +out_dec_target_busy:
> > +     if (scsi_target(sdev)->can_queue > 0)
> > +             atomic_dec(&scsi_target(sdev)->target_busy);
> > +out_dec_device_busy:
> > +     atomic_dec(&sdev->device_busy);
> > +out_put_device:
> > +     put_device(&sdev->sdev_gendev);
> > +out:
> > +     switch (ret) {
> > +     case BLK_MQ_RQ_QUEUE_BUSY:
> > +             blk_mq_stop_hw_queue(hctx);
> > +             if (atomic_read(&sdev->device_busy) == 0 &&
> > +                 !scsi_device_blocked(sdev))
> > +                     blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
> > +             break;
> > +     case BLK_MQ_RQ_QUEUE_ERROR:
> > +             /*
> > +              * Make sure to release all allocated ressources when
> > +              * we hit an error, as we will never see this command
> > +              * again.
> > +              */
> > +             if (req->cmd_flags & REQ_DONTPREP)
> > +                     scsi_mq_uninit_cmd(cmd);
> > +             break;
> > +     default:
> > +             break;
> > +     }
> > +     return ret;
> > +}
> > +
> > +static int scsi_init_request(void *data, struct request *rq,
> > +             unsigned int hctx_idx, unsigned int request_idx,
> > +             unsigned int numa_node)
> > +{
> > +     struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
> > +
> > +     cmd->sense_buffer = kzalloc_node(SCSI_SENSE_BUFFERSIZE,
> > GFP_KERNEL,
> > +                     numa_node);
> > +     if (!cmd->sense_buffer)
> > +             return -ENOMEM;
> > +     return 0;
> > +}
> > +
> > +static void scsi_exit_request(void *data, struct request *rq,
> > +             unsigned int hctx_idx, unsigned int request_idx) {
> > +     struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
> > +
> > +     kfree(cmd->sense_buffer);
> > +}
> > +
> >  static u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)  {
> >       struct device *host_dev;
> > @@ -1692,16 +1973,10 @@ static u64 scsi_calculate_bounce_limit(struct
> > Scsi_Host *shost)
> >       return bounce_limit;
> >  }
> >
> > -struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
> > -                                      request_fn_proc *request_fn)
> > +static void __scsi_init_queue(struct Scsi_Host *shost, struct
> > +request_queue *q)
> >  {
> > -     struct request_queue *q;
> >       struct device *dev = shost->dma_dev;
> >
> > -     q = blk_init_queue(request_fn, NULL);
> > -     if (!q)
> > -             return NULL;
> > -
> >       /*
> >        * this limit is imposed by hardware restrictions
> >        */
> > @@ -1732,7 +2007,17 @@ struct request_queue *__scsi_alloc_queue(struct
> > Scsi_Host *shost,
> >        * blk_queue_update_dma_alignment() later.
> >        */
> >       blk_queue_dma_alignment(q, 0x03);
> > +}
> >
> > +struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
> > +                                      request_fn_proc *request_fn)
> > +{
> > +     struct request_queue *q;
> > +
> > +     q = blk_init_queue(request_fn, NULL);
> > +     if (!q)
> > +             return NULL;
> > +     __scsi_init_queue(shost, q);
> >       return q;
> >  }
> >  EXPORT_SYMBOL(__scsi_alloc_queue);
> > @@ -1753,6 +2038,55 @@ struct request_queue *scsi_alloc_queue(struct
> > scsi_device *sdev)
> >       return q;
> >  }
> >
> > +static struct blk_mq_ops scsi_mq_ops = {
> > +     .map_queue      = blk_mq_map_queue,
> > +     .queue_rq       = scsi_queue_rq,
> > +     .complete       = scsi_softirq_done,
> > +     .timeout        = scsi_times_out,
> > +     .init_request   = scsi_init_request,
> > +     .exit_request   = scsi_exit_request,
> > +};
> > +
> > +struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev) {
> > +     sdev->request_queue = blk_mq_init_queue(&sdev->host-
> > >tag_set);
> > +     if (IS_ERR(sdev->request_queue))
> > +             return NULL;
> > +
> > +     sdev->request_queue->queuedata = sdev;
> > +     __scsi_init_queue(sdev->host, sdev->request_queue);
> > +     return sdev->request_queue;
> > +}
> > +
> > +int scsi_mq_setup_tags(struct Scsi_Host *shost) {
> > +     unsigned int cmd_size, sgl_size, tbl_size;
> > +
> > +     tbl_size = shost->sg_tablesize;
> > +     if (tbl_size > SCSI_MAX_SG_SEGMENTS)
> > +             tbl_size = SCSI_MAX_SG_SEGMENTS;
> > +     sgl_size = tbl_size * sizeof(struct scatterlist);
> > +     cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size +
> > sgl_size;
> > +     if (scsi_host_get_prot(shost))
> > +             cmd_size += sizeof(struct scsi_data_buffer) + sgl_size;
> > +
> > +     memset(&shost->tag_set, 0, sizeof(shost->tag_set));
> > +     shost->tag_set.ops = &scsi_mq_ops;
> > +     shost->tag_set.nr_hw_queues = 1;
> > +     shost->tag_set.queue_depth = shost->can_queue;
> > +     shost->tag_set.cmd_size = cmd_size;
> > +     shost->tag_set.numa_node = NUMA_NO_NODE;
> > +     shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
> > BLK_MQ_F_SG_MERGE;
> > +     shost->tag_set.driver_data = shost;
> > +
> > +     return blk_mq_alloc_tag_set(&shost->tag_set);
> > +}
> > +
> > +void scsi_mq_destroy_tags(struct Scsi_Host *shost) {
> > +     blk_mq_free_tag_set(&shost->tag_set);
> > +}
> > +
> >  /*
> >   * Function:    scsi_block_requests()
> >   *
> > @@ -2498,9 +2832,13 @@ scsi_internal_device_block(struct scsi_device
> > *sdev)
> >        * block layer from calling the midlayer with this device's
> >        * request queue.
> >        */
> > -     spin_lock_irqsave(q->queue_lock, flags);
> > -     blk_stop_queue(q);
> > -     spin_unlock_irqrestore(q->queue_lock, flags);
> > +     if (q->mq_ops) {
> > +             blk_mq_stop_hw_queues(q);
> > +     } else {
> > +             spin_lock_irqsave(q->queue_lock, flags);
> > +             blk_stop_queue(q);
> > +             spin_unlock_irqrestore(q->queue_lock, flags);
> > +     }
> >
> >       return 0;
> >  }
> > @@ -2546,9 +2884,13 @@ scsi_internal_device_unblock(struct scsi_device
> > *sdev,
> >                sdev->sdev_state != SDEV_OFFLINE)
> >               return -EINVAL;
> >
> > -     spin_lock_irqsave(q->queue_lock, flags);
> > -     blk_start_queue(q);
> > -     spin_unlock_irqrestore(q->queue_lock, flags);
> > +     if (q->mq_ops) {
> > +             blk_mq_start_stopped_hw_queues(q, false);
> > +     } else {
> > +             spin_lock_irqsave(q->queue_lock, flags);
> > +             blk_start_queue(q);
> > +             spin_unlock_irqrestore(q->queue_lock, flags);
> > +     }
> >
> >       return 0;
> >  }
> > diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h index
> > a45d1c2..12b8e1b 100644
> > --- a/drivers/scsi/scsi_priv.h
> > +++ b/drivers/scsi/scsi_priv.h
> > @@ -88,6 +88,9 @@ extern void scsi_next_command(struct scsi_cmnd
> > *cmd);  extern void scsi_io_completion(struct scsi_cmnd *, unsigned
> int);
> > extern void scsi_run_host_queues(struct Scsi_Host *shost);  extern
> struct
> > request_queue *scsi_alloc_queue(struct scsi_device *sdev);
> > +extern struct request_queue *scsi_mq_alloc_queue(struct scsi_device
> > +*sdev); extern int scsi_mq_setup_tags(struct Scsi_Host *shost); extern
> > +void scsi_mq_destroy_tags(struct Scsi_Host *shost);
> >  extern int scsi_init_queue(void);
> >  extern void scsi_exit_queue(void);
> >  struct request_queue;
> > diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index
> > 4a6e4ba..b91cfaf 100644
> > --- a/drivers/scsi/scsi_scan.c
> > +++ b/drivers/scsi/scsi_scan.c
> > @@ -273,7 +273,10 @@ static struct scsi_device *scsi_alloc_sdev(struct
> > scsi_target *starget,
> >        */
> >       sdev->borken = 1;
> >
> > -     sdev->request_queue = scsi_alloc_queue(sdev);
> > +     if (shost_use_blk_mq(shost))
> > +             sdev->request_queue = scsi_mq_alloc_queue(sdev);
> > +     else
> > +             sdev->request_queue = scsi_alloc_queue(sdev);
> >       if (!sdev->request_queue) {
> >               /* release fn is set up in scsi_sysfs_device_initialise,
> so
> >                * have to free and put manually here */ diff --git
> > a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index
> deef063..6c9227f
> > 100644
> > --- a/drivers/scsi/scsi_sysfs.c
> > +++ b/drivers/scsi/scsi_sysfs.c
> > @@ -333,6 +333,7 @@ store_shost_eh_deadline(struct device *dev, struct
> > device_attribute *attr,
> >
> >  static DEVICE_ATTR(eh_deadline, S_IRUGO | S_IWUSR,
> > show_shost_eh_deadline, store_shost_eh_deadline);
> >
> > +shost_rd_attr(use_blk_mq, "%d\n");
> >  shost_rd_attr(unique_id, "%u\n");
> >  shost_rd_attr(cmd_per_lun, "%hd\n");
> >  shost_rd_attr(can_queue, "%hd\n");
> > @@ -352,6 +353,7 @@ show_host_busy(struct device *dev, struct
> > device_attribute *attr, char *buf)  static DEVICE_ATTR(host_busy,
> S_IRUGO,
> > show_host_busy, NULL);
> >
> >  static struct attribute *scsi_sysfs_shost_attrs[] = {
> > +     &dev_attr_use_blk_mq.attr,
> >       &dev_attr_unique_id.attr,
> >       &dev_attr_host_busy.attr,
> >       &dev_attr_cmd_per_lun.attr,
> > diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h index
> > 5e8ebc1..ba20347 100644
> > --- a/include/scsi/scsi_host.h
> > +++ b/include/scsi/scsi_host.h
> > @@ -7,6 +7,7 @@
> >  #include <linux/workqueue.h>
> >  #include <linux/mutex.h>
> >  #include <linux/seq_file.h>
> > +#include <linux/blk-mq.h>
> >  #include <scsi/scsi.h>
> >
> >  struct request_queue;
> > @@ -510,6 +511,9 @@ struct scsi_host_template {
> >        */
> >       unsigned int cmd_size;
> >       struct scsi_host_cmd_pool *cmd_pool;
> > +
> > +     /* temporary flag to disable blk-mq I/O path */
> > +     bool disable_blk_mq;
> >  };
> >
> >  /*
> > @@ -580,7 +584,10 @@ struct Scsi_Host {
> >        * Area to keep a shared tag map (if needed, will be
> >        * NULL if not).
> >        */
> > -     struct blk_queue_tag    *bqt;
> > +     union {
> > +             struct blk_queue_tag    *bqt;
> > +             struct blk_mq_tag_set   tag_set;
> > +     };
> >
> >       atomic_t host_busy;                /* commands actually active on
> low-
> > level */
> >       atomic_t host_blocked;
> > @@ -672,6 +679,8 @@ struct Scsi_Host {
> >       /* The controller does not support WRITE SAME */
> >       unsigned no_write_same:1;
> >
> > +     unsigned use_blk_mq:1;
> > +
> >       /*
> >        * Optional work queue to be utilized by the transport
> >        */
> > @@ -772,6 +781,13 @@ static inline int scsi_host_in_recovery(struct
> > Scsi_Host *shost)
> >               shost->tmf_in_progress;
> >  }
> >
> > +extern bool scsi_use_blk_mq;
> > +
> > +static inline bool shost_use_blk_mq(struct Scsi_Host *shost) {
> > +     return shost->use_blk_mq;
> > +}
> > +
> >  extern int scsi_queue_work(struct Scsi_Host *, struct work_struct *);
> > extern void scsi_flush_work(struct Scsi_Host *);
> >
> > diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h index
> > 81dd12e..cdcc90b 100644
> > --- a/include/scsi/scsi_tcq.h
> > +++ b/include/scsi/scsi_tcq.h
> > @@ -67,7 +67,8 @@ static inline void scsi_activate_tcq(struct
> scsi_device
> > *sdev, int depth)
> >       if (!sdev->tagged_supported)
> >               return;
> >
> > -     if (!blk_queue_tagged(sdev->request_queue))
> > +     if (!shost_use_blk_mq(sdev->host) &&
> > +         blk_queue_tagged(sdev->request_queue))
> >               blk_queue_init_tags(sdev->request_queue, depth,
> >                                   sdev->host->bqt);
> >
> > @@ -80,7 +81,8 @@ static inline void scsi_activate_tcq(struct
> scsi_device
> > *sdev, int depth)
> >   **/
> >  static inline void scsi_deactivate_tcq(struct scsi_device *sdev, int
> depth)  {
> > -     if (blk_queue_tagged(sdev->request_queue))
> > +     if (!shost_use_blk_mq(sdev->host) &&
> > +         blk_queue_tagged(sdev->request_queue))
> >               blk_queue_free_tags(sdev->request_queue);
> >       scsi_adjust_queue_depth(sdev, 0, depth);  } @@ -108,6 +110,15 @@
> > static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char
> *msg)
> >       return 0;
> >  }
> >
> > +static inline struct scsi_cmnd *scsi_mq_find_tag(struct Scsi_Host
> *shost,
> > +             unsigned int hw_ctx, int tag)
> > +{
> > +     struct request *req;
> > +
> > +     req = blk_mq_tag_to_rq(shost->tag_set.tags[hw_ctx], tag);
> > +     return req ? (struct scsi_cmnd *)req->special : NULL; }
> > +
> >  /**
> >   * scsi_find_tag - find a tagged command by device
> >   * @SDpnt:   pointer to the ScSI device
> > @@ -118,10 +129,12 @@ static inline int scsi_populate_tag_msg(struct
> > scsi_cmnd *cmd, char *msg)
> >   **/
> >  static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev,
> int tag)
> > {
> > -
> >          struct request *req;
> >
> >          if (tag != SCSI_NO_TAG) {
> > +             if (shost_use_blk_mq(sdev->host))
> > +                     return scsi_mq_find_tag(sdev->host, 0, tag);
> > +
> >               req = blk_queue_find_tag(sdev->request_queue, tag);
> >               return req ? (struct scsi_cmnd *)req->special : NULL;
> >       }
> > @@ -130,6 +143,7 @@ static inline struct scsi_cmnd *scsi_find_tag(struct
> > scsi_device *sdev, int tag)
> >       return sdev->current_cmnd;
> >  }
> >
> > +
> >  /**
> >   * scsi_init_shared_tag_map - create a shared tag map
> >   * @shost:   the host to share the tag map among all devices
> > @@ -138,6 +152,12 @@ static inline struct scsi_cmnd
> *scsi_find_tag(struct
> > scsi_device *sdev, int tag)  static inline int
> scsi_init_shared_tag_map(struct
> > Scsi_Host *shost, int depth)  {
> >       /*
> > +      * We always have a shared tag map around when using blk-mq.
> > +      */
> > +     if (shost_use_blk_mq(shost))
> > +             return 0;
> > +
> > +     /*
> >        * If the shared tag map isn't already initialized, do it now.
> >        * This saves callers from having to check ->bqt when setting up
> >        * devices on the shared host (for libata) @@ -165,6 +185,8 @@
> static
> > inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
> >       struct request *req;
> >
> >       if (tag != SCSI_NO_TAG) {
> > +             if (shost_use_blk_mq(shost))
> > +                     return scsi_mq_find_tag(shost, 0, tag);
> >               req = blk_map_queue_find_tag(shost->bqt, tag);
> >               return req ? (struct scsi_cmnd *)req->special : NULL;
> >       }
> > --
> > 1.9.1
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the
> > body of a message to majordomo@vger.kernel.org More majordomo info at
> > http://vger.kernel.org/majordomo-info.html




-- 
Device Driver Developer @ Avagotech
Kashyap D. Desai
Note - my new email address
kashyap.desai@avagotech.com

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
  2014-07-18 10:13 ` [PATCH 13/14] scsi: add support for a blk-mq based I/O path Christoph Hellwig
  2014-07-25 19:29   ` Martin K. Petersen
@ 2014-08-18 22:21   ` Kashyap Desai
  2014-08-19 15:41     ` Kashyap Desai
  2014-08-19 16:06     ` Christoph Hellwig
  1 sibling, 2 replies; 99+ messages in thread
From: Kashyap Desai @ 2014-08-18 22:21 UTC (permalink / raw)
  To: Christoph Hellwig, James Bottomley, linux-scsi
  Cc: Jens Axboe, Bart Van Assche, Mike Christie, Martin K. Petersen,
	Robert Elliott, Webb Scales, linux-kernel

> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Christoph Hellwig
> Sent: Friday, July 18, 2014 3:43 PM
> To: James Bottomley; linux-scsi@vger.kernel.org
> Cc: Jens Axboe; Bart Van Assche; Mike Christie; Martin K. Petersen;
Robert
> Elliott; Webb Scales; linux-kernel@vger.kernel.org
> Subject: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
>
> This patch adds support for an alternate I/O path in the scsi midlayer
which
> uses the blk-mq infrastructure instead of the legacy request code.
>
> Use of blk-mq is fully transparent to drivers, although for now a host
> template field is provided to opt out of blk-mq usage in case any
unforseen
> incompatibilities arise.
>
> In general replacing the legacy request code with blk-mq is a simple and
> mostly mechanical transformation.  The biggest exception is the new code
> that deals with the fact the I/O submissions in blk-mq must happen from
> process context, which slightly complicates the I/O completion handler.
> The second biggest differences is that blk-mq is build around the
concept of
> preallocated requests that also include driver specific data, which in
SCSI
> context means the scsi_cmnd structure.  This completely avoids dynamic
> memory allocations for the fast path through I/O submission.
>
> Due the preallocated requests the MQ code path exclusively uses the
host-
> wide shared tag allocator instead of a per-LUN one.  This only affects
drivers
> actually using the block layer provided tag allocator instead of their
own.
> Unlike the old path blk-mq always provides a tag, although drivers don't
have
> to use it.
>
> For now the blk-mq path is disable by defauly and must be enabled using
the
> "use_blk_mq" module parameter.  Once the remaining work in the block
> layer to make blk-mq more suitable for slow devices is complete I hope
to
> make it the default and eventually even remove the old code path.
>
> Based on the earlier scsi-mq prototype by Nicholas Bellinger.
>
> Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking
and
> various sugestions and code contributions.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Webb Scales <webbnh@hp.com>
> Acked-by: Jens Axboe <axboe@kernel.dk>
> Tested-by: Bart Van Assche <bvanassche@acm.org>
> Tested-by: Robert Elliott <elliott@hp.com>
> ---
>  drivers/scsi/hosts.c      |  35 +++-
>  drivers/scsi/scsi.c       |   5 +-
>  drivers/scsi/scsi_lib.c   | 464
> ++++++++++++++++++++++++++++++++++++++++------
>  drivers/scsi/scsi_priv.h  |   3 +
>  drivers/scsi/scsi_scan.c  |   5 +-
>  drivers/scsi/scsi_sysfs.c |   2 +
>  include/scsi/scsi_host.h  |  18 +-
>  include/scsi/scsi_tcq.h   |  28 ++-
>  8 files changed, 488 insertions(+), 72 deletions(-)
>
> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c index
0632eee..6de80e3
> 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host
> *shost, struct device *dev,
>  		goto fail;
>  	}
>
> +	if (shost_use_blk_mq(shost)) {
> +		error = scsi_mq_setup_tags(shost);
> +		if (error)
> +			goto fail;
> +	}
> +
> +	/*
> +	 * Note that we allocate the freelist even for the MQ case for
now,
> +	 * as we need a command set aside for scsi_reset_provider.  Having
> +	 * the full host freelist and one command available for that is a
> +	 * little heavy-handed, but avoids introducing a special allocator
> +	 * just for this.  Eventually the structure of scsi_reset_provider
> +	 * will need a major overhaul.
> +	 */
>  	error = scsi_setup_command_freelist(shost);
>  	if (error)
> -		goto fail;
> +		goto out_destroy_tags;
> +
>
>  	if (!shost->shost_gendev.parent)
>  		shost->shost_gendev.parent = dev ? dev : &platform_bus;
> @@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost,
> struct device *dev,
>
>  	error = device_add(&shost->shost_gendev);
>  	if (error)
> -		goto out;
> +		goto out_destroy_freelist;
>
>  	pm_runtime_set_active(&shost->shost_gendev);
>  	pm_runtime_enable(&shost->shost_gendev);
> @@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host
> *shost, struct device *dev,
>  	device_del(&shost->shost_dev);
>   out_del_gendev:
>  	device_del(&shost->shost_gendev);
> - out:
> + out_destroy_freelist:
>  	scsi_destroy_command_freelist(shost);
> + out_destroy_tags:
> +	if (shost_use_blk_mq(shost))
> +		scsi_mq_destroy_tags(shost);
>   fail:
>  	return error;
>  }
> @@ -309,8 +327,13 @@ static void scsi_host_dev_release(struct device
> *dev)
>  	}
>
>  	scsi_destroy_command_freelist(shost);
> -	if (shost->bqt)
> -		blk_free_tags(shost->bqt);
> +	if (shost_use_blk_mq(shost)) {
> +		if (shost->tag_set.tags)
> +			scsi_mq_destroy_tags(shost);
> +	} else {
> +		if (shost->bqt)
> +			blk_free_tags(shost->bqt);
> +	}
>
>  	kfree(shost->shost_data);
>
> @@ -436,6 +459,8 @@ struct Scsi_Host *scsi_host_alloc(struct
> scsi_host_template *sht, int privsize)
>  	else
>  		shost->dma_boundary = 0xffffffff;
>
> +	shost->use_blk_mq = scsi_use_blk_mq && !shost->hostt-
> >disable_blk_mq;
> +
>  	device_initialize(&shost->shost_gendev);
>  	dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
>  	shost->shost_gendev.bus = &scsi_bus_type; diff --git
> a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c index 3dde8a3..013709f
100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -805,7 +805,7 @@ void scsi_adjust_queue_depth(struct scsi_device
> *sdev, int tagged, int tags)
>  	 * is more IO than the LLD's can_queue (so there are not enuogh
>  	 * tags) request_fn's host queue ready check will handle it.
>  	 */
> -	if (!sdev->host->bqt) {
> +	if (!shost_use_blk_mq(sdev->host) && !sdev->host->bqt) {
>  		if (blk_queue_tagged(sdev->request_queue) &&
>  		    blk_queue_resize_tags(sdev->request_queue, tags) != 0)
>  			goto out;
> @@ -1361,6 +1361,9 @@ MODULE_LICENSE("GPL");
> module_param(scsi_logging_level, int, S_IRUGO|S_IWUSR);
> MODULE_PARM_DESC(scsi_logging_level, "a bit mask of logging levels");
>
> +bool scsi_use_blk_mq = false;
> +module_param_named(use_blk_mq, scsi_use_blk_mq, bool, S_IWUSR |
> +S_IRUGO);
> +
>  static int __init init_scsi(void)
>  {
>  	int error;
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> bbd7a0a..9c44392 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1,5 +1,6 @@
>  /*
> - *  scsi_lib.c Copyright (C) 1999 Eric Youngdale
> + * Copyright (C) 1999 Eric Youngdale
> + * Copyright (C) 2014 Christoph Hellwig
>   *
>   *  SCSI queueing library.
>   *      Initial versions: Eric Youngdale (eric@andante.org).
> @@ -20,6 +21,7 @@
>  #include <linux/delay.h>
>  #include <linux/hardirq.h>
>  #include <linux/scatterlist.h>
> +#include <linux/blk-mq.h>
>
>  #include <scsi/scsi.h>
>  #include <scsi/scsi_cmnd.h>
> @@ -113,6 +115,16 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
>  	}
>  }
>
> +static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd) {
> +	struct scsi_device *sdev = cmd->device;
> +	struct request_queue *q = cmd->request->q;
> +
> +	blk_mq_requeue_request(cmd->request);
> +	blk_mq_kick_requeue_list(q);
> +	put_device(&sdev->sdev_gendev);
> +}
> +
>  /**
>   * __scsi_queue_insert - private queue insertion
>   * @cmd: The SCSI command being requeued @@ -150,6 +162,10 @@ static
> void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
>  	 * before blk_cleanup_queue() finishes.
>  	 */
>  	cmd->result = 0;
> +	if (q->mq_ops) {
> +		scsi_mq_requeue_cmd(cmd);
> +		return;
> +	}
>  	spin_lock_irqsave(q->queue_lock, flags);
>  	blk_requeue_request(q, cmd->request);
>  	kblockd_schedule_work(&device->requeue_work);
> @@ -308,6 +324,14 @@ void scsi_device_unbusy(struct scsi_device *sdev)
>  	atomic_dec(&sdev->device_busy);
>  }
>
> +static void scsi_kick_queue(struct request_queue *q) {
> +	if (q->mq_ops)
> +		blk_mq_start_hw_queues(q);
> +	else
> +		blk_run_queue(q);
> +}
> +
>  /*
>   * Called for single_lun devices on IO completion. Clear
starget_sdev_user,
>   * and call blk_run_queue for all the scsi_devices on the target - @@
-332,7
> +356,7 @@ static void scsi_single_lun_run(struct scsi_device
*current_sdev)
>  	 * but in most cases, we will be first. Ideally, each LU on the
>  	 * target would get some limited time or requests on the target.
>  	 */
> -	blk_run_queue(current_sdev->request_queue);
> +	scsi_kick_queue(current_sdev->request_queue);
>
>  	spin_lock_irqsave(shost->host_lock, flags);
>  	if (starget->starget_sdev_user)
> @@ -345,7 +369,7 @@ static void scsi_single_lun_run(struct scsi_device
> *current_sdev)
>  			continue;
>
>  		spin_unlock_irqrestore(shost->host_lock, flags);
> -		blk_run_queue(sdev->request_queue);
> +		scsi_kick_queue(sdev->request_queue);
>  		spin_lock_irqsave(shost->host_lock, flags);
>
>  		scsi_device_put(sdev);
> @@ -435,7 +459,7 @@ static void scsi_starved_list_run(struct Scsi_Host
> *shost)
>  			continue;
>  		spin_unlock_irqrestore(shost->host_lock, flags);
>
> -		blk_run_queue(slq);
> +		scsi_kick_queue(slq);
>  		blk_put_queue(slq);
>
>  		spin_lock_irqsave(shost->host_lock, flags); @@ -466,7
> +490,10 @@ static void scsi_run_queue(struct request_queue *q)
>  	if (!list_empty(&sdev->host->starved_list))
>  		scsi_starved_list_run(sdev->host);
>
> -	blk_run_queue(q);
> +	if (q->mq_ops)
> +		blk_mq_start_stopped_hw_queues(q, false);
> +	else
> +		blk_run_queue(q);
>  }
>
>  void scsi_requeue_run_queue(struct work_struct *work) @@ -564,25
> +591,72 @@ static struct scatterlist *scsi_sg_alloc(unsigned int nents,
gfp_t
> gfp_mask)
>  	return mempool_alloc(sgp->pool, gfp_mask);  }
>
> -static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
> +static void scsi_free_sgtable(struct scsi_data_buffer *sdb, bool mq)
>  {
> -	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, false,
> scsi_sg_free);
> +	if (mq && sdb->table.nents <= SCSI_MAX_SG_SEGMENTS)
> +		return;
> +	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, mq,
> scsi_sg_free);
>  }
>
>  static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
> -			      gfp_t gfp_mask)
> +			      gfp_t gfp_mask, bool mq)
>  {
> +	struct scatterlist *first_chunk = NULL;
>  	int ret;
>
>  	BUG_ON(!nents);
>
> +	if (mq) {
> +		if (nents <= SCSI_MAX_SG_SEGMENTS) {
> +			sdb->table.nents = nents;
> +			sg_init_table(sdb->table.sgl, sdb->table.nents);
> +			return 0;
> +		}
> +		first_chunk = sdb->table.sgl;
> +	}
> +
>  	ret = __sg_alloc_table(&sdb->table, nents,
> SCSI_MAX_SG_SEGMENTS,
> -			       NULL, gfp_mask, scsi_sg_alloc);
> +			       first_chunk, gfp_mask, scsi_sg_alloc);
>  	if (unlikely(ret))
> -		scsi_free_sgtable(sdb);
> +		scsi_free_sgtable(sdb, mq);
>  	return ret;
>  }
>
> +static void scsi_uninit_cmd(struct scsi_cmnd *cmd) {
> +	if (cmd->request->cmd_type == REQ_TYPE_FS) {
> +		struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
> +
> +		if (drv->uninit_command)
> +			drv->uninit_command(cmd);
> +	}
> +}
> +
> +static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd) {
> +	if (cmd->sdb.table.nents)
> +		scsi_free_sgtable(&cmd->sdb, true);
> +	if (cmd->request->next_rq && cmd->request->next_rq->special)
> +		scsi_free_sgtable(cmd->request->next_rq->special, true);
> +	if (scsi_prot_sg_count(cmd))
> +		scsi_free_sgtable(cmd->prot_sdb, true); }
> +
> +static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd) {
> +	struct scsi_device *sdev = cmd->device;
> +	unsigned long flags;
> +
> +	BUG_ON(list_empty(&cmd->list));
> +
> +	scsi_mq_free_sgtables(cmd);
> +	scsi_uninit_cmd(cmd);
> +
> +	spin_lock_irqsave(&sdev->list_lock, flags);
> +	list_del_init(&cmd->list);
> +	spin_unlock_irqrestore(&sdev->list_lock, flags); }
> +
>  /*
>   * Function:    scsi_release_buffers()
>   *
> @@ -602,19 +676,19 @@ static int scsi_alloc_sgtable(struct
scsi_data_buffer
> *sdb, int nents,  static void scsi_release_buffers(struct scsi_cmnd
*cmd)  {
>  	if (cmd->sdb.table.nents)
> -		scsi_free_sgtable(&cmd->sdb);
> +		scsi_free_sgtable(&cmd->sdb, false);
>
>  	memset(&cmd->sdb, 0, sizeof(cmd->sdb));
>
>  	if (scsi_prot_sg_count(cmd))
> -		scsi_free_sgtable(cmd->prot_sdb);
> +		scsi_free_sgtable(cmd->prot_sdb, false);
>  }
>
>  static void scsi_release_bidi_buffers(struct scsi_cmnd *cmd)  {
>  	struct scsi_data_buffer *bidi_sdb = cmd->request->next_rq-
> >special;
>
> -	scsi_free_sgtable(bidi_sdb);
> +	scsi_free_sgtable(bidi_sdb, false);
>  	kmem_cache_free(scsi_sdb_cache, bidi_sdb);
>  	cmd->request->next_rq->special = NULL;  } @@ -625,8 +699,6 @@
> static bool scsi_end_request(struct request *req, int error,
>  	struct scsi_cmnd *cmd = req->special;
>  	struct scsi_device *sdev = cmd->device;
>  	struct request_queue *q = sdev->request_queue;
> -	unsigned long flags;
> -
>
>  	if (blk_update_request(req, error, bytes))
>  		return true;
> @@ -639,14 +711,38 @@ static bool scsi_end_request(struct request *req,
> int error,
>  	if (blk_queue_add_random(q))
>  		add_disk_randomness(req->rq_disk);
>
> -	spin_lock_irqsave(q->queue_lock, flags);
> -	blk_finish_request(req, error);
> -	spin_unlock_irqrestore(q->queue_lock, flags);
> +	if (req->mq_ctx) {
> +		/*
> +		 * In the MQ case the command gets freed by
> __blk_mq_end_io,
> +		 * so we have to do all cleanup that depends on it
earlier.
> +		 *
> +		 * We also can't kick the queues from irq context, so we
> +		 * will have to defer it to a workqueue.
> +		 */
> +		scsi_mq_uninit_cmd(cmd);
> +
> +		__blk_mq_end_io(req, error);
> +
> +		if (scsi_target(sdev)->single_lun ||
> +		    !list_empty(&sdev->host->starved_list))
> +			kblockd_schedule_work(&sdev->requeue_work);
> +		else
> +			blk_mq_start_stopped_hw_queues(q, true);
> +
> +		put_device(&sdev->sdev_gendev);
> +	} else {
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(q->queue_lock, flags);
> +		blk_finish_request(req, error);
> +		spin_unlock_irqrestore(q->queue_lock, flags);
> +
> +		if (bidi_bytes)
> +			scsi_release_bidi_buffers(cmd);
> +		scsi_release_buffers(cmd);
> +		scsi_next_command(cmd);
> +	}
>
> -	if (bidi_bytes)
> -		scsi_release_bidi_buffers(cmd);
> -	scsi_release_buffers(cmd);
> -	scsi_next_command(cmd);
>  	return false;
>  }
>
> @@ -953,8 +1049,14 @@ void scsi_io_completion(struct scsi_cmnd *cmd,
> unsigned int good_bytes)
>  		/* Unprep the request and put it back at the head of the
> queue.
>  		 * A new command will be prepared and issued.
>  		 */
> -		scsi_release_buffers(cmd);
> -		scsi_requeue_command(q, cmd);
> +		if (q->mq_ops) {
> +			cmd->request->cmd_flags &= ~REQ_DONTPREP;
> +			scsi_mq_uninit_cmd(cmd);
> +			scsi_mq_requeue_cmd(cmd);
> +		} else {
> +			scsi_release_buffers(cmd);
> +			scsi_requeue_command(q, cmd);
> +		}
>  		break;
>  	case ACTION_RETRY:
>  		/* Retry the same command immediately */ @@ -976,9
> +1078,8 @@ static int scsi_init_sgtable(struct request *req, struct
> scsi_data_buffer *sdb,
>  	 * If sg table allocation fails, requeue request later.
>  	 */
>  	if (unlikely(scsi_alloc_sgtable(sdb, req->nr_phys_segments,
> -					gfp_mask))) {
> +					gfp_mask, req->mq_ctx != NULL)))
>  		return BLKPREP_DEFER;
> -	}
>
>  	/*
>  	 * Next, walk the list, and fill in the addresses and sizes of @@
-
> 1006,6 +1107,7 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t
gfp_mask)  {
>  	struct scsi_device *sdev = cmd->device;
>  	struct request *rq = cmd->request;
> +	bool is_mq = (rq->mq_ctx != NULL);
>  	int error;
>
>  	BUG_ON(!rq->nr_phys_segments);
> @@ -1015,15 +1117,19 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t
> gfp_mask)
>  		goto err_exit;
>
>  	if (blk_bidi_rq(rq)) {
> -		struct scsi_data_buffer *bidi_sdb = kmem_cache_zalloc(
> -			scsi_sdb_cache, GFP_ATOMIC);
> -		if (!bidi_sdb) {
> -			error = BLKPREP_DEFER;
> -			goto err_exit;
> +		if (!rq->q->mq_ops) {
> +			struct scsi_data_buffer *bidi_sdb =
> +				kmem_cache_zalloc(scsi_sdb_cache,
> GFP_ATOMIC);
> +			if (!bidi_sdb) {
> +				error = BLKPREP_DEFER;
> +				goto err_exit;
> +			}
> +
> +			rq->next_rq->special = bidi_sdb;
>  		}
>
> -		rq->next_rq->special = bidi_sdb;
> -		error = scsi_init_sgtable(rq->next_rq, bidi_sdb,
> GFP_ATOMIC);
> +		error = scsi_init_sgtable(rq->next_rq,
rq->next_rq->special,
> +					  GFP_ATOMIC);
>  		if (error)
>  			goto err_exit;
>  	}
> @@ -1035,7 +1141,7 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t
> gfp_mask)
>  		BUG_ON(prot_sdb == NULL);
>  		ivecs = blk_rq_count_integrity_sg(rq->q, rq->bio);
>
> -		if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask)) {
> +		if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask, is_mq))
{
>  			error = BLKPREP_DEFER;
>  			goto err_exit;
>  		}
> @@ -1049,13 +1155,16 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t
> gfp_mask)
>  		cmd->prot_sdb->table.nents = count;
>  	}
>
> -	return BLKPREP_OK ;
> -
> +	return BLKPREP_OK;
>  err_exit:
> -	scsi_release_buffers(cmd);
> -	cmd->request->special = NULL;
> -	scsi_put_command(cmd);
> -	put_device(&sdev->sdev_gendev);
> +	if (is_mq) {
> +		scsi_mq_free_sgtables(cmd);
> +	} else {
> +		scsi_release_buffers(cmd);
> +		cmd->request->special = NULL;
> +		scsi_put_command(cmd);
> +		put_device(&sdev->sdev_gendev);
> +	}
>  	return error;
>  }
>  EXPORT_SYMBOL(scsi_init_io);
> @@ -1266,13 +1375,7 @@ out:
>
>  static void scsi_unprep_fn(struct request_queue *q, struct request
*req)  {
> -	if (req->cmd_type == REQ_TYPE_FS) {
> -		struct scsi_cmnd *cmd = req->special;
> -		struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
> -
> -		if (drv->uninit_command)
> -			drv->uninit_command(cmd);
> -	}
> +	scsi_uninit_cmd(req->special);
>  }
>
>  /*
> @@ -1295,7 +1398,11 @@ static inline int scsi_dev_queue_ready(struct
> request_queue *q,
>  		 * unblock after device_blocked iterates to zero
>  		 */
>  		if (atomic_dec_return(&sdev->device_blocked) > 0) {
> -			blk_delay_queue(q, SCSI_QUEUE_DELAY);
> +			/*
> +			 * For the MQ case we take care of this in the
caller.
> +			 */
> +			if (!q->mq_ops)
> +				blk_delay_queue(q, SCSI_QUEUE_DELAY);
>  			goto out_dec;
>  		}
>  		SCSI_LOG_MLQUEUE(3, sdev_printk(KERN_INFO, sdev, @@
> -1671,6 +1778,180 @@ out_delay:
>  		blk_delay_queue(q, SCSI_QUEUE_DELAY);  }
>
> +static inline int prep_to_mq(int ret)
> +{
> +	switch (ret) {
> +	case BLKPREP_OK:
> +		return 0;
> +	case BLKPREP_DEFER:
> +		return BLK_MQ_RQ_QUEUE_BUSY;
> +	default:
> +		return BLK_MQ_RQ_QUEUE_ERROR;
> +	}
> +}
> +
> +static int scsi_mq_prep_fn(struct request *req) {
> +	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> +	struct scsi_device *sdev = req->q->queuedata;
> +	struct Scsi_Host *shost = sdev->host;
> +	unsigned char *sense_buf = cmd->sense_buffer;
> +	struct scatterlist *sg;
> +
> +	memset(cmd, 0, sizeof(struct scsi_cmnd));
> +
> +	req->special = cmd;
> +
> +	cmd->request = req;
> +	cmd->device = sdev;
> +	cmd->sense_buffer = sense_buf;
> +
> +	cmd->tag = req->tag;
> +
> +	req->cmd = req->__cmd;
> +	cmd->cmnd = req->cmd;
> +	cmd->prot_op = SCSI_PROT_NORMAL;
> +
> +	INIT_LIST_HEAD(&cmd->list);
> +	INIT_DELAYED_WORK(&cmd->abort_work,
> scmd_eh_abort_handler);
> +	cmd->jiffies_at_alloc = jiffies;
> +
> +	/*
> +	 * XXX: cmd_list lookups are only used by two drivers, try to get
> +	 * rid of this list in common code.
> +	 */
> +	spin_lock_irq(&sdev->list_lock);
> +	list_add_tail(&cmd->list, &sdev->cmd_list);
> +	spin_unlock_irq(&sdev->list_lock);

Hi Chris,

I am using scsi.mq.4 branch and doing profiling to find out possible
improvement in low level driver to get benefit of SCSI.MQ.  I am using
LSI/Avago 12G MegaRaid Invader and total 12 SSDs (of 12Gpb/s).
I have done some changes in "megaraid_sas" driver to gain from scsi.mq
interface. I will send the list of changes some time later to get early
feedback..

I used this thread to reply as I found relevant patch to explain you
better.

Here are few data points - ( I used 4K Rand READ  FIO-libaio load on Two
socket Super micro server)

If I use "null_blk" driver, I was able to get 1800K IOPs on my setup
When I used "megaraid_sas" driver in loop back mode (FAKE READ/WRITE), I
see below numbers.
Keep the worker on Node-0, 1800K IOPs (similar to null_blk), but when I
spread workers on Node-0 and Node-1, I see ~700K IOPS.

Above experiment hint me that there may be some difference in SCSI.MQ
compare to BLK-MQ.

My original problem was - "12 Drives R0 cannot scale beyond 750K IOPS, but
it goes upto 1200K IOPS if I keep workers on Node-0 using cpus_allowed
parameter of fio"

Lock stats data  - Below data is for work load where I was not able to
scale beyond 750K IOPS..


--------------------------------------------------------------------------
------
                              class name    con-bounces    contentions
waittime-min   waittime-max waittime-total   waittime-avg    acq-bounces
acquisitions   holdtime-min   holdtime-max holdtime-total   holdtime-avg
--------------------------------------------------------------------------
--------------------------------------------------------------------------
-------------------------------------------------------------------------

              &(&sdev->list_lock)->rlock:       2307248        2308395
0.07         158.89    10435357.44           4.52        3849400
3958002           0.04          26.02     1123671.56           0.28
              --------------------------
              &(&sdev->list_lock)->rlock        1105029
[<ffffffff814ac980>] scsi_queue_rq+0x560/0x750
              &(&sdev->list_lock)->rlock        1203366
[<ffffffff814abc97>] scsi_mq_uninit_cmd+0x47/0xb0
              --------------------------
              &(&sdev->list_lock)->rlock        1176271
[<ffffffff814abc97>] scsi_mq_uninit_cmd+0x47/0xb0
              &(&sdev->list_lock)->rlock        1132124
[<ffffffff814ac980>] scsi_queue_rq+0x560/0x750



I read  this comment and find that very few drivers are using this
cmd_list.  I think if we remove this cmd_list, performance will scale as I
am seeing major contention in this lock.
Just thought to ping you to see if this is known limitation for now or any
plan to change this lock in near future ?


~ Kashyap

> +
> +	sg = (void *)cmd + sizeof(struct scsi_cmnd) + shost->hostt-
> >cmd_size;
> +	cmd->sdb.table.sgl = sg;
> +
> +	if (scsi_host_get_prot(shost)) {
> +		cmd->prot_sdb = (void *)sg +
> +			shost->sg_tablesize * sizeof(struct scatterlist);
> +		memset(cmd->prot_sdb, 0, sizeof(struct scsi_data_buffer));
> +
> +		cmd->prot_sdb->table.sgl =
> +			(struct scatterlist *)(cmd->prot_sdb + 1);
> +	}
> +
> +	if (blk_bidi_rq(req)) {
> +		struct request *next_rq = req->next_rq;
> +		struct scsi_data_buffer *bidi_sdb =
> blk_mq_rq_to_pdu(next_rq);
> +
> +		memset(bidi_sdb, 0, sizeof(struct scsi_data_buffer));
> +		bidi_sdb->table.sgl =
> +			(struct scatterlist *)(bidi_sdb + 1);
> +
> +		next_rq->special = bidi_sdb;
> +	}
> +
> +	return scsi_setup_cmnd(sdev, req);
> +}
> +
> +static void scsi_mq_done(struct scsi_cmnd *cmd) {
> +	trace_scsi_dispatch_cmd_done(cmd);
> +	blk_mq_complete_request(cmd->request);
> +}
> +
> +static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request
> +*req) {
> +	struct request_queue *q = req->q;
> +	struct scsi_device *sdev = q->queuedata;
> +	struct Scsi_Host *shost = sdev->host;
> +	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> +	int ret;
> +	int reason;
> +
> +	ret = prep_to_mq(scsi_prep_state_check(sdev, req));
> +	if (ret)
> +		goto out;
> +
> +	ret = BLK_MQ_RQ_QUEUE_BUSY;
> +	if (!get_device(&sdev->sdev_gendev))
> +		goto out;
> +
> +	if (!scsi_dev_queue_ready(q, sdev))
> +		goto out_put_device;
> +	if (!scsi_target_queue_ready(shost, sdev))
> +		goto out_dec_device_busy;
> +	if (!scsi_host_queue_ready(q, shost, sdev))
> +		goto out_dec_target_busy;
> +
> +	if (!(req->cmd_flags & REQ_DONTPREP)) {
> +		ret = prep_to_mq(scsi_mq_prep_fn(req));
> +		if (ret)
> +			goto out_dec_host_busy;
> +		req->cmd_flags |= REQ_DONTPREP;
> +	}
> +
> +	scsi_init_cmd_errh(cmd);
> +	cmd->scsi_done = scsi_mq_done;
> +
> +	reason = scsi_dispatch_cmd(cmd);
> +	if (reason) {
> +		scsi_set_blocked(cmd, reason);
> +		ret = BLK_MQ_RQ_QUEUE_BUSY;
> +		goto out_dec_host_busy;
> +	}
> +
> +	return BLK_MQ_RQ_QUEUE_OK;
> +
> +out_dec_host_busy:
> +	atomic_dec(&shost->host_busy);
> +out_dec_target_busy:
> +	if (scsi_target(sdev)->can_queue > 0)
> +		atomic_dec(&scsi_target(sdev)->target_busy);
> +out_dec_device_busy:
> +	atomic_dec(&sdev->device_busy);
> +out_put_device:
> +	put_device(&sdev->sdev_gendev);
> +out:
> +	switch (ret) {
> +	case BLK_MQ_RQ_QUEUE_BUSY:
> +		blk_mq_stop_hw_queue(hctx);
> +		if (atomic_read(&sdev->device_busy) == 0 &&
> +		    !scsi_device_blocked(sdev))
> +			blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
> +		break;
> +	case BLK_MQ_RQ_QUEUE_ERROR:
> +		/*
> +		 * Make sure to release all allocated ressources when
> +		 * we hit an error, as we will never see this command
> +		 * again.
> +		 */
> +		if (req->cmd_flags & REQ_DONTPREP)
> +			scsi_mq_uninit_cmd(cmd);
> +		break;
> +	default:
> +		break;
> +	}
> +	return ret;
> +}
> +
> +static int scsi_init_request(void *data, struct request *rq,
> +		unsigned int hctx_idx, unsigned int request_idx,
> +		unsigned int numa_node)
> +{
> +	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
> +
> +	cmd->sense_buffer = kzalloc_node(SCSI_SENSE_BUFFERSIZE,
> GFP_KERNEL,
> +			numa_node);
> +	if (!cmd->sense_buffer)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
> +static void scsi_exit_request(void *data, struct request *rq,
> +		unsigned int hctx_idx, unsigned int request_idx) {
> +	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
> +
> +	kfree(cmd->sense_buffer);
> +}
> +
>  static u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)  {
>  	struct device *host_dev;
> @@ -1692,16 +1973,10 @@ static u64 scsi_calculate_bounce_limit(struct
> Scsi_Host *shost)
>  	return bounce_limit;
>  }
>
> -struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
> -					 request_fn_proc *request_fn)
> +static void __scsi_init_queue(struct Scsi_Host *shost, struct
> +request_queue *q)
>  {
> -	struct request_queue *q;
>  	struct device *dev = shost->dma_dev;
>
> -	q = blk_init_queue(request_fn, NULL);
> -	if (!q)
> -		return NULL;
> -
>  	/*
>  	 * this limit is imposed by hardware restrictions
>  	 */
> @@ -1732,7 +2007,17 @@ struct request_queue *__scsi_alloc_queue(struct
> Scsi_Host *shost,
>  	 * blk_queue_update_dma_alignment() later.
>  	 */
>  	blk_queue_dma_alignment(q, 0x03);
> +}
>
> +struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
> +					 request_fn_proc *request_fn)
> +{
> +	struct request_queue *q;
> +
> +	q = blk_init_queue(request_fn, NULL);
> +	if (!q)
> +		return NULL;
> +	__scsi_init_queue(shost, q);
>  	return q;
>  }
>  EXPORT_SYMBOL(__scsi_alloc_queue);
> @@ -1753,6 +2038,55 @@ struct request_queue *scsi_alloc_queue(struct
> scsi_device *sdev)
>  	return q;
>  }
>
> +static struct blk_mq_ops scsi_mq_ops = {
> +	.map_queue	= blk_mq_map_queue,
> +	.queue_rq	= scsi_queue_rq,
> +	.complete	= scsi_softirq_done,
> +	.timeout	= scsi_times_out,
> +	.init_request	= scsi_init_request,
> +	.exit_request	= scsi_exit_request,
> +};
> +
> +struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev) {
> +	sdev->request_queue = blk_mq_init_queue(&sdev->host-
> >tag_set);
> +	if (IS_ERR(sdev->request_queue))
> +		return NULL;
> +
> +	sdev->request_queue->queuedata = sdev;
> +	__scsi_init_queue(sdev->host, sdev->request_queue);
> +	return sdev->request_queue;
> +}
> +
> +int scsi_mq_setup_tags(struct Scsi_Host *shost) {
> +	unsigned int cmd_size, sgl_size, tbl_size;
> +
> +	tbl_size = shost->sg_tablesize;
> +	if (tbl_size > SCSI_MAX_SG_SEGMENTS)
> +		tbl_size = SCSI_MAX_SG_SEGMENTS;
> +	sgl_size = tbl_size * sizeof(struct scatterlist);
> +	cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size +
> sgl_size;
> +	if (scsi_host_get_prot(shost))
> +		cmd_size += sizeof(struct scsi_data_buffer) + sgl_size;
> +
> +	memset(&shost->tag_set, 0, sizeof(shost->tag_set));
> +	shost->tag_set.ops = &scsi_mq_ops;
> +	shost->tag_set.nr_hw_queues = 1;
> +	shost->tag_set.queue_depth = shost->can_queue;
> +	shost->tag_set.cmd_size = cmd_size;
> +	shost->tag_set.numa_node = NUMA_NO_NODE;
> +	shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
> BLK_MQ_F_SG_MERGE;
> +	shost->tag_set.driver_data = shost;
> +
> +	return blk_mq_alloc_tag_set(&shost->tag_set);
> +}
> +
> +void scsi_mq_destroy_tags(struct Scsi_Host *shost) {
> +	blk_mq_free_tag_set(&shost->tag_set);
> +}
> +
>  /*
>   * Function:    scsi_block_requests()
>   *
> @@ -2498,9 +2832,13 @@ scsi_internal_device_block(struct scsi_device
> *sdev)
>  	 * block layer from calling the midlayer with this device's
>  	 * request queue.
>  	 */
> -	spin_lock_irqsave(q->queue_lock, flags);
> -	blk_stop_queue(q);
> -	spin_unlock_irqrestore(q->queue_lock, flags);
> +	if (q->mq_ops) {
> +		blk_mq_stop_hw_queues(q);
> +	} else {
> +		spin_lock_irqsave(q->queue_lock, flags);
> +		blk_stop_queue(q);
> +		spin_unlock_irqrestore(q->queue_lock, flags);
> +	}
>
>  	return 0;
>  }
> @@ -2546,9 +2884,13 @@ scsi_internal_device_unblock(struct scsi_device
> *sdev,
>  		 sdev->sdev_state != SDEV_OFFLINE)
>  		return -EINVAL;
>
> -	spin_lock_irqsave(q->queue_lock, flags);
> -	blk_start_queue(q);
> -	spin_unlock_irqrestore(q->queue_lock, flags);
> +	if (q->mq_ops) {
> +		blk_mq_start_stopped_hw_queues(q, false);
> +	} else {
> +		spin_lock_irqsave(q->queue_lock, flags);
> +		blk_start_queue(q);
> +		spin_unlock_irqrestore(q->queue_lock, flags);
> +	}
>
>  	return 0;
>  }
> diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h index
> a45d1c2..12b8e1b 100644
> --- a/drivers/scsi/scsi_priv.h
> +++ b/drivers/scsi/scsi_priv.h
> @@ -88,6 +88,9 @@ extern void scsi_next_command(struct scsi_cmnd
> *cmd);  extern void scsi_io_completion(struct scsi_cmnd *, unsigned
int);
> extern void scsi_run_host_queues(struct Scsi_Host *shost);  extern
struct
> request_queue *scsi_alloc_queue(struct scsi_device *sdev);
> +extern struct request_queue *scsi_mq_alloc_queue(struct scsi_device
> +*sdev); extern int scsi_mq_setup_tags(struct Scsi_Host *shost); extern
> +void scsi_mq_destroy_tags(struct Scsi_Host *shost);
>  extern int scsi_init_queue(void);
>  extern void scsi_exit_queue(void);
>  struct request_queue;
> diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c index
> 4a6e4ba..b91cfaf 100644
> --- a/drivers/scsi/scsi_scan.c
> +++ b/drivers/scsi/scsi_scan.c
> @@ -273,7 +273,10 @@ static struct scsi_device *scsi_alloc_sdev(struct
> scsi_target *starget,
>  	 */
>  	sdev->borken = 1;
>
> -	sdev->request_queue = scsi_alloc_queue(sdev);
> +	if (shost_use_blk_mq(shost))
> +		sdev->request_queue = scsi_mq_alloc_queue(sdev);
> +	else
> +		sdev->request_queue = scsi_alloc_queue(sdev);
>  	if (!sdev->request_queue) {
>  		/* release fn is set up in scsi_sysfs_device_initialise,
so
>  		 * have to free and put manually here */ diff --git
> a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c index
deef063..6c9227f
> 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -333,6 +333,7 @@ store_shost_eh_deadline(struct device *dev, struct
> device_attribute *attr,
>
>  static DEVICE_ATTR(eh_deadline, S_IRUGO | S_IWUSR,
> show_shost_eh_deadline, store_shost_eh_deadline);
>
> +shost_rd_attr(use_blk_mq, "%d\n");
>  shost_rd_attr(unique_id, "%u\n");
>  shost_rd_attr(cmd_per_lun, "%hd\n");
>  shost_rd_attr(can_queue, "%hd\n");
> @@ -352,6 +353,7 @@ show_host_busy(struct device *dev, struct
> device_attribute *attr, char *buf)  static DEVICE_ATTR(host_busy,
S_IRUGO,
> show_host_busy, NULL);
>
>  static struct attribute *scsi_sysfs_shost_attrs[] = {
> +	&dev_attr_use_blk_mq.attr,
>  	&dev_attr_unique_id.attr,
>  	&dev_attr_host_busy.attr,
>  	&dev_attr_cmd_per_lun.attr,
> diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h index
> 5e8ebc1..ba20347 100644
> --- a/include/scsi/scsi_host.h
> +++ b/include/scsi/scsi_host.h
> @@ -7,6 +7,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/mutex.h>
>  #include <linux/seq_file.h>
> +#include <linux/blk-mq.h>
>  #include <scsi/scsi.h>
>
>  struct request_queue;
> @@ -510,6 +511,9 @@ struct scsi_host_template {
>  	 */
>  	unsigned int cmd_size;
>  	struct scsi_host_cmd_pool *cmd_pool;
> +
> +	/* temporary flag to disable blk-mq I/O path */
> +	bool disable_blk_mq;
>  };
>
>  /*
> @@ -580,7 +584,10 @@ struct Scsi_Host {
>  	 * Area to keep a shared tag map (if needed, will be
>  	 * NULL if not).
>  	 */
> -	struct blk_queue_tag	*bqt;
> +	union {
> +		struct blk_queue_tag	*bqt;
> +		struct blk_mq_tag_set	tag_set;
> +	};
>
>  	atomic_t host_busy;		   /* commands actually active on
low-
> level */
>  	atomic_t host_blocked;
> @@ -672,6 +679,8 @@ struct Scsi_Host {
>  	/* The controller does not support WRITE SAME */
>  	unsigned no_write_same:1;
>
> +	unsigned use_blk_mq:1;
> +
>  	/*
>  	 * Optional work queue to be utilized by the transport
>  	 */
> @@ -772,6 +781,13 @@ static inline int scsi_host_in_recovery(struct
> Scsi_Host *shost)
>  		shost->tmf_in_progress;
>  }
>
> +extern bool scsi_use_blk_mq;
> +
> +static inline bool shost_use_blk_mq(struct Scsi_Host *shost) {
> +	return shost->use_blk_mq;
> +}
> +
>  extern int scsi_queue_work(struct Scsi_Host *, struct work_struct *);
> extern void scsi_flush_work(struct Scsi_Host *);
>
> diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h index
> 81dd12e..cdcc90b 100644
> --- a/include/scsi/scsi_tcq.h
> +++ b/include/scsi/scsi_tcq.h
> @@ -67,7 +67,8 @@ static inline void scsi_activate_tcq(struct
scsi_device
> *sdev, int depth)
>  	if (!sdev->tagged_supported)
>  		return;
>
> -	if (!blk_queue_tagged(sdev->request_queue))
> +	if (!shost_use_blk_mq(sdev->host) &&
> +	    blk_queue_tagged(sdev->request_queue))
>  		blk_queue_init_tags(sdev->request_queue, depth,
>  				    sdev->host->bqt);
>
> @@ -80,7 +81,8 @@ static inline void scsi_activate_tcq(struct
scsi_device
> *sdev, int depth)
>   **/
>  static inline void scsi_deactivate_tcq(struct scsi_device *sdev, int
depth)  {
> -	if (blk_queue_tagged(sdev->request_queue))
> +	if (!shost_use_blk_mq(sdev->host) &&
> +	    blk_queue_tagged(sdev->request_queue))
>  		blk_queue_free_tags(sdev->request_queue);
>  	scsi_adjust_queue_depth(sdev, 0, depth);  } @@ -108,6 +110,15 @@
> static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char
*msg)
>  	return 0;
>  }
>
> +static inline struct scsi_cmnd *scsi_mq_find_tag(struct Scsi_Host
*shost,
> +		unsigned int hw_ctx, int tag)
> +{
> +	struct request *req;
> +
> +	req = blk_mq_tag_to_rq(shost->tag_set.tags[hw_ctx], tag);
> +	return req ? (struct scsi_cmnd *)req->special : NULL; }
> +
>  /**
>   * scsi_find_tag - find a tagged command by device
>   * @SDpnt:	pointer to the ScSI device
> @@ -118,10 +129,12 @@ static inline int scsi_populate_tag_msg(struct
> scsi_cmnd *cmd, char *msg)
>   **/
>  static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev,
int tag)
> {
> -
>          struct request *req;
>
>          if (tag != SCSI_NO_TAG) {
> +		if (shost_use_blk_mq(sdev->host))
> +			return scsi_mq_find_tag(sdev->host, 0, tag);
> +
>          	req = blk_queue_find_tag(sdev->request_queue, tag);
>  	        return req ? (struct scsi_cmnd *)req->special : NULL;
>  	}
> @@ -130,6 +143,7 @@ static inline struct scsi_cmnd *scsi_find_tag(struct
> scsi_device *sdev, int tag)
>  	return sdev->current_cmnd;
>  }
>
> +
>  /**
>   * scsi_init_shared_tag_map - create a shared tag map
>   * @shost:	the host to share the tag map among all devices
> @@ -138,6 +152,12 @@ static inline struct scsi_cmnd
*scsi_find_tag(struct
> scsi_device *sdev, int tag)  static inline int
scsi_init_shared_tag_map(struct
> Scsi_Host *shost, int depth)  {
>  	/*
> +	 * We always have a shared tag map around when using blk-mq.
> +	 */
> +	if (shost_use_blk_mq(shost))
> +		return 0;
> +
> +	/*
>  	 * If the shared tag map isn't already initialized, do it now.
>  	 * This saves callers from having to check ->bqt when setting up
>  	 * devices on the shared host (for libata) @@ -165,6 +185,8 @@
static
> inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
>  	struct request *req;
>
>  	if (tag != SCSI_NO_TAG) {
> +		if (shost_use_blk_mq(shost))
> +			return scsi_mq_find_tag(shost, 0, tag);
>  		req = blk_map_queue_find_tag(shost->bqt, tag);
>  		return req ? (struct scsi_cmnd *)req->special : NULL;
>  	}
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the
> body of a message to majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
  2014-07-18 10:13 ` [PATCH 13/14] scsi: add support for a blk-mq based I/O path Christoph Hellwig
@ 2014-07-25 19:29   ` Martin K. Petersen
  2014-08-18 22:21   ` Kashyap Desai
  1 sibling, 0 replies; 99+ messages in thread
From: Martin K. Petersen @ 2014-07-25 19:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: James Bottomley, linux-scsi, Jens Axboe, Bart Van Assche,
	Mike Christie, Martin K. Petersen, Robert Elliott, Webb Scales,
	linux-kernel

>>>>> "Christoph" == Christoph Hellwig <hch@lst.de> writes:

Christoph> This patch adds support for an alternate I/O path in the scsi
Christoph> midlayer which uses the blk-mq infrastructure instead of the
Christoph> legacy request code.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
  2014-07-18 10:12 scsi-mq V4 Christoph Hellwig
@ 2014-07-18 10:13 ` Christoph Hellwig
  2014-07-25 19:29   ` Martin K. Petersen
  2014-08-18 22:21   ` Kashyap Desai
  0 siblings, 2 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-07-18 10:13 UTC (permalink / raw)
  To: James Bottomley, linux-scsi
  Cc: Jens Axboe, Bart Van Assche, Mike Christie, Martin K. Petersen,
	Robert Elliott, Webb Scales, linux-kernel

This patch adds support for an alternate I/O path in the scsi midlayer
which uses the blk-mq infrastructure instead of the legacy request code.

Use of blk-mq is fully transparent to drivers, although for now a host
template field is provided to opt out of blk-mq usage in case any unforseen
incompatibilities arise.

In general replacing the legacy request code with blk-mq is a simple and
mostly mechanical transformation.  The biggest exception is the new code
that deals with the fact the I/O submissions in blk-mq must happen from
process context, which slightly complicates the I/O completion handler.
The second biggest differences is that blk-mq is build around the concept
of preallocated requests that also include driver specific data, which
in SCSI context means the scsi_cmnd structure.  This completely avoids
dynamic memory allocations for the fast path through I/O submission.

Due the preallocated requests the MQ code path exclusively uses the
host-wide shared tag allocator instead of a per-LUN one.  This only
affects drivers actually using the block layer provided tag allocator
instead of their own.  Unlike the old path blk-mq always provides a tag,
although drivers don't have to use it.

For now the blk-mq path is disable by defauly and must be enabled using
the "use_blk_mq" module parameter.  Once the remaining work in the block
layer to make blk-mq more suitable for slow devices is complete I hope
to make it the default and eventually even remove the old code path.

Based on the earlier scsi-mq prototype by Nicholas Bellinger.

Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking and
various sugestions and code contributions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Webb Scales <webbnh@hp.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Robert Elliott <elliott@hp.com>
---
 drivers/scsi/hosts.c      |  35 +++-
 drivers/scsi/scsi.c       |   5 +-
 drivers/scsi/scsi_lib.c   | 464 ++++++++++++++++++++++++++++++++++++++++------
 drivers/scsi/scsi_priv.h  |   3 +
 drivers/scsi/scsi_scan.c  |   5 +-
 drivers/scsi/scsi_sysfs.c |   2 +
 include/scsi/scsi_host.h  |  18 +-
 include/scsi/scsi_tcq.h   |  28 ++-
 8 files changed, 488 insertions(+), 72 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 0632eee..6de80e3 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 		goto fail;
 	}
 
+	if (shost_use_blk_mq(shost)) {
+		error = scsi_mq_setup_tags(shost);
+		if (error)
+			goto fail;
+	}
+
+	/*
+	 * Note that we allocate the freelist even for the MQ case for now,
+	 * as we need a command set aside for scsi_reset_provider.  Having
+	 * the full host freelist and one command available for that is a
+	 * little heavy-handed, but avoids introducing a special allocator
+	 * just for this.  Eventually the structure of scsi_reset_provider
+	 * will need a major overhaul.
+	 */
 	error = scsi_setup_command_freelist(shost);
 	if (error)
-		goto fail;
+		goto out_destroy_tags;
+
 
 	if (!shost->shost_gendev.parent)
 		shost->shost_gendev.parent = dev ? dev : &platform_bus;
@@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 
 	error = device_add(&shost->shost_gendev);
 	if (error)
-		goto out;
+		goto out_destroy_freelist;
 
 	pm_runtime_set_active(&shost->shost_gendev);
 	pm_runtime_enable(&shost->shost_gendev);
@@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 	device_del(&shost->shost_dev);
  out_del_gendev:
 	device_del(&shost->shost_gendev);
- out:
+ out_destroy_freelist:
 	scsi_destroy_command_freelist(shost);
+ out_destroy_tags:
+	if (shost_use_blk_mq(shost))
+		scsi_mq_destroy_tags(shost);
  fail:
 	return error;
 }
@@ -309,8 +327,13 @@ static void scsi_host_dev_release(struct device *dev)
 	}
 
 	scsi_destroy_command_freelist(shost);
-	if (shost->bqt)
-		blk_free_tags(shost->bqt);
+	if (shost_use_blk_mq(shost)) {
+		if (shost->tag_set.tags)
+			scsi_mq_destroy_tags(shost);
+	} else {
+		if (shost->bqt)
+			blk_free_tags(shost->bqt);
+	}
 
 	kfree(shost->shost_data);
 
@@ -436,6 +459,8 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
 	else
 		shost->dma_boundary = 0xffffffff;
 
+	shost->use_blk_mq = scsi_use_blk_mq && !shost->hostt->disable_blk_mq;
+
 	device_initialize(&shost->shost_gendev);
 	dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
 	shost->shost_gendev.bus = &scsi_bus_type;
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 3dde8a3..013709f 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -805,7 +805,7 @@ void scsi_adjust_queue_depth(struct scsi_device *sdev, int tagged, int tags)
 	 * is more IO than the LLD's can_queue (so there are not enuogh
 	 * tags) request_fn's host queue ready check will handle it.
 	 */
-	if (!sdev->host->bqt) {
+	if (!shost_use_blk_mq(sdev->host) && !sdev->host->bqt) {
 		if (blk_queue_tagged(sdev->request_queue) &&
 		    blk_queue_resize_tags(sdev->request_queue, tags) != 0)
 			goto out;
@@ -1361,6 +1361,9 @@ MODULE_LICENSE("GPL");
 module_param(scsi_logging_level, int, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(scsi_logging_level, "a bit mask of logging levels");
 
+bool scsi_use_blk_mq = false;
+module_param_named(use_blk_mq, scsi_use_blk_mq, bool, S_IWUSR | S_IRUGO);
+
 static int __init init_scsi(void)
 {
 	int error;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index bbd7a0a..9c44392 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1,5 +1,6 @@
 /*
- *  scsi_lib.c Copyright (C) 1999 Eric Youngdale
+ * Copyright (C) 1999 Eric Youngdale
+ * Copyright (C) 2014 Christoph Hellwig
  *
  *  SCSI queueing library.
  *      Initial versions: Eric Youngdale (eric@andante.org).
@@ -20,6 +21,7 @@
 #include <linux/delay.h>
 #include <linux/hardirq.h>
 #include <linux/scatterlist.h>
+#include <linux/blk-mq.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -113,6 +115,16 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
 	}
 }
 
+static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
+{
+	struct scsi_device *sdev = cmd->device;
+	struct request_queue *q = cmd->request->q;
+
+	blk_mq_requeue_request(cmd->request);
+	blk_mq_kick_requeue_list(q);
+	put_device(&sdev->sdev_gendev);
+}
+
 /**
  * __scsi_queue_insert - private queue insertion
  * @cmd: The SCSI command being requeued
@@ -150,6 +162,10 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
 	 * before blk_cleanup_queue() finishes.
 	 */
 	cmd->result = 0;
+	if (q->mq_ops) {
+		scsi_mq_requeue_cmd(cmd);
+		return;
+	}
 	spin_lock_irqsave(q->queue_lock, flags);
 	blk_requeue_request(q, cmd->request);
 	kblockd_schedule_work(&device->requeue_work);
@@ -308,6 +324,14 @@ void scsi_device_unbusy(struct scsi_device *sdev)
 	atomic_dec(&sdev->device_busy);
 }
 
+static void scsi_kick_queue(struct request_queue *q)
+{
+	if (q->mq_ops)
+		blk_mq_start_hw_queues(q);
+	else
+		blk_run_queue(q);
+}
+
 /*
  * Called for single_lun devices on IO completion. Clear starget_sdev_user,
  * and call blk_run_queue for all the scsi_devices on the target -
@@ -332,7 +356,7 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
 	 * but in most cases, we will be first. Ideally, each LU on the
 	 * target would get some limited time or requests on the target.
 	 */
-	blk_run_queue(current_sdev->request_queue);
+	scsi_kick_queue(current_sdev->request_queue);
 
 	spin_lock_irqsave(shost->host_lock, flags);
 	if (starget->starget_sdev_user)
@@ -345,7 +369,7 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
 			continue;
 
 		spin_unlock_irqrestore(shost->host_lock, flags);
-		blk_run_queue(sdev->request_queue);
+		scsi_kick_queue(sdev->request_queue);
 		spin_lock_irqsave(shost->host_lock, flags);
 	
 		scsi_device_put(sdev);
@@ -435,7 +459,7 @@ static void scsi_starved_list_run(struct Scsi_Host *shost)
 			continue;
 		spin_unlock_irqrestore(shost->host_lock, flags);
 
-		blk_run_queue(slq);
+		scsi_kick_queue(slq);
 		blk_put_queue(slq);
 
 		spin_lock_irqsave(shost->host_lock, flags);
@@ -466,7 +490,10 @@ static void scsi_run_queue(struct request_queue *q)
 	if (!list_empty(&sdev->host->starved_list))
 		scsi_starved_list_run(sdev->host);
 
-	blk_run_queue(q);
+	if (q->mq_ops)
+		blk_mq_start_stopped_hw_queues(q, false);
+	else
+		blk_run_queue(q);
 }
 
 void scsi_requeue_run_queue(struct work_struct *work)
@@ -564,25 +591,72 @@ static struct scatterlist *scsi_sg_alloc(unsigned int nents, gfp_t gfp_mask)
 	return mempool_alloc(sgp->pool, gfp_mask);
 }
 
-static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
+static void scsi_free_sgtable(struct scsi_data_buffer *sdb, bool mq)
 {
-	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, false, scsi_sg_free);
+	if (mq && sdb->table.nents <= SCSI_MAX_SG_SEGMENTS)
+		return;
+	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, mq, scsi_sg_free);
 }
 
 static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
-			      gfp_t gfp_mask)
+			      gfp_t gfp_mask, bool mq)
 {
+	struct scatterlist *first_chunk = NULL;
 	int ret;
 
 	BUG_ON(!nents);
 
+	if (mq) {
+		if (nents <= SCSI_MAX_SG_SEGMENTS) {
+			sdb->table.nents = nents;
+			sg_init_table(sdb->table.sgl, sdb->table.nents);
+			return 0;
+		}
+		first_chunk = sdb->table.sgl;
+	}
+
 	ret = __sg_alloc_table(&sdb->table, nents, SCSI_MAX_SG_SEGMENTS,
-			       NULL, gfp_mask, scsi_sg_alloc);
+			       first_chunk, gfp_mask, scsi_sg_alloc);
 	if (unlikely(ret))
-		scsi_free_sgtable(sdb);
+		scsi_free_sgtable(sdb, mq);
 	return ret;
 }
 
+static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
+{
+	if (cmd->request->cmd_type == REQ_TYPE_FS) {
+		struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
+
+		if (drv->uninit_command)
+			drv->uninit_command(cmd);
+	}
+}
+
+static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
+{
+	if (cmd->sdb.table.nents)
+		scsi_free_sgtable(&cmd->sdb, true);
+	if (cmd->request->next_rq && cmd->request->next_rq->special)
+		scsi_free_sgtable(cmd->request->next_rq->special, true);
+	if (scsi_prot_sg_count(cmd))
+		scsi_free_sgtable(cmd->prot_sdb, true);
+}
+
+static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd)
+{
+	struct scsi_device *sdev = cmd->device;
+	unsigned long flags;
+
+	BUG_ON(list_empty(&cmd->list));
+
+	scsi_mq_free_sgtables(cmd);
+	scsi_uninit_cmd(cmd);
+
+	spin_lock_irqsave(&sdev->list_lock, flags);
+	list_del_init(&cmd->list);
+	spin_unlock_irqrestore(&sdev->list_lock, flags);
+}
+
 /*
  * Function:    scsi_release_buffers()
  *
@@ -602,19 +676,19 @@ static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
 static void scsi_release_buffers(struct scsi_cmnd *cmd)
 {
 	if (cmd->sdb.table.nents)
-		scsi_free_sgtable(&cmd->sdb);
+		scsi_free_sgtable(&cmd->sdb, false);
 
 	memset(&cmd->sdb, 0, sizeof(cmd->sdb));
 
 	if (scsi_prot_sg_count(cmd))
-		scsi_free_sgtable(cmd->prot_sdb);
+		scsi_free_sgtable(cmd->prot_sdb, false);
 }
 
 static void scsi_release_bidi_buffers(struct scsi_cmnd *cmd)
 {
 	struct scsi_data_buffer *bidi_sdb = cmd->request->next_rq->special;
 
-	scsi_free_sgtable(bidi_sdb);
+	scsi_free_sgtable(bidi_sdb, false);
 	kmem_cache_free(scsi_sdb_cache, bidi_sdb);
 	cmd->request->next_rq->special = NULL;
 }
@@ -625,8 +699,6 @@ static bool scsi_end_request(struct request *req, int error,
 	struct scsi_cmnd *cmd = req->special;
 	struct scsi_device *sdev = cmd->device;
 	struct request_queue *q = sdev->request_queue;
-	unsigned long flags;
-
 
 	if (blk_update_request(req, error, bytes))
 		return true;
@@ -639,14 +711,38 @@ static bool scsi_end_request(struct request *req, int error,
 	if (blk_queue_add_random(q))
 		add_disk_randomness(req->rq_disk);
 
-	spin_lock_irqsave(q->queue_lock, flags);
-	blk_finish_request(req, error);
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	if (req->mq_ctx) {
+		/*
+		 * In the MQ case the command gets freed by __blk_mq_end_io,
+		 * so we have to do all cleanup that depends on it earlier.
+		 *
+		 * We also can't kick the queues from irq context, so we
+		 * will have to defer it to a workqueue.
+		 */
+		scsi_mq_uninit_cmd(cmd);
+
+		__blk_mq_end_io(req, error);
+
+		if (scsi_target(sdev)->single_lun ||
+		    !list_empty(&sdev->host->starved_list))
+			kblockd_schedule_work(&sdev->requeue_work);
+		else
+			blk_mq_start_stopped_hw_queues(q, true);
+
+		put_device(&sdev->sdev_gendev);
+	} else {
+		unsigned long flags;
+
+		spin_lock_irqsave(q->queue_lock, flags);
+		blk_finish_request(req, error);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+
+		if (bidi_bytes)
+			scsi_release_bidi_buffers(cmd);
+		scsi_release_buffers(cmd);
+		scsi_next_command(cmd);
+	}
 
-	if (bidi_bytes)
-		scsi_release_bidi_buffers(cmd);
-	scsi_release_buffers(cmd);
-	scsi_next_command(cmd);
 	return false;
 }
 
@@ -953,8 +1049,14 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 		/* Unprep the request and put it back at the head of the queue.
 		 * A new command will be prepared and issued.
 		 */
-		scsi_release_buffers(cmd);
-		scsi_requeue_command(q, cmd);
+		if (q->mq_ops) {
+			cmd->request->cmd_flags &= ~REQ_DONTPREP;
+			scsi_mq_uninit_cmd(cmd);
+			scsi_mq_requeue_cmd(cmd);
+		} else {
+			scsi_release_buffers(cmd);
+			scsi_requeue_command(q, cmd);
+		}
 		break;
 	case ACTION_RETRY:
 		/* Retry the same command immediately */
@@ -976,9 +1078,8 @@ static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb,
 	 * If sg table allocation fails, requeue request later.
 	 */
 	if (unlikely(scsi_alloc_sgtable(sdb, req->nr_phys_segments,
-					gfp_mask))) {
+					gfp_mask, req->mq_ctx != NULL)))
 		return BLKPREP_DEFER;
-	}
 
 	/* 
 	 * Next, walk the list, and fill in the addresses and sizes of
@@ -1006,6 +1107,7 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
 {
 	struct scsi_device *sdev = cmd->device;
 	struct request *rq = cmd->request;
+	bool is_mq = (rq->mq_ctx != NULL);
 	int error;
 
 	BUG_ON(!rq->nr_phys_segments);
@@ -1015,15 +1117,19 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
 		goto err_exit;
 
 	if (blk_bidi_rq(rq)) {
-		struct scsi_data_buffer *bidi_sdb = kmem_cache_zalloc(
-			scsi_sdb_cache, GFP_ATOMIC);
-		if (!bidi_sdb) {
-			error = BLKPREP_DEFER;
-			goto err_exit;
+		if (!rq->q->mq_ops) {
+			struct scsi_data_buffer *bidi_sdb =
+				kmem_cache_zalloc(scsi_sdb_cache, GFP_ATOMIC);
+			if (!bidi_sdb) {
+				error = BLKPREP_DEFER;
+				goto err_exit;
+			}
+
+			rq->next_rq->special = bidi_sdb;
 		}
 
-		rq->next_rq->special = bidi_sdb;
-		error = scsi_init_sgtable(rq->next_rq, bidi_sdb, GFP_ATOMIC);
+		error = scsi_init_sgtable(rq->next_rq, rq->next_rq->special,
+					  GFP_ATOMIC);
 		if (error)
 			goto err_exit;
 	}
@@ -1035,7 +1141,7 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
 		BUG_ON(prot_sdb == NULL);
 		ivecs = blk_rq_count_integrity_sg(rq->q, rq->bio);
 
-		if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask)) {
+		if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask, is_mq)) {
 			error = BLKPREP_DEFER;
 			goto err_exit;
 		}
@@ -1049,13 +1155,16 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
 		cmd->prot_sdb->table.nents = count;
 	}
 
-	return BLKPREP_OK ;
-
+	return BLKPREP_OK;
 err_exit:
-	scsi_release_buffers(cmd);
-	cmd->request->special = NULL;
-	scsi_put_command(cmd);
-	put_device(&sdev->sdev_gendev);
+	if (is_mq) {
+		scsi_mq_free_sgtables(cmd);
+	} else {
+		scsi_release_buffers(cmd);
+		cmd->request->special = NULL;
+		scsi_put_command(cmd);
+		put_device(&sdev->sdev_gendev);
+	}
 	return error;
 }
 EXPORT_SYMBOL(scsi_init_io);
@@ -1266,13 +1375,7 @@ out:
 
 static void scsi_unprep_fn(struct request_queue *q, struct request *req)
 {
-	if (req->cmd_type == REQ_TYPE_FS) {
-		struct scsi_cmnd *cmd = req->special;
-		struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
-
-		if (drv->uninit_command)
-			drv->uninit_command(cmd);
-	}
+	scsi_uninit_cmd(req->special);
 }
 
 /*
@@ -1295,7 +1398,11 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
 		 * unblock after device_blocked iterates to zero
 		 */
 		if (atomic_dec_return(&sdev->device_blocked) > 0) {
-			blk_delay_queue(q, SCSI_QUEUE_DELAY);
+			/*
+			 * For the MQ case we take care of this in the caller.
+			 */
+			if (!q->mq_ops)
+				blk_delay_queue(q, SCSI_QUEUE_DELAY);
 			goto out_dec;
 		}
 		SCSI_LOG_MLQUEUE(3, sdev_printk(KERN_INFO, sdev,
@@ -1671,6 +1778,180 @@ out_delay:
 		blk_delay_queue(q, SCSI_QUEUE_DELAY);
 }
 
+static inline int prep_to_mq(int ret)
+{
+	switch (ret) {
+	case BLKPREP_OK:
+		return 0;
+	case BLKPREP_DEFER:
+		return BLK_MQ_RQ_QUEUE_BUSY;
+	default:
+		return BLK_MQ_RQ_QUEUE_ERROR;
+	}
+}
+
+static int scsi_mq_prep_fn(struct request *req)
+{
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
+	struct scsi_device *sdev = req->q->queuedata;
+	struct Scsi_Host *shost = sdev->host;
+	unsigned char *sense_buf = cmd->sense_buffer;
+	struct scatterlist *sg;
+
+	memset(cmd, 0, sizeof(struct scsi_cmnd));
+
+	req->special = cmd;
+
+	cmd->request = req;
+	cmd->device = sdev;
+	cmd->sense_buffer = sense_buf;
+
+	cmd->tag = req->tag;
+
+	req->cmd = req->__cmd;
+	cmd->cmnd = req->cmd;
+	cmd->prot_op = SCSI_PROT_NORMAL;
+
+	INIT_LIST_HEAD(&cmd->list);
+	INIT_DELAYED_WORK(&cmd->abort_work, scmd_eh_abort_handler);
+	cmd->jiffies_at_alloc = jiffies;
+
+	/*
+	 * XXX: cmd_list lookups are only used by two drivers, try to get
+	 * rid of this list in common code.
+	 */
+	spin_lock_irq(&sdev->list_lock);
+	list_add_tail(&cmd->list, &sdev->cmd_list);
+	spin_unlock_irq(&sdev->list_lock);
+
+	sg = (void *)cmd + sizeof(struct scsi_cmnd) + shost->hostt->cmd_size;
+	cmd->sdb.table.sgl = sg;
+
+	if (scsi_host_get_prot(shost)) {
+		cmd->prot_sdb = (void *)sg +
+			shost->sg_tablesize * sizeof(struct scatterlist);
+		memset(cmd->prot_sdb, 0, sizeof(struct scsi_data_buffer));
+
+		cmd->prot_sdb->table.sgl =
+			(struct scatterlist *)(cmd->prot_sdb + 1);
+	}
+
+	if (blk_bidi_rq(req)) {
+		struct request *next_rq = req->next_rq;
+		struct scsi_data_buffer *bidi_sdb = blk_mq_rq_to_pdu(next_rq);
+
+		memset(bidi_sdb, 0, sizeof(struct scsi_data_buffer));
+		bidi_sdb->table.sgl =
+			(struct scatterlist *)(bidi_sdb + 1);
+
+		next_rq->special = bidi_sdb;
+	}
+
+	return scsi_setup_cmnd(sdev, req);
+}
+
+static void scsi_mq_done(struct scsi_cmnd *cmd)
+{
+	trace_scsi_dispatch_cmd_done(cmd);
+	blk_mq_complete_request(cmd->request);
+}
+
+static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
+{
+	struct request_queue *q = req->q;
+	struct scsi_device *sdev = q->queuedata;
+	struct Scsi_Host *shost = sdev->host;
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
+	int ret;
+	int reason;
+
+	ret = prep_to_mq(scsi_prep_state_check(sdev, req));
+	if (ret)
+		goto out;
+
+	ret = BLK_MQ_RQ_QUEUE_BUSY;
+	if (!get_device(&sdev->sdev_gendev))
+		goto out;
+
+	if (!scsi_dev_queue_ready(q, sdev))
+		goto out_put_device;
+	if (!scsi_target_queue_ready(shost, sdev))
+		goto out_dec_device_busy;
+	if (!scsi_host_queue_ready(q, shost, sdev))
+		goto out_dec_target_busy;
+
+	if (!(req->cmd_flags & REQ_DONTPREP)) {
+		ret = prep_to_mq(scsi_mq_prep_fn(req));
+		if (ret)
+			goto out_dec_host_busy;
+		req->cmd_flags |= REQ_DONTPREP;
+	}
+
+	scsi_init_cmd_errh(cmd);
+	cmd->scsi_done = scsi_mq_done;
+
+	reason = scsi_dispatch_cmd(cmd);
+	if (reason) {
+		scsi_set_blocked(cmd, reason);
+		ret = BLK_MQ_RQ_QUEUE_BUSY;
+		goto out_dec_host_busy;
+	}
+
+	return BLK_MQ_RQ_QUEUE_OK;
+
+out_dec_host_busy:
+	atomic_dec(&shost->host_busy);
+out_dec_target_busy:
+	if (scsi_target(sdev)->can_queue > 0)
+		atomic_dec(&scsi_target(sdev)->target_busy);
+out_dec_device_busy:
+	atomic_dec(&sdev->device_busy);
+out_put_device:
+	put_device(&sdev->sdev_gendev);
+out:
+	switch (ret) {
+	case BLK_MQ_RQ_QUEUE_BUSY:
+		blk_mq_stop_hw_queue(hctx);
+		if (atomic_read(&sdev->device_busy) == 0 &&
+		    !scsi_device_blocked(sdev))
+			blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
+		break;
+	case BLK_MQ_RQ_QUEUE_ERROR:
+		/*
+		 * Make sure to release all allocated ressources when
+		 * we hit an error, as we will never see this command
+		 * again.
+		 */
+		if (req->cmd_flags & REQ_DONTPREP)
+			scsi_mq_uninit_cmd(cmd);
+		break;
+	default:
+		break;
+	}
+	return ret;
+}
+
+static int scsi_init_request(void *data, struct request *rq,
+		unsigned int hctx_idx, unsigned int request_idx,
+		unsigned int numa_node)
+{
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
+
+	cmd->sense_buffer = kzalloc_node(SCSI_SENSE_BUFFERSIZE, GFP_KERNEL,
+			numa_node);
+	if (!cmd->sense_buffer)
+		return -ENOMEM;
+	return 0;
+}
+
+static void scsi_exit_request(void *data, struct request *rq,
+		unsigned int hctx_idx, unsigned int request_idx)
+{
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
+
+	kfree(cmd->sense_buffer);
+}
+
 static u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
 {
 	struct device *host_dev;
@@ -1692,16 +1973,10 @@ static u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
 	return bounce_limit;
 }
 
-struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
-					 request_fn_proc *request_fn)
+static void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
 {
-	struct request_queue *q;
 	struct device *dev = shost->dma_dev;
 
-	q = blk_init_queue(request_fn, NULL);
-	if (!q)
-		return NULL;
-
 	/*
 	 * this limit is imposed by hardware restrictions
 	 */
@@ -1732,7 +2007,17 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
 	 * blk_queue_update_dma_alignment() later.
 	 */
 	blk_queue_dma_alignment(q, 0x03);
+}
 
+struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
+					 request_fn_proc *request_fn)
+{
+	struct request_queue *q;
+
+	q = blk_init_queue(request_fn, NULL);
+	if (!q)
+		return NULL;
+	__scsi_init_queue(shost, q);
 	return q;
 }
 EXPORT_SYMBOL(__scsi_alloc_queue);
@@ -1753,6 +2038,55 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
 	return q;
 }
 
+static struct blk_mq_ops scsi_mq_ops = {
+	.map_queue	= blk_mq_map_queue,
+	.queue_rq	= scsi_queue_rq,
+	.complete	= scsi_softirq_done,
+	.timeout	= scsi_times_out,
+	.init_request	= scsi_init_request,
+	.exit_request	= scsi_exit_request,
+};
+
+struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev)
+{
+	sdev->request_queue = blk_mq_init_queue(&sdev->host->tag_set);
+	if (IS_ERR(sdev->request_queue))
+		return NULL;
+
+	sdev->request_queue->queuedata = sdev;
+	__scsi_init_queue(sdev->host, sdev->request_queue);
+	return sdev->request_queue;
+}
+
+int scsi_mq_setup_tags(struct Scsi_Host *shost)
+{
+	unsigned int cmd_size, sgl_size, tbl_size;
+
+	tbl_size = shost->sg_tablesize;
+	if (tbl_size > SCSI_MAX_SG_SEGMENTS)
+		tbl_size = SCSI_MAX_SG_SEGMENTS;
+	sgl_size = tbl_size * sizeof(struct scatterlist);
+	cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size;
+	if (scsi_host_get_prot(shost))
+		cmd_size += sizeof(struct scsi_data_buffer) + sgl_size;
+
+	memset(&shost->tag_set, 0, sizeof(shost->tag_set));
+	shost->tag_set.ops = &scsi_mq_ops;
+	shost->tag_set.nr_hw_queues = 1;
+	shost->tag_set.queue_depth = shost->can_queue;
+	shost->tag_set.cmd_size = cmd_size;
+	shost->tag_set.numa_node = NUMA_NO_NODE;
+	shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+	shost->tag_set.driver_data = shost;
+
+	return blk_mq_alloc_tag_set(&shost->tag_set);
+}
+
+void scsi_mq_destroy_tags(struct Scsi_Host *shost)
+{
+	blk_mq_free_tag_set(&shost->tag_set);
+}
+
 /*
  * Function:    scsi_block_requests()
  *
@@ -2498,9 +2832,13 @@ scsi_internal_device_block(struct scsi_device *sdev)
 	 * block layer from calling the midlayer with this device's
 	 * request queue. 
 	 */
-	spin_lock_irqsave(q->queue_lock, flags);
-	blk_stop_queue(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	if (q->mq_ops) {
+		blk_mq_stop_hw_queues(q);
+	} else {
+		spin_lock_irqsave(q->queue_lock, flags);
+		blk_stop_queue(q);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+	}
 
 	return 0;
 }
@@ -2546,9 +2884,13 @@ scsi_internal_device_unblock(struct scsi_device *sdev,
 		 sdev->sdev_state != SDEV_OFFLINE)
 		return -EINVAL;
 
-	spin_lock_irqsave(q->queue_lock, flags);
-	blk_start_queue(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	if (q->mq_ops) {
+		blk_mq_start_stopped_hw_queues(q, false);
+	} else {
+		spin_lock_irqsave(q->queue_lock, flags);
+		blk_start_queue(q);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+	}
 
 	return 0;
 }
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index a45d1c2..12b8e1b 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -88,6 +88,9 @@ extern void scsi_next_command(struct scsi_cmnd *cmd);
 extern void scsi_io_completion(struct scsi_cmnd *, unsigned int);
 extern void scsi_run_host_queues(struct Scsi_Host *shost);
 extern struct request_queue *scsi_alloc_queue(struct scsi_device *sdev);
+extern struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev);
+extern int scsi_mq_setup_tags(struct Scsi_Host *shost);
+extern void scsi_mq_destroy_tags(struct Scsi_Host *shost);
 extern int scsi_init_queue(void);
 extern void scsi_exit_queue(void);
 struct request_queue;
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 4a6e4ba..b91cfaf 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -273,7 +273,10 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	 */
 	sdev->borken = 1;
 
-	sdev->request_queue = scsi_alloc_queue(sdev);
+	if (shost_use_blk_mq(shost))
+		sdev->request_queue = scsi_mq_alloc_queue(sdev);
+	else
+		sdev->request_queue = scsi_alloc_queue(sdev);
 	if (!sdev->request_queue) {
 		/* release fn is set up in scsi_sysfs_device_initialise, so
 		 * have to free and put manually here */
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index deef063..6c9227f 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -333,6 +333,7 @@ store_shost_eh_deadline(struct device *dev, struct device_attribute *attr,
 
 static DEVICE_ATTR(eh_deadline, S_IRUGO | S_IWUSR, show_shost_eh_deadline, store_shost_eh_deadline);
 
+shost_rd_attr(use_blk_mq, "%d\n");
 shost_rd_attr(unique_id, "%u\n");
 shost_rd_attr(cmd_per_lun, "%hd\n");
 shost_rd_attr(can_queue, "%hd\n");
@@ -352,6 +353,7 @@ show_host_busy(struct device *dev, struct device_attribute *attr, char *buf)
 static DEVICE_ATTR(host_busy, S_IRUGO, show_host_busy, NULL);
 
 static struct attribute *scsi_sysfs_shost_attrs[] = {
+	&dev_attr_use_blk_mq.attr,
 	&dev_attr_unique_id.attr,
 	&dev_attr_host_busy.attr,
 	&dev_attr_cmd_per_lun.attr,
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 5e8ebc1..ba20347 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -7,6 +7,7 @@
 #include <linux/workqueue.h>
 #include <linux/mutex.h>
 #include <linux/seq_file.h>
+#include <linux/blk-mq.h>
 #include <scsi/scsi.h>
 
 struct request_queue;
@@ -510,6 +511,9 @@ struct scsi_host_template {
 	 */
 	unsigned int cmd_size;
 	struct scsi_host_cmd_pool *cmd_pool;
+
+	/* temporary flag to disable blk-mq I/O path */
+	bool disable_blk_mq;
 };
 
 /*
@@ -580,7 +584,10 @@ struct Scsi_Host {
 	 * Area to keep a shared tag map (if needed, will be
 	 * NULL if not).
 	 */
-	struct blk_queue_tag	*bqt;
+	union {
+		struct blk_queue_tag	*bqt;
+		struct blk_mq_tag_set	tag_set;
+	};
 
 	atomic_t host_busy;		   /* commands actually active on low-level */
 	atomic_t host_blocked;
@@ -672,6 +679,8 @@ struct Scsi_Host {
 	/* The controller does not support WRITE SAME */
 	unsigned no_write_same:1;
 
+	unsigned use_blk_mq:1;
+
 	/*
 	 * Optional work queue to be utilized by the transport
 	 */
@@ -772,6 +781,13 @@ static inline int scsi_host_in_recovery(struct Scsi_Host *shost)
 		shost->tmf_in_progress;
 }
 
+extern bool scsi_use_blk_mq;
+
+static inline bool shost_use_blk_mq(struct Scsi_Host *shost)
+{
+	return shost->use_blk_mq;
+}
+
 extern int scsi_queue_work(struct Scsi_Host *, struct work_struct *);
 extern void scsi_flush_work(struct Scsi_Host *);
 
diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h
index 81dd12e..cdcc90b 100644
--- a/include/scsi/scsi_tcq.h
+++ b/include/scsi/scsi_tcq.h
@@ -67,7 +67,8 @@ static inline void scsi_activate_tcq(struct scsi_device *sdev, int depth)
 	if (!sdev->tagged_supported)
 		return;
 
-	if (!blk_queue_tagged(sdev->request_queue))
+	if (!shost_use_blk_mq(sdev->host) &&
+	    blk_queue_tagged(sdev->request_queue))
 		blk_queue_init_tags(sdev->request_queue, depth,
 				    sdev->host->bqt);
 
@@ -80,7 +81,8 @@ static inline void scsi_activate_tcq(struct scsi_device *sdev, int depth)
  **/
 static inline void scsi_deactivate_tcq(struct scsi_device *sdev, int depth)
 {
-	if (blk_queue_tagged(sdev->request_queue))
+	if (!shost_use_blk_mq(sdev->host) &&
+	    blk_queue_tagged(sdev->request_queue))
 		blk_queue_free_tags(sdev->request_queue);
 	scsi_adjust_queue_depth(sdev, 0, depth);
 }
@@ -108,6 +110,15 @@ static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
 	return 0;
 }
 
+static inline struct scsi_cmnd *scsi_mq_find_tag(struct Scsi_Host *shost,
+		unsigned int hw_ctx, int tag)
+{
+	struct request *req;
+
+	req = blk_mq_tag_to_rq(shost->tag_set.tags[hw_ctx], tag);
+	return req ? (struct scsi_cmnd *)req->special : NULL;
+}
+
 /**
  * scsi_find_tag - find a tagged command by device
  * @SDpnt:	pointer to the ScSI device
@@ -118,10 +129,12 @@ static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
  **/
 static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
 {
-
         struct request *req;
 
         if (tag != SCSI_NO_TAG) {
+		if (shost_use_blk_mq(sdev->host))
+			return scsi_mq_find_tag(sdev->host, 0, tag);
+
         	req = blk_queue_find_tag(sdev->request_queue, tag);
 	        return req ? (struct scsi_cmnd *)req->special : NULL;
 	}
@@ -130,6 +143,7 @@ static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
 	return sdev->current_cmnd;
 }
 
+
 /**
  * scsi_init_shared_tag_map - create a shared tag map
  * @shost:	the host to share the tag map among all devices
@@ -138,6 +152,12 @@ static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
 static inline int scsi_init_shared_tag_map(struct Scsi_Host *shost, int depth)
 {
 	/*
+	 * We always have a shared tag map around when using blk-mq.
+	 */
+	if (shost_use_blk_mq(shost))
+		return 0;
+
+	/*
 	 * If the shared tag map isn't already initialized, do it now.
 	 * This saves callers from having to check ->bqt when setting up
 	 * devices on the shared host (for libata)
@@ -165,6 +185,8 @@ static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
 	struct request *req;
 
 	if (tag != SCSI_NO_TAG) {
+		if (shost_use_blk_mq(shost))
+			return scsi_mq_find_tag(shost, 0, tag);
 		req = blk_map_queue_find_tag(shost->bqt, tag);
 		return req ? (struct scsi_cmnd *)req->special : NULL;
 	}
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH 13/14] scsi: add support for a blk-mq based I/O path.
  2014-06-12 13:48 scsi-mq Christoph Hellwig
@ 2014-06-12 13:49 ` Christoph Hellwig
  0 siblings, 0 replies; 99+ messages in thread
From: Christoph Hellwig @ 2014-06-12 13:49 UTC (permalink / raw)
  To: James Bottomley
  Cc: Jens Axboe, Bart Van Assche, Robert Elliot, linux-scsi, linux-kernel

This patch adds support for an alternate I/O path in the scsi midlayer
which uses the blk-mq infrastructure instead of the legacy request code.

Use of blk-mq is fully transparent to drivers, although for now a host
template field is provided to opt out of blk-mq usage in case any unforseen
incompatibilities arise.

In general replacing the legacy request code with blk-mq is a simple and
mostly mechanical transformation.  The biggest exception is the new code
that deals with the fact the I/O submissions in blk-mq must happen from
process context, which slightly complicates the I/O completion handler.
The second biggest differences is that blk-mq is build around the concept
of preallocated requests that also include driver specific data, which
in SCSI context means the scsi_cmnd structure.  This completely avoids
dynamic memory allocations for the fast path through I/O submission.

Due the preallocated requests the MQ code path exclusively uses the
host-wide shared tag allocator instead of a per-LUN one.  This only
affects drivers actually using the block layer provided tag allocator
instead of their own.  Unlike the old path blk-mq always provides a tag,
although drivers don't have to use it.

For now the blk-mq path is disable by defauly and must be enabled using
the "use_blk_mq" module parameter.  Once the remaining work in the block
layer to make blk-mq more suitable for slow devices is complete I hope
to make it the default and eventually even remove the old code path.

Based on the earlier scsi-mq prototype by Nicholas Bellinger.

Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking and
various sugestions and code contributions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/hosts.c      |   30 ++-
 drivers/scsi/scsi.c       |    5 +-
 drivers/scsi/scsi_lib.c   |  460 +++++++++++++++++++++++++++++++++++++++------
 drivers/scsi/scsi_priv.h  |    3 +
 drivers/scsi/scsi_scan.c  |    5 +-
 drivers/scsi/scsi_sysfs.c |    2 +
 include/scsi/scsi_host.h  |   18 +-
 include/scsi/scsi_tcq.h   |   28 ++-
 8 files changed, 481 insertions(+), 70 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index 3cbb57a..0dd6874 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -213,9 +213,24 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 		goto fail;
 	}
 
+	if (shost_use_blk_mq(shost)) {
+		error = scsi_mq_setup_tags(shost);
+		if (error)
+			goto fail;
+	}
+
+	/*
+	 * Note that we allocate the freelist even for the MQ case for now,
+	 * as we need a command set aside for scsi_reset_provider.  Having
+	 * the full host freelist and one command available for that is a
+	 * little heavy-handed, but avoids introducing a special allocator
+	 * just for this.  Eventually the structure of scsi_reset_provider
+	 * will need a major overhaul.
+	 */
 	error = scsi_setup_command_freelist(shost);
 	if (error)
-		goto fail;
+		goto out_destroy_tags;
+
 
 	if (!shost->shost_gendev.parent)
 		shost->shost_gendev.parent = dev ? dev : &platform_bus;
@@ -226,7 +241,7 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 
 	error = device_add(&shost->shost_gendev);
 	if (error)
-		goto out;
+		goto out_destroy_freelist;
 
 	pm_runtime_set_active(&shost->shost_gendev);
 	pm_runtime_enable(&shost->shost_gendev);
@@ -279,8 +294,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 	device_del(&shost->shost_dev);
  out_del_gendev:
 	device_del(&shost->shost_gendev);
- out:
+ out_destroy_freelist:
 	scsi_destroy_command_freelist(shost);
+ out_destroy_tags:
+	if (shost_use_blk_mq(shost))
+		scsi_mq_destroy_tags(shost);
  fail:
 	return error;
 }
@@ -309,7 +327,9 @@ static void scsi_host_dev_release(struct device *dev)
 	}
 
 	scsi_destroy_command_freelist(shost);
-	if (shost->bqt)
+	if (shost_use_blk_mq(shost) && shost->tag_set.tags)
+		scsi_mq_destroy_tags(shost);
+	else if (shost->bqt)
 		blk_free_tags(shost->bqt);
 
 	kfree(shost->shost_data);
@@ -436,6 +456,8 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
 	else
 		shost->dma_boundary = 0xffffffff;
 
+	shost->use_blk_mq = scsi_use_blk_mq && !shost->hostt->disable_blk_mq;
+
 	device_initialize(&shost->shost_gendev);
 	dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
 	shost->shost_gendev.bus = &scsi_bus_type;
diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index e30509a..cc55b74 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -810,7 +810,7 @@ void scsi_adjust_queue_depth(struct scsi_device *sdev, int tagged, int tags)
 	 * is more IO than the LLD's can_queue (so there are not enuogh
 	 * tags) request_fn's host queue ready check will handle it.
 	 */
-	if (!sdev->host->bqt) {
+	if (!shost_use_blk_mq(sdev->host) && !sdev->host->bqt) {
 		if (blk_queue_tagged(sdev->request_queue) &&
 		    blk_queue_resize_tags(sdev->request_queue, tags) != 0)
 			goto out;
@@ -1364,6 +1364,9 @@ MODULE_LICENSE("GPL");
 module_param(scsi_logging_level, int, S_IRUGO|S_IWUSR);
 MODULE_PARM_DESC(scsi_logging_level, "a bit mask of logging levels");
 
+bool scsi_use_blk_mq = false;
+module_param_named(use_blk_mq, scsi_use_blk_mq, bool, S_IWUSR | S_IRUGO);
+
 static int __init init_scsi(void)
 {
 	int error;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 32fbae4..aecc12e 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -20,6 +20,7 @@
 #include <linux/delay.h>
 #include <linux/hardirq.h>
 #include <linux/scatterlist.h>
+#include <linux/blk-mq.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -113,6 +114,16 @@ scsi_set_blocked(struct scsi_cmnd *cmd, int reason)
 	}
 }
 
+static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
+{
+	struct scsi_device *sdev = cmd->device;
+	struct request_queue *q = cmd->request->q;
+
+	blk_mq_requeue_request(cmd->request);
+	blk_mq_kick_requeue_list(q);
+	put_device(&sdev->sdev_gendev);
+}
+
 /**
  * __scsi_queue_insert - private queue insertion
  * @cmd: The SCSI command being requeued
@@ -150,6 +161,10 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
 	 * before blk_cleanup_queue() finishes.
 	 */
 	cmd->result = 0;
+	if (q->mq_ops) {
+		scsi_mq_requeue_cmd(cmd);
+		return;
+	}
 	spin_lock_irqsave(q->queue_lock, flags);
 	blk_requeue_request(q, cmd->request);
 	kblockd_schedule_work(&device->requeue_work);
@@ -308,6 +323,14 @@ void scsi_device_unbusy(struct scsi_device *sdev)
 	atomic_dec(&sdev->device_busy);
 }
 
+static void scsi_kick_queue(struct request_queue *q)
+{
+	if (q->mq_ops)
+		blk_mq_start_hw_queues(q);
+	else
+		blk_run_queue(q);
+}
+
 /*
  * Called for single_lun devices on IO completion. Clear starget_sdev_user,
  * and call blk_run_queue for all the scsi_devices on the target -
@@ -332,7 +355,7 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
 	 * but in most cases, we will be first. Ideally, each LU on the
 	 * target would get some limited time or requests on the target.
 	 */
-	blk_run_queue(current_sdev->request_queue);
+	scsi_kick_queue(current_sdev->request_queue);
 
 	spin_lock_irqsave(shost->host_lock, flags);
 	if (starget->starget_sdev_user)
@@ -345,7 +368,7 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
 			continue;
 
 		spin_unlock_irqrestore(shost->host_lock, flags);
-		blk_run_queue(sdev->request_queue);
+		scsi_kick_queue(sdev->request_queue);
 		spin_lock_irqsave(shost->host_lock, flags);
 	
 		scsi_device_put(sdev);
@@ -438,7 +461,7 @@ static void scsi_starved_list_run(struct Scsi_Host *shost)
 			continue;
 		spin_unlock_irqrestore(shost->host_lock, flags);
 
-		blk_run_queue(slq);
+		scsi_kick_queue(slq);
 		blk_put_queue(slq);
 
 		spin_lock_irqsave(shost->host_lock, flags);
@@ -469,7 +492,10 @@ static void scsi_run_queue(struct request_queue *q)
 	if (!list_empty(&sdev->host->starved_list))
 		scsi_starved_list_run(sdev->host);
 
-	blk_run_queue(q);
+	if (q->mq_ops)
+		blk_mq_start_stopped_hw_queues(q, false);
+	else
+		blk_run_queue(q);
 }
 
 void scsi_requeue_run_queue(struct work_struct *work)
@@ -567,25 +593,57 @@ static struct scatterlist *scsi_sg_alloc(unsigned int nents, gfp_t gfp_mask)
 	return mempool_alloc(sgp->pool, gfp_mask);
 }
 
-static void scsi_free_sgtable(struct scsi_data_buffer *sdb)
+static void scsi_free_sgtable(struct scsi_data_buffer *sdb, bool mq)
 {
-	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, false, scsi_sg_free);
+	if (mq && sdb->table.nents <= SCSI_MAX_SG_SEGMENTS)
+		return;
+	__sg_free_table(&sdb->table, SCSI_MAX_SG_SEGMENTS, mq, scsi_sg_free);
 }
 
 static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
-			      gfp_t gfp_mask)
+			      gfp_t gfp_mask, bool mq)
 {
+	struct scatterlist *first_chunk = NULL;
 	int ret;
 
 	BUG_ON(!nents);
 
+	if (mq) {
+		if (nents <= SCSI_MAX_SG_SEGMENTS) {
+			sdb->table.nents = nents;
+			sg_init_table(sdb->table.sgl, sdb->table.nents);
+			return 0;
+		}
+		first_chunk = sdb->table.sgl;
+	}
+
 	ret = __sg_alloc_table(&sdb->table, nents, SCSI_MAX_SG_SEGMENTS,
-			       NULL, gfp_mask, scsi_sg_alloc);
+			       first_chunk, gfp_mask, scsi_sg_alloc);
 	if (unlikely(ret))
-		scsi_free_sgtable(sdb);
+		scsi_free_sgtable(sdb, mq);
 	return ret;
 }
 
+static void scsi_mq_free_sgtables(struct scsi_cmnd *cmd)
+{
+	if (cmd->sdb.table.nents)
+		scsi_free_sgtable(&cmd->sdb, true);
+	if (cmd->request->next_rq && cmd->request->next_rq->special)
+		scsi_free_sgtable(cmd->request->next_rq->special, true);
+	if (scsi_prot_sg_count(cmd))
+		scsi_free_sgtable(cmd->prot_sdb, true);
+}
+
+static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
+{
+	if (cmd->request->cmd_type == REQ_TYPE_FS) {
+		struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
+
+		if (drv->uninit_command)
+			drv->uninit_command(cmd);
+	}
+}
+
 /*
  * Function:    scsi_release_buffers()
  *
@@ -605,12 +663,12 @@ static int scsi_alloc_sgtable(struct scsi_data_buffer *sdb, int nents,
 void scsi_release_buffers(struct scsi_cmnd *cmd)
 {
 	if (cmd->sdb.table.nents)
-		scsi_free_sgtable(&cmd->sdb);
+		scsi_free_sgtable(&cmd->sdb, false);
 
 	memset(&cmd->sdb, 0, sizeof(cmd->sdb));
 
 	if (scsi_prot_sg_count(cmd))
-		scsi_free_sgtable(cmd->prot_sdb);
+		scsi_free_sgtable(cmd->prot_sdb, false);
 }
 EXPORT_SYMBOL(scsi_release_buffers);
 
@@ -618,7 +676,7 @@ static void scsi_release_bidi_buffers(struct scsi_cmnd *cmd)
 {
 	struct scsi_data_buffer *bidi_sdb = cmd->request->next_rq->special;
 
-	scsi_free_sgtable(bidi_sdb);
+	scsi_free_sgtable(bidi_sdb, false);
 	kmem_cache_free(scsi_sdb_cache, bidi_sdb);
 	cmd->request->next_rq->special = NULL;
 }
@@ -631,7 +689,6 @@ static bool scsi_end_request(struct request *req, int error,
 	struct request_queue *q = sdev->request_queue;
 	unsigned long flags;
 
-
 	if (blk_update_request(req, error, bytes))
 		return true;
 
@@ -643,14 +700,38 @@ static bool scsi_end_request(struct request *req, int error,
 	if (blk_queue_add_random(q))
 		add_disk_randomness(req->rq_disk);
 
-	spin_lock_irqsave(q->queue_lock, flags);
-	blk_finish_request(req, error);
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	if (req->mq_ctx) {
+		/*
+		 * In the MQ case the command gets freed by __blk_mq_end_io,
+		 * so we have to do all cleanup that depends on it earlier.
+		 *
+		 * We also can't kick the queues from irq context, so we
+		 * will have to defer it to a workqueue.
+		 */
+		cancel_delayed_work(&cmd->abort_work);
+		scsi_mq_free_sgtables(cmd);
+		scsi_uninit_cmd(cmd);
+
+		spin_lock_irqsave(&sdev->list_lock, flags);
+		BUG_ON(list_empty(&cmd->list));
+		list_del_init(&cmd->list);
+		spin_unlock_irqrestore(&sdev->list_lock, flags);
+
+		__blk_mq_end_io(req, error);
+
+		kblockd_schedule_work(&sdev->requeue_work);
+		put_device(&sdev->sdev_gendev);
+	} else {
+		spin_lock_irqsave(q->queue_lock, flags);
+		blk_finish_request(req, error);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+
+		if (bidi_bytes)
+			scsi_release_bidi_buffers(cmd);
+		scsi_release_buffers(cmd);
+		scsi_next_command(cmd);
+	}
 
-	if (bidi_bytes)
-		scsi_release_bidi_buffers(cmd);
-	scsi_release_buffers(cmd);
-	scsi_next_command(cmd);
 	return false;
 }
 
@@ -981,8 +1062,16 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 		/* Unprep the request and put it back at the head of the queue.
 		 * A new command will be prepared and issued.
 		 */
-		scsi_release_buffers(cmd);
-		scsi_requeue_command(q, cmd);
+		if (q->mq_ops) {
+			cancel_delayed_work(&cmd->abort_work);
+			cmd->request->cmd_flags &= ~REQ_DONTPREP;
+			scsi_mq_free_sgtables(cmd);
+			scsi_uninit_cmd(cmd);
+			scsi_mq_requeue_cmd(cmd);
+		} else {
+			scsi_release_buffers(cmd);
+			scsi_requeue_command(q, cmd);
+		}
 		break;
 	case ACTION_RETRY:
 		/* Retry the same command immediately */
@@ -1004,9 +1093,8 @@ static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb,
 	 * If sg table allocation fails, requeue request later.
 	 */
 	if (unlikely(scsi_alloc_sgtable(sdb, req->nr_phys_segments,
-					gfp_mask))) {
+					gfp_mask, req->mq_ctx != NULL)))
 		return BLKPREP_DEFER;
-	}
 
 	/* 
 	 * Next, walk the list, and fill in the addresses and sizes of
@@ -1034,21 +1122,27 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
 {
 	struct scsi_device *sdev = cmd->device;
 	struct request *rq = cmd->request;
+	bool is_mq = (rq->mq_ctx != NULL);
+	int error;
 
-	int error = scsi_init_sgtable(rq, &cmd->sdb, gfp_mask);
+	error = scsi_init_sgtable(rq, &cmd->sdb, gfp_mask);
 	if (error)
 		goto err_exit;
 
 	if (blk_bidi_rq(rq)) {
-		struct scsi_data_buffer *bidi_sdb = kmem_cache_zalloc(
-			scsi_sdb_cache, GFP_ATOMIC);
-		if (!bidi_sdb) {
-			error = BLKPREP_DEFER;
-			goto err_exit;
+		if (!rq->q->mq_ops) {
+			struct scsi_data_buffer *bidi_sdb =
+				kmem_cache_zalloc(scsi_sdb_cache, GFP_ATOMIC);
+			if (!bidi_sdb) {
+				error = BLKPREP_DEFER;
+				goto err_exit;
+			}
+
+			rq->next_rq->special = bidi_sdb;
 		}
 
-		rq->next_rq->special = bidi_sdb;
-		error = scsi_init_sgtable(rq->next_rq, bidi_sdb, GFP_ATOMIC);
+		error = scsi_init_sgtable(rq->next_rq, rq->next_rq->special,
+					  GFP_ATOMIC);
 		if (error)
 			goto err_exit;
 	}
@@ -1060,7 +1154,7 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
 		BUG_ON(prot_sdb == NULL);
 		ivecs = blk_rq_count_integrity_sg(rq->q, rq->bio);
 
-		if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask)) {
+		if (scsi_alloc_sgtable(prot_sdb, ivecs, gfp_mask, is_mq)) {
 			error = BLKPREP_DEFER;
 			goto err_exit;
 		}
@@ -1074,13 +1168,16 @@ int scsi_init_io(struct scsi_cmnd *cmd, gfp_t gfp_mask)
 		cmd->prot_sdb->table.nents = count;
 	}
 
-	return BLKPREP_OK ;
-
+	return BLKPREP_OK;
 err_exit:
-	scsi_release_buffers(cmd);
-	cmd->request->special = NULL;
-	scsi_put_command(cmd);
-	put_device(&sdev->sdev_gendev);
+	if (is_mq) {
+		scsi_mq_free_sgtables(cmd);
+	} else {
+		scsi_release_buffers(cmd);
+		cmd->request->special = NULL;
+		scsi_put_command(cmd);
+		put_device(&sdev->sdev_gendev);
+	}
 	return error;
 }
 EXPORT_SYMBOL(scsi_init_io);
@@ -1295,13 +1392,7 @@ out:
 
 static void scsi_unprep_fn(struct request_queue *q, struct request *req)
 {
-	if (req->cmd_type == REQ_TYPE_FS) {
-		struct scsi_cmnd *cmd = req->special;
-		struct scsi_driver *drv = scsi_cmd_to_driver(cmd);
-
-		if (drv->uninit_command)
-			drv->uninit_command(cmd);
-	}
+	scsi_uninit_cmd(req->special);
 }
 
 /*
@@ -1318,7 +1409,11 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
 	busy = atomic_inc_return(&sdev->device_busy) - 1;
 	if (busy == 0 && atomic_read(&sdev->device_blocked) > 0) {
 		if (atomic_dec_return(&sdev->device_blocked) > 0) {
-			blk_delay_queue(q, SCSI_QUEUE_DELAY);
+			/*
+			 * For the MQ case we take care of this in the caller.
+			 */
+			if (!q->mq_ops)
+				blk_delay_queue(q, SCSI_QUEUE_DELAY);
 			goto out_dec;
 		}
 		SCSI_LOG_MLQUEUE(3, sdev_printk(KERN_INFO, sdev,
@@ -1688,6 +1783,190 @@ out_delay:
 		blk_delay_queue(q, SCSI_QUEUE_DELAY);
 }
 
+static inline int prep_to_mq(int ret)
+{
+	switch (ret) {
+	case BLKPREP_OK:
+		return 0;
+	case BLKPREP_DEFER:
+		return BLK_MQ_RQ_QUEUE_BUSY;
+	default:
+		return BLK_MQ_RQ_QUEUE_ERROR;
+	}
+}
+
+static int scsi_mq_prep_fn(struct request *req)
+{
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
+	struct scsi_device *sdev = req->q->queuedata;
+	struct Scsi_Host *shost = sdev->host;
+	unsigned char *sense_buf = cmd->sense_buffer;
+	struct scatterlist *sg;
+
+	memset(cmd, 0, sizeof(struct scsi_cmnd));
+
+	req->special = cmd;
+
+	cmd->request = req;
+	cmd->device = sdev;
+	cmd->sense_buffer = sense_buf;
+
+	cmd->tag = req->tag;
+
+	req->cmd = req->__cmd;
+	cmd->cmnd = req->cmd;
+	cmd->prot_op = SCSI_PROT_NORMAL;
+
+	INIT_LIST_HEAD(&cmd->list);
+	INIT_DELAYED_WORK(&cmd->abort_work, scmd_eh_abort_handler);
+	cmd->jiffies_at_alloc = jiffies;
+
+	/*
+	 * XXX: cmd_list lookups are only used by two drivers, try to get
+	 * rid of this list in common code.
+	 */
+	spin_lock_irq(&sdev->list_lock);
+	list_add_tail(&cmd->list, &sdev->cmd_list);
+	spin_unlock_irq(&sdev->list_lock);
+
+	sg = (void *)cmd + sizeof(struct scsi_cmnd) + shost->hostt->cmd_size;
+	cmd->sdb.table.sgl = sg;
+
+	if (scsi_host_get_prot(shost)) {
+		cmd->prot_sdb = (void *)sg +
+			shost->sg_tablesize * sizeof(struct scatterlist);
+		memset(cmd->prot_sdb, 0, sizeof(struct scsi_data_buffer));
+
+		cmd->prot_sdb->table.sgl =
+			(struct scatterlist *)(cmd->prot_sdb + 1);
+	}
+
+	if (blk_bidi_rq(req)) {
+		struct request *next_rq = req->next_rq;
+		struct scsi_data_buffer *bidi_sdb = blk_mq_rq_to_pdu(next_rq);
+
+		memset(bidi_sdb, 0, sizeof(struct scsi_data_buffer));
+		bidi_sdb->table.sgl =
+			(struct scatterlist *)(bidi_sdb + 1);
+
+		next_rq->special = bidi_sdb;
+	}
+
+	switch (req->cmd_type) {
+	case REQ_TYPE_FS:
+		return scsi_cmd_to_driver(cmd)->init_command(cmd);
+	case REQ_TYPE_BLOCK_PC:
+		return scsi_setup_blk_pc_cmnd(cmd->device, req);
+	default:
+		return BLKPREP_KILL;
+	}
+}
+
+static void scsi_mq_done(struct scsi_cmnd *cmd)
+{
+	trace_scsi_dispatch_cmd_done(cmd);
+	blk_mq_complete_request(cmd->request);
+}
+
+static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
+{
+	struct request_queue *q = req->q;
+	struct scsi_device *sdev = q->queuedata;
+	struct Scsi_Host *shost = sdev->host;
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
+	int ret;
+	int reason;
+
+	ret = prep_to_mq(scsi_prep_state_check(sdev, req));
+	if (ret)
+		goto out;
+
+	ret = BLK_MQ_RQ_QUEUE_BUSY;
+	if (!get_device(&sdev->sdev_gendev))
+		goto out;
+
+	if (!scsi_dev_queue_ready(q, sdev))
+		goto out_put_device;
+	if (!scsi_target_queue_ready(shost, sdev))
+		goto out_dec_device_busy;
+	if (!scsi_host_queue_ready(q, shost, sdev))
+		goto out_dec_target_busy;
+
+	if (!(req->cmd_flags & REQ_DONTPREP)) {
+		ret = prep_to_mq(scsi_mq_prep_fn(req));
+		if (ret)
+			goto out_dec_host_busy;
+		req->cmd_flags |= REQ_DONTPREP;
+	}
+
+	scsi_init_cmd_errh(cmd);
+	cmd->scsi_done = scsi_mq_done;
+
+	reason = scsi_dispatch_cmd(cmd);
+	if (reason) {
+		scsi_set_blocked(cmd, reason);
+		ret = BLK_MQ_RQ_QUEUE_BUSY;
+		goto out_dec_host_busy;
+	}
+
+	return BLK_MQ_RQ_QUEUE_OK;
+
+out_dec_host_busy:
+	cancel_delayed_work(&cmd->abort_work);
+	atomic_dec(&shost->host_busy);
+out_dec_target_busy:
+	if (scsi_target(sdev)->can_queue > 0)
+		atomic_dec(&scsi_target(sdev)->target_busy);
+out_dec_device_busy:
+	atomic_dec(&sdev->device_busy);
+out_put_device:
+	put_device(&sdev->sdev_gendev);
+out:
+	switch (ret) {
+	case BLK_MQ_RQ_QUEUE_BUSY:
+		blk_mq_stop_hw_queue(hctx);
+		if (atomic_read(&sdev->device_busy) == 0 &&
+		    !scsi_device_blocked(sdev))
+			blk_mq_delay_queue(hctx, SCSI_QUEUE_DELAY);
+		break;
+	case BLK_MQ_RQ_QUEUE_ERROR:
+		/*
+		 * Make sure to release all allocated ressources when
+		 * we hit an error, as we will never see this command
+		 * again.
+		 */
+		if (req->cmd_flags & REQ_DONTPREP) {
+			scsi_mq_free_sgtables(cmd);
+			scsi_uninit_cmd(cmd);
+		}
+		break;
+	default:
+		break;
+	}
+	return ret;
+}
+
+static int scsi_init_request(void *data, struct request *rq,
+		unsigned int hctx_idx, unsigned int request_idx,
+		unsigned int numa_node)
+{
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
+
+	cmd->sense_buffer = kzalloc_node(SCSI_SENSE_BUFFERSIZE, GFP_KERNEL,
+			numa_node);
+	if (!cmd->sense_buffer)
+		return -ENOMEM;
+	return 0;
+}
+
+static void scsi_exit_request(void *data, struct request *rq,
+		unsigned int hctx_idx, unsigned int request_idx)
+{
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
+
+	kfree(cmd->sense_buffer);
+}
+
 u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
 {
 	struct device *host_dev;
@@ -1710,16 +1989,10 @@ u64 scsi_calculate_bounce_limit(struct Scsi_Host *shost)
 }
 EXPORT_SYMBOL(scsi_calculate_bounce_limit);
 
-struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
-					 request_fn_proc *request_fn)
+static void __scsi_init_queue(struct Scsi_Host *shost, struct request_queue *q)
 {
-	struct request_queue *q;
 	struct device *dev = shost->dma_dev;
 
-	q = blk_init_queue(request_fn, NULL);
-	if (!q)
-		return NULL;
-
 	/*
 	 * this limit is imposed by hardware restrictions
 	 */
@@ -1750,7 +2023,17 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
 	 * blk_queue_update_dma_alignment() later.
 	 */
 	blk_queue_dma_alignment(q, 0x03);
+}
+
+struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
+					 request_fn_proc *request_fn)
+{
+	struct request_queue *q;
 
+	q = blk_init_queue(request_fn, NULL);
+	if (!q)
+		return NULL;
+	__scsi_init_queue(shost, q);
 	return q;
 }
 EXPORT_SYMBOL(__scsi_alloc_queue);
@@ -1771,6 +2054,55 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
 	return q;
 }
 
+static struct blk_mq_ops scsi_mq_ops = {
+	.map_queue	= blk_mq_map_queue,
+	.queue_rq	= scsi_queue_rq,
+	.complete	= scsi_softirq_done,
+	.timeout	= scsi_times_out,
+	.init_request	= scsi_init_request,
+	.exit_request	= scsi_exit_request,
+};
+
+struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev)
+{
+	sdev->request_queue = blk_mq_init_queue(&sdev->host->tag_set);
+	if (IS_ERR(sdev->request_queue))
+		return NULL;
+
+	sdev->request_queue->queuedata = sdev;
+	__scsi_init_queue(sdev->host, sdev->request_queue);
+	return sdev->request_queue;
+}
+
+int scsi_mq_setup_tags(struct Scsi_Host *shost)
+{
+	unsigned int cmd_size, sgl_size, tbl_size;
+
+	tbl_size = shost->sg_tablesize;
+	if (tbl_size > SCSI_MAX_SG_SEGMENTS)
+		tbl_size = SCSI_MAX_SG_SEGMENTS;
+	sgl_size = tbl_size * sizeof(struct scatterlist);
+	cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size;
+	if (scsi_host_get_prot(shost))
+		cmd_size += sizeof(struct scsi_data_buffer) + sgl_size;
+
+	memset(&shost->tag_set, 0, sizeof(shost->tag_set));
+	shost->tag_set.ops = &scsi_mq_ops;
+	shost->tag_set.nr_hw_queues = 1;
+	shost->tag_set.queue_depth = shost->can_queue;
+	shost->tag_set.cmd_size = cmd_size;
+	shost->tag_set.numa_node = NUMA_NO_NODE;
+	shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	shost->tag_set.driver_data = shost;
+
+	return blk_mq_alloc_tag_set(&shost->tag_set);
+}
+
+void scsi_mq_destroy_tags(struct Scsi_Host *shost)
+{
+	blk_mq_free_tag_set(&shost->tag_set);
+}
+
 /*
  * Function:    scsi_block_requests()
  *
@@ -2516,9 +2848,13 @@ scsi_internal_device_block(struct scsi_device *sdev)
 	 * block layer from calling the midlayer with this device's
 	 * request queue. 
 	 */
-	spin_lock_irqsave(q->queue_lock, flags);
-	blk_stop_queue(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	if (q->mq_ops) {
+		blk_mq_stop_hw_queues(q);
+	} else {
+		spin_lock_irqsave(q->queue_lock, flags);
+		blk_stop_queue(q);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+	}
 
 	return 0;
 }
@@ -2564,9 +2900,13 @@ scsi_internal_device_unblock(struct scsi_device *sdev,
 		 sdev->sdev_state != SDEV_OFFLINE)
 		return -EINVAL;
 
-	spin_lock_irqsave(q->queue_lock, flags);
-	blk_start_queue(q);
-	spin_unlock_irqrestore(q->queue_lock, flags);
+	if (q->mq_ops) {
+		blk_mq_start_stopped_hw_queues(q, false);
+	} else {
+		spin_lock_irqsave(q->queue_lock, flags);
+		blk_start_queue(q);
+		spin_unlock_irqrestore(q->queue_lock, flags);
+	}
 
 	return 0;
 }
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index 48e5b65..5d8353f 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -88,6 +88,9 @@ extern void scsi_next_command(struct scsi_cmnd *cmd);
 extern void scsi_io_completion(struct scsi_cmnd *, unsigned int);
 extern void scsi_run_host_queues(struct Scsi_Host *shost);
 extern struct request_queue *scsi_alloc_queue(struct scsi_device *sdev);
+extern struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev);
+extern int scsi_mq_setup_tags(struct Scsi_Host *shost);
+extern void scsi_mq_destroy_tags(struct Scsi_Host *shost);
 extern int scsi_init_queue(void);
 extern void scsi_exit_queue(void);
 struct request_queue;
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index e02b3aa..e6ce3a1 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -277,7 +277,10 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	 */
 	sdev->borken = 1;
 
-	sdev->request_queue = scsi_alloc_queue(sdev);
+	if (shost_use_blk_mq(shost))
+		sdev->request_queue = scsi_mq_alloc_queue(sdev);
+	else
+		sdev->request_queue = scsi_alloc_queue(sdev);
 	if (!sdev->request_queue) {
 		/* release fn is set up in scsi_sysfs_device_initialise, so
 		 * have to free and put manually here */
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 9efa2b8..81f50b1 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -333,6 +333,7 @@ store_shost_eh_deadline(struct device *dev, struct device_attribute *attr,
 
 static DEVICE_ATTR(eh_deadline, S_IRUGO | S_IWUSR, show_shost_eh_deadline, store_shost_eh_deadline);
 
+shost_rd_attr(use_blk_mq, "%d\n");
 shost_rd_attr(unique_id, "%u\n");
 shost_rd_attr(cmd_per_lun, "%hd\n");
 shost_rd_attr(can_queue, "%hd\n");
@@ -352,6 +353,7 @@ show_host_busy(struct device *dev, struct device_attribute *attr, char *buf)
 static DEVICE_ATTR(host_busy, S_IRUGO, show_host_busy, NULL);
 
 static struct attribute *scsi_sysfs_shost_attrs[] = {
+	&dev_attr_use_blk_mq.attr,
 	&dev_attr_unique_id.attr,
 	&dev_attr_host_busy.attr,
 	&dev_attr_cmd_per_lun.attr,
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index c4e4875..f48f9ce 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -7,6 +7,7 @@
 #include <linux/workqueue.h>
 #include <linux/mutex.h>
 #include <linux/seq_file.h>
+#include <linux/blk-mq.h>
 #include <scsi/scsi.h>
 
 struct request_queue;
@@ -531,6 +532,9 @@ struct scsi_host_template {
 	 */
 	unsigned int cmd_size;
 	struct scsi_host_cmd_pool *cmd_pool;
+
+	/* temporary flag to disable blk-mq I/O path */
+	bool disable_blk_mq;
 };
 
 /*
@@ -601,7 +605,10 @@ struct Scsi_Host {
 	 * Area to keep a shared tag map (if needed, will be
 	 * NULL if not).
 	 */
-	struct blk_queue_tag	*bqt;
+	union {
+		struct blk_queue_tag	*bqt;
+		struct blk_mq_tag_set	tag_set;
+	};
 
 	atomic_t host_busy;		   /* commands actually active on low-level */
 	atomic_t host_blocked;
@@ -693,6 +700,8 @@ struct Scsi_Host {
 	/* The controller does not support WRITE SAME */
 	unsigned no_write_same:1;
 
+	unsigned use_blk_mq:1;
+
 	/*
 	 * Optional work queue to be utilized by the transport
 	 */
@@ -793,6 +802,13 @@ static inline int scsi_host_in_recovery(struct Scsi_Host *shost)
 		shost->tmf_in_progress;
 }
 
+extern bool scsi_use_blk_mq;
+
+static inline bool shost_use_blk_mq(struct Scsi_Host *shost)
+{
+	return shost->use_blk_mq;
+}
+
 extern int scsi_queue_work(struct Scsi_Host *, struct work_struct *);
 extern void scsi_flush_work(struct Scsi_Host *);
 
diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h
index 81dd12e..cdcc90b 100644
--- a/include/scsi/scsi_tcq.h
+++ b/include/scsi/scsi_tcq.h
@@ -67,7 +67,8 @@ static inline void scsi_activate_tcq(struct scsi_device *sdev, int depth)
 	if (!sdev->tagged_supported)
 		return;
 
-	if (!blk_queue_tagged(sdev->request_queue))
+	if (!shost_use_blk_mq(sdev->host) &&
+	    blk_queue_tagged(sdev->request_queue))
 		blk_queue_init_tags(sdev->request_queue, depth,
 				    sdev->host->bqt);
 
@@ -80,7 +81,8 @@ static inline void scsi_activate_tcq(struct scsi_device *sdev, int depth)
  **/
 static inline void scsi_deactivate_tcq(struct scsi_device *sdev, int depth)
 {
-	if (blk_queue_tagged(sdev->request_queue))
+	if (!shost_use_blk_mq(sdev->host) &&
+	    blk_queue_tagged(sdev->request_queue))
 		blk_queue_free_tags(sdev->request_queue);
 	scsi_adjust_queue_depth(sdev, 0, depth);
 }
@@ -108,6 +110,15 @@ static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
 	return 0;
 }
 
+static inline struct scsi_cmnd *scsi_mq_find_tag(struct Scsi_Host *shost,
+		unsigned int hw_ctx, int tag)
+{
+	struct request *req;
+
+	req = blk_mq_tag_to_rq(shost->tag_set.tags[hw_ctx], tag);
+	return req ? (struct scsi_cmnd *)req->special : NULL;
+}
+
 /**
  * scsi_find_tag - find a tagged command by device
  * @SDpnt:	pointer to the ScSI device
@@ -118,10 +129,12 @@ static inline int scsi_populate_tag_msg(struct scsi_cmnd *cmd, char *msg)
  **/
 static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
 {
-
         struct request *req;
 
         if (tag != SCSI_NO_TAG) {
+		if (shost_use_blk_mq(sdev->host))
+			return scsi_mq_find_tag(sdev->host, 0, tag);
+
         	req = blk_queue_find_tag(sdev->request_queue, tag);
 	        return req ? (struct scsi_cmnd *)req->special : NULL;
 	}
@@ -130,6 +143,7 @@ static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
 	return sdev->current_cmnd;
 }
 
+
 /**
  * scsi_init_shared_tag_map - create a shared tag map
  * @shost:	the host to share the tag map among all devices
@@ -138,6 +152,12 @@ static inline struct scsi_cmnd *scsi_find_tag(struct scsi_device *sdev, int tag)
 static inline int scsi_init_shared_tag_map(struct Scsi_Host *shost, int depth)
 {
 	/*
+	 * We always have a shared tag map around when using blk-mq.
+	 */
+	if (shost_use_blk_mq(shost))
+		return 0;
+
+	/*
 	 * If the shared tag map isn't already initialized, do it now.
 	 * This saves callers from having to check ->bqt when setting up
 	 * devices on the shared host (for libata)
@@ -165,6 +185,8 @@ static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
 	struct request *req;
 
 	if (tag != SCSI_NO_TAG) {
+		if (shost_use_blk_mq(shost))
+			return scsi_mq_find_tag(shost, 0, tag);
 		req = blk_map_queue_find_tag(shost->bqt, tag);
 		return req ? (struct scsi_cmnd *)req->special : NULL;
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2014-08-24 16:42 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-25 16:51 scsi-mq V2 Christoph Hellwig
2014-06-25 16:51 ` [PATCH 01/14] sd: don't use rq->cmd_len before setting it up Christoph Hellwig
2014-07-09 11:12   ` Hannes Reinecke
2014-07-09 11:12     ` Hannes Reinecke
2014-07-09 15:03     ` Christoph Hellwig
2014-06-25 16:51 ` [PATCH 02/14] scsi: split __scsi_queue_insert Christoph Hellwig
2014-07-09 11:12   ` Hannes Reinecke
2014-06-25 16:51 ` [PATCH 03/14] scsi: centralize command re-queueing in scsi_dispatch_fn Christoph Hellwig
2014-07-08 20:51   ` Elliott, Robert (Server Storage)
2014-07-09  6:40     ` Christoph Hellwig
2014-07-09 11:13   ` Hannes Reinecke
2014-06-25 16:51 ` [PATCH 04/14] scsi: set ->scsi_done before calling scsi_dispatch_cmd Christoph Hellwig
2014-07-09 11:14   ` Hannes Reinecke
2014-06-25 16:51 ` [PATCH 05/14] scsi: push host_lock down into scsi_{host,target}_queue_ready Christoph Hellwig
2014-07-09 11:14   ` Hannes Reinecke
2014-06-25 16:51 ` [PATCH 06/14] scsi: convert target_busy to an atomic_t Christoph Hellwig
2014-07-09 11:15   ` Hannes Reinecke
2014-07-09 11:15     ` Hannes Reinecke
2014-06-25 16:51 ` [PATCH 07/14] scsi: convert host_busy to atomic_t Christoph Hellwig
2014-07-09 11:15   ` Hannes Reinecke
2014-06-25 16:51 ` [PATCH 08/14] scsi: convert device_busy " Christoph Hellwig
2014-07-09 11:16   ` Hannes Reinecke
2014-07-09 16:49   ` James Bottomley
2014-07-10  6:01     ` Christoph Hellwig
2014-06-25 16:51 ` [PATCH 09/14] scsi: fix the {host,target,device}_blocked counter mess Christoph Hellwig
2014-07-09 11:12   ` Hannes Reinecke
2014-07-10  6:06     ` Christoph Hellwig
2014-06-25 16:51 ` [PATCH 10/14] scsi: only maintain target_blocked if the driver has a target queue limit Christoph Hellwig
2014-07-09 11:19   ` Hannes Reinecke
2014-07-09 15:05     ` Christoph Hellwig
2014-06-25 16:51 ` [PATCH 11/14] scsi: unwind blk_end_request_all and blk_end_request_err calls Christoph Hellwig
2014-07-09 11:20   ` Hannes Reinecke
2014-06-25 16:51 ` [PATCH 12/14] scatterlist: allow chaining to preallocated chunks Christoph Hellwig
2014-07-09 11:21   ` Hannes Reinecke
2014-06-25 16:52 ` [PATCH 13/14] scsi: add support for a blk-mq based I/O path Christoph Hellwig
2014-07-09 11:25   ` Hannes Reinecke
2014-07-16 11:13   ` Mike Christie
2014-07-16 11:16     ` Christoph Hellwig
2014-06-25 16:52 ` [PATCH 14/14] fnic: reject device resets without assigned tags for the blk-mq case Christoph Hellwig
2014-07-09 11:27   ` Hannes Reinecke
2014-06-26  4:50 ` scsi-mq V2 Jens Axboe
2014-06-26 22:07   ` Elliott, Robert (Server Storage)
2014-06-27 14:42     ` Bart Van Assche
2014-06-30 15:20   ` Jens Axboe
2014-06-30 15:25     ` Christoph Hellwig
2014-06-30 15:54       ` Martin K. Petersen
2014-07-08 14:48 ` Christoph Hellwig
2014-07-09 16:39   ` Douglas Gilbert
2014-07-09 19:38     ` Jens Axboe
2014-07-10  0:53       ` Elliott, Robert (Server Storage)
2014-07-10  6:20         ` Christoph Hellwig
2014-07-10 13:36           ` Benjamin LaHaise
2014-07-10 13:39             ` Jens Axboe
2014-07-10 13:44               ` Benjamin LaHaise
2014-07-10 13:48                 ` Jens Axboe
2014-07-10 13:50                   ` Benjamin LaHaise
2014-07-10 13:52                     ` Jens Axboe
2014-07-10 13:50             ` Christoph Hellwig
2014-07-10 13:52               ` Jens Axboe
2014-07-10 14:36                 ` Elliott, Robert (Server Storage)
2014-07-10 14:45                   ` Benjamin LaHaise
2014-07-10 15:11                     ` Jeff Moyer
2014-07-10 15:11                       ` Jeff Moyer
2014-07-10 19:59                       ` Jens Axboe
2014-07-10 19:59                         ` Jens Axboe
2014-07-10 20:05                         ` Jeff Moyer
2014-07-10 20:05                           ` Jeff Moyer
2014-07-10 20:06                           ` Jens Axboe
2014-07-10 20:06                             ` Jens Axboe
2014-07-10 15:51           ` Elliott, Robert (Server Storage)
2014-07-10 16:04             ` Christoph Hellwig
2014-07-10 16:14               ` Christoph Hellwig
2014-07-10 18:49                 ` Elliott, Robert (Server Storage)
2014-07-10 19:14                   ` Jeff Moyer
2014-07-10 19:14                     ` Jeff Moyer
2014-07-10 19:36                     ` Jeff Moyer
2014-07-10 19:36                       ` Jeff Moyer
2014-07-10 21:10                     ` Elliott, Robert (Server Storage)
2014-07-11  6:02                       ` Elliott, Robert (Server Storage)
2014-07-11  6:14                         ` Christoph Hellwig
2014-07-11 14:33                           ` Elliott, Robert (Server Storage)
2014-07-11 14:55                             ` Benjamin LaHaise
2014-07-12 21:50                               ` Elliott, Robert (Server Storage)
2014-07-12 23:20                                 ` Elliott, Robert (Server Storage)
2014-07-13 17:15                                   ` Elliott, Robert (Server Storage)
2014-07-14 17:15                                     ` Benjamin LaHaise
2014-07-14  9:13   ` Sagi Grimberg
2014-08-21 12:32     ` Performance degradation in IO writes vs. reads (was scsi-mq V2) Sagi Grimberg
2014-08-21 12:32       ` Sagi Grimberg
2014-08-21 13:03       ` Christoph Hellwig
2014-08-21 14:02         ` Sagi Grimberg
2014-08-24 16:41           ` Sagi Grimberg
  -- strict thread matches above, loose matches on Subject: below --
2014-07-18 10:12 scsi-mq V4 Christoph Hellwig
2014-07-18 10:13 ` [PATCH 13/14] scsi: add support for a blk-mq based I/O path Christoph Hellwig
2014-07-25 19:29   ` Martin K. Petersen
2014-08-18 22:21   ` Kashyap Desai
2014-08-19 15:41     ` Kashyap Desai
2014-08-19 16:06     ` Christoph Hellwig
2014-08-19 16:11       ` Kashyap Desai
2014-06-12 13:48 scsi-mq Christoph Hellwig
2014-06-12 13:49 ` [PATCH 13/14] scsi: add support for a blk-mq based I/O path Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.