[PATCH 0/6] scsi: support bypass device busy check for some high end HBA with SSD

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] scsi: support bypass device busy check for some high end HBA with SSD
@ 2020-01-19  7:14 Ming Lei
  2020-01-19  7:14 ` [PATCH 1/6] scsi: mpt3sas: don't use .device_busy in device reset routine Ming Lei
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Ming Lei @ 2020-01-19  7:14 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Ming Lei, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Hannes Reinecke, Bart Van Assche

Hi,

SCSI's per-LUN queue depth is usually for improving IO merge and
balancing IO load among LUNs. blk-mq has provides fair driver tag
allocation and managed IRQ balances interrupt load among queues,
meantime IO merge doesn't play a big role for SSD, and NVMe doesn't
apply per-namespace queue depth.

This patchset tries to don't apply per-LUN queue depth for some high end
HBA in case of SSD, then we can avoid the expensive atomic operation on
sdev->device_busy. We do understand that this shared counter affects IOPS
a lot.

Thanks,
Ming

Ming Lei (6):
  scsi: mpt3sas: don't use .device_busy in device reset routine
  scsi: remove .for_blk_mq
  scsi: sd: register request queue after sd_revalidate_disk is done
  block: freeze queue for updating QUEUE_FLAG_NONROT
  scsi: core: don't limit per-LUN queue depth for SSD when HBA needs
  scsi: megaraid: set flag of no_device_queue_for_ssd

 block/blk-sysfs.c                         | 14 +++++++++-
 drivers/scsi/megaraid/megaraid_sas_base.c |  1 +
 drivers/scsi/mpt3sas/mpt3sas_scsih.c      | 32 ++++++++++++++++++++-
 drivers/scsi/scsi_lib.c                   | 34 +++++++++++++++++++----
 drivers/scsi/sd.c                         |  7 ++++-
 drivers/scsi/virtio_scsi.c                |  1 -
 include/scsi/scsi_host.h                  |  4 +--
 7 files changed, 81 insertions(+), 12 deletions(-)

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
-- 
2.20.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/6] scsi: mpt3sas: don't use .device_busy in device reset routine
  2020-01-19  7:14 [PATCH 0/6] scsi: support bypass device busy check for some high end HBA with SSD Ming Lei
@ 2020-01-19  7:14 ` Ming Lei
  2020-01-19 20:28   ` Bart Van Assche
  2020-01-19  7:14 ` [PATCH 2/6] scsi: remove .for_blk_mq Ming Lei
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2020-01-19  7:14 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Ming Lei, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Hannes Reinecke, Bart Van Assche

scsih_dev_reset() uses scsi_device->device_busy to check if there is
inflight commands aimed to this LUN.

Uses block layer's helper of blk_mq_tagset_busy_iter() to do that, so
we can prepare for bypassing .device_busy for SSD since it is quite
expensive to inc/dec the global atomic counter in IO path.

With this change, no driver uses scsi_device->device_busy any more.

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 32 +++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index c597d544eb39..91766c172d8f 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -2994,6 +2994,36 @@ scsih_abort(struct scsi_cmnd *scmd)
 	return r;
 }
 
+struct device_busy {
+	struct scsi_device *dev;
+	unsigned int cnt;
+};
+
+static bool scsi_device_check_in_flight(struct request *rq, void *data,
+				      bool reserved)
+{
+	struct device_busy *busy = data;
+	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
+
+	if (test_bit(SCMD_STATE_INFLIGHT, &cmd->state) && cmd->device ==
+			busy->dev)
+		(busy->cnt)++;
+
+	return true;
+}
+
+static bool scsih_dev_busy(struct scsi_device *device)
+{
+	struct device_busy data = {
+		.dev =	device,
+		.cnt = 0,
+	};
+
+	blk_mq_tagset_busy_iter(&device->host->tag_set,
+				scsi_device_check_in_flight, &data);
+	return data.cnt > 0;
+}
+
 /**
  * scsih_dev_reset - eh threads main device reset routine
  * @scmd: pointer to scsi command object
@@ -3060,7 +3090,7 @@ scsih_dev_reset(struct scsi_cmnd *scmd)
 		MPI2_SCSITASKMGMT_TASKTYPE_LOGICAL_UNIT_RESET, 0, 0,
 		tr_timeout, tr_method);
 	/* Check for busy commands after reset */
-	if (r == SUCCESS && atomic_read(&scmd->device->device_busy))
+	if (r == SUCCESS && scsih_dev_busy(scmd->device))
 		r = FAILED;
  out:
 	sdev_printk(KERN_INFO, scmd->device, "device reset: %s scmd(0x%p)\n",
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/6] scsi: remove .for_blk_mq
  2020-01-19  7:14 [PATCH 0/6] scsi: support bypass device busy check for some high end HBA with SSD Ming Lei
  2020-01-19  7:14 ` [PATCH 1/6] scsi: mpt3sas: don't use .device_busy in device reset routine Ming Lei
@ 2020-01-19  7:14 ` Ming Lei
  2020-01-19 20:29   ` Bart Van Assche
                     ` (3 more replies)
  2020-01-19  7:14 ` [PATCH 3/6] scsi: sd: register request queue after sd_revalidate_disk is done Ming Lei
                   ` (3 subsequent siblings)
  5 siblings, 4 replies; 25+ messages in thread
From: Ming Lei @ 2020-01-19  7:14 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Ming Lei, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

No one use it any more, so remove the flag.

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/virtio_scsi.c | 1 -
 include/scsi/scsi_host.h   | 3 ---
 2 files changed, 4 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index bfec84aacd90..0e0910c5b942 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -742,7 +742,6 @@ static struct scsi_host_template virtscsi_host_template = {
 	.dma_boundary = UINT_MAX,
 	.map_queues = virtscsi_map_queues,
 	.track_queue_depth = 1,
-	.force_blk_mq = 1,
 };
 
 #define virtscsi_config_get(vdev, fld) \
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index f577647bf5f2..7a97fb8104cf 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -426,9 +426,6 @@ struct scsi_host_template {
 	/* True if the controller does not support WRITE SAME */
 	unsigned no_write_same:1;
 
-	/* True if the low-level driver supports blk-mq only */
-	unsigned force_blk_mq:1;
-
 	/*
 	 * Countdown for host blocking with no commands outstanding.
 	 */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/6] scsi: sd: register request queue after sd_revalidate_disk is done
  2020-01-19  7:14 [PATCH 0/6] scsi: support bypass device busy check for some high end HBA with SSD Ming Lei
  2020-01-19  7:14 ` [PATCH 1/6] scsi: mpt3sas: don't use .device_busy in device reset routine Ming Lei
  2020-01-19  7:14 ` [PATCH 2/6] scsi: remove .for_blk_mq Ming Lei
@ 2020-01-19  7:14 ` Ming Lei
  2020-01-19 20:36   ` Bart Van Assche
  2020-01-19  7:14 ` [PATCH 4/6] block: freeze queue for updating QUEUE_FLAG_NONROT Ming Lei
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2020-01-19  7:14 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Ming Lei, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

Prepare for improving SSD performance in the following patch, which
needs to read queue flag of QUEUE_FLAG_NONROT in IO path. So we have
to freeze queue before changing this flag in sd_revalidate_disk().

However, queue freezing becomes quite slow after the queue is registered
because RCU grace period is involved.

So delay registering queue after sd_revalidate_disk() is done for
avoiding slow queue freezing which will be added to sd_revalidate_disk()
in the following patch.

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/sd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 5afb0046b12a..f401ba96dcfd 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3380,11 +3380,12 @@ static int sd_probe(struct device *dev)
 		pm_runtime_set_autosuspend_delay(dev,
 			sdp->host->hostt->rpm_autosuspend_delay);
 	}
-	device_add_disk(dev, gd, NULL);
+	device_add_disk_no_queue_reg(dev, gd);
 	if (sdkp->capacity)
 		sd_dif_config_host(sdkp);
 
 	sd_revalidate_disk(gd);
+	blk_register_queue(gd);
 
 	if (sdkp->security) {
 		sdkp->opal_dev = init_opal_dev(sdp, &sd_sec_submit);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/6] block: freeze queue for updating QUEUE_FLAG_NONROT
  2020-01-19  7:14 [PATCH 0/6] scsi: support bypass device busy check for some high end HBA with SSD Ming Lei
                   ` (2 preceding siblings ...)
  2020-01-19  7:14 ` [PATCH 3/6] scsi: sd: register request queue after sd_revalidate_disk is done Ming Lei
@ 2020-01-19  7:14 ` Ming Lei
  2020-01-19 20:40   ` Bart Van Assche
  2020-01-19  7:14 ` [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs Ming Lei
  2020-01-19  7:14 ` [PATCH 6/6] scsi: megaraid: set flag of no_device_queue_for_ssd Ming Lei
  5 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2020-01-19  7:14 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Ming Lei, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

We will read the NONROT flag for bypassing scsi->device_busy in SCSI
IO path, so have to freeze queue when updating this flag.

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-sysfs.c | 14 +++++++++++++-
 drivers/scsi/sd.c |  4 ++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..9cc0dd5f88a1 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
 QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
 #undef QUEUE_SYSFS_BIT_FNS
 
+static ssize_t queue_store_nonrot_freeze(struct request_queue *q,
+		const char *page, size_t count)
+{
+	size_t ret;
+
+	blk_mq_freeze_queue(q);
+	ret = queue_store_nonrot(q, page, count);
+	blk_mq_unfreeze_queue(q);
+
+	return ret;
+}
+
 static ssize_t queue_zoned_show(struct request_queue *q, char *page)
 {
 	switch (blk_queue_zoned_model(q)) {
@@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = 0644 },
 	.show = queue_show_nonrot,
-	.store = queue_store_nonrot,
+	.store = queue_store_nonrot_freeze,
 };
 
 static struct queue_sysfs_entry queue_zoned_entry = {
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index f401ba96dcfd..349c944b29a3 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2935,7 +2935,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
 	rot = get_unaligned_be16(&buffer[4]);
 
 	if (rot == 1) {
+		blk_mq_freeze_queue(q);
 		blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
+		blk_mq_unfreeze_queue(q);
 		blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
 	}
 
@@ -3131,7 +3133,9 @@ static int sd_revalidate_disk(struct gendisk *disk)
 		 * cause this to be updated correctly and any device which
 		 * doesn't support it should be treated as rotational.
 		 */
+		blk_mq_freeze_queue(q);
 		blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
+		blk_mq_unfreeze_queue(q);
 		blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
 
 		if (scsi_device_supports_vpd(sdp)) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs
  2020-01-19  7:14 [PATCH 0/6] scsi: support bypass device busy check for some high end HBA with SSD Ming Lei
                   ` (3 preceding siblings ...)
  2020-01-19  7:14 ` [PATCH 4/6] block: freeze queue for updating QUEUE_FLAG_NONROT Ming Lei
@ 2020-01-19  7:14 ` Ming Lei
  2020-01-19 20:58   ` Bart Van Assche
  2020-01-21  4:52   ` Martin K. Petersen
  2020-01-19  7:14 ` [PATCH 6/6] scsi: megaraid: set flag of no_device_queue_for_ssd Ming Lei
  5 siblings, 2 replies; 25+ messages in thread
From: Ming Lei @ 2020-01-19  7:14 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Ming Lei, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

SCSI core uses the atomic variable of sdev->device_busy to track
in-flight IO requests dispatched to this scsi device. IO request may be
submitted from any CPU, so the cost for maintaining the shared atomic
counter can be very big on big NUMA machine with lots of CPU cores.

sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
2) fair IO request scattered among all LUNs.

blk-mq already provides fair request allocation among all active shared
request queues(LUNs), see hctx_may_queue().

NVMe doesn't have such per-request-queue(namespace) queue depth, so it
is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
play big role for reaching top SSD performance.

With this patch, big cost for tracking in-flight per-LUN requests via
atomic variable can be saved for some high end controller with SSD.

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c  | 34 ++++++++++++++++++++++++++++------
 include/scsi/scsi_host.h |  3 +++
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 610ee41fa54c..0903697dd843 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -344,6 +344,16 @@ static void scsi_dec_host_busy(struct Scsi_Host *shost, struct scsi_cmnd *cmd)
 	rcu_read_unlock();
 }
 
+static inline bool scsi_bypass_device_busy(struct scsi_device *sdev)
+{
+	struct Scsi_Host *shost = sdev->host;
+
+	if (!shost->hostt->no_device_queue_for_ssd)
+		return false;
+
+	return blk_queue_nonrot(sdev->request_queue);
+}
+
 void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
 {
 	struct Scsi_Host *shost = sdev->host;
@@ -354,7 +364,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
 	if (starget->can_queue > 0)
 		atomic_dec(&starget->target_busy);
 
-	atomic_dec(&sdev->device_busy);
+	if (!scsi_bypass_device_busy(sdev))
+		atomic_dec(&sdev->device_busy);
 }
 
 static void scsi_kick_queue(struct request_queue *q)
@@ -410,7 +421,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
 
 static inline bool scsi_device_is_busy(struct scsi_device *sdev)
 {
-	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
+	if (!scsi_bypass_device_busy(sdev) &&
+			atomic_read(&sdev->device_busy) >= sdev->queue_depth)
 		return true;
 	if (atomic_read(&sdev->device_blocked) > 0)
 		return true;
@@ -1283,8 +1295,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
 				  struct scsi_device *sdev)
 {
 	unsigned int busy;
+	bool bypass = scsi_bypass_device_busy(sdev);
 
-	busy = atomic_inc_return(&sdev->device_busy) - 1;
+	if (!bypass)
+		busy = atomic_inc_return(&sdev->device_busy) - 1;
+	else
+		busy = 0;
 	if (atomic_read(&sdev->device_blocked)) {
 		if (busy)
 			goto out_dec;
@@ -1298,12 +1314,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
 				   "unblocking device at zero depth\n"));
 	}
 
+	if (bypass)
+		return 1;
+
 	if (busy >= sdev->queue_depth)
 		goto out_dec;
 
 	return 1;
 out_dec:
-	atomic_dec(&sdev->device_busy);
+	if (!bypass)
+		atomic_dec(&sdev->device_busy);
 	return 0;
 }
 
@@ -1624,7 +1644,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct scsi_device *sdev = q->queuedata;
 
-	atomic_dec(&sdev->device_busy);
+	if (!scsi_bypass_device_busy(sdev))
+		atomic_dec(&sdev->device_busy);
 }
 
 static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
@@ -1706,7 +1727,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
-		if (atomic_read(&sdev->device_busy) ||
+		if ((!scsi_bypass_device_busy(sdev) &&
+		     atomic_read(&sdev->device_busy)) ||
 		    scsi_device_blocked(sdev))
 			ret = BLK_STS_DEV_RESOURCE;
 		break;
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 7a97fb8104cf..8e80edfd5a6d 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -426,6 +426,9 @@ struct scsi_host_template {
 	/* True if the controller does not support WRITE SAME */
 	unsigned no_write_same:1;
 
+	/* True if the controller needn't to maintain device queue for SSD */
+	unsigned no_device_queue_for_ssd:1;
+
 	/*
 	 * Countdown for host blocking with no commands outstanding.
 	 */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/6] scsi: megaraid: set flag of no_device_queue_for_ssd
  2020-01-19  7:14 [PATCH 0/6] scsi: support bypass device busy check for some high end HBA with SSD Ming Lei
                   ` (4 preceding siblings ...)
  2020-01-19  7:14 ` [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs Ming Lei
@ 2020-01-19  7:14 ` Ming Lei
  5 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2020-01-19  7:14 UTC (permalink / raw)
  To: James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Ming Lei, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

megraraid sas is one high end HBA, and the IOPS may reach million level.
As discussed before, megaraid sas performance can be improved on SSD
when the device busy check is bypassed.

So set the flag of no_device_queue_for_ssd for megaraid sas.

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/megaraid/megaraid_sas_base.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 43cbc749f66c..8f3770af7382 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -3424,6 +3424,7 @@ static struct scsi_host_template megasas_template = {
 	.bios_param = megasas_bios_param,
 	.change_queue_depth = scsi_change_queue_depth,
 	.max_segment_size = 0xffffffff,
+	.no_device_queue_for_ssd = 1,
 };
 
 /**
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] scsi: mpt3sas: don't use .device_busy in device reset routine
  2020-01-19  7:14 ` [PATCH 1/6] scsi: mpt3sas: don't use .device_busy in device reset routine Ming Lei
@ 2020-01-19 20:28   ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2020-01-19 20:28 UTC (permalink / raw)
  To: Ming Lei, James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Hannes Reinecke, Bart Van Assche

On 2020-01-18 23:14, Ming Lei wrote:
> Cc: Bart Van Assche <bart.vanassche@wdc.com>

As one can see in the .mailmap file in the kernel tree I use my @acm.org
email address for kernel contributions. The above email address is no
longer valid since I left WDC more than a year ago. This is something I
have mentioned before. See also
https://lore.kernel.org/linux-scsi/55c68a97-4393-fa63-e2da-d41a48237f96@acm.org/

> +static bool scsi_device_check_in_flight(struct request *rq, void *data,
> +				      bool reserved)
> +{
> +	struct device_busy *busy = data;
> +	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
> +
> +	if (test_bit(SCMD_STATE_INFLIGHT, &cmd->state) && cmd->device ==
> +			busy->dev)
> +		(busy->cnt)++;
> +
> +	return true;
> +}

scsi_host_check_in_flight() is almost identical to the above function.
Has it been considered to merge these two functions into a single
function? Same comment for scsi_host_busy() and scsih_dev_busy().

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] scsi: remove .for_blk_mq
  2020-01-19  7:14 ` [PATCH 2/6] scsi: remove .for_blk_mq Ming Lei
@ 2020-01-19 20:29   ` Bart Van Assche
  2020-01-20 10:17   ` John Garry
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2020-01-19 20:29 UTC (permalink / raw)
  To: Ming Lei, James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

On 2020-01-18 23:14, Ming Lei wrote:
> No one use it any more, so remove the flag.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] scsi: sd: register request queue after sd_revalidate_disk is done
  2020-01-19  7:14 ` [PATCH 3/6] scsi: sd: register request queue after sd_revalidate_disk is done Ming Lei
@ 2020-01-19 20:36   ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2020-01-19 20:36 UTC (permalink / raw)
  To: Ming Lei, James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

On 2020-01-18 23:14, Ming Lei wrote:
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 5afb0046b12a..f401ba96dcfd 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3380,11 +3380,12 @@ static int sd_probe(struct device *dev)
>  		pm_runtime_set_autosuspend_delay(dev,
>  			sdp->host->hostt->rpm_autosuspend_delay);
>  	}
> -	device_add_disk(dev, gd, NULL);
> +	device_add_disk_no_queue_reg(dev, gd);
>  	if (sdkp->capacity)
>  		sd_dif_config_host(sdkp);
>  
>  	sd_revalidate_disk(gd);
> +	blk_register_queue(gd);
>  
>  	if (sdkp->security) {
>  		sdkp->opal_dev = init_opal_dev(sdp, &sd_sec_submit);

The effect of changing device_add_disk() into
device_add_disk_no_queue_reg() is as follows:
- __device_add_disk() skips the elevator_init_mq() call.
- __device_add_disk() skips the blk_register_queue() call.

The above patch introduces a blk_register_queue() call but no
elevator_init_mq() call. Is that perhaps an oversight?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/6] block: freeze queue for updating QUEUE_FLAG_NONROT
  2020-01-19  7:14 ` [PATCH 4/6] block: freeze queue for updating QUEUE_FLAG_NONROT Ming Lei
@ 2020-01-19 20:40   ` Bart Van Assche
  0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2020-01-19 20:40 UTC (permalink / raw)
  To: Ming Lei, James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

On 2020-01-18 23:14, Ming Lei wrote:
> +static ssize_t queue_store_nonrot_freeze(struct request_queue *q,
> +		const char *page, size_t count)
> +{
> +	size_t ret;
> +
> +	blk_mq_freeze_queue(q);
> +	ret = queue_store_nonrot(q, page, count);
> +	blk_mq_unfreeze_queue(q);
> +
> +	return ret;
> +}

Both queue_store_nonrot() and queue_store_nonrot_freeze() return a
signed integer but 'ret' is an unsigned integer. Is that perhaps the
result of an oversight?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs
  2020-01-19  7:14 ` [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs Ming Lei
@ 2020-01-19 20:58   ` Bart Van Assche
  2020-01-21  4:52   ` Martin K. Petersen
  1 sibling, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2020-01-19 20:58 UTC (permalink / raw)
  To: Ming Lei, James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

On 2020-01-18 23:14, Ming Lei wrote:
> +static inline bool scsi_bypass_device_busy(struct scsi_device *sdev)
> +{
> +	struct Scsi_Host *shost = sdev->host;
> +
> +	if (!shost->hostt->no_device_queue_for_ssd)
> +		return false;
> +
> +	return blk_queue_nonrot(sdev->request_queue);
> +}

In other words, sdev->device_busy is maintained for all SCSI devices
except for those SSDs controlled by a SCSI LLD driver that has
no_device_queue_for_ssd set in its host template. I'd like to see
different behavior, namely that sdev->device_busy is not maintained for
any SSD except if that SSD really needs the sdev->device_busy counter.
The blacklist mechanism may be more appropriate to mark such SSDs than
the SCSI host template.

What I also noticed is that most scsi_bypass_device_busy() calls have an
exclamation mark (!) in front of these calls. I think that inverting the
return value and renaming this function into e.g.
scsi_maintain_device_busy() would result in code that is easier to read.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] scsi: remove .for_blk_mq
  2020-01-19  7:14 ` [PATCH 2/6] scsi: remove .for_blk_mq Ming Lei
  2020-01-19 20:29   ` Bart Van Assche
@ 2020-01-20 10:17   ` John Garry
  2020-01-20 22:12   ` Elliott, Robert (Servers)
  2020-01-31  6:30   ` Christoph Hellwig
  3 siblings, 0 replies; 25+ messages in thread
From: John Garry @ 2020-01-20 10:17 UTC (permalink / raw)
  To: Ming Lei, James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche, Jason Yan

On 19/01/2020 07:14, Ming Lei wrote:
> No one use it any more, so remove the flag.
> 
> Cc: Sathya Prakash <sathya.prakash@broadcom.com>
> Cc: Chaitra P B <chaitra.basappa@broadcom.com>
> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Sumit Saxena <sumit.saxena@broadcom.com>
> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
> Cc: Ewan D. Milne <emilne@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

I think that we can also delete drivers/scsi/scsi.c:scsi_use_blk_mq.

IIRC, a patch was already sent for that but never picked up.

Thanks,
John

> ---
>   drivers/scsi/virtio_scsi.c | 1 -
>   include/scsi/scsi_host.h   | 3 ---
>   2 files changed, 4 deletions(-)
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index bfec84aacd90..0e0910c5b942 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -742,7 +742,6 @@ static struct scsi_host_template virtscsi_host_template = {
>   	.dma_boundary = UINT_MAX,
>   	.map_queues = virtscsi_map_queues,
>   	.track_queue_depth = 1,
> -	.force_blk_mq = 1,
>   };
>   
>   #define virtscsi_config_get(vdev, fld) \
> diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
> index f577647bf5f2..7a97fb8104cf 100644
> --- a/include/scsi/scsi_host.h
> +++ b/include/scsi/scsi_host.h
> @@ -426,9 +426,6 @@ struct scsi_host_template {
>   	/* True if the controller does not support WRITE SAME */
>   	unsigned no_write_same:1;
>   
> -	/* True if the low-level driver supports blk-mq only */
> -	unsigned force_blk_mq:1;
> -
>   	/*
>   	 * Countdown for host blocking with no commands outstanding.
>   	 */
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 2/6] scsi: remove .for_blk_mq
  2020-01-19  7:14 ` [PATCH 2/6] scsi: remove .for_blk_mq Ming Lei
  2020-01-19 20:29   ` Bart Van Assche
  2020-01-20 10:17   ` John Garry
@ 2020-01-20 22:12   ` Elliott, Robert (Servers)
  2020-01-31  6:30   ` Christoph Hellwig
  3 siblings, 0 replies; 25+ messages in thread
From: Elliott, Robert (Servers) @ 2020-01-20 22:12 UTC (permalink / raw)
  To: Ming Lei, James Bottomley, linux-scsi, Martin K . Petersen
  Cc: linux-block, Jens Axboe, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

> -	/* True if the low-level driver supports blk-mq only */
> -	unsigned force_blk_mq:1;

Minor nit - in the patch Subject, for should be force




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs
  2020-01-19  7:14 ` [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs Ming Lei
  2020-01-19 20:58   ` Bart Van Assche
@ 2020-01-21  4:52   ` Martin K. Petersen
  2020-01-23  2:54     ` Ming Lei
       [not found]     ` <ab676c4c-03fb-7eb9-6212-129eb83d0ee8@broadcom.com>
  1 sibling, 2 replies; 25+ messages in thread
From: Martin K. Petersen @ 2020-01-21  4:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: James Bottomley, linux-scsi, Martin K . Petersen, linux-block,
	Jens Axboe, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

Ming,

> NVMe doesn't have such per-request-queue(namespace) queue depth, so it
> is reasonable to ignore the limit for SCSI SSD too.

It is really not. A free host tag does not guarantee that the target
device can queue the command.

The assumption that SSDs are somehow special because they are "fast" is
not valid. Given the common hardware queue depth for a SAS device of
~128 it is often trivial to drive a device into a congestion
scenario. We see it all the time for non-rotational devices, SSDs and
arrays alike. The SSD heuristic is simply not going to fly.

Don't get me wrong, I am very sympathetic to obliterating device_busy in
the hot path. I just don't think it is as easy as just ignoring the
counter and hope for the best. Dynamic queue depth management is an
integral part of the SCSI protocol, not something we can just decide to
bypass because a device claims to be of a certain media type or speed.

I would prefer not to touch drivers that rely on cmd_per_lun / untagged
operation and focus exclusively on the ones that use .track_queue_depth.
For those we could consider an adaptive queue depth management scheme.
Something like not maintaining device_busy until we actually get a
QUEUE_FULL condition. And then rely on the existing queue depth ramp up
heuristics to determine when to disable the busy counter again. Maybe
with an additional watermark or time limit to avoid flip-flopping.

If that approach turns out to work, we should convert all remaining
non-legacy drivers to .track_queue_depth so we only have two driver
queuing flavors to worry about.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs
  2020-01-21  4:52   ` Martin K. Petersen
@ 2020-01-23  2:54     ` Ming Lei
  2020-01-24  1:21       ` Martin K. Petersen
       [not found]     ` <ab676c4c-03fb-7eb9-6212-129eb83d0ee8@broadcom.com>
  1 sibling, 1 reply; 25+ messages in thread
From: Ming Lei @ 2020-01-23  2:54 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James Bottomley, linux-scsi, linux-block, Jens Axboe,
	Sathya Prakash, Chaitra P B, Suganath Prabu Subramani,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Ewan D . Milne,
	Christoph Hellwig, Hannes Reinecke, Bart Van Assche

Hi Martin,

On Mon, Jan 20, 2020 at 11:52:06PM -0500, Martin K. Petersen wrote:
> 
> Ming,
> 
> > NVMe doesn't have such per-request-queue(namespace) queue depth, so it
> > is reasonable to ignore the limit for SCSI SSD too.
> 
> It is really not. A free host tag does not guarantee that the target
> device can queue the command.

Yeah, I agree.

> 
> The assumption that SSDs are somehow special because they are "fast" is
> not valid. Given the common hardware queue depth for a SAS device of
> ~128 it is often trivial to drive a device into a congestion
> scenario. We see it all the time for non-rotational devices, SSDs and
> arrays alike. The SSD heuristic is simply not going to fly.

In reality, the channel number of SSD is very limited, it should be
dozens of in enterprise grade SSD, so the device itself can be
saturated by limited in-flight IOs.

However, it depends on if the target device returns the congestion
to host. From my observation, looks there isn't such feedback
from NVMe target.

If SSD target device doesn't provide such kind of congestion feedback,
tracking in-flight per-LUN IO via .device_busy doesn't make any
difference.

Even if there was such SSD target which provides such congestion
feedback, bypassing .device_busy won't cause big effect too since
blk-mq's SCHED_RESTART will retry this IO returning STS_RESOURCE
only after another in-flight one is completed.

> 
> Don't get me wrong, I am very sympathetic to obliterating device_busy in
> the hot path. I just don't think it is as easy as just ignoring the
> counter and hope for the best. Dynamic queue depth management is an
> integral part of the SCSI protocol, not something we can just decide to
> bypass because a device claims to be of a certain media type or speed.

At least, Broadcom guys tests this patch on megaraid raid and the results
shows that big improvement was got, that is why the flag is only set
on megaraid host.

> 
> I would prefer not to touch drivers that rely on cmd_per_lun / untagged
> operation and focus exclusively on the ones that use .track_queue_depth.
> For those we could consider an adaptive queue depth management scheme.
> Something like not maintaining device_busy until we actually get a
> QUEUE_FULL condition. And then rely on the existing queue depth ramp up
> heuristics to determine when to disable the busy counter again. Maybe
> with an additional watermark or time limit to avoid flip-flopping.
> 
> If that approach turns out to work, we should convert all remaining
> non-legacy drivers to .track_queue_depth so we only have two driver
> queuing flavors to worry about.

In theory, .track_queue_depth may only improve sequential IO's
performance for HDD., not very effective for SSD. Or just save a bit
CPU cycles in case of SSD.

I will study the related drivers/device a bit more wrt. track_queue_depth().

Thanks,
Ming

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs
  2020-01-23  2:54     ` Ming Lei
@ 2020-01-24  1:21       ` Martin K. Petersen
  2020-01-24  1:59         ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: Martin K. Petersen @ 2020-01-24  1:21 UTC (permalink / raw)
  To: Ming Lei
  Cc: Martin K. Petersen, James Bottomley, linux-scsi, linux-block,
	Jens Axboe, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche


Ming,

> However, it depends on if the target device returns the congestion to
> host. From my observation, looks there isn't such feedback from NVMe
> target.

It happens all the time with SCSI devices. It is imperative that this
keeps working.

> Even if there was such SSD target which provides such congestion
> feedback, bypassing .device_busy won't cause big effect too since
> blk-mq's SCHED_RESTART will retry this IO returning STS_RESOURCE only
> after another in-flight one is completed.

The reason we back off is that it allows the device to recover by
temporarily reducing its workload. In addition, the lower queue depth
alleviates the risk of commands timing out leading to application I/O
failures.

> At least, Broadcom guys tests this patch on megaraid raid and the
> results shows that big improvement was got, that is why the flag is
> only set on megaraid host.

I do not question that it improves performance. That's not my point.

> In theory, .track_queue_depth may only improve sequential IO's
> performance for HDD., not very effective for SSD. Or just save a bit
> CPU cycles in case of SSD.

This is not about performance. This is about how the system behaves when
a device is starved for resources or experiencing transient failures.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs
  2020-01-24  1:21       ` Martin K. Petersen
@ 2020-01-24  1:59         ` Ming Lei
  2020-01-24 12:43           ` Sumit Saxena
  0 siblings, 1 reply; 25+ messages in thread
From: Ming Lei @ 2020-01-24  1:59 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: James Bottomley, linux-scsi, linux-block, Jens Axboe,
	Sathya Prakash, Chaitra P B, Suganath Prabu Subramani,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Ewan D . Milne,
	Christoph Hellwig, Hannes Reinecke, Bart Van Assche

Hi Martin,

On Thu, Jan 23, 2020 at 08:21:42PM -0500, Martin K. Petersen wrote:
> 
> Ming,
> 
> > However, it depends on if the target device returns the congestion to
> > host. From my observation, looks there isn't such feedback from NVMe
> > target.
> 
> It happens all the time with SCSI devices. It is imperative that this
> keeps working.
> 
> > Even if there was such SSD target which provides such congestion
> > feedback, bypassing .device_busy won't cause big effect too since
> > blk-mq's SCHED_RESTART will retry this IO returning STS_RESOURCE only
> > after another in-flight one is completed.
> 
> The reason we back off is that it allows the device to recover by
> temporarily reducing its workload. In addition, the lower queue depth
> alleviates the risk of commands timing out leading to application I/O
> failures.

The timeout risk may only happen when driver/device doesn't return
congestion feedback, meantime the host queue depth is big enough.

So far we don't see such issue on NVMe which hw queue depth is 1023, and
the hw queue count is often 32+, and not see such timeout report
when there are so many inflight IOs(32 * 1023) on single LUN.

Also megaraid sas's queue depth is much less than (32 * 1023), so it
seems much unlikely to happen.

Megaraid guys, could you clarify if it is one issue? Kashyap, Sumit
and Shivasharan?

> 
> > At least, Broadcom guys tests this patch on megaraid raid and the
> > results shows that big improvement was got, that is why the flag is
> > only set on megaraid host.
> 
> I do not question that it improves performance. That's not my point.
> 
> > In theory, .track_queue_depth may only improve sequential IO's
> > performance for HDD., not very effective for SSD. Or just save a bit
> > CPU cycles in case of SSD.
> 
> This is not about performance. This is about how the system behaves when
> a device is starved for resources or experiencing transient failures.

Could you explain a bit how this patch changes the system beaviror? I
understand the EH just retries the incompleted requests, which total
number is just less than host queue depth.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs
  2020-01-24  1:59         ` Ming Lei
@ 2020-01-24 12:43           ` Sumit Saxena
  2020-01-28  4:04             ` Martin K. Petersen
  0 siblings, 1 reply; 25+ messages in thread
From: Sumit Saxena @ 2020-01-24 12:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Martin K. Petersen, James Bottomley, Linux SCSI List,
	linux-block, Jens Axboe, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Shivasharan S,
	Ewan D . Milne, Christoph Hellwig, Hannes Reinecke,
	Bart Van Assche

On Fri, Jan 24, 2020 at 7:30 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Hi Martin,
>
> On Thu, Jan 23, 2020 at 08:21:42PM -0500, Martin K. Petersen wrote:
> >
> > Ming,
> >
> > > However, it depends on if the target device returns the congestion to
> > > host. From my observation, looks there isn't such feedback from NVMe
> > > target.
> >
> > It happens all the time with SCSI devices. It is imperative that this
> > keeps working.
> >
> > > Even if there was such SSD target which provides such congestion
> > > feedback, bypassing .device_busy won't cause big effect too since
> > > blk-mq's SCHED_RESTART will retry this IO returning STS_RESOURCE only
> > > after another in-flight one is completed.
> >
> > The reason we back off is that it allows the device to recover by
> > temporarily reducing its workload. In addition, the lower queue depth
> > alleviates the risk of commands timing out leading to application I/O
> > failures.
>
> The timeout risk may only happen when driver/device doesn't return
> congestion feedback, meantime the host queue depth is big enough.
>
> So far we don't see such issue on NVMe which hw queue depth is 1023, and
> the hw queue count is often 32+, and not see such timeout report
> when there are so many inflight IOs(32 * 1023) on single LUN.
>
> Also megaraid sas's queue depth is much less than (32 * 1023), so it
> seems much unlikely to happen.
>
> Megaraid guys, could you clarify if it is one issue? Kashyap, Sumit
> and Shivasharan?

Hi Ming, Martin,

megaraid_sas driver does not enable “.track_queue_depth”, so
megaraid_sas adapters never used QUEUE FULL interface of Linux SCSI
layer. Most of the handling of QUEUE FULL is managed by MegaRAID
controller Firmware and it also manages reducing drive level QD (like
ramp up/down).

"mpt3sas" adapters support QUEUE FULL based on IOCCapabilities of
Firmware.  Default configuration is Firmware will manage QUEUE FULL.
This is not same as Linux SCSI level handling. It is delayed retry in
Firmware. It means, we should not expect IO timeout in case of QUEUE
FULL from device since firmware can handle it as delayed retry. User
can disable Firmware handling QUEUE FULL condition (through customized
firmware) and allow QUEUE FULL return back to SCSI layer.  This
feature is called “MPI2_IOCFACTS_CAPABILITY_TASK_SET_FULL_HANDLING”.
So for mpt3sas driver, we may use QUEUE FULL handling of OS. We can
opt to enable “no_device_queue_for_ssd” for mpt3sas driver only if FW
does not expose MPI2_IOCFACTS_CAPABILITY_TASK_SET_FULL_HANDLING.

Thanks,
Sumit

>
> >
> > > At least, Broadcom guys tests this patch on megaraid raid and the
> > > results shows that big improvement was got, that is why the flag is
> > > only set on megaraid host.
> >
> > I do not question that it improves performance. That's not my point.
> >
> > > In theory, .track_queue_depth may only improve sequential IO's
> > > performance for HDD., not very effective for SSD. Or just save a bit
> > > CPU cycles in case of SSD.
> >
> > This is not about performance. This is about how the system behaves when
> > a device is starved for resources or experiencing transient failures.
>
> Could you explain a bit how this patch changes the system beaviror? I
> understand the EH just retries the incompleted requests, which total
> number is just less than host queue depth.
>
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs
       [not found]       ` <yq1iml1ehtl.fsf@oracle.com>
@ 2020-01-24 19:41         ` Sumanesh Samanta
  2020-01-28  4:22           ` Martin K. Petersen
  0 siblings, 1 reply; 25+ messages in thread
From: Sumanesh Samanta @ 2020-01-24 19:41 UTC (permalink / raw)
  To: Martin K. Petersen, linux-scsi
  Cc: James Bottomley, Jens Axboe, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Shivasharan S,
	Ewan D . Milne, Christoph Hellwig, Hannes Reinecke,
	Bart Van Assche, Ming Lei, linux-block, Sumit Saxena

Hi Martin,

>> Please read what I proposed. megaraid VDs don't report QUEUE_FULL so
you would not trigger the device_busy counter.

Yes, megaraid does not use set track_queue_depth, and never reports
QUEUE_FULL . Instead of relying on QUEUE_FULL and some complex heuristics
of when to start tracking device_busy, why can't we simply use  "
track_queue_depth" ( along with the other flag that Ming added) to decide
which devices need queue depth tracking, and track device_busy only for
them?

>>Something like not maintaining device_busy until we actually get a
>>QUEUE_FULL condition. And then rely on the existing queue depth ramp up
>>heuristics to determine when to disable the busy counter again.

I am not sure how we can suddenly start tracking device_busy on the fly,
if we do not know how many IO are already pending for that device? Won't
device_busy have a high chance of getting negative ( or at least wrong
value) when pending IOs start getting completed?

Thanks,
Sumanesh


-----Original Message-----
From: Martin K. Petersen <martin.petersen@oracle.com>
Sent: Thursday, January 23, 2020 6:59 PM
To: Sumanesh Samanta <sumanesh.samanta@broadcom.com>
Cc: linux-scsi@vger.kernel.org; Martin K. Petersen
<martin.petersen@oracle.com>
Subject: Re: [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for
SSD when HBA needs


Sumanesh,

> The high-end controllers might expose a SCSI interface, but can have
> all kind of devices (NVMe/SCSI/SATA) behind it, and has its own
> capability to queue IO and feed to devices as needed. Those devices
> should not be penalized with the overhead of the device_busy counter,
> just because they chose to expose themselves has SCSI devices (for
> historical and backward compatibility reasons).

Please read what I proposed. megaraid VDs don't report QUEUE_FULL so you
would not trigger the device_busy counter.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs
  2020-01-24 12:43           ` Sumit Saxena
@ 2020-01-28  4:04             ` Martin K. Petersen
  0 siblings, 0 replies; 25+ messages in thread
From: Martin K. Petersen @ 2020-01-28  4:04 UTC (permalink / raw)
  To: Sumit Saxena
  Cc: Ming Lei, Martin K. Petersen, James Bottomley, Linux SCSI List,
	linux-block, Jens Axboe, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Shivasharan S,
	Ewan D . Milne, Christoph Hellwig, Hannes Reinecke,
	Bart Van Assche


Sumit,

> "mpt3sas" adapters support QUEUE FULL based on IOCCapabilities of
> Firmware.  Default configuration is Firmware will manage QUEUE FULL.

Regrettably...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs
  2020-01-24 19:41         ` Sumanesh Samanta
@ 2020-01-28  4:22           ` Martin K. Petersen
  2020-01-31 11:39             ` Ming Lei
  0 siblings, 1 reply; 25+ messages in thread
From: Martin K. Petersen @ 2020-01-28  4:22 UTC (permalink / raw)
  To: Sumanesh Samanta
  Cc: Martin K. Petersen, linux-scsi, James Bottomley, Jens Axboe,
	Sathya Prakash, Chaitra P B, Suganath Prabu Subramani,
	Kashyap Desai, Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche, Ming Lei, linux-block,
	Sumit Saxena


Sumanesh,

> Instead of relying on QUEUE_FULL and some complex heuristics of when
> to start tracking device_busy, why can't we simply use "
> track_queue_depth" ( along with the other flag that Ming added) to
> decide which devices need queue depth tracking, and track device_busy
> only for them?

Because I am interested in addressing the device_busy contention problem
for all of our non-legacy drivers. I.e. not just for controllers that
happen to queue internally.

> I am not sure how we can suddenly start tracking device_busy on the fly,
> if we do not know how many IO are already pending for that device?

We know that from the tags. It's just not hot path material.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] scsi: remove .for_blk_mq
  2020-01-19  7:14 ` [PATCH 2/6] scsi: remove .for_blk_mq Ming Lei
                     ` (2 preceding siblings ...)
  2020-01-20 22:12   ` Elliott, Robert (Servers)
@ 2020-01-31  6:30   ` Christoph Hellwig
  2020-02-05  2:13     ` Martin K. Petersen
  3 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-01-31  6:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: James Bottomley, linux-scsi, Martin K . Petersen, linux-block,
	Jens Axboe, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

On Sun, Jan 19, 2020 at 03:14:28PM +0800, Ming Lei wrote:
> No one use it any more, so remove the flag.

Looks good modulo the subject typo, lets get this in ASAP even with
outstanding issue on the rest of the series:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs
  2020-01-28  4:22           ` Martin K. Petersen
@ 2020-01-31 11:39             ` Ming Lei
  0 siblings, 0 replies; 25+ messages in thread
From: Ming Lei @ 2020-01-31 11:39 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Sumanesh Samanta, Linux SCSI List, James Bottomley, Jens Axboe,
	Sathya Prakash, Chaitra P B, Suganath Prabu Subramani,
	Kashyap Desai, Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche, Ming Lei, linux-block,
	Sumit Saxena

Hi Martin,

On Tue, Jan 28, 2020 at 12:24 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> Sumanesh,
>
> > Instead of relying on QUEUE_FULL and some complex heuristics of when
> > to start tracking device_busy, why can't we simply use "
> > track_queue_depth" ( along with the other flag that Ming added) to
> > decide which devices need queue depth tracking, and track device_busy
> > only for them?
>
> Because I am interested in addressing the device_busy contention problem
> for all of our non-legacy drivers. I.e. not just for controllers that
> happen to queue internally.

Can we just do it for controllers without 'track_queue_depth' and SSD now?

>
> > I am not sure how we can suddenly start tracking device_busy on the fly,
> > if we do not know how many IO are already pending for that device?
>
> We know that from the tags. It's just not hot path material.

In case of 'track_queue_depth', cost for tracking queue depth has to be paid,
which can be too big to get expected perf on high end HBA.

sbitmap might be used for this purpose, but not sure if it can scale well enough
for this purpose.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] scsi: remove .for_blk_mq
  2020-01-31  6:30   ` Christoph Hellwig
@ 2020-02-05  2:13     ` Martin K. Petersen
  0 siblings, 0 replies; 25+ messages in thread
From: Martin K. Petersen @ 2020-02-05  2:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, James Bottomley, linux-scsi,
	Martin K . Petersen, linux-block, Jens Axboe, Sathya Prakash,
	Chaitra P B, Suganath Prabu Subramani, Kashyap Desai,
	Sumit Saxena, Shivasharan S, Ewan D . Milne, Hannes Reinecke,
	Bart Van Assche


>> No one use it any more, so remove the flag.
>
> Looks good modulo the subject typo, lets get this in ASAP even with
> outstanding issue on the rest of the series:
>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Applied to 5.7/scsi-queue. Thanks!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-02-05  2:13 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-19  7:14 [PATCH 0/6] scsi: support bypass device busy check for some high end HBA with SSD Ming Lei
2020-01-19  7:14 ` [PATCH 1/6] scsi: mpt3sas: don't use .device_busy in device reset routine Ming Lei
2020-01-19 20:28   ` Bart Van Assche
2020-01-19  7:14 ` [PATCH 2/6] scsi: remove .for_blk_mq Ming Lei
2020-01-19 20:29   ` Bart Van Assche
2020-01-20 10:17   ` John Garry
2020-01-20 22:12   ` Elliott, Robert (Servers)
2020-01-31  6:30   ` Christoph Hellwig
2020-02-05  2:13     ` Martin K. Petersen
2020-01-19  7:14 ` [PATCH 3/6] scsi: sd: register request queue after sd_revalidate_disk is done Ming Lei
2020-01-19 20:36   ` Bart Van Assche
2020-01-19  7:14 ` [PATCH 4/6] block: freeze queue for updating QUEUE_FLAG_NONROT Ming Lei
2020-01-19 20:40   ` Bart Van Assche
2020-01-19  7:14 ` [PATCH 5/6] scsi: core: don't limit per-LUN queue depth for SSD when HBA needs Ming Lei
2020-01-19 20:58   ` Bart Van Assche
2020-01-21  4:52   ` Martin K. Petersen
2020-01-23  2:54     ` Ming Lei
2020-01-24  1:21       ` Martin K. Petersen
2020-01-24  1:59         ` Ming Lei
2020-01-24 12:43           ` Sumit Saxena
2020-01-28  4:04             ` Martin K. Petersen
     [not found]     ` <ab676c4c-03fb-7eb9-6212-129eb83d0ee8@broadcom.com>
     [not found]       ` <yq1iml1ehtl.fsf@oracle.com>
2020-01-24 19:41         ` Sumanesh Samanta
2020-01-28  4:22           ` Martin K. Petersen
2020-01-31 11:39             ` Ming Lei
2020-01-19  7:14 ` [PATCH 6/6] scsi: megaraid: set flag of no_device_queue_for_ssd Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).