All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] scis: don't apply per-LUN queue depth for SSD
@ 2019-11-18 10:31 Ming Lei
  2019-11-18 10:31 ` [PATCH 1/4] scsi: megaraid_sas: use private counter for tracking inflight per-LUN commands Ming Lei
                   ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Ming Lei @ 2019-11-18 10:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Ming Lei, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

Hi,

SCSI's per-LUN queue depth is usually for improving IO merge and
balancing IO load among LUNs. blk-mq has provides fair driver tag
allocation and managed IRQ balances interrupt load among queues,
meantime IO merge doesn't play a big role for SSD, and NVMe doesn't
apply per-namespace queue depth.

This patchset tries to don't apply per-LUN queue depth for SSD, then
we can avoid the expensive atomic operation on sdev->device_busy.
We do understand that this shared counter affects IOPS a lot.

The 1st two patches replaces .device_busy with two private counter for
megaraid_sas and mpt3sas to track inflight per-LUN commands since the
two drivers need the counter for balancing load, even though it is
probably no much use.

The last two patches avoid to operate on sdev->device_busy in IO path
for SSD.


Ming Lei (4):
  scsi: megaraid_sas: use private counter for tracking inflight per-LUN
    commands
  scsi: mpt3sas: use private counter for tracking inflight per-LUN
    commands
  scsi: sd: register request queue after sd_revalidate_disk is done
  scsi: core: don't limit per-LUN queue depth for SSD

 block/blk-sysfs.c                           | 14 +++++++++++-
 drivers/scsi/megaraid/megaraid_sas.h        |  1 +
 drivers/scsi/megaraid/megaraid_sas_base.c   | 15 +++++++++++--
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 13 +++++++----
 drivers/scsi/mpt3sas/mpt3sas_base.c         |  3 ++-
 drivers/scsi/mpt3sas/mpt3sas_base.h         |  1 +
 drivers/scsi/mpt3sas/mpt3sas_scsih.c        | 15 +++++++++++--
 drivers/scsi/scsi_lib.c                     | 24 +++++++++++++++------
 drivers/scsi/sd.c                           |  8 ++++++-
 9 files changed, 77 insertions(+), 17 deletions(-)

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
-- 
2.20.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/4] scsi: megaraid_sas: use private counter for tracking inflight per-LUN commands
  2019-11-18 10:31 [PATCH 0/4] scis: don't apply per-LUN queue depth for SSD Ming Lei
@ 2019-11-18 10:31 ` Ming Lei
  2019-11-20  9:54   ` Hannes Reinecke
  2019-11-18 10:31 ` [PATCH 2/4] scsi: mpt3sas: " Ming Lei
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2019-11-18 10:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Ming Lei, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

Prepare for bypassing sdev->device_busy for improving performance on
fast SCSI storage device, so sdev->device_busy can't be relied
any more.

megaraid_sas may need one such counter for balancing load among LUNs in
some specific setting, so add one private counter for this purpose.

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/megaraid/megaraid_sas.h        |  1 +
 drivers/scsi/megaraid/megaraid_sas_base.c   | 15 +++++++++++++--
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 13 +++++++++----
 3 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas.h b/drivers/scsi/megaraid/megaraid_sas.h
index a6e788c02ff4..f9562c02705b 100644
--- a/drivers/scsi/megaraid/megaraid_sas.h
+++ b/drivers/scsi/megaraid/megaraid_sas.h
@@ -2025,6 +2025,7 @@ struct MR_PRIV_DEVICE {
 	u8 interface_type;
 	u8 task_abort_tmo;
 	u8 target_reset_tmo;
+	atomic_t active_cmds;
 };
 struct megasas_cmd;
 
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 42cf38c1ea99..3afead3bc96e 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -1759,6 +1759,7 @@ megasas_queue_command(struct Scsi_Host *shost, struct scsi_cmnd *scmd)
 {
 	struct megasas_instance *instance;
 	struct MR_PRIV_DEVICE *mr_device_priv_data;
+	int ret;
 
 	instance = (struct megasas_instance *)
 	    scmd->device->host->hostdata;
@@ -1821,7 +1822,11 @@ megasas_queue_command(struct Scsi_Host *shost, struct scsi_cmnd *scmd)
 		goto out_done;
 	}
 
-	return instance->instancet->build_and_issue_cmd(instance, scmd);
+	atomic_inc(&mr_device_priv_data->active_cmds);
+	ret = instance->instancet->build_and_issue_cmd(instance, scmd);
+	if (ret)
+		atomic_dec(&mr_device_priv_data->active_cmds);
+	return ret;
 
  out_done:
 	scmd->scsi_done(scmd);
@@ -2100,6 +2105,7 @@ static int megasas_slave_alloc(struct scsi_device *sdev)
 
 	atomic_set(&mr_device_priv_data->r1_ldio_hint,
 		   instance->r1_ldio_hint_default);
+	atomic_set(&mr_device_priv_data->active_cmds, 0);
 	return 0;
 }
 
@@ -3475,12 +3481,15 @@ megasas_complete_cmd(struct megasas_instance *instance, struct megasas_cmd *cmd,
 	unsigned long flags;
 	struct fusion_context *fusion = instance->ctrl_context;
 	u32 opcode, status;
+	struct MR_PRIV_DEVICE *mr_device_priv_data = NULL;
 
 	/* flag for the retry reset */
 	cmd->retry_for_fw_reset = 0;
 
-	if (cmd->scmd)
+	if (cmd->scmd) {
 		cmd->scmd->SCp.ptr = NULL;
+		mr_device_priv_data = cmd->scmd->device->hostdata;
+	}
 
 	switch (hdr->cmd) {
 	case MFI_CMD_INVALID:
@@ -3519,6 +3528,7 @@ megasas_complete_cmd(struct megasas_instance *instance, struct megasas_cmd *cmd,
 		if (exception) {
 
 			atomic_dec(&instance->fw_outstanding);
+			atomic_dec(&mr_device_priv_data->active_cmds);
 
 			scsi_dma_unmap(cmd->scmd);
 			cmd->scmd->scsi_done(cmd->scmd);
@@ -3567,6 +3577,7 @@ megasas_complete_cmd(struct megasas_instance *instance, struct megasas_cmd *cmd,
 		}
 
 		atomic_dec(&instance->fw_outstanding);
+		atomic_dec(&mr_device_priv_data->active_cmds);
 
 		scsi_dma_unmap(cmd->scmd);
 		cmd->scmd->scsi_done(cmd->scmd);
diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index e301458bcbae..10ed3bc3b643 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -2728,7 +2728,7 @@ megasas_build_ldio_fusion(struct megasas_instance *instance,
 	u8 *raidLUN;
 	unsigned long spinlock_flags;
 	struct MR_LD_RAID *raid = NULL;
-	struct MR_PRIV_DEVICE *mrdev_priv;
+	struct MR_PRIV_DEVICE *mrdev_priv = scp->device->hostdata;
 	struct RAID_CONTEXT *rctx;
 	struct RAID_CONTEXT_G35 *rctx_g35;
 
@@ -2826,7 +2826,7 @@ megasas_build_ldio_fusion(struct megasas_instance *instance,
 	}
 
 	if ((instance->perf_mode == MR_BALANCED_PERF_MODE) &&
-		atomic_read(&scp->device->device_busy) >
+		atomic_read(&mrdev_priv->active_cmds) >
 		(io_info.data_arms * MR_DEVICE_HIGH_IOPS_DEPTH))
 		cmd->request_desc->SCSIIO.MSIxIndex =
 			mega_mod64((atomic64_add_return(1, &instance->high_iops_outstanding) /
@@ -2849,7 +2849,6 @@ megasas_build_ldio_fusion(struct megasas_instance *instance,
 		 * with the SLD bit asserted.
 		 */
 		if (io_info.r1_alt_dev_handle != MR_DEVHANDLE_INVALID) {
-			mrdev_priv = scp->device->hostdata;
 
 			if (atomic_inc_return(&instance->fw_outstanding) >
 				(instance->host->can_queue)) {
@@ -3159,7 +3158,7 @@ megasas_build_syspd_fusion(struct megasas_instance *instance,
 	cmd->request_desc->SCSIIO.DevHandle = io_request->DevHandle;
 
 	if ((instance->perf_mode == MR_BALANCED_PERF_MODE) &&
-		atomic_read(&scmd->device->device_busy) > MR_DEVICE_HIGH_IOPS_DEPTH)
+		atomic_read(&mr_device_priv_data->active_cmds) > MR_DEVICE_HIGH_IOPS_DEPTH)
 		cmd->request_desc->SCSIIO.MSIxIndex =
 			mega_mod64((atomic64_add_return(1, &instance->high_iops_outstanding) /
 				MR_HIGH_IOPS_BATCH_COUNT), instance->low_latency_index_start);
@@ -3550,6 +3549,7 @@ complete_cmd_fusion(struct megasas_instance *instance, u32 MSIxIndex,
 
 	while (d_val.u.low != cpu_to_le32(UINT_MAX) &&
 	       d_val.u.high != cpu_to_le32(UINT_MAX)) {
+		struct MR_PRIV_DEVICE *mr_device_priv_data = NULL;
 
 		smid = le16_to_cpu(reply_desc->SMID);
 		cmd_fusion = fusion->cmd_list[smid - 1];
@@ -3585,6 +3585,8 @@ complete_cmd_fusion(struct megasas_instance *instance, u32 MSIxIndex,
 			}
 			/* Fall through - and complete IO */
 		case MEGASAS_MPI2_FUNCTION_LD_IO_REQUEST: /* LD-IO Path */
+			mr_device_priv_data = scmd_local->device->hostdata;
+			atomic_dec(&mr_device_priv_data->active_cmds);
 			atomic_dec(&instance->fw_outstanding);
 			if (cmd_fusion->r1_alt_dev_handle == MR_DEVHANDLE_INVALID) {
 				map_cmd_status(fusion, scmd_local, status,
@@ -4865,6 +4867,7 @@ int megasas_reset_fusion(struct Scsi_Host *shost, int reason)
 
 		/* Now return commands back to the OS */
 		for (i = 0 ; i < instance->max_scsi_cmds; i++) {
+			struct MR_PRIV_DEVICE *mr_device_priv_data = NULL;
 			cmd_fusion = fusion->cmd_list[i];
 			/*check for extra commands issued by driver*/
 			if (instance->adapter_type >= VENTURA_SERIES) {
@@ -4893,6 +4896,8 @@ int megasas_reset_fusion(struct Scsi_Host *shost, int reason)
 				megasas_return_cmd_fusion(instance, cmd_fusion);
 				scsi_dma_unmap(scmd_local);
 				scmd_local->scsi_done(scmd_local);
+				mr_device_priv_data = scmd_local->device->hostdata;
+				atomic_dec(&mr_device_priv_data->active_cmds);
 			}
 		}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/4] scsi: mpt3sas: use private counter for tracking inflight per-LUN commands
  2019-11-18 10:31 [PATCH 0/4] scis: don't apply per-LUN queue depth for SSD Ming Lei
  2019-11-18 10:31 ` [PATCH 1/4] scsi: megaraid_sas: use private counter for tracking inflight per-LUN commands Ming Lei
@ 2019-11-18 10:31 ` Ming Lei
  2019-11-20  9:55   ` Hannes Reinecke
  2019-11-18 10:31 ` [PATCH 3/4] scsi: sd: register request queue after sd_revalidate_disk is done Ming Lei
  2019-11-18 10:31 ` [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD Ming Lei
  3 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2019-11-18 10:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Ming Lei, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

Prepare for bypassing sdev->device_busy for improving performance on
fast SCSI storage device, so sdev->device_busy can't be relied
any more.

mpt3sas may need one such counter for balancing load among LUNs in
some specific setting, so add one private counter for this purpose.

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/mpt3sas/mpt3sas_base.c  |  3 ++-
 drivers/scsi/mpt3sas/mpt3sas_base.h  |  1 +
 drivers/scsi/mpt3sas/mpt3sas_scsih.c | 15 +++++++++++++--
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c
index fea3cb6a090b..9c2d374b3157 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -3480,12 +3480,13 @@ static inline u8
 _base_get_high_iops_msix_index(struct MPT3SAS_ADAPTER *ioc,
 	struct scsi_cmnd *scmd)
 {
+	struct MPT3SAS_DEVICE *sas_device_priv_data = scmd->device->hostdata;
 	/**
 	 * Round robin the IO interrupts among the high iops
 	 * reply queues in terms of batch count 16 when outstanding
 	 * IOs on the target device is >=8.
 	 */
-	if (atomic_read(&scmd->device->device_busy) >
+	if (atomic_read(&sas_device_priv_data->active_cmds) >
 	    MPT3SAS_DEVICE_HIGH_IOPS_DEPTH)
 		return base_mod64((
 		    atomic64_add_return(1, &ioc->high_iops_outstanding) /
diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h b/drivers/scsi/mpt3sas/mpt3sas_base.h
index faca0a5e71f8..9e9f319bb636 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.h
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
@@ -467,6 +467,7 @@ struct MPT3SAS_DEVICE {
 	 * thing while a SATL command is pending.
 	 */
 	unsigned long ata_command_pending;
+	atomic_t active_cmds;
 
 };
 
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index c8e512ba6d39..194960dae1d6 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -1770,6 +1770,7 @@ scsih_slave_alloc(struct scsi_device *sdev)
 
 	sas_device_priv_data->lun = sdev->lun;
 	sas_device_priv_data->flags = MPT_DEVICE_FLAGS_INIT;
+	atomic_set(&sas_device_priv_data->active_cmds, 0);
 
 	starget = scsi_target(sdev);
 	sas_target_priv_data = starget->hostdata;
@@ -2884,6 +2885,7 @@ scsih_abort(struct scsi_cmnd *scmd)
 	    ioc->remove_host) {
 		sdev_printk(KERN_INFO, scmd->device,
 			"device been deleted! scmd(%p)\n", scmd);
+		atomic_dec(&sas_device_priv_data->active_cmds);
 		scmd->result = DID_NO_CONNECT << 16;
 		scmd->scsi_done(scmd);
 		r = SUCCESS;
@@ -2993,7 +2995,7 @@ scsih_dev_reset(struct scsi_cmnd *scmd)
 		MPI2_SCSITASKMGMT_TASKTYPE_LOGICAL_UNIT_RESET, 0, 0,
 		tr_timeout, tr_method);
 	/* Check for busy commands after reset */
-	if (r == SUCCESS && atomic_read(&scmd->device->device_busy))
+	if (r == SUCCESS && atomic_read(&sas_device_priv_data->active_cmds))
 		r = FAILED;
  out:
 	sdev_printk(KERN_INFO, scmd->device, "device reset: %s scmd(%p)\n",
@@ -4517,9 +4519,12 @@ _scsih_flush_running_cmds(struct MPT3SAS_ADAPTER *ioc)
 	int count = 0;
 
 	for (smid = 1; smid <= ioc->scsiio_depth; smid++) {
+		struct MPT3SAS_DEVICE *sas_device_priv_data;
+
 		scmd = mpt3sas_scsih_scsi_lookup_get(ioc, smid);
 		if (!scmd)
 			continue;
+		sas_device_priv_data = scmd->device->hostdata;
 		count++;
 		_scsih_set_satl_pending(scmd, false);
 		st = scsi_cmd_priv(scmd);
@@ -4529,6 +4534,7 @@ _scsih_flush_running_cmds(struct MPT3SAS_ADAPTER *ioc)
 			scmd->result = DID_NO_CONNECT << 16;
 		else
 			scmd->result = DID_RESET << 16;
+		atomic_dec(&sas_device_priv_data->active_cmds);
 		scmd->scsi_done(scmd);
 	}
 	dtmprintk(ioc, ioc_info(ioc, "completing %d cmds\n", count));
@@ -4756,11 +4762,14 @@ scsih_qcmd(struct Scsi_Host *shost, struct scsi_cmnd *scmd)
 	    mpi_request->LUN);
 	memcpy(mpi_request->CDB.CDB32, scmd->cmnd, scmd->cmd_len);
 
+	atomic_inc(&sas_device_priv_data->active_cmds);
+
 	if (mpi_request->DataLength) {
 		pcie_device = sas_target_priv_data->pcie_dev;
 		if (ioc->build_sg_scmd(ioc, scmd, smid, pcie_device)) {
 			mpt3sas_base_free_smid(ioc, smid);
 			_scsih_set_satl_pending(scmd, false);
+			atomic_dec(&sas_device_priv_data->active_cmds);
 			goto out;
 		}
 	} else
@@ -5207,6 +5216,7 @@ _scsih_smart_predicted_fault(struct MPT3SAS_ADAPTER *ioc, u16 handle)
 static u8
 _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply)
 {
+	struct MPT3SAS_DEVICE *sas_device_priv_data;
 	Mpi25SCSIIORequest_t *mpi_request;
 	Mpi2SCSIIOReply_t *mpi_reply;
 	struct scsi_cmnd *scmd;
@@ -5216,7 +5226,6 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply)
 	u8 scsi_state;
 	u8 scsi_status;
 	u32 log_info;
-	struct MPT3SAS_DEVICE *sas_device_priv_data;
 	u32 response_code = 0;
 
 	mpi_reply = mpt3sas_base_get_reply_virt_addr(ioc, reply);
@@ -5225,6 +5234,7 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply)
 	if (scmd == NULL)
 		return 1;
 
+	sas_device_priv_data = scmd->device->hostdata;
 	_scsih_set_satl_pending(scmd, false);
 
 	mpi_request = mpt3sas_base_get_msg_frame(ioc, smid);
@@ -5431,6 +5441,7 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply)
 
 	scsi_dma_unmap(scmd);
 	mpt3sas_base_free_smid(ioc, smid);
+	atomic_dec(&sas_device_priv_data->active_cmds);
 	scmd->scsi_done(scmd);
 	return 0;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/4] scsi: sd: register request queue after sd_revalidate_disk is done
  2019-11-18 10:31 [PATCH 0/4] scis: don't apply per-LUN queue depth for SSD Ming Lei
  2019-11-18 10:31 ` [PATCH 1/4] scsi: megaraid_sas: use private counter for tracking inflight per-LUN commands Ming Lei
  2019-11-18 10:31 ` [PATCH 2/4] scsi: mpt3sas: " Ming Lei
@ 2019-11-18 10:31 ` Ming Lei
  2019-11-20  9:59   ` Hannes Reinecke
  2019-11-18 10:31 ` [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD Ming Lei
  3 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2019-11-18 10:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Ming Lei, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

Prepare for improving SSD performance in the following patch, which
needs to read queue flag of QUEUE_FLAG_NONROT in IO path. So we have
to freeze queue before changing this flag in sd_revalidate_disk().

However, queue freezing becomes quite slow after the queue is registered
because RCU period is involved.

So delay registering queue after sd_revalidate_disk() is done for
avoiding slow queue freezing which will be added to sd_revalidate_disk()
in the following patch.

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/sd.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 03163ac5fe95..0744c34468e1 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3368,12 +3368,14 @@ static int sd_probe(struct device *dev)
 	}
 
 	blk_pm_runtime_init(sdp->request_queue, dev);
-	device_add_disk(dev, gd, NULL);
+	device_add_disk_no_queue_reg(dev, gd);
 	if (sdkp->capacity)
 		sd_dif_config_host(sdkp);
 
 	sd_revalidate_disk(gd);
 
+	blk_register_queue(gd);
+
 	if (sdkp->security) {
 		sdkp->opal_dev = init_opal_dev(sdp, &sd_sec_submit);
 		if (sdkp->opal_dev)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-18 10:31 [PATCH 0/4] scis: don't apply per-LUN queue depth for SSD Ming Lei
                   ` (2 preceding siblings ...)
  2019-11-18 10:31 ` [PATCH 3/4] scsi: sd: register request queue after sd_revalidate_disk is done Ming Lei
@ 2019-11-18 10:31 ` Ming Lei
  2019-11-20 10:05   ` Hannes Reinecke
  3 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2019-11-18 10:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Ming Lei, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Hannes Reinecke, Bart Van Assche

SCSI core uses the atomic variable of sdev->device_busy to track
in-flight IO requests dispatched to this scsi device. IO request may be
submitted from any CPU, so the cost for maintaining the shared atomic
counter can be very big on big NUMA machine with lots of CPU cores.

sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
2) fair IO request scattered among all LUNs.

blk-mq already provides fair request allocation among all active shared
request queues(LUNs), see hctx_may_queue().

NVMe doesn't have such per-request-queue(namespace) queue depth, so it
is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
play big role for reaching top SSD performance.

With this patch, big cost for tracking in-flight per-LUN requests via
atomic variable can be saved.

Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue
before changing this flag.

Cc: Sathya Prakash <sathya.prakash@broadcom.com>
Cc: Chaitra P B <chaitra.basappa@broadcom.com>
Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Sumit Saxena <sumit.saxena@broadcom.com>
Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>,
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-sysfs.c       | 14 +++++++++++++-
 drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------
 drivers/scsi/sd.c       |  4 ++++
 3 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..9cc0dd5f88a1 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
 QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
 #undef QUEUE_SYSFS_BIT_FNS
 
+static ssize_t queue_store_nonrot_freeze(struct request_queue *q,
+		const char *page, size_t count)
+{
+	size_t ret;
+
+	blk_mq_freeze_queue(q);
+	ret = queue_store_nonrot(q, page, count);
+	blk_mq_unfreeze_queue(q);
+
+	return ret;
+}
+
 static ssize_t queue_zoned_show(struct request_queue *q, char *page)
 {
 	switch (blk_queue_zoned_model(q)) {
@@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = 0644 },
 	.show = queue_show_nonrot,
-	.store = queue_store_nonrot,
+	.store = queue_store_nonrot_freeze,
 };
 
 static struct queue_sysfs_entry queue_zoned_entry = {
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 62a86a82c38d..72655a049efd 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
 	if (starget->can_queue > 0)
 		atomic_dec(&starget->target_busy);
 
-	atomic_dec(&sdev->device_busy);
+	if (!blk_queue_nonrot(sdev->request_queue))
+		atomic_dec(&sdev->device_busy);
 }
 
 static void scsi_kick_queue(struct request_queue *q)
@@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
 
 static inline bool scsi_device_is_busy(struct scsi_device *sdev)
 {
-	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
+	if (!blk_queue_nonrot(sdev->request_queue) &&
+			atomic_read(&sdev->device_busy) >= sdev->queue_depth)
 		return true;
 	if (atomic_read(&sdev->device_blocked) > 0)
 		return true;
@@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
 				  struct scsi_device *sdev)
 {
 	unsigned int busy;
+	bool bypass = blk_queue_nonrot(sdev->request_queue);
 
-	busy = atomic_inc_return(&sdev->device_busy) - 1;
+	if (!bypass)
+		busy = atomic_inc_return(&sdev->device_busy) - 1;
+	else
+		busy = 0;
 	if (atomic_read(&sdev->device_blocked)) {
 		if (busy)
 			goto out_dec;
@@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
 				   "unblocking device at zero depth\n"));
 	}
 
+	if (bypass)
+		return 1;
+
 	if (busy >= sdev->queue_depth)
 		goto out_dec;
 
 	return 1;
 out_dec:
-	atomic_dec(&sdev->device_busy);
+	if (!bypass)
+		atomic_dec(&sdev->device_busy);
 	return 0;
 }
 
@@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
 	struct request_queue *q = hctx->queue;
 	struct scsi_device *sdev = q->queuedata;
 
-	atomic_dec(&sdev->device_busy);
+	if (!blk_queue_nonrot(sdev->request_queue))
+		atomic_dec(&sdev->device_busy);
 }
 
 static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
@@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
-		if (atomic_read(&sdev->device_busy) ||
+		if ((!blk_queue_nonrot(sdev->request_queue) &&
+		     atomic_read(&sdev->device_busy)) ||
 		    scsi_device_blocked(sdev))
 			ret = BLK_STS_DEV_RESOURCE;
 		break;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 0744c34468e1..c3d47117700d 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
 	rot = get_unaligned_be16(&buffer[4]);
 
 	if (rot == 1) {
+		blk_mq_freeze_queue(q);
 		blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
+		blk_mq_unfreeze_queue(q);
 		blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
 	}
 
@@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk)
 		 * cause this to be updated correctly and any device which
 		 * doesn't support it should be treated as rotational.
 		 */
+		blk_mq_freeze_queue(q);
 		blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
+		blk_mq_unfreeze_queue(q);
 		blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
 
 		if (scsi_device_supports_vpd(sdp)) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/4] scsi: megaraid_sas: use private counter for tracking inflight per-LUN commands
  2019-11-18 10:31 ` [PATCH 1/4] scsi: megaraid_sas: use private counter for tracking inflight per-LUN commands Ming Lei
@ 2019-11-20  9:54   ` Hannes Reinecke
  2019-11-26  3:12     ` Kashyap Desai
  0 siblings, 1 reply; 34+ messages in thread
From: Hannes Reinecke @ 2019-11-20  9:54 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche

On 11/18/19 11:31 AM, Ming Lei wrote:
> Prepare for bypassing sdev->device_busy for improving performance on
> fast SCSI storage device, so sdev->device_busy can't be relied
> any more.
> 
> megaraid_sas may need one such counter for balancing load among LUNs in
> some specific setting, so add one private counter for this purpose.
> 
> Cc: Sathya Prakash <sathya.prakash@broadcom.com>
> Cc: Chaitra P B <chaitra.basappa@broadcom.com>
> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Sumit Saxena <sumit.saxena@broadcom.com>
> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
> Cc: Ewan D. Milne <emilne@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/scsi/megaraid/megaraid_sas.h        |  1 +
>  drivers/scsi/megaraid/megaraid_sas_base.c   | 15 +++++++++++++--
>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 13 +++++++++----
>  3 files changed, 23 insertions(+), 6 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/4] scsi: mpt3sas: use private counter for tracking inflight per-LUN commands
  2019-11-18 10:31 ` [PATCH 2/4] scsi: mpt3sas: " Ming Lei
@ 2019-11-20  9:55   ` Hannes Reinecke
  0 siblings, 0 replies; 34+ messages in thread
From: Hannes Reinecke @ 2019-11-20  9:55 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche

On 11/18/19 11:31 AM, Ming Lei wrote:
> Prepare for bypassing sdev->device_busy for improving performance on
> fast SCSI storage device, so sdev->device_busy can't be relied
> any more.
> 
> mpt3sas may need one such counter for balancing load among LUNs in
> some specific setting, so add one private counter for this purpose.
> 
> Cc: Sathya Prakash <sathya.prakash@broadcom.com>
> Cc: Chaitra P B <chaitra.basappa@broadcom.com>
> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Sumit Saxena <sumit.saxena@broadcom.com>
> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
> Cc: Ewan D. Milne <emilne@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/scsi/mpt3sas/mpt3sas_base.c  |  3 ++-
>  drivers/scsi/mpt3sas/mpt3sas_base.h  |  1 +
>  drivers/scsi/mpt3sas/mpt3sas_scsih.c | 15 +++++++++++++--
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c
> index fea3cb6a090b..9c2d374b3157 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_base.c
> +++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
> @@ -3480,12 +3480,13 @@ static inline u8
>  _base_get_high_iops_msix_index(struct MPT3SAS_ADAPTER *ioc,
>  	struct scsi_cmnd *scmd)
>  {
> +	struct MPT3SAS_DEVICE *sas_device_priv_data = scmd->device->hostdata;
>  	/**
>  	 * Round robin the IO interrupts among the high iops
>  	 * reply queues in terms of batch count 16 when outstanding
>  	 * IOs on the target device is >=8.
>  	 */
> -	if (atomic_read(&scmd->device->device_busy) >
> +	if (atomic_read(&sas_device_priv_data->active_cmds) >
>  	    MPT3SAS_DEVICE_HIGH_IOPS_DEPTH)
>  		return base_mod64((
>  		    atomic64_add_return(1, &ioc->high_iops_outstanding) /
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.h b/drivers/scsi/mpt3sas/mpt3sas_base.h
> index faca0a5e71f8..9e9f319bb636 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_base.h
> +++ b/drivers/scsi/mpt3sas/mpt3sas_base.h
> @@ -467,6 +467,7 @@ struct MPT3SAS_DEVICE {
>  	 * thing while a SATL command is pending.
>  	 */
>  	unsigned long ata_command_pending;
> +	atomic_t active_cmds;
>  
>  };
>  
> diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> index c8e512ba6d39..194960dae1d6 100644
> --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
> @@ -1770,6 +1770,7 @@ scsih_slave_alloc(struct scsi_device *sdev)
>  
>  	sas_device_priv_data->lun = sdev->lun;
>  	sas_device_priv_data->flags = MPT_DEVICE_FLAGS_INIT;
> +	atomic_set(&sas_device_priv_data->active_cmds, 0);
>  
>  	starget = scsi_target(sdev);
>  	sas_target_priv_data = starget->hostdata;
> @@ -2884,6 +2885,7 @@ scsih_abort(struct scsi_cmnd *scmd)
>  	    ioc->remove_host) {
>  		sdev_printk(KERN_INFO, scmd->device,
>  			"device been deleted! scmd(%p)\n", scmd);
> +		atomic_dec(&sas_device_priv_data->active_cmds);
>  		scmd->result = DID_NO_CONNECT << 16;
>  		scmd->scsi_done(scmd);
>  		r = SUCCESS;
> @@ -2993,7 +2995,7 @@ scsih_dev_reset(struct scsi_cmnd *scmd)
>  		MPI2_SCSITASKMGMT_TASKTYPE_LOGICAL_UNIT_RESET, 0, 0,
>  		tr_timeout, tr_method);
>  	/* Check for busy commands after reset */
> -	if (r == SUCCESS && atomic_read(&scmd->device->device_busy))
> +	if (r == SUCCESS && atomic_read(&sas_device_priv_data->active_cmds))
>  		r = FAILED;
>   out:
>  	sdev_printk(KERN_INFO, scmd->device, "device reset: %s scmd(%p)\n",
> @@ -4517,9 +4519,12 @@ _scsih_flush_running_cmds(struct MPT3SAS_ADAPTER *ioc)
>  	int count = 0;
>  
>  	for (smid = 1; smid <= ioc->scsiio_depth; smid++) {
> +		struct MPT3SAS_DEVICE *sas_device_priv_data;
> +
>  		scmd = mpt3sas_scsih_scsi_lookup_get(ioc, smid);
>  		if (!scmd)
>  			continue;
> +		sas_device_priv_data = scmd->device->hostdata;
>  		count++;
>  		_scsih_set_satl_pending(scmd, false);
>  		st = scsi_cmd_priv(scmd);
> @@ -4529,6 +4534,7 @@ _scsih_flush_running_cmds(struct MPT3SAS_ADAPTER *ioc)
>  			scmd->result = DID_NO_CONNECT << 16;
>  		else
>  			scmd->result = DID_RESET << 16;
> +		atomic_dec(&sas_device_priv_data->active_cmds);
>  		scmd->scsi_done(scmd);
>  	}
>  	dtmprintk(ioc, ioc_info(ioc, "completing %d cmds\n", count));
> @@ -4756,11 +4762,14 @@ scsih_qcmd(struct Scsi_Host *shost, struct scsi_cmnd *scmd)
>  	    mpi_request->LUN);
>  	memcpy(mpi_request->CDB.CDB32, scmd->cmnd, scmd->cmd_len);
>  
> +	atomic_inc(&sas_device_priv_data->active_cmds);
> +
>  	if (mpi_request->DataLength) {
>  		pcie_device = sas_target_priv_data->pcie_dev;
>  		if (ioc->build_sg_scmd(ioc, scmd, smid, pcie_device)) {
>  			mpt3sas_base_free_smid(ioc, smid);
>  			_scsih_set_satl_pending(scmd, false);
> +			atomic_dec(&sas_device_priv_data->active_cmds);
>  			goto out;
>  		}
>  	} else
> @@ -5207,6 +5216,7 @@ _scsih_smart_predicted_fault(struct MPT3SAS_ADAPTER *ioc, u16 handle)
>  static u8
>  _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply)
>  {
> +	struct MPT3SAS_DEVICE *sas_device_priv_data;
>  	Mpi25SCSIIORequest_t *mpi_request;
>  	Mpi2SCSIIOReply_t *mpi_reply;
>  	struct scsi_cmnd *scmd;
> @@ -5216,7 +5226,6 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply)
>  	u8 scsi_state;
>  	u8 scsi_status;
>  	u32 log_info;
> -	struct MPT3SAS_DEVICE *sas_device_priv_data;
>  	u32 response_code = 0;
>  
>  	mpi_reply = mpt3sas_base_get_reply_virt_addr(ioc, reply);

The above two modifications are probably not needed ...

> @@ -5225,6 +5234,7 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply)
>  	if (scmd == NULL)
>  		return 1;
>  
> +	sas_device_priv_data = scmd->device->hostdata;
>  	_scsih_set_satl_pending(scmd, false);
>  
>  	mpi_request = mpt3sas_base_get_msg_frame(ioc, smid);
> @@ -5431,6 +5441,7 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply)
>  
>  	scsi_dma_unmap(scmd);
>  	mpt3sas_base_free_smid(ioc, smid);
> +	atomic_dec(&sas_device_priv_data->active_cmds);
>  	scmd->scsi_done(scmd);
>  	return 0;
>  }
> 
Otherwise:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/4] scsi: sd: register request queue after sd_revalidate_disk is done
  2019-11-18 10:31 ` [PATCH 3/4] scsi: sd: register request queue after sd_revalidate_disk is done Ming Lei
@ 2019-11-20  9:59   ` Hannes Reinecke
  0 siblings, 0 replies; 34+ messages in thread
From: Hannes Reinecke @ 2019-11-20  9:59 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche

On 11/18/19 11:31 AM, Ming Lei wrote:
> Prepare for improving SSD performance in the following patch, which
> needs to read queue flag of QUEUE_FLAG_NONROT in IO path. So we have
> to freeze queue before changing this flag in sd_revalidate_disk().
> 
> However, queue freezing becomes quite slow after the queue is registered
> because RCU period is involved.
> 
> So delay registering queue after sd_revalidate_disk() is done for
> avoiding slow queue freezing which will be added to sd_revalidate_disk()
> in the following patch.
> 
> Cc: Sathya Prakash <sathya.prakash@broadcom.com>
> Cc: Chaitra P B <chaitra.basappa@broadcom.com>
> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Sumit Saxena <sumit.saxena@broadcom.com>
> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
> Cc: Ewan D. Milne <emilne@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/scsi/sd.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 03163ac5fe95..0744c34468e1 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3368,12 +3368,14 @@ static int sd_probe(struct device *dev)
>  	}
>  
>  	blk_pm_runtime_init(sdp->request_queue, dev);
> -	device_add_disk(dev, gd, NULL);
> +	device_add_disk_no_queue_reg(dev, gd);
>  	if (sdkp->capacity)
>  		sd_dif_config_host(sdkp);
>  
>  	sd_revalidate_disk(gd);
>  
> +	blk_register_queue(gd);
> +
>  	if (sdkp->security) {
>  		sdkp->opal_dev = init_opal_dev(sdp, &sd_sec_submit);
>  		if (sdkp->opal_dev)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-18 10:31 ` [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD Ming Lei
@ 2019-11-20 10:05   ` Hannes Reinecke
  2019-11-20 17:00     ` Ewan D. Milne
                       ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Hannes Reinecke @ 2019-11-20 10:05 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche

On 11/18/19 11:31 AM, Ming Lei wrote:
> SCSI core uses the atomic variable of sdev->device_busy to track
> in-flight IO requests dispatched to this scsi device. IO request may be
> submitted from any CPU, so the cost for maintaining the shared atomic
> counter can be very big on big NUMA machine with lots of CPU cores.
> 
> sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
> 2) fair IO request scattered among all LUNs.
> 
> blk-mq already provides fair request allocation among all active shared
> request queues(LUNs), see hctx_may_queue().
> 
> NVMe doesn't have such per-request-queue(namespace) queue depth, so it
> is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
> play big role for reaching top SSD performance.
> 
> With this patch, big cost for tracking in-flight per-LUN requests via
> atomic variable can be saved.
> 
> Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue
> before changing this flag.
> 
> Cc: Sathya Prakash <sathya.prakash@broadcom.com>
> Cc: Chaitra P B <chaitra.basappa@broadcom.com>
> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Sumit Saxena <sumit.saxena@broadcom.com>
> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
> Cc: Ewan D. Milne <emilne@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>,
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-sysfs.c       | 14 +++++++++++++-
>  drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------
>  drivers/scsi/sd.c       |  4 ++++
>  3 files changed, 35 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index fca9b158f4a0..9cc0dd5f88a1 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
>  QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
>  #undef QUEUE_SYSFS_BIT_FNS
>  
> +static ssize_t queue_store_nonrot_freeze(struct request_queue *q,
> +		const char *page, size_t count)
> +{
> +	size_t ret;
> +
> +	blk_mq_freeze_queue(q);
> +	ret = queue_store_nonrot(q, page, count);
> +	blk_mq_unfreeze_queue(q);
> +
> +	return ret;
> +}
> +
>  static ssize_t queue_zoned_show(struct request_queue *q, char *page)
>  {
>  	switch (blk_queue_zoned_model(q)) {
> @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
>  static struct queue_sysfs_entry queue_nonrot_entry = {
>  	.attr = {.name = "rotational", .mode = 0644 },
>  	.show = queue_show_nonrot,
> -	.store = queue_store_nonrot,
> +	.store = queue_store_nonrot_freeze,
>  };
>  
>  static struct queue_sysfs_entry queue_zoned_entry = {
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 62a86a82c38d..72655a049efd 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
>  	if (starget->can_queue > 0)
>  		atomic_dec(&starget->target_busy);
>  
> -	atomic_dec(&sdev->device_busy);
> +	if (!blk_queue_nonrot(sdev->request_queue))
> +		atomic_dec(&sdev->device_busy);
>  }
>  
>  static void scsi_kick_queue(struct request_queue *q)
> @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
>  
>  static inline bool scsi_device_is_busy(struct scsi_device *sdev)
>  {
> -	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> +	if (!blk_queue_nonrot(sdev->request_queue) &&
> +			atomic_read(&sdev->device_busy) >= sdev->queue_depth)
>  		return true;
>  	if (atomic_read(&sdev->device_blocked) > 0)
>  		return true;
> @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
>  				  struct scsi_device *sdev)
>  {
>  	unsigned int busy;
> +	bool bypass = blk_queue_nonrot(sdev->request_queue);
>  
> -	busy = atomic_inc_return(&sdev->device_busy) - 1;
> +	if (!bypass)
> +		busy = atomic_inc_return(&sdev->device_busy) - 1;
> +	else
> +		busy = 0;
>  	if (atomic_read(&sdev->device_blocked)) {
>  		if (busy)
>  			goto out_dec;
> @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
>  				   "unblocking device at zero depth\n"));
>  	}
>  
> +	if (bypass)
> +		return 1;
> +
>  	if (busy >= sdev->queue_depth)
>  		goto out_dec;
>  
>  	return 1;
>  out_dec:
> -	atomic_dec(&sdev->device_busy);
> +	if (!bypass)
> +		atomic_dec(&sdev->device_busy);
>  	return 0;
>  }
>  
> @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
>  	struct request_queue *q = hctx->queue;
>  	struct scsi_device *sdev = q->queuedata;
>  
> -	atomic_dec(&sdev->device_busy);
> +	if (!blk_queue_nonrot(sdev->request_queue))
> +		atomic_dec(&sdev->device_busy);
>  }
>  
>  static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
> @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	case BLK_STS_OK:
>  		break;
>  	case BLK_STS_RESOURCE:
> -		if (atomic_read(&sdev->device_busy) ||
> +		if ((!blk_queue_nonrot(sdev->request_queue) &&
> +		     atomic_read(&sdev->device_busy)) ||
>  		    scsi_device_blocked(sdev))
>  			ret = BLK_STS_DEV_RESOURCE;
>  		break;
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 0744c34468e1..c3d47117700d 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
>  	rot = get_unaligned_be16(&buffer[4]);
>  
>  	if (rot == 1) {
> +		blk_mq_freeze_queue(q);
>  		blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> +		blk_mq_unfreeze_queue(q);
>  		blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
>  	}
>  
> @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk)
>  		 * cause this to be updated correctly and any device which
>  		 * doesn't support it should be treated as rotational.
>  		 */
> +		blk_mq_freeze_queue(q);
>  		blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
> +		blk_mq_unfreeze_queue(q);
>  		blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
>  
>  		if (scsi_device_supports_vpd(sdp)) {
> 
Hmm.

I must admit I patently don't like this explicit dependency on
blk_nonrot(). Having a conditional counter is just an open invitation to
getting things wrong...

I would far prefer if we could delegate any queueing decision to the
elevators, and completely drop the device_busy flag for all devices.
But I do see that this might not be easy to do, nor necessarily
something which others agree on.
So alternatively, can't we make this a per-host flag, instead of having
it per device characteristic?

I'm just thinking of a storage array with a massive cache; it's anyone's
guess if 'nonrot' is set there, but at the same time these things can be
fast, so they might be profiting from dropping the device_busy counter
there, too.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-20 10:05   ` Hannes Reinecke
@ 2019-11-20 17:00     ` Ewan D. Milne
  2019-11-20 20:56       ` Bart Van Assche
                         ` (2 more replies)
  2019-11-21  0:53     ` Ming Lei
  2019-11-22  2:18     ` Martin K. Petersen
  2 siblings, 3 replies; 34+ messages in thread
From: Ewan D. Milne @ 2019-11-20 17:00 UTC (permalink / raw)
  To: Hannes Reinecke, Ming Lei, Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Christoph Hellwig, Bart Van Assche

On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
> 
> Hmm.
> 
> I must admit I patently don't like this explicit dependency on
> blk_nonrot(). Having a conditional counter is just an open invitation to
> getting things wrong...
> 

This concerns me as well, it seems like the SCSI ML should have it's
own per-device attribute if we actually need to control this per-device
instead of on a per-host or per-driver basis.  And it seems like this
is something that is specific to high-performance drivers, so changing
the way this works for all drivers seems a bit much.

Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
but I dislike wrapping the examination of that and the queue flag in
a macro that makes it not obvious how the behavior is affected.
(Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
which was another piece of ML functionality used by only a few drivers.)

Ming's patch does freeze the queue if NONROT is changed by sysfs, but
the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
clears it and then calls sd_read_block_characteristics() which may set
it again.  So it's not clear to me how reliable this is.

-Ewan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-20 17:00     ` Ewan D. Milne
@ 2019-11-20 20:56       ` Bart Van Assche
  2019-11-20 21:36         ` Ewan D. Milne
  2019-11-21  1:07         ` Ming Lei
  2019-11-21  0:08       ` Sumanesh Samanta
  2019-11-21  0:54       ` Ming Lei
  2 siblings, 2 replies; 34+ messages in thread
From: Bart Van Assche @ 2019-11-20 20:56 UTC (permalink / raw)
  To: Ewan D. Milne, Hannes Reinecke, Ming Lei, Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Christoph Hellwig, Bart Van Assche

On 11/20/19 9:00 AM, Ewan D. Milne wrote:
> On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
>> I must admit I patently don't like this explicit dependency on
>> blk_nonrot(). Having a conditional counter is just an open invitation to
>> getting things wrong...
> 
> This concerns me as well, it seems like the SCSI ML should have it's
> own per-device attribute if we actually need to control this per-device
> instead of on a per-host or per-driver basis.  And it seems like this
> is something that is specific to high-performance drivers, so changing
> the way this works for all drivers seems a bit much.
> 
> Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
> but I dislike wrapping the examination of that and the queue flag in
> a macro that makes it not obvious how the behavior is affected.
> (Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
> which was another piece of ML functionality used by only a few drivers.)
> 
> Ming's patch does freeze the queue if NONROT is changed by sysfs, but
> the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
> clears it and then calls sd_read_block_characteristics() which may set
> it again.  So it's not clear to me how reliable this is.

How about changing the default behavior into ignoring sdev->queue_depth 
and only honoring sdev->queue_depth if a "quirk" flag is set or if 
overridden by setting a sysfs attribute? My understanding is that the 
goal of the queue ramp-up/ramp-down mechanism is to reduce the number of 
times a SCSI device responds "BUSY". An alternative for queue 
ramp-up/ramp-down is a delayed queue re-run. I think if scsi_queue_rq() 
returns BLK_STS_RESOURCE that the queue is only rerun after a delay. 
 From blk_mq_dispatch_rq_list():

	[ ... ]
	else if (needs_restart && (ret == BLK_STS_RESOURCE))
		blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-20 20:56       ` Bart Van Assche
@ 2019-11-20 21:36         ` Ewan D. Milne
  2019-11-22  2:25           ` Martin K. Petersen
  2019-11-21  1:07         ` Ming Lei
  1 sibling, 1 reply; 34+ messages in thread
From: Ewan D. Milne @ 2019-11-20 21:36 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, Ming Lei, Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Christoph Hellwig, Bart Van Assche

On Wed, 2019-11-20 at 12:56 -0800, Bart Van Assche wrote:
> On 11/20/19 9:00 AM, Ewan D. Milne wrote:
> > On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
> > > I must admit I patently don't like this explicit dependency on
> > > blk_nonrot(). Having a conditional counter is just an open invitation to
> > > getting things wrong...
> > 
> > This concerns me as well, it seems like the SCSI ML should have it's
> > own per-device attribute if we actually need to control this per-device
> > instead of on a per-host or per-driver basis.  And it seems like this
> > is something that is specific to high-performance drivers, so changing
> > the way this works for all drivers seems a bit much.
> > 
> > Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
> > but I dislike wrapping the examination of that and the queue flag in
> > a macro that makes it not obvious how the behavior is affected.
> > (Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
> > which was another piece of ML functionality used by only a few drivers.)
> > 
> > Ming's patch does freeze the queue if NONROT is changed by sysfs, but
> > the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
> > clears it and then calls sd_read_block_characteristics() which may set
> > it again.  So it's not clear to me how reliable this is.
> 
> How about changing the default behavior into ignoring sdev->queue_depth 
> and only honoring sdev->queue_depth if a "quirk" flag is set or if 
> overridden by setting a sysfs attribute? My understanding is that the 
> goal of the queue ramp-up/ramp-down mechanism is to reduce the number of 
> times a SCSI device responds "BUSY". An alternative for queue 
> ramp-up/ramp-down is a delayed queue re-run. I think if scsi_queue_rq() 
> returns BLK_STS_RESOURCE that the queue is only rerun after a delay. 
>  From blk_mq_dispatch_rq_list():
> 
> 	[ ... ]
> 	else if (needs_restart && (ret == BLK_STS_RESOURCE))
> 		blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
> 
> Bart.

In general it is better not to put in changes that affect older drivers
that are not regularly tested, I think.  I would prefer an opt-in for
drivers desiring higher performance.

Delaying the queue re-run vs. a ramp down might negatively affect performance.
I'm not sure how accurate the ramp is at discovering the optimum though.

-Ewan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-20 17:00     ` Ewan D. Milne
  2019-11-20 20:56       ` Bart Van Assche
@ 2019-11-21  0:08       ` Sumanesh Samanta
  2019-11-21  0:54       ` Ming Lei
  2 siblings, 0 replies; 34+ messages in thread
From: Sumanesh Samanta @ 2019-11-21  0:08 UTC (permalink / raw)
  To: linux-scsi

>Ordinarily I'd prefer a host template  attribute as Sumanesh proposed,
>but I dislike  wrapping the examination of that and the queue flag in
>a macro that makes  it not obvious how the behavior is affected.

I think we can have a host template attribute and discard this check
altogether, that is not check device_busy for both SDD and HDDs.

As both you and Hannes mentioned, this change affects high end controllers
most, and most of them have some storage IO size limit. Also, for HDD
sequential IO is almost always large and
touch the controller max IO size limit. Thus, I am not sure merge matters
for these kind of controllers. Database use REDO log and small sequential
IO, but those are targeted to SSDs, where latency and IOPs are far more
important than IO merging.
If this patch is opt-in for drivers, so any LLD that cannot take advantage
of the flag need not set it, and would work as-is


Thanks,

Sumanesh

On 11/20/2019 10:00 AM, Ewan D. Milne wrote:
> On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
>> Hmm.
>>
>> I must admit I patently don't like this explicit dependency on
>> blk_nonrot(). Having a conditional counter is just an open invitation to
>> getting things wrong...
>>
> This concerns me as well, it seems like the SCSI ML should have it's
> own per-device attribute if we actually need to control this per-device
> instead of on a per-host or per-driver basis.  And it seems like this
> is something that is specific to high-performance drivers, so changing
> the way this works for all drivers seems a bit much.
>
> Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
> but I dislike wrapping the examination of that and the queue flag in
> a macro that makes it not obvious how the behavior is affected.
> (Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
> which was another piece of ML functionality used by only a few drivers.)
>
> Ming's patch does freeze the queue if NONROT is changed by sysfs, but
> the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
> clears it and then calls sd_read_block_characteristics() which may set
> it again.  So it's not clear to me how reliable this is.
>
> -Ewan
>
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-20 10:05   ` Hannes Reinecke
  2019-11-20 17:00     ` Ewan D. Milne
@ 2019-11-21  0:53     ` Ming Lei
  2019-11-21 15:45       ` Hannes Reinecke
  2019-11-22  2:18     ` Martin K. Petersen
  2 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2019-11-21  0:53 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche

On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote:
> On 11/18/19 11:31 AM, Ming Lei wrote:
> > SCSI core uses the atomic variable of sdev->device_busy to track
> > in-flight IO requests dispatched to this scsi device. IO request may be
> > submitted from any CPU, so the cost for maintaining the shared atomic
> > counter can be very big on big NUMA machine with lots of CPU cores.
> > 
> > sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
> > 2) fair IO request scattered among all LUNs.
> > 
> > blk-mq already provides fair request allocation among all active shared
> > request queues(LUNs), see hctx_may_queue().
> > 
> > NVMe doesn't have such per-request-queue(namespace) queue depth, so it
> > is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
> > play big role for reaching top SSD performance.
> > 
> > With this patch, big cost for tracking in-flight per-LUN requests via
> > atomic variable can be saved.
> > 
> > Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue
> > before changing this flag.
> > 
> > Cc: Sathya Prakash <sathya.prakash@broadcom.com>
> > Cc: Chaitra P B <chaitra.basappa@broadcom.com>
> > Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
> > Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> > Cc: Sumit Saxena <sumit.saxena@broadcom.com>
> > Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
> > Cc: Ewan D. Milne <emilne@redhat.com>
> > Cc: Christoph Hellwig <hch@lst.de>,
> > Cc: Hannes Reinecke <hare@suse.de>
> > Cc: Bart Van Assche <bart.vanassche@wdc.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-sysfs.c       | 14 +++++++++++++-
> >  drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------
> >  drivers/scsi/sd.c       |  4 ++++
> >  3 files changed, 35 insertions(+), 7 deletions(-)
> > 
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index fca9b158f4a0..9cc0dd5f88a1 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
> >  QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
> >  #undef QUEUE_SYSFS_BIT_FNS
> >  
> > +static ssize_t queue_store_nonrot_freeze(struct request_queue *q,
> > +		const char *page, size_t count)
> > +{
> > +	size_t ret;
> > +
> > +	blk_mq_freeze_queue(q);
> > +	ret = queue_store_nonrot(q, page, count);
> > +	blk_mq_unfreeze_queue(q);
> > +
> > +	return ret;
> > +}
> > +
> >  static ssize_t queue_zoned_show(struct request_queue *q, char *page)
> >  {
> >  	switch (blk_queue_zoned_model(q)) {
> > @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
> >  static struct queue_sysfs_entry queue_nonrot_entry = {
> >  	.attr = {.name = "rotational", .mode = 0644 },
> >  	.show = queue_show_nonrot,
> > -	.store = queue_store_nonrot,
> > +	.store = queue_store_nonrot_freeze,
> >  };
> >  
> >  static struct queue_sysfs_entry queue_zoned_entry = {
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index 62a86a82c38d..72655a049efd 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
> >  	if (starget->can_queue > 0)
> >  		atomic_dec(&starget->target_busy);
> >  
> > -	atomic_dec(&sdev->device_busy);
> > +	if (!blk_queue_nonrot(sdev->request_queue))
> > +		atomic_dec(&sdev->device_busy);
> >  }
> >  
> >  static void scsi_kick_queue(struct request_queue *q)
> > @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
> >  
> >  static inline bool scsi_device_is_busy(struct scsi_device *sdev)
> >  {
> > -	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> > +	if (!blk_queue_nonrot(sdev->request_queue) &&
> > +			atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> >  		return true;
> >  	if (atomic_read(&sdev->device_blocked) > 0)
> >  		return true;
> > @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> >  				  struct scsi_device *sdev)
> >  {
> >  	unsigned int busy;
> > +	bool bypass = blk_queue_nonrot(sdev->request_queue);
> >  
> > -	busy = atomic_inc_return(&sdev->device_busy) - 1;
> > +	if (!bypass)
> > +		busy = atomic_inc_return(&sdev->device_busy) - 1;
> > +	else
> > +		busy = 0;
> >  	if (atomic_read(&sdev->device_blocked)) {
> >  		if (busy)
> >  			goto out_dec;
> > @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> >  				   "unblocking device at zero depth\n"));
> >  	}
> >  
> > +	if (bypass)
> > +		return 1;
> > +
> >  	if (busy >= sdev->queue_depth)
> >  		goto out_dec;
> >  
> >  	return 1;
> >  out_dec:
> > -	atomic_dec(&sdev->device_busy);
> > +	if (!bypass)
> > +		atomic_dec(&sdev->device_busy);
> >  	return 0;
> >  }
> >  
> > @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
> >  	struct request_queue *q = hctx->queue;
> >  	struct scsi_device *sdev = q->queuedata;
> >  
> > -	atomic_dec(&sdev->device_busy);
> > +	if (!blk_queue_nonrot(sdev->request_queue))
> > +		atomic_dec(&sdev->device_busy);
> >  }
> >  
> >  static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
> > @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
> >  	case BLK_STS_OK:
> >  		break;
> >  	case BLK_STS_RESOURCE:
> > -		if (atomic_read(&sdev->device_busy) ||
> > +		if ((!blk_queue_nonrot(sdev->request_queue) &&
> > +		     atomic_read(&sdev->device_busy)) ||
> >  		    scsi_device_blocked(sdev))
> >  			ret = BLK_STS_DEV_RESOURCE;
> >  		break;
> > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> > index 0744c34468e1..c3d47117700d 100644
> > --- a/drivers/scsi/sd.c
> > +++ b/drivers/scsi/sd.c
> > @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
> >  	rot = get_unaligned_be16(&buffer[4]);
> >  
> >  	if (rot == 1) {
> > +		blk_mq_freeze_queue(q);
> >  		blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> > +		blk_mq_unfreeze_queue(q);
> >  		blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
> >  	}
> >  
> > @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk)
> >  		 * cause this to be updated correctly and any device which
> >  		 * doesn't support it should be treated as rotational.
> >  		 */
> > +		blk_mq_freeze_queue(q);
> >  		blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
> > +		blk_mq_unfreeze_queue(q);
> >  		blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
> >  
> >  		if (scsi_device_supports_vpd(sdp)) {
> > 
> Hmm.
> 
> I must admit I patently don't like this explicit dependency on
> blk_nonrot(). Having a conditional counter is just an open invitation to
> getting things wrong...

That won't be an issue given this patch freezes queue when the
NONROT queue flag is changed.

> 
> I would far prefer if we could delegate any queueing decision to the
> elevators, and completely drop the device_busy flag for all devices.

If you drop it, you may create big sequential IO performance drop
on HDD., that is why this patch only bypasses sdev->queue_depth on
SSD. NVMe bypasses it because no one uses HDD. via NVMe.

> But I do see that this might not be easy to do, nor necessarily
> something which others agree on.
> So alternatively, can't we make this a per-host flag, instead of having
> it per device characteristic?

Are you sure each LUN shares the same NONROT flag? I think it can be
true for each LUN to have different NONROT flag.

> 
> I'm just thinking of a storage array with a massive cache; it's anyone's
> guess if 'nonrot' is set there, but at the same time these things can be
> fast, so they might be profiting from dropping the device_busy counter
> there, too.

If you are worrying about slow queue freeze, that won't be an issue,
given we don't switch to percpu mode until the flag is updated during
sd probe, and blk_mq_freeze_queue_wait() is pretty quick in ATOMIC
mode.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-20 17:00     ` Ewan D. Milne
  2019-11-20 20:56       ` Bart Van Assche
  2019-11-21  0:08       ` Sumanesh Samanta
@ 2019-11-21  0:54       ` Ming Lei
  2019-11-21 19:19         ` Ewan D. Milne
  2 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2019-11-21  0:54 UTC (permalink / raw)
  To: Ewan D. Milne
  Cc: Hannes Reinecke, Jens Axboe, linux-block,
	James E . J . Bottomley, Martin K . Petersen, linux-scsi,
	Sathya Prakash, Chaitra P B, Suganath Prabu Subramani,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Christoph Hellwig,
	Bart Van Assche

On Wed, Nov 20, 2019 at 12:00:19PM -0500, Ewan D. Milne wrote:
> On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
> > 
> > Hmm.
> > 
> > I must admit I patently don't like this explicit dependency on
> > blk_nonrot(). Having a conditional counter is just an open invitation to
> > getting things wrong...
> > 
> 
> This concerns me as well, it seems like the SCSI ML should have it's
> own per-device attribute if we actually need to control this per-device
> instead of on a per-host or per-driver basis.  And it seems like this
> is something that is specific to high-performance drivers, so changing
> the way this works for all drivers seems a bit much.
> 
> Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
> but I dislike wrapping the examination of that and the queue flag in
> a macro that makes it not obvious how the behavior is affected.
> (Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
> which was another piece of ML functionality used by only a few drivers.)
> 
> Ming's patch does freeze the queue if NONROT is changed by sysfs, but
> the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
> clears it and then calls sd_read_block_characteristics() which may set
> it again.  So it's not clear to me how reliable this is.

The queue freeze is applied in sd_revalidate_disk() too, isn't it?

Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-20 20:56       ` Bart Van Assche
  2019-11-20 21:36         ` Ewan D. Milne
@ 2019-11-21  1:07         ` Ming Lei
  2019-11-22  2:59           ` Martin K. Petersen
  1 sibling, 1 reply; 34+ messages in thread
From: Ming Lei @ 2019-11-21  1:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Ewan D. Milne, Hannes Reinecke, Jens Axboe, linux-block,
	James E . J . Bottomley, Martin K . Petersen, linux-scsi,
	Sathya Prakash, Chaitra P B, Suganath Prabu Subramani,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Christoph Hellwig,
	Bart Van Assche

On Wed, Nov 20, 2019 at 12:56:21PM -0800, Bart Van Assche wrote:
> On 11/20/19 9:00 AM, Ewan D. Milne wrote:
> > On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
> > > I must admit I patently don't like this explicit dependency on
> > > blk_nonrot(). Having a conditional counter is just an open invitation to
> > > getting things wrong...
> > 
> > This concerns me as well, it seems like the SCSI ML should have it's
> > own per-device attribute if we actually need to control this per-device
> > instead of on a per-host or per-driver basis.  And it seems like this
> > is something that is specific to high-performance drivers, so changing
> > the way this works for all drivers seems a bit much.
> > 
> > Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
> > but I dislike wrapping the examination of that and the queue flag in
> > a macro that makes it not obvious how the behavior is affected.
> > (Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
> > which was another piece of ML functionality used by only a few drivers.)
> > 
> > Ming's patch does freeze the queue if NONROT is changed by sysfs, but
> > the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
> > clears it and then calls sd_read_block_characteristics() which may set
> > it again.  So it's not clear to me how reliable this is.
> 
> How about changing the default behavior into ignoring sdev->queue_depth and
> only honoring sdev->queue_depth if a "quirk" flag is set or if overridden by
> setting a sysfs attribute?

Using 'quirk' was my first idea in mind when we start to discuss the issue, but
problem is that it isn't flexible, for example, one HBA may connects HDD. in one
setting, and SSD. in another setting.

> My understanding is that the goal of the queue
> ramp-up/ramp-down mechanism is to reduce the number of times a SCSI device
> responds "BUSY".

I don't understand the motivation of ramp-up/ramp-down, maybe it is just
for fairness among LUNs. So far, blk-mq provides fair IO submission
among LUNs. One problem of ignoring it is that sequential IO performance
may be dropped much compared with before.

> An alternative for queue ramp-up/ramp-down is a delayed
> queue re-run. I think if scsi_queue_rq() returns BLK_STS_RESOURCE that the
> queue is only rerun after a delay. From blk_mq_dispatch_rq_list():
> 
> 	[ ... ]
> 	else if (needs_restart && (ret == BLK_STS_RESOURCE))
> 		blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);

The delay re-run can't work given we call blk_mq_get_dispatch_budget()
before dequeuing request from scheduler/sw queue for improving IO merge.
At that time, run queue won't be involved.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-21  0:53     ` Ming Lei
@ 2019-11-21 15:45       ` Hannes Reinecke
  2019-11-22  8:09         ` Ming Lei
  0 siblings, 1 reply; 34+ messages in thread
From: Hannes Reinecke @ 2019-11-21 15:45 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche

On 11/21/19 1:53 AM, Ming Lei wrote:
> On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote:
>> On 11/18/19 11:31 AM, Ming Lei wrote:
>>> SCSI core uses the atomic variable of sdev->device_busy to track
>>> in-flight IO requests dispatched to this scsi device. IO request may be
>>> submitted from any CPU, so the cost for maintaining the shared atomic
>>> counter can be very big on big NUMA machine with lots of CPU cores.
>>>
>>> sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
>>> 2) fair IO request scattered among all LUNs.
>>>
>>> blk-mq already provides fair request allocation among all active shared
>>> request queues(LUNs), see hctx_may_queue().
>>>
>>> NVMe doesn't have such per-request-queue(namespace) queue depth, so it
>>> is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
>>> play big role for reaching top SSD performance.
>>>
>>> With this patch, big cost for tracking in-flight per-LUN requests via
>>> atomic variable can be saved.
>>>
>>> Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue
>>> before changing this flag.
>>>
>>> Cc: Sathya Prakash <sathya.prakash@broadcom.com>
>>> Cc: Chaitra P B <chaitra.basappa@broadcom.com>
>>> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
>>> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
>>> Cc: Sumit Saxena <sumit.saxena@broadcom.com>
>>> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
>>> Cc: Ewan D. Milne <emilne@redhat.com>
>>> Cc: Christoph Hellwig <hch@lst.de>,
>>> Cc: Hannes Reinecke <hare@suse.de>
>>> Cc: Bart Van Assche <bart.vanassche@wdc.com>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  block/blk-sysfs.c       | 14 +++++++++++++-
>>>  drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------
>>>  drivers/scsi/sd.c       |  4 ++++
>>>  3 files changed, 35 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>>> index fca9b158f4a0..9cc0dd5f88a1 100644
>>> --- a/block/blk-sysfs.c
>>> +++ b/block/blk-sysfs.c
>>> @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
>>>  QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
>>>  #undef QUEUE_SYSFS_BIT_FNS
>>>  
>>> +static ssize_t queue_store_nonrot_freeze(struct request_queue *q,
>>> +		const char *page, size_t count)
>>> +{
>>> +	size_t ret;
>>> +
>>> +	blk_mq_freeze_queue(q);
>>> +	ret = queue_store_nonrot(q, page, count);
>>> +	blk_mq_unfreeze_queue(q);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>>  static ssize_t queue_zoned_show(struct request_queue *q, char *page)
>>>  {
>>>  	switch (blk_queue_zoned_model(q)) {
>>> @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
>>>  static struct queue_sysfs_entry queue_nonrot_entry = {
>>>  	.attr = {.name = "rotational", .mode = 0644 },
>>>  	.show = queue_show_nonrot,
>>> -	.store = queue_store_nonrot,
>>> +	.store = queue_store_nonrot_freeze,
>>>  };
>>>  
>>>  static struct queue_sysfs_entry queue_zoned_entry = {
>>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>>> index 62a86a82c38d..72655a049efd 100644
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
>>>  	if (starget->can_queue > 0)
>>>  		atomic_dec(&starget->target_busy);
>>>  
>>> -	atomic_dec(&sdev->device_busy);
>>> +	if (!blk_queue_nonrot(sdev->request_queue))
>>> +		atomic_dec(&sdev->device_busy);
>>>  }
>>>  
>>>  static void scsi_kick_queue(struct request_queue *q)
>>> @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
>>>  
>>>  static inline bool scsi_device_is_busy(struct scsi_device *sdev)
>>>  {
>>> -	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
>>> +	if (!blk_queue_nonrot(sdev->request_queue) &&
>>> +			atomic_read(&sdev->device_busy) >= sdev->queue_depth)
>>>  		return true;
>>>  	if (atomic_read(&sdev->device_blocked) > 0)
>>>  		return true;
>>> @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
>>>  				  struct scsi_device *sdev)
>>>  {
>>>  	unsigned int busy;
>>> +	bool bypass = blk_queue_nonrot(sdev->request_queue);
>>>  
>>> -	busy = atomic_inc_return(&sdev->device_busy) - 1;
>>> +	if (!bypass)
>>> +		busy = atomic_inc_return(&sdev->device_busy) - 1;
>>> +	else
>>> +		busy = 0;
>>>  	if (atomic_read(&sdev->device_blocked)) {
>>>  		if (busy)
>>>  			goto out_dec;
>>> @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
>>>  				   "unblocking device at zero depth\n"));
>>>  	}
>>>  
>>> +	if (bypass)
>>> +		return 1;
>>> +
>>>  	if (busy >= sdev->queue_depth)
>>>  		goto out_dec;
>>>  
>>>  	return 1;
>>>  out_dec:
>>> -	atomic_dec(&sdev->device_busy);
>>> +	if (!bypass)
>>> +		atomic_dec(&sdev->device_busy);
>>>  	return 0;
>>>  }
>>>  
>>> @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
>>>  	struct request_queue *q = hctx->queue;
>>>  	struct scsi_device *sdev = q->queuedata;
>>>  
>>> -	atomic_dec(&sdev->device_busy);
>>> +	if (!blk_queue_nonrot(sdev->request_queue))
>>> +		atomic_dec(&sdev->device_busy);
>>>  }
>>>  
>>>  static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
>>> @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
>>>  	case BLK_STS_OK:
>>>  		break;
>>>  	case BLK_STS_RESOURCE:
>>> -		if (atomic_read(&sdev->device_busy) ||
>>> +		if ((!blk_queue_nonrot(sdev->request_queue) &&
>>> +		     atomic_read(&sdev->device_busy)) ||
>>>  		    scsi_device_blocked(sdev))
>>>  			ret = BLK_STS_DEV_RESOURCE;
>>>  		break;
>>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>>> index 0744c34468e1..c3d47117700d 100644
>>> --- a/drivers/scsi/sd.c
>>> +++ b/drivers/scsi/sd.c
>>> @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
>>>  	rot = get_unaligned_be16(&buffer[4]);
>>>  
>>>  	if (rot == 1) {
>>> +		blk_mq_freeze_queue(q);
>>>  		blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
>>> +		blk_mq_unfreeze_queue(q);
>>>  		blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
>>>  	}
>>>  
>>> @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk)
>>>  		 * cause this to be updated correctly and any device which
>>>  		 * doesn't support it should be treated as rotational.
>>>  		 */
>>> +		blk_mq_freeze_queue(q);
>>>  		blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
>>> +		blk_mq_unfreeze_queue(q);
>>>  		blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
>>>  
>>>  		if (scsi_device_supports_vpd(sdp)) {
>>>
>> Hmm.
>>
>> I must admit I patently don't like this explicit dependency on
>> blk_nonrot(). Having a conditional counter is just an open invitation to
>> getting things wrong...
> 
> That won't be an issue given this patch freezes queue when the
> NONROT queue flag is changed.
> 
That's not what I'm saying.

My issue is that we're turning the device_busy counter into a
conditional counter, which only must be accessed when that condition is
true.
This not only puts a burden on the developer (as he has to remember to
always check the condition), but also has possible issues when switching
between modes.
The latter you've addressed with ensuring that the queue must be frozen
when modifying that flag, but the former is a conceptual problem.

And that was what I had been referring to.

>>
>> I would far prefer if we could delegate any queueing decision to the
>> elevators, and completely drop the device_busy flag for all devices.
> 
> If you drop it, you may create big sequential IO performance drop
> on HDD., that is why this patch only bypasses sdev->queue_depth on
> SSD. NVMe bypasses it because no one uses HDD. via NVMe.
> 
I still wonder how much performance drop we actually see; what seems to
happen is that device_busy just arbitrary pushes back to the block
layer, giving it more time to do merging.
I do think we can do better then that...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-21  0:54       ` Ming Lei
@ 2019-11-21 19:19         ` Ewan D. Milne
  0 siblings, 0 replies; 34+ messages in thread
From: Ewan D. Milne @ 2019-11-21 19:19 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, Jens Axboe, linux-block,
	James E . J . Bottomley, Martin K . Petersen, linux-scsi,
	Sathya Prakash, Chaitra P B, Suganath Prabu Subramani,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Christoph Hellwig,
	Bart Van Assche

On Thu, 2019-11-21 at 08:54 +0800, Ming Lei wrote:
> On Wed, Nov 20, 2019 at 12:00:19PM -0500, Ewan D. Milne wrote:
> > On Wed, 2019-11-20 at 11:05 +0100, Hannes Reinecke wrote:
> > > 
> > > Hmm.
> > > 
> > > I must admit I patently don't like this explicit dependency on
> > > blk_nonrot(). Having a conditional counter is just an open invitation to
> > > getting things wrong...
> > > 
> > 
> > This concerns me as well, it seems like the SCSI ML should have it's
> > own per-device attribute if we actually need to control this per-device
> > instead of on a per-host or per-driver basis.  And it seems like this
> > is something that is specific to high-performance drivers, so changing
> > the way this works for all drivers seems a bit much.
> > 
> > Ordinarily I'd prefer a host template attribute as Sumanesh proposed,
> > but I dislike wrapping the examination of that and the queue flag in
> > a macro that makes it not obvious how the behavior is affected.
> > (Plus Hannes just submitted submitted the patches to remove .use_cmd_list,
> > which was another piece of ML functionality used by only a few drivers.)
> > 
> > Ming's patch does freeze the queue if NONROT is changed by sysfs, but
> > the flag can be changed by other kernel code, e.g. sd_revalidate_disk()
> > clears it and then calls sd_read_block_characteristics() which may set
> > it again.  So it's not clear to me how reliable this is.
> 
> The queue freeze is applied in sd_revalidate_disk() too, isn't it?
> 

Yes, sorry, you are right, your patch does add this.  But if anything else changes
the NONROT attribute for a queue associated with a SCSI device in the future
it would have to freeze the queue.  Because the device_busy counter mechanism
would rely on it to work right.  This isn't an obvious connection, it seems to me.

-Ewan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-20 10:05   ` Hannes Reinecke
  2019-11-20 17:00     ` Ewan D. Milne
  2019-11-21  0:53     ` Ming Lei
@ 2019-11-22  2:18     ` Martin K. Petersen
  2 siblings, 0 replies; 34+ messages in thread
From: Martin K. Petersen @ 2019-11-22  2:18 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Ming Lei, Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche


Hannes,

> I must admit I patently don't like this explicit dependency on
> blk_nonrot().

The whole idea behind using rotational/non-rotational as a trigger for
this is a red herring.

Fast devices also have internal queuing limitations. And more
importantly: Fast devices also experience temporary shortages which can
lead to congestion.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-20 21:36         ` Ewan D. Milne
@ 2019-11-22  2:25           ` Martin K. Petersen
  0 siblings, 0 replies; 34+ messages in thread
From: Martin K. Petersen @ 2019-11-22  2:25 UTC (permalink / raw)
  To: Ewan D. Milne
  Cc: Bart Van Assche, Hannes Reinecke, Ming Lei, Jens Axboe,
	linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Christoph Hellwig, Bart Van Assche


Ewan,

> Delaying the queue re-run vs. a ramp down might negatively affect
> performance.  I'm not sure how accurate the ramp is at discovering the
> optimum though.

The optimum - as well as the actual limit - might change over time in a
multi-initiator setup as other hosts grab resources on the device.

I do think that the only way forward here, if we want to avoid counting
outstanding commands for performance reasons, is to ensure that the
BUSY/QUEUE_FULL/TASK_SET_FULL handling is relatively fast path and not
something deep in the bowels of error handling. Which it actually
isn't. But I do think we'll need to take a closer look.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-21  1:07         ` Ming Lei
@ 2019-11-22  2:59           ` Martin K. Petersen
  2019-11-22  3:24             ` Ming Lei
  2019-11-22 16:38             ` Sumanesh Samanta
  0 siblings, 2 replies; 34+ messages in thread
From: Martin K. Petersen @ 2019-11-22  2:59 UTC (permalink / raw)
  To: Ming Lei
  Cc: Bart Van Assche, Ewan D. Milne, Hannes Reinecke, Jens Axboe,
	linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Christoph Hellwig, Bart Van Assche


Ming,

> I don't understand the motivation of ramp-up/ramp-down, maybe it is just
> for fairness among LUNs.

Congestion control. Devices have actual, physical limitations that are
different from the tag context limitations on the HBA. You don't have
that problem on NVMe because (at least for PCIe) the storage device and
the controller are one and the same.

If you submit 100000 concurrent requests to a SCSI drive that does 100
IOPS, some requests will time out before they get serviced.
Consequently we have the ability to raise and lower the queue depth to
constrain the amount of requests in flight to a given device at any
point in time.

Also, devices use BUSY/QUEUE_FULL/TASK_SET_FULL to cause the OS to back
off. We frequently see issues where the host can submit burst I/O much
faster than the device can de-stage from cache. In that scenario the
device reports BUSY/QF/TSF and we will back off so the device gets a
chance to recover. If we just let the application submit new I/O without
bounds, the system would never actually recover.

Note that the actual, physical limitations for how many commands a
target can handle are typically much, much lower than the number of tags
the HBA can manage. SATA devices can only express 32 concurrent
commands. SAS devices typically 128 concurrent commands per
port. Arrays differ.

If we ignore the RAID controller use case where the controller
internally queues and arbitrates commands between many devices, how is
submitting 1000 concurrent requests to a device which only has 128
command slots going to work?

Some HBAs have special sauce to manage BUSY/QF/TSF, some don't. If we
blindly stop restricting the number of I/Os in flight in the ML, we may
exceed either the capabilities of what the transport protocol can
express or internal device resources.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-22  2:59           ` Martin K. Petersen
@ 2019-11-22  3:24             ` Ming Lei
  2019-11-22 16:38             ` Sumanesh Samanta
  1 sibling, 0 replies; 34+ messages in thread
From: Ming Lei @ 2019-11-22  3:24 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Bart Van Assche, Ewan D. Milne, Hannes Reinecke, Jens Axboe,
	linux-block, James E . J . Bottomley, linux-scsi, Sathya Prakash,
	Chaitra P B, Suganath Prabu Subramani, Kashyap Desai,
	Sumit Saxena, Shivasharan S, Christoph Hellwig, Bart Van Assche

Hi Martin,

On Thu, Nov 21, 2019 at 09:59:53PM -0500, Martin K. Petersen wrote:
> 
> Ming,
> 
> > I don't understand the motivation of ramp-up/ramp-down, maybe it is just
> > for fairness among LUNs.
> 
> Congestion control. Devices have actual, physical limitations that are
> different from the tag context limitations on the HBA. You don't have
> that problem on NVMe because (at least for PCIe) the storage device and
> the controller are one and the same.
> 
> If you submit 100000 concurrent requests to a SCSI drive that does 100
> IOPS, some requests will time out before they get serviced.
> Consequently we have the ability to raise and lower the queue depth to
> constrain the amount of requests in flight to a given device at any
> point in time.

blk-mq has already puts a limit on each LUN, the number is
host_queue_depth / nr_active_LUNs, see hctx_may_queue().

Looks this way works for NVMe, that is why I try to bypass
.device_busy for SSD which is too expensive on fast storage. Even
Hannes wants to kill it completely.

> 
> Also, devices use BUSY/QUEUE_FULL/TASK_SET_FULL to cause the OS to back
> off. We frequently see issues where the host can submit burst I/O much
> faster than the device can de-stage from cache. In that scenario the
> device reports BUSY/QF/TSF and we will back off so the device gets a
> chance to recover. If we just let the application submit new I/O without
> bounds, the system would never actually recover.
> 
> Note that the actual, physical limitations for how many commands a
> target can handle are typically much, much lower than the number of tags
> the HBA can manage. SATA devices can only express 32 concurrent
> commands. SAS devices typically 128 concurrent commands per
> port. Arrays differ.

I understand SATA's host queue depth is set as 32.

But SAS HBA's queue depth is often big, so do we reply on .device_busy for
throttling requests to SAS?

> 
> If we ignore the RAID controller use case where the controller
> internally queues and arbitrates commands between many devices, how is
> submitting 1000 concurrent requests to a device which only has 128
> command slots going to work?

For SSD, I guess it might be fine, given NVMe sets per-hw-queue depth
as 1023 usually. That means the concurrent requests can be as many as 
1023 * nr_hw_queues in case of single namespace.

> 
> Some HBAs have special sauce to manage BUSY/QF/TSF, some don't. If we
> blindly stop restricting the number of I/Os in flight in the ML, we may
> exceed either the capabilities of what the transport protocol can
> express or internal device resources.

OK, one conservative approach may be just to just bypass .device_busy 
in case of SSD only for some high end HBA.

Or maybe we can wire up sdev->queue_depth with block layer's scheduler
queue depth? One issue is that sdev->queue_depth may be updated some
times.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-21 15:45       ` Hannes Reinecke
@ 2019-11-22  8:09         ` Ming Lei
  2019-11-22 18:14           ` Bart Van Assche
  0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2019-11-22  8:09 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche

On Thu, Nov 21, 2019 at 04:45:48PM +0100, Hannes Reinecke wrote:
> On 11/21/19 1:53 AM, Ming Lei wrote:
> > On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote:
> >> On 11/18/19 11:31 AM, Ming Lei wrote:
> >>> SCSI core uses the atomic variable of sdev->device_busy to track
> >>> in-flight IO requests dispatched to this scsi device. IO request may be
> >>> submitted from any CPU, so the cost for maintaining the shared atomic
> >>> counter can be very big on big NUMA machine with lots of CPU cores.
> >>>
> >>> sdev->queue_depth is usually used for two purposes: 1) improve IO merge;
> >>> 2) fair IO request scattered among all LUNs.
> >>>
> >>> blk-mq already provides fair request allocation among all active shared
> >>> request queues(LUNs), see hctx_may_queue().
> >>>
> >>> NVMe doesn't have such per-request-queue(namespace) queue depth, so it
> >>> is reasonable to ignore the limit for SCSI SSD too. Also IO merge won't
> >>> play big role for reaching top SSD performance.
> >>>
> >>> With this patch, big cost for tracking in-flight per-LUN requests via
> >>> atomic variable can be saved.
> >>>
> >>> Given QUEUE_FLAG_NONROT is read in IO path, we have to freeze queue
> >>> before changing this flag.
> >>>
> >>> Cc: Sathya Prakash <sathya.prakash@broadcom.com>
> >>> Cc: Chaitra P B <chaitra.basappa@broadcom.com>
> >>> Cc: Suganath Prabu Subramani <suganath-prabu.subramani@broadcom.com>
> >>> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> >>> Cc: Sumit Saxena <sumit.saxena@broadcom.com>
> >>> Cc: Shivasharan S <shivasharan.srikanteshwara@broadcom.com>
> >>> Cc: Ewan D. Milne <emilne@redhat.com>
> >>> Cc: Christoph Hellwig <hch@lst.de>,
> >>> Cc: Hannes Reinecke <hare@suse.de>
> >>> Cc: Bart Van Assche <bart.vanassche@wdc.com>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>>  block/blk-sysfs.c       | 14 +++++++++++++-
> >>>  drivers/scsi/scsi_lib.c | 24 ++++++++++++++++++------
> >>>  drivers/scsi/sd.c       |  4 ++++
> >>>  3 files changed, 35 insertions(+), 7 deletions(-)
> >>>
> >>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> >>> index fca9b158f4a0..9cc0dd5f88a1 100644
> >>> --- a/block/blk-sysfs.c
> >>> +++ b/block/blk-sysfs.c
> >>> @@ -281,6 +281,18 @@ QUEUE_SYSFS_BIT_FNS(random, ADD_RANDOM, 0);
> >>>  QUEUE_SYSFS_BIT_FNS(iostats, IO_STAT, 0);
> >>>  #undef QUEUE_SYSFS_BIT_FNS
> >>>  
> >>> +static ssize_t queue_store_nonrot_freeze(struct request_queue *q,
> >>> +		const char *page, size_t count)
> >>> +{
> >>> +	size_t ret;
> >>> +
> >>> +	blk_mq_freeze_queue(q);
> >>> +	ret = queue_store_nonrot(q, page, count);
> >>> +	blk_mq_unfreeze_queue(q);
> >>> +
> >>> +	return ret;
> >>> +}
> >>> +
> >>>  static ssize_t queue_zoned_show(struct request_queue *q, char *page)
> >>>  {
> >>>  	switch (blk_queue_zoned_model(q)) {
> >>> @@ -642,7 +654,7 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
> >>>  static struct queue_sysfs_entry queue_nonrot_entry = {
> >>>  	.attr = {.name = "rotational", .mode = 0644 },
> >>>  	.show = queue_show_nonrot,
> >>> -	.store = queue_store_nonrot,
> >>> +	.store = queue_store_nonrot_freeze,
> >>>  };
> >>>  
> >>>  static struct queue_sysfs_entry queue_zoned_entry = {
> >>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> >>> index 62a86a82c38d..72655a049efd 100644
> >>> --- a/drivers/scsi/scsi_lib.c
> >>> +++ b/drivers/scsi/scsi_lib.c
> >>> @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev, struct scsi_cmnd *cmd)
> >>>  	if (starget->can_queue > 0)
> >>>  		atomic_dec(&starget->target_busy);
> >>>  
> >>> -	atomic_dec(&sdev->device_busy);
> >>> +	if (!blk_queue_nonrot(sdev->request_queue))
> >>> +		atomic_dec(&sdev->device_busy);
> >>>  }
> >>>  
> >>>  static void scsi_kick_queue(struct request_queue *q)
> >>> @@ -410,7 +411,8 @@ static void scsi_single_lun_run(struct scsi_device *current_sdev)
> >>>  
> >>>  static inline bool scsi_device_is_busy(struct scsi_device *sdev)
> >>>  {
> >>> -	if (atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> >>> +	if (!blk_queue_nonrot(sdev->request_queue) &&
> >>> +			atomic_read(&sdev->device_busy) >= sdev->queue_depth)
> >>>  		return true;
> >>>  	if (atomic_read(&sdev->device_blocked) > 0)
> >>>  		return true;
> >>> @@ -1283,8 +1285,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> >>>  				  struct scsi_device *sdev)
> >>>  {
> >>>  	unsigned int busy;
> >>> +	bool bypass = blk_queue_nonrot(sdev->request_queue);
> >>>  
> >>> -	busy = atomic_inc_return(&sdev->device_busy) - 1;
> >>> +	if (!bypass)
> >>> +		busy = atomic_inc_return(&sdev->device_busy) - 1;
> >>> +	else
> >>> +		busy = 0;
> >>>  	if (atomic_read(&sdev->device_blocked)) {
> >>>  		if (busy)
> >>>  			goto out_dec;
> >>> @@ -1298,12 +1304,16 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
> >>>  				   "unblocking device at zero depth\n"));
> >>>  	}
> >>>  
> >>> +	if (bypass)
> >>> +		return 1;
> >>> +
> >>>  	if (busy >= sdev->queue_depth)
> >>>  		goto out_dec;
> >>>  
> >>>  	return 1;
> >>>  out_dec:
> >>> -	atomic_dec(&sdev->device_busy);
> >>> +	if (!bypass)
> >>> +		atomic_dec(&sdev->device_busy);
> >>>  	return 0;
> >>>  }
> >>>  
> >>> @@ -1624,7 +1634,8 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
> >>>  	struct request_queue *q = hctx->queue;
> >>>  	struct scsi_device *sdev = q->queuedata;
> >>>  
> >>> -	atomic_dec(&sdev->device_busy);
> >>> +	if (!blk_queue_nonrot(sdev->request_queue))
> >>> +		atomic_dec(&sdev->device_busy);
> >>>  }
> >>>  
> >>>  static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
> >>> @@ -1731,7 +1742,8 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
> >>>  	case BLK_STS_OK:
> >>>  		break;
> >>>  	case BLK_STS_RESOURCE:
> >>> -		if (atomic_read(&sdev->device_busy) ||
> >>> +		if ((!blk_queue_nonrot(sdev->request_queue) &&
> >>> +		     atomic_read(&sdev->device_busy)) ||
> >>>  		    scsi_device_blocked(sdev))
> >>>  			ret = BLK_STS_DEV_RESOURCE;
> >>>  		break;
> >>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> >>> index 0744c34468e1..c3d47117700d 100644
> >>> --- a/drivers/scsi/sd.c
> >>> +++ b/drivers/scsi/sd.c
> >>> @@ -2927,7 +2927,9 @@ static void sd_read_block_characteristics(struct scsi_disk *sdkp)
> >>>  	rot = get_unaligned_be16(&buffer[4]);
> >>>  
> >>>  	if (rot == 1) {
> >>> +		blk_mq_freeze_queue(q);
> >>>  		blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> >>> +		blk_mq_unfreeze_queue(q);
> >>>  		blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
> >>>  	}
> >>>  
> >>> @@ -3123,7 +3125,9 @@ static int sd_revalidate_disk(struct gendisk *disk)
> >>>  		 * cause this to be updated correctly and any device which
> >>>  		 * doesn't support it should be treated as rotational.
> >>>  		 */
> >>> +		blk_mq_freeze_queue(q);
> >>>  		blk_queue_flag_clear(QUEUE_FLAG_NONROT, q);
> >>> +		blk_mq_unfreeze_queue(q);
> >>>  		blk_queue_flag_set(QUEUE_FLAG_ADD_RANDOM, q);
> >>>  
> >>>  		if (scsi_device_supports_vpd(sdp)) {
> >>>
> >> Hmm.
> >>
> >> I must admit I patently don't like this explicit dependency on
> >> blk_nonrot(). Having a conditional counter is just an open invitation to
> >> getting things wrong...
> > 
> > That won't be an issue given this patch freezes queue when the
> > NONROT queue flag is changed.
> > 
> That's not what I'm saying.
> 
> My issue is that we're turning the device_busy counter into a
> conditional counter, which only must be accessed when that condition is
> true.
> This not only puts a burden on the developer (as he has to remember to
> always check the condition), but also has possible issues when switching
> between modes.

The conditional check can be avoided only if we can kill device_busy
completely. However, as Martin commented, that isn't realistic so far.

> The latter you've addressed with ensuring that the queue must be frozen
> when modifying that flag, but the former is a conceptual problem.

It can't be perfect, but it can be good by the latter.

Someone said 'Perfect is the enemy of the good', :-)

> 
> And that was what I had been referring to.
> 
> >>
> >> I would far prefer if we could delegate any queueing decision to the
> >> elevators, and completely drop the device_busy flag for all devices.
> > 
> > If you drop it, you may create big sequential IO performance drop
> > on HDD., that is why this patch only bypasses sdev->queue_depth on
> > SSD. NVMe bypasses it because no one uses HDD. via NVMe.
> > 
> I still wonder how much performance drop we actually see; what seems to
> happen is that device_busy just arbitrary pushes back to the block
> layer, giving it more time to do merging.
> I do think we can do better then that...

For example, running the following script[1] on 4-core VM:

------------------------------------------
                    | QD:255    | QD: 32  | 
------------------------------------------
fio read throughput | 825MB/s   | 1432MB/s|   
------------------------------------------

note:

QD: scsi_device's queue depth, which can be passed to test.sh.
scsi debug's device queue depth is 255 at default, which is >= .can_queue

[1] test.sh

#!/bin/bash
QD=$1

modprobe scsi_debug ndelay=200000
sleep 2
DEVN=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename`
DEV="/dev/"$DEVN

echo $QD > /sys/block/$DEVN/device/queue_depth
SQD=`cat /sys/block/$DEVN/device/queue_depth`
echo "scsi device queue depth: $QD"

fio --readwrite=read --filename=$DEV --runtime=20s --time_based --group_reporting=1 \
    --ioengine=libaio --direct=1 --iodepth=64 --bs=4k --numjobs=4 --name=seq_perf



Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-22  2:59           ` Martin K. Petersen
  2019-11-22  3:24             ` Ming Lei
@ 2019-11-22 16:38             ` Sumanesh Samanta
  1 sibling, 0 replies; 34+ messages in thread
From: Sumanesh Samanta @ 2019-11-22 16:38 UTC (permalink / raw)
  To: Martin K. Petersen, Ming Lei
  Cc: Bart Van Assche, Ewan D. Milne, Hannes Reinecke, Jens Axboe,
	linux-block, James E . J . Bottomley, linux-scsi, Sathya Prakash,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Christoph Hellwig, Bart Van Assche

>>If we ignore the RAID controller use case where the controller
>>internally queues and arbitrates commands between many devices

These controllers should not be "ignored", but rather enabled. Many of them visualize both HDD and NVMe devices behind them, and thus forced to expose themselves as SCSI controllers.
However, they have their own queue management and IO merging capabilities. Many have capability of holding IO in queue and pull them as needed (just like NVMe), and thus does not bother if many IOs to a device or controller is sent or if there is congestion. In case of congestion, the IO will simply wait in queue, along with advanced timeout handling capabilities.
Besides, as Ming pointed out, Block layer (function hctx_may_queue) already limits IO on a per controller and per LUN basis.

Overall, if the proposal does not work for all cases, then at least it should be made optional for high end controller, so that they are not disadvantaged vis-a-vis NVMe, just because they expose themselves as SCSI in order to support a wide range of devices behind them.

thanks,
Sumanesh

On 11/21/2019 7:59 PM, Martin K. Petersen wrote:
> Ming,
>
>> I don't understand the motivation of ramp-up/ramp-down, maybe it is just
>> for fairness among LUNs.
> Congestion control. Devices have actual, physical limitations that are
> different from the tag context limitations on the HBA. You don't have
> that problem on NVMe because (at least for PCIe) the storage device and
> the controller are one and the same.
>
> If you submit 100000 concurrent requests to a SCSI drive that does 100
> IOPS, some requests will time out before they get serviced.
> Consequently we have the ability to raise and lower the queue depth to
> constrain the amount of requests in flight to a given device at any
> point in time.
>
> Also, devices use BUSY/QUEUE_FULL/TASK_SET_FULL to cause the OS to back
> off. We frequently see issues where the host can submit burst I/O much
> faster than the device can de-stage from cache. In that scenario the
> device reports BUSY/QF/TSF and we will back off so the device gets a
> chance to recover. If we just let the application submit new I/O without
> bounds, the system would never actually recover.
>
> Note that the actual, physical limitations for how many commands a
> target can handle are typically much, much lower than the number of tags
> the HBA can manage. SATA devices can only express 32 concurrent
> commands. SAS devices typically 128 concurrent commands per
> port. Arrays differ.
>
> If we ignore the RAID controller use case where the controller
> internally queues and arbitrates commands between many devices, how is
> submitting 1000 concurrent requests to a device which only has 128
> command slots going to work?
>
> Some HBAs have special sauce to manage BUSY/QF/TSF, some don't. If we
> blindly stop restricting the number of I/Os in flight in the ML, we may
> exceed either the capabilities of what the transport protocol can
> express or internal device resources.
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-22  8:09         ` Ming Lei
@ 2019-11-22 18:14           ` Bart Van Assche
  2019-11-22 18:26             ` James Smart
                               ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Bart Van Assche @ 2019-11-22 18:14 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche

On 11/22/19 12:09 AM, Ming Lei wrote:
> On Thu, Nov 21, 2019 at 04:45:48PM +0100, Hannes Reinecke wrote:
>> On 11/21/19 1:53 AM, Ming Lei wrote:
>>> On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote:
>>>> I would far prefer if we could delegate any queueing decision to the
>>>> elevators, and completely drop the device_busy flag for all devices.
>>>
>>> If you drop it, you may create big sequential IO performance drop
>>> on HDD., that is why this patch only bypasses sdev->queue_depth on
>>> SSD. NVMe bypasses it because no one uses HDD. via NVMe.
>>>
>> I still wonder how much performance drop we actually see; what seems to
>> happen is that device_busy just arbitrary pushes back to the block
>> layer, giving it more time to do merging.
>> I do think we can do better then that...
> 
> For example, running the following script[1] on 4-core VM:
> 
> ------------------------------------------
>                      | QD:255    | QD: 32  |
> ------------------------------------------
> fio read throughput | 825MB/s   | 1432MB/s|
> ------------------------------------------
> 
> [ ... ]

Hi Ming,

Thanks for having shared these numbers. I think this is very useful 
information. Do these results show the performance drop that happens if 
/sys/block/.../device/queue_depth exceeds .can_queue? What I am 
wondering about is how important these results are in the context of 
this discussion. Are there any modern SCSI devices for which a SCSI LLD 
sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the 
device responds with BUSY? What surprised me is that only three SCSI 
LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that 
mean that BUSY responses from a SCSI device or HBA are rare?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-22 18:14           ` Bart Van Assche
@ 2019-11-22 18:26             ` James Smart
  2019-11-22 20:46               ` Bart Van Assche
  2019-11-22 22:00             ` Ming Lei
  2019-11-25 18:28             ` Ewan D. Milne
  2 siblings, 1 reply; 34+ messages in thread
From: James Smart @ 2019-11-22 18:26 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei, Hannes Reinecke
  Cc: Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche



On 11/22/2019 10:14 AM, Bart Van Assche wrote:
> Thanks for having shared these numbers. I think this is very useful 
> information. Do these results show the performance drop that happens 
> if /sys/block/.../device/queue_depth exceeds .can_queue? What I am 
> wondering about is how important these results are in the context of 
> this discussion. Are there any modern SCSI devices for which a SCSI 
> LLD sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the 
> device responds with BUSY? What surprised me is that only three SCSI 
> LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that 
> mean that BUSY responses from a SCSI device or HBA are rare?

That's because most of the drivers, which had queue full ramp up/ramp 
down in them and would have called scsi_track_queue_full() converted 
over to the moved-queue-full-handling-in-the-mid-layer, indicated by 
sht->track_queue_depth = 1.

Yes - it is still hit a lot!

-- james


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-22 18:26             ` James Smart
@ 2019-11-22 20:46               ` Bart Van Assche
  2019-11-22 22:04                 ` Ming Lei
  0 siblings, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2019-11-22 20:46 UTC (permalink / raw)
  To: James Smart, Ming Lei, Hannes Reinecke
  Cc: Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche

On 11/22/19 10:26 AM, James Smart wrote:
> On 11/22/2019 10:14 AM, Bart Van Assche wrote:
>> Thanks for having shared these numbers. I think this is very useful 
>> information. Do these results show the performance drop that happens 
>> if /sys/block/.../device/queue_depth exceeds .can_queue? What I am 
>> wondering about is how important these results are in the context of 
>> this discussion. Are there any modern SCSI devices for which a SCSI 
>> LLD sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the 
>> device responds with BUSY? What surprised me is that only three SCSI 
>> LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that 
>> mean that BUSY responses from a SCSI device or HBA are rare?
> 
> That's because most of the drivers, which had queue full ramp up/ramp 
> down in them and would have called scsi_track_queue_full() converted 
> over to the moved-queue-full-handling-in-the-mid-layer, indicated by 
> sht->track_queue_depth = 1.
> 
> Yes - it is still hit a lot!

Hi James,

In the systems that I have been working on myself I made sure that the 
BUSY condition is rarely or never encountered. Anyway, since there are 
setups in which this condition is hit frequently we need to make sure 
that these setups keep performing well. I'm wondering now whether we 
should try to come up with an algorithm for maintaining 
sdev->device_busy only if it improves performance and for not 
maintaining sdev->device_busy for devices/HBAs that don't need it.

Bart.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-22 18:14           ` Bart Van Assche
  2019-11-22 18:26             ` James Smart
@ 2019-11-22 22:00             ` Ming Lei
  2019-11-25 18:28             ` Ewan D. Milne
  2 siblings, 0 replies; 34+ messages in thread
From: Ming Lei @ 2019-11-22 22:00 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Hannes Reinecke, Jens Axboe, linux-block,
	James E . J . Bottomley, Martin K . Petersen, linux-scsi,
	Sathya Prakash, Chaitra P B, Suganath Prabu Subramani,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Ewan D . Milne,
	Christoph Hellwig, Bart Van Assche

On Fri, Nov 22, 2019 at 10:14:51AM -0800, Bart Van Assche wrote:
> On 11/22/19 12:09 AM, Ming Lei wrote:
> > On Thu, Nov 21, 2019 at 04:45:48PM +0100, Hannes Reinecke wrote:
> > > On 11/21/19 1:53 AM, Ming Lei wrote:
> > > > On Wed, Nov 20, 2019 at 11:05:24AM +0100, Hannes Reinecke wrote:
> > > > > I would far prefer if we could delegate any queueing decision to the
> > > > > elevators, and completely drop the device_busy flag for all devices.
> > > > 
> > > > If you drop it, you may create big sequential IO performance drop
> > > > on HDD., that is why this patch only bypasses sdev->queue_depth on
> > > > SSD. NVMe bypasses it because no one uses HDD. via NVMe.
> > > > 
> > > I still wonder how much performance drop we actually see; what seems to
> > > happen is that device_busy just arbitrary pushes back to the block
> > > layer, giving it more time to do merging.
> > > I do think we can do better then that...
> > 
> > For example, running the following script[1] on 4-core VM:
> > 
> > ------------------------------------------
> >                      | QD:255    | QD: 32  |
> > ------------------------------------------
> > fio read throughput | 825MB/s   | 1432MB/s|
> > ------------------------------------------
> > 
> > [ ... ]
> 
> Hi Ming,
> 
> Thanks for having shared these numbers. I think this is very useful
> information. Do these results show the performance drop that happens if
> /sys/block/.../device/queue_depth exceeds .can_queue? What I am wondering

The above test just shows that IO merge plays important role here, and
one important point for triggering IO merge is that .get_budget returns
false.

If sdev->queue_depth is too big, .get_budget may never return false.

That is why this patch just bypasses .device_busy for SSD.

> about is how important these results are in the context of this discussion.
> Are there any modern SCSI devices for which a SCSI LLD sets
> scsi_host->can_queue and scsi_host->cmd_per_lun such that the device
> responds with BUSY? What surprised me is that only three SCSI LLDs call

There are many such HBAs, for which sdev->queue_depth is smaller than
.can_queue, especially in case of small number of LUNs.

> scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that mean that BUSY
> responses from a SCSI device or HBA are rare?

It is only true for some HBAs.

thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-22 20:46               ` Bart Van Assche
@ 2019-11-22 22:04                 ` Ming Lei
  0 siblings, 0 replies; 34+ messages in thread
From: Ming Lei @ 2019-11-22 22:04 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: James Smart, Hannes Reinecke, Jens Axboe, linux-block,
	James E . J . Bottomley, Martin K . Petersen, linux-scsi,
	Sathya Prakash, Chaitra P B, Suganath Prabu Subramani,
	Kashyap Desai, Sumit Saxena, Shivasharan S, Ewan D . Milne,
	Christoph Hellwig, Bart Van Assche

On Fri, Nov 22, 2019 at 12:46:48PM -0800, Bart Van Assche wrote:
> On 11/22/19 10:26 AM, James Smart wrote:
> > On 11/22/2019 10:14 AM, Bart Van Assche wrote:
> > > Thanks for having shared these numbers. I think this is very useful
> > > information. Do these results show the performance drop that happens
> > > if /sys/block/.../device/queue_depth exceeds .can_queue? What I am
> > > wondering about is how important these results are in the context of
> > > this discussion. Are there any modern SCSI devices for which a SCSI
> > > LLD sets scsi_host->can_queue and scsi_host->cmd_per_lun such that
> > > the device responds with BUSY? What surprised me is that only three
> > > SCSI LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does
> > > that mean that BUSY responses from a SCSI device or HBA are rare?
> > 
> > That's because most of the drivers, which had queue full ramp up/ramp
> > down in them and would have called scsi_track_queue_full() converted
> > over to the moved-queue-full-handling-in-the-mid-layer, indicated by
> > sht->track_queue_depth = 1.
> > 
> > Yes - it is still hit a lot!
> 
> Hi James,
> 
> In the systems that I have been working on myself I made sure that the BUSY
> condition is rarely or never encountered. Anyway, since there are setups in
> which this condition is hit frequently we need to make sure that these
> setups keep performing well. I'm wondering now whether we should try to come
> up with an algorithm for maintaining sdev->device_busy only if it improves
> performance and for not maintaining sdev->device_busy for devices/HBAs that
> don't need it.

The simplest policy could be to only maintain sdev->device_busy for HDD.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-22 18:14           ` Bart Van Assche
  2019-11-22 18:26             ` James Smart
  2019-11-22 22:00             ` Ming Lei
@ 2019-11-25 18:28             ` Ewan D. Milne
  2019-11-25 22:14               ` James Smart
  2 siblings, 1 reply; 34+ messages in thread
From: Ewan D. Milne @ 2019-11-25 18:28 UTC (permalink / raw)
  To: Bart Van Assche, Ming Lei, Hannes Reinecke
  Cc: Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Christoph Hellwig, Bart Van Assche

On Fri, 2019-11-22 at 10:14 -0800, Bart Van Assche wrote:
> 
> Hi Ming,
> 
> Thanks for having shared these numbers. I think this is very useful 
> information. Do these results show the performance drop that happens if 
> /sys/block/.../device/queue_depth exceeds .can_queue? What I am 
> wondering about is how important these results are in the context of 
> this discussion. Are there any modern SCSI devices for which a SCSI LLD 
> sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the 
> device responds with BUSY? What surprised me is that only three SCSI 
> LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that 
> mean that BUSY responses from a SCSI device or HBA are rare?
> 

Some FC HBAs end up returning busy from ->queuecommand() but I think
this is more commonly due to there being and issue with the rport rather
than the device.

-Ewan


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD
  2019-11-25 18:28             ` Ewan D. Milne
@ 2019-11-25 22:14               ` James Smart
  0 siblings, 0 replies; 34+ messages in thread
From: James Smart @ 2019-11-25 22:14 UTC (permalink / raw)
  To: Ewan D. Milne, Bart Van Assche, Ming Lei, Hannes Reinecke
  Cc: Jens Axboe, linux-block, James E . J . Bottomley,
	Martin K . Petersen, linux-scsi, Sathya Prakash, Chaitra P B,
	Suganath Prabu Subramani, Kashyap Desai, Sumit Saxena,
	Shivasharan S, Christoph Hellwig, Bart Van Assche

On 11/25/2019 10:28 AM, Ewan D. Milne wrote:
> On Fri, 2019-11-22 at 10:14 -0800, Bart Van Assche wrote:
>> Hi Ming,
>>
>> Thanks for having shared these numbers. I think this is very useful
>> information. Do these results show the performance drop that happens if
>> /sys/block/.../device/queue_depth exceeds .can_queue? What I am
>> wondering about is how important these results are in the context of
>> this discussion. Are there any modern SCSI devices for which a SCSI LLD
>> sets scsi_host->can_queue and scsi_host->cmd_per_lun such that the
>> device responds with BUSY? What surprised me is that only three SCSI
>> LLDs call scsi_track_queue_full() (mptsas, bfa, esp_scsi). Does that
>> mean that BUSY responses from a SCSI device or HBA are rare?
>>
> Some FC HBAs end up returning busy from ->queuecommand() but I think
> this is more commonly due to there being and issue with the rport rather
> than the device.
>
> -Ewan
>

True - but I would assume busy from queuecommand() is different from 
BUSY/QUEUE_FULL via a SCSI response.

Adapter queuecommand busy's can be for out-of-resource limits in the 
driver - such as I_T io count limits enforced by the driver are reached, 
or if some other adapter resource limit is reached as well. Canqueue 
covers most of those - but we sometimes overcommit the adapter with a 
canqueue on physical port as well as per npiv ports, or scsi and nvme on 
the same port.

Going back to Bart's question - with SANS and multiple initiators 
sharing a target and lots of luns on that target, it's very common to 
hit bursty conditions where the target may reply with QUEUE_FULL.  Many 
arrays provide think tuning guides on how to set up values on multiple 
hosts, but it's mainly to help the target avoid being completely overrun 
as some didn't do so well.   In the end, it's very hard to predict 
multi-initiator load and in a lot of cases, things are usually left a 
bit overcommitted as the performance downside otherwise is significant.

-- james


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 1/4] scsi: megaraid_sas: use private counter for tracking inflight per-LUN commands
  2019-11-20  9:54   ` Hannes Reinecke
@ 2019-11-26  3:12     ` Kashyap Desai
  2019-11-26  3:37       ` Ming Lei
  0 siblings, 1 reply; 34+ messages in thread
From: Kashyap Desai @ 2019-11-26  3:12 UTC (permalink / raw)
  To: Hannes Reinecke, Ming Lei, Jens Axboe
  Cc: linux-block, James E . J . Bottomley, Martin K . Petersen,
	linux-scsi, Sathya Prakash Veerichetty, Chaitra P B,
	Suganath Prabu Subramani, Sumit Saxena,
	Shivasharan Srikanteshwara, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche

> >  drivers/scsi/megaraid/megaraid_sas.h        |  1 +
> >  drivers/scsi/megaraid/megaraid_sas_base.c   | 15 +++++++++++++--
> >  drivers/scsi/megaraid/megaraid_sas_fusion.c | 13 +++++++++----
> >  3 files changed, 23 insertions(+), 6 deletions(-)
> >
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Ming - Sorry for delay. I will update this Patch. We prefer driver to
avoid counter for per sdev if possible. We are currently testing driver
using below changes.

inline unsigned long sdev_nr_inflight_request(struct request_queue *q) {
    struct blk_mq_hw_ctx *hctx = q->queue_hw_ctx[0]

    return atomic_read(&hctx->nr_active);
}

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke		      Teamlead Storage & Networking
> hare@suse.de			                  +49 911 74053 688
> SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB
> 247165 (AG München), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/4] scsi: megaraid_sas: use private counter for tracking inflight per-LUN commands
  2019-11-26  3:12     ` Kashyap Desai
@ 2019-11-26  3:37       ` Ming Lei
  2019-12-05 10:32         ` Kashyap Desai
  0 siblings, 1 reply; 34+ messages in thread
From: Ming Lei @ 2019-11-26  3:37 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Hannes Reinecke, Jens Axboe, linux-block,
	James E . J . Bottomley, Martin K . Petersen, linux-scsi,
	Sathya Prakash Veerichetty, Chaitra P B,
	Suganath Prabu Subramani, Sumit Saxena,
	Shivasharan Srikanteshwara, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche

On Tue, Nov 26, 2019 at 08:42:59AM +0530, Kashyap Desai wrote:
> > >  drivers/scsi/megaraid/megaraid_sas.h        |  1 +
> > >  drivers/scsi/megaraid/megaraid_sas_base.c   | 15 +++++++++++++--
> > >  drivers/scsi/megaraid/megaraid_sas_fusion.c | 13 +++++++++----
> > >  3 files changed, 23 insertions(+), 6 deletions(-)
> > >
> > Reviewed-by: Hannes Reinecke <hare@suse.de>
> 
> Ming - Sorry for delay. I will update this Patch. We prefer driver to
> avoid counter for per sdev if possible. We are currently testing driver
> using below changes.
> 
> inline unsigned long sdev_nr_inflight_request(struct request_queue *q) {
>     struct blk_mq_hw_ctx *hctx = q->queue_hw_ctx[0]
> 
>     return atomic_read(&hctx->nr_active);
> }

OK, I am fine with this way, given it is just used for balancing irq load.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 1/4] scsi: megaraid_sas: use private counter for tracking inflight per-LUN commands
  2019-11-26  3:37       ` Ming Lei
@ 2019-12-05 10:32         ` Kashyap Desai
  0 siblings, 0 replies; 34+ messages in thread
From: Kashyap Desai @ 2019-12-05 10:32 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, Jens Axboe, linux-block,
	James E . J . Bottomley, Martin K . Petersen, linux-scsi,
	Sathya Prakash Veerichetty, Chaitra P B,
	Suganath Prabu Subramani, Sumit Saxena,
	Shivasharan Srikanteshwara, Ewan D . Milne, Christoph Hellwig,
	Bart Van Assche

> > Ming - Sorry for delay. I will update this Patch. We prefer driver to
> > avoid counter for per sdev if possible. We are currently testing
> > driver using below changes.
> >
> > inline unsigned long sdev_nr_inflight_request(struct request_queue *q)
{
> >     struct blk_mq_hw_ctx *hctx = q->queue_hw_ctx[0]
> >
> >     return atomic_read(&hctx->nr_active); }
>
> OK, I am fine with this way, given it is just used for balancing irq
load.

We have removed sdev->device_busy check from megaraid_sas and mpt3sas
drive.  I am able to get required performance 3.0M IOPs on Aero controller
using <hctx->nr_active>
Since this is not urgent, we will wait for some testing done by Broadcom.
We will post this specific change as Driver update.

>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2019-12-05 10:32 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-18 10:31 [PATCH 0/4] scis: don't apply per-LUN queue depth for SSD Ming Lei
2019-11-18 10:31 ` [PATCH 1/4] scsi: megaraid_sas: use private counter for tracking inflight per-LUN commands Ming Lei
2019-11-20  9:54   ` Hannes Reinecke
2019-11-26  3:12     ` Kashyap Desai
2019-11-26  3:37       ` Ming Lei
2019-12-05 10:32         ` Kashyap Desai
2019-11-18 10:31 ` [PATCH 2/4] scsi: mpt3sas: " Ming Lei
2019-11-20  9:55   ` Hannes Reinecke
2019-11-18 10:31 ` [PATCH 3/4] scsi: sd: register request queue after sd_revalidate_disk is done Ming Lei
2019-11-20  9:59   ` Hannes Reinecke
2019-11-18 10:31 ` [PATCH 4/4] scsi: core: don't limit per-LUN queue depth for SSD Ming Lei
2019-11-20 10:05   ` Hannes Reinecke
2019-11-20 17:00     ` Ewan D. Milne
2019-11-20 20:56       ` Bart Van Assche
2019-11-20 21:36         ` Ewan D. Milne
2019-11-22  2:25           ` Martin K. Petersen
2019-11-21  1:07         ` Ming Lei
2019-11-22  2:59           ` Martin K. Petersen
2019-11-22  3:24             ` Ming Lei
2019-11-22 16:38             ` Sumanesh Samanta
2019-11-21  0:08       ` Sumanesh Samanta
2019-11-21  0:54       ` Ming Lei
2019-11-21 19:19         ` Ewan D. Milne
2019-11-21  0:53     ` Ming Lei
2019-11-21 15:45       ` Hannes Reinecke
2019-11-22  8:09         ` Ming Lei
2019-11-22 18:14           ` Bart Van Assche
2019-11-22 18:26             ` James Smart
2019-11-22 20:46               ` Bart Van Assche
2019-11-22 22:04                 ` Ming Lei
2019-11-22 22:00             ` Ming Lei
2019-11-25 18:28             ` Ewan D. Milne
2019-11-25 22:14               ` James Smart
2019-11-22  2:18     ` Martin K. Petersen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.