linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v1 0/5] Add SCSI per device tagsets
@ 2022-02-18 18:41 Melanie Plageman (Microsoft)
  2022-02-18 18:41 ` [PATCH RFC v1 1/5] scsi: core: Rename host_tagset to hctx_share_tags Melanie Plageman (Microsoft)
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Melanie Plageman (Microsoft) @ 2022-02-18 18:41 UTC (permalink / raw)
  To: mikelley, jejb, kys, martin.petersen, mst, benh, decui,
	don.brace, R-QLogic-Storage-Upstream, haiyangz, jasowang,
	john.garry, kashyap.desai, mpe, njavali, pbonzini, paulus,
	sathya.prakash, shivasharan.srikanteshwara, sreekanth.reddy,
	stefanha, sthemmin, suganath-prabu.subramani, sumit.saxena,
	tyreld, wei.liu, linuxppc-dev, megaraidlinux.pdl,
	mpi3mr-linuxdrv.pdl, storagedev, virtualization, linux-hyperv,
	linux-kernel, linux-scsi, MPT-FusionLinux.pdl
  Cc: andres

Currently a single blk_mq_tag_set is associated with a Scsi_Host. When SCSI
controllers are limited, attaching multiple devices to the same controller is
required. In cloud environments with relatively high latency persistent storage,
requiring all devices on a controller to share a single blk_mq_tag_set
negatively impacts performance.

For example: a device provisioned with high IOPS and BW limits on the same
controller as a smaller and slower device can starve the slower device of tags.
This is especially noticeable when the slower device's workload has low I/O
depth tasks.
A common configuration for a journaling database application would be to
configure all I/O except journaling writes to target one device and target the
journaling writes to another device. This can decrease transaction commit
latency and improve performance. However, an I/O-bound database workload, for
example one with a large number of random reads on the device with high
provisioned IOPS, can consume all of the tags in the Scsi_Host tag set,
resulting in poor overall performance as the journaling writes experience high
latency.

Given a 16-core VM with two devices attached to the same controller, the first
with a combined (VM + disk) provisioned bandwidth of 750 MBps and 18000 IOPS
mounted at /mnt/big_device and the second with a combined provisioned bandwidth
of 170 MBps and 3500 IOPS mounted at /mnt/small_device:

The following fio job description demonstrates the benefit of per device tag
sets:

[global]
time_based=1
ioengine=io_uring
direct=1
runtime=1000

[read_hogs]
bs=16k
iodepth=10000
rw=randread
filesize=10G
numjobs=32
directory=/mnt/big_device

[wal]
bs=8k
iodepth=3
filesize=4G
rw=write
numjobs=1
directory=/mnt/small_device

Also note that for this example I have configured:
small device LUN queue depth: 10
large device LUN queue depth: 70
nr_hw_queues: 1
nr_requests: 170

On master, the sequential write job averages around 5 MBps. The random reads hit
the provisioned IOPS on both master and with the patch. With the patch for per
device tag sets, the sequential write job averages 29 MBps -- the same as this
job running alone on the VM.

Open questions and TODOs:

The following open items are for the "Add per device tag sets" patch:

- show_nr_hw_queues() does not work in this implementation. I wasn't
  sure how to access the scsi_device to get the blk_mq_tag_set. I also
  assume there are other sysfs changes that need to be made but I wasn't
  sure which.

- scsi_host_busy(): I've modified scsi_host_busy() such that, when device
  tag sets are in use, its remaining callers will iterate over all the
  attached devices and check their associated blk_mq_tag_sets. I don't
  know if this is the correct thing to do.

  What does the concept of the host being busy mean with device tag
  sets? Does it depend on the caller and the context? Should this form a
  new concept of scsi device busy?

  Also, could this cause deadlocks since this iteration requires the
  host_lock?

  I assume that there still needs to be a concept of host failed
  (Scsi_Host->host_failed) even with device tag sets because if all of
  the inflight requests for a host have failed, then something is
  probably wrong with the controller?

- scsi_host_queue_ready(): I've modified scsi_host_queue_ready() such
  that, when device tag sets are in use, it will only check the device
  tag set when determining starvation. It seemed to me that if a request
  can only acquire a tag from its target device's tag set, then it
  doesn't matter if other tag sets on the host have available tags.

Melanie Plageman (Microsoft) (5):
  scsi: core: Rename host_tagset to hctx_share_tags
  scsi: map_queues() takes tag set instead of host
  scsi: core: Add per device tag sets
  scsi: storvsc: use per device tag sets
  scsi: storvsc: Hardware queues share blk_mq_tags

 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c    |  7 +-
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    |  7 +-
 drivers/scsi/hosts.c                      | 32 ++++++---
 drivers/scsi/ibmvscsi/ibmvfc.c            |  2 +-
 drivers/scsi/megaraid/megaraid_sas_base.c |  8 ++-
 drivers/scsi/mpi3mr/mpi3mr_os.c           |  4 +-
 drivers/scsi/mpt3sas/mpt3sas_base.c       |  6 +-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c      |  9 +--
 drivers/scsi/qla2xxx/qla_os.c             |  7 +-
 drivers/scsi/scsi_debug.c                 |  7 +-
 drivers/scsi/scsi_lib.c                   | 30 +++++---
 drivers/scsi/scsi_priv.h                  |  2 +-
 drivers/scsi/scsi_scan.c                  | 30 ++++++--
 drivers/scsi/scsi_sysfs.c                 | 11 ++-
 drivers/scsi/smartpqi/smartpqi_init.c     |  7 +-
 drivers/scsi/storvsc_drv.c                | 86 +++++++++++++++++++++--
 drivers/scsi/virtio_scsi.c                |  5 +-
 include/scsi/scsi_device.h                |  1 +
 include/scsi/scsi_host.h                  | 53 ++++++++++++--
 include/scsi/scsi_tcq.h                   | 15 ++--
 20 files changed, 258 insertions(+), 71 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH RFC v1 1/5] scsi: core: Rename host_tagset to hctx_share_tags
  2022-02-18 18:41 [PATCH RFC v1 0/5] Add SCSI per device tagsets Melanie Plageman (Microsoft)
@ 2022-02-18 18:41 ` Melanie Plageman (Microsoft)
  2022-02-18 18:41 ` [PATCH RFC v1 2/5] scsi: map_queues() takes tag set instead of host Melanie Plageman (Microsoft)
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Melanie Plageman (Microsoft) @ 2022-02-18 18:41 UTC (permalink / raw)
  To: mikelley, jejb, kys, martin.petersen, mst, benh, decui,
	don.brace, R-QLogic-Storage-Upstream, haiyangz, jasowang,
	john.garry, kashyap.desai, mpe, njavali, pbonzini, paulus,
	sathya.prakash, shivasharan.srikanteshwara, sreekanth.reddy,
	stefanha, sthemmin, suganath-prabu.subramani, sumit.saxena,
	tyreld, wei.liu, linuxppc-dev, megaraidlinux.pdl,
	mpi3mr-linuxdrv.pdl, storagedev, virtualization, linux-hyperv,
	linux-kernel, linux-scsi, MPT-FusionLinux.pdl
  Cc: andres

Scsi_Host->host_tagset, which is set by LLD using the
scsi_host_template, determines whether or not to set the
BLK_MQ_F_TAG_HCTX_SHARED flag in the blk_mq_tag_set. This then
determines whether or not hardware contexts share blk_mq_tags.

The name host_tagset is misleading, as it sounds like it indicates
whether there is a host-wide blk_mq_tag_set.  Rename it to
hctx_share_tags to make this more evident, and so it doesn’t sound like
the opposite of the per_device_tag_set option that is introduced in a
subsequent patch.  No functional change.

Signed-off-by: Melanie Plageman <melanieplageman@gmail.com>
---
 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c    |  2 +-
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    |  2 +-
 drivers/scsi/hosts.c                      |  2 +-
 drivers/scsi/ibmvscsi/ibmvfc.c            |  2 +-
 drivers/scsi/megaraid/megaraid_sas_base.c |  4 ++--
 drivers/scsi/mpt3sas/mpt3sas_base.c       |  6 +++---
 drivers/scsi/mpt3sas/mpt3sas_scsih.c      |  6 +++---
 drivers/scsi/scsi_debug.c                 |  2 +-
 drivers/scsi/scsi_lib.c                   |  2 +-
 drivers/scsi/smartpqi/smartpqi_init.c     |  2 +-
 include/scsi/scsi_host.h                  | 10 ++++------
 11 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
index 64ed3e472e65..a3cf413f6990 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
@@ -3584,7 +3584,7 @@ static struct scsi_host_template sht_v2_hw = {
 	.shost_groups		= host_v2_hw_groups,
 	.host_reset		= hisi_sas_host_reset,
 	.map_queues		= map_queues_v2_hw,
-	.host_tagset		= 1,
+	.hctx_share_tags	= 1,
 };
 
 static const struct hisi_sas_hw hisi_sas_v2_hw = {
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index a01a3a7b706b..169eb5fbecfc 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -3172,7 +3172,7 @@ static struct scsi_host_template sht_v3_hw = {
 	.shost_groups		= host_v3_hw_groups,
 	.tag_alloc_policy	= BLK_TAG_ALLOC_RR,
 	.host_reset             = hisi_sas_host_reset,
-	.host_tagset		= 1,
+	.hctx_share_tags	= 1,
 };
 
 static const struct hisi_sas_hw hisi_sas_v3_hw = {
diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index f69b77cbf538..a364243fac7c 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -426,7 +426,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
 	shost->sg_prot_tablesize = sht->sg_prot_tablesize;
 	shost->cmd_per_lun = sht->cmd_per_lun;
 	shost->no_write_same = sht->no_write_same;
-	shost->host_tagset = sht->host_tagset;
+	shost->hctx_share_tags = sht->hctx_share_tags;
 
 	if (shost_eh_deadline == -1 || !sht->eh_host_reset_handler)
 		shost->eh_deadline = -1;
diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index d0eab5700dc5..950be46c041f 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -3625,7 +3625,7 @@ static struct scsi_host_template driver_template = {
 	.max_sectors = IBMVFC_MAX_SECTORS,
 	.shost_groups = ibmvfc_host_groups,
 	.track_queue_depth = 1,
-	.host_tagset = 1,
+	.hctx_share_tags = 1,
 };
 
 /**
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 82e1e24257bc..05db8f0b1e7e 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -6950,14 +6950,14 @@ static int megasas_io_attach(struct megasas_instance *instance)
 	 * Single msix_vectors in kdump, so shared host tag is also disabled.
 	 */
 
-	host->host_tagset = 0;
+	host->hctx_share_tags = 0;
 	host->nr_hw_queues = 1;
 
 	if ((instance->adapter_type != MFI_SERIES) &&
 		(instance->msix_vectors > instance->low_latency_index_start) &&
 		host_tagset_enable &&
 		instance->smp_affinity_enable) {
-		host->host_tagset = 1;
+		host->hctx_share_tags = 1;
 		host->nr_hw_queues = instance->msix_vectors -
 			instance->low_latency_index_start + instance->iopoll_q_count;
 		if (instance->iopoll_q_count)
diff --git a/drivers/scsi/mpt3sas/mpt3sas_base.c b/drivers/scsi/mpt3sas/mpt3sas_base.c
index 511726f92d9a..eb85d74bd9d8 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -3414,12 +3414,12 @@ _base_enable_msix(struct MPT3SAS_ADAPTER *ioc)
 		ioc->smp_affinity_enable = 0;
 
 	if (!ioc->smp_affinity_enable || ioc->reply_queue_count <= 1)
-		ioc->shost->host_tagset = 0;
+		ioc->shost->hctx_share_tags = 0;
 
 	/*
-	 * Enable io uring poll queues only if host_tagset is enabled.
+	 * Enable io uring poll queues only if hctx_share_tags is enabled.
 	 */
-	if (ioc->shost->host_tagset)
+	if (ioc->shost->hctx_share_tags)
 		iopoll_q_count = poll_queues;
 
 	if (iopoll_q_count) {
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 00792767c620..74cdf72ef837 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -12322,10 +12322,10 @@ _scsih_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto out_thread_fail;
 	}
 
-	shost->host_tagset = 0;
+	shost->hctx_share_tags = 0;
 
 	if (ioc->is_gen35_ioc && host_tagset_enable)
-		shost->host_tagset = 1;
+		shost->hctx_share_tags = 1;
 
 	ioc->is_driver_loading = 1;
 	if ((mpt3sas_base_attach(ioc))) {
@@ -12351,7 +12351,7 @@ _scsih_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	shost->nr_hw_queues = 1;
 
-	if (shost->host_tagset) {
+	if (shost->hctx_share_tags) {
 		shost->nr_hw_queues =
 		    ioc->reply_queue_count - ioc->high_iops_queues;
 
diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 2104973a35cd..af7ad912fabc 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -7614,7 +7614,7 @@ static int sdebug_driver_probe(struct device *dev)
 	 */
 	hpnt->nr_hw_queues = submit_queues;
 	if (sdebug_host_max_queue)
-		hpnt->host_tagset = 1;
+		hpnt->hctx_share_tags = 1;
 
 	/* poll queues are possible for nr_hw_queues > 1 */
 	if (hpnt->nr_hw_queues == 1 || (poll_queues < 1)) {
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0a70aa763a96..61795bab83f7 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1987,7 +1987,7 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
 	tag_set->flags |=
 		BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy);
 	tag_set->driver_data = shost;
-	if (shost->host_tagset)
+	if (shost->hctx_share_tags)
 		tag_set->flags |= BLK_MQ_F_TAG_HCTX_SHARED;
 
 	return blk_mq_alloc_tag_set(tag_set);
diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index f0897d587454..31c0c2054d6a 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -7187,7 +7187,7 @@ static int pqi_register_scsi(struct pqi_ctrl_info *ctrl_info)
 	shost->irq = pci_irq_vector(ctrl_info->pci_dev, 0);
 	shost->unique_id = shost->irq;
 	shost->nr_hw_queues = ctrl_info->num_queue_groups;
-	shost->host_tagset = 1;
+	shost->hctx_share_tags = 1;
 	shost->hostdata[0] = (unsigned long)ctrl_info;
 
 	rc = scsi_add_host(shost, &ctrl_info->pci_dev->dev);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 72e1a347baa6..f7f330f9255b 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -456,8 +456,7 @@ struct scsi_host_template {
 	/* True if the controller does not support WRITE SAME */
 	unsigned no_write_same:1;
 
-	/* True if the host uses host-wide tagspace */
-	unsigned host_tagset:1;
+	unsigned hctx_share_tags:1;
 
 	/*
 	 * Countdown for host blocking with no commands outstanding.
@@ -617,8 +616,8 @@ struct Scsi_Host {
 	 * In scsi-mq mode, the number of hardware queues supported by the LLD.
 	 *
 	 * Note: it is assumed that each hardware queue has a queue depth of
-	 * can_queue. In other words, the total queue depth per host
-	 * is nr_hw_queues * can_queue. However, for when host_tagset is set,
+	 * can_queue. In other words, the total queue depth per host is
+	 * nr_hw_queues * can_queue. However, when hctx_share_tags is set,
 	 * the total queue depth is can_queue.
 	 */
 	unsigned nr_hw_queues;
@@ -650,8 +649,7 @@ struct Scsi_Host {
 	/* The controller does not support WRITE SAME */
 	unsigned no_write_same:1;
 
-	/* True if the host uses host-wide tagspace */
-	unsigned host_tagset:1;
+	unsigned hctx_share_tags:1;
 
 	/* Host responded with short (<36 bytes) INQUIRY result */
 	unsigned short_inquiry:1;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH RFC v1 2/5] scsi: map_queues() takes tag set instead of host
  2022-02-18 18:41 [PATCH RFC v1 0/5] Add SCSI per device tagsets Melanie Plageman (Microsoft)
  2022-02-18 18:41 ` [PATCH RFC v1 1/5] scsi: core: Rename host_tagset to hctx_share_tags Melanie Plageman (Microsoft)
@ 2022-02-18 18:41 ` Melanie Plageman (Microsoft)
  2022-02-18 18:41 ` [PATCH RFC v1 3/5] scsi: core: Add per device tag sets Melanie Plageman (Microsoft)
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Melanie Plageman (Microsoft) @ 2022-02-18 18:41 UTC (permalink / raw)
  To: mikelley, jejb, kys, martin.petersen, mst, benh, decui,
	don.brace, R-QLogic-Storage-Upstream, haiyangz, jasowang,
	john.garry, kashyap.desai, mpe, njavali, pbonzini, paulus,
	sathya.prakash, shivasharan.srikanteshwara, sreekanth.reddy,
	stefanha, sthemmin, suganath-prabu.subramani, sumit.saxena,
	tyreld, wei.liu, linuxppc-dev, megaraidlinux.pdl,
	mpi3mr-linuxdrv.pdl, storagedev, virtualization, linux-hyperv,
	linux-kernel, linux-scsi, MPT-FusionLinux.pdl
  Cc: andres

Change the scsi_host_template->map_queues() to accept a blk_mq_tag_set
instead of a Scsi_Host as a function parameter.

A future commit will introduce the concept of device tag sets. Thus
map_queues() cannot assume that the target blk_mq_tag_set is accessible
through Scsi_Host->tag_set.

Generalize map_queues() by changing its input to a blk_mq_tag_set.

This commit makes no functional change, as the Scsi_Host->tag_set is
passed as the argument to the LLD map_queues function.

Signed-off-by: Melanie Plageman <melanieplageman@gmail.com>
---
 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c    | 5 +++--
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    | 5 +++--
 drivers/scsi/megaraid/megaraid_sas_base.c | 4 +++-
 drivers/scsi/mpi3mr/mpi3mr_os.c           | 4 ++--
 drivers/scsi/mpt3sas/mpt3sas_scsih.c      | 3 ++-
 drivers/scsi/qla2xxx/qla_os.c             | 7 ++++---
 drivers/scsi/scsi_debug.c                 | 5 +++--
 drivers/scsi/scsi_lib.c                   | 4 ++--
 drivers/scsi/smartpqi/smartpqi_init.c     | 5 +++--
 drivers/scsi/virtio_scsi.c                | 5 +++--
 include/scsi/scsi_host.h                  | 2 +-
 11 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
index a3cf413f6990..e850797e4c18 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
@@ -3538,10 +3538,11 @@ static struct attribute *host_v2_hw_attrs[] = {
 
 ATTRIBUTE_GROUPS(host_v2_hw);
 
-static int map_queues_v2_hw(struct Scsi_Host *shost)
+static int map_queues_v2_hw(struct blk_mq_tag_set *set)
 {
+	struct Scsi_Host *shost = set->driver_data;
 	struct hisi_hba *hisi_hba = shost_priv(shost);
-	struct blk_mq_queue_map *qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT];
+	struct blk_mq_queue_map *qmap = &set->map[HCTX_TYPE_DEFAULT];
 	const struct cpumask *mask;
 	unsigned int queue, cpu;
 
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index 169eb5fbecfc..75a2b4ccbd95 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -3135,10 +3135,11 @@ static int debugfs_set_bist_v3_hw(struct hisi_hba *hisi_hba, bool enable)
 	return 0;
 }
 
-static int hisi_sas_map_queues(struct Scsi_Host *shost)
+static int hisi_sas_map_queues(struct blk_mq_tag_set *set)
 {
+	struct Scsi_Host *shost = set->driver_data;
 	struct hisi_hba *hisi_hba = shost_priv(shost);
-	struct blk_mq_queue_map *qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT];
+	struct blk_mq_queue_map *qmap = &set->map[HCTX_TYPE_DEFAULT];
 
 	return blk_mq_pci_map_queues(qmap, hisi_hba->pci_dev,
 				     BASE_VECTORS_V3_HW);
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 05db8f0b1e7e..d579cb41355e 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -3168,12 +3168,14 @@ megasas_bios_param(struct scsi_device *sdev, struct block_device *bdev,
 	return 0;
 }
 
-static int megasas_map_queues(struct Scsi_Host *shost)
+static int megasas_map_queues(struct blk_mq_tag_set *set)
 {
+	struct Scsi_Host *shost;
 	struct megasas_instance *instance;
 	int qoff = 0, offset;
 	struct blk_mq_queue_map *map;
 
+	shost = set->driver_data;
 	instance = (struct megasas_instance *)shost->hostdata;
 
 	if (shost->nr_hw_queues == 1)
diff --git a/drivers/scsi/mpi3mr/mpi3mr_os.c b/drivers/scsi/mpi3mr/mpi3mr_os.c
index 284117da9086..38e27e3b392e 100644
--- a/drivers/scsi/mpi3mr/mpi3mr_os.c
+++ b/drivers/scsi/mpi3mr/mpi3mr_os.c
@@ -3167,8 +3167,9 @@ static int mpi3mr_bios_param(struct scsi_device *sdev,
  *
  * Return: return zero.
  */
-static int mpi3mr_map_queues(struct Scsi_Host *shost)
+static int mpi3mr_map_queues(struct blk_mq_tag_set *set)
 {
+	struct Scsi_Host *shost = set->driver_data;
 	struct mpi3mr_ioc *mrioc = shost_priv(shost);
 	int i, qoff, offset;
 	struct blk_mq_queue_map *map = NULL;
@@ -3205,7 +3206,6 @@ static int mpi3mr_map_queues(struct Scsi_Host *shost)
 	}
 
 	return 0;
-
 }
 
 /**
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 74cdf72ef837..4e3a9fff023e 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -11868,8 +11868,9 @@ scsih_scan_finished(struct Scsi_Host *shost, unsigned long time)
  * scsih_map_queues - map reply queues with request queues
  * @shost: SCSI host pointer
  */
-static int scsih_map_queues(struct Scsi_Host *shost)
+static int scsih_map_queues(struct blk_mq_tag_set *set)
 {
+	struct Scsi_Host *shost = set->driver_data;
 	struct MPT3SAS_ADAPTER *ioc =
 	    (struct MPT3SAS_ADAPTER *)shost->hostdata;
 	struct blk_mq_queue_map *map;
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index abcd30917263..915fee7f9d08 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -335,7 +335,7 @@ MODULE_PARM_DESC(ql2xabts_wait_nvme,
 
 static void qla2x00_clear_drv_active(struct qla_hw_data *);
 static void qla2x00_free_device(scsi_qla_host_t *);
-static int qla2xxx_map_queues(struct Scsi_Host *shost);
+static int qla2xxx_map_queues(struct blk_mq_tag_set *tag_set);
 static void qla2x00_destroy_deferred_work(struct qla_hw_data *);
 
 
@@ -7881,11 +7881,12 @@ qla_pci_reset_done(struct pci_dev *pdev)
 	clear_bit(ABORT_ISP_ACTIVE, &base_vha->dpc_flags);
 }
 
-static int qla2xxx_map_queues(struct Scsi_Host *shost)
+static int qla2xxx_map_queues(struct blk_mq_tag_set *set)
 {
 	int rc;
+	struct Scsi_Host *shost = set->driver_data;
 	scsi_qla_host_t *vha = (scsi_qla_host_t *)shost->hostdata;
-	struct blk_mq_queue_map *qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT];
+	struct blk_mq_queue_map *qmap = &set->map[HCTX_TYPE_DEFAULT];
 
 	if (USER_CTRL_IRQ(vha->hw) || !vha->hw->mqiobase)
 		rc = blk_mq_map_queues(qmap);
diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index af7ad912fabc..a0327ef6ef83 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -7272,15 +7272,16 @@ static int resp_not_ready(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	return check_condition_result;
 }
 
-static int sdebug_map_queues(struct Scsi_Host *shost)
+static int sdebug_map_queues(struct blk_mq_tag_set *set)
 {
 	int i, qoff;
+	struct Scsi_Host *shost = set->driver_data;
 
 	if (shost->nr_hw_queues == 1)
 		return 0;
 
 	for (i = 0, qoff = 0; i < HCTX_MAX_TYPES; i++) {
-		struct blk_mq_queue_map *map = &shost->tag_set.map[i];
+		struct blk_mq_queue_map *map = &set->map[i];
 
 		map->nr_queues  = 0;
 
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 61795bab83f7..bba66e29d38c 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1864,10 +1864,10 @@ static int scsi_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 
 static int scsi_map_queues(struct blk_mq_tag_set *set)
 {
-	struct Scsi_Host *shost = container_of(set, struct Scsi_Host, tag_set);
+	struct Scsi_Host *shost = set->driver_data;
 
 	if (shost->hostt->map_queues)
-		return shost->hostt->map_queues(shost);
+		return shost->hostt->map_queues(set);
 	return blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
 }
 
diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index 31c0c2054d6a..d81ead36d9f1 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -6332,11 +6332,12 @@ static int pqi_slave_alloc(struct scsi_device *sdev)
 	return 0;
 }
 
-static int pqi_map_queues(struct Scsi_Host *shost)
+static int pqi_map_queues(struct blk_mq_tag_set *set)
 {
+	struct Scsi_Host *shost = set->driver_data;
 	struct pqi_ctrl_info *ctrl_info = shost_to_hba(shost);
 
-	return blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
+	return blk_mq_pci_map_queues(&set->map[HCTX_TYPE_DEFAULT],
 					ctrl_info->pci_dev, 0);
 }
 
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 0e6110da69e7..6595e142e2e4 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -711,10 +711,11 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
 	return virtscsi_tmf(vscsi, cmd);
 }
 
-static int virtscsi_map_queues(struct Scsi_Host *shost)
+static int virtscsi_map_queues(struct blk_mq_tag_set *set)
 {
+	struct Scsi_Host *shost = set->driver_data;
 	struct virtio_scsi *vscsi = shost_priv(shost);
-	struct blk_mq_queue_map *qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT];
+	struct blk_mq_queue_map *qmap = &set->map[HCTX_TYPE_DEFAULT];
 
 	return blk_mq_virtio_map_queues(qmap, vscsi->vdev, 2);
 }
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index f7f330f9255b..1255e8c164f6 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -277,7 +277,7 @@ struct scsi_host_template {
 	 *
 	 * Status: OPTIONAL
 	 */
-	int (* map_queues)(struct Scsi_Host *shost);
+	int (* map_queues)(struct blk_mq_tag_set *set);
 
 	/*
 	 * SCSI interface of blk_poll - poll for IO completions.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH RFC v1 3/5] scsi: core: Add per device tag sets
  2022-02-18 18:41 [PATCH RFC v1 0/5] Add SCSI per device tagsets Melanie Plageman (Microsoft)
  2022-02-18 18:41 ` [PATCH RFC v1 1/5] scsi: core: Rename host_tagset to hctx_share_tags Melanie Plageman (Microsoft)
  2022-02-18 18:41 ` [PATCH RFC v1 2/5] scsi: map_queues() takes tag set instead of host Melanie Plageman (Microsoft)
@ 2022-02-18 18:41 ` Melanie Plageman (Microsoft)
  2022-02-18 18:41 ` [PATCH RFC v1 4/5] scsi: storvsc: use " Melanie Plageman (Microsoft)
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Melanie Plageman (Microsoft) @ 2022-02-18 18:41 UTC (permalink / raw)
  To: mikelley, jejb, kys, martin.petersen, mst, benh, decui,
	don.brace, R-QLogic-Storage-Upstream, haiyangz, jasowang,
	john.garry, kashyap.desai, mpe, njavali, pbonzini, paulus,
	sathya.prakash, shivasharan.srikanteshwara, sreekanth.reddy,
	stefanha, sthemmin, suganath-prabu.subramani, sumit.saxena,
	tyreld, wei.liu, linuxppc-dev, megaraidlinux.pdl,
	mpi3mr-linuxdrv.pdl, storagedev, virtualization, linux-hyperv,
	linux-kernel, linux-scsi, MPT-FusionLinux.pdl
  Cc: andres

Currently a single blk_mq_tag_set is associated with a Scsi_Host. When
SCSI controllers are limited, attaching multiple devices to the same
controller is required. In cloud environments with relatively high
latency persistent storage, requiring all devices on a controller to
share a single blk_mq_tag_set negatively impacts performance.

For example: a device provisioned with high IOPS and BW limits on the
same controller as a smaller and slower device can starve the slower
device of tags. This is especially noticeable when the slower device's
workload has low iodepth tasks.

A common configuration for a journaling database application would be to
configure all IO except journaling writes to target one device and
target the journaling writes to another device. This can decrease
transaction commit latency and improve performance. However, given an
IO-bound database workload, for example one with a large number of
random reads on the device with high provisioned IOPS, the high IO depth
workload can consume all of the tags in the Scsi_Host tag set, resulting
in poor overall performance as the journaling writes experience high
latency.

To provide more independence for devices on a SCSI controller, introduce
the option of per-SCSI-device tag sets.  A LLD can opt-in to per-device
tag sets, in which case there is no Scsi_Host level tag set.

scsi_alloc_sdev() allocates the per-device’s blk_mq_tag_set and
__scsi_remove_device() frees it.

A LLD that opts-in to this behavior is responsible for submitting I/Os
to the device associated with a particular tag set and for matching
completions to tags in the correct tag set. For LLDs that do not opt-in
to this behavior, the existing Scsi_Host tag set continues to be used
and there is no functional change.

Signed-off-by: Melanie Plageman <melanieplageman@gmail.com>
---
 drivers/scsi/hosts.c       | 30 +++++++++++++++++++++-------
 drivers/scsi/scsi_lib.c    | 24 ++++++++++++++++------
 drivers/scsi/scsi_priv.h   |  2 +-
 drivers/scsi/scsi_scan.c   | 30 ++++++++++++++++++++++------
 drivers/scsi/scsi_sysfs.c  | 11 +++++++++-
 include/scsi/scsi_device.h |  1 +
 include/scsi/scsi_host.h   | 41 ++++++++++++++++++++++++++++++++++++++
 include/scsi/scsi_tcq.h    | 15 +++++++++-----
 8 files changed, 128 insertions(+), 26 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index a364243fac7c..ce2899e4c1c8 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -229,9 +229,11 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 	if (error)
 		goto fail;
 
-	error = scsi_mq_setup_tags(shost);
-	if (error)
-		goto fail;
+	if (!shost->per_device_tag_set) {
+		error = scsi_mq_setup_tags(shost, &shost->tag_set);
+		if (error)
+			goto fail;
+	}
 
 	if (!shost->shost_gendev.parent)
 		shost->shost_gendev.parent = dev ? dev : &platform_bus;
@@ -345,7 +347,7 @@ static void scsi_host_dev_release(struct device *dev)
 		kfree(dev_name(&shost->shost_dev));
 	}
 
-	if (shost->tag_set.tags)
+	if (!shost->per_device_tag_set && shost->tag_set.tags)
 		scsi_mq_destroy_tags(shost);
 
 	kfree(shost->shost_data);
@@ -427,6 +429,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template *sht, int privsize)
 	shost->cmd_per_lun = sht->cmd_per_lun;
 	shost->no_write_same = sht->no_write_same;
 	shost->hctx_share_tags = sht->hctx_share_tags;
+	shost->per_device_tag_set = sht->per_device_tag_set;
 
 	if (shost_eh_deadline == -1 || !sht->eh_host_reset_handler)
 		shost->eh_deadline = -1;
@@ -566,7 +569,7 @@ struct Scsi_Host *scsi_host_get(struct Scsi_Host *shost)
 }
 EXPORT_SYMBOL(scsi_host_get);
 
-static bool scsi_host_check_in_flight(struct request *rq, void *data,
+bool scsi_check_in_flight(struct request *rq, void *data,
 				      bool reserved)
 {
 	int *count = data;
@@ -585,9 +588,17 @@ static bool scsi_host_check_in_flight(struct request *rq, void *data,
 int scsi_host_busy(struct Scsi_Host *shost)
 {
 	int cnt = 0;
+	if (shost->per_device_tag_set) {
 
-	blk_mq_tagset_busy_iter(&shost->tag_set,
-				scsi_host_check_in_flight, &cnt);
+		struct scsi_device *sdev;
+
+		shost_for_each_device(sdev, shost) {
+			blk_mq_tagset_busy_iter(sdev->request_queue->tag_set,
+						scsi_check_in_flight, &cnt);
+		}
+	} else
+		blk_mq_tagset_busy_iter(&shost->tag_set, scsi_check_in_flight,
+					&cnt);
 	return cnt;
 }
 EXPORT_SYMBOL(scsi_host_busy);
@@ -687,6 +698,8 @@ static bool complete_all_cmds_iter(struct request *rq, void *data, bool rsvd)
 void scsi_host_complete_all_commands(struct Scsi_Host *shost,
 				     enum scsi_host_status status)
 {
+	if (shost->per_device_tag_set)
+		return;
 	blk_mq_tagset_busy_iter(&shost->tag_set, complete_all_cmds_iter,
 				&status);
 }
@@ -724,6 +737,9 @@ void scsi_host_busy_iter(struct Scsi_Host *shost,
 		.priv = priv,
 	};
 
+	if (shost->per_device_tag_set)
+		return;
+
 	blk_mq_tagset_busy_iter(&shost->tag_set, __scsi_host_busy_iter_fn,
 				&iter_data);
 }
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index bba66e29d38c..ec69a526c397 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1369,6 +1369,17 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
 	return 0;
 }
 
+static int scsi_tag_set_busy(struct scsi_device *sdevice)
+{
+	struct blk_mq_tag_set *tag_set = sdevice->request_queue->tag_set;
+
+	int cnt = 0;
+
+	blk_mq_tagset_busy_iter(tag_set,
+				scsi_check_in_flight, &cnt);
+	return cnt;
+}
+
 /*
  * scsi_host_queue_ready: if we can send requests to shost, return 1 else
  * return 0. We must end up running the queue again whenever 0 is
@@ -1383,9 +1394,11 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
 		return 0;
 
 	if (atomic_read(&shost->host_blocked) > 0) {
-		if (scsi_host_busy(shost) > 0)
+		if (shost->per_device_tag_set) {
+			if (scsi_tag_set_busy(sdev) > 0)
+				goto starved;
+		} else if (scsi_host_busy(shost) > 0)
 			goto starved;
-
 		/*
 		 * unblock after host_blocked iterates to zero
 		 */
@@ -1961,11 +1974,9 @@ static const struct blk_mq_ops scsi_mq_ops = {
 	.get_rq_budget_token = scsi_mq_get_rq_budget_token,
 };
 
-int scsi_mq_setup_tags(struct Scsi_Host *shost)
+int scsi_mq_setup_tags(struct Scsi_Host *shost, struct blk_mq_tag_set *tag_set)
 {
 	unsigned int cmd_size, sgl_size;
-	struct blk_mq_tag_set *tag_set = &shost->tag_set;
-
 	sgl_size = max_t(unsigned int, sizeof(struct scatterlist),
 				scsi_mq_inline_sgl_size(shost));
 	cmd_size = sizeof(struct scsi_cmnd) + shost->hostt->cmd_size + sgl_size;
@@ -2955,7 +2966,8 @@ scsi_host_block(struct Scsi_Host *shost)
 	 * SCSI never enables blk-mq's BLK_MQ_F_BLOCKING flag so
 	 * calling synchronize_rcu() once is enough.
 	 */
-	WARN_ON_ONCE(shost->tag_set.flags & BLK_MQ_F_BLOCKING);
+	if (!shost->per_device_tag_set)
+		WARN_ON_ONCE(shost->tag_set.flags & BLK_MQ_F_BLOCKING);
 
 	if (!ret)
 		synchronize_rcu();
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index 5c4786310a31..aac2fbfb7d94 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -93,7 +93,7 @@ extern void scsi_io_completion(struct scsi_cmnd *, unsigned int);
 extern void scsi_run_host_queues(struct Scsi_Host *shost);
 extern void scsi_requeue_run_queue(struct work_struct *work);
 extern void scsi_start_queue(struct scsi_device *sdev);
-extern int scsi_mq_setup_tags(struct Scsi_Host *shost);
+extern int scsi_mq_setup_tags(struct Scsi_Host *shost, struct blk_mq_tag_set *tag_set);
 extern void scsi_mq_destroy_tags(struct Scsi_Host *shost);
 extern void scsi_exit_queue(void);
 extern void scsi_evt_thread(struct work_struct *work);
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index f4e6c68ac99e..3d213da4a87f 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -276,6 +276,7 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	unsigned int depth;
 	struct scsi_device *sdev;
 	struct request_queue *q;
+	struct blk_mq_tag_set *tag_set;
 	int display_failure_msg = 1, ret;
 	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
 
@@ -324,16 +325,27 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	 * doesn't
 	 */
 	sdev->borken = 1;
+	if (shost->per_device_tag_set) {
+		tag_set = kzalloc(sizeof(struct blk_mq_tag_set), GFP_KERNEL);
+		sdev->tag_set = tag_set;
+		if (!tag_set)
+			goto out;
+		if (scsi_mq_setup_tags(shost, tag_set))
+			goto out_free_tag_set;
+	} else {
+		tag_set = &shost->tag_set;
+		sdev->tag_set = NULL;
+	}
 
 	sdev->sg_reserved_size = INT_MAX;
 
-	q = blk_mq_init_queue(&sdev->host->tag_set);
+	q = blk_mq_init_queue(tag_set);
 	if (IS_ERR(q)) {
 		/* release fn is set up in scsi_sysfs_device_initialise, so
 		 * have to free and put manually here */
 		put_device(&starget->dev);
 		kfree(sdev);
-		goto out;
+		goto out_free_tag_set;
 	}
 	sdev->request_queue = q;
 	q->queuedata = sdev;
@@ -351,7 +363,7 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	if (scsi_realloc_sdev_budget_map(sdev, depth)) {
 		put_device(&starget->dev);
 		kfree(sdev);
-		goto out;
+		goto out_free_tag_set;
 	}
 
 	scsi_change_queue_depth(sdev, depth);
@@ -367,14 +379,20 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 			 */
 			if (ret == -ENXIO)
 				display_failure_msg = 0;
-			goto out_device_destroy;
+			__scsi_remove_device(sdev);
+			/*
+			 * __scsi_remove_device() will free the tag_set, so go
+			 * to "out" label.
+			 */
+			goto out;
 		}
 	}
 
 	return sdev;
 
-out_device_destroy:
-	__scsi_remove_device(sdev);
+out_free_tag_set:
+	if (shost->per_device_tag_set)
+		kfree(tag_set);
 out:
 	if (display_failure_msg)
 		printk(ALLOC_FAILURE_MSG, __func__);
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index f1e0c131b77c..dd11f3d59663 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -399,7 +399,10 @@ show_nr_hw_queues(struct device *dev, struct device_attribute *attr, char *buf)
 	struct Scsi_Host *shost = class_to_shost(dev);
 	struct blk_mq_tag_set *tag_set = &shost->tag_set;
 
-	return snprintf(buf, 20, "%d\n", tag_set->nr_hw_queues);
+	if (shost->per_device_tag_set)
+		return 0;
+	else
+		return snprintf(buf, 20, "%d\n", tag_set->nr_hw_queues);
 }
 static DEVICE_ATTR(nr_hw_queues, S_IRUGO, show_nr_hw_queues, NULL);
 
@@ -1467,6 +1470,12 @@ void __scsi_remove_device(struct scsi_device *sdev)
 		sdev->host->hostt->slave_destroy(sdev);
 	transport_destroy_device(dev);
 
+	if (sdev->host->per_device_tag_set) {
+		blk_mq_free_tag_set(sdev->tag_set);
+		kfree(sdev->tag_set);
+		sdev->tag_set = NULL;
+	}
+
 	/*
 	 * Paired with the kref_get() in scsi_sysfs_initialize().  We have
 	 * removed sysfs visibility from the device, so make the target
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 647c53b26105..aa8c7f860a34 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -103,6 +103,7 @@ struct scsi_vpd {
 struct scsi_device {
 	struct Scsi_Host *host;
 	struct request_queue *request_queue;
+	struct blk_mq_tag_set *tag_set;
 
 	/* the next two are protected by the host->host_lock */
 	struct list_head    siblings;   /* list of all devices on this host */
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 1255e8c164f6..a625a8490ea8 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -456,7 +456,39 @@ struct scsi_host_template {
 	/* True if the controller does not support WRITE SAME */
 	unsigned no_write_same:1;
 
+	/*
+	 * True if hardware queues will share blk_mq_tags
+	 * (BLK_MQ_F_TAG_HCTX_SHARED will be set)
+	 */
 	unsigned hctx_share_tags:1;
+	/*
+	 * True if the LLD will create blk_mq_tag_sets for each scsi_device
+	 * If False, request_queues will share the single blk_mq_tag_set in the
+	 * Scsi_Host.
+	 *
+	 * If per_device_tag_set is False and hctx_share_tags is True, all
+	 * devices on the Scsi_Host share a single blk_mq_tag_set referenced
+	 * through the host and all hardware queues share a blk_mq_tags.
+	 *
+	 * If per_device_tag_set is True and hctx_share_tags is True, all
+	 * devices on the Scsi_Host have their own blk_mq_tag_set, allocated and
+	 * maintained by the LLD, and all hardware queues on a given device
+	 * share one blk_mq_tags.
+	 *
+	 * If per_device_tag_set is False and hctx_share_tags is False, all
+	 * devices on the Scsi_Host share a single blk_mq_tag_set referenced
+	 * through the host and hardware queues have their own blk_mq_tags. That
+	 * is, hardware context 1 from device 1 and hardware context 1 from
+	 * device 2 will share the same blk_mq_tags in the host blk_mq_tag_set
+	 * but, hardware context 2 from device 1 and hardware context 2 from
+	 * device 2 will share another blk_mq_tags in the host blk_mq_tag_set.
+	 *
+	 * If per_device_tag_set is True and hctx_share_tags is False, all
+	 * devices on the Scsi_Host will have their own blk_mq_tag_set and all
+	 * hardware queues will have their own blk_mq_tags.
+	 */
+	unsigned per_device_tag_set:1;
+
 
 	/*
 	 * Countdown for host blocking with no commands outstanding.
@@ -649,8 +681,14 @@ struct Scsi_Host {
 	/* The controller does not support WRITE SAME */
 	unsigned no_write_same:1;
 
+	/*
+	 * See comment in struct scsi_host_template for details on how
+	 * hctx_share_tags and per_device_tag_set interact.
+	 */
 	unsigned hctx_share_tags:1;
 
+	unsigned per_device_tag_set:1;
+
 	/* Host responded with short (<36 bytes) INQUIRY result */
 	unsigned short_inquiry:1;
 
@@ -754,6 +792,9 @@ extern void scsi_rescan_device(struct device *);
 extern void scsi_remove_host(struct Scsi_Host *);
 extern struct Scsi_Host *scsi_host_get(struct Scsi_Host *);
 extern int scsi_host_busy(struct Scsi_Host *shost);
+// TODO: does this belong in another file?
+extern bool scsi_check_in_flight(struct request *rq, void *data,
+				      bool reserved);
 extern void scsi_host_put(struct Scsi_Host *t);
 extern struct Scsi_Host *scsi_host_lookup(unsigned short);
 extern const char *scsi_host_state_name(enum scsi_host_state);
diff --git a/include/scsi/scsi_tcq.h b/include/scsi/scsi_tcq.h
index ea7848e74d25..443e4d917dd3 100644
--- a/include/scsi/scsi_tcq.h
+++ b/include/scsi/scsi_tcq.h
@@ -19,18 +19,17 @@
  * Note: for devices using multiple hardware queues tag must have been
  * generated by blk_mq_unique_tag().
  **/
-static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
-		int tag)
+static inline struct scsi_cmnd *scsi_find_tag(struct blk_mq_tag_set *tag_set, int tag)
 {
 	struct request *req = NULL;
 	u16 hwq;
 
 	if (tag == SCSI_NO_TAG)
 		return NULL;
-
 	hwq = blk_mq_unique_tag_to_hwq(tag);
-	if (hwq < shost->tag_set.nr_hw_queues) {
-		req = blk_mq_tag_to_rq(shost->tag_set.tags[hwq],
+
+	if (hwq < tag_set->nr_hw_queues) {
+		req = blk_mq_tag_to_rq(tag_set->tags[hwq],
 					blk_mq_unique_tag_to_tag(tag));
 	}
 
@@ -39,5 +38,11 @@ static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
 	return blk_mq_rq_to_pdu(req);
 }
 
+static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host *shost,
+		int tag)
+{
+	return scsi_find_tag(&shost->tag_set, tag);
+}
+
 #endif /* CONFIG_BLOCK */
 #endif /* _SCSI_SCSI_TCQ_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH RFC v1 4/5] scsi: storvsc: use per device tag sets
  2022-02-18 18:41 [PATCH RFC v1 0/5] Add SCSI per device tagsets Melanie Plageman (Microsoft)
                   ` (2 preceding siblings ...)
  2022-02-18 18:41 ` [PATCH RFC v1 3/5] scsi: core: Add per device tag sets Melanie Plageman (Microsoft)
@ 2022-02-18 18:41 ` Melanie Plageman (Microsoft)
  2022-02-18 18:41 ` [PATCH RFC v1 5/5] scsi: storvsc: Hardware queues share blk_mq_tags Melanie Plageman (Microsoft)
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Melanie Plageman (Microsoft) @ 2022-02-18 18:41 UTC (permalink / raw)
  To: mikelley, jejb, kys, martin.petersen, mst, benh, decui,
	don.brace, R-QLogic-Storage-Upstream, haiyangz, jasowang,
	john.garry, kashyap.desai, mpe, njavali, pbonzini, paulus,
	sathya.prakash, shivasharan.srikanteshwara, sreekanth.reddy,
	stefanha, sthemmin, suganath-prabu.subramani, sumit.saxena,
	tyreld, wei.liu, linuxppc-dev, megaraidlinux.pdl,
	mpi3mr-linuxdrv.pdl, storagedev, virtualization, linux-hyperv,
	linux-kernel, linux-scsi, MPT-FusionLinux.pdl
  Cc: andres

storvsc operates in virtual environments where multiple devices with
potentially widely varying performance can be attached to the same SCSI
controller.

As such, it can benefit from the device independence
provided by per-device tag sets.

Map each device tag set to a small integer index which can be combined
with the tag number to uniquely identify an I/O request and can be
quickly looked up and validated when a request completes.

Signed-off-by: Melanie Plageman <melanieplageman@gmail.com>
---
 drivers/scsi/storvsc_drv.c | 85 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 81 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 9a0bba5a51a7..0ed764bcabab 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -399,6 +399,7 @@ static void storvsc_on_channel_callback(void *context);
 #define STORVSC_MAX_LUNS_PER_TARGET			255
 #define STORVSC_MAX_TARGETS				2
 #define STORVSC_MAX_CHANNELS				8
+#define STORVSC_MAX_DEVICES_PER_CONTROLLER		64
 
 #define STORVSC_FC_MAX_LUNS_PER_TARGET			255
 #define STORVSC_FC_MAX_TARGETS				128
@@ -431,6 +432,14 @@ struct storvsc_cmd_request {
 	struct vstor_packet vstor_packet;
 };
 
+union storvsc_scsi_device_tag {
+	u64 as_uint64;
+	struct {
+		u32 tag;
+		u32 idx;
+	} __packed;
+};
+
 
 /* A storvsc device is a device object that contains a vmbus channel */
 struct storvsc_device {
@@ -502,6 +511,7 @@ struct hv_host_device {
 	struct workqueue_struct *handle_error_wq;
 	struct work_struct host_scan_work;
 	struct Scsi_Host *host;
+	struct blk_mq_tag_set *device_tag_sets[STORVSC_MAX_DEVICES_PER_CONTROLLER];
 };
 
 struct storvsc_scan_work {
@@ -700,6 +710,7 @@ static void storvsc_change_target_cpu(struct vmbus_channel *channel, u32 old,
 
 static u64 storvsc_next_request_id(struct vmbus_channel *channel, u64 rqst_addr)
 {
+	union storvsc_scsi_device_tag tag;
 	struct storvsc_cmd_request *request =
 		(struct storvsc_cmd_request *)(unsigned long)rqst_addr;
 
@@ -708,11 +719,15 @@ static u64 storvsc_next_request_id(struct vmbus_channel *channel, u64 rqst_addr)
 	if (rqst_addr == VMBUS_RQST_RESET)
 		return VMBUS_RQST_RESET;
 
+	tag.tag = blk_mq_unique_tag(scsi_cmd_to_rq(request->cmd));
+
 	/*
 	 * Cannot return an ID of 0, which is reserved for an unsolicited
 	 * message from Hyper-V.
 	 */
-	return (u64)blk_mq_unique_tag(scsi_cmd_to_rq(request->cmd)) + 1;
+	tag.tag++;
+	tag.idx = (u64) request->cmd->device->hostdata;
+	return tag.as_uint64;
 }
 
 static void handle_sc_creation(struct vmbus_channel *new_sc)
@@ -1272,6 +1287,7 @@ static void storvsc_on_channel_callback(void *context)
 	struct hv_device *device;
 	struct storvsc_device *stor_device;
 	struct Scsi_Host *shost;
+	struct hv_host_device *host_dev;
 
 	if (channel->primary_channel != NULL)
 		device = channel->primary_channel->device_obj;
@@ -1283,6 +1299,7 @@ static void storvsc_on_channel_callback(void *context)
 		return;
 
 	shost = stor_device->host;
+	host_dev = shost_priv(shost);
 
 	foreach_vmbus_pkt(desc, channel) {
 		struct vstor_packet *packet = hv_pkt_data(desc);
@@ -1330,13 +1347,34 @@ static void storvsc_on_channel_callback(void *context)
 				}
 			} else {
 				struct scsi_cmnd *scmnd;
+				union storvsc_scsi_device_tag tag;
+				struct blk_mq_tag_set *tag_set;
 
 				/* Transaction 'rqst_id' corresponds to tag 'rqst_id - 1' */
-				scmnd = scsi_host_find_tag(shost, rqst_id - 1);
-				if (scmnd == NULL) {
+				tag.as_uint64 = rqst_id;
+				tag.tag--;
+
+				if (tag.idx >= STORVSC_MAX_DEVICES_PER_CONTROLLER) {
+					dev_err(&device->device,
+						"SCSI Device Tag Set index %d is invalid.\n",
+						tag.idx);
+					continue;
+				}
+
+				tag_set = READ_ONCE(host_dev->device_tag_sets[tag.idx]);
+				if (!tag_set) {
+					dev_err(&device->device,
+						"No SCSI Device Tag Set found at idx %d.\n",
+						tag.idx);
+					continue;
+				}
+
+				scmnd = scsi_find_tag(tag_set, tag.tag);
+				if (!scmnd) {
 					dev_err(&device->device, "Incorrect transaction ID\n");
 					continue;
 				}
+
 				request = (struct storvsc_cmd_request *)scsi_cmd_priv(scmnd);
 				scsi_dma_unmap(scmnd);
 			}
@@ -1596,6 +1634,10 @@ static int storvsc_do_io(struct hv_device *device,
 
 static int storvsc_device_alloc(struct scsi_device *sdevice)
 {
+	u64 i;
+	struct hv_host_device *host_dev = shost_priv(sdevice->host);
+	struct hv_device *dev = host_dev->dev;
+
 	/*
 	 * Set blist flag to permit the reading of the VPD pages even when
 	 * the target may claim SPC-2 compliance. MSFT targets currently
@@ -1607,7 +1649,31 @@ static int storvsc_device_alloc(struct scsi_device *sdevice)
 	 */
 	sdevice->sdev_bflags = BLIST_REPORTLUN2 | BLIST_TRY_VPD_PAGES;
 
-	return 0;
+	if (!sdevice->tag_set)
+		return -EINVAL;
+	/*
+	 * Find an open spot in the array to save the pointer to the
+	 * blk_mq_tag_set allocated in scsi_alloc_sdev, then save that spot's
+	 * index. The index comprises part of the tag that is eventually
+	 * assigned to a dispatched SCSI command.
+	 *
+	 * This can be done without a lock because scanning is serialized with
+	 * the Scsi_Host->scan_mutex.
+	 */
+	for (i = 0; i < STORVSC_MAX_DEVICES_PER_CONTROLLER; i++) {
+		if (READ_ONCE(host_dev->device_tag_sets[i]))
+			continue;
+
+		WRITE_ONCE(host_dev->device_tag_sets[i], sdevice->tag_set);
+
+		sdevice->hostdata = (void *) i;
+		return 0;
+	}
+
+	dev_err(&dev->device,
+		"The storvsc SCSI Device Tag Set array is full. No additional tag sets can be allocated.");
+
+	return -EINVAL;
 }
 
 static int storvsc_device_configure(struct scsi_device *sdevice)
@@ -1636,6 +1702,15 @@ static int storvsc_device_configure(struct scsi_device *sdevice)
 	return 0;
 }
 
+static void storvsc_device_destroy(struct scsi_device *sdevice)
+{
+	struct Scsi_Host *host = sdevice->host;
+	struct hv_host_device *host_dev = shost_priv(host);
+	u64 idx = (u64) sdevice->hostdata;
+
+	WRITE_ONCE(host_dev->device_tag_sets[idx], NULL);
+}
+
 static int storvsc_get_chs(struct scsi_device *sdev, struct block_device * bdev,
 			   sector_t capacity, int *info)
 {
@@ -1912,6 +1987,7 @@ static struct scsi_host_template scsi_driver = {
 	.proc_name =		"storvsc_host",
 	.eh_timed_out =		storvsc_eh_timed_out,
 	.slave_alloc =		storvsc_device_alloc,
+	.slave_destroy =	storvsc_device_destroy,
 	.slave_configure =	storvsc_device_configure,
 	.cmd_per_lun =		2048,
 	.this_id =		-1,
@@ -1920,6 +1996,7 @@ static struct scsi_host_template scsi_driver = {
 	.no_write_same =	1,
 	.track_queue_depth =	1,
 	.change_queue_depth =	storvsc_change_queue_depth,
+	.per_device_tag_set =	1,
 };
 
 enum {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH RFC v1 5/5] scsi: storvsc: Hardware queues share blk_mq_tags
  2022-02-18 18:41 [PATCH RFC v1 0/5] Add SCSI per device tagsets Melanie Plageman (Microsoft)
                   ` (3 preceding siblings ...)
  2022-02-18 18:41 ` [PATCH RFC v1 4/5] scsi: storvsc: use " Melanie Plageman (Microsoft)
@ 2022-02-18 18:41 ` Melanie Plageman (Microsoft)
  2022-02-19  7:37 ` [PATCH RFC v1 0/5] Add SCSI per device tagsets Christoph Hellwig
  2022-02-21 17:40 ` John Garry
  6 siblings, 0 replies; 8+ messages in thread
From: Melanie Plageman (Microsoft) @ 2022-02-18 18:41 UTC (permalink / raw)
  To: mikelley, jejb, kys, martin.petersen, mst, benh, decui,
	don.brace, R-QLogic-Storage-Upstream, haiyangz, jasowang,
	john.garry, kashyap.desai, mpe, njavali, pbonzini, paulus,
	sathya.prakash, shivasharan.srikanteshwara, sreekanth.reddy,
	stefanha, sthemmin, suganath-prabu.subramani, sumit.saxena,
	tyreld, wei.liu, linuxppc-dev, megaraidlinux.pdl,
	mpi3mr-linuxdrv.pdl, storagedev, virtualization, linux-hyperv,
	linux-kernel, linux-scsi, MPT-FusionLinux.pdl
  Cc: andres

Decouple the number of tags available from the number of hardware queues
by sharing a single blk_mq_tags amongst all hardware queues.

When storage latency is relatively high, having too many tags available
can harm the performance of mixed workloads.
By sharing blk_mq_tags amongst hardware queues, nr_requests can be set
to the appropriate number of tags for the device.

Signed-off-by: Melanie Plageman <melanieplageman@gmail.com>
---
As an example, on a 16-core VM coupled with a 1 TiB storage device having a
combined (VM + disk) max BW of 200 MB/s and IOPS of 5000, configured with 16
hardware queues and with nr_requests set to 56 and queue_depth set to 15, the
following fio job description illustrates the benefit of hardware queues sharing
blk_mq_tags:

[global]
time_based=1
ioengine=io_uring
direct=1
runtime=60

[read_hogs]
bs=16k
iodepth=10000
rw=randread
filesize=10G
numjobs=15
directory=/mnt/test

[wal]
bs=8k
iodepth=3
filesize=4G
rw=write
numjobs=1
directory=/mnt/test

with hctx_share_tags set, the "wal" job does 271 IOPS, averaging 13120 usec
completion latency and the "read_hogs" jobs average around 4700 IOPS.

without hctx_share_tags set, the "wal" job does 85 IOPS and averages around
45308 usec completion latency and the "read_hogs" job average around 4900 IOPS.

Note that reducing nr_requests to a number sufficient to increase WAL IOPS
results in unacceptably low IOPS for the random reads when only one random read
job is running.

 drivers/scsi/storvsc_drv.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 0ed764bcabab..5048e7fcf959 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1997,6 +1997,7 @@ static struct scsi_host_template scsi_driver = {
 	.track_queue_depth =	1,
 	.change_queue_depth =	storvsc_change_queue_depth,
 	.per_device_tag_set =	1,
+	.hctx_share_tags = 1,
 };
 
 enum {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v1 0/5] Add SCSI per device tagsets
  2022-02-18 18:41 [PATCH RFC v1 0/5] Add SCSI per device tagsets Melanie Plageman (Microsoft)
                   ` (4 preceding siblings ...)
  2022-02-18 18:41 ` [PATCH RFC v1 5/5] scsi: storvsc: Hardware queues share blk_mq_tags Melanie Plageman (Microsoft)
@ 2022-02-19  7:37 ` Christoph Hellwig
  2022-02-21 17:40 ` John Garry
  6 siblings, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2022-02-19  7:37 UTC (permalink / raw)
  To: Melanie Plageman (Microsoft)
  Cc: tyreld, linux-hyperv, mst, jasowang, mikelley,
	R-QLogic-Storage-Upstream, paulus, kys, wei.liu, sthemmin,
	linux-scsi, sathya.prakash, decui, kashyap.desai, njavali,
	MPT-FusionLinux.pdl, haiyangz, mpi3mr-linuxdrv.pdl,
	suganath-prabu.subramani, jejb, john.garry, stefanha, storagedev,
	virtualization, megaraidlinux.pdl, sreekanth.reddy,
	martin.petersen, shivasharan.srikanteshwara, linuxppc-dev,
	linux-kernel, sumit.saxena, andres, pbonzini, don.brace

On Fri, Feb 18, 2022 at 06:41:52PM +0000, Melanie Plageman (Microsoft) wrote:
> Currently a single blk_mq_tag_set is associated with a Scsi_Host. When SCSI
> controllers are limited, attaching multiple devices to the same controller is
> required. In cloud environments with relatively high latency persistent storage,
> requiring all devices on a controller to share a single blk_mq_tag_set
> negatively impacts performance.

So add more controllers instead of completely breaking the architecture.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RFC v1 0/5] Add SCSI per device tagsets
  2022-02-18 18:41 [PATCH RFC v1 0/5] Add SCSI per device tagsets Melanie Plageman (Microsoft)
                   ` (5 preceding siblings ...)
  2022-02-19  7:37 ` [PATCH RFC v1 0/5] Add SCSI per device tagsets Christoph Hellwig
@ 2022-02-21 17:40 ` John Garry
  6 siblings, 0 replies; 8+ messages in thread
From: John Garry @ 2022-02-21 17:40 UTC (permalink / raw)
  To: Melanie Plageman (Microsoft),
	mikelley, jejb, kys, martin.petersen, mst, benh, decui,
	don.brace, R-QLogic-Storage-Upstream, haiyangz, jasowang,
	kashyap.desai, mpe, njavali, pbonzini, paulus, sathya.prakash,
	shivasharan.srikanteshwara, sreekanth.reddy, stefanha, sthemmin,
	suganath-prabu.subramani, sumit.saxena, tyreld, wei.liu,
	linuxppc-dev, megaraidlinux.pdl, mpi3mr-linuxdrv.pdl, storagedev,
	virtualization, linux-hyperv, linux-kernel, linux-scsi,
	MPT-FusionLinux.pdl
  Cc: andres

On 18/02/2022 18:41, Melanie Plageman (Microsoft) wrote:
> For example: a device provisioned with high IOPS and BW limits on the same
> controller as a smaller and slower device can starve the slower device of tags.
> This is especially noticeable when the slower device's workload has low I/O
> depth tasks.

If you check hctx_may_queue() in the block code then it is noticeable 
that a fair allocation of HW queue depth is allocated per request queue 
to ensure we don't get starvation.

Thanks,
John

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-02-21 23:04 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-18 18:41 [PATCH RFC v1 0/5] Add SCSI per device tagsets Melanie Plageman (Microsoft)
2022-02-18 18:41 ` [PATCH RFC v1 1/5] scsi: core: Rename host_tagset to hctx_share_tags Melanie Plageman (Microsoft)
2022-02-18 18:41 ` [PATCH RFC v1 2/5] scsi: map_queues() takes tag set instead of host Melanie Plageman (Microsoft)
2022-02-18 18:41 ` [PATCH RFC v1 3/5] scsi: core: Add per device tag sets Melanie Plageman (Microsoft)
2022-02-18 18:41 ` [PATCH RFC v1 4/5] scsi: storvsc: use " Melanie Plageman (Microsoft)
2022-02-18 18:41 ` [PATCH RFC v1 5/5] scsi: storvsc: Hardware queues share blk_mq_tags Melanie Plageman (Microsoft)
2022-02-19  7:37 ` [PATCH RFC v1 0/5] Add SCSI per device tagsets Christoph Hellwig
2022-02-21 17:40 ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).