* [PATCH v1] scsi: storvsc: Parameterize nr_hw_queues
@ 2021-02-02 20:25 Melanie Plageman
2021-02-03 0:50 ` Michael Kelley
0 siblings, 1 reply; 4+ messages in thread
From: Melanie Plageman @ 2021-02-02 20:25 UTC (permalink / raw)
To: linux-hyperv
[-- Attachment #1: Type: text/plain, Size: 4926 bytes --]
Proposed patch attached.
While doing some performance tuning of various block device parameters
on Azure to optimize database workloads, we found that reducing the
number of hardware queues (nr_hw_queues used in the block-mq layer and
by storvsc) improved performance of short queue depth writes, like
database journaling.
The Azure premium disks we were testing on often had around 2-3ms
latency for a single small write. Given the IOPS and bandwidth
provisioned on your typical Azure VM paired with an Azure premium disk,
we found that decreasing the LUN queue_depth substantially from the
default (default is 2048) was often required to get reasonable IOPS out
of a small sequential synchronous write with a queue_depth of 1. In our
investigation, the high default queue_depth resulted in such a large
number of outstanding requests against the device that a journaling
write incurred very high latency, slowing overall database performance.
However, even with tuning these defaults (including
/sys/block/sdX/queue/nr_requests), we still incurred high latency,
especially on machines with high core counts, as the number of hardware
queues, nr_hw_queues, is set to be the number of CPUs in storvsc.
nr_hw_queues is, in turn, used to calculate the number of tags available
on the block device, meaning that we still ended up with deeper queues
than intended when requests were submitted to more than one hardware
queue.
We calculated the optimal block device settings, including our intended
queue depth, to utilize as much of provisioned bandwidth as possible
while still getting a reasonable number of IOPS from low queue depth
sequential IO and random IO, but, without parameterizing the number of
hardware queues, we weren't able to fully control this queue_depth.
Attached is a patch which adds a module param to control nr_hw_queues.
The default is the current value (number of CPUs), so it should have no
impact on users who do not set the param.
As a minimal example of the potential benefit, we found that, starting
with a baseline of optimized block device tunings, we could
substantially increase the IOPS of a sequential write job for both a
large Azure premium SSD and a small Azure premium SSD.
On an Azure Standard_F32s_v2 with a single 16 TiB disk, which has a
provisioned max bandwidth of 750 MB/s and provisioned max IOPS of
18000, running Debian 10 with a Linux kernel built from master
(v5.11-rc6 at time of patch testing) with the patch applied, with the
following block device settings:
/sys/block/sdX/device/queue_depth=55
/sys/block/sdX/queue/max_sectors_kb=64,
read_ahead_kb=2296,
nr_requests=55,
wbt_lat_usec=0,
scheduler=mq-deadline
And this fio job file:
[global]
time_based=1
ioengine=libaio
direct=1
runtime=20
[job1]
name=seq_read
bs=32k
size=23G
rw=read
numjobs=2
iodepth=110
iodepth_batch_submit=55
iodepth_batch_complete=55
[job2]
name=seq_write
bs=8k
size=10G
rw=write
numjobs=1
iodepth=1
overwrite=1
With ncpu hardware queues configured, we measured an average of 764 MB/s
read throughput and 153 write IOPS.
With one hardware queue configured, we measured an average of 763 MB/s
read throughput and 270 write IOPS.
And on an Azure Standard_F32s_v2 with a single 16 GiB disk, the
combination having a provisioned max bandwidth of 170 MB/s and a
provisioned max IOPS of 3500, with the following block device settings:
/sys/block/sdX/device/queue_depth=11
/sys/block/sdX/queue/max_sectors_kb=65,
read_ahead_kb=520,
nr_requests=11,
wbt_lat_usec=0,
scheduler=mq-deadline
And with this fio job file:
[global]
time_based=1
ioengine=libaio
direct=1
runtime=60
[job1]
name=seq_read
bs=32k
size=5G
rw=read
numjobs=2
iodepth=22
iodepth_batch_submit=11
iodepth_batch_complete=11
[job2]
name=seq_write
bs=8k
size=3G
rw=write
numjobs=1
iodepth=1
overwrite=1
With ncpu hardware queues configured, we measured an average of 123 MB/s
read throughput and 56 write IOPS.
With one hardware queue configured, we measured an average of 165 MB/s
read throughput and 346 write IOPS.
Providing this option as a module param will help improve performance of
certain workloads on certain devices.
In the attached patch, I check that the value provided for
storvsc_nr_hw_queues is within a valid range at init time and error out
if it is not. I noticed this warning from scripts/checkpatch.pl
WARNING: Prefer [subsystem eg: netdev]_err([subsystem]dev, ... then
dev_err(dev, ... then pr_err(... to printk(KERN_ERR ...
#64: FILE: drivers/scsi/storvsc_drv.c:2183:
printk(KERN_ERR "storvsc: Invalid storvsc_nr_hw_queues value of %d.
Should I be using a different function for printing this message?
Regards,
Melanie (Microsoft)
[-- Attachment #2: v1-0001-scsi-storvsc-Parameterize-number-hardware-queues.patch --]
[-- Type: application/octet-stream, Size: 2730 bytes --]
From 3232c69d0cb702c538d8654ed8607cec15adfe00 Mon Sep 17 00:00:00 2001
From: "Melanie Plageman (Microsoft)" <melanieplageman@gmail.com>
Date: Tue, 2 Feb 2021 19:19:36 +0000
Subject: [PATCH v1] scsi: storvsc: Parameterize number hardware queues
Add ability to set the number of hardware queues. The default value
remains the number of CPUs. This functionality is useful in some
environments (e.g. Microsoft Azure) when decreasing the number of
hardware queues has been shown to improve performance.
Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com>
---
drivers/scsi/storvsc_drv.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 2e4fa77445fd..d72ab6daf9ae 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -378,10 +378,14 @@ static u32 max_outstanding_req_per_channel;
static int storvsc_change_queue_depth(struct scsi_device *sdev, int queue_depth);
static int storvsc_vcpus_per_sub_channel = 4;
+static int storvsc_nr_hw_queues = -1;
module_param(storvsc_ringbuffer_size, int, S_IRUGO);
MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer size (bytes)");
+module_param(storvsc_nr_hw_queues, int, S_IRUGO);
+MODULE_PARM_DESC(storvsc_nr_hw_queues, "Number of hardware queues");
+
module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs to subchannels");
@@ -2004,8 +2008,12 @@ static int storvsc_probe(struct hv_device *device,
* For non-IDE disks, the host supports multiple channels.
* Set the number of HW queues we are supporting.
*/
- if (!dev_is_ide)
- host->nr_hw_queues = num_present_cpus();
+ if (!dev_is_ide) {
+ if (storvsc_nr_hw_queues == -1)
+ host->nr_hw_queues = num_present_cpus();
+ else
+ host->nr_hw_queues = storvsc_nr_hw_queues;
+ }
/*
* Set the error handler work queue.
@@ -2155,6 +2163,7 @@ static struct fc_function_template fc_transport_functions = {
static int __init storvsc_drv_init(void)
{
int ret;
+ int ncpus = num_present_cpus();
/*
* Divide the ring buffer data size (which is 1 page less
@@ -2169,6 +2178,14 @@ static int __init storvsc_drv_init(void)
vmscsi_size_delta,
sizeof(u64)));
+ if (storvsc_nr_hw_queues > ncpus || storvsc_nr_hw_queues == 0 ||
+ storvsc_nr_hw_queues < -1) {
+ printk(KERN_ERR "storvsc: Invalid storvsc_nr_hw_queues value of %d.
+ Valid values include -1 and 1-%d.\n",
+ storvsc_nr_hw_queues, ncpus);
+ return -EINVAL;
+ }
+
#if IS_ENABLED(CONFIG_SCSI_FC_ATTRS)
fc_transport_template = fc_attach_transport(&fc_transport_functions);
if (!fc_transport_template)
--
2.20.1
^ permalink raw reply related [flat|nested] 4+ messages in thread
* RE: [PATCH v1] scsi: storvsc: Parameterize nr_hw_queues
2021-02-02 20:25 [PATCH v1] scsi: storvsc: Parameterize nr_hw_queues Melanie Plageman
@ 2021-02-03 0:50 ` Michael Kelley
2021-02-06 0:25 ` Melanie Plageman (Microsoft)
0 siblings, 1 reply; 4+ messages in thread
From: Michael Kelley @ 2021-02-03 0:50 UTC (permalink / raw)
To: melanieplageman, linux-hyperv
From: Melanie Plageman <melanieplageman@gmail.com> Sent: Tuesday, February 2, 2021 12:26 PM
>
> Proposed patch attached.
>
For public mailing lists like linux-hyperv@vger.kernel.org that are
associated with the Linux kernel project, proposed patches should
generally go inline in the email, rather than as an attachment.
>
> While doing some performance tuning of various block device parameters
> on Azure to optimize database workloads, we found that reducing the
> number of hardware queues (nr_hw_queues used in the block-mq layer and
> by storvsc) improved performance of short queue depth writes, like
> database journaling.
>
> The Azure premium disks we were testing on often had around 2-3ms
> latency for a single small write. Given the IOPS and bandwidth
> provisioned on your typical Azure VM paired with an Azure premium disk,
> we found that decreasing the LUN queue_depth substantially from the
> default (default is 2048) was often required to get reasonable IOPS out
> of a small sequential synchronous write with a queue_depth of 1. In our
> investigation, the high default queue_depth resulted in such a large
> number of outstanding requests against the device that a journaling
> write incurred very high latency, slowing overall database performance.
>
> However, even with tuning these defaults (including
> /sys/block/sdX/queue/nr_requests), we still incurred high latency,
> especially on machines with high core counts, as the number of hardware
> queues, nr_hw_queues, is set to be the number of CPUs in storvsc.
> nr_hw_queues is, in turn, used to calculate the number of tags available
> on the block device, meaning that we still ended up with deeper queues
> than intended when requests were submitted to more than one hardware
> queue.
>
> We calculated the optimal block device settings, including our intended
> queue depth, to utilize as much of provisioned bandwidth as possible
> while still getting a reasonable number of IOPS from low queue depth
> sequential IO and random IO, but, without parameterizing the number of
> hardware queues, we weren't able to fully control this queue_depth.
>
> Attached is a patch which adds a module param to control nr_hw_queues.
> The default is the current value (number of CPUs), so it should have no
> impact on users who do not set the param.
>
> As a minimal example of the potential benefit, we found that, starting
> with a baseline of optimized block device tunings, we could
> substantially increase the IOPS of a sequential write job for both a
> large Azure premium SSD and a small Azure premium SSD.
>
> On an Azure Standard_F32s_v2 with a single 16 TiB disk, which has a
> provisioned max bandwidth of 750 MB/s and provisioned max IOPS of
> 18000, running Debian 10 with a Linux kernel built from master
> (v5.11-rc6 at time of patch testing) with the patch applied, with the
> following block device settings:
>
> /sys/block/sdX/device/queue_depth=55
>
> /sys/block/sdX/queue/max_sectors_kb=64,
> read_ahead_kb=2296,
> nr_requests=55,
> wbt_lat_usec=0,
> scheduler=mq-deadline
>
> And this fio job file:
> [global]
> time_based=1
> ioengine=libaio
> direct=1
> runtime=20
>
> [job1]
> name=seq_read
> bs=32k
> size=23G
> rw=read
> numjobs=2
> iodepth=110
> iodepth_batch_submit=55
> iodepth_batch_complete=55
>
> [job2]
> name=seq_write
> bs=8k
> size=10G
> rw=write
> numjobs=1
> iodepth=1
> overwrite=1
>
> With ncpu hardware queues configured, we measured an average of 764 MB/s
> read throughput and 153 write IOPS.
>
> With one hardware queue configured, we measured an average of 763 MB/s
> read throughput and 270 write IOPS.
>
> And on an Azure Standard_F32s_v2 with a single 16 GiB disk, the
> combination having a provisioned max bandwidth of 170 MB/s and a
> provisioned max IOPS of 3500, with the following block device settings:
>
> /sys/block/sdX/device/queue_depth=11
>
> /sys/block/sdX/queue/max_sectors_kb=65,
> read_ahead_kb=520,
> nr_requests=11,
> wbt_lat_usec=0,
> scheduler=mq-deadline
>
> And with this fio job file:
> [global]
> time_based=1
> ioengine=libaio
> direct=1
> runtime=60
>
> [job1]
> name=seq_read
> bs=32k
> size=5G
> rw=read
> numjobs=2
> iodepth=22
> iodepth_batch_submit=11
> iodepth_batch_complete=11
>
> [job2]
> name=seq_write
> bs=8k
> size=3G
> rw=write
> numjobs=1
> iodepth=1
> overwrite=1
>
> With ncpu hardware queues configured, we measured an average of 123 MB/s
> read throughput and 56 write IOPS.
>
> With one hardware queue configured, we measured an average of 165 MB/s
> read throughput and 346 write IOPS.
>
> Providing this option as a module param will help improve performance of
> certain workloads on certain devices.
I'm in agreement that the current handling of I/O queuing in the storvsc
has problems. Your data definitely confirms that, and there are other
data points that indicate that we need to more fundamentally rethink
what I/Os get queued where. Storvsc is letting far too many I/Os get
queued in the VMbus ring buffers and in the underlying Hyper-V.
Adding a module parameter to specify the number of hardware queues
may be part of the solution. But I really want to step back a bit and
take into account all the data points we have before deciding what to
change, what additional parameters to offer (if any), etc. There are
other ways of limiting the number of I/Os being queued at the driver
level, and I'm wondering how those tradeoff against adding a module
parameter. I'm planning to jump in on this topic in just a few weeks,
and would like to coordinate with you.
>
> In the attached patch, I check that the value provided for
> storvsc_nr_hw_queues is within a valid range at init time and error out
> if it is not. I noticed this warning from scripts/checkpatch.pl
>
> WARNING: Prefer [subsystem eg: netdev]_err([subsystem]dev, ... then
> dev_err(dev, ... then pr_err(... to printk(KERN_ERR ...
> #64: FILE: drivers/scsi/storvsc_drv.c:2183:
> printk(KERN_ERR "storvsc: Invalid storvsc_nr_hw_queues value of %d.
>
> Should I be using a different function for printing this message?
Yes. If you look at other code in storvsc_drv.c, you'll see the use of
"dev_err" for outputting warning messages. Follow the pattern of
these other uses of "dev_err", and that should eliminate the
checkpatch error.
Michael
>
> Regards,
> Melanie (Microsoft)
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v1] scsi: storvsc: Parameterize nr_hw_queues
2021-02-03 0:50 ` Michael Kelley
@ 2021-02-06 0:25 ` Melanie Plageman (Microsoft)
0 siblings, 0 replies; 4+ messages in thread
From: Melanie Plageman (Microsoft) @ 2021-02-06 0:25 UTC (permalink / raw)
To: mikelley
Cc: andres, haiyangz, kys, linux-hyperv, melanieplageman, sashal, sthemmin
> Yes. If you look at other code in storvsc_drv.c, you'll see the use of
> "dev_err" for outputting warning messages. Follow the pattern of
> these other uses of "dev_err", and that should eliminate the
> checkpatch error.
I'm not sure I can use the dev_warn/dev_error macros (or the storvsc_log
macro which uses dev_warn) in storvsc_drv_init(), as it seems like it is
meant to be used with a device, and, at driver initialization, I'm not
sure how to access a specific device.
I originally used printk() after looking for an example of a driver
printing some logging message before erroring out in its init function
and had found this in drivers/scsi/iscsi_tcp.c
static int __init iscsi_sw_tcp_init(void)
{
if (iscsi_max_lun < 1) {
printk(KERN_ERR "iscsi_tcp: Invalid max_lun value of %u\n",
iscsi_max_lun);
return -EINVAL;
}
...
}
Unfortunately, the thread has split in two because of some issues with my
original mail. This message has the other updates I've made to the patch [1].
> I'm in agreement that the current handling of I/O queuing in the storvsc
> has problems. Your data definitely confirms that, and there are other
> data points that indicate that we need to more fundamentally rethink
> what I/Os get queued where. Storvsc is letting far too many I/Os get
> queued in the VMbus ring buffers and in the underlying Hyper-V.
>
> Adding a module parameter to specify the number of hardware queues
> may be part of the solution. But I really want to step back a bit and
> take into account all the data points we have before deciding what to
> change, what additional parameters to offer (if any), etc. There are
> other ways of limiting the number of I/Os being queued at the driver
> level, and I'm wondering how those tradeoff against adding a module
> parameter. I'm planning to jump in on this topic in just a few weeks,
> and would like to coordinate with you.
There are tradeoffs to reducing the number of hardware queues. It may
not be the right solution for many workloads. However, parameterizing
the number of hardware queues, given that the default would be the same
as current state, doesn't seem like it would be harmful.
virtio-scsi, for example, seems to provide num_queues as a configuration
option in virtscsi_probe() in drivers/scsi/virtio_scsi.c
num_queues = virtscsi_config_get(vdev, num_queues) ? : 1;
num_queues = min_t(unsigned int, nr_cpu_ids, num_queues);
And then use that value to set nr_hw_queues in the scsi host:
...
shost->nr_hw_queues = num_queues;
In fact, it may be worth modifying the patch to allow setting the number
of hardware queues to any number above zero.
I made it a module parameter to avoid having to change the interface at
any point above storvsc in the stack to allow for configuration of this
parameter.
[1] https://lore.kernel.org/linux-hyperv/20210205212447.3327-3-melanieplageman@gmail.com/
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v1] scsi: storvsc: Parameterize nr_hw_queues
[not found] <CAAKRu_aWxovgFagWgY-UDDYb-24ca7yo=C461LXq_2iGt3XFqA@mail.gmail.com>
@ 2021-02-03 1:09 ` Andres Freund
0 siblings, 0 replies; 4+ messages in thread
From: Andres Freund @ 2021-02-03 1:09 UTC (permalink / raw)
To: Melanie Plageman; +Cc: linux-hyperv, kys, haiyangz, sthemmin, sashal
Hi,
On 2021-02-02 15:19:08 -0500, Melanie Plageman wrote:
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index 2e4fa77445fd..d72ab6daf9ae 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -378,10 +378,14 @@ static u32 max_outstanding_req_per_channel;
> static int storvsc_change_queue_depth(struct scsi_device *sdev, int queue_depth);
>
> static int storvsc_vcpus_per_sub_channel = 4;
> +static int storvsc_nr_hw_queues = -1;
>
> module_param(storvsc_ringbuffer_size, int, S_IRUGO);
> MODULE_PARM_DESC(storvsc_ringbuffer_size, "Ring buffer size (bytes)");
>
> +module_param(storvsc_nr_hw_queues, int, S_IRUGO);
> +MODULE_PARM_DESC(storvsc_nr_hw_queues, "Number of hardware queues");
Would be nice if the parameter could be changed in a running
system. Obviously that's only going to affect devices that'll be
attached in the future (or detached and reattached), but that's still
nicer than not being able to change at all.
> @@ -2155,6 +2163,7 @@ static struct fc_function_template fc_transport_functions = {
> static int __init storvsc_drv_init(void)
> {
> int ret;
> + int ncpus = num_present_cpus();
>
> /*
> * Divide the ring buffer data size (which is 1 page less
> @@ -2169,6 +2178,14 @@ static int __init storvsc_drv_init(void)
> vmscsi_size_delta,
> sizeof(u64)));
>
> + if (storvsc_nr_hw_queues > ncpus || storvsc_nr_hw_queues == 0 ||
> + storvsc_nr_hw_queues < -1) {
> + printk(KERN_ERR "storvsc: Invalid storvsc_nr_hw_queues value of %d.
> + Valid values include -1 and 1-%d.\n",
> + storvsc_nr_hw_queues, ncpus);
> + return -EINVAL;
> + }
> +
I have a mild preference for just capping the effective value to
num_present_cpus(), rather than erroring out. Would allow you to
e.g. cap the number of queues to 4, with the same param specified on
smaller and bigger systems. Perhaps renaming it to max_hw_queues would
make that more obvious?
Greetings,
Andres Freund
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2021-02-06 2:31 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-02 20:25 [PATCH v1] scsi: storvsc: Parameterize nr_hw_queues Melanie Plageman
2021-02-03 0:50 ` Michael Kelley
2021-02-06 0:25 ` Melanie Plageman (Microsoft)
[not found] <CAAKRu_aWxovgFagWgY-UDDYb-24ca7yo=C461LXq_2iGt3XFqA@mail.gmail.com>
2021-02-03 1:09 ` Andres Freund
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).