Sighting: Use of 'disable_sqflow' option shows drastic reduced I/O rate for small Queue Depths

All of lore.kernel.org
 help / color / mirror / Atom feed

* Sighting:  Use of 'disable_sqflow' option shows drastic reduced I/O rate for small Queue Depths
@ 2020-03-09 17:38 Wunderlich, Mark
  2020-03-09 19:32 ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: Wunderlich, Mark @ 2020-03-09 17:38 UTC (permalink / raw)
  To: linux-nvme; +Cc: Jens Axboe, Sagi Grimberg

Recent attempt to use the 'disable_sqflow' connection option displayed greatly reduced I/O rate when queue depth of 1 or 2 used.
Verified this behavior with nvme branch 5.5-rc baseline with no other patches applied.

Test was using FIO, engine io_uring as follows:

fio --filename=/dev/nvme0n1 --time_based --runtime=20 --thread --rw=randrw --rwmixread=100 --refill_buffers --direct=1 --ioengine=io_uring --hipri --fixedbufs --bs=4k --iodepth=1 --numjobs=1 --group_reporting --gtod_reduce=0 --disable_lat=0 --name=cpu3 --cpus_allowed=3

Specifying a queue depth of 1, result is 4 IOPS.  A queue depth of 2, result is 9 IOPS.  Queue depth of 3 yields 54K IOPS.  Without using this option a queue depth of 1 will yield around 32K IOPS with all busy polling disabled.

Interestingly, noticed that with the following patches applied to the baseline branch, done to correct a different stability issue, I see the performance for queue depth of 2 change from 9 to ~27K IOPS.  Perf at queue depth of 1 remained at 4 IOPS.

- io-wq-re-add-io_wq_current_is_worker
- io_uring-ensure-workqueue-offload-grabs-ring-mutex
- io_uring-clear-req-result-always-before-issuing

The 'disable_sqflow' option was enabled via script echo operation (see below) to /dev/nvme-fabrics along with other connection options.  Verified via tcpdump trace that target received flag indication of this option and it's expected appropriate setting of SUCCESS flag in initiator received read data frames.

echo "transport=tcp,traddr=${TARGET},trsvcid=${PORT},nqn=${NQN}${EXTRA},nr_io_queues=${QUEUES},nr_poll_queues=${POLL_QUEUES},disable_sqflow" > /dev/nvme-fabrics

So, before I launch into root cause debug, thought a shout out would be good just in case I am incorrectly using or understanding the use of this option.  Maybe this is expected behavior?  Or hopefully someone else has already seen this, or can replicate this failure in their environment.

Cheers ---- Mark

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Sighting: Use of 'disable_sqflow' option shows drastic reduced I/O rate for small Queue Depths
  2020-03-09 17:38 Sighting: Use of 'disable_sqflow' option shows drastic reduced I/O rate for small Queue Depths Wunderlich, Mark
@ 2020-03-09 19:32 ` Sagi Grimberg
  2020-03-09 22:17   ` Wunderlich, Mark
  0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2020-03-09 19:32 UTC (permalink / raw)
  To: Wunderlich, Mark, linux-nvme; +Cc: Jens Axboe

Hey Mark, thanks for reporting!

> Recent attempt to use the 'disable_sqflow' connection option displayed greatly reduced I/O rate when queue depth of 1 or 2 used.
> Verified this behavior with nvme branch 5.5-rc baseline with no other patches applied.
> 
> Test was using FIO, engine io_uring as follows:
> 
> fio --filename=/dev/nvme0n1 --time_based --runtime=20 --thread --rw=randrw --rwmixread=100 --refill_buffers --direct=1 --ioengine=io_uring --hipri --fixedbufs --bs=4k --iodepth=1 --numjobs=1 --group_reporting --gtod_reduce=0 --disable_lat=0 --name=cpu3 --cpus_allowed=3
> 
> Specifying a queue depth of 1, result is 4 IOPS.  A queue depth of 2, result is 9 IOPS.  Queue depth of 3 yields 54K IOPS.  Without using this option a queue depth of 1 will yield around 32K IOPS with all busy polling disabled.
> 
> Interestingly, noticed that with the following patches applied to the baseline branch, done to correct a different stability issue, I see the performance for queue depth of 2 change from 9 to ~27K IOPS.  Perf at queue depth of 1 remained at 4 IOPS.
> 
> - io-wq-re-add-io_wq_current_is_worker
> - io_uring-ensure-workqueue-offload-grabs-ring-mutex
> - io_uring-clear-req-result-always-before-issuing

This is highly unlikely to be related...

> The 'disable_sqflow' option was enabled via script echo operation (see below) to /dev/nvme-fabrics along with other connection options.  Verified via tcpdump trace that target received flag indication of this option and it's expected appropriate setting of SUCCESS flag in initiator received read data frames.

There is a problem that I see with nvmet-tcp target usage of msg flags
with the SUCCESS flag optimization.

Does this patch solve the issue?
--
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 1f4322ee7dbe..af1889641d45 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -533,9 +533,13 @@ static int nvmet_try_send_data(struct nvmet_tcp_cmd 
*cmd)
         while (cmd->cur_sg) {
                 struct page *page = sg_page(cmd->cur_sg);
                 u32 left = cmd->cur_sg->length - cmd->offset;
+               int flags = MSG_DONTWAIT;
+
+               if (cmd->wbytes_done + left < cmd->req.transfer_len)
+                       flags |= MSG_MORE;

                 ret = kernel_sendpage(cmd->queue->sock, page, cmd->offset,
-                                       left, MSG_DONTWAIT | MSG_MORE);
+                                       left, flags);
                 if (ret <= 0)
                         return ret;
--

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* RE: Sighting: Use of 'disable_sqflow' option shows drastic reduced I/O rate for small Queue Depths
  2020-03-09 19:32 ` Sagi Grimberg
@ 2020-03-09 22:17   ` Wunderlich, Mark
  2020-03-09 22:49     ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: Wunderlich, Mark @ 2020-03-09 22:17 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme; +Cc: Jens Axboe

> There is a problem that I see with nvmet-tcp target usage of msg flags with the SUCCESS flag optimization.

> Does this patch solve the issue?

This does improve things.  For queue depth of 1 now see ~49K, but the rate is not constant.  FIO output will at times show the IOPS rate drop down low then back up.  Also, a test run for queue depth of 32, and batch of 8 shows lower performance now vs. without this fix.  Before ~287K, now ~249K.  This got me thinking why would this be, in relation to 'disable_sqflow'.

Because the send data frame could be the last in the batch, much like the response or r2t frame, it made sense to integrate into your suggested logic for setting the flag to add that for 'last_in_batch'.  So modified to be the following.  This results in a good steady performance for queue depth of 1, and what appears to be a slight performance improvement for larger queue depths.  All good!

diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index cbff1038bdb3..7b2027cdd715 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -525,7 +525,8 @@ static int nvmet_try_send_data_pdu(struct nvmet_tcp_cmd *cmd)
        return 1;
 }

-static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd)
+static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd,
+               bool last_in_batch)
 {
        struct nvmet_tcp_queue *queue = cmd->queue;
        int ret;
@@ -533,9 +534,16 @@ static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd)
        while (cmd->cur_sg) {
                struct page *page = sg_page(cmd->cur_sg);
                u32 left = cmd->cur_sg->length - cmd->offset;
+               int flags = MSG_DONTWAIT;
+
+               if (cmd->wbytes_done + left < cmd->req.transfer_len ||
+                       (!last_in_batch && cmd->queue->send_list_len))
+                       flags |= MSG_MORE;
+               else if (queue->nvme_sq.sqhd_disabled)
+                       flags |= MSG_EOR;

                ret = kernel_sendpage(cmd->queue->sock, page, cmd->offset,
-                                       left, MSG_DONTWAIT | MSG_MORE);
+                                       left, flags);
                if (ret <= 0)
                        return ret;

@@ -670,7 +678,7 @@ static int nvmet_tcp_try_send_one(struct nvmet_tcp_queue *queue,
        }

        if (cmd->state == NVMET_TCP_SEND_DATA) {
-               ret = nvmet_try_send_data(cmd);
+               ret = nvmet_try_send_data(cmd, last_in_batch);
                if (ret <= 0)
                        goto done_send;
        }

--- Mark
_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Sighting: Use of 'disable_sqflow' option shows drastic reduced I/O rate for small Queue Depths
  2020-03-09 22:17   ` Wunderlich, Mark
@ 2020-03-09 22:49     ` Sagi Grimberg
  2020-03-11 22:56       ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2020-03-09 22:49 UTC (permalink / raw)
  To: Wunderlich, Mark, linux-nvme; +Cc: Jens Axboe


>> There is a problem that I see with nvmet-tcp target usage of msg flags with the SUCCESS flag optimization.
> 
>> Does this patch solve the issue?
> 
> This does improve things.  For queue depth of 1 now see ~49K, but the rate is not constant.  FIO output will at times show the IOPS rate drop down low then back up.  Also, a test run for queue depth of 32, and batch of 8 shows lower performance now vs. without this fix.  Before ~287K, now ~249K.  This got me thinking why would this be, in relation to 'disable_sqflow'.
> 
> Because the send data frame could be the last in the batch, much like the response or r2t frame, it made sense to integrate into your suggested logic for setting the flag to add that for 'last_in_batch'.  So modified to be the following.  This results in a good steady performance for queue depth of 1, and what appears to be a slight performance improvement for larger queue depths.  All good!

Thanks for testing Mark.

> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
> index cbff1038bdb3..7b2027cdd715 100644
> --- a/drivers/nvme/target/tcp.c
> +++ b/drivers/nvme/target/tcp.c
> @@ -525,7 +525,8 @@ static int nvmet_try_send_data_pdu(struct nvmet_tcp_cmd *cmd)
>          return 1;
>   }
> 
> -static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd)
> +static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd,
> +               bool last_in_batch)
>   {
>          struct nvmet_tcp_queue *queue = cmd->queue;
>          int ret;
> @@ -533,9 +534,16 @@ static int nvmet_try_send_data(struct nvmet_tcp_cmd *cmd)
>          while (cmd->cur_sg) {
>                  struct page *page = sg_page(cmd->cur_sg);
>                  u32 left = cmd->cur_sg->length - cmd->offset;
> +               int flags = MSG_DONTWAIT;
> +
> +               if (cmd->wbytes_done + left < cmd->req.transfer_len ||
> +                       (!last_in_batch && cmd->queue->send_list_len))
> +                       flags |= MSG_MORE;
> +               else if (queue->nvme_sq.sqhd_disabled)
> +                       flags |= MSG_EOR;

Is the MSG_EOR necessary here? Can you check without it?

> 
>                  ret = kernel_sendpage(cmd->queue->sock, page, cmd->offset,
> -                                       left, MSG_DONTWAIT | MSG_MORE);
> +                                       left, flags);
>                  if (ret <= 0)
>                          return ret;
> 
> @@ -670,7 +678,7 @@ static int nvmet_tcp_try_send_one(struct nvmet_tcp_queue *queue,
>          }
> 
>          if (cmd->state == NVMET_TCP_SEND_DATA) {
> -               ret = nvmet_try_send_data(cmd);
> +               ret = nvmet_try_send_data(cmd, last_in_batch);
>                  if (ret <= 0)
>                          goto done_send;
>          }
> 
> --- Mark
> _______________________________________________
> linux-nvme mailing list
> linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Sighting: Use of 'disable_sqflow' option shows drastic reduced I/O rate for small Queue Depths
  2020-03-09 22:49     ` Sagi Grimberg
@ 2020-03-11 22:56       ` Sagi Grimberg
  2020-03-12  0:21         ` Wunderlich, Mark
  0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2020-03-11 22:56 UTC (permalink / raw)
  To: Wunderlich, Mark, linux-nvme; +Cc: Jens Axboe


>> @@ -533,9 +534,16 @@ static int nvmet_try_send_data(struct 
>> nvmet_tcp_cmd *cmd)
>>          while (cmd->cur_sg) {
>>                  struct page *page = sg_page(cmd->cur_sg);
>>                  u32 left = cmd->cur_sg->length - cmd->offset;
>> +               int flags = MSG_DONTWAIT;
>> +
>> +               if (cmd->wbytes_done + left < cmd->req.transfer_len ||
>> +                       (!last_in_batch && cmd->queue->send_list_len))
>> +                       flags |= MSG_MORE;
>> +               else if (queue->nvme_sq.sqhd_disabled)
>> +                       flags |= MSG_EOR;
> 
> Is the MSG_EOR necessary here? Can you check without it?

Mark? Did you get a chance to check that?

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Sighting: Use of 'disable_sqflow' option shows drastic reduced I/O rate for small Queue Depths
  2020-03-11 22:56       ` Sagi Grimberg
@ 2020-03-12  0:21         ` Wunderlich, Mark
  2020-03-12  6:29           ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: Wunderlich, Mark @ 2020-03-12  0:21 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme; +Cc: Jens Axboe

>>> @@ -533,9 +534,16 @@ static int nvmet_try_send_data(struct 
>>> nvmet_tcp_cmd *cmd)
>>>          while (cmd->cur_sg) {
>>>                  struct page *page = sg_page(cmd->cur_sg);
>>>                  u32 left = cmd->cur_sg->length - cmd->offset;
>>> +               int flags = MSG_DONTWAIT;
>>> +
>>> +               if (cmd->wbytes_done + left < cmd->req.transfer_len 
>>> +||
>>> +                       (!last_in_batch && 
>>> +cmd->queue->send_list_len))
>>> +                       flags |= MSG_MORE;
>>> +               else if (queue->nvme_sq.sqhd_disabled)
>>> +                       flags |= MSG_EOR;
>> 
>> Is the MSG_EOR necessary here? Can you check without it?

>Mark? Did you get a chance to check that?

Yes, I modified the 'if' conditions a bit and testing does not show need for setting MSG_EOR.

New if condition used:
+               if (cmd->wbytes_done + left < cmd->req.transfer_len ||   /* more data in this xfer to send */
+                       !queue->nvme_sq.sqhd_disabled || /* then a response frame is coming */
+                       (!last_in_batch && cmd->queue->send_list_len))  /* more requests to process in this batch window */
+                       flags |= MSG_MORE;

For queue depth = 1 at times I will still see IOPS rate dip below the high but recover quickly.  And I do not see this on every run.  And it seems to be more frequent when testing a ramdisk, not a real nvme device.  But for the most part this greatly improves the original issue I reported.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Sighting: Use of 'disable_sqflow' option shows drastic reduced I/O rate for small Queue Depths
  2020-03-12  0:21         ` Wunderlich, Mark
@ 2020-03-12  6:29           ` Sagi Grimberg
  0 siblings, 0 replies; 7+ messages in thread
From: Sagi Grimberg @ 2020-03-12  6:29 UTC (permalink / raw)
  To: Wunderlich, Mark, linux-nvme; +Cc: Jens Axboe


> Yes, I modified the 'if' conditions a bit and testing does not show need for setting MSG_EOR.
> 
> New if condition used:
> +               if (cmd->wbytes_done + left < cmd->req.transfer_len ||   /* more data in this xfer to send */
> +                       !queue->nvme_sq.sqhd_disabled || /* then a response frame is coming */
> +                       (!last_in_batch && cmd->queue->send_list_len))  /* more requests to process in this batch window */
> +                       flags |= MSG_MORE;
> 
> For queue depth = 1 at times I will still see IOPS rate dip below the high but recover quickly.  And I do not see this on every run.  And it seems to be more frequent when testing a ramdisk, not a real nvme device.  But for the most part this greatly improves the original issue I reported.

Thanks, sent a patch for this.

For the short dip you see at times, does it happen when you
use sync/pvsync2 io engines?

Maybe this could be related to some NIC effects like adaptive moderation 
or LRO?

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-03-12  6:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-09 17:38 Sighting: Use of 'disable_sqflow' option shows drastic reduced I/O rate for small Queue Depths Wunderlich, Mark
2020-03-09 19:32 ` Sagi Grimberg
2020-03-09 22:17   ` Wunderlich, Mark
2020-03-09 22:49     ` Sagi Grimberg
2020-03-11 22:56       ` Sagi Grimberg
2020-03-12  0:21         ` Wunderlich, Mark
2020-03-12  6:29           ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.