Re: Precise disk statistics

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Precise disk statistics
       [not found] <1650661324247.40468@akamai.com>
@ 2022-04-25 19:35 ` Jayaramappa, Srilakshmi
  2022-04-26  6:25   ` yukuai (C)
  0 siblings, 1 reply; 6+ messages in thread
From: Jayaramappa, Srilakshmi @ 2022-04-25 19:35 UTC (permalink / raw)
  To: axboe, snitzer, linux-block; +Cc: Hunt, Joshua

________________________________________
From: Jayaramappa, Srilakshmi
Sent: Friday, April 22, 2022 5:02 PM
To: axboe@kernel.dk; snitzer@redhat.com; linux-block@vger.kernel.org
Cc: Hunt, Joshua
Subject: Precise disk statistics

Hi,

We install LTS kernel on our machines. While moving from 4.19.x to 5.4.x we noticed a performance drop in one of our applications.
We tracked down the root cause to the commit series for removing the pending IO accounting (80a787ba3809 to 6f75723190d8)
which includes 5b18b5a73760 block: delete part_round_stats and switch to less precise counting.

The application (which runs on non-dm machines) tracks disk utilization to estimate the load it can further take on. After the commits in question,
we see an over reporting of disk utilization [1] compared to the older method of reporting based on inflight counter [2] for the same load.
The over-reporting is observed in v5.4.190 and in v5.15.35 as well. I've attached the config file used to build the kernel.

We understand that the disk util% does not provide a true picture of how much more work the device is capable of doing in flash based
devices and we are planning to use a different model to observe the performance potential.
In the interim we are having to revert the above commit series to bring back the original reporting method.

In the hopes of getting back our application's performance with a new change on top of the 5.4.x reporting (as opposed to reverting commits),
I tried checking if the request queue is busy before updating io_ticks [3]. With this change the applications's throughput is closer to
what we observe with the commits reverted, but still behind by ~ 6 %. Though, I am not sure that this change is safe overall.

I'd appreciate your expert opinion on this matter. Could you please let us know if there is some other idea we could explore to report precise disk stats
that we can build on top of existing reporting in the kernel and submit a patch, or if going back to using the inflight counters is indeed our best bet.

Thank you
-Sri

[1]
root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait
...
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.721532 s, 3.0 GB/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.13    0.00    0.28    1.53    0.00   98.07

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
nvme3n1       16383.00   2047.88     0.00   0.00    0.21   128.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00  72.20

[2]

root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait
...
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.702101 s, 3.1 GB/s

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.03    0.00    0.18    1.57    0.00   98.22

Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
nvme3n1       16380.00   2047.50     0.00   0.00    0.20   128.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00  64.20

[3]
diff --git a/block/bio.c b/block/bio.c
index cb38d6f3acce..8275b10a1c9a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1754,14 +1754,17 @@ void bio_check_pages_dirty(struct bio *bio)
        schedule_work(&bio_dirty_work);
 }

-void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
+void update_io_ticks(struct request_queue *q, struct hd_struct *part, unsigned long now, bool end)
 {
        unsigned long stamp;
 again:
        stamp = READ_ONCE(part->stamp);
        if (unlikely(stamp != now)) {
                if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
+                      if (blk_mq_queue_inflight(q)) {
                                __part_stat_add(part, io_ticks, end ? now - stamp : 1);
+            }
                }
        }
        if (part->partno) {

Sorry, resending without the config attachment since my original email bounced from linux-block.

Thanks
-Sri

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: Precise disk statistics
  2022-04-25 19:35 ` Precise disk statistics Jayaramappa, Srilakshmi
@ 2022-04-26  6:25   ` yukuai (C)
  2022-04-26 23:56     ` Jayaramappa, Srilakshmi
  0 siblings, 1 reply; 6+ messages in thread
From: yukuai (C) @ 2022-04-26  6:25 UTC (permalink / raw)
  To: Jayaramappa, Srilakshmi, axboe, snitzer, linux-block; +Cc: Hunt, Joshua

在 2022/04/26 3:35, Jayaramappa, Srilakshmi 写道:
> ________________________________________
> From: Jayaramappa, Srilakshmi
> Sent: Friday, April 22, 2022 5:02 PM
> To: axboe@kernel.dk; snitzer@redhat.com; linux-block@vger.kernel.org
> Cc: Hunt, Joshua
> Subject: Precise disk statistics
> 
> Hi,
> 
> We install LTS kernel on our machines. While moving from 4.19.x to 5.4.x we noticed a performance drop in one of our applications.
> We tracked down the root cause to the commit series for removing the pending IO accounting (80a787ba3809 to 6f75723190d8)
> which includes 5b18b5a73760 block: delete part_round_stats and switch to less precise counting.
> 
> The application (which runs on non-dm machines) tracks disk utilization to estimate the load it can further take on. After the commits in question,
> we see an over reporting of disk utilization [1] compared to the older method of reporting based on inflight counter [2] for the same load.
> The over-reporting is observed in v5.4.190 and in v5.15.35 as well. I've attached the config file used to build the kernel.
> 
> We understand that the disk util% does not provide a true picture of how much more work the device is capable of doing in flash based
> devices and we are planning to use a different model to observe the performance potential.
> In the interim we are having to revert the above commit series to bring back the original reporting method.
> 
> In the hopes of getting back our application's performance with a new change on top of the 5.4.x reporting (as opposed to reverting commits),
> I tried checking if the request queue is busy before updating io_ticks [3]. With this change the applications's throughput is closer to
> what we observe with the commits reverted, but still behind by ~ 6 %. Though, I am not sure that this change is safe overall.
> 
> I'd appreciate your expert opinion on this matter. Could you please let us know if there is some other idea we could explore to report precise disk stats
> that we can build on top of existing reporting in the kernel and submit a patch, or if going back to using the inflight counters is indeed our best bet.
> 
> Thank you
> -Sri
> 
> [1]
> root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait
> ...
> 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.721532 s, 3.0 GB/s
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.13    0.00    0.28    1.53    0.00   98.07
> 
> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
> nvme3n1       16383.00   2047.88     0.00   0.00    0.21   128.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00  72.20
> 
> [2]
> 
> root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait
> ...
> 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.702101 s, 3.1 GB/s
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.03    0.00    0.18    1.57    0.00   98.22
> 
> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
> nvme3n1       16380.00   2047.50     0.00   0.00    0.20   128.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00  64.20
> 
> 
> [3]
> diff --git a/block/bio.c b/block/bio.c
> index cb38d6f3acce..8275b10a1c9a 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1754,14 +1754,17 @@ void bio_check_pages_dirty(struct bio *bio)
>          schedule_work(&bio_dirty_work);
>   }
> 
> -void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
> +void update_io_ticks(struct request_queue *q, struct hd_struct *part, unsigned long now, bool end)
>   {
>          unsigned long stamp;
>   again:
>          stamp = READ_ONCE(part->stamp);
>          if (unlikely(stamp != now)) {
>                  if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
> +                      if (blk_mq_queue_inflight(q)) {
>                                  __part_stat_add(part, io_ticks, end ? now - stamp : 1);
Hi,

We met the same problem, and I'm pretty sure the root cause is the above
code: while starting the first IO in the new jiffies, io_ticks will
always add 1 jiffies in additional, which is wrong. And in your test
case, dd will issue io one by one, thus if the new io is issued in the
new jiffies than the jiffies that old io is done, io_ticks will be
miscaculated.

We reintroduce part_round_stats() to fix the problem. However, iterate
tags when starting each IO is not a good idea, and we can't figure out
a good solution that will not affect fast path yet.

Thanks,
Kuai
> +            }
>                  }
>          }
>          if (part->partno) {
> 
> 
> 
> Sorry, resending without the config attachment since my original email bounced from linux-block.
> 
> 
> Thanks
> -Sri.
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Precise disk statistics
  2022-04-26  6:25   ` yukuai (C)
@ 2022-04-26 23:56     ` Jayaramappa, Srilakshmi
  2022-04-27  9:57       ` Mikulas Patocka
  0 siblings, 1 reply; 6+ messages in thread
From: Jayaramappa, Srilakshmi @ 2022-04-26 23:56 UTC (permalink / raw)
  To: yukuai (C), axboe, snitzer, linux-block, mpatocka; +Cc: Hunt, Joshua, ming.lei


________________________________________
From: yukuai (C) <yukuai3@huawei.com>
Sent: Tuesday, April 26, 2022 2:25 AM
To: Jayaramappa, Srilakshmi; axboe@kernel.dk; snitzer@redhat.com; linux-block@vger.kernel.org
Cc: Hunt, Joshua
Subject: Re: Precise disk statistics

在 2022/04/26 3:35, Jayaramappa, Srilakshmi 写道:
> ________________________________________
> From: Jayaramappa, Srilakshmi
> Sent: Friday, April 22, 2022 5:02 PM
> To: axboe@kernel.dk; snitzer@redhat.com; linux-block@vger.kernel.org
> Cc: Hunt, Joshua
> Subject: Precise disk statistics
>
> Hi,
>
> We install LTS kernel on our machines. While moving from 4.19.x to 5.4.x we noticed a performance drop in one of our applications.
> We tracked down the root cause to the commit series for removing the pending IO accounting (80a787ba3809 to 6f75723190d8)
> which includes 5b18b5a73760 block: delete part_round_stats and switch to less precise counting.
>
> The application (which runs on non-dm machines) tracks disk utilization to estimate the load it can further take on. After the commits in question,
> we see an over reporting of disk utilization [1] compared to the older method of reporting based on inflight counter [2] for the same load.
> The over-reporting is observed in v5.4.190 and in v5.15.35 as well. I've attached the config file used to build the kernel.
>
> We understand that the disk util% does not provide a true picture of how much more work the device is capable of doing in flash based
> devices and we are planning to use a different model to observe the performance potential.
> In the interim we are having to revert the above commit series to bring back the original reporting method.
>
> In the hopes of getting back our application's performance with a new change on top of the 5.4.x reporting (as opposed to reverting commits),
> I tried checking if the request queue is busy before updating io_ticks [3]. With this change the applications's throughput is closer to
> what we observe with the commits reverted, but still behind by ~ 6 %. Though, I am not sure that this change is safe overall.
>
> I'd appreciate your expert opinion on this matter. Could you please let us know if there is some other idea we could explore to report precise disk stats
> that we can build on top of existing reporting in the kernel and submit a patch, or if going back to using the inflight counters is indeed our best bet.
>
> Thank you
> -Sri
>
> [1]
> root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait
> ...
> 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.721532 s, 3.0 GB/s
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.13    0.00    0.28    1.53    0.00   98.07
>
> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
> nvme3n1       16383.00   2047.88     0.00   0.00    0.21   128.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00  72.20
>
> [2]
>
> root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait
> ...
> 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.702101 s, 3.1 GB/s
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.03    0.00    0.18    1.57    0.00   98.22
>
> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
> nvme3n1       16380.00   2047.50     0.00   0.00    0.20   128.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00  64.20
>
>
> [3]
> diff --git a/block/bio.c b/block/bio.c
> index cb38d6f3acce..8275b10a1c9a 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1754,14 +1754,17 @@ void bio_check_pages_dirty(struct bio *bio)
>          schedule_work(&bio_dirty_work);
>   }
>
> -void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
> +void update_io_ticks(struct request_queue *q, struct hd_struct *part, unsigned long now, bool end)
>   {
>          unsigned long stamp;
>   again:
>          stamp = READ_ONCE(part->stamp);
>          if (unlikely(stamp != now)) {
>                  if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
> +                      if (blk_mq_queue_inflight(q)) {
>                                  __part_stat_add(part, io_ticks, end ? now - stamp : 1);
Hi,

We met the same problem, and I'm pretty sure the root cause is the above
code: while starting the first IO in the new jiffies, io_ticks will
always add 1 jiffies in additional, which is wrong. And in your test
case, dd will issue io one by one, thus if the new io is issued in the
new jiffies than the jiffies that old io is done, io_ticks will be
miscaculated.

We reintroduce part_round_stats() to fix the problem. However, iterate
tags when starting each IO is not a good idea, and we can't figure out
a good solution that will not affect fast path yet.

Thanks,
Kuai
> +            }
>                  }
>          }
>          if (part->partno) {
>
>
>
> Sorry, resending without the config attachment since my original email bounced from linux-block.
>
>
> Thanks
> -Sri.
>


[+ Mikulas and Ming]

I see. Thank you for the response, Kuai, appreciate it. 

The conversation here https://lkml.org/lkml/2020/3/24/1870 hints at potential improvements to io_ticks tracking. 

@Mikulas, Mike, please let us know if you have plans for more accurate accounting or if there is some idea we can work on and submit a patch.


Thanks
-Sri


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Precise disk statistics
  2022-04-26 23:56     ` Jayaramappa, Srilakshmi
@ 2022-04-27  9:57       ` Mikulas Patocka
  2022-04-27 19:32         ` Josh Hunt
  0 siblings, 1 reply; 6+ messages in thread
From: Mikulas Patocka @ 2022-04-27  9:57 UTC (permalink / raw)
  To: Jayaramappa, Srilakshmi
  Cc: yukuai (C), axboe, snitzer, linux-block, Hunt, Joshua, ming.lei

> [+ Mikulas and Ming]
> 
> I see. Thank you for the response, Kuai, appreciate it.
> 
> The conversation here https://lkml.org/lkml/2020/3/24/1870 hints at 
> potential improvements to io_ticks tracking.
> 
> @Mikulas, Mike, please let us know if you have plans for more accurate 
> accounting or if there is some idea we can work on and submit a patch.

I know that the accounting is not accurate, but more accurate accounting 
needed a shared atomic variable and it caused performance degradation. So, 
we don't plan to improve the accounting.

Mikulas

> 
> Thanks
> -Sri
> 
> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Precise disk statistics
  2022-04-27  9:57       ` Mikulas Patocka
@ 2022-04-27 19:32         ` Josh Hunt
  2022-04-28  0:12           ` Ming Lei
  0 siblings, 1 reply; 6+ messages in thread
From: Josh Hunt @ 2022-04-27 19:32 UTC (permalink / raw)
  To: Mikulas Patocka, Jayaramappa, Srilakshmi
  Cc: yukuai (C), axboe, snitzer, linux-block, ming.lei

On 4/27/22 02:57, Mikulas Patocka wrote:
>> [+ Mikulas and Ming]
>>
>> I see. Thank you for the response, Kuai, appreciate it.
>>
>> The conversation here https://urldefense.com/v3/__https://lkml.org/lkml/2020/3/24/1870__;!!GjvTz_vk!US3LozmCgynsWtz-1LkwhFPTXfY0XZNT7XKAw9GSNjZn0JkehqevMU7StsFKkjsS9b1hfGRsOCu0e1E$  hints at
>> potential improvements to io_ticks tracking.
>>
>> @Mikulas, Mike, please let us know if you have plans for more accurate
>> accounting or if there is some idea we can work on and submit a patch.
> 
> I know that the accounting is not accurate, but more accurate accounting
> needed a shared atomic variable and it caused performance degradation. So,
> we don't plan to improve the accounting.

Thanks this is all very helpful.

If we know the accounting is not accurate then is there any reason to 
keep it around? What value is it providing? Also, should we update tools 
that use ioticks like iostat to report that disk utilization is 
deprecated and should not be referred to going forward?

Josh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Precise disk statistics
  2022-04-27 19:32         ` Josh Hunt
@ 2022-04-28  0:12           ` Ming Lei
  0 siblings, 0 replies; 6+ messages in thread
From: Ming Lei @ 2022-04-28  0:12 UTC (permalink / raw)
  To: Josh Hunt
  Cc: Mikulas Patocka, Jayaramappa, Srilakshmi, yukuai (C),
	axboe, snitzer, linux-block

On Wed, Apr 27, 2022 at 12:32:25PM -0700, Josh Hunt wrote:
> On 4/27/22 02:57, Mikulas Patocka wrote:
> > > [+ Mikulas and Ming]
> > > 
> > > I see. Thank you for the response, Kuai, appreciate it.
> > > 
> > > The conversation here https://urldefense.com/v3/__https://lkml.org/lkml/2020/3/24/1870__;!!GjvTz_vk!US3LozmCgynsWtz-1LkwhFPTXfY0XZNT7XKAw9GSNjZn0JkehqevMU7StsFKkjsS9b1hfGRsOCu0e1E$  hints at
> > > potential improvements to io_ticks tracking.
> > > 
> > > @Mikulas, Mike, please let us know if you have plans for more accurate
> > > accounting or if there is some idea we can work on and submit a patch.
> > 
> > I know that the accounting is not accurate, but more accurate accounting
> > needed a shared atomic variable and it caused performance degradation. So,
> > we don't plan to improve the accounting.
> 
> Thanks this is all very helpful.
> 
> If we know the accounting is not accurate then is there any reason to keep
> it around? What value is it providing? Also, should we update tools that use
> ioticks like iostat to report that disk utilization is deprecated and should
> not be referred to going forward?


man iostat
...
      %util
             Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization
             for the device). Device saturation occurs when this value is close to 100% for  devices  serving  re‐
             quests  serially.  But for devices serving requests in parallel, such as RAID arrays and modern SSDs,
             this number does not reflect their performance limits.
...

As you saw, %util isn't accurate from the beginning. If you want a bit more accurate
accounting before applying 5b18b5a73760 ("block: delete part_round_stats and switch
to less precise counting"), it can be done easily by one simple bcc/bpftrace prog
with extra cost.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-04-28  0:12 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1650661324247.40468@akamai.com>
2022-04-25 19:35 ` Precise disk statistics Jayaramappa, Srilakshmi
2022-04-26  6:25   ` yukuai (C)
2022-04-26 23:56     ` Jayaramappa, Srilakshmi
2022-04-27  9:57       ` Mikulas Patocka
2022-04-27 19:32         ` Josh Hunt
2022-04-28  0:12           ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.