Re: Precise disk statistics

From: "yukuai (C)" <yukuai3@huawei.com>
To: "Jayaramappa, Srilakshmi" <sjayaram@akamai.com>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"snitzer@redhat.com" <snitzer@redhat.com>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>
Cc: "Hunt, Joshua" <johunt@akamai.com>
Subject: Re: Precise disk statistics
Date: Tue, 26 Apr 2022 14:25:32 +0800	[thread overview]
Message-ID: <87031651-ba75-2b6f-8a5e-b0b4ef41c65f@huawei.com> (raw)
In-Reply-To: <1650915337169.63486@akamai.com>

在 2022/04/26 3:35, Jayaramappa, Srilakshmi 写道:
> ________________________________________
> From: Jayaramappa, Srilakshmi
> Sent: Friday, April 22, 2022 5:02 PM
> To: axboe@kernel.dk; snitzer@redhat.com; linux-block@vger.kernel.org
> Cc: Hunt, Joshua
> Subject: Precise disk statistics
> 
> Hi,
> 
> We install LTS kernel on our machines. While moving from 4.19.x to 5.4.x we noticed a performance drop in one of our applications.
> We tracked down the root cause to the commit series for removing the pending IO accounting (80a787ba3809 to 6f75723190d8)
> which includes 5b18b5a73760 block: delete part_round_stats and switch to less precise counting.
> 
> The application (which runs on non-dm machines) tracks disk utilization to estimate the load it can further take on. After the commits in question,
> we see an over reporting of disk utilization [1] compared to the older method of reporting based on inflight counter [2] for the same load.
> The over-reporting is observed in v5.4.190 and in v5.15.35 as well. I've attached the config file used to build the kernel.
> 
> We understand that the disk util% does not provide a true picture of how much more work the device is capable of doing in flash based
> devices and we are planning to use a different model to observe the performance potential.
> In the interim we are having to revert the above commit series to bring back the original reporting method.
> 
> In the hopes of getting back our application's performance with a new change on top of the 5.4.x reporting (as opposed to reverting commits),
> I tried checking if the request queue is busy before updating io_ticks [3]. With this change the applications's throughput is closer to
> what we observe with the commits reverted, but still behind by ~ 6 %. Though, I am not sure that this change is safe overall.
> 
> I'd appreciate your expert opinion on this matter. Could you please let us know if there is some other idea we could explore to report precise disk stats
> that we can build on top of existing reporting in the kernel and submit a patch, or if going back to using the inflight counters is indeed our best bet.
> 
> Thank you
> -Sri
> 
> [1]
> root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait
> ...
> 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.721532 s, 3.0 GB/s
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.13    0.00    0.28    1.53    0.00   98.07
> 
> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
> nvme3n1       16383.00   2047.88     0.00   0.00    0.21   128.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00  72.20
> 
> [2]
> 
> root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait
> ...
> 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.702101 s, 3.1 GB/s
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.03    0.00    0.18    1.57    0.00   98.22
> 
> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
> nvme3n1       16380.00   2047.50     0.00   0.00    0.20   128.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00  64.20
> 
> 
> [3]
> diff --git a/block/bio.c b/block/bio.c
> index cb38d6f3acce..8275b10a1c9a 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1754,14 +1754,17 @@ void bio_check_pages_dirty(struct bio *bio)
>          schedule_work(&bio_dirty_work);
>   }
> 
> -void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
> +void update_io_ticks(struct request_queue *q, struct hd_struct *part, unsigned long now, bool end)
>   {
>          unsigned long stamp;
>   again:
>          stamp = READ_ONCE(part->stamp);
>          if (unlikely(stamp != now)) {
>                  if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
> +                      if (blk_mq_queue_inflight(q)) {
>                                  __part_stat_add(part, io_ticks, end ? now - stamp : 1);
Hi,

We met the same problem, and I'm pretty sure the root cause is the above
code: while starting the first IO in the new jiffies, io_ticks will
always add 1 jiffies in additional, which is wrong. And in your test
case, dd will issue io one by one, thus if the new io is issued in the
new jiffies than the jiffies that old io is done, io_ticks will be
miscaculated.

We reintroduce part_round_stats() to fix the problem. However, iterate
tags when starting each IO is not a good idea, and we can't figure out
a good solution that will not affect fast path yet.

Thanks,
Kuai
> +            }
>                  }
>          }
>          if (part->partno) {
> 
> 
> 
> Sorry, resending without the config attachment since my original email bounced from linux-block.
> 
> 
> Thanks
> -Sri.
>