All of lore.kernel.org
 help / color / mirror / Atom feed
From: "yukuai (C)" <yukuai3@huawei.com>
To: "Jayaramappa, Srilakshmi" <sjayaram@akamai.com>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"snitzer@redhat.com" <snitzer@redhat.com>,
	"linux-block@vger.kernel.org" <linux-block@vger.kernel.org>
Cc: "Hunt, Joshua" <johunt@akamai.com>
Subject: Re: Precise disk statistics
Date: Tue, 26 Apr 2022 14:25:32 +0800	[thread overview]
Message-ID: <87031651-ba75-2b6f-8a5e-b0b4ef41c65f@huawei.com> (raw)
In-Reply-To: <1650915337169.63486@akamai.com>

在 2022/04/26 3:35, Jayaramappa, Srilakshmi 写道:
> ________________________________________
> From: Jayaramappa, Srilakshmi
> Sent: Friday, April 22, 2022 5:02 PM
> To: axboe@kernel.dk; snitzer@redhat.com; linux-block@vger.kernel.org
> Cc: Hunt, Joshua
> Subject: Precise disk statistics
> 
> Hi,
> 
> We install LTS kernel on our machines. While moving from 4.19.x to 5.4.x we noticed a performance drop in one of our applications.
> We tracked down the root cause to the commit series for removing the pending IO accounting (80a787ba3809 to 6f75723190d8)
> which includes 5b18b5a73760 block: delete part_round_stats and switch to less precise counting.
> 
> The application (which runs on non-dm machines) tracks disk utilization to estimate the load it can further take on. After the commits in question,
> we see an over reporting of disk utilization [1] compared to the older method of reporting based on inflight counter [2] for the same load.
> The over-reporting is observed in v5.4.190 and in v5.15.35 as well. I've attached the config file used to build the kernel.
> 
> We understand that the disk util% does not provide a true picture of how much more work the device is capable of doing in flash based
> devices and we are planning to use a different model to observe the performance potential.
> In the interim we are having to revert the above commit series to bring back the original reporting method.
> 
> In the hopes of getting back our application's performance with a new change on top of the 5.4.x reporting (as opposed to reverting commits),
> I tried checking if the request queue is busy before updating io_ticks [3]. With this change the applications's throughput is closer to
> what we observe with the commits reverted, but still behind by ~ 6 %. Though, I am not sure that this change is safe overall.
> 
> I'd appreciate your expert opinion on this matter. Could you please let us know if there is some other idea we could explore to report precise disk stats
> that we can build on top of existing reporting in the kernel and submit a patch, or if going back to using the inflight counters is indeed our best bet.
> 
> Thank you
> -Sri
> 
> [1]
> root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait
> ...
> 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.721532 s, 3.0 GB/s
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.13    0.00    0.28    1.53    0.00   98.07
> 
> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
> nvme3n1       16383.00   2047.88     0.00   0.00    0.21   128.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00  72.20
> 
> [2]
> 
> root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait
> ...
> 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.702101 s, 3.1 GB/s
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.03    0.00    0.18    1.57    0.00   98.22
> 
> Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
> nvme3n1       16380.00   2047.50     0.00   0.00    0.20   128.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00  64.20
> 
> 
> [3]
> diff --git a/block/bio.c b/block/bio.c
> index cb38d6f3acce..8275b10a1c9a 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1754,14 +1754,17 @@ void bio_check_pages_dirty(struct bio *bio)
>          schedule_work(&bio_dirty_work);
>   }
> 
> -void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
> +void update_io_ticks(struct request_queue *q, struct hd_struct *part, unsigned long now, bool end)
>   {
>          unsigned long stamp;
>   again:
>          stamp = READ_ONCE(part->stamp);
>          if (unlikely(stamp != now)) {
>                  if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
> +                      if (blk_mq_queue_inflight(q)) {
>                                  __part_stat_add(part, io_ticks, end ? now - stamp : 1);
Hi,

We met the same problem, and I'm pretty sure the root cause is the above
code: while starting the first IO in the new jiffies, io_ticks will
always add 1 jiffies in additional, which is wrong. And in your test
case, dd will issue io one by one, thus if the new io is issued in the
new jiffies than the jiffies that old io is done, io_ticks will be
miscaculated.

We reintroduce part_round_stats() to fix the problem. However, iterate
tags when starting each IO is not a good idea, and we can't figure out
a good solution that will not affect fast path yet.

Thanks,
Kuai
> +            }
>                  }
>          }
>          if (part->partno) {
> 
> 
> 
> Sorry, resending without the config attachment since my original email bounced from linux-block.
> 
> 
> Thanks
> -Sri.
> 

  reply	other threads:[~2022-04-26  6:25 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <1650661324247.40468@akamai.com>
2022-04-25 19:35 ` Precise disk statistics Jayaramappa, Srilakshmi
2022-04-26  6:25   ` yukuai (C) [this message]
2022-04-26 23:56     ` Jayaramappa, Srilakshmi
2022-04-27  9:57       ` Mikulas Patocka
2022-04-27 19:32         ` Josh Hunt
2022-04-28  0:12           ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87031651-ba75-2b6f-8a5e-b0b4ef41c65f@huawei.com \
    --to=yukuai3@huawei.com \
    --cc=axboe@kernel.dk \
    --cc=johunt@akamai.com \
    --cc=linux-block@vger.kernel.org \
    --cc=sjayaram@akamai.com \
    --cc=snitzer@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.