* Re: Precise disk statistics [not found] <1650661324247.40468@akamai.com> @ 2022-04-25 19:35 ` Jayaramappa, Srilakshmi 2022-04-26 6:25 ` yukuai (C) 0 siblings, 1 reply; 6+ messages in thread From: Jayaramappa, Srilakshmi @ 2022-04-25 19:35 UTC (permalink / raw) To: axboe, snitzer, linux-block; +Cc: Hunt, Joshua ________________________________________ From: Jayaramappa, Srilakshmi Sent: Friday, April 22, 2022 5:02 PM To: axboe@kernel.dk; snitzer@redhat.com; linux-block@vger.kernel.org Cc: Hunt, Joshua Subject: Precise disk statistics Hi, We install LTS kernel on our machines. While moving from 4.19.x to 5.4.x we noticed a performance drop in one of our applications. We tracked down the root cause to the commit series for removing the pending IO accounting (80a787ba3809 to 6f75723190d8) which includes 5b18b5a73760 block: delete part_round_stats and switch to less precise counting. The application (which runs on non-dm machines) tracks disk utilization to estimate the load it can further take on. After the commits in question, we see an over reporting of disk utilization [1] compared to the older method of reporting based on inflight counter [2] for the same load. The over-reporting is observed in v5.4.190 and in v5.15.35 as well. I've attached the config file used to build the kernel. We understand that the disk util% does not provide a true picture of how much more work the device is capable of doing in flash based devices and we are planning to use a different model to observe the performance potential. In the interim we are having to revert the above commit series to bring back the original reporting method. In the hopes of getting back our application's performance with a new change on top of the 5.4.x reporting (as opposed to reverting commits), I tried checking if the request queue is busy before updating io_ticks [3]. With this change the applications's throughput is closer to what we observe with the commits reverted, but still behind by ~ 6 %. Though, I am not sure that this change is safe overall. I'd appreciate your expert opinion on this matter. Could you please let us know if there is some other idea we could explore to report precise disk stats that we can build on top of existing reporting in the kernel and submit a patch, or if going back to using the inflight counters is indeed our best bet. Thank you -Sri [1] root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait ... 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.721532 s, 3.0 GB/s avg-cpu: %user %nice %system %iowait %steal %idle 0.13 0.00 0.28 1.53 0.00 98.07 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz aqu-sz %util nvme3n1 16383.00 2047.88 0.00 0.00 0.21 128.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 72.20 [2] root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait ... 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.702101 s, 3.1 GB/s avg-cpu: %user %nice %system %iowait %steal %idle 0.03 0.00 0.18 1.57 0.00 98.22 Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz aqu-sz %util nvme3n1 16380.00 2047.50 0.00 0.00 0.20 128.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 64.20 [3] diff --git a/block/bio.c b/block/bio.c index cb38d6f3acce..8275b10a1c9a 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1754,14 +1754,17 @@ void bio_check_pages_dirty(struct bio *bio) schedule_work(&bio_dirty_work); } -void update_io_ticks(struct hd_struct *part, unsigned long now, bool end) +void update_io_ticks(struct request_queue *q, struct hd_struct *part, unsigned long now, bool end) { unsigned long stamp; again: stamp = READ_ONCE(part->stamp); if (unlikely(stamp != now)) { if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) { + if (blk_mq_queue_inflight(q)) { __part_stat_add(part, io_ticks, end ? now - stamp : 1); + } } } if (part->partno) { Sorry, resending without the config attachment since my original email bounced from linux-block. Thanks -Sri ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: Precise disk statistics 2022-04-25 19:35 ` Precise disk statistics Jayaramappa, Srilakshmi @ 2022-04-26 6:25 ` yukuai (C) 2022-04-26 23:56 ` Jayaramappa, Srilakshmi 0 siblings, 1 reply; 6+ messages in thread From: yukuai (C) @ 2022-04-26 6:25 UTC (permalink / raw) To: Jayaramappa, Srilakshmi, axboe, snitzer, linux-block; +Cc: Hunt, Joshua 在 2022/04/26 3:35, Jayaramappa, Srilakshmi 写道: > ________________________________________ > From: Jayaramappa, Srilakshmi > Sent: Friday, April 22, 2022 5:02 PM > To: axboe@kernel.dk; snitzer@redhat.com; linux-block@vger.kernel.org > Cc: Hunt, Joshua > Subject: Precise disk statistics > > Hi, > > We install LTS kernel on our machines. While moving from 4.19.x to 5.4.x we noticed a performance drop in one of our applications. > We tracked down the root cause to the commit series for removing the pending IO accounting (80a787ba3809 to 6f75723190d8) > which includes 5b18b5a73760 block: delete part_round_stats and switch to less precise counting. > > The application (which runs on non-dm machines) tracks disk utilization to estimate the load it can further take on. After the commits in question, > we see an over reporting of disk utilization [1] compared to the older method of reporting based on inflight counter [2] for the same load. > The over-reporting is observed in v5.4.190 and in v5.15.35 as well. I've attached the config file used to build the kernel. > > We understand that the disk util% does not provide a true picture of how much more work the device is capable of doing in flash based > devices and we are planning to use a different model to observe the performance potential. > In the interim we are having to revert the above commit series to bring back the original reporting method. > > In the hopes of getting back our application's performance with a new change on top of the 5.4.x reporting (as opposed to reverting commits), > I tried checking if the request queue is busy before updating io_ticks [3]. With this change the applications's throughput is closer to > what we observe with the commits reverted, but still behind by ~ 6 %. Though, I am not sure that this change is safe overall. > > I'd appreciate your expert opinion on this matter. Could you please let us know if there is some other idea we could explore to report precise disk stats > that we can build on top of existing reporting in the kernel and submit a patch, or if going back to using the inflight counters is indeed our best bet. > > Thank you > -Sri > > [1] > root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait > ... > 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.721532 s, 3.0 GB/s > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.13 0.00 0.28 1.53 0.00 98.07 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz aqu-sz %util > nvme3n1 16383.00 2047.88 0.00 0.00 0.21 128.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 72.20 > > [2] > > root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait > ... > 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.702101 s, 3.1 GB/s > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.03 0.00 0.18 1.57 0.00 98.22 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz aqu-sz %util > nvme3n1 16380.00 2047.50 0.00 0.00 0.20 128.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 64.20 > > > [3] > diff --git a/block/bio.c b/block/bio.c > index cb38d6f3acce..8275b10a1c9a 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -1754,14 +1754,17 @@ void bio_check_pages_dirty(struct bio *bio) > schedule_work(&bio_dirty_work); > } > > -void update_io_ticks(struct hd_struct *part, unsigned long now, bool end) > +void update_io_ticks(struct request_queue *q, struct hd_struct *part, unsigned long now, bool end) > { > unsigned long stamp; > again: > stamp = READ_ONCE(part->stamp); > if (unlikely(stamp != now)) { > if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) { > + if (blk_mq_queue_inflight(q)) { > __part_stat_add(part, io_ticks, end ? now - stamp : 1); Hi, We met the same problem, and I'm pretty sure the root cause is the above code: while starting the first IO in the new jiffies, io_ticks will always add 1 jiffies in additional, which is wrong. And in your test case, dd will issue io one by one, thus if the new io is issued in the new jiffies than the jiffies that old io is done, io_ticks will be miscaculated. We reintroduce part_round_stats() to fix the problem. However, iterate tags when starting each IO is not a good idea, and we can't figure out a good solution that will not affect fast path yet. Thanks, Kuai > + } > } > } > if (part->partno) { > > > > Sorry, resending without the config attachment since my original email bounced from linux-block. > > > Thanks > -Sri. > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Precise disk statistics 2022-04-26 6:25 ` yukuai (C) @ 2022-04-26 23:56 ` Jayaramappa, Srilakshmi 2022-04-27 9:57 ` Mikulas Patocka 0 siblings, 1 reply; 6+ messages in thread From: Jayaramappa, Srilakshmi @ 2022-04-26 23:56 UTC (permalink / raw) To: yukuai (C), axboe, snitzer, linux-block, mpatocka; +Cc: Hunt, Joshua, ming.lei ________________________________________ From: yukuai (C) <yukuai3@huawei.com> Sent: Tuesday, April 26, 2022 2:25 AM To: Jayaramappa, Srilakshmi; axboe@kernel.dk; snitzer@redhat.com; linux-block@vger.kernel.org Cc: Hunt, Joshua Subject: Re: Precise disk statistics 在 2022/04/26 3:35, Jayaramappa, Srilakshmi 写道: > ________________________________________ > From: Jayaramappa, Srilakshmi > Sent: Friday, April 22, 2022 5:02 PM > To: axboe@kernel.dk; snitzer@redhat.com; linux-block@vger.kernel.org > Cc: Hunt, Joshua > Subject: Precise disk statistics > > Hi, > > We install LTS kernel on our machines. While moving from 4.19.x to 5.4.x we noticed a performance drop in one of our applications. > We tracked down the root cause to the commit series for removing the pending IO accounting (80a787ba3809 to 6f75723190d8) > which includes 5b18b5a73760 block: delete part_round_stats and switch to less precise counting. > > The application (which runs on non-dm machines) tracks disk utilization to estimate the load it can further take on. After the commits in question, > we see an over reporting of disk utilization [1] compared to the older method of reporting based on inflight counter [2] for the same load. > The over-reporting is observed in v5.4.190 and in v5.15.35 as well. I've attached the config file used to build the kernel. > > We understand that the disk util% does not provide a true picture of how much more work the device is capable of doing in flash based > devices and we are planning to use a different model to observe the performance potential. > In the interim we are having to revert the above commit series to bring back the original reporting method. > > In the hopes of getting back our application's performance with a new change on top of the 5.4.x reporting (as opposed to reverting commits), > I tried checking if the request queue is busy before updating io_ticks [3]. With this change the applications's throughput is closer to > what we observe with the commits reverted, but still behind by ~ 6 %. Though, I am not sure that this change is safe overall. > > I'd appreciate your expert opinion on this matter. Could you please let us know if there is some other idea we could explore to report precise disk stats > that we can build on top of existing reporting in the kernel and submit a patch, or if going back to using the inflight counters is indeed our best bet. > > Thank you > -Sri > > [1] > root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait > ... > 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.721532 s, 3.0 GB/s > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.13 0.00 0.28 1.53 0.00 98.07 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz aqu-sz %util > nvme3n1 16383.00 2047.88 0.00 0.00 0.21 128.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 72.20 > > [2] > > root@xxx:~# DISK=nvme3n1; dd if=/dev/$DISK of=/dev/null bs=1048576 iflag=direct count=2048 & iostat -yxm /dev/$DISK 1 1 ; wait > ... > 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.702101 s, 3.1 GB/s > > avg-cpu: %user %nice %system %iowait %steal %idle > 0.03 0.00 0.18 1.57 0.00 98.22 > > Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz aqu-sz %util > nvme3n1 16380.00 2047.50 0.00 0.00 0.20 128.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 64.20 > > > [3] > diff --git a/block/bio.c b/block/bio.c > index cb38d6f3acce..8275b10a1c9a 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -1754,14 +1754,17 @@ void bio_check_pages_dirty(struct bio *bio) > schedule_work(&bio_dirty_work); > } > > -void update_io_ticks(struct hd_struct *part, unsigned long now, bool end) > +void update_io_ticks(struct request_queue *q, struct hd_struct *part, unsigned long now, bool end) > { > unsigned long stamp; > again: > stamp = READ_ONCE(part->stamp); > if (unlikely(stamp != now)) { > if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) { > + if (blk_mq_queue_inflight(q)) { > __part_stat_add(part, io_ticks, end ? now - stamp : 1); Hi, We met the same problem, and I'm pretty sure the root cause is the above code: while starting the first IO in the new jiffies, io_ticks will always add 1 jiffies in additional, which is wrong. And in your test case, dd will issue io one by one, thus if the new io is issued in the new jiffies than the jiffies that old io is done, io_ticks will be miscaculated. We reintroduce part_round_stats() to fix the problem. However, iterate tags when starting each IO is not a good idea, and we can't figure out a good solution that will not affect fast path yet. Thanks, Kuai > + } > } > } > if (part->partno) { > > > > Sorry, resending without the config attachment since my original email bounced from linux-block. > > > Thanks > -Sri. > [+ Mikulas and Ming] I see. Thank you for the response, Kuai, appreciate it. The conversation here https://lkml.org/lkml/2020/3/24/1870 hints at potential improvements to io_ticks tracking. @Mikulas, Mike, please let us know if you have plans for more accurate accounting or if there is some idea we can work on and submit a patch. Thanks -Sri ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Precise disk statistics 2022-04-26 23:56 ` Jayaramappa, Srilakshmi @ 2022-04-27 9:57 ` Mikulas Patocka 2022-04-27 19:32 ` Josh Hunt 0 siblings, 1 reply; 6+ messages in thread From: Mikulas Patocka @ 2022-04-27 9:57 UTC (permalink / raw) To: Jayaramappa, Srilakshmi Cc: yukuai (C), axboe, snitzer, linux-block, Hunt, Joshua, ming.lei > [+ Mikulas and Ming] > > I see. Thank you for the response, Kuai, appreciate it. > > The conversation here https://lkml.org/lkml/2020/3/24/1870 hints at > potential improvements to io_ticks tracking. > > @Mikulas, Mike, please let us know if you have plans for more accurate > accounting or if there is some idea we can work on and submit a patch. I know that the accounting is not accurate, but more accurate accounting needed a shared atomic variable and it caused performance degradation. So, we don't plan to improve the accounting. Mikulas > > Thanks > -Sri > > > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Precise disk statistics 2022-04-27 9:57 ` Mikulas Patocka @ 2022-04-27 19:32 ` Josh Hunt 2022-04-28 0:12 ` Ming Lei 0 siblings, 1 reply; 6+ messages in thread From: Josh Hunt @ 2022-04-27 19:32 UTC (permalink / raw) To: Mikulas Patocka, Jayaramappa, Srilakshmi Cc: yukuai (C), axboe, snitzer, linux-block, ming.lei On 4/27/22 02:57, Mikulas Patocka wrote: >> [+ Mikulas and Ming] >> >> I see. Thank you for the response, Kuai, appreciate it. >> >> The conversation here https://urldefense.com/v3/__https://lkml.org/lkml/2020/3/24/1870__;!!GjvTz_vk!US3LozmCgynsWtz-1LkwhFPTXfY0XZNT7XKAw9GSNjZn0JkehqevMU7StsFKkjsS9b1hfGRsOCu0e1E$ hints at >> potential improvements to io_ticks tracking. >> >> @Mikulas, Mike, please let us know if you have plans for more accurate >> accounting or if there is some idea we can work on and submit a patch. > > I know that the accounting is not accurate, but more accurate accounting > needed a shared atomic variable and it caused performance degradation. So, > we don't plan to improve the accounting. Thanks this is all very helpful. If we know the accounting is not accurate then is there any reason to keep it around? What value is it providing? Also, should we update tools that use ioticks like iostat to report that disk utilization is deprecated and should not be referred to going forward? Josh ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: Precise disk statistics 2022-04-27 19:32 ` Josh Hunt @ 2022-04-28 0:12 ` Ming Lei 0 siblings, 0 replies; 6+ messages in thread From: Ming Lei @ 2022-04-28 0:12 UTC (permalink / raw) To: Josh Hunt Cc: Mikulas Patocka, Jayaramappa, Srilakshmi, yukuai (C), axboe, snitzer, linux-block On Wed, Apr 27, 2022 at 12:32:25PM -0700, Josh Hunt wrote: > On 4/27/22 02:57, Mikulas Patocka wrote: > > > [+ Mikulas and Ming] > > > > > > I see. Thank you for the response, Kuai, appreciate it. > > > > > > The conversation here https://urldefense.com/v3/__https://lkml.org/lkml/2020/3/24/1870__;!!GjvTz_vk!US3LozmCgynsWtz-1LkwhFPTXfY0XZNT7XKAw9GSNjZn0JkehqevMU7StsFKkjsS9b1hfGRsOCu0e1E$ hints at > > > potential improvements to io_ticks tracking. > > > > > > @Mikulas, Mike, please let us know if you have plans for more accurate > > > accounting or if there is some idea we can work on and submit a patch. > > > > I know that the accounting is not accurate, but more accurate accounting > > needed a shared atomic variable and it caused performance degradation. So, > > we don't plan to improve the accounting. > > Thanks this is all very helpful. > > If we know the accounting is not accurate then is there any reason to keep > it around? What value is it providing? Also, should we update tools that use > ioticks like iostat to report that disk utilization is deprecated and should > not be referred to going forward? man iostat ... %util Percentage of elapsed time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving re‐ quests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits. ... As you saw, %util isn't accurate from the beginning. If you want a bit more accurate accounting before applying 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting"), it can be done easily by one simple bcc/bpftrace prog with extra cost. Thanks, Ming ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2022-04-28 0:12 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <1650661324247.40468@akamai.com> 2022-04-25 19:35 ` Precise disk statistics Jayaramappa, Srilakshmi 2022-04-26 6:25 ` yukuai (C) 2022-04-26 23:56 ` Jayaramappa, Srilakshmi 2022-04-27 9:57 ` Mikulas Patocka 2022-04-27 19:32 ` Josh Hunt 2022-04-28 0:12 ` Ming Lei
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.