* Re: [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle
[not found] <57333E75.3080309@huawei.com>
@ 2016-05-12 1:11 ` Miao Xie
2016-05-12 15:32 ` Tejun Heo
0 siblings, 1 reply; 7+ messages in thread
From: Miao Xie @ 2016-05-12 1:11 UTC (permalink / raw)
To: Fengguang Wu, Tejun Heo; +Cc: linux-kernel
Cc linux-kernel mail list
on 2016/5/11 at 22:15, Miao Xie wrote:
> Hi, Tejun and Fengguang
>
> I found that buffered write thoughput was dropped down by writeback cgroup and dirty thottle on
> 4.6-rc7 kernel. If I ran benchmark on the top block cgroup, the thoughput was more than 1500MB/s.
> If I ran benchmark on a new block cgroup, the thoughput was down to 4MB/s.
>
> Steps to reproduce:
> # mount -t cgroup2 cgroup <cgrp_mnt>
> # echo "+io +memory" > <cgrp_mnt>/cgroup.subtree_control
> # mkdir <cgrp_mnt>/aaa
> # echo $$ > <cgrp_mnt>/aaa/cgroup.procs
> # fio test.config
> job0: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
> fio-2.2.8
> Starting 1 thread
> Jobs: 1 (f=1): [W(1)] [3.7% done] [0KB/4000KB/0KB /s] [0/1000/0 iops] [eta 04m:50s]
>
> Fio configuration is:
> [global]
> bs=4K
> direct=0
> ioengine=psync
> iodepth=1
> directory=/mnt/ext4/tstdir0
> time_based
> runtime=300
> group_reporting
> size=16G
> sync=0
> max_latency=120000000
> thread
>
> [job0]
> numjobs=1
> rw=write
>
> My box has 48 cores and 188GB memory, but I set
> vm.dirty_background_bytes = 268435456
> vm.dirty_bytes = 536870912
>
> if I set vm.dirty_background_bytes and vm.dirty_bytes to be a large number(vm.dirty_background_bytes = 3GB,
> vm.dirty_bytes = 4GB), then fio thoughput would be more than 1500MB/s. and then if I reset them to the original
> value(the above ones), the thoughout would be down to 500MB/s.
>
> And according my debug, I found fio sleeped for 1ms every time we dirty a page(balance dirty pages) when
> the thoughput was down to 4MB/s, it might be a bug of dirty throttle when we open write back cgroup, I think.
>
> Tejun and Fengguang, please let me know what you guys think about this issue, and if you have
> any suggestions for possible solutions, Any input is greatly appreciated!
>
> Thanks
> Miao
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle
2016-05-12 1:11 ` [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle Miao Xie
@ 2016-05-12 15:32 ` Tejun Heo
2016-05-13 6:11 ` Miao Xie
2016-05-27 18:34 ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo
0 siblings, 2 replies; 7+ messages in thread
From: Tejun Heo @ 2016-05-12 15:32 UTC (permalink / raw)
To: Miao Xie; +Cc: Fengguang Wu, linux-kernel
Hello,
On Thu, May 12, 2016 at 09:11:33AM +0800, Miao Xie wrote:
> >My box has 48 cores and 188GB memory, but I set
> >vm.dirty_background_bytes = 268435456
> >vm.dirty_bytes = 536870912
> >
> >if I set vm.dirty_background_bytes and vm.dirty_bytes to be a large number(vm.dirty_background_bytes = 3GB,
> >vm.dirty_bytes = 4GB), then fio thoughput would be more than 1500MB/s. and then if I reset them to the original
> >value(the above ones), the thoughout would be down to 500MB/s.
> >
> >And according my debug, I found fio sleeped for 1ms every time we dirty a page(balance dirty pages) when
> >the thoughput was down to 4MB/s, it might be a bug of dirty throttle when we open write back cgroup, I think.
Heh, so, for cgroups, the absolute byte limits can't applied directly
and converted to percentage value before being applied. You're
specifying 0.27% for threshold. Unfortunately, the ratio is
translated into a percentage number and 0.27% becomes 0, so your
cgroups are always over limit and being throttled.
Can you please see whether the following patch fixes the issue?
Thanks.
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 999792d..a455a21 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -369,8 +369,9 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc);
unsigned long bytes = vm_dirty_bytes;
unsigned long bg_bytes = dirty_background_bytes;
- unsigned long ratio = vm_dirty_ratio;
- unsigned long bg_ratio = dirty_background_ratio;
+ /* convert ratios to per-PAGE_SIZE for higher precision */
+ unsigned long ratio = (vm_dirty_ratio * PAGE_SIZE) / 100;
+ unsigned long bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100;
unsigned long thresh;
unsigned long bg_thresh;
struct task_struct *tsk;
@@ -382,26 +383,28 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
/*
* The byte settings can't be applied directly to memcg
* domains. Convert them to ratios by scaling against
- * globally available memory.
+ * globally available memory. As the ratios are in
+ * per-PAGE_SIZE, they can be obtained by dividing bytes by
+ * pages.
*/
if (bytes)
- ratio = min(DIV_ROUND_UP(bytes, PAGE_SIZE) * 100 /
- global_avail, 100UL);
+ ratio = min(DIV_ROUND_UP(bytes, global_avail),
+ PAGE_SIZE);
if (bg_bytes)
- bg_ratio = min(DIV_ROUND_UP(bg_bytes, PAGE_SIZE) * 100 /
- global_avail, 100UL);
+ bg_ratio = min(DIV_ROUND_UP(bg_bytes, global_avail),
+ PAGE_SIZE);
bytes = bg_bytes = 0;
}
if (bytes)
thresh = DIV_ROUND_UP(bytes, PAGE_SIZE);
else
- thresh = (ratio * available_memory) / 100;
+ thresh = (ratio * available_memory) / PAGE_SIZE;
if (bg_bytes)
bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
else
- bg_thresh = (bg_ratio * available_memory) / 100;
+ bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE;
if (bg_thresh >= thresh)
bg_thresh = thresh / 2;
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle
2016-05-12 15:32 ` Tejun Heo
@ 2016-05-13 6:11 ` Miao Xie
2016-05-27 18:24 ` Tejun Heo
2016-05-27 18:34 ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo
1 sibling, 1 reply; 7+ messages in thread
From: Miao Xie @ 2016-05-13 6:11 UTC (permalink / raw)
To: Tejun Heo; +Cc: Fengguang Wu, linux-kernel
on 2016/5/12 at 23:32, Tejun Heo wrote:
> On Thu, May 12, 2016 at 09:11:33AM +0800, Miao Xie wrote:
>>> My box has 48 cores and 188GB memory, but I set
>>> vm.dirty_background_bytes = 268435456
>>> vm.dirty_bytes = 536870912
>>>
>>> if I set vm.dirty_background_bytes and vm.dirty_bytes to be a large number(vm.dirty_background_bytes = 3GB,
>>> vm.dirty_bytes = 4GB), then fio thoughput would be more than 1500MB/s. and then if I reset them to the original
>>> value(the above ones), the thoughout would be down to 500MB/s.
>>>
>>> And according my debug, I found fio sleeped for 1ms every time we dirty a page(balance dirty pages) when
>>> the thoughput was down to 4MB/s, it might be a bug of dirty throttle when we open write back cgroup, I think.
>
> Heh, so, for cgroups, the absolute byte limits can't applied directly
> and converted to percentage value before being applied. You're
> specifying 0.27% for threshold. Unfortunately, the ratio is
> translated into a percentage number and 0.27% becomes 0, so your
> cgroups are always over limit and being throttled.
>
> Can you please see whether the following patch fixes the issue?
Better than the kernel without patch. Now the benchmark could reach the device bandwidth after 5-8 seconds.
But at the beginning, it was still very slow, and its thoughput was only 4MB/s for ~4 seconds, then it
could go up in 1~3 seconds.
Thanks
Miao
> Thanks.
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 999792d..a455a21 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -369,8 +369,9 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
> struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc);
> unsigned long bytes = vm_dirty_bytes;
> unsigned long bg_bytes = dirty_background_bytes;
> - unsigned long ratio = vm_dirty_ratio;
> - unsigned long bg_ratio = dirty_background_ratio;
> + /* convert ratios to per-PAGE_SIZE for higher precision */
> + unsigned long ratio = (vm_dirty_ratio * PAGE_SIZE) / 100;
> + unsigned long bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100;
> unsigned long thresh;
> unsigned long bg_thresh;
> struct task_struct *tsk;
> @@ -382,26 +383,28 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
> /*
> * The byte settings can't be applied directly to memcg
> * domains. Convert them to ratios by scaling against
> - * globally available memory.
> + * globally available memory. As the ratios are in
> + * per-PAGE_SIZE, they can be obtained by dividing bytes by
> + * pages.
> */
> if (bytes)
> - ratio = min(DIV_ROUND_UP(bytes, PAGE_SIZE) * 100 /
> - global_avail, 100UL);
> + ratio = min(DIV_ROUND_UP(bytes, global_avail),
> + PAGE_SIZE);
> if (bg_bytes)
> - bg_ratio = min(DIV_ROUND_UP(bg_bytes, PAGE_SIZE) * 100 /
> - global_avail, 100UL);
> + bg_ratio = min(DIV_ROUND_UP(bg_bytes, global_avail),
> + PAGE_SIZE);
> bytes = bg_bytes = 0;
> }
>
> if (bytes)
> thresh = DIV_ROUND_UP(bytes, PAGE_SIZE);
> else
> - thresh = (ratio * available_memory) / 100;
> + thresh = (ratio * available_memory) / PAGE_SIZE;
>
> if (bg_bytes)
> bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
> else
> - bg_thresh = (bg_ratio * available_memory) / 100;
> + bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE;
>
> if (bg_thresh >= thresh)
> bg_thresh = thresh / 2;
>
> .
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle
2016-05-13 6:11 ` Miao Xie
@ 2016-05-27 18:24 ` Tejun Heo
0 siblings, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2016-05-27 18:24 UTC (permalink / raw)
To: Miao Xie; +Cc: Fengguang Wu, linux-kernel
Hello,
Sorry about the delay. I forgot about this thread.
On Fri, May 13, 2016 at 02:11:53PM +0800, Miao Xie wrote:
> Better than the kernel without patch. Now the benchmark could reach
> the device bandwidth after 5-8 seconds. But at the beginning, it
> was still very slow, and its thoughput was only 4MB/s for ~4
> seconds, then it could go up in 1~3 seconds.
I see. As this fix is needed anyways, I'll send it up. As for the
ramp-up, it could be normal. There are estimators which take running
avg and modulate the threshold accordingly and the starting values are
conservative, so a short ramp-up time can be coming from that.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits()
2016-05-12 15:32 ` Tejun Heo
2016-05-13 6:11 ` Miao Xie
@ 2016-05-27 18:34 ` Tejun Heo
2016-05-30 8:05 ` Jan Kara
2016-05-30 14:55 ` Jens Axboe
1 sibling, 2 replies; 7+ messages in thread
From: Tejun Heo @ 2016-05-27 18:34 UTC (permalink / raw)
To: Jens Axboe, Jan Kara; +Cc: Fengguang Wu, linux-kernel, Miao Xie, kernel-team
As vm.dirty_[background_]bytes can't be applied verbatim to multiple
cgroup writeback domains, they get converted to percentages in
domain_dirty_limits() and applied the same way as
vm.dirty_[background]ratio. However, if the specified bytes is lower
than 1% of available memory, the calculated ratios become zero and the
writeback domain gets throttled constantly.
Fix it by using per-PAGE_SIZE instead of percentage for ratio
calculations. Also, the updated DIV_ROUND_UP() usages now should
yield 1/4096 (0.0244%) as the minimum ratio as long as the specified
bytes are above zero.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Miao Xie <miaoxie@huawei.com>
Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com
Cc: stable@vger.kernel.org # v4.2+
Fixes: 9fc3a43e1757 ("writeback: separate out domain_dirty_limits()")
---
mm/page-writeback.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b9956fd..9f914e9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -373,8 +373,9 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc);
unsigned long bytes = vm_dirty_bytes;
unsigned long bg_bytes = dirty_background_bytes;
- unsigned long ratio = vm_dirty_ratio;
- unsigned long bg_ratio = dirty_background_ratio;
+ /* convert ratios to per-PAGE_SIZE for higher precision */
+ unsigned long ratio = (vm_dirty_ratio * PAGE_SIZE) / 100;
+ unsigned long bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100;
unsigned long thresh;
unsigned long bg_thresh;
struct task_struct *tsk;
@@ -386,26 +387,28 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
/*
* The byte settings can't be applied directly to memcg
* domains. Convert them to ratios by scaling against
- * globally available memory.
+ * globally available memory. As the ratios are in
+ * per-PAGE_SIZE, they can be obtained by dividing bytes by
+ * pages.
*/
if (bytes)
- ratio = min(DIV_ROUND_UP(bytes, PAGE_SIZE) * 100 /
- global_avail, 100UL);
+ ratio = min(DIV_ROUND_UP(bytes, global_avail),
+ PAGE_SIZE);
if (bg_bytes)
- bg_ratio = min(DIV_ROUND_UP(bg_bytes, PAGE_SIZE) * 100 /
- global_avail, 100UL);
+ bg_ratio = min(DIV_ROUND_UP(bg_bytes, global_avail),
+ PAGE_SIZE);
bytes = bg_bytes = 0;
}
if (bytes)
thresh = DIV_ROUND_UP(bytes, PAGE_SIZE);
else
- thresh = (ratio * available_memory) / 100;
+ thresh = (ratio * available_memory) / PAGE_SIZE;
if (bg_bytes)
bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
else
- bg_thresh = (bg_ratio * available_memory) / 100;
+ bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE;
if (bg_thresh >= thresh)
bg_thresh = thresh / 2;
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits()
2016-05-27 18:34 ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo
@ 2016-05-30 8:05 ` Jan Kara
2016-05-30 14:55 ` Jens Axboe
1 sibling, 0 replies; 7+ messages in thread
From: Jan Kara @ 2016-05-30 8:05 UTC (permalink / raw)
To: Tejun Heo
Cc: Jens Axboe, Jan Kara, Fengguang Wu, linux-kernel, Miao Xie, kernel-team
On Fri 27-05-16 14:34:46, Tejun Heo wrote:
> As vm.dirty_[background_]bytes can't be applied verbatim to multiple
> cgroup writeback domains, they get converted to percentages in
> domain_dirty_limits() and applied the same way as
> vm.dirty_[background]ratio. However, if the specified bytes is lower
> than 1% of available memory, the calculated ratios become zero and the
> writeback domain gets throttled constantly.
>
> Fix it by using per-PAGE_SIZE instead of percentage for ratio
> calculations. Also, the updated DIV_ROUND_UP() usages now should
> yield 1/4096 (0.0244%) as the minimum ratio as long as the specified
> bytes are above zero.
The patch looks good to me. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
Just one nit below:
> @@ -386,26 +387,28 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
> /*
> * The byte settings can't be applied directly to memcg
> * domains. Convert them to ratios by scaling against
> - * globally available memory.
> + * globally available memory. As the ratios are in
> + * per-PAGE_SIZE, they can be obtained by dividing bytes by
> + * pages.
The comment would be more comprehensible to me is the last sentence was
"... by dividing bytes by number of pages".
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits()
2016-05-27 18:34 ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo
2016-05-30 8:05 ` Jan Kara
@ 2016-05-30 14:55 ` Jens Axboe
1 sibling, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2016-05-30 14:55 UTC (permalink / raw)
To: Tejun Heo, Jan Kara; +Cc: Fengguang Wu, linux-kernel, Miao Xie, kernel-team
On 05/27/2016 12:34 PM, Tejun Heo wrote:
> As vm.dirty_[background_]bytes can't be applied verbatim to multiple
> cgroup writeback domains, they get converted to percentages in
> domain_dirty_limits() and applied the same way as
> vm.dirty_[background]ratio. However, if the specified bytes is lower
> than 1% of available memory, the calculated ratios become zero and the
> writeback domain gets throttled constantly.
>
> Fix it by using per-PAGE_SIZE instead of percentage for ratio
> calculations. Also, the updated DIV_ROUND_UP() usages now should
> yield 1/4096 (0.0244%) as the minimum ratio as long as the specified
> bytes are above zero.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Miao Xie <miaoxie@huawei.com>
> Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com
> Cc: stable@vger.kernel.org # v4.2+
> Fixes: 9fc3a43e1757 ("writeback: separate out domain_dirty_limits()")
Queued up for this series, with the minor comment tweak that Jan suggested.
--
Jens Axboe
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2016-05-30 14:55 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <57333E75.3080309@huawei.com>
2016-05-12 1:11 ` [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle Miao Xie
2016-05-12 15:32 ` Tejun Heo
2016-05-13 6:11 ` Miao Xie
2016-05-27 18:24 ` Tejun Heo
2016-05-27 18:34 ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo
2016-05-30 8:05 ` Jan Kara
2016-05-30 14:55 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).