linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle
       [not found] <57333E75.3080309@huawei.com>
@ 2016-05-12  1:11 ` Miao Xie
  2016-05-12 15:32   ` Tejun Heo
  0 siblings, 1 reply; 7+ messages in thread
From: Miao Xie @ 2016-05-12  1:11 UTC (permalink / raw)
  To: Fengguang Wu, Tejun Heo; +Cc: linux-kernel

Cc linux-kernel mail list

on 2016/5/11 at 22:15, Miao Xie wrote:
> Hi, Tejun and Fengguang
>
> I found that buffered write thoughput was dropped down by writeback cgroup and dirty thottle on
> 4.6-rc7 kernel. If I ran benchmark on the top block cgroup, the thoughput was more than 1500MB/s.
> If I ran benchmark on a new block cgroup, the thoughput was down to 4MB/s.
>
> Steps to reproduce:
> # mount -t cgroup2 cgroup <cgrp_mnt>
> # echo "+io +memory" > <cgrp_mnt>/cgroup.subtree_control
> # mkdir <cgrp_mnt>/aaa
> # echo $$ > <cgrp_mnt>/aaa/cgroup.procs
> # fio test.config
> job0: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
> fio-2.2.8
> Starting 1 thread
> Jobs: 1 (f=1): [W(1)] [3.7% done] [0KB/4000KB/0KB /s] [0/1000/0 iops] [eta 04m:50s]
>
> Fio configuration is:
> [global]
> bs=4K
> direct=0
> ioengine=psync
> iodepth=1
> directory=/mnt/ext4/tstdir0
> time_based
> runtime=300
> group_reporting
> size=16G
> sync=0
> max_latency=120000000
> thread
>
> [job0]
> numjobs=1
> rw=write
>
> My box has 48 cores and 188GB memory, but I set
> vm.dirty_background_bytes = 268435456
> vm.dirty_bytes = 536870912
>
> if I set vm.dirty_background_bytes and vm.dirty_bytes to be a large number(vm.dirty_background_bytes = 3GB,
> vm.dirty_bytes = 4GB), then fio thoughput would be more than 1500MB/s. and then if I reset them to the original
> value(the above ones), the thoughout would be down to 500MB/s.
>
> And according my debug, I found fio sleeped for 1ms every time we dirty a page(balance dirty pages) when
> the thoughput was down to 4MB/s, it might be a bug of dirty throttle when we open write back cgroup, I think.
>
> Tejun and Fengguang, please let me know what you guys think about this issue, and if you have
> any suggestions for possible solutions, Any input is greatly appreciated!
>
> Thanks
> Miao

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle
  2016-05-12  1:11 ` [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle Miao Xie
@ 2016-05-12 15:32   ` Tejun Heo
  2016-05-13  6:11     ` Miao Xie
  2016-05-27 18:34     ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo
  0 siblings, 2 replies; 7+ messages in thread
From: Tejun Heo @ 2016-05-12 15:32 UTC (permalink / raw)
  To: Miao Xie; +Cc: Fengguang Wu, linux-kernel

Hello,

On Thu, May 12, 2016 at 09:11:33AM +0800, Miao Xie wrote:
> >My box has 48 cores and 188GB memory, but I set
> >vm.dirty_background_bytes = 268435456
> >vm.dirty_bytes = 536870912
> >
> >if I set vm.dirty_background_bytes and vm.dirty_bytes to be a large number(vm.dirty_background_bytes = 3GB,
> >vm.dirty_bytes = 4GB), then fio thoughput would be more than 1500MB/s. and then if I reset them to the original
> >value(the above ones), the thoughout would be down to 500MB/s.
> >
> >And according my debug, I found fio sleeped for 1ms every time we dirty a page(balance dirty pages) when
> >the thoughput was down to 4MB/s, it might be a bug of dirty throttle when we open write back cgroup, I think.

Heh, so, for cgroups, the absolute byte limits can't applied directly
and converted to percentage value before being applied.  You're
specifying 0.27% for threshold.  Unfortunately, the ratio is
translated into a percentage number and 0.27% becomes 0, so your
cgroups are always over limit and being throttled.

Can you please see whether the following patch fixes the issue?

Thanks.

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 999792d..a455a21 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -369,8 +369,9 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
 	struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc);
 	unsigned long bytes = vm_dirty_bytes;
 	unsigned long bg_bytes = dirty_background_bytes;
-	unsigned long ratio = vm_dirty_ratio;
-	unsigned long bg_ratio = dirty_background_ratio;
+	/* convert ratios to per-PAGE_SIZE for higher precision */
+	unsigned long ratio = (vm_dirty_ratio * PAGE_SIZE) / 100;
+	unsigned long bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100;
 	unsigned long thresh;
 	unsigned long bg_thresh;
 	struct task_struct *tsk;
@@ -382,26 +383,28 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
 		/*
 		 * The byte settings can't be applied directly to memcg
 		 * domains.  Convert them to ratios by scaling against
-		 * globally available memory.
+		 * globally available memory.  As the ratios are in
+		 * per-PAGE_SIZE, they can be obtained by dividing bytes by
+		 * pages.
 		 */
 		if (bytes)
-			ratio = min(DIV_ROUND_UP(bytes, PAGE_SIZE) * 100 /
-				    global_avail, 100UL);
+			ratio = min(DIV_ROUND_UP(bytes, global_avail),
+				    PAGE_SIZE);
 		if (bg_bytes)
-			bg_ratio = min(DIV_ROUND_UP(bg_bytes, PAGE_SIZE) * 100 /
-				       global_avail, 100UL);
+			bg_ratio = min(DIV_ROUND_UP(bg_bytes, global_avail),
+				       PAGE_SIZE);
 		bytes = bg_bytes = 0;
 	}
 
 	if (bytes)
 		thresh = DIV_ROUND_UP(bytes, PAGE_SIZE);
 	else
-		thresh = (ratio * available_memory) / 100;
+		thresh = (ratio * available_memory) / PAGE_SIZE;
 
 	if (bg_bytes)
 		bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
 	else
-		bg_thresh = (bg_ratio * available_memory) / 100;
+		bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE;
 
 	if (bg_thresh >= thresh)
 		bg_thresh = thresh / 2;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle
  2016-05-12 15:32   ` Tejun Heo
@ 2016-05-13  6:11     ` Miao Xie
  2016-05-27 18:24       ` Tejun Heo
  2016-05-27 18:34     ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo
  1 sibling, 1 reply; 7+ messages in thread
From: Miao Xie @ 2016-05-13  6:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Fengguang Wu, linux-kernel

on 2016/5/12 at 23:32, Tejun Heo wrote:
> On Thu, May 12, 2016 at 09:11:33AM +0800, Miao Xie wrote:
>>> My box has 48 cores and 188GB memory, but I set
>>> vm.dirty_background_bytes = 268435456
>>> vm.dirty_bytes = 536870912
>>>
>>> if I set vm.dirty_background_bytes and vm.dirty_bytes to be a large number(vm.dirty_background_bytes = 3GB,
>>> vm.dirty_bytes = 4GB), then fio thoughput would be more than 1500MB/s. and then if I reset them to the original
>>> value(the above ones), the thoughout would be down to 500MB/s.
>>>
>>> And according my debug, I found fio sleeped for 1ms every time we dirty a page(balance dirty pages) when
>>> the thoughput was down to 4MB/s, it might be a bug of dirty throttle when we open write back cgroup, I think.
>
> Heh, so, for cgroups, the absolute byte limits can't applied directly
> and converted to percentage value before being applied.  You're
> specifying 0.27% for threshold.  Unfortunately, the ratio is
> translated into a percentage number and 0.27% becomes 0, so your
> cgroups are always over limit and being throttled.
>
> Can you please see whether the following patch fixes the issue?

Better than the kernel without patch. Now the benchmark could reach the device bandwidth after 5-8 seconds.
But at the beginning, it was still very slow, and its thoughput was only 4MB/s for ~4 seconds, then it
could go up in 1~3 seconds.

Thanks
Miao

> Thanks.
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 999792d..a455a21 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -369,8 +369,9 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
>   	struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc);
>   	unsigned long bytes = vm_dirty_bytes;
>   	unsigned long bg_bytes = dirty_background_bytes;
> -	unsigned long ratio = vm_dirty_ratio;
> -	unsigned long bg_ratio = dirty_background_ratio;
> +	/* convert ratios to per-PAGE_SIZE for higher precision */
> +	unsigned long ratio = (vm_dirty_ratio * PAGE_SIZE) / 100;
> +	unsigned long bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100;
>   	unsigned long thresh;
>   	unsigned long bg_thresh;
>   	struct task_struct *tsk;
> @@ -382,26 +383,28 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
>   		/*
>   		 * The byte settings can't be applied directly to memcg
>   		 * domains.  Convert them to ratios by scaling against
> -		 * globally available memory.
> +		 * globally available memory.  As the ratios are in
> +		 * per-PAGE_SIZE, they can be obtained by dividing bytes by
> +		 * pages.
>   		 */
>   		if (bytes)
> -			ratio = min(DIV_ROUND_UP(bytes, PAGE_SIZE) * 100 /
> -				    global_avail, 100UL);
> +			ratio = min(DIV_ROUND_UP(bytes, global_avail),
> +				    PAGE_SIZE);
>   		if (bg_bytes)
> -			bg_ratio = min(DIV_ROUND_UP(bg_bytes, PAGE_SIZE) * 100 /
> -				       global_avail, 100UL);
> +			bg_ratio = min(DIV_ROUND_UP(bg_bytes, global_avail),
> +				       PAGE_SIZE);
>   		bytes = bg_bytes = 0;
>   	}
>
>   	if (bytes)
>   		thresh = DIV_ROUND_UP(bytes, PAGE_SIZE);
>   	else
> -		thresh = (ratio * available_memory) / 100;
> +		thresh = (ratio * available_memory) / PAGE_SIZE;
>
>   	if (bg_bytes)
>   		bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
>   	else
> -		bg_thresh = (bg_ratio * available_memory) / 100;
> +		bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE;
>
>   	if (bg_thresh >= thresh)
>   		bg_thresh = thresh / 2;
>
> .
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle
  2016-05-13  6:11     ` Miao Xie
@ 2016-05-27 18:24       ` Tejun Heo
  0 siblings, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2016-05-27 18:24 UTC (permalink / raw)
  To: Miao Xie; +Cc: Fengguang Wu, linux-kernel

Hello,

Sorry about the delay.  I forgot about this thread.

On Fri, May 13, 2016 at 02:11:53PM +0800, Miao Xie wrote:
> Better than the kernel without patch. Now the benchmark could reach
> the device bandwidth after 5-8 seconds.  But at the beginning, it
> was still very slow, and its thoughput was only 4MB/s for ~4
> seconds, then it could go up in 1~3 seconds.

I see.  As this fix is needed anyways, I'll send it up.  As for the
ramp-up, it could be normal.  There are estimators which take running
avg and modulate the threshold accordingly and the starting values are
conservative, so a short ramp-up time can be coming from that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits()
  2016-05-12 15:32   ` Tejun Heo
  2016-05-13  6:11     ` Miao Xie
@ 2016-05-27 18:34     ` Tejun Heo
  2016-05-30  8:05       ` Jan Kara
  2016-05-30 14:55       ` Jens Axboe
  1 sibling, 2 replies; 7+ messages in thread
From: Tejun Heo @ 2016-05-27 18:34 UTC (permalink / raw)
  To: Jens Axboe, Jan Kara; +Cc: Fengguang Wu, linux-kernel, Miao Xie, kernel-team

As vm.dirty_[background_]bytes can't be applied verbatim to multiple
cgroup writeback domains, they get converted to percentages in
domain_dirty_limits() and applied the same way as
vm.dirty_[background]ratio.  However, if the specified bytes is lower
than 1% of available memory, the calculated ratios become zero and the
writeback domain gets throttled constantly.

Fix it by using per-PAGE_SIZE instead of percentage for ratio
calculations.  Also, the updated DIV_ROUND_UP() usages now should
yield 1/4096 (0.0244%) as the minimum ratio as long as the specified
bytes are above zero.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Miao Xie <miaoxie@huawei.com>
Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com
Cc: stable@vger.kernel.org # v4.2+
Fixes: 9fc3a43e1757 ("writeback: separate out domain_dirty_limits()")
---
 mm/page-writeback.c |   21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b9956fd..9f914e9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -373,8 +373,9 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
 	struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc);
 	unsigned long bytes = vm_dirty_bytes;
 	unsigned long bg_bytes = dirty_background_bytes;
-	unsigned long ratio = vm_dirty_ratio;
-	unsigned long bg_ratio = dirty_background_ratio;
+	/* convert ratios to per-PAGE_SIZE for higher precision */
+	unsigned long ratio = (vm_dirty_ratio * PAGE_SIZE) / 100;
+	unsigned long bg_ratio = (dirty_background_ratio * PAGE_SIZE) / 100;
 	unsigned long thresh;
 	unsigned long bg_thresh;
 	struct task_struct *tsk;
@@ -386,26 +387,28 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
 		/*
 		 * The byte settings can't be applied directly to memcg
 		 * domains.  Convert them to ratios by scaling against
-		 * globally available memory.
+		 * globally available memory.  As the ratios are in
+		 * per-PAGE_SIZE, they can be obtained by dividing bytes by
+		 * pages.
 		 */
 		if (bytes)
-			ratio = min(DIV_ROUND_UP(bytes, PAGE_SIZE) * 100 /
-				    global_avail, 100UL);
+			ratio = min(DIV_ROUND_UP(bytes, global_avail),
+				    PAGE_SIZE);
 		if (bg_bytes)
-			bg_ratio = min(DIV_ROUND_UP(bg_bytes, PAGE_SIZE) * 100 /
-				       global_avail, 100UL);
+			bg_ratio = min(DIV_ROUND_UP(bg_bytes, global_avail),
+				       PAGE_SIZE);
 		bytes = bg_bytes = 0;
 	}
 
 	if (bytes)
 		thresh = DIV_ROUND_UP(bytes, PAGE_SIZE);
 	else
-		thresh = (ratio * available_memory) / 100;
+		thresh = (ratio * available_memory) / PAGE_SIZE;
 
 	if (bg_bytes)
 		bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
 	else
-		bg_thresh = (bg_ratio * available_memory) / 100;
+		bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE;
 
 	if (bg_thresh >= thresh)
 		bg_thresh = thresh / 2;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits()
  2016-05-27 18:34     ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo
@ 2016-05-30  8:05       ` Jan Kara
  2016-05-30 14:55       ` Jens Axboe
  1 sibling, 0 replies; 7+ messages in thread
From: Jan Kara @ 2016-05-30  8:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Jan Kara, Fengguang Wu, linux-kernel, Miao Xie, kernel-team

On Fri 27-05-16 14:34:46, Tejun Heo wrote:
> As vm.dirty_[background_]bytes can't be applied verbatim to multiple
> cgroup writeback domains, they get converted to percentages in
> domain_dirty_limits() and applied the same way as
> vm.dirty_[background]ratio.  However, if the specified bytes is lower
> than 1% of available memory, the calculated ratios become zero and the
> writeback domain gets throttled constantly.
> 
> Fix it by using per-PAGE_SIZE instead of percentage for ratio
> calculations.  Also, the updated DIV_ROUND_UP() usages now should
> yield 1/4096 (0.0244%) as the minimum ratio as long as the specified
> bytes are above zero.

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

Just one nit below:

> @@ -386,26 +387,28 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
>  		/*
>  		 * The byte settings can't be applied directly to memcg
>  		 * domains.  Convert them to ratios by scaling against
> -		 * globally available memory.
> +		 * globally available memory.  As the ratios are in
> +		 * per-PAGE_SIZE, they can be obtained by dividing bytes by
> +		 * pages.

The comment would be more comprehensible to me is the last sentence was
"... by dividing bytes by number of pages".

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits()
  2016-05-27 18:34     ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo
  2016-05-30  8:05       ` Jan Kara
@ 2016-05-30 14:55       ` Jens Axboe
  1 sibling, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2016-05-30 14:55 UTC (permalink / raw)
  To: Tejun Heo, Jan Kara; +Cc: Fengguang Wu, linux-kernel, Miao Xie, kernel-team

On 05/27/2016 12:34 PM, Tejun Heo wrote:
> As vm.dirty_[background_]bytes can't be applied verbatim to multiple
> cgroup writeback domains, they get converted to percentages in
> domain_dirty_limits() and applied the same way as
> vm.dirty_[background]ratio.  However, if the specified bytes is lower
> than 1% of available memory, the calculated ratios become zero and the
> writeback domain gets throttled constantly.
>
> Fix it by using per-PAGE_SIZE instead of percentage for ratio
> calculations.  Also, the updated DIV_ROUND_UP() usages now should
> yield 1/4096 (0.0244%) as the minimum ratio as long as the specified
> bytes are above zero.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Miao Xie <miaoxie@huawei.com>
> Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com
> Cc: stable@vger.kernel.org # v4.2+
> Fixes: 9fc3a43e1757 ("writeback: separate out domain_dirty_limits()")

Queued up for this series, with the minor comment tweak that Jan suggested.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-05-30 14:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <57333E75.3080309@huawei.com>
2016-05-12  1:11 ` [BUG]Writeback Cgroup/Dirty Throttle: very small buffered write thoughput caused by writeback cgroup and dirty thottle Miao Xie
2016-05-12 15:32   ` Tejun Heo
2016-05-13  6:11     ` Miao Xie
2016-05-27 18:24       ` Tejun Heo
2016-05-27 18:34     ` [PATCH block/for-4.7-fixes] writeback: use higher precision calculation in domain_dirty_limits() Tejun Heo
2016-05-30  8:05       ` Jan Kara
2016-05-30 14:55       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).