linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] Fix and cleanups to page-writeback
@ 2024-04-25 13:17 Kemeng Shi
  2024-04-25 13:17 ` [PATCH v2 1/4] mm: enable __wb_calc_thresh to calculate dirty background threshold Kemeng Shi
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Kemeng Shi @ 2024-04-25 13:17 UTC (permalink / raw)
  To: willy, akpm
  Cc: tj, jack, hcochran, axboe, mszeredi, linux-fsdevel, linux-mm,
	linux-kernel

v1->v2:
-rebase on up-to-date tree.
-add test result in "mm: correct calculation of wb's bg_thresh in cgroup
domain"
-drop "mm: remove redundant check in wb_min_max_ratio"
-collect RVB from Matthew to "mm: remove stale comment __folio_mark_dirty"

This series contains some random cleanups and a fix to correct
calculation of wb's bg_thresh in cgroup domain. More details can
be found respective patches. Thanks!

Kemeng Shi (4):
  mm: enable __wb_calc_thresh to calculate dirty background threshold
  mm: correct calculation of wb's bg_thresh in cgroup domain
  mm: call __wb_calc_thresh instead of wb_calc_thresh in
    wb_over_bg_thresh
  mm: remove stale comment __folio_mark_dirty

 mm/page-writeback.c | 40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

-- 
2.30.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/4] mm: enable __wb_calc_thresh to calculate dirty background threshold
  2024-04-25 13:17 [PATCH v2 0/4] Fix and cleanups to page-writeback Kemeng Shi
@ 2024-04-25 13:17 ` Kemeng Shi
  2024-05-03  9:11   ` Jan Kara
  2024-04-25 13:17 ` [PATCH v2 2/4] mm: correct calculation of wb's bg_thresh in cgroup domain Kemeng Shi
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Kemeng Shi @ 2024-04-25 13:17 UTC (permalink / raw)
  To: willy, akpm
  Cc: tj, jack, hcochran, axboe, mszeredi, linux-fsdevel, linux-mm,
	linux-kernel

Originally, __wb_calc_thresh always calculate wb's share of dirty
throttling threshold. By getting thresh of wb_domain from caller,
__wb_calc_thresh could be used for both dirty throttling and dirty
background threshold.

This is a preparation to correct threshold calculation of wb in cgroup.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/page-writeback.c | 33 ++++++++++++++++++---------------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index db5769cb12fd..2a3b68aae336 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -838,13 +838,15 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
 }
 
 /**
- * __wb_calc_thresh - @wb's share of dirty throttling threshold
+ * __wb_calc_thresh - @wb's share of dirty threshold
  * @dtc: dirty_throttle_context of interest
+ * @thresh: dirty throttling or dirty background threshold of wb_domain in @dtc
  *
- * Note that balance_dirty_pages() will only seriously take it as a hard limit
- * when sleeping max_pause per page is not enough to keep the dirty pages under
- * control. For example, when the device is completely stalled due to some error
- * conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key.
+ * Note that balance_dirty_pages() will only seriously take dirty throttling
+ * threshold as a hard limit when sleeping max_pause per page is not enough
+ * to keep the dirty pages under control. For example, when the device is
+ * completely stalled due to some error conditions, or when there are 1000
+ * dd tasks writing to a slow 10MB/s USB key.
  * In the other normal situations, it acts more gently by throttling the tasks
  * more (rather than completely block them) when the wb dirty pages go high.
  *
@@ -855,19 +857,20 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
  * The wb's share of dirty limit will be adapting to its throughput and
  * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
  *
- * Return: @wb's dirty limit in pages. The term "dirty" in the context of
- * dirty balancing includes all PG_dirty and PG_writeback pages.
+ * Return: @wb's dirty limit in pages. For dirty throttling limit, the term
+ * "dirty" in the context of dirty balancing includes all PG_dirty and
+ * PG_writeback pages.
  */
-static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
+static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
+				      unsigned long thresh)
 {
 	struct wb_domain *dom = dtc_dom(dtc);
-	unsigned long thresh = dtc->thresh;
 	u64 wb_thresh;
 	unsigned long numerator, denominator;
 	unsigned long wb_min_ratio, wb_max_ratio;
 
 	/*
-	 * Calculate this BDI's share of the thresh ratio.
+	 * Calculate this wb's share of the thresh ratio.
 	 */
 	fprop_fraction_percpu(&dom->completions, dtc->wb_completions,
 			      &numerator, &denominator);
@@ -887,9 +890,9 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
 
 unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
 {
-	struct dirty_throttle_control gdtc = { GDTC_INIT(wb),
-					       .thresh = thresh };
-	return __wb_calc_thresh(&gdtc);
+	struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };
+
+	return __wb_calc_thresh(&gdtc, thresh);
 }
 
 unsigned long cgwb_calc_thresh(struct bdi_writeback *wb)
@@ -908,7 +911,7 @@ unsigned long cgwb_calc_thresh(struct bdi_writeback *wb)
 	mdtc_calc_avail(&mdtc, filepages, headroom);
 	domain_dirty_limits(&mdtc);
 
-	return __wb_calc_thresh(&mdtc);
+	return __wb_calc_thresh(&mdtc, mdtc.thresh);
 }
 
 /*
@@ -1655,7 +1658,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
 	 *   wb_position_ratio() will let the dirtier task progress
 	 *   at some rate <= (write_bw / 2) for bringing down wb_dirty.
 	 */
-	dtc->wb_thresh = __wb_calc_thresh(dtc);
+	dtc->wb_thresh = __wb_calc_thresh(dtc, dtc->thresh);
 	dtc->wb_bg_thresh = dtc->thresh ?
 		div64_u64(dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
 
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/4] mm: correct calculation of wb's bg_thresh in cgroup domain
  2024-04-25 13:17 [PATCH v2 0/4] Fix and cleanups to page-writeback Kemeng Shi
  2024-04-25 13:17 ` [PATCH v2 1/4] mm: enable __wb_calc_thresh to calculate dirty background threshold Kemeng Shi
@ 2024-04-25 13:17 ` Kemeng Shi
  2024-05-03  9:30   ` Jan Kara
  2024-04-25 13:17 ` [PATCH v2 3/4] mm: call __wb_calc_thresh instead of wb_calc_thresh in wb_over_bg_thresh Kemeng Shi
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Kemeng Shi @ 2024-04-25 13:17 UTC (permalink / raw)
  To: willy, akpm
  Cc: tj, jack, hcochran, axboe, mszeredi, linux-fsdevel, linux-mm,
	linux-kernel

The wb_calc_thresh is supposed to calculate wb's share of bg_thresh in
global domain. To calculate wb's share of bg_thresh in cgroup domain,
it's more reasonable to use __wb_calc_thresh in which way we calculate
dirty_thresh in cgroup domain in balance_dirty_pages().

Consider following domain hierarchy:
                global domain (> 20G)
                /                 \
        cgroup domain1(10G)     cgroup domain2(10G)
                |                 |
bdi            wb1               wb2
Assume wb1 and wb2 has the same bandwidth.
We have global domain bg_thresh > 2G, cgroup domain bg_thresh 1G.
Then we have:
wb's thresh in global domain = 2G * (wb bandwidth) / (system bandwidth)
= 2G * 1/2 = 1G
wb's thresh in cgroup domain = 1G * (wb bandwidth) / (system bandwidth)
= 1G * 1/2 = 0.5G
At last, wb1 and wb2 will be limited at 0.5G, the system will be limited
at 1G which is less than global domain bg_thresh 2G.

Test as following:
/* make it easier to observe the issue */
echo 300000 > /proc/sys/vm/dirty_expire_centisecs
echo 100 > /proc/sys/vm/dirty_writeback_centisecs

/* run fio in wb1 */
cd /sys/fs/cgroup
echo "+memory +io" > cgroup.subtree_control
mkdir group1
cd group1
echo 10G > memory.high
echo 10G > memory.max
echo $$ > cgroup.procs
mkfs.ext4 -F /dev/vdb
mount /dev/vdb /bdi1/
fio -name test -filename=/bdi1/file -size=600M -ioengine=libaio -bs=4K \
-iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0

/* run fio in wb2 with a new shell */
cd /sys/fs/cgroup
mkdir group2
cd group2
echo 10G > memory.high
echo 10G > memory.max
echo $$ > cgroup.procs
mkfs.ext4 -F /dev/vdc
mount /dev/vdc /bdi2/
fio -name test -filename=/bdi2/file -size=600M -ioengine=libaio -bs=4K \
-iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0

Before fix, the wrttien pages of wb1 and wb2 reported from
toos/writeback/wb_monitor.py keep growing. After fix, rare written pages
are accumulated.
There is no obvious change in fio result.

Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/page-writeback.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2a3b68aae336..14893b20d38c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2137,7 +2137,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 		if (mdtc->dirty > mdtc->bg_thresh)
 			return true;
 
-		thresh = wb_calc_thresh(mdtc->wb, mdtc->bg_thresh);
+		thresh = __wb_calc_thresh(mdtc, mdtc->bg_thresh);
 		if (thresh < 2 * wb_stat_error())
 			reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
 		else
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/4] mm: call __wb_calc_thresh instead of wb_calc_thresh in wb_over_bg_thresh
  2024-04-25 13:17 [PATCH v2 0/4] Fix and cleanups to page-writeback Kemeng Shi
  2024-04-25 13:17 ` [PATCH v2 1/4] mm: enable __wb_calc_thresh to calculate dirty background threshold Kemeng Shi
  2024-04-25 13:17 ` [PATCH v2 2/4] mm: correct calculation of wb's bg_thresh in cgroup domain Kemeng Shi
@ 2024-04-25 13:17 ` Kemeng Shi
  2024-05-03  9:31   ` Jan Kara
  2024-04-25 13:17 ` [PATCH v2 4/4] mm: remove stale comment __folio_mark_dirty Kemeng Shi
  2024-05-01 16:16 ` [PATCH v2 0/4] Fix and cleanups to page-writeback Tejun Heo
  4 siblings, 1 reply; 13+ messages in thread
From: Kemeng Shi @ 2024-04-25 13:17 UTC (permalink / raw)
  To: willy, akpm
  Cc: tj, jack, hcochran, axboe, mszeredi, linux-fsdevel, linux-mm,
	linux-kernel

Call __wb_calc_thresh to calculate wb bg_thresh of gdtc in
wb_over_bg_thresh to remove unnecessary wrap in wb_calc_thresh.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
---
 mm/page-writeback.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 14893b20d38c..22e1acec899e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2117,7 +2117,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	if (gdtc->dirty > gdtc->bg_thresh)
 		return true;
 
-	thresh = wb_calc_thresh(gdtc->wb, gdtc->bg_thresh);
+	thresh = __wb_calc_thresh(gdtc, gdtc->bg_thresh);
 	if (thresh < 2 * wb_stat_error())
 		reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
 	else
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 4/4] mm: remove stale comment __folio_mark_dirty
  2024-04-25 13:17 [PATCH v2 0/4] Fix and cleanups to page-writeback Kemeng Shi
                   ` (2 preceding siblings ...)
  2024-04-25 13:17 ` [PATCH v2 3/4] mm: call __wb_calc_thresh instead of wb_calc_thresh in wb_over_bg_thresh Kemeng Shi
@ 2024-04-25 13:17 ` Kemeng Shi
  2024-05-03  9:31   ` Jan Kara
  2024-05-01 16:16 ` [PATCH v2 0/4] Fix and cleanups to page-writeback Tejun Heo
  4 siblings, 1 reply; 13+ messages in thread
From: Kemeng Shi @ 2024-04-25 13:17 UTC (permalink / raw)
  To: willy, akpm
  Cc: tj, jack, hcochran, axboe, mszeredi, linux-fsdevel, linux-mm,
	linux-kernel

The __folio_mark_dirty will not mark inode dirty any longer. Remove the
stale comment of it.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/page-writeback.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 22e1acec899e..692c0da04cbd 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2721,8 +2721,7 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
 }
 
 /*
- * Mark the folio dirty, and set it dirty in the page cache, and mark
- * the inode dirty.
+ * Mark the folio dirty, and set it dirty in the page cache.
  *
  * If warn is true, then emit a warning if the folio is not uptodate and has
  * not been truncated.
-- 
2.30.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/4] Fix and cleanups to page-writeback
  2024-04-25 13:17 [PATCH v2 0/4] Fix and cleanups to page-writeback Kemeng Shi
                   ` (3 preceding siblings ...)
  2024-04-25 13:17 ` [PATCH v2 4/4] mm: remove stale comment __folio_mark_dirty Kemeng Shi
@ 2024-05-01 16:16 ` Tejun Heo
  2024-05-06  1:25   ` Kemeng Shi
  4 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2024-05-01 16:16 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: willy, akpm, jack, hcochran, axboe, mszeredi, linux-fsdevel,
	linux-mm, linux-kernel

On Thu, Apr 25, 2024 at 09:17:20PM +0800, Kemeng Shi wrote:
> v1->v2:
> -rebase on up-to-date tree.
> -add test result in "mm: correct calculation of wb's bg_thresh in cgroup
> domain"
> -drop "mm: remove redundant check in wb_min_max_ratio"
> -collect RVB from Matthew to "mm: remove stale comment __folio_mark_dirty"
> 
> This series contains some random cleanups and a fix to correct
> calculation of wb's bg_thresh in cgroup domain. More details can
> be found respective patches. Thanks!

Isn't this series already in -mm? Why is this being reposted? What tree is
this based on? Please provide more context to help reviewing the patches.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/4] mm: enable __wb_calc_thresh to calculate dirty background threshold
  2024-04-25 13:17 ` [PATCH v2 1/4] mm: enable __wb_calc_thresh to calculate dirty background threshold Kemeng Shi
@ 2024-05-03  9:11   ` Jan Kara
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2024-05-03  9:11 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: willy, akpm, tj, jack, hcochran, axboe, mszeredi, linux-fsdevel,
	linux-mm, linux-kernel

On Thu 25-04-24 21:17:21, Kemeng Shi wrote:
> Originally, __wb_calc_thresh always calculate wb's share of dirty
> throttling threshold. By getting thresh of wb_domain from caller,
> __wb_calc_thresh could be used for both dirty throttling and dirty
> background threshold.
> 
> This is a preparation to correct threshold calculation of wb in cgroup.
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/page-writeback.c | 33 ++++++++++++++++++---------------
>  1 file changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index db5769cb12fd..2a3b68aae336 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -838,13 +838,15 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
>  }
>  
>  /**
> - * __wb_calc_thresh - @wb's share of dirty throttling threshold
> + * __wb_calc_thresh - @wb's share of dirty threshold
>   * @dtc: dirty_throttle_context of interest
> + * @thresh: dirty throttling or dirty background threshold of wb_domain in @dtc
>   *
> - * Note that balance_dirty_pages() will only seriously take it as a hard limit
> - * when sleeping max_pause per page is not enough to keep the dirty pages under
> - * control. For example, when the device is completely stalled due to some error
> - * conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key.
> + * Note that balance_dirty_pages() will only seriously take dirty throttling
> + * threshold as a hard limit when sleeping max_pause per page is not enough
> + * to keep the dirty pages under control. For example, when the device is
> + * completely stalled due to some error conditions, or when there are 1000
> + * dd tasks writing to a slow 10MB/s USB key.
>   * In the other normal situations, it acts more gently by throttling the tasks
>   * more (rather than completely block them) when the wb dirty pages go high.
>   *
> @@ -855,19 +857,20 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
>   * The wb's share of dirty limit will be adapting to its throughput and
>   * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
>   *
> - * Return: @wb's dirty limit in pages. The term "dirty" in the context of
> - * dirty balancing includes all PG_dirty and PG_writeback pages.
> + * Return: @wb's dirty limit in pages. For dirty throttling limit, the term
> + * "dirty" in the context of dirty balancing includes all PG_dirty and
> + * PG_writeback pages.
>   */
> -static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
> +static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc,
> +				      unsigned long thresh)
>  {
>  	struct wb_domain *dom = dtc_dom(dtc);
> -	unsigned long thresh = dtc->thresh;
>  	u64 wb_thresh;
>  	unsigned long numerator, denominator;
>  	unsigned long wb_min_ratio, wb_max_ratio;
>  
>  	/*
> -	 * Calculate this BDI's share of the thresh ratio.
> +	 * Calculate this wb's share of the thresh ratio.
>  	 */
>  	fprop_fraction_percpu(&dom->completions, dtc->wb_completions,
>  			      &numerator, &denominator);
> @@ -887,9 +890,9 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
>  
>  unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
>  {
> -	struct dirty_throttle_control gdtc = { GDTC_INIT(wb),
> -					       .thresh = thresh };
> -	return __wb_calc_thresh(&gdtc);
> +	struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };
> +
> +	return __wb_calc_thresh(&gdtc, thresh);
>  }
>  
>  unsigned long cgwb_calc_thresh(struct bdi_writeback *wb)
> @@ -908,7 +911,7 @@ unsigned long cgwb_calc_thresh(struct bdi_writeback *wb)
>  	mdtc_calc_avail(&mdtc, filepages, headroom);
>  	domain_dirty_limits(&mdtc);
>  
> -	return __wb_calc_thresh(&mdtc);
> +	return __wb_calc_thresh(&mdtc, mdtc.thresh);
>  }
>  
>  /*
> @@ -1655,7 +1658,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
>  	 *   wb_position_ratio() will let the dirtier task progress
>  	 *   at some rate <= (write_bw / 2) for bringing down wb_dirty.
>  	 */
> -	dtc->wb_thresh = __wb_calc_thresh(dtc);
> +	dtc->wb_thresh = __wb_calc_thresh(dtc, dtc->thresh);
>  	dtc->wb_bg_thresh = dtc->thresh ?
>  		div64_u64(dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;
>  
> -- 
> 2.30.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] mm: correct calculation of wb's bg_thresh in cgroup domain
  2024-04-25 13:17 ` [PATCH v2 2/4] mm: correct calculation of wb's bg_thresh in cgroup domain Kemeng Shi
@ 2024-05-03  9:30   ` Jan Kara
  2024-05-07  1:16     ` Kemeng Shi
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2024-05-03  9:30 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: willy, akpm, tj, jack, hcochran, axboe, mszeredi, linux-fsdevel,
	linux-mm, linux-kernel

On Thu 25-04-24 21:17:22, Kemeng Shi wrote:
> The wb_calc_thresh is supposed to calculate wb's share of bg_thresh in
> global domain. To calculate wb's share of bg_thresh in cgroup domain,
> it's more reasonable to use __wb_calc_thresh in which way we calculate
> dirty_thresh in cgroup domain in balance_dirty_pages().
> 
> Consider following domain hierarchy:
>                 global domain (> 20G)
>                 /                 \
>         cgroup domain1(10G)     cgroup domain2(10G)
>                 |                 |
> bdi            wb1               wb2
> Assume wb1 and wb2 has the same bandwidth.
> We have global domain bg_thresh > 2G, cgroup domain bg_thresh 1G.
> Then we have:
> wb's thresh in global domain = 2G * (wb bandwidth) / (system bandwidth)
> = 2G * 1/2 = 1G
> wb's thresh in cgroup domain = 1G * (wb bandwidth) / (system bandwidth)
> = 1G * 1/2 = 0.5G
> At last, wb1 and wb2 will be limited at 0.5G, the system will be limited
> at 1G which is less than global domain bg_thresh 2G.

This was a bit hard to understand for me so I'd rephrase it as:

wb_calc_thresh() is calculating wb's share of bg_thresh in the global
domain. However in case of cgroup writeback this is not the right thing to
do. Consider the following domain hierarchy:

                global domain (> 20G)
                /                 \
          cgroup1 (10G)     cgroup2 (10G)
                |                 |
bdi            wb1               wb2

and assume wb1 and wb2 have the same bandwidth and the background threshold
is set at 10%. The bg_thresh of cgroup1 and cgroup2 is going to be 1G. Now
because wb_calc_thresh(mdtc->wb, mdtc->bg_thresh) calculates per-wb
threshold in the global domain as (wb bandwidth) / (domain bandwidth) it
returns bg_thresh for wb1 as 0.5G although it has nobody to compete against
in cgroup1.

Fix the problem by calculating wb's share of bg_thresh in the cgroup
domain.

> Test as following:
> /* make it easier to observe the issue */
> echo 300000 > /proc/sys/vm/dirty_expire_centisecs
> echo 100 > /proc/sys/vm/dirty_writeback_centisecs
> 
> /* run fio in wb1 */
> cd /sys/fs/cgroup
> echo "+memory +io" > cgroup.subtree_control
> mkdir group1
> cd group1
> echo 10G > memory.high
> echo 10G > memory.max
> echo $$ > cgroup.procs
> mkfs.ext4 -F /dev/vdb
> mount /dev/vdb /bdi1/
> fio -name test -filename=/bdi1/file -size=600M -ioengine=libaio -bs=4K \
> -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0
> 
> /* run fio in wb2 with a new shell */
> cd /sys/fs/cgroup
> mkdir group2
> cd group2
> echo 10G > memory.high
> echo 10G > memory.max
> echo $$ > cgroup.procs
> mkfs.ext4 -F /dev/vdc
> mount /dev/vdc /bdi2/
> fio -name test -filename=/bdi2/file -size=600M -ioengine=libaio -bs=4K \
> -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0
> 
> Before fix, the wrttien pages of wb1 and wb2 reported from
> toos/writeback/wb_monitor.py keep growing. After fix, rare written pages
> are accumulated.
> There is no obvious change in fio result.
> 
> Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()")
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>

Besides the changelog rephrasing the change looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/page-writeback.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 2a3b68aae336..14893b20d38c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2137,7 +2137,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
>  		if (mdtc->dirty > mdtc->bg_thresh)
>  			return true;
>  
> -		thresh = wb_calc_thresh(mdtc->wb, mdtc->bg_thresh);
> +		thresh = __wb_calc_thresh(mdtc, mdtc->bg_thresh);
>  		if (thresh < 2 * wb_stat_error())
>  			reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
>  		else
> -- 
> 2.30.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/4] mm: call __wb_calc_thresh instead of wb_calc_thresh in wb_over_bg_thresh
  2024-04-25 13:17 ` [PATCH v2 3/4] mm: call __wb_calc_thresh instead of wb_calc_thresh in wb_over_bg_thresh Kemeng Shi
@ 2024-05-03  9:31   ` Jan Kara
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2024-05-03  9:31 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: willy, akpm, tj, jack, hcochran, axboe, mszeredi, linux-fsdevel,
	linux-mm, linux-kernel

On Thu 25-04-24 21:17:23, Kemeng Shi wrote:
> Call __wb_calc_thresh to calculate wb bg_thresh of gdtc in
> wb_over_bg_thresh to remove unnecessary wrap in wb_calc_thresh.
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/page-writeback.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 14893b20d38c..22e1acec899e 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2117,7 +2117,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
>  	if (gdtc->dirty > gdtc->bg_thresh)
>  		return true;
>  
> -	thresh = wb_calc_thresh(gdtc->wb, gdtc->bg_thresh);
> +	thresh = __wb_calc_thresh(gdtc, gdtc->bg_thresh);
>  	if (thresh < 2 * wb_stat_error())
>  		reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
>  	else
> -- 
> 2.30.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/4] mm: remove stale comment __folio_mark_dirty
  2024-04-25 13:17 ` [PATCH v2 4/4] mm: remove stale comment __folio_mark_dirty Kemeng Shi
@ 2024-05-03  9:31   ` Jan Kara
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2024-05-03  9:31 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: willy, akpm, tj, jack, hcochran, axboe, mszeredi, linux-fsdevel,
	linux-mm, linux-kernel

On Thu 25-04-24 21:17:24, Kemeng Shi wrote:
> The __folio_mark_dirty will not mark inode dirty any longer. Remove the
> stale comment of it.
> 
> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  mm/page-writeback.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 22e1acec899e..692c0da04cbd 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2721,8 +2721,7 @@ void folio_account_cleaned(struct folio *folio, struct bdi_writeback *wb)
>  }
>  
>  /*
> - * Mark the folio dirty, and set it dirty in the page cache, and mark
> - * the inode dirty.
> + * Mark the folio dirty, and set it dirty in the page cache.
>   *
>   * If warn is true, then emit a warning if the folio is not uptodate and has
>   * not been truncated.
> -- 
> 2.30.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/4] Fix and cleanups to page-writeback
  2024-05-01 16:16 ` [PATCH v2 0/4] Fix and cleanups to page-writeback Tejun Heo
@ 2024-05-06  1:25   ` Kemeng Shi
  0 siblings, 0 replies; 13+ messages in thread
From: Kemeng Shi @ 2024-05-06  1:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: willy, akpm, jack, hcochran, axboe, mszeredi, linux-fsdevel,
	linux-mm, linux-kernel

Hi Tejun,

on 5/2/2024 12:16 AM, Tejun Heo wrote:
> On Thu, Apr 25, 2024 at 09:17:20PM +0800, Kemeng Shi wrote:
>> v1->v2:
>> -rebase on up-to-date tree.
>> -add test result in "mm: correct calculation of wb's bg_thresh in cgroup
>> domain"
>> -drop "mm: remove redundant check in wb_min_max_ratio"
>> -collect RVB from Matthew to "mm: remove stale comment __folio_mark_dirty"
>>
>> This series contains some random cleanups and a fix to correct
>> calculation of wb's bg_thresh in cgroup domain. More details can
>> be found respective patches. Thanks!
> 
> Isn't this series already in -mm? Why is this being reposted? What tree is
> this based on? Please provide more context to help reviewing the patches.
> 
Sorry for late reply as I was on vacation these days. This series is based
on mm-unstable and was applied into -mm mm-unstable branch after this v2
series was posted.
Thanks
> Thanks.
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] mm: correct calculation of wb's bg_thresh in cgroup domain
  2024-05-03  9:30   ` Jan Kara
@ 2024-05-07  1:16     ` Kemeng Shi
  2024-05-07 13:28       ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Kemeng Shi @ 2024-05-07  1:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: willy, akpm, tj, hcochran, axboe, mszeredi, linux-fsdevel,
	linux-mm, linux-kernel


Hi Jan,
on 5/3/2024 5:30 PM, Jan Kara wrote:
> On Thu 25-04-24 21:17:22, Kemeng Shi wrote:
>> The wb_calc_thresh is supposed to calculate wb's share of bg_thresh in
>> global domain. To calculate wb's share of bg_thresh in cgroup domain,
>> it's more reasonable to use __wb_calc_thresh in which way we calculate
>> dirty_thresh in cgroup domain in balance_dirty_pages().
>>
>> Consider following domain hierarchy:
>>                 global domain (> 20G)
>>                 /                 \
>>         cgroup domain1(10G)     cgroup domain2(10G)
>>                 |                 |
>> bdi            wb1               wb2
>> Assume wb1 and wb2 has the same bandwidth.
>> We have global domain bg_thresh > 2G, cgroup domain bg_thresh 1G.
>> Then we have:
>> wb's thresh in global domain = 2G * (wb bandwidth) / (system bandwidth)
>> = 2G * 1/2 = 1G
>> wb's thresh in cgroup domain = 1G * (wb bandwidth) / (system bandwidth)
>> = 1G * 1/2 = 0.5G
>> At last, wb1 and wb2 will be limited at 0.5G, the system will be limited
>> at 1G which is less than global domain bg_thresh 2G.
> 
> This was a bit hard to understand for me so I'd rephrase it as:
> 
> wb_calc_thresh() is calculating wb's share of bg_thresh in the global
> domain. However in case of cgroup writeback this is not the right thing to
> do. Consider the following domain hierarchy:
> 
>                 global domain (> 20G)
>                 /                 \
>           cgroup1 (10G)     cgroup2 (10G)
>                 |                 |
> bdi            wb1               wb2
> 
> and assume wb1 and wb2 have the same bandwidth and the background threshold
> is set at 10%. The bg_thresh of cgroup1 and cgroup2 is going to be 1G. Now
> because wb_calc_thresh(mdtc->wb, mdtc->bg_thresh) calculates per-wb
> threshold in the global domain as (wb bandwidth) / (domain bandwidth) it
> returns bg_thresh for wb1 as 0.5G although it has nobody to compete against
> in cgroup1.
> 
> Fix the problem by calculating wb's share of bg_thresh in the cgroup
> domain.
Thanks for improving the changelog. As this was merged into -mm and
mm-unstable tree, I'm not sure if a new patch is needed. If there is
anything I should do, please let me konw. Thanks.

> 
>> Test as following:
>> /* make it easier to observe the issue */
>> echo 300000 > /proc/sys/vm/dirty_expire_centisecs
>> echo 100 > /proc/sys/vm/dirty_writeback_centisecs
>>
>> /* run fio in wb1 */
>> cd /sys/fs/cgroup
>> echo "+memory +io" > cgroup.subtree_control
>> mkdir group1
>> cd group1
>> echo 10G > memory.high
>> echo 10G > memory.max
>> echo $$ > cgroup.procs
>> mkfs.ext4 -F /dev/vdb
>> mount /dev/vdb /bdi1/
>> fio -name test -filename=/bdi1/file -size=600M -ioengine=libaio -bs=4K \
>> -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0
>>
>> /* run fio in wb2 with a new shell */
>> cd /sys/fs/cgroup
>> mkdir group2
>> cd group2
>> echo 10G > memory.high
>> echo 10G > memory.max
>> echo $$ > cgroup.procs
>> mkfs.ext4 -F /dev/vdc
>> mount /dev/vdc /bdi2/
>> fio -name test -filename=/bdi2/file -size=600M -ioengine=libaio -bs=4K \
>> -iodepth=1 -rw=write -direct=0 --time_based -runtime=600 -invalidate=0
>>
>> Before fix, the wrttien pages of wb1 and wb2 reported from
>> toos/writeback/wb_monitor.py keep growing. After fix, rare written pages
>> are accumulated.
>> There is no obvious change in fio result.
>>
>> Fixes: 74d369443325 ("writeback: Fix performance regression in wb_over_bg_thresh()")
>> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> 
> Besides the changelog rephrasing the change looks good. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> 								Honza
> 
>> ---
>>  mm/page-writeback.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 2a3b68aae336..14893b20d38c 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -2137,7 +2137,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
>>  		if (mdtc->dirty > mdtc->bg_thresh)
>>  			return true;
>>  
>> -		thresh = wb_calc_thresh(mdtc->wb, mdtc->bg_thresh);
>> +		thresh = __wb_calc_thresh(mdtc, mdtc->bg_thresh);
>>  		if (thresh < 2 * wb_stat_error())
>>  			reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
>>  		else
>> -- 
>> 2.30.0
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/4] mm: correct calculation of wb's bg_thresh in cgroup domain
  2024-05-07  1:16     ` Kemeng Shi
@ 2024-05-07 13:28       ` Jan Kara
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2024-05-07 13:28 UTC (permalink / raw)
  To: Kemeng Shi
  Cc: Jan Kara, willy, akpm, tj, hcochran, axboe, mszeredi,
	linux-fsdevel, linux-mm, linux-kernel

On Tue 07-05-24 09:16:39, Kemeng Shi wrote:
> 
> Hi Jan,
> on 5/3/2024 5:30 PM, Jan Kara wrote:
> > On Thu 25-04-24 21:17:22, Kemeng Shi wrote:
> >> The wb_calc_thresh is supposed to calculate wb's share of bg_thresh in
> >> global domain. To calculate wb's share of bg_thresh in cgroup domain,
> >> it's more reasonable to use __wb_calc_thresh in which way we calculate
> >> dirty_thresh in cgroup domain in balance_dirty_pages().
> >>
> >> Consider following domain hierarchy:
> >>                 global domain (> 20G)
> >>                 /                 \
> >>         cgroup domain1(10G)     cgroup domain2(10G)
> >>                 |                 |
> >> bdi            wb1               wb2
> >> Assume wb1 and wb2 has the same bandwidth.
> >> We have global domain bg_thresh > 2G, cgroup domain bg_thresh 1G.
> >> Then we have:
> >> wb's thresh in global domain = 2G * (wb bandwidth) / (system bandwidth)
> >> = 2G * 1/2 = 1G
> >> wb's thresh in cgroup domain = 1G * (wb bandwidth) / (system bandwidth)
> >> = 1G * 1/2 = 0.5G
> >> At last, wb1 and wb2 will be limited at 0.5G, the system will be limited
> >> at 1G which is less than global domain bg_thresh 2G.
> > 
> > This was a bit hard to understand for me so I'd rephrase it as:
> > 
> > wb_calc_thresh() is calculating wb's share of bg_thresh in the global
> > domain. However in case of cgroup writeback this is not the right thing to
> > do. Consider the following domain hierarchy:
> > 
> >                 global domain (> 20G)
> >                 /                 \
> >           cgroup1 (10G)     cgroup2 (10G)
> >                 |                 |
> > bdi            wb1               wb2
> > 
> > and assume wb1 and wb2 have the same bandwidth and the background threshold
> > is set at 10%. The bg_thresh of cgroup1 and cgroup2 is going to be 1G. Now
> > because wb_calc_thresh(mdtc->wb, mdtc->bg_thresh) calculates per-wb
> > threshold in the global domain as (wb bandwidth) / (domain bandwidth) it
> > returns bg_thresh for wb1 as 0.5G although it has nobody to compete against
> > in cgroup1.
> > 
> > Fix the problem by calculating wb's share of bg_thresh in the cgroup
> > domain.
> Thanks for improving the changelog. As this was merged into -mm and
> mm-unstable tree, I'm not sure if a new patch is needed. If there is
> anything I should do, please let me konw. Thanks.

No need to do anything here. Andrew has picked up these updates.

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-05-07 13:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-25 13:17 [PATCH v2 0/4] Fix and cleanups to page-writeback Kemeng Shi
2024-04-25 13:17 ` [PATCH v2 1/4] mm: enable __wb_calc_thresh to calculate dirty background threshold Kemeng Shi
2024-05-03  9:11   ` Jan Kara
2024-04-25 13:17 ` [PATCH v2 2/4] mm: correct calculation of wb's bg_thresh in cgroup domain Kemeng Shi
2024-05-03  9:30   ` Jan Kara
2024-05-07  1:16     ` Kemeng Shi
2024-05-07 13:28       ` Jan Kara
2024-04-25 13:17 ` [PATCH v2 3/4] mm: call __wb_calc_thresh instead of wb_calc_thresh in wb_over_bg_thresh Kemeng Shi
2024-05-03  9:31   ` Jan Kara
2024-04-25 13:17 ` [PATCH v2 4/4] mm: remove stale comment __folio_mark_dirty Kemeng Shi
2024-05-03  9:31   ` Jan Kara
2024-05-01 16:16 ` [PATCH v2 0/4] Fix and cleanups to page-writeback Tejun Heo
2024-05-06  1:25   ` Kemeng Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).