linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -next v3 0/2] watermark related improvement on zone movable
@ 2022-09-05  3:28 Wupeng Ma
  2022-09-05  3:28 ` [PATCH -next v3 1/2] mm: Cap zone movable's min wmark to small value Wupeng Ma
  2022-09-05  3:28 ` [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor Wupeng Ma
  0 siblings, 2 replies; 14+ messages in thread
From: Wupeng Ma @ 2022-09-05  3:28 UTC (permalink / raw)
  To: akpm, david, npiggin, ying.huang, hannes
  Cc: corbet, mcgrof, mgorman, keescook, yzaikin, songmuchun,
	mike.kravetz, osalvador, surenb, mawupeng1, rppt, charante,
	jsavitz, linux-kernel, linux-mm

From: Ma Wupeng <mawupeng1@huawei.com>

The first patch cap zone movable's min watermark to small value since no
one can use it.

The second patch introduce a per zone watermark to replace the vanilla
watermark_scale_factor to bring flexibility to tune each zone's
watermark separately and lead to more efficient kswapd.

Each patch's detail information can be seen is its own changelog.

changelog since v2:
- add comment in __setup_per_zone_wmarks

changelog since v1:
- fix compile error if CONFIG_SYSCTL is not enabled
- remove useless function comment

Ma Wupeng (2):
  mm: Cap zone movable's min wmark to small value
  mm: sysctl: Introduce per zone watermark_scale_factor

 Documentation/admin-guide/sysctl/vm.rst |  6 +++
 include/linux/mm.h                      |  2 +-
 kernel/sysctl.c                         |  2 -
 mm/page_alloc.c                         | 49 ++++++++++++++++++-------
 4 files changed, 43 insertions(+), 16 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH -next v3 1/2] mm: Cap zone movable's min wmark to small value
  2022-09-05  3:28 [PATCH -next v3 0/2] watermark related improvement on zone movable Wupeng Ma
@ 2022-09-05  3:28 ` Wupeng Ma
  2022-09-05  9:26   ` Mel Gorman
  2022-09-05  3:28 ` [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor Wupeng Ma
  1 sibling, 1 reply; 14+ messages in thread
From: Wupeng Ma @ 2022-09-05  3:28 UTC (permalink / raw)
  To: akpm, david, npiggin, ying.huang, hannes
  Cc: corbet, mcgrof, mgorman, keescook, yzaikin, songmuchun,
	mike.kravetz, osalvador, surenb, mawupeng1, rppt, charante,
	jsavitz, linux-kernel, linux-mm

From: Ma Wupeng <mawupeng1@huawei.com>

Since min_free_kbytes is based on gfp_zone(GFP_USER) which does not include
zone movable. However zone movable will get its min share in
__setup_per_zone_wmarks() which does not make any sense.

And like highmem pages, __GFP_HIGH and PF_MEMALLOC allocations usually
don't need movable pages, so there is no need to assign min pages for zone
movable.

Let's cap pages_min for zone movable to a small value here just link
highmem pages.

Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
---
 mm/page_alloc.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..f1e4474879f1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8636,9 +8636,9 @@ static void __setup_per_zone_wmarks(void)
 	struct zone *zone;
 	unsigned long flags;
 
-	/* Calculate total number of !ZONE_HIGHMEM pages */
+	/* Calculate total number of none highmem/movable pages */
 	for_each_zone(zone) {
-		if (!is_highmem(zone))
+		if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE)
 			lowmem_pages += zone_managed_pages(zone);
 	}
 
@@ -8648,15 +8648,15 @@ static void __setup_per_zone_wmarks(void)
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone_managed_pages(zone);
 		do_div(tmp, lowmem_pages);
-		if (is_highmem(zone)) {
+		if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
-			 * need highmem pages, so cap pages_min to a small
-			 * value here.
+			 * need highmem/movable pages, so cap pages_min to a
+			 * small value here.
 			 *
 			 * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
 			 * deltas control async page reclaim, and so should
-			 * not be capped for highmem.
+			 * not be capped for highmem/movable zone.
 			 */
 			unsigned long min_pages;
 
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor
  2022-09-05  3:28 [PATCH -next v3 0/2] watermark related improvement on zone movable Wupeng Ma
  2022-09-05  3:28 ` [PATCH -next v3 1/2] mm: Cap zone movable's min wmark to small value Wupeng Ma
@ 2022-09-05  3:28 ` Wupeng Ma
  2022-09-05  3:45   ` Matthew Wilcox
                     ` (2 more replies)
  1 sibling, 3 replies; 14+ messages in thread
From: Wupeng Ma @ 2022-09-05  3:28 UTC (permalink / raw)
  To: akpm, david, npiggin, ying.huang, hannes
  Cc: corbet, mcgrof, mgorman, keescook, yzaikin, songmuchun,
	mike.kravetz, osalvador, surenb, mawupeng1, rppt, charante,
	jsavitz, linux-kernel, linux-mm, kernel test robot

From: Ma Wupeng <mawupeng1@huawei.com>

System may have little normal zone memory and huge movable memory in the
following situations:
  - for system with kernelcore=nn% or kernelcore=mirror, movable zone will
  be added and movable zone is bigger than normal zone in most cases.
  - system with movable nodes, they will have multiple numa nodes with
  only movable zone and movable zone will have plenty of memory.

Since kernel/driver can only use memory from non-movable zone in most
cases, normal zone need to increase its watermark to reserve more memory.

However, current watermark_scale_factor is used to control all zones
at once and can't be set separately. To reserve memory in non-movable
zones, the watermark is increased in movable zones as well. Which will
lead to inefficient kswapd.

To solve this problem, per zone watermark is introduced to tune each zone's
watermark separately. This can bring the following advantages:
  - each zone can set its own watermark which bring flexibility
  - lead to more efficient kswapd if this watermark is set fine

Here is real watermark data in my qemu machine(with THP disabled).

With watermark_scale_factor = 10, there is only 1440(772-68+807-71)
pages(5.76M) reserved for a system with 96G of memory. However if the
watermark is set to 100, the movable zone's watermark increased to
231908(93M), which is too much.
This situation is even worse with 32G of normal zone memory and 1T of
movable zone memory.

       Modified        | Vanilla wm_factor = 10 | Vanilla wm_factor = 30
Node 0, zone      DMA  | Node 0, zone      DMA  | Node 0, zone      DMA
        min      68    |         min      68    |         min      68
        low      7113  |         low      772   |         low      7113
        high **14158** |         high **1476**  |         high **14158**
Node 0, zone   Normal  | Node 0, zone   Normal  | Node 0, zone   Normal
        min      71    |         min      71    |         min      71
        low      7438  |         low      807   |         low      7438
        high     14805 |         high     1543  |         high     14805
Node 0, zone  Movable  | Node 0, zone  Movable  | Node 0, zone  Movable
        min      1455  |         min      1455  |         min      1455
        low      16388 |         low      16386 |         low      150787
        high **31321** |         high **31317** |         high **300119**
Node 1, zone  Movable  | Node 1, zone  Movable  | Node 1, zone  Movable
        min      804   |         min      804   |         min      804
        low      9061  |         low      9061  |         low      83379
        high **17318** |         high **17318** |         high **165954**

With the modified per zone watermark_scale_factor, only dma/normal zone
will increase its watermark via the following command which the huge
movable zone stay the same.

  % echo 100 100 100 10 > /proc/sys/vm/watermark_scale_factor

The reason to disable THP is khugepaged_min_free_kbytes_update() will
update min watermark.

Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 Documentation/admin-guide/sysctl/vm.rst |  6 ++++
 include/linux/mm.h                      |  2 +-
 kernel/sysctl.c                         |  2 --
 mm/page_alloc.c                         | 37 ++++++++++++++++++++-----
 4 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 9b833e439f09..ec240aa45322 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -1002,6 +1002,12 @@ that the number of free pages kswapd maintains for latency reasons is
 too small for the allocation bursts occurring in the system. This knob
 can then be used to tune kswapd aggressiveness accordingly.
 
+The watermark_scale_factor is an array. You can set each zone's watermark
+separately and can be seen by reading this file::
+
+	% cat /proc/sys/vm/watermark_scale_factor
+	10	10	10	10
+
 
 zone_reclaim_mode
 =================
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 21f8b27bd9fd..b291c795f9db 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2532,7 +2532,7 @@ extern void setup_per_cpu_pageset(void);
 /* page_alloc.c */
 extern int min_free_kbytes;
 extern int watermark_boost_factor;
-extern int watermark_scale_factor;
+extern int watermark_scale_factor[MAX_NR_ZONES];
 extern bool arch_has_descending_max_zone_pfns(void);
 
 /* nommu.c */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 205d605cacc5..d16d06c71e5a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2251,8 +2251,6 @@ static struct ctl_table vm_table[] = {
 		.maxlen		= sizeof(watermark_scale_factor),
 		.mode		= 0644,
 		.proc_handler	= watermark_scale_factor_sysctl_handler,
-		.extra1		= SYSCTL_ONE,
-		.extra2		= SYSCTL_THREE_THOUSAND,
 	},
 	{
 		.procname	= "percpu_pagelist_high_fraction",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f1e4474879f1..7a6ac3b4ebb6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -421,7 +421,6 @@ compound_page_dtor * const compound_page_dtors[NR_COMPOUND_DTORS] = {
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 int watermark_boost_factor __read_mostly = 15000;
-int watermark_scale_factor = 10;
 
 static unsigned long nr_kernel_pages __initdata;
 static unsigned long nr_all_pages __initdata;
@@ -449,6 +448,20 @@ EXPORT_SYMBOL(nr_online_nodes);
 
 int page_group_by_mobility_disabled __read_mostly;
 
+int watermark_scale_factor[MAX_NR_ZONES] = {
+#ifdef CONFIG_ZONE_DMA
+	[ZONE_DMA] = 10,
+#endif
+#ifdef CONFIG_ZONE_DMA32
+	[ZONE_DMA32] = 10,
+#endif
+	[ZONE_NORMAL] = 10,
+#ifdef CONFIG_HIGHMEM
+	[ZONE_HIGHMEM] = 10,
+#endif
+	[ZONE_MOVABLE] = 10,
+};
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 /*
  * During boot we initialize deferred pages on-demand, as needed, but once
@@ -8643,6 +8656,7 @@ static void __setup_per_zone_wmarks(void)
 	}
 
 	for_each_zone(zone) {
+		int zone_wm_factor;
 		u64 tmp;
 
 		spin_lock_irqsave(&zone->lock, flags);
@@ -8676,9 +8690,10 @@ static void __setup_per_zone_wmarks(void)
 		 * scale factor in proportion to available memory, but
 		 * ensure a minimum size on small systems.
 		 */
+		zone_wm_factor = watermark_scale_factor[zone_idx(zone)];
 		tmp = max_t(u64, tmp >> 2,
-			    mult_frac(zone_managed_pages(zone),
-				      watermark_scale_factor, 10000));
+			    mult_frac(zone_managed_pages(zone), zone_wm_factor,
+				      10000));
 
 		zone->watermark_boost = 0;
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
@@ -8798,11 +8813,19 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
 int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
 		void *buffer, size_t *length, loff_t *ppos)
 {
-	int rc;
+	int i;
 
-	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
-	if (rc)
-		return rc;
+	proc_dointvec_minmax(table, write, buffer, length, ppos);
+
+	/*
+	 * The unit is in fractions of 10,000. The default value of 10
+	 * means the distances between watermarks are 0.1% of the available
+	 * memory in the node/system. The maximum value is 3000, or 30% of
+	 * memory.
+	 */
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		watermark_scale_factor[i] =
+			clamp(watermark_scale_factor[i], 1, 3000);
 
 	if (write)
 		setup_per_zone_wmarks();
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor
  2022-09-05  3:28 ` [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor Wupeng Ma
@ 2022-09-05  3:45   ` Matthew Wilcox
  2022-09-05  6:39     ` mawupeng
  2022-09-06 18:23   ` Luis Chamberlain
  2022-09-09 21:41   ` Khalid Aziz
  2 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2022-09-05  3:45 UTC (permalink / raw)
  To: Wupeng Ma
  Cc: akpm, david, npiggin, ying.huang, hannes, corbet, mcgrof,
	mgorman, keescook, yzaikin, songmuchun, mike.kravetz, osalvador,
	surenb, rppt, charante, jsavitz, linux-kernel, linux-mm,
	kernel test robot

On Mon, Sep 05, 2022 at 11:28:58AM +0800, Wupeng Ma wrote:
> The reason to disable THP is khugepaged_min_free_kbytes_update() will
> update min watermark.
> 
> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
> Reported-by: kernel test robot <lkp@intel.com>

Don't include this 'Reported-by'.  The kernel test robot did not
tell you to write this patch.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor
  2022-09-05  3:45   ` Matthew Wilcox
@ 2022-09-05  6:39     ` mawupeng
  0 siblings, 0 replies; 14+ messages in thread
From: mawupeng @ 2022-09-05  6:39 UTC (permalink / raw)
  To: willy
  Cc: mawupeng1, akpm, david, npiggin, ying.huang, hannes, corbet,
	mcgrof, mgorman, keescook, yzaikin, songmuchun, mike.kravetz,
	osalvador, surenb, rppt, charante, jsavitz, linux-kernel,
	linux-mm, lkp



On 2022/9/5 11:45, Matthew Wilcox wrote:
> On Mon, Sep 05, 2022 at 11:28:58AM +0800, Wupeng Ma wrote:
>> The reason to disable THP is khugepaged_min_free_kbytes_update() will
>> update min watermark.
>>
>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>> Reported-by: kernel test robot <lkp@intel.com>
> 
> Don't include this 'Reported-by'.  The kernel test robot did not
> tell you to write this patch.
> 

Oh, i see.

For this patch, I will add this information in the change log.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next v3 1/2] mm: Cap zone movable's min wmark to small value
  2022-09-05  3:28 ` [PATCH -next v3 1/2] mm: Cap zone movable's min wmark to small value Wupeng Ma
@ 2022-09-05  9:26   ` Mel Gorman
  2022-09-06 10:12     ` mawupeng
  0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2022-09-05  9:26 UTC (permalink / raw)
  To: Wupeng Ma
  Cc: akpm, david, ying.huang, hannes, corbet, mcgrof, keescook,
	yzaikin, songmuchun, mike.kravetz, osalvador, surenb, rppt,
	charante, jsavitz, linux-kernel, linux-mm

On Mon, Sep 05, 2022 at 11:28:57AM +0800, Wupeng Ma wrote:
> From: Ma Wupeng <mawupeng1@huawei.com>
> 
> Since min_free_kbytes is based on gfp_zone(GFP_USER) which does not include
> zone movable. However zone movable will get its min share in
> __setup_per_zone_wmarks() which does not make any sense.
> 
> And like highmem pages, __GFP_HIGH and PF_MEMALLOC allocations usually
> don't need movable pages, so there is no need to assign min pages for zone
> movable.
> 
> Let's cap pages_min for zone movable to a small value here just link
> highmem pages.
> 

I think there is a misunderstanding why the higher zones have a watermark
and why it might be large.

It's not about a __GFP_HIGH or PF_MEMALLOC allocations because it's known
that few of those allocations may be movable. It's because high memory
allocations indirectly pin pages in lower zones. User-mapped memory allocated
from ZONE_MOVABLE still needs page table pages allocated from a lower zone
so there is a ratio between the size of ZONE_MOVABLE and lower zones
that limits the total amount of memory that can be allocated. Similarly,
file backed pages that may be allocated from ZONE_MOVABLE still requires
pages from lower memory for the inode and other associated kernel
objects that are allocated from lower zones.

The intent behind the higher zones having a large min watermark is so
that kswapd reclaims pages from there first to *potentially* release
pages from lower memory. By capping pages_min for zone_movable, there is
the potential for lower memory pressure to be higher and to reach a point
where a ZONE_MOVABLE page cannot be allocated simply because there isn't
enough low memory available. Once the lower zones are all unreclaimable
(e.g. page table pages or the movable pages are not been reclaimed to free
the associated kernel structures), the system goes OOM.

It's possible that there are safe adjustments that could be made that
would detect when there is no choice except to reclaim zone reclaimable
but it would be tricky and it's not this patch. This patch changelog states

	However zone movable will get its min share in
	__setup_per_zone_wmarks() which does not make any sense.

It makes sense, higher zones allocations indirectly pin pages in lower
zones and there is a bias in reclaim to free the higher zone pages first
on the *possibility* that lower zone pages get indirectly released later.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next v3 1/2] mm: Cap zone movable's min wmark to small value
  2022-09-05  9:26   ` Mel Gorman
@ 2022-09-06 10:12     ` mawupeng
  2022-09-06 12:22       ` Mel Gorman
  0 siblings, 1 reply; 14+ messages in thread
From: mawupeng @ 2022-09-06 10:12 UTC (permalink / raw)
  To: mgorman
  Cc: mawupeng1, akpm, david, ying.huang, hannes, corbet, mcgrof,
	keescook, yzaikin, songmuchun, mike.kravetz, osalvador, surenb,
	rppt, charante, jsavitz, linux-kernel, linux-mm



On 2022/9/5 17:26, Mel Gorman wrote:
> On Mon, Sep 05, 2022 at 11:28:57AM +0800, Wupeng Ma wrote:
>> From: Ma Wupeng <mawupeng1@huawei.com>
>>
>> Since min_free_kbytes is based on gfp_zone(GFP_USER) which does not include
>> zone movable. However zone movable will get its min share in
>> __setup_per_zone_wmarks() which does not make any sense.
>>
>> And like highmem pages, __GFP_HIGH and PF_MEMALLOC allocations usually
>> don't need movable pages, so there is no need to assign min pages for zone
>> movable.
>>
>> Let's cap pages_min for zone movable to a small value here just link
>> highmem pages.
>>
> 
> I think there is a misunderstanding why the higher zones have a watermark
> and why it might be large.
> 
> It's not about a __GFP_HIGH or PF_MEMALLOC allocations because it's known
> that few of those allocations may be movable. It's because high memory
> allocations indirectly pin pages in lower zones. User-mapped memory allocated
> from ZONE_MOVABLE still needs page table pages allocated from a lower zone
> so there is a ratio between the size of ZONE_MOVABLE and lower zones
> that limits the total amount of memory that can be allocated. Similarly,
> file backed pages that may be allocated from ZONE_MOVABLE still requires
> pages from lower memory for the inode and other associated kernel
> objects that are allocated from lower zones.
> 
> The intent behind the higher zones having a large min watermark is so
> that kswapd reclaims pages from there first to *potentially* release
> pages from lower memory. By capping pages_min for zone_movable, there is
> the potential for lower memory pressure to be higher and to reach a point
> where a ZONE_MOVABLE page cannot be allocated simply because there isn't
> enough low memory available. Once the lower zones are all unreclaimable
> (e.g. page table pages or the movable pages are not been reclaimed to free
> the associated kernel structures), the system goes OOM.

This i do agree with you, lower zone is actually "more important" than the
higher one.

But higher min watermark for zone movable will not work since no memory
allocation can use this reserve memory below min. Memory allocation
with specify watermark modifier(__GFP_ATOMIC ,__GFP_HIGH ...) can use this
in slowpath, however the standard movable memory allocation
(gfp flag: GFP_HIGHUSER_MOVABLE) does not contain this.

Second, lowmem_reserve_ratio is used to "reserve" memory for lower zone.
And the second patch introduce per zone watermark_scale_factor to boost
normal/movable zone's watermark which can trigger early kswapd for zone
movable.

> 
> It's possible that there are safe adjustments that could be made that
> would detect when there is no choice except to reclaim zone reclaimable
> but it would be tricky and it's not this patch. This patch changelog states
> 
> 	However zone movable will get its min share in
> 	__setup_per_zone_wmarks() which does not make any sense.
> 
> It makes sense, higher zones allocations indirectly pin pages in lower
> zones and there is a bias in reclaim to free the higher zone pages first
> on the *possibility* that lower zone pages get indirectly released later.
> 

In our Test vm with 16G of mirrored memory(normal zone) and 256 of normal
momory(Movable zone), the min share for normal zone is too few since the
size of min watermark is calc by zone dma/normal while this will be shared
by zones(include zone movable) based on managed pages.

Node 0, zone      DMA
        min      39
        low      743
        high     1447
Node 0, zone   Normal
        min      180
        low      3372
        high     6564
Node 1, zone  Movable
        min      3728
        low      69788
        high     135848


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next v3 1/2] mm: Cap zone movable's min wmark to small value
  2022-09-06 10:12     ` mawupeng
@ 2022-09-06 12:22       ` Mel Gorman
  2022-09-07  8:42         ` mawupeng
  0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2022-09-06 12:22 UTC (permalink / raw)
  To: mawupeng
  Cc: akpm, david, ying.huang, hannes, corbet, mcgrof, keescook,
	yzaikin, songmuchun, mike.kravetz, osalvador, surenb, rppt,
	charante, jsavitz, linux-kernel, linux-mm

On Tue, Sep 06, 2022 at 06:12:23PM +0800, mawupeng wrote:
> > I think there is a misunderstanding why the higher zones have a watermark
> > and why it might be large.
> > 
> > It's not about a __GFP_HIGH or PF_MEMALLOC allocations because it's known
> > that few of those allocations may be movable. It's because high memory
> > allocations indirectly pin pages in lower zones. User-mapped memory allocated
> > from ZONE_MOVABLE still needs page table pages allocated from a lower zone
> > so there is a ratio between the size of ZONE_MOVABLE and lower zones
> > that limits the total amount of memory that can be allocated. Similarly,
> > file backed pages that may be allocated from ZONE_MOVABLE still requires
> > pages from lower memory for the inode and other associated kernel
> > objects that are allocated from lower zones.
> > 
> > The intent behind the higher zones having a large min watermark is so
> > that kswapd reclaims pages from there first to *potentially* release
> > pages from lower memory. By capping pages_min for zone_movable, there is
> > the potential for lower memory pressure to be higher and to reach a point
> > where a ZONE_MOVABLE page cannot be allocated simply because there isn't
> > enough low memory available. Once the lower zones are all unreclaimable
> > (e.g. page table pages or the movable pages are not been reclaimed to free
> > the associated kernel structures), the system goes OOM.
> 
> This i do agree with you, lower zone is actually "more important" than the
> higher one.
> 

Very often yes.

> But higher min watermark for zone movable will not work since no memory
> allocation can use this reserve memory below min. Memory allocation
> with specify watermark modifier(__GFP_ATOMIC ,__GFP_HIGH ...) can use this
> in slowpath, however the standard movable memory allocation
> (gfp flag: GFP_HIGHUSER_MOVABLE) does not contain this.
> 

Then a more appropriate solution may be to alter how the gap between min
and low is calculated. That gap determines when kswapd is active but
allocations are still allowed.

> Second, lowmem_reserve_ratio is used to "reserve" memory for lower zone.
> And the second patch introduce per zone watermark_scale_factor to boost
> normal/movable zone's watermark which can trigger early kswapd for zone
> movable.
> 

The problem with the tunable is that this patch introduces a potentially
seriously problem that must then be corrected by a system administrator and
it'll be non-obvious what the root of the problem is or the solution. For
some users, they will only be able to determine is that OOM triggers
when there is plenty of free memory or kswapd is consuming a lot more
CPU than expected. They will not necessarily be able to determine that
watermark_scale_factor is the solution.

> > 
> > It's possible that there are safe adjustments that could be made that
> > would detect when there is no choice except to reclaim zone reclaimable
> > but it would be tricky and it's not this patch. This patch changelog states
> > 
> > 	However zone movable will get its min share in
> > 	__setup_per_zone_wmarks() which does not make any sense.
> > 
> > It makes sense, higher zones allocations indirectly pin pages in lower
> > zones and there is a bias in reclaim to free the higher zone pages first
> > on the *possibility* that lower zone pages get indirectly released later.
> > 
> 
> In our Test vm with 16G of mirrored memory(normal zone) and 256 of normal
> momory(Movable zone), the min share for normal zone is too few since the
> size of min watermark is calc by zone dma/normal while this will be shared
> by zones(include zone movable) based on managed pages.
> 
> Node 0, zone      DMA
>         min      39
>         low      743
>         high     1447
> Node 0, zone   Normal
>         min      180
>         low      3372
>         high     6564
> Node 1, zone  Movable
>         min      3728
>         low      69788
>         high     135848

The gap between min and low is massive so either adjust how that gap is
calculated or to avoid side-effects for other users, consider special
casing the gap for ZONE_MOVABLE with a comment explaining why it is
treated differently. To mitigate the risk further, it could be further
special cased to only apply when there is a massive ratio between
ALL_ZONES_EXCEPT_MOVABLE:ZONE_MOVABLE. Document in the changelog the
potential downside of more lowmem potentially getting pinned by MOVABLE
allocations leading to excessive kswapd activity or premature OOM.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor
  2022-09-05  3:28 ` [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor Wupeng Ma
  2022-09-05  3:45   ` Matthew Wilcox
@ 2022-09-06 18:23   ` Luis Chamberlain
  2022-09-07  3:29     ` mawupeng
  2022-09-09 21:41   ` Khalid Aziz
  2 siblings, 1 reply; 14+ messages in thread
From: Luis Chamberlain @ 2022-09-06 18:23 UTC (permalink / raw)
  To: Wupeng Ma
  Cc: akpm, david, npiggin, ying.huang, hannes, corbet, mgorman,
	keescook, yzaikin, songmuchun, mike.kravetz, osalvador, surenb,
	rppt, charante, jsavitz, linux-kernel, linux-mm,
	kernel test robot

On Mon, Sep 05, 2022 at 11:28:58AM +0800, Wupeng Ma wrote:
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 205d605cacc5..d16d06c71e5a 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2251,8 +2251,6 @@ static struct ctl_table vm_table[] = {
>  		.maxlen		= sizeof(watermark_scale_factor),
>  		.mode		= 0644,
>  		.proc_handler	= watermark_scale_factor_sysctl_handler,
> -		.extra1		= SYSCTL_ONE,
> -		.extra2		= SYSCTL_THREE_THOUSAND,
>  	},
>  	{
>  		.procname	= "percpu_pagelist_high_fraction",

Please move the sysctl from kernel/sysctl.c to mm/page_alloc.c while
at it, you can git log the kernel/sysctl.c for prior moves and for
the motivations. No need to keep expanding on the existing table.

  Luis


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor
  2022-09-06 18:23   ` Luis Chamberlain
@ 2022-09-07  3:29     ` mawupeng
  0 siblings, 0 replies; 14+ messages in thread
From: mawupeng @ 2022-09-07  3:29 UTC (permalink / raw)
  To: mcgrof
  Cc: mawupeng1, akpm, david, npiggin, ying.huang, hannes, corbet,
	mgorman, keescook, yzaikin, songmuchun, mike.kravetz, osalvador,
	surenb, rppt, charante, jsavitz, linux-kernel, linux-mm, lkp



On 2022/9/7 2:23, Luis Chamberlain wrote:
> On Mon, Sep 05, 2022 at 11:28:58AM +0800, Wupeng Ma wrote:
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 205d605cacc5..d16d06c71e5a 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -2251,8 +2251,6 @@ static struct ctl_table vm_table[] = {
>>  		.maxlen		= sizeof(watermark_scale_factor),
>>  		.mode		= 0644,
>>  		.proc_handler	= watermark_scale_factor_sysctl_handler,
>> -		.extra1		= SYSCTL_ONE,
>> -		.extra2		= SYSCTL_THREE_THOUSAND,
>>  	},
>>  	{
>>  		.procname	= "percpu_pagelist_high_fraction",
> 
> Please move the sysctl from kernel/sysctl.c to mm/page_alloc.c while
> at it, you can git log the kernel/sysctl.c for prior moves and for
> the motivations. No need to keep expanding on the existing table.
> 
>   Luis

Ok.

I will move this sysctl to mm/page_alloc.c next version.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next v3 1/2] mm: Cap zone movable's min wmark to small value
  2022-09-06 12:22       ` Mel Gorman
@ 2022-09-07  8:42         ` mawupeng
  0 siblings, 0 replies; 14+ messages in thread
From: mawupeng @ 2022-09-07  8:42 UTC (permalink / raw)
  To: mgorman
  Cc: mawupeng1, akpm, david, ying.huang, hannes, corbet, mcgrof,
	keescook, yzaikin, songmuchun, mike.kravetz, osalvador, surenb,
	rppt, charante, jsavitz, linux-kernel, linux-mm



On 2022/9/6 20:22, Mel Gorman wrote:
> On Tue, Sep 06, 2022 at 06:12:23PM +0800, mawupeng wrote:
>>> I think there is a misunderstanding why the higher zones have a watermark
>>> and why it might be large.
>>>
>>> It's not about a __GFP_HIGH or PF_MEMALLOC allocations because it's known
>>> that few of those allocations may be movable. It's because high memory
>>> allocations indirectly pin pages in lower zones. User-mapped memory allocated
>>> from ZONE_MOVABLE still needs page table pages allocated from a lower zone
>>> so there is a ratio between the size of ZONE_MOVABLE and lower zones
>>> that limits the total amount of memory that can be allocated. Similarly,
>>> file backed pages that may be allocated from ZONE_MOVABLE still requires
>>> pages from lower memory for the inode and other associated kernel
>>> objects that are allocated from lower zones.
>>>
>>> The intent behind the higher zones having a large min watermark is so
>>> that kswapd reclaims pages from there first to *potentially* release
>>> pages from lower memory. By capping pages_min for zone_movable, there is
>>> the potential for lower memory pressure to be higher and to reach a point
>>> where a ZONE_MOVABLE page cannot be allocated simply because there isn't
>>> enough low memory available. Once the lower zones are all unreclaimable
>>> (e.g. page table pages or the movable pages are not been reclaimed to free
>>> the associated kernel structures), the system goes OOM.
>>
>> This i do agree with you, lower zone is actually "more important" than the
>> higher one.
>>
> 
> Very often yes.
> 
>> But higher min watermark for zone movable will not work since no memory
>> allocation can use this reserve memory below min. Memory allocation
>> with specify watermark modifier(__GFP_ATOMIC ,__GFP_HIGH ...) can use this
>> in slowpath, however the standard movable memory allocation
>> (gfp flag: GFP_HIGHUSER_MOVABLE) does not contain this.
>>
> 
> Then a more appropriate solution may be to alter how the gap between min
> and low is calculated. That gap determines when kswapd is active but
> allocations are still allowed.
> 
>> Second, lowmem_reserve_ratio is used to "reserve" memory for lower zone.
>> And the second patch introduce per zone watermark_scale_factor to boost
>> normal/movable zone's watermark which can trigger early kswapd for zone
>> movable.
>>
> 
> The problem with the tunable is that this patch introduces a potentially
> seriously problem that must then be corrected by a system administrator and
> it'll be non-obvious what the root of the problem is or the solution. For
> some users, they will only be able to determine is that OOM triggers
> when there is plenty of free memory or kswapd is consuming a lot more
> CPU than expected. They will not necessarily be able to determine that
> watermark_scale_factor is the solution.
> 
>>>
>>> It's possible that there are safe adjustments that could be made that
>>> would detect when there is no choice except to reclaim zone reclaimable
>>> but it would be tricky and it's not this patch. This patch changelog states
>>>
>>> 	However zone movable will get its min share in
>>> 	__setup_per_zone_wmarks() which does not make any sense.
>>>
>>> It makes sense, higher zones allocations indirectly pin pages in lower
>>> zones and there is a bias in reclaim to free the higher zone pages first
>>> on the *possibility* that lower zone pages get indirectly released later.
>>>
>>
>> In our Test vm with 16G of mirrored memory(normal zone) and 256 of normal
>> momory(Movable zone), the min share for normal zone is too few since the
>> size of min watermark is calc by zone dma/normal while this will be shared
>> by zones(include zone movable) based on managed pages.
>>
>> Node 0, zone      DMA
>>         min      39
>>         low      743
>>         high     1447
>> Node 0, zone   Normal
>>         min      180
>>         low      3372
>>         high     6564
>> Node 1, zone  Movable
>>         min      3728
>>         low      69788
>>         high     135848
> 
> The gap between min and low is massive so either adjust how that gap is
> calculated or to avoid side-effects for other users, consider special
> casing the gap for ZONE_MOVABLE with a comment explaining why it is
> treated differently. To mitigate the risk further, it could be further
> special cased to only apply when there is a massive ratio between
> ALL_ZONES_EXCEPT_MOVABLE:ZONE_MOVABLE. Document in the changelog the
> potential downside of more lowmem potentially getting pinned by MOVABLE
> allocations leading to excessive kswapd activity or premature OOM.
What I'm trying to say is that the min watermark is too low for zone normal
since it is shared by other zone based on manager pages.

        Vanilla          |         Modified         
Node 0, zone      DMA    | Node 0, zone      DMA    
        min      39      |         min      713     
        low      743     |         low      1417    
        high     1447    |         high     2121    
Node 0, zone   Normal    | Node 0, zone   Normal    
        min    **180**   |         min    **3234**    
        low      3372    |         low      6426    
        high     6564    |         high     9618    
Node 1, zone  Movable    | Node 1, zone  Movable    
        min    **3728**  |         min    **128**     
        low      69788   |         low      66188   
        high     135848  |         high     132248 

You can see, after this patch, the min watermark is set to small value(128) while zone
dma/normal's min watermark increase a lot which be useful if the system is low on memory.

The gap between min and low is about 1/1000 of the zone's memory which will not be
effected by this patch.
  
This patch, I am to do something for the min watermark for zone movable, In the next
patch I want to do something to reserve memory for zone normal or just make 
watermark_scale_factor more flexible for little normal zone and huge movable zone.

What is you idea on "Cap zone movable to small value"?


> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor
  2022-09-05  3:28 ` [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor Wupeng Ma
  2022-09-05  3:45   ` Matthew Wilcox
  2022-09-06 18:23   ` Luis Chamberlain
@ 2022-09-09 21:41   ` Khalid Aziz
  2022-09-13  2:09     ` mawupeng
  2 siblings, 1 reply; 14+ messages in thread
From: Khalid Aziz @ 2022-09-09 21:41 UTC (permalink / raw)
  To: Wupeng Ma
  Cc: akpm, david, npiggin, ying.huang, hannes, corbet, mcgrof,
	mgorman, keescook, yzaikin, songmuchun, mike.kravetz, osalvador,
	surenb, rppt, charante, jsavitz, linux-kernel, linux-mm,
	kernel test robot

On Mon, 2022-09-05 at 11:28 +0800, Wupeng Ma wrote:
> From: Ma Wupeng <mawupeng1@huawei.com>
> 
> System may have little normal zone memory and huge movable memory in
> the
> following situations:
>   - for system with kernelcore=nn% or kernelcore=mirror, movable zone
> will
>   be added and movable zone is bigger than normal zone in most cases.
>   - system with movable nodes, they will have multiple numa nodes
> with
>   only movable zone and movable zone will have plenty of memory.
> 
> Since kernel/driver can only use memory from non-movable zone in most
> cases, normal zone need to increase its watermark to reserve more
> memory.
> 
> However, current watermark_scale_factor is used to control all zones
> at once and can't be set separately. To reserve memory in non-movable
> zones, the watermark is increased in movable zones as well. Which
> will
> lead to inefficient kswapd.

Similar issues happen on systems with large amount of memory (1 TB or
more). Most of the memory ends up in Normal zone while DMA and DMA32
zones have very little memory. For the watermark scale factor that
results in a reasonable low watermark for Normal zone, low watermark
can be too low for DMA and DMA32 zones. Being able to tune those
watermarks independently can be helpful. The trouble with this approach
is it introduces another level of complexity to tuning knobs with no
clear guidelines for system admins on how to tune these. I already see
multiple customers struggling with setting simple min_free_kbytes or
watermark_scale_factor. Once we add the complexity of per zone
watermark scale factor, it only gets to be a more daunting task for
system admins. NUMA systems with multiple nodes with hot-pluggable
memory can have sizeable number of zones.

I see the usefulness of per-zone watermark but what guidance would you
give to a sysadmin on how to set these values for their systems?

Thanks,
Khalid


> 
> To solve this problem, per zone watermark is introduced to tune each
> zone's
> watermark separately. This can bring the following advantages:
>   - each zone can set its own watermark which bring flexibility
>   - lead to more efficient kswapd if this watermark is set fine
> 
> Here is real watermark data in my qemu machine(with THP disabled).
> 
> With watermark_scale_factor = 10, there is only 1440(772-68+807-71)
> pages(5.76M) reserved for a system with 96G of memory. However if the
> watermark is set to 100, the movable zone's watermark increased to
> 231908(93M), which is too much.
> This situation is even worse with 32G of normal zone memory and 1T of
> movable zone memory.
> 
>        Modified        | Vanilla wm_factor = 10 | Vanilla wm_factor =
> 30
> Node 0, zone      DMA  | Node 0, zone      DMA  | Node 0, zone     
> DMA
>         min      68    |         min      68    |         min      68
>         low      7113  |         low      772   |         low     
> 7113
>         high **14158** |         high **1476**  |         high
> **14158**
> Node 0, zone   Normal  | Node 0, zone   Normal  | Node 0, zone  
> Normal
>         min      71    |         min      71    |         min      71
>         low      7438  |         low      807   |         low     
> 7438
>         high     14805 |         high     1543  |         high    
> 14805
> Node 0, zone  Movable  | Node 0, zone  Movable  | Node 0, zone 
> Movable
>         min      1455  |         min      1455  |         min     
> 1455
>         low      16388 |         low      16386 |         low     
> 150787
>         high **31321** |         high **31317** |         high
> **300119**
> Node 1, zone  Movable  | Node 1, zone  Movable  | Node 1, zone 
> Movable
>         min      804   |         min      804   |         min     
> 804
>         low      9061  |         low      9061  |         low     
> 83379
>         high **17318** |         high **17318** |         high
> **165954**
> 
> With the modified per zone watermark_scale_factor, only dma/normal
> zone
> will increase its watermark via the following command which the huge
> movable zone stay the same.
> 
>   % echo 100 100 100 10 > /proc/sys/vm/watermark_scale_factor
> 
> The reason to disable THP is khugepaged_min_free_kbytes_update() will
> update min watermark.
> 
> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
> Reported-by: kernel test robot <lkp@intel.com>
> ---
>  Documentation/admin-guide/sysctl/vm.rst |  6 ++++
>  include/linux/mm.h                      |  2 +-
>  kernel/sysctl.c                         |  2 --
>  mm/page_alloc.c                         | 37 ++++++++++++++++++++---
> --
>  4 files changed, 37 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/vm.rst
> b/Documentation/admin-guide/sysctl/vm.rst
> index 9b833e439f09..ec240aa45322 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -1002,6 +1002,12 @@ that the number of free pages kswapd maintains
> for latency reasons is
>  too small for the allocation bursts occurring in the system. This
> knob
>  can then be used to tune kswapd aggressiveness accordingly.
>  
> +The watermark_scale_factor is an array. You can set each zone's
> watermark
> +separately and can be seen by reading this file::
> +
> +       % cat /proc/sys/vm/watermark_scale_factor
> +       10      10      10      10
> +
>  
>  zone_reclaim_mode
>  =================
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 21f8b27bd9fd..b291c795f9db 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2532,7 +2532,7 @@ extern void setup_per_cpu_pageset(void);
>  /* page_alloc.c */
>  extern int min_free_kbytes;
>  extern int watermark_boost_factor;
> -extern int watermark_scale_factor;
> +extern int watermark_scale_factor[MAX_NR_ZONES];
>  extern bool arch_has_descending_max_zone_pfns(void);
>  
>  /* nommu.c */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 205d605cacc5..d16d06c71e5a 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2251,8 +2251,6 @@ static struct ctl_table vm_table[] = {
>                 .maxlen         = sizeof(watermark_scale_factor),
>                 .mode           = 0644,
>                 .proc_handler   =
> watermark_scale_factor_sysctl_handler,
> -               .extra1         = SYSCTL_ONE,
> -               .extra2         = SYSCTL_THREE_THOUSAND,
>         },
>         {
>                 .procname       = "percpu_pagelist_high_fraction",
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f1e4474879f1..7a6ac3b4ebb6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -421,7 +421,6 @@ compound_page_dtor * const
> compound_page_dtors[NR_COMPOUND_DTORS] = {
>  int min_free_kbytes = 1024;
>  int user_min_free_kbytes = -1;
>  int watermark_boost_factor __read_mostly = 15000;
> -int watermark_scale_factor = 10;
>  
>  static unsigned long nr_kernel_pages __initdata;
>  static unsigned long nr_all_pages __initdata;
> @@ -449,6 +448,20 @@ EXPORT_SYMBOL(nr_online_nodes);
>  
>  int page_group_by_mobility_disabled __read_mostly;
>  
> +int watermark_scale_factor[MAX_NR_ZONES] = {
> +#ifdef CONFIG_ZONE_DMA
> +       [ZONE_DMA] = 10,
> +#endif
> +#ifdef CONFIG_ZONE_DMA32
> +       [ZONE_DMA32] = 10,
> +#endif
> +       [ZONE_NORMAL] = 10,
> +#ifdef CONFIG_HIGHMEM
> +       [ZONE_HIGHMEM] = 10,
> +#endif
> +       [ZONE_MOVABLE] = 10,
> +};
> +
>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>  /*
>   * During boot we initialize deferred pages on-demand, as needed,
> but once
> @@ -8643,6 +8656,7 @@ static void __setup_per_zone_wmarks(void)
>         }
>  
>         for_each_zone(zone) {
> +               int zone_wm_factor;
>                 u64 tmp;
>  
>                 spin_lock_irqsave(&zone->lock, flags);
> @@ -8676,9 +8690,10 @@ static void __setup_per_zone_wmarks(void)
>                  * scale factor in proportion to available memory,
> but
>                  * ensure a minimum size on small systems.
>                  */
> +               zone_wm_factor =
> watermark_scale_factor[zone_idx(zone)];
>                 tmp = max_t(u64, tmp >> 2,
> -                           mult_frac(zone_managed_pages(zone),
> -                                     watermark_scale_factor,
> 10000));
> +                           mult_frac(zone_managed_pages(zone),
> zone_wm_factor,
> +                                     10000));
>  
>                 zone->watermark_boost = 0;
>                 zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone)
> + tmp;
> @@ -8798,11 +8813,19 @@ int min_free_kbytes_sysctl_handler(struct
> ctl_table *table, int write,
>  int watermark_scale_factor_sysctl_handler(struct ctl_table *table,
> int write,
>                 void *buffer, size_t *length, loff_t *ppos)
>  {
> -       int rc;
> +       int i;
>  
> -       rc = proc_dointvec_minmax(table, write, buffer, length,
> ppos);
> -       if (rc)
> -               return rc;
> +       proc_dointvec_minmax(table, write, buffer, length, ppos);
> +
> +       /*
> +        * The unit is in fractions of 10,000. The default value of
> 10
> +        * means the distances between watermarks are 0.1% of the
> available
> +        * memory in the node/system. The maximum value is 3000, or
> 30% of
> +        * memory.
> +        */
> +       for (i = 0; i < MAX_NR_ZONES; i++)
> +               watermark_scale_factor[i] =
> +                       clamp(watermark_scale_factor[i], 1, 3000);
>  
>         if (write)
>                 setup_per_zone_wmarks();


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor
  2022-09-09 21:41   ` Khalid Aziz
@ 2022-09-13  2:09     ` mawupeng
  2022-09-14 22:42       ` Khalid Aziz
  0 siblings, 1 reply; 14+ messages in thread
From: mawupeng @ 2022-09-13  2:09 UTC (permalink / raw)
  To: gonehacking
  Cc: mawupeng1, akpm, david, npiggin, ying.huang, hannes, corbet,
	mcgrof, mgorman, keescook, yzaikin, songmuchun, mike.kravetz,
	osalvador, surenb, rppt, charante, jsavitz, linux-kernel,
	linux-mm, lkp



On 2022/9/10 5:41, Khalid Aziz wrote:
> On Mon, 2022-09-05 at 11:28 +0800, Wupeng Ma wrote:
>> From: Ma Wupeng <mawupeng1@huawei.com>
>>
>> System may have little normal zone memory and huge movable memory in
>> the
>> following situations:
>>   - for system with kernelcore=nn% or kernelcore=mirror, movable zone
>> will
>>   be added and movable zone is bigger than normal zone in most cases.
>>   - system with movable nodes, they will have multiple numa nodes
>> with
>>   only movable zone and movable zone will have plenty of memory.
>>
>> Since kernel/driver can only use memory from non-movable zone in most
>> cases, normal zone need to increase its watermark to reserve more
>> memory.
>>
>> However, current watermark_scale_factor is used to control all zones
>> at once and can't be set separately. To reserve memory in non-movable
>> zones, the watermark is increased in movable zones as well. Which
>> will
>> lead to inefficient kswapd.
> 
> Similar issues happen on systems with large amount of memory (1 TB or
> more). Most of the memory ends up in Normal zone while DMA and DMA32
> zones have very little memory. For the watermark scale factor that
> results in a reasonable low watermark for Normal zone, low watermark
> can be too low for DMA and DMA32 zones. Being able to tune those
> watermarks independently can be helpful. The trouble with this approach
> is it introduces another level of complexity to tuning knobs with no
> clear guidelines for system admins on how to tune these. I already see
> multiple customers struggling with setting simple min_free_kbytes or
> watermark_scale_factor. Once we add the complexity of per zone
> watermark scale factor, it only gets to be a more daunting task for
> system admins. NUMA systems with multiple nodes with hot-pluggable
> memory can have sizeable number of zones.
> 
> I see the usefulness of per-zone watermark but what guidance would you
> give to a sysadmin on how to set these values for their systems?
> 
> Thanks,
> Khalid

Thanks for your reply.

Like the vanilla watermark_scale_factor, this per-zone one just introduce
the ability to tune this watermark separately. The overall usage of the
per-zone one is similar to to vanilla one.

Memory allocation below the low watermark will awake the kswapd and drop
cache or swap out until the memory's watermark reach the high watermark.
Since memory allocation can get their page in fast path(watermark above
low), the following memory allocation can benefit from this background
kswapd.

Since memory below min is used for emergency and below min means system
is low on memory, And higher zones(movable) allocations indirectly pin
pages in lower zones(page table..), boost the watermark for the lower
zones can reserve enough memory for kernel/driver to use.

So bigger watermark for lower zones if the higher zones is huge may be
the solution.

> 
> 
>>
>> To solve this problem, per zone watermark is introduced to tune each
>> zone's
>> watermark separately. This can bring the following advantages:
>>   - each zone can set its own watermark which bring flexibility
>>   - lead to more efficient kswapd if this watermark is set fine
>>
>> Here is real watermark data in my qemu machine(with THP disabled).
>>
>> With watermark_scale_factor = 10, there is only 1440(772-68+807-71)
>> pages(5.76M) reserved for a system with 96G of memory. However if the
>> watermark is set to 100, the movable zone's watermark increased to
>> 231908(93M), which is too much.
>> This situation is even worse with 32G of normal zone memory and 1T of
>> movable zone memory.
>>
>>        Modified        | Vanilla wm_factor = 10 | Vanilla wm_factor =
>> 30
>> Node 0, zone      DMA  | Node 0, zone      DMA  | Node 0, zone     
>> DMA
>>         min      68    |         min      68    |         min      68
>>         low      7113  |         low      772   |         low     
>> 7113
>>         high **14158** |         high **1476**  |         high
>> **14158**
>> Node 0, zone   Normal  | Node 0, zone   Normal  | Node 0, zone  
>> Normal
>>         min      71    |         min      71    |         min      71
>>         low      7438  |         low      807   |         low     
>> 7438
>>         high     14805 |         high     1543  |         high    
>> 14805
>> Node 0, zone  Movable  | Node 0, zone  Movable  | Node 0, zone 
>> Movable
>>         min      1455  |         min      1455  |         min     
>> 1455
>>         low      16388 |         low      16386 |         low     
>> 150787
>>         high **31321** |         high **31317** |         high
>> **300119**
>> Node 1, zone  Movable  | Node 1, zone  Movable  | Node 1, zone 
>> Movable
>>         min      804   |         min      804   |         min     
>> 804
>>         low      9061  |         low      9061  |         low     
>> 83379
>>         high **17318** |         high **17318** |         high
>> **165954**
>>
>> With the modified per zone watermark_scale_factor, only dma/normal
>> zone
>> will increase its watermark via the following command which the huge
>> movable zone stay the same.
>>
>>   % echo 100 100 100 10 > /proc/sys/vm/watermark_scale_factor
>>
>> The reason to disable THP is khugepaged_min_free_kbytes_update() will
>> update min watermark.
>>
>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>> Reported-by: kernel test robot <lkp@intel.com>
>> ---
>>  Documentation/admin-guide/sysctl/vm.rst |  6 ++++
>>  include/linux/mm.h                      |  2 +-
>>  kernel/sysctl.c                         |  2 --
>>  mm/page_alloc.c                         | 37 ++++++++++++++++++++---
>> --
>>  4 files changed, 37 insertions(+), 10 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/sysctl/vm.rst
>> b/Documentation/admin-guide/sysctl/vm.rst
>> index 9b833e439f09..ec240aa45322 100644
>> --- a/Documentation/admin-guide/sysctl/vm.rst
>> +++ b/Documentation/admin-guide/sysctl/vm.rst
>> @@ -1002,6 +1002,12 @@ that the number of free pages kswapd maintains
>> for latency reasons is
>>  too small for the allocation bursts occurring in the system. This
>> knob
>>  can then be used to tune kswapd aggressiveness accordingly.
>>  
>> +The watermark_scale_factor is an array. You can set each zone's
>> watermark
>> +separately and can be seen by reading this file::
>> +
>> +       % cat /proc/sys/vm/watermark_scale_factor
>> +       10      10      10      10
>> +
>>  
>>  zone_reclaim_mode
>>  =================
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 21f8b27bd9fd..b291c795f9db 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2532,7 +2532,7 @@ extern void setup_per_cpu_pageset(void);
>>  /* page_alloc.c */
>>  extern int min_free_kbytes;
>>  extern int watermark_boost_factor;
>> -extern int watermark_scale_factor;
>> +extern int watermark_scale_factor[MAX_NR_ZONES];
>>  extern bool arch_has_descending_max_zone_pfns(void);
>>  
>>  /* nommu.c */
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 205d605cacc5..d16d06c71e5a 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -2251,8 +2251,6 @@ static struct ctl_table vm_table[] = {
>>                 .maxlen         = sizeof(watermark_scale_factor),
>>                 .mode           = 0644,
>>                 .proc_handler   =
>> watermark_scale_factor_sysctl_handler,
>> -               .extra1         = SYSCTL_ONE,
>> -               .extra2         = SYSCTL_THREE_THOUSAND,
>>         },
>>         {
>>                 .procname       = "percpu_pagelist_high_fraction",
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index f1e4474879f1..7a6ac3b4ebb6 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -421,7 +421,6 @@ compound_page_dtor * const
>> compound_page_dtors[NR_COMPOUND_DTORS] = {
>>  int min_free_kbytes = 1024;
>>  int user_min_free_kbytes = -1;
>>  int watermark_boost_factor __read_mostly = 15000;
>> -int watermark_scale_factor = 10;
>>  
>>  static unsigned long nr_kernel_pages __initdata;
>>  static unsigned long nr_all_pages __initdata;
>> @@ -449,6 +448,20 @@ EXPORT_SYMBOL(nr_online_nodes);
>>  
>>  int page_group_by_mobility_disabled __read_mostly;
>>  
>> +int watermark_scale_factor[MAX_NR_ZONES] = {
>> +#ifdef CONFIG_ZONE_DMA
>> +       [ZONE_DMA] = 10,
>> +#endif
>> +#ifdef CONFIG_ZONE_DMA32
>> +       [ZONE_DMA32] = 10,
>> +#endif
>> +       [ZONE_NORMAL] = 10,
>> +#ifdef CONFIG_HIGHMEM
>> +       [ZONE_HIGHMEM] = 10,
>> +#endif
>> +       [ZONE_MOVABLE] = 10,
>> +};
>> +
>>  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>>  /*
>>   * During boot we initialize deferred pages on-demand, as needed,
>> but once
>> @@ -8643,6 +8656,7 @@ static void __setup_per_zone_wmarks(void)
>>         }
>>  
>>         for_each_zone(zone) {
>> +               int zone_wm_factor;
>>                 u64 tmp;
>>  
>>                 spin_lock_irqsave(&zone->lock, flags);
>> @@ -8676,9 +8690,10 @@ static void __setup_per_zone_wmarks(void)
>>                  * scale factor in proportion to available memory,
>> but
>>                  * ensure a minimum size on small systems.
>>                  */
>> +               zone_wm_factor =
>> watermark_scale_factor[zone_idx(zone)];
>>                 tmp = max_t(u64, tmp >> 2,
>> -                           mult_frac(zone_managed_pages(zone),
>> -                                     watermark_scale_factor,
>> 10000));
>> +                           mult_frac(zone_managed_pages(zone),
>> zone_wm_factor,
>> +                                     10000));
>>  
>>                 zone->watermark_boost = 0;
>>                 zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone)
>> + tmp;
>> @@ -8798,11 +8813,19 @@ int min_free_kbytes_sysctl_handler(struct
>> ctl_table *table, int write,
>>  int watermark_scale_factor_sysctl_handler(struct ctl_table *table,
>> int write,
>>                 void *buffer, size_t *length, loff_t *ppos)
>>  {
>> -       int rc;
>> +       int i;
>>  
>> -       rc = proc_dointvec_minmax(table, write, buffer, length,
>> ppos);
>> -       if (rc)
>> -               return rc;
>> +       proc_dointvec_minmax(table, write, buffer, length, ppos);
>> +
>> +       /*
>> +        * The unit is in fractions of 10,000. The default value of
>> 10
>> +        * means the distances between watermarks are 0.1% of the
>> available
>> +        * memory in the node/system. The maximum value is 3000, or
>> 30% of
>> +        * memory.
>> +        */
>> +       for (i = 0; i < MAX_NR_ZONES; i++)
>> +               watermark_scale_factor[i] =
>> +                       clamp(watermark_scale_factor[i], 1, 3000);
>>  
>>         if (write)
>>                 setup_per_zone_wmarks();
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor
  2022-09-13  2:09     ` mawupeng
@ 2022-09-14 22:42       ` Khalid Aziz
  0 siblings, 0 replies; 14+ messages in thread
From: Khalid Aziz @ 2022-09-14 22:42 UTC (permalink / raw)
  To: mawupeng
  Cc: akpm, david, npiggin, ying.huang, hannes, corbet, mcgrof,
	mgorman, keescook, yzaikin, songmuchun, mike.kravetz, osalvador,
	surenb, rppt, charante, jsavitz, linux-kernel, linux-mm, lkp

On Tue, 2022-09-13 at 10:09 +0800, mawupeng wrote:
> 
> 
> On 2022/9/10 5:41, Khalid Aziz wrote:
> > On Mon, 2022-09-05 at 11:28 +0800, Wupeng Ma wrote:
> > > From: Ma Wupeng <mawupeng1@huawei.com>
> > > 
> > > System may have little normal zone memory and huge movable memory
> > > in
> > > the
> > > following situations:
> > >   - for system with kernelcore=nn% or kernelcore=mirror, movable
> > > zone
> > > will
> > >   be added and movable zone is bigger than normal zone in most
> > > cases.
> > >   - system with movable nodes, they will have multiple numa nodes
> > > with
> > >   only movable zone and movable zone will have plenty of memory.
> > > 
> > > Since kernel/driver can only use memory from non-movable zone in
> > > most
> > > cases, normal zone need to increase its watermark to reserve more
> > > memory.
> > > 
> > > However, current watermark_scale_factor is used to control all
> > > zones
> > > at once and can't be set separately. To reserve memory in non-
> > > movable
> > > zones, the watermark is increased in movable zones as well. Which
> > > will
> > > lead to inefficient kswapd.
> > 
> > Similar issues happen on systems with large amount of memory (1 TB
> > or
> > more). Most of the memory ends up in Normal zone while DMA and
> > DMA32
> > zones have very little memory. For the watermark scale factor that
> > results in a reasonable low watermark for Normal zone, low
> > watermark
> > can be too low for DMA and DMA32 zones. Being able to tune those
> > watermarks independently can be helpful. The trouble with this
> > approach
> > is it introduces another level of complexity to tuning knobs with
> > no
> > clear guidelines for system admins on how to tune these. I already
> > see
> > multiple customers struggling with setting simple min_free_kbytes
> > or
> > watermark_scale_factor. Once we add the complexity of per zone
> > watermark scale factor, it only gets to be a more daunting task for
> > system admins. NUMA systems with multiple nodes with hot-pluggable
> > memory can have sizeable number of zones.
> > 
> > I see the usefulness of per-zone watermark but what guidance would
> > you
> > give to a sysadmin on how to set these values for their systems?
> > 
> > Thanks,
> > Khalid
> 
> Thanks for your reply.
> 
> Like the vanilla watermark_scale_factor, this per-zone one just
> introduce
> the ability to tune this watermark separately. The overall usage of
> the
> per-zone one is similar to to vanilla one.
> 
> Memory allocation below the low watermark will awake the kswapd and
> drop
> cache or swap out until the memory's watermark reach the high
> watermark.
> Since memory allocation can get their page in fast path(watermark
> above
> low), the following memory allocation can benefit from this
> background
> kswapd.
> 
> Since memory below min is used for emergency and below min means
> system
> is low on memory, And higher zones(movable) allocations indirectly
> pin
> pages in lower zones(page table..), boost the watermark for the lower
> zones can reserve enough memory for kernel/driver to use.
> 
> So bigger watermark for lower zones if the higher zones is huge may
> be
> the solution.

I understand how it works. My question is how would a system admin look
at the workload on their system and know what would be appropriate
values to set for each zone. "bigger watermark for lower zones if the
higher zones is huge" is one way to look at this but is vague and how
would a system admin arrive at those numbers, how would they know if
they got it right? Documentation for watermark_scale_factor says "A
high rate of threads entering direct reclaim (allocstall) or kswapd
going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
that the number of free pages kswapd maintains for latency reasons is
too small for the allocation bursts occurring in the system. This knob
can then be used to tune kswapd aggressiveness accordingly." That
provides some guidance on where to observe problems that can possibly
be mitigated with watermark_scale_factor. It is still vague about how
much to tune the knob and in which direction but at least one can
observe the effects of tuning the knob. With the number of knobs going
up, how would you know which zone(s) to tune.

Thanks,
Khalid

> 
> > 
> > 
> > > 
> > > To solve this problem, per zone watermark is introduced to tune
> > > each
> > > zone's
> > > watermark separately. This can bring the following advantages:
> > >   - each zone can set its own watermark which bring flexibility
> > >   - lead to more efficient kswapd if this watermark is set fine
> > > 
> > > Here is real watermark data in my qemu machine(with THP
> > > disabled).
> > > 
> > > With watermark_scale_factor = 10, there is only 1440(772-68+807-
> > > 71)
> > > pages(5.76M) reserved for a system with 96G of memory. However if
> > > the
> > > watermark is set to 100, the movable zone's watermark increased
> > > to
> > > 231908(93M), which is too much.
> > > This situation is even worse with 32G of normal zone memory and
> > > 1T of
> > > movable zone memory.
> > > 
> > >        Modified        | Vanilla wm_factor = 10 | Vanilla
> > > wm_factor =
> > > 30
> > > Node 0, zone      DMA  | Node 0, zone      DMA  | Node 0,
> > > zone     
> > > DMA
> > >         min      68    |         min      68    |        
> > > min      68
> > >         low      7113  |         low      772   |        
> > > low     
> > > 7113
> > >         high **14158** |         high **1476**  |         high
> > > **14158**
> > > Node 0, zone   Normal  | Node 0, zone   Normal  | Node 0, zone  
> > > Normal
> > >         min      71    |         min      71    |        
> > > min      71
> > >         low      7438  |         low      807   |        
> > > low     
> > > 7438
> > >         high     14805 |         high     1543  |        
> > > high    
> > > 14805
> > > Node 0, zone  Movable  | Node 0, zone  Movable  | Node 0, zone 
> > > Movable
> > >         min      1455  |         min      1455  |        
> > > min     
> > > 1455
> > >         low      16388 |         low      16386 |        
> > > low     
> > > 150787
> > >         high **31321** |         high **31317** |         high
> > > **300119**
> > > Node 1, zone  Movable  | Node 1, zone  Movable  | Node 1, zone 
> > > Movable
> > >         min      804   |         min      804   |        
> > > min     
> > > 804
> > >         low      9061  |         low      9061  |        
> > > low     
> > > 83379
> > >         high **17318** |         high **17318** |         high
> > > **165954**
> > > 
> > > With the modified per zone watermark_scale_factor, only
> > > dma/normal
> > > zone
> > > will increase its watermark via the following command which the
> > > huge
> > > movable zone stay the same.
> > > 
> > >   % echo 100 100 100 10 > /proc/sys/vm/watermark_scale_factor
> > > 
> > > The reason to disable THP is khugepaged_min_free_kbytes_update()
> > > will
> > > update min watermark.
> > > 
> > > Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
> > > Reported-by: kernel test robot <lkp@intel.com>
> > > ---
> > >  Documentation/admin-guide/sysctl/vm.rst |  6 ++++
> > >  include/linux/mm.h                      |  2 +-
> > >  kernel/sysctl.c                         |  2 --
> > >  mm/page_alloc.c                         | 37
> > > ++++++++++++++++++++---
> > > --
> > >  4 files changed, 37 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/Documentation/admin-guide/sysctl/vm.rst
> > > b/Documentation/admin-guide/sysctl/vm.rst
> > > index 9b833e439f09..ec240aa45322 100644
> > > --- a/Documentation/admin-guide/sysctl/vm.rst
> > > +++ b/Documentation/admin-guide/sysctl/vm.rst
> > > @@ -1002,6 +1002,12 @@ that the number of free pages kswapd
> > > maintains
> > > for latency reasons is
> > >  too small for the allocation bursts occurring in the system.
> > > This
> > > knob
> > >  can then be used to tune kswapd aggressiveness accordingly.
> > >  
> > > +The watermark_scale_factor is an array. You can set each zone's
> > > watermark
> > > +separately and can be seen by reading this file::
> > > +
> > > +       % cat /proc/sys/vm/watermark_scale_factor
> > > +       10      10      10      10
> > > +
> > >  
> > >  zone_reclaim_mode
> > >  =================
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 21f8b27bd9fd..b291c795f9db 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -2532,7 +2532,7 @@ extern void setup_per_cpu_pageset(void);
> > >  /* page_alloc.c */
> > >  extern int min_free_kbytes;
> > >  extern int watermark_boost_factor;
> > > -extern int watermark_scale_factor;
> > > +extern int watermark_scale_factor[MAX_NR_ZONES];
> > >  extern bool arch_has_descending_max_zone_pfns(void);
> > >  
> > >  /* nommu.c */
> > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > index 205d605cacc5..d16d06c71e5a 100644
> > > --- a/kernel/sysctl.c
> > > +++ b/kernel/sysctl.c
> > > @@ -2251,8 +2251,6 @@ static struct ctl_table vm_table[] = {
> > >                 .maxlen         = sizeof(watermark_scale_factor),
> > >                 .mode           = 0644,
> > >                 .proc_handler   =
> > > watermark_scale_factor_sysctl_handler,
> > > -               .extra1         = SYSCTL_ONE,
> > > -               .extra2         = SYSCTL_THREE_THOUSAND,
> > >         },
> > >         {
> > >                 .procname       =
> > > "percpu_pagelist_high_fraction",
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index f1e4474879f1..7a6ac3b4ebb6 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -421,7 +421,6 @@ compound_page_dtor * const
> > > compound_page_dtors[NR_COMPOUND_DTORS] = {
> > >  int min_free_kbytes = 1024;
> > >  int user_min_free_kbytes = -1;
> > >  int watermark_boost_factor __read_mostly = 15000;
> > > -int watermark_scale_factor = 10;
> > >  
> > >  static unsigned long nr_kernel_pages __initdata;
> > >  static unsigned long nr_all_pages __initdata;
> > > @@ -449,6 +448,20 @@ EXPORT_SYMBOL(nr_online_nodes);
> > >  
> > >  int page_group_by_mobility_disabled __read_mostly;
> > >  
> > > +int watermark_scale_factor[MAX_NR_ZONES] = {
> > > +#ifdef CONFIG_ZONE_DMA
> > > +       [ZONE_DMA] = 10,
> > > +#endif
> > > +#ifdef CONFIG_ZONE_DMA32
> > > +       [ZONE_DMA32] = 10,
> > > +#endif
> > > +       [ZONE_NORMAL] = 10,
> > > +#ifdef CONFIG_HIGHMEM
> > > +       [ZONE_HIGHMEM] = 10,
> > > +#endif
> > > +       [ZONE_MOVABLE] = 10,
> > > +};
> > > +
> > >  #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> > >  /*
> > >   * During boot we initialize deferred pages on-demand, as
> > > needed,
> > > but once
> > > @@ -8643,6 +8656,7 @@ static void __setup_per_zone_wmarks(void)
> > >         }
> > >  
> > >         for_each_zone(zone) {
> > > +               int zone_wm_factor;
> > >                 u64 tmp;
> > >  
> > >                 spin_lock_irqsave(&zone->lock, flags);
> > > @@ -8676,9 +8690,10 @@ static void __setup_per_zone_wmarks(void)
> > >                  * scale factor in proportion to available
> > > memory,
> > > but
> > >                  * ensure a minimum size on small systems.
> > >                  */
> > > +               zone_wm_factor =
> > > watermark_scale_factor[zone_idx(zone)];
> > >                 tmp = max_t(u64, tmp >> 2,
> > > -                           mult_frac(zone_managed_pages(zone),
> > > -                                     watermark_scale_factor,
> > > 10000));
> > > +                           mult_frac(zone_managed_pages(zone),
> > > zone_wm_factor,
> > > +                                     10000));
> > >  
> > >                 zone->watermark_boost = 0;
> > >                 zone->_watermark[WMARK_LOW]  =
> > > min_wmark_pages(zone)
> > > + tmp;
> > > @@ -8798,11 +8813,19 @@ int min_free_kbytes_sysctl_handler(struct
> > > ctl_table *table, int write,
> > >  int watermark_scale_factor_sysctl_handler(struct ctl_table
> > > *table,
> > > int write,
> > >                 void *buffer, size_t *length, loff_t *ppos)
> > >  {
> > > -       int rc;
> > > +       int i;
> > >  
> > > -       rc = proc_dointvec_minmax(table, write, buffer, length,
> > > ppos);
> > > -       if (rc)
> > > -               return rc;
> > > +       proc_dointvec_minmax(table, write, buffer, length, ppos);
> > > +
> > > +       /*
> > > +        * The unit is in fractions of 10,000. The default value
> > > of
> > > 10
> > > +        * means the distances between watermarks are 0.1% of the
> > > available
> > > +        * memory in the node/system. The maximum value is 3000,
> > > or
> > > 30% of
> > > +        * memory.
> > > +        */
> > > +       for (i = 0; i < MAX_NR_ZONES; i++)
> > > +               watermark_scale_factor[i] =
> > > +                       clamp(watermark_scale_factor[i], 1,
> > > 3000);
> > >  
> > >         if (write)
> > >                 setup_per_zone_wmarks();
> > 


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-09-14 22:42 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-05  3:28 [PATCH -next v3 0/2] watermark related improvement on zone movable Wupeng Ma
2022-09-05  3:28 ` [PATCH -next v3 1/2] mm: Cap zone movable's min wmark to small value Wupeng Ma
2022-09-05  9:26   ` Mel Gorman
2022-09-06 10:12     ` mawupeng
2022-09-06 12:22       ` Mel Gorman
2022-09-07  8:42         ` mawupeng
2022-09-05  3:28 ` [PATCH -next v3 2/2] mm: sysctl: Introduce per zone watermark_scale_factor Wupeng Ma
2022-09-05  3:45   ` Matthew Wilcox
2022-09-05  6:39     ` mawupeng
2022-09-06 18:23   ` Luis Chamberlain
2022-09-07  3:29     ` mawupeng
2022-09-09 21:41   ` Khalid Aziz
2022-09-13  2:09     ` mawupeng
2022-09-14 22:42       ` Khalid Aziz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).