All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] watermark related improvement on zone movable
@ 2022-08-19  9:30 Wupeng Ma
  2022-08-19  9:30 ` [PATCH v2 1/2] mm: Cap zone movable's min wmark to small value Wupeng Ma
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Wupeng Ma @ 2022-08-19  9:30 UTC (permalink / raw)
  To: akpm
  Cc: corbet, mcgrof, keescook, yzaikin, songmuchun, mike.kravetz,
	osalvador, rppt, surenb, mawupeng1, jsavitz, linux-doc,
	linux-kernel, linux-mm, linux-fsdevel, wangkefeng.wang

From: Ma Wupeng <mawupeng1@huawei.com>

The first patch cap zone movable's min watermark to small value since no
one can use it.

The second patch introduce a per zone watermark to replace the vanilla
watermark_scale_factor to bring flexibility to tune each zone's
watermark separately and lead to more efficient kswapd.

Each patch's detail information can be seen is its own changelog.

changelog since v1:
- fix compile error if CONFIG_SYSCTL is not enabled
- remove useless function comment

Ma Wupeng (2):
  mm: Cap zone movable's min wmark to small value
  mm: sysctl: Introduce per zone watermark_scale_factor

 Documentation/admin-guide/sysctl/vm.rst |  6 ++++
 include/linux/mm.h                      |  2 +-
 kernel/sysctl.c                         |  2 --
 mm/page_alloc.c                         | 41 +++++++++++++++++++------
 4 files changed, 39 insertions(+), 12 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/2] mm: Cap zone movable's min wmark to small value
  2022-08-19  9:30 [PATCH v2 0/2] watermark related improvement on zone movable Wupeng Ma
@ 2022-08-19  9:30 ` Wupeng Ma
  2022-08-24  8:10   ` David Hildenbrand
  2022-08-19  9:30 ` [PATCH v2 2/2] mm: sysctl: Introduce per zone watermark_scale_factor Wupeng Ma
  2022-08-24  7:27 ` [PATCH v2 0/2] watermark related improvement on zone movable mawupeng
  2 siblings, 1 reply; 6+ messages in thread
From: Wupeng Ma @ 2022-08-19  9:30 UTC (permalink / raw)
  To: akpm
  Cc: corbet, mcgrof, keescook, yzaikin, songmuchun, mike.kravetz,
	osalvador, rppt, surenb, mawupeng1, jsavitz, linux-doc,
	linux-kernel, linux-mm, linux-fsdevel, wangkefeng.wang

From: Ma Wupeng <mawupeng1@huawei.com>

Since min_free_kbytes is based on gfp_zone(GFP_USER) which does not include
zone movable. However zone movable will get its min share in
__setup_per_zone_wmarks() which does not make any sense.

And like highmem pages, __GFP_HIGH and PF_MEMALLOC allocations usually
don't need movable pages, so there is no need to assign min pages for zone
movable.

Let's cap pages_min for zone movable to a small value here just link
highmem pages.

Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
---
 mm/page_alloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..ff644205370f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8638,7 +8638,7 @@ static void __setup_per_zone_wmarks(void)
 
 	/* Calculate total number of !ZONE_HIGHMEM pages */
 	for_each_zone(zone) {
-		if (!is_highmem(zone))
+		if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE)
 			lowmem_pages += zone_managed_pages(zone);
 	}
 
@@ -8648,7 +8648,7 @@ static void __setup_per_zone_wmarks(void)
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone_managed_pages(zone);
 		do_div(tmp, lowmem_pages);
-		if (is_highmem(zone)) {
+		if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
 			 * need highmem pages, so cap pages_min to a small
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] mm: sysctl: Introduce per zone watermark_scale_factor
  2022-08-19  9:30 [PATCH v2 0/2] watermark related improvement on zone movable Wupeng Ma
  2022-08-19  9:30 ` [PATCH v2 1/2] mm: Cap zone movable's min wmark to small value Wupeng Ma
@ 2022-08-19  9:30 ` Wupeng Ma
  2022-08-24  7:27 ` [PATCH v2 0/2] watermark related improvement on zone movable mawupeng
  2 siblings, 0 replies; 6+ messages in thread
From: Wupeng Ma @ 2022-08-19  9:30 UTC (permalink / raw)
  To: akpm
  Cc: corbet, mcgrof, keescook, yzaikin, songmuchun, mike.kravetz,
	osalvador, rppt, surenb, mawupeng1, jsavitz, linux-doc,
	linux-kernel, linux-mm, linux-fsdevel, wangkefeng.wang,
	kernel test robot

From: Ma Wupeng <mawupeng1@huawei.com>

System may have little normal zone memory and huge movable memory in the
following situations:
  - for system with kernelcore=nn% or kernelcore=mirror, movable zone will
  be added and movable zone is bigger than normal zone in most cases.
  - system with movable nodes, they will have multiple numa nodes with
  only movable zone and movable zone will have plenty of memory.

Since kernel/driver can only use memory from non-movable zone in most
cases, normal zone need to increase its watermark to reserve more memory.

However, current watermark_scale_factor is used to control all zones
at once and can't be set separately. To reserve memory in non-movable
zones, the watermark is increased in movable zones as well. Which will
lead to inefficient kswapd.

To solve this problem, per zone watermark is introduced to tune each zone's
watermark separately. This can bring the following advantages:
  - each zone can set its own watermark which bring flexibility
  - lead to more efficient kswapd if this watermark is set fine

Here is real watermark data in my qemu machine(with THP disabled).

With watermark_scale_factor = 10, there is only 1440(772-68+807-71)
pages(5.76M) reserved for a system with 96G of memory. However if the
watermark is set to 100, the movable zone's watermark increased to
231908(93M), which is too much.
This situation is even worse with 32G of normal zone memory and 1T of
movable zone memory.

       Modified        | Vanilla wm_factor = 10 | Vanilla wm_factor = 30
Node 0, zone      DMA  | Node 0, zone      DMA  | Node 0, zone      DMA
        min      68    |         min      68    |         min      68
        low      7113  |         low      772   |         low      7113
        high **14158** |         high **1476**  |         high **14158**
Node 0, zone   Normal  | Node 0, zone   Normal  | Node 0, zone   Normal
        min      71    |         min      71    |         min      71
        low      7438  |         low      807   |         low      7438
        high     14805 |         high     1543  |         high     14805
Node 0, zone  Movable  | Node 0, zone  Movable  | Node 0, zone  Movable
        min      1455  |         min      1455  |         min      1455
        low      16388 |         low      16386 |         low      150787
        high **31321** |         high **31317** |         high **300119**
Node 1, zone  Movable  | Node 1, zone  Movable  | Node 1, zone  Movable
        min      804   |         min      804   |         min      804
        low      9061  |         low      9061  |         low      83379
        high **17318** |         high **17318** |         high **165954**

With the modified per zone watermark_scale_factor, only dma/normal zone
will increase its watermark via the following command which the huge
movable zone stay the same.

  % echo 100 100 100 10 > /proc/sys/vm/watermark_scale_factor

The reason to disable THP is khugepaged_min_free_kbytes_update() will
update min watermark.

Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 Documentation/admin-guide/sysctl/vm.rst |  6 ++++
 include/linux/mm.h                      |  2 +-
 kernel/sysctl.c                         |  2 --
 mm/page_alloc.c                         | 37 ++++++++++++++++++++-----
 4 files changed, 37 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 9b833e439f09..ec240aa45322 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -1002,6 +1002,12 @@ that the number of free pages kswapd maintains for latency reasons is
 too small for the allocation bursts occurring in the system. This knob
 can then be used to tune kswapd aggressiveness accordingly.
 
+The watermark_scale_factor is an array. You can set each zone's watermark
+separately and can be seen by reading this file::
+
+	% cat /proc/sys/vm/watermark_scale_factor
+	10	10	10	10
+
 
 zone_reclaim_mode
 =================
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3bedc449c14d..7f1eba1541f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2525,7 +2525,7 @@ extern void setup_per_cpu_pageset(void);
 /* page_alloc.c */
 extern int min_free_kbytes;
 extern int watermark_boost_factor;
-extern int watermark_scale_factor;
+extern int watermark_scale_factor[MAX_NR_ZONES];
 extern bool arch_has_descending_max_zone_pfns(void);
 
 /* nommu.c */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 205d605cacc5..d16d06c71e5a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2251,8 +2251,6 @@ static struct ctl_table vm_table[] = {
 		.maxlen		= sizeof(watermark_scale_factor),
 		.mode		= 0644,
 		.proc_handler	= watermark_scale_factor_sysctl_handler,
-		.extra1		= SYSCTL_ONE,
-		.extra2		= SYSCTL_THREE_THOUSAND,
 	},
 	{
 		.procname	= "percpu_pagelist_high_fraction",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ff644205370f..21459256dab6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -421,7 +421,6 @@ compound_page_dtor * const compound_page_dtors[NR_COMPOUND_DTORS] = {
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
 int watermark_boost_factor __read_mostly = 15000;
-int watermark_scale_factor = 10;
 
 static unsigned long nr_kernel_pages __initdata;
 static unsigned long nr_all_pages __initdata;
@@ -449,6 +448,20 @@ EXPORT_SYMBOL(nr_online_nodes);
 
 int page_group_by_mobility_disabled __read_mostly;
 
+int watermark_scale_factor[MAX_NR_ZONES] = {
+#ifdef CONFIG_ZONE_DMA
+	[ZONE_DMA] = 10,
+#endif
+#ifdef CONFIG_ZONE_DMA32
+	[ZONE_DMA32] = 10,
+#endif
+	[ZONE_NORMAL] = 10,
+#ifdef CONFIG_HIGHMEM
+	[ZONE_HIGHMEM] = 10,
+#endif
+	[ZONE_MOVABLE] = 10,
+};
+
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 /*
  * During boot we initialize deferred pages on-demand, as needed, but once
@@ -8643,6 +8656,7 @@ static void __setup_per_zone_wmarks(void)
 	}
 
 	for_each_zone(zone) {
+		int zone_wm_factor;
 		u64 tmp;
 
 		spin_lock_irqsave(&zone->lock, flags);
@@ -8676,9 +8690,10 @@ static void __setup_per_zone_wmarks(void)
 		 * scale factor in proportion to available memory, but
 		 * ensure a minimum size on small systems.
 		 */
+		zone_wm_factor = watermark_scale_factor[zone_idx(zone)];
 		tmp = max_t(u64, tmp >> 2,
-			    mult_frac(zone_managed_pages(zone),
-				      watermark_scale_factor, 10000));
+			    mult_frac(zone_managed_pages(zone), zone_wm_factor,
+				      10000));
 
 		zone->watermark_boost = 0;
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
@@ -8798,11 +8813,19 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
 int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
 		void *buffer, size_t *length, loff_t *ppos)
 {
-	int rc;
+	int i;
 
-	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
-	if (rc)
-		return rc;
+	proc_dointvec_minmax(table, write, buffer, length, ppos);
+
+	/*
+	 * The unit is in fractions of 10,000. The default value of 10
+	 * means the distances between watermarks are 0.1% of the available
+	 * memory in the node/system. The maximum value is 3000, or 30% of
+	 * memory.
+	 */
+	for (i = 0; i < MAX_NR_ZONES; i++)
+		watermark_scale_factor[i] =
+			clamp(watermark_scale_factor[i], 1, 3000);
 
 	if (write)
 		setup_per_zone_wmarks();
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 0/2] watermark related improvement on zone movable
  2022-08-19  9:30 [PATCH v2 0/2] watermark related improvement on zone movable Wupeng Ma
  2022-08-19  9:30 ` [PATCH v2 1/2] mm: Cap zone movable's min wmark to small value Wupeng Ma
  2022-08-19  9:30 ` [PATCH v2 2/2] mm: sysctl: Introduce per zone watermark_scale_factor Wupeng Ma
@ 2022-08-24  7:27 ` mawupeng
  2 siblings, 0 replies; 6+ messages in thread
From: mawupeng @ 2022-08-24  7:27 UTC (permalink / raw)
  To: akpm
  Cc: mawupeng1, corbet, mcgrof, keescook, yzaikin, songmuchun,
	mike.kravetz, osalvador, rppt, surenb, jsavitz, linux-doc,
	linux-kernel, linux-mm, linux-fsdevel, wangkefeng.wang

Hi, maintainers, kindly ping...

Thanks.
Ma.

On 2022/8/19 17:30, Wupeng Ma wrote:
> From: Ma Wupeng <mawupeng1@huawei.com>
> 
> The first patch cap zone movable's min watermark to small value since no
> one can use it.
> 
> The second patch introduce a per zone watermark to replace the vanilla
> watermark_scale_factor to bring flexibility to tune each zone's
> watermark separately and lead to more efficient kswapd.
> 
> Each patch's detail information can be seen is its own changelog.
> 
> changelog since v1:
> - fix compile error if CONFIG_SYSCTL is not enabled
> - remove useless function comment
> 
> Ma Wupeng (2):
>   mm: Cap zone movable's min wmark to small value
>   mm: sysctl: Introduce per zone watermark_scale_factor
> 
>  Documentation/admin-guide/sysctl/vm.rst |  6 ++++
>  include/linux/mm.h                      |  2 +-
>  kernel/sysctl.c                         |  2 --
>  mm/page_alloc.c                         | 41 +++++++++++++++++++------
>  4 files changed, 39 insertions(+), 12 deletions(-)
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] mm: Cap zone movable's min wmark to small value
  2022-08-19  9:30 ` [PATCH v2 1/2] mm: Cap zone movable's min wmark to small value Wupeng Ma
@ 2022-08-24  8:10   ` David Hildenbrand
  2022-08-25  0:49     ` mawupeng
  0 siblings, 1 reply; 6+ messages in thread
From: David Hildenbrand @ 2022-08-24  8:10 UTC (permalink / raw)
  To: Wupeng Ma, akpm
  Cc: corbet, mcgrof, keescook, yzaikin, songmuchun, mike.kravetz,
	osalvador, rppt, surenb, jsavitz, linux-doc, linux-kernel,
	linux-mm, linux-fsdevel, wangkefeng.wang

On 19.08.22 11:30, Wupeng Ma wrote:
> From: Ma Wupeng <mawupeng1@huawei.com>
> 
> Since min_free_kbytes is based on gfp_zone(GFP_USER) which does not include
> zone movable. However zone movable will get its min share in
> __setup_per_zone_wmarks() which does not make any sense.
> 
> And like highmem pages, __GFP_HIGH and PF_MEMALLOC allocations usually
> don't need movable pages, so there is no need to assign min pages for zone
> movable.
> 
> Let's cap pages_min for zone movable to a small value here just link
> highmem pages.
> 
> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
> ---
>  mm/page_alloc.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e5486d47406e..ff644205370f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8638,7 +8638,7 @@ static void __setup_per_zone_wmarks(void)
>  
>  	/* Calculate total number of !ZONE_HIGHMEM pages */
>  	for_each_zone(zone) {
> -		if (!is_highmem(zone))
> +		if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE)
>  			lowmem_pages += zone_managed_pages(zone);
>  	}
>  
> @@ -8648,7 +8648,7 @@ static void __setup_per_zone_wmarks(void)
>  		spin_lock_irqsave(&zone->lock, flags);
>  		tmp = (u64)pages_min * zone_managed_pages(zone);
>  		do_div(tmp, lowmem_pages);
> -		if (is_highmem(zone)) {
> +		if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) {
>  			/*
>  			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
>  			 * need highmem pages, so cap pages_min to a small

This kind-off makes sense to me, but I'm not completely sure about all
implications. We most certainly should update the comment as well.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] mm: Cap zone movable's min wmark to small value
  2022-08-24  8:10   ` David Hildenbrand
@ 2022-08-25  0:49     ` mawupeng
  0 siblings, 0 replies; 6+ messages in thread
From: mawupeng @ 2022-08-25  0:49 UTC (permalink / raw)
  To: david, akpm
  Cc: mawupeng1, corbet, mcgrof, keescook, yzaikin, songmuchun,
	mike.kravetz, osalvador, rppt, surenb, jsavitz, linux-doc,
	linux-kernel, linux-mm, linux-fsdevel, wangkefeng.wang



On 2022/8/24 16:10, David Hildenbrand wrote:
> On 19.08.22 11:30, Wupeng Ma wrote:
>> From: Ma Wupeng <mawupeng1@huawei.com>
>>
>> Since min_free_kbytes is based on gfp_zone(GFP_USER) which does not include
>> zone movable. However zone movable will get its min share in
>> __setup_per_zone_wmarks() which does not make any sense.
>>
>> And like highmem pages, __GFP_HIGH and PF_MEMALLOC allocations usually
>> don't need movable pages, so there is no need to assign min pages for zone
>> movable.
>>
>> Let's cap pages_min for zone movable to a small value here just link
>> highmem pages.
>>
>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
>> ---
>>  mm/page_alloc.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index e5486d47406e..ff644205370f 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -8638,7 +8638,7 @@ static void __setup_per_zone_wmarks(void)
>>  
>>  	/* Calculate total number of !ZONE_HIGHMEM pages */
>>  	for_each_zone(zone) {
>> -		if (!is_highmem(zone))
>> +		if (!is_highmem(zone) && zone_idx(zone) != ZONE_MOVABLE)
>>  			lowmem_pages += zone_managed_pages(zone);
>>  	}
>>  
>> @@ -8648,7 +8648,7 @@ static void __setup_per_zone_wmarks(void)
>>  		spin_lock_irqsave(&zone->lock, flags);
>>  		tmp = (u64)pages_min * zone_managed_pages(zone);
>>  		do_div(tmp, lowmem_pages);
>> -		if (is_highmem(zone)) {
>> +		if (is_highmem(zone) || zone_idx(zone) == ZONE_MOVABLE) {
>>  			/*
>>  			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
>>  			 * need highmem pages, so cap pages_min to a small
> 
> This kind-off makes sense to me, but I'm not completely sure about all
> implications. We most certainly should update the comment as well.

Yes, we should certainly do this.

Thanks for reviewing.

> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-08-25  0:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-19  9:30 [PATCH v2 0/2] watermark related improvement on zone movable Wupeng Ma
2022-08-19  9:30 ` [PATCH v2 1/2] mm: Cap zone movable's min wmark to small value Wupeng Ma
2022-08-24  8:10   ` David Hildenbrand
2022-08-25  0:49     ` mawupeng
2022-08-19  9:30 ` [PATCH v2 2/2] mm: sysctl: Introduce per zone watermark_scale_factor Wupeng Ma
2022-08-24  7:27 ` [PATCH v2 0/2] watermark related improvement on zone movable mawupeng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.