linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] mm/compaction:let proactive compaction order configurable
@ 2021-04-25  1:21 chukaiping
  2021-04-26  1:15 ` David Rientjes
  2021-04-26  1:31 ` Rafael Aquini
  0 siblings, 2 replies; 12+ messages in thread
From: chukaiping @ 2021-04-25  1:21 UTC (permalink / raw)
  To: mcgrof, keescook, yzaikin, akpm, vbabka, nigupta, bhe,
	khalid.aziz, iamjoonsoo.kim, mateusznosek0, sh_def
  Cc: linux-kernel, linux-fsdevel, linux-mm

Currently the proactive compaction order is fixed to
COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of
normal 4KB memory, but it's too high for the machines with small
normal memory, for example the machines with most memory configured
as 1GB hugetlbfs huge pages. In these machines the max order of
free pages is often below 9, and it's always below 9 even with hard
compaction. This will lead to proactive compaction be triggered very
frequently. In these machines we only care about order of 3 or 4.
This patch export the oder to proc and let it configurable
by user, and the default value is still COMPACTION_HPAGE_ORDER.

Signed-off-by: chukaiping <chukaiping@baidu.com>
Reported-by: kernel test robot <lkp@intel.com>
---

Changes in v3:
    - change the min value of compaction_order to 1 because the fragmentation
      index of order 0 is always 0
    - move the definition of max_buddy_zone into #ifdef CONFIG_COMPACTION

Changes in v2:
    - fix the compile error in ia64 and powerpc, move the initialization
      of sysctl_compaction_order to kcompactd_init because 
      COMPACTION_HPAGE_ORDER is a variable in these architectures
    - change the hard coded max order number from 10 to MAX_ORDER - 1

 include/linux/compaction.h |    1 +
 kernel/sysctl.c            |   10 ++++++++++
 mm/compaction.c            |    9 ++++++---
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index ed4070e..151ccd1 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -83,6 +83,7 @@ static inline unsigned long compact_gap(unsigned int order)
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern unsigned int sysctl_compaction_proactiveness;
+extern unsigned int sysctl_compaction_order;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void *buffer, size_t *length, loff_t *ppos);
 extern int sysctl_extfrag_threshold;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 62fbd09..e50f7d2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -196,6 +196,7 @@ enum sysctl_writes_mode {
 #endif /* CONFIG_SCHED_DEBUG */
 
 #ifdef CONFIG_COMPACTION
+static int max_buddy_zone = MAX_ORDER - 1;
 static int min_extfrag_threshold;
 static int max_extfrag_threshold = 1000;
 #endif
@@ -2871,6 +2872,15 @@ int proc_do_static_key(struct ctl_table *table, int write,
 		.extra2		= &one_hundred,
 	},
 	{
+		.procname       = "compaction_order",
+		.data           = &sysctl_compaction_order,
+		.maxlen         = sizeof(sysctl_compaction_order),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec_minmax,
+		.extra1         = SYSCTL_ONE,
+		.extra2         = &max_buddy_zone,
+	},
+	{
 		.procname	= "extfrag_threshold",
 		.data		= &sysctl_extfrag_threshold,
 		.maxlen		= sizeof(int),
diff --git a/mm/compaction.c b/mm/compaction.c
index e04f447..70c0acd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1925,16 +1925,16 @@ static bool kswapd_is_running(pg_data_t *pgdat)
 
 /*
  * A zone's fragmentation score is the external fragmentation wrt to the
- * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
+ * sysctl_compaction_order. It returns a value in the range [0, 100].
  */
 static unsigned int fragmentation_score_zone(struct zone *zone)
 {
-	return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
+	return extfrag_for_order(zone, sysctl_compaction_order);
 }
 
 /*
  * A weighted zone's fragmentation score is the external fragmentation
- * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
+ * wrt to the sysctl_compaction_order scaled by the zone's size. It
  * returns a value in the range [0, 100].
  *
  * The scaling factor ensures that proactive compaction focuses on larger
@@ -2666,6 +2666,7 @@ static void compact_nodes(void)
  * background. It takes values in the range [0, 100].
  */
 unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
+unsigned int __read_mostly sysctl_compaction_order;
 
 /*
  * This is the entry point for compacting all nodes via
@@ -2958,6 +2959,8 @@ static int __init kcompactd_init(void)
 	int nid;
 	int ret;
 
+	sysctl_compaction_order = COMPACTION_HPAGE_ORDER;
+
 	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
 					"mm/compaction:online",
 					kcompactd_cpu_online, NULL);
-- 
1.7.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm/compaction:let proactive compaction order configurable
  2021-04-25  1:21 [PATCH v3] mm/compaction:let proactive compaction order configurable chukaiping
@ 2021-04-26  1:15 ` David Rientjes
  2021-04-26  1:29   ` 答复: " Chu,Kaiping
  2021-05-06 21:27   ` Khalid Aziz
  2021-04-26  1:31 ` Rafael Aquini
  1 sibling, 2 replies; 12+ messages in thread
From: David Rientjes @ 2021-04-26  1:15 UTC (permalink / raw)
  To: chukaiping
  Cc: mcgrof, keescook, yzaikin, akpm, vbabka, nigupta, bhe,
	khalid.aziz, iamjoonsoo.kim, mateusznosek0, sh_def, linux-kernel,
	linux-fsdevel, linux-mm

On Sun, 25 Apr 2021, chukaiping wrote:

> Currently the proactive compaction order is fixed to
> COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of
> normal 4KB memory, but it's too high for the machines with small
> normal memory, for example the machines with most memory configured
> as 1GB hugetlbfs huge pages. In these machines the max order of
> free pages is often below 9, and it's always below 9 even with hard
> compaction. This will lead to proactive compaction be triggered very
> frequently. In these machines we only care about order of 3 or 4.
> This patch export the oder to proc and let it configurable
> by user, and the default value is still COMPACTION_HPAGE_ORDER.
> 

As asked in the review of the v1 of the patch, why is this not a userspace 
policy decision?  If you are interested in order-3 or order-4 
fragmentation, for whatever reason, you could periodically check 
/proc/buddyinfo and manually invoke compaction on the system.

In other words, why does this need to live in the kernel?

> Signed-off-by: chukaiping <chukaiping@baidu.com>
> Reported-by: kernel test robot <lkp@intel.com>
> ---
> 
> Changes in v3:
>     - change the min value of compaction_order to 1 because the fragmentation
>       index of order 0 is always 0
>     - move the definition of max_buddy_zone into #ifdef CONFIG_COMPACTION
> 
> Changes in v2:
>     - fix the compile error in ia64 and powerpc, move the initialization
>       of sysctl_compaction_order to kcompactd_init because 
>       COMPACTION_HPAGE_ORDER is a variable in these architectures
>     - change the hard coded max order number from 10 to MAX_ORDER - 1
> 
>  include/linux/compaction.h |    1 +
>  kernel/sysctl.c            |   10 ++++++++++
>  mm/compaction.c            |    9 ++++++---
>  3 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index ed4070e..151ccd1 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -83,6 +83,7 @@ static inline unsigned long compact_gap(unsigned int order)
>  #ifdef CONFIG_COMPACTION
>  extern int sysctl_compact_memory;
>  extern unsigned int sysctl_compaction_proactiveness;
> +extern unsigned int sysctl_compaction_order;
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>  			void *buffer, size_t *length, loff_t *ppos);
>  extern int sysctl_extfrag_threshold;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 62fbd09..e50f7d2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -196,6 +196,7 @@ enum sysctl_writes_mode {
>  #endif /* CONFIG_SCHED_DEBUG */
>  
>  #ifdef CONFIG_COMPACTION
> +static int max_buddy_zone = MAX_ORDER - 1;
>  static int min_extfrag_threshold;
>  static int max_extfrag_threshold = 1000;
>  #endif
> @@ -2871,6 +2872,15 @@ int proc_do_static_key(struct ctl_table *table, int write,
>  		.extra2		= &one_hundred,
>  	},
>  	{
> +		.procname       = "compaction_order",
> +		.data           = &sysctl_compaction_order,
> +		.maxlen         = sizeof(sysctl_compaction_order),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec_minmax,
> +		.extra1         = SYSCTL_ONE,
> +		.extra2         = &max_buddy_zone,
> +	},
> +	{
>  		.procname	= "extfrag_threshold",
>  		.data		= &sysctl_extfrag_threshold,
>  		.maxlen		= sizeof(int),
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e04f447..70c0acd 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1925,16 +1925,16 @@ static bool kswapd_is_running(pg_data_t *pgdat)
>  
>  /*
>   * A zone's fragmentation score is the external fragmentation wrt to the
> - * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
> + * sysctl_compaction_order. It returns a value in the range [0, 100].
>   */
>  static unsigned int fragmentation_score_zone(struct zone *zone)
>  {
> -	return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
> +	return extfrag_for_order(zone, sysctl_compaction_order);
>  }
>  
>  /*
>   * A weighted zone's fragmentation score is the external fragmentation
> - * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
> + * wrt to the sysctl_compaction_order scaled by the zone's size. It
>   * returns a value in the range [0, 100].
>   *
>   * The scaling factor ensures that proactive compaction focuses on larger
> @@ -2666,6 +2666,7 @@ static void compact_nodes(void)
>   * background. It takes values in the range [0, 100].
>   */
>  unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
> +unsigned int __read_mostly sysctl_compaction_order;
>  
>  /*
>   * This is the entry point for compacting all nodes via
> @@ -2958,6 +2959,8 @@ static int __init kcompactd_init(void)
>  	int nid;
>  	int ret;
>  
> +	sysctl_compaction_order = COMPACTION_HPAGE_ORDER;
> +
>  	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
>  					"mm/compaction:online",
>  					kcompactd_cpu_online, NULL);
> -- 
> 1.7.1
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* 答复: [PATCH v3] mm/compaction:let proactive compaction order configurable
  2021-04-26  1:15 ` David Rientjes
@ 2021-04-26  1:29   ` Chu,Kaiping
  2021-04-26  1:48     ` David Rientjes
  2021-05-06 21:27   ` Khalid Aziz
  1 sibling, 1 reply; 12+ messages in thread
From: Chu,Kaiping @ 2021-04-26  1:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: mcgrof, keescook, yzaikin, akpm, vbabka, nigupta, bhe,
	khalid.aziz, iamjoonsoo.kim, mateusznosek0, sh_def, linux-kernel,
	linux-fsdevel, linux-mm

Hi Rientjes
I already answered your question in 4.19.
" We turn off the transparent huge page in our machines, so we don't care about the order 9.
There are many user space applications, different application maybe allocate different order of memory, we can't know the "known order of interest" in advance. Our purpose is to keep the overall fragment index as low as possible, not care about the specific order. 
Although current proactive compaction mechanism only check the fragment index of specific order, but it can do memory compaction for all order(.order = -1 in proactive_compact_node), so it's still useful for us. 
We set the compaction_order according to the average fragment index of all our machines, it's an experience value, it's a compromise of keep memory fragment index low and not trigger background compaction too much, this value can be changed in future.
We did periodically memory compaction by command "echo 1 > /proc/sys/vm/compact_memory " previously, but it's not good enough, it's will compact all memory forcibly, it may lead to lots of memory move in short time, and affect the performance of application."


BR,
Chu Kaiping

-----邮件原件-----
发件人: David Rientjes <rientjes@google.com> 
发送时间: 2021年4月26日 9:15
收件人: Chu,Kaiping <chukaiping@baidu.com>
抄送: mcgrof@kernel.org; keescook@chromium.org; yzaikin@google.com; akpm@linux-foundation.org; vbabka@suse.cz; nigupta@nvidia.com; bhe@redhat.com; khalid.aziz@oracle.com; iamjoonsoo.kim@lge.com; mateusznosek0@gmail.com; sh_def@163.com; linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-mm@kvack.org
主题: Re: [PATCH v3] mm/compaction:let proactive compaction order configurable

On Sun, 25 Apr 2021, chukaiping wrote:

> Currently the proactive compaction order is fixed to 
> COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of 
> normal 4KB memory, but it's too high for the machines with small 
> normal memory, for example the machines with most memory configured as 
> 1GB hugetlbfs huge pages. In these machines the max order of free 
> pages is often below 9, and it's always below 9 even with hard 
> compaction. This will lead to proactive compaction be triggered very 
> frequently. In these machines we only care about order of 3 or 4.
> This patch export the oder to proc and let it configurable by user, 
> and the default value is still COMPACTION_HPAGE_ORDER.
> 

As asked in the review of the v1 of the patch, why is this not a userspace policy decision?  If you are interested in order-3 or order-4 fragmentation, for whatever reason, you could periodically check /proc/buddyinfo and manually invoke compaction on the system.

In other words, why does this need to live in the kernel?

> Signed-off-by: chukaiping <chukaiping@baidu.com>
> Reported-by: kernel test robot <lkp@intel.com>
> ---
> 
> Changes in v3:
>     - change the min value of compaction_order to 1 because the fragmentation
>       index of order 0 is always 0
>     - move the definition of max_buddy_zone into #ifdef 
> CONFIG_COMPACTION
> 
> Changes in v2:
>     - fix the compile error in ia64 and powerpc, move the initialization
>       of sysctl_compaction_order to kcompactd_init because 
>       COMPACTION_HPAGE_ORDER is a variable in these architectures
>     - change the hard coded max order number from 10 to MAX_ORDER - 1
> 
>  include/linux/compaction.h |    1 +
>  kernel/sysctl.c            |   10 ++++++++++
>  mm/compaction.c            |    9 ++++++---
>  3 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h 
> index ed4070e..151ccd1 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -83,6 +83,7 @@ static inline unsigned long compact_gap(unsigned int 
> order)  #ifdef CONFIG_COMPACTION  extern int sysctl_compact_memory;  
> extern unsigned int sysctl_compaction_proactiveness;
> +extern unsigned int sysctl_compaction_order;
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>  			void *buffer, size_t *length, loff_t *ppos);  extern int 
> sysctl_extfrag_threshold; diff --git a/kernel/sysctl.c 
> b/kernel/sysctl.c index 62fbd09..e50f7d2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -196,6 +196,7 @@ enum sysctl_writes_mode {  #endif /* 
> CONFIG_SCHED_DEBUG */
>  
>  #ifdef CONFIG_COMPACTION
> +static int max_buddy_zone = MAX_ORDER - 1;
>  static int min_extfrag_threshold;
>  static int max_extfrag_threshold = 1000;  #endif @@ -2871,6 +2872,15 
> @@ int proc_do_static_key(struct ctl_table *table, int write,
>  		.extra2		= &one_hundred,
>  	},
>  	{
> +		.procname       = "compaction_order",
> +		.data           = &sysctl_compaction_order,
> +		.maxlen         = sizeof(sysctl_compaction_order),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec_minmax,
> +		.extra1         = SYSCTL_ONE,
> +		.extra2         = &max_buddy_zone,
> +	},
> +	{
>  		.procname	= "extfrag_threshold",
>  		.data		= &sysctl_extfrag_threshold,
>  		.maxlen		= sizeof(int),
> diff --git a/mm/compaction.c b/mm/compaction.c index e04f447..70c0acd 
> 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1925,16 +1925,16 @@ static bool kswapd_is_running(pg_data_t 
> *pgdat)
>  
>  /*
>   * A zone's fragmentation score is the external fragmentation wrt to 
> the
> - * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
> + * sysctl_compaction_order. It returns a value in the range [0, 100].
>   */
>  static unsigned int fragmentation_score_zone(struct zone *zone)  {
> -	return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
> +	return extfrag_for_order(zone, sysctl_compaction_order);
>  }
>  
>  /*
>   * A weighted zone's fragmentation score is the external 
> fragmentation
> - * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
> + * wrt to the sysctl_compaction_order scaled by the zone's size. It
>   * returns a value in the range [0, 100].
>   *
>   * The scaling factor ensures that proactive compaction focuses on 
> larger @@ -2666,6 +2666,7 @@ static void compact_nodes(void)
>   * background. It takes values in the range [0, 100].
>   */
>  unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
> +unsigned int __read_mostly sysctl_compaction_order;
>  
>  /*
>   * This is the entry point for compacting all nodes via @@ -2958,6 
> +2959,8 @@ static int __init kcompactd_init(void)
>  	int nid;
>  	int ret;
>  
> +	sysctl_compaction_order = COMPACTION_HPAGE_ORDER;
> +
>  	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
>  					"mm/compaction:online",
>  					kcompactd_cpu_online, NULL);
> --
> 1.7.1
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm/compaction:let proactive compaction order configurable
  2021-04-25  1:21 [PATCH v3] mm/compaction:let proactive compaction order configurable chukaiping
  2021-04-26  1:15 ` David Rientjes
@ 2021-04-26  1:31 ` Rafael Aquini
  2021-04-28  1:17   ` 答复: " Chu,Kaiping
  1 sibling, 1 reply; 12+ messages in thread
From: Rafael Aquini @ 2021-04-26  1:31 UTC (permalink / raw)
  To: chukaiping
  Cc: mcgrof, keescook, yzaikin, akpm, vbabka, nigupta, bhe,
	khalid.aziz, iamjoonsoo.kim, mateusznosek0, sh_def, linux-kernel,
	linux-fsdevel, linux-mm

On Sun, Apr 25, 2021 at 09:21:02AM +0800, chukaiping wrote:
> Currently the proactive compaction order is fixed to
> COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of
> normal 4KB memory, but it's too high for the machines with small
> normal memory, for example the machines with most memory configured
> as 1GB hugetlbfs huge pages. In these machines the max order of
> free pages is often below 9, and it's always below 9 even with hard
> compaction. This will lead to proactive compaction be triggered very
> frequently. In these machines we only care about order of 3 or 4.
> This patch export the oder to proc and let it configurable
> by user, and the default value is still COMPACTION_HPAGE_ORDER.
> 
> Signed-off-by: chukaiping <chukaiping@baidu.com>
> Reported-by: kernel test robot <lkp@intel.com>

Two minor nits on the commit log message: 
* there seems to be a whitespage missing in your short log: 
  "... mm/compaction:let ..."

* has the path really been reported by a test robot?


A note on the sysctl name, I'd suggest that it perhaps should reflect 
the fact that we're adjusting the order for proactive compation.
How about "proactive_compation_order"?

Cheers,

> ---
> 
> Changes in v3:
>     - change the min value of compaction_order to 1 because the fragmentation
>       index of order 0 is always 0
>     - move the definition of max_buddy_zone into #ifdef CONFIG_COMPACTION
> 
> Changes in v2:
>     - fix the compile error in ia64 and powerpc, move the initialization
>       of sysctl_compaction_order to kcompactd_init because 
>       COMPACTION_HPAGE_ORDER is a variable in these architectures
>     - change the hard coded max order number from 10 to MAX_ORDER - 1
> 
>  include/linux/compaction.h |    1 +
>  kernel/sysctl.c            |   10 ++++++++++
>  mm/compaction.c            |    9 ++++++---
>  3 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index ed4070e..151ccd1 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -83,6 +83,7 @@ static inline unsigned long compact_gap(unsigned int order)
>  #ifdef CONFIG_COMPACTION
>  extern int sysctl_compact_memory;
>  extern unsigned int sysctl_compaction_proactiveness;
> +extern unsigned int sysctl_compaction_order;
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>  			void *buffer, size_t *length, loff_t *ppos);
>  extern int sysctl_extfrag_threshold;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 62fbd09..e50f7d2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -196,6 +196,7 @@ enum sysctl_writes_mode {
>  #endif /* CONFIG_SCHED_DEBUG */
>  
>  #ifdef CONFIG_COMPACTION
> +static int max_buddy_zone = MAX_ORDER - 1;
>  static int min_extfrag_threshold;
>  static int max_extfrag_threshold = 1000;
>  #endif
> @@ -2871,6 +2872,15 @@ int proc_do_static_key(struct ctl_table *table, int write,
>  		.extra2		= &one_hundred,
>  	},
>  	{
> +		.procname       = "compaction_order",
> +		.data           = &sysctl_compaction_order,
> +		.maxlen         = sizeof(sysctl_compaction_order),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec_minmax,
> +		.extra1         = SYSCTL_ONE,
> +		.extra2         = &max_buddy_zone,
> +	},
> +	{
>  		.procname	= "extfrag_threshold",
>  		.data		= &sysctl_extfrag_threshold,
>  		.maxlen		= sizeof(int),
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e04f447..70c0acd 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1925,16 +1925,16 @@ static bool kswapd_is_running(pg_data_t *pgdat)
>  
>  /*
>   * A zone's fragmentation score is the external fragmentation wrt to the
> - * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
> + * sysctl_compaction_order. It returns a value in the range [0, 100].
>   */
>  static unsigned int fragmentation_score_zone(struct zone *zone)
>  {
> -	return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
> +	return extfrag_for_order(zone, sysctl_compaction_order);
>  }
>  
>  /*
>   * A weighted zone's fragmentation score is the external fragmentation
> - * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
> + * wrt to the sysctl_compaction_order scaled by the zone's size. It
>   * returns a value in the range [0, 100].
>   *
>   * The scaling factor ensures that proactive compaction focuses on larger
> @@ -2666,6 +2666,7 @@ static void compact_nodes(void)
>   * background. It takes values in the range [0, 100].
>   */
>  unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
> +unsigned int __read_mostly sysctl_compaction_order;
>  
>  /*
>   * This is the entry point for compacting all nodes via
> @@ -2958,6 +2959,8 @@ static int __init kcompactd_init(void)
>  	int nid;
>  	int ret;
>  
> +	sysctl_compaction_order = COMPACTION_HPAGE_ORDER;
> +
>  	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
>  					"mm/compaction:online",
>  					kcompactd_cpu_online, NULL);
> -- 
> 1.7.1
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 答复: [PATCH v3] mm/compaction:let proactive compaction order configurable
  2021-04-26  1:29   ` 答复: " Chu,Kaiping
@ 2021-04-26  1:48     ` David Rientjes
  2021-04-28  1:38       ` 答复: " Chu,Kaiping
  0 siblings, 1 reply; 12+ messages in thread
From: David Rientjes @ 2021-04-26  1:48 UTC (permalink / raw)
  To: Chu,Kaiping
  Cc: mcgrof, keescook, yzaikin, akpm, vbabka, nigupta, bhe,
	khalid.aziz, iamjoonsoo.kim, mateusznosek0, sh_def, linux-kernel,
	linux-fsdevel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 8050 bytes --]

On Mon, 26 Apr 2021, Chu,Kaiping wrote:

> Hi Rientjes
> I already answered your question in 4.19.
> " We turn off the transparent huge page in our machines, so we don't care about the order 9.
> There are many user space applications, different application maybe allocate different order of memory, we can't know the "known order of interest" in advance. Our purpose is to keep the overall fragment index as low as possible, not care about the specific order. 

Ok, so you don't care about a specific order but you are adding a 
vm.compaction_order sysctl?

I think what you're trying to do is invoke full compaction (cc.order = -1) 
at some point in time that will (1) keep node-wide fragmentation low over 
the long run and (2) be relatively lightweight at the time it is done.

I can certainly understand (1) on your configuration that is mostly 
consumed by 1GB gigantic pages, you are likely dealing with significant 
memory pressure that causes fragmentation to increase over time and 
eventually become unrecoverable for the most part.

And for (2), yes, using vm.compact_memory will become very heavyweight if 
it's done too late.

So since proactive compaction uses cc.order = 1, same as 
vm.compact_memory, it should be possible to monitor extfrag_index under 
debugfs and manually trigger compaction when necessary without 
intervention of the kernel.

I think we can both agree that we wouldn't want to add obscure and 
undocumented sysctls that that can easily be replaced by a userspace 
implementation.

> Although current proactive compaction mechanism only check the fragment index of specific order, but it can do memory compaction for all order(.order = -1 in proactive_compact_node), so it's still useful for us. 
> We set the compaction_order according to the average fragment index of all our machines, it's an experience value, it's a compromise of keep memory fragment index low and not trigger background compaction too much, this value can be changed in future.
> We did periodically memory compaction by command "echo 1 > /proc/sys/vm/compact_memory " previously, but it's not good enough, it's will compact all memory forcibly, it may lead to lots of memory move in short time, and affect the performance of application."
> 
> 
> BR,
> Chu Kaiping
> 
> -----邮件原件-----
> 发件人: David Rientjes <rientjes@google.com> 
> 发送时间: 2021年4月26日 9:15
> 收件人: Chu,Kaiping <chukaiping@baidu.com>
> 抄送: mcgrof@kernel.org; keescook@chromium.org; yzaikin@google.com; akpm@linux-foundation.org; vbabka@suse.cz; nigupta@nvidia.com; bhe@redhat.com; khalid.aziz@oracle.com; iamjoonsoo.kim@lge.com; mateusznosek0@gmail.com; sh_def@163.com; linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-mm@kvack.org
> 主题: Re: [PATCH v3] mm/compaction:let proactive compaction order configurable
> 
> On Sun, 25 Apr 2021, chukaiping wrote:
> 
> > Currently the proactive compaction order is fixed to 
> > COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of 
> > normal 4KB memory, but it's too high for the machines with small 
> > normal memory, for example the machines with most memory configured as 
> > 1GB hugetlbfs huge pages. In these machines the max order of free 
> > pages is often below 9, and it's always below 9 even with hard 
> > compaction. This will lead to proactive compaction be triggered very 
> > frequently. In these machines we only care about order of 3 or 4.
> > This patch export the oder to proc and let it configurable by user, 
> > and the default value is still COMPACTION_HPAGE_ORDER.
> > 
> 
> As asked in the review of the v1 of the patch, why is this not a userspace policy decision?  If you are interested in order-3 or order-4 fragmentation, for whatever reason, you could periodically check /proc/buddyinfo and manually invoke compaction on the system.
> 
> In other words, why does this need to live in the kernel?
> 
> > Signed-off-by: chukaiping <chukaiping@baidu.com>
> > Reported-by: kernel test robot <lkp@intel.com>
> > ---
> > 
> > Changes in v3:
> >     - change the min value of compaction_order to 1 because the fragmentation
> >       index of order 0 is always 0
> >     - move the definition of max_buddy_zone into #ifdef 
> > CONFIG_COMPACTION
> > 
> > Changes in v2:
> >     - fix the compile error in ia64 and powerpc, move the initialization
> >       of sysctl_compaction_order to kcompactd_init because 
> >       COMPACTION_HPAGE_ORDER is a variable in these architectures
> >     - change the hard coded max order number from 10 to MAX_ORDER - 1
> > 
> >  include/linux/compaction.h |    1 +
> >  kernel/sysctl.c            |   10 ++++++++++
> >  mm/compaction.c            |    9 ++++++---
> >  3 files changed, 17 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h 
> > index ed4070e..151ccd1 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -83,6 +83,7 @@ static inline unsigned long compact_gap(unsigned int 
> > order)  #ifdef CONFIG_COMPACTION  extern int sysctl_compact_memory;  
> > extern unsigned int sysctl_compaction_proactiveness;
> > +extern unsigned int sysctl_compaction_order;
> >  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> >  			void *buffer, size_t *length, loff_t *ppos);  extern int 
> > sysctl_extfrag_threshold; diff --git a/kernel/sysctl.c 
> > b/kernel/sysctl.c index 62fbd09..e50f7d2 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -196,6 +196,7 @@ enum sysctl_writes_mode {  #endif /* 
> > CONFIG_SCHED_DEBUG */
> >  
> >  #ifdef CONFIG_COMPACTION
> > +static int max_buddy_zone = MAX_ORDER - 1;
> >  static int min_extfrag_threshold;
> >  static int max_extfrag_threshold = 1000;  #endif @@ -2871,6 +2872,15 
> > @@ int proc_do_static_key(struct ctl_table *table, int write,
> >  		.extra2		= &one_hundred,
> >  	},
> >  	{
> > +		.procname       = "compaction_order",
> > +		.data           = &sysctl_compaction_order,
> > +		.maxlen         = sizeof(sysctl_compaction_order),
> > +		.mode           = 0644,
> > +		.proc_handler   = proc_dointvec_minmax,
> > +		.extra1         = SYSCTL_ONE,
> > +		.extra2         = &max_buddy_zone,
> > +	},
> > +	{
> >  		.procname	= "extfrag_threshold",
> >  		.data		= &sysctl_extfrag_threshold,
> >  		.maxlen		= sizeof(int),
> > diff --git a/mm/compaction.c b/mm/compaction.c index e04f447..70c0acd 
> > 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1925,16 +1925,16 @@ static bool kswapd_is_running(pg_data_t 
> > *pgdat)
> >  
> >  /*
> >   * A zone's fragmentation score is the external fragmentation wrt to 
> > the
> > - * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
> > + * sysctl_compaction_order. It returns a value in the range [0, 100].
> >   */
> >  static unsigned int fragmentation_score_zone(struct zone *zone)  {
> > -	return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
> > +	return extfrag_for_order(zone, sysctl_compaction_order);
> >  }
> >  
> >  /*
> >   * A weighted zone's fragmentation score is the external 
> > fragmentation
> > - * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
> > + * wrt to the sysctl_compaction_order scaled by the zone's size. It
> >   * returns a value in the range [0, 100].
> >   *
> >   * The scaling factor ensures that proactive compaction focuses on 
> > larger @@ -2666,6 +2666,7 @@ static void compact_nodes(void)
> >   * background. It takes values in the range [0, 100].
> >   */
> >  unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
> > +unsigned int __read_mostly sysctl_compaction_order;
> >  
> >  /*
> >   * This is the entry point for compacting all nodes via @@ -2958,6 
> > +2959,8 @@ static int __init kcompactd_init(void)
> >  	int nid;
> >  	int ret;
> >  
> > +	sysctl_compaction_order = COMPACTION_HPAGE_ORDER;
> > +
> >  	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> >  					"mm/compaction:online",
> >  					kcompactd_cpu_online, NULL);
> > --
> > 1.7.1
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* 答复: [PATCH v3] mm/compaction:let proactive compaction order configurable
  2021-04-26  1:31 ` Rafael Aquini
@ 2021-04-28  1:17   ` Chu,Kaiping
  2021-04-29 19:45     ` Rafael Aquini
  0 siblings, 1 reply; 12+ messages in thread
From: Chu,Kaiping @ 2021-04-28  1:17 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: mcgrof, keescook, yzaikin, akpm, vbabka, nigupta, bhe,
	khalid.aziz, iamjoonsoo.kim, mateusznosek0, sh_def, linux-kernel,
	linux-fsdevel, linux-mm

Please see my answer inline.

-----邮件原件-----
发件人: Rafael Aquini <aquini@redhat.com> 
发送时间: 2021年4月26日 9:31
收件人: Chu,Kaiping <chukaiping@baidu.com>
抄送: mcgrof@kernel.org; keescook@chromium.org; yzaikin@google.com; akpm@linux-foundation.org; vbabka@suse.cz; nigupta@nvidia.com; bhe@redhat.com; khalid.aziz@oracle.com; iamjoonsoo.kim@lge.com; mateusznosek0@gmail.com; sh_def@163.com; linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-mm@kvack.org
主题: Re: [PATCH v3] mm/compaction:let proactive compaction order configurable

On Sun, Apr 25, 2021 at 09:21:02AM +0800, chukaiping wrote:
> Currently the proactive compaction order is fixed to 
> COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of 
> normal 4KB memory, but it's too high for the machines with small 
> normal memory, for example the machines with most memory configured as 
> 1GB hugetlbfs huge pages. In these machines the max order of free 
> pages is often below 9, and it's always below 9 even with hard 
> compaction. This will lead to proactive compaction be triggered very 
> frequently. In these machines we only care about order of 3 or 4.
> This patch export the oder to proc and let it configurable by user, 
> and the default value is still COMPACTION_HPAGE_ORDER.
> 
> Signed-off-by: chukaiping <chukaiping@baidu.com>
> Reported-by: kernel test robot <lkp@intel.com>

Two minor nits on the commit log message: 
* there seems to be a whitespage missing in your short log: 
  "... mm/compaction:let ..."
--> I will fix it in next patch.

* has the path really been reported by a test robot?
--> Yes. There is a compile error in v1, I fixed it in v2.

A note on the sysctl name, I'd suggest that it perhaps should reflect the fact that we're adjusting the order for proactive compation.
How about "proactive_compation_order"?
--> I will change it in next patch.

Cheers,

> ---
> 
> Changes in v3:
>     - change the min value of compaction_order to 1 because the fragmentation
>       index of order 0 is always 0
>     - move the definition of max_buddy_zone into #ifdef 
> CONFIG_COMPACTION
> 
> Changes in v2:
>     - fix the compile error in ia64 and powerpc, move the initialization
>       of sysctl_compaction_order to kcompactd_init because 
>       COMPACTION_HPAGE_ORDER is a variable in these architectures
>     - change the hard coded max order number from 10 to MAX_ORDER - 1
> 
>  include/linux/compaction.h |    1 +
>  kernel/sysctl.c            |   10 ++++++++++
>  mm/compaction.c            |    9 ++++++---
>  3 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h 
> index ed4070e..151ccd1 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -83,6 +83,7 @@ static inline unsigned long compact_gap(unsigned int 
> order)  #ifdef CONFIG_COMPACTION  extern int sysctl_compact_memory;  
> extern unsigned int sysctl_compaction_proactiveness;
> +extern unsigned int sysctl_compaction_order;
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>  			void *buffer, size_t *length, loff_t *ppos);  extern int 
> sysctl_extfrag_threshold; diff --git a/kernel/sysctl.c 
> b/kernel/sysctl.c index 62fbd09..e50f7d2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -196,6 +196,7 @@ enum sysctl_writes_mode {  #endif /* 
> CONFIG_SCHED_DEBUG */
>  
>  #ifdef CONFIG_COMPACTION
> +static int max_buddy_zone = MAX_ORDER - 1;
>  static int min_extfrag_threshold;
>  static int max_extfrag_threshold = 1000;  #endif @@ -2871,6 +2872,15 
> @@ int proc_do_static_key(struct ctl_table *table, int write,
>  		.extra2		= &one_hundred,
>  	},
>  	{
> +		.procname       = "compaction_order",
> +		.data           = &sysctl_compaction_order,
> +		.maxlen         = sizeof(sysctl_compaction_order),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec_minmax,
> +		.extra1         = SYSCTL_ONE,
> +		.extra2         = &max_buddy_zone,
> +	},
> +	{
>  		.procname	= "extfrag_threshold",
>  		.data		= &sysctl_extfrag_threshold,
>  		.maxlen		= sizeof(int),
> diff --git a/mm/compaction.c b/mm/compaction.c index e04f447..70c0acd 
> 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1925,16 +1925,16 @@ static bool kswapd_is_running(pg_data_t 
> *pgdat)
>  
>  /*
>   * A zone's fragmentation score is the external fragmentation wrt to 
> the
> - * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
> + * sysctl_compaction_order. It returns a value in the range [0, 100].
>   */
>  static unsigned int fragmentation_score_zone(struct zone *zone)  {
> -	return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
> +	return extfrag_for_order(zone, sysctl_compaction_order);
>  }
>  
>  /*
>   * A weighted zone's fragmentation score is the external 
> fragmentation
> - * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
> + * wrt to the sysctl_compaction_order scaled by the zone's size. It
>   * returns a value in the range [0, 100].
>   *
>   * The scaling factor ensures that proactive compaction focuses on 
> larger @@ -2666,6 +2666,7 @@ static void compact_nodes(void)
>   * background. It takes values in the range [0, 100].
>   */
>  unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
> +unsigned int __read_mostly sysctl_compaction_order;
>  
>  /*
>   * This is the entry point for compacting all nodes via @@ -2958,6 
> +2959,8 @@ static int __init kcompactd_init(void)
>  	int nid;
>  	int ret;
>  
> +	sysctl_compaction_order = COMPACTION_HPAGE_ORDER;
> +
>  	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
>  					"mm/compaction:online",
>  					kcompactd_cpu_online, NULL);
> --
> 1.7.1
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* 答复: 答复: [PATCH v3] mm/compaction:let proactive compaction order configurable
  2021-04-26  1:48     ` David Rientjes
@ 2021-04-28  1:38       ` Chu,Kaiping
  0 siblings, 0 replies; 12+ messages in thread
From: Chu,Kaiping @ 2021-04-28  1:38 UTC (permalink / raw)
  To: David Rientjes
  Cc: mcgrof, keescook, yzaikin, akpm, vbabka, nigupta, bhe,
	khalid.aziz, iamjoonsoo.kim, mateusznosek0, sh_def, linux-kernel,
	linux-fsdevel, linux-mm

Please see my answer inline.

-----邮件原件-----
发件人: David Rientjes <rientjes@google.com> 
发送时间: 2021年4月26日 9:48
收件人: Chu,Kaiping <chukaiping@baidu.com>
抄送: mcgrof@kernel.org; keescook@chromium.org; yzaikin@google.com; akpm@linux-foundation.org; vbabka@suse.cz; nigupta@nvidia.com; bhe@redhat.com; khalid.aziz@oracle.com; iamjoonsoo.kim@lge.com; mateusznosek0@gmail.com; sh_def@163.com; linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-mm@kvack.org
主题: Re: 答复: [PATCH v3] mm/compaction:let proactive compaction order configurable

On Mon, 26 Apr 2021, Chu,Kaiping wrote:

> Hi Rientjes
> I already answered your question in 4.19.
> " We turn off the transparent huge page in our machines, so we don't care about the order 9.
> There are many user space applications, different application maybe allocate different order of memory, we can't know the "known order of interest" in advance. Our purpose is to keep the overall fragment index as low as possible, not care about the specific order. 

Ok, so you don't care about a specific order but you are adding a vm.compaction_order sysctl?

I think what you're trying to do is invoke full compaction (cc.order = -1) at some point in time that will (1) keep node-wide fragmentation low over the long run and (2) be relatively lightweight at the time it is done.

I can certainly understand (1) on your configuration that is mostly consumed by 1GB gigantic pages, you are likely dealing with significant memory pressure that causes fragmentation to increase over time and eventually become unrecoverable for the most part.

And for (2), yes, using vm.compact_memory will become very heavyweight if it's done too late.

So since proactive compaction uses cc.order = 1, same as vm.compact_memory, it should be possible to monitor extfrag_index under debugfs and manually trigger compaction when necessary without intervention of the kernel.
--> Adding userspace monitor will bring extra load to machines. We can take use of current proactive compaction mechanism in kernel, only need to do some small modification.

I think we can both agree that we wouldn't want to add obscure and undocumented sysctls that that can easily be replaced by a userspace implementation.

> Although current proactive compaction mechanism only check the fragment index of specific order, but it can do memory compaction for all order(.order = -1 in proactive_compact_node), so it's still useful for us. 
> We set the compaction_order according to the average fragment index of all our machines, it's an experience value, it's a compromise of keep memory fragment index low and not trigger background compaction too much, this value can be changed in future.
> We did periodically memory compaction by command "echo 1 > /proc/sys/vm/compact_memory " previously, but it's not good enough, it's will compact all memory forcibly, it may lead to lots of memory move in short time, and affect the performance of application."
> 
> 
> BR,
> Chu Kaiping
> 
> -----邮件原件-----
> 发件人: David Rientjes <rientjes@google.com>
> 发送时间: 2021年4月26日 9:15
> 收件人: Chu,Kaiping <chukaiping@baidu.com>
> 抄送: mcgrof@kernel.org; keescook@chromium.org; yzaikin@google.com; 
> akpm@linux-foundation.org; vbabka@suse.cz; nigupta@nvidia.com; 
> bhe@redhat.com; khalid.aziz@oracle.com; iamjoonsoo.kim@lge.com; 
> mateusznosek0@gmail.com; sh_def@163.com; linux-kernel@vger.kernel.org; 
> linux-fsdevel@vger.kernel.org; linux-mm@kvack.org
> 主题: Re: [PATCH v3] mm/compaction:let proactive compaction order 
> configurable
> 
> On Sun, 25 Apr 2021, chukaiping wrote:
> 
> > Currently the proactive compaction order is fixed to 
> > COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of 
> > normal 4KB memory, but it's too high for the machines with small 
> > normal memory, for example the machines with most memory configured 
> > as 1GB hugetlbfs huge pages. In these machines the max order of free 
> > pages is often below 9, and it's always below 9 even with hard 
> > compaction. This will lead to proactive compaction be triggered very 
> > frequently. In these machines we only care about order of 3 or 4.
> > This patch export the oder to proc and let it configurable by user, 
> > and the default value is still COMPACTION_HPAGE_ORDER.
> > 
> 
> As asked in the review of the v1 of the patch, why is this not a userspace policy decision?  If you are interested in order-3 or order-4 fragmentation, for whatever reason, you could periodically check /proc/buddyinfo and manually invoke compaction on the system.
> 
> In other words, why does this need to live in the kernel?
> 
> > Signed-off-by: chukaiping <chukaiping@baidu.com>
> > Reported-by: kernel test robot <lkp@intel.com>
> > ---
> > 
> > Changes in v3:
> >     - change the min value of compaction_order to 1 because the fragmentation
> >       index of order 0 is always 0
> >     - move the definition of max_buddy_zone into #ifdef 
> > CONFIG_COMPACTION
> > 
> > Changes in v2:
> >     - fix the compile error in ia64 and powerpc, move the initialization
> >       of sysctl_compaction_order to kcompactd_init because 
> >       COMPACTION_HPAGE_ORDER is a variable in these architectures
> >     - change the hard coded max order number from 10 to MAX_ORDER - 
> > 1
> > 
> >  include/linux/compaction.h |    1 +
> >  kernel/sysctl.c            |   10 ++++++++++
> >  mm/compaction.c            |    9 ++++++---
> >  3 files changed, 17 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h 
> > index ed4070e..151ccd1 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -83,6 +83,7 @@ static inline unsigned long compact_gap(unsigned 
> > int
> > order)  #ifdef CONFIG_COMPACTION  extern int sysctl_compact_memory; 
> > extern unsigned int sysctl_compaction_proactiveness;
> > +extern unsigned int sysctl_compaction_order;
> >  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> >  			void *buffer, size_t *length, loff_t *ppos);  extern int 
> > sysctl_extfrag_threshold; diff --git a/kernel/sysctl.c 
> > b/kernel/sysctl.c index 62fbd09..e50f7d2 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -196,6 +196,7 @@ enum sysctl_writes_mode {  #endif /* 
> > CONFIG_SCHED_DEBUG */
> >  
> >  #ifdef CONFIG_COMPACTION
> > +static int max_buddy_zone = MAX_ORDER - 1;
> >  static int min_extfrag_threshold;
> >  static int max_extfrag_threshold = 1000;  #endif @@ -2871,6 
> > +2872,15 @@ int proc_do_static_key(struct ctl_table *table, int write,
> >  		.extra2		= &one_hundred,
> >  	},
> >  	{
> > +		.procname       = "compaction_order",
> > +		.data           = &sysctl_compaction_order,
> > +		.maxlen         = sizeof(sysctl_compaction_order),
> > +		.mode           = 0644,
> > +		.proc_handler   = proc_dointvec_minmax,
> > +		.extra1         = SYSCTL_ONE,
> > +		.extra2         = &max_buddy_zone,
> > +	},
> > +	{
> >  		.procname	= "extfrag_threshold",
> >  		.data		= &sysctl_extfrag_threshold,
> >  		.maxlen		= sizeof(int),
> > diff --git a/mm/compaction.c b/mm/compaction.c index 
> > e04f447..70c0acd
> > 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1925,16 +1925,16 @@ static bool kswapd_is_running(pg_data_t
> > *pgdat)
> >  
> >  /*
> >   * A zone's fragmentation score is the external fragmentation wrt 
> > to the
> > - * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
> > + * sysctl_compaction_order. It returns a value in the range [0, 100].
> >   */
> >  static unsigned int fragmentation_score_zone(struct zone *zone)  {
> > -	return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
> > +	return extfrag_for_order(zone, sysctl_compaction_order);
> >  }
> >  
> >  /*
> >   * A weighted zone's fragmentation score is the external 
> > fragmentation
> > - * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
> > + * wrt to the sysctl_compaction_order scaled by the zone's size. It
> >   * returns a value in the range [0, 100].
> >   *
> >   * The scaling factor ensures that proactive compaction focuses on 
> > larger @@ -2666,6 +2666,7 @@ static void compact_nodes(void)
> >   * background. It takes values in the range [0, 100].
> >   */
> >  unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
> > +unsigned int __read_mostly sysctl_compaction_order;
> >  
> >  /*
> >   * This is the entry point for compacting all nodes via @@ -2958,6
> > +2959,8 @@ static int __init kcompactd_init(void)
> >  	int nid;
> >  	int ret;
> >  
> > +	sysctl_compaction_order = COMPACTION_HPAGE_ORDER;
> > +
> >  	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> >  					"mm/compaction:online",
> >  					kcompactd_cpu_online, NULL);
> > --
> > 1.7.1
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 答复: [PATCH v3] mm/compaction:let proactive compaction order configurable
  2021-04-28  1:17   ` 答复: " Chu,Kaiping
@ 2021-04-29 19:45     ` Rafael Aquini
  2021-05-06  1:08       ` 答复: " Chu,Kaiping
  0 siblings, 1 reply; 12+ messages in thread
From: Rafael Aquini @ 2021-04-29 19:45 UTC (permalink / raw)
  To: Chu,Kaiping
  Cc: mcgrof, keescook, yzaikin, akpm, vbabka, nigupta, bhe,
	khalid.aziz, iamjoonsoo.kim, mateusznosek0, sh_def, linux-kernel,
	linux-fsdevel, linux-mm

On Wed, Apr 28, 2021 at 01:17:40AM +0000, Chu,Kaiping wrote:
> Please see my answer inline.
> 
> -----邮件原件-----
> 发件人: Rafael Aquini <aquini@redhat.com> 
> 发送时间: 2021年4月26日 9:31
> 收件人: Chu,Kaiping <chukaiping@baidu.com>
> 抄送: mcgrof@kernel.org; keescook@chromium.org; yzaikin@google.com; akpm@linux-foundation.org; vbabka@suse.cz; nigupta@nvidia.com; bhe@redhat.com; khalid.aziz@oracle.com; iamjoonsoo.kim@lge.com; mateusznosek0@gmail.com; sh_def@163.com; linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-mm@kvack.org
> 主题: Re: [PATCH v3] mm/compaction:let proactive compaction order configurable
> 
> On Sun, Apr 25, 2021 at 09:21:02AM +0800, chukaiping wrote:
> > Currently the proactive compaction order is fixed to 
> > COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of 
> > normal 4KB memory, but it's too high for the machines with small 
> > normal memory, for example the machines with most memory configured as 
> > 1GB hugetlbfs huge pages. In these machines the max order of free 
> > pages is often below 9, and it's always below 9 even with hard 
> > compaction. This will lead to proactive compaction be triggered very 
> > frequently. In these machines we only care about order of 3 or 4.
> > This patch export the oder to proc and let it configurable by user, 
> > and the default value is still COMPACTION_HPAGE_ORDER.
> > 
> > Signed-off-by: chukaiping <chukaiping@baidu.com>
> > Reported-by: kernel test robot <lkp@intel.com>
> 
> Two minor nits on the commit log message: 
> * there seems to be a whitespage missing in your short log: 
>   "... mm/compaction:let ..."
> --> I will fix it in next patch.
> 
> * has the path really been reported by a test robot?
> --> Yes. There is a compile error in v1, I fixed it in v2.
>

So, no... the test robot should not be listed as Reported-by. 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* 答复: 答复: [PATCH v3] mm/compaction:let proactive compaction order configurable
  2021-04-29 19:45     ` Rafael Aquini
@ 2021-05-06  1:08       ` Chu,Kaiping
  0 siblings, 0 replies; 12+ messages in thread
From: Chu,Kaiping @ 2021-05-06  1:08 UTC (permalink / raw)
  To: Rafael Aquini
  Cc: mcgrof, keescook, yzaikin, akpm, vbabka, nigupta, bhe,
	khalid.aziz, iamjoonsoo.kim, mateusznosek0, sh_def, linux-kernel,
	linux-fsdevel, linux-mm



-----邮件原件-----
发件人: Rafael Aquini <aquini@redhat.com> 
发送时间: 2021年4月30日 3:46
收件人: Chu,Kaiping <chukaiping@baidu.com>
抄送: mcgrof@kernel.org; keescook@chromium.org; yzaikin@google.com; akpm@linux-foundation.org; vbabka@suse.cz; nigupta@nvidia.com; bhe@redhat.com; khalid.aziz@oracle.com; iamjoonsoo.kim@lge.com; mateusznosek0@gmail.com; sh_def@163.com; linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-mm@kvack.org
主题: Re: 答复: [PATCH v3] mm/compaction:let proactive compaction order configurable

On Wed, Apr 28, 2021 at 01:17:40AM +0000, Chu,Kaiping wrote:
> Please see my answer inline.
> 
> -----邮件原件-----
> 发件人: Rafael Aquini <aquini@redhat.com>
> 发送时间: 2021年4月26日 9:31
> 收件人: Chu,Kaiping <chukaiping@baidu.com>
> 抄送: mcgrof@kernel.org; keescook@chromium.org; yzaikin@google.com; 
> akpm@linux-foundation.org; vbabka@suse.cz; nigupta@nvidia.com; 
> bhe@redhat.com; khalid.aziz@oracle.com; iamjoonsoo.kim@lge.com; 
> mateusznosek0@gmail.com; sh_def@163.com; linux-kernel@vger.kernel.org; 
> linux-fsdevel@vger.kernel.org; linux-mm@kvack.org
> 主题: Re: [PATCH v3] mm/compaction:let proactive compaction order 
> configurable
> 
> On Sun, Apr 25, 2021 at 09:21:02AM +0800, chukaiping wrote:
> > Currently the proactive compaction order is fixed to 
> > COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of 
> > normal 4KB memory, but it's too high for the machines with small 
> > normal memory, for example the machines with most memory configured 
> > as 1GB hugetlbfs huge pages. In these machines the max order of free 
> > pages is often below 9, and it's always below 9 even with hard 
> > compaction. This will lead to proactive compaction be triggered very 
> > frequently. In these machines we only care about order of 3 or 4.
> > This patch export the oder to proc and let it configurable by user, 
> > and the default value is still COMPACTION_HPAGE_ORDER.
> > 
> > Signed-off-by: chukaiping <chukaiping@baidu.com>
> > Reported-by: kernel test robot <lkp@intel.com>
> 
> Two minor nits on the commit log message: 
> * there seems to be a whitespage missing in your short log: 
>   "... mm/compaction:let ..."
> --> I will fix it in next patch.
> 
> * has the path really been reported by a test robot?
> --> Yes. There is a compile error in v1, I fixed it in v2.
>

> So, no... the test robot should not be listed as Reported-by. 
I did it as below suggestion in the build error notification email sent by kernel test robot.
" If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>"


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm/compaction:let proactive compaction order configurable
  2021-04-26  1:15 ` David Rientjes
  2021-04-26  1:29   ` 答复: " Chu,Kaiping
@ 2021-05-06 21:27   ` Khalid Aziz
  2021-05-11  7:48     ` 答复: " Chu,Kaiping
  1 sibling, 1 reply; 12+ messages in thread
From: Khalid Aziz @ 2021-05-06 21:27 UTC (permalink / raw)
  To: David Rientjes, chukaiping
  Cc: mcgrof, keescook, yzaikin, akpm, vbabka, nigupta, bhe,
	iamjoonsoo.kim, mateusznosek0, sh_def, linux-kernel,
	linux-fsdevel, linux-mm

On 4/25/21 9:15 PM, David Rientjes wrote:
> On Sun, 25 Apr 2021, chukaiping wrote:
> 
>> Currently the proactive compaction order is fixed to
>> COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of
>> normal 4KB memory, but it's too high for the machines with small
>> normal memory, for example the machines with most memory configured
>> as 1GB hugetlbfs huge pages. In these machines the max order of
>> free pages is often below 9, and it's always below 9 even with hard
>> compaction. This will lead to proactive compaction be triggered very
>> frequently. In these machines we only care about order of 3 or 4.
>> This patch export the oder to proc and let it configurable
>> by user, and the default value is still COMPACTION_HPAGE_ORDER.
>>
> 
> As asked in the review of the v1 of the patch, why is this not a userspace
> policy decision?  If you are interested in order-3 or order-4
> fragmentation, for whatever reason, you could periodically check
> /proc/buddyinfo and manually invoke compaction on the system.
> 
> In other words, why does this need to live in the kernel?
> 

I have struggled with this question. Fragmentation and allocation stalls are significant issues on large database 
systems which also happen to use memory in similar ways (90+% of memory is allocated as hugepages) leaving just enough 
memory to run rest of the userspace processes. I had originally proposed a kernel patch to monitor, do a trend analysis 
of memory usage and take proactive action - 
<https://lore.kernel.org/lkml/20190813014012.30232-1-khalid.aziz@oracle.com/>. Based upon feedback, I moved the 
implementation to userspace - <https://github.com/oracle/memoptimizer>. Test results across multiple workloads have been 
very good. Results from one of the workloads are in this blog - 
<https://blogs.oracle.com/linux/anticipating-your-memory-needs>. It works well from userspace but it has limited ways to 
influence reclamation and compaction. It uses watermark_scale_factor to boost watermarks and cause reclamation to kick 
in earlier and run longer. It uses /sys/devices/system/node/node%d/compact to force compaction on the node expected to 
reach high level of fragmentation soon. Neither of these is very efficient from userspace even though they get the job 
done. Scaling watermark has longer lasting impact than raising scanning priority in balance_pgdat() temporarily. I plan 
to experiment with watermark_boost_factor to see if I can use it in place of /sys/devices/system/node/node%d/compact and 
get the same results. Doing all of this in the kernel can be more efficient and lessen potential negative impact on the 
system. On the other hand, it is easier to fix and update such policies in userspace although at the cost of having a 
performance critical component live outside the kernel and thus not be active on the system by default.

--
Khalid


^ permalink raw reply	[flat|nested] 12+ messages in thread

* 答复: [PATCH v3] mm/compaction:let proactive compaction order configurable
  2021-05-06 21:27   ` Khalid Aziz
@ 2021-05-11  7:48     ` Chu,Kaiping
  2021-05-11 15:00       ` Khalid Aziz
  0 siblings, 1 reply; 12+ messages in thread
From: Chu,Kaiping @ 2021-05-11  7:48 UTC (permalink / raw)
  To: Khalid Aziz, David Rientjes
  Cc: mcgrof, keescook, yzaikin, akpm, vbabka, nigupta, bhe,
	iamjoonsoo.kim, mateusznosek0, sh_def, linux-kernel,
	linux-fsdevel, linux-mm



> -----邮件原件-----
> 发件人: Khalid Aziz <khalid.aziz@oracle.com>
> 发送时间: 2021年5月7日 5:27
> 收件人: David Rientjes <rientjes@google.com>; Chu,Kaiping
> <chukaiping@baidu.com>
> 抄送: mcgrof@kernel.org; keescook@chromium.org; yzaikin@google.com;
> akpm@linux-foundation.org; vbabka@suse.cz; nigupta@nvidia.com;
> bhe@redhat.com; iamjoonsoo.kim@lge.com; mateusznosek0@gmail.com;
> sh_def@163.com; linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org;
> linux-mm@kvack.org
> 主题: Re: [PATCH v3] mm/compaction:let proactive compaction order
> configurable
> 
> On 4/25/21 9:15 PM, David Rientjes wrote:
> > On Sun, 25 Apr 2021, chukaiping wrote:
> >
> >> Currently the proactive compaction order is fixed to
> >> COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of
> >> normal 4KB memory, but it's too high for the machines with small
> >> normal memory, for example the machines with most memory configured
> >> as 1GB hugetlbfs huge pages. In these machines the max order of free
> >> pages is often below 9, and it's always below 9 even with hard
> >> compaction. This will lead to proactive compaction be triggered very
> >> frequently. In these machines we only care about order of 3 or 4.
> >> This patch export the oder to proc and let it configurable by user,
> >> and the default value is still COMPACTION_HPAGE_ORDER.
> >>
> >
> > As asked in the review of the v1 of the patch, why is this not a
> > userspace policy decision?  If you are interested in order-3 or
> > order-4 fragmentation, for whatever reason, you could periodically
> > check /proc/buddyinfo and manually invoke compaction on the system.
> >
> > In other words, why does this need to live in the kernel?
> >
> 
> I have struggled with this question. Fragmentation and allocation stalls are
> significant issues on large database systems which also happen to use memory
> in similar ways (90+% of memory is allocated as hugepages) leaving just
> enough memory to run rest of the userspace processes. I had originally
> proposed a kernel patch to monitor, do a trend analysis of memory usage and
> take proactive action -
> <https://lore.kernel.org/lkml/20190813014012.30232-1-khalid.aziz@oracle.c
> om/>. Based upon feedback, I moved the implementation to userspace -
> <https://github.com/oracle/memoptimizer>. Test results across multiple
> workloads have been very good. Results from one of the workloads are in this
> blog - <https://blogs.oracle.com/linux/anticipating-your-memory-needs>. It
> works well from userspace but it has limited ways to influence reclamation and
> compaction. It uses watermark_scale_factor to boost watermarks and cause
> reclamation to kick in earlier and run longer. It uses
> /sys/devices/system/node/node%d/compact to force compaction on the node
> expected to reach high level of fragmentation soon. Neither of these is very
> efficient from userspace even though they get the job done. Scaling watermark
> has longer lasting impact than raising scanning priority in balance_pgdat()
> temporarily. I plan to experiment with watermark_boost_factor to see if I can
> use it in place of /sys/devices/system/node/node%d/compact and get the
> same results. Doing all of this in the kernel can be more efficient and lessen
> potential negative impact on the system. On the other hand, it is easier to fix
> and update such policies in userspace although at the cost of having a
> performance critical component live outside the kernel and thus not be active
> on the system by default.
>
I studied your memoptimizer these days, I also agree to move the implementation into kernel to co-work with current proactive compaction mechanism to get higher efficiency.
By the way I am interested about the memoptimizer, I want to have a test of it, but how to evaluate its effectiveness?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: 答复: [PATCH v3] mm/compaction:let proactive compaction order configurable
  2021-05-11  7:48     ` 答复: " Chu,Kaiping
@ 2021-05-11 15:00       ` Khalid Aziz
  0 siblings, 0 replies; 12+ messages in thread
From: Khalid Aziz @ 2021-05-11 15:00 UTC (permalink / raw)
  To: Chu,Kaiping, David Rientjes
  Cc: mcgrof, keescook, yzaikin, akpm, vbabka, nigupta, bhe,
	iamjoonsoo.kim, mateusznosek0, sh_def, linux-kernel,
	linux-fsdevel, linux-mm

On 5/11/21 1:48 AM, Chu,Kaiping wrote:
> 
> 
>> -----邮件原件-----
>> 发件人: Khalid Aziz <khalid.aziz@oracle.com>
>> 发送时间: 2021年5月7日 5:27
>> 收件人: David Rientjes <rientjes@google.com>; Chu,Kaiping
>> <chukaiping@baidu.com>
>> 抄送: mcgrof@kernel.org; keescook@chromium.org; yzaikin@google.com;
>> akpm@linux-foundation.org; vbabka@suse.cz; nigupta@nvidia.com;
>> bhe@redhat.com; iamjoonsoo.kim@lge.com; mateusznosek0@gmail.com;
>> sh_def@163.com; linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org;
>> linux-mm@kvack.org
>> 主题: Re: [PATCH v3] mm/compaction:let proactive compaction order
>> configurable
>>
>> On 4/25/21 9:15 PM, David Rientjes wrote:
>>> On Sun, 25 Apr 2021, chukaiping wrote:
>>>
>>>> Currently the proactive compaction order is fixed to
>>>> COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of
>>>> normal 4KB memory, but it's too high for the machines with small
>>>> normal memory, for example the machines with most memory configured
>>>> as 1GB hugetlbfs huge pages. In these machines the max order of free
>>>> pages is often below 9, and it's always below 9 even with hard
>>>> compaction. This will lead to proactive compaction be triggered very
>>>> frequently. In these machines we only care about order of 3 or 4.
>>>> This patch export the oder to proc and let it configurable by user,
>>>> and the default value is still COMPACTION_HPAGE_ORDER.
>>>>
>>>
>>> As asked in the review of the v1 of the patch, why is this not a
>>> userspace policy decision?  If you are interested in order-3 or
>>> order-4 fragmentation, for whatever reason, you could periodically
>>> check /proc/buddyinfo and manually invoke compaction on the system.
>>>
>>> In other words, why does this need to live in the kernel?
>>>
>>
>> I have struggled with this question. Fragmentation and allocation stalls are
>> significant issues on large database systems which also happen to use memory
>> in similar ways (90+% of memory is allocated as hugepages) leaving just
>> enough memory to run rest of the userspace processes. I had originally
>> proposed a kernel patch to monitor, do a trend analysis of memory usage and
>> take proactive action -
>> <https://lore.kernel.org/lkml/20190813014012.30232-1-khalid.aziz@oracle.c
>> om/>. Based upon feedback, I moved the implementation to userspace -
>> <https://github.com/oracle/memoptimizer>. Test results across multiple
>> workloads have been very good. Results from one of the workloads are in this
>> blog - <https://blogs.oracle.com/linux/anticipating-your-memory-needs>. It
>> works well from userspace but it has limited ways to influence reclamation and
>> compaction. It uses watermark_scale_factor to boost watermarks and cause
>> reclamation to kick in earlier and run longer. It uses
>> /sys/devices/system/node/node%d/compact to force compaction on the node
>> expected to reach high level of fragmentation soon. Neither of these is very
>> efficient from userspace even though they get the job done. Scaling watermark
>> has longer lasting impact than raising scanning priority in balance_pgdat()
>> temporarily. I plan to experiment with watermark_boost_factor to see if I can
>> use it in place of /sys/devices/system/node/node%d/compact and get the
>> same results. Doing all of this in the kernel can be more efficient and lessen
>> potential negative impact on the system. On the other hand, it is easier to fix
>> and update such policies in userspace although at the cost of having a
>> performance critical component live outside the kernel and thus not be active
>> on the system by default.
>>
> I studied your memoptimizer these days, I also agree to move the implementation into kernel to co-work with current proactive compaction mechanism to get higher efficiency.
> By the way I am interested about the memoptimizer, I want to have a test of it, but how to evaluate its effectiveness?
> 
> 

If you look at this blog I wrote on memoptimizer - <https://blogs.oracle.com/linux/anticipating-your-memory-needs>, 
under the section "Measuring stalls" I describe the workload I used to measure its effectiveness. The metric I use is 
number of allocation/compaction stalls over a multi-hour run of the workload. Number of allocation/compaction stalls 
gives an idea of how effective system is at keeping free order 0 and larger pages available proactively. Any workload 
that runs into significant number of stalls is a good workload to use to measure effectiveness.

--
Khalid

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-05-11 15:01 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-25  1:21 [PATCH v3] mm/compaction:let proactive compaction order configurable chukaiping
2021-04-26  1:15 ` David Rientjes
2021-04-26  1:29   ` 答复: " Chu,Kaiping
2021-04-26  1:48     ` David Rientjes
2021-04-28  1:38       ` 答复: " Chu,Kaiping
2021-05-06 21:27   ` Khalid Aziz
2021-05-11  7:48     ` 答复: " Chu,Kaiping
2021-05-11 15:00       ` Khalid Aziz
2021-04-26  1:31 ` Rafael Aquini
2021-04-28  1:17   ` 答复: " Chu,Kaiping
2021-04-29 19:45     ` Rafael Aquini
2021-05-06  1:08       ` 答复: " Chu,Kaiping

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).