Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
From: "Chu,Kaiping" <chukaiping@baidu.com>
To: David Rientjes <rientjes@google.com>
Cc: "mcgrof@kernel.org" <mcgrof@kernel.org>,
	"keescook@chromium.org" <keescook@chromium.org>,
	"yzaikin@google.com" <yzaikin@google.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: 答复: [PATCH v2] mm/compaction:let proactive compaction order configurable
Date: Mon, 19 Apr 2021 12:02:14 +0000
Message-ID: <1e686e75fe71471aa94705e76bec76a5@baidu.com> (raw)
In-Reply-To: <7efa316c-d39b-59a5-bc52-62325127a917@google.com>

Hi Rientjes,
We turn off the transparent huge page in our machines, so we don't care about the order 9.
There are many user space applications, different application maybe allocate different order of memory, we can't know the "known order of interest" in advance. Our purpose is to keep the overall fragment index as low as possible, not care about the specific order. 
Although current proactive compaction mechanism only check the fragment index of specific order, but it can do memory compaction for all order(.order = -1 in proactive_compact_node), so it's still useful for us. 
We set the compaction_order according to the average fragment index of all our machines, it's an experience value, it's a compromise of keep memory fragment index low and not trigger background compaction too much, this value can be changed in future.
We did periodically memory compaction by command "echo 1 > /proc/sys/vm/compact_memory " previously, but it's not good enough, it's will compact all memory forcibly, it may lead to lots of memory move in short time, and affect the performance of application.

BR,
Chu Kaiping

-----邮件原件-----
发件人: David Rientjes <rientjes@google.com> 
发送时间: 2021年4月17日 5:31
收件人: Chu,Kaiping <chukaiping@baidu.com>
抄送: mcgrof@kernel.org; keescook@chromium.org; yzaikin@google.com; akpm@linux-foundation.org; linux-kernel@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-mm@kvack.org
主题: Re: [PATCH v2] mm/compaction:let proactive compaction order configurable

On Sat, 17 Apr 2021, chukaiping wrote:

> Currently the proactive compaction order is fixed to 
> COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of 
> normal 4KB memory, but it's too high for the machines with small 
> normal memory, for example the machines with most memory configured as 
> 1GB hugetlbfs huge pages. In these machines the max order of free 
> pages is often below 9, and it's always below 9 even with hard 
> compaction. This will lead to proactive compaction be triggered very 
> frequently. In these machines we only care about order of 3 or 4.
> This patch export the oder to proc and let it configurable by user, 
> and the default value is still COMPACTION_HPAGE_ORDER.
> 

Still not entirely clear on the use case beyond hugepages.  In your response from v1, you indicated you were not concerned with allocation latency of hugepages but rather had a thundering herd problem where once fragmentation got bad, many threads started compacting all at once.

I'm not sure that tuning the proactive compaction order is the right solution.  I think the proactive compaction order is more about starting compaction when a known order of interest (like a hugepage) is fully depleted and we want a page of that order so the idea is to start recovering from that situation.

Is this not a userspace policy decision?  I'm wondering if it would simply be better to manually invoke compaction periodically or when the fragmentation ratio has reached a certain level.  You can manually invoke compaction yourself through sysfs for each node on the system.

> Signed-off-by: chukaiping <chukaiping@baidu.com>
> Reported-by: kernel test robot <lkp@intel.com>
> ---
> 
> Changes in v2:
>     - fix the compile error in ia64 and powerpc
>     - change the hard coded max order number from 10 to MAX_ORDER - 1
> 
>  include/linux/compaction.h |    1 +
>  kernel/sysctl.c            |   11 +++++++++++
>  mm/compaction.c            |   14 +++++++++++---
>  3 files changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h 
> index ed4070e..151ccd1 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -83,6 +83,7 @@ static inline unsigned long compact_gap(unsigned int 
> order)  #ifdef CONFIG_COMPACTION  extern int sysctl_compact_memory;  
> extern unsigned int sysctl_compaction_proactiveness;
> +extern unsigned int sysctl_compaction_order;
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>  			void *buffer, size_t *length, loff_t *ppos);  extern int 
> sysctl_extfrag_threshold; diff --git a/kernel/sysctl.c 
> b/kernel/sysctl.c index 62fbd09..a607d4d 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -195,6 +195,8 @@ enum sysctl_writes_mode {  #endif /* CONFIG_SMP */  
> #endif /* CONFIG_SCHED_DEBUG */
>  
> +static int max_buddy_zone = MAX_ORDER - 1;
> +
>  #ifdef CONFIG_COMPACTION
>  static int min_extfrag_threshold;
>  static int max_extfrag_threshold = 1000; @@ -2871,6 +2873,15 @@ int 
> proc_do_static_key(struct ctl_table *table, int write,
>  		.extra2		= &one_hundred,
>  	},
>  	{
> +		.procname       = "compaction_order",
> +		.data           = &sysctl_compaction_order,
> +		.maxlen         = sizeof(sysctl_compaction_order),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec_minmax,
> +		.extra1         = SYSCTL_ZERO,
> +		.extra2         = &max_buddy_zone,
> +	},
> +	{
>  		.procname	= "extfrag_threshold",
>  		.data		= &sysctl_extfrag_threshold,
>  		.maxlen		= sizeof(int),
> diff --git a/mm/compaction.c b/mm/compaction.c index e04f447..bfd1d5e 
> 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1925,16 +1925,16 @@ static bool kswapd_is_running(pg_data_t 
> *pgdat)
>  
>  /*
>   * A zone's fragmentation score is the external fragmentation wrt to 
> the
> - * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
> + * sysctl_compaction_order. It returns a value in the range [0, 100].
>   */
>  static unsigned int fragmentation_score_zone(struct zone *zone)  {
> -	return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
> +	return extfrag_for_order(zone, sysctl_compaction_order);
>  }
>  
>  /*
>   * A weighted zone's fragmentation score is the external 
> fragmentation
> - * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
> + * wrt to the sysctl_compaction_order scaled by the zone's size. It
>   * returns a value in the range [0, 100].
>   *
>   * The scaling factor ensures that proactive compaction focuses on 
> larger @@ -2666,6 +2666,7 @@ static void compact_nodes(void)
>   * background. It takes values in the range [0, 100].
>   */
>  unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
> +unsigned int __read_mostly sysctl_compaction_order;
>  
>  /*
>   * This is the entry point for compacting all nodes via @@ -2958,6 
> +2959,13 @@ static int __init kcompactd_init(void)
>  	int nid;
>  	int ret;
>  
> +	/*
> +	 * move the initialization of sysctl_compaction_order to here to
> +	 * eliminate compile error in ia64 and powerpc architecture because
> +	 * COMPACTION_HPAGE_ORDER is a variable in this architecture
> +	 */
> +	sysctl_compaction_order = COMPACTION_HPAGE_ORDER;
> +
>  	ret = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
>  					"mm/compaction:online",
>  					kcompactd_cpu_online, NULL);
> --
> 1.7.1
> 
> 
> 

  reply index

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-16 17:22 chukaiping
2021-04-16 21:31 ` David Rientjes
2021-04-19 12:02   ` Chu,Kaiping [this message]
2021-04-21  7:21 chukaiping
2021-04-22 16:27 ` Nitin Gupta
2021-04-22 20:33 ` Rafael Aquini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1e686e75fe71471aa94705e76bec76a5@baidu.com \
    --to=chukaiping@baidu.com \
    --cc=akpm@linux-foundation.org \
    --cc=keescook@chromium.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcgrof@kernel.org \
    --cc=rientjes@google.com \
    --cc=yzaikin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git