[PATCH v4 0/4] make vm_committed_as_batch aware of vm overcommit policy

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/4] make vm_committed_as_batch aware of vm overcommit policy
@ 2020-05-29  1:06 Feng Tang
  2020-05-29  1:06 ` [PATCH v4 1/4] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Feng Tang @ 2020-05-29  1:06 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, Qian Cai, andi.kleen, tim.c.chen,
	dave.hansen, ying.huang, linux-mm, linux-kernel
  Cc: Feng Tang

When checking a performance change for will-it-scale scalability
mmap test [1], we found very high lock contention for spinlock of
percpu counter 'vm_committed_as':

    94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
    48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
    45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch
number for the percpu counter.

So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy,
and enlarge it for not-so-strict  OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS
policies.

Benchmark with the same testcase in [1] shows 53% improvement on a
8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. And for that
case, whether it shows improvements depends on if the test mmap size
is bigger than the batch number computed.

We tested 10+ platforms in 0day (server, desktop and laptop). If we
lift it to 64X, 80%+ platforms show improvements, and for 16X lift,
1/3 of the platforms will show improvements.

And generally it should help the mmap/unmap usage,as Michal Hocko
mentioned:
"
I believe that there are non-synthetic worklaods which would benefit
from a larger batch. E.g. large in memory databases which do large
mmaps during startups from multiple threads.
"

Note: There are some style complain from checkpatch for patch 3,
as sysctl handler declaration follows the similar format of sibling
functions

patch1: a cleanup for /proc/meminfo
patch2: a preparation patch which also improve the accuracy of
        vm_memory_committed
patch3: remove the VM_WARN_ONCE for vm_committed_as underflow check
patch4: main change

This is against today's linux-mm git tree on github.

Please help to review, thanks!

- Feng

----------------------------------------------------------------
Changelog:

  v4:
    * Remove the VM_WARN_ONCE check for vm_committed_as underflow,
      thanks to Qian Cai for finding and testing the warning

  v3:
    * refine commit log and cleanup code, according to comments
      from Michal Hocko and Matthew Wilcox
    * change the lift from 16X and 64X after test 

  v2:
     * add the sysctl handler to cover runtime overcommit policy
       change, as suggested by Andres Morton 
     * address the accuracy concern of vm_memory_committed()
       from Andi Kleen 

Feng Tang (4):
  proc/meminfo: avoid open coded reading of vm_committed_as
  mm/util.c: make vm_memory_committed() more accurate
  mm/util.c: remove the VM_WARN_ONCE for vm_committed_as underflow check
  mm: adjust vm_committed_as_batch according to vm overcommit policy

 fs/proc/meminfo.c    |  2 +-
 include/linux/mm.h   |  2 ++
 include/linux/mman.h |  4 ++++
 kernel/sysctl.c      |  2 +-
 mm/mm_init.c         | 18 ++++++++++++++----
 mm/util.c            | 22 +++++++++++++---------
 6 files changed, 35 insertions(+), 15 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v4 1/4] proc/meminfo: avoid open coded reading of vm_committed_as
  2020-05-29  1:06 [PATCH v4 0/4] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
@ 2020-05-29  1:06 ` Feng Tang
  2020-05-29  1:06 ` [PATCH v4 2/4] mm/util.c: make vm_memory_committed() more accurate Feng Tang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Feng Tang @ 2020-05-29  1:06 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, Qian Cai, andi.kleen, tim.c.chen,
	dave.hansen, ying.huang, linux-mm, linux-kernel
  Cc: Feng Tang

Use the existing vm_memory_committed() instead, which is also
convenient for future change.

Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/meminfo.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index b030d8b..e3d14ee 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -41,7 +41,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 
 	si_meminfo(&i);
 	si_swapinfo(&i);
-	committed = percpu_counter_read_positive(&vm_committed_as);
+	committed = vm_memory_committed();
 
 	cached = global_node_page_state(NR_FILE_PAGES) -
 			total_swapcache_pages() - i.bufferram;
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 2/4] mm/util.c: make vm_memory_committed() more accurate
  2020-05-29  1:06 [PATCH v4 0/4] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
  2020-05-29  1:06 ` [PATCH v4 1/4] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
@ 2020-05-29  1:06 ` Feng Tang
  2020-06-03 13:35   ` Michal Hocko
  2020-06-03 14:28   ` Andi Kleen
  2020-05-29  1:06 ` [PATCH v4 3/4] mm/util.c: remove the VM_WARN_ONCE for vm_committed_as underflow check Feng Tang
  2020-05-29  1:06 ` [PATCH v4 4/4] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
  3 siblings, 2 replies; 12+ messages in thread
From: Feng Tang @ 2020-05-29  1:06 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, Qian Cai, andi.kleen, tim.c.chen,
	dave.hansen, ying.huang, linux-mm, linux-kernel
  Cc: Feng Tang

percpu_counter_sum_positive() will provide more accurate info.

As with percpu_counter_read_positive(), in worst case the deviation
could be 'batch * nr_cpus', which is totalram_pages/256 for now,
and will be more when the batch gets enlarged.

Its time cost is about 800 nanoseconds on a 2C/4T platform and
2~3 microseconds on a 2S/36C/72T server in normal case, and in
worst case where vm_committed_as's spinlock is under severe
contention, it costs 30~40 microseconds for the 2S/36C/72T sever,
which should be fine for its only two users: /proc/meminfo and
HyperV balloon driver's status trace per second.

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/util.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/util.c b/mm/util.c
index 9b3be03..3c7a08c 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -790,7 +790,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
  */
 unsigned long vm_memory_committed(void)
 {
-	return percpu_counter_read_positive(&vm_committed_as);
+	return percpu_counter_sum_positive(&vm_committed_as);
 }
 EXPORT_SYMBOL_GPL(vm_memory_committed);
 
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 3/4] mm/util.c: remove the VM_WARN_ONCE for vm_committed_as underflow check
  2020-05-29  1:06 [PATCH v4 0/4] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
  2020-05-29  1:06 ` [PATCH v4 1/4] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
  2020-05-29  1:06 ` [PATCH v4 2/4] mm/util.c: make vm_memory_committed() more accurate Feng Tang
@ 2020-05-29  1:06 ` Feng Tang
  2020-05-29  2:49   ` Qian Cai
  2020-05-29  1:06 ` [PATCH v4 4/4] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
  3 siblings, 1 reply; 12+ messages in thread
From: Feng Tang @ 2020-05-29  1:06 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, Qian Cai, andi.kleen, tim.c.chen,
	dave.hansen, ying.huang, linux-mm, linux-kernel
  Cc: Feng Tang, Konstantin Khlebnikov

As is explained by Michal Hocko:

: Looking at the history, this has been added by 82f71ae4a2b8
: ("mm: catch memory commitment underflow") to have a safety check
: for issues which have been fixed. There doesn't seem to be any bug
: reports mentioning this splat since then so it is likely just
: spending cycles for a hot path (yes many people run with DEBUG_VM)
: without a strong reason.

Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <andi.kleen@intel.com>
---
 mm/util.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 3c7a08c..fe63271 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -814,14 +814,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 {
 	long allowed;
 
-	/*
-	 * A transient decrease in the value is unlikely, so no need
-	 * READ_ONCE() for vm_committed_as.count.
-	 */
-	VM_WARN_ONCE(data_race(percpu_counter_read(&vm_committed_as) <
-			-(s64)vm_committed_as_batch * num_online_cpus()),
-			"memory commitment underflow");
-
 	vm_acct_memory(pages);
 
 	/*
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 4/4] mm: adjust vm_committed_as_batch according to vm overcommit policy
  2020-05-29  1:06 [PATCH v4 0/4] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
                   ` (2 preceding siblings ...)
  2020-05-29  1:06 ` [PATCH v4 3/4] mm/util.c: remove the VM_WARN_ONCE for vm_committed_as underflow check Feng Tang
@ 2020-05-29  1:06 ` Feng Tang
  2020-06-03 13:38   ` Michal Hocko
  3 siblings, 1 reply; 12+ messages in thread
From: Feng Tang @ 2020-05-29  1:06 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, Qian Cai, andi.kleen, tim.c.chen,
	dave.hansen, ying.huang, linux-mm, linux-kernel
  Cc: Feng Tang

When checking a performance change for will-it-scale scalability mmap test
[1], we found very high lock contention for spinlock of percpu counter
'vm_committed_as':

    94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
    48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
    45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

Actually this heavy lock contention is not always necessary.  The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
for the percpu counter.

So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies.  Also
add a sysctl handler to adjust it when the policy is reconfigured.

Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
desktop, and 2097%(20X) on a 4S/72C/144T server.  We tested with test
platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
improvements with that test.  And whether it shows improvements depends on
if the test mmap size is bigger than the batch number computed.

And if the lift is 16X, 1/3 of the platforms will show improvements,
though it should help the mmap/unmap usage generally, as Michal Hocko
mentioned:

: I believe that there are non-synthetic worklaods which would benefit from
: a larger batch.  E.g.  large in memory databases which do large mmaps
: during startups from multiple threads.

[1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/

Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mm.h   |  2 ++
 include/linux/mman.h |  4 ++++
 kernel/sysctl.c      |  2 +-
 mm/mm_init.c         | 18 ++++++++++++++----
 mm/util.c            | 12 ++++++++++++
 5 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 573947c..c2efea6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -206,6 +206,8 @@ int overcommit_ratio_handler(struct ctl_table *, int, void *, size_t *,
 		loff_t *);
 int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *,
 		loff_t *);
+int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
+		loff_t *);
 
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 4b08e9c..91c93c1 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -57,8 +57,12 @@ extern struct percpu_counter vm_committed_as;
 
 #ifdef CONFIG_SMP
 extern s32 vm_committed_as_batch;
+extern void mm_compute_batch(void);
 #else
 #define vm_committed_as_batch 0
+static inline void mm_compute_batch(void)
+{
+}
 #endif
 
 unsigned long vm_memory_committed(void);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index db1ce7a..9456c86 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2650,7 +2650,7 @@ static struct ctl_table vm_table[] = {
 		.data		= &sysctl_overcommit_memory,
 		.maxlen		= sizeof(sysctl_overcommit_memory),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= overcommit_policy_handler,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= &two,
 	},
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 435e5f7..c5a6fb1 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -13,6 +13,7 @@
 #include <linux/memory.h>
 #include <linux/notifier.h>
 #include <linux/sched.h>
+#include <linux/mman.h>
 #include "internal.h"
 
 #ifdef CONFIG_DEBUG_MEMORY_INIT
@@ -144,14 +145,23 @@ EXPORT_SYMBOL_GPL(mm_kobj);
 #ifdef CONFIG_SMP
 s32 vm_committed_as_batch = 32;
 
-static void __meminit mm_compute_batch(void)
+void mm_compute_batch(void)
 {
 	u64 memsized_batch;
 	s32 nr = num_present_cpus();
 	s32 batch = max_t(s32, nr*2, 32);
-
-	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
-	memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff);
+	unsigned long ram_pages = totalram_pages();
+
+	/*
+	 * For policy of OVERCOMMIT_NEVER, set batch size to 0.4%
+	 * of (total memory/#cpus), and lift it to 25% for other
+	 * policies to easy the possible lock contention for percpu_counter
+	 * vm_committed_as, while the max limit is INT_MAX
+	 */
+	if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
+		memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX);
+	else
+		memsized_batch = min_t(u64, ram_pages/nr/4, INT_MAX);
 
 	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
 }
diff --git a/mm/util.c b/mm/util.c
index fe63271..580d268 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -746,6 +746,18 @@ int overcommit_ratio_handler(struct ctl_table *table, int write, void *buffer,
 	return ret;
 }
 
+int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
+		size_t *lenp, loff_t *ppos)
+{
+	int ret;
+
+	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (ret == 0 && write)
+		mm_compute_batch();
+
+	return ret;
+}
+
 int overcommit_kbytes_handler(struct ctl_table *table, int write, void *buffer,
 		size_t *lenp, loff_t *ppos)
 {
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 3/4] mm/util.c: remove the VM_WARN_ONCE for vm_committed_as underflow check
  2020-05-29  1:06 ` [PATCH v4 3/4] mm/util.c: remove the VM_WARN_ONCE for vm_committed_as underflow check Feng Tang
@ 2020-05-29  2:49   ` Qian Cai
  2020-05-29  5:37     ` Feng Tang
  2020-06-02  3:37     ` Feng Tang
  0 siblings, 2 replies; 12+ messages in thread
From: Qian Cai @ 2020-05-29  2:49 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, andi.kleen, tim.c.chen, dave.hansen,
	ying.huang, linux-mm, linux-kernel, Konstantin Khlebnikov

On Fri, May 29, 2020 at 09:06:09AM +0800, Feng Tang wrote:
> As is explained by Michal Hocko:
> 
> : Looking at the history, this has been added by 82f71ae4a2b8
> : ("mm: catch memory commitment underflow") to have a safety check
> : for issues which have been fixed. There doesn't seem to be any bug
> : reports mentioning this splat since then so it is likely just
> : spending cycles for a hot path (yes many people run with DEBUG_VM)
> : without a strong reason.

Hmm, it looks like the warning is still useful to catch issues in,

https://lore.kernel.org/linux-mm/20140624201606.18273.44270.stgit@zurg
https://lore.kernel.org/linux-mm/54BB9A32.7080703@oracle.com/

After read the whole discussion in that thread, I actually disagree with
Michal. In order to get ride of this existing warning, it is rather
someone needs a strong reason that could prove the performance hit is
noticeable with some data.

> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Qian Cai <cai@lca.pw>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andi Kleen <andi.kleen@intel.com>
> ---
>  mm/util.c | 8 --------
>  1 file changed, 8 deletions(-)
> 
> diff --git a/mm/util.c b/mm/util.c
> index 3c7a08c..fe63271 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -814,14 +814,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  {
>  	long allowed;
>  
> -	/*
> -	 * A transient decrease in the value is unlikely, so no need
> -	 * READ_ONCE() for vm_committed_as.count.
> -	 */
> -	VM_WARN_ONCE(data_race(percpu_counter_read(&vm_committed_as) <
> -			-(s64)vm_committed_as_batch * num_online_cpus()),
> -			"memory commitment underflow");
> -
>  	vm_acct_memory(pages);
>  
>  	/*
> -- 
> 2.7.4
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 3/4] mm/util.c: remove the VM_WARN_ONCE for vm_committed_as underflow check
  2020-05-29  2:49   ` Qian Cai
@ 2020-05-29  5:37     ` Feng Tang
  2020-06-02  3:37     ` Feng Tang
  1 sibling, 0 replies; 12+ messages in thread
From: Feng Tang @ 2020-05-29  5:37 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, andi.kleen, tim.c.chen, dave.hansen,
	ying.huang, linux-mm, linux-kernel, Konstantin Khlebnikov

On Thu, May 28, 2020 at 10:49:28PM -0400, Qian Cai wrote:
> On Fri, May 29, 2020 at 09:06:09AM +0800, Feng Tang wrote:
> > As is explained by Michal Hocko:
> > 
> > : Looking at the history, this has been added by 82f71ae4a2b8
> > : ("mm: catch memory commitment underflow") to have a safety check
> > : for issues which have been fixed. There doesn't seem to be any bug
> > : reports mentioning this splat since then so it is likely just
> > : spending cycles for a hot path (yes many people run with DEBUG_VM)
> > : without a strong reason.
> 
> Hmm, it looks like the warning is still useful to catch issues in,
> 
> https://lore.kernel.org/linux-mm/20140624201606.18273.44270.stgit@zurg
> https://lore.kernel.org/linux-mm/54BB9A32.7080703@oracle.com/
> 
> After read the whole discussion in that thread, I actually disagree with
> Michal. In order to get ride of this existing warning, it is rather
> someone needs a strong reason that could prove the performance hit is
> noticeable with some data.

One problem with current check is percpu_counter_read(&vm_committed_as)
is not accurate, and percpu_counter_sum() is way too heavy.

Thanks,
Feng


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 3/4] mm/util.c: remove the VM_WARN_ONCE for vm_committed_as underflow check
  2020-05-29  2:49   ` Qian Cai
  2020-05-29  5:37     ` Feng Tang
@ 2020-06-02  3:37     ` Feng Tang
  1 sibling, 0 replies; 12+ messages in thread
From: Feng Tang @ 2020-06-02  3:37 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, andi.kleen, tim.c.chen, dave.hansen,
	ying.huang, linux-mm, linux-kernel, Konstantin Khlebnikov

Hi Qian,

On Thu, May 28, 2020 at 10:49:28PM -0400, Qian Cai wrote:
> On Fri, May 29, 2020 at 09:06:09AM +0800, Feng Tang wrote:
> > As is explained by Michal Hocko:
> > 
> > : Looking at the history, this has been added by 82f71ae4a2b8
> > : ("mm: catch memory commitment underflow") to have a safety check
> > : for issues which have been fixed. There doesn't seem to be any bug
> > : reports mentioning this splat since then so it is likely just
> > : spending cycles for a hot path (yes many people run with DEBUG_VM)
> > : without a strong reason.
> 
> Hmm, it looks like the warning is still useful to catch issues in,
> 
> https://lore.kernel.org/linux-mm/20140624201606.18273.44270.stgit@zurg
> https://lore.kernel.org/linux-mm/54BB9A32.7080703@oracle.com/
> 
> After read the whole discussion in that thread, I actually disagree with
> Michal. In order to get ride of this existing warning, it is rather
> someone needs a strong reason that could prove the performance hit is
> noticeable with some data.

I re-run the same benchmark with v5.7 and 5.7+remove_warning kernels,
the overall performance change is trivial (which is expected)

   1330147            +0.1%    1331032        will-it-scale.72.processes

But the perf stats of "self" shows big change for __vm_enough_memory() 

      0.27            -0.3        0.00        pp.self.__vm_enough_memory

I post the full compare result in the end.

Thanks,
Feng


=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/cpufreq_governor/ucode:
  lkp-skl-2sp7/will-it-scale/debian-x86_64-20191114.cgz/x86_64-rhel-7.6-vm-debug/gcc-7/100%/process/mmap2/performance/0x2000065

commit: 
  v5.7
  af3eca72dc43078e1ee4a38b0ecc0225b659f345

            v5.7 af3eca72dc43078e1ee4a38b0ec 
---------------- --------------------------- 
       fail:runs  %reproduction    fail:runs
           |             |             |    
        850:3       -12130%         486:2     dmesg.timestamp:last
          2:3          -67%            :2     kmsg.Firmware_Bug]:the_BIOS_has_corrupted_hw-PMU_resources(MSR#is#bb)
           :3           33%           1:2     kmsg.Firmware_Bug]:the_BIOS_has_corrupted_hw-PMU_resources(MSR#is#e08)
          5:3         -177%            :2     kmsg.timestamp:Firmware_Bug]:the_BIOS_has_corrupted_hw-PMU_resources(MSR#is#bb)
           :3           88%           2:2     kmsg.timestamp:Firmware_Bug]:the_BIOS_has_corrupted_hw-PMU_resources(MSR#is#e08)
        398:3        -4444%         265:2     kmsg.timestamp:last
         %stddev     %change         %stddev
             \          |                \  
   1330147            +0.1%    1331032        will-it-scale.72.processes
      0.02            +0.0%       0.02        will-it-scale.72.processes_idle
     18474            +0.1%      18486        will-it-scale.per_process_ops
    301.18            -0.0%     301.16        will-it-scale.time.elapsed_time
    301.18            -0.0%     301.16        will-it-scale.time.elapsed_time.max
      1.00 ± 81%    +100.0%       2.00        will-it-scale.time.involuntary_context_switches
      9452            +0.0%       9452        will-it-scale.time.maximum_resident_set_size
      5925            +0.1%       5932        will-it-scale.time.minor_page_faults
      4096            +0.0%       4096        will-it-scale.time.page_size
      0.01 ± 35%     +12.5%       0.01 ± 33%  will-it-scale.time.system_time
      0.03 ± 14%      +5.0%       0.04 ± 14%  will-it-scale.time.user_time
     83.33            +0.2%      83.50        will-it-scale.time.voluntary_context_switches
   1330147            +0.1%    1331032        will-it-scale.workload
      0.45 ± 29%      +0.0        0.50 ± 28%  mpstat.cpu.all.idle%
     98.41            -0.1       98.34        mpstat.cpu.all.sys%
      1.14            +0.0        1.16        mpstat.cpu.all.usr%
    200395 ± 18%     +11.9%     224282 ± 14%  cpuidle.C1.time
      4008 ± 38%      -2.1%       3924 ± 15%  cpuidle.C1.usage
 1.222e+08 ± 19%     -29.2%   86444161        cpuidle.C1E.time
    254203 ± 19%     -23.2%     195198 ±  4%  cpuidle.C1E.usage
   8145747 ± 31%    +339.9%   35830338 ± 72%  cpuidle.C6.time
     22878 ±  9%    +288.2%      88823 ± 70%  cpuidle.C6.usage
      8891 ±  7%      -7.4%       8229        cpuidle.POLL.time
      3111 ± 18%     -11.1%       2766        cpuidle.POLL.usage
      0.00          -100.0%       0.00        numa-numastat.node0.interleave_hit
    314399 ±  2%      -1.0%     311244 ±  3%  numa-numastat.node0.local_node
    322209            +2.4%     329909        numa-numastat.node0.numa_hit
      7814 ± 73%    +138.9%      18670 ± 24%  numa-numastat.node0.other_node
      0.00          -100.0%       0.00        numa-numastat.node1.interleave_hit
    343026 ±  2%      -0.3%     341980        numa-numastat.node1.local_node
    358632            -3.3%     346708        numa-numastat.node1.numa_hit
     15613 ± 36%     -69.7%       4728 ± 98%  numa-numastat.node1.other_node
    301.18            -0.0%     301.16        time.elapsed_time
    301.18            -0.0%     301.16        time.elapsed_time.max
      1.00 ± 81%    +100.0%       2.00        time.involuntary_context_switches
      9452            +0.0%       9452        time.maximum_resident_set_size
      5925            +0.1%       5932        time.minor_page_faults
      4096            +0.0%       4096        time.page_size
      0.01 ± 35%     +12.5%       0.01 ± 33%  time.system_time
      0.03 ± 14%      +5.0%       0.04 ± 14%  time.user_time
     83.33            +0.2%      83.50        time.voluntary_context_switches
      0.33 ±141%     +50.0%       0.50 ±100%  vmstat.cpu.id
     97.00            +0.0%      97.00        vmstat.cpu.sy
      1.00            +0.0%       1.00        vmstat.cpu.us
      0.00          -100.0%       0.00        vmstat.io.bi
      4.00            +0.0%       4.00        vmstat.memory.buff
   1391751            +0.1%    1392746        vmstat.memory.cache
 1.294e+08            -0.0%  1.294e+08        vmstat.memory.free
     71.00            +0.0%      71.00        vmstat.procs.r
      1315            -0.7%       1305        vmstat.system.cs
    147433            -0.0%     147369        vmstat.system.in
      0.00          -100.0%       0.00        proc-vmstat.compact_isolated
     85060            +0.4%      85431        proc-vmstat.nr_active_anon
     37.00            -1.4%      36.50        proc-vmstat.nr_active_file
     71111            +0.1%      71200        proc-vmstat.nr_anon_pages
     77.33 ± 17%     +12.5%      87.00        proc-vmstat.nr_anon_transparent_hugepages
     54.00            +1.9%      55.00        proc-vmstat.nr_dirtied
      5.00            +0.0%       5.00        proc-vmstat.nr_dirty
   3215506            -0.0%    3215471        proc-vmstat.nr_dirty_background_threshold
   6438875            -0.0%    6438805        proc-vmstat.nr_dirty_threshold
    327936            +0.1%     328237        proc-vmstat.nr_file_pages
     50398            +0.0%      50398        proc-vmstat.nr_free_cma
  32356721            -0.0%   32356374        proc-vmstat.nr_free_pages
      4640            -0.1%       4636        proc-vmstat.nr_inactive_anon
     82.67 ±  2%      -0.8%      82.00 ±  2%  proc-vmstat.nr_inactive_file
     13256            -0.3%      13211        proc-vmstat.nr_kernel_stack
      8057            -0.5%       8017        proc-vmstat.nr_mapped
    134.00 ±141%     +49.3%     200.00 ±100%  proc-vmstat.nr_mlock
      2229            +0.2%       2234        proc-vmstat.nr_page_table_pages
     18609 ±  3%      +1.6%      18898 ±  3%  proc-vmstat.nr_shmem
     19964            -0.3%      19901        proc-vmstat.nr_slab_reclaimable
     34003            +0.1%      34025        proc-vmstat.nr_slab_unreclaimable
    309227            +0.0%     309249        proc-vmstat.nr_unevictable
      0.00          -100.0%       0.00        proc-vmstat.nr_writeback
     53.00            +0.0%      53.00        proc-vmstat.nr_written
     85060            +0.4%      85431        proc-vmstat.nr_zone_active_anon
     37.00            -1.4%      36.50        proc-vmstat.nr_zone_active_file
      4640            -0.1%       4636        proc-vmstat.nr_zone_inactive_anon
     82.67 ±  2%      -0.8%      82.00 ±  2%  proc-vmstat.nr_zone_inactive_file
    309227            +0.0%     309249        proc-vmstat.nr_zone_unevictable
      5.00            +0.0%       5.00        proc-vmstat.nr_zone_write_pending
      2181 ±124%     -68.6%     685.50 ± 80%  proc-vmstat.numa_hint_faults
     37.67 ±109%     +77.9%      67.00 ± 91%  proc-vmstat.numa_hint_faults_local
    702373            +0.7%     707116        proc-vmstat.numa_hit
     35.33 ± 85%     -70.3%      10.50 ±  4%  proc-vmstat.numa_huge_pte_updates
      0.00          -100.0%       0.00        proc-vmstat.numa_interleave
    678938            +0.7%     683714        proc-vmstat.numa_local
     23435            -0.1%      23401        proc-vmstat.numa_other
      4697 ± 68%     -86.8%     618.50 ± 98%  proc-vmstat.numa_pages_migrated
     25844 ± 52%     -79.1%       5406 ±  4%  proc-vmstat.numa_pte_updates
     20929 ±  4%      +1.9%      21332 ±  5%  proc-vmstat.pgactivate
      0.00          -100.0%       0.00        proc-vmstat.pgalloc_dma32
    760325            -0.4%     756908        proc-vmstat.pgalloc_normal
    801566            -0.5%     797832        proc-vmstat.pgfault
    714690            -0.2%     713286        proc-vmstat.pgfree
      4697 ± 68%     -86.8%     618.50 ± 98%  proc-vmstat.pgmigrate_success
      0.00          -100.0%       0.00        proc-vmstat.pgpgin
    103.00            +0.5%     103.50        proc-vmstat.thp_collapse_alloc
      5.00            +0.0%       5.00        proc-vmstat.thp_fault_alloc
      0.00          -100.0%       0.00        proc-vmstat.thp_zero_page_alloc
     41.00 ± 98%     +35.4%      55.50 ± 80%  proc-vmstat.unevictable_pgs_culled
    183.00 ±141%     +50.0%     274.50 ±100%  proc-vmstat.unevictable_pgs_mlocked
      2.59            +0.5%       2.60        perf-stat.i.MPKI
 4.854e+09            +0.0%  4.856e+09        perf-stat.i.branch-instructions
      0.45            -0.0        0.43        perf-stat.i.branch-miss-rate%
  21296577            -2.3%   20817170        perf-stat.i.branch-misses
     39.98            -0.2       39.81        perf-stat.i.cache-miss-rate%
  21372778            +0.0%   21380457        perf-stat.i.cache-misses
  53441942            +0.5%   53705724        perf-stat.i.cache-references
      1285            -0.7%       1277        perf-stat.i.context-switches
     10.67            -0.0%      10.67        perf-stat.i.cpi
     71998            -0.0%      71998        perf-stat.i.cpu-clock
  2.21e+11            +0.0%   2.21e+11        perf-stat.i.cpu-cycles
    117.36            +0.3%     117.71        perf-stat.i.cpu-migrations
     10322            -0.0%      10321        perf-stat.i.cycles-between-cache-misses
      0.05            +0.0        0.05        perf-stat.i.dTLB-load-miss-rate%
   2709233            +0.1%    2712427        perf-stat.i.dTLB-load-misses
 5.785e+09            +0.0%  5.787e+09        perf-stat.i.dTLB-loads
      0.00            +0.0        0.00 ±  2%  perf-stat.i.dTLB-store-miss-rate%
      8967            -3.0%       8701        perf-stat.i.dTLB-store-misses
  1.97e+09            +0.1%  1.971e+09        perf-stat.i.dTLB-stores
     94.02            +0.2       94.24        perf-stat.i.iTLB-load-miss-rate%
   2732366            -1.4%    2694372        perf-stat.i.iTLB-load-misses
    173049            -5.7%     163172 ±  2%  perf-stat.i.iTLB-loads
  2.07e+10            +0.0%  2.071e+10        perf-stat.i.instructions
      7671            +1.0%       7747        perf-stat.i.instructions-per-iTLB-miss
      0.10            +0.1%       0.10        perf-stat.i.ipc
      3.07            +0.0%       3.07        perf-stat.i.metric.GHz
      0.42            +0.5%       0.43        perf-stat.i.metric.K/sec
    175.98            +0.1%     176.08        perf-stat.i.metric.M/sec
      2565            -0.7%       2547        perf-stat.i.minor-faults
     99.55            -0.0       99.53        perf-stat.i.node-load-miss-rate%
   5949351            +0.3%    5969805        perf-stat.i.node-load-misses
     22301 ±  6%      +5.6%      23543 ±  8%  perf-stat.i.node-loads
     99.73            -0.0       99.72        perf-stat.i.node-store-miss-rate%
   5314673            -0.1%    5310449        perf-stat.i.node-store-misses
      4704 ±  4%      -1.8%       4619        perf-stat.i.node-stores
      2565            -0.7%       2547        perf-stat.i.page-faults
     71998            -0.0%      71998        perf-stat.i.task-clock
      2.58            +0.5%       2.59        perf-stat.overall.MPKI
      0.44            -0.0        0.43        perf-stat.overall.branch-miss-rate%
     39.99            -0.2       39.81        perf-stat.overall.cache-miss-rate%
     10.67            -0.0%      10.67        perf-stat.overall.cpi
     10340            -0.0%      10337        perf-stat.overall.cycles-between-cache-misses
      0.05            +0.0        0.05        perf-stat.overall.dTLB-load-miss-rate%
      0.00            -0.0        0.00        perf-stat.overall.dTLB-store-miss-rate%
     94.04            +0.2       94.29        perf-stat.overall.iTLB-load-miss-rate%
      7577            +1.4%       7686        perf-stat.overall.instructions-per-iTLB-miss
      0.09            +0.0%       0.09        perf-stat.overall.ipc
     99.62            -0.0       99.60        perf-stat.overall.node-load-miss-rate%
     99.91            +0.0       99.91        perf-stat.overall.node-store-miss-rate%
   4691551            +0.0%    4693151        perf-stat.overall.path-length
 4.838e+09            +0.0%   4.84e+09        perf-stat.ps.branch-instructions
  21230930            -2.3%   20750859        perf-stat.ps.branch-misses
  21302195            +0.0%   21309444        perf-stat.ps.cache-misses
  53273375            +0.5%   53531696        perf-stat.ps.cache-references
      1281            -0.7%       1273        perf-stat.ps.context-switches
     71760            -0.0%      71759        perf-stat.ps.cpu-clock
 2.203e+11            +0.0%  2.203e+11        perf-stat.ps.cpu-cycles
    117.02            +0.3%     117.33        perf-stat.ps.cpu-migrations
   2702184            +0.1%    2704689        perf-stat.ps.dTLB-load-misses
 5.766e+09            +0.0%  5.767e+09        perf-stat.ps.dTLB-loads
      9028            -3.2%       8736        perf-stat.ps.dTLB-store-misses
 1.963e+09            +0.1%  1.965e+09        perf-stat.ps.dTLB-stores
   2723237            -1.4%    2685365        perf-stat.ps.iTLB-load-misses
    172573            -5.7%     162735 ±  2%  perf-stat.ps.iTLB-loads
 2.063e+10            +0.0%  2.064e+10        perf-stat.ps.instructions
      2559            -0.7%       2540        perf-stat.ps.minor-faults
   5929506            +0.3%    5949863        perf-stat.ps.node-load-misses
     22689 ±  5%      +4.7%      23766 ±  8%  perf-stat.ps.node-loads
   5296902            -0.1%    5292690        perf-stat.ps.node-store-misses
      4724 ±  4%      -2.2%       4622        perf-stat.ps.node-stores
      2559            -0.7%       2540        perf-stat.ps.page-faults
     71760            -0.0%      71759        perf-stat.ps.task-clock
  6.24e+12            +0.1%  6.247e+12        perf-stat.total.instructions
     47.20            -0.2       47.05        pp.bt.percpu_counter_add_batch.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64
     50.12            -0.2       49.97        pp.bt.entry_SYSCALL_64_after_hwframe.munmap
     50.10            -0.1       49.95        pp.bt.do_syscall_64.entry_SYSCALL_64_after_hwframe.munmap
     46.75            -0.1       46.60        pp.bt._raw_spin_lock_irqsave.percpu_counter_add_batch.__do_munmap.__vm_munmap.__x64_sys_munmap
     49.36            -0.1       49.22        pp.bt.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
     49.78            -0.1       49.64        pp.bt.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.munmap
     49.75            -0.1       49.61        pp.bt.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.munmap
     50.48            -0.1       50.34        pp.bt.munmap
     46.56            -0.1       46.41        pp.bt.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.__do_munmap.__vm_munmap
      1.88            -0.0        1.88        pp.bt.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64
      1.32            +0.0        1.33        pp.bt.unmap_page_range.unmap_vmas.unmap_region.__do_munmap.__vm_munmap
      1.42            +0.0        1.44        pp.bt.unmap_vmas.unmap_region.__do_munmap.__vm_munmap.__x64_sys_munmap
      0.51            +0.0        0.53        pp.bt.___might_sleep.unmap_page_range.unmap_vmas.unmap_region.__do_munmap
     48.27            +0.1       48.39        pp.bt.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
     48.67            +0.1       48.80        pp.bt.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.mmap64
     48.51            +0.1       48.65        pp.bt.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
     47.10            +0.1       47.24        pp.bt.__vm_enough_memory.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff
     48.74            +0.1       48.88        pp.bt.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.mmap64
     49.08            +0.1       49.23        pp.bt.entry_SYSCALL_64_after_hwframe.mmap64
     46.48            +0.1       46.62        pp.bt.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.__vm_enough_memory.mmap_region
     49.06            +0.1       49.21        pp.bt.do_syscall_64.entry_SYSCALL_64_after_hwframe.mmap64
     49.46            +0.1       49.61        pp.bt.mmap64
     46.66            +0.1       46.80        pp.bt._raw_spin_lock_irqsave.percpu_counter_add_batch.__vm_enough_memory.mmap_region.do_mmap
     46.84            +0.4       47.23        pp.bt.percpu_counter_add_batch.__vm_enough_memory.mmap_region.do_mmap.vm_mmap_pgoff
     49.78            -0.1       49.64        pp.child.__x64_sys_munmap
     50.51            -0.1       50.36        pp.child.munmap
     49.76            -0.1       49.61        pp.child.__vm_munmap
     49.36            -0.1       49.22        pp.child.__do_munmap
      0.45            -0.0        0.41        pp.child.perf_event_mmap
      0.03 ± 70%      -0.0        0.00        pp.child.strlen
      0.02 ±141%      -0.0        0.00        pp.child.common_file_perm
      0.30            -0.0        0.29        pp.child.free_pgd_range
      1.88            -0.0        1.88        pp.child.unmap_region
      0.28            -0.0        0.27        pp.child.free_p4d_range
      0.28 ±  4%      -0.0        0.27        pp.child.apic_timer_interrupt
      0.19 ±  4%      -0.0        0.18 ±  2%  pp.child.hrtimer_interrupt
      0.09            -0.0        0.08 ±  5%  pp.child.find_vma
      0.40 ±  2%      -0.0        0.40 ±  3%  pp.child.vm_area_alloc
      0.35            -0.0        0.34        pp.child.syscall_return_via_sysret
      0.35 ±  2%      -0.0        0.34        pp.child.up_read
     93.41            -0.0       93.41        pp.child._raw_spin_lock_irqsave
      0.05 ±  8%      -0.0        0.05        pp.child.kmem_cache_alloc_trace
      0.11 ±  4%      -0.0        0.11        pp.child.d_path
      0.11 ±  4%      -0.0        0.11        pp.child.perf_iterate_sb
      0.08 ±  5%      -0.0        0.08        pp.child.prepend_path
     99.23            -0.0       99.23        pp.child.entry_SYSCALL_64_after_hwframe
     93.04            -0.0       93.04        pp.child.native_queued_spin_lock_slowpath
      0.09            -0.0        0.09        pp.child.tick_sched_timer
      0.05            -0.0        0.05        pp.child.unlink_file_vma
      0.05            -0.0        0.05        pp.child.task_tick_fair
      0.05            -0.0        0.05        pp.child.perf_event_mmap_output
     99.20            +0.0       99.20        pp.child.do_syscall_64
      0.21 ±  3%      +0.0        0.21        pp.child.smp_apic_timer_interrupt
      0.08            +0.0        0.08        pp.child.update_process_times
      0.06            +0.0        0.06        pp.child.down_write_killable
      0.15            +0.0        0.15 ±  6%  pp.child.rcu_all_qs
      0.37 ±  2%      +0.0        0.37 ±  5%  pp.child.kmem_cache_alloc
      0.28            +0.0        0.29 ±  5%  pp.child._cond_resched
      0.12 ±  3%      +0.0        0.12 ±  4%  pp.child.vma_link
      0.09 ±  5%      +0.0        0.10 ±  5%  pp.child.security_mmap_file
      0.06 ±  7%      +0.0        0.07 ±  7%  pp.child.scheduler_tick
      0.13 ±  3%      +0.0        0.14 ±  3%  pp.child.__hrtimer_run_queues
      0.32 ±  2%      +0.0        0.32        pp.child._raw_spin_unlock_irqrestore
      0.08            +0.0        0.08 ±  5%  pp.child.tick_sched_handle
      0.06            +0.0        0.07 ±  7%  pp.child.down_write
      0.06            +0.0        0.07 ±  7%  pp.child.remove_vma
      1.43            +0.0        1.44        pp.child.unmap_vmas
      0.05            +0.0        0.06        pp.child.__vma_rb_erase
      0.08            +0.0        0.09        pp.child.free_pgtables
      1.39            +0.0        1.40        pp.child.unmap_page_range
      0.35            +0.0        0.36 ±  4%  pp.child.entry_SYSCALL_64
      0.10 ±  4%      +0.0        0.11        pp.child.arch_get_unmapped_area_topdown
      0.58            +0.0        0.59        pp.child.___might_sleep
      0.06            +0.0        0.08 ±  6%  pp.child.shmem_mmap
      0.03 ± 70%      +0.0        0.05        pp.child.up_write
      0.03 ± 70%      +0.0        0.05        pp.child.vm_unmapped_area
      0.03 ± 70%      +0.0        0.05        pp.child.__vma_link_rb
      0.15 ±  3%      +0.0        0.17 ±  3%  pp.child.shmem_get_unmapped_area
      0.18 ±  2%      +0.0        0.20 ±  2%  pp.child.get_unmapped_area
      0.00            +0.0        0.03 ±100%  pp.child.prepend_name
      0.02 ±141%      +0.0        0.05        pp.child.touch_atime
     48.28            +0.1       48.39        pp.child.mmap_region
     47.10            +0.1       47.24        pp.child.__vm_enough_memory
     48.51            +0.1       48.65        pp.child.do_mmap
     48.74            +0.1       48.88        pp.child.ksys_mmap_pgoff
     48.67            +0.1       48.81        pp.child.vm_mmap_pgoff
     49.49            +0.1       49.64        pp.child.mmap64
     94.04            +0.2       94.28        pp.child.percpu_counter_add_batch
      0.27            -0.3        0.00        pp.self.__vm_enough_memory
      0.03 ± 70%      -0.0        0.00        pp.self.strlen
      0.02 ±141%      -0.0        0.00        pp.self.prepend_path
      0.28            -0.0        0.27        pp.self.free_p4d_range
      0.07            -0.0        0.06        pp.self.perf_iterate_sb
      0.08            -0.0        0.07        pp.self.perf_event_mmap
      0.35 ±  2%      -0.0        0.34 ±  5%  pp.self.kmem_cache_alloc
      0.11            -0.0        0.11 ±  4%  pp.self.rcu_all_qs
      0.37            -0.0        0.36        pp.self._raw_spin_lock_irqsave
      0.35            -0.0        0.34        pp.self.syscall_return_via_sysret
      0.35 ±  2%      -0.0        0.34        pp.self.up_read
      0.14 ±  3%      -0.0        0.14 ±  3%  pp.self._cond_resched
     93.04            -0.0       93.04        pp.self.native_queued_spin_lock_slowpath
      0.31            +0.0        0.32 ±  4%  pp.self.entry_SYSCALL_64
      0.69            +0.0        0.69        pp.self.unmap_page_range
      0.10 ±  4%      +0.0        0.10        pp.self._raw_spin_unlock_irqrestore
      0.06 ±  8%      +0.0        0.06        pp.self.find_vma
      0.06 ±  8%      +0.0        0.06        pp.self.__do_munmap
      0.06 ±  7%      +0.0        0.07        pp.self.mmap_region
      0.61 ±  3%      +0.0        0.62        pp.self.do_syscall_64
      0.02 ±141%      +0.0        0.03 ±100%  pp.self.up_write
      0.05            +0.0        0.06        pp.self.__vma_rb_erase
      0.54            +0.0        0.56        pp.self.___might_sleep
      0.03 ± 70%      +0.0        0.05        pp.self.vm_unmapped_area
      0.03 ± 70%      +0.0        0.05        pp.self.shmem_get_unmapped_area
      0.03 ± 70%      +0.0        0.05        pp.self.__vma_link_rb
      0.00            +0.0        0.03 ±100%  pp.self.prepend_name
      0.00            +0.0        0.03 ±100%  pp.self.do_mmap
      0.00            +0.0        0.03 ±100%  pp.self.arch_get_unmapped_area_topdown
      0.02 ±141%      +0.0        0.05        pp.self.perf_event_mmap_output
      0.32 ±  2%      +0.2        0.56        pp.self.percpu_counter_add_batch
    552.67 ±  2%      -5.2%     524.00 ±  6%  softirqs.BLOCK
      2.00            +0.0%       2.00        softirqs.HI
    911.00 ± 47%     -31.4%     625.00 ±  2%  softirqs.NET_RX
     63.67 ±  3%      -5.0%      60.50 ±  4%  softirqs.NET_TX
    312414            -1.1%     309101        softirqs.RCU
    228903            -1.4%     225602 ±  3%  softirqs.SCHED
    265.67            -1.0%     263.00        softirqs.TASKLET
   8777267            +0.1%    8789634        softirqs.TIMER
     23504            -0.1%      23472        interrupts.CAL:Function_call_interrupts
    144.00            -0.3%     143.50        interrupts.IWI:IRQ_work_interrupts
  43641147            -0.0%   43621882        interrupts.LOC:Local_timer_interrupts
      0.00          -100.0%       0.00        interrupts.MCP:Machine_check_polls
    570736 ±  5%      +2.1%     582977 ±  5%  interrupts.NMI:Non-maskable_interrupts
    570736 ±  5%      +2.1%     582977 ±  5%  interrupts.PMI:Performance_monitoring_interrupts
     45097            +0.7%      45407        interrupts.RES:Rescheduling_interrupts
      0.00          -100.0%       0.00        interrupts.RTR:APIC_ICR_read_retries
    193.67 ± 26%     -21.3%     152.50 ± 29%  interrupts.TLB:TLB_shootdowns




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/4] mm/util.c: make vm_memory_committed() more accurate
  2020-05-29  1:06 ` [PATCH v4 2/4] mm/util.c: make vm_memory_committed() more accurate Feng Tang
@ 2020-06-03 13:35   ` Michal Hocko
  2020-06-03 14:28   ` Andi Kleen
  1 sibling, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2020-06-03 13:35 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	Kees Cook, Qian Cai, andi.kleen, tim.c.chen, dave.hansen,
	ying.huang, linux-mm, linux-kernel, K. Y. Srinivasan,
	Haiyang Zhang

On Fri 29-05-20 09:06:08, Feng Tang wrote:
> percpu_counter_sum_positive() will provide more accurate info.
> 
> As with percpu_counter_read_positive(), in worst case the deviation
> could be 'batch * nr_cpus', which is totalram_pages/256 for now,
> and will be more when the batch gets enlarged.
> 
> Its time cost is about 800 nanoseconds on a 2C/4T platform and
> 2~3 microseconds on a 2S/36C/72T server in normal case, and in
> worst case where vm_committed_as's spinlock is under severe
> contention, it costs 30~40 microseconds for the 2S/36C/72T sever,
> which should be fine for its only two users: /proc/meminfo and
> HyperV balloon driver's status trace per second.
> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>

I cannot speak for HyperV part. Cc maintainers but this shouldn't be a
problem for meminfo.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/util.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/util.c b/mm/util.c
> index 9b3be03..3c7a08c 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -790,7 +790,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
>   */
>  unsigned long vm_memory_committed(void)
>  {
> -	return percpu_counter_read_positive(&vm_committed_as);
> +	return percpu_counter_sum_positive(&vm_committed_as);
>  }
>  EXPORT_SYMBOL_GPL(vm_memory_committed);
>  
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 4/4] mm: adjust vm_committed_as_batch according to vm overcommit policy
  2020-05-29  1:06 ` [PATCH v4 4/4] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
@ 2020-06-03 13:38   ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2020-06-03 13:38 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	Kees Cook, Qian Cai, andi.kleen, tim.c.chen, dave.hansen,
	ying.huang, linux-mm, linux-kernel

On Fri 29-05-20 09:06:10, Feng Tang wrote:
> When checking a performance change for will-it-scale scalability mmap test
> [1], we found very high lock contention for spinlock of percpu counter
> 'vm_committed_as':
> 
>     94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
>     48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
>     45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
> 
> Actually this heavy lock contention is not always necessary.  The
> 'vm_committed_as' needs to be very precise when the strict
> OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
> for the percpu counter.
> 
> So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
> lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies.  Also
> add a sysctl handler to adjust it when the policy is reconfigured.
> 
> Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
> desktop, and 2097%(20X) on a 4S/72C/144T server.  We tested with test
> platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
> improvements with that test.  And whether it shows improvements depends on
> if the test mmap size is bigger than the batch number computed.
> 
> And if the lift is 16X, 1/3 of the platforms will show improvements,
> though it should help the mmap/unmap usage generally, as Michal Hocko
> mentioned:
> 
> : I believe that there are non-synthetic worklaods which would benefit from
> : a larger batch.  E.g.  large in memory databases which do large mmaps
> : during startups from multiple threads.
> 
> [1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
> 
> Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andi Kleen <andi.kleen@intel.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mm.h   |  2 ++
>  include/linux/mman.h |  4 ++++
>  kernel/sysctl.c      |  2 +-
>  mm/mm_init.c         | 18 ++++++++++++++----
>  mm/util.c            | 12 ++++++++++++
>  5 files changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 573947c..c2efea6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -206,6 +206,8 @@ int overcommit_ratio_handler(struct ctl_table *, int, void *, size_t *,
>  		loff_t *);
>  int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *,
>  		loff_t *);
> +int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
> +		loff_t *);
>  
>  #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
>  
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 4b08e9c..91c93c1 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -57,8 +57,12 @@ extern struct percpu_counter vm_committed_as;
>  
>  #ifdef CONFIG_SMP
>  extern s32 vm_committed_as_batch;
> +extern void mm_compute_batch(void);
>  #else
>  #define vm_committed_as_batch 0
> +static inline void mm_compute_batch(void)
> +{
> +}
>  #endif
>  
>  unsigned long vm_memory_committed(void);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index db1ce7a..9456c86 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2650,7 +2650,7 @@ static struct ctl_table vm_table[] = {
>  		.data		= &sysctl_overcommit_memory,
>  		.maxlen		= sizeof(sysctl_overcommit_memory),
>  		.mode		= 0644,
> -		.proc_handler	= proc_dointvec_minmax,
> +		.proc_handler	= overcommit_policy_handler,
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= &two,
>  	},
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 435e5f7..c5a6fb1 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -13,6 +13,7 @@
>  #include <linux/memory.h>
>  #include <linux/notifier.h>
>  #include <linux/sched.h>
> +#include <linux/mman.h>
>  #include "internal.h"
>  
>  #ifdef CONFIG_DEBUG_MEMORY_INIT
> @@ -144,14 +145,23 @@ EXPORT_SYMBOL_GPL(mm_kobj);
>  #ifdef CONFIG_SMP
>  s32 vm_committed_as_batch = 32;
>  
> -static void __meminit mm_compute_batch(void)
> +void mm_compute_batch(void)
>  {
>  	u64 memsized_batch;
>  	s32 nr = num_present_cpus();
>  	s32 batch = max_t(s32, nr*2, 32);
> -
> -	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
> -	memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff);
> +	unsigned long ram_pages = totalram_pages();
> +
> +	/*
> +	 * For policy of OVERCOMMIT_NEVER, set batch size to 0.4%
> +	 * of (total memory/#cpus), and lift it to 25% for other
> +	 * policies to easy the possible lock contention for percpu_counter
> +	 * vm_committed_as, while the max limit is INT_MAX
> +	 */
> +	if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> +		memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX);
> +	else
> +		memsized_batch = min_t(u64, ram_pages/nr/4, INT_MAX);
>  
>  	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
>  }
> diff --git a/mm/util.c b/mm/util.c
> index fe63271..580d268 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -746,6 +746,18 @@ int overcommit_ratio_handler(struct ctl_table *table, int write, void *buffer,
>  	return ret;
>  }
>  
> +int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
> +		size_t *lenp, loff_t *ppos)
> +{
> +	int ret;
> +
> +	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +	if (ret == 0 && write)
> +		mm_compute_batch();
> +
> +	return ret;
> +}
> +
>  int overcommit_kbytes_handler(struct ctl_table *table, int write, void *buffer,
>  		size_t *lenp, loff_t *ppos)
>  {
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/4] mm/util.c: make vm_memory_committed() more accurate
  2020-05-29  1:06 ` [PATCH v4 2/4] mm/util.c: make vm_memory_committed() more accurate Feng Tang
  2020-06-03 13:35   ` Michal Hocko
@ 2020-06-03 14:28   ` Andi Kleen
  2020-06-04  1:38     ` Feng Tang
  1 sibling, 1 reply; 12+ messages in thread
From: Andi Kleen @ 2020-06-03 14:28 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, Qian Cai, tim.c.chen, dave.hansen,
	ying.huang, linux-mm, linux-kernel

> Its time cost is about 800 nanoseconds on a 2C/4T platform and
> 2~3 microseconds on a 2S/36C/72T server in normal case, and in
> worst case where vm_committed_as's spinlock is under severe
> contention, it costs 30~40 microseconds for the 2S/36C/72T sever,

This will be likely 40-80us on larger systems, although the overhead
is often non linear so it might get worse.

> which should be fine for its only two users: /proc/meminfo and
> HyperV balloon driver's status trace per second.

There are some setups who do frequent sampling of /proc/meminfo
in the background.  Increased overhead could be a problem for them.
But not proposing a change now. If someone complains have to 
revisit I guess, perhaps adding a rate limit of some sort.

-Andi



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/4] mm/util.c: make vm_memory_committed() more accurate
  2020-06-03 14:28   ` Andi Kleen
@ 2020-06-04  1:38     ` Feng Tang
  0 siblings, 0 replies; 12+ messages in thread
From: Feng Tang @ 2020-06-04  1:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, Qian Cai, tim.c.chen, dave.hansen,
	ying.huang, linux-mm, linux-kernel

On Wed, Jun 03, 2020 at 07:28:53AM -0700, Andi Kleen wrote:
> > Its time cost is about 800 nanoseconds on a 2C/4T platform and
> > 2~3 microseconds on a 2S/36C/72T server in normal case, and in
> > worst case where vm_committed_as's spinlock is under severe
> > contention, it costs 30~40 microseconds for the 2S/36C/72T sever,
> 
> This will be likely 40-80us on larger systems, although the overhead
> is often non linear so it might get worse.
> 
> > which should be fine for its only two users: /proc/meminfo and
> > HyperV balloon driver's status trace per second.
> 
> There are some setups who do frequent sampling of /proc/meminfo
> in the background.  Increased overhead could be a problem for them.
> But not proposing a change now. If someone complains have to 
> revisit I guess, perhaps adding a rate limit of some sort.

Agree. Maybe I should also put the time cost info into the code
comments in case someone noticed the slowdown.

Thanks,
Feng

> 
> -Andi


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-06-04  1:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-29  1:06 [PATCH v4 0/4] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
2020-05-29  1:06 ` [PATCH v4 1/4] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
2020-05-29  1:06 ` [PATCH v4 2/4] mm/util.c: make vm_memory_committed() more accurate Feng Tang
2020-06-03 13:35   ` Michal Hocko
2020-06-03 14:28   ` Andi Kleen
2020-06-04  1:38     ` Feng Tang
2020-05-29  1:06 ` [PATCH v4 3/4] mm/util.c: remove the VM_WARN_ONCE for vm_committed_as underflow check Feng Tang
2020-05-29  2:49   ` Qian Cai
2020-05-29  5:37     ` Feng Tang
2020-06-02  3:37     ` Feng Tang
2020-05-29  1:06 ` [PATCH v4 4/4] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
2020-06-03 13:38   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).