linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
@ 2020-05-08  7:25 Feng Tang
  2020-05-08  7:25 ` [PATCH 1/3] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
                   ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Feng Tang @ 2020-05-08  7:25 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, Luis Chamberlain, Iurii Zaikin,
	andi.kleen, tim.c.chen, dave.hansen, ying.huang, linux-mm,
	linux-kernel
  Cc: Feng Tang

When checking a performance change for will-it-scale scalability
mmap test [1], we found very high lock contention for spinlock of
percpu counter 'vm_committed_as':

    94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
    48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
    45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch
number for the percpu counter.

So lift the batch number to 16X for OVERCOMMIT_ALWAYS and
OVERCOMMIT_GUESS policies, and add a sysctl handler to adjust it
when the policy is reconfigured.

Benchmark with the same testcase in [1] shows 53% improvement on a
8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. And no change
for some platforms, due to the test mmap size of the case is bigger
than the batch number computed, though the patch will help mmap/munmap
generally.

[1] https://lkml.org/lkml/2020/3/5/57

There are some style complain from checkpatch for patch 3, as it
follows the similar format of sibling functions

patch1: a cleanup for /proc/meminfo
patch2: a preparation patch which also improve the accuracy of
        vm_memory_committed
patch3: the main change

Please help to review, thanks!

- Feng

----------------------------------------------------------------
Changelog:
  v2:
     * add the sysctl handler to cover runtime overcommit policy
       change, as suggested by Andres Morton 
     * address the accuracy concern of vm_memory_committed()
       from Andi Kleen 

Feng Tang (3):
  proc/meminfo: avoid open coded reading of vm_committed_as
  mm/util.c: make vm_memory_committed() more accurate
  mm: adjust vm_committed_as_batch according to vm overcommit policy

 fs/proc/meminfo.c    |  2 +-
 include/linux/mm.h   |  2 ++
 include/linux/mman.h |  4 ++++
 kernel/sysctl.c      |  2 +-
 mm/mm_init.c         | 19 +++++++++++++++----
 mm/util.c            | 15 ++++++++++++++-
 6 files changed, 37 insertions(+), 7 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/3] proc/meminfo: avoid open coded reading of vm_committed_as
  2020-05-08  7:25 [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
@ 2020-05-08  7:25 ` Feng Tang
  2020-05-15  7:22   ` Michal Hocko
  2020-05-08  7:25 ` [PATCH 2/3] mm/util.c: make vm_memory_committed() more accurate Feng Tang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Feng Tang @ 2020-05-08  7:25 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, Luis Chamberlain, Iurii Zaikin,
	andi.kleen, tim.c.chen, dave.hansen, ying.huang, linux-mm,
	linux-kernel
  Cc: Feng Tang

Use the existing vm_memory_committed() instead, which is also
convenient for future change.

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 fs/proc/meminfo.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8c1f1bb..578c0b8 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -42,7 +42,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 
 	si_meminfo(&i);
 	si_swapinfo(&i);
-	committed = percpu_counter_read_positive(&vm_committed_as);
+	committed = vm_memory_committed();
 
 	cached = global_node_page_state(NR_FILE_PAGES) -
 			total_swapcache_pages() - i.bufferram;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/3] mm/util.c: make vm_memory_committed() more accurate
  2020-05-08  7:25 [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
  2020-05-08  7:25 ` [PATCH 1/3] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
@ 2020-05-08  7:25 ` Feng Tang
  2020-05-15  7:23   ` Michal Hocko
  2020-05-08  7:25 ` [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
  2020-05-21 21:27 ` [PATCH 0/3] make vm_committed_as_batch aware of " Qian Cai
  3 siblings, 1 reply; 34+ messages in thread
From: Feng Tang @ 2020-05-08  7:25 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, Luis Chamberlain, Iurii Zaikin,
	andi.kleen, tim.c.chen, dave.hansen, ying.huang, linux-mm,
	linux-kernel
  Cc: Feng Tang

percpu_counter_sum_positive() will provide more accurate info.

Its time cost is about 800 nanoseconds on a 2C/4T platform and
2~3 microseconds on a 2S/36C/72T server in normal case, and in
worst case where vm_committed_as's spinlock is under severe
contention, it costs 30~40 microseconds for the 2S/36C/72T sever,
which should be fine for its two users: /proc/meminfo and HV ballon
driver's status trace per second.

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/util.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/util.c b/mm/util.c
index 988d11e..3de78e9 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -774,7 +774,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
  */
 unsigned long vm_memory_committed(void)
 {
-	return percpu_counter_read_positive(&vm_committed_as);
+	return percpu_counter_sum_positive(&vm_committed_as);
 }
 EXPORT_SYMBOL_GPL(vm_memory_committed);
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy
  2020-05-08  7:25 [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
  2020-05-08  7:25 ` [PATCH 1/3] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
  2020-05-08  7:25 ` [PATCH 2/3] mm/util.c: make vm_memory_committed() more accurate Feng Tang
@ 2020-05-08  7:25 ` Feng Tang
  2020-05-08 11:24   ` Matthew Wilcox
                     ` (2 more replies)
  2020-05-21 21:27 ` [PATCH 0/3] make vm_committed_as_batch aware of " Qian Cai
  3 siblings, 3 replies; 34+ messages in thread
From: Feng Tang @ 2020-05-08  7:25 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, Luis Chamberlain, Iurii Zaikin,
	andi.kleen, tim.c.chen, dave.hansen, ying.huang, linux-mm,
	linux-kernel
  Cc: Feng Tang

When checking a performance change for will-it-scale scalability
mmap test [1], we found very high lock contention for spinlock of
percpu counter 'vm_committed_as':

    94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
    48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
    45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch
number for the percpu counter.

So lift the batch number to 16X for OVERCOMMIT_ALWAYS and
OVERCOMMIT_GUESS policies, and add a sysctl handler to adjust it
when the policy is reconfigured.

Benchmark with the same testcase in [1] shows 53% improvement on a
8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. And no change
for some platforms, due to the test mmap size of the case is bigger
than the batch number computed, though the patch will help mmap/munmap
generally.

[1] https://lkml.org/lkml/2020/3/5/57

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/linux/mm.h   |  2 ++
 include/linux/mman.h |  4 ++++
 kernel/sysctl.c      |  2 +-
 mm/mm_init.c         | 19 +++++++++++++++----
 mm/util.c            | 13 +++++++++++++
 5 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a32342..bc3722f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -205,6 +205,8 @@ extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
 				    size_t *, loff_t *);
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
 				    size_t *, loff_t *);
+extern int overcommit_policy_handler(struct ctl_table *, int, void __user *,
+				    size_t *, loff_t *);
 
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 4b08e9c..91c93c1 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -57,8 +57,12 @@ extern struct percpu_counter vm_committed_as;
 
 #ifdef CONFIG_SMP
 extern s32 vm_committed_as_batch;
+extern void mm_compute_batch(void);
 #else
 #define vm_committed_as_batch 0
+static inline void mm_compute_batch(void)
+{
+}
 #endif
 
 unsigned long vm_memory_committed(void);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8..6fa552d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1278,7 +1278,7 @@ static struct ctl_table vm_table[] = {
 		.data		= &sysctl_overcommit_memory,
 		.maxlen		= sizeof(sysctl_overcommit_memory),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= overcommit_policy_handler,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= &two,
 	},
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7da6991..1654358 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -13,6 +13,7 @@
 #include <linux/memory.h>
 #include <linux/notifier.h>
 #include <linux/sched.h>
+#include <linux/mman.h>
 #include "internal.h"
 
 #ifdef CONFIG_DEBUG_MEMORY_INIT
@@ -140,16 +141,26 @@ EXPORT_SYMBOL_GPL(mm_kobj);
 #ifdef CONFIG_SMP
 s32 vm_committed_as_batch = 32;
 
-static void __meminit mm_compute_batch(void)
+void mm_compute_batch(void)
 {
 	u64 memsized_batch;
 	s32 nr = num_present_cpus();
 	s32 batch = max_t(s32, nr*2, 32);
-
-	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
-	memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff);
+	unsigned long ram_pages = totalram_pages();
+
+	/*
+	 * For policy of OVERCOMMIT_NEVER, set batch size to 0.4%
+	 * of (total memory/#cpus), and lift it to 6.25% for other
+	 * policies to easy the possible lock contention for percpu_counter
+	 * vm_committed_as, while the max limit is INT_MAX
+	 */
+	if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
+		memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX);
+	else
+		memsized_batch = min_t(u64, ram_pages/nr/16, INT_MAX);
 
 	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
+	printk("vm_committed_as_batch = %d\n", vm_committed_as_batch);
 }
 
 static int __meminit mm_compute_batch_notifier(struct notifier_block *self,
diff --git a/mm/util.c b/mm/util.c
index 3de78e9..99936d3 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -729,6 +729,19 @@ int overcommit_ratio_handler(struct ctl_table *table, int write,
 	return ret;
 }
 
+int overcommit_policy_handler(struct ctl_table *table, int write,
+			     void __user *buffer, size_t *lenp,
+			     loff_t *ppos)
+{
+	int ret;
+
+	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (ret == 0 && write)
+		mm_compute_batch();
+
+	return ret;
+}
+
 int overcommit_kbytes_handler(struct ctl_table *table, int write,
 			     void __user *buffer, size_t *lenp,
 			     loff_t *ppos)
-- 
2.7.4


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy
  2020-05-08  7:25 ` [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
@ 2020-05-08 11:24   ` Matthew Wilcox
  2020-05-08 12:33     ` Feng Tang
  2020-05-15  7:41   ` Michal Hocko
  2020-05-15  7:44   ` Michal Hocko
  2 siblings, 1 reply; 34+ messages in thread
From: Matthew Wilcox @ 2020-05-08 11:24 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Mel Gorman,
	Kees Cook, Luis Chamberlain, Iurii Zaikin, andi.kleen,
	tim.c.chen, dave.hansen, ying.huang, linux-mm, linux-kernel

On Fri, May 08, 2020 at 03:25:17PM +0800, Feng Tang wrote:
> +void mm_compute_batch(void)
>  {
>  	u64 memsized_batch;
>  	s32 nr = num_present_cpus();
>  	s32 batch = max_t(s32, nr*2, 32);
> -
> -	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
> -	memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff);
> +	unsigned long ram_pages = totalram_pages();
> +
> +	/*
> +	 * For policy of OVERCOMMIT_NEVER, set batch size to 0.4%
> +	 * of (total memory/#cpus), and lift it to 6.25% for other
> +	 * policies to easy the possible lock contention for percpu_counter
> +	 * vm_committed_as, while the max limit is INT_MAX
> +	 */
> +	if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> +		memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX);
> +	else
> +		memsized_batch = min_t(u64, ram_pages/nr/16, INT_MAX);
>  
>  	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
> +	printk("vm_committed_as_batch = %d\n", vm_committed_as_batch);
>  }

You left a debugging printk in ...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy
  2020-05-08 11:24   ` Matthew Wilcox
@ 2020-05-08 12:33     ` Feng Tang
  0 siblings, 0 replies; 34+ messages in thread
From: Feng Tang @ 2020-05-08 12:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Mel Gorman,
	Kees Cook, Luis Chamberlain, Iurii Zaikin, Kleen, Andi, Chen,
	Tim C, Hansen, Dave, Huang, Ying, linux-mm, linux-kernel

On Fri, May 08, 2020 at 07:24:48PM +0800, Matthew Wilcox wrote:
> On Fri, May 08, 2020 at 03:25:17PM +0800, Feng Tang wrote:
> > +void mm_compute_batch(void)
> >  {
> >  	u64 memsized_batch;
> >  	s32 nr = num_present_cpus();
> >  	s32 batch = max_t(s32, nr*2, 32);
> > -
> > -	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
> > -	memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff);
> > +	unsigned long ram_pages = totalram_pages();
> > +
> > +	/*
> > +	 * For policy of OVERCOMMIT_NEVER, set batch size to 0.4%
> > +	 * of (total memory/#cpus), and lift it to 6.25% for other
> > +	 * policies to easy the possible lock contention for percpu_counter
> > +	 * vm_committed_as, while the max limit is INT_MAX
> > +	 */
> > +	if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> > +		memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX);
> > +	else
> > +		memsized_batch = min_t(u64, ram_pages/nr/16, INT_MAX);
> >  
> >  	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
> > +	printk("vm_committed_as_batch = %d\n", vm_committed_as_batch);
> >  }
> 
> You left a debugging printk in ...

Yep, my bad, thanks for the catching, will remove it.

- Feng

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] proc/meminfo: avoid open coded reading of vm_committed_as
  2020-05-08  7:25 ` [PATCH 1/3] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
@ 2020-05-15  7:22   ` Michal Hocko
  0 siblings, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2020-05-15  7:22 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	Kees Cook, Luis Chamberlain, Iurii Zaikin, andi.kleen,
	tim.c.chen, dave.hansen, ying.huang, linux-mm, linux-kernel

On Fri 08-05-20 15:25:15, Feng Tang wrote:
> Use the existing vm_memory_committed() instead, which is also
> convenient for future change.
> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  fs/proc/meminfo.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 8c1f1bb..578c0b8 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -42,7 +42,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  
>  	si_meminfo(&i);
>  	si_swapinfo(&i);
> -	committed = percpu_counter_read_positive(&vm_committed_as);
> +	committed = vm_memory_committed();
>  
>  	cached = global_node_page_state(NR_FILE_PAGES) -
>  			total_swapcache_pages() - i.bufferram;
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] mm/util.c: make vm_memory_committed() more accurate
  2020-05-08  7:25 ` [PATCH 2/3] mm/util.c: make vm_memory_committed() more accurate Feng Tang
@ 2020-05-15  7:23   ` Michal Hocko
  2020-05-15  8:11     ` Feng Tang
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2020-05-15  7:23 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	Kees Cook, Luis Chamberlain, Iurii Zaikin, andi.kleen,
	tim.c.chen, dave.hansen, ying.huang, linux-mm, linux-kernel

On Fri 08-05-20 15:25:16, Feng Tang wrote:
> percpu_counter_sum_positive() will provide more accurate info.

Why do we need that?

> Its time cost is about 800 nanoseconds on a 2C/4T platform and
> 2~3 microseconds on a 2S/36C/72T server in normal case, and in
> worst case where vm_committed_as's spinlock is under severe
> contention, it costs 30~40 microseconds for the 2S/36C/72T sever,
> which should be fine for its two users: /proc/meminfo and HV ballon
> driver's status trace per second.

OK, this explains that the additional overhead is not terrible but there
is no actual information on why the imprecision is a problem.

> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  mm/util.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/util.c b/mm/util.c
> index 988d11e..3de78e9 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -774,7 +774,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
>   */
>  unsigned long vm_memory_committed(void)
>  {
> -	return percpu_counter_read_positive(&vm_committed_as);
> +	return percpu_counter_sum_positive(&vm_committed_as);
>  }
>  EXPORT_SYMBOL_GPL(vm_memory_committed);
>  
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy
  2020-05-08  7:25 ` [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
  2020-05-08 11:24   ` Matthew Wilcox
@ 2020-05-15  7:41   ` Michal Hocko
  2020-05-15  8:02     ` Feng Tang
  2020-05-15  7:44   ` Michal Hocko
  2 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2020-05-15  7:41 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	Kees Cook, Luis Chamberlain, Iurii Zaikin, andi.kleen,
	tim.c.chen, dave.hansen, ying.huang, linux-mm, linux-kernel

On Fri 08-05-20 15:25:17, Feng Tang wrote:
> When checking a performance change for will-it-scale scalability
> mmap test [1], we found very high lock contention for spinlock of
> percpu counter 'vm_committed_as':
> 
>     94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
>     48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
>     45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
> 
> Actually this heavy lock contention is not always necessary. The
> 'vm_committed_as' needs to be very precise when the strict
> OVERCOMMIT_NEVER policy is set, which requires a rather small batch
> number for the percpu counter.
> 
> So lift the batch number to 16X for OVERCOMMIT_ALWAYS and
> OVERCOMMIT_GUESS policies, and add a sysctl handler to adjust it
> when the policy is reconfigured.

Increasing the batch size for weaker overcommit modes makes sense. But
your patch is also tuning OVERCOMMIT_NEVER without any explanation why
that is still "small enough to be precise".

> Benchmark with the same testcase in [1] shows 53% improvement on a
> 8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. And no change
> for some platforms, due to the test mmap size of the case is bigger
> than the batch number computed, though the patch will help mmap/munmap
> generally.
> 
> [1] https://lkml.org/lkml/2020/3/5/57

Please do not use lkml.org links in the changelog. Use
http://lkml.kernel.org/r/$msg instead.

> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  include/linux/mm.h   |  2 ++
>  include/linux/mman.h |  4 ++++
>  kernel/sysctl.c      |  2 +-
>  mm/mm_init.c         | 19 +++++++++++++++----
>  mm/util.c            | 13 +++++++++++++
>  5 files changed, 35 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 5a32342..bc3722f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -205,6 +205,8 @@ extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
>  				    size_t *, loff_t *);
>  extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
>  				    size_t *, loff_t *);
> +extern int overcommit_policy_handler(struct ctl_table *, int, void __user *,
> +				    size_t *, loff_t *);
>  
>  #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
>  
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 4b08e9c..91c93c1 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -57,8 +57,12 @@ extern struct percpu_counter vm_committed_as;
>  
>  #ifdef CONFIG_SMP
>  extern s32 vm_committed_as_batch;
> +extern void mm_compute_batch(void);
>  #else
>  #define vm_committed_as_batch 0
> +static inline void mm_compute_batch(void)
> +{
> +}
>  #endif
>  
>  unsigned long vm_memory_committed(void);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 8a176d8..6fa552d 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1278,7 +1278,7 @@ static struct ctl_table vm_table[] = {
>  		.data		= &sysctl_overcommit_memory,
>  		.maxlen		= sizeof(sysctl_overcommit_memory),
>  		.mode		= 0644,
> -		.proc_handler	= proc_dointvec_minmax,
> +		.proc_handler	= overcommit_policy_handler,
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= &two,
>  	},
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 7da6991..1654358 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -13,6 +13,7 @@
>  #include <linux/memory.h>
>  #include <linux/notifier.h>
>  #include <linux/sched.h>
> +#include <linux/mman.h>
>  #include "internal.h"
>  
>  #ifdef CONFIG_DEBUG_MEMORY_INIT
> @@ -140,16 +141,26 @@ EXPORT_SYMBOL_GPL(mm_kobj);
>  #ifdef CONFIG_SMP
>  s32 vm_committed_as_batch = 32;
>  
> -static void __meminit mm_compute_batch(void)
> +void mm_compute_batch(void)
>  {
>  	u64 memsized_batch;
>  	s32 nr = num_present_cpus();
>  	s32 batch = max_t(s32, nr*2, 32);
> -
> -	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
> -	memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff);
> +	unsigned long ram_pages = totalram_pages();
> +
> +	/*
> +	 * For policy of OVERCOMMIT_NEVER, set batch size to 0.4%
> +	 * of (total memory/#cpus), and lift it to 6.25% for other
> +	 * policies to easy the possible lock contention for percpu_counter
> +	 * vm_committed_as, while the max limit is INT_MAX
> +	 */
> +	if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> +		memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX);
> +	else
> +		memsized_batch = min_t(u64, ram_pages/nr/16, INT_MAX);
>  
>  	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
> +	printk("vm_committed_as_batch = %d\n", vm_committed_as_batch);
>  }
>  
>  static int __meminit mm_compute_batch_notifier(struct notifier_block *self,
> diff --git a/mm/util.c b/mm/util.c
> index 3de78e9..99936d3 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -729,6 +729,19 @@ int overcommit_ratio_handler(struct ctl_table *table, int write,
>  	return ret;
>  }
>  
> +int overcommit_policy_handler(struct ctl_table *table, int write,
> +			     void __user *buffer, size_t *lenp,
> +			     loff_t *ppos)
> +{
> +	int ret;
> +
> +	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> +	if (ret == 0 && write)
> +		mm_compute_batch();
> +
> +	return ret;
> +}
> +
>  int overcommit_kbytes_handler(struct ctl_table *table, int write,
>  			     void __user *buffer, size_t *lenp,
>  			     loff_t *ppos)
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy
  2020-05-08  7:25 ` [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
  2020-05-08 11:24   ` Matthew Wilcox
  2020-05-15  7:41   ` Michal Hocko
@ 2020-05-15  7:44   ` Michal Hocko
  2020-05-15  8:38     ` Feng Tang
  2 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2020-05-15  7:44 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	Kees Cook, Luis Chamberlain, Iurii Zaikin, andi.kleen,
	tim.c.chen, dave.hansen, ying.huang, linux-mm, linux-kernel

On Fri 08-05-20 15:25:17, Feng Tang wrote:
> When checking a performance change for will-it-scale scalability
> mmap test [1], we found very high lock contention for spinlock of
> percpu counter 'vm_committed_as':

Btw. you are focusing on a microbenchmark here but I believe that there
are non-synthetic worklaods which would benefit from a larger batch.
E.g. large in memory databases which do large mmaps during startups
from multiple threads.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy
  2020-05-15  7:41   ` Michal Hocko
@ 2020-05-15  8:02     ` Feng Tang
  2020-05-15  9:08       ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Feng Tang @ 2020-05-15  8:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	Kees Cook, Luis Chamberlain, Iurii Zaikin, Kleen, Andi, Chen,
	Tim C, Hansen, Dave, Huang, Ying, linux-mm, linux-kernel

Hi Michal,

Thanks for the thorough reviews for these 3 patches!

On Fri, May 15, 2020 at 03:41:25PM +0800, Michal Hocko wrote:
> On Fri 08-05-20 15:25:17, Feng Tang wrote:
> > When checking a performance change for will-it-scale scalability
> > mmap test [1], we found very high lock contention for spinlock of
> > percpu counter 'vm_committed_as':
> > 
> >     94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
> >     48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
> >     45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
> > 
> > Actually this heavy lock contention is not always necessary. The
> > 'vm_committed_as' needs to be very precise when the strict
> > OVERCOMMIT_NEVER policy is set, which requires a rather small batch
> > number for the percpu counter.
> > 
> > So lift the batch number to 16X for OVERCOMMIT_ALWAYS and
> > OVERCOMMIT_GUESS policies, and add a sysctl handler to adjust it
> > when the policy is reconfigured.
> 
> Increasing the batch size for weaker overcommit modes makes sense. But
> your patch is also tuning OVERCOMMIT_NEVER without any explanation why
> that is still "small enough to be precise".

Actually, it keeps the batch algorithm for "OVERCOMMIT_NEVER", but
change the other 2 policies, which I should set it clear in the
commit log.

> > Benchmark with the same testcase in [1] shows 53% improvement on a
> > 8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. And no change
> > for some platforms, due to the test mmap size of the case is bigger
> > than the batch number computed, though the patch will help mmap/munmap
> > generally.
> > 
> > [1] https://lkml.org/lkml/2020/3/5/57
> 
> Please do not use lkml.org links in the changelog. Use
> http://lkml.kernel.org/r/$msg instead.

Thanks, will keep that in mind for this and future patches.

> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> >  s32 vm_committed_as_batch = 32;
> >  
> > -static void __meminit mm_compute_batch(void)
> > +void mm_compute_batch(void)
> >  {
> >  	u64 memsized_batch;
> >  	s32 nr = num_present_cpus();
> >  	s32 batch = max_t(s32, nr*2, 32);
> > -
> > -	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
> > -	memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff);
> > +	unsigned long ram_pages = totalram_pages();
> > +
> > +	/*
> > +	 * For policy of OVERCOMMIT_NEVER, set batch size to 0.4%
> > +	 * of (total memory/#cpus), and lift it to 6.25% for other
> > +	 * policies to easy the possible lock contention for percpu_counter
> > +	 * vm_committed_as, while the max limit is INT_MAX
> > +	 */
> > +	if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> > +		memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX);
> > +	else
> > +		memsized_batch = min_t(u64, ram_pages/nr/16, INT_MAX);

Also as you mentioned there are real-world work loads with big mmap
size and multi-threading, can we lift it even further ?
	memsized_batch = min_t(u64, ram_pages/nr/4, INT_MAX)

Thanks,
Feng


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] mm/util.c: make vm_memory_committed() more accurate
  2020-05-15  7:23   ` Michal Hocko
@ 2020-05-15  8:11     ` Feng Tang
  2020-05-15  9:04       ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Feng Tang @ 2020-05-15  8:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	Kees Cook, Luis Chamberlain, Iurii Zaikin, Kleen, Andi, Chen,
	Tim C, Hansen, Dave, Huang, Ying, linux-mm, linux-kernel

On Fri, May 15, 2020 at 03:23:07PM +0800, Michal Hocko wrote:
> On Fri 08-05-20 15:25:16, Feng Tang wrote:
> > percpu_counter_sum_positive() will provide more accurate info.
> 
> Why do we need that?

This is a preparation for patch 3/3, which will enlarge the batch size
of percpu-counter 'vm_committed_as'. Andi Kleen concerned that may
hurt the accuracy for those readers (currently only /proc/meminfo
and hyperV balloon drivers for status check)

As in worst case, the deviation could be 'batch * nr_cpus', and
with patch 3/3, that could be totalram_pages/16.

Thanks,
Feng

> > Its time cost is about 800 nanoseconds on a 2C/4T platform and
> > 2~3 microseconds on a 2S/36C/72T server in normal case, and in
> > worst case where vm_committed_as's spinlock is under severe
> > contention, it costs 30~40 microseconds for the 2S/36C/72T sever,
> > which should be fine for its two users: /proc/meminfo and HV ballon
> > driver's status trace per second.
> 
> OK, this explains that the additional overhead is not terrible but there
> is no actual information on why the imprecision is a problem.
> 
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> >  mm/util.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/util.c b/mm/util.c
> > index 988d11e..3de78e9 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -774,7 +774,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
> >   */
> >  unsigned long vm_memory_committed(void)
> >  {
> > -	return percpu_counter_read_positive(&vm_committed_as);
> > +	return percpu_counter_sum_positive(&vm_committed_as);
> >  }
> >  EXPORT_SYMBOL_GPL(vm_memory_committed);
> >  
> > -- 
> > 2.7.4
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy
  2020-05-15  7:44   ` Michal Hocko
@ 2020-05-15  8:38     ` Feng Tang
  0 siblings, 0 replies; 34+ messages in thread
From: Feng Tang @ 2020-05-15  8:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	Kees Cook, Luis Chamberlain, Iurii Zaikin, Kleen, Andi, Chen,
	Tim C, Hansen, Dave, Huang, Ying, linux-mm, linux-kernel

On Fri, May 15, 2020 at 03:44:43PM +0800, Michal Hocko wrote:
> On Fri 08-05-20 15:25:17, Feng Tang wrote:
> > When checking a performance change for will-it-scale scalability
> > mmap test [1], we found very high lock contention for spinlock of
> > percpu counter 'vm_committed_as':
> 
> Btw. you are focusing on a microbenchmark here but I believe that there
> are non-synthetic worklaods which would benefit from a larger batch.
> E.g. large in memory databases which do large mmaps during startups
> from multiple threads.

I imagined cases like this too :) but was not sure there are such cases.
So I only put down the the benchmark I've worked with in 0day.

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] mm/util.c: make vm_memory_committed() more accurate
  2020-05-15  8:11     ` Feng Tang
@ 2020-05-15  9:04       ` Michal Hocko
  2020-05-15 13:01         ` Feng Tang
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2020-05-15  9:04 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	Kees Cook, Luis Chamberlain, Iurii Zaikin, Kleen, Andi, Chen,
	Tim C, Hansen, Dave, Huang, Ying, linux-mm, linux-kernel

On Fri 15-05-20 16:11:09, Feng Tang wrote:
> On Fri, May 15, 2020 at 03:23:07PM +0800, Michal Hocko wrote:
> > On Fri 08-05-20 15:25:16, Feng Tang wrote:
> > > percpu_counter_sum_positive() will provide more accurate info.
> > 
> > Why do we need that?
> 
> This is a preparation for patch 3/3, which will enlarge the batch size
> of percpu-counter 'vm_committed_as'. Andi Kleen concerned that may
> hurt the accuracy for those readers (currently only /proc/meminfo
> and hyperV balloon drivers for status check)
> 
> As in worst case, the deviation could be 'batch * nr_cpus', and
> with patch 3/3, that could be totalram_pages/16.

This information belongs to the changelog.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy
  2020-05-15  8:02     ` Feng Tang
@ 2020-05-15  9:08       ` Michal Hocko
  2020-05-15 11:26         ` Feng Tang
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2020-05-15  9:08 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	Kees Cook, Luis Chamberlain, Iurii Zaikin, Kleen, Andi, Chen,
	Tim C, Hansen, Dave, Huang, Ying, linux-mm, linux-kernel

On Fri 15-05-20 16:02:10, Feng Tang wrote:
> Hi Michal,
> 
> Thanks for the thorough reviews for these 3 patches!
> 
> On Fri, May 15, 2020 at 03:41:25PM +0800, Michal Hocko wrote:
> > On Fri 08-05-20 15:25:17, Feng Tang wrote:
> > > When checking a performance change for will-it-scale scalability
> > > mmap test [1], we found very high lock contention for spinlock of
> > > percpu counter 'vm_committed_as':
> > > 
> > >     94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
> > >     48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
> > >     45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
> > > 
> > > Actually this heavy lock contention is not always necessary. The
> > > 'vm_committed_as' needs to be very precise when the strict
> > > OVERCOMMIT_NEVER policy is set, which requires a rather small batch
> > > number for the percpu counter.
> > > 
> > > So lift the batch number to 16X for OVERCOMMIT_ALWAYS and
> > > OVERCOMMIT_GUESS policies, and add a sysctl handler to adjust it
> > > when the policy is reconfigured.
> > 
> > Increasing the batch size for weaker overcommit modes makes sense. But
> > your patch is also tuning OVERCOMMIT_NEVER without any explanation why
> > that is still "small enough to be precise".
> 
> Actually, it keeps the batch algorithm for "OVERCOMMIT_NEVER", but
> change the other 2 policies, which I should set it clear in the
> commit log.

Yeah, I have misread that part. Sorry about that.

[...]
> > > +void mm_compute_batch(void)
> > >  {
> > >  	u64 memsized_batch;
> > >  	s32 nr = num_present_cpus();
> > >  	s32 batch = max_t(s32, nr*2, 32);
> > > -
> > > -	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
> > > -	memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff);
> > > +	unsigned long ram_pages = totalram_pages();
> > > +
> > > +	/*
> > > +	 * For policy of OVERCOMMIT_NEVER, set batch size to 0.4%
> > > +	 * of (total memory/#cpus), and lift it to 6.25% for other
> > > +	 * policies to easy the possible lock contention for percpu_counter
> > > +	 * vm_committed_as, while the max limit is INT_MAX
> > > +	 */
> > > +	if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> > > +		memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX);
> > > +	else
> > > +		memsized_batch = min_t(u64, ram_pages/nr/16, INT_MAX);
> 
> Also as you mentioned there are real-world work loads with big mmap
> size and multi-threading, can we lift it even further ?
> 	memsized_batch = min_t(u64, ram_pages/nr/4, INT_MAX)

Try to measure those and see what numbers look like.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy
  2020-05-15  9:08       ` Michal Hocko
@ 2020-05-15 11:26         ` Feng Tang
  0 siblings, 0 replies; 34+ messages in thread
From: Feng Tang @ 2020-05-15 11:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	Kees Cook, Luis Chamberlain, Iurii Zaikin, Kleen, Andi, Chen,
	Tim C, Hansen, Dave, Huang, Ying, linux-mm, linux-kernel

On Fri, May 15, 2020 at 05:08:21PM +0800, Michal Hocko wrote:
> > > > +void mm_compute_batch(void)
> > > >  {
> > > >  	u64 memsized_batch;
> > > >  	s32 nr = num_present_cpus();
> > > >  	s32 batch = max_t(s32, nr*2, 32);
> > > > -
> > > > -	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
> > > > -	memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff);
> > > > +	unsigned long ram_pages = totalram_pages();
> > > > +
> > > > +	/*
> > > > +	 * For policy of OVERCOMMIT_NEVER, set batch size to 0.4%
> > > > +	 * of (total memory/#cpus), and lift it to 6.25% for other
> > > > +	 * policies to easy the possible lock contention for percpu_counter
> > > > +	 * vm_committed_as, while the max limit is INT_MAX
> > > > +	 */
> > > > +	if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> > > > +		memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX);
> > > > +	else
> > > > +		memsized_batch = min_t(u64, ram_pages/nr/16, INT_MAX);
> > 
> > Also as you mentioned there are real-world work loads with big mmap
> > size and multi-threading, can we lift it even further ?
> > 	memsized_batch = min_t(u64, ram_pages/nr/4, INT_MAX)
> 
> Try to measure those and see what numbers look like.

With the same benchmark, for the 16X lifting in this patch, about
1/3 of the test platforms (servers, desktops, laptops) will show
improvements (up to 20X for servers, much less on platform with
fewer CPUs). 

If we further lift it to 64X, most of the test platforms will
show improvements. 

Thanks,
Feng

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] mm/util.c: make vm_memory_committed() more accurate
  2020-05-15  9:04       ` Michal Hocko
@ 2020-05-15 13:01         ` Feng Tang
  0 siblings, 0 replies; 34+ messages in thread
From: Feng Tang @ 2020-05-15 13:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	Kees Cook, Luis Chamberlain, Iurii Zaikin, Kleen, Andi, Chen,
	Tim C, Hansen, Dave, Huang, Ying, linux-mm, linux-kernel

On Fri, May 15, 2020 at 05:04:40PM +0800, Michal Hocko wrote:
> On Fri 15-05-20 16:11:09, Feng Tang wrote:
> > On Fri, May 15, 2020 at 03:23:07PM +0800, Michal Hocko wrote:
> > > On Fri 08-05-20 15:25:16, Feng Tang wrote:
> > > > percpu_counter_sum_positive() will provide more accurate info.
> > > 
> > > Why do we need that?
> > 
> > This is a preparation for patch 3/3, which will enlarge the batch size
> > of percpu-counter 'vm_committed_as'. Andi Kleen concerned that may
> > hurt the accuracy for those readers (currently only /proc/meminfo
> > and hyperV balloon drivers for status check)
> > 
> > As in worst case, the deviation could be 'batch * nr_cpus', and
> > with patch 3/3, that could be totalram_pages/16.
> 
> This information belongs to the changelog.

Will add it in v3, thanks

- Feng

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-08  7:25 [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
                   ` (2 preceding siblings ...)
  2020-05-08  7:25 ` [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
@ 2020-05-21 21:27 ` Qian Cai
  2020-05-26 18:14   ` Qian Cai
  3 siblings, 1 reply; 34+ messages in thread
From: Qian Cai @ 2020-05-21 21:27 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, Kees Cook, Luis Chamberlain, Iurii Zaikin,
	andi.kleen, tim.c.chen, dave.hansen, ying.huang, linux-mm,
	linux-kernel

On Fri, May 08, 2020 at 03:25:14PM +0800, Feng Tang wrote:
> When checking a performance change for will-it-scale scalability
> mmap test [1], we found very high lock contention for spinlock of
> percpu counter 'vm_committed_as':
> 
>     94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
>     48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
>     45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
> 
> Actually this heavy lock contention is not always necessary. The
> 'vm_committed_as' needs to be very precise when the strict
> OVERCOMMIT_NEVER policy is set, which requires a rather small batch
> number for the percpu counter.
> 
> So lift the batch number to 16X for OVERCOMMIT_ALWAYS and
> OVERCOMMIT_GUESS policies, and add a sysctl handler to adjust it
> when the policy is reconfigured.
> 
> Benchmark with the same testcase in [1] shows 53% improvement on a
> 8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. And no change
> for some platforms, due to the test mmap size of the case is bigger
> than the batch number computed, though the patch will help mmap/munmap
> generally.
> 
> [1] https://lkml.org/lkml/2020/3/5/57
> 
> There are some style complain from checkpatch for patch 3, as it
> follows the similar format of sibling functions
> 
> patch1: a cleanup for /proc/meminfo
> patch2: a preparation patch which also improve the accuracy of
>         vm_memory_committed
> patch3: the main change
> 
> Please help to review, thanks!

Reverted this series fixed a warning under memory pressue.

[ 3319.257898] LTP: starting oom01
[ 3319.284417] ------------[ cut here ]------------
[ 3319.284439] memory commitment underflow
[ 3319.284456] WARNING: CPU: 13 PID: 130949 at mm/util.c:835 __vm_enough_memory+0x1dc/0x1f0
[ 3319.284485] Modules linked in: brd ext4 crc16 mbcache jbd2 loop kvm_hv kvm ip_tables x_tables xfs sd_mod ahci libahci bnx2x libata mdio tg3 libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod
[ 3319.284538] CPU: 13 PID: 130949 Comm: oom01 Not tainted 5.7.0-rc6-next-20200521+ #115
[ 3319.284551] NIP:  c00000000043027c LR: c000000000430278 CTR: c0000000007bd120
[ 3319.284572] REGS: c00000064a42f710 TRAP: 0700   Not tainted  (5.7.0-rc6-next-20200521+)
[ 3319.284602] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28222422  XER: 20040000
[ 3319.284638] CFAR: c000000000119098 IRQMASK: 0 
               GPR00: c000000000430278 c00000064a42f9a0 c000000001765500 000000000000001b 
               GPR04: c00000000179c3b0 000000000000000b 0000000066726564 c00000000429ac28 
               GPR08: 0000000000000001 0000000000000000 0000000000000000 c000001301facc80 
               GPR12: 0000000000002000 c000001fffff3d80 0000000000000000 0000000000000000 
               GPR16: 0000000000000000 0000000000000000 c0000014f358dd60 fffffffffff7dfff 
               GPR20: c0000003725a9128 c0000003725a3928 c000000b0a0c0670 c000000b0a0c0678 
               GPR24: c000000b0a0c0660 c00000064a42fa58 c000000b0a0c0650 c0000003725a9080 
               GPR28: 0000000000000001 0000000000000000 c000000001620178 0000000000000001 
[ 3319.284825] NIP [c00000000043027c] __vm_enough_memory+0x1dc/0x1f0
[ 3319.284853] LR [c000000000430278] __vm_enough_memory+0x1d8/0x1f0
[ 3319.284872] Call Trace:
[ 3319.284889] [c00000064a42f9a0] [c000000000430278] __vm_enough_memory+0x1d8/0x1f0 (unreliable)
[ 3319.284930] [c00000064a42fa30] [c000000000114258] dup_mm+0x2a8/0x700
[ 3319.284960] [c00000064a42fae0] [c000000000115c7c] copy_process+0xeac/0x1a00
[ 3319.284981] [c00000064a42fbf0] [c000000000116b3c] _do_fork+0xac/0xce0
[ 3319.285002] [c00000064a42fd00] [c000000000117928] __do_sys_clone+0x98/0xe0
[ 3319.285032] [c00000064a42fdc0] [c000000000039e78] system_call_exception+0x108/0x1d0
[ 3319.285064] [c00000064a42fe20] [c00000000000c9f0] system_call_common+0xf0/0x278
[ 3319.285093] Instruction dump:
[ 3319.285109] 60000000 73a90001 4182000c e8be0002 4bfffeb8 3c62ff51 39200001 3d42fff9 
[ 3319.285142] 38636690 992a6ad6 4bce8dbd 60000000 <0fe00000> e8be0002 4bfffe90 60000000 
[ 3319.285166] irq event stamp: 0
[ 3319.285183] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[ 3319.285195] hardirqs last disabled at (0): [<c0000000001154c8>] copy_process+0x6f8/0x1a00
[ 3319.285216] softirqs last  enabled at (0): [<c0000000001154c8>] copy_process+0x6f8/0x1a00
[ 3319.285238] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ 3319.285267] ---[ end trace 115e3f7fbb0653c2 ]---

> 
> - Feng
> 
> ----------------------------------------------------------------
> Changelog:
>   v2:
>      * add the sysctl handler to cover runtime overcommit policy
>        change, as suggested by Andres Morton 
>      * address the accuracy concern of vm_memory_committed()
>        from Andi Kleen 
> 
> Feng Tang (3):
>   proc/meminfo: avoid open coded reading of vm_committed_as
>   mm/util.c: make vm_memory_committed() more accurate
>   mm: adjust vm_committed_as_batch according to vm overcommit policy
> 
>  fs/proc/meminfo.c    |  2 +-
>  include/linux/mm.h   |  2 ++
>  include/linux/mman.h |  4 ++++
>  kernel/sysctl.c      |  2 +-
>  mm/mm_init.c         | 19 +++++++++++++++----
>  mm/util.c            | 15 ++++++++++++++-
>  6 files changed, 37 insertions(+), 7 deletions(-)
> 
> -- 
> 2.7.4
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-21 21:27 ` [PATCH 0/3] make vm_committed_as_batch aware of " Qian Cai
@ 2020-05-26 18:14   ` Qian Cai
  2020-05-27  1:14     ` Andi Kleen
  2020-05-27  1:46     ` Feng Tang
  0 siblings, 2 replies; 34+ messages in thread
From: Qian Cai @ 2020-05-26 18:14 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Stephen Rothwell,
	Matthew Wilcox, Mel Gorman, Kees Cook, Luis Chamberlain,
	Iurii Zaikin, andi.kleen, tim.c.chen, dave.hansen, ying.huang,
	linux-mm, linux-kernel

On Thu, May 21, 2020 at 05:27:26PM -0400, Qian Cai wrote:
> On Fri, May 08, 2020 at 03:25:14PM +0800, Feng Tang wrote:
> > When checking a performance change for will-it-scale scalability
> > mmap test [1], we found very high lock contention for spinlock of
> > percpu counter 'vm_committed_as':
> > 
> >     94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
> >     48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
> >     45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
> > 
> > Actually this heavy lock contention is not always necessary. The
> > 'vm_committed_as' needs to be very precise when the strict
> > OVERCOMMIT_NEVER policy is set, which requires a rather small batch
> > number for the percpu counter.
> > 
> > So lift the batch number to 16X for OVERCOMMIT_ALWAYS and
> > OVERCOMMIT_GUESS policies, and add a sysctl handler to adjust it
> > when the policy is reconfigured.
> > 
> > Benchmark with the same testcase in [1] shows 53% improvement on a
> > 8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. And no change
> > for some platforms, due to the test mmap size of the case is bigger
> > than the batch number computed, though the patch will help mmap/munmap
> > generally.
> > 
> > [1] https://lkml.org/lkml/2020/3/5/57
> > 
> > There are some style complain from checkpatch for patch 3, as it
> > follows the similar format of sibling functions
> > 
> > patch1: a cleanup for /proc/meminfo
> > patch2: a preparation patch which also improve the accuracy of
> >         vm_memory_committed
> > patch3: the main change
> > 
> > Please help to review, thanks!
> 
> Reverted this series fixed a warning under memory pressue.

Andrew, Stephen, can you drop this series?

> 
> [ 3319.257898] LTP: starting oom01
> [ 3319.284417] ------------[ cut here ]------------
> [ 3319.284439] memory commitment underflow
> [ 3319.284456] WARNING: CPU: 13 PID: 130949 at mm/util.c:835 __vm_enough_memory+0x1dc/0x1f0
> [ 3319.284485] Modules linked in: brd ext4 crc16 mbcache jbd2 loop kvm_hv kvm ip_tables x_tables xfs sd_mod ahci libahci bnx2x libata mdio tg3 libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod
> [ 3319.284538] CPU: 13 PID: 130949 Comm: oom01 Not tainted 5.7.0-rc6-next-20200521+ #115
> [ 3319.284551] NIP:  c00000000043027c LR: c000000000430278 CTR: c0000000007bd120
> [ 3319.284572] REGS: c00000064a42f710 TRAP: 0700   Not tainted  (5.7.0-rc6-next-20200521+)
> [ 3319.284602] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28222422  XER: 20040000
> [ 3319.284638] CFAR: c000000000119098 IRQMASK: 0 
>                GPR00: c000000000430278 c00000064a42f9a0 c000000001765500 000000000000001b 
>                GPR04: c00000000179c3b0 000000000000000b 0000000066726564 c00000000429ac28 
>                GPR08: 0000000000000001 0000000000000000 0000000000000000 c000001301facc80 
>                GPR12: 0000000000002000 c000001fffff3d80 0000000000000000 0000000000000000 
>                GPR16: 0000000000000000 0000000000000000 c0000014f358dd60 fffffffffff7dfff 
>                GPR20: c0000003725a9128 c0000003725a3928 c000000b0a0c0670 c000000b0a0c0678 
>                GPR24: c000000b0a0c0660 c00000064a42fa58 c000000b0a0c0650 c0000003725a9080 
>                GPR28: 0000000000000001 0000000000000000 c000000001620178 0000000000000001 
> [ 3319.284825] NIP [c00000000043027c] __vm_enough_memory+0x1dc/0x1f0
> [ 3319.284853] LR [c000000000430278] __vm_enough_memory+0x1d8/0x1f0
> [ 3319.284872] Call Trace:
> [ 3319.284889] [c00000064a42f9a0] [c000000000430278] __vm_enough_memory+0x1d8/0x1f0 (unreliable)
> [ 3319.284930] [c00000064a42fa30] [c000000000114258] dup_mm+0x2a8/0x700
> [ 3319.284960] [c00000064a42fae0] [c000000000115c7c] copy_process+0xeac/0x1a00
> [ 3319.284981] [c00000064a42fbf0] [c000000000116b3c] _do_fork+0xac/0xce0
> [ 3319.285002] [c00000064a42fd00] [c000000000117928] __do_sys_clone+0x98/0xe0
> [ 3319.285032] [c00000064a42fdc0] [c000000000039e78] system_call_exception+0x108/0x1d0
> [ 3319.285064] [c00000064a42fe20] [c00000000000c9f0] system_call_common+0xf0/0x278
> [ 3319.285093] Instruction dump:
> [ 3319.285109] 60000000 73a90001 4182000c e8be0002 4bfffeb8 3c62ff51 39200001 3d42fff9 
> [ 3319.285142] 38636690 992a6ad6 4bce8dbd 60000000 <0fe00000> e8be0002 4bfffe90 60000000 
> [ 3319.285166] irq event stamp: 0
> [ 3319.285183] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> [ 3319.285195] hardirqs last disabled at (0): [<c0000000001154c8>] copy_process+0x6f8/0x1a00
> [ 3319.285216] softirqs last  enabled at (0): [<c0000000001154c8>] copy_process+0x6f8/0x1a00
> [ 3319.285238] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [ 3319.285267] ---[ end trace 115e3f7fbb0653c2 ]---
> 
> > 
> > - Feng
> > 
> > ----------------------------------------------------------------
> > Changelog:
> >   v2:
> >      * add the sysctl handler to cover runtime overcommit policy
> >        change, as suggested by Andres Morton 
> >      * address the accuracy concern of vm_memory_committed()
> >        from Andi Kleen 
> > 
> > Feng Tang (3):
> >   proc/meminfo: avoid open coded reading of vm_committed_as
> >   mm/util.c: make vm_memory_committed() more accurate
> >   mm: adjust vm_committed_as_batch according to vm overcommit policy
> > 
> >  fs/proc/meminfo.c    |  2 +-
> >  include/linux/mm.h   |  2 ++
> >  include/linux/mman.h |  4 ++++
> >  kernel/sysctl.c      |  2 +-
> >  mm/mm_init.c         | 19 +++++++++++++++----
> >  mm/util.c            | 15 ++++++++++++++-
> >  6 files changed, 37 insertions(+), 7 deletions(-)
> > 
> > -- 
> > 2.7.4
> > 
> > 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-26 18:14   ` Qian Cai
@ 2020-05-27  1:14     ` Andi Kleen
  2020-05-27  1:43       ` Feng Tang
  2020-05-27  1:46     ` Feng Tang
  1 sibling, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2020-05-27  1:14 UTC (permalink / raw)
  To: Qian Cai
  Cc: Feng Tang, Andrew Morton, Michal Hocko, Johannes Weiner,
	Stephen Rothwell, Matthew Wilcox, Mel Gorman, Kees Cook,
	Luis Chamberlain, Iurii Zaikin, tim.c.chen, dave.hansen,
	ying.huang, linux-mm, linux-kernel

On Tue, May 26, 2020 at 02:14:59PM -0400, Qian Cai wrote:
> On Thu, May 21, 2020 at 05:27:26PM -0400, Qian Cai wrote:
> > On Fri, May 08, 2020 at 03:25:14PM +0800, Feng Tang wrote:
> > > When checking a performance change for will-it-scale scalability
> > > mmap test [1], we found very high lock contention for spinlock of
> > > percpu counter 'vm_committed_as':
> > > 
> > >     94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
> > >     48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
> > >     45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
> > > 
> > > Actually this heavy lock contention is not always necessary. The
> > > 'vm_committed_as' needs to be very precise when the strict
> > > OVERCOMMIT_NEVER policy is set, which requires a rather small batch
> > > number for the percpu counter.
> > > 
> > > So lift the batch number to 16X for OVERCOMMIT_ALWAYS and
> > > OVERCOMMIT_GUESS policies, and add a sysctl handler to adjust it
> > > when the policy is reconfigured.
> > > 
> > > Benchmark with the same testcase in [1] shows 53% improvement on a
> > > 8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. And no change
> > > for some platforms, due to the test mmap size of the case is bigger
> > > than the batch number computed, though the patch will help mmap/munmap
> > > generally.
> > > 
> > > [1] https://lkml.org/lkml/2020/3/5/57
> > > 
> > > There are some style complain from checkpatch for patch 3, as it
> > > follows the similar format of sibling functions
> > > 
> > > patch1: a cleanup for /proc/meminfo
> > > patch2: a preparation patch which also improve the accuracy of
> > >         vm_memory_committed
> > > patch3: the main change
> > > 
> > > Please help to review, thanks!
> > 
> > Reverted this series fixed a warning under memory pressue.
> 
> Andrew, Stephen, can you drop this series?

Hope you're happy now.

The warning is just not needed for the always case. The whole point
of the patch was to maintain these counters only approximately.
We could of course use _sum or _sum_positive in the warning, but that would
just be unnecessary overhead in this moderately hot path.

So this patch should fix it: (untested)

Possibly might need to move it even further down because the patch
also affected the _GUESS case.


diff --git a/mm/util.c b/mm/util.c
index 988d11e6c17c..fa78f90e29a1 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -798,10 +798,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 {
 	long allowed;
 
-	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
-			-(s64)vm_committed_as_batch * num_online_cpus(),
-			"memory commitment underflow");
-
 	vm_acct_memory(pages);
 
 	/*
@@ -810,6 +806,10 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 	if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
 		return 0;
 
+	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
+			-(s64)vm_committed_as_batch * num_online_cpus(),
+			"memory commitment underflow");
+
 	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
 		if (pages > totalram_pages() + total_swap_pages)
 			goto error;




> 
> > 
> > [ 3319.257898] LTP: starting oom01
> > [ 3319.284417] ------------[ cut here ]------------
> > [ 3319.284439] memory commitment underflow
> > [ 3319.284456] WARNING: CPU: 13 PID: 130949 at mm/util.c:835 __vm_enough_memory+0x1dc/0x1f0
> > [ 3319.284485] Modules linked in: brd ext4 crc16 mbcache jbd2 loop kvm_hv kvm ip_tables x_tables xfs sd_mod ahci libahci bnx2x libata mdio tg3 libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod
> > [ 3319.284538] CPU: 13 PID: 130949 Comm: oom01 Not tainted 5.7.0-rc6-next-20200521+ #115
> > [ 3319.284551] NIP:  c00000000043027c LR: c000000000430278 CTR: c0000000007bd120
> > [ 3319.284572] REGS: c00000064a42f710 TRAP: 0700   Not tainted  (5.7.0-rc6-next-20200521+)
> > [ 3319.284602] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28222422  XER: 20040000
> > [ 3319.284638] CFAR: c000000000119098 IRQMASK: 0 
> >                GPR00: c000000000430278 c00000064a42f9a0 c000000001765500 000000000000001b 
> >                GPR04: c00000000179c3b0 000000000000000b 0000000066726564 c00000000429ac28 
> >                GPR08: 0000000000000001 0000000000000000 0000000000000000 c000001301facc80 
> >                GPR12: 0000000000002000 c000001fffff3d80 0000000000000000 0000000000000000 
> >                GPR16: 0000000000000000 0000000000000000 c0000014f358dd60 fffffffffff7dfff 
> >                GPR20: c0000003725a9128 c0000003725a3928 c000000b0a0c0670 c000000b0a0c0678 
> >                GPR24: c000000b0a0c0660 c00000064a42fa58 c000000b0a0c0650 c0000003725a9080 
> >                GPR28: 0000000000000001 0000000000000000 c000000001620178 0000000000000001 
> > [ 3319.284825] NIP [c00000000043027c] __vm_enough_memory+0x1dc/0x1f0
> > [ 3319.284853] LR [c000000000430278] __vm_enough_memory+0x1d8/0x1f0
> > [ 3319.284872] Call Trace:
> > [ 3319.284889] [c00000064a42f9a0] [c000000000430278] __vm_enough_memory+0x1d8/0x1f0 (unreliable)
> > [ 3319.284930] [c00000064a42fa30] [c000000000114258] dup_mm+0x2a8/0x700
> > [ 3319.284960] [c00000064a42fae0] [c000000000115c7c] copy_process+0xeac/0x1a00
> > [ 3319.284981] [c00000064a42fbf0] [c000000000116b3c] _do_fork+0xac/0xce0
> > [ 3319.285002] [c00000064a42fd00] [c000000000117928] __do_sys_clone+0x98/0xe0
> > [ 3319.285032] [c00000064a42fdc0] [c000000000039e78] system_call_exception+0x108/0x1d0
> > [ 3319.285064] [c00000064a42fe20] [c00000000000c9f0] system_call_common+0xf0/0x278
> > [ 3319.285093] Instruction dump:
> > [ 3319.285109] 60000000 73a90001 4182000c e8be0002 4bfffeb8 3c62ff51 39200001 3d42fff9 
> > [ 3319.285142] 38636690 992a6ad6 4bce8dbd 60000000 <0fe00000> e8be0002 4bfffe90 60000000 
> > [ 3319.285166] irq event stamp: 0
> > [ 3319.285183] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> > [ 3319.285195] hardirqs last disabled at (0): [<c0000000001154c8>] copy_process+0x6f8/0x1a00
> > [ 3319.285216] softirqs last  enabled at (0): [<c0000000001154c8>] copy_process+0x6f8/0x1a00
> > [ 3319.285238] softirqs last disabled at (0): [<0000000000000000>] 0x0
> > [ 3319.285267] ---[ end trace 115e3f7fbb0653c2 ]---
> > 
> > > 
> > > - Feng
> > > 
> > > ----------------------------------------------------------------
> > > Changelog:
> > >   v2:
> > >      * add the sysctl handler to cover runtime overcommit policy
> > >        change, as suggested by Andres Morton 
> > >      * address the accuracy concern of vm_memory_committed()
> > >        from Andi Kleen 
> > > 
> > > Feng Tang (3):
> > >   proc/meminfo: avoid open coded reading of vm_committed_as
> > >   mm/util.c: make vm_memory_committed() more accurate
> > >   mm: adjust vm_committed_as_batch according to vm overcommit policy
> > > 
> > >  fs/proc/meminfo.c    |  2 +-
> > >  include/linux/mm.h   |  2 ++
> > >  include/linux/mman.h |  4 ++++
> > >  kernel/sysctl.c      |  2 +-
> > >  mm/mm_init.c         | 19 +++++++++++++++----
> > >  mm/util.c            | 15 ++++++++++++++-
> > >  6 files changed, 37 insertions(+), 7 deletions(-)
> > > 
> > > -- 
> > > 2.7.4
> > > 
> > > 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-27  1:14     ` Andi Kleen
@ 2020-05-27  1:43       ` Feng Tang
  0 siblings, 0 replies; 34+ messages in thread
From: Feng Tang @ 2020-05-27  1:43 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Qian Cai, Andrew Morton, Michal Hocko, Johannes Weiner,
	Stephen Rothwell, Matthew Wilcox, Mel Gorman, Kees Cook,
	Luis Chamberlain, Iurii Zaikin, tim.c.chen, dave.hansen,
	ying.huang, linux-mm, linux-kernel

On Tue, May 26, 2020 at 06:14:13PM -0700, Andi Kleen wrote:
> On Tue, May 26, 2020 at 02:14:59PM -0400, Qian Cai wrote:
> > On Thu, May 21, 2020 at 05:27:26PM -0400, Qian Cai wrote:
> > > On Fri, May 08, 2020 at 03:25:14PM +0800, Feng Tang wrote:
> > > > When checking a performance change for will-it-scale scalability
> > > > mmap test [1], we found very high lock contention for spinlock of
> > > > percpu counter 'vm_committed_as':
> > > > 
> > > >     94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
> > > >     48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
> > > >     45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
> > > > 
> > > > Actually this heavy lock contention is not always necessary. The
> > > > 'vm_committed_as' needs to be very precise when the strict
> > > > OVERCOMMIT_NEVER policy is set, which requires a rather small batch
> > > > number for the percpu counter.
> > > > 
> > > > So lift the batch number to 16X for OVERCOMMIT_ALWAYS and
> > > > OVERCOMMIT_GUESS policies, and add a sysctl handler to adjust it
> > > > when the policy is reconfigured.
> > > > 
> > > > Benchmark with the same testcase in [1] shows 53% improvement on a
> > > > 8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. And no change
> > > > for some platforms, due to the test mmap size of the case is bigger
> > > > than the batch number computed, though the patch will help mmap/munmap
> > > > generally.
> > > > 
> > > > [1] https://lkml.org/lkml/2020/3/5/57
> > > > 
> > > > There are some style complain from checkpatch for patch 3, as it
> > > > follows the similar format of sibling functions
> > > > 
> > > > patch1: a cleanup for /proc/meminfo
> > > > patch2: a preparation patch which also improve the accuracy of
> > > >         vm_memory_committed
> > > > patch3: the main change
> > > > 
> > > > Please help to review, thanks!
> > > 
> > > Reverted this series fixed a warning under memory pressue.
> > 
> > Andrew, Stephen, can you drop this series?
> 
> Hope you're happy now.
> 
> The warning is just not needed for the always case. The whole point
> of the patch was to maintain these counters only approximately.
> We could of course use _sum or _sum_positive in the warning, but that would
> just be unnecessary overhead in this moderately hot path.
> 
> So this patch should fix it: (untested)
> 
> Possibly might need to move it even further down because the patch
> also affected the _GUESS case.

Thanks Andi for the patch. And I agree that it could be after _GUESS
case, as existig check for _GUESS is loose:

	if (pages > totalram_pages() + total_swap_pages)
	         goto error;

Thanks,
Feng

> 
> 
> diff --git a/mm/util.c b/mm/util.c
> index 988d11e6c17c..fa78f90e29a1 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -798,10 +798,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  {
>  	long allowed;
>  
> -	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
> -			-(s64)vm_committed_as_batch * num_online_cpus(),
> -			"memory commitment underflow");
> -
>  	vm_acct_memory(pages);
>  
>  	/*
> @@ -810,6 +806,10 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  	if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
>  		return 0;
>  
> +	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
> +			-(s64)vm_committed_as_batch * num_online_cpus(),
> +			"memory commitment underflow");
> +
>  	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
>  		if (pages > totalram_pages() + total_swap_pages)
>  			goto error;

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-26 18:14   ` Qian Cai
  2020-05-27  1:14     ` Andi Kleen
@ 2020-05-27  1:46     ` Feng Tang
  2020-05-27  2:25       ` Qian Cai
  1 sibling, 1 reply; 34+ messages in thread
From: Feng Tang @ 2020-05-27  1:46 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Stephen Rothwell,
	Matthew Wilcox, Mel Gorman, Kees Cook, Luis Chamberlain,
	Iurii Zaikin, andi.kleen, tim.c.chen, dave.hansen, ying.huang,
	linux-mm, linux-kernel

Hi Qian,

On Tue, May 26, 2020 at 02:14:59PM -0400, Qian Cai wrote:
> On Thu, May 21, 2020 at 05:27:26PM -0400, Qian Cai wrote:
> > On Fri, May 08, 2020 at 03:25:14PM +0800, Feng Tang wrote:
> > > When checking a performance change for will-it-scale scalability
> > > mmap test [1], we found very high lock contention for spinlock of
> > > percpu counter 'vm_committed_as':
> > > 
> > >     94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
> > >     48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
> > >     45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
> > > 
> > > Actually this heavy lock contention is not always necessary. The
> > > 'vm_committed_as' needs to be very precise when the strict
> > > OVERCOMMIT_NEVER policy is set, which requires a rather small batch
> > > number for the percpu counter.
> > > 
> > > So lift the batch number to 16X for OVERCOMMIT_ALWAYS and
> > > OVERCOMMIT_GUESS policies, and add a sysctl handler to adjust it
> > > when the policy is reconfigured.
> > > 
> > > Benchmark with the same testcase in [1] shows 53% improvement on a
> > > 8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. And no change
> > > for some platforms, due to the test mmap size of the case is bigger
> > > than the batch number computed, though the patch will help mmap/munmap
> > > generally.
> > > 
> > > [1] https://lkml.org/lkml/2020/3/5/57
> > 
> > Reverted this series fixed a warning under memory pressue.
> 
> Andrew, Stephen, can you drop this series?
> 
> > 
> > [ 3319.257898] LTP: starting oom01
> > [ 3319.284417] ------------[ cut here ]------------
> > [ 3319.284439] memory commitment underflow

Thanks for the catch!

Could you share the info about the platform, like the CPU numbers
and RAM size, and what's the mmap test size of your test program.
It would be great if you can point me the link to the test program.

Thanks,
Feng

> > [ 3319.284456] WARNING: CPU: 13 PID: 130949 at mm/util.c:835 __vm_enough_memory+0x1dc/0x1f0
> > [ 3319.284485] Modules linked in: brd ext4 crc16 mbcache jbd2 loop kvm_hv kvm ip_tables x_tables xfs sd_mod ahci libahci bnx2x libata mdio tg3 libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod
> > [ 3319.284538] CPU: 13 PID: 130949 Comm: oom01 Not tainted 5.7.0-rc6-next-20200521+ #115
> > [ 3319.284551] NIP:  c00000000043027c LR: c000000000430278 CTR: c0000000007bd120
> > [ 3319.284572] REGS: c00000064a42f710 TRAP: 0700   Not tainted  (5.7.0-rc6-next-20200521+)
> > [ 3319.284602] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28222422  XER: 20040000
> > [ 3319.284638] CFAR: c000000000119098 IRQMASK: 0 
> >                GPR00: c000000000430278 c00000064a42f9a0 c000000001765500 000000000000001b 
> >                GPR04: c00000000179c3b0 000000000000000b 0000000066726564 c00000000429ac28 
> >                GPR08: 0000000000000001 0000000000000000 0000000000000000 c000001301facc80 
> >                GPR12: 0000000000002000 c000001fffff3d80 0000000000000000 0000000000000000 
> >                GPR16: 0000000000000000 0000000000000000 c0000014f358dd60 fffffffffff7dfff 
> >                GPR20: c0000003725a9128 c0000003725a3928 c000000b0a0c0670 c000000b0a0c0678 
> >                GPR24: c000000b0a0c0660 c00000064a42fa58 c000000b0a0c0650 c0000003725a9080 
> >                GPR28: 0000000000000001 0000000000000000 c000000001620178 0000000000000001 
> > [ 3319.284825] NIP [c00000000043027c] __vm_enough_memory+0x1dc/0x1f0
> > [ 3319.284853] LR [c000000000430278] __vm_enough_memory+0x1d8/0x1f0
> > [ 3319.284872] Call Trace:
> > [ 3319.284889] [c00000064a42f9a0] [c000000000430278] __vm_enough_memory+0x1d8/0x1f0 (unreliable)
> > [ 3319.284930] [c00000064a42fa30] [c000000000114258] dup_mm+0x2a8/0x700
> > [ 3319.284960] [c00000064a42fae0] [c000000000115c7c] copy_process+0xeac/0x1a00
> > [ 3319.284981] [c00000064a42fbf0] [c000000000116b3c] _do_fork+0xac/0xce0
> > [ 3319.285002] [c00000064a42fd00] [c000000000117928] __do_sys_clone+0x98/0xe0
> > [ 3319.285032] [c00000064a42fdc0] [c000000000039e78] system_call_exception+0x108/0x1d0
> > [ 3319.285064] [c00000064a42fe20] [c00000000000c9f0] system_call_common+0xf0/0x278
> > [ 3319.285093] Instruction dump:
> > [ 3319.285109] 60000000 73a90001 4182000c e8be0002 4bfffeb8 3c62ff51 39200001 3d42fff9 
> > [ 3319.285142] 38636690 992a6ad6 4bce8dbd 60000000 <0fe00000> e8be0002 4bfffe90 60000000 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-27  1:46     ` Feng Tang
@ 2020-05-27  2:25       ` Qian Cai
  2020-05-27 10:46         ` Feng Tang
  0 siblings, 1 reply; 34+ messages in thread
From: Qian Cai @ 2020-05-27  2:25 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Stephen Rothwell,
	Matthew Wilcox, Mel Gorman, Kees Cook, Luis Chamberlain,
	Iurii Zaikin, andi.kleen, tim.c.chen, dave.hansen, ying.huang,
	linux-mm, linux-kernel

On Wed, May 27, 2020 at 09:46:47AM +0800, Feng Tang wrote:
> Hi Qian,
> 
> On Tue, May 26, 2020 at 02:14:59PM -0400, Qian Cai wrote:
> > On Thu, May 21, 2020 at 05:27:26PM -0400, Qian Cai wrote:
> > > On Fri, May 08, 2020 at 03:25:14PM +0800, Feng Tang wrote:
> > > > When checking a performance change for will-it-scale scalability
> > > > mmap test [1], we found very high lock contention for spinlock of
> > > > percpu counter 'vm_committed_as':
> > > > 
> > > >     94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
> > > >     48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
> > > >     45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
> > > > 
> > > > Actually this heavy lock contention is not always necessary. The
> > > > 'vm_committed_as' needs to be very precise when the strict
> > > > OVERCOMMIT_NEVER policy is set, which requires a rather small batch
> > > > number for the percpu counter.
> > > > 
> > > > So lift the batch number to 16X for OVERCOMMIT_ALWAYS and
> > > > OVERCOMMIT_GUESS policies, and add a sysctl handler to adjust it
> > > > when the policy is reconfigured.
> > > > 
> > > > Benchmark with the same testcase in [1] shows 53% improvement on a
> > > > 8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. And no change
> > > > for some platforms, due to the test mmap size of the case is bigger
> > > > than the batch number computed, though the patch will help mmap/munmap
> > > > generally.
> > > > 
> > > > [1] https://lkml.org/lkml/2020/3/5/57
> > > 
> > > Reverted this series fixed a warning under memory pressue.
> > 
> > Andrew, Stephen, can you drop this series?
> > 
> > > 
> > > [ 3319.257898] LTP: starting oom01
> > > [ 3319.284417] ------------[ cut here ]------------
> > > [ 3319.284439] memory commitment underflow
> 
> Thanks for the catch!
> 
> Could you share the info about the platform, like the CPU numbers
> and RAM size, and what's the mmap test size of your test program.
> It would be great if you can point me the link to the test program.

I have been reproduced this on both AMD and Intel. The test just
allocating memory and swapping.

https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/oom/oom01.c
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/tunable/overcommit_memory.c

It might be better to run the whole LTP mm tests if none of the above
triggers it for you which has quite a few memory pressurers.

/opt/ltp/runltp -f mm

# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  1
Core(s) per socket:  12
Socket(s):           4
NUMA node(s):        4
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-4650 v3 @ 2.10GHz
Stepping:            2
CPU MHz:             1198.498
BogoMIPS:            4195.14
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            30720K
NUMA node0 CPU(s):   0-5,24-29
NUMA node1 CPU(s):   6-11,30-35
NUMA node2 CPU(s):   12-17,36-41
NUMA node3 CPU(s):   18-23,42-47

# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 24 25 26 27 28 29
node 0 size: 27768 MB
node 0 free: 27539 MB
node 1 cpus: 6 7 8 9 10 11 30 31 32 33 34 35
node 1 size: 32225 MB
node 1 free: 32009 MB
node 2 cpus: 12 13 14 15 16 17 36 37 38 39 40 41
node 2 size: 32253 MB
node 2 free: 32174 MB
node 3 cpus: 18 19 20 21 22 23 42 43 44 45 46 47
node 3 size: 32252 MB
node 3 free: 31330 MB
node distances:
node   0   1   2   3
  0:  10  21  21  21
  1:  21  10  21  21
  2:  21  21  10  21
  3:  21  21  21  10

> 
> Thanks,
> Feng
> 
> > > [ 3319.284456] WARNING: CPU: 13 PID: 130949 at mm/util.c:835 __vm_enough_memory+0x1dc/0x1f0
> > > [ 3319.284485] Modules linked in: brd ext4 crc16 mbcache jbd2 loop kvm_hv kvm ip_tables x_tables xfs sd_mod ahci libahci bnx2x libata mdio tg3 libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod
> > > [ 3319.284538] CPU: 13 PID: 130949 Comm: oom01 Not tainted 5.7.0-rc6-next-20200521+ #115
> > > [ 3319.284551] NIP:  c00000000043027c LR: c000000000430278 CTR: c0000000007bd120
> > > [ 3319.284572] REGS: c00000064a42f710 TRAP: 0700   Not tainted  (5.7.0-rc6-next-20200521+)
> > > [ 3319.284602] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28222422  XER: 20040000
> > > [ 3319.284638] CFAR: c000000000119098 IRQMASK: 0 
> > >                GPR00: c000000000430278 c00000064a42f9a0 c000000001765500 000000000000001b 
> > >                GPR04: c00000000179c3b0 000000000000000b 0000000066726564 c00000000429ac28 
> > >                GPR08: 0000000000000001 0000000000000000 0000000000000000 c000001301facc80 
> > >                GPR12: 0000000000002000 c000001fffff3d80 0000000000000000 0000000000000000 
> > >                GPR16: 0000000000000000 0000000000000000 c0000014f358dd60 fffffffffff7dfff 
> > >                GPR20: c0000003725a9128 c0000003725a3928 c000000b0a0c0670 c000000b0a0c0678 
> > >                GPR24: c000000b0a0c0660 c00000064a42fa58 c000000b0a0c0650 c0000003725a9080 
> > >                GPR28: 0000000000000001 0000000000000000 c000000001620178 0000000000000001 
> > > [ 3319.284825] NIP [c00000000043027c] __vm_enough_memory+0x1dc/0x1f0
> > > [ 3319.284853] LR [c000000000430278] __vm_enough_memory+0x1d8/0x1f0
> > > [ 3319.284872] Call Trace:
> > > [ 3319.284889] [c00000064a42f9a0] [c000000000430278] __vm_enough_memory+0x1d8/0x1f0 (unreliable)
> > > [ 3319.284930] [c00000064a42fa30] [c000000000114258] dup_mm+0x2a8/0x700
> > > [ 3319.284960] [c00000064a42fae0] [c000000000115c7c] copy_process+0xeac/0x1a00
> > > [ 3319.284981] [c00000064a42fbf0] [c000000000116b3c] _do_fork+0xac/0xce0
> > > [ 3319.285002] [c00000064a42fd00] [c000000000117928] __do_sys_clone+0x98/0xe0
> > > [ 3319.285032] [c00000064a42fdc0] [c000000000039e78] system_call_exception+0x108/0x1d0
> > > [ 3319.285064] [c00000064a42fe20] [c00000000000c9f0] system_call_common+0xf0/0x278
> > > [ 3319.285093] Instruction dump:
> > > [ 3319.285109] 60000000 73a90001 4182000c e8be0002 4bfffeb8 3c62ff51 39200001 3d42fff9 
> > > [ 3319.285142] 38636690 992a6ad6 4bce8dbd 60000000 <0fe00000> e8be0002 4bfffe90 60000000 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-27  2:25       ` Qian Cai
@ 2020-05-27 10:46         ` Feng Tang
  2020-05-27 12:05           ` Qian Cai
  2020-05-28 14:18           ` Qian Cai
  0 siblings, 2 replies; 34+ messages in thread
From: Feng Tang @ 2020-05-27 10:46 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Stephen Rothwell,
	Matthew Wilcox, Mel Gorman, Kees Cook, Luis Chamberlain,
	Iurii Zaikin, andi.kleen, tim.c.chen, dave.hansen, ying.huang,
	linux-mm, linux-kernel

Hi Qian,

On Tue, May 26, 2020 at 10:25:39PM -0400, Qian Cai wrote:
> > > > > [1] https://lkml.org/lkml/2020/3/5/57
> > > > 
> > > > Reverted this series fixed a warning under memory pressue.
> > > 
> > > Andrew, Stephen, can you drop this series?
> > > 
> > > > 
> > > > [ 3319.257898] LTP: starting oom01
> > > > [ 3319.284417] ------------[ cut here ]------------
> > > > [ 3319.284439] memory commitment underflow
> > 
> > Thanks for the catch!
> > 
> > Could you share the info about the platform, like the CPU numbers
> > and RAM size, and what's the mmap test size of your test program.
> > It would be great if you can point me the link to the test program.
> 
> I have been reproduced this on both AMD and Intel. The test just
> allocating memory and swapping.
> 
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/oom/oom01.c
> https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/tunable/overcommit_memory.c
> 
> It might be better to run the whole LTP mm tests if none of the above
> triggers it for you which has quite a few memory pressurers.
> 
> /opt/ltp/runltp -f mm

Thanks for sharing. I tried to reproduce this on 2 server plaforms,
but can't reproduce it, and they are still under testing.

Meanwhile, could you help to try the below patch, which is based on
Andi's suggestion and have some debug info. The warning is a little
strange, as the condition is

	(percpu_counter_read(&vm_committed_as) <
                       -(s64)vm_committed_as_batch * num_online_cpus())

while for your platform (48 CPU + 128 GB RAM), the
'-(s64)vm_committed_as_batch * num_online_cpus()'
is a s64 value: '-32G', which makes the condition hard to be true,
and when it is,  it could be triggered by some magic for s32/s64
operations around the percpu-counter. 

Thanks,
Feng

---
 mm/util.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 988d11e6c17c..2f3a2e0f12ae 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -797,10 +797,15 @@ EXPORT_SYMBOL_GPL(vm_memory_committed);
 int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 {
 	long allowed;
-
-	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
+	if (percpu_counter_read(&vm_committed_as) <
+			-(s64)vm_committed_as_batch * num_online_cpus()) {
+		printk("as:%lld  as_sum:%lld  check:%lld batch:%d\n",
+			percpu_counter_read(&vm_committed_as),
+			percpu_counter_sum(&vm_committed_as),
 			-(s64)vm_committed_as_batch * num_online_cpus(),
-			"memory commitment underflow");
+			vm_committed_as_batch);
+
+	}
 
 	vm_acct_memory(pages);
 
@@ -816,6 +821,10 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		return 0;
 	}
 
+	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
+			-(s64)vm_committed_as_batch * num_online_cpus(),
+			"memory commitment underflow");
+
 	allowed = vm_commit_limit();
 	/*
 	 * Reserve some for root
-- 
2.14.1

> 
> # lscpu
> Architecture:        x86_64
> CPU op-mode(s):      32-bit, 64-bit
> Byte Order:          Little Endian
> CPU(s):              48
> On-line CPU(s) list: 0-47
> Thread(s) per core:  1
> Core(s) per socket:  12
> Socket(s):           4
> NUMA node(s):        4
> Vendor ID:           GenuineIntel
> CPU family:          6
> Model:               63
> Model name:          Intel(R) Xeon(R) CPU E5-4650 v3 @ 2.10GHz
> Stepping:            2
> CPU MHz:             1198.498
> BogoMIPS:            4195.14
> Virtualization:      VT-x
> L1d cache:           32K
> L1i cache:           32K
> L2 cache:            256K
> L3 cache:            30720K
> NUMA node0 CPU(s):   0-5,24-29
> NUMA node1 CPU(s):   6-11,30-35
> NUMA node2 CPU(s):   12-17,36-41
> NUMA node3 CPU(s):   18-23,42-47
> 
> # numactl -H
> available: 4 nodes (0-3)
> node 0 cpus: 0 1 2 3 4 5 24 25 26 27 28 29
> node 0 size: 27768 MB
> node 0 free: 27539 MB
> node 1 cpus: 6 7 8 9 10 11 30 31 32 33 34 35
> node 1 size: 32225 MB
> node 1 free: 32009 MB
> node 2 cpus: 12 13 14 15 16 17 36 37 38 39 40 41
> node 2 size: 32253 MB
> node 2 free: 32174 MB
> node 3 cpus: 18 19 20 21 22 23 42 43 44 45 46 47
> node 3 size: 32252 MB
> node 3 free: 31330 MB
> node distances:
> node   0   1   2   3
>   0:  10  21  21  21
>   1:  21  10  21  21
>   2:  21  21  10  21
>   3:  21  21  21  10
> 
> > 
> > Thanks,
> > Feng
> > 
> > > > [ 3319.284456] WARNING: CPU: 13 PID: 130949 at mm/util.c:835 __vm_enough_memory+0x1dc/0x1f0
> > > > [ 3319.284485] Modules linked in: brd ext4 crc16 mbcache jbd2 loop kvm_hv kvm ip_tables x_tables xfs sd_mod ahci libahci bnx2x libata mdio tg3 libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod
> > > > [ 3319.284538] CPU: 13 PID: 130949 Comm: oom01 Not tainted 5.7.0-rc6-next-20200521+ #115
> > > > [ 3319.284551] NIP:  c00000000043027c LR: c000000000430278 CTR: c0000000007bd120
> > > > [ 3319.284572] REGS: c00000064a42f710 TRAP: 0700   Not tainted  (5.7.0-rc6-next-20200521+)
> > > > [ 3319.284602] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28222422  XER: 20040000
> > > > [ 3319.284638] CFAR: c000000000119098 IRQMASK: 0 
> > > >                GPR00: c000000000430278 c00000064a42f9a0 c000000001765500 000000000000001b 
> > > >                GPR04: c00000000179c3b0 000000000000000b 0000000066726564 c00000000429ac28 
> > > >                GPR08: 0000000000000001 0000000000000000 0000000000000000 c000001301facc80 
> > > >                GPR12: 0000000000002000 c000001fffff3d80 0000000000000000 0000000000000000 
> > > >                GPR16: 0000000000000000 0000000000000000 c0000014f358dd60 fffffffffff7dfff 
> > > >                GPR20: c0000003725a9128 c0000003725a3928 c000000b0a0c0670 c000000b0a0c0678 
> > > >                GPR24: c000000b0a0c0660 c00000064a42fa58 c000000b0a0c0650 c0000003725a9080 
> > > >                GPR28: 0000000000000001 0000000000000000 c000000001620178 0000000000000001 
> > > > [ 3319.284825] NIP [c00000000043027c] __vm_enough_memory+0x1dc/0x1f0
> > > > [ 3319.284853] LR [c000000000430278] __vm_enough_memory+0x1d8/0x1f0
> > > > [ 3319.284872] Call Trace:
> > > > [ 3319.284889] [c00000064a42f9a0] [c000000000430278] __vm_enough_memory+0x1d8/0x1f0 (unreliable)
> > > > [ 3319.284930] [c00000064a42fa30] [c000000000114258] dup_mm+0x2a8/0x700
> > > > [ 3319.284960] [c00000064a42fae0] [c000000000115c7c] copy_process+0xeac/0x1a00
> > > > [ 3319.284981] [c00000064a42fbf0] [c000000000116b3c] _do_fork+0xac/0xce0
> > > > [ 3319.285002] [c00000064a42fd00] [c000000000117928] __do_sys_clone+0x98/0xe0
> > > > [ 3319.285032] [c00000064a42fdc0] [c000000000039e78] system_call_exception+0x108/0x1d0
> > > > [ 3319.285064] [c00000064a42fe20] [c00000000000c9f0] system_call_common+0xf0/0x278
> > > > [ 3319.285093] Instruction dump:
> > > > [ 3319.285109] 60000000 73a90001 4182000c e8be0002 4bfffeb8 3c62ff51 39200001 3d42fff9 
> > > > [ 3319.285142] 38636690 992a6ad6 4bce8dbd 60000000 <0fe00000> e8be0002 4bfffe90 60000000 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-27 10:46         ` Feng Tang
@ 2020-05-27 12:05           ` Qian Cai
  2020-05-27 13:33             ` Feng Tang
  2020-05-28 14:18           ` Qian Cai
  1 sibling, 1 reply; 34+ messages in thread
From: Qian Cai @ 2020-05-27 12:05 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Stephen Rothwell,
	Matthew Wilcox, Mel Gorman, Kees Cook, Luis Chamberlain,
	Iurii Zaikin, andi.kleen, tim.c.chen, dave.hansen, ying.huang,
	linux-mm, linux-kernel

On Wed, May 27, 2020 at 06:46:06PM +0800, Feng Tang wrote:
> Hi Qian,
> 
> On Tue, May 26, 2020 at 10:25:39PM -0400, Qian Cai wrote:
> > > > > > [1] https://lkml.org/lkml/2020/3/5/57
> > > > > 
> > > > > Reverted this series fixed a warning under memory pressue.
> > > > 
> > > > Andrew, Stephen, can you drop this series?
> > > > 
> > > > > 
> > > > > [ 3319.257898] LTP: starting oom01
> > > > > [ 3319.284417] ------------[ cut here ]------------
> > > > > [ 3319.284439] memory commitment underflow
> > > 
> > > Thanks for the catch!
> > > 
> > > Could you share the info about the platform, like the CPU numbers
> > > and RAM size, and what's the mmap test size of your test program.
> > > It would be great if you can point me the link to the test program.
> > 
> > I have been reproduced this on both AMD and Intel. The test just
> > allocating memory and swapping.
> > 
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/oom/oom01.c
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/tunable/overcommit_memory.c
> > 
> > It might be better to run the whole LTP mm tests if none of the above
> > triggers it for you which has quite a few memory pressurers.
> > 
> > /opt/ltp/runltp -f mm
> 
> Thanks for sharing. I tried to reproduce this on 2 server plaforms,
> but can't reproduce it, and they are still under testing.
> 
> Meanwhile, could you help to try the below patch, which is based on
> Andi's suggestion and have some debug info. The warning is a little
> strange, as the condition is
> 
> 	(percpu_counter_read(&vm_committed_as) <
>                        -(s64)vm_committed_as_batch * num_online_cpus())
> 
> while for your platform (48 CPU + 128 GB RAM), the
> '-(s64)vm_committed_as_batch * num_online_cpus()'
> is a s64 value: '-32G', which makes the condition hard to be true,
> and when it is,  it could be triggered by some magic for s32/s64
> operations around the percpu-counter. 

Here is the information on AMD and powerpc below affected by this. It
could need a bit patient to reproduce, but our usual daily CI would
trigger it eventually after a few tries.

# git clone https://github.com/cailca/linux-mm.git
# cd linux-mm
# ./compile.sh
# systemctl reboot
# ./test.sh

== AMD ==
# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  2
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        8
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD EPYC 7601 32-Core Processor
Stepping:            2
CPU MHz:             3184.974
CPU max MHz:         2200.0000
CPU min MHz:         1200.0000
BogoMIPS:            4391.87
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7,64-71
NUMA node1 CPU(s):   8-15,72-79
NUMA node2 CPU(s):   16-23,80-87
NUMA node3 CPU(s):   24-31,88-95
NUMA node4 CPU(s):   32-39,96-103
NUMA node5 CPU(s):   40-47,104-111
NUMA node6 CPU(s):   48-55,112-119
NUMA node7 CPU(s):   56-63,120-127

# numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 20028 MB
node 0 free: 3837 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 0 MB
node 2 free: 0 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 0 MB
node 3 free: 0 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 31391 MB
node 4 free: 21309 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 0 MB
node 5 free: 0 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 0 MB
node 6 free: 0 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 0 MB
node 7 free: 0 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  16  32  32  32  32
  1:  16  10  16  16  32  32  32  32
  2:  16  16  10  16  32  32  32  32
  3:  16  16  16  10  32  32  32  32
  4:  32  32  32  32  10  16  16  16
  5:  32  32  32  32  16  10  16  16
  6:  32  32  32  32  16  16  10  16
  7:  32  32  32  32  16  16  16  10

== powerpc ==
# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  4
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)
Model name:          POWER9 (raw), altivec supported
CPU max MHz:         3800.0000
CPU min MHz:         2300.0000
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node8 CPU(s):   64-127

# numactl -H
available: 2 nodes (0,8)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 130138 MB
node 0 free: 125654 MB
node 8 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 8 size: 130781 MB
node 8 free: 124756 MB
node distances:
node   0   8
  0:  10  40
  8:  40  10

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-27 12:05           ` Qian Cai
@ 2020-05-27 13:33             ` Feng Tang
  2020-05-27 15:42               ` Qian Cai
  0 siblings, 1 reply; 34+ messages in thread
From: Feng Tang @ 2020-05-27 13:33 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Stephen Rothwell,
	Matthew Wilcox, Mel Gorman, Kees Cook, Luis Chamberlain,
	Iurii Zaikin, andi.kleen, tim.c.chen, dave.hansen, ying.huang,
	linux-mm, linux-kernel

Hi Qian,

On Wed, May 27, 2020 at 08:05:49AM -0400, Qian Cai wrote:
> On Wed, May 27, 2020 at 06:46:06PM +0800, Feng Tang wrote:
> > Hi Qian,
> > 
> > On Tue, May 26, 2020 at 10:25:39PM -0400, Qian Cai wrote:
> > > > > > > [1] https://lkml.org/lkml/2020/3/5/57
> > > > > > 
> > > > > > Reverted this series fixed a warning under memory pressue.
> > > > > 
> > > > > Andrew, Stephen, can you drop this series?
> > > > > 
> > > > > > 
> > > > > > [ 3319.257898] LTP: starting oom01
> > > > > > [ 3319.284417] ------------[ cut here ]------------
> > > > > > [ 3319.284439] memory commitment underflow
> > > > 
> > > > Thanks for the catch!
> > > > 
> > > > Could you share the info about the platform, like the CPU numbers
> > > > and RAM size, and what's the mmap test size of your test program.
> > > > It would be great if you can point me the link to the test program.
> > > 
> > > I have been reproduced this on both AMD and Intel. The test just
> > > allocating memory and swapping.
> > > 
> > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/oom/oom01.c
> > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/tunable/overcommit_memory.c
> > > 
> > > It might be better to run the whole LTP mm tests if none of the above
> > > triggers it for you which has quite a few memory pressurers.
> > > 
> > > /opt/ltp/runltp -f mm
> > 
> > Thanks for sharing. I tried to reproduce this on 2 server plaforms,
> > but can't reproduce it, and they are still under testing.
> > 
> > Meanwhile, could you help to try the below patch, which is based on
> > Andi's suggestion and have some debug info. The warning is a little
> > strange, as the condition is
> > 
> > 	(percpu_counter_read(&vm_committed_as) <
> >                        -(s64)vm_committed_as_batch * num_online_cpus())
> > 
> > while for your platform (48 CPU + 128 GB RAM), the
> > '-(s64)vm_committed_as_batch * num_online_cpus()'
> > is a s64 value: '-32G', which makes the condition hard to be true,
> > and when it is,  it could be triggered by some magic for s32/s64
> > operations around the percpu-counter. 
> 
> Here is the information on AMD and powerpc below affected by this. It
> could need a bit patient to reproduce, but our usual daily CI would
> trigger it eventually after a few tries.
> 
> # git clone https://github.com/cailca/linux-mm.git
> # cd linux-mm
> # ./compile.sh
> # systemctl reboot
> # ./test.sh

I just downloaded it, and it failed on my desktop machine as it failed
in 'yum' and 'grub2' setup. The difficulty for me to reproduce is the
test platforms are behind the 0day framework, and I can hardly  setup
external test suits, though I have been trying for all day today :) 

So if possible, please help to try the patch in my last email. thanks!

- Feng


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-27 13:33             ` Feng Tang
@ 2020-05-27 15:42               ` Qian Cai
  0 siblings, 0 replies; 34+ messages in thread
From: Qian Cai @ 2020-05-27 15:42 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Stephen Rothwell,
	Matthew Wilcox, Mel Gorman, Kees Cook, Luis Chamberlain,
	Iurii Zaikin, andi.kleen, tim.c.chen, dave.hansen, ying.huang,
	linux-mm, linux-kernel

On Wed, May 27, 2020 at 09:33:32PM +0800, Feng Tang wrote:
> Hi Qian,
> 
> On Wed, May 27, 2020 at 08:05:49AM -0400, Qian Cai wrote:
> > On Wed, May 27, 2020 at 06:46:06PM +0800, Feng Tang wrote:
> > > Hi Qian,
> > > 
> > > On Tue, May 26, 2020 at 10:25:39PM -0400, Qian Cai wrote:
> > > > > > > > [1] https://lkml.org/lkml/2020/3/5/57
> > > > > > > 
> > > > > > > Reverted this series fixed a warning under memory pressue.
> > > > > > 
> > > > > > Andrew, Stephen, can you drop this series?
> > > > > > 
> > > > > > > 
> > > > > > > [ 3319.257898] LTP: starting oom01
> > > > > > > [ 3319.284417] ------------[ cut here ]------------
> > > > > > > [ 3319.284439] memory commitment underflow
> > > > > 
> > > > > Thanks for the catch!
> > > > > 
> > > > > Could you share the info about the platform, like the CPU numbers
> > > > > and RAM size, and what's the mmap test size of your test program.
> > > > > It would be great if you can point me the link to the test program.
> > > > 
> > > > I have been reproduced this on both AMD and Intel. The test just
> > > > allocating memory and swapping.
> > > > 
> > > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/oom/oom01.c
> > > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/tunable/overcommit_memory.c
> > > > 
> > > > It might be better to run the whole LTP mm tests if none of the above
> > > > triggers it for you which has quite a few memory pressurers.
> > > > 
> > > > /opt/ltp/runltp -f mm
> > > 
> > > Thanks for sharing. I tried to reproduce this on 2 server plaforms,
> > > but can't reproduce it, and they are still under testing.
> > > 
> > > Meanwhile, could you help to try the below patch, which is based on
> > > Andi's suggestion and have some debug info. The warning is a little
> > > strange, as the condition is
> > > 
> > > 	(percpu_counter_read(&vm_committed_as) <
> > >                        -(s64)vm_committed_as_batch * num_online_cpus())
> > > 
> > > while for your platform (48 CPU + 128 GB RAM), the
> > > '-(s64)vm_committed_as_batch * num_online_cpus()'
> > > is a s64 value: '-32G', which makes the condition hard to be true,
> > > and when it is,  it could be triggered by some magic for s32/s64
> > > operations around the percpu-counter. 
> > 
> > Here is the information on AMD and powerpc below affected by this. It
> > could need a bit patient to reproduce, but our usual daily CI would
> > trigger it eventually after a few tries.
> > 
> > # git clone https://github.com/cailca/linux-mm.git
> > # cd linux-mm
> > # ./compile.sh
> > # systemctl reboot
> > # ./test.sh
> 
> I just downloaded it, and it failed on my desktop machine as it failed
> in 'yum' and 'grub2' setup. The difficulty for me to reproduce is the
> test platforms are behind the 0day framework, and I can hardly  setup
> external test suits, though I have been trying for all day today :) 

I tried your debug patch and it did not even compile on linux-next
(where the issue was happened) and I am running out of time today. It
probably need to reproduce on large systems as it did not happen on one
of our small s390 system here.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-27 10:46         ` Feng Tang
  2020-05-27 12:05           ` Qian Cai
@ 2020-05-28 14:18           ` Qian Cai
  2020-05-28 15:10             ` Feng Tang
  1 sibling, 1 reply; 34+ messages in thread
From: Qian Cai @ 2020-05-28 14:18 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Stephen Rothwell,
	Matthew Wilcox, Mel Gorman, Kees Cook, Luis Chamberlain,
	Iurii Zaikin, andi.kleen, tim.c.chen, dave.hansen, ying.huang,
	linux-mm, linux-kernel

On Wed, May 27, 2020 at 06:46:06PM +0800, Feng Tang wrote:
> Hi Qian,
> 
> On Tue, May 26, 2020 at 10:25:39PM -0400, Qian Cai wrote:
> > > > > > [1] https://lkml.org/lkml/2020/3/5/57
> > > > > 
> > > > > Reverted this series fixed a warning under memory pressue.
> > > > 
> > > > Andrew, Stephen, can you drop this series?
> > > > 
> > > > > 
> > > > > [ 3319.257898] LTP: starting oom01
> > > > > [ 3319.284417] ------------[ cut here ]------------
> > > > > [ 3319.284439] memory commitment underflow
> > > 
> > > Thanks for the catch!
> > > 
> > > Could you share the info about the platform, like the CPU numbers
> > > and RAM size, and what's the mmap test size of your test program.
> > > It would be great if you can point me the link to the test program.
> > 
> > I have been reproduced this on both AMD and Intel. The test just
> > allocating memory and swapping.
> > 
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/oom/oom01.c
> > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/tunable/overcommit_memory.c
> > 
> > It might be better to run the whole LTP mm tests if none of the above
> > triggers it for you which has quite a few memory pressurers.
> > 
> > /opt/ltp/runltp -f mm
> 
> Thanks for sharing. I tried to reproduce this on 2 server plaforms,
> but can't reproduce it, and they are still under testing.
> 
> Meanwhile, could you help to try the below patch, which is based on
> Andi's suggestion and have some debug info. The warning is a little
> strange, as the condition is
> 
> 	(percpu_counter_read(&vm_committed_as) <
>                        -(s64)vm_committed_as_batch * num_online_cpus())
> 
> while for your platform (48 CPU + 128 GB RAM), the
> '-(s64)vm_committed_as_batch * num_online_cpus()'
> is a s64 value: '-32G', which makes the condition hard to be true,
> and when it is,  it could be triggered by some magic for s32/s64
> operations around the percpu-counter. 

The patch below does not fix anything.

[ 3160.275230][ T7955] LTP: starting oom01
[ 3160.306977][T106365] KK as:-59683  as_sum:6896  check:-32768 batch:256
[ 3160.307161][T106365] ------------[ cut here ]------------
[ 3160.307184][T106365] memory commitment underflow
[ 3160.307216][T106365] WARNING: CPU: 103 PID: 106365 at mm/util.c:858 __vm_enough_memory+0x204/0x250
[ 3160.307275][T106365] Modules linked in: brd ext4 crc16 mbcache jbd2 loop kvm_hv kvm ip_tables x_tables xfs sd_mod bnx2x ahci libahci mdio tg3 libata libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod
[ 3160.307341][T106365] CPU: 103 PID: 106365 Comm: oom01 Not tainted 5.7.0-rc7-next-20200528+ #3
[ 3160.307368][T106365] NIP:  c0000000003ee654 LR: c0000000003ee650 CTR: c000000000745b40
[ 3160.307382][T106365] REGS: c0002017f940f730 TRAP: 0700   Not tainted  (5.7.0-rc7-next-20200528+)
[ 3160.307409][T106365] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28222482  XER: 00000000
[ 3160.307442][T106365] CFAR: c00000000010e428 IRQMASK: 0
[ 3160.307442][T106365] GPR00: c0000000003ee650 c0002017f940f9c0 c00000000130a300 000000000000001b
[ 3160.307442][T106365] GPR04: c0000000017b8d68 0000000000000006 0000000078154885 ffffffff6ca32090
[ 3160.307442][T106365] GPR08: 0000201cc61a0000 0000000000000000 0000000000000000 0000000000000003
[ 3160.307442][T106365] GPR12: 0000000000002000 c000201fff675100 0000000000000000 0000000000000000
[ 3160.307442][T106365] GPR16: 0000000000000000 0000000000000000 c000200350fe3b60 fffffffffff7dfff
[ 3160.307442][T106365] GPR20: c000201a36824928 c000201a3682c128 c000200ee33ad3a0 c000200ee33ad3a8
[ 3160.307442][T106365] GPR24: c000200ee33ad390 ffffffffffff16dd 0000000000000000 0000000000000001
[ 3160.307442][T106365] GPR28: c000201a36824880 0000000000000001 c000000003fa2a80 c0000000011fb0a8
[ 3160.307723][T106365] NIP [c0000000003ee654] __vm_enough_memory+0x204/0x250
[ 3160.307759][T106365] LR [c0000000003ee650] __vm_enough_memory+0x200/0x250
[ 3160.307793][T106365] Call Trace:
[ 3160.307814][T106365] [c0002017f940f9c0] [c0000000003ee650] __vm_enough_memory+0x200/0x250 (unreliable)
[ 3160.307875][T106365] [c0002017f940fa60] [c000000000109ab8] dup_mm+0x2a8/0x700
[ 3160.307901][T106365] [c0002017f940fb10] [c00000000010b3f0] copy_process+0xeb0/0x1920
[ 3160.307927][T106365] [c0002017f940fc20] [c00000000010c1cc] _do_fork+0xac/0xb20
[ 3160.307941][T106365] [c0002017f940fd00] [c00000000010cdf8] __do_sys_clone+0x98/0xe0
[ 3160.307979][T106365] [c0002017f940fdc0] [c000000000038a88] system_call_exception+0x108/0x1d0
[ 3160.308018][T106365] [c0002017f940fe20] [c00000000000c9f0] system_call_common+0xf0/0x278
[ 3160.308054][T106365] Instruction dump:
[ 3160.308076][T106365] 60000000 60000000 60000000 73490001 4082fee4 3c62ff8c 39200001 3d42fff9
[ 3160.308128][T106365] 38632788 992a6b4a 4bd1fd75 60000000 <0fe00000> 4bfffec0 7fc3f378 483218f9
[ 3160.308158][T106365] irq event stamp: 760
[ 3160.308181][T106365] hardirqs last  enabled at (759): [<c0000000001e9838>] vprintk_emit+0x118/0x430
[ 3160.308231][T106365] hardirqs last disabled at (760): [<c00000000000963c>] program_check_common_virt+0x2bc/0x310
[ 3160.308281][T106365] softirqs last  enabled at (0): [<c00000000010ac38>] copy_process+0x6f8/0x1920
[ 3160.308297][T106365] softirqs last disabled at (0): [<0000000000000000>] 0x0
[ 3160.308310][T106365] ---[ end trace e2152aa44c190593 ]---
[ 3160.308478][T106365] KK as:-59683  as_sum:6897  check:-32768 batch:256
[ 3160.308614][T106365] KK as:-59683  as_sum:6898  check:-32768 batch:256
[ 3160.308714][T106365] KK as:-59683  as_sum:6901  check:-32768 batch:256
[ 3160.308806][T106365] KK as:-59683  as_sum:6902  check:-32768 batch:256
[ 3160.308900][T106365] KK as:-59683  as_sum:6903  check:-32768 batch:256
[ 3160.308979][T106365] KK as:-59683  as_sum:6904  check:-32768 batch:256
[ 3160.309064][T106365] KK as:-59683  as_sum:6905  check:-32768 batch:256
[ 3160.309160][T106365] KK as:-59683  as_sum:6906  check:-32768 batch:256
[ 3160.309275][T106365] KK as:-59683  as_sum:6907  check:-32768 batch:256
[ 3160.309356][T106365] KK as:-59683  as_sum:6908  check:-32768 batch:256
[ 3160.309437][T106365] KK as:-59683  as_sum:6909  check:-32768 batch:256
[ 3160.310939][T106366] KK as:-59683  as_sum:6912  check:-32768 batch:256
[ 3160.311134][T106366] KK as:-59270  as_sum:7040  check:-32768 batch:256
[ 3160.311240][T106367] KK as:-59270  as_sum:7168  check:-32768 batch:256

> 
> Thanks,
> Feng
> 
> ---
>  mm/util.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/util.c b/mm/util.c
> index 988d11e6c17c..2f3a2e0f12ae 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -797,10 +797,15 @@ EXPORT_SYMBOL_GPL(vm_memory_committed);
>  int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  {
>  	long allowed;
> -
> -	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
> +	if (percpu_counter_read(&vm_committed_as) <
> +			-(s64)vm_committed_as_batch * num_online_cpus()) {
> +		printk("as:%lld  as_sum:%lld  check:%lld batch:%d\n",
> +			percpu_counter_read(&vm_committed_as),
> +			percpu_counter_sum(&vm_committed_as),
>  			-(s64)vm_committed_as_batch * num_online_cpus(),
> -			"memory commitment underflow");
> +			vm_committed_as_batch);
> +
> +	}
>  
>  	vm_acct_memory(pages);
>  
> @@ -816,6 +821,10 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  		return 0;
>  	}
>  
> +	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
> +			-(s64)vm_committed_as_batch * num_online_cpus(),
> +			"memory commitment underflow");
> +
>  	allowed = vm_commit_limit();
>  	/*
>  	 * Reserve some for root
> -- 
> 2.14.1
> 
> > 
> > # lscpu
> > Architecture:        x86_64
> > CPU op-mode(s):      32-bit, 64-bit
> > Byte Order:          Little Endian
> > CPU(s):              48
> > On-line CPU(s) list: 0-47
> > Thread(s) per core:  1
> > Core(s) per socket:  12
> > Socket(s):           4
> > NUMA node(s):        4
> > Vendor ID:           GenuineIntel
> > CPU family:          6
> > Model:               63
> > Model name:          Intel(R) Xeon(R) CPU E5-4650 v3 @ 2.10GHz
> > Stepping:            2
> > CPU MHz:             1198.498
> > BogoMIPS:            4195.14
> > Virtualization:      VT-x
> > L1d cache:           32K
> > L1i cache:           32K
> > L2 cache:            256K
> > L3 cache:            30720K
> > NUMA node0 CPU(s):   0-5,24-29
> > NUMA node1 CPU(s):   6-11,30-35
> > NUMA node2 CPU(s):   12-17,36-41
> > NUMA node3 CPU(s):   18-23,42-47
> > 
> > # numactl -H
> > available: 4 nodes (0-3)
> > node 0 cpus: 0 1 2 3 4 5 24 25 26 27 28 29
> > node 0 size: 27768 MB
> > node 0 free: 27539 MB
> > node 1 cpus: 6 7 8 9 10 11 30 31 32 33 34 35
> > node 1 size: 32225 MB
> > node 1 free: 32009 MB
> > node 2 cpus: 12 13 14 15 16 17 36 37 38 39 40 41
> > node 2 size: 32253 MB
> > node 2 free: 32174 MB
> > node 3 cpus: 18 19 20 21 22 23 42 43 44 45 46 47
> > node 3 size: 32252 MB
> > node 3 free: 31330 MB
> > node distances:
> > node   0   1   2   3
> >   0:  10  21  21  21
> >   1:  21  10  21  21
> >   2:  21  21  10  21
> >   3:  21  21  21  10
> > 
> > > 
> > > Thanks,
> > > Feng
> > > 
> > > > > [ 3319.284456] WARNING: CPU: 13 PID: 130949 at mm/util.c:835 __vm_enough_memory+0x1dc/0x1f0
> > > > > [ 3319.284485] Modules linked in: brd ext4 crc16 mbcache jbd2 loop kvm_hv kvm ip_tables x_tables xfs sd_mod ahci libahci bnx2x libata mdio tg3 libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod
> > > > > [ 3319.284538] CPU: 13 PID: 130949 Comm: oom01 Not tainted 5.7.0-rc6-next-20200521+ #115
> > > > > [ 3319.284551] NIP:  c00000000043027c LR: c000000000430278 CTR: c0000000007bd120
> > > > > [ 3319.284572] REGS: c00000064a42f710 TRAP: 0700   Not tainted  (5.7.0-rc6-next-20200521+)
> > > > > [ 3319.284602] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28222422  XER: 20040000
> > > > > [ 3319.284638] CFAR: c000000000119098 IRQMASK: 0 
> > > > >                GPR00: c000000000430278 c00000064a42f9a0 c000000001765500 000000000000001b 
> > > > >                GPR04: c00000000179c3b0 000000000000000b 0000000066726564 c00000000429ac28 
> > > > >                GPR08: 0000000000000001 0000000000000000 0000000000000000 c000001301facc80 
> > > > >                GPR12: 0000000000002000 c000001fffff3d80 0000000000000000 0000000000000000 
> > > > >                GPR16: 0000000000000000 0000000000000000 c0000014f358dd60 fffffffffff7dfff 
> > > > >                GPR20: c0000003725a9128 c0000003725a3928 c000000b0a0c0670 c000000b0a0c0678 
> > > > >                GPR24: c000000b0a0c0660 c00000064a42fa58 c000000b0a0c0650 c0000003725a9080 
> > > > >                GPR28: 0000000000000001 0000000000000000 c000000001620178 0000000000000001 
> > > > > [ 3319.284825] NIP [c00000000043027c] __vm_enough_memory+0x1dc/0x1f0
> > > > > [ 3319.284853] LR [c000000000430278] __vm_enough_memory+0x1d8/0x1f0
> > > > > [ 3319.284872] Call Trace:
> > > > > [ 3319.284889] [c00000064a42f9a0] [c000000000430278] __vm_enough_memory+0x1d8/0x1f0 (unreliable)
> > > > > [ 3319.284930] [c00000064a42fa30] [c000000000114258] dup_mm+0x2a8/0x700
> > > > > [ 3319.284960] [c00000064a42fae0] [c000000000115c7c] copy_process+0xeac/0x1a00
> > > > > [ 3319.284981] [c00000064a42fbf0] [c000000000116b3c] _do_fork+0xac/0xce0
> > > > > [ 3319.285002] [c00000064a42fd00] [c000000000117928] __do_sys_clone+0x98/0xe0
> > > > > [ 3319.285032] [c00000064a42fdc0] [c000000000039e78] system_call_exception+0x108/0x1d0
> > > > > [ 3319.285064] [c00000064a42fe20] [c00000000000c9f0] system_call_common+0xf0/0x278
> > > > > [ 3319.285093] Instruction dump:
> > > > > [ 3319.285109] 60000000 73a90001 4182000c e8be0002 4bfffeb8 3c62ff51 39200001 3d42fff9 
> > > > > [ 3319.285142] 38636690 992a6ad6 4bce8dbd 60000000 <0fe00000> e8be0002 4bfffe90 60000000 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-28 14:18           ` Qian Cai
@ 2020-05-28 15:10             ` Feng Tang
  2020-05-28 15:21               ` Kleen, Andi
  2020-05-28 15:48               ` Michal Hocko
  0 siblings, 2 replies; 34+ messages in thread
From: Feng Tang @ 2020-05-28 15:10 UTC (permalink / raw)
  To: Qian Cai
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Stephen Rothwell,
	Matthew Wilcox, Mel Gorman, Kees Cook, Luis Chamberlain,
	Iurii Zaikin, andi.kleen, tim.c.chen, dave.hansen, ying.huang,
	linux-mm, linux-kernel

On Thu, May 28, 2020 at 10:18:02AM -0400, Qian Cai wrote:
> > > I have been reproduced this on both AMD and Intel. The test just
> > > allocating memory and swapping.
> > > 
> > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/oom/oom01.c
> > > https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/tunable/overcommit_memory.c
> > > 
> > > It might be better to run the whole LTP mm tests if none of the above
> > > triggers it for you which has quite a few memory pressurers.
> > > 
> > > /opt/ltp/runltp -f mm
> > 
> > Thanks for sharing. I tried to reproduce this on 2 server plaforms,
> > but can't reproduce it, and they are still under testing.
> > 
> > Meanwhile, could you help to try the below patch, which is based on
> > Andi's suggestion and have some debug info. The warning is a little
> > strange, as the condition is
> > 
> > 	(percpu_counter_read(&vm_committed_as) <
> >                        -(s64)vm_committed_as_batch * num_online_cpus())
> > 
> > while for your platform (48 CPU + 128 GB RAM), the
> > '-(s64)vm_committed_as_batch * num_online_cpus()'
> > is a s64 value: '-32G', which makes the condition hard to be true,
> > and when it is,  it could be triggered by some magic for s32/s64
> > operations around the percpu-counter. 
> 
> The patch below does not fix anything.

Thanks for the info.

> 
> [ 3160.275230][ T7955] LTP: starting oom01
> [ 3160.306977][T106365] KK as:-59683  as_sum:6896 

This does show that percpu_counter_read() is not accurate
comparing to percpu_counter_sum(). And if we use 
percpu_counter_sum() then the warning won't happen, as
6896 >> -32768

> check:-32768 batch:256

-32768 means the RAM of the test platform  is either 512M
or 32GB (more likely) depending on overcommit policy, and
it has 128 CPUs. 

So my guess of the root cause is the overcommit policy is
changing while doing the memory stress test, so the
vm_commited_as.count could have bigger deviation when the
overcommit policy is !OVERCOMMIT_NEVER, and when the policy
is changed to OVERCOMMIT_NEVER, the check value is hugely
reduced, which trigger the warning.

If it's true, then there could be 2 solutions, one is to
skip the WARN_ONCE as it has no practical value, as the real
check is the following code, the other is to rectify the
percpu counter when the policy is changing to OVERCOMMIT_NEVER. 

Thanks,
Feng

> [ 3160.307161][T106365] ------------[ cut here ]------------
> [ 3160.307184][T106365] memory commitment underflow
> [ 3160.307216][T106365] WARNING: CPU: 103 PID: 106365 at mm/util.c:858 __vm_enough_memory+0x204/0x250
> [ 3160.307275][T106365] Modules linked in: brd ext4 crc16 mbcache jbd2 loop kvm_hv kvm ip_tables x_tables xfs sd_mod bnx2x ahci libahci mdio tg3 libata libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod
> [ 3160.307341][T106365] CPU: 103 PID: 106365 Comm: oom01 Not tainted 5.7.0-rc7-next-20200528+ #3
> [ 3160.307368][T106365] NIP:  c0000000003ee654 LR: c0000000003ee650 CTR: c000000000745b40
> [ 3160.307382][T106365] REGS: c0002017f940f730 TRAP: 0700   Not tainted  (5.7.0-rc7-next-20200528+)
> [ 3160.307409][T106365] MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 28222482  XER: 00000000
> [ 3160.307442][T106365] CFAR: c00000000010e428 IRQMASK: 0
> [ 3160.307442][T106365] GPR00: c0000000003ee650 c0002017f940f9c0 c00000000130a300 000000000000001b
> [ 3160.307442][T106365] GPR04: c0000000017b8d68 0000000000000006 0000000078154885 ffffffff6ca32090
> [ 3160.307442][T106365] GPR08: 0000201cc61a0000 0000000000000000 0000000000000000 0000000000000003
> [ 3160.307442][T106365] GPR12: 0000000000002000 c000201fff675100 0000000000000000 0000000000000000
> [ 3160.307442][T106365] GPR16: 0000000000000000 0000000000000000 c000200350fe3b60 fffffffffff7dfff
> [ 3160.307442][T106365] GPR20: c000201a36824928 c000201a3682c128 c000200ee33ad3a0 c000200ee33ad3a8
> [ 3160.307442][T106365] GPR24: c000200ee33ad390 ffffffffffff16dd 0000000000000000 0000000000000001
> [ 3160.307442][T106365] GPR28: c000201a36824880 0000000000000001 c000000003fa2a80 c0000000011fb0a8
> [ 3160.307723][T106365] NIP [c0000000003ee654] __vm_enough_memory+0x204/0x250
> [ 3160.307759][T106365] LR [c0000000003ee650] __vm_enough_memory+0x200/0x250
> [ 3160.308310][T106365] ---[ end trace e2152aa44c190593 ]---
> [ 3160.308478][T106365] KK as:-59683  as_sum:6897  check:-32768 batch:256
> [ 3160.308614][T106365] KK as:-59683  as_sum:6898  check:-32768 batch:256
> [ 3160.308714][T106365] KK as:-59683  as_sum:6901  check:-32768 batch:256
> [ 3160.308806][T106365] KK as:-59683  as_sum:6902  check:-32768 batch:256
> [ 3160.308900][T106365] KK as:-59683  as_sum:6903  check:-32768 batch:256
> [ 3160.308979][T106365] KK as:-59683  as_sum:6904  check:-32768 batch:256
> [ 3160.309064][T106365] KK as:-59683  as_sum:6905  check:-32768 batch:256
> [ 3160.309160][T106365] KK as:-59683  as_sum:6906  check:-32768 batch:256
> [ 3160.309275][T106365] KK as:-59683  as_sum:6907  check:-32768 batch:256
> [ 3160.309356][T106365] KK as:-59683  as_sum:6908  check:-32768 batch:256
> [ 3160.309437][T106365] KK as:-59683  as_sum:6909  check:-32768 batch:256
> [ 3160.310939][T106366] KK as:-59683  as_sum:6912  check:-32768 batch:256
> [ 3160.311134][T106366] KK as:-59270  as_sum:7040  check:-32768 batch:256
> [ 3160.311240][T106367] KK as:-59270  as_sum:7168  check:-32768 batch:256
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-28 15:10             ` Feng Tang
@ 2020-05-28 15:21               ` Kleen, Andi
  2020-05-29 15:43                 ` Feng Tang
  2020-05-28 15:48               ` Michal Hocko
  1 sibling, 1 reply; 34+ messages in thread
From: Kleen, Andi @ 2020-05-28 15:21 UTC (permalink / raw)
  To: Tang, Feng, Qian Cai
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Stephen Rothwell,
	Matthew Wilcox, Mel Gorman, Kees Cook, Luis Chamberlain,
	Iurii Zaikin, Chen, Tim C, Hansen, Dave, Huang, Ying, linux-mm,
	linux-kernel



>If it's true, then there could be 2 solutions, one is to skip the WARN_ONCE as it has no practical value, as the real >check is the following code, the other is to rectify the percpu counter when the policy is changing to >OVERCOMMIT_NEVER.

I think it's better to fix it up when the policy changes. That's the right place. The WARN_ON might be useful to catch other bugs.

-Andi
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-28 15:10             ` Feng Tang
  2020-05-28 15:21               ` Kleen, Andi
@ 2020-05-28 15:48               ` Michal Hocko
  1 sibling, 0 replies; 34+ messages in thread
From: Michal Hocko @ 2020-05-28 15:48 UTC (permalink / raw)
  To: Feng Tang
  Cc: Qian Cai, Andrew Morton, Johannes Weiner, Stephen Rothwell,
	Matthew Wilcox, Mel Gorman, Kees Cook, Luis Chamberlain,
	Iurii Zaikin, andi.kleen, tim.c.chen, dave.hansen, ying.huang,
	linux-mm, linux-kernel

On Thu 28-05-20 23:10:20, Feng Tang wrote:
[...]
> If it's true, then there could be 2 solutions, one is to
> skip the WARN_ONCE as it has no practical value, as the real
> check is the following code, the other is to rectify the
> percpu counter when the policy is changing to OVERCOMMIT_NEVER. 

I would simply drop the WARN_ONCE. Looking at the history this has been
added by 82f71ae4a2b8 ("mm: catch memory commitment underflow") to have
a safety check for issues which have been fixed. There doesn't seem to
be any bug reports mentioning this splat since then so it is likely just
spending cycles for a hot path (yes many people run with DEBUG_VM)
without a strong reason.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-28 15:21               ` Kleen, Andi
@ 2020-05-29 15:43                 ` Feng Tang
  2020-05-29 15:50                   ` Andi Kleen
  0 siblings, 1 reply; 34+ messages in thread
From: Feng Tang @ 2020-05-29 15:43 UTC (permalink / raw)
  To: Kleen, Andi, Qian Cai, Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Stephen Rothwell, Matthew Wilcox,
	Mel Gorman, Kees Cook, Luis Chamberlain, Iurii Zaikin, Chen,
	Tim C, Hansen, Dave, Huang, Ying, linux-mm, linux-kernel

On Thu, May 28, 2020 at 11:21:36PM +0800, Kleen, Andi wrote:
> 
> 
> >If it's true, then there could be 2 solutions, one is to skip the WARN_ONCE as it has no practical value, as the real >check is the following code, the other is to rectify the percpu counter when the policy is changing to >OVERCOMMIT_NEVER.
> 
> I think it's better to fix it up when the policy changes. That's the right place. The WARN_ON might be useful to catch other bugs.

If we keep the WARN_ON, then the draft fix patch I can think of looks like:

diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index a66595b..02d87fc 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -98,6 +98,20 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
 }
 EXPORT_SYMBOL(percpu_counter_add_batch);
 
+void percpu_counter_sync(struct percpu_counter *fbc)
+{
+	unsigned long flags;
+	s64 count;
+
+	raw_spin_lock_irqsave(&fbc->lock, flags);
+	count = __this_cpu_read(*fbc->counters);
+	fbc->count += count;
+	__this_cpu_sub(*fbc->counters, count);
+	raw_spin_unlock_irqrestore(&fbc->lock, flags);
+}
+EXPORT_SYMBOL(percpu_counter_sync);
+
+
 /*
  * Add up all the per-cpu counts, return the result.  This is a more accurate
  * but much slower version of percpu_counter_read_positive()
diff --git a/mm/util.c b/mm/util.c
index 580d268..24322da 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -746,14 +746,24 @@ int overcommit_ratio_handler(struct ctl_table *table, int write, void *buffer,
 	return ret;
 }
 
+static  void sync_overcommit_as(struct work_struct *dummy)
+{
+	percpu_counter_sync(&vm_committed_as);
+}
+
 int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
 		size_t *lenp, loff_t *ppos)
 {
 	int ret;
 
 	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
-	if (ret == 0 && write)
+	if (ret == 0 && write) {
+		if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
+			schedule_on_each_cpu(sync_overcommit_as);
+
 		mm_compute_batch();
+	}
 
 	return ret;
 }

Any comments?

Thanks,
Feng

> -Andi
>  

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-29 15:43                 ` Feng Tang
@ 2020-05-29 15:50                   ` Andi Kleen
  2020-05-29 16:04                     ` Feng Tang
  0 siblings, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2020-05-29 15:50 UTC (permalink / raw)
  To: Feng Tang
  Cc: Qian Cai, Michal Hocko, Andrew Morton, Johannes Weiner,
	Stephen Rothwell, Matthew Wilcox, Mel Gorman, Kees Cook,
	Luis Chamberlain, Iurii Zaikin, Chen, Tim C, Hansen, Dave, Huang,
	Ying, linux-mm, linux-kernel

>  
>  	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> -	if (ret == 0 && write)
> +	if (ret == 0 && write) {
> +		if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> +			schedule_on_each_cpu(sync_overcommit_as);

The schedule is not atomic.
There's still a race window here over all the CPUs where the WARN_ON could
happen because you change the global first.

Probably you would need another global that says "i'm currently changing
the mode" and then skip the WARN_ON in that window. Maybe a sequence lock.

Seems all overkill to me. Better to kill the warning.

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy
  2020-05-29 15:50                   ` Andi Kleen
@ 2020-05-29 16:04                     ` Feng Tang
  0 siblings, 0 replies; 34+ messages in thread
From: Feng Tang @ 2020-05-29 16:04 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Qian Cai, Michal Hocko, Andrew Morton, Johannes Weiner,
	Stephen Rothwell, Matthew Wilcox, Mel Gorman, Kees Cook,
	Luis Chamberlain, Iurii Zaikin, Chen, Tim C, Hansen, Dave, Huang,
	Ying, linux-mm, linux-kernel

On Fri, May 29, 2020 at 08:50:25AM -0700, Andi Kleen wrote:
> >  
> >  	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
> > -	if (ret == 0 && write)
> > +	if (ret == 0 && write) {
> > +		if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
> > +			schedule_on_each_cpu(sync_overcommit_as);
> 
> The schedule is not atomic.
> There's still a race window here over all the CPUs where the WARN_ON could
> happen because you change the global first.
 
The re-computing of batch number comes after this sync, so at this
point the batch is still the bigger one, and won't trigger the warning.

> Probably you would need another global that says "i'm currently changing
> the mode" and then skip the WARN_ON in that window. Maybe a sequence lock.
> 
> Seems all overkill to me. Better to kill the warning.

Yes, the cost is high, schedule_on_each_cpu is labeled as "very slow"
in the code comments.

Thanks,
Feng

> 
> -Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2020-05-29 16:04 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-08  7:25 [PATCH 0/3] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
2020-05-08  7:25 ` [PATCH 1/3] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
2020-05-15  7:22   ` Michal Hocko
2020-05-08  7:25 ` [PATCH 2/3] mm/util.c: make vm_memory_committed() more accurate Feng Tang
2020-05-15  7:23   ` Michal Hocko
2020-05-15  8:11     ` Feng Tang
2020-05-15  9:04       ` Michal Hocko
2020-05-15 13:01         ` Feng Tang
2020-05-08  7:25 ` [PATCH 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
2020-05-08 11:24   ` Matthew Wilcox
2020-05-08 12:33     ` Feng Tang
2020-05-15  7:41   ` Michal Hocko
2020-05-15  8:02     ` Feng Tang
2020-05-15  9:08       ` Michal Hocko
2020-05-15 11:26         ` Feng Tang
2020-05-15  7:44   ` Michal Hocko
2020-05-15  8:38     ` Feng Tang
2020-05-21 21:27 ` [PATCH 0/3] make vm_committed_as_batch aware of " Qian Cai
2020-05-26 18:14   ` Qian Cai
2020-05-27  1:14     ` Andi Kleen
2020-05-27  1:43       ` Feng Tang
2020-05-27  1:46     ` Feng Tang
2020-05-27  2:25       ` Qian Cai
2020-05-27 10:46         ` Feng Tang
2020-05-27 12:05           ` Qian Cai
2020-05-27 13:33             ` Feng Tang
2020-05-27 15:42               ` Qian Cai
2020-05-28 14:18           ` Qian Cai
2020-05-28 15:10             ` Feng Tang
2020-05-28 15:21               ` Kleen, Andi
2020-05-29 15:43                 ` Feng Tang
2020-05-29 15:50                   ` Andi Kleen
2020-05-29 16:04                     ` Feng Tang
2020-05-28 15:48               ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).