[PATCH v3 0/3] make vm_committed_as_batch aware of vm overcommit policy

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/3] make vm_committed_as_batch aware of vm overcommit policy
@ 2020-05-16  6:47 Feng Tang
  2020-05-16  6:47 ` [PATCH v3 1/3] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Feng Tang @ 2020-05-16  6:47 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Matthew Wilcox, Johannes Weiner,
	Mel Gorman, Kees Cook, andi.kleen, tim.c.chen, dave.hansen,
	ying.huang, linux-mm, linux-kernel
  Cc: Feng Tang

When checking a performance change for will-it-scale scalability
mmap test [1], we found very high lock contention for spinlock of
percpu counter 'vm_committed_as':

    94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
    48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
    45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch
number for the percpu counter.

So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy,
and enlarge it for not-so-strict  OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS
policies.

Benchmark with the same testcase in [1] shows 53% improvement on a
8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. And for that
case, whether it shows improvements depends on if the test mmap size
is bigger than the batch number computed.

We tested 10+ platforms in 0day (server, desktop and laptop). If we
lift it to 64X, 80%+ platforms show improvements, and for 16X lift,
1/3 of the platforms will show improvements.

And generally it should help the mmap/unmap usage,as Michal Hocko
mentioned:
"
I believe that there are non-synthetic worklaods which would benefit
from a larger batch. E.g. large in memory databases which do large
mmaps during startups from multiple threads.
"

Note: There are some style complain from checkpatch for patch 3,
as sysctl handler declaration follows the similar format of sibling
functions

patch1: a cleanup for /proc/meminfo
patch2: a preparation patch which also improve the accuracy of
        vm_memory_committed
patch3: the main change

Please help to review, thanks!

- Feng

----------------------------------------------------------------
Changelog:
  v3:
    * refine commit log and cleanup code, according to comments
      from Michal Hocko and Matthew Wilcox
    * change the lift from 16X and 64X after test 

  v2:
     * add the sysctl handler to cover runtime overcommit policy
       change, as suggested by Andres Morton 
     * address the accuracy concern of vm_memory_committed()
       from Andi Kleen 

Feng Tang (3):
  proc/meminfo: avoid open coded reading of vm_committed_as
  mm/util.c: make vm_memory_committed() more accurate
  mm: adjust vm_committed_as_batch according to vm overcommit policy

 fs/proc/meminfo.c    |  2 +-
 include/linux/mm.h   |  2 ++
 include/linux/mman.h |  4 ++++
 kernel/sysctl.c      |  2 +-
 mm/mm_init.c         | 18 ++++++++++++++----
 mm/util.c            | 15 ++++++++++++++-
 6 files changed, 36 insertions(+), 7 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v3 1/3] proc/meminfo: avoid open coded reading of vm_committed_as
  2020-05-16  6:47 [PATCH v3 0/3] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
@ 2020-05-16  6:47 ` Feng Tang
  2020-05-16  6:47 ` [PATCH v3 2/3] mm/util.c: make vm_memory_committed() more accurate Feng Tang
  2020-05-16  6:47 ` [PATCH v3 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
  2 siblings, 0 replies; 5+ messages in thread
From: Feng Tang @ 2020-05-16  6:47 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Matthew Wilcox, Johannes Weiner,
	Mel Gorman, Kees Cook, andi.kleen, tim.c.chen, dave.hansen,
	ying.huang, linux-mm, linux-kernel
  Cc: Feng Tang

Use the existing vm_memory_committed() instead, which is also
convenient for future change.

Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/meminfo.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8c1f1bb..578c0b8 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -42,7 +42,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 
 	si_meminfo(&i);
 	si_swapinfo(&i);
-	committed = percpu_counter_read_positive(&vm_committed_as);
+	committed = vm_memory_committed();
 
 	cached = global_node_page_state(NR_FILE_PAGES) -
 			total_swapcache_pages() - i.bufferram;
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v3 2/3] mm/util.c: make vm_memory_committed() more accurate
  2020-05-16  6:47 [PATCH v3 0/3] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
  2020-05-16  6:47 ` [PATCH v3 1/3] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
@ 2020-05-16  6:47 ` Feng Tang
  2020-05-16  6:47 ` [PATCH v3 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
  2 siblings, 0 replies; 5+ messages in thread
From: Feng Tang @ 2020-05-16  6:47 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Matthew Wilcox, Johannes Weiner,
	Mel Gorman, Kees Cook, andi.kleen, tim.c.chen, dave.hansen,
	ying.huang, linux-mm, linux-kernel
  Cc: Feng Tang

percpu_counter_sum_positive() will provide more accurate info.

As with percpu_counter_read_positive(), in worst case the deviation
could be 'batch * nr_cpus', which is totalram_pages/256 for now,
and will be more when the batch gets enlarged.

Its time cost is about 800 nanoseconds on a 2C/4T platform and
2~3 microseconds on a 2S/36C/72T server in normal case, and in
worst case where vm_committed_as's spinlock is under severe
contention, it costs 30~40 microseconds for the 2S/36C/72T sever,
which should be fine for its only two users: /proc/meminfo and
HyperV balloon driver's status trace per second.

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/util.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/util.c b/mm/util.c
index 988d11e..3de78e9 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -774,7 +774,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
  */
 unsigned long vm_memory_committed(void)
 {
-	return percpu_counter_read_positive(&vm_committed_as);
+	return percpu_counter_sum_positive(&vm_committed_as);
 }
 EXPORT_SYMBOL_GPL(vm_memory_committed);

-- 
2.7.4

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v3 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy
  2020-05-16  6:47 [PATCH v3 0/3] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
  2020-05-16  6:47 ` [PATCH v3 1/3] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
  2020-05-16  6:47 ` [PATCH v3 2/3] mm/util.c: make vm_memory_committed() more accurate Feng Tang
@ 2020-05-16  6:47 ` Feng Tang
  2020-05-18 22:38   ` Andrew Morton
  2 siblings, 1 reply; 5+ messages in thread
From: Feng Tang @ 2020-05-16  6:47 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Matthew Wilcox, Johannes Weiner,
	Mel Gorman, Kees Cook, andi.kleen, tim.c.chen, dave.hansen,
	ying.huang, linux-mm, linux-kernel
  Cc: Feng Tang

When checking a performance change for will-it-scale scalability
mmap test [1], we found very high lock contention for spinlock of
percpu counter 'vm_committed_as':

    94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
    48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
    45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;

Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch
number for the percpu counter.

So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy,
and lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies.
Also add a sysctl handler to adjust it when the policy is reconfigured.

Benchmark with the same testcase in [1] shows 53% improvement on a
8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. We tested
with test platforms in 0day (server, desktop and laptop), and 80%+
platforms shows improvements with that test. And whether it shows
improvements depends on if the test mmap size is bigger than the
batch number computed.

And if the lift is 16X, 1/3 of the platforms will show improvements,
though it should help the mmap/unmap usage generally, as Michal Hocko
mentioned:
"
I believe that there are non-synthetic worklaods which would benefit
from a larger batch. E.g. large in memory databases which do large
mmaps during startups from multiple threads.
"

[1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/linux/mm.h   |  2 ++
 include/linux/mman.h |  4 ++++
 kernel/sysctl.c      |  2 +-
 mm/mm_init.c         | 18 ++++++++++++++----
 mm/util.c            | 13 +++++++++++++
 5 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a32342..bc3722f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -205,6 +205,8 @@ extern int overcommit_ratio_handler(struct ctl_table *, int, void __user *,
 				    size_t *, loff_t *);
 extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
 				    size_t *, loff_t *);
+extern int overcommit_policy_handler(struct ctl_table *, int, void __user *,
+				    size_t *, loff_t *);
 
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 4b08e9c..91c93c1 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -57,8 +57,12 @@ extern struct percpu_counter vm_committed_as;
 
 #ifdef CONFIG_SMP
 extern s32 vm_committed_as_batch;
+extern void mm_compute_batch(void);
 #else
 #define vm_committed_as_batch 0
+static inline void mm_compute_batch(void)
+{
+}
 #endif
 
 unsigned long vm_memory_committed(void);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a176d8..6fa552d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1278,7 +1278,7 @@ static struct ctl_table vm_table[] = {
 		.data		= &sysctl_overcommit_memory,
 		.maxlen		= sizeof(sysctl_overcommit_memory),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
+		.proc_handler	= overcommit_policy_handler,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= &two,
 	},
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7da6991..b48dafd 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -13,6 +13,7 @@
 #include <linux/memory.h>
 #include <linux/notifier.h>
 #include <linux/sched.h>
+#include <linux/mman.h>
 #include "internal.h"
 
 #ifdef CONFIG_DEBUG_MEMORY_INIT
@@ -140,14 +141,23 @@ EXPORT_SYMBOL_GPL(mm_kobj);
 #ifdef CONFIG_SMP
 s32 vm_committed_as_batch = 32;
 
-static void __meminit mm_compute_batch(void)
+void mm_compute_batch(void)
 {
 	u64 memsized_batch;
 	s32 nr = num_present_cpus();
 	s32 batch = max_t(s32, nr*2, 32);
-
-	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
-	memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff);
+	unsigned long ram_pages = totalram_pages();
+
+	/*
+	 * For policy of OVERCOMMIT_NEVER, set batch size to 0.4%
+	 * of (total memory/#cpus), and lift it to 25% for other
+	 * policies to easy the possible lock contention for percpu_counter
+	 * vm_committed_as, while the max limit is INT_MAX
+	 */
+	if (sysctl_overcommit_memory == OVERCOMMIT_NEVER)
+		memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX);
+	else
+		memsized_batch = min_t(u64, ram_pages/nr/4, INT_MAX);
 
 	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
 }
diff --git a/mm/util.c b/mm/util.c
index 3de78e9..99936d3 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -729,6 +729,19 @@ int overcommit_ratio_handler(struct ctl_table *table, int write,
 	return ret;
 }
 
+int overcommit_policy_handler(struct ctl_table *table, int write,
+			     void __user *buffer, size_t *lenp,
+			     loff_t *ppos)
+{
+	int ret;
+
+	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (ret == 0 && write)
+		mm_compute_batch();
+
+	return ret;
+}
+
 int overcommit_kbytes_handler(struct ctl_table *table, int write,
 			     void __user *buffer, size_t *lenp,
 			     loff_t *ppos)
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v3 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy
  2020-05-16  6:47 ` [PATCH v3 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
@ 2020-05-18 22:38   ` Andrew Morton
  0 siblings, 0 replies; 5+ messages in thread
From: Andrew Morton @ 2020-05-18 22:38 UTC (permalink / raw)
  To: Feng Tang
  Cc: Michal Hocko, Matthew Wilcox, Johannes Weiner, Mel Gorman,
	Kees Cook, andi.kleen, tim.c.chen, dave.hansen, ying.huang,
	linux-mm, linux-kernel

On Sat, 16 May 2020 14:47:40 +0800 Feng Tang <feng.tang@intel.com> wrote:

> When checking a performance change for will-it-scale scalability
> mmap test [1], we found very high lock contention for spinlock of
> percpu counter 'vm_committed_as':
> 
>     94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
>     48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
>     45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
> 
> Actually this heavy lock contention is not always necessary. The
> 'vm_committed_as' needs to be very precise when the strict
> OVERCOMMIT_NEVER policy is set, which requires a rather small batch
> number for the percpu counter.
> 
> So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy,
> and lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies.
> Also add a sysctl handler to adjust it when the policy is reconfigured.
> 
> Benchmark with the same testcase in [1] shows 53% improvement on a
> 8C/16T desktop, and 2097%(20X) on a 4S/72C/144T server. We tested
> with test platforms in 0day (server, desktop and laptop), and 80%+
> platforms shows improvements with that test. And whether it shows
> improvements depends on if the test mmap size is bigger than the
> batch number computed.
> 
> And if the lift is 16X, 1/3 of the platforms will show improvements,
> though it should help the mmap/unmap usage generally, as Michal Hocko
> mentioned:
> "
> I believe that there are non-synthetic worklaods which would benefit
> from a larger batch. E.g. large in memory databases which do large
> mmaps during startups from multiple threads.
> "
> 

This needed some adjustments to overcommit_policy_handler() after
linux-next's 32927393dc1c ("sysctl: pass kernel pointers to
->proc_handler").  Relevant parts are below.

--- a/include/linux/mm.h~mm-adjust-vm_committed_as_batch-according-to-vm-overcommit-policy
+++ a/include/linux/mm.h
@@ -205,6 +205,8 @@ int overcommit_ratio_handler(struct ctl_
 		loff_t *);
 int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *,
 		loff_t *);
+int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
+		loff_t *);
 
 #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
 

--- a/mm/util.c~mm-adjust-vm_committed_as_batch-according-to-vm-overcommit-policy
+++ a/mm/util.c
@@ -746,6 +746,18 @@ int overcommit_ratio_handler(struct ctl_
 	return ret;
 }
 
+int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
+		size_t *lenp, loff_t *ppos)
+{
+	int ret;
+
+	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (ret == 0 && write)
+		mm_compute_batch();
+
+	return ret;
+}
+
 int overcommit_kbytes_handler(struct ctl_table *table, int write, void *buffer,
 		size_t *lenp, loff_t *ppos)
 {
_



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-05-18 22:38 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-16  6:47 [PATCH v3 0/3] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
2020-05-16  6:47 ` [PATCH v3 1/3] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
2020-05-16  6:47 ` [PATCH v3 2/3] mm/util.c: make vm_memory_committed() more accurate Feng Tang
2020-05-16  6:47 ` [PATCH v3 3/3] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
2020-05-18 22:38   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).