* [PATCH v6 2/4] mm/util.c: make vm_memory_committed() more accurate
2020-07-10 14:01 [PATCH v6 0/4] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
2020-07-10 14:01 ` [PATCH v6 1/4] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
@ 2020-07-10 14:01 ` Feng Tang
2020-07-10 14:01 ` [PATCH v6 3/4] percpu_counter: add percpu_counter_sync() Feng Tang
2020-07-10 14:01 ` [PATCH v6 4/4] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
3 siblings, 0 replies; 5+ messages in thread
From: Feng Tang @ 2020-07-10 14:01 UTC (permalink / raw)
To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
Mel Gorman, Kees Cook, Qian Cai, Dennis Zhou, andi.kleen,
tim.c.chen, dave.hansen, ying.huang, linux-mm, linux-kernel
Cc: Feng Tang, K. Y. Srinivasan, Haiyang Zhang
percpu_counter_sum_positive() will provide more accurate info.
As with percpu_counter_read_positive(), in worst case the deviation could
be 'batch * nr_cpus', which is totalram_pages/256 for now, and will be
more when the batch gets enlarged.
Its time cost is about 800 nanoseconds on a 2C/4T platform and 2~3
microseconds on a 2S/36C/72T Skylake server in normal case, and in worst
case where vm_committed_as's spinlock is under severe contention, it costs
30~40 microseconds for the 2S/36C/72T Skylake sever, which should be fine
for its only two users: /proc/meminfo and HyperV balloon driver's status
trace per second.
Link: http://lkml.kernel.org/r/1592725000-73486-3-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com> # for /proc/meminfo
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Qian Cai <cai@lca.pw>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
mm/util.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/util.c b/mm/util.c
index c856f5f..d076218 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -787,10 +787,15 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
* balancing memory across competing virtual machines that are hosted.
* Several metrics drive this policy engine including the guest reported
* memory commitment.
+ *
+ * The time cost of this is very low for small platforms, and for big
+ * platform like a 2S/36C/72T Skylake server, in worst case where
+ * vm_committed_as's spinlock is under severe contention, the time cost
+ * could be about 30~40 microseconds.
*/
unsigned long vm_memory_committed(void)
{
- return percpu_counter_read_positive(&vm_committed_as);
+ return percpu_counter_sum_positive(&vm_committed_as);
}
EXPORT_SYMBOL_GPL(vm_memory_committed);
--
2.7.4
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v6 3/4] percpu_counter: add percpu_counter_sync()
2020-07-10 14:01 [PATCH v6 0/4] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
2020-07-10 14:01 ` [PATCH v6 1/4] proc/meminfo: avoid open coded reading of vm_committed_as Feng Tang
2020-07-10 14:01 ` [PATCH v6 2/4] mm/util.c: make vm_memory_committed() more accurate Feng Tang
@ 2020-07-10 14:01 ` Feng Tang
2020-07-10 14:01 ` [PATCH v6 4/4] mm: adjust vm_committed_as_batch according to vm overcommit policy Feng Tang
3 siblings, 0 replies; 5+ messages in thread
From: Feng Tang @ 2020-07-10 14:01 UTC (permalink / raw)
To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
Mel Gorman, Kees Cook, Qian Cai, Dennis Zhou, andi.kleen,
tim.c.chen, dave.hansen, ying.huang, linux-mm, linux-kernel
Cc: Feng Tang, Tejun Heo, Christoph Lameter
percpu_counter's accuracy is related to its batch size. For a percpu_counter
with a big batch, its deviation could be big, so when the counter's batch is
runtime changed to a smaller value for better accuracy, there could also be
requirment to reduce the big deviation.
So add a percpu-counter sync function to be run on each CPU.
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
---
include/linux/percpu_counter.h | 4 ++++
lib/percpu_counter.c | 19 +++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 0a4f54d..01861ee 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -44,6 +44,7 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
s32 batch);
s64 __percpu_counter_sum(struct percpu_counter *fbc);
int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
+void percpu_counter_sync(struct percpu_counter *fbc);
static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
{
@@ -172,6 +173,9 @@ static inline bool percpu_counter_initialized(struct percpu_counter *fbc)
return true;
}
+static inline void percpu_counter_sync(struct percpu_counter *fbc)
+{
+}
#endif /* CONFIG_SMP */
static inline void percpu_counter_inc(struct percpu_counter *fbc)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index a66595b..a2345de 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -99,6 +99,25 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
EXPORT_SYMBOL(percpu_counter_add_batch);
/*
+ * For percpu_counter with a big batch, the devication of its count could
+ * be big, and there is requirement to reduce the deviation, like when the
+ * counter's batch could be runtime decreased to get a better accuracy,
+ * which can be achieved by running this sync function on each CPU.
+ */
+void percpu_counter_sync(struct percpu_counter *fbc)
+{
+ unsigned long flags;
+ s64 count;
+
+ raw_spin_lock_irqsave(&fbc->lock, flags);
+ count = __this_cpu_read(*fbc->counters);
+ fbc->count += count;
+ __this_cpu_sub(*fbc->counters, count);
+ raw_spin_unlock_irqrestore(&fbc->lock, flags);
+}
+EXPORT_SYMBOL(percpu_counter_sync);
+
+/*
* Add up all the per-cpu counts, return the result. This is a more accurate
* but much slower version of percpu_counter_read_positive()
*/
--
2.7.4
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v6 4/4] mm: adjust vm_committed_as_batch according to vm overcommit policy
2020-07-10 14:01 [PATCH v6 0/4] make vm_committed_as_batch aware of vm overcommit policy Feng Tang
` (2 preceding siblings ...)
2020-07-10 14:01 ` [PATCH v6 3/4] percpu_counter: add percpu_counter_sync() Feng Tang
@ 2020-07-10 14:01 ` Feng Tang
3 siblings, 0 replies; 5+ messages in thread
From: Feng Tang @ 2020-07-10 14:01 UTC (permalink / raw)
To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
Mel Gorman, Kees Cook, Qian Cai, Dennis Zhou, andi.kleen,
tim.c.chen, dave.hansen, ying.huang, linux-mm, linux-kernel
Cc: Feng Tang
When checking a performance change for will-it-scale scalability mmap test
[1], we found very high lock contention for spinlock of percpu counter
'vm_committed_as':
94.14% 0.35% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
Actually this heavy lock contention is not always necessary. The
'vm_committed_as' needs to be very precise when the strict
OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
for the percpu counter.
So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies. Also
add a sysctl handler to adjust it when the policy is reconfigured.
Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
desktop, and 2097%(20X) on a 4S/72C/144T server. We tested with test
platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
improvements with that test. And whether it shows improvements depends on
if the test mmap size is bigger than the batch number computed.
And if the lift is 16X, 1/3 of the platforms will show improvements,
though it should help the mmap/unmap usage generally, as Michal Hocko
mentioned:
: I believe that there are non-synthetic worklaods which would benefit from
: a larger batch. E.g. large in memory databases which do large mmaps
: during startups from multiple threads.
[1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com
Link: http://lkml.kernel.org/r/1592725000-73486-4-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Qian Cai <cai@lca.pw>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
---
include/linux/mm.h | 2 ++
include/linux/mman.h | 4 ++++
kernel/sysctl.c | 2 +-
mm/mm_init.c | 22 ++++++++++++++++------
mm/util.c | 41 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 64 insertions(+), 7 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e529e90..678ea25 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,6 +208,8 @@ int overcommit_ratio_handler(struct ctl_table *, int, void *, size_t *,
loff_t *);
int overcommit_kbytes_handler(struct ctl_table *, int, void *, size_t *,
loff_t *);
+int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
+ loff_t *);
#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 6733f2f..629cefc 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -57,8 +57,12 @@ extern struct percpu_counter vm_committed_as;
#ifdef CONFIG_SMP
extern s32 vm_committed_as_batch;
+extern void mm_compute_batch(int overcommit_policy);
#else
#define vm_committed_as_batch 0
+static inline void mm_compute_batch(int overcommit_policy)
+{
+}
#endif
unsigned long vm_memory_committed(void);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8dca889..51ad1aa 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2664,7 +2664,7 @@ static struct ctl_table vm_table[] = {
.data = &sysctl_overcommit_memory,
.maxlen = sizeof(sysctl_overcommit_memory),
.mode = 0644,
- .proc_handler = proc_dointvec_minmax,
+ .proc_handler = overcommit_policy_handler,
.extra1 = SYSCTL_ZERO,
.extra2 = &two,
},
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 435e5f7..b06a30f 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -13,6 +13,7 @@
#include <linux/memory.h>
#include <linux/notifier.h>
#include <linux/sched.h>
+#include <linux/mman.h>
#include "internal.h"
#ifdef CONFIG_DEBUG_MEMORY_INIT
@@ -144,14 +145,23 @@ EXPORT_SYMBOL_GPL(mm_kobj);
#ifdef CONFIG_SMP
s32 vm_committed_as_batch = 32;
-static void __meminit mm_compute_batch(void)
+void mm_compute_batch(int overcommit_policy)
{
u64 memsized_batch;
s32 nr = num_present_cpus();
s32 batch = max_t(s32, nr*2, 32);
-
- /* batch size set to 0.4% of (total memory/#cpus), or max int32 */
- memsized_batch = min_t(u64, (totalram_pages()/nr)/256, 0x7fffffff);
+ unsigned long ram_pages = totalram_pages();
+
+ /*
+ * For policy OVERCOMMIT_NEVER, set batch size to 0.4% of
+ * (total memory/#cpus), and lift it to 25% for other policies
+ * to easy the possible lock contention for percpu_counter
+ * vm_committed_as, while the max limit is INT_MAX
+ */
+ if (overcommit_policy == OVERCOMMIT_NEVER)
+ memsized_batch = min_t(u64, ram_pages/nr/256, INT_MAX);
+ else
+ memsized_batch = min_t(u64, ram_pages/nr/4, INT_MAX);
vm_committed_as_batch = max_t(s32, memsized_batch, batch);
}
@@ -162,7 +172,7 @@ static int __meminit mm_compute_batch_notifier(struct notifier_block *self,
switch (action) {
case MEM_ONLINE:
case MEM_OFFLINE:
- mm_compute_batch();
+ mm_compute_batch(sysctl_overcommit_memory);
default:
break;
}
@@ -176,7 +186,7 @@ static struct notifier_block compute_batch_nb __meminitdata = {
static int __init mm_compute_batch_init(void)
{
- mm_compute_batch();
+ mm_compute_batch(sysctl_overcommit_memory);
register_hotmemory_notifier(&compute_batch_nb);
return 0;
diff --git a/mm/util.c b/mm/util.c
index d076218..79c965d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -746,6 +746,47 @@ int overcommit_ratio_handler(struct ctl_table *table, int write, void *buffer,
return ret;
}
+static void sync_overcommit_as(struct work_struct *dummy)
+{
+ percpu_counter_sync(&vm_committed_as);
+}
+
+int overcommit_policy_handler(struct ctl_table *table, int write, void *buffer,
+ size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table t;
+ int new_policy;
+ int ret;
+
+ /*
+ * The deviation of sync_overcommit_as could be big with loose policy
+ * like OVERCOMMIT_ALWAYS/OVERCOMMIT_GUESS. When changing policy to
+ * strict OVERCOMMIT_NEVER, we need to reduce the deviation to comply
+ * with the strict "NEVER", and to avoid possible race condtion (even
+ * though user usually won't too frequently do the switching to policy
+ * OVERCOMMIT_NEVER), the switch is done in the following order:
+ * 1. changing the batch
+ * 2. sync percpu count on each CPU
+ * 3. switch the policy
+ */
+ if (write) {
+ t = *table;
+ t.data = &new_policy;
+ ret = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+ if (ret)
+ return ret;
+
+ mm_compute_batch(new_policy);
+ if (new_policy == OVERCOMMIT_NEVER)
+ schedule_on_each_cpu(sync_overcommit_as);
+ sysctl_overcommit_memory = new_policy;
+ } else {
+ ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+ }
+
+ return ret;
+}
+
int overcommit_kbytes_handler(struct ctl_table *table, int write, void *buffer,
size_t *lenp, loff_t *ppos)
{
--
2.7.4
^ permalink raw reply related [flat|nested] 5+ messages in thread