All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] mm: move max_map_count bits into mm.h
@ 2016-02-10 14:52 ` Andrey Ryabinin
  0 siblings, 0 replies; 30+ messages in thread
From: Andrey Ryabinin @ 2016-02-10 14:52 UTC (permalink / raw)
  To: Andrew Morton, linux-mm; +Cc: linux-kernel, Andrey Ryabinin

max_map_count sysctl unrelated to scheduler. Move its bits from
include/linux/sched/sysctl.h to include/linux/mm.h.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 include/linux/mm.h           | 21 +++++++++++++++++++++
 include/linux/sched/sysctl.h | 21 ---------------------
 mm/mmap.c                    |  1 -
 mm/mremap.c                  |  1 -
 mm/nommu.c                   |  1 -
 5 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 516e149..979bc83 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -82,6 +82,27 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #define mm_forbids_zeropage(X)	(0)
 #endif
 
+/*
+ * Default maximum number of active map areas, this limits the number of vmas
+ * per mm struct. Users can overwrite this number by sysctl but there is a
+ * problem.
+ *
+ * When a program's coredump is generated as ELF format, a section is created
+ * per a vma. In ELF, the number of sections is represented in unsigned short.
+ * This means the number of sections should be smaller than 65535 at coredump.
+ * Because the kernel adds some informative sections to a image of program at
+ * generating coredump, we need some margin. The number of extra sections is
+ * 1-3 now and depends on arch. We use "5" as safe margin, here.
+ *
+ * ELF extended numbering allows more than 65535 sections, so 16-bit bound is
+ * not a hard limit any more. Although some userspace tools can be surprised by
+ * that.
+ */
+#define MAPCOUNT_ELF_CORE_MARGIN	(5)
+#define DEFAULT_MAX_MAP_COUNT	(USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
+
+extern int sysctl_max_map_count;
+
 extern unsigned long sysctl_user_reserve_kbytes;
 extern unsigned long sysctl_admin_reserve_kbytes;
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731..5f7d334 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -14,27 +14,6 @@ extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
 enum { sysctl_hung_task_timeout_secs = 0 };
 #endif
 
-/*
- * Default maximum number of active map areas, this limits the number of vmas
- * per mm struct. Users can overwrite this number by sysctl but there is a
- * problem.
- *
- * When a program's coredump is generated as ELF format, a section is created
- * per a vma. In ELF, the number of sections is represented in unsigned short.
- * This means the number of sections should be smaller than 65535 at coredump.
- * Because the kernel adds some informative sections to a image of program at
- * generating coredump, we need some margin. The number of extra sections is
- * 1-3 now and depends on arch. We use "5" as safe margin, here.
- *
- * ELF extended numbering allows more than 65535 sections, so 16-bit bound is
- * not a hard limit any more. Although some userspace tools can be surprised by
- * that.
- */
-#define MAPCOUNT_ELF_CORE_MARGIN	(5)
-#define DEFAULT_MAX_MAP_COUNT	(USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
-
-extern int sysctl_max_map_count;
-
 extern unsigned int sysctl_sched_latency;
 extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
diff --git a/mm/mmap.c b/mm/mmap.c
index 2f2415a..4afb2a2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -37,7 +37,6 @@
 #include <linux/khugepaged.h>
 #include <linux/uprobes.h>
 #include <linux/rbtree_augmented.h>
-#include <linux/sched/sysctl.h>
 #include <linux/notifier.h>
 #include <linux/memory.h>
 #include <linux/printk.h>
diff --git a/mm/mremap.c b/mm/mremap.c
index d77946a..e8329d1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -20,7 +20,6 @@
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/mmu_notifier.h>
-#include <linux/sched/sysctl.h>
 #include <linux/uaccess.h>
 #include <linux/mm-arch-hooks.h>
 
diff --git a/mm/nommu.c b/mm/nommu.c
index fbf6f0f1..9bdf8b1 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -33,7 +33,6 @@
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/audit.h>
-#include <linux/sched/sysctl.h>
 #include <linux/printk.h>
 
 #include <asm/uaccess.h>
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 1/3] mm: move max_map_count bits into mm.h
@ 2016-02-10 14:52 ` Andrey Ryabinin
  0 siblings, 0 replies; 30+ messages in thread
From: Andrey Ryabinin @ 2016-02-10 14:52 UTC (permalink / raw)
  To: Andrew Morton, linux-mm; +Cc: linux-kernel, Andrey Ryabinin

max_map_count sysctl unrelated to scheduler. Move its bits from
include/linux/sched/sysctl.h to include/linux/mm.h.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 include/linux/mm.h           | 21 +++++++++++++++++++++
 include/linux/sched/sysctl.h | 21 ---------------------
 mm/mmap.c                    |  1 -
 mm/mremap.c                  |  1 -
 mm/nommu.c                   |  1 -
 5 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 516e149..979bc83 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -82,6 +82,27 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #define mm_forbids_zeropage(X)	(0)
 #endif
 
+/*
+ * Default maximum number of active map areas, this limits the number of vmas
+ * per mm struct. Users can overwrite this number by sysctl but there is a
+ * problem.
+ *
+ * When a program's coredump is generated as ELF format, a section is created
+ * per a vma. In ELF, the number of sections is represented in unsigned short.
+ * This means the number of sections should be smaller than 65535 at coredump.
+ * Because the kernel adds some informative sections to a image of program at
+ * generating coredump, we need some margin. The number of extra sections is
+ * 1-3 now and depends on arch. We use "5" as safe margin, here.
+ *
+ * ELF extended numbering allows more than 65535 sections, so 16-bit bound is
+ * not a hard limit any more. Although some userspace tools can be surprised by
+ * that.
+ */
+#define MAPCOUNT_ELF_CORE_MARGIN	(5)
+#define DEFAULT_MAX_MAP_COUNT	(USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
+
+extern int sysctl_max_map_count;
+
 extern unsigned long sysctl_user_reserve_kbytes;
 extern unsigned long sysctl_admin_reserve_kbytes;
 
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731..5f7d334 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -14,27 +14,6 @@ extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
 enum { sysctl_hung_task_timeout_secs = 0 };
 #endif
 
-/*
- * Default maximum number of active map areas, this limits the number of vmas
- * per mm struct. Users can overwrite this number by sysctl but there is a
- * problem.
- *
- * When a program's coredump is generated as ELF format, a section is created
- * per a vma. In ELF, the number of sections is represented in unsigned short.
- * This means the number of sections should be smaller than 65535 at coredump.
- * Because the kernel adds some informative sections to a image of program at
- * generating coredump, we need some margin. The number of extra sections is
- * 1-3 now and depends on arch. We use "5" as safe margin, here.
- *
- * ELF extended numbering allows more than 65535 sections, so 16-bit bound is
- * not a hard limit any more. Although some userspace tools can be surprised by
- * that.
- */
-#define MAPCOUNT_ELF_CORE_MARGIN	(5)
-#define DEFAULT_MAX_MAP_COUNT	(USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
-
-extern int sysctl_max_map_count;
-
 extern unsigned int sysctl_sched_latency;
 extern unsigned int sysctl_sched_min_granularity;
 extern unsigned int sysctl_sched_wakeup_granularity;
diff --git a/mm/mmap.c b/mm/mmap.c
index 2f2415a..4afb2a2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -37,7 +37,6 @@
 #include <linux/khugepaged.h>
 #include <linux/uprobes.h>
 #include <linux/rbtree_augmented.h>
-#include <linux/sched/sysctl.h>
 #include <linux/notifier.h>
 #include <linux/memory.h>
 #include <linux/printk.h>
diff --git a/mm/mremap.c b/mm/mremap.c
index d77946a..e8329d1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -20,7 +20,6 @@
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/mmu_notifier.h>
-#include <linux/sched/sysctl.h>
 #include <linux/uaccess.h>
 #include <linux/mm-arch-hooks.h>
 
diff --git a/mm/nommu.c b/mm/nommu.c
index fbf6f0f1..9bdf8b1 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -33,7 +33,6 @@
 #include <linux/security.h>
 #include <linux/syscalls.h>
 #include <linux/audit.h>
-#include <linux/sched/sysctl.h>
 #include <linux/printk.h>
 
 #include <asm/uaccess.h>
-- 
2.4.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/3] mm: dedupclicate memory overcommitment code
  2016-02-10 14:52 ` Andrey Ryabinin
@ 2016-02-10 14:52   ` Andrey Ryabinin
  -1 siblings, 0 replies; 30+ messages in thread
From: Andrey Ryabinin @ 2016-02-10 14:52 UTC (permalink / raw)
  To: Andrew Morton, linux-mm; +Cc: linux-kernel, Andrey Ryabinin

Currently we have two copies of the same code which implements memory
overcommitment logic. Let's move it into mm/util.c and hence avoid
duplication. No functional changes here.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 mm/mmap.c  | 124 -------------------------------------------------------------
 mm/nommu.c | 116 ---------------------------------------------------------
 mm/util.c  | 124 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 124 insertions(+), 240 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 4afb2a2..f088c60 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -122,130 +122,6 @@ void vma_set_page_prot(struct vm_area_struct *vma)
 	}
 }
 
-
-int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;  /* heuristic overcommit */
-int sysctl_overcommit_ratio __read_mostly = 50;	/* default is 50% */
-unsigned long sysctl_overcommit_kbytes __read_mostly;
-int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
-unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
-unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
-/*
- * Make sure vm_committed_as in one cacheline and not cacheline shared with
- * other variables. It can be updated by several CPUs frequently.
- */
-struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
-
-/*
- * The global memory commitment made in the system can be a metric
- * that can be used to drive ballooning decisions when Linux is hosted
- * as a guest. On Hyper-V, the host implements a policy engine for dynamically
- * balancing memory across competing virtual machines that are hosted.
- * Several metrics drive this policy engine including the guest reported
- * memory commitment.
- */
-unsigned long vm_memory_committed(void)
-{
-	return percpu_counter_read_positive(&vm_committed_as);
-}
-EXPORT_SYMBOL_GPL(vm_memory_committed);
-
-/*
- * Check that a process has enough memory to allocate a new virtual
- * mapping. 0 means there is enough memory for the allocation to
- * succeed and -ENOMEM implies there is not.
- *
- * We currently support three overcommit policies, which are set via the
- * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
- *
- * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
- * Additional code 2002 Jul 20 by Robert Love.
- *
- * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
- *
- * Note this is a helper function intended to be used by LSMs which
- * wish to use this logic.
- */
-int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
-{
-	long free, allowed, reserve;
-
-	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
-			-(s64)vm_committed_as_batch * num_online_cpus(),
-			"memory commitment underflow");
-
-	vm_acct_memory(pages);
-
-	/*
-	 * Sometimes we want to use more memory than we have
-	 */
-	if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
-		return 0;
-
-	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
-		free = global_page_state(NR_FREE_PAGES);
-		free += global_page_state(NR_FILE_PAGES);
-
-		/*
-		 * shmem pages shouldn't be counted as free in this
-		 * case, they can't be purged, only swapped out, and
-		 * that won't affect the overall amount of available
-		 * memory in the system.
-		 */
-		free -= global_page_state(NR_SHMEM);
-
-		free += get_nr_swap_pages();
-
-		/*
-		 * Any slabs which are created with the
-		 * SLAB_RECLAIM_ACCOUNT flag claim to have contents
-		 * which are reclaimable, under pressure.  The dentry
-		 * cache and most inode caches should fall into this
-		 */
-		free += global_page_state(NR_SLAB_RECLAIMABLE);
-
-		/*
-		 * Leave reserved pages. The pages are not for anonymous pages.
-		 */
-		if (free <= totalreserve_pages)
-			goto error;
-		else
-			free -= totalreserve_pages;
-
-		/*
-		 * Reserve some for root
-		 */
-		if (!cap_sys_admin)
-			free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
-
-		if (free > pages)
-			return 0;
-
-		goto error;
-	}
-
-	allowed = vm_commit_limit();
-	/*
-	 * Reserve some for root
-	 */
-	if (!cap_sys_admin)
-		allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
-
-	/*
-	 * Don't let a single process grow so big a user can't recover
-	 */
-	if (mm) {
-		reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
-		allowed -= min_t(long, mm->total_vm / 32, reserve);
-	}
-
-	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
-		return 0;
-error:
-	vm_unacct_memory(pages);
-
-	return -ENOMEM;
-}
-
 /*
  * Requires inode->i_mapping->i_mmap_rwsem
  */
diff --git a/mm/nommu.c b/mm/nommu.c
index 9bdf8b1..6402f27 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -47,33 +47,11 @@ struct page *mem_map;
 unsigned long max_mapnr;
 EXPORT_SYMBOL(max_mapnr);
 unsigned long highest_memmap_pfn;
-struct percpu_counter vm_committed_as;
-int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
-int sysctl_overcommit_ratio = 50; /* default is 50% */
-unsigned long sysctl_overcommit_kbytes __read_mostly;
-int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT;
 int sysctl_nr_trim_pages = CONFIG_NOMMU_INITIAL_TRIM_EXCESS;
-unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
-unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
 int heap_stack_gap = 0;
 
 atomic_long_t mmap_pages_allocated;
 
-/*
- * The global memory commitment made in the system can be a metric
- * that can be used to drive ballooning decisions when Linux is hosted
- * as a guest. On Hyper-V, the host implements a policy engine for dynamically
- * balancing memory across competing virtual machines that are hosted.
- * Several metrics drive this policy engine including the guest reported
- * memory commitment.
- */
-unsigned long vm_memory_committed(void)
-{
-	return percpu_counter_read_positive(&vm_committed_as);
-}
-
-EXPORT_SYMBOL_GPL(vm_memory_committed);
-
 EXPORT_SYMBOL(mem_map);
 
 /* list of mapped, potentially shareable regions */
@@ -1828,100 +1806,6 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
-/*
- * Check that a process has enough memory to allocate a new virtual
- * mapping. 0 means there is enough memory for the allocation to
- * succeed and -ENOMEM implies there is not.
- *
- * We currently support three overcommit policies, which are set via the
- * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
- *
- * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
- * Additional code 2002 Jul 20 by Robert Love.
- *
- * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
- *
- * Note this is a helper function intended to be used by LSMs which
- * wish to use this logic.
- */
-int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
-{
-	long free, allowed, reserve;
-
-	vm_acct_memory(pages);
-
-	/*
-	 * Sometimes we want to use more memory than we have
-	 */
-	if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
-		return 0;
-
-	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
-		free = global_page_state(NR_FREE_PAGES);
-		free += global_page_state(NR_FILE_PAGES);
-
-		/*
-		 * shmem pages shouldn't be counted as free in this
-		 * case, they can't be purged, only swapped out, and
-		 * that won't affect the overall amount of available
-		 * memory in the system.
-		 */
-		free -= global_page_state(NR_SHMEM);
-
-		free += get_nr_swap_pages();
-
-		/*
-		 * Any slabs which are created with the
-		 * SLAB_RECLAIM_ACCOUNT flag claim to have contents
-		 * which are reclaimable, under pressure.  The dentry
-		 * cache and most inode caches should fall into this
-		 */
-		free += global_page_state(NR_SLAB_RECLAIMABLE);
-
-		/*
-		 * Leave reserved pages. The pages are not for anonymous pages.
-		 */
-		if (free <= totalreserve_pages)
-			goto error;
-		else
-			free -= totalreserve_pages;
-
-		/*
-		 * Reserve some for root
-		 */
-		if (!cap_sys_admin)
-			free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
-
-		if (free > pages)
-			return 0;
-
-		goto error;
-	}
-
-	allowed = vm_commit_limit();
-	/*
-	 * Reserve some 3% for root
-	 */
-	if (!cap_sys_admin)
-		allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
-
-	/*
-	 * Don't let a single process grow so big a user can't recover
-	 */
-	if (mm) {
-		reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
-		allowed -= min_t(long, mm->total_vm / 32, reserve);
-	}
-
-	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
-		return 0;
-
-error:
-	vm_unacct_memory(pages);
-
-	return -ENOMEM;
-}
-
 int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	BUG();
diff --git a/mm/util.c b/mm/util.c
index 4fb14ca..47a57e5 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -396,6 +396,13 @@ int __page_mapcount(struct page *page)
 }
 EXPORT_SYMBOL_GPL(__page_mapcount);
 
+int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
+int sysctl_overcommit_ratio __read_mostly = 50;
+unsigned long sysctl_overcommit_kbytes __read_mostly;
+int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
+unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
+unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
+
 int overcommit_ratio_handler(struct ctl_table *table, int write,
 			     void __user *buffer, size_t *lenp,
 			     loff_t *ppos)
@@ -437,6 +444,123 @@ unsigned long vm_commit_limit(void)
 	return allowed;
 }
 
+/*
+ * Make sure vm_committed_as in one cacheline and not cacheline shared with
+ * other variables. It can be updated by several CPUs frequently.
+ */
+struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
+
+/*
+ * The global memory commitment made in the system can be a metric
+ * that can be used to drive ballooning decisions when Linux is hosted
+ * as a guest. On Hyper-V, the host implements a policy engine for dynamically
+ * balancing memory across competing virtual machines that are hosted.
+ * Several metrics drive this policy engine including the guest reported
+ * memory commitment.
+ */
+unsigned long vm_memory_committed(void)
+{
+	return percpu_counter_read_positive(&vm_committed_as);
+}
+EXPORT_SYMBOL_GPL(vm_memory_committed);
+
+/*
+ * Check that a process has enough memory to allocate a new virtual
+ * mapping. 0 means there is enough memory for the allocation to
+ * succeed and -ENOMEM implies there is not.
+ *
+ * We currently support three overcommit policies, which are set via the
+ * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
+ *
+ * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
+ * Additional code 2002 Jul 20 by Robert Love.
+ *
+ * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
+ *
+ * Note this is a helper function intended to be used by LSMs which
+ * wish to use this logic.
+ */
+int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
+{
+	long free, allowed, reserve;
+
+	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
+			-(s64)vm_committed_as_batch * num_online_cpus(),
+			"memory commitment underflow");
+
+	vm_acct_memory(pages);
+
+	/*
+	 * Sometimes we want to use more memory than we have
+	 */
+	if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
+		return 0;
+
+	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
+		free = global_page_state(NR_FREE_PAGES);
+		free += global_page_state(NR_FILE_PAGES);
+
+		/*
+		 * shmem pages shouldn't be counted as free in this
+		 * case, they can't be purged, only swapped out, and
+		 * that won't affect the overall amount of available
+		 * memory in the system.
+		 */
+		free -= global_page_state(NR_SHMEM);
+
+		free += get_nr_swap_pages();
+
+		/*
+		 * Any slabs which are created with the
+		 * SLAB_RECLAIM_ACCOUNT flag claim to have contents
+		 * which are reclaimable, under pressure.  The dentry
+		 * cache and most inode caches should fall into this
+		 */
+		free += global_page_state(NR_SLAB_RECLAIMABLE);
+
+		/*
+		 * Leave reserved pages. The pages are not for anonymous pages.
+		 */
+		if (free <= totalreserve_pages)
+			goto error;
+		else
+			free -= totalreserve_pages;
+
+		/*
+		 * Reserve some for root
+		 */
+		if (!cap_sys_admin)
+			free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
+
+		if (free > pages)
+			return 0;
+
+		goto error;
+	}
+
+	allowed = vm_commit_limit();
+	/*
+	 * Reserve some for root
+	 */
+	if (!cap_sys_admin)
+		allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
+
+	/*
+	 * Don't let a single process grow so big a user can't recover
+	 */
+	if (mm) {
+		reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
+		allowed -= min_t(long, mm->total_vm / 32, reserve);
+	}
+
+	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
+		return 0;
+error:
+	vm_unacct_memory(pages);
+
+	return -ENOMEM;
+}
+
 /**
  * get_cmdline() - copy the cmdline value to a buffer.
  * @task:     the task whose cmdline value to copy.
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/3] mm: dedupclicate memory overcommitment code
@ 2016-02-10 14:52   ` Andrey Ryabinin
  0 siblings, 0 replies; 30+ messages in thread
From: Andrey Ryabinin @ 2016-02-10 14:52 UTC (permalink / raw)
  To: Andrew Morton, linux-mm; +Cc: linux-kernel, Andrey Ryabinin

Currently we have two copies of the same code which implements memory
overcommitment logic. Let's move it into mm/util.c and hence avoid
duplication. No functional changes here.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 mm/mmap.c  | 124 -------------------------------------------------------------
 mm/nommu.c | 116 ---------------------------------------------------------
 mm/util.c  | 124 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 124 insertions(+), 240 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 4afb2a2..f088c60 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -122,130 +122,6 @@ void vma_set_page_prot(struct vm_area_struct *vma)
 	}
 }
 
-
-int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;  /* heuristic overcommit */
-int sysctl_overcommit_ratio __read_mostly = 50;	/* default is 50% */
-unsigned long sysctl_overcommit_kbytes __read_mostly;
-int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
-unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
-unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
-/*
- * Make sure vm_committed_as in one cacheline and not cacheline shared with
- * other variables. It can be updated by several CPUs frequently.
- */
-struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
-
-/*
- * The global memory commitment made in the system can be a metric
- * that can be used to drive ballooning decisions when Linux is hosted
- * as a guest. On Hyper-V, the host implements a policy engine for dynamically
- * balancing memory across competing virtual machines that are hosted.
- * Several metrics drive this policy engine including the guest reported
- * memory commitment.
- */
-unsigned long vm_memory_committed(void)
-{
-	return percpu_counter_read_positive(&vm_committed_as);
-}
-EXPORT_SYMBOL_GPL(vm_memory_committed);
-
-/*
- * Check that a process has enough memory to allocate a new virtual
- * mapping. 0 means there is enough memory for the allocation to
- * succeed and -ENOMEM implies there is not.
- *
- * We currently support three overcommit policies, which are set via the
- * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
- *
- * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
- * Additional code 2002 Jul 20 by Robert Love.
- *
- * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
- *
- * Note this is a helper function intended to be used by LSMs which
- * wish to use this logic.
- */
-int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
-{
-	long free, allowed, reserve;
-
-	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
-			-(s64)vm_committed_as_batch * num_online_cpus(),
-			"memory commitment underflow");
-
-	vm_acct_memory(pages);
-
-	/*
-	 * Sometimes we want to use more memory than we have
-	 */
-	if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
-		return 0;
-
-	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
-		free = global_page_state(NR_FREE_PAGES);
-		free += global_page_state(NR_FILE_PAGES);
-
-		/*
-		 * shmem pages shouldn't be counted as free in this
-		 * case, they can't be purged, only swapped out, and
-		 * that won't affect the overall amount of available
-		 * memory in the system.
-		 */
-		free -= global_page_state(NR_SHMEM);
-
-		free += get_nr_swap_pages();
-
-		/*
-		 * Any slabs which are created with the
-		 * SLAB_RECLAIM_ACCOUNT flag claim to have contents
-		 * which are reclaimable, under pressure.  The dentry
-		 * cache and most inode caches should fall into this
-		 */
-		free += global_page_state(NR_SLAB_RECLAIMABLE);
-
-		/*
-		 * Leave reserved pages. The pages are not for anonymous pages.
-		 */
-		if (free <= totalreserve_pages)
-			goto error;
-		else
-			free -= totalreserve_pages;
-
-		/*
-		 * Reserve some for root
-		 */
-		if (!cap_sys_admin)
-			free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
-
-		if (free > pages)
-			return 0;
-
-		goto error;
-	}
-
-	allowed = vm_commit_limit();
-	/*
-	 * Reserve some for root
-	 */
-	if (!cap_sys_admin)
-		allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
-
-	/*
-	 * Don't let a single process grow so big a user can't recover
-	 */
-	if (mm) {
-		reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
-		allowed -= min_t(long, mm->total_vm / 32, reserve);
-	}
-
-	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
-		return 0;
-error:
-	vm_unacct_memory(pages);
-
-	return -ENOMEM;
-}
-
 /*
  * Requires inode->i_mapping->i_mmap_rwsem
  */
diff --git a/mm/nommu.c b/mm/nommu.c
index 9bdf8b1..6402f27 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -47,33 +47,11 @@ struct page *mem_map;
 unsigned long max_mapnr;
 EXPORT_SYMBOL(max_mapnr);
 unsigned long highest_memmap_pfn;
-struct percpu_counter vm_committed_as;
-int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
-int sysctl_overcommit_ratio = 50; /* default is 50% */
-unsigned long sysctl_overcommit_kbytes __read_mostly;
-int sysctl_max_map_count = DEFAULT_MAX_MAP_COUNT;
 int sysctl_nr_trim_pages = CONFIG_NOMMU_INITIAL_TRIM_EXCESS;
-unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
-unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
 int heap_stack_gap = 0;
 
 atomic_long_t mmap_pages_allocated;
 
-/*
- * The global memory commitment made in the system can be a metric
- * that can be used to drive ballooning decisions when Linux is hosted
- * as a guest. On Hyper-V, the host implements a policy engine for dynamically
- * balancing memory across competing virtual machines that are hosted.
- * Several metrics drive this policy engine including the guest reported
- * memory commitment.
- */
-unsigned long vm_memory_committed(void)
-{
-	return percpu_counter_read_positive(&vm_committed_as);
-}
-
-EXPORT_SYMBOL_GPL(vm_memory_committed);
-
 EXPORT_SYMBOL(mem_map);
 
 /* list of mapped, potentially shareable regions */
@@ -1828,100 +1806,6 @@ void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
-/*
- * Check that a process has enough memory to allocate a new virtual
- * mapping. 0 means there is enough memory for the allocation to
- * succeed and -ENOMEM implies there is not.
- *
- * We currently support three overcommit policies, which are set via the
- * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
- *
- * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
- * Additional code 2002 Jul 20 by Robert Love.
- *
- * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
- *
- * Note this is a helper function intended to be used by LSMs which
- * wish to use this logic.
- */
-int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
-{
-	long free, allowed, reserve;
-
-	vm_acct_memory(pages);
-
-	/*
-	 * Sometimes we want to use more memory than we have
-	 */
-	if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
-		return 0;
-
-	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
-		free = global_page_state(NR_FREE_PAGES);
-		free += global_page_state(NR_FILE_PAGES);
-
-		/*
-		 * shmem pages shouldn't be counted as free in this
-		 * case, they can't be purged, only swapped out, and
-		 * that won't affect the overall amount of available
-		 * memory in the system.
-		 */
-		free -= global_page_state(NR_SHMEM);
-
-		free += get_nr_swap_pages();
-
-		/*
-		 * Any slabs which are created with the
-		 * SLAB_RECLAIM_ACCOUNT flag claim to have contents
-		 * which are reclaimable, under pressure.  The dentry
-		 * cache and most inode caches should fall into this
-		 */
-		free += global_page_state(NR_SLAB_RECLAIMABLE);
-
-		/*
-		 * Leave reserved pages. The pages are not for anonymous pages.
-		 */
-		if (free <= totalreserve_pages)
-			goto error;
-		else
-			free -= totalreserve_pages;
-
-		/*
-		 * Reserve some for root
-		 */
-		if (!cap_sys_admin)
-			free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
-
-		if (free > pages)
-			return 0;
-
-		goto error;
-	}
-
-	allowed = vm_commit_limit();
-	/*
-	 * Reserve some 3% for root
-	 */
-	if (!cap_sys_admin)
-		allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
-
-	/*
-	 * Don't let a single process grow so big a user can't recover
-	 */
-	if (mm) {
-		reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
-		allowed -= min_t(long, mm->total_vm / 32, reserve);
-	}
-
-	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
-		return 0;
-
-error:
-	vm_unacct_memory(pages);
-
-	return -ENOMEM;
-}
-
 int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
 	BUG();
diff --git a/mm/util.c b/mm/util.c
index 4fb14ca..47a57e5 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -396,6 +396,13 @@ int __page_mapcount(struct page *page)
 }
 EXPORT_SYMBOL_GPL(__page_mapcount);
 
+int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
+int sysctl_overcommit_ratio __read_mostly = 50;
+unsigned long sysctl_overcommit_kbytes __read_mostly;
+int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
+unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
+unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
+
 int overcommit_ratio_handler(struct ctl_table *table, int write,
 			     void __user *buffer, size_t *lenp,
 			     loff_t *ppos)
@@ -437,6 +444,123 @@ unsigned long vm_commit_limit(void)
 	return allowed;
 }
 
+/*
+ * Make sure vm_committed_as in one cacheline and not cacheline shared with
+ * other variables. It can be updated by several CPUs frequently.
+ */
+struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
+
+/*
+ * The global memory commitment made in the system can be a metric
+ * that can be used to drive ballooning decisions when Linux is hosted
+ * as a guest. On Hyper-V, the host implements a policy engine for dynamically
+ * balancing memory across competing virtual machines that are hosted.
+ * Several metrics drive this policy engine including the guest reported
+ * memory commitment.
+ */
+unsigned long vm_memory_committed(void)
+{
+	return percpu_counter_read_positive(&vm_committed_as);
+}
+EXPORT_SYMBOL_GPL(vm_memory_committed);
+
+/*
+ * Check that a process has enough memory to allocate a new virtual
+ * mapping. 0 means there is enough memory for the allocation to
+ * succeed and -ENOMEM implies there is not.
+ *
+ * We currently support three overcommit policies, which are set via the
+ * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
+ *
+ * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
+ * Additional code 2002 Jul 20 by Robert Love.
+ *
+ * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
+ *
+ * Note this is a helper function intended to be used by LSMs which
+ * wish to use this logic.
+ */
+int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
+{
+	long free, allowed, reserve;
+
+	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
+			-(s64)vm_committed_as_batch * num_online_cpus(),
+			"memory commitment underflow");
+
+	vm_acct_memory(pages);
+
+	/*
+	 * Sometimes we want to use more memory than we have
+	 */
+	if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
+		return 0;
+
+	if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
+		free = global_page_state(NR_FREE_PAGES);
+		free += global_page_state(NR_FILE_PAGES);
+
+		/*
+		 * shmem pages shouldn't be counted as free in this
+		 * case, they can't be purged, only swapped out, and
+		 * that won't affect the overall amount of available
+		 * memory in the system.
+		 */
+		free -= global_page_state(NR_SHMEM);
+
+		free += get_nr_swap_pages();
+
+		/*
+		 * Any slabs which are created with the
+		 * SLAB_RECLAIM_ACCOUNT flag claim to have contents
+		 * which are reclaimable, under pressure.  The dentry
+		 * cache and most inode caches should fall into this
+		 */
+		free += global_page_state(NR_SLAB_RECLAIMABLE);
+
+		/*
+		 * Leave reserved pages. The pages are not for anonymous pages.
+		 */
+		if (free <= totalreserve_pages)
+			goto error;
+		else
+			free -= totalreserve_pages;
+
+		/*
+		 * Reserve some for root
+		 */
+		if (!cap_sys_admin)
+			free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
+
+		if (free > pages)
+			return 0;
+
+		goto error;
+	}
+
+	allowed = vm_commit_limit();
+	/*
+	 * Reserve some for root
+	 */
+	if (!cap_sys_admin)
+		allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
+
+	/*
+	 * Don't let a single process grow so big a user can't recover
+	 */
+	if (mm) {
+		reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
+		allowed -= min_t(long, mm->total_vm / 32, reserve);
+	}
+
+	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
+		return 0;
+error:
+	vm_unacct_memory(pages);
+
+	return -ENOMEM;
+}
+
 /**
  * get_cmdline() - copy the cmdline value to a buffer.
  * @task:     the task whose cmdline value to copy.
-- 
2.4.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
  2016-02-10 14:52 ` Andrey Ryabinin
@ 2016-02-10 14:52   ` Andrey Ryabinin
  -1 siblings, 0 replies; 30+ messages in thread
From: Andrey Ryabinin @ 2016-02-10 14:52 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Andrey Ryabinin, Andi Kleen, Tim Chen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

Currently we use percpu_counter for accounting committed memory. Change
of committed memory on more than vm_committed_as_batch pages leads to
grab of counter's spinlock. The batch size is quite small - from 32 pages
up to 0.4% of the memory/cpu (usually several MBs even on large machines).

So map/munmap of several MBs anonymous memory in multiple processes leads
to high contention on that spinlock.

Instead of percpu_counter we could use ordinary per-cpu variables.
Dump test case (8-proccesses running map/munmap of 4MB,
vm_committed_as_batch = 2MB on test setup) showed 2.5x performance
improvement.

The downside of this approach is slowdown of vm_memory_committed().
However, it doesn't matter much since it usually is not in a hot path.
The only exception is __vm_enough_memory() with overcommit set to
OVERCOMMIT_NEVER. In that case brk1 test from will-it-scale benchmark
shows 1.1x - 1.3x performance regression.

So I think it's a good tradeoff. We've got significantly increased
scalability for the price of some overhead in vm_memory_committed().

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
---
 fs/proc/meminfo.c    |  2 +-
 include/linux/mm.h   |  4 ++++
 include/linux/mman.h | 13 +++----------
 mm/mm_init.c         | 45 ---------------------------------------------
 mm/mmap.c            | 11 -----------
 mm/nommu.c           |  4 ----
 mm/util.c            | 20 ++++++++------------
 7 files changed, 16 insertions(+), 83 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index df4661a..f30e387 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -41,7 +41,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	si_meminfo(&i);
 	si_swapinfo(&i);
-	committed = percpu_counter_read_positive(&vm_committed_as);
+	committed = vm_memory_committed();
 
 	cached = global_page_state(NR_FILE_PAGES) -
 			total_swapcache_pages() - i.bufferram;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 979bc83..82dac6e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1881,7 +1881,11 @@ extern void memmap_init_zone(unsigned long, int, unsigned long,
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
+#ifdef CONFIG_MMU
+static inline void mmap_init(void) {}
+#else
 extern void __init mmap_init(void);
+#endif
 extern void show_mem(unsigned int flags);
 extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 16373c8..436ab11 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -2,7 +2,7 @@
 #define _LINUX_MMAN_H
 
 #include <linux/mm.h>
-#include <linux/percpu_counter.h>
+#include <linux/percpu.h>
 
 #include <linux/atomic.h>
 #include <uapi/linux/mman.h>
@@ -10,19 +10,12 @@
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
-extern struct percpu_counter vm_committed_as;
-
-#ifdef CONFIG_SMP
-extern s32 vm_committed_as_batch;
-#else
-#define vm_committed_as_batch 0
-#endif
-
 unsigned long vm_memory_committed(void);
+DECLARE_PER_CPU(int, vm_committed_as);
 
 static inline void vm_acct_memory(long pages)
 {
-	__percpu_counter_add(&vm_committed_as, pages, vm_committed_as_batch);
+	this_cpu_add(vm_committed_as, pages);
 }
 
 static inline void vm_unacct_memory(long pages)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index fdadf91..d96c71f 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -142,51 +142,6 @@ early_param("mminit_loglevel", set_mminit_loglevel);
 struct kobject *mm_kobj;
 EXPORT_SYMBOL_GPL(mm_kobj);
 
-#ifdef CONFIG_SMP
-s32 vm_committed_as_batch = 32;
-
-static void __meminit mm_compute_batch(void)
-{
-	u64 memsized_batch;
-	s32 nr = num_present_cpus();
-	s32 batch = max_t(s32, nr*2, 32);
-
-	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
-	memsized_batch = min_t(u64, (totalram_pages/nr)/256, 0x7fffffff);
-
-	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
-}
-
-static int __meminit mm_compute_batch_notifier(struct notifier_block *self,
-					unsigned long action, void *arg)
-{
-	switch (action) {
-	case MEM_ONLINE:
-	case MEM_OFFLINE:
-		mm_compute_batch();
-	default:
-		break;
-	}
-	return NOTIFY_OK;
-}
-
-static struct notifier_block compute_batch_nb __meminitdata = {
-	.notifier_call = mm_compute_batch_notifier,
-	.priority = IPC_CALLBACK_PRI, /* use lowest priority */
-};
-
-static int __init mm_compute_batch_init(void)
-{
-	mm_compute_batch();
-	register_hotmemory_notifier(&compute_batch_nb);
-
-	return 0;
-}
-
-__initcall(mm_compute_batch_init);
-
-#endif
-
 static int __init mm_sysfs_init(void)
 {
 	mm_kobj = kobject_create_and_add("mm", kernel_kobj);
diff --git a/mm/mmap.c b/mm/mmap.c
index f088c60..c796d73 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3184,17 +3184,6 @@ void mm_drop_all_locks(struct mm_struct *mm)
 }
 
 /*
- * initialise the VMA slab
- */
-void __init mmap_init(void)
-{
-	int ret;
-
-	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
-	VM_BUG_ON(ret);
-}
-
-/*
  * Initialise sysctl_user_reserve_kbytes.
  *
  * This is intended to prevent a user from starting a single memory hogging
diff --git a/mm/nommu.c b/mm/nommu.c
index 6402f27..2d52dbc 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -533,10 +533,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
  */
 void __init mmap_init(void)
 {
-	int ret;
-
-	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
-	VM_BUG_ON(ret);
 	vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
 }
 
diff --git a/mm/util.c b/mm/util.c
index 47a57e5..418e68f 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -402,6 +402,7 @@ unsigned long sysctl_overcommit_kbytes __read_mostly;
 int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
 unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
 unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
+DEFINE_PER_CPU(int, vm_committed_as);
 
 int overcommit_ratio_handler(struct ctl_table *table, int write,
 			     void __user *buffer, size_t *lenp,
@@ -445,12 +446,6 @@ unsigned long vm_commit_limit(void)
 }
 
 /*
- * Make sure vm_committed_as in one cacheline and not cacheline shared with
- * other variables. It can be updated by several CPUs frequently.
- */
-struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
-
-/*
  * The global memory commitment made in the system can be a metric
  * that can be used to drive ballooning decisions when Linux is hosted
  * as a guest. On Hyper-V, the host implements a policy engine for dynamically
@@ -460,7 +455,12 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
  */
 unsigned long vm_memory_committed(void)
 {
-	return percpu_counter_read_positive(&vm_committed_as);
+	int cpu, sum = 0;
+
+	for_each_possible_cpu(cpu)
+		sum += *per_cpu_ptr(&vm_committed_as, cpu);
+
+	return sum < 0 ? 0 : sum;
 }
 EXPORT_SYMBOL_GPL(vm_memory_committed);
 
@@ -484,10 +484,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 {
 	long free, allowed, reserve;
 
-	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
-			-(s64)vm_committed_as_batch * num_online_cpus(),
-			"memory commitment underflow");
-
 	vm_acct_memory(pages);
 
 	/*
@@ -553,7 +549,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		allowed -= min_t(long, mm->total_vm / 32, reserve);
 	}
 
-	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
+	if (vm_memory_committed() < allowed)
 		return 0;
 error:
 	vm_unacct_memory(pages);
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
@ 2016-02-10 14:52   ` Andrey Ryabinin
  0 siblings, 0 replies; 30+ messages in thread
From: Andrey Ryabinin @ 2016-02-10 14:52 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: linux-kernel, Andrey Ryabinin, Andi Kleen, Tim Chen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

Currently we use percpu_counter for accounting committed memory. Change
of committed memory on more than vm_committed_as_batch pages leads to
grab of counter's spinlock. The batch size is quite small - from 32 pages
up to 0.4% of the memory/cpu (usually several MBs even on large machines).

So map/munmap of several MBs anonymous memory in multiple processes leads
to high contention on that spinlock.

Instead of percpu_counter we could use ordinary per-cpu variables.
Dump test case (8-proccesses running map/munmap of 4MB,
vm_committed_as_batch = 2MB on test setup) showed 2.5x performance
improvement.

The downside of this approach is slowdown of vm_memory_committed().
However, it doesn't matter much since it usually is not in a hot path.
The only exception is __vm_enough_memory() with overcommit set to
OVERCOMMIT_NEVER. In that case brk1 test from will-it-scale benchmark
shows 1.1x - 1.3x performance regression.

So I think it's a good tradeoff. We've got significantly increased
scalability for the price of some overhead in vm_memory_committed().

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
---
 fs/proc/meminfo.c    |  2 +-
 include/linux/mm.h   |  4 ++++
 include/linux/mman.h | 13 +++----------
 mm/mm_init.c         | 45 ---------------------------------------------
 mm/mmap.c            | 11 -----------
 mm/nommu.c           |  4 ----
 mm/util.c            | 20 ++++++++------------
 7 files changed, 16 insertions(+), 83 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index df4661a..f30e387 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -41,7 +41,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	si_meminfo(&i);
 	si_swapinfo(&i);
-	committed = percpu_counter_read_positive(&vm_committed_as);
+	committed = vm_memory_committed();
 
 	cached = global_page_state(NR_FILE_PAGES) -
 			total_swapcache_pages() - i.bufferram;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 979bc83..82dac6e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1881,7 +1881,11 @@ extern void memmap_init_zone(unsigned long, int, unsigned long,
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
+#ifdef CONFIG_MMU
+static inline void mmap_init(void) {}
+#else
 extern void __init mmap_init(void);
+#endif
 extern void show_mem(unsigned int flags);
 extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 16373c8..436ab11 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -2,7 +2,7 @@
 #define _LINUX_MMAN_H
 
 #include <linux/mm.h>
-#include <linux/percpu_counter.h>
+#include <linux/percpu.h>
 
 #include <linux/atomic.h>
 #include <uapi/linux/mman.h>
@@ -10,19 +10,12 @@
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
-extern struct percpu_counter vm_committed_as;
-
-#ifdef CONFIG_SMP
-extern s32 vm_committed_as_batch;
-#else
-#define vm_committed_as_batch 0
-#endif
-
 unsigned long vm_memory_committed(void);
+DECLARE_PER_CPU(int, vm_committed_as);
 
 static inline void vm_acct_memory(long pages)
 {
-	__percpu_counter_add(&vm_committed_as, pages, vm_committed_as_batch);
+	this_cpu_add(vm_committed_as, pages);
 }
 
 static inline void vm_unacct_memory(long pages)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index fdadf91..d96c71f 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -142,51 +142,6 @@ early_param("mminit_loglevel", set_mminit_loglevel);
 struct kobject *mm_kobj;
 EXPORT_SYMBOL_GPL(mm_kobj);
 
-#ifdef CONFIG_SMP
-s32 vm_committed_as_batch = 32;
-
-static void __meminit mm_compute_batch(void)
-{
-	u64 memsized_batch;
-	s32 nr = num_present_cpus();
-	s32 batch = max_t(s32, nr*2, 32);
-
-	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
-	memsized_batch = min_t(u64, (totalram_pages/nr)/256, 0x7fffffff);
-
-	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
-}
-
-static int __meminit mm_compute_batch_notifier(struct notifier_block *self,
-					unsigned long action, void *arg)
-{
-	switch (action) {
-	case MEM_ONLINE:
-	case MEM_OFFLINE:
-		mm_compute_batch();
-	default:
-		break;
-	}
-	return NOTIFY_OK;
-}
-
-static struct notifier_block compute_batch_nb __meminitdata = {
-	.notifier_call = mm_compute_batch_notifier,
-	.priority = IPC_CALLBACK_PRI, /* use lowest priority */
-};
-
-static int __init mm_compute_batch_init(void)
-{
-	mm_compute_batch();
-	register_hotmemory_notifier(&compute_batch_nb);
-
-	return 0;
-}
-
-__initcall(mm_compute_batch_init);
-
-#endif
-
 static int __init mm_sysfs_init(void)
 {
 	mm_kobj = kobject_create_and_add("mm", kernel_kobj);
diff --git a/mm/mmap.c b/mm/mmap.c
index f088c60..c796d73 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3184,17 +3184,6 @@ void mm_drop_all_locks(struct mm_struct *mm)
 }
 
 /*
- * initialise the VMA slab
- */
-void __init mmap_init(void)
-{
-	int ret;
-
-	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
-	VM_BUG_ON(ret);
-}
-
-/*
  * Initialise sysctl_user_reserve_kbytes.
  *
  * This is intended to prevent a user from starting a single memory hogging
diff --git a/mm/nommu.c b/mm/nommu.c
index 6402f27..2d52dbc 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -533,10 +533,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
  */
 void __init mmap_init(void)
 {
-	int ret;
-
-	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
-	VM_BUG_ON(ret);
 	vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
 }
 
diff --git a/mm/util.c b/mm/util.c
index 47a57e5..418e68f 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -402,6 +402,7 @@ unsigned long sysctl_overcommit_kbytes __read_mostly;
 int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
 unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
 unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
+DEFINE_PER_CPU(int, vm_committed_as);
 
 int overcommit_ratio_handler(struct ctl_table *table, int write,
 			     void __user *buffer, size_t *lenp,
@@ -445,12 +446,6 @@ unsigned long vm_commit_limit(void)
 }
 
 /*
- * Make sure vm_committed_as in one cacheline and not cacheline shared with
- * other variables. It can be updated by several CPUs frequently.
- */
-struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
-
-/*
  * The global memory commitment made in the system can be a metric
  * that can be used to drive ballooning decisions when Linux is hosted
  * as a guest. On Hyper-V, the host implements a policy engine for dynamically
@@ -460,7 +455,12 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
  */
 unsigned long vm_memory_committed(void)
 {
-	return percpu_counter_read_positive(&vm_committed_as);
+	int cpu, sum = 0;
+
+	for_each_possible_cpu(cpu)
+		sum += *per_cpu_ptr(&vm_committed_as, cpu);
+
+	return sum < 0 ? 0 : sum;
 }
 EXPORT_SYMBOL_GPL(vm_memory_committed);
 
@@ -484,10 +484,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 {
 	long free, allowed, reserve;
 
-	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
-			-(s64)vm_committed_as_batch * num_online_cpus(),
-			"memory commitment underflow");
-
 	vm_acct_memory(pages);
 
 	/*
@@ -553,7 +549,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		allowed -= min_t(long, mm->total_vm / 32, reserve);
 	}
 
-	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
+	if (vm_memory_committed() < allowed)
 		return 0;
 error:
 	vm_unacct_memory(pages);
-- 
2.4.10

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
  2016-02-10 14:52   ` Andrey Ryabinin
@ 2016-02-10 17:46     ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 30+ messages in thread
From: Konstantin Khlebnikov @ 2016-02-10 17:46 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, linux-mm, Linux Kernel Mailing List, Andi Kleen,
	Tim Chen, Mel Gorman, Vladimir Davydov

On Wed, Feb 10, 2016 at 5:52 PM, Andrey Ryabinin
<aryabinin@virtuozzo.com> wrote:
> Currently we use percpu_counter for accounting committed memory. Change
> of committed memory on more than vm_committed_as_batch pages leads to
> grab of counter's spinlock. The batch size is quite small - from 32 pages
> up to 0.4% of the memory/cpu (usually several MBs even on large machines).
>
> So map/munmap of several MBs anonymous memory in multiple processes leads
> to high contention on that spinlock.
>
> Instead of percpu_counter we could use ordinary per-cpu variables.
> Dump test case (8-proccesses running map/munmap of 4MB,
> vm_committed_as_batch = 2MB on test setup) showed 2.5x performance
> improvement.
>
> The downside of this approach is slowdown of vm_memory_committed().
> However, it doesn't matter much since it usually is not in a hot path.
> The only exception is __vm_enough_memory() with overcommit set to
> OVERCOMMIT_NEVER. In that case brk1 test from will-it-scale benchmark
> shows 1.1x - 1.3x performance regression.
>
> So I think it's a good tradeoff. We've got significantly increased
> scalability for the price of some overhead in vm_memory_committed().

I think thats a no go. 30% regression for your not-so-big machine.
For 4096 cores regression will be enourmous. Link: https://xkcd.com/619/

There're three per-cpu page counters: in memcg, in vmstat and this one.
Maybe more. And zero universal fast resource counter with quota and
per-cpu fast-path.

>
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> ---
>  fs/proc/meminfo.c    |  2 +-
>  include/linux/mm.h   |  4 ++++
>  include/linux/mman.h | 13 +++----------
>  mm/mm_init.c         | 45 ---------------------------------------------
>  mm/mmap.c            | 11 -----------
>  mm/nommu.c           |  4 ----
>  mm/util.c            | 20 ++++++++------------
>  7 files changed, 16 insertions(+), 83 deletions(-)
>
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index df4661a..f30e387 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -41,7 +41,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
>         si_meminfo(&i);
>         si_swapinfo(&i);
> -       committed = percpu_counter_read_positive(&vm_committed_as);
> +       committed = vm_memory_committed();
>
>         cached = global_page_state(NR_FILE_PAGES) -
>                         total_swapcache_pages() - i.bufferram;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 979bc83..82dac6e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1881,7 +1881,11 @@ extern void memmap_init_zone(unsigned long, int, unsigned long,
>  extern void setup_per_zone_wmarks(void);
>  extern int __meminit init_per_zone_wmark_min(void);
>  extern void mem_init(void);
> +#ifdef CONFIG_MMU
> +static inline void mmap_init(void) {}
> +#else
>  extern void __init mmap_init(void);
> +#endif
>  extern void show_mem(unsigned int flags);
>  extern void si_meminfo(struct sysinfo * val);
>  extern void si_meminfo_node(struct sysinfo *val, int nid);
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 16373c8..436ab11 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -2,7 +2,7 @@
>  #define _LINUX_MMAN_H
>
>  #include <linux/mm.h>
> -#include <linux/percpu_counter.h>
> +#include <linux/percpu.h>
>
>  #include <linux/atomic.h>
>  #include <uapi/linux/mman.h>
> @@ -10,19 +10,12 @@
>  extern int sysctl_overcommit_memory;
>  extern int sysctl_overcommit_ratio;
>  extern unsigned long sysctl_overcommit_kbytes;
> -extern struct percpu_counter vm_committed_as;
> -
> -#ifdef CONFIG_SMP
> -extern s32 vm_committed_as_batch;
> -#else
> -#define vm_committed_as_batch 0
> -#endif
> -
>  unsigned long vm_memory_committed(void);
> +DECLARE_PER_CPU(int, vm_committed_as);
>
>  static inline void vm_acct_memory(long pages)
>  {
> -       __percpu_counter_add(&vm_committed_as, pages, vm_committed_as_batch);
> +       this_cpu_add(vm_committed_as, pages);
>  }
>
>  static inline void vm_unacct_memory(long pages)
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index fdadf91..d96c71f 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -142,51 +142,6 @@ early_param("mminit_loglevel", set_mminit_loglevel);
>  struct kobject *mm_kobj;
>  EXPORT_SYMBOL_GPL(mm_kobj);
>
> -#ifdef CONFIG_SMP
> -s32 vm_committed_as_batch = 32;
> -
> -static void __meminit mm_compute_batch(void)
> -{
> -       u64 memsized_batch;
> -       s32 nr = num_present_cpus();
> -       s32 batch = max_t(s32, nr*2, 32);
> -
> -       /* batch size set to 0.4% of (total memory/#cpus), or max int32 */
> -       memsized_batch = min_t(u64, (totalram_pages/nr)/256, 0x7fffffff);
> -
> -       vm_committed_as_batch = max_t(s32, memsized_batch, batch);
> -}
> -
> -static int __meminit mm_compute_batch_notifier(struct notifier_block *self,
> -                                       unsigned long action, void *arg)
> -{
> -       switch (action) {
> -       case MEM_ONLINE:
> -       case MEM_OFFLINE:
> -               mm_compute_batch();
> -       default:
> -               break;
> -       }
> -       return NOTIFY_OK;
> -}
> -
> -static struct notifier_block compute_batch_nb __meminitdata = {
> -       .notifier_call = mm_compute_batch_notifier,
> -       .priority = IPC_CALLBACK_PRI, /* use lowest priority */
> -};
> -
> -static int __init mm_compute_batch_init(void)
> -{
> -       mm_compute_batch();
> -       register_hotmemory_notifier(&compute_batch_nb);
> -
> -       return 0;
> -}
> -
> -__initcall(mm_compute_batch_init);
> -
> -#endif
> -
>  static int __init mm_sysfs_init(void)
>  {
>         mm_kobj = kobject_create_and_add("mm", kernel_kobj);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f088c60..c796d73 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3184,17 +3184,6 @@ void mm_drop_all_locks(struct mm_struct *mm)
>  }
>
>  /*
> - * initialise the VMA slab
> - */
> -void __init mmap_init(void)
> -{
> -       int ret;
> -
> -       ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
> -       VM_BUG_ON(ret);
> -}
> -
> -/*
>   * Initialise sysctl_user_reserve_kbytes.
>   *
>   * This is intended to prevent a user from starting a single memory hogging
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 6402f27..2d52dbc 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -533,10 +533,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
>   */
>  void __init mmap_init(void)
>  {
> -       int ret;
> -
> -       ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
> -       VM_BUG_ON(ret);
>         vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
>  }
>
> diff --git a/mm/util.c b/mm/util.c
> index 47a57e5..418e68f 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -402,6 +402,7 @@ unsigned long sysctl_overcommit_kbytes __read_mostly;
>  int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
>  unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
>  unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
> +DEFINE_PER_CPU(int, vm_committed_as);
>
>  int overcommit_ratio_handler(struct ctl_table *table, int write,
>                              void __user *buffer, size_t *lenp,
> @@ -445,12 +446,6 @@ unsigned long vm_commit_limit(void)
>  }
>
>  /*
> - * Make sure vm_committed_as in one cacheline and not cacheline shared with
> - * other variables. It can be updated by several CPUs frequently.
> - */
> -struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
> -
> -/*
>   * The global memory commitment made in the system can be a metric
>   * that can be used to drive ballooning decisions when Linux is hosted
>   * as a guest. On Hyper-V, the host implements a policy engine for dynamically
> @@ -460,7 +455,12 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
>   */
>  unsigned long vm_memory_committed(void)
>  {
> -       return percpu_counter_read_positive(&vm_committed_as);
> +       int cpu, sum = 0;
> +
> +       for_each_possible_cpu(cpu)
> +               sum += *per_cpu_ptr(&vm_committed_as, cpu);
> +
> +       return sum < 0 ? 0 : sum;
>  }
>  EXPORT_SYMBOL_GPL(vm_memory_committed);
>
> @@ -484,10 +484,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  {
>         long free, allowed, reserve;
>
> -       VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
> -                       -(s64)vm_committed_as_batch * num_online_cpus(),
> -                       "memory commitment underflow");
> -
>         vm_acct_memory(pages);
>
>         /*
> @@ -553,7 +549,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>                 allowed -= min_t(long, mm->total_vm / 32, reserve);
>         }
>
> -       if (percpu_counter_read_positive(&vm_committed_as) < allowed)
> +       if (vm_memory_committed() < allowed)
>                 return 0;
>  error:
>         vm_unacct_memory(pages);
> --
> 2.4.10
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
@ 2016-02-10 17:46     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 30+ messages in thread
From: Konstantin Khlebnikov @ 2016-02-10 17:46 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, linux-mm, Linux Kernel Mailing List, Andi Kleen,
	Tim Chen, Mel Gorman, Vladimir Davydov

On Wed, Feb 10, 2016 at 5:52 PM, Andrey Ryabinin
<aryabinin@virtuozzo.com> wrote:
> Currently we use percpu_counter for accounting committed memory. Change
> of committed memory on more than vm_committed_as_batch pages leads to
> grab of counter's spinlock. The batch size is quite small - from 32 pages
> up to 0.4% of the memory/cpu (usually several MBs even on large machines).
>
> So map/munmap of several MBs anonymous memory in multiple processes leads
> to high contention on that spinlock.
>
> Instead of percpu_counter we could use ordinary per-cpu variables.
> Dump test case (8-proccesses running map/munmap of 4MB,
> vm_committed_as_batch = 2MB on test setup) showed 2.5x performance
> improvement.
>
> The downside of this approach is slowdown of vm_memory_committed().
> However, it doesn't matter much since it usually is not in a hot path.
> The only exception is __vm_enough_memory() with overcommit set to
> OVERCOMMIT_NEVER. In that case brk1 test from will-it-scale benchmark
> shows 1.1x - 1.3x performance regression.
>
> So I think it's a good tradeoff. We've got significantly increased
> scalability for the price of some overhead in vm_memory_committed().

I think thats a no go. 30% regression for your not-so-big machine.
For 4096 cores regression will be enourmous. Link: https://xkcd.com/619/

There're three per-cpu page counters: in memcg, in vmstat and this one.
Maybe more. And zero universal fast resource counter with quota and
per-cpu fast-path.

>
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> ---
>  fs/proc/meminfo.c    |  2 +-
>  include/linux/mm.h   |  4 ++++
>  include/linux/mman.h | 13 +++----------
>  mm/mm_init.c         | 45 ---------------------------------------------
>  mm/mmap.c            | 11 -----------
>  mm/nommu.c           |  4 ----
>  mm/util.c            | 20 ++++++++------------
>  7 files changed, 16 insertions(+), 83 deletions(-)
>
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index df4661a..f30e387 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -41,7 +41,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
>         si_meminfo(&i);
>         si_swapinfo(&i);
> -       committed = percpu_counter_read_positive(&vm_committed_as);
> +       committed = vm_memory_committed();
>
>         cached = global_page_state(NR_FILE_PAGES) -
>                         total_swapcache_pages() - i.bufferram;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 979bc83..82dac6e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1881,7 +1881,11 @@ extern void memmap_init_zone(unsigned long, int, unsigned long,
>  extern void setup_per_zone_wmarks(void);
>  extern int __meminit init_per_zone_wmark_min(void);
>  extern void mem_init(void);
> +#ifdef CONFIG_MMU
> +static inline void mmap_init(void) {}
> +#else
>  extern void __init mmap_init(void);
> +#endif
>  extern void show_mem(unsigned int flags);
>  extern void si_meminfo(struct sysinfo * val);
>  extern void si_meminfo_node(struct sysinfo *val, int nid);
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 16373c8..436ab11 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -2,7 +2,7 @@
>  #define _LINUX_MMAN_H
>
>  #include <linux/mm.h>
> -#include <linux/percpu_counter.h>
> +#include <linux/percpu.h>
>
>  #include <linux/atomic.h>
>  #include <uapi/linux/mman.h>
> @@ -10,19 +10,12 @@
>  extern int sysctl_overcommit_memory;
>  extern int sysctl_overcommit_ratio;
>  extern unsigned long sysctl_overcommit_kbytes;
> -extern struct percpu_counter vm_committed_as;
> -
> -#ifdef CONFIG_SMP
> -extern s32 vm_committed_as_batch;
> -#else
> -#define vm_committed_as_batch 0
> -#endif
> -
>  unsigned long vm_memory_committed(void);
> +DECLARE_PER_CPU(int, vm_committed_as);
>
>  static inline void vm_acct_memory(long pages)
>  {
> -       __percpu_counter_add(&vm_committed_as, pages, vm_committed_as_batch);
> +       this_cpu_add(vm_committed_as, pages);
>  }
>
>  static inline void vm_unacct_memory(long pages)
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index fdadf91..d96c71f 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -142,51 +142,6 @@ early_param("mminit_loglevel", set_mminit_loglevel);
>  struct kobject *mm_kobj;
>  EXPORT_SYMBOL_GPL(mm_kobj);
>
> -#ifdef CONFIG_SMP
> -s32 vm_committed_as_batch = 32;
> -
> -static void __meminit mm_compute_batch(void)
> -{
> -       u64 memsized_batch;
> -       s32 nr = num_present_cpus();
> -       s32 batch = max_t(s32, nr*2, 32);
> -
> -       /* batch size set to 0.4% of (total memory/#cpus), or max int32 */
> -       memsized_batch = min_t(u64, (totalram_pages/nr)/256, 0x7fffffff);
> -
> -       vm_committed_as_batch = max_t(s32, memsized_batch, batch);
> -}
> -
> -static int __meminit mm_compute_batch_notifier(struct notifier_block *self,
> -                                       unsigned long action, void *arg)
> -{
> -       switch (action) {
> -       case MEM_ONLINE:
> -       case MEM_OFFLINE:
> -               mm_compute_batch();
> -       default:
> -               break;
> -       }
> -       return NOTIFY_OK;
> -}
> -
> -static struct notifier_block compute_batch_nb __meminitdata = {
> -       .notifier_call = mm_compute_batch_notifier,
> -       .priority = IPC_CALLBACK_PRI, /* use lowest priority */
> -};
> -
> -static int __init mm_compute_batch_init(void)
> -{
> -       mm_compute_batch();
> -       register_hotmemory_notifier(&compute_batch_nb);
> -
> -       return 0;
> -}
> -
> -__initcall(mm_compute_batch_init);
> -
> -#endif
> -
>  static int __init mm_sysfs_init(void)
>  {
>         mm_kobj = kobject_create_and_add("mm", kernel_kobj);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f088c60..c796d73 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3184,17 +3184,6 @@ void mm_drop_all_locks(struct mm_struct *mm)
>  }
>
>  /*
> - * initialise the VMA slab
> - */
> -void __init mmap_init(void)
> -{
> -       int ret;
> -
> -       ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
> -       VM_BUG_ON(ret);
> -}
> -
> -/*
>   * Initialise sysctl_user_reserve_kbytes.
>   *
>   * This is intended to prevent a user from starting a single memory hogging
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 6402f27..2d52dbc 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -533,10 +533,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
>   */
>  void __init mmap_init(void)
>  {
> -       int ret;
> -
> -       ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
> -       VM_BUG_ON(ret);
>         vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
>  }
>
> diff --git a/mm/util.c b/mm/util.c
> index 47a57e5..418e68f 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -402,6 +402,7 @@ unsigned long sysctl_overcommit_kbytes __read_mostly;
>  int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
>  unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
>  unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
> +DEFINE_PER_CPU(int, vm_committed_as);
>
>  int overcommit_ratio_handler(struct ctl_table *table, int write,
>                              void __user *buffer, size_t *lenp,
> @@ -445,12 +446,6 @@ unsigned long vm_commit_limit(void)
>  }
>
>  /*
> - * Make sure vm_committed_as in one cacheline and not cacheline shared with
> - * other variables. It can be updated by several CPUs frequently.
> - */
> -struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
> -
> -/*
>   * The global memory commitment made in the system can be a metric
>   * that can be used to drive ballooning decisions when Linux is hosted
>   * as a guest. On Hyper-V, the host implements a policy engine for dynamically
> @@ -460,7 +455,12 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
>   */
>  unsigned long vm_memory_committed(void)
>  {
> -       return percpu_counter_read_positive(&vm_committed_as);
> +       int cpu, sum = 0;
> +
> +       for_each_possible_cpu(cpu)
> +               sum += *per_cpu_ptr(&vm_committed_as, cpu);
> +
> +       return sum < 0 ? 0 : sum;
>  }
>  EXPORT_SYMBOL_GPL(vm_memory_committed);
>
> @@ -484,10 +484,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  {
>         long free, allowed, reserve;
>
> -       VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
> -                       -(s64)vm_committed_as_batch * num_online_cpus(),
> -                       "memory commitment underflow");
> -
>         vm_acct_memory(pages);
>
>         /*
> @@ -553,7 +549,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>                 allowed -= min_t(long, mm->total_vm / 32, reserve);
>         }
>
> -       if (percpu_counter_read_positive(&vm_committed_as) < allowed)
> +       if (vm_memory_committed() < allowed)
>                 return 0;
>  error:
>         vm_unacct_memory(pages);
> --
> 2.4.10
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
  2016-02-10 14:52   ` Andrey Ryabinin
@ 2016-02-10 18:00     ` Tim Chen
  -1 siblings, 0 replies; 30+ messages in thread
From: Tim Chen @ 2016-02-10 18:00 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

On Wed, 2016-02-10 at 17:52 +0300, Andrey Ryabinin wrote:
> Currently we use percpu_counter for accounting committed memory. Change
> of committed memory on more than vm_committed_as_batch pages leads to
> grab of counter's spinlock. The batch size is quite small - from 32 pages
> up to 0.4% of the memory/cpu (usually several MBs even on large machines).
> 
> So map/munmap of several MBs anonymous memory in multiple processes leads
> to high contention on that spinlock.
> 
> Instead of percpu_counter we could use ordinary per-cpu variables.
> Dump test case (8-proccesses running map/munmap of 4MB,
> vm_committed_as_batch = 2MB on test setup) showed 2.5x performance
> improvement.
> 
> The downside of this approach is slowdown of vm_memory_committed().
> However, it doesn't matter much since it usually is not in a hot path.
> The only exception is __vm_enough_memory() with overcommit set to
> OVERCOMMIT_NEVER. In that case brk1 test from will-it-scale benchmark
> shows 1.1x - 1.3x performance regression.
> 
> So I think it's a good tradeoff. We've got significantly increased
> scalability for the price of some overhead in vm_memory_committed().

It is a trade off between the counter read speed vs the counter update
speed.  With this change the reading of the counter is slower
because we need to sum over all the cpus each time we need the counter
value.  So this read overhead will grow with the number of cpus and may
not be a good tradeoff for that case.

Wonder if you have tried to tweak the batch size of per cpu counter
and make it a little larger?

Tim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
@ 2016-02-10 18:00     ` Tim Chen
  0 siblings, 0 replies; 30+ messages in thread
From: Tim Chen @ 2016-02-10 18:00 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

On Wed, 2016-02-10 at 17:52 +0300, Andrey Ryabinin wrote:
> Currently we use percpu_counter for accounting committed memory. Change
> of committed memory on more than vm_committed_as_batch pages leads to
> grab of counter's spinlock. The batch size is quite small - from 32 pages
> up to 0.4% of the memory/cpu (usually several MBs even on large machines).
> 
> So map/munmap of several MBs anonymous memory in multiple processes leads
> to high contention on that spinlock.
> 
> Instead of percpu_counter we could use ordinary per-cpu variables.
> Dump test case (8-proccesses running map/munmap of 4MB,
> vm_committed_as_batch = 2MB on test setup) showed 2.5x performance
> improvement.
> 
> The downside of this approach is slowdown of vm_memory_committed().
> However, it doesn't matter much since it usually is not in a hot path.
> The only exception is __vm_enough_memory() with overcommit set to
> OVERCOMMIT_NEVER. In that case brk1 test from will-it-scale benchmark
> shows 1.1x - 1.3x performance regression.
> 
> So I think it's a good tradeoff. We've got significantly increased
> scalability for the price of some overhead in vm_memory_committed().

It is a trade off between the counter read speed vs the counter update
speed.  With this change the reading of the counter is slower
because we need to sum over all the cpus each time we need the counter
value.  So this read overhead will grow with the number of cpus and may
not be a good tradeoff for that case.

Wonder if you have tried to tweak the batch size of per cpu counter
and make it a little larger?

Tim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
  2016-02-10 18:00     ` Tim Chen
@ 2016-02-10 21:28       ` Andrew Morton
  -1 siblings, 0 replies; 30+ messages in thread
From: Andrew Morton @ 2016-02-10 21:28 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrey Ryabinin, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

On Wed, 10 Feb 2016 10:00:53 -0800 Tim Chen <tim.c.chen@linux.intel.com> wrote:

> On Wed, 2016-02-10 at 17:52 +0300, Andrey Ryabinin wrote:
> > Currently we use percpu_counter for accounting committed memory. Change
> > of committed memory on more than vm_committed_as_batch pages leads to
> > grab of counter's spinlock. The batch size is quite small - from 32 pages
> > up to 0.4% of the memory/cpu (usually several MBs even on large machines).
> > 
> > So map/munmap of several MBs anonymous memory in multiple processes leads
> > to high contention on that spinlock.
> > 
> > Instead of percpu_counter we could use ordinary per-cpu variables.
> > Dump test case (8-proccesses running map/munmap of 4MB,
> > vm_committed_as_batch = 2MB on test setup) showed 2.5x performance
> > improvement.
> > 
> > The downside of this approach is slowdown of vm_memory_committed().
> > However, it doesn't matter much since it usually is not in a hot path.
> > The only exception is __vm_enough_memory() with overcommit set to
> > OVERCOMMIT_NEVER. In that case brk1 test from will-it-scale benchmark
> > shows 1.1x - 1.3x performance regression.
> > 
> > So I think it's a good tradeoff. We've got significantly increased
> > scalability for the price of some overhead in vm_memory_committed().
> 
> It is a trade off between the counter read speed vs the counter update
> speed.  With this change the reading of the counter is slower
> because we need to sum over all the cpus each time we need the counter
> value.  So this read overhead will grow with the number of cpus and may
> not be a good tradeoff for that case.
> 
> Wonder if you have tried to tweak the batch size of per cpu counter
> and make it a little larger?

If a process is unmapping 4MB then it's pretty crazy for us to be
hitting the percpu_counter 32 separate times for that single operation.

Is there some way in which we can batch up the modifications within the
caller and update the counter less frequently?  Perhaps even in a
single hit?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
@ 2016-02-10 21:28       ` Andrew Morton
  0 siblings, 0 replies; 30+ messages in thread
From: Andrew Morton @ 2016-02-10 21:28 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrey Ryabinin, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

On Wed, 10 Feb 2016 10:00:53 -0800 Tim Chen <tim.c.chen@linux.intel.com> wrote:

> On Wed, 2016-02-10 at 17:52 +0300, Andrey Ryabinin wrote:
> > Currently we use percpu_counter for accounting committed memory. Change
> > of committed memory on more than vm_committed_as_batch pages leads to
> > grab of counter's spinlock. The batch size is quite small - from 32 pages
> > up to 0.4% of the memory/cpu (usually several MBs even on large machines).
> > 
> > So map/munmap of several MBs anonymous memory in multiple processes leads
> > to high contention on that spinlock.
> > 
> > Instead of percpu_counter we could use ordinary per-cpu variables.
> > Dump test case (8-proccesses running map/munmap of 4MB,
> > vm_committed_as_batch = 2MB on test setup) showed 2.5x performance
> > improvement.
> > 
> > The downside of this approach is slowdown of vm_memory_committed().
> > However, it doesn't matter much since it usually is not in a hot path.
> > The only exception is __vm_enough_memory() with overcommit set to
> > OVERCOMMIT_NEVER. In that case brk1 test from will-it-scale benchmark
> > shows 1.1x - 1.3x performance regression.
> > 
> > So I think it's a good tradeoff. We've got significantly increased
> > scalability for the price of some overhead in vm_memory_committed().
> 
> It is a trade off between the counter read speed vs the counter update
> speed.  With this change the reading of the counter is slower
> because we need to sum over all the cpus each time we need the counter
> value.  So this read overhead will grow with the number of cpus and may
> not be a good tradeoff for that case.
> 
> Wonder if you have tried to tweak the batch size of per cpu counter
> and make it a little larger?

If a process is unmapping 4MB then it's pretty crazy for us to be
hitting the percpu_counter 32 separate times for that single operation.

Is there some way in which we can batch up the modifications within the
caller and update the counter less frequently?  Perhaps even in a
single hit?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
  2016-02-10 21:28       ` Andrew Morton
@ 2016-02-11  0:24         ` Tim Chen
  -1 siblings, 0 replies; 30+ messages in thread
From: Tim Chen @ 2016-02-11  0:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrey Ryabinin, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

On Wed, 2016-02-10 at 13:28 -0800, Andrew Morton wrote:

> 
> If a process is unmapping 4MB then it's pretty crazy for us to be
> hitting the percpu_counter 32 separate times for that single operation.
> 
> Is there some way in which we can batch up the modifications within the
> caller and update the counter less frequently?  Perhaps even in a
> single hit?

I think the problem is the batch size is too small and we overflow
the local counter into the global counter for 4M allocations.
The reason for the small batch size was because we use
percpu_counter_read_positive in __vm_enough_memory and it is not precise
and the error could grow with large batch size.

Let's switch to the precise __percpu_counter_compare that is 
unaffected by batch size.  It will do precise comparison and only add up
the local per cpu counters when the global count is not precise
enough.  

So maybe something like the following patch with a relaxed batch size.
I have not tested this patch much other than compiling and booting
the kernel.  I wonder if this works for Andrey. We could relax the batch
size further, but that will mean that we will incur the overhead
of summing the per cpu counters earlier when the global count get close
to the allowed limit.

Thanks.

Tim

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
diff --git a/mm/mm_init.c b/mm/mm_init.c
index fdadf91..996c332 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -151,8 +151,8 @@ static void __meminit mm_compute_batch(void)
 	s32 nr = num_present_cpus();
 	s32 batch = max_t(s32, nr*2, 32);
 
-	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
-	memsized_batch = min_t(u64, (totalram_pages/nr)/256, 0x7fffffff);
+	/* batch size set to 3% of (total memory/#cpus), or max int32 */
+	memsized_batch = min_t(u64, (totalram_pages/nr)/32, 0x7fffffff);
 
 	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index 79bcc9f..eec9dfd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -131,7 +131,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
  */
 unsigned long vm_memory_committed(void)
 {
-	return percpu_counter_read_positive(&vm_committed_as);
+	return percpu_counter_sum(&vm_committed_as);
 }
 EXPORT_SYMBOL_GPL(vm_memory_committed);
 
@@ -155,10 +155,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 {
 	long free, allowed, reserve;
 
-	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
-			-(s64)vm_committed_as_batch * num_online_cpus(),
-			"memory commitment underflow");
-
 	vm_acct_memory(pages);
 
 	/*
@@ -224,7 +220,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		allowed -= min_t(long, mm->total_vm / 32, reserve);
 	}
 
-	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
+	if (__percpu_counter_compare(&vm_committed_as, allowed, vm_committed_as_batch) < 0)
 		return 0;
 error:
 	vm_unacct_memory(pages);
diff --git a/mm/nommu.c b/mm/nommu.c
index ab14a20..53e4cae 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -70,7 +70,7 @@ atomic_long_t mmap_pages_allocated;
  */
 unsigned long vm_memory_committed(void)
 {
-	return percpu_counter_read_positive(&vm_committed_as);
+	return percpu_counter_sum(&vm_committed_as);
 }
 
 EXPORT_SYMBOL_GPL(vm_memory_committed);
@@ -1914,7 +1914,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		allowed -= min_t(long, mm->total_vm / 32, reserve);
 	}
 
-	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
+	if (__percpu_counter_compare(&vm_committed_as, allowed, vm_committed_as_batch) < 0)
 		return 0;
 
 error:

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
@ 2016-02-11  0:24         ` Tim Chen
  0 siblings, 0 replies; 30+ messages in thread
From: Tim Chen @ 2016-02-11  0:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrey Ryabinin, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

On Wed, 2016-02-10 at 13:28 -0800, Andrew Morton wrote:

> 
> If a process is unmapping 4MB then it's pretty crazy for us to be
> hitting the percpu_counter 32 separate times for that single operation.
> 
> Is there some way in which we can batch up the modifications within the
> caller and update the counter less frequently?  Perhaps even in a
> single hit?

I think the problem is the batch size is too small and we overflow
the local counter into the global counter for 4M allocations.
The reason for the small batch size was because we use
percpu_counter_read_positive in __vm_enough_memory and it is not precise
and the error could grow with large batch size.

Let's switch to the precise __percpu_counter_compare that is 
unaffected by batch size.  It will do precise comparison and only add up
the local per cpu counters when the global count is not precise
enough.  

So maybe something like the following patch with a relaxed batch size.
I have not tested this patch much other than compiling and booting
the kernel.  I wonder if this works for Andrey. We could relax the batch
size further, but that will mean that we will incur the overhead
of summing the per cpu counters earlier when the global count get close
to the allowed limit.

Thanks.

Tim

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
diff --git a/mm/mm_init.c b/mm/mm_init.c
index fdadf91..996c332 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -151,8 +151,8 @@ static void __meminit mm_compute_batch(void)
 	s32 nr = num_present_cpus();
 	s32 batch = max_t(s32, nr*2, 32);
 
-	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
-	memsized_batch = min_t(u64, (totalram_pages/nr)/256, 0x7fffffff);
+	/* batch size set to 3% of (total memory/#cpus), or max int32 */
+	memsized_batch = min_t(u64, (totalram_pages/nr)/32, 0x7fffffff);
 
 	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index 79bcc9f..eec9dfd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -131,7 +131,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
  */
 unsigned long vm_memory_committed(void)
 {
-	return percpu_counter_read_positive(&vm_committed_as);
+	return percpu_counter_sum(&vm_committed_as);
 }
 EXPORT_SYMBOL_GPL(vm_memory_committed);
 
@@ -155,10 +155,6 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 {
 	long free, allowed, reserve;
 
-	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
-			-(s64)vm_committed_as_batch * num_online_cpus(),
-			"memory commitment underflow");
-
 	vm_acct_memory(pages);
 
 	/*
@@ -224,7 +220,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		allowed -= min_t(long, mm->total_vm / 32, reserve);
 	}
 
-	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
+	if (__percpu_counter_compare(&vm_committed_as, allowed, vm_committed_as_batch) < 0)
 		return 0;
 error:
 	vm_unacct_memory(pages);
diff --git a/mm/nommu.c b/mm/nommu.c
index ab14a20..53e4cae 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -70,7 +70,7 @@ atomic_long_t mmap_pages_allocated;
  */
 unsigned long vm_memory_committed(void)
 {
-	return percpu_counter_read_positive(&vm_committed_as);
+	return percpu_counter_sum(&vm_committed_as);
 }
 
 EXPORT_SYMBOL_GPL(vm_memory_committed);
@@ -1914,7 +1914,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		allowed -= min_t(long, mm->total_vm / 32, reserve);
 	}
 
-	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
+	if (__percpu_counter_compare(&vm_committed_as, allowed, vm_committed_as_batch) < 0)
 		return 0;
 
 error:


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
  2016-02-10 17:46     ` Konstantin Khlebnikov
@ 2016-02-11 13:36       ` Andrey Ryabinin
  -1 siblings, 0 replies; 30+ messages in thread
From: Andrey Ryabinin @ 2016-02-11 13:36 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, linux-mm, Linux Kernel Mailing List, Andi Kleen,
	Tim Chen, Mel Gorman, Vladimir Davydov

On 02/10/2016 08:46 PM, Konstantin Khlebnikov wrote:
> On Wed, Feb 10, 2016 at 5:52 PM, Andrey Ryabinin
> <aryabinin@virtuozzo.com> wrote:
>> Currently we use percpu_counter for accounting committed memory. Change
>> of committed memory on more than vm_committed_as_batch pages leads to
>> grab of counter's spinlock. The batch size is quite small - from 32 pages
>> up to 0.4% of the memory/cpu (usually several MBs even on large machines).
>>
>> So map/munmap of several MBs anonymous memory in multiple processes leads
>> to high contention on that spinlock.
>>
>> Instead of percpu_counter we could use ordinary per-cpu variables.
>> Dump test case (8-proccesses running map/munmap of 4MB,
>> vm_committed_as_batch = 2MB on test setup) showed 2.5x performance
>> improvement.
>>
>> The downside of this approach is slowdown of vm_memory_committed().
>> However, it doesn't matter much since it usually is not in a hot path.
>> The only exception is __vm_enough_memory() with overcommit set to
>> OVERCOMMIT_NEVER. In that case brk1 test from will-it-scale benchmark
>> shows 1.1x - 1.3x performance regression.
>>
>> So I think it's a good tradeoff. We've got significantly increased
>> scalability for the price of some overhead in vm_memory_committed().
> 
> I think thats a no go. 30% regression for your not-so-big machine.
> For 4096 cores regression will be enourmous. Link: https://xkcd.com/619/
> 

Bayan. Linux already supports 8192 cpus. So I set possible_cpus=8192 to see how bad it is.
brk1 test with disabled overcommit (OVERCOMMIT_NEVER) showed ~500x regression. I guess that's too much.

I've tried another approach - convert 'vm_committed_as' to atomic_t variable.
On 8-proccesses map/munmap of 4K this shows only 2%-3% regression (comparing to mainline).
And for 4MB map/munmap this gives 125% improvement.

So, for me, this sounds like a good way to go, although, it worth check regression of small
allocations on bigger machines.

---

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index df4661a..f30e387 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -41,7 +41,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	si_meminfo(&i);
 	si_swapinfo(&i);
-	committed = percpu_counter_read_positive(&vm_committed_as);
+	committed = vm_memory_committed();
 
 	cached = global_page_state(NR_FILE_PAGES) -
 			total_swapcache_pages() - i.bufferram;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 979bc83..82dac6e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1881,7 +1881,11 @@ extern void memmap_init_zone(unsigned long, int, unsigned long,
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
+#ifdef CONFIG_MMU
+static inline void mmap_init(void) {}
+#else
 extern void __init mmap_init(void);
+#endif
 extern void show_mem(unsigned int flags);
 extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 16373c8..21b68e8 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -2,7 +2,7 @@
 #define _LINUX_MMAN_H
 
 #include <linux/mm.h>
-#include <linux/percpu_counter.h>
+#include <linux/percpu.h>
 
 #include <linux/atomic.h>
 #include <uapi/linux/mman.h>
@@ -10,19 +10,12 @@
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
-extern struct percpu_counter vm_committed_as;
-
-#ifdef CONFIG_SMP
-extern s32 vm_committed_as_batch;
-#else
-#define vm_committed_as_batch 0
-#endif
-
 unsigned long vm_memory_committed(void);
+extern atomic_t vm_committed_as;
 
 static inline void vm_acct_memory(long pages)
 {
-	__percpu_counter_add(&vm_committed_as, pages, vm_committed_as_batch);
+	atomic_add(pages, &vm_committed_as);
 }
 
 static inline void vm_unacct_memory(long pages)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index fdadf91..d96c71f 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -142,51 +142,6 @@ early_param("mminit_loglevel", set_mminit_loglevel);
 struct kobject *mm_kobj;
 EXPORT_SYMBOL_GPL(mm_kobj);
 
-#ifdef CONFIG_SMP
-s32 vm_committed_as_batch = 32;
-
-static void __meminit mm_compute_batch(void)
-{
-	u64 memsized_batch;
-	s32 nr = num_present_cpus();
-	s32 batch = max_t(s32, nr*2, 32);
-
-	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
-	memsized_batch = min_t(u64, (totalram_pages/nr)/256, 0x7fffffff);
-
-	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
-}
-
-static int __meminit mm_compute_batch_notifier(struct notifier_block *self,
-					unsigned long action, void *arg)
-{
-	switch (action) {
-	case MEM_ONLINE:
-	case MEM_OFFLINE:
-		mm_compute_batch();
-	default:
-		break;
-	}
-	return NOTIFY_OK;
-}
-
-static struct notifier_block compute_batch_nb __meminitdata = {
-	.notifier_call = mm_compute_batch_notifier,
-	.priority = IPC_CALLBACK_PRI, /* use lowest priority */
-};
-
-static int __init mm_compute_batch_init(void)
-{
-	mm_compute_batch();
-	register_hotmemory_notifier(&compute_batch_nb);
-
-	return 0;
-}
-
-__initcall(mm_compute_batch_init);
-
-#endif
-
 static int __init mm_sysfs_init(void)
 {
 	mm_kobj = kobject_create_and_add("mm", kernel_kobj);
diff --git a/mm/mmap.c b/mm/mmap.c
index f088c60..c796d73 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3184,17 +3184,6 @@ void mm_drop_all_locks(struct mm_struct *mm)
 }
 
 /*
- * initialise the VMA slab
- */
-void __init mmap_init(void)
-{
-	int ret;
-
-	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
-	VM_BUG_ON(ret);
-}
-
-/*
  * Initialise sysctl_user_reserve_kbytes.
  *
  * This is intended to prevent a user from starting a single memory hogging
diff --git a/mm/nommu.c b/mm/nommu.c
index 6402f27..2d52dbc 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -533,10 +533,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
  */
 void __init mmap_init(void)
 {
-	int ret;
-
-	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
-	VM_BUG_ON(ret);
 	vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
 }
 
diff --git a/mm/util.c b/mm/util.c
index 47a57e5..9130983 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -402,6 +402,7 @@ unsigned long sysctl_overcommit_kbytes __read_mostly;
 int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
 unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
 unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
+atomic_t vm_committed_as;
 
 int overcommit_ratio_handler(struct ctl_table *table, int write,
 			     void __user *buffer, size_t *lenp,
@@ -445,12 +446,6 @@ unsigned long vm_commit_limit(void)
 }
 
 /*
- * Make sure vm_committed_as in one cacheline and not cacheline shared with
- * other variables. It can be updated by several CPUs frequently.
- */
-struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
-
-/*
  * The global memory commitment made in the system can be a metric
  * that can be used to drive ballooning decisions when Linux is hosted
  * as a guest. On Hyper-V, the host implements a policy engine for dynamically
@@ -460,7 +455,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
  */
 unsigned long vm_memory_committed(void)
 {
-	return percpu_counter_read_positive(&vm_committed_as);
+	return atomic_read(&vm_committed_as);
 }
 EXPORT_SYMBOL_GPL(vm_memory_committed);
 
@@ -484,8 +479,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 {
 	long free, allowed, reserve;
 
-	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
-			-(s64)vm_committed_as_batch * num_online_cpus(),
+	VM_WARN_ONCE(atomic_read(&vm_committed_as) < 0,
 			"memory commitment underflow");
 
 	vm_acct_memory(pages);
@@ -553,7 +547,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		allowed -= min_t(long, mm->total_vm / 32, reserve);
 	}
 
-	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
+	if (vm_memory_committed() < allowed)
 		return 0;
 error:
 	vm_unacct_memory(pages);
-- 
2.4.10

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
@ 2016-02-11 13:36       ` Andrey Ryabinin
  0 siblings, 0 replies; 30+ messages in thread
From: Andrey Ryabinin @ 2016-02-11 13:36 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Andrew Morton, linux-mm, Linux Kernel Mailing List, Andi Kleen,
	Tim Chen, Mel Gorman, Vladimir Davydov

On 02/10/2016 08:46 PM, Konstantin Khlebnikov wrote:
> On Wed, Feb 10, 2016 at 5:52 PM, Andrey Ryabinin
> <aryabinin@virtuozzo.com> wrote:
>> Currently we use percpu_counter for accounting committed memory. Change
>> of committed memory on more than vm_committed_as_batch pages leads to
>> grab of counter's spinlock. The batch size is quite small - from 32 pages
>> up to 0.4% of the memory/cpu (usually several MBs even on large machines).
>>
>> So map/munmap of several MBs anonymous memory in multiple processes leads
>> to high contention on that spinlock.
>>
>> Instead of percpu_counter we could use ordinary per-cpu variables.
>> Dump test case (8-proccesses running map/munmap of 4MB,
>> vm_committed_as_batch = 2MB on test setup) showed 2.5x performance
>> improvement.
>>
>> The downside of this approach is slowdown of vm_memory_committed().
>> However, it doesn't matter much since it usually is not in a hot path.
>> The only exception is __vm_enough_memory() with overcommit set to
>> OVERCOMMIT_NEVER. In that case brk1 test from will-it-scale benchmark
>> shows 1.1x - 1.3x performance regression.
>>
>> So I think it's a good tradeoff. We've got significantly increased
>> scalability for the price of some overhead in vm_memory_committed().
> 
> I think thats a no go. 30% regression for your not-so-big machine.
> For 4096 cores regression will be enourmous. Link: https://xkcd.com/619/
> 

Bayan. Linux already supports 8192 cpus. So I set possible_cpus=8192 to see how bad it is.
brk1 test with disabled overcommit (OVERCOMMIT_NEVER) showed ~500x regression. I guess that's too much.

I've tried another approach - convert 'vm_committed_as' to atomic_t variable.
On 8-proccesses map/munmap of 4K this shows only 2%-3% regression (comparing to mainline).
And for 4MB map/munmap this gives 125% improvement.

So, for me, this sounds like a good way to go, although, it worth check regression of small
allocations on bigger machines.

---

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index df4661a..f30e387 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -41,7 +41,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	si_meminfo(&i);
 	si_swapinfo(&i);
-	committed = percpu_counter_read_positive(&vm_committed_as);
+	committed = vm_memory_committed();
 
 	cached = global_page_state(NR_FILE_PAGES) -
 			total_swapcache_pages() - i.bufferram;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 979bc83..82dac6e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1881,7 +1881,11 @@ extern void memmap_init_zone(unsigned long, int, unsigned long,
 extern void setup_per_zone_wmarks(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
+#ifdef CONFIG_MMU
+static inline void mmap_init(void) {}
+#else
 extern void __init mmap_init(void);
+#endif
 extern void show_mem(unsigned int flags);
 extern void si_meminfo(struct sysinfo * val);
 extern void si_meminfo_node(struct sysinfo *val, int nid);
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 16373c8..21b68e8 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -2,7 +2,7 @@
 #define _LINUX_MMAN_H
 
 #include <linux/mm.h>
-#include <linux/percpu_counter.h>
+#include <linux/percpu.h>
 
 #include <linux/atomic.h>
 #include <uapi/linux/mman.h>
@@ -10,19 +10,12 @@
 extern int sysctl_overcommit_memory;
 extern int sysctl_overcommit_ratio;
 extern unsigned long sysctl_overcommit_kbytes;
-extern struct percpu_counter vm_committed_as;
-
-#ifdef CONFIG_SMP
-extern s32 vm_committed_as_batch;
-#else
-#define vm_committed_as_batch 0
-#endif
-
 unsigned long vm_memory_committed(void);
+extern atomic_t vm_committed_as;
 
 static inline void vm_acct_memory(long pages)
 {
-	__percpu_counter_add(&vm_committed_as, pages, vm_committed_as_batch);
+	atomic_add(pages, &vm_committed_as);
 }
 
 static inline void vm_unacct_memory(long pages)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index fdadf91..d96c71f 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -142,51 +142,6 @@ early_param("mminit_loglevel", set_mminit_loglevel);
 struct kobject *mm_kobj;
 EXPORT_SYMBOL_GPL(mm_kobj);
 
-#ifdef CONFIG_SMP
-s32 vm_committed_as_batch = 32;
-
-static void __meminit mm_compute_batch(void)
-{
-	u64 memsized_batch;
-	s32 nr = num_present_cpus();
-	s32 batch = max_t(s32, nr*2, 32);
-
-	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
-	memsized_batch = min_t(u64, (totalram_pages/nr)/256, 0x7fffffff);
-
-	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
-}
-
-static int __meminit mm_compute_batch_notifier(struct notifier_block *self,
-					unsigned long action, void *arg)
-{
-	switch (action) {
-	case MEM_ONLINE:
-	case MEM_OFFLINE:
-		mm_compute_batch();
-	default:
-		break;
-	}
-	return NOTIFY_OK;
-}
-
-static struct notifier_block compute_batch_nb __meminitdata = {
-	.notifier_call = mm_compute_batch_notifier,
-	.priority = IPC_CALLBACK_PRI, /* use lowest priority */
-};
-
-static int __init mm_compute_batch_init(void)
-{
-	mm_compute_batch();
-	register_hotmemory_notifier(&compute_batch_nb);
-
-	return 0;
-}
-
-__initcall(mm_compute_batch_init);
-
-#endif
-
 static int __init mm_sysfs_init(void)
 {
 	mm_kobj = kobject_create_and_add("mm", kernel_kobj);
diff --git a/mm/mmap.c b/mm/mmap.c
index f088c60..c796d73 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3184,17 +3184,6 @@ void mm_drop_all_locks(struct mm_struct *mm)
 }
 
 /*
- * initialise the VMA slab
- */
-void __init mmap_init(void)
-{
-	int ret;
-
-	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
-	VM_BUG_ON(ret);
-}
-
-/*
  * Initialise sysctl_user_reserve_kbytes.
  *
  * This is intended to prevent a user from starting a single memory hogging
diff --git a/mm/nommu.c b/mm/nommu.c
index 6402f27..2d52dbc 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -533,10 +533,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
  */
 void __init mmap_init(void)
 {
-	int ret;
-
-	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
-	VM_BUG_ON(ret);
 	vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
 }
 
diff --git a/mm/util.c b/mm/util.c
index 47a57e5..9130983 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -402,6 +402,7 @@ unsigned long sysctl_overcommit_kbytes __read_mostly;
 int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
 unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
 unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
+atomic_t vm_committed_as;
 
 int overcommit_ratio_handler(struct ctl_table *table, int write,
 			     void __user *buffer, size_t *lenp,
@@ -445,12 +446,6 @@ unsigned long vm_commit_limit(void)
 }
 
 /*
- * Make sure vm_committed_as in one cacheline and not cacheline shared with
- * other variables. It can be updated by several CPUs frequently.
- */
-struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
-
-/*
  * The global memory commitment made in the system can be a metric
  * that can be used to drive ballooning decisions when Linux is hosted
  * as a guest. On Hyper-V, the host implements a policy engine for dynamically
@@ -460,7 +455,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
  */
 unsigned long vm_memory_committed(void)
 {
-	return percpu_counter_read_positive(&vm_committed_as);
+	return atomic_read(&vm_committed_as);
 }
 EXPORT_SYMBOL_GPL(vm_memory_committed);
 
@@ -484,8 +479,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 {
 	long free, allowed, reserve;
 
-	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
-			-(s64)vm_committed_as_batch * num_online_cpus(),
+	VM_WARN_ONCE(atomic_read(&vm_committed_as) < 0,
 			"memory commitment underflow");
 
 	vm_acct_memory(pages);
@@ -553,7 +547,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
 		allowed -= min_t(long, mm->total_vm / 32, reserve);
 	}
 
-	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
+	if (vm_memory_committed() < allowed)
 		return 0;
 error:
 	vm_unacct_memory(pages);
-- 
2.4.10


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
  2016-02-11  0:24         ` Tim Chen
@ 2016-02-11 13:54           ` Andrey Ryabinin
  -1 siblings, 0 replies; 30+ messages in thread
From: Andrey Ryabinin @ 2016-02-11 13:54 UTC (permalink / raw)
  To: Tim Chen, Andrew Morton
  Cc: linux-mm, linux-kernel, Andi Kleen, Mel Gorman, Vladimir Davydov,
	Konstantin Khlebnikov



On 02/11/2016 03:24 AM, Tim Chen wrote:
> On Wed, 2016-02-10 at 13:28 -0800, Andrew Morton wrote:
> 
>>
>> If a process is unmapping 4MB then it's pretty crazy for us to be
>> hitting the percpu_counter 32 separate times for that single operation.
>>
>> Is there some way in which we can batch up the modifications within the
>> caller and update the counter less frequently?  Perhaps even in a
>> single hit?
> 
> I think the problem is the batch size is too small and we overflow
> the local counter into the global counter for 4M allocations.
> The reason for the small batch size was because we use
> percpu_counter_read_positive in __vm_enough_memory and it is not precise
> and the error could grow with large batch size.
> 
> Let's switch to the precise __percpu_counter_compare that is 
> unaffected by batch size.  It will do precise comparison and only add up
> the local per cpu counters when the global count is not precise
> enough.  
> 

I'm not certain about this. for_each_online_cpu() under spinlock somewhat doubtful.
And if we are close to limit we will be hitting slowpath all the time.


> So maybe something like the following patch with a relaxed batch size.
> I have not tested this patch much other than compiling and booting
> the kernel.  I wonder if this works for Andrey. We could relax the batch
> size further, but that will mean that we will incur the overhead
> of summing the per cpu counters earlier when the global count get close
> to the allowed limit.
> 
> Thanks.
> 
> Tim
> 
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
@ 2016-02-11 13:54           ` Andrey Ryabinin
  0 siblings, 0 replies; 30+ messages in thread
From: Andrey Ryabinin @ 2016-02-11 13:54 UTC (permalink / raw)
  To: Tim Chen, Andrew Morton
  Cc: linux-mm, linux-kernel, Andi Kleen, Mel Gorman, Vladimir Davydov,
	Konstantin Khlebnikov



On 02/11/2016 03:24 AM, Tim Chen wrote:
> On Wed, 2016-02-10 at 13:28 -0800, Andrew Morton wrote:
> 
>>
>> If a process is unmapping 4MB then it's pretty crazy for us to be
>> hitting the percpu_counter 32 separate times for that single operation.
>>
>> Is there some way in which we can batch up the modifications within the
>> caller and update the counter less frequently?  Perhaps even in a
>> single hit?
> 
> I think the problem is the batch size is too small and we overflow
> the local counter into the global counter for 4M allocations.
> The reason for the small batch size was because we use
> percpu_counter_read_positive in __vm_enough_memory and it is not precise
> and the error could grow with large batch size.
> 
> Let's switch to the precise __percpu_counter_compare that is 
> unaffected by batch size.  It will do precise comparison and only add up
> the local per cpu counters when the global count is not precise
> enough.  
> 

I'm not certain about this. for_each_online_cpu() under spinlock somewhat doubtful.
And if we are close to limit we will be hitting slowpath all the time.


> So maybe something like the following patch with a relaxed batch size.
> I have not tested this patch much other than compiling and booting
> the kernel.  I wonder if this works for Andrey. We could relax the batch
> size further, but that will mean that we will incur the overhead
> of summing the per cpu counters earlier when the global count get close
> to the allowed limit.
> 
> Thanks.
> 
> Tim
> 
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
  2016-02-11 13:36       ` Andrey Ryabinin
@ 2016-02-11 16:57         ` Tim Chen
  -1 siblings, 0 replies; 30+ messages in thread
From: Tim Chen @ 2016-02-11 16:57 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Konstantin Khlebnikov, Andrew Morton, linux-mm,
	Linux Kernel Mailing List, Andi Kleen, Mel Gorman,
	Vladimir Davydov

On Thu, 2016-02-11 at 16:36 +0300, Andrey Ryabinin wrote:
> On 02/10/2016 08:46 PM, Konstantin Khlebnikov wrote:
> > On Wed, Feb 10, 2016 at 5:52 PM, Andrey Ryabinin
> > <aryabinin@virtuozzo.com> wrote:
> >> Currently we use percpu_counter for accounting committed memory. Change
> >> of committed memory on more than vm_committed_as_batch pages leads to
> >> grab of counter's spinlock. The batch size is quite small - from 32 pages
> >> up to 0.4% of the memory/cpu (usually several MBs even on large machines).
> >>
> >> So map/munmap of several MBs anonymous memory in multiple processes leads
> >> to high contention on that spinlock.
> >>
> >> Instead of percpu_counter we could use ordinary per-cpu variables.
> >> Dump test case (8-proccesses running map/munmap of 4MB,
> >> vm_committed_as_batch = 2MB on test setup) showed 2.5x performance
> >> improvement.
> >>
> >> The downside of this approach is slowdown of vm_memory_committed().
> >> However, it doesn't matter much since it usually is not in a hot path.
> >> The only exception is __vm_enough_memory() with overcommit set to
> >> OVERCOMMIT_NEVER. In that case brk1 test from will-it-scale benchmark
> >> shows 1.1x - 1.3x performance regression.
> >>
> >> So I think it's a good tradeoff. We've got significantly increased
> >> scalability for the price of some overhead in vm_memory_committed().
> > 
> > I think thats a no go. 30% regression for your not-so-big machine.
> > For 4096 cores regression will be enourmous. Link: https://xkcd.com/619/
> > 
> 
> Bayan. Linux already supports 8192 cpus. So I set possible_cpus=8192 to see how bad it is.
> brk1 test with disabled overcommit (OVERCOMMIT_NEVER) showed ~500x regression. I guess that's too much.
> 
> I've tried another approach - convert 'vm_committed_as' to atomic_t variable.
> On 8-proccesses map/munmap of 4K this shows only 2%-3% regression (comparing to mainline).
> And for 4MB map/munmap this gives 125% improvement.

In our experience, atomic variables is very slow on larger multi-socket
machines due to cross node cache sync.  So I'm afraid
that this approach will not scale on those larger machines.

Tim

> 
> So, for me, this sounds like a good way to go, although, it worth check regression of small
> allocations on bigger machines.
> 
> ---
> 
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index df4661a..f30e387 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -41,7 +41,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
>  	si_meminfo(&i);
>  	si_swapinfo(&i);
> -	committed = percpu_counter_read_positive(&vm_committed_as);
> +	committed = vm_memory_committed();
>  
>  	cached = global_page_state(NR_FILE_PAGES) -
>  			total_swapcache_pages() - i.bufferram;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 979bc83..82dac6e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1881,7 +1881,11 @@ extern void memmap_init_zone(unsigned long, int, unsigned long,
>  extern void setup_per_zone_wmarks(void);
>  extern int __meminit init_per_zone_wmark_min(void);
>  extern void mem_init(void);
> +#ifdef CONFIG_MMU
> +static inline void mmap_init(void) {}
> +#else
>  extern void __init mmap_init(void);
> +#endif
>  extern void show_mem(unsigned int flags);
>  extern void si_meminfo(struct sysinfo * val);
>  extern void si_meminfo_node(struct sysinfo *val, int nid);
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 16373c8..21b68e8 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -2,7 +2,7 @@
>  #define _LINUX_MMAN_H
>  
>  #include <linux/mm.h>
> -#include <linux/percpu_counter.h>
> +#include <linux/percpu.h>
>  
>  #include <linux/atomic.h>
>  #include <uapi/linux/mman.h>
> @@ -10,19 +10,12 @@
>  extern int sysctl_overcommit_memory;
>  extern int sysctl_overcommit_ratio;
>  extern unsigned long sysctl_overcommit_kbytes;
> -extern struct percpu_counter vm_committed_as;
> -
> -#ifdef CONFIG_SMP
> -extern s32 vm_committed_as_batch;
> -#else
> -#define vm_committed_as_batch 0
> -#endif
> -
>  unsigned long vm_memory_committed(void);
> +extern atomic_t vm_committed_as;
>  
>  static inline void vm_acct_memory(long pages)
>  {
> -	__percpu_counter_add(&vm_committed_as, pages, vm_committed_as_batch);
> +	atomic_add(pages, &vm_committed_as);
>  }
>  
>  static inline void vm_unacct_memory(long pages)
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index fdadf91..d96c71f 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -142,51 +142,6 @@ early_param("mminit_loglevel", set_mminit_loglevel);
>  struct kobject *mm_kobj;
>  EXPORT_SYMBOL_GPL(mm_kobj);
>  
> -#ifdef CONFIG_SMP
> -s32 vm_committed_as_batch = 32;
> -
> -static void __meminit mm_compute_batch(void)
> -{
> -	u64 memsized_batch;
> -	s32 nr = num_present_cpus();
> -	s32 batch = max_t(s32, nr*2, 32);
> -
> -	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
> -	memsized_batch = min_t(u64, (totalram_pages/nr)/256, 0x7fffffff);
> -
> -	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
> -}
> -
> -static int __meminit mm_compute_batch_notifier(struct notifier_block *self,
> -					unsigned long action, void *arg)
> -{
> -	switch (action) {
> -	case MEM_ONLINE:
> -	case MEM_OFFLINE:
> -		mm_compute_batch();
> -	default:
> -		break;
> -	}
> -	return NOTIFY_OK;
> -}
> -
> -static struct notifier_block compute_batch_nb __meminitdata = {
> -	.notifier_call = mm_compute_batch_notifier,
> -	.priority = IPC_CALLBACK_PRI, /* use lowest priority */
> -};
> -
> -static int __init mm_compute_batch_init(void)
> -{
> -	mm_compute_batch();
> -	register_hotmemory_notifier(&compute_batch_nb);
> -
> -	return 0;
> -}
> -
> -__initcall(mm_compute_batch_init);
> -
> -#endif
> -
>  static int __init mm_sysfs_init(void)
>  {
>  	mm_kobj = kobject_create_and_add("mm", kernel_kobj);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f088c60..c796d73 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3184,17 +3184,6 @@ void mm_drop_all_locks(struct mm_struct *mm)
>  }
>  
>  /*
> - * initialise the VMA slab
> - */
> -void __init mmap_init(void)
> -{
> -	int ret;
> -
> -	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
> -	VM_BUG_ON(ret);
> -}
> -
> -/*
>   * Initialise sysctl_user_reserve_kbytes.
>   *
>   * This is intended to prevent a user from starting a single memory hogging
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 6402f27..2d52dbc 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -533,10 +533,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
>   */
>  void __init mmap_init(void)
>  {
> -	int ret;
> -
> -	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
> -	VM_BUG_ON(ret);
>  	vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
>  }
>  
> diff --git a/mm/util.c b/mm/util.c
> index 47a57e5..9130983 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -402,6 +402,7 @@ unsigned long sysctl_overcommit_kbytes __read_mostly;
>  int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
>  unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
>  unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
> +atomic_t vm_committed_as;
>  
>  int overcommit_ratio_handler(struct ctl_table *table, int write,
>  			     void __user *buffer, size_t *lenp,
> @@ -445,12 +446,6 @@ unsigned long vm_commit_limit(void)
>  }
>  
>  /*
> - * Make sure vm_committed_as in one cacheline and not cacheline shared with
> - * other variables. It can be updated by several CPUs frequently.
> - */
> -struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
> -
> -/*
>   * The global memory commitment made in the system can be a metric
>   * that can be used to drive ballooning decisions when Linux is hosted
>   * as a guest. On Hyper-V, the host implements a policy engine for dynamically
> @@ -460,7 +455,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
>   */
>  unsigned long vm_memory_committed(void)
>  {
> -	return percpu_counter_read_positive(&vm_committed_as);
> +	return atomic_read(&vm_committed_as);
>  }
>  EXPORT_SYMBOL_GPL(vm_memory_committed);
>  
> @@ -484,8 +479,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  {
>  	long free, allowed, reserve;
>  
> -	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
> -			-(s64)vm_committed_as_batch * num_online_cpus(),
> +	VM_WARN_ONCE(atomic_read(&vm_committed_as) < 0,
>  			"memory commitment underflow");
>  
>  	vm_acct_memory(pages);
> @@ -553,7 +547,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  		allowed -= min_t(long, mm->total_vm / 32, reserve);
>  	}
>  
> -	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
> +	if (vm_memory_committed() < allowed)
>  		return 0;
>  error:
>  	vm_unacct_memory(pages);

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
@ 2016-02-11 16:57         ` Tim Chen
  0 siblings, 0 replies; 30+ messages in thread
From: Tim Chen @ 2016-02-11 16:57 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Konstantin Khlebnikov, Andrew Morton, linux-mm,
	Linux Kernel Mailing List, Andi Kleen, Mel Gorman,
	Vladimir Davydov

On Thu, 2016-02-11 at 16:36 +0300, Andrey Ryabinin wrote:
> On 02/10/2016 08:46 PM, Konstantin Khlebnikov wrote:
> > On Wed, Feb 10, 2016 at 5:52 PM, Andrey Ryabinin
> > <aryabinin@virtuozzo.com> wrote:
> >> Currently we use percpu_counter for accounting committed memory. Change
> >> of committed memory on more than vm_committed_as_batch pages leads to
> >> grab of counter's spinlock. The batch size is quite small - from 32 pages
> >> up to 0.4% of the memory/cpu (usually several MBs even on large machines).
> >>
> >> So map/munmap of several MBs anonymous memory in multiple processes leads
> >> to high contention on that spinlock.
> >>
> >> Instead of percpu_counter we could use ordinary per-cpu variables.
> >> Dump test case (8-proccesses running map/munmap of 4MB,
> >> vm_committed_as_batch = 2MB on test setup) showed 2.5x performance
> >> improvement.
> >>
> >> The downside of this approach is slowdown of vm_memory_committed().
> >> However, it doesn't matter much since it usually is not in a hot path.
> >> The only exception is __vm_enough_memory() with overcommit set to
> >> OVERCOMMIT_NEVER. In that case brk1 test from will-it-scale benchmark
> >> shows 1.1x - 1.3x performance regression.
> >>
> >> So I think it's a good tradeoff. We've got significantly increased
> >> scalability for the price of some overhead in vm_memory_committed().
> > 
> > I think thats a no go. 30% regression for your not-so-big machine.
> > For 4096 cores regression will be enourmous. Link: https://xkcd.com/619/
> > 
> 
> Bayan. Linux already supports 8192 cpus. So I set possible_cpus=8192 to see how bad it is.
> brk1 test with disabled overcommit (OVERCOMMIT_NEVER) showed ~500x regression. I guess that's too much.
> 
> I've tried another approach - convert 'vm_committed_as' to atomic_t variable.
> On 8-proccesses map/munmap of 4K this shows only 2%-3% regression (comparing to mainline).
> And for 4MB map/munmap this gives 125% improvement.

In our experience, atomic variables is very slow on larger multi-socket
machines due to cross node cache sync.  So I'm afraid
that this approach will not scale on those larger machines.

Tim

> 
> So, for me, this sounds like a good way to go, although, it worth check regression of small
> allocations on bigger machines.
> 
> ---
> 
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index df4661a..f30e387 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -41,7 +41,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  #define K(x) ((x) << (PAGE_SHIFT - 10))
>  	si_meminfo(&i);
>  	si_swapinfo(&i);
> -	committed = percpu_counter_read_positive(&vm_committed_as);
> +	committed = vm_memory_committed();
>  
>  	cached = global_page_state(NR_FILE_PAGES) -
>  			total_swapcache_pages() - i.bufferram;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 979bc83..82dac6e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1881,7 +1881,11 @@ extern void memmap_init_zone(unsigned long, int, unsigned long,
>  extern void setup_per_zone_wmarks(void);
>  extern int __meminit init_per_zone_wmark_min(void);
>  extern void mem_init(void);
> +#ifdef CONFIG_MMU
> +static inline void mmap_init(void) {}
> +#else
>  extern void __init mmap_init(void);
> +#endif
>  extern void show_mem(unsigned int flags);
>  extern void si_meminfo(struct sysinfo * val);
>  extern void si_meminfo_node(struct sysinfo *val, int nid);
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 16373c8..21b68e8 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -2,7 +2,7 @@
>  #define _LINUX_MMAN_H
>  
>  #include <linux/mm.h>
> -#include <linux/percpu_counter.h>
> +#include <linux/percpu.h>
>  
>  #include <linux/atomic.h>
>  #include <uapi/linux/mman.h>
> @@ -10,19 +10,12 @@
>  extern int sysctl_overcommit_memory;
>  extern int sysctl_overcommit_ratio;
>  extern unsigned long sysctl_overcommit_kbytes;
> -extern struct percpu_counter vm_committed_as;
> -
> -#ifdef CONFIG_SMP
> -extern s32 vm_committed_as_batch;
> -#else
> -#define vm_committed_as_batch 0
> -#endif
> -
>  unsigned long vm_memory_committed(void);
> +extern atomic_t vm_committed_as;
>  
>  static inline void vm_acct_memory(long pages)
>  {
> -	__percpu_counter_add(&vm_committed_as, pages, vm_committed_as_batch);
> +	atomic_add(pages, &vm_committed_as);
>  }
>  
>  static inline void vm_unacct_memory(long pages)
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index fdadf91..d96c71f 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -142,51 +142,6 @@ early_param("mminit_loglevel", set_mminit_loglevel);
>  struct kobject *mm_kobj;
>  EXPORT_SYMBOL_GPL(mm_kobj);
>  
> -#ifdef CONFIG_SMP
> -s32 vm_committed_as_batch = 32;
> -
> -static void __meminit mm_compute_batch(void)
> -{
> -	u64 memsized_batch;
> -	s32 nr = num_present_cpus();
> -	s32 batch = max_t(s32, nr*2, 32);
> -
> -	/* batch size set to 0.4% of (total memory/#cpus), or max int32 */
> -	memsized_batch = min_t(u64, (totalram_pages/nr)/256, 0x7fffffff);
> -
> -	vm_committed_as_batch = max_t(s32, memsized_batch, batch);
> -}
> -
> -static int __meminit mm_compute_batch_notifier(struct notifier_block *self,
> -					unsigned long action, void *arg)
> -{
> -	switch (action) {
> -	case MEM_ONLINE:
> -	case MEM_OFFLINE:
> -		mm_compute_batch();
> -	default:
> -		break;
> -	}
> -	return NOTIFY_OK;
> -}
> -
> -static struct notifier_block compute_batch_nb __meminitdata = {
> -	.notifier_call = mm_compute_batch_notifier,
> -	.priority = IPC_CALLBACK_PRI, /* use lowest priority */
> -};
> -
> -static int __init mm_compute_batch_init(void)
> -{
> -	mm_compute_batch();
> -	register_hotmemory_notifier(&compute_batch_nb);
> -
> -	return 0;
> -}
> -
> -__initcall(mm_compute_batch_init);
> -
> -#endif
> -
>  static int __init mm_sysfs_init(void)
>  {
>  	mm_kobj = kobject_create_and_add("mm", kernel_kobj);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f088c60..c796d73 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3184,17 +3184,6 @@ void mm_drop_all_locks(struct mm_struct *mm)
>  }
>  
>  /*
> - * initialise the VMA slab
> - */
> -void __init mmap_init(void)
> -{
> -	int ret;
> -
> -	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
> -	VM_BUG_ON(ret);
> -}
> -
> -/*
>   * Initialise sysctl_user_reserve_kbytes.
>   *
>   * This is intended to prevent a user from starting a single memory hogging
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 6402f27..2d52dbc 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -533,10 +533,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
>   */
>  void __init mmap_init(void)
>  {
> -	int ret;
> -
> -	ret = percpu_counter_init(&vm_committed_as, 0, GFP_KERNEL);
> -	VM_BUG_ON(ret);
>  	vm_region_jar = KMEM_CACHE(vm_region, SLAB_PANIC|SLAB_ACCOUNT);
>  }
>  
> diff --git a/mm/util.c b/mm/util.c
> index 47a57e5..9130983 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -402,6 +402,7 @@ unsigned long sysctl_overcommit_kbytes __read_mostly;
>  int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
>  unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
>  unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
> +atomic_t vm_committed_as;
>  
>  int overcommit_ratio_handler(struct ctl_table *table, int write,
>  			     void __user *buffer, size_t *lenp,
> @@ -445,12 +446,6 @@ unsigned long vm_commit_limit(void)
>  }
>  
>  /*
> - * Make sure vm_committed_as in one cacheline and not cacheline shared with
> - * other variables. It can be updated by several CPUs frequently.
> - */
> -struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
> -
> -/*
>   * The global memory commitment made in the system can be a metric
>   * that can be used to drive ballooning decisions when Linux is hosted
>   * as a guest. On Hyper-V, the host implements a policy engine for dynamically
> @@ -460,7 +455,7 @@ struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
>   */
>  unsigned long vm_memory_committed(void)
>  {
> -	return percpu_counter_read_positive(&vm_committed_as);
> +	return atomic_read(&vm_committed_as);
>  }
>  EXPORT_SYMBOL_GPL(vm_memory_committed);
>  
> @@ -484,8 +479,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  {
>  	long free, allowed, reserve;
>  
> -	VM_WARN_ONCE(percpu_counter_read(&vm_committed_as) <
> -			-(s64)vm_committed_as_batch * num_online_cpus(),
> +	VM_WARN_ONCE(atomic_read(&vm_committed_as) < 0,
>  			"memory commitment underflow");
>  
>  	vm_acct_memory(pages);
> @@ -553,7 +547,7 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
>  		allowed -= min_t(long, mm->total_vm / 32, reserve);
>  	}
>  
> -	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
> +	if (vm_memory_committed() < allowed)
>  		return 0;
>  error:
>  	vm_unacct_memory(pages);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
  2016-02-11 13:54           ` Andrey Ryabinin
@ 2016-02-11 18:20             ` Tim Chen
  -1 siblings, 0 replies; 30+ messages in thread
From: Tim Chen @ 2016-02-11 18:20 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov, Dave Hansen

On Thu, 2016-02-11 at 16:54 +0300, Andrey Ryabinin wrote:
> 
> On 02/11/2016 03:24 AM, Tim Chen wrote:
> > On Wed, 2016-02-10 at 13:28 -0800, Andrew Morton wrote:
> > 
> >>
> >> If a process is unmapping 4MB then it's pretty crazy for us to be
> >> hitting the percpu_counter 32 separate times for that single operation.
> >>
> >> Is there some way in which we can batch up the modifications within the
> >> caller and update the counter less frequently?  Perhaps even in a
> >> single hit?
> > 
> > I think the problem is the batch size is too small and we overflow
> > the local counter into the global counter for 4M allocations.
> > The reason for the small batch size was because we use
> > percpu_counter_read_positive in __vm_enough_memory and it is not precise
> > and the error could grow with large batch size.
> > 
> > Let's switch to the precise __percpu_counter_compare that is 
> > unaffected by batch size.  It will do precise comparison and only add up
> > the local per cpu counters when the global count is not precise
> > enough.  
> > 
> 
> I'm not certain about this. for_each_online_cpu() under spinlock somewhat doubtful.
> And if we are close to limit we will be hitting slowpath all the time.
> 

Yes, it is a trade-off between faster allocation for the general case vs
being on slowpath when we are within 3% of the memory limit. I'm
thinking when we are that close to the memory limit, it probably 
takes more time to do page reclaim and this slow path might be a
secondary effect.  But still it will be better than the original
proposal that strictly uses per cpu variables as we will then 
need to sum the variables up all the time.

The brk1 test is also somewhat pathologic.  It
does nothing but brk which is unlikely for real workload.
So we have to be careful when we are tuning our system
behavior for brk1 throughput. We'll need to make sure
whatever changes we made don't impact other more useful
workloads adversely.

Tim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
@ 2016-02-11 18:20             ` Tim Chen
  0 siblings, 0 replies; 30+ messages in thread
From: Tim Chen @ 2016-02-11 18:20 UTC (permalink / raw)
  To: Andrey Ryabinin
  Cc: Andrew Morton, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov, Dave Hansen

On Thu, 2016-02-11 at 16:54 +0300, Andrey Ryabinin wrote:
> 
> On 02/11/2016 03:24 AM, Tim Chen wrote:
> > On Wed, 2016-02-10 at 13:28 -0800, Andrew Morton wrote:
> > 
> >>
> >> If a process is unmapping 4MB then it's pretty crazy for us to be
> >> hitting the percpu_counter 32 separate times for that single operation.
> >>
> >> Is there some way in which we can batch up the modifications within the
> >> caller and update the counter less frequently?  Perhaps even in a
> >> single hit?
> > 
> > I think the problem is the batch size is too small and we overflow
> > the local counter into the global counter for 4M allocations.
> > The reason for the small batch size was because we use
> > percpu_counter_read_positive in __vm_enough_memory and it is not precise
> > and the error could grow with large batch size.
> > 
> > Let's switch to the precise __percpu_counter_compare that is 
> > unaffected by batch size.  It will do precise comparison and only add up
> > the local per cpu counters when the global count is not precise
> > enough.  
> > 
> 
> I'm not certain about this. for_each_online_cpu() under spinlock somewhat doubtful.
> And if we are close to limit we will be hitting slowpath all the time.
> 

Yes, it is a trade-off between faster allocation for the general case vs
being on slowpath when we are within 3% of the memory limit. I'm
thinking when we are that close to the memory limit, it probably 
takes more time to do page reclaim and this slow path might be a
secondary effect.  But still it will be better than the original
proposal that strictly uses per cpu variables as we will then 
need to sum the variables up all the time.

The brk1 test is also somewhat pathologic.  It
does nothing but brk which is unlikely for real workload.
So we have to be careful when we are tuning our system
behavior for brk1 throughput. We'll need to make sure
whatever changes we made don't impact other more useful
workloads adversely.

Tim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
  2016-02-11 18:20             ` Tim Chen
@ 2016-02-11 19:45               ` Dave Hansen
  -1 siblings, 0 replies; 30+ messages in thread
From: Dave Hansen @ 2016-02-11 19:45 UTC (permalink / raw)
  To: Tim Chen, Andrey Ryabinin
  Cc: Andrew Morton, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

On 02/11/2016 10:20 AM, Tim Chen wrote:
> The brk1 test is also somewhat pathologic.  It
> does nothing but brk which is unlikely for real workload.
> So we have to be careful when we are tuning our system
> behavior for brk1 throughput. We'll need to make sure
> whatever changes we made don't impact other more useful
> workloads adversely.

Yeah, there are *so* many alternatives to using brk() or mmap()/munmap()
frequently.

glibc has tunables to tune how tightly coupled malloc()/free() are with
virtual space allocation.  Raising those can reduce the brk() frequency.

There are also other allocators that take much larger chunks of virtual
address space and then "free" memory with MADV_FREE instead of brk().  I
think jemalloc does this, for instance.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
@ 2016-02-11 19:45               ` Dave Hansen
  0 siblings, 0 replies; 30+ messages in thread
From: Dave Hansen @ 2016-02-11 19:45 UTC (permalink / raw)
  To: Tim Chen, Andrey Ryabinin
  Cc: Andrew Morton, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

On 02/11/2016 10:20 AM, Tim Chen wrote:
> The brk1 test is also somewhat pathologic.  It
> does nothing but brk which is unlikely for real workload.
> So we have to be careful when we are tuning our system
> behavior for brk1 throughput. We'll need to make sure
> whatever changes we made don't impact other more useful
> workloads adversely.

Yeah, there are *so* many alternatives to using brk() or mmap()/munmap()
frequently.

glibc has tunables to tune how tightly coupled malloc()/free() are with
virtual space allocation.  Raising those can reduce the brk() frequency.

There are also other allocators that take much larger chunks of virtual
address space and then "free" memory with MADV_FREE instead of brk().  I
think jemalloc does this, for instance.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
  2016-02-11  0:24         ` Tim Chen
@ 2016-02-11 20:51           ` Andrew Morton
  -1 siblings, 0 replies; 30+ messages in thread
From: Andrew Morton @ 2016-02-11 20:51 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrey Ryabinin, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

On Wed, 10 Feb 2016 16:24:16 -0800 Tim Chen <tim.c.chen@linux.intel.com> wrote:

> On Wed, 2016-02-10 at 13:28 -0800, Andrew Morton wrote:
> 
> > 
> > If a process is unmapping 4MB then it's pretty crazy for us to be
> > hitting the percpu_counter 32 separate times for that single operation.
> > 
> > Is there some way in which we can batch up the modifications within the
> > caller and update the counter less frequently?  Perhaps even in a
> > single hit?
> 
> I think the problem is the batch size is too small and we overflow
> the local counter into the global counter for 4M allocations.

That's one way of looking at the issue.  The other way (which I point
out above) is that we're calling vm_[un]_acct_memory too frequently
when mapping/unmapping 4M segments.

Exactly which mmap.c callsite is causing this issue?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
@ 2016-02-11 20:51           ` Andrew Morton
  0 siblings, 0 replies; 30+ messages in thread
From: Andrew Morton @ 2016-02-11 20:51 UTC (permalink / raw)
  To: Tim Chen
  Cc: Andrey Ryabinin, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

On Wed, 10 Feb 2016 16:24:16 -0800 Tim Chen <tim.c.chen@linux.intel.com> wrote:

> On Wed, 2016-02-10 at 13:28 -0800, Andrew Morton wrote:
> 
> > 
> > If a process is unmapping 4MB then it's pretty crazy for us to be
> > hitting the percpu_counter 32 separate times for that single operation.
> > 
> > Is there some way in which we can batch up the modifications within the
> > caller and update the counter less frequently?  Perhaps even in a
> > single hit?
> 
> I think the problem is the batch size is too small and we overflow
> the local counter into the global counter for 4M allocations.

That's one way of looking at the issue.  The other way (which I point
out above) is that we're calling vm_[un]_acct_memory too frequently
when mapping/unmapping 4M segments.

Exactly which mmap.c callsite is causing this issue?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
  2016-02-11 20:51           ` Andrew Morton
@ 2016-02-11 21:18             ` Tim Chen
  -1 siblings, 0 replies; 30+ messages in thread
From: Tim Chen @ 2016-02-11 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrey Ryabinin, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

On Thu, 2016-02-11 at 12:51 -0800, Andrew Morton wrote:
> On Wed, 10 Feb 2016 16:24:16 -0800 Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> > On Wed, 2016-02-10 at 13:28 -0800, Andrew Morton wrote:
> > 
> > > 
> > > If a process is unmapping 4MB then it's pretty crazy for us to be
> > > hitting the percpu_counter 32 separate times for that single operation.
> > > 
> > > Is there some way in which we can batch up the modifications within the
> > > caller and update the counter less frequently?  Perhaps even in a
> > > single hit?
> > 
> > I think the problem is the batch size is too small and we overflow
> > the local counter into the global counter for 4M allocations.
> 
> That's one way of looking at the issue.  The other way (which I point
> out above) is that we're calling vm_[un]_acct_memory too frequently
> when mapping/unmapping 4M segments.
> 
> Exactly which mmap.c callsite is causing this issue?

I suspect it is __vm_enough_memory called from do_brk or mmap_region in
Andrey's test case.

Tim

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
@ 2016-02-11 21:18             ` Tim Chen
  0 siblings, 0 replies; 30+ messages in thread
From: Tim Chen @ 2016-02-11 21:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrey Ryabinin, linux-mm, linux-kernel, Andi Kleen, Mel Gorman,
	Vladimir Davydov, Konstantin Khlebnikov

On Thu, 2016-02-11 at 12:51 -0800, Andrew Morton wrote:
> On Wed, 10 Feb 2016 16:24:16 -0800 Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
> > On Wed, 2016-02-10 at 13:28 -0800, Andrew Morton wrote:
> > 
> > > 
> > > If a process is unmapping 4MB then it's pretty crazy for us to be
> > > hitting the percpu_counter 32 separate times for that single operation.
> > > 
> > > Is there some way in which we can batch up the modifications within the
> > > caller and update the counter less frequently?  Perhaps even in a
> > > single hit?
> > 
> > I think the problem is the batch size is too small and we overflow
> > the local counter into the global counter for 4M allocations.
> 
> That's one way of looking at the issue.  The other way (which I point
> out above) is that we're calling vm_[un]_acct_memory too frequently
> when mapping/unmapping 4M segments.
> 
> Exactly which mmap.c callsite is causing this issue?

I suspect it is __vm_enough_memory called from do_brk or mmap_region in
Andrey's test case.

Tim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
  2016-02-11 20:51           ` Andrew Morton
@ 2016-02-12 12:24             ` Andrey Ryabinin
  -1 siblings, 0 replies; 30+ messages in thread
From: Andrey Ryabinin @ 2016-02-12 12:24 UTC (permalink / raw)
  To: Andrew Morton, Tim Chen
  Cc: linux-mm, linux-kernel, Andi Kleen, Mel Gorman, Vladimir Davydov,
	Konstantin Khlebnikov

On 02/11/2016 11:51 PM, Andrew Morton wrote:
> On Wed, 10 Feb 2016 16:24:16 -0800 Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
>> On Wed, 2016-02-10 at 13:28 -0800, Andrew Morton wrote:
>>
>>>
>>> If a process is unmapping 4MB then it's pretty crazy for us to be
>>> hitting the percpu_counter 32 separate times for that single operation.
>>>
>>> Is there some way in which we can batch up the modifications within the
>>> caller and update the counter less frequently?  Perhaps even in a
>>> single hit?
>>
>> I think the problem is the batch size is too small and we overflow
>> the local counter into the global counter for 4M allocations.
> 
> That's one way of looking at the issue.  The other way (which I point
> out above) is that we're calling vm_[un]_acct_memory too frequently
> when mapping/unmapping 4M segments.
> 

We call it only once per mmap() or munmap(), so there is nothing to improve.

> Exactly which mmap.c callsite is causing this issue?
> 


mmap_region() (or do_brk()) ->
	security_vm_enough_memory() ->
		__vm_enough_memory() ->
			vm_acct_memory()

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting
@ 2016-02-12 12:24             ` Andrey Ryabinin
  0 siblings, 0 replies; 30+ messages in thread
From: Andrey Ryabinin @ 2016-02-12 12:24 UTC (permalink / raw)
  To: Andrew Morton, Tim Chen
  Cc: linux-mm, linux-kernel, Andi Kleen, Mel Gorman, Vladimir Davydov,
	Konstantin Khlebnikov

On 02/11/2016 11:51 PM, Andrew Morton wrote:
> On Wed, 10 Feb 2016 16:24:16 -0800 Tim Chen <tim.c.chen@linux.intel.com> wrote:
> 
>> On Wed, 2016-02-10 at 13:28 -0800, Andrew Morton wrote:
>>
>>>
>>> If a process is unmapping 4MB then it's pretty crazy for us to be
>>> hitting the percpu_counter 32 separate times for that single operation.
>>>
>>> Is there some way in which we can batch up the modifications within the
>>> caller and update the counter less frequently?  Perhaps even in a
>>> single hit?
>>
>> I think the problem is the batch size is too small and we overflow
>> the local counter into the global counter for 4M allocations.
> 
> That's one way of looking at the issue.  The other way (which I point
> out above) is that we're calling vm_[un]_acct_memory too frequently
> when mapping/unmapping 4M segments.
> 

We call it only once per mmap() or munmap(), so there is nothing to improve.

> Exactly which mmap.c callsite is causing this issue?
> 


mmap_region() (or do_brk()) ->
	security_vm_enough_memory() ->
		__vm_enough_memory() ->
			vm_acct_memory()

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2016-02-12 12:23 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-10 14:52 [PATCH 1/3] mm: move max_map_count bits into mm.h Andrey Ryabinin
2016-02-10 14:52 ` Andrey Ryabinin
2016-02-10 14:52 ` [PATCH 2/3] mm: dedupclicate memory overcommitment code Andrey Ryabinin
2016-02-10 14:52   ` Andrey Ryabinin
2016-02-10 14:52 ` [RFC PATCH 3/3] mm: increase scalability of global memory commitment accounting Andrey Ryabinin
2016-02-10 14:52   ` Andrey Ryabinin
2016-02-10 17:46   ` Konstantin Khlebnikov
2016-02-10 17:46     ` Konstantin Khlebnikov
2016-02-11 13:36     ` Andrey Ryabinin
2016-02-11 13:36       ` Andrey Ryabinin
2016-02-11 16:57       ` Tim Chen
2016-02-11 16:57         ` Tim Chen
2016-02-10 18:00   ` Tim Chen
2016-02-10 18:00     ` Tim Chen
2016-02-10 21:28     ` Andrew Morton
2016-02-10 21:28       ` Andrew Morton
2016-02-11  0:24       ` Tim Chen
2016-02-11  0:24         ` Tim Chen
2016-02-11 13:54         ` Andrey Ryabinin
2016-02-11 13:54           ` Andrey Ryabinin
2016-02-11 18:20           ` Tim Chen
2016-02-11 18:20             ` Tim Chen
2016-02-11 19:45             ` Dave Hansen
2016-02-11 19:45               ` Dave Hansen
2016-02-11 20:51         ` Andrew Morton
2016-02-11 20:51           ` Andrew Morton
2016-02-11 21:18           ` Tim Chen
2016-02-11 21:18             ` Tim Chen
2016-02-12 12:24           ` Andrey Ryabinin
2016-02-12 12:24             ` Andrey Ryabinin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.