All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] Mitigate a vmap lock contention v2
@ 2023-08-29  8:11 Uladzislau Rezki (Sony)
  2023-08-29  8:11 ` [PATCH v2 1/9] mm: vmalloc: Add va_alloc() helper Uladzislau Rezki (Sony)
                   ` (11 more replies)
  0 siblings, 12 replies; 74+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-08-29  8:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: LKML, Baoquan He, Lorenzo Stoakes, Christoph Hellwig,
	Matthew Wilcox, Liam R . Howlett, Dave Chinner,
	Paul E . McKenney, Joel Fernandes, Uladzislau Rezki,
	Oleksiy Avramchenko

Hello, folk!

This is the v2, the series which tends to minimize the vmap
lock contention. It is based on the tag: v6.5-rc6. Here you
can find a documentation about it:

wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf

even though it is a bit outdated(it follows v1), it still gives a
good overview on the problem and how it can be solved. On demand
and by request i can update it.

The v1 is here: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/

Delta v1 -> v2:
  - open coded locking;
  - switch to array of nodes instead of per-cpu definition;
  - density is 2 cores per one node(not equal to number of CPUs);
  - VAs first go back(free path) to an owner node and later to
    a global heap if a block is fully freed, nid is saved in va->flags;
  - add helpers to drain lazily-freed areas faster, if high pressure;
  - picked al Reviewed-by.

Test on AMD Ryzen Threadripper 3970X 32-Core Processor:
sudo ./test_vmalloc.sh run_test_mask=127 nr_threads=64

<v6.5-rc6 perf>
  94.17%     0.90%  [kernel]    [k] _raw_spin_lock
  93.27%    93.05%  [kernel]    [k] native_queued_spin_lock_slowpath
  74.69%     0.25%  [kernel]    [k] __vmalloc_node_range
  72.64%     0.01%  [kernel]    [k] __get_vm_area_node
  72.04%     0.89%  [kernel]    [k] alloc_vmap_area
  42.17%     0.00%  [kernel]    [k] vmalloc
  32.53%     0.00%  [kernel]    [k] __vmalloc_node
  24.91%     0.25%  [kernel]    [k] vfree
  24.32%     0.01%  [kernel]    [k] remove_vm_area
  22.63%     0.21%  [kernel]    [k] find_unlink_vmap_area
  15.51%     0.00%  [unknown]   [k] 0xffffffffc09a74ac
  14.35%     0.00%  [kernel]    [k] ret_from_fork_asm
  14.35%     0.00%  [kernel]    [k] ret_from_fork
  14.35%     0.00%  [kernel]    [k] kthread
<v6.5-rc6 perf>
   vs
<v6.5-rc6+v2 perf>
  74.32%     2.42%  [kernel]    [k] __vmalloc_node_range
  69.58%     0.01%  [kernel]    [k] vmalloc
  54.21%     1.17%  [kernel]    [k] __alloc_pages_bulk
  48.13%    47.91%  [kernel]    [k] clear_page_orig
  43.60%     0.01%  [unknown]   [k] 0xffffffffc082f16f
  32.06%     0.00%  [kernel]    [k] ret_from_fork_asm
  32.06%     0.00%  [kernel]    [k] ret_from_fork
  32.06%     0.00%  [kernel]    [k] kthread
  31.30%     0.00%  [unknown]   [k] 0xffffffffc082f889
  22.98%     4.16%  [kernel]    [k] vfree
  14.36%     0.28%  [kernel]    [k] __get_vm_area_node
  13.43%     3.35%  [kernel]    [k] alloc_vmap_area
  10.86%     0.04%  [kernel]    [k] remove_vm_area
   8.89%     2.75%  [kernel]    [k] _raw_spin_lock
   7.19%     0.00%  [unknown]   [k] 0xffffffffc082fba3
   6.65%     1.37%  [kernel]    [k] free_unref_page
   6.13%     6.11%  [kernel]    [k] native_queued_spin_lock_slowpath
<v6.5-rc6+v2 perf>

On smaller systems, for example, 8xCPU Hikey960 board the
contention is not that high and is approximately ~16 percent.

Uladzislau Rezki (Sony) (9):
  mm: vmalloc: Add va_alloc() helper
  mm: vmalloc: Rename adjust_va_to_fit_type() function
  mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
  mm: vmalloc: Remove global vmap_area_root rb-tree
  mm: vmalloc: Remove global purge_vmap_area_root rb-tree
  mm: vmalloc: Offload free_vmap_area_lock lock
  mm: vmalloc: Support multiple nodes in vread_iter
  mm: vmalloc: Support multiple nodes in vmallocinfo
  mm: vmalloc: Set nr_nodes/node_size based on CPU-cores

 mm/vmalloc.c | 929 +++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 683 insertions(+), 246 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 1/9] mm: vmalloc: Add va_alloc() helper
  2023-08-29  8:11 [PATCH v2 0/9] Mitigate a vmap lock contention v2 Uladzislau Rezki (Sony)
@ 2023-08-29  8:11 ` Uladzislau Rezki (Sony)
  2023-09-06  5:51   ` Baoquan He
  2023-08-29  8:11 ` [PATCH v2 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function Uladzislau Rezki (Sony)
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 74+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-08-29  8:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: LKML, Baoquan He, Lorenzo Stoakes, Christoph Hellwig,
	Matthew Wilcox, Liam R . Howlett, Dave Chinner,
	Paul E . McKenney, Joel Fernandes, Uladzislau Rezki,
	Oleksiy Avramchenko, Christoph Hellwig

Currently __alloc_vmap_area() function contains an open codded
logic that finds and adjusts a VA based on allocation request.

Introduce a va_alloc() helper that adjusts found VA only. It
will be used later at least in two places.

There is no a functional change as a result of this patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 41 ++++++++++++++++++++++++++++-------------
 1 file changed, 28 insertions(+), 13 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 93cf99aba335..00afc1ee4756 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1481,6 +1481,32 @@ adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
 	return 0;
 }
 
+static unsigned long
+va_alloc(struct vmap_area *va,
+		struct rb_root *root, struct list_head *head,
+		unsigned long size, unsigned long align,
+		unsigned long vstart, unsigned long vend)
+{
+	unsigned long nva_start_addr;
+	int ret;
+
+	if (va->va_start > vstart)
+		nva_start_addr = ALIGN(va->va_start, align);
+	else
+		nva_start_addr = ALIGN(vstart, align);
+
+	/* Check the "vend" restriction. */
+	if (nva_start_addr + size > vend)
+		return vend;
+
+	/* Update the free vmap_area. */
+	ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
+	if (WARN_ON_ONCE(ret))
+		return vend;
+
+	return nva_start_addr;
+}
+
 /*
  * Returns a start address of the newly allocated area, if success.
  * Otherwise a vend is returned that indicates failure.
@@ -1493,7 +1519,6 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
 	bool adjust_search_size = true;
 	unsigned long nva_start_addr;
 	struct vmap_area *va;
-	int ret;
 
 	/*
 	 * Do not adjust when:
@@ -1511,18 +1536,8 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
 	if (unlikely(!va))
 		return vend;
 
-	if (va->va_start > vstart)
-		nva_start_addr = ALIGN(va->va_start, align);
-	else
-		nva_start_addr = ALIGN(vstart, align);
-
-	/* Check the "vend" restriction. */
-	if (nva_start_addr + size > vend)
-		return vend;
-
-	/* Update the free vmap_area. */
-	ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
-	if (WARN_ON_ONCE(ret))
+	nva_start_addr = va_alloc(va, root, head, size, align, vstart, vend);
+	if (nva_start_addr == vend)
 		return vend;
 
 #if DEBUG_AUGMENT_LOWEST_MATCH_CHECK
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v2 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function
  2023-08-29  8:11 [PATCH v2 0/9] Mitigate a vmap lock contention v2 Uladzislau Rezki (Sony)
  2023-08-29  8:11 ` [PATCH v2 1/9] mm: vmalloc: Add va_alloc() helper Uladzislau Rezki (Sony)
@ 2023-08-29  8:11 ` Uladzislau Rezki (Sony)
  2023-09-06  5:51   ` Baoquan He
  2023-08-29  8:11 ` [PATCH v2 3/9] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c Uladzislau Rezki (Sony)
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 74+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-08-29  8:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: LKML, Baoquan He, Lorenzo Stoakes, Christoph Hellwig,
	Matthew Wilcox, Liam R . Howlett, Dave Chinner,
	Paul E . McKenney, Joel Fernandes, Uladzislau Rezki,
	Oleksiy Avramchenko, Christoph Hellwig

This patch renames the adjust_va_to_fit_type() function
to va_clip() which is shorter and more expressive.

There is no a functional change as a result of this patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 00afc1ee4756..09e315f8ea34 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1382,9 +1382,9 @@ classify_va_fit_type(struct vmap_area *va,
 }
 
 static __always_inline int
-adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
-		      struct vmap_area *va, unsigned long nva_start_addr,
-		      unsigned long size)
+va_clip(struct rb_root *root, struct list_head *head,
+		struct vmap_area *va, unsigned long nva_start_addr,
+		unsigned long size)
 {
 	struct vmap_area *lva = NULL;
 	enum fit_type type = classify_va_fit_type(va, nva_start_addr, size);
@@ -1500,7 +1500,7 @@ va_alloc(struct vmap_area *va,
 		return vend;
 
 	/* Update the free vmap_area. */
-	ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
+	ret = va_clip(root, head, va, nva_start_addr, size);
 	if (WARN_ON_ONCE(ret))
 		return vend;
 
@@ -4151,9 +4151,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 			/* It is a BUG(), but trigger recovery instead. */
 			goto recovery;
 
-		ret = adjust_va_to_fit_type(&free_vmap_area_root,
-					    &free_vmap_area_list,
-					    va, start, size);
+		ret = va_clip(&free_vmap_area_root,
+			&free_vmap_area_list, va, start, size);
 		if (WARN_ON_ONCE(unlikely(ret)))
 			/* It is a BUG(), but trigger recovery instead. */
 			goto recovery;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v2 3/9] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
  2023-08-29  8:11 [PATCH v2 0/9] Mitigate a vmap lock contention v2 Uladzislau Rezki (Sony)
  2023-08-29  8:11 ` [PATCH v2 1/9] mm: vmalloc: Add va_alloc() helper Uladzislau Rezki (Sony)
  2023-08-29  8:11 ` [PATCH v2 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function Uladzislau Rezki (Sony)
@ 2023-08-29  8:11 ` Uladzislau Rezki (Sony)
  2023-09-06  5:52   ` Baoquan He
  2023-08-29  8:11 ` [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree Uladzislau Rezki (Sony)
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 74+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-08-29  8:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: LKML, Baoquan He, Lorenzo Stoakes, Christoph Hellwig,
	Matthew Wilcox, Liam R . Howlett, Dave Chinner,
	Paul E . McKenney, Joel Fernandes, Uladzislau Rezki,
	Oleksiy Avramchenko, Christoph Hellwig

A vmap_init_free_space() is a function that setups a vmap space
and is considered as part of initialization phase. Since a main
entry which is vmalloc_init(), has been moved down in vmalloc.c
it makes sense to follow the pattern.

There is no a functional change as a result of this patch.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 82 ++++++++++++++++++++++++++--------------------------
 1 file changed, 41 insertions(+), 41 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 09e315f8ea34..b7deacca1483 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2512,47 +2512,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align)
 	kasan_populate_early_vm_area_shadow(vm->addr, vm->size);
 }
 
-static void vmap_init_free_space(void)
-{
-	unsigned long vmap_start = 1;
-	const unsigned long vmap_end = ULONG_MAX;
-	struct vmap_area *busy, *free;
-
-	/*
-	 *     B     F     B     B     B     F
-	 * -|-----|.....|-----|-----|-----|.....|-
-	 *  |           The KVA space           |
-	 *  |<--------------------------------->|
-	 */
-	list_for_each_entry(busy, &vmap_area_list, list) {
-		if (busy->va_start - vmap_start > 0) {
-			free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
-			if (!WARN_ON_ONCE(!free)) {
-				free->va_start = vmap_start;
-				free->va_end = busy->va_start;
-
-				insert_vmap_area_augment(free, NULL,
-					&free_vmap_area_root,
-						&free_vmap_area_list);
-			}
-		}
-
-		vmap_start = busy->va_end;
-	}
-
-	if (vmap_end - vmap_start > 0) {
-		free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
-		if (!WARN_ON_ONCE(!free)) {
-			free->va_start = vmap_start;
-			free->va_end = vmap_end;
-
-			insert_vmap_area_augment(free, NULL,
-				&free_vmap_area_root,
-					&free_vmap_area_list);
-		}
-	}
-}
-
 static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
 	struct vmap_area *va, unsigned long flags, const void *caller)
 {
@@ -4443,6 +4402,47 @@ module_init(proc_vmalloc_init);
 
 #endif
 
+static void vmap_init_free_space(void)
+{
+	unsigned long vmap_start = 1;
+	const unsigned long vmap_end = ULONG_MAX;
+	struct vmap_area *busy, *free;
+
+	/*
+	 *     B     F     B     B     B     F
+	 * -|-----|.....|-----|-----|-----|.....|-
+	 *  |           The KVA space           |
+	 *  |<--------------------------------->|
+	 */
+	list_for_each_entry(busy, &vmap_area_list, list) {
+		if (busy->va_start - vmap_start > 0) {
+			free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
+			if (!WARN_ON_ONCE(!free)) {
+				free->va_start = vmap_start;
+				free->va_end = busy->va_start;
+
+				insert_vmap_area_augment(free, NULL,
+					&free_vmap_area_root,
+						&free_vmap_area_list);
+			}
+		}
+
+		vmap_start = busy->va_end;
+	}
+
+	if (vmap_end - vmap_start > 0) {
+		free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
+		if (!WARN_ON_ONCE(!free)) {
+			free->va_start = vmap_start;
+			free->va_end = vmap_end;
+
+			insert_vmap_area_augment(free, NULL,
+				&free_vmap_area_root,
+					&free_vmap_area_list);
+		}
+	}
+}
+
 void __init vmalloc_init(void)
 {
 	struct vmap_area *va;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-08-29  8:11 [PATCH v2 0/9] Mitigate a vmap lock contention v2 Uladzislau Rezki (Sony)
                   ` (2 preceding siblings ...)
  2023-08-29  8:11 ` [PATCH v2 3/9] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c Uladzislau Rezki (Sony)
@ 2023-08-29  8:11 ` Uladzislau Rezki (Sony)
  2023-08-29 14:30   ` kernel test robot
                     ` (2 more replies)
  2023-08-29  8:11 ` [PATCH v2 5/9] mm: vmalloc: Remove global purge_vmap_area_root rb-tree Uladzislau Rezki (Sony)
                   ` (7 subsequent siblings)
  11 siblings, 3 replies; 74+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-08-29  8:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: LKML, Baoquan He, Lorenzo Stoakes, Christoph Hellwig,
	Matthew Wilcox, Liam R . Howlett, Dave Chinner,
	Paul E . McKenney, Joel Fernandes, Uladzislau Rezki,
	Oleksiy Avramchenko

Store allocated objects in a separate nodes. A va->va_start
address is converted into a correct node where it should
be placed and resided. An addr_to_node() function is used
to do a proper address conversion to determine a node that
contains a VA.

Such approach balances VAs across nodes as a result an access
becomes scalable. Number of nodes in a system depends on number
of CPUs divided by two. The density factor in this case is 1/2.

Please note:

1. As of now allocated VAs are bound to a node-0. It means the
   patch does not give any difference comparing with a current
   behavior;

2. The global vmap_area_lock, vmap_area_root are removed as there
   is no need in it anymore. The vmap_area_list is still kept and
   is _empty_. It is exported for a kexec only;

3. The vmallocinfo and vread() have to be reworked to be able to
   handle multiple nodes.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 209 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 161 insertions(+), 48 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b7deacca1483..ae0368c314ff 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -728,11 +728,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
 #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
 
 
-static DEFINE_SPINLOCK(vmap_area_lock);
 static DEFINE_SPINLOCK(free_vmap_area_lock);
 /* Export for kexec only */
 LIST_HEAD(vmap_area_list);
-static struct rb_root vmap_area_root = RB_ROOT;
 static bool vmap_initialized __read_mostly;
 
 static struct rb_root purge_vmap_area_root = RB_ROOT;
@@ -772,6 +770,38 @@ static struct rb_root free_vmap_area_root = RB_ROOT;
  */
 static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
 
+/*
+ * An effective vmap-node logic. Users make use of nodes instead
+ * of a global heap. It allows to balance an access and mitigate
+ * contention.
+ */
+struct rb_list {
+	struct rb_root root;
+	struct list_head head;
+	spinlock_t lock;
+};
+
+struct vmap_node {
+	/* Bookkeeping data of this node. */
+	struct rb_list busy;
+};
+
+static struct vmap_node *nodes, snode;
+static __read_mostly unsigned int nr_nodes = 1;
+static __read_mostly unsigned int node_size = 1;
+
+static inline unsigned int
+addr_to_node_id(unsigned long addr)
+{
+	return (addr / node_size) % nr_nodes;
+}
+
+static inline struct vmap_node *
+addr_to_node(unsigned long addr)
+{
+	return &nodes[addr_to_node_id(addr)];
+}
+
 static __always_inline unsigned long
 va_size(struct vmap_area *va)
 {
@@ -803,10 +833,11 @@ unsigned long vmalloc_nr_pages(void)
 }
 
 /* Look up the first VA which satisfies addr < va_end, NULL if none. */
-static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr)
+static struct vmap_area *
+find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
 {
 	struct vmap_area *va = NULL;
-	struct rb_node *n = vmap_area_root.rb_node;
+	struct rb_node *n = root->rb_node;
 
 	addr = (unsigned long)kasan_reset_tag((void *)addr);
 
@@ -1552,12 +1583,14 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
  */
 static void free_vmap_area(struct vmap_area *va)
 {
+	struct vmap_node *vn = addr_to_node(va->va_start);
+
 	/*
 	 * Remove from the busy tree/list.
 	 */
-	spin_lock(&vmap_area_lock);
-	unlink_va(va, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	spin_lock(&vn->busy.lock);
+	unlink_va(va, &vn->busy.root);
+	spin_unlock(&vn->busy.lock);
 
 	/*
 	 * Insert/Merge it back to the free tree/list.
@@ -1600,6 +1633,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 				int node, gfp_t gfp_mask,
 				unsigned long va_flags)
 {
+	struct vmap_node *vn;
 	struct vmap_area *va;
 	unsigned long freed;
 	unsigned long addr;
@@ -1645,9 +1679,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	va->vm = NULL;
 	va->flags = va_flags;
 
-	spin_lock(&vmap_area_lock);
-	insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
-	spin_unlock(&vmap_area_lock);
+	vn = addr_to_node(va->va_start);
+
+	spin_lock(&vn->busy.lock);
+	insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
+	spin_unlock(&vn->busy.lock);
 
 	BUG_ON(!IS_ALIGNED(va->va_start, align));
 	BUG_ON(va->va_start < vstart);
@@ -1871,26 +1907,61 @@ static void free_unmap_vmap_area(struct vmap_area *va)
 
 struct vmap_area *find_vmap_area(unsigned long addr)
 {
+	struct vmap_node *vn;
 	struct vmap_area *va;
+	int i, j;
+
+	/*
+	 * An addr_to_node_id(addr) converts an address to a node index
+	 * where a VA is located. If VA spans several zones and passed
+	 * addr is not the same as va->va_start, what is not common, we
+	 * may need to scan an extra nodes. See an example:
+	 *
+	 *      <--va-->
+	 * -|-----|-----|-----|-----|-
+	 *     1     2     0     1
+	 *
+	 * VA resides in node 1 whereas it spans 1 and 2. If passed
+	 * addr is within a second node we should do extra work. We
+	 * should mention that it is rare and is a corner case from
+	 * the other hand it has to be covered.
+	 */
+	i = j = addr_to_node_id(addr);
+	do {
+		vn = &nodes[i];
 
-	spin_lock(&vmap_area_lock);
-	va = __find_vmap_area(addr, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+		spin_lock(&vn->busy.lock);
+		va = __find_vmap_area(addr, &vn->busy.root);
+		spin_unlock(&vn->busy.lock);
 
-	return va;
+		if (va)
+			return va;
+	} while ((i = (i + 1) % nr_nodes) != j);
+
+	return NULL;
 }
 
 static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
 {
+	struct vmap_node *vn;
 	struct vmap_area *va;
+	int i, j;
 
-	spin_lock(&vmap_area_lock);
-	va = __find_vmap_area(addr, &vmap_area_root);
-	if (va)
-		unlink_va(va, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	i = j = addr_to_node_id(addr);
+	do {
+		vn = &nodes[i];
 
-	return va;
+		spin_lock(&vn->busy.lock);
+		va = __find_vmap_area(addr, &vn->busy.root);
+		if (va)
+			unlink_va(va, &vn->busy.root);
+		spin_unlock(&vn->busy.lock);
+
+		if (va)
+			return va;
+	} while ((i = (i + 1) % nr_nodes) != j);
+
+	return NULL;
 }
 
 /*** Per cpu kva allocator ***/
@@ -2092,6 +2163,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 
 static void free_vmap_block(struct vmap_block *vb)
 {
+	struct vmap_node *vn;
 	struct vmap_block *tmp;
 	struct xarray *xa;
 
@@ -2099,9 +2171,10 @@ static void free_vmap_block(struct vmap_block *vb)
 	tmp = xa_erase(xa, addr_to_vb_idx(vb->va->va_start));
 	BUG_ON(tmp != vb);
 
-	spin_lock(&vmap_area_lock);
-	unlink_va(vb->va, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	vn = addr_to_node(vb->va->va_start);
+	spin_lock(&vn->busy.lock);
+	unlink_va(vb->va, &vn->busy.root);
+	spin_unlock(&vn->busy.lock);
 
 	free_vmap_area_noflush(vb->va);
 	kfree_rcu(vb, rcu_head);
@@ -2525,9 +2598,11 @@ static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
 static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
 			      unsigned long flags, const void *caller)
 {
-	spin_lock(&vmap_area_lock);
+	struct vmap_node *vn = addr_to_node(va->va_start);
+
+	spin_lock(&vn->busy.lock);
 	setup_vmalloc_vm_locked(vm, va, flags, caller);
-	spin_unlock(&vmap_area_lock);
+	spin_unlock(&vn->busy.lock);
 }
 
 static void clear_vm_uninitialized_flag(struct vm_struct *vm)
@@ -3711,6 +3786,7 @@ static size_t vmap_ram_vread_iter(struct iov_iter *iter, const char *addr,
  */
 long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 {
+	struct vmap_node *vn;
 	struct vmap_area *va;
 	struct vm_struct *vm;
 	char *vaddr;
@@ -3724,8 +3800,11 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 
 	remains = count;
 
-	spin_lock(&vmap_area_lock);
-	va = find_vmap_area_exceed_addr((unsigned long)addr);
+	/* Hooked to node_0 so far. */
+	vn = addr_to_node(0);
+	spin_lock(&vn->busy.lock);
+
+	va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root);
 	if (!va)
 		goto finished_zero;
 
@@ -3733,7 +3812,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 	if ((unsigned long)addr + remains <= va->va_start)
 		goto finished_zero;
 
-	list_for_each_entry_from(va, &vmap_area_list, list) {
+	list_for_each_entry_from(va, &vn->busy.head, list) {
 		size_t copied;
 
 		if (remains == 0)
@@ -3792,12 +3871,12 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 	}
 
 finished_zero:
-	spin_unlock(&vmap_area_lock);
+	spin_unlock(&vn->busy.lock);
 	/* zero-fill memory holes */
 	return count - remains + zero_iter(iter, remains);
 finished:
 	/* Nothing remains, or We couldn't copy/zero everything. */
-	spin_unlock(&vmap_area_lock);
+	spin_unlock(&vn->busy.lock);
 
 	return count - remains;
 }
@@ -4131,14 +4210,15 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 	}
 
 	/* insert all vm's */
-	spin_lock(&vmap_area_lock);
 	for (area = 0; area < nr_vms; area++) {
-		insert_vmap_area(vas[area], &vmap_area_root, &vmap_area_list);
+		struct vmap_node *vn = addr_to_node(vas[area]->va_start);
 
+		spin_lock(&vn->busy.lock);
+		insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
 		setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
 				 pcpu_get_vm_areas);
+		spin_unlock(&vn->busy.lock);
 	}
-	spin_unlock(&vmap_area_lock);
 
 	/*
 	 * Mark allocated areas as accessible. Do it now as a best-effort
@@ -4261,25 +4341,26 @@ bool vmalloc_dump_obj(void *object)
 
 #ifdef CONFIG_PROC_FS
 static void *s_start(struct seq_file *m, loff_t *pos)
-	__acquires(&vmap_purge_lock)
-	__acquires(&vmap_area_lock)
 {
+	struct vmap_node *vn = addr_to_node(0);
+
 	mutex_lock(&vmap_purge_lock);
-	spin_lock(&vmap_area_lock);
+	spin_lock(&vn->busy.lock);
 
-	return seq_list_start(&vmap_area_list, *pos);
+	return seq_list_start(&vn->busy.head, *pos);
 }
 
 static void *s_next(struct seq_file *m, void *p, loff_t *pos)
 {
-	return seq_list_next(p, &vmap_area_list, pos);
+	struct vmap_node *vn = addr_to_node(0);
+	return seq_list_next(p, &vn->busy.head, pos);
 }
 
 static void s_stop(struct seq_file *m, void *p)
-	__releases(&vmap_area_lock)
-	__releases(&vmap_purge_lock)
 {
-	spin_unlock(&vmap_area_lock);
+	struct vmap_node *vn = addr_to_node(0);
+
+	spin_unlock(&vn->busy.lock);
 	mutex_unlock(&vmap_purge_lock);
 }
 
@@ -4322,9 +4403,11 @@ static void show_purge_info(struct seq_file *m)
 
 static int s_show(struct seq_file *m, void *p)
 {
+	struct vmap_node *vn;
 	struct vmap_area *va;
 	struct vm_struct *v;
 
+	vn = addr_to_node(0);
 	va = list_entry(p, struct vmap_area, list);
 
 	if (!va->vm) {
@@ -4375,7 +4458,7 @@ static int s_show(struct seq_file *m, void *p)
 	 * As a final step, dump "unpurged" areas.
 	 */
 final:
-	if (list_is_last(&va->list, &vmap_area_list))
+	if (list_is_last(&va->list, &vn->busy.head))
 		show_purge_info(m);
 
 	return 0;
@@ -4406,7 +4489,8 @@ static void vmap_init_free_space(void)
 {
 	unsigned long vmap_start = 1;
 	const unsigned long vmap_end = ULONG_MAX;
-	struct vmap_area *busy, *free;
+	struct vmap_area *free;
+	struct vm_struct *busy;
 
 	/*
 	 *     B     F     B     B     B     F
@@ -4414,12 +4498,12 @@ static void vmap_init_free_space(void)
 	 *  |           The KVA space           |
 	 *  |<--------------------------------->|
 	 */
-	list_for_each_entry(busy, &vmap_area_list, list) {
-		if (busy->va_start - vmap_start > 0) {
+	for (busy = vmlist; busy; busy = busy->next) {
+		if (busy->addr - vmap_start > 0) {
 			free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
 			if (!WARN_ON_ONCE(!free)) {
 				free->va_start = vmap_start;
-				free->va_end = busy->va_start;
+				free->va_end = (unsigned long) busy->addr;
 
 				insert_vmap_area_augment(free, NULL,
 					&free_vmap_area_root,
@@ -4427,7 +4511,7 @@ static void vmap_init_free_space(void)
 			}
 		}
 
-		vmap_start = busy->va_end;
+		vmap_start = (unsigned long) busy->addr + busy->size;
 	}
 
 	if (vmap_end - vmap_start > 0) {
@@ -4443,9 +4527,31 @@ static void vmap_init_free_space(void)
 	}
 }
 
+static void vmap_init_nodes(void)
+{
+	struct vmap_node *vn;
+	int i;
+
+	nodes = &snode;
+
+	if (nr_nodes > 1) {
+		vn = kmalloc_array(nr_nodes, sizeof(*vn), GFP_NOWAIT);
+		if (vn)
+			nodes = vn;
+	}
+
+	for (i = 0; i < nr_nodes; i++) {
+		vn = &nodes[i];
+		vn->busy.root = RB_ROOT;
+		INIT_LIST_HEAD(&vn->busy.head);
+		spin_lock_init(&vn->busy.lock);
+	}
+}
+
 void __init vmalloc_init(void)
 {
 	struct vmap_area *va;
+	struct vmap_node *vn;
 	struct vm_struct *tmp;
 	int i;
 
@@ -4467,6 +4573,11 @@ void __init vmalloc_init(void)
 		xa_init(&vbq->vmap_blocks);
 	}
 
+	/*
+	 * Setup nodes before importing vmlist.
+	 */
+	vmap_init_nodes();
+
 	/* Import existing vmlist entries. */
 	for (tmp = vmlist; tmp; tmp = tmp->next) {
 		va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
@@ -4476,7 +4587,9 @@ void __init vmalloc_init(void)
 		va->va_start = (unsigned long)tmp->addr;
 		va->va_end = va->va_start + tmp->size;
 		va->vm = tmp;
-		insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
+
+		vn = addr_to_node(va->va_start);
+		insert_vmap_area(va, &vn->busy.root, &vn->busy.head);
 	}
 
 	/*
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v2 5/9] mm: vmalloc: Remove global purge_vmap_area_root rb-tree
  2023-08-29  8:11 [PATCH v2 0/9] Mitigate a vmap lock contention v2 Uladzislau Rezki (Sony)
                   ` (3 preceding siblings ...)
  2023-08-29  8:11 ` [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree Uladzislau Rezki (Sony)
@ 2023-08-29  8:11 ` Uladzislau Rezki (Sony)
  2023-09-11  2:57   ` Baoquan He
  2023-08-29  8:11 ` [PATCH v2 6/9] mm: vmalloc: Offload free_vmap_area_lock lock Uladzislau Rezki (Sony)
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 74+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-08-29  8:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: LKML, Baoquan He, Lorenzo Stoakes, Christoph Hellwig,
	Matthew Wilcox, Liam R . Howlett, Dave Chinner,
	Paul E . McKenney, Joel Fernandes, Uladzislau Rezki,
	Oleksiy Avramchenko

Similar to busy VA, lazily-freed area is stored to a node
it belongs to. Such approach does not require any global
locking primitive, instead an access becomes scalable what
mitigates a contention.

This patch removes a global purge-lock, global purge-tree
and global purge list.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 135 +++++++++++++++++++++++++++++++--------------------
 1 file changed, 82 insertions(+), 53 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ae0368c314ff..5a8a9c1370b6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -733,10 +733,6 @@ static DEFINE_SPINLOCK(free_vmap_area_lock);
 LIST_HEAD(vmap_area_list);
 static bool vmap_initialized __read_mostly;
 
-static struct rb_root purge_vmap_area_root = RB_ROOT;
-static LIST_HEAD(purge_vmap_area_list);
-static DEFINE_SPINLOCK(purge_vmap_area_lock);
-
 /*
  * This kmem_cache is used for vmap_area objects. Instead of
  * allocating from slab we reuse an object from this cache to
@@ -784,6 +780,12 @@ struct rb_list {
 struct vmap_node {
 	/* Bookkeeping data of this node. */
 	struct rb_list busy;
+	struct rb_list lazy;
+
+	/*
+	 * Ready-to-free areas.
+	 */
+	struct list_head purge_list;
 };
 
 static struct vmap_node *nodes, snode;
@@ -1768,40 +1770,22 @@ static DEFINE_MUTEX(vmap_purge_lock);
 
 /* for per-CPU blocks */
 static void purge_fragmented_blocks_allcpus(void);
+static cpumask_t purge_nodes;
 
 /*
  * Purges all lazily-freed vmap areas.
  */
-static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
+static unsigned long
+purge_vmap_node(struct vmap_node *vn)
 {
-	unsigned long resched_threshold;
-	unsigned int num_purged_areas = 0;
-	struct list_head local_purge_list;
+	unsigned long num_purged_areas = 0;
 	struct vmap_area *va, *n_va;
 
-	lockdep_assert_held(&vmap_purge_lock);
-
-	spin_lock(&purge_vmap_area_lock);
-	purge_vmap_area_root = RB_ROOT;
-	list_replace_init(&purge_vmap_area_list, &local_purge_list);
-	spin_unlock(&purge_vmap_area_lock);
-
-	if (unlikely(list_empty(&local_purge_list)))
-		goto out;
-
-	start = min(start,
-		list_first_entry(&local_purge_list,
-			struct vmap_area, list)->va_start);
-
-	end = max(end,
-		list_last_entry(&local_purge_list,
-			struct vmap_area, list)->va_end);
-
-	flush_tlb_kernel_range(start, end);
-	resched_threshold = lazy_max_pages() << 1;
+	if (list_empty(&vn->purge_list))
+		return 0;
 
 	spin_lock(&free_vmap_area_lock);
-	list_for_each_entry_safe(va, n_va, &local_purge_list, list) {
+	list_for_each_entry_safe(va, n_va, &vn->purge_list, list) {
 		unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
 		unsigned long orig_start = va->va_start;
 		unsigned long orig_end = va->va_end;
@@ -1823,13 +1807,55 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 
 		atomic_long_sub(nr, &vmap_lazy_nr);
 		num_purged_areas++;
-
-		if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
-			cond_resched_lock(&free_vmap_area_lock);
 	}
 	spin_unlock(&free_vmap_area_lock);
 
-out:
+	return num_purged_areas;
+}
+
+/*
+ * Purges all lazily-freed vmap areas.
+ */
+static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
+{
+	unsigned long num_purged_areas = 0;
+	struct vmap_node *vn;
+	int i;
+
+	lockdep_assert_held(&vmap_purge_lock);
+	purge_nodes = CPU_MASK_NONE;
+
+	for (i = 0; i < nr_nodes; i++) {
+		vn = &nodes[i];
+
+		INIT_LIST_HEAD(&vn->purge_list);
+
+		if (RB_EMPTY_ROOT(&vn->lazy.root))
+			continue;
+
+		spin_lock(&vn->lazy.lock);
+		WRITE_ONCE(vn->lazy.root.rb_node, NULL);
+		list_replace_init(&vn->lazy.head, &vn->purge_list);
+		spin_unlock(&vn->lazy.lock);
+
+		start = min(start, list_first_entry(&vn->purge_list,
+			struct vmap_area, list)->va_start);
+
+		end = max(end, list_last_entry(&vn->purge_list,
+			struct vmap_area, list)->va_end);
+
+		cpumask_set_cpu(i, &purge_nodes);
+	}
+
+	if (cpumask_weight(&purge_nodes) > 0) {
+		flush_tlb_kernel_range(start, end);
+
+		for_each_cpu(i, &purge_nodes) {
+			vn = &nodes[i];
+			num_purged_areas += purge_vmap_node(vn);
+		}
+	}
+
 	trace_purge_vmap_area_lazy(start, end, num_purged_areas);
 	return num_purged_areas > 0;
 }
@@ -1848,16 +1874,9 @@ static void reclaim_and_purge_vmap_areas(void)
 
 static void drain_vmap_area_work(struct work_struct *work)
 {
-	unsigned long nr_lazy;
-
-	do {
-		mutex_lock(&vmap_purge_lock);
-		__purge_vmap_area_lazy(ULONG_MAX, 0);
-		mutex_unlock(&vmap_purge_lock);
-
-		/* Recheck if further work is required. */
-		nr_lazy = atomic_long_read(&vmap_lazy_nr);
-	} while (nr_lazy > lazy_max_pages());
+	mutex_lock(&vmap_purge_lock);
+	__purge_vmap_area_lazy(ULONG_MAX, 0);
+	mutex_unlock(&vmap_purge_lock);
 }
 
 /*
@@ -1867,6 +1886,7 @@ static void drain_vmap_area_work(struct work_struct *work)
  */
 static void free_vmap_area_noflush(struct vmap_area *va)
 {
+	struct vmap_node *vn = addr_to_node(va->va_start);
 	unsigned long nr_lazy_max = lazy_max_pages();
 	unsigned long va_start = va->va_start;
 	unsigned long nr_lazy;
@@ -1880,10 +1900,9 @@ static void free_vmap_area_noflush(struct vmap_area *va)
 	/*
 	 * Merge or place it to the purge tree/list.
 	 */
-	spin_lock(&purge_vmap_area_lock);
-	merge_or_add_vmap_area(va,
-		&purge_vmap_area_root, &purge_vmap_area_list);
-	spin_unlock(&purge_vmap_area_lock);
+	spin_lock(&vn->lazy.lock);
+	merge_or_add_vmap_area(va, &vn->lazy.root, &vn->lazy.head);
+	spin_unlock(&vn->lazy.lock);
 
 	trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max);
 
@@ -4390,15 +4409,21 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v)
 
 static void show_purge_info(struct seq_file *m)
 {
+	struct vmap_node *vn;
 	struct vmap_area *va;
+	int i;
 
-	spin_lock(&purge_vmap_area_lock);
-	list_for_each_entry(va, &purge_vmap_area_list, list) {
-		seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
-			(void *)va->va_start, (void *)va->va_end,
-			va->va_end - va->va_start);
+	for (i = 0; i < nr_nodes; i++) {
+		vn = &nodes[i];
+
+		spin_lock(&vn->lazy.lock);
+		list_for_each_entry(va, &vn->lazy.head, list) {
+			seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
+				(void *)va->va_start, (void *)va->va_end,
+				va->va_end - va->va_start);
+		}
+		spin_unlock(&vn->lazy.lock);
 	}
-	spin_unlock(&purge_vmap_area_lock);
 }
 
 static int s_show(struct seq_file *m, void *p)
@@ -4545,6 +4570,10 @@ static void vmap_init_nodes(void)
 		vn->busy.root = RB_ROOT;
 		INIT_LIST_HEAD(&vn->busy.head);
 		spin_lock_init(&vn->busy.lock);
+
+		vn->lazy.root = RB_ROOT;
+		INIT_LIST_HEAD(&vn->lazy.head);
+		spin_lock_init(&vn->lazy.lock);
 	}
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v2 6/9] mm: vmalloc: Offload free_vmap_area_lock lock
  2023-08-29  8:11 [PATCH v2 0/9] Mitigate a vmap lock contention v2 Uladzislau Rezki (Sony)
                   ` (4 preceding siblings ...)
  2023-08-29  8:11 ` [PATCH v2 5/9] mm: vmalloc: Remove global purge_vmap_area_root rb-tree Uladzislau Rezki (Sony)
@ 2023-08-29  8:11 ` Uladzislau Rezki (Sony)
  2023-09-06  6:04   ` Baoquan He
  2023-09-11  3:25   ` Baoquan He
  2023-08-29  8:11 ` [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter Uladzislau Rezki (Sony)
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 74+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-08-29  8:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: LKML, Baoquan He, Lorenzo Stoakes, Christoph Hellwig,
	Matthew Wilcox, Liam R . Howlett, Dave Chinner,
	Paul E . McKenney, Joel Fernandes, Uladzislau Rezki,
	Oleksiy Avramchenko

Concurrent access to a global vmap space is a bottle-neck.
We can simulate a high contention by running a vmalloc test
suite.

To address it, introduce an effective vmap node logic. Each
node behaves as independent entity. When a node is accessed
it serves a request directly(if possible) also it can fetch
a new block from a global heap to its internals if no space
or low capacity is left.

This technique reduces a pressure on the global vmap lock.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 316 +++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 279 insertions(+), 37 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5a8a9c1370b6..4fd4915c532d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -779,6 +779,7 @@ struct rb_list {
 
 struct vmap_node {
 	/* Bookkeeping data of this node. */
+	struct rb_list free;
 	struct rb_list busy;
 	struct rb_list lazy;
 
@@ -786,6 +787,13 @@ struct vmap_node {
 	 * Ready-to-free areas.
 	 */
 	struct list_head purge_list;
+	struct work_struct purge_work;
+	unsigned long nr_purged;
+
+	/*
+	 * Control that only one user can pre-fetch this node.
+	 */
+	atomic_t fill_in_progress;
 };
 
 static struct vmap_node *nodes, snode;
@@ -804,6 +812,32 @@ addr_to_node(unsigned long addr)
 	return &nodes[addr_to_node_id(addr)];
 }
 
+static inline struct vmap_node *
+id_to_node(int id)
+{
+	return &nodes[id % nr_nodes];
+}
+
+static inline int
+this_node_id(void)
+{
+	return raw_smp_processor_id() % nr_nodes;
+}
+
+static inline unsigned long
+encode_vn_id(int node_id)
+{
+	/* Can store U8_MAX [0:254] nodes. */
+	return (node_id + 1) << BITS_PER_BYTE;
+}
+
+static inline int
+decode_vn_id(unsigned long val)
+{
+	/* Can store U8_MAX [0:254] nodes. */
+	return (val >> BITS_PER_BYTE) - 1;
+}
+
 static __always_inline unsigned long
 va_size(struct vmap_area *va)
 {
@@ -1586,6 +1620,7 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
 static void free_vmap_area(struct vmap_area *va)
 {
 	struct vmap_node *vn = addr_to_node(va->va_start);
+	int vn_id = decode_vn_id(va->flags);
 
 	/*
 	 * Remove from the busy tree/list.
@@ -1594,12 +1629,19 @@ static void free_vmap_area(struct vmap_area *va)
 	unlink_va(va, &vn->busy.root);
 	spin_unlock(&vn->busy.lock);
 
-	/*
-	 * Insert/Merge it back to the free tree/list.
-	 */
-	spin_lock(&free_vmap_area_lock);
-	merge_or_add_vmap_area_augment(va, &free_vmap_area_root, &free_vmap_area_list);
-	spin_unlock(&free_vmap_area_lock);
+	if (vn_id >= 0) {
+		vn = id_to_node(vn_id);
+
+		/* Belongs to this node. */
+		spin_lock(&vn->free.lock);
+		merge_or_add_vmap_area_augment(va, &vn->free.root, &vn->free.head);
+		spin_unlock(&vn->free.lock);
+	} else {
+		/* Goes to global. */
+		spin_lock(&free_vmap_area_lock);
+		merge_or_add_vmap_area_augment(va, &free_vmap_area_root, &free_vmap_area_list);
+		spin_unlock(&free_vmap_area_lock);
+	}
 }
 
 static inline void
@@ -1625,6 +1667,134 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
 		kmem_cache_free(vmap_area_cachep, va);
 }
 
+static unsigned long
+node_alloc_fill(struct vmap_node *vn,
+		unsigned long size, unsigned long align,
+		gfp_t gfp_mask, int node)
+{
+	struct vmap_area *va;
+	unsigned long addr;
+
+	va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
+	if (unlikely(!va))
+		return VMALLOC_END;
+
+	/*
+	 * Please note, an allocated block is not aligned to its size.
+	 * Therefore it can span several zones what means addr_to_node()
+	 * can point to two different nodes:
+	 *      <----->
+	 * -|-----|-----|-----|-----|-
+	 *     1     2     0     1
+	 *
+	 * an alignment would just increase fragmentation thus more heap
+	 * consumption what we would like to avoid.
+	 */
+	spin_lock(&free_vmap_area_lock);
+	addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
+		node_size, 1, VMALLOC_START, VMALLOC_END);
+	spin_unlock(&free_vmap_area_lock);
+
+	if (addr == VMALLOC_END) {
+		kmem_cache_free(vmap_area_cachep, va);
+		return VMALLOC_END;
+	}
+
+	/*
+	 * Statement and condition of the problem:
+	 *
+	 * a) where to free allocated areas from a node:
+	 *   - directly to a global heap;
+	 *   - to a node that we got a VA from;
+	 *     - what is a condition to return allocated areas
+	 *       to a global heap then;
+	 * b) how to properly handle left small free fragments
+	 *    of a node in order to mitigate a fragmentation.
+	 *
+	 * How to address described points:
+	 * When a new block is allocated(from a global heap) we shrink
+	 * it deliberately by one page from both sides and place it to
+	 * this node to serve a request.
+	 *
+	 * Why we shrink. We would like to distinguish VAs which were
+	 * obtained from a node and a global heap. This is for a free
+	 * path. A va->flags contains a node-id it belongs to. No VAs
+	 * merging is possible between each other unless they are part
+	 * of same block.
+	 *
+	 * A free-path in its turn can detect a correct node where a
+	 * VA has to be returned. Thus as a block is freed entirely,
+	 * its size becomes(merging): node_size - (2 * PAGE_SIZE) it
+	 * recovers its edges, thus is released to a global heap for
+	 * reuse elsewhere. In partly freed case, VAs go back to the
+	 * node not bothering a global vmap space.
+	 *
+	 *        1               2              3
+	 * |<------------>|<------------>|<------------>|
+	 * |..<-------->..|..<-------->..|..<-------->..|
+	 */
+	va->va_start = addr + PAGE_SIZE;
+	va->va_end = (addr + node_size) - PAGE_SIZE;
+
+	spin_lock(&vn->free.lock);
+	/* Never merges. See explanation above. */
+	insert_vmap_area_augment(va, NULL, &vn->free.root, &vn->free.head);
+	addr = va_alloc(va, &vn->free.root, &vn->free.head,
+		size, align, VMALLOC_START, VMALLOC_END);
+	spin_unlock(&vn->free.lock);
+
+	return addr;
+}
+
+static unsigned long
+node_alloc(int vn_id, unsigned long size, unsigned long align,
+		unsigned long vstart, unsigned long vend,
+		gfp_t gfp_mask, int node)
+{
+	struct vmap_node *vn = id_to_node(vn_id);
+	unsigned long extra = align > PAGE_SIZE ? align : 0;
+	bool do_alloc_fill = false;
+	unsigned long addr;
+
+	/*
+	 * Fallback to a global heap if not vmalloc.
+	 */
+	if (vstart != VMALLOC_START || vend != VMALLOC_END)
+		return vend;
+
+	/*
+	 * A maximum allocation limit is 1/4 of capacity. This
+	 * is done in order to prevent a fast depleting of zone
+	 * by few requests.
+	 */
+	if (size + extra > (node_size >> 2))
+		return vend;
+
+	spin_lock(&vn->free.lock);
+	addr = __alloc_vmap_area(&vn->free.root, &vn->free.head,
+		size, align, vstart, vend);
+
+	if (addr == vend) {
+		/*
+		 * Set the fetch flag under the critical section.
+		 * This guarantees that only one user is eligible
+		 * to perform a pre-fetch. A reset operation can
+		 * be concurrent.
+		 */
+		if (!atomic_xchg(&vn->fill_in_progress, 1))
+			do_alloc_fill = true;
+	}
+	spin_unlock(&vn->free.lock);
+
+	/* Only if fails a previous attempt. */
+	if (do_alloc_fill) {
+		addr = node_alloc_fill(vn, size, align, gfp_mask, node);
+		atomic_set(&vn->fill_in_progress, 0);
+	}
+
+	return addr;
+}
+
 /*
  * Allocate a region of KVA of the specified size and alignment, within the
  * vstart and vend.
@@ -1640,7 +1810,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	unsigned long freed;
 	unsigned long addr;
 	int purged = 0;
-	int ret;
+	int ret, vn_id;
 
 	if (unlikely(!size || offset_in_page(size) || !is_power_of_2(align)))
 		return ERR_PTR(-EINVAL);
@@ -1661,11 +1831,17 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	 */
 	kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask);
 
+	vn_id = this_node_id();
+	addr = node_alloc(vn_id, size, align, vstart, vend, gfp_mask, node);
+	va->flags = (addr != vend) ? encode_vn_id(vn_id) : 0;
+
 retry:
-	preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
-	addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
-		size, align, vstart, vend);
-	spin_unlock(&free_vmap_area_lock);
+	if (addr == vend) {
+		preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
+		addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
+			size, align, vstart, vend);
+		spin_unlock(&free_vmap_area_lock);
+	}
 
 	trace_alloc_vmap_area(addr, size, align, vstart, vend, addr == vend);
 
@@ -1679,7 +1855,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	va->va_start = addr;
 	va->va_end = addr + size;
 	va->vm = NULL;
-	va->flags = va_flags;
+	va->flags |= va_flags;
 
 	vn = addr_to_node(va->va_start);
 
@@ -1772,31 +1948,58 @@ static DEFINE_MUTEX(vmap_purge_lock);
 static void purge_fragmented_blocks_allcpus(void);
 static cpumask_t purge_nodes;
 
-/*
- * Purges all lazily-freed vmap areas.
- */
-static unsigned long
-purge_vmap_node(struct vmap_node *vn)
+static void
+reclaim_list_global(struct list_head *head)
+{
+	struct vmap_area *va, *n;
+
+	if (list_empty(head))
+		return;
+
+	spin_lock(&free_vmap_area_lock);
+	list_for_each_entry_safe(va, n, head, list)
+		merge_or_add_vmap_area_augment(va,
+			&free_vmap_area_root, &free_vmap_area_list);
+	spin_unlock(&free_vmap_area_lock);
+}
+
+static void purge_vmap_node(struct work_struct *work)
 {
-	unsigned long num_purged_areas = 0;
+	struct vmap_node *vn = container_of(work,
+		struct vmap_node, purge_work);
 	struct vmap_area *va, *n_va;
+	LIST_HEAD(global);
+
+	vn->nr_purged = 0;
 
 	if (list_empty(&vn->purge_list))
-		return 0;
+		return;
 
-	spin_lock(&free_vmap_area_lock);
+	spin_lock(&vn->free.lock);
 	list_for_each_entry_safe(va, n_va, &vn->purge_list, list) {
 		unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
 		unsigned long orig_start = va->va_start;
 		unsigned long orig_end = va->va_end;
+		int vn_id = decode_vn_id(va->flags);
 
-		/*
-		 * Finally insert or merge lazily-freed area. It is
-		 * detached and there is no need to "unlink" it from
-		 * anything.
-		 */
-		va = merge_or_add_vmap_area_augment(va, &free_vmap_area_root,
-				&free_vmap_area_list);
+		list_del_init(&va->list);
+
+		if (vn_id >= 0) {
+			if (va_size(va) != node_size - (2 * PAGE_SIZE))
+				va = merge_or_add_vmap_area_augment(va, &vn->free.root, &vn->free.head);
+
+			if (va_size(va) == node_size - (2 * PAGE_SIZE)) {
+				if (!list_empty(&va->list))
+					unlink_va_augment(va, &vn->free.root);
+
+				/* Restore the block size. */
+				va->va_start -= PAGE_SIZE;
+				va->va_end += PAGE_SIZE;
+				list_add(&va->list, &global);
+			}
+		} else {
+			list_add(&va->list, &global);
+		}
 
 		if (!va)
 			continue;
@@ -1806,11 +2009,10 @@ purge_vmap_node(struct vmap_node *vn)
 					      va->va_start, va->va_end);
 
 		atomic_long_sub(nr, &vmap_lazy_nr);
-		num_purged_areas++;
+		vn->nr_purged++;
 	}
-	spin_unlock(&free_vmap_area_lock);
-
-	return num_purged_areas;
+	spin_unlock(&vn->free.lock);
+	reclaim_list_global(&global);
 }
 
 /*
@@ -1818,11 +2020,17 @@ purge_vmap_node(struct vmap_node *vn)
  */
 static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 {
-	unsigned long num_purged_areas = 0;
+	unsigned long nr_purged_areas = 0;
+	unsigned int nr_purge_helpers;
+	unsigned int nr_purge_nodes;
 	struct vmap_node *vn;
 	int i;
 
 	lockdep_assert_held(&vmap_purge_lock);
+
+	/*
+	 * Use cpumask to mark which node has to be processed.
+	 */
 	purge_nodes = CPU_MASK_NONE;
 
 	for (i = 0; i < nr_nodes; i++) {
@@ -1847,17 +2055,45 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 		cpumask_set_cpu(i, &purge_nodes);
 	}
 
-	if (cpumask_weight(&purge_nodes) > 0) {
+	nr_purge_nodes = cpumask_weight(&purge_nodes);
+	if (nr_purge_nodes > 0) {
 		flush_tlb_kernel_range(start, end);
 
+		/* One extra worker is per a lazy_max_pages() full set minus one. */
+		nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages();
+		nr_purge_helpers = clamp(nr_purge_helpers, 1U, nr_purge_nodes) - 1;
+
+		for_each_cpu(i, &purge_nodes) {
+			vn = &nodes[i];
+
+			if (nr_purge_helpers > 0) {
+				INIT_WORK(&vn->purge_work, purge_vmap_node);
+
+				if (cpumask_test_cpu(i, cpu_online_mask))
+					schedule_work_on(i, &vn->purge_work);
+				else
+					schedule_work(&vn->purge_work);
+
+				nr_purge_helpers--;
+			} else {
+				vn->purge_work.func = NULL;
+				purge_vmap_node(&vn->purge_work);
+				nr_purged_areas += vn->nr_purged;
+			}
+		}
+
 		for_each_cpu(i, &purge_nodes) {
 			vn = &nodes[i];
-			num_purged_areas += purge_vmap_node(vn);
+
+			if (vn->purge_work.func) {
+				flush_work(&vn->purge_work);
+				nr_purged_areas += vn->nr_purged;
+			}
 		}
 	}
 
-	trace_purge_vmap_area_lazy(start, end, num_purged_areas);
-	return num_purged_areas > 0;
+	trace_purge_vmap_area_lazy(start, end, nr_purged_areas);
+	return nr_purged_areas > 0;
 }
 
 /*
@@ -1886,9 +2122,11 @@ static void drain_vmap_area_work(struct work_struct *work)
  */
 static void free_vmap_area_noflush(struct vmap_area *va)
 {
-	struct vmap_node *vn = addr_to_node(va->va_start);
 	unsigned long nr_lazy_max = lazy_max_pages();
 	unsigned long va_start = va->va_start;
+	int vn_id = decode_vn_id(va->flags);
+	struct vmap_node *vn = vn_id >= 0 ? id_to_node(vn_id):
+		addr_to_node(va->va_start);;
 	unsigned long nr_lazy;
 
 	if (WARN_ON_ONCE(!list_empty(&va->list)))
@@ -4574,6 +4812,10 @@ static void vmap_init_nodes(void)
 		vn->lazy.root = RB_ROOT;
 		INIT_LIST_HEAD(&vn->lazy.head);
 		spin_lock_init(&vn->lazy.lock);
+
+		vn->free.root = RB_ROOT;
+		INIT_LIST_HEAD(&vn->free.head);
+		spin_lock_init(&vn->free.lock);
 	}
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter
  2023-08-29  8:11 [PATCH v2 0/9] Mitigate a vmap lock contention v2 Uladzislau Rezki (Sony)
                   ` (5 preceding siblings ...)
  2023-08-29  8:11 ` [PATCH v2 6/9] mm: vmalloc: Offload free_vmap_area_lock lock Uladzislau Rezki (Sony)
@ 2023-08-29  8:11 ` Uladzislau Rezki (Sony)
  2023-09-11  3:58   ` Baoquan He
  2023-08-29  8:11 ` [PATCH v2 8/9] mm: vmalloc: Support multiple nodes in vmallocinfo Uladzislau Rezki (Sony)
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 74+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-08-29  8:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: LKML, Baoquan He, Lorenzo Stoakes, Christoph Hellwig,
	Matthew Wilcox, Liam R . Howlett, Dave Chinner,
	Paul E . McKenney, Joel Fernandes, Uladzislau Rezki,
	Oleksiy Avramchenko

Extend the vread_iter() to be able to perform a sequential
reading of VAs which are spread among multiple nodes. So a
data read over the /dev/kmem correctly reflects a vmalloc
memory layout.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 67 +++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 53 insertions(+), 14 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4fd4915c532d..968144c16237 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -870,7 +870,7 @@ unsigned long vmalloc_nr_pages(void)
 
 /* Look up the first VA which satisfies addr < va_end, NULL if none. */
 static struct vmap_area *
-find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
+__find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
 {
 	struct vmap_area *va = NULL;
 	struct rb_node *n = root->rb_node;
@@ -894,6 +894,41 @@ find_vmap_area_exceed_addr(unsigned long addr, struct rb_root *root)
 	return va;
 }
 
+/*
+ * Returns a node where a first VA, that satisfies addr < va_end, resides.
+ * If success, a node is locked. A user is responsible to unlock it when a
+ * VA is no longer needed to be accessed.
+ *
+ * Returns NULL if nothing found.
+ */
+static struct vmap_node *
+find_vmap_area_exceed_addr_lock(unsigned long addr, struct vmap_area **va)
+{
+	struct vmap_node *vn, *va_node = NULL;
+	struct vmap_area *va_lowest;
+	int i;
+
+	for (i = 0; i < nr_nodes; i++) {
+		vn = &nodes[i];
+
+		spin_lock(&vn->busy.lock);
+		va_lowest = __find_vmap_area_exceed_addr(addr, &vn->busy.root);
+		if (va_lowest) {
+			if (!va_node || va_lowest->va_start < (*va)->va_start) {
+				if (va_node)
+					spin_unlock(&va_node->busy.lock);
+
+				*va = va_lowest;
+				va_node = vn;
+				continue;
+			}
+		}
+		spin_unlock(&vn->busy.lock);
+	}
+
+	return va_node;
+}
+
 static struct vmap_area *__find_vmap_area(unsigned long addr, struct rb_root *root)
 {
 	struct rb_node *n = root->rb_node;
@@ -4048,6 +4083,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 	struct vm_struct *vm;
 	char *vaddr;
 	size_t n, size, flags, remains;
+	unsigned long next;
 
 	addr = kasan_reset_tag(addr);
 
@@ -4057,19 +4093,15 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 
 	remains = count;
 
-	/* Hooked to node_0 so far. */
-	vn = addr_to_node(0);
-	spin_lock(&vn->busy.lock);
-
-	va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root);
-	if (!va)
+	vn = find_vmap_area_exceed_addr_lock((unsigned long) addr, &va);
+	if (!vn)
 		goto finished_zero;
 
 	/* no intersects with alive vmap_area */
 	if ((unsigned long)addr + remains <= va->va_start)
 		goto finished_zero;
 
-	list_for_each_entry_from(va, &vn->busy.head, list) {
+	do {
 		size_t copied;
 
 		if (remains == 0)
@@ -4084,10 +4116,10 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 		WARN_ON(flags == VMAP_BLOCK);
 
 		if (!vm && !flags)
-			continue;
+			goto next_va;
 
 		if (vm && (vm->flags & VM_UNINITIALIZED))
-			continue;
+			goto next_va;
 
 		/* Pair with smp_wmb() in clear_vm_uninitialized_flag() */
 		smp_rmb();
@@ -4096,7 +4128,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 		size = vm ? get_vm_area_size(vm) : va_size(va);
 
 		if (addr >= vaddr + size)
-			continue;
+			goto next_va;
 
 		if (addr < vaddr) {
 			size_t to_zero = min_t(size_t, vaddr - addr, remains);
@@ -4125,15 +4157,22 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
 
 		if (copied != n)
 			goto finished;
-	}
+
+	next_va:
+		next = va->va_end;
+		spin_unlock(&vn->busy.lock);
+	} while ((vn = find_vmap_area_exceed_addr_lock(next, &va)));
 
 finished_zero:
-	spin_unlock(&vn->busy.lock);
+	if (vn)
+		spin_unlock(&vn->busy.lock);
+
 	/* zero-fill memory holes */
 	return count - remains + zero_iter(iter, remains);
 finished:
 	/* Nothing remains, or We couldn't copy/zero everything. */
-	spin_unlock(&vn->busy.lock);
+	if (vn)
+		spin_unlock(&vn->busy.lock);
 
 	return count - remains;
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v2 8/9] mm: vmalloc: Support multiple nodes in vmallocinfo
  2023-08-29  8:11 [PATCH v2 0/9] Mitigate a vmap lock contention v2 Uladzislau Rezki (Sony)
                   ` (6 preceding siblings ...)
  2023-08-29  8:11 ` [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter Uladzislau Rezki (Sony)
@ 2023-08-29  8:11 ` Uladzislau Rezki (Sony)
  2023-09-15 13:02   ` Baoquan He
  2023-08-29  8:11 ` [PATCH v2 9/9] mm: vmalloc: Set nr_nodes/node_size based on CPU-cores Uladzislau Rezki (Sony)
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 74+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-08-29  8:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: LKML, Baoquan He, Lorenzo Stoakes, Christoph Hellwig,
	Matthew Wilcox, Liam R . Howlett, Dave Chinner,
	Paul E . McKenney, Joel Fernandes, Uladzislau Rezki,
	Oleksiy Avramchenko

Allocated areas are spread among nodes, it implies that
the scanning has to be performed individually of each node
in order to dump all existing VAs.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 120 ++++++++++++++++++++-------------------------------
 1 file changed, 47 insertions(+), 73 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 968144c16237..9cce012aecdb 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4636,30 +4636,6 @@ bool vmalloc_dump_obj(void *object)
 #endif
 
 #ifdef CONFIG_PROC_FS
-static void *s_start(struct seq_file *m, loff_t *pos)
-{
-	struct vmap_node *vn = addr_to_node(0);
-
-	mutex_lock(&vmap_purge_lock);
-	spin_lock(&vn->busy.lock);
-
-	return seq_list_start(&vn->busy.head, *pos);
-}
-
-static void *s_next(struct seq_file *m, void *p, loff_t *pos)
-{
-	struct vmap_node *vn = addr_to_node(0);
-	return seq_list_next(p, &vn->busy.head, pos);
-}
-
-static void s_stop(struct seq_file *m, void *p)
-{
-	struct vmap_node *vn = addr_to_node(0);
-
-	spin_unlock(&vn->busy.lock);
-	mutex_unlock(&vmap_purge_lock);
-}
-
 static void show_numa_info(struct seq_file *m, struct vm_struct *v)
 {
 	if (IS_ENABLED(CONFIG_NUMA)) {
@@ -4703,84 +4679,82 @@ static void show_purge_info(struct seq_file *m)
 	}
 }
 
-static int s_show(struct seq_file *m, void *p)
+static int vmalloc_info_show(struct seq_file *m, void *p)
 {
 	struct vmap_node *vn;
 	struct vmap_area *va;
 	struct vm_struct *v;
+	int i;
 
-	vn = addr_to_node(0);
-	va = list_entry(p, struct vmap_area, list);
+	for (i = 0; i < nr_nodes; i++) {
+		vn = &nodes[i];
 
-	if (!va->vm) {
-		if (va->flags & VMAP_RAM)
-			seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
-				(void *)va->va_start, (void *)va->va_end,
-				va->va_end - va->va_start);
+		spin_lock(&vn->busy.lock);
+		list_for_each_entry(va, &vn->busy.head, list) {
+			if (!va->vm) {
+				if (va->flags & VMAP_RAM)
+					seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
+						(void *)va->va_start, (void *)va->va_end,
+						va->va_end - va->va_start);
 
-		goto final;
-	}
+				continue;
+			}
 
-	v = va->vm;
+			v = va->vm;
 
-	seq_printf(m, "0x%pK-0x%pK %7ld",
-		v->addr, v->addr + v->size, v->size);
+			seq_printf(m, "0x%pK-0x%pK %7ld",
+				v->addr, v->addr + v->size, v->size);
 
-	if (v->caller)
-		seq_printf(m, " %pS", v->caller);
+			if (v->caller)
+				seq_printf(m, " %pS", v->caller);
 
-	if (v->nr_pages)
-		seq_printf(m, " pages=%d", v->nr_pages);
+			if (v->nr_pages)
+				seq_printf(m, " pages=%d", v->nr_pages);
 
-	if (v->phys_addr)
-		seq_printf(m, " phys=%pa", &v->phys_addr);
+			if (v->phys_addr)
+				seq_printf(m, " phys=%pa", &v->phys_addr);
 
-	if (v->flags & VM_IOREMAP)
-		seq_puts(m, " ioremap");
+			if (v->flags & VM_IOREMAP)
+				seq_puts(m, " ioremap");
 
-	if (v->flags & VM_ALLOC)
-		seq_puts(m, " vmalloc");
+			if (v->flags & VM_ALLOC)
+				seq_puts(m, " vmalloc");
 
-	if (v->flags & VM_MAP)
-		seq_puts(m, " vmap");
+			if (v->flags & VM_MAP)
+				seq_puts(m, " vmap");
 
-	if (v->flags & VM_USERMAP)
-		seq_puts(m, " user");
+			if (v->flags & VM_USERMAP)
+				seq_puts(m, " user");
 
-	if (v->flags & VM_DMA_COHERENT)
-		seq_puts(m, " dma-coherent");
+			if (v->flags & VM_DMA_COHERENT)
+				seq_puts(m, " dma-coherent");
 
-	if (is_vmalloc_addr(v->pages))
-		seq_puts(m, " vpages");
+			if (is_vmalloc_addr(v->pages))
+				seq_puts(m, " vpages");
 
-	show_numa_info(m, v);
-	seq_putc(m, '\n');
+			show_numa_info(m, v);
+			seq_putc(m, '\n');
+		}
+		spin_unlock(&vn->busy.lock);
+	}
 
 	/*
 	 * As a final step, dump "unpurged" areas.
 	 */
-final:
-	if (list_is_last(&va->list, &vn->busy.head))
-		show_purge_info(m);
-
+	show_purge_info(m);
 	return 0;
 }
 
-static const struct seq_operations vmalloc_op = {
-	.start = s_start,
-	.next = s_next,
-	.stop = s_stop,
-	.show = s_show,
-};
-
 static int __init proc_vmalloc_init(void)
 {
+	void *priv_data = NULL;
+
 	if (IS_ENABLED(CONFIG_NUMA))
-		proc_create_seq_private("vmallocinfo", 0400, NULL,
-				&vmalloc_op,
-				nr_node_ids * sizeof(unsigned int), NULL);
-	else
-		proc_create_seq("vmallocinfo", 0400, NULL, &vmalloc_op);
+		priv_data = kmalloc(nr_node_ids * sizeof(unsigned int), GFP_KERNEL);
+
+	proc_create_single_data("vmallocinfo",
+		0400, NULL, vmalloc_info_show, priv_data);
+
 	return 0;
 }
 module_init(proc_vmalloc_init);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH v2 9/9] mm: vmalloc: Set nr_nodes/node_size based on CPU-cores
  2023-08-29  8:11 [PATCH v2 0/9] Mitigate a vmap lock contention v2 Uladzislau Rezki (Sony)
                   ` (7 preceding siblings ...)
  2023-08-29  8:11 ` [PATCH v2 8/9] mm: vmalloc: Support multiple nodes in vmallocinfo Uladzislau Rezki (Sony)
@ 2023-08-29  8:11 ` Uladzislau Rezki (Sony)
  2023-09-15 13:03   ` Baoquan He
  2023-08-31  1:15 ` [PATCH v2 0/9] Mitigate a vmap lock contention v2 Baoquan He
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 74+ messages in thread
From: Uladzislau Rezki (Sony) @ 2023-08-29  8:11 UTC (permalink / raw)
  To: linux-mm, Andrew Morton
  Cc: LKML, Baoquan He, Lorenzo Stoakes, Christoph Hellwig,
	Matthew Wilcox, Liam R . Howlett, Dave Chinner,
	Paul E . McKenney, Joel Fernandes, Uladzislau Rezki,
	Oleksiy Avramchenko

The density ratio is set to 2, i.e. two users per one node.
For example if there are 6 cores in a system the "nr_nodes"
is 3.

The "node_size" also depends on number of physical cores.
A high-threshold limit is hard-coded and set to SZ_4M.

For 32-bit, single/dual core systems an access to a global
vmap heap is not balanced. Such small systems do not suffer
from lock contentions due to limitation of CPU-cores.

Test on AMD Ryzen Threadripper 3970X 32-Core Processor:
sudo ./test_vmalloc.sh run_test_mask=127 nr_threads=64

<default perf>
 94.17%     0.90%  [kernel]    [k] _raw_spin_lock
 93.27%    93.05%  [kernel]    [k] native_queued_spin_lock_slowpath
 74.69%     0.25%  [kernel]    [k] __vmalloc_node_range
 72.64%     0.01%  [kernel]    [k] __get_vm_area_node
 72.04%     0.89%  [kernel]    [k] alloc_vmap_area
 42.17%     0.00%  [kernel]    [k] vmalloc
 32.53%     0.00%  [kernel]    [k] __vmalloc_node
 24.91%     0.25%  [kernel]    [k] vfree
 24.32%     0.01%  [kernel]    [k] remove_vm_area
 22.63%     0.21%  [kernel]    [k] find_unlink_vmap_area
 15.51%     0.00%  [unknown]   [k] 0xffffffffc09a74ac
 14.35%     0.00%  [kernel]    [k] ret_from_fork_asm
 14.35%     0.00%  [kernel]    [k] ret_from_fork
 14.35%     0.00%  [kernel]    [k] kthread
<default perf>
   vs
<patch-series perf>
 74.32%     2.42%  [kernel]    [k] __vmalloc_node_range
 69.58%     0.01%  [kernel]    [k] vmalloc
 54.21%     1.17%  [kernel]    [k] __alloc_pages_bulk
 48.13%    47.91%  [kernel]    [k] clear_page_orig
 43.60%     0.01%  [unknown]   [k] 0xffffffffc082f16f
 32.06%     0.00%  [kernel]    [k] ret_from_fork_asm
 32.06%     0.00%  [kernel]    [k] ret_from_fork
 32.06%     0.00%  [kernel]    [k] kthread
 31.30%     0.00%  [unknown]   [k] 0xffffffffc082f889
 22.98%     4.16%  [kernel]    [k] vfree
 14.36%     0.28%  [kernel]    [k] __get_vm_area_node
 13.43%     3.35%  [kernel]    [k] alloc_vmap_area
 10.86%     0.04%  [kernel]    [k] remove_vm_area
  8.89%     2.75%  [kernel]    [k] _raw_spin_lock
  7.19%     0.00%  [unknown]   [k] 0xffffffffc082fba3
  6.65%     1.37%  [kernel]    [k] free_unref_page
  6.13%     6.11%  [kernel]    [k] native_queued_spin_lock_slowpath
<patch-series perf>

confirms that a native_queued_spin_lock_slowpath bottle-neck
can be considered as negligible for the patch-series version.

The throughput is ~15x higher:

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=127 nr_threads=64
Run the test with following parameters: run_test_mask=127 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real    24m3.305s
user    0m0.361s
sys     0m0.013s
urezki@pc638:~$

urezki@pc638:~$ time sudo ./test_vmalloc.sh run_test_mask=127 nr_threads=64
Run the test with following parameters: run_test_mask=127 nr_threads=64
Done.
Check the kernel ring buffer to see the summary.

real    1m28.382s
user    0m0.014s
sys     0m0.026s
urezki@pc638:~$

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9cce012aecdb..08990f630c21 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -796,6 +796,9 @@ struct vmap_node {
 	atomic_t fill_in_progress;
 };
 
+#define MAX_NODES U8_MAX
+#define MAX_NODE_SIZE SZ_4M
+
 static struct vmap_node *nodes, snode;
 static __read_mostly unsigned int nr_nodes = 1;
 static __read_mostly unsigned int node_size = 1;
@@ -4803,11 +4806,24 @@ static void vmap_init_free_space(void)
 	}
 }
 
+static unsigned int calculate_nr_nodes(void)
+{
+	unsigned int nr_cpus;
+
+	nr_cpus = num_present_cpus();
+	if (nr_cpus <= 1)
+		nr_cpus = num_possible_cpus();
+
+	/* Density factor. Two users per a node. */
+	return clamp_t(unsigned int, nr_cpus >> 1, 1, MAX_NODES);
+}
+
 static void vmap_init_nodes(void)
 {
 	struct vmap_node *vn;
 	int i;
 
+	nr_nodes = calculate_nr_nodes();
 	nodes = &snode;
 
 	if (nr_nodes > 1) {
@@ -4830,6 +4846,16 @@ static void vmap_init_nodes(void)
 		INIT_LIST_HEAD(&vn->free.head);
 		spin_lock_init(&vn->free.lock);
 	}
+
+	/*
+	 * Scale a node size to number of CPUs. Each power of two
+	 * value doubles a node size. A high-threshold limit is set
+	 * to 4M.
+	 */
+#if BITS_PER_LONG == 64
+	if (nr_nodes > 1)
+		node_size = min(SZ_64K << fls(num_possible_cpus()), SZ_4M);
+#endif
 }
 
 void __init vmalloc_init(void)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-08-29  8:11 ` [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree Uladzislau Rezki (Sony)
@ 2023-08-29 14:30   ` kernel test robot
  2023-08-30 14:48     ` Uladzislau Rezki
  2023-09-07  2:17     ` Baoquan He
  2023-09-11  2:38   ` Baoquan He
  2 siblings, 1 reply; 74+ messages in thread
From: kernel test robot @ 2023-08-29 14:30 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony), linux-mm, Andrew Morton
  Cc: oe-kbuild-all, Linux Memory Management List, LKML, Baoquan He,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Uladzislau Rezki, Oleksiy Avramchenko

Hi Uladzislau,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Uladzislau-Rezki-Sony/mm-vmalloc-Add-va_alloc-helper/20230829-161248
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230829081142.3619-5-urezki%40gmail.com
patch subject: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
config: csky-randconfig-r024-20230829 (https://download.01.org/0day-ci/archive/20230829/202308292228.RRrGUYyB-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230829/202308292228.RRrGUYyB-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308292228.RRrGUYyB-lkp@intel.com/

All warnings (new ones prefixed by >>):

   mm/vmalloc.c: In function 'vmap_init_free_space':
>> mm/vmalloc.c:4506:45: warning: ordered comparison of pointer with integer zero [-Wextra]
    4506 |                 if (busy->addr - vmap_start > 0) {
         |                                             ^


vim +4506 mm/vmalloc.c

  4491	
  4492	static void vmap_init_free_space(void)
  4493	{
  4494		unsigned long vmap_start = 1;
  4495		const unsigned long vmap_end = ULONG_MAX;
  4496		struct vmap_area *free;
  4497		struct vm_struct *busy;
  4498	
  4499		/*
  4500		 *     B     F     B     B     B     F
  4501		 * -|-----|.....|-----|-----|-----|.....|-
  4502		 *  |           The KVA space           |
  4503		 *  |<--------------------------------->|
  4504		 */
  4505		for (busy = vmlist; busy; busy = busy->next) {
> 4506			if (busy->addr - vmap_start > 0) {
  4507				free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
  4508				if (!WARN_ON_ONCE(!free)) {
  4509					free->va_start = vmap_start;
  4510					free->va_end = (unsigned long) busy->addr;
  4511	
  4512					insert_vmap_area_augment(free, NULL,
  4513						&free_vmap_area_root,
  4514							&free_vmap_area_list);
  4515				}
  4516			}
  4517	
  4518			vmap_start = (unsigned long) busy->addr + busy->size;
  4519		}
  4520	
  4521		if (vmap_end - vmap_start > 0) {
  4522			free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
  4523			if (!WARN_ON_ONCE(!free)) {
  4524				free->va_start = vmap_start;
  4525				free->va_end = vmap_end;
  4526	
  4527				insert_vmap_area_augment(free, NULL,
  4528					&free_vmap_area_root,
  4529						&free_vmap_area_list);
  4530			}
  4531		}
  4532	}
  4533	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-08-29 14:30   ` kernel test robot
@ 2023-08-30 14:48     ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-08-30 14:48 UTC (permalink / raw)
  To: kernel test robot
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, oe-kbuild-all, LKML, Baoquan He,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko

On Tue, Aug 29, 2023 at 10:30:19PM +0800, kernel test robot wrote:
> Hi Uladzislau,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on akpm-mm/mm-everything]
> [also build test WARNING on linus/master v6.5]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Uladzislau-Rezki-Sony/mm-vmalloc-Add-va_alloc-helper/20230829-161248
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20230829081142.3619-5-urezki%40gmail.com
> patch subject: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
> config: csky-randconfig-r024-20230829 (https://download.01.org/0day-ci/archive/20230829/202308292228.RRrGUYyB-lkp@intel.com/config)
> compiler: csky-linux-gcc (GCC) 13.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230829/202308292228.RRrGUYyB-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202308292228.RRrGUYyB-lkp@intel.com/
> 
> All warnings (new ones prefixed by >>):
> 
>    mm/vmalloc.c: In function 'vmap_init_free_space':
> >> mm/vmalloc.c:4506:45: warning: ordered comparison of pointer with integer zero [-Wextra]
>     4506 |                 if (busy->addr - vmap_start > 0) {
>          |                                             ^
>
Right. I will fix it. We should cast the busy->addr to unsigned long.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Mitigate a vmap lock contention v2
  2023-08-29  8:11 [PATCH v2 0/9] Mitigate a vmap lock contention v2 Uladzislau Rezki (Sony)
                   ` (8 preceding siblings ...)
  2023-08-29  8:11 ` [PATCH v2 9/9] mm: vmalloc: Set nr_nodes/node_size based on CPU-cores Uladzislau Rezki (Sony)
@ 2023-08-31  1:15 ` Baoquan He
  2023-08-31 16:26   ` Uladzislau Rezki
  2023-09-04 14:55 ` Uladzislau Rezki
  2023-09-06 20:04 ` Lorenzo Stoakes
  11 siblings, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-08-31  1:15 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> Hello, folk!
> 
> This is the v2, the series which tends to minimize the vmap
> lock contention. It is based on the tag: v6.5-rc6. Here you
> can find a documentation about it:
> 
> wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf

Seems the wget command doesn't work for me. Not sure if other people can
retrieve it successfully.

--2023-08-30 21:14:20--  ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf
           => ‘Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf’
Resolving vps418301.ovh.net (vps418301.ovh.net)... 37.187.244.100
Connecting to vps418301.ovh.net (vps418301.ovh.net)|37.187.244.100|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /incoming ... done.
==> SIZE Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf ... done.

==> PASV ... done.    ==> RETR Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf ... 
No such file ‘Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf’.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Mitigate a vmap lock contention v2
  2023-08-31  1:15 ` [PATCH v2 0/9] Mitigate a vmap lock contention v2 Baoquan He
@ 2023-08-31 16:26   ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-08-31 16:26 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On Thu, Aug 31, 2023 at 09:15:46AM +0800, Baoquan He wrote:
> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > Hello, folk!
> > 
> > This is the v2, the series which tends to minimize the vmap
> > lock contention. It is based on the tag: v6.5-rc6. Here you
> > can find a documentation about it:
> > 
> > wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf
> 
> Seems the wget command doesn't work for me. Not sure if other people can
> retrieve it successfully.
> 
> --2023-08-30 21:14:20--  ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf
>            => ‘Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf’
> Resolving vps418301.ovh.net (vps418301.ovh.net)... 37.187.244.100
> Connecting to vps418301.ovh.net (vps418301.ovh.net)|37.187.244.100|:21... connected.
> Logging in as anonymous ... Logged in!
> ==> SYST ... done.    ==> PWD ... done.
> ==> TYPE I ... done.  ==> CWD (1) /incoming ... done.
> ==> SIZE Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf ... done.
> 
> ==> PASV ... done.    ==> RETR Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf ... 
> No such file ‘Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf’.
> 
Right. Same issue as a last time. I renamed the file name but pointed
to the old name. Here we go:

wget ftp://vps418301.ovh.net/incoming/Mitigate_a_vmalloc_lock_contention_in_SMP_env_v2.pdf

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Mitigate a vmap lock contention v2
  2023-08-29  8:11 [PATCH v2 0/9] Mitigate a vmap lock contention v2 Uladzislau Rezki (Sony)
                   ` (9 preceding siblings ...)
  2023-08-31  1:15 ` [PATCH v2 0/9] Mitigate a vmap lock contention v2 Baoquan He
@ 2023-09-04 14:55 ` Uladzislau Rezki
  2023-09-04 19:53   ` Andrew Morton
  2023-09-06 20:04 ` Lorenzo Stoakes
  11 siblings, 1 reply; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-04 14:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Andrew Morton, LKML, Baoquan He, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

Hello, Andrew!

> Hello, folk!
> 
> This is the v2, the series which tends to minimize the vmap
> lock contention. It is based on the tag: v6.5-rc6. Here you
> can find a documentation about it:
> 
> wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf
> 
> even though it is a bit outdated(it follows v1), it still gives a
> good overview on the problem and how it can be solved. On demand
> and by request i can update it.
> 
> The v1 is here: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/
> 
> Delta v1 -> v2:
>   - open coded locking;
>   - switch to array of nodes instead of per-cpu definition;
>   - density is 2 cores per one node(not equal to number of CPUs);
>   - VAs first go back(free path) to an owner node and later to
>     a global heap if a block is fully freed, nid is saved in va->flags;
>   - add helpers to drain lazily-freed areas faster, if high pressure;
>   - picked al Reviewed-by.
> 
> Test on AMD Ryzen Threadripper 3970X 32-Core Processor:
> sudo ./test_vmalloc.sh run_test_mask=127 nr_threads=64
> 
> <v6.5-rc6 perf>
>   94.17%     0.90%  [kernel]    [k] _raw_spin_lock
>   93.27%    93.05%  [kernel]    [k] native_queued_spin_lock_slowpath
>   74.69%     0.25%  [kernel]    [k] __vmalloc_node_range
>   72.64%     0.01%  [kernel]    [k] __get_vm_area_node
>   72.04%     0.89%  [kernel]    [k] alloc_vmap_area
>   42.17%     0.00%  [kernel]    [k] vmalloc
>   32.53%     0.00%  [kernel]    [k] __vmalloc_node
>   24.91%     0.25%  [kernel]    [k] vfree
>   24.32%     0.01%  [kernel]    [k] remove_vm_area
>   22.63%     0.21%  [kernel]    [k] find_unlink_vmap_area
>   15.51%     0.00%  [unknown]   [k] 0xffffffffc09a74ac
>   14.35%     0.00%  [kernel]    [k] ret_from_fork_asm
>   14.35%     0.00%  [kernel]    [k] ret_from_fork
>   14.35%     0.00%  [kernel]    [k] kthread
> <v6.5-rc6 perf>
>    vs
> <v6.5-rc6+v2 perf>
>   74.32%     2.42%  [kernel]    [k] __vmalloc_node_range
>   69.58%     0.01%  [kernel]    [k] vmalloc
>   54.21%     1.17%  [kernel]    [k] __alloc_pages_bulk
>   48.13%    47.91%  [kernel]    [k] clear_page_orig
>   43.60%     0.01%  [unknown]   [k] 0xffffffffc082f16f
>   32.06%     0.00%  [kernel]    [k] ret_from_fork_asm
>   32.06%     0.00%  [kernel]    [k] ret_from_fork
>   32.06%     0.00%  [kernel]    [k] kthread
>   31.30%     0.00%  [unknown]   [k] 0xffffffffc082f889
>   22.98%     4.16%  [kernel]    [k] vfree
>   14.36%     0.28%  [kernel]    [k] __get_vm_area_node
>   13.43%     3.35%  [kernel]    [k] alloc_vmap_area
>   10.86%     0.04%  [kernel]    [k] remove_vm_area
>    8.89%     2.75%  [kernel]    [k] _raw_spin_lock
>    7.19%     0.00%  [unknown]   [k] 0xffffffffc082fba3
>    6.65%     1.37%  [kernel]    [k] free_unref_page
>    6.13%     6.11%  [kernel]    [k] native_queued_spin_lock_slowpath
> <v6.5-rc6+v2 perf>
> 
> On smaller systems, for example, 8xCPU Hikey960 board the
> contention is not that high and is approximately ~16 percent.
> 
> Uladzislau Rezki (Sony) (9):
>   mm: vmalloc: Add va_alloc() helper
>   mm: vmalloc: Rename adjust_va_to_fit_type() function
>   mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
>   mm: vmalloc: Remove global vmap_area_root rb-tree
>   mm: vmalloc: Remove global purge_vmap_area_root rb-tree
>   mm: vmalloc: Offload free_vmap_area_lock lock
>   mm: vmalloc: Support multiple nodes in vread_iter
>   mm: vmalloc: Support multiple nodes in vmallocinfo
>   mm: vmalloc: Set nr_nodes/node_size based on CPU-cores
> 
>  mm/vmalloc.c | 929 +++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 683 insertions(+), 246 deletions(-)
> 
> -- 
> 2.30.2
> 
It would be good if this series somehow could be tested having some runtime
from the people. So far there was a warning from the test robot:

https://lore.kernel.org/lkml/202308292228.RRrGUYyB-lkp@intel.com/T/#m397b3834cb3b7a0a53b8dffb3624384c8e278007

<snip>
urezki@pc638:~/data/raid0/coding/linux.git$ git diff
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 08990f630c21..7105d7bcd37e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4778,7 +4778,7 @@ static void vmap_init_free_space(void)
         *  |<--------------------------------->|
         */
        for (busy = vmlist; busy; busy = busy->next) {
-               if (busy->addr - vmap_start > 0) {
+               if ((unsigned long) busy->addr - vmap_start > 0) {
                        free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
                        if (!WARN_ON_ONCE(!free)) {
                                free->va_start = vmap_start;
urezki@pc638:~/data/raid0/coding/linux.git$
<snip>

This extra patch has to be applied to fix the warning. 

From my side i have tested it as much as i can. Can it be plugged
into linux-next to get some runtime? Or is there any other way you
prefer to go?

Thank you in advance!

--
Uladzislau Rezki

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Mitigate a vmap lock contention v2
  2023-09-04 14:55 ` Uladzislau Rezki
@ 2023-09-04 19:53   ` Andrew Morton
  2023-09-05  6:53     ` Uladzislau Rezki
  0 siblings, 1 reply; 74+ messages in thread
From: Andrew Morton @ 2023-09-04 19:53 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, LKML, Baoquan He, Lorenzo Stoakes, Christoph Hellwig,
	Matthew Wilcox, Liam R . Howlett, Dave Chinner,
	Paul E . McKenney, Joel Fernandes, Oleksiy Avramchenko

On Mon, 4 Sep 2023 16:55:38 +0200 Uladzislau Rezki <urezki@gmail.com> wrote:

> It would be good if this series somehow could be tested having some runtime
> from the people.

I grabbed it.  We're supposed to avoid adding new material to -next until
after -rc1 is released, but I've cheated before ;)

That (inaccessible) pdf file is awkward.  Could you please send out
a suitable [0/N] cover letter for this series, which can be incorporated
into the git record?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Mitigate a vmap lock contention v2
  2023-09-04 19:53   ` Andrew Morton
@ 2023-09-05  6:53     ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-05  6:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Uladzislau Rezki, linux-mm, LKML, Baoquan He, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On Mon, Sep 04, 2023 at 12:53:21PM -0700, Andrew Morton wrote:
> On Mon, 4 Sep 2023 16:55:38 +0200 Uladzislau Rezki <urezki@gmail.com> wrote:
> 
> > It would be good if this series somehow could be tested having some runtime
> > from the people.
> 
> I grabbed it.  We're supposed to avoid adding new material to -next until
> after -rc1 is released, but I've cheated before ;)
> 
> That (inaccessible) pdf file is awkward.  Could you please send out
> a suitable [0/N] cover letter for this series, which can be incorporated
> into the git record?
>
There will be a v3 anyway where i update the cover latter. The v2 is not
adapted to recently introduced Joel's patch, which is not in linux-next
but will land soon:

<snip>
From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: mm/vmalloc: add a safer version of find_vm_area() for debug
Date: Mon, 4 Sep 2023 18:08:04 +0000

It is unsafe to dump vmalloc area information when trying to do so from
some contexts.  Add a safer trylock version of the same function to do a
best-effort VMA finding and use it from vmalloc_dump_obj().
<snip>

Also it might come some extra reviews and comments for v2.

Thanks!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 1/9] mm: vmalloc: Add va_alloc() helper
  2023-08-29  8:11 ` [PATCH v2 1/9] mm: vmalloc: Add va_alloc() helper Uladzislau Rezki (Sony)
@ 2023-09-06  5:51   ` Baoquan He
  2023-09-06 15:06     ` Uladzislau Rezki
  0 siblings, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-06  5:51 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, Christoph Hellwig

On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> Currently __alloc_vmap_area() function contains an open codded
> logic that finds and adjusts a VA based on allocation request.
> 
> Introduce a va_alloc() helper that adjusts found VA only. It
> will be used later at least in two places.
> 
> There is no a functional change as a result of this patch.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 41 ++++++++++++++++++++++++++++-------------
>  1 file changed, 28 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 93cf99aba335..00afc1ee4756 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1481,6 +1481,32 @@ adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
>  	return 0;
>  }
>  
> +static unsigned long
> +va_alloc(struct vmap_area *va,
> +		struct rb_root *root, struct list_head *head,
> +		unsigned long size, unsigned long align,
> +		unsigned long vstart, unsigned long vend)
> +{
> +	unsigned long nva_start_addr;
> +	int ret;
> +
> +	if (va->va_start > vstart)
> +		nva_start_addr = ALIGN(va->va_start, align);
> +	else
> +		nva_start_addr = ALIGN(vstart, align);
> +
> +	/* Check the "vend" restriction. */
> +	if (nva_start_addr + size > vend)
> +		return vend;
> +
> +	/* Update the free vmap_area. */
> +	ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
> +	if (WARN_ON_ONCE(ret))
> +		return vend;
> +
> +	return nva_start_addr;
> +}
> +
>  /*
>   * Returns a start address of the newly allocated area, if success.
>   * Otherwise a vend is returned that indicates failure.
> @@ -1493,7 +1519,6 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
>  	bool adjust_search_size = true;
>  	unsigned long nva_start_addr;
>  	struct vmap_area *va;
> -	int ret;
>  
>  	/*
>  	 * Do not adjust when:
> @@ -1511,18 +1536,8 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
>  	if (unlikely(!va))
>  		return vend;
>  
> -	if (va->va_start > vstart)
> -		nva_start_addr = ALIGN(va->va_start, align);
> -	else
> -		nva_start_addr = ALIGN(vstart, align);
> -
> -	/* Check the "vend" restriction. */
> -	if (nva_start_addr + size > vend)
> -		return vend;
> -
> -	/* Update the free vmap_area. */
> -	ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
> -	if (WARN_ON_ONCE(ret))
> +	nva_start_addr = va_alloc(va, root, head, size, align, vstart, vend);
> +	if (nva_start_addr == vend)
>  		return vend;
>  
>  #if DEBUG_AUGMENT_LOWEST_MATCH_CHECK
> -- 
> 2.30.2

Reviewed-by: Baoquan He <bhe@redhat.com>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function
  2023-08-29  8:11 ` [PATCH v2 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function Uladzislau Rezki (Sony)
@ 2023-09-06  5:51   ` Baoquan He
  2023-09-06 16:27     ` Uladzislau Rezki
  0 siblings, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-06  5:51 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, Christoph Hellwig

On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> This patch renames the adjust_va_to_fit_type() function
> to va_clip() which is shorter and more expressive.
> 
> There is no a functional change as a result of this patch.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 00afc1ee4756..09e315f8ea34 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1382,9 +1382,9 @@ classify_va_fit_type(struct vmap_area *va,
>  }
>  
>  static __always_inline int
> -adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
> -		      struct vmap_area *va, unsigned long nva_start_addr,
> -		      unsigned long size)
> +va_clip(struct rb_root *root, struct list_head *head,
> +		struct vmap_area *va, unsigned long nva_start_addr,
> +		unsigned long size)
>  {
>  	struct vmap_area *lva = NULL;
>  	enum fit_type type = classify_va_fit_type(va, nva_start_addr, size);
> @@ -1500,7 +1500,7 @@ va_alloc(struct vmap_area *va,
>  		return vend;
>  
>  	/* Update the free vmap_area. */
> -	ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
> +	ret = va_clip(root, head, va, nva_start_addr, size);
>  	if (WARN_ON_ONCE(ret))
>  		return vend;
>  
> @@ -4151,9 +4151,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>  			/* It is a BUG(), but trigger recovery instead. */
>  			goto recovery;
>  
> -		ret = adjust_va_to_fit_type(&free_vmap_area_root,
> -					    &free_vmap_area_list,
> -					    va, start, size);
> +		ret = va_clip(&free_vmap_area_root,
> +			&free_vmap_area_list, va, start, size);
>  		if (WARN_ON_ONCE(unlikely(ret)))
>  			/* It is a BUG(), but trigger recovery instead. */
>  			goto recovery;
> -- 
> 2.30.2
> 

Reviewed-by: Baoquan He <bhe@redhat.com>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 3/9] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
  2023-08-29  8:11 ` [PATCH v2 3/9] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c Uladzislau Rezki (Sony)
@ 2023-09-06  5:52   ` Baoquan He
  2023-09-06 16:29     ` Uladzislau Rezki
  0 siblings, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-06  5:52 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, Christoph Hellwig

On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> A vmap_init_free_space() is a function that setups a vmap space
> and is considered as part of initialization phase. Since a main
> entry which is vmalloc_init(), has been moved down in vmalloc.c
> it makes sense to follow the pattern.
> 
> There is no a functional change as a result of this patch.
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 82 ++++++++++++++++++++++++++--------------------------
>  1 file changed, 41 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 09e315f8ea34..b7deacca1483 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2512,47 +2512,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align)
>  	kasan_populate_early_vm_area_shadow(vm->addr, vm->size);
>  }
>  
> -static void vmap_init_free_space(void)
> -{
> -	unsigned long vmap_start = 1;
> -	const unsigned long vmap_end = ULONG_MAX;
> -	struct vmap_area *busy, *free;
> -
> -	/*
> -	 *     B     F     B     B     B     F
> -	 * -|-----|.....|-----|-----|-----|.....|-
> -	 *  |           The KVA space           |
> -	 *  |<--------------------------------->|
> -	 */
> -	list_for_each_entry(busy, &vmap_area_list, list) {
> -		if (busy->va_start - vmap_start > 0) {
> -			free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> -			if (!WARN_ON_ONCE(!free)) {
> -				free->va_start = vmap_start;
> -				free->va_end = busy->va_start;
> -
> -				insert_vmap_area_augment(free, NULL,
> -					&free_vmap_area_root,
> -						&free_vmap_area_list);
> -			}
> -		}
> -
> -		vmap_start = busy->va_end;
> -	}
> -
> -	if (vmap_end - vmap_start > 0) {
> -		free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> -		if (!WARN_ON_ONCE(!free)) {
> -			free->va_start = vmap_start;
> -			free->va_end = vmap_end;
> -
> -			insert_vmap_area_augment(free, NULL,
> -				&free_vmap_area_root,
> -					&free_vmap_area_list);
> -		}
> -	}
> -}
> -
>  static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
>  	struct vmap_area *va, unsigned long flags, const void *caller)
>  {
> @@ -4443,6 +4402,47 @@ module_init(proc_vmalloc_init);
>  
>  #endif
>  
> +static void vmap_init_free_space(void)
> +{
> +	unsigned long vmap_start = 1;
> +	const unsigned long vmap_end = ULONG_MAX;
> +	struct vmap_area *busy, *free;
> +
> +	/*
> +	 *     B     F     B     B     B     F
> +	 * -|-----|.....|-----|-----|-----|.....|-
> +	 *  |           The KVA space           |
> +	 *  |<--------------------------------->|
> +	 */
> +	list_for_each_entry(busy, &vmap_area_list, list) {
> +		if (busy->va_start - vmap_start > 0) {
> +			free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> +			if (!WARN_ON_ONCE(!free)) {
> +				free->va_start = vmap_start;
> +				free->va_end = busy->va_start;
> +
> +				insert_vmap_area_augment(free, NULL,
> +					&free_vmap_area_root,
> +						&free_vmap_area_list);
> +			}
> +		}
> +
> +		vmap_start = busy->va_end;
> +	}
> +
> +	if (vmap_end - vmap_start > 0) {
> +		free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> +		if (!WARN_ON_ONCE(!free)) {
> +			free->va_start = vmap_start;
> +			free->va_end = vmap_end;
> +
> +			insert_vmap_area_augment(free, NULL,
> +				&free_vmap_area_root,
> +					&free_vmap_area_list);
> +		}
> +	}
> +}
> +
>  void __init vmalloc_init(void)
>  {
>  	struct vmap_area *va;
> -- 
> 2.30.2
> 

Reviewed-by: Baoquan He <bhe@redhat.com>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 6/9] mm: vmalloc: Offload free_vmap_area_lock lock
  2023-08-29  8:11 ` [PATCH v2 6/9] mm: vmalloc: Offload free_vmap_area_lock lock Uladzislau Rezki (Sony)
@ 2023-09-06  6:04   ` Baoquan He
  2023-09-06 19:16     ` Uladzislau Rezki
  2023-09-11  3:25   ` Baoquan He
  1 sibling, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-06  6:04 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> Concurrent access to a global vmap space is a bottle-neck.
> We can simulate a high contention by running a vmalloc test
> suite.
> 
> To address it, introduce an effective vmap node logic. Each
> node behaves as independent entity. When a node is accessed
> it serves a request directly(if possible) also it can fetch
> a new block from a global heap to its internals if no space
> or low capacity is left.
> 
> This technique reduces a pressure on the global vmap lock.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 316 +++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 279 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 5a8a9c1370b6..4fd4915c532d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -779,6 +779,7 @@ struct rb_list {
>  
>  struct vmap_node {
>  	/* Bookkeeping data of this node. */
> +	struct rb_list free;
>  	struct rb_list busy;
>  	struct rb_list lazy;
>  
> @@ -786,6 +787,13 @@ struct vmap_node {
>  	 * Ready-to-free areas.
>  	 */
>  	struct list_head purge_list;
> +	struct work_struct purge_work;
> +	unsigned long nr_purged;
> +
> +	/*
> +	 * Control that only one user can pre-fetch this node.
> +	 */
> +	atomic_t fill_in_progress;
>  };
>  
>  static struct vmap_node *nodes, snode;
> @@ -804,6 +812,32 @@ addr_to_node(unsigned long addr)
>  	return &nodes[addr_to_node_id(addr)];
>  }
>  
> +static inline struct vmap_node *
> +id_to_node(int id)
> +{
> +	return &nodes[id % nr_nodes];
> +}
> +
> +static inline int
> +this_node_id(void)
> +{
> +	return raw_smp_processor_id() % nr_nodes;
> +}
> +
> +static inline unsigned long
> +encode_vn_id(int node_id)
> +{
> +	/* Can store U8_MAX [0:254] nodes. */
> +	return (node_id + 1) << BITS_PER_BYTE;
> +}
> +
> +static inline int
> +decode_vn_id(unsigned long val)
> +{
> +	/* Can store U8_MAX [0:254] nodes. */
> +	return (val >> BITS_PER_BYTE) - 1;
> +}
> +
>  static __always_inline unsigned long
>  va_size(struct vmap_area *va)
>  {
> @@ -1586,6 +1620,7 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
>  static void free_vmap_area(struct vmap_area *va)
>  {
>  	struct vmap_node *vn = addr_to_node(va->va_start);
> +	int vn_id = decode_vn_id(va->flags);
>  
>  	/*
>  	 * Remove from the busy tree/list.
> @@ -1594,12 +1629,19 @@ static void free_vmap_area(struct vmap_area *va)
>  	unlink_va(va, &vn->busy.root);
>  	spin_unlock(&vn->busy.lock);
>  
> -	/*
> -	 * Insert/Merge it back to the free tree/list.
> -	 */
> -	spin_lock(&free_vmap_area_lock);
> -	merge_or_add_vmap_area_augment(va, &free_vmap_area_root, &free_vmap_area_list);
> -	spin_unlock(&free_vmap_area_lock);
> +	if (vn_id >= 0) {

In alloc_vmap_area(), the vn_id is encoded into va->flags. When
allocation failed, the vn_id = 0. Here should we change to check 'if
(vn_id > 0)' becasue the vn_id == 0 means no available vn_id encoded
into. And I do not get how we treat the case vn_id truly is 0.

	va->flags = (addr != vend) ? encode_vn_id(vn_id) : 0;

> +		vn = id_to_node(vn_id);
> +
> +		/* Belongs to this node. */
> +		spin_lock(&vn->free.lock);
> +		merge_or_add_vmap_area_augment(va, &vn->free.root, &vn->free.head);
> +		spin_unlock(&vn->free.lock);
> +	} else {
> +		/* Goes to global. */
> +		spin_lock(&free_vmap_area_lock);
> +		merge_or_add_vmap_area_augment(va, &free_vmap_area_root, &free_vmap_area_list);
> +		spin_unlock(&free_vmap_area_lock);
> +	}
>  }
>  
>  static inline void
......
> @@ -1640,7 +1810,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	unsigned long freed;
>  	unsigned long addr;
>  	int purged = 0;
> -	int ret;
> +	int ret, vn_id;
>  
>  	if (unlikely(!size || offset_in_page(size) || !is_power_of_2(align)))
>  		return ERR_PTR(-EINVAL);
> @@ -1661,11 +1831,17 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	 */
>  	kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask);
>  
> +	vn_id = this_node_id();
> +	addr = node_alloc(vn_id, size, align, vstart, vend, gfp_mask, node);
> +	va->flags = (addr != vend) ? encode_vn_id(vn_id) : 0;
> +
>  retry:
> -	preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
> -	addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> -		size, align, vstart, vend);
> -	spin_unlock(&free_vmap_area_lock);
> +	if (addr == vend) {
> +		preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
> +		addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> +			size, align, vstart, vend);
> +		spin_unlock(&free_vmap_area_lock);
> +	}
>  
>  	trace_alloc_vmap_area(addr, size, align, vstart, vend, addr == vend);
>  
> @@ -1679,7 +1855,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	va->va_start = addr;
>  	va->va_end = addr + size;
>  	va->vm = NULL;
> -	va->flags = va_flags;
> +	va->flags |= va_flags;
>  
>  	vn = addr_to_node(va->va_start);
>  
> @@ -1772,31 +1948,58 @@ static DEFINE_MUTEX(vmap_purge_lock);
>  static void purge_fragmented_blocks_allcpus(void);
>  static cpumask_t purge_nodes;
>  
> -/*
> - * Purges all lazily-freed vmap areas.
> - */
> -static unsigned long
> -purge_vmap_node(struct vmap_node *vn)
> +static void
> +reclaim_list_global(struct list_head *head)
> +{
> +	struct vmap_area *va, *n;
> +
> +	if (list_empty(head))
> +		return;
> +
> +	spin_lock(&free_vmap_area_lock);
> +	list_for_each_entry_safe(va, n, head, list)
> +		merge_or_add_vmap_area_augment(va,
> +			&free_vmap_area_root, &free_vmap_area_list);
> +	spin_unlock(&free_vmap_area_lock);
> +}
> +
> +static void purge_vmap_node(struct work_struct *work)
>  {
> -	unsigned long num_purged_areas = 0;
> +	struct vmap_node *vn = container_of(work,
> +		struct vmap_node, purge_work);
>  	struct vmap_area *va, *n_va;
> +	LIST_HEAD(global);
> +
> +	vn->nr_purged = 0;
>  
>  	if (list_empty(&vn->purge_list))
> -		return 0;
> +		return;
>  
> -	spin_lock(&free_vmap_area_lock);
> +	spin_lock(&vn->free.lock);
>  	list_for_each_entry_safe(va, n_va, &vn->purge_list, list) {
>  		unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
>  		unsigned long orig_start = va->va_start;
>  		unsigned long orig_end = va->va_end;
> +		int vn_id = decode_vn_id(va->flags);
>  
> -		/*
> -		 * Finally insert or merge lazily-freed area. It is
> -		 * detached and there is no need to "unlink" it from
> -		 * anything.
> -		 */
> -		va = merge_or_add_vmap_area_augment(va, &free_vmap_area_root,
> -				&free_vmap_area_list);
> +		list_del_init(&va->list);
> +
> +		if (vn_id >= 0) {
> +			if (va_size(va) != node_size - (2 * PAGE_SIZE))
> +				va = merge_or_add_vmap_area_augment(va, &vn->free.root, &vn->free.head);
> +
> +			if (va_size(va) == node_size - (2 * PAGE_SIZE)) {
> +				if (!list_empty(&va->list))
> +					unlink_va_augment(va, &vn->free.root);
> +
> +				/* Restore the block size. */
> +				va->va_start -= PAGE_SIZE;
> +				va->va_end += PAGE_SIZE;
> +				list_add(&va->list, &global);
> +			}
> +		} else {
> +			list_add(&va->list, &global);
> +		}
>  
>  		if (!va)
>  			continue;
> @@ -1806,11 +2009,10 @@ purge_vmap_node(struct vmap_node *vn)
>  					      va->va_start, va->va_end);
>  
>  		atomic_long_sub(nr, &vmap_lazy_nr);
> -		num_purged_areas++;
> +		vn->nr_purged++;
>  	}
> -	spin_unlock(&free_vmap_area_lock);
> -
> -	return num_purged_areas;
> +	spin_unlock(&vn->free.lock);
> +	reclaim_list_global(&global);
>  }
>  
>  /*
> @@ -1818,11 +2020,17 @@ purge_vmap_node(struct vmap_node *vn)
>   */
>  static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  {
> -	unsigned long num_purged_areas = 0;
> +	unsigned long nr_purged_areas = 0;
> +	unsigned int nr_purge_helpers;
> +	unsigned int nr_purge_nodes;
>  	struct vmap_node *vn;
>  	int i;
>  
>  	lockdep_assert_held(&vmap_purge_lock);
> +
> +	/*
> +	 * Use cpumask to mark which node has to be processed.
> +	 */
>  	purge_nodes = CPU_MASK_NONE;
>  
>  	for (i = 0; i < nr_nodes; i++) {
> @@ -1847,17 +2055,45 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  		cpumask_set_cpu(i, &purge_nodes);
>  	}
>  
> -	if (cpumask_weight(&purge_nodes) > 0) {
> +	nr_purge_nodes = cpumask_weight(&purge_nodes);
> +	if (nr_purge_nodes > 0) {
>  		flush_tlb_kernel_range(start, end);
>  
> +		/* One extra worker is per a lazy_max_pages() full set minus one. */
> +		nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages();
> +		nr_purge_helpers = clamp(nr_purge_helpers, 1U, nr_purge_nodes) - 1;
> +
> +		for_each_cpu(i, &purge_nodes) {
> +			vn = &nodes[i];
> +
> +			if (nr_purge_helpers > 0) {
> +				INIT_WORK(&vn->purge_work, purge_vmap_node);
> +
> +				if (cpumask_test_cpu(i, cpu_online_mask))
> +					schedule_work_on(i, &vn->purge_work);
> +				else
> +					schedule_work(&vn->purge_work);
> +
> +				nr_purge_helpers--;
> +			} else {
> +				vn->purge_work.func = NULL;
> +				purge_vmap_node(&vn->purge_work);
> +				nr_purged_areas += vn->nr_purged;
> +			}
> +		}
> +
>  		for_each_cpu(i, &purge_nodes) {
>  			vn = &nodes[i];
> -			num_purged_areas += purge_vmap_node(vn);
> +
> +			if (vn->purge_work.func) {
> +				flush_work(&vn->purge_work);
> +				nr_purged_areas += vn->nr_purged;
> +			}
>  		}
>  	}
>  
> -	trace_purge_vmap_area_lazy(start, end, num_purged_areas);
> -	return num_purged_areas > 0;
> +	trace_purge_vmap_area_lazy(start, end, nr_purged_areas);
> +	return nr_purged_areas > 0;
>  }
>  
>  /*
> @@ -1886,9 +2122,11 @@ static void drain_vmap_area_work(struct work_struct *work)
>   */
>  static void free_vmap_area_noflush(struct vmap_area *va)
>  {
> -	struct vmap_node *vn = addr_to_node(va->va_start);
>  	unsigned long nr_lazy_max = lazy_max_pages();
>  	unsigned long va_start = va->va_start;
> +	int vn_id = decode_vn_id(va->flags);
> +	struct vmap_node *vn = vn_id >= 0 ? id_to_node(vn_id):
> +		addr_to_node(va->va_start);;
>  	unsigned long nr_lazy;
>  
>  	if (WARN_ON_ONCE(!list_empty(&va->list)))
> @@ -4574,6 +4812,10 @@ static void vmap_init_nodes(void)
>  		vn->lazy.root = RB_ROOT;
>  		INIT_LIST_HEAD(&vn->lazy.head);
>  		spin_lock_init(&vn->lazy.lock);
> +
> +		vn->free.root = RB_ROOT;
> +		INIT_LIST_HEAD(&vn->free.head);
> +		spin_lock_init(&vn->free.lock);
>  	}
>  }
>  
> -- 
> 2.30.2
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 1/9] mm: vmalloc: Add va_alloc() helper
  2023-09-06  5:51   ` Baoquan He
@ 2023-09-06 15:06     ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-06 15:06 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, Christoph Hellwig

On Wed, Sep 06, 2023 at 01:51:03PM +0800, Baoquan He wrote:
> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > Currently __alloc_vmap_area() function contains an open codded
> > logic that finds and adjusts a VA based on allocation request.
> > 
> > Introduce a va_alloc() helper that adjusts found VA only. It
> > will be used later at least in two places.
> > 
> > There is no a functional change as a result of this patch.
> > 
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 41 ++++++++++++++++++++++++++++-------------
> >  1 file changed, 28 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 93cf99aba335..00afc1ee4756 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1481,6 +1481,32 @@ adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
> >  	return 0;
> >  }
> >  
> > +static unsigned long
> > +va_alloc(struct vmap_area *va,
> > +		struct rb_root *root, struct list_head *head,
> > +		unsigned long size, unsigned long align,
> > +		unsigned long vstart, unsigned long vend)
> > +{
> > +	unsigned long nva_start_addr;
> > +	int ret;
> > +
> > +	if (va->va_start > vstart)
> > +		nva_start_addr = ALIGN(va->va_start, align);
> > +	else
> > +		nva_start_addr = ALIGN(vstart, align);
> > +
> > +	/* Check the "vend" restriction. */
> > +	if (nva_start_addr + size > vend)
> > +		return vend;
> > +
> > +	/* Update the free vmap_area. */
> > +	ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
> > +	if (WARN_ON_ONCE(ret))
> > +		return vend;
> > +
> > +	return nva_start_addr;
> > +}
> > +
> >  /*
> >   * Returns a start address of the newly allocated area, if success.
> >   * Otherwise a vend is returned that indicates failure.
> > @@ -1493,7 +1519,6 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
> >  	bool adjust_search_size = true;
> >  	unsigned long nva_start_addr;
> >  	struct vmap_area *va;
> > -	int ret;
> >  
> >  	/*
> >  	 * Do not adjust when:
> > @@ -1511,18 +1536,8 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
> >  	if (unlikely(!va))
> >  		return vend;
> >  
> > -	if (va->va_start > vstart)
> > -		nva_start_addr = ALIGN(va->va_start, align);
> > -	else
> > -		nva_start_addr = ALIGN(vstart, align);
> > -
> > -	/* Check the "vend" restriction. */
> > -	if (nva_start_addr + size > vend)
> > -		return vend;
> > -
> > -	/* Update the free vmap_area. */
> > -	ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
> > -	if (WARN_ON_ONCE(ret))
> > +	nva_start_addr = va_alloc(va, root, head, size, align, vstart, vend);
> > +	if (nva_start_addr == vend)
> >  		return vend;
> >  
> >  #if DEBUG_AUGMENT_LOWEST_MATCH_CHECK
> > -- 
> > 2.30.2
> 
> Reviewed-by: Baoquan He <bhe@redhat.com>
> 
Thanks, i picked it for V3.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function
  2023-09-06  5:51   ` Baoquan He
@ 2023-09-06 16:27     ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-06 16:27 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, Christoph Hellwig

On Wed, Sep 06, 2023 at 01:51:42PM +0800, Baoquan He wrote:
> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > This patch renames the adjust_va_to_fit_type() function
> > to va_clip() which is shorter and more expressive.
> > 
> > There is no a functional change as a result of this patch.
> > 
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 13 ++++++-------
> >  1 file changed, 6 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 00afc1ee4756..09e315f8ea34 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1382,9 +1382,9 @@ classify_va_fit_type(struct vmap_area *va,
> >  }
> >  
> >  static __always_inline int
> > -adjust_va_to_fit_type(struct rb_root *root, struct list_head *head,
> > -		      struct vmap_area *va, unsigned long nva_start_addr,
> > -		      unsigned long size)
> > +va_clip(struct rb_root *root, struct list_head *head,
> > +		struct vmap_area *va, unsigned long nva_start_addr,
> > +		unsigned long size)
> >  {
> >  	struct vmap_area *lva = NULL;
> >  	enum fit_type type = classify_va_fit_type(va, nva_start_addr, size);
> > @@ -1500,7 +1500,7 @@ va_alloc(struct vmap_area *va,
> >  		return vend;
> >  
> >  	/* Update the free vmap_area. */
> > -	ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size);
> > +	ret = va_clip(root, head, va, nva_start_addr, size);
> >  	if (WARN_ON_ONCE(ret))
> >  		return vend;
> >  
> > @@ -4151,9 +4151,8 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
> >  			/* It is a BUG(), but trigger recovery instead. */
> >  			goto recovery;
> >  
> > -		ret = adjust_va_to_fit_type(&free_vmap_area_root,
> > -					    &free_vmap_area_list,
> > -					    va, start, size);
> > +		ret = va_clip(&free_vmap_area_root,
> > +			&free_vmap_area_list, va, start, size);
> >  		if (WARN_ON_ONCE(unlikely(ret)))
> >  			/* It is a BUG(), but trigger recovery instead. */
> >  			goto recovery;
> > -- 
> > 2.30.2
> > 
> 
> Reviewed-by: Baoquan He <bhe@redhat.com>
> 
Thank you for the review. Picked it up.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 3/9] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
  2023-09-06  5:52   ` Baoquan He
@ 2023-09-06 16:29     ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-06 16:29 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, Christoph Hellwig

On Wed, Sep 06, 2023 at 01:52:08PM +0800, Baoquan He wrote:
> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > A vmap_init_free_space() is a function that setups a vmap space
> > and is considered as part of initialization phase. Since a main
> > entry which is vmalloc_init(), has been moved down in vmalloc.c
> > it makes sense to follow the pattern.
> > 
> > There is no a functional change as a result of this patch.
> > 
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 82 ++++++++++++++++++++++++++--------------------------
> >  1 file changed, 41 insertions(+), 41 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 09e315f8ea34..b7deacca1483 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2512,47 +2512,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align)
> >  	kasan_populate_early_vm_area_shadow(vm->addr, vm->size);
> >  }
> >  
> > -static void vmap_init_free_space(void)
> > -{
> > -	unsigned long vmap_start = 1;
> > -	const unsigned long vmap_end = ULONG_MAX;
> > -	struct vmap_area *busy, *free;
> > -
> > -	/*
> > -	 *     B     F     B     B     B     F
> > -	 * -|-----|.....|-----|-----|-----|.....|-
> > -	 *  |           The KVA space           |
> > -	 *  |<--------------------------------->|
> > -	 */
> > -	list_for_each_entry(busy, &vmap_area_list, list) {
> > -		if (busy->va_start - vmap_start > 0) {
> > -			free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> > -			if (!WARN_ON_ONCE(!free)) {
> > -				free->va_start = vmap_start;
> > -				free->va_end = busy->va_start;
> > -
> > -				insert_vmap_area_augment(free, NULL,
> > -					&free_vmap_area_root,
> > -						&free_vmap_area_list);
> > -			}
> > -		}
> > -
> > -		vmap_start = busy->va_end;
> > -	}
> > -
> > -	if (vmap_end - vmap_start > 0) {
> > -		free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> > -		if (!WARN_ON_ONCE(!free)) {
> > -			free->va_start = vmap_start;
> > -			free->va_end = vmap_end;
> > -
> > -			insert_vmap_area_augment(free, NULL,
> > -				&free_vmap_area_root,
> > -					&free_vmap_area_list);
> > -		}
> > -	}
> > -}
> > -
> >  static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
> >  	struct vmap_area *va, unsigned long flags, const void *caller)
> >  {
> > @@ -4443,6 +4402,47 @@ module_init(proc_vmalloc_init);
> >  
> >  #endif
> >  
> > +static void vmap_init_free_space(void)
> > +{
> > +	unsigned long vmap_start = 1;
> > +	const unsigned long vmap_end = ULONG_MAX;
> > +	struct vmap_area *busy, *free;
> > +
> > +	/*
> > +	 *     B     F     B     B     B     F
> > +	 * -|-----|.....|-----|-----|-----|.....|-
> > +	 *  |           The KVA space           |
> > +	 *  |<--------------------------------->|
> > +	 */
> > +	list_for_each_entry(busy, &vmap_area_list, list) {
> > +		if (busy->va_start - vmap_start > 0) {
> > +			free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> > +			if (!WARN_ON_ONCE(!free)) {
> > +				free->va_start = vmap_start;
> > +				free->va_end = busy->va_start;
> > +
> > +				insert_vmap_area_augment(free, NULL,
> > +					&free_vmap_area_root,
> > +						&free_vmap_area_list);
> > +			}
> > +		}
> > +
> > +		vmap_start = busy->va_end;
> > +	}
> > +
> > +	if (vmap_end - vmap_start > 0) {
> > +		free = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
> > +		if (!WARN_ON_ONCE(!free)) {
> > +			free->va_start = vmap_start;
> > +			free->va_end = vmap_end;
> > +
> > +			insert_vmap_area_augment(free, NULL,
> > +				&free_vmap_area_root,
> > +					&free_vmap_area_list);
> > +		}
> > +	}
> > +}
> > +
> >  void __init vmalloc_init(void)
> >  {
> >  	struct vmap_area *va;
> > -- 
> > 2.30.2
> > 
> 
> Reviewed-by: Baoquan He <bhe@redhat.com>
> 
Thanks!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 6/9] mm: vmalloc: Offload free_vmap_area_lock lock
  2023-09-06  6:04   ` Baoquan He
@ 2023-09-06 19:16     ` Uladzislau Rezki
  2023-09-07  0:06       ` Baoquan He
  0 siblings, 1 reply; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-06 19:16 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

> >  static void free_vmap_area(struct vmap_area *va)
> >  {
> >  	struct vmap_node *vn = addr_to_node(va->va_start);
> > +	int vn_id = decode_vn_id(va->flags);
> >  
> >  	/*
> >  	 * Remove from the busy tree/list.
> > @@ -1594,12 +1629,19 @@ static void free_vmap_area(struct vmap_area *va)
> >  	unlink_va(va, &vn->busy.root);
> >  	spin_unlock(&vn->busy.lock);
> >  
> > -	/*
> > -	 * Insert/Merge it back to the free tree/list.
> > -	 */
> > -	spin_lock(&free_vmap_area_lock);
> > -	merge_or_add_vmap_area_augment(va, &free_vmap_area_root, &free_vmap_area_list);
> > -	spin_unlock(&free_vmap_area_lock);
> > +	if (vn_id >= 0) {
> 
> In alloc_vmap_area(), the vn_id is encoded into va->flags. When
> allocation failed, the vn_id = 0. Here should we change to check 'if
> (vn_id > 0)' becasue the vn_id == 0 means no available vn_id encoded
> into. And I do not get how we treat the case vn_id truly is 0.
> 
> 	va->flags = (addr != vend) ? encode_vn_id(vn_id) : 0;
>
Yes, vn_id always >= 0, so it is positive since it is an index.
We encode a vn_id as vn_id + 1. For example if it is zero we write 1.

If not node allocation path or an error zero is written. Decoding
is done as: zero - 1 = -1, so it is negative value, i.e. decode_vn_id()
function returns -1.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Mitigate a vmap lock contention v2
  2023-08-29  8:11 [PATCH v2 0/9] Mitigate a vmap lock contention v2 Uladzislau Rezki (Sony)
                   ` (10 preceding siblings ...)
  2023-09-04 14:55 ` Uladzislau Rezki
@ 2023-09-06 20:04 ` Lorenzo Stoakes
  2023-09-07  9:15   ` Uladzislau Rezki
  11 siblings, 1 reply; 74+ messages in thread
From: Lorenzo Stoakes @ 2023-09-06 20:04 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, LKML, Baoquan He, Christoph Hellwig,
	Matthew Wilcox, Liam R . Howlett, Dave Chinner,
	Paul E . McKenney, Joel Fernandes, Oleksiy Avramchenko

On Tue, Aug 29, 2023 at 10:11:33AM +0200, Uladzislau Rezki (Sony) wrote:
> Hello, folk!
>
> This is the v2, the series which tends to minimize the vmap
> lock contention. It is based on the tag: v6.5-rc6. Here you
> can find a documentation about it:

Will take a look when I get a chance at v3 as I gather you're spinning
another version :)

Cheers!

>
> wget ftp://vps418301.ovh.net/incoming/Fix_a_vmalloc_lock_contention_in_SMP_env_v2.pdf
>
> even though it is a bit outdated(it follows v1), it still gives a
> good overview on the problem and how it can be solved. On demand
> and by request i can update it.
>
> The v1 is here: https://lore.kernel.org/linux-mm/ZIAqojPKjChJTssg@pc636/T/
>
> Delta v1 -> v2:
>   - open coded locking;
>   - switch to array of nodes instead of per-cpu definition;
>   - density is 2 cores per one node(not equal to number of CPUs);
>   - VAs first go back(free path) to an owner node and later to
>     a global heap if a block is fully freed, nid is saved in va->flags;
>   - add helpers to drain lazily-freed areas faster, if high pressure;
>   - picked al Reviewed-by.
>
> Test on AMD Ryzen Threadripper 3970X 32-Core Processor:
> sudo ./test_vmalloc.sh run_test_mask=127 nr_threads=64
>
> <v6.5-rc6 perf>
>   94.17%     0.90%  [kernel]    [k] _raw_spin_lock
>   93.27%    93.05%  [kernel]    [k] native_queued_spin_lock_slowpath
>   74.69%     0.25%  [kernel]    [k] __vmalloc_node_range
>   72.64%     0.01%  [kernel]    [k] __get_vm_area_node
>   72.04%     0.89%  [kernel]    [k] alloc_vmap_area
>   42.17%     0.00%  [kernel]    [k] vmalloc
>   32.53%     0.00%  [kernel]    [k] __vmalloc_node
>   24.91%     0.25%  [kernel]    [k] vfree
>   24.32%     0.01%  [kernel]    [k] remove_vm_area
>   22.63%     0.21%  [kernel]    [k] find_unlink_vmap_area
>   15.51%     0.00%  [unknown]   [k] 0xffffffffc09a74ac
>   14.35%     0.00%  [kernel]    [k] ret_from_fork_asm
>   14.35%     0.00%  [kernel]    [k] ret_from_fork
>   14.35%     0.00%  [kernel]    [k] kthread
> <v6.5-rc6 perf>
>    vs
> <v6.5-rc6+v2 perf>
>   74.32%     2.42%  [kernel]    [k] __vmalloc_node_range
>   69.58%     0.01%  [kernel]    [k] vmalloc
>   54.21%     1.17%  [kernel]    [k] __alloc_pages_bulk
>   48.13%    47.91%  [kernel]    [k] clear_page_orig
>   43.60%     0.01%  [unknown]   [k] 0xffffffffc082f16f
>   32.06%     0.00%  [kernel]    [k] ret_from_fork_asm
>   32.06%     0.00%  [kernel]    [k] ret_from_fork
>   32.06%     0.00%  [kernel]    [k] kthread
>   31.30%     0.00%  [unknown]   [k] 0xffffffffc082f889
>   22.98%     4.16%  [kernel]    [k] vfree
>   14.36%     0.28%  [kernel]    [k] __get_vm_area_node
>   13.43%     3.35%  [kernel]    [k] alloc_vmap_area
>   10.86%     0.04%  [kernel]    [k] remove_vm_area
>    8.89%     2.75%  [kernel]    [k] _raw_spin_lock
>    7.19%     0.00%  [unknown]   [k] 0xffffffffc082fba3
>    6.65%     1.37%  [kernel]    [k] free_unref_page
>    6.13%     6.11%  [kernel]    [k] native_queued_spin_lock_slowpath
> <v6.5-rc6+v2 perf>
>
> On smaller systems, for example, 8xCPU Hikey960 board the
> contention is not that high and is approximately ~16 percent.
>
> Uladzislau Rezki (Sony) (9):
>   mm: vmalloc: Add va_alloc() helper
>   mm: vmalloc: Rename adjust_va_to_fit_type() function
>   mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c
>   mm: vmalloc: Remove global vmap_area_root rb-tree
>   mm: vmalloc: Remove global purge_vmap_area_root rb-tree
>   mm: vmalloc: Offload free_vmap_area_lock lock
>   mm: vmalloc: Support multiple nodes in vread_iter
>   mm: vmalloc: Support multiple nodes in vmallocinfo
>   mm: vmalloc: Set nr_nodes/node_size based on CPU-cores
>
>  mm/vmalloc.c | 929 +++++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 683 insertions(+), 246 deletions(-)
>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 6/9] mm: vmalloc: Offload free_vmap_area_lock lock
  2023-09-06 19:16     ` Uladzislau Rezki
@ 2023-09-07  0:06       ` Baoquan He
  2023-09-07  9:33         ` Uladzislau Rezki
  0 siblings, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-07  0:06 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 09/06/23 at 09:16pm, Uladzislau Rezki wrote:
> > >  static void free_vmap_area(struct vmap_area *va)
> > >  {
> > >  	struct vmap_node *vn = addr_to_node(va->va_start);
> > > +	int vn_id = decode_vn_id(va->flags);
> > >  
> > >  	/*
> > >  	 * Remove from the busy tree/list.
> > > @@ -1594,12 +1629,19 @@ static void free_vmap_area(struct vmap_area *va)
> > >  	unlink_va(va, &vn->busy.root);
> > >  	spin_unlock(&vn->busy.lock);
> > >  
> > > -	/*
> > > -	 * Insert/Merge it back to the free tree/list.
> > > -	 */
> > > -	spin_lock(&free_vmap_area_lock);
> > > -	merge_or_add_vmap_area_augment(va, &free_vmap_area_root, &free_vmap_area_list);
> > > -	spin_unlock(&free_vmap_area_lock);
> > > +	if (vn_id >= 0) {
> > 
> > In alloc_vmap_area(), the vn_id is encoded into va->flags. When
> > allocation failed, the vn_id = 0. Here should we change to check 'if
> > (vn_id > 0)' becasue the vn_id == 0 means no available vn_id encoded
> > into. And I do not get how we treat the case vn_id truly is 0.
> > 
> > 	va->flags = (addr != vend) ? encode_vn_id(vn_id) : 0;
> >
> Yes, vn_id always >= 0, so it is positive since it is an index.
> We encode a vn_id as vn_id + 1. For example if it is zero we write 1.
> 
> If not node allocation path or an error zero is written. Decoding
> is done as: zero - 1 = -1, so it is negative value, i.e. decode_vn_id()
> function returns -1.

Ah, I see it now, thanks. It would be helpful to add some explanation
above decode_vn_id() lest people misunderstand this like me?


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-08-29  8:11 ` [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree Uladzislau Rezki (Sony)
@ 2023-09-07  2:17     ` Baoquan He
  2023-09-07  2:17     ` Baoquan He
  2023-09-11  2:38   ` Baoquan He
  2 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-07  2:17 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony), k-hagio-ab, lijiang
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, kexec

Add Kazu and Lianbo to CC, and kexec mailing list

On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> Store allocated objects in a separate nodes. A va->va_start
> address is converted into a correct node where it should
> be placed and resided. An addr_to_node() function is used
> to do a proper address conversion to determine a node that
> contains a VA.
> 
> Such approach balances VAs across nodes as a result an access
> becomes scalable. Number of nodes in a system depends on number
> of CPUs divided by two. The density factor in this case is 1/2.
> 
> Please note:
> 
> 1. As of now allocated VAs are bound to a node-0. It means the
>    patch does not give any difference comparing with a current
>    behavior;
> 
> 2. The global vmap_area_lock, vmap_area_root are removed as there
>    is no need in it anymore. The vmap_area_list is still kept and
>    is _empty_. It is exported for a kexec only;

I haven't taken a test, while accessing all nodes' busy tree to get
va of the lowest address could severely impact kcore reading efficiency
on system with many vmap nodes. People doing live debugging via
/proc/kcore will get a little surprise.

Empty vmap_area_list will break makedumpfile utility, Crash utility
could be impactd too. I checked makedumpfile code, it relys on
vmap_area_list to deduce the vmalloc_start value. 

> 
> 3. The vmallocinfo and vread() have to be reworked to be able to
>    handle multiple nodes.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
@ 2023-09-07  2:17     ` Baoquan He
  0 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-07  2:17 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony), k-hagio-ab, lijiang
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, kexec

Add Kazu and Lianbo to CC, and kexec mailing list

On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> Store allocated objects in a separate nodes. A va->va_start
> address is converted into a correct node where it should
> be placed and resided. An addr_to_node() function is used
> to do a proper address conversion to determine a node that
> contains a VA.
> 
> Such approach balances VAs across nodes as a result an access
> becomes scalable. Number of nodes in a system depends on number
> of CPUs divided by two. The density factor in this case is 1/2.
> 
> Please note:
> 
> 1. As of now allocated VAs are bound to a node-0. It means the
>    patch does not give any difference comparing with a current
>    behavior;
> 
> 2. The global vmap_area_lock, vmap_area_root are removed as there
>    is no need in it anymore. The vmap_area_list is still kept and
>    is _empty_. It is exported for a kexec only;

I haven't taken a test, while accessing all nodes' busy tree to get
va of the lowest address could severely impact kcore reading efficiency
on system with many vmap nodes. People doing live debugging via
/proc/kcore will get a little surprise.

Empty vmap_area_list will break makedumpfile utility, Crash utility
could be impactd too. I checked makedumpfile code, it relys on
vmap_area_list to deduce the vmalloc_start value. 

> 
> 3. The vmallocinfo and vread() have to be reworked to be able to
>    handle multiple nodes.


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 0/9] Mitigate a vmap lock contention v2
  2023-09-06 20:04 ` Lorenzo Stoakes
@ 2023-09-07  9:15   ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-07  9:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, LKML, Baoquan He, Christoph Hellwig,
	Matthew Wilcox, Liam R . Howlett, Dave Chinner,
	Paul E . McKenney, Joel Fernandes, Oleksiy Avramchenko

On Wed, Sep 06, 2023 at 09:04:26PM +0100, Lorenzo Stoakes wrote:
> On Tue, Aug 29, 2023 at 10:11:33AM +0200, Uladzislau Rezki (Sony) wrote:
> > Hello, folk!
> >
> > This is the v2, the series which tends to minimize the vmap
> > lock contention. It is based on the tag: v6.5-rc6. Here you
> > can find a documentation about it:
> 
> Will take a look when I get a chance at v3 as I gather you're spinning
> another version :)
> 
Correct. I will do that :)

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 6/9] mm: vmalloc: Offload free_vmap_area_lock lock
  2023-09-07  0:06       ` Baoquan He
@ 2023-09-07  9:33         ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-07  9:33 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On Thu, Sep 07, 2023 at 08:06:09AM +0800, Baoquan He wrote:
> On 09/06/23 at 09:16pm, Uladzislau Rezki wrote:
> > > >  static void free_vmap_area(struct vmap_area *va)
> > > >  {
> > > >  	struct vmap_node *vn = addr_to_node(va->va_start);
> > > > +	int vn_id = decode_vn_id(va->flags);
> > > >  
> > > >  	/*
> > > >  	 * Remove from the busy tree/list.
> > > > @@ -1594,12 +1629,19 @@ static void free_vmap_area(struct vmap_area *va)
> > > >  	unlink_va(va, &vn->busy.root);
> > > >  	spin_unlock(&vn->busy.lock);
> > > >  
> > > > -	/*
> > > > -	 * Insert/Merge it back to the free tree/list.
> > > > -	 */
> > > > -	spin_lock(&free_vmap_area_lock);
> > > > -	merge_or_add_vmap_area_augment(va, &free_vmap_area_root, &free_vmap_area_list);
> > > > -	spin_unlock(&free_vmap_area_lock);
> > > > +	if (vn_id >= 0) {
> > > 
> > > In alloc_vmap_area(), the vn_id is encoded into va->flags. When
> > > allocation failed, the vn_id = 0. Here should we change to check 'if
> > > (vn_id > 0)' becasue the vn_id == 0 means no available vn_id encoded
> > > into. And I do not get how we treat the case vn_id truly is 0.
> > > 
> > > 	va->flags = (addr != vend) ? encode_vn_id(vn_id) : 0;
> > >
> > Yes, vn_id always >= 0, so it is positive since it is an index.
> > We encode a vn_id as vn_id + 1. For example if it is zero we write 1.
> > 
> > If not node allocation path or an error zero is written. Decoding
> > is done as: zero - 1 = -1, so it is negative value, i.e. decode_vn_id()
> > function returns -1.
> 
> Ah, I see it now, thanks. It would be helpful to add some explanation
> above decode_vn_id() lest people misunderstand this like me?
> 
I got that feeling also. This makes sense, so i will comment it!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-09-07  2:17     ` Baoquan He
@ 2023-09-07  9:38       ` Baoquan He
  -1 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-07  9:38 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony), k-hagio-ab, lijiang
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, kexec

On 09/07/23 at 10:17am, Baoquan He wrote:
> Add Kazu and Lianbo to CC, and kexec mailing list
> 
> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > Store allocated objects in a separate nodes. A va->va_start
> > address is converted into a correct node where it should
> > be placed and resided. An addr_to_node() function is used
> > to do a proper address conversion to determine a node that
> > contains a VA.
> > 
> > Such approach balances VAs across nodes as a result an access
> > becomes scalable. Number of nodes in a system depends on number
> > of CPUs divided by two. The density factor in this case is 1/2.
> > 
> > Please note:
> > 
> > 1. As of now allocated VAs are bound to a node-0. It means the
> >    patch does not give any difference comparing with a current
> >    behavior;
> > 
> > 2. The global vmap_area_lock, vmap_area_root are removed as there
> >    is no need in it anymore. The vmap_area_list is still kept and
> >    is _empty_. It is exported for a kexec only;
> 
> I haven't taken a test, while accessing all nodes' busy tree to get
> va of the lowest address could severely impact kcore reading efficiency
> on system with many vmap nodes. People doing live debugging via
> /proc/kcore will get a little surprise.
> 
> Empty vmap_area_list will break makedumpfile utility, Crash utility
> could be impactd too. I checked makedumpfile code, it relys on
> vmap_area_list to deduce the vmalloc_start value. 

Except of the empty vmap_area_list, this patch looks good to me.

We may need think of another way to export the vmalloc_start value or
deduce it in makedumpfile/Crash utility. And then remove the useless
vmap_area_list. I am not sure if we should remove vmap_area_list in this
patch because the empty value will cause breakage anyway. Otherwise,

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> > 
> > 3. The vmallocinfo and vread() have to be reworked to be able to
> >    handle multiple nodes.
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
@ 2023-09-07  9:38       ` Baoquan He
  0 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-07  9:38 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony), k-hagio-ab, lijiang
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, kexec

On 09/07/23 at 10:17am, Baoquan He wrote:
> Add Kazu and Lianbo to CC, and kexec mailing list
> 
> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > Store allocated objects in a separate nodes. A va->va_start
> > address is converted into a correct node where it should
> > be placed and resided. An addr_to_node() function is used
> > to do a proper address conversion to determine a node that
> > contains a VA.
> > 
> > Such approach balances VAs across nodes as a result an access
> > becomes scalable. Number of nodes in a system depends on number
> > of CPUs divided by two. The density factor in this case is 1/2.
> > 
> > Please note:
> > 
> > 1. As of now allocated VAs are bound to a node-0. It means the
> >    patch does not give any difference comparing with a current
> >    behavior;
> > 
> > 2. The global vmap_area_lock, vmap_area_root are removed as there
> >    is no need in it anymore. The vmap_area_list is still kept and
> >    is _empty_. It is exported for a kexec only;
> 
> I haven't taken a test, while accessing all nodes' busy tree to get
> va of the lowest address could severely impact kcore reading efficiency
> on system with many vmap nodes. People doing live debugging via
> /proc/kcore will get a little surprise.
> 
> Empty vmap_area_list will break makedumpfile utility, Crash utility
> could be impactd too. I checked makedumpfile code, it relys on
> vmap_area_list to deduce the vmalloc_start value. 

Except of the empty vmap_area_list, this patch looks good to me.

We may need think of another way to export the vmalloc_start value or
deduce it in makedumpfile/Crash utility. And then remove the useless
vmap_area_list. I am not sure if we should remove vmap_area_list in this
patch because the empty value will cause breakage anyway. Otherwise,

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> > 
> > 3. The vmallocinfo and vread() have to be reworked to be able to
> >    handle multiple nodes.
> 


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-09-07  2:17     ` Baoquan He
@ 2023-09-07  9:39       ` Uladzislau Rezki
  -1 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-07  9:39 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	k-hagio-ab, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
> Add Kazu and Lianbo to CC, and kexec mailing list
> 
> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > Store allocated objects in a separate nodes. A va->va_start
> > address is converted into a correct node where it should
> > be placed and resided. An addr_to_node() function is used
> > to do a proper address conversion to determine a node that
> > contains a VA.
> > 
> > Such approach balances VAs across nodes as a result an access
> > becomes scalable. Number of nodes in a system depends on number
> > of CPUs divided by two. The density factor in this case is 1/2.
> > 
> > Please note:
> > 
> > 1. As of now allocated VAs are bound to a node-0. It means the
> >    patch does not give any difference comparing with a current
> >    behavior;
> > 
> > 2. The global vmap_area_lock, vmap_area_root are removed as there
> >    is no need in it anymore. The vmap_area_list is still kept and
> >    is _empty_. It is exported for a kexec only;
> 
> I haven't taken a test, while accessing all nodes' busy tree to get
> va of the lowest address could severely impact kcore reading efficiency
> on system with many vmap nodes. People doing live debugging via
> /proc/kcore will get a little surprise.
> 
>
> Empty vmap_area_list will break makedumpfile utility, Crash utility
> could be impactd too. I checked makedumpfile code, it relys on
> vmap_area_list to deduce the vmalloc_start value. 
>
It is left part and i hope i fix it in v3. The problem here is
we can not give an opportunity to access to vmap internals from
outside. This is just not correct, i.e. you are not allowed to
access the list directly.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
@ 2023-09-07  9:39       ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-07  9:39 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	k-hagio-ab, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
> Add Kazu and Lianbo to CC, and kexec mailing list
> 
> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > Store allocated objects in a separate nodes. A va->va_start
> > address is converted into a correct node where it should
> > be placed and resided. An addr_to_node() function is used
> > to do a proper address conversion to determine a node that
> > contains a VA.
> > 
> > Such approach balances VAs across nodes as a result an access
> > becomes scalable. Number of nodes in a system depends on number
> > of CPUs divided by two. The density factor in this case is 1/2.
> > 
> > Please note:
> > 
> > 1. As of now allocated VAs are bound to a node-0. It means the
> >    patch does not give any difference comparing with a current
> >    behavior;
> > 
> > 2. The global vmap_area_lock, vmap_area_root are removed as there
> >    is no need in it anymore. The vmap_area_list is still kept and
> >    is _empty_. It is exported for a kexec only;
> 
> I haven't taken a test, while accessing all nodes' busy tree to get
> va of the lowest address could severely impact kcore reading efficiency
> on system with many vmap nodes. People doing live debugging via
> /proc/kcore will get a little surprise.
> 
>
> Empty vmap_area_list will break makedumpfile utility, Crash utility
> could be impactd too. I checked makedumpfile code, it relys on
> vmap_area_list to deduce the vmalloc_start value. 
>
It is left part and i hope i fix it in v3. The problem here is
we can not give an opportunity to access to vmap internals from
outside. This is just not correct, i.e. you are not allowed to
access the list directly.

--
Uladzislau Rezki

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-09-07  9:38       ` Baoquan He
@ 2023-09-07  9:40         ` Uladzislau Rezki
  -1 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-07  9:40 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	k-hagio-ab, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On Thu, Sep 07, 2023 at 05:38:07PM +0800, Baoquan He wrote:
> On 09/07/23 at 10:17am, Baoquan He wrote:
> > Add Kazu and Lianbo to CC, and kexec mailing list
> > 
> > On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > Store allocated objects in a separate nodes. A va->va_start
> > > address is converted into a correct node where it should
> > > be placed and resided. An addr_to_node() function is used
> > > to do a proper address conversion to determine a node that
> > > contains a VA.
> > > 
> > > Such approach balances VAs across nodes as a result an access
> > > becomes scalable. Number of nodes in a system depends on number
> > > of CPUs divided by two. The density factor in this case is 1/2.
> > > 
> > > Please note:
> > > 
> > > 1. As of now allocated VAs are bound to a node-0. It means the
> > >    patch does not give any difference comparing with a current
> > >    behavior;
> > > 
> > > 2. The global vmap_area_lock, vmap_area_root are removed as there
> > >    is no need in it anymore. The vmap_area_list is still kept and
> > >    is _empty_. It is exported for a kexec only;
> > 
> > I haven't taken a test, while accessing all nodes' busy tree to get
> > va of the lowest address could severely impact kcore reading efficiency
> > on system with many vmap nodes. People doing live debugging via
> > /proc/kcore will get a little surprise.
> > 
> > Empty vmap_area_list will break makedumpfile utility, Crash utility
> > could be impactd too. I checked makedumpfile code, it relys on
> > vmap_area_list to deduce the vmalloc_start value. 
> 
> Except of the empty vmap_area_list, this patch looks good to me.
> 
> We may need think of another way to export the vmalloc_start value or
> deduce it in makedumpfile/Crash utility. And then remove the useless
> vmap_area_list. I am not sure if we should remove vmap_area_list in this
> patch because the empty value will cause breakage anyway. Otherwise,
> 
> Reviewed-by: Baoquan He <bhe@redhat.com>
> 
Thanks for the review!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
@ 2023-09-07  9:40         ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-07  9:40 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	k-hagio-ab, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On Thu, Sep 07, 2023 at 05:38:07PM +0800, Baoquan He wrote:
> On 09/07/23 at 10:17am, Baoquan He wrote:
> > Add Kazu and Lianbo to CC, and kexec mailing list
> > 
> > On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > Store allocated objects in a separate nodes. A va->va_start
> > > address is converted into a correct node where it should
> > > be placed and resided. An addr_to_node() function is used
> > > to do a proper address conversion to determine a node that
> > > contains a VA.
> > > 
> > > Such approach balances VAs across nodes as a result an access
> > > becomes scalable. Number of nodes in a system depends on number
> > > of CPUs divided by two. The density factor in this case is 1/2.
> > > 
> > > Please note:
> > > 
> > > 1. As of now allocated VAs are bound to a node-0. It means the
> > >    patch does not give any difference comparing with a current
> > >    behavior;
> > > 
> > > 2. The global vmap_area_lock, vmap_area_root are removed as there
> > >    is no need in it anymore. The vmap_area_list is still kept and
> > >    is _empty_. It is exported for a kexec only;
> > 
> > I haven't taken a test, while accessing all nodes' busy tree to get
> > va of the lowest address could severely impact kcore reading efficiency
> > on system with many vmap nodes. People doing live debugging via
> > /proc/kcore will get a little surprise.
> > 
> > Empty vmap_area_list will break makedumpfile utility, Crash utility
> > could be impactd too. I checked makedumpfile code, it relys on
> > vmap_area_list to deduce the vmalloc_start value. 
> 
> Except of the empty vmap_area_list, this patch looks good to me.
> 
> We may need think of another way to export the vmalloc_start value or
> deduce it in makedumpfile/Crash utility. And then remove the useless
> vmap_area_list. I am not sure if we should remove vmap_area_list in this
> patch because the empty value will cause breakage anyway. Otherwise,
> 
> Reviewed-by: Baoquan He <bhe@redhat.com>
> 
Thanks for the review!

--
Uladzislau Rezki

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-09-07  9:39       ` Uladzislau Rezki
@ 2023-09-07  9:58         ` Baoquan He
  -1 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-07  9:58 UTC (permalink / raw)
  To: Uladzislau Rezki, k-hagio-ab
  Cc: lijiang, linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, kexec

On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
> > Add Kazu and Lianbo to CC, and kexec mailing list
> > 
> > On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > Store allocated objects in a separate nodes. A va->va_start
> > > address is converted into a correct node where it should
> > > be placed and resided. An addr_to_node() function is used
> > > to do a proper address conversion to determine a node that
> > > contains a VA.
> > > 
> > > Such approach balances VAs across nodes as a result an access
> > > becomes scalable. Number of nodes in a system depends on number
> > > of CPUs divided by two. The density factor in this case is 1/2.
> > > 
> > > Please note:
> > > 
> > > 1. As of now allocated VAs are bound to a node-0. It means the
> > >    patch does not give any difference comparing with a current
> > >    behavior;
> > > 
> > > 2. The global vmap_area_lock, vmap_area_root are removed as there
> > >    is no need in it anymore. The vmap_area_list is still kept and
> > >    is _empty_. It is exported for a kexec only;
> > 
> > I haven't taken a test, while accessing all nodes' busy tree to get
> > va of the lowest address could severely impact kcore reading efficiency
> > on system with many vmap nodes. People doing live debugging via
> > /proc/kcore will get a little surprise.
> > 
> >
> > Empty vmap_area_list will break makedumpfile utility, Crash utility
> > could be impactd too. I checked makedumpfile code, it relys on
> > vmap_area_list to deduce the vmalloc_start value. 
> >
> It is left part and i hope i fix it in v3. The problem here is
> we can not give an opportunity to access to vmap internals from
> outside. This is just not correct, i.e. you are not allowed to
> access the list directly.

Right. Thanks for the fix in v3, that is a relief of makedumpfile and
crash.

Hi Kazu,

Meanwhile, I am thinking if we should evaluate the necessity of
vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
vmap_area_list to deduce VMALLOC_START. Wondering if we can export
VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
is a tighter low boundary of vmalloc area and can reduce unnecessary
scanning below the lowest va. Not sure if this is the reason people
decided to export vmap_area_list.

Thanks
Baoquan


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
@ 2023-09-07  9:58         ` Baoquan He
  0 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-07  9:58 UTC (permalink / raw)
  To: Uladzislau Rezki, k-hagio-ab
  Cc: lijiang, linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, kexec

On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
> > Add Kazu and Lianbo to CC, and kexec mailing list
> > 
> > On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > Store allocated objects in a separate nodes. A va->va_start
> > > address is converted into a correct node where it should
> > > be placed and resided. An addr_to_node() function is used
> > > to do a proper address conversion to determine a node that
> > > contains a VA.
> > > 
> > > Such approach balances VAs across nodes as a result an access
> > > becomes scalable. Number of nodes in a system depends on number
> > > of CPUs divided by two. The density factor in this case is 1/2.
> > > 
> > > Please note:
> > > 
> > > 1. As of now allocated VAs are bound to a node-0. It means the
> > >    patch does not give any difference comparing with a current
> > >    behavior;
> > > 
> > > 2. The global vmap_area_lock, vmap_area_root are removed as there
> > >    is no need in it anymore. The vmap_area_list is still kept and
> > >    is _empty_. It is exported for a kexec only;
> > 
> > I haven't taken a test, while accessing all nodes' busy tree to get
> > va of the lowest address could severely impact kcore reading efficiency
> > on system with many vmap nodes. People doing live debugging via
> > /proc/kcore will get a little surprise.
> > 
> >
> > Empty vmap_area_list will break makedumpfile utility, Crash utility
> > could be impactd too. I checked makedumpfile code, it relys on
> > vmap_area_list to deduce the vmalloc_start value. 
> >
> It is left part and i hope i fix it in v3. The problem here is
> we can not give an opportunity to access to vmap internals from
> outside. This is just not correct, i.e. you are not allowed to
> access the list directly.

Right. Thanks for the fix in v3, that is a relief of makedumpfile and
crash.

Hi Kazu,

Meanwhile, I am thinking if we should evaluate the necessity of
vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
vmap_area_list to deduce VMALLOC_START. Wondering if we can export
VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
is a tighter low boundary of vmalloc area and can reduce unnecessary
scanning below the lowest va. Not sure if this is the reason people
decided to export vmap_area_list.

Thanks
Baoquan


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-09-07  9:58         ` Baoquan He
@ 2023-09-08  1:51           ` HAGIO KAZUHITO(萩尾 一仁)
  -1 siblings, 0 replies; 74+ messages in thread
From: HAGIO KAZUHITO(萩尾 一仁) @ 2023-09-08  1:51 UTC (permalink / raw)
  To: Baoquan He, Uladzislau Rezki
  Cc: lijiang, linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, kexec

On 2023/09/07 18:58, Baoquan He wrote:
> On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
>> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
>>> Add Kazu and Lianbo to CC, and kexec mailing list
>>>
>>> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
>>>> Store allocated objects in a separate nodes. A va->va_start
>>>> address is converted into a correct node where it should
>>>> be placed and resided. An addr_to_node() function is used
>>>> to do a proper address conversion to determine a node that
>>>> contains a VA.
>>>>
>>>> Such approach balances VAs across nodes as a result an access
>>>> becomes scalable. Number of nodes in a system depends on number
>>>> of CPUs divided by two. The density factor in this case is 1/2.
>>>>
>>>> Please note:
>>>>
>>>> 1. As of now allocated VAs are bound to a node-0. It means the
>>>>     patch does not give any difference comparing with a current
>>>>     behavior;
>>>>
>>>> 2. The global vmap_area_lock, vmap_area_root are removed as there
>>>>     is no need in it anymore. The vmap_area_list is still kept and
>>>>     is _empty_. It is exported for a kexec only;
>>>
>>> I haven't taken a test, while accessing all nodes' busy tree to get
>>> va of the lowest address could severely impact kcore reading efficiency
>>> on system with many vmap nodes. People doing live debugging via
>>> /proc/kcore will get a little surprise.
>>>
>>>
>>> Empty vmap_area_list will break makedumpfile utility, Crash utility
>>> could be impactd too. I checked makedumpfile code, it relys on
>>> vmap_area_list to deduce the vmalloc_start value.
>>>
>> It is left part and i hope i fix it in v3. The problem here is
>> we can not give an opportunity to access to vmap internals from
>> outside. This is just not correct, i.e. you are not allowed to
>> access the list directly.
> 
> Right. Thanks for the fix in v3, that is a relief of makedumpfile and
> crash.
> 
> Hi Kazu,
> 
> Meanwhile, I am thinking if we should evaluate the necessity of
> vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
> vmap_area_list to deduce VMALLOC_START. Wondering if we can export
> VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
> is a tighter low boundary of vmalloc area and can reduce unnecessary
> scanning below the lowest va. Not sure if this is the reason people
> decided to export vmap_area_list.

The kernel commit acd99dbf5402 introduced the original vmlist entry to 
vmcoreinfo, but there is no information about why it did not export 
VMALLOC_START directly.

If VMALLOC_START is exported directly to vmcoreinfo, I think it would be 
enough for makedumpfile.

Thanks,
Kazu

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
@ 2023-09-08  1:51           ` HAGIO KAZUHITO(萩尾 一仁)
  0 siblings, 0 replies; 74+ messages in thread
From: HAGIO KAZUHITO(萩尾 一仁) @ 2023-09-08  1:51 UTC (permalink / raw)
  To: Baoquan He, Uladzislau Rezki
  Cc: lijiang, linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko, kexec

On 2023/09/07 18:58, Baoquan He wrote:
> On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
>> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
>>> Add Kazu and Lianbo to CC, and kexec mailing list
>>>
>>> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
>>>> Store allocated objects in a separate nodes. A va->va_start
>>>> address is converted into a correct node where it should
>>>> be placed and resided. An addr_to_node() function is used
>>>> to do a proper address conversion to determine a node that
>>>> contains a VA.
>>>>
>>>> Such approach balances VAs across nodes as a result an access
>>>> becomes scalable. Number of nodes in a system depends on number
>>>> of CPUs divided by two. The density factor in this case is 1/2.
>>>>
>>>> Please note:
>>>>
>>>> 1. As of now allocated VAs are bound to a node-0. It means the
>>>>     patch does not give any difference comparing with a current
>>>>     behavior;
>>>>
>>>> 2. The global vmap_area_lock, vmap_area_root are removed as there
>>>>     is no need in it anymore. The vmap_area_list is still kept and
>>>>     is _empty_. It is exported for a kexec only;
>>>
>>> I haven't taken a test, while accessing all nodes' busy tree to get
>>> va of the lowest address could severely impact kcore reading efficiency
>>> on system with many vmap nodes. People doing live debugging via
>>> /proc/kcore will get a little surprise.
>>>
>>>
>>> Empty vmap_area_list will break makedumpfile utility, Crash utility
>>> could be impactd too. I checked makedumpfile code, it relys on
>>> vmap_area_list to deduce the vmalloc_start value.
>>>
>> It is left part and i hope i fix it in v3. The problem here is
>> we can not give an opportunity to access to vmap internals from
>> outside. This is just not correct, i.e. you are not allowed to
>> access the list directly.
> 
> Right. Thanks for the fix in v3, that is a relief of makedumpfile and
> crash.
> 
> Hi Kazu,
> 
> Meanwhile, I am thinking if we should evaluate the necessity of
> vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
> vmap_area_list to deduce VMALLOC_START. Wondering if we can export
> VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
> is a tighter low boundary of vmalloc area and can reduce unnecessary
> scanning below the lowest va. Not sure if this is the reason people
> decided to export vmap_area_list.

The kernel commit acd99dbf5402 introduced the original vmlist entry to 
vmcoreinfo, but there is no information about why it did not export 
VMALLOC_START directly.

If VMALLOC_START is exported directly to vmcoreinfo, I think it would be 
enough for makedumpfile.

Thanks,
Kazu
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-09-08  1:51           ` HAGIO KAZUHITO(萩尾 一仁)
@ 2023-09-08  4:43             ` Baoquan He
  -1 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-08  4:43 UTC (permalink / raw)
  To: HAGIO KAZUHITO(萩尾 一仁)
  Cc: Uladzislau Rezki, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On 09/08/23 at 01:51am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> On 2023/09/07 18:58, Baoquan He wrote:
> > On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
> >> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
> >>> Add Kazu and Lianbo to CC, and kexec mailing list
> >>>
> >>> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> >>>> Store allocated objects in a separate nodes. A va->va_start
> >>>> address is converted into a correct node where it should
> >>>> be placed and resided. An addr_to_node() function is used
> >>>> to do a proper address conversion to determine a node that
> >>>> contains a VA.
> >>>>
> >>>> Such approach balances VAs across nodes as a result an access
> >>>> becomes scalable. Number of nodes in a system depends on number
> >>>> of CPUs divided by two. The density factor in this case is 1/2.
> >>>>
> >>>> Please note:
> >>>>
> >>>> 1. As of now allocated VAs are bound to a node-0. It means the
> >>>>     patch does not give any difference comparing with a current
> >>>>     behavior;
> >>>>
> >>>> 2. The global vmap_area_lock, vmap_area_root are removed as there
> >>>>     is no need in it anymore. The vmap_area_list is still kept and
> >>>>     is _empty_. It is exported for a kexec only;
> >>>
> >>> I haven't taken a test, while accessing all nodes' busy tree to get
> >>> va of the lowest address could severely impact kcore reading efficiency
> >>> on system with many vmap nodes. People doing live debugging via
> >>> /proc/kcore will get a little surprise.
> >>>
> >>>
> >>> Empty vmap_area_list will break makedumpfile utility, Crash utility
> >>> could be impactd too. I checked makedumpfile code, it relys on
> >>> vmap_area_list to deduce the vmalloc_start value.
> >>>
> >> It is left part and i hope i fix it in v3. The problem here is
> >> we can not give an opportunity to access to vmap internals from
> >> outside. This is just not correct, i.e. you are not allowed to
> >> access the list directly.
> > 
> > Right. Thanks for the fix in v3, that is a relief of makedumpfile and
> > crash.
> > 
> > Hi Kazu,
> > 
> > Meanwhile, I am thinking if we should evaluate the necessity of
> > vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
> > vmap_area_list to deduce VMALLOC_START. Wondering if we can export
> > VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
> > is a tighter low boundary of vmalloc area and can reduce unnecessary
> > scanning below the lowest va. Not sure if this is the reason people
> > decided to export vmap_area_list.
> 
> The kernel commit acd99dbf5402 introduced the original vmlist entry to 
> vmcoreinfo, but there is no information about why it did not export 
> VMALLOC_START directly.
> 
> If VMALLOC_START is exported directly to vmcoreinfo, I think it would be 
> enough for makedumpfile.

Thanks for confirmation, Kazu.

Then, below draft patch should be enough to export VMALLOC_START
instead, and remove vmap_area_list. In order to get the base address of
vmalloc area, constructing a vmap_area_list from multiple busy-tree
seems not worth.

diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index 599e8d3bcbc3..3cb1ea09ff26 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
 the kernel start address. Used to convert a virtual address from the
 direct kernel map to a physical address.
 
-vmap_area_list
---------------
+VMALLOC_START
+-------------
 
-Stores the virtual area list. makedumpfile gets the vmalloc start value
-from this variable and its value is necessary for vmalloc translation.
+Stores the base address of vmalloc area. makedumpfile gets this value and
+its value is necessary for vmalloc translation.
 
 mem_map
 -------
diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
index 66cde752cd74..2a24199a9b81 100644
--- a/arch/arm64/kernel/crash_core.c
+++ b/arch/arm64/kernel/crash_core.c
@@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void)
 	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
 	vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR);
 	vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END);
-	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
 	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c
index 55f1d7856b54..5c39cedd2c5c 100644
--- a/arch/riscv/kernel/crash_core.c
+++ b/arch/riscv/kernel/crash_core.c
@@ -9,7 +9,6 @@ void arch_crash_save_vmcoreinfo(void)
 	VMCOREINFO_NUMBER(phys_ram_base);
 
 	vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET);
-	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
 	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..91810b4e9510 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -253,7 +253,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
 /*
  *	Internals.  Don't use..
  */
-extern struct list_head vmap_area_list;
 extern __init void vm_area_add_early(struct vm_struct *vm);
 extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
 
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 03a7932cde0a..91af87930770 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -617,7 +617,7 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
 #endif
 	VMCOREINFO_SYMBOL(_stext);
-	VMCOREINFO_SYMBOL(vmap_area_list);
+	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
 
 #ifndef CONFIG_NUMA
 	VMCOREINFO_SYMBOL(mem_map);
diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
index b4cac76ea5e9..8a689b4ff4f9 100644
--- a/kernel/kallsyms_selftest.c
+++ b/kernel/kallsyms_selftest.c
@@ -89,7 +89,6 @@ static struct test_item test_items[] = {
 	ITEM_DATA(kallsyms_test_var_data_static),
 	ITEM_DATA(kallsyms_test_var_bss),
 	ITEM_DATA(kallsyms_test_var_data),
-	ITEM_DATA(vmap_area_list),
 #endif
 };
 
diff --git a/mm/nommu.c b/mm/nommu.c
index 7f9e9e5a0e12..8c6686176ebd 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL(follow_pfn);
 
-LIST_HEAD(vmap_area_list);
-
 void vfree(const void *addr)
 {
 	kfree(addr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 50d8239b82df..0a02633a9566 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -729,8 +729,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
 
 
 static DEFINE_SPINLOCK(free_vmap_area_lock);
-/* Export for kexec only */
-LIST_HEAD(vmap_area_list);
+
 static bool vmap_initialized __read_mostly;
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
@ 2023-09-08  4:43             ` Baoquan He
  0 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-08  4:43 UTC (permalink / raw)
  To: HAGIO KAZUHITO(萩尾 一仁)
  Cc: Uladzislau Rezki, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On 09/08/23 at 01:51am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> On 2023/09/07 18:58, Baoquan He wrote:
> > On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
> >> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
> >>> Add Kazu and Lianbo to CC, and kexec mailing list
> >>>
> >>> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> >>>> Store allocated objects in a separate nodes. A va->va_start
> >>>> address is converted into a correct node where it should
> >>>> be placed and resided. An addr_to_node() function is used
> >>>> to do a proper address conversion to determine a node that
> >>>> contains a VA.
> >>>>
> >>>> Such approach balances VAs across nodes as a result an access
> >>>> becomes scalable. Number of nodes in a system depends on number
> >>>> of CPUs divided by two. The density factor in this case is 1/2.
> >>>>
> >>>> Please note:
> >>>>
> >>>> 1. As of now allocated VAs are bound to a node-0. It means the
> >>>>     patch does not give any difference comparing with a current
> >>>>     behavior;
> >>>>
> >>>> 2. The global vmap_area_lock, vmap_area_root are removed as there
> >>>>     is no need in it anymore. The vmap_area_list is still kept and
> >>>>     is _empty_. It is exported for a kexec only;
> >>>
> >>> I haven't taken a test, while accessing all nodes' busy tree to get
> >>> va of the lowest address could severely impact kcore reading efficiency
> >>> on system with many vmap nodes. People doing live debugging via
> >>> /proc/kcore will get a little surprise.
> >>>
> >>>
> >>> Empty vmap_area_list will break makedumpfile utility, Crash utility
> >>> could be impactd too. I checked makedumpfile code, it relys on
> >>> vmap_area_list to deduce the vmalloc_start value.
> >>>
> >> It is left part and i hope i fix it in v3. The problem here is
> >> we can not give an opportunity to access to vmap internals from
> >> outside. This is just not correct, i.e. you are not allowed to
> >> access the list directly.
> > 
> > Right. Thanks for the fix in v3, that is a relief of makedumpfile and
> > crash.
> > 
> > Hi Kazu,
> > 
> > Meanwhile, I am thinking if we should evaluate the necessity of
> > vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
> > vmap_area_list to deduce VMALLOC_START. Wondering if we can export
> > VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
> > is a tighter low boundary of vmalloc area and can reduce unnecessary
> > scanning below the lowest va. Not sure if this is the reason people
> > decided to export vmap_area_list.
> 
> The kernel commit acd99dbf5402 introduced the original vmlist entry to 
> vmcoreinfo, but there is no information about why it did not export 
> VMALLOC_START directly.
> 
> If VMALLOC_START is exported directly to vmcoreinfo, I think it would be 
> enough for makedumpfile.

Thanks for confirmation, Kazu.

Then, below draft patch should be enough to export VMALLOC_START
instead, and remove vmap_area_list. In order to get the base address of
vmalloc area, constructing a vmap_area_list from multiple busy-tree
seems not worth.

diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index 599e8d3bcbc3..3cb1ea09ff26 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
 the kernel start address. Used to convert a virtual address from the
 direct kernel map to a physical address.
 
-vmap_area_list
---------------
+VMALLOC_START
+-------------
 
-Stores the virtual area list. makedumpfile gets the vmalloc start value
-from this variable and its value is necessary for vmalloc translation.
+Stores the base address of vmalloc area. makedumpfile gets this value and
+its value is necessary for vmalloc translation.
 
 mem_map
 -------
diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
index 66cde752cd74..2a24199a9b81 100644
--- a/arch/arm64/kernel/crash_core.c
+++ b/arch/arm64/kernel/crash_core.c
@@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void)
 	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
 	vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR);
 	vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END);
-	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
 	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c
index 55f1d7856b54..5c39cedd2c5c 100644
--- a/arch/riscv/kernel/crash_core.c
+++ b/arch/riscv/kernel/crash_core.c
@@ -9,7 +9,6 @@ void arch_crash_save_vmcoreinfo(void)
 	VMCOREINFO_NUMBER(phys_ram_base);
 
 	vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET);
-	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
 	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..91810b4e9510 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -253,7 +253,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
 /*
  *	Internals.  Don't use..
  */
-extern struct list_head vmap_area_list;
 extern __init void vm_area_add_early(struct vm_struct *vm);
 extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
 
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 03a7932cde0a..91af87930770 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -617,7 +617,7 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
 #endif
 	VMCOREINFO_SYMBOL(_stext);
-	VMCOREINFO_SYMBOL(vmap_area_list);
+	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
 
 #ifndef CONFIG_NUMA
 	VMCOREINFO_SYMBOL(mem_map);
diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
index b4cac76ea5e9..8a689b4ff4f9 100644
--- a/kernel/kallsyms_selftest.c
+++ b/kernel/kallsyms_selftest.c
@@ -89,7 +89,6 @@ static struct test_item test_items[] = {
 	ITEM_DATA(kallsyms_test_var_data_static),
 	ITEM_DATA(kallsyms_test_var_bss),
 	ITEM_DATA(kallsyms_test_var_data),
-	ITEM_DATA(vmap_area_list),
 #endif
 };
 
diff --git a/mm/nommu.c b/mm/nommu.c
index 7f9e9e5a0e12..8c6686176ebd 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL(follow_pfn);
 
-LIST_HEAD(vmap_area_list);
-
 void vfree(const void *addr)
 {
 	kfree(addr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 50d8239b82df..0a02633a9566 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -729,8 +729,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
 
 
 static DEFINE_SPINLOCK(free_vmap_area_lock);
-/* Export for kexec only */
-LIST_HEAD(vmap_area_list);
+
 static bool vmap_initialized __read_mostly;
 
 /*
-- 
2.41.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-09-08  4:43             ` Baoquan He
@ 2023-09-08  5:01               ` HAGIO KAZUHITO(萩尾 一仁)
  -1 siblings, 0 replies; 74+ messages in thread
From: HAGIO KAZUHITO(萩尾 一仁) @ 2023-09-08  5:01 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On 2023/09/08 13:43, Baoquan He wrote:
> On 09/08/23 at 01:51am, HAGIO KAZUHITO(萩尾 一仁) wrote:
>> On 2023/09/07 18:58, Baoquan He wrote:
>>> On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
>>>> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
>>>>> Add Kazu and Lianbo to CC, and kexec mailing list
>>>>>
>>>>> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
>>>>>> Store allocated objects in a separate nodes. A va->va_start
>>>>>> address is converted into a correct node where it should
>>>>>> be placed and resided. An addr_to_node() function is used
>>>>>> to do a proper address conversion to determine a node that
>>>>>> contains a VA.
>>>>>>
>>>>>> Such approach balances VAs across nodes as a result an access
>>>>>> becomes scalable. Number of nodes in a system depends on number
>>>>>> of CPUs divided by two. The density factor in this case is 1/2.
>>>>>>
>>>>>> Please note:
>>>>>>
>>>>>> 1. As of now allocated VAs are bound to a node-0. It means the
>>>>>>      patch does not give any difference comparing with a current
>>>>>>      behavior;
>>>>>>
>>>>>> 2. The global vmap_area_lock, vmap_area_root are removed as there
>>>>>>      is no need in it anymore. The vmap_area_list is still kept and
>>>>>>      is _empty_. It is exported for a kexec only;
>>>>>
>>>>> I haven't taken a test, while accessing all nodes' busy tree to get
>>>>> va of the lowest address could severely impact kcore reading efficiency
>>>>> on system with many vmap nodes. People doing live debugging via
>>>>> /proc/kcore will get a little surprise.
>>>>>
>>>>>
>>>>> Empty vmap_area_list will break makedumpfile utility, Crash utility
>>>>> could be impactd too. I checked makedumpfile code, it relys on
>>>>> vmap_area_list to deduce the vmalloc_start value.
>>>>>
>>>> It is left part and i hope i fix it in v3. The problem here is
>>>> we can not give an opportunity to access to vmap internals from
>>>> outside. This is just not correct, i.e. you are not allowed to
>>>> access the list directly.
>>>
>>> Right. Thanks for the fix in v3, that is a relief of makedumpfile and
>>> crash.
>>>
>>> Hi Kazu,
>>>
>>> Meanwhile, I am thinking if we should evaluate the necessity of
>>> vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
>>> vmap_area_list to deduce VMALLOC_START. Wondering if we can export
>>> VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
>>> is a tighter low boundary of vmalloc area and can reduce unnecessary
>>> scanning below the lowest va. Not sure if this is the reason people
>>> decided to export vmap_area_list.
>>
>> The kernel commit acd99dbf5402 introduced the original vmlist entry to
>> vmcoreinfo, but there is no information about why it did not export
>> VMALLOC_START directly.
>>
>> If VMALLOC_START is exported directly to vmcoreinfo, I think it would be
>> enough for makedumpfile.
> 
> Thanks for confirmation, Kazu.
> 
> Then, below draft patch should be enough to export VMALLOC_START
> instead, and remove vmap_area_list. 

also the following entries can be removed.

         VMCOREINFO_OFFSET(vmap_area, va_start);
         VMCOREINFO_OFFSET(vmap_area, list);

Thanks,
Kazu

In order to get the base address of
> vmalloc area, constructing a vmap_area_list from multiple busy-tree
> seems not worth.
> 
> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> index 599e8d3bcbc3..3cb1ea09ff26 100644
> --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> @@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
>   the kernel start address. Used to convert a virtual address from the
>   direct kernel map to a physical address.
>   
> -vmap_area_list
> ---------------
> +VMALLOC_START
> +-------------
>   
> -Stores the virtual area list. makedumpfile gets the vmalloc start value
> -from this variable and its value is necessary for vmalloc translation.
> +Stores the base address of vmalloc area. makedumpfile gets this value and
> +its value is necessary for vmalloc translation.
>   
>   mem_map
>   -------
> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> index 66cde752cd74..2a24199a9b81 100644
> --- a/arch/arm64/kernel/crash_core.c
> +++ b/arch/arm64/kernel/crash_core.c
> @@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void)
>   	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
>   	vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR);
>   	vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END);
> -	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
>   	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
>   	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
>   	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
> diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c
> index 55f1d7856b54..5c39cedd2c5c 100644
> --- a/arch/riscv/kernel/crash_core.c
> +++ b/arch/riscv/kernel/crash_core.c
> @@ -9,7 +9,6 @@ void arch_crash_save_vmcoreinfo(void)
>   	VMCOREINFO_NUMBER(phys_ram_base);
>   
>   	vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET);
> -	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
>   	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
>   	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
>   	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index c720be70c8dd..91810b4e9510 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -253,7 +253,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
>   /*
>    *	Internals.  Don't use..
>    */
> -extern struct list_head vmap_area_list;
>   extern __init void vm_area_add_early(struct vm_struct *vm);
>   extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
>   
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 03a7932cde0a..91af87930770 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -617,7 +617,7 @@ static int __init crash_save_vmcoreinfo_init(void)
>   	VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
>   #endif
>   	VMCOREINFO_SYMBOL(_stext);
> -	VMCOREINFO_SYMBOL(vmap_area_list);
> +	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
>   
>   #ifndef CONFIG_NUMA
>   	VMCOREINFO_SYMBOL(mem_map);
> diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
> index b4cac76ea5e9..8a689b4ff4f9 100644
> --- a/kernel/kallsyms_selftest.c
> +++ b/kernel/kallsyms_selftest.c
> @@ -89,7 +89,6 @@ static struct test_item test_items[] = {
>   	ITEM_DATA(kallsyms_test_var_data_static),
>   	ITEM_DATA(kallsyms_test_var_bss),
>   	ITEM_DATA(kallsyms_test_var_data),
> -	ITEM_DATA(vmap_area_list),
>   #endif
>   };
>   
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 7f9e9e5a0e12..8c6686176ebd 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
>   }
>   EXPORT_SYMBOL(follow_pfn);
>   
> -LIST_HEAD(vmap_area_list);
> -
>   void vfree(const void *addr)
>   {
>   	kfree(addr);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 50d8239b82df..0a02633a9566 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -729,8 +729,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
>   
>   
>   static DEFINE_SPINLOCK(free_vmap_area_lock);
> -/* Export for kexec only */
> -LIST_HEAD(vmap_area_list);
> +
>   static bool vmap_initialized __read_mostly;
>   
>   /*

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
@ 2023-09-08  5:01               ` HAGIO KAZUHITO(萩尾 一仁)
  0 siblings, 0 replies; 74+ messages in thread
From: HAGIO KAZUHITO(萩尾 一仁) @ 2023-09-08  5:01 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On 2023/09/08 13:43, Baoquan He wrote:
> On 09/08/23 at 01:51am, HAGIO KAZUHITO(萩尾 一仁) wrote:
>> On 2023/09/07 18:58, Baoquan He wrote:
>>> On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
>>>> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
>>>>> Add Kazu and Lianbo to CC, and kexec mailing list
>>>>>
>>>>> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
>>>>>> Store allocated objects in a separate nodes. A va->va_start
>>>>>> address is converted into a correct node where it should
>>>>>> be placed and resided. An addr_to_node() function is used
>>>>>> to do a proper address conversion to determine a node that
>>>>>> contains a VA.
>>>>>>
>>>>>> Such approach balances VAs across nodes as a result an access
>>>>>> becomes scalable. Number of nodes in a system depends on number
>>>>>> of CPUs divided by two. The density factor in this case is 1/2.
>>>>>>
>>>>>> Please note:
>>>>>>
>>>>>> 1. As of now allocated VAs are bound to a node-0. It means the
>>>>>>      patch does not give any difference comparing with a current
>>>>>>      behavior;
>>>>>>
>>>>>> 2. The global vmap_area_lock, vmap_area_root are removed as there
>>>>>>      is no need in it anymore. The vmap_area_list is still kept and
>>>>>>      is _empty_. It is exported for a kexec only;
>>>>>
>>>>> I haven't taken a test, while accessing all nodes' busy tree to get
>>>>> va of the lowest address could severely impact kcore reading efficiency
>>>>> on system with many vmap nodes. People doing live debugging via
>>>>> /proc/kcore will get a little surprise.
>>>>>
>>>>>
>>>>> Empty vmap_area_list will break makedumpfile utility, Crash utility
>>>>> could be impactd too. I checked makedumpfile code, it relys on
>>>>> vmap_area_list to deduce the vmalloc_start value.
>>>>>
>>>> It is left part and i hope i fix it in v3. The problem here is
>>>> we can not give an opportunity to access to vmap internals from
>>>> outside. This is just not correct, i.e. you are not allowed to
>>>> access the list directly.
>>>
>>> Right. Thanks for the fix in v3, that is a relief of makedumpfile and
>>> crash.
>>>
>>> Hi Kazu,
>>>
>>> Meanwhile, I am thinking if we should evaluate the necessity of
>>> vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
>>> vmap_area_list to deduce VMALLOC_START. Wondering if we can export
>>> VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
>>> is a tighter low boundary of vmalloc area and can reduce unnecessary
>>> scanning below the lowest va. Not sure if this is the reason people
>>> decided to export vmap_area_list.
>>
>> The kernel commit acd99dbf5402 introduced the original vmlist entry to
>> vmcoreinfo, but there is no information about why it did not export
>> VMALLOC_START directly.
>>
>> If VMALLOC_START is exported directly to vmcoreinfo, I think it would be
>> enough for makedumpfile.
> 
> Thanks for confirmation, Kazu.
> 
> Then, below draft patch should be enough to export VMALLOC_START
> instead, and remove vmap_area_list. 

also the following entries can be removed.

         VMCOREINFO_OFFSET(vmap_area, va_start);
         VMCOREINFO_OFFSET(vmap_area, list);

Thanks,
Kazu

In order to get the base address of
> vmalloc area, constructing a vmap_area_list from multiple busy-tree
> seems not worth.
> 
> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> index 599e8d3bcbc3..3cb1ea09ff26 100644
> --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> @@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
>   the kernel start address. Used to convert a virtual address from the
>   direct kernel map to a physical address.
>   
> -vmap_area_list
> ---------------
> +VMALLOC_START
> +-------------
>   
> -Stores the virtual area list. makedumpfile gets the vmalloc start value
> -from this variable and its value is necessary for vmalloc translation.
> +Stores the base address of vmalloc area. makedumpfile gets this value and
> +its value is necessary for vmalloc translation.
>   
>   mem_map
>   -------
> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> index 66cde752cd74..2a24199a9b81 100644
> --- a/arch/arm64/kernel/crash_core.c
> +++ b/arch/arm64/kernel/crash_core.c
> @@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void)
>   	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
>   	vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR);
>   	vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END);
> -	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
>   	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
>   	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
>   	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
> diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c
> index 55f1d7856b54..5c39cedd2c5c 100644
> --- a/arch/riscv/kernel/crash_core.c
> +++ b/arch/riscv/kernel/crash_core.c
> @@ -9,7 +9,6 @@ void arch_crash_save_vmcoreinfo(void)
>   	VMCOREINFO_NUMBER(phys_ram_base);
>   
>   	vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET);
> -	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
>   	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
>   	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
>   	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index c720be70c8dd..91810b4e9510 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -253,7 +253,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
>   /*
>    *	Internals.  Don't use..
>    */
> -extern struct list_head vmap_area_list;
>   extern __init void vm_area_add_early(struct vm_struct *vm);
>   extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
>   
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 03a7932cde0a..91af87930770 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -617,7 +617,7 @@ static int __init crash_save_vmcoreinfo_init(void)
>   	VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
>   #endif
>   	VMCOREINFO_SYMBOL(_stext);
> -	VMCOREINFO_SYMBOL(vmap_area_list);
> +	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
>   
>   #ifndef CONFIG_NUMA
>   	VMCOREINFO_SYMBOL(mem_map);
> diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
> index b4cac76ea5e9..8a689b4ff4f9 100644
> --- a/kernel/kallsyms_selftest.c
> +++ b/kernel/kallsyms_selftest.c
> @@ -89,7 +89,6 @@ static struct test_item test_items[] = {
>   	ITEM_DATA(kallsyms_test_var_data_static),
>   	ITEM_DATA(kallsyms_test_var_bss),
>   	ITEM_DATA(kallsyms_test_var_data),
> -	ITEM_DATA(vmap_area_list),
>   #endif
>   };
>   
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 7f9e9e5a0e12..8c6686176ebd 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
>   }
>   EXPORT_SYMBOL(follow_pfn);
>   
> -LIST_HEAD(vmap_area_list);
> -
>   void vfree(const void *addr)
>   {
>   	kfree(addr);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 50d8239b82df..0a02633a9566 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -729,8 +729,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
>   
>   
>   static DEFINE_SPINLOCK(free_vmap_area_lock);
> -/* Export for kexec only */
> -LIST_HEAD(vmap_area_list);
> +
>   static bool vmap_initialized __read_mostly;
>   
>   /*
_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-09-08  5:01               ` HAGIO KAZUHITO(萩尾 一仁)
@ 2023-09-08  6:44                 ` Baoquan He
  -1 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-08  6:44 UTC (permalink / raw)
  To: HAGIO KAZUHITO(萩尾 一仁)
  Cc: Uladzislau Rezki, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On 09/08/23 at 05:01am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> On 2023/09/08 13:43, Baoquan He wrote:
> > On 09/08/23 at 01:51am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> >> On 2023/09/07 18:58, Baoquan He wrote:
> >>> On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
> >>>> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
> >>>>> Add Kazu and Lianbo to CC, and kexec mailing list
> >>>>>
> >>>>> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> >>>>>> Store allocated objects in a separate nodes. A va->va_start
> >>>>>> address is converted into a correct node where it should
> >>>>>> be placed and resided. An addr_to_node() function is used
> >>>>>> to do a proper address conversion to determine a node that
> >>>>>> contains a VA.
> >>>>>>
> >>>>>> Such approach balances VAs across nodes as a result an access
> >>>>>> becomes scalable. Number of nodes in a system depends on number
> >>>>>> of CPUs divided by two. The density factor in this case is 1/2.
> >>>>>>
> >>>>>> Please note:
> >>>>>>
> >>>>>> 1. As of now allocated VAs are bound to a node-0. It means the
> >>>>>>      patch does not give any difference comparing with a current
> >>>>>>      behavior;
> >>>>>>
> >>>>>> 2. The global vmap_area_lock, vmap_area_root are removed as there
> >>>>>>      is no need in it anymore. The vmap_area_list is still kept and
> >>>>>>      is _empty_. It is exported for a kexec only;
> >>>>>
> >>>>> I haven't taken a test, while accessing all nodes' busy tree to get
> >>>>> va of the lowest address could severely impact kcore reading efficiency
> >>>>> on system with many vmap nodes. People doing live debugging via
> >>>>> /proc/kcore will get a little surprise.
> >>>>>
> >>>>>
> >>>>> Empty vmap_area_list will break makedumpfile utility, Crash utility
> >>>>> could be impactd too. I checked makedumpfile code, it relys on
> >>>>> vmap_area_list to deduce the vmalloc_start value.
> >>>>>
> >>>> It is left part and i hope i fix it in v3. The problem here is
> >>>> we can not give an opportunity to access to vmap internals from
> >>>> outside. This is just not correct, i.e. you are not allowed to
> >>>> access the list directly.
> >>>
> >>> Right. Thanks for the fix in v3, that is a relief of makedumpfile and
> >>> crash.
> >>>
> >>> Hi Kazu,
> >>>
> >>> Meanwhile, I am thinking if we should evaluate the necessity of
> >>> vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
> >>> vmap_area_list to deduce VMALLOC_START. Wondering if we can export
> >>> VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
> >>> is a tighter low boundary of vmalloc area and can reduce unnecessary
> >>> scanning below the lowest va. Not sure if this is the reason people
> >>> decided to export vmap_area_list.
> >>
> >> The kernel commit acd99dbf5402 introduced the original vmlist entry to
> >> vmcoreinfo, but there is no information about why it did not export
> >> VMALLOC_START directly.
> >>
> >> If VMALLOC_START is exported directly to vmcoreinfo, I think it would be
> >> enough for makedumpfile.
> > 
> > Thanks for confirmation, Kazu.
> > 
> > Then, below draft patch should be enough to export VMALLOC_START
> > instead, and remove vmap_area_list. 
> 
> also the following entries can be removed.
> 
>          VMCOREINFO_OFFSET(vmap_area, va_start);
>          VMCOREINFO_OFFSET(vmap_area, list);

Right, they are useless now. I updated to remove them in below patch.

From a867fada34fd9e96528fcc5e72ae50b3b5685015 Mon Sep 17 00:00:00 2001
From: Baoquan He <bhe@redhat.com>
Date: Fri, 8 Sep 2023 11:53:22 +0800
Subject: [PATCH] mm/vmalloc: remove vmap_area_list
Content-type: text/plain

Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile
get the base address of vmalloc area. Now, vmap_area_list is empty, so
export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list.

Signed-off-by: Baoquan He <bhe@redhat.com>
---
 Documentation/admin-guide/kdump/vmcoreinfo.rst | 8 ++++----
 arch/arm64/kernel/crash_core.c                 | 1 -
 arch/riscv/kernel/crash_core.c                 | 1 -
 include/linux/vmalloc.h                        | 1 -
 kernel/crash_core.c                            | 4 +---
 kernel/kallsyms_selftest.c                     | 1 -
 mm/nommu.c                                     | 2 --
 mm/vmalloc.c                                   | 3 +--
 8 files changed, 6 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index 599e8d3bcbc3..c11bd4b1ceb1 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
 the kernel start address. Used to convert a virtual address from the
 direct kernel map to a physical address.
 
-vmap_area_list
---------------
+VMALLOC_START
+-------------
 
-Stores the virtual area list. makedumpfile gets the vmalloc start value
-from this variable and its value is necessary for vmalloc translation.
+Stores the base address of vmalloc area. makedumpfile gets this value
+since is necessary for vmalloc translation.
 
 mem_map
 -------
diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
index 66cde752cd74..2a24199a9b81 100644
--- a/arch/arm64/kernel/crash_core.c
+++ b/arch/arm64/kernel/crash_core.c
@@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void)
 	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
 	vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR);
 	vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END);
-	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
 	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c
index 55f1d7856b54..5c39cedd2c5c 100644
--- a/arch/riscv/kernel/crash_core.c
+++ b/arch/riscv/kernel/crash_core.c
@@ -9,7 +9,6 @@ void arch_crash_save_vmcoreinfo(void)
 	VMCOREINFO_NUMBER(phys_ram_base);
 
 	vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET);
-	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
 	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..91810b4e9510 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -253,7 +253,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
 /*
  *	Internals.  Don't use..
  */
-extern struct list_head vmap_area_list;
 extern __init void vm_area_add_early(struct vm_struct *vm);
 extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
 
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 03a7932cde0a..a9faaf7e5f7d 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -617,7 +617,7 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
 #endif
 	VMCOREINFO_SYMBOL(_stext);
-	VMCOREINFO_SYMBOL(vmap_area_list);
+	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
 
 #ifndef CONFIG_NUMA
 	VMCOREINFO_SYMBOL(mem_map);
@@ -658,8 +658,6 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_OFFSET(free_area, free_list);
 	VMCOREINFO_OFFSET(list_head, next);
 	VMCOREINFO_OFFSET(list_head, prev);
-	VMCOREINFO_OFFSET(vmap_area, va_start);
-	VMCOREINFO_OFFSET(vmap_area, list);
 	VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER + 1);
 	log_buf_vmcoreinfo_setup();
 	VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
index b4cac76ea5e9..8a689b4ff4f9 100644
--- a/kernel/kallsyms_selftest.c
+++ b/kernel/kallsyms_selftest.c
@@ -89,7 +89,6 @@ static struct test_item test_items[] = {
 	ITEM_DATA(kallsyms_test_var_data_static),
 	ITEM_DATA(kallsyms_test_var_bss),
 	ITEM_DATA(kallsyms_test_var_data),
-	ITEM_DATA(vmap_area_list),
 #endif
 };
 
diff --git a/mm/nommu.c b/mm/nommu.c
index 7f9e9e5a0e12..8c6686176ebd 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL(follow_pfn);
 
-LIST_HEAD(vmap_area_list);
-
 void vfree(const void *addr)
 {
 	kfree(addr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 50d8239b82df..0a02633a9566 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -729,8 +729,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
 
 
 static DEFINE_SPINLOCK(free_vmap_area_lock);
-/* Export for kexec only */
-LIST_HEAD(vmap_area_list);
+
 static bool vmap_initialized __read_mostly;
 
 /*
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
@ 2023-09-08  6:44                 ` Baoquan He
  0 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-08  6:44 UTC (permalink / raw)
  To: HAGIO KAZUHITO(萩尾 一仁)
  Cc: Uladzislau Rezki, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On 09/08/23 at 05:01am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> On 2023/09/08 13:43, Baoquan He wrote:
> > On 09/08/23 at 01:51am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> >> On 2023/09/07 18:58, Baoquan He wrote:
> >>> On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
> >>>> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
> >>>>> Add Kazu and Lianbo to CC, and kexec mailing list
> >>>>>
> >>>>> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> >>>>>> Store allocated objects in a separate nodes. A va->va_start
> >>>>>> address is converted into a correct node where it should
> >>>>>> be placed and resided. An addr_to_node() function is used
> >>>>>> to do a proper address conversion to determine a node that
> >>>>>> contains a VA.
> >>>>>>
> >>>>>> Such approach balances VAs across nodes as a result an access
> >>>>>> becomes scalable. Number of nodes in a system depends on number
> >>>>>> of CPUs divided by two. The density factor in this case is 1/2.
> >>>>>>
> >>>>>> Please note:
> >>>>>>
> >>>>>> 1. As of now allocated VAs are bound to a node-0. It means the
> >>>>>>      patch does not give any difference comparing with a current
> >>>>>>      behavior;
> >>>>>>
> >>>>>> 2. The global vmap_area_lock, vmap_area_root are removed as there
> >>>>>>      is no need in it anymore. The vmap_area_list is still kept and
> >>>>>>      is _empty_. It is exported for a kexec only;
> >>>>>
> >>>>> I haven't taken a test, while accessing all nodes' busy tree to get
> >>>>> va of the lowest address could severely impact kcore reading efficiency
> >>>>> on system with many vmap nodes. People doing live debugging via
> >>>>> /proc/kcore will get a little surprise.
> >>>>>
> >>>>>
> >>>>> Empty vmap_area_list will break makedumpfile utility, Crash utility
> >>>>> could be impactd too. I checked makedumpfile code, it relys on
> >>>>> vmap_area_list to deduce the vmalloc_start value.
> >>>>>
> >>>> It is left part and i hope i fix it in v3. The problem here is
> >>>> we can not give an opportunity to access to vmap internals from
> >>>> outside. This is just not correct, i.e. you are not allowed to
> >>>> access the list directly.
> >>>
> >>> Right. Thanks for the fix in v3, that is a relief of makedumpfile and
> >>> crash.
> >>>
> >>> Hi Kazu,
> >>>
> >>> Meanwhile, I am thinking if we should evaluate the necessity of
> >>> vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
> >>> vmap_area_list to deduce VMALLOC_START. Wondering if we can export
> >>> VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
> >>> is a tighter low boundary of vmalloc area and can reduce unnecessary
> >>> scanning below the lowest va. Not sure if this is the reason people
> >>> decided to export vmap_area_list.
> >>
> >> The kernel commit acd99dbf5402 introduced the original vmlist entry to
> >> vmcoreinfo, but there is no information about why it did not export
> >> VMALLOC_START directly.
> >>
> >> If VMALLOC_START is exported directly to vmcoreinfo, I think it would be
> >> enough for makedumpfile.
> > 
> > Thanks for confirmation, Kazu.
> > 
> > Then, below draft patch should be enough to export VMALLOC_START
> > instead, and remove vmap_area_list. 
> 
> also the following entries can be removed.
> 
>          VMCOREINFO_OFFSET(vmap_area, va_start);
>          VMCOREINFO_OFFSET(vmap_area, list);

Right, they are useless now. I updated to remove them in below patch.

From a867fada34fd9e96528fcc5e72ae50b3b5685015 Mon Sep 17 00:00:00 2001
From: Baoquan He <bhe@redhat.com>
Date: Fri, 8 Sep 2023 11:53:22 +0800
Subject: [PATCH] mm/vmalloc: remove vmap_area_list
Content-type: text/plain

Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile
get the base address of vmalloc area. Now, vmap_area_list is empty, so
export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list.

Signed-off-by: Baoquan He <bhe@redhat.com>
---
 Documentation/admin-guide/kdump/vmcoreinfo.rst | 8 ++++----
 arch/arm64/kernel/crash_core.c                 | 1 -
 arch/riscv/kernel/crash_core.c                 | 1 -
 include/linux/vmalloc.h                        | 1 -
 kernel/crash_core.c                            | 4 +---
 kernel/kallsyms_selftest.c                     | 1 -
 mm/nommu.c                                     | 2 --
 mm/vmalloc.c                                   | 3 +--
 8 files changed, 6 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index 599e8d3bcbc3..c11bd4b1ceb1 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
 the kernel start address. Used to convert a virtual address from the
 direct kernel map to a physical address.
 
-vmap_area_list
---------------
+VMALLOC_START
+-------------
 
-Stores the virtual area list. makedumpfile gets the vmalloc start value
-from this variable and its value is necessary for vmalloc translation.
+Stores the base address of vmalloc area. makedumpfile gets this value
+since is necessary for vmalloc translation.
 
 mem_map
 -------
diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
index 66cde752cd74..2a24199a9b81 100644
--- a/arch/arm64/kernel/crash_core.c
+++ b/arch/arm64/kernel/crash_core.c
@@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void)
 	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
 	vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR);
 	vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END);
-	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
 	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c
index 55f1d7856b54..5c39cedd2c5c 100644
--- a/arch/riscv/kernel/crash_core.c
+++ b/arch/riscv/kernel/crash_core.c
@@ -9,7 +9,6 @@ void arch_crash_save_vmcoreinfo(void)
 	VMCOREINFO_NUMBER(phys_ram_base);
 
 	vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET);
-	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
 	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
 	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index c720be70c8dd..91810b4e9510 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -253,7 +253,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
 /*
  *	Internals.  Don't use..
  */
-extern struct list_head vmap_area_list;
 extern __init void vm_area_add_early(struct vm_struct *vm);
 extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
 
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 03a7932cde0a..a9faaf7e5f7d 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -617,7 +617,7 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
 #endif
 	VMCOREINFO_SYMBOL(_stext);
-	VMCOREINFO_SYMBOL(vmap_area_list);
+	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
 
 #ifndef CONFIG_NUMA
 	VMCOREINFO_SYMBOL(mem_map);
@@ -658,8 +658,6 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_OFFSET(free_area, free_list);
 	VMCOREINFO_OFFSET(list_head, next);
 	VMCOREINFO_OFFSET(list_head, prev);
-	VMCOREINFO_OFFSET(vmap_area, va_start);
-	VMCOREINFO_OFFSET(vmap_area, list);
 	VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER + 1);
 	log_buf_vmcoreinfo_setup();
 	VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
index b4cac76ea5e9..8a689b4ff4f9 100644
--- a/kernel/kallsyms_selftest.c
+++ b/kernel/kallsyms_selftest.c
@@ -89,7 +89,6 @@ static struct test_item test_items[] = {
 	ITEM_DATA(kallsyms_test_var_data_static),
 	ITEM_DATA(kallsyms_test_var_bss),
 	ITEM_DATA(kallsyms_test_var_data),
-	ITEM_DATA(vmap_area_list),
 #endif
 };
 
diff --git a/mm/nommu.c b/mm/nommu.c
index 7f9e9e5a0e12..8c6686176ebd 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 }
 EXPORT_SYMBOL(follow_pfn);
 
-LIST_HEAD(vmap_area_list);
-
 void vfree(const void *addr)
 {
 	kfree(addr);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 50d8239b82df..0a02633a9566 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -729,8 +729,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
 
 
 static DEFINE_SPINLOCK(free_vmap_area_lock);
-/* Export for kexec only */
-LIST_HEAD(vmap_area_list);
+
 static bool vmap_initialized __read_mostly;
 
 /*
-- 
2.41.0


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-09-08  6:44                 ` Baoquan He
@ 2023-09-08 11:25                   ` Uladzislau Rezki
  -1 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-08 11:25 UTC (permalink / raw)
  To: Baoquan He, k-hagio-ab
  Cc: HAGIO KAZUHITO(萩尾 一仁),
	Uladzislau Rezki, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On Fri, Sep 08, 2023 at 02:44:56PM +0800, Baoquan He wrote:
> On 09/08/23 at 05:01am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> > On 2023/09/08 13:43, Baoquan He wrote:
> > > On 09/08/23 at 01:51am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> > >> On 2023/09/07 18:58, Baoquan He wrote:
> > >>> On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
> > >>>> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
> > >>>>> Add Kazu and Lianbo to CC, and kexec mailing list
> > >>>>>
> > >>>>> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > >>>>>> Store allocated objects in a separate nodes. A va->va_start
> > >>>>>> address is converted into a correct node where it should
> > >>>>>> be placed and resided. An addr_to_node() function is used
> > >>>>>> to do a proper address conversion to determine a node that
> > >>>>>> contains a VA.
> > >>>>>>
> > >>>>>> Such approach balances VAs across nodes as a result an access
> > >>>>>> becomes scalable. Number of nodes in a system depends on number
> > >>>>>> of CPUs divided by two. The density factor in this case is 1/2.
> > >>>>>>
> > >>>>>> Please note:
> > >>>>>>
> > >>>>>> 1. As of now allocated VAs are bound to a node-0. It means the
> > >>>>>>      patch does not give any difference comparing with a current
> > >>>>>>      behavior;
> > >>>>>>
> > >>>>>> 2. The global vmap_area_lock, vmap_area_root are removed as there
> > >>>>>>      is no need in it anymore. The vmap_area_list is still kept and
> > >>>>>>      is _empty_. It is exported for a kexec only;
> > >>>>>
> > >>>>> I haven't taken a test, while accessing all nodes' busy tree to get
> > >>>>> va of the lowest address could severely impact kcore reading efficiency
> > >>>>> on system with many vmap nodes. People doing live debugging via
> > >>>>> /proc/kcore will get a little surprise.
> > >>>>>
> > >>>>>
> > >>>>> Empty vmap_area_list will break makedumpfile utility, Crash utility
> > >>>>> could be impactd too. I checked makedumpfile code, it relys on
> > >>>>> vmap_area_list to deduce the vmalloc_start value.
> > >>>>>
> > >>>> It is left part and i hope i fix it in v3. The problem here is
> > >>>> we can not give an opportunity to access to vmap internals from
> > >>>> outside. This is just not correct, i.e. you are not allowed to
> > >>>> access the list directly.
> > >>>
> > >>> Right. Thanks for the fix in v3, that is a relief of makedumpfile and
> > >>> crash.
> > >>>
> > >>> Hi Kazu,
> > >>>
> > >>> Meanwhile, I am thinking if we should evaluate the necessity of
> > >>> vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
> > >>> vmap_area_list to deduce VMALLOC_START. Wondering if we can export
> > >>> VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
> > >>> is a tighter low boundary of vmalloc area and can reduce unnecessary
> > >>> scanning below the lowest va. Not sure if this is the reason people
> > >>> decided to export vmap_area_list.
> > >>
> > >> The kernel commit acd99dbf5402 introduced the original vmlist entry to
> > >> vmcoreinfo, but there is no information about why it did not export
> > >> VMALLOC_START directly.
> > >>
> > >> If VMALLOC_START is exported directly to vmcoreinfo, I think it would be
> > >> enough for makedumpfile.
> > > 
> > > Thanks for confirmation, Kazu.
> > > 
> > > Then, below draft patch should be enough to export VMALLOC_START
> > > instead, and remove vmap_area_list. 
> > 
> > also the following entries can be removed.
> > 
> >          VMCOREINFO_OFFSET(vmap_area, va_start);
> >          VMCOREINFO_OFFSET(vmap_area, list);
> 
> Right, they are useless now. I updated to remove them in below patch.
> 
> From a867fada34fd9e96528fcc5e72ae50b3b5685015 Mon Sep 17 00:00:00 2001
> From: Baoquan He <bhe@redhat.com>
> Date: Fri, 8 Sep 2023 11:53:22 +0800
> Subject: [PATCH] mm/vmalloc: remove vmap_area_list
> Content-type: text/plain
> 
> Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile
> get the base address of vmalloc area. Now, vmap_area_list is empty, so
> export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list.
> 
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  Documentation/admin-guide/kdump/vmcoreinfo.rst | 8 ++++----
>  arch/arm64/kernel/crash_core.c                 | 1 -
>  arch/riscv/kernel/crash_core.c                 | 1 -
>  include/linux/vmalloc.h                        | 1 -
>  kernel/crash_core.c                            | 4 +---
>  kernel/kallsyms_selftest.c                     | 1 -
>  mm/nommu.c                                     | 2 --
>  mm/vmalloc.c                                   | 3 +--
>  8 files changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> index 599e8d3bcbc3..c11bd4b1ceb1 100644
> --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> @@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
>  the kernel start address. Used to convert a virtual address from the
>  direct kernel map to a physical address.
>  
> -vmap_area_list
> ---------------
> +VMALLOC_START
> +-------------
>  
> -Stores the virtual area list. makedumpfile gets the vmalloc start value
> -from this variable and its value is necessary for vmalloc translation.
> +Stores the base address of vmalloc area. makedumpfile gets this value
> +since is necessary for vmalloc translation.
>  
>  mem_map
>  -------
> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> index 66cde752cd74..2a24199a9b81 100644
> --- a/arch/arm64/kernel/crash_core.c
> +++ b/arch/arm64/kernel/crash_core.c
> @@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void)
>  	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
>  	vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR);
>  	vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END);
> -	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
>  	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
>  	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
>  	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
> diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c
> index 55f1d7856b54..5c39cedd2c5c 100644
> --- a/arch/riscv/kernel/crash_core.c
> +++ b/arch/riscv/kernel/crash_core.c
> @@ -9,7 +9,6 @@ void arch_crash_save_vmcoreinfo(void)
>  	VMCOREINFO_NUMBER(phys_ram_base);
>  
>  	vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET);
> -	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
>  	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
>  	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
>  	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index c720be70c8dd..91810b4e9510 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -253,7 +253,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
>  /*
>   *	Internals.  Don't use..
>   */
> -extern struct list_head vmap_area_list;
>  extern __init void vm_area_add_early(struct vm_struct *vm);
>  extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
>  
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 03a7932cde0a..a9faaf7e5f7d 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -617,7 +617,7 @@ static int __init crash_save_vmcoreinfo_init(void)
>  	VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
>  #endif
>  	VMCOREINFO_SYMBOL(_stext);
> -	VMCOREINFO_SYMBOL(vmap_area_list);
> +	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
>  
>  #ifndef CONFIG_NUMA
>  	VMCOREINFO_SYMBOL(mem_map);
> @@ -658,8 +658,6 @@ static int __init crash_save_vmcoreinfo_init(void)
>  	VMCOREINFO_OFFSET(free_area, free_list);
>  	VMCOREINFO_OFFSET(list_head, next);
>  	VMCOREINFO_OFFSET(list_head, prev);
> -	VMCOREINFO_OFFSET(vmap_area, va_start);
> -	VMCOREINFO_OFFSET(vmap_area, list);
>  	VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER + 1);
>  	log_buf_vmcoreinfo_setup();
>  	VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
> diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
> index b4cac76ea5e9..8a689b4ff4f9 100644
> --- a/kernel/kallsyms_selftest.c
> +++ b/kernel/kallsyms_selftest.c
> @@ -89,7 +89,6 @@ static struct test_item test_items[] = {
>  	ITEM_DATA(kallsyms_test_var_data_static),
>  	ITEM_DATA(kallsyms_test_var_bss),
>  	ITEM_DATA(kallsyms_test_var_data),
> -	ITEM_DATA(vmap_area_list),
>  #endif
>  };
>  
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 7f9e9e5a0e12..8c6686176ebd 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
>  }
>  EXPORT_SYMBOL(follow_pfn);
>  
> -LIST_HEAD(vmap_area_list);
> -
>  void vfree(const void *addr)
>  {
>  	kfree(addr);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 50d8239b82df..0a02633a9566 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -729,8 +729,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
>  
>  
>  static DEFINE_SPINLOCK(free_vmap_area_lock);
> -/* Export for kexec only */
> -LIST_HEAD(vmap_area_list);
> +
>  static bool vmap_initialized __read_mostly;
>  
>  /*
> -- 
> 2.41.0
> 
Appreciate for your great input. This patch can go as standalone
with slight commit message updating or i can take it and send it
out as part of v3.

Either way i am totally fine. What do you prefer?

Thank you!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
@ 2023-09-08 11:25                   ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-08 11:25 UTC (permalink / raw)
  To: Baoquan He, k-hagio-ab
  Cc: HAGIO KAZUHITO(萩尾 一仁),
	Uladzislau Rezki, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On Fri, Sep 08, 2023 at 02:44:56PM +0800, Baoquan He wrote:
> On 09/08/23 at 05:01am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> > On 2023/09/08 13:43, Baoquan He wrote:
> > > On 09/08/23 at 01:51am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> > >> On 2023/09/07 18:58, Baoquan He wrote:
> > >>> On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
> > >>>> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
> > >>>>> Add Kazu and Lianbo to CC, and kexec mailing list
> > >>>>>
> > >>>>> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > >>>>>> Store allocated objects in a separate nodes. A va->va_start
> > >>>>>> address is converted into a correct node where it should
> > >>>>>> be placed and resided. An addr_to_node() function is used
> > >>>>>> to do a proper address conversion to determine a node that
> > >>>>>> contains a VA.
> > >>>>>>
> > >>>>>> Such approach balances VAs across nodes as a result an access
> > >>>>>> becomes scalable. Number of nodes in a system depends on number
> > >>>>>> of CPUs divided by two. The density factor in this case is 1/2.
> > >>>>>>
> > >>>>>> Please note:
> > >>>>>>
> > >>>>>> 1. As of now allocated VAs are bound to a node-0. It means the
> > >>>>>>      patch does not give any difference comparing with a current
> > >>>>>>      behavior;
> > >>>>>>
> > >>>>>> 2. The global vmap_area_lock, vmap_area_root are removed as there
> > >>>>>>      is no need in it anymore. The vmap_area_list is still kept and
> > >>>>>>      is _empty_. It is exported for a kexec only;
> > >>>>>
> > >>>>> I haven't taken a test, while accessing all nodes' busy tree to get
> > >>>>> va of the lowest address could severely impact kcore reading efficiency
> > >>>>> on system with many vmap nodes. People doing live debugging via
> > >>>>> /proc/kcore will get a little surprise.
> > >>>>>
> > >>>>>
> > >>>>> Empty vmap_area_list will break makedumpfile utility, Crash utility
> > >>>>> could be impactd too. I checked makedumpfile code, it relys on
> > >>>>> vmap_area_list to deduce the vmalloc_start value.
> > >>>>>
> > >>>> It is left part and i hope i fix it in v3. The problem here is
> > >>>> we can not give an opportunity to access to vmap internals from
> > >>>> outside. This is just not correct, i.e. you are not allowed to
> > >>>> access the list directly.
> > >>>
> > >>> Right. Thanks for the fix in v3, that is a relief of makedumpfile and
> > >>> crash.
> > >>>
> > >>> Hi Kazu,
> > >>>
> > >>> Meanwhile, I am thinking if we should evaluate the necessity of
> > >>> vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
> > >>> vmap_area_list to deduce VMALLOC_START. Wondering if we can export
> > >>> VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
> > >>> is a tighter low boundary of vmalloc area and can reduce unnecessary
> > >>> scanning below the lowest va. Not sure if this is the reason people
> > >>> decided to export vmap_area_list.
> > >>
> > >> The kernel commit acd99dbf5402 introduced the original vmlist entry to
> > >> vmcoreinfo, but there is no information about why it did not export
> > >> VMALLOC_START directly.
> > >>
> > >> If VMALLOC_START is exported directly to vmcoreinfo, I think it would be
> > >> enough for makedumpfile.
> > > 
> > > Thanks for confirmation, Kazu.
> > > 
> > > Then, below draft patch should be enough to export VMALLOC_START
> > > instead, and remove vmap_area_list. 
> > 
> > also the following entries can be removed.
> > 
> >          VMCOREINFO_OFFSET(vmap_area, va_start);
> >          VMCOREINFO_OFFSET(vmap_area, list);
> 
> Right, they are useless now. I updated to remove them in below patch.
> 
> From a867fada34fd9e96528fcc5e72ae50b3b5685015 Mon Sep 17 00:00:00 2001
> From: Baoquan He <bhe@redhat.com>
> Date: Fri, 8 Sep 2023 11:53:22 +0800
> Subject: [PATCH] mm/vmalloc: remove vmap_area_list
> Content-type: text/plain
> 
> Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile
> get the base address of vmalloc area. Now, vmap_area_list is empty, so
> export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list.
> 
> Signed-off-by: Baoquan He <bhe@redhat.com>
> ---
>  Documentation/admin-guide/kdump/vmcoreinfo.rst | 8 ++++----
>  arch/arm64/kernel/crash_core.c                 | 1 -
>  arch/riscv/kernel/crash_core.c                 | 1 -
>  include/linux/vmalloc.h                        | 1 -
>  kernel/crash_core.c                            | 4 +---
>  kernel/kallsyms_selftest.c                     | 1 -
>  mm/nommu.c                                     | 2 --
>  mm/vmalloc.c                                   | 3 +--
>  8 files changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> index 599e8d3bcbc3..c11bd4b1ceb1 100644
> --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> @@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
>  the kernel start address. Used to convert a virtual address from the
>  direct kernel map to a physical address.
>  
> -vmap_area_list
> ---------------
> +VMALLOC_START
> +-------------
>  
> -Stores the virtual area list. makedumpfile gets the vmalloc start value
> -from this variable and its value is necessary for vmalloc translation.
> +Stores the base address of vmalloc area. makedumpfile gets this value
> +since is necessary for vmalloc translation.
>  
>  mem_map
>  -------
> diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> index 66cde752cd74..2a24199a9b81 100644
> --- a/arch/arm64/kernel/crash_core.c
> +++ b/arch/arm64/kernel/crash_core.c
> @@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void)
>  	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
>  	vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR);
>  	vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END);
> -	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
>  	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
>  	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
>  	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
> diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c
> index 55f1d7856b54..5c39cedd2c5c 100644
> --- a/arch/riscv/kernel/crash_core.c
> +++ b/arch/riscv/kernel/crash_core.c
> @@ -9,7 +9,6 @@ void arch_crash_save_vmcoreinfo(void)
>  	VMCOREINFO_NUMBER(phys_ram_base);
>  
>  	vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET);
> -	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
>  	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
>  	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
>  	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index c720be70c8dd..91810b4e9510 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -253,7 +253,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
>  /*
>   *	Internals.  Don't use..
>   */
> -extern struct list_head vmap_area_list;
>  extern __init void vm_area_add_early(struct vm_struct *vm);
>  extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
>  
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 03a7932cde0a..a9faaf7e5f7d 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -617,7 +617,7 @@ static int __init crash_save_vmcoreinfo_init(void)
>  	VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
>  #endif
>  	VMCOREINFO_SYMBOL(_stext);
> -	VMCOREINFO_SYMBOL(vmap_area_list);
> +	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
>  
>  #ifndef CONFIG_NUMA
>  	VMCOREINFO_SYMBOL(mem_map);
> @@ -658,8 +658,6 @@ static int __init crash_save_vmcoreinfo_init(void)
>  	VMCOREINFO_OFFSET(free_area, free_list);
>  	VMCOREINFO_OFFSET(list_head, next);
>  	VMCOREINFO_OFFSET(list_head, prev);
> -	VMCOREINFO_OFFSET(vmap_area, va_start);
> -	VMCOREINFO_OFFSET(vmap_area, list);
>  	VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER + 1);
>  	log_buf_vmcoreinfo_setup();
>  	VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
> diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
> index b4cac76ea5e9..8a689b4ff4f9 100644
> --- a/kernel/kallsyms_selftest.c
> +++ b/kernel/kallsyms_selftest.c
> @@ -89,7 +89,6 @@ static struct test_item test_items[] = {
>  	ITEM_DATA(kallsyms_test_var_data_static),
>  	ITEM_DATA(kallsyms_test_var_bss),
>  	ITEM_DATA(kallsyms_test_var_data),
> -	ITEM_DATA(vmap_area_list),
>  #endif
>  };
>  
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 7f9e9e5a0e12..8c6686176ebd 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
>  }
>  EXPORT_SYMBOL(follow_pfn);
>  
> -LIST_HEAD(vmap_area_list);
> -
>  void vfree(const void *addr)
>  {
>  	kfree(addr);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 50d8239b82df..0a02633a9566 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -729,8 +729,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
>  
>  
>  static DEFINE_SPINLOCK(free_vmap_area_lock);
> -/* Export for kexec only */
> -LIST_HEAD(vmap_area_list);
> +
>  static bool vmap_initialized __read_mostly;
>  
>  /*
> -- 
> 2.41.0
> 
Appreciate for your great input. This patch can go as standalone
with slight commit message updating or i can take it and send it
out as part of v3.

Either way i am totally fine. What do you prefer?

Thank you!

--
Uladzislau Rezki

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-09-08 11:25                   ` Uladzislau Rezki
@ 2023-09-08 11:38                     ` Baoquan He
  -1 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-08 11:38 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: k-hagio-ab, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On 09/08/23 at 01:25pm, Uladzislau Rezki wrote:
> On Fri, Sep 08, 2023 at 02:44:56PM +0800, Baoquan He wrote:
> > On 09/08/23 at 05:01am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> > > On 2023/09/08 13:43, Baoquan He wrote:
> > > > On 09/08/23 at 01:51am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> > > >> On 2023/09/07 18:58, Baoquan He wrote:
> > > >>> On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
> > > >>>> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
> > > >>>>> Add Kazu and Lianbo to CC, and kexec mailing list
> > > >>>>>
> > > >>>>> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > >>>>>> Store allocated objects in a separate nodes. A va->va_start
> > > >>>>>> address is converted into a correct node where it should
> > > >>>>>> be placed and resided. An addr_to_node() function is used
> > > >>>>>> to do a proper address conversion to determine a node that
> > > >>>>>> contains a VA.
> > > >>>>>>
> > > >>>>>> Such approach balances VAs across nodes as a result an access
> > > >>>>>> becomes scalable. Number of nodes in a system depends on number
> > > >>>>>> of CPUs divided by two. The density factor in this case is 1/2.
> > > >>>>>>
> > > >>>>>> Please note:
> > > >>>>>>
> > > >>>>>> 1. As of now allocated VAs are bound to a node-0. It means the
> > > >>>>>>      patch does not give any difference comparing with a current
> > > >>>>>>      behavior;
> > > >>>>>>
> > > >>>>>> 2. The global vmap_area_lock, vmap_area_root are removed as there
> > > >>>>>>      is no need in it anymore. The vmap_area_list is still kept and
> > > >>>>>>      is _empty_. It is exported for a kexec only;
> > > >>>>>
> > > >>>>> I haven't taken a test, while accessing all nodes' busy tree to get
> > > >>>>> va of the lowest address could severely impact kcore reading efficiency
> > > >>>>> on system with many vmap nodes. People doing live debugging via
> > > >>>>> /proc/kcore will get a little surprise.
> > > >>>>>
> > > >>>>>
> > > >>>>> Empty vmap_area_list will break makedumpfile utility, Crash utility
> > > >>>>> could be impactd too. I checked makedumpfile code, it relys on
> > > >>>>> vmap_area_list to deduce the vmalloc_start value.
> > > >>>>>
> > > >>>> It is left part and i hope i fix it in v3. The problem here is
> > > >>>> we can not give an opportunity to access to vmap internals from
> > > >>>> outside. This is just not correct, i.e. you are not allowed to
> > > >>>> access the list directly.
> > > >>>
> > > >>> Right. Thanks for the fix in v3, that is a relief of makedumpfile and
> > > >>> crash.
> > > >>>
> > > >>> Hi Kazu,
> > > >>>
> > > >>> Meanwhile, I am thinking if we should evaluate the necessity of
> > > >>> vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
> > > >>> vmap_area_list to deduce VMALLOC_START. Wondering if we can export
> > > >>> VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
> > > >>> is a tighter low boundary of vmalloc area and can reduce unnecessary
> > > >>> scanning below the lowest va. Not sure if this is the reason people
> > > >>> decided to export vmap_area_list.
> > > >>
> > > >> The kernel commit acd99dbf5402 introduced the original vmlist entry to
> > > >> vmcoreinfo, but there is no information about why it did not export
> > > >> VMALLOC_START directly.
> > > >>
> > > >> If VMALLOC_START is exported directly to vmcoreinfo, I think it would be
> > > >> enough for makedumpfile.
> > > > 
> > > > Thanks for confirmation, Kazu.
> > > > 
> > > > Then, below draft patch should be enough to export VMALLOC_START
> > > > instead, and remove vmap_area_list. 
> > > 
> > > also the following entries can be removed.
> > > 
> > >          VMCOREINFO_OFFSET(vmap_area, va_start);
> > >          VMCOREINFO_OFFSET(vmap_area, list);
> > 
> > Right, they are useless now. I updated to remove them in below patch.
> > 
> > From a867fada34fd9e96528fcc5e72ae50b3b5685015 Mon Sep 17 00:00:00 2001
> > From: Baoquan He <bhe@redhat.com>
> > Date: Fri, 8 Sep 2023 11:53:22 +0800
> > Subject: [PATCH] mm/vmalloc: remove vmap_area_list
> > Content-type: text/plain
> > 
> > Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile
> > get the base address of vmalloc area. Now, vmap_area_list is empty, so
> > export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list.
> > 
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  Documentation/admin-guide/kdump/vmcoreinfo.rst | 8 ++++----
> >  arch/arm64/kernel/crash_core.c                 | 1 -
> >  arch/riscv/kernel/crash_core.c                 | 1 -
> >  include/linux/vmalloc.h                        | 1 -
> >  kernel/crash_core.c                            | 4 +---
> >  kernel/kallsyms_selftest.c                     | 1 -
> >  mm/nommu.c                                     | 2 --
> >  mm/vmalloc.c                                   | 3 +--
> >  8 files changed, 6 insertions(+), 15 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> > index 599e8d3bcbc3..c11bd4b1ceb1 100644
> > --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> > +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> > @@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
> >  the kernel start address. Used to convert a virtual address from the
> >  direct kernel map to a physical address.
> >  
> > -vmap_area_list
> > ---------------
> > +VMALLOC_START
> > +-------------
> >  
> > -Stores the virtual area list. makedumpfile gets the vmalloc start value
> > -from this variable and its value is necessary for vmalloc translation.
> > +Stores the base address of vmalloc area. makedumpfile gets this value
> > +since is necessary for vmalloc translation.
> >  
> >  mem_map
> >  -------
> > diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> > index 66cde752cd74..2a24199a9b81 100644
> > --- a/arch/arm64/kernel/crash_core.c
> > +++ b/arch/arm64/kernel/crash_core.c
> > @@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void)
> >  	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
> >  	vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR);
> >  	vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END);
> > -	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
> >  	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
> >  	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
> >  	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
> > diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c
> > index 55f1d7856b54..5c39cedd2c5c 100644
> > --- a/arch/riscv/kernel/crash_core.c
> > +++ b/arch/riscv/kernel/crash_core.c
> > @@ -9,7 +9,6 @@ void arch_crash_save_vmcoreinfo(void)
> >  	VMCOREINFO_NUMBER(phys_ram_base);
> >  
> >  	vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET);
> > -	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
> >  	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
> >  	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
> >  	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index c720be70c8dd..91810b4e9510 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -253,7 +253,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
> >  /*
> >   *	Internals.  Don't use..
> >   */
> > -extern struct list_head vmap_area_list;
> >  extern __init void vm_area_add_early(struct vm_struct *vm);
> >  extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
> >  
> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > index 03a7932cde0a..a9faaf7e5f7d 100644
> > --- a/kernel/crash_core.c
> > +++ b/kernel/crash_core.c
> > @@ -617,7 +617,7 @@ static int __init crash_save_vmcoreinfo_init(void)
> >  	VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
> >  #endif
> >  	VMCOREINFO_SYMBOL(_stext);
> > -	VMCOREINFO_SYMBOL(vmap_area_list);
> > +	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
> >  
> >  #ifndef CONFIG_NUMA
> >  	VMCOREINFO_SYMBOL(mem_map);
> > @@ -658,8 +658,6 @@ static int __init crash_save_vmcoreinfo_init(void)
> >  	VMCOREINFO_OFFSET(free_area, free_list);
> >  	VMCOREINFO_OFFSET(list_head, next);
> >  	VMCOREINFO_OFFSET(list_head, prev);
> > -	VMCOREINFO_OFFSET(vmap_area, va_start);
> > -	VMCOREINFO_OFFSET(vmap_area, list);
> >  	VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER + 1);
> >  	log_buf_vmcoreinfo_setup();
> >  	VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
> > diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
> > index b4cac76ea5e9..8a689b4ff4f9 100644
> > --- a/kernel/kallsyms_selftest.c
> > +++ b/kernel/kallsyms_selftest.c
> > @@ -89,7 +89,6 @@ static struct test_item test_items[] = {
> >  	ITEM_DATA(kallsyms_test_var_data_static),
> >  	ITEM_DATA(kallsyms_test_var_bss),
> >  	ITEM_DATA(kallsyms_test_var_data),
> > -	ITEM_DATA(vmap_area_list),
> >  #endif
> >  };
> >  
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 7f9e9e5a0e12..8c6686176ebd 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
> >  }
> >  EXPORT_SYMBOL(follow_pfn);
> >  
> > -LIST_HEAD(vmap_area_list);
> > -
> >  void vfree(const void *addr)
> >  {
> >  	kfree(addr);
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 50d8239b82df..0a02633a9566 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -729,8 +729,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
> >  
> >  
> >  static DEFINE_SPINLOCK(free_vmap_area_lock);
> > -/* Export for kexec only */
> > -LIST_HEAD(vmap_area_list);
> > +
> >  static bool vmap_initialized __read_mostly;
> >  
> >  /*
> > -- 
> > 2.41.0
> > 
> Appreciate for your great input. This patch can go as standalone
> with slight commit message updating or i can take it and send it
> out as part of v3.
> 
> Either way i am totally fine. What do you prefer?

Maybe take it together with this patchset in v3. Thanks.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
@ 2023-09-08 11:38                     ` Baoquan He
  0 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-08 11:38 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: k-hagio-ab, lijiang, linux-mm, Andrew Morton, LKML,
	Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

On 09/08/23 at 01:25pm, Uladzislau Rezki wrote:
> On Fri, Sep 08, 2023 at 02:44:56PM +0800, Baoquan He wrote:
> > On 09/08/23 at 05:01am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> > > On 2023/09/08 13:43, Baoquan He wrote:
> > > > On 09/08/23 at 01:51am, HAGIO KAZUHITO(萩尾 一仁) wrote:
> > > >> On 2023/09/07 18:58, Baoquan He wrote:
> > > >>> On 09/07/23 at 11:39am, Uladzislau Rezki wrote:
> > > >>>> On Thu, Sep 07, 2023 at 10:17:39AM +0800, Baoquan He wrote:
> > > >>>>> Add Kazu and Lianbo to CC, and kexec mailing list
> > > >>>>>
> > > >>>>> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > >>>>>> Store allocated objects in a separate nodes. A va->va_start
> > > >>>>>> address is converted into a correct node where it should
> > > >>>>>> be placed and resided. An addr_to_node() function is used
> > > >>>>>> to do a proper address conversion to determine a node that
> > > >>>>>> contains a VA.
> > > >>>>>>
> > > >>>>>> Such approach balances VAs across nodes as a result an access
> > > >>>>>> becomes scalable. Number of nodes in a system depends on number
> > > >>>>>> of CPUs divided by two. The density factor in this case is 1/2.
> > > >>>>>>
> > > >>>>>> Please note:
> > > >>>>>>
> > > >>>>>> 1. As of now allocated VAs are bound to a node-0. It means the
> > > >>>>>>      patch does not give any difference comparing with a current
> > > >>>>>>      behavior;
> > > >>>>>>
> > > >>>>>> 2. The global vmap_area_lock, vmap_area_root are removed as there
> > > >>>>>>      is no need in it anymore. The vmap_area_list is still kept and
> > > >>>>>>      is _empty_. It is exported for a kexec only;
> > > >>>>>
> > > >>>>> I haven't taken a test, while accessing all nodes' busy tree to get
> > > >>>>> va of the lowest address could severely impact kcore reading efficiency
> > > >>>>> on system with many vmap nodes. People doing live debugging via
> > > >>>>> /proc/kcore will get a little surprise.
> > > >>>>>
> > > >>>>>
> > > >>>>> Empty vmap_area_list will break makedumpfile utility, Crash utility
> > > >>>>> could be impactd too. I checked makedumpfile code, it relys on
> > > >>>>> vmap_area_list to deduce the vmalloc_start value.
> > > >>>>>
> > > >>>> It is left part and i hope i fix it in v3. The problem here is
> > > >>>> we can not give an opportunity to access to vmap internals from
> > > >>>> outside. This is just not correct, i.e. you are not allowed to
> > > >>>> access the list directly.
> > > >>>
> > > >>> Right. Thanks for the fix in v3, that is a relief of makedumpfile and
> > > >>> crash.
> > > >>>
> > > >>> Hi Kazu,
> > > >>>
> > > >>> Meanwhile, I am thinking if we should evaluate the necessity of
> > > >>> vmap_area_list in makedumpfile and Crash. In makedumpfile, we just use
> > > >>> vmap_area_list to deduce VMALLOC_START. Wondering if we can export
> > > >>> VMALLOC_START directly. Surely, the lowest va->va_start in vmap_area_list
> > > >>> is a tighter low boundary of vmalloc area and can reduce unnecessary
> > > >>> scanning below the lowest va. Not sure if this is the reason people
> > > >>> decided to export vmap_area_list.
> > > >>
> > > >> The kernel commit acd99dbf5402 introduced the original vmlist entry to
> > > >> vmcoreinfo, but there is no information about why it did not export
> > > >> VMALLOC_START directly.
> > > >>
> > > >> If VMALLOC_START is exported directly to vmcoreinfo, I think it would be
> > > >> enough for makedumpfile.
> > > > 
> > > > Thanks for confirmation, Kazu.
> > > > 
> > > > Then, below draft patch should be enough to export VMALLOC_START
> > > > instead, and remove vmap_area_list. 
> > > 
> > > also the following entries can be removed.
> > > 
> > >          VMCOREINFO_OFFSET(vmap_area, va_start);
> > >          VMCOREINFO_OFFSET(vmap_area, list);
> > 
> > Right, they are useless now. I updated to remove them in below patch.
> > 
> > From a867fada34fd9e96528fcc5e72ae50b3b5685015 Mon Sep 17 00:00:00 2001
> > From: Baoquan He <bhe@redhat.com>
> > Date: Fri, 8 Sep 2023 11:53:22 +0800
> > Subject: [PATCH] mm/vmalloc: remove vmap_area_list
> > Content-type: text/plain
> > 
> > Earlier, vmap_area_list is exported to vmcoreinfo so that makedumpfile
> > get the base address of vmalloc area. Now, vmap_area_list is empty, so
> > export VMALLOC_START to vmcoreinfo instead, and remove vmap_area_list.
> > 
> > Signed-off-by: Baoquan He <bhe@redhat.com>
> > ---
> >  Documentation/admin-guide/kdump/vmcoreinfo.rst | 8 ++++----
> >  arch/arm64/kernel/crash_core.c                 | 1 -
> >  arch/riscv/kernel/crash_core.c                 | 1 -
> >  include/linux/vmalloc.h                        | 1 -
> >  kernel/crash_core.c                            | 4 +---
> >  kernel/kallsyms_selftest.c                     | 1 -
> >  mm/nommu.c                                     | 2 --
> >  mm/vmalloc.c                                   | 3 +--
> >  8 files changed, 6 insertions(+), 15 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> > index 599e8d3bcbc3..c11bd4b1ceb1 100644
> > --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
> > +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
> > @@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
> >  the kernel start address. Used to convert a virtual address from the
> >  direct kernel map to a physical address.
> >  
> > -vmap_area_list
> > ---------------
> > +VMALLOC_START
> > +-------------
> >  
> > -Stores the virtual area list. makedumpfile gets the vmalloc start value
> > -from this variable and its value is necessary for vmalloc translation.
> > +Stores the base address of vmalloc area. makedumpfile gets this value
> > +since is necessary for vmalloc translation.
> >  
> >  mem_map
> >  -------
> > diff --git a/arch/arm64/kernel/crash_core.c b/arch/arm64/kernel/crash_core.c
> > index 66cde752cd74..2a24199a9b81 100644
> > --- a/arch/arm64/kernel/crash_core.c
> > +++ b/arch/arm64/kernel/crash_core.c
> > @@ -23,7 +23,6 @@ void arch_crash_save_vmcoreinfo(void)
> >  	/* Please note VMCOREINFO_NUMBER() uses "%d", not "%x" */
> >  	vmcoreinfo_append_str("NUMBER(MODULES_VADDR)=0x%lx\n", MODULES_VADDR);
> >  	vmcoreinfo_append_str("NUMBER(MODULES_END)=0x%lx\n", MODULES_END);
> > -	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
> >  	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
> >  	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
> >  	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
> > diff --git a/arch/riscv/kernel/crash_core.c b/arch/riscv/kernel/crash_core.c
> > index 55f1d7856b54..5c39cedd2c5c 100644
> > --- a/arch/riscv/kernel/crash_core.c
> > +++ b/arch/riscv/kernel/crash_core.c
> > @@ -9,7 +9,6 @@ void arch_crash_save_vmcoreinfo(void)
> >  	VMCOREINFO_NUMBER(phys_ram_base);
> >  
> >  	vmcoreinfo_append_str("NUMBER(PAGE_OFFSET)=0x%lx\n", PAGE_OFFSET);
> > -	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
> >  	vmcoreinfo_append_str("NUMBER(VMALLOC_END)=0x%lx\n", VMALLOC_END);
> >  	vmcoreinfo_append_str("NUMBER(VMEMMAP_START)=0x%lx\n", VMEMMAP_START);
> >  	vmcoreinfo_append_str("NUMBER(VMEMMAP_END)=0x%lx\n", VMEMMAP_END);
> > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> > index c720be70c8dd..91810b4e9510 100644
> > --- a/include/linux/vmalloc.h
> > +++ b/include/linux/vmalloc.h
> > @@ -253,7 +253,6 @@ extern long vread_iter(struct iov_iter *iter, const char *addr, size_t count);
> >  /*
> >   *	Internals.  Don't use..
> >   */
> > -extern struct list_head vmap_area_list;
> >  extern __init void vm_area_add_early(struct vm_struct *vm);
> >  extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
> >  
> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > index 03a7932cde0a..a9faaf7e5f7d 100644
> > --- a/kernel/crash_core.c
> > +++ b/kernel/crash_core.c
> > @@ -617,7 +617,7 @@ static int __init crash_save_vmcoreinfo_init(void)
> >  	VMCOREINFO_SYMBOL_ARRAY(swapper_pg_dir);
> >  #endif
> >  	VMCOREINFO_SYMBOL(_stext);
> > -	VMCOREINFO_SYMBOL(vmap_area_list);
> > +	vmcoreinfo_append_str("NUMBER(VMALLOC_START)=0x%lx\n", VMALLOC_START);
> >  
> >  #ifndef CONFIG_NUMA
> >  	VMCOREINFO_SYMBOL(mem_map);
> > @@ -658,8 +658,6 @@ static int __init crash_save_vmcoreinfo_init(void)
> >  	VMCOREINFO_OFFSET(free_area, free_list);
> >  	VMCOREINFO_OFFSET(list_head, next);
> >  	VMCOREINFO_OFFSET(list_head, prev);
> > -	VMCOREINFO_OFFSET(vmap_area, va_start);
> > -	VMCOREINFO_OFFSET(vmap_area, list);
> >  	VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER + 1);
> >  	log_buf_vmcoreinfo_setup();
> >  	VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
> > diff --git a/kernel/kallsyms_selftest.c b/kernel/kallsyms_selftest.c
> > index b4cac76ea5e9..8a689b4ff4f9 100644
> > --- a/kernel/kallsyms_selftest.c
> > +++ b/kernel/kallsyms_selftest.c
> > @@ -89,7 +89,6 @@ static struct test_item test_items[] = {
> >  	ITEM_DATA(kallsyms_test_var_data_static),
> >  	ITEM_DATA(kallsyms_test_var_bss),
> >  	ITEM_DATA(kallsyms_test_var_data),
> > -	ITEM_DATA(vmap_area_list),
> >  #endif
> >  };
> >  
> > diff --git a/mm/nommu.c b/mm/nommu.c
> > index 7f9e9e5a0e12..8c6686176ebd 100644
> > --- a/mm/nommu.c
> > +++ b/mm/nommu.c
> > @@ -131,8 +131,6 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
> >  }
> >  EXPORT_SYMBOL(follow_pfn);
> >  
> > -LIST_HEAD(vmap_area_list);
> > -
> >  void vfree(const void *addr)
> >  {
> >  	kfree(addr);
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 50d8239b82df..0a02633a9566 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -729,8 +729,7 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
> >  
> >  
> >  static DEFINE_SPINLOCK(free_vmap_area_lock);
> > -/* Export for kexec only */
> > -LIST_HEAD(vmap_area_list);
> > +
> >  static bool vmap_initialized __read_mostly;
> >  
> >  /*
> > -- 
> > 2.41.0
> > 
> Appreciate for your great input. This patch can go as standalone
> with slight commit message updating or i can take it and send it
> out as part of v3.
> 
> Either way i am totally fine. What do you prefer?

Maybe take it together with this patchset in v3. Thanks.


_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-09-08 11:38                     ` Baoquan He
@ 2023-09-08 13:23                       ` Uladzislau Rezki
  -1 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-08 13:23 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, k-hagio-ab, lijiang, linux-mm, Andrew Morton,
	LKML, Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

> > > 
> > Appreciate for your great input. This patch can go as standalone
> > with slight commit message updating or i can take it and send it
> > out as part of v3.
> > 
> > Either way i am totally fine. What do you prefer?
> 
> Maybe take it together with this patchset in v3. Thanks.
> 
OK, i will deliver it with v3. Again, thank you very much.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
@ 2023-09-08 13:23                       ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-08 13:23 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, k-hagio-ab, lijiang, linux-mm, Andrew Morton,
	LKML, Lorenzo Stoakes, Christoph Hellwig, Matthew Wilcox,
	Liam R . Howlett, Dave Chinner, Paul E . McKenney,
	Joel Fernandes, Oleksiy Avramchenko, kexec

> > > 
> > Appreciate for your great input. This patch can go as standalone
> > with slight commit message updating or i can take it and send it
> > out as part of v3.
> > 
> > Either way i am totally fine. What do you prefer?
> 
> Maybe take it together with this patchset in v3. Thanks.
> 
OK, i will deliver it with v3. Again, thank you very much.

--
Uladzislau Rezki

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-08-29  8:11 ` [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree Uladzislau Rezki (Sony)
  2023-08-29 14:30   ` kernel test robot
  2023-09-07  2:17     ` Baoquan He
@ 2023-09-11  2:38   ` Baoquan He
  2023-09-11 16:53     ` Uladzislau Rezki
  2 siblings, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-11  2:38 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> Store allocated objects in a separate nodes. A va->va_start
> address is converted into a correct node where it should
> be placed and resided. An addr_to_node() function is used
> to do a proper address conversion to determine a node that
> contains a VA.
> 
> Such approach balances VAs across nodes as a result an access
> becomes scalable. Number of nodes in a system depends on number
> of CPUs divided by two. The density factor in this case is 1/2.
> 
> Please note:
> 
> 1. As of now allocated VAs are bound to a node-0. It means the
>    patch does not give any difference comparing with a current
>    behavior;
> 
> 2. The global vmap_area_lock, vmap_area_root are removed as there
>    is no need in it anymore. The vmap_area_list is still kept and
>    is _empty_. It is exported for a kexec only;
> 
> 3. The vmallocinfo and vread() have to be reworked to be able to
>    handle multiple nodes.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 209 +++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 161 insertions(+), 48 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index b7deacca1483..ae0368c314ff 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -728,11 +728,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
>  #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
>  
>  
> -static DEFINE_SPINLOCK(vmap_area_lock);
>  static DEFINE_SPINLOCK(free_vmap_area_lock);
>  /* Export for kexec only */
>  LIST_HEAD(vmap_area_list);
> -static struct rb_root vmap_area_root = RB_ROOT;
>  static bool vmap_initialized __read_mostly;
>  
>  static struct rb_root purge_vmap_area_root = RB_ROOT;
> @@ -772,6 +770,38 @@ static struct rb_root free_vmap_area_root = RB_ROOT;
>   */
>  static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
>  
> +/*
> + * An effective vmap-node logic. Users make use of nodes instead
> + * of a global heap. It allows to balance an access and mitigate
> + * contention.
> + */
> +struct rb_list {
> +	struct rb_root root;
> +	struct list_head head;
> +	spinlock_t lock;
> +};
> +
> +struct vmap_node {
> +	/* Bookkeeping data of this node. */
> +	struct rb_list busy;
> +};
> +
> +static struct vmap_node *nodes, snode;
> +static __read_mostly unsigned int nr_nodes = 1;
> +static __read_mostly unsigned int node_size = 1;

It could be better if calling these global variables a meaningful name,
e.g vmap_nodes, static_vmap_nodes, nr_vmap_nodes. When I use vim+cscope
to reference them, it gives me a super long list. Aside from that, a
simple name often makes me mistake it as a local virable. A weak
opinion.


> +
> +static inline unsigned int
> +addr_to_node_id(unsigned long addr)
> +{
> +	return (addr / node_size) % nr_nodes;
> +}
> +
> +static inline struct vmap_node *
> +addr_to_node(unsigned long addr)
> +{
> +	return &nodes[addr_to_node_id(addr)];
> +}
> +
>  static __always_inline unsigned long
>  va_size(struct vmap_area *va)
>  {


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 5/9] mm: vmalloc: Remove global purge_vmap_area_root rb-tree
  2023-08-29  8:11 ` [PATCH v2 5/9] mm: vmalloc: Remove global purge_vmap_area_root rb-tree Uladzislau Rezki (Sony)
@ 2023-09-11  2:57   ` Baoquan He
  2023-09-11 17:00     ` Uladzislau Rezki
  0 siblings, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-11  2:57 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> Similar to busy VA, lazily-freed area is stored to a node
> it belongs to. Such approach does not require any global
> locking primitive, instead an access becomes scalable what
> mitigates a contention.
> 
> This patch removes a global purge-lock, global purge-tree
> and global purge list.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 135 +++++++++++++++++++++++++++++++--------------------
>  1 file changed, 82 insertions(+), 53 deletions(-)

LGTM,

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ae0368c314ff..5a8a9c1370b6 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -733,10 +733,6 @@ static DEFINE_SPINLOCK(free_vmap_area_lock);
>  LIST_HEAD(vmap_area_list);
>  static bool vmap_initialized __read_mostly;
>  
> -static struct rb_root purge_vmap_area_root = RB_ROOT;
> -static LIST_HEAD(purge_vmap_area_list);
> -static DEFINE_SPINLOCK(purge_vmap_area_lock);
> -
>  /*
>   * This kmem_cache is used for vmap_area objects. Instead of
>   * allocating from slab we reuse an object from this cache to
> @@ -784,6 +780,12 @@ struct rb_list {
>  struct vmap_node {
>  	/* Bookkeeping data of this node. */
>  	struct rb_list busy;
> +	struct rb_list lazy;
> +
> +	/*
> +	 * Ready-to-free areas.
> +	 */
> +	struct list_head purge_list;
>  };
>  
>  static struct vmap_node *nodes, snode;
> @@ -1768,40 +1770,22 @@ static DEFINE_MUTEX(vmap_purge_lock);
>  
>  /* for per-CPU blocks */
>  static void purge_fragmented_blocks_allcpus(void);
> +static cpumask_t purge_nodes;
>  
>  /*
>   * Purges all lazily-freed vmap areas.
>   */
> -static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> +static unsigned long
> +purge_vmap_node(struct vmap_node *vn)
>  {
> -	unsigned long resched_threshold;
> -	unsigned int num_purged_areas = 0;
> -	struct list_head local_purge_list;
> +	unsigned long num_purged_areas = 0;
>  	struct vmap_area *va, *n_va;
>  
> -	lockdep_assert_held(&vmap_purge_lock);
> -
> -	spin_lock(&purge_vmap_area_lock);
> -	purge_vmap_area_root = RB_ROOT;
> -	list_replace_init(&purge_vmap_area_list, &local_purge_list);
> -	spin_unlock(&purge_vmap_area_lock);
> -
> -	if (unlikely(list_empty(&local_purge_list)))
> -		goto out;
> -
> -	start = min(start,
> -		list_first_entry(&local_purge_list,
> -			struct vmap_area, list)->va_start);
> -
> -	end = max(end,
> -		list_last_entry(&local_purge_list,
> -			struct vmap_area, list)->va_end);
> -
> -	flush_tlb_kernel_range(start, end);
> -	resched_threshold = lazy_max_pages() << 1;
> +	if (list_empty(&vn->purge_list))
> +		return 0;
>  
>  	spin_lock(&free_vmap_area_lock);
> -	list_for_each_entry_safe(va, n_va, &local_purge_list, list) {
> +	list_for_each_entry_safe(va, n_va, &vn->purge_list, list) {
>  		unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
>  		unsigned long orig_start = va->va_start;
>  		unsigned long orig_end = va->va_end;
> @@ -1823,13 +1807,55 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  
>  		atomic_long_sub(nr, &vmap_lazy_nr);
>  		num_purged_areas++;
> -
> -		if (atomic_long_read(&vmap_lazy_nr) < resched_threshold)
> -			cond_resched_lock(&free_vmap_area_lock);
>  	}
>  	spin_unlock(&free_vmap_area_lock);
>  
> -out:
> +	return num_purged_areas;
> +}
> +
> +/*
> + * Purges all lazily-freed vmap areas.
> + */
> +static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
> +{
> +	unsigned long num_purged_areas = 0;
> +	struct vmap_node *vn;
> +	int i;
> +
> +	lockdep_assert_held(&vmap_purge_lock);
> +	purge_nodes = CPU_MASK_NONE;
> +
> +	for (i = 0; i < nr_nodes; i++) {
> +		vn = &nodes[i];
> +
> +		INIT_LIST_HEAD(&vn->purge_list);
> +
> +		if (RB_EMPTY_ROOT(&vn->lazy.root))
> +			continue;
> +
> +		spin_lock(&vn->lazy.lock);
> +		WRITE_ONCE(vn->lazy.root.rb_node, NULL);
> +		list_replace_init(&vn->lazy.head, &vn->purge_list);
> +		spin_unlock(&vn->lazy.lock);
> +
> +		start = min(start, list_first_entry(&vn->purge_list,
> +			struct vmap_area, list)->va_start);
> +
> +		end = max(end, list_last_entry(&vn->purge_list,
> +			struct vmap_area, list)->va_end);
> +
> +		cpumask_set_cpu(i, &purge_nodes);
> +	}
> +
> +	if (cpumask_weight(&purge_nodes) > 0) {
> +		flush_tlb_kernel_range(start, end);
> +
> +		for_each_cpu(i, &purge_nodes) {
> +			vn = &nodes[i];
> +			num_purged_areas += purge_vmap_node(vn);
> +		}
> +	}
> +
>  	trace_purge_vmap_area_lazy(start, end, num_purged_areas);
>  	return num_purged_areas > 0;
>  }
> @@ -1848,16 +1874,9 @@ static void reclaim_and_purge_vmap_areas(void)
>  
>  static void drain_vmap_area_work(struct work_struct *work)
>  {
> -	unsigned long nr_lazy;
> -
> -	do {
> -		mutex_lock(&vmap_purge_lock);
> -		__purge_vmap_area_lazy(ULONG_MAX, 0);
> -		mutex_unlock(&vmap_purge_lock);
> -
> -		/* Recheck if further work is required. */
> -		nr_lazy = atomic_long_read(&vmap_lazy_nr);
> -	} while (nr_lazy > lazy_max_pages());
> +	mutex_lock(&vmap_purge_lock);
> +	__purge_vmap_area_lazy(ULONG_MAX, 0);
> +	mutex_unlock(&vmap_purge_lock);
>  }
>  
>  /*
> @@ -1867,6 +1886,7 @@ static void drain_vmap_area_work(struct work_struct *work)
>   */
>  static void free_vmap_area_noflush(struct vmap_area *va)
>  {
> +	struct vmap_node *vn = addr_to_node(va->va_start);
>  	unsigned long nr_lazy_max = lazy_max_pages();
>  	unsigned long va_start = va->va_start;
>  	unsigned long nr_lazy;
> @@ -1880,10 +1900,9 @@ static void free_vmap_area_noflush(struct vmap_area *va)
>  	/*
>  	 * Merge or place it to the purge tree/list.
>  	 */
> -	spin_lock(&purge_vmap_area_lock);
> -	merge_or_add_vmap_area(va,
> -		&purge_vmap_area_root, &purge_vmap_area_list);
> -	spin_unlock(&purge_vmap_area_lock);
> +	spin_lock(&vn->lazy.lock);
> +	merge_or_add_vmap_area(va, &vn->lazy.root, &vn->lazy.head);
> +	spin_unlock(&vn->lazy.lock);
>  
>  	trace_free_vmap_area_noflush(va_start, nr_lazy, nr_lazy_max);
>  
> @@ -4390,15 +4409,21 @@ static void show_numa_info(struct seq_file *m, struct vm_struct *v)
>  
>  static void show_purge_info(struct seq_file *m)
>  {
> +	struct vmap_node *vn;
>  	struct vmap_area *va;
> +	int i;
>  
> -	spin_lock(&purge_vmap_area_lock);
> -	list_for_each_entry(va, &purge_vmap_area_list, list) {
> -		seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
> -			(void *)va->va_start, (void *)va->va_end,
> -			va->va_end - va->va_start);
> +	for (i = 0; i < nr_nodes; i++) {
> +		vn = &nodes[i];
> +
> +		spin_lock(&vn->lazy.lock);
> +		list_for_each_entry(va, &vn->lazy.head, list) {
> +			seq_printf(m, "0x%pK-0x%pK %7ld unpurged vm_area\n",
> +				(void *)va->va_start, (void *)va->va_end,
> +				va->va_end - va->va_start);
> +		}
> +		spin_unlock(&vn->lazy.lock);
>  	}
> -	spin_unlock(&purge_vmap_area_lock);
>  }
>  
>  static int s_show(struct seq_file *m, void *p)
> @@ -4545,6 +4570,10 @@ static void vmap_init_nodes(void)
>  		vn->busy.root = RB_ROOT;
>  		INIT_LIST_HEAD(&vn->busy.head);
>  		spin_lock_init(&vn->busy.lock);
> +
> +		vn->lazy.root = RB_ROOT;
> +		INIT_LIST_HEAD(&vn->lazy.head);
> +		spin_lock_init(&vn->lazy.lock);
>  	}
>  }
>  
> -- 
> 2.30.2
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 6/9] mm: vmalloc: Offload free_vmap_area_lock lock
  2023-08-29  8:11 ` [PATCH v2 6/9] mm: vmalloc: Offload free_vmap_area_lock lock Uladzislau Rezki (Sony)
  2023-09-06  6:04   ` Baoquan He
@ 2023-09-11  3:25   ` Baoquan He
  2023-09-11 17:10     ` Uladzislau Rezki
  1 sibling, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-11  3:25 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> Concurrent access to a global vmap space is a bottle-neck.
> We can simulate a high contention by running a vmalloc test
> suite.
> 
> To address it, introduce an effective vmap node logic. Each
> node behaves as independent entity. When a node is accessed
> it serves a request directly(if possible) also it can fetch
> a new block from a global heap to its internals if no space
> or low capacity is left.
> 
> This technique reduces a pressure on the global vmap lock.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 316 +++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 279 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 5a8a9c1370b6..4fd4915c532d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -779,6 +779,7 @@ struct rb_list {
>  
>  struct vmap_node {
>  	/* Bookkeeping data of this node. */
> +	struct rb_list free;
>  	struct rb_list busy;
>  	struct rb_list lazy;
>  
> @@ -786,6 +787,13 @@ struct vmap_node {
>  	 * Ready-to-free areas.
>  	 */
>  	struct list_head purge_list;
> +	struct work_struct purge_work;
> +	unsigned long nr_purged;
> +
> +	/*
> +	 * Control that only one user can pre-fetch this node.
> +	 */
> +	atomic_t fill_in_progress;
>  };
>  
>  static struct vmap_node *nodes, snode;
> @@ -804,6 +812,32 @@ addr_to_node(unsigned long addr)
>  	return &nodes[addr_to_node_id(addr)];
>  }
>  
> +static inline struct vmap_node *
> +id_to_node(int id)
> +{
> +	return &nodes[id % nr_nodes];
> +}
> +
> +static inline int
> +this_node_id(void)
> +{
> +	return raw_smp_processor_id() % nr_nodes;
> +}
> +
> +static inline unsigned long
> +encode_vn_id(int node_id)
> +{
> +	/* Can store U8_MAX [0:254] nodes. */
> +	return (node_id + 1) << BITS_PER_BYTE;
> +}
> +
> +static inline int
> +decode_vn_id(unsigned long val)
> +{
> +	/* Can store U8_MAX [0:254] nodes. */
> +	return (val >> BITS_PER_BYTE) - 1;
> +}

This patch looks good to me. However, should we split out the encoding
vn_id into va->flags optimization into another patch? It looks like an
independent optimization which can be described better with specific
log. At least, in the pdf file pasted or patch log, it's not obvious
that:
1) node's free tree could contains any address range;
2) nodes' busy tree only contains address range belonging to this node;
   - could contain crossing node range, corner case.
3) nodes' purge tree could contain any address range;
   - decided by encoded vn_id in va->flags.
   - decided by address via addr_to_node(va->va_start).

Personal opinion, feel it will make reviewing easier.

> +
>  static __always_inline unsigned long
>  va_size(struct vmap_area *va)
>  {
> @@ -1586,6 +1620,7 @@ __alloc_vmap_area(struct rb_root *root, struct list_head *head,
>  static void free_vmap_area(struct vmap_area *va)
>  {
>  	struct vmap_node *vn = addr_to_node(va->va_start);
> +	int vn_id = decode_vn_id(va->flags);
>  
>  	/*
>  	 * Remove from the busy tree/list.
> @@ -1594,12 +1629,19 @@ static void free_vmap_area(struct vmap_area *va)
>  	unlink_va(va, &vn->busy.root);
>  	spin_unlock(&vn->busy.lock);
>  
> -	/*
> -	 * Insert/Merge it back to the free tree/list.
> -	 */
> -	spin_lock(&free_vmap_area_lock);
> -	merge_or_add_vmap_area_augment(va, &free_vmap_area_root, &free_vmap_area_list);
> -	spin_unlock(&free_vmap_area_lock);
> +	if (vn_id >= 0) {
> +		vn = id_to_node(vn_id);
> +
> +		/* Belongs to this node. */
> +		spin_lock(&vn->free.lock);
> +		merge_or_add_vmap_area_augment(va, &vn->free.root, &vn->free.head);
> +		spin_unlock(&vn->free.lock);
> +	} else {
> +		/* Goes to global. */
> +		spin_lock(&free_vmap_area_lock);
> +		merge_or_add_vmap_area_augment(va, &free_vmap_area_root, &free_vmap_area_list);
> +		spin_unlock(&free_vmap_area_lock);
> +	}
>  }
>  
>  static inline void
> @@ -1625,6 +1667,134 @@ preload_this_cpu_lock(spinlock_t *lock, gfp_t gfp_mask, int node)
>  		kmem_cache_free(vmap_area_cachep, va);
>  }
>  
> +static unsigned long
> +node_alloc_fill(struct vmap_node *vn,
> +		unsigned long size, unsigned long align,
> +		gfp_t gfp_mask, int node)
> +{
> +	struct vmap_area *va;
> +	unsigned long addr;
> +
> +	va = kmem_cache_alloc_node(vmap_area_cachep, gfp_mask, node);
> +	if (unlikely(!va))
> +		return VMALLOC_END;
> +
> +	/*
> +	 * Please note, an allocated block is not aligned to its size.
> +	 * Therefore it can span several zones what means addr_to_node()
> +	 * can point to two different nodes:
> +	 *      <----->
> +	 * -|-----|-----|-----|-----|-
> +	 *     1     2     0     1
> +	 *
> +	 * an alignment would just increase fragmentation thus more heap
> +	 * consumption what we would like to avoid.
> +	 */
> +	spin_lock(&free_vmap_area_lock);
> +	addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> +		node_size, 1, VMALLOC_START, VMALLOC_END);
> +	spin_unlock(&free_vmap_area_lock);
> +
> +	if (addr == VMALLOC_END) {
> +		kmem_cache_free(vmap_area_cachep, va);
> +		return VMALLOC_END;
> +	}
> +
> +	/*
> +	 * Statement and condition of the problem:
> +	 *
> +	 * a) where to free allocated areas from a node:
> +	 *   - directly to a global heap;
> +	 *   - to a node that we got a VA from;
> +	 *     - what is a condition to return allocated areas
> +	 *       to a global heap then;
> +	 * b) how to properly handle left small free fragments
> +	 *    of a node in order to mitigate a fragmentation.
> +	 *
> +	 * How to address described points:
> +	 * When a new block is allocated(from a global heap) we shrink
> +	 * it deliberately by one page from both sides and place it to
> +	 * this node to serve a request.
> +	 *
> +	 * Why we shrink. We would like to distinguish VAs which were
> +	 * obtained from a node and a global heap. This is for a free
> +	 * path. A va->flags contains a node-id it belongs to. No VAs
> +	 * merging is possible between each other unless they are part
> +	 * of same block.
> +	 *
> +	 * A free-path in its turn can detect a correct node where a
> +	 * VA has to be returned. Thus as a block is freed entirely,
> +	 * its size becomes(merging): node_size - (2 * PAGE_SIZE) it
> +	 * recovers its edges, thus is released to a global heap for
> +	 * reuse elsewhere. In partly freed case, VAs go back to the
> +	 * node not bothering a global vmap space.
> +	 *
> +	 *        1               2              3
> +	 * |<------------>|<------------>|<------------>|
> +	 * |..<-------->..|..<-------->..|..<-------->..|
> +	 */
> +	va->va_start = addr + PAGE_SIZE;
> +	va->va_end = (addr + node_size) - PAGE_SIZE;
> +
> +	spin_lock(&vn->free.lock);
> +	/* Never merges. See explanation above. */
> +	insert_vmap_area_augment(va, NULL, &vn->free.root, &vn->free.head);
> +	addr = va_alloc(va, &vn->free.root, &vn->free.head,
> +		size, align, VMALLOC_START, VMALLOC_END);
> +	spin_unlock(&vn->free.lock);
> +
> +	return addr;
> +}
> +
> +static unsigned long
> +node_alloc(int vn_id, unsigned long size, unsigned long align,
> +		unsigned long vstart, unsigned long vend,
> +		gfp_t gfp_mask, int node)
> +{
> +	struct vmap_node *vn = id_to_node(vn_id);
> +	unsigned long extra = align > PAGE_SIZE ? align : 0;
> +	bool do_alloc_fill = false;
> +	unsigned long addr;
> +
> +	/*
> +	 * Fallback to a global heap if not vmalloc.
> +	 */
> +	if (vstart != VMALLOC_START || vend != VMALLOC_END)
> +		return vend;
> +
> +	/*
> +	 * A maximum allocation limit is 1/4 of capacity. This
> +	 * is done in order to prevent a fast depleting of zone
> +	 * by few requests.
> +	 */
> +	if (size + extra > (node_size >> 2))
> +		return vend;
> +
> +	spin_lock(&vn->free.lock);
> +	addr = __alloc_vmap_area(&vn->free.root, &vn->free.head,
> +		size, align, vstart, vend);
> +
> +	if (addr == vend) {
> +		/*
> +		 * Set the fetch flag under the critical section.
> +		 * This guarantees that only one user is eligible
> +		 * to perform a pre-fetch. A reset operation can
> +		 * be concurrent.
> +		 */
> +		if (!atomic_xchg(&vn->fill_in_progress, 1))
> +			do_alloc_fill = true;
> +	}
> +	spin_unlock(&vn->free.lock);
> +
> +	/* Only if fails a previous attempt. */
> +	if (do_alloc_fill) {
> +		addr = node_alloc_fill(vn, size, align, gfp_mask, node);
> +		atomic_set(&vn->fill_in_progress, 0);
> +	}
> +
> +	return addr;
> +}
> +
>  /*
>   * Allocate a region of KVA of the specified size and alignment, within the
>   * vstart and vend.
> @@ -1640,7 +1810,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	unsigned long freed;
>  	unsigned long addr;
>  	int purged = 0;
> -	int ret;
> +	int ret, vn_id;
>  
>  	if (unlikely(!size || offset_in_page(size) || !is_power_of_2(align)))
>  		return ERR_PTR(-EINVAL);
> @@ -1661,11 +1831,17 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	 */
>  	kmemleak_scan_area(&va->rb_node, SIZE_MAX, gfp_mask);
>  
> +	vn_id = this_node_id();
> +	addr = node_alloc(vn_id, size, align, vstart, vend, gfp_mask, node);
> +	va->flags = (addr != vend) ? encode_vn_id(vn_id) : 0;
> +
>  retry:
> -	preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
> -	addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> -		size, align, vstart, vend);
> -	spin_unlock(&free_vmap_area_lock);
> +	if (addr == vend) {
> +		preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node);
> +		addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list,
> +			size, align, vstart, vend);
> +		spin_unlock(&free_vmap_area_lock);
> +	}
>  
>  	trace_alloc_vmap_area(addr, size, align, vstart, vend, addr == vend);
>  
> @@ -1679,7 +1855,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	va->va_start = addr;
>  	va->va_end = addr + size;
>  	va->vm = NULL;
> -	va->flags = va_flags;
> +	va->flags |= va_flags;
>  
>  	vn = addr_to_node(va->va_start);
>  
> @@ -1772,31 +1948,58 @@ static DEFINE_MUTEX(vmap_purge_lock);
>  static void purge_fragmented_blocks_allcpus(void);
>  static cpumask_t purge_nodes;
>  
> -/*
> - * Purges all lazily-freed vmap areas.
> - */
> -static unsigned long
> -purge_vmap_node(struct vmap_node *vn)
> +static void
> +reclaim_list_global(struct list_head *head)
> +{
> +	struct vmap_area *va, *n;
> +
> +	if (list_empty(head))
> +		return;
> +
> +	spin_lock(&free_vmap_area_lock);
> +	list_for_each_entry_safe(va, n, head, list)
> +		merge_or_add_vmap_area_augment(va,
> +			&free_vmap_area_root, &free_vmap_area_list);
> +	spin_unlock(&free_vmap_area_lock);
> +}
> +
> +static void purge_vmap_node(struct work_struct *work)
>  {
> -	unsigned long num_purged_areas = 0;
> +	struct vmap_node *vn = container_of(work,
> +		struct vmap_node, purge_work);
>  	struct vmap_area *va, *n_va;
> +	LIST_HEAD(global);
> +
> +	vn->nr_purged = 0;
>  
>  	if (list_empty(&vn->purge_list))
> -		return 0;
> +		return;
>  
> -	spin_lock(&free_vmap_area_lock);
> +	spin_lock(&vn->free.lock);
>  	list_for_each_entry_safe(va, n_va, &vn->purge_list, list) {
>  		unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT;
>  		unsigned long orig_start = va->va_start;
>  		unsigned long orig_end = va->va_end;
> +		int vn_id = decode_vn_id(va->flags);
>  
> -		/*
> -		 * Finally insert or merge lazily-freed area. It is
> -		 * detached and there is no need to "unlink" it from
> -		 * anything.
> -		 */
> -		va = merge_or_add_vmap_area_augment(va, &free_vmap_area_root,
> -				&free_vmap_area_list);
> +		list_del_init(&va->list);
> +
> +		if (vn_id >= 0) {
> +			if (va_size(va) != node_size - (2 * PAGE_SIZE))
> +				va = merge_or_add_vmap_area_augment(va, &vn->free.root, &vn->free.head);
> +
> +			if (va_size(va) == node_size - (2 * PAGE_SIZE)) {
> +				if (!list_empty(&va->list))
> +					unlink_va_augment(va, &vn->free.root);
> +
> +				/* Restore the block size. */
> +				va->va_start -= PAGE_SIZE;
> +				va->va_end += PAGE_SIZE;
> +				list_add(&va->list, &global);
> +			}
> +		} else {
> +			list_add(&va->list, &global);
> +		}
>  
>  		if (!va)
>  			continue;
> @@ -1806,11 +2009,10 @@ purge_vmap_node(struct vmap_node *vn)
>  					      va->va_start, va->va_end);
>  
>  		atomic_long_sub(nr, &vmap_lazy_nr);
> -		num_purged_areas++;
> +		vn->nr_purged++;
>  	}
> -	spin_unlock(&free_vmap_area_lock);
> -
> -	return num_purged_areas;
> +	spin_unlock(&vn->free.lock);
> +	reclaim_list_global(&global);
>  }
>  
>  /*
> @@ -1818,11 +2020,17 @@ purge_vmap_node(struct vmap_node *vn)
>   */
>  static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  {
> -	unsigned long num_purged_areas = 0;
> +	unsigned long nr_purged_areas = 0;
> +	unsigned int nr_purge_helpers;
> +	unsigned int nr_purge_nodes;
>  	struct vmap_node *vn;
>  	int i;
>  
>  	lockdep_assert_held(&vmap_purge_lock);
> +
> +	/*
> +	 * Use cpumask to mark which node has to be processed.
> +	 */
>  	purge_nodes = CPU_MASK_NONE;
>  
>  	for (i = 0; i < nr_nodes; i++) {
> @@ -1847,17 +2055,45 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
>  		cpumask_set_cpu(i, &purge_nodes);
>  	}
>  
> -	if (cpumask_weight(&purge_nodes) > 0) {
> +	nr_purge_nodes = cpumask_weight(&purge_nodes);
> +	if (nr_purge_nodes > 0) {
>  		flush_tlb_kernel_range(start, end);
>  
> +		/* One extra worker is per a lazy_max_pages() full set minus one. */
> +		nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages();
> +		nr_purge_helpers = clamp(nr_purge_helpers, 1U, nr_purge_nodes) - 1;
> +
> +		for_each_cpu(i, &purge_nodes) {
> +			vn = &nodes[i];
> +
> +			if (nr_purge_helpers > 0) {
> +				INIT_WORK(&vn->purge_work, purge_vmap_node);
> +
> +				if (cpumask_test_cpu(i, cpu_online_mask))
> +					schedule_work_on(i, &vn->purge_work);
> +				else
> +					schedule_work(&vn->purge_work);
> +
> +				nr_purge_helpers--;
> +			} else {
> +				vn->purge_work.func = NULL;
> +				purge_vmap_node(&vn->purge_work);
> +				nr_purged_areas += vn->nr_purged;
> +			}
> +		}
> +
>  		for_each_cpu(i, &purge_nodes) {
>  			vn = &nodes[i];
> -			num_purged_areas += purge_vmap_node(vn);
> +
> +			if (vn->purge_work.func) {
> +				flush_work(&vn->purge_work);
> +				nr_purged_areas += vn->nr_purged;
> +			}
>  		}
>  	}
>  
> -	trace_purge_vmap_area_lazy(start, end, num_purged_areas);
> -	return num_purged_areas > 0;
> +	trace_purge_vmap_area_lazy(start, end, nr_purged_areas);
> +	return nr_purged_areas > 0;
>  }
>  
>  /*
> @@ -1886,9 +2122,11 @@ static void drain_vmap_area_work(struct work_struct *work)
>   */
>  static void free_vmap_area_noflush(struct vmap_area *va)
>  {
> -	struct vmap_node *vn = addr_to_node(va->va_start);
>  	unsigned long nr_lazy_max = lazy_max_pages();
>  	unsigned long va_start = va->va_start;
> +	int vn_id = decode_vn_id(va->flags);
> +	struct vmap_node *vn = vn_id >= 0 ? id_to_node(vn_id):
> +		addr_to_node(va->va_start);;
>  	unsigned long nr_lazy;
>  
>  	if (WARN_ON_ONCE(!list_empty(&va->list)))
> @@ -4574,6 +4812,10 @@ static void vmap_init_nodes(void)
>  		vn->lazy.root = RB_ROOT;
>  		INIT_LIST_HEAD(&vn->lazy.head);
>  		spin_lock_init(&vn->lazy.lock);
> +
> +		vn->free.root = RB_ROOT;
> +		INIT_LIST_HEAD(&vn->free.head);
> +		spin_lock_init(&vn->free.lock);
>  	}
>  }
>  
> -- 
> 2.30.2
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter
  2023-08-29  8:11 ` [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter Uladzislau Rezki (Sony)
@ 2023-09-11  3:58   ` Baoquan He
  2023-09-11 18:16     ` Uladzislau Rezki
  0 siblings, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-11  3:58 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> Extend the vread_iter() to be able to perform a sequential
> reading of VAs which are spread among multiple nodes. So a
> data read over the /dev/kmem correctly reflects a vmalloc
> memory layout.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 67 +++++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 53 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 4fd4915c532d..968144c16237 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
......  
> @@ -4057,19 +4093,15 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
>  
>  	remains = count;
>  
> -	/* Hooked to node_0 so far. */
> -	vn = addr_to_node(0);
> -	spin_lock(&vn->busy.lock);

This could change the vread behaviour a little bit. Before, once we take
vmap_area_lock, the vread will read out the content of snapshot at the
moment. Now, reading out in one node's tree won't disrupt other nodes'
tree accessing. Not sure if this matters when people need access
/proc/kcore, e.g dynamic debugging.

And, the reading will be a little slower because each va finding need
iterate all vmap_nodes[].

Otherwise, the patch itself looks good to me.

Reviewed-by: Baoquan He <bhe@redhat.com>

> -
> -	va = find_vmap_area_exceed_addr((unsigned long)addr, &vn->busy.root);
> -	if (!va)
> +	vn = find_vmap_area_exceed_addr_lock((unsigned long) addr, &va);
> +	if (!vn)
>  		goto finished_zero;
>  
>  	/* no intersects with alive vmap_area */
>  	if ((unsigned long)addr + remains <= va->va_start)
>  		goto finished_zero;
>  
> -	list_for_each_entry_from(va, &vn->busy.head, list) {
> +	do {
>  		size_t copied;
>  
>  		if (remains == 0)
> @@ -4084,10 +4116,10 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
>  		WARN_ON(flags == VMAP_BLOCK);
>  
>  		if (!vm && !flags)
> -			continue;
> +			goto next_va;
>  
>  		if (vm && (vm->flags & VM_UNINITIALIZED))
> -			continue;
> +			goto next_va;
>  
>  		/* Pair with smp_wmb() in clear_vm_uninitialized_flag() */
>  		smp_rmb();
> @@ -4096,7 +4128,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
>  		size = vm ? get_vm_area_size(vm) : va_size(va);
>  
>  		if (addr >= vaddr + size)
> -			continue;
> +			goto next_va;
>  
>  		if (addr < vaddr) {
>  			size_t to_zero = min_t(size_t, vaddr - addr, remains);
> @@ -4125,15 +4157,22 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
>  
>  		if (copied != n)
>  			goto finished;
> -	}
> +
> +	next_va:
> +		next = va->va_end;
> +		spin_unlock(&vn->busy.lock);
> +	} while ((vn = find_vmap_area_exceed_addr_lock(next, &va)));
>  
>  finished_zero:
> -	spin_unlock(&vn->busy.lock);
> +	if (vn)
> +		spin_unlock(&vn->busy.lock);
> +
>  	/* zero-fill memory holes */
>  	return count - remains + zero_iter(iter, remains);
>  finished:
>  	/* Nothing remains, or We couldn't copy/zero everything. */
> -	spin_unlock(&vn->busy.lock);
> +	if (vn)
> +		spin_unlock(&vn->busy.lock);
>  
>  	return count - remains;
>  }
> -- 
> 2.30.2
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-09-11  2:38   ` Baoquan He
@ 2023-09-11 16:53     ` Uladzislau Rezki
  2023-09-12 13:19       ` Baoquan He
  0 siblings, 1 reply; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-11 16:53 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On Mon, Sep 11, 2023 at 10:38:29AM +0800, Baoquan He wrote:
> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > Store allocated objects in a separate nodes. A va->va_start
> > address is converted into a correct node where it should
> > be placed and resided. An addr_to_node() function is used
> > to do a proper address conversion to determine a node that
> > contains a VA.
> > 
> > Such approach balances VAs across nodes as a result an access
> > becomes scalable. Number of nodes in a system depends on number
> > of CPUs divided by two. The density factor in this case is 1/2.
> > 
> > Please note:
> > 
> > 1. As of now allocated VAs are bound to a node-0. It means the
> >    patch does not give any difference comparing with a current
> >    behavior;
> > 
> > 2. The global vmap_area_lock, vmap_area_root are removed as there
> >    is no need in it anymore. The vmap_area_list is still kept and
> >    is _empty_. It is exported for a kexec only;
> > 
> > 3. The vmallocinfo and vread() have to be reworked to be able to
> >    handle multiple nodes.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 209 +++++++++++++++++++++++++++++++++++++++------------
> >  1 file changed, 161 insertions(+), 48 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index b7deacca1483..ae0368c314ff 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -728,11 +728,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
> >  #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
> >  
> >  
> > -static DEFINE_SPINLOCK(vmap_area_lock);
> >  static DEFINE_SPINLOCK(free_vmap_area_lock);
> >  /* Export for kexec only */
> >  LIST_HEAD(vmap_area_list);
> > -static struct rb_root vmap_area_root = RB_ROOT;
> >  static bool vmap_initialized __read_mostly;
> >  
> >  static struct rb_root purge_vmap_area_root = RB_ROOT;
> > @@ -772,6 +770,38 @@ static struct rb_root free_vmap_area_root = RB_ROOT;
> >   */
> >  static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
> >  
> > +/*
> > + * An effective vmap-node logic. Users make use of nodes instead
> > + * of a global heap. It allows to balance an access and mitigate
> > + * contention.
> > + */
> > +struct rb_list {
> > +	struct rb_root root;
> > +	struct list_head head;
> > +	spinlock_t lock;
> > +};
> > +
> > +struct vmap_node {
> > +	/* Bookkeeping data of this node. */
> > +	struct rb_list busy;
> > +};
> > +
> > +static struct vmap_node *nodes, snode;
> > +static __read_mostly unsigned int nr_nodes = 1;
> > +static __read_mostly unsigned int node_size = 1;
> 
> It could be better if calling these global variables a meaningful name,
> e.g vmap_nodes, static_vmap_nodes, nr_vmap_nodes. When I use vim+cscope
> to reference them, it gives me a super long list. Aside from that, a
> simple name often makes me mistake it as a local virable. A weak
> opinion.
> 
I am OK to add "vmap_" prefix:

vmap_nodes;
vmap_nr_nodes;
vmap_node_size;
..

If you are not OK with that, feel free to propose other variants.

Thank you!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 5/9] mm: vmalloc: Remove global purge_vmap_area_root rb-tree
  2023-09-11  2:57   ` Baoquan He
@ 2023-09-11 17:00     ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-11 17:00 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > Similar to busy VA, lazily-freed area is stored to a node
> > it belongs to. Such approach does not require any global
> > locking primitive, instead an access becomes scalable what
> > mitigates a contention.
> > 
> > This patch removes a global purge-lock, global purge-tree
> > and global purge list.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 135 +++++++++++++++++++++++++++++++--------------------
> >  1 file changed, 82 insertions(+), 53 deletions(-)
> 
> LGTM,
> 
> Reviewed-by: Baoquan He <bhe@redhat.com>
> 
Applied.

Thank you for review!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 6/9] mm: vmalloc: Offload free_vmap_area_lock lock
  2023-09-11  3:25   ` Baoquan He
@ 2023-09-11 17:10     ` Uladzislau Rezki
  2023-09-12 13:21       ` Baoquan He
  0 siblings, 1 reply; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-11 17:10 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On Mon, Sep 11, 2023 at 11:25:01AM +0800, Baoquan He wrote:
> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > Concurrent access to a global vmap space is a bottle-neck.
> > We can simulate a high contention by running a vmalloc test
> > suite.
> > 
> > To address it, introduce an effective vmap node logic. Each
> > node behaves as independent entity. When a node is accessed
> > it serves a request directly(if possible) also it can fetch
> > a new block from a global heap to its internals if no space
> > or low capacity is left.
> > 
> > This technique reduces a pressure on the global vmap lock.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 316 +++++++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 279 insertions(+), 37 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 5a8a9c1370b6..4fd4915c532d 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -779,6 +779,7 @@ struct rb_list {
> >  
> >  struct vmap_node {
> >  	/* Bookkeeping data of this node. */
> > +	struct rb_list free;
> >  	struct rb_list busy;
> >  	struct rb_list lazy;
> >  
> > @@ -786,6 +787,13 @@ struct vmap_node {
> >  	 * Ready-to-free areas.
> >  	 */
> >  	struct list_head purge_list;
> > +	struct work_struct purge_work;
> > +	unsigned long nr_purged;
> > +
> > +	/*
> > +	 * Control that only one user can pre-fetch this node.
> > +	 */
> > +	atomic_t fill_in_progress;
> >  };
> >  
> >  static struct vmap_node *nodes, snode;
> > @@ -804,6 +812,32 @@ addr_to_node(unsigned long addr)
> >  	return &nodes[addr_to_node_id(addr)];
> >  }
> >  
> > +static inline struct vmap_node *
> > +id_to_node(int id)
> > +{
> > +	return &nodes[id % nr_nodes];
> > +}
> > +
> > +static inline int
> > +this_node_id(void)
> > +{
> > +	return raw_smp_processor_id() % nr_nodes;
> > +}
> > +
> > +static inline unsigned long
> > +encode_vn_id(int node_id)
> > +{
> > +	/* Can store U8_MAX [0:254] nodes. */
> > +	return (node_id + 1) << BITS_PER_BYTE;
> > +}
> > +
> > +static inline int
> > +decode_vn_id(unsigned long val)
> > +{
> > +	/* Can store U8_MAX [0:254] nodes. */
> > +	return (val >> BITS_PER_BYTE) - 1;
> > +}
> 
> This patch looks good to me. However, should we split out the encoding
> vn_id into va->flags optimization into another patch? It looks like an
> independent optimization which can be described better with specific
> log. At least, in the pdf file pasted or patch log, it's not obvious
> that:
> 1) node's free tree could contains any address range;
> 2) nodes' busy tree only contains address range belonging to this node;
>    - could contain crossing node range, corner case.
> 3) nodes' purge tree could contain any address range;
>    - decided by encoded vn_id in va->flags.
>    - decided by address via addr_to_node(va->va_start).
> 
> Personal opinion, feel it will make reviewing easier.
> 
Sure, if it is easier to review, then i will split these two parts.
All three statements are correct and valid. The pdf file only covers
v1, so it is not up to date.

Anyway i will update a cover letter in v3 with more details.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter
  2023-09-11  3:58   ` Baoquan He
@ 2023-09-11 18:16     ` Uladzislau Rezki
  2023-09-12 13:42       ` Baoquan He
  2023-09-13 10:59       ` Baoquan He
  0 siblings, 2 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-11 18:16 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On Mon, Sep 11, 2023 at 11:58:13AM +0800, Baoquan He wrote:
> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > Extend the vread_iter() to be able to perform a sequential
> > reading of VAs which are spread among multiple nodes. So a
> > data read over the /dev/kmem correctly reflects a vmalloc
> > memory layout.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 67 +++++++++++++++++++++++++++++++++++++++++-----------
> >  1 file changed, 53 insertions(+), 14 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 4fd4915c532d..968144c16237 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> ......  
> > @@ -4057,19 +4093,15 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> >  
> >  	remains = count;
> >  
> > -	/* Hooked to node_0 so far. */
> > -	vn = addr_to_node(0);
> > -	spin_lock(&vn->busy.lock);
> 
> This could change the vread behaviour a little bit. Before, once we take
> vmap_area_lock, the vread will read out the content of snapshot at the
> moment. Now, reading out in one node's tree won't disrupt other nodes'
> tree accessing. Not sure if this matters when people need access
> /proc/kcore, e.g dynamic debugging.
>
With one big tree you anyway drop the lock after one cycle of reading.
As far as i see, kcore.c's read granularity is a PAGE_SIZE.

Please correct me if i am wrong.

> 
> And, the reading will be a little slower because each va finding need
> iterate all vmap_nodes[].
> 
Right. It is a bit tough here, because we have multiple nodes which
represent zones(address space), i.e. there is an offset between them,
it means that, reading fully one tree, will not provide a sequential
reading.

>
> Otherwise, the patch itself looks good to me.
> 
> Reviewed-by: Baoquan He <bhe@redhat.com>
>
Applied.

Thank you for looking at it!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree
  2023-09-11 16:53     ` Uladzislau Rezki
@ 2023-09-12 13:19       ` Baoquan He
  0 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-12 13:19 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 09/11/23 at 06:53pm, Uladzislau Rezki wrote:
> On Mon, Sep 11, 2023 at 10:38:29AM +0800, Baoquan He wrote:
> > On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > Store allocated objects in a separate nodes. A va->va_start
> > > address is converted into a correct node where it should
> > > be placed and resided. An addr_to_node() function is used
> > > to do a proper address conversion to determine a node that
> > > contains a VA.
> > > 
> > > Such approach balances VAs across nodes as a result an access
> > > becomes scalable. Number of nodes in a system depends on number
> > > of CPUs divided by two. The density factor in this case is 1/2.
> > > 
> > > Please note:
> > > 
> > > 1. As of now allocated VAs are bound to a node-0. It means the
> > >    patch does not give any difference comparing with a current
> > >    behavior;
> > > 
> > > 2. The global vmap_area_lock, vmap_area_root are removed as there
> > >    is no need in it anymore. The vmap_area_list is still kept and
> > >    is _empty_. It is exported for a kexec only;
> > > 
> > > 3. The vmallocinfo and vread() have to be reworked to be able to
> > >    handle multiple nodes.
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > ---
> > >  mm/vmalloc.c | 209 +++++++++++++++++++++++++++++++++++++++------------
> > >  1 file changed, 161 insertions(+), 48 deletions(-)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index b7deacca1483..ae0368c314ff 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -728,11 +728,9 @@ EXPORT_SYMBOL(vmalloc_to_pfn);
> > >  #define DEBUG_AUGMENT_LOWEST_MATCH_CHECK 0
> > >  
> > >  
> > > -static DEFINE_SPINLOCK(vmap_area_lock);
> > >  static DEFINE_SPINLOCK(free_vmap_area_lock);
> > >  /* Export for kexec only */
> > >  LIST_HEAD(vmap_area_list);
> > > -static struct rb_root vmap_area_root = RB_ROOT;
> > >  static bool vmap_initialized __read_mostly;
> > >  
> > >  static struct rb_root purge_vmap_area_root = RB_ROOT;
> > > @@ -772,6 +770,38 @@ static struct rb_root free_vmap_area_root = RB_ROOT;
> > >   */
> > >  static DEFINE_PER_CPU(struct vmap_area *, ne_fit_preload_node);
> > >  
> > > +/*
> > > + * An effective vmap-node logic. Users make use of nodes instead
> > > + * of a global heap. It allows to balance an access and mitigate
> > > + * contention.
> > > + */
> > > +struct rb_list {
> > > +	struct rb_root root;
> > > +	struct list_head head;
> > > +	spinlock_t lock;
> > > +};
> > > +
> > > +struct vmap_node {
> > > +	/* Bookkeeping data of this node. */
> > > +	struct rb_list busy;
> > > +};
> > > +
> > > +static struct vmap_node *nodes, snode;
> > > +static __read_mostly unsigned int nr_nodes = 1;
> > > +static __read_mostly unsigned int node_size = 1;
> > 
> > It could be better if calling these global variables a meaningful name,
> > e.g vmap_nodes, static_vmap_nodes, nr_vmap_nodes. When I use vim+cscope
> > to reference them, it gives me a super long list. Aside from that, a
> > simple name often makes me mistake it as a local virable. A weak
> > opinion.
> > 
> I am OK to add "vmap_" prefix:
> 
> vmap_nodes;
> vmap_nr_nodes;
           ~ nr_vmap_nodes?
> vmap_node_size;
> ..
> 
> If you are not OK with that, feel free to propose other variants.

Other than the nr_nodes one, others look good to me, thanks a lot.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 6/9] mm: vmalloc: Offload free_vmap_area_lock lock
  2023-09-11 17:10     ` Uladzislau Rezki
@ 2023-09-12 13:21       ` Baoquan He
  0 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-12 13:21 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 09/11/23 at 07:10pm, Uladzislau Rezki wrote:
> On Mon, Sep 11, 2023 at 11:25:01AM +0800, Baoquan He wrote:
> > On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > Concurrent access to a global vmap space is a bottle-neck.
> > > We can simulate a high contention by running a vmalloc test
> > > suite.
> > > 
> > > To address it, introduce an effective vmap node logic. Each
> > > node behaves as independent entity. When a node is accessed
> > > it serves a request directly(if possible) also it can fetch
> > > a new block from a global heap to its internals if no space
> > > or low capacity is left.
> > > 
> > > This technique reduces a pressure on the global vmap lock.
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > ---
> > >  mm/vmalloc.c | 316 +++++++++++++++++++++++++++++++++++++++++++++------
> > >  1 file changed, 279 insertions(+), 37 deletions(-)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 5a8a9c1370b6..4fd4915c532d 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -779,6 +779,7 @@ struct rb_list {
> > >  
> > >  struct vmap_node {
> > >  	/* Bookkeeping data of this node. */
> > > +	struct rb_list free;
> > >  	struct rb_list busy;
> > >  	struct rb_list lazy;
> > >  
> > > @@ -786,6 +787,13 @@ struct vmap_node {
> > >  	 * Ready-to-free areas.
> > >  	 */
> > >  	struct list_head purge_list;
> > > +	struct work_struct purge_work;
> > > +	unsigned long nr_purged;
> > > +
> > > +	/*
> > > +	 * Control that only one user can pre-fetch this node.
> > > +	 */
> > > +	atomic_t fill_in_progress;
> > >  };
> > >  
> > >  static struct vmap_node *nodes, snode;
> > > @@ -804,6 +812,32 @@ addr_to_node(unsigned long addr)
> > >  	return &nodes[addr_to_node_id(addr)];
> > >  }
> > >  
> > > +static inline struct vmap_node *
> > > +id_to_node(int id)
> > > +{
> > > +	return &nodes[id % nr_nodes];
> > > +}
> > > +
> > > +static inline int
> > > +this_node_id(void)
> > > +{
> > > +	return raw_smp_processor_id() % nr_nodes;
> > > +}
> > > +
> > > +static inline unsigned long
> > > +encode_vn_id(int node_id)
> > > +{
> > > +	/* Can store U8_MAX [0:254] nodes. */
> > > +	return (node_id + 1) << BITS_PER_BYTE;
> > > +}
> > > +
> > > +static inline int
> > > +decode_vn_id(unsigned long val)
> > > +{
> > > +	/* Can store U8_MAX [0:254] nodes. */
> > > +	return (val >> BITS_PER_BYTE) - 1;
> > > +}
> > 
> > This patch looks good to me. However, should we split out the encoding
> > vn_id into va->flags optimization into another patch? It looks like an
> > independent optimization which can be described better with specific
> > log. At least, in the pdf file pasted or patch log, it's not obvious
> > that:
> > 1) node's free tree could contains any address range;
> > 2) nodes' busy tree only contains address range belonging to this node;
> >    - could contain crossing node range, corner case.
> > 3) nodes' purge tree could contain any address range;
> >    - decided by encoded vn_id in va->flags.
> >    - decided by address via addr_to_node(va->va_start).
> > 
> > Personal opinion, feel it will make reviewing easier.
> > 
> Sure, if it is easier to review, then i will split these two parts.
> All three statements are correct and valid. The pdf file only covers
> v1, so it is not up to date.
> 
> Anyway i will update a cover letter in v3 with more details.

Maybe providing these details in patch log or cover letter is enough.
Leave it to you to decide if splitting patch is still needed. Thanks.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter
  2023-09-11 18:16     ` Uladzislau Rezki
@ 2023-09-12 13:42       ` Baoquan He
  2023-09-13 15:42         ` Uladzislau Rezki
  2023-09-13 10:59       ` Baoquan He
  1 sibling, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-12 13:42 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 09/11/23 at 08:16pm, Uladzislau Rezki wrote:
> On Mon, Sep 11, 2023 at 11:58:13AM +0800, Baoquan He wrote:
> > On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > Extend the vread_iter() to be able to perform a sequential
> > > reading of VAs which are spread among multiple nodes. So a
> > > data read over the /dev/kmem correctly reflects a vmalloc
> > > memory layout.
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > ---
> > >  mm/vmalloc.c | 67 +++++++++++++++++++++++++++++++++++++++++-----------
> > >  1 file changed, 53 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 4fd4915c532d..968144c16237 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > ......  
> > > @@ -4057,19 +4093,15 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> > >  
> > >  	remains = count;
> > >  
> > > -	/* Hooked to node_0 so far. */
> > > -	vn = addr_to_node(0);
> > > -	spin_lock(&vn->busy.lock);
> > 
> > This could change the vread behaviour a little bit. Before, once we take
> > vmap_area_lock, the vread will read out the content of snapshot at the
> > moment. Now, reading out in one node's tree won't disrupt other nodes'
> > tree accessing. Not sure if this matters when people need access
> > /proc/kcore, e.g dynamic debugging.
> >
> With one big tree you anyway drop the lock after one cycle of reading.
> As far as i see, kcore.c's read granularity is a PAGE_SIZE.

With my understanding, kcore reading on vmalloc does read page by page,
it will continue after one page reading if the required size is bigger
than one page. Please see aligned_vread_iter() code. During the complete
process, vmap_area_lock is held before this patch.

> 
> > 
> > And, the reading will be a little slower because each va finding need
> > iterate all vmap_nodes[].
> > 
> Right. It is a bit tough here, because we have multiple nodes which
> represent zones(address space), i.e. there is an offset between them,
> it means that, reading fully one tree, will not provide a sequential
> reading.

Understood. Suppose the kcore reading on vmalloc is not critical. If I
get chance to test on a machine with 256 cpu, I will report here.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter
  2023-09-11 18:16     ` Uladzislau Rezki
  2023-09-12 13:42       ` Baoquan He
@ 2023-09-13 10:59       ` Baoquan He
  2023-09-13 15:38         ` Uladzislau Rezki
  1 sibling, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-13 10:59 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 09/11/23 at 08:16pm, Uladzislau Rezki wrote:
> On Mon, Sep 11, 2023 at 11:58:13AM +0800, Baoquan He wrote:
> > On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > Extend the vread_iter() to be able to perform a sequential
> > > reading of VAs which are spread among multiple nodes. So a
> > > data read over the /dev/kmem correctly reflects a vmalloc
> > > memory layout.
> > > 
> > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > ---
> > >  mm/vmalloc.c | 67 +++++++++++++++++++++++++++++++++++++++++-----------
> > >  1 file changed, 53 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 4fd4915c532d..968144c16237 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > ......  
> > > @@ -4057,19 +4093,15 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> > >  
> > >  	remains = count;
> > >  
> > > -	/* Hooked to node_0 so far. */
> > > -	vn = addr_to_node(0);
> > > -	spin_lock(&vn->busy.lock);
> > 
> > This could change the vread behaviour a little bit. Before, once we take
> > vmap_area_lock, the vread will read out the content of snapshot at the
> > moment. Now, reading out in one node's tree won't disrupt other nodes'
> > tree accessing. Not sure if this matters when people need access
> > /proc/kcore, e.g dynamic debugging.
> >
> With one big tree you anyway drop the lock after one cycle of reading.
> As far as i see, kcore.c's read granularity is a PAGE_SIZE.

You are right, kcore.c's reading granularity is truly PAGE_SIZE.
I don't know procfs well, still need to study the code. Then it doesn't
matter much with the multiple nodes in vread_iter(). Sorry for the noise.

static ssize_t read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
{  
	......
        start = kc_offset_to_vaddr(*fpos - data_offset);
        if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen)
                tsz = buflen;

	m = NULL;
        while (buflen) {
	}
	...
}


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter
  2023-09-13 10:59       ` Baoquan He
@ 2023-09-13 15:38         ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-13 15:38 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On Wed, Sep 13, 2023 at 06:59:42PM +0800, Baoquan He wrote:
> On 09/11/23 at 08:16pm, Uladzislau Rezki wrote:
> > On Mon, Sep 11, 2023 at 11:58:13AM +0800, Baoquan He wrote:
> > > On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > > Extend the vread_iter() to be able to perform a sequential
> > > > reading of VAs which are spread among multiple nodes. So a
> > > > data read over the /dev/kmem correctly reflects a vmalloc
> > > > memory layout.
> > > > 
> > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > ---
> > > >  mm/vmalloc.c | 67 +++++++++++++++++++++++++++++++++++++++++-----------
> > > >  1 file changed, 53 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 4fd4915c532d..968144c16237 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > ......  
> > > > @@ -4057,19 +4093,15 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> > > >  
> > > >  	remains = count;
> > > >  
> > > > -	/* Hooked to node_0 so far. */
> > > > -	vn = addr_to_node(0);
> > > > -	spin_lock(&vn->busy.lock);
> > > 
> > > This could change the vread behaviour a little bit. Before, once we take
> > > vmap_area_lock, the vread will read out the content of snapshot at the
> > > moment. Now, reading out in one node's tree won't disrupt other nodes'
> > > tree accessing. Not sure if this matters when people need access
> > > /proc/kcore, e.g dynamic debugging.
> > >
> > With one big tree you anyway drop the lock after one cycle of reading.
> > As far as i see, kcore.c's read granularity is a PAGE_SIZE.
> 
> You are right, kcore.c's reading granularity is truly PAGE_SIZE.
> I don't know procfs well, still need to study the code. Then it doesn't
> matter much with the multiple nodes in vread_iter(). Sorry for the noise.
> 
> static ssize_t read_kcore_iter(struct kiocb *iocb, struct iov_iter *iter)
> {  
> 	......
>         start = kc_offset_to_vaddr(*fpos - data_offset);
>         if ((tsz = (PAGE_SIZE - (start & ~PAGE_MASK))) > buflen)
>                 tsz = buflen;
> 
> 	m = NULL;
>         while (buflen) {
> 	}
> 	...
> }
> 
Good. Then we are on the same page :)

Thank you!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter
  2023-09-12 13:42       ` Baoquan He
@ 2023-09-13 15:42         ` Uladzislau Rezki
  2023-09-14  3:02           ` Baoquan He
  2023-09-14  3:36           ` Baoquan He
  0 siblings, 2 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-13 15:42 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On Tue, Sep 12, 2023 at 09:42:32PM +0800, Baoquan He wrote:
> On 09/11/23 at 08:16pm, Uladzislau Rezki wrote:
> > On Mon, Sep 11, 2023 at 11:58:13AM +0800, Baoquan He wrote:
> > > On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > > Extend the vread_iter() to be able to perform a sequential
> > > > reading of VAs which are spread among multiple nodes. So a
> > > > data read over the /dev/kmem correctly reflects a vmalloc
> > > > memory layout.
> > > > 
> > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > ---
> > > >  mm/vmalloc.c | 67 +++++++++++++++++++++++++++++++++++++++++-----------
> > > >  1 file changed, 53 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 4fd4915c532d..968144c16237 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > ......  
> > > > @@ -4057,19 +4093,15 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> > > >  
> > > >  	remains = count;
> > > >  
> > > > -	/* Hooked to node_0 so far. */
> > > > -	vn = addr_to_node(0);
> > > > -	spin_lock(&vn->busy.lock);
> > > 
> > > This could change the vread behaviour a little bit. Before, once we take
> > > vmap_area_lock, the vread will read out the content of snapshot at the
> > > moment. Now, reading out in one node's tree won't disrupt other nodes'
> > > tree accessing. Not sure if this matters when people need access
> > > /proc/kcore, e.g dynamic debugging.
> > >
> > With one big tree you anyway drop the lock after one cycle of reading.
> > As far as i see, kcore.c's read granularity is a PAGE_SIZE.
> 
> With my understanding, kcore reading on vmalloc does read page by page,
> it will continue after one page reading if the required size is bigger
> than one page. Please see aligned_vread_iter() code. During the complete
> process, vmap_area_lock is held before this patch.
> 
> > 
> > > 
> > > And, the reading will be a little slower because each va finding need
> > > iterate all vmap_nodes[].
> > > 
> > Right. It is a bit tough here, because we have multiple nodes which
> > represent zones(address space), i.e. there is an offset between them,
> > it means that, reading fully one tree, will not provide a sequential
> > reading.
> 
> Understood. Suppose the kcore reading on vmalloc is not critical. If I
> get chance to test on a machine with 256 cpu, I will report here.
> 
It would be great! Unfortunately i do not have an access to such big
systems. What i have is 64 CPUs max system. If you, by chance can test
on bigger systems or can provide a temporary ssh access that would be
awesome.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter
  2023-09-13 15:42         ` Uladzislau Rezki
@ 2023-09-14  3:02           ` Baoquan He
  2023-09-14  3:36           ` Baoquan He
  1 sibling, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-14  3:02 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 09/13/23 at 05:42pm, Uladzislau Rezki wrote:
> On Tue, Sep 12, 2023 at 09:42:32PM +0800, Baoquan He wrote:
> > On 09/11/23 at 08:16pm, Uladzislau Rezki wrote:
> > > On Mon, Sep 11, 2023 at 11:58:13AM +0800, Baoquan He wrote:
> > > > On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > > > Extend the vread_iter() to be able to perform a sequential
> > > > > reading of VAs which are spread among multiple nodes. So a
> > > > > data read over the /dev/kmem correctly reflects a vmalloc
> > > > > memory layout.
> > > > > 
> > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > ---
> > > > >  mm/vmalloc.c | 67 +++++++++++++++++++++++++++++++++++++++++-----------
> > > > >  1 file changed, 53 insertions(+), 14 deletions(-)
> > > > > 
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index 4fd4915c532d..968144c16237 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > ......  
> > > > > @@ -4057,19 +4093,15 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> > > > >  
> > > > >  	remains = count;
> > > > >  
> > > > > -	/* Hooked to node_0 so far. */
> > > > > -	vn = addr_to_node(0);
> > > > > -	spin_lock(&vn->busy.lock);
> > > > 
> > > > This could change the vread behaviour a little bit. Before, once we take
> > > > vmap_area_lock, the vread will read out the content of snapshot at the
> > > > moment. Now, reading out in one node's tree won't disrupt other nodes'
> > > > tree accessing. Not sure if this matters when people need access
> > > > /proc/kcore, e.g dynamic debugging.
> > > >
> > > With one big tree you anyway drop the lock after one cycle of reading.
> > > As far as i see, kcore.c's read granularity is a PAGE_SIZE.
> > 
> > With my understanding, kcore reading on vmalloc does read page by page,
> > it will continue after one page reading if the required size is bigger
> > than one page. Please see aligned_vread_iter() code. During the complete
> > process, vmap_area_lock is held before this patch.
> > 
> > > 
> > > > 
> > > > And, the reading will be a little slower because each va finding need
> > > > iterate all vmap_nodes[].
> > > > 
> > > Right. It is a bit tough here, because we have multiple nodes which
> > > represent zones(address space), i.e. there is an offset between them,
> > > it means that, reading fully one tree, will not provide a sequential
> > > reading.
> > 
> > Understood. Suppose the kcore reading on vmalloc is not critical. If I
> > get chance to test on a machine with 256 cpu, I will report here.
> > 
> It would be great! Unfortunately i do not have an access to such big
> systems. What i have is 64 CPUs max system. If you, by chance can test
> on bigger systems or can provide a temporary ssh access that would be
> awesome.

I got one with 288 cpus, have sent you ip address in private mail.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter
  2023-09-13 15:42         ` Uladzislau Rezki
  2023-09-14  3:02           ` Baoquan He
@ 2023-09-14  3:36           ` Baoquan He
  2023-09-14  3:38             ` Baoquan He
  1 sibling, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-14  3:36 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 09/13/23 at 05:42pm, Uladzislau Rezki wrote:
> On Tue, Sep 12, 2023 at 09:42:32PM +0800, Baoquan He wrote:
> > On 09/11/23 at 08:16pm, Uladzislau Rezki wrote:
> > > On Mon, Sep 11, 2023 at 11:58:13AM +0800, Baoquan He wrote:
> > > > On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > > > Extend the vread_iter() to be able to perform a sequential
> > > > > reading of VAs which are spread among multiple nodes. So a
> > > > > data read over the /dev/kmem correctly reflects a vmalloc
> > > > > memory layout.
> > > > > 
> > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > ---
> > > > >  mm/vmalloc.c | 67 +++++++++++++++++++++++++++++++++++++++++-----------
> > > > >  1 file changed, 53 insertions(+), 14 deletions(-)
> > > > > 
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index 4fd4915c532d..968144c16237 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > ......  
> > > > > @@ -4057,19 +4093,15 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> > > > >  
> > > > >  	remains = count;
> > > > >  
> > > > > -	/* Hooked to node_0 so far. */
> > > > > -	vn = addr_to_node(0);
> > > > > -	spin_lock(&vn->busy.lock);
> > > > 
> > > > This could change the vread behaviour a little bit. Before, once we take
> > > > vmap_area_lock, the vread will read out the content of snapshot at the
> > > > moment. Now, reading out in one node's tree won't disrupt other nodes'
> > > > tree accessing. Not sure if this matters when people need access
> > > > /proc/kcore, e.g dynamic debugging.
> > > >
> > > With one big tree you anyway drop the lock after one cycle of reading.
> > > As far as i see, kcore.c's read granularity is a PAGE_SIZE.
> > 
> > With my understanding, kcore reading on vmalloc does read page by page,
> > it will continue after one page reading if the required size is bigger
> > than one page. Please see aligned_vread_iter() code. During the complete
> > process, vmap_area_lock is held before this patch.
> > 
> > > 
> > > > 
> > > > And, the reading will be a little slower because each va finding need
> > > > iterate all vmap_nodes[].
> > > > 
> > > Right. It is a bit tough here, because we have multiple nodes which
> > > represent zones(address space), i.e. there is an offset between them,
> > > it means that, reading fully one tree, will not provide a sequential
> > > reading.
> > 
> > Understood. Suppose the kcore reading on vmalloc is not critical. If I
> > get chance to test on a machine with 256 cpu, I will report here.
> > 
> It would be great! Unfortunately i do not have an access to such big
> systems. What i have is 64 CPUs max system. If you, by chance can test
> on bigger systems or can provide a temporary ssh access that would be
> awesome.

10.16.216.205
user:root
password:redhat

This is a testing server in our lab, we apply for usage each time and it
will reinstall OS, root user should be OK. I will take it for two days.

If accessing is not available, I can do some testing if you want me to
run some commands.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter
  2023-09-14  3:36           ` Baoquan He
@ 2023-09-14  3:38             ` Baoquan He
  0 siblings, 0 replies; 74+ messages in thread
From: Baoquan He @ 2023-09-14  3:38 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 09/14/23 at 11:36am, Baoquan He wrote:
> On 09/13/23 at 05:42pm, Uladzislau Rezki wrote:
> > On Tue, Sep 12, 2023 at 09:42:32PM +0800, Baoquan He wrote:
> > > On 09/11/23 at 08:16pm, Uladzislau Rezki wrote:
> > > > On Mon, Sep 11, 2023 at 11:58:13AM +0800, Baoquan He wrote:
> > > > > On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > > > > > Extend the vread_iter() to be able to perform a sequential
> > > > > > reading of VAs which are spread among multiple nodes. So a
> > > > > > data read over the /dev/kmem correctly reflects a vmalloc
> > > > > > memory layout.
> > > > > > 
> > > > > > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > > ---
> > > > > >  mm/vmalloc.c | 67 +++++++++++++++++++++++++++++++++++++++++-----------
> > > > > >  1 file changed, 53 insertions(+), 14 deletions(-)
> > > > > > 
> > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > index 4fd4915c532d..968144c16237 100644
> > > > > > --- a/mm/vmalloc.c
> > > > > > +++ b/mm/vmalloc.c
> > > > > ......  
> > > > > > @@ -4057,19 +4093,15 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
> > > > > >  
> > > > > >  	remains = count;
> > > > > >  
> > > > > > -	/* Hooked to node_0 so far. */
> > > > > > -	vn = addr_to_node(0);
> > > > > > -	spin_lock(&vn->busy.lock);
> > > > > 
> > > > > This could change the vread behaviour a little bit. Before, once we take
> > > > > vmap_area_lock, the vread will read out the content of snapshot at the
> > > > > moment. Now, reading out in one node's tree won't disrupt other nodes'
> > > > > tree accessing. Not sure if this matters when people need access
> > > > > /proc/kcore, e.g dynamic debugging.
> > > > >
> > > > With one big tree you anyway drop the lock after one cycle of reading.
> > > > As far as i see, kcore.c's read granularity is a PAGE_SIZE.
> > > 
> > > With my understanding, kcore reading on vmalloc does read page by page,
> > > it will continue after one page reading if the required size is bigger
> > > than one page. Please see aligned_vread_iter() code. During the complete
> > > process, vmap_area_lock is held before this patch.
> > > 
> > > > 
> > > > > 
> > > > > And, the reading will be a little slower because each va finding need
> > > > > iterate all vmap_nodes[].
> > > > > 
> > > > Right. It is a bit tough here, because we have multiple nodes which
> > > > represent zones(address space), i.e. there is an offset between them,
> > > > it means that, reading fully one tree, will not provide a sequential
> > > > reading.
> > > 
> > > Understood. Suppose the kcore reading on vmalloc is not critical. If I
> > > get chance to test on a machine with 256 cpu, I will report here.
> > > 
> > It would be great! Unfortunately i do not have an access to such big
> > systems. What i have is 64 CPUs max system. If you, by chance can test
> > on bigger systems or can provide a temporary ssh access that would be
> > awesome.
> 
> 10.16.216.205
> user:root
> password:redhat
> 
> This is a testing server in our lab, we apply for usage each time and it
> will reinstall OS, root user should be OK. I will take it for two days.

Oops, I sent it out publicly.

> 
> If accessing is not available, I can do some testing if you want me to
> run some commands.
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 8/9] mm: vmalloc: Support multiple nodes in vmallocinfo
  2023-08-29  8:11 ` [PATCH v2 8/9] mm: vmalloc: Support multiple nodes in vmallocinfo Uladzislau Rezki (Sony)
@ 2023-09-15 13:02   ` Baoquan He
  2023-09-15 18:32     ` Uladzislau Rezki
  0 siblings, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-15 13:02 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> Allocated areas are spread among nodes, it implies that
> the scanning has to be performed individually of each node
> in order to dump all existing VAs.
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 120 ++++++++++++++++++++-------------------------------
>  1 file changed, 47 insertions(+), 73 deletions(-)

LGTM,

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 968144c16237..9cce012aecdb 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -4636,30 +4636,6 @@ bool vmalloc_dump_obj(void *object)
>  #endif
>  
>  #ifdef CONFIG_PROC_FS
> -static void *s_start(struct seq_file *m, loff_t *pos)
> -{
> -	struct vmap_node *vn = addr_to_node(0);
> -
> -	mutex_lock(&vmap_purge_lock);
> -	spin_lock(&vn->busy.lock);
> -
> -	return seq_list_start(&vn->busy.head, *pos);
> -}
> -
> -static void *s_next(struct seq_file *m, void *p, loff_t *pos)
> -{
> -	struct vmap_node *vn = addr_to_node(0);
> -	return seq_list_next(p, &vn->busy.head, pos);
> -}
> -
> -static void s_stop(struct seq_file *m, void *p)
> -{
> -	struct vmap_node *vn = addr_to_node(0);
> -
> -	spin_unlock(&vn->busy.lock);
> -	mutex_unlock(&vmap_purge_lock);
> -}
> -
>  static void show_numa_info(struct seq_file *m, struct vm_struct *v)
>  {
>  	if (IS_ENABLED(CONFIG_NUMA)) {
> @@ -4703,84 +4679,82 @@ static void show_purge_info(struct seq_file *m)
>  	}
>  }
>  
> -static int s_show(struct seq_file *m, void *p)
> +static int vmalloc_info_show(struct seq_file *m, void *p)
>  {
>  	struct vmap_node *vn;
>  	struct vmap_area *va;
>  	struct vm_struct *v;
> +	int i;
>  
> -	vn = addr_to_node(0);
> -	va = list_entry(p, struct vmap_area, list);
> +	for (i = 0; i < nr_nodes; i++) {
> +		vn = &nodes[i];
>  
> -	if (!va->vm) {
> -		if (va->flags & VMAP_RAM)
> -			seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
> -				(void *)va->va_start, (void *)va->va_end,
> -				va->va_end - va->va_start);
> +		spin_lock(&vn->busy.lock);
> +		list_for_each_entry(va, &vn->busy.head, list) {
> +			if (!va->vm) {
> +				if (va->flags & VMAP_RAM)
> +					seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
> +						(void *)va->va_start, (void *)va->va_end,
> +						va->va_end - va->va_start);
>  
> -		goto final;
> -	}
> +				continue;
> +			}
>  
> -	v = va->vm;
> +			v = va->vm;
>  
> -	seq_printf(m, "0x%pK-0x%pK %7ld",
> -		v->addr, v->addr + v->size, v->size);
> +			seq_printf(m, "0x%pK-0x%pK %7ld",
> +				v->addr, v->addr + v->size, v->size);
>  
> -	if (v->caller)
> -		seq_printf(m, " %pS", v->caller);
> +			if (v->caller)
> +				seq_printf(m, " %pS", v->caller);
>  
> -	if (v->nr_pages)
> -		seq_printf(m, " pages=%d", v->nr_pages);
> +			if (v->nr_pages)
> +				seq_printf(m, " pages=%d", v->nr_pages);
>  
> -	if (v->phys_addr)
> -		seq_printf(m, " phys=%pa", &v->phys_addr);
> +			if (v->phys_addr)
> +				seq_printf(m, " phys=%pa", &v->phys_addr);
>  
> -	if (v->flags & VM_IOREMAP)
> -		seq_puts(m, " ioremap");
> +			if (v->flags & VM_IOREMAP)
> +				seq_puts(m, " ioremap");
>  
> -	if (v->flags & VM_ALLOC)
> -		seq_puts(m, " vmalloc");
> +			if (v->flags & VM_ALLOC)
> +				seq_puts(m, " vmalloc");
>  
> -	if (v->flags & VM_MAP)
> -		seq_puts(m, " vmap");
> +			if (v->flags & VM_MAP)
> +				seq_puts(m, " vmap");
>  
> -	if (v->flags & VM_USERMAP)
> -		seq_puts(m, " user");
> +			if (v->flags & VM_USERMAP)
> +				seq_puts(m, " user");
>  
> -	if (v->flags & VM_DMA_COHERENT)
> -		seq_puts(m, " dma-coherent");
> +			if (v->flags & VM_DMA_COHERENT)
> +				seq_puts(m, " dma-coherent");
>  
> -	if (is_vmalloc_addr(v->pages))
> -		seq_puts(m, " vpages");
> +			if (is_vmalloc_addr(v->pages))
> +				seq_puts(m, " vpages");
>  
> -	show_numa_info(m, v);
> -	seq_putc(m, '\n');
> +			show_numa_info(m, v);
> +			seq_putc(m, '\n');
> +		}
> +		spin_unlock(&vn->busy.lock);
> +	}
>  
>  	/*
>  	 * As a final step, dump "unpurged" areas.
>  	 */
> -final:
> -	if (list_is_last(&va->list, &vn->busy.head))
> -		show_purge_info(m);
> -
> +	show_purge_info(m);
>  	return 0;
>  }
>  
> -static const struct seq_operations vmalloc_op = {
> -	.start = s_start,
> -	.next = s_next,
> -	.stop = s_stop,
> -	.show = s_show,
> -};
> -
>  static int __init proc_vmalloc_init(void)
>  {
> +	void *priv_data = NULL;
> +
>  	if (IS_ENABLED(CONFIG_NUMA))
> -		proc_create_seq_private("vmallocinfo", 0400, NULL,
> -				&vmalloc_op,
> -				nr_node_ids * sizeof(unsigned int), NULL);
> -	else
> -		proc_create_seq("vmallocinfo", 0400, NULL, &vmalloc_op);
> +		priv_data = kmalloc(nr_node_ids * sizeof(unsigned int), GFP_KERNEL);
> +
> +	proc_create_single_data("vmallocinfo",
> +		0400, NULL, vmalloc_info_show, priv_data);
> +
>  	return 0;
>  }
>  module_init(proc_vmalloc_init);
> -- 
> 2.30.2
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 9/9] mm: vmalloc: Set nr_nodes/node_size based on CPU-cores
  2023-08-29  8:11 ` [PATCH v2 9/9] mm: vmalloc: Set nr_nodes/node_size based on CPU-cores Uladzislau Rezki (Sony)
@ 2023-09-15 13:03   ` Baoquan He
  2023-09-15 18:31     ` Uladzislau Rezki
  0 siblings, 1 reply; 74+ messages in thread
From: Baoquan He @ 2023-09-15 13:03 UTC (permalink / raw)
  To: Uladzislau Rezki (Sony)
  Cc: linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
......
> real    1m28.382s
> user    0m0.014s
> sys     0m0.026s
> urezki@pc638:~$
> 
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> ---
>  mm/vmalloc.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)

LGTM,

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 9cce012aecdb..08990f630c21 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -796,6 +796,9 @@ struct vmap_node {
>  	atomic_t fill_in_progress;
>  };
>  
> +#define MAX_NODES U8_MAX
> +#define MAX_NODE_SIZE SZ_4M
> +
>  static struct vmap_node *nodes, snode;
>  static __read_mostly unsigned int nr_nodes = 1;
>  static __read_mostly unsigned int node_size = 1;
> @@ -4803,11 +4806,24 @@ static void vmap_init_free_space(void)
>  	}
>  }
>  
> +static unsigned int calculate_nr_nodes(void)
> +{
> +	unsigned int nr_cpus;
> +
> +	nr_cpus = num_present_cpus();
> +	if (nr_cpus <= 1)
> +		nr_cpus = num_possible_cpus();
> +
> +	/* Density factor. Two users per a node. */
> +	return clamp_t(unsigned int, nr_cpus >> 1, 1, MAX_NODES);
> +}
> +
>  static void vmap_init_nodes(void)
>  {
>  	struct vmap_node *vn;
>  	int i;
>  
> +	nr_nodes = calculate_nr_nodes();
>  	nodes = &snode;
>  
>  	if (nr_nodes > 1) {
> @@ -4830,6 +4846,16 @@ static void vmap_init_nodes(void)
>  		INIT_LIST_HEAD(&vn->free.head);
>  		spin_lock_init(&vn->free.lock);
>  	}
> +
> +	/*
> +	 * Scale a node size to number of CPUs. Each power of two
> +	 * value doubles a node size. A high-threshold limit is set
> +	 * to 4M.
> +	 */
> +#if BITS_PER_LONG == 64
> +	if (nr_nodes > 1)
> +		node_size = min(SZ_64K << fls(num_possible_cpus()), SZ_4M);
> +#endif
>  }
>  
>  void __init vmalloc_init(void)
> -- 
> 2.30.2
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 9/9] mm: vmalloc: Set nr_nodes/node_size based on CPU-cores
  2023-09-15 13:03   ` Baoquan He
@ 2023-09-15 18:31     ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-15 18:31 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On Fri, Sep 15, 2023 at 09:03:29PM +0800, Baoquan He wrote:
> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> ......
> > real    1m28.382s
> > user    0m0.014s
> > sys     0m0.026s
> > urezki@pc638:~$
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 26 ++++++++++++++++++++++++++
> >  1 file changed, 26 insertions(+)
> 
> LGTM,
> 
> Reviewed-by: Baoquan He <bhe@redhat.com>
> 
Applied. Thank you!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 8/9] mm: vmalloc: Support multiple nodes in vmallocinfo
  2023-09-15 13:02   ` Baoquan He
@ 2023-09-15 18:32     ` Uladzislau Rezki
  0 siblings, 0 replies; 74+ messages in thread
From: Uladzislau Rezki @ 2023-09-15 18:32 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki (Sony),
	linux-mm, Andrew Morton, LKML, Lorenzo Stoakes,
	Christoph Hellwig, Matthew Wilcox, Liam R . Howlett,
	Dave Chinner, Paul E . McKenney, Joel Fernandes,
	Oleksiy Avramchenko

On Fri, Sep 15, 2023 at 09:02:37PM +0800, Baoquan He wrote:
> On 08/29/23 at 10:11am, Uladzislau Rezki (Sony) wrote:
> > Allocated areas are spread among nodes, it implies that
> > the scanning has to be performed individually of each node
> > in order to dump all existing VAs.
> > 
> > Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > ---
> >  mm/vmalloc.c | 120 ++++++++++++++++++++-------------------------------
> >  1 file changed, 47 insertions(+), 73 deletions(-)
> 
> LGTM,
> 
> Reviewed-by: Baoquan He <bhe@redhat.com>
> 
Thank you for review, applied for v3.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2023-09-15 18:32 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-29  8:11 [PATCH v2 0/9] Mitigate a vmap lock contention v2 Uladzislau Rezki (Sony)
2023-08-29  8:11 ` [PATCH v2 1/9] mm: vmalloc: Add va_alloc() helper Uladzislau Rezki (Sony)
2023-09-06  5:51   ` Baoquan He
2023-09-06 15:06     ` Uladzislau Rezki
2023-08-29  8:11 ` [PATCH v2 2/9] mm: vmalloc: Rename adjust_va_to_fit_type() function Uladzislau Rezki (Sony)
2023-09-06  5:51   ` Baoquan He
2023-09-06 16:27     ` Uladzislau Rezki
2023-08-29  8:11 ` [PATCH v2 3/9] mm: vmalloc: Move vmap_init_free_space() down in vmalloc.c Uladzislau Rezki (Sony)
2023-09-06  5:52   ` Baoquan He
2023-09-06 16:29     ` Uladzislau Rezki
2023-08-29  8:11 ` [PATCH v2 4/9] mm: vmalloc: Remove global vmap_area_root rb-tree Uladzislau Rezki (Sony)
2023-08-29 14:30   ` kernel test robot
2023-08-30 14:48     ` Uladzislau Rezki
2023-09-07  2:17   ` Baoquan He
2023-09-07  2:17     ` Baoquan He
2023-09-07  9:38     ` Baoquan He
2023-09-07  9:38       ` Baoquan He
2023-09-07  9:40       ` Uladzislau Rezki
2023-09-07  9:40         ` Uladzislau Rezki
2023-09-07  9:39     ` Uladzislau Rezki
2023-09-07  9:39       ` Uladzislau Rezki
2023-09-07  9:58       ` Baoquan He
2023-09-07  9:58         ` Baoquan He
2023-09-08  1:51         ` HAGIO KAZUHITO(萩尾 一仁)
2023-09-08  1:51           ` HAGIO KAZUHITO(萩尾 一仁)
2023-09-08  4:43           ` Baoquan He
2023-09-08  4:43             ` Baoquan He
2023-09-08  5:01             ` HAGIO KAZUHITO(萩尾 一仁)
2023-09-08  5:01               ` HAGIO KAZUHITO(萩尾 一仁)
2023-09-08  6:44               ` Baoquan He
2023-09-08  6:44                 ` Baoquan He
2023-09-08 11:25                 ` Uladzislau Rezki
2023-09-08 11:25                   ` Uladzislau Rezki
2023-09-08 11:38                   ` Baoquan He
2023-09-08 11:38                     ` Baoquan He
2023-09-08 13:23                     ` Uladzislau Rezki
2023-09-08 13:23                       ` Uladzislau Rezki
2023-09-11  2:38   ` Baoquan He
2023-09-11 16:53     ` Uladzislau Rezki
2023-09-12 13:19       ` Baoquan He
2023-08-29  8:11 ` [PATCH v2 5/9] mm: vmalloc: Remove global purge_vmap_area_root rb-tree Uladzislau Rezki (Sony)
2023-09-11  2:57   ` Baoquan He
2023-09-11 17:00     ` Uladzislau Rezki
2023-08-29  8:11 ` [PATCH v2 6/9] mm: vmalloc: Offload free_vmap_area_lock lock Uladzislau Rezki (Sony)
2023-09-06  6:04   ` Baoquan He
2023-09-06 19:16     ` Uladzislau Rezki
2023-09-07  0:06       ` Baoquan He
2023-09-07  9:33         ` Uladzislau Rezki
2023-09-11  3:25   ` Baoquan He
2023-09-11 17:10     ` Uladzislau Rezki
2023-09-12 13:21       ` Baoquan He
2023-08-29  8:11 ` [PATCH v2 7/9] mm: vmalloc: Support multiple nodes in vread_iter Uladzislau Rezki (Sony)
2023-09-11  3:58   ` Baoquan He
2023-09-11 18:16     ` Uladzislau Rezki
2023-09-12 13:42       ` Baoquan He
2023-09-13 15:42         ` Uladzislau Rezki
2023-09-14  3:02           ` Baoquan He
2023-09-14  3:36           ` Baoquan He
2023-09-14  3:38             ` Baoquan He
2023-09-13 10:59       ` Baoquan He
2023-09-13 15:38         ` Uladzislau Rezki
2023-08-29  8:11 ` [PATCH v2 8/9] mm: vmalloc: Support multiple nodes in vmallocinfo Uladzislau Rezki (Sony)
2023-09-15 13:02   ` Baoquan He
2023-09-15 18:32     ` Uladzislau Rezki
2023-08-29  8:11 ` [PATCH v2 9/9] mm: vmalloc: Set nr_nodes/node_size based on CPU-cores Uladzislau Rezki (Sony)
2023-09-15 13:03   ` Baoquan He
2023-09-15 18:31     ` Uladzislau Rezki
2023-08-31  1:15 ` [PATCH v2 0/9] Mitigate a vmap lock contention v2 Baoquan He
2023-08-31 16:26   ` Uladzislau Rezki
2023-09-04 14:55 ` Uladzislau Rezki
2023-09-04 19:53   ` Andrew Morton
2023-09-05  6:53     ` Uladzislau Rezki
2023-09-06 20:04 ` Lorenzo Stoakes
2023-09-07  9:15   ` Uladzislau Rezki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.