[PATCH v7 0/2] mm/vmalloc: lock contention optimization under multi-threading

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/2] mm/vmalloc: lock contention optimization under multi-threading
@ 2024-03-01 15:54 rulinhuang
  2024-03-01 15:54 ` [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened rulinhuang
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: rulinhuang @ 2024-03-01 15:54 UTC (permalink / raw)
  To: urezki, bhe
  Cc: akpm, colin.king, hch, linux-kernel, linux-mm, lstoakes,
	rulin.huang, tianyou.li, tim.c.chen, wangyang.guo, zhiguo.zhou

Hi,

This version has the rearrangement of macros from the previous one.

We are not sure whether we have completely moved these macros and 
their corresponding helper to the correct position. Could you please 
help to check whether they are correct?

~

1. Motivation

When allocating a new memory area where the mapping address range is 
known, it is observed that the vmap_node->busy.lock is acquired twice 
but one of the acquisitions is actually unnecessary.

2. Design

Among the two acquisitions, the first one occurs in the 
alloc_vmap_area() function when inserting the vm area into the vm 
mapping red-black tree, and the second one occurs in the 
setup_vmalloc_vm() function when updating the properties of the vm, 
such as flags and address, etc.

Combine these two operations together in alloc_vmap_area(), which 
improves scalability when the vmap_node->busy.lock is contended.
By doing so, the need to acquire the lock twice can also be eliminated 
to once.

3. Test results

With the above change, tested on intel sapphire rapids
platform(224 vcpu), a 4% performance improvement is gained on 
stress-ng/pthread(https://github.com/ColinIanKing/stress-ng),
which is the stress test of thread creations.

rulinhuang

[v1] https://lore.kernel.org/all/20240207033059.1565623-1-rulin.huang@intel.com/
[v2] https://lore.kernel.org/all/20240220090521.3316345-1-rulin.huang@intel.com/
[v3] https://lore.kernel.org/all/20240221032905.11392-1-rulin.huang@intel.com/
[v4] https://lore.kernel.org/all/20240222120536.216166-1-rulin.huang@intel.com/
[v5] https://lore.kernel.org/all/20240223130318.112198-2-rulin.huang@intel.com/
[v6] https://lore.kernel.org/lkml/aa8f0413-d055-4b49-bcd3-401e93e01c6d@intel.com/

rulinhuang (2):
  mm/vmalloc: Moved macros with no functional change happened
  mm/vmalloc: Eliminated the lock contention from twice to once

 mm/vmalloc.c | 314 +++++++++++++++++++++++++--------------------------
 1 file changed, 155 insertions(+), 159 deletions(-)

base-commit: 10c2cf5fe97647d68ee89b1f921e982e71519f20
-- 
2.43.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened
  2024-03-01 15:54 [PATCH v7 0/2] mm/vmalloc: lock contention optimization under multi-threading rulinhuang
@ 2024-03-01 15:54 ` rulinhuang
  2024-03-06 13:23   ` Baoquan He
  2024-03-06 19:01   ` Uladzislau Rezki
  2024-03-01 15:54 ` [PATCH v7 2/2] mm/vmalloc: Eliminated the lock contention from twice to once rulinhuang
  2024-03-06  9:18 ` [PATCH v7 0/2] mm/vmalloc: lock contention optimization under multi-threading Huang, Rulin
  2 siblings, 2 replies; 16+ messages in thread
From: rulinhuang @ 2024-03-01 15:54 UTC (permalink / raw)
  To: urezki, bhe
  Cc: akpm, colin.king, hch, linux-kernel, linux-mm, lstoakes,
	rulin.huang, tianyou.li, tim.c.chen, wangyang.guo, zhiguo.zhou

Moved data structures and basic helpers related to per cpu kva allocator
up too to along with these macros with no functional change happened.

Signed-off-by: rulinhuang <rulin.huang@intel.com>
---
V6 -> V7: Adjusted the macros
---
 mm/vmalloc.c | 262 +++++++++++++++++++++++++--------------------------
 1 file changed, 131 insertions(+), 131 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 25a8df497255..fc027a61c12e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -887,6 +887,137 @@ is_vn_id_valid(unsigned int node_id)
 	return false;
 }
 
+/*
+ * vmap space is limited especially on 32 bit architectures. Ensure there is
+ * room for at least 16 percpu vmap blocks per CPU.
+ */
+/*
+ * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able
+ * to #define VMALLOC_SPACE		(VMALLOC_END-VMALLOC_START). Guess
+ * instead (we just need a rough idea)
+ */
+#if BITS_PER_LONG == 32
+#define VMALLOC_SPACE		(128UL*1024*1024)
+#else
+#define VMALLOC_SPACE		(128UL*1024*1024*1024)
+#endif
+
+#define VMALLOC_PAGES		(VMALLOC_SPACE / PAGE_SIZE)
+#define VMAP_MAX_ALLOC		BITS_PER_LONG	/* 256K with 4K pages */
+#define VMAP_BBMAP_BITS_MAX	1024	/* 4MB with 4K pages */
+#define VMAP_BBMAP_BITS_MIN	(VMAP_MAX_ALLOC*2)
+#define VMAP_MIN(x, y)		((x) < (y) ? (x) : (y)) /* can't use min() */
+#define VMAP_MAX(x, y)		((x) > (y) ? (x) : (y)) /* can't use max() */
+#define VMAP_BBMAP_BITS		\
+		VMAP_MIN(VMAP_BBMAP_BITS_MAX,	\
+		VMAP_MAX(VMAP_BBMAP_BITS_MIN,	\
+			VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16))
+
+#define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
+
+/*
+ * Purge threshold to prevent overeager purging of fragmented blocks for
+ * regular operations: Purge if vb->free is less than 1/4 of the capacity.
+ */
+#define VMAP_PURGE_THRESHOLD	(VMAP_BBMAP_BITS / 4)
+
+#define VMAP_RAM		0x1 /* indicates vm_map_ram area*/
+#define VMAP_BLOCK		0x2 /* mark out the vmap_block sub-type*/
+#define VMAP_FLAGS_MASK		0x3
+
+struct vmap_block_queue {
+	spinlock_t lock;
+	struct list_head free;
+
+	/*
+	 * An xarray requires an extra memory dynamically to
+	 * be allocated. If it is an issue, we can use rb-tree
+	 * instead.
+	 */
+	struct xarray vmap_blocks;
+};
+
+struct vmap_block {
+	spinlock_t lock;
+	struct vmap_area *va;
+	unsigned long free, dirty;
+	DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
+	unsigned long dirty_min, dirty_max; /*< dirty range */
+	struct list_head free_list;
+	struct rcu_head rcu_head;
+	struct list_head purge;
+};
+
+/* Queue of free and dirty vmap blocks, for allocation and flushing purposes */
+static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
+
+/*
+ * In order to fast access to any "vmap_block" associated with a
+ * specific address, we use a hash.
+ *
+ * A per-cpu vmap_block_queue is used in both ways, to serialize
+ * an access to free block chains among CPUs(alloc path) and it
+ * also acts as a vmap_block hash(alloc/free paths). It means we
+ * overload it, since we already have the per-cpu array which is
+ * used as a hash table. When used as a hash a 'cpu' passed to
+ * per_cpu() is not actually a CPU but rather a hash index.
+ *
+ * A hash function is addr_to_vb_xa() which hashes any address
+ * to a specific index(in a hash) it belongs to. This then uses a
+ * per_cpu() macro to access an array with generated index.
+ *
+ * An example:
+ *
+ *  CPU_1  CPU_2  CPU_0
+ *    |      |      |
+ *    V      V      V
+ * 0     10     20     30     40     50     60
+ * |------|------|------|------|------|------|...<vmap address space>
+ *   CPU0   CPU1   CPU2   CPU0   CPU1   CPU2
+ *
+ * - CPU_1 invokes vm_unmap_ram(6), 6 belongs to CPU0 zone, thus
+ *   it access: CPU0/INDEX0 -> vmap_blocks -> xa_lock;
+ *
+ * - CPU_2 invokes vm_unmap_ram(11), 11 belongs to CPU1 zone, thus
+ *   it access: CPU1/INDEX1 -> vmap_blocks -> xa_lock;
+ *
+ * - CPU_0 invokes vm_unmap_ram(20), 20 belongs to CPU2 zone, thus
+ *   it access: CPU2/INDEX2 -> vmap_blocks -> xa_lock.
+ *
+ * This technique almost always avoids lock contention on insert/remove,
+ * however xarray spinlocks protect against any contention that remains.
+ */
+static struct xarray *
+addr_to_vb_xa(unsigned long addr)
+{
+	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+
+	return &per_cpu(vmap_block_queue, index).vmap_blocks;
+}
+
+/*
+ * We should probably have a fallback mechanism to allocate virtual memory
+ * out of partially filled vmap blocks. However vmap block sizing should be
+ * fairly reasonable according to the vmalloc size, so it shouldn't be a
+ * big problem.
+ */
+
+static unsigned long addr_to_vb_idx(unsigned long addr)
+{
+	addr -= VMALLOC_START & ~(VMAP_BLOCK_SIZE-1);
+	addr /= VMAP_BLOCK_SIZE;
+	return addr;
+}
+
+static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off)
+{
+	unsigned long addr;
+
+	addr = va_start + (pages_off << PAGE_SHIFT);
+	BUG_ON(addr_to_vb_idx(addr) != addr_to_vb_idx(va_start));
+	return (void *)addr;
+}
+
 static __always_inline unsigned long
 va_size(struct vmap_area *va)
 {
@@ -2327,137 +2458,6 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
 
 /*** Per cpu kva allocator ***/
 
-/*
- * vmap space is limited especially on 32 bit architectures. Ensure there is
- * room for at least 16 percpu vmap blocks per CPU.
- */
-/*
- * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able
- * to #define VMALLOC_SPACE		(VMALLOC_END-VMALLOC_START). Guess
- * instead (we just need a rough idea)
- */
-#if BITS_PER_LONG == 32
-#define VMALLOC_SPACE		(128UL*1024*1024)
-#else
-#define VMALLOC_SPACE		(128UL*1024*1024*1024)
-#endif
-
-#define VMALLOC_PAGES		(VMALLOC_SPACE / PAGE_SIZE)
-#define VMAP_MAX_ALLOC		BITS_PER_LONG	/* 256K with 4K pages */
-#define VMAP_BBMAP_BITS_MAX	1024	/* 4MB with 4K pages */
-#define VMAP_BBMAP_BITS_MIN	(VMAP_MAX_ALLOC*2)
-#define VMAP_MIN(x, y)		((x) < (y) ? (x) : (y)) /* can't use min() */
-#define VMAP_MAX(x, y)		((x) > (y) ? (x) : (y)) /* can't use max() */
-#define VMAP_BBMAP_BITS		\
-		VMAP_MIN(VMAP_BBMAP_BITS_MAX,	\
-		VMAP_MAX(VMAP_BBMAP_BITS_MIN,	\
-			VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16))
-
-#define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
-
-/*
- * Purge threshold to prevent overeager purging of fragmented blocks for
- * regular operations: Purge if vb->free is less than 1/4 of the capacity.
- */
-#define VMAP_PURGE_THRESHOLD	(VMAP_BBMAP_BITS / 4)
-
-#define VMAP_RAM		0x1 /* indicates vm_map_ram area*/
-#define VMAP_BLOCK		0x2 /* mark out the vmap_block sub-type*/
-#define VMAP_FLAGS_MASK		0x3
-
-struct vmap_block_queue {
-	spinlock_t lock;
-	struct list_head free;
-
-	/*
-	 * An xarray requires an extra memory dynamically to
-	 * be allocated. If it is an issue, we can use rb-tree
-	 * instead.
-	 */
-	struct xarray vmap_blocks;
-};
-
-struct vmap_block {
-	spinlock_t lock;
-	struct vmap_area *va;
-	unsigned long free, dirty;
-	DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
-	unsigned long dirty_min, dirty_max; /*< dirty range */
-	struct list_head free_list;
-	struct rcu_head rcu_head;
-	struct list_head purge;
-};
-
-/* Queue of free and dirty vmap blocks, for allocation and flushing purposes */
-static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
-
-/*
- * In order to fast access to any "vmap_block" associated with a
- * specific address, we use a hash.
- *
- * A per-cpu vmap_block_queue is used in both ways, to serialize
- * an access to free block chains among CPUs(alloc path) and it
- * also acts as a vmap_block hash(alloc/free paths). It means we
- * overload it, since we already have the per-cpu array which is
- * used as a hash table. When used as a hash a 'cpu' passed to
- * per_cpu() is not actually a CPU but rather a hash index.
- *
- * A hash function is addr_to_vb_xa() which hashes any address
- * to a specific index(in a hash) it belongs to. This then uses a
- * per_cpu() macro to access an array with generated index.
- *
- * An example:
- *
- *  CPU_1  CPU_2  CPU_0
- *    |      |      |
- *    V      V      V
- * 0     10     20     30     40     50     60
- * |------|------|------|------|------|------|...<vmap address space>
- *   CPU0   CPU1   CPU2   CPU0   CPU1   CPU2
- *
- * - CPU_1 invokes vm_unmap_ram(6), 6 belongs to CPU0 zone, thus
- *   it access: CPU0/INDEX0 -> vmap_blocks -> xa_lock;
- *
- * - CPU_2 invokes vm_unmap_ram(11), 11 belongs to CPU1 zone, thus
- *   it access: CPU1/INDEX1 -> vmap_blocks -> xa_lock;
- *
- * - CPU_0 invokes vm_unmap_ram(20), 20 belongs to CPU2 zone, thus
- *   it access: CPU2/INDEX2 -> vmap_blocks -> xa_lock.
- *
- * This technique almost always avoids lock contention on insert/remove,
- * however xarray spinlocks protect against any contention that remains.
- */
-static struct xarray *
-addr_to_vb_xa(unsigned long addr)
-{
-	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
-
-	return &per_cpu(vmap_block_queue, index).vmap_blocks;
-}
-
-/*
- * We should probably have a fallback mechanism to allocate virtual memory
- * out of partially filled vmap blocks. However vmap block sizing should be
- * fairly reasonable according to the vmalloc size, so it shouldn't be a
- * big problem.
- */
-
-static unsigned long addr_to_vb_idx(unsigned long addr)
-{
-	addr -= VMALLOC_START & ~(VMAP_BLOCK_SIZE-1);
-	addr /= VMAP_BLOCK_SIZE;
-	return addr;
-}
-
-static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off)
-{
-	unsigned long addr;
-
-	addr = va_start + (pages_off << PAGE_SHIFT);
-	BUG_ON(addr_to_vb_idx(addr) != addr_to_vb_idx(va_start));
-	return (void *)addr;
-}
-
 /**
  * new_vmap_block - allocates new vmap_block and occupies 2^order pages in this
  *                  block. Of course pages number can't exceed VMAP_BBMAP_BITS
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened
  2024-03-01 15:54 ` [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened rulinhuang
@ 2024-03-06 13:23   ` Baoquan He
  2024-03-06 19:01   ` Uladzislau Rezki
  1 sibling, 0 replies; 16+ messages in thread
From: Baoquan He @ 2024-03-06 13:23 UTC (permalink / raw)
  To: rulinhuang
  Cc: urezki, akpm, colin.king, hch, linux-kernel, linux-mm, lstoakes,
	tianyou.li, tim.c.chen, wangyang.guo, zhiguo.zhou

Sorry, I missed this patchset in my mail box.

On 03/01/24 at 10:54am, rulinhuang wrote:
> Moved data structures and basic helpers related to per cpu kva allocator
  ~~~ s/Moved/move/? And the subject too?
> up too to along with these macros with no functional change happened.

Maybe we should add below line to tell why the moving need be done.

This is in preparation for later VMAP_RAM checking in alloc_vmap_area().

Other than above nitpicks, this looks good to me. If you update
this patch log and post a new version, please feel free to add:

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> Signed-off-by: rulinhuang <rulin.huang@intel.com>
> ---
> V6 -> V7: Adjusted the macros
> ---
>  mm/vmalloc.c | 262 +++++++++++++++++++++++++--------------------------
>  1 file changed, 131 insertions(+), 131 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 25a8df497255..fc027a61c12e 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -887,6 +887,137 @@ is_vn_id_valid(unsigned int node_id)
>  	return false;
>  }
>  
> +/*
> + * vmap space is limited especially on 32 bit architectures. Ensure there is
> + * room for at least 16 percpu vmap blocks per CPU.
> + */
> +/*
> + * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able
> + * to #define VMALLOC_SPACE		(VMALLOC_END-VMALLOC_START). Guess
> + * instead (we just need a rough idea)
> + */
> +#if BITS_PER_LONG == 32
> +#define VMALLOC_SPACE		(128UL*1024*1024)
> +#else
> +#define VMALLOC_SPACE		(128UL*1024*1024*1024)
> +#endif
> +
> +#define VMALLOC_PAGES		(VMALLOC_SPACE / PAGE_SIZE)
> +#define VMAP_MAX_ALLOC		BITS_PER_LONG	/* 256K with 4K pages */
> +#define VMAP_BBMAP_BITS_MAX	1024	/* 4MB with 4K pages */
> +#define VMAP_BBMAP_BITS_MIN	(VMAP_MAX_ALLOC*2)
> +#define VMAP_MIN(x, y)		((x) < (y) ? (x) : (y)) /* can't use min() */
> +#define VMAP_MAX(x, y)		((x) > (y) ? (x) : (y)) /* can't use max() */
> +#define VMAP_BBMAP_BITS		\
> +		VMAP_MIN(VMAP_BBMAP_BITS_MAX,	\
> +		VMAP_MAX(VMAP_BBMAP_BITS_MIN,	\
> +			VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16))
> +
> +#define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
> +
> +/*
> + * Purge threshold to prevent overeager purging of fragmented blocks for
> + * regular operations: Purge if vb->free is less than 1/4 of the capacity.
> + */
> +#define VMAP_PURGE_THRESHOLD	(VMAP_BBMAP_BITS / 4)
> +
> +#define VMAP_RAM		0x1 /* indicates vm_map_ram area*/
> +#define VMAP_BLOCK		0x2 /* mark out the vmap_block sub-type*/
> +#define VMAP_FLAGS_MASK		0x3
> +
> +struct vmap_block_queue {
> +	spinlock_t lock;
> +	struct list_head free;
> +
> +	/*
> +	 * An xarray requires an extra memory dynamically to
> +	 * be allocated. If it is an issue, we can use rb-tree
> +	 * instead.
> +	 */
> +	struct xarray vmap_blocks;
> +};
> +
> +struct vmap_block {
> +	spinlock_t lock;
> +	struct vmap_area *va;
> +	unsigned long free, dirty;
> +	DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
> +	unsigned long dirty_min, dirty_max; /*< dirty range */
> +	struct list_head free_list;
> +	struct rcu_head rcu_head;
> +	struct list_head purge;
> +};
> +
> +/* Queue of free and dirty vmap blocks, for allocation and flushing purposes */
> +static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> +
> +/*
> + * In order to fast access to any "vmap_block" associated with a
> + * specific address, we use a hash.
> + *
> + * A per-cpu vmap_block_queue is used in both ways, to serialize
> + * an access to free block chains among CPUs(alloc path) and it
> + * also acts as a vmap_block hash(alloc/free paths). It means we
> + * overload it, since we already have the per-cpu array which is
> + * used as a hash table. When used as a hash a 'cpu' passed to
> + * per_cpu() is not actually a CPU but rather a hash index.
> + *
> + * A hash function is addr_to_vb_xa() which hashes any address
> + * to a specific index(in a hash) it belongs to. This then uses a
> + * per_cpu() macro to access an array with generated index.
> + *
> + * An example:
> + *
> + *  CPU_1  CPU_2  CPU_0
> + *    |      |      |
> + *    V      V      V
> + * 0     10     20     30     40     50     60
> + * |------|------|------|------|------|------|...<vmap address space>
> + *   CPU0   CPU1   CPU2   CPU0   CPU1   CPU2
> + *
> + * - CPU_1 invokes vm_unmap_ram(6), 6 belongs to CPU0 zone, thus
> + *   it access: CPU0/INDEX0 -> vmap_blocks -> xa_lock;
> + *
> + * - CPU_2 invokes vm_unmap_ram(11), 11 belongs to CPU1 zone, thus
> + *   it access: CPU1/INDEX1 -> vmap_blocks -> xa_lock;
> + *
> + * - CPU_0 invokes vm_unmap_ram(20), 20 belongs to CPU2 zone, thus
> + *   it access: CPU2/INDEX2 -> vmap_blocks -> xa_lock.
> + *
> + * This technique almost always avoids lock contention on insert/remove,
> + * however xarray spinlocks protect against any contention that remains.
> + */
> +static struct xarray *
> +addr_to_vb_xa(unsigned long addr)
> +{
> +	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> +
> +	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> +}
> +
> +/*
> + * We should probably have a fallback mechanism to allocate virtual memory
> + * out of partially filled vmap blocks. However vmap block sizing should be
> + * fairly reasonable according to the vmalloc size, so it shouldn't be a
> + * big problem.
> + */
> +
> +static unsigned long addr_to_vb_idx(unsigned long addr)
> +{
> +	addr -= VMALLOC_START & ~(VMAP_BLOCK_SIZE-1);
> +	addr /= VMAP_BLOCK_SIZE;
> +	return addr;
> +}
> +
> +static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off)
> +{
> +	unsigned long addr;
> +
> +	addr = va_start + (pages_off << PAGE_SHIFT);
> +	BUG_ON(addr_to_vb_idx(addr) != addr_to_vb_idx(va_start));
> +	return (void *)addr;
> +}
> +
>  static __always_inline unsigned long
>  va_size(struct vmap_area *va)
>  {
> @@ -2327,137 +2458,6 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
>  
>  /*** Per cpu kva allocator ***/
>  
> -/*
> - * vmap space is limited especially on 32 bit architectures. Ensure there is
> - * room for at least 16 percpu vmap blocks per CPU.
> - */
> -/*
> - * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able
> - * to #define VMALLOC_SPACE		(VMALLOC_END-VMALLOC_START). Guess
> - * instead (we just need a rough idea)
> - */
> -#if BITS_PER_LONG == 32
> -#define VMALLOC_SPACE		(128UL*1024*1024)
> -#else
> -#define VMALLOC_SPACE		(128UL*1024*1024*1024)
> -#endif
> -
> -#define VMALLOC_PAGES		(VMALLOC_SPACE / PAGE_SIZE)
> -#define VMAP_MAX_ALLOC		BITS_PER_LONG	/* 256K with 4K pages */
> -#define VMAP_BBMAP_BITS_MAX	1024	/* 4MB with 4K pages */
> -#define VMAP_BBMAP_BITS_MIN	(VMAP_MAX_ALLOC*2)
> -#define VMAP_MIN(x, y)		((x) < (y) ? (x) : (y)) /* can't use min() */
> -#define VMAP_MAX(x, y)		((x) > (y) ? (x) : (y)) /* can't use max() */
> -#define VMAP_BBMAP_BITS		\
> -		VMAP_MIN(VMAP_BBMAP_BITS_MAX,	\
> -		VMAP_MAX(VMAP_BBMAP_BITS_MIN,	\
> -			VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16))
> -
> -#define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
> -
> -/*
> - * Purge threshold to prevent overeager purging of fragmented blocks for
> - * regular operations: Purge if vb->free is less than 1/4 of the capacity.
> - */
> -#define VMAP_PURGE_THRESHOLD	(VMAP_BBMAP_BITS / 4)
> -
> -#define VMAP_RAM		0x1 /* indicates vm_map_ram area*/
> -#define VMAP_BLOCK		0x2 /* mark out the vmap_block sub-type*/
> -#define VMAP_FLAGS_MASK		0x3
> -
> -struct vmap_block_queue {
> -	spinlock_t lock;
> -	struct list_head free;
> -
> -	/*
> -	 * An xarray requires an extra memory dynamically to
> -	 * be allocated. If it is an issue, we can use rb-tree
> -	 * instead.
> -	 */
> -	struct xarray vmap_blocks;
> -};
> -
> -struct vmap_block {
> -	spinlock_t lock;
> -	struct vmap_area *va;
> -	unsigned long free, dirty;
> -	DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
> -	unsigned long dirty_min, dirty_max; /*< dirty range */
> -	struct list_head free_list;
> -	struct rcu_head rcu_head;
> -	struct list_head purge;
> -};
> -
> -/* Queue of free and dirty vmap blocks, for allocation and flushing purposes */
> -static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> -
> -/*
> - * In order to fast access to any "vmap_block" associated with a
> - * specific address, we use a hash.
> - *
> - * A per-cpu vmap_block_queue is used in both ways, to serialize
> - * an access to free block chains among CPUs(alloc path) and it
> - * also acts as a vmap_block hash(alloc/free paths). It means we
> - * overload it, since we already have the per-cpu array which is
> - * used as a hash table. When used as a hash a 'cpu' passed to
> - * per_cpu() is not actually a CPU but rather a hash index.
> - *
> - * A hash function is addr_to_vb_xa() which hashes any address
> - * to a specific index(in a hash) it belongs to. This then uses a
> - * per_cpu() macro to access an array with generated index.
> - *
> - * An example:
> - *
> - *  CPU_1  CPU_2  CPU_0
> - *    |      |      |
> - *    V      V      V
> - * 0     10     20     30     40     50     60
> - * |------|------|------|------|------|------|...<vmap address space>
> - *   CPU0   CPU1   CPU2   CPU0   CPU1   CPU2
> - *
> - * - CPU_1 invokes vm_unmap_ram(6), 6 belongs to CPU0 zone, thus
> - *   it access: CPU0/INDEX0 -> vmap_blocks -> xa_lock;
> - *
> - * - CPU_2 invokes vm_unmap_ram(11), 11 belongs to CPU1 zone, thus
> - *   it access: CPU1/INDEX1 -> vmap_blocks -> xa_lock;
> - *
> - * - CPU_0 invokes vm_unmap_ram(20), 20 belongs to CPU2 zone, thus
> - *   it access: CPU2/INDEX2 -> vmap_blocks -> xa_lock.
> - *
> - * This technique almost always avoids lock contention on insert/remove,
> - * however xarray spinlocks protect against any contention that remains.
> - */
> -static struct xarray *
> -addr_to_vb_xa(unsigned long addr)
> -{
> -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> -
> -	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> -}
> -
> -/*
> - * We should probably have a fallback mechanism to allocate virtual memory
> - * out of partially filled vmap blocks. However vmap block sizing should be
> - * fairly reasonable according to the vmalloc size, so it shouldn't be a
> - * big problem.
> - */
> -
> -static unsigned long addr_to_vb_idx(unsigned long addr)
> -{
> -	addr -= VMALLOC_START & ~(VMAP_BLOCK_SIZE-1);
> -	addr /= VMAP_BLOCK_SIZE;
> -	return addr;
> -}
> -
> -static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off)
> -{
> -	unsigned long addr;
> -
> -	addr = va_start + (pages_off << PAGE_SHIFT);
> -	BUG_ON(addr_to_vb_idx(addr) != addr_to_vb_idx(va_start));
> -	return (void *)addr;
> -}
> -
>  /**
>   * new_vmap_block - allocates new vmap_block and occupies 2^order pages in this
>   *                  block. Of course pages number can't exceed VMAP_BBMAP_BITS
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened
  2024-03-01 15:54 ` [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened rulinhuang
  2024-03-06 13:23   ` Baoquan He
@ 2024-03-06 19:01   ` Uladzislau Rezki
  2024-03-07  1:23     ` Baoquan He
  1 sibling, 1 reply; 16+ messages in thread
From: Uladzislau Rezki @ 2024-03-06 19:01 UTC (permalink / raw)
  To: rulinhuang
  Cc: urezki, bhe, akpm, colin.king, hch, linux-kernel, linux-mm,
	lstoakes, tianyou.li, tim.c.chen, wangyang.guo, zhiguo.zhou

On Fri, Mar 01, 2024 at 10:54:16AM -0500, rulinhuang wrote:
> Moved data structures and basic helpers related to per cpu kva allocator
> up too to along with these macros with no functional change happened.
> 
> Signed-off-by: rulinhuang <rulin.huang@intel.com>
> ---
> V6 -> V7: Adjusted the macros
> ---
>  mm/vmalloc.c | 262 +++++++++++++++++++++++++--------------------------
>  1 file changed, 131 insertions(+), 131 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 25a8df497255..fc027a61c12e 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -887,6 +887,137 @@ is_vn_id_valid(unsigned int node_id)
>  	return false;
>  }
>  
> +/*
> + * vmap space is limited especially on 32 bit architectures. Ensure there is
> + * room for at least 16 percpu vmap blocks per CPU.
> + */
> +/*
> + * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able
> + * to #define VMALLOC_SPACE		(VMALLOC_END-VMALLOC_START). Guess
> + * instead (we just need a rough idea)
> + */
> +#if BITS_PER_LONG == 32
> +#define VMALLOC_SPACE		(128UL*1024*1024)
> +#else
> +#define VMALLOC_SPACE		(128UL*1024*1024*1024)
> +#endif
> +
> +#define VMALLOC_PAGES		(VMALLOC_SPACE / PAGE_SIZE)
> +#define VMAP_MAX_ALLOC		BITS_PER_LONG	/* 256K with 4K pages */
> +#define VMAP_BBMAP_BITS_MAX	1024	/* 4MB with 4K pages */
> +#define VMAP_BBMAP_BITS_MIN	(VMAP_MAX_ALLOC*2)
> +#define VMAP_MIN(x, y)		((x) < (y) ? (x) : (y)) /* can't use min() */
> +#define VMAP_MAX(x, y)		((x) > (y) ? (x) : (y)) /* can't use max() */
> +#define VMAP_BBMAP_BITS		\
> +		VMAP_MIN(VMAP_BBMAP_BITS_MAX,	\
> +		VMAP_MAX(VMAP_BBMAP_BITS_MIN,	\
> +			VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16))
> +
> +#define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
> +
> +/*
> + * Purge threshold to prevent overeager purging of fragmented blocks for
> + * regular operations: Purge if vb->free is less than 1/4 of the capacity.
> + */
> +#define VMAP_PURGE_THRESHOLD	(VMAP_BBMAP_BITS / 4)
> +
> +#define VMAP_RAM		0x1 /* indicates vm_map_ram area*/
> +#define VMAP_BLOCK		0x2 /* mark out the vmap_block sub-type*/
> +#define VMAP_FLAGS_MASK		0x3
> +
> +struct vmap_block_queue {
> +	spinlock_t lock;
> +	struct list_head free;
> +
> +	/*
> +	 * An xarray requires an extra memory dynamically to
> +	 * be allocated. If it is an issue, we can use rb-tree
> +	 * instead.
> +	 */
> +	struct xarray vmap_blocks;
> +};
> +
> +struct vmap_block {
> +	spinlock_t lock;
> +	struct vmap_area *va;
> +	unsigned long free, dirty;
> +	DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
> +	unsigned long dirty_min, dirty_max; /*< dirty range */
> +	struct list_head free_list;
> +	struct rcu_head rcu_head;
> +	struct list_head purge;
> +};
> +
> +/* Queue of free and dirty vmap blocks, for allocation and flushing purposes */
> +static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> +
> +/*
> + * In order to fast access to any "vmap_block" associated with a
> + * specific address, we use a hash.
> + *
> + * A per-cpu vmap_block_queue is used in both ways, to serialize
> + * an access to free block chains among CPUs(alloc path) and it
> + * also acts as a vmap_block hash(alloc/free paths). It means we
> + * overload it, since we already have the per-cpu array which is
> + * used as a hash table. When used as a hash a 'cpu' passed to
> + * per_cpu() is not actually a CPU but rather a hash index.
> + *
> + * A hash function is addr_to_vb_xa() which hashes any address
> + * to a specific index(in a hash) it belongs to. This then uses a
> + * per_cpu() macro to access an array with generated index.
> + *
> + * An example:
> + *
> + *  CPU_1  CPU_2  CPU_0
> + *    |      |      |
> + *    V      V      V
> + * 0     10     20     30     40     50     60
> + * |------|------|------|------|------|------|...<vmap address space>
> + *   CPU0   CPU1   CPU2   CPU0   CPU1   CPU2
> + *
> + * - CPU_1 invokes vm_unmap_ram(6), 6 belongs to CPU0 zone, thus
> + *   it access: CPU0/INDEX0 -> vmap_blocks -> xa_lock;
> + *
> + * - CPU_2 invokes vm_unmap_ram(11), 11 belongs to CPU1 zone, thus
> + *   it access: CPU1/INDEX1 -> vmap_blocks -> xa_lock;
> + *
> + * - CPU_0 invokes vm_unmap_ram(20), 20 belongs to CPU2 zone, thus
> + *   it access: CPU2/INDEX2 -> vmap_blocks -> xa_lock.
> + *
> + * This technique almost always avoids lock contention on insert/remove,
> + * however xarray spinlocks protect against any contention that remains.
> + */
> +static struct xarray *
> +addr_to_vb_xa(unsigned long addr)
> +{
> +	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> +
> +	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> +}
> +
> +/*
> + * We should probably have a fallback mechanism to allocate virtual memory
> + * out of partially filled vmap blocks. However vmap block sizing should be
> + * fairly reasonable according to the vmalloc size, so it shouldn't be a
> + * big problem.
> + */
> +
> +static unsigned long addr_to_vb_idx(unsigned long addr)
> +{
> +	addr -= VMALLOC_START & ~(VMAP_BLOCK_SIZE-1);
> +	addr /= VMAP_BLOCK_SIZE;
> +	return addr;
> +}
> +
> +static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off)
> +{
> +	unsigned long addr;
> +
> +	addr = va_start + (pages_off << PAGE_SHIFT);
> +	BUG_ON(addr_to_vb_idx(addr) != addr_to_vb_idx(va_start));
> +	return (void *)addr;
> +}
> +
>  static __always_inline unsigned long
>  va_size(struct vmap_area *va)
>  {
> @@ -2327,137 +2458,6 @@ static struct vmap_area *find_unlink_vmap_area(unsigned long addr)
>  
>  /*** Per cpu kva allocator ***/
>  
> -/*
> - * vmap space is limited especially on 32 bit architectures. Ensure there is
> - * room for at least 16 percpu vmap blocks per CPU.
> - */
> -/*
> - * If we had a constant VMALLOC_START and VMALLOC_END, we'd like to be able
> - * to #define VMALLOC_SPACE		(VMALLOC_END-VMALLOC_START). Guess
> - * instead (we just need a rough idea)
> - */
> -#if BITS_PER_LONG == 32
> -#define VMALLOC_SPACE		(128UL*1024*1024)
> -#else
> -#define VMALLOC_SPACE		(128UL*1024*1024*1024)
> -#endif
> -
> -#define VMALLOC_PAGES		(VMALLOC_SPACE / PAGE_SIZE)
> -#define VMAP_MAX_ALLOC		BITS_PER_LONG	/* 256K with 4K pages */
> -#define VMAP_BBMAP_BITS_MAX	1024	/* 4MB with 4K pages */
> -#define VMAP_BBMAP_BITS_MIN	(VMAP_MAX_ALLOC*2)
> -#define VMAP_MIN(x, y)		((x) < (y) ? (x) : (y)) /* can't use min() */
> -#define VMAP_MAX(x, y)		((x) > (y) ? (x) : (y)) /* can't use max() */
> -#define VMAP_BBMAP_BITS		\
> -		VMAP_MIN(VMAP_BBMAP_BITS_MAX,	\
> -		VMAP_MAX(VMAP_BBMAP_BITS_MIN,	\
> -			VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16))
> -
> -#define VMAP_BLOCK_SIZE		(VMAP_BBMAP_BITS * PAGE_SIZE)
> -
> -/*
> - * Purge threshold to prevent overeager purging of fragmented blocks for
> - * regular operations: Purge if vb->free is less than 1/4 of the capacity.
> - */
> -#define VMAP_PURGE_THRESHOLD	(VMAP_BBMAP_BITS / 4)
> -
> -#define VMAP_RAM		0x1 /* indicates vm_map_ram area*/
> -#define VMAP_BLOCK		0x2 /* mark out the vmap_block sub-type*/
> -#define VMAP_FLAGS_MASK		0x3
> -
> -struct vmap_block_queue {
> -	spinlock_t lock;
> -	struct list_head free;
> -
> -	/*
> -	 * An xarray requires an extra memory dynamically to
> -	 * be allocated. If it is an issue, we can use rb-tree
> -	 * instead.
> -	 */
> -	struct xarray vmap_blocks;
> -};
> -
> -struct vmap_block {
> -	spinlock_t lock;
> -	struct vmap_area *va;
> -	unsigned long free, dirty;
> -	DECLARE_BITMAP(used_map, VMAP_BBMAP_BITS);
> -	unsigned long dirty_min, dirty_max; /*< dirty range */
> -	struct list_head free_list;
> -	struct rcu_head rcu_head;
> -	struct list_head purge;
> -};
> -
> -/* Queue of free and dirty vmap blocks, for allocation and flushing purposes */
> -static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> -
> -/*
> - * In order to fast access to any "vmap_block" associated with a
> - * specific address, we use a hash.
> - *
> - * A per-cpu vmap_block_queue is used in both ways, to serialize
> - * an access to free block chains among CPUs(alloc path) and it
> - * also acts as a vmap_block hash(alloc/free paths). It means we
> - * overload it, since we already have the per-cpu array which is
> - * used as a hash table. When used as a hash a 'cpu' passed to
> - * per_cpu() is not actually a CPU but rather a hash index.
> - *
> - * A hash function is addr_to_vb_xa() which hashes any address
> - * to a specific index(in a hash) it belongs to. This then uses a
> - * per_cpu() macro to access an array with generated index.
> - *
> - * An example:
> - *
> - *  CPU_1  CPU_2  CPU_0
> - *    |      |      |
> - *    V      V      V
> - * 0     10     20     30     40     50     60
> - * |------|------|------|------|------|------|...<vmap address space>
> - *   CPU0   CPU1   CPU2   CPU0   CPU1   CPU2
> - *
> - * - CPU_1 invokes vm_unmap_ram(6), 6 belongs to CPU0 zone, thus
> - *   it access: CPU0/INDEX0 -> vmap_blocks -> xa_lock;
> - *
> - * - CPU_2 invokes vm_unmap_ram(11), 11 belongs to CPU1 zone, thus
> - *   it access: CPU1/INDEX1 -> vmap_blocks -> xa_lock;
> - *
> - * - CPU_0 invokes vm_unmap_ram(20), 20 belongs to CPU2 zone, thus
> - *   it access: CPU2/INDEX2 -> vmap_blocks -> xa_lock.
> - *
> - * This technique almost always avoids lock contention on insert/remove,
> - * however xarray spinlocks protect against any contention that remains.
> - */
> -static struct xarray *
> -addr_to_vb_xa(unsigned long addr)
> -{
> -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> -
> -	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> -}
> -
> -/*
> - * We should probably have a fallback mechanism to allocate virtual memory
> - * out of partially filled vmap blocks. However vmap block sizing should be
> - * fairly reasonable according to the vmalloc size, so it shouldn't be a
> - * big problem.
> - */
> -
> -static unsigned long addr_to_vb_idx(unsigned long addr)
> -{
> -	addr -= VMALLOC_START & ~(VMAP_BLOCK_SIZE-1);
> -	addr /= VMAP_BLOCK_SIZE;
> -	return addr;
> -}
> -
> -static void *vmap_block_vaddr(unsigned long va_start, unsigned long pages_off)
> -{
> -	unsigned long addr;
> -
> -	addr = va_start + (pages_off << PAGE_SHIFT);
> -	BUG_ON(addr_to_vb_idx(addr) != addr_to_vb_idx(va_start));
> -	return (void *)addr;
> -}
> -
>  /**
>   * new_vmap_block - allocates new vmap_block and occupies 2^order pages in this
>   *                  block. Of course pages number can't exceed VMAP_BBMAP_BITS
> -- 
> 2.43.0
> 

Sorry for the late answer, i also just noticed this email. It was not in
my inbox...

OK, now you move part of the per-cpu allocator on the top and leave
another part down making it split. This is just for the:

BUG_ON(va_flags & VMAP_RAM);

VMAP_RAM macro. Do we really need this BUG_ON()?

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened
  2024-03-06 19:01   ` Uladzislau Rezki
@ 2024-03-07  1:23     ` Baoquan He
  2024-03-07  3:01       ` Huang, Rulin
  2024-03-07 19:16       ` Uladzislau Rezki
  0 siblings, 2 replies; 16+ messages in thread
From: Baoquan He @ 2024-03-07  1:23 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: rulinhuang, akpm, colin.king, hch, linux-kernel, linux-mm,
	lstoakes, tianyou.li, tim.c.chen, wangyang.guo, zhiguo.zhou

On 03/06/24 at 08:01pm, Uladzislau Rezki wrote:
> On Fri, Mar 01, 2024 at 10:54:16AM -0500, rulinhuang wrote:
......
> 
> Sorry for the late answer, i also just noticed this email. It was not in
> my inbox...
> 
> OK, now you move part of the per-cpu allocator on the top and leave
> another part down making it split. This is just for the:
> 
> BUG_ON(va_flags & VMAP_RAM);
> 
> VMAP_RAM macro. Do we really need this BUG_ON()?

Sorry, I suggested that when reviewing v5:
https://lore.kernel.org/all/ZdiltpK5fUvwVWtD@MiWiFi-R3L-srv/T/#u

About part of per-cpu kva allocator moving and the split making, I would
argue that we will have vmap_nodes defintion and basic helper functions
like addr_to_node_id() etc at top, and leave other part like
size_to_va_pool(), node_pool_add_va() etc down. These are similar.

While about whether we should add 'BUG_ON(va_flags & VMAP_RAM);', I am
not sure about it. When I suggested that, I am also hesitant. From the
current code, alloc_vmap_area() is called in below three functions, only
__get_vm_area_node() will pass the non-NULL vm. 
 new_vmap_block()     -|
 vm_map_ram()         ----> alloc_vmap_area()
 __get_vm_area_node() -|

It could be wrongly passed in the future? Only checking if vm is
non-NULL makes me feel a little unsafe. While I am fine if removing the
BUG_ON, because there's no worry in the current code. We can wait and
see in the future.

       if (vm) {
               BUG_ON(va_flags & VMAP_RAM);
               setup_vmalloc_vm(vm, va, flags, caller);
       }

Thanks
Baoquan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened
  2024-03-07  1:23     ` Baoquan He
@ 2024-03-07  3:01       ` Huang, Rulin
  2024-03-07  3:32         ` Baoquan He
  2024-03-07 19:16       ` Uladzislau Rezki
  1 sibling, 1 reply; 16+ messages in thread
From: Huang, Rulin @ 2024-03-07  3:01 UTC (permalink / raw)
  To: Baoquan He, Uladzislau Rezki
  Cc: akpm, colin.king, hch, linux-kernel, linux-mm, lstoakes,
	tianyou.li, tim.c.chen, wangyang.guo, zhiguo.zhou

We have made changes based on your latest suggestions.
1.Removed bugs_on.
2.Removed adjustion of macros.

We submitted patch v8 based on this. Thanks to Baoquan for the
discussion, and could you please help to review and confirm if there are
any problems on the latest version?

On 2024/3/7 9:23, Baoquan He wrote:
> On 03/06/24 at 08:01pm, Uladzislau Rezki wrote:
>> On Fri, Mar 01, 2024 at 10:54:16AM -0500, rulinhuang wrote:
> ......
>>
>> Sorry for the late answer, i also just noticed this email. It was not in
>> my inbox...
>>
>> OK, now you move part of the per-cpu allocator on the top and leave
>> another part down making it split. This is just for the:
>>
>> BUG_ON(va_flags & VMAP_RAM);
>>
>> VMAP_RAM macro. Do we really need this BUG_ON()?
> 
> Sorry, I suggested that when reviewing v5:
> https://lore.kernel.org/all/ZdiltpK5fUvwVWtD@MiWiFi-R3L-srv/T/#u
> 
> About part of per-cpu kva allocator moving and the split making, I would
> argue that we will have vmap_nodes defintion and basic helper functions
> like addr_to_node_id() etc at top, and leave other part like
> size_to_va_pool(), node_pool_add_va() etc down. These are similar.
> 
> While about whether we should add 'BUG_ON(va_flags & VMAP_RAM);', I am
> not sure about it. When I suggested that, I am also hesitant. From the
> current code, alloc_vmap_area() is called in below three functions, only
> __get_vm_area_node() will pass the non-NULL vm. 
>  new_vmap_block()     -|
>  vm_map_ram()         ----> alloc_vmap_area()
>  __get_vm_area_node() -|
> 
> It could be wrongly passed in the future? Only checking if vm is
> non-NULL makes me feel a little unsafe. While I am fine if removing the
> BUG_ON, because there's no worry in the current code. We can wait and
> see in the future.
> 
>        if (vm) {
>                BUG_ON(va_flags & VMAP_RAM);
>                setup_vmalloc_vm(vm, va, flags, caller);
>        }
> 
> Thanks
> Baoquan
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened
  2024-03-07  3:01       ` Huang, Rulin
@ 2024-03-07  3:32         ` Baoquan He
  2024-03-07  5:48           ` Huang, Rulin
  0 siblings, 1 reply; 16+ messages in thread
From: Baoquan He @ 2024-03-07  3:32 UTC (permalink / raw)
  To: Huang, Rulin
  Cc: Uladzislau Rezki, akpm, colin.king, hch, linux-kernel, linux-mm,
	lstoakes, tianyou.li, tim.c.chen, wangyang.guo, zhiguo.zhou

On 03/07/24 at 11:01am, Huang, Rulin wrote:
> We have made changes based on your latest suggestions.
> 1.Removed bugs_on.
> 2.Removed adjustion of macros.
> 
> We submitted patch v8 based on this. Thanks to Baoquan for the
> discussion, and could you please help to review and confirm if there are
> any problems on the latest version?

Looks good to me, I don't want to exhaust a newcomer's enthusiasm and
patience before you get used to this :-). Will ack, thanks for the
awesome work.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened
  2024-03-07  3:32         ` Baoquan He
@ 2024-03-07  5:48           ` Huang, Rulin
  2024-03-07 19:53             ` Uladzislau Rezki
  0 siblings, 1 reply; 16+ messages in thread
From: Huang, Rulin @ 2024-03-07  5:48 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, akpm, colin.king, hch, linux-kernel, linux-mm,
	lstoakes, tianyou.li, tim.c.chen, wangyang.guo, zhiguo.zhou,
	rulin.huang

Thanks for your guiding and encouragement!

On 2024/3/7 11:32, Baoquan He wrote:
> On 03/07/24 at 11:01am, Huang, Rulin wrote:
>> We have made changes based on your latest suggestions.
>> 1.Removed bugs_on.
>> 2.Removed adjustion of macros.
>>
>> We submitted patch v8 based on this. Thanks to Baoquan for the
>> discussion, and could you please help to review and confirm if there are
>> any problems on the latest version?
> 
> Looks good to me, I don't want to exhaust a newcomer's enthusiasm and
> patience before you get used to this :-). Will ack, thanks for the
> awesome work.
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened
  2024-03-07  5:48           ` Huang, Rulin
@ 2024-03-07 19:53             ` Uladzislau Rezki
  0 siblings, 0 replies; 16+ messages in thread
From: Uladzislau Rezki @ 2024-03-07 19:53 UTC (permalink / raw)
  To: Huang, Rulin
  Cc: Baoquan He, Uladzislau Rezki, akpm, colin.king, hch,
	linux-kernel, linux-mm, lstoakes, tianyou.li, tim.c.chen,
	wangyang.guo, zhiguo.zhou

>
> Thanks for your guiding and encouragement!
> 
Thank you again. v8 looks good to me :)

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened
  2024-03-07  1:23     ` Baoquan He
  2024-03-07  3:01       ` Huang, Rulin
@ 2024-03-07 19:16       ` Uladzislau Rezki
  2024-03-08  8:23         ` Baoquan He
  1 sibling, 1 reply; 16+ messages in thread
From: Uladzislau Rezki @ 2024-03-07 19:16 UTC (permalink / raw)
  To: Baoquan He, rulinhuang
  Cc: Uladzislau Rezki, rulinhuang, akpm, colin.king, hch,
	linux-kernel, linux-mm, lstoakes, tianyou.li, tim.c.chen,
	wangyang.guo, zhiguo.zhou

On Thu, Mar 07, 2024 at 09:23:10AM +0800, Baoquan He wrote:
> On 03/06/24 at 08:01pm, Uladzislau Rezki wrote:
> > On Fri, Mar 01, 2024 at 10:54:16AM -0500, rulinhuang wrote:
> ......
> > 
> > Sorry for the late answer, i also just noticed this email. It was not in
> > my inbox...
> > 
> > OK, now you move part of the per-cpu allocator on the top and leave
> > another part down making it split. This is just for the:
> > 
> > BUG_ON(va_flags & VMAP_RAM);
> > 
> > VMAP_RAM macro. Do we really need this BUG_ON()?
> 
> Sorry, I suggested that when reviewing v5:
> https://lore.kernel.org/all/ZdiltpK5fUvwVWtD@MiWiFi-R3L-srv/T/#u
> 
> About part of per-cpu kva allocator moving and the split making, I would
> argue that we will have vmap_nodes defintion and basic helper functions
> like addr_to_node_id() etc at top, and leave other part like
> size_to_va_pool(), node_pool_add_va() etc down. These are similar.
> 
> While about whether we should add 'BUG_ON(va_flags & VMAP_RAM);', I am
> not sure about it. When I suggested that, I am also hesitant. From the
> current code, alloc_vmap_area() is called in below three functions, only
> __get_vm_area_node() will pass the non-NULL vm. 
>  new_vmap_block()     -|
>  vm_map_ram()         ----> alloc_vmap_area()
>  __get_vm_area_node() -|
> 
> It could be wrongly passed in the future? Only checking if vm is
> non-NULL makes me feel a little unsafe. While I am fine if removing the
> BUG_ON, because there's no worry in the current code. We can wait and
> see in the future.
> 
>        if (vm) {
>                BUG_ON(va_flags & VMAP_RAM);
>                setup_vmalloc_vm(vm, va, flags, caller);
>        }
> 
I would remove it, because it is really hard to mess it, there is only
one place also BUG_ON() is really a show stopper. I really appreciate
what rulinhuang <rulin.huang@intel.com> is doing and i understand that
it might be not so easy.

So, if we can avoid of moving the code, that looks to me that we can do,
if we can pass less arguments into alloc_vmap_area() since it is overloaded 
that would be great.

Just an example:

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 25a8df497255..b6050e018539 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1841,6 +1841,30 @@ node_alloc(unsigned long size, unsigned long align,
 	return va;
 }
 
+static inline void
+__pre_setup_vmalloc_vm(struct vm_struct *vm,
+		unsigned long flags, const void *caller)
+{
+	vm->flags = flags;
+	vm->caller = caller;
+}
+
+static inline void
+__post_setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va)
+{
+	vm->addr = (void *)va->va_start;
+	vm->size = va->va_end - va->va_start;
+	va->vm = vm;
+}
+
+static inline void
+setup_vmalloc_vm_locked(struct vm_struct *vm, struct vmap_area *va,
+		unsigned long flags, const void *caller)
+{
+	__pre_setup_vmalloc_vm(vm, flags, caller);
+	__post_setup_vmalloc_vm(vm, va);
+}
+
 /*
  * Allocate a region of KVA of the specified size and alignment, within the
  * vstart and vend.
@@ -1849,7 +1873,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 				unsigned long align,
 				unsigned long vstart, unsigned long vend,
 				int node, gfp_t gfp_mask,
-				unsigned long va_flags)
+				unsigned long va_flags, struct vm_struct *vm)
 {
 	struct vmap_node *vn;
 	struct vmap_area *va;
@@ -1912,6 +1936,9 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	va->vm = NULL;
 	va->flags = (va_flags | vn_id);
 
+	if (vm)
+		__post_setup_vmalloc_vm(vm, va);
+
 	vn = addr_to_node(va->va_start);
 
 	spin_lock(&vn->busy.lock);
@@ -2486,7 +2513,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
 					VMALLOC_START, VMALLOC_END,
 					node, gfp_mask,
-					VMAP_RAM|VMAP_BLOCK);
+					VMAP_RAM|VMAP_BLOCK, NULL);
 	if (IS_ERR(va)) {
 		kfree(vb);
 		return ERR_CAST(va);
@@ -2843,7 +2870,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
 		struct vmap_area *va;
 		va = alloc_vmap_area(size, PAGE_SIZE,
 				VMALLOC_START, VMALLOC_END,
-				node, GFP_KERNEL, VMAP_RAM);
+				node, GFP_KERNEL, VMAP_RAM, NULL);
+
 		if (IS_ERR(va))
 			return NULL;
 
@@ -2946,26 +2974,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align)
 	kasan_populate_early_vm_area_shadow(vm->addr, vm->size);
 }
 
-static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
-	struct vmap_area *va, unsigned long flags, const void *caller)
-{
-	vm->flags = flags;
-	vm->addr = (void *)va->va_start;
-	vm->size = va->va_end - va->va_start;
-	vm->caller = caller;
-	va->vm = vm;
-}
-
-static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
-			      unsigned long flags, const void *caller)
-{
-	struct vmap_node *vn = addr_to_node(va->va_start);
-
-	spin_lock(&vn->busy.lock);
-	setup_vmalloc_vm_locked(vm, va, flags, caller);
-	spin_unlock(&vn->busy.lock);
-}
-
 static void clear_vm_uninitialized_flag(struct vm_struct *vm)
 {
 	/*
@@ -3002,14 +3010,15 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
 	if (!(flags & VM_NO_GUARD))
 		size += PAGE_SIZE;
 
-	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0);
+	/* post-setup is done in the alloc_vmap_area(). */
+	__pre_setup_vmalloc_vm(area, flags, caller);
+
+	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area);
 	if (IS_ERR(va)) {
 		kfree(area);
 		return NULL;
 	}
 
-	setup_vmalloc_vm(area, va, flags, caller);
-
 	/*
 	 * Mark pages for non-VM_ALLOC mappings as accessible. Do it now as a
 	 * best-effort approach, as they can be mapped outside of vmalloc code.
<snip>

--
Uladzislau Rezki

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened
  2024-03-07 19:16       ` Uladzislau Rezki
@ 2024-03-08  8:23         ` Baoquan He
  2024-03-08 10:28           ` Uladzislau Rezki
  0 siblings, 1 reply; 16+ messages in thread
From: Baoquan He @ 2024-03-08  8:23 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: rulinhuang, akpm, colin.king, hch, linux-kernel, linux-mm,
	lstoakes, tianyou.li, tim.c.chen, wangyang.guo, zhiguo.zhou

On 03/07/24 at 08:16pm, Uladzislau Rezki wrote:
> On Thu, Mar 07, 2024 at 09:23:10AM +0800, Baoquan He wrote:
> > On 03/06/24 at 08:01pm, Uladzislau Rezki wrote:
> > > On Fri, Mar 01, 2024 at 10:54:16AM -0500, rulinhuang wrote:
> > ......
> > > 
> > > Sorry for the late answer, i also just noticed this email. It was not in
> > > my inbox...
> > > 
> > > OK, now you move part of the per-cpu allocator on the top and leave
> > > another part down making it split. This is just for the:
> > > 
> > > BUG_ON(va_flags & VMAP_RAM);
> > > 
> > > VMAP_RAM macro. Do we really need this BUG_ON()?
> > 
> > Sorry, I suggested that when reviewing v5:
> > https://lore.kernel.org/all/ZdiltpK5fUvwVWtD@MiWiFi-R3L-srv/T/#u
> > 
> > About part of per-cpu kva allocator moving and the split making, I would
> > argue that we will have vmap_nodes defintion and basic helper functions
> > like addr_to_node_id() etc at top, and leave other part like
> > size_to_va_pool(), node_pool_add_va() etc down. These are similar.
> > 
> > While about whether we should add 'BUG_ON(va_flags & VMAP_RAM);', I am
> > not sure about it. When I suggested that, I am also hesitant. From the
> > current code, alloc_vmap_area() is called in below three functions, only
> > __get_vm_area_node() will pass the non-NULL vm. 
> >  new_vmap_block()     -|
> >  vm_map_ram()         ----> alloc_vmap_area()
> >  __get_vm_area_node() -|
> > 
> > It could be wrongly passed in the future? Only checking if vm is
> > non-NULL makes me feel a little unsafe. While I am fine if removing the
> > BUG_ON, because there's no worry in the current code. We can wait and
> > see in the future.
> > 
> >        if (vm) {
> >                BUG_ON(va_flags & VMAP_RAM);
> >                setup_vmalloc_vm(vm, va, flags, caller);
> >        }
> > 
> I would remove it, because it is really hard to mess it, there is only
> one place also BUG_ON() is really a show stopper. I really appreciate
> what rulinhuang <rulin.huang@intel.com> is doing and i understand that
> it might be not so easy.

I agree, I was hesitant, now it firms up my mind.

> 
> So, if we can avoid of moving the code, that looks to me that we can do,
> if we can pass less arguments into alloc_vmap_area() since it is overloaded 
> that would be great.

Agree too, less arguments is much better. While I personnally prefer the open
coding a little bit like below. There is suspicion of excessive packaging in
__pre/__post_setup_vmalloc_vm() wrapping. They are very simple and few
assignments after all. 

---
 mm/vmalloc.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0fd8ebaad17b..0c738423976d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1924,8 +1924,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 				unsigned long align,
 				unsigned long vstart, unsigned long vend,
 				int node, gfp_t gfp_mask,
-				unsigned long va_flags, struct vm_struct *vm,
-				unsigned long flags, const void *caller)
+				unsigned long va_flags, struct vm_struct *vm)
 {
 	struct vmap_node *vn;
 	struct vmap_area *va;
@@ -1988,8 +1987,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	va->vm = NULL;
 	va->flags = (va_flags | vn_id);
 
-	if (vm)
-		setup_vmalloc_vm(vm, va, flags, caller);
+	if (vm) {
+		vm->addr = (void *)va->va_start;
+		vm->size = va->va_end - va->va_start;
+		va->vm = vm;
+	}
 
 	vn = addr_to_node(va->va_start);
 
@@ -2565,8 +2567,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
 					VMALLOC_START, VMALLOC_END,
 					node, gfp_mask,
-					VMAP_RAM|VMAP_BLOCK, NULL,
-					0, NULL);
+					VMAP_RAM|VMAP_BLOCK, NULL);
 	if (IS_ERR(va)) {
 		kfree(vb);
 		return ERR_CAST(va);
@@ -2924,7 +2925,7 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
 		va = alloc_vmap_area(size, PAGE_SIZE,
 				VMALLOC_START, VMALLOC_END,
 				node, GFP_KERNEL, VMAP_RAM,
-				NULL, 0, NULL);
+				NULL);
 		if (IS_ERR(va))
 			return NULL;
 
@@ -3063,7 +3064,10 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
 	if (!(flags & VM_NO_GUARD))
 		size += PAGE_SIZE;
 
-	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area, flags, caller);
+	area->flags = flags;
+	area->caller = caller;
+
+	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area);
 	if (IS_ERR(va)) {
 		kfree(area);
 		return NULL;
-- 
2.41.0


> 
> Just an example:
> 
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 25a8df497255..b6050e018539 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1841,6 +1841,30 @@ node_alloc(unsigned long size, unsigned long align,
>  	return va;
>  }
>  
> +static inline void
> +__pre_setup_vmalloc_vm(struct vm_struct *vm,
> +		unsigned long flags, const void *caller)
> +{
> +	vm->flags = flags;
> +	vm->caller = caller;
> +}
> +
> +static inline void
> +__post_setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va)
> +{
> +	vm->addr = (void *)va->va_start;
> +	vm->size = va->va_end - va->va_start;
> +	va->vm = vm;
> +}
> +
> +static inline void
> +setup_vmalloc_vm_locked(struct vm_struct *vm, struct vmap_area *va,
> +		unsigned long flags, const void *caller)
> +{
> +	__pre_setup_vmalloc_vm(vm, flags, caller);
> +	__post_setup_vmalloc_vm(vm, va);
> +}
> +
>  /*
>   * Allocate a region of KVA of the specified size and alignment, within the
>   * vstart and vend.
> @@ -1849,7 +1873,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  				unsigned long align,
>  				unsigned long vstart, unsigned long vend,
>  				int node, gfp_t gfp_mask,
> -				unsigned long va_flags)
> +				unsigned long va_flags, struct vm_struct *vm)
>  {
>  	struct vmap_node *vn;
>  	struct vmap_area *va;
> @@ -1912,6 +1936,9 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	va->vm = NULL;
>  	va->flags = (va_flags | vn_id);
>  
> +	if (vm)
> +		__post_setup_vmalloc_vm(vm, va);
> +
>  	vn = addr_to_node(va->va_start);
>  
>  	spin_lock(&vn->busy.lock);
> @@ -2486,7 +2513,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
>  	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
>  					VMALLOC_START, VMALLOC_END,
>  					node, gfp_mask,
> -					VMAP_RAM|VMAP_BLOCK);
> +					VMAP_RAM|VMAP_BLOCK, NULL);
>  	if (IS_ERR(va)) {
>  		kfree(vb);
>  		return ERR_CAST(va);
> @@ -2843,7 +2870,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
>  		struct vmap_area *va;
>  		va = alloc_vmap_area(size, PAGE_SIZE,
>  				VMALLOC_START, VMALLOC_END,
> -				node, GFP_KERNEL, VMAP_RAM);
> +				node, GFP_KERNEL, VMAP_RAM, NULL);
> +
>  		if (IS_ERR(va))
>  			return NULL;
>  
> @@ -2946,26 +2974,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align)
>  	kasan_populate_early_vm_area_shadow(vm->addr, vm->size);
>  }
>  
> -static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
> -	struct vmap_area *va, unsigned long flags, const void *caller)
> -{
> -	vm->flags = flags;
> -	vm->addr = (void *)va->va_start;
> -	vm->size = va->va_end - va->va_start;
> -	vm->caller = caller;
> -	va->vm = vm;
> -}
> -
> -static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
> -			      unsigned long flags, const void *caller)
> -{
> -	struct vmap_node *vn = addr_to_node(va->va_start);
> -
> -	spin_lock(&vn->busy.lock);
> -	setup_vmalloc_vm_locked(vm, va, flags, caller);
> -	spin_unlock(&vn->busy.lock);
> -}
> -
>  static void clear_vm_uninitialized_flag(struct vm_struct *vm)
>  {
>  	/*
> @@ -3002,14 +3010,15 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
>  	if (!(flags & VM_NO_GUARD))
>  		size += PAGE_SIZE;
>  
> -	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0);
> +	/* post-setup is done in the alloc_vmap_area(). */
> +	__pre_setup_vmalloc_vm(area, flags, caller);
> +
> +	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area);
>  	if (IS_ERR(va)) {
>  		kfree(area);
>  		return NULL;
>  	}
>  
> -	setup_vmalloc_vm(area, va, flags, caller);
> -
>  	/*
>  	 * Mark pages for non-VM_ALLOC mappings as accessible. Do it now as a
>  	 * best-effort approach, as they can be mapped outside of vmalloc code.
> <snip>
> 
> --
> Uladzislau Rezki
> 


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened
  2024-03-08  8:23         ` Baoquan He
@ 2024-03-08 10:28           ` Uladzislau Rezki
  2024-03-09  4:54             ` Baoquan He
  0 siblings, 1 reply; 16+ messages in thread
From: Uladzislau Rezki @ 2024-03-08 10:28 UTC (permalink / raw)
  To: Baoquan He
  Cc: Uladzislau Rezki, rulinhuang, akpm, colin.king, hch,
	linux-kernel, linux-mm, lstoakes, tianyou.li, tim.c.chen,
	wangyang.guo, zhiguo.zhou

> > I would remove it, because it is really hard to mess it, there is only
> > one place also BUG_ON() is really a show stopper. I really appreciate
> > what rulinhuang <rulin.huang@intel.com> is doing and i understand that
> > it might be not so easy.
> 
> I agree, I was hesitant, now it firms up my mind.
> 
> > 
> > So, if we can avoid of moving the code, that looks to me that we can do,
> > if we can pass less arguments into alloc_vmap_area() since it is overloaded 
> > that would be great.
> 
> Agree too, less arguments is much better. While I personnally prefer the open
> coding a little bit like below. There is suspicion of excessive packaging in
> __pre/__post_setup_vmalloc_vm() wrapping. They are very simple and few
> assignments after all. 
> 
> ---
>  mm/vmalloc.c | 20 ++++++++++++--------
>  1 file changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 0fd8ebaad17b..0c738423976d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1924,8 +1924,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  				unsigned long align,
>  				unsigned long vstart, unsigned long vend,
>  				int node, gfp_t gfp_mask,
> -				unsigned long va_flags, struct vm_struct *vm,
> -				unsigned long flags, const void *caller)
> +				unsigned long va_flags, struct vm_struct *vm)
>  {
>  	struct vmap_node *vn;
>  	struct vmap_area *va;
> @@ -1988,8 +1987,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	va->vm = NULL;
>  	va->flags = (va_flags | vn_id);
>  
> -	if (vm)
> -		setup_vmalloc_vm(vm, va, flags, caller);
> +	if (vm) {
> +		vm->addr = (void *)va->va_start;
> +		vm->size = va->va_end - va->va_start;
> +		va->vm = vm;
> +	}
>  
>  	vn = addr_to_node(va->va_start);
>  
> @@ -2565,8 +2567,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
>  	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
>  					VMALLOC_START, VMALLOC_END,
>  					node, gfp_mask,
> -					VMAP_RAM|VMAP_BLOCK, NULL,
> -					0, NULL);
> +					VMAP_RAM|VMAP_BLOCK, NULL);
>  	if (IS_ERR(va)) {
>  		kfree(vb);
>  		return ERR_CAST(va);
> @@ -2924,7 +2925,7 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
>  		va = alloc_vmap_area(size, PAGE_SIZE,
>  				VMALLOC_START, VMALLOC_END,
>  				node, GFP_KERNEL, VMAP_RAM,
> -				NULL, 0, NULL);
> +				NULL);
>  		if (IS_ERR(va))
>  			return NULL;
>  
> @@ -3063,7 +3064,10 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
>  	if (!(flags & VM_NO_GUARD))
>  		size += PAGE_SIZE;
>  
> -	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area, flags, caller);
> +	area->flags = flags;
> +	area->caller = caller;
> +
> +	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area);
>  	if (IS_ERR(va)) {
>  		kfree(area);
>  		return NULL;
> -- 
> 2.41.0
> 
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

Looks even better :) It can be applied on on top of:

[PATCH v8] mm/vmalloc: Eliminated the lock contention from twice to once

We are a bit ahead since v8 will be taken later. Anyway please use the
reviewed-by tag once you send a complete patch. 

Thanks!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened
  2024-03-08 10:28           ` Uladzislau Rezki
@ 2024-03-09  4:54             ` Baoquan He
  0 siblings, 0 replies; 16+ messages in thread
From: Baoquan He @ 2024-03-09  4:54 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: rulinhuang, akpm, colin.king, hch, linux-kernel, linux-mm,
	lstoakes, tianyou.li, tim.c.chen, wangyang.guo, zhiguo.zhou

On 03/08/24 at 11:28am, Uladzislau Rezki wrote:
> > > I would remove it, because it is really hard to mess it, there is only
> > > one place also BUG_ON() is really a show stopper. I really appreciate
> > > what rulinhuang <rulin.huang@intel.com> is doing and i understand that
> > > it might be not so easy.
> > 
> > I agree, I was hesitant, now it firms up my mind.
> > 
> > > 
> > > So, if we can avoid of moving the code, that looks to me that we can do,
> > > if we can pass less arguments into alloc_vmap_area() since it is overloaded 
> > > that would be great.
> > 
> > Agree too, less arguments is much better. While I personnally prefer the open
> > coding a little bit like below. There is suspicion of excessive packaging in
> > __pre/__post_setup_vmalloc_vm() wrapping. They are very simple and few
> > assignments after all. 
> > 
> > ---
> >  mm/vmalloc.c | 20 ++++++++++++--------
> >  1 file changed, 12 insertions(+), 8 deletions(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 0fd8ebaad17b..0c738423976d 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -1924,8 +1924,7 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> >  				unsigned long align,
> >  				unsigned long vstart, unsigned long vend,
> >  				int node, gfp_t gfp_mask,
> > -				unsigned long va_flags, struct vm_struct *vm,
> > -				unsigned long flags, const void *caller)
> > +				unsigned long va_flags, struct vm_struct *vm)
> >  {
> >  	struct vmap_node *vn;
> >  	struct vmap_area *va;
> > @@ -1988,8 +1987,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
> >  	va->vm = NULL;
> >  	va->flags = (va_flags | vn_id);
> >  
> > -	if (vm)
> > -		setup_vmalloc_vm(vm, va, flags, caller);
> > +	if (vm) {
> > +		vm->addr = (void *)va->va_start;
> > +		vm->size = va->va_end - va->va_start;
> > +		va->vm = vm;
> > +	}
> >  
> >  	vn = addr_to_node(va->va_start);
> >  
> > @@ -2565,8 +2567,7 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
> >  	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
> >  					VMALLOC_START, VMALLOC_END,
> >  					node, gfp_mask,
> > -					VMAP_RAM|VMAP_BLOCK, NULL,
> > -					0, NULL);
> > +					VMAP_RAM|VMAP_BLOCK, NULL);
> >  	if (IS_ERR(va)) {
> >  		kfree(vb);
> >  		return ERR_CAST(va);
> > @@ -2924,7 +2925,7 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
> >  		va = alloc_vmap_area(size, PAGE_SIZE,
> >  				VMALLOC_START, VMALLOC_END,
> >  				node, GFP_KERNEL, VMAP_RAM,
> > -				NULL, 0, NULL);
> > +				NULL);
> >  		if (IS_ERR(va))
> >  			return NULL;
> >  
> > @@ -3063,7 +3064,10 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
> >  	if (!(flags & VM_NO_GUARD))
> >  		size += PAGE_SIZE;
> >  
> > -	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area, flags, caller);
> > +	area->flags = flags;
> > +	area->caller = caller;
> > +
> > +	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area);
> >  	if (IS_ERR(va)) {
> >  		kfree(area);
> >  		return NULL;
> > -- 
> > 2.41.0
> > 
> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
> 
> Looks even better :) It can be applied on on top of:
> 
> [PATCH v8] mm/vmalloc: Eliminated the lock contention from twice to once
> 
> We are a bit ahead since v8 will be taken later. Anyway please use the
> reviewed-by tag once you send a complete patch. 

Thanks, have posted.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v7 2/2] mm/vmalloc: Eliminated the lock contention from twice to once
  2024-03-01 15:54 [PATCH v7 0/2] mm/vmalloc: lock contention optimization under multi-threading rulinhuang
  2024-03-01 15:54 ` [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened rulinhuang
@ 2024-03-01 15:54 ` rulinhuang
  2024-03-06 13:55   ` Baoquan He
  2024-03-06  9:18 ` [PATCH v7 0/2] mm/vmalloc: lock contention optimization under multi-threading Huang, Rulin
  2 siblings, 1 reply; 16+ messages in thread
From: rulinhuang @ 2024-03-01 15:54 UTC (permalink / raw)
  To: urezki, bhe
  Cc: akpm, colin.king, hch, linux-kernel, linux-mm, lstoakes,
	rulin.huang, tianyou.li, tim.c.chen, wangyang.guo, zhiguo.zhou

When allocating a new memory area where the mapping address range is
known, it is observed that the vmap_node->busy.lock is acquired twice.

The first acquisition occurs in the alloc_vmap_area() function when
inserting the vm area into the vm mapping red-black tree. The second
acquisition occurs in the setup_vmalloc_vm() function when updating the
properties of the vm, such as flags and address, etc.

Combine these two operations together in alloc_vmap_area(), which
improves scalability when the vmap_node->busy.lock is contended.
By doing so, the need to acquire the lock twice can also be eliminated
to once.

With the above change, tested on intel sapphire rapids
platform(224 vcpu), a 4% performance improvement is
gained on stress-ng/pthread(https://github.com/ColinIanKing/stress-ng),
which is the stress test of thread creations.

Co-developed-by: "Chen, Tim C" <tim.c.chen@intel.com>
Signed-off-by: "Chen, Tim C" <tim.c.chen@intel.com>
Co-developed-by: "King, Colin" <colin.king@intel.com>
Signed-off-by: "King, Colin" <colin.king@intel.com>
Signed-off-by: rulinhuang <rulin.huang@intel.com>
---
V1 -> V2: Avoided the partial initialization issue of vm and
separated insert_vmap_area() from alloc_vmap_area()
V2 -> V3: Rebased on 6.8-rc5
V3 -> V4: Rebased on mm-unstable branch
V4 -> V5: Canceled the split of alloc_vmap_area()
and keep insert_vmap_area()
V5 -> V6: Added bug_on
V6 -> V7: Adjusted the macros
---
 mm/vmalloc.c | 52 ++++++++++++++++++++++++----------------------------
 1 file changed, 24 insertions(+), 28 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index fc027a61c12e..5b7c9156d8da 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1972,15 +1972,26 @@ node_alloc(unsigned long size, unsigned long align,
 	return va;
 }
 
+static inline void setup_vmalloc_vm(struct vm_struct *vm,
+	struct vmap_area *va, unsigned long flags, const void *caller)
+{
+	vm->flags = flags;
+	vm->addr = (void *)va->va_start;
+	vm->size = va->va_end - va->va_start;
+	vm->caller = caller;
+	va->vm = vm;
+}
+
 /*
  * Allocate a region of KVA of the specified size and alignment, within the
- * vstart and vend.
+ * vstart and vend. If vm is passed in, the two will also be bound.
  */
 static struct vmap_area *alloc_vmap_area(unsigned long size,
 				unsigned long align,
 				unsigned long vstart, unsigned long vend,
 				int node, gfp_t gfp_mask,
-				unsigned long va_flags)
+				unsigned long va_flags, struct vm_struct *vm,
+				unsigned long flags, const void *caller)
 {
 	struct vmap_node *vn;
 	struct vmap_area *va;
@@ -2043,6 +2054,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
 	va->vm = NULL;
 	va->flags = (va_flags | vn_id);
 
+	if (vm) {
+		BUG_ON(va_flags & VMAP_RAM);
+		setup_vmalloc_vm(vm, va, flags, caller);
+	}
+
 	vn = addr_to_node(va->va_start);
 
 	spin_lock(&vn->busy.lock);
@@ -2486,7 +2502,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
 	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
 					VMALLOC_START, VMALLOC_END,
 					node, gfp_mask,
-					VMAP_RAM|VMAP_BLOCK);
+					VMAP_RAM|VMAP_BLOCK, NULL,
+					0, NULL);
 	if (IS_ERR(va)) {
 		kfree(vb);
 		return ERR_CAST(va);
@@ -2843,7 +2860,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
 		struct vmap_area *va;
 		va = alloc_vmap_area(size, PAGE_SIZE,
 				VMALLOC_START, VMALLOC_END,
-				node, GFP_KERNEL, VMAP_RAM);
+				node, GFP_KERNEL, VMAP_RAM,
+				NULL, 0, NULL);
 		if (IS_ERR(va))
 			return NULL;
 
@@ -2946,26 +2964,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align)
 	kasan_populate_early_vm_area_shadow(vm->addr, vm->size);
 }
 
-static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
-	struct vmap_area *va, unsigned long flags, const void *caller)
-{
-	vm->flags = flags;
-	vm->addr = (void *)va->va_start;
-	vm->size = va->va_end - va->va_start;
-	vm->caller = caller;
-	va->vm = vm;
-}
-
-static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
-			      unsigned long flags, const void *caller)
-{
-	struct vmap_node *vn = addr_to_node(va->va_start);
-
-	spin_lock(&vn->busy.lock);
-	setup_vmalloc_vm_locked(vm, va, flags, caller);
-	spin_unlock(&vn->busy.lock);
-}
-
 static void clear_vm_uninitialized_flag(struct vm_struct *vm)
 {
 	/*
@@ -3002,14 +3000,12 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
 	if (!(flags & VM_NO_GUARD))
 		size += PAGE_SIZE;
 
-	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0);
+	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area, flags, caller);
 	if (IS_ERR(va)) {
 		kfree(area);
 		return NULL;
 	}
 
-	setup_vmalloc_vm(area, va, flags, caller);
-
 	/*
 	 * Mark pages for non-VM_ALLOC mappings as accessible. Do it now as a
 	 * best-effort approach, as they can be mapped outside of vmalloc code.
@@ -4584,7 +4580,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
 
 		spin_lock(&vn->busy.lock);
 		insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
-		setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
+		setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC,
 				 pcpu_get_vm_areas);
 		spin_unlock(&vn->busy.lock);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 2/2] mm/vmalloc: Eliminated the lock contention from twice to once
  2024-03-01 15:54 ` [PATCH v7 2/2] mm/vmalloc: Eliminated the lock contention from twice to once rulinhuang
@ 2024-03-06 13:55   ` Baoquan He
  0 siblings, 0 replies; 16+ messages in thread
From: Baoquan He @ 2024-03-06 13:55 UTC (permalink / raw)
  To: rulinhuang
  Cc: urezki, akpm, colin.king, hch, linux-kernel, linux-mm, lstoakes,
	tianyou.li, tim.c.chen, wangyang.guo, zhiguo.zhou

On 03/01/24 at 10:54am, rulinhuang wrote:
> When allocating a new memory area where the mapping address range is
> known, it is observed that the vmap_node->busy.lock is acquired twice.
> 
> The first acquisition occurs in the alloc_vmap_area() function when
> inserting the vm area into the vm mapping red-black tree. The second
> acquisition occurs in the setup_vmalloc_vm() function when updating the
> properties of the vm, such as flags and address, etc.
> 
> Combine these two operations together in alloc_vmap_area(), which
> improves scalability when the vmap_node->busy.lock is contended.
> By doing so, the need to acquire the lock twice can also be eliminated
> to once.
> 
> With the above change, tested on intel sapphire rapids
> platform(224 vcpu), a 4% performance improvement is
> gained on stress-ng/pthread(https://github.com/ColinIanKing/stress-ng),
> which is the stress test of thread creations.
> 
> Co-developed-by: "Chen, Tim C" <tim.c.chen@intel.com>
> Signed-off-by: "Chen, Tim C" <tim.c.chen@intel.com>
> Co-developed-by: "King, Colin" <colin.king@intel.com>
> Signed-off-by: "King, Colin" <colin.king@intel.com>
> Signed-off-by: rulinhuang <rulin.huang@intel.com>
> ---
> V1 -> V2: Avoided the partial initialization issue of vm and
> separated insert_vmap_area() from alloc_vmap_area()
> V2 -> V3: Rebased on 6.8-rc5
> V3 -> V4: Rebased on mm-unstable branch
> V4 -> V5: Canceled the split of alloc_vmap_area()
> and keep insert_vmap_area()
> V5 -> V6: Added bug_on
> V6 -> V7: Adjusted the macros
> ---
>  mm/vmalloc.c | 52 ++++++++++++++++++++++++----------------------------
>  1 file changed, 24 insertions(+), 28 deletions(-)

LGTM,

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index fc027a61c12e..5b7c9156d8da 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1972,15 +1972,26 @@ node_alloc(unsigned long size, unsigned long align,
>  	return va;
>  }
>  
> +static inline void setup_vmalloc_vm(struct vm_struct *vm,
> +	struct vmap_area *va, unsigned long flags, const void *caller)
> +{
> +	vm->flags = flags;
> +	vm->addr = (void *)va->va_start;
> +	vm->size = va->va_end - va->va_start;
> +	vm->caller = caller;
> +	va->vm = vm;
> +}
> +
>  /*
>   * Allocate a region of KVA of the specified size and alignment, within the
> - * vstart and vend.
> + * vstart and vend. If vm is passed in, the two will also be bound.
>   */
>  static struct vmap_area *alloc_vmap_area(unsigned long size,
>  				unsigned long align,
>  				unsigned long vstart, unsigned long vend,
>  				int node, gfp_t gfp_mask,
> -				unsigned long va_flags)
> +				unsigned long va_flags, struct vm_struct *vm,
> +				unsigned long flags, const void *caller)
>  {
>  	struct vmap_node *vn;
>  	struct vmap_area *va;
> @@ -2043,6 +2054,11 @@ static struct vmap_area *alloc_vmap_area(unsigned long size,
>  	va->vm = NULL;
>  	va->flags = (va_flags | vn_id);
>  
> +	if (vm) {
> +		BUG_ON(va_flags & VMAP_RAM);
> +		setup_vmalloc_vm(vm, va, flags, caller);
> +	}
> +
>  	vn = addr_to_node(va->va_start);
>  
>  	spin_lock(&vn->busy.lock);
> @@ -2486,7 +2502,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
>  	va = alloc_vmap_area(VMAP_BLOCK_SIZE, VMAP_BLOCK_SIZE,
>  					VMALLOC_START, VMALLOC_END,
>  					node, gfp_mask,
> -					VMAP_RAM|VMAP_BLOCK);
> +					VMAP_RAM|VMAP_BLOCK, NULL,
> +					0, NULL);
>  	if (IS_ERR(va)) {
>  		kfree(vb);
>  		return ERR_CAST(va);
> @@ -2843,7 +2860,8 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node)
>  		struct vmap_area *va;
>  		va = alloc_vmap_area(size, PAGE_SIZE,
>  				VMALLOC_START, VMALLOC_END,
> -				node, GFP_KERNEL, VMAP_RAM);
> +				node, GFP_KERNEL, VMAP_RAM,
> +				NULL, 0, NULL);
>  		if (IS_ERR(va))
>  			return NULL;
>  
> @@ -2946,26 +2964,6 @@ void __init vm_area_register_early(struct vm_struct *vm, size_t align)
>  	kasan_populate_early_vm_area_shadow(vm->addr, vm->size);
>  }
>  
> -static inline void setup_vmalloc_vm_locked(struct vm_struct *vm,
> -	struct vmap_area *va, unsigned long flags, const void *caller)
> -{
> -	vm->flags = flags;
> -	vm->addr = (void *)va->va_start;
> -	vm->size = va->va_end - va->va_start;
> -	vm->caller = caller;
> -	va->vm = vm;
> -}
> -
> -static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
> -			      unsigned long flags, const void *caller)
> -{
> -	struct vmap_node *vn = addr_to_node(va->va_start);
> -
> -	spin_lock(&vn->busy.lock);
> -	setup_vmalloc_vm_locked(vm, va, flags, caller);
> -	spin_unlock(&vn->busy.lock);
> -}
> -
>  static void clear_vm_uninitialized_flag(struct vm_struct *vm)
>  {
>  	/*
> @@ -3002,14 +3000,12 @@ static struct vm_struct *__get_vm_area_node(unsigned long size,
>  	if (!(flags & VM_NO_GUARD))
>  		size += PAGE_SIZE;
>  
> -	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0);
> +	va = alloc_vmap_area(size, align, start, end, node, gfp_mask, 0, area, flags, caller);
>  	if (IS_ERR(va)) {
>  		kfree(area);
>  		return NULL;
>  	}
>  
> -	setup_vmalloc_vm(area, va, flags, caller);
> -
>  	/*
>  	 * Mark pages for non-VM_ALLOC mappings as accessible. Do it now as a
>  	 * best-effort approach, as they can be mapped outside of vmalloc code.
> @@ -4584,7 +4580,7 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets,
>  
>  		spin_lock(&vn->busy.lock);
>  		insert_vmap_area(vas[area], &vn->busy.root, &vn->busy.head);
> -		setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
> +		setup_vmalloc_vm(vms[area], vas[area], VM_ALLOC,
>  				 pcpu_get_vm_areas);
>  		spin_unlock(&vn->busy.lock);
>  	}
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v7 0/2] mm/vmalloc: lock contention optimization under multi-threading
  2024-03-01 15:54 [PATCH v7 0/2] mm/vmalloc: lock contention optimization under multi-threading rulinhuang
  2024-03-01 15:54 ` [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened rulinhuang
  2024-03-01 15:54 ` [PATCH v7 2/2] mm/vmalloc: Eliminated the lock contention from twice to once rulinhuang
@ 2024-03-06  9:18 ` Huang, Rulin
  2 siblings, 0 replies; 16+ messages in thread
From: Huang, Rulin @ 2024-03-06  9:18 UTC (permalink / raw)
  To: urezki, bhe
  Cc: akpm, colin.king, hch, linux-kernel, linux-mm, lstoakes,
	tianyou.li, tim.c.chen, wangyang.guo, zhiguo.zhou, rulin.huang

Hello, are there any issues with this patch that need to be modified? If
there is any, we will modify it as soon as possible, thank you.

On 2024/3/1 23:54, rulinhuang wrote:
> Hi,
> 
> This version has the rearrangement of macros from the previous one.
> 
> We are not sure whether we have completely moved these macros and 
> their corresponding helper to the correct position. Could you please 
> help to check whether they are correct?
> 
> ~
> 
> 1. Motivation
> 
> When allocating a new memory area where the mapping address range is 
> known, it is observed that the vmap_node->busy.lock is acquired twice 
> but one of the acquisitions is actually unnecessary.
> 
> 2. Design
> 
> Among the two acquisitions, the first one occurs in the 
> alloc_vmap_area() function when inserting the vm area into the vm 
> mapping red-black tree, and the second one occurs in the 
> setup_vmalloc_vm() function when updating the properties of the vm, 
> such as flags and address, etc.
> 
> Combine these two operations together in alloc_vmap_area(), which 
> improves scalability when the vmap_node->busy.lock is contended.
> By doing so, the need to acquire the lock twice can also be eliminated 
> to once.
> 
> 3. Test results
> 
> With the above change, tested on intel sapphire rapids
> platform(224 vcpu), a 4% performance improvement is gained on 
> stress-ng/pthread(https://github.com/ColinIanKing/stress-ng),
> which is the stress test of thread creations.
> 
> rulinhuang
> 
> [v1] https://lore.kernel.org/all/20240207033059.1565623-1-rulin.huang@intel.com/
> [v2] https://lore.kernel.org/all/20240220090521.3316345-1-rulin.huang@intel.com/
> [v3] https://lore.kernel.org/all/20240221032905.11392-1-rulin.huang@intel.com/
> [v4] https://lore.kernel.org/all/20240222120536.216166-1-rulin.huang@intel.com/
> [v5] https://lore.kernel.org/all/20240223130318.112198-2-rulin.huang@intel.com/
> [v6] https://lore.kernel.org/lkml/aa8f0413-d055-4b49-bcd3-401e93e01c6d@intel.com/
> 
> 
> rulinhuang (2):
>   mm/vmalloc: Moved macros with no functional change happened
>   mm/vmalloc: Eliminated the lock contention from twice to once
> 
>  mm/vmalloc.c | 314 +++++++++++++++++++++++++--------------------------
>  1 file changed, 155 insertions(+), 159 deletions(-)
> 
> 
> base-commit: 10c2cf5fe97647d68ee89b1f921e982e71519f20

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-03-09  4:54 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-01 15:54 [PATCH v7 0/2] mm/vmalloc: lock contention optimization under multi-threading rulinhuang
2024-03-01 15:54 ` [PATCH v7 1/2] mm/vmalloc: Moved macros with no functional change happened rulinhuang
2024-03-06 13:23   ` Baoquan He
2024-03-06 19:01   ` Uladzislau Rezki
2024-03-07  1:23     ` Baoquan He
2024-03-07  3:01       ` Huang, Rulin
2024-03-07  3:32         ` Baoquan He
2024-03-07  5:48           ` Huang, Rulin
2024-03-07 19:53             ` Uladzislau Rezki
2024-03-07 19:16       ` Uladzislau Rezki
2024-03-08  8:23         ` Baoquan He
2024-03-08 10:28           ` Uladzislau Rezki
2024-03-09  4:54             ` Baoquan He
2024-03-01 15:54 ` [PATCH v7 2/2] mm/vmalloc: Eliminated the lock contention from twice to once rulinhuang
2024-03-06 13:55   ` Baoquan He
2024-03-06  9:18 ` [PATCH v7 0/2] mm/vmalloc: lock contention optimization under multi-threading Huang, Rulin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).