[PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork()

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork()
@ 2023-08-30 12:56 Peng Zhang
  2023-08-30 12:56 ` [PATCH v2 1/6] maple_tree: Add two helpers Peng Zhang
                   ` (7 more replies)
  0 siblings, 8 replies; 35+ messages in thread
From: Peng Zhang @ 2023-08-30 12:56 UTC (permalink / raw)
  To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
	michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
  Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang

In the process of duplicating mmap in fork(), VMAs will be inserted into the new
maple tree one by one. When inserting into the maple tree, the maple tree will
be rebalanced multiple times. The rebalancing of maple tree is not as fast as
the rebalancing of red-black tree and will be slower. Therefore, __mt_dup() is
introduced to directly duplicate the structure of the old maple tree, and then
modify each element of the new maple tree. This avoids rebalancing and some extra
copying, so is faster than the original method.
More information can refer to [1].

There is a "spawn" in byte-unixbench[2], which can be used to test the performance
of fork(). I modified it slightly to make it work with different number of VMAs.

Below are the test numbers. There are 21 VMAs by default. The first row indicates
the number of added VMAs. The following two lines are the number of fork() calls
every 10 seconds. These numbers are different from the test results in v1 because
this time the benchmark is bound to a CPU. This way the numbers are more stable.

  Increment of VMAs: 0      100     200     400     800     1600    3200    6400
6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   6110    3158
Apply this patchset: 114531 85420   64541   44592   28660   16371   9038    4831
                     +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% +47.92% +52.98%

Todo:
  - Update the documentation.

Changes since v1:
 - Reimplement __mt_dup() and mtree_dup(). Loops are implemented without using
   goto instructions.
 - The new tree also needs to be locked to avoid some lockdep warnings.
 - Drop and add some helpers.
 - Add test for duplicating full tree.
 - Drop mas_replace_entry(), it doesn't seem to have a big impact on the
   performance of fork().

[1] https://lore.kernel.org/lkml/463899aa-6cbd-f08e-0aca-077b0e4e4475@bytedance.com/
[2] https://github.com/kdlucas/byte-unixbench/tree/master

v1: https://lore.kernel.org/lkml/20230726080916.17454-1-zhangpeng.00@bytedance.com/

Peng Zhang (6):
  maple_tree: Add two helpers
  maple_tree: Introduce interfaces __mt_dup() and mtree_dup()
  maple_tree: Add test for mtree_dup()
  maple_tree: Skip other tests when BENCH is enabled
  maple_tree: Update check_forking() and bench_forking()
  fork: Use __mt_dup() to duplicate maple tree in dup_mmap()

 include/linux/maple_tree.h       |   3 +
 kernel/fork.c                    |  34 ++-
 lib/maple_tree.c                 | 277 ++++++++++++++++++++++++-
 lib/test_maple_tree.c            |  69 +++---
 mm/mmap.c                        |  14 +-
 tools/testing/radix-tree/maple.c | 346 +++++++++++++++++++++++++++++++
 6 files changed, 697 insertions(+), 46 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 1/6] maple_tree: Add two helpers
  2023-08-30 12:56 [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork() Peng Zhang
@ 2023-08-30 12:56 ` Peng Zhang
  2023-09-07 20:13   ` Liam R. Howlett
  2023-08-30 12:56 ` [PATCH v2 2/6] maple_tree: Introduce interfaces __mt_dup() and mtree_dup() Peng Zhang
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 35+ messages in thread
From: Peng Zhang @ 2023-08-30 12:56 UTC (permalink / raw)
  To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
	michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
  Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang

Add two helpers, which will be used later.

Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
 lib/maple_tree.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index ee1ff0c59fd7..ef234cf02e3e 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -165,6 +165,11 @@ static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
 	return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
 }
 
+static inline void mt_free_one(struct maple_node *node)
+{
+	kmem_cache_free(maple_node_cache, node);
+}
+
 static inline void mt_free_bulk(size_t size, void __rcu **nodes)
 {
 	kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
@@ -205,6 +210,11 @@ static unsigned int mas_mt_height(struct ma_state *mas)
 	return mt_height(mas->tree);
 }
 
+static inline unsigned int mt_attr(struct maple_tree *mt)
+{
+	return mt->ma_flags & ~MT_FLAGS_HEIGHT_MASK;
+}
+
 static inline enum maple_type mte_node_type(const struct maple_enode *entry)
 {
 	return ((unsigned long)entry >> MAPLE_NODE_TYPE_SHIFT) &
@@ -5520,7 +5530,7 @@ void mas_destroy(struct ma_state *mas)
 			mt_free_bulk(count, (void __rcu **)&node->slot[1]);
 			total -= count;
 		}
-		kmem_cache_free(maple_node_cache, node);
+		mt_free_one(ma_mnode_ptr(node));
 		total--;
 	}
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 2/6] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()
  2023-08-30 12:56 [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork() Peng Zhang
  2023-08-30 12:56 ` [PATCH v2 1/6] maple_tree: Add two helpers Peng Zhang
@ 2023-08-30 12:56 ` Peng Zhang
  2023-09-07 20:13   ` Liam R. Howlett
  2023-08-30 12:56 ` [PATCH v2 3/6] maple_tree: Add test for mtree_dup() Peng Zhang
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 35+ messages in thread
From: Peng Zhang @ 2023-08-30 12:56 UTC (permalink / raw)
  To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
	michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
  Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang

Introduce interfaces __mt_dup() and mtree_dup(), which are used to
duplicate a maple tree. Compared with traversing the source tree and
reinserting entry by entry in the new tree, it has better performance.
The difference between __mt_dup() and mtree_dup() is that mtree_dup()
handles locks internally.

Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
 include/linux/maple_tree.h |   3 +
 lib/maple_tree.c           | 265 +++++++++++++++++++++++++++++++++++++
 2 files changed, 268 insertions(+)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index e41c70ac7744..44fe8a57ecbd 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
 		void *entry, gfp_t gfp);
 void *mtree_erase(struct maple_tree *mt, unsigned long index);
 
+int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
+int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
+
 void mtree_destroy(struct maple_tree *mt);
 void __mt_destroy(struct maple_tree *mt);
 
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index ef234cf02e3e..8f841682269c 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -6370,6 +6370,271 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
 }
 EXPORT_SYMBOL(mtree_erase);
 
+/*
+ * mas_dup_free() - Free a half-constructed tree.
+ * @mas: Points to the last node of the half-constructed tree.
+ *
+ * This function frees all nodes starting from @mas->node in the reverse order
+ * of mas_dup_build(). There is no need to hold the source tree lock at this
+ * time.
+ */
+static void mas_dup_free(struct ma_state *mas)
+{
+	struct maple_node *node;
+	enum maple_type type;
+	void __rcu **slots;
+	unsigned char count, i;
+
+	/* Maybe the first node allocation failed. */
+	if (!mas->node)
+		return;
+
+	while (!mte_is_root(mas->node)) {
+		mas_ascend(mas);
+
+		if (mas->offset) {
+			mas->offset--;
+			do {
+				mas_descend(mas);
+				mas->offset = mas_data_end(mas);
+			} while (!mte_is_leaf(mas->node));
+
+			mas_ascend(mas);
+		}
+
+		node = mte_to_node(mas->node);
+		type = mte_node_type(mas->node);
+		slots = (void **)ma_slots(node, type);
+		count = mas_data_end(mas) + 1;
+		for (i = 0; i < count; i++)
+			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
+
+		mt_free_bulk(count, slots);
+	}
+
+	node = mte_to_node(mas->node);
+	mt_free_one(node);
+}
+
+/*
+ * mas_copy_node() - Copy a maple node and allocate child nodes.
+ * @mas: Points to the source node.
+ * @new_mas: Points to the new node.
+ * @parent: The parent node of the new node.
+ * @gfp: The GFP_FLAGS to use for allocations.
+ *
+ * Copy @mas->node to @new_mas->node, set @parent to be the parent of
+ * @new_mas->node and allocate new child nodes for @new_mas->node.
+ * If memory allocation fails, @mas is set to -ENOMEM.
+ */
+static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
+		struct maple_node *parent, gfp_t gfp)
+{
+	struct maple_node *node = mte_to_node(mas->node);
+	struct maple_node *new_node = mte_to_node(new_mas->node);
+	enum maple_type type;
+	unsigned long val;
+	unsigned char request, count, i;
+	void __rcu **slots;
+	void __rcu **new_slots;
+
+	/* Copy the node completely. */
+	memcpy(new_node, node, sizeof(struct maple_node));
+
+	/* Update the parent node pointer. */
+	if (unlikely(ma_is_root(node)))
+		val = MA_ROOT_PARENT;
+	else
+		val = (unsigned long)node->parent & MAPLE_NODE_MASK;
+
+	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
+
+	if (mte_is_leaf(mas->node))
+		return;
+
+	/* Allocate memory for child nodes. */
+	type = mte_node_type(mas->node);
+	new_slots = ma_slots(new_node, type);
+	request = mas_data_end(mas) + 1;
+	count = mt_alloc_bulk(gfp, request, new_slots);
+	if (unlikely(count < request)) {
+		if (count)
+			mt_free_bulk(count, new_slots);
+		mas_set_err(mas, -ENOMEM);
+		return;
+	}
+
+	/* Restore node type information in slots. */
+	slots = ma_slots(node, type);
+	for (i = 0; i < count; i++)
+		((unsigned long *)new_slots)[i] |=
+			((unsigned long)mt_slot_locked(mas->tree, slots, i) &
+			MAPLE_NODE_MASK);
+}
+
+/*
+ * mas_dup_build() - Build a new maple tree from a source tree
+ * @mas: The maple state of source tree.
+ * @new_mas: The maple state of new tree.
+ * @gfp: The GFP_FLAGS to use for allocations.
+ *
+ * This function builds a new tree in DFS preorder. If the memory allocation
+ * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
+ * last node. mas_dup_free() will free the half-constructed tree.
+ *
+ * Note that the attributes of the two trees must be exactly the same, and the
+ * new tree must be empty, otherwise -EINVAL will be returned.
+ */
+static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
+		gfp_t gfp)
+{
+	struct maple_node *node, *parent;
+	struct maple_enode *root;
+	enum maple_type type;
+
+	if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
+	    unlikely(!mtree_empty(new_mas->tree))) {
+		mas_set_err(mas, -EINVAL);
+		return;
+	}
+
+	mas_start(mas);
+	if (mas_is_ptr(mas) || mas_is_none(mas)) {
+		/*
+		 * The attributes of the two trees must be the same before this.
+		 * The following assignment makes them the same height.
+		 */
+		new_mas->tree->ma_flags = mas->tree->ma_flags;
+		rcu_assign_pointer(new_mas->tree->ma_root, mas->tree->ma_root);
+		return;
+	}
+
+	node = mt_alloc_one(gfp);
+	if (!node) {
+		new_mas->node = NULL;
+		mas_set_err(mas, -ENOMEM);
+		return;
+	}
+
+	type = mte_node_type(mas->node);
+	root = mt_mk_node(node, type);
+	new_mas->node = root;
+	new_mas->min = 0;
+	new_mas->max = ULONG_MAX;
+	parent = ma_mnode_ptr(new_mas->tree);
+
+	while (1) {
+		mas_copy_node(mas, new_mas, parent, gfp);
+
+		if (unlikely(mas_is_err(mas)))
+			return;
+
+		/* Once we reach a leaf, we need to ascend, or end the loop. */
+		if (mte_is_leaf(mas->node)) {
+			if (mas->max == ULONG_MAX) {
+				new_mas->tree->ma_flags = mas->tree->ma_flags;
+				rcu_assign_pointer(new_mas->tree->ma_root,
+						   mte_mk_root(root));
+				break;
+			}
+
+			do {
+				/*
+				 * Must not at the root node, because we've
+				 * already end the loop when we reach the last
+				 * leaf.
+				 */
+				mas_ascend(mas);
+				mas_ascend(new_mas);
+			} while (mas->offset == mas_data_end(mas));
+
+			mas->offset++;
+			new_mas->offset++;
+		}
+
+		mas_descend(mas);
+		parent = mte_to_node(new_mas->node);
+		mas_descend(new_mas);
+		mas->offset = 0;
+		new_mas->offset = 0;
+	}
+}
+
+/**
+ * __mt_dup(): Duplicate a maple tree
+ * @mt: The source maple tree
+ * @new: The new maple tree
+ * @gfp: The GFP_FLAGS to use for allocations
+ *
+ * This function duplicates a maple tree using a faster method than traversing
+ * the source tree and inserting entries into the new tree one by one.
+ * The user needs to ensure that the attributes of the source tree and the new
+ * tree are the same, and the new tree needs to be an empty tree, otherwise
+ * -EINVAL will be returned.
+ * Note that the user needs to manually lock the source tree and the new tree.
+ *
+ * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
+ * the attributes of the two trees are different or the new tree is not an empty
+ * tree.
+ */
+int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
+{
+	int ret = 0;
+	MA_STATE(mas, mt, 0, 0);
+	MA_STATE(new_mas, new, 0, 0);
+
+	mas_dup_build(&mas, &new_mas, gfp);
+
+	if (unlikely(mas_is_err(&mas))) {
+		ret = xa_err(mas.node);
+		if (ret == -ENOMEM)
+			mas_dup_free(&new_mas);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(__mt_dup);
+
+/**
+ * mtree_dup(): Duplicate a maple tree
+ * @mt: The source maple tree
+ * @new: The new maple tree
+ * @gfp: The GFP_FLAGS to use for allocations
+ *
+ * This function duplicates a maple tree using a faster method than traversing
+ * the source tree and inserting entries into the new tree one by one.
+ * The user needs to ensure that the attributes of the source tree and the new
+ * tree are the same, and the new tree needs to be an empty tree, otherwise
+ * -EINVAL will be returned.
+ *
+ * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
+ * the attributes of the two trees are different or the new tree is not an empty
+ * tree.
+ */
+int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
+{
+	int ret = 0;
+	MA_STATE(mas, mt, 0, 0);
+	MA_STATE(new_mas, new, 0, 0);
+
+	mas_lock(&new_mas);
+	mas_lock(&mas);
+
+	mas_dup_build(&mas, &new_mas, gfp);
+	mas_unlock(&mas);
+
+	if (unlikely(mas_is_err(&mas))) {
+		ret = xa_err(mas.node);
+		if (ret == -ENOMEM)
+			mas_dup_free(&new_mas);
+	}
+
+	mas_unlock(&new_mas);
+
+	return ret;
+}
+EXPORT_SYMBOL(mtree_dup);
+
 /**
  * __mt_destroy() - Walk and free all nodes of a locked maple tree.
  * @mt: The maple tree
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 3/6] maple_tree: Add test for mtree_dup()
  2023-08-30 12:56 [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork() Peng Zhang
  2023-08-30 12:56 ` [PATCH v2 1/6] maple_tree: Add two helpers Peng Zhang
  2023-08-30 12:56 ` [PATCH v2 2/6] maple_tree: Introduce interfaces __mt_dup() and mtree_dup() Peng Zhang
@ 2023-08-30 12:56 ` Peng Zhang
  2023-09-07 20:13   ` Liam R. Howlett
  2023-08-30 12:56 ` [PATCH v2 4/6] maple_tree: Skip other tests when BENCH is enabled Peng Zhang
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 35+ messages in thread
From: Peng Zhang @ 2023-08-30 12:56 UTC (permalink / raw)
  To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
	michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
  Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang

Add test for mtree_dup().

Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
 tools/testing/radix-tree/maple.c | 344 +++++++++++++++++++++++++++++++
 1 file changed, 344 insertions(+)

diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index e5da1cad70ba..38455916331e 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -35857,6 +35857,346 @@ static noinline void __init check_locky(struct maple_tree *mt)
 	mt_clear_in_rcu(mt);
 }
 
+/*
+ * Compare two nodes and return 0 if they are the same, non-zero otherwise.
+ */
+static int __init compare_node(struct maple_enode *enode_a,
+			       struct maple_enode *enode_b)
+{
+	struct maple_node *node_a, *node_b;
+	struct maple_node a, b;
+	void **slots_a, **slots_b; /* Do not use the rcu tag. */
+	enum maple_type type;
+	int i;
+
+	if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
+	    ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
+		pr_err("The lower 8 bits of enode are different.\n");
+		return -1;
+	}
+
+	type = mte_node_type(enode_a);
+	node_a = mte_to_node(enode_a);
+	node_b = mte_to_node(enode_b);
+	a = *node_a;
+	b = *node_b;
+
+	/* Do not compare addresses. */
+	if (ma_is_root(node_a) || ma_is_root(node_b)) {
+		a.parent = (struct maple_pnode *)((unsigned long)a.parent &
+						  MA_ROOT_PARENT);
+		b.parent = (struct maple_pnode *)((unsigned long)b.parent &
+						  MA_ROOT_PARENT);
+	} else {
+		a.parent = (struct maple_pnode *)((unsigned long)a.parent &
+						  MAPLE_NODE_MASK);
+		b.parent = (struct maple_pnode *)((unsigned long)b.parent &
+						  MAPLE_NODE_MASK);
+	}
+
+	if (a.parent != b.parent) {
+		pr_err("The lower 8 bits of parents are different. %p %p\n",
+			a.parent, b.parent);
+		return -1;
+	}
+
+	/*
+	 * If it is a leaf node, the slots do not contain the node address, and
+	 * no special processing of slots is required.
+	 */
+	if (ma_is_leaf(type))
+		goto cmp;
+
+	slots_a = ma_slots(&a, type);
+	slots_b = ma_slots(&b, type);
+
+	for (i = 0; i < mt_slots[type]; i++) {
+		if (!slots_a[i] && !slots_b[i])
+			break;
+
+		if (!slots_a[i] || !slots_b[i]) {
+			pr_err("The number of slots is different.\n");
+			return -1;
+		}
+
+		/* Do not compare addresses in slots. */
+		((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
+		((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
+	}
+
+cmp:
+	/*
+	 * Compare all contents of two nodes, including parent (except address),
+	 * slots (except address), pivots, gaps and metadata.
+	 */
+	return memcmp(&a, &b, sizeof(struct maple_node));
+}
+
+/*
+ * Compare two trees and return 0 if they are the same, non-zero otherwise.
+ */
+static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
+{
+	MA_STATE(mas_a, mt_a, 0, 0);
+	MA_STATE(mas_b, mt_b, 0, 0);
+
+	if (mt_a->ma_flags != mt_b->ma_flags) {
+		pr_err("The flags of the two trees are different.\n");
+		return -1;
+	}
+
+	mas_dfs_preorder(&mas_a);
+	mas_dfs_preorder(&mas_b);
+
+	if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
+		if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
+			pr_err("One is MAS_ROOT and the other is not.\n");
+			return -1;
+		}
+		return 0;
+	}
+
+	while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
+
+		if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
+			pr_err("One is MAS_NONE and the other is not.\n");
+			return -1;
+		}
+
+		if (mas_a.min != mas_b.min ||
+		    mas_a.max != mas_b.max) {
+			pr_err("mas->min, mas->max do not match.\n");
+			return -1;
+		}
+
+		if (compare_node(mas_a.node, mas_b.node)) {
+			pr_err("The contents of nodes %p and %p are different.\n",
+			       mas_a.node, mas_b.node);
+			mt_dump(mt_a, mt_dump_dec);
+			mt_dump(mt_b, mt_dump_dec);
+			return -1;
+		}
+
+		mas_dfs_preorder(&mas_a);
+		mas_dfs_preorder(&mas_b);
+	}
+
+	return 0;
+}
+
+static __init void mas_subtree_max_range(struct ma_state *mas)
+{
+	unsigned long limit = mas->max;
+	MA_STATE(newmas, mas->tree, 0, 0);
+	void *entry;
+
+	mas_for_each(mas, entry, limit) {
+		if (mas->last - mas->index >=
+		    newmas.last - newmas.index) {
+			newmas = *mas;
+		}
+	}
+
+	*mas = newmas;
+}
+
+/*
+ * build_full_tree() - Build a full tree.
+ * @mt: The tree to build.
+ * @flags: Use @flags to build the tree.
+ * @height: The height of the tree to build.
+ *
+ * Build a tree with full leaf nodes and internal nodes. Note that the height
+ * should not exceed 3, otherwise it will take a long time to build.
+ * Return: zero if the build is successful, non-zero if it fails.
+ */
+static __init int build_full_tree(struct maple_tree *mt, unsigned int flags,
+		int height)
+{
+	MA_STATE(mas, mt, 0, 0);
+	unsigned long step;
+	int ret = 0, cnt = 1;
+	enum maple_type type;
+
+	mt_init_flags(mt, flags);
+	mtree_insert_range(mt, 0, ULONG_MAX, xa_mk_value(5), GFP_KERNEL);
+
+	mtree_lock(mt);
+
+	while (1) {
+		mas_set(&mas, 0);
+		if (mt_height(mt) < height) {
+			mas.max = ULONG_MAX;
+			goto store;
+		}
+
+		while (1) {
+			mas_dfs_preorder(&mas);
+			if (mas_is_none(&mas))
+				goto unlock;
+
+			type = mte_node_type(mas.node);
+			if (mas_data_end(&mas) + 1 < mt_slots[type]) {
+				mas_set(&mas, mas.min);
+				goto store;
+			}
+		}
+store:
+		mas_subtree_max_range(&mas);
+		step = mas.last - mas.index;
+		if (step < 1) {
+			ret = -1;
+			goto unlock;
+		}
+
+		step /= 2;
+		mas.last = mas.index + step;
+		mas_store_gfp(&mas, xa_mk_value(5),
+				GFP_KERNEL);
+		++cnt;
+	}
+unlock:
+	mtree_unlock(mt);
+
+	MT_BUG_ON(mt, mt_height(mt) != height);
+	/* pr_info("height:%u number of elements:%d\n", mt_height(mt), cnt); */
+	return ret;
+}
+
+static noinline void __init check_mtree_dup(struct maple_tree *mt)
+{
+	DEFINE_MTREE(new);
+	int i, j, ret, count = 0;
+	unsigned int rand_seed = 17, rand;
+
+	/* store a value at [0, 0] */
+	mt_init_flags(&tree, 0);
+	mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
+	ret = mtree_dup(&tree, &new, GFP_KERNEL);
+	MT_BUG_ON(&new, ret);
+	mt_validate(&new);
+	if (compare_tree(&tree, &new))
+		MT_BUG_ON(&new, 1);
+
+	mtree_destroy(&tree);
+	mtree_destroy(&new);
+
+	/* The two trees have different attributes. */
+	mt_init_flags(&tree, 0);
+	mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
+	ret = mtree_dup(&tree, &new, GFP_KERNEL);
+	MT_BUG_ON(&new, ret != -EINVAL);
+	mtree_destroy(&tree);
+	mtree_destroy(&new);
+
+	/* The new tree is not empty */
+	mt_init_flags(&tree, 0);
+	mt_init_flags(&new, 0);
+	mtree_store(&new, 5, xa_mk_value(5), GFP_KERNEL);
+	ret = mtree_dup(&tree, &new, GFP_KERNEL);
+	MT_BUG_ON(&new, ret != -EINVAL);
+	mtree_destroy(&tree);
+	mtree_destroy(&new);
+
+	/* Test for duplicating full trees. */
+	for (i = 1; i <= 3; i++) {
+		ret = build_full_tree(&tree, 0, i);
+		MT_BUG_ON(&tree, ret);
+		mt_init_flags(&new, 0);
+
+		ret = mtree_dup(&tree, &new, GFP_KERNEL);
+		MT_BUG_ON(&new, ret);
+		mt_validate(&new);
+		if (compare_tree(&tree, &new))
+			MT_BUG_ON(&new, 1);
+
+		mtree_destroy(&tree);
+		mtree_destroy(&new);
+	}
+
+	for (i = 1; i <= 3; i++) {
+		ret = build_full_tree(&tree, MT_FLAGS_ALLOC_RANGE, i);
+		MT_BUG_ON(&tree, ret);
+		mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
+
+		ret = mtree_dup(&tree, &new, GFP_KERNEL);
+		MT_BUG_ON(&new, ret);
+		mt_validate(&new);
+		if (compare_tree(&tree, &new))
+			MT_BUG_ON(&new, 1);
+
+		mtree_destroy(&tree);
+		mtree_destroy(&new);
+	}
+
+	/* Test for normal duplicating. */
+	for (i = 0; i < 1000; i += 3) {
+		if (i & 1) {
+			mt_init_flags(&tree, 0);
+			mt_init_flags(&new, 0);
+		} else {
+			mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+			mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
+		}
+
+		for (j = 0; j < i; j++) {
+			mtree_store_range(&tree, j * 10, j * 10 + 5,
+					  xa_mk_value(j), GFP_KERNEL);
+		}
+
+		ret = mtree_dup(&tree, &new, GFP_KERNEL);
+		MT_BUG_ON(&new, ret);
+		mt_validate(&new);
+		if (compare_tree(&tree, &new))
+			MT_BUG_ON(&new, 1);
+
+		mtree_destroy(&tree);
+		mtree_destroy(&new);
+	}
+
+	/* Test memory allocation failed. */
+	for (i = 0; i < 1000; i += 3) {
+		if (i & 1) {
+			mt_init_flags(&tree, 0);
+			mt_init_flags(&new, 0);
+		} else {
+			mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+			mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
+		}
+
+		for (j = 0; j < i; j++) {
+			mtree_store_range(&tree, j * 10, j * 10 + 5,
+					  xa_mk_value(j), GFP_KERNEL);
+		}
+		/*
+		 * The rand() library function is not used, so we can generate
+		 * the same random numbers on any platform.
+		 */
+		rand_seed = rand_seed * 1103515245 + 12345;
+		rand = rand_seed / 65536 % 128;
+		mt_set_non_kernel(rand);
+
+		ret = mtree_dup(&tree, &new, GFP_NOWAIT);
+		mt_set_non_kernel(0);
+		if (ret != 0) {
+			MT_BUG_ON(&new, ret != -ENOMEM);
+			count++;
+			mtree_destroy(&tree);
+			continue;
+		}
+
+		mt_validate(&new);
+		if (compare_tree(&tree, &new))
+			MT_BUG_ON(&new, 1);
+
+		mtree_destroy(&tree);
+		mtree_destroy(&new);
+	}
+
+	/* pr_info("mtree_dup() fail %d times\n", count); */
+	BUG_ON(!count);
+}
+
 extern void test_kmem_cache_bulk(void);
 
 void farmer_tests(void)
@@ -35904,6 +36244,10 @@ void farmer_tests(void)
 	check_null_expand(&tree);
 	mtree_destroy(&tree);
 
+	mt_init_flags(&tree, 0);
+	check_mtree_dup(&tree);
+	mtree_destroy(&tree);
+
 	/* RCU testing */
 	mt_init_flags(&tree, 0);
 	check_erase_testset(&tree);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 4/6] maple_tree: Skip other tests when BENCH is enabled
  2023-08-30 12:56 [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork() Peng Zhang
                   ` (2 preceding siblings ...)
  2023-08-30 12:56 ` [PATCH v2 3/6] maple_tree: Add test for mtree_dup() Peng Zhang
@ 2023-08-30 12:56 ` Peng Zhang
  2023-08-30 12:56 ` [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking() Peng Zhang
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 35+ messages in thread
From: Peng Zhang @ 2023-08-30 12:56 UTC (permalink / raw)
  To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
	michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
  Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang

Skip other tests when BENCH is enabled so that performance can be
measured in user space.

Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
 lib/test_maple_tree.c            | 8 ++++----
 tools/testing/radix-tree/maple.c | 2 ++
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 0674aebd4423..0ec0c6a7c0b5 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -3514,10 +3514,6 @@ static int __init maple_tree_seed(void)
 
 	pr_info("\nTEST STARTING\n\n");
 
-	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
-	check_root_expand(&tree);
-	mtree_destroy(&tree);
-
 #if defined(BENCH_SLOT_STORE)
 #define BENCH
 	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
@@ -3575,6 +3571,10 @@ static int __init maple_tree_seed(void)
 	goto skip;
 #endif
 
+	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+	check_root_expand(&tree);
+	mtree_destroy(&tree);
+
 	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
 	check_iteration(&tree);
 	mtree_destroy(&tree);
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 38455916331e..57f153b8bf4b 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -36282,7 +36282,9 @@ void farmer_tests(void)
 
 void maple_tree_tests(void)
 {
+#if !defined(BENCH)
 	farmer_tests();
+#endif
 	maple_tree_seed();
 	maple_tree_harvest();
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking()
  2023-08-30 12:56 [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork() Peng Zhang
                   ` (3 preceding siblings ...)
  2023-08-30 12:56 ` [PATCH v2 4/6] maple_tree: Skip other tests when BENCH is enabled Peng Zhang
@ 2023-08-30 12:56 ` Peng Zhang
  2023-08-31 13:40   ` kernel test robot
  2023-09-07 20:14   ` Liam R. Howlett
  2023-08-30 12:56 ` [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() Peng Zhang
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 35+ messages in thread
From: Peng Zhang @ 2023-08-30 12:56 UTC (permalink / raw)
  To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
	michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
  Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang

Updated check_forking() and bench_forking() to use __mt_dup() to
duplicate maple tree. Also increased the number of VMAs, because the
new way is faster.

Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
 lib/test_maple_tree.c | 61 +++++++++++++++++++++----------------------
 1 file changed, 30 insertions(+), 31 deletions(-)

diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 0ec0c6a7c0b5..72fba7cce148 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -1837,36 +1837,37 @@ static noinline void __init check_forking(struct maple_tree *mt)
 {
 
 	struct maple_tree newmt;
-	int i, nr_entries = 134;
+	int i, nr_entries = 300, ret;
 	void *val;
 	MA_STATE(mas, mt, 0, 0);
-	MA_STATE(newmas, mt, 0, 0);
+	MA_STATE(newmas, &newmt, 0, 0);
+
+	mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
 
 	for (i = 0; i <= nr_entries; i++)
 		mtree_store_range(mt, i*10, i*10 + 5,
 				  xa_mk_value(i), GFP_KERNEL);
 
+
 	mt_set_non_kernel(99999);
-	mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
-	newmas.tree = &newmt;
-	mas_reset(&newmas);
-	mas_reset(&mas);
 	mas_lock(&newmas);
-	mas.index = 0;
-	mas.last = 0;
-	if (mas_expected_entries(&newmas, nr_entries)) {
+	mas_lock(&mas);
+
+	ret = __mt_dup(mt, &newmt, GFP_NOWAIT | __GFP_NOWARN);
+	if (ret) {
 		pr_err("OOM!");
 		BUG_ON(1);
 	}
-	rcu_read_lock();
-	mas_for_each(&mas, val, ULONG_MAX) {
-		newmas.index = mas.index;
-		newmas.last = mas.last;
+
+	mas_set(&newmas, 0);
+	mas_for_each(&newmas, val, ULONG_MAX) {
 		mas_store(&newmas, val);
 	}
-	rcu_read_unlock();
-	mas_destroy(&newmas);
+
+	mas_unlock(&mas);
 	mas_unlock(&newmas);
+
+	mas_destroy(&newmas);
 	mt_validate(&newmt);
 	mt_set_non_kernel(0);
 	mtree_destroy(&newmt);
@@ -1974,12 +1975,11 @@ static noinline void __init check_mas_store_gfp(struct maple_tree *mt)
 #if defined(BENCH_FORK)
 static noinline void __init bench_forking(struct maple_tree *mt)
 {
-
 	struct maple_tree newmt;
-	int i, nr_entries = 134, nr_fork = 80000;
+	int i, nr_entries = 300, nr_fork = 80000, ret;
 	void *val;
 	MA_STATE(mas, mt, 0, 0);
-	MA_STATE(newmas, mt, 0, 0);
+	MA_STATE(newmas, &newmt, 0, 0);
 
 	for (i = 0; i <= nr_entries; i++)
 		mtree_store_range(mt, i*10, i*10 + 5,
@@ -1988,25 +1988,24 @@ static noinline void __init bench_forking(struct maple_tree *mt)
 	for (i = 0; i < nr_fork; i++) {
 		mt_set_non_kernel(99999);
 		mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
-		newmas.tree = &newmt;
-		mas_reset(&newmas);
-		mas_reset(&mas);
-		mas.index = 0;
-		mas.last = 0;
-		rcu_read_lock();
+
 		mas_lock(&newmas);
-		if (mas_expected_entries(&newmas, nr_entries)) {
-			printk("OOM!");
+		mas_lock(&mas);
+		ret = __mt_dup(mt, &newmt, GFP_NOWAIT | __GFP_NOWARN);
+		if (ret) {
+			pr_err("OOM!");
 			BUG_ON(1);
 		}
-		mas_for_each(&mas, val, ULONG_MAX) {
-			newmas.index = mas.index;
-			newmas.last = mas.last;
+
+		mas_set(&newmas, 0);
+		mas_for_each(&newmas, val, ULONG_MAX) {
 			mas_store(&newmas, val);
 		}
-		mas_destroy(&newmas);
+
+		mas_unlock(&mas);
 		mas_unlock(&newmas);
-		rcu_read_unlock();
+
+		mas_destroy(&newmas);
 		mt_validate(&newmt);
 		mt_set_non_kernel(0);
 		mtree_destroy(&newmt);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
  2023-08-30 12:56 [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork() Peng Zhang
                   ` (4 preceding siblings ...)
  2023-08-30 12:56 ` [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking() Peng Zhang
@ 2023-08-30 12:56 ` Peng Zhang
  2023-09-07 20:14   ` Liam R. Howlett
  2023-08-30 13:05 ` [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork() Peng Zhang
  2023-09-07 20:19 ` Liam R. Howlett
  7 siblings, 1 reply; 35+ messages in thread
From: Peng Zhang @ 2023-08-30 12:56 UTC (permalink / raw)
  To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
	michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
  Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel, Peng Zhang

Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
directly modify the entries of VMAs in the new maple tree, which can
get better performance. The optimization effect is proportional to the
number of VMAs.

There is a "spawn" in byte-unixbench[1], which can be used to test the
performance of fork(). I modified it slightly to make it work with
different number of VMAs.

Below are the test numbers. There are 21 VMAs by default. The first row
indicates the number of added VMAs. The following two lines are the
number of fork() calls every 10 seconds. These numbers are different
from the test results in v1 because this time the benchmark is bound to
a CPU. This way the numbers are more stable.

  Increment of VMAs: 0      100     200     400     800     1600    3200    6400
6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   6110    3158
Apply this patchset: 114531 85420   64541   44592   28660   16371   9038    4831
                     +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% +47.92% +52.98%

[1] https://github.com/kdlucas/byte-unixbench/tree/master

Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
 kernel/fork.c | 34 ++++++++++++++++++++++++++--------
 mm/mmap.c     | 14 ++++++++++++--
 2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..e6299adefbd8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	int retval;
 	unsigned long charge = 0;
 	LIST_HEAD(uf);
-	VMA_ITERATOR(old_vmi, oldmm, 0);
 	VMA_ITERATOR(vmi, mm, 0);
 
 	uprobe_start_dup_mmap();
@@ -678,17 +677,39 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		goto out;
 	khugepaged_fork(mm, oldmm);
 
-	retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
-	if (retval)
+	/* Use __mt_dup() to efficiently build an identical maple tree. */
+	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
+	if (unlikely(retval))
 		goto out;
 
 	mt_clear_in_rcu(vmi.mas.tree);
-	for_each_vma(old_vmi, mpnt) {
+	for_each_vma(vmi, mpnt) {
 		struct file *file;
 
 		vma_start_write(mpnt);
 		if (mpnt->vm_flags & VM_DONTCOPY) {
 			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
+
+			/*
+			 * Since the new tree is exactly the same as the old one,
+			 * we need to remove the unneeded VMAs.
+			 */
+			mas_store(&vmi.mas, NULL);
+
+			/*
+			 * Even removing an entry may require memory allocation,
+			 * and if removal fails, we use XA_ZERO_ENTRY to mark
+			 * from which VMA it failed. The case of encountering
+			 * XA_ZERO_ENTRY will be handled in exit_mmap().
+			 */
+			if (unlikely(mas_is_err(&vmi.mas))) {
+				retval = xa_err(vmi.mas.node);
+				mas_reset(&vmi.mas);
+				if (mas_find(&vmi.mas, ULONG_MAX))
+					mas_store(&vmi.mas, XA_ZERO_ENTRY);
+				goto loop_out;
+			}
+
 			continue;
 		}
 		charge = 0;
@@ -750,8 +771,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			hugetlb_dup_vma_private(tmp);
 
 		/* Link the vma into the MT */
-		if (vma_iter_bulk_store(&vmi, tmp))
-			goto fail_nomem_vmi_store;
+		mas_store(&vmi.mas, tmp);
 
 		mm->map_count++;
 		if (!(tmp->vm_flags & VM_WIPEONFORK))
@@ -778,8 +798,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	uprobe_end_dup_mmap();
 	return retval;
 
-fail_nomem_vmi_store:
-	unlink_anon_vmas(tmp);
 fail_nomem_anon_vma_fork:
 	mpol_put(vma_policy(tmp));
 fail_nomem_policy:
diff --git a/mm/mmap.c b/mm/mmap.c
index b56a7f0c9f85..dfc6881be81c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3196,7 +3196,11 @@ void exit_mmap(struct mm_struct *mm)
 	arch_exit_mmap(mm);
 
 	vma = mas_find(&mas, ULONG_MAX);
-	if (!vma) {
+	/*
+	 * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
+	 * xa_is_zero(vma) may be true.
+	 */
+	if (!vma || xa_is_zero(vma)) {
 		/* Can happen if dup_mmap() received an OOM */
 		mmap_read_unlock(mm);
 		return;
@@ -3234,7 +3238,13 @@ void exit_mmap(struct mm_struct *mm)
 		remove_vma(vma, true);
 		count++;
 		cond_resched();
-	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
+		vma = mas_find(&mas, ULONG_MAX);
+		/*
+		 * If xa_is_zero(vma) is true, it means that subsequent VMAs
+		 * donot need to be removed. Can happen if dup_mmap() fails to
+		 * remove a VMA marked VM_DONTCOPY.
+		 */
+	} while (vma != NULL && !xa_is_zero(vma));
 
 	BUG_ON(count != mm->map_count);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork()
  2023-08-30 12:56 [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork() Peng Zhang
                   ` (5 preceding siblings ...)
  2023-08-30 12:56 ` [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() Peng Zhang
@ 2023-08-30 13:05 ` Peng Zhang
  2023-09-07 20:19 ` Liam R. Howlett
  7 siblings, 0 replies; 35+ messages in thread
From: Peng Zhang @ 2023-08-30 13:05 UTC (permalink / raw)
  To: Liam.Howlett, corbet, akpm, willy, brauner, surenb,
	michael.christie, peterz, mathieu.desnoyers, npiggin, avagin
  Cc: linux-mm, linux-doc, linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 55 bytes --]

See the attachment for the slightly modified benchmark.

[-- Attachment #2: spawn.c --]
[-- Type: text/plain, Size: 2393 bytes --]

/*******************************************************************************
 *  The BYTE UNIX Benchmarks - Release 3
 *          Module: spawn.c   SID: 3.3 5/15/91 19:30:20
 *
 *******************************************************************************
 * Bug reports, patches, comments, suggestions should be sent to:
 *
 *	Ben Smith, Rick Grehan or Tom Yagerat BYTE Magazine
 *	ben@bytepb.byte.com   rick_g@bytepb.byte.com   tyager@bytepb.byte.com
 *
 *******************************************************************************
 *  Modification Log:
 *  $Header: spawn.c,v 3.4 87/06/22 14:32:48 kjmcdonell Beta $
 *  August 29, 1990 - Modified timing routines (ty)
 *  October 22, 1997 - code cleanup to remove ANSI C compiler warnings
 *                     Andy Kahn <kahn@zk3.dec.com>
 *
 ******************************************************************************/
char SCCSid[] = "@(#) @(#)spawn.c:3.3 -- 5/15/91 19:30:20";
/*
 *  Process creation
 *
 */

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#include <sys/wait.h>
#include <sys/mman.h>

volatile int stop;
unsigned long iter;

void wake_me(int seconds, void (*func)())
{
	/* set up the signal handler */
	signal(SIGALRM, func);
	/* get the clock running */
	alarm(seconds);
}

void report()
{
	fprintf(stderr,"COUNT: %lu\n", iter);
	iter = 0;
	stop = 1;
}

void spawn()
{
	int status, slave;

	while (!stop) {
		if ((slave = fork()) == 0) {
			/* slave .. boring */
			exit(0);
		} else if (slave < 0) {
			/* woops ... */
			fprintf(stderr,"Fork failed at iteration %lu\n", iter);
			perror("Reason");
			exit(2);
		} else
			/* master */
			wait(&status);
		if (status != 0) {
			fprintf(stderr,"Bad wait status: 0x%x\n", status);
			exit(2);
		}
		iter++;
	}
}

int main(int argc, char	*argv[])
{
	int duration, nr_vmas = 0;
	size_t size;
	void *addr;

	if (argc != 2) {
		fprintf(stderr,"Usage: %s duration \n", argv[0]);
		exit(1);
	}
	duration = atoi(argv[1]);

	size = 10 * getpagesize();
	for (int i = 0; i <= 7000; ++i) {
		if (i == nr_vmas) {
			stop = 0;
			fprintf(stderr,"VMAs: %d\n", i);
			wake_me(duration, report);
			spawn();
			if (nr_vmas == 0)
				nr_vmas = 100;
			else nr_vmas *= 2;
		}
		addr = mmap(NULL, size, i & 1 ? PROT_READ : PROT_WRITE,
			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

		if (addr == MAP_FAILED) {
			perror("mmap");
			exit(2);
		}
	}
}

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking()
  2023-08-30 12:56 ` [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking() Peng Zhang
@ 2023-08-31 13:40   ` kernel test robot
  2023-09-01 10:58     ` Peng Zhang
  2023-09-07 20:14   ` Liam R. Howlett
  1 sibling, 1 reply; 35+ messages in thread
From: kernel test robot @ 2023-08-31 13:40 UTC (permalink / raw)
  To: Peng Zhang
  Cc: oe-lkp, lkp, maple-tree, linux-mm, Liam.Howlett, corbet, akpm,
	willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-doc, linux-kernel,
	linux-fsdevel, Peng Zhang, oliver.sang



Hello,

kernel test robot noticed "WARNING:possible_recursive_locking_detected" on:

commit: 2730245bd6b13a94a67e84c10832a9f52fad0aa5 ("[PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking()")
url: https://github.com/intel-lab-lkp/linux/commits/Peng-Zhang/maple_tree-Add-two-helpers/20230830-205847
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20230830125654.21257-6-zhangpeng.00@bytedance.com/
patch subject: [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking()

in testcase: boot

compiler: clang-16
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202308312115.cad34fed-oliver.sang@intel.com


[   25.146957][    T1] WARNING: possible recursive locking detected
[   25.147110][    T1] 6.5.0-rc4-00632-g2730245bd6b1 #1 Tainted: G                TN
[   25.147110][    T1] --------------------------------------------
[   25.147110][    T1] swapper/1 is trying to acquire lock:
[ 25.147110][ T1] ffffffff86485058 (&mt->ma_lock){+.+.}-{2:2}, at: check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854) 
[   25.147110][    T1]
[   25.147110][    T1] but task is already holding lock:
[ 25.147110][ T1] ffff888110847a30 (&mt->ma_lock){+.+.}-{2:2}, at: check_forking (include/linux/spinlock.h:351 lib/test_maple_tree.c:1854) 
[   25.147110][    T1]
[   25.147110][    T1] other info that might help us debug this:
[   25.147110][    T1]  Possible unsafe locking scenario:
[   25.147110][    T1]
[   25.147110][    T1]        CPU0
[   25.147110][    T1]        ----
[   25.147110][    T1]   lock(&mt->ma_lock);
[   25.147110][    T1]
[   25.147110][    T1]  *** DEADLOCK ***
[   25.147110][    T1]
[   25.147110][    T1]  May be due to missing lock nesting notation
[   25.147110][    T1]
[   25.147110][    T1] 1 lock held by swapper/1:
[ 25.147110][ T1] #0: ffff888110847a30 (&mt->ma_lock){+.+.}-{2:2}, at: check_forking (include/linux/spinlock.h:351 lib/test_maple_tree.c:1854) 
[   25.147110][    T1]
[   25.147110][    T1] stack backtrace:
[   25.147110][    T1] CPU: 0 PID: 1 Comm: swapper Tainted: G                TN 6.5.0-rc4-00632-g2730245bd6b1 #1
[   25.147110][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   25.147110][    T1] Call Trace:
[   25.147110][    T1]  <TASK>
[ 25.147110][ T1] dump_stack_lvl (lib/dump_stack.c:? lib/dump_stack.c:106) 
[ 25.147110][ T1] validate_chain (kernel/locking/lockdep.c:?) 
[ 25.147110][ T1] ? look_up_lock_class (kernel/locking/lockdep.c:926) 
[ 25.147110][ T1] ? mark_lock (arch/x86/include/asm/bitops.h:228 arch/x86/include/asm/bitops.h:240 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228 kernel/locking/lockdep.c:4655) 
[ 25.147110][ T1] __lock_acquire (kernel/locking/lockdep.c:?) 
[ 25.147110][ T1] lock_acquire (kernel/locking/lockdep.c:5753) 
[ 25.147110][ T1] ? check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854) 
[ 25.147110][ T1] _raw_spin_lock (include/linux/spinlock_api_smp.h:133 kernel/locking/spinlock.c:154) 
[ 25.147110][ T1] ? check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854) 
[ 25.147110][ T1] check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854) 
[ 25.147110][ T1] maple_tree_seed (lib/test_maple_tree.c:3583) 
[ 25.147110][ T1] do_one_initcall (init/main.c:1232) 
[ 25.147110][ T1] ? __cfi_maple_tree_seed (lib/test_maple_tree.c:3508) 
[ 25.147110][ T1] do_initcall_level (init/main.c:1293) 
[ 25.147110][ T1] do_initcalls (init/main.c:1307) 
[ 25.147110][ T1] kernel_init_freeable (init/main.c:1550) 
[ 25.147110][ T1] ? __cfi_kernel_init (init/main.c:1429) 
[ 25.147110][ T1] kernel_init (init/main.c:1439) 
[ 25.147110][ T1] ? __cfi_kernel_init (init/main.c:1429) 
[ 25.147110][ T1] ret_from_fork (arch/x86/kernel/process.c:151) 
[ 25.147110][ T1] ? __cfi_kernel_init (init/main.c:1429) 
[ 25.147110][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) 
[   25.147110][    T1]  </TASK>
[   28.697241][   T32] clocksource_wdtest: --- Verify jiffies-like uncertainty margin.
[   28.698316][   T32] clocksource: wdtest-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370867519511994 ns
[   29.714980][   T32] clocksource_wdtest: --- Verify tsc-like uncertainty margin.
[   29.716387][   T32] clocksource: wdtest-ktime: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[   29.721896][   T32] clocksource_wdtest: --- tsc-like times: 1693478138832947444 - 1693478138832945950 = 1494.
[   29.723570][   T32] clocksource_wdtest: --- Watchdog with 0x error injection, 2 retries.
[   31.898906][   T32] clocksource_wdtest: --- Watchdog with 1x error injection, 2 retries.
[   34.043415][   T32] clocksource_wdtest: --- Watchdog with 2x error injection, 2 retries, expect message.
[   34.512462][    C0] clocksource: timekeeping watchdog on CPU0: kvm-clock retried 2 times before success
[   36.169157][   T32] clocksource_wdtest: --- Watchdog with 3x error injection, 2 retries, expect clock skew.
[   36.513464][    C0] clocksource: timekeeping watchdog on CPU0: wd-wdtest-ktime-wd excessive read-back delay of 1000880ns vs. limit of 125000ns, wd-wd read-back delay only 46ns, attempt 3, marking wdtest-ktime unstable
[   36.516829][    C0] clocksource_wdtest: --- Marking wdtest-ktime unstable due to clocksource watchdog.
[   38.412889][   T32] clocksource: wdtest-ktime: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[   38.421249][   T32] clocksource_wdtest: --- Watchdog clock-value-fuzz error injection, expect clock skew and per-CPU mismatches.
[   38.990462][    C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'wdtest-ktime' as unstable because the skew is too large:
[   38.992698][    C0] clocksource:                       'kvm-clock' wd_nsec: 479996388 wd_now: 9454aecf2 wd_last: 928aec30e mask: ffffffffffffffff
[   38.994924][    C0] clocksource:                       'wdtest-ktime' cs_nsec: 679996638 cs_now: 17807167426ff864 cs_last: 1780716719e80b86 mask: ffffffffffffffff
[   38.997374][    C0] clocksource:                       Clocksource 'wdtest-ktime' skewed 200000250 ns (200 ms) over watchdog 'kvm-clock' interval of 479996388 ns (479 ms)
[   38.999919][    C0] clocksource:                       'kvm-clock' (not 'wdtest-ktime') is current clocksource.
[   39.001696][    C0] clocksource_wdtest: --- Marking wdtest-ktime unstable due to clocksource watchdog.
[   40.441815][   T32] clocksource: Not enough CPUs to check clocksource 'wdtest-ktime'.
[   40.443303][   T32] clocksource_wdtest: --- Done with test.
[  293.673815][    T1] swapper invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[  293.675628][    T1] CPU: 0 PID: 1 Comm: swapper Tainted: G                TN 6.5.0-rc4-00632-g2730245bd6b1 #1
[  293.677082][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[  293.677082][    T1] Call Trace:
[  293.677082][    T1]  <TASK>
[ 293.677082][ T1] dump_stack_lvl (lib/dump_stack.c:107) 
[ 293.677082][ T1] dump_header (mm/oom_kill.c:?) 
[ 293.677082][ T1] out_of_memory (mm/oom_kill.c:1159) 
[ 293.677082][ T1] __alloc_pages_slowpath (mm/page_alloc.c:3372 mm/page_alloc.c:4132) 
[ 293.677082][ T1] __alloc_pages (mm/page_alloc.c:4469) 
[ 293.677082][ T1] alloc_slab_page (mm/slub.c:1866) 
[ 293.677082][ T1] new_slab (mm/slub.c:2017 mm/slub.c:2062) 
[ 293.677082][ T1] ? mas_alloc_nodes (lib/maple_tree.c:1282) 
[ 293.677082][ T1] ___slab_alloc (arch/x86/include/asm/preempt.h:80 mm/slub.c:3216) 
[ 293.677082][ T1] ? mas_alloc_nodes (lib/maple_tree.c:1282) 
[ 293.677082][ T1] kmem_cache_alloc_bulk (mm/slub.c:? mm/slub.c:4041) 
[ 293.677082][ T1] mas_alloc_nodes (lib/maple_tree.c:1282) 
[ 293.677082][ T1] mas_nomem (lib/maple_tree.c:?) 
[ 293.677082][ T1] mtree_store_range (lib/maple_tree.c:6191) 
[ 293.677082][ T1] check_dup_gaps (lib/test_maple_tree.c:2623) 
[ 293.677082][ T1] check_dup (lib/test_maple_tree.c:2707) 
[ 293.677082][ T1] maple_tree_seed (lib/test_maple_tree.c:3766) 
[ 293.677082][ T1] do_one_initcall (init/main.c:1232) 
[ 293.677082][ T1] ? __cfi_maple_tree_seed (lib/test_maple_tree.c:3508) 
[ 293.677082][ T1] do_initcall_level (init/main.c:1293) 
[ 293.677082][ T1] do_initcalls (init/main.c:1307) 
[ 293.677082][ T1] kernel_init_freeable (init/main.c:1550) 
[ 293.677082][ T1] ? __cfi_kernel_init (init/main.c:1429) 


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230831/202308312115.cad34fed-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking()
  2023-08-31 13:40   ` kernel test robot
@ 2023-09-01 10:58     ` Peng Zhang
  2023-09-07 18:03       ` Liam R. Howlett
  0 siblings, 1 reply; 35+ messages in thread
From: Peng Zhang @ 2023-09-01 10:58 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, maple-tree, linux-mm, Liam.Howlett, corbet, akpm,
	willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-doc, Peng Zhang,
	linux-kernel, linux-fsdevel



在 2023/8/31 21:40, kernel test robot 写道:
> 
> 
> Hello,
> 
> kernel test robot noticed "WARNING:possible_recursive_locking_detected" on:
> 
> commit: 2730245bd6b13a94a67e84c10832a9f52fad0aa5 ("[PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking()")
> url: https://github.com/intel-lab-lkp/linux/commits/Peng-Zhang/maple_tree-Add-two-helpers/20230830-205847
> base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/all/20230830125654.21257-6-zhangpeng.00@bytedance.com/
> patch subject: [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking()
> 
> in testcase: boot
> 
> compiler: clang-16
> test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> 
> (please refer to attached dmesg/kmsg for entire log/backtrace)
> 
> 
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202308312115.cad34fed-oliver.sang@intel.com
> 
> 
> [   25.146957][    T1] WARNING: possible recursive locking detected
> [   25.147110][    T1] 6.5.0-rc4-00632-g2730245bd6b1 #1 Tainted: G                TN
> [   25.147110][    T1] --------------------------------------------
> [   25.147110][    T1] swapper/1 is trying to acquire lock:
> [ 25.147110][ T1] ffffffff86485058 (&mt->ma_lock){+.+.}-{2:2}, at: check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854)
> [   25.147110][    T1]
> [   25.147110][    T1] but task is already holding lock:
> [ 25.147110][ T1] ffff888110847a30 (&mt->ma_lock){+.+.}-{2:2}, at: check_forking (include/linux/spinlock.h:351 lib/test_maple_tree.c:1854)
Thanks for the test. I checked that these are two different locks, why
is this warning reported? Did I miss something?
> [   25.147110][    T1]
> [   25.147110][    T1] other info that might help us debug this:
> [   25.147110][    T1]  Possible unsafe locking scenario:
> [   25.147110][    T1]
> [   25.147110][    T1]        CPU0
> [   25.147110][    T1]        ----
> [   25.147110][    T1]   lock(&mt->ma_lock);
> [   25.147110][    T1]
> [   25.147110][    T1]  *** DEADLOCK ***
> [   25.147110][    T1]
> [   25.147110][    T1]  May be due to missing lock nesting notation
> [   25.147110][    T1]
> [   25.147110][    T1] 1 lock held by swapper/1:
> [ 25.147110][ T1] #0: ffff888110847a30 (&mt->ma_lock){+.+.}-{2:2}, at: check_forking (include/linux/spinlock.h:351 lib/test_maple_tree.c:1854)
> [   25.147110][    T1]
> [   25.147110][    T1] stack backtrace:
> [   25.147110][    T1] CPU: 0 PID: 1 Comm: swapper Tainted: G                TN 6.5.0-rc4-00632-g2730245bd6b1 #1
> [   25.147110][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [   25.147110][    T1] Call Trace:
> [   25.147110][    T1]  <TASK>
> [ 25.147110][ T1] dump_stack_lvl (lib/dump_stack.c:? lib/dump_stack.c:106)
> [ 25.147110][ T1] validate_chain (kernel/locking/lockdep.c:?)
> [ 25.147110][ T1] ? look_up_lock_class (kernel/locking/lockdep.c:926)
> [ 25.147110][ T1] ? mark_lock (arch/x86/include/asm/bitops.h:228 arch/x86/include/asm/bitops.h:240 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228 kernel/locking/lockdep.c:4655)
> [ 25.147110][ T1] __lock_acquire (kernel/locking/lockdep.c:?)
> [ 25.147110][ T1] lock_acquire (kernel/locking/lockdep.c:5753)
> [ 25.147110][ T1] ? check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854)
> [ 25.147110][ T1] _raw_spin_lock (include/linux/spinlock_api_smp.h:133 kernel/locking/spinlock.c:154)
> [ 25.147110][ T1] ? check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854)
> [ 25.147110][ T1] check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854)
> [ 25.147110][ T1] maple_tree_seed (lib/test_maple_tree.c:3583)
> [ 25.147110][ T1] do_one_initcall (init/main.c:1232)
> [ 25.147110][ T1] ? __cfi_maple_tree_seed (lib/test_maple_tree.c:3508)
> [ 25.147110][ T1] do_initcall_level (init/main.c:1293)
> [ 25.147110][ T1] do_initcalls (init/main.c:1307)
> [ 25.147110][ T1] kernel_init_freeable (init/main.c:1550)
> [ 25.147110][ T1] ? __cfi_kernel_init (init/main.c:1429)
> [ 25.147110][ T1] kernel_init (init/main.c:1439)
> [ 25.147110][ T1] ? __cfi_kernel_init (init/main.c:1429)
> [ 25.147110][ T1] ret_from_fork (arch/x86/kernel/process.c:151)
> [ 25.147110][ T1] ? __cfi_kernel_init (init/main.c:1429)
> [ 25.147110][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
> [   25.147110][    T1]  </TASK>
> [   28.697241][   T32] clocksource_wdtest: --- Verify jiffies-like uncertainty margin.
> [   28.698316][   T32] clocksource: wdtest-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370867519511994 ns
> [   29.714980][   T32] clocksource_wdtest: --- Verify tsc-like uncertainty margin.
> [   29.716387][   T32] clocksource: wdtest-ktime: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [   29.721896][   T32] clocksource_wdtest: --- tsc-like times: 1693478138832947444 - 1693478138832945950 = 1494.
> [   29.723570][   T32] clocksource_wdtest: --- Watchdog with 0x error injection, 2 retries.
> [   31.898906][   T32] clocksource_wdtest: --- Watchdog with 1x error injection, 2 retries.
> [   34.043415][   T32] clocksource_wdtest: --- Watchdog with 2x error injection, 2 retries, expect message.
> [   34.512462][    C0] clocksource: timekeeping watchdog on CPU0: kvm-clock retried 2 times before success
> [   36.169157][   T32] clocksource_wdtest: --- Watchdog with 3x error injection, 2 retries, expect clock skew.
> [   36.513464][    C0] clocksource: timekeeping watchdog on CPU0: wd-wdtest-ktime-wd excessive read-back delay of 1000880ns vs. limit of 125000ns, wd-wd read-back delay only 46ns, attempt 3, marking wdtest-ktime unstable
> [   36.516829][    C0] clocksource_wdtest: --- Marking wdtest-ktime unstable due to clocksource watchdog.
> [   38.412889][   T32] clocksource: wdtest-ktime: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [   38.421249][   T32] clocksource_wdtest: --- Watchdog clock-value-fuzz error injection, expect clock skew and per-CPU mismatches.
> [   38.990462][    C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'wdtest-ktime' as unstable because the skew is too large:
> [   38.992698][    C0] clocksource:                       'kvm-clock' wd_nsec: 479996388 wd_now: 9454aecf2 wd_last: 928aec30e mask: ffffffffffffffff
> [   38.994924][    C0] clocksource:                       'wdtest-ktime' cs_nsec: 679996638 cs_now: 17807167426ff864 cs_last: 1780716719e80b86 mask: ffffffffffffffff
> [   38.997374][    C0] clocksource:                       Clocksource 'wdtest-ktime' skewed 200000250 ns (200 ms) over watchdog 'kvm-clock' interval of 479996388 ns (479 ms)
> [   38.999919][    C0] clocksource:                       'kvm-clock' (not 'wdtest-ktime') is current clocksource.
> [   39.001696][    C0] clocksource_wdtest: --- Marking wdtest-ktime unstable due to clocksource watchdog.
> [   40.441815][   T32] clocksource: Not enough CPUs to check clocksource 'wdtest-ktime'.
> [   40.443303][   T32] clocksource_wdtest: --- Done with test.
> [  293.673815][    T1] swapper invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> [  293.675628][    T1] CPU: 0 PID: 1 Comm: swapper Tainted: G                TN 6.5.0-rc4-00632-g2730245bd6b1 #1
> [  293.677082][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [  293.677082][    T1] Call Trace:
> [  293.677082][    T1]  <TASK>
> [ 293.677082][ T1] dump_stack_lvl (lib/dump_stack.c:107)
> [ 293.677082][ T1] dump_header (mm/oom_kill.c:?)
> [ 293.677082][ T1] out_of_memory (mm/oom_kill.c:1159)
> [ 293.677082][ T1] __alloc_pages_slowpath (mm/page_alloc.c:3372 mm/page_alloc.c:4132)
> [ 293.677082][ T1] __alloc_pages (mm/page_alloc.c:4469)
> [ 293.677082][ T1] alloc_slab_page (mm/slub.c:1866)
> [ 293.677082][ T1] new_slab (mm/slub.c:2017 mm/slub.c:2062)
> [ 293.677082][ T1] ? mas_alloc_nodes (lib/maple_tree.c:1282)
> [ 293.677082][ T1] ___slab_alloc (arch/x86/include/asm/preempt.h:80 mm/slub.c:3216)
> [ 293.677082][ T1] ? mas_alloc_nodes (lib/maple_tree.c:1282)
> [ 293.677082][ T1] kmem_cache_alloc_bulk (mm/slub.c:? mm/slub.c:4041)
> [ 293.677082][ T1] mas_alloc_nodes (lib/maple_tree.c:1282)
> [ 293.677082][ T1] mas_nomem (lib/maple_tree.c:?)
> [ 293.677082][ T1] mtree_store_range (lib/maple_tree.c:6191)
> [ 293.677082][ T1] check_dup_gaps (lib/test_maple_tree.c:2623)
> [ 293.677082][ T1] check_dup (lib/test_maple_tree.c:2707)
> [ 293.677082][ T1] maple_tree_seed (lib/test_maple_tree.c:3766)
> [ 293.677082][ T1] do_one_initcall (init/main.c:1232)
> [ 293.677082][ T1] ? __cfi_maple_tree_seed (lib/test_maple_tree.c:3508)
> [ 293.677082][ T1] do_initcall_level (init/main.c:1293)
> [ 293.677082][ T1] do_initcalls (init/main.c:1307)
> [ 293.677082][ T1] kernel_init_freeable (init/main.c:1550)
> [ 293.677082][ T1] ? __cfi_kernel_init (init/main.c:1429)
> 
> 
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20230831/202308312115.cad34fed-oliver.sang@intel.com
> 
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking()
  2023-09-01 10:58     ` Peng Zhang
@ 2023-09-07 18:03       ` Liam R. Howlett
  2023-09-07 18:16         ` Matthew Wilcox
  0 siblings, 1 reply; 35+ messages in thread
From: Liam R. Howlett @ 2023-09-07 18:03 UTC (permalink / raw)
  To: Peng Zhang
  Cc: kernel test robot, oe-lkp, lkp, maple-tree, linux-mm, corbet,
	akpm, willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-doc, linux-kernel,
	linux-fsdevel

* Peng Zhang <zhangpeng.00@bytedance.com> [230901 06:58]:
> 
> 
> 在 2023/8/31 21:40, kernel test robot 写道:
> > 
> > 
> > Hello,
> > 
> > kernel test robot noticed "WARNING:possible_recursive_locking_detected" on:
> > 
> > commit: 2730245bd6b13a94a67e84c10832a9f52fad0aa5 ("[PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking()")
> > url: https://github.com/intel-lab-lkp/linux/commits/Peng-Zhang/maple_tree-Add-two-helpers/20230830-205847
> > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > patch link: https://lore.kernel.org/all/20230830125654.21257-6-zhangpeng.00@bytedance.com/
> > patch subject: [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking()
> > 
> > in testcase: boot
> > 
> > compiler: clang-16
> > test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> > 
> > (please refer to attached dmesg/kmsg for entire log/backtrace)
> > 
> > 
> > 
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <oliver.sang@intel.com>
> > | Closes: https://lore.kernel.org/oe-lkp/202308312115.cad34fed-oliver.sang@intel.com
> > 
> > 
> > [   25.146957][    T1] WARNING: possible recursive locking detected
> > [   25.147110][    T1] 6.5.0-rc4-00632-g2730245bd6b1 #1 Tainted: G                TN
> > [   25.147110][    T1] --------------------------------------------
> > [   25.147110][    T1] swapper/1 is trying to acquire lock:
> > [ 25.147110][ T1] ffffffff86485058 (&mt->ma_lock){+.+.}-{2:2}, at: check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854)
> > [   25.147110][    T1]
> > [   25.147110][    T1] but task is already holding lock:
> > [ 25.147110][ T1] ffff888110847a30 (&mt->ma_lock){+.+.}-{2:2}, at: check_forking (include/linux/spinlock.h:351 lib/test_maple_tree.c:1854)
> Thanks for the test. I checked that these are two different locks, why
> is this warning reported? Did I miss something?

I don't think you can nest spinlocks like this.  In my previous test I
avoided nesting, but in your case we cannot avoid having both locks at
the same time.

You can get around this by using an rwsemaphore, set the two trees as
external and use down_write_nested(&lock2, SINGLE_DEPTH_NESTING) like
the real fork.  Basically, switch the locking to exactly what fork does.

> > [   25.147110][    T1]
> > [   25.147110][    T1] other info that might help us debug this:
> > [   25.147110][    T1]  Possible unsafe locking scenario:
> > [   25.147110][    T1]
> > [   25.147110][    T1]        CPU0
> > [   25.147110][    T1]        ----
> > [   25.147110][    T1]   lock(&mt->ma_lock);
> > [   25.147110][    T1]
> > [   25.147110][    T1]  *** DEADLOCK ***
> > [   25.147110][    T1]
> > [   25.147110][    T1]  May be due to missing lock nesting notation
> > [   25.147110][    T1]
> > [   25.147110][    T1] 1 lock held by swapper/1:
> > [ 25.147110][ T1] #0: ffff888110847a30 (&mt->ma_lock){+.+.}-{2:2}, at: check_forking (include/linux/spinlock.h:351 lib/test_maple_tree.c:1854)
> > [   25.147110][    T1]
> > [   25.147110][    T1] stack backtrace:
> > [   25.147110][    T1] CPU: 0 PID: 1 Comm: swapper Tainted: G                TN 6.5.0-rc4-00632-g2730245bd6b1 #1
> > [   25.147110][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > [   25.147110][    T1] Call Trace:
> > [   25.147110][    T1]  <TASK>
> > [ 25.147110][ T1] dump_stack_lvl (lib/dump_stack.c:? lib/dump_stack.c:106)
> > [ 25.147110][ T1] validate_chain (kernel/locking/lockdep.c:?)
> > [ 25.147110][ T1] ? look_up_lock_class (kernel/locking/lockdep.c:926)
> > [ 25.147110][ T1] ? mark_lock (arch/x86/include/asm/bitops.h:228 arch/x86/include/asm/bitops.h:240 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228 kernel/locking/lockdep.c:4655)
> > [ 25.147110][ T1] __lock_acquire (kernel/locking/lockdep.c:?)
> > [ 25.147110][ T1] lock_acquire (kernel/locking/lockdep.c:5753)
> > [ 25.147110][ T1] ? check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854)
> > [ 25.147110][ T1] _raw_spin_lock (include/linux/spinlock_api_smp.h:133 kernel/locking/spinlock.c:154)
> > [ 25.147110][ T1] ? check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854)
> > [ 25.147110][ T1] check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854)
> > [ 25.147110][ T1] maple_tree_seed (lib/test_maple_tree.c:3583)
> > [ 25.147110][ T1] do_one_initcall (init/main.c:1232)
> > [ 25.147110][ T1] ? __cfi_maple_tree_seed (lib/test_maple_tree.c:3508)
> > [ 25.147110][ T1] do_initcall_level (init/main.c:1293)
> > [ 25.147110][ T1] do_initcalls (init/main.c:1307)
> > [ 25.147110][ T1] kernel_init_freeable (init/main.c:1550)
> > [ 25.147110][ T1] ? __cfi_kernel_init (init/main.c:1429)
> > [ 25.147110][ T1] kernel_init (init/main.c:1439)
> > [ 25.147110][ T1] ? __cfi_kernel_init (init/main.c:1429)
> > [ 25.147110][ T1] ret_from_fork (arch/x86/kernel/process.c:151)
> > [ 25.147110][ T1] ? __cfi_kernel_init (init/main.c:1429)
> > [ 25.147110][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:312)
> > [   25.147110][    T1]  </TASK>
> > [   28.697241][   T32] clocksource_wdtest: --- Verify jiffies-like uncertainty margin.
> > [   28.698316][   T32] clocksource: wdtest-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370867519511994 ns
> > [   29.714980][   T32] clocksource_wdtest: --- Verify tsc-like uncertainty margin.
> > [   29.716387][   T32] clocksource: wdtest-ktime: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> > [   29.721896][   T32] clocksource_wdtest: --- tsc-like times: 1693478138832947444 - 1693478138832945950 = 1494.
> > [   29.723570][   T32] clocksource_wdtest: --- Watchdog with 0x error injection, 2 retries.
> > [   31.898906][   T32] clocksource_wdtest: --- Watchdog with 1x error injection, 2 retries.
> > [   34.043415][   T32] clocksource_wdtest: --- Watchdog with 2x error injection, 2 retries, expect message.
> > [   34.512462][    C0] clocksource: timekeeping watchdog on CPU0: kvm-clock retried 2 times before success
> > [   36.169157][   T32] clocksource_wdtest: --- Watchdog with 3x error injection, 2 retries, expect clock skew.
> > [   36.513464][    C0] clocksource: timekeeping watchdog on CPU0: wd-wdtest-ktime-wd excessive read-back delay of 1000880ns vs. limit of 125000ns, wd-wd read-back delay only 46ns, attempt 3, marking wdtest-ktime unstable
> > [   36.516829][    C0] clocksource_wdtest: --- Marking wdtest-ktime unstable due to clocksource watchdog.
> > [   38.412889][   T32] clocksource: wdtest-ktime: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> > [   38.421249][   T32] clocksource_wdtest: --- Watchdog clock-value-fuzz error injection, expect clock skew and per-CPU mismatches.
> > [   38.990462][    C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'wdtest-ktime' as unstable because the skew is too large:
> > [   38.992698][    C0] clocksource:                       'kvm-clock' wd_nsec: 479996388 wd_now: 9454aecf2 wd_last: 928aec30e mask: ffffffffffffffff
> > [   38.994924][    C0] clocksource:                       'wdtest-ktime' cs_nsec: 679996638 cs_now: 17807167426ff864 cs_last: 1780716719e80b86 mask: ffffffffffffffff
> > [   38.997374][    C0] clocksource:                       Clocksource 'wdtest-ktime' skewed 200000250 ns (200 ms) over watchdog 'kvm-clock' interval of 479996388 ns (479 ms)
> > [   38.999919][    C0] clocksource:                       'kvm-clock' (not 'wdtest-ktime') is current clocksource.
> > [   39.001696][    C0] clocksource_wdtest: --- Marking wdtest-ktime unstable due to clocksource watchdog.
> > [   40.441815][   T32] clocksource: Not enough CPUs to check clocksource 'wdtest-ktime'.
> > [   40.443303][   T32] clocksource_wdtest: --- Done with test.
> > [  293.673815][    T1] swapper invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> > [  293.675628][    T1] CPU: 0 PID: 1 Comm: swapper Tainted: G                TN 6.5.0-rc4-00632-g2730245bd6b1 #1
> > [  293.677082][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > [  293.677082][    T1] Call Trace:
> > [  293.677082][    T1]  <TASK>
> > [ 293.677082][ T1] dump_stack_lvl (lib/dump_stack.c:107)
> > [ 293.677082][ T1] dump_header (mm/oom_kill.c:?)
> > [ 293.677082][ T1] out_of_memory (mm/oom_kill.c:1159)
> > [ 293.677082][ T1] __alloc_pages_slowpath (mm/page_alloc.c:3372 mm/page_alloc.c:4132)
> > [ 293.677082][ T1] __alloc_pages (mm/page_alloc.c:4469)
> > [ 293.677082][ T1] alloc_slab_page (mm/slub.c:1866)
> > [ 293.677082][ T1] new_slab (mm/slub.c:2017 mm/slub.c:2062)
> > [ 293.677082][ T1] ? mas_alloc_nodes (lib/maple_tree.c:1282)
> > [ 293.677082][ T1] ___slab_alloc (arch/x86/include/asm/preempt.h:80 mm/slub.c:3216)
> > [ 293.677082][ T1] ? mas_alloc_nodes (lib/maple_tree.c:1282)
> > [ 293.677082][ T1] kmem_cache_alloc_bulk (mm/slub.c:? mm/slub.c:4041)
> > [ 293.677082][ T1] mas_alloc_nodes (lib/maple_tree.c:1282)
> > [ 293.677082][ T1] mas_nomem (lib/maple_tree.c:?)
> > [ 293.677082][ T1] mtree_store_range (lib/maple_tree.c:6191)
> > [ 293.677082][ T1] check_dup_gaps (lib/test_maple_tree.c:2623)
> > [ 293.677082][ T1] check_dup (lib/test_maple_tree.c:2707)
> > [ 293.677082][ T1] maple_tree_seed (lib/test_maple_tree.c:3766)
> > [ 293.677082][ T1] do_one_initcall (init/main.c:1232)
> > [ 293.677082][ T1] ? __cfi_maple_tree_seed (lib/test_maple_tree.c:3508)
> > [ 293.677082][ T1] do_initcall_level (init/main.c:1293)
> > [ 293.677082][ T1] do_initcalls (init/main.c:1307)
> > [ 293.677082][ T1] kernel_init_freeable (init/main.c:1550)
> > [ 293.677082][ T1] ? __cfi_kernel_init (init/main.c:1429)
> > 
> > 
> > The kernel config and materials to reproduce are available at:
> > https://download.01.org/0day-ci/archive/20230831/202308312115.cad34fed-oliver.sang@intel.com
> > 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking()
  2023-09-07 18:03       ` Liam R. Howlett
@ 2023-09-07 18:16         ` Matthew Wilcox
  2023-09-08  9:47           ` Peng Zhang
  0 siblings, 1 reply; 35+ messages in thread
From: Matthew Wilcox @ 2023-09-07 18:16 UTC (permalink / raw)
  To: Liam R. Howlett, Peng Zhang, kernel test robot, oe-lkp, lkp,
	maple-tree, linux-mm, corbet, akpm, brauner, surenb,
	michael.christie, peterz, mathieu.desnoyers, npiggin, avagin,
	linux-doc, linux-kernel, linux-fsdevel

On Thu, Sep 07, 2023 at 02:03:01PM -0400, Liam R. Howlett wrote:
> > >  WARNING: possible recursive locking detected
> > >  6.5.0-rc4-00632-g2730245bd6b1 #1 Tainted: G                TN
> > >  --------------------------------------------
> > >  swapper/1 is trying to acquire lock:
> > > ffffffff86485058 (&mt->ma_lock){+.+.}-{2:2}, at: check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854)
> > > 
> > >  but task is already holding lock:
> > >  ffff888110847a30 (&mt->ma_lock){+.+.}-{2:2}, at: check_forking (include/linux/spinlock.h:351 lib/test_maple_tree.c:1854)
> > Thanks for the test. I checked that these are two different locks, why
> > is this warning reported? Did I miss something?
> 
> I don't think you can nest spinlocks like this.  In my previous test I
> avoided nesting, but in your case we cannot avoid having both locks at
> the same time.
> 
> You can get around this by using an rwsemaphore, set the two trees as
> external and use down_write_nested(&lock2, SINGLE_DEPTH_NESTING) like
> the real fork.  Basically, switch the locking to exactly what fork does.

spin_lock_nested() exists.

You should probably both read through
Documentation/locking/lockdep-design.rst It's not the best user
documentation in the world, but it's what we have.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/6] maple_tree: Add two helpers
  2023-08-30 12:56 ` [PATCH v2 1/6] maple_tree: Add two helpers Peng Zhang
@ 2023-09-07 20:13   ` Liam R. Howlett
  2023-09-08  2:45     ` Peng Zhang
  0 siblings, 1 reply; 35+ messages in thread
From: Liam R. Howlett @ 2023-09-07 20:13 UTC (permalink / raw)
  To: Peng Zhang
  Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
	linux-kernel, linux-fsdevel

* Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
> Add two helpers, which will be used later.

Can you please change the subject to something like:
Add mt_free_one() and mt_attr() helpers

for easier git log readability?

> 
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
>  lib/maple_tree.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index ee1ff0c59fd7..ef234cf02e3e 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -165,6 +165,11 @@ static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
>  	return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
>  }
>  
> +static inline void mt_free_one(struct maple_node *node)
> +{
> +	kmem_cache_free(maple_node_cache, node);
> +}
> +
>  static inline void mt_free_bulk(size_t size, void __rcu **nodes)
>  {
>  	kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
> @@ -205,6 +210,11 @@ static unsigned int mas_mt_height(struct ma_state *mas)
>  	return mt_height(mas->tree);
>  }
>  
> +static inline unsigned int mt_attr(struct maple_tree *mt)
> +{
> +	return mt->ma_flags & ~MT_FLAGS_HEIGHT_MASK;
> +}
> +
>  static inline enum maple_type mte_node_type(const struct maple_enode *entry)
>  {
>  	return ((unsigned long)entry >> MAPLE_NODE_TYPE_SHIFT) &
> @@ -5520,7 +5530,7 @@ void mas_destroy(struct ma_state *mas)
>  			mt_free_bulk(count, (void __rcu **)&node->slot[1]);
>  			total -= count;
>  		}
> -		kmem_cache_free(maple_node_cache, node);
> +		mt_free_one(ma_mnode_ptr(node));
>  		total--;
>  	}
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/6] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()
  2023-08-30 12:56 ` [PATCH v2 2/6] maple_tree: Introduce interfaces __mt_dup() and mtree_dup() Peng Zhang
@ 2023-09-07 20:13   ` Liam R. Howlett
  2023-09-08  9:26     ` Peng Zhang
  2023-09-11 12:59     ` Peng Zhang
  0 siblings, 2 replies; 35+ messages in thread
From: Liam R. Howlett @ 2023-09-07 20:13 UTC (permalink / raw)
  To: Peng Zhang
  Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
	linux-kernel, linux-fsdevel

* Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
> Introduce interfaces __mt_dup() and mtree_dup(), which are used to
> duplicate a maple tree. Compared with traversing the source tree and
> reinserting entry by entry in the new tree, it has better performance.
> The difference between __mt_dup() and mtree_dup() is that mtree_dup()
> handles locks internally.

__mt_dup() should be called mas_dup() to indicate the advanced interface
which requires users to handle their own locks.

> 
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
>  include/linux/maple_tree.h |   3 +
>  lib/maple_tree.c           | 265 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 268 insertions(+)
> 
> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> index e41c70ac7744..44fe8a57ecbd 100644
> --- a/include/linux/maple_tree.h
> +++ b/include/linux/maple_tree.h
> @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
>  		void *entry, gfp_t gfp);
>  void *mtree_erase(struct maple_tree *mt, unsigned long index);
>  
> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> +
>  void mtree_destroy(struct maple_tree *mt);
>  void __mt_destroy(struct maple_tree *mt);
>  
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index ef234cf02e3e..8f841682269c 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -6370,6 +6370,271 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
>  }
>  EXPORT_SYMBOL(mtree_erase);
>  
> +/*
> + * mas_dup_free() - Free a half-constructed tree.

Maybe "Free an incomplete duplication of a tree" ?

> + * @mas: Points to the last node of the half-constructed tree.

Your use of "Points to" seems to indicate someone knows you are talking
about a "maple state that has a node pointing to".  Can this be made
more clear?
@mas: The maple state of a incomplete tree.

Then add a note that @mas->node points to the last successfully
allocated node?

Or something along those lines.

> + *
> + * This function frees all nodes starting from @mas->node in the reverse order
> + * of mas_dup_build(). There is no need to hold the source tree lock at this
> + * time.
> + */
> +static void mas_dup_free(struct ma_state *mas)
> +{
> +	struct maple_node *node;
> +	enum maple_type type;
> +	void __rcu **slots;
> +	unsigned char count, i;
> +
> +	/* Maybe the first node allocation failed. */
> +	if (!mas->node)
> +		return;
> +
> +	while (!mte_is_root(mas->node)) {
> +		mas_ascend(mas);
> +
> +		if (mas->offset) {
> +			mas->offset--;
> +			do {
> +				mas_descend(mas);
> +				mas->offset = mas_data_end(mas);
> +			} while (!mte_is_leaf(mas->node));

Can you blindly descend and check !mte_is_leaf()?  What happens when the
tree duplication fails at random internal nodes?  Maybe I missed how
this cannot happen?

> +
> +			mas_ascend(mas);
> +		}
> +
> +		node = mte_to_node(mas->node);
> +		type = mte_node_type(mas->node);
> +		slots = (void **)ma_slots(node, type);
> +		count = mas_data_end(mas) + 1;
> +		for (i = 0; i < count; i++)
> +			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
> +
> +		mt_free_bulk(count, slots);
> +	}


> +
> +	node = mte_to_node(mas->node);
> +	mt_free_one(node);
> +}
> +
> +/*
> + * mas_copy_node() - Copy a maple node and allocate child nodes.

if required. "..and allocate child nodes if required."

> + * @mas: Points to the source node.
> + * @new_mas: Points to the new node.
> + * @parent: The parent node of the new node.
> + * @gfp: The GFP_FLAGS to use for allocations.
> + *
> + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
> + * @new_mas->node and allocate new child nodes for @new_mas->node.
> + * If memory allocation fails, @mas is set to -ENOMEM.
> + */
> +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
> +		struct maple_node *parent, gfp_t gfp)
> +{
> +	struct maple_node *node = mte_to_node(mas->node);
> +	struct maple_node *new_node = mte_to_node(new_mas->node);
> +	enum maple_type type;
> +	unsigned long val;
> +	unsigned char request, count, i;
> +	void __rcu **slots;
> +	void __rcu **new_slots;
> +
> +	/* Copy the node completely. */
> +	memcpy(new_node, node, sizeof(struct maple_node));
> +
> +	/* Update the parent node pointer. */
> +	if (unlikely(ma_is_root(node)))
> +		val = MA_ROOT_PARENT;
> +	else
> +		val = (unsigned long)node->parent & MAPLE_NODE_MASK;

If you treat the root as special and outside the loop, then you can
avoid the check for root for every non-root node.  For root, you just
need to copy and do this special parent thing before the main loop in
mas_dup_build().  This will avoid an extra branch for each VMA over 14,
so that would add up to a lot of instructions.

> +
> +	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
> +
> +	if (mte_is_leaf(mas->node))
> +		return;

You are checking here and in mas_dup_build() for the leaf, splitting the
function into parent assignment and allocate would allow you to check
once. Copy could be moved to the main loop or with the parent setting,
depending on how you handle the root suggestion above.

> +
> +	/* Allocate memory for child nodes. */
> +	type = mte_node_type(mas->node);
> +	new_slots = ma_slots(new_node, type);
> +	request = mas_data_end(mas) + 1;
> +	count = mt_alloc_bulk(gfp, request, new_slots);
> +	if (unlikely(count < request)) {
> +		if (count)
> +			mt_free_bulk(count, new_slots);

The new_slots will still contain the addresses of the freed nodes.
Don't you need to clear it here to avoid a double free?  Is there a
test case for this in your testing?  Again, I may have missed how this
is not possible..

> +		mas_set_err(mas, -ENOMEM);
> +		return;
> +	}
> +
> +	/* Restore node type information in slots. */
> +	slots = ma_slots(node, type);
> +	for (i = 0; i < count; i++)
> +		((unsigned long *)new_slots)[i] |=
> +			((unsigned long)mt_slot_locked(mas->tree, slots, i) &
> +			MAPLE_NODE_MASK);

Can you expand this to multiple lines to make it more clear what is
going on?

> +}
> +
> +/*
> + * mas_dup_build() - Build a new maple tree from a source tree
> + * @mas: The maple state of source tree.
> + * @new_mas: The maple state of new tree.
> + * @gfp: The GFP_FLAGS to use for allocations.
> + *
> + * This function builds a new tree in DFS preorder. If the memory allocation
> + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
> + * last node. mas_dup_free() will free the half-constructed tree.
> + *
> + * Note that the attributes of the two trees must be exactly the same, and the
> + * new tree must be empty, otherwise -EINVAL will be returned.
> + */
> +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
> +		gfp_t gfp)
> +{
> +	struct maple_node *node, *parent;

Could parent be struct maple_pnode?

> +	struct maple_enode *root;
> +	enum maple_type type;
> +
> +	if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
> +	    unlikely(!mtree_empty(new_mas->tree))) {
> +		mas_set_err(mas, -EINVAL);
> +		return;
> +	}
> +
> +	mas_start(mas);
> +	if (mas_is_ptr(mas) || mas_is_none(mas)) {
> +		/*
> +		 * The attributes of the two trees must be the same before this.
> +		 * The following assignment makes them the same height.
> +		 */
> +		new_mas->tree->ma_flags = mas->tree->ma_flags;
> +		rcu_assign_pointer(new_mas->tree->ma_root, mas->tree->ma_root);
> +		return;
> +	}
> +
> +	node = mt_alloc_one(gfp);
> +	if (!node) {
> +		new_mas->node = NULL;

We don't have checks around for node == NULL, MAS_NONE would be a safer
choice.  It is unlikely that someone would dup the tree and fail then
call something else, but I avoid setting node to NULL.

> +		mas_set_err(mas, -ENOMEM);
> +		return;
> +	}
> +
> +	type = mte_node_type(mas->node);
> +	root = mt_mk_node(node, type);
> +	new_mas->node = root;
> +	new_mas->min = 0;
> +	new_mas->max = ULONG_MAX;
> +	parent = ma_mnode_ptr(new_mas->tree);
> +
> +	while (1) {
> +		mas_copy_node(mas, new_mas, parent, gfp);
> +
> +		if (unlikely(mas_is_err(mas)))
> +			return;
> +
> +		/* Once we reach a leaf, we need to ascend, or end the loop. */
> +		if (mte_is_leaf(mas->node)) {
> +			if (mas->max == ULONG_MAX) {
> +				new_mas->tree->ma_flags = mas->tree->ma_flags;
> +				rcu_assign_pointer(new_mas->tree->ma_root,
> +						   mte_mk_root(root));
> +				break;

If you move this to the end of the function, you can replace the same
block above with a goto.  That will avoid breaking the line up.

> +			}
> +
> +			do {
> +				/*
> +				 * Must not at the root node, because we've
> +				 * already end the loop when we reach the last
> +				 * leaf.
> +				 */

I'm not sure what the comment above is trying to say.  Do you mean "This
won't reach the root node because the loop will break when the last leaf
is hit"?  I don't think that is accurate.. it will hit the root node but
not the end of the root node, right?  Anyways, the comment isn't clear
so please have a look.

> +				mas_ascend(mas);
> +				mas_ascend(new_mas);
> +			} while (mas->offset == mas_data_end(mas));
> +
> +			mas->offset++;
> +			new_mas->offset++;
> +		}
> +
> +		mas_descend(mas);
> +		parent = mte_to_node(new_mas->node);
> +		mas_descend(new_mas);
> +		mas->offset = 0;
> +		new_mas->offset = 0;
> +	}
> +}
> +
> +/**
> + * __mt_dup(): Duplicate a maple tree
> + * @mt: The source maple tree
> + * @new: The new maple tree
> + * @gfp: The GFP_FLAGS to use for allocations
> + *
> + * This function duplicates a maple tree using a faster method than traversing
> + * the source tree and inserting entries into the new tree one by one.

Can you make this comment more about what your code does instead of the
"one by one" description?

> + * The user needs to ensure that the attributes of the source tree and the new
> + * tree are the same, and the new tree needs to be an empty tree, otherwise
> + * -EINVAL will be returned.
> + * Note that the user needs to manually lock the source tree and the new tree.
> + *
> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> + * the attributes of the two trees are different or the new tree is not an empty
> + * tree.
> + */
> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> +{
> +	int ret = 0;
> +	MA_STATE(mas, mt, 0, 0);
> +	MA_STATE(new_mas, new, 0, 0);
> +
> +	mas_dup_build(&mas, &new_mas, gfp);
> +
> +	if (unlikely(mas_is_err(&mas))) {
> +		ret = xa_err(mas.node);
> +		if (ret == -ENOMEM)
> +			mas_dup_free(&new_mas);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(__mt_dup);
> +
> +/**
> + * mtree_dup(): Duplicate a maple tree
> + * @mt: The source maple tree
> + * @new: The new maple tree
> + * @gfp: The GFP_FLAGS to use for allocations
> + *
> + * This function duplicates a maple tree using a faster method than traversing
> + * the source tree and inserting entries into the new tree one by one.

Again, it's more interesting to state it uses the DFS preorder copy.

It is also worth mentioning the superior allocation behaviour since that
is a desirable trait for many.  In fact, you should add the allocation
behaviour in your cover letter.

> + * The user needs to ensure that the attributes of the source tree and the new
> + * tree are the same, and the new tree needs to be an empty tree, otherwise
> + * -EINVAL will be returned.
> + *
> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> + * the attributes of the two trees are different or the new tree is not an empty
> + * tree.
> + */
> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> +{
> +	int ret = 0;
> +	MA_STATE(mas, mt, 0, 0);
> +	MA_STATE(new_mas, new, 0, 0);
> +
> +	mas_lock(&new_mas);
> +	mas_lock(&mas);
> +
> +	mas_dup_build(&mas, &new_mas, gfp);
> +	mas_unlock(&mas);
> +
> +	if (unlikely(mas_is_err(&mas))) {
> +		ret = xa_err(mas.node);
> +		if (ret == -ENOMEM)
> +			mas_dup_free(&new_mas);
> +	}
> +
> +	mas_unlock(&new_mas);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(mtree_dup);
> +
>  /**
>   * __mt_destroy() - Walk and free all nodes of a locked maple tree.
>   * @mt: The maple tree
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/6] maple_tree: Add test for mtree_dup()
  2023-08-30 12:56 ` [PATCH v2 3/6] maple_tree: Add test for mtree_dup() Peng Zhang
@ 2023-09-07 20:13   ` Liam R. Howlett
  2023-09-08  9:38     ` Peng Zhang
  2023-09-25  4:06     ` Peng Zhang
  0 siblings, 2 replies; 35+ messages in thread
From: Liam R. Howlett @ 2023-09-07 20:13 UTC (permalink / raw)
  To: Peng Zhang
  Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
	linux-kernel, linux-fsdevel

* Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
> Add test for mtree_dup().

Please add a better description of what tests are included.

> 
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
>  tools/testing/radix-tree/maple.c | 344 +++++++++++++++++++++++++++++++
>  1 file changed, 344 insertions(+)
> 
> diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
> index e5da1cad70ba..38455916331e 100644
> --- a/tools/testing/radix-tree/maple.c
> +++ b/tools/testing/radix-tree/maple.c

Why not lib/test_maple_tree.c?

If they are included there then they will be built into the test module.
I try to include any tests that I can in the test module, within reason.


> @@ -35857,6 +35857,346 @@ static noinline void __init check_locky(struct maple_tree *mt)
>  	mt_clear_in_rcu(mt);
>  }
>  
> +/*
> + * Compare two nodes and return 0 if they are the same, non-zero otherwise.

The slots can be different, right?  That seems worth mentioning here.
It's also worth mentioning this is destructive.

> + */
> +static int __init compare_node(struct maple_enode *enode_a,
> +			       struct maple_enode *enode_b)
> +{
> +	struct maple_node *node_a, *node_b;
> +	struct maple_node a, b;
> +	void **slots_a, **slots_b; /* Do not use the rcu tag. */
> +	enum maple_type type;
> +	int i;
> +
> +	if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
> +	    ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
> +		pr_err("The lower 8 bits of enode are different.\n");
> +		return -1;
> +	}
> +
> +	type = mte_node_type(enode_a);
> +	node_a = mte_to_node(enode_a);
> +	node_b = mte_to_node(enode_b);
> +	a = *node_a;
> +	b = *node_b;
> +
> +	/* Do not compare addresses. */
> +	if (ma_is_root(node_a) || ma_is_root(node_b)) {
> +		a.parent = (struct maple_pnode *)((unsigned long)a.parent &
> +						  MA_ROOT_PARENT);
> +		b.parent = (struct maple_pnode *)((unsigned long)b.parent &
> +						  MA_ROOT_PARENT);
> +	} else {
> +		a.parent = (struct maple_pnode *)((unsigned long)a.parent &
> +						  MAPLE_NODE_MASK);
> +		b.parent = (struct maple_pnode *)((unsigned long)b.parent &
> +						  MAPLE_NODE_MASK);
> +	}
> +
> +	if (a.parent != b.parent) {
> +		pr_err("The lower 8 bits of parents are different. %p %p\n",
> +			a.parent, b.parent);
> +		return -1;
> +	}
> +
> +	/*
> +	 * If it is a leaf node, the slots do not contain the node address, and
> +	 * no special processing of slots is required.
> +	 */
> +	if (ma_is_leaf(type))
> +		goto cmp;
> +
> +	slots_a = ma_slots(&a, type);
> +	slots_b = ma_slots(&b, type);
> +
> +	for (i = 0; i < mt_slots[type]; i++) {
> +		if (!slots_a[i] && !slots_b[i])
> +			break;
> +
> +		if (!slots_a[i] || !slots_b[i]) {
> +			pr_err("The number of slots is different.\n");
> +			return -1;
> +		}
> +
> +		/* Do not compare addresses in slots. */
> +		((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
> +		((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
> +	}
> +
> +cmp:
> +	/*
> +	 * Compare all contents of two nodes, including parent (except address),
> +	 * slots (except address), pivots, gaps and metadata.
> +	 */
> +	return memcmp(&a, &b, sizeof(struct maple_node));
> +}
> +
> +/*
> + * Compare two trees and return 0 if they are the same, non-zero otherwise.
> + */
> +static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
> +{
> +	MA_STATE(mas_a, mt_a, 0, 0);
> +	MA_STATE(mas_b, mt_b, 0, 0);
> +
> +	if (mt_a->ma_flags != mt_b->ma_flags) {
> +		pr_err("The flags of the two trees are different.\n");
> +		return -1;
> +	}
> +
> +	mas_dfs_preorder(&mas_a);
> +	mas_dfs_preorder(&mas_b);
> +
> +	if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
> +		if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
> +			pr_err("One is MAS_ROOT and the other is not.\n");
> +			return -1;
> +		}
> +		return 0;
> +	}
> +
> +	while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
> +
> +		if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
> +			pr_err("One is MAS_NONE and the other is not.\n");
> +			return -1;
> +		}
> +
> +		if (mas_a.min != mas_b.min ||
> +		    mas_a.max != mas_b.max) {
> +			pr_err("mas->min, mas->max do not match.\n");
> +			return -1;
> +		}
> +
> +		if (compare_node(mas_a.node, mas_b.node)) {
> +			pr_err("The contents of nodes %p and %p are different.\n",
> +			       mas_a.node, mas_b.node);
> +			mt_dump(mt_a, mt_dump_dec);
> +			mt_dump(mt_b, mt_dump_dec);
> +			return -1;
> +		}
> +
> +		mas_dfs_preorder(&mas_a);
> +		mas_dfs_preorder(&mas_b);
> +	}
> +
> +	return 0;
> +}
> +
> +static __init void mas_subtree_max_range(struct ma_state *mas)
> +{
> +	unsigned long limit = mas->max;
> +	MA_STATE(newmas, mas->tree, 0, 0);
> +	void *entry;
> +
> +	mas_for_each(mas, entry, limit) {
> +		if (mas->last - mas->index >=
> +		    newmas.last - newmas.index) {
> +			newmas = *mas;
> +		}
> +	}
> +
> +	*mas = newmas;
> +}
> +
> +/*
> + * build_full_tree() - Build a full tree.
> + * @mt: The tree to build.
> + * @flags: Use @flags to build the tree.
> + * @height: The height of the tree to build.
> + *
> + * Build a tree with full leaf nodes and internal nodes. Note that the height
> + * should not exceed 3, otherwise it will take a long time to build.
> + * Return: zero if the build is successful, non-zero if it fails.
> + */
> +static __init int build_full_tree(struct maple_tree *mt, unsigned int flags,
> +		int height)
> +{
> +	MA_STATE(mas, mt, 0, 0);
> +	unsigned long step;
> +	int ret = 0, cnt = 1;
> +	enum maple_type type;
> +
> +	mt_init_flags(mt, flags);
> +	mtree_insert_range(mt, 0, ULONG_MAX, xa_mk_value(5), GFP_KERNEL);
> +
> +	mtree_lock(mt);
> +
> +	while (1) {
> +		mas_set(&mas, 0);
> +		if (mt_height(mt) < height) {
> +			mas.max = ULONG_MAX;
> +			goto store;
> +		}
> +
> +		while (1) {
> +			mas_dfs_preorder(&mas);
> +			if (mas_is_none(&mas))
> +				goto unlock;
> +
> +			type = mte_node_type(mas.node);
> +			if (mas_data_end(&mas) + 1 < mt_slots[type]) {
> +				mas_set(&mas, mas.min);
> +				goto store;
> +			}
> +		}
> +store:
> +		mas_subtree_max_range(&mas);
> +		step = mas.last - mas.index;
> +		if (step < 1) {
> +			ret = -1;
> +			goto unlock;
> +		}
> +
> +		step /= 2;
> +		mas.last = mas.index + step;
> +		mas_store_gfp(&mas, xa_mk_value(5),
> +				GFP_KERNEL);
> +		++cnt;
> +	}
> +unlock:
> +	mtree_unlock(mt);
> +
> +	MT_BUG_ON(mt, mt_height(mt) != height);
> +	/* pr_info("height:%u number of elements:%d\n", mt_height(mt), cnt); */
> +	return ret;
> +}
> +
> +static noinline void __init check_mtree_dup(struct maple_tree *mt)
> +{
> +	DEFINE_MTREE(new);
> +	int i, j, ret, count = 0;
> +	unsigned int rand_seed = 17, rand;
> +
> +	/* store a value at [0, 0] */
> +	mt_init_flags(&tree, 0);
> +	mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
> +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
> +	MT_BUG_ON(&new, ret);
> +	mt_validate(&new);
> +	if (compare_tree(&tree, &new))
> +		MT_BUG_ON(&new, 1);
> +
> +	mtree_destroy(&tree);
> +	mtree_destroy(&new);
> +
> +	/* The two trees have different attributes. */
> +	mt_init_flags(&tree, 0);
> +	mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
> +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
> +	MT_BUG_ON(&new, ret != -EINVAL);
> +	mtree_destroy(&tree);
> +	mtree_destroy(&new);
> +
> +	/* The new tree is not empty */
> +	mt_init_flags(&tree, 0);
> +	mt_init_flags(&new, 0);
> +	mtree_store(&new, 5, xa_mk_value(5), GFP_KERNEL);
> +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
> +	MT_BUG_ON(&new, ret != -EINVAL);
> +	mtree_destroy(&tree);
> +	mtree_destroy(&new);
> +
> +	/* Test for duplicating full trees. */
> +	for (i = 1; i <= 3; i++) {
> +		ret = build_full_tree(&tree, 0, i);
> +		MT_BUG_ON(&tree, ret);
> +		mt_init_flags(&new, 0);
> +
> +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
> +		MT_BUG_ON(&new, ret);
> +		mt_validate(&new);
> +		if (compare_tree(&tree, &new))
> +			MT_BUG_ON(&new, 1);
> +
> +		mtree_destroy(&tree);
> +		mtree_destroy(&new);
> +	}
> +
> +	for (i = 1; i <= 3; i++) {
> +		ret = build_full_tree(&tree, MT_FLAGS_ALLOC_RANGE, i);
> +		MT_BUG_ON(&tree, ret);
> +		mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
> +
> +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
> +		MT_BUG_ON(&new, ret);
> +		mt_validate(&new);
> +		if (compare_tree(&tree, &new))
> +			MT_BUG_ON(&new, 1);
> +
> +		mtree_destroy(&tree);
> +		mtree_destroy(&new);
> +	}
> +
> +	/* Test for normal duplicating. */
> +	for (i = 0; i < 1000; i += 3) {
> +		if (i & 1) {
> +			mt_init_flags(&tree, 0);
> +			mt_init_flags(&new, 0);
> +		} else {
> +			mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
> +			mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
> +		}
> +
> +		for (j = 0; j < i; j++) {
> +			mtree_store_range(&tree, j * 10, j * 10 + 5,
> +					  xa_mk_value(j), GFP_KERNEL);
> +		}
> +
> +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
> +		MT_BUG_ON(&new, ret);
> +		mt_validate(&new);
> +		if (compare_tree(&tree, &new))
> +			MT_BUG_ON(&new, 1);
> +
> +		mtree_destroy(&tree);
> +		mtree_destroy(&new);
> +	}
> +
> +	/* Test memory allocation failed. */

It might be worth while having specific allocations fail.  At a leaf
node, intermediate nodes, first node come to mind.

> +	for (i = 0; i < 1000; i += 3) {
> +		if (i & 1) {
> +			mt_init_flags(&tree, 0);
> +			mt_init_flags(&new, 0);
> +		} else {
> +			mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
> +			mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
> +		}
> +
> +		for (j = 0; j < i; j++) {
> +			mtree_store_range(&tree, j * 10, j * 10 + 5,
> +					  xa_mk_value(j), GFP_KERNEL);
> +		}
> +		/*
> +		 * The rand() library function is not used, so we can generate
> +		 * the same random numbers on any platform.
> +		 */
> +		rand_seed = rand_seed * 1103515245 + 12345;
> +		rand = rand_seed / 65536 % 128;
> +		mt_set_non_kernel(rand);
> +
> +		ret = mtree_dup(&tree, &new, GFP_NOWAIT);
> +		mt_set_non_kernel(0);
> +		if (ret != 0) {
> +			MT_BUG_ON(&new, ret != -ENOMEM);
> +			count++;
> +			mtree_destroy(&tree);
> +			continue;
> +		}
> +
> +		mt_validate(&new);
> +		if (compare_tree(&tree, &new))
> +			MT_BUG_ON(&new, 1);
> +
> +		mtree_destroy(&tree);
> +		mtree_destroy(&new);
> +	}
> +
> +	/* pr_info("mtree_dup() fail %d times\n", count); */
> +	BUG_ON(!count);
> +}
> +
>  extern void test_kmem_cache_bulk(void);
>  
>  void farmer_tests(void)
> @@ -35904,6 +36244,10 @@ void farmer_tests(void)
>  	check_null_expand(&tree);
>  	mtree_destroy(&tree);
>  
> +	mt_init_flags(&tree, 0);
> +	check_mtree_dup(&tree);
> +	mtree_destroy(&tree);
> +
>  	/* RCU testing */
>  	mt_init_flags(&tree, 0);
>  	check_erase_testset(&tree);
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking()
  2023-08-30 12:56 ` [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking() Peng Zhang
  2023-08-31 13:40   ` kernel test robot
@ 2023-09-07 20:14   ` Liam R. Howlett
  1 sibling, 0 replies; 35+ messages in thread
From: Liam R. Howlett @ 2023-09-07 20:14 UTC (permalink / raw)
  To: Peng Zhang
  Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
	linux-kernel, linux-fsdevel

* Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
> Updated check_forking() and bench_forking() to use __mt_dup() to
> duplicate maple tree. Also increased the number of VMAs, because the
> new way is faster.
> 
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
>  lib/test_maple_tree.c | 61 +++++++++++++++++++++----------------------
>  1 file changed, 30 insertions(+), 31 deletions(-)
> 
> diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
> index 0ec0c6a7c0b5..72fba7cce148 100644
> --- a/lib/test_maple_tree.c
> +++ b/lib/test_maple_tree.c
> @@ -1837,36 +1837,37 @@ static noinline void __init check_forking(struct maple_tree *mt)
>  {
>  
>  	struct maple_tree newmt;
> -	int i, nr_entries = 134;
> +	int i, nr_entries = 300, ret;

check_forking can probably remain at 134, I set it to to 134 as a
'reasonable' value.  Unless you want 300 to test some specific case in
your case?

>  	void *val;
>  	MA_STATE(mas, mt, 0, 0);
> -	MA_STATE(newmas, mt, 0, 0);
> +	MA_STATE(newmas, &newmt, 0, 0);
> +
> +	mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
>  
>  	for (i = 0; i <= nr_entries; i++)
>  		mtree_store_range(mt, i*10, i*10 + 5,
>  				  xa_mk_value(i), GFP_KERNEL);
>  
> +
>  	mt_set_non_kernel(99999);
> -	mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
> -	newmas.tree = &newmt;
> -	mas_reset(&newmas);
> -	mas_reset(&mas);
>  	mas_lock(&newmas);
> -	mas.index = 0;
> -	mas.last = 0;
> -	if (mas_expected_entries(&newmas, nr_entries)) {
> +	mas_lock(&mas);
> +
> +	ret = __mt_dup(mt, &newmt, GFP_NOWAIT | __GFP_NOWARN);
> +	if (ret) {
>  		pr_err("OOM!");
>  		BUG_ON(1);
>  	}
> -	rcu_read_lock();
> -	mas_for_each(&mas, val, ULONG_MAX) {
> -		newmas.index = mas.index;
> -		newmas.last = mas.last;
> +
> +	mas_set(&newmas, 0);
> +	mas_for_each(&newmas, val, ULONG_MAX) {
>  		mas_store(&newmas, val);
>  	}
> -	rcu_read_unlock();
> -	mas_destroy(&newmas);
> +
> +	mas_unlock(&mas);
>  	mas_unlock(&newmas);
> +
> +	mas_destroy(&newmas);
>  	mt_validate(&newmt);
>  	mt_set_non_kernel(0);
>  	mtree_destroy(&newmt);
> @@ -1974,12 +1975,11 @@ static noinline void __init check_mas_store_gfp(struct maple_tree *mt)
>  #if defined(BENCH_FORK)
>  static noinline void __init bench_forking(struct maple_tree *mt)
>  {
> -
>  	struct maple_tree newmt;
> -	int i, nr_entries = 134, nr_fork = 80000;
> +	int i, nr_entries = 300, nr_fork = 80000, ret;
>  	void *val;
>  	MA_STATE(mas, mt, 0, 0);
> -	MA_STATE(newmas, mt, 0, 0);
> +	MA_STATE(newmas, &newmt, 0, 0);
>  
>  	for (i = 0; i <= nr_entries; i++)
>  		mtree_store_range(mt, i*10, i*10 + 5,
> @@ -1988,25 +1988,24 @@ static noinline void __init bench_forking(struct maple_tree *mt)
>  	for (i = 0; i < nr_fork; i++) {
>  		mt_set_non_kernel(99999);
>  		mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
> -		newmas.tree = &newmt;
> -		mas_reset(&newmas);
> -		mas_reset(&mas);
> -		mas.index = 0;
> -		mas.last = 0;
> -		rcu_read_lock();
> +
>  		mas_lock(&newmas);
> -		if (mas_expected_entries(&newmas, nr_entries)) {
> -			printk("OOM!");
> +		mas_lock(&mas);

Should probably switch this locking to not nest as well, since you have
to make the test framework cope with it already :/


> +		ret = __mt_dup(mt, &newmt, GFP_NOWAIT | __GFP_NOWARN);
> +		if (ret) {
> +			pr_err("OOM!");
>  			BUG_ON(1);
>  		}
> -		mas_for_each(&mas, val, ULONG_MAX) {
> -			newmas.index = mas.index;
> -			newmas.last = mas.last;
> +
> +		mas_set(&newmas, 0);
> +		mas_for_each(&newmas, val, ULONG_MAX) {
>  			mas_store(&newmas, val);
>  		}
> -		mas_destroy(&newmas);
> +
> +		mas_unlock(&mas);
>  		mas_unlock(&newmas);
> -		rcu_read_unlock();
> +
> +		mas_destroy(&newmas);
>  		mt_validate(&newmt);
>  		mt_set_non_kernel(0);
>  		mtree_destroy(&newmt);
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
  2023-08-30 12:56 ` [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() Peng Zhang
@ 2023-09-07 20:14   ` Liam R. Howlett
  2023-09-08  9:58     ` Peng Zhang
  2023-09-15 10:51     ` Peng Zhang
  0 siblings, 2 replies; 35+ messages in thread
From: Liam R. Howlett @ 2023-09-07 20:14 UTC (permalink / raw)
  To: Peng Zhang
  Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
	linux-kernel, linux-fsdevel

* Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:58]:
> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
> directly modify the entries of VMAs in the new maple tree, which can
> get better performance. The optimization effect is proportional to the
> number of VMAs.
> 
> There is a "spawn" in byte-unixbench[1], which can be used to test the
> performance of fork(). I modified it slightly to make it work with
> different number of VMAs.
> 
> Below are the test numbers. There are 21 VMAs by default. The first row
> indicates the number of added VMAs. The following two lines are the
> number of fork() calls every 10 seconds. These numbers are different
> from the test results in v1 because this time the benchmark is bound to
> a CPU. This way the numbers are more stable.
> 
>   Increment of VMAs: 0      100     200     400     800     1600    3200    6400
> 6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   6110    3158
> Apply this patchset: 114531 85420   64541   44592   28660   16371   9038    4831
>                      +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% +47.92% +52.98%

Thanks!

Can you include 21 in this table since it's the default?

> 
> [1] https://github.com/kdlucas/byte-unixbench/tree/master
> 
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
>  kernel/fork.c | 34 ++++++++++++++++++++++++++--------
>  mm/mmap.c     | 14 ++++++++++++--
>  2 files changed, 38 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 3b6d20dfb9a8..e6299adefbd8 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  	int retval;
>  	unsigned long charge = 0;
>  	LIST_HEAD(uf);
> -	VMA_ITERATOR(old_vmi, oldmm, 0);
>  	VMA_ITERATOR(vmi, mm, 0);
>  
>  	uprobe_start_dup_mmap();
> @@ -678,17 +677,39 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  		goto out;
>  	khugepaged_fork(mm, oldmm);
>  
> -	retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
> -	if (retval)
> +	/* Use __mt_dup() to efficiently build an identical maple tree. */
> +	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);

Apparently the flags should be GFP_KERNEL here so that compaction can
run.

> +	if (unlikely(retval))
>  		goto out;
>  
>  	mt_clear_in_rcu(vmi.mas.tree);
> -	for_each_vma(old_vmi, mpnt) {
> +	for_each_vma(vmi, mpnt) {
>  		struct file *file;
>  
>  		vma_start_write(mpnt);
>  		if (mpnt->vm_flags & VM_DONTCOPY) {
>  			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> +
> +			/*
> +			 * Since the new tree is exactly the same as the old one,
> +			 * we need to remove the unneeded VMAs.
> +			 */
> +			mas_store(&vmi.mas, NULL);
> +
> +			/*
> +			 * Even removing an entry may require memory allocation,
> +			 * and if removal fails, we use XA_ZERO_ENTRY to mark
> +			 * from which VMA it failed. The case of encountering
> +			 * XA_ZERO_ENTRY will be handled in exit_mmap().
> +			 */
> +			if (unlikely(mas_is_err(&vmi.mas))) {
> +				retval = xa_err(vmi.mas.node);
> +				mas_reset(&vmi.mas);
> +				if (mas_find(&vmi.mas, ULONG_MAX))
> +					mas_store(&vmi.mas, XA_ZERO_ENTRY);
> +				goto loop_out;
> +			}
> +

Storing NULL may need extra space as you noted, so we need to be careful
what happens if we don't have that space.  We should have a testcase to
test this scenario.

mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
in this function, see vm_area_dup().

Don't use the exit_mmap() path to undo a failed fork.  You've added
checks and complications to the exit path for all tasks in the very
unlikely event that we run out of memory when we hit a very unlikely
VM_DONTCOPY flag.

I see the issue with having a portion of the tree with new VMAs that are
accounted and a portion of the tree that has old VMAs that should not be
looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
we cannot add that complication to the exit path and then there is the
OOM race to worry about (maybe, I am not sure since this MM isn't
active yet).

Using what is done in exit_mmap() and do_vmi_align_munmap() as a
prototype, we can do something like the *untested* code below:

if (unlikely(mas_is_err(&vmi.mas))) {
	unsigned long max = vmi.index;

	retval = xa_err(vmi.mas.node);
	mas_set(&vmi.mas, 0);
	tmp = mas_find(&vmi.mas, ULONG_MAX);
	if (tmp) { /* Not the first VMA failed */
		unsigned long nr_accounted = 0;

		unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
				true);
		do {
			if (vma->vm_flags & VM_ACCOUNT)
				nr_accounted += vma_pages(vma);
			remove_vma(vma, true);
			cond_resched();
			vma = mas_find(&vmi.mas, max - 1);
		} while (vma != NULL);

		vm_unacct_memory(nr_accounted);
	}
	__mt_destroy(&mm->mm_mt);
	goto loop_out;
}

Once exit_mmap() is called, the check for OOM (no vma) will catch that
nothing is left to do.

It might be worth making an inline function to do this to keep the fork
code clean.  We should test this by detecting a specific task name and
returning a failure at a given interval:

if (!strcmp(current->comm, "fork_test") {
...
}


>  			continue;
>  		}
>  		charge = 0;
> @@ -750,8 +771,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  			hugetlb_dup_vma_private(tmp);
>  
>  		/* Link the vma into the MT */
> -		if (vma_iter_bulk_store(&vmi, tmp))
> -			goto fail_nomem_vmi_store;
> +		mas_store(&vmi.mas, tmp);
>  
>  		mm->map_count++;
>  		if (!(tmp->vm_flags & VM_WIPEONFORK))
> @@ -778,8 +798,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  	uprobe_end_dup_mmap();
>  	return retval;
>  
> -fail_nomem_vmi_store:
> -	unlink_anon_vmas(tmp);
>  fail_nomem_anon_vma_fork:
>  	mpol_put(vma_policy(tmp));
>  fail_nomem_policy:
> diff --git a/mm/mmap.c b/mm/mmap.c
> index b56a7f0c9f85..dfc6881be81c 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3196,7 +3196,11 @@ void exit_mmap(struct mm_struct *mm)
>  	arch_exit_mmap(mm);
>  
>  	vma = mas_find(&mas, ULONG_MAX);
> -	if (!vma) {
> +	/*
> +	 * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
> +	 * xa_is_zero(vma) may be true.
> +	 */
> +	if (!vma || xa_is_zero(vma)) {
>  		/* Can happen if dup_mmap() received an OOM */
>  		mmap_read_unlock(mm);
>  		return;
> @@ -3234,7 +3238,13 @@ void exit_mmap(struct mm_struct *mm)
>  		remove_vma(vma, true);
>  		count++;
>  		cond_resched();
> -	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> +		vma = mas_find(&mas, ULONG_MAX);
> +		/*
> +		 * If xa_is_zero(vma) is true, it means that subsequent VMAs
> +		 * donot need to be removed. Can happen if dup_mmap() fails to
> +		 * remove a VMA marked VM_DONTCOPY.
> +		 */
> +	} while (vma != NULL && !xa_is_zero(vma));
>  
>  	BUG_ON(count != mm->map_count);
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork()
  2023-08-30 12:56 [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork() Peng Zhang
                   ` (6 preceding siblings ...)
  2023-08-30 13:05 ` [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork() Peng Zhang
@ 2023-09-07 20:19 ` Liam R. Howlett
  7 siblings, 0 replies; 35+ messages in thread
From: Liam R. Howlett @ 2023-09-07 20:19 UTC (permalink / raw)
  To: Peng Zhang
  Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
	linux-kernel, linux-fsdevel

* Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
> In the process of duplicating mmap in fork(), VMAs will be inserted into the new
> maple tree one by one. When inserting into the maple tree, the maple tree will
> be rebalanced multiple times. The rebalancing of maple tree is not as fast as
> the rebalancing of red-black tree and will be slower. Therefore, __mt_dup() is
> introduced to directly duplicate the structure of the old maple tree, and then
> modify each element of the new maple tree. This avoids rebalancing and some extra
> copying, so is faster than the original method.
> More information can refer to [1].

Thanks for this patch set, it's really coming along nicely.

> 
> There is a "spawn" in byte-unixbench[2], which can be used to test the performance
> of fork(). I modified it slightly to make it work with different number of VMAs.
> 
> Below are the test numbers. There are 21 VMAs by default. The first row indicates
> the number of added VMAs. The following two lines are the number of fork() calls
> every 10 seconds. These numbers are different from the test results in v1 because
> this time the benchmark is bound to a CPU. This way the numbers are more stable.
> 
>   Increment of VMAs: 0      100     200     400     800     1600    3200    6400
> 6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   6110    3158
> Apply this patchset: 114531 85420   64541   44592   28660   16371   9038    4831
>                      +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% +47.92% +52.98%

Can you run this with the default 21 as well?

> 
> Todo:
>   - Update the documentation.
> 
> Changes since v1:
>  - Reimplement __mt_dup() and mtree_dup(). Loops are implemented without using
>    goto instructions.
>  - The new tree also needs to be locked to avoid some lockdep warnings.
>  - Drop and add some helpers.

I guess this also includes the changes to remove the new ways of finding
a node end and using that extra bit in the address?  Those were
significant and welcome changes.  Thanks.

>  - Add test for duplicating full tree.
>  - Drop mas_replace_entry(), it doesn't seem to have a big impact on the
>    performance of fork().
> 
> [1] https://lore.kernel.org/lkml/463899aa-6cbd-f08e-0aca-077b0e4e4475@bytedance.com/
> [2] https://github.com/kdlucas/byte-unixbench/tree/master
> 
> v1: https://lore.kernel.org/lkml/20230726080916.17454-1-zhangpeng.00@bytedance.com/
> 
> Peng Zhang (6):
>   maple_tree: Add two helpers
>   maple_tree: Introduce interfaces __mt_dup() and mtree_dup()
>   maple_tree: Add test for mtree_dup()
>   maple_tree: Skip other tests when BENCH is enabled
>   maple_tree: Update check_forking() and bench_forking()
>   fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
> 
>  include/linux/maple_tree.h       |   3 +
>  kernel/fork.c                    |  34 ++-
>  lib/maple_tree.c                 | 277 ++++++++++++++++++++++++-
>  lib/test_maple_tree.c            |  69 +++---
>  mm/mmap.c                        |  14 +-
>  tools/testing/radix-tree/maple.c | 346 +++++++++++++++++++++++++++++++
>  6 files changed, 697 insertions(+), 46 deletions(-)
> 
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 1/6] maple_tree: Add two helpers
  2023-09-07 20:13   ` Liam R. Howlett
@ 2023-09-08  2:45     ` Peng Zhang
  0 siblings, 0 replies; 35+ messages in thread
From: Peng Zhang @ 2023-09-08  2:45 UTC (permalink / raw)
  To: Liam R. Howlett, Peng Zhang, corbet, akpm, willy, brauner,
	surenb, michael.christie, peterz, mathieu.desnoyers, npiggin,
	avagin, linux-mm, linux-doc, linux-kernel, linux-fsdevel



在 2023/9/8 04:13, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
>> Add two helpers, which will be used later.
> 
> Can you please change the subject to something like:
> Add mt_free_one() and mt_attr() helpers
> 
> for easier git log readability?
OK, I'll do that.
> 
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>>   lib/maple_tree.c | 12 +++++++++++-
>>   1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index ee1ff0c59fd7..ef234cf02e3e 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -165,6 +165,11 @@ static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
>>   	return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
>>   }
>>   
>> +static inline void mt_free_one(struct maple_node *node)
>> +{
>> +	kmem_cache_free(maple_node_cache, node);
>> +}
>> +
>>   static inline void mt_free_bulk(size_t size, void __rcu **nodes)
>>   {
>>   	kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
>> @@ -205,6 +210,11 @@ static unsigned int mas_mt_height(struct ma_state *mas)
>>   	return mt_height(mas->tree);
>>   }
>>   
>> +static inline unsigned int mt_attr(struct maple_tree *mt)
>> +{
>> +	return mt->ma_flags & ~MT_FLAGS_HEIGHT_MASK;
>> +}
>> +
>>   static inline enum maple_type mte_node_type(const struct maple_enode *entry)
>>   {
>>   	return ((unsigned long)entry >> MAPLE_NODE_TYPE_SHIFT) &
>> @@ -5520,7 +5530,7 @@ void mas_destroy(struct ma_state *mas)
>>   			mt_free_bulk(count, (void __rcu **)&node->slot[1]);
>>   			total -= count;
>>   		}
>> -		kmem_cache_free(maple_node_cache, node);
>> +		mt_free_one(ma_mnode_ptr(node));
>>   		total--;
>>   	}
>>   
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/6] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()
  2023-09-07 20:13   ` Liam R. Howlett
@ 2023-09-08  9:26     ` Peng Zhang
  2023-09-08 16:05       ` Liam R. Howlett
  2023-09-11 12:59     ` Peng Zhang
  1 sibling, 1 reply; 35+ messages in thread
From: Peng Zhang @ 2023-09-08  9:26 UTC (permalink / raw)
  To: Liam R. Howlett, Peng Zhang, corbet, akpm, willy, brauner,
	surenb, michael.christie, peterz, mathieu.desnoyers, npiggin,
	avagin, linux-mm, linux-doc, linux-kernel, linux-fsdevel



在 2023/9/8 04:13, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
>> Introduce interfaces __mt_dup() and mtree_dup(), which are used to
>> duplicate a maple tree. Compared with traversing the source tree and
>> reinserting entry by entry in the new tree, it has better performance.
>> The difference between __mt_dup() and mtree_dup() is that mtree_dup()
>> handles locks internally.
> 
> __mt_dup() should be called mas_dup() to indicate the advanced interface
> which requires users to handle their own locks.
Ok, I'll change __mt_dup() to mas_dup().
> 
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>>   include/linux/maple_tree.h |   3 +
>>   lib/maple_tree.c           | 265 +++++++++++++++++++++++++++++++++++++
>>   2 files changed, 268 insertions(+)
>>
>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>> index e41c70ac7744..44fe8a57ecbd 100644
>> --- a/include/linux/maple_tree.h
>> +++ b/include/linux/maple_tree.h
>> @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
>>   		void *entry, gfp_t gfp);
>>   void *mtree_erase(struct maple_tree *mt, unsigned long index);
>>   
>> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>> +
>>   void mtree_destroy(struct maple_tree *mt);
>>   void __mt_destroy(struct maple_tree *mt);
>>   
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index ef234cf02e3e..8f841682269c 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -6370,6 +6370,271 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
>>   }
>>   EXPORT_SYMBOL(mtree_erase);
>>   
>> +/*
>> + * mas_dup_free() - Free a half-constructed tree.
> 
> Maybe "Free an incomplete duplication of a tree" ?
> 
>> + * @mas: Points to the last node of the half-constructed tree.
> 
> Your use of "Points to" seems to indicate someone knows you are talking
> about a "maple state that has a node pointing to".  Can this be made
> more clear?
> @mas: The maple state of a incomplete tree.
> 
> Then add a note that @mas->node points to the last successfully
> allocated node?
> 
> Or something along those lines.
Ok, I'll revise the comment.
> 
>> + *
>> + * This function frees all nodes starting from @mas->node in the reverse order
>> + * of mas_dup_build(). There is no need to hold the source tree lock at this
>> + * time.
>> + */
>> +static void mas_dup_free(struct ma_state *mas)
>> +{
>> +	struct maple_node *node;
>> +	enum maple_type type;
>> +	void __rcu **slots;
>> +	unsigned char count, i;
>> +
>> +	/* Maybe the first node allocation failed. */
>> +	if (!mas->node)
>> +		return;
>> +
>> +	while (!mte_is_root(mas->node)) {
>> +		mas_ascend(mas);
>> +
>> +		if (mas->offset) {
>> +			mas->offset--;
>> +			do {
>> +				mas_descend(mas);
>> +				mas->offset = mas_data_end(mas);
>> +			} while (!mte_is_leaf(mas->node));
> 
> Can you blindly descend and check !mte_is_leaf()?  What happens when the
> tree duplication fails at random internal nodes?  Maybe I missed how
> this cannot happen?
This cannot happen. Note the mas_ascend(mas) at the beginning of the
outermost loop.

> 
>> +
>> +			mas_ascend(mas);
>> +		}
>> +
>> +		node = mte_to_node(mas->node);
>> +		type = mte_node_type(mas->node);
>> +		slots = (void **)ma_slots(node, type);
>> +		count = mas_data_end(mas) + 1;
>> +		for (i = 0; i < count; i++)
>> +			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
>> +
>> +		mt_free_bulk(count, slots);
>> +	}
> 
> 
>> +
>> +	node = mte_to_node(mas->node);
>> +	mt_free_one(node);
>> +}
>> +
>> +/*
>> + * mas_copy_node() - Copy a maple node and allocate child nodes.
> 
> if required. "..and allocate child nodes if required."
> 
>> + * @mas: Points to the source node.
>> + * @new_mas: Points to the new node.
>> + * @parent: The parent node of the new node.
>> + * @gfp: The GFP_FLAGS to use for allocations.
>> + *
>> + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
>> + * @new_mas->node and allocate new child nodes for @new_mas->node.
>> + * If memory allocation fails, @mas is set to -ENOMEM.
>> + */
>> +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
>> +		struct maple_node *parent, gfp_t gfp)
>> +{
>> +	struct maple_node *node = mte_to_node(mas->node);
>> +	struct maple_node *new_node = mte_to_node(new_mas->node);
>> +	enum maple_type type;
>> +	unsigned long val;
>> +	unsigned char request, count, i;
>> +	void __rcu **slots;
>> +	void __rcu **new_slots;
>> +
>> +	/* Copy the node completely. */
>> +	memcpy(new_node, node, sizeof(struct maple_node));
>> +
>> +	/* Update the parent node pointer. */
>> +	if (unlikely(ma_is_root(node)))
>> +		val = MA_ROOT_PARENT;
>> +	else
>> +		val = (unsigned long)node->parent & MAPLE_NODE_MASK;
> 
> If you treat the root as special and outside the loop, then you can
> avoid the check for root for every non-root node.  For root, you just
> need to copy and do this special parent thing before the main loop in
> mas_dup_build().  This will avoid an extra branch for each VMA over 14,
> so that would add up to a lot of instructions.
I'll handle the root node outside.
However, do you think it makes sense to have the parent of the root node
point to the struct maple_tree? I don't see it used anywhere.

> 
>> +
>> +	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
>> +
>> +	if (mte_is_leaf(mas->node))
>> +		return;
> 
> You are checking here and in mas_dup_build() for the leaf, splitting the
> function into parent assignment and allocate would allow you to check
> once. Copy could be moved to the main loop or with the parent setting,
> depending on how you handle the root suggestion above.
I'll try to reduce some checks.
> 
>> +
>> +	/* Allocate memory for child nodes. */
>> +	type = mte_node_type(mas->node);
>> +	new_slots = ma_slots(new_node, type);
>> +	request = mas_data_end(mas) + 1;
>> +	count = mt_alloc_bulk(gfp, request, new_slots);
>> +	if (unlikely(count < request)) {
>> +		if (count)
>> +			mt_free_bulk(count, new_slots);
> 
> The new_slots will still contain the addresses of the freed nodes.
> Don't you need to clear it here to avoid a double free?  Is there a
> test case for this in your testing?  Again, I may have missed how this
> is not possible..
It's impossible, because in mt_free_bulk(), the first thing to do with
the incoming node is to go up. We free all child nodes at the parent
node.

We guarantee that the node passed to mas_dup_free() is "clean".
mas_dup_free() also follows this so will not free children of this node.

The child nodes of this node cannot be freed in mt_free_bulk() because
the node is not completely constructed and data_end cannot be obtained.
data_end cannot be set on this node because the number of successfully
allocated child nodes can be 0.
> 
>> +		mas_set_err(mas, -ENOMEM);
>> +		return;
>> +	}
>> +
>> +	/* Restore node type information in slots. */
>> +	slots = ma_slots(node, type);
>> +	for (i = 0; i < count; i++)
>> +		((unsigned long *)new_slots)[i] |=
>> +			((unsigned long)mt_slot_locked(mas->tree, slots, i) &
>> +			MAPLE_NODE_MASK);
> 
> Can you expand this to multiple lines to make it more clear what is
> going on?
I will try to do that.

> 
>> +}
>> +
>> +/*
>> + * mas_dup_build() - Build a new maple tree from a source tree
>> + * @mas: The maple state of source tree.
>> + * @new_mas: The maple state of new tree.
>> + * @gfp: The GFP_FLAGS to use for allocations.
>> + *
>> + * This function builds a new tree in DFS preorder. If the memory allocation
>> + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
>> + * last node. mas_dup_free() will free the half-constructed tree.
>> + *
>> + * Note that the attributes of the two trees must be exactly the same, and the
>> + * new tree must be empty, otherwise -EINVAL will be returned.
>> + */
>> +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
>> +		gfp_t gfp)
>> +{
>> +	struct maple_node *node, *parent;
> 
> Could parent be struct maple_pnode?
I'll rename it.

> 
>> +	struct maple_enode *root;
>> +	enum maple_type type;
>> +
>> +	if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
>> +	    unlikely(!mtree_empty(new_mas->tree))) {
>> +		mas_set_err(mas, -EINVAL);
>> +		return;
>> +	}
>> +
>> +	mas_start(mas);
>> +	if (mas_is_ptr(mas) || mas_is_none(mas)) {
>> +		/*
>> +		 * The attributes of the two trees must be the same before this.
>> +		 * The following assignment makes them the same height.
>> +		 */
>> +		new_mas->tree->ma_flags = mas->tree->ma_flags;
>> +		rcu_assign_pointer(new_mas->tree->ma_root, mas->tree->ma_root);
>> +		return;
>> +	}
>> +
>> +	node = mt_alloc_one(gfp);
>> +	if (!node) {
>> +		new_mas->node = NULL;
> 
> We don't have checks around for node == NULL, MAS_NONE would be a safer
> choice.  It is unlikely that someone would dup the tree and fail then
> call something else, but I avoid setting node to NULL.
I will set it to MAS_NONE in the next version.

> 
>> +		mas_set_err(mas, -ENOMEM);
>> +		return;
>> +	}
>> +
>> +	type = mte_node_type(mas->node);
>> +	root = mt_mk_node(node, type);
>> +	new_mas->node = root;
>> +	new_mas->min = 0;
>> +	new_mas->max = ULONG_MAX;
>> +	parent = ma_mnode_ptr(new_mas->tree);
>> +
>> +	while (1) {
>> +		mas_copy_node(mas, new_mas, parent, gfp);
>> +
>> +		if (unlikely(mas_is_err(mas)))
>> +			return;
>> +
>> +		/* Once we reach a leaf, we need to ascend, or end the loop. */
>> +		if (mte_is_leaf(mas->node)) {
>> +			if (mas->max == ULONG_MAX) {
>> +				new_mas->tree->ma_flags = mas->tree->ma_flags;
>> +				rcu_assign_pointer(new_mas->tree->ma_root,
>> +						   mte_mk_root(root));
>> +				break;
> 
> If you move this to the end of the function, you can replace the same
> block above with a goto.  That will avoid breaking the line up.
I can do this, but it doesn't seem to make a difference.
> 
>> +			}
>> +
>> +			do {
>> +				/*
>> +				 * Must not at the root node, because we've
>> +				 * already end the loop when we reach the last
>> +				 * leaf.
>> +				 */
> 
> I'm not sure what the comment above is trying to say.  Do you mean "This
> won't reach the root node because the loop will break when the last leaf
> is hit"?  I don't think that is accurate.. it will hit the root node but
> not the end of the root node, right?  Anyways, the comment isn't clear
> so please have a look.
Yes, it will hit the root node but not the end of the root node. I'll
fix this comment. Thanks.

> 
>> +				mas_ascend(mas);
>> +				mas_ascend(new_mas);
>> +			} while (mas->offset == mas_data_end(mas));
>> +
>> +			mas->offset++;
>> +			new_mas->offset++;
>> +		}
>> +
>> +		mas_descend(mas);
>> +		parent = mte_to_node(new_mas->node);
>> +		mas_descend(new_mas);
>> +		mas->offset = 0;
>> +		new_mas->offset = 0;
>> +	}
>> +}
>> +
>> +/**
>> + * __mt_dup(): Duplicate a maple tree
>> + * @mt: The source maple tree
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + *
>> + * This function duplicates a maple tree using a faster method than traversing
>> + * the source tree and inserting entries into the new tree one by one.
> 
> Can you make this comment more about what your code does instead of the
> "one by one" description?
> 
>> + * The user needs to ensure that the attributes of the source tree and the new
>> + * tree are the same, and the new tree needs to be an empty tree, otherwise
>> + * -EINVAL will be returned.
>> + * Note that the user needs to manually lock the source tree and the new tree.
>> + *
>> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
>> + * the attributes of the two trees are different or the new tree is not an empty
>> + * tree.
>> + */
>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>> +{
>> +	int ret = 0;
>> +	MA_STATE(mas, mt, 0, 0);
>> +	MA_STATE(new_mas, new, 0, 0);
>> +
>> +	mas_dup_build(&mas, &new_mas, gfp);
>> +
>> +	if (unlikely(mas_is_err(&mas))) {
>> +		ret = xa_err(mas.node);
>> +		if (ret == -ENOMEM)
>> +			mas_dup_free(&new_mas);
>> +	}
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(__mt_dup);
>> +
>> +/**
>> + * mtree_dup(): Duplicate a maple tree
>> + * @mt: The source maple tree
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + *
>> + * This function duplicates a maple tree using a faster method than traversing
>> + * the source tree and inserting entries into the new tree one by one.
> 
> Again, it's more interesting to state it uses the DFS preorder copy.
> 
> It is also worth mentioning the superior allocation behaviour since that
> is a desirable trait for many.  In fact, you should add the allocation
> behaviour in your cover letter.
Okay, I will describe more in the next version.

> 
>> + * The user needs to ensure that the attributes of the source tree and the new
>> + * tree are the same, and the new tree needs to be an empty tree, otherwise
>> + * -EINVAL will be returned.
>> + *
>> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
>> + * the attributes of the two trees are different or the new tree is not an empty
>> + * tree.
>> + */
>> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>> +{
>> +	int ret = 0;
>> +	MA_STATE(mas, mt, 0, 0);
>> +	MA_STATE(new_mas, new, 0, 0);
>> +
>> +	mas_lock(&new_mas);
>> +	mas_lock(&mas);
>> +
>> +	mas_dup_build(&mas, &new_mas, gfp);
>> +	mas_unlock(&mas);
>> +
>> +	if (unlikely(mas_is_err(&mas))) {
>> +		ret = xa_err(mas.node);
>> +		if (ret == -ENOMEM)
>> +			mas_dup_free(&new_mas);
>> +	}
>> +
>> +	mas_unlock(&new_mas);
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(mtree_dup);
>> +
>>   /**
>>    * __mt_destroy() - Walk and free all nodes of a locked maple tree.
>>    * @mt: The maple tree
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/6] maple_tree: Add test for mtree_dup()
  2023-09-07 20:13   ` Liam R. Howlett
@ 2023-09-08  9:38     ` Peng Zhang
  2023-09-25  4:06     ` Peng Zhang
  1 sibling, 0 replies; 35+ messages in thread
From: Peng Zhang @ 2023-09-08  9:38 UTC (permalink / raw)
  To: Liam R. Howlett, Peng Zhang, corbet, akpm, willy, brauner,
	surenb, michael.christie, peterz, mathieu.desnoyers, npiggin,
	avagin, linux-mm, linux-doc, linux-kernel, linux-fsdevel



在 2023/9/8 04:13, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
>> Add test for mtree_dup().
> 
> Please add a better description of what tests are included.
> 
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>>   tools/testing/radix-tree/maple.c | 344 +++++++++++++++++++++++++++++++
>>   1 file changed, 344 insertions(+)
>>
>> diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
>> index e5da1cad70ba..38455916331e 100644
>> --- a/tools/testing/radix-tree/maple.c
>> +++ b/tools/testing/radix-tree/maple.c
> 
> Why not lib/test_maple_tree.c?
Because I used mas_dfs_preorder() in user space, which is implemented in 
maple.c
> 
> If they are included there then they will be built into the test module.
> I try to include any tests that I can in the test module, within reason.
> 
> 
>> @@ -35857,6 +35857,346 @@ static noinline void __init check_locky(struct maple_tree *mt)
>>   	mt_clear_in_rcu(mt);
>>   }
>>   
>> +/*
>> + * Compare two nodes and return 0 if they are the same, non-zero otherwise.
> 
> The slots can be different, right?  That seems worth mentioning here.
> It's also worth mentioning this is destructive.
Ok, I'll mention this.
> 
>> + */
>> +static int __init compare_node(struct maple_enode *enode_a,
>> +			       struct maple_enode *enode_b)
>> +{
>> +	struct maple_node *node_a, *node_b;
>> +	struct maple_node a, b;
>> +	void **slots_a, **slots_b; /* Do not use the rcu tag. */
>> +	enum maple_type type;
>> +	int i;
>> +
>> +	if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
>> +	    ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
>> +		pr_err("The lower 8 bits of enode are different.\n");
>> +		return -1;
>> +	}
>> +
>> +	type = mte_node_type(enode_a);
>> +	node_a = mte_to_node(enode_a);
>> +	node_b = mte_to_node(enode_b);
>> +	a = *node_a;
>> +	b = *node_b;
>> +
>> +	/* Do not compare addresses. */
>> +	if (ma_is_root(node_a) || ma_is_root(node_b)) {
>> +		a.parent = (struct maple_pnode *)((unsigned long)a.parent &
>> +						  MA_ROOT_PARENT);
>> +		b.parent = (struct maple_pnode *)((unsigned long)b.parent &
>> +						  MA_ROOT_PARENT);
>> +	} else {
>> +		a.parent = (struct maple_pnode *)((unsigned long)a.parent &
>> +						  MAPLE_NODE_MASK);
>> +		b.parent = (struct maple_pnode *)((unsigned long)b.parent &
>> +						  MAPLE_NODE_MASK);
>> +	}
>> +
>> +	if (a.parent != b.parent) {
>> +		pr_err("The lower 8 bits of parents are different. %p %p\n",
>> +			a.parent, b.parent);
>> +		return -1;
>> +	}
>> +
>> +	/*
>> +	 * If it is a leaf node, the slots do not contain the node address, and
>> +	 * no special processing of slots is required.
>> +	 */
>> +	if (ma_is_leaf(type))
>> +		goto cmp;
>> +
>> +	slots_a = ma_slots(&a, type);
>> +	slots_b = ma_slots(&b, type);
>> +
>> +	for (i = 0; i < mt_slots[type]; i++) {
>> +		if (!slots_a[i] && !slots_b[i])
>> +			break;
>> +
>> +		if (!slots_a[i] || !slots_b[i]) {
>> +			pr_err("The number of slots is different.\n");
>> +			return -1;
>> +		}
>> +
>> +		/* Do not compare addresses in slots. */
>> +		((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
>> +		((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
>> +	}
>> +
>> +cmp:
>> +	/*
>> +	 * Compare all contents of two nodes, including parent (except address),
>> +	 * slots (except address), pivots, gaps and metadata.
>> +	 */
>> +	return memcmp(&a, &b, sizeof(struct maple_node));
>> +}
>> +
>> +/*
>> + * Compare two trees and return 0 if they are the same, non-zero otherwise.
>> + */
>> +static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
>> +{
>> +	MA_STATE(mas_a, mt_a, 0, 0);
>> +	MA_STATE(mas_b, mt_b, 0, 0);
>> +
>> +	if (mt_a->ma_flags != mt_b->ma_flags) {
>> +		pr_err("The flags of the two trees are different.\n");
>> +		return -1;
>> +	}
>> +
>> +	mas_dfs_preorder(&mas_a);
>> +	mas_dfs_preorder(&mas_b);
>> +
>> +	if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
>> +		if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
>> +			pr_err("One is MAS_ROOT and the other is not.\n");
>> +			return -1;
>> +		}
>> +		return 0;
>> +	}
>> +
>> +	while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
>> +
>> +		if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
>> +			pr_err("One is MAS_NONE and the other is not.\n");
>> +			return -1;
>> +		}
>> +
>> +		if (mas_a.min != mas_b.min ||
>> +		    mas_a.max != mas_b.max) {
>> +			pr_err("mas->min, mas->max do not match.\n");
>> +			return -1;
>> +		}
>> +
>> +		if (compare_node(mas_a.node, mas_b.node)) {
>> +			pr_err("The contents of nodes %p and %p are different.\n",
>> +			       mas_a.node, mas_b.node);
>> +			mt_dump(mt_a, mt_dump_dec);
>> +			mt_dump(mt_b, mt_dump_dec);
>> +			return -1;
>> +		}
>> +
>> +		mas_dfs_preorder(&mas_a);
>> +		mas_dfs_preorder(&mas_b);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static __init void mas_subtree_max_range(struct ma_state *mas)
>> +{
>> +	unsigned long limit = mas->max;
>> +	MA_STATE(newmas, mas->tree, 0, 0);
>> +	void *entry;
>> +
>> +	mas_for_each(mas, entry, limit) {
>> +		if (mas->last - mas->index >=
>> +		    newmas.last - newmas.index) {
>> +			newmas = *mas;
>> +		}
>> +	}
>> +
>> +	*mas = newmas;
>> +}
>> +
>> +/*
>> + * build_full_tree() - Build a full tree.
>> + * @mt: The tree to build.
>> + * @flags: Use @flags to build the tree.
>> + * @height: The height of the tree to build.
>> + *
>> + * Build a tree with full leaf nodes and internal nodes. Note that the height
>> + * should not exceed 3, otherwise it will take a long time to build.
>> + * Return: zero if the build is successful, non-zero if it fails.
>> + */
>> +static __init int build_full_tree(struct maple_tree *mt, unsigned int flags,
>> +		int height)
>> +{
>> +	MA_STATE(mas, mt, 0, 0);
>> +	unsigned long step;
>> +	int ret = 0, cnt = 1;
>> +	enum maple_type type;
>> +
>> +	mt_init_flags(mt, flags);
>> +	mtree_insert_range(mt, 0, ULONG_MAX, xa_mk_value(5), GFP_KERNEL);
>> +
>> +	mtree_lock(mt);
>> +
>> +	while (1) {
>> +		mas_set(&mas, 0);
>> +		if (mt_height(mt) < height) {
>> +			mas.max = ULONG_MAX;
>> +			goto store;
>> +		}
>> +
>> +		while (1) {
>> +			mas_dfs_preorder(&mas);
>> +			if (mas_is_none(&mas))
>> +				goto unlock;
>> +
>> +			type = mte_node_type(mas.node);
>> +			if (mas_data_end(&mas) + 1 < mt_slots[type]) {
>> +				mas_set(&mas, mas.min);
>> +				goto store;
>> +			}
>> +		}
>> +store:
>> +		mas_subtree_max_range(&mas);
>> +		step = mas.last - mas.index;
>> +		if (step < 1) {
>> +			ret = -1;
>> +			goto unlock;
>> +		}
>> +
>> +		step /= 2;
>> +		mas.last = mas.index + step;
>> +		mas_store_gfp(&mas, xa_mk_value(5),
>> +				GFP_KERNEL);
>> +		++cnt;
>> +	}
>> +unlock:
>> +	mtree_unlock(mt);
>> +
>> +	MT_BUG_ON(mt, mt_height(mt) != height);
>> +	/* pr_info("height:%u number of elements:%d\n", mt_height(mt), cnt); */
>> +	return ret;
>> +}
>> +
>> +static noinline void __init check_mtree_dup(struct maple_tree *mt)
>> +{
>> +	DEFINE_MTREE(new);
>> +	int i, j, ret, count = 0;
>> +	unsigned int rand_seed = 17, rand;
>> +
>> +	/* store a value at [0, 0] */
>> +	mt_init_flags(&tree, 0);
>> +	mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
>> +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
>> +	MT_BUG_ON(&new, ret);
>> +	mt_validate(&new);
>> +	if (compare_tree(&tree, &new))
>> +		MT_BUG_ON(&new, 1);
>> +
>> +	mtree_destroy(&tree);
>> +	mtree_destroy(&new);
>> +
>> +	/* The two trees have different attributes. */
>> +	mt_init_flags(&tree, 0);
>> +	mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
>> +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
>> +	MT_BUG_ON(&new, ret != -EINVAL);
>> +	mtree_destroy(&tree);
>> +	mtree_destroy(&new);
>> +
>> +	/* The new tree is not empty */
>> +	mt_init_flags(&tree, 0);
>> +	mt_init_flags(&new, 0);
>> +	mtree_store(&new, 5, xa_mk_value(5), GFP_KERNEL);
>> +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
>> +	MT_BUG_ON(&new, ret != -EINVAL);
>> +	mtree_destroy(&tree);
>> +	mtree_destroy(&new);
>> +
>> +	/* Test for duplicating full trees. */
>> +	for (i = 1; i <= 3; i++) {
>> +		ret = build_full_tree(&tree, 0, i);
>> +		MT_BUG_ON(&tree, ret);
>> +		mt_init_flags(&new, 0);
>> +
>> +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
>> +		MT_BUG_ON(&new, ret);
>> +		mt_validate(&new);
>> +		if (compare_tree(&tree, &new))
>> +			MT_BUG_ON(&new, 1);
>> +
>> +		mtree_destroy(&tree);
>> +		mtree_destroy(&new);
>> +	}
>> +
>> +	for (i = 1; i <= 3; i++) {
>> +		ret = build_full_tree(&tree, MT_FLAGS_ALLOC_RANGE, i);
>> +		MT_BUG_ON(&tree, ret);
>> +		mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
>> +
>> +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
>> +		MT_BUG_ON(&new, ret);
>> +		mt_validate(&new);
>> +		if (compare_tree(&tree, &new))
>> +			MT_BUG_ON(&new, 1);
>> +
>> +		mtree_destroy(&tree);
>> +		mtree_destroy(&new);
>> +	}
>> +
>> +	/* Test for normal duplicating. */
>> +	for (i = 0; i < 1000; i += 3) {
>> +		if (i & 1) {
>> +			mt_init_flags(&tree, 0);
>> +			mt_init_flags(&new, 0);
>> +		} else {
>> +			mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>> +			mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
>> +		}
>> +
>> +		for (j = 0; j < i; j++) {
>> +			mtree_store_range(&tree, j * 10, j * 10 + 5,
>> +					  xa_mk_value(j), GFP_KERNEL);
>> +		}
>> +
>> +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
>> +		MT_BUG_ON(&new, ret);
>> +		mt_validate(&new);
>> +		if (compare_tree(&tree, &new))
>> +			MT_BUG_ON(&new, 1);
>> +
>> +		mtree_destroy(&tree);
>> +		mtree_destroy(&new);
>> +	}
>> +
>> +	/* Test memory allocation failed. */
> 
> It might be worth while having specific allocations fail.  At a leaf
> node, intermediate nodes, first node come to mind.
In fact, the random number has covered the first node. I'll write some
test cases later.
> 
>> +	for (i = 0; i < 1000; i += 3) {
>> +		if (i & 1) {
>> +			mt_init_flags(&tree, 0);
>> +			mt_init_flags(&new, 0);
>> +		} else {
>> +			mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>> +			mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
>> +		}
>> +
>> +		for (j = 0; j < i; j++) {
>> +			mtree_store_range(&tree, j * 10, j * 10 + 5,
>> +					  xa_mk_value(j), GFP_KERNEL);
>> +		}
>> +		/*
>> +		 * The rand() library function is not used, so we can generate
>> +		 * the same random numbers on any platform.
>> +		 */
>> +		rand_seed = rand_seed * 1103515245 + 12345;
>> +		rand = rand_seed / 65536 % 128;
>> +		mt_set_non_kernel(rand);
>> +
>> +		ret = mtree_dup(&tree, &new, GFP_NOWAIT);
>> +		mt_set_non_kernel(0);
>> +		if (ret != 0) {
>> +			MT_BUG_ON(&new, ret != -ENOMEM);
>> +			count++;
>> +			mtree_destroy(&tree);
>> +			continue;
>> +		}
>> +
>> +		mt_validate(&new);
>> +		if (compare_tree(&tree, &new))
>> +			MT_BUG_ON(&new, 1);
>> +
>> +		mtree_destroy(&tree);
>> +		mtree_destroy(&new);
>> +	}
>> +
>> +	/* pr_info("mtree_dup() fail %d times\n", count); */
>> +	BUG_ON(!count);
>> +}
>> +
>>   extern void test_kmem_cache_bulk(void);
>>   
>>   void farmer_tests(void)
>> @@ -35904,6 +36244,10 @@ void farmer_tests(void)
>>   	check_null_expand(&tree);
>>   	mtree_destroy(&tree);
>>   
>> +	mt_init_flags(&tree, 0);
>> +	check_mtree_dup(&tree);
>> +	mtree_destroy(&tree);
>> +
>>   	/* RCU testing */
>>   	mt_init_flags(&tree, 0);
>>   	check_erase_testset(&tree);
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking()
  2023-09-07 18:16         ` Matthew Wilcox
@ 2023-09-08  9:47           ` Peng Zhang
  0 siblings, 0 replies; 35+ messages in thread
From: Peng Zhang @ 2023-09-08  9:47 UTC (permalink / raw)
  To: Matthew Wilcox, Liam R. Howlett, Peng Zhang, kernel test robot,
	oe-lkp, lkp, maple-tree, linux-mm, corbet, akpm, brauner, surenb,
	michael.christie, peterz, mathieu.desnoyers, npiggin, avagin,
	linux-doc, linux-kernel, linux-fsdevel



在 2023/9/8 02:16, Matthew Wilcox 写道:
> On Thu, Sep 07, 2023 at 02:03:01PM -0400, Liam R. Howlett wrote:
>>>>   WARNING: possible recursive locking detected
>>>>   6.5.0-rc4-00632-g2730245bd6b1 #1 Tainted: G                TN
>>>>   --------------------------------------------
>>>>   swapper/1 is trying to acquire lock:
>>>> ffffffff86485058 (&mt->ma_lock){+.+.}-{2:2}, at: check_forking (include/linux/spinlock.h:? lib/test_maple_tree.c:1854)
>>>>
>>>>   but task is already holding lock:
>>>>   ffff888110847a30 (&mt->ma_lock){+.+.}-{2:2}, at: check_forking (include/linux/spinlock.h:351 lib/test_maple_tree.c:1854)
>>> Thanks for the test. I checked that these are two different locks, why
>>> is this warning reported? Did I miss something?
>>
>> I don't think you can nest spinlocks like this.  In my previous test I
>> avoided nesting, but in your case we cannot avoid having both locks at
>> the same time.
>>
>> You can get around this by using an rwsemaphore, set the two trees as
>> external and use down_write_nested(&lock2, SINGLE_DEPTH_NESTING) like
>> the real fork.  Basically, switch the locking to exactly what fork does.
Here I can use rwsemaphore to avoid the warning. But what about in
mtree_dup()? mtree_dup() handles locks internally.

Maybe spin_lock_nested() mentioned by Matthew can be used in
mtree_dup().
> 
> spin_lock_nested() exists.
Thanks for mentioning this, I'll have a look.
> 
> You should probably both read through
> Documentation/locking/lockdep-design.rst It's not the best user
> documentation in the world, but it's what we have.
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
  2023-09-07 20:14   ` Liam R. Howlett
@ 2023-09-08  9:58     ` Peng Zhang
  2023-09-08 16:07       ` Liam R. Howlett
  2023-09-15 10:51     ` Peng Zhang
  1 sibling, 1 reply; 35+ messages in thread
From: Peng Zhang @ 2023-09-08  9:58 UTC (permalink / raw)
  To: Liam R. Howlett, Peng Zhang, corbet, akpm, willy, brauner,
	surenb, michael.christie, peterz, mathieu.desnoyers, npiggin,
	avagin, linux-mm, linux-doc, linux-kernel, linux-fsdevel



在 2023/9/8 04:14, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:58]:
>> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
>> directly modify the entries of VMAs in the new maple tree, which can
>> get better performance. The optimization effect is proportional to the
>> number of VMAs.
>>
>> There is a "spawn" in byte-unixbench[1], which can be used to test the
>> performance of fork(). I modified it slightly to make it work with
>> different number of VMAs.
>>
>> Below are the test numbers. There are 21 VMAs by default. The first row
>> indicates the number of added VMAs. The following two lines are the
>> number of fork() calls every 10 seconds. These numbers are different
>> from the test results in v1 because this time the benchmark is bound to
>> a CPU. This way the numbers are more stable.
>>
>>    Increment of VMAs: 0      100     200     400     800     1600    3200    6400
>> 6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   6110    3158
>> Apply this patchset: 114531 85420   64541   44592   28660   16371   9038    4831
>>                       +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% +47.92% +52.98%
> 
> Thanks!
> 
> Can you include 21 in this table since it's the default?
Maybe I didn't express clearly, "Increment of VMAs" means the number of
VMAs added on the basis of 21 VMAs.
> 
>>
>> [1] https://github.com/kdlucas/byte-unixbench/tree/master
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>>   kernel/fork.c | 34 ++++++++++++++++++++++++++--------
>>   mm/mmap.c     | 14 ++++++++++++--
>>   2 files changed, 38 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 3b6d20dfb9a8..e6299adefbd8 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   	int retval;
>>   	unsigned long charge = 0;
>>   	LIST_HEAD(uf);
>> -	VMA_ITERATOR(old_vmi, oldmm, 0);
>>   	VMA_ITERATOR(vmi, mm, 0);
>>   
>>   	uprobe_start_dup_mmap();
>> @@ -678,17 +677,39 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   		goto out;
>>   	khugepaged_fork(mm, oldmm);
>>   
>> -	retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
>> -	if (retval)
>> +	/* Use __mt_dup() to efficiently build an identical maple tree. */
>> +	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
> 
> Apparently the flags should be GFP_KERNEL here so that compaction can
> run.
OK, I'll change it to GFP_KERNEL.
> 
>> +	if (unlikely(retval))
>>   		goto out;
>>   
>>   	mt_clear_in_rcu(vmi.mas.tree);
>> -	for_each_vma(old_vmi, mpnt) {
>> +	for_each_vma(vmi, mpnt) {
>>   		struct file *file;
>>   
>>   		vma_start_write(mpnt);
>>   		if (mpnt->vm_flags & VM_DONTCOPY) {
>>   			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>> +
>> +			/*
>> +			 * Since the new tree is exactly the same as the old one,
>> +			 * we need to remove the unneeded VMAs.
>> +			 */
>> +			mas_store(&vmi.mas, NULL);
>> +
>> +			/*
>> +			 * Even removing an entry may require memory allocation,
>> +			 * and if removal fails, we use XA_ZERO_ENTRY to mark
>> +			 * from which VMA it failed. The case of encountering
>> +			 * XA_ZERO_ENTRY will be handled in exit_mmap().
>> +			 */
>> +			if (unlikely(mas_is_err(&vmi.mas))) {
>> +				retval = xa_err(vmi.mas.node);
>> +				mas_reset(&vmi.mas);
>> +				if (mas_find(&vmi.mas, ULONG_MAX))
>> +					mas_store(&vmi.mas, XA_ZERO_ENTRY);
>> +				goto loop_out;
>> +			}
>> +
> 
> Storing NULL may need extra space as you noted, so we need to be careful
> what happens if we don't have that space.  We should have a testcase to
> test this scenario.
> 
> mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
> in this function, see vm_area_dup().
> 
> Don't use the exit_mmap() path to undo a failed fork.  You've added
> checks and complications to the exit path for all tasks in the very
> unlikely event that we run out of memory when we hit a very unlikely
> VM_DONTCOPY flag.
> 
> I see the issue with having a portion of the tree with new VMAs that are
> accounted and a portion of the tree that has old VMAs that should not be
> looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
> we cannot add that complication to the exit path and then there is the
> OOM race to worry about (maybe, I am not sure since this MM isn't
> active yet).
> 
> Using what is done in exit_mmap() and do_vmi_align_munmap() as a
> prototype, we can do something like the *untested* code below:
> 
> if (unlikely(mas_is_err(&vmi.mas))) {
> 	unsigned long max = vmi.index;
> 
> 	retval = xa_err(vmi.mas.node);
> 	mas_set(&vmi.mas, 0);
> 	tmp = mas_find(&vmi.mas, ULONG_MAX);
> 	if (tmp) { /* Not the first VMA failed */
> 		unsigned long nr_accounted = 0;
> 
> 		unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
> 				true);
> 		do {
> 			if (vma->vm_flags & VM_ACCOUNT)
> 				nr_accounted += vma_pages(vma);
> 			remove_vma(vma, true);
> 			cond_resched();
> 			vma = mas_find(&vmi.mas, max - 1);
> 		} while (vma != NULL);
> 
> 		vm_unacct_memory(nr_accounted);
> 	}
> 	__mt_destroy(&mm->mm_mt);
> 	goto loop_out;
> }
> 
> Once exit_mmap() is called, the check for OOM (no vma) will catch that
> nothing is left to do.
> 
> It might be worth making an inline function to do this to keep the fork
> code clean.  We should test this by detecting a specific task name and
> returning a failure at a given interval:
> 
> if (!strcmp(current->comm, "fork_test") {
> ...
> }

Thank you for your suggestion, I will do this in the next version.
> 
> 
>>   			continue;
>>   		}
>>   		charge = 0;
>> @@ -750,8 +771,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   			hugetlb_dup_vma_private(tmp);
>>   
>>   		/* Link the vma into the MT */
>> -		if (vma_iter_bulk_store(&vmi, tmp))
>> -			goto fail_nomem_vmi_store;
>> +		mas_store(&vmi.mas, tmp);
>>   
>>   		mm->map_count++;
>>   		if (!(tmp->vm_flags & VM_WIPEONFORK))
>> @@ -778,8 +798,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   	uprobe_end_dup_mmap();
>>   	return retval;
>>   
>> -fail_nomem_vmi_store:
>> -	unlink_anon_vmas(tmp);
>>   fail_nomem_anon_vma_fork:
>>   	mpol_put(vma_policy(tmp));
>>   fail_nomem_policy:
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index b56a7f0c9f85..dfc6881be81c 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3196,7 +3196,11 @@ void exit_mmap(struct mm_struct *mm)
>>   	arch_exit_mmap(mm);
>>   
>>   	vma = mas_find(&mas, ULONG_MAX);
>> -	if (!vma) {
>> +	/*
>> +	 * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
>> +	 * xa_is_zero(vma) may be true.
>> +	 */
>> +	if (!vma || xa_is_zero(vma)) {
>>   		/* Can happen if dup_mmap() received an OOM */
>>   		mmap_read_unlock(mm);
>>   		return;
>> @@ -3234,7 +3238,13 @@ void exit_mmap(struct mm_struct *mm)
>>   		remove_vma(vma, true);
>>   		count++;
>>   		cond_resched();
>> -	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
>> +		vma = mas_find(&mas, ULONG_MAX);
>> +		/*
>> +		 * If xa_is_zero(vma) is true, it means that subsequent VMAs
>> +		 * donot need to be removed. Can happen if dup_mmap() fails to
>> +		 * remove a VMA marked VM_DONTCOPY.
>> +		 */
>> +	} while (vma != NULL && !xa_is_zero(vma));
>>   
>>   	BUG_ON(count != mm->map_count);
>>   
>> -- 
>> 2.20.1
>>
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/6] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()
  2023-09-08  9:26     ` Peng Zhang
@ 2023-09-08 16:05       ` Liam R. Howlett
  0 siblings, 0 replies; 35+ messages in thread
From: Liam R. Howlett @ 2023-09-08 16:05 UTC (permalink / raw)
  To: Peng Zhang
  Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
	linux-kernel, linux-fsdevel

* Peng Zhang <zhangpeng.00@bytedance.com> [230908 05:26]:
> 
> 
> 在 2023/9/8 04:13, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
> > > Introduce interfaces __mt_dup() and mtree_dup(), which are used to
> > > duplicate a maple tree. Compared with traversing the source tree and
> > > reinserting entry by entry in the new tree, it has better performance.
> > > The difference between __mt_dup() and mtree_dup() is that mtree_dup()
> > > handles locks internally.
> > 
> > __mt_dup() should be called mas_dup() to indicate the advanced interface
> > which requires users to handle their own locks.
> Ok, I'll change __mt_dup() to mas_dup().
> > 
> > > 
> > > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> > > ---
> > >   include/linux/maple_tree.h |   3 +
> > >   lib/maple_tree.c           | 265 +++++++++++++++++++++++++++++++++++++
> > >   2 files changed, 268 insertions(+)
> > > 
> > > diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> > > index e41c70ac7744..44fe8a57ecbd 100644
> > > --- a/include/linux/maple_tree.h
> > > +++ b/include/linux/maple_tree.h
> > > @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
> > >   		void *entry, gfp_t gfp);
> > >   void *mtree_erase(struct maple_tree *mt, unsigned long index);
> > > +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > +
> > >   void mtree_destroy(struct maple_tree *mt);
> > >   void __mt_destroy(struct maple_tree *mt);
> > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > index ef234cf02e3e..8f841682269c 100644
> > > --- a/lib/maple_tree.c
> > > +++ b/lib/maple_tree.c
> > > @@ -6370,6 +6370,271 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
> > >   }
> > >   EXPORT_SYMBOL(mtree_erase);
> > > +/*
> > > + * mas_dup_free() - Free a half-constructed tree.
> > 
> > Maybe "Free an incomplete duplication of a tree" ?
> > 
> > > + * @mas: Points to the last node of the half-constructed tree.
> > 
> > Your use of "Points to" seems to indicate someone knows you are talking
> > about a "maple state that has a node pointing to".  Can this be made
> > more clear?
> > @mas: The maple state of a incomplete tree.
> > 
> > Then add a note that @mas->node points to the last successfully
> > allocated node?
> > 
> > Or something along those lines.
> Ok, I'll revise the comment.
> > 
> > > + *
> > > + * This function frees all nodes starting from @mas->node in the reverse order
> > > + * of mas_dup_build(). There is no need to hold the source tree lock at this
> > > + * time.
> > > + */
> > > +static void mas_dup_free(struct ma_state *mas)
> > > +{
> > > +	struct maple_node *node;
> > > +	enum maple_type type;
> > > +	void __rcu **slots;
> > > +	unsigned char count, i;
> > > +
> > > +	/* Maybe the first node allocation failed. */
> > > +	if (!mas->node)
> > > +		return;
> > > +
> > > +	while (!mte_is_root(mas->node)) {
> > > +		mas_ascend(mas);
> > > +
> > > +		if (mas->offset) {
> > > +			mas->offset--;
> > > +			do {
> > > +				mas_descend(mas);
> > > +				mas->offset = mas_data_end(mas);
> > > +			} while (!mte_is_leaf(mas->node));
> > 
> > Can you blindly descend and check !mte_is_leaf()?  What happens when the
> > tree duplication fails at random internal nodes?  Maybe I missed how
> > this cannot happen?
> This cannot happen. Note the mas_ascend(mas) at the beginning of the
> outermost loop.
> 
> > 
> > > +
> > > +			mas_ascend(mas);
> > > +		}
> > > +
> > > +		node = mte_to_node(mas->node);
> > > +		type = mte_node_type(mas->node);
> > > +		slots = (void **)ma_slots(node, type);
> > > +		count = mas_data_end(mas) + 1;
> > > +		for (i = 0; i < count; i++)
> > > +			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
> > > +
> > > +		mt_free_bulk(count, slots);
> > > +	}
> > 
> > 
> > > +
> > > +	node = mte_to_node(mas->node);
> > > +	mt_free_one(node);
> > > +}
> > > +
> > > +/*
> > > + * mas_copy_node() - Copy a maple node and allocate child nodes.
> > 
> > if required. "..and allocate child nodes if required."
> > 
> > > + * @mas: Points to the source node.
> > > + * @new_mas: Points to the new node.
> > > + * @parent: The parent node of the new node.
> > > + * @gfp: The GFP_FLAGS to use for allocations.
> > > + *
> > > + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
> > > + * @new_mas->node and allocate new child nodes for @new_mas->node.
> > > + * If memory allocation fails, @mas is set to -ENOMEM.
> > > + */
> > > +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
> > > +		struct maple_node *parent, gfp_t gfp)
> > > +{
> > > +	struct maple_node *node = mte_to_node(mas->node);
> > > +	struct maple_node *new_node = mte_to_node(new_mas->node);
> > > +	enum maple_type type;
> > > +	unsigned long val;
> > > +	unsigned char request, count, i;
> > > +	void __rcu **slots;
> > > +	void __rcu **new_slots;
> > > +
> > > +	/* Copy the node completely. */
> > > +	memcpy(new_node, node, sizeof(struct maple_node));
> > > +
> > > +	/* Update the parent node pointer. */
> > > +	if (unlikely(ma_is_root(node)))
> > > +		val = MA_ROOT_PARENT;
> > > +	else
> > > +		val = (unsigned long)node->parent & MAPLE_NODE_MASK;
> > 
> > If you treat the root as special and outside the loop, then you can
> > avoid the check for root for every non-root node.  For root, you just
> > need to copy and do this special parent thing before the main loop in
> > mas_dup_build().  This will avoid an extra branch for each VMA over 14,
> > so that would add up to a lot of instructions.
> I'll handle the root node outside.
> However, do you think it makes sense to have the parent of the root node
> point to the struct maple_tree? I don't see it used anywhere.

I'm not sure.  It needs to not point to itself (indicating it is dead),
and we need to tell it's the root node, but I'm not entirely sure it is
necessary to point to the maple_tree.. although it is useful in dumps
sometimes.

> 
> > 
> > > +
> > > +	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
> > > +
> > > +	if (mte_is_leaf(mas->node))
> > > +		return;
> > 
> > You are checking here and in mas_dup_build() for the leaf, splitting the
> > function into parent assignment and allocate would allow you to check
> > once. Copy could be moved to the main loop or with the parent setting,
> > depending on how you handle the root suggestion above.
> I'll try to reduce some checks.
> > 
> > > +
> > > +	/* Allocate memory for child nodes. */
> > > +	type = mte_node_type(mas->node);
> > > +	new_slots = ma_slots(new_node, type);
> > > +	request = mas_data_end(mas) + 1;
> > > +	count = mt_alloc_bulk(gfp, request, new_slots);
> > > +	if (unlikely(count < request)) {
> > > +		if (count)
> > > +			mt_free_bulk(count, new_slots);
> > 
> > The new_slots will still contain the addresses of the freed nodes.
> > Don't you need to clear it here to avoid a double free?  Is there a
> > test case for this in your testing?  Again, I may have missed how this
> > is not possible..
> It's impossible, because in mt_free_bulk(), the first thing to do with
> the incoming node is to go up. We free all child nodes at the parent
> node.
> 
> We guarantee that the node passed to mas_dup_free() is "clean".

You mean there are no allocations below?

> mas_dup_free() also follows this so will not free children of this node.
> 
> The child nodes of this node cannot be freed in mt_free_bulk() because
> the node is not completely constructed and data_end cannot be obtained.
> data_end cannot be set on this node because the number of successfully
> allocated child nodes can be 0.

It still seems unwise to keep pointers pointing to unallocated memory
here, can we just clear the slots?

It's a bit odd because our choice is to leave it with pointers to nodes
in another tree or potential unallocated memory that isn't anywhere.
Both are a bit unnerving passing into another function that cleans
things up.  Since it's the error path, we won't have a performance
penalty in wiping the slots and it doesn't really matter that the node
isn't valid.  It seems more likely we would catch the error in a more
identifiable place if we set the slots to NULL.

> > 
> > > +		mas_set_err(mas, -ENOMEM);
> > > +		return;
> > > +	}
> > > +
> > > +	/* Restore node type information in slots. */
> > > +	slots = ma_slots(node, type);
> > > +	for (i = 0; i < count; i++)
> > > +		((unsigned long *)new_slots)[i] |=
> > > +			((unsigned long)mt_slot_locked(mas->tree, slots, i) &
> > > +			MAPLE_NODE_MASK);
> > 
> > Can you expand this to multiple lines to make it more clear what is
> > going on?
> I will try to do that.
> 
> > 
> > > +}
> > > +
> > > +/*
> > > + * mas_dup_build() - Build a new maple tree from a source tree
> > > + * @mas: The maple state of source tree.
> > > + * @new_mas: The maple state of new tree.
> > > + * @gfp: The GFP_FLAGS to use for allocations.
> > > + *
> > > + * This function builds a new tree in DFS preorder. If the memory allocation
> > > + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
> > > + * last node. mas_dup_free() will free the half-constructed tree.
> > > + *
> > > + * Note that the attributes of the two trees must be exactly the same, and the
> > > + * new tree must be empty, otherwise -EINVAL will be returned.
> > > + */
> > > +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
> > > +		gfp_t gfp)
> > > +{
> > > +	struct maple_node *node, *parent;
> > 
> > Could parent be struct maple_pnode?
> I'll rename it.
> 
> > 
> > > +	struct maple_enode *root;
> > > +	enum maple_type type;
> > > +
> > > +	if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
> > > +	    unlikely(!mtree_empty(new_mas->tree))) {
> > > +		mas_set_err(mas, -EINVAL);
> > > +		return;
> > > +	}
> > > +
> > > +	mas_start(mas);
> > > +	if (mas_is_ptr(mas) || mas_is_none(mas)) {
> > > +		/*
> > > +		 * The attributes of the two trees must be the same before this.
> > > +		 * The following assignment makes them the same height.
> > > +		 */
> > > +		new_mas->tree->ma_flags = mas->tree->ma_flags;
> > > +		rcu_assign_pointer(new_mas->tree->ma_root, mas->tree->ma_root);
> > > +		return;
> > > +	}
> > > +
> > > +	node = mt_alloc_one(gfp);
> > > +	if (!node) {
> > > +		new_mas->node = NULL;
> > 
> > We don't have checks around for node == NULL, MAS_NONE would be a safer
> > choice.  It is unlikely that someone would dup the tree and fail then
> > call something else, but I avoid setting node to NULL.
> I will set it to MAS_NONE in the next version.
> 
> > 
> > > +		mas_set_err(mas, -ENOMEM);
> > > +		return;
> > > +	}
> > > +
> > > +	type = mte_node_type(mas->node);
> > > +	root = mt_mk_node(node, type);
> > > +	new_mas->node = root;
> > > +	new_mas->min = 0;
> > > +	new_mas->max = ULONG_MAX;
> > > +	parent = ma_mnode_ptr(new_mas->tree);
> > > +
> > > +	while (1) {
> > > +		mas_copy_node(mas, new_mas, parent, gfp);
> > > +
> > > +		if (unlikely(mas_is_err(mas)))
> > > +			return;
> > > +
> > > +		/* Once we reach a leaf, we need to ascend, or end the loop. */
> > > +		if (mte_is_leaf(mas->node)) {
> > > +			if (mas->max == ULONG_MAX) {
> > > +				new_mas->tree->ma_flags = mas->tree->ma_flags;
> > > +				rcu_assign_pointer(new_mas->tree->ma_root,
> > > +						   mte_mk_root(root));
> > > +				break;
> > 
> > If you move this to the end of the function, you can replace the same
> > block above with a goto.  That will avoid breaking the line up.
> I can do this, but it doesn't seem to make a difference.

Thanks, Just for clarity of keeping it all on one line and sine there's
going to be a respin of the set anyways..

> > 
> > > +			}
> > > +
> > > +			do {
> > > +				/*
> > > +				 * Must not at the root node, because we've
> > > +				 * already end the loop when we reach the last
> > > +				 * leaf.
> > > +				 */
> > 
> > I'm not sure what the comment above is trying to say.  Do you mean "This
> > won't reach the root node because the loop will break when the last leaf
> > is hit"?  I don't think that is accurate.. it will hit the root node but
> > not the end of the root node, right?  Anyways, the comment isn't clear
> > so please have a look.
> Yes, it will hit the root node but not the end of the root node. I'll
> fix this comment. Thanks.
> 
> > 
> > > +				mas_ascend(mas);
> > > +				mas_ascend(new_mas);
> > > +			} while (mas->offset == mas_data_end(mas));
> > > +
> > > +			mas->offset++;
> > > +			new_mas->offset++;
> > > +		}
> > > +
> > > +		mas_descend(mas);
> > > +		parent = mte_to_node(new_mas->node);
> > > +		mas_descend(new_mas);
> > > +		mas->offset = 0;
> > > +		new_mas->offset = 0;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * __mt_dup(): Duplicate a maple tree
> > > + * @mt: The source maple tree
> > > + * @new: The new maple tree
> > > + * @gfp: The GFP_FLAGS to use for allocations
> > > + *
> > > + * This function duplicates a maple tree using a faster method than traversing
> > > + * the source tree and inserting entries into the new tree one by one.
> > 
> > Can you make this comment more about what your code does instead of the
> > "one by one" description?
> > 
> > > + * The user needs to ensure that the attributes of the source tree and the new
> > > + * tree are the same, and the new tree needs to be an empty tree, otherwise
> > > + * -EINVAL will be returned.
> > > + * Note that the user needs to manually lock the source tree and the new tree.
> > > + *
> > > + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> > > + * the attributes of the two trees are different or the new tree is not an empty
> > > + * tree.
> > > + */
> > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> > > +{
> > > +	int ret = 0;
> > > +	MA_STATE(mas, mt, 0, 0);
> > > +	MA_STATE(new_mas, new, 0, 0);
> > > +
> > > +	mas_dup_build(&mas, &new_mas, gfp);
> > > +
> > > +	if (unlikely(mas_is_err(&mas))) {
> > > +		ret = xa_err(mas.node);
> > > +		if (ret == -ENOMEM)
> > > +			mas_dup_free(&new_mas);
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(__mt_dup);
> > > +
> > > +/**
> > > + * mtree_dup(): Duplicate a maple tree
> > > + * @mt: The source maple tree
> > > + * @new: The new maple tree
> > > + * @gfp: The GFP_FLAGS to use for allocations
> > > + *
> > > + * This function duplicates a maple tree using a faster method than traversing
> > > + * the source tree and inserting entries into the new tree one by one.
> > 
> > Again, it's more interesting to state it uses the DFS preorder copy.
> > 
> > It is also worth mentioning the superior allocation behaviour since that
> > is a desirable trait for many.  In fact, you should add the allocation
> > behaviour in your cover letter.
> Okay, I will describe more in the next version.
> 
> > 
> > > + * The user needs to ensure that the attributes of the source tree and the new
> > > + * tree are the same, and the new tree needs to be an empty tree, otherwise
> > > + * -EINVAL will be returned.
> > > + *
> > > + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> > > + * the attributes of the two trees are different or the new tree is not an empty
> > > + * tree.
> > > + */
> > > +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> > > +{
> > > +	int ret = 0;
> > > +	MA_STATE(mas, mt, 0, 0);
> > > +	MA_STATE(new_mas, new, 0, 0);
> > > +
> > > +	mas_lock(&new_mas);
> > > +	mas_lock(&mas);
> > > +
> > > +	mas_dup_build(&mas, &new_mas, gfp);
> > > +	mas_unlock(&mas);
> > > +
> > > +	if (unlikely(mas_is_err(&mas))) {
> > > +		ret = xa_err(mas.node);
> > > +		if (ret == -ENOMEM)
> > > +			mas_dup_free(&new_mas);
> > > +	}
> > > +
> > > +	mas_unlock(&new_mas);
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(mtree_dup);
> > > +
> > >   /**
> > >    * __mt_destroy() - Walk and free all nodes of a locked maple tree.
> > >    * @mt: The maple tree
> > > -- 
> > > 2.20.1
> > > 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
  2023-09-08  9:58     ` Peng Zhang
@ 2023-09-08 16:07       ` Liam R. Howlett
  0 siblings, 0 replies; 35+ messages in thread
From: Liam R. Howlett @ 2023-09-08 16:07 UTC (permalink / raw)
  To: Peng Zhang
  Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
	linux-kernel, linux-fsdevel

* Peng Zhang <zhangpeng.00@bytedance.com> [230908 05:59]:
> 
> 
> 在 2023/9/8 04:14, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:58]:
> > > Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
> > > directly modify the entries of VMAs in the new maple tree, which can
> > > get better performance. The optimization effect is proportional to the
> > > number of VMAs.
> > > 
> > > There is a "spawn" in byte-unixbench[1], which can be used to test the
> > > performance of fork(). I modified it slightly to make it work with
> > > different number of VMAs.
> > > 
> > > Below are the test numbers. There are 21 VMAs by default. The first row
> > > indicates the number of added VMAs. The following two lines are the
> > > number of fork() calls every 10 seconds. These numbers are different
> > > from the test results in v1 because this time the benchmark is bound to
> > > a CPU. This way the numbers are more stable.
> > > 
> > >    Increment of VMAs: 0      100     200     400     800     1600    3200    6400
> > > 6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   6110    3158
> > > Apply this patchset: 114531 85420   64541   44592   28660   16371   9038    4831
> > >                       +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% +47.92% +52.98%
> > 
> > Thanks!
> > 
> > Can you include 21 in this table since it's the default?
> Maybe I didn't express clearly, "Increment of VMAs" means the number of
> VMAs added on the basis of 21 VMAs.

Ah, I see.  Thanks.

> > 
> > > 
> > > [1] https://github.com/kdlucas/byte-unixbench/tree/master
> > > 
> > > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> > > ---
> > >   kernel/fork.c | 34 ++++++++++++++++++++++++++--------
> > >   mm/mmap.c     | 14 ++++++++++++--
> > >   2 files changed, 38 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 3b6d20dfb9a8..e6299adefbd8 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >   	int retval;
> > >   	unsigned long charge = 0;
> > >   	LIST_HEAD(uf);
> > > -	VMA_ITERATOR(old_vmi, oldmm, 0);
> > >   	VMA_ITERATOR(vmi, mm, 0);
> > >   	uprobe_start_dup_mmap();
> > > @@ -678,17 +677,39 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >   		goto out;
> > >   	khugepaged_fork(mm, oldmm);
> > > -	retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
> > > -	if (retval)
> > > +	/* Use __mt_dup() to efficiently build an identical maple tree. */
> > > +	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
> > 
> > Apparently the flags should be GFP_KERNEL here so that compaction can
> > run.
> OK, I'll change it to GFP_KERNEL.
> > 
> > > +	if (unlikely(retval))
> > >   		goto out;
> > >   	mt_clear_in_rcu(vmi.mas.tree);
> > > -	for_each_vma(old_vmi, mpnt) {
> > > +	for_each_vma(vmi, mpnt) {
> > >   		struct file *file;
> > >   		vma_start_write(mpnt);
> > >   		if (mpnt->vm_flags & VM_DONTCOPY) {
> > >   			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> > > +
> > > +			/*
> > > +			 * Since the new tree is exactly the same as the old one,
> > > +			 * we need to remove the unneeded VMAs.
> > > +			 */
> > > +			mas_store(&vmi.mas, NULL);
> > > +
> > > +			/*
> > > +			 * Even removing an entry may require memory allocation,
> > > +			 * and if removal fails, we use XA_ZERO_ENTRY to mark
> > > +			 * from which VMA it failed. The case of encountering
> > > +			 * XA_ZERO_ENTRY will be handled in exit_mmap().
> > > +			 */
> > > +			if (unlikely(mas_is_err(&vmi.mas))) {
> > > +				retval = xa_err(vmi.mas.node);
> > > +				mas_reset(&vmi.mas);
> > > +				if (mas_find(&vmi.mas, ULONG_MAX))
> > > +					mas_store(&vmi.mas, XA_ZERO_ENTRY);
> > > +				goto loop_out;
> > > +			}
> > > +
> > 
> > Storing NULL may need extra space as you noted, so we need to be careful
> > what happens if we don't have that space.  We should have a testcase to
> > test this scenario.
> > 
> > mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
> > in this function, see vm_area_dup().
> > 
> > Don't use the exit_mmap() path to undo a failed fork.  You've added
> > checks and complications to the exit path for all tasks in the very
> > unlikely event that we run out of memory when we hit a very unlikely
> > VM_DONTCOPY flag.
> > 
> > I see the issue with having a portion of the tree with new VMAs that are
> > accounted and a portion of the tree that has old VMAs that should not be
> > looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
> > we cannot add that complication to the exit path and then there is the
> > OOM race to worry about (maybe, I am not sure since this MM isn't
> > active yet).
> > 
> > Using what is done in exit_mmap() and do_vmi_align_munmap() as a
> > prototype, we can do something like the *untested* code below:
> > 
> > if (unlikely(mas_is_err(&vmi.mas))) {
> > 	unsigned long max = vmi.index;
> > 
> > 	retval = xa_err(vmi.mas.node);
> > 	mas_set(&vmi.mas, 0);
> > 	tmp = mas_find(&vmi.mas, ULONG_MAX);
> > 	if (tmp) { /* Not the first VMA failed */
> > 		unsigned long nr_accounted = 0;
> > 
> > 		unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
> > 				true);
> > 		do {
> > 			if (vma->vm_flags & VM_ACCOUNT)
> > 				nr_accounted += vma_pages(vma);
> > 			remove_vma(vma, true);
> > 			cond_resched();
> > 			vma = mas_find(&vmi.mas, max - 1);
> > 		} while (vma != NULL);
> > 
> > 		vm_unacct_memory(nr_accounted);
> > 	}
> > 	__mt_destroy(&mm->mm_mt);
> > 	goto loop_out;
> > }
> > 
> > Once exit_mmap() is called, the check for OOM (no vma) will catch that
> > nothing is left to do.
> > 
> > It might be worth making an inline function to do this to keep the fork
> > code clean.  We should test this by detecting a specific task name and
> > returning a failure at a given interval:
> > 
> > if (!strcmp(current->comm, "fork_test") {
> > ...
> > }
> 
> Thank you for your suggestion, I will do this in the next version.
> > 
> > 
> > >   			continue;
> > >   		}
> > >   		charge = 0;
> > > @@ -750,8 +771,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >   			hugetlb_dup_vma_private(tmp);
> > >   		/* Link the vma into the MT */
> > > -		if (vma_iter_bulk_store(&vmi, tmp))
> > > -			goto fail_nomem_vmi_store;
> > > +		mas_store(&vmi.mas, tmp);
> > >   		mm->map_count++;
> > >   		if (!(tmp->vm_flags & VM_WIPEONFORK))
> > > @@ -778,8 +798,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >   	uprobe_end_dup_mmap();
> > >   	return retval;
> > > -fail_nomem_vmi_store:
> > > -	unlink_anon_vmas(tmp);
> > >   fail_nomem_anon_vma_fork:
> > >   	mpol_put(vma_policy(tmp));
> > >   fail_nomem_policy:
> > > diff --git a/mm/mmap.c b/mm/mmap.c
> > > index b56a7f0c9f85..dfc6881be81c 100644
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -3196,7 +3196,11 @@ void exit_mmap(struct mm_struct *mm)
> > >   	arch_exit_mmap(mm);
> > >   	vma = mas_find(&mas, ULONG_MAX);
> > > -	if (!vma) {
> > > +	/*
> > > +	 * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
> > > +	 * xa_is_zero(vma) may be true.
> > > +	 */
> > > +	if (!vma || xa_is_zero(vma)) {
> > >   		/* Can happen if dup_mmap() received an OOM */
> > >   		mmap_read_unlock(mm);
> > >   		return;
> > > @@ -3234,7 +3238,13 @@ void exit_mmap(struct mm_struct *mm)
> > >   		remove_vma(vma, true);
> > >   		count++;
> > >   		cond_resched();
> > > -	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> > > +		vma = mas_find(&mas, ULONG_MAX);
> > > +		/*
> > > +		 * If xa_is_zero(vma) is true, it means that subsequent VMAs
> > > +		 * donot need to be removed. Can happen if dup_mmap() fails to
> > > +		 * remove a VMA marked VM_DONTCOPY.
> > > +		 */
> > > +	} while (vma != NULL && !xa_is_zero(vma));
> > >   	BUG_ON(count != mm->map_count);
> > > -- 
> > > 2.20.1
> > > 
> > 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/6] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()
  2023-09-07 20:13   ` Liam R. Howlett
  2023-09-08  9:26     ` Peng Zhang
@ 2023-09-11 12:59     ` Peng Zhang
  2023-09-11 13:36       ` Liam R. Howlett
  1 sibling, 1 reply; 35+ messages in thread
From: Peng Zhang @ 2023-09-11 12:59 UTC (permalink / raw)
  To: Liam R. Howlett, Peng Zhang, corbet, akpm, willy, brauner,
	surenb, michael.christie, peterz, mathieu.desnoyers, npiggin,
	avagin, linux-mm, linux-doc, linux-kernel, linux-fsdevel



在 2023/9/8 04:13, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
>> Introduce interfaces __mt_dup() and mtree_dup(), which are used to
>> duplicate a maple tree. Compared with traversing the source tree and
>> reinserting entry by entry in the new tree, it has better performance.
>> The difference between __mt_dup() and mtree_dup() is that mtree_dup()
>> handles locks internally.
> 
> __mt_dup() should be called mas_dup() to indicate the advanced interface
> which requires users to handle their own locks.
Changing to the mas_dup() interface may look like this:
mas_dup(mas_old, mas_new)

This still encounters the problem we discussed before. You expect both
mas_old and mas_new to point to the first element after the function
returns, but for_each_vma(vmi, mpnt) in dup_mmap() does not support
this, and will skip the first element.

Unless we have an iterator similar to "do {} while()", we have to reset 
mas_new. There is still additional overhead in making both mas_old and
mas_new point to the first element, because mas will point to the last
node after dfs order traversal.

In fact, I think mtree_dup() and __mt_dup() are enough. They seem to
match mtree_destroy() and __mt_destroy() very well. Underlines indicate
that users need to handle the lock themselves.
> 
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>>   include/linux/maple_tree.h |   3 +
>>   lib/maple_tree.c           | 265 +++++++++++++++++++++++++++++++++++++
>>   2 files changed, 268 insertions(+)
>>
>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>> index e41c70ac7744..44fe8a57ecbd 100644
>> --- a/include/linux/maple_tree.h
>> +++ b/include/linux/maple_tree.h
>> @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
>>   		void *entry, gfp_t gfp);
>>   void *mtree_erase(struct maple_tree *mt, unsigned long index);
>>   
>> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>> +
>>   void mtree_destroy(struct maple_tree *mt);
>>   void __mt_destroy(struct maple_tree *mt);
>>   
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index ef234cf02e3e..8f841682269c 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -6370,6 +6370,271 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
>>   }
>>   EXPORT_SYMBOL(mtree_erase);
>>   
>> +/*
>> + * mas_dup_free() - Free a half-constructed tree.
> 
> Maybe "Free an incomplete duplication of a tree" ?
> 
>> + * @mas: Points to the last node of the half-constructed tree.
> 
> Your use of "Points to" seems to indicate someone knows you are talking
> about a "maple state that has a node pointing to".  Can this be made
> more clear?
> @mas: The maple state of a incomplete tree.
> 
> Then add a note that @mas->node points to the last successfully
> allocated node?
> 
> Or something along those lines.
> 
>> + *
>> + * This function frees all nodes starting from @mas->node in the reverse order
>> + * of mas_dup_build(). There is no need to hold the source tree lock at this
>> + * time.
>> + */
>> +static void mas_dup_free(struct ma_state *mas)
>> +{
>> +	struct maple_node *node;
>> +	enum maple_type type;
>> +	void __rcu **slots;
>> +	unsigned char count, i;
>> +
>> +	/* Maybe the first node allocation failed. */
>> +	if (!mas->node)
>> +		return;
>> +
>> +	while (!mte_is_root(mas->node)) {
>> +		mas_ascend(mas);
>> +
>> +		if (mas->offset) {
>> +			mas->offset--;
>> +			do {
>> +				mas_descend(mas);
>> +				mas->offset = mas_data_end(mas);
>> +			} while (!mte_is_leaf(mas->node));
> 
> Can you blindly descend and check !mte_is_leaf()?  What happens when the
> tree duplication fails at random internal nodes?  Maybe I missed how
> this cannot happen?
> 
>> +
>> +			mas_ascend(mas);
>> +		}
>> +
>> +		node = mte_to_node(mas->node);
>> +		type = mte_node_type(mas->node);
>> +		slots = (void **)ma_slots(node, type);
>> +		count = mas_data_end(mas) + 1;
>> +		for (i = 0; i < count; i++)
>> +			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
>> +
>> +		mt_free_bulk(count, slots);
>> +	}
> 
> 
>> +
>> +	node = mte_to_node(mas->node);
>> +	mt_free_one(node);
>> +}
>> +
>> +/*
>> + * mas_copy_node() - Copy a maple node and allocate child nodes.
> 
> if required. "..and allocate child nodes if required."
> 
>> + * @mas: Points to the source node.
>> + * @new_mas: Points to the new node.
>> + * @parent: The parent node of the new node.
>> + * @gfp: The GFP_FLAGS to use for allocations.
>> + *
>> + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
>> + * @new_mas->node and allocate new child nodes for @new_mas->node.
>> + * If memory allocation fails, @mas is set to -ENOMEM.
>> + */
>> +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
>> +		struct maple_node *parent, gfp_t gfp)
>> +{
>> +	struct maple_node *node = mte_to_node(mas->node);
>> +	struct maple_node *new_node = mte_to_node(new_mas->node);
>> +	enum maple_type type;
>> +	unsigned long val;
>> +	unsigned char request, count, i;
>> +	void __rcu **slots;
>> +	void __rcu **new_slots;
>> +
>> +	/* Copy the node completely. */
>> +	memcpy(new_node, node, sizeof(struct maple_node));
>> +
>> +	/* Update the parent node pointer. */
>> +	if (unlikely(ma_is_root(node)))
>> +		val = MA_ROOT_PARENT;
>> +	else
>> +		val = (unsigned long)node->parent & MAPLE_NODE_MASK;
> 
> If you treat the root as special and outside the loop, then you can
> avoid the check for root for every non-root node.  For root, you just
> need to copy and do this special parent thing before the main loop in
> mas_dup_build().  This will avoid an extra branch for each VMA over 14,
> so that would add up to a lot of instructions.
> 
>> +
>> +	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
>> +
>> +	if (mte_is_leaf(mas->node))
>> +		return;
> 
> You are checking here and in mas_dup_build() for the leaf, splitting the
> function into parent assignment and allocate would allow you to check
> once. Copy could be moved to the main loop or with the parent setting,
> depending on how you handle the root suggestion above.
> 
>> +
>> +	/* Allocate memory for child nodes. */
>> +	type = mte_node_type(mas->node);
>> +	new_slots = ma_slots(new_node, type);
>> +	request = mas_data_end(mas) + 1;
>> +	count = mt_alloc_bulk(gfp, request, new_slots);
>> +	if (unlikely(count < request)) {
>> +		if (count)
>> +			mt_free_bulk(count, new_slots);
> 
> The new_slots will still contain the addresses of the freed nodes.
> Don't you need to clear it here to avoid a double free?  Is there a
> test case for this in your testing?  Again, I may have missed how this
> is not possible..
> 
>> +		mas_set_err(mas, -ENOMEM);
>> +		return;
>> +	}
>> +
>> +	/* Restore node type information in slots. */
>> +	slots = ma_slots(node, type);
>> +	for (i = 0; i < count; i++)
>> +		((unsigned long *)new_slots)[i] |=
>> +			((unsigned long)mt_slot_locked(mas->tree, slots, i) &
>> +			MAPLE_NODE_MASK);
> 
> Can you expand this to multiple lines to make it more clear what is
> going on?
> 
>> +}
>> +
>> +/*
>> + * mas_dup_build() - Build a new maple tree from a source tree
>> + * @mas: The maple state of source tree.
>> + * @new_mas: The maple state of new tree.
>> + * @gfp: The GFP_FLAGS to use for allocations.
>> + *
>> + * This function builds a new tree in DFS preorder. If the memory allocation
>> + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
>> + * last node. mas_dup_free() will free the half-constructed tree.
>> + *
>> + * Note that the attributes of the two trees must be exactly the same, and the
>> + * new tree must be empty, otherwise -EINVAL will be returned.
>> + */
>> +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
>> +		gfp_t gfp)
>> +{
>> +	struct maple_node *node, *parent;
> 
> Could parent be struct maple_pnode?
> 
>> +	struct maple_enode *root;
>> +	enum maple_type type;
>> +
>> +	if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
>> +	    unlikely(!mtree_empty(new_mas->tree))) {
>> +		mas_set_err(mas, -EINVAL);
>> +		return;
>> +	}
>> +
>> +	mas_start(mas);
>> +	if (mas_is_ptr(mas) || mas_is_none(mas)) {
>> +		/*
>> +		 * The attributes of the two trees must be the same before this.
>> +		 * The following assignment makes them the same height.
>> +		 */
>> +		new_mas->tree->ma_flags = mas->tree->ma_flags;
>> +		rcu_assign_pointer(new_mas->tree->ma_root, mas->tree->ma_root);
>> +		return;
>> +	}
>> +
>> +	node = mt_alloc_one(gfp);
>> +	if (!node) {
>> +		new_mas->node = NULL;
> 
> We don't have checks around for node == NULL, MAS_NONE would be a safer
> choice.  It is unlikely that someone would dup the tree and fail then
> call something else, but I avoid setting node to NULL.
> 
>> +		mas_set_err(mas, -ENOMEM);
>> +		return;
>> +	}
>> +
>> +	type = mte_node_type(mas->node);
>> +	root = mt_mk_node(node, type);
>> +	new_mas->node = root;
>> +	new_mas->min = 0;
>> +	new_mas->max = ULONG_MAX;
>> +	parent = ma_mnode_ptr(new_mas->tree);
>> +
>> +	while (1) {
>> +		mas_copy_node(mas, new_mas, parent, gfp);
>> +
>> +		if (unlikely(mas_is_err(mas)))
>> +			return;
>> +
>> +		/* Once we reach a leaf, we need to ascend, or end the loop. */
>> +		if (mte_is_leaf(mas->node)) {
>> +			if (mas->max == ULONG_MAX) {
>> +				new_mas->tree->ma_flags = mas->tree->ma_flags;
>> +				rcu_assign_pointer(new_mas->tree->ma_root,
>> +						   mte_mk_root(root));
>> +				break;
> 
> If you move this to the end of the function, you can replace the same
> block above with a goto.  That will avoid breaking the line up.
> 
>> +			}
>> +
>> +			do {
>> +				/*
>> +				 * Must not at the root node, because we've
>> +				 * already end the loop when we reach the last
>> +				 * leaf.
>> +				 */
> 
> I'm not sure what the comment above is trying to say.  Do you mean "This
> won't reach the root node because the loop will break when the last leaf
> is hit"?  I don't think that is accurate.. it will hit the root node but
> not the end of the root node, right?  Anyways, the comment isn't clear
> so please have a look.
> 
>> +				mas_ascend(mas);
>> +				mas_ascend(new_mas);
>> +			} while (mas->offset == mas_data_end(mas));
>> +
>> +			mas->offset++;
>> +			new_mas->offset++;
>> +		}
>> +
>> +		mas_descend(mas);
>> +		parent = mte_to_node(new_mas->node);
>> +		mas_descend(new_mas);
>> +		mas->offset = 0;
>> +		new_mas->offset = 0;
>> +	}
>> +}
>> +
>> +/**
>> + * __mt_dup(): Duplicate a maple tree
>> + * @mt: The source maple tree
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + *
>> + * This function duplicates a maple tree using a faster method than traversing
>> + * the source tree and inserting entries into the new tree one by one.
> 
> Can you make this comment more about what your code does instead of the
> "one by one" description?
> 
>> + * The user needs to ensure that the attributes of the source tree and the new
>> + * tree are the same, and the new tree needs to be an empty tree, otherwise
>> + * -EINVAL will be returned.
>> + * Note that the user needs to manually lock the source tree and the new tree.
>> + *
>> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
>> + * the attributes of the two trees are different or the new tree is not an empty
>> + * tree.
>> + */
>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>> +{
>> +	int ret = 0;
>> +	MA_STATE(mas, mt, 0, 0);
>> +	MA_STATE(new_mas, new, 0, 0);
>> +
>> +	mas_dup_build(&mas, &new_mas, gfp);
>> +
>> +	if (unlikely(mas_is_err(&mas))) {
>> +		ret = xa_err(mas.node);
>> +		if (ret == -ENOMEM)
>> +			mas_dup_free(&new_mas);
>> +	}
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(__mt_dup);
>> +
>> +/**
>> + * mtree_dup(): Duplicate a maple tree
>> + * @mt: The source maple tree
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + *
>> + * This function duplicates a maple tree using a faster method than traversing
>> + * the source tree and inserting entries into the new tree one by one.
> 
> Again, it's more interesting to state it uses the DFS preorder copy.
> 
> It is also worth mentioning the superior allocation behaviour since that
> is a desirable trait for many.  In fact, you should add the allocation
> behaviour in your cover letter.
> 
>> + * The user needs to ensure that the attributes of the source tree and the new
>> + * tree are the same, and the new tree needs to be an empty tree, otherwise
>> + * -EINVAL will be returned.
>> + *
>> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
>> + * the attributes of the two trees are different or the new tree is not an empty
>> + * tree.
>> + */
>> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>> +{
>> +	int ret = 0;
>> +	MA_STATE(mas, mt, 0, 0);
>> +	MA_STATE(new_mas, new, 0, 0);
>> +
>> +	mas_lock(&new_mas);
>> +	mas_lock(&mas);
>> +
>> +	mas_dup_build(&mas, &new_mas, gfp);
>> +	mas_unlock(&mas);
>> +
>> +	if (unlikely(mas_is_err(&mas))) {
>> +		ret = xa_err(mas.node);
>> +		if (ret == -ENOMEM)
>> +			mas_dup_free(&new_mas);
>> +	}
>> +
>> +	mas_unlock(&new_mas);
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(mtree_dup);
>> +
>>   /**
>>    * __mt_destroy() - Walk and free all nodes of a locked maple tree.
>>    * @mt: The maple tree
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 2/6] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()
  2023-09-11 12:59     ` Peng Zhang
@ 2023-09-11 13:36       ` Liam R. Howlett
  0 siblings, 0 replies; 35+ messages in thread
From: Liam R. Howlett @ 2023-09-11 13:36 UTC (permalink / raw)
  To: Peng Zhang
  Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
	linux-kernel, linux-fsdevel

* Peng Zhang <zhangpeng.00@bytedance.com> [230911 08:59]:
> 
> 
> 在 2023/9/8 04:13, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
> > > Introduce interfaces __mt_dup() and mtree_dup(), which are used to
> > > duplicate a maple tree. Compared with traversing the source tree and
> > > reinserting entry by entry in the new tree, it has better performance.
> > > The difference between __mt_dup() and mtree_dup() is that mtree_dup()
> > > handles locks internally.
> > 
> > __mt_dup() should be called mas_dup() to indicate the advanced interface
> > which requires users to handle their own locks.
> Changing to the mas_dup() interface may look like this:
> mas_dup(mas_old, mas_new)
> 
> This still encounters the problem we discussed before. You expect both
> mas_old and mas_new to point to the first element after the function
> returns, but for_each_vma(vmi, mpnt) in dup_mmap() does not support
> this, and will skip the first element.
> 
> Unless we have an iterator similar to "do {} while()", we have to reset
> mas_new. There is still additional overhead in making both mas_old and
> mas_new point to the first element, because mas will point to the last
> node after dfs order traversal.

I was only looking for the name change.  Although, I think we could have
written in a way to avoid skipping the first element.

> 
> In fact, I think mtree_dup() and __mt_dup() are enough. They seem to
> match mtree_destroy() and __mt_destroy() very well. Underlines indicate
> that users need to handle the lock themselves.

I think you are correct, __mt_dup() doesn't take a maple state.  Thanks
for pointing that out.  Please leave it the way you have it.

> > 
> > > 
> > > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> > > ---
> > >   include/linux/maple_tree.h |   3 +
> > >   lib/maple_tree.c           | 265 +++++++++++++++++++++++++++++++++++++
> > >   2 files changed, 268 insertions(+)
> > > 
> > > diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> > > index e41c70ac7744..44fe8a57ecbd 100644
> > > --- a/include/linux/maple_tree.h
> > > +++ b/include/linux/maple_tree.h
> > > @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
> > >   		void *entry, gfp_t gfp);
> > >   void *mtree_erase(struct maple_tree *mt, unsigned long index);
> > > +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > +
> > >   void mtree_destroy(struct maple_tree *mt);
> > >   void __mt_destroy(struct maple_tree *mt);
> > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > index ef234cf02e3e..8f841682269c 100644
> > > --- a/lib/maple_tree.c
> > > +++ b/lib/maple_tree.c
> > > @@ -6370,6 +6370,271 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
> > >   }
> > >   EXPORT_SYMBOL(mtree_erase);
> > > +/*
> > > + * mas_dup_free() - Free a half-constructed tree.
> > 
> > Maybe "Free an incomplete duplication of a tree" ?
> > 
> > > + * @mas: Points to the last node of the half-constructed tree.
> > 
> > Your use of "Points to" seems to indicate someone knows you are talking
> > about a "maple state that has a node pointing to".  Can this be made
> > more clear?
> > @mas: The maple state of a incomplete tree.
> > 
> > Then add a note that @mas->node points to the last successfully
> > allocated node?
> > 
> > Or something along those lines.
> > 
> > > + *
> > > + * This function frees all nodes starting from @mas->node in the reverse order
> > > + * of mas_dup_build(). There is no need to hold the source tree lock at this
> > > + * time.
> > > + */
> > > +static void mas_dup_free(struct ma_state *mas)
> > > +{
> > > +	struct maple_node *node;
> > > +	enum maple_type type;
> > > +	void __rcu **slots;
> > > +	unsigned char count, i;
> > > +
> > > +	/* Maybe the first node allocation failed. */
> > > +	if (!mas->node)
> > > +		return;
> > > +
> > > +	while (!mte_is_root(mas->node)) {
> > > +		mas_ascend(mas);
> > > +
> > > +		if (mas->offset) {
> > > +			mas->offset--;
> > > +			do {
> > > +				mas_descend(mas);
> > > +				mas->offset = mas_data_end(mas);
> > > +			} while (!mte_is_leaf(mas->node));
> > 
> > Can you blindly descend and check !mte_is_leaf()?  What happens when the
> > tree duplication fails at random internal nodes?  Maybe I missed how
> > this cannot happen?
> > 
> > > +
> > > +			mas_ascend(mas);
> > > +		}
> > > +
> > > +		node = mte_to_node(mas->node);
> > > +		type = mte_node_type(mas->node);
> > > +		slots = (void **)ma_slots(node, type);
> > > +		count = mas_data_end(mas) + 1;
> > > +		for (i = 0; i < count; i++)
> > > +			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
> > > +
> > > +		mt_free_bulk(count, slots);
> > > +	}
> > 
> > 
> > > +
> > > +	node = mte_to_node(mas->node);
> > > +	mt_free_one(node);
> > > +}
> > > +
> > > +/*
> > > + * mas_copy_node() - Copy a maple node and allocate child nodes.
> > 
> > if required. "..and allocate child nodes if required."
> > 
> > > + * @mas: Points to the source node.
> > > + * @new_mas: Points to the new node.
> > > + * @parent: The parent node of the new node.
> > > + * @gfp: The GFP_FLAGS to use for allocations.
> > > + *
> > > + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
> > > + * @new_mas->node and allocate new child nodes for @new_mas->node.
> > > + * If memory allocation fails, @mas is set to -ENOMEM.
> > > + */
> > > +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
> > > +		struct maple_node *parent, gfp_t gfp)
> > > +{
> > > +	struct maple_node *node = mte_to_node(mas->node);
> > > +	struct maple_node *new_node = mte_to_node(new_mas->node);
> > > +	enum maple_type type;
> > > +	unsigned long val;
> > > +	unsigned char request, count, i;
> > > +	void __rcu **slots;
> > > +	void __rcu **new_slots;
> > > +
> > > +	/* Copy the node completely. */
> > > +	memcpy(new_node, node, sizeof(struct maple_node));
> > > +
> > > +	/* Update the parent node pointer. */
> > > +	if (unlikely(ma_is_root(node)))
> > > +		val = MA_ROOT_PARENT;
> > > +	else
> > > +		val = (unsigned long)node->parent & MAPLE_NODE_MASK;
> > 
> > If you treat the root as special and outside the loop, then you can
> > avoid the check for root for every non-root node.  For root, you just
> > need to copy and do this special parent thing before the main loop in
> > mas_dup_build().  This will avoid an extra branch for each VMA over 14,
> > so that would add up to a lot of instructions.
> > 
> > > +
> > > +	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
> > > +
> > > +	if (mte_is_leaf(mas->node))
> > > +		return;
> > 
> > You are checking here and in mas_dup_build() for the leaf, splitting the
> > function into parent assignment and allocate would allow you to check
> > once. Copy could be moved to the main loop or with the parent setting,
> > depending on how you handle the root suggestion above.
> > 
> > > +
> > > +	/* Allocate memory for child nodes. */
> > > +	type = mte_node_type(mas->node);
> > > +	new_slots = ma_slots(new_node, type);
> > > +	request = mas_data_end(mas) + 1;
> > > +	count = mt_alloc_bulk(gfp, request, new_slots);
> > > +	if (unlikely(count < request)) {
> > > +		if (count)
> > > +			mt_free_bulk(count, new_slots);
> > 
> > The new_slots will still contain the addresses of the freed nodes.
> > Don't you need to clear it here to avoid a double free?  Is there a
> > test case for this in your testing?  Again, I may have missed how this
> > is not possible..
> > 
> > > +		mas_set_err(mas, -ENOMEM);
> > > +		return;
> > > +	}
> > > +
> > > +	/* Restore node type information in slots. */
> > > +	slots = ma_slots(node, type);
> > > +	for (i = 0; i < count; i++)
> > > +		((unsigned long *)new_slots)[i] |=
> > > +			((unsigned long)mt_slot_locked(mas->tree, slots, i) &
> > > +			MAPLE_NODE_MASK);
> > 
> > Can you expand this to multiple lines to make it more clear what is
> > going on?
> > 
> > > +}
> > > +
> > > +/*
> > > + * mas_dup_build() - Build a new maple tree from a source tree
> > > + * @mas: The maple state of source tree.
> > > + * @new_mas: The maple state of new tree.
> > > + * @gfp: The GFP_FLAGS to use for allocations.
> > > + *
> > > + * This function builds a new tree in DFS preorder. If the memory allocation
> > > + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
> > > + * last node. mas_dup_free() will free the half-constructed tree.
> > > + *
> > > + * Note that the attributes of the two trees must be exactly the same, and the
> > > + * new tree must be empty, otherwise -EINVAL will be returned.
> > > + */
> > > +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
> > > +		gfp_t gfp)
> > > +{
> > > +	struct maple_node *node, *parent;
> > 
> > Could parent be struct maple_pnode?
> > 
> > > +	struct maple_enode *root;
> > > +	enum maple_type type;
> > > +
> > > +	if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
> > > +	    unlikely(!mtree_empty(new_mas->tree))) {
> > > +		mas_set_err(mas, -EINVAL);
> > > +		return;
> > > +	}
> > > +
> > > +	mas_start(mas);
> > > +	if (mas_is_ptr(mas) || mas_is_none(mas)) {
> > > +		/*
> > > +		 * The attributes of the two trees must be the same before this.
> > > +		 * The following assignment makes them the same height.
> > > +		 */
> > > +		new_mas->tree->ma_flags = mas->tree->ma_flags;
> > > +		rcu_assign_pointer(new_mas->tree->ma_root, mas->tree->ma_root);
> > > +		return;
> > > +	}
> > > +
> > > +	node = mt_alloc_one(gfp);
> > > +	if (!node) {
> > > +		new_mas->node = NULL;
> > 
> > We don't have checks around for node == NULL, MAS_NONE would be a safer
> > choice.  It is unlikely that someone would dup the tree and fail then
> > call something else, but I avoid setting node to NULL.
> > 
> > > +		mas_set_err(mas, -ENOMEM);
> > > +		return;
> > > +	}
> > > +
> > > +	type = mte_node_type(mas->node);
> > > +	root = mt_mk_node(node, type);
> > > +	new_mas->node = root;
> > > +	new_mas->min = 0;
> > > +	new_mas->max = ULONG_MAX;
> > > +	parent = ma_mnode_ptr(new_mas->tree);
> > > +
> > > +	while (1) {
> > > +		mas_copy_node(mas, new_mas, parent, gfp);
> > > +
> > > +		if (unlikely(mas_is_err(mas)))
> > > +			return;
> > > +
> > > +		/* Once we reach a leaf, we need to ascend, or end the loop. */
> > > +		if (mte_is_leaf(mas->node)) {
> > > +			if (mas->max == ULONG_MAX) {
> > > +				new_mas->tree->ma_flags = mas->tree->ma_flags;
> > > +				rcu_assign_pointer(new_mas->tree->ma_root,
> > > +						   mte_mk_root(root));
> > > +				break;
> > 
> > If you move this to the end of the function, you can replace the same
> > block above with a goto.  That will avoid breaking the line up.
> > 
> > > +			}
> > > +
> > > +			do {
> > > +				/*
> > > +				 * Must not at the root node, because we've
> > > +				 * already end the loop when we reach the last
> > > +				 * leaf.
> > > +				 */
> > 
> > I'm not sure what the comment above is trying to say.  Do you mean "This
> > won't reach the root node because the loop will break when the last leaf
> > is hit"?  I don't think that is accurate.. it will hit the root node but
> > not the end of the root node, right?  Anyways, the comment isn't clear
> > so please have a look.
> > 
> > > +				mas_ascend(mas);
> > > +				mas_ascend(new_mas);
> > > +			} while (mas->offset == mas_data_end(mas));
> > > +
> > > +			mas->offset++;
> > > +			new_mas->offset++;
> > > +		}
> > > +
> > > +		mas_descend(mas);
> > > +		parent = mte_to_node(new_mas->node);
> > > +		mas_descend(new_mas);
> > > +		mas->offset = 0;
> > > +		new_mas->offset = 0;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * __mt_dup(): Duplicate a maple tree
> > > + * @mt: The source maple tree
> > > + * @new: The new maple tree
> > > + * @gfp: The GFP_FLAGS to use for allocations
> > > + *
> > > + * This function duplicates a maple tree using a faster method than traversing
> > > + * the source tree and inserting entries into the new tree one by one.
> > 
> > Can you make this comment more about what your code does instead of the
> > "one by one" description?
> > 
> > > + * The user needs to ensure that the attributes of the source tree and the new
> > > + * tree are the same, and the new tree needs to be an empty tree, otherwise
> > > + * -EINVAL will be returned.
> > > + * Note that the user needs to manually lock the source tree and the new tree.
> > > + *
> > > + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> > > + * the attributes of the two trees are different or the new tree is not an empty
> > > + * tree.
> > > + */
> > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> > > +{
> > > +	int ret = 0;
> > > +	MA_STATE(mas, mt, 0, 0);
> > > +	MA_STATE(new_mas, new, 0, 0);
> > > +
> > > +	mas_dup_build(&mas, &new_mas, gfp);
> > > +
> > > +	if (unlikely(mas_is_err(&mas))) {
> > > +		ret = xa_err(mas.node);
> > > +		if (ret == -ENOMEM)
> > > +			mas_dup_free(&new_mas);
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(__mt_dup);
> > > +
> > > +/**
> > > + * mtree_dup(): Duplicate a maple tree
> > > + * @mt: The source maple tree
> > > + * @new: The new maple tree
> > > + * @gfp: The GFP_FLAGS to use for allocations
> > > + *
> > > + * This function duplicates a maple tree using a faster method than traversing
> > > + * the source tree and inserting entries into the new tree one by one.
> > 
> > Again, it's more interesting to state it uses the DFS preorder copy.
> > 
> > It is also worth mentioning the superior allocation behaviour since that
> > is a desirable trait for many.  In fact, you should add the allocation
> > behaviour in your cover letter.
> > 
> > > + * The user needs to ensure that the attributes of the source tree and the new
> > > + * tree are the same, and the new tree needs to be an empty tree, otherwise
> > > + * -EINVAL will be returned.
> > > + *
> > > + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> > > + * the attributes of the two trees are different or the new tree is not an empty
> > > + * tree.
> > > + */
> > > +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> > > +{
> > > +	int ret = 0;
> > > +	MA_STATE(mas, mt, 0, 0);
> > > +	MA_STATE(new_mas, new, 0, 0);
> > > +
> > > +	mas_lock(&new_mas);
> > > +	mas_lock(&mas);
> > > +
> > > +	mas_dup_build(&mas, &new_mas, gfp);
> > > +	mas_unlock(&mas);
> > > +
> > > +	if (unlikely(mas_is_err(&mas))) {
> > > +		ret = xa_err(mas.node);
> > > +		if (ret == -ENOMEM)
> > > +			mas_dup_free(&new_mas);
> > > +	}
> > > +
> > > +	mas_unlock(&new_mas);
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(mtree_dup);
> > > +
> > >   /**
> > >    * __mt_destroy() - Walk and free all nodes of a locked maple tree.
> > >    * @mt: The maple tree
> > > -- 
> > > 2.20.1
> > > 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
  2023-09-07 20:14   ` Liam R. Howlett
  2023-09-08  9:58     ` Peng Zhang
@ 2023-09-15 10:51     ` Peng Zhang
  2023-09-15 10:56       ` Peng Zhang
  1 sibling, 1 reply; 35+ messages in thread
From: Peng Zhang @ 2023-09-15 10:51 UTC (permalink / raw)
  To: Liam R. Howlett, Peng Zhang, corbet, akpm, willy, brauner,
	surenb, michael.christie, peterz, mathieu.desnoyers, npiggin,
	avagin, linux-mm, linux-doc, linux-kernel, linux-fsdevel



在 2023/9/8 04:14, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:58]:
>> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
>> directly modify the entries of VMAs in the new maple tree, which can
>> get better performance. The optimization effect is proportional to the
>> number of VMAs.
>>
>> There is a "spawn" in byte-unixbench[1], which can be used to test the
>> performance of fork(). I modified it slightly to make it work with
>> different number of VMAs.
>>
>> Below are the test numbers. There are 21 VMAs by default. The first row
>> indicates the number of added VMAs. The following two lines are the
>> number of fork() calls every 10 seconds. These numbers are different
>> from the test results in v1 because this time the benchmark is bound to
>> a CPU. This way the numbers are more stable.
>>
>>    Increment of VMAs: 0      100     200     400     800     1600    3200    6400
>> 6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   6110    3158
>> Apply this patchset: 114531 85420   64541   44592   28660   16371   9038    4831
>>                       +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% +47.92% +52.98%
> 
> Thanks!
> 
> Can you include 21 in this table since it's the default?
> 
>>
>> [1] https://github.com/kdlucas/byte-unixbench/tree/master
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>>   kernel/fork.c | 34 ++++++++++++++++++++++++++--------
>>   mm/mmap.c     | 14 ++++++++++++--
>>   2 files changed, 38 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 3b6d20dfb9a8..e6299adefbd8 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   	int retval;
>>   	unsigned long charge = 0;
>>   	LIST_HEAD(uf);
>> -	VMA_ITERATOR(old_vmi, oldmm, 0);
>>   	VMA_ITERATOR(vmi, mm, 0);
>>   
>>   	uprobe_start_dup_mmap();
>> @@ -678,17 +677,39 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   		goto out;
>>   	khugepaged_fork(mm, oldmm);
>>   
>> -	retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
>> -	if (retval)
>> +	/* Use __mt_dup() to efficiently build an identical maple tree. */
>> +	retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
> 
> Apparently the flags should be GFP_KERNEL here so that compaction can
> run.
> 
>> +	if (unlikely(retval))
>>   		goto out;
>>   
>>   	mt_clear_in_rcu(vmi.mas.tree);
>> -	for_each_vma(old_vmi, mpnt) {
>> +	for_each_vma(vmi, mpnt) {
>>   		struct file *file;
>>   
>>   		vma_start_write(mpnt);
>>   		if (mpnt->vm_flags & VM_DONTCOPY) {
>>   			vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>> +
>> +			/*
>> +			 * Since the new tree is exactly the same as the old one,
>> +			 * we need to remove the unneeded VMAs.
>> +			 */
>> +			mas_store(&vmi.mas, NULL);
>> +
>> +			/*
>> +			 * Even removing an entry may require memory allocation,
>> +			 * and if removal fails, we use XA_ZERO_ENTRY to mark
>> +			 * from which VMA it failed. The case of encountering
>> +			 * XA_ZERO_ENTRY will be handled in exit_mmap().
>> +			 */
>> +			if (unlikely(mas_is_err(&vmi.mas))) {
>> +				retval = xa_err(vmi.mas.node);
>> +				mas_reset(&vmi.mas);
>> +				if (mas_find(&vmi.mas, ULONG_MAX))
>> +					mas_store(&vmi.mas, XA_ZERO_ENTRY);
>> +				goto loop_out;
>> +			}
>> +
> 
> Storing NULL may need extra space as you noted, so we need to be careful
> what happens if we don't have that space.  We should have a testcase to
> test this scenario.
> 
> mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
> in this function, see vm_area_dup().
> 
> Don't use the exit_mmap() path to undo a failed fork.  You've added
> checks and complications to the exit path for all tasks in the very
> unlikely event that we run out of memory when we hit a very unlikely
> VM_DONTCOPY flag.
> 
> I see the issue with having a portion of the tree with new VMAs that are
> accounted and a portion of the tree that has old VMAs that should not be
> looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
> we cannot add that complication to the exit path and then there is the
> OOM race to worry about (maybe, I am not sure since this MM isn't
> active yet).
I encountered some errors after implementing the scheme you mentioned
below. This would also clutter fork.c and mmap.c, as some internal
functions would need to be made global.

I thought of another way to put everything into maple tree. In non-RCU
mode, we can remove the last half of the tree without allocating any
memory. This requires modifications to the internal implementation of
mas_store().
Then remove the second half of the tree like this:

mas.index = 0;
mas.last = ULONGN_MAX;
mas_store(&mas, NULL).

At least in non-RCU mode, we can do this, since we only need to merge
some nodes, or move some items to adjacent nodes.
However, this will increase the workload significantly.

> 
> Using what is done in exit_mmap() and do_vmi_align_munmap() as a
> prototype, we can do something like the *untested* code below:
> 
> if (unlikely(mas_is_err(&vmi.mas))) {
> 	unsigned long max = vmi.index;
> 
> 	retval = xa_err(vmi.mas.node);
> 	mas_set(&vmi.mas, 0);
> 	tmp = mas_find(&vmi.mas, ULONG_MAX);
> 	if (tmp) { /* Not the first VMA failed */
> 		unsigned long nr_accounted = 0;
> 
> 		unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
> 				true);
> 		do {
> 			if (vma->vm_flags & VM_ACCOUNT)
> 				nr_accounted += vma_pages(vma);
> 			remove_vma(vma, true);
> 			cond_resched();
> 			vma = mas_find(&vmi.mas, max - 1);
> 		} while (vma != NULL);
> 
> 		vm_unacct_memory(nr_accounted);
> 	}
> 	__mt_destroy(&mm->mm_mt);
> 	goto loop_out;
> }
> 
> Once exit_mmap() is called, the check for OOM (no vma) will catch that
> nothing is left to do.
> 
> It might be worth making an inline function to do this to keep the fork
> code clean.  We should test this by detecting a specific task name and
> returning a failure at a given interval:
> 
> if (!strcmp(current->comm, "fork_test") {
> ...
> }
> 
> 
>>   			continue;
>>   		}
>>   		charge = 0;
>> @@ -750,8 +771,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   			hugetlb_dup_vma_private(tmp);
>>   
>>   		/* Link the vma into the MT */
>> -		if (vma_iter_bulk_store(&vmi, tmp))
>> -			goto fail_nomem_vmi_store;
>> +		mas_store(&vmi.mas, tmp);
>>   
>>   		mm->map_count++;
>>   		if (!(tmp->vm_flags & VM_WIPEONFORK))
>> @@ -778,8 +798,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   	uprobe_end_dup_mmap();
>>   	return retval;
>>   
>> -fail_nomem_vmi_store:
>> -	unlink_anon_vmas(tmp);
>>   fail_nomem_anon_vma_fork:
>>   	mpol_put(vma_policy(tmp));
>>   fail_nomem_policy:
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index b56a7f0c9f85..dfc6881be81c 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3196,7 +3196,11 @@ void exit_mmap(struct mm_struct *mm)
>>   	arch_exit_mmap(mm);
>>   
>>   	vma = mas_find(&mas, ULONG_MAX);
>> -	if (!vma) {
>> +	/*
>> +	 * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
>> +	 * xa_is_zero(vma) may be true.
>> +	 */
>> +	if (!vma || xa_is_zero(vma)) {
>>   		/* Can happen if dup_mmap() received an OOM */
>>   		mmap_read_unlock(mm);
>>   		return;
>> @@ -3234,7 +3238,13 @@ void exit_mmap(struct mm_struct *mm)
>>   		remove_vma(vma, true);
>>   		count++;
>>   		cond_resched();
>> -	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
>> +		vma = mas_find(&mas, ULONG_MAX);
>> +		/*
>> +		 * If xa_is_zero(vma) is true, it means that subsequent VMAs
>> +		 * donot need to be removed. Can happen if dup_mmap() fails to
>> +		 * remove a VMA marked VM_DONTCOPY.
>> +		 */
>> +	} while (vma != NULL && !xa_is_zero(vma));
>>   
>>   	BUG_ON(count != mm->map_count);
>>   
>> -- 
>> 2.20.1
>>
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
  2023-09-15 10:51     ` Peng Zhang
@ 2023-09-15 10:56       ` Peng Zhang
  2023-09-15 20:00         ` Liam R. Howlett
  0 siblings, 1 reply; 35+ messages in thread
From: Peng Zhang @ 2023-09-15 10:56 UTC (permalink / raw)
  To: Peng Zhang, Liam R. Howlett, corbet, akpm, willy, brauner,
	surenb, michael.christie, peterz, mathieu.desnoyers, npiggin,
	avagin, linux-mm, linux-doc, linux-kernel, linux-fsdevel



在 2023/9/15 18:51, Peng Zhang 写道:
> 
> 
> 在 2023/9/8 04:14, Liam R. Howlett 写道:
>> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:58]:
>>> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
>>> directly modify the entries of VMAs in the new maple tree, which can
>>> get better performance. The optimization effect is proportional to the
>>> number of VMAs.
>>>
>>> There is a "spawn" in byte-unixbench[1], which can be used to test the
>>> performance of fork(). I modified it slightly to make it work with
>>> different number of VMAs.
>>>
>>> Below are the test numbers. There are 21 VMAs by default. The first row
>>> indicates the number of added VMAs. The following two lines are the
>>> number of fork() calls every 10 seconds. These numbers are different
>>> from the test results in v1 because this time the benchmark is bound to
>>> a CPU. This way the numbers are more stable.
>>>
>>>    Increment of VMAs: 0      100     200     400     800     1600    
>>> 3200    6400
>>> 6.5.0-next-20230829: 111878 75531   53683   35282   20741   11317   
>>> 6110    3158
>>> Apply this patchset: 114531 85420   64541   44592   28660   16371   
>>> 9038    4831
>>>                       +2.37% +13.09% +20.23% +26.39% +38.18% +44.66% 
>>> +47.92% +52.98%
>>
>> Thanks!
>>
>> Can you include 21 in this table since it's the default?
>>
>>>
>>> [1] https://github.com/kdlucas/byte-unixbench/tree/master
>>>
>>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>>> ---
>>>   kernel/fork.c | 34 ++++++++++++++++++++++++++--------
>>>   mm/mmap.c     | 14 ++++++++++++--
>>>   2 files changed, 38 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/kernel/fork.c b/kernel/fork.c
>>> index 3b6d20dfb9a8..e6299adefbd8 100644
>>> --- a/kernel/fork.c
>>> +++ b/kernel/fork.c
>>> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct 
>>> mm_struct *mm,
>>>       int retval;
>>>       unsigned long charge = 0;
>>>       LIST_HEAD(uf);
>>> -    VMA_ITERATOR(old_vmi, oldmm, 0);
>>>       VMA_ITERATOR(vmi, mm, 0);
>>>       uprobe_start_dup_mmap();
>>> @@ -678,17 +677,39 @@ static __latent_entropy int dup_mmap(struct 
>>> mm_struct *mm,
>>>           goto out;
>>>       khugepaged_fork(mm, oldmm);
>>> -    retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
>>> -    if (retval)
>>> +    /* Use __mt_dup() to efficiently build an identical maple tree. */
>>> +    retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | 
>>> __GFP_NOWARN);
>>
>> Apparently the flags should be GFP_KERNEL here so that compaction can
>> run.
>>
>>> +    if (unlikely(retval))
>>>           goto out;
>>>       mt_clear_in_rcu(vmi.mas.tree);
>>> -    for_each_vma(old_vmi, mpnt) {
>>> +    for_each_vma(vmi, mpnt) {
>>>           struct file *file;
>>>           vma_start_write(mpnt);
>>>           if (mpnt->vm_flags & VM_DONTCOPY) {
>>>               vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>>> +
>>> +            /*
>>> +             * Since the new tree is exactly the same as the old one,
>>> +             * we need to remove the unneeded VMAs.
>>> +             */
>>> +            mas_store(&vmi.mas, NULL);
>>> +
>>> +            /*
>>> +             * Even removing an entry may require memory allocation,
>>> +             * and if removal fails, we use XA_ZERO_ENTRY to mark
>>> +             * from which VMA it failed. The case of encountering
>>> +             * XA_ZERO_ENTRY will be handled in exit_mmap().
>>> +             */
>>> +            if (unlikely(mas_is_err(&vmi.mas))) {
>>> +                retval = xa_err(vmi.mas.node);
>>> +                mas_reset(&vmi.mas);
>>> +                if (mas_find(&vmi.mas, ULONG_MAX))
>>> +                    mas_store(&vmi.mas, XA_ZERO_ENTRY);
>>> +                goto loop_out;
>>> +            }
>>> +
>>
>> Storing NULL may need extra space as you noted, so we need to be careful
>> what happens if we don't have that space.  We should have a testcase to
>> test this scenario.
>>
>> mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
>> in this function, see vm_area_dup().
>>
>> Don't use the exit_mmap() path to undo a failed fork.  You've added
>> checks and complications to the exit path for all tasks in the very
>> unlikely event that we run out of memory when we hit a very unlikely
>> VM_DONTCOPY flag.
>>
>> I see the issue with having a portion of the tree with new VMAs that are
>> accounted and a portion of the tree that has old VMAs that should not be
>> looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
>> we cannot add that complication to the exit path and then there is the
>> OOM race to worry about (maybe, I am not sure since this MM isn't
>> active yet).
> I encountered some errors after implementing the scheme you mentioned
> below. This would also clutter fork.c and mmap.c, as some internal
> functions would need to be made global.
> 
> I thought of another way to put everything into maple tree. In non-RCU
> mode, we can remove the last half of the tree without allocating any
> memory. This requires modifications to the internal implementation of
> mas_store().
> Then remove the second half of the tree like this:
> 
> mas.index = 0;
Sorry, typo.
Change to: mas.index = vma->start
> mas.last = ULONGN_MAX;
> mas_store(&mas, NULL).

> 
> At least in non-RCU mode, we can do this, since we only need to merge
> some nodes, or move some items to adjacent nodes.
> However, this will increase the workload significantly.
> 
>>
>> Using what is done in exit_mmap() and do_vmi_align_munmap() as a
>> prototype, we can do something like the *untested* code below:
>>
>> if (unlikely(mas_is_err(&vmi.mas))) {
>>     unsigned long max = vmi.index;
>>
>>     retval = xa_err(vmi.mas.node);
>>     mas_set(&vmi.mas, 0);
>>     tmp = mas_find(&vmi.mas, ULONG_MAX);
>>     if (tmp) { /* Not the first VMA failed */
>>         unsigned long nr_accounted = 0;
>>
>>         unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
>>                 true);
>>         do {
>>             if (vma->vm_flags & VM_ACCOUNT)
>>                 nr_accounted += vma_pages(vma);
>>             remove_vma(vma, true);
>>             cond_resched();
>>             vma = mas_find(&vmi.mas, max - 1);
>>         } while (vma != NULL);
>>
>>         vm_unacct_memory(nr_accounted);
>>     }
>>     __mt_destroy(&mm->mm_mt);
>>     goto loop_out;
>> }
>>
>> Once exit_mmap() is called, the check for OOM (no vma) will catch that
>> nothing is left to do.
>>
>> It might be worth making an inline function to do this to keep the fork
>> code clean.  We should test this by detecting a specific task name and
>> returning a failure at a given interval:
>>
>> if (!strcmp(current->comm, "fork_test") {
>> ...
>> }
>>
>>
>>>               continue;
>>>           }
>>>           charge = 0;
>>> @@ -750,8 +771,7 @@ static __latent_entropy int dup_mmap(struct 
>>> mm_struct *mm,
>>>               hugetlb_dup_vma_private(tmp);
>>>           /* Link the vma into the MT */
>>> -        if (vma_iter_bulk_store(&vmi, tmp))
>>> -            goto fail_nomem_vmi_store;
>>> +        mas_store(&vmi.mas, tmp);
>>>           mm->map_count++;
>>>           if (!(tmp->vm_flags & VM_WIPEONFORK))
>>> @@ -778,8 +798,6 @@ static __latent_entropy int dup_mmap(struct 
>>> mm_struct *mm,
>>>       uprobe_end_dup_mmap();
>>>       return retval;
>>> -fail_nomem_vmi_store:
>>> -    unlink_anon_vmas(tmp);
>>>   fail_nomem_anon_vma_fork:
>>>       mpol_put(vma_policy(tmp));
>>>   fail_nomem_policy:
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index b56a7f0c9f85..dfc6881be81c 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -3196,7 +3196,11 @@ void exit_mmap(struct mm_struct *mm)
>>>       arch_exit_mmap(mm);
>>>       vma = mas_find(&mas, ULONG_MAX);
>>> -    if (!vma) {
>>> +    /*
>>> +     * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
>>> +     * xa_is_zero(vma) may be true.
>>> +     */
>>> +    if (!vma || xa_is_zero(vma)) {
>>>           /* Can happen if dup_mmap() received an OOM */
>>>           mmap_read_unlock(mm);
>>>           return;
>>> @@ -3234,7 +3238,13 @@ void exit_mmap(struct mm_struct *mm)
>>>           remove_vma(vma, true);
>>>           count++;
>>>           cond_resched();
>>> -    } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
>>> +        vma = mas_find(&mas, ULONG_MAX);
>>> +        /*
>>> +         * If xa_is_zero(vma) is true, it means that subsequent VMAs
>>> +         * donot need to be removed. Can happen if dup_mmap() fails to
>>> +         * remove a VMA marked VM_DONTCOPY.
>>> +         */
>>> +    } while (vma != NULL && !xa_is_zero(vma));
>>>       BUG_ON(count != mm->map_count);
>>> -- 
>>> 2.20.1
>>>
>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
  2023-09-15 10:56       ` Peng Zhang
@ 2023-09-15 20:00         ` Liam R. Howlett
  2023-09-18 13:14           ` Peng Zhang
  0 siblings, 1 reply; 35+ messages in thread
From: Liam R. Howlett @ 2023-09-15 20:00 UTC (permalink / raw)
  To: Peng Zhang
  Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
	linux-kernel, linux-fsdevel

* Peng Zhang <zhangpeng.00@bytedance.com> [230915 06:57]:
> 
> 

...

> > > > +    if (unlikely(retval))
> > > >           goto out;
> > > >       mt_clear_in_rcu(vmi.mas.tree);
> > > > -    for_each_vma(old_vmi, mpnt) {
> > > > +    for_each_vma(vmi, mpnt) {
> > > >           struct file *file;
> > > >           vma_start_write(mpnt);
> > > >           if (mpnt->vm_flags & VM_DONTCOPY) {
> > > >               vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> > > > +
> > > > +            /*
> > > > +             * Since the new tree is exactly the same as the old one,
> > > > +             * we need to remove the unneeded VMAs.
> > > > +             */
> > > > +            mas_store(&vmi.mas, NULL);
> > > > +
> > > > +            /*
> > > > +             * Even removing an entry may require memory allocation,
> > > > +             * and if removal fails, we use XA_ZERO_ENTRY to mark
> > > > +             * from which VMA it failed. The case of encountering
> > > > +             * XA_ZERO_ENTRY will be handled in exit_mmap().
> > > > +             */
> > > > +            if (unlikely(mas_is_err(&vmi.mas))) {
> > > > +                retval = xa_err(vmi.mas.node);
> > > > +                mas_reset(&vmi.mas);
> > > > +                if (mas_find(&vmi.mas, ULONG_MAX))
> > > > +                    mas_store(&vmi.mas, XA_ZERO_ENTRY);
> > > > +                goto loop_out;
> > > > +            }
> > > > +
> > > 
> > > Storing NULL may need extra space as you noted, so we need to be careful
> > > what happens if we don't have that space.  We should have a testcase to
> > > test this scenario.
> > > 
> > > mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
> > > in this function, see vm_area_dup().
> > > 
> > > Don't use the exit_mmap() path to undo a failed fork.  You've added
> > > checks and complications to the exit path for all tasks in the very
> > > unlikely event that we run out of memory when we hit a very unlikely
> > > VM_DONTCOPY flag.
> > > 
> > > I see the issue with having a portion of the tree with new VMAs that are
> > > accounted and a portion of the tree that has old VMAs that should not be
> > > looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
> > > we cannot add that complication to the exit path and then there is the
> > > OOM race to worry about (maybe, I am not sure since this MM isn't
> > > active yet).
> > I encountered some errors after implementing the scheme you mentioned
> > below.

What were the errors?  Maybe I missed something or there is another way.

> > This would also clutter fork.c and mmap.c, as some internal
> > functions would need to be made global.

Could it not be a new function in mm/mmap.c and added to mm/internal.h
that does the accounting and VMA freeing from [0 - vma->vm_start)?

Maybe we could use it in the other areas that do this sort of work?
do_vmi_align_munmap() does something similar to what we need after the
"Point of no return".

> > 
> > I thought of another way to put everything into maple tree. In non-RCU
> > mode, we can remove the last half of the tree without allocating any
> > memory. This requires modifications to the internal implementation of
> > mas_store().
> > Then remove the second half of the tree like this:
> > 
> > mas.index = 0;
> Sorry, typo.
> Change to: mas.index = vma->start
> > mas.last = ULONGN_MAX;
> > mas_store(&mas, NULL).

Well, we know we are not in RCU mode here, but I am concerned about this
going poorly.

> 
> > 
> > At least in non-RCU mode, we can do this, since we only need to merge
> > some nodes, or move some items to adjacent nodes.
> > However, this will increase the workload significantly.

In the unlikely event of an issue allocating memory, this would be
unwelcome.  If we can avoid it, it would be best.  I don't mind being
slow in error paths, but a significant workload would be rather bad on
an overloaded system.

> > 
> > > 
> > > Using what is done in exit_mmap() and do_vmi_align_munmap() as a
> > > prototype, we can do something like the *untested* code below:
> > > 
> > > if (unlikely(mas_is_err(&vmi.mas))) {
> > >     unsigned long max = vmi.index;
> > > 
> > >     retval = xa_err(vmi.mas.node);
> > >     mas_set(&vmi.mas, 0);
> > >     tmp = mas_find(&vmi.mas, ULONG_MAX);
> > >     if (tmp) { /* Not the first VMA failed */
> > >         unsigned long nr_accounted = 0;
> > > 
> > >         unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
> > >                 true);
> > >         do {
> > >             if (vma->vm_flags & VM_ACCOUNT)
> > >                 nr_accounted += vma_pages(vma);
> > >             remove_vma(vma, true);
> > >             cond_resched();
> > >             vma = mas_find(&vmi.mas, max - 1);
> > >         } while (vma != NULL);
> > > 
> > >         vm_unacct_memory(nr_accounted);
> > >     }
> > >     __mt_destroy(&mm->mm_mt);
> > >     goto loop_out;
> > > }
> > > 
> > > Once exit_mmap() is called, the check for OOM (no vma) will catch that
> > > nothing is left to do.
> > > 
> > > It might be worth making an inline function to do this to keep the fork
> > > code clean.  We should test this by detecting a specific task name and
> > > returning a failure at a given interval:
> > > 
> > > if (!strcmp(current->comm, "fork_test") {
> > > ...
> > > }
> > > 
...


Thanks,
Liam

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
  2023-09-15 20:00         ` Liam R. Howlett
@ 2023-09-18 13:14           ` Peng Zhang
  2023-09-18 17:59             ` Liam R. Howlett
  0 siblings, 1 reply; 35+ messages in thread
From: Peng Zhang @ 2023-09-18 13:14 UTC (permalink / raw)
  To: Liam R. Howlett, Peng Zhang, corbet, akpm, willy, brauner,
	surenb, michael.christie, peterz, mathieu.desnoyers, npiggin,
	avagin, linux-mm, linux-doc, linux-kernel, linux-fsdevel



在 2023/9/16 04:00, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230915 06:57]:
>>
>>
> 
> ...
> 
>>>>> +    if (unlikely(retval))
>>>>>            goto out;
>>>>>        mt_clear_in_rcu(vmi.mas.tree);
>>>>> -    for_each_vma(old_vmi, mpnt) {
>>>>> +    for_each_vma(vmi, mpnt) {
>>>>>            struct file *file;
>>>>>            vma_start_write(mpnt);
>>>>>            if (mpnt->vm_flags & VM_DONTCOPY) {
>>>>>                vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>>>>> +
>>>>> +            /*
>>>>> +             * Since the new tree is exactly the same as the old one,
>>>>> +             * we need to remove the unneeded VMAs.
>>>>> +             */
>>>>> +            mas_store(&vmi.mas, NULL);
>>>>> +
>>>>> +            /*
>>>>> +             * Even removing an entry may require memory allocation,
>>>>> +             * and if removal fails, we use XA_ZERO_ENTRY to mark
>>>>> +             * from which VMA it failed. The case of encountering
>>>>> +             * XA_ZERO_ENTRY will be handled in exit_mmap().
>>>>> +             */
>>>>> +            if (unlikely(mas_is_err(&vmi.mas))) {
>>>>> +                retval = xa_err(vmi.mas.node);
>>>>> +                mas_reset(&vmi.mas);
>>>>> +                if (mas_find(&vmi.mas, ULONG_MAX))
>>>>> +                    mas_store(&vmi.mas, XA_ZERO_ENTRY);
>>>>> +                goto loop_out;
>>>>> +            }
>>>>> +
>>>>
>>>> Storing NULL may need extra space as you noted, so we need to be careful
>>>> what happens if we don't have that space.  We should have a testcase to
>>>> test this scenario.
>>>>
>>>> mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
>>>> in this function, see vm_area_dup().
>>>>
>>>> Don't use the exit_mmap() path to undo a failed fork.  You've added
>>>> checks and complications to the exit path for all tasks in the very
>>>> unlikely event that we run out of memory when we hit a very unlikely
>>>> VM_DONTCOPY flag.
>>>>
>>>> I see the issue with having a portion of the tree with new VMAs that are
>>>> accounted and a portion of the tree that has old VMAs that should not be
>>>> looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
>>>> we cannot add that complication to the exit path and then there is the
>>>> OOM race to worry about (maybe, I am not sure since this MM isn't
>>>> active yet).
>>> I encountered some errors after implementing the scheme you mentioned
>>> below.
> 
> What were the errors?  Maybe I missed something or there is another way.
I found the cause of the problem and fixed it, tested the error path and
it seems to be working fine now.

The reason is that "free_pgd_range(tlb, addr, vma->vm_end,floor, next?
next->vm_start: ceiling);" in free_pgtables() does not free all page
tables due to the existence of the last false VMA. I've fixed it.
Thanks.

> 
>>> This would also clutter fork.c and mmap.c, as some internal
>>> functions would need to be made global.
> 
> Could it not be a new function in mm/mmap.c and added to mm/internal.h
> that does the accounting and VMA freeing from [0 - vma->vm_start)?
> 
> Maybe we could use it in the other areas that do this sort of work?
> do_vmi_align_munmap() does something similar to what we need after the
> "Point of no return".
> 
>>>
>>> I thought of another way to put everything into maple tree. In non-RCU
>>> mode, we can remove the last half of the tree without allocating any
>>> memory. This requires modifications to the internal implementation of
>>> mas_store().
>>> Then remove the second half of the tree like this:
>>>
>>> mas.index = 0;
>> Sorry, typo.
>> Change to: mas.index = vma->start
>>> mas.last = ULONGN_MAX;
>>> mas_store(&mas, NULL).
> 
> Well, we know we are not in RCU mode here, but I am concerned about this
> going poorly.
> 
>>
>>>
>>> At least in non-RCU mode, we can do this, since we only need to merge
>>> some nodes, or move some items to adjacent nodes.
>>> However, this will increase the workload significantly.
> 
> In the unlikely event of an issue allocating memory, this would be
> unwelcome.  If we can avoid it, it would be best.  I don't mind being
> slow in error paths, but a significant workload would be rather bad on
> an overloaded system.
> 
>>>
>>>>
>>>> Using what is done in exit_mmap() and do_vmi_align_munmap() as a
>>>> prototype, we can do something like the *untested* code below:
>>>>
>>>> if (unlikely(mas_is_err(&vmi.mas))) {
>>>>      unsigned long max = vmi.index;
>>>>
>>>>      retval = xa_err(vmi.mas.node);
>>>>      mas_set(&vmi.mas, 0);
>>>>      tmp = mas_find(&vmi.mas, ULONG_MAX);
>>>>      if (tmp) { /* Not the first VMA failed */
>>>>          unsigned long nr_accounted = 0;
>>>>
>>>>          unmap_region(mm, &vmi.mas, vma, NULL, mpnt, 0, max, max,
>>>>                  true);
>>>>          do {
>>>>              if (vma->vm_flags & VM_ACCOUNT)
>>>>                  nr_accounted += vma_pages(vma);
>>>>              remove_vma(vma, true);
>>>>              cond_resched();
>>>>              vma = mas_find(&vmi.mas, max - 1);
>>>>          } while (vma != NULL);
>>>>
>>>>          vm_unacct_memory(nr_accounted);
>>>>      }
>>>>      __mt_destroy(&mm->mm_mt);
>>>>      goto loop_out;
>>>> }
>>>>
>>>> Once exit_mmap() is called, the check for OOM (no vma) will catch that
>>>> nothing is left to do.
>>>>
>>>> It might be worth making an inline function to do this to keep the fork
>>>> code clean.  We should test this by detecting a specific task name and
>>>> returning a failure at a given interval:
>>>>
>>>> if (!strcmp(current->comm, "fork_test") {
>>>> ...
>>>> }
>>>>
> ...
> 
> 
> Thanks,
> Liam
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
  2023-09-18 13:14           ` Peng Zhang
@ 2023-09-18 17:59             ` Liam R. Howlett
  0 siblings, 0 replies; 35+ messages in thread
From: Liam R. Howlett @ 2023-09-18 17:59 UTC (permalink / raw)
  To: Peng Zhang
  Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
	linux-kernel, linux-fsdevel

* Peng Zhang <zhangpeng.00@bytedance.com> [230918 09:15]:
> 
> 
> 在 2023/9/16 04:00, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230915 06:57]:
> > > 
> > > 
> > 
> > ...
> > 
> > > > > > +    if (unlikely(retval))
> > > > > >            goto out;
> > > > > >        mt_clear_in_rcu(vmi.mas.tree);
> > > > > > -    for_each_vma(old_vmi, mpnt) {
> > > > > > +    for_each_vma(vmi, mpnt) {
> > > > > >            struct file *file;
> > > > > >            vma_start_write(mpnt);
> > > > > >            if (mpnt->vm_flags & VM_DONTCOPY) {
> > > > > >                vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> > > > > > +
> > > > > > +            /*
> > > > > > +             * Since the new tree is exactly the same as the old one,
> > > > > > +             * we need to remove the unneeded VMAs.
> > > > > > +             */
> > > > > > +            mas_store(&vmi.mas, NULL);
> > > > > > +
> > > > > > +            /*
> > > > > > +             * Even removing an entry may require memory allocation,
> > > > > > +             * and if removal fails, we use XA_ZERO_ENTRY to mark
> > > > > > +             * from which VMA it failed. The case of encountering
> > > > > > +             * XA_ZERO_ENTRY will be handled in exit_mmap().
> > > > > > +             */
> > > > > > +            if (unlikely(mas_is_err(&vmi.mas))) {
> > > > > > +                retval = xa_err(vmi.mas.node);
> > > > > > +                mas_reset(&vmi.mas);
> > > > > > +                if (mas_find(&vmi.mas, ULONG_MAX))
> > > > > > +                    mas_store(&vmi.mas, XA_ZERO_ENTRY);
> > > > > > +                goto loop_out;
> > > > > > +            }
> > > > > > +
> > > > > 
> > > > > Storing NULL may need extra space as you noted, so we need to be careful
> > > > > what happens if we don't have that space.  We should have a testcase to
> > > > > test this scenario.
> > > > > 
> > > > > mas_store_gfp() should be used with GFP_KERNEL.  The VMAs use GFP_KERNEL
> > > > > in this function, see vm_area_dup().
> > > > > 
> > > > > Don't use the exit_mmap() path to undo a failed fork.  You've added
> > > > > checks and complications to the exit path for all tasks in the very
> > > > > unlikely event that we run out of memory when we hit a very unlikely
> > > > > VM_DONTCOPY flag.
> > > > > 
> > > > > I see the issue with having a portion of the tree with new VMAs that are
> > > > > accounted and a portion of the tree that has old VMAs that should not be
> > > > > looked at.  It was clever to use the XA_ZERO_ENTRY as a stop point, but
> > > > > we cannot add that complication to the exit path and then there is the
> > > > > OOM race to worry about (maybe, I am not sure since this MM isn't
> > > > > active yet).
> > > > I encountered some errors after implementing the scheme you mentioned
> > > > below.
> > 
> > What were the errors?  Maybe I missed something or there is another way.
> I found the cause of the problem and fixed it, tested the error path and
> it seems to be working fine now.
> 
> The reason is that "free_pgd_range(tlb, addr, vma->vm_end,floor, next?
> next->vm_start: ceiling);" in free_pgtables() does not free all page
> tables due to the existence of the last false VMA. I've fixed it.
> Thanks.

Sounds good.

Please Cc the maple tree mailing (maple-tree@lists.infradead.org) list
on v3 - we are looking forward to seeing it.

Thanks,
Liam



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/6] maple_tree: Add test for mtree_dup()
  2023-09-07 20:13   ` Liam R. Howlett
  2023-09-08  9:38     ` Peng Zhang
@ 2023-09-25  4:06     ` Peng Zhang
  2023-09-25  7:44       ` Liam R. Howlett
  1 sibling, 1 reply; 35+ messages in thread
From: Peng Zhang @ 2023-09-25  4:06 UTC (permalink / raw)
  To: Liam R. Howlett, Peng Zhang, corbet, akpm, willy, brauner,
	surenb, michael.christie, peterz, mathieu.desnoyers, npiggin,
	avagin, linux-mm, linux-doc, linux-kernel, linux-fsdevel



在 2023/9/8 04:13, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
>> Add test for mtree_dup().
> 
> Please add a better description of what tests are included.
> 
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>>   tools/testing/radix-tree/maple.c | 344 +++++++++++++++++++++++++++++++
>>   1 file changed, 344 insertions(+)
>>
>> diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
>> index e5da1cad70ba..38455916331e 100644
>> --- a/tools/testing/radix-tree/maple.c
>> +++ b/tools/testing/radix-tree/maple.c
> 
> Why not lib/test_maple_tree.c?
> 
> If they are included there then they will be built into the test module.
> I try to include any tests that I can in the test module, within reason.
> 
> 
>> @@ -35857,6 +35857,346 @@ static noinline void __init check_locky(struct maple_tree *mt)
>>   	mt_clear_in_rcu(mt);
>>   }
>>   
>> +/*
>> + * Compare two nodes and return 0 if they are the same, non-zero otherwise.
> 
> The slots can be different, right?  That seems worth mentioning here.
> It's also worth mentioning this is destructive.
I compared the type information in the slots, but the addresses cannot
be compared because they are different.
> 
>> + */
>> +static int __init compare_node(struct maple_enode *enode_a,
>> +			       struct maple_enode *enode_b)
>> +{
>> +	struct maple_node *node_a, *node_b;
>> +	struct maple_node a, b;
>> +	void **slots_a, **slots_b; /* Do not use the rcu tag. */
>> +	enum maple_type type;
>> +	int i;
>> +
>> +	if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
>> +	    ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
>> +		pr_err("The lower 8 bits of enode are different.\n");
>> +		return -1;
>> +	}
>> +
>> +	type = mte_node_type(enode_a);
>> +	node_a = mte_to_node(enode_a);
>> +	node_b = mte_to_node(enode_b);
>> +	a = *node_a;
>> +	b = *node_b;
>> +
>> +	/* Do not compare addresses. */
>> +	if (ma_is_root(node_a) || ma_is_root(node_b)) {
>> +		a.parent = (struct maple_pnode *)((unsigned long)a.parent &
>> +						  MA_ROOT_PARENT);
>> +		b.parent = (struct maple_pnode *)((unsigned long)b.parent &
>> +						  MA_ROOT_PARENT);
>> +	} else {
>> +		a.parent = (struct maple_pnode *)((unsigned long)a.parent &
>> +						  MAPLE_NODE_MASK);
>> +		b.parent = (struct maple_pnode *)((unsigned long)b.parent &
>> +						  MAPLE_NODE_MASK);
>> +	}
>> +
>> +	if (a.parent != b.parent) {
>> +		pr_err("The lower 8 bits of parents are different. %p %p\n",
>> +			a.parent, b.parent);
>> +		return -1;
>> +	}
>> +
>> +	/*
>> +	 * If it is a leaf node, the slots do not contain the node address, and
>> +	 * no special processing of slots is required.
>> +	 */
>> +	if (ma_is_leaf(type))
>> +		goto cmp;
>> +
>> +	slots_a = ma_slots(&a, type);
>> +	slots_b = ma_slots(&b, type);
>> +
>> +	for (i = 0; i < mt_slots[type]; i++) {
>> +		if (!slots_a[i] && !slots_b[i])
>> +			break;
>> +
>> +		if (!slots_a[i] || !slots_b[i]) {
>> +			pr_err("The number of slots is different.\n");
>> +			return -1;
>> +		}
>> +
>> +		/* Do not compare addresses in slots. */
>> +		((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
>> +		((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
>> +	}
>> +
>> +cmp:
>> +	/*
>> +	 * Compare all contents of two nodes, including parent (except address),
>> +	 * slots (except address), pivots, gaps and metadata.
>> +	 */
>> +	return memcmp(&a, &b, sizeof(struct maple_node));
>> +}
>> +
>> +/*
>> + * Compare two trees and return 0 if they are the same, non-zero otherwise.
>> + */
>> +static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
>> +{
>> +	MA_STATE(mas_a, mt_a, 0, 0);
>> +	MA_STATE(mas_b, mt_b, 0, 0);
>> +
>> +	if (mt_a->ma_flags != mt_b->ma_flags) {
>> +		pr_err("The flags of the two trees are different.\n");
>> +		return -1;
>> +	}
>> +
>> +	mas_dfs_preorder(&mas_a);
>> +	mas_dfs_preorder(&mas_b);
>> +
>> +	if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
>> +		if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
>> +			pr_err("One is MAS_ROOT and the other is not.\n");
>> +			return -1;
>> +		}
>> +		return 0;
>> +	}
>> +
>> +	while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
>> +
>> +		if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
>> +			pr_err("One is MAS_NONE and the other is not.\n");
>> +			return -1;
>> +		}
>> +
>> +		if (mas_a.min != mas_b.min ||
>> +		    mas_a.max != mas_b.max) {
>> +			pr_err("mas->min, mas->max do not match.\n");
>> +			return -1;
>> +		}
>> +
>> +		if (compare_node(mas_a.node, mas_b.node)) {
>> +			pr_err("The contents of nodes %p and %p are different.\n",
>> +			       mas_a.node, mas_b.node);
>> +			mt_dump(mt_a, mt_dump_dec);
>> +			mt_dump(mt_b, mt_dump_dec);
>> +			return -1;
>> +		}
>> +
>> +		mas_dfs_preorder(&mas_a);
>> +		mas_dfs_preorder(&mas_b);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static __init void mas_subtree_max_range(struct ma_state *mas)
>> +{
>> +	unsigned long limit = mas->max;
>> +	MA_STATE(newmas, mas->tree, 0, 0);
>> +	void *entry;
>> +
>> +	mas_for_each(mas, entry, limit) {
>> +		if (mas->last - mas->index >=
>> +		    newmas.last - newmas.index) {
>> +			newmas = *mas;
>> +		}
>> +	}
>> +
>> +	*mas = newmas;
>> +}
>> +
>> +/*
>> + * build_full_tree() - Build a full tree.
>> + * @mt: The tree to build.
>> + * @flags: Use @flags to build the tree.
>> + * @height: The height of the tree to build.
>> + *
>> + * Build a tree with full leaf nodes and internal nodes. Note that the height
>> + * should not exceed 3, otherwise it will take a long time to build.
>> + * Return: zero if the build is successful, non-zero if it fails.
>> + */
>> +static __init int build_full_tree(struct maple_tree *mt, unsigned int flags,
>> +		int height)
>> +{
>> +	MA_STATE(mas, mt, 0, 0);
>> +	unsigned long step;
>> +	int ret = 0, cnt = 1;
>> +	enum maple_type type;
>> +
>> +	mt_init_flags(mt, flags);
>> +	mtree_insert_range(mt, 0, ULONG_MAX, xa_mk_value(5), GFP_KERNEL);
>> +
>> +	mtree_lock(mt);
>> +
>> +	while (1) {
>> +		mas_set(&mas, 0);
>> +		if (mt_height(mt) < height) {
>> +			mas.max = ULONG_MAX;
>> +			goto store;
>> +		}
>> +
>> +		while (1) {
>> +			mas_dfs_preorder(&mas);
>> +			if (mas_is_none(&mas))
>> +				goto unlock;
>> +
>> +			type = mte_node_type(mas.node);
>> +			if (mas_data_end(&mas) + 1 < mt_slots[type]) {
>> +				mas_set(&mas, mas.min);
>> +				goto store;
>> +			}
>> +		}
>> +store:
>> +		mas_subtree_max_range(&mas);
>> +		step = mas.last - mas.index;
>> +		if (step < 1) {
>> +			ret = -1;
>> +			goto unlock;
>> +		}
>> +
>> +		step /= 2;
>> +		mas.last = mas.index + step;
>> +		mas_store_gfp(&mas, xa_mk_value(5),
>> +				GFP_KERNEL);
>> +		++cnt;
>> +	}
>> +unlock:
>> +	mtree_unlock(mt);
>> +
>> +	MT_BUG_ON(mt, mt_height(mt) != height);
>> +	/* pr_info("height:%u number of elements:%d\n", mt_height(mt), cnt); */
>> +	return ret;
>> +}
>> +
>> +static noinline void __init check_mtree_dup(struct maple_tree *mt)
>> +{
>> +	DEFINE_MTREE(new);
>> +	int i, j, ret, count = 0;
>> +	unsigned int rand_seed = 17, rand;
>> +
>> +	/* store a value at [0, 0] */
>> +	mt_init_flags(&tree, 0);
>> +	mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
>> +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
>> +	MT_BUG_ON(&new, ret);
>> +	mt_validate(&new);
>> +	if (compare_tree(&tree, &new))
>> +		MT_BUG_ON(&new, 1);
>> +
>> +	mtree_destroy(&tree);
>> +	mtree_destroy(&new);
>> +
>> +	/* The two trees have different attributes. */
>> +	mt_init_flags(&tree, 0);
>> +	mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
>> +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
>> +	MT_BUG_ON(&new, ret != -EINVAL);
>> +	mtree_destroy(&tree);
>> +	mtree_destroy(&new);
>> +
>> +	/* The new tree is not empty */
>> +	mt_init_flags(&tree, 0);
>> +	mt_init_flags(&new, 0);
>> +	mtree_store(&new, 5, xa_mk_value(5), GFP_KERNEL);
>> +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
>> +	MT_BUG_ON(&new, ret != -EINVAL);
>> +	mtree_destroy(&tree);
>> +	mtree_destroy(&new);
>> +
>> +	/* Test for duplicating full trees. */
>> +	for (i = 1; i <= 3; i++) {
>> +		ret = build_full_tree(&tree, 0, i);
>> +		MT_BUG_ON(&tree, ret);
>> +		mt_init_flags(&new, 0);
>> +
>> +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
>> +		MT_BUG_ON(&new, ret);
>> +		mt_validate(&new);
>> +		if (compare_tree(&tree, &new))
>> +			MT_BUG_ON(&new, 1);
>> +
>> +		mtree_destroy(&tree);
>> +		mtree_destroy(&new);
>> +	}
>> +
>> +	for (i = 1; i <= 3; i++) {
>> +		ret = build_full_tree(&tree, MT_FLAGS_ALLOC_RANGE, i);
>> +		MT_BUG_ON(&tree, ret);
>> +		mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
>> +
>> +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
>> +		MT_BUG_ON(&new, ret);
>> +		mt_validate(&new);
>> +		if (compare_tree(&tree, &new))
>> +			MT_BUG_ON(&new, 1);
>> +
>> +		mtree_destroy(&tree);
>> +		mtree_destroy(&new);
>> +	}
>> +
>> +	/* Test for normal duplicating. */
>> +	for (i = 0; i < 1000; i += 3) {
>> +		if (i & 1) {
>> +			mt_init_flags(&tree, 0);
>> +			mt_init_flags(&new, 0);
>> +		} else {
>> +			mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>> +			mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
>> +		}
>> +
>> +		for (j = 0; j < i; j++) {
>> +			mtree_store_range(&tree, j * 10, j * 10 + 5,
>> +					  xa_mk_value(j), GFP_KERNEL);
>> +		}
>> +
>> +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
>> +		MT_BUG_ON(&new, ret);
>> +		mt_validate(&new);
>> +		if (compare_tree(&tree, &new))
>> +			MT_BUG_ON(&new, 1);
>> +
>> +		mtree_destroy(&tree);
>> +		mtree_destroy(&new);
>> +	}
>> +
>> +	/* Test memory allocation failed. */
> 
> It might be worth while having specific allocations fail.  At a leaf
> node, intermediate nodes, first node come to mind.
Memory allocation is only possible in non-leaf nodes. It is impossible
to fail in leaf nodes.
> 
>> +	for (i = 0; i < 1000; i += 3) {
>> +		if (i & 1) {
>> +			mt_init_flags(&tree, 0);
>> +			mt_init_flags(&new, 0);
>> +		} else {
>> +			mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>> +			mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
>> +		}
>> +
>> +		for (j = 0; j < i; j++) {
>> +			mtree_store_range(&tree, j * 10, j * 10 + 5,
>> +					  xa_mk_value(j), GFP_KERNEL);
>> +		}
>> +		/*
>> +		 * The rand() library function is not used, so we can generate
>> +		 * the same random numbers on any platform.
>> +		 */
>> +		rand_seed = rand_seed * 1103515245 + 12345;
>> +		rand = rand_seed / 65536 % 128;
>> +		mt_set_non_kernel(rand);
>> +
>> +		ret = mtree_dup(&tree, &new, GFP_NOWAIT);
>> +		mt_set_non_kernel(0);
>> +		if (ret != 0) {
>> +			MT_BUG_ON(&new, ret != -ENOMEM);
>> +			count++;
>> +			mtree_destroy(&tree);
>> +			continue;
>> +		}
>> +
>> +		mt_validate(&new);
>> +		if (compare_tree(&tree, &new))
>> +			MT_BUG_ON(&new, 1);
>> +
>> +		mtree_destroy(&tree);
>> +		mtree_destroy(&new);
>> +	}
>> +
>> +	/* pr_info("mtree_dup() fail %d times\n", count); */
>> +	BUG_ON(!count);
>> +}
>> +
>>   extern void test_kmem_cache_bulk(void);
>>   
>>   void farmer_tests(void)
>> @@ -35904,6 +36244,10 @@ void farmer_tests(void)
>>   	check_null_expand(&tree);
>>   	mtree_destroy(&tree);
>>   
>> +	mt_init_flags(&tree, 0);
>> +	check_mtree_dup(&tree);
>> +	mtree_destroy(&tree);
>> +
>>   	/* RCU testing */
>>   	mt_init_flags(&tree, 0);
>>   	check_erase_testset(&tree);
>> -- 
>> 2.20.1
>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/6] maple_tree: Add test for mtree_dup()
  2023-09-25  4:06     ` Peng Zhang
@ 2023-09-25  7:44       ` Liam R. Howlett
  2023-09-25  8:30         ` Peng Zhang
  0 siblings, 1 reply; 35+ messages in thread
From: Liam R. Howlett @ 2023-09-25  7:44 UTC (permalink / raw)
  To: Peng Zhang
  Cc: corbet, akpm, willy, brauner, surenb, michael.christie, peterz,
	mathieu.desnoyers, npiggin, avagin, linux-mm, linux-doc,
	linux-kernel, linux-fsdevel

* Peng Zhang <zhangpeng.00@bytedance.com> [230925 00:06]:
> 
> 
> 在 2023/9/8 04:13, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
> > > Add test for mtree_dup().
> > 
> > Please add a better description of what tests are included.
> > 
> > > 
> > > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> > > ---
> > >   tools/testing/radix-tree/maple.c | 344 +++++++++++++++++++++++++++++++
> > >   1 file changed, 344 insertions(+)
> > > 
> > > diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
> > > index e5da1cad70ba..38455916331e 100644
> > > --- a/tools/testing/radix-tree/maple.c
> > > +++ b/tools/testing/radix-tree/maple.c
> > 
> > Why not lib/test_maple_tree.c?
> > 
> > If they are included there then they will be built into the test module.
> > I try to include any tests that I can in the test module, within reason.
> > 
> > 
> > > @@ -35857,6 +35857,346 @@ static noinline void __init check_locky(struct maple_tree *mt)
> > >   	mt_clear_in_rcu(mt);
> > >   }
> > > +/*
> > > + * Compare two nodes and return 0 if they are the same, non-zero otherwise.
> > 
> > The slots can be different, right?  That seems worth mentioning here.
> > It's also worth mentioning this is destructive.
> I compared the type information in the slots, but the addresses cannot
> be compared because they are different.

Yes, but that is not what the comment says, it states that it will
return 0 if they are the same.  It doesn't check the memory addresses or
the parent.  I don't expect it to, but your comment is misleading.

> > 
> > > + */
> > > +static int __init compare_node(struct maple_enode *enode_a,
> > > +			       struct maple_enode *enode_b)
> > > +{
> > > +	struct maple_node *node_a, *node_b;
> > > +	struct maple_node a, b;
> > > +	void **slots_a, **slots_b; /* Do not use the rcu tag. */
> > > +	enum maple_type type;
> > > +	int i;
> > > +
> > > +	if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
> > > +	    ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
> > > +		pr_err("The lower 8 bits of enode are different.\n");
> > > +		return -1;
> > > +	}
> > > +
> > > +	type = mte_node_type(enode_a);
> > > +	node_a = mte_to_node(enode_a);
> > > +	node_b = mte_to_node(enode_b);
> > > +	a = *node_a;
> > > +	b = *node_b;
> > > +
> > > +	/* Do not compare addresses. */
> > > +	if (ma_is_root(node_a) || ma_is_root(node_b)) {
> > > +		a.parent = (struct maple_pnode *)((unsigned long)a.parent &
> > > +						  MA_ROOT_PARENT);
> > > +		b.parent = (struct maple_pnode *)((unsigned long)b.parent &
> > > +						  MA_ROOT_PARENT);
> > > +	} else {
> > > +		a.parent = (struct maple_pnode *)((unsigned long)a.parent &
> > > +						  MAPLE_NODE_MASK);
> > > +		b.parent = (struct maple_pnode *)((unsigned long)b.parent &
> > > +						  MAPLE_NODE_MASK);
> > > +	}
> > > +
> > > +	if (a.parent != b.parent) {
> > > +		pr_err("The lower 8 bits of parents are different. %p %p\n",
> > > +			a.parent, b.parent);
> > > +		return -1;
> > > +	}
> > > +
> > > +	/*
> > > +	 * If it is a leaf node, the slots do not contain the node address, and
> > > +	 * no special processing of slots is required.
> > > +	 */
> > > +	if (ma_is_leaf(type))
> > > +		goto cmp;
> > > +
> > > +	slots_a = ma_slots(&a, type);
> > > +	slots_b = ma_slots(&b, type);
> > > +
> > > +	for (i = 0; i < mt_slots[type]; i++) {
> > > +		if (!slots_a[i] && !slots_b[i])
> > > +			break;
> > > +
> > > +		if (!slots_a[i] || !slots_b[i]) {
> > > +			pr_err("The number of slots is different.\n");
> > > +			return -1;
> > > +		}
> > > +
> > > +		/* Do not compare addresses in slots. */
> > > +		((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
> > > +		((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
> > > +	}
> > > +
> > > +cmp:
> > > +	/*
> > > +	 * Compare all contents of two nodes, including parent (except address),
> > > +	 * slots (except address), pivots, gaps and metadata.
> > > +	 */
> > > +	return memcmp(&a, &b, sizeof(struct maple_node));
> > > +}
> > > +
> > > +/*
> > > + * Compare two trees and return 0 if they are the same, non-zero otherwise.
> > > + */
> > > +static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
> > > +{
> > > +	MA_STATE(mas_a, mt_a, 0, 0);
> > > +	MA_STATE(mas_b, mt_b, 0, 0);
> > > +
> > > +	if (mt_a->ma_flags != mt_b->ma_flags) {
> > > +		pr_err("The flags of the two trees are different.\n");
> > > +		return -1;
> > > +	}
> > > +
> > > +	mas_dfs_preorder(&mas_a);
> > > +	mas_dfs_preorder(&mas_b);
> > > +
> > > +	if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
> > > +		if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
> > > +			pr_err("One is MAS_ROOT and the other is not.\n");
> > > +			return -1;
> > > +		}
> > > +		return 0;
> > > +	}
> > > +
> > > +	while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
> > > +
> > > +		if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
> > > +			pr_err("One is MAS_NONE and the other is not.\n");
> > > +			return -1;
> > > +		}
> > > +
> > > +		if (mas_a.min != mas_b.min ||
> > > +		    mas_a.max != mas_b.max) {
> > > +			pr_err("mas->min, mas->max do not match.\n");
> > > +			return -1;
> > > +		}
> > > +
> > > +		if (compare_node(mas_a.node, mas_b.node)) {
> > > +			pr_err("The contents of nodes %p and %p are different.\n",
> > > +			       mas_a.node, mas_b.node);
> > > +			mt_dump(mt_a, mt_dump_dec);
> > > +			mt_dump(mt_b, mt_dump_dec);
> > > +			return -1;
> > > +		}
> > > +
> > > +		mas_dfs_preorder(&mas_a);
> > > +		mas_dfs_preorder(&mas_b);
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static __init void mas_subtree_max_range(struct ma_state *mas)
> > > +{
> > > +	unsigned long limit = mas->max;
> > > +	MA_STATE(newmas, mas->tree, 0, 0);
> > > +	void *entry;
> > > +
> > > +	mas_for_each(mas, entry, limit) {
> > > +		if (mas->last - mas->index >=
> > > +		    newmas.last - newmas.index) {
> > > +			newmas = *mas;
> > > +		}
> > > +	}
> > > +
> > > +	*mas = newmas;
> > > +}
> > > +
> > > +/*
> > > + * build_full_tree() - Build a full tree.
> > > + * @mt: The tree to build.
> > > + * @flags: Use @flags to build the tree.
> > > + * @height: The height of the tree to build.
> > > + *
> > > + * Build a tree with full leaf nodes and internal nodes. Note that the height
> > > + * should not exceed 3, otherwise it will take a long time to build.
> > > + * Return: zero if the build is successful, non-zero if it fails.
> > > + */
> > > +static __init int build_full_tree(struct maple_tree *mt, unsigned int flags,
> > > +		int height)
> > > +{
> > > +	MA_STATE(mas, mt, 0, 0);
> > > +	unsigned long step;
> > > +	int ret = 0, cnt = 1;
> > > +	enum maple_type type;
> > > +
> > > +	mt_init_flags(mt, flags);
> > > +	mtree_insert_range(mt, 0, ULONG_MAX, xa_mk_value(5), GFP_KERNEL);
> > > +
> > > +	mtree_lock(mt);
> > > +
> > > +	while (1) {
> > > +		mas_set(&mas, 0);
> > > +		if (mt_height(mt) < height) {
> > > +			mas.max = ULONG_MAX;
> > > +			goto store;
> > > +		}
> > > +
> > > +		while (1) {
> > > +			mas_dfs_preorder(&mas);
> > > +			if (mas_is_none(&mas))
> > > +				goto unlock;
> > > +
> > > +			type = mte_node_type(mas.node);
> > > +			if (mas_data_end(&mas) + 1 < mt_slots[type]) {
> > > +				mas_set(&mas, mas.min);
> > > +				goto store;
> > > +			}
> > > +		}
> > > +store:
> > > +		mas_subtree_max_range(&mas);
> > > +		step = mas.last - mas.index;
> > > +		if (step < 1) {
> > > +			ret = -1;
> > > +			goto unlock;
> > > +		}
> > > +
> > > +		step /= 2;
> > > +		mas.last = mas.index + step;
> > > +		mas_store_gfp(&mas, xa_mk_value(5),
> > > +				GFP_KERNEL);
> > > +		++cnt;
> > > +	}
> > > +unlock:
> > > +	mtree_unlock(mt);
> > > +
> > > +	MT_BUG_ON(mt, mt_height(mt) != height);
> > > +	/* pr_info("height:%u number of elements:%d\n", mt_height(mt), cnt); */
> > > +	return ret;
> > > +}
> > > +
> > > +static noinline void __init check_mtree_dup(struct maple_tree *mt)
> > > +{
> > > +	DEFINE_MTREE(new);
> > > +	int i, j, ret, count = 0;
> > > +	unsigned int rand_seed = 17, rand;
> > > +
> > > +	/* store a value at [0, 0] */
> > > +	mt_init_flags(&tree, 0);
> > > +	mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
> > > +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
> > > +	MT_BUG_ON(&new, ret);
> > > +	mt_validate(&new);
> > > +	if (compare_tree(&tree, &new))
> > > +		MT_BUG_ON(&new, 1);
> > > +
> > > +	mtree_destroy(&tree);
> > > +	mtree_destroy(&new);
> > > +
> > > +	/* The two trees have different attributes. */
> > > +	mt_init_flags(&tree, 0);
> > > +	mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
> > > +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
> > > +	MT_BUG_ON(&new, ret != -EINVAL);
> > > +	mtree_destroy(&tree);
> > > +	mtree_destroy(&new);
> > > +
> > > +	/* The new tree is not empty */
> > > +	mt_init_flags(&tree, 0);
> > > +	mt_init_flags(&new, 0);
> > > +	mtree_store(&new, 5, xa_mk_value(5), GFP_KERNEL);
> > > +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
> > > +	MT_BUG_ON(&new, ret != -EINVAL);
> > > +	mtree_destroy(&tree);
> > > +	mtree_destroy(&new);
> > > +
> > > +	/* Test for duplicating full trees. */
> > > +	for (i = 1; i <= 3; i++) {
> > > +		ret = build_full_tree(&tree, 0, i);
> > > +		MT_BUG_ON(&tree, ret);
> > > +		mt_init_flags(&new, 0);
> > > +
> > > +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
> > > +		MT_BUG_ON(&new, ret);
> > > +		mt_validate(&new);
> > > +		if (compare_tree(&tree, &new))
> > > +			MT_BUG_ON(&new, 1);
> > > +
> > > +		mtree_destroy(&tree);
> > > +		mtree_destroy(&new);
> > > +	}
> > > +
> > > +	for (i = 1; i <= 3; i++) {
> > > +		ret = build_full_tree(&tree, MT_FLAGS_ALLOC_RANGE, i);
> > > +		MT_BUG_ON(&tree, ret);
> > > +		mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
> > > +
> > > +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
> > > +		MT_BUG_ON(&new, ret);
> > > +		mt_validate(&new);
> > > +		if (compare_tree(&tree, &new))
> > > +			MT_BUG_ON(&new, 1);
> > > +
> > > +		mtree_destroy(&tree);
> > > +		mtree_destroy(&new);
> > > +	}
> > > +
> > > +	/* Test for normal duplicating. */
> > > +	for (i = 0; i < 1000; i += 3) {
> > > +		if (i & 1) {
> > > +			mt_init_flags(&tree, 0);
> > > +			mt_init_flags(&new, 0);
> > > +		} else {
> > > +			mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
> > > +			mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
> > > +		}
> > > +
> > > +		for (j = 0; j < i; j++) {
> > > +			mtree_store_range(&tree, j * 10, j * 10 + 5,
> > > +					  xa_mk_value(j), GFP_KERNEL);
> > > +		}
> > > +
> > > +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
> > > +		MT_BUG_ON(&new, ret);
> > > +		mt_validate(&new);
> > > +		if (compare_tree(&tree, &new))
> > > +			MT_BUG_ON(&new, 1);
> > > +
> > > +		mtree_destroy(&tree);
> > > +		mtree_destroy(&new);
> > > +	}
> > > +
> > > +	/* Test memory allocation failed. */
> > 
> > It might be worth while having specific allocations fail.  At a leaf
> > node, intermediate nodes, first node come to mind.
> Memory allocation is only possible in non-leaf nodes. It is impossible
> to fail in leaf nodes.

I understand that's your intent and probably what happens today - but
it'd be good to have testing for that, if not for this code then for
future potential changes.

> > 
> > > +	for (i = 0; i < 1000; i += 3) {
> > > +		if (i & 1) {
> > > +			mt_init_flags(&tree, 0);
> > > +			mt_init_flags(&new, 0);
> > > +		} else {
> > > +			mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
> > > +			mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
> > > +		}
> > > +
> > > +		for (j = 0; j < i; j++) {
> > > +			mtree_store_range(&tree, j * 10, j * 10 + 5,
> > > +					  xa_mk_value(j), GFP_KERNEL);
> > > +		}
> > > +		/*
> > > +		 * The rand() library function is not used, so we can generate
> > > +		 * the same random numbers on any platform.
> > > +		 */
> > > +		rand_seed = rand_seed * 1103515245 + 12345;
> > > +		rand = rand_seed / 65536 % 128;
> > > +		mt_set_non_kernel(rand);
> > > +
> > > +		ret = mtree_dup(&tree, &new, GFP_NOWAIT);
> > > +		mt_set_non_kernel(0);
> > > +		if (ret != 0) {
> > > +			MT_BUG_ON(&new, ret != -ENOMEM);
> > > +			count++;
> > > +			mtree_destroy(&tree);
> > > +			continue;
> > > +		}
> > > +
> > > +		mt_validate(&new);
> > > +		if (compare_tree(&tree, &new))
> > > +			MT_BUG_ON(&new, 1);
> > > +
> > > +		mtree_destroy(&tree);
> > > +		mtree_destroy(&new);
> > > +	}
> > > +
> > > +	/* pr_info("mtree_dup() fail %d times\n", count); */
> > > +	BUG_ON(!count);
> > > +}
> > > +
> > >   extern void test_kmem_cache_bulk(void);
> > >   void farmer_tests(void)
> > > @@ -35904,6 +36244,10 @@ void farmer_tests(void)
> > >   	check_null_expand(&tree);
> > >   	mtree_destroy(&tree);
> > > +	mt_init_flags(&tree, 0);
> > > +	check_mtree_dup(&tree);
> > > +	mtree_destroy(&tree);
> > > +
> > >   	/* RCU testing */
> > >   	mt_init_flags(&tree, 0);
> > >   	check_erase_testset(&tree);
> > > -- 
> > > 2.20.1
> > > 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 3/6] maple_tree: Add test for mtree_dup()
  2023-09-25  7:44       ` Liam R. Howlett
@ 2023-09-25  8:30         ` Peng Zhang
  0 siblings, 0 replies; 35+ messages in thread
From: Peng Zhang @ 2023-09-25  8:30 UTC (permalink / raw)
  To: Liam R. Howlett, Peng Zhang, corbet, akpm, willy, brauner,
	surenb, michael.christie, peterz, mathieu.desnoyers, npiggin,
	avagin, linux-mm, linux-doc, linux-kernel, linux-fsdevel



在 2023/9/25 15:44, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230925 00:06]:
>>
>>
>> 在 2023/9/8 04:13, Liam R. Howlett 写道:
>>> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
>>>> Add test for mtree_dup().
>>>
>>> Please add a better description of what tests are included.
>>>
>>>>
>>>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>>>> ---
>>>>    tools/testing/radix-tree/maple.c | 344 +++++++++++++++++++++++++++++++
>>>>    1 file changed, 344 insertions(+)
>>>>
>>>> diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
>>>> index e5da1cad70ba..38455916331e 100644
>>>> --- a/tools/testing/radix-tree/maple.c
>>>> +++ b/tools/testing/radix-tree/maple.c
>>>
>>> Why not lib/test_maple_tree.c?
>>>
>>> If they are included there then they will be built into the test module.
>>> I try to include any tests that I can in the test module, within reason.
>>>
>>>
>>>> @@ -35857,6 +35857,346 @@ static noinline void __init check_locky(struct maple_tree *mt)
>>>>    	mt_clear_in_rcu(mt);
>>>>    }
>>>> +/*
>>>> + * Compare two nodes and return 0 if they are the same, non-zero otherwise.
>>>
>>> The slots can be different, right?  That seems worth mentioning here.
>>> It's also worth mentioning this is destructive.
>> I compared the type information in the slots, but the addresses cannot
>> be compared because they are different.
> 
> Yes, but that is not what the comment says, it states that it will
> return 0 if they are the same.  It doesn't check the memory addresses or
> the parent.  I don't expect it to, but your comment is misleading.
OK, I have made the modifications in v3. Thanks.
> 
>>>
>>>> + */
>>>> +static int __init compare_node(struct maple_enode *enode_a,
>>>> +			       struct maple_enode *enode_b)
>>>> +{
>>>> +	struct maple_node *node_a, *node_b;
>>>> +	struct maple_node a, b;
>>>> +	void **slots_a, **slots_b; /* Do not use the rcu tag. */
>>>> +	enum maple_type type;
>>>> +	int i;
>>>> +
>>>> +	if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
>>>> +	    ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
>>>> +		pr_err("The lower 8 bits of enode are different.\n");
>>>> +		return -1;
>>>> +	}
>>>> +
>>>> +	type = mte_node_type(enode_a);
>>>> +	node_a = mte_to_node(enode_a);
>>>> +	node_b = mte_to_node(enode_b);
>>>> +	a = *node_a;
>>>> +	b = *node_b;
>>>> +
>>>> +	/* Do not compare addresses. */
>>>> +	if (ma_is_root(node_a) || ma_is_root(node_b)) {
>>>> +		a.parent = (struct maple_pnode *)((unsigned long)a.parent &
>>>> +						  MA_ROOT_PARENT);
>>>> +		b.parent = (struct maple_pnode *)((unsigned long)b.parent &
>>>> +						  MA_ROOT_PARENT);
>>>> +	} else {
>>>> +		a.parent = (struct maple_pnode *)((unsigned long)a.parent &
>>>> +						  MAPLE_NODE_MASK);
>>>> +		b.parent = (struct maple_pnode *)((unsigned long)b.parent &
>>>> +						  MAPLE_NODE_MASK);
>>>> +	}
>>>> +
>>>> +	if (a.parent != b.parent) {
>>>> +		pr_err("The lower 8 bits of parents are different. %p %p\n",
>>>> +			a.parent, b.parent);
>>>> +		return -1;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * If it is a leaf node, the slots do not contain the node address, and
>>>> +	 * no special processing of slots is required.
>>>> +	 */
>>>> +	if (ma_is_leaf(type))
>>>> +		goto cmp;
>>>> +
>>>> +	slots_a = ma_slots(&a, type);
>>>> +	slots_b = ma_slots(&b, type);
>>>> +
>>>> +	for (i = 0; i < mt_slots[type]; i++) {
>>>> +		if (!slots_a[i] && !slots_b[i])
>>>> +			break;
>>>> +
>>>> +		if (!slots_a[i] || !slots_b[i]) {
>>>> +			pr_err("The number of slots is different.\n");
>>>> +			return -1;
>>>> +		}
>>>> +
>>>> +		/* Do not compare addresses in slots. */
>>>> +		((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
>>>> +		((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
>>>> +	}
>>>> +
>>>> +cmp:
>>>> +	/*
>>>> +	 * Compare all contents of two nodes, including parent (except address),
>>>> +	 * slots (except address), pivots, gaps and metadata.
>>>> +	 */
>>>> +	return memcmp(&a, &b, sizeof(struct maple_node));
>>>> +}
>>>> +
>>>> +/*
>>>> + * Compare two trees and return 0 if they are the same, non-zero otherwise.
>>>> + */
>>>> +static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
>>>> +{
>>>> +	MA_STATE(mas_a, mt_a, 0, 0);
>>>> +	MA_STATE(mas_b, mt_b, 0, 0);
>>>> +
>>>> +	if (mt_a->ma_flags != mt_b->ma_flags) {
>>>> +		pr_err("The flags of the two trees are different.\n");
>>>> +		return -1;
>>>> +	}
>>>> +
>>>> +	mas_dfs_preorder(&mas_a);
>>>> +	mas_dfs_preorder(&mas_b);
>>>> +
>>>> +	if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
>>>> +		if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
>>>> +			pr_err("One is MAS_ROOT and the other is not.\n");
>>>> +			return -1;
>>>> +		}
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
>>>> +
>>>> +		if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
>>>> +			pr_err("One is MAS_NONE and the other is not.\n");
>>>> +			return -1;
>>>> +		}
>>>> +
>>>> +		if (mas_a.min != mas_b.min ||
>>>> +		    mas_a.max != mas_b.max) {
>>>> +			pr_err("mas->min, mas->max do not match.\n");
>>>> +			return -1;
>>>> +		}
>>>> +
>>>> +		if (compare_node(mas_a.node, mas_b.node)) {
>>>> +			pr_err("The contents of nodes %p and %p are different.\n",
>>>> +			       mas_a.node, mas_b.node);
>>>> +			mt_dump(mt_a, mt_dump_dec);
>>>> +			mt_dump(mt_b, mt_dump_dec);
>>>> +			return -1;
>>>> +		}
>>>> +
>>>> +		mas_dfs_preorder(&mas_a);
>>>> +		mas_dfs_preorder(&mas_b);
>>>> +	}
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static __init void mas_subtree_max_range(struct ma_state *mas)
>>>> +{
>>>> +	unsigned long limit = mas->max;
>>>> +	MA_STATE(newmas, mas->tree, 0, 0);
>>>> +	void *entry;
>>>> +
>>>> +	mas_for_each(mas, entry, limit) {
>>>> +		if (mas->last - mas->index >=
>>>> +		    newmas.last - newmas.index) {
>>>> +			newmas = *mas;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	*mas = newmas;
>>>> +}
>>>> +
>>>> +/*
>>>> + * build_full_tree() - Build a full tree.
>>>> + * @mt: The tree to build.
>>>> + * @flags: Use @flags to build the tree.
>>>> + * @height: The height of the tree to build.
>>>> + *
>>>> + * Build a tree with full leaf nodes and internal nodes. Note that the height
>>>> + * should not exceed 3, otherwise it will take a long time to build.
>>>> + * Return: zero if the build is successful, non-zero if it fails.
>>>> + */
>>>> +static __init int build_full_tree(struct maple_tree *mt, unsigned int flags,
>>>> +		int height)
>>>> +{
>>>> +	MA_STATE(mas, mt, 0, 0);
>>>> +	unsigned long step;
>>>> +	int ret = 0, cnt = 1;
>>>> +	enum maple_type type;
>>>> +
>>>> +	mt_init_flags(mt, flags);
>>>> +	mtree_insert_range(mt, 0, ULONG_MAX, xa_mk_value(5), GFP_KERNEL);
>>>> +
>>>> +	mtree_lock(mt);
>>>> +
>>>> +	while (1) {
>>>> +		mas_set(&mas, 0);
>>>> +		if (mt_height(mt) < height) {
>>>> +			mas.max = ULONG_MAX;
>>>> +			goto store;
>>>> +		}
>>>> +
>>>> +		while (1) {
>>>> +			mas_dfs_preorder(&mas);
>>>> +			if (mas_is_none(&mas))
>>>> +				goto unlock;
>>>> +
>>>> +			type = mte_node_type(mas.node);
>>>> +			if (mas_data_end(&mas) + 1 < mt_slots[type]) {
>>>> +				mas_set(&mas, mas.min);
>>>> +				goto store;
>>>> +			}
>>>> +		}
>>>> +store:
>>>> +		mas_subtree_max_range(&mas);
>>>> +		step = mas.last - mas.index;
>>>> +		if (step < 1) {
>>>> +			ret = -1;
>>>> +			goto unlock;
>>>> +		}
>>>> +
>>>> +		step /= 2;
>>>> +		mas.last = mas.index + step;
>>>> +		mas_store_gfp(&mas, xa_mk_value(5),
>>>> +				GFP_KERNEL);
>>>> +		++cnt;
>>>> +	}
>>>> +unlock:
>>>> +	mtree_unlock(mt);
>>>> +
>>>> +	MT_BUG_ON(mt, mt_height(mt) != height);
>>>> +	/* pr_info("height:%u number of elements:%d\n", mt_height(mt), cnt); */
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static noinline void __init check_mtree_dup(struct maple_tree *mt)
>>>> +{
>>>> +	DEFINE_MTREE(new);
>>>> +	int i, j, ret, count = 0;
>>>> +	unsigned int rand_seed = 17, rand;
>>>> +
>>>> +	/* store a value at [0, 0] */
>>>> +	mt_init_flags(&tree, 0);
>>>> +	mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
>>>> +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
>>>> +	MT_BUG_ON(&new, ret);
>>>> +	mt_validate(&new);
>>>> +	if (compare_tree(&tree, &new))
>>>> +		MT_BUG_ON(&new, 1);
>>>> +
>>>> +	mtree_destroy(&tree);
>>>> +	mtree_destroy(&new);
>>>> +
>>>> +	/* The two trees have different attributes. */
>>>> +	mt_init_flags(&tree, 0);
>>>> +	mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
>>>> +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
>>>> +	MT_BUG_ON(&new, ret != -EINVAL);
>>>> +	mtree_destroy(&tree);
>>>> +	mtree_destroy(&new);
>>>> +
>>>> +	/* The new tree is not empty */
>>>> +	mt_init_flags(&tree, 0);
>>>> +	mt_init_flags(&new, 0);
>>>> +	mtree_store(&new, 5, xa_mk_value(5), GFP_KERNEL);
>>>> +	ret = mtree_dup(&tree, &new, GFP_KERNEL);
>>>> +	MT_BUG_ON(&new, ret != -EINVAL);
>>>> +	mtree_destroy(&tree);
>>>> +	mtree_destroy(&new);
>>>> +
>>>> +	/* Test for duplicating full trees. */
>>>> +	for (i = 1; i <= 3; i++) {
>>>> +		ret = build_full_tree(&tree, 0, i);
>>>> +		MT_BUG_ON(&tree, ret);
>>>> +		mt_init_flags(&new, 0);
>>>> +
>>>> +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
>>>> +		MT_BUG_ON(&new, ret);
>>>> +		mt_validate(&new);
>>>> +		if (compare_tree(&tree, &new))
>>>> +			MT_BUG_ON(&new, 1);
>>>> +
>>>> +		mtree_destroy(&tree);
>>>> +		mtree_destroy(&new);
>>>> +	}
>>>> +
>>>> +	for (i = 1; i <= 3; i++) {
>>>> +		ret = build_full_tree(&tree, MT_FLAGS_ALLOC_RANGE, i);
>>>> +		MT_BUG_ON(&tree, ret);
>>>> +		mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
>>>> +
>>>> +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
>>>> +		MT_BUG_ON(&new, ret);
>>>> +		mt_validate(&new);
>>>> +		if (compare_tree(&tree, &new))
>>>> +			MT_BUG_ON(&new, 1);
>>>> +
>>>> +		mtree_destroy(&tree);
>>>> +		mtree_destroy(&new);
>>>> +	}
>>>> +
>>>> +	/* Test for normal duplicating. */
>>>> +	for (i = 0; i < 1000; i += 3) {
>>>> +		if (i & 1) {
>>>> +			mt_init_flags(&tree, 0);
>>>> +			mt_init_flags(&new, 0);
>>>> +		} else {
>>>> +			mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>>>> +			mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
>>>> +		}
>>>> +
>>>> +		for (j = 0; j < i; j++) {
>>>> +			mtree_store_range(&tree, j * 10, j * 10 + 5,
>>>> +					  xa_mk_value(j), GFP_KERNEL);
>>>> +		}
>>>> +
>>>> +		ret = mtree_dup(&tree, &new, GFP_KERNEL);
>>>> +		MT_BUG_ON(&new, ret);
>>>> +		mt_validate(&new);
>>>> +		if (compare_tree(&tree, &new))
>>>> +			MT_BUG_ON(&new, 1);
>>>> +
>>>> +		mtree_destroy(&tree);
>>>> +		mtree_destroy(&new);
>>>> +	}
>>>> +
>>>> +	/* Test memory allocation failed. */
>>>
>>> It might be worth while having specific allocations fail.  At a leaf
>>> node, intermediate nodes, first node come to mind.
>> Memory allocation is only possible in non-leaf nodes. It is impossible
>> to fail in leaf nodes.
> 
> I understand that's your intent and probably what happens today - but
> it'd be good to have testing for that, if not for this code then for
> future potential changes.
But currently, it's not possible to have tests that fail at leaf nodes
because they don't fail at leaf nodes. What is done at leaf nodes is
simply copying the node and replacing the parent pointer.
> 
>>>
>>>> +	for (i = 0; i < 1000; i += 3) {
>>>> +		if (i & 1) {
>>>> +			mt_init_flags(&tree, 0);
>>>> +			mt_init_flags(&new, 0);
>>>> +		} else {
>>>> +			mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>>>> +			mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
>>>> +		}
>>>> +
>>>> +		for (j = 0; j < i; j++) {
>>>> +			mtree_store_range(&tree, j * 10, j * 10 + 5,
>>>> +					  xa_mk_value(j), GFP_KERNEL);
>>>> +		}
>>>> +		/*
>>>> +		 * The rand() library function is not used, so we can generate
>>>> +		 * the same random numbers on any platform.
>>>> +		 */
>>>> +		rand_seed = rand_seed * 1103515245 + 12345;
>>>> +		rand = rand_seed / 65536 % 128;
>>>> +		mt_set_non_kernel(rand);
>>>> +
>>>> +		ret = mtree_dup(&tree, &new, GFP_NOWAIT);
>>>> +		mt_set_non_kernel(0);
>>>> +		if (ret != 0) {
>>>> +			MT_BUG_ON(&new, ret != -ENOMEM);
>>>> +			count++;
>>>> +			mtree_destroy(&tree);
>>>> +			continue;
>>>> +		}
>>>> +
>>>> +		mt_validate(&new);
>>>> +		if (compare_tree(&tree, &new))
>>>> +			MT_BUG_ON(&new, 1);
>>>> +
>>>> +		mtree_destroy(&tree);
>>>> +		mtree_destroy(&new);
>>>> +	}
>>>> +
>>>> +	/* pr_info("mtree_dup() fail %d times\n", count); */
>>>> +	BUG_ON(!count);
>>>> +}
>>>> +
>>>>    extern void test_kmem_cache_bulk(void);
>>>>    void farmer_tests(void)
>>>> @@ -35904,6 +36244,10 @@ void farmer_tests(void)
>>>>    	check_null_expand(&tree);
>>>>    	mtree_destroy(&tree);
>>>> +	mt_init_flags(&tree, 0);
>>>> +	check_mtree_dup(&tree);
>>>> +	mtree_destroy(&tree);
>>>> +
>>>>    	/* RCU testing */
>>>>    	mt_init_flags(&tree, 0);
>>>>    	check_erase_testset(&tree);
>>>> -- 
>>>> 2.20.1
>>>>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2023-09-25  8:30 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-30 12:56 [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork() Peng Zhang
2023-08-30 12:56 ` [PATCH v2 1/6] maple_tree: Add two helpers Peng Zhang
2023-09-07 20:13   ` Liam R. Howlett
2023-09-08  2:45     ` Peng Zhang
2023-08-30 12:56 ` [PATCH v2 2/6] maple_tree: Introduce interfaces __mt_dup() and mtree_dup() Peng Zhang
2023-09-07 20:13   ` Liam R. Howlett
2023-09-08  9:26     ` Peng Zhang
2023-09-08 16:05       ` Liam R. Howlett
2023-09-11 12:59     ` Peng Zhang
2023-09-11 13:36       ` Liam R. Howlett
2023-08-30 12:56 ` [PATCH v2 3/6] maple_tree: Add test for mtree_dup() Peng Zhang
2023-09-07 20:13   ` Liam R. Howlett
2023-09-08  9:38     ` Peng Zhang
2023-09-25  4:06     ` Peng Zhang
2023-09-25  7:44       ` Liam R. Howlett
2023-09-25  8:30         ` Peng Zhang
2023-08-30 12:56 ` [PATCH v2 4/6] maple_tree: Skip other tests when BENCH is enabled Peng Zhang
2023-08-30 12:56 ` [PATCH v2 5/6] maple_tree: Update check_forking() and bench_forking() Peng Zhang
2023-08-31 13:40   ` kernel test robot
2023-09-01 10:58     ` Peng Zhang
2023-09-07 18:03       ` Liam R. Howlett
2023-09-07 18:16         ` Matthew Wilcox
2023-09-08  9:47           ` Peng Zhang
2023-09-07 20:14   ` Liam R. Howlett
2023-08-30 12:56 ` [PATCH v2 6/6] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() Peng Zhang
2023-09-07 20:14   ` Liam R. Howlett
2023-09-08  9:58     ` Peng Zhang
2023-09-08 16:07       ` Liam R. Howlett
2023-09-15 10:51     ` Peng Zhang
2023-09-15 10:56       ` Peng Zhang
2023-09-15 20:00         ` Liam R. Howlett
2023-09-18 13:14           ` Peng Zhang
2023-09-18 17:59             ` Liam R. Howlett
2023-08-30 13:05 ` [PATCH v2 0/6] Introduce __mt_dup() to improve the performance of fork() Peng Zhang
2023-09-07 20:19 ` Liam R. Howlett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).