linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peng Zhang <zhangpeng.00@bytedance.com>
To: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
Cc: Peng Zhang <zhangpeng.00@bytedance.com>,
	corbet@lwn.net, akpm@linux-foundation.org, willy@infradead.org,
	brauner@kernel.org, surenb@google.com,
	michael.christie@oracle.com, mjguzik@gmail.com,
	mathieu.desnoyers@efficios.com, npiggin@gmail.com,
	peterz@infradead.org, oliver.sang@intel.com,
	maple-tree@lists.infradead.org, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v3 3/9] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()
Date: Thu, 5 Oct 2023 23:55:22 +0800	[thread overview]
Message-ID: <867f3fb6-22c5-4dd1-479d-5b148163f2d0@bytedance.com> (raw)
In-Reply-To: <20231004142500.gz2552r74aiphl4z@revolver>



在 2023/10/4 22:25, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [231004 05:09]:
>>
>>
>> 在 2023/10/4 02:45, Liam R. Howlett 写道:
>>> * Peng Zhang <zhangpeng.00@bytedance.com> [230924 23:58]:
>>>> Introduce interfaces __mt_dup() and mtree_dup(), which are used to
>>>> duplicate a maple tree. They duplicate a maple tree in Depth-First
>>>> Search (DFS) pre-order traversal. It uses memcopy() to copy nodes in the
>>>> source tree and allocate new child nodes in non-leaf nodes. The new node
>>>> is exactly the same as the source node except for all the addresses
>>>> stored in it. It will be faster than traversing all elements in the
>>>> source tree and inserting them one by one into the new tree. The time
>>>> complexity of these two functions is O(n).
>>>>
>>>> The difference between __mt_dup() and mtree_dup() is that mtree_dup()
>>>> handles locks internally.
>>>>
>>>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>>>> ---
>>>>    include/linux/maple_tree.h |   3 +
>>>>    lib/maple_tree.c           | 286 +++++++++++++++++++++++++++++++++++++
>>>>    2 files changed, 289 insertions(+)
>>>>
>>>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>>>> index 666a3764ed89..de5a4056503a 100644
>>>> --- a/include/linux/maple_tree.h
>>>> +++ b/include/linux/maple_tree.h
>>>> @@ -329,6 +329,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
>>>>    		void *entry, gfp_t gfp);
>>>>    void *mtree_erase(struct maple_tree *mt, unsigned long index);
>>>> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>>>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>>>> +
>>>>    void mtree_destroy(struct maple_tree *mt);
>>>>    void __mt_destroy(struct maple_tree *mt);
>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>>> index 3fe5652a8c6c..ed8847b4f1ff 100644
>>>> --- a/lib/maple_tree.c
>>>> +++ b/lib/maple_tree.c
>>>> @@ -6370,6 +6370,292 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
>>>>    }
>>>>    EXPORT_SYMBOL(mtree_erase);
>>>> +/*
>>>> + * mas_dup_free() - Free an incomplete duplication of a tree.
>>>> + * @mas: The maple state of a incomplete tree.
>>>> + *
>>>> + * The parameter @mas->node passed in indicates that the allocation failed on
>>>> + * this node. This function frees all nodes starting from @mas->node in the
>>>> + * reverse order of mas_dup_build(). There is no need to hold the source tree
>>>> + * lock at this time.
>>>> + */
>>>> +static void mas_dup_free(struct ma_state *mas)
>>>> +{
>>>> +	struct maple_node *node;
>>>> +	enum maple_type type;
>>>> +	void __rcu **slots;
>>>> +	unsigned char count, i;
>>>> +
>>>> +	/* Maybe the first node allocation failed. */
>>>> +	if (mas_is_none(mas))
>>>> +		return;
>>>> +
>>>> +	while (!mte_is_root(mas->node)) {
>>>> +		mas_ascend(mas);
>>>> +
>>>> +		if (mas->offset) {
>>>> +			mas->offset--;
>>>> +			do {
>>>> +				mas_descend(mas);
>>>> +				mas->offset = mas_data_end(mas);
>>>> +			} while (!mte_is_leaf(mas->node));
>>>> +
>>>> +			mas_ascend(mas);
>>>> +		}
>>>> +
>>>> +		node = mte_to_node(mas->node);
>>>> +		type = mte_node_type(mas->node);
>>>> +		slots = ma_slots(node, type);
>>>> +		count = mas_data_end(mas) + 1;
>>>> +		for (i = 0; i < count; i++)
>>>> +			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
>>>> +
>>>> +		mt_free_bulk(count, slots);
>>>> +	}
>>>> +
>>>> +	node = mte_to_node(mas->node);
>>>> +	mt_free_one(node);
>>>> +}
>>>> +
>>>> +/*
>>>> + * mas_copy_node() - Copy a maple node and replace the parent.
>>>> + * @mas: The maple state of source tree.
>>>> + * @new_mas: The maple state of new tree.
>>>> + * @parent: The parent of the new node.
>>>> + *
>>>> + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
>>>> + * @new_mas->node. If memory allocation fails, @mas is set to -ENOMEM.
>>>> + */
>>>> +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
>>>> +		struct maple_pnode *parent)
>>>> +{
>>>> +	struct maple_node *node = mte_to_node(mas->node);
>>>> +	struct maple_node *new_node = mte_to_node(new_mas->node);
>>>> +	unsigned long val;
>>>> +
>>>> +	/* Copy the node completely. */
>>>> +	memcpy(new_node, node, sizeof(struct maple_node));
>>>> +
>>>> +	/* Update the parent node pointer. */
>>>> +	val = (unsigned long)node->parent & MAPLE_NODE_MASK;
>>>> +	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
>>>> +}
>>>> +
>>>> +/*
>>>> + * mas_dup_alloc() - Allocate child nodes for a maple node.
>>>> + * @mas: The maple state of source tree.
>>>> + * @new_mas: The maple state of new tree.
>>>> + * @gfp: The GFP_FLAGS to use for allocations.
>>>> + *
>>>> + * This function allocates child nodes for @new_mas->node during the duplication
>>>> + * process. If memory allocation fails, @mas is set to -ENOMEM.
>>>> + */
>>>> +static inline void mas_dup_alloc(struct ma_state *mas, struct ma_state *new_mas,
>>>> +		gfp_t gfp)
>>>> +{
>>>> +	struct maple_node *node = mte_to_node(mas->node);
>>>> +	struct maple_node *new_node = mte_to_node(new_mas->node);
>>>> +	enum maple_type type;
>>>> +	unsigned char request, count, i;
>>>> +	void __rcu **slots;
>>>> +	void __rcu **new_slots;
>>>> +	unsigned long val;
>>>> +
>>>> +	/* Allocate memory for child nodes. */
>>>> +	type = mte_node_type(mas->node);
>>>> +	new_slots = ma_slots(new_node, type);
>>>> +	request = mas_data_end(mas) + 1;
>>>> +	count = mt_alloc_bulk(gfp, request, (void **)new_slots);
>>>> +	if (unlikely(count < request)) {
>>>> +		if (count) {
>>>> +			mt_free_bulk(count, new_slots);
>>>
>>> If you look at mm/slab.c: kmem_cache_alloc(), you will see that the
>>> error path already bulk frees for you - but does not zero the array.
>>> This bulk free will lead to double free, but you do need the below
>>> memset().  Also, it will return !count or request. So, I think this code
>>> is never executed as it is written.
>> If kmem_cache_alloc() is called to allocate memory in mt_alloc_bulk(),
>> then this code will not be executed because it only returns 0 or
>> request. However, I am concerned that changes to mt_alloc_bulk() like
>> [1] may be merged, which could potentially lead to memory leaks. To
>> improve robustness, I wrote it this way.
>>
>> How do you think it should be handled? Is it okay to do this like the
>> code below?
>>
>> if (unlikely(count < request)) {
>> 	memset(new_slots, 0, request * sizeof(unsigned long));
>> 	mas_set_err(mas, -ENOMEM);
>> 	return;
>> }
>>
>> [1] https://lore.kernel.org/lkml/20230810163627.6206-13-vbabka@suse.cz/
> 
> Ah, I see.
> 
> We should keep the same functionality as before.  The code you are
> referencing is an RFC and won't be merged as-is.  We should be sure to
> keep an eye on this happening.
> 
> I think the code you have there is correct.
> 
>>>
>>> I don't think this will show up in your testcases because the test code
>>> doesn't leave dangling pointers and simply returns 0 if there isn't
>>> enough nodes.
>> Yes, no testing here.
> 
> Yeah :/  I think we should update the test code at some point to behave
> the same as the real code.  Don't worry about it here though.
If we want to test this here, we need to modify the
kmem_cache_alloc_bulk() in the user space to allocate a portion of
memory. This will cause it to behave differently from the corresponding
function in the kernel space. I'm not sure if this modification is
acceptable.

Also, I might need to move the memset() outside of the if
statement (if (unlikely(count < request)){}) to use it for cleaning up
residual pointers.

> 
>>>
>>>> +			memset(new_slots, 0, count * sizeof(unsigned long));
>>>> +		}
>>>> +		mas_set_err(mas, -ENOMEM);
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	/* Restore node type information in slots. */
>>>> +	slots = ma_slots(node, type);
>>>> +	for (i = 0; i < count; i++) {
>>>> +		val = (unsigned long)mt_slot_locked(mas->tree, slots, i);
>>>> +		val &= MAPLE_NODE_MASK;
>>>> +		((unsigned long *)new_slots)[i] |= val;
>>>> +	}
>>>> +}
>>>> +
>>>> +/*
>>>> + * mas_dup_build() - Build a new maple tree from a source tree
>>>> + * @mas: The maple state of source tree.
>>>> + * @new_mas: The maple state of new tree.
>>>> + * @gfp: The GFP_FLAGS to use for allocations.
>>>> + *
>>>> + * This function builds a new tree in DFS preorder. If the memory allocation
>>>> + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
>>>> + * last node. mas_dup_free() will free the incomplete duplication of a tree.
>>>> + *
>>>> + * Note that the attributes of the two trees need to be exactly the same, and the
>>>> + * new tree needs to be empty, otherwise -EINVAL will be set in @mas.
>>>> + */
>>>> +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
>>>> +		gfp_t gfp)
>>>> +{
>>>> +	struct maple_node *node;
>>>> +	struct maple_pnode *parent = NULL;
>>>> +	struct maple_enode *root;
>>>> +	enum maple_type type;
>>>> +
>>>> +	if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
>>>> +	    unlikely(!mtree_empty(new_mas->tree))) {
>>>
>>> Would it be worth checking mas_is_start() for both mas and new_mas here?
>>> Otherwise mas_start() will not do what you want below.  I think it is
>>> implied that both are at MAS_START but never checked?
>> This function is an internal function and is currently only called by
>> {mtree,__mt}_dup(). It is ensured that both 'mas' and 'new_mas' are
>> MAS_START when called. Do you think we really need to check it? Maybe we
>> just need to explain it in the comments?
> 
> Yes, just document that it is expected to be MAS_START.
> 
>>>
>>>> +		mas_set_err(mas, -EINVAL);
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	mas_start(mas);
>>>> +	if (mas_is_ptr(mas) || mas_is_none(mas)) {
>>>> +		root = mt_root_locked(mas->tree);
>>>> +		goto set_new_tree;
>>>> +	}
>>>> +
>>>> +	node = mt_alloc_one(gfp);
>>>> +	if (!node) {
>>>> +		new_mas->node = MAS_NONE;
>>>> +		mas_set_err(mas, -ENOMEM);
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	type = mte_node_type(mas->node);
>>>> +	root = mt_mk_node(node, type);
>>>> +	new_mas->node = root;
>>>> +	new_mas->min = 0;
>>>> +	new_mas->max = ULONG_MAX;
>>>> +	root = mte_mk_root(root);
>>>> +
>>>> +	while (1) {
>>>> +		mas_copy_node(mas, new_mas, parent);
>>>> +
>>>> +		if (!mte_is_leaf(mas->node)) {
>>>> +			/* Only allocate child nodes for non-leaf nodes. */
>>>> +			mas_dup_alloc(mas, new_mas, gfp);
>>>> +			if (unlikely(mas_is_err(mas)))
>>>> +				return;
>>>> +		} else {
>>>> +			/*
>>>> +			 * This is the last leaf node and duplication is
>>>> +			 * completed.
>>>> +			 */
>>>> +			if (mas->max == ULONG_MAX)
>>>> +				goto done;
>>>> +
>>>> +			/* This is not the last leaf node and needs to go up. */
>>>> +			do {
>>>> +				mas_ascend(mas);
>>>> +				mas_ascend(new_mas);
>>>> +			} while (mas->offset == mas_data_end(mas));
>>>> +
>>>> +			/* Move to the next subtree. */
>>>> +			mas->offset++;
>>>> +			new_mas->offset++;
>>>> +		}
>>>> +
>>>> +		mas_descend(mas);
>>>> +		parent = ma_parent_ptr(mte_to_node(new_mas->node));
>>>> +		mas_descend(new_mas);
>>>> +		mas->offset = 0;
>>>> +		new_mas->offset = 0;
>>>> +	}
>>>> +done:
>>>> +	/* Specially handle the parent of the root node. */
>>>> +	mte_to_node(root)->parent = ma_parent_ptr(mas_tree_parent(new_mas));
>>>> +set_new_tree:
>>>> +	/* Make them the same height */
>>>> +	new_mas->tree->ma_flags = mas->tree->ma_flags;
>>>> +	rcu_assign_pointer(new_mas->tree->ma_root, root);
>>>> +}
>>>> +
>>>> +/**
>>>> + * __mt_dup(): Duplicate a maple tree
>>>> + * @mt: The source maple tree
>>>> + * @new: The new maple tree
>>>> + * @gfp: The GFP_FLAGS to use for allocations
>>>> + *
>>>> + * This function duplicates a maple tree in Depth-First Search (DFS) pre-order
>>>> + * traversal. It uses memcopy() to copy nodes in the source tree and allocate
>>>> + * new child nodes in non-leaf nodes. The new node is exactly the same as the
>>>> + * source node except for all the addresses stored in it. It will be faster than
>>>> + * traversing all elements in the source tree and inserting them one by one into
>>>> + * the new tree.
>>>> + * The user needs to ensure that the attributes of the source tree and the new
>>>> + * tree are the same, and the new tree needs to be an empty tree, otherwise
>>>> + * -EINVAL will be returned.
>>>> + * Note that the user needs to manually lock the source tree and the new tree.
>>>> + *
>>>> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
>>>> + * the attributes of the two trees are different or the new tree is not an empty
>>>> + * tree.
>>>> + */
>>>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>>>> +{
>>>> +	int ret = 0;
>>>> +	MA_STATE(mas, mt, 0, 0);
>>>> +	MA_STATE(new_mas, new, 0, 0);
>>>> +
>>>> +	mas_dup_build(&mas, &new_mas, gfp);
>>>> +
>>>> +	if (unlikely(mas_is_err(&mas))) {
>>>> +		ret = xa_err(mas.node);
>>>> +		if (ret == -ENOMEM)
>>>> +			mas_dup_free(&new_mas);
>>>> +	}
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +EXPORT_SYMBOL(__mt_dup);
>>>> +
>>>> +/**
>>>> + * mtree_dup(): Duplicate a maple tree
>>>> + * @mt: The source maple tree
>>>> + * @new: The new maple tree
>>>> + * @gfp: The GFP_FLAGS to use for allocations
>>>> + *
>>>> + * This function duplicates a maple tree in Depth-First Search (DFS) pre-order
>>>> + * traversal. It uses memcopy() to copy nodes in the source tree and allocate
>>>> + * new child nodes in non-leaf nodes. The new node is exactly the same as the
>>>> + * source node except for all the addresses stored in it. It will be faster than
>>>> + * traversing all elements in the source tree and inserting them one by one into
>>>> + * the new tree.
>>>> + * The user needs to ensure that the attributes of the source tree and the new
>>>> + * tree are the same, and the new tree needs to be an empty tree, otherwise
>>>> + * -EINVAL will be returned.
>>>
>>> The requirement to duplicate the entire tree should be mentioned and
>>> maybe the mas_is_start() requirement (as I asked about above?)
>> Okay, I will add a comment saying 'This duplicates the entire tree'. But
>> 'mas_is_start()' is not a requirement for calling this function because
>> the function's parameter is 'maple_tree', not 'ma_state'. I think
>> 'mas_is_start()' should be added to the comment for 'mas_dup_build()'.
> 
> Oh right, thanks.
> 
>>>
>>> I can see someone thinking they are going to make a super fast sub-tree
>>> of existing data using this - which won't (always?) work.
>>>
>>>> + *
>>>> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
>>>> + * the attributes of the two trees are different or the new tree is not an empty
>>>> + * tree.
>>>> + */
>>>> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>>>> +{
>>>> +	int ret = 0;
>>>> +	MA_STATE(mas, mt, 0, 0);
>>>> +	MA_STATE(new_mas, new, 0, 0);
>>>> +
>>>> +	mas_lock(&new_mas);
>>>> +	mas_lock_nested(&mas, SINGLE_DEPTH_NESTING);
>>>> +
>>>> +	mas_dup_build(&mas, &new_mas, gfp);
>>>> +	mas_unlock(&mas);
>>>> +
>>>> +	if (unlikely(mas_is_err(&mas))) {
>>>> +		ret = xa_err(mas.node);
>>>> +		if (ret == -ENOMEM)
>>>> +			mas_dup_free(&new_mas);
>>>> +	}
>>>> +
>>>> +	mas_unlock(&new_mas);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +EXPORT_SYMBOL(mtree_dup);
>>>> +
>>>>    /**
>>>>     * __mt_destroy() - Walk and free all nodes of a locked maple tree.
>>>>     * @mt: The maple tree
>>>> -- 
>>>> 2.20.1
>>>>
>>>
> 

  reply	other threads:[~2023-10-05 16:11 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-25  3:56 [PATCH v3 0/9] Introduce __mt_dup() to improve the performance of fork() Peng Zhang
2023-09-25  3:56 ` [PATCH v3 1/9] maple_tree: Add mt_free_one() and mt_attr() helpers Peng Zhang
2023-09-25  3:56 ` [PATCH v3 2/9] maple_tree: Introduce {mtree,mas}_lock_nested() Peng Zhang
2023-09-25  3:56 ` [PATCH v3 3/9] maple_tree: Introduce interfaces __mt_dup() and mtree_dup() Peng Zhang
2023-10-03 18:45   ` Liam R. Howlett
2023-10-04  9:09     ` Peng Zhang
2023-10-04 14:25       ` Liam R. Howlett
2023-10-05 15:55         ` Peng Zhang [this message]
2023-10-05 19:09           ` Liam R. Howlett
2023-09-25  3:56 ` [PATCH v3 4/9] maple_tree: Add test for mtree_dup() Peng Zhang
2023-09-25  3:56 ` [PATCH v3 5/9] maple_tree: Update the documentation of maple tree Peng Zhang
2023-10-03 18:46   ` Liam R. Howlett
2023-10-04  9:09     ` Peng Zhang
2023-09-25  3:56 ` [PATCH v3 6/9] maple_tree: Skip other tests when BENCH is enabled Peng Zhang
2023-09-25  3:56 ` [PATCH v3 7/9] maple_tree: Update check_forking() and bench_forking() Peng Zhang
2023-09-25  3:56 ` [PATCH v3 8/9] maple_tree: Preserve the tree attributes when destroying maple tree Peng Zhang
2023-09-25  3:56 ` [PATCH v3 9/9] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() Peng Zhang
2023-10-03 18:46   ` Liam R. Howlett
2023-10-04  9:10     ` Peng Zhang
2023-10-04 19:53       ` Liam R. Howlett
2023-10-05 15:56         ` Peng Zhang
2023-10-07  1:11           ` Liam R. Howlett
2023-10-07  1:32             ` Liam R. Howlett
2023-10-07  4:25               ` Peng Zhang
2023-10-08  7:54               ` Peng Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=867f3fb6-22c5-4dd1-479d-5b148163f2d0@bytedance.com \
    --to=zhangpeng.00@bytedance.com \
    --cc=Liam.Howlett@Oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=corbet@lwn.net \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=maple-tree@lists.infradead.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=michael.christie@oracle.com \
    --cc=mjguzik@gmail.com \
    --cc=npiggin@gmail.com \
    --cc=oliver.sang@intel.com \
    --cc=peterz@infradead.org \
    --cc=surenb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).