linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peng Zhang <zhangpeng.00@bytedance.com>
To: "Liam R. Howlett" <Liam.Howlett@Oracle.com>,
	Peng Zhang <zhangpeng.00@bytedance.com>,
	willy@infradead.org, michael.christie@oracle.com,
	surenb@google.com, npiggin@gmail.com, corbet@lwn.net,
	mathieu.desnoyers@efficios.com, avagin@gmail.com,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-mm@kvack.org,
	akpm@linux-foundation.org, brauner@kernel.org,
	peterz@infradead.org
Subject: Re: [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup()
Date: Wed, 16 Aug 2023 21:41:46 +0800	[thread overview]
Message-ID: <3f4e73cc-1a98-95a8-9ab2-47797d236585@bytedance.com> (raw)
In-Reply-To: <20230731162714.4x3lzymuyvu2mter@revolver>



在 2023/8/1 00:27, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230731 08:24]:
>>
>>
>> 在 2023/7/27 00:03, Liam R. Howlett 写道:
>>> * Peng Zhang <zhangpeng.00@bytedance.com> [230726 04:10]:
>>>> Introduce interfaces __mt_dup() and mt_dup(), which are used to
>>>> duplicate a maple tree. Compared with traversing the source tree and
>>>> reinserting entry by entry in the new tree, it has better performance.
>>>> The difference between __mt_dup() and mt_dup() is that mt_dup() holds
>>>> an internal lock.
>>>>
>>>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>>>> ---
>>>>    include/linux/maple_tree.h |   3 +
>>>>    lib/maple_tree.c           | 211 +++++++++++++++++++++++++++++++++++++
>>>>    2 files changed, 214 insertions(+)
>>>>
>>>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>>>> index c962af188681..229fe78e4c89 100644
>>>> --- a/include/linux/maple_tree.h
>>>> +++ b/include/linux/maple_tree.h
>>>> @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
>>>>    		void *entry, gfp_t gfp);
>>>>    void *mtree_erase(struct maple_tree *mt, unsigned long index);
>>>> +int mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>>>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>>>> +
>>>>    void mtree_destroy(struct maple_tree *mt);
>>>>    void __mt_destroy(struct maple_tree *mt);
>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>>> index da3a2fb405c0..efac6761ae37 100644
>>>> --- a/lib/maple_tree.c
>>>> +++ b/lib/maple_tree.c
>>>> @@ -6595,6 +6595,217 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
>>>>    }
>>>>    EXPORT_SYMBOL(mtree_erase);
>>>> +/*
>>>> + * mt_dup_free() - Free the nodes of a incomplete maple tree.
>>>> + * @mt: The incomplete maple tree
>>>> + * @node: Free nodes from @node
>>>> + *
>>>> + * This function frees all nodes starting from @node in the reverse order of
>>>> + * mt_dup_build(). At this point we don't need to hold the source tree lock.
>>>> + */
>>>> +static void mt_dup_free(struct maple_tree *mt, struct maple_node *node)
>>>> +{
>>>> +	void **slots;
>>>> +	unsigned char offset;
>>>> +	struct maple_enode *enode;
>>>> +	enum maple_type type;
>>>> +	unsigned char count = 0, i;
>>>> +
>>>
>>> Can we make these labels inline functions and try to make this a loop?
>> I did this just to make things easier. Refer to the implementation of
>> walk_tg_tree_from() in sched/core.c. Using some loops and inline
>> functions probably doesn't simplify things. I'll try to do that and give
>> up if it complicates things.
> 
> Thanks, I'd like to try and simplify the code instead of adding goto
> label loops. The code you are referencing is from 2008 and goto loops
> are not common.
> 
>>>
>>>> +try_ascend:
>>>> +	if (ma_is_root(node)) {
>>>> +		mt_free_one(node);
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	offset = ma_parent_slot(node);
>>>> +	type = ma_parent_type(mt, node);
>>>> +	node = ma_parent(node);
>>>> +	if (!offset)
>>>> +		goto free;
>>>> +
>>>> +	offset--;
>>>> +
>>>> +descend:
>>>> +	slots = (void **)ma_slots(node, type);
>>>> +	enode = slots[offset];
>>>> +	if (mte_is_leaf(enode))
>>>> +		goto free;
>>>> +
>>>> +	type = mte_node_type(enode);
>>>> +	node = mte_to_node(enode);
>>>> +	offset = ma_nonleaf_data_end_nocheck(node, type);
>>>> +	goto descend;
>>>> +
>>>> +free:
>>>> +	slots = (void **)ma_slots(node, type);
>>>> +	count = ma_nonleaf_data_end_nocheck(node, type) + 1;
>>>> +	for (i = 0; i < count; i++)
>>>> +		((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
>>>> +
>>>> +	/* Cast to __rcu to avoid sparse checker complaining. */
>>>> +	mt_free_bulk(count, (void __rcu **)slots);
>>>> +	goto try_ascend;
>>>> +}
>>>> +
>>>> +/*
>>>> + * mt_dup_build() - Build a new maple tree from a source tree
>>>> + * @mt: The source maple tree to copy from
>>>> + * @new: The new maple tree
>>>> + * @gfp: The GFP_FLAGS to use for allocations
>>>> + * @to_free: Free nodes starting from @to_free if the build fails
>>>> + *
>>>> + * This function builds a new tree in DFS preorder. If it fails due to memory
>>>> + * allocation, @to_free will store the last failed node to free the incomplete
>>>> + * tree. Use mt_dup_free() to free nodes.
>>>> + *
>>>> + * Return: 0 on success, -ENOMEM if memory could not be allocated.
>>>> + */
>>>> +static inline int mt_dup_build(struct maple_tree *mt, struct maple_tree *new,
>>>> +			       gfp_t gfp, struct maple_node **to_free)
>>>
>>> I am trying to change the functions to be two tabs of indent for
>>> arguments from now on.  It allows for more to fit on a single line and
>>> still maintains a clear separation between code and argument lists.
>> I'm not too concerned about code formatting. . . At least in this
>> patchset.
> 
> I have a mess of it in the tree and wanted to communicate my desire to
> shift to using two tabs for extra arguments in the future.
> 
>>>
>>>> +{
>>>> +	struct maple_enode *enode;
>>>> +	struct maple_node *new_node, *new_parent = NULL, *node;
>>>> +	enum maple_type type;
>>>> +	void __rcu **slots;
>>>> +	void **new_slots;
>>>> +	unsigned char count, request, i, offset;
>>>> +	unsigned long *set_parent;
>>>> +	unsigned long new_root;
>>>> +
>>>> +	mt_init_flags(new, mt->ma_flags);
>>>> +	enode = mt_root_locked(mt);
>>>> +	if (unlikely(!xa_is_node(enode))) {
>>>> +		rcu_assign_pointer(new->ma_root, enode);
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	new_node = mt_alloc_one(gfp);
>>>> +	if (!new_node)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	new_root = (unsigned long)new_node;
>>>> +	new_root |= (unsigned long)enode & MAPLE_NODE_MASK;
>>>> +
>>>> +copy_node:
>>>
>>> Can you make copy_node, descend, ascend inline functions instead of the
>>> goto jumping please?  It's better to have loops over jumping around a
>>> lot.  Gotos are good for undoing things and retry, but constructing
>>> loops with them makes it difficult to follow.
>> Same as above.
>>>
>>>> +	node = mte_to_node(enode);
>>>> +	type = mte_node_type(enode);
>>>> +	memcpy(new_node, node, sizeof(struct maple_node));
>>>> +
>>>> +	set_parent = (unsigned long *)&(new_node->parent);
>>>> +	*set_parent &= MAPLE_NODE_MASK;
>>>> +	*set_parent |= (unsigned long)new_parent;
>>>
>>> Maybe make a small inline to set the parent instead of this?
>>>
>>> There are some defined helpers for setting the types like
>>> ma_parent_ptr() and ma_enode_ptr() to make casting more type-safe.
>> Ok, I'll try to do that.
>>>
>>>> +	if (ma_is_leaf(type))
>>>> +		goto ascend;
>>>> +
>>>> +	new_slots = (void **)ma_slots(new_node, type);
>>>> +	slots = ma_slots(node, type);
>>>> +	request = ma_nonleaf_data_end(mt, node, type) + 1;
>>>> +	count = mt_alloc_bulk(gfp, request, new_slots);
>>>> +	if (!count) {
>>>> +		*to_free = new_node;
>>>> +		return -ENOMEM;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < count; i++)
>>>> +		((unsigned long *)new_slots)[i] |=
>>>> +				((unsigned long)mt_slot_locked(mt, slots, i) &
>>>> +				 MAPLE_NODE_MASK);
>>>> +	offset = 0;
>>>> +
>>>> +descend:
>>>> +	new_parent = new_node;
>>>> +	enode = mt_slot_locked(mt, slots, offset);
>>>> +	new_node = mte_to_node(new_slots[offset]);
>>>> +	goto copy_node;
>>>> +
>>>> +ascend:
>>>> +	if (ma_is_root(node)) {
>>>> +		new_node = mte_to_node((void *)new_root);
>>>> +		new_node->parent = ma_parent_ptr((unsigned long)new |
>>>> +						 MA_ROOT_PARENT);
>>>> +		rcu_assign_pointer(new->ma_root, (void *)new_root);
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	offset = ma_parent_slot(node);
>>>> +	type = ma_parent_type(mt, node);
>>>> +	node = ma_parent(node);
>>>> +	new_node = ma_parent(new_node);
>>>> +	if (offset < ma_nonleaf_data_end(mt, node, type)) {
>>>> +		offset++;
>>>> +		new_slots = (void **)ma_slots(new_node, type);
>>>> +		slots = ma_slots(node, type);
>>>> +		goto descend;
>>>> +	}
>>>> +
>>>> +	goto ascend;
>>>> +}
>>>> +
>>>> +/**
>>>> + * __mt_dup(): Duplicate a maple tree
>>>> + * @mt: The source maple tree
>>>> + * @new: The new maple tree
>>>> + * @gfp: The GFP_FLAGS to use for allocations
>>>> + *
>>>> + * This function duplicates a maple tree using a faster method than traversing
>>>> + * the source tree and inserting entries into the new tree one by one. The user
>>>> + * needs to lock the source tree manually. Before calling this function, @new
>>>> + * must be an empty tree or an uninitialized tree. If @mt uses an external lock,
>>>> + * we may also need to manually set @new's external lock using
>>>> + * mt_set_external_lock().
>>>> + *
>>>> + * Return: 0 on success, -ENOMEM if memory could not be allocated.
>>>> + */
>>>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>>>
>>> We use mas_ for things that won't handle the locking and pass in a maple
>>> state.  Considering the leaves need to be altered once this is returned,
>>> I would expect passing in a maple state should be feasible?
>> But we don't really need mas here. What do you think the state of mas
>> should be when this function returns? Make it point to the first entry,
>> or the last entry?
> 
> I would write it to point to the first element so that the call to
> replace the first element can just do that without an extra walk and
> document the maple state end point.
Unfortunately, this does not seem to be convenient. Users usually use
mas_for_each() to replace elements. If we set mas to the first element,
the first call to mas_find() in mas_for_each() will get the next
element.

There may also be other scenarios where the user does not necessarily
have to replace every element.

Finally, getting the first element in __mt_dup() requires an additional
check to check whether the first element has already been recorded. Such
a check will be performed at each leaf node, which is unnecessary
overhead.

Of course, the first reason is the main reason, which prevents us from
using mas_for_each(). So I don't want to record the first element.
> 
>>>
>>>> +{
>>>> +	int ret;
>>>> +	struct maple_node *to_free = NULL;
>>>> +
>>>> +	ret = mt_dup_build(mt, new, gfp, &to_free);
>>>> +
>>>> +	if (unlikely(ret == -ENOMEM)) {
>>>
>>> On other errors, will the half constructed tree be returned?  Is this
>>> safe?
>> Of course, mt_dup_free() is carefully designed to handle this.
>>>
>>>> +		if (to_free)
>>>> +			mt_dup_free(new, to_free);
>>>> +	}
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +EXPORT_SYMBOL(__mt_dup);
>>>> +
>>>> +/**
>>>> + * mt_dup(): Duplicate a maple tree
>>>> + * @mt: The source maple tree
>>>> + * @new: The new maple tree
>>>> + * @gfp: The GFP_FLAGS to use for allocations
>>>> + *
>>>> + * This function duplicates a maple tree using a faster method than traversing
>>>> + * the source tree and inserting entries into the new tree one by one. The
>>>> + * function will lock the source tree with an internal lock, and the user does
>>>> + * not need to manually handle the lock. Before calling this function, @new must
>>>> + * be an empty tree or an uninitialized tree. If @mt uses an external lock, we
>>>> + * may also need to manually set @new's external lock using
>>>> + * mt_set_external_lock().
>>>> + *
>>>> + * Return: 0 on success, -ENOMEM if memory could not be allocated.
>>>> + */
>>>> +int mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>>>
>>> mtree_ ususually used to indicate locking is handled.
>> Before unifying mtree_* and mt_*, I don't think I can see any difference
>> between them. At least mt_set_in_rcu() and mt_clear_in_rcu() will hold
>> the lock.
> 
> Fair enough.  I was thinking this closely matches __mt_destroy() and
> mtree_destroy().  We could be consistent in our inconsistency, at least.
> 
>>>
>>>> +{
>>>> +	int ret;
>>>> +	struct maple_node *to_free = NULL;
>>>> +
>>>> +	mtree_lock(mt);
>>>> +	ret = mt_dup_build(mt, new, gfp, &to_free);
>>>> +	mtree_unlock(mt);
>>>> +
>>>> +	if (unlikely(ret == -ENOMEM)) {
>>>> +		if (to_free)
>>>> +			mt_dup_free(new, to_free);
>>>
>>> Again, is a half constructed tree safe to return?  Since each caller
>>> checks to_free is NULL, could that be in mt_dup_free() instead?
>> Yes, this check can be put in mt_dup_free().
>>>
>>>> +	}
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +EXPORT_SYMBOL(mt_dup);
>>>> +
>>>>    /**
>>>>     * __mt_destroy() - Walk and free all nodes of a locked maple tree.
>>>>     * @mt: The maple tree
>>>> -- 
>>>> 2.20.1
>>>>
>>>>

  reply	other threads:[~2023-08-16 13:43 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-26  8:09 [PATCH 00/11] Introduce mt_dup() to improve the performance of fork() Peng Zhang
2023-07-26  8:09 ` [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}() Peng Zhang
2023-07-26 14:58   ` Liam R. Howlett
2023-07-31  9:52     ` Peng Zhang
2023-07-31 16:08       ` Liam R. Howlett
2023-07-26  8:09 ` [PATCH 02/11] maple_tree: Validate MAPLE_ENODE and ma_nonleaf_data_end() Peng Zhang
2023-07-26  8:09 ` [PATCH 03/11] maple_tree: Add some helper functions Peng Zhang
2023-07-26 15:02   ` Liam R. Howlett
2023-07-26 15:08     ` Matthew Wilcox
2023-07-31 11:45       ` Peng Zhang
2023-08-11 17:28         ` Liam R. Howlett
2023-07-31 11:40     ` Peng Zhang
2023-07-26  8:09 ` [PATCH 04/11] maple_tree: Introduce interfaces __mt_dup() and mt_dup() Peng Zhang
2023-07-26 16:03   ` Liam R. Howlett
2023-07-31 12:24     ` Peng Zhang
2023-07-31 16:27       ` Liam R. Howlett
2023-08-16 13:41         ` Peng Zhang [this message]
2023-08-16 18:30           ` Liam R. Howlett
2023-08-18 11:53             ` Peng Zhang
2023-08-18 16:13               ` Liam R. Howlett
2023-07-26  8:09 ` [PATCH 05/11] maple_tree: Add test for mt_dup() Peng Zhang
2023-07-26 16:06   ` Liam R. Howlett
2023-07-31 12:32     ` Peng Zhang
2023-07-31 16:41       ` Liam R. Howlett
2023-07-26  8:09 ` [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry Peng Zhang
2023-07-26 16:08   ` Liam R. Howlett
2023-07-31 12:39     ` Peng Zhang
2023-07-31 16:48       ` Liam R. Howlett
2023-08-16 13:11         ` Peng Zhang
2023-08-16 17:40           ` Liam R. Howlett
2023-08-18  9:39             ` Peng Zhang
2023-08-18 16:15               ` Liam R. Howlett
2023-07-26  8:09 ` [PATCH 07/11] maple_tree: Update the documentation of maple tree Peng Zhang
2023-07-26  8:09 ` [PATCH 08/11] maple_tree: Skip other tests when BENCH is enabled Peng Zhang
2023-07-26  8:09 ` [PATCH 09/11] maple_tree: Update check_forking() and bench_forking() Peng Zhang
2023-07-26  8:09 ` [PATCH 10/11] MAINTAINERS: Add co-maintainer for maple tree Peng Zhang
2023-07-26 16:39   ` Liam R. Howlett
2023-07-31 12:55     ` Peng Zhang
2023-07-31 20:55       ` Liam R. Howlett
2023-07-26  8:09 ` [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap() Peng Zhang
2023-07-26 17:06   ` Liam R. Howlett
2023-07-31 12:59     ` Peng Zhang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3f4e73cc-1a98-95a8-9ab2-47797d236585@bytedance.com \
    --to=zhangpeng.00@bytedance.com \
    --cc=Liam.Howlett@Oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=avagin@gmail.com \
    --cc=brauner@kernel.org \
    --cc=corbet@lwn.net \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=michael.christie@oracle.com \
    --cc=npiggin@gmail.com \
    --cc=peterz@infradead.org \
    --cc=surenb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).