All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hou Tao <houtao@huaweicloud.com>
To: paulmck@kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>,
	bpf@vger.kernel.org, Martin KaFai Lau <kafai@fb.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	Song Liu <songliubraving@fb.com>, Hao Luo <haoluo@google.com>,
	Yonghong Song <yhs@fb.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	KP Singh <kpsingh@kernel.org>,
	"David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Jiri Olsa <jolsa@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	Delyan Kratunov <delyank@fb.com>,
	rcu@vger.kernel.org, houtao1@huawei.com
Subject: Re: [PATCH bpf-next 1/3] bpf: Free elements after one RCU-tasks-trace grace period
Date: Thu, 13 Oct 2022 09:25:31 +0800	[thread overview]
Message-ID: <ca5f2973-e8b5-0d73-fd23-849f0dfc4347@huaweicloud.com> (raw)
In-Reply-To: <20221012161103.GU4221@paulmck-ThinkPad-P17-Gen-1>

Hi,

On 10/13/2022 12:11 AM, Paul E. McKenney wrote:
> On Wed, Oct 12, 2022 at 05:26:26PM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 10/12/2022 2:36 PM, Paul E. McKenney wrote:
>>> On Tue, Oct 11, 2022 at 07:31:28PM +0800, Hou Tao wrote:
>>>> Hi,
>>>>
>>>> On 10/11/2022 5:07 PM, Paul E. McKenney wrote:
>>>>> On Tue, Oct 11, 2022 at 03:11:26PM +0800, Hou Tao wrote:
>>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>>
>>>>>> According to the implementation of RCU Tasks Trace, it inovkes
>>>>>> ->postscan_func() to wait for one RCU-tasks-trace grace period and
>>>>>> rcu_tasks_trace_postscan() inovkes synchronize_rcu() to wait for one
>>>>>> normal RCU grace period in turn, so one RCU-tasks-trace grace period
>>>>>> will imply one RCU grace period.
>>>>>>
>>>>>> So there is no need to do call_rcu() again in the callback of
>>>>>> call_rcu_tasks_trace() and it can just free these elements directly.
>>>>> This is true, but this is an implementation detail that is not guaranteed
>>>>> in future versions of the kernel.  But if this additional call_rcu()
>>>>> is causing trouble, I could add some API member that returned true in
>>>>> kernels where it does happen to be the case that call_rcu_tasks_trace()
>>>>> implies a call_rcu()-style grace period.
>>>>>
>>>>> The BPF memory allocator could then complain or adapt, as appropriate.
>>>>>
>>>>> Thoughts?
>>>> It is indeed an implementation details. But In an idle KVM guest, for memory
>>>> reclamation in bpf memory allocator a RCU tasks trace grace period is about 30ms
>>>> and RCU grace period is about 20 ms. Under stress condition, the grace period
>>>> will be much longer. If the extra RCU grace period can be removed, these memory
>>>> can be reclaimed more quickly and it will be beneficial for memory pressure.
>>> I understand the benefits.  We just need to get a safe way to take
>>> advantage of them.
>>>
>>>> Now it seems we can use RCU poll APIs (e.g. get_state_synchronize_rcu() and
>>>> poll_state_synchronize_rcu() ) to check whether or not a RCU grace period has
>>>> passed. But It needs to add at least one unsigned long into the freeing object.
>>>> The extra memory overhead may be OK for bpf memory allocator, but it is not for
>>>> small object. So could you please show example on how these new APIs work ? Does
>>>> it need to modify the to-be-free object ?
>>> Good point on the polling APIs, more on this below.
>>>
>>> I was thinking in terms of an API like this:
>>>
>>> 	static inline bool rcu_trace_implies_rcu_gp(void)
>>> 	{
>>> 		return true;
>>> 	}
>>>
>>> Along with comments on the synchronize_rcu() pointing at the
>>> rcu_trace_implies_rcu_gp().
>> It is a simple API and the modifications for call_rcu_tasks_trace() users will
>> also be simple. The callback of call_rcu_tasks_trace() will invoke
>> rcu_trace_implies_rcu_gp(), and it returns true, the callback invokes the
>> callback for call_rcu() directly, else it does so through call_rcu().
> Sounds good!
>
> Please note that if the callback function just does kfree() or equivalent,
> this will work fine.  If it acquires spinlocks, you may need to do
> local_bh_disable() before invoking it directly and local_bh_enable()
> afterwards.
What is the purpose for invoking local_bh_disable() ? Is it trying to ensure the
callback is called under soft-irq context or something else ? For all I know,
task rcu already invokes its callback with soft-irq disabled.
>
>>> Another approach is to wait for the grace periods concurrently, but this
>>> requires not one but two rcu_head structures.
>> Beside the extra space usage, does it also complicate the logic in callback
>> function because it needs to handle the concurrency problem ?
> Definitely!!!
>
> Perhaps something like this:
>
> 	static void cbf(struct rcu_head *rhp)
> 	{
> 		p = container_of(rhp, ...);
>
> 		if (atomic_dec_and_test(&p->cbs_awaiting))
> 			kfree(p);
> 	}
>
> 	atomic_set(&p->cbs_awating, 2);
> 	call_rcu(p->rh1, cbf);
> 	call_rcu_tasks_trace(p->rh2, cbf);
>
> Is this worth it?  I have no idea.  I must defer to you.
I still prefer the simple solution.
>
>>> Back to the polling API.  Are these things freed individually, or can
>>> they be grouped?  If they can be grouped, the storage for the grace-period
>>> state could be associated with the group.
>> As said above, for bpf memory allocator it may be OK because it frees elements
>> in batch, but for bpf local storage and its element these memories are freed
>> individually. So I think if the implication of RCU tasks trace grace period will
>> not be changed in the foreseeable future, adding rcu_trace_implies_rcu_gp() and
>> using it in bpf is a good idea. What do you think ?
> Maybe the BPF memory allocator does it one way and BPF local storage
> does it another way.
Why not. Maybe bpf expert think the space overhead is also reasonable in the BPF
local storage case.
>
> How about if I produce a patch for rcu_trace_implies_rcu_gp() and let
> you carry it with your series?  That way I don't have an unused function
> in -rcu and you don't have to wait for me to send it upstream?
Sound reasonable to me. Also thanks for your suggestions.
>
> 							Thanx, Paul
>
>>>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>>>>>> ---
>>>>>>  kernel/bpf/memalloc.c | 17 ++++++-----------
>>>>>>  1 file changed, 6 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c
>>>>>> index 5f83be1d2018..6f32dddc804f 100644
>>>>>> --- a/kernel/bpf/memalloc.c
>>>>>> +++ b/kernel/bpf/memalloc.c
>>>>>> @@ -209,6 +209,9 @@ static void free_one(struct bpf_mem_cache *c, void *obj)
>>>>>>  	kfree(obj);
>>>>>>  }
>>>>>>  
>>>>>> +/* Now RCU Tasks grace period implies RCU grace period, so no need to do
>>>>>> + * an extra call_rcu().
>>>>>> + */
>>>>>>  static void __free_rcu(struct rcu_head *head)
>>>>>>  {
>>>>>>  	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
>>>>>> @@ -220,13 +223,6 @@ static void __free_rcu(struct rcu_head *head)
>>>>>>  	atomic_set(&c->call_rcu_in_progress, 0);
>>>>>>  }
>>>>>>  
>>>>>> -static void __free_rcu_tasks_trace(struct rcu_head *head)
>>>>>> -{
>>>>>> -	struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu);
>>>>>> -
>>>>>> -	call_rcu(&c->rcu, __free_rcu);
>>>>>> -}
>>>>>> -
>>>>>>  static void enque_to_free(struct bpf_mem_cache *c, void *obj)
>>>>>>  {
>>>>>>  	struct llist_node *llnode = obj;
>>>>>> @@ -252,11 +248,10 @@ static void do_call_rcu(struct bpf_mem_cache *c)
>>>>>>  		 * from __free_rcu() and from drain_mem_cache().
>>>>>>  		 */
>>>>>>  		__llist_add(llnode, &c->waiting_for_gp);
>>>>>> -	/* Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
>>>>>> -	 * Then use call_rcu() to wait for normal progs to finish
>>>>>> -	 * and finally do free_one() on each element.
>>>>>> +	/* Use call_rcu_tasks_trace() to wait for both sleepable and normal
>>>>>> +	 * progs to finish and finally do free_one() on each element.
>>>>>>  	 */
>>>>>> -	call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace);
>>>>>> +	call_rcu_tasks_trace(&c->rcu, __free_rcu);
>>>>>>  }
>>>>>>  
>>>>>>  static void free_bulk(struct bpf_mem_cache *c)
>>>>>> -- 
>>>>>> 2.29.2
>>>>>>


  reply	other threads:[~2022-10-13  1:25 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-11  7:11 [PATCH bpf-next 0/3] Remove unnecessary RCU grace period chaining Hou Tao
2022-10-11  7:11 ` [PATCH bpf-next 1/3] bpf: Free elements after one RCU-tasks-trace grace period Hou Tao
2022-10-11  9:07   ` Paul E. McKenney
2022-10-11 11:31     ` Hou Tao
2022-10-12  6:36       ` Paul E. McKenney
2022-10-12  9:26         ` Hou Tao
2022-10-12 16:11           ` Paul E. McKenney
2022-10-13  1:25             ` Hou Tao [this message]
2022-10-13  5:07               ` Martin KaFai Lau
2022-10-13  9:02                 ` Hou Tao
2022-10-13 19:00               ` Paul E. McKenney
2022-10-14  4:20                 ` Hou Tao
2022-10-14 12:15                   ` Paul E. McKenney
2022-10-17  6:55                     ` Hou Tao
2022-10-17 10:23                       ` Paul E. McKenney
2022-10-13  1:41             ` Hou Tao
2022-10-13 19:04               ` Paul E. McKenney
2022-10-14  4:04                 ` Hou Tao
2022-10-11  7:11 ` [PATCH bpf-next 2/3] bpf: Free local storage memory " Hou Tao
2022-10-11  9:09   ` Paul E. McKenney
2022-10-11  7:11 ` [PATCH bpf-next 3/3] bpf: Free trace program array " Hou Tao
2022-10-11  9:09   ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ca5f2973-e8b5-0d73-fd23-849f0dfc4347@huaweicloud.com \
    --to=houtao@huaweicloud.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=delyank@fb.com \
    --cc=haoluo@google.com \
    --cc=houtao1@huawei.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kafai@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=songliubraving@fb.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.