From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A7AAC433FE for ; Wed, 12 Oct 2022 16:11:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229533AbiJLQLL (ORCPT ); Wed, 12 Oct 2022 12:11:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32856 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229567AbiJLQLK (ORCPT ); Wed, 12 Oct 2022 12:11:10 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EF2D5A033F; Wed, 12 Oct 2022 09:11:08 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 6BC8D61557; Wed, 12 Oct 2022 16:11:08 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8FC1DC433C1; Wed, 12 Oct 2022 16:11:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1665591067; bh=5baChwl7p9R7fplrSmrngOXPqaPkRkzDUYbEwWZ0Q10=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=SuoINakrZK0I+x8uRoRqPDBMveQ8mddWgiqdR4LPi+/6zdEaSQmA1TFU0UZUmdpY0 mDdPTBEgz1hevFvm/YE4B/wfMIBd75AIBRRS2dmbSUGGdgL+EC9SR4Devpk9ciwbZs E3vSp5aj6JXqAc+TC916KUPpFA5iDoya6PD38fa6Umf3z9OKg6zyEdMk7mDOZWZxxm WHBRzA4OqfJkoihoWhkuQhRMiwy/jbTKVNoJ4QIL7mPX5pPzR+GPcafyzYQoFH1mTa FvE7C49sEcQGXTlGrWfhbg7x3MwIHF/0d7AlfCCuLEKldNgM7SkRr0VrYB7SKTiNOA wL68ylf5I7Kpw== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id 9D0BF5C196D; Wed, 12 Oct 2022 09:11:03 -0700 (PDT) Date: Wed, 12 Oct 2022 09:11:03 -0700 From: "Paul E. McKenney" To: Hou Tao Cc: Alexei Starovoitov , bpf@vger.kernel.org, Martin KaFai Lau , Andrii Nakryiko , Song Liu , Hao Luo , Yonghong Song , Daniel Borkmann , KP Singh , "David S . Miller" , Jakub Kicinski , Stanislav Fomichev , Jiri Olsa , John Fastabend , Delyan Kratunov , rcu@vger.kernel.org, houtao1@huawei.com Subject: Re: [PATCH bpf-next 1/3] bpf: Free elements after one RCU-tasks-trace grace period Message-ID: <20221012161103.GU4221@paulmck-ThinkPad-P17-Gen-1> Reply-To: paulmck@kernel.org References: <20221011071128.3470622-1-houtao@huaweicloud.com> <20221011071128.3470622-2-houtao@huaweicloud.com> <20221011090742.GG4221@paulmck-ThinkPad-P17-Gen-1> <20221012063607.GM4221@paulmck-ThinkPad-P17-Gen-1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On Wed, Oct 12, 2022 at 05:26:26PM +0800, Hou Tao wrote: > Hi, > > On 10/12/2022 2:36 PM, Paul E. McKenney wrote: > > On Tue, Oct 11, 2022 at 07:31:28PM +0800, Hou Tao wrote: > >> Hi, > >> > >> On 10/11/2022 5:07 PM, Paul E. McKenney wrote: > >>> On Tue, Oct 11, 2022 at 03:11:26PM +0800, Hou Tao wrote: > >>>> From: Hou Tao > >>>> > >>>> According to the implementation of RCU Tasks Trace, it inovkes > >>>> ->postscan_func() to wait for one RCU-tasks-trace grace period and > >>>> rcu_tasks_trace_postscan() inovkes synchronize_rcu() to wait for one > >>>> normal RCU grace period in turn, so one RCU-tasks-trace grace period > >>>> will imply one RCU grace period. > >>>> > >>>> So there is no need to do call_rcu() again in the callback of > >>>> call_rcu_tasks_trace() and it can just free these elements directly. > >>> This is true, but this is an implementation detail that is not guaranteed > >>> in future versions of the kernel. But if this additional call_rcu() > >>> is causing trouble, I could add some API member that returned true in > >>> kernels where it does happen to be the case that call_rcu_tasks_trace() > >>> implies a call_rcu()-style grace period. > >>> > >>> The BPF memory allocator could then complain or adapt, as appropriate. > >>> > >>> Thoughts? > >> It is indeed an implementation details. But In an idle KVM guest, for memory > >> reclamation in bpf memory allocator a RCU tasks trace grace period is about 30ms > >> and RCU grace period is about 20 ms. Under stress condition, the grace period > >> will be much longer. If the extra RCU grace period can be removed, these memory > >> can be reclaimed more quickly and it will be beneficial for memory pressure. > > I understand the benefits. We just need to get a safe way to take > > advantage of them. > > > >> Now it seems we can use RCU poll APIs (e.g. get_state_synchronize_rcu() and > >> poll_state_synchronize_rcu() ) to check whether or not a RCU grace period has > >> passed. But It needs to add at least one unsigned long into the freeing object. > >> The extra memory overhead may be OK for bpf memory allocator, but it is not for > >> small object. So could you please show example on how these new APIs work ? Does > >> it need to modify the to-be-free object ? > > Good point on the polling APIs, more on this below. > > > > I was thinking in terms of an API like this: > > > > static inline bool rcu_trace_implies_rcu_gp(void) > > { > > return true; > > } > > > > Along with comments on the synchronize_rcu() pointing at the > > rcu_trace_implies_rcu_gp(). > > It is a simple API and the modifications for call_rcu_tasks_trace() users will > also be simple. The callback of call_rcu_tasks_trace() will invoke > rcu_trace_implies_rcu_gp(), and it returns true, the callback invokes the > callback for call_rcu() directly, else it does so through call_rcu(). Sounds good! Please note that if the callback function just does kfree() or equivalent, this will work fine. If it acquires spinlocks, you may need to do local_bh_disable() before invoking it directly and local_bh_enable() afterwards. > > Another approach is to wait for the grace periods concurrently, but this > > requires not one but two rcu_head structures. > > Beside the extra space usage, does it also complicate the logic in callback > function because it needs to handle the concurrency problem ? Definitely!!! Perhaps something like this: static void cbf(struct rcu_head *rhp) { p = container_of(rhp, ...); if (atomic_dec_and_test(&p->cbs_awaiting)) kfree(p); } atomic_set(&p->cbs_awating, 2); call_rcu(p->rh1, cbf); call_rcu_tasks_trace(p->rh2, cbf); Is this worth it? I have no idea. I must defer to you. > > Back to the polling API. Are these things freed individually, or can > > they be grouped? If they can be grouped, the storage for the grace-period > > state could be associated with the group. > > As said above, for bpf memory allocator it may be OK because it frees elements > in batch, but for bpf local storage and its element these memories are freed > individually. So I think if the implication of RCU tasks trace grace period will > not be changed in the foreseeable future, adding rcu_trace_implies_rcu_gp() and > using it in bpf is a good idea. What do you think ? Maybe the BPF memory allocator does it one way and BPF local storage does it another way. How about if I produce a patch for rcu_trace_implies_rcu_gp() and let you carry it with your series? That way I don't have an unused function in -rcu and you don't have to wait for me to send it upstream? Thanx, Paul > >>>> Signed-off-by: Hou Tao > >>>> --- > >>>> kernel/bpf/memalloc.c | 17 ++++++----------- > >>>> 1 file changed, 6 insertions(+), 11 deletions(-) > >>>> > >>>> diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c > >>>> index 5f83be1d2018..6f32dddc804f 100644 > >>>> --- a/kernel/bpf/memalloc.c > >>>> +++ b/kernel/bpf/memalloc.c > >>>> @@ -209,6 +209,9 @@ static void free_one(struct bpf_mem_cache *c, void *obj) > >>>> kfree(obj); > >>>> } > >>>> > >>>> +/* Now RCU Tasks grace period implies RCU grace period, so no need to do > >>>> + * an extra call_rcu(). > >>>> + */ > >>>> static void __free_rcu(struct rcu_head *head) > >>>> { > >>>> struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu); > >>>> @@ -220,13 +223,6 @@ static void __free_rcu(struct rcu_head *head) > >>>> atomic_set(&c->call_rcu_in_progress, 0); > >>>> } > >>>> > >>>> -static void __free_rcu_tasks_trace(struct rcu_head *head) > >>>> -{ > >>>> - struct bpf_mem_cache *c = container_of(head, struct bpf_mem_cache, rcu); > >>>> - > >>>> - call_rcu(&c->rcu, __free_rcu); > >>>> -} > >>>> - > >>>> static void enque_to_free(struct bpf_mem_cache *c, void *obj) > >>>> { > >>>> struct llist_node *llnode = obj; > >>>> @@ -252,11 +248,10 @@ static void do_call_rcu(struct bpf_mem_cache *c) > >>>> * from __free_rcu() and from drain_mem_cache(). > >>>> */ > >>>> __llist_add(llnode, &c->waiting_for_gp); > >>>> - /* Use call_rcu_tasks_trace() to wait for sleepable progs to finish. > >>>> - * Then use call_rcu() to wait for normal progs to finish > >>>> - * and finally do free_one() on each element. > >>>> + /* Use call_rcu_tasks_trace() to wait for both sleepable and normal > >>>> + * progs to finish and finally do free_one() on each element. > >>>> */ > >>>> - call_rcu_tasks_trace(&c->rcu, __free_rcu_tasks_trace); > >>>> + call_rcu_tasks_trace(&c->rcu, __free_rcu); > >>>> } > >>>> > >>>> static void free_bulk(struct bpf_mem_cache *c) > >>>> -- > >>>> 2.29.2 > >>>> >