From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.4 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 069D4C433E1 for ; Wed, 27 May 2020 20:38:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DC802207BC for ; Wed, 27 May 2020 20:38:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="PQ1PF938" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727922AbgE0Ui1 (ORCPT ); Wed, 27 May 2020 16:38:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45626 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726129AbgE0Ui0 (ORCPT ); Wed, 27 May 2020 16:38:26 -0400 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 842FCC03E97D for ; Wed, 27 May 2020 13:38:25 -0700 (PDT) Received: by mail-yb1-xb49.google.com with SMTP id 5so25809966ybe.17 for ; Wed, 27 May 2020 13:38:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=1P1IVYYqD7Lmk9JnpJwXopeg+n6aCPm5iww2pLfGskc=; b=PQ1PF9386ekV9M0GlLbVeudomlp8UYMvwHlWTTfmFvHcwC1XRE3WipoS0AOLefNVYa PCDZtZC3xEFqr3jVCKQLwEClapvSN4ZW3Fxm8JRE1bJ7U5PRKgx/MBdmcOtkfAjh4f9H YEPHdeZmTIreyzOW15DD3ejUrIjKX2+GfFMMbR4LA/PGFxDrN4fzkoZjT5co9HLsPX1B f9qvaoPx8zqYPILMrnVuwXiK8RBgYurSA0hIX5xhp77TPXc4hNf/CcFoTEZhpdyZvVG+ XeFUpGxLlap3HyQ5umybDCIrR5TM9YCb6qY/Upj3YWbenSwKATdzbjZXQEewjL+cxQOc ddyA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=1P1IVYYqD7Lmk9JnpJwXopeg+n6aCPm5iww2pLfGskc=; b=JE0zGERTmr6FAPOoOoRx4TJv7jYaGUAlY2Vft/NjCBmvGL2G2WNbTrL1n12ddEah4k Eyt41o5uPBZ3D8pNkpkTUl1WvZDyJPvn+1OVXNNhfF++V3TV/epoV2YPohuROpMLtxKa DZ45ckU3Seh7nenE5KR5v2MUWIAhJveYIm3kAj1HShVykct9bPPZAhZp2ddeUEFgT4yu jPGkjKsbtwXLRymnJM7U6+Pyr786QrH/p6gRSw+Wi51oYXjrMQkLCJ5fB9tzYMKTDBmN iAE7rCy31rUsEEGmCQNqlmEDEewUyMDludYa2C2Zb/cxOBkQa5/rq0i5obFcd4FNihUF +0iA== X-Gm-Message-State: AOAM531Lgx9qXnO9hSOU5Gg0l7JObbj96vNV4TLB9oDKumFKWncAuF2A TOtTGddHTlwp63QMHEG6akFuBEI= X-Google-Smtp-Source: ABdhPJwhXIsHOW4VAq44rrgzdJVTNLBJ2Bt0DHNfkP+5zIV37LoxeqwMUISPBtDgagnouAEJkcKKHQA= X-Received: by 2002:a25:148b:: with SMTP id 133mr100655ybu.38.1590611904629; Wed, 27 May 2020 13:38:24 -0700 (PDT) Date: Wed, 27 May 2020 13:38:23 -0700 In-Reply-To: <87tv012lxs.fsf@cloudflare.com> Message-Id: <20200527203823.GB57268@google.com> Mime-Version: 1.0 References: <20200527170840.1768178-1-jakub@cloudflare.com> <20200527170840.1768178-6-jakub@cloudflare.com> <20200527174805.GG49942@google.com> <87tv012lxs.fsf@cloudflare.com> Subject: Re: [PATCH bpf-next 5/8] bpf: Add link-based BPF program attachment to network namespace From: sdf@google.com To: Jakub Sitnicki Cc: bpf@vger.kernel.org, netdev@vger.kernel.org, kernel-team@cloudflare.com Content-Type: text/plain; charset="UTF-8"; format=flowed; delsp=yes Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On 05/27, Jakub Sitnicki wrote: > On Wed, May 27, 2020 at 07:48 PM CEST, sdf@google.com wrote: > > On 05/27, Jakub Sitnicki wrote: > >> Add support for bpf() syscall subcommands that operate on > >> bpf_link (LINK_CREATE, LINK_UPDATE, OBJ_GET_INFO) for attach points > tied to > >> network namespaces (that is flow dissector at the moment). > > > >> Link-based and prog-based attachment can be used interchangeably, but > only > >> one can be in use at a time. Attempts to attach a link when a prog is > >> already attached directly, and the other way around, will be met with > >> -EBUSY. > > > >> Attachment of multiple links of same attach type to one netns is not > >> supported, with the intention to lift it when a use-case presents > >> itself. Because of that attempts to create a netns link, when one > already > >> exists result in -E2BIG error, signifying that there is no space left > for > >> another attachment. > > > >> Link-based attachments to netns don't keep a netns alive by holding a > ref > >> to it. Instead links get auto-detached from netns when the latter is > being > >> destroyed by a pernet pre_exit callback. > > > >> When auto-detached, link lives in defunct state as long there are open > FDs > >> for it. -ENOLINK is returned if a user tries to update a defunct link. > > > >> Because bpf_link to netns doesn't hold a ref to struct net, special > care is > >> taken when releasing the link. The netns might be getting torn down > when > >> the release function tries to access it to detach the link. > > > >> To ensure the struct net object is alive when release function > accesses it > >> we rely on the fact that cleanup_net(), struct net destructor, calls > >> synchronize_rcu() after invoking pre_exit callbacks. If auto-detach > from > >> pre_exit happens first, link release will not attempt to access struct > net. > > > >> Same applies the other way around, network namespace doesn't keep an > >> attached link alive because by not holding a ref to it. Instead > bpf_links > >> to netns are RCU-freed, so that pernet pre_exit callback can safely > access > >> and auto-detach the link when racing with link release/free. > > > > [..] > >> + rcu_read_lock(); > >> for (type = 0; type < MAX_NETNS_BPF_ATTACH_TYPE; type++) { > >> - if (rcu_access_pointer(net->bpf.progs[type])) > >> + if (rcu_access_pointer(net->bpf.links[type])) > >> + bpf_netns_link_auto_detach(net, type); > >> + else if (rcu_access_pointer(net->bpf.progs[type])) > >> __netns_bpf_prog_detach(net, type); > >> } > >> + rcu_read_unlock(); > > Aren't you doing RCU_INIT_POINTER in __netns_bpf_prog_detach? > > Is it allowed under rcu_read_load? > Yes, that's true. __netns_bpf_prog_detach does > RCU_INIT_POINTER(net->bpf.progs[type], NULL); > RCU read lock is here for the rcu_dereference() that happens in > bpf_netns_link_auto_detach (netns doesn't hold a ref to bpf_link): > /* Called with RCU read lock. */ > static void __net_exit > bpf_netns_link_auto_detach(struct net *net, enum netns_bpf_attach_type > type) > { > struct bpf_netns_link *net_link; > struct bpf_link *link; > link = rcu_dereference(net->bpf.links[type]); > if (!link) > return; > net_link = to_bpf_netns_link(link); > RCU_INIT_POINTER(net_link->net, NULL); > } > I've pulled it up, out of the loop, perhaps too eagerly and just made it > confusing, considering we're iterating over a 1-item array :-) > Now, I'm also doing RCU_INIT_POINTER on the *contents of bpf_link* in > bpf_netns_link_auto_detach. Is that allowed? I'm not sure, that bit is > hazy to me. > There are no concurrent writers to net_link->net, just readers, i.e. > bpf_netns_link_release(). And I know bpf_link won't be freed until the > grace period elapses. > sparse and CONFIG_PROVE_RCU are not shouting at me, but please do if it > doesn't hold up or make sense. > I certainly can push the rcu_read_lock() down into > bpf_netns_link_auto_detach(). I think it would be much nicer if you push them down to preserve the assumption that nothing is modified under read lock and you flip the pointers only when holding the mutexes. I'll do another pass on this patch, I think I don't understand a bunch of bits where you do: mutex_lock rcu_read_lock -> why? you're already in the update section, can use rcu_dereference_protected ... rcu_read_unlock mutex_unlock But I'll post those comments inline shortly.