From: Stanislav Fomichev <sdf@fomichev.me> To: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: Stanislav Fomichev <sdf@google.com>, Networking <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>, "David S. Miller" <davem@davemloft.net>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Petar Penkov <ppenkov@google.com> Subject: Re: [PATCH bpf-next 1/2] bpf/flow_dissector: add mode to enforce global BPF flow dissector Date: Thu, 3 Oct 2019 09:01:37 -0700 [thread overview] Message-ID: <20191003160137.GD3223377@mini-arch> (raw) In-Reply-To: <CAEf4BzZnWkdFpSUsSBenDDfrvgjGvBxUnJmQRwb7xjNQBaKXdQ@mail.gmail.com> On 10/02, Andrii Nakryiko wrote: > On Wed, Oct 2, 2019 at 6:43 PM Stanislav Fomichev <sdf@fomichev.me> wrote: > > > > On 10/02, Andrii Nakryiko wrote: > > > On Wed, Oct 2, 2019 at 10:35 AM Stanislav Fomichev <sdf@google.com> wrote: > > > > > > > > Always use init_net flow dissector BPF program if it's attached and fall > > > > back to the per-net namespace one. Also, deny installing new programs if > > > > there is already one attached to the root namespace. > > > > Users can still detach their BPF programs, but can't attach any > > > > new ones (-EPERM). > > I find this quite confusing for users, honestly. If there is no root > namespace dissector we'll successfully attach per-net ones and they > will be working fine. That some process will attach root one and all > the previously successfully working ones will suddenly "break" without > users potentially not realizing why. I bet this will be hair-pulling > investigation for someone. Furthermore, if root net dissector is > already attached, all subsequent attachment will now start failing. The idea is that if sysadmin decides to use system-wide dissector it would be attached from the init scripts/systemd early in the boot process. So the users in your example would always get EPERM/EBUSY/EXIST. I don't really see a realistic use-case where root and non-root namespaces attach/detach flow dissector programs at non-boot time (or why non-root containers could have BPF dissector and root could have C dissector; multi-nic machine?). But I totally see your point about confusion. See below. > I'm not sure what's the better behavior here is, but maybe at least > forcibly detach already attached ones, so when someone goes and tries > to investigate, they will see that their BPF program is not attached > anymore. Printing dmesg warning would be hugely useful here as well. We can do for_each_net and detach non-root ones; that sounds feasible and may avoid the confusion (at least when you query non-root ns to see if the prog is still there, you get a valid indication that it's not). > Alternatively, if there is any per-net dissector attached, we might > disallow root net dissector to be installed. Sort of "too late to the > party" way, but at least not surprising to successfully installed > dissectors. We can do this as well. > Thoughts? Let me try to implement both of your suggestions and see which one makes more sense. I'm leaning towards the later (simple check to see if any non-root ns has the prog attached). I'll follow up with a v2 if all goes well. > > > > Cc: Petar Penkov <ppenkov@google.com> > > > > Signed-off-by: Stanislav Fomichev <sdf@google.com> > > > > --- > > > > Documentation/bpf/prog_flow_dissector.rst | 3 +++ > > > > net/core/flow_dissector.c | 11 ++++++++++- > > > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/Documentation/bpf/prog_flow_dissector.rst b/Documentation/bpf/prog_flow_dissector.rst > > > > index a78bf036cadd..4d86780ab0f1 100644 > > > > --- a/Documentation/bpf/prog_flow_dissector.rst > > > > +++ b/Documentation/bpf/prog_flow_dissector.rst > > > > @@ -142,3 +142,6 @@ BPF flow dissector doesn't support exporting all the metadata that in-kernel > > > > C-based implementation can export. Notable example is single VLAN (802.1Q) > > > > and double VLAN (802.1AD) tags. Please refer to the ``struct bpf_flow_keys`` > > > > for a set of information that's currently can be exported from the BPF context. > > > > + > > > > +When BPF flow dissector is attached to the root network namespace (machine-wide > > > > +policy), users can't override it in their child network namespaces. > > > > diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c > > > > index 7c09d87d3269..494e2016fe84 100644 > > > > --- a/net/core/flow_dissector.c > > > > +++ b/net/core/flow_dissector.c > > > > @@ -115,6 +115,11 @@ int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr, > > > > struct bpf_prog *attached; > > > > struct net *net; > > > > > > > > + if (rcu_access_pointer(init_net.flow_dissector_prog)) { > > > > + /* Can't override root flow dissector program */ > > > > + return -EPERM; > > > > + } > > > > > > This is racy, shouldn't this be checked after grabbing a lock below? > > What kind of race do you have in mind? > > I was thinking about the case of two competing attaches for root > init_net, but it seems like we will double-check again under lock, so > this is fine as is. > > > > > Even if I put this check under the mutex, it's still possible that if > > two cpus concurrently start attaching flow dissector programs (i.e. call > > sys_bpf(BPF_PROG_ATTACH)) at the same time (one to root ns, the other > > to non-root ns), the cpu that is attaching to non-root can grab mutex first, > > pass all the checks and attach the prog (higher frequency, tubo boost, etc). > > > > The mutex is there to protect only against concurrent attaches to the > > _same_ netns. For the sake of simplicity we have a global one instead > > of a mutex per net-ns. > > > > So I'd rather not grab the mutex and keep it simple. Even in there is a > > race, in __skb_flow_dissect we always check init_net first. > > > > > > + > > > > net = current->nsproxy->net_ns; > > > > mutex_lock(&flow_dissector_mutex); > > > > attached = rcu_dereference_protected(net->flow_dissector_prog, > > > > @@ -910,7 +915,11 @@ bool __skb_flow_dissect(const struct net *net, > > > > WARN_ON_ONCE(!net); > > > > if (net) { > > > > rcu_read_lock(); > > > > - attached = rcu_dereference(net->flow_dissector_prog); > > > > + attached = > > > > + rcu_dereference(init_net.flow_dissector_prog); > > > > + > > > > + if (!attached) > > > > + attached = rcu_dereference(net->flow_dissector_prog); > > > > > > > > if (attached) { > > > > struct bpf_flow_keys flow_keys; > > > > -- > > > > 2.23.0.444.g18eeb5a265-goog > > > >
next prev parent reply other threads:[~2019-10-03 17:39 UTC|newest] Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-10-02 17:33 [PATCH bpf-next 0/2] " Stanislav Fomichev 2019-10-02 17:33 ` [PATCH bpf-next 1/2] " Stanislav Fomichev 2019-10-02 20:57 ` Song Liu 2019-10-02 21:31 ` Stanislav Fomichev 2019-10-02 23:29 ` Andrii Nakryiko 2019-10-03 1:43 ` Stanislav Fomichev 2019-10-03 2:47 ` Andrii Nakryiko 2019-10-03 16:01 ` Stanislav Fomichev [this message] 2019-10-03 16:26 ` Andrii Nakryiko 2019-10-03 17:45 ` John Fastabend 2019-10-03 17:58 ` Stanislav Fomichev 2019-10-02 17:33 ` [PATCH bpf-next 2/2] selftests/bpf: add test for BPF flow dissector in the root namespace Stanislav Fomichev
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20191003160137.GD3223377@mini-arch \ --to=sdf@fomichev.me \ --cc=andrii.nakryiko@gmail.com \ --cc=ast@kernel.org \ --cc=bpf@vger.kernel.org \ --cc=daniel@iogearbox.net \ --cc=davem@davemloft.net \ --cc=netdev@vger.kernel.org \ --cc=ppenkov@google.com \ --cc=sdf@google.com \ --subject='Re: [PATCH bpf-next 1/2] bpf/flow_dissector: add mode to enforce global BPF flow dissector' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).