From: Andrii Nakryiko <andrii.nakryiko@gmail.com> To: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>, Martynas Pumputis <m@lambda.lt>, Joe Stringer <joe@wand.net.nz>, bpf <bpf@vger.kernel.org>, Networking <netdev@vger.kernel.org> Subject: Re: [PATCH bpf-next 3/7] bpf: add netns cookie and enable it for bpf cgroup hooks Date: Fri, 27 Mar 2020 17:32:18 -0700 Message-ID: <CAEf4BzbZiZwfae+B2gu4WkWVRoVkLJUYhFf0rorx0jVCf_kiQw@mail.gmail.com> (raw) In-Reply-To: <c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net> On Fri, Mar 27, 2020 at 8:59 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > In Cilium we're mainly using BPF cgroup hooks today in order to implement > kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*), > ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic > between Cilium managed nodes. While this works in its current shape and avoids > packet-level NAT for inter Cilium managed node traffic, there is one major > limitation we're facing today, that is, lack of netns awareness. > > In Kubernetes, the concept of Pods (which hold one or multiple containers) > has been built around network namespaces, so while we can use the global scope > of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing > NodePort ports on loopback addresses), we also have the need to differentiate > between initial network namespaces and non-initial one. For example, ExternalIP > services mandate that non-local service IPs are not to be translated from the > host (initial) network namespace as one example. Right now, we have an ugly > work-around in place where non-local service IPs for ExternalIP services are > not xlated from connect() and friends BPF hooks but instead via less efficient > packet-level NAT on the veth tc ingress hook for Pod traffic. > > On top of determining whether we're in initial or non-initial network namespace > we also have a need for a socket-cookie like mechanism for network namespaces > scope. Socket cookies have the nice property that they can be combined as part > of the key structure e.g. for BPF LRU maps without having to worry that the > cookie could be recycled. We are planning to use this for our sessionAffinity > implementation for services. Therefore, add a new bpf_get_netns_cookie() helper > which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would > provide the cookie for the initial network namespace while passing the context > instead of NULL would provide the cookie from the application's network namespace. > We're using a hole, so no size increase; the assignment happens only once. > Therefore this allows for a comparison on initial namespace as well as regular > cookie usage as we have today with socket cookies. We could later on enable > this helper for other program types as well as we would see need. > > (*) Both externalTrafficPolicy={Local|Cluster} types > [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c > > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> > --- > include/linux/bpf.h | 1 + > include/net/net_namespace.h | 10 +++++++++ > include/uapi/linux/bpf.h | 16 ++++++++++++++- > kernel/bpf/verifier.c | 16 +++++++++------ > net/core/filter.c | 37 ++++++++++++++++++++++++++++++++++ > net/core/net_namespace.c | 15 ++++++++++++++ > tools/include/uapi/linux/bpf.h | 16 ++++++++++++++- > 7 files changed, 103 insertions(+), 8 deletions(-) > [...] > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 745f3cfdf3b2..cb30b34d1466 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -3461,13 +3461,17 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, > expected_type = CONST_PTR_TO_MAP; > if (type != expected_type) > goto err_type; > - } else if (arg_type == ARG_PTR_TO_CTX) { > + } else if (arg_type == ARG_PTR_TO_CTX || > + arg_type == ARG_PTR_TO_CTX_OR_NULL) { > expected_type = PTR_TO_CTX; > - if (type != expected_type) > - goto err_type; > - err = check_ctx_reg(env, reg, regno); > - if (err < 0) > - return err; > + if (!(register_is_null(reg) && > + arg_type == ARG_PTR_TO_CTX_OR_NULL)) { Other parts of check_func_arg() have different pattern that doesn't negate this complicated condition: if (register_is_null(reg) && arg_type == ARG_PTR_TO_CTX_OR_NULL) ; else { ... } It's marginally easier to follow, though still increases nestedness :( > + if (type != expected_type) > + goto err_type; > + err = check_ctx_reg(env, reg, regno); > + if (err < 0) > + return err; > + } > } else if (arg_type == ARG_PTR_TO_SOCK_COMMON) { > expected_type = PTR_TO_SOCK_COMMON; > /* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */ [...] > +static const struct bpf_func_proto bpf_get_netns_cookie_sock_addr_proto = { > + .func = bpf_get_netns_cookie_sock_addr, > + .gpl_only = false, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_PTR_TO_CTX_OR_NULL, Just for completeness sake, have you considered using two helpers - one accepting context and other accepting nothing instead of adding ARG_PTR_TO_CTX_OR_NULL? Would it be too bad? > +}; > + [...] > +static atomic64_t cookie_gen; > + > +u64 net_gen_cookie(struct net *net) > +{ > + while (1) { > + u64 res = atomic64_read(&net->net_cookie); > + > + if (res) > + return res; > + res = atomic64_inc_return(&cookie_gen); > + atomic64_cmpxchg(&net->net_cookie, 0, res); you'll always do extra atomic64_read, even if you succeed on the first try. Why not: while (1) { u64 res = atomic64_read(&net->net_cookie); if (res) return res; res = atomic64_inc_return(&cookie_gen); if (atomic64_cmpxchg(&net->net_cookie, 0, res) == 0) return res; } > + } > +} > + [...]
next prev parent reply index Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-03-27 15:58 [PATCH bpf-next 0/7] Various improvements to cgroup helpers Daniel Borkmann 2020-03-27 15:58 ` [PATCH bpf-next 1/7] bpf: enable retrieval of socket cookie for bind/post-bind hook Daniel Borkmann 2020-03-27 15:58 ` [PATCH bpf-next 2/7] bpf: enable perf event rb output for bpf cgroup progs Daniel Borkmann 2020-03-27 15:58 ` [PATCH bpf-next 3/7] bpf: add netns cookie and enable it for bpf cgroup hooks Daniel Borkmann 2020-03-28 0:32 ` Andrii Nakryiko [this message] 2020-03-28 1:30 ` Daniel Borkmann 2020-03-28 1:48 ` Alexei Starovoitov 2020-03-28 2:16 ` Daniel Borkmann 2020-03-28 2:56 ` Alexei Starovoitov 2020-03-27 15:58 ` [PATCH bpf-next 4/7] bpf: allow to retrieve cgroup v1 classid from v2 hooks Daniel Borkmann 2020-03-28 0:41 ` Andrii Nakryiko 2020-03-28 1:56 ` Daniel Borkmann 2020-03-28 20:27 ` Andrii Nakryiko 2020-03-27 15:58 ` [PATCH bpf-next 5/7] bpf: enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id Daniel Borkmann 2020-03-28 0:43 ` Andrii Nakryiko 2020-03-27 15:58 ` [PATCH bpf-next 6/7] bpf: enable retrival of pid/tgid/comm from bpf cgroup hooks Daniel Borkmann 2020-03-28 0:49 ` Andrii Nakryiko 2020-03-28 1:40 ` Daniel Borkmann 2020-03-27 15:58 ` [PATCH bpf-next 7/7] bpf: add selftest cases for ctx_or_null argument type Daniel Borkmann 2020-03-28 0:51 ` Andrii Nakryiko
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=CAEf4BzbZiZwfae+B2gu4WkWVRoVkLJUYhFf0rorx0jVCf_kiQw@mail.gmail.com \ --to=andrii.nakryiko@gmail.com \ --cc=alexei.starovoitov@gmail.com \ --cc=bpf@vger.kernel.org \ --cc=daniel@iogearbox.net \ --cc=joe@wand.net.nz \ --cc=m@lambda.lt \ --cc=netdev@vger.kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
BPF Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \ bpf@vger.kernel.org public-inbox-index bpf Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.bpf AGPL code for this site: git clone https://public-inbox.org/public-inbox.git