From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72D9EC43331 for ; Sat, 28 Mar 2020 01:31:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 45FF92076B for ; Sat, 28 Mar 2020 01:31:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727191AbgC1BbD (ORCPT ); Fri, 27 Mar 2020 21:31:03 -0400 Received: from www62.your-server.de ([213.133.104.62]:54556 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726212AbgC1BbD (ORCPT ); Fri, 27 Mar 2020 21:31:03 -0400 Received: from sslproxy01.your-server.de ([78.46.139.224]) by www62.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from ) id 1jI0JL-0002SR-OL; Sat, 28 Mar 2020 02:30:55 +0100 Received: from [178.195.186.98] (helo=pc-9.home) by sslproxy01.your-server.de with esmtpsa (TLSv1.3:TLS_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1jI0JL-000W9f-Fi; Sat, 28 Mar 2020 02:30:55 +0100 Subject: Re: [PATCH bpf-next 3/7] bpf: add netns cookie and enable it for bpf cgroup hooks To: Andrii Nakryiko Cc: Alexei Starovoitov , Martynas Pumputis , Joe Stringer , bpf , Networking References: From: Daniel Borkmann Message-ID: <6c712f1a-60bc-d5c6-d6c3-311bb4df9faf@iogearbox.net> Date: Sat, 28 Mar 2020 02:30:54 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Authenticated-Sender: daniel@iogearbox.net X-Virus-Scanned: Clear (ClamAV 0.102.2/25764/Fri Mar 27 14:11:26 2020) Sender: bpf-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On 3/28/20 1:32 AM, Andrii Nakryiko wrote: > On Fri, Mar 27, 2020 at 8:59 AM Daniel Borkmann wrote: >> >> In Cilium we're mainly using BPF cgroup hooks today in order to implement >> kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*), >> ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic >> between Cilium managed nodes. While this works in its current shape and avoids >> packet-level NAT for inter Cilium managed node traffic, there is one major >> limitation we're facing today, that is, lack of netns awareness. >> >> In Kubernetes, the concept of Pods (which hold one or multiple containers) >> has been built around network namespaces, so while we can use the global scope >> of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing >> NodePort ports on loopback addresses), we also have the need to differentiate >> between initial network namespaces and non-initial one. For example, ExternalIP >> services mandate that non-local service IPs are not to be translated from the >> host (initial) network namespace as one example. Right now, we have an ugly >> work-around in place where non-local service IPs for ExternalIP services are >> not xlated from connect() and friends BPF hooks but instead via less efficient >> packet-level NAT on the veth tc ingress hook for Pod traffic. >> >> On top of determining whether we're in initial or non-initial network namespace >> we also have a need for a socket-cookie like mechanism for network namespaces >> scope. Socket cookies have the nice property that they can be combined as part >> of the key structure e.g. for BPF LRU maps without having to worry that the >> cookie could be recycled. We are planning to use this for our sessionAffinity >> implementation for services. Therefore, add a new bpf_get_netns_cookie() helper >> which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would >> provide the cookie for the initial network namespace while passing the context >> instead of NULL would provide the cookie from the application's network namespace. >> We're using a hole, so no size increase; the assignment happens only once. >> Therefore this allows for a comparison on initial namespace as well as regular >> cookie usage as we have today with socket cookies. We could later on enable >> this helper for other program types as well as we would see need. >> >> (*) Both externalTrafficPolicy={Local|Cluster} types >> [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c >> >> Signed-off-by: Daniel Borkmann >> --- >> include/linux/bpf.h | 1 + >> include/net/net_namespace.h | 10 +++++++++ >> include/uapi/linux/bpf.h | 16 ++++++++++++++- >> kernel/bpf/verifier.c | 16 +++++++++------ >> net/core/filter.c | 37 ++++++++++++++++++++++++++++++++++ >> net/core/net_namespace.c | 15 ++++++++++++++ >> tools/include/uapi/linux/bpf.h | 16 ++++++++++++++- >> 7 files changed, 103 insertions(+), 8 deletions(-) >> > > [...] > >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c >> index 745f3cfdf3b2..cb30b34d1466 100644 >> --- a/kernel/bpf/verifier.c >> +++ b/kernel/bpf/verifier.c >> @@ -3461,13 +3461,17 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno, >> expected_type = CONST_PTR_TO_MAP; >> if (type != expected_type) >> goto err_type; >> - } else if (arg_type == ARG_PTR_TO_CTX) { >> + } else if (arg_type == ARG_PTR_TO_CTX || >> + arg_type == ARG_PTR_TO_CTX_OR_NULL) { >> expected_type = PTR_TO_CTX; >> - if (type != expected_type) >> - goto err_type; >> - err = check_ctx_reg(env, reg, regno); >> - if (err < 0) >> - return err; >> + if (!(register_is_null(reg) && >> + arg_type == ARG_PTR_TO_CTX_OR_NULL)) { > > Other parts of check_func_arg() have different pattern that doesn't > negate this complicated condition: > > if (register_is_null(reg) && arg_type == ARG_PTR_TO_CTX_OR_NULL) > ; > else { > ... > } > > It's marginally easier to follow, though still increases nestedness :( Yeah, that looks a bit ugly tbh, but perhaps personal style/preference. I tend to avoid such empty bodies. >> + if (type != expected_type) >> + goto err_type; >> + err = check_ctx_reg(env, reg, regno); >> + if (err < 0) >> + return err; >> + } >> } else if (arg_type == ARG_PTR_TO_SOCK_COMMON) { >> expected_type = PTR_TO_SOCK_COMMON; >> /* Any sk pointer can be ARG_PTR_TO_SOCK_COMMON */ > > [...] > >> +static const struct bpf_func_proto bpf_get_netns_cookie_sock_addr_proto = { >> + .func = bpf_get_netns_cookie_sock_addr, >> + .gpl_only = false, >> + .ret_type = RET_INTEGER, >> + .arg1_type = ARG_PTR_TO_CTX_OR_NULL, > > Just for completeness sake, have you considered using two helpers - > one accepting context and other accepting nothing instead of adding > ARG_PTR_TO_CTX_OR_NULL? Would it be too bad? I haven't considered it since it feels a bit too much churn to the helper side for just this simple addition. Both helpers would be doing the same and I would need to have it duplicated for sock_addr and sock once again given we need it in both connect et al and bind contexts. >> +}; >> + > > [...] > >> +static atomic64_t cookie_gen; >> + >> +u64 net_gen_cookie(struct net *net) >> +{ >> + while (1) { >> + u64 res = atomic64_read(&net->net_cookie); >> + >> + if (res) >> + return res; >> + res = atomic64_inc_return(&cookie_gen); >> + atomic64_cmpxchg(&net->net_cookie, 0, res); > > you'll always do extra atomic64_read, even if you succeed on the first > try. Why not: > > while (1) { > u64 res = atomic64_read(&net->net_cookie); > if (res) > return res; > res = atomic64_inc_return(&cookie_gen); > if (atomic64_cmpxchg(&net->net_cookie, 0, res) == 0) > return res; > } Right, though it's on same CPU, so might not make too much of a noticeable difference. I've used same scheme here as with socket cookies actually. I thought about consolidating the generator into a separate function for both as next step and reuse some of the ideas from get_next_ino() for batching to optimize the atomic op cost, which should make a difference, but will do as a separate patch. >> + } >> +} >> + > > [...] >