From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57B70C2BB1D for ; Mon, 16 Mar 2020 10:08:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1AEA62051A for ; Mon, 16 Mar 2020 10:08:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=cloudflare.com header.i=@cloudflare.com header.b="a3EEd1Zo" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730531AbgCPKI0 (ORCPT ); Mon, 16 Mar 2020 06:08:26 -0400 Received: from mail-lj1-f196.google.com ([209.85.208.196]:39890 "EHLO mail-lj1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730430AbgCPKI0 (ORCPT ); Mon, 16 Mar 2020 06:08:26 -0400 Received: by mail-lj1-f196.google.com with SMTP id f10so17905097ljn.6 for ; Mon, 16 Mar 2020 03:08:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; h=references:user-agent:from:to:cc:subject:in-reply-to:date :message-id:mime-version; bh=wiPq37TQcmv/j8DXzF82UVls7gLD7nfxkRjphi/xnhQ=; b=a3EEd1ZoH77ZtJyojDDciIWfx0K+gfSuEQa80gQ+k9Sg8vcpLF7kfuymsl3fY+hn/c H/+NGRu4McYttlAAxFgjWm8cBMmHB5Z34LCMHAJ7ORFSErJ6R/1TKQS/AytaUhI7dJzF X386bLZfXrpxs25+7f4ahKvTTldEq3bT+ysmM= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:references:user-agent:from:to:cc:subject :in-reply-to:date:message-id:mime-version; bh=wiPq37TQcmv/j8DXzF82UVls7gLD7nfxkRjphi/xnhQ=; b=sRFrXqhzHW1RIk4u7GQIRHntcb5eaNpA+tF5jGJ2699lSAg6VT+5gCyfr/mFpPR2dO skWIvc6RMZbnEnWWChYegPGcgChWzCCUNShrF0HoLpCekFd/12iQuufnbZ1i9Fy7Hi14 SEH/aB1pwcWmWhSOLxGyUYtz7Q/+8n99tbWO8caVagmaGSlVlhEHiKGdL2+RQpK/1bkp 4q//552UdvpggXVktLzAm3Mu4RA802DO+9OU0DaUAhyxWjEROiALVftpjPwJ5tfdjr0t 9t3yTh83Ju7fEOiC9Q0D/9pGDI9mJEPFbdigBnD1A5BDMFioNLa3r+DB0YoEFddlgW5D /VHg== X-Gm-Message-State: ANhLgQ3ViSHI6f370vUuu+aALrYKnqTJmyAxdIXSrWB99DZ8eaXHdrE+ FCFAAcVnbYoVSqFtoSkXFqc9MA== X-Google-Smtp-Source: ADFU+vu18xobEfN8sFNIDLEVd+xzp0/BQLPfyDuKUUEutLRlyTQzFrs6hMVm7lfKDA5ZB1kP74e5Qw== X-Received: by 2002:a2e:b88d:: with SMTP id r13mr15140800ljp.66.1584353299513; Mon, 16 Mar 2020 03:08:19 -0700 (PDT) Received: from cloudflare.com ([2a02:a310:c262:aa00:b35e:8938:2c2a:ba8b]) by smtp.gmail.com with ESMTPSA id y18sm34344515ljm.93.2020.03.16.03.08.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 16 Mar 2020 03:08:18 -0700 (PDT) References: <20200312233648.1767-1-joe@wand.net.nz> <20200312233648.1767-4-joe@wand.net.nz> User-agent: mu4e 1.1.0; emacs 26.3 From: Jakub Sitnicki To: Joe Stringer Cc: bpf@vger.kernel.org, netdev@vger.kernel.org, daniel@iogearbox.net, ast@kernel.org, eric.dumazet@gmail.com, lmb@cloudflare.com, Florian Westphal Subject: Re: [PATCH bpf-next 3/7] bpf: Add socket assign support In-reply-to: <20200312233648.1767-4-joe@wand.net.nz> Date: Mon, 16 Mar 2020 11:08:17 +0100 Message-ID: <87mu8gy5m6.fsf@cloudflare.com> MIME-Version: 1.0 Content-Type: text/plain Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org [+CC Florian] Hey Joe, On Fri, Mar 13, 2020 at 12:36 AM CET, Joe Stringer wrote: > Add support for TPROXY via a new bpf helper, bpf_sk_assign(). > > This helper requires the BPF program to discover the socket via a call > to bpf_sk*_lookup_*(), then pass this socket to the new helper. The > helper takes its own reference to the socket in addition to any existing > reference that may or may not currently be obtained for the duration of > BPF processing. For the destination socket to receive the traffic, the > traffic must be routed towards that socket via local route, the socket > must have the transparent option enabled out-of-band, and the socket > must not be closing. If all of these conditions hold, the socket will be > assigned to the skb to allow delivery to the socket. My impression from the last time we have been discussing TPROXY is that the check for IP_TRANSPARENT on ingress doesn't serve any purpose [0]. The socket option only has effect on output, when there is a need to source traffic from a non-local address. Setting IP_TRANSPARENT requires CAP_NET_{RAW|ADMIN}, which grant a wider range of capabilities than needed to build a transparent proxy app. This is problematic because you to lock down your application with seccomp. It seems it should be enough to use a port number from a privileged range, if you want to ensure that only the designed process can receive the proxied traffic. Or, alternatively, instead of using socket lookup + IP_TRANSPARENT check, get the socket from sockmap and apply control to who can update the BPF map. Thanks, -jkbs [0] https://lore.kernel.org/bpf/20190621125155.2sdw7pugepj3ityx@breakpoint.cc/ > > The recently introduced dst_sk_prefetch is used to communicate from the > TC layer to the IP receive layer that the socket should be retained > across the receive. The dst_sk_prefetch destination wraps any existing > destination (if available) and stores it temporarily in a per-cpu var. > > To ensure that no dst references held by the skb prior to sk_assign() > are lost, they are stored in the per-cpu variable associated with > dst_sk_prefetch. When the BPF program invocation from the TC action > completes, we check the return code against TC_ACT_OK and if any other > return code is used, we restore the dst to avoid unintentionally leaking > the reference held in the per-CPU variable. If the packet is cloned or > dropped before reaching ip{,6}_rcv_core(), the original dst will also be > restored from the per-cpu variable to avoid the leak; if the packet makes > its way to the receive function for the protocol, then the destination > (if any) will be restored to the packet at that point. > > Signed-off-by: Joe Stringer > --- > include/uapi/linux/bpf.h | 23 ++++++++++++++++++++++- > net/core/filter.c | 28 ++++++++++++++++++++++++++++ > net/core/skbuff.c | 3 +++ > net/ipv4/ip_input.c | 5 ++++- > net/ipv6/ip6_input.c | 5 ++++- > net/sched/act_bpf.c | 3 +++ > tools/include/uapi/linux/bpf.h | 18 +++++++++++++++++- > 7 files changed, 81 insertions(+), 4 deletions(-) > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 40b2d9476268..35f282cc745e 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -2914,6 +2914,26 @@ union bpf_attr { > * of sizeof(struct perf_branch_entry). > * > * **-ENOENT** if architecture does not support branch records. > + * > + * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags) > + * Description > + * Assign the *sk* to the *skb*. When combined with appropriate > + * routing configuration to receive the packet towards the socket, > + * will cause *skb* to be delivered to the specified socket. > + * Subsequent redirection of *skb* via **bpf_redirect**\ (), > + * **bpf_clone_redirect**\ () or other methods outside of BPF may > + * interfere with successful delivery to the socket. > + * > + * This operation is only valid from TC ingress path. > + * > + * The *flags* argument must be zero. > + * Return > + * 0 on success, or a negative errno in case of failure. > + * > + * * **-EINVAL** Unsupported flags specified. > + * * **-EOPNOTSUPP**: Unsupported operation, for example a > + * call from outside of TC ingress. > + * * **-ENOENT** The socket cannot be assigned. > */ > #define __BPF_FUNC_MAPPER(FN) \ > FN(unspec), \ > @@ -3035,7 +3055,8 @@ union bpf_attr { > FN(tcp_send_ack), \ > FN(send_signal_thread), \ > FN(jiffies64), \ > - FN(read_branch_records), > + FN(read_branch_records), \ > + FN(sk_assign), > > /* integer value in 'imm' field of BPF_CALL instruction selects which helper > * function eBPF program intends to call > diff --git a/net/core/filter.c b/net/core/filter.c > index cd0a532db4e7..bae0874289d8 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -5846,6 +5846,32 @@ static const struct bpf_func_proto bpf_tcp_gen_syncookie_proto = { > .arg5_type = ARG_CONST_SIZE, > }; > > +BPF_CALL_3(bpf_sk_assign, struct sk_buff *, skb, struct sock *, sk, u64, flags) > +{ > + if (flags != 0) > + return -EINVAL; > + if (!skb_at_tc_ingress(skb)) > + return -EOPNOTSUPP; > + if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt))) > + return -ENOENT; > + > + skb_orphan(skb); > + skb->sk = sk; > + skb->destructor = sock_edemux; > + dst_sk_prefetch_store(skb); > + > + return 0; > +} > + > +static const struct bpf_func_proto bpf_sk_assign_proto = { > + .func = bpf_sk_assign, > + .gpl_only = false, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_PTR_TO_CTX, > + .arg2_type = ARG_PTR_TO_SOCK_COMMON, > + .arg3_type = ARG_ANYTHING, > +}; > + > #endif /* CONFIG_INET */ > > bool bpf_helper_changes_pkt_data(void *func) > @@ -6139,6 +6165,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > return &bpf_skb_ecn_set_ce_proto; > case BPF_FUNC_tcp_gen_syncookie: > return &bpf_tcp_gen_syncookie_proto; > + case BPF_FUNC_sk_assign: > + return &bpf_sk_assign_proto; > #endif > default: > return bpf_base_func_proto(func_id); > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 6b2798450fd4..80ee8f7b6a19 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -63,6 +63,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -1042,6 +1043,7 @@ EXPORT_SYMBOL_GPL(alloc_skb_for_msg); > */ > void skb_dst_drop(struct sk_buff *skb) > { > + dst_sk_prefetch_reset(skb); > if (skb->_skb_refdst) { > refdst_drop(skb->_skb_refdst); > skb->_skb_refdst = 0UL; > @@ -1466,6 +1468,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask) > n->fclone = SKB_FCLONE_UNAVAILABLE; > } > > + dst_sk_prefetch_reset(skb); > return __skb_clone(n, skb); > } > EXPORT_SYMBOL(skb_clone); > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c > index aa438c6758a7..9bd4858d20fc 100644 > --- a/net/ipv4/ip_input.c > +++ b/net/ipv4/ip_input.c > @@ -509,7 +509,10 @@ static struct sk_buff *ip_rcv_core(struct sk_buff *skb, struct net *net) > IPCB(skb)->iif = skb->skb_iif; > > /* Must drop socket now because of tproxy. */ > - skb_orphan(skb); > + if (skb_dst_is_sk_prefetch(skb)) > + dst_sk_prefetch_fetch(skb); > + else > + skb_orphan(skb); > > return skb; > > diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c > index 7b089d0ac8cd..f7b42adca9d0 100644 > --- a/net/ipv6/ip6_input.c > +++ b/net/ipv6/ip6_input.c > @@ -285,7 +285,10 @@ static struct sk_buff *ip6_rcv_core(struct sk_buff *skb, struct net_device *dev, > rcu_read_unlock(); > > /* Must drop socket now because of tproxy. */ > - skb_orphan(skb); > + if (skb_dst_is_sk_prefetch(skb)) > + dst_sk_prefetch_fetch(skb); > + else > + skb_orphan(skb); > > return skb; > err: > diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c > index 46f47e58b3be..b4c557e6158d 100644 > --- a/net/sched/act_bpf.c > +++ b/net/sched/act_bpf.c > @@ -11,6 +11,7 @@ > #include > #include > > +#include > #include > #include > #include > @@ -53,6 +54,8 @@ static int tcf_bpf_act(struct sk_buff *skb, const struct tc_action *act, > bpf_compute_data_pointers(skb); > filter_res = BPF_PROG_RUN(filter, skb); > } > + if (filter_res != TC_ACT_OK) > + dst_sk_prefetch_reset(skb); > rcu_read_unlock(); > > /* A BPF program may overwrite the default action opcode. > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > index 40b2d9476268..546e9e1368ff 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -2914,6 +2914,21 @@ union bpf_attr { > * of sizeof(struct perf_branch_entry). > * > * **-ENOENT** if architecture does not support branch records. > + * > + * int bpf_sk_assign(struct sk_buff *skb, struct bpf_sock *sk, u64 flags) > + * Description > + * Assign the *sk* to the *skb*. > + * > + * This operation is only valid from TC ingress path. > + * > + * The *flags* argument must be zero. > + * Return > + * 0 on success, or a negative errno in case of failure. > + * > + * * **-EINVAL** Unsupported flags specified. > + * * **-EOPNOTSUPP**: Unsupported operation, for example a > + * call from outside of TC ingress. > + * * **-ENOENT** The socket cannot be assigned. > */ > #define __BPF_FUNC_MAPPER(FN) \ > FN(unspec), \ > @@ -3035,7 +3050,8 @@ union bpf_attr { > FN(tcp_send_ack), \ > FN(send_signal_thread), \ > FN(jiffies64), \ > - FN(read_branch_records), > + FN(read_branch_records), \ > + FN(sk_assign), > > /* integer value in 'imm' field of BPF_CALL instruction selects which helper > * function eBPF program intends to call