netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Sitnicki <jakub@cloudflare.com>
To: Florian Westphal <fw@strlen.de>
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, kernel-team@cloudflare.com
Subject: Re: [RFC bpf-next 0/7] Programming socket lookup with BPF
Date: Wed, 19 Jun 2019 11:13:48 +0200	[thread overview]
Message-ID: <87sgs6ey43.fsf@cloudflare.com> (raw)
In-Reply-To: <20190618135258.spo6c457h6dfknt2@breakpoint.cc>

Hey Florian,

Thanks for taking a look at it.

On Tue, Jun 18, 2019 at 03:52 PM CEST, Florian Westphal wrote:
> Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>  - XDP programs using bpf_sk_lookup helpers, like load balancers, can't
>>    find the listening socket to check for SYN cookies with TPROXY redirect.
>
> Sorry for the question, but where is the problem?
> (i.e., is it with TPROXY or bpf side)?

The way I see it is that the problem is that we have mappings for
steering traffic into sockets split between two places: (1) the socket
lookup tables, and (2) the TPROXY rules.

BPF programs that need to check if there is a socket the packet is
destined for have access to the socket lookup tables, via the mentioned
bpf_sk_lookup helper, but are unaware of TPROXY redirects.

For TCP we're able to look up from BPF if there are any established,
request, and "normal" listening sockets. The listening sockets that
receive connections via TPROXY are invisible to BPF progs.

Why are we interested in finding all listening sockets? To check if any
of them had SYN queue overflow recently and if we should honor SYN
cookies.

>>  - TPROXY takes a reference to the listening socket on dispatch, which
>>    raises lock contention concerns.
>
> FWIW this could be avoided in similar way as to how we handle noref dsts.
>
> The only reason we need to take the reference at the moment is because
> once skb leaves the TPROXY target hook, the skb could leave rcu
> protection as well at some point (nfqueue for example).
>
> Maybe its even enough to move reference taking to nfqueue and add
> 'noref' destructor, that would allow skb_steal_sock to propagate
> refcounted value in __inet_lookup_skb.
>
> So, at least for this part I don't see a technical reason why this
> has to grab a reference for listener socket.

That's helpful, thanks! We rely on TPROXY, so I would like to help with
that. Let me see if I can get time to work on it.

>
>>  - Traffic steering configuration is split over several iptables rules, at
>>    least one per service, which makes configuration changes error prone.
>
> Could you perhaps sketch an example ruleset (doesn't have to be complete
> nor parse-able by itpables-restore), I would just like to understand if
> there is any room for improvement on netfilter/iptables/nft side.

Happy to. Scenarios that are of interest to us:

1) Port sharing, while accepting on a set of subnets
   (same are the demo BPF prog from cover letter)

  ip route add local 192.0.2.0/24 dev lo
  ip route add local 198.51.100.0/24 dev lo
  ip route add local 203.0.113.0/24 dev lo

  ipset create net1 hash:net
  ipset create net2 hash:net
  ipset create net3 hash:net

  ipset add net1 192.0.2.0/24
  ipset add net2 198.51.100.0/24
  ipset add net3 203.0.113.0/24

  iptables -t mangle -A PREROUTING -p tcp --dport 80 \
           -m set --match-set net1 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=81

  iptables -t mangle -A PREROUTING -p tcp --dport 80 \
           -m set --match-set net2 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=82

2) Receving on all ports, except some

  iptables -t mangle -A PREROUTING -p tcp --dport 80 \
           -m set --match-set net3 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=81

  iptables -t mangle -A PREROUTING -p tcp \
           -m set --match-set net3 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=1

3) Steering part of the traffic to a different socket (A/B testing)

  iptables -t mangle -A PREROUTING -p tcp \
           -m set --match-set net3 dst \
           -m statistic --mode random --probability 0.01 \
           -j TPROXY --on-ip=127.0.0.1 --on-port=2

  iptables -t mangle -A PREROUTING -p tcp \
           -m set --match-set net3 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=1

One thing I haven't touched on in the cover letter is that to use TPROXY
you need to set IP_TRANSPARENT on the listening socket. This requires
that your process runs with CAP_NET_RAW or CAP_NET_ADMIN, or that you
get the socket from systemd.

I haven't been able to explain why the process needs to be privileged to
receive traffic steered with TPROXY, but it turns out to be a pain point
too. We end up having to lock down the service to ensure it doesn't use
the elevated privileges for anything else than setting IP_TRANSPARENT.

Thanks,
Jakub

  reply	other threads:[~2019-06-19  9:13 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-18 13:00 [RFC bpf-next 0/7] Programming socket lookup with BPF Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 1/7] bpf: Introduce inet_lookup program type Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 2/7] ipv4: Run inet_lookup bpf program on socket lookup Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 3/7] ipv6: " Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 4/7] bpf: Sync linux/bpf.h to tools/ Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 5/7] libbpf: Add support for inet_lookup program type Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 6/7] bpf: Test destination address remapping with inet_lookup Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 7/7] bpf: Add verifier tests for inet_lookup context access Jakub Sitnicki
2019-06-18 13:52 ` [RFC bpf-next 0/7] Programming socket lookup with BPF Florian Westphal
2019-06-19  9:13   ` Jakub Sitnicki [this message]
2019-06-20 11:56     ` Florian Westphal
2019-06-20 22:20     ` Joe Stringer
     [not found]       ` <CAGn+7TUmgsA8oKw-mM6S5iR4rmNt6sWxjUgw8=qSCHb=m0ROyg@mail.gmail.com>
2019-06-21 16:50         ` Joe Stringer
2019-06-25  8:11           ` Jakub Sitnicki
2019-06-25  7:28       ` Jakub Sitnicki
2019-06-21 12:51     ` Florian Westphal
2019-06-21 14:33       ` Eric Dumazet
2019-06-21 16:41         ` Florian Westphal
2019-06-21 16:54           ` Paolo Abeni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87sgs6ey43.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=bpf@vger.kernel.org \
    --cc=fw@strlen.de \
    --cc=kernel-team@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).