Re: [RFC bpf-next 0/7] Programming socket lookup with BPF

From: Jakub Sitnicki <jakub@cloudflare.com>
To: Florian Westphal <fw@strlen.de>
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, kernel-team@cloudflare.com
Subject: Re: [RFC bpf-next 0/7] Programming socket lookup with BPF
Date: Wed, 19 Jun 2019 11:13:48 +0200	[thread overview]
Message-ID: <87sgs6ey43.fsf@cloudflare.com> (raw)
In-Reply-To: <20190618135258.spo6c457h6dfknt2@breakpoint.cc>

Hey Florian,

Thanks for taking a look at it.

On Tue, Jun 18, 2019 at 03:52 PM CEST, Florian Westphal wrote:
> Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>  - XDP programs using bpf_sk_lookup helpers, like load balancers, can't
>>    find the listening socket to check for SYN cookies with TPROXY redirect.
>
> Sorry for the question, but where is the problem?
> (i.e., is it with TPROXY or bpf side)?

The way I see it is that the problem is that we have mappings for
steering traffic into sockets split between two places: (1) the socket
lookup tables, and (2) the TPROXY rules.

BPF programs that need to check if there is a socket the packet is
destined for have access to the socket lookup tables, via the mentioned
bpf_sk_lookup helper, but are unaware of TPROXY redirects.

For TCP we're able to look up from BPF if there are any established,
request, and "normal" listening sockets. The listening sockets that
receive connections via TPROXY are invisible to BPF progs.

Why are we interested in finding all listening sockets? To check if any
of them had SYN queue overflow recently and if we should honor SYN
cookies.

>>  - TPROXY takes a reference to the listening socket on dispatch, which
>>    raises lock contention concerns.
>
> FWIW this could be avoided in similar way as to how we handle noref dsts.
>
> The only reason we need to take the reference at the moment is because
> once skb leaves the TPROXY target hook, the skb could leave rcu
> protection as well at some point (nfqueue for example).
>
> Maybe its even enough to move reference taking to nfqueue and add
> 'noref' destructor, that would allow skb_steal_sock to propagate
> refcounted value in __inet_lookup_skb.
>
> So, at least for this part I don't see a technical reason why this
> has to grab a reference for listener socket.

That's helpful, thanks! We rely on TPROXY, so I would like to help with
that. Let me see if I can get time to work on it.

>
>>  - Traffic steering configuration is split over several iptables rules, at
>>    least one per service, which makes configuration changes error prone.
>
> Could you perhaps sketch an example ruleset (doesn't have to be complete
> nor parse-able by itpables-restore), I would just like to understand if
> there is any room for improvement on netfilter/iptables/nft side.

Happy to. Scenarios that are of interest to us:

1) Port sharing, while accepting on a set of subnets
   (same are the demo BPF prog from cover letter)

  ip route add local 192.0.2.0/24 dev lo
  ip route add local 198.51.100.0/24 dev lo
  ip route add local 203.0.113.0/24 dev lo

  ipset create net1 hash:net
  ipset create net2 hash:net
  ipset create net3 hash:net

  ipset add net1 192.0.2.0/24
  ipset add net2 198.51.100.0/24
  ipset add net3 203.0.113.0/24

  iptables -t mangle -A PREROUTING -p tcp --dport 80 \
           -m set --match-set net1 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=81

  iptables -t mangle -A PREROUTING -p tcp --dport 80 \
           -m set --match-set net2 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=82

2) Receving on all ports, except some

  iptables -t mangle -A PREROUTING -p tcp --dport 80 \
           -m set --match-set net3 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=81

  iptables -t mangle -A PREROUTING -p tcp \
           -m set --match-set net3 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=1

3) Steering part of the traffic to a different socket (A/B testing)

  iptables -t mangle -A PREROUTING -p tcp \
           -m set --match-set net3 dst \
           -m statistic --mode random --probability 0.01 \
           -j TPROXY --on-ip=127.0.0.1 --on-port=2

  iptables -t mangle -A PREROUTING -p tcp \
           -m set --match-set net3 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=1

One thing I haven't touched on in the cover letter is that to use TPROXY
you need to set IP_TRANSPARENT on the listening socket. This requires
that your process runs with CAP_NET_RAW or CAP_NET_ADMIN, or that you
get the socket from systemd.

I haven't been able to explain why the process needs to be privileged to
receive traffic steered with TPROXY, but it turns out to be a pain point
too. We end up having to lock down the service to ensure it doesn't use
the elevated privileges for anything else than setting IP_TRANSPARENT.

Thanks,
Jakub