Bringing the SSL VPN data path back in-kernel

From: David Woodhouse <dwmw2@infradead.org>
To: David Miller <davem@davemloft.net>
Cc: mst@redhat.com, herbert@gondor.apana.org.au,
	eric.dumazet@gmail.com, jan.kiszka@siemens.com,
	netdev@vger.kernel.org, Jason Wang <jasowang@redhat.com>,
	Eugenio Perez Martin <eperezma@redhat.com>,
	ahmeddan@amazon.co.uk
Subject: Bringing the SSL VPN data path back in-kernel
Date: Fri, 25 Jun 2021 16:56:22 +0100	[thread overview]
Message-ID: <291763b92bd198e867145b72d08dda1b5853e1af.camel@infradead.org> (raw)
In-Reply-To: <20150201.210716.588479604128207372.davem@davemloft.net>

[-- Attachment #1: Type: text/plain, Size: 7294 bytes --]

Reviving an 11-year-old thread, which was 5 years old last time I
revived it, 6 years ago. But I figure this DaveM quote is a good place
to start:

On Sun, 2015-02-01 at 21:07 -0800, David Miller wrote:
> From: David Woodhouse <dwmw2@infradead.org>
> Date: Sun, 01 Feb 2015 21:29:43 +0000
> 
> > I really was looking for some way to push down something like an XFRM
> > state into the tun device and just say "shove them out here until I tell
> > you otherwise".
> 
> People decided to use TUN and push VPN stuff back into userspace,
> and there are repercussions for that decision.
> 
> I'm not saying this to be mean or whatever, but I was very
> disappointed when userland IPSEC solutions using TUN started showing
> up.
> 
> We might as well have not have implemented the IPSEC stack at all,
> because as a result of the userland VPN stuff our IPSEC stack is
> largely unused except by a very narrow group of users.

I periodically come back to optimising OpenConnect, bumping up against
the fact that *most* of its time is spent pointlessly copying packets
up to userspace and back, and thinking about how much I'd *love* to use
the kernel IPSEC stack.

This is one of those times, as I've just been playing with using
vhost-net for optimising the tun device access directly from userspace:
https://gitlab.com/openconnect/openconnect/-/compare/master...vhost

I hate the fact that userspace is in the data path; the XFRM packet
flow is fairly much exactly what we want for the VPN fast path. We just
need to work out how to glue it together.

To recap on the problem statement briefly: WireGuard is great and all,
but SSL VPNs solve a slightly different problem — they provide a
versatile client VPN which works *even* when you're stuck on a crappy
airport wifi and all you can establish is an HTTPS connection, and they
will *also* opportunistically upgrade to a datagram path based on
DTLS/ESP *if* they can.

We support a bunch of these protocols now — Cisco AnyConnect, Pulse
Secure, Palo Alto's GlobalProtect, Fortinet, etc., and we have an open
source server implementation of the Cisco one. All of them use either
DTLS or ESP for the datagram path¹, with a fairly simple header of a
few bytes in front of the packet inside the DTLS payload.

I do not think we can ditch the tun device completely. All the
complexity of keepalives and rekeying and multiplexing on the TCP & UDP
sockets needs to remain in userspace, and some packets *will* have to
go through userspace on the traditional 'tun' read/write path. And from
an administrative point of view the fact that VPN packets are seen as
ingress/egress on that specific device is useful, and is a security
model that users are used to.

So what I'd like to do is find a way to optimise the *fast* path of
DTLS/ESP by keeping those packets entirely in the kernel using XFRM.
Probably starting with ESP since that's what XFRM already supports.

Some design criteria...

 • VPN clients can currently run as an unprivileged user, using a
   *persistent* tun device which is set up for that user in advance.
   I would like to retain this model.

 • XFRM state (the SPIs in use, etc.) is NOT GLOBAL. The same SPI
   pairs can be in use multiple times over multiple different UDP
   sockets at the same time.

Currently, to set up ESP over UDP I think I have to set up my incoming
and outgoing state as global state (with globally unique SPIs?), then
add policies which cause certain packets to be encrypted/decrypted
using the corresponding state. For an ESP-in-UDP tunnel the xfrm_state
needs to be given the public src and dst {IP,port} pairs *and* I also
need to bind/connect a real UDP socket and use sockopt(UDP_ENCAP) so
that *incoming* packets get fed to xfrm_input() to be handled. Right?

So... we don't want global state, we don't want generic xfrm policies.

Let's imagine a ULP-like sockopt (or sockopts) on the UDP socket which
provides all the information needed to build XFRM state for both
directions, *and* the fd of the tun device.

It generates a *private* xfrm_state using that information. No
privileges are required for this, since the user already has access to
the tun device and the UDP socket.

It registers its own ->encap_rcv() function on the UDP socket, which
sets up the skb and instead of calling xfrm_input() (which would try to
lookup the xfrm_state from global policies), it calls esp_input()
directly with the appropriate private xfrm_state. Or maybe we extend
the 'negative udp_encap' trick in xfrm_input() to make it bypass the
xfrm_state_lookup() and use the state it's told, but still following
the non-resume path? But that isn't really doing much *except* calling
esp_input() directly, which was my first suggestion.

I haven't quite worked out how, but then it needs to hook into the
xfrm_rcv_cb() or inner_mode handling, such that the decrypted skb is
handed back to us and we can do something vaguely equivalent to this
with it so that it appears to have been received on the tunnel:
  'skb->dev = tundev; netif_rx(skb);'

In the DTLS cases there is always a VPN protocol-specific header in the
encrypted payload. For Cisco it's a single byte, with zero being for
data packets and anything else is control stuff (keepalives, MTU
probes, etc.) to be handed up to userspace². So decrypted non-data
packets would ideally get fed back up to userspace on the UDP socket or
maybe some other fd. (Would have been nice in some ways if it was all
synchronous and we could just return an appropriate value from the UDP
->encap_rcv() function, but it's much too late by the time the packet
is decrypted).

The transmit path is a bit simpler, taking outbound skbs from the tun
queue and feeding them to a variant of xfrm_output() which takes the
xfrm_state from the skb (like xfrm_input() does in the resume case).

For DTLS encapsulations, the inner encap could be prepended to the skb
before it's fed to xfrm_input(). We would *also* need to be able to
handle control messages from userspace (perhaps sendmsg() on the UDP
socket, or again perhaps on a different fd).

Does that sound at all sane? Clearly there are details I haven't worked
out yet, but it seems like it might be a good start. Or is there a
better way to do it?

I'm happy to be told I'm a moron, but prefer for such observations to
come with better suggestions. Where "better suggestions" should be both
anatomically possible *and*, ideally, actually solve the problem at
hand.

-- 
dwmw2

¹ (apart from the Array Networks one, which has an unencrypted mode 
   too, and also a legacy "some engineer rolled his own crypto" mode 
   which AFAICT was basically equivalent in security to the unencrypted
   mode but I had to stop looking because I was saying too many naughty
   words in front of the children).

² Presumably a BPF program makes that decision, unless we want to
  hard-code the specifics for different protocols in a kernel module.
  It *could* be done with a simpler "memcmp <these> bytes and expect
  the length at <offset> to match the packet length minus <delta>" but 
  expressing that in BPF is probably the better choice these days.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]