On Mon, 2021-06-28 at 12:23 +0100, David Woodhouse wrote:
> 
> To be clear: from the point of view of my *application* I don't care
> about any of this; my only motivation here is to clean up the kernel
> behaviour and make life easier for potential future users. I have found
> a setup that works in today's kernels (even though I have to disable
> XDP, and have to use a virtio header that I don't want), and will stick
> with that for now, if I actually commit it to my master branch at all:
> https://gitlab.com/openconnect/openconnect/-/commit/0da4fe43b886403e6
> 
> I might yet abandon it because I haven't *yet* seen it go any faster
> than the code which just does read()/write() on the tun device from
> userspace. And without XDP or zerocopy it's not clear that it could
> ever give me any benefit that I couldn't achieve purely in userspace by
> having a separate thread to do tun device I/O. But we'll see...

I managed to do some proper testing, between EC2 c5 (Skylake) virtual
instances.

The kernel on a c5.metal can transmit (AES128-SHA1) ESP at about
1.2Gb/s from iperf, as it seems to be doing it all from the iperf
thread.

Before I started messing with OpenConnect, it could transmit 1.6Gb/s.

When I pull in the 'stitched' AES+SHA code from OpenSSL instead of
doing the encryption and the HMAC in separate passes, I get to 2.1Gb/s.

Adding vhost support on top of that takes me to 2.46Gb/s, which is a
decent enough win. That's with OpenConnect taking 100% CPU, iperf3
taking 50% of another one, and the vhost kernel thread taking ~20%.