On Thu, 2021-06-24 at 14:12 +0800, Jason Wang wrote:
> 在 2021/6/24 上午12:12, David Woodhouse 写道:
> > We *should* eventually expand this test case to attach an AF_PACKET
> > device to the vhost-net, instead of using a tun device as the back end.
> > (Although I don't really see *why* vhost is limited to AF_PACKET. Why
> > *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?)
> 
> 
> It's just because nobody wrote the code. And we're lacking the real use 
> case.

Hm, what code? For AF_PACKET I haven't actually spotted that there *is*
any.

As I've been refactoring the interaction between vhost and tun/tap, and
fixing it up for different vhdr lengths, PI, and (just now) frowning in
horror at the concept that tun and vhost can have *different*
endiannesses, I hadn't spotted that there was anything special on the
packet socket. For that case, sock_hlen is just zero and we
send/receive plain packets... or so I thought? Did I miss something?

As far as I was aware, that ought to have worked with any datagram
socket. I was pondering not just AF_UNIX but also UDP (since that's my
main transport for VPN data, at least in the case where I care about
performance).

An interesting use case for a non-packet socket might be to establish a
tunnel. A guest's virtio-net device is just connected to a UDP socket
on the host, and that tunnels all their packets to/from a remote
endpoint which is where that guest is logically connected to the
network. It might be useful for live migration cases, perhaps?

I don't have an overriding desire to *make* it work, mind you; I just
wanted to make sure I understand *why* it doesn't, if indeed it
doesn't. As far as I could tell, it *should* work if we just dropped
the check?

> Vhost_net is bascially used for accepting packet from userspace to the 
> kernel networking stack.
> 
> Using AF_UNIX makes it looks more like a case of inter process 
> communication (without vnet header it won't be used efficiently by VM). 
> In this case, using io_uring is much more suitable.
> 
> Or thinking in another way, instead of depending on the vhost_net, we 
> can expose TUN/TAP socket to userspace then io_uring could be used for 
> the OpenConnect case as well?

That would work, I suppose. Although as noted, I *can* use vhost_net
with tun today from userspace as long as I disable XDP and PI, and use
a virtio_net_hdr that I don't really want. (And pull a value for
TASK_SIZE out of my posterior; qv.)

So I *can* ship a version of OpenConnect that works on existing
production kernels with those workarounds, and I'm fixing up the other
permutations of vhost/tun stuff in the kernel just because I figured we
*should*.

If I'm going to *require* new kernel support for OpenConnect then I
might as well go straight to the AF_TLS/DTLS + BPF + sockmap plan and
have the data packets never go to userspace in the first place.


> 
> > 
> > 
> > > > +	/*
> > > > +	 * I just want to map the *whole* of userspace address space. But
> > > > +	 * from userspace I don't know what that is. On x86_64 it would be:
> > > > +	 *
> > > > +	 * vmem->regions[0].guest_phys_addr = 4096;
> > > > +	 * vmem->regions[0].memory_size = 0x7fffffffe000;
> > > > +	 * vmem->regions[0].userspace_addr = 4096;
> > > > +	 *
> > > > +	 * For now, just ensure we put everything inside a single BSS region.
> > > > +	 */
> > > > +	vmem->regions[0].guest_phys_addr = (uint64_t)&rings;
> > > > +	vmem->regions[0].userspace_addr = (uint64_t)&rings;
> > > > +	vmem->regions[0].memory_size = sizeof(rings);
> > > 
> > > Instead of doing tricks like this, we can do it in another way:
> > > 
> > > 1) enable device IOTLB
> > > 2) wait for the IOTLB miss request (iova, len) and update identity
> > > mapping accordingly
> > > 
> > > This should work for all the archs (with some performance hit).
> > 
> > Ick. For my actual application (OpenConnect) I'm either going to suck
> > it up and put in the arch-specific limits like in the comment above, or
> > I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking
> > about elsewhere.
> 
> 
> The feature could be useful for the case of vhost-vDPA as well.
> 
> 
> >   (Probably the former, since if I'm requiring kernel
> > changes then I have grander plans around extending AF_TLS to do DTLS,
> > then hooking that directly up to the tun socket via BPF and a sockmap
> > without the data frames ever going to userspace at all.)
> 
> 
> Ok, I guess we need to make sockmap works for tun socket.

Hm, I need to work out the ideal data flow here. I don't know if
sendmsg() / recvmsg() on the tun socket are even what I want, for full
zero-copy.

In the case where the NIC supports encryption, we want true zero-copy
from the moment the "encrypted" packet arrives over UDP on the public
network, through the DTLS processing and seqno checking, to being
processed as netif_receive_skb() on the tun device.

Likewise skbs from tun_net_xmit() should have the appropriate DTLS and
IP/UDP headers prepended to them and that *same* skb (or at least the
same frags) should be handed to the NIC to encrypt and send.

In the case where we have software crypto in the kernel, we can
tolerate precisely *one* copy because the crypto doesn't have to be
done in-place, so moving from the input to the output crypto buffers
can be that one "copy", and we can use it to move data around (from the
incoming skb to the outgoing skb) if we need to.

Ultimately I think we want udp_sendmsg() and tun_sendmsg() to support
being *given* ownership of the buffers, rather than copying from them.
Or just being given a skb and pushing/pulling their headers.

I'm looking at skb_send_sock() and it *doesn't* seem to support "just
steal the frags from the initial skb and give them to the new one", but
there may be ways to make that work.

> > I think I can fix *all* those test cases by making tun_get_socket()
> > take an extra 'int *' argument, and use that to return the *actual*
> > value of sock_hlen. Here's the updated test case in the meantime:
> 
> 
> It would be better if you can post a new version of the whole series to 
> ease the reviewing.

Yep. I was working on that... until I got even more distracted by
looking at how we can do the true in-kernel zero-copy option ;)