On Thu, 2021-06-24 at 14:12 +0800, Jason Wang wrote: > 在 2021/6/24 上午12:12, David Woodhouse 写道: > > We *should* eventually expand this test case to attach an AF_PACKET > > device to the vhost-net, instead of using a tun device as the back end. > > (Although I don't really see *why* vhost is limited to AF_PACKET. Why > > *can't* I attach anything else, like an AF_UNIX socket, to vhost-net?) > > > It's just because nobody wrote the code. And we're lacking the real use > case. Hm, what code? For AF_PACKET I haven't actually spotted that there *is* any. As I've been refactoring the interaction between vhost and tun/tap, and fixing it up for different vhdr lengths, PI, and (just now) frowning in horror at the concept that tun and vhost can have *different* endiannesses, I hadn't spotted that there was anything special on the packet socket. For that case, sock_hlen is just zero and we send/receive plain packets... or so I thought? Did I miss something? As far as I was aware, that ought to have worked with any datagram socket. I was pondering not just AF_UNIX but also UDP (since that's my main transport for VPN data, at least in the case where I care about performance). An interesting use case for a non-packet socket might be to establish a tunnel. A guest's virtio-net device is just connected to a UDP socket on the host, and that tunnels all their packets to/from a remote endpoint which is where that guest is logically connected to the network. It might be useful for live migration cases, perhaps? I don't have an overriding desire to *make* it work, mind you; I just wanted to make sure I understand *why* it doesn't, if indeed it doesn't. As far as I could tell, it *should* work if we just dropped the check? > Vhost_net is bascially used for accepting packet from userspace to the > kernel networking stack. > > Using AF_UNIX makes it looks more like a case of inter process > communication (without vnet header it won't be used efficiently by VM). > In this case, using io_uring is much more suitable. > > Or thinking in another way, instead of depending on the vhost_net, we > can expose TUN/TAP socket to userspace then io_uring could be used for > the OpenConnect case as well? That would work, I suppose. Although as noted, I *can* use vhost_net with tun today from userspace as long as I disable XDP and PI, and use a virtio_net_hdr that I don't really want. (And pull a value for TASK_SIZE out of my posterior; qv.) So I *can* ship a version of OpenConnect that works on existing production kernels with those workarounds, and I'm fixing up the other permutations of vhost/tun stuff in the kernel just because I figured we *should*. If I'm going to *require* new kernel support for OpenConnect then I might as well go straight to the AF_TLS/DTLS + BPF + sockmap plan and have the data packets never go to userspace in the first place. > > > > > > > > > + /* > > > > + * I just want to map the *whole* of userspace address space. But > > > > + * from userspace I don't know what that is. On x86_64 it would be: > > > > + * > > > > + * vmem->regions[0].guest_phys_addr = 4096; > > > > + * vmem->regions[0].memory_size = 0x7fffffffe000; > > > > + * vmem->regions[0].userspace_addr = 4096; > > > > + * > > > > + * For now, just ensure we put everything inside a single BSS region. > > > > + */ > > > > + vmem->regions[0].guest_phys_addr = (uint64_t)&rings; > > > > + vmem->regions[0].userspace_addr = (uint64_t)&rings; > > > > + vmem->regions[0].memory_size = sizeof(rings); > > > > > > Instead of doing tricks like this, we can do it in another way: > > > > > > 1) enable device IOTLB > > > 2) wait for the IOTLB miss request (iova, len) and update identity > > > mapping accordingly > > > > > > This should work for all the archs (with some performance hit). > > > > Ick. For my actual application (OpenConnect) I'm either going to suck > > it up and put in the arch-specific limits like in the comment above, or > > I'll fix things to do the VHOST_F_IDENTITY_MAPPING thing we're talking > > about elsewhere. > > > The feature could be useful for the case of vhost-vDPA as well. > > > > (Probably the former, since if I'm requiring kernel > > changes then I have grander plans around extending AF_TLS to do DTLS, > > then hooking that directly up to the tun socket via BPF and a sockmap > > without the data frames ever going to userspace at all.) > > > Ok, I guess we need to make sockmap works for tun socket. Hm, I need to work out the ideal data flow here. I don't know if sendmsg() / recvmsg() on the tun socket are even what I want, for full zero-copy. In the case where the NIC supports encryption, we want true zero-copy from the moment the "encrypted" packet arrives over UDP on the public network, through the DTLS processing and seqno checking, to being processed as netif_receive_skb() on the tun device. Likewise skbs from tun_net_xmit() should have the appropriate DTLS and IP/UDP headers prepended to them and that *same* skb (or at least the same frags) should be handed to the NIC to encrypt and send. In the case where we have software crypto in the kernel, we can tolerate precisely *one* copy because the crypto doesn't have to be done in-place, so moving from the input to the output crypto buffers can be that one "copy", and we can use it to move data around (from the incoming skb to the outgoing skb) if we need to. Ultimately I think we want udp_sendmsg() and tun_sendmsg() to support being *given* ownership of the buffers, rather than copying from them. Or just being given a skb and pushing/pulling their headers. I'm looking at skb_send_sock() and it *doesn't* seem to support "just steal the frags from the initial skb and give them to the new one", but there may be ways to make that work. > > I think I can fix *all* those test cases by making tun_get_socket() > > take an extra 'int *' argument, and use that to return the *actual* > > value of sock_hlen. Here's the updated test case in the meantime: > > > It would be better if you can post a new version of the whole series to > ease the reviewing. Yep. I was working on that... until I got even more distracted by looking at how we can do the true in-kernel zero-copy option ;)