On Thu, 2021-07-01 at 12:13 +0800, Jason Wang wrote: > 在 2021/6/30 下午6:02, David Woodhouse 写道: > > On Wed, 2021-06-30 at 12:39 +0800, Jason Wang wrote: > > > 在 2021/6/29 下午6:49, David Woodhouse 写道: > > > > So as I expected, the throughput is better with vhost-net once I get to > > > > the point of 100% CPU usage in my main thread, because it offloads the > > > > kernel←→user copies. But latency is somewhat worse. > > > > > > > > I'm still using select() instead of epoll() which would give me a > > > > little back — but only a little, as I only poll on 3-4 fds, and more to > > > > the point it'll give me just as much win in the non-vhost case too, so > > > > it won't make much difference to the vhost vs. non-vhost comparison. > > > > > > > > Perhaps I really should look into that trick of "if the vhost TX ring > > > > is already stopped and would need a kick, and I only have a few packets > > > > in the batch, just write them directly to /dev/net/tun". > > > > > > That should work on low throughput. > > > > Indeed it works remarkably well, as I noted in my follow-up. I also > > fixed a minor stupidity where I was reading from the 'call' eventfd > > *before* doing the real work of moving packets around. And that gives > > me a few tens of microseconds back too. > > > > > > I'm wondering how that optimisation would translate to actual guests, > > > > which presumably have the same problem. Perhaps it would be an > > > > operation on the vhost fd, which ends up processing the ring right > > > > there in the context of *that* process instead of doing a wakeup? > > > > > > It might improve the latency in an ideal case but several possible issues: > > > > > > 1) this will blocks vCPU running until the sent is done > > > 2) copy_from_user() may sleep which will block the vcpu thread further > > > > Yes, it would block the vCPU for a short period of time, but we could > > limit that. The real win is to improve latency of single, short packets > > like a first SYN, or SYNACK. It should work fine even if it's limited > > to *one* *short* packet which *is* resident in memory. > > > This looks tricky since we need to poke both virtqueue metadata as well > as the payload. That's OK as we'd *only* do it if the thread is quiescent anyway. > And we need to let the packet iterate the network stack which might have > extra latency (qdiscs, eBPF, switch/OVS). > > So it looks to me it's better to use vhost_net busy polling instead > (VHOST_SET_VRING_BUSYLOOP_TIMEOUT). Or something very similar, with the *trylock* and bailing out. > Userspace can detect this feature by validating the existence of the ioctl. Yep. Or if we want to get fancy, we could even offer it to the guest. As a *different* doorbell register to poke if they want to relinquish the physical CPU to process the packet quicker. We wouldn't even *need* to go through userspace at all, if we put that into a separate page... but that probably *is* overengineering it :) > > Although actually I'm not *overly* worried about the 'resident' part. > > For a transmit packet, especially a short one not a sendpage(), it's > > fairly likely the the guest has touched the buffer right before sending > > it. And taken the hit of faulting it in then, if necessary. If the host > > is paging out memory which is *active* use by a guest, that guest is > > screwed anyway :) > > > Right, but there could be workload that is unrelated to the networking. > Block vCPU thread in this case seems sub-optimal. > Right, but the VMM (or the guest, if we're letting the guest choose) wouldn't have to use it for those cases. > > Alternatively, there's still the memory map thing I need to fix before > > I can commit this in my application: > > > > #ifdef __x86_64__ > > vmem->regions[0].guest_phys_addr = 4096; > > vmem->regions[0].memory_size = 0x7fffffffe000; > > vmem->regions[0].userspace_addr = 4096; > > #else > > #error FIXME > > #endif > > if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) { > > > > Perhaps if we end up with a user-visible feature to deal with that, > > then I could use the presence of *that* feature to infer that the tun > > bugs are fixed. > > > As we discussed before it could be a new backend feature. VHOST_NET_SVA > (shared virtual address)? Yeah, I'll take a look. > > Another random thought as I stare at this... can't we handle checksums > > in tun_get_user() / tun_put_user()? We could always set NETIF_F_HW_CSUM > > on the tun device, and just do it *while* we're copying the packet to > > userspace, if userspace doesn't support it. That would be better than > > having the kernel complete the checksum in a separate pass *before* > > handing the skb to tun_net_xmit(). > > > I'm not sure I get this, but for performance reason we don't do any csum > in this case? I think we have to; the packets can't leave the box without a valid checksum. If the skb isn't CHECKSUM_COMPLETE at the time it's handed off to the ->hard_start_xmit of a netdev which doesn't advertise hardware checksum support, the network stack will do it manually in an extra pass. Which is kind of silly if the tun device is going to do a pass over all the data *anyway* as it copies it up to userspace. Even in the normal case without vhost-net. > > We could similarly do a partial checksum in tun_get_user() and hand it > > off to the network stack with ->ip_summed == CHECKSUM_PARTIAL. > > > I think that's is how it is expected to work (via vnet header), see > virtio_net_hdr_to_skb(). But only if the "guest" supports it; it doesn't get handled by the tun device. It *could*, since we already have the helpers to checksum *as* we copy to/from userspace. It doesn't help for me to advertise that I support TX checksums in userspace because I'd have to do an extra pass for that. I only do one pass over the data as I encrypt it, and in many block cipher modes the encryption of the early blocks affects the IV for the subsequent blocks... do I can't just go back and "fix" the checksum at the start of the packet, once I'm finished. So doing the checksum as the packet is copied up to userspace would be very useful.