On Fri, 2021-07-02 at 11:13 +0800, Jason Wang wrote:
> 在 2021/7/2 上午1:39, David Woodhouse 写道:
> > 
> > Right, but the VMM (or the guest, if we're letting the guest choose)
> > wouldn't have to use it for those cases.
> 
> 
> I'm not sure I get here. If so, simply write to TUN directly would work.

As noted, that works nicely for me in OpenConnect; I just write it to
the tun device *instead* of putting it in the vring. My TX latency is
now fine; it's just RX which takes *two* scheduler wakeups (tun wakes
vhost thread, wakes guest).

But it's not clear to me that a VMM could use it. Because the guest has
already put that packet *into* the vring. Now if the VMM is in the path
of all wakeups for that vring, I suppose we *might* be able to contrive
some hackish way to be 'sure' that the kernel isn't servicing it, so we
could try to 'steal' that packet from the ring in order to send it
directly... but no. That's awful :)

I do think it'd be interesting to look at a way to reduce the latency
of the vring wakeup especially for that case of a virtio-net guest with
a single small packet to send. But realistically speaking, I'm unlikely
to get to it any time soon except for showing the numbers with the
userspace equivalent and observing that there's probably a similar win
to be had for guests too.

In the short term, I should focus on what we want to do to finish off
my existing fixes. Did we have a consensus on whether to bother
supporting PI? As I said, I'm mildly inclined to do so just because it
mostly comes out in the wash as we fix everything else, and making it
gracefully *refuse* that setup reliably is just as hard.

I think I'll try to make the vhost-net code much more resilient to
finding that tun_recvmsg() returns a header other than the sock_hlen it
expects, and see how much still actually needs "fixing" if we can do
that.


> I think the design is to delay the tx checksum as much as possible:
> 
> 1) host RX -> TAP TX -> Guest RX -> Guest TX -> TX RX -> host TX
> 2) VM1 TX -> TAP RX -> switch -> TX TX -> VM2 TX
> 
> E.g  if the checksum is supported in all those path, we don't need any 
> software checksum at all in the above path. And if any part is not 
> capable of doing checksum, the checksum will be done by networking core 
> before calling the hard_start_xmit of that device.

Right, but in *any* case where the 'device' is going to memcpy the data
around (like tun_put_user() does), it's a waste of time having the
networking core do a *separate* pass over the data just to checksum it.

> > > > We could similarly do a partial checksum in tun_get_user() and hand it
> > > > off to the network stack with ->ip_summed == CHECKSUM_PARTIAL.
> > > 
> > > I think that's is how it is expected to work (via vnet header), see
> > > virtio_net_hdr_to_skb().
> > 
> > But only if the "guest" supports it; it doesn't get handled by the tun
> > device. It *could*, since we already have the helpers to checksum *as*
> > we copy to/from userspace.
> > 
> > It doesn't help for me to advertise that I support TX checksums in
> > userspace because I'd have to do an extra pass for that. I only do one
> > pass over the data as I encrypt it, and in many block cipher modes the
> > encryption of the early blocks affects the IV for the subsequent
> > blocks... do I can't just go back and "fix" the checksum at the start
> > of the packet, once I'm finished.
> > 
> > So doing the checksum as the packet is copied up to userspace would be
> > very useful.
> 
> 
> I think I get this, but it requires a new TUN features (and maybe make 
> it userspace controllable via tun_set_csum()).

I don't think it's visible to userspace at all; it's purely between the
tun driver and the network stack. We *always* set NETIF_F_HW_CSUM,
regardless of what the user can cope with. And if the user *didn't*
support checksum offload then tun will transparently do the checksum
*during* the copy_to_iter() (in either tun_put_user_xdp() or
tun_put_user()).

Userspace sees precisely what it did before. If it doesn't support
checksum offload then it gets a pre-checksummed packet just as before.
It's just that the kernel will do that checksum *while* it's already
touching the data as it copies it to userspace, instead of in a
separate pass.

Although actually, for my *benchmark* case with iperf3 sending UDP, I
spotted in the perf traces that we actually do the checksum as we're
copying from userspace in the udp_sendmsg() call. There's a check in
__ip_append_data() which looks to see if the destination device has
HW_CSUM|IP_CSUM features, and does the copy-and-checksum if not. There
are definitely use cases which *don't* have that kind of optimisation
though, and packets that would reach tun_net_xmit() with CHECKSUM_NONE.
So I think it's worth looking at.