Re: [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket()

From: David Woodhouse <dwmw2@infradead.org>
To: Jason Wang <jasowang@redhat.com>, netdev@vger.kernel.org
Cc: "Eugenio Pérez" <eperezma@redhat.com>,
	"Willem de Bruijn" <willemb@google.com>
Subject: Re: [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket()
Date: Fri, 25 Jun 2021 09:23:27 +0100	[thread overview]
Message-ID: <4b33ed9ac98c28e8980043d482cc3549acfba799.camel@infradead.org> (raw)
In-Reply-To: <8bc0d9b7-b3d8-ddbb-bcdc-e0169fac7111@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 5798 bytes --]

On Fri, 2021-06-25 at 13:00 +0800, Jason Wang wrote:
> 在 2021/6/24 下午8:30, David Woodhouse 写道:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > The vhost-net driver was making wild assumptions about the header length
> > of the underlying tun/tap socket.
> 
> 
> It's by design to depend on the userspace to co-ordinate the vnet header 
> setting with the underlying sockets.
> 
> 
> >   Then it was discarding packets if
> > the number of bytes it got from sock_recvmsg() didn't precisely match
> > its guess.
> 
> 
> Anything that is broken by this? The failure is a hint for the userspace 
> that something is wrong during the coordination.

I am not a fan of this approach. I firmly believe that for a given
configuration, the kernel should either *work* or it should gracefully
refuse to set it up that way. And the requirements should be clearly
documented.

Having been on the receiving end of this "hint" of which you speak, I
found it distinctly suboptimal as a user interface. I was left
scrabbling around trying to find a set of options which *would* work,
and it was only through debugging the kernel that I managed to work out
that I:

  • MUST set IFF_NO_PI
  • MUST use TUNSETSNDBUF to reduce the sndbuf from INT_MAX
  • MUST use a virtio_net_hdr that I don't want

If my application failed to do any of those things, I got a silent
failure to transport any packets. The only thing I could do *without*
debugging the kernel was tcpdump on the 'tun0' interface and see if the
TX packets I put into the ring were even making it to the interface,
and what they looked like if they did. (Losing the first 14 bytes and
having the *next* 14 bytes scribbled on by an Ethernet header was a fun
one.)

> > 
> > Fix it to get the correct information along with the socket itself.
> 
> 
> I'm not sure what is fixed by this. It looks to me it tires to let 
> packet go even if the userspace set the wrong attributes to tap or 
> vhost. This is even sub-optimal than failing explicitly fail the RX.

I'm OK with explicit failure. But once I'd let it *get* the information
from the underlying socket in order to decide whether it should fail or
not, it turned out to be easy enough just to make those configs work
anyway.

The main case where that "easy enough" is stretched a little (IMO) was
when there's a tun_pi header. I have one more of your emails to reply
to after this, and I'll address that there.

> 
> > As a side-effect, this means that tun_get_socket() won't work if the
> > tun file isn't actually connected to a device, since there's no 'tun'
> > yet in that case to get the information from.
> 
> 
> This may break the existing application. Vhost-net is tied to the socket 
> instead of the device that the socket is loosely coupled.

Hm. Perhaps the PI and vnet hdr should be considered an option of the
*socket* (which is tied to the tfile), not purely an option of the
underlying device?

Or maybe it's sufficient just to get the flags from *either* tfile->tun 
or tfile->detached, so that it works when the queue is detached. I'll
take a look.

I suppose we could even have a fallback that makes stuff up like we do
today. If the user attempts to attach a tun file descriptor to vhost
without ever calling TUNSETIFF on it first, *then* we make the same
assumptions we do today?

> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -1143,7 +1143,8 @@ static void handle_rx(struct vhost_net *net)
> >   
> >   	vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ?
> >   		vq->log : NULL;
> > -	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
> > +	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF) &&
> > +		(vhost_hlen || sock_hlen >= sizeof(num_buffers));
> 
> 
> So this change the behavior. When mergeable buffer is enabled, userspace 
> expects the vhost to merge buffers. If the feature is disabled silently, 
> it violates virtio spec.
> 
> If anything wrong in the setup, userspace just breaks itself.
> 
> E.g if sock_hlen is less that struct virtio_net_hdr_mrg_buf. The packet 
> header might be overwrote by the vnet header.

This wasn't intended to change the behaviour of any code path that is
already working today. If *either* vhost or the underlying device have
provided a vnet header, we still merge.

If *neither* provide a vnet hdr, there's nowhere to put num_buffers and
we can't merge.

That code path doesn't work at all today, but does after my patches.
But you're right, we should explicitly refuse to negotiate
VIRITO_NET_F_MSG_RXBUF in that case.

> 
> >   
> >   	do {
> >   		sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
> > @@ -1213,9 +1214,10 @@ static void handle_rx(struct vhost_net *net)
> >   			}
> >   		} else {
> >   			/* Header came from socket; we'll need to patch
> > -			 * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF
> > +			 * ->num_buffers over the last two bytes if
> > +			 * VIRTIO_NET_F_MRG_RXBUF is enabled.
> >   			 */
> > -			iov_iter_advance(&fixup, sizeof(hdr));
> > +			iov_iter_advance(&fixup, sock_hlen - 2);
> 
> 
> I'm not sure what did the above code want to fix. It doesn't change 
> anything if vnet header is set correctly in TUN. It only prevents the 
> the packet header from being rewrote.
> 

It fixes the case where the virtio_net_hdr isn't at the start of the
tun header, because the tun actually puts the tun_pi struct *first*,
and *then* the virtio_net_hdr. 

The num_buffers field needs to go at the *end* of sock_hlen. Not at a
fixed offset from the *start* of it.

At least, that's true unless we want to just declare that we *only*
support TUN with the IFF_NO_PI flag. (qv).

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]