Re: [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket()

From: Jason Wang <jasowang@redhat.com>
To: David Woodhouse <dwmw2@infradead.org>, netdev@vger.kernel.org
Cc: "Eugenio Pérez" <eperezma@redhat.com>,
	"Willem de Bruijn" <willemb@google.com>,
	"Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [PATCH v3 1/5] net: add header len parameter to tun_get_socket(), tap_get_socket()
Date: Mon, 28 Jun 2021 12:22:20 +0800	[thread overview]
Message-ID: <8b274bbf-56d8-554e-3aac-077883245e7f@redhat.com> (raw)
In-Reply-To: <4b33ed9ac98c28e8980043d482cc3549acfba799.camel@infradead.org>

在 2021/6/25 下午4:23, David Woodhouse 写道:
> On Fri, 2021-06-25 at 13:00 +0800, Jason Wang wrote:
>> 在 2021/6/24 下午8:30, David Woodhouse 写道:
>>> From: David Woodhouse <dwmw@amazon.co.uk>
>>>
>>> The vhost-net driver was making wild assumptions about the header length
>>> of the underlying tun/tap socket.
>>
>> It's by design to depend on the userspace to co-ordinate the vnet header
>> setting with the underlying sockets.
>>
>>
>>>    Then it was discarding packets if
>>> the number of bytes it got from sock_recvmsg() didn't precisely match
>>> its guess.
>>
>> Anything that is broken by this? The failure is a hint for the userspace
>> that something is wrong during the coordination.
> I am not a fan of this approach. I firmly believe that for a given
> configuration, the kernel should either *work* or it should gracefully
> refuse to set it up that way. And the requirements should be clearly
> documented.

That works only if all the logic were implemented in the same module but 
not the case in the e.g networking stack that a packet need to iterate 
several modules.

E.g in this case, the vnet header size of the TAP could be changed at 
anytime via TUNSETVNETHDRSZ, and tuntap is unaware of the existence of 
vhost_net. This makes it impossible to do refuse in the case of setup 
(SET_BACKEND).

>
> Having been on the receiving end of this "hint" of which you speak, I
> found it distinctly suboptimal as a user interface. I was left
> scrabbling around trying to find a set of options which *would* work,
> and it was only through debugging the kernel that I managed to work out
> that I:
>
>    • MUST set IFF_NO_PI
>    • MUST use TUNSETSNDBUF to reduce the sndbuf from INT_MAX
>    • MUST use a virtio_net_hdr that I don't want
>
> If my application failed to do any of those things, I got a silent
> failure to transport any packets.

Yes, this is because the bug when using vhost_net + PI/TUN. And I guess 
the reason is that nobody tries to use that combination in the past.

I'm not even sure if it's a valid setup since vhost-net is a virtio-net 
kernel server which is not expected to handle L3 packets or PI header 
(which is Linux specific and out of the scope virtio spec).

>   The only thing I could do *without*
> debugging the kernel was tcpdump on the 'tun0' interface and see if the
> TX packets I put into the ring were even making it to the interface,
> and what they looked like if they did. (Losing the first 14 bytes and
> having the *next* 14 bytes scribbled on by an Ethernet header was a fun
> one.)

The tricky part is that, the networking stack thinks the packet is 
successfully received but it was actually dropped by vhost-net.

And there's no obvious userspace API to report such dropping as 
statistics counters or trace-points. Maybe we can tweak the vhost for a 
better logging in this case.

>
>
>
>
>
>>> Fix it to get the correct information along with the socket itself.
>>
>> I'm not sure what is fixed by this. It looks to me it tires to let
>> packet go even if the userspace set the wrong attributes to tap or
>> vhost. This is even sub-optimal than failing explicitly fail the RX.
> I'm OK with explicit failure. But once I'd let it *get* the information
> from the underlying socket in order to decide whether it should fail or
> not, it turned out to be easy enough just to make those configs work
> anyway.

The problem is that this change may make some wrong configuration 
"works" silently at the level of vhost or TAP. When using this for VM, 
it would make the debugging even harder.

>
> The main case where that "easy enough" is stretched a little (IMO) was
> when there's a tun_pi header. I have one more of your emails to reply
> to after this, and I'll address that there.
>
>
>>> As a side-effect, this means that tun_get_socket() won't work if the
>>> tun file isn't actually connected to a device, since there's no 'tun'
>>> yet in that case to get the information from.
>>
>> This may break the existing application. Vhost-net is tied to the socket
>> instead of the device that the socket is loosely coupled.
> Hm. Perhaps the PI and vnet hdr should be considered an option of the
> *socket* (which is tied to the tfile), not purely an option of the
> underlying device?

Though this is how it is done in macvtap. It's probably too late to 
change tuntap.

>
> Or maybe it's sufficient just to get the flags from *either* tfile->tun
> or tfile->detached, so that it works when the queue is detached. I'll
> take a look.
>
> I suppose we could even have a fallback that makes stuff up like we do
> today. If the user attempts to attach a tun file descriptor to vhost
> without ever calling TUNSETIFF on it first, *then* we make the same
> assumptions we do today?

Then I would rather keep the using the assumption:

1) the value get from get_socket() might not be correct
2) the complexity or risk for bring a very little improvement of the 
debug-ability (which is still suspicious).

>
>>> --- a/drivers/vhost/net.c
>>> +++ b/drivers/vhost/net.c
>>> @@ -1143,7 +1143,8 @@ static void handle_rx(struct vhost_net *net)
>>>    
>>>    	vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ?
>>>    		vq->log : NULL;
>>> -	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
>>> +	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF) &&
>>> +		(vhost_hlen || sock_hlen >= sizeof(num_buffers));
>>
>> So this change the behavior. When mergeable buffer is enabled, userspace
>> expects the vhost to merge buffers. If the feature is disabled silently,
>> it violates virtio spec.
>>
>> If anything wrong in the setup, userspace just breaks itself.
>>
>> E.g if sock_hlen is less that struct virtio_net_hdr_mrg_buf. The packet
>> header might be overwrote by the vnet header.
> This wasn't intended to change the behaviour of any code path that is
> already working today. If *either* vhost or the underlying device have
> provided a vnet header, we still merge.
>
> If *neither* provide a vnet hdr, there's nowhere to put num_buffers and
> we can't merge.
>
> That code path doesn't work at all today, but does after my patches.

It looks to me it's a bug that userspace can keep working in this case. 
After mrg rx buffer is negotiated, userspace should always assumes the 
vhost-net to provide num_buffers.

> But you're right, we should explicitly refuse to negotiate
> VIRITO_NET_F_MSG_RXBUF in that case.

This would be very hard:

1) VHOST_SET_FEATURES and VHOST_NET_SET_BACKEND are two different ioctls
2) vhost_net is not tightly coupled with tuntap, vnet header size could 
be changed by userspace at any time

>
>>>    
>>>    	do {
>>>    		sock_len = vhost_net_rx_peek_head_len(net, sock->sk,
>>> @@ -1213,9 +1214,10 @@ static void handle_rx(struct vhost_net *net)
>>>    			}
>>>    		} else {
>>>    			/* Header came from socket; we'll need to patch
>>> -			 * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF
>>> +			 * ->num_buffers over the last two bytes if
>>> +			 * VIRTIO_NET_F_MRG_RXBUF is enabled.
>>>    			 */
>>> -			iov_iter_advance(&fixup, sizeof(hdr));
>>> +			iov_iter_advance(&fixup, sock_hlen - 2);
>>
>> I'm not sure what did the above code want to fix. It doesn't change
>> anything if vnet header is set correctly in TUN. It only prevents the
>> the packet header from being rewrote.
>>
> It fixes the case where the virtio_net_hdr isn't at the start of the
> tun header, because the tun actually puts the tun_pi struct *first*,
> and *then* the virtio_net_hdr.

Right.

> The num_buffers field needs to go at the *end* of sock_hlen. Not at a
> fixed offset from the *start* of it.
>
> At least, that's true unless we want to just declare that we *only*
> support TUN with the IFF_NO_PI flag. (qv).

Yes, that's a good question. This is probably a hint that "vhost-net is 
never designed to work of PI", and even if it's not true, I'm not sure 
if it's too late to fix.

Thanks