Re: [PATCH net v3 2/2] vhost_net: fix tx queue stuck when sendmsg fails

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: wangyunjian <wangyunjian@huawei.com>
Cc: Network Development <netdev@vger.kernel.org>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	virtualization@lists.linux-foundation.org,
	"Lilijun (Jerry)" <jerry.lilijun@huawei.com>,
	chenchanghu <chenchanghu@huawei.com>,
	xudingke <xudingke@huawei.com>,
	"huangbin (J)" <brian.huangbin@huawei.com>
Subject: Re: [PATCH net v3 2/2] vhost_net: fix tx queue stuck when sendmsg fails
Date: Wed, 23 Dec 2020 12:05:38 -0500	[thread overview]
Message-ID: <CAF=yD-KSm4fTWUZy1F2gFOw-qLmMV76rHmzcr05Upz9WV=SXvg@mail.gmail.com> (raw)
In-Reply-To: <1608734856-12516-1-git-send-email-wangyunjian@huawei.com>

On Wed, Dec 23, 2020 at 9:47 AM wangyunjian <wangyunjian@huawei.com> wrote:
>
> From: Yunjian Wang <wangyunjian@huawei.com>
>
> Currently the driver don't drop a packet which can't be send by tun
>
> (e.g bad packet). In this case, the driver will always process the
> same packet lead to the tx queue stuck.
>
> To fix this issue:
> 1. in the case of persistent failure (e.g bad packet), the driver
> can skip this descriptior by ignoring the error.
> 2. in the case of transient failure (e.g -EAGAIN and -ENOMEM), the
> driver schedules the worker to try again.
>

Fixes: 3a4d5c94e959 ("vhost_net: a kernel-level virtio server")

Since I have a few other comments, a few minor typo corrections too:
don't -> doesn't, send -> sent, descriptior -> descriptor.

> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
>
>  drivers/vhost/net.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index c8784dfafdd7..e49dd64d086a 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -827,9 +827,8 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>                                 msg.msg_flags &= ~MSG_MORE;
>                 }
>
> -               /* TODO: Check specific error and bomb out unless ENOBUFS? */
>                 err = sock->ops->sendmsg(sock, &msg, len);
> -               if (unlikely(err < 0)) {
> +               if (unlikely(err == -EAGAIN || err == -ENOMEM)) {
>                         vhost_discard_vq_desc(vq, 1);
>                         vhost_net_enable_vq(net, vq);
>                         break;
> @@ -922,7 +921,6 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>                         msg.msg_flags &= ~MSG_MORE;
>                 }
>
> -               /* TODO: Check specific error and bomb out unless ENOBUFS? */
>                 err = sock->ops->sendmsg(sock, &msg, len);
>                 if (unlikely(err < 0)) {
>                         if (zcopy_used) {
> @@ -931,9 +929,11 @@ static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
>                                 nvq->upend_idx = ((unsigned)nvq->upend_idx - 1)
>                                         % UIO_MAXIOV;
>                         }
> -                       vhost_discard_vq_desc(vq, 1);
> -                       vhost_net_enable_vq(net, vq);
> -                       break;
> +                       if (err == -EAGAIN || err == -ENOMEM) {
> +                               vhost_discard_vq_desc(vq, 1);
> +                               vhost_net_enable_vq(net, vq);
> +                               break;
> +                       }
>                 }
>                 if (err != len)
>                         pr_debug("Truncated TX packet: "

Probably my bad for feedback in patch 2/2, but now vhost will
incorrectly log bad packets as truncated packets.

This will need to be if (err >= 0 && err != len).

It would be nice if we could notify the guest in the transmit
descriptor when a packet was dropped due to failing integrity checks
(bad packet). But I don't think we easily can, so out of scope for
this fix.