netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
@ 2017-08-19  6:38 Koichiro Den
  2017-08-20 20:49 ` Willem de Bruijn
  2017-08-21 12:33 ` Jason Wang
  0 siblings, 2 replies; 47+ messages in thread
From: Koichiro Den @ 2017-08-19  6:38 UTC (permalink / raw)
  To: mst; +Cc: netdev, virtualization

Facing the possible unbounded delay relying on freeing on xmit path,
we also better to invoke and clear the upper layer zerocopy callback
beforehand to keep them from waiting for unbounded duration in vain.
For instance, this removes the possible deadlock in the case that the
upper layer is a zerocopy-enabled vhost-net.
This does not apply if napi_tx is enabled since it will be called in
reasonale time.

Signed-off-by: Koichiro Den <den@klaipeden.com>
---
 drivers/net/virtio_net.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 4302f313d9a7..f7deaa5b7b50 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	/* Don't wait up for transmitted skbs to be freed. */
 	if (!use_napi) {
+		if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
+			struct ubuf_info *uarg;
+			uarg = skb_shinfo(skb)->destructor_arg;
+			if (uarg->callback)
+			    uarg->callback(uarg, true);
+			skb_shinfo(skb)->destructor_arg = NULL;
+			skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+		}
 		skb_orphan(skb);
 		nf_reset(skb);
 	}
-- 
2.9.4

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-19  6:38 [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi Koichiro Den
@ 2017-08-20 20:49 ` Willem de Bruijn
  2017-08-21 12:40   ` Koichiro Den
  2017-08-22 12:11   ` Willem de Bruijn
  2017-08-21 12:33 ` Jason Wang
  1 sibling, 2 replies; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-20 20:49 UTC (permalink / raw)
  To: Koichiro Den; +Cc: Network Development, virtualization, Michael S. Tsirkin

On Sat, Aug 19, 2017 at 2:38 AM, Koichiro Den <den@klaipeden.com> wrote:
> Facing the possible unbounded delay relying on freeing on xmit path,
> we also better to invoke and clear the upper layer zerocopy callback
> beforehand to keep them from waiting for unbounded duration in vain.

Good point.

> For instance, this removes the possible deadlock in the case that the
> upper layer is a zerocopy-enabled vhost-net.
> This does not apply if napi_tx is enabled since it will be called in
> reasonale time.

Indeed. Btw, I am gathering data to eventually make napi the default
mode. But that is taking some time.

>
> Signed-off-by: Koichiro Den <den@klaipeden.com>
> ---
>  drivers/net/virtio_net.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 4302f313d9a7..f7deaa5b7b50 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>
>         /* Don't wait up for transmitted skbs to be freed. */
>         if (!use_napi) {
> +               if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
> +                       struct ubuf_info *uarg;
> +                       uarg = skb_shinfo(skb)->destructor_arg;
> +                       if (uarg->callback)
> +                           uarg->callback(uarg, true);
> +                       skb_shinfo(skb)->destructor_arg = NULL;
> +                       skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> +               }

Instead of open coding, this can use skb_zcopy_clear.

>                 skb_orphan(skb);
>                 nf_reset(skb);
>         }
> --
> 2.9.4
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-19  6:38 [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi Koichiro Den
  2017-08-20 20:49 ` Willem de Bruijn
@ 2017-08-21 12:33 ` Jason Wang
  2017-08-21 12:58   ` Koichiro Den
  2017-08-21 15:41   ` Willem de Bruijn
  1 sibling, 2 replies; 47+ messages in thread
From: Jason Wang @ 2017-08-21 12:33 UTC (permalink / raw)
  To: Koichiro Den, mst; +Cc: virtualization, netdev



On 2017年08月19日 14:38, Koichiro Den wrote:
> Facing the possible unbounded delay relying on freeing on xmit path,
> we also better to invoke and clear the upper layer zerocopy callback
> beforehand to keep them from waiting for unbounded duration in vain.
> For instance, this removes the possible deadlock in the case that the
> upper layer is a zerocopy-enabled vhost-net.
> This does not apply if napi_tx is enabled since it will be called in
> reasonale time.
>
> Signed-off-by: Koichiro Den <den@klaipeden.com>
> ---
>   drivers/net/virtio_net.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 4302f313d9a7..f7deaa5b7b50 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>   
>   	/* Don't wait up for transmitted skbs to be freed. */
>   	if (!use_napi) {
> +		if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
> +			struct ubuf_info *uarg;
> +			uarg = skb_shinfo(skb)->destructor_arg;
> +			if (uarg->callback)
> +			    uarg->callback(uarg, true);
> +			skb_shinfo(skb)->destructor_arg = NULL;
> +			skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> +		}
>   		skb_orphan(skb);
>   		nf_reset(skb);
>   	}


Interesting, deadlock could be treated as a a radical case of the 
discussion here https://patchwork.kernel.org/patch/3787671/.

git grep tells more similar skb_orphan() cases. Do we need to change 
them all (or part)?

Actually, we may meet similar issues at many other places (e.g netem). 
Need to consider a complete solution for this. Figuring out all places 
that could delay a packet is a method.

Thanks

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-20 20:49 ` Willem de Bruijn
@ 2017-08-21 12:40   ` Koichiro Den
  2017-08-22 12:11   ` Willem de Bruijn
  1 sibling, 0 replies; 47+ messages in thread
From: Koichiro Den @ 2017-08-21 12:40 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Michael S. Tsirkin, virtualization, Network Development

On Sun, 2017-08-20 at 16:49 -0400, Willem de Bruijn wrote:
> On Sat, Aug 19, 2017 at 2:38 AM, Koichiro Den <den@klaipeden.com> wrote:
> > Facing the possible unbounded delay relying on freeing on xmit path,
> > we also better to invoke and clear the upper layer zerocopy callback
> > beforehand to keep them from waiting for unbounded duration in vain.
> 
> Good point.
> 
> > For instance, this removes the possible deadlock in the case that the
> > upper layer is a zerocopy-enabled vhost-net.
> > This does not apply if napi_tx is enabled since it will be called in
> > reasonale time.
> 
> Indeed. Btw, I am gathering data to eventually make napi the default
> mode. But that is taking some time.
I'm looking forward to it. Btw I'm just guessing somehow delayed napi disabling
seems to relieve the interrupt costs much, though it might need to be stateful
so sounds a bit offensive. That's to say, not-yet-used ones are not necessarily
going to be used in the near future, so we have to limit the delay with some
sort of counter. Just guessing as a virtio novice, so pls take it with a grain
of salt ;)
> 
> > 
> > Signed-off-by: Koichiro Den <den@klaipeden.com>
> > ---
> >  drivers/net/virtio_net.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 4302f313d9a7..f7deaa5b7b50 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> > struct net_device *dev)
> > 
> >         /* Don't wait up for transmitted skbs to be freed. */
> >         if (!use_napi) {
> > +               if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
> > +                       struct ubuf_info *uarg;
> > +                       uarg = skb_shinfo(skb)->destructor_arg;
> > +                       if (uarg->callback)
> > +                           uarg->callback(uarg, true);
> > +                       skb_shinfo(skb)->destructor_arg = NULL;
> > +                       skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> > +               }
> 
> Instead of open coding, this can use skb_zcopy_clear.
I didn't notice it, seems necessary and sufficient. Please let me repost.
Thank you.
> 
> >                 skb_orphan(skb);
> >                 nf_reset(skb);
> >         }
> > --
> > 2.9.4
> > 
> > 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-21 12:33 ` Jason Wang
@ 2017-08-21 12:58   ` Koichiro Den
  2017-08-21 15:41   ` Willem de Bruijn
  1 sibling, 0 replies; 47+ messages in thread
From: Koichiro Den @ 2017-08-21 12:58 UTC (permalink / raw)
  To: Jason Wang, mst; +Cc: virtualization, netdev

On Mon, 2017-08-21 at 20:33 +0800, Jason Wang wrote:
> 
> On 2017年08月19日 14:38, Koichiro Den wrote:
> > Facing the possible unbounded delay relying on freeing on xmit path,
> > we also better to invoke and clear the upper layer zerocopy callback
> > beforehand to keep them from waiting for unbounded duration in vain.
> > For instance, this removes the possible deadlock in the case that the
> > upper layer is a zerocopy-enabled vhost-net.
> > This does not apply if napi_tx is enabled since it will be called in
> > reasonale time.
> > 
> > Signed-off-by: Koichiro Den <den@klaipeden.com>
> > ---
> >   drivers/net/virtio_net.c | 8 ++++++++
> >   1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 4302f313d9a7..f7deaa5b7b50 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> > struct net_device *dev)
> >   
> >   	/* Don't wait up for transmitted skbs to be freed. */
> >   	if (!use_napi) {
> > +		if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
> > +			struct ubuf_info *uarg;
> > +			uarg = skb_shinfo(skb)->destructor_arg;
> > +			if (uarg->callback)
> > +			    uarg->callback(uarg, true);
> > +			skb_shinfo(skb)->destructor_arg = NULL;
> > +			skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> > +		}
> >   		skb_orphan(skb);
> >   		nf_reset(skb);
> >   	}
> 
> 
> Interesting, deadlock could be treated as a a radical case of the 
> discussion here https://patchwork.kernel.org/patch/3787671/.
Thanks for letting me know this one. I am going to read it.
> 
> git grep tells more similar skb_orphan() cases. Do we need to change 
> them all (or part)?
Maybe, even though it seems less likely that we may meet it than the one I
described as an example, such as virtio-net sandwiched between vhost-net.
> 
> Actually, we may meet similar issues at many other places (e.g netem). 
> Need to consider a complete solution for this. Figuring out all places 
> that could delay a packet is a method.
Okay I will do it and post the result and a suggestion if possible. Thanks.
> 
> Thanks

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-21 12:33 ` Jason Wang
  2017-08-21 12:58   ` Koichiro Den
@ 2017-08-21 15:41   ` Willem de Bruijn
  2017-08-22  2:50     ` Jason Wang
  1 sibling, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-21 15:41 UTC (permalink / raw)
  To: Jason Wang
  Cc: Koichiro Den, Michael S. Tsirkin, virtualization, Network Development

On Mon, Aug 21, 2017 at 8:33 AM, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2017年08月19日 14:38, Koichiro Den wrote:
>>
>> Facing the possible unbounded delay relying on freeing on xmit path,
>> we also better to invoke and clear the upper layer zerocopy callback
>> beforehand to keep them from waiting for unbounded duration in vain.
>> For instance, this removes the possible deadlock in the case that the
>> upper layer is a zerocopy-enabled vhost-net.
>> This does not apply if napi_tx is enabled since it will be called in
>> reasonale time.
>>
>> Signed-off-by: Koichiro Den <den@klaipeden.com>
>> ---
>>   drivers/net/virtio_net.c | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 4302f313d9a7..f7deaa5b7b50 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
>> struct net_device *dev)
>>         /* Don't wait up for transmitted skbs to be freed. */
>>         if (!use_napi) {
>> +               if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
>> +                       struct ubuf_info *uarg;
>> +                       uarg = skb_shinfo(skb)->destructor_arg;
>> +                       if (uarg->callback)
>> +                           uarg->callback(uarg, true);
>> +                       skb_shinfo(skb)->destructor_arg = NULL;
>> +                       skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
>> +               }
>>                 skb_orphan(skb);
>>                 nf_reset(skb);
>>         }
>
>
>
> Interesting, deadlock could be treated as a a radical case of the discussion
> here https://patchwork.kernel.org/patch/3787671/.
>
> git grep tells more similar skb_orphan() cases. Do we need to change them
> all (or part)?

Most skb_orphan calls are not relevant to the issue of transmit delay.

> Actually, we may meet similar issues at many other places (e.g netem).

Netem is an interesting case. Because it is intended to mimic network
delay, at least in the case where it calls skb_orphan, it may make
sense to release all references, including calling skb_zcopy_clear.

In general, zerocopy reverts to copy on all paths that may cause
unbounded delay due to another process. Guarding against delay
induced by the administrator is infeasible. It is always possible to
just pause the nic. Netem is one instance of that, and not unbounded.

> Need
> to consider a complete solution for this. Figuring out all places that could
> delay a packet is a method.

The issue described in the referenced patch seems like head of line
blocking between two flows. If one flow delays zerocopy descriptor
release from the vhost-net pool, it blocks all subsequent descriptors
in that pool from being released, also delaying other flows that use
the same descriptor pool. If the pool is empty, all transmission stopped.

Reverting to copy tx when the pool reaches a low watermark, as the
patch does, fixes this. Perhaps the descriptor pool should also be
revised to allow out of order completions. Then there is no need to
copy zerocopy packets whenever they may experience delay.

On the point of counting copy vs zerocopy: the new msg_zerocopy
variant of ubuf_info has a field to record whether a deep copy was
made. This can be used with vhost-net zerocopy, too.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-21 15:41   ` Willem de Bruijn
@ 2017-08-22  2:50     ` Jason Wang
  2017-08-22  3:10       ` Willem de Bruijn
  2017-08-22 17:55       ` Michael S. Tsirkin
  0 siblings, 2 replies; 47+ messages in thread
From: Jason Wang @ 2017-08-22  2:50 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Koichiro Den, Michael S. Tsirkin, virtualization, Network Development



On 2017年08月21日 23:41, Willem de Bruijn wrote:
> On Mon, Aug 21, 2017 at 8:33 AM, Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2017年08月19日 14:38, Koichiro Den wrote:
>>> Facing the possible unbounded delay relying on freeing on xmit path,
>>> we also better to invoke and clear the upper layer zerocopy callback
>>> beforehand to keep them from waiting for unbounded duration in vain.
>>> For instance, this removes the possible deadlock in the case that the
>>> upper layer is a zerocopy-enabled vhost-net.
>>> This does not apply if napi_tx is enabled since it will be called in
>>> reasonale time.
>>>
>>> Signed-off-by: Koichiro Den <den@klaipeden.com>
>>> ---
>>>    drivers/net/virtio_net.c | 8 ++++++++
>>>    1 file changed, 8 insertions(+)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 4302f313d9a7..f7deaa5b7b50 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
>>> struct net_device *dev)
>>>          /* Don't wait up for transmitted skbs to be freed. */
>>>          if (!use_napi) {
>>> +               if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
>>> +                       struct ubuf_info *uarg;
>>> +                       uarg = skb_shinfo(skb)->destructor_arg;
>>> +                       if (uarg->callback)
>>> +                           uarg->callback(uarg, true);
>>> +                       skb_shinfo(skb)->destructor_arg = NULL;
>>> +                       skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
>>> +               }
>>>                  skb_orphan(skb);
>>>                  nf_reset(skb);
>>>          }
>>
>>
>> Interesting, deadlock could be treated as a a radical case of the discussion
>> here https://patchwork.kernel.org/patch/3787671/.
>>
>> git grep tells more similar skb_orphan() cases. Do we need to change them
>> all (or part)?
> Most skb_orphan calls are not relevant to the issue of transmit delay.

Yes, but at least we should audit the ones in drivers/net.

>
>> Actually, we may meet similar issues at many other places (e.g netem).
> Netem is an interesting case. Because it is intended to mimic network
> delay, at least in the case where it calls skb_orphan, it may make
> sense to release all references, including calling skb_zcopy_clear.
>
> In general, zerocopy reverts to copy on all paths that may cause
> unbounded delay due to another process. Guarding against delay
> induced by the administrator is infeasible. It is always possible to
> just pause the nic. Netem is one instance of that, and not unbounded.

The problem is, admin may only delay the traffic in e.g one interface, 
but it actually delay or stall all traffic inside a VM.

>
>> Need
>> to consider a complete solution for this. Figuring out all places that could
>> delay a packet is a method.
> The issue described in the referenced patch seems like head of line
> blocking between two flows. If one flow delays zerocopy descriptor
> release from the vhost-net pool, it blocks all subsequent descriptors
> in that pool from being released, also delaying other flows that use
> the same descriptor pool. If the pool is empty, all transmission stopped.
>
> Reverting to copy tx when the pool reaches a low watermark, as the
> patch does, fixes this.

An issue of the referenced patch is that sndbuf could be smaller than 
low watermark.

> Perhaps the descriptor pool should also be
> revised to allow out of order completions. Then there is no need to
> copy zerocopy packets whenever they may experience delay.

Yes, but as replied in the referenced thread, windows driver may treat 
out of order completion as a bug.

>
> On the point of counting copy vs zerocopy: the new msg_zerocopy
> variant of ubuf_info has a field to record whether a deep copy was
> made. This can be used with vhost-net zerocopy, too.

Just to make sure I understand. It's still not clear to me how to reuse 
this for vhost-net, e.g zerocopy flag is in a union which is not used by 
vhost_net.

Thanks

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-22  2:50     ` Jason Wang
@ 2017-08-22  3:10       ` Willem de Bruijn
  2017-08-22 11:47         ` Jason Wang
  2017-08-22 13:42         ` Koichiro Den
  2017-08-22 17:55       ` Michael S. Tsirkin
  1 sibling, 2 replies; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-22  3:10 UTC (permalink / raw)
  To: Jason Wang
  Cc: Koichiro Den, Michael S. Tsirkin, virtualization, Network Development

>>> Interesting, deadlock could be treated as a a radical case of the
>>> discussion
>>> here https://patchwork.kernel.org/patch/3787671/.
>>>
>>> git grep tells more similar skb_orphan() cases. Do we need to change them
>>> all (or part)?
>>
>> Most skb_orphan calls are not relevant to the issue of transmit delay.
>
>
> Yes, but at least we should audit the ones in drivers/net.

Do you mean other virtual device driver transmit paths, like xen,
specifically?

>>> Actually, we may meet similar issues at many other places (e.g netem).
>>
>> Netem is an interesting case. Because it is intended to mimic network
>> delay, at least in the case where it calls skb_orphan, it may make
>> sense to release all references, including calling skb_zcopy_clear.
>>
>> In general, zerocopy reverts to copy on all paths that may cause
>> unbounded delay due to another process. Guarding against delay
>> induced by the administrator is infeasible. It is always possible to
>> just pause the nic. Netem is one instance of that, and not unbounded.
>
>
> The problem is, admin may only delay the traffic in e.g one interface, but
> it actually delay or stall all traffic inside a VM.

Understood. Ideally we can remove the HoL blocking cause of this,
itself.

>>> Need
>>> to consider a complete solution for this. Figuring out all places that
>>> could
>>> delay a packet is a method.
>>
>> The issue described in the referenced patch seems like head of line
>> blocking between two flows. If one flow delays zerocopy descriptor
>> release from the vhost-net pool, it blocks all subsequent descriptors
>> in that pool from being released, also delaying other flows that use
>> the same descriptor pool. If the pool is empty, all transmission stopped.
>>
>> Reverting to copy tx when the pool reaches a low watermark, as the
>> patch does, fixes this.
>
>
> An issue of the referenced patch is that sndbuf could be smaller than low
> watermark.
>
>> Perhaps the descriptor pool should also be
>> revised to allow out of order completions. Then there is no need to
>> copy zerocopy packets whenever they may experience delay.
>
>
> Yes, but as replied in the referenced thread, windows driver may treat out
> of order completion as a bug.

Interesting. I missed that. Perhaps the zerocopy optimization
could be gated on guest support for out of order completions.

>> On the point of counting copy vs zerocopy: the new msg_zerocopy
>> variant of ubuf_info has a field to record whether a deep copy was
>> made. This can be used with vhost-net zerocopy, too.
>
>
> Just to make sure I understand. It's still not clear to me how to reuse this
> for vhost-net, e.g zerocopy flag is in a union which is not used by
> vhost_net.

True, but that is not set in stone. I went back and forth on that when
preparing fix 0a4a060bb204 ("sock: fix zerocopy_success regression
with msg_zerocopy"). The field can be moved outside the union and
initialized in the other zerocopy paths.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-22  3:10       ` Willem de Bruijn
@ 2017-08-22 11:47         ` Jason Wang
  2017-08-22 13:42         ` Koichiro Den
  1 sibling, 0 replies; 47+ messages in thread
From: Jason Wang @ 2017-08-22 11:47 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, virtualization, Koichiro Den, Michael S. Tsirkin



On 2017年08月22日 11:10, Willem de Bruijn wrote:
>>>> Interesting, deadlock could be treated as a a radical case of the
>>>> discussion
>>>> here https://patchwork.kernel.org/patch/3787671/.
>>>>
>>>> git grep tells more similar skb_orphan() cases. Do we need to change them
>>>> all (or part)?
>>> Most skb_orphan calls are not relevant to the issue of transmit delay.
>>
>> Yes, but at least we should audit the ones in drivers/net.
> Do you mean other virtual device driver transmit paths, like xen,
> specifically?

Git grep does not show skb_orphan() was used for xen for me. But looking 
at cxgb4/sge.c which seems to call skb_orphan() for large packet and 
reclaim transmitted packets when:

- doing ndo_start_xmit()
- or a timer.

>>>> Actually, we may meet similar issues at many other places (e.g netem).
>>> Netem is an interesting case. Because it is intended to mimic network
>>> delay, at least in the case where it calls skb_orphan, it may make
>>> sense to release all references, including calling skb_zcopy_clear.
>>>
>>> In general, zerocopy reverts to copy on all paths that may cause
>>> unbounded delay due to another process. Guarding against delay
>>> induced by the administrator is infeasible. It is always possible to
>>> just pause the nic. Netem is one instance of that, and not unbounded.
>>
>> The problem is, admin may only delay the traffic in e.g one interface, but
>> it actually delay or stall all traffic inside a VM.
> Understood. Ideally we can remove the HoL blocking cause of this,
> itself.
>
>>>> Need
>>>> to consider a complete solution for this. Figuring out all places that
>>>> could
>>>> delay a packet is a method.
>>> The issue described in the referenced patch seems like head of line
>>> blocking between two flows. If one flow delays zerocopy descriptor
>>> release from the vhost-net pool, it blocks all subsequent descriptors
>>> in that pool from being released, also delaying other flows that use
>>> the same descriptor pool. If the pool is empty, all transmission stopped.
>>>
>>> Reverting to copy tx when the pool reaches a low watermark, as the
>>> patch does, fixes this.
>>
>> An issue of the referenced patch is that sndbuf could be smaller than low
>> watermark.
>>
>>> Perhaps the descriptor pool should also be
>>> revised to allow out of order completions. Then there is no need to
>>> copy zerocopy packets whenever they may experience delay.
>>
>> Yes, but as replied in the referenced thread, windows driver may treat out
>> of order completion as a bug.
> Interesting. I missed that. Perhaps the zerocopy optimization
> could be gated on guest support for out of order completions.

Yes, we may plan to explicitly notify driver about out of order in 
future virtio.

>
>>> On the point of counting copy vs zerocopy: the new msg_zerocopy
>>> variant of ubuf_info has a field to record whether a deep copy was
>>> made. This can be used with vhost-net zerocopy, too.
>>
>> Just to make sure I understand. It's still not clear to me how to reuse this
>> for vhost-net, e.g zerocopy flag is in a union which is not used by
>> vhost_net.
> True, but that is not set in stone. I went back and forth on that when
> preparing fix 0a4a060bb204 ("sock: fix zerocopy_success regression
> with msg_zerocopy"). The field can be moved outside the union and
> initialized in the other zerocopy paths.

Ok. I see.

Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-20 20:49 ` Willem de Bruijn
  2017-08-21 12:40   ` Koichiro Den
@ 2017-08-22 12:11   ` Willem de Bruijn
  2017-08-22 14:04     ` Koichiro Den
  1 sibling, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-22 12:11 UTC (permalink / raw)
  To: Koichiro Den; +Cc: Michael S. Tsirkin, virtualization, Network Development

On Sun, Aug 20, 2017 at 4:49 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Sat, Aug 19, 2017 at 2:38 AM, Koichiro Den <den@klaipeden.com> wrote:
>> Facing the possible unbounded delay relying on freeing on xmit path,
>> we also better to invoke and clear the upper layer zerocopy callback
>> beforehand to keep them from waiting for unbounded duration in vain.
>
> Good point.
>
>> For instance, this removes the possible deadlock in the case that the
>> upper layer is a zerocopy-enabled vhost-net.
>> This does not apply if napi_tx is enabled since it will be called in
>> reasonale time.
>
> Indeed. Btw, I am gathering data to eventually make napi the default
> mode. But that is taking some time.
>
>>
>> Signed-off-by: Koichiro Den <den@klaipeden.com>
>> ---
>>  drivers/net/virtio_net.c | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 4302f313d9a7..f7deaa5b7b50 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
>>
>>         /* Don't wait up for transmitted skbs to be freed. */
>>         if (!use_napi) {
>> +               if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
>> +                       struct ubuf_info *uarg;
>> +                       uarg = skb_shinfo(skb)->destructor_arg;
>> +                       if (uarg->callback)
>> +                           uarg->callback(uarg, true);
>> +                       skb_shinfo(skb)->destructor_arg = NULL;
>> +                       skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
>> +               }
>
> Instead of open coding, this can use skb_zcopy_clear.

It is not correct to only send the zerocopy completion here, as
the skb will still be shared with the nic. The only safe approach
to notifying early is to create a copy with skb_orphan_frags_rx.
That will call skb_zcopy_clear(skb, false) internally. But the
copy will be very expensive.

Is the deadlock you refer to the case where tx interrupts are
disabled, so transmit completions are only handled in start_xmit
and somehow the socket(s) are unable to send new data? The
question is what is blocking them. If it is zerocopy, it may be a
too low optmem_max or hitting the per-user locked pages limit.
We may have to keep interrupts enabled when zerocopy skbs
are in flight.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-22  3:10       ` Willem de Bruijn
  2017-08-22 11:47         ` Jason Wang
@ 2017-08-22 13:42         ` Koichiro Den
  2017-08-22 17:16           ` Willem de Bruijn
  1 sibling, 1 reply; 47+ messages in thread
From: Koichiro Den @ 2017-08-22 13:42 UTC (permalink / raw)
  To: Willem de Bruijn, Jason Wang
  Cc: Network Development, virtualization, Michael S. Tsirkin

On Mon, 2017-08-21 at 23:10 -0400, Willem de Bruijn wrote:
> > > > Interesting, deadlock could be treated as a a radical case of the
> > > > discussion
> > > > here https://patchwork.kernel.org/patch/3787671/.
> > > > 
> > > > git grep tells more similar skb_orphan() cases. Do we need to change
> > > > them
> > > > all (or part)?
> > > 
> > > Most skb_orphan calls are not relevant to the issue of transmit delay.
> > 
> > 
> > Yes, but at least we should audit the ones in drivers/net.
> 
> Do you mean other virtual device driver transmit paths, like xen,
> specifically?
> 
> > > > Actually, we may meet similar issues at many other places (e.g netem).
> > > 
> > > Netem is an interesting case. Because it is intended to mimic network
> > > delay, at least in the case where it calls skb_orphan, it may make
> > > sense to release all references, including calling skb_zcopy_clear.
> > > 
> > > In general, zerocopy reverts to copy on all paths that may cause
> > > unbounded delay due to another process. Guarding against delay
> > > induced by the administrator is infeasible. It is always possible to
> > > just pause the nic. Netem is one instance of that, and not unbounded.
> > 
> > 
> > The problem is, admin may only delay the traffic in e.g one interface, but
> > it actually delay or stall all traffic inside a VM.
> 
> Understood. Ideally we can remove the HoL blocking cause of this,
> itself.
> 
> > > > Need
> > > > to consider a complete solution for this. Figuring out all places that
> > > > could
> > > > delay a packet is a method.
> > > 
> > > The issue described in the referenced patch seems like head of line
> > > blocking between two flows. If one flow delays zerocopy descriptor
> > > release from the vhost-net pool, it blocks all subsequent descriptors
> > > in that pool from being released, also delaying other flows that use
> > > the same descriptor pool. If the pool is empty, all transmission stopped.
> > > 
> > > Reverting to copy tx when the pool reaches a low watermark, as the
> > > patch does, fixes this.
> > 
> > 
> > An issue of the referenced patch is that sndbuf could be smaller than low
> > watermark.
We cannot determine the low watermark properly because of not only sndbuf size
issue but also the fact that the upper vhost-net cannot directly see how much
descriptor is currently available at the virtio-net tx queue. It depends on
multiqueue settings or other senders which are also using the same tx queue.
Note that in the latter case if they constantly transmitting, the deadlock could
not occur(*), however if it has just temporarily fulfill some portion of the
pool in the mean time, then the low watermark cannot be helpful.
(*: That is because it's reliable enough in the sense I mention below.)

Keep in this in mind, let me briefly describe the possible deadlock I mentioned:
(1). vhost-net on L1 guest has nothing to do sendmsg until the upper layer sets
new descriptors, which depends only on the vhost-net zcopy callback and adding
newly used descriptors.
(2). vhost-net callback depends on the skb freeing on the xmit path only.
(3). the xmit path depends (possibly only) on the vhost-net sendmsg.
As you see, it's enough to bring about the situation above that L1 virtio-net
reaches its limit earlier than the L0 host processing. The vhost-net pool could
be almost full or empty, whatever.

Also note that relaxing the restriction (2) (and thus remove the dependency (3)
-> (1)) seems quite natural but it needs to be reliable enough to never miss the
last trigger if it's based on interruption. We might need to do some
modification on virtio interruption mitigation in none tx napi case but it's too
costly and not fair to cover the upper layer peculiarity.

So I think drivers which holds characteristics such as (2) is worth checking
whether we should revise.
> > 
> > > Perhaps the descriptor pool should also be
> > > revised to allow out of order completions. Then there is no need to
> > > copy zerocopy packets whenever they may experience delay.
> > 
> > 
> > Yes, but as replied in the referenced thread, windows driver may treat out
> > of order completion as a bug.
> 
> Interesting. I missed that. Perhaps the zerocopy optimization
> could be gated on guest support for out of order completions.
> 
> > > On the point of counting copy vs zerocopy: the new msg_zerocopy
> > > variant of ubuf_info has a field to record whether a deep copy was
> > > made. This can be used with vhost-net zerocopy, too.
> > 
> > 
> > Just to make sure I understand. It's still not clear to me how to reuse this
> > for vhost-net, e.g zerocopy flag is in a union which is not used by
> > vhost_net.
> 
> True, but that is not set in stone. I went back and forth on that when
> preparing fix 0a4a060bb204 ("sock: fix zerocopy_success regression
> with msg_zerocopy"). The field can be moved outside the union and
> initialized in the other zerocopy paths.

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-22 12:11   ` Willem de Bruijn
@ 2017-08-22 14:04     ` Koichiro Den
  2017-08-22 17:19       ` Willem de Bruijn
  0 siblings, 1 reply; 47+ messages in thread
From: Koichiro Den @ 2017-08-22 14:04 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Michael S. Tsirkin, virtualization, Network Development

On Tue, 2017-08-22 at 08:11 -0400, Willem de Bruijn wrote:
> On Sun, Aug 20, 2017 at 4:49 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> > On Sat, Aug 19, 2017 at 2:38 AM, Koichiro Den <den@klaipeden.com> wrote:
> > > Facing the possible unbounded delay relying on freeing on xmit path,
> > > we also better to invoke and clear the upper layer zerocopy callback
> > > beforehand to keep them from waiting for unbounded duration in vain.
> > 
> > Good point.
> > 
> > > For instance, this removes the possible deadlock in the case that the
> > > upper layer is a zerocopy-enabled vhost-net.
> > > This does not apply if napi_tx is enabled since it will be called in
> > > reasonale time.
> > 
> > Indeed. Btw, I am gathering data to eventually make napi the default
> > mode. But that is taking some time.
> > 
> > > 
> > > Signed-off-by: Koichiro Den <den@klaipeden.com>
> > > ---
> > >  drivers/net/virtio_net.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 4302f313d9a7..f7deaa5b7b50 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -1290,6 +1290,14 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> > > struct net_device *dev)
> > > 
> > >         /* Don't wait up for transmitted skbs to be freed. */
> > >         if (!use_napi) {
> > > +               if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
> > > +                       struct ubuf_info *uarg;
> > > +                       uarg = skb_shinfo(skb)->destructor_arg;
> > > +                       if (uarg->callback)
> > > +                           uarg->callback(uarg, true);
> > > +                       skb_shinfo(skb)->destructor_arg = NULL;
> > > +                       skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> > > +               }
> > 
> > Instead of open coding, this can use skb_zcopy_clear.
> 
> It is not correct to only send the zerocopy completion here, as
> the skb will still be shared with the nic. The only safe approach
> to notifying early is to create a copy with skb_orphan_frags_rx.
> That will call skb_zcopy_clear(skb, false) internally. But the
> copy will be very expensive.
I noticed this email after my last post. I cannot not imagine how it could be
unsafe in virtio case. Sorry could you explain a bit more?
> 
> Is the deadlock you refer to the case where tx interrupts are
> disabled, so transmit completions are only handled in start_xmit
> and somehow the socket(s) are unable to send new data? The
> question is what is blocking them. If it is zerocopy, it may be a
> too low optmem_max or hitting the per-user locked pages limit.
> We may have to keep interrupts enabled when zerocopy skbs
> are in flight.
Even if we keep interrupts enabled, We miss the completion without start_xmit.
And it's also likely that the next start_xmit depends on the completion itself
as I described in my last post.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-22 13:42         ` Koichiro Den
@ 2017-08-22 17:16           ` Willem de Bruijn
  2017-08-23 14:24             ` Koichiro Den
  0 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-22 17:16 UTC (permalink / raw)
  To: Koichiro Den
  Cc: Jason Wang, Michael S. Tsirkin, virtualization, Network Development

>> > An issue of the referenced patch is that sndbuf could be smaller than low
>> > watermark.
> We cannot determine the low watermark properly because of not only sndbuf size
> issue but also the fact that the upper vhost-net cannot directly see how much
> descriptor is currently available at the virtio-net tx queue. It depends on
> multiqueue settings or other senders which are also using the same tx queue.
> Note that in the latter case if they constantly transmitting, the deadlock could
> not occur(*), however if it has just temporarily fulfill some portion of the
> pool in the mean time, then the low watermark cannot be helpful.
> (*: That is because it's reliable enough in the sense I mention below.)
>
> Keep in this in mind, let me briefly describe the possible deadlock I mentioned:
> (1). vhost-net on L1 guest has nothing to do sendmsg until the upper layer sets
> new descriptors, which depends only on the vhost-net zcopy callback and adding
> newly used descriptors.
> (2). vhost-net callback depends on the skb freeing on the xmit path only.
> (3). the xmit path depends (possibly only) on the vhost-net sendmsg.
> As you see, it's enough to bring about the situation above that L1 virtio-net
> reaches its limit earlier than the L0 host processing. The vhost-net pool could
> be almost full or empty, whatever.

Thanks for the context. This issue is very similar to the one that used to
exist when running out of transmit descriptors, before the removal of
the timer and introduction of skb_orphan in start_xmit.

To make sure that I understand correctly, let me paraphrase:

A. guest socket cannot send because it exhausted its sk budget (sndbuf, tsq, ..)

B. budget is not freed up until guest receives tx completion for this flow

C. tx completion is held back on the host side in vhost_zerocopy_signal_used
   behind the completion for an unrelated skb

D. unrelated packet is delayed somewhere in the host stackf zerocopy
completions.
   e.g., netem

The issue that is specific to vhost-net zerocopy is that (C) enforces strict
ordering of transmit completions causing head of line blocking behind
vhost-net zerocopy callbacks.

This is a different problem from

C1. tx completion is delayed until guest sends another packet and
       triggers free_old_xmit_skb

Both in host and guest, zerocopy packets should never be able to loop
to a receive path where they can cause unbounded delay.

The obvious cases of latency are queueing, like netem. That leads
to poor performance for unrelated flows, but I don't see how this
could cause deadlock.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-22 14:04     ` Koichiro Den
@ 2017-08-22 17:19       ` Willem de Bruijn
  2017-08-23 14:26         ` Koichiro Den
  0 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-22 17:19 UTC (permalink / raw)
  To: Koichiro Den; +Cc: Michael S. Tsirkin, virtualization, Network Development

>> > >         /* Don't wait up for transmitted skbs to be freed. */
>> > >         if (!use_napi) {
>> > > +               if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
>> > > +                       struct ubuf_info *uarg;
>> > > +                       uarg = skb_shinfo(skb)->destructor_arg;
>> > > +                       if (uarg->callback)
>> > > +                           uarg->callback(uarg, true);
>> > > +                       skb_shinfo(skb)->destructor_arg = NULL;
>> > > +                       skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
>> > > +               }
>> >
>> > Instead of open coding, this can use skb_zcopy_clear.
>>
>> It is not correct to only send the zerocopy completion here, as
>> the skb will still be shared with the nic. The only safe approach
>> to notifying early is to create a copy with skb_orphan_frags_rx.
>> That will call skb_zcopy_clear(skb, false) internally. But the
>> copy will be very expensive.
> I noticed this email after my last post. I cannot not imagine how it could be
> unsafe in virtio case. Sorry could you explain a bit more?

A guest process sends a packet with MSG_ZEROCOPY to the
virtio-net device. As soon as the process receives the completion
notification, it is allowed to reuse the memory backing the packet.

A call to skb_zcopy_clear in virtio-net start_xmit will notify the
process that it is allowed to reuse the memory. But the user pages
are still linked into the skb frags and are about to be shared with
the host os.

>> Is the deadlock you refer to the case where tx interrupts are
>> disabled, so transmit completions are only handled in start_xmit
>> and somehow the socket(s) are unable to send new data? The
>> question is what is blocking them. If it is zerocopy, it may be a
>> too low optmem_max or hitting the per-user locked pages limit.
>> We may have to keep interrupts enabled when zerocopy skbs
>> are in flight.
> Even if we keep interrupts enabled, We miss the completion without start_xmit.
> And it's also likely that the next start_xmit depends on the completion itself
> as I described in my last post.
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-22  2:50     ` Jason Wang
  2017-08-22  3:10       ` Willem de Bruijn
@ 2017-08-22 17:55       ` Michael S. Tsirkin
  2017-08-22 18:01         ` David Miller
  2017-08-23 14:28         ` Koichiro Den
  1 sibling, 2 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2017-08-22 17:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: Willem de Bruijn, Koichiro Den, virtualization, Network Development

On Tue, Aug 22, 2017 at 10:50:41AM +0800, Jason Wang wrote:
> > Perhaps the descriptor pool should also be
> > revised to allow out of order completions. Then there is no need to
> > copy zerocopy packets whenever they may experience delay.
> 
> Yes, but as replied in the referenced thread, windows driver may treat out
> of order completion as a bug.

That would be a windows driver bug then, but I don't think it makes this
assumption. What the referenced thread
(https://patchwork.kernel.org/patch/3787671/) is saying is that host
must use any buffers made available on a tx vq within a reasonable
timeframe otherwise windows guests panic.

Ideally we would detect that a packet is actually experiencing delay and
trigger the copy at that point e.g. by calling skb_linearize. But it
isn't easy to track these packets though and even harder to do a data
copy without races.

Which reminds me that skb_linearize in net core seems to be
fundamentally racy - I suspect that if skb is cloned, and someone is
trying to use the shared frags while another thread calls skb_linearize,
we get some use after free bugs which likely mostly go undetected
because the corrupted packets mostly go on wire and get dropped
by checksum code.

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-22 17:55       ` Michael S. Tsirkin
@ 2017-08-22 18:01         ` David Miller
  2017-08-22 18:28           ` Eric Dumazet
  2017-08-23 14:28         ` Koichiro Den
  1 sibling, 1 reply; 47+ messages in thread
From: David Miller @ 2017-08-22 18:01 UTC (permalink / raw)
  To: mst; +Cc: jasowang, willemdebruijn.kernel, den, virtualization, netdev

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Tue, 22 Aug 2017 20:55:56 +0300

> Which reminds me that skb_linearize in net core seems to be
> fundamentally racy - I suspect that if skb is cloned, and someone is
> trying to use the shared frags while another thread calls skb_linearize,
> we get some use after free bugs which likely mostly go undetected
> because the corrupted packets mostly go on wire and get dropped
> by checksum code.

Indeed, it does assume that the skb from which the clone was made
never has it's geometry changed.

I don't think even the TCP retransmit queue has this guarantee.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-22 18:01         ` David Miller
@ 2017-08-22 18:28           ` Eric Dumazet
  2017-08-22 18:39             ` Michael S. Tsirkin
  0 siblings, 1 reply; 47+ messages in thread
From: Eric Dumazet @ 2017-08-22 18:28 UTC (permalink / raw)
  To: David Miller; +Cc: willemdebruijn.kernel, mst, netdev, den, virtualization

On Tue, 2017-08-22 at 11:01 -0700, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Tue, 22 Aug 2017 20:55:56 +0300
> 
> > Which reminds me that skb_linearize in net core seems to be
> > fundamentally racy - I suspect that if skb is cloned, and someone is
> > trying to use the shared frags while another thread calls skb_linearize,
> > we get some use after free bugs which likely mostly go undetected
> > because the corrupted packets mostly go on wire and get dropped
> > by checksum code.
> 
> Indeed, it does assume that the skb from which the clone was made
> never has it's geometry changed.
> 
> I don't think even the TCP retransmit queue has this guarantee.

TCP retransmit makes sure to avoid that.

if (skb_unclone(skb, GFP_ATOMIC))
     return -ENOMEM;

( Before cloning again skb )

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-22 18:28           ` Eric Dumazet
@ 2017-08-22 18:39             ` Michael S. Tsirkin
  0 siblings, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2017-08-22 18:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, jasowang, willemdebruijn.kernel, den,
	virtualization, netdev

On Tue, Aug 22, 2017 at 11:28:28AM -0700, Eric Dumazet wrote:
> On Tue, 2017-08-22 at 11:01 -0700, David Miller wrote:
> > From: "Michael S. Tsirkin" <mst@redhat.com>
> > Date: Tue, 22 Aug 2017 20:55:56 +0300
> > 
> > > Which reminds me that skb_linearize in net core seems to be
> > > fundamentally racy - I suspect that if skb is cloned, and someone is
> > > trying to use the shared frags while another thread calls skb_linearize,
> > > we get some use after free bugs which likely mostly go undetected
> > > because the corrupted packets mostly go on wire and get dropped
> > > by checksum code.
> > 
> > Indeed, it does assume that the skb from which the clone was made
> > never has it's geometry changed.
> > 
> > I don't think even the TCP retransmit queue has this guarantee.
> 
> TCP retransmit makes sure to avoid that.
> 
> if (skb_unclone(skb, GFP_ATOMIC))
>      return -ENOMEM;
> 
> ( Before cloning again skb )
> 
> 

I'm pretty sure not all users of skb_clone or generally
__pskb_pull_tail are careful like this.

E.g. skb_cow_data actually intentionally pulls pages when
skb is cloned.


-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-22 17:16           ` Willem de Bruijn
@ 2017-08-23 14:24             ` Koichiro Den
  0 siblings, 0 replies; 47+ messages in thread
From: Koichiro Den @ 2017-08-23 14:24 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jason Wang, Michael S. Tsirkin, virtualization, Network Development

On Tue, 2017-08-22 at 13:16 -0400, Willem de Bruijn wrote:
> > > > An issue of the referenced patch is that sndbuf could be smaller than
> > > > low
> > > > watermark.
> > 
> > We cannot determine the low watermark properly because of not only sndbuf
> > size
> > issue but also the fact that the upper vhost-net cannot directly see how
> > much
> > descriptor is currently available at the virtio-net tx queue. It depends on
> > multiqueue settings or other senders which are also using the same tx queue.
> > Note that in the latter case if they constantly transmitting, the deadlock
> > could
> > not occur(*), however if it has just temporarily fulfill some portion of the
> > pool in the mean time, then the low watermark cannot be helpful.
> > (*: That is because it's reliable enough in the sense I mention below.)
> > 
> > Keep in this in mind, let me briefly describe the possible deadlock I
> > mentioned:
> > (1). vhost-net on L1 guest has nothing to do sendmsg until the upper layer
> > sets
> > new descriptors, which depends only on the vhost-net zcopy callback and
> > adding
> > newly used descriptors.
> > (2). vhost-net callback depends on the skb freeing on the xmit path only.
> > (3). the xmit path depends (possibly only) on the vhost-net sendmsg.
> > As you see, it's enough to bring about the situation above that L1 virtio-
> > net
> > reaches its limit earlier than the L0 host processing. The vhost-net pool
> > could
> > be almost full or empty, whatever.
> 
> Thanks for the context. This issue is very similar to the one that used to
> exist when running out of transmit descriptors, before the removal of
> the timer and introduction of skb_orphan in start_xmit.
> 
> To make sure that I understand correctly, let me paraphrase:
> 
> A. guest socket cannot send because it exhausted its sk budget (sndbuf, tsq,
> ..)
> 
> B. budget is not freed up until guest receives tx completion for this flow
> 
> C. tx completion is held back on the host side in vhost_zerocopy_signal_used
>    behind the completion for an unrelated skb
> 
> D. unrelated packet is delayed somewhere in the host stackf zerocopy
> completions.
>    e.g., netem
> 
> The issue that is specific to vhost-net zerocopy is that (C) enforces strict
> ordering of transmit completions causing head of line blocking behind
> vhost-net zerocopy callbacks.
> 
> This is a different problem from
> 
> C1. tx completion is delayed until guest sends another packet and
>        triggers free_old_xmit_skb
> 
> Both in host and guest, zerocopy packets should never be able to loop
> to a receive path where they can cause unbounded delay.
> 
> The obvious cases of latency are queueing, like netem. That leads
> to poor performance for unrelated flows, but I don't see how this
> could cause deadlock.

Thanks for the wrap-up. I see all the points now and also that C1 should not
cause deadlock.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-22 17:19       ` Willem de Bruijn
@ 2017-08-23 14:26         ` Koichiro Den
  0 siblings, 0 replies; 47+ messages in thread
From: Koichiro Den @ 2017-08-23 14:26 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Michael S. Tsirkin, virtualization, Network Development

On Tue, 2017-08-22 at 13:19 -0400, Willem de Bruijn wrote:
> > > > >         /* Don't wait up for transmitted skbs to be freed. */
> > > > >         if (!use_napi) {
> > > > > +               if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
> > > > > +                       struct ubuf_info *uarg;
> > > > > +                       uarg = skb_shinfo(skb)->destructor_arg;
> > > > > +                       if (uarg->callback)
> > > > > +                           uarg->callback(uarg, true);
> > > > > +                       skb_shinfo(skb)->destructor_arg = NULL;
> > > > > +                       skb_shinfo(skb)->tx_flags &=
> > > > > ~SKBTX_DEV_ZEROCOPY;
> > > > > +               }
> > > > 
> > > > Instead of open coding, this can use skb_zcopy_clear.
> > > 
> > > It is not correct to only send the zerocopy completion here, as
> > > the skb will still be shared with the nic. The only safe approach
> > > to notifying early is to create a copy with skb_orphan_frags_rx.
> > > That will call skb_zcopy_clear(skb, false) internally. But the
> > > copy will be very expensive.
> > 
> > I noticed this email after my last post. I cannot not imagine how it could
> > be
> > unsafe in virtio case. Sorry could you explain a bit more?
> 
> A guest process sends a packet with MSG_ZEROCOPY to the
> virtio-net device. As soon as the process receives the completion
> notification, it is allowed to reuse the memory backing the packet.
> 
> A call to skb_zcopy_clear in virtio-net start_xmit will notify the
> process that it is allowed to reuse the memory. But the user pages
> are still linked into the skb frags and are about to be shared with
> the host os.
> 
> > > Is the deadlock you refer to the case where tx interrupts are
> > > disabled, so transmit completions are only handled in start_xmit
> > > and somehow the socket(s) are unable to send new data? The
> > > question is what is blocking them. If it is zerocopy, it may be a
> > > too low optmem_max or hitting the per-user locked pages limit.
> > > We may have to keep interrupts enabled when zerocopy skbs
> > > are in flight.
> > 
> > Even if we keep interrupts enabled, We miss the completion without
> > start_xmit.
> > And it's also likely that the next start_xmit depends on the completion
> > itself
> > as I described in my last post.
> > 
Thanks for the explanation, I misinterpreted the "nic" part, now it's clear.
Sorry about bothering.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-22 17:55       ` Michael S. Tsirkin
  2017-08-22 18:01         ` David Miller
@ 2017-08-23 14:28         ` Koichiro Den
  2017-08-23 14:47           ` Koichiro Den
  2017-08-23 15:20           ` Willem de Bruijn
  1 sibling, 2 replies; 47+ messages in thread
From: Koichiro Den @ 2017-08-23 14:28 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: Willem de Bruijn, virtualization, Network Development

On Tue, 2017-08-22 at 20:55 +0300, Michael S. Tsirkin wrote:
> On Tue, Aug 22, 2017 at 10:50:41AM +0800, Jason Wang wrote:
> > > Perhaps the descriptor pool should also be
> > > revised to allow out of order completions. Then there is no need to
> > > copy zerocopy packets whenever they may experience delay.
> > 
> > Yes, but as replied in the referenced thread, windows driver may treat out
> > of order completion as a bug.
> 
> That would be a windows driver bug then, but I don't think it makes this
> assumption. What the referenced thread
> (https://patchwork.kernel.org/patch/3787671/) is saying is that host
> must use any buffers made available on a tx vq within a reasonable
> timeframe otherwise windows guests panic.
> 
> Ideally we would detect that a packet is actually experiencing delay and
> trigger the copy at that point e.g. by calling skb_linearize. But it
> isn't easy to track these packets though and even harder to do a data
> copy without races.
> 
> Which reminds me that skb_linearize in net core seems to be
> fundamentally racy - I suspect that if skb is cloned, and someone is
> trying to use the shared frags while another thread calls skb_linearize,
> we get some use after free bugs which likely mostly go undetected
> because the corrupted packets mostly go on wire and get dropped
> by checksum code.
> 
Please let me make sure if I understand it correctly:
* always do copy with skb_orphan_frags_rx as Willem mentioned in the earlier
post, before the xmit_skb as opposed to my original patch, is safe but too
costly so cannot be adopted.
* as a generic solution, if we were to somehow overcome the safety issue, track
the delay and do copy if some threshold is reached could be an answer, but it's
hard for now.
* so things like the current vhost-net implementation of deciding whether or not
to do zerocopy beforehand referring the zerocopy tx error ratio is a point of
practical compromise.

Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-23 14:28         ` Koichiro Den
@ 2017-08-23 14:47           ` Koichiro Den
  2017-08-23 15:20           ` Willem de Bruijn
  1 sibling, 0 replies; 47+ messages in thread
From: Koichiro Den @ 2017-08-23 14:47 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang
  Cc: Willem de Bruijn, virtualization, Network Development

On Wed, 2017-08-23 at 23:28 +0900, Koichiro Den wrote:
> On Tue, 2017-08-22 at 20:55 +0300, Michael S. Tsirkin wrote:
> > On Tue, Aug 22, 2017 at 10:50:41AM +0800, Jason Wang wrote:
> > > > Perhaps the descriptor pool should also be
> > > > revised to allow out of order completions. Then there is no need to
> > > > copy zerocopy packets whenever they may experience delay.
> > > 
> > > Yes, but as replied in the referenced thread, windows driver may treat out
> > > of order completion as a bug.
> > 
> > That would be a windows driver bug then, but I don't think it makes this
> > assumption. What the referenced thread
> > (https://patchwork.kernel.org/patch/3787671/) is saying is that host
> > must use any buffers made available on a tx vq within a reasonable
> > timeframe otherwise windows guests panic.
> > 
> > Ideally we would detect that a packet is actually experiencing delay and
> > trigger the copy at that point e.g. by calling skb_linearize. But it
> > isn't easy to track these packets though and even harder to do a data
> > copy without races.
> > 
> > Which reminds me that skb_linearize in net core seems to be
> > fundamentally racy - I suspect that if skb is cloned, and someone is
> > trying to use the shared frags while another thread calls skb_linearize,
> > we get some use after free bugs which likely mostly go undetected
> > because the corrupted packets mostly go on wire and get dropped
> > by checksum code.
> > 
> 
> Please let me make sure if I understand it correctly:
> * always do copy with skb_orphan_frags_rx as Willem mentioned in the earlier
> post, before the xmit_skb as opposed to my original patch, is safe but too
> costly so cannot be adopted.
> * as a generic solution, if we were to somehow overcome the safety issue,
> track
> the delay and do copy if some threshold is reached could be an answer, but
> it's
> hard for now.
> * so things like the current vhost-net implementation of deciding whether or
> not
> to do zerocopy beforehand referring the zerocopy tx error ratio is a point of
> practical compromise.
<- I forgot to mention the max pend checking part.
> 
> Thanks.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-23 14:28         ` Koichiro Den
  2017-08-23 14:47           ` Koichiro Den
@ 2017-08-23 15:20           ` Willem de Bruijn
  2017-08-23 22:57             ` Michael S. Tsirkin
  1 sibling, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-23 15:20 UTC (permalink / raw)
  To: Koichiro Den
  Cc: Michael S. Tsirkin, Jason Wang, virtualization, Network Development

> Please let me make sure if I understand it correctly:
> * always do copy with skb_orphan_frags_rx as Willem mentioned in the earlier
> post, before the xmit_skb as opposed to my original patch, is safe but too
> costly so cannot be adopted.

One more point about msg_zerocopy in the guest. This does add new allocation
limits on optmem and locked pages rlimit.

Hitting these should be extremely rare. The tcp small queues limit normally
throttles well before this.

Virtio-net is an exception because it breaks the tsq signal by calling
skb_orphan before transmission.

As a result hitting these limits is more likely here. But, in this edge case the
sendmsg call will not block, either, but fail with -ENOBUFS. The caller can
send without zerocopy to make forward progress and
trigger free_old_xmit_skbs from start_xmit.

> * as a generic solution, if we were to somehow overcome the safety issue, track
> the delay and do copy if some threshold is reached could be an answer, but it's
> hard for now.> * so things like the current vhost-net implementation of deciding whether or not
> to do zerocopy beforehand referring the zerocopy tx error ratio is a point of
> practical compromise.

The fragility of this mechanism is another argument for switching to tx napi
as default.

Is there any more data about the windows guest issues when completions
are not queued within a reasonable timeframe? What is this timescale and
do we really need to work around this. That is the only thing keeping us from
removing the HoL blocking in vhost-net zerocopy.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-23 15:20           ` Willem de Bruijn
@ 2017-08-23 22:57             ` Michael S. Tsirkin
  2017-08-24  3:28               ` Willem de Bruijn
  0 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2017-08-23 22:57 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development

On Wed, Aug 23, 2017 at 11:20:45AM -0400, Willem de Bruijn wrote:
> > Please let me make sure if I understand it correctly:
> > * always do copy with skb_orphan_frags_rx as Willem mentioned in the earlier
> > post, before the xmit_skb as opposed to my original patch, is safe but too
> > costly so cannot be adopted.
> 
> One more point about msg_zerocopy in the guest. This does add new allocation
> limits on optmem and locked pages rlimit.
> 
> Hitting these should be extremely rare. The tcp small queues limit normally
> throttles well before this.
> 
> Virtio-net is an exception because it breaks the tsq signal by calling
> skb_orphan before transmission.
> 
> As a result hitting these limits is more likely here. But, in this edge case the
> sendmsg call will not block, either, but fail with -ENOBUFS. The caller can
> send without zerocopy to make forward progress and
> trigger free_old_xmit_skbs from start_xmit.
> 
> > * as a generic solution, if we were to somehow overcome the safety issue, track
> > the delay and do copy if some threshold is reached could be an answer, but it's
> > hard for now.> * so things like the current vhost-net implementation of deciding whether or not
> > to do zerocopy beforehand referring the zerocopy tx error ratio is a point of
> > practical compromise.
> 
> The fragility of this mechanism is another argument for switching to tx napi
> as default.
>
> Is there any more data about the windows guest issues when completions
> are not queued within a reasonable timeframe? What is this timescale and
> do we really need to work around this. 

I think it's pretty large, many milliseconds.

But I wonder what do you mean by "work around". Using buffers within
limited time frame sounds like a reasonable requirement to me. Neither
do I see why would using tx interrupts within guest be a work around -
AFAIK windows driver uses tx interrupts.

> That is the only thing keeping us from removing the HoL blocking in vhost-net zerocopy.

We don't enable network watchdog on virtio but we could and maybe
should.

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-23 22:57             ` Michael S. Tsirkin
@ 2017-08-24  3:28               ` Willem de Bruijn
  2017-08-24  4:34                 ` Michael S. Tsirkin
  2017-08-24 13:50                 ` Michael S. Tsirkin
  0 siblings, 2 replies; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-24  3:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development

>> > * as a generic solution, if we were to somehow overcome the safety issue, track
>> > the delay and do copy if some threshold is reached could be an answer, but it's
>> > hard for now.> * so things like the current vhost-net implementation of deciding whether or not
>> > to do zerocopy beforehand referring the zerocopy tx error ratio is a point of
>> > practical compromise.
>>
>> The fragility of this mechanism is another argument for switching to tx napi
>> as default.
>>
>> Is there any more data about the windows guest issues when completions
>> are not queued within a reasonable timeframe? What is this timescale and
>> do we really need to work around this.
>
> I think it's pretty large, many milliseconds.
>
> But I wonder what do you mean by "work around". Using buffers within
> limited time frame sounds like a reasonable requirement to me.

Vhost-net zerocopy delays completions until the skb is really
sent. Traffic shaping can introduce msec timescale latencies.

The delay may actually be a useful signal. If the guest does not
orphan skbs early, TSQ will throttle the socket causing host
queue build up.

But, if completions are queued in-order, unrelated flows may be
throttled as well. Allowing out of order completions would resolve
this HoL blocking.

> Neither
> do I see why would using tx interrupts within guest be a work around -
> AFAIK windows driver uses tx interrupts.

It does not address completion latency itself. What I meant was
that in an interrupt-driver model, additional starvation issues,
such as the potential deadlock raised at the start of this thread,
or the timer delay observed before packets were orphaned in
virtio-net in commit b0c39dbdc204, are mitigated.

Specifically, it breaks the potential deadlock where sockets are
blocked waiting for completions (to free up budget in sndbuf, tsq, ..),
yet completion handling is blocked waiting for a new packet to
trigger free_old_xmit_skbs from start_xmit.

>> That is the only thing keeping us from removing the HoL blocking in vhost-net zerocopy.
>
> We don't enable network watchdog on virtio but we could and maybe
> should.

Can you elaborate?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-24  3:28               ` Willem de Bruijn
@ 2017-08-24  4:34                 ` Michael S. Tsirkin
  2017-08-24 13:50                 ` Michael S. Tsirkin
  1 sibling, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2017-08-24  4:34 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development

On Wed, Aug 23, 2017 at 11:28:24PM -0400, Willem de Bruijn wrote:
> >> > * as a generic solution, if we were to somehow overcome the safety issue, track
> >> > the delay and do copy if some threshold is reached could be an answer, but it's
> >> > hard for now.> * so things like the current vhost-net implementation of deciding whether or not
> >> > to do zerocopy beforehand referring the zerocopy tx error ratio is a point of
> >> > practical compromise.
> >>
> >> The fragility of this mechanism is another argument for switching to tx napi
> >> as default.
> >>
> >> Is there any more data about the windows guest issues when completions
> >> are not queued within a reasonable timeframe? What is this timescale and
> >> do we really need to work around this.
> >
> > I think it's pretty large, many milliseconds.
> >
> > But I wonder what do you mean by "work around". Using buffers within
> > limited time frame sounds like a reasonable requirement to me.
> 
> Vhost-net zerocopy delays completions until the skb is really
> sent. Traffic shaping can introduce msec timescale latencies.
> 
> The delay may actually be a useful signal. If the guest does not
> orphan skbs early, TSQ will throttle the socket causing host
> queue build up.
> 
> But, if completions are queued in-order, unrelated flows may be
> throttled as well. Allowing out of order completions would resolve
> this HoL blocking.

There's no issue with out of order. It does not break any guests AFAIK.

> > Neither
> > do I see why would using tx interrupts within guest be a work around -
> > AFAIK windows driver uses tx interrupts.
> 
> It does not address completion latency itself. What I meant was
> that in an interrupt-driver model, additional starvation issues,
> such as the potential deadlock raised at the start of this thread,
> or the timer delay observed before packets were orphaned in
> virtio-net in commit b0c39dbdc204, are mitigated.
> 
> Specifically, it breaks the potential deadlock where sockets are
> blocked waiting for completions (to free up budget in sndbuf, tsq, ..),
> yet completion handling is blocked waiting for a new packet to
> trigger free_old_xmit_skbs from start_xmit.
> 
> >> That is the only thing keeping us from removing the HoL blocking in vhost-net zerocopy.
> >
> > We don't enable network watchdog on virtio but we could and maybe
> > should.
> 
> Can you elaborate?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-24  3:28               ` Willem de Bruijn
  2017-08-24  4:34                 ` Michael S. Tsirkin
@ 2017-08-24 13:50                 ` Michael S. Tsirkin
  2017-08-24 20:20                   ` Willem de Bruijn
  1 sibling, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2017-08-24 13:50 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Network Development, Koichiro Den, virtualization

On Wed, Aug 23, 2017 at 11:28:24PM -0400, Willem de Bruijn wrote:
> >> > * as a generic solution, if we were to somehow overcome the safety issue, track
> >> > the delay and do copy if some threshold is reached could be an answer, but it's
> >> > hard for now.> * so things like the current vhost-net implementation of deciding whether or not
> >> > to do zerocopy beforehand referring the zerocopy tx error ratio is a point of
> >> > practical compromise.
> >>
> >> The fragility of this mechanism is another argument for switching to tx napi
> >> as default.
> >>
> >> Is there any more data about the windows guest issues when completions
> >> are not queued within a reasonable timeframe? What is this timescale and
> >> do we really need to work around this.
> >
> > I think it's pretty large, many milliseconds.
> >
> > But I wonder what do you mean by "work around". Using buffers within
> > limited time frame sounds like a reasonable requirement to me.
> 
> Vhost-net zerocopy delays completions until the skb is really
> sent.

This is fundamental in any solution. Guest/application can not
write over a memory buffer as long as hardware might be reading it.

> Traffic shaping can introduce msec timescale latencies.
> 
> The delay may actually be a useful signal. If the guest does not
> orphan skbs early, TSQ will throttle the socket causing host
> queue build up.
> 
> But, if completions are queued in-order, unrelated flows may be
> throttled as well. Allowing out of order completions would resolve
> this HoL blocking.

We can allow out of order, no guests that follow virtio spec
will break. But this won't help in all cases
- a single slow flow can occupy the whole ring, you will not
  be able to make any new buffers available for the fast flow
- what host considers a single flow can be multiple flows for guest

There are many other examples.

> > Neither
> > do I see why would using tx interrupts within guest be a work around -
> > AFAIK windows driver uses tx interrupts.
> 
> It does not address completion latency itself. What I meant was
> that in an interrupt-driver model, additional starvation issues,
> such as the potential deadlock raised at the start of this thread,
> or the timer delay observed before packets were orphaned in
> virtio-net in commit b0c39dbdc204, are mitigated.
> 
> Specifically, it breaks the potential deadlock where sockets are
> blocked waiting for completions (to free up budget in sndbuf, tsq, ..),
> yet completion handling is blocked waiting for a new packet to
> trigger free_old_xmit_skbs from start_xmit.

This talk of potential deadlock confuses me - I think you mean we would
deadlock if we did not orphan skbs in !use_napi - is that right?  If you
mean that you can drop skb orphan and this won't lead to a deadlock if
free skbs upon a tx interrupt, I agree, for sure.

> >> That is the only thing keeping us from removing the HoL blocking in vhost-net zerocopy.
> >
> > We don't enable network watchdog on virtio but we could and maybe
> > should.
> 
> Can you elaborate?

The issue is that holding onto buffers for very long times makes guests
think they are stuck. This is funamentally because from guest point of
view this is a NIC, so it is supposed to transmit things out in
a timely manner. If host backs the virtual NIC by something that is not
a NIC, with traffic shaping etc introducing unbounded latencies,
guest will be confused.

For example, we could set ndo_tx_timeout within guest. Then
if tx queue is stopped for too long, a watchdog would fire.

We worked around most of the issues by introducing guest/host
copy. This copy, done by vhost-net, allows us to pretend that
a not-nic backend (e.g. a qdisc) is a nic (virtio-net).
This way you can both do traffic shaping in host with
unbounded latencies and limit latency from guest point of view.

Cost is both data copies and loss of end to end credit accounting.

Changing Linux as a host to limit latencies while not doing copies will
not be an easy task but that's the only fix that comes to mind.

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-24 13:50                 ` Michael S. Tsirkin
@ 2017-08-24 20:20                   ` Willem de Bruijn
  2017-08-24 20:50                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-24 20:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development

>> Traffic shaping can introduce msec timescale latencies.
>>
>> The delay may actually be a useful signal. If the guest does not
>> orphan skbs early, TSQ will throttle the socket causing host
>> queue build up.
>>
>> But, if completions are queued in-order, unrelated flows may be
>> throttled as well. Allowing out of order completions would resolve
>> this HoL blocking.
>
> We can allow out of order, no guests that follow virtio spec
> will break. But this won't help in all cases
> - a single slow flow can occupy the whole ring, you will not
>   be able to make any new buffers available for the fast flow
> - what host considers a single flow can be multiple flows for guest
>
> There are many other examples.

These examples are due to exhaustion of the fixed ubuf_info pool,
right? We could use dynamic allocation or a resizable pool if these
issues are serious enough.

>> > Neither
>> > do I see why would using tx interrupts within guest be a work around -
>> > AFAIK windows driver uses tx interrupts.
>>
>> It does not address completion latency itself. What I meant was
>> that in an interrupt-driver model, additional starvation issues,
>> such as the potential deadlock raised at the start of this thread,
>> or the timer delay observed before packets were orphaned in
>> virtio-net in commit b0c39dbdc204, are mitigated.
>>
>> Specifically, it breaks the potential deadlock where sockets are
>> blocked waiting for completions (to free up budget in sndbuf, tsq, ..),
>> yet completion handling is blocked waiting for a new packet to
>> trigger free_old_xmit_skbs from start_xmit.
>
> This talk of potential deadlock confuses me - I think you mean we would
> deadlock if we did not orphan skbs in !use_napi - is that right?  If you
> mean that you can drop skb orphan and this won't lead to a deadlock if
> free skbs upon a tx interrupt, I agree, for sure.

Yes, that is what I meant.

>> >> That is the only thing keeping us from removing the HoL blocking in vhost-net zerocopy.
>> >
>> > We don't enable network watchdog on virtio but we could and maybe
>> > should.
>>
>> Can you elaborate?
>
> The issue is that holding onto buffers for very long times makes guests
> think they are stuck. This is funamentally because from guest point of
> view this is a NIC, so it is supposed to transmit things out in
> a timely manner. If host backs the virtual NIC by something that is not
> a NIC, with traffic shaping etc introducing unbounded latencies,
> guest will be confused.

That assumes that guests are fragile in this regard. A linux guest
does not make such assumptions. There are NICs with hardware
rate limiting, so I'm not sure how much of a leap host os rate
limiting is.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-24 20:20                   ` Willem de Bruijn
@ 2017-08-24 20:50                     ` Michael S. Tsirkin
  2017-08-25 22:44                       ` Willem de Bruijn
  0 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2017-08-24 20:50 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development

On Thu, Aug 24, 2017 at 04:20:39PM -0400, Willem de Bruijn wrote:
> >> Traffic shaping can introduce msec timescale latencies.
> >>
> >> The delay may actually be a useful signal. If the guest does not
> >> orphan skbs early, TSQ will throttle the socket causing host
> >> queue build up.
> >>
> >> But, if completions are queued in-order, unrelated flows may be
> >> throttled as well. Allowing out of order completions would resolve
> >> this HoL blocking.
> >
> > We can allow out of order, no guests that follow virtio spec
> > will break. But this won't help in all cases
> > - a single slow flow can occupy the whole ring, you will not
> >   be able to make any new buffers available for the fast flow
> > - what host considers a single flow can be multiple flows for guest
> >
> > There are many other examples.
> 
> These examples are due to exhaustion of the fixed ubuf_info pool,
> right?

No - the ring size itself.

> We could use dynamic allocation or a resizable pool if these
> issues are serious enough.

We need some kind of limit on how many requests a guest can queue in the
host, or it's an obvious DoS attack vector. We used the ring size for
that.

> >> > Neither
> >> > do I see why would using tx interrupts within guest be a work around -
> >> > AFAIK windows driver uses tx interrupts.
> >>
> >> It does not address completion latency itself. What I meant was
> >> that in an interrupt-driver model, additional starvation issues,
> >> such as the potential deadlock raised at the start of this thread,
> >> or the timer delay observed before packets were orphaned in
> >> virtio-net in commit b0c39dbdc204, are mitigated.
> >>
> >> Specifically, it breaks the potential deadlock where sockets are
> >> blocked waiting for completions (to free up budget in sndbuf, tsq, ..),
> >> yet completion handling is blocked waiting for a new packet to
> >> trigger free_old_xmit_skbs from start_xmit.
> >
> > This talk of potential deadlock confuses me - I think you mean we would
> > deadlock if we did not orphan skbs in !use_napi - is that right?  If you
> > mean that you can drop skb orphan and this won't lead to a deadlock if
> > free skbs upon a tx interrupt, I agree, for sure.
> 
> Yes, that is what I meant.
> 
> >> >> That is the only thing keeping us from removing the HoL blocking in vhost-net zerocopy.
> >> >
> >> > We don't enable network watchdog on virtio but we could and maybe
> >> > should.
> >>
> >> Can you elaborate?
> >
> > The issue is that holding onto buffers for very long times makes guests
> > think they are stuck. This is funamentally because from guest point of
> > view this is a NIC, so it is supposed to transmit things out in
> > a timely manner. If host backs the virtual NIC by something that is not
> > a NIC, with traffic shaping etc introducing unbounded latencies,
> > guest will be confused.
> 
> That assumes that guests are fragile in this regard. A linux guest
> does not make such assumptions.

Yes it does. Examples above:
	> > - a single slow flow can occupy the whole ring, you will not
	> >   be able to make any new buffers available for the fast flow
	> > - what host considers a single flow can be multiple flows for guest

it's easier to see if you enable the watchdog timer for virtio. Linux
supports that.

> There are NICs with hardware
> rate limiting, so I'm not sure how much of a leap host os rate
> limiting is.

I don't know what happens if these NICs hold onto packets for seconds.

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-24 20:50                     ` Michael S. Tsirkin
@ 2017-08-25 22:44                       ` Willem de Bruijn
  2017-08-25 23:32                         ` Michael S. Tsirkin
  0 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-25 22:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development

>> >> > We don't enable network watchdog on virtio but we could and maybe
>> >> > should.
>> >>
>> >> Can you elaborate?
>> >
>> > The issue is that holding onto buffers for very long times makes guests
>> > think they are stuck. This is funamentally because from guest point of
>> > view this is a NIC, so it is supposed to transmit things out in
>> > a timely manner. If host backs the virtual NIC by something that is not
>> > a NIC, with traffic shaping etc introducing unbounded latencies,
>> > guest will be confused.
>>
>> That assumes that guests are fragile in this regard. A linux guest
>> does not make such assumptions.
>
> Yes it does. Examples above:
>         > > - a single slow flow can occupy the whole ring, you will not
>         > >   be able to make any new buffers available for the fast flow

Oh, right. Though those are due to vring_desc pool exhaustion
rather than an upper bound on latency of any single packet.

Limiting the number of zerocopy packets in flight to some fraction
of the ring ensures that fast flows can always grab a slot. Running
out of ubuf_info slots reverts to copy, so indirectly does this. But
I read it correclty the zerocopy pool may be equal to or larger than
the descriptor pool. Should we refine the zcopy_used test

    (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx

to also return false if the number of outstanding ubuf_info is greater
than, say, vq->num >> 1?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-25 22:44                       ` Willem de Bruijn
@ 2017-08-25 23:32                         ` Michael S. Tsirkin
  2017-08-26  1:03                           ` Willem de Bruijn
  0 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2017-08-25 23:32 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development

On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:
> >> >> > We don't enable network watchdog on virtio but we could and maybe
> >> >> > should.
> >> >>
> >> >> Can you elaborate?
> >> >
> >> > The issue is that holding onto buffers for very long times makes guests
> >> > think they are stuck. This is funamentally because from guest point of
> >> > view this is a NIC, so it is supposed to transmit things out in
> >> > a timely manner. If host backs the virtual NIC by something that is not
> >> > a NIC, with traffic shaping etc introducing unbounded latencies,
> >> > guest will be confused.
> >>
> >> That assumes that guests are fragile in this regard. A linux guest
> >> does not make such assumptions.
> >
> > Yes it does. Examples above:
> >         > > - a single slow flow can occupy the whole ring, you will not
> >         > >   be able to make any new buffers available for the fast flow
> 
> Oh, right. Though those are due to vring_desc pool exhaustion
> rather than an upper bound on latency of any single packet.
> 
> Limiting the number of zerocopy packets in flight to some fraction
> of the ring ensures that fast flows can always grab a slot.
> Running
> out of ubuf_info slots reverts to copy, so indirectly does this. But
> I read it correclty the zerocopy pool may be equal to or larger than
> the descriptor pool. Should we refine the zcopy_used test
> 
>     (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
> 
> to also return false if the number of outstanding ubuf_info is greater
> than, say, vq->num >> 1?


We'll need to think about where to put the threshold, but I think it's
a good idea.

Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
resources.

In a sense it still means once you run out of slots zcopt gets disabled possibly permanently.

Need to experiment with some numbers.

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-25 23:32                         ` Michael S. Tsirkin
@ 2017-08-26  1:03                           ` Willem de Bruijn
  2017-08-29 19:35                             ` Willem de Bruijn
  0 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-26  1:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development

On Fri, Aug 25, 2017 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:
>> >> >> > We don't enable network watchdog on virtio but we could and maybe
>> >> >> > should.
>> >> >>
>> >> >> Can you elaborate?
>> >> >
>> >> > The issue is that holding onto buffers for very long times makes guests
>> >> > think they are stuck. This is funamentally because from guest point of
>> >> > view this is a NIC, so it is supposed to transmit things out in
>> >> > a timely manner. If host backs the virtual NIC by something that is not
>> >> > a NIC, with traffic shaping etc introducing unbounded latencies,
>> >> > guest will be confused.
>> >>
>> >> That assumes that guests are fragile in this regard. A linux guest
>> >> does not make such assumptions.
>> >
>> > Yes it does. Examples above:
>> >         > > - a single slow flow can occupy the whole ring, you will not
>> >         > >   be able to make any new buffers available for the fast flow
>>
>> Oh, right. Though those are due to vring_desc pool exhaustion
>> rather than an upper bound on latency of any single packet.
>>
>> Limiting the number of zerocopy packets in flight to some fraction
>> of the ring ensures that fast flows can always grab a slot.
>> Running
>> out of ubuf_info slots reverts to copy, so indirectly does this. But
>> I read it correclty the zerocopy pool may be equal to or larger than
>> the descriptor pool. Should we refine the zcopy_used test
>>
>>     (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
>>
>> to also return false if the number of outstanding ubuf_info is greater
>> than, say, vq->num >> 1?
>
>
> We'll need to think about where to put the threshold, but I think it's
> a good idea.
>
> Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
> resources.
>
> In a sense it still means once you run out of slots zcopt gets disabled possibly permanently.
>
> Need to experiment with some numbers.

I can take a stab with two flows, one delayed in a deep host qdisc
queue. See how this change affects the other flow and also how
sensitive that is to the chosen threshold value.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-26  1:03                           ` Willem de Bruijn
@ 2017-08-29 19:35                             ` Willem de Bruijn
  2017-08-29 19:42                               ` Michael S. Tsirkin
                                                 ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-29 19:35 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Network Development, Koichiro Den, virtualization

On Fri, Aug 25, 2017 at 9:03 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Fri, Aug 25, 2017 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:
>>> >> >> > We don't enable network watchdog on virtio but we could and maybe
>>> >> >> > should.
>>> >> >>
>>> >> >> Can you elaborate?
>>> >> >
>>> >> > The issue is that holding onto buffers for very long times makes guests
>>> >> > think they are stuck. This is funamentally because from guest point of
>>> >> > view this is a NIC, so it is supposed to transmit things out in
>>> >> > a timely manner. If host backs the virtual NIC by something that is not
>>> >> > a NIC, with traffic shaping etc introducing unbounded latencies,
>>> >> > guest will be confused.
>>> >>
>>> >> That assumes that guests are fragile in this regard. A linux guest
>>> >> does not make such assumptions.
>>> >
>>> > Yes it does. Examples above:
>>> >         > > - a single slow flow can occupy the whole ring, you will not
>>> >         > >   be able to make any new buffers available for the fast flow
>>>
>>> Oh, right. Though those are due to vring_desc pool exhaustion
>>> rather than an upper bound on latency of any single packet.
>>>
>>> Limiting the number of zerocopy packets in flight to some fraction
>>> of the ring ensures that fast flows can always grab a slot.
>>> Running
>>> out of ubuf_info slots reverts to copy, so indirectly does this. But
>>> I read it correclty the zerocopy pool may be equal to or larger than
>>> the descriptor pool. Should we refine the zcopy_used test
>>>
>>>     (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
>>>
>>> to also return false if the number of outstanding ubuf_info is greater
>>> than, say, vq->num >> 1?
>>
>>
>> We'll need to think about where to put the threshold, but I think it's
>> a good idea.
>>
>> Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
>> resources.
>>
>> In a sense it still means once you run out of slots zcopt gets disabled possibly permanently.
>>
>> Need to experiment with some numbers.
>
> I can take a stab with two flows, one delayed in a deep host qdisc
> queue. See how this change affects the other flow and also how
> sensitive that is to the chosen threshold value.

Incomplete results at this stage, but I do see this correlation between
flows. It occurs even while not running out of zerocopy descriptors,
which I cannot yet explain.

Running two threads in a guest, each with a udp socket, each
sending up to 100 datagrams, or until EAGAIN, every msec.

Sender A sends 1B datagrams.
Sender B sends VHOST_GOODCOPY_LEN, which is enough
to trigger zcopy_used in vhost net.

A local receive process on the host receives both flows. To avoid
a deep copy when looping the packet onto the receive path,
changed skb_orphan_frags_rx to always return false (gross hack).

The flow with the larger packets is redirected through netem on ifb0:

  modprobe ifb
  ip link set dev ifb0 up
  tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit

  tc qdisc add dev tap0 ingress
  tc filter add dev tap0 parent ffff: protocol ip \
      u32 match ip dport 8000 0xffff \
      action mirred egress redirect dev ifb0

For 10 second run, packet count with various ifb0 queue lengths $LIMIT:

no filter
  rx.A: ~840,000
  rx.B: ~840,000

limit 1
  rx.A: ~500,000
  rx.B: ~3100
  ifb0: 3273 sent, 371141 dropped

limit 100
  rx.A: ~9000
  rx.B: ~4200
  ifb0: 4630 sent, 1491 dropped

limit 1000
  rx.A: ~6800
  rx.B: ~4200
  ifb0: 4651 sent, 0 dropped

Sender B is always correctly rate limited to 1 MBps or less. With a
short queue, it ends up dropping a lot and sending even less.

When a queue builds up for sender B, sender A throughput is strongly
correlated with queue length. With queue length 1, it can send almost
at unthrottled speed. But even at limit 100 its throughput is on the
same order as sender B.

What is surprising to me is that this happens even though the number
of ubuf_info in use at limit 100 is around 100 at all times. In other words,
it does not exhaust the pool.

When forcing zcopy_used to be false for all packets, this effect of
sender A throughput being correlated with sender B does not happen.

no filter
  rx.A: ~850,000
  rx.B: ~850,000

limit 100
  rx.A: ~850,000
  rx.B: ~4200
  ifb0: 4518 sent, 876182 dropped

Also relevant is that with zerocopy, the sender processes back off
and report the same count as the receiver. Without zerocopy,
both senders send at full speed, even if only 4200 packets from flow
B arrive at the receiver.

This is with the default virtio_net driver, so without napi-tx.

It appears that the zerocopy notifications are pausing the guest.
Will look at that now.

By the way, I have had an unrelated patch outstanding for a while
to have virtio-net support the VIRTIO_CONFIG_S_NEEDS_RESET
command. Will send that as RFC.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-29 19:35                             ` Willem de Bruijn
@ 2017-08-29 19:42                               ` Michael S. Tsirkin
  2017-08-29 19:53                                 ` Willem de Bruijn
  2017-08-30  1:45                               ` Jason Wang
  2017-08-31 14:30                               ` Willem de Bruijn
  2 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2017-08-29 19:42 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development

On Tue, Aug 29, 2017 at 03:35:38PM -0400, Willem de Bruijn wrote:
> By the way, I have had an unrelated patch outstanding for a while
> to have virtio-net support the VIRTIO_CONFIG_S_NEEDS_RESET
> command. Will send that as RFC.

Oh nice. One needs to be careful about locking there which is why
no devices support that yet.

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-29 19:42                               ` Michael S. Tsirkin
@ 2017-08-29 19:53                                 ` Willem de Bruijn
  2017-08-29 20:40                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-29 19:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development

On Tue, Aug 29, 2017 at 3:42 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Aug 29, 2017 at 03:35:38PM -0400, Willem de Bruijn wrote:
>> By the way, I have had an unrelated patch outstanding for a while
>> to have virtio-net support the VIRTIO_CONFIG_S_NEEDS_RESET
>> command. Will send that as RFC.
>
> Oh nice.

Great :)

> One needs to be careful about locking there which is why
> no devices support that yet.

I originally wrote it based on the virtnet_reset function introduced
for xdp. Calling this from virtnet_config_changed_work is non trivial,
as virtnet_freeze_down waits until no config worker is running.

Otherwise, I could not find any constraints on when freeze may be
called, and it largely follows the same path. I hope I didn't miss anything.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-29 19:53                                 ` Willem de Bruijn
@ 2017-08-29 20:40                                   ` Michael S. Tsirkin
  2017-08-29 22:55                                     ` Willem de Bruijn
  0 siblings, 1 reply; 47+ messages in thread
From: Michael S. Tsirkin @ 2017-08-29 20:40 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development

On Tue, Aug 29, 2017 at 03:53:08PM -0400, Willem de Bruijn wrote:
> On Tue, Aug 29, 2017 at 3:42 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Aug 29, 2017 at 03:35:38PM -0400, Willem de Bruijn wrote:
> >> By the way, I have had an unrelated patch outstanding for a while
> >> to have virtio-net support the VIRTIO_CONFIG_S_NEEDS_RESET
> >> command. Will send that as RFC.
> >
> > Oh nice.
> 
> Great :)
> 
> > One needs to be careful about locking there which is why
> > no devices support that yet.
> 
> I originally wrote it based on the virtnet_reset function introduced
> for xdp. Calling this from virtnet_config_changed_work is non trivial,
> as virtnet_freeze_down waits until no config worker is running.
> 
> Otherwise, I could not find any constraints on when freeze may be
> called, and it largely follows the same path. I hope I didn't miss anything.

The issue is that on freeze processes are not running so we
generally know no new packets will arrive (might be wrong
for bridging, then it's a bug). On device error you must
prevent new skbs from coming in, etc.

-- 
MST

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-29 20:40                                   ` Michael S. Tsirkin
@ 2017-08-29 22:55                                     ` Willem de Bruijn
  0 siblings, 0 replies; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-29 22:55 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Network Development, Koichiro Den, virtualization

On Tue, Aug 29, 2017 at 4:40 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Aug 29, 2017 at 03:53:08PM -0400, Willem de Bruijn wrote:
>> On Tue, Aug 29, 2017 at 3:42 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > On Tue, Aug 29, 2017 at 03:35:38PM -0400, Willem de Bruijn wrote:
>> >> By the way, I have had an unrelated patch outstanding for a while
>> >> to have virtio-net support the VIRTIO_CONFIG_S_NEEDS_RESET
>> >> command. Will send that as RFC.
>> >
>> > Oh nice.
>>
>> Great :)
>>
>> > One needs to be careful about locking there which is why
>> > no devices support that yet.
>>
>> I originally wrote it based on the virtnet_reset function introduced
>> for xdp. Calling this from virtnet_config_changed_work is non trivial,
>> as virtnet_freeze_down waits until no config worker is running.
>>
>> Otherwise, I could not find any constraints on when freeze may be
>> called, and it largely follows the same path. I hope I didn't miss anything.
>
> The issue is that on freeze processes are not running so we
> generally know no new packets will arrive (might be wrong
> for bridging, then it's a bug). On device error you must
> prevent new skbs from coming in, etc.

Thanks a lot for the quick review. I had indeed not yet figured out
which invariants freeze can depend on that are not universal. Same
for the virtnet_reset call from the .ndo_xdp. Will need to take a much
better look at that.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-29 19:35                             ` Willem de Bruijn
  2017-08-29 19:42                               ` Michael S. Tsirkin
@ 2017-08-30  1:45                               ` Jason Wang
  2017-08-30  3:11                                 ` Willem de Bruijn
  2017-08-31 14:30                               ` Willem de Bruijn
  2 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2017-08-30  1:45 UTC (permalink / raw)
  To: Willem de Bruijn, Michael S. Tsirkin
  Cc: Koichiro Den, virtualization, Network Development



On 2017年08月30日 03:35, Willem de Bruijn wrote:
> On Fri, Aug 25, 2017 at 9:03 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> On Fri, Aug 25, 2017 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:
>>>>>>>>> We don't enable network watchdog on virtio but we could and maybe
>>>>>>>>> should.
>>>>>>>> Can you elaborate?
>>>>>>> The issue is that holding onto buffers for very long times makes guests
>>>>>>> think they are stuck. This is funamentally because from guest point of
>>>>>>> view this is a NIC, so it is supposed to transmit things out in
>>>>>>> a timely manner. If host backs the virtual NIC by something that is not
>>>>>>> a NIC, with traffic shaping etc introducing unbounded latencies,
>>>>>>> guest will be confused.
>>>>>> That assumes that guests are fragile in this regard. A linux guest
>>>>>> does not make such assumptions.
>>>>> Yes it does. Examples above:
>>>>>          > > - a single slow flow can occupy the whole ring, you will not
>>>>>          > >   be able to make any new buffers available for the fast flow
>>>> Oh, right. Though those are due to vring_desc pool exhaustion
>>>> rather than an upper bound on latency of any single packet.
>>>>
>>>> Limiting the number of zerocopy packets in flight to some fraction
>>>> of the ring ensures that fast flows can always grab a slot.
>>>> Running
>>>> out of ubuf_info slots reverts to copy, so indirectly does this. But
>>>> I read it correclty the zerocopy pool may be equal to or larger than
>>>> the descriptor pool. Should we refine the zcopy_used test
>>>>
>>>>      (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
>>>>
>>>> to also return false if the number of outstanding ubuf_info is greater
>>>> than, say, vq->num >> 1?
>>>
>>> We'll need to think about where to put the threshold, but I think it's
>>> a good idea.
>>>
>>> Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
>>> resources.
>>>
>>> In a sense it still means once you run out of slots zcopt gets disabled possibly permanently.
>>>
>>> Need to experiment with some numbers.
>> I can take a stab with two flows, one delayed in a deep host qdisc
>> queue. See how this change affects the other flow and also how
>> sensitive that is to the chosen threshold value.
> Incomplete results at this stage, but I do see this correlation between
> flows. It occurs even while not running out of zerocopy descriptors,
> which I cannot yet explain.
>
> Running two threads in a guest, each with a udp socket, each
> sending up to 100 datagrams, or until EAGAIN, every msec.
>
> Sender A sends 1B datagrams.
> Sender B sends VHOST_GOODCOPY_LEN, which is enough
> to trigger zcopy_used in vhost net.
>
> A local receive process on the host receives both flows. To avoid
> a deep copy when looping the packet onto the receive path,
> changed skb_orphan_frags_rx to always return false (gross hack).
>
> The flow with the larger packets is redirected through netem on ifb0:
>
>    modprobe ifb
>    ip link set dev ifb0 up
>    tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit
>
>    tc qdisc add dev tap0 ingress
>    tc filter add dev tap0 parent ffff: protocol ip \
>        u32 match ip dport 8000 0xffff \
>        action mirred egress redirect dev ifb0
>
> For 10 second run, packet count with various ifb0 queue lengths $LIMIT:
>
> no filter
>    rx.A: ~840,000
>    rx.B: ~840,000

Just to make sure I understand the case here. What did rx.B mean here? I 
thought all traffic sent by Sender B has been redirected to ifb0?

>
> limit 1
>    rx.A: ~500,000
>    rx.B: ~3100
>    ifb0: 3273 sent, 371141 dropped
>
> limit 100
>    rx.A: ~9000
>    rx.B: ~4200
>    ifb0: 4630 sent, 1491 dropped
>
> limit 1000
>    rx.A: ~6800
>    rx.B: ~4200
>    ifb0: 4651 sent, 0 dropped
>
> Sender B is always correctly rate limited to 1 MBps or less. With a
> short queue, it ends up dropping a lot and sending even less.
>
> When a queue builds up for sender B, sender A throughput is strongly
> correlated with queue length. With queue length 1, it can send almost
> at unthrottled speed. But even at limit 100 its throughput is on the
> same order as sender B.
>
> What is surprising to me is that this happens even though the number
> of ubuf_info in use at limit 100 is around 100 at all times. In other words,
> it does not exhaust the pool.
>
> When forcing zcopy_used to be false for all packets, this effect of
> sender A throughput being correlated with sender B does not happen.
>
> no filter
>    rx.A: ~850,000
>    rx.B: ~850,000
>
> limit 100
>    rx.A: ~850,000
>    rx.B: ~4200
>    ifb0: 4518 sent, 876182 dropped
>
> Also relevant is that with zerocopy, the sender processes back off
> and report the same count as the receiver. Without zerocopy,
> both senders send at full speed, even if only 4200 packets from flow
> B arrive at the receiver.
>
> This is with the default virtio_net driver, so without napi-tx.

What kind of qdisc do you use in guest? I suspect we should use 
something which could do fair queueing (e.g sfq).

>
> It appears that the zerocopy notifications are pausing the guest.
> Will look at that now.

Another factor is the tx interrupt coalescing parameters of ifb0, maybe 
we should disable it during the test.

Thanks

>
> By the way, I have had an unrelated patch outstanding for a while
> to have virtio-net support the VIRTIO_CONFIG_S_NEEDS_RESET
> command. Will send that as RFC.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-30  1:45                               ` Jason Wang
@ 2017-08-30  3:11                                 ` Willem de Bruijn
  2017-09-01  3:08                                   ` Jason Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-30  3:11 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Koichiro Den, virtualization, Network Development

On Tue, Aug 29, 2017 at 9:45 PM, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2017年08月30日 03:35, Willem de Bruijn wrote:
>>
>> On Fri, Aug 25, 2017 at 9:03 PM, Willem de Bruijn
>> <willemdebruijn.kernel@gmail.com> wrote:
>>>
>>> On Fri, Aug 25, 2017 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com>
>>> wrote:
>>>>
>>>> On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:
>>>>>>>>>>
>>>>>>>>>> We don't enable network watchdog on virtio but we could and maybe
>>>>>>>>>> should.
>>>>>>>>>
>>>>>>>>> Can you elaborate?
>>>>>>>>
>>>>>>>> The issue is that holding onto buffers for very long times makes
>>>>>>>> guests
>>>>>>>> think they are stuck. This is funamentally because from guest point
>>>>>>>> of
>>>>>>>> view this is a NIC, so it is supposed to transmit things out in
>>>>>>>> a timely manner. If host backs the virtual NIC by something that is
>>>>>>>> not
>>>>>>>> a NIC, with traffic shaping etc introducing unbounded latencies,
>>>>>>>> guest will be confused.
>>>>>>>
>>>>>>> That assumes that guests are fragile in this regard. A linux guest
>>>>>>> does not make such assumptions.
>>>>>>
>>>>>> Yes it does. Examples above:
>>>>>>          > > - a single slow flow can occupy the whole ring, you will
>>>>>> not
>>>>>>          > >   be able to make any new buffers available for the fast
>>>>>> flow
>>>>>
>>>>> Oh, right. Though those are due to vring_desc pool exhaustion
>>>>> rather than an upper bound on latency of any single packet.
>>>>>
>>>>> Limiting the number of zerocopy packets in flight to some fraction
>>>>> of the ring ensures that fast flows can always grab a slot.
>>>>> Running
>>>>> out of ubuf_info slots reverts to copy, so indirectly does this. But
>>>>> I read it correclty the zerocopy pool may be equal to or larger than
>>>>> the descriptor pool. Should we refine the zcopy_used test
>>>>>
>>>>>      (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
>>>>>
>>>>> to also return false if the number of outstanding ubuf_info is greater
>>>>> than, say, vq->num >> 1?
>>>>
>>>>
>>>> We'll need to think about where to put the threshold, but I think it's
>>>> a good idea.
>>>>
>>>> Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
>>>> resources.
>>>>
>>>> In a sense it still means once you run out of slots zcopt gets disabled
>>>> possibly permanently.
>>>>
>>>> Need to experiment with some numbers.
>>>
>>> I can take a stab with two flows, one delayed in a deep host qdisc
>>> queue. See how this change affects the other flow and also how
>>> sensitive that is to the chosen threshold value.
>>
>> Incomplete results at this stage, but I do see this correlation between
>> flows. It occurs even while not running out of zerocopy descriptors,
>> which I cannot yet explain.
>>
>> Running two threads in a guest, each with a udp socket, each
>> sending up to 100 datagrams, or until EAGAIN, every msec.
>>
>> Sender A sends 1B datagrams.
>> Sender B sends VHOST_GOODCOPY_LEN, which is enough
>> to trigger zcopy_used in vhost net.
>>
>> A local receive process on the host receives both flows. To avoid
>> a deep copy when looping the packet onto the receive path,
>> changed skb_orphan_frags_rx to always return false (gross hack).
>>
>> The flow with the larger packets is redirected through netem on ifb0:
>>
>>    modprobe ifb
>>    ip link set dev ifb0 up
>>    tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit
>>
>>    tc qdisc add dev tap0 ingress
>>    tc filter add dev tap0 parent ffff: protocol ip \
>>        u32 match ip dport 8000 0xffff \
>>        action mirred egress redirect dev ifb0
>>
>> For 10 second run, packet count with various ifb0 queue lengths $LIMIT:
>>
>> no filter
>>    rx.A: ~840,000
>>    rx.B: ~840,000
>
>
> Just to make sure I understand the case here. What did rx.B mean here? I
> thought all traffic sent by Sender B has been redirected to ifb0?

It has been, but the packet still arrives at the destination socket.
IFB is a special virtual device that applies traffic shaping and
then reinjects it back at the point it was intercept by mirred.

rx.B is indeed arrival rate at the receiver, similar to rx.A.

>>
>> limit 1
>>    rx.A: ~500,000
>>    rx.B: ~3100
>>    ifb0: 3273 sent, 371141 dropped
>>
>> limit 100
>>    rx.A: ~9000
>>    rx.B: ~4200
>>    ifb0: 4630 sent, 1491 dropped
>>
>> limit 1000
>>    rx.A: ~6800
>>    rx.B: ~4200
>>    ifb0: 4651 sent, 0 dropped
>>
>> Sender B is always correctly rate limited to 1 MBps or less. With a
>> short queue, it ends up dropping a lot and sending even less.
>>
>> When a queue builds up for sender B, sender A throughput is strongly
>> correlated with queue length. With queue length 1, it can send almost
>> at unthrottled speed. But even at limit 100 its throughput is on the
>> same order as sender B.
>>
>> What is surprising to me is that this happens even though the number
>> of ubuf_info in use at limit 100 is around 100 at all times. In other
>> words,
>> it does not exhaust the pool.
>>
>> When forcing zcopy_used to be false for all packets, this effect of
>> sender A throughput being correlated with sender B does not happen.
>>
>> no filter
>>    rx.A: ~850,000
>>    rx.B: ~850,000
>>
>> limit 100
>>    rx.A: ~850,000
>>    rx.B: ~4200
>>    ifb0: 4518 sent, 876182 dropped
>>
>> Also relevant is that with zerocopy, the sender processes back off
>> and report the same count as the receiver. Without zerocopy,
>> both senders send at full speed, even if only 4200 packets from flow
>> B arrive at the receiver.
>>
>> This is with the default virtio_net driver, so without napi-tx.
>
>
> What kind of qdisc do you use in guest? I suspect we should use something
> which could do fair queueing (e.g sfq).

Or fq. The test was using the default, pfifo_fast.

This particular two flow test probably would not be affected,
as something else is delaying both flows equally once some
completions are delayed.

One obvious candidate would be hitting VHOST_MAX_PEND.
But I instrumented that and handle_tx is never throttled by
vhost_exceeds_maxpend.

Btw, vhost_exceeds_maxpend implements almost what I
suggested earlier and was planning to test here: ensure that
only part of the descriptor pool is filled with zerocopy requests.

Only, it currently breaks out of the loop when the max is
reached. I think that we should move it into the main
zcopy_used calculation, so that hitting this threshold reverts to
copy-based transmission, instead of delaying all tx until
zerocopy budget opens up.


>>
>> It appears that the zerocopy notifications are pausing the guest.
>> Will look at that now.
>
>
> Another factor is the tx interrupt coalescing parameters of ifb0, maybe we
> should disable it during the test.
>
> Thanks
>
>
>>
>> By the way, I have had an unrelated patch outstanding for a while
>> to have virtio-net support the VIRTIO_CONFIG_S_NEEDS_RESET
>> command. Will send that as RFC.
>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-29 19:35                             ` Willem de Bruijn
  2017-08-29 19:42                               ` Michael S. Tsirkin
  2017-08-30  1:45                               ` Jason Wang
@ 2017-08-31 14:30                               ` Willem de Bruijn
  2017-09-01  3:25                                 ` Jason Wang
  2 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-08-31 14:30 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development

On Tue, Aug 29, 2017 at 3:35 PM, Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
> On Fri, Aug 25, 2017 at 9:03 PM, Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
>> On Fri, Aug 25, 2017 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:
>>>> >> >> > We don't enable network watchdog on virtio but we could and maybe
>>>> >> >> > should.
>>>> >> >>
>>>> >> >> Can you elaborate?
>>>> >> >
>>>> >> > The issue is that holding onto buffers for very long times makes guests
>>>> >> > think they are stuck. This is funamentally because from guest point of
>>>> >> > view this is a NIC, so it is supposed to transmit things out in
>>>> >> > a timely manner. If host backs the virtual NIC by something that is not
>>>> >> > a NIC, with traffic shaping etc introducing unbounded latencies,
>>>> >> > guest will be confused.
>>>> >>
>>>> >> That assumes that guests are fragile in this regard. A linux guest
>>>> >> does not make such assumptions.
>>>> >
>>>> > Yes it does. Examples above:
>>>> >         > > - a single slow flow can occupy the whole ring, you will not
>>>> >         > >   be able to make any new buffers available for the fast flow
>>>>
>>>> Oh, right. Though those are due to vring_desc pool exhaustion
>>>> rather than an upper bound on latency of any single packet.
>>>>
>>>> Limiting the number of zerocopy packets in flight to some fraction
>>>> of the ring ensures that fast flows can always grab a slot.
>>>> Running
>>>> out of ubuf_info slots reverts to copy, so indirectly does this. But
>>>> I read it correclty the zerocopy pool may be equal to or larger than
>>>> the descriptor pool. Should we refine the zcopy_used test
>>>>
>>>>     (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
>>>>
>>>> to also return false if the number of outstanding ubuf_info is greater
>>>> than, say, vq->num >> 1?
>>>
>>>
>>> We'll need to think about where to put the threshold, but I think it's
>>> a good idea.
>>>
>>> Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
>>> resources.
>>>
>>> In a sense it still means once you run out of slots zcopt gets disabled possibly permanently.
>>>
>>> Need to experiment with some numbers.
>>
>> I can take a stab with two flows, one delayed in a deep host qdisc
>> queue. See how this change affects the other flow and also how
>> sensitive that is to the chosen threshold value.
>
> Incomplete results at this stage, but I do see this correlation between
> flows. It occurs even while not running out of zerocopy descriptors,
> which I cannot yet explain.
>
> Running two threads in a guest, each with a udp socket, each
> sending up to 100 datagrams, or until EAGAIN, every msec.
>
> Sender A sends 1B datagrams.
> Sender B sends VHOST_GOODCOPY_LEN, which is enough
> to trigger zcopy_used in vhost net.
>
> A local receive process on the host receives both flows. To avoid
> a deep copy when looping the packet onto the receive path,
> changed skb_orphan_frags_rx to always return false (gross hack).
>
> The flow with the larger packets is redirected through netem on ifb0:
>
>   modprobe ifb
>   ip link set dev ifb0 up
>   tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit
>
>   tc qdisc add dev tap0 ingress
>   tc filter add dev tap0 parent ffff: protocol ip \
>       u32 match ip dport 8000 0xffff \
>       action mirred egress redirect dev ifb0
>
> For 10 second run, packet count with various ifb0 queue lengths $LIMIT:
>
> no filter
>   rx.A: ~840,000
>   rx.B: ~840,000
>
> limit 1
>   rx.A: ~500,000
>   rx.B: ~3100
>   ifb0: 3273 sent, 371141 dropped
>
> limit 100
>   rx.A: ~9000
>   rx.B: ~4200
>   ifb0: 4630 sent, 1491 dropped
>
> limit 1000
>   rx.A: ~6800
>   rx.B: ~4200
>   ifb0: 4651 sent, 0 dropped
>
> Sender B is always correctly rate limited to 1 MBps or less. With a
> short queue, it ends up dropping a lot and sending even less.
>
> When a queue builds up for sender B, sender A throughput is strongly
> correlated with queue length. With queue length 1, it can send almost
> at unthrottled speed. But even at limit 100 its throughput is on the
> same order as sender B.
>
> What is surprising to me is that this happens even though the number
> of ubuf_info in use at limit 100 is around 100 at all times. In other words,
> it does not exhaust the pool.
>
> When forcing zcopy_used to be false for all packets, this effect of
> sender A throughput being correlated with sender B does not happen.
>
> no filter
>   rx.A: ~850,000
>   rx.B: ~850,000
>
> limit 100
>   rx.A: ~850,000
>   rx.B: ~4200
>   ifb0: 4518 sent, 876182 dropped
>
> Also relevant is that with zerocopy, the sender processes back off
> and report the same count as the receiver. Without zerocopy,
> both senders send at full speed, even if only 4200 packets from flow
> B arrive at the receiver.
>
> This is with the default virtio_net driver, so without napi-tx.
>
> It appears that the zerocopy notifications are pausing the guest.
> Will look at that now.

It was indeed as simple as that. With 256 descriptors, queuing even
a hundred or so packets causes the guest to stall the device as soon
as the qdisc is installed.

Adding this check

+                       in_use = nvq->upend_idx - nvq->done_idx;
+                       if (nvq->upend_idx < nvq->done_idx)
+                               in_use += UIO_MAXIOV;
+
+                       if (in_use > (vq->num >> 2))
+                               zcopy_used = false;

Has the desired behavior of reverting zerocopy requests to copying.

Without this change, the result is, as previously reported, throughput
dropping to hundreds of packets per second on both flows.

With the change, pps as observed for a few seconds at handle_tx is

zerocopy=165 copy=168435
zerocopy=0 copy=168500
zerocopy=65 copy=168535

Both flows continue to send at more or less normal rate, with only
sender B observing massive drops at the netem.

With the queue removed the rate reverts to

zerocopy=58878 copy=110239
zerocopy=58833 copy=110207

This is not a 50/50 split, which implies that some packets from the large
packet flow are still converted to copying. Without the change the rate
without queue was 80k zerocopy vs 80k copy, so this choice of
(vq->num >> 2) appears too conservative.

However, testing with (vq->num >> 1) was not as effective at mitigating
stalls. I did not save that data, unfortunately. Can run more tests on fine
tuning this variable, if the idea sounds good.

I also compiled qemu with 1024 descriptors in the tx ring instead of 256.
In that case the test

                if (unlikely(vhost_exceeds_maxpend(net)))
                        break;

which is

        return (nvq->upend_idx + vq->num - VHOST_MAX_PEND) % UIO_MAXIOV
                == nvq->done_idx;

Is hit before my proposed change, again stalling the device instead of
reverting to copying.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-30  3:11                                 ` Willem de Bruijn
@ 2017-09-01  3:08                                   ` Jason Wang
  0 siblings, 0 replies; 47+ messages in thread
From: Jason Wang @ 2017-09-01  3:08 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Michael S. Tsirkin, Koichiro Den, virtualization, Network Development



On 2017年08月30日 11:11, Willem de Bruijn wrote:
> On Tue, Aug 29, 2017 at 9:45 PM, Jason Wang <jasowang@redhat.com> wrote:
>>
>> On 2017年08月30日 03:35, Willem de Bruijn wrote:
>>> On Fri, Aug 25, 2017 at 9:03 PM, Willem de Bruijn
>>> <willemdebruijn.kernel@gmail.com> wrote:
>>>> On Fri, Aug 25, 2017 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com>
>>>> wrote:
>>>>> On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:

[...]

>>> Incomplete results at this stage, but I do see this correlation between
>>> flows. It occurs even while not running out of zerocopy descriptors,
>>> which I cannot yet explain.
>>>
>>> Running two threads in a guest, each with a udp socket, each
>>> sending up to 100 datagrams, or until EAGAIN, every msec.
>>>
>>> Sender A sends 1B datagrams.
>>> Sender B sends VHOST_GOODCOPY_LEN, which is enough
>>> to trigger zcopy_used in vhost net.
>>>
>>> A local receive process on the host receives both flows. To avoid
>>> a deep copy when looping the packet onto the receive path,
>>> changed skb_orphan_frags_rx to always return false (gross hack).
>>>
>>> The flow with the larger packets is redirected through netem on ifb0:
>>>
>>>     modprobe ifb
>>>     ip link set dev ifb0 up
>>>     tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit
>>>
>>>     tc qdisc add dev tap0 ingress
>>>     tc filter add dev tap0 parent ffff: protocol ip \
>>>         u32 match ip dport 8000 0xffff \
>>>         action mirred egress redirect dev ifb0
>>>
>>> For 10 second run, packet count with various ifb0 queue lengths $LIMIT:
>>>
>>> no filter
>>>     rx.A: ~840,000
>>>     rx.B: ~840,000
>>
>> Just to make sure I understand the case here. What did rx.B mean here? I
>> thought all traffic sent by Sender B has been redirected to ifb0?
> It has been, but the packet still arrives at the destination socket.
> IFB is a special virtual device that applies traffic shaping and
> then reinjects it back at the point it was intercept by mirred.
>
> rx.B is indeed arrival rate at the receiver, similar to rx.A.
>

I see, then ifb looks pretty fit to the test.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-08-31 14:30                               ` Willem de Bruijn
@ 2017-09-01  3:25                                 ` Jason Wang
  2017-09-01 16:15                                   ` Willem de Bruijn
  0 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2017-09-01  3:25 UTC (permalink / raw)
  To: Willem de Bruijn, Michael S. Tsirkin
  Cc: Koichiro Den, virtualization, Network Development



On 2017年08月31日 22:30, Willem de Bruijn wrote:
>> Incomplete results at this stage, but I do see this correlation between
>> flows. It occurs even while not running out of zerocopy descriptors,
>> which I cannot yet explain.
>>
>> Running two threads in a guest, each with a udp socket, each
>> sending up to 100 datagrams, or until EAGAIN, every msec.
>>
>> Sender A sends 1B datagrams.
>> Sender B sends VHOST_GOODCOPY_LEN, which is enough
>> to trigger zcopy_used in vhost net.
>>
>> A local receive process on the host receives both flows. To avoid
>> a deep copy when looping the packet onto the receive path,
>> changed skb_orphan_frags_rx to always return false (gross hack).
>>
>> The flow with the larger packets is redirected through netem on ifb0:
>>
>>    modprobe ifb
>>    ip link set dev ifb0 up
>>    tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit
>>
>>    tc qdisc add dev tap0 ingress
>>    tc filter add dev tap0 parent ffff: protocol ip \
>>        u32 match ip dport 8000 0xffff \
>>        action mirred egress redirect dev ifb0
>>
>> For 10 second run, packet count with various ifb0 queue lengths $LIMIT:
>>
>> no filter
>>    rx.A: ~840,000
>>    rx.B: ~840,000
>>
>> limit 1
>>    rx.A: ~500,000
>>    rx.B: ~3100
>>    ifb0: 3273 sent, 371141 dropped
>>
>> limit 100
>>    rx.A: ~9000
>>    rx.B: ~4200
>>    ifb0: 4630 sent, 1491 dropped
>>
>> limit 1000
>>    rx.A: ~6800
>>    rx.B: ~4200
>>    ifb0: 4651 sent, 0 dropped
>>
>> Sender B is always correctly rate limited to 1 MBps or less. With a
>> short queue, it ends up dropping a lot and sending even less.
>>
>> When a queue builds up for sender B, sender A throughput is strongly
>> correlated with queue length. With queue length 1, it can send almost
>> at unthrottled speed. But even at limit 100 its throughput is on the
>> same order as sender B.
>>
>> What is surprising to me is that this happens even though the number
>> of ubuf_info in use at limit 100 is around 100 at all times. In other words,
>> it does not exhaust the pool.
>>
>> When forcing zcopy_used to be false for all packets, this effect of
>> sender A throughput being correlated with sender B does not happen.
>>
>> no filter
>>    rx.A: ~850,000
>>    rx.B: ~850,000
>>
>> limit 100
>>    rx.A: ~850,000
>>    rx.B: ~4200
>>    ifb0: 4518 sent, 876182 dropped
>>
>> Also relevant is that with zerocopy, the sender processes back off
>> and report the same count as the receiver. Without zerocopy,
>> both senders send at full speed, even if only 4200 packets from flow
>> B arrive at the receiver.
>>
>> This is with the default virtio_net driver, so without napi-tx.
>>
>> It appears that the zerocopy notifications are pausing the guest.
>> Will look at that now.
> It was indeed as simple as that. With 256 descriptors, queuing even
> a hundred or so packets causes the guest to stall the device as soon
> as the qdisc is installed.
>
> Adding this check
>
> +                       in_use = nvq->upend_idx - nvq->done_idx;
> +                       if (nvq->upend_idx < nvq->done_idx)
> +                               in_use += UIO_MAXIOV;
> +
> +                       if (in_use > (vq->num >> 2))
> +                               zcopy_used = false;
>
> Has the desired behavior of reverting zerocopy requests to copying.
>
> Without this change, the result is, as previously reported, throughput
> dropping to hundreds of packets per second on both flows.
>
> With the change, pps as observed for a few seconds at handle_tx is
>
> zerocopy=165 copy=168435
> zerocopy=0 copy=168500
> zerocopy=65 copy=168535
>
> Both flows continue to send at more or less normal rate, with only
> sender B observing massive drops at the netem.
>
> With the queue removed the rate reverts to
>
> zerocopy=58878 copy=110239
> zerocopy=58833 copy=110207
>
> This is not a 50/50 split, which impliesTw that some packets from the large
> packet flow are still converted to copying. Without the change the rate
> without queue was 80k zerocopy vs 80k copy, so this choice of
> (vq->num >> 2) appears too conservative.
>
> However, testing with (vq->num >> 1) was not as effective at mitigating
> stalls. I did not save that data, unfortunately. Can run more tests on fine
> tuning this variable, if the idea sounds good.

Looks like there're still two cases were left:

1) sndbuf is not INT_MAX
2) tx napi is used for virtio-net

1) could be a corner case, and for 2) what your suggest here may not 
solve the issue since it still do in order completion.

Thanks

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-09-01  3:25                                 ` Jason Wang
@ 2017-09-01 16:15                                   ` Willem de Bruijn
  2017-09-01 16:17                                     ` Willem de Bruijn
  0 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-09-01 16:15 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Koichiro Den, virtualization, Network Development

On Thu, Aug 31, 2017 at 11:25 PM, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2017年08月31日 22:30, Willem de Bruijn wrote:
>>>
>>> Incomplete results at this stage, but I do see this correlation between
>>> flows. It occurs even while not running out of zerocopy descriptors,
>>> which I cannot yet explain.
>>>
>>> Running two threads in a guest, each with a udp socket, each
>>> sending up to 100 datagrams, or until EAGAIN, every msec.
>>>
>>> Sender A sends 1B datagrams.
>>> Sender B sends VHOST_GOODCOPY_LEN, which is enough
>>> to trigger zcopy_used in vhost net.
>>>
>>> A local receive process on the host receives both flows. To avoid
>>> a deep copy when looping the packet onto the receive path,
>>> changed skb_orphan_frags_rx to always return false (gross hack).
>>>
>>> The flow with the larger packets is redirected through netem on ifb0:
>>>
>>>    modprobe ifb
>>>    ip link set dev ifb0 up
>>>    tc qdisc add dev ifb0 root netem limit $LIMIT rate 1MBit
>>>
>>>    tc qdisc add dev tap0 ingress
>>>    tc filter add dev tap0 parent ffff: protocol ip \
>>>        u32 match ip dport 8000 0xffff \
>>>        action mirred egress redirect dev ifb0
>>>
>>> For 10 second run, packet count with various ifb0 queue lengths $LIMIT:
>>>
>>> no filter
>>>    rx.A: ~840,000
>>>    rx.B: ~840,000
>>>
>>> limit 1
>>>    rx.A: ~500,000
>>>    rx.B: ~3100
>>>    ifb0: 3273 sent, 371141 dropped
>>>
>>> limit 100
>>>    rx.A: ~9000
>>>    rx.B: ~4200
>>>    ifb0: 4630 sent, 1491 dropped
>>>
>>> limit 1000
>>>    rx.A: ~6800
>>>    rx.B: ~4200
>>>    ifb0: 4651 sent, 0 dropped
>>>
>>> Sender B is always correctly rate limited to 1 MBps or less. With a
>>> short queue, it ends up dropping a lot and sending even less.
>>>
>>> When a queue builds up for sender B, sender A throughput is strongly
>>> correlated with queue length. With queue length 1, it can send almost
>>> at unthrottled speed. But even at limit 100 its throughput is on the
>>> same order as sender B.
>>>
>>> What is surprising to me is that this happens even though the number
>>> of ubuf_info in use at limit 100 is around 100 at all times. In other
>>> words,
>>> it does not exhaust the pool.
>>>
>>> When forcing zcopy_used to be false for all packets, this effect of
>>> sender A throughput being correlated with sender B does not happen.
>>>
>>> no filter
>>>    rx.A: ~850,000
>>>    rx.B: ~850,000
>>>
>>> limit 100
>>>    rx.A: ~850,000
>>>    rx.B: ~4200
>>>    ifb0: 4518 sent, 876182 dropped
>>>
>>> Also relevant is that with zerocopy, the sender processes back off
>>> and report the same count as the receiver. Without zerocopy,
>>> both senders send at full speed, even if only 4200 packets from flow
>>> B arrive at the receiver.
>>>
>>> This is with the default virtio_net driver, so without napi-tx.
>>>
>>> It appears that the zerocopy notifications are pausing the guest.
>>> Will look at that now.
>>
>> It was indeed as simple as that. With 256 descriptors, queuing even
>> a hundred or so packets causes the guest to stall the device as soon
>> as the qdisc is installed.
>>
>> Adding this check
>>
>> +                       in_use = nvq->upend_idx - nvq->done_idx;
>> +                       if (nvq->upend_idx < nvq->done_idx)
>> +                               in_use += UIO_MAXIOV;
>> +
>> +                       if (in_use > (vq->num >> 2))
>> +                               zcopy_used = false;
>>
>> Has the desired behavior of reverting zerocopy requests to copying.
>>
>> Without this change, the result is, as previously reported, throughput
>> dropping to hundreds of packets per second on both flows.
>>
>> With the change, pps as observed for a few seconds at handle_tx is
>>
>> zerocopy=165 copy=168435
>> zerocopy=0 copy=168500
>> zerocopy=65 copy=168535
>>
>> Both flows continue to send at more or less normal rate, with only
>> sender B observing massive drops at the netem.
>>
>> With the queue removed the rate reverts to
>>
>> zerocopy=58878 copy=110239
>> zerocopy=58833 copy=110207
>>
>> This is not a 50/50 split, which impliesTw that some packets from the
>> large
>> packet flow are still converted to copying. Without the change the rate
>> without queue was 80k zerocopy vs 80k copy, so this choice of
>> (vq->num >> 2) appears too conservative.
>>
>> However, testing with (vq->num >> 1) was not as effective at mitigating
>> stalls. I did not save that data, unfortunately. Can run more tests on
>> fine
>> tuning this variable, if the idea sounds good.
>
>
> Looks like there're still two cases were left:

To be clear, this patch is not intended to fix all issues. It is a small
improvement to avoid HoL blocking due to queued zerocopy skbs.

The trade-off is that reverting to copying in these cases increases
cycle cost. I think that that is a trade-off worth making compared to
the alternative drop in throughput. It probably would be good to be
able to measure this without kernel instrumentation: export
counters similar to net->tx_zcopy_err and net->tx_packets (though
without reset to zero, as in vhost_net_tx_packet).

> 1) sndbuf is not INT_MAX

You mean the case where the device stalls, later zerocopy notifications
are queued, but these are never cleaned in free_old_xmit_skbs,
because it requires a start_xmit and by now the (only) socket is out of
descriptors?

A watchdog would help somewhat. With tx-napi, this case cannot occur,
either, as free_old_xmit_skbs no longer depends on a call to start_xmit.

> 2) tx napi is used for virtio-net

I am not aware of any issue specific to the use of tx-napi?

> 1) could be a corner case, and for 2) what your suggest here may not solve
> the issue since it still do in order completion.

Somewhat tangential, but it might also help to break the in-order
completion processing in vhost_zerocopy_signal_used. Complete
all descriptors between done_idx and upend_idx. done_idx should
then only be forward to the oldest still not-completed descriptor.

In the test I ran, where the oldest descriptors are held in a queue and
all newer ones are tail-dropped, this would avoid blocking a full ring
of completions, when only a small number (or 1) is actually delayed.

Dynamic switching between copy and zerocopy using zcopy_used
already returns completions out-of-order, so this is not a huge leap.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-09-01 16:15                                   ` Willem de Bruijn
@ 2017-09-01 16:17                                     ` Willem de Bruijn
  2017-09-04  3:03                                       ` Jason Wang
  0 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-09-01 16:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Koichiro Den, virtualization, Network Development

>>> This is not a 50/50 split, which impliesTw that some packets from the
>>> large
>>> packet flow are still converted to copying. Without the change the rate
>>> without queue was 80k zerocopy vs 80k copy, so this choice of
>>> (vq->num >> 2) appears too conservative.
>>>
>>> However, testing with (vq->num >> 1) was not as effective at mitigating
>>> stalls. I did not save that data, unfortunately. Can run more tests on
>>> fine
>>> tuning this variable, if the idea sounds good.
>>
>>
>> Looks like there're still two cases were left:
>
> To be clear, this patch is not intended to fix all issues. It is a small
> improvement to avoid HoL blocking due to queued zerocopy skbs.
>
> The trade-off is that reverting to copying in these cases increases
> cycle cost. I think that that is a trade-off worth making compared to
> the alternative drop in throughput. It probably would be good to be
> able to measure this without kernel instrumentation: export
> counters similar to net->tx_zcopy_err and net->tx_packets (though
> without reset to zero, as in vhost_net_tx_packet).
>
>> 1) sndbuf is not INT_MAX
>
> You mean the case where the device stalls, later zerocopy notifications
> are queued, but these are never cleaned in free_old_xmit_skbs,
> because it requires a start_xmit and by now the (only) socket is out of
> descriptors?

Typo, sorry. I meant out of sndbuf.

> A watchdog would help somewhat. With tx-napi, this case cannot occur,
> either, as free_old_xmit_skbs no longer depends on a call to start_xmit.
>
>> 2) tx napi is used for virtio-net
>
> I am not aware of any issue specific to the use of tx-napi?
>
>> 1) could be a corner case, and for 2) what your suggest here may not solve
>> the issue since it still do in order completion.
>
> Somewhat tangential, but it might also help to break the in-order
> completion processing in vhost_zerocopy_signal_used. Complete
> all descriptors between done_idx and upend_idx. done_idx should
> then only be forward to the oldest still not-completed descriptor.
>
> In the test I ran, where the oldest descriptors are held in a queue and
> all newer ones are tail-dropped, this would avoid blocking a full ring
> of completions, when only a small number (or 1) is actually delayed.
>
> Dynamic switching between copy and zerocopy using zcopy_used
> already returns completions out-of-order, so this is not a huge leap.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-09-01 16:17                                     ` Willem de Bruijn
@ 2017-09-04  3:03                                       ` Jason Wang
  2017-09-05 14:09                                         ` Willem de Bruijn
  0 siblings, 1 reply; 47+ messages in thread
From: Jason Wang @ 2017-09-04  3:03 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Michael S. Tsirkin, Koichiro Den, virtualization, Network Development



On 2017年09月02日 00:17, Willem de Bruijn wrote:
>>>> This is not a 50/50 split, which impliesTw that some packets from the
>>>> large
>>>> packet flow are still converted to copying. Without the change the rate
>>>> without queue was 80k zerocopy vs 80k copy, so this choice of
>>>> (vq->num >> 2) appears too conservative.
>>>>
>>>> However, testing with (vq->num >> 1) was not as effective at mitigating
>>>> stalls. I did not save that data, unfortunately. Can run more tests on
>>>> fine
>>>> tuning this variable, if the idea sounds good.
>>>
>>> Looks like there're still two cases were left:
>> To be clear, this patch is not intended to fix all issues. It is a small
>> improvement to avoid HoL blocking due to queued zerocopy skbs.

Right, just want to see if there's anything left.

>>
>> The trade-off is that reverting to copying in these cases increases
>> cycle cost. I think that that is a trade-off worth making compared to
>> the alternative drop in throughput. It probably would be good to be
>> able to measure this without kernel instrumentation: export
>> counters similar to net->tx_zcopy_err and net->tx_packets (though
>> without reset to zero, as in vhost_net_tx_packet).

I think it's acceptable if extra cycles were spent if we detect HOL anyhow.

>>
>>> 1) sndbuf is not INT_MAX
>> You mean the case where the device stalls, later zerocopy notifications
>> are queued, but these are never cleaned in free_old_xmit_skbs,
>> because it requires a start_xmit and by now the (only) socket is out of
>> descriptors?
> Typo, sorry. I meant out of sndbuf.

I mean e.g for tun. If its sndbuf is smaller than e.g (vq->num >> 1) * 
$pkt_size and if all packet were held by some modules, limitation like 
vq->num >> 1 won't work since we hit sudbuf before it.

>
>> A watchdog would help somewhat. With tx-napi, this case cannot occur,
>> either, as free_old_xmit_skbs no longer depends on a call to start_xmit.
>>
>>> 2) tx napi is used for virtio-net
>> I am not aware of any issue specific to the use of tx-napi?

Might not be clear here, I mean e.g virtio_net (tx-napi) in guest + 
vhost_net (zerocopy) in host. In this case, even if we switch to 
datacopy if ubuf counts exceeds vq->num >> 1, we still complete tx 
buffers in order, tx interrupt could be delayed for indefinite time.

>>
>>> 1) could be a corner case, and for 2) what your suggest here may not solve
>>> the issue since it still do in order completion.
>> Somewhat tangential, but it might also help to break the in-order
>> completion processing in vhost_zerocopy_signal_used. Complete
>> all descriptors between done_idx and upend_idx. done_idx should
>> then only be forward to the oldest still not-completed descriptor.
>>
>> In the test I ran, where the oldest descriptors are held in a queue and
>> all newer ones are tail-dropped,

Do you mean the descriptors were tail-dropped by vhost?

>> this would avoid blocking a full ring
>> of completions, when only a small number (or 1) is actually delayed.
>>
>> Dynamic switching between copy and zerocopy using zcopy_used
>> already returns completions out-of-order, so this is not a huge leap.

Yes.

Thanks

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-09-04  3:03                                       ` Jason Wang
@ 2017-09-05 14:09                                         ` Willem de Bruijn
  2017-09-06  3:27                                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 47+ messages in thread
From: Willem de Bruijn @ 2017-09-05 14:09 UTC (permalink / raw)
  To: Jason Wang
  Cc: Network Development, virtualization, Koichiro Den, Michael S. Tsirkin

On Mon, Sep 4, 2017 at 5:03 AM, Jason Wang <jasowang@redhat.com> wrote:
>
>
> On 2017年09月02日 00:17, Willem de Bruijn wrote:
>>>>>
>>>>> This is not a 50/50 split, which impliesTw that some packets from the
>>>>> large
>>>>> packet flow are still converted to copying. Without the change the rate
>>>>> without queue was 80k zerocopy vs 80k copy, so this choice of
>>>>> (vq->num >> 2) appears too conservative.
>>>>>
>>>>> However, testing with (vq->num >> 1) was not as effective at mitigating
>>>>> stalls. I did not save that data, unfortunately. Can run more tests on
>>>>> fine
>>>>> tuning this variable, if the idea sounds good.
>>>>
>>>>
>>>> Looks like there're still two cases were left:
>>>
>>> To be clear, this patch is not intended to fix all issues. It is a small
>>> improvement to avoid HoL blocking due to queued zerocopy skbs.
>
>
> Right, just want to see if there's anything left.
>
>>>
>>> The trade-off is that reverting to copying in these cases increases
>>> cycle cost. I think that that is a trade-off worth making compared to
>>> the alternative drop in throughput. It probably would be good to be
>>> able to measure this without kernel instrumentation: export
>>> counters similar to net->tx_zcopy_err and net->tx_packets (though
>>> without reset to zero, as in vhost_net_tx_packet).
>
>
> I think it's acceptable if extra cycles were spent if we detect HOL anyhow.
>
>>>
>>>> 1) sndbuf is not INT_MAX
>>>
>>> You mean the case where the device stalls, later zerocopy notifications
>>> are queued, but these are never cleaned in free_old_xmit_skbs,
>>> because it requires a start_xmit and by now the (only) socket is out of
>>> descriptors?
>>
>> Typo, sorry. I meant out of sndbuf.
>
>
> I mean e.g for tun. If its sndbuf is smaller than e.g (vq->num >> 1) *
> $pkt_size and if all packet were held by some modules, limitation like
> vq->num >> 1 won't work since we hit sudbuf before it.

Good point.

>>
>>> A watchdog would help somewhat. With tx-napi, this case cannot occur,
>>> either, as free_old_xmit_skbs no longer depends on a call to start_xmit.
>>>
>>>> 2) tx napi is used for virtio-net
>>>
>>> I am not aware of any issue specific to the use of tx-napi?
>
>
> Might not be clear here, I mean e.g virtio_net (tx-napi) in guest +
> vhost_net (zerocopy) in host. In this case, even if we switch to datacopy if
> ubuf counts exceeds vq->num >> 1, we still complete tx buffers in order, tx
> interrupt could be delayed for indefinite time.

Copied buffers are completed immediately in handle_tx.

Do you mean when a process sends fewer packets than vq->num >> 1,
so that all are queued? Yes, then the latency is indeed that of the last
element leaving the qdisc.

>>>
>>>> 1) could be a corner case, and for 2) what your suggest here may not
>>>> solve
>>>> the issue since it still do in order completion.
>>>
>>> Somewhat tangential, but it might also help to break the in-order
>>> completion processing in vhost_zerocopy_signal_used. Complete
>>> all descriptors between done_idx and upend_idx. done_idx should
>>> then only be forward to the oldest still not-completed descriptor.
>>>
>>> In the test I ran, where the oldest descriptors are held in a queue and
>>> all newer ones are tail-dropped,
>
>
> Do you mean the descriptors were tail-dropped by vhost?

Tail-dropped by netem. The dropped items are completed out of
order by vhost before the held items.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
  2017-09-05 14:09                                         ` Willem de Bruijn
@ 2017-09-06  3:27                                           ` Michael S. Tsirkin
  0 siblings, 0 replies; 47+ messages in thread
From: Michael S. Tsirkin @ 2017-09-06  3:27 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jason Wang, Koichiro Den, virtualization, Network Development

On Tue, Sep 05, 2017 at 04:09:19PM +0200, Willem de Bruijn wrote:
> On Mon, Sep 4, 2017 at 5:03 AM, Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > On 2017年09月02日 00:17, Willem de Bruijn wrote:
> >>>>>
> >>>>> This is not a 50/50 split, which impliesTw that some packets from the
> >>>>> large
> >>>>> packet flow are still converted to copying. Without the change the rate
> >>>>> without queue was 80k zerocopy vs 80k copy, so this choice of
> >>>>> (vq->num >> 2) appears too conservative.
> >>>>>
> >>>>> However, testing with (vq->num >> 1) was not as effective at mitigating
> >>>>> stalls. I did not save that data, unfortunately. Can run more tests on
> >>>>> fine
> >>>>> tuning this variable, if the idea sounds good.
> >>>>
> >>>>
> >>>> Looks like there're still two cases were left:
> >>>
> >>> To be clear, this patch is not intended to fix all issues. It is a small
> >>> improvement to avoid HoL blocking due to queued zerocopy skbs.
> >
> >
> > Right, just want to see if there's anything left.
> >
> >>>
> >>> The trade-off is that reverting to copying in these cases increases
> >>> cycle cost. I think that that is a trade-off worth making compared to
> >>> the alternative drop in throughput. It probably would be good to be
> >>> able to measure this without kernel instrumentation: export
> >>> counters similar to net->tx_zcopy_err and net->tx_packets (though
> >>> without reset to zero, as in vhost_net_tx_packet).
> >
> >
> > I think it's acceptable if extra cycles were spent if we detect HOL anyhow.
> >
> >>>
> >>>> 1) sndbuf is not INT_MAX
> >>>
> >>> You mean the case where the device stalls, later zerocopy notifications
> >>> are queued, but these are never cleaned in free_old_xmit_skbs,
> >>> because it requires a start_xmit and by now the (only) socket is out of
> >>> descriptors?
> >>
> >> Typo, sorry. I meant out of sndbuf.
> >
> >
> > I mean e.g for tun. If its sndbuf is smaller than e.g (vq->num >> 1) *
> > $pkt_size and if all packet were held by some modules, limitation like
> > vq->num >> 1 won't work since we hit sudbuf before it.
> 
> Good point.

I agree however anyone using SNDBUF < infinity already hits HOQ blocking
in some scenarios.


> >>
> >>> A watchdog would help somewhat. With tx-napi, this case cannot occur,
> >>> either, as free_old_xmit_skbs no longer depends on a call to start_xmit.
> >>>
> >>>> 2) tx napi is used for virtio-net
> >>>
> >>> I am not aware of any issue specific to the use of tx-napi?
> >
> >
> > Might not be clear here, I mean e.g virtio_net (tx-napi) in guest +
> > vhost_net (zerocopy) in host. In this case, even if we switch to datacopy if
> > ubuf counts exceeds vq->num >> 1, we still complete tx buffers in order, tx
> > interrupt could be delayed for indefinite time.
> 
> Copied buffers are completed immediately in handle_tx.
> 
> Do you mean when a process sends fewer packets than vq->num >> 1,
> so that all are queued? Yes, then the latency is indeed that of the last
> element leaving the qdisc.
> 
> >>>
> >>>> 1) could be a corner case, and for 2) what your suggest here may not
> >>>> solve
> >>>> the issue since it still do in order completion.
> >>>
> >>> Somewhat tangential, but it might also help to break the in-order
> >>> completion processing in vhost_zerocopy_signal_used. Complete
> >>> all descriptors between done_idx and upend_idx. done_idx should
> >>> then only be forward to the oldest still not-completed descriptor.
> >>>
> >>> In the test I ran, where the oldest descriptors are held in a queue and
> >>> all newer ones are tail-dropped,
> >
> >
> > Do you mean the descriptors were tail-dropped by vhost?
> 
> Tail-dropped by netem. The dropped items are completed out of
> order by vhost before the held items.

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2017-09-06  3:27 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-19  6:38 [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi Koichiro Den
2017-08-20 20:49 ` Willem de Bruijn
2017-08-21 12:40   ` Koichiro Den
2017-08-22 12:11   ` Willem de Bruijn
2017-08-22 14:04     ` Koichiro Den
2017-08-22 17:19       ` Willem de Bruijn
2017-08-23 14:26         ` Koichiro Den
2017-08-21 12:33 ` Jason Wang
2017-08-21 12:58   ` Koichiro Den
2017-08-21 15:41   ` Willem de Bruijn
2017-08-22  2:50     ` Jason Wang
2017-08-22  3:10       ` Willem de Bruijn
2017-08-22 11:47         ` Jason Wang
2017-08-22 13:42         ` Koichiro Den
2017-08-22 17:16           ` Willem de Bruijn
2017-08-23 14:24             ` Koichiro Den
2017-08-22 17:55       ` Michael S. Tsirkin
2017-08-22 18:01         ` David Miller
2017-08-22 18:28           ` Eric Dumazet
2017-08-22 18:39             ` Michael S. Tsirkin
2017-08-23 14:28         ` Koichiro Den
2017-08-23 14:47           ` Koichiro Den
2017-08-23 15:20           ` Willem de Bruijn
2017-08-23 22:57             ` Michael S. Tsirkin
2017-08-24  3:28               ` Willem de Bruijn
2017-08-24  4:34                 ` Michael S. Tsirkin
2017-08-24 13:50                 ` Michael S. Tsirkin
2017-08-24 20:20                   ` Willem de Bruijn
2017-08-24 20:50                     ` Michael S. Tsirkin
2017-08-25 22:44                       ` Willem de Bruijn
2017-08-25 23:32                         ` Michael S. Tsirkin
2017-08-26  1:03                           ` Willem de Bruijn
2017-08-29 19:35                             ` Willem de Bruijn
2017-08-29 19:42                               ` Michael S. Tsirkin
2017-08-29 19:53                                 ` Willem de Bruijn
2017-08-29 20:40                                   ` Michael S. Tsirkin
2017-08-29 22:55                                     ` Willem de Bruijn
2017-08-30  1:45                               ` Jason Wang
2017-08-30  3:11                                 ` Willem de Bruijn
2017-09-01  3:08                                   ` Jason Wang
2017-08-31 14:30                               ` Willem de Bruijn
2017-09-01  3:25                                 ` Jason Wang
2017-09-01 16:15                                   ` Willem de Bruijn
2017-09-01 16:17                                     ` Willem de Bruijn
2017-09-04  3:03                                       ` Jason Wang
2017-09-05 14:09                                         ` Willem de Bruijn
2017-09-06  3:27                                           ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).