All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] veth: Do not record rx queue hint in veth_xmit
@ 2022-01-06  0:46 Daniel Borkmann
  2022-01-06  3:22 ` John Fastabend
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Daniel Borkmann @ 2022-01-06  0:46 UTC (permalink / raw)
  To: davem, kuba
  Cc: netdev, laurent.bernaille, maciej.fijalkowski, toshiaki.makita1,
	eric.dumazet, pabeni, john.fastabend, willemb, Daniel Borkmann

Laurent reported that they have seen a significant amount of TCP retransmissions
at high throughput from applications residing in network namespaces talking to
the outside world via veths. The drops were seen on the qdisc layer (fq_codel,
as per systemd default) of the phys device such as ena or virtio_net due to all
traffic hitting a _single_ TX queue _despite_ multi-queue device. (Note that the
setup was _not_ using XDP on veths as the issue is generic.)

More specifically, after edbea9220251 ("veth: Store queue_mapping independently
of XDP prog presence") which made it all the way back to v4.19.184+,
skb_record_rx_queue() would set skb->queue_mapping to 1 (given 1 RX and 1 TX
queue by default for veths) instead of leaving at 0.

This is eventually retained and callbacks like ena_select_queue() will also pick
single queue via netdev_core_pick_tx()'s ndo_select_queue() once all the traffic
is forwarded to that device via upper stack or other means. Similarly, for others
not implementing ndo_select_queue() if XPS is disabled, netdev_pick_tx() might
call into the skb_tx_hash() and check for prior skb_rx_queue_recorded() as well.

In general, it is a _bad_ idea for virtual devices like veth to mess around with
queue selection [by default]. Given dev->real_num_tx_queues is by default 1,
the skb->queue_mapping was left untouched, and so prior to edbea9220251 the
netdev_core_pick_tx() could do its job upon __dev_queue_xmit() on the phys device.

Unbreak this and restore prior behavior by removing the skb_record_rx_queue()
from veth_xmit() altogether.

If the veth peer has an XDP program attached, then it would return the first RX
queue index in xdp_md->rx_queue_index (unless configured in non-default manner).
However, this is still better than breaking the generic case.

Fixes: edbea9220251 ("veth: Store queue_mapping independently of XDP prog presence")
Fixes: 638264dc9022 ("veth: Support per queue XDP ring")
Reported-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Cc: Toshiaki Makita <toshiaki.makita1@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
---
 drivers/net/veth.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index d21dd25f429e..354a963075c5 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -335,7 +335,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		 */
 		use_napi = rcu_access_pointer(rq->napi) &&
 			   veth_skb_is_eligible_for_gro(dev, rcv, skb);
-		skb_record_rx_queue(skb, rxq);
 	}
 
 	skb_tx_timestamp(skb);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* RE: [PATCH net-next] veth: Do not record rx queue hint in veth_xmit
  2022-01-06  0:46 [PATCH net-next] veth: Do not record rx queue hint in veth_xmit Daniel Borkmann
@ 2022-01-06  3:22 ` John Fastabend
  2022-01-06  9:00 ` Eric Dumazet
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: John Fastabend @ 2022-01-06  3:22 UTC (permalink / raw)
  To: Daniel Borkmann, davem, kuba
  Cc: netdev, laurent.bernaille, maciej.fijalkowski, toshiaki.makita1,
	eric.dumazet, pabeni, john.fastabend, willemb, Daniel Borkmann

Daniel Borkmann wrote:
> Laurent reported that they have seen a significant amount of TCP retransmissions
> at high throughput from applications residing in network namespaces talking to
> the outside world via veths. The drops were seen on the qdisc layer (fq_codel,
> as per systemd default) of the phys device such as ena or virtio_net due to all
> traffic hitting a _single_ TX queue _despite_ multi-queue device. (Note that the
> setup was _not_ using XDP on veths as the issue is generic.)
> 
> More specifically, after edbea9220251 ("veth: Store queue_mapping independently
> of XDP prog presence") which made it all the way back to v4.19.184+,
> skb_record_rx_queue() would set skb->queue_mapping to 1 (given 1 RX and 1 TX
> queue by default for veths) instead of leaving at 0.
> 
> This is eventually retained and callbacks like ena_select_queue() will also pick
> single queue via netdev_core_pick_tx()'s ndo_select_queue() once all the traffic
> is forwarded to that device via upper stack or other means. Similarly, for others
> not implementing ndo_select_queue() if XPS is disabled, netdev_pick_tx() might
> call into the skb_tx_hash() and check for prior skb_rx_queue_recorded() as well.
> 
> In general, it is a _bad_ idea for virtual devices like veth to mess around with
> queue selection [by default]. Given dev->real_num_tx_queues is by default 1,
> the skb->queue_mapping was left untouched, and so prior to edbea9220251 the
> netdev_core_pick_tx() could do its job upon __dev_queue_xmit() on the phys device.
> 
> Unbreak this and restore prior behavior by removing the skb_record_rx_queue()
> from veth_xmit() altogether.
> 
> If the veth peer has an XDP program attached, then it would return the first RX
> queue index in xdp_md->rx_queue_index (unless configured in non-default manner).
> However, this is still better than breaking the generic case.

Agree on all the above. Fix LGTM thanks!

Acked-by: John Fastabend <john.fastabend@gmail.com>

> 
> Fixes: edbea9220251 ("veth: Store queue_mapping independently of XDP prog presence")
> Fixes: 638264dc9022 ("veth: Support per queue XDP ring")
> Reported-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Cc: Toshiaki Makita <toshiaki.makita1@gmail.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Willem de Bruijn <willemb@google.com>
> ---
>  drivers/net/veth.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index d21dd25f429e..354a963075c5 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -335,7 +335,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>  		 */
>  		use_napi = rcu_access_pointer(rq->napi) &&
>  			   veth_skb_is_eligible_for_gro(dev, rcv, skb);
> -		skb_record_rx_queue(skb, rxq);
>  	}
>  
>  	skb_tx_timestamp(skb);
> -- 
> 2.21.0
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] veth: Do not record rx queue hint in veth_xmit
  2022-01-06  0:46 [PATCH net-next] veth: Do not record rx queue hint in veth_xmit Daniel Borkmann
  2022-01-06  3:22 ` John Fastabend
@ 2022-01-06  9:00 ` Eric Dumazet
  2022-01-06 12:57 ` Toshiaki Makita
  2022-01-06 14:00 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2022-01-06  9:00 UTC (permalink / raw)
  To: Daniel Borkmann, davem, kuba
  Cc: netdev, laurent.bernaille, maciej.fijalkowski, toshiaki.makita1,
	pabeni, john.fastabend, willemb


On 1/5/22 16:46, Daniel Borkmann wrote:
> Laurent reported that they have seen a significant amount of TCP retransmissions
> at high throughput from applications residing in network namespaces talking to
> the outside world via veths. The drops were seen on the qdisc layer (fq_codel,
> as per systemd default) of the phys device such as ena or virtio_net due to all
> traffic hitting a _single_ TX queue _despite_ multi-queue device. (Note that the
> setup was _not_ using XDP on veths as the issue is generic.)
>
> More specifically, after edbea9220251 ("veth: Store queue_mapping independently
> of XDP prog presence") which made it all the way back to v4.19.184+,
> skb_record_rx_queue() would set skb->queue_mapping to 1 (given 1 RX and 1 TX
> queue by default for veths) instead of leaving at 0.
>
> This is eventually retained and callbacks like ena_select_queue() will also pick
> single queue via netdev_core_pick_tx()'s ndo_select_queue() once all the traffic
> is forwarded to that device via upper stack or other means. Similarly, for others
> not implementing ndo_select_queue() if XPS is disabled, netdev_pick_tx() might
> call into the skb_tx_hash() and check for prior skb_rx_queue_recorded() as well.
>
> In general, it is a _bad_ idea for virtual devices like veth to mess around with
> queue selection [by default]. Given dev->real_num_tx_queues is by default 1,
> the skb->queue_mapping was left untouched, and so prior to edbea9220251 the
> netdev_core_pick_tx() could do its job upon __dev_queue_xmit() on the phys device.


Nice changelog and fix, thanks Daniel !

Reviewed-by: Eric Dumazet <edumazet@google.com>

> Unbreak this and restore prior behavior by removing the skb_record_rx_queue()
> from veth_xmit() altogether.
>
> If the veth peer has an XDP program attached, then it would return the first RX
> queue index in xdp_md->rx_queue_index (unless configured in non-default manner).
> However, this is still better than breaking the generic case.
>
> Fixes: edbea9220251 ("veth: Store queue_mapping independently of XDP prog presence")
> Fixes: 638264dc9022 ("veth: Support per queue XDP ring")
> Reported-by: Laurent Bernaille <laurent.bernaille@datadoghq.com>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Cc: Toshiaki Makita <toshiaki.makita1@gmail.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Willem de Bruijn <willemb@google.com>
> ---
>   drivers/net/veth.c | 1 -
>   1 file changed, 1 deletion(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index d21dd25f429e..354a963075c5 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -335,7 +335,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
>   		 */
>   		use_napi = rcu_access_pointer(rq->napi) &&
>   			   veth_skb_is_eligible_for_gro(dev, rcv, skb);
> -		skb_record_rx_queue(skb, rxq);
>   	}
>   
>   	skb_tx_timestamp(skb);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] veth: Do not record rx queue hint in veth_xmit
  2022-01-06  0:46 [PATCH net-next] veth: Do not record rx queue hint in veth_xmit Daniel Borkmann
  2022-01-06  3:22 ` John Fastabend
  2022-01-06  9:00 ` Eric Dumazet
@ 2022-01-06 12:57 ` Toshiaki Makita
  2022-01-07 13:56   ` Daniel Borkmann
  2022-01-06 14:00 ` patchwork-bot+netdevbpf
  3 siblings, 1 reply; 7+ messages in thread
From: Toshiaki Makita @ 2022-01-06 12:57 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, laurent.bernaille, maciej.fijalkowski, eric.dumazet,
	pabeni, john.fastabend, willemb, davem, kuba

On 2022/01/06 9:46, Daniel Borkmann wrote:

Thank you for the fix.

> (unless configured in non-default manner).

Just curious, is there any special configuration to record rx queue index on veth 
after your patch?

Anyway,

Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] veth: Do not record rx queue hint in veth_xmit
  2022-01-06  0:46 [PATCH net-next] veth: Do not record rx queue hint in veth_xmit Daniel Borkmann
                   ` (2 preceding siblings ...)
  2022-01-06 12:57 ` Toshiaki Makita
@ 2022-01-06 14:00 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 7+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-01-06 14:00 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: davem, kuba, netdev, laurent.bernaille, maciej.fijalkowski,
	toshiaki.makita1, eric.dumazet, pabeni, john.fastabend, willemb

Hello:

This patch was applied to netdev/net-next.git (master)
by David S. Miller <davem@davemloft.net>:

On Thu,  6 Jan 2022 01:46:06 +0100 you wrote:
> Laurent reported that they have seen a significant amount of TCP retransmissions
> at high throughput from applications residing in network namespaces talking to
> the outside world via veths. The drops were seen on the qdisc layer (fq_codel,
> as per systemd default) of the phys device such as ena or virtio_net due to all
> traffic hitting a _single_ TX queue _despite_ multi-queue device. (Note that the
> setup was _not_ using XDP on veths as the issue is generic.)
> 
> [...]

Here is the summary with links:
  - [net-next] veth: Do not record rx queue hint in veth_xmit
    https://git.kernel.org/netdev/net-next/c/710ad98c363a

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] veth: Do not record rx queue hint in veth_xmit
  2022-01-06 12:57 ` Toshiaki Makita
@ 2022-01-07 13:56   ` Daniel Borkmann
  2022-01-10  6:52     ` Toshiaki Makita
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Borkmann @ 2022-01-07 13:56 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: netdev, laurent.bernaille, maciej.fijalkowski, eric.dumazet,
	pabeni, john.fastabend, willemb, davem, kuba, magnus.karlsson

On 1/6/22 1:57 PM, Toshiaki Makita wrote:
> On 2022/01/06 9:46, Daniel Borkmann wrote:
> 
> Thank you for the fix.
> 
>> (unless configured in non-default manner).
> 
> Just curious, is there any special configuration to record rx queue index on veth after your patch?

Right now skb->queue_mapping represents the tx queue number. So assuming
dev->real_num_tx_queues == dev->real_num_rx_queues for the veth creation,
then veth_xmit() picks the rx queue via rcv_priv->rq[rxq] based on tx queue
number. So, by default veth is created with dev->real_num_tx_queues ==
dev->real_num_rx_queues == 1, in which case the queue_mapping stays 0.
Checking with e.g. ...

   ip link add foo numtxqueues 64 numrxqueues 64 type veth peer numtxqueues 64 numrxqueues 64

... then stack inside the netns on the veth dev picks a TX queue via
netdev_core_pick_tx(). Due to the skb_record_rx_queue() / skb_get_rx_queue()
it is off by one in generic XDP [in hostns] at bpf_prog_run_generic_xdp() for
the case where veth has more than single queue (Single queue case netif_get_rxqueue()
will see that skb_rx_queue_recorded() is false and just pick the first queue
so at least there it's correct).

Not sure if there is a good way to detect inside bpf_prog_run_generic_xdp()
that skb was looped to host from netns and then fix up the offset .. other
option could potentially be an extra device parameter and when enabled only
then do the skb_record_rx_queue() so it's an explicit opt-in where admin needs
to be aware of potential implications.

My worry is that if the skb ends up being pushed out the phys dev, then 1)
if veth was provisioned with >1 TX queues the admin must also take into
account the TX queues e.g. on phys devices like ena, so that we're not under-
utilizing it (or have something like BPF clear the queue mapping before
it's forwarded). And if we do skb_record_rx_queue() and, say, TX queue numbers
of veth and phys dev are provisioned to be the same, then we jump +1 on the
phys dev with a skb_record_rx_queue() which may be unintentional. Hence maybe
explicit opt-in could be better option?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next] veth: Do not record rx queue hint in veth_xmit
  2022-01-07 13:56   ` Daniel Borkmann
@ 2022-01-10  6:52     ` Toshiaki Makita
  0 siblings, 0 replies; 7+ messages in thread
From: Toshiaki Makita @ 2022-01-10  6:52 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, laurent.bernaille, maciej.fijalkowski, eric.dumazet,
	pabeni, john.fastabend, willemb, davem, kuba, magnus.karlsson

On 2022/01/07 22:56, Daniel Borkmann wrote:
> On 1/6/22 1:57 PM, Toshiaki Makita wrote:
>> On 2022/01/06 9:46, Daniel Borkmann wrote:
>>
>> Thank you for the fix.
>>
>>> (unless configured in non-default manner).
>>
>> Just curious, is there any special configuration to record rx queue index on veth 
>> after your patch?
> 
> Right now skb->queue_mapping represents the tx queue number. So assuming
> dev->real_num_tx_queues == dev->real_num_rx_queues for the veth creation,
> then veth_xmit() picks the rx queue via rcv_priv->rq[rxq] based on tx queue
> number. So, by default veth is created with dev->real_num_tx_queues ==
> dev->real_num_rx_queues == 1, in which case the queue_mapping stays 0.
> Checking with e.g. ...
> 
>    ip link add foo numtxqueues 64 numrxqueues 64 type veth peer numtxqueues 64 
> numrxqueues 64
> 
> ... then stack inside the netns on the veth dev picks a TX queue via
> netdev_core_pick_tx(). Due to the skb_record_rx_queue() / skb_get_rx_queue()
> it is off by one in generic XDP [in hostns] at bpf_prog_run_generic_xdp() for
> the case where veth has more than single queue (Single queue case netif_get_rxqueue()
> will see that skb_rx_queue_recorded() is false and just pick the first queue
> so at least there it's correct).
> 
> Not sure if there is a good way to detect inside bpf_prog_run_generic_xdp()
> that skb was looped to host from netns and then fix up the offset .. other
> option could potentially be an extra device parameter and when enabled only
> then do the skb_record_rx_queue() so it's an explicit opt-in where admin needs
> to be aware of potential implications.
> 
> My worry is that if the skb ends up being pushed out the phys dev, then 1)
> if veth was provisioned with >1 TX queues the admin must also take into
> account the TX queues e.g. on phys devices like ena, so that we're not under-
> utilizing it (or have something like BPF clear the queue mapping before
> it's forwarded). And if we do skb_record_rx_queue() and, say, TX queue numbers
> of veth and phys dev are provisioned to be the same, then we jump +1 on the
> phys dev with a skb_record_rx_queue() which may be unintentional. Hence maybe
> explicit opt-in could be better option?

Thank you for the detailed explanation!
I agree with you on that an opt-in would be better.
I thought you might mean there is already some way to do that by "configured in
non-default manner", so I asked it. Now I understand we need an additional knob
if we really want to extract the veth rx queue number from bpf.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-01-10  6:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-06  0:46 [PATCH net-next] veth: Do not record rx queue hint in veth_xmit Daniel Borkmann
2022-01-06  3:22 ` John Fastabend
2022-01-06  9:00 ` Eric Dumazet
2022-01-06 12:57 ` Toshiaki Makita
2022-01-07 13:56   ` Daniel Borkmann
2022-01-10  6:52     ` Toshiaki Makita
2022-01-06 14:00 ` patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.