All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net 1/1] net: Fix return value of qdisc ingress handling on success
@ 2022-09-21  8:50 Paul Blakey
  2022-09-21  9:11 ` Daniel Borkmann
  0 siblings, 1 reply; 7+ messages in thread
From: Paul Blakey @ 2022-09-21  8:50 UTC (permalink / raw)
  To: Daniel Borkmann, Paul Blakey, Vlad Buslov, Oz Shlomo, Roi Dayan,
	netdev, Saeed Mahameed
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni

Currently qdisc ingress handling (sch_handle_ingress()) doesn't
set a return value and it is left to the old return value of
the caller (__netif_receive_skb_core()) which is RX drop, so if
the packet is consumed, caller will stop and return this value
as if the packet was dropped.

This causes a problem in the kernel tcp stack when having a
egress tc rule forwarding to a ingress tc rule.
The tcp stack sending packets on the device having the egress rule
will see the packets as not successfully transmitted (although they
actually were), will not advance it's internal state of sent data,
and packets returning on such tcp stream will be dropped by the tcp
stack with reason ack-of-unsent-data. See reproduction in [0] below.

Fix that by setting the return value to RX success if
the packet was handled successfully.

[0] Reproduction steps:
 $ ip link add veth1 type veth peer name peer1
 $ ip link add veth2 type veth peer name peer2
 $ ifconfig peer1 5.5.5.6/24 up
 $ ip netns add ns0
 $ ip link set dev peer2 netns ns0
 $ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up
 $ ifconfig veth2 0 up
 $ ifconfig veth1 0 up

 #ingress forwarding veth1 <-> veth2
 $ tc qdisc add dev veth2 ingress
 $ tc qdisc add dev veth1 ingress
 $ tc filter add dev veth2 ingress prio 1 proto all flower \
   action mirred egress redirect dev veth1
 $ tc filter add dev veth1 ingress prio 1 proto all flower \
   action mirred egress redirect dev veth2

 #steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe
 $ tc qdisc add dev peer1 clsact
 $ tc filter add dev peer1 egress prio 20 proto ip flower \
   action mirred ingress redirect dev veth1

 #run iperf and see connection not running
 $ iperf3 -s&
 $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1

 #delete egress rule, and run again, now should work
 $ tc filter del dev peer1 egress
 $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1

Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Signed-off-by: Paul Blakey <paulb@nvidia.com>
---
 net/core/dev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 56c8b0921c9f..c58ab657b164 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5141,6 +5141,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 	case TC_ACT_QUEUED:
 	case TC_ACT_TRAP:
 		consume_skb(skb);
+		*ret = NET_RX_SUCCESS;
 		return NULL;
 	case TC_ACT_REDIRECT:
 		/* skb_mac_header check was done by cls/act_bpf, so
@@ -5153,8 +5154,10 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
 			*another = true;
 			break;
 		}
+		*ret = NET_RX_SUCCESS;
 		return NULL;
 	case TC_ACT_CONSUMED:
+		*ret = NET_RX_SUCCESS;
 		return NULL;
 	default:
 		break;
-- 
2.30.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net 1/1] net: Fix return value of qdisc ingress handling on success
  2022-09-21  8:50 [PATCH net 1/1] net: Fix return value of qdisc ingress handling on success Paul Blakey
@ 2022-09-21  9:11 ` Daniel Borkmann
  2022-09-21 14:48   ` Jakub Kicinski
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Borkmann @ 2022-09-21  9:11 UTC (permalink / raw)
  To: Paul Blakey, Vlad Buslov, Oz Shlomo, Roi Dayan, netdev, Saeed Mahameed
  Cc: Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni

On 9/21/22 10:50 AM, Paul Blakey wrote:
> Currently qdisc ingress handling (sch_handle_ingress()) doesn't
> set a return value and it is left to the old return value of
> the caller (__netif_receive_skb_core()) which is RX drop, so if
> the packet is consumed, caller will stop and return this value
> as if the packet was dropped.
> 
> This causes a problem in the kernel tcp stack when having a
> egress tc rule forwarding to a ingress tc rule.
> The tcp stack sending packets on the device having the egress rule
> will see the packets as not successfully transmitted (although they
> actually were), will not advance it's internal state of sent data,
> and packets returning on such tcp stream will be dropped by the tcp
> stack with reason ack-of-unsent-data. See reproduction in [0] below.
> 
> Fix that by setting the return value to RX success if
> the packet was handled successfully.
> 
> [0] Reproduction steps:
>   $ ip link add veth1 type veth peer name peer1
>   $ ip link add veth2 type veth peer name peer2
>   $ ifconfig peer1 5.5.5.6/24 up
>   $ ip netns add ns0
>   $ ip link set dev peer2 netns ns0
>   $ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up
>   $ ifconfig veth2 0 up
>   $ ifconfig veth1 0 up
> 
>   #ingress forwarding veth1 <-> veth2
>   $ tc qdisc add dev veth2 ingress
>   $ tc qdisc add dev veth1 ingress
>   $ tc filter add dev veth2 ingress prio 1 proto all flower \
>     action mirred egress redirect dev veth1
>   $ tc filter add dev veth1 ingress prio 1 proto all flower \
>     action mirred egress redirect dev veth2
> 
>   #steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe
>   $ tc qdisc add dev peer1 clsact
>   $ tc filter add dev peer1 egress prio 20 proto ip flower \
>     action mirred ingress redirect dev veth1
> 
>   #run iperf and see connection not running
>   $ iperf3 -s&
>   $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
> 
>   #delete egress rule, and run again, now should work
>   $ tc filter del dev peer1 egress
>   $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
> 
> Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
> Signed-off-by: Paul Blakey <paulb@nvidia.com>
> ---
>   net/core/dev.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 56c8b0921c9f..c58ab657b164 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5141,6 +5141,7 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>   	case TC_ACT_QUEUED:
>   	case TC_ACT_TRAP:
>   		consume_skb(skb);
> +		*ret = NET_RX_SUCCESS;
>   		return NULL;
>   	case TC_ACT_REDIRECT:
>   		/* skb_mac_header check was done by cls/act_bpf, so
> @@ -5153,8 +5154,10 @@ sch_handle_ingress(struct sk_buff *skb, struct packet_type **pt_prev, int *ret,
>   			*another = true;
>   			break;
>   		}
> +		*ret = NET_RX_SUCCESS;
>   		return NULL;
>   	case TC_ACT_CONSUMED:
> +		*ret = NET_RX_SUCCESS;
>   		return NULL;
>   	default:

Looks reasonable and aligns with sch_handle_egress() fwiw. I think your Fixes tag is wrong
since that commit didn't modify any of the above. This patch should also rather go to net-next
tree to make sure it has enough soak time to catch potential regressions from this change in
behavior. Given the change under TC_ACT_REDIRECT is BPF specific, please also add a BPF selftest
for tc BPF program to assert the new behavior so we can run it in our BPF CI for every patch.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net 1/1] net: Fix return value of qdisc ingress handling on success
  2022-09-21  9:11 ` Daniel Borkmann
@ 2022-09-21 14:48   ` Jakub Kicinski
  2022-09-22 14:47     ` Daniel Borkmann
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Kicinski @ 2022-09-21 14:48 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Paul Blakey, Vlad Buslov, Oz Shlomo, Roi Dayan, netdev,
	Saeed Mahameed, Eric Dumazet, David S. Miller, Paolo Abeni

On Wed, 21 Sep 2022 11:11:03 +0200 Daniel Borkmann wrote:
> On 9/21/22 10:50 AM, Paul Blakey wrote:
> > Currently qdisc ingress handling (sch_handle_ingress()) doesn't
> > set a return value and it is left to the old return value of
> > the caller (__netif_receive_skb_core()) which is RX drop, so if
> > the packet is consumed, caller will stop and return this value
> > as if the packet was dropped.
> > 
> > This causes a problem in the kernel tcp stack when having a
> > egress tc rule forwarding to a ingress tc rule.
> > The tcp stack sending packets on the device having the egress rule
> > will see the packets as not successfully transmitted (although they
> > actually were), will not advance it's internal state of sent data,
> > and packets returning on such tcp stream will be dropped by the tcp
> > stack with reason ack-of-unsent-data. See reproduction in [0] below.
> > 
> > Fix that by setting the return value to RX success if
> > the packet was handled successfully.
> > 
> > [0] Reproduction steps:
> >   $ ip link add veth1 type veth peer name peer1
> >   $ ip link add veth2 type veth peer name peer2
> >   $ ifconfig peer1 5.5.5.6/24 up
> >   $ ip netns add ns0
> >   $ ip link set dev peer2 netns ns0
> >   $ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up
> >   $ ifconfig veth2 0 up
> >   $ ifconfig veth1 0 up
> > 
> >   #ingress forwarding veth1 <-> veth2
> >   $ tc qdisc add dev veth2 ingress
> >   $ tc qdisc add dev veth1 ingress
> >   $ tc filter add dev veth2 ingress prio 1 proto all flower \
> >     action mirred egress redirect dev veth1
> >   $ tc filter add dev veth1 ingress prio 1 proto all flower \
> >     action mirred egress redirect dev veth2
> > 
> >   #steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe
> >   $ tc qdisc add dev peer1 clsact
> >   $ tc filter add dev peer1 egress prio 20 proto ip flower \
> >     action mirred ingress redirect dev veth1
> > 
> >   #run iperf and see connection not running
> >   $ iperf3 -s&
> >   $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
> > 
> >   #delete egress rule, and run again, now should work
> >   $ tc filter del dev peer1 egress
> >   $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
> > 
> > Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
> > Signed-off-by: Paul Blakey <paulb@nvidia.com>
>
> Looks reasonable and aligns with sch_handle_egress() fwiw. I think your Fixes tag is wrong
> since that commit didn't modify any of the above. This patch should also rather go to net-next
> tree to make sure it has enough soak time to catch potential regressions from this change in
> behavior.

I don't think we do "soak time" in networking. Perhaps we can try 
to use the "CC: stable # after 4 weeks" delay mechanism which Greg
promised at LPC?

> Given the change under TC_ACT_REDIRECT is BPF specific, please also add a BPF selftest
> for tc BPF program to assert the new behavior so we can run it in our BPF CI for every patch.

And it would be great to turn the commands from the commit message
into a selftest as well :S

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net 1/1] net: Fix return value of qdisc ingress handling on success
  2022-09-21 14:48   ` Jakub Kicinski
@ 2022-09-22 14:47     ` Daniel Borkmann
  2022-09-22 15:23       ` Jakub Kicinski
  0 siblings, 1 reply; 7+ messages in thread
From: Daniel Borkmann @ 2022-09-22 14:47 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paul Blakey, Vlad Buslov, Oz Shlomo, Roi Dayan, netdev,
	Saeed Mahameed, Eric Dumazet, David S. Miller, Paolo Abeni

On 9/21/22 4:48 PM, Jakub Kicinski wrote:
> On Wed, 21 Sep 2022 11:11:03 +0200 Daniel Borkmann wrote:
>> On 9/21/22 10:50 AM, Paul Blakey wrote:
>>> Currently qdisc ingress handling (sch_handle_ingress()) doesn't
>>> set a return value and it is left to the old return value of
>>> the caller (__netif_receive_skb_core()) which is RX drop, so if
>>> the packet is consumed, caller will stop and return this value
>>> as if the packet was dropped.
>>>
>>> This causes a problem in the kernel tcp stack when having a
>>> egress tc rule forwarding to a ingress tc rule.
>>> The tcp stack sending packets on the device having the egress rule
>>> will see the packets as not successfully transmitted (although they
>>> actually were), will not advance it's internal state of sent data,
>>> and packets returning on such tcp stream will be dropped by the tcp
>>> stack with reason ack-of-unsent-data. See reproduction in [0] below.
>>>
>>> Fix that by setting the return value to RX success if
>>> the packet was handled successfully.
>>>
>>> [0] Reproduction steps:
>>>    $ ip link add veth1 type veth peer name peer1
>>>    $ ip link add veth2 type veth peer name peer2
>>>    $ ifconfig peer1 5.5.5.6/24 up
>>>    $ ip netns add ns0
>>>    $ ip link set dev peer2 netns ns0
>>>    $ ip netns exec ns0 ifconfig peer2 5.5.5.5/24 up
>>>    $ ifconfig veth2 0 up
>>>    $ ifconfig veth1 0 up
>>>
>>>    #ingress forwarding veth1 <-> veth2
>>>    $ tc qdisc add dev veth2 ingress
>>>    $ tc qdisc add dev veth1 ingress
>>>    $ tc filter add dev veth2 ingress prio 1 proto all flower \
>>>      action mirred egress redirect dev veth1
>>>    $ tc filter add dev veth1 ingress prio 1 proto all flower \
>>>      action mirred egress redirect dev veth2
>>>
>>>    #steal packet from peer1 egress to veth2 ingress, bypassing the veth pipe
>>>    $ tc qdisc add dev peer1 clsact
>>>    $ tc filter add dev peer1 egress prio 20 proto ip flower \
>>>      action mirred ingress redirect dev veth1
>>>
>>>    #run iperf and see connection not running
>>>    $ iperf3 -s&
>>>    $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
>>>
>>>    #delete egress rule, and run again, now should work
>>>    $ tc filter del dev peer1 egress
>>>    $ ip netns exec ns0 iperf3 -c 5.5.5.6 -i 1
>>>
>>> Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
>>> Signed-off-by: Paul Blakey <paulb@nvidia.com>
>>
>> Looks reasonable and aligns with sch_handle_egress() fwiw. I think your Fixes tag is wrong
>> since that commit didn't modify any of the above. This patch should also rather go to net-next
>> tree to make sure it has enough soak time to catch potential regressions from this change in
>> behavior.
> 
> I don't think we do "soak time" in networking. Perhaps we can try
> to use the "CC: stable # after 4 weeks" delay mechanism which Greg
> promised at LPC?

Isn't that implicit? If the commit has Fixes tag and lands in net-next, stable team
anyway automatically pulls it once everything lands in Linus' tree via merge win and
then does the backporting for stable.

>> Given the change under TC_ACT_REDIRECT is BPF specific, please also add a BPF selftest
>> for tc BPF program to assert the new behavior so we can run it in our BPF CI for every patch.
> 
> And it would be great to turn the commands from the commit message
> into a selftest as well :S

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net 1/1] net: Fix return value of qdisc ingress handling on success
  2022-09-22 14:47     ` Daniel Borkmann
@ 2022-09-22 15:23       ` Jakub Kicinski
  2022-09-22 21:06         ` Jakub Kicinski
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Kicinski @ 2022-09-22 15:23 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Paul Blakey, Vlad Buslov, Oz Shlomo, Roi Dayan, netdev,
	Saeed Mahameed, Eric Dumazet, David S. Miller, Paolo Abeni

On Thu, 22 Sep 2022 16:47:14 +0200 Daniel Borkmann wrote:
> >> Looks reasonable and aligns with sch_handle_egress() fwiw. I think your Fixes tag is wrong
> >> since that commit didn't modify any of the above. This patch should also rather go to net-next
> >> tree to make sure it has enough soak time to catch potential regressions from this change in
> >> behavior.  
> > 
> > I don't think we do "soak time" in networking. Perhaps we can try
> > to use the "CC: stable # after 4 weeks" delay mechanism which Greg
> > promised at LPC?  
> 
> Isn't that implicit? If the commit has Fixes tag and lands in net-next, stable team
> anyway automatically pulls it once everything lands in Linus' tree via merge win and
> then does the backporting for stable.

What I meant is we don't merge fixes into net-next directly.
Perhaps that's my personal view, not shared by other netdev maintainers.

To me the 8 rc release process is fairly arbitrary timing wise.
The fixes continue flowing in after Linus cuts final, plus only 
after a few stable releases the kernel makes it to a wide audience.

Putting a fix in -next gives us anywhere between 0 and 8 weeks of delay.
Explicit delay on the tag seems much more precise and independent of
where we are in the release cycle.

The cases where we put something in -next, later it becomes urgent 
and we can't get it to stable stand out in my memory much more than
problems introduced late in the rc cycle.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net 1/1] net: Fix return value of qdisc ingress handling on success
  2022-09-22 15:23       ` Jakub Kicinski
@ 2022-09-22 21:06         ` Jakub Kicinski
  2022-09-22 23:17           ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Jakub Kicinski @ 2022-09-22 21:06 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Paul Blakey, Vlad Buslov, Oz Shlomo, Roi Dayan, netdev,
	Saeed Mahameed, Eric Dumazet, David S. Miller, Paolo Abeni

On Thu, 22 Sep 2022 08:23:49 -0700 Jakub Kicinski wrote:
> What I meant is we don't merge fixes into net-next directly.
> Perhaps that's my personal view, not shared by other netdev maintainers.

FWIW I've seen Eric posting a fix against net-next today 
so I may indeed be the only one who thinks this way :)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net 1/1] net: Fix return value of qdisc ingress handling on success
  2022-09-22 21:06         ` Jakub Kicinski
@ 2022-09-22 23:17           ` Eric Dumazet
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2022-09-22 23:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel Borkmann, Paul Blakey, Vlad Buslov, Oz Shlomo, Roi Dayan,
	netdev, Saeed Mahameed, David S. Miller, Paolo Abeni

On Thu, Sep 22, 2022 at 2:06 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 22 Sep 2022 08:23:49 -0700 Jakub Kicinski wrote:
> > What I meant is we don't merge fixes into net-next directly.
> > Perhaps that's my personal view, not shared by other netdev maintainers.
>
> FWIW I've seen Eric posting a fix against net-next today
> so I may indeed be the only one who thinks this way :)

If you refer to the TCP fix, this is because the blamed patch is in
net-next only.

These are very specific changes that are unlikely to cause real issues.
Only packetdrill tests were unhappy with the glitches.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-09-22 23:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-21  8:50 [PATCH net 1/1] net: Fix return value of qdisc ingress handling on success Paul Blakey
2022-09-21  9:11 ` Daniel Borkmann
2022-09-21 14:48   ` Jakub Kicinski
2022-09-22 14:47     ` Daniel Borkmann
2022-09-22 15:23       ` Jakub Kicinski
2022-09-22 21:06         ` Jakub Kicinski
2022-09-22 23:17           ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.