netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
@ 2024-03-14 11:17 renmingshuai
  2024-03-14 11:43 ` Jiri Pirko
  2024-03-14 17:14 ` Jamal Hadi Salim
  0 siblings, 2 replies; 24+ messages in thread
From: renmingshuai @ 2024-03-14 11:17 UTC (permalink / raw)
  To: jhs, xiyou.wangcong, jiri, davem, vladbu
  Cc: netdev, yanan, liaichun, caowangbao, renmingshuai

As we all know the mirred action is used to mirroring or redirecting the
packet it receives. Howerver, add mirred action to a filter attached to
a egress qdisc might cause a deadlock. To reproduce the problem, perform
the following steps:
(1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
(2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
     action police rate 100mbit burst 12m conform-exceed jump 1 \
     / pipe mirred egress redirect dev eth2 action drop

The stack is show as below:
[28848.883915]  _raw_spin_lock+0x1e/0x30
[28848.884367]  __dev_queue_xmit+0x160/0x850
[28848.884851]  ? 0xffffffffc031906a
[28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
[28848.885863]  tcf_action_exec.part.0+0x88/0x130
[28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
[28848.886970]  ? dequeue_entity+0x145/0x9e0
[28848.887464]  ? newidle_balance+0x23f/0x2f0
[28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
[28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
[28848.889137]  ? __flush_work.isra.0+0x35/0x80
[28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
[28848.890293]  ? do_select+0x637/0x870
[28848.890735]  tcf_classify+0x52/0xf0
[28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
[28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
[28848.892251]  __dev_queue_xmit+0x2d8/0x850
[28848.892738]  ? nf_hook_slow+0x3c/0xb0
[28848.893198]  ip_finish_output2+0x272/0x580
[28848.893692]  __ip_queue_xmit+0x193/0x420
[28848.894179]  __tcp_transmit_skb+0x8cc/0x970

In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
before the egress packets are mirred, and it will attempt to obtain the
spin lock again after packets are mirred, which cause a deadlock.

Fix the issue by forbidding assigning mirred action to a filter attached
to the egress.

Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
---
 net/sched/act_mirred.c                        |  4 +++
 .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 5b3814365924..fc96705285fb 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
 		NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
 		return -EINVAL;
 	}
+	if (tp->chain->block->q->parent != TC_H_INGRESS) {
+		NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
+		return -EINVAL;
+	}
 	ret = nla_parse_nested_deprecated(tb, TCA_MIRRED_MAX, nla,
 					  mirred_policy, extack);
 	if (ret < 0)
diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
index b73bd255ea36..50c6153bf34c 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
@@ -1052,5 +1052,37 @@
             "$TC qdisc del dev $DEV1 ingress_block 21 clsact",
             "$TC actions flush action mirred"
         ]
+    },
+    {
+        "id": "fdda",
+        "name": "Add mirred mirror to the filter which attached to engress",
+        "category": [
+            "actions",
+            "mirred"
+        ],
+        "plugins": {
+            "requires": "nsPlugin"
+        },
+        "setup": [
+            [
+                "$TC actions flush action mirred",
+                0,
+                1,
+                255
+            ],
+            [
+                "$TC qdisc add dev $DEV1 root handle 1: htb default 1",
+                0
+            ]
+        ],
+        "cmdUnderTest": "$TC filter add dev $DEV1 protocol ip u32 match ip protocol 1 0xff action mirred egress mirror dev $DEV1",
+        "expExitCode": "2",
+        "verifyCmd": "$TC action list action mirred",
+        "matchPattern": "^[ \t]+index [0-9]+ ref",
+        "matchCount": "0",
+        "teardown": [
+            "$TC qdisc del dev $DEV1 root handle 1: htb default 1",
+            "$TC actions flush action mirred"
+        ]
     }
 ]
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-14 11:17 [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress renmingshuai
@ 2024-03-14 11:43 ` Jiri Pirko
  2024-03-14 14:04   ` renmingshuai
  2024-03-14 17:14 ` Jamal Hadi Salim
  1 sibling, 1 reply; 24+ messages in thread
From: Jiri Pirko @ 2024-03-14 11:43 UTC (permalink / raw)
  To: renmingshuai
  Cc: jhs, xiyou.wangcong, davem, vladbu, netdev, yanan, liaichun, caowangbao

Thu, Mar 14, 2024 at 12:17:13PM CET, renmingshuai@huawei.com wrote:
>As we all know the mirred action is used to mirroring or redirecting the
>packet it receives. Howerver, add mirred action to a filter attached to
>a egress qdisc might cause a deadlock. To reproduce the problem, perform
>the following steps:
>(1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
>(2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
>     action police rate 100mbit burst 12m conform-exceed jump 1 \
>     / pipe mirred egress redirect dev eth2 action drop
>
>The stack is show as below:
>[28848.883915]  _raw_spin_lock+0x1e/0x30
>[28848.884367]  __dev_queue_xmit+0x160/0x850
>[28848.884851]  ? 0xffffffffc031906a
>[28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
>[28848.885863]  tcf_action_exec.part.0+0x88/0x130
>[28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
>[28848.886970]  ? dequeue_entity+0x145/0x9e0
>[28848.887464]  ? newidle_balance+0x23f/0x2f0
>[28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
>[28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
>[28848.889137]  ? __flush_work.isra.0+0x35/0x80
>[28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
>[28848.890293]  ? do_select+0x637/0x870
>[28848.890735]  tcf_classify+0x52/0xf0
>[28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
>[28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
>[28848.892251]  __dev_queue_xmit+0x2d8/0x850
>[28848.892738]  ? nf_hook_slow+0x3c/0xb0
>[28848.893198]  ip_finish_output2+0x272/0x580
>[28848.893692]  __ip_queue_xmit+0x193/0x420
>[28848.894179]  __tcp_transmit_skb+0x8cc/0x970
>
>In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
>before the egress packets are mirred, and it will attempt to obtain the
>spin lock again after packets are mirred, which cause a deadlock.
>
>Fix the issue by forbidding assigning mirred action to a filter attached
>to the egress.
>
>Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>

You are missing "Fixes" tag.


>---
> net/sched/act_mirred.c                        |  4 +++
> .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> 2 files changed, 36 insertions(+)
>
>diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
>index 5b3814365924..fc96705285fb 100644
>--- a/net/sched/act_mirred.c
>+++ b/net/sched/act_mirred.c
>@@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> 		NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> 		return -EINVAL;
> 	}
>+	if (tp->chain->block->q->parent != TC_H_INGRESS) {
>+		NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");

Hmm, that is quite restrictive. I'm pretty sure you would break some
valid usecases.


>+		return -EINVAL;
>+	}
> 	ret = nla_parse_nested_deprecated(tb, TCA_MIRRED_MAX, nla,
> 					  mirred_policy, extack);
> 	if (ret < 0)
>diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
>index b73bd255ea36..50c6153bf34c 100644
>--- a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
>+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
>@@ -1052,5 +1052,37 @@
>             "$TC qdisc del dev $DEV1 ingress_block 21 clsact",
>             "$TC actions flush action mirred"
>         ]
>+    },
>+    {
>+        "id": "fdda",
>+        "name": "Add mirred mirror to the filter which attached to engress",
>+        "category": [
>+            "actions",
>+            "mirred"
>+        ],
>+        "plugins": {
>+            "requires": "nsPlugin"
>+        },
>+        "setup": [
>+            [
>+                "$TC actions flush action mirred",
>+                0,
>+                1,
>+                255
>+            ],
>+            [
>+                "$TC qdisc add dev $DEV1 root handle 1: htb default 1",
>+                0
>+            ]
>+        ],
>+        "cmdUnderTest": "$TC filter add dev $DEV1 protocol ip u32 match ip protocol 1 0xff action mirred egress mirror dev $DEV1",
>+        "expExitCode": "2",
>+        "verifyCmd": "$TC action list action mirred",
>+        "matchPattern": "^[ \t]+index [0-9]+ ref",
>+        "matchCount": "0",
>+        "teardown": [
>+            "$TC qdisc del dev $DEV1 root handle 1: htb default 1",
>+            "$TC actions flush action mirred"
>+        ]
>     }
> ]
>-- 
>2.33.0
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-14 11:43 ` Jiri Pirko
@ 2024-03-14 14:04   ` renmingshuai
  2024-03-14 14:47     ` Pedro Tammela
  0 siblings, 1 reply; 24+ messages in thread
From: renmingshuai @ 2024-03-14 14:04 UTC (permalink / raw)
  To: jiri
  Cc: caowangbao, davem, jhs, liaichun, netdev, renmingshuai, vladbu,
	xiyou.wangcong, yanan

>>---
>> net/sched/act_mirred.c                        |  4 +++
>> .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
>> 2 files changed, 36 insertions(+)
>>
>>diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
>>index 5b3814365924..fc96705285fb 100644
>>--- a/net/sched/act_mirred.c
>>+++ b/net/sched/act_mirred.c
>>@@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
>> 		NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
>> 		return -EINVAL;
>> 	}
>>+	if (tp->chain->block->q->parent != TC_H_INGRESS) {
>>+		NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
>
>Hmm, that is quite restrictive. I'm pretty sure you would break some
>valid usecases.
Hmm, that is really quite restrictive. It might be better to Forbid mirred attached to egress filter
to mirror or redirect packets to the egress.

diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index fc96705285fb..ab3841470992 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -152,6 +152,11 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
                return -EINVAL;
        }

+       if ((tp->chain->block->q->parent != TC_H_INGRESS) && (parm->eaction == TCA_EGRESS_MIRROR || parm->eaction == TCA_EGRESS_REDIR)) {
+               NL_SET_ERR_MSG_MOD(extack, "Mirred assigned to egress filter can only mirror or redirect to ingress");
+               return -EINVAL;
+       }
+
        switch (parm->eaction) {
        case TCA_EGRESS_MIRROR:
        case TCA_EGRESS_REDIR:

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-14 14:04   ` renmingshuai
@ 2024-03-14 14:47     ` Pedro Tammela
  0 siblings, 0 replies; 24+ messages in thread
From: Pedro Tammela @ 2024-03-14 14:47 UTC (permalink / raw)
  To: renmingshuai, jiri
  Cc: caowangbao, davem, jhs, liaichun, netdev, vladbu, xiyou.wangcong, yanan

On 14/03/2024 11:04, renmingshuai wrote:
>>> ---
>>> net/sched/act_mirred.c                        |  4 +++
>>> .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
>>> 2 files changed, 36 insertions(+)
>>>
>>> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
>>> index 5b3814365924..fc96705285fb 100644
>>> --- a/net/sched/act_mirred.c
>>> +++ b/net/sched/act_mirred.c
>>> @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
>>> 		NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
>>> 		return -EINVAL;
>>> 	}
>>> +	if (tp->chain->block->q->parent != TC_H_INGRESS) {
>>> +		NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
>>
>> Hmm, that is quite restrictive. I'm pretty sure you would break some
>> valid usecases.
> Hmm, that is really quite restrictive. It might be better to Forbid mirred attached to egress filter
> to mirror or redirect packets to the egress.
> 
> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> index fc96705285fb..ab3841470992 100644
> --- a/net/sched/act_mirred.c
> +++ b/net/sched/act_mirred.c
> @@ -152,6 +152,11 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
>                  return -EINVAL;
>          }
> 
> +       if ((tp->chain->block->q->parent != TC_H_INGRESS) && (parm->eaction == TCA_EGRESS_MIRROR || parm->eaction == TCA_EGRESS_REDIR)) {
> +               NL_SET_ERR_MSG_MOD(extack, "Mirred assigned to egress filter can only mirror or redirect to ingress");
> +               return -EINVAL;
> +       }
> +
>          switch (parm->eaction) {
>          case TCA_EGRESS_MIRROR:
>          case TCA_EGRESS_REDIR:


Are you sure this happens to egress mirred to netdevs other than the one 
used to attach the filter?
It seems like in your example you are forcing mirred egress to the same 
netdev as the filter

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-14 11:17 [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress renmingshuai
  2024-03-14 11:43 ` Jiri Pirko
@ 2024-03-14 17:14 ` Jamal Hadi Salim
  2024-03-15  1:56   ` renmingshuai
  2024-03-17 16:10   ` Jamal Hadi Salim
  1 sibling, 2 replies; 24+ messages in thread
From: Jamal Hadi Salim @ 2024-03-14 17:14 UTC (permalink / raw)
  To: renmingshuai
  Cc: xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao

On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
>
> As we all know the mirred action is used to mirroring or redirecting the
> packet it receives. Howerver, add mirred action to a filter attached to
> a egress qdisc might cause a deadlock. To reproduce the problem, perform
> the following steps:
> (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
>      action police rate 100mbit burst 12m conform-exceed jump 1 \
>      / pipe mirred egress redirect dev eth2 action drop
>

I think you meant both to be the same device eth0 or eth2?

> The stack is show as below:
> [28848.883915]  _raw_spin_lock+0x1e/0x30
> [28848.884367]  __dev_queue_xmit+0x160/0x850
> [28848.884851]  ? 0xffffffffc031906a
> [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> [28848.886970]  ? dequeue_entity+0x145/0x9e0
> [28848.887464]  ? newidle_balance+0x23f/0x2f0
> [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> [28848.890293]  ? do_select+0x637/0x870
> [28848.890735]  tcf_classify+0x52/0xf0
> [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> [28848.893198]  ip_finish_output2+0x272/0x580
> [28848.893692]  __ip_queue_xmit+0x193/0x420
> [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
>
> In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> before the egress packets are mirred, and it will attempt to obtain the
> spin lock again after packets are mirred, which cause a deadlock.
>
> Fix the issue by forbidding assigning mirred action to a filter attached
> to the egress.
>
> Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> ---
>  net/sched/act_mirred.c                        |  4 +++
>  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
>  2 files changed, 36 insertions(+)
>
> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> index 5b3814365924..fc96705285fb 100644
> --- a/net/sched/act_mirred.c
> +++ b/net/sched/act_mirred.c
> @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
>                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
>                 return -EINVAL;
>         }
> +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> +               return -EINVAL;
> +       }

Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
am almost certain this used to work in the old days.

cheers,
jamal

PS:- thanks for the tdc test, you are a hero just for submitting that!



>         ret = nla_parse_nested_deprecated(tb, TCA_MIRRED_MAX, nla,
>                                           mirred_policy, extack);
>         if (ret < 0)
> diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
> index b73bd255ea36..50c6153bf34c 100644
> --- a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
> +++ b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
> @@ -1052,5 +1052,37 @@
>              "$TC qdisc del dev $DEV1 ingress_block 21 clsact",
>              "$TC actions flush action mirred"
>          ]
> +    },
> +    {
> +        "id": "fdda",
> +        "name": "Add mirred mirror to the filter which attached to engress",
> +        "category": [
> +            "actions",
> +            "mirred"
> +        ],
> +        "plugins": {
> +            "requires": "nsPlugin"
> +        },
> +        "setup": [
> +            [
> +                "$TC actions flush action mirred",
> +                0,
> +                1,
> +                255
> +            ],
> +            [
> +                "$TC qdisc add dev $DEV1 root handle 1: htb default 1",
> +                0
> +            ]
> +        ],
> +        "cmdUnderTest": "$TC filter add dev $DEV1 protocol ip u32 match ip protocol 1 0xff action mirred egress mirror dev $DEV1",
> +        "expExitCode": "2",
> +        "verifyCmd": "$TC action list action mirred",
> +        "matchPattern": "^[ \t]+index [0-9]+ ref",
> +        "matchCount": "0",
> +        "teardown": [
> +            "$TC qdisc del dev $DEV1 root handle 1: htb default 1",
> +            "$TC actions flush action mirred"
> +        ]
>      }
>  ]
> --
> 2.33.0
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-14 17:14 ` Jamal Hadi Salim
@ 2024-03-15  1:56   ` renmingshuai
  2024-03-15 22:34     ` Jamal Hadi Salim
  2024-03-17 16:10   ` Jamal Hadi Salim
  1 sibling, 1 reply; 24+ messages in thread
From: renmingshuai @ 2024-03-15  1:56 UTC (permalink / raw)
  To: jhs
  Cc: caowangbao, davem, jiri, liaichun, netdev, renmingshuai, vladbu,
	xiyou.wangcong, yanan

>> As we all know the mirred action is used to mirroring or redirecting the
>> packet it receives. Howerver, add mirred action to a filter attached to
>> a egress qdisc might cause a deadlock. To reproduce the problem, perform
>> the following steps:
>> (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
>> (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
>>      action police rate 100mbit burst 12m conform-exceed jump 1 \
>>      / pipe mirred egress redirect dev eth2 action drop
>>
>
>I think you meant both to be the same device eth0 or eth2?

Sorry, a careless mistake. eth2 in step 2 should be eth0.
(2)tc filter add dev eth0 protocol ip prio 2 flower verbose \
     action police rate 100mbit burst 12m conform-exceed jump 1 \
     / pipe mirred egress redirect dev eth0 action drop

>> The stack is show as below:
>> [28848.883915]  _raw_spin_lock+0x1e/0x30
>> [28848.884367]  __dev_queue_xmit+0x160/0x850
>> [28848.884851]  ? 0xffffffffc031906a
>> [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
>> [28848.885863]  tcf_action_exec.part.0+0x88/0x130
>> [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
>> [28848.886970]  ? dequeue_entity+0x145/0x9e0
>> [28848.887464]  ? newidle_balance+0x23f/0x2f0
>> [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
>> [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
>> [28848.889137]  ? __flush_work.isra.0+0x35/0x80
>> [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
>> [28848.890293]  ? do_select+0x637/0x870
>> [28848.890735]  tcf_classify+0x52/0xf0
>> [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
>> [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
>> [28848.892251]  __dev_queue_xmit+0x2d8/0x850
>> [28848.892738]  ? nf_hook_slow+0x3c/0xb0
>> [28848.893198]  ip_finish_output2+0x272/0x580
>> [28848.893692]  __ip_queue_xmit+0x193/0x420
>> [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
>>
>> In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
>> before the egress packets are mirred, and it will attempt to obtain the
>> spin lock again after packets are mirred, which cause a deadlock.
>>
>> Fix the issue by forbidding assigning mirred action to a filter attached
>> to the egress.
>>
>> Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
>> ---
>>  net/sched/act_mirred.c                        |  4 +++
>>  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
>>  2 files changed, 36 insertions(+)
>>
>> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
>> index 5b3814365924..fc96705285fb 100644
>> --- a/net/sched/act_mirred.c
>> +++ b/net/sched/act_mirred.c
>> @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
>>                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
>>                 return -EINVAL;
>>         }
>> +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
>> +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
>> +               return -EINVAL;
>> +       }
>
>Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
>am almost certain this used to work in the old days.
>
>cheers,
>jamal
>
>PS:- thanks for the tdc test, you are a hero just for submitting that!

As Jiri said, that is really quite restrictive. It might be better to forbid mirred attached to egress filter
to mirror or redirect packets to the egress. Just like:

--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -152,6 +152,12 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
                return -EINVAL;
        }

+       if ((tp->chain->block->q->parent != TC_H_INGRESS) &&
+           (parm->eaction == TCA_EGRESS_MIRROR || parm->eaction == TCA_EGRESS_REDIR)) {
+               NL_SET_ERR_MSG_MOD(extack, "Mirred assigned to egress filter can only mirror or redirect to ingress");
+               return -EINVAL;
+       }
+
        switch (parm->eaction) {
        case TCA_EGRESS_MIRROR:
        case TCA_EGRESS_REDIR:

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-15  1:56   ` renmingshuai
@ 2024-03-15 22:34     ` Jamal Hadi Salim
  0 siblings, 0 replies; 24+ messages in thread
From: Jamal Hadi Salim @ 2024-03-15 22:34 UTC (permalink / raw)
  To: renmingshuai
  Cc: caowangbao, davem, jiri, liaichun, netdev, vladbu, xiyou.wangcong, yanan

On Thu, Mar 14, 2024 at 10:10 PM renmingshuai <renmingshuai@huawei.com> wrote:
>
> >> As we all know the mirred action is used to mirroring or redirecting the
> >> packet it receives. Howerver, add mirred action to a filter attached to
> >> a egress qdisc might cause a deadlock. To reproduce the problem, perform
> >> the following steps:
> >> (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> >> (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> >>      action police rate 100mbit burst 12m conform-exceed jump 1 \
> >>      / pipe mirred egress redirect dev eth2 action drop
> >>
> >
> >I think you meant both to be the same device eth0 or eth2?
>
> Sorry, a careless mistake. eth2 in step 2 should be eth0.
> (2)tc filter add dev eth0 protocol ip prio 2 flower verbose \
>      action police rate 100mbit burst 12m conform-exceed jump 1 \
>      / pipe mirred egress redirect dev eth0 action drop
>
> >> The stack is show as below:
> >> [28848.883915]  _raw_spin_lock+0x1e/0x30
> >> [28848.884367]  __dev_queue_xmit+0x160/0x850
> >> [28848.884851]  ? 0xffffffffc031906a
> >> [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> >> [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> >> [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> >> [28848.886970]  ? dequeue_entity+0x145/0x9e0
> >> [28848.887464]  ? newidle_balance+0x23f/0x2f0
> >> [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> >> [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> >> [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> >> [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> >> [28848.890293]  ? do_select+0x637/0x870
> >> [28848.890735]  tcf_classify+0x52/0xf0
> >> [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> >> [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> >> [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> >> [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> >> [28848.893198]  ip_finish_output2+0x272/0x580
> >> [28848.893692]  __ip_queue_xmit+0x193/0x420
> >> [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> >>
> >> In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> >> before the egress packets are mirred, and it will attempt to obtain the
> >> spin lock again after packets are mirred, which cause a deadlock.
> >>
> >> Fix the issue by forbidding assigning mirred action to a filter attached
> >> to the egress.
> >>
> >> Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> >> ---
> >>  net/sched/act_mirred.c                        |  4 +++
> >>  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> >>  2 files changed, 36 insertions(+)
> >>
> >> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> >> index 5b3814365924..fc96705285fb 100644
> >> --- a/net/sched/act_mirred.c
> >> +++ b/net/sched/act_mirred.c
> >> @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> >>                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> >>                 return -EINVAL;
> >>         }
> >> +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> >> +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> >> +               return -EINVAL;
> >> +       }
> >
> >Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> >am almost certain this used to work in the old days.
> >

> >cheers,
> >jamal
> >
> >PS:- thanks for the tdc test, you are a hero just for submitting that!
>
> As Jiri said, that is really quite restrictive. It might be better to forbid mirred attached to egress filter
> to mirror or redirect packets to the egress. Just like:
>

Please, no. You are continuing to restrict many valid use cases. We'll
look at it this weekend.

cheers,
jamal

> --- a/net/sched/act_mirred.c
> +++ b/net/sched/act_mirred.c
> @@ -152,6 +152,12 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
>                 return -EINVAL;
>         }
>
> +       if ((tp->chain->block->q->parent != TC_H_INGRESS) &&
> +           (parm->eaction == TCA_EGRESS_MIRROR || parm->eaction == TCA_EGRESS_REDIR)) {
> +               NL_SET_ERR_MSG_MOD(extack, "Mirred assigned to egress filter can only mirror or redirect to ingress");
> +               return -EINVAL;
> +       }
> +
>         switch (parm->eaction) {
>         case TCA_EGRESS_MIRROR:
>         case TCA_EGRESS_REDIR:
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-14 17:14 ` Jamal Hadi Salim
  2024-03-15  1:56   ` renmingshuai
@ 2024-03-17 16:10   ` Jamal Hadi Salim
  2024-03-18 14:27     ` Jamal Hadi Salim
  1 sibling, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2024-03-17 16:10 UTC (permalink / raw)
  To: renmingshuai
  Cc: xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao

On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
> >
> > As we all know the mirred action is used to mirroring or redirecting the
> > packet it receives. Howerver, add mirred action to a filter attached to
> > a egress qdisc might cause a deadlock. To reproduce the problem, perform
> > the following steps:
> > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> >      action police rate 100mbit burst 12m conform-exceed jump 1 \
> >      / pipe mirred egress redirect dev eth2 action drop
> >
>
> I think you meant both to be the same device eth0 or eth2?
>
> > The stack is show as below:
> > [28848.883915]  _raw_spin_lock+0x1e/0x30
> > [28848.884367]  __dev_queue_xmit+0x160/0x850
> > [28848.884851]  ? 0xffffffffc031906a
> > [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> > [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> > [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> > [28848.886970]  ? dequeue_entity+0x145/0x9e0
> > [28848.887464]  ? newidle_balance+0x23f/0x2f0
> > [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> > [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> > [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> > [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> > [28848.890293]  ? do_select+0x637/0x870
> > [28848.890735]  tcf_classify+0x52/0xf0
> > [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> > [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> > [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> > [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> > [28848.893198]  ip_finish_output2+0x272/0x580
> > [28848.893692]  __ip_queue_xmit+0x193/0x420
> > [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> >
> > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> > before the egress packets are mirred, and it will attempt to obtain the
> > spin lock again after packets are mirred, which cause a deadlock.
> >
> > Fix the issue by forbidding assigning mirred action to a filter attached
> > to the egress.
> >
> > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> > ---
> >  net/sched/act_mirred.c                        |  4 +++
> >  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> >  2 files changed, 36 insertions(+)
> >
> > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > index 5b3814365924..fc96705285fb 100644
> > --- a/net/sched/act_mirred.c
> > +++ b/net/sched/act_mirred.c
> > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> >                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> >                 return -EINVAL;
> >         }
> > +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> > +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> > +               return -EINVAL;
> > +       }
>
> Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> am almost certain this used to work in the old days.

Ok, i looked at old notes - it did work at "some point" pre-tdc.
Conclusion is things broke around this time frame:
https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/
https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/

Looking further into it.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-17 16:10   ` Jamal Hadi Salim
@ 2024-03-18 14:27     ` Jamal Hadi Salim
  2024-03-18 15:46       ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2024-03-18 14:27 UTC (permalink / raw)
  To: renmingshuai
  Cc: xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun,
	caowangbao, Eric Dumazet, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Victor Nogueira, Pedro Tammela

On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
> > >
> > > As we all know the mirred action is used to mirroring or redirecting the
> > > packet it receives. Howerver, add mirred action to a filter attached to
> > > a egress qdisc might cause a deadlock. To reproduce the problem, perform
> > > the following steps:
> > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> > >      action police rate 100mbit burst 12m conform-exceed jump 1 \
> > >      / pipe mirred egress redirect dev eth2 action drop
> > >
> >
> > I think you meant both to be the same device eth0 or eth2?
> >
> > > The stack is show as below:
> > > [28848.883915]  _raw_spin_lock+0x1e/0x30
> > > [28848.884367]  __dev_queue_xmit+0x160/0x850
> > > [28848.884851]  ? 0xffffffffc031906a
> > > [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> > > [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> > > [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> > > [28848.886970]  ? dequeue_entity+0x145/0x9e0
> > > [28848.887464]  ? newidle_balance+0x23f/0x2f0
> > > [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> > > [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> > > [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> > > [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> > > [28848.890293]  ? do_select+0x637/0x870
> > > [28848.890735]  tcf_classify+0x52/0xf0
> > > [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> > > [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> > > [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> > > [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> > > [28848.893198]  ip_finish_output2+0x272/0x580
> > > [28848.893692]  __ip_queue_xmit+0x193/0x420
> > > [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> > >
> > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> > > before the egress packets are mirred, and it will attempt to obtain the
> > > spin lock again after packets are mirred, which cause a deadlock.
> > >
> > > Fix the issue by forbidding assigning mirred action to a filter attached
> > > to the egress.
> > >
> > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> > > ---
> > >  net/sched/act_mirred.c                        |  4 +++
> > >  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> > >  2 files changed, 36 insertions(+)
> > >
> > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > > index 5b3814365924..fc96705285fb 100644
> > > --- a/net/sched/act_mirred.c
> > > +++ b/net/sched/act_mirred.c
> > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> > >                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> > >                 return -EINVAL;
> > >         }
> > > +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> > > +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> > > +               return -EINVAL;
> > > +       }
> >
> > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> > am almost certain this used to work in the old days.
>
> Ok, i looked at old notes - it did work at "some point" pre-tdc.
> Conclusion is things broke around this time frame:
> https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/
> https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/
>
> Looking further into it.

This is what we came up with. Eric, please take a look...

cheers,
jamal


--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
        if (unlikely(contended))
                spin_lock(&q->busylock);

+       if (dev_recursion_level()) {
+               rc = NET_XMIT_DROP;
+               __qdisc_drop(skb, &to_free);
+               goto free;
+       }
+
        spin_lock(root_lock);
+       dev_xmit_recursion_inc();
        if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
                __qdisc_drop(skb, &to_free);
                rc = NET_XMIT_DROP;
@@ -3824,8 +3831,10 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
                        qdisc_run_end(q);
                }
        }
+       dev_xmit_recursion_dec();
        spin_unlock(root_lock);
        if (unlikely(to_free))
+free:
                kfree_skb_list_reason(to_free,
                                      tcf_get_drop_reason(to_free));
        if (unlikely(contended))

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-18 14:27     ` Jamal Hadi Salim
@ 2024-03-18 15:46       ` Eric Dumazet
  2024-03-18 17:36         ` Jamal Hadi Salim
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2024-03-18 15:46 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela

On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
> > > >
> > > > As we all know the mirred action is used to mirroring or redirecting the
> > > > packet it receives. Howerver, add mirred action to a filter attached to
> > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform
> > > > the following steps:
> > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> > > >      action police rate 100mbit burst 12m conform-exceed jump 1 \
> > > >      / pipe mirred egress redirect dev eth2 action drop
> > > >
> > >
> > > I think you meant both to be the same device eth0 or eth2?
> > >
> > > > The stack is show as below:
> > > > [28848.883915]  _raw_spin_lock+0x1e/0x30
> > > > [28848.884367]  __dev_queue_xmit+0x160/0x850
> > > > [28848.884851]  ? 0xffffffffc031906a
> > > > [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> > > > [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> > > > [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> > > > [28848.886970]  ? dequeue_entity+0x145/0x9e0
> > > > [28848.887464]  ? newidle_balance+0x23f/0x2f0
> > > > [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> > > > [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> > > > [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> > > > [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> > > > [28848.890293]  ? do_select+0x637/0x870
> > > > [28848.890735]  tcf_classify+0x52/0xf0
> > > > [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> > > > [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> > > > [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> > > > [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> > > > [28848.893198]  ip_finish_output2+0x272/0x580
> > > > [28848.893692]  __ip_queue_xmit+0x193/0x420
> > > > [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> > > >
> > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> > > > before the egress packets are mirred, and it will attempt to obtain the
> > > > spin lock again after packets are mirred, which cause a deadlock.
> > > >
> > > > Fix the issue by forbidding assigning mirred action to a filter attached
> > > > to the egress.
> > > >
> > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> > > > ---
> > > >  net/sched/act_mirred.c                        |  4 +++
> > > >  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> > > >  2 files changed, 36 insertions(+)
> > > >
> > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > > > index 5b3814365924..fc96705285fb 100644
> > > > --- a/net/sched/act_mirred.c
> > > > +++ b/net/sched/act_mirred.c
> > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> > > >                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> > > >                 return -EINVAL;
> > > >         }
> > > > +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> > > > +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> > > > +               return -EINVAL;
> > > > +       }
> > >
> > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> > > am almost certain this used to work in the old days.
> >
> > Ok, i looked at old notes - it did work at "some point" pre-tdc.
> > Conclusion is things broke around this time frame:
> > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/
> > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/
> >
> > Looking further into it.
>
> This is what we came up with. Eric, please take a look...
>
> cheers,
> jamal
>
>
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff
> *skb, struct Qdisc *q,
>         if (unlikely(contended))
>                 spin_lock(&q->busylock);
>
> +       if (dev_recursion_level()) {

I am not sure what your intent is, but this seems wrong to me.

Some valid setup use :

A bonding device, with HTB qdisc (or other qdisc)
  (This also could be a tunnel device with a qdisc)

-> one or multiple physical NIC, wth FQ or other qdisc.

Packets would be dropped here when we try to reach the physical device.

> +               rc = NET_XMIT_DROP;
> +               __qdisc_drop(skb, &to_free);
> +               goto free;
> +       }
> +
>         spin_lock(root_lock);
> +       dev_xmit_recursion_inc();
>         if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
>                 __qdisc_drop(skb, &to_free);
>                 rc = NET_XMIT_DROP;
> @@ -3824,8 +3831,10 @@ static inline int __dev_xmit_skb(struct sk_buff
> *skb, struct Qdisc *q,
>                         qdisc_run_end(q);
>                 }
>         }
> +       dev_xmit_recursion_dec();
>         spin_unlock(root_lock);
>         if (unlikely(to_free))
> +free:
>                 kfree_skb_list_reason(to_free,
>                                       tcf_get_drop_reason(to_free));
>         if (unlikely(contended))

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-18 15:46       ` Eric Dumazet
@ 2024-03-18 17:36         ` Jamal Hadi Salim
  2024-03-18 19:11           ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2024-03-18 17:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela

On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
> > > > >
> > > > > As we all know the mirred action is used to mirroring or redirecting the
> > > > > packet it receives. Howerver, add mirred action to a filter attached to
> > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform
> > > > > the following steps:
> > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> > > > >      action police rate 100mbit burst 12m conform-exceed jump 1 \
> > > > >      / pipe mirred egress redirect dev eth2 action drop
> > > > >
> > > >
> > > > I think you meant both to be the same device eth0 or eth2?
> > > >
> > > > > The stack is show as below:
> > > > > [28848.883915]  _raw_spin_lock+0x1e/0x30
> > > > > [28848.884367]  __dev_queue_xmit+0x160/0x850
> > > > > [28848.884851]  ? 0xffffffffc031906a
> > > > > [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> > > > > [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> > > > > [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> > > > > [28848.886970]  ? dequeue_entity+0x145/0x9e0
> > > > > [28848.887464]  ? newidle_balance+0x23f/0x2f0
> > > > > [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> > > > > [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> > > > > [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> > > > > [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> > > > > [28848.890293]  ? do_select+0x637/0x870
> > > > > [28848.890735]  tcf_classify+0x52/0xf0
> > > > > [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> > > > > [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> > > > > [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> > > > > [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> > > > > [28848.893198]  ip_finish_output2+0x272/0x580
> > > > > [28848.893692]  __ip_queue_xmit+0x193/0x420
> > > > > [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> > > > >
> > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> > > > > before the egress packets are mirred, and it will attempt to obtain the
> > > > > spin lock again after packets are mirred, which cause a deadlock.
> > > > >
> > > > > Fix the issue by forbidding assigning mirred action to a filter attached
> > > > > to the egress.
> > > > >
> > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> > > > > ---
> > > > >  net/sched/act_mirred.c                        |  4 +++
> > > > >  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> > > > >  2 files changed, 36 insertions(+)
> > > > >
> > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > > > > index 5b3814365924..fc96705285fb 100644
> > > > > --- a/net/sched/act_mirred.c
> > > > > +++ b/net/sched/act_mirred.c
> > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> > > > >                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> > > > >                 return -EINVAL;
> > > > >         }
> > > > > +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> > > > > +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> > > > > +               return -EINVAL;
> > > > > +       }
> > > >
> > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> > > > am almost certain this used to work in the old days.
> > >
> > > Ok, i looked at old notes - it did work at "some point" pre-tdc.
> > > Conclusion is things broke around this time frame:
> > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/
> > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/
> > >
> > > Looking further into it.
> >
> > This is what we came up with. Eric, please take a look...
> >
> > cheers,
> > jamal
> >
> >
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff
> > *skb, struct Qdisc *q,
> >         if (unlikely(contended))
> >                 spin_lock(&q->busylock);
> >
> > +       if (dev_recursion_level()) {
>
> I am not sure what your intent is, but this seems wrong to me.
>

There is a deadlock if you reenter the same device which has a qdisc
attached to it more than once.
Essentially entering __dev_xmit_skb() we grab the root qdisc lock then
run some action which requires it to grab the root qdisc lock (again).
This is easy to show with mirred (although i am wondering if syzkaller
may have produced this at some point)..
$TC qdisc add dev $DEV root handle 1: htb default 1
$TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff
action mirred egress mirror dev $DEV

Above example is essentially egress $DEV-> egress $DEV in both cases
"egress $DEV" grabs the root qdisc lock. You could also create another
example with egress($DEV1->$DEV2->back to $DEV1).

> Some valid setup use :
>
> A bonding device, with HTB qdisc (or other qdisc)
>   (This also could be a tunnel device with a qdisc)
>
> -> one or multiple physical NIC, wth FQ or other qdisc.
>
> Packets would be dropped here when we try to reach the physical device.
>

If you have an example handy please send it. I am trying to imagine
how those would have worked if they have to reenter the root qdisc of
the same dev multiple times..

cheers,
jamal


> > +               rc = NET_XMIT_DROP;
> > +               __qdisc_drop(skb, &to_free);
> > +               goto free;
> > +       }
> > +
> >         spin_lock(root_lock);
> > +       dev_xmit_recursion_inc();
> >         if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> >                 __qdisc_drop(skb, &to_free);
> >                 rc = NET_XMIT_DROP;
> > @@ -3824,8 +3831,10 @@ static inline int __dev_xmit_skb(struct sk_buff
> > *skb, struct Qdisc *q,
> >                         qdisc_run_end(q);
> >                 }
> >         }
> > +       dev_xmit_recursion_dec();
> >         spin_unlock(root_lock);
> >         if (unlikely(to_free))
> > +free:
> >                 kfree_skb_list_reason(to_free,
> >                                       tcf_get_drop_reason(to_free));
> >         if (unlikely(contended))

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-18 17:36         ` Jamal Hadi Salim
@ 2024-03-18 19:11           ` Eric Dumazet
  2024-03-18 22:05             ` Jamal Hadi Salim
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2024-03-18 19:11 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela

On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
> > > > > >
> > > > > > As we all know the mirred action is used to mirroring or redirecting the
> > > > > > packet it receives. Howerver, add mirred action to a filter attached to
> > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform
> > > > > > the following steps:
> > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> > > > > >      action police rate 100mbit burst 12m conform-exceed jump 1 \
> > > > > >      / pipe mirred egress redirect dev eth2 action drop
> > > > > >
> > > > >
> > > > > I think you meant both to be the same device eth0 or eth2?
> > > > >
> > > > > > The stack is show as below:
> > > > > > [28848.883915]  _raw_spin_lock+0x1e/0x30
> > > > > > [28848.884367]  __dev_queue_xmit+0x160/0x850
> > > > > > [28848.884851]  ? 0xffffffffc031906a
> > > > > > [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> > > > > > [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> > > > > > [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> > > > > > [28848.886970]  ? dequeue_entity+0x145/0x9e0
> > > > > > [28848.887464]  ? newidle_balance+0x23f/0x2f0
> > > > > > [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> > > > > > [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> > > > > > [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> > > > > > [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> > > > > > [28848.890293]  ? do_select+0x637/0x870
> > > > > > [28848.890735]  tcf_classify+0x52/0xf0
> > > > > > [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> > > > > > [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> > > > > > [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> > > > > > [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> > > > > > [28848.893198]  ip_finish_output2+0x272/0x580
> > > > > > [28848.893692]  __ip_queue_xmit+0x193/0x420
> > > > > > [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> > > > > >
> > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> > > > > > before the egress packets are mirred, and it will attempt to obtain the
> > > > > > spin lock again after packets are mirred, which cause a deadlock.
> > > > > >
> > > > > > Fix the issue by forbidding assigning mirred action to a filter attached
> > > > > > to the egress.
> > > > > >
> > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> > > > > > ---
> > > > > >  net/sched/act_mirred.c                        |  4 +++
> > > > > >  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> > > > > >  2 files changed, 36 insertions(+)
> > > > > >
> > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > > > > > index 5b3814365924..fc96705285fb 100644
> > > > > > --- a/net/sched/act_mirred.c
> > > > > > +++ b/net/sched/act_mirred.c
> > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> > > > > >                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> > > > > >                 return -EINVAL;
> > > > > >         }
> > > > > > +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> > > > > > +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> > > > > > +               return -EINVAL;
> > > > > > +       }
> > > > >
> > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> > > > > am almost certain this used to work in the old days.
> > > >
> > > > Ok, i looked at old notes - it did work at "some point" pre-tdc.
> > > > Conclusion is things broke around this time frame:
> > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/
> > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/
> > > >
> > > > Looking further into it.
> > >
> > > This is what we came up with. Eric, please take a look...
> > >
> > > cheers,
> > > jamal
> > >
> > >
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff
> > > *skb, struct Qdisc *q,
> > >         if (unlikely(contended))
> > >                 spin_lock(&q->busylock);
> > >
> > > +       if (dev_recursion_level()) {
> >
> > I am not sure what your intent is, but this seems wrong to me.
> >
>
> There is a deadlock if you reenter the same device which has a qdisc
> attached to it more than once.
> Essentially entering __dev_xmit_skb() we grab the root qdisc lock then
> run some action which requires it to grab the root qdisc lock (again).
> This is easy to show with mirred (although i am wondering if syzkaller
> may have produced this at some point)..
> $TC qdisc add dev $DEV root handle 1: htb default 1
> $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff
> action mirred egress mirror dev $DEV
>
> Above example is essentially egress $DEV-> egress $DEV in both cases
> "egress $DEV" grabs the root qdisc lock. You could also create another
> example with egress($DEV1->$DEV2->back to $DEV1).
>
> > Some valid setup use :
> >
> > A bonding device, with HTB qdisc (or other qdisc)
> >   (This also could be a tunnel device with a qdisc)
> >
> > -> one or multiple physical NIC, wth FQ or other qdisc.
> >
> > Packets would be dropped here when we try to reach the physical device.
> >
>
> If you have an example handy please send it. I am trying to imagine
> how those would have worked if they have to reenter the root qdisc of
> the same dev multiple times..

Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ?

dev_xmit_recursion_inc() is global (per-cpu), it is not per-device.

A stack of devices A -> B -> C  would elevate the recursion level to
three just fine.

After your patch, a stack of devices would no longer work.

It seems mirred correctly injects packets to the top of the stack for
ingress (via netif_rx() / netif_receive_skb()),
but thinks it is okay to call dev_queue_xmit(), regardless of the context ?

Perhaps safe-guard mirred, instead of adding more code to fast path.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-18 19:11           ` Eric Dumazet
@ 2024-03-18 22:05             ` Jamal Hadi Salim
  2024-03-19  9:38               ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2024-03-18 22:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela, Davide Caratti

On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
> > > > > > >
> > > > > > > As we all know the mirred action is used to mirroring or redirecting the
> > > > > > > packet it receives. Howerver, add mirred action to a filter attached to
> > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform
> > > > > > > the following steps:
> > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> > > > > > >      action police rate 100mbit burst 12m conform-exceed jump 1 \
> > > > > > >      / pipe mirred egress redirect dev eth2 action drop
> > > > > > >
> > > > > >
> > > > > > I think you meant both to be the same device eth0 or eth2?
> > > > > >
> > > > > > > The stack is show as below:
> > > > > > > [28848.883915]  _raw_spin_lock+0x1e/0x30
> > > > > > > [28848.884367]  __dev_queue_xmit+0x160/0x850
> > > > > > > [28848.884851]  ? 0xffffffffc031906a
> > > > > > > [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> > > > > > > [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> > > > > > > [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> > > > > > > [28848.886970]  ? dequeue_entity+0x145/0x9e0
> > > > > > > [28848.887464]  ? newidle_balance+0x23f/0x2f0
> > > > > > > [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> > > > > > > [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> > > > > > > [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> > > > > > > [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> > > > > > > [28848.890293]  ? do_select+0x637/0x870
> > > > > > > [28848.890735]  tcf_classify+0x52/0xf0
> > > > > > > [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> > > > > > > [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> > > > > > > [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> > > > > > > [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> > > > > > > [28848.893198]  ip_finish_output2+0x272/0x580
> > > > > > > [28848.893692]  __ip_queue_xmit+0x193/0x420
> > > > > > > [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> > > > > > >
> > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> > > > > > > before the egress packets are mirred, and it will attempt to obtain the
> > > > > > > spin lock again after packets are mirred, which cause a deadlock.
> > > > > > >
> > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached
> > > > > > > to the egress.
> > > > > > >
> > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> > > > > > > ---
> > > > > > >  net/sched/act_mirred.c                        |  4 +++
> > > > > > >  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> > > > > > >  2 files changed, 36 insertions(+)
> > > > > > >
> > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > > > > > > index 5b3814365924..fc96705285fb 100644
> > > > > > > --- a/net/sched/act_mirred.c
> > > > > > > +++ b/net/sched/act_mirred.c
> > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> > > > > > >                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> > > > > > >                 return -EINVAL;
> > > > > > >         }
> > > > > > > +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> > > > > > > +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> > > > > > > +               return -EINVAL;
> > > > > > > +       }
> > > > > >
> > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> > > > > > am almost certain this used to work in the old days.
> > > > >
> > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc.
> > > > > Conclusion is things broke around this time frame:
> > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/
> > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/
> > > > >
> > > > > Looking further into it.
> > > >
> > > > This is what we came up with. Eric, please take a look...
> > > >
> > > > cheers,
> > > > jamal
> > > >
> > > >
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff
> > > > *skb, struct Qdisc *q,
> > > >         if (unlikely(contended))
> > > >                 spin_lock(&q->busylock);
> > > >
> > > > +       if (dev_recursion_level()) {
> > >
> > > I am not sure what your intent is, but this seems wrong to me.
> > >
> >
> > There is a deadlock if you reenter the same device which has a qdisc
> > attached to it more than once.
> > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then
> > run some action which requires it to grab the root qdisc lock (again).
> > This is easy to show with mirred (although i am wondering if syzkaller
> > may have produced this at some point)..
> > $TC qdisc add dev $DEV root handle 1: htb default 1
> > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff
> > action mirred egress mirror dev $DEV
> >
> > Above example is essentially egress $DEV-> egress $DEV in both cases
> > "egress $DEV" grabs the root qdisc lock. You could also create another
> > example with egress($DEV1->$DEV2->back to $DEV1).
> >
> > > Some valid setup use :
> > >
> > > A bonding device, with HTB qdisc (or other qdisc)
> > >   (This also could be a tunnel device with a qdisc)
> > >
> > > -> one or multiple physical NIC, wth FQ or other qdisc.
> > >
> > > Packets would be dropped here when we try to reach the physical device.
> > >
> >
> > If you have an example handy please send it. I am trying to imagine
> > how those would have worked if they have to reenter the root qdisc of
> > the same dev multiple times..
>
> Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ?
>
> dev_xmit_recursion_inc() is global (per-cpu), it is not per-device.
>
> A stack of devices A -> B -> C  would elevate the recursion level to
> three just fine.
>
> After your patch, a stack of devices would no longer work.
>
> It seems mirred correctly injects packets to the top of the stack for
> ingress (via netif_rx() / netif_receive_skb()),
> but thinks it is okay to call dev_queue_xmit(), regardless of the context ?
>
> Perhaps safe-guard mirred, instead of adding more code to fast path.

I agree not to penalize everybody for a "bad config" like this
(surprising syzkaller hasnt caught this). But i dont see how doing the
checking within mirred will catch this (we cant detect the A->B->A
case).
I think you are suggesting a backlog-like queue for mirred? Not far
off from that is how it used to work before
(https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/)
- i.e we had a trylock for the qdisc lock and if it failed we tagged
the rx softirq for a reschedule. That in itself is insufficient, we
would need a loop check which is per-skb (which we had before
https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/).
There are other gotchas there, potentially packet reordering.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-18 22:05             ` Jamal Hadi Salim
@ 2024-03-19  9:38               ` Eric Dumazet
  2024-03-19 20:54                 ` Jamal Hadi Salim
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2024-03-19  9:38 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela, Davide Caratti

On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
> > > > > > > >
> > > > > > > > As we all know the mirred action is used to mirroring or redirecting the
> > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to
> > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform
> > > > > > > > the following steps:
> > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> > > > > > > >      action police rate 100mbit burst 12m conform-exceed jump 1 \
> > > > > > > >      / pipe mirred egress redirect dev eth2 action drop
> > > > > > > >
> > > > > > >
> > > > > > > I think you meant both to be the same device eth0 or eth2?
> > > > > > >
> > > > > > > > The stack is show as below:
> > > > > > > > [28848.883915]  _raw_spin_lock+0x1e/0x30
> > > > > > > > [28848.884367]  __dev_queue_xmit+0x160/0x850
> > > > > > > > [28848.884851]  ? 0xffffffffc031906a
> > > > > > > > [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> > > > > > > > [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> > > > > > > > [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> > > > > > > > [28848.886970]  ? dequeue_entity+0x145/0x9e0
> > > > > > > > [28848.887464]  ? newidle_balance+0x23f/0x2f0
> > > > > > > > [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> > > > > > > > [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> > > > > > > > [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> > > > > > > > [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> > > > > > > > [28848.890293]  ? do_select+0x637/0x870
> > > > > > > > [28848.890735]  tcf_classify+0x52/0xf0
> > > > > > > > [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> > > > > > > > [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> > > > > > > > [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> > > > > > > > [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> > > > > > > > [28848.893198]  ip_finish_output2+0x272/0x580
> > > > > > > > [28848.893692]  __ip_queue_xmit+0x193/0x420
> > > > > > > > [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> > > > > > > >
> > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> > > > > > > > before the egress packets are mirred, and it will attempt to obtain the
> > > > > > > > spin lock again after packets are mirred, which cause a deadlock.
> > > > > > > >
> > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached
> > > > > > > > to the egress.
> > > > > > > >
> > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> > > > > > > > ---
> > > > > > > >  net/sched/act_mirred.c                        |  4 +++
> > > > > > > >  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> > > > > > > >  2 files changed, 36 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > > > > > > > index 5b3814365924..fc96705285fb 100644
> > > > > > > > --- a/net/sched/act_mirred.c
> > > > > > > > +++ b/net/sched/act_mirred.c
> > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> > > > > > > >                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> > > > > > > >                 return -EINVAL;
> > > > > > > >         }
> > > > > > > > +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> > > > > > > > +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> > > > > > > > +               return -EINVAL;
> > > > > > > > +       }
> > > > > > >
> > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> > > > > > > am almost certain this used to work in the old days.
> > > > > >
> > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc.
> > > > > > Conclusion is things broke around this time frame:
> > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/
> > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/
> > > > > >
> > > > > > Looking further into it.
> > > > >
> > > > > This is what we came up with. Eric, please take a look...
> > > > >
> > > > > cheers,
> > > > > jamal
> > > > >
> > > > >
> > > > > --- a/net/core/dev.c
> > > > > +++ b/net/core/dev.c
> > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff
> > > > > *skb, struct Qdisc *q,
> > > > >         if (unlikely(contended))
> > > > >                 spin_lock(&q->busylock);
> > > > >
> > > > > +       if (dev_recursion_level()) {
> > > >
> > > > I am not sure what your intent is, but this seems wrong to me.
> > > >
> > >
> > > There is a deadlock if you reenter the same device which has a qdisc
> > > attached to it more than once.
> > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then
> > > run some action which requires it to grab the root qdisc lock (again).
> > > This is easy to show with mirred (although i am wondering if syzkaller
> > > may have produced this at some point)..
> > > $TC qdisc add dev $DEV root handle 1: htb default 1
> > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff
> > > action mirred egress mirror dev $DEV
> > >
> > > Above example is essentially egress $DEV-> egress $DEV in both cases
> > > "egress $DEV" grabs the root qdisc lock. You could also create another
> > > example with egress($DEV1->$DEV2->back to $DEV1).
> > >
> > > > Some valid setup use :
> > > >
> > > > A bonding device, with HTB qdisc (or other qdisc)
> > > >   (This also could be a tunnel device with a qdisc)
> > > >
> > > > -> one or multiple physical NIC, wth FQ or other qdisc.
> > > >
> > > > Packets would be dropped here when we try to reach the physical device.
> > > >
> > >
> > > If you have an example handy please send it. I am trying to imagine
> > > how those would have worked if they have to reenter the root qdisc of
> > > the same dev multiple times..
> >
> > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ?
> >
> > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device.
> >
> > A stack of devices A -> B -> C  would elevate the recursion level to
> > three just fine.
> >
> > After your patch, a stack of devices would no longer work.
> >
> > It seems mirred correctly injects packets to the top of the stack for
> > ingress (via netif_rx() / netif_receive_skb()),
> > but thinks it is okay to call dev_queue_xmit(), regardless of the context ?
> >
> > Perhaps safe-guard mirred, instead of adding more code to fast path.
>
> I agree not to penalize everybody for a "bad config" like this
> (surprising syzkaller hasnt caught this). But i dont see how doing the
> checking within mirred will catch this (we cant detect the A->B->A
> case).
> I think you are suggesting a backlog-like queue for mirred? Not far
> off from that is how it used to work before
> (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/)


spin_trylock() had to go. There is no way we could keep this.

> - i.e we had a trylock for the qdisc lock and if it failed we tagged
> the rx softirq for a reschedule. That in itself is insufficient, we
> would need a loop check which is per-skb (which we had before
> https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/).
> There are other gotchas there, potentially packet reordering.

If we want to make sure dev_queue_xmit() is called from the top (no
spinlock held),
then we need a queue, serviced from another context.

This extra queueing could happen if
__this_cpu_read(softnet_data.xmit.recursion) > 0

This can be done from net/sched/act_mirred.c with no additional change
in net/core/dev.c,
and no new skb field.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-19  9:38               ` Eric Dumazet
@ 2024-03-19 20:54                 ` Jamal Hadi Salim
  2024-03-20 16:46                   ` Jamal Hadi Salim
  2024-03-20 16:57                   ` Eric Dumazet
  0 siblings, 2 replies; 24+ messages in thread
From: Jamal Hadi Salim @ 2024-03-19 20:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela, Davide Caratti

On Tue, Mar 19, 2024 at 5:38 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > >
> > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
> > > > > > > > >
> > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the
> > > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to
> > > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform
> > > > > > > > > the following steps:
> > > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> > > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> > > > > > > > >      action police rate 100mbit burst 12m conform-exceed jump 1 \
> > > > > > > > >      / pipe mirred egress redirect dev eth2 action drop
> > > > > > > > >
> > > > > > > >
> > > > > > > > I think you meant both to be the same device eth0 or eth2?
> > > > > > > >
> > > > > > > > > The stack is show as below:
> > > > > > > > > [28848.883915]  _raw_spin_lock+0x1e/0x30
> > > > > > > > > [28848.884367]  __dev_queue_xmit+0x160/0x850
> > > > > > > > > [28848.884851]  ? 0xffffffffc031906a
> > > > > > > > > [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> > > > > > > > > [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> > > > > > > > > [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> > > > > > > > > [28848.886970]  ? dequeue_entity+0x145/0x9e0
> > > > > > > > > [28848.887464]  ? newidle_balance+0x23f/0x2f0
> > > > > > > > > [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> > > > > > > > > [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> > > > > > > > > [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> > > > > > > > > [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> > > > > > > > > [28848.890293]  ? do_select+0x637/0x870
> > > > > > > > > [28848.890735]  tcf_classify+0x52/0xf0
> > > > > > > > > [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> > > > > > > > > [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> > > > > > > > > [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> > > > > > > > > [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> > > > > > > > > [28848.893198]  ip_finish_output2+0x272/0x580
> > > > > > > > > [28848.893692]  __ip_queue_xmit+0x193/0x420
> > > > > > > > > [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> > > > > > > > >
> > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> > > > > > > > > before the egress packets are mirred, and it will attempt to obtain the
> > > > > > > > > spin lock again after packets are mirred, which cause a deadlock.
> > > > > > > > >
> > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached
> > > > > > > > > to the egress.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> > > > > > > > > ---
> > > > > > > > >  net/sched/act_mirred.c                        |  4 +++
> > > > > > > > >  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> > > > > > > > >  2 files changed, 36 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > > > > > > > > index 5b3814365924..fc96705285fb 100644
> > > > > > > > > --- a/net/sched/act_mirred.c
> > > > > > > > > +++ b/net/sched/act_mirred.c
> > > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> > > > > > > > >                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> > > > > > > > >                 return -EINVAL;
> > > > > > > > >         }
> > > > > > > > > +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> > > > > > > > > +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> > > > > > > > > +               return -EINVAL;
> > > > > > > > > +       }
> > > > > > > >
> > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> > > > > > > > am almost certain this used to work in the old days.
> > > > > > >
> > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc.
> > > > > > > Conclusion is things broke around this time frame:
> > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/
> > > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/
> > > > > > >
> > > > > > > Looking further into it.
> > > > > >
> > > > > > This is what we came up with. Eric, please take a look...
> > > > > >
> > > > > > cheers,
> > > > > > jamal
> > > > > >
> > > > > >
> > > > > > --- a/net/core/dev.c
> > > > > > +++ b/net/core/dev.c
> > > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff
> > > > > > *skb, struct Qdisc *q,
> > > > > >         if (unlikely(contended))
> > > > > >                 spin_lock(&q->busylock);
> > > > > >
> > > > > > +       if (dev_recursion_level()) {
> > > > >
> > > > > I am not sure what your intent is, but this seems wrong to me.
> > > > >
> > > >
> > > > There is a deadlock if you reenter the same device which has a qdisc
> > > > attached to it more than once.
> > > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then
> > > > run some action which requires it to grab the root qdisc lock (again).
> > > > This is easy to show with mirred (although i am wondering if syzkaller
> > > > may have produced this at some point)..
> > > > $TC qdisc add dev $DEV root handle 1: htb default 1
> > > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff
> > > > action mirred egress mirror dev $DEV
> > > >
> > > > Above example is essentially egress $DEV-> egress $DEV in both cases
> > > > "egress $DEV" grabs the root qdisc lock. You could also create another
> > > > example with egress($DEV1->$DEV2->back to $DEV1).
> > > >
> > > > > Some valid setup use :
> > > > >
> > > > > A bonding device, with HTB qdisc (or other qdisc)
> > > > >   (This also could be a tunnel device with a qdisc)
> > > > >
> > > > > -> one or multiple physical NIC, wth FQ or other qdisc.
> > > > >
> > > > > Packets would be dropped here when we try to reach the physical device.
> > > > >
> > > >
> > > > If you have an example handy please send it. I am trying to imagine
> > > > how those would have worked if they have to reenter the root qdisc of
> > > > the same dev multiple times..
> > >
> > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ?
> > >
> > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device.
> > >
> > > A stack of devices A -> B -> C  would elevate the recursion level to
> > > three just fine.
> > >
> > > After your patch, a stack of devices would no longer work.
> > >
> > > It seems mirred correctly injects packets to the top of the stack for
> > > ingress (via netif_rx() / netif_receive_skb()),
> > > but thinks it is okay to call dev_queue_xmit(), regardless of the context ?
> > >
> > > Perhaps safe-guard mirred, instead of adding more code to fast path.
> >
> > I agree not to penalize everybody for a "bad config" like this
> > (surprising syzkaller hasnt caught this). But i dont see how doing the
> > checking within mirred will catch this (we cant detect the A->B->A
> > case).
> > I think you are suggesting a backlog-like queue for mirred? Not far
> > off from that is how it used to work before
> > (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/)
>
>
> spin_trylock() had to go. There is no way we could keep this.
>

Not asking for it to come back... just pointing out why it worked before.

> > - i.e we had a trylock for the qdisc lock and if it failed we tagged
> > the rx softirq for a reschedule. That in itself is insufficient, we
> > would need a loop check which is per-skb (which we had before
> > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/).
> > There are other gotchas there, potentially packet reordering.
>
> If we want to make sure dev_queue_xmit() is called from the top (no
> spinlock held),
> then we need a queue, serviced from another context.
>
> This extra queueing could happen if
> __this_cpu_read(softnet_data.xmit.recursion) > 0
>

I dont see a way to detect softnet_data.xmit.recursion > 0 at mirred
level. The first time we enter it will be 0. The second time we would
deadlock before we hit mirred. I am also worried about packet
reordering avoidance logic code amount needed just to deal with a
buggy configuration. Dropping the packet for this exact config should
be ok.
How about adding a per-qdisc per-cpu recursion counter? Incremented
when we enter the cpu from the top, decremented when we leave. If we
detect __this_cpu_read(qdisc.xmit_recursion) > 0 we drop as per that
patch?

cheers,
jamal
> This can be done from net/sched/act_mirred.c with no additional change
> in net/core/dev.c,
> and no new skb field.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-19 20:54                 ` Jamal Hadi Salim
@ 2024-03-20 16:46                   ` Jamal Hadi Salim
  2024-03-20 16:57                   ` Eric Dumazet
  1 sibling, 0 replies; 24+ messages in thread
From: Jamal Hadi Salim @ 2024-03-20 16:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela, Davide Caratti

On Tue, Mar 19, 2024 at 4:54 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Tue, Mar 19, 2024 at 5:38 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > >
> > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
> > > > > > > > > >
> > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the
> > > > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to
> > > > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform
> > > > > > > > > > the following steps:
> > > > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> > > > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> > > > > > > > > >      action police rate 100mbit burst 12m conform-exceed jump 1 \
> > > > > > > > > >      / pipe mirred egress redirect dev eth2 action drop
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think you meant both to be the same device eth0 or eth2?
> > > > > > > > >
> > > > > > > > > > The stack is show as below:
> > > > > > > > > > [28848.883915]  _raw_spin_lock+0x1e/0x30
> > > > > > > > > > [28848.884367]  __dev_queue_xmit+0x160/0x850
> > > > > > > > > > [28848.884851]  ? 0xffffffffc031906a
> > > > > > > > > > [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> > > > > > > > > > [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> > > > > > > > > > [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> > > > > > > > > > [28848.886970]  ? dequeue_entity+0x145/0x9e0
> > > > > > > > > > [28848.887464]  ? newidle_balance+0x23f/0x2f0
> > > > > > > > > > [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> > > > > > > > > > [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> > > > > > > > > > [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> > > > > > > > > > [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> > > > > > > > > > [28848.890293]  ? do_select+0x637/0x870
> > > > > > > > > > [28848.890735]  tcf_classify+0x52/0xf0
> > > > > > > > > > [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> > > > > > > > > > [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> > > > > > > > > > [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> > > > > > > > > > [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> > > > > > > > > > [28848.893198]  ip_finish_output2+0x272/0x580
> > > > > > > > > > [28848.893692]  __ip_queue_xmit+0x193/0x420
> > > > > > > > > > [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> > > > > > > > > >
> > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> > > > > > > > > > before the egress packets are mirred, and it will attempt to obtain the
> > > > > > > > > > spin lock again after packets are mirred, which cause a deadlock.
> > > > > > > > > >
> > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached
> > > > > > > > > > to the egress.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> > > > > > > > > > ---
> > > > > > > > > >  net/sched/act_mirred.c                        |  4 +++
> > > > > > > > > >  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> > > > > > > > > >  2 files changed, 36 insertions(+)
> > > > > > > > > >
> > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > > > > > > > > > index 5b3814365924..fc96705285fb 100644
> > > > > > > > > > --- a/net/sched/act_mirred.c
> > > > > > > > > > +++ b/net/sched/act_mirred.c
> > > > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> > > > > > > > > >                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> > > > > > > > > >                 return -EINVAL;
> > > > > > > > > >         }
> > > > > > > > > > +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> > > > > > > > > > +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> > > > > > > > > > +               return -EINVAL;
> > > > > > > > > > +       }
> > > > > > > > >
> > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> > > > > > > > > am almost certain this used to work in the old days.
> > > > > > > >
> > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc.
> > > > > > > > Conclusion is things broke around this time frame:
> > > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/
> > > > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/
> > > > > > > >
> > > > > > > > Looking further into it.
> > > > > > >
> > > > > > > This is what we came up with. Eric, please take a look...
> > > > > > >
> > > > > > > cheers,
> > > > > > > jamal
> > > > > > >
> > > > > > >
> > > > > > > --- a/net/core/dev.c
> > > > > > > +++ b/net/core/dev.c
> > > > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff
> > > > > > > *skb, struct Qdisc *q,
> > > > > > >         if (unlikely(contended))
> > > > > > >                 spin_lock(&q->busylock);
> > > > > > >
> > > > > > > +       if (dev_recursion_level()) {
> > > > > >
> > > > > > I am not sure what your intent is, but this seems wrong to me.
> > > > > >
> > > > >
> > > > > There is a deadlock if you reenter the same device which has a qdisc
> > > > > attached to it more than once.
> > > > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then
> > > > > run some action which requires it to grab the root qdisc lock (again).
> > > > > This is easy to show with mirred (although i am wondering if syzkaller
> > > > > may have produced this at some point)..
> > > > > $TC qdisc add dev $DEV root handle 1: htb default 1
> > > > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff
> > > > > action mirred egress mirror dev $DEV
> > > > >
> > > > > Above example is essentially egress $DEV-> egress $DEV in both cases
> > > > > "egress $DEV" grabs the root qdisc lock. You could also create another
> > > > > example with egress($DEV1->$DEV2->back to $DEV1).
> > > > >
> > > > > > Some valid setup use :
> > > > > >
> > > > > > A bonding device, with HTB qdisc (or other qdisc)
> > > > > >   (This also could be a tunnel device with a qdisc)
> > > > > >
> > > > > > -> one or multiple physical NIC, wth FQ or other qdisc.
> > > > > >
> > > > > > Packets would be dropped here when we try to reach the physical device.
> > > > > >
> > > > >
> > > > > If you have an example handy please send it. I am trying to imagine
> > > > > how those would have worked if they have to reenter the root qdisc of
> > > > > the same dev multiple times..
> > > >
> > > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ?
> > > >
> > > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device.
> > > >
> > > > A stack of devices A -> B -> C  would elevate the recursion level to
> > > > three just fine.
> > > >
> > > > After your patch, a stack of devices would no longer work.
> > > >
> > > > It seems mirred correctly injects packets to the top of the stack for
> > > > ingress (via netif_rx() / netif_receive_skb()),
> > > > but thinks it is okay to call dev_queue_xmit(), regardless of the context ?
> > > >
> > > > Perhaps safe-guard mirred, instead of adding more code to fast path.
> > >
> > > I agree not to penalize everybody for a "bad config" like this
> > > (surprising syzkaller hasnt caught this). But i dont see how doing the
> > > checking within mirred will catch this (we cant detect the A->B->A
> > > case).
> > > I think you are suggesting a backlog-like queue for mirred? Not far
> > > off from that is how it used to work before
> > > (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/)
> >
> >
> > spin_trylock() had to go. There is no way we could keep this.
> >
>
> Not asking for it to come back... just pointing out why it worked before.
>
> > > - i.e we had a trylock for the qdisc lock and if it failed we tagged
> > > the rx softirq for a reschedule. That in itself is insufficient, we
> > > would need a loop check which is per-skb (which we had before
> > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/).
> > > There are other gotchas there, potentially packet reordering.
> >
> > If we want to make sure dev_queue_xmit() is called from the top (no
> > spinlock held),
> > then we need a queue, serviced from another context.
> >
> > This extra queueing could happen if
> > __this_cpu_read(softnet_data.xmit.recursion) > 0
> >
>
> I dont see a way to detect softnet_data.xmit.recursion > 0 at mirred
> level. The first time we enter it will be 0. The second time we would
> deadlock before we hit mirred. I am also worried about packet
> reordering avoidance logic code amount needed just to deal with a
> buggy configuration. Dropping the packet for this exact config should
> be ok.
> How about adding a per-qdisc per-cpu recursion counter? Incremented
> when we enter the cpu from the top, decremented when we leave. If we
> detect __this_cpu_read(qdisc.xmit_recursion) > 0 we drop as per that
> patch?
>

Actually scratch this idea. It wont work since we added tc block
support in mirred which complicates things further.
Need some more thinking...

cheers,
jamal


> cheers,
> jamal
> > This can be done from net/sched/act_mirred.c with no additional change
> > in net/core/dev.c,
> > and no new skb field.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-19 20:54                 ` Jamal Hadi Salim
  2024-03-20 16:46                   ` Jamal Hadi Salim
@ 2024-03-20 16:57                   ` Eric Dumazet
  2024-03-20 17:31                     ` Jamal Hadi Salim
  1 sibling, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2024-03-20 16:57 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela, Davide Caratti

On Tue, Mar 19, 2024 at 9:54 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Tue, Mar 19, 2024 at 5:38 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > >
> > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
> > > > > > > > > >
> > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the
> > > > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to
> > > > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform
> > > > > > > > > > the following steps:
> > > > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> > > > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> > > > > > > > > >      action police rate 100mbit burst 12m conform-exceed jump 1 \
> > > > > > > > > >      / pipe mirred egress redirect dev eth2 action drop
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think you meant both to be the same device eth0 or eth2?
> > > > > > > > >
> > > > > > > > > > The stack is show as below:
> > > > > > > > > > [28848.883915]  _raw_spin_lock+0x1e/0x30
> > > > > > > > > > [28848.884367]  __dev_queue_xmit+0x160/0x850
> > > > > > > > > > [28848.884851]  ? 0xffffffffc031906a
> > > > > > > > > > [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> > > > > > > > > > [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> > > > > > > > > > [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> > > > > > > > > > [28848.886970]  ? dequeue_entity+0x145/0x9e0
> > > > > > > > > > [28848.887464]  ? newidle_balance+0x23f/0x2f0
> > > > > > > > > > [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> > > > > > > > > > [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> > > > > > > > > > [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> > > > > > > > > > [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> > > > > > > > > > [28848.890293]  ? do_select+0x637/0x870
> > > > > > > > > > [28848.890735]  tcf_classify+0x52/0xf0
> > > > > > > > > > [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> > > > > > > > > > [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> > > > > > > > > > [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> > > > > > > > > > [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> > > > > > > > > > [28848.893198]  ip_finish_output2+0x272/0x580
> > > > > > > > > > [28848.893692]  __ip_queue_xmit+0x193/0x420
> > > > > > > > > > [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> > > > > > > > > >
> > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> > > > > > > > > > before the egress packets are mirred, and it will attempt to obtain the
> > > > > > > > > > spin lock again after packets are mirred, which cause a deadlock.
> > > > > > > > > >
> > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached
> > > > > > > > > > to the egress.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> > > > > > > > > > ---
> > > > > > > > > >  net/sched/act_mirred.c                        |  4 +++
> > > > > > > > > >  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> > > > > > > > > >  2 files changed, 36 insertions(+)
> > > > > > > > > >
> > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > > > > > > > > > index 5b3814365924..fc96705285fb 100644
> > > > > > > > > > --- a/net/sched/act_mirred.c
> > > > > > > > > > +++ b/net/sched/act_mirred.c
> > > > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> > > > > > > > > >                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> > > > > > > > > >                 return -EINVAL;
> > > > > > > > > >         }
> > > > > > > > > > +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> > > > > > > > > > +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> > > > > > > > > > +               return -EINVAL;
> > > > > > > > > > +       }
> > > > > > > > >
> > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> > > > > > > > > am almost certain this used to work in the old days.
> > > > > > > >
> > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc.
> > > > > > > > Conclusion is things broke around this time frame:
> > > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/
> > > > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/
> > > > > > > >
> > > > > > > > Looking further into it.
> > > > > > >
> > > > > > > This is what we came up with. Eric, please take a look...
> > > > > > >
> > > > > > > cheers,
> > > > > > > jamal
> > > > > > >
> > > > > > >
> > > > > > > --- a/net/core/dev.c
> > > > > > > +++ b/net/core/dev.c
> > > > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff
> > > > > > > *skb, struct Qdisc *q,
> > > > > > >         if (unlikely(contended))
> > > > > > >                 spin_lock(&q->busylock);
> > > > > > >
> > > > > > > +       if (dev_recursion_level()) {
> > > > > >
> > > > > > I am not sure what your intent is, but this seems wrong to me.
> > > > > >
> > > > >
> > > > > There is a deadlock if you reenter the same device which has a qdisc
> > > > > attached to it more than once.
> > > > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then
> > > > > run some action which requires it to grab the root qdisc lock (again).
> > > > > This is easy to show with mirred (although i am wondering if syzkaller
> > > > > may have produced this at some point)..
> > > > > $TC qdisc add dev $DEV root handle 1: htb default 1
> > > > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff
> > > > > action mirred egress mirror dev $DEV
> > > > >
> > > > > Above example is essentially egress $DEV-> egress $DEV in both cases
> > > > > "egress $DEV" grabs the root qdisc lock. You could also create another
> > > > > example with egress($DEV1->$DEV2->back to $DEV1).
> > > > >
> > > > > > Some valid setup use :
> > > > > >
> > > > > > A bonding device, with HTB qdisc (or other qdisc)
> > > > > >   (This also could be a tunnel device with a qdisc)
> > > > > >
> > > > > > -> one or multiple physical NIC, wth FQ or other qdisc.
> > > > > >
> > > > > > Packets would be dropped here when we try to reach the physical device.
> > > > > >
> > > > >
> > > > > If you have an example handy please send it. I am trying to imagine
> > > > > how those would have worked if they have to reenter the root qdisc of
> > > > > the same dev multiple times..
> > > >
> > > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ?
> > > >
> > > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device.
> > > >
> > > > A stack of devices A -> B -> C  would elevate the recursion level to
> > > > three just fine.
> > > >
> > > > After your patch, a stack of devices would no longer work.
> > > >
> > > > It seems mirred correctly injects packets to the top of the stack for
> > > > ingress (via netif_rx() / netif_receive_skb()),
> > > > but thinks it is okay to call dev_queue_xmit(), regardless of the context ?
> > > >
> > > > Perhaps safe-guard mirred, instead of adding more code to fast path.
> > >
> > > I agree not to penalize everybody for a "bad config" like this
> > > (surprising syzkaller hasnt caught this). But i dont see how doing the
> > > checking within mirred will catch this (we cant detect the A->B->A
> > > case).
> > > I think you are suggesting a backlog-like queue for mirred? Not far
> > > off from that is how it used to work before
> > > (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/)
> >
> >
> > spin_trylock() had to go. There is no way we could keep this.
> >
>
> Not asking for it to come back... just pointing out why it worked before.
>
> > > - i.e we had a trylock for the qdisc lock and if it failed we tagged
> > > the rx softirq for a reschedule. That in itself is insufficient, we
> > > would need a loop check which is per-skb (which we had before
> > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/).
> > > There are other gotchas there, potentially packet reordering.
> >
> > If we want to make sure dev_queue_xmit() is called from the top (no
> > spinlock held),
> > then we need a queue, serviced from another context.
> >
> > This extra queueing could happen if
> > __this_cpu_read(softnet_data.xmit.recursion) > 0
> >
>
> I dont see a way to detect softnet_data.xmit.recursion > 0 at mirred
> level. The first time we enter it will be 0.

Then it is fine, no qdisc spinlock is held at this point.

 The second time we would
> deadlock before we hit mirred.

This is not how I see the trace.

Mirred would detect that and either drop or queue the packet to a work
queue or something.

diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 5b38143659249e66718348e0ec4ed3c7bc21c13d..a2c53e200629a17130f38246ab3cdb8c89c6d30e
100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -237,9 +237,15 @@ tcf_mirred_forward(bool at_ingress, bool
want_ingress, struct sk_buff *skb)
 {
        int err;

-       if (!want_ingress)
+       if (!want_ingress) {
+               if (__this_cpu_read(softnet_data.xmit.recursion) > 0) {
+                       // TODO increment a drop counter perhaps ?
+                       kfree_skb(skb);
+                       return -EINVAL;
+               }
+
                err = tcf_dev_queue_xmit(skb, dev_queue_xmit);
-       else if (!at_ingress)
+       } else if (!at_ingress)
                err = netif_rx(skb);
        else
                err = netif_receive_skb(skb);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-20 16:57                   ` Eric Dumazet
@ 2024-03-20 17:31                     ` Jamal Hadi Salim
  2024-03-20 17:50                       ` Jamal Hadi Salim
  0 siblings, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2024-03-20 17:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela, Davide Caratti

On Wed, Mar 20, 2024 at 12:58 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Mar 19, 2024 at 9:54 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Tue, Mar 19, 2024 at 5:38 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > >
> > > > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > >
> > > > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > >
> > > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the
> > > > > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to
> > > > > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform
> > > > > > > > > > > the following steps:
> > > > > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> > > > > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> > > > > > > > > > >      action police rate 100mbit burst 12m conform-exceed jump 1 \
> > > > > > > > > > >      / pipe mirred egress redirect dev eth2 action drop
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I think you meant both to be the same device eth0 or eth2?
> > > > > > > > > >
> > > > > > > > > > > The stack is show as below:
> > > > > > > > > > > [28848.883915]  _raw_spin_lock+0x1e/0x30
> > > > > > > > > > > [28848.884367]  __dev_queue_xmit+0x160/0x850
> > > > > > > > > > > [28848.884851]  ? 0xffffffffc031906a
> > > > > > > > > > > [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> > > > > > > > > > > [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> > > > > > > > > > > [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> > > > > > > > > > > [28848.886970]  ? dequeue_entity+0x145/0x9e0
> > > > > > > > > > > [28848.887464]  ? newidle_balance+0x23f/0x2f0
> > > > > > > > > > > [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> > > > > > > > > > > [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> > > > > > > > > > > [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> > > > > > > > > > > [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> > > > > > > > > > > [28848.890293]  ? do_select+0x637/0x870
> > > > > > > > > > > [28848.890735]  tcf_classify+0x52/0xf0
> > > > > > > > > > > [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> > > > > > > > > > > [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> > > > > > > > > > > [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> > > > > > > > > > > [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> > > > > > > > > > > [28848.893198]  ip_finish_output2+0x272/0x580
> > > > > > > > > > > [28848.893692]  __ip_queue_xmit+0x193/0x420
> > > > > > > > > > > [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> > > > > > > > > > >
> > > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> > > > > > > > > > > before the egress packets are mirred, and it will attempt to obtain the
> > > > > > > > > > > spin lock again after packets are mirred, which cause a deadlock.
> > > > > > > > > > >
> > > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached
> > > > > > > > > > > to the egress.
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  net/sched/act_mirred.c                        |  4 +++
> > > > > > > > > > >  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> > > > > > > > > > >  2 files changed, 36 insertions(+)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > > > > > > > > > > index 5b3814365924..fc96705285fb 100644
> > > > > > > > > > > --- a/net/sched/act_mirred.c
> > > > > > > > > > > +++ b/net/sched/act_mirred.c
> > > > > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> > > > > > > > > > >                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> > > > > > > > > > >                 return -EINVAL;
> > > > > > > > > > >         }
> > > > > > > > > > > +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> > > > > > > > > > > +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> > > > > > > > > > > +               return -EINVAL;
> > > > > > > > > > > +       }
> > > > > > > > > >
> > > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> > > > > > > > > > am almost certain this used to work in the old days.
> > > > > > > > >
> > > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc.
> > > > > > > > > Conclusion is things broke around this time frame:
> > > > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/
> > > > > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/
> > > > > > > > >
> > > > > > > > > Looking further into it.
> > > > > > > >
> > > > > > > > This is what we came up with. Eric, please take a look...
> > > > > > > >
> > > > > > > > cheers,
> > > > > > > > jamal
> > > > > > > >
> > > > > > > >
> > > > > > > > --- a/net/core/dev.c
> > > > > > > > +++ b/net/core/dev.c
> > > > > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff
> > > > > > > > *skb, struct Qdisc *q,
> > > > > > > >         if (unlikely(contended))
> > > > > > > >                 spin_lock(&q->busylock);
> > > > > > > >
> > > > > > > > +       if (dev_recursion_level()) {
> > > > > > >
> > > > > > > I am not sure what your intent is, but this seems wrong to me.
> > > > > > >
> > > > > >
> > > > > > There is a deadlock if you reenter the same device which has a qdisc
> > > > > > attached to it more than once.
> > > > > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then
> > > > > > run some action which requires it to grab the root qdisc lock (again).
> > > > > > This is easy to show with mirred (although i am wondering if syzkaller
> > > > > > may have produced this at some point)..
> > > > > > $TC qdisc add dev $DEV root handle 1: htb default 1
> > > > > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff
> > > > > > action mirred egress mirror dev $DEV
> > > > > >
> > > > > > Above example is essentially egress $DEV-> egress $DEV in both cases
> > > > > > "egress $DEV" grabs the root qdisc lock. You could also create another
> > > > > > example with egress($DEV1->$DEV2->back to $DEV1).
> > > > > >
> > > > > > > Some valid setup use :
> > > > > > >
> > > > > > > A bonding device, with HTB qdisc (or other qdisc)
> > > > > > >   (This also could be a tunnel device with a qdisc)
> > > > > > >
> > > > > > > -> one or multiple physical NIC, wth FQ or other qdisc.
> > > > > > >
> > > > > > > Packets would be dropped here when we try to reach the physical device.
> > > > > > >
> > > > > >
> > > > > > If you have an example handy please send it. I am trying to imagine
> > > > > > how those would have worked if they have to reenter the root qdisc of
> > > > > > the same dev multiple times..
> > > > >
> > > > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ?
> > > > >
> > > > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device.
> > > > >
> > > > > A stack of devices A -> B -> C  would elevate the recursion level to
> > > > > three just fine.
> > > > >
> > > > > After your patch, a stack of devices would no longer work.
> > > > >
> > > > > It seems mirred correctly injects packets to the top of the stack for
> > > > > ingress (via netif_rx() / netif_receive_skb()),
> > > > > but thinks it is okay to call dev_queue_xmit(), regardless of the context ?
> > > > >
> > > > > Perhaps safe-guard mirred, instead of adding more code to fast path.
> > > >
> > > > I agree not to penalize everybody for a "bad config" like this
> > > > (surprising syzkaller hasnt caught this). But i dont see how doing the
> > > > checking within mirred will catch this (we cant detect the A->B->A
> > > > case).
> > > > I think you are suggesting a backlog-like queue for mirred? Not far
> > > > off from that is how it used to work before
> > > > (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/)
> > >
> > >
> > > spin_trylock() had to go. There is no way we could keep this.
> > >
> >
> > Not asking for it to come back... just pointing out why it worked before.
> >
> > > > - i.e we had a trylock for the qdisc lock and if it failed we tagged
> > > > the rx softirq for a reschedule. That in itself is insufficient, we
> > > > would need a loop check which is per-skb (which we had before
> > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/).
> > > > There are other gotchas there, potentially packet reordering.
> > >
> > > If we want to make sure dev_queue_xmit() is called from the top (no
> > > spinlock held),
> > > then we need a queue, serviced from another context.
> > >
> > > This extra queueing could happen if
> > > __this_cpu_read(softnet_data.xmit.recursion) > 0
> > >
> >
> > I dont see a way to detect softnet_data.xmit.recursion > 0 at mirred
> > level. The first time we enter it will be 0.
>
> Then it is fine, no qdisc spinlock is held at this point.
>
>  The second time we would
> > deadlock before we hit mirred.
>
> This is not how I see the trace.
>
> Mirred would detect that and either drop or queue the packet to a work
> queue or something.
>
> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> index 5b38143659249e66718348e0ec4ed3c7bc21c13d..a2c53e200629a17130f38246ab3cdb8c89c6d30e
> 100644
> --- a/net/sched/act_mirred.c
> +++ b/net/sched/act_mirred.c
> @@ -237,9 +237,15 @@ tcf_mirred_forward(bool at_ingress, bool
> want_ingress, struct sk_buff *skb)
>  {
>         int err;
>
> -       if (!want_ingress)
> +       if (!want_ingress) {
> +               if (__this_cpu_read(softnet_data.xmit.recursion) > 0) {
> +                       // TODO increment a drop counter perhaps ?
> +                       kfree_skb(skb);
> +                       return -EINVAL;
> +               }
> +
>                 err = tcf_dev_queue_xmit(skb, dev_queue_xmit);
> -       else if (!at_ingress)
> +       } else if (!at_ingress)
>                 err = netif_rx(skb);
>         else
>                 err = netif_receive_skb(skb);


I doubt this will work - who increments softnet_data.xmit.recursion?
We enter __dev_xmit_skb (grab lock) --> qdisc_enq-->classify-->mirred
(recursion is zero) which redirects entering back into  __dev_xmit_skb
again and deadlocks trying to grab lock.
 Maybe something is not clear to me, trying your suggestion...

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-20 17:31                     ` Jamal Hadi Salim
@ 2024-03-20 17:50                       ` Jamal Hadi Salim
  2024-03-20 18:13                         ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2024-03-20 17:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela, Davide Caratti

On Wed, Mar 20, 2024 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Wed, Mar 20, 2024 at 12:58 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Tue, Mar 19, 2024 at 9:54 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Tue, Mar 19, 2024 at 5:38 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > >
> > > > > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > > >
> > > > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the
> > > > > > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to
> > > > > > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform
> > > > > > > > > > > > the following steps:
> > > > > > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> > > > > > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> > > > > > > > > > > >      action police rate 100mbit burst 12m conform-exceed jump 1 \
> > > > > > > > > > > >      / pipe mirred egress redirect dev eth2 action drop
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I think you meant both to be the same device eth0 or eth2?
> > > > > > > > > > >
> > > > > > > > > > > > The stack is show as below:
> > > > > > > > > > > > [28848.883915]  _raw_spin_lock+0x1e/0x30
> > > > > > > > > > > > [28848.884367]  __dev_queue_xmit+0x160/0x850
> > > > > > > > > > > > [28848.884851]  ? 0xffffffffc031906a
> > > > > > > > > > > > [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> > > > > > > > > > > > [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> > > > > > > > > > > > [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> > > > > > > > > > > > [28848.886970]  ? dequeue_entity+0x145/0x9e0
> > > > > > > > > > > > [28848.887464]  ? newidle_balance+0x23f/0x2f0
> > > > > > > > > > > > [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> > > > > > > > > > > > [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> > > > > > > > > > > > [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> > > > > > > > > > > > [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> > > > > > > > > > > > [28848.890293]  ? do_select+0x637/0x870
> > > > > > > > > > > > [28848.890735]  tcf_classify+0x52/0xf0
> > > > > > > > > > > > [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> > > > > > > > > > > > [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> > > > > > > > > > > > [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> > > > > > > > > > > > [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> > > > > > > > > > > > [28848.893198]  ip_finish_output2+0x272/0x580
> > > > > > > > > > > > [28848.893692]  __ip_queue_xmit+0x193/0x420
> > > > > > > > > > > > [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> > > > > > > > > > > >
> > > > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> > > > > > > > > > > > before the egress packets are mirred, and it will attempt to obtain the
> > > > > > > > > > > > spin lock again after packets are mirred, which cause a deadlock.
> > > > > > > > > > > >
> > > > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached
> > > > > > > > > > > > to the egress.
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  net/sched/act_mirred.c                        |  4 +++
> > > > > > > > > > > >  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> > > > > > > > > > > >  2 files changed, 36 insertions(+)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > > > > > > > > > > > index 5b3814365924..fc96705285fb 100644
> > > > > > > > > > > > --- a/net/sched/act_mirred.c
> > > > > > > > > > > > +++ b/net/sched/act_mirred.c
> > > > > > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> > > > > > > > > > > >                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> > > > > > > > > > > >                 return -EINVAL;
> > > > > > > > > > > >         }
> > > > > > > > > > > > +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> > > > > > > > > > > > +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> > > > > > > > > > > > +               return -EINVAL;
> > > > > > > > > > > > +       }
> > > > > > > > > > >
> > > > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> > > > > > > > > > > am almost certain this used to work in the old days.
> > > > > > > > > >
> > > > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc.
> > > > > > > > > > Conclusion is things broke around this time frame:
> > > > > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/
> > > > > > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/
> > > > > > > > > >
> > > > > > > > > > Looking further into it.
> > > > > > > > >
> > > > > > > > > This is what we came up with. Eric, please take a look...
> > > > > > > > >
> > > > > > > > > cheers,
> > > > > > > > > jamal
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --- a/net/core/dev.c
> > > > > > > > > +++ b/net/core/dev.c
> > > > > > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff
> > > > > > > > > *skb, struct Qdisc *q,
> > > > > > > > >         if (unlikely(contended))
> > > > > > > > >                 spin_lock(&q->busylock);
> > > > > > > > >
> > > > > > > > > +       if (dev_recursion_level()) {
> > > > > > > >
> > > > > > > > I am not sure what your intent is, but this seems wrong to me.
> > > > > > > >
> > > > > > >
> > > > > > > There is a deadlock if you reenter the same device which has a qdisc
> > > > > > > attached to it more than once.
> > > > > > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then
> > > > > > > run some action which requires it to grab the root qdisc lock (again).
> > > > > > > This is easy to show with mirred (although i am wondering if syzkaller
> > > > > > > may have produced this at some point)..
> > > > > > > $TC qdisc add dev $DEV root handle 1: htb default 1
> > > > > > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff
> > > > > > > action mirred egress mirror dev $DEV
> > > > > > >
> > > > > > > Above example is essentially egress $DEV-> egress $DEV in both cases
> > > > > > > "egress $DEV" grabs the root qdisc lock. You could also create another
> > > > > > > example with egress($DEV1->$DEV2->back to $DEV1).
> > > > > > >
> > > > > > > > Some valid setup use :
> > > > > > > >
> > > > > > > > A bonding device, with HTB qdisc (or other qdisc)
> > > > > > > >   (This also could be a tunnel device with a qdisc)
> > > > > > > >
> > > > > > > > -> one or multiple physical NIC, wth FQ or other qdisc.
> > > > > > > >
> > > > > > > > Packets would be dropped here when we try to reach the physical device.
> > > > > > > >
> > > > > > >
> > > > > > > If you have an example handy please send it. I am trying to imagine
> > > > > > > how those would have worked if they have to reenter the root qdisc of
> > > > > > > the same dev multiple times..
> > > > > >
> > > > > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ?
> > > > > >
> > > > > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device.
> > > > > >
> > > > > > A stack of devices A -> B -> C  would elevate the recursion level to
> > > > > > three just fine.
> > > > > >
> > > > > > After your patch, a stack of devices would no longer work.
> > > > > >
> > > > > > It seems mirred correctly injects packets to the top of the stack for
> > > > > > ingress (via netif_rx() / netif_receive_skb()),
> > > > > > but thinks it is okay to call dev_queue_xmit(), regardless of the context ?
> > > > > >
> > > > > > Perhaps safe-guard mirred, instead of adding more code to fast path.
> > > > >
> > > > > I agree not to penalize everybody for a "bad config" like this
> > > > > (surprising syzkaller hasnt caught this). But i dont see how doing the
> > > > > checking within mirred will catch this (we cant detect the A->B->A
> > > > > case).
> > > > > I think you are suggesting a backlog-like queue for mirred? Not far
> > > > > off from that is how it used to work before
> > > > > (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/)
> > > >
> > > >
> > > > spin_trylock() had to go. There is no way we could keep this.
> > > >
> > >
> > > Not asking for it to come back... just pointing out why it worked before.
> > >
> > > > > - i.e we had a trylock for the qdisc lock and if it failed we tagged
> > > > > the rx softirq for a reschedule. That in itself is insufficient, we
> > > > > would need a loop check which is per-skb (which we had before
> > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/).
> > > > > There are other gotchas there, potentially packet reordering.
> > > >
> > > > If we want to make sure dev_queue_xmit() is called from the top (no
> > > > spinlock held),
> > > > then we need a queue, serviced from another context.
> > > >
> > > > This extra queueing could happen if
> > > > __this_cpu_read(softnet_data.xmit.recursion) > 0
> > > >
> > >
> > > I dont see a way to detect softnet_data.xmit.recursion > 0 at mirred
> > > level. The first time we enter it will be 0.
> >
> > Then it is fine, no qdisc spinlock is held at this point.
> >
> >  The second time we would
> > > deadlock before we hit mirred.
> >
> > This is not how I see the trace.
> >
> > Mirred would detect that and either drop or queue the packet to a work
> > queue or something.
> >
> > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > index 5b38143659249e66718348e0ec4ed3c7bc21c13d..a2c53e200629a17130f38246ab3cdb8c89c6d30e
> > 100644
> > --- a/net/sched/act_mirred.c
> > +++ b/net/sched/act_mirred.c
> > @@ -237,9 +237,15 @@ tcf_mirred_forward(bool at_ingress, bool
> > want_ingress, struct sk_buff *skb)
> >  {
> >         int err;
> >
> > -       if (!want_ingress)
> > +       if (!want_ingress) {
> > +               if (__this_cpu_read(softnet_data.xmit.recursion) > 0) {
> > +                       // TODO increment a drop counter perhaps ?
> > +                       kfree_skb(skb);
> > +                       return -EINVAL;
> > +               }
> > +
> >                 err = tcf_dev_queue_xmit(skb, dev_queue_xmit);
> > -       else if (!at_ingress)
> > +       } else if (!at_ingress)
> >                 err = netif_rx(skb);
> >         else
> >                 err = netif_receive_skb(skb);
>
>
> I doubt this will work - who increments softnet_data.xmit.recursion?
> We enter __dev_xmit_skb (grab lock) --> qdisc_enq-->classify-->mirred
> (recursion is zero) which redirects entering back into  __dev_xmit_skb
> again and deadlocks trying to grab lock.
>  Maybe something is not clear to me, trying your suggestion...
>

jhs@mojaone:~$
[   82.890330] __this_cpu_read(softnet_data.xmit.recursion) 0 in
tcf_mirred_forward
[   82.890906]
[   82.890906] ============================================
[   82.890906] WARNING: possible recursive locking detected
[   82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G        W
[   82.890906] --------------------------------------------
[   82.890906] ping/418 is trying to acquire lock:
[   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[   82.890906]
[   82.890906] but task is already holding lock:
[   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
__dev_queue_xmit+0x1778/0x3550
[   82.890906]
[   82.890906] other info that might help us debug this:
[   82.890906]  Possible unsafe locking scenario:
[   82.890906]
[   82.890906]        CPU0
[   82.890906]        ----
[   82.890906]   lock(&sch->q.lock);
[   82.890906]   lock(&sch->q.lock);
[   82.890906]
[   82.890906]  *** DEADLOCK ***
[   82.890906]
[..... other info removed for brevity....]

Needs more thinking...
A fool proof solution is to add the per-recursion counter to be per
netdevice but that maybe considered blasphemy? ;->

cheers,
jamal

> cheers,
> jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-20 17:50                       ` Jamal Hadi Salim
@ 2024-03-20 18:13                         ` Eric Dumazet
  2024-03-20 18:25                           ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2024-03-20 18:13 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela, Davide Caratti

On Wed, Mar 20, 2024 at 6:50 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Wed, Mar 20, 2024 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Wed, Mar 20, 2024 at 12:58 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Tue, Mar 19, 2024 at 9:54 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Tue, Mar 19, 2024 at 5:38 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > >
> > > > > > On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > >
> > > > > > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the
> > > > > > > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to
> > > > > > > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform
> > > > > > > > > > > > > the following steps:
> > > > > > > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n
> > > > > > > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \
> > > > > > > > > > > > >      action police rate 100mbit burst 12m conform-exceed jump 1 \
> > > > > > > > > > > > >      / pipe mirred egress redirect dev eth2 action drop
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I think you meant both to be the same device eth0 or eth2?
> > > > > > > > > > > >
> > > > > > > > > > > > > The stack is show as below:
> > > > > > > > > > > > > [28848.883915]  _raw_spin_lock+0x1e/0x30
> > > > > > > > > > > > > [28848.884367]  __dev_queue_xmit+0x160/0x850
> > > > > > > > > > > > > [28848.884851]  ? 0xffffffffc031906a
> > > > > > > > > > > > > [28848.885279]  tcf_mirred_act+0x3ab/0x596 [act_mirred]
> > > > > > > > > > > > > [28848.885863]  tcf_action_exec.part.0+0x88/0x130
> > > > > > > > > > > > > [28848.886401]  fl_classify+0x1ca/0x1e0 [cls_flower]
> > > > > > > > > > > > > [28848.886970]  ? dequeue_entity+0x145/0x9e0
> > > > > > > > > > > > > [28848.887464]  ? newidle_balance+0x23f/0x2f0
> > > > > > > > > > > > > [28848.887973]  ? nft_lookup_eval+0x57/0x170 [nf_tables]
> > > > > > > > > > > > > [28848.888566]  ? nft_do_chain+0xef/0x430 [nf_tables]
> > > > > > > > > > > > > [28848.889137]  ? __flush_work.isra.0+0x35/0x80
> > > > > > > > > > > > > [28848.889657]  ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack]
> > > > > > > > > > > > > [28848.890293]  ? do_select+0x637/0x870
> > > > > > > > > > > > > [28848.890735]  tcf_classify+0x52/0xf0
> > > > > > > > > > > > > [28848.891177]  htb_classify+0x9d/0x1c0 [sch_htb]
> > > > > > > > > > > > > [28848.891722]  htb_enqueue+0x3a/0x1c0 [sch_htb]
> > > > > > > > > > > > > [28848.892251]  __dev_queue_xmit+0x2d8/0x850
> > > > > > > > > > > > > [28848.892738]  ? nf_hook_slow+0x3c/0xb0
> > > > > > > > > > > > > [28848.893198]  ip_finish_output2+0x272/0x580
> > > > > > > > > > > > > [28848.893692]  __ip_queue_xmit+0x193/0x420
> > > > > > > > > > > > > [28848.894179]  __tcp_transmit_skb+0x8cc/0x970
> > > > > > > > > > > > >
> > > > > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit
> > > > > > > > > > > > > before the egress packets are mirred, and it will attempt to obtain the
> > > > > > > > > > > > > spin lock again after packets are mirred, which cause a deadlock.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached
> > > > > > > > > > > > > to the egress.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  net/sched/act_mirred.c                        |  4 +++
> > > > > > > > > > > > >  .../tc-testing/tc-tests/actions/mirred.json   | 32 +++++++++++++++++++
> > > > > > > > > > > > >  2 files changed, 36 insertions(+)
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > > > > > > > > > > > > index 5b3814365924..fc96705285fb 100644
> > > > > > > > > > > > > --- a/net/sched/act_mirred.c
> > > > > > > > > > > > > +++ b/net/sched/act_mirred.c
> > > > > > > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla,
> > > > > > > > > > > > >                 NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed");
> > > > > > > > > > > > >                 return -EINVAL;
> > > > > > > > > > > > >         }
> > > > > > > > > > > > > +       if (tp->chain->block->q->parent != TC_H_INGRESS) {
> > > > > > > > > > > > > +               NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress");
> > > > > > > > > > > > > +               return -EINVAL;
> > > > > > > > > > > > > +       }
> > > > > > > > > > > >
> > > > > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I
> > > > > > > > > > > > am almost certain this used to work in the old days.
> > > > > > > > > > >
> > > > > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc.
> > > > > > > > > > > Conclusion is things broke around this time frame:
> > > > > > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/
> > > > > > > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/
> > > > > > > > > > >
> > > > > > > > > > > Looking further into it.
> > > > > > > > > >
> > > > > > > > > > This is what we came up with. Eric, please take a look...
> > > > > > > > > >
> > > > > > > > > > cheers,
> > > > > > > > > > jamal
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --- a/net/core/dev.c
> > > > > > > > > > +++ b/net/core/dev.c
> > > > > > > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff
> > > > > > > > > > *skb, struct Qdisc *q,
> > > > > > > > > >         if (unlikely(contended))
> > > > > > > > > >                 spin_lock(&q->busylock);
> > > > > > > > > >
> > > > > > > > > > +       if (dev_recursion_level()) {
> > > > > > > > >
> > > > > > > > > I am not sure what your intent is, but this seems wrong to me.
> > > > > > > > >
> > > > > > > >
> > > > > > > > There is a deadlock if you reenter the same device which has a qdisc
> > > > > > > > attached to it more than once.
> > > > > > > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then
> > > > > > > > run some action which requires it to grab the root qdisc lock (again).
> > > > > > > > This is easy to show with mirred (although i am wondering if syzkaller
> > > > > > > > may have produced this at some point)..
> > > > > > > > $TC qdisc add dev $DEV root handle 1: htb default 1
> > > > > > > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff
> > > > > > > > action mirred egress mirror dev $DEV
> > > > > > > >
> > > > > > > > Above example is essentially egress $DEV-> egress $DEV in both cases
> > > > > > > > "egress $DEV" grabs the root qdisc lock. You could also create another
> > > > > > > > example with egress($DEV1->$DEV2->back to $DEV1).
> > > > > > > >
> > > > > > > > > Some valid setup use :
> > > > > > > > >
> > > > > > > > > A bonding device, with HTB qdisc (or other qdisc)
> > > > > > > > >   (This also could be a tunnel device with a qdisc)
> > > > > > > > >
> > > > > > > > > -> one or multiple physical NIC, wth FQ or other qdisc.
> > > > > > > > >
> > > > > > > > > Packets would be dropped here when we try to reach the physical device.
> > > > > > > > >
> > > > > > > >
> > > > > > > > If you have an example handy please send it. I am trying to imagine
> > > > > > > > how those would have worked if they have to reenter the root qdisc of
> > > > > > > > the same dev multiple times..
> > > > > > >
> > > > > > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ?
> > > > > > >
> > > > > > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device.
> > > > > > >
> > > > > > > A stack of devices A -> B -> C  would elevate the recursion level to
> > > > > > > three just fine.
> > > > > > >
> > > > > > > After your patch, a stack of devices would no longer work.
> > > > > > >
> > > > > > > It seems mirred correctly injects packets to the top of the stack for
> > > > > > > ingress (via netif_rx() / netif_receive_skb()),
> > > > > > > but thinks it is okay to call dev_queue_xmit(), regardless of the context ?
> > > > > > >
> > > > > > > Perhaps safe-guard mirred, instead of adding more code to fast path.
> > > > > >
> > > > > > I agree not to penalize everybody for a "bad config" like this
> > > > > > (surprising syzkaller hasnt caught this). But i dont see how doing the
> > > > > > checking within mirred will catch this (we cant detect the A->B->A
> > > > > > case).
> > > > > > I think you are suggesting a backlog-like queue for mirred? Not far
> > > > > > off from that is how it used to work before
> > > > > > (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/)
> > > > >
> > > > >
> > > > > spin_trylock() had to go. There is no way we could keep this.
> > > > >
> > > >
> > > > Not asking for it to come back... just pointing out why it worked before.
> > > >
> > > > > > - i.e we had a trylock for the qdisc lock and if it failed we tagged
> > > > > > the rx softirq for a reschedule. That in itself is insufficient, we
> > > > > > would need a loop check which is per-skb (which we had before
> > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/).
> > > > > > There are other gotchas there, potentially packet reordering.
> > > > >
> > > > > If we want to make sure dev_queue_xmit() is called from the top (no
> > > > > spinlock held),
> > > > > then we need a queue, serviced from another context.
> > > > >
> > > > > This extra queueing could happen if
> > > > > __this_cpu_read(softnet_data.xmit.recursion) > 0
> > > > >
> > > >
> > > > I dont see a way to detect softnet_data.xmit.recursion > 0 at mirred
> > > > level. The first time we enter it will be 0.
> > >
> > > Then it is fine, no qdisc spinlock is held at this point.
> > >
> > >  The second time we would
> > > > deadlock before we hit mirred.
> > >
> > > This is not how I see the trace.
> > >
> > > Mirred would detect that and either drop or queue the packet to a work
> > > queue or something.
> > >
> > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > > index 5b38143659249e66718348e0ec4ed3c7bc21c13d..a2c53e200629a17130f38246ab3cdb8c89c6d30e
> > > 100644
> > > --- a/net/sched/act_mirred.c
> > > +++ b/net/sched/act_mirred.c
> > > @@ -237,9 +237,15 @@ tcf_mirred_forward(bool at_ingress, bool
> > > want_ingress, struct sk_buff *skb)
> > >  {
> > >         int err;
> > >
> > > -       if (!want_ingress)
> > > +       if (!want_ingress) {
> > > +               if (__this_cpu_read(softnet_data.xmit.recursion) > 0) {
> > > +                       // TODO increment a drop counter perhaps ?
> > > +                       kfree_skb(skb);
> > > +                       return -EINVAL;
> > > +               }
> > > +
> > >                 err = tcf_dev_queue_xmit(skb, dev_queue_xmit);
> > > -       else if (!at_ingress)
> > > +       } else if (!at_ingress)
> > >                 err = netif_rx(skb);
> > >         else
> > >                 err = netif_receive_skb(skb);
> >
> >
> > I doubt this will work - who increments softnet_data.xmit.recursion?
> > We enter __dev_xmit_skb (grab lock) --> qdisc_enq-->classify-->mirred
> > (recursion is zero) which redirects entering back into  __dev_xmit_skb
> > again and deadlocks trying to grab lock.
> >  Maybe something is not clear to me, trying your suggestion...
> >
>
> jhs@mojaone:~$
> [   82.890330] __this_cpu_read(softnet_data.xmit.recursion) 0 in
> tcf_mirred_forward
> [   82.890906]
> [   82.890906] ============================================
> [   82.890906] WARNING: possible recursive locking detected
> [   82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G        W
> [   82.890906] --------------------------------------------
> [   82.890906] ping/418 is trying to acquire lock:
> [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
> __dev_queue_xmit+0x1778/0x3550
> [   82.890906]
> [   82.890906] but task is already holding lock:
> [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
> __dev_queue_xmit+0x1778/0x3550
> [   82.890906]
> [   82.890906] other info that might help us debug this:
> [   82.890906]  Possible unsafe locking scenario:
> [   82.890906]
> [   82.890906]        CPU0
> [   82.890906]        ----
> [   82.890906]   lock(&sch->q.lock);
> [   82.890906]   lock(&sch->q.lock);
> [   82.890906]
> [   82.890906]  *** DEADLOCK ***
> [   82.890906]
> [..... other info removed for brevity....]
>
> Needs more thinking...
> A fool proof solution is to add the per-recursion counter to be per
> netdevice but that maybe considered blasphemy? ;->

Nope, you just have to complete the patch, moving around
dev_xmit_recursion_inc() and dev_xmit_recursion_dec()

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-20 18:13                         ` Eric Dumazet
@ 2024-03-20 18:25                           ` Eric Dumazet
  2024-03-20 19:34                             ` Jamal Hadi Salim
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2024-03-20 18:25 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela, Davide Caratti

On Wed, Mar 20, 2024 at 7:13 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Mar 20, 2024 at 6:50 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> Nope, you just have to complete the patch, moving around
> dev_xmit_recursion_inc() and dev_xmit_recursion_dec()

Untested part would be:

diff --git a/net/core/dev.c b/net/core/dev.c
index 303a6ff46e4e16296e94ed6b726621abe093e567..dbeaf67282e8b6ec164d00d796c9fd8e4fd7c332
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4259,6 +4259,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
net_device *sb_dev)
         */
        rcu_read_lock_bh();

+       dev_xmit_recursion_inc();
+
        skb_update_prio(skb);

        qdisc_pkt_len_init(skb);
@@ -4331,9 +4333,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
net_device *sb_dev)
                        HARD_TX_LOCK(dev, txq, cpu);

                        if (!netif_xmit_stopped(txq)) {
-                               dev_xmit_recursion_inc();
                                skb = dev_hard_start_xmit(skb, dev, txq, &rc);
-                               dev_xmit_recursion_dec();
                                if (dev_xmit_complete(rc)) {
                                        HARD_TX_UNLOCK(dev, txq);
                                        goto out;
@@ -4353,12 +4353,14 @@ int __dev_queue_xmit(struct sk_buff *skb,
struct net_device *sb_dev)
        }

        rc = -ENETDOWN;
+       dev_xmit_recursion_dec();
        rcu_read_unlock_bh();

        dev_core_stats_tx_dropped_inc(dev);
        kfree_skb_list(skb);
        return rc;
 out:
+       dev_xmit_recursion_dec();
        rcu_read_unlock_bh();
        return rc;
 }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-20 18:25                           ` Eric Dumazet
@ 2024-03-20 19:34                             ` Jamal Hadi Salim
  2024-03-24 15:27                               ` Jamal Hadi Salim
  0 siblings, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2024-03-20 19:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela, Davide Caratti

On Wed, Mar 20, 2024 at 2:26 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Mar 20, 2024 at 7:13 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Mar 20, 2024 at 6:50 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> > Nope, you just have to complete the patch, moving around
> > dev_xmit_recursion_inc() and dev_xmit_recursion_dec()
>
> Untested part would be:
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 303a6ff46e4e16296e94ed6b726621abe093e567..dbeaf67282e8b6ec164d00d796c9fd8e4fd7c332
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4259,6 +4259,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> net_device *sb_dev)
>          */
>         rcu_read_lock_bh();
>
> +       dev_xmit_recursion_inc();
> +
>         skb_update_prio(skb);
>
>         qdisc_pkt_len_init(skb);
> @@ -4331,9 +4333,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> net_device *sb_dev)
>                         HARD_TX_LOCK(dev, txq, cpu);
>
>                         if (!netif_xmit_stopped(txq)) {
> -                               dev_xmit_recursion_inc();
>                                 skb = dev_hard_start_xmit(skb, dev, txq, &rc);
> -                               dev_xmit_recursion_dec();
>                                 if (dev_xmit_complete(rc)) {
>                                         HARD_TX_UNLOCK(dev, txq);
>                                         goto out;
> @@ -4353,12 +4353,14 @@ int __dev_queue_xmit(struct sk_buff *skb,
> struct net_device *sb_dev)
>         }
>
>         rc = -ENETDOWN;
> +       dev_xmit_recursion_dec();
>         rcu_read_unlock_bh();
>
>         dev_core_stats_tx_dropped_inc(dev);
>         kfree_skb_list(skb);
>         return rc;
>  out:
> +       dev_xmit_recursion_dec();
>         rcu_read_unlock_bh();
>         return rc;
>  }

This removed the deadlock but now every packet being redirected will
be dropped. I was wrong earlier on the tc block because that only
works on clsact and ingress which are fine not needing this lock.
Here's a variation of the earlier patch that may work but comes at the
cost of a new per-cpu increment on the qdisc.

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3789,6 +3789,11 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
        if (unlikely(contended))
                spin_lock(&q->busylock);

+       if (__this_cpu_read(q->recursion_xmit) > 0) {
+               //drop
+       }
+
+       __this_cpu_inc(q->recursion_xmit);
        spin_lock(root_lock);
        if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
                __qdisc_drop(skb, &to_free);
@@ -3825,6 +3830,7 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
                }
        }
        spin_unlock(root_lock);
+       __this_cpu_dec(q->recursion_xmit);
        if (unlikely(to_free))
                kfree_skb_list_reason(to_free,
                                      tcf_get_drop_reason(to_free));


cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-20 19:34                             ` Jamal Hadi Salim
@ 2024-03-24 15:27                               ` Jamal Hadi Salim
  2024-03-26 23:18                                 ` Jamal Hadi Salim
  0 siblings, 1 reply; 24+ messages in thread
From: Jamal Hadi Salim @ 2024-03-24 15:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela, Davide Caratti

[-- Attachment #1: Type: text/plain, Size: 3419 bytes --]

On Wed, Mar 20, 2024 at 3:34 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Wed, Mar 20, 2024 at 2:26 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Mar 20, 2024 at 7:13 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, Mar 20, 2024 at 6:50 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > > Nope, you just have to complete the patch, moving around
> > > dev_xmit_recursion_inc() and dev_xmit_recursion_dec()
> >
> > Untested part would be:
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 303a6ff46e4e16296e94ed6b726621abe093e567..dbeaf67282e8b6ec164d00d796c9fd8e4fd7c332
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4259,6 +4259,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> > net_device *sb_dev)
> >          */
> >         rcu_read_lock_bh();
> >
> > +       dev_xmit_recursion_inc();
> > +
> >         skb_update_prio(skb);
> >
> >         qdisc_pkt_len_init(skb);
> > @@ -4331,9 +4333,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> > net_device *sb_dev)
> >                         HARD_TX_LOCK(dev, txq, cpu);
> >
> >                         if (!netif_xmit_stopped(txq)) {
> > -                               dev_xmit_recursion_inc();
> >                                 skb = dev_hard_start_xmit(skb, dev, txq, &rc);
> > -                               dev_xmit_recursion_dec();
> >                                 if (dev_xmit_complete(rc)) {
> >                                         HARD_TX_UNLOCK(dev, txq);
> >                                         goto out;
> > @@ -4353,12 +4353,14 @@ int __dev_queue_xmit(struct sk_buff *skb,
> > struct net_device *sb_dev)
> >         }
> >
> >         rc = -ENETDOWN;
> > +       dev_xmit_recursion_dec();
> >         rcu_read_unlock_bh();
> >
> >         dev_core_stats_tx_dropped_inc(dev);
> >         kfree_skb_list(skb);
> >         return rc;
> >  out:
> > +       dev_xmit_recursion_dec();
> >         rcu_read_unlock_bh();
> >         return rc;
> >  }
>
> This removed the deadlock but now every packet being redirected will
> be dropped. I was wrong earlier on the tc block because that only
> works on clsact and ingress which are fine not needing this lock.
> Here's a variation of the earlier patch that may work but comes at the
> cost of a new per-cpu increment on the qdisc.
>
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3789,6 +3789,11 @@ static inline int __dev_xmit_skb(struct sk_buff
> *skb, struct Qdisc *q,
>         if (unlikely(contended))
>                 spin_lock(&q->busylock);
>
> +       if (__this_cpu_read(q->recursion_xmit) > 0) {
> +               //drop
> +       }
> +
> +       __this_cpu_inc(q->recursion_xmit);
>         spin_lock(root_lock);
>         if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
>                 __qdisc_drop(skb, &to_free);
> @@ -3825,6 +3830,7 @@ static inline int __dev_xmit_skb(struct sk_buff
> *skb, struct Qdisc *q,
>                 }
>         }
>         spin_unlock(root_lock);
> +       __this_cpu_dec(q->recursion_xmit);
>         if (unlikely(to_free))
>                 kfree_skb_list_reason(to_free,
>                                       tcf_get_drop_reason(to_free));
>

And here's a tested version(attached) that fixes both the A->A and A->B->A.

cheers,
jamal

[-- Attachment #2: qdisc-rec-xmit-var-fix.patch --]
[-- Type: text/x-patch, Size: 2619 bytes --]

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index cefe0c4bdae3..f9f99df037ed 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -125,6 +125,8 @@ struct Qdisc {
 	spinlock_t		busylock ____cacheline_aligned_in_smp;
 	spinlock_t		seqlock;
 
+	u16 __percpu            *xmit_recursion;
+
 	struct rcu_head		rcu;
 	netdevice_tracker	dev_tracker;
 	/* private data */
diff --git a/net/core/dev.c b/net/core/dev.c
index 9a67003e49db..2b712388c06f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3789,6 +3789,13 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	if (unlikely(contended))
 		spin_lock(&q->busylock);
 
+	if (__this_cpu_read(*q->xmit_recursion) > 0) {
+		__qdisc_drop(skb, &to_free);
+		rc = NET_XMIT_DROP;
+		goto free_skb_list;
+	}
+
+	__this_cpu_inc(*q->xmit_recursion);
 	spin_lock(root_lock);
 	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
 		__qdisc_drop(skb, &to_free);
@@ -3825,7 +3832,9 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		}
 	}
 	spin_unlock(root_lock);
+	__this_cpu_dec(*q->xmit_recursion);
 	if (unlikely(to_free))
+free_skb_list:
 		kfree_skb_list_reason(to_free,
 				      tcf_get_drop_reason(to_free));
 	if (unlikely(contended))
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 65e05b0c98e4..6c3bc1aff89a 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1260,6 +1260,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	struct Qdisc *sch;
 	struct Qdisc_ops *ops;
 	struct qdisc_size_table *stab;
+	int cpu;
 
 	ops = qdisc_lookup_ops(kind);
 #ifdef CONFIG_MODULES
@@ -1376,11 +1377,22 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 		}
 	}
 
+	sch->xmit_recursion = alloc_percpu(u16);
+	if (!sch->xmit_recursion) {
+		err = -ENOMEM;
+		goto err_out5;
+	}
+	for_each_possible_cpu(cpu)
+		(*per_cpu_ptr(sch->xmit_recursion, cpu)) = 0;
+
 	qdisc_hash_add(sch, false);
 	trace_qdisc_create(ops, dev, parent);
 
 	return sch;
 
+err_out5:
+	if (tca[TCA_RATE])
+		gen_kill_estimator(&sch->rate_est);
 err_out4:
 	/* Even if ops->init() failed, we call ops->destroy()
 	 * like qdisc_create_dflt().
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ff5336493777..afbbd2e885a4 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1070,6 +1070,8 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
 	module_put(ops->owner);
 	netdev_put(dev, &qdisc->dev_tracker);
 
+	free_percpu(qdisc->xmit_recursion);
+
 	trace_qdisc_destroy(qdisc);
 
 	call_rcu(&qdisc->rcu, qdisc_free_cb);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress
  2024-03-24 15:27                               ` Jamal Hadi Salim
@ 2024-03-26 23:18                                 ` Jamal Hadi Salim
  0 siblings, 0 replies; 24+ messages in thread
From: Jamal Hadi Salim @ 2024-03-26 23:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan,
	liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Victor Nogueira, Pedro Tammela, Davide Caratti

On Sun, Mar 24, 2024 at 11:27 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Wed, Mar 20, 2024 at 3:34 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Wed, Mar 20, 2024 at 2:26 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, Mar 20, 2024 at 7:13 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Wed, Mar 20, 2024 at 6:50 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > > Nope, you just have to complete the patch, moving around
> > > > dev_xmit_recursion_inc() and dev_xmit_recursion_dec()
> > >
> > > Untested part would be:
> > >
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 303a6ff46e4e16296e94ed6b726621abe093e567..dbeaf67282e8b6ec164d00d796c9fd8e4fd7c332
> > > 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -4259,6 +4259,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> > > net_device *sb_dev)
> > >          */
> > >         rcu_read_lock_bh();
> > >
> > > +       dev_xmit_recursion_inc();
> > > +
> > >         skb_update_prio(skb);
> > >
> > >         qdisc_pkt_len_init(skb);
> > > @@ -4331,9 +4333,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> > > net_device *sb_dev)
> > >                         HARD_TX_LOCK(dev, txq, cpu);
> > >
> > >                         if (!netif_xmit_stopped(txq)) {
> > > -                               dev_xmit_recursion_inc();
> > >                                 skb = dev_hard_start_xmit(skb, dev, txq, &rc);
> > > -                               dev_xmit_recursion_dec();
> > >                                 if (dev_xmit_complete(rc)) {
> > >                                         HARD_TX_UNLOCK(dev, txq);
> > >                                         goto out;
> > > @@ -4353,12 +4353,14 @@ int __dev_queue_xmit(struct sk_buff *skb,
> > > struct net_device *sb_dev)
> > >         }
> > >
> > >         rc = -ENETDOWN;
> > > +       dev_xmit_recursion_dec();
> > >         rcu_read_unlock_bh();
> > >
> > >         dev_core_stats_tx_dropped_inc(dev);
> > >         kfree_skb_list(skb);
> > >         return rc;
> > >  out:
> > > +       dev_xmit_recursion_dec();
> > >         rcu_read_unlock_bh();
> > >         return rc;
> > >  }
> >
> > This removed the deadlock but now every packet being redirected will
> > be dropped. I was wrong earlier on the tc block because that only
> > works on clsact and ingress which are fine not needing this lock.
> > Here's a variation of the earlier patch that may work but comes at the
> > cost of a new per-cpu increment on the qdisc.
> >
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3789,6 +3789,11 @@ static inline int __dev_xmit_skb(struct sk_buff
> > *skb, struct Qdisc *q,
> >         if (unlikely(contended))
> >                 spin_lock(&q->busylock);
> >
> > +       if (__this_cpu_read(q->recursion_xmit) > 0) {
> > +               //drop
> > +       }
> > +
> > +       __this_cpu_inc(q->recursion_xmit);
> >         spin_lock(root_lock);
> >         if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> >                 __qdisc_drop(skb, &to_free);
> > @@ -3825,6 +3830,7 @@ static inline int __dev_xmit_skb(struct sk_buff
> > *skb, struct Qdisc *q,
> >                 }
> >         }
> >         spin_unlock(root_lock);
> > +       __this_cpu_dec(q->recursion_xmit);
> >         if (unlikely(to_free))
> >                 kfree_skb_list_reason(to_free,
> >                                       tcf_get_drop_reason(to_free));
> >
>
> And here's a tested version(attached) that fixes both the A->A and A->B->A.

Sent a proper patch as RFC, Eric please take a look.

cheers,
jamal

> cheers,
> jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-03-26 23:19 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-14 11:17 [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress renmingshuai
2024-03-14 11:43 ` Jiri Pirko
2024-03-14 14:04   ` renmingshuai
2024-03-14 14:47     ` Pedro Tammela
2024-03-14 17:14 ` Jamal Hadi Salim
2024-03-15  1:56   ` renmingshuai
2024-03-15 22:34     ` Jamal Hadi Salim
2024-03-17 16:10   ` Jamal Hadi Salim
2024-03-18 14:27     ` Jamal Hadi Salim
2024-03-18 15:46       ` Eric Dumazet
2024-03-18 17:36         ` Jamal Hadi Salim
2024-03-18 19:11           ` Eric Dumazet
2024-03-18 22:05             ` Jamal Hadi Salim
2024-03-19  9:38               ` Eric Dumazet
2024-03-19 20:54                 ` Jamal Hadi Salim
2024-03-20 16:46                   ` Jamal Hadi Salim
2024-03-20 16:57                   ` Eric Dumazet
2024-03-20 17:31                     ` Jamal Hadi Salim
2024-03-20 17:50                       ` Jamal Hadi Salim
2024-03-20 18:13                         ` Eric Dumazet
2024-03-20 18:25                           ` Eric Dumazet
2024-03-20 19:34                             ` Jamal Hadi Salim
2024-03-24 15:27                               ` Jamal Hadi Salim
2024-03-26 23:18                                 ` Jamal Hadi Salim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).