* [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress @ 2024-03-14 11:17 renmingshuai 2024-03-14 11:43 ` Jiri Pirko 2024-03-14 17:14 ` Jamal Hadi Salim 0 siblings, 2 replies; 24+ messages in thread From: renmingshuai @ 2024-03-14 11:17 UTC (permalink / raw) To: jhs, xiyou.wangcong, jiri, davem, vladbu Cc: netdev, yanan, liaichun, caowangbao, renmingshuai As we all know the mirred action is used to mirroring or redirecting the packet it receives. Howerver, add mirred action to a filter attached to a egress qdisc might cause a deadlock. To reproduce the problem, perform the following steps: (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ action police rate 100mbit burst 12m conform-exceed jump 1 \ / pipe mirred egress redirect dev eth2 action drop The stack is show as below: [28848.883915] _raw_spin_lock+0x1e/0x30 [28848.884367] __dev_queue_xmit+0x160/0x850 [28848.884851] ? 0xffffffffc031906a [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] [28848.885863] tcf_action_exec.part.0+0x88/0x130 [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] [28848.886970] ? dequeue_entity+0x145/0x9e0 [28848.887464] ? newidle_balance+0x23f/0x2f0 [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] [28848.889137] ? __flush_work.isra.0+0x35/0x80 [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] [28848.890293] ? do_select+0x637/0x870 [28848.890735] tcf_classify+0x52/0xf0 [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] [28848.892251] __dev_queue_xmit+0x2d8/0x850 [28848.892738] ? nf_hook_slow+0x3c/0xb0 [28848.893198] ip_finish_output2+0x272/0x580 [28848.893692] __ip_queue_xmit+0x193/0x420 [28848.894179] __tcp_transmit_skb+0x8cc/0x970 In this case, the process has hold the qdisc spin lock in __dev_queue_xmit before the egress packets are mirred, and it will attempt to obtain the spin lock again after packets are mirred, which cause a deadlock. Fix the issue by forbidding assigning mirred action to a filter attached to the egress. Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> --- net/sched/act_mirred.c | 4 +++ .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ 2 files changed, 36 insertions(+) diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c index 5b3814365924..fc96705285fb 100644 --- a/net/sched/act_mirred.c +++ b/net/sched/act_mirred.c @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); return -EINVAL; } + if (tp->chain->block->q->parent != TC_H_INGRESS) { + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); + return -EINVAL; + } ret = nla_parse_nested_deprecated(tb, TCA_MIRRED_MAX, nla, mirred_policy, extack); if (ret < 0) diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json index b73bd255ea36..50c6153bf34c 100644 --- a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json +++ b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json @@ -1052,5 +1052,37 @@ "$TC qdisc del dev $DEV1 ingress_block 21 clsact", "$TC actions flush action mirred" ] + }, + { + "id": "fdda", + "name": "Add mirred mirror to the filter which attached to engress", + "category": [ + "actions", + "mirred" + ], + "plugins": { + "requires": "nsPlugin" + }, + "setup": [ + [ + "$TC actions flush action mirred", + 0, + 1, + 255 + ], + [ + "$TC qdisc add dev $DEV1 root handle 1: htb default 1", + 0 + ] + ], + "cmdUnderTest": "$TC filter add dev $DEV1 protocol ip u32 match ip protocol 1 0xff action mirred egress mirror dev $DEV1", + "expExitCode": "2", + "verifyCmd": "$TC action list action mirred", + "matchPattern": "^[ \t]+index [0-9]+ ref", + "matchCount": "0", + "teardown": [ + "$TC qdisc del dev $DEV1 root handle 1: htb default 1", + "$TC actions flush action mirred" + ] } ] -- 2.33.0 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-14 11:17 [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress renmingshuai @ 2024-03-14 11:43 ` Jiri Pirko 2024-03-14 14:04 ` renmingshuai 2024-03-14 17:14 ` Jamal Hadi Salim 1 sibling, 1 reply; 24+ messages in thread From: Jiri Pirko @ 2024-03-14 11:43 UTC (permalink / raw) To: renmingshuai Cc: jhs, xiyou.wangcong, davem, vladbu, netdev, yanan, liaichun, caowangbao Thu, Mar 14, 2024 at 12:17:13PM CET, renmingshuai@huawei.com wrote: >As we all know the mirred action is used to mirroring or redirecting the >packet it receives. Howerver, add mirred action to a filter attached to >a egress qdisc might cause a deadlock. To reproduce the problem, perform >the following steps: >(1)tc qdisc add dev eth0 root handle 1: htb default 30 \n >(2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > action police rate 100mbit burst 12m conform-exceed jump 1 \ > / pipe mirred egress redirect dev eth2 action drop > >The stack is show as below: >[28848.883915] _raw_spin_lock+0x1e/0x30 >[28848.884367] __dev_queue_xmit+0x160/0x850 >[28848.884851] ? 0xffffffffc031906a >[28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] >[28848.885863] tcf_action_exec.part.0+0x88/0x130 >[28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] >[28848.886970] ? dequeue_entity+0x145/0x9e0 >[28848.887464] ? newidle_balance+0x23f/0x2f0 >[28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] >[28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] >[28848.889137] ? __flush_work.isra.0+0x35/0x80 >[28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] >[28848.890293] ? do_select+0x637/0x870 >[28848.890735] tcf_classify+0x52/0xf0 >[28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] >[28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] >[28848.892251] __dev_queue_xmit+0x2d8/0x850 >[28848.892738] ? nf_hook_slow+0x3c/0xb0 >[28848.893198] ip_finish_output2+0x272/0x580 >[28848.893692] __ip_queue_xmit+0x193/0x420 >[28848.894179] __tcp_transmit_skb+0x8cc/0x970 > >In this case, the process has hold the qdisc spin lock in __dev_queue_xmit >before the egress packets are mirred, and it will attempt to obtain the >spin lock again after packets are mirred, which cause a deadlock. > >Fix the issue by forbidding assigning mirred action to a filter attached >to the egress. > >Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> You are missing "Fixes" tag. >--- > net/sched/act_mirred.c | 4 +++ > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > 2 files changed, 36 insertions(+) > >diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c >index 5b3814365924..fc96705285fb 100644 >--- a/net/sched/act_mirred.c >+++ b/net/sched/act_mirred.c >@@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > return -EINVAL; > } >+ if (tp->chain->block->q->parent != TC_H_INGRESS) { >+ NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); Hmm, that is quite restrictive. I'm pretty sure you would break some valid usecases. >+ return -EINVAL; >+ } > ret = nla_parse_nested_deprecated(tb, TCA_MIRRED_MAX, nla, > mirred_policy, extack); > if (ret < 0) >diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json >index b73bd255ea36..50c6153bf34c 100644 >--- a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json >+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json >@@ -1052,5 +1052,37 @@ > "$TC qdisc del dev $DEV1 ingress_block 21 clsact", > "$TC actions flush action mirred" > ] >+ }, >+ { >+ "id": "fdda", >+ "name": "Add mirred mirror to the filter which attached to engress", >+ "category": [ >+ "actions", >+ "mirred" >+ ], >+ "plugins": { >+ "requires": "nsPlugin" >+ }, >+ "setup": [ >+ [ >+ "$TC actions flush action mirred", >+ 0, >+ 1, >+ 255 >+ ], >+ [ >+ "$TC qdisc add dev $DEV1 root handle 1: htb default 1", >+ 0 >+ ] >+ ], >+ "cmdUnderTest": "$TC filter add dev $DEV1 protocol ip u32 match ip protocol 1 0xff action mirred egress mirror dev $DEV1", >+ "expExitCode": "2", >+ "verifyCmd": "$TC action list action mirred", >+ "matchPattern": "^[ \t]+index [0-9]+ ref", >+ "matchCount": "0", >+ "teardown": [ >+ "$TC qdisc del dev $DEV1 root handle 1: htb default 1", >+ "$TC actions flush action mirred" >+ ] > } > ] >-- >2.33.0 > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-14 11:43 ` Jiri Pirko @ 2024-03-14 14:04 ` renmingshuai 2024-03-14 14:47 ` Pedro Tammela 0 siblings, 1 reply; 24+ messages in thread From: renmingshuai @ 2024-03-14 14:04 UTC (permalink / raw) To: jiri Cc: caowangbao, davem, jhs, liaichun, netdev, renmingshuai, vladbu, xiyou.wangcong, yanan >>--- >> net/sched/act_mirred.c | 4 +++ >> .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ >> 2 files changed, 36 insertions(+) >> >>diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c >>index 5b3814365924..fc96705285fb 100644 >>--- a/net/sched/act_mirred.c >>+++ b/net/sched/act_mirred.c >>@@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, >> NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); >> return -EINVAL; >> } >>+ if (tp->chain->block->q->parent != TC_H_INGRESS) { >>+ NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > >Hmm, that is quite restrictive. I'm pretty sure you would break some >valid usecases. Hmm, that is really quite restrictive. It might be better to Forbid mirred attached to egress filter to mirror or redirect packets to the egress. diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c index fc96705285fb..ab3841470992 100644 --- a/net/sched/act_mirred.c +++ b/net/sched/act_mirred.c @@ -152,6 +152,11 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, return -EINVAL; } + if ((tp->chain->block->q->parent != TC_H_INGRESS) && (parm->eaction == TCA_EGRESS_MIRROR || parm->eaction == TCA_EGRESS_REDIR)) { + NL_SET_ERR_MSG_MOD(extack, "Mirred assigned to egress filter can only mirror or redirect to ingress"); + return -EINVAL; + } + switch (parm->eaction) { case TCA_EGRESS_MIRROR: case TCA_EGRESS_REDIR: ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-14 14:04 ` renmingshuai @ 2024-03-14 14:47 ` Pedro Tammela 0 siblings, 0 replies; 24+ messages in thread From: Pedro Tammela @ 2024-03-14 14:47 UTC (permalink / raw) To: renmingshuai, jiri Cc: caowangbao, davem, jhs, liaichun, netdev, vladbu, xiyou.wangcong, yanan On 14/03/2024 11:04, renmingshuai wrote: >>> --- >>> net/sched/act_mirred.c | 4 +++ >>> .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ >>> 2 files changed, 36 insertions(+) >>> >>> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c >>> index 5b3814365924..fc96705285fb 100644 >>> --- a/net/sched/act_mirred.c >>> +++ b/net/sched/act_mirred.c >>> @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, >>> NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); >>> return -EINVAL; >>> } >>> + if (tp->chain->block->q->parent != TC_H_INGRESS) { >>> + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); >> >> Hmm, that is quite restrictive. I'm pretty sure you would break some >> valid usecases. > Hmm, that is really quite restrictive. It might be better to Forbid mirred attached to egress filter > to mirror or redirect packets to the egress. > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > index fc96705285fb..ab3841470992 100644 > --- a/net/sched/act_mirred.c > +++ b/net/sched/act_mirred.c > @@ -152,6 +152,11 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > return -EINVAL; > } > > + if ((tp->chain->block->q->parent != TC_H_INGRESS) && (parm->eaction == TCA_EGRESS_MIRROR || parm->eaction == TCA_EGRESS_REDIR)) { > + NL_SET_ERR_MSG_MOD(extack, "Mirred assigned to egress filter can only mirror or redirect to ingress"); > + return -EINVAL; > + } > + > switch (parm->eaction) { > case TCA_EGRESS_MIRROR: > case TCA_EGRESS_REDIR: Are you sure this happens to egress mirred to netdevs other than the one used to attach the filter? It seems like in your example you are forcing mirred egress to the same netdev as the filter ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-14 11:17 [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress renmingshuai 2024-03-14 11:43 ` Jiri Pirko @ 2024-03-14 17:14 ` Jamal Hadi Salim 2024-03-15 1:56 ` renmingshuai 2024-03-17 16:10 ` Jamal Hadi Salim 1 sibling, 2 replies; 24+ messages in thread From: Jamal Hadi Salim @ 2024-03-14 17:14 UTC (permalink / raw) To: renmingshuai Cc: xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > As we all know the mirred action is used to mirroring or redirecting the > packet it receives. Howerver, add mirred action to a filter attached to > a egress qdisc might cause a deadlock. To reproduce the problem, perform > the following steps: > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > action police rate 100mbit burst 12m conform-exceed jump 1 \ > / pipe mirred egress redirect dev eth2 action drop > I think you meant both to be the same device eth0 or eth2? > The stack is show as below: > [28848.883915] _raw_spin_lock+0x1e/0x30 > [28848.884367] __dev_queue_xmit+0x160/0x850 > [28848.884851] ? 0xffffffffc031906a > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > [28848.886970] ? dequeue_entity+0x145/0x9e0 > [28848.887464] ? newidle_balance+0x23f/0x2f0 > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > [28848.890293] ? do_select+0x637/0x870 > [28848.890735] tcf_classify+0x52/0xf0 > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > [28848.893198] ip_finish_output2+0x272/0x580 > [28848.893692] __ip_queue_xmit+0x193/0x420 > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > before the egress packets are mirred, and it will attempt to obtain the > spin lock again after packets are mirred, which cause a deadlock. > > Fix the issue by forbidding assigning mirred action to a filter attached > to the egress. > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > --- > net/sched/act_mirred.c | 4 +++ > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > 2 files changed, 36 insertions(+) > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > index 5b3814365924..fc96705285fb 100644 > --- a/net/sched/act_mirred.c > +++ b/net/sched/act_mirred.c > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > return -EINVAL; > } > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > + return -EINVAL; > + } Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I am almost certain this used to work in the old days. cheers, jamal PS:- thanks for the tdc test, you are a hero just for submitting that! > ret = nla_parse_nested_deprecated(tb, TCA_MIRRED_MAX, nla, > mirred_policy, extack); > if (ret < 0) > diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json > index b73bd255ea36..50c6153bf34c 100644 > --- a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json > +++ b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json > @@ -1052,5 +1052,37 @@ > "$TC qdisc del dev $DEV1 ingress_block 21 clsact", > "$TC actions flush action mirred" > ] > + }, > + { > + "id": "fdda", > + "name": "Add mirred mirror to the filter which attached to engress", > + "category": [ > + "actions", > + "mirred" > + ], > + "plugins": { > + "requires": "nsPlugin" > + }, > + "setup": [ > + [ > + "$TC actions flush action mirred", > + 0, > + 1, > + 255 > + ], > + [ > + "$TC qdisc add dev $DEV1 root handle 1: htb default 1", > + 0 > + ] > + ], > + "cmdUnderTest": "$TC filter add dev $DEV1 protocol ip u32 match ip protocol 1 0xff action mirred egress mirror dev $DEV1", > + "expExitCode": "2", > + "verifyCmd": "$TC action list action mirred", > + "matchPattern": "^[ \t]+index [0-9]+ ref", > + "matchCount": "0", > + "teardown": [ > + "$TC qdisc del dev $DEV1 root handle 1: htb default 1", > + "$TC actions flush action mirred" > + ] > } > ] > -- > 2.33.0 > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-14 17:14 ` Jamal Hadi Salim @ 2024-03-15 1:56 ` renmingshuai 2024-03-15 22:34 ` Jamal Hadi Salim 2024-03-17 16:10 ` Jamal Hadi Salim 1 sibling, 1 reply; 24+ messages in thread From: renmingshuai @ 2024-03-15 1:56 UTC (permalink / raw) To: jhs Cc: caowangbao, davem, jiri, liaichun, netdev, renmingshuai, vladbu, xiyou.wangcong, yanan >> As we all know the mirred action is used to mirroring or redirecting the >> packet it receives. Howerver, add mirred action to a filter attached to >> a egress qdisc might cause a deadlock. To reproduce the problem, perform >> the following steps: >> (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n >> (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ >> action police rate 100mbit burst 12m conform-exceed jump 1 \ >> / pipe mirred egress redirect dev eth2 action drop >> > >I think you meant both to be the same device eth0 or eth2? Sorry, a careless mistake. eth2 in step 2 should be eth0. (2)tc filter add dev eth0 protocol ip prio 2 flower verbose \ action police rate 100mbit burst 12m conform-exceed jump 1 \ / pipe mirred egress redirect dev eth0 action drop >> The stack is show as below: >> [28848.883915] _raw_spin_lock+0x1e/0x30 >> [28848.884367] __dev_queue_xmit+0x160/0x850 >> [28848.884851] ? 0xffffffffc031906a >> [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] >> [28848.885863] tcf_action_exec.part.0+0x88/0x130 >> [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] >> [28848.886970] ? dequeue_entity+0x145/0x9e0 >> [28848.887464] ? newidle_balance+0x23f/0x2f0 >> [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] >> [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] >> [28848.889137] ? __flush_work.isra.0+0x35/0x80 >> [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] >> [28848.890293] ? do_select+0x637/0x870 >> [28848.890735] tcf_classify+0x52/0xf0 >> [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] >> [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] >> [28848.892251] __dev_queue_xmit+0x2d8/0x850 >> [28848.892738] ? nf_hook_slow+0x3c/0xb0 >> [28848.893198] ip_finish_output2+0x272/0x580 >> [28848.893692] __ip_queue_xmit+0x193/0x420 >> [28848.894179] __tcp_transmit_skb+0x8cc/0x970 >> >> In this case, the process has hold the qdisc spin lock in __dev_queue_xmit >> before the egress packets are mirred, and it will attempt to obtain the >> spin lock again after packets are mirred, which cause a deadlock. >> >> Fix the issue by forbidding assigning mirred action to a filter attached >> to the egress. >> >> Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> >> --- >> net/sched/act_mirred.c | 4 +++ >> .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ >> 2 files changed, 36 insertions(+) >> >> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c >> index 5b3814365924..fc96705285fb 100644 >> --- a/net/sched/act_mirred.c >> +++ b/net/sched/act_mirred.c >> @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, >> NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); >> return -EINVAL; >> } >> + if (tp->chain->block->q->parent != TC_H_INGRESS) { >> + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); >> + return -EINVAL; >> + } > >Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I >am almost certain this used to work in the old days. > >cheers, >jamal > >PS:- thanks for the tdc test, you are a hero just for submitting that! As Jiri said, that is really quite restrictive. It might be better to forbid mirred attached to egress filter to mirror or redirect packets to the egress. Just like: --- a/net/sched/act_mirred.c +++ b/net/sched/act_mirred.c @@ -152,6 +152,12 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, return -EINVAL; } + if ((tp->chain->block->q->parent != TC_H_INGRESS) && + (parm->eaction == TCA_EGRESS_MIRROR || parm->eaction == TCA_EGRESS_REDIR)) { + NL_SET_ERR_MSG_MOD(extack, "Mirred assigned to egress filter can only mirror or redirect to ingress"); + return -EINVAL; + } + switch (parm->eaction) { case TCA_EGRESS_MIRROR: case TCA_EGRESS_REDIR: ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-15 1:56 ` renmingshuai @ 2024-03-15 22:34 ` Jamal Hadi Salim 0 siblings, 0 replies; 24+ messages in thread From: Jamal Hadi Salim @ 2024-03-15 22:34 UTC (permalink / raw) To: renmingshuai Cc: caowangbao, davem, jiri, liaichun, netdev, vladbu, xiyou.wangcong, yanan On Thu, Mar 14, 2024 at 10:10 PM renmingshuai <renmingshuai@huawei.com> wrote: > > >> As we all know the mirred action is used to mirroring or redirecting the > >> packet it receives. Howerver, add mirred action to a filter attached to > >> a egress qdisc might cause a deadlock. To reproduce the problem, perform > >> the following steps: > >> (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > >> (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > >> action police rate 100mbit burst 12m conform-exceed jump 1 \ > >> / pipe mirred egress redirect dev eth2 action drop > >> > > > >I think you meant both to be the same device eth0 or eth2? > > Sorry, a careless mistake. eth2 in step 2 should be eth0. > (2)tc filter add dev eth0 protocol ip prio 2 flower verbose \ > action police rate 100mbit burst 12m conform-exceed jump 1 \ > / pipe mirred egress redirect dev eth0 action drop > > >> The stack is show as below: > >> [28848.883915] _raw_spin_lock+0x1e/0x30 > >> [28848.884367] __dev_queue_xmit+0x160/0x850 > >> [28848.884851] ? 0xffffffffc031906a > >> [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > >> [28848.885863] tcf_action_exec.part.0+0x88/0x130 > >> [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > >> [28848.886970] ? dequeue_entity+0x145/0x9e0 > >> [28848.887464] ? newidle_balance+0x23f/0x2f0 > >> [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > >> [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > >> [28848.889137] ? __flush_work.isra.0+0x35/0x80 > >> [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > >> [28848.890293] ? do_select+0x637/0x870 > >> [28848.890735] tcf_classify+0x52/0xf0 > >> [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > >> [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > >> [28848.892251] __dev_queue_xmit+0x2d8/0x850 > >> [28848.892738] ? nf_hook_slow+0x3c/0xb0 > >> [28848.893198] ip_finish_output2+0x272/0x580 > >> [28848.893692] __ip_queue_xmit+0x193/0x420 > >> [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > >> > >> In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > >> before the egress packets are mirred, and it will attempt to obtain the > >> spin lock again after packets are mirred, which cause a deadlock. > >> > >> Fix the issue by forbidding assigning mirred action to a filter attached > >> to the egress. > >> > >> Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > >> --- > >> net/sched/act_mirred.c | 4 +++ > >> .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > >> 2 files changed, 36 insertions(+) > >> > >> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > >> index 5b3814365924..fc96705285fb 100644 > >> --- a/net/sched/act_mirred.c > >> +++ b/net/sched/act_mirred.c > >> @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > >> NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > >> return -EINVAL; > >> } > >> + if (tp->chain->block->q->parent != TC_H_INGRESS) { > >> + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > >> + return -EINVAL; > >> + } > > > >Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > >am almost certain this used to work in the old days. > > > >cheers, > >jamal > > > >PS:- thanks for the tdc test, you are a hero just for submitting that! > > As Jiri said, that is really quite restrictive. It might be better to forbid mirred attached to egress filter > to mirror or redirect packets to the egress. Just like: > Please, no. You are continuing to restrict many valid use cases. We'll look at it this weekend. cheers, jamal > --- a/net/sched/act_mirred.c > +++ b/net/sched/act_mirred.c > @@ -152,6 +152,12 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > return -EINVAL; > } > > + if ((tp->chain->block->q->parent != TC_H_INGRESS) && > + (parm->eaction == TCA_EGRESS_MIRROR || parm->eaction == TCA_EGRESS_REDIR)) { > + NL_SET_ERR_MSG_MOD(extack, "Mirred assigned to egress filter can only mirror or redirect to ingress"); > + return -EINVAL; > + } > + > switch (parm->eaction) { > case TCA_EGRESS_MIRROR: > case TCA_EGRESS_REDIR: > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-14 17:14 ` Jamal Hadi Salim 2024-03-15 1:56 ` renmingshuai @ 2024-03-17 16:10 ` Jamal Hadi Salim 2024-03-18 14:27 ` Jamal Hadi Salim 1 sibling, 1 reply; 24+ messages in thread From: Jamal Hadi Salim @ 2024-03-17 16:10 UTC (permalink / raw) To: renmingshuai Cc: xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > > > As we all know the mirred action is used to mirroring or redirecting the > > packet it receives. Howerver, add mirred action to a filter attached to > > a egress qdisc might cause a deadlock. To reproduce the problem, perform > > the following steps: > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > > action police rate 100mbit burst 12m conform-exceed jump 1 \ > > / pipe mirred egress redirect dev eth2 action drop > > > > I think you meant both to be the same device eth0 or eth2? > > > The stack is show as below: > > [28848.883915] _raw_spin_lock+0x1e/0x30 > > [28848.884367] __dev_queue_xmit+0x160/0x850 > > [28848.884851] ? 0xffffffffc031906a > > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > > [28848.886970] ? dequeue_entity+0x145/0x9e0 > > [28848.887464] ? newidle_balance+0x23f/0x2f0 > > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > > [28848.890293] ? do_select+0x637/0x870 > > [28848.890735] tcf_classify+0x52/0xf0 > > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > > [28848.893198] ip_finish_output2+0x272/0x580 > > [28848.893692] __ip_queue_xmit+0x193/0x420 > > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > > before the egress packets are mirred, and it will attempt to obtain the > > spin lock again after packets are mirred, which cause a deadlock. > > > > Fix the issue by forbidding assigning mirred action to a filter attached > > to the egress. > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > > --- > > net/sched/act_mirred.c | 4 +++ > > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > > 2 files changed, 36 insertions(+) > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > index 5b3814365924..fc96705285fb 100644 > > --- a/net/sched/act_mirred.c > > +++ b/net/sched/act_mirred.c > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > > return -EINVAL; > > } > > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > > + return -EINVAL; > > + } > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > am almost certain this used to work in the old days. Ok, i looked at old notes - it did work at "some point" pre-tdc. Conclusion is things broke around this time frame: https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/ https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/ Looking further into it. cheers, jamal ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-17 16:10 ` Jamal Hadi Salim @ 2024-03-18 14:27 ` Jamal Hadi Salim 2024-03-18 15:46 ` Eric Dumazet 0 siblings, 1 reply; 24+ messages in thread From: Jamal Hadi Salim @ 2024-03-18 14:27 UTC (permalink / raw) To: renmingshuai Cc: xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > > > > > As we all know the mirred action is used to mirroring or redirecting the > > > packet it receives. Howerver, add mirred action to a filter attached to > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform > > > the following steps: > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > > > action police rate 100mbit burst 12m conform-exceed jump 1 \ > > > / pipe mirred egress redirect dev eth2 action drop > > > > > > > I think you meant both to be the same device eth0 or eth2? > > > > > The stack is show as below: > > > [28848.883915] _raw_spin_lock+0x1e/0x30 > > > [28848.884367] __dev_queue_xmit+0x160/0x850 > > > [28848.884851] ? 0xffffffffc031906a > > > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > > > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > > > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > > > [28848.886970] ? dequeue_entity+0x145/0x9e0 > > > [28848.887464] ? newidle_balance+0x23f/0x2f0 > > > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > > > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > > > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > > > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > > > [28848.890293] ? do_select+0x637/0x870 > > > [28848.890735] tcf_classify+0x52/0xf0 > > > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > > > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > > > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > > > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > > > [28848.893198] ip_finish_output2+0x272/0x580 > > > [28848.893692] __ip_queue_xmit+0x193/0x420 > > > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > > > before the egress packets are mirred, and it will attempt to obtain the > > > spin lock again after packets are mirred, which cause a deadlock. > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached > > > to the egress. > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > > > --- > > > net/sched/act_mirred.c | 4 +++ > > > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > > > 2 files changed, 36 insertions(+) > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > > index 5b3814365924..fc96705285fb 100644 > > > --- a/net/sched/act_mirred.c > > > +++ b/net/sched/act_mirred.c > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > > > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > > > return -EINVAL; > > > } > > > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > > > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > > > + return -EINVAL; > > > + } > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > > am almost certain this used to work in the old days. > > Ok, i looked at old notes - it did work at "some point" pre-tdc. > Conclusion is things broke around this time frame: > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/ > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/ > > Looking further into it. This is what we came up with. Eric, please take a look... cheers, jamal --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, if (unlikely(contended)) spin_lock(&q->busylock); + if (dev_recursion_level()) { + rc = NET_XMIT_DROP; + __qdisc_drop(skb, &to_free); + goto free; + } + spin_lock(root_lock); + dev_xmit_recursion_inc(); if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { __qdisc_drop(skb, &to_free); rc = NET_XMIT_DROP; @@ -3824,8 +3831,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, qdisc_run_end(q); } } + dev_xmit_recursion_dec(); spin_unlock(root_lock); if (unlikely(to_free)) +free: kfree_skb_list_reason(to_free, tcf_get_drop_reason(to_free)); if (unlikely(contended)) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-18 14:27 ` Jamal Hadi Salim @ 2024-03-18 15:46 ` Eric Dumazet 2024-03-18 17:36 ` Jamal Hadi Salim 0 siblings, 1 reply; 24+ messages in thread From: Eric Dumazet @ 2024-03-18 15:46 UTC (permalink / raw) To: Jamal Hadi Salim Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the > > > > packet it receives. Howerver, add mirred action to a filter attached to > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform > > > > the following steps: > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > > > > action police rate 100mbit burst 12m conform-exceed jump 1 \ > > > > / pipe mirred egress redirect dev eth2 action drop > > > > > > > > > > I think you meant both to be the same device eth0 or eth2? > > > > > > > The stack is show as below: > > > > [28848.883915] _raw_spin_lock+0x1e/0x30 > > > > [28848.884367] __dev_queue_xmit+0x160/0x850 > > > > [28848.884851] ? 0xffffffffc031906a > > > > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > > > > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > > > > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > > > > [28848.886970] ? dequeue_entity+0x145/0x9e0 > > > > [28848.887464] ? newidle_balance+0x23f/0x2f0 > > > > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > > > > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > > > > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > > > > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > > > > [28848.890293] ? do_select+0x637/0x870 > > > > [28848.890735] tcf_classify+0x52/0xf0 > > > > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > > > > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > > > > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > > > > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > > > > [28848.893198] ip_finish_output2+0x272/0x580 > > > > [28848.893692] __ip_queue_xmit+0x193/0x420 > > > > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > > > > before the egress packets are mirred, and it will attempt to obtain the > > > > spin lock again after packets are mirred, which cause a deadlock. > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached > > > > to the egress. > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > > > > --- > > > > net/sched/act_mirred.c | 4 +++ > > > > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > > > > 2 files changed, 36 insertions(+) > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > > > index 5b3814365924..fc96705285fb 100644 > > > > --- a/net/sched/act_mirred.c > > > > +++ b/net/sched/act_mirred.c > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > > > > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > > > > return -EINVAL; > > > > } > > > > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > > > > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > > > > + return -EINVAL; > > > > + } > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > > > am almost certain this used to work in the old days. > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc. > > Conclusion is things broke around this time frame: > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/ > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/ > > > > Looking further into it. > > This is what we came up with. Eric, please take a look... > > cheers, > jamal > > > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff > *skb, struct Qdisc *q, > if (unlikely(contended)) > spin_lock(&q->busylock); > > + if (dev_recursion_level()) { I am not sure what your intent is, but this seems wrong to me. Some valid setup use : A bonding device, with HTB qdisc (or other qdisc) (This also could be a tunnel device with a qdisc) -> one or multiple physical NIC, wth FQ or other qdisc. Packets would be dropped here when we try to reach the physical device. > + rc = NET_XMIT_DROP; > + __qdisc_drop(skb, &to_free); > + goto free; > + } > + > spin_lock(root_lock); > + dev_xmit_recursion_inc(); > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { > __qdisc_drop(skb, &to_free); > rc = NET_XMIT_DROP; > @@ -3824,8 +3831,10 @@ static inline int __dev_xmit_skb(struct sk_buff > *skb, struct Qdisc *q, > qdisc_run_end(q); > } > } > + dev_xmit_recursion_dec(); > spin_unlock(root_lock); > if (unlikely(to_free)) > +free: > kfree_skb_list_reason(to_free, > tcf_get_drop_reason(to_free)); > if (unlikely(contended)) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-18 15:46 ` Eric Dumazet @ 2024-03-18 17:36 ` Jamal Hadi Salim 2024-03-18 19:11 ` Eric Dumazet 0 siblings, 1 reply; 24+ messages in thread From: Jamal Hadi Salim @ 2024-03-18 17:36 UTC (permalink / raw) To: Eric Dumazet Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the > > > > > packet it receives. Howerver, add mirred action to a filter attached to > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform > > > > > the following steps: > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > > > > > action police rate 100mbit burst 12m conform-exceed jump 1 \ > > > > > / pipe mirred egress redirect dev eth2 action drop > > > > > > > > > > > > > I think you meant both to be the same device eth0 or eth2? > > > > > > > > > The stack is show as below: > > > > > [28848.883915] _raw_spin_lock+0x1e/0x30 > > > > > [28848.884367] __dev_queue_xmit+0x160/0x850 > > > > > [28848.884851] ? 0xffffffffc031906a > > > > > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > > > > > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > > > > > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > > > > > [28848.886970] ? dequeue_entity+0x145/0x9e0 > > > > > [28848.887464] ? newidle_balance+0x23f/0x2f0 > > > > > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > > > > > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > > > > > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > > > > > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > > > > > [28848.890293] ? do_select+0x637/0x870 > > > > > [28848.890735] tcf_classify+0x52/0xf0 > > > > > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > > > > > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > > > > > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > > > > > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > > > > > [28848.893198] ip_finish_output2+0x272/0x580 > > > > > [28848.893692] __ip_queue_xmit+0x193/0x420 > > > > > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > > > > > before the egress packets are mirred, and it will attempt to obtain the > > > > > spin lock again after packets are mirred, which cause a deadlock. > > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached > > > > > to the egress. > > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > > > > > --- > > > > > net/sched/act_mirred.c | 4 +++ > > > > > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > > > > > 2 files changed, 36 insertions(+) > > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > > > > index 5b3814365924..fc96705285fb 100644 > > > > > --- a/net/sched/act_mirred.c > > > > > +++ b/net/sched/act_mirred.c > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > > > > > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > > > > > return -EINVAL; > > > > > } > > > > > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > > > > > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > > > > > + return -EINVAL; > > > > > + } > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > > > > am almost certain this used to work in the old days. > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc. > > > Conclusion is things broke around this time frame: > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/ > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/ > > > > > > Looking further into it. > > > > This is what we came up with. Eric, please take a look... > > > > cheers, > > jamal > > > > > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff > > *skb, struct Qdisc *q, > > if (unlikely(contended)) > > spin_lock(&q->busylock); > > > > + if (dev_recursion_level()) { > > I am not sure what your intent is, but this seems wrong to me. > There is a deadlock if you reenter the same device which has a qdisc attached to it more than once. Essentially entering __dev_xmit_skb() we grab the root qdisc lock then run some action which requires it to grab the root qdisc lock (again). This is easy to show with mirred (although i am wondering if syzkaller may have produced this at some point).. $TC qdisc add dev $DEV root handle 1: htb default 1 $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff action mirred egress mirror dev $DEV Above example is essentially egress $DEV-> egress $DEV in both cases "egress $DEV" grabs the root qdisc lock. You could also create another example with egress($DEV1->$DEV2->back to $DEV1). > Some valid setup use : > > A bonding device, with HTB qdisc (or other qdisc) > (This also could be a tunnel device with a qdisc) > > -> one or multiple physical NIC, wth FQ or other qdisc. > > Packets would be dropped here when we try to reach the physical device. > If you have an example handy please send it. I am trying to imagine how those would have worked if they have to reenter the root qdisc of the same dev multiple times.. cheers, jamal > > + rc = NET_XMIT_DROP; > > + __qdisc_drop(skb, &to_free); > > + goto free; > > + } > > + > > spin_lock(root_lock); > > + dev_xmit_recursion_inc(); > > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { > > __qdisc_drop(skb, &to_free); > > rc = NET_XMIT_DROP; > > @@ -3824,8 +3831,10 @@ static inline int __dev_xmit_skb(struct sk_buff > > *skb, struct Qdisc *q, > > qdisc_run_end(q); > > } > > } > > + dev_xmit_recursion_dec(); > > spin_unlock(root_lock); > > if (unlikely(to_free)) > > +free: > > kfree_skb_list_reason(to_free, > > tcf_get_drop_reason(to_free)); > > if (unlikely(contended)) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-18 17:36 ` Jamal Hadi Salim @ 2024-03-18 19:11 ` Eric Dumazet 2024-03-18 22:05 ` Jamal Hadi Salim 0 siblings, 1 reply; 24+ messages in thread From: Eric Dumazet @ 2024-03-18 19:11 UTC (permalink / raw) To: Jamal Hadi Salim Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > > > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the > > > > > > packet it receives. Howerver, add mirred action to a filter attached to > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform > > > > > > the following steps: > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > > > > > > action police rate 100mbit burst 12m conform-exceed jump 1 \ > > > > > > / pipe mirred egress redirect dev eth2 action drop > > > > > > > > > > > > > > > > I think you meant both to be the same device eth0 or eth2? > > > > > > > > > > > The stack is show as below: > > > > > > [28848.883915] _raw_spin_lock+0x1e/0x30 > > > > > > [28848.884367] __dev_queue_xmit+0x160/0x850 > > > > > > [28848.884851] ? 0xffffffffc031906a > > > > > > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > > > > > > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > > > > > > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > > > > > > [28848.886970] ? dequeue_entity+0x145/0x9e0 > > > > > > [28848.887464] ? newidle_balance+0x23f/0x2f0 > > > > > > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > > > > > > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > > > > > > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > > > > > > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > > > > > > [28848.890293] ? do_select+0x637/0x870 > > > > > > [28848.890735] tcf_classify+0x52/0xf0 > > > > > > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > > > > > > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > > > > > > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > > > > > > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > > > > > > [28848.893198] ip_finish_output2+0x272/0x580 > > > > > > [28848.893692] __ip_queue_xmit+0x193/0x420 > > > > > > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > > > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > > > > > > before the egress packets are mirred, and it will attempt to obtain the > > > > > > spin lock again after packets are mirred, which cause a deadlock. > > > > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached > > > > > > to the egress. > > > > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > > > > > > --- > > > > > > net/sched/act_mirred.c | 4 +++ > > > > > > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > > > > > > 2 files changed, 36 insertions(+) > > > > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > > > > > index 5b3814365924..fc96705285fb 100644 > > > > > > --- a/net/sched/act_mirred.c > > > > > > +++ b/net/sched/act_mirred.c > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > > > > > > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > > > > > > return -EINVAL; > > > > > > } > > > > > > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > > > > > > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > > > > > > + return -EINVAL; > > > > > > + } > > > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > > > > > am almost certain this used to work in the old days. > > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc. > > > > Conclusion is things broke around this time frame: > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/ > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/ > > > > > > > > Looking further into it. > > > > > > This is what we came up with. Eric, please take a look... > > > > > > cheers, > > > jamal > > > > > > > > > --- a/net/core/dev.c > > > +++ b/net/core/dev.c > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff > > > *skb, struct Qdisc *q, > > > if (unlikely(contended)) > > > spin_lock(&q->busylock); > > > > > > + if (dev_recursion_level()) { > > > > I am not sure what your intent is, but this seems wrong to me. > > > > There is a deadlock if you reenter the same device which has a qdisc > attached to it more than once. > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then > run some action which requires it to grab the root qdisc lock (again). > This is easy to show with mirred (although i am wondering if syzkaller > may have produced this at some point).. > $TC qdisc add dev $DEV root handle 1: htb default 1 > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff > action mirred egress mirror dev $DEV > > Above example is essentially egress $DEV-> egress $DEV in both cases > "egress $DEV" grabs the root qdisc lock. You could also create another > example with egress($DEV1->$DEV2->back to $DEV1). > > > Some valid setup use : > > > > A bonding device, with HTB qdisc (or other qdisc) > > (This also could be a tunnel device with a qdisc) > > > > -> one or multiple physical NIC, wth FQ or other qdisc. > > > > Packets would be dropped here when we try to reach the physical device. > > > > If you have an example handy please send it. I am trying to imagine > how those would have worked if they have to reenter the root qdisc of > the same dev multiple times.. Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ? dev_xmit_recursion_inc() is global (per-cpu), it is not per-device. A stack of devices A -> B -> C would elevate the recursion level to three just fine. After your patch, a stack of devices would no longer work. It seems mirred correctly injects packets to the top of the stack for ingress (via netif_rx() / netif_receive_skb()), but thinks it is okay to call dev_queue_xmit(), regardless of the context ? Perhaps safe-guard mirred, instead of adding more code to fast path. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-18 19:11 ` Eric Dumazet @ 2024-03-18 22:05 ` Jamal Hadi Salim 2024-03-19 9:38 ` Eric Dumazet 0 siblings, 1 reply; 24+ messages in thread From: Jamal Hadi Salim @ 2024-03-18 22:05 UTC (permalink / raw) To: Eric Dumazet Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela, Davide Caratti On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > > > > > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform > > > > > > > the following steps: > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > > > > > > > action police rate 100mbit burst 12m conform-exceed jump 1 \ > > > > > > > / pipe mirred egress redirect dev eth2 action drop > > > > > > > > > > > > > > > > > > > I think you meant both to be the same device eth0 or eth2? > > > > > > > > > > > > > The stack is show as below: > > > > > > > [28848.883915] _raw_spin_lock+0x1e/0x30 > > > > > > > [28848.884367] __dev_queue_xmit+0x160/0x850 > > > > > > > [28848.884851] ? 0xffffffffc031906a > > > > > > > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > > > > > > > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > > > > > > > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > > > > > > > [28848.886970] ? dequeue_entity+0x145/0x9e0 > > > > > > > [28848.887464] ? newidle_balance+0x23f/0x2f0 > > > > > > > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > > > > > > > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > > > > > > > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > > > > > > > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > > > > > > > [28848.890293] ? do_select+0x637/0x870 > > > > > > > [28848.890735] tcf_classify+0x52/0xf0 > > > > > > > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > > > > > > > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > > > > > > > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > > > > > > > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > > > > > > > [28848.893198] ip_finish_output2+0x272/0x580 > > > > > > > [28848.893692] __ip_queue_xmit+0x193/0x420 > > > > > > > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > > > > > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > > > > > > > before the egress packets are mirred, and it will attempt to obtain the > > > > > > > spin lock again after packets are mirred, which cause a deadlock. > > > > > > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached > > > > > > > to the egress. > > > > > > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > > > > > > > --- > > > > > > > net/sched/act_mirred.c | 4 +++ > > > > > > > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > > > > > > > 2 files changed, 36 insertions(+) > > > > > > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > > > > > > index 5b3814365924..fc96705285fb 100644 > > > > > > > --- a/net/sched/act_mirred.c > > > > > > > +++ b/net/sched/act_mirred.c > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > > > > > > > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > > > > > > > return -EINVAL; > > > > > > > } > > > > > > > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > > > > > > > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > > > > > > > + return -EINVAL; > > > > > > > + } > > > > > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > > > > > > am almost certain this used to work in the old days. > > > > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc. > > > > > Conclusion is things broke around this time frame: > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/ > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/ > > > > > > > > > > Looking further into it. > > > > > > > > This is what we came up with. Eric, please take a look... > > > > > > > > cheers, > > > > jamal > > > > > > > > > > > > --- a/net/core/dev.c > > > > +++ b/net/core/dev.c > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff > > > > *skb, struct Qdisc *q, > > > > if (unlikely(contended)) > > > > spin_lock(&q->busylock); > > > > > > > > + if (dev_recursion_level()) { > > > > > > I am not sure what your intent is, but this seems wrong to me. > > > > > > > There is a deadlock if you reenter the same device which has a qdisc > > attached to it more than once. > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then > > run some action which requires it to grab the root qdisc lock (again). > > This is easy to show with mirred (although i am wondering if syzkaller > > may have produced this at some point).. > > $TC qdisc add dev $DEV root handle 1: htb default 1 > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff > > action mirred egress mirror dev $DEV > > > > Above example is essentially egress $DEV-> egress $DEV in both cases > > "egress $DEV" grabs the root qdisc lock. You could also create another > > example with egress($DEV1->$DEV2->back to $DEV1). > > > > > Some valid setup use : > > > > > > A bonding device, with HTB qdisc (or other qdisc) > > > (This also could be a tunnel device with a qdisc) > > > > > > -> one or multiple physical NIC, wth FQ or other qdisc. > > > > > > Packets would be dropped here when we try to reach the physical device. > > > > > > > If you have an example handy please send it. I am trying to imagine > > how those would have worked if they have to reenter the root qdisc of > > the same dev multiple times.. > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ? > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device. > > A stack of devices A -> B -> C would elevate the recursion level to > three just fine. > > After your patch, a stack of devices would no longer work. > > It seems mirred correctly injects packets to the top of the stack for > ingress (via netif_rx() / netif_receive_skb()), > but thinks it is okay to call dev_queue_xmit(), regardless of the context ? > > Perhaps safe-guard mirred, instead of adding more code to fast path. I agree not to penalize everybody for a "bad config" like this (surprising syzkaller hasnt caught this). But i dont see how doing the checking within mirred will catch this (we cant detect the A->B->A case). I think you are suggesting a backlog-like queue for mirred? Not far off from that is how it used to work before (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/) - i.e we had a trylock for the qdisc lock and if it failed we tagged the rx softirq for a reschedule. That in itself is insufficient, we would need a loop check which is per-skb (which we had before https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/). There are other gotchas there, potentially packet reordering. cheers, jamal ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-18 22:05 ` Jamal Hadi Salim @ 2024-03-19 9:38 ` Eric Dumazet 2024-03-19 20:54 ` Jamal Hadi Salim 0 siblings, 1 reply; 24+ messages in thread From: Eric Dumazet @ 2024-03-19 9:38 UTC (permalink / raw) To: Jamal Hadi Salim Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela, Davide Caratti On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > > > > > > > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the > > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to > > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform > > > > > > > > the following steps: > > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > > > > > > > > action police rate 100mbit burst 12m conform-exceed jump 1 \ > > > > > > > > / pipe mirred egress redirect dev eth2 action drop > > > > > > > > > > > > > > > > > > > > > > I think you meant both to be the same device eth0 or eth2? > > > > > > > > > > > > > > > The stack is show as below: > > > > > > > > [28848.883915] _raw_spin_lock+0x1e/0x30 > > > > > > > > [28848.884367] __dev_queue_xmit+0x160/0x850 > > > > > > > > [28848.884851] ? 0xffffffffc031906a > > > > > > > > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > > > > > > > > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > > > > > > > > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > > > > > > > > [28848.886970] ? dequeue_entity+0x145/0x9e0 > > > > > > > > [28848.887464] ? newidle_balance+0x23f/0x2f0 > > > > > > > > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > > > > > > > > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > > > > > > > > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > > > > > > > > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > > > > > > > > [28848.890293] ? do_select+0x637/0x870 > > > > > > > > [28848.890735] tcf_classify+0x52/0xf0 > > > > > > > > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > > > > > > > > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > > > > > > > > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > > > > > > > > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > > > > > > > > [28848.893198] ip_finish_output2+0x272/0x580 > > > > > > > > [28848.893692] __ip_queue_xmit+0x193/0x420 > > > > > > > > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > > > > > > > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > > > > > > > > before the egress packets are mirred, and it will attempt to obtain the > > > > > > > > spin lock again after packets are mirred, which cause a deadlock. > > > > > > > > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached > > > > > > > > to the egress. > > > > > > > > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > > > > > > > > --- > > > > > > > > net/sched/act_mirred.c | 4 +++ > > > > > > > > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > > > > > > > > 2 files changed, 36 insertions(+) > > > > > > > > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > > > > > > > index 5b3814365924..fc96705285fb 100644 > > > > > > > > --- a/net/sched/act_mirred.c > > > > > > > > +++ b/net/sched/act_mirred.c > > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > > > > > > > > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > > > > > > > > return -EINVAL; > > > > > > > > } > > > > > > > > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > > > > > > > > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > > > > > > > > + return -EINVAL; > > > > > > > > + } > > > > > > > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > > > > > > > am almost certain this used to work in the old days. > > > > > > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc. > > > > > > Conclusion is things broke around this time frame: > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/ > > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/ > > > > > > > > > > > > Looking further into it. > > > > > > > > > > This is what we came up with. Eric, please take a look... > > > > > > > > > > cheers, > > > > > jamal > > > > > > > > > > > > > > > --- a/net/core/dev.c > > > > > +++ b/net/core/dev.c > > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff > > > > > *skb, struct Qdisc *q, > > > > > if (unlikely(contended)) > > > > > spin_lock(&q->busylock); > > > > > > > > > > + if (dev_recursion_level()) { > > > > > > > > I am not sure what your intent is, but this seems wrong to me. > > > > > > > > > > There is a deadlock if you reenter the same device which has a qdisc > > > attached to it more than once. > > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then > > > run some action which requires it to grab the root qdisc lock (again). > > > This is easy to show with mirred (although i am wondering if syzkaller > > > may have produced this at some point).. > > > $TC qdisc add dev $DEV root handle 1: htb default 1 > > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff > > > action mirred egress mirror dev $DEV > > > > > > Above example is essentially egress $DEV-> egress $DEV in both cases > > > "egress $DEV" grabs the root qdisc lock. You could also create another > > > example with egress($DEV1->$DEV2->back to $DEV1). > > > > > > > Some valid setup use : > > > > > > > > A bonding device, with HTB qdisc (or other qdisc) > > > > (This also could be a tunnel device with a qdisc) > > > > > > > > -> one or multiple physical NIC, wth FQ or other qdisc. > > > > > > > > Packets would be dropped here when we try to reach the physical device. > > > > > > > > > > If you have an example handy please send it. I am trying to imagine > > > how those would have worked if they have to reenter the root qdisc of > > > the same dev multiple times.. > > > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ? > > > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device. > > > > A stack of devices A -> B -> C would elevate the recursion level to > > three just fine. > > > > After your patch, a stack of devices would no longer work. > > > > It seems mirred correctly injects packets to the top of the stack for > > ingress (via netif_rx() / netif_receive_skb()), > > but thinks it is okay to call dev_queue_xmit(), regardless of the context ? > > > > Perhaps safe-guard mirred, instead of adding more code to fast path. > > I agree not to penalize everybody for a "bad config" like this > (surprising syzkaller hasnt caught this). But i dont see how doing the > checking within mirred will catch this (we cant detect the A->B->A > case). > I think you are suggesting a backlog-like queue for mirred? Not far > off from that is how it used to work before > (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/) spin_trylock() had to go. There is no way we could keep this. > - i.e we had a trylock for the qdisc lock and if it failed we tagged > the rx softirq for a reschedule. That in itself is insufficient, we > would need a loop check which is per-skb (which we had before > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/). > There are other gotchas there, potentially packet reordering. If we want to make sure dev_queue_xmit() is called from the top (no spinlock held), then we need a queue, serviced from another context. This extra queueing could happen if __this_cpu_read(softnet_data.xmit.recursion) > 0 This can be done from net/sched/act_mirred.c with no additional change in net/core/dev.c, and no new skb field. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-19 9:38 ` Eric Dumazet @ 2024-03-19 20:54 ` Jamal Hadi Salim 2024-03-20 16:46 ` Jamal Hadi Salim 2024-03-20 16:57 ` Eric Dumazet 0 siblings, 2 replies; 24+ messages in thread From: Jamal Hadi Salim @ 2024-03-19 20:54 UTC (permalink / raw) To: Eric Dumazet Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela, Davide Caratti On Tue, Mar 19, 2024 at 5:38 AM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > > > > > > > > > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the > > > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to > > > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform > > > > > > > > > the following steps: > > > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > > > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > > > > > > > > > action police rate 100mbit burst 12m conform-exceed jump 1 \ > > > > > > > > > / pipe mirred egress redirect dev eth2 action drop > > > > > > > > > > > > > > > > > > > > > > > > > I think you meant both to be the same device eth0 or eth2? > > > > > > > > > > > > > > > > > The stack is show as below: > > > > > > > > > [28848.883915] _raw_spin_lock+0x1e/0x30 > > > > > > > > > [28848.884367] __dev_queue_xmit+0x160/0x850 > > > > > > > > > [28848.884851] ? 0xffffffffc031906a > > > > > > > > > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > > > > > > > > > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > > > > > > > > > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > > > > > > > > > [28848.886970] ? dequeue_entity+0x145/0x9e0 > > > > > > > > > [28848.887464] ? newidle_balance+0x23f/0x2f0 > > > > > > > > > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > > > > > > > > > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > > > > > > > > > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > > > > > > > > > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > > > > > > > > > [28848.890293] ? do_select+0x637/0x870 > > > > > > > > > [28848.890735] tcf_classify+0x52/0xf0 > > > > > > > > > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > > > > > > > > > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > > > > > > > > > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > > > > > > > > > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > > > > > > > > > [28848.893198] ip_finish_output2+0x272/0x580 > > > > > > > > > [28848.893692] __ip_queue_xmit+0x193/0x420 > > > > > > > > > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > > > > > > > > > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > > > > > > > > > before the egress packets are mirred, and it will attempt to obtain the > > > > > > > > > spin lock again after packets are mirred, which cause a deadlock. > > > > > > > > > > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached > > > > > > > > > to the egress. > > > > > > > > > > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > > > > > > > > > --- > > > > > > > > > net/sched/act_mirred.c | 4 +++ > > > > > > > > > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > > > > > > > > > 2 files changed, 36 insertions(+) > > > > > > > > > > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > > > > > > > > index 5b3814365924..fc96705285fb 100644 > > > > > > > > > --- a/net/sched/act_mirred.c > > > > > > > > > +++ b/net/sched/act_mirred.c > > > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > > > > > > > > > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > > > > > > > > > return -EINVAL; > > > > > > > > > } > > > > > > > > > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > > > > > > > > > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > > > > > > > > > + return -EINVAL; > > > > > > > > > + } > > > > > > > > > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > > > > > > > > am almost certain this used to work in the old days. > > > > > > > > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc. > > > > > > > Conclusion is things broke around this time frame: > > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/ > > > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/ > > > > > > > > > > > > > > Looking further into it. > > > > > > > > > > > > This is what we came up with. Eric, please take a look... > > > > > > > > > > > > cheers, > > > > > > jamal > > > > > > > > > > > > > > > > > > --- a/net/core/dev.c > > > > > > +++ b/net/core/dev.c > > > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff > > > > > > *skb, struct Qdisc *q, > > > > > > if (unlikely(contended)) > > > > > > spin_lock(&q->busylock); > > > > > > > > > > > > + if (dev_recursion_level()) { > > > > > > > > > > I am not sure what your intent is, but this seems wrong to me. > > > > > > > > > > > > > There is a deadlock if you reenter the same device which has a qdisc > > > > attached to it more than once. > > > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then > > > > run some action which requires it to grab the root qdisc lock (again). > > > > This is easy to show with mirred (although i am wondering if syzkaller > > > > may have produced this at some point).. > > > > $TC qdisc add dev $DEV root handle 1: htb default 1 > > > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff > > > > action mirred egress mirror dev $DEV > > > > > > > > Above example is essentially egress $DEV-> egress $DEV in both cases > > > > "egress $DEV" grabs the root qdisc lock. You could also create another > > > > example with egress($DEV1->$DEV2->back to $DEV1). > > > > > > > > > Some valid setup use : > > > > > > > > > > A bonding device, with HTB qdisc (or other qdisc) > > > > > (This also could be a tunnel device with a qdisc) > > > > > > > > > > -> one or multiple physical NIC, wth FQ or other qdisc. > > > > > > > > > > Packets would be dropped here when we try to reach the physical device. > > > > > > > > > > > > > If you have an example handy please send it. I am trying to imagine > > > > how those would have worked if they have to reenter the root qdisc of > > > > the same dev multiple times.. > > > > > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ? > > > > > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device. > > > > > > A stack of devices A -> B -> C would elevate the recursion level to > > > three just fine. > > > > > > After your patch, a stack of devices would no longer work. > > > > > > It seems mirred correctly injects packets to the top of the stack for > > > ingress (via netif_rx() / netif_receive_skb()), > > > but thinks it is okay to call dev_queue_xmit(), regardless of the context ? > > > > > > Perhaps safe-guard mirred, instead of adding more code to fast path. > > > > I agree not to penalize everybody for a "bad config" like this > > (surprising syzkaller hasnt caught this). But i dont see how doing the > > checking within mirred will catch this (we cant detect the A->B->A > > case). > > I think you are suggesting a backlog-like queue for mirred? Not far > > off from that is how it used to work before > > (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/) > > > spin_trylock() had to go. There is no way we could keep this. > Not asking for it to come back... just pointing out why it worked before. > > - i.e we had a trylock for the qdisc lock and if it failed we tagged > > the rx softirq for a reschedule. That in itself is insufficient, we > > would need a loop check which is per-skb (which we had before > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/). > > There are other gotchas there, potentially packet reordering. > > If we want to make sure dev_queue_xmit() is called from the top (no > spinlock held), > then we need a queue, serviced from another context. > > This extra queueing could happen if > __this_cpu_read(softnet_data.xmit.recursion) > 0 > I dont see a way to detect softnet_data.xmit.recursion > 0 at mirred level. The first time we enter it will be 0. The second time we would deadlock before we hit mirred. I am also worried about packet reordering avoidance logic code amount needed just to deal with a buggy configuration. Dropping the packet for this exact config should be ok. How about adding a per-qdisc per-cpu recursion counter? Incremented when we enter the cpu from the top, decremented when we leave. If we detect __this_cpu_read(qdisc.xmit_recursion) > 0 we drop as per that patch? cheers, jamal > This can be done from net/sched/act_mirred.c with no additional change > in net/core/dev.c, > and no new skb field. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-19 20:54 ` Jamal Hadi Salim @ 2024-03-20 16:46 ` Jamal Hadi Salim 2024-03-20 16:57 ` Eric Dumazet 1 sibling, 0 replies; 24+ messages in thread From: Jamal Hadi Salim @ 2024-03-20 16:46 UTC (permalink / raw) To: Eric Dumazet Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela, Davide Caratti On Tue, Mar 19, 2024 at 4:54 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Tue, Mar 19, 2024 at 5:38 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > > > > > > > > > > > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the > > > > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to > > > > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform > > > > > > > > > > the following steps: > > > > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > > > > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > > > > > > > > > > action police rate 100mbit burst 12m conform-exceed jump 1 \ > > > > > > > > > > / pipe mirred egress redirect dev eth2 action drop > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think you meant both to be the same device eth0 or eth2? > > > > > > > > > > > > > > > > > > > The stack is show as below: > > > > > > > > > > [28848.883915] _raw_spin_lock+0x1e/0x30 > > > > > > > > > > [28848.884367] __dev_queue_xmit+0x160/0x850 > > > > > > > > > > [28848.884851] ? 0xffffffffc031906a > > > > > > > > > > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > > > > > > > > > > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > > > > > > > > > > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > > > > > > > > > > [28848.886970] ? dequeue_entity+0x145/0x9e0 > > > > > > > > > > [28848.887464] ? newidle_balance+0x23f/0x2f0 > > > > > > > > > > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > > > > > > > > > > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > > > > > > > > > > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > > > > > > > > > > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > > > > > > > > > > [28848.890293] ? do_select+0x637/0x870 > > > > > > > > > > [28848.890735] tcf_classify+0x52/0xf0 > > > > > > > > > > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > > > > > > > > > > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > > > > > > > > > > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > > > > > > > > > > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > > > > > > > > > > [28848.893198] ip_finish_output2+0x272/0x580 > > > > > > > > > > [28848.893692] __ip_queue_xmit+0x193/0x420 > > > > > > > > > > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > > > > > > > > > > > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > > > > > > > > > > before the egress packets are mirred, and it will attempt to obtain the > > > > > > > > > > spin lock again after packets are mirred, which cause a deadlock. > > > > > > > > > > > > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached > > > > > > > > > > to the egress. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > > > > > > > > > > --- > > > > > > > > > > net/sched/act_mirred.c | 4 +++ > > > > > > > > > > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > > > > > > > > > > 2 files changed, 36 insertions(+) > > > > > > > > > > > > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > > > > > > > > > index 5b3814365924..fc96705285fb 100644 > > > > > > > > > > --- a/net/sched/act_mirred.c > > > > > > > > > > +++ b/net/sched/act_mirred.c > > > > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > > > > > > > > > > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > > > > > > > > > > return -EINVAL; > > > > > > > > > > } > > > > > > > > > > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > > > > > > > > > > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > > > > > > > > > > + return -EINVAL; > > > > > > > > > > + } > > > > > > > > > > > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > > > > > > > > > am almost certain this used to work in the old days. > > > > > > > > > > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc. > > > > > > > > Conclusion is things broke around this time frame: > > > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/ > > > > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/ > > > > > > > > > > > > > > > > Looking further into it. > > > > > > > > > > > > > > This is what we came up with. Eric, please take a look... > > > > > > > > > > > > > > cheers, > > > > > > > jamal > > > > > > > > > > > > > > > > > > > > > --- a/net/core/dev.c > > > > > > > +++ b/net/core/dev.c > > > > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff > > > > > > > *skb, struct Qdisc *q, > > > > > > > if (unlikely(contended)) > > > > > > > spin_lock(&q->busylock); > > > > > > > > > > > > > > + if (dev_recursion_level()) { > > > > > > > > > > > > I am not sure what your intent is, but this seems wrong to me. > > > > > > > > > > > > > > > > There is a deadlock if you reenter the same device which has a qdisc > > > > > attached to it more than once. > > > > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then > > > > > run some action which requires it to grab the root qdisc lock (again). > > > > > This is easy to show with mirred (although i am wondering if syzkaller > > > > > may have produced this at some point).. > > > > > $TC qdisc add dev $DEV root handle 1: htb default 1 > > > > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff > > > > > action mirred egress mirror dev $DEV > > > > > > > > > > Above example is essentially egress $DEV-> egress $DEV in both cases > > > > > "egress $DEV" grabs the root qdisc lock. You could also create another > > > > > example with egress($DEV1->$DEV2->back to $DEV1). > > > > > > > > > > > Some valid setup use : > > > > > > > > > > > > A bonding device, with HTB qdisc (or other qdisc) > > > > > > (This also could be a tunnel device with a qdisc) > > > > > > > > > > > > -> one or multiple physical NIC, wth FQ or other qdisc. > > > > > > > > > > > > Packets would be dropped here when we try to reach the physical device. > > > > > > > > > > > > > > > > If you have an example handy please send it. I am trying to imagine > > > > > how those would have worked if they have to reenter the root qdisc of > > > > > the same dev multiple times.. > > > > > > > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ? > > > > > > > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device. > > > > > > > > A stack of devices A -> B -> C would elevate the recursion level to > > > > three just fine. > > > > > > > > After your patch, a stack of devices would no longer work. > > > > > > > > It seems mirred correctly injects packets to the top of the stack for > > > > ingress (via netif_rx() / netif_receive_skb()), > > > > but thinks it is okay to call dev_queue_xmit(), regardless of the context ? > > > > > > > > Perhaps safe-guard mirred, instead of adding more code to fast path. > > > > > > I agree not to penalize everybody for a "bad config" like this > > > (surprising syzkaller hasnt caught this). But i dont see how doing the > > > checking within mirred will catch this (we cant detect the A->B->A > > > case). > > > I think you are suggesting a backlog-like queue for mirred? Not far > > > off from that is how it used to work before > > > (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/) > > > > > > spin_trylock() had to go. There is no way we could keep this. > > > > Not asking for it to come back... just pointing out why it worked before. > > > > - i.e we had a trylock for the qdisc lock and if it failed we tagged > > > the rx softirq for a reschedule. That in itself is insufficient, we > > > would need a loop check which is per-skb (which we had before > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/). > > > There are other gotchas there, potentially packet reordering. > > > > If we want to make sure dev_queue_xmit() is called from the top (no > > spinlock held), > > then we need a queue, serviced from another context. > > > > This extra queueing could happen if > > __this_cpu_read(softnet_data.xmit.recursion) > 0 > > > > I dont see a way to detect softnet_data.xmit.recursion > 0 at mirred > level. The first time we enter it will be 0. The second time we would > deadlock before we hit mirred. I am also worried about packet > reordering avoidance logic code amount needed just to deal with a > buggy configuration. Dropping the packet for this exact config should > be ok. > How about adding a per-qdisc per-cpu recursion counter? Incremented > when we enter the cpu from the top, decremented when we leave. If we > detect __this_cpu_read(qdisc.xmit_recursion) > 0 we drop as per that > patch? > Actually scratch this idea. It wont work since we added tc block support in mirred which complicates things further. Need some more thinking... cheers, jamal > cheers, > jamal > > This can be done from net/sched/act_mirred.c with no additional change > > in net/core/dev.c, > > and no new skb field. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-19 20:54 ` Jamal Hadi Salim 2024-03-20 16:46 ` Jamal Hadi Salim @ 2024-03-20 16:57 ` Eric Dumazet 2024-03-20 17:31 ` Jamal Hadi Salim 1 sibling, 1 reply; 24+ messages in thread From: Eric Dumazet @ 2024-03-20 16:57 UTC (permalink / raw) To: Jamal Hadi Salim Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela, Davide Caratti On Tue, Mar 19, 2024 at 9:54 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Tue, Mar 19, 2024 at 5:38 AM Eric Dumazet <edumazet@google.com> wrote: > > > > On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > > > > > > > > > > > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the > > > > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to > > > > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform > > > > > > > > > > the following steps: > > > > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > > > > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > > > > > > > > > > action police rate 100mbit burst 12m conform-exceed jump 1 \ > > > > > > > > > > / pipe mirred egress redirect dev eth2 action drop > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think you meant both to be the same device eth0 or eth2? > > > > > > > > > > > > > > > > > > > The stack is show as below: > > > > > > > > > > [28848.883915] _raw_spin_lock+0x1e/0x30 > > > > > > > > > > [28848.884367] __dev_queue_xmit+0x160/0x850 > > > > > > > > > > [28848.884851] ? 0xffffffffc031906a > > > > > > > > > > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > > > > > > > > > > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > > > > > > > > > > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > > > > > > > > > > [28848.886970] ? dequeue_entity+0x145/0x9e0 > > > > > > > > > > [28848.887464] ? newidle_balance+0x23f/0x2f0 > > > > > > > > > > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > > > > > > > > > > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > > > > > > > > > > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > > > > > > > > > > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > > > > > > > > > > [28848.890293] ? do_select+0x637/0x870 > > > > > > > > > > [28848.890735] tcf_classify+0x52/0xf0 > > > > > > > > > > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > > > > > > > > > > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > > > > > > > > > > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > > > > > > > > > > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > > > > > > > > > > [28848.893198] ip_finish_output2+0x272/0x580 > > > > > > > > > > [28848.893692] __ip_queue_xmit+0x193/0x420 > > > > > > > > > > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > > > > > > > > > > > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > > > > > > > > > > before the egress packets are mirred, and it will attempt to obtain the > > > > > > > > > > spin lock again after packets are mirred, which cause a deadlock. > > > > > > > > > > > > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached > > > > > > > > > > to the egress. > > > > > > > > > > > > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > > > > > > > > > > --- > > > > > > > > > > net/sched/act_mirred.c | 4 +++ > > > > > > > > > > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > > > > > > > > > > 2 files changed, 36 insertions(+) > > > > > > > > > > > > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > > > > > > > > > index 5b3814365924..fc96705285fb 100644 > > > > > > > > > > --- a/net/sched/act_mirred.c > > > > > > > > > > +++ b/net/sched/act_mirred.c > > > > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > > > > > > > > > > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > > > > > > > > > > return -EINVAL; > > > > > > > > > > } > > > > > > > > > > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > > > > > > > > > > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > > > > > > > > > > + return -EINVAL; > > > > > > > > > > + } > > > > > > > > > > > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > > > > > > > > > am almost certain this used to work in the old days. > > > > > > > > > > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc. > > > > > > > > Conclusion is things broke around this time frame: > > > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/ > > > > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/ > > > > > > > > > > > > > > > > Looking further into it. > > > > > > > > > > > > > > This is what we came up with. Eric, please take a look... > > > > > > > > > > > > > > cheers, > > > > > > > jamal > > > > > > > > > > > > > > > > > > > > > --- a/net/core/dev.c > > > > > > > +++ b/net/core/dev.c > > > > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff > > > > > > > *skb, struct Qdisc *q, > > > > > > > if (unlikely(contended)) > > > > > > > spin_lock(&q->busylock); > > > > > > > > > > > > > > + if (dev_recursion_level()) { > > > > > > > > > > > > I am not sure what your intent is, but this seems wrong to me. > > > > > > > > > > > > > > > > There is a deadlock if you reenter the same device which has a qdisc > > > > > attached to it more than once. > > > > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then > > > > > run some action which requires it to grab the root qdisc lock (again). > > > > > This is easy to show with mirred (although i am wondering if syzkaller > > > > > may have produced this at some point).. > > > > > $TC qdisc add dev $DEV root handle 1: htb default 1 > > > > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff > > > > > action mirred egress mirror dev $DEV > > > > > > > > > > Above example is essentially egress $DEV-> egress $DEV in both cases > > > > > "egress $DEV" grabs the root qdisc lock. You could also create another > > > > > example with egress($DEV1->$DEV2->back to $DEV1). > > > > > > > > > > > Some valid setup use : > > > > > > > > > > > > A bonding device, with HTB qdisc (or other qdisc) > > > > > > (This also could be a tunnel device with a qdisc) > > > > > > > > > > > > -> one or multiple physical NIC, wth FQ or other qdisc. > > > > > > > > > > > > Packets would be dropped here when we try to reach the physical device. > > > > > > > > > > > > > > > > If you have an example handy please send it. I am trying to imagine > > > > > how those would have worked if they have to reenter the root qdisc of > > > > > the same dev multiple times.. > > > > > > > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ? > > > > > > > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device. > > > > > > > > A stack of devices A -> B -> C would elevate the recursion level to > > > > three just fine. > > > > > > > > After your patch, a stack of devices would no longer work. > > > > > > > > It seems mirred correctly injects packets to the top of the stack for > > > > ingress (via netif_rx() / netif_receive_skb()), > > > > but thinks it is okay to call dev_queue_xmit(), regardless of the context ? > > > > > > > > Perhaps safe-guard mirred, instead of adding more code to fast path. > > > > > > I agree not to penalize everybody for a "bad config" like this > > > (surprising syzkaller hasnt caught this). But i dont see how doing the > > > checking within mirred will catch this (we cant detect the A->B->A > > > case). > > > I think you are suggesting a backlog-like queue for mirred? Not far > > > off from that is how it used to work before > > > (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/) > > > > > > spin_trylock() had to go. There is no way we could keep this. > > > > Not asking for it to come back... just pointing out why it worked before. > > > > - i.e we had a trylock for the qdisc lock and if it failed we tagged > > > the rx softirq for a reschedule. That in itself is insufficient, we > > > would need a loop check which is per-skb (which we had before > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/). > > > There are other gotchas there, potentially packet reordering. > > > > If we want to make sure dev_queue_xmit() is called from the top (no > > spinlock held), > > then we need a queue, serviced from another context. > > > > This extra queueing could happen if > > __this_cpu_read(softnet_data.xmit.recursion) > 0 > > > > I dont see a way to detect softnet_data.xmit.recursion > 0 at mirred > level. The first time we enter it will be 0. Then it is fine, no qdisc spinlock is held at this point. The second time we would > deadlock before we hit mirred. This is not how I see the trace. Mirred would detect that and either drop or queue the packet to a work queue or something. diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c index 5b38143659249e66718348e0ec4ed3c7bc21c13d..a2c53e200629a17130f38246ab3cdb8c89c6d30e 100644 --- a/net/sched/act_mirred.c +++ b/net/sched/act_mirred.c @@ -237,9 +237,15 @@ tcf_mirred_forward(bool at_ingress, bool want_ingress, struct sk_buff *skb) { int err; - if (!want_ingress) + if (!want_ingress) { + if (__this_cpu_read(softnet_data.xmit.recursion) > 0) { + // TODO increment a drop counter perhaps ? + kfree_skb(skb); + return -EINVAL; + } + err = tcf_dev_queue_xmit(skb, dev_queue_xmit); - else if (!at_ingress) + } else if (!at_ingress) err = netif_rx(skb); else err = netif_receive_skb(skb); ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-20 16:57 ` Eric Dumazet @ 2024-03-20 17:31 ` Jamal Hadi Salim 2024-03-20 17:50 ` Jamal Hadi Salim 0 siblings, 1 reply; 24+ messages in thread From: Jamal Hadi Salim @ 2024-03-20 17:31 UTC (permalink / raw) To: Eric Dumazet Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela, Davide Caratti On Wed, Mar 20, 2024 at 12:58 PM Eric Dumazet <edumazet@google.com> wrote: > > On Tue, Mar 19, 2024 at 9:54 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > On Tue, Mar 19, 2024 at 5:38 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > > > > > > > > > > > > > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the > > > > > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to > > > > > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform > > > > > > > > > > > the following steps: > > > > > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > > > > > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > > > > > > > > > > > action police rate 100mbit burst 12m conform-exceed jump 1 \ > > > > > > > > > > > / pipe mirred egress redirect dev eth2 action drop > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think you meant both to be the same device eth0 or eth2? > > > > > > > > > > > > > > > > > > > > > The stack is show as below: > > > > > > > > > > > [28848.883915] _raw_spin_lock+0x1e/0x30 > > > > > > > > > > > [28848.884367] __dev_queue_xmit+0x160/0x850 > > > > > > > > > > > [28848.884851] ? 0xffffffffc031906a > > > > > > > > > > > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > > > > > > > > > > > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > > > > > > > > > > > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > > > > > > > > > > > [28848.886970] ? dequeue_entity+0x145/0x9e0 > > > > > > > > > > > [28848.887464] ? newidle_balance+0x23f/0x2f0 > > > > > > > > > > > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > > > > > > > > > > > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > > > > > > > > > > > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > > > > > > > > > > > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > > > > > > > > > > > [28848.890293] ? do_select+0x637/0x870 > > > > > > > > > > > [28848.890735] tcf_classify+0x52/0xf0 > > > > > > > > > > > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > > > > > > > > > > > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > > > > > > > > > > > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > > > > > > > > > > > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > > > > > > > > > > > [28848.893198] ip_finish_output2+0x272/0x580 > > > > > > > > > > > [28848.893692] __ip_queue_xmit+0x193/0x420 > > > > > > > > > > > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > > > > > > > > > > > > > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > > > > > > > > > > > before the egress packets are mirred, and it will attempt to obtain the > > > > > > > > > > > spin lock again after packets are mirred, which cause a deadlock. > > > > > > > > > > > > > > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached > > > > > > > > > > > to the egress. > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > > > > > > > > > > > --- > > > > > > > > > > > net/sched/act_mirred.c | 4 +++ > > > > > > > > > > > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > > > > > > > > > > > 2 files changed, 36 insertions(+) > > > > > > > > > > > > > > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > > > > > > > > > > index 5b3814365924..fc96705285fb 100644 > > > > > > > > > > > --- a/net/sched/act_mirred.c > > > > > > > > > > > +++ b/net/sched/act_mirred.c > > > > > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > > > > > > > > > > > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > > > > > > > > > > > return -EINVAL; > > > > > > > > > > > } > > > > > > > > > > > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > > > > > > > > > > > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > > > > > > > > > > > + return -EINVAL; > > > > > > > > > > > + } > > > > > > > > > > > > > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > > > > > > > > > > am almost certain this used to work in the old days. > > > > > > > > > > > > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc. > > > > > > > > > Conclusion is things broke around this time frame: > > > > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/ > > > > > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/ > > > > > > > > > > > > > > > > > > Looking further into it. > > > > > > > > > > > > > > > > This is what we came up with. Eric, please take a look... > > > > > > > > > > > > > > > > cheers, > > > > > > > > jamal > > > > > > > > > > > > > > > > > > > > > > > > --- a/net/core/dev.c > > > > > > > > +++ b/net/core/dev.c > > > > > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff > > > > > > > > *skb, struct Qdisc *q, > > > > > > > > if (unlikely(contended)) > > > > > > > > spin_lock(&q->busylock); > > > > > > > > > > > > > > > > + if (dev_recursion_level()) { > > > > > > > > > > > > > > I am not sure what your intent is, but this seems wrong to me. > > > > > > > > > > > > > > > > > > > There is a deadlock if you reenter the same device which has a qdisc > > > > > > attached to it more than once. > > > > > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then > > > > > > run some action which requires it to grab the root qdisc lock (again). > > > > > > This is easy to show with mirred (although i am wondering if syzkaller > > > > > > may have produced this at some point).. > > > > > > $TC qdisc add dev $DEV root handle 1: htb default 1 > > > > > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff > > > > > > action mirred egress mirror dev $DEV > > > > > > > > > > > > Above example is essentially egress $DEV-> egress $DEV in both cases > > > > > > "egress $DEV" grabs the root qdisc lock. You could also create another > > > > > > example with egress($DEV1->$DEV2->back to $DEV1). > > > > > > > > > > > > > Some valid setup use : > > > > > > > > > > > > > > A bonding device, with HTB qdisc (or other qdisc) > > > > > > > (This also could be a tunnel device with a qdisc) > > > > > > > > > > > > > > -> one or multiple physical NIC, wth FQ or other qdisc. > > > > > > > > > > > > > > Packets would be dropped here when we try to reach the physical device. > > > > > > > > > > > > > > > > > > > If you have an example handy please send it. I am trying to imagine > > > > > > how those would have worked if they have to reenter the root qdisc of > > > > > > the same dev multiple times.. > > > > > > > > > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ? > > > > > > > > > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device. > > > > > > > > > > A stack of devices A -> B -> C would elevate the recursion level to > > > > > three just fine. > > > > > > > > > > After your patch, a stack of devices would no longer work. > > > > > > > > > > It seems mirred correctly injects packets to the top of the stack for > > > > > ingress (via netif_rx() / netif_receive_skb()), > > > > > but thinks it is okay to call dev_queue_xmit(), regardless of the context ? > > > > > > > > > > Perhaps safe-guard mirred, instead of adding more code to fast path. > > > > > > > > I agree not to penalize everybody for a "bad config" like this > > > > (surprising syzkaller hasnt caught this). But i dont see how doing the > > > > checking within mirred will catch this (we cant detect the A->B->A > > > > case). > > > > I think you are suggesting a backlog-like queue for mirred? Not far > > > > off from that is how it used to work before > > > > (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/) > > > > > > > > > spin_trylock() had to go. There is no way we could keep this. > > > > > > > Not asking for it to come back... just pointing out why it worked before. > > > > > > - i.e we had a trylock for the qdisc lock and if it failed we tagged > > > > the rx softirq for a reschedule. That in itself is insufficient, we > > > > would need a loop check which is per-skb (which we had before > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/). > > > > There are other gotchas there, potentially packet reordering. > > > > > > If we want to make sure dev_queue_xmit() is called from the top (no > > > spinlock held), > > > then we need a queue, serviced from another context. > > > > > > This extra queueing could happen if > > > __this_cpu_read(softnet_data.xmit.recursion) > 0 > > > > > > > I dont see a way to detect softnet_data.xmit.recursion > 0 at mirred > > level. The first time we enter it will be 0. > > Then it is fine, no qdisc spinlock is held at this point. > > The second time we would > > deadlock before we hit mirred. > > This is not how I see the trace. > > Mirred would detect that and either drop or queue the packet to a work > queue or something. > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > index 5b38143659249e66718348e0ec4ed3c7bc21c13d..a2c53e200629a17130f38246ab3cdb8c89c6d30e > 100644 > --- a/net/sched/act_mirred.c > +++ b/net/sched/act_mirred.c > @@ -237,9 +237,15 @@ tcf_mirred_forward(bool at_ingress, bool > want_ingress, struct sk_buff *skb) > { > int err; > > - if (!want_ingress) > + if (!want_ingress) { > + if (__this_cpu_read(softnet_data.xmit.recursion) > 0) { > + // TODO increment a drop counter perhaps ? > + kfree_skb(skb); > + return -EINVAL; > + } > + > err = tcf_dev_queue_xmit(skb, dev_queue_xmit); > - else if (!at_ingress) > + } else if (!at_ingress) > err = netif_rx(skb); > else > err = netif_receive_skb(skb); I doubt this will work - who increments softnet_data.xmit.recursion? We enter __dev_xmit_skb (grab lock) --> qdisc_enq-->classify-->mirred (recursion is zero) which redirects entering back into __dev_xmit_skb again and deadlocks trying to grab lock. Maybe something is not clear to me, trying your suggestion... cheers, jamal ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-20 17:31 ` Jamal Hadi Salim @ 2024-03-20 17:50 ` Jamal Hadi Salim 2024-03-20 18:13 ` Eric Dumazet 0 siblings, 1 reply; 24+ messages in thread From: Jamal Hadi Salim @ 2024-03-20 17:50 UTC (permalink / raw) To: Eric Dumazet Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela, Davide Caratti On Wed, Mar 20, 2024 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Wed, Mar 20, 2024 at 12:58 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Tue, Mar 19, 2024 at 9:54 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > On Tue, Mar 19, 2024 at 5:38 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the > > > > > > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to > > > > > > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform > > > > > > > > > > > > the following steps: > > > > > > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > > > > > > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > > > > > > > > > > > > action police rate 100mbit burst 12m conform-exceed jump 1 \ > > > > > > > > > > > > / pipe mirred egress redirect dev eth2 action drop > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think you meant both to be the same device eth0 or eth2? > > > > > > > > > > > > > > > > > > > > > > > The stack is show as below: > > > > > > > > > > > > [28848.883915] _raw_spin_lock+0x1e/0x30 > > > > > > > > > > > > [28848.884367] __dev_queue_xmit+0x160/0x850 > > > > > > > > > > > > [28848.884851] ? 0xffffffffc031906a > > > > > > > > > > > > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > > > > > > > > > > > > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > > > > > > > > > > > > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > > > > > > > > > > > > [28848.886970] ? dequeue_entity+0x145/0x9e0 > > > > > > > > > > > > [28848.887464] ? newidle_balance+0x23f/0x2f0 > > > > > > > > > > > > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > > > > > > > > > > > > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > > > > > > > > > > > > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > > > > > > > > > > > > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > > > > > > > > > > > > [28848.890293] ? do_select+0x637/0x870 > > > > > > > > > > > > [28848.890735] tcf_classify+0x52/0xf0 > > > > > > > > > > > > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > > > > > > > > > > > > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > > > > > > > > > > > > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > > > > > > > > > > > > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > > > > > > > > > > > > [28848.893198] ip_finish_output2+0x272/0x580 > > > > > > > > > > > > [28848.893692] __ip_queue_xmit+0x193/0x420 > > > > > > > > > > > > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > > > > > > > > > > > > > > > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > > > > > > > > > > > > before the egress packets are mirred, and it will attempt to obtain the > > > > > > > > > > > > spin lock again after packets are mirred, which cause a deadlock. > > > > > > > > > > > > > > > > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached > > > > > > > > > > > > to the egress. > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > > > > > > > > > > > > --- > > > > > > > > > > > > net/sched/act_mirred.c | 4 +++ > > > > > > > > > > > > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > > > > > > > > > > > > 2 files changed, 36 insertions(+) > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > > > > > > > > > > > index 5b3814365924..fc96705285fb 100644 > > > > > > > > > > > > --- a/net/sched/act_mirred.c > > > > > > > > > > > > +++ b/net/sched/act_mirred.c > > > > > > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > > > > > > > > > > > > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > > > > > > > > > > > > return -EINVAL; > > > > > > > > > > > > } > > > > > > > > > > > > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > > > > > > > > > > > > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > > > > > > > > > > > > + return -EINVAL; > > > > > > > > > > > > + } > > > > > > > > > > > > > > > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > > > > > > > > > > > am almost certain this used to work in the old days. > > > > > > > > > > > > > > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc. > > > > > > > > > > Conclusion is things broke around this time frame: > > > > > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/ > > > > > > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/ > > > > > > > > > > > > > > > > > > > > Looking further into it. > > > > > > > > > > > > > > > > > > This is what we came up with. Eric, please take a look... > > > > > > > > > > > > > > > > > > cheers, > > > > > > > > > jamal > > > > > > > > > > > > > > > > > > > > > > > > > > > --- a/net/core/dev.c > > > > > > > > > +++ b/net/core/dev.c > > > > > > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff > > > > > > > > > *skb, struct Qdisc *q, > > > > > > > > > if (unlikely(contended)) > > > > > > > > > spin_lock(&q->busylock); > > > > > > > > > > > > > > > > > > + if (dev_recursion_level()) { > > > > > > > > > > > > > > > > I am not sure what your intent is, but this seems wrong to me. > > > > > > > > > > > > > > > > > > > > > > There is a deadlock if you reenter the same device which has a qdisc > > > > > > > attached to it more than once. > > > > > > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then > > > > > > > run some action which requires it to grab the root qdisc lock (again). > > > > > > > This is easy to show with mirred (although i am wondering if syzkaller > > > > > > > may have produced this at some point).. > > > > > > > $TC qdisc add dev $DEV root handle 1: htb default 1 > > > > > > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff > > > > > > > action mirred egress mirror dev $DEV > > > > > > > > > > > > > > Above example is essentially egress $DEV-> egress $DEV in both cases > > > > > > > "egress $DEV" grabs the root qdisc lock. You could also create another > > > > > > > example with egress($DEV1->$DEV2->back to $DEV1). > > > > > > > > > > > > > > > Some valid setup use : > > > > > > > > > > > > > > > > A bonding device, with HTB qdisc (or other qdisc) > > > > > > > > (This also could be a tunnel device with a qdisc) > > > > > > > > > > > > > > > > -> one or multiple physical NIC, wth FQ or other qdisc. > > > > > > > > > > > > > > > > Packets would be dropped here when we try to reach the physical device. > > > > > > > > > > > > > > > > > > > > > > If you have an example handy please send it. I am trying to imagine > > > > > > > how those would have worked if they have to reenter the root qdisc of > > > > > > > the same dev multiple times.. > > > > > > > > > > > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ? > > > > > > > > > > > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device. > > > > > > > > > > > > A stack of devices A -> B -> C would elevate the recursion level to > > > > > > three just fine. > > > > > > > > > > > > After your patch, a stack of devices would no longer work. > > > > > > > > > > > > It seems mirred correctly injects packets to the top of the stack for > > > > > > ingress (via netif_rx() / netif_receive_skb()), > > > > > > but thinks it is okay to call dev_queue_xmit(), regardless of the context ? > > > > > > > > > > > > Perhaps safe-guard mirred, instead of adding more code to fast path. > > > > > > > > > > I agree not to penalize everybody for a "bad config" like this > > > > > (surprising syzkaller hasnt caught this). But i dont see how doing the > > > > > checking within mirred will catch this (we cant detect the A->B->A > > > > > case). > > > > > I think you are suggesting a backlog-like queue for mirred? Not far > > > > > off from that is how it used to work before > > > > > (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/) > > > > > > > > > > > > spin_trylock() had to go. There is no way we could keep this. > > > > > > > > > > Not asking for it to come back... just pointing out why it worked before. > > > > > > > > - i.e we had a trylock for the qdisc lock and if it failed we tagged > > > > > the rx softirq for a reschedule. That in itself is insufficient, we > > > > > would need a loop check which is per-skb (which we had before > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/). > > > > > There are other gotchas there, potentially packet reordering. > > > > > > > > If we want to make sure dev_queue_xmit() is called from the top (no > > > > spinlock held), > > > > then we need a queue, serviced from another context. > > > > > > > > This extra queueing could happen if > > > > __this_cpu_read(softnet_data.xmit.recursion) > 0 > > > > > > > > > > I dont see a way to detect softnet_data.xmit.recursion > 0 at mirred > > > level. The first time we enter it will be 0. > > > > Then it is fine, no qdisc spinlock is held at this point. > > > > The second time we would > > > deadlock before we hit mirred. > > > > This is not how I see the trace. > > > > Mirred would detect that and either drop or queue the packet to a work > > queue or something. > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > index 5b38143659249e66718348e0ec4ed3c7bc21c13d..a2c53e200629a17130f38246ab3cdb8c89c6d30e > > 100644 > > --- a/net/sched/act_mirred.c > > +++ b/net/sched/act_mirred.c > > @@ -237,9 +237,15 @@ tcf_mirred_forward(bool at_ingress, bool > > want_ingress, struct sk_buff *skb) > > { > > int err; > > > > - if (!want_ingress) > > + if (!want_ingress) { > > + if (__this_cpu_read(softnet_data.xmit.recursion) > 0) { > > + // TODO increment a drop counter perhaps ? > > + kfree_skb(skb); > > + return -EINVAL; > > + } > > + > > err = tcf_dev_queue_xmit(skb, dev_queue_xmit); > > - else if (!at_ingress) > > + } else if (!at_ingress) > > err = netif_rx(skb); > > else > > err = netif_receive_skb(skb); > > > I doubt this will work - who increments softnet_data.xmit.recursion? > We enter __dev_xmit_skb (grab lock) --> qdisc_enq-->classify-->mirred > (recursion is zero) which redirects entering back into __dev_xmit_skb > again and deadlocks trying to grab lock. > Maybe something is not clear to me, trying your suggestion... > jhs@mojaone:~$ [ 82.890330] __this_cpu_read(softnet_data.xmit.recursion) 0 in tcf_mirred_forward [ 82.890906] [ 82.890906] ============================================ [ 82.890906] WARNING: possible recursive locking detected [ 82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G W [ 82.890906] -------------------------------------------- [ 82.890906] ping/418 is trying to acquire lock: [ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at: __dev_queue_xmit+0x1778/0x3550 [ 82.890906] [ 82.890906] but task is already holding lock: [ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at: __dev_queue_xmit+0x1778/0x3550 [ 82.890906] [ 82.890906] other info that might help us debug this: [ 82.890906] Possible unsafe locking scenario: [ 82.890906] [ 82.890906] CPU0 [ 82.890906] ---- [ 82.890906] lock(&sch->q.lock); [ 82.890906] lock(&sch->q.lock); [ 82.890906] [ 82.890906] *** DEADLOCK *** [ 82.890906] [..... other info removed for brevity....] Needs more thinking... A fool proof solution is to add the per-recursion counter to be per netdevice but that maybe considered blasphemy? ;-> cheers, jamal > cheers, > jamal ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-20 17:50 ` Jamal Hadi Salim @ 2024-03-20 18:13 ` Eric Dumazet 2024-03-20 18:25 ` Eric Dumazet 0 siblings, 1 reply; 24+ messages in thread From: Eric Dumazet @ 2024-03-20 18:13 UTC (permalink / raw) To: Jamal Hadi Salim Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela, Davide Caratti On Wed, Mar 20, 2024 at 6:50 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Wed, Mar 20, 2024 at 1:31 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > On Wed, Mar 20, 2024 at 12:58 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Tue, Mar 19, 2024 at 9:54 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > On Tue, Mar 19, 2024 at 5:38 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > On Mon, Mar 18, 2024 at 11:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > On Mon, Mar 18, 2024 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > > > On Mon, Mar 18, 2024 at 6:36 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > On Mon, Mar 18, 2024 at 11:46 AM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > > > > > > > > > > > On Mon, Mar 18, 2024 at 3:27 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > > > > > On Sun, Mar 17, 2024 at 12:10 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Thu, Mar 14, 2024 at 1:14 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Mar 14, 2024 at 7:18 AM renmingshuai <renmingshuai@huawei.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > As we all know the mirred action is used to mirroring or redirecting the > > > > > > > > > > > > > packet it receives. Howerver, add mirred action to a filter attached to > > > > > > > > > > > > > a egress qdisc might cause a deadlock. To reproduce the problem, perform > > > > > > > > > > > > > the following steps: > > > > > > > > > > > > > (1)tc qdisc add dev eth0 root handle 1: htb default 30 \n > > > > > > > > > > > > > (2)tc filter add dev eth2 protocol ip prio 2 flower verbose \ > > > > > > > > > > > > > action police rate 100mbit burst 12m conform-exceed jump 1 \ > > > > > > > > > > > > > / pipe mirred egress redirect dev eth2 action drop > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think you meant both to be the same device eth0 or eth2? > > > > > > > > > > > > > > > > > > > > > > > > > The stack is show as below: > > > > > > > > > > > > > [28848.883915] _raw_spin_lock+0x1e/0x30 > > > > > > > > > > > > > [28848.884367] __dev_queue_xmit+0x160/0x850 > > > > > > > > > > > > > [28848.884851] ? 0xffffffffc031906a > > > > > > > > > > > > > [28848.885279] tcf_mirred_act+0x3ab/0x596 [act_mirred] > > > > > > > > > > > > > [28848.885863] tcf_action_exec.part.0+0x88/0x130 > > > > > > > > > > > > > [28848.886401] fl_classify+0x1ca/0x1e0 [cls_flower] > > > > > > > > > > > > > [28848.886970] ? dequeue_entity+0x145/0x9e0 > > > > > > > > > > > > > [28848.887464] ? newidle_balance+0x23f/0x2f0 > > > > > > > > > > > > > [28848.887973] ? nft_lookup_eval+0x57/0x170 [nf_tables] > > > > > > > > > > > > > [28848.888566] ? nft_do_chain+0xef/0x430 [nf_tables] > > > > > > > > > > > > > [28848.889137] ? __flush_work.isra.0+0x35/0x80 > > > > > > > > > > > > > [28848.889657] ? nf_ct_get_tuple+0x1cf/0x210 [nf_conntrack] > > > > > > > > > > > > > [28848.890293] ? do_select+0x637/0x870 > > > > > > > > > > > > > [28848.890735] tcf_classify+0x52/0xf0 > > > > > > > > > > > > > [28848.891177] htb_classify+0x9d/0x1c0 [sch_htb] > > > > > > > > > > > > > [28848.891722] htb_enqueue+0x3a/0x1c0 [sch_htb] > > > > > > > > > > > > > [28848.892251] __dev_queue_xmit+0x2d8/0x850 > > > > > > > > > > > > > [28848.892738] ? nf_hook_slow+0x3c/0xb0 > > > > > > > > > > > > > [28848.893198] ip_finish_output2+0x272/0x580 > > > > > > > > > > > > > [28848.893692] __ip_queue_xmit+0x193/0x420 > > > > > > > > > > > > > [28848.894179] __tcp_transmit_skb+0x8cc/0x970 > > > > > > > > > > > > > > > > > > > > > > > > > > In this case, the process has hold the qdisc spin lock in __dev_queue_xmit > > > > > > > > > > > > > before the egress packets are mirred, and it will attempt to obtain the > > > > > > > > > > > > > spin lock again after packets are mirred, which cause a deadlock. > > > > > > > > > > > > > > > > > > > > > > > > > > Fix the issue by forbidding assigning mirred action to a filter attached > > > > > > > > > > > > > to the egress. > > > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Mingshuai Ren <renmingshuai@huawei.com> > > > > > > > > > > > > > --- > > > > > > > > > > > > > net/sched/act_mirred.c | 4 +++ > > > > > > > > > > > > > .../tc-testing/tc-tests/actions/mirred.json | 32 +++++++++++++++++++ > > > > > > > > > > > > > 2 files changed, 36 insertions(+) > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > > > > > > > > > > > > index 5b3814365924..fc96705285fb 100644 > > > > > > > > > > > > > --- a/net/sched/act_mirred.c > > > > > > > > > > > > > +++ b/net/sched/act_mirred.c > > > > > > > > > > > > > @@ -120,6 +120,10 @@ static int tcf_mirred_init(struct net *net, struct nlattr *nla, > > > > > > > > > > > > > NL_SET_ERR_MSG_MOD(extack, "Mirred requires attributes to be passed"); > > > > > > > > > > > > > return -EINVAL; > > > > > > > > > > > > > } > > > > > > > > > > > > > + if (tp->chain->block->q->parent != TC_H_INGRESS) { > > > > > > > > > > > > > + NL_SET_ERR_MSG_MOD(extack, "Mirred can only be assigned to the filter attached to ingress"); > > > > > > > > > > > > > + return -EINVAL; > > > > > > > > > > > > > + } > > > > > > > > > > > > > > > > > > > > > > > > Sorry, this is too restrictive as Jiri said. We'll try to reproduce. I > > > > > > > > > > > > am almost certain this used to work in the old days. > > > > > > > > > > > > > > > > > > > > > > Ok, i looked at old notes - it did work at "some point" pre-tdc. > > > > > > > > > > > Conclusion is things broke around this time frame: > > > > > > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/ > > > > > > > > > > > https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/ > > > > > > > > > > > > > > > > > > > > > > Looking further into it. > > > > > > > > > > > > > > > > > > > > This is what we came up with. Eric, please take a look... > > > > > > > > > > > > > > > > > > > > cheers, > > > > > > > > > > jamal > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --- a/net/core/dev.c > > > > > > > > > > +++ b/net/core/dev.c > > > > > > > > > > @@ -3789,7 +3789,14 @@ static inline int __dev_xmit_skb(struct sk_buff > > > > > > > > > > *skb, struct Qdisc *q, > > > > > > > > > > if (unlikely(contended)) > > > > > > > > > > spin_lock(&q->busylock); > > > > > > > > > > > > > > > > > > > > + if (dev_recursion_level()) { > > > > > > > > > > > > > > > > > > I am not sure what your intent is, but this seems wrong to me. > > > > > > > > > > > > > > > > > > > > > > > > > There is a deadlock if you reenter the same device which has a qdisc > > > > > > > > attached to it more than once. > > > > > > > > Essentially entering __dev_xmit_skb() we grab the root qdisc lock then > > > > > > > > run some action which requires it to grab the root qdisc lock (again). > > > > > > > > This is easy to show with mirred (although i am wondering if syzkaller > > > > > > > > may have produced this at some point).. > > > > > > > > $TC qdisc add dev $DEV root handle 1: htb default 1 > > > > > > > > $TC filter add dev $DEV protocol ip u32 match ip protocol 1 0xff > > > > > > > > action mirred egress mirror dev $DEV > > > > > > > > > > > > > > > > Above example is essentially egress $DEV-> egress $DEV in both cases > > > > > > > > "egress $DEV" grabs the root qdisc lock. You could also create another > > > > > > > > example with egress($DEV1->$DEV2->back to $DEV1). > > > > > > > > > > > > > > > > > Some valid setup use : > > > > > > > > > > > > > > > > > > A bonding device, with HTB qdisc (or other qdisc) > > > > > > > > > (This also could be a tunnel device with a qdisc) > > > > > > > > > > > > > > > > > > -> one or multiple physical NIC, wth FQ or other qdisc. > > > > > > > > > > > > > > > > > > Packets would be dropped here when we try to reach the physical device. > > > > > > > > > > > > > > > > > > > > > > > > > If you have an example handy please send it. I am trying to imagine > > > > > > > > how those would have worked if they have to reenter the root qdisc of > > > > > > > > the same dev multiple times.. > > > > > > > > > > > > > > Any virtual device like a GRE/SIT/IPIP/... tunnel, add a qdisc on it ? > > > > > > > > > > > > > > dev_xmit_recursion_inc() is global (per-cpu), it is not per-device. > > > > > > > > > > > > > > A stack of devices A -> B -> C would elevate the recursion level to > > > > > > > three just fine. > > > > > > > > > > > > > > After your patch, a stack of devices would no longer work. > > > > > > > > > > > > > > It seems mirred correctly injects packets to the top of the stack for > > > > > > > ingress (via netif_rx() / netif_receive_skb()), > > > > > > > but thinks it is okay to call dev_queue_xmit(), regardless of the context ? > > > > > > > > > > > > > > Perhaps safe-guard mirred, instead of adding more code to fast path. > > > > > > > > > > > > I agree not to penalize everybody for a "bad config" like this > > > > > > (surprising syzkaller hasnt caught this). But i dont see how doing the > > > > > > checking within mirred will catch this (we cant detect the A->B->A > > > > > > case). > > > > > > I think you are suggesting a backlog-like queue for mirred? Not far > > > > > > off from that is how it used to work before > > > > > > (https://lore.kernel.org/netdev/1465095748.2968.45.camel@edumazet-glaptop3.roam.corp.google.com/) > > > > > > > > > > > > > > > spin_trylock() had to go. There is no way we could keep this. > > > > > > > > > > > > > Not asking for it to come back... just pointing out why it worked before. > > > > > > > > > > - i.e we had a trylock for the qdisc lock and if it failed we tagged > > > > > > the rx softirq for a reschedule. That in itself is insufficient, we > > > > > > would need a loop check which is per-skb (which we had before > > > > > > https://lore.kernel.org/netdev/1431679850-31896-1-git-send-email-fw@strlen.de/). > > > > > > There are other gotchas there, potentially packet reordering. > > > > > > > > > > If we want to make sure dev_queue_xmit() is called from the top (no > > > > > spinlock held), > > > > > then we need a queue, serviced from another context. > > > > > > > > > > This extra queueing could happen if > > > > > __this_cpu_read(softnet_data.xmit.recursion) > 0 > > > > > > > > > > > > > I dont see a way to detect softnet_data.xmit.recursion > 0 at mirred > > > > level. The first time we enter it will be 0. > > > > > > Then it is fine, no qdisc spinlock is held at this point. > > > > > > The second time we would > > > > deadlock before we hit mirred. > > > > > > This is not how I see the trace. > > > > > > Mirred would detect that and either drop or queue the packet to a work > > > queue or something. > > > > > > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c > > > index 5b38143659249e66718348e0ec4ed3c7bc21c13d..a2c53e200629a17130f38246ab3cdb8c89c6d30e > > > 100644 > > > --- a/net/sched/act_mirred.c > > > +++ b/net/sched/act_mirred.c > > > @@ -237,9 +237,15 @@ tcf_mirred_forward(bool at_ingress, bool > > > want_ingress, struct sk_buff *skb) > > > { > > > int err; > > > > > > - if (!want_ingress) > > > + if (!want_ingress) { > > > + if (__this_cpu_read(softnet_data.xmit.recursion) > 0) { > > > + // TODO increment a drop counter perhaps ? > > > + kfree_skb(skb); > > > + return -EINVAL; > > > + } > > > + > > > err = tcf_dev_queue_xmit(skb, dev_queue_xmit); > > > - else if (!at_ingress) > > > + } else if (!at_ingress) > > > err = netif_rx(skb); > > > else > > > err = netif_receive_skb(skb); > > > > > > I doubt this will work - who increments softnet_data.xmit.recursion? > > We enter __dev_xmit_skb (grab lock) --> qdisc_enq-->classify-->mirred > > (recursion is zero) which redirects entering back into __dev_xmit_skb > > again and deadlocks trying to grab lock. > > Maybe something is not clear to me, trying your suggestion... > > > > jhs@mojaone:~$ > [ 82.890330] __this_cpu_read(softnet_data.xmit.recursion) 0 in > tcf_mirred_forward > [ 82.890906] > [ 82.890906] ============================================ > [ 82.890906] WARNING: possible recursive locking detected > [ 82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G W > [ 82.890906] -------------------------------------------- > [ 82.890906] ping/418 is trying to acquire lock: > [ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at: > __dev_queue_xmit+0x1778/0x3550 > [ 82.890906] > [ 82.890906] but task is already holding lock: > [ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at: > __dev_queue_xmit+0x1778/0x3550 > [ 82.890906] > [ 82.890906] other info that might help us debug this: > [ 82.890906] Possible unsafe locking scenario: > [ 82.890906] > [ 82.890906] CPU0 > [ 82.890906] ---- > [ 82.890906] lock(&sch->q.lock); > [ 82.890906] lock(&sch->q.lock); > [ 82.890906] > [ 82.890906] *** DEADLOCK *** > [ 82.890906] > [..... other info removed for brevity....] > > Needs more thinking... > A fool proof solution is to add the per-recursion counter to be per > netdevice but that maybe considered blasphemy? ;-> Nope, you just have to complete the patch, moving around dev_xmit_recursion_inc() and dev_xmit_recursion_dec() ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-20 18:13 ` Eric Dumazet @ 2024-03-20 18:25 ` Eric Dumazet 2024-03-20 19:34 ` Jamal Hadi Salim 0 siblings, 1 reply; 24+ messages in thread From: Eric Dumazet @ 2024-03-20 18:25 UTC (permalink / raw) To: Jamal Hadi Salim Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela, Davide Caratti On Wed, Mar 20, 2024 at 7:13 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Mar 20, 2024 at 6:50 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > Nope, you just have to complete the patch, moving around > dev_xmit_recursion_inc() and dev_xmit_recursion_dec() Untested part would be: diff --git a/net/core/dev.c b/net/core/dev.c index 303a6ff46e4e16296e94ed6b726621abe093e567..dbeaf67282e8b6ec164d00d796c9fd8e4fd7c332 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4259,6 +4259,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) */ rcu_read_lock_bh(); + dev_xmit_recursion_inc(); + skb_update_prio(skb); qdisc_pkt_len_init(skb); @@ -4331,9 +4333,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) HARD_TX_LOCK(dev, txq, cpu); if (!netif_xmit_stopped(txq)) { - dev_xmit_recursion_inc(); skb = dev_hard_start_xmit(skb, dev, txq, &rc); - dev_xmit_recursion_dec(); if (dev_xmit_complete(rc)) { HARD_TX_UNLOCK(dev, txq); goto out; @@ -4353,12 +4353,14 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev) } rc = -ENETDOWN; + dev_xmit_recursion_dec(); rcu_read_unlock_bh(); dev_core_stats_tx_dropped_inc(dev); kfree_skb_list(skb); return rc; out: + dev_xmit_recursion_dec(); rcu_read_unlock_bh(); return rc; } ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-20 18:25 ` Eric Dumazet @ 2024-03-20 19:34 ` Jamal Hadi Salim 2024-03-24 15:27 ` Jamal Hadi Salim 0 siblings, 1 reply; 24+ messages in thread From: Jamal Hadi Salim @ 2024-03-20 19:34 UTC (permalink / raw) To: Eric Dumazet Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela, Davide Caratti On Wed, Mar 20, 2024 at 2:26 PM Eric Dumazet <edumazet@google.com> wrote: > > On Wed, Mar 20, 2024 at 7:13 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Wed, Mar 20, 2024 at 6:50 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > Nope, you just have to complete the patch, moving around > > dev_xmit_recursion_inc() and dev_xmit_recursion_dec() > > Untested part would be: > > diff --git a/net/core/dev.c b/net/core/dev.c > index 303a6ff46e4e16296e94ed6b726621abe093e567..dbeaf67282e8b6ec164d00d796c9fd8e4fd7c332 > 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -4259,6 +4259,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct > net_device *sb_dev) > */ > rcu_read_lock_bh(); > > + dev_xmit_recursion_inc(); > + > skb_update_prio(skb); > > qdisc_pkt_len_init(skb); > @@ -4331,9 +4333,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct > net_device *sb_dev) > HARD_TX_LOCK(dev, txq, cpu); > > if (!netif_xmit_stopped(txq)) { > - dev_xmit_recursion_inc(); > skb = dev_hard_start_xmit(skb, dev, txq, &rc); > - dev_xmit_recursion_dec(); > if (dev_xmit_complete(rc)) { > HARD_TX_UNLOCK(dev, txq); > goto out; > @@ -4353,12 +4353,14 @@ int __dev_queue_xmit(struct sk_buff *skb, > struct net_device *sb_dev) > } > > rc = -ENETDOWN; > + dev_xmit_recursion_dec(); > rcu_read_unlock_bh(); > > dev_core_stats_tx_dropped_inc(dev); > kfree_skb_list(skb); > return rc; > out: > + dev_xmit_recursion_dec(); > rcu_read_unlock_bh(); > return rc; > } This removed the deadlock but now every packet being redirected will be dropped. I was wrong earlier on the tc block because that only works on clsact and ingress which are fine not needing this lock. Here's a variation of the earlier patch that may work but comes at the cost of a new per-cpu increment on the qdisc. --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3789,6 +3789,11 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, if (unlikely(contended)) spin_lock(&q->busylock); + if (__this_cpu_read(q->recursion_xmit) > 0) { + //drop + } + + __this_cpu_inc(q->recursion_xmit); spin_lock(root_lock); if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { __qdisc_drop(skb, &to_free); @@ -3825,6 +3830,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, } } spin_unlock(root_lock); + __this_cpu_dec(q->recursion_xmit); if (unlikely(to_free)) kfree_skb_list_reason(to_free, tcf_get_drop_reason(to_free)); cheers, jamal ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-20 19:34 ` Jamal Hadi Salim @ 2024-03-24 15:27 ` Jamal Hadi Salim 2024-03-26 23:18 ` Jamal Hadi Salim 0 siblings, 1 reply; 24+ messages in thread From: Jamal Hadi Salim @ 2024-03-24 15:27 UTC (permalink / raw) To: Eric Dumazet Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela, Davide Caratti [-- Attachment #1: Type: text/plain, Size: 3419 bytes --] On Wed, Mar 20, 2024 at 3:34 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Wed, Mar 20, 2024 at 2:26 PM Eric Dumazet <edumazet@google.com> wrote: > > > > On Wed, Mar 20, 2024 at 7:13 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Wed, Mar 20, 2024 at 6:50 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > Nope, you just have to complete the patch, moving around > > > dev_xmit_recursion_inc() and dev_xmit_recursion_dec() > > > > Untested part would be: > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > index 303a6ff46e4e16296e94ed6b726621abe093e567..dbeaf67282e8b6ec164d00d796c9fd8e4fd7c332 > > 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -4259,6 +4259,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct > > net_device *sb_dev) > > */ > > rcu_read_lock_bh(); > > > > + dev_xmit_recursion_inc(); > > + > > skb_update_prio(skb); > > > > qdisc_pkt_len_init(skb); > > @@ -4331,9 +4333,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct > > net_device *sb_dev) > > HARD_TX_LOCK(dev, txq, cpu); > > > > if (!netif_xmit_stopped(txq)) { > > - dev_xmit_recursion_inc(); > > skb = dev_hard_start_xmit(skb, dev, txq, &rc); > > - dev_xmit_recursion_dec(); > > if (dev_xmit_complete(rc)) { > > HARD_TX_UNLOCK(dev, txq); > > goto out; > > @@ -4353,12 +4353,14 @@ int __dev_queue_xmit(struct sk_buff *skb, > > struct net_device *sb_dev) > > } > > > > rc = -ENETDOWN; > > + dev_xmit_recursion_dec(); > > rcu_read_unlock_bh(); > > > > dev_core_stats_tx_dropped_inc(dev); > > kfree_skb_list(skb); > > return rc; > > out: > > + dev_xmit_recursion_dec(); > > rcu_read_unlock_bh(); > > return rc; > > } > > This removed the deadlock but now every packet being redirected will > be dropped. I was wrong earlier on the tc block because that only > works on clsact and ingress which are fine not needing this lock. > Here's a variation of the earlier patch that may work but comes at the > cost of a new per-cpu increment on the qdisc. > > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3789,6 +3789,11 @@ static inline int __dev_xmit_skb(struct sk_buff > *skb, struct Qdisc *q, > if (unlikely(contended)) > spin_lock(&q->busylock); > > + if (__this_cpu_read(q->recursion_xmit) > 0) { > + //drop > + } > + > + __this_cpu_inc(q->recursion_xmit); > spin_lock(root_lock); > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { > __qdisc_drop(skb, &to_free); > @@ -3825,6 +3830,7 @@ static inline int __dev_xmit_skb(struct sk_buff > *skb, struct Qdisc *q, > } > } > spin_unlock(root_lock); > + __this_cpu_dec(q->recursion_xmit); > if (unlikely(to_free)) > kfree_skb_list_reason(to_free, > tcf_get_drop_reason(to_free)); > And here's a tested version(attached) that fixes both the A->A and A->B->A. cheers, jamal [-- Attachment #2: qdisc-rec-xmit-var-fix.patch --] [-- Type: text/x-patch, Size: 2619 bytes --] diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index cefe0c4bdae3..f9f99df037ed 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -125,6 +125,8 @@ struct Qdisc { spinlock_t busylock ____cacheline_aligned_in_smp; spinlock_t seqlock; + u16 __percpu *xmit_recursion; + struct rcu_head rcu; netdevice_tracker dev_tracker; /* private data */ diff --git a/net/core/dev.c b/net/core/dev.c index 9a67003e49db..2b712388c06f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3789,6 +3789,13 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, if (unlikely(contended)) spin_lock(&q->busylock); + if (__this_cpu_read(*q->xmit_recursion) > 0) { + __qdisc_drop(skb, &to_free); + rc = NET_XMIT_DROP; + goto free_skb_list; + } + + __this_cpu_inc(*q->xmit_recursion); spin_lock(root_lock); if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { __qdisc_drop(skb, &to_free); @@ -3825,7 +3832,9 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, } } spin_unlock(root_lock); + __this_cpu_dec(*q->xmit_recursion); if (unlikely(to_free)) +free_skb_list: kfree_skb_list_reason(to_free, tcf_get_drop_reason(to_free)); if (unlikely(contended)) diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c index 65e05b0c98e4..6c3bc1aff89a 100644 --- a/net/sched/sch_api.c +++ b/net/sched/sch_api.c @@ -1260,6 +1260,7 @@ static struct Qdisc *qdisc_create(struct net_device *dev, struct Qdisc *sch; struct Qdisc_ops *ops; struct qdisc_size_table *stab; + int cpu; ops = qdisc_lookup_ops(kind); #ifdef CONFIG_MODULES @@ -1376,11 +1377,22 @@ static struct Qdisc *qdisc_create(struct net_device *dev, } } + sch->xmit_recursion = alloc_percpu(u16); + if (!sch->xmit_recursion) { + err = -ENOMEM; + goto err_out5; + } + for_each_possible_cpu(cpu) + (*per_cpu_ptr(sch->xmit_recursion, cpu)) = 0; + qdisc_hash_add(sch, false); trace_qdisc_create(ops, dev, parent); return sch; +err_out5: + if (tca[TCA_RATE]) + gen_kill_estimator(&sch->rate_est); err_out4: /* Even if ops->init() failed, we call ops->destroy() * like qdisc_create_dflt(). diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index ff5336493777..afbbd2e885a4 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -1070,6 +1070,8 @@ static void __qdisc_destroy(struct Qdisc *qdisc) module_put(ops->owner); netdev_put(dev, &qdisc->dev_tracker); + free_percpu(qdisc->xmit_recursion); + trace_qdisc_destroy(qdisc); call_rcu(&qdisc->rcu, qdisc_free_cb); ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress 2024-03-24 15:27 ` Jamal Hadi Salim @ 2024-03-26 23:18 ` Jamal Hadi Salim 0 siblings, 0 replies; 24+ messages in thread From: Jamal Hadi Salim @ 2024-03-26 23:18 UTC (permalink / raw) To: Eric Dumazet Cc: renmingshuai, xiyou.wangcong, jiri, davem, vladbu, netdev, yanan, liaichun, caowangbao, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Victor Nogueira, Pedro Tammela, Davide Caratti On Sun, Mar 24, 2024 at 11:27 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > On Wed, Mar 20, 2024 at 3:34 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > On Wed, Mar 20, 2024 at 2:26 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > On Wed, Mar 20, 2024 at 7:13 PM Eric Dumazet <edumazet@google.com> wrote: > > > > > > > > On Wed, Mar 20, 2024 at 6:50 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote: > > > > > > > Nope, you just have to complete the patch, moving around > > > > dev_xmit_recursion_inc() and dev_xmit_recursion_dec() > > > > > > Untested part would be: > > > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > > index 303a6ff46e4e16296e94ed6b726621abe093e567..dbeaf67282e8b6ec164d00d796c9fd8e4fd7c332 > > > 100644 > > > --- a/net/core/dev.c > > > +++ b/net/core/dev.c > > > @@ -4259,6 +4259,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct > > > net_device *sb_dev) > > > */ > > > rcu_read_lock_bh(); > > > > > > + dev_xmit_recursion_inc(); > > > + > > > skb_update_prio(skb); > > > > > > qdisc_pkt_len_init(skb); > > > @@ -4331,9 +4333,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct > > > net_device *sb_dev) > > > HARD_TX_LOCK(dev, txq, cpu); > > > > > > if (!netif_xmit_stopped(txq)) { > > > - dev_xmit_recursion_inc(); > > > skb = dev_hard_start_xmit(skb, dev, txq, &rc); > > > - dev_xmit_recursion_dec(); > > > if (dev_xmit_complete(rc)) { > > > HARD_TX_UNLOCK(dev, txq); > > > goto out; > > > @@ -4353,12 +4353,14 @@ int __dev_queue_xmit(struct sk_buff *skb, > > > struct net_device *sb_dev) > > > } > > > > > > rc = -ENETDOWN; > > > + dev_xmit_recursion_dec(); > > > rcu_read_unlock_bh(); > > > > > > dev_core_stats_tx_dropped_inc(dev); > > > kfree_skb_list(skb); > > > return rc; > > > out: > > > + dev_xmit_recursion_dec(); > > > rcu_read_unlock_bh(); > > > return rc; > > > } > > > > This removed the deadlock but now every packet being redirected will > > be dropped. I was wrong earlier on the tc block because that only > > works on clsact and ingress which are fine not needing this lock. > > Here's a variation of the earlier patch that may work but comes at the > > cost of a new per-cpu increment on the qdisc. > > > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -3789,6 +3789,11 @@ static inline int __dev_xmit_skb(struct sk_buff > > *skb, struct Qdisc *q, > > if (unlikely(contended)) > > spin_lock(&q->busylock); > > > > + if (__this_cpu_read(q->recursion_xmit) > 0) { > > + //drop > > + } > > + > > + __this_cpu_inc(q->recursion_xmit); > > spin_lock(root_lock); > > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { > > __qdisc_drop(skb, &to_free); > > @@ -3825,6 +3830,7 @@ static inline int __dev_xmit_skb(struct sk_buff > > *skb, struct Qdisc *q, > > } > > } > > spin_unlock(root_lock); > > + __this_cpu_dec(q->recursion_xmit); > > if (unlikely(to_free)) > > kfree_skb_list_reason(to_free, > > tcf_get_drop_reason(to_free)); > > > > And here's a tested version(attached) that fixes both the A->A and A->B->A. Sent a proper patch as RFC, Eric please take a look. cheers, jamal > cheers, > jamal ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2024-03-26 23:19 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-03-14 11:17 [PATCH] net/sched: Forbid assigning mirred action to a filter attached to the egress renmingshuai 2024-03-14 11:43 ` Jiri Pirko 2024-03-14 14:04 ` renmingshuai 2024-03-14 14:47 ` Pedro Tammela 2024-03-14 17:14 ` Jamal Hadi Salim 2024-03-15 1:56 ` renmingshuai 2024-03-15 22:34 ` Jamal Hadi Salim 2024-03-17 16:10 ` Jamal Hadi Salim 2024-03-18 14:27 ` Jamal Hadi Salim 2024-03-18 15:46 ` Eric Dumazet 2024-03-18 17:36 ` Jamal Hadi Salim 2024-03-18 19:11 ` Eric Dumazet 2024-03-18 22:05 ` Jamal Hadi Salim 2024-03-19 9:38 ` Eric Dumazet 2024-03-19 20:54 ` Jamal Hadi Salim 2024-03-20 16:46 ` Jamal Hadi Salim 2024-03-20 16:57 ` Eric Dumazet 2024-03-20 17:31 ` Jamal Hadi Salim 2024-03-20 17:50 ` Jamal Hadi Salim 2024-03-20 18:13 ` Eric Dumazet 2024-03-20 18:25 ` Eric Dumazet 2024-03-20 19:34 ` Jamal Hadi Salim 2024-03-24 15:27 ` Jamal Hadi Salim 2024-03-26 23:18 ` Jamal Hadi Salim
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).