All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jamal Hadi Salim <jhs@mojatatu.com>
To: Eric Dumazet <edumazet@google.com>
Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
	jiri@resnulli.us,  xiyou.wangcong@gmail.com,
	netdev@vger.kernel.org, renmingshuai@huawei.com,
	 Victor Nogueira <victor@mojatatu.com>
Subject: Re: [PATCH RFC net 1/1] net/sched: Fix mirred to self recursion
Date: Wed, 27 Mar 2024 19:12:59 -0400	[thread overview]
Message-ID: <CAM0EoMmO9pz7Y3ZMVL-QYAUB3kuxNQyK8h4OY-V1T99aE_Qbig@mail.gmail.com> (raw)
In-Reply-To: <CAM0EoMmQHsucU6n1O3XEd50zUB4TENkEH0+J-cZ=5Bbv9298mA@mail.gmail.com>

On Wed, Mar 27, 2024 at 6:57 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Wed, Mar 27, 2024 at 9:23 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Mar 27, 2024 at 12:03 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > When the mirred action is used on a classful egress qdisc and a packet is
> > > mirrored or redirected to self we hit a qdisc lock deadlock.
> > > See trace below.
> > >
> > > [..... other info removed for brevity....]
> > > [   82.890906]
> > > [   82.890906] ============================================
> > > [   82.890906] WARNING: possible recursive locking detected
> > > [   82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G        W
> > > [   82.890906] --------------------------------------------
> > > [   82.890906] ping/418 is trying to acquire lock:
> > > [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
> > > __dev_queue_xmit+0x1778/0x3550
> > > [   82.890906]
> > > [   82.890906] but task is already holding lock:
> > > [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
> > > __dev_queue_xmit+0x1778/0x3550
> > > [   82.890906]
> > > [   82.890906] other info that might help us debug this:
> > > [   82.890906]  Possible unsafe locking scenario:
> > > [   82.890906]
> > > [   82.890906]        CPU0
> > > [   82.890906]        ----
> > > [   82.890906]   lock(&sch->q.lock);
> > > [   82.890906]   lock(&sch->q.lock);
> > > [   82.890906]
> > > [   82.890906]  *** DEADLOCK ***
> > > [   82.890906]
> > > [..... other info removed for brevity....]
> > >
> > > Example setup (eth0->eth0) to recreate
> > > tc qdisc add dev eth0 root handle 1: htb default 30
> > > tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
> > >      action mirred egress redirect dev eth0
> > >
> > > Another example(eth0->eth1->eth0) to recreate
> > > tc qdisc add dev eth0 root handle 1: htb default 30
> > > tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
> > >      action mirred egress redirect dev eth1
> > >
> > > tc qdisc add dev eth1 root handle 1: htb default 30
> > > tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \
> > >      action mirred egress redirect dev eth0
> > >
> > > We fix this by adding a per-cpu, per-qdisc recursion counter which is
> > > incremented the first time a root qdisc is entered and on a second attempt
> > > enter the same root qdisc from the top, the packet is dropped to break the
> > > loop.
> > >
> > > Reported-by: renmingshuai@huawei.com
> > > Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/
> > > Fixes: 3bcb846ca4cf ("net: get rid of spin_trylock() in net_tx_action()")
> > > Fixes: e578d9c02587 ("net: sched: use counter to break reclassify loops")
> > > Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> > > Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> > > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> > > ---
> > >  include/net/sch_generic.h |  2 ++
> > >  net/core/dev.c            |  9 +++++++++
> > >  net/sched/sch_api.c       | 12 ++++++++++++
> > >  net/sched/sch_generic.c   |  2 ++
> > >  4 files changed, 25 insertions(+)
> > >
> > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > > index cefe0c4bdae3..f9f99df037ed 100644
> > > --- a/include/net/sch_generic.h
> > > +++ b/include/net/sch_generic.h
> > > @@ -125,6 +125,8 @@ struct Qdisc {
> > >         spinlock_t              busylock ____cacheline_aligned_in_smp;
> > >         spinlock_t              seqlock;
> > >
> > > +       u16 __percpu            *xmit_recursion;
> > > +
> > >         struct rcu_head         rcu;
> > >         netdevice_tracker       dev_tracker;
> > >         /* private data */
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 9a67003e49db..2b712388c06f 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -3789,6 +3789,13 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> > >         if (unlikely(contended))
> > >                 spin_lock(&q->busylock);
> >
> > This could hang here (busylock)
>
> Notice the goto free_skb_list has an spin_unlock(&q->busylock);  in
> its code vicinity. Am I missing something?
>
> >
> > >
> > > +       if (__this_cpu_read(*q->xmit_recursion) > 0) {
> > > +               __qdisc_drop(skb, &to_free);
> > > +               rc = NET_XMIT_DROP;
> > > +               goto free_skb_list;
> > > +       }
> >
> >
> > I do not think we want to add yet another cache line miss and
> > complexity in tx fast path.
> >
>
> I empathize. The cache miss is due to a per-cpu variable? Otherwise
> that seems to be in the vicinity of the other fields being accessed in
> __dev_xmit_skb()
>
> > I think that mirred should  use a separate queue to kick a transmit
> > from the top level.
> >
> > (Like netif_rx() does)
> >
>
> Eric, here's my concern: this would entail restructuring mirred
> totally just to cater for one use case which is in itself _a bad
> config_ for egress qdisc case only. Mirred is very heavily used in
> many use cases and changing its behavior could likely introduce other
> corner cases for some use cases which we would be chasing for a while.
> Not to forget now we have to go via an extra transient queue.
> If i understood what you are suggesting is to add an equivalent of
> backlog queu for the tx side? I am assuming in a very similar nature
> to backlog, meaning per cpu fired by softirq? or is it something
> closer to qdisc->gso_skb
> For either of those cases, the amount of infrastructure code there is
> not a few lines of code. And then there's the desire to break the loop
> etc.
>
> Some questions regarding your proposal - something I am not following
> And i may have misunderstood what you are suggesting, but i am missing
> what scenario mirred can directly call tcf_dev_queue_xmit() (see my
> comment below)..
>
> > Using a softnet.xmit_qdisc_recursion (not a qdisc-per-cpu thing),
> > would allow mirred to bypass this additional queue
> > in most cases.
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index cb37817d6382c29117afd8ce54db6dba94f8c930..62ba5ef554860496ee928f7ed6b7c3ea46b8ee1d
> > 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3217,7 +3217,8 @@ struct softnet_data {
> >  #endif
> >         /* written and read only by owning cpu: */
> >         struct {
> > -               u16 recursion;
> > +               u8 recursion;
> > +               u8 qdisc_recursion;
> >                 u8  more;
> >  #ifdef CONFIG_NET_EGRESS
> >                 u8  skip_txqueue;
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 9a67003e49db87f3f92b6c6296b3e7a5ca9d9171..7ac59835edef657e9558d4d4fc0a76b171aace93
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4298,7 +4298,9 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> > net_device *sb_dev)
> >
> >         trace_net_dev_queue(skb);
> >         if (q->enqueue) {
> > +               __this_cpu_inc(softnet_data.xmit.qdisc_recursion);
>
> This increments the count by 1..
>
> >                 rc = __dev_xmit_skb(skb, q, dev, txq);
> > +               __this_cpu_dec(softnet_data.xmit.qdisc_recursion);
> >                 goto out;
> >         }
> >
> > diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> > index 5b38143659249e66718348e0ec4ed3c7bc21c13d..0f5f02e6744397d33ae2a72670ba7131aaa6942e
> > 100644
> > --- a/net/sched/act_mirred.c
> > +++ b/net/sched/act_mirred.c
> > @@ -237,8 +237,13 @@ tcf_mirred_forward(bool at_ingress, bool
> > want_ingress, struct sk_buff *skb)
> >  {
> >         int err;
> >
> > -       if (!want_ingress)
> > -               err = tcf_dev_queue_xmit(skb, dev_queue_xmit);
> > +       if (!want_ingress) {
> > +               if (__this_cpu_read(softnet_data.xmit.qdisc_recursion)) {
>
> Where does the defered
> So this will always be 1 assuming the defer queue will have to be
> something like a workqueue

Sorry, sent too fast - meant we would always enter here..

> > +                       // Queue to top level, or drop
> > +               } else {
>
> and we'll never enter this..
>
> > +                       err = tcf_dev_queue_xmit(skb, dev_queue_xmit);
> > +               }
> > +       }
> >         else if (!at_ingress)
> >                 err = netif_rx(skb);
> >         else
>
> cheers,
> jamal

  reply	other threads:[~2024-03-27 23:13 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-26 23:03 [PATCH RFC net 1/1] net/sched: Fix mirred to self recursion Jamal Hadi Salim
2024-03-27 13:23 ` Eric Dumazet
2024-03-27 22:57   ` Jamal Hadi Salim
2024-03-27 23:12     ` Jamal Hadi Salim [this message]
2024-04-02  2:00       ` renmingshuai
2024-04-02 16:38         ` Jamal Hadi Salim
2024-04-02 16:47     ` Eric Dumazet
2024-04-02 17:35       ` Jamal Hadi Salim
2024-04-10 20:30         ` Jamal Hadi Salim
2024-04-15  9:20           ` Eric Dumazet
2024-04-15  9:29             ` Eric Dumazet
2024-04-15 13:59             ` Jamal Hadi Salim
2024-04-15 14:01               ` Jamal Hadi Salim
2024-04-15 14:11                 ` Eric Dumazet
2024-04-15 14:17                   ` Jamal Hadi Salim
2024-04-15 21:14                   ` Jamal Hadi Salim
2024-04-16  8:05                     ` Davide Caratti
2024-04-16  9:14                       ` Eric Dumazet
2024-04-16  9:28                         ` Davide Caratti
2024-04-16  9:37                           ` Eric Dumazet
2024-04-16 10:30                       ` Jamal Hadi Salim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAM0EoMmO9pz7Y3ZMVL-QYAUB3kuxNQyK8h4OY-V1T99aE_Qbig@mail.gmail.com \
    --to=jhs@mojatatu.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=jiri@resnulli.us \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=renmingshuai@huawei.com \
    --cc=victor@mojatatu.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.