Re: [PATCH RFC net 1/1] net/sched: Fix mirred to self recursion

From: Eric Dumazet <edumazet@google.com>
To: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
	jiri@resnulli.us,  xiyou.wangcong@gmail.com,
	netdev@vger.kernel.org, renmingshuai@huawei.com,
	 Victor Nogueira <victor@mojatatu.com>
Subject: Re: [PATCH RFC net 1/1] net/sched: Fix mirred to self recursion
Date: Mon, 15 Apr 2024 11:20:27 +0200	[thread overview]
Message-ID: <CANn89iLmhaC8fuu4UpPdELOAapBzLv0+S50gr0Rs+J+=4+9j=g@mail.gmail.com> (raw)
In-Reply-To: <CAM0EoMk33ga5dh12ViZz8QeFwjwNQBvykM53VQo1B3BdfAZtaQ@mail.gmail.com>

On Wed, Apr 10, 2024 at 10:30 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Tue, Apr 2, 2024 at 1:35 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Tue, Apr 2, 2024 at 12:47 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Wed, Mar 27, 2024 at 11:58 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Wed, Mar 27, 2024 at 9:23 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Wed, Mar 27, 2024 at 12:03 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > >
> > > > > > When the mirred action is used on a classful egress qdisc and a packet is
> > > > > > mirrored or redirected to self we hit a qdisc lock deadlock.
> > > > > > See trace below.
> > > > > >
> > > > > > [..... other info removed for brevity....]
> > > > > > [   82.890906]
> > > > > > [   82.890906] ============================================
> > > > > > [   82.890906] WARNING: possible recursive locking detected
> > > > > > [   82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G        W
> > > > > > [   82.890906] --------------------------------------------
> > > > > > [   82.890906] ping/418 is trying to acquire lock:
> > > > > > [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
> > > > > > __dev_queue_xmit+0x1778/0x3550
> > > > > > [   82.890906]
> > > > > > [   82.890906] but task is already holding lock:
> > > > > > [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
> > > > > > __dev_queue_xmit+0x1778/0x3550
> > > > > > [   82.890906]
> > > > > > [   82.890906] other info that might help us debug this:
> > > > > > [   82.890906]  Possible unsafe locking scenario:
> > > > > > [   82.890906]
> > > > > > [   82.890906]        CPU0
> > > > > > [   82.890906]        ----
> > > > > > [   82.890906]   lock(&sch->q.lock);
> > > > > > [   82.890906]   lock(&sch->q.lock);
> > > > > > [   82.890906]
> > > > > > [   82.890906]  *** DEADLOCK ***
> > > > > > [   82.890906]
> > > > > > [..... other info removed for brevity....]
> > > > > >
> > > > > > Example setup (eth0->eth0) to recreate
> > > > > > tc qdisc add dev eth0 root handle 1: htb default 30
> > > > > > tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
> > > > > >      action mirred egress redirect dev eth0
> > > > > >
> > > > > > Another example(eth0->eth1->eth0) to recreate
> > > > > > tc qdisc add dev eth0 root handle 1: htb default 30
> > > > > > tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
> > > > > >      action mirred egress redirect dev eth1
> > > > > >
> > > > > > tc qdisc add dev eth1 root handle 1: htb default 30
> > > > > > tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \
> > > > > >      action mirred egress redirect dev eth0
> > > > > >
> > > > > > We fix this by adding a per-cpu, per-qdisc recursion counter which is
> > > > > > incremented the first time a root qdisc is entered and on a second attempt
> > > > > > enter the same root qdisc from the top, the packet is dropped to break the
> > > > > > loop.
> > > > > >
> > > > > > Reported-by: renmingshuai@huawei.com
> > > > > > Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/
> > > > > > Fixes: 3bcb846ca4cf ("net: get rid of spin_trylock() in net_tx_action()")
> > > > > > Fixes: e578d9c02587 ("net: sched: use counter to break reclassify loops")
> > > > > > Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> > > > > > Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> > > > > > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> > > > > > ---
> > > > > >  include/net/sch_generic.h |  2 ++
> > > > > >  net/core/dev.c            |  9 +++++++++
> > > > > >  net/sched/sch_api.c       | 12 ++++++++++++
> > > > > >  net/sched/sch_generic.c   |  2 ++
> > > > > >  4 files changed, 25 insertions(+)
> > > > > >
> > > > > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > > > > > index cefe0c4bdae3..f9f99df037ed 100644
> > > > > > --- a/include/net/sch_generic.h
> > > > > > +++ b/include/net/sch_generic.h
> > > > > > @@ -125,6 +125,8 @@ struct Qdisc {
> > > > > >         spinlock_t              busylock ____cacheline_aligned_in_smp;
> > > > > >         spinlock_t              seqlock;
> > > > > >
> > > > > > +       u16 __percpu            *xmit_recursion;
> > > > > > +
> > > > > >         struct rcu_head         rcu;
> > > > > >         netdevice_tracker       dev_tracker;
> > > > > >         /* private data */
> > > > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > > > index 9a67003e49db..2b712388c06f 100644
> > > > > > --- a/net/core/dev.c
> > > > > > +++ b/net/core/dev.c
> > > > > > @@ -3789,6 +3789,13 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> > > > > >         if (unlikely(contended))
> > > > > >                 spin_lock(&q->busylock);
> > > > >
> > > > > This could hang here (busylock)
> > > >
> > > > Notice the goto free_skb_list has an spin_unlock(&q->busylock);  in
> > > > its code vicinity. Am I missing something?
> > >
> > > The hang would happen in above spin_lock(&q->busylock), before you can
> > > get a chance...
> > >
> > > If you want to test your patch, add this debugging feature, pretending
> > > the spinlock is contended.
> > >
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 818699dea9d7040ee74532ccdebf01c4fd6887cc..b2fe3aa2716f0fe128ef10f9d06c2431b3246933
> > > 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -3816,7 +3816,7 @@ static inline int __dev_xmit_skb(struct sk_buff
> > > *skb, struct Qdisc *q,
> > >          * sent after the qdisc owner is scheduled again. To prevent this
> > >          * scenario the task always serialize on the lock.
> > >          */
> > > -       contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
> > > +       contended = true; // DEBUG for Jamal
> > >         if (unlikely(contended))
> > >                 spin_lock(&q->busylock);
> >
> > Will do.
>
> Finally got time to look again. Probably being too clever, but moving
> the check before the contended check resolves it as well. The only
> strange thing is now with the latest net-next seems to be spitting
> some false positive lockdep splat for the test of A->B->A (i am sure
> it's fixable).
>
> See attached. Didnt try the other idea, see if you like this one.

A spinlock can only be held by one cpu at a time, so recording the cpu
number of the lock owner should be
enough to avoid a deadlock.

So I really do not understand your push for a per-cpu variable with
extra cache line misses.

I think the following would work just fine ? What do you think ?

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 76db6be1608315102495dd6372fc30e6c9d41a99..dcd92ed7f69fae00deaca2c88fed248a559108ea
100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -117,6 +117,7 @@ struct Qdisc {
        struct qdisc_skb_head   q;
        struct gnet_stats_basic_sync bstats;
        struct gnet_stats_queue qstats;
+       int                     owner;
        unsigned long           state;
        unsigned long           state2; /* must be written under qdisc
spinlock */
        struct Qdisc            *next_sched;
diff --git a/net/core/dev.c b/net/core/dev.c
index 854a3a28a8d85b335a9158378ae0cca6dfbf8b36..d77cac53df4b4af478548dd17e7a3a7cfe4bd792
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3808,6 +3808,11 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
                return rc;
        }

+       if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) {
+               /* add a specific drop_reason later in net-next */
+               kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP);
+               return NET_XMIT_DROP;
+       }
        /*
         * Heuristic to force contended enqueues to serialize on a
         * separate lock before trying to get qdisc main lock.
@@ -3847,7 +3852,9 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
                qdisc_run_end(q);
                rc = NET_XMIT_SUCCESS;
        } else {
+               WRITE_ONCE(q->owner, smp_processor_id());
                rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
+               WRITE_ONCE(q->owner, -1);
                if (qdisc_run_begin(q)) {
                        if (unlikely(contended)) {
                                spin_unlock(&q->busylock);