On Tue, Apr 2, 2024 at 1:35 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Tue, Apr 2, 2024 at 12:47 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Wed, Mar 27, 2024 at 11:58 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Wed, Mar 27, 2024 at 9:23 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Wed, Mar 27, 2024 at 12:03 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > When the mirred action is used on a classful egress qdisc and a packet is
> > > > > mirrored or redirected to self we hit a qdisc lock deadlock.
> > > > > See trace below.
> > > > >
> > > > > [..... other info removed for brevity....]
> > > > > [   82.890906]
> > > > > [   82.890906] ============================================
> > > > > [   82.890906] WARNING: possible recursive locking detected
> > > > > [   82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G        W
> > > > > [   82.890906] --------------------------------------------
> > > > > [   82.890906] ping/418 is trying to acquire lock:
> > > > > [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
> > > > > __dev_queue_xmit+0x1778/0x3550
> > > > > [   82.890906]
> > > > > [   82.890906] but task is already holding lock:
> > > > > [   82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at:
> > > > > __dev_queue_xmit+0x1778/0x3550
> > > > > [   82.890906]
> > > > > [   82.890906] other info that might help us debug this:
> > > > > [   82.890906]  Possible unsafe locking scenario:
> > > > > [   82.890906]
> > > > > [   82.890906]        CPU0
> > > > > [   82.890906]        ----
> > > > > [   82.890906]   lock(&sch->q.lock);
> > > > > [   82.890906]   lock(&sch->q.lock);
> > > > > [   82.890906]
> > > > > [   82.890906]  *** DEADLOCK ***
> > > > > [   82.890906]
> > > > > [..... other info removed for brevity....]
> > > > >
> > > > > Example setup (eth0->eth0) to recreate
> > > > > tc qdisc add dev eth0 root handle 1: htb default 30
> > > > > tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
> > > > >      action mirred egress redirect dev eth0
> > > > >
> > > > > Another example(eth0->eth1->eth0) to recreate
> > > > > tc qdisc add dev eth0 root handle 1: htb default 30
> > > > > tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \
> > > > >      action mirred egress redirect dev eth1
> > > > >
> > > > > tc qdisc add dev eth1 root handle 1: htb default 30
> > > > > tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \
> > > > >      action mirred egress redirect dev eth0
> > > > >
> > > > > We fix this by adding a per-cpu, per-qdisc recursion counter which is
> > > > > incremented the first time a root qdisc is entered and on a second attempt
> > > > > enter the same root qdisc from the top, the packet is dropped to break the
> > > > > loop.
> > > > >
> > > > > Reported-by: renmingshuai@huawei.com
> > > > > Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/
> > > > > Fixes: 3bcb846ca4cf ("net: get rid of spin_trylock() in net_tx_action()")
> > > > > Fixes: e578d9c02587 ("net: sched: use counter to break reclassify loops")
> > > > > Co-developed-by: Victor Nogueira <victor@mojatatu.com>
> > > > > Signed-off-by: Victor Nogueira <victor@mojatatu.com>
> > > > > Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> > > > > ---
> > > > >  include/net/sch_generic.h |  2 ++
> > > > >  net/core/dev.c            |  9 +++++++++
> > > > >  net/sched/sch_api.c       | 12 ++++++++++++
> > > > >  net/sched/sch_generic.c   |  2 ++
> > > > >  4 files changed, 25 insertions(+)
> > > > >
> > > > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > > > > index cefe0c4bdae3..f9f99df037ed 100644
> > > > > --- a/include/net/sch_generic.h
> > > > > +++ b/include/net/sch_generic.h
> > > > > @@ -125,6 +125,8 @@ struct Qdisc {
> > > > >         spinlock_t              busylock ____cacheline_aligned_in_smp;
> > > > >         spinlock_t              seqlock;
> > > > >
> > > > > +       u16 __percpu            *xmit_recursion;
> > > > > +
> > > > >         struct rcu_head         rcu;
> > > > >         netdevice_tracker       dev_tracker;
> > > > >         /* private data */
> > > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > > index 9a67003e49db..2b712388c06f 100644
> > > > > --- a/net/core/dev.c
> > > > > +++ b/net/core/dev.c
> > > > > @@ -3789,6 +3789,13 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> > > > >         if (unlikely(contended))
> > > > >                 spin_lock(&q->busylock);
> > > >
> > > > This could hang here (busylock)
> > >
> > > Notice the goto free_skb_list has an spin_unlock(&q->busylock);  in
> > > its code vicinity. Am I missing something?
> >
> > The hang would happen in above spin_lock(&q->busylock), before you can
> > get a chance...
> >
> > If you want to test your patch, add this debugging feature, pretending
> > the spinlock is contended.
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 818699dea9d7040ee74532ccdebf01c4fd6887cc..b2fe3aa2716f0fe128ef10f9d06c2431b3246933
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3816,7 +3816,7 @@ static inline int __dev_xmit_skb(struct sk_buff
> > *skb, struct Qdisc *q,
> >          * sent after the qdisc owner is scheduled again. To prevent this
> >          * scenario the task always serialize on the lock.
> >          */
> > -       contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
> > +       contended = true; // DEBUG for Jamal
> >         if (unlikely(contended))
> >                 spin_lock(&q->busylock);
>
> Will do.

Finally got time to look again. Probably being too clever, but moving
the check before the contended check resolves it as well. The only
strange thing is now with the latest net-next seems to be spitting
some false positive lockdep splat for the test of A->B->A (i am sure
it's fixable).

See attached. Didnt try the other idea, see if you like this one.

cheers,
jamal
> > >
> > > >
> > > > >
> > > > > +       if (__this_cpu_read(*q->xmit_recursion) > 0) {
> > > > > +               __qdisc_drop(skb, &to_free);
> > > > > +               rc = NET_XMIT_DROP;
> > > > > +               goto free_skb_list;
> > > > > +       }
> > > >
> > > >
> > > > I do not think we want to add yet another cache line miss and
> > > > complexity in tx fast path.
> > > >
> > >
> > > I empathize. The cache miss is due to a per-cpu variable? Otherwise
> > > that seems to be in the vicinity of the other fields being accessed in
> > > __dev_xmit_skb()
> > >
> > > > I think that mirred should  use a separate queue to kick a transmit
> > > > from the top level.
> > > >
> > > > (Like netif_rx() does)
> > > >
> > >
> > > Eric, here's my concern: this would entail restructuring mirred
> > > totally just to cater for one use case which is in itself _a bad
> > > config_ for egress qdisc case only.
> >
> > Why can't the bad config be detected in the ctl path ?
> >
>
> We actually discussed this. It looks simple until you dig in. To catch
> this issue we will need to store some state across ports. This can be
> done with a graph construct in the control plane. Assume the following
> cases matching header "foo" (all using htb or other classful qdiscs):
>
> case 1:
> flower match foo on eth0 redirect to egress of eth0
>
> This is easy to do. Your graph can check if src is eth0 and dst is eth0.
>
> Case 2:
> flower match foo on eth0 redirect to eth1
> flower match foo on eth1 redirect to eth0
>
> Now your graph has to keep state across netdevs. You can catch this
> issue as well (but your "checking" code is now growing).
>
> case 3:
> flower match foo on eth0 redirect to eth1
> flower match bar on eth1 redirect to eth0
>
> There is clearly no loop here because we have different headers, but
> you will have to add code to check for such a case.
>
> case 4:
> flower match foo on eth0 redirect to eth1
> u32 match foo on eth1 redirect to eth0
>
> Now you have to add code that checks all other classifiers(worse would
> be ebpf) for matching headers. Worse is you have to consider wild
> card. I am also likely missing some other use cases.
> Also, I am sure we'll very quickly hear from people who want very fast
> insertion rates of filters which now are going to be massively slowed
> down when using mirred.
> So i am sure it can be done, just not worth the effort given how many
> lines of code would need to be added to catch a corner case of bad
> config.
>
> >  Mirred is very heavily used in
> > > many use cases and changing its behavior could likely introduce other
> > > corner cases for some use cases which we would be chasing for a while.
> > > Not to forget now we have to go via an extra transient queue.
> > > If i understood what you are suggesting is to add an equivalent of
> > > backlog queu for the tx side? I am assuming in a very similar nature
> > > to backlog, meaning per cpu fired by softirq? or is it something
> > > closer to qdisc->gso_skb
> > > For either of those cases, the amount of infrastructure code there is
> > > not a few lines of code. And then there's the desire to break the loop
> > > etc.
> > >
> > > Some questions regarding your proposal - something I am not following
> > > And i may have misunderstood what you are suggesting, but i am missing
> > > what scenario mirred can directly call tcf_dev_queue_xmit() (see my
> > > comment below)..
> >
> > I wish the act path would run without qdisc spinlock held, and use RCU instead.
> >
>
> The main (only?) reason we need spinlock for qdiscs is for packet queues.
>
> > Instead, we are adding cost in fast path only to detect bad configs.
>
> Agreed but I dont know how we can totally avoid it. Would you be ok
> with us using something that looks like qdisc->gso_skb and we check
> such a list after releasing the qdisc lock in __dev_xmit_skb() to
> invoke the redirect? The loop then can be caught inside mirred.
>
> cheers,
> jamal