From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-yb1-f178.google.com (mail-yb1-f178.google.com [209.85.219.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4004815A4B3 for ; Tue, 2 Apr 2024 17:35:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.178 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712079338; cv=none; b=SGJBTrknJOUBovW0As6XGk1fKpSfosIqryRq7p05bk2O0iKP2Bh5qD0w6Nxi739bCeya4YCrDDGv7wA3+oAHT1iZCKogECty0Vt4SYA2eHKrDhmGMcAYE5RfFsQwqg16q+llIO96UhWriE8s5wmdToLYbtljB/GqVaIw4AIjRzI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712079338; c=relaxed/simple; bh=oK8MjyuNe6ZeoJ3MGK90LouWpavgSylqwcwMaIkQxYQ=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=VuA2ekKP18mmrxJMagAXsGFWmfyytjBqt661xCe4cFWtmaBeiMYp83VCMJljTx7V/MijxGy7CpZ39c8AMRy0rgl66lrvP2b7zPAv1YKS+03XFHIYErC+97MNEh+pgwrsMT2TXlpefsSl7KtJNFsYi9rZ1DHTN+gUSWJz5gFJrWE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=mojatatu.com; spf=none smtp.mailfrom=mojatatu.com; dkim=pass (2048-bit key) header.d=mojatatu-com.20230601.gappssmtp.com header.i=@mojatatu-com.20230601.gappssmtp.com header.b=osejXlnR; arc=none smtp.client-ip=209.85.219.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=mojatatu.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=mojatatu.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=mojatatu-com.20230601.gappssmtp.com header.i=@mojatatu-com.20230601.gappssmtp.com header.b="osejXlnR" Received: by mail-yb1-f178.google.com with SMTP id 3f1490d57ef6-dcbc6a6808fso5079948276.2 for ; Tue, 02 Apr 2024 10:35:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mojatatu-com.20230601.gappssmtp.com; s=20230601; t=1712079335; x=1712684135; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=hWde6IPtBKdKAhJje1cmSJIbcu8x901SsVx48dEY2dw=; b=osejXlnRcXsABeAToZaFx405GlmwEYtlam6e1/xNOrFhk+ptD57eAfLV9/pGeZyF3b Q+a9dX8CXH26t4q9jUbPAmshSkZXBPnQXUx27B7R+jaNODy3UZ0GX6Ode9dt4ByVSuH7 F1t50JBBDcGpEJD2SEGxG08gizm6QURx2XYgfVl/w2Tax+UoZtsoyVcjDWWPrOW65bRZ EKLkG9ogds25cruixCPFol+ECVPNTRVgmGMoAb0U4C0SOzgGrV/6m4su893VgjEvwA9n coZyW23TZK/ODXlxNSmD5ezgTIcr/qCktP3sZfoANBA8Q8J+8o8RumdYn9xcOIhZgImw Oytg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712079335; x=1712684135; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=hWde6IPtBKdKAhJje1cmSJIbcu8x901SsVx48dEY2dw=; b=Mm5TM7nKz66Ehdk1uKdm1l4S8/yKKThv1SkWUgF1oMQXXgLOSyYXtpa2jDSB5a2ky/ qUjDan4NGBHaOujaASBKwgBOAYVCEj0KzkSFVZ3NMnRbptH8pxD9vZ+guvFGsBqupbnJ 5PVh9DYAOWtNSjO3krEJlcaZwtwdfi3ujS+KRWlgDnLWwUCKPNhfC4zlsPm+vBbA1scN DQ75Ny8Xmyo1CJxZEsavQ/SmE4vvtQCqreRXnqBJoUGWWn4MoN+IOeGHWDwh0GMsvvUJ p+MM9BJXBCivf42/R7c8sglJ+hi8OymZ/k00h3S4yj07uB4qkmMM44pTAK3waY1pAP6F aLMw== X-Forwarded-Encrypted: i=1; AJvYcCWPSLZ8bYZq0pvMQ6p4+ri/opZiFzTlkftJiUx1VtxCMirW8s3mUiwZhEdw9WNMD+g1sRFodq+XKB6XHvUF1LKFabHfpa4m X-Gm-Message-State: AOJu0YwpbyllXycs3+d1zuG9NqPM79/oInUii+a1aQkchQ9YnnzfJbXS wjq0kbATgwtVA+dapQoJdE/tomsreK9xkKjH2V4BG9Eo79g4gSBwBM77uXb/dYO8CvK1ZohiMKn ziySnCZWr3zzzfYrCRCEjS7O2IHSyVuQlYaCH X-Google-Smtp-Source: AGHT+IHS7kTRN36DYSWL+ijWgARRsbQhKfjvJWr45SwYdlGZo7f498jKew3ZGrBsJxCSoj/0gp782xgwdJfw1TZlX4s= X-Received: by 2002:a05:6902:4e9:b0:dcf:bc86:1020 with SMTP id w9-20020a05690204e900b00dcfbc861020mr10630160ybs.53.1712079335077; Tue, 02 Apr 2024 10:35:35 -0700 (PDT) Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20240326230319.190117-1-jhs@mojatatu.com> In-Reply-To: From: Jamal Hadi Salim Date: Tue, 2 Apr 2024 13:35:23 -0400 Message-ID: Subject: Re: [PATCH RFC net 1/1] net/sched: Fix mirred to self recursion To: Eric Dumazet Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com, jiri@resnulli.us, xiyou.wangcong@gmail.com, netdev@vger.kernel.org, renmingshuai@huawei.com, Victor Nogueira Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Apr 2, 2024 at 12:47=E2=80=AFPM Eric Dumazet = wrote: > > On Wed, Mar 27, 2024 at 11:58=E2=80=AFPM Jamal Hadi Salim wrote: > > > > On Wed, Mar 27, 2024 at 9:23=E2=80=AFAM Eric Dumazet wrote: > > > > > > On Wed, Mar 27, 2024 at 12:03=E2=80=AFAM Jamal Hadi Salim wrote: > > > > > > > > When the mirred action is used on a classful egress qdisc and a pac= ket is > > > > mirrored or redirected to self we hit a qdisc lock deadlock. > > > > See trace below. > > > > > > > > [..... other info removed for brevity....] > > > > [ 82.890906] > > > > [ 82.890906] =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D > > > > [ 82.890906] WARNING: possible recursive locking detected > > > > [ 82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G = W > > > > [ 82.890906] -------------------------------------------- > > > > [ 82.890906] ping/418 is trying to acquire lock: > > > > [ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at: > > > > __dev_queue_xmit+0x1778/0x3550 > > > > [ 82.890906] > > > > [ 82.890906] but task is already holding lock: > > > > [ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at: > > > > __dev_queue_xmit+0x1778/0x3550 > > > > [ 82.890906] > > > > [ 82.890906] other info that might help us debug this: > > > > [ 82.890906] Possible unsafe locking scenario: > > > > [ 82.890906] > > > > [ 82.890906] CPU0 > > > > [ 82.890906] ---- > > > > [ 82.890906] lock(&sch->q.lock); > > > > [ 82.890906] lock(&sch->q.lock); > > > > [ 82.890906] > > > > [ 82.890906] *** DEADLOCK *** > > > > [ 82.890906] > > > > [..... other info removed for brevity....] > > > > > > > > Example setup (eth0->eth0) to recreate > > > > tc qdisc add dev eth0 root handle 1: htb default 30 > > > > tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \ > > > > action mirred egress redirect dev eth0 > > > > > > > > Another example(eth0->eth1->eth0) to recreate > > > > tc qdisc add dev eth0 root handle 1: htb default 30 > > > > tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \ > > > > action mirred egress redirect dev eth1 > > > > > > > > tc qdisc add dev eth1 root handle 1: htb default 30 > > > > tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \ > > > > action mirred egress redirect dev eth0 > > > > > > > > We fix this by adding a per-cpu, per-qdisc recursion counter which = is > > > > incremented the first time a root qdisc is entered and on a second = attempt > > > > enter the same root qdisc from the top, the packet is dropped to br= eak the > > > > loop. > > > > > > > > Reported-by: renmingshuai@huawei.com > > > > Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmin= gshuai@huawei.com/ > > > > Fixes: 3bcb846ca4cf ("net: get rid of spin_trylock() in net_tx_acti= on()") > > > > Fixes: e578d9c02587 ("net: sched: use counter to break reclassify l= oops") > > > > Co-developed-by: Victor Nogueira > > > > Signed-off-by: Victor Nogueira > > > > Signed-off-by: Jamal Hadi Salim > > > > --- > > > > include/net/sch_generic.h | 2 ++ > > > > net/core/dev.c | 9 +++++++++ > > > > net/sched/sch_api.c | 12 ++++++++++++ > > > > net/sched/sch_generic.c | 2 ++ > > > > 4 files changed, 25 insertions(+) > > > > > > > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h > > > > index cefe0c4bdae3..f9f99df037ed 100644 > > > > --- a/include/net/sch_generic.h > > > > +++ b/include/net/sch_generic.h > > > > @@ -125,6 +125,8 @@ struct Qdisc { > > > > spinlock_t busylock ____cacheline_aligned_in_s= mp; > > > > spinlock_t seqlock; > > > > > > > > + u16 __percpu *xmit_recursion; > > > > + > > > > struct rcu_head rcu; > > > > netdevice_tracker dev_tracker; > > > > /* private data */ > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > > > index 9a67003e49db..2b712388c06f 100644 > > > > --- a/net/core/dev.c > > > > +++ b/net/core/dev.c > > > > @@ -3789,6 +3789,13 @@ static inline int __dev_xmit_skb(struct sk_b= uff *skb, struct Qdisc *q, > > > > if (unlikely(contended)) > > > > spin_lock(&q->busylock); > > > > > > This could hang here (busylock) > > > > Notice the goto free_skb_list has an spin_unlock(&q->busylock); in > > its code vicinity. Am I missing something? > > The hang would happen in above spin_lock(&q->busylock), before you can > get a chance... > > If you want to test your patch, add this debugging feature, pretending > the spinlock is contended. > > diff --git a/net/core/dev.c b/net/core/dev.c > index 818699dea9d7040ee74532ccdebf01c4fd6887cc..b2fe3aa2716f0fe128ef10f9d= 06c2431b3246933 > 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -3816,7 +3816,7 @@ static inline int __dev_xmit_skb(struct sk_buff > *skb, struct Qdisc *q, > * sent after the qdisc owner is scheduled again. To prevent this > * scenario the task always serialize on the lock. > */ > - contended =3D qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT= ); > + contended =3D true; // DEBUG for Jamal > if (unlikely(contended)) > spin_lock(&q->busylock); Will do. > > > > > > > > > > > > > + if (__this_cpu_read(*q->xmit_recursion) > 0) { > > > > + __qdisc_drop(skb, &to_free); > > > > + rc =3D NET_XMIT_DROP; > > > > + goto free_skb_list; > > > > + } > > > > > > > > > I do not think we want to add yet another cache line miss and > > > complexity in tx fast path. > > > > > > > I empathize. The cache miss is due to a per-cpu variable? Otherwise > > that seems to be in the vicinity of the other fields being accessed in > > __dev_xmit_skb() > > > > > I think that mirred should use a separate queue to kick a transmit > > > from the top level. > > > > > > (Like netif_rx() does) > > > > > > > Eric, here's my concern: this would entail restructuring mirred > > totally just to cater for one use case which is in itself _a bad > > config_ for egress qdisc case only. > > Why can't the bad config be detected in the ctl path ? > We actually discussed this. It looks simple until you dig in. To catch this issue we will need to store some state across ports. This can be done with a graph construct in the control plane. Assume the following cases matching header "foo" (all using htb or other classful qdiscs): case 1: flower match foo on eth0 redirect to egress of eth0 This is easy to do. Your graph can check if src is eth0 and dst is eth0. Case 2: flower match foo on eth0 redirect to eth1 flower match foo on eth1 redirect to eth0 Now your graph has to keep state across netdevs. You can catch this issue as well (but your "checking" code is now growing). case 3: flower match foo on eth0 redirect to eth1 flower match bar on eth1 redirect to eth0 There is clearly no loop here because we have different headers, but you will have to add code to check for such a case. case 4: flower match foo on eth0 redirect to eth1 u32 match foo on eth1 redirect to eth0 Now you have to add code that checks all other classifiers(worse would be ebpf) for matching headers. Worse is you have to consider wild card. I am also likely missing some other use cases. Also, I am sure we'll very quickly hear from people who want very fast insertion rates of filters which now are going to be massively slowed down when using mirred. So i am sure it can be done, just not worth the effort given how many lines of code would need to be added to catch a corner case of bad config. > Mirred is very heavily used in > > many use cases and changing its behavior could likely introduce other > > corner cases for some use cases which we would be chasing for a while. > > Not to forget now we have to go via an extra transient queue. > > If i understood what you are suggesting is to add an equivalent of > > backlog queu for the tx side? I am assuming in a very similar nature > > to backlog, meaning per cpu fired by softirq? or is it something > > closer to qdisc->gso_skb > > For either of those cases, the amount of infrastructure code there is > > not a few lines of code. And then there's the desire to break the loop > > etc. > > > > Some questions regarding your proposal - something I am not following > > And i may have misunderstood what you are suggesting, but i am missing > > what scenario mirred can directly call tcf_dev_queue_xmit() (see my > > comment below).. > > I wish the act path would run without qdisc spinlock held, and use RCU in= stead. > The main (only?) reason we need spinlock for qdiscs is for packet queues. > Instead, we are adding cost in fast path only to detect bad configs. Agreed but I dont know how we can totally avoid it. Would you be ok with us using something that looks like qdisc->gso_skb and we check such a list after releasing the qdisc lock in __dev_xmit_skb() to invoke the redirect? The loop then can be caught inside mirred. cheers, jamal