netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact
@ 2023-05-24  1:16 Peilin Ye
  2023-05-24  1:17 ` [PATCH v5 net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS Peilin Ye
                   ` (6 more replies)
  0 siblings, 7 replies; 45+ messages in thread
From: Peilin Ye @ 2023-05-24  1:16 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
	Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
	Peilin Ye

Link to v4: https://lore.kernel.org/r/cover.1684825171.git.peilin.ye@bytedance.com/
Link to v3 (incomplete): https://lore.kernel.org/r/cover.1684821877.git.peilin.ye@bytedance.com/
Link to v2: https://lore.kernel.org/r/cover.1684796705.git.peilin.ye@bytedance.com/
Link to v1: https://lore.kernel.org/r/cover.1683326865.git.peilin.ye@bytedance.com/

Hi all,

These are v5 fixes for ingress and clsact Qdiscs.  Please take another
look at patch 1, 2 and 6, thanks!

Changes in v5:
  - for [6/6], reinitialize @q, @p (suggested by Vlad) and @tcm after the
    "replay:" tag
  - for [1,2/6], do nothing in ->destroy() if ->parent isn't ffff:fff1, as
    reported by Pedro

Change in v3, v4:
  - add in-body From: tags

Changes in v2:
  - for [1-5/6], include tags from Jamal and Pedro
  - for [6/6], as suggested by Vlad, replay the request if the current
    Qdisc has any ongoing (RTNL-unlocked) filter requests, instead of
    returning -EBUSY to the user
  - use Closes: tag as warned by checkpatch

[1,2/6]: ingress and clsact Qdiscs should only be created under ffff:fff1
  [3/6]: Under ffff:fff1, only create ingress and clsact Qdiscs (for now,
         at least)
  [4/6]: After creating ingress and clsact Qdiscs under ffff:fff1, do not
         graft them again to anywhere else (e.g. as the inner Qdisc of a
         TBF Qdisc)
  [5/6]: Prepare for [6/6], do not reuse that for-loop in qdisc_graft()
         for ingress and clsact Qdiscs
  [6/6]: Fix use-after-free [a] in mini_qdisc_pair_swap()

[a] https://syzkaller.appspot.com/bug?extid=b53a9c0d1ea4ad62da8b

Thanks,
Peilin Ye (6):
  net/sched: sch_ingress: Only create under TC_H_INGRESS
  net/sched: sch_clsact: Only create under TC_H_CLSACT
  net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact)
    Qdiscs
  net/sched: Prohibit regrafting ingress or clsact Qdiscs
  net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
  net/sched: qdisc_destroy() old ingress and clsact Qdiscs before
    grafting

 include/net/sch_generic.h |  8 +++++
 net/sched/sch_api.c       | 68 ++++++++++++++++++++++++++++-----------
 net/sched/sch_generic.c   | 14 ++++++--
 net/sched/sch_ingress.c   | 16 +++++++--
 4 files changed, 83 insertions(+), 23 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS
  2023-05-24  1:16 [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
@ 2023-05-24  1:17 ` Peilin Ye
  2023-05-24 15:37   ` Pedro Tammela
  2023-05-24 15:57   ` Jamal Hadi Salim
  2023-05-24  1:18 ` [PATCH v5 net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT Peilin Ye
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 45+ messages in thread
From: Peilin Ye @ 2023-05-24  1:17 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
	Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
	Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

ingress Qdiscs are only supposed to be created under TC_H_INGRESS.
Return -EOPNOTSUPP if 'parent' is not TC_H_INGRESS, similar to
mq_init().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
change in v5:
  - avoid underflowing @ingress_needed_key in ->destroy(), reported by
    Pedro

change in v3, v4:
  - add in-body From: tag

 net/sched/sch_ingress.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 84838128b9c5..f9ef6deb2770 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -80,6 +80,9 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
 	struct net_device *dev = qdisc_dev(sch);
 	int err;
 
+	if (sch->parent != TC_H_INGRESS)
+		return -EOPNOTSUPP;
+
 	net_inc_ingress_queue();
 
 	mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
@@ -101,6 +104,9 @@ static void ingress_destroy(struct Qdisc *sch)
 {
 	struct ingress_sched_data *q = qdisc_priv(sch);
 
+	if (sch->parent != TC_H_INGRESS)
+		return;
+
 	tcf_block_put_ext(q->block, sch, &q->block_info);
 	net_dec_ingress_queue();
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT
  2023-05-24  1:16 [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
  2023-05-24  1:17 ` [PATCH v5 net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS Peilin Ye
@ 2023-05-24  1:18 ` Peilin Ye
  2023-05-24 15:38   ` Pedro Tammela
  2023-05-24  1:19 ` [PATCH v5 net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs Peilin Ye
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 45+ messages in thread
From: Peilin Ye @ 2023-05-24  1:18 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
	Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
	Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

clsact Qdiscs are only supposed to be created under TC_H_CLSACT (which
equals TC_H_INGRESS).  Return -EOPNOTSUPP if 'parent' is not
TC_H_CLSACT.

Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
change in v5:
  - avoid underflowing @egress_needed_key in ->destroy(), reported by
    Pedro

change in v3, v4:
  - add in-body From: tag

 net/sched/sch_ingress.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index f9ef6deb2770..35963929e117 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -225,6 +225,9 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
 	struct net_device *dev = qdisc_dev(sch);
 	int err;
 
+	if (sch->parent != TC_H_CLSACT)
+		return -EOPNOTSUPP;
+
 	net_inc_ingress_queue();
 	net_inc_egress_queue();
 
@@ -254,6 +257,9 @@ static void clsact_destroy(struct Qdisc *sch)
 {
 	struct clsact_sched_data *q = qdisc_priv(sch);
 
+	if (sch->parent != TC_H_CLSACT)
+		return;
+
 	tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
 	tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs
  2023-05-24  1:16 [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
  2023-05-24  1:17 ` [PATCH v5 net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS Peilin Ye
  2023-05-24  1:18 ` [PATCH v5 net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT Peilin Ye
@ 2023-05-24  1:19 ` Peilin Ye
  2023-05-24 15:38   ` Pedro Tammela
  2023-05-24  1:19 ` [PATCH v5 net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs Peilin Ye
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 45+ messages in thread
From: Peilin Ye @ 2023-05-24  1:19 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
	Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
	Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Currently it is possible to add e.g. an HTB Qdisc under ffff:fff1
(TC_H_INGRESS, TC_H_CLSACT):

  $ ip link add name ifb0 type ifb
  $ tc qdisc add dev ifb0 parent ffff:fff1 htb
  $ tc qdisc add dev ifb0 clsact
  Error: Exclusivity flag on, cannot modify.
  $ drgn
  ...
  >>> ifb0 = netdev_get_by_name(prog, "ifb0")
  >>> qdisc = ifb0.ingress_queue.qdisc_sleeping
  >>> print(qdisc.ops.id.string_().decode())
  htb
  >>> qdisc.flags.value_() # TCQ_F_INGRESS
  2

Only allow ingress and clsact Qdiscs under ffff:fff1.  Return -EINVAL
for everything else.  Make TCQ_F_INGRESS a static flag of ingress and
clsact Qdiscs.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
change in v3, v4:
  - add in-body From: tag

 net/sched/sch_api.c     | 7 ++++++-
 net/sched/sch_ingress.c | 4 ++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index fdb8f429333d..383195955b7d 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1252,7 +1252,12 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
 	sch->parent = parent;
 
 	if (handle == TC_H_INGRESS) {
-		sch->flags |= TCQ_F_INGRESS;
+		if (!(sch->flags & TCQ_F_INGRESS)) {
+			NL_SET_ERR_MSG(extack,
+				       "Specified parent ID is reserved for ingress and clsact Qdiscs");
+			err = -EINVAL;
+			goto err_out3;
+		}
 		handle = TC_H_MAKE(TC_H_INGRESS, 0);
 	} else {
 		if (handle == 0) {
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 35963929e117..e43a45499372 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -140,7 +140,7 @@ static struct Qdisc_ops ingress_qdisc_ops __read_mostly = {
 	.cl_ops			=	&ingress_class_ops,
 	.id			=	"ingress",
 	.priv_size		=	sizeof(struct ingress_sched_data),
-	.static_flags		=	TCQ_F_CPUSTATS,
+	.static_flags		=	TCQ_F_INGRESS | TCQ_F_CPUSTATS,
 	.init			=	ingress_init,
 	.destroy		=	ingress_destroy,
 	.dump			=	ingress_dump,
@@ -281,7 +281,7 @@ static struct Qdisc_ops clsact_qdisc_ops __read_mostly = {
 	.cl_ops			=	&clsact_class_ops,
 	.id			=	"clsact",
 	.priv_size		=	sizeof(struct clsact_sched_data),
-	.static_flags		=	TCQ_F_CPUSTATS,
+	.static_flags		=	TCQ_F_INGRESS | TCQ_F_CPUSTATS,
 	.init			=	clsact_init,
 	.destroy		=	clsact_destroy,
 	.dump			=	ingress_dump,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs
  2023-05-24  1:16 [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
                   ` (2 preceding siblings ...)
  2023-05-24  1:19 ` [PATCH v5 net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs Peilin Ye
@ 2023-05-24  1:19 ` Peilin Ye
  2023-05-24 15:38   ` Pedro Tammela
  2023-05-24  1:20 ` [PATCH v5 net 5/6] net/sched: Refactor qdisc_graft() for ingress and " Peilin Ye
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 45+ messages in thread
From: Peilin Ye @ 2023-05-24  1:19 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
	Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
	Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Currently, after creating an ingress (or clsact) Qdisc and grafting it
under TC_H_INGRESS (TC_H_CLSACT), it is possible to graft it again under
e.g. a TBF Qdisc:

  $ ip link add ifb0 type ifb
  $ tc qdisc add dev ifb0 handle 1: root tbf rate 20kbit buffer 1600 limit 3000
  $ tc qdisc add dev ifb0 clsact
  $ tc qdisc link dev ifb0 handle ffff: parent 1:1
  $ tc qdisc show dev ifb0
  qdisc tbf 1: root refcnt 2 rate 20Kbit burst 1600b lat 560.0ms
  qdisc clsact ffff: parent ffff:fff1 refcnt 2
                                      ^^^^^^^^

clsact's refcount has increased: it is now grafted under both
TC_H_CLSACT and 1:1.

ingress and clsact Qdiscs should only be used under TC_H_INGRESS
(TC_H_CLSACT).  Prohibit regrafting them.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
change in v3, v4:
  - add in-body From: tag

 net/sched/sch_api.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 383195955b7d..49b9c1bbfdd9 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1596,6 +1596,11 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 					NL_SET_ERR_MSG(extack, "Invalid qdisc name");
 					return -EINVAL;
 				}
+				if (q->flags & TCQ_F_INGRESS) {
+					NL_SET_ERR_MSG(extack,
+						       "Cannot regraft ingress or clsact Qdiscs");
+					return -EINVAL;
+				}
 				if (q == p ||
 				    (p && check_loop(q, p, 0))) {
 					NL_SET_ERR_MSG(extack, "Qdisc parent/child loop detected");
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 net 5/6] net/sched: Refactor qdisc_graft() for ingress and clsact Qdiscs
  2023-05-24  1:16 [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
                   ` (3 preceding siblings ...)
  2023-05-24  1:19 ` [PATCH v5 net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs Peilin Ye
@ 2023-05-24  1:20 ` Peilin Ye
  2023-05-24  1:20 ` [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Peilin Ye
  2023-05-25 17:16 ` [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Vlad Buslov
  6 siblings, 0 replies; 45+ messages in thread
From: Peilin Ye @ 2023-05-24  1:20 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
	Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
	Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Grafting ingress and clsact Qdiscs does not need a for-loop in
qdisc_graft().  Refactor it.  No functional changes intended.

Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Tested-by: Pedro Tammela <pctammela@mojatatu.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
change in v3, v4:
  - add in-body From: tag

 net/sched/sch_api.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 49b9c1bbfdd9..f72a581666a2 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1073,12 +1073,12 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 
 	if (parent == NULL) {
 		unsigned int i, num_q, ingress;
+		struct netdev_queue *dev_queue;
 
 		ingress = 0;
 		num_q = dev->num_tx_queues;
 		if ((q && q->flags & TCQ_F_INGRESS) ||
 		    (new && new->flags & TCQ_F_INGRESS)) {
-			num_q = 1;
 			ingress = 1;
 			if (!dev_ingress_queue(dev)) {
 				NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
@@ -1094,18 +1094,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 		if (new && new->ops->attach && !ingress)
 			goto skip;
 
-		for (i = 0; i < num_q; i++) {
-			struct netdev_queue *dev_queue = dev_ingress_queue(dev);
-
-			if (!ingress)
+		if (!ingress) {
+			for (i = 0; i < num_q; i++) {
 				dev_queue = netdev_get_tx_queue(dev, i);
+				old = dev_graft_qdisc(dev_queue, new);
 
-			old = dev_graft_qdisc(dev_queue, new);
-			if (new && i > 0)
-				qdisc_refcount_inc(new);
-
-			if (!ingress)
+				if (new && i > 0)
+					qdisc_refcount_inc(new);
 				qdisc_put(old);
+			}
+		} else {
+			dev_queue = dev_ingress_queue(dev);
+			old = dev_graft_qdisc(dev_queue, new);
 		}
 
 skip:
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-24  1:16 [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
                   ` (4 preceding siblings ...)
  2023-05-24  1:20 ` [PATCH v5 net 5/6] net/sched: Refactor qdisc_graft() for ingress and " Peilin Ye
@ 2023-05-24  1:20 ` Peilin Ye
  2023-05-24 15:39   ` Pedro Tammela
  2023-05-25 17:16 ` [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Vlad Buslov
  6 siblings, 1 reply; 45+ messages in thread
From: Peilin Ye @ 2023-05-24  1:20 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
	Pedro Tammela, Hillf Danton, netdev, linux-kernel, Cong Wang,
	Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
clsact Qdiscs and miniq_egress.

Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
requests (thanks Hillf Danton for the hint), when replacing ingress or
clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
causing race conditions [1] including a use-after-free bug in
mini_qdisc_pair_swap() reported by syzbot:

 BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
 Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
 Call Trace:
  <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
  print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
  print_report mm/kasan/report.c:430 [inline]
  kasan_report+0x11c/0x130 mm/kasan/report.c:536
  mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
  tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
  tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
  tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
  tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
  tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...

@old and @new should not affect each other.  In other words, @old should
never modify miniq_{in,e}gress after @new, and @new should not update
@old's RCU state.  Fixing without changing sch_api.c turned out to be
difficult (please refer to Closes: for discussions).  Instead, make sure
@new's first call always happen after @old's last call, in
qdisc_destroy(), has finished:

In qdisc_graft(), return -EAGAIN and tell the caller to replay
(suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
requests, and call qdisc_destroy() for @old before grafting @new.

Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests.  Introduce
a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.

Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
Qdiscs".

[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
create under TC_H_INGRESS") on eth0 that has 8 transmission queues:

  Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
  adds a flower filter X to A.

  Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
  b2) to replace A, then adds a flower filter Y to B.

 Thread 1               A's refcnt   Thread 2
  RTM_NEWQDISC (A, RTNL-locked)
   qdisc_create(A)               1
   qdisc_graft(A)                9

  RTM_NEWTFILTER (X, RTNL-unlocked)
   __tcf_qdisc_find(A)          10
   tcf_chain0_head_change(A)
   mini_qdisc_pair_swap(A) (1st)
            |
            |                         RTM_NEWQDISC (B, RTNL-locked)
         RCU sync                2     qdisc_graft(B)
            |                    1     notify_and_destroy(A)
            |
   tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
   qdisc_destroy(A)                    tcf_chain0_head_change(B)
   tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
   mini_qdisc_pair_swap(A) (3rd)                |
           ...                                 ...

Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
on eth0 will not find filter Y in sch_handle_ingress().

This is only one of the possible consequences of concurrently accessing
miniq_{in,e}gress pointers.  The point is clear though: again, A should
never modify those per-net_device pointers after B, and B should not
update A's RCU state.

Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
Cc: Hillf Danton <hdanton@sina.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
change in v5:
  - reinitialize @q, @p (suggested by Vlad) and @tcm before replaying,
    just like @flags in tc_new_tfilter()

change in v3, v4:
  - add in-body From: tag

changes in v2:
  - replay the request if the current Qdisc has any ongoing RTNL-unlocked
    filter requests (Vlad)
  - minor changes in code comments and commit log

 include/net/sch_generic.h |  8 ++++++++
 net/sched/sch_api.c       | 40 ++++++++++++++++++++++++++++++---------
 net/sched/sch_generic.c   | 14 +++++++++++---
 3 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index fab5ba3e61b7..3e9cc43cbc90 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
 	refcount_inc(&qdisc->refcnt);
 }
 
+static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
+{
+	if (qdisc->flags & TCQ_F_BUILTIN)
+		return true;
+	return refcount_dec_if_one(&qdisc->refcnt);
+}
+
 /* Intended to be used by unlocked users, when concurrent qdisc release is
  * possible.
  */
@@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
 struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
 			      struct Qdisc *qdisc);
 void qdisc_reset(struct Qdisc *qdisc);
+void qdisc_destroy(struct Qdisc *qdisc);
 void qdisc_put(struct Qdisc *qdisc);
 void qdisc_put_unlocked(struct Qdisc *qdisc);
 void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f72a581666a2..286b7c58f5b9 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 		if ((q && q->flags & TCQ_F_INGRESS) ||
 		    (new && new->flags & TCQ_F_INGRESS)) {
 			ingress = 1;
-			if (!dev_ingress_queue(dev)) {
+			dev_queue = dev_ingress_queue(dev);
+			if (!dev_queue) {
 				NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
 				return -ENOENT;
 			}
+
+			/* Replay if the current ingress (or clsact) Qdisc has ongoing
+			 * RTNL-unlocked filter request(s).  This is the counterpart of that
+			 * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
+			 */
+			if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
+				return -EAGAIN;
 		}
 
 		if (dev->flags & IFF_UP)
@@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 				qdisc_put(old);
 			}
 		} else {
-			dev_queue = dev_ingress_queue(dev);
-			old = dev_graft_qdisc(dev_queue, new);
+			old = dev_graft_qdisc(dev_queue, NULL);
+
+			/* {ingress,clsact}_destroy() @old before grafting @new to avoid
+			 * unprotected concurrent accesses to net_device::miniq_{in,e}gress
+			 * pointer(s) in mini_qdisc_pair_swap().
+			 */
+			qdisc_notify(net, skb, n, classid, old, new, extack);
+			qdisc_destroy(old);
+
+			dev_graft_qdisc(dev_queue, new);
 		}
 
 skip:
@@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 
 			if (new && new->ops->attach)
 				new->ops->attach(new);
-		} else {
-			notify_and_destroy(net, skb, n, classid, old, new, extack);
 		}
 
 		if (dev->flags & IFF_UP)
@@ -1450,19 +1464,22 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 			struct netlink_ext_ack *extack)
 {
 	struct net *net = sock_net(skb->sk);
-	struct tcmsg *tcm = nlmsg_data(n);
 	struct nlattr *tca[TCA_MAX + 1];
 	struct net_device *dev;
+	struct Qdisc *q, *p;
+	struct tcmsg *tcm;
 	u32 clid;
-	struct Qdisc *q = NULL;
-	struct Qdisc *p = NULL;
 	int err;
 
+replay:
 	err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
 				     rtm_tca_policy, extack);
 	if (err < 0)
 		return err;
 
+	tcm = nlmsg_data(n);
+	q = p = NULL;
+
 	dev = __dev_get_by_index(net, tcm->tcm_ifindex);
 	if (!dev)
 		return -ENODEV;
@@ -1515,8 +1532,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 			return -ENOENT;
 		}
 		err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
-		if (err != 0)
+		if (err != 0) {
+			if (err == -EAGAIN)
+				goto replay;
 			return err;
+		}
 	} else {
 		qdisc_notify(net, skb, n, clid, NULL, q, NULL);
 	}
@@ -1704,6 +1724,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 	if (err) {
 		if (q)
 			qdisc_put(q);
+		if (err == -EAGAIN)
+			goto replay;
 		return err;
 	}
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 37e41f972f69..e14ed47f961c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
 	qdisc_free(q);
 }
 
-static void qdisc_destroy(struct Qdisc *qdisc)
+static void __qdisc_destroy(struct Qdisc *qdisc)
 {
 	const struct Qdisc_ops  *ops = qdisc->ops;
 
@@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
 	call_rcu(&qdisc->rcu, qdisc_free_cb);
 }
 
+void qdisc_destroy(struct Qdisc *qdisc)
+{
+	if (qdisc->flags & TCQ_F_BUILTIN)
+		return;
+
+	__qdisc_destroy(qdisc);
+}
+
 void qdisc_put(struct Qdisc *qdisc)
 {
 	if (!qdisc)
@@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
 	    !refcount_dec_and_test(&qdisc->refcnt))
 		return;
 
-	qdisc_destroy(qdisc);
+	__qdisc_destroy(qdisc);
 }
 EXPORT_SYMBOL(qdisc_put);
 
@@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
 	    !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
 		return;
 
-	qdisc_destroy(qdisc);
+	__qdisc_destroy(qdisc);
 	rtnl_unlock();
 }
 EXPORT_SYMBOL(qdisc_put_unlocked);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS
  2023-05-24  1:17 ` [PATCH v5 net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS Peilin Ye
@ 2023-05-24 15:37   ` Pedro Tammela
  2023-05-24 15:57   ` Jamal Hadi Salim
  1 sibling, 0 replies; 45+ messages in thread
From: Pedro Tammela @ 2023-05-24 15:37 UTC (permalink / raw)
  To: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
	Hillf Danton, netdev, linux-kernel, Cong Wang

On 23/05/2023 22:17, Peilin Ye wrote:
> From: Peilin Ye <peilin.ye@bytedance.com>
> 
> ingress Qdiscs are only supposed to be created under TC_H_INGRESS.
> Return -EOPNOTSUPP if 'parent' is not TC_H_INGRESS, similar to
> mq_init().
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>

Tested-by: Pedro Tammela <pctammela@mojatatu.com>

> ---
> change in v5:
>    - avoid underflowing @ingress_needed_key in ->destroy(), reported by
>      Pedro
> 
> change in v3, v4:
>    - add in-body From: tag
> 
>   net/sched/sch_ingress.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 84838128b9c5..f9ef6deb2770 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -80,6 +80,9 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
>   	struct net_device *dev = qdisc_dev(sch);
>   	int err;
>   
> +	if (sch->parent != TC_H_INGRESS)
> +		return -EOPNOTSUPP;
> +
>   	net_inc_ingress_queue();
>   
>   	mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
> @@ -101,6 +104,9 @@ static void ingress_destroy(struct Qdisc *sch)
>   {
>   	struct ingress_sched_data *q = qdisc_priv(sch);
>   
> +	if (sch->parent != TC_H_INGRESS)
> +		return;
> +
>   	tcf_block_put_ext(q->block, sch, &q->block_info);
>   	net_dec_ingress_queue();
>   }


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT
  2023-05-24  1:18 ` [PATCH v5 net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT Peilin Ye
@ 2023-05-24 15:38   ` Pedro Tammela
  2023-05-24 15:58     ` Jamal Hadi Salim
  0 siblings, 1 reply; 45+ messages in thread
From: Pedro Tammela @ 2023-05-24 15:38 UTC (permalink / raw)
  To: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
	Hillf Danton, netdev, linux-kernel, Cong Wang

On 23/05/2023 22:18, Peilin Ye wrote:
> From: Peilin Ye <peilin.ye@bytedance.com>
> 
> clsact Qdiscs are only supposed to be created under TC_H_CLSACT (which
> equals TC_H_INGRESS).  Return -EOPNOTSUPP if 'parent' is not
> TC_H_CLSACT.
> 
> Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>

Tested-by: Pedro Tammela <pctammela@mojatatu.com>

> ---
> change in v5:
>    - avoid underflowing @egress_needed_key in ->destroy(), reported by
>      Pedro
> 
> change in v3, v4:
>    - add in-body From: tag
> 
>   net/sched/sch_ingress.c | 6 ++++++
>   1 file changed, 6 insertions(+)
> 
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index f9ef6deb2770..35963929e117 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -225,6 +225,9 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
>   	struct net_device *dev = qdisc_dev(sch);
>   	int err;
>   
> +	if (sch->parent != TC_H_CLSACT)
> +		return -EOPNOTSUPP;
> +
>   	net_inc_ingress_queue();
>   	net_inc_egress_queue();
>   
> @@ -254,6 +257,9 @@ static void clsact_destroy(struct Qdisc *sch)
>   {
>   	struct clsact_sched_data *q = qdisc_priv(sch);
>   
> +	if (sch->parent != TC_H_CLSACT)
> +		return;
> +
>   	tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
>   	tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info);
>   


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs
  2023-05-24  1:19 ` [PATCH v5 net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs Peilin Ye
@ 2023-05-24 15:38   ` Pedro Tammela
  0 siblings, 0 replies; 45+ messages in thread
From: Pedro Tammela @ 2023-05-24 15:38 UTC (permalink / raw)
  To: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
	Hillf Danton, netdev, linux-kernel, Cong Wang

On 23/05/2023 22:19, Peilin Ye wrote:
> From: Peilin Ye <peilin.ye@bytedance.com>
> 
> Currently it is possible to add e.g. an HTB Qdisc under ffff:fff1
> (TC_H_INGRESS, TC_H_CLSACT):
> 
>    $ ip link add name ifb0 type ifb
>    $ tc qdisc add dev ifb0 parent ffff:fff1 htb
>    $ tc qdisc add dev ifb0 clsact
>    Error: Exclusivity flag on, cannot modify.
>    $ drgn
>    ...
>    >>> ifb0 = netdev_get_by_name(prog, "ifb0")
>    >>> qdisc = ifb0.ingress_queue.qdisc_sleeping
>    >>> print(qdisc.ops.id.string_().decode())
>    htb
>    >>> qdisc.flags.value_() # TCQ_F_INGRESS
>    2
> 
> Only allow ingress and clsact Qdiscs under ffff:fff1.  Return -EINVAL
> for everything else.  Make TCQ_F_INGRESS a static flag of ingress and
> clsact Qdiscs.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>

Tested-by: Pedro Tammela <pctammela@mojatatu.com>

> ---
> change in v3, v4:
>    - add in-body From: tag
> 
>   net/sched/sch_api.c     | 7 ++++++-
>   net/sched/sch_ingress.c | 4 ++--
>   2 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index fdb8f429333d..383195955b7d 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1252,7 +1252,12 @@ static struct Qdisc *qdisc_create(struct net_device *dev,
>   	sch->parent = parent;
>   
>   	if (handle == TC_H_INGRESS) {
> -		sch->flags |= TCQ_F_INGRESS;
> +		if (!(sch->flags & TCQ_F_INGRESS)) {
> +			NL_SET_ERR_MSG(extack,
> +				       "Specified parent ID is reserved for ingress and clsact Qdiscs");
> +			err = -EINVAL;
> +			goto err_out3;
> +		}
>   		handle = TC_H_MAKE(TC_H_INGRESS, 0);
>   	} else {
>   		if (handle == 0) {
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 35963929e117..e43a45499372 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -140,7 +140,7 @@ static struct Qdisc_ops ingress_qdisc_ops __read_mostly = {
>   	.cl_ops			=	&ingress_class_ops,
>   	.id			=	"ingress",
>   	.priv_size		=	sizeof(struct ingress_sched_data),
> -	.static_flags		=	TCQ_F_CPUSTATS,
> +	.static_flags		=	TCQ_F_INGRESS | TCQ_F_CPUSTATS,
>   	.init			=	ingress_init,
>   	.destroy		=	ingress_destroy,
>   	.dump			=	ingress_dump,
> @@ -281,7 +281,7 @@ static struct Qdisc_ops clsact_qdisc_ops __read_mostly = {
>   	.cl_ops			=	&clsact_class_ops,
>   	.id			=	"clsact",
>   	.priv_size		=	sizeof(struct clsact_sched_data),
> -	.static_flags		=	TCQ_F_CPUSTATS,
> +	.static_flags		=	TCQ_F_INGRESS | TCQ_F_CPUSTATS,
>   	.init			=	clsact_init,
>   	.destroy		=	clsact_destroy,
>   	.dump			=	ingress_dump,


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs
  2023-05-24  1:19 ` [PATCH v5 net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs Peilin Ye
@ 2023-05-24 15:38   ` Pedro Tammela
  0 siblings, 0 replies; 45+ messages in thread
From: Pedro Tammela @ 2023-05-24 15:38 UTC (permalink / raw)
  To: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
	Hillf Danton, netdev, linux-kernel, Cong Wang

On 23/05/2023 22:19, Peilin Ye wrote:
> From: Peilin Ye <peilin.ye@bytedance.com>
> 
> Currently, after creating an ingress (or clsact) Qdisc and grafting it
> under TC_H_INGRESS (TC_H_CLSACT), it is possible to graft it again under
> e.g. a TBF Qdisc:
> 
>    $ ip link add ifb0 type ifb
>    $ tc qdisc add dev ifb0 handle 1: root tbf rate 20kbit buffer 1600 limit 3000
>    $ tc qdisc add dev ifb0 clsact
>    $ tc qdisc link dev ifb0 handle ffff: parent 1:1
>    $ tc qdisc show dev ifb0
>    qdisc tbf 1: root refcnt 2 rate 20Kbit burst 1600b lat 560.0ms
>    qdisc clsact ffff: parent ffff:fff1 refcnt 2
>                                        ^^^^^^^^
> 
> clsact's refcount has increased: it is now grafted under both
> TC_H_CLSACT and 1:1.
> 
> ingress and clsact Qdiscs should only be used under TC_H_INGRESS
> (TC_H_CLSACT).  Prohibit regrafting them.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>

Tested-by: Pedro Tammela <pctammela@mojatatu.com>

> ---
> change in v3, v4:
>    - add in-body From: tag
> 
>   net/sched/sch_api.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 383195955b7d..49b9c1bbfdd9 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1596,6 +1596,11 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
>   					NL_SET_ERR_MSG(extack, "Invalid qdisc name");
>   					return -EINVAL;
>   				}
> +				if (q->flags & TCQ_F_INGRESS) {
> +					NL_SET_ERR_MSG(extack,
> +						       "Cannot regraft ingress or clsact Qdiscs");
> +					return -EINVAL;
> +				}
>   				if (q == p ||
>   				    (p && check_loop(q, p, 0))) {
>   					NL_SET_ERR_MSG(extack, "Qdisc parent/child loop detected");


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-24  1:20 ` [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Peilin Ye
@ 2023-05-24 15:39   ` Pedro Tammela
  2023-05-24 16:09     ` Jamal Hadi Salim
  2023-05-26 12:20     ` Jamal Hadi Salim
  0 siblings, 2 replies; 45+ messages in thread
From: Pedro Tammela @ 2023-05-24 15:39 UTC (permalink / raw)
  To: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, Daniel Borkmann, John Fastabend, Vlad Buslov,
	Hillf Danton, netdev, linux-kernel, Cong Wang

On 23/05/2023 22:20, Peilin Ye wrote:
> From: Peilin Ye <peilin.ye@bytedance.com>
> 
> mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
> access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
> clsact Qdiscs and miniq_egress.
> 
> Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
> requests (thanks Hillf Danton for the hint), when replacing ingress or
> clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
> miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
> causing race conditions [1] including a use-after-free bug in
> mini_qdisc_pair_swap() reported by syzbot:
> 
>   BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
>   Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> ...
>   Call Trace:
>    <TASK>
>    __dump_stack lib/dump_stack.c:88 [inline]
>    dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
>    print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
>    print_report mm/kasan/report.c:430 [inline]
>    kasan_report+0x11c/0x130 mm/kasan/report.c:536
>    mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
>    tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
>    tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
>    tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
>    tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
>    tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> ...
> 
> @old and @new should not affect each other.  In other words, @old should
> never modify miniq_{in,e}gress after @new, and @new should not update
> @old's RCU state.  Fixing without changing sch_api.c turned out to be
> difficult (please refer to Closes: for discussions).  Instead, make sure
> @new's first call always happen after @old's last call, in
> qdisc_destroy(), has finished:
> 
> In qdisc_graft(), return -EAGAIN and tell the caller to replay
> (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
> requests, and call qdisc_destroy() for @old before grafting @new.
> 
> Introduce qdisc_refcount_dec_if_one() as the counterpart of
> qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests.  Introduce
> a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> just like qdisc_put() etc.
> 
> Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
> Qdiscs".
> 
> [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
> 
>    Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
>    adds a flower filter X to A.
> 
>    Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
>    b2) to replace A, then adds a flower filter Y to B.
> 
>   Thread 1               A's refcnt   Thread 2
>    RTM_NEWQDISC (A, RTNL-locked)
>     qdisc_create(A)               1
>     qdisc_graft(A)                9
> 
>    RTM_NEWTFILTER (X, RTNL-unlocked)
>     __tcf_qdisc_find(A)          10
>     tcf_chain0_head_change(A)
>     mini_qdisc_pair_swap(A) (1st)
>              |
>              |                         RTM_NEWQDISC (B, RTNL-locked)
>           RCU sync                2     qdisc_graft(B)
>              |                    1     notify_and_destroy(A)
>              |
>     tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
>     qdisc_destroy(A)                    tcf_chain0_head_change(B)
>     tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
>     mini_qdisc_pair_swap(A) (3rd)                |
>             ...                                 ...
> 
> Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
> ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> on eth0 will not find filter Y in sch_handle_ingress().
> 
> This is only one of the possible consequences of concurrently accessing
> miniq_{in,e}gress pointers.  The point is clear though: again, A should
> never modify those per-net_device pointers after B, and B should not
> update A's RCU state.
> 
> Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
> Cc: Hillf Danton <hdanton@sina.com>
> Cc: Vlad Buslov <vladbu@mellanox.com>
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>

Tested-by: Pedro Tammela <pctammela@mojatatu.com>

> ---
> change in v5:
>    - reinitialize @q, @p (suggested by Vlad) and @tcm before replaying,
>      just like @flags in tc_new_tfilter()
> 
> change in v3, v4:
>    - add in-body From: tag
> 
> changes in v2:
>    - replay the request if the current Qdisc has any ongoing RTNL-unlocked
>      filter requests (Vlad)
>    - minor changes in code comments and commit log
> 
>   include/net/sch_generic.h |  8 ++++++++
>   net/sched/sch_api.c       | 40 ++++++++++++++++++++++++++++++---------
>   net/sched/sch_generic.c   | 14 +++++++++++---
>   3 files changed, 50 insertions(+), 12 deletions(-)
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index fab5ba3e61b7..3e9cc43cbc90 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
>   	refcount_inc(&qdisc->refcnt);
>   }
>   
> +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> +{
> +	if (qdisc->flags & TCQ_F_BUILTIN)
> +		return true;
> +	return refcount_dec_if_one(&qdisc->refcnt);
> +}
> +
>   /* Intended to be used by unlocked users, when concurrent qdisc release is
>    * possible.
>    */
> @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
>   struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
>   			      struct Qdisc *qdisc);
>   void qdisc_reset(struct Qdisc *qdisc);
> +void qdisc_destroy(struct Qdisc *qdisc);
>   void qdisc_put(struct Qdisc *qdisc);
>   void qdisc_put_unlocked(struct Qdisc *qdisc);
>   void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index f72a581666a2..286b7c58f5b9 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>   		if ((q && q->flags & TCQ_F_INGRESS) ||
>   		    (new && new->flags & TCQ_F_INGRESS)) {
>   			ingress = 1;
> -			if (!dev_ingress_queue(dev)) {
> +			dev_queue = dev_ingress_queue(dev);
> +			if (!dev_queue) {
>   				NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
>   				return -ENOENT;
>   			}
> +
> +			/* Replay if the current ingress (or clsact) Qdisc has ongoing
> +			 * RTNL-unlocked filter request(s).  This is the counterpart of that
> +			 * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> +			 */
> +			if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> +				return -EAGAIN;
>   		}
>   
>   		if (dev->flags & IFF_UP)
> @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>   				qdisc_put(old);
>   			}
>   		} else {
> -			dev_queue = dev_ingress_queue(dev);
> -			old = dev_graft_qdisc(dev_queue, new);
> +			old = dev_graft_qdisc(dev_queue, NULL);
> +
> +			/* {ingress,clsact}_destroy() @old before grafting @new to avoid
> +			 * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> +			 * pointer(s) in mini_qdisc_pair_swap().
> +			 */
> +			qdisc_notify(net, skb, n, classid, old, new, extack);
> +			qdisc_destroy(old);
> +
> +			dev_graft_qdisc(dev_queue, new);
>   		}
>   
>   skip:
> @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>   
>   			if (new && new->ops->attach)
>   				new->ops->attach(new);
> -		} else {
> -			notify_and_destroy(net, skb, n, classid, old, new, extack);
>   		}
>   
>   		if (dev->flags & IFF_UP)
> @@ -1450,19 +1464,22 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
>   			struct netlink_ext_ack *extack)
>   {
>   	struct net *net = sock_net(skb->sk);
> -	struct tcmsg *tcm = nlmsg_data(n);
>   	struct nlattr *tca[TCA_MAX + 1];
>   	struct net_device *dev;
> +	struct Qdisc *q, *p;
> +	struct tcmsg *tcm;
>   	u32 clid;
> -	struct Qdisc *q = NULL;
> -	struct Qdisc *p = NULL;
>   	int err;
>   
> +replay:
>   	err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
>   				     rtm_tca_policy, extack);
>   	if (err < 0)
>   		return err;
>   
> +	tcm = nlmsg_data(n);
> +	q = p = NULL;
> +
>   	dev = __dev_get_by_index(net, tcm->tcm_ifindex);
>   	if (!dev)
>   		return -ENODEV;
> @@ -1515,8 +1532,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
>   			return -ENOENT;
>   		}
>   		err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
> -		if (err != 0)
> +		if (err != 0) {
> +			if (err == -EAGAIN)
> +				goto replay;
>   			return err;
> +		}
>   	} else {
>   		qdisc_notify(net, skb, n, clid, NULL, q, NULL);
>   	}
> @@ -1704,6 +1724,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
>   	if (err) {
>   		if (q)
>   			qdisc_put(q);
> +		if (err == -EAGAIN)
> +			goto replay;
>   		return err;
>   	}
>   
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 37e41f972f69..e14ed47f961c 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
>   	qdisc_free(q);
>   }
>   
> -static void qdisc_destroy(struct Qdisc *qdisc)
> +static void __qdisc_destroy(struct Qdisc *qdisc)
>   {
>   	const struct Qdisc_ops  *ops = qdisc->ops;
>   
> @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
>   	call_rcu(&qdisc->rcu, qdisc_free_cb);
>   }
>   
> +void qdisc_destroy(struct Qdisc *qdisc)
> +{
> +	if (qdisc->flags & TCQ_F_BUILTIN)
> +		return;
> +
> +	__qdisc_destroy(qdisc);
> +}
> +
>   void qdisc_put(struct Qdisc *qdisc)
>   {
>   	if (!qdisc)
> @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
>   	    !refcount_dec_and_test(&qdisc->refcnt))
>   		return;
>   
> -	qdisc_destroy(qdisc);
> +	__qdisc_destroy(qdisc);
>   }
>   EXPORT_SYMBOL(qdisc_put);
>   
> @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
>   	    !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
>   		return;
>   
> -	qdisc_destroy(qdisc);
> +	__qdisc_destroy(qdisc);
>   	rtnl_unlock();
>   }
>   EXPORT_SYMBOL(qdisc_put_unlocked);


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS
  2023-05-24  1:17 ` [PATCH v5 net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS Peilin Ye
  2023-05-24 15:37   ` Pedro Tammela
@ 2023-05-24 15:57   ` Jamal Hadi Salim
  1 sibling, 0 replies; 45+ messages in thread
From: Jamal Hadi Salim @ 2023-05-24 15:57 UTC (permalink / raw)
  To: Peilin Ye
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Vlad Buslov, Pedro Tammela, Hillf Danton, netdev,
	linux-kernel, Cong Wang

On Tue, May 23, 2023 at 9:18 PM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> From: Peilin Ye <peilin.ye@bytedance.com>
>
> ingress Qdiscs are only supposed to be created under TC_H_INGRESS.
> Return -EOPNOTSUPP if 'parent' is not TC_H_INGRESS, similar to
> mq_init().
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>

Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal
> ---
> change in v5:
>   - avoid underflowing @ingress_needed_key in ->destroy(), reported by
>     Pedro
>
> change in v3, v4:
>   - add in-body From: tag
>
>  net/sched/sch_ingress.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 84838128b9c5..f9ef6deb2770 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -80,6 +80,9 @@ static int ingress_init(struct Qdisc *sch, struct nlattr *opt,
>         struct net_device *dev = qdisc_dev(sch);
>         int err;
>
> +       if (sch->parent != TC_H_INGRESS)
> +               return -EOPNOTSUPP;
> +
>         net_inc_ingress_queue();
>
>         mini_qdisc_pair_init(&q->miniqp, sch, &dev->miniq_ingress);
> @@ -101,6 +104,9 @@ static void ingress_destroy(struct Qdisc *sch)
>  {
>         struct ingress_sched_data *q = qdisc_priv(sch);
>
> +       if (sch->parent != TC_H_INGRESS)
> +               return;
> +
>         tcf_block_put_ext(q->block, sch, &q->block_info);
>         net_dec_ingress_queue();
>  }
> --
> 2.20.1
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT
  2023-05-24 15:38   ` Pedro Tammela
@ 2023-05-24 15:58     ` Jamal Hadi Salim
  0 siblings, 0 replies; 45+ messages in thread
From: Jamal Hadi Salim @ 2023-05-24 15:58 UTC (permalink / raw)
  To: Pedro Tammela
  Cc: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Vlad Buslov, Hillf Danton, netdev, linux-kernel,
	Cong Wang

On Wed, May 24, 2023 at 11:38 AM Pedro Tammela <pctammela@mojatatu.com> wrote:
>
> On 23/05/2023 22:18, Peilin Ye wrote:
> > From: Peilin Ye <peilin.ye@bytedance.com>
> >
> > clsact Qdiscs are only supposed to be created under TC_H_CLSACT (which
> > equals TC_H_INGRESS).  Return -EOPNOTSUPP if 'parent' is not
> > TC_H_CLSACT.
> >
> > Fixes: 1f211a1b929c ("net, sched: add clsact qdisc")
> > Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
>
> Tested-by: Pedro Tammela <pctammela@mojatatu.com>
>

Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal
> > ---
> > change in v5:
> >    - avoid underflowing @egress_needed_key in ->destroy(), reported by
> >      Pedro
> >
> > change in v3, v4:
> >    - add in-body From: tag
> >
> >   net/sched/sch_ingress.c | 6 ++++++
> >   1 file changed, 6 insertions(+)
> >
> > diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> > index f9ef6deb2770..35963929e117 100644
> > --- a/net/sched/sch_ingress.c
> > +++ b/net/sched/sch_ingress.c
> > @@ -225,6 +225,9 @@ static int clsact_init(struct Qdisc *sch, struct nlattr *opt,
> >       struct net_device *dev = qdisc_dev(sch);
> >       int err;
> >
> > +     if (sch->parent != TC_H_CLSACT)
> > +             return -EOPNOTSUPP;
> > +
> >       net_inc_ingress_queue();
> >       net_inc_egress_queue();
> >
> > @@ -254,6 +257,9 @@ static void clsact_destroy(struct Qdisc *sch)
> >   {
> >       struct clsact_sched_data *q = qdisc_priv(sch);
> >
> > +     if (sch->parent != TC_H_CLSACT)
> > +             return;
> > +
> >       tcf_block_put_ext(q->egress_block, sch, &q->egress_block_info);
> >       tcf_block_put_ext(q->ingress_block, sch, &q->ingress_block_info);
> >
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-24 15:39   ` Pedro Tammela
@ 2023-05-24 16:09     ` Jamal Hadi Salim
  2023-05-25  9:25       ` Paolo Abeni
  2023-05-26 12:20     ` Jamal Hadi Salim
  1 sibling, 1 reply; 45+ messages in thread
From: Jamal Hadi Salim @ 2023-05-24 16:09 UTC (permalink / raw)
  To: Pedro Tammela
  Cc: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Vlad Buslov, Hillf Danton, netdev, linux-kernel,
	Cong Wang

Pedro,
When you have a moment - could you run tc monitor in parallel to the
reproducer and double check it generates the correct events...

cheers,
jamal

On Wed, May 24, 2023 at 11:39 AM Pedro Tammela <pctammela@mojatatu.com> wrote:
>
> On 23/05/2023 22:20, Peilin Ye wrote:
> > From: Peilin Ye <peilin.ye@bytedance.com>
> >
> > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> > ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
> > access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
> > clsact Qdiscs and miniq_egress.
> >
> > Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
> > requests (thanks Hillf Danton for the hint), when replacing ingress or
> > clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
> > miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
> > causing race conditions [1] including a use-after-free bug in
> > mini_qdisc_pair_swap() reported by syzbot:
> >
> >   BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> >   Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> > ...
> >   Call Trace:
> >    <TASK>
> >    __dump_stack lib/dump_stack.c:88 [inline]
> >    dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> >    print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> >    print_report mm/kasan/report.c:430 [inline]
> >    kasan_report+0x11c/0x130 mm/kasan/report.c:536
> >    mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> >    tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> >    tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> >    tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> >    tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> >    tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> > ...
> >
> > @old and @new should not affect each other.  In other words, @old should
> > never modify miniq_{in,e}gress after @new, and @new should not update
> > @old's RCU state.  Fixing without changing sch_api.c turned out to be
> > difficult (please refer to Closes: for discussions).  Instead, make sure
> > @new's first call always happen after @old's last call, in
> > qdisc_destroy(), has finished:
> >
> > In qdisc_graft(), return -EAGAIN and tell the caller to replay
> > (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
> > requests, and call qdisc_destroy() for @old before grafting @new.
> >
> > Introduce qdisc_refcount_dec_if_one() as the counterpart of
> > qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests.  Introduce
> > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> > just like qdisc_put() etc.
> >
> > Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
> > Qdiscs".
> >
> > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> > create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
> >
> >    Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> >    adds a flower filter X to A.
> >
> >    Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> >    b2) to replace A, then adds a flower filter Y to B.
> >
> >   Thread 1               A's refcnt   Thread 2
> >    RTM_NEWQDISC (A, RTNL-locked)
> >     qdisc_create(A)               1
> >     qdisc_graft(A)                9
> >
> >    RTM_NEWTFILTER (X, RTNL-unlocked)
> >     __tcf_qdisc_find(A)          10
> >     tcf_chain0_head_change(A)
> >     mini_qdisc_pair_swap(A) (1st)
> >              |
> >              |                         RTM_NEWQDISC (B, RTNL-locked)
> >           RCU sync                2     qdisc_graft(B)
> >              |                    1     notify_and_destroy(A)
> >              |
> >     tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
> >     qdisc_destroy(A)                    tcf_chain0_head_change(B)
> >     tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
> >     mini_qdisc_pair_swap(A) (3rd)                |
> >             ...                                 ...
> >
> > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> > mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
> > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> > on eth0 will not find filter Y in sch_handle_ingress().
> >
> > This is only one of the possible consequences of concurrently accessing
> > miniq_{in,e}gress pointers.  The point is clear though: again, A should
> > never modify those per-net_device pointers after B, and B should not
> > update A's RCU state.
> >
> > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> > Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> > Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
> > Cc: Hillf Danton <hdanton@sina.com>
> > Cc: Vlad Buslov <vladbu@mellanox.com>
> > Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
>
> Tested-by: Pedro Tammela <pctammela@mojatatu.com>
>
> > ---
> > change in v5:
> >    - reinitialize @q, @p (suggested by Vlad) and @tcm before replaying,
> >      just like @flags in tc_new_tfilter()
> >
> > change in v3, v4:
> >    - add in-body From: tag
> >
> > changes in v2:
> >    - replay the request if the current Qdisc has any ongoing RTNL-unlocked
> >      filter requests (Vlad)
> >    - minor changes in code comments and commit log
> >
> >   include/net/sch_generic.h |  8 ++++++++
> >   net/sched/sch_api.c       | 40 ++++++++++++++++++++++++++++++---------
> >   net/sched/sch_generic.c   | 14 +++++++++++---
> >   3 files changed, 50 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > index fab5ba3e61b7..3e9cc43cbc90 100644
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
> >       refcount_inc(&qdisc->refcnt);
> >   }
> >
> > +static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
> > +{
> > +     if (qdisc->flags & TCQ_F_BUILTIN)
> > +             return true;
> > +     return refcount_dec_if_one(&qdisc->refcnt);
> > +}
> > +
> >   /* Intended to be used by unlocked users, when concurrent qdisc release is
> >    * possible.
> >    */
> > @@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
> >   struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
> >                             struct Qdisc *qdisc);
> >   void qdisc_reset(struct Qdisc *qdisc);
> > +void qdisc_destroy(struct Qdisc *qdisc);
> >   void qdisc_put(struct Qdisc *qdisc);
> >   void qdisc_put_unlocked(struct Qdisc *qdisc);
> >   void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
> > diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> > index f72a581666a2..286b7c58f5b9 100644
> > --- a/net/sched/sch_api.c
> > +++ b/net/sched/sch_api.c
> > @@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> >               if ((q && q->flags & TCQ_F_INGRESS) ||
> >                   (new && new->flags & TCQ_F_INGRESS)) {
> >                       ingress = 1;
> > -                     if (!dev_ingress_queue(dev)) {
> > +                     dev_queue = dev_ingress_queue(dev);
> > +                     if (!dev_queue) {
> >                               NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
> >                               return -ENOENT;
> >                       }
> > +
> > +                     /* Replay if the current ingress (or clsact) Qdisc has ongoing
> > +                      * RTNL-unlocked filter request(s).  This is the counterpart of that
> > +                      * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> > +                      */
> > +                     if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> > +                             return -EAGAIN;
> >               }
> >
> >               if (dev->flags & IFF_UP)
> > @@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> >                               qdisc_put(old);
> >                       }
> >               } else {
> > -                     dev_queue = dev_ingress_queue(dev);
> > -                     old = dev_graft_qdisc(dev_queue, new);
> > +                     old = dev_graft_qdisc(dev_queue, NULL);
> > +
> > +                     /* {ingress,clsact}_destroy() @old before grafting @new to avoid
> > +                      * unprotected concurrent accesses to net_device::miniq_{in,e}gress
> > +                      * pointer(s) in mini_qdisc_pair_swap().
> > +                      */
> > +                     qdisc_notify(net, skb, n, classid, old, new, extack);
> > +                     qdisc_destroy(old);
> > +
> > +                     dev_graft_qdisc(dev_queue, new);
> >               }
> >
> >   skip:
> > @@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> >
> >                       if (new && new->ops->attach)
> >                               new->ops->attach(new);
> > -             } else {
> > -                     notify_and_destroy(net, skb, n, classid, old, new, extack);
> >               }
> >
> >               if (dev->flags & IFF_UP)
> > @@ -1450,19 +1464,22 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> >                       struct netlink_ext_ack *extack)
> >   {
> >       struct net *net = sock_net(skb->sk);
> > -     struct tcmsg *tcm = nlmsg_data(n);
> >       struct nlattr *tca[TCA_MAX + 1];
> >       struct net_device *dev;
> > +     struct Qdisc *q, *p;
> > +     struct tcmsg *tcm;
> >       u32 clid;
> > -     struct Qdisc *q = NULL;
> > -     struct Qdisc *p = NULL;
> >       int err;
> >
> > +replay:
> >       err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
> >                                    rtm_tca_policy, extack);
> >       if (err < 0)
> >               return err;
> >
> > +     tcm = nlmsg_data(n);
> > +     q = p = NULL;
> > +
> >       dev = __dev_get_by_index(net, tcm->tcm_ifindex);
> >       if (!dev)
> >               return -ENODEV;
> > @@ -1515,8 +1532,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> >                       return -ENOENT;
> >               }
> >               err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
> > -             if (err != 0)
> > +             if (err != 0) {
> > +                     if (err == -EAGAIN)
> > +                             goto replay;
> >                       return err;
> > +             }
> >       } else {
> >               qdisc_notify(net, skb, n, clid, NULL, q, NULL);
> >       }
> > @@ -1704,6 +1724,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
> >       if (err) {
> >               if (q)
> >                       qdisc_put(q);
> > +             if (err == -EAGAIN)
> > +                     goto replay;
> >               return err;
> >       }
> >
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> > index 37e41f972f69..e14ed47f961c 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
> >       qdisc_free(q);
> >   }
> >
> > -static void qdisc_destroy(struct Qdisc *qdisc)
> > +static void __qdisc_destroy(struct Qdisc *qdisc)
> >   {
> >       const struct Qdisc_ops  *ops = qdisc->ops;
> >
> > @@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
> >       call_rcu(&qdisc->rcu, qdisc_free_cb);
> >   }
> >
> > +void qdisc_destroy(struct Qdisc *qdisc)
> > +{
> > +     if (qdisc->flags & TCQ_F_BUILTIN)
> > +             return;
> > +
> > +     __qdisc_destroy(qdisc);
> > +}
> > +
> >   void qdisc_put(struct Qdisc *qdisc)
> >   {
> >       if (!qdisc)
> > @@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
> >           !refcount_dec_and_test(&qdisc->refcnt))
> >               return;
> >
> > -     qdisc_destroy(qdisc);
> > +     __qdisc_destroy(qdisc);
> >   }
> >   EXPORT_SYMBOL(qdisc_put);
> >
> > @@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
> >           !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
> >               return;
> >
> > -     qdisc_destroy(qdisc);
> > +     __qdisc_destroy(qdisc);
> >       rtnl_unlock();
> >   }
> >   EXPORT_SYMBOL(qdisc_put_unlocked);
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-24 16:09     ` Jamal Hadi Salim
@ 2023-05-25  9:25       ` Paolo Abeni
  2023-05-26 12:19         ` Jamal Hadi Salim
  0 siblings, 1 reply; 45+ messages in thread
From: Paolo Abeni @ 2023-05-25  9:25 UTC (permalink / raw)
  To: Jamal Hadi Salim, Pedro Tammela
  Cc: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Vlad Buslov, Hillf Danton, netdev, linux-kernel,
	Cong Wang

On Wed, 2023-05-24 at 12:09 -0400, Jamal Hadi Salim wrote:
> When you have a moment - could you run tc monitor in parallel to the
> reproducer and double check it generates the correct events...

FTR, I'll wait a bit to apply this series, to allow for the above
tests. Unless someone will scream very loudly very soon, it's not going
to enter today's PR. Since the addressed issue is an ancient one, it
should not a problem, I hope.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact
  2023-05-24  1:16 [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
                   ` (5 preceding siblings ...)
  2023-05-24  1:20 ` [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Peilin Ye
@ 2023-05-25 17:16 ` Vlad Buslov
  6 siblings, 0 replies; 45+ messages in thread
From: Vlad Buslov @ 2023-05-25 17:16 UTC (permalink / raw)
  To: Peilin Ye
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Vlad Buslov, Pedro Tammela,
	Hillf Danton, netdev, linux-kernel, Cong Wang

On Tue 23 May 2023 at 18:16, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> Link to v4: https://lore.kernel.org/r/cover.1684825171.git.peilin.ye@bytedance.com/
> Link to v3 (incomplete): https://lore.kernel.org/r/cover.1684821877.git.peilin.ye@bytedance.com/
> Link to v2: https://lore.kernel.org/r/cover.1684796705.git.peilin.ye@bytedance.com/
> Link to v1: https://lore.kernel.org/r/cover.1683326865.git.peilin.ye@bytedance.com/
>
> Hi all,
>
> These are v5 fixes for ingress and clsact Qdiscs.  Please take another
> look at patch 1, 2 and 6, thanks!
>

Thanks again!

Reviewed-by: Vlad Buslov <vladbu@nvidia.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-25  9:25       ` Paolo Abeni
@ 2023-05-26 12:19         ` Jamal Hadi Salim
  0 siblings, 0 replies; 45+ messages in thread
From: Jamal Hadi Salim @ 2023-05-26 12:19 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Pedro Tammela, Peilin Ye, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Vlad Buslov, Hillf Danton,
	netdev, linux-kernel, Cong Wang

On Thu, May 25, 2023 at 5:25 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Wed, 2023-05-24 at 12:09 -0400, Jamal Hadi Salim wrote:
> > When you have a moment - could you run tc monitor in parallel to the
> > reproducer and double check it generates the correct events...
>
> FTR, I'll wait a bit to apply this series, to allow for the above
> tests. Unless someone will scream very loudly very soon, it's not going
> to enter today's PR. Since the addressed issue is an ancient one, it
> should not a problem, I hope.

Sorry I do not mean to hold this back - I think we should have applied
V1 then worried about making things better.  We can worry about events
after.
So Acking this.

cheers,
jamal

> Cheers,
>
> Paolo
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-24 15:39   ` Pedro Tammela
  2023-05-24 16:09     ` Jamal Hadi Salim
@ 2023-05-26 12:20     ` Jamal Hadi Salim
  2023-05-26 19:47       ` Jamal Hadi Salim
  1 sibling, 1 reply; 45+ messages in thread
From: Jamal Hadi Salim @ 2023-05-26 12:20 UTC (permalink / raw)
  To: Pedro Tammela
  Cc: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Vlad Buslov, Hillf Danton, netdev, linux-kernel,
	Cong Wang

On Wed, May 24, 2023 at 11:39 AM Pedro Tammela <pctammela@mojatatu.com> wrote:
>
> On 23/05/2023 22:20, Peilin Ye wrote:
> > From: Peilin Ye <peilin.ye@bytedance.com>
> >
> > mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
> > ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
> > access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
> > clsact Qdiscs and miniq_egress.
> >
> > Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
> > requests (thanks Hillf Danton for the hint), when replacing ingress or
> > clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
> > miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
> > causing race conditions [1] including a use-after-free bug in
> > mini_qdisc_pair_swap() reported by syzbot:
> >
> >   BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> >   Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
> > ...
> >   Call Trace:
> >    <TASK>
> >    __dump_stack lib/dump_stack.c:88 [inline]
> >    dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
> >    print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
> >    print_report mm/kasan/report.c:430 [inline]
> >    kasan_report+0x11c/0x130 mm/kasan/report.c:536
> >    mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
> >    tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
> >    tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
> >    tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
> >    tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
> >    tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
> > ...
> >
> > @old and @new should not affect each other.  In other words, @old should
> > never modify miniq_{in,e}gress after @new, and @new should not update
> > @old's RCU state.  Fixing without changing sch_api.c turned out to be
> > difficult (please refer to Closes: for discussions).  Instead, make sure
> > @new's first call always happen after @old's last call, in
> > qdisc_destroy(), has finished:
> >
> > In qdisc_graft(), return -EAGAIN and tell the caller to replay
> > (suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
> > requests, and call qdisc_destroy() for @old before grafting @new.
> >
> > Introduce qdisc_refcount_dec_if_one() as the counterpart of
> > qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests.  Introduce
> > a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
> > just like qdisc_put() etc.
> >
> > Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
> > Qdiscs".
> >
> > [1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
> > TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
> > create under TC_H_INGRESS") on eth0 that has 8 transmission queues:
> >
> >    Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
> >    adds a flower filter X to A.
> >
> >    Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
> >    b2) to replace A, then adds a flower filter Y to B.
> >
> >   Thread 1               A's refcnt   Thread 2
> >    RTM_NEWQDISC (A, RTNL-locked)
> >     qdisc_create(A)               1
> >     qdisc_graft(A)                9
> >
> >    RTM_NEWTFILTER (X, RTNL-unlocked)
> >     __tcf_qdisc_find(A)          10
> >     tcf_chain0_head_change(A)
> >     mini_qdisc_pair_swap(A) (1st)
> >              |
> >              |                         RTM_NEWQDISC (B, RTNL-locked)
> >           RCU sync                2     qdisc_graft(B)
> >              |                    1     notify_and_destroy(A)
> >              |
> >     tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
> >     qdisc_destroy(A)                    tcf_chain0_head_change(B)
> >     tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
> >     mini_qdisc_pair_swap(A) (3rd)                |
> >             ...                                 ...
> >
> > Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
> > mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
> > ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
> > on eth0 will not find filter Y in sch_handle_ingress().
> >
> > This is only one of the possible consequences of concurrently accessing
> > miniq_{in,e}gress pointers.  The point is clear though: again, A should
> > never modify those per-net_device pointers after B, and B should not
> > update A's RCU state.
> >
> > Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
> > Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
> > Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
> > Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
> > Cc: Hillf Danton <hdanton@sina.com>
> > Cc: Vlad Buslov <vladbu@mellanox.com>
> > Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
>
> Tested-by: Pedro Tammela <pctammela@mojatatu.com>


Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>


cheers,
jamal

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-26 12:20     ` Jamal Hadi Salim
@ 2023-05-26 19:47       ` Jamal Hadi Salim
  2023-05-26 20:21         ` Pedro Tammela
  0 siblings, 1 reply; 45+ messages in thread
From: Jamal Hadi Salim @ 2023-05-26 19:47 UTC (permalink / raw)
  To: Pedro Tammela
  Cc: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Vlad Buslov, Hillf Danton, netdev, linux-kernel,
	Cong Wang

On Fri, May 26, 2023 at 8:20 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Wed, May 24, 2023 at 11:39 AM Pedro Tammela <pctammela@mojatatu.com> wrote:
> >
> > On 23/05/2023 22:20, Peilin Ye wrote:
> > > From: Peilin Ye <peilin.ye@bytedance.com>

> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

I apologize i am going to take this back, lets please hold the series for now.

In pursuit for the effect on events, Pedro and I spent a few _hours_
chasing this - and regardless of the events, there are still
challenges with the concurrency issue. The current reproducer
unfortunately cant cause damage after patch 2, so really patch 6 was
not being tested. We hacked the repro to hit codepath patch 6 fixes.
We are not sure what the root cause is - but it certainly due to the
series. Peilin, Pedro will post the new repro.

cheers,
jamal

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-26 19:47       ` Jamal Hadi Salim
@ 2023-05-26 20:21         ` Pedro Tammela
  2023-05-26 23:09           ` Peilin Ye
  0 siblings, 1 reply; 45+ messages in thread
From: Pedro Tammela @ 2023-05-26 20:21 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Peilin Ye, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Hillf Danton, netdev, Cong Wang, Vlad Buslov

[-- Attachment #1: Type: text/plain, Size: 2386 bytes --]

On 26/05/2023 16:47, Jamal Hadi Salim wrote:
> [...] Peilin, Pedro will post the new repro.

Hi!

We tweaked the reproducer to:
---
r0 = socket$netlink(0x10, 0x3, 0x0)
r1 = socket(0x10, 0x803, 0x0)
sendmsg$nl_route_sched(r1, &(0x7f0000000300)={0x0, 0x0, 
&(0x7f0000000240)={&(0x7f0000000380)=ANY=[], 0x24}}, 0x0)
getsockname$packet(r1, &(0x7f0000000200)={0x11, 0x0, <r2=>0x0, 0x1, 0x0, 
0x6, @broadcast}, &(0x7f0000000440)=0x14)
sendmsg$nl_route(r0, &(0x7f0000000040)={0x0, 0x0, 
&(0x7f0000000000)={&(0x7f0000000080)=ANY=[@ANYBLOB="480000001000050700"/20, 
@ANYRES32=r2, @ANYBLOB="0000000000000000280012000900010076657468"], 
0x48}}, 0x0)
sendmsg$nl_route_sched(0xffffffffffffffff, &(0x7f00000002c0)={0x0, 0x0, 
&(0x7f0000000280)={&(0x7f0000000540)=@newqdisc={0x30, 0x24, 0xf0b, 0x0, 
0x0, {0x0, 0x0, 0x0, r2, {}, {0xfff1, 0xffff}}, 
[@qdisc_kind_options=@q_ingress={0xc}]}, 0x30}}, 0x0)
sendmsg$nl_route_sched(0xffffffffffffffff, &(0x7f0000000340)={0x0, 0x0, 
&(0x7f00000000c0)={&(0x7f0000000580)=@newtfilter={0x3c, 0x2c, 0xd27, 
0x0, 0x0, {0x0, 0x0, 0x0, r2, {}, {0xfff1, 0xffff}, {0xc}}, 
[@filter_kind_options=@f_flower={{0xb}, {0xc, 0x2, 
[@TCA_FLOWER_CLASSID={0x8}]}}]}, 0x3c}}, 0x0)
r4 = socket$netlink(0x10, 0x3, 0x0)
sendmmsg(r4, &(0x7f00000002c0), 0x40000000000009f, 0x0)
r5 = socket$netlink(0x10, 0x3, 0x0)
sendmmsg(r5, &(0x7f00000002c0), 0x40000000000009f, 0x0)
---

We then generate the C program with:
syz-prog2c -sandbox none -enable net_dev -threaded -repeat 0 -prog 
peilin.syz > repro.c

Now here comes a very important detail. The above will create a new net 
namespace to shoot the netlink messages. We are only able to reproduce 
the deadlock with your patches if we comment the creation of the new 
namespace out:
---
diff --git a/repro.c b/repro.c
index ee8eb0726..5cdbfb289 100644
--- a/repro.c
+++ b/repro.c
@@ -1121,9 +1121,8 @@ static int do_sandbox_none(void)
    sandbox_common();
    drop_caps();
    initialize_netdevices_init();
-  if (unshare(CLONE_NEWNET)) {
-  }
+  // Doesn't seem to deadlock in a new netns
+  // if (unshare(CLONE_NEWNET)) {
+  // }
    write_file("/proc/sys/net/ipv4/ping_group_range", "0 65535");
    initialize_netdevices();
    setup_binderfs();

---

The reason we did this was to check on the event with 'tc mon'.
The splat is quite big, see attached. It has all the indications of a 
deadlock in the rtnl_lock.

Thanks,
Pedro

[-- Attachment #2: deadlock-splat --]
[-- Type: text/plain, Size: 27730 bytes --]

[  291.882988][   T64] INFO: task kworker/1:2:4513 blocked for more than 143 seconds.
[  291.885716][   T64]       Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  291.887658][   T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  291.890484][   T64] task:kworker/1:2     state:D stack:28272 pid:4513  ppid:2      flags:0x00004000
[  291.893628][   T64] Workqueue: events_power_efficient reg_check_chans_work
[  291.896167][   T64] Call Trace:
[  291.897406][   T64]  <TASK>
[  291.898476][   T64]  __schedule+0xecb/0x59d0
[  291.900101][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  291.902218][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  291.904563][   T64]  ? io_schedule_timeout+0x150/0x150
[  291.906437][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  291.908241][   T64]  ? _raw_spin_unlock_irq+0x23/0x50
[  291.909564][   T64]  ? lockdep_hardirqs_on+0x7d/0x100
[  291.910912][   T64]  schedule+0xe7/0x1b0
[  291.912060][   T64]  schedule_preempt_disabled+0x13/0x20
[  291.913608][   T64]  __mutex_lock+0x907/0x12d0
[  291.914850][   T64]  ? reg_check_chans_work+0x7d/0xfe0
[  291.916249][   T64]  ? mutex_lock_io_nested+0x1120/0x1120
[  291.917689][   T64]  ? reg_check_chans_work+0x7d/0xfe0
[  291.919051][   T64]  ? rtnl_lock+0x9/0x20
[  291.920164][   T64]  reg_check_chans_work+0x7d/0xfe0
[  291.921451][   T64]  ? lock_sync+0x190/0x190
[  291.922567][   T64]  ? freq_reg_info+0x1d0/0x1d0
[  291.923802][   T64]  ? spin_bug+0x1d0/0x1d0
[  291.924910][   T64]  process_one_work+0x9f9/0x15f0
[  291.926155][   T64]  ? pwq_dec_nr_in_flight+0x2a0/0x2a0
[  291.927544][   T64]  ? spin_bug+0x1d0/0x1d0
[  291.928706][   T64]  worker_thread+0x687/0x1110
[  291.929867][   T64]  ? process_one_work+0x15f0/0x15f0
[  291.930968][   T64]  kthread+0x334/0x430
[  291.931869][   T64]  ? kthread_complete_and_exit+0x40/0x40
[  291.933092][   T64]  ret_from_fork+0x1f/0x30
[  291.934037][   T64]  </TASK>
[  291.934675][   T64] INFO: task kworker/3:2:6525 blocked for more than 143 seconds.
[  291.936271][   T64]       Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  291.937672][   T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  291.939514][   T64] task:kworker/3:2     state:D stack:28272 pid:6525  ppid:2      flags:0x00004000
[  291.941718][   T64] Workqueue: events linkwatch_event
[  291.943042][   T64] Call Trace:
[  291.943834][   T64]  <TASK>
[  291.944526][   T64]  __schedule+0xecb/0x59d0
[  291.945582][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  291.946952][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  291.948389][   T64]  ? io_schedule_timeout+0x150/0x150
[  291.949615][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  291.950847][   T64]  ? _raw_spin_unlock_irq+0x23/0x50
[  291.952077][   T64]  ? lockdep_hardirqs_on+0x7d/0x100
[  291.953344][   T64]  schedule+0xe7/0x1b0
[  291.954286][   T64]  schedule_preempt_disabled+0x13/0x20
[  291.955512][   T64]  __mutex_lock+0x907/0x12d0
[  291.956603][   T64]  ? linkwatch_event+0xf/0x70
[  291.957680][   T64]  ? mutex_lock_io_nested+0x1120/0x1120
[  291.958953][   T64]  ? spin_bug+0x1d0/0x1d0
[  291.959996][   T64]  ? linkwatch_event+0xf/0x70
[  291.960738][   T64]  ? rtnl_lock+0x9/0x20
[  291.961249][   T64]  linkwatch_event+0xf/0x70
[  291.961770][   T64]  process_one_work+0x9f9/0x15f0
[  291.962306][   T64]  ? pwq_dec_nr_in_flight+0x2a0/0x2a0
[  291.962928][   T64]  ? spin_bug+0x1d0/0x1d0
[  291.963409][   T64]  worker_thread+0x687/0x1110
[  291.963982][   T64]  ? __kthread_parkme+0x152/0x220
[  291.964567][   T64]  ? process_one_work+0x15f0/0x15f0
[  291.965163][   T64]  kthread+0x334/0x430
[  291.965641][   T64]  ? kthread_complete_and_exit+0x40/0x40
[  291.966303][   T64]  ret_from_fork+0x1f/0x30
[  291.966836][   T64]  </TASK>
[  291.967189][   T64] INFO: task systemd-udevd:6764 blocked for more than 143 seconds.
[  291.968062][   T64]       Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  291.968813][   T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  291.969691][   T64] task:systemd-udevd   state:D stack:25968 pid:6764  ppid:4544   flags:0x00000006
[  291.970609][   T64] Call Trace:
[  291.970939][   T64]  <TASK>
[  291.971239][   T64]  __schedule+0xecb/0x59d0
[  291.971718][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  291.972341][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  291.972965][   T64]  ? io_schedule_timeout+0x150/0x150
[  291.973572][   T64]  ? __mutex_lock+0x902/0x12d0
[  291.974120][   T64]  schedule+0xe7/0x1b0
[  291.974579][   T64]  schedule_preempt_disabled+0x13/0x20
[  291.975174][   T64]  __mutex_lock+0x907/0x12d0
[  291.975702][   T64]  ? rtnetlink_rcv_msg+0x3e2/0xd30
[  291.976299][   T64]  ? mutex_lock_io_nested+0x1120/0x1120
[  291.976938][   T64]  ? rtnetlink_rcv_msg+0x3e2/0xd30
[  291.977506][   T64]  rtnetlink_rcv_msg+0x3e2/0xd30
[  291.978042][   T64]  ? rtnl_getlink+0xb10/0xb10
[  291.978573][   T64]  ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  291.979295][   T64]  ? __module_address+0x55/0x3b0
[  291.979937][   T64]  ? netdev_core_pick_tx+0x390/0x390
[  291.980580][   T64]  netlink_rcv_skb+0x166/0x440
[  291.981164][   T64]  ? rtnl_getlink+0xb10/0xb10
[  291.981726][   T64]  ? netlink_ack+0x1370/0x1370
[  291.982310][   T64]  ? kasan_set_track+0x25/0x30
[  291.982973][   T64]  ? netlink_deliver_tap+0x1b1/0xd00
[  291.983579][   T64]  netlink_unicast+0x530/0x800
[  291.984132][   T64]  ? netlink_attachskb+0x880/0x880
[  291.984682][   T64]  ? __sanitizer_cov_trace_switch+0x54/0x90
[  291.985330][   T64]  ? __phys_addr_symbol+0x30/0x70
[  291.985904][   T64]  ? __check_object_size+0x323/0x740
[  291.986488][   T64]  netlink_sendmsg+0x90b/0xe10
[  291.987023][   T64]  ? netlink_unicast+0x800/0x800
[  291.987584][   T64]  ? bpf_lsm_socket_sendmsg+0x9/0x10
[  291.988189][   T64]  ? netlink_unicast+0x800/0x800
[  291.988737][   T64]  sock_sendmsg+0xd9/0x180
[  291.989228][   T64]  __sys_sendto+0x201/0x2f0
[  291.989714][   T64]  ? __ia32_sys_getpeername+0xb0/0xb0
[  291.990296][   T64]  ? get_random_bytes+0x20/0x20
[  291.990833][   T64]  ? rcu_is_watching+0x12/0xb0
[  291.991368][   T64]  ? __rseq_handle_notify_resume+0x5c7/0xfe0
[  291.992062][   T64]  __x64_sys_sendto+0xe0/0x1b0
[  291.992600][   T64]  ? syscall_enter_from_user_mode+0x26/0x80
[  291.993324][   T64]  do_syscall_64+0x38/0xb0
[  291.993816][   T64]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  291.994436][   T64] RIP: 0033:0x7f1de823fcf7
[  291.994902][   T64] RSP: 002b:00007ffd8c5d3208 EFLAGS: 00000202 ORIG_RAX: 000000000000002c
[  291.995834][   T64] RAX: ffffffffffffffda RBX: 000055d9974d14e0 RCX: 00007f1de823fcf7
[  291.996711][   T64] RDX: 0000000000000020 RSI: 000055d997332930 RDI: 0000000000000005
[  291.997593][   T64] RBP: 00007ffd8c5d33b8 R08: 00007ffd8c5d3280 R09: 0000000000000080
[  291.998477][   T64] R10: 0000000000000000 R11: 0000000000000202 R12: 00000000043abe07
[  291.999360][   T64] R13: 0000000000000001 R14: 000055d9974845a0 R15: 0000000000000000
[  292.000254][   T64]  </TASK>
[  292.000609][   T64] INFO: task systemd-udevd:6773 blocked for more than 143 seconds.
[  292.001482][   T64]       Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  292.002233][   T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  292.003237][   T64] task:systemd-udevd   state:D stack:25968 pid:6773  ppid:4544   flags:0x00004006
[  292.004292][   T64] Call Trace:
[  292.004743][   T64]  <TASK>
[  292.005128][   T64]  __schedule+0xecb/0x59d0
[  292.005637][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.006304][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.007005][   T64]  ? io_schedule_timeout+0x150/0x150
[  292.007601][   T64]  ? __mutex_lock+0x902/0x12d0
[  292.008174][   T64]  schedule+0xe7/0x1b0
[  292.008648][   T64]  schedule_preempt_disabled+0x13/0x20
[  292.009257][   T64]  __mutex_lock+0x907/0x12d0
[  292.009793][   T64]  ? dev_ethtool+0x21d/0x53e0
[  292.010374][   T64]  ? mutex_lock_io_nested+0x1120/0x1120
[  292.011105][   T64]  ? kasan_unpoison+0x44/0x70
[  292.011987][   T64]  ? __kmem_cache_alloc_node+0x19e/0x330
[  292.012994][   T64]  ? dev_ethtool+0x170/0x53e0
[  292.013757][   T64]  ? dev_ethtool+0x21d/0x53e0
[  292.014470][   T64]  ? rtnl_lock+0x9/0x20
[  292.015032][   T64]  dev_ethtool+0x21d/0x53e0
[  292.015633][   T64]  ? filter_irq_stacks+0x90/0x90
[  292.016269][   T64]  ? print_usage_bug.part.0+0x670/0x670
[  292.016941][   T64]  ? ethtool_get_module_info_call+0x1c0/0x1c0
[  292.017674][   T64]  ? kasan_set_track+0x25/0x30
[  292.018228][   T64]  ? ____kasan_slab_free+0x15e/0x1b0
[  292.018794][   T64]  ? slab_free_freelist_hook+0x10b/0x1e0
[  292.019287][   T64]  ? __kmem_cache_free+0xaf/0x2e0
[  292.019729][   T64]  ? tomoyo_path_number_perm+0x43a/0x530
[  292.020250][   T64]  ? security_file_ioctl+0x72/0xb0
[  292.020700][   T64]  ? __x64_sys_ioctl+0xbb/0x210
[  292.021128][   T64]  ? do_syscall_64+0x38/0xb0
[  292.021538][   T64]  ? mark_lock+0x105/0x1960
[  292.021950][   T64]  ? __lock_acquire+0xc42/0x5cf0
[  292.022399][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.022990][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.023514][   T64]  ? lockdep_hardirqs_on+0x7d/0x100
[  292.023995][   T64]  ? __sanitizer_cov_trace_switch+0x54/0x90
[  292.024513][   T64]  ? find_held_lock+0x2d/0x110
[  292.024940][   T64]  ? dev_load+0x7d/0x240
[  292.025322][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.025801][   T64]  ? full_name_hash+0xbc/0x110
[  292.026220][   T64]  dev_ioctl+0x29e/0x1090
[  292.026601][   T64]  sock_do_ioctl+0x1be/0x250
[  292.027006][   T64]  ? get_user_ifreq+0x250/0x250
[  292.027433][   T64]  ? do_vfs_ioctl+0x34a/0x17f0
[  292.027870][   T64]  ? vfs_fileattr_set+0xbf0/0xbf0
[  292.028318][   T64]  sock_ioctl+0x205/0x6a0
[  292.028697][   T64]  ? br_ioctl_call+0xb0/0xb0
[  292.029145][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.029827][   T64]  ? handle_mm_fault+0x32c/0x9e0
[  292.030501][   T64]  ? bpf_lsm_file_ioctl+0x9/0x10
[  292.031054][   T64]  ? br_ioctl_call+0xb0/0xb0
[  292.031527][   T64]  __x64_sys_ioctl+0x189/0x210
[  292.032091][   T64]  do_syscall_64+0x38/0xb0
[  292.032609][   T64]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  292.033390][   T64] RIP: 0033:0x7f1de82319ef
[  292.033908][   T64] RSP: 002b:00007ffd8c5d32f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  292.034751][   T64] RAX: ffffffffffffffda RBX: 000055d997476d20 RCX: 00007f1de82319ef
[  292.035448][   T64] RDX: 00007ffd8c5d3430 RSI: 0000000000008946 RDI: 000000000000000b
[  292.036179][   T64] RBP: 000055d997488a50 R08: 0000000000000000 R09: 0000000000000006
[  292.036884][   T64] R10: 00007ffd8c5d3400 R11: 0000000000000246 R12: 000055d99732a230
[  292.037581][   T64] R13: 00007ffd8c5d3430 R14: 0000000000000000 R15: 000055d99732a238
[  292.038268][   T64]  </TASK>
[  292.038557][   T64] INFO: task repro:7652 blocked for more than 143 seconds.
[  292.039202][   T64]       Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  292.039776][   T64] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  292.040509][   T64] task:repro           state:D stack:25984 pid:7652  ppid:6753   flags:0x00004006
[  292.041289][   T64] Call Trace:
[  292.041575][   T64]  <TASK>
[  292.041848][   T64]  __schedule+0xecb/0x59d0
[  292.042297][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.043017][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.043711][   T64]  ? io_schedule_timeout+0x150/0x150
[  292.044360][   T64]  ? __mutex_lock+0x902/0x12d0
[  292.044845][   T64]  schedule+0xe7/0x1b0
[  292.045300][   T64]  schedule_preempt_disabled+0x13/0x20
[  292.045872][   T64]  __mutex_lock+0x907/0x12d0
[  292.046396][   T64]  ? fl_change+0x2102/0x51e0
[  292.046833][   T64]  ? tc_setup_cb_reoffload+0x100/0x100
[  292.047353][   T64]  ? mutex_lock_io_nested+0x1120/0x1120
[  292.047915][   T64]  ? _raw_spin_unlock+0x28/0x40
[  292.048368][   T64]  ? tcf_exts_init_ex+0x1bc/0x610
[  292.048828][   T64]  ? fl_change+0x2102/0x51e0
[  292.049328][   T64]  ? rtnl_lock+0x9/0x20
[  292.049768][   T64]  fl_change+0x2102/0x51e0
[  292.050232][   T64]  ? find_held_lock+0x2d/0x110
[  292.050681][   T64]  ? __rhashtable_insert_fast.constprop.0.isra.0+0x14a0/0x14a0
[  292.051416][   T64]  ? fl_get+0x21c/0x3c0
[  292.051791][   T64]  ? fl_put+0x20/0x20
[  292.052170][   T64]  ? mini_qdisc_pair_swap+0x128/0x1f0
[  292.052655][   T64]  ? __rhashtable_insert_fast.constprop.0.isra.0+0x14a0/0x14a0
[  292.053395][   T64]  tc_new_tfilter+0x992/0x22b0
[  292.053858][   T64]  ? tc_del_tfilter+0x1570/0x1570
[  292.054419][   T64]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.055066][   T64]  ? kasan_quarantine_put+0x102/0x230
[  292.055660][   T64]  ? rtnetlink_rcv_msg+0x94a/0xd30
[  292.056224][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.056799][   T64]  ? bpf_lsm_capable+0x9/0x10
[  292.057311][   T64]  ? tc_del_tfilter+0x1570/0x1570
[  292.057843][   T64]  rtnetlink_rcv_msg+0x98a/0xd30
[  292.058381][   T64]  ? rtnl_getlink+0xb10/0xb10
[  292.058833][   T64]  ? __x64_sys_sendmmsg+0x9c/0x100
[  292.059305][   T64]  ? do_syscall_64+0x38/0xb0
[  292.059752][   T64]  ? netdev_core_pick_tx+0x390/0x390
[  292.060297][   T64]  netlink_rcv_skb+0x166/0x440
[  292.060733][   T64]  ? rtnl_getlink+0xb10/0xb10
[  292.061146][   T64]  ? netlink_ack+0x1370/0x1370
[  292.061566][   T64]  ? kasan_set_track+0x25/0x30
[  292.062003][   T64]  ? netlink_deliver_tap+0x1b1/0xd00
[  292.062468][   T64]  netlink_unicast+0x530/0x800
[  292.062955][   T64]  ? netlink_attachskb+0x880/0x880
[  292.063416][   T64]  ? __sanitizer_cov_trace_switch+0x54/0x90
[  292.063958][   T64]  ? __phys_addr_symbol+0x30/0x70
[  292.064468][   T64]  ? __check_object_size+0x323/0x740
[  292.065009][   T64]  netlink_sendmsg+0x90b/0xe10
[  292.065517][   T64]  ? netlink_unicast+0x800/0x800
[  292.066010][   T64]  ? bpf_lsm_socket_sendmsg+0x9/0x10
[  292.066467][   T64]  ? netlink_unicast+0x800/0x800
[  292.066933][   T64]  sock_sendmsg+0xd9/0x180
[  292.067358][   T64]  ____sys_sendmsg+0x264/0x910
[  292.067794][   T64]  ? kernel_sendmsg+0x50/0x50
[  292.068252][   T64]  ? __copy_msghdr+0x460/0x460
[  292.068690][   T64]  ___sys_sendmsg+0x11d/0x1b0
[  292.069111][   T64]  ? do_recvmmsg+0x700/0x700
[  292.069551][   T64]  ? find_held_lock+0x2d/0x110
[  292.069972][   T64]  ? __might_fault+0xe5/0x190
[  292.070397][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.070862][   T64]  __sys_sendmmsg+0x18e/0x430
[  292.071272][   T64]  ? __ia32_sys_sendmsg+0xb0/0xb0
[  292.071705][   T64]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.072204][   T64]  ? rcu_is_watching+0x12/0xb0
[  292.072620][   T64]  ? xfd_validate_state+0x5d/0x180
[  292.073099][   T64]  ? restore_fpregs_from_fpstate+0xc1/0x1d0
[  292.073611][   T64]  ? unlock_page_memcg+0x2d0/0x2d0
[  292.074062][   T64]  ? do_futex+0x350/0x350
[  292.074461][   T64]  __x64_sys_sendmmsg+0x9c/0x100
[  292.074893][   T64]  ? syscall_enter_from_user_mode+0x26/0x80
[  292.075494][   T64]  do_syscall_64+0x38/0xb0
[  292.075960][   T64]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  292.076473][   T64] RIP: 0033:0x7f3f4319289d
[  292.076850][   T64] RSP: 002b:00007f3f43056e38 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
[  292.077557][   T64] RAX: ffffffffffffffda RBX: 0000558de6727248 RCX: 00007f3f4319289d
[  292.078234][   T64] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000006
[  292.078924][   T64] RBP: 0000000000000006 R08: 0000000000000000 R09: 0000000000000000
[  292.079586][   T64] R10: 0000000000000000 R11: 0000000000000246 R12: 0000558de672724c
[  292.080259][   T64] R13: 0000558de6723030 R14: 040000000000009f R15: 0000558de6727240
[  292.080943][   T64]  </TASK>
[  292.081221][   T64]
[  292.081221][   T64] Showing all locks held in the system:
[  292.081868][   T64] 1 lock held by rcu_tasks_kthre/14:
[  292.082316][   T64]  #0: ffffffff8c798430 (rcu_tasks.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2c/0xe20
[  292.083266][   T64] 1 lock held by rcu_tasks_trace/15:
[  292.083721][   T64]  #0: ffffffff8c798130 (rcu_tasks_trace.tasks_gp_mutex){+.+.}-{3:3}, at: rcu_tasks_one_gp+0x2c/0xe20
[  292.084702][   T64] 5 locks held by ksoftirqd/4/38:
[  292.085152][   T64] 3 locks held by kworker/4:0/39:
[  292.085598][   T64]  #0: ffff88811b78b938 ((wq_completion)ipv6_addrconf){+.+.}-{0:0}, at: process_one_work+0x8e2/0x15f0
[  292.086704][   T64]  #1: ffffc9000139fda0 ((work_completion)(&(&net->ipv6.addr_chk_work)->work)){+.+.}-{0:0}, at: process_one_work+0x916/0x15f0
[  292.088126][   T64]  #2: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: addrconf_verify_work+0x12/0x30
[  292.089159][   T64] 1 lock held by khungtaskd/64:
[  292.089656][   T64]  #0: ffffffff8c799040 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x55/0x340
[  292.090713][   T64] 3 locks held by kworker/1:2/4513:
[  292.091175][   T64]  #0: ffff888100079d38 ((wq_completion)events_power_efficient){+.+.}-{0:0}, at: process_one_work+0x8e2/0x15f0
[  292.092274][   T64]  #1: ffffc90016fcfda0 ((reg_check_chans).work){+.+.}-{0:0}, at: process_one_work+0x916/0x15f0
[  292.093319][   T64]  #2: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: reg_check_chans_work+0x7d/0xfe0
[  292.094269][   T64] 3 locks held by kworker/3:2/6525:
[  292.094801][   T64]  #0: ffff888100078d38 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x8e2/0x15f0
[  292.095834][   T64]  #1: ffffc9000648fda0 ((linkwatch_work).work){+.+.}-{0:0}, at: process_one_work+0x916/0x15f0
[  292.096754][   T64]  #2: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: linkwatch_event+0xf/0x70
[  292.097662][   T64] 1 lock held by systemd-udevd/6764:
[  292.098148][   T64]  #0: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x3e2/0xd30
[  292.098984][   T64] 1 lock held by systemd-udevd/6773:
[  292.099462][   T64]  #0: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: dev_ethtool+0x21d/0x53e0
[  292.100300][   T64] 1 lock held by repro/7652:
[  292.100700][   T64]  #0: ffffffff8e115668 (rtnl_mutex){+.+.}-{3:3}, at: fl_change+0x2102/0x51e0
[  292.101494][   T64] 2 locks held by repro/7653:
[  292.101905][   T64]
[  292.102135][   T64] =============================================
[  292.102135][   T64]
[  292.102905][   T64] NMI backtrace for cpu 5
[  292.103277][   T64] CPU: 5 PID: 64 Comm: khungtaskd Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  292.104054][   T64] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  292.104931][   T64] Call Trace:
[  292.105222][   T64]  <TASK>
[  292.105481][   T64]  dump_stack_lvl+0xd9/0x1b0
[  292.105889][   T64]  nmi_cpu_backtrace+0x277/0x380
[  292.106318][   T64]  ? lapic_can_unplug_cpu+0xa0/0xa0
[  292.106770][   T64]  nmi_trigger_cpumask_backtrace+0x2ac/0x310
[  292.107290][   T64]  watchdog+0xed1/0x1150
[  292.107715][   T64]  ? proc_dohung_task_timeout_secs+0x90/0x90
[  292.108330][   T64]  kthread+0x334/0x430
[  292.108736][   T64]  ? kthread_complete_and_exit+0x40/0x40
[  292.109285][   T64]  ret_from_fork+0x1f/0x30
[  292.109720][   T64]  </TASK>
[  292.109991][   T64] Sending NMI from CPU 5 to CPUs 0-4,6-7:
[  292.110569][    C3] NMI backtrace for cpu 3
[  292.110577][    C3] CPU: 3 PID: 7653 Comm: repro Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  292.110586][    C3] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  292.110591][    C3] RIP: 0010:unwind_get_return_address+0x6e/0xa0
[  292.110604][    C3] Code: ea 03 80 3c 02 00 75 32 48 8b 7b 48 e8 3b 46 1c 00 85 c0 74 d3 48 b8 00 00 00 00 00 fc ff df 48 89 ea 48 c1 ea 03 80 3c 02 00 <75> 18 48 8b 43 48 5b 5d c3 e8 14 c5 9e 00 eb a8 48 89 ef e8 2a c5
[  292.110611][    C3] RSP: 0018:ffffc900095d6ed0 EFLAGS: 00000246
[  292.110618][    C3] RAX: dffffc0000000000 RBX: ffffc900095d6ee8 RCX: 0000000000000000
[  292.110623][    C3] RDX: 1ffff920012bade6 RSI: ffffc900095d72a0 RDI: ffffffff88370cd5
[  292.110628][    C3] RBP: ffffc900095d6f30 R08: ffffc900095d6f1c R09: ffffffff8f831a34
[  292.110633][    C3] R10: ffffc900095d6ee8 R11: 0000000000072d0d R12: ffffc900095d6fa0
[  292.110637][    C3] R13: 0000000000000000 R14: ffff888173271dc0 R15: ffff888171b140a0
[  292.110645][    C3] FS:  00007f3f430376c0(0000) GS:ffff888437c80000(0000) knlGS:0000000000000000
[  292.110653][    C3] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  292.110658][    C3] CR2: 0000558de83422d8 CR3: 0000000116ff7000 CR4: 0000000000350ee0
[  292.110664][    C3] Call Trace:
[  292.110667][    C3]  <TASK>
[  292.110670][    C3]  ? write_profile+0x440/0x440
[  292.110684][    C3]  arch_stack_walk+0x97/0xf0
[  292.110697][    C3]  ? ingress_init+0x165/0x220
[  292.110706][    C3]  stack_trace_save+0x96/0xd0
[  292.110716][    C3]  ? filter_irq_stacks+0x90/0x90
[  292.110727][    C3]  kasan_save_stack+0x20/0x40
[  292.110741][    C3]  ? kasan_save_stack+0x20/0x40
[  292.110753][    C3]  ? kasan_set_track+0x25/0x30
[  292.110766][    C3]  ? __kasan_kmalloc+0xa2/0xb0
[  292.110776][    C3]  ? tcf_block_get_ext+0x14f/0x1900
[  292.110785][    C3]  ? ingress_init+0x165/0x220
[  292.110792][    C3]  ? qdisc_create+0x4f7/0x1090
[  292.110799][    C3]  ? tc_modify_qdisc+0xd38/0x1b70
[  292.110806][    C3]  ? rtnetlink_rcv_msg+0x439/0xd30
[  292.110813][    C3]  ? netlink_rcv_skb+0x166/0x440
[  292.110824][    C3]  ? netlink_unicast+0x530/0x800
[  292.110835][    C3]  ? netlink_sendmsg+0x90b/0xe10
[  292.110847][    C3]  ? find_held_lock+0x2d/0x110
[  292.110857][    C3]  ? __kmem_cache_alloc_node+0x49/0x330
[  292.110867][    C3]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.110878][    C3]  ? kasan_unpoison+0x44/0x70
[  292.110885][    C3]  ? __kasan_slab_alloc+0x4e/0x90
[  292.110896][    C3]  ? __kmem_cache_alloc_node+0x19e/0x330
[  292.110906][    C3]  ? tcf_block_get_ext+0x14f/0x1900
[  292.110915][    C3]  kasan_set_track+0x25/0x30
[  292.110925][    C3]  __kasan_kmalloc+0xa2/0xb0
[  292.110935][    C3]  tcf_block_get_ext+0x14f/0x1900
[  292.110945][    C3]  ingress_init+0x165/0x220
[  292.110952][    C3]  ? clsact_init+0x330/0x330
[  292.110959][    C3]  qdisc_create+0x4f7/0x1090
[  292.110967][    C3]  ? tc_get_qdisc+0xb80/0xb80
[  292.110975][    C3]  tc_modify_qdisc+0xd38/0x1b70
[  292.110984][    C3]  ? qdisc_create+0x1090/0x1090
[  292.110994][    C3]  ? bpf_lsm_capable+0x9/0x10
[  292.111004][    C3]  ? qdisc_create+0x1090/0x1090
[  292.111012][    C3]  rtnetlink_rcv_msg+0x439/0xd30
[  292.111019][    C3]  ? rtnl_getlink+0xb10/0xb10
[  292.111025][    C3]  ? __x64_sys_sendmmsg+0x9c/0x100
[  292.111034][    C3]  ? do_syscall_64+0x38/0xb0
[  292.111043][    C3]  ? netdev_core_pick_tx+0x390/0x390
[  292.111056][    C3]  netlink_rcv_skb+0x166/0x440
[  292.111067][    C3]  ? rtnl_getlink+0xb10/0xb10
[  292.111074][    C3]  ? netlink_ack+0x1370/0x1370
[  292.111085][    C3]  ? kasan_set_track+0x25/0x30
[  292.111097][    C3]  ? netlink_deliver_tap+0x1b1/0xd00
[  292.111109][    C3]  netlink_unicast+0x530/0x800
[  292.111121][    C3]  ? netlink_attachskb+0x880/0x880
[  292.111132][    C3]  ? __sanitizer_cov_trace_switch+0x54/0x90
[  292.111142][    C3]  ? __phys_addr_symbol+0x30/0x70
[  292.111153][    C3]  ? __check_object_size+0x323/0x740
[  292.111161][    C3]  netlink_sendmsg+0x90b/0xe10
[  292.111173][    C3]  ? netlink_unicast+0x800/0x800
[  292.111185][    C3]  ? bpf_lsm_socket_sendmsg+0x9/0x10
[  292.111193][    C3]  ? netlink_unicast+0x800/0x800
[  292.111204][    C3]  sock_sendmsg+0xd9/0x180
[  292.111216][    C3]  ____sys_sendmsg+0x264/0x910
[  292.111227][    C3]  ? kernel_sendmsg+0x50/0x50
[  292.111238][    C3]  ? __copy_msghdr+0x460/0x460
[  292.111246][    C3]  ? lockdep_hardirqs_on_prepare+0x410/0x410
[  292.111258][    C3]  ___sys_sendmsg+0x11d/0x1b0
[  292.111266][    C3]  ? do_recvmmsg+0x700/0x700
[  292.111274][    C3]  ? __fget_files+0x260/0x420
[  292.111284][    C3]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.111296][    C3]  ? __fget_files+0x282/0x420
[  292.111307][    C3]  ? __fget_light+0xe6/0x270
[  292.111318][    C3]  __sys_sendmmsg+0x18e/0x430
[  292.111327][    C3]  ? __ia32_sys_sendmsg+0xb0/0xb0
[  292.111334][    C3]  ? reacquire_held_locks+0x4b0/0x4b0
[  292.111346][    C3]  ? rcu_is_watching+0x12/0xb0
[  292.111358][    C3]  ? __rseq_handle_notify_resume+0x5c7/0xfe0
[  292.111368][    C3]  ? __do_sys_rseq+0x750/0x750
[  292.111376][    C3]  ? unlock_page_memcg+0x2d0/0x2d0
[  292.111387][    C3]  ? do_futex+0x350/0x350
[  292.111395][    C3]  __x64_sys_sendmmsg+0x9c/0x100
[  292.111403][    C3]  ? syscall_enter_from_user_mode+0x26/0x80
[  292.111414][    C3]  do_syscall_64+0x38/0xb0
[  292.111421][    C3]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  292.111433][    C3] RIP: 0033:0x7f3f4319289d
[  292.111440][    C3] Code: 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4b 05 0e 00 f7 d8 64 89 01 48
[  292.111447][    C3] RSP: 002b:00007f3f43035e38 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
[  292.111454][    C3] RAX: ffffffffffffffda RBX: 0000558de6727258 RCX: 00007f3f4319289d
[  292.111459][    C3] RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000007
[  292.111463][    C3] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  292.111467][    C3] R10: 0000000000000000 R11: 0000000000000246 R12: 0000558de672725c
[  292.111472][    C3] R13: 0000558de6723030 R14: 040000000000009f R15: 0000558de6727250
[  292.111479][    C3]  </TASK>
[  292.111481][    C4] NMI backtrace for cpu 4 skipped: idling at default_idle+0xf/0x20
[  292.092844][    C7] NMI backtrace for cpu 7 skipped: idling at default_idle+0xf/0x20
[  292.092832][    C2] NMI backtrace for cpu 2 skipped: idling at default_idle+0xf/0x20
[  292.092832][    C2] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.014 msecs
[  292.092820][    C1] NMI backtrace for cpu 1 skipped: idling at default_idle+0xf/0x20
[  292.092820][    C1] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.033 msecs
[  292.102824][    C6] NMI backtrace for cpu 6 skipped: idling at default_idle+0xf/0x20
[  292.102824][    C6] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.054 msecs
[  292.102812][    C0] NMI backtrace for cpu 0 skipped: idling at default_idle+0xf/0x20
[  292.102812][    C0] INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.092 msecs
[  292.112559][   T64] Kernel panic - not syncing: hung_task: blocked tasks
[  292.112565][   T64] CPU: 5 PID: 64 Comm: khungtaskd Not tainted 6.4.0-rc2-00215-g6078d01d2926 #13
[  292.112575][   T64] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  292.112582][   T64] Call Trace:
[  292.112586][   T64]  <TASK>
[  292.112590][   T64]  dump_stack_lvl+0xd9/0x1b0
[  292.112603][   T64]  panic+0x689/0x730
[  292.112615][   T64]  ? panic_smp_self_stop+0xa0/0xa0
[  292.112631][   T64]  ? __irq_work_queue_local+0x132/0x3f0
[  292.112641][   T64]  ? irq_work_queue+0x2a/0x70
[  292.112649][   T64]  ? __wake_up_klogd.part.0+0x99/0xf0
[  292.112663][   T64]  ? watchdog+0xc89/0x1150
[  292.112678][   T64]  watchdog+0xc9a/0x1150
[  292.112698][   T64]  ? proc_dohung_task_timeout_secs+0x90/0x90
[  292.112713][   T64]  kthread+0x334/0x430
[  292.112723][   T64]  ? kthread_complete_and_exit+0x40/0x40
[  292.112734][   T64]  ret_from_fork+0x1f/0x30
[  292.112750][   T64]  </TASK>
[  292.112866][   T64] Kernel Offset: disabled
[  292.112866][   T64] Rebooting in 86400 seconds..

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-26 20:21         ` Pedro Tammela
@ 2023-05-26 23:09           ` Peilin Ye
  2023-05-27  2:33             ` Jakub Kicinski
  0 siblings, 1 reply; 45+ messages in thread
From: Peilin Ye @ 2023-05-26 23:09 UTC (permalink / raw)
  To: Pedro Tammela, Jamal Hadi Salim
  Cc: Jamal Hadi Salim, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Hillf Danton, netdev, Cong Wang, Vlad Buslov

On Fri, May 26, 2023 at 05:21:34PM -0300, Pedro Tammela wrote:
> On 26/05/2023 16:47, Jamal Hadi Salim wrote:
> > [...] Peilin, Pedro will post the new repro.
> 
> We tweaked the reproducer to:
> ---
> r0 = socket$netlink(0x10, 0x3, 0x0)
> r1 = socket(0x10, 0x803, 0x0)
> sendmsg$nl_route_sched(r1, &(0x7f0000000300)={0x0, 0x0,
> &(0x7f0000000240)={&(0x7f0000000380)=ANY=[], 0x24}}, 0x0)
> getsockname$packet(r1, &(0x7f0000000200)={0x11, 0x0, <r2=>0x0, 0x1, 0x0,
> 0x6, @broadcast}, &(0x7f0000000440)=0x14)
> sendmsg$nl_route(r0, &(0x7f0000000040)={0x0, 0x0,
> &(0x7f0000000000)={&(0x7f0000000080)=ANY=[@ANYBLOB="480000001000050700"/20,
> @ANYRES32=r2, @ANYBLOB="0000000000000000280012000900010076657468"], 0x48}},
> 0x0)
> sendmsg$nl_route_sched(0xffffffffffffffff, &(0x7f00000002c0)={0x0, 0x0,
> &(0x7f0000000280)={&(0x7f0000000540)=@newqdisc={0x30, 0x24, 0xf0b, 0x0, 0x0,
> {0x0, 0x0, 0x0, r2, {}, {0xfff1, 0xffff}},
> [@qdisc_kind_options=@q_ingress={0xc}]}, 0x30}}, 0x0)
> sendmsg$nl_route_sched(0xffffffffffffffff, &(0x7f0000000340)={0x0, 0x0,
> &(0x7f00000000c0)={&(0x7f0000000580)=@newtfilter={0x3c, 0x2c, 0xd27, 0x0,
> 0x0, {0x0, 0x0, 0x0, r2, {}, {0xfff1, 0xffff}, {0xc}},
> [@filter_kind_options=@f_flower={{0xb}, {0xc, 0x2,
> [@TCA_FLOWER_CLASSID={0x8}]}}]}, 0x3c}}, 0x0)
> r4 = socket$netlink(0x10, 0x3, 0x0)
> sendmmsg(r4, &(0x7f00000002c0), 0x40000000000009f, 0x0)
> r5 = socket$netlink(0x10, 0x3, 0x0)
> sendmmsg(r5, &(0x7f00000002c0), 0x40000000000009f, 0x0)
> ---
> 
> We then generate the C program with:
> syz-prog2c -sandbox none -enable net_dev -threaded -repeat 0 -prog
> peilin.syz > repro.c
> 
> Now here comes a very important detail. The above will create a new net
> namespace to shoot the netlink messages. We are only able to reproduce the
> deadlock with your patches if we comment the creation of the new namespace
> out:
> ---
> diff --git a/repro.c b/repro.c
> index ee8eb0726..5cdbfb289 100644
> --- a/repro.c
> +++ b/repro.c
> @@ -1121,9 +1121,8 @@ static int do_sandbox_none(void)
>    sandbox_common();
>    drop_caps();
>    initialize_netdevices_init();
> -  if (unshare(CLONE_NEWNET)) {
> -  }
> +  // Doesn't seem to deadlock in a new netns
> +  // if (unshare(CLONE_NEWNET)) {
> +  // }
>    write_file("/proc/sys/net/ipv4/ping_group_range", "0 65535");
>    initialize_netdevices();
>    setup_binderfs();
> 
> ---
> 
> The reason we did this was to check on the event with 'tc mon'.
> The splat is quite big, see attached. It has all the indications of a
> deadlock in the rtnl_lock.

Thanks a lot, I'll get right on it.

Peilin Ye


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-26 23:09           ` Peilin Ye
@ 2023-05-27  2:33             ` Jakub Kicinski
  2023-05-27  8:23               ` Peilin Ye
  0 siblings, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2023-05-27  2:33 UTC (permalink / raw)
  To: Peilin Ye
  Cc: Pedro Tammela, Jamal Hadi Salim, David S. Miller, Eric Dumazet,
	Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Hillf Danton, netdev, Cong Wang, Vlad Buslov

On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
> Thanks a lot, I'll get right on it.

Any insights? Is it just a live-lock inherent to the retry scheme 
or we actually forget to release the lock/refcnt?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-27  2:33             ` Jakub Kicinski
@ 2023-05-27  8:23               ` Peilin Ye
  2023-05-28 18:54                 ` Jamal Hadi Salim
  0 siblings, 1 reply; 45+ messages in thread
From: Peilin Ye @ 2023-05-27  8:23 UTC (permalink / raw)
  To: Jakub Kicinski, Jamal Hadi Salim, Pedro Tammela
  Cc: Pedro Tammela, Jamal Hadi Salim, David S. Miller, Eric Dumazet,
	Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Hillf Danton, netdev, Cong Wang, Vlad Buslov

Hi Jakub and all,

On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote:
> On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
> > Thanks a lot, I'll get right on it.
>
> Any insights? Is it just a live-lock inherent to the retry scheme
> or we actually forget to release the lock/refcnt?

I think it's just a thread holding the RTNL mutex for too long (replaying
too many times).  We could replay for arbitrary times in
tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter
requests for the old Qdisc.

I tested the new reproducer Pedro posted, on:

1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported

2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger
   any issues (in about 30 minutes).

3. All 6 v5 patches, plus this diff:

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 286b7c58f5b9..988718ba5abe 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
                         * RTNL-unlocked filter request(s).  This is the counterpart of that
                         * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
                         */
-                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
+                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
+                               rtnl_unlock();
+                               rtnl_lock();
                                return -EAGAIN;
+                       }
                }

                if (dev->flags & IFF_UP)

   Did not trigger any issues (in about 30 mintues) either.

What would you suggest?

Thanks,
Peilin Ye


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-27  8:23               ` Peilin Ye
@ 2023-05-28 18:54                 ` Jamal Hadi Salim
  2023-05-29 11:50                   ` Vlad Buslov
  0 siblings, 1 reply; 45+ messages in thread
From: Jamal Hadi Salim @ 2023-05-28 18:54 UTC (permalink / raw)
  To: Peilin Ye
  Cc: Jakub Kicinski, Pedro Tammela, David S. Miller, Eric Dumazet,
	Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Hillf Danton, netdev, Cong Wang, Vlad Buslov

On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> Hi Jakub and all,
>
> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote:
> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
> > > Thanks a lot, I'll get right on it.
> >
> > Any insights? Is it just a live-lock inherent to the retry scheme
> > or we actually forget to release the lock/refcnt?
>
> I think it's just a thread holding the RTNL mutex for too long (replaying
> too many times).  We could replay for arbitrary times in
> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter
> requests for the old Qdisc.
>
> I tested the new reproducer Pedro posted, on:
>
> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported
>
> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger
>    any issues (in about 30 minutes).
>
> 3. All 6 v5 patches, plus this diff:
>
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 286b7c58f5b9..988718ba5abe 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>                          * RTNL-unlocked filter request(s).  This is the counterpart of that
>                          * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
>                          */
> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> +                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
> +                               rtnl_unlock();
> +                               rtnl_lock();
>                                 return -EAGAIN;
> +                       }
>                 }
>
>                 if (dev->flags & IFF_UP)
>
>    Did not trigger any issues (in about 30 mintues) either.
>
> What would you suggest?


I am more worried it is a wackamole situation. We fixed the first
reproducer with essentially patches 1-4 but we opened a new one which
the second reproducer catches. One thing the current reproducer does
is create a lot rtnl contention in the beggining by creating all those
devices and then after it is just creating/deleting qdisc and doing
update with flower where such contention is reduced. i.e it may just
take longer for the mole to pop up.

Why dont we push the V1 patch in and then worry about getting clever
with EAGAIN after? Can you test the V1 version with the repro Pedro
posted? It shouldnt have these issues. Also it would be interesting to
see how performance of the parallel updates to flower is affected.

cheers,
jamal

> Thanks,
> Peilin Ye
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-28 18:54                 ` Jamal Hadi Salim
@ 2023-05-29 11:50                   ` Vlad Buslov
  2023-05-29 12:58                     ` Vlad Buslov
  2023-05-29 13:55                     ` Jamal Hadi Salim
  0 siblings, 2 replies; 45+ messages in thread
From: Vlad Buslov @ 2023-05-29 11:50 UTC (permalink / raw)
  To: Peilin Ye, Jamal Hadi Salim
  Cc: Jakub Kicinski, Pedro Tammela, David S. Miller, Eric Dumazet,
	Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Hillf Danton, netdev, Cong Wang

On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>>
>> Hi Jakub and all,
>>
>> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote:
>> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
>> > > Thanks a lot, I'll get right on it.
>> >
>> > Any insights? Is it just a live-lock inherent to the retry scheme
>> > or we actually forget to release the lock/refcnt?
>>
>> I think it's just a thread holding the RTNL mutex for too long (replaying
>> too many times).  We could replay for arbitrary times in
>> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter
>> requests for the old Qdisc.

After looking very carefully at the code I think I know what the issue
might be:

   Task 1 graft Qdisc   Task 2 new filter
           +                    +
           |                    |
           v                    v
        rtnl_lock()       take  q->refcnt
           +                    +
           |                    |
           v                    v
Spin while q->refcnt!=1   Block on rtnl_lock() indefinitely due to -EAGAIN

This will cause a real deadlock with the proposed patch. I'll try to
come up with a better approach. Sorry for not seeing it earlier.

>>
>> I tested the new reproducer Pedro posted, on:
>>
>> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported
>>
>> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger
>>    any issues (in about 30 minutes).
>>
>> 3. All 6 v5 patches, plus this diff:
>>
>> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
>> index 286b7c58f5b9..988718ba5abe 100644
>> --- a/net/sched/sch_api.c
>> +++ b/net/sched/sch_api.c
>> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>>                          * RTNL-unlocked filter request(s).  This is the counterpart of that
>>                          * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
>>                          */
>> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
>> +                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
>> +                               rtnl_unlock();
>> +                               rtnl_lock();
>>                                 return -EAGAIN;
>> +                       }
>>                 }
>>
>>                 if (dev->flags & IFF_UP)
>>
>>    Did not trigger any issues (in about 30 mintues) either.
>>
>> What would you suggest?
>
>
> I am more worried it is a wackamole situation. We fixed the first
> reproducer with essentially patches 1-4 but we opened a new one which
> the second reproducer catches. One thing the current reproducer does
> is create a lot rtnl contention in the beggining by creating all those
> devices and then after it is just creating/deleting qdisc and doing
> update with flower where such contention is reduced. i.e it may just
> take longer for the mole to pop up.
>
> Why dont we push the V1 patch in and then worry about getting clever
> with EAGAIN after? Can you test the V1 version with the repro Pedro
> posted? It shouldnt have these issues. Also it would be interesting to
> see how performance of the parallel updates to flower is affected.

This or at least push first 4 patches of this series. They target other
older commits and fix straightforward issues with the API.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-29 11:50                   ` Vlad Buslov
@ 2023-05-29 12:58                     ` Vlad Buslov
  2023-05-30  1:03                       ` Jakub Kicinski
  2023-05-30  9:11                       ` Peilin Ye
  2023-05-29 13:55                     ` Jamal Hadi Salim
  1 sibling, 2 replies; 45+ messages in thread
From: Vlad Buslov @ 2023-05-29 12:58 UTC (permalink / raw)
  To: Peilin Ye, Jamal Hadi Salim
  Cc: Jakub Kicinski, Pedro Tammela, David S. Miller, Eric Dumazet,
	Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Hillf Danton, netdev, Cong Wang

On Mon 29 May 2023 at 14:50, Vlad Buslov <vladbu@nvidia.com> wrote:
> On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>>>
>>> Hi Jakub and all,
>>>
>>> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote:
>>> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
>>> > > Thanks a lot, I'll get right on it.
>>> >
>>> > Any insights? Is it just a live-lock inherent to the retry scheme
>>> > or we actually forget to release the lock/refcnt?
>>>
>>> I think it's just a thread holding the RTNL mutex for too long (replaying
>>> too many times).  We could replay for arbitrary times in
>>> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter
>>> requests for the old Qdisc.
>
> After looking very carefully at the code I think I know what the issue
> might be:
>
>    Task 1 graft Qdisc   Task 2 new filter
>            +                    +
>            |                    |
>            v                    v
>         rtnl_lock()       take  q->refcnt
>            +                    +
>            |                    |
>            v                    v
> Spin while q->refcnt!=1   Block on rtnl_lock() indefinitely due to -EAGAIN
>
> This will cause a real deadlock with the proposed patch. I'll try to
> come up with a better approach. Sorry for not seeing it earlier.
>

Followup: I considered two approaches for preventing the dealock:

- Refactor cls_api to always obtain the lock before taking a reference
  to Qdisc. I started implementing PoC moving the rtnl_lock() call in
  tc_new_tfilter() before __tcf_qdisc_find() and decided it is not
  feasible because cls_api will still try to obtain rtnl_lock when
  offloading a filter to a device with non-unlocked driver or after
  releasing the lock when loading a classifier module.

- Account for such cls_api behavior in sch_api by dropping and
  re-tacking the lock before replaying. This actually seems to be quite
  straightforward since 'replay' functionality that we are reusing for
  this is designed for similar behavior - it releases rtnl lock before
  loading a sch module, takes the lock again and safely replays the
  function by re-obtaining all the necessary data.

If livelock with concurrent filters insertion is an issue, then it can
be remedied by setting a new Qdisc->flags bit
"DELETED-REJECT-NEW-FILTERS" and checking for it together with
QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
insertion coming after the flag is set to synchronize on rtnl lock.

Thoughts?

>>>
>>> I tested the new reproducer Pedro posted, on:
>>>
>>> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported
>>>
>>> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger
>>>    any issues (in about 30 minutes).
>>>
>>> 3. All 6 v5 patches, plus this diff:
>>>
>>> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
>>> index 286b7c58f5b9..988718ba5abe 100644
>>> --- a/net/sched/sch_api.c
>>> +++ b/net/sched/sch_api.c
>>> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>>>                          * RTNL-unlocked filter request(s).  This is the counterpart of that
>>>                          * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
>>>                          */
>>> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
>>> +                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
>>> +                               rtnl_unlock();
>>> +                               rtnl_lock();
>>>                                 return -EAGAIN;
>>> +                       }
>>>                 }
>>>
>>>                 if (dev->flags & IFF_UP)
>>>
>>>    Did not trigger any issues (in about 30 mintues) either.
>>>
>>> What would you suggest?
>>
>>
>> I am more worried it is a wackamole situation. We fixed the first
>> reproducer with essentially patches 1-4 but we opened a new one which
>> the second reproducer catches. One thing the current reproducer does
>> is create a lot rtnl contention in the beggining by creating all those
>> devices and then after it is just creating/deleting qdisc and doing
>> update with flower where such contention is reduced. i.e it may just
>> take longer for the mole to pop up.
>>
>> Why dont we push the V1 patch in and then worry about getting clever
>> with EAGAIN after? Can you test the V1 version with the repro Pedro
>> posted? It shouldnt have these issues. Also it would be interesting to
>> see how performance of the parallel updates to flower is affected.
>
> This or at least push first 4 patches of this series. They target other
> older commits and fix straightforward issues with the API.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-29 11:50                   ` Vlad Buslov
  2023-05-29 12:58                     ` Vlad Buslov
@ 2023-05-29 13:55                     ` Jamal Hadi Salim
  2023-05-29 19:14                       ` Peilin Ye
  1 sibling, 1 reply; 45+ messages in thread
From: Jamal Hadi Salim @ 2023-05-29 13:55 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Peilin Ye, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Mon, May 29, 2023 at 8:06 AM Vlad Buslov <vladbu@nvidia.com> wrote:
>
> On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
> >>
> >> Hi Jakub and all,
> >>
> >> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote:
> >> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
> >> > > Thanks a lot, I'll get right on it.
> >> >
> >> > Any insights? Is it just a live-lock inherent to the retry scheme
> >> > or we actually forget to release the lock/refcnt?
> >>
> >> I think it's just a thread holding the RTNL mutex for too long (replaying
> >> too many times).  We could replay for arbitrary times in
> >> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter
> >> requests for the old Qdisc.
>
> After looking very carefully at the code I think I know what the issue
> might be:
>
>    Task 1 graft Qdisc   Task 2 new filter
>            +                    +
>            |                    |
>            v                    v
>         rtnl_lock()       take  q->refcnt
>            +                    +
>            |                    |
>            v                    v
> Spin while q->refcnt!=1   Block on rtnl_lock() indefinitely due to -EAGAIN
>
> This will cause a real deadlock with the proposed patch. I'll try to
> come up with a better approach. Sorry for not seeing it earlier.
>
> >>
> >> I tested the new reproducer Pedro posted, on:
> >>
> >> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported
> >>
> >> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger
> >>    any issues (in about 30 minutes).
> >>
> >> 3. All 6 v5 patches, plus this diff:
> >>
> >> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> >> index 286b7c58f5b9..988718ba5abe 100644
> >> --- a/net/sched/sch_api.c
> >> +++ b/net/sched/sch_api.c
> >> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
> >>                          * RTNL-unlocked filter request(s).  This is the counterpart of that
> >>                          * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> >>                          */
> >> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> >> +                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
> >> +                               rtnl_unlock();
> >> +                               rtnl_lock();
> >>                                 return -EAGAIN;
> >> +                       }
> >>                 }
> >>
> >>                 if (dev->flags & IFF_UP)
> >>
> >>    Did not trigger any issues (in about 30 mintues) either.
> >>
> >> What would you suggest?
> >
> >
> > I am more worried it is a wackamole situation. We fixed the first
> > reproducer with essentially patches 1-4 but we opened a new one which
> > the second reproducer catches. One thing the current reproducer does
> > is create a lot rtnl contention in the beggining by creating all those
> > devices and then after it is just creating/deleting qdisc and doing
> > update with flower where such contention is reduced. i.e it may just
> > take longer for the mole to pop up.
> >
> > Why dont we push the V1 patch in and then worry about getting clever
> > with EAGAIN after? Can you test the V1 version with the repro Pedro
> > posted? It shouldnt have these issues. Also it would be interesting to
> > see how performance of the parallel updates to flower is affected.
>
> This or at least push first 4 patches of this series. They target other
> older commits and fix straightforward issues with the API.


Yes, lets get patch 1-4 in first ...

cheers,
jamal

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-29 13:55                     ` Jamal Hadi Salim
@ 2023-05-29 19:14                       ` Peilin Ye
  0 siblings, 0 replies; 45+ messages in thread
From: Peilin Ye @ 2023-05-29 19:14 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Vlad Buslov, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Mon, May 29, 2023 at 09:55:51AM -0400, Jamal Hadi Salim wrote:
> On Mon, May 29, 2023 at 8:06 AM Vlad Buslov <vladbu@nvidia.com> wrote:
> > On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > Why dont we push the V1 patch in and then worry about getting clever
> > > with EAGAIN after? Can you test the V1 version with the repro Pedro
> > > posted? It shouldnt have these issues. Also it would be interesting to
> > > see how performance of the parallel updates to flower is affected.
> >
> > This or at least push first 4 patches of this series. They target other
> > older commits and fix straightforward issues with the API.
> 
> Yes, lets get patch 1-4 in first ...

Sure, I'll send 1-4 as v6 first.

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-29 12:58                     ` Vlad Buslov
@ 2023-05-30  1:03                       ` Jakub Kicinski
  2023-05-30  9:11                       ` Peilin Ye
  1 sibling, 0 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-05-30  1:03 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Peilin Ye, Jamal Hadi Salim, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Mon, 29 May 2023 15:58:50 +0300 Vlad Buslov wrote:
> If livelock with concurrent filters insertion is an issue, then it can
> be remedied by setting a new Qdisc->flags bit
> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
> insertion coming after the flag is set to synchronize on rtnl lock.

Sounds very nice, yes.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-29 12:58                     ` Vlad Buslov
  2023-05-30  1:03                       ` Jakub Kicinski
@ 2023-05-30  9:11                       ` Peilin Ye
  2023-05-30 12:18                         ` Vlad Buslov
  1 sibling, 1 reply; 45+ messages in thread
From: Peilin Ye @ 2023-05-30  9:11 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Jamal Hadi Salim, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Mon, May 29, 2023 at 02:50:26PM +0300, Vlad Buslov wrote:
> After looking very carefully at the code I think I know what the issue
> might be:
>
>    Task 1 graft Qdisc   Task 2 new filter
>            +                    +
>            |                    |
>            v                    v
>         rtnl_lock()       take  q->refcnt
>            +                    +
>            |                    |
>            v                    v
> Spin while q->refcnt!=1   Block on rtnl_lock() indefinitely due to -EAGAIN
>
> This will cause a real deadlock with the proposed patch. I'll try to
> come up with a better approach. Sorry for not seeing it earlier.

Thanks a lot for pointing this out!  The reproducers add flower filters to
ingress Qdiscs so I didn't think of rtnl_lock()'ed filter requests...

On Mon, May 29, 2023 at 03:58:50PM +0300, Vlad Buslov wrote:
> - Account for such cls_api behavior in sch_api by dropping and
>   re-tacking the lock before replaying. This actually seems to be quite
>   straightforward since 'replay' functionality that we are reusing for
>   this is designed for similar behavior - it releases rtnl lock before
>   loading a sch module, takes the lock again and safely replays the
>   function by re-obtaining all the necessary data.

Yes, I've tested this using that reproducer Pedro posted.

On Mon, May 29, 2023 at 03:58:50PM +0300, Vlad Buslov wrote:
> If livelock with concurrent filters insertion is an issue, then it can
> be remedied by setting a new Qdisc->flags bit
> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
> insertion coming after the flag is set to synchronize on rtnl lock.

Thanks for the suggestion!  I'll try this approach.

Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
later than Qdisc is flagged as being-deleted) sync on RTNL lock without
(before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
even longer?).

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-30  9:11                       ` Peilin Ye
@ 2023-05-30 12:18                         ` Vlad Buslov
  2023-05-31  0:29                           ` Peilin Ye
  2023-06-01  3:57                           ` Peilin Ye
  0 siblings, 2 replies; 45+ messages in thread
From: Vlad Buslov @ 2023-05-30 12:18 UTC (permalink / raw)
  To: Peilin Ye
  Cc: Jamal Hadi Salim, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Tue 30 May 2023 at 02:11, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> On Mon, May 29, 2023 at 02:50:26PM +0300, Vlad Buslov wrote:
>> After looking very carefully at the code I think I know what the issue
>> might be:
>>
>>    Task 1 graft Qdisc   Task 2 new filter
>>            +                    +
>>            |                    |
>>            v                    v
>>         rtnl_lock()       take  q->refcnt
>>            +                    +
>>            |                    |
>>            v                    v
>> Spin while q->refcnt!=1   Block on rtnl_lock() indefinitely due to -EAGAIN
>>
>> This will cause a real deadlock with the proposed patch. I'll try to
>> come up with a better approach. Sorry for not seeing it earlier.
>
> Thanks a lot for pointing this out!  The reproducers add flower filters to
> ingress Qdiscs so I didn't think of rtnl_lock()'ed filter requests...
>
> On Mon, May 29, 2023 at 03:58:50PM +0300, Vlad Buslov wrote:
>> - Account for such cls_api behavior in sch_api by dropping and
>>   re-tacking the lock before replaying. This actually seems to be quite
>>   straightforward since 'replay' functionality that we are reusing for
>>   this is designed for similar behavior - it releases rtnl lock before
>>   loading a sch module, takes the lock again and safely replays the
>>   function by re-obtaining all the necessary data.
>
> Yes, I've tested this using that reproducer Pedro posted.
>
> On Mon, May 29, 2023 at 03:58:50PM +0300, Vlad Buslov wrote:
>> If livelock with concurrent filters insertion is an issue, then it can
>> be remedied by setting a new Qdisc->flags bit
>> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
>> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
>> insertion coming after the flag is set to synchronize on rtnl lock.
>
> Thanks for the suggestion!  I'll try this approach.
>
> Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
> the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
> later than Qdisc is flagged as being-deleted) sync on RTNL lock without
> (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
> even longer?).

Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
already returns -EINVAL when q->refcnt is zero, so maybe returning
-EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
set is also fine? Would be much easier to implement as opposed to moving
rtnl_lock there.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-30 12:18                         ` Vlad Buslov
@ 2023-05-31  0:29                           ` Peilin Ye
  2023-06-01  3:57                           ` Peilin Ye
  1 sibling, 0 replies; 45+ messages in thread
From: Peilin Ye @ 2023-05-31  0:29 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Jamal Hadi Salim, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Tue, May 30, 2023 at 03:18:19PM +0300, Vlad Buslov wrote:
> >> If livelock with concurrent filters insertion is an issue, then it can
> >> be remedied by setting a new Qdisc->flags bit
> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
> >> insertion coming after the flag is set to synchronize on rtnl lock.
> >
> > Thanks for the suggestion!  I'll try this approach.
> >
> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
> > the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without
> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
> > even longer?).
> 
> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
> already returns -EINVAL when q->refcnt is zero, so maybe returning
> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
> set is also fine? Would be much easier to implement as opposed to moving
> rtnl_lock there.

Ah, I see, sure.

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-05-30 12:18                         ` Vlad Buslov
  2023-05-31  0:29                           ` Peilin Ye
@ 2023-06-01  3:57                           ` Peilin Ye
  2023-06-01  6:20                             ` Vlad Buslov
  2023-06-01 13:03                             ` Pedro Tammela
  1 sibling, 2 replies; 45+ messages in thread
From: Peilin Ye @ 2023-06-01  3:57 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Jamal Hadi Salim, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

Hi Vlad and all,

On Tue, May 30, 2023 at 03:18:19PM +0300, Vlad Buslov wrote:
> >> If livelock with concurrent filters insertion is an issue, then it can
> >> be remedied by setting a new Qdisc->flags bit
> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
> >> insertion coming after the flag is set to synchronize on rtnl lock.
> >
> > Thanks for the suggestion!  I'll try this approach.
> >
> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
> > the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without
> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
> > even longer?).
> 
> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
> already returns -EINVAL when q->refcnt is zero, so maybe returning
> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
> set is also fine? Would be much easier to implement as opposed to moving
> rtnl_lock there.

I implemented [1] this suggestion and tested the livelock issue in QEMU (-m
16G, CONFIG_NR_CPUS=8).  I tried deleting the ingress Qdisc (let's call it
"request A") while it has a lot of ongoing filter requests, and here's the
result:

                        #1         #2         #3         #4
  ----------------------------------------------------------
   a. refcnt            89         93        230        571
   b. replayed     167,568    196,450    336,291    878,027
   c. time real   0m2.478s   0m2.746s   0m3.693s   0m9.461s
           user   0m0.000s   0m0.000s   0m0.000s   0m0.000s
            sys   0m0.623s   0m0.681s   0m1.119s   0m2.770s

   a. is the Qdisc refcnt when A calls qdisc_graft() for the first time;
   b. is the number of times A has been replayed;
   c. is the time(1) output for A.

a. and b. are collected from printk() output.  This is better than before,
but A could still be replayed for hundreds of thousands of times and hang
for a few seconds.

Is this okay?  If not, is it possible (or should we) to make A really
_wait_ on Qdisc refcnt, instead of "busy-replaying"?

Thanks,
Peilin Ye

[1] Diff against v5 patch 6 (printk() calls not included):

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 3e9cc43cbc90..de7b0538b309 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -94,6 +94,7 @@ struct Qdisc {
 #define TCQ_F_INVISIBLE                0x80 /* invisible by default in dump */
 #define TCQ_F_NOLOCK           0x100 /* qdisc does not require locking */
 #define TCQ_F_OFFLOADED                0x200 /* qdisc is offloaded to HW */
+#define TCQ_F_DESTROYING       0x400 /* destroying, reject filter requests */
        u32                     limit;
        const struct Qdisc_ops  *ops;
        struct qdisc_size_table __rcu *stab;
@@ -185,6 +186,11 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
        return !READ_ONCE(qdisc->q.qlen);
 }

+static inline bool qdisc_is_destroying(const struct Qdisc *qdisc)
+{
+       return qdisc->flags & TCQ_F_DESTROYING;
+}
+
 /* For !TCQ_F_NOLOCK qdisc, qdisc_run_begin/end() must be invoked with
  * the qdisc root lock acquired.
  */
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 2621550bfddc..3e7f6f286ac0 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1172,7 +1172,7 @@ static int __tcf_qdisc_find(struct net *net, struct Qdisc **q,
                *parent = (*q)->handle;
        } else {
                *q = qdisc_lookup_rcu(dev, TC_H_MAJ(*parent));
-               if (!*q) {
+               if (!*q || qdisc_is_destroying(*q)) {
                        NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists");
                        err = -EINVAL;
                        goto errout_rcu;
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 286b7c58f5b9..d6e47546c7fe 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1086,12 +1086,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
                                return -ENOENT;
                        }

-                       /* Replay if the current ingress (or clsact) Qdisc has ongoing
-                        * RTNL-unlocked filter request(s).  This is the counterpart of that
-                        * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
+                       /* If current ingress (clsact) Qdisc has ongoing filter requests, stop
+                        * accepting any more by marking it as "being destroyed", then tell the
+                        * caller to replay by returning -EAGAIN.
                         */
-                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
+                       q = dev_queue->qdisc_sleeping;
+                       if (!qdisc_refcount_dec_if_one(q)) {
+                               q->flags |= TCQ_F_DESTROYING;
+                               rtnl_unlock();
+                               schedule();
+                               rtnl_lock();
                                return -EAGAIN;
+                       }
                }

                if (dev->flags & IFF_UP)


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-06-01  3:57                           ` Peilin Ye
@ 2023-06-01  6:20                             ` Vlad Buslov
  2023-06-07  0:57                               ` Peilin Ye
  2023-06-08  0:39                               ` Peilin Ye
  2023-06-01 13:03                             ` Pedro Tammela
  1 sibling, 2 replies; 45+ messages in thread
From: Vlad Buslov @ 2023-06-01  6:20 UTC (permalink / raw)
  To: Peilin Ye
  Cc: Jamal Hadi Salim, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Wed 31 May 2023 at 20:57, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> Hi Vlad and all,
>
> On Tue, May 30, 2023 at 03:18:19PM +0300, Vlad Buslov wrote:
>> >> If livelock with concurrent filters insertion is an issue, then it can
>> >> be remedied by setting a new Qdisc->flags bit
>> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
>> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
>> >> insertion coming after the flag is set to synchronize on rtnl lock.
>> >
>> > Thanks for the suggestion!  I'll try this approach.
>> >
>> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
>> > the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
>> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without
>> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
>> > even longer?).
>> 
>> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
>> already returns -EINVAL when q->refcnt is zero, so maybe returning
>> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
>> set is also fine? Would be much easier to implement as opposed to moving
>> rtnl_lock there.
>
> I implemented [1] this suggestion and tested the livelock issue in QEMU (-m
> 16G, CONFIG_NR_CPUS=8).  I tried deleting the ingress Qdisc (let's call it
> "request A") while it has a lot of ongoing filter requests, and here's the
> result:
>
>                         #1         #2         #3         #4
>   ----------------------------------------------------------
>    a. refcnt            89         93        230        571
>    b. replayed     167,568    196,450    336,291    878,027
>    c. time real   0m2.478s   0m2.746s   0m3.693s   0m9.461s
>            user   0m0.000s   0m0.000s   0m0.000s   0m0.000s
>             sys   0m0.623s   0m0.681s   0m1.119s   0m2.770s
>
>    a. is the Qdisc refcnt when A calls qdisc_graft() for the first time;
>    b. is the number of times A has been replayed;
>    c. is the time(1) output for A.
>
> a. and b. are collected from printk() output.  This is better than before,
> but A could still be replayed for hundreds of thousands of times and hang
> for a few seconds.

I don't get where does few seconds waiting time come from. I'm probably
missing something obvious here, but the waiting time should be the
maximum filter op latency of new/get/del filter request that is already
in-flight (i.e. already passed qdisc_is_destroying() check) and it
should take several orders of magnitude less time.

>
> Is this okay?  If not, is it possible (or should we) to make A really
> _wait_ on Qdisc refcnt, instead of "busy-replaying"?
>
> Thanks,
> Peilin Ye
>
> [1] Diff against v5 patch 6 (printk() calls not included):
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 3e9cc43cbc90..de7b0538b309 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -94,6 +94,7 @@ struct Qdisc {
>  #define TCQ_F_INVISIBLE                0x80 /* invisible by default in dump */
>  #define TCQ_F_NOLOCK           0x100 /* qdisc does not require locking */
>  #define TCQ_F_OFFLOADED                0x200 /* qdisc is offloaded to HW */
> +#define TCQ_F_DESTROYING       0x400 /* destroying, reject filter requests */
>         u32                     limit;
>         const struct Qdisc_ops  *ops;
>         struct qdisc_size_table __rcu *stab;
> @@ -185,6 +186,11 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
>         return !READ_ONCE(qdisc->q.qlen);
>  }
>
> +static inline bool qdisc_is_destroying(const struct Qdisc *qdisc)
> +{
> +       return qdisc->flags & TCQ_F_DESTROYING;

Hmm, do we need at least some kind of {READ|WRITE}_ONCE() for accessing
flags since they are now used in unlocked filter code path?

> +}
> +
>  /* For !TCQ_F_NOLOCK qdisc, qdisc_run_begin/end() must be invoked with
>   * the qdisc root lock acquired.
>   */
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 2621550bfddc..3e7f6f286ac0 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -1172,7 +1172,7 @@ static int __tcf_qdisc_find(struct net *net, struct Qdisc **q,
>                 *parent = (*q)->handle;
>         } else {
>                 *q = qdisc_lookup_rcu(dev, TC_H_MAJ(*parent));
> -               if (!*q) {
> +               if (!*q || qdisc_is_destroying(*q)) {
>                         NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists");
>                         err = -EINVAL;
>                         goto errout_rcu;
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 286b7c58f5b9..d6e47546c7fe 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1086,12 +1086,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>                                 return -ENOENT;
>                         }
>
> -                       /* Replay if the current ingress (or clsact) Qdisc has ongoing
> -                        * RTNL-unlocked filter request(s).  This is the counterpart of that
> -                        * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> +                       /* If current ingress (clsact) Qdisc has ongoing filter requests, stop
> +                        * accepting any more by marking it as "being destroyed", then tell the
> +                        * caller to replay by returning -EAGAIN.
>                          */
> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> +                       q = dev_queue->qdisc_sleeping;
> +                       if (!qdisc_refcount_dec_if_one(q)) {
> +                               q->flags |= TCQ_F_DESTROYING;
> +                               rtnl_unlock();
> +                               schedule();
> +                               rtnl_lock();
>                                 return -EAGAIN;
> +                       }
>                 }
>
>                 if (dev->flags & IFF_UP)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-06-01  3:57                           ` Peilin Ye
  2023-06-01  6:20                             ` Vlad Buslov
@ 2023-06-01 13:03                             ` Pedro Tammela
  2023-06-07  4:25                               ` Peilin Ye
  1 sibling, 1 reply; 45+ messages in thread
From: Pedro Tammela @ 2023-06-01 13:03 UTC (permalink / raw)
  To: Peilin Ye, Vlad Buslov
  Cc: Jamal Hadi Salim, Jakub Kicinski, David S. Miller, Eric Dumazet,
	Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye, Daniel Borkmann,
	John Fastabend, Hillf Danton, netdev, Cong Wang

On 01/06/2023 00:57, Peilin Ye wrote:
> Hi Vlad and all,
> 
> On Tue, May 30, 2023 at 03:18:19PM +0300, Vlad Buslov wrote:
>>>> If livelock with concurrent filters insertion is an issue, then it can
>>>> be remedied by setting a new Qdisc->flags bit
>>>> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
>>>> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
>>>> insertion coming after the flag is set to synchronize on rtnl lock.
>>>
>>> Thanks for the suggestion!  I'll try this approach.
>>>
>>> Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
>>> the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
>>> later than Qdisc is flagged as being-deleted) sync on RTNL lock without
>>> (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
>>> even longer?).
>>
>> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
>> already returns -EINVAL when q->refcnt is zero, so maybe returning
>> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
>> set is also fine? Would be much easier to implement as opposed to moving
>> rtnl_lock there.
> 
> I implemented [1] this suggestion and tested the livelock issue in QEMU (-m
> 16G, CONFIG_NR_CPUS=8).  I tried deleting the ingress Qdisc (let's call it
> "request A") while it has a lot of ongoing filter requests, and here's the
> result:
> 
>                          #1         #2         #3         #4
>    ----------------------------------------------------------
>     a. refcnt            89         93        230        571
>     b. replayed     167,568    196,450    336,291    878,027
>     c. time real   0m2.478s   0m2.746s   0m3.693s   0m9.461s
>             user   0m0.000s   0m0.000s   0m0.000s   0m0.000s
>              sys   0m0.623s   0m0.681s   0m1.119s   0m2.770s
> 
>     a. is the Qdisc refcnt when A calls qdisc_graft() for the first time;
>     b. is the number of times A has been replayed;
>     c. is the time(1) output for A.
> 
> a. and b. are collected from printk() output.  This is better than before,
> but A could still be replayed for hundreds of thousands of times and hang
> for a few seconds.
> 
> Is this okay?  If not, is it possible (or should we) to make A really
> _wait_ on Qdisc refcnt, instead of "busy-replaying"?
> 
> Thanks,
> Peilin Ye
> 
> [1] Diff against v5 patch 6 (printk() calls not included):
> 
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 3e9cc43cbc90..de7b0538b309 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -94,6 +94,7 @@ struct Qdisc {
>   #define TCQ_F_INVISIBLE                0x80 /* invisible by default in dump */
>   #define TCQ_F_NOLOCK           0x100 /* qdisc does not require locking */
>   #define TCQ_F_OFFLOADED                0x200 /* qdisc is offloaded to HW */
> +#define TCQ_F_DESTROYING       0x400 /* destroying, reject filter requests */
>          u32                     limit;
>          const struct Qdisc_ops  *ops;
>          struct qdisc_size_table __rcu *stab;
> @@ -185,6 +186,11 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
>          return !READ_ONCE(qdisc->q.qlen);
>   }
> 
> +static inline bool qdisc_is_destroying(const struct Qdisc *qdisc)
> +{
> +       return qdisc->flags & TCQ_F_DESTROYING;
> +}
> +
>   /* For !TCQ_F_NOLOCK qdisc, qdisc_run_begin/end() must be invoked with
>    * the qdisc root lock acquired.
>    */
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 2621550bfddc..3e7f6f286ac0 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -1172,7 +1172,7 @@ static int __tcf_qdisc_find(struct net *net, struct Qdisc **q,
>                  *parent = (*q)->handle;
>          } else {
>                  *q = qdisc_lookup_rcu(dev, TC_H_MAJ(*parent));
> -               if (!*q) {
> +               if (!*q || qdisc_is_destroying(*q)) {
>                          NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists");
>                          err = -EINVAL;
>                          goto errout_rcu;
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 286b7c58f5b9..d6e47546c7fe 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1086,12 +1086,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>                                  return -ENOENT;
>                          }
> 
> -                       /* Replay if the current ingress (or clsact) Qdisc has ongoing
> -                        * RTNL-unlocked filter request(s).  This is the counterpart of that
> -                        * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> +                       /* If current ingress (clsact) Qdisc has ongoing filter requests, stop
> +                        * accepting any more by marking it as "being destroyed", then tell the
> +                        * caller to replay by returning -EAGAIN.
>                           */
> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> +                       q = dev_queue->qdisc_sleeping;
> +                       if (!qdisc_refcount_dec_if_one(q)) {
> +                               q->flags |= TCQ_F_DESTROYING;
> +                               rtnl_unlock();
> +                               schedule();
Was this intended or just a leftover?
rtnl_lock() would reschedule if needed as it's a mutex_lock
> +                               rtnl_lock();
>                                  return -EAGAIN;
> +                       }
>                  }
> 
>                  if (dev->flags & IFF_UP)
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-06-01  6:20                             ` Vlad Buslov
@ 2023-06-07  0:57                               ` Peilin Ye
  2023-06-07  8:18                                 ` Vlad Buslov
  2023-06-08  0:39                               ` Peilin Ye
  1 sibling, 1 reply; 45+ messages in thread
From: Peilin Ye @ 2023-06-07  0:57 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Jamal Hadi Salim, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote:
> On Wed 31 May 2023 at 20:57, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> > +static inline bool qdisc_is_destroying(const struct Qdisc *qdisc)
> > +{
> > +       return qdisc->flags & TCQ_F_DESTROYING;
> 
> Hmm, do we need at least some kind of {READ|WRITE}_ONCE() for accessing
> flags since they are now used in unlocked filter code path?

Thanks, after taking another look at cls_api.c, I noticed this code in
tc_new_tfilter():

	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
			      flags, extack);
	if (err == 0) {
		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
			       RTM_NEWTFILTER, false, rtnl_held, extack);
		tfilter_put(tp, fh);
		/* q pointer is NULL for shared blocks */
		if (q)
			q->flags &= ~TCQ_F_CAN_BYPASS;
	}               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

TCQ_F_CAN_BYPASS is cleared after e.g. adding a filter to the Qdisc, and it
isn't atomic [1].

We also have this:

  ->dequeue()
    htb_dequeue()
      htb_dequeue_tree()
        qdisc_warn_nonwc():

  void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc)
  {
          if (!(qdisc->flags & TCQ_F_WARN_NONWC)) {
                  pr_warn("%s: %s qdisc %X: is non-work-conserving?\n",
                          txt, qdisc->ops->id, qdisc->handle >> 16);
                  qdisc->flags |= TCQ_F_WARN_NONWC;
          }       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  }
  EXPORT_SYMBOL(qdisc_warn_nonwc);

Also non-atomic; isn't it possible for the above 2 underlined statements to
race with each other?  If true, I think we need to change Qdisc::flags to
use atomic bitops, just like what we're doing for Qdisc::state and
::state2.  It feels like a separate TODO, however.

I also thought about adding the new DELETED-REJECT-NEW-FILTERS flag to
::state2, but not sure if it's okay to extend it for our purpose.

Thanks,
Peilin Ye

[1] Compiled to this on my Intel 64:

   0x0000000000017788 <+6472>:	83 62 10 fb        	andl   $0xfffffffb,0x10(%rdx)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-06-01 13:03                             ` Pedro Tammela
@ 2023-06-07  4:25                               ` Peilin Ye
  0 siblings, 0 replies; 45+ messages in thread
From: Peilin Ye @ 2023-06-07  4:25 UTC (permalink / raw)
  To: Pedro Tammela
  Cc: Vlad Buslov, Jamal Hadi Salim, Jakub Kicinski, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

Hi Pedro,

On Thu, Jun 01, 2023 at 10:03:22AM -0300, Pedro Tammela wrote:
> On 01/06/2023 00:57, Peilin Ye wrote:
> > -                       /* Replay if the current ingress (or clsact) Qdisc has ongoing
> > -                        * RTNL-unlocked filter request(s).  This is the counterpart of that
> > -                        * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
> > +                       /* If current ingress (clsact) Qdisc has ongoing filter requests, stop
> > +                        * accepting any more by marking it as "being destroyed", then tell the
> > +                        * caller to replay by returning -EAGAIN.
> >                           */
> > -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
> > +                       q = dev_queue->qdisc_sleeping;
> > +                       if (!qdisc_refcount_dec_if_one(q)) {
> > +                               q->flags |= TCQ_F_DESTROYING;
> > +                               rtnl_unlock();
> > +                               schedule();
>
> Was this intended or just a leftover?
> rtnl_lock() would reschedule if needed as it's a mutex_lock

In qdisc_create():

			rtnl_unlock();
			request_module("sch_%s", name);
			rtnl_lock();
			ops = qdisc_lookup_ops(kind);
			if (ops != NULL) {
				/* We will try again qdisc_lookup_ops,
				 * so don't keep a reference.
				 */
				module_put(ops->owner);
				err = -EAGAIN;
				goto err_out;

If qdisc_lookup_ops() returns !NULL, it means we've successfully loaded the
module, so we know that the next replay should (normally) succeed.

However in the diff I posted, if qdisc_refcount_dec_if_one() returned
false, we know that we should step back and wait a bit - especially when
there're ongoing RTNL-locked filter requests: we want them to grab the RTNL
mutex before us, which was my intention here, to unconditionally give up
the CPU.  I haven't tested if it really makes a difference though.

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-06-07  0:57                               ` Peilin Ye
@ 2023-06-07  8:18                                 ` Vlad Buslov
  2023-06-08  1:08                                   ` Peilin Ye
  0 siblings, 1 reply; 45+ messages in thread
From: Vlad Buslov @ 2023-06-07  8:18 UTC (permalink / raw)
  To: Peilin Ye
  Cc: Jamal Hadi Salim, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Tue 06 Jun 2023 at 17:57, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote:
>> On Wed 31 May 2023 at 20:57, Peilin Ye <yepeilin.cs@gmail.com> wrote:
>> > +static inline bool qdisc_is_destroying(const struct Qdisc *qdisc)
>> > +{
>> > +       return qdisc->flags & TCQ_F_DESTROYING;
>> 
>> Hmm, do we need at least some kind of {READ|WRITE}_ONCE() for accessing
>> flags since they are now used in unlocked filter code path?
>
> Thanks, after taking another look at cls_api.c, I noticed this code in
> tc_new_tfilter():
>
> 	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
> 			      flags, extack);
> 	if (err == 0) {
> 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
> 			       RTM_NEWTFILTER, false, rtnl_held, extack);
> 		tfilter_put(tp, fh);
> 		/* q pointer is NULL for shared blocks */
> 		if (q)
> 			q->flags &= ~TCQ_F_CAN_BYPASS;
> 	}               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> TCQ_F_CAN_BYPASS is cleared after e.g. adding a filter to the Qdisc, and it
> isn't atomic [1].

Yeah, I see we have already got such behavior in 3f05e6886a59
("net_sched: unset TCQ_F_CAN_BYPASS when adding filters").

>
> We also have this:
>
>   ->dequeue()
>     htb_dequeue()
>       htb_dequeue_tree()
>         qdisc_warn_nonwc():
>
>   void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc)
>   {
>           if (!(qdisc->flags & TCQ_F_WARN_NONWC)) {
>                   pr_warn("%s: %s qdisc %X: is non-work-conserving?\n",
>                           txt, qdisc->ops->id, qdisc->handle >> 16);
>                   qdisc->flags |= TCQ_F_WARN_NONWC;
>           }       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   }
>   EXPORT_SYMBOL(qdisc_warn_nonwc);
>
> Also non-atomic; isn't it possible for the above 2 underlined statements to
> race with each other?  If true, I think we need to change Qdisc::flags to
> use atomic bitops, just like what we're doing for Qdisc::state and
> ::state2.  It feels like a separate TODO, however.

It looks like even though 3f05e6886a59 ("net_sched: unset
TCQ_F_CAN_BYPASS when adding filters") was introduced after cls api
unlock by now we have these in exactly the same list of supported
kernels (5.4 LTS and newer). Considering this, the conversion to the
atomic bitops can be done as a standalone fix for cited commit and after
it will have been accepted and backported the qdisc fix can just assume
that qdisc->flags is an atomic bitops field in all target kernels and
use it as-is. WDYT?

>
> I also thought about adding the new DELETED-REJECT-NEW-FILTERS flag to
> ::state2, but not sure if it's okay to extend it for our purpose.

As you described above qdisc->flags is already used to interact with cls
api (including changing it dynamically), so I don't see why not.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-06-01  6:20                             ` Vlad Buslov
  2023-06-07  0:57                               ` Peilin Ye
@ 2023-06-08  0:39                               ` Peilin Ye
  2023-06-08  9:17                                 ` Vlad Buslov
  1 sibling, 1 reply; 45+ messages in thread
From: Peilin Ye @ 2023-06-08  0:39 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Jamal Hadi Salim, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote:
> >> >> If livelock with concurrent filters insertion is an issue, then it can
> >> >> be remedied by setting a new Qdisc->flags bit
> >> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
> >> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
> >> >> insertion coming after the flag is set to synchronize on rtnl lock.
> >> >
> >> > Thanks for the suggestion!  I'll try this approach.
> >> >
> >> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
> >> > the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
> >> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without
> >> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
> >> > even longer?).
> >> 
> >> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
> >> already returns -EINVAL when q->refcnt is zero, so maybe returning
> >> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
> >> set is also fine? Would be much easier to implement as opposed to moving
> >> rtnl_lock there.
> >
> > I implemented [1] this suggestion and tested the livelock issue in QEMU (-m
> > 16G, CONFIG_NR_CPUS=8).  I tried deleting the ingress Qdisc (let's call it
> > "request A") while it has a lot of ongoing filter requests, and here's the
> > result:
> >
> >                         #1         #2         #3         #4
> >   ----------------------------------------------------------
> >    a. refcnt            89         93        230        571
> >    b. replayed     167,568    196,450    336,291    878,027
> >    c. time real   0m2.478s   0m2.746s   0m3.693s   0m9.461s
> >            user   0m0.000s   0m0.000s   0m0.000s   0m0.000s
> >             sys   0m0.623s   0m0.681s   0m1.119s   0m2.770s
> >
> >    a. is the Qdisc refcnt when A calls qdisc_graft() for the first time;
> >    b. is the number of times A has been replayed;
> >    c. is the time(1) output for A.
> >
> > a. and b. are collected from printk() output.  This is better than before,
> > but A could still be replayed for hundreds of thousands of times and hang
> > for a few seconds.
> 
> I don't get where does few seconds waiting time come from. I'm probably
> missing something obvious here, but the waiting time should be the
> maximum filter op latency of new/get/del filter request that is already
> in-flight (i.e. already passed qdisc_is_destroying() check) and it
> should take several orders of magnitude less time.

Yeah I agree, here's what I did:

In Terminal 1 I keep adding filters to eth1 in a naive and unrealistic
loop:

  $ echo "1 1 32" > /sys/bus/netdevsim/new_device
  $ tc qdisc add dev eth1 ingress
  $ for (( i=1; i<=3000; i++ ))
  > do
  > tc filter add dev eth1 ingress proto all flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 &
  > done

When the loop is running, I delete the Qdisc in Terminal 2:

  $ time tc qdisc delete dev eth1 ingress

Which took seconds on average.  However, if I specify a unique "prio" when
adding filters in that loop, e.g.:

  $ for (( i=1; i<=3000; i++ ))
  > do
  > tc filter add dev eth1 ingress proto all prio $i flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 &
  > done				     ^^^^^^^

Then deleting the Qdisc in Terminal 2 becomes a lot faster:

  real  0m0.712s
  user  0m0.000s
  sys   0m0.152s 

In fact it's so fast that I couldn't even make qdisc->refcnt > 1, so I did
yet another test [1], which looks a lot better.

When I didn't specify "prio", sometimes that
rhashtable_lookup_insert_fast() call in fl_ht_insert_unique() returns
-EEXIST.  Is it because that concurrent add-filter requests auto-allocated
the same "prio" number, so they collided with each other?  Do you think
this is related to why it's slow?

Thanks,
Peilin Ye

[1] In a beefier QEMU setup (64 cores, -m 128G), I started 64 tc instances
in -batch mode that keeps adding a unique filter (with "prio" and "handle"
specified) then deletes it.  Again, when they are running I delete the
ingress Qdisc, and here's the result:

                         #1         #2         #3         #4
   ----------------------------------------------------------
    a. refcnt            64         63         64         64
    b. replayed         169      5,630        887      3,442
    c. time real   0m0.171s   0m0.147s   0m0.186s   0m0.111s
            user   0m0.000s   0m0.009s   0m0.001s   0m0.000s
             sys   0m0.112s   0m0.108s   0m0.115s   0m0.104s


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-06-07  8:18                                 ` Vlad Buslov
@ 2023-06-08  1:08                                   ` Peilin Ye
  2023-06-08  7:48                                     ` Vlad Buslov
  0 siblings, 1 reply; 45+ messages in thread
From: Peilin Ye @ 2023-06-08  1:08 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Jamal Hadi Salim, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Wed, Jun 07, 2023 at 11:18:32AM +0300, Vlad Buslov wrote:
> > I also thought about adding the new DELETED-REJECT-NEW-FILTERS flag to
> > ::state2, but not sure if it's okay to extend it for our purpose.
>
> As you described above qdisc->flags is already used to interact with cls
> api (including changing it dynamically), so I don't see why not.

Sorry, I don't follow, I meant qdisc->state2:

  enum qdisc_state2_t {
          /* Only for !TCQ_F_NOLOCK qdisc. Never access it directly.
           * Use qdisc_run_begin/end() or qdisc_is_running() instead.
           */
          __QDISC_STATE2_RUNNING,
  };

NVM, I think using qdisc->flags after making it atomic sounds better.

On Wed, Jun 07, 2023 at 11:18:32AM +0300, Vlad Buslov wrote:
> > 	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
> > 			      flags, extack);
> > 	if (err == 0) {
> > 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
> > 			       RTM_NEWTFILTER, false, rtnl_held, extack);
> > 		tfilter_put(tp, fh);
> > 		/* q pointer is NULL for shared blocks */
> > 		if (q)
> > 			q->flags &= ~TCQ_F_CAN_BYPASS;
> > 	}               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > TCQ_F_CAN_BYPASS is cleared after e.g. adding a filter to the Qdisc, and it
> > isn't atomic [1].
> 
> Yeah, I see we have already got such behavior in 3f05e6886a59
> ("net_sched: unset TCQ_F_CAN_BYPASS when adding filters").
> 
> > We also have this:
> >
> >   ->dequeue()
> >     htb_dequeue()
> >       htb_dequeue_tree()
> >         qdisc_warn_nonwc():
> >
> >   void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc)
> >   {
> >           if (!(qdisc->flags & TCQ_F_WARN_NONWC)) {
> >                   pr_warn("%s: %s qdisc %X: is non-work-conserving?\n",
> >                           txt, qdisc->ops->id, qdisc->handle >> 16);
> >                   qdisc->flags |= TCQ_F_WARN_NONWC;
> >           }       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >   }
> >   EXPORT_SYMBOL(qdisc_warn_nonwc);
> >
> > Also non-atomic; isn't it possible for the above 2 underlined statements to
> > race with each other?  If true, I think we need to change Qdisc::flags to
> > use atomic bitops, just like what we're doing for Qdisc::state and
> > ::state2.  It feels like a separate TODO, however.
> 
> It looks like even though 3f05e6886a59 ("net_sched: unset
> TCQ_F_CAN_BYPASS when adding filters") was introduced after cls api
> unlock by now we have these in exactly the same list of supported
> kernels (5.4 LTS and newer). Considering this, the conversion to the
> atomic bitops can be done as a standalone fix for cited commit and after
> it will have been accepted and backported the qdisc fix can just assume
> that qdisc->flags is an atomic bitops field in all target kernels and
> use it as-is. WDYT?

Sounds great, how about:

  1. I'll post the non-replay version of the fix (after updating the commit
     message), and we apply that first, as suggested by Jamal

  2. Make qdisc->flags atomic

  3. Make the fix better by replaying and using the (now atomic)
     IS-DESTROYING flag with test_and_set_bit() and friends

?

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-06-08  1:08                                   ` Peilin Ye
@ 2023-06-08  7:48                                     ` Vlad Buslov
  2023-06-11  3:25                                       ` Peilin Ye
  0 siblings, 1 reply; 45+ messages in thread
From: Vlad Buslov @ 2023-06-08  7:48 UTC (permalink / raw)
  To: Peilin Ye
  Cc: Jamal Hadi Salim, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Wed 07 Jun 2023 at 18:08, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> On Wed, Jun 07, 2023 at 11:18:32AM +0300, Vlad Buslov wrote:
>> > I also thought about adding the new DELETED-REJECT-NEW-FILTERS flag to
>> > ::state2, but not sure if it's okay to extend it for our purpose.
>>
>> As you described above qdisc->flags is already used to interact with cls
>> api (including changing it dynamically), so I don't see why not.
>
> Sorry, I don't follow, I meant qdisc->state2:
>
>   enum qdisc_state2_t {
>           /* Only for !TCQ_F_NOLOCK qdisc. Never access it directly.
>            * Use qdisc_run_begin/end() or qdisc_is_running() instead.
>            */
>           __QDISC_STATE2_RUNNING,
>   };

Sorry, I misunderstood what you were suggesting. Got it now.

>
> NVM, I think using qdisc->flags after making it atomic sounds better.

Agree.

>
> On Wed, Jun 07, 2023 at 11:18:32AM +0300, Vlad Buslov wrote:
>> > 	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
>> > 			      flags, extack);
>> > 	if (err == 0) {
>> > 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
>> > 			       RTM_NEWTFILTER, false, rtnl_held, extack);
>> > 		tfilter_put(tp, fh);
>> > 		/* q pointer is NULL for shared blocks */
>> > 		if (q)
>> > 			q->flags &= ~TCQ_F_CAN_BYPASS;
>> > 	}               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> >
>> > TCQ_F_CAN_BYPASS is cleared after e.g. adding a filter to the Qdisc, and it
>> > isn't atomic [1].
>> 
>> Yeah, I see we have already got such behavior in 3f05e6886a59
>> ("net_sched: unset TCQ_F_CAN_BYPASS when adding filters").
>> 
>> > We also have this:
>> >
>> >   ->dequeue()
>> >     htb_dequeue()
>> >       htb_dequeue_tree()
>> >         qdisc_warn_nonwc():
>> >
>> >   void qdisc_warn_nonwc(const char *txt, struct Qdisc *qdisc)
>> >   {
>> >           if (!(qdisc->flags & TCQ_F_WARN_NONWC)) {
>> >                   pr_warn("%s: %s qdisc %X: is non-work-conserving?\n",
>> >                           txt, qdisc->ops->id, qdisc->handle >> 16);
>> >                   qdisc->flags |= TCQ_F_WARN_NONWC;
>> >           }       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> >   }
>> >   EXPORT_SYMBOL(qdisc_warn_nonwc);
>> >
>> > Also non-atomic; isn't it possible for the above 2 underlined statements to
>> > race with each other?  If true, I think we need to change Qdisc::flags to
>> > use atomic bitops, just like what we're doing for Qdisc::state and
>> > ::state2.  It feels like a separate TODO, however.
>> 
>> It looks like even though 3f05e6886a59 ("net_sched: unset
>> TCQ_F_CAN_BYPASS when adding filters") was introduced after cls api
>> unlock by now we have these in exactly the same list of supported
>> kernels (5.4 LTS and newer). Considering this, the conversion to the
>> atomic bitops can be done as a standalone fix for cited commit and after
>> it will have been accepted and backported the qdisc fix can just assume
>> that qdisc->flags is an atomic bitops field in all target kernels and
>> use it as-is. WDYT?
>
> Sounds great, how about:
>
>   1. I'll post the non-replay version of the fix (after updating the commit
>      message), and we apply that first, as suggested by Jamal

From my side there are no objections to any of the proposed approaches
since we have never had any users with legitimate use-case where they
need to replace/delete a qdisc concurrently with a filter update, so
returning -EBUSY (or -EAGAIN) to the user in such case would work as
either temporary or the final fix. However, Jakub had reservations with
such approach so don't know where we stand now regarding this.

>
>   2. Make qdisc->flags atomic
>
>   3. Make the fix better by replaying and using the (now atomic)
>      IS-DESTROYING flag with test_and_set_bit() and friends
>
> ?

Again, no objections from my side. Ping me if you need help with any of
these.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-06-08  0:39                               ` Peilin Ye
@ 2023-06-08  9:17                                 ` Vlad Buslov
  2023-06-10  0:20                                   ` Peilin Ye
  0 siblings, 1 reply; 45+ messages in thread
From: Vlad Buslov @ 2023-06-08  9:17 UTC (permalink / raw)
  To: Peilin Ye
  Cc: Jamal Hadi Salim, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang


On Wed 07 Jun 2023 at 17:39, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> On Thu, Jun 01, 2023 at 09:20:39AM +0300, Vlad Buslov wrote:
>> >> >> If livelock with concurrent filters insertion is an issue, then it can
>> >> >> be remedied by setting a new Qdisc->flags bit
>> >> >> "DELETED-REJECT-NEW-FILTERS" and checking for it together with
>> >> >> QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
>> >> >> insertion coming after the flag is set to synchronize on rtnl lock.
>> >> >
>> >> > Thanks for the suggestion!  I'll try this approach.
>> >> >
>> >> > Currently QDISC_CLASS_OPS_DOIT_UNLOCKED is checked after taking a refcnt of
>> >> > the "being-deleted" Qdisc.  I'll try forcing "late" requests (that arrive
>> >> > later than Qdisc is flagged as being-deleted) sync on RTNL lock without
>> >> > (before) taking the Qdisc refcnt (otherwise I think Task 1 will replay for
>> >> > even longer?).
>> >> 
>> >> Yeah, I see what you mean. Looking at the code __tcf_qdisc_find()
>> >> already returns -EINVAL when q->refcnt is zero, so maybe returning
>> >> -EINVAL from that function when "DELETED-REJECT-NEW-FILTERS" flags is
>> >> set is also fine? Would be much easier to implement as opposed to moving
>> >> rtnl_lock there.
>> >
>> > I implemented [1] this suggestion and tested the livelock issue in QEMU (-m
>> > 16G, CONFIG_NR_CPUS=8).  I tried deleting the ingress Qdisc (let's call it
>> > "request A") while it has a lot of ongoing filter requests, and here's the
>> > result:
>> >
>> >                         #1         #2         #3         #4
>> >   ----------------------------------------------------------
>> >    a. refcnt            89         93        230        571
>> >    b. replayed     167,568    196,450    336,291    878,027
>> >    c. time real   0m2.478s   0m2.746s   0m3.693s   0m9.461s
>> >            user   0m0.000s   0m0.000s   0m0.000s   0m0.000s
>> >             sys   0m0.623s   0m0.681s   0m1.119s   0m2.770s
>> >
>> >    a. is the Qdisc refcnt when A calls qdisc_graft() for the first time;
>> >    b. is the number of times A has been replayed;
>> >    c. is the time(1) output for A.
>> >
>> > a. and b. are collected from printk() output.  This is better than before,
>> > but A could still be replayed for hundreds of thousands of times and hang
>> > for a few seconds.
>> 
>> I don't get where does few seconds waiting time come from. I'm probably
>> missing something obvious here, but the waiting time should be the
>> maximum filter op latency of new/get/del filter request that is already
>> in-flight (i.e. already passed qdisc_is_destroying() check) and it
>> should take several orders of magnitude less time.
>
> Yeah I agree, here's what I did:
>
> In Terminal 1 I keep adding filters to eth1 in a naive and unrealistic
> loop:
>
>   $ echo "1 1 32" > /sys/bus/netdevsim/new_device
>   $ tc qdisc add dev eth1 ingress
>   $ for (( i=1; i<=3000; i++ ))
>   > do
>   > tc filter add dev eth1 ingress proto all flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 &
>   > done
>
> When the loop is running, I delete the Qdisc in Terminal 2:
>
>   $ time tc qdisc delete dev eth1 ingress
>
> Which took seconds on average.  However, if I specify a unique "prio" when
> adding filters in that loop, e.g.:
>
>   $ for (( i=1; i<=3000; i++ ))
>   > do
>   > tc filter add dev eth1 ingress proto all prio $i flower src_mac 00:11:22:33:44:55 action pass > /dev/null 2>&1 &
>   > done				     ^^^^^^^
>
> Then deleting the Qdisc in Terminal 2 becomes a lot faster:
>
>   real  0m0.712s
>   user  0m0.000s
>   sys   0m0.152s 
>
> In fact it's so fast that I couldn't even make qdisc->refcnt > 1, so I did
> yet another test [1], which looks a lot better.

That makes sense, thanks for explaining.

>
> When I didn't specify "prio", sometimes that
> rhashtable_lookup_insert_fast() call in fl_ht_insert_unique() returns
> -EEXIST.  Is it because that concurrent add-filter requests auto-allocated
> the same "prio" number, so they collided with each other?  Do you think
> this is related to why it's slow?

It is slow because when creating a filter without providing priority you
are basically measuring the latency of creating a whole flower
classifier instance (multiple memory allocation, initialization of all
kinds of idrs, hash tables and locks, updating tp list in chain, etc.),
not just a single filter, so significantly higher latency is expected.

But my point still stands: with latest version of your fix the maximum
time of 'spinning' in sch_api is the maximum concurrent
tcf_{new|del|get}_tfilter op latency that has already obtained the qdisc
and any concurrent filter API messages coming after qdisc->flags
"DELETED-REJECT-NEW-FILTERS" has been set will fail and can't livelock
the concurrent qdisc del/replace.

>
> Thanks,
> Peilin Ye
>
> [1] In a beefier QEMU setup (64 cores, -m 128G), I started 64 tc instances
> in -batch mode that keeps adding a unique filter (with "prio" and "handle"
> specified) then deletes it.  Again, when they are running I delete the
> ingress Qdisc, and here's the result:
>
>                          #1         #2         #3         #4
>    ----------------------------------------------------------
>     a. refcnt            64         63         64         64
>     b. replayed         169      5,630        887      3,442
>     c. time real   0m0.171s   0m0.147s   0m0.186s   0m0.111s
>             user   0m0.000s   0m0.009s   0m0.001s   0m0.000s
>              sys   0m0.112s   0m0.108s   0m0.115s   0m0.104s


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-06-08  9:17                                 ` Vlad Buslov
@ 2023-06-10  0:20                                   ` Peilin Ye
  0 siblings, 0 replies; 45+ messages in thread
From: Peilin Ye @ 2023-06-10  0:20 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Jamal Hadi Salim, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Thu, Jun 08, 2023 at 12:17:27PM +0300, Vlad Buslov wrote:
> > When I didn't specify "prio", sometimes that
> > rhashtable_lookup_insert_fast() call in fl_ht_insert_unique() returns
> > -EEXIST.  Is it because that concurrent add-filter requests auto-allocated
> > the same "prio" number, so they collided with each other?  Do you think
> > this is related to why it's slow?
> 
> It is slow because when creating a filter without providing priority you
> are basically measuring the latency of creating a whole flower
> classifier instance (multiple memory allocation, initialization of all
> kinds of idrs, hash tables and locks, updating tp list in chain, etc.),
> not just a single filter, so significantly higher latency is expected.

Ah, I see.  Thanks for the explanation.

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
  2023-06-08  7:48                                     ` Vlad Buslov
@ 2023-06-11  3:25                                       ` Peilin Ye
  0 siblings, 0 replies; 45+ messages in thread
From: Peilin Ye @ 2023-06-11  3:25 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Jamal Hadi Salim, Jakub Kicinski, Pedro Tammela, David S. Miller,
	Eric Dumazet, Paolo Abeni, Cong Wang, Jiri Pirko, Peilin Ye,
	Daniel Borkmann, John Fastabend, Hillf Danton, netdev, Cong Wang

On Thu, Jun 08, 2023 at 10:48:12AM +0300, Vlad Buslov wrote:
> >> It looks like even though 3f05e6886a59 ("net_sched: unset
> >> TCQ_F_CAN_BYPASS when adding filters") was introduced after cls api
> >> unlock by now we have these in exactly the same list of supported
> >> kernels (5.4 LTS and newer). Considering this, the conversion to the
> >> atomic bitops can be done as a standalone fix for cited commit and after
> >> it will have been accepted and backported the qdisc fix can just assume
> >> that qdisc->flags is an atomic bitops field in all target kernels and
> >> use it as-is. WDYT?
> >
> > Sounds great, how about:
> >
> >   1. I'll post the non-replay version of the fix (after updating the commit
> >      message), and we apply that first, as suggested by Jamal
>
> From my side there are no objections to any of the proposed approaches
> since we have never had any users with legitimate use-case where they
> need to replace/delete a qdisc concurrently with a filter update, so
> returning -EBUSY (or -EAGAIN) to the user in such case would work as
> either temporary or the final fix.

I see, yeah.

> However, Jakub had reservations with such approach so don't know where we
> stand now regarding this.

Either way I'd say applying that non-replay version first is better than
leaving the bug unfixed.  It's been many days since the root cause of the
issue has been figured out.  I'll post it, and start making qdisc->flags
atomic.

> >   2. Make qdisc->flags atomic
> >
> >   3. Make the fix better by replaying and using the (now atomic)
> >      IS-DESTROYING flag with test_and_set_bit() and friends
> >
> > ?
>
> Again, no objections from my side. Ping me if you need help with any of
> these.

Sure, thanks, Vlad!
Peilin Ye


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2023-06-11  3:25 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-24  1:16 [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
2023-05-24  1:17 ` [PATCH v5 net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS Peilin Ye
2023-05-24 15:37   ` Pedro Tammela
2023-05-24 15:57   ` Jamal Hadi Salim
2023-05-24  1:18 ` [PATCH v5 net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT Peilin Ye
2023-05-24 15:38   ` Pedro Tammela
2023-05-24 15:58     ` Jamal Hadi Salim
2023-05-24  1:19 ` [PATCH v5 net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs Peilin Ye
2023-05-24 15:38   ` Pedro Tammela
2023-05-24  1:19 ` [PATCH v5 net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs Peilin Ye
2023-05-24 15:38   ` Pedro Tammela
2023-05-24  1:20 ` [PATCH v5 net 5/6] net/sched: Refactor qdisc_graft() for ingress and " Peilin Ye
2023-05-24  1:20 ` [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Peilin Ye
2023-05-24 15:39   ` Pedro Tammela
2023-05-24 16:09     ` Jamal Hadi Salim
2023-05-25  9:25       ` Paolo Abeni
2023-05-26 12:19         ` Jamal Hadi Salim
2023-05-26 12:20     ` Jamal Hadi Salim
2023-05-26 19:47       ` Jamal Hadi Salim
2023-05-26 20:21         ` Pedro Tammela
2023-05-26 23:09           ` Peilin Ye
2023-05-27  2:33             ` Jakub Kicinski
2023-05-27  8:23               ` Peilin Ye
2023-05-28 18:54                 ` Jamal Hadi Salim
2023-05-29 11:50                   ` Vlad Buslov
2023-05-29 12:58                     ` Vlad Buslov
2023-05-30  1:03                       ` Jakub Kicinski
2023-05-30  9:11                       ` Peilin Ye
2023-05-30 12:18                         ` Vlad Buslov
2023-05-31  0:29                           ` Peilin Ye
2023-06-01  3:57                           ` Peilin Ye
2023-06-01  6:20                             ` Vlad Buslov
2023-06-07  0:57                               ` Peilin Ye
2023-06-07  8:18                                 ` Vlad Buslov
2023-06-08  1:08                                   ` Peilin Ye
2023-06-08  7:48                                     ` Vlad Buslov
2023-06-11  3:25                                       ` Peilin Ye
2023-06-08  0:39                               ` Peilin Ye
2023-06-08  9:17                                 ` Vlad Buslov
2023-06-10  0:20                                   ` Peilin Ye
2023-06-01 13:03                             ` Pedro Tammela
2023-06-07  4:25                               ` Peilin Ye
2023-05-29 13:55                     ` Jamal Hadi Salim
2023-05-29 19:14                       ` Peilin Ye
2023-05-25 17:16 ` [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Vlad Buslov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).