All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peilin Ye <yepeilin.cs@gmail.com>
To: "David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Jamal Hadi Salim <jhs@mojatatu.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	Jiri Pirko <jiri@resnulli.us>
Cc: Peilin Ye <peilin.ye@bytedance.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	John Fastabend <john.fastabend@gmail.com>,
	Vlad Buslov <vladbu@mellanox.com>,
	Pedro Tammela <pctammela@mojatatu.com>,
	Hillf Danton <hdanton@sina.com>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	Cong Wang <cong.wang@bytedance.com>,
	Peilin Ye <yepeilin.cs@gmail.com>
Subject: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
Date: Tue, 23 May 2023 18:20:20 -0700	[thread overview]
Message-ID: <429357af094297abbc45f47b8e606f11206df049.1684887977.git.peilin.ye@bytedance.com> (raw)
In-Reply-To: <cover.1684887977.git.peilin.ye@bytedance.com>

From: Peilin Ye <peilin.ye@bytedance.com>

mini_Qdisc_pair::p_miniq is a double pointer to mini_Qdisc, initialized in
ingress_init() to point to net_device::miniq_ingress.  ingress Qdiscs
access this per-net_device pointer in mini_qdisc_pair_swap().  Similar for
clsact Qdiscs and miniq_egress.

Unfortunately, after introducing RTNL-unlocked RTM_{NEW,DEL,GET}TFILTER
requests (thanks Hillf Danton for the hint), when replacing ingress or
clsact Qdiscs, for example, the old Qdisc ("@old") could access the same
miniq_{in,e}gress pointer(s) concurrently with the new Qdisc ("@new"),
causing race conditions [1] including a use-after-free bug in
mini_qdisc_pair_swap() reported by syzbot:

 BUG: KASAN: slab-use-after-free in mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
 Write of size 8 at addr ffff888045b31308 by task syz-executor690/14901
...
 Call Trace:
  <TASK>
  __dump_stack lib/dump_stack.c:88 [inline]
  dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
  print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:319
  print_report mm/kasan/report.c:430 [inline]
  kasan_report+0x11c/0x130 mm/kasan/report.c:536
  mini_qdisc_pair_swap+0x1c2/0x1f0 net/sched/sch_generic.c:1573
  tcf_chain_head_change_item net/sched/cls_api.c:495 [inline]
  tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:509
  tcf_chain_tp_insert net/sched/cls_api.c:1826 [inline]
  tcf_chain_tp_insert_unique net/sched/cls_api.c:1875 [inline]
  tc_new_tfilter+0x1de6/0x2290 net/sched/cls_api.c:2266
...

@old and @new should not affect each other.  In other words, @old should
never modify miniq_{in,e}gress after @new, and @new should not update
@old's RCU state.  Fixing without changing sch_api.c turned out to be
difficult (please refer to Closes: for discussions).  Instead, make sure
@new's first call always happen after @old's last call, in
qdisc_destroy(), has finished:

In qdisc_graft(), return -EAGAIN and tell the caller to replay
(suggested by Vlad Buslov) if @old has any ongoing RTNL-unlocked filter
requests, and call qdisc_destroy() for @old before grafting @new.

Introduce qdisc_refcount_dec_if_one() as the counterpart of
qdisc_refcount_inc_nz() used for RTNL-unlocked filter requests.  Introduce
a non-static version of qdisc_destroy() that does a TCQ_F_BUILTIN check,
just like qdisc_put() etc.

Depends on patch "net/sched: Refactor qdisc_graft() for ingress and clsact
Qdiscs".

[1] To illustrate, the syzkaller reproducer adds ingress Qdiscs under
TC_H_ROOT (no longer possible after patch "net/sched: sch_ingress: Only
create under TC_H_INGRESS") on eth0 that has 8 transmission queues:

  Thread 1 creates ingress Qdisc A (containing mini Qdisc a1 and a2), then
  adds a flower filter X to A.

  Thread 2 creates another ingress Qdisc B (containing mini Qdisc b1 and
  b2) to replace A, then adds a flower filter Y to B.

 Thread 1               A's refcnt   Thread 2
  RTM_NEWQDISC (A, RTNL-locked)
   qdisc_create(A)               1
   qdisc_graft(A)                9

  RTM_NEWTFILTER (X, RTNL-unlocked)
   __tcf_qdisc_find(A)          10
   tcf_chain0_head_change(A)
   mini_qdisc_pair_swap(A) (1st)
            |
            |                         RTM_NEWQDISC (B, RTNL-locked)
         RCU sync                2     qdisc_graft(B)
            |                    1     notify_and_destroy(A)
            |
   tcf_block_release(A)          0    RTM_NEWTFILTER (Y, RTNL-unlocked)
   qdisc_destroy(A)                    tcf_chain0_head_change(B)
   tcf_chain0_head_change_cb_del(A)    mini_qdisc_pair_swap(B) (2nd)
   mini_qdisc_pair_swap(A) (3rd)                |
           ...                                 ...

Here, B calls mini_qdisc_pair_swap(), pointing eth0->miniq_ingress to its
mini Qdisc, b1.  Then, A calls mini_qdisc_pair_swap() again during
ingress_destroy(), setting eth0->miniq_ingress to NULL, so ingress packets
on eth0 will not find filter Y in sch_handle_ingress().

This is only one of the possible consequences of concurrently accessing
miniq_{in,e}gress pointers.  The point is clear though: again, A should
never modify those per-net_device pointers after B, and B should not
update A's RCU state.

Fixes: 7a096d579e8e ("net: sched: ingress: set 'unlocked' flag for Qdisc ops")
Fixes: 87f373921c4e ("net: sched: ingress: set 'unlocked' flag for clsact Qdisc ops")
Reported-by: syzbot+b53a9c0d1ea4ad62da8b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/0000000000006cf87705f79acf1a@google.com/
Cc: Hillf Danton <hdanton@sina.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
change in v5:
  - reinitialize @q, @p (suggested by Vlad) and @tcm before replaying,
    just like @flags in tc_new_tfilter()

change in v3, v4:
  - add in-body From: tag

changes in v2:
  - replay the request if the current Qdisc has any ongoing RTNL-unlocked
    filter requests (Vlad)
  - minor changes in code comments and commit log

 include/net/sch_generic.h |  8 ++++++++
 net/sched/sch_api.c       | 40 ++++++++++++++++++++++++++++++---------
 net/sched/sch_generic.c   | 14 +++++++++++---
 3 files changed, 50 insertions(+), 12 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index fab5ba3e61b7..3e9cc43cbc90 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -137,6 +137,13 @@ static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
 	refcount_inc(&qdisc->refcnt);
 }
 
+static inline bool qdisc_refcount_dec_if_one(struct Qdisc *qdisc)
+{
+	if (qdisc->flags & TCQ_F_BUILTIN)
+		return true;
+	return refcount_dec_if_one(&qdisc->refcnt);
+}
+
 /* Intended to be used by unlocked users, when concurrent qdisc release is
  * possible.
  */
@@ -652,6 +659,7 @@ void dev_deactivate_many(struct list_head *head);
 struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
 			      struct Qdisc *qdisc);
 void qdisc_reset(struct Qdisc *qdisc);
+void qdisc_destroy(struct Qdisc *qdisc);
 void qdisc_put(struct Qdisc *qdisc);
 void qdisc_put_unlocked(struct Qdisc *qdisc);
 void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, int n, int len);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index f72a581666a2..286b7c58f5b9 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1080,10 +1080,18 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 		if ((q && q->flags & TCQ_F_INGRESS) ||
 		    (new && new->flags & TCQ_F_INGRESS)) {
 			ingress = 1;
-			if (!dev_ingress_queue(dev)) {
+			dev_queue = dev_ingress_queue(dev);
+			if (!dev_queue) {
 				NL_SET_ERR_MSG(extack, "Device does not have an ingress queue");
 				return -ENOENT;
 			}
+
+			/* Replay if the current ingress (or clsact) Qdisc has ongoing
+			 * RTNL-unlocked filter request(s).  This is the counterpart of that
+			 * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
+			 */
+			if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
+				return -EAGAIN;
 		}
 
 		if (dev->flags & IFF_UP)
@@ -1104,8 +1112,16 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 				qdisc_put(old);
 			}
 		} else {
-			dev_queue = dev_ingress_queue(dev);
-			old = dev_graft_qdisc(dev_queue, new);
+			old = dev_graft_qdisc(dev_queue, NULL);
+
+			/* {ingress,clsact}_destroy() @old before grafting @new to avoid
+			 * unprotected concurrent accesses to net_device::miniq_{in,e}gress
+			 * pointer(s) in mini_qdisc_pair_swap().
+			 */
+			qdisc_notify(net, skb, n, classid, old, new, extack);
+			qdisc_destroy(old);
+
+			dev_graft_qdisc(dev_queue, new);
 		}
 
 skip:
@@ -1119,8 +1135,6 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 
 			if (new && new->ops->attach)
 				new->ops->attach(new);
-		} else {
-			notify_and_destroy(net, skb, n, classid, old, new, extack);
 		}
 
 		if (dev->flags & IFF_UP)
@@ -1450,19 +1464,22 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 			struct netlink_ext_ack *extack)
 {
 	struct net *net = sock_net(skb->sk);
-	struct tcmsg *tcm = nlmsg_data(n);
 	struct nlattr *tca[TCA_MAX + 1];
 	struct net_device *dev;
+	struct Qdisc *q, *p;
+	struct tcmsg *tcm;
 	u32 clid;
-	struct Qdisc *q = NULL;
-	struct Qdisc *p = NULL;
 	int err;
 
+replay:
 	err = nlmsg_parse_deprecated(n, sizeof(*tcm), tca, TCA_MAX,
 				     rtm_tca_policy, extack);
 	if (err < 0)
 		return err;
 
+	tcm = nlmsg_data(n);
+	q = p = NULL;
+
 	dev = __dev_get_by_index(net, tcm->tcm_ifindex);
 	if (!dev)
 		return -ENODEV;
@@ -1515,8 +1532,11 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 			return -ENOENT;
 		}
 		err = qdisc_graft(dev, p, skb, n, clid, NULL, q, extack);
-		if (err != 0)
+		if (err != 0) {
+			if (err == -EAGAIN)
+				goto replay;
 			return err;
+		}
 	} else {
 		qdisc_notify(net, skb, n, clid, NULL, q, NULL);
 	}
@@ -1704,6 +1724,8 @@ static int tc_modify_qdisc(struct sk_buff *skb, struct nlmsghdr *n,
 	if (err) {
 		if (q)
 			qdisc_put(q);
+		if (err == -EAGAIN)
+			goto replay;
 		return err;
 	}
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 37e41f972f69..e14ed47f961c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1046,7 +1046,7 @@ static void qdisc_free_cb(struct rcu_head *head)
 	qdisc_free(q);
 }
 
-static void qdisc_destroy(struct Qdisc *qdisc)
+static void __qdisc_destroy(struct Qdisc *qdisc)
 {
 	const struct Qdisc_ops  *ops = qdisc->ops;
 
@@ -1070,6 +1070,14 @@ static void qdisc_destroy(struct Qdisc *qdisc)
 	call_rcu(&qdisc->rcu, qdisc_free_cb);
 }
 
+void qdisc_destroy(struct Qdisc *qdisc)
+{
+	if (qdisc->flags & TCQ_F_BUILTIN)
+		return;
+
+	__qdisc_destroy(qdisc);
+}
+
 void qdisc_put(struct Qdisc *qdisc)
 {
 	if (!qdisc)
@@ -1079,7 +1087,7 @@ void qdisc_put(struct Qdisc *qdisc)
 	    !refcount_dec_and_test(&qdisc->refcnt))
 		return;
 
-	qdisc_destroy(qdisc);
+	__qdisc_destroy(qdisc);
 }
 EXPORT_SYMBOL(qdisc_put);
 
@@ -1094,7 +1102,7 @@ void qdisc_put_unlocked(struct Qdisc *qdisc)
 	    !refcount_dec_and_rtnl_lock(&qdisc->refcnt))
 		return;
 
-	qdisc_destroy(qdisc);
+	__qdisc_destroy(qdisc);
 	rtnl_unlock();
 }
 EXPORT_SYMBOL(qdisc_put_unlocked);
-- 
2.20.1


  parent reply	other threads:[~2023-05-24  1:20 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-24  1:16 [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
2023-05-24  1:17 ` [PATCH v5 net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS Peilin Ye
2023-05-24 15:37   ` Pedro Tammela
2023-05-24 15:57   ` Jamal Hadi Salim
2023-05-24  1:18 ` [PATCH v5 net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT Peilin Ye
2023-05-24 15:38   ` Pedro Tammela
2023-05-24 15:58     ` Jamal Hadi Salim
2023-05-24  1:19 ` [PATCH v5 net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs Peilin Ye
2023-05-24 15:38   ` Pedro Tammela
2023-05-24  1:19 ` [PATCH v5 net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs Peilin Ye
2023-05-24 15:38   ` Pedro Tammela
2023-05-24  1:20 ` [PATCH v5 net 5/6] net/sched: Refactor qdisc_graft() for ingress and " Peilin Ye
2023-05-24  1:20 ` Peilin Ye [this message]
2023-05-24 15:39   ` [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Pedro Tammela
2023-05-24 16:09     ` Jamal Hadi Salim
2023-05-25  9:25       ` Paolo Abeni
2023-05-26 12:19         ` Jamal Hadi Salim
2023-05-26 12:20     ` Jamal Hadi Salim
2023-05-26 19:47       ` Jamal Hadi Salim
2023-05-26 20:21         ` Pedro Tammela
2023-05-26 23:09           ` Peilin Ye
2023-05-27  2:33             ` Jakub Kicinski
2023-05-27  8:23               ` Peilin Ye
2023-05-28 18:54                 ` Jamal Hadi Salim
2023-05-29 11:50                   ` Vlad Buslov
2023-05-29 12:58                     ` Vlad Buslov
2023-05-30  1:03                       ` Jakub Kicinski
2023-05-30  9:11                       ` Peilin Ye
2023-05-30 12:18                         ` Vlad Buslov
2023-05-31  0:29                           ` Peilin Ye
2023-06-01  3:57                           ` Peilin Ye
2023-06-01  6:20                             ` Vlad Buslov
2023-06-07  0:57                               ` Peilin Ye
2023-06-07  8:18                                 ` Vlad Buslov
2023-06-08  1:08                                   ` Peilin Ye
2023-06-08  7:48                                     ` Vlad Buslov
2023-06-11  3:25                                       ` Peilin Ye
2023-06-08  0:39                               ` Peilin Ye
2023-06-08  9:17                                 ` Vlad Buslov
2023-06-10  0:20                                   ` Peilin Ye
2023-06-01 13:03                             ` Pedro Tammela
2023-06-07  4:25                               ` Peilin Ye
2023-05-29 13:55                     ` Jamal Hadi Salim
2023-05-29 19:14                       ` Peilin Ye
2023-05-25 17:16 ` [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Vlad Buslov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=429357af094297abbc45f47b8e606f11206df049.1684887977.git.peilin.ye@bytedance.com \
    --to=yepeilin.cs@gmail.com \
    --cc=cong.wang@bytedance.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hdanton@sina.com \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=john.fastabend@gmail.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pctammela@mojatatu.com \
    --cc=peilin.ye@bytedance.com \
    --cc=vladbu@mellanox.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.