All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch net 00/15] net_sched: remove RCU callbacks from TC
@ 2017-10-23 22:02 Cong Wang
  2017-10-23 22:02 ` [Patch net 01/15] net_sched: remove RCU callbacks in basic filter Cong Wang
                   ` (16 more replies)
  0 siblings, 17 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:02 UTC (permalink / raw)
  To: netdev; +Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang

Recently, the RCU callbacks used in TC filters and TC actions keep
drawing my attention, they introduce at least 4 race condition bugs:

1. A simple one fixed by Daniel:

commit c78e1746d3ad7d548bdf3fe491898cc453911a49
Author: Daniel Borkmann <daniel@iogearbox.net>
Date:   Wed May 20 17:13:33 2015 +0200

    net: sched: fix call_rcu() race on classifier module unloads

2. A very nasty one fixed by me:

commit 1697c4bb5245649a23f06a144cc38c06715e1b65
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Mon Sep 11 16:33:32 2017 -0700

    net_sched: carefully handle tcf_block_put()

3. Two more bugs found by Chris:
https://patchwork.ozlabs.org/patch/826696/
https://patchwork.ozlabs.org/patch/826695/

Usually RCU callbacks are simple, however for TC filters and actions,
they are complex because at least TC actions could be destroyed
together with the TC filter in one callback. And RCU callbacks are
invoked in BH context, without locking they are parallel too. All of
these contribute to the cause of these nasty bugs. It looks like they
bring us more troubles than benefits.

Alternatively, we could also:

a) Introduce a spinlock to serialize these RCU callbacks. But as I
said in commit 1697c4bb5245 ("net_sched: carefully handle
tcf_block_put()"), it is very hard to do because of tcf_chain_dump().
Potentially we need to do a lot of work to make it possible, if not
impossible.

b) As suggested by Paul, we could defer the work to a workqueue and
gain the permission of holding RTNL again without any performance
impact, however, this seems impossible too, because as lockdep
complains we have a deadlock when flushing workqueue while hodling
RTNL lock, see the rcu_barrier() in tcf_block_put().

Therefore, the simplest solution we have is probably just getting
rid of these RCU callbacks, because they are not necessary at all,
callers of these call_rcu() are all on slow paths and have RTNL
lock, so blocking is allowed in their contexts.

The downside is this could hurt the performance of deleting TC
filters, but again it is not a hot path and I already batch
synchronize_rcu() whenever needed.

Different filters have different data structures, so please also
see each patch for details.

Chris Mi (2):
  selftests: Introduce a new script to generate tc batch file
  selftests: Introduce a new test case to tc testsuite

Cong Wang (13):
  net_sched: remove RCU callbacks in basic filter
  net_sched: remove RCU callbacks in bpf filter
  net_sched: remove RCU callbacks in flower filter
  net_sched: remove RCU callbacks in matchall filter
  net_sched: remove RCU callbacks in cgroup filter
  net_sched: remove RCU callbacks in rsvp filter
  net_sched: remove RCU callbacks in flow filter
  net_sched: remove RCU callbacks in tcindex filter
  net_sched: remove RCU callbacks in u32 filter
  net_sched: remove RCU callbacks in fw filter
  net_sched: remove RCU callbacks in route filter
  net_sched: remove RCU callbacks in sample action
  net_sched: add rtnl assertion to tcf_exts_destroy()

 include/net/tc_act/tc_sample.h                     |  1 -
 net/sched/act_sample.c                             | 12 +---
 net/sched/cls_api.c                                |  1 +
 net/sched/cls_basic.c                              | 23 ++++----
 net/sched/cls_bpf.c                                | 22 ++++----
 net/sched/cls_cgroup.c                             | 18 +++---
 net/sched/cls_flow.c                               | 24 ++++----
 net/sched/cls_flower.c                             | 46 +++++-----------
 net/sched/cls_fw.c                                 | 18 +++---
 net/sched/cls_matchall.c                           |  9 +--
 net/sched/cls_route.c                              | 21 +++----
 net/sched/cls_rsvp.h                               | 13 +----
 net/sched/cls_tcindex.c                            | 35 +++++-------
 net/sched/cls_u32.c                                | 64 ++++++----------------
 .../tc-testing/tc-tests/filters/tests.json         | 23 +++++++-
 tools/testing/selftests/tc-testing/tdc.py          | 20 +++++--
 tools/testing/selftests/tc-testing/tdc_batch.py    | 62 +++++++++++++++++++++
 tools/testing/selftests/tc-testing/tdc_config.py   |  2 +
 18 files changed, 222 insertions(+), 192 deletions(-)
 create mode 100644 tools/testing/selftests/tc-testing/tdc_batch.py

-- 
2.13.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [Patch net 01/15] net_sched: remove RCU callbacks in basic filter
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
@ 2017-10-23 22:02 ` Cong Wang
  2017-10-23 22:02 ` [Patch net 02/15] net_sched: remove RCU callbacks in bpf filter Cong Wang
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:02 UTC (permalink / raw)
  To: netdev; +Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang, Daniel Borkmann

Replace call_rcu() with synchronize_rcu(), except
in basic_destroy() we have to use list_splice_init_rcu().

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_basic.c | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index d89ebafd2239..11bca8346e0f 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -24,7 +24,6 @@
 struct basic_head {
 	u32			hgenerator;
 	struct list_head	flist;
-	struct rcu_head		rcu;
 };
 
 struct basic_filter {
@@ -34,7 +33,6 @@ struct basic_filter {
 	struct tcf_result	res;
 	struct tcf_proto	*tp;
 	struct list_head	link;
-	struct rcu_head		rcu;
 };
 
 static int basic_classify(struct sk_buff *skb, const struct tcf_proto *tp,
@@ -82,10 +80,8 @@ static int basic_init(struct tcf_proto *tp)
 	return 0;
 }
 
-static void basic_delete_filter(struct rcu_head *head)
+static void basic_delete_filter(struct basic_filter *f)
 {
-	struct basic_filter *f = container_of(head, struct basic_filter, rcu);
-
 	tcf_exts_destroy(&f->exts);
 	tcf_em_tree_destroy(&f->ematches);
 	kfree(f);
@@ -95,13 +91,16 @@ static void basic_destroy(struct tcf_proto *tp)
 {
 	struct basic_head *head = rtnl_dereference(tp->root);
 	struct basic_filter *f, *n;
+	LIST_HEAD(local);
+
+	list_splice_init_rcu(&head->flist, &local, synchronize_rcu);
 
-	list_for_each_entry_safe(f, n, &head->flist, link) {
-		list_del_rcu(&f->link);
+	list_for_each_entry_safe(f, n, &local, link) {
+		list_del(&f->link);
 		tcf_unbind_filter(tp, &f->res);
-		call_rcu(&f->rcu, basic_delete_filter);
+		basic_delete_filter(f);
 	}
-	kfree_rcu(head, rcu);
+	kfree(head);
 }
 
 static int basic_delete(struct tcf_proto *tp, void *arg, bool *last)
@@ -111,7 +110,8 @@ static int basic_delete(struct tcf_proto *tp, void *arg, bool *last)
 
 	list_del_rcu(&f->link);
 	tcf_unbind_filter(tp, &f->res);
-	call_rcu(&f->rcu, basic_delete_filter);
+	synchronize_rcu();
+	basic_delete_filter(f);
 	*last = list_empty(&head->flist);
 	return 0;
 }
@@ -205,7 +205,8 @@ static int basic_change(struct net *net, struct sk_buff *in_skb,
 	if (fold) {
 		list_replace_rcu(&fold->link, &fnew->link);
 		tcf_unbind_filter(tp, &fold->res);
-		call_rcu(&fold->rcu, basic_delete_filter);
+		synchronize_rcu();
+		basic_delete_filter(fold);
 	} else {
 		list_add_rcu(&fnew->link, &head->flist);
 	}
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 02/15] net_sched: remove RCU callbacks in bpf filter
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
  2017-10-23 22:02 ` [Patch net 01/15] net_sched: remove RCU callbacks in basic filter Cong Wang
@ 2017-10-23 22:02 ` Cong Wang
  2017-10-23 22:02 ` [Patch net 03/15] net_sched: remove RCU callbacks in flower filter Cong Wang
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:02 UTC (permalink / raw)
  To: netdev; +Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang, Daniel Borkmann

Replace call_rcu() with synchronize_rcu(), except in
cls_bpf_destroy() we have to use list_splice_init_rcu().

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_bpf.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 520c5027646a..11571b6539f2 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -33,7 +33,6 @@ MODULE_DESCRIPTION("TC BPF based classifier");
 struct cls_bpf_head {
 	struct list_head plist;
 	u32 hgen;
-	struct rcu_head rcu;
 };
 
 struct cls_bpf_prog {
@@ -49,7 +48,6 @@ struct cls_bpf_prog {
 	struct sock_filter *bpf_ops;
 	const char *bpf_name;
 	struct tcf_proto *tp;
-	struct rcu_head rcu;
 };
 
 static const struct nla_policy bpf_policy[TCA_BPF_MAX + 1] = {
@@ -257,17 +255,11 @@ static void __cls_bpf_delete_prog(struct cls_bpf_prog *prog)
 	kfree(prog);
 }
 
-static void cls_bpf_delete_prog_rcu(struct rcu_head *rcu)
-{
-	__cls_bpf_delete_prog(container_of(rcu, struct cls_bpf_prog, rcu));
-}
-
 static void __cls_bpf_delete(struct tcf_proto *tp, struct cls_bpf_prog *prog)
 {
 	cls_bpf_stop_offload(tp, prog);
 	list_del_rcu(&prog->link);
 	tcf_unbind_filter(tp, &prog->res);
-	call_rcu(&prog->rcu, cls_bpf_delete_prog_rcu);
 }
 
 static int cls_bpf_delete(struct tcf_proto *tp, void *arg, bool *last)
@@ -275,6 +267,8 @@ static int cls_bpf_delete(struct tcf_proto *tp, void *arg, bool *last)
 	struct cls_bpf_head *head = rtnl_dereference(tp->root);
 
 	__cls_bpf_delete(tp, arg);
+	synchronize_rcu();
+	__cls_bpf_delete_prog((struct cls_bpf_prog *)arg);
 	*last = list_empty(&head->plist);
 	return 0;
 }
@@ -283,11 +277,16 @@ static void cls_bpf_destroy(struct tcf_proto *tp)
 {
 	struct cls_bpf_head *head = rtnl_dereference(tp->root);
 	struct cls_bpf_prog *prog, *tmp;
+	LIST_HEAD(local);
 
-	list_for_each_entry_safe(prog, tmp, &head->plist, link)
+	list_splice_init_rcu(&head->plist, &local, synchronize_rcu);
+
+	list_for_each_entry_safe(prog, tmp, &local, link) {
 		__cls_bpf_delete(tp, prog);
+		__cls_bpf_delete_prog(prog);
+	}
 
-	kfree_rcu(head, rcu);
+	kfree(head);
 }
 
 static void *cls_bpf_get(struct tcf_proto *tp, u32 handle)
@@ -501,7 +500,8 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 	if (oldprog) {
 		list_replace_rcu(&oldprog->link, &prog->link);
 		tcf_unbind_filter(tp, &oldprog->res);
-		call_rcu(&oldprog->rcu, cls_bpf_delete_prog_rcu);
+		synchronize_rcu();
+		__cls_bpf_delete_prog(oldprog);
 	} else {
 		list_add_rcu(&prog->link, &head->plist);
 	}
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 03/15] net_sched: remove RCU callbacks in flower filter
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
  2017-10-23 22:02 ` [Patch net 01/15] net_sched: remove RCU callbacks in basic filter Cong Wang
  2017-10-23 22:02 ` [Patch net 02/15] net_sched: remove RCU callbacks in bpf filter Cong Wang
@ 2017-10-23 22:02 ` Cong Wang
  2017-10-23 22:02 ` [Patch net 04/15] net_sched: remove RCU callbacks in matchall filter Cong Wang
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:02 UTC (permalink / raw)
  To: netdev
  Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang,
	Daniel Borkmann, Jiri Pirko

Replace call_rcu() with synchronize_rcu(), except
in fl_destroy() we have to use list_splice_init_rcu().

As a bonus, this also drops the ugly code in commit
d936377414fa ("net, sched: respect rcu grace period on cls destruction").

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_flower.c | 46 ++++++++++++++--------------------------------
 1 file changed, 14 insertions(+), 32 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index b480d7c792ba..ad33bd00b4f0 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -61,7 +61,6 @@ struct fl_flow_mask_range {
 struct fl_flow_mask {
 	struct fl_flow_key key;
 	struct fl_flow_mask_range range;
-	struct rcu_head	rcu;
 };
 
 struct cls_fl_head {
@@ -71,10 +70,6 @@ struct cls_fl_head {
 	bool mask_assigned;
 	struct list_head filters;
 	struct rhashtable_params ht_params;
-	union {
-		struct work_struct work;
-		struct rcu_head	rcu;
-	};
 	struct idr handle_idr;
 };
 
@@ -87,7 +82,6 @@ struct cls_fl_filter {
 	struct list_head list;
 	u32 handle;
 	u32 flags;
-	struct rcu_head	rcu;
 	struct net_device *hw_dev;
 };
 
@@ -215,10 +209,8 @@ static int fl_init(struct tcf_proto *tp)
 	return 0;
 }
 
-static void fl_destroy_filter(struct rcu_head *head)
+static void fl_destroy_filter(struct cls_fl_filter *f)
 {
-	struct cls_fl_filter *f = container_of(head, struct cls_fl_filter, rcu);
-
 	tcf_exts_destroy(&f->exts);
 	kfree(f);
 }
@@ -305,38 +297,25 @@ static void __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f)
 	if (!tc_skip_hw(f->flags))
 		fl_hw_destroy_filter(tp, f);
 	tcf_unbind_filter(tp, &f->res);
-	call_rcu(&f->rcu, fl_destroy_filter);
-}
-
-static void fl_destroy_sleepable(struct work_struct *work)
-{
-	struct cls_fl_head *head = container_of(work, struct cls_fl_head,
-						work);
-	if (head->mask_assigned)
-		rhashtable_destroy(&head->ht);
-	kfree(head);
-	module_put(THIS_MODULE);
-}
-
-static void fl_destroy_rcu(struct rcu_head *rcu)
-{
-	struct cls_fl_head *head = container_of(rcu, struct cls_fl_head, rcu);
-
-	INIT_WORK(&head->work, fl_destroy_sleepable);
-	schedule_work(&head->work);
 }
 
 static void fl_destroy(struct tcf_proto *tp)
 {
 	struct cls_fl_head *head = rtnl_dereference(tp->root);
 	struct cls_fl_filter *f, *next;
+	LIST_HEAD(local);
+
+	list_splice_init_rcu(&head->filters, &local, synchronize_rcu);
 
-	list_for_each_entry_safe(f, next, &head->filters, list)
+	list_for_each_entry_safe(f, next, &local, list) {
 		__fl_delete(tp, f);
+		fl_destroy_filter(f);
+	}
 	idr_destroy(&head->handle_idr);
 
-	__module_get(THIS_MODULE);
-	call_rcu(&head->rcu, fl_destroy_rcu);
+	if (head->mask_assigned)
+		rhashtable_destroy(&head->ht);
+	kfree(head);
 }
 
 static void *fl_get(struct tcf_proto *tp, u32 handle)
@@ -975,7 +954,8 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 		idr_replace_ext(&head->handle_idr, fnew, fnew->handle);
 		list_replace_rcu(&fold->list, &fnew->list);
 		tcf_unbind_filter(tp, &fold->res);
-		call_rcu(&fold->rcu, fl_destroy_filter);
+		synchronize_rcu();
+		fl_destroy_filter(fold);
 	} else {
 		list_add_tail_rcu(&fnew->list, &head->filters);
 	}
@@ -1003,6 +983,8 @@ static int fl_delete(struct tcf_proto *tp, void *arg, bool *last)
 		rhashtable_remove_fast(&head->ht, &f->ht_node,
 				       head->ht_params);
 	__fl_delete(tp, f);
+	synchronize_rcu();
+	fl_destroy_filter(f);
 	*last = list_empty(&head->filters);
 	return 0;
 }
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 04/15] net_sched: remove RCU callbacks in matchall filter
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (2 preceding siblings ...)
  2017-10-23 22:02 ` [Patch net 03/15] net_sched: remove RCU callbacks in flower filter Cong Wang
@ 2017-10-23 22:02 ` Cong Wang
  2017-10-23 22:02 ` [Patch net 05/15] net_sched: remove RCU callbacks in cgroup filter Cong Wang
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:02 UTC (permalink / raw)
  To: netdev
  Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang,
	Daniel Borkmann, Jiri Pirko

Replace call_rcu() with synchronize_rcu().

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_matchall.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index eeac606c95ab..459614cf4186 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -21,7 +21,6 @@ struct cls_mall_head {
 	struct tcf_result res;
 	u32 handle;
 	u32 flags;
-	struct rcu_head	rcu;
 };
 
 static int mall_classify(struct sk_buff *skb, const struct tcf_proto *tp,
@@ -41,11 +40,8 @@ static int mall_init(struct tcf_proto *tp)
 	return 0;
 }
 
-static void mall_destroy_rcu(struct rcu_head *rcu)
+static void __mall_destroy(struct cls_mall_head *head)
 {
-	struct cls_mall_head *head = container_of(rcu, struct cls_mall_head,
-						  rcu);
-
 	tcf_exts_destroy(&head->exts);
 	kfree(head);
 }
@@ -96,7 +92,8 @@ static void mall_destroy(struct tcf_proto *tp)
 	if (tc_should_offload(dev, head->flags))
 		mall_destroy_hw_filter(tp, head, (unsigned long) head);
 
-	call_rcu(&head->rcu, mall_destroy_rcu);
+	synchronize_rcu();
+	__mall_destroy(head);
 }
 
 static void *mall_get(struct tcf_proto *tp, u32 handle)
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 05/15] net_sched: remove RCU callbacks in cgroup filter
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (3 preceding siblings ...)
  2017-10-23 22:02 ` [Patch net 04/15] net_sched: remove RCU callbacks in matchall filter Cong Wang
@ 2017-10-23 22:02 ` Cong Wang
  2017-10-23 22:02 ` [Patch net 06/15] net_sched: remove RCU callbacks in rsvp filter Cong Wang
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:02 UTC (permalink / raw)
  To: netdev
  Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang,
	Daniel Borkmann, Jiri Pirko

Replace call_rcu() with synchronize_rcu().

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_cgroup.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index d48452f87975..fb001799eca6 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -23,7 +23,6 @@ struct cls_cgroup_head {
 	struct tcf_exts		exts;
 	struct tcf_ematch_tree	ematches;
 	struct tcf_proto	*tp;
-	struct rcu_head		rcu;
 };
 
 static int cls_cgroup_classify(struct sk_buff *skb, const struct tcf_proto *tp,
@@ -57,11 +56,8 @@ static const struct nla_policy cgroup_policy[TCA_CGROUP_MAX + 1] = {
 	[TCA_CGROUP_EMATCHES]	= { .type = NLA_NESTED },
 };
 
-static void cls_cgroup_destroy_rcu(struct rcu_head *root)
+static void __cls_cgroup_destroy(struct cls_cgroup_head *head)
 {
-	struct cls_cgroup_head *head = container_of(root,
-						    struct cls_cgroup_head,
-						    rcu);
 
 	tcf_exts_destroy(&head->exts);
 	tcf_em_tree_destroy(&head->ematches);
@@ -110,8 +106,10 @@ static int cls_cgroup_change(struct net *net, struct sk_buff *in_skb,
 		goto errout;
 
 	rcu_assign_pointer(tp->root, new);
-	if (head)
-		call_rcu(&head->rcu, cls_cgroup_destroy_rcu);
+	if (head) {
+		synchronize_rcu();
+		__cls_cgroup_destroy(head);
+	}
 	return 0;
 errout:
 	tcf_exts_destroy(&new->exts);
@@ -124,8 +122,10 @@ static void cls_cgroup_destroy(struct tcf_proto *tp)
 	struct cls_cgroup_head *head = rtnl_dereference(tp->root);
 
 	/* Head can still be NULL due to cls_cgroup_init(). */
-	if (head)
-		call_rcu(&head->rcu, cls_cgroup_destroy_rcu);
+	if (head) {
+		synchronize_rcu();
+		__cls_cgroup_destroy(head);
+	}
 }
 
 static int cls_cgroup_delete(struct tcf_proto *tp, void *arg, bool *last)
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 06/15] net_sched: remove RCU callbacks in rsvp filter
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (4 preceding siblings ...)
  2017-10-23 22:02 ` [Patch net 05/15] net_sched: remove RCU callbacks in cgroup filter Cong Wang
@ 2017-10-23 22:02 ` Cong Wang
  2017-10-23 22:02 ` [Patch net 07/15] net_sched: remove RCU callbacks in flow filter Cong Wang
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:02 UTC (permalink / raw)
  To: netdev
  Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang,
	Daniel Borkmann, Jiri Pirko

Replace call_rcu() with synchronize_rcu(). I don't
touch the rest kfree_rcu() because they are not relevant
here and they are more complicated.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_rsvp.h | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/net/sched/cls_rsvp.h b/net/sched/cls_rsvp.h
index b1f6ed48bc72..5013e9a58f30 100644
--- a/net/sched/cls_rsvp.h
+++ b/net/sched/cls_rsvp.h
@@ -97,7 +97,6 @@ struct rsvp_filter {
 
 	u32				handle;
 	struct rsvp_session		*sess;
-	struct rcu_head			rcu;
 };
 
 static inline unsigned int hash_dst(__be32 *dst, u8 protocol, u8 tunnelid)
@@ -282,14 +281,6 @@ static int rsvp_init(struct tcf_proto *tp)
 	return -ENOBUFS;
 }
 
-static void rsvp_delete_filter_rcu(struct rcu_head *head)
-{
-	struct rsvp_filter *f = container_of(head, struct rsvp_filter, rcu);
-
-	tcf_exts_destroy(&f->exts);
-	kfree(f);
-}
-
 static void rsvp_delete_filter(struct tcf_proto *tp, struct rsvp_filter *f)
 {
 	tcf_unbind_filter(tp, &f->res);
@@ -297,7 +288,9 @@ static void rsvp_delete_filter(struct tcf_proto *tp, struct rsvp_filter *f)
 	 * grace period, since converted-to-rcu actions are relying on that
 	 * in cleanup() callback
 	 */
-	call_rcu(&f->rcu, rsvp_delete_filter_rcu);
+	synchronize_rcu();
+	tcf_exts_destroy(&f->exts);
+	kfree(f);
 }
 
 static void rsvp_destroy(struct tcf_proto *tp)
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 07/15] net_sched: remove RCU callbacks in flow filter
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (5 preceding siblings ...)
  2017-10-23 22:02 ` [Patch net 06/15] net_sched: remove RCU callbacks in rsvp filter Cong Wang
@ 2017-10-23 22:02 ` Cong Wang
  2017-10-23 22:02 ` [Patch net 08/15] net_sched: remove RCU callbacks in tcindex filter Cong Wang
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:02 UTC (permalink / raw)
  To: netdev; +Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang, Daniel Borkmann

Replace call_rcu() with synchronize_rcu(), except in
flow_destroy() we have to use list_splice_init_rcu().

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_flow.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index 2a3a60ec5b86..c33643289fa9 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -35,7 +35,6 @@
 
 struct flow_head {
 	struct list_head	filters;
-	struct rcu_head		rcu;
 };
 
 struct flow_filter {
@@ -57,7 +56,6 @@ struct flow_filter {
 	u32			divisor;
 	u32			baseclass;
 	u32			hashrnd;
-	struct rcu_head		rcu;
 };
 
 static inline u32 addr_fold(void *addr)
@@ -369,10 +367,8 @@ static const struct nla_policy flow_policy[TCA_FLOW_MAX + 1] = {
 	[TCA_FLOW_PERTURB]	= { .type = NLA_U32 },
 };
 
-static void flow_destroy_filter(struct rcu_head *head)
+static void flow_destroy_filter(struct flow_filter *f)
 {
-	struct flow_filter *f = container_of(head, struct flow_filter, rcu);
-
 	del_timer_sync(&f->perturb_timer);
 	tcf_exts_destroy(&f->exts);
 	tcf_em_tree_destroy(&f->ematches);
@@ -539,8 +535,10 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 
 	*arg = fnew;
 
-	if (fold)
-		call_rcu(&fold->rcu, flow_destroy_filter);
+	if (fold) {
+		synchronize_rcu();
+		flow_destroy_filter(fold);
+	}
 	return 0;
 
 err2:
@@ -557,7 +555,8 @@ static int flow_delete(struct tcf_proto *tp, void *arg, bool *last)
 	struct flow_filter *f = arg;
 
 	list_del_rcu(&f->list);
-	call_rcu(&f->rcu, flow_destroy_filter);
+	synchronize_rcu();
+	flow_destroy_filter(f);
 	*last = list_empty(&head->filters);
 	return 0;
 }
@@ -578,12 +577,15 @@ static void flow_destroy(struct tcf_proto *tp)
 {
 	struct flow_head *head = rtnl_dereference(tp->root);
 	struct flow_filter *f, *next;
+	LIST_HEAD(local);
+
+	list_splice_init_rcu(&head->filters, &local, synchronize_rcu);
 
-	list_for_each_entry_safe(f, next, &head->filters, list) {
+	list_for_each_entry_safe(f, next, &local, list) {
 		list_del_rcu(&f->list);
-		call_rcu(&f->rcu, flow_destroy_filter);
+		flow_destroy_filter(f);
 	}
-	kfree_rcu(head, rcu);
+	kfree(head);
 }
 
 static void *flow_get(struct tcf_proto *tp, u32 handle)
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 08/15] net_sched: remove RCU callbacks in tcindex filter
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (6 preceding siblings ...)
  2017-10-23 22:02 ` [Patch net 07/15] net_sched: remove RCU callbacks in flow filter Cong Wang
@ 2017-10-23 22:02 ` Cong Wang
  2017-10-23 22:02 ` [Patch net 09/15] net_sched: remove RCU callbacks in u32 filter Cong Wang
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:02 UTC (permalink / raw)
  To: netdev; +Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang, Daniel Borkmann

Replace call_rcu() with synchronize_rcu().

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_tcindex.c | 35 +++++++++++++----------------------
 1 file changed, 13 insertions(+), 22 deletions(-)

diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
index 14a7e08b2fa9..32aaa5e459d3 100644
--- a/net/sched/cls_tcindex.c
+++ b/net/sched/cls_tcindex.c
@@ -27,14 +27,12 @@
 struct tcindex_filter_result {
 	struct tcf_exts		exts;
 	struct tcf_result	res;
-	struct rcu_head		rcu;
 };
 
 struct tcindex_filter {
 	u16 key;
 	struct tcindex_filter_result result;
 	struct tcindex_filter __rcu *next;
-	struct rcu_head rcu;
 };
 
 
@@ -47,7 +45,6 @@ struct tcindex_data {
 	u32 hash;		/* hash table size; 0 if undefined */
 	u32 alloc_hash;		/* allocated size */
 	u32 fall_through;	/* 0: only classify if explicit match */
-	struct rcu_head rcu;
 };
 
 static inline int tcindex_filter_is_set(struct tcindex_filter_result *r)
@@ -133,19 +130,13 @@ static int tcindex_init(struct tcf_proto *tp)
 	return 0;
 }
 
-static void tcindex_destroy_rexts(struct rcu_head *head)
+static void tcindex_destroy_rexts(struct tcindex_filter_result *r)
 {
-	struct tcindex_filter_result *r;
-
-	r = container_of(head, struct tcindex_filter_result, rcu);
 	tcf_exts_destroy(&r->exts);
 }
 
-static void tcindex_destroy_fexts(struct rcu_head *head)
+static void tcindex_destroy_fexts(struct tcindex_filter *f)
 {
-	struct tcindex_filter *f = container_of(head, struct tcindex_filter,
-						rcu);
-
 	tcf_exts_destroy(&f->result.exts);
 	kfree(f);
 }
@@ -182,10 +173,11 @@ static int tcindex_delete(struct tcf_proto *tp, void *arg, bool *last)
 	 * grace period, since converted-to-rcu actions are relying on that
 	 * in cleanup() callback
 	 */
+	synchronize_rcu();
 	if (f)
-		call_rcu(&f->rcu, tcindex_destroy_fexts);
+		tcindex_destroy_fexts(f);
 	else
-		call_rcu(&r->rcu, tcindex_destroy_rexts);
+		tcindex_destroy_rexts(r);
 
 	*last = false;
 	return 0;
@@ -199,10 +191,8 @@ static int tcindex_destroy_element(struct tcf_proto *tp,
 	return tcindex_delete(tp, arg, &last);
 }
 
-static void __tcindex_destroy(struct rcu_head *head)
+static void __tcindex_destroy(struct tcindex_data *p)
 {
-	struct tcindex_data *p = container_of(head, struct tcindex_data, rcu);
-
 	kfree(p->perfect);
 	kfree(p->h);
 	kfree(p);
@@ -228,10 +218,8 @@ static int tcindex_filter_result_init(struct tcindex_filter_result *r)
 	return tcf_exts_init(&r->exts, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE);
 }
 
-static void __tcindex_partial_destroy(struct rcu_head *head)
+static void __tcindex_partial_destroy(struct tcindex_data *p)
 {
-	struct tcindex_data *p = container_of(head, struct tcindex_data, rcu);
-
 	kfree(p->perfect);
 	kfree(p);
 }
@@ -449,8 +437,10 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 		rcu_assign_pointer(*fp, f);
 	}
 
-	if (oldp)
-		call_rcu(&oldp->rcu, __tcindex_partial_destroy);
+	if (oldp) {
+		synchronize_rcu();
+		 __tcindex_partial_destroy(oldp);
+	}
 	return 0;
 
 errout_alloc:
@@ -540,7 +530,8 @@ static void tcindex_destroy(struct tcf_proto *tp)
 	walker.fn = tcindex_destroy_element;
 	tcindex_walk(tp, &walker);
 
-	call_rcu(&p->rcu, __tcindex_destroy);
+	synchronize_rcu();
+	__tcindex_destroy(p);
 }
 
 
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 09/15] net_sched: remove RCU callbacks in u32 filter
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (7 preceding siblings ...)
  2017-10-23 22:02 ` [Patch net 08/15] net_sched: remove RCU callbacks in tcindex filter Cong Wang
@ 2017-10-23 22:02 ` Cong Wang
  2017-10-23 22:02 ` [Patch net 10/15] net_sched: remove RCU callbacks in fw filter Cong Wang
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:02 UTC (permalink / raw)
  To: netdev
  Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang,
	Daniel Borkmann, Jiri Pirko

Replace call_rcu() with synchronize_rcu().

Note, u32_clear_hnode() is very special here, because
it deletes all elements from a hashtable, we definitely
don't want to wait for each element, however we can
unpublish the whole hashtable from upper layer first,
so that after one grace period, the whole hashtable is
safe to remove too.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_u32.c | 64 ++++++++++++++---------------------------------------
 1 file changed, 17 insertions(+), 47 deletions(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 10b8d851fc6b..393c0aead78a 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -68,7 +68,6 @@ struct tc_u_knode {
 	u32 __percpu		*pcpu_success;
 #endif
 	struct tcf_proto	*tp;
-	struct rcu_head		rcu;
 	/* The 'sel' field MUST be the last field in structure to allow for
 	 * tc_u32_keys allocated at end of structure.
 	 */
@@ -82,7 +81,6 @@ struct tc_u_hnode {
 	struct tc_u_common	*tp_c;
 	int			refcnt;
 	unsigned int		divisor;
-	struct rcu_head		rcu;
 	/* The 'ht' field MUST be the last field in structure to allow for
 	 * more entries allocated at end of structure.
 	 */
@@ -95,7 +93,6 @@ struct tc_u_common {
 	int			refcnt;
 	u32			hgenerator;
 	struct hlist_node	hnode;
-	struct rcu_head		rcu;
 };
 
 static inline unsigned int u32_hash_fold(__be32 key,
@@ -410,35 +407,6 @@ static int u32_destroy_key(struct tcf_proto *tp, struct tc_u_knode *n,
 	return 0;
 }
 
-/* u32_delete_key_rcu should be called when free'ing a copied
- * version of a tc_u_knode obtained from u32_init_knode(). When
- * copies are obtained from u32_init_knode() the statistics are
- * shared between the old and new copies to allow readers to
- * continue to update the statistics during the copy. To support
- * this the u32_delete_key_rcu variant does not free the percpu
- * statistics.
- */
-static void u32_delete_key_rcu(struct rcu_head *rcu)
-{
-	struct tc_u_knode *key = container_of(rcu, struct tc_u_knode, rcu);
-
-	u32_destroy_key(key->tp, key, false);
-}
-
-/* u32_delete_key_freepf_rcu is the rcu callback variant
- * that free's the entire structure including the statistics
- * percpu variables. Only use this if the key is not a copy
- * returned by u32_init_knode(). See u32_delete_key_rcu()
- * for the variant that should be used with keys return from
- * u32_init_knode()
- */
-static void u32_delete_key_freepf_rcu(struct rcu_head *rcu)
-{
-	struct tc_u_knode *key = container_of(rcu, struct tc_u_knode, rcu);
-
-	u32_destroy_key(key->tp, key, true);
-}
-
 static int u32_delete_key(struct tcf_proto *tp, struct tc_u_knode *key)
 {
 	struct tc_u_knode __rcu **kp;
@@ -453,7 +421,8 @@ static int u32_delete_key(struct tcf_proto *tp, struct tc_u_knode *key)
 				RCU_INIT_POINTER(*kp, key->next);
 
 				tcf_unbind_filter(tp, &key->res);
-				call_rcu(&key->rcu, u32_delete_key_freepf_rcu);
+				synchronize_rcu();
+				u32_destroy_key(key->tp, key, true);
 				return 0;
 			}
 		}
@@ -565,7 +534,7 @@ static void u32_clear_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht)
 					 rtnl_dereference(n->next));
 			tcf_unbind_filter(tp, &n->res);
 			u32_remove_hw_knode(tp, n->handle);
-			call_rcu(&n->rcu, u32_delete_key_freepf_rcu);
+			u32_destroy_key(n->tp, n, true);
 		}
 	}
 }
@@ -578,8 +547,6 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht)
 
 	WARN_ON(ht->refcnt);
 
-	u32_clear_hnode(tp, ht);
-
 	hn = &tp_c->hlist;
 	for (phn = rtnl_dereference(*hn);
 	     phn;
@@ -587,7 +554,10 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht)
 		if (phn == ht) {
 			u32_clear_hw_hnode(tp, ht);
 			RCU_INIT_POINTER(*hn, ht->next);
-			kfree_rcu(ht, rcu);
+			synchronize_rcu();
+
+			u32_clear_hnode(tp, ht);
+			kfree(ht);
 			return 0;
 		}
 	}
@@ -617,20 +587,19 @@ static void u32_destroy(struct tcf_proto *tp)
 		u32_destroy_hnode(tp, root_ht);
 
 	if (--tp_c->refcnt == 0) {
-		struct tc_u_hnode *ht;
+		struct tc_u_hnode *ht, *tmp;
 
 		hlist_del(&tp_c->hnode);
 
-		for (ht = rtnl_dereference(tp_c->hlist);
-		     ht;
-		     ht = rtnl_dereference(ht->next)) {
+		tmp = rtnl_dereference(tp_c->hlist);
+		RCU_INIT_POINTER(tp_c->hlist, NULL);
+		synchronize_rcu();
+
+		while ((ht = tmp) != NULL) {
+			tmp = rtnl_dereference(ht->next);
 			ht->refcnt--;
 			u32_clear_hnode(tp, ht);
-		}
-
-		while ((ht = rtnl_dereference(tp_c->hlist)) != NULL) {
-			RCU_INIT_POINTER(tp_c->hlist, ht->next);
-			kfree_rcu(ht, rcu);
+			kfree(ht);
 		}
 
 		kfree(tp_c);
@@ -926,7 +895,8 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 
 		u32_replace_knode(tp, tp_c, new);
 		tcf_unbind_filter(tp, &n->res);
-		call_rcu(&n->rcu, u32_delete_key_rcu);
+		synchronize_rcu();
+		u32_destroy_key(n->tp, n, false);
 		return 0;
 	}
 
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 10/15] net_sched: remove RCU callbacks in fw filter
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (8 preceding siblings ...)
  2017-10-23 22:02 ` [Patch net 09/15] net_sched: remove RCU callbacks in u32 filter Cong Wang
@ 2017-10-23 22:02 ` Cong Wang
  2017-10-23 22:03 ` [Patch net 11/15] net_sched: remove RCU callbacks in route filter Cong Wang
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:02 UTC (permalink / raw)
  To: netdev
  Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang,
	Daniel Borkmann, Jiri Pirko

Replace call_rcu() with synchronize_rcu().

Similarly, fw_destroy() is special here, because
it deletes all elements from a hashtable, we definitely
don't want to wait for each element, however we can
unpublish the whole hashtable from upper layer first,
so that after one grace period, the whole hashtable is
safe to remove too.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_fw.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index 941245ad07fd..2e7171d3da6d 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -46,7 +46,6 @@ struct fw_filter {
 #endif /* CONFIG_NET_CLS_IND */
 	struct tcf_exts		exts;
 	struct tcf_proto	*tp;
-	struct rcu_head		rcu;
 };
 
 static u32 fw_hash(u32 handle)
@@ -119,10 +118,8 @@ static int fw_init(struct tcf_proto *tp)
 	return 0;
 }
 
-static void fw_delete_filter(struct rcu_head *head)
+static void fw_delete_filter(struct fw_filter *f )
 {
-	struct fw_filter *f = container_of(head, struct fw_filter, rcu);
-
 	tcf_exts_destroy(&f->exts);
 	kfree(f);
 }
@@ -136,15 +133,18 @@ static void fw_destroy(struct tcf_proto *tp)
 	if (head == NULL)
 		return;
 
+	RCU_INIT_POINTER(tp->root, NULL);
+	synchronize_rcu();
+
 	for (h = 0; h < HTSIZE; h++) {
 		while ((f = rtnl_dereference(head->ht[h])) != NULL) {
 			RCU_INIT_POINTER(head->ht[h],
 					 rtnl_dereference(f->next));
 			tcf_unbind_filter(tp, &f->res);
-			call_rcu(&f->rcu, fw_delete_filter);
+			fw_delete_filter(f);
 		}
 	}
-	kfree_rcu(head, rcu);
+	kfree(head);
 }
 
 static int fw_delete(struct tcf_proto *tp, void *arg, bool *last)
@@ -166,7 +166,8 @@ static int fw_delete(struct tcf_proto *tp, void *arg, bool *last)
 		if (pfp == f) {
 			RCU_INIT_POINTER(*fp, rtnl_dereference(f->next));
 			tcf_unbind_filter(tp, &f->res);
-			call_rcu(&f->rcu, fw_delete_filter);
+			synchronize_rcu();
+			fw_delete_filter(f);
 			ret = 0;
 			break;
 		}
@@ -286,7 +287,8 @@ static int fw_change(struct net *net, struct sk_buff *in_skb,
 		RCU_INIT_POINTER(fnew->next, rtnl_dereference(pfp->next));
 		rcu_assign_pointer(*fp, fnew);
 		tcf_unbind_filter(tp, &f->res);
-		call_rcu(&f->rcu, fw_delete_filter);
+		synchronize_rcu();
+		fw_delete_filter(f);
 
 		*arg = fnew;
 		return err;
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 11/15] net_sched: remove RCU callbacks in route filter
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (9 preceding siblings ...)
  2017-10-23 22:02 ` [Patch net 10/15] net_sched: remove RCU callbacks in fw filter Cong Wang
@ 2017-10-23 22:03 ` Cong Wang
  2017-10-23 22:03 ` [Patch net 12/15] net_sched: remove RCU callbacks in sample action Cong Wang
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:03 UTC (permalink / raw)
  To: netdev
  Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang,
	Daniel Borkmann, Jiri Pirko

Replace call_rcu() with synchronize_rcu().

Similarly, route4_destroy() is special here, because
it deletes all elements from a hashtable, we definitely
don't want to wait for each element, however we can
unpublish the whole hashtable from upper layer first,
so that after one grace period, the whole hashtable is
safe to remove too.

Reported-by: Chris Mi <chrism@mellanox.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_route.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c
index 9ddde65915d2..b9b77baac1d2 100644
--- a/net/sched/cls_route.c
+++ b/net/sched/cls_route.c
@@ -38,7 +38,6 @@ struct route4_fastmap {
 struct route4_head {
 	struct route4_fastmap		fastmap[16];
 	struct route4_bucket __rcu	*table[256 + 1];
-	struct rcu_head			rcu;
 };
 
 struct route4_bucket {
@@ -57,7 +56,6 @@ struct route4_filter {
 	u32			handle;
 	struct route4_bucket	*bkt;
 	struct tcf_proto	*tp;
-	struct rcu_head		rcu;
 };
 
 #define ROUTE4_FAILURE ((struct route4_filter *)(-1L))
@@ -254,10 +252,8 @@ static int route4_init(struct tcf_proto *tp)
 	return 0;
 }
 
-static void route4_delete_filter(struct rcu_head *head)
+static void route4_delete_filter(struct route4_filter *f)
 {
-	struct route4_filter *f = container_of(head, struct route4_filter, rcu);
-
 	tcf_exts_destroy(&f->exts);
 	kfree(f);
 }
@@ -270,6 +266,9 @@ static void route4_destroy(struct tcf_proto *tp)
 	if (head == NULL)
 		return;
 
+	RCU_INIT_POINTER(tp->root, NULL);
+	synchronize_rcu();
+
 	for (h1 = 0; h1 <= 256; h1++) {
 		struct route4_bucket *b;
 
@@ -284,14 +283,14 @@ static void route4_destroy(struct tcf_proto *tp)
 					next = rtnl_dereference(f->next);
 					RCU_INIT_POINTER(b->ht[h2], next);
 					tcf_unbind_filter(tp, &f->res);
-					call_rcu(&f->rcu, route4_delete_filter);
+					route4_delete_filter(f);
 				}
 			}
 			RCU_INIT_POINTER(head->table[h1], NULL);
-			kfree_rcu(b, rcu);
+			kfree(b);
 		}
 	}
-	kfree_rcu(head, rcu);
+	kfree(head);
 }
 
 static int route4_delete(struct tcf_proto *tp, void *arg, bool *last)
@@ -325,7 +324,8 @@ static int route4_delete(struct tcf_proto *tp, void *arg, bool *last)
 
 			/* Delete it */
 			tcf_unbind_filter(tp, &f->res);
-			call_rcu(&f->rcu, route4_delete_filter);
+			synchronize_rcu();
+			route4_delete_filter(f);
 
 			/* Strip RTNL protected tree */
 			for (i = 0; i <= 32; i++) {
@@ -528,7 +528,8 @@ static int route4_change(struct net *net, struct sk_buff *in_skb,
 	*arg = f;
 	if (fold) {
 		tcf_unbind_filter(tp, &fold->res);
-		call_rcu(&fold->rcu, route4_delete_filter);
+		synchronize_rcu();
+		route4_delete_filter(fold);
 	}
 	return 0;
 
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 12/15] net_sched: remove RCU callbacks in sample action
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (10 preceding siblings ...)
  2017-10-23 22:03 ` [Patch net 11/15] net_sched: remove RCU callbacks in route filter Cong Wang
@ 2017-10-23 22:03 ` Cong Wang
  2017-10-23 22:03 ` [Patch net 13/15] net_sched: add rtnl assertion to tcf_exts_destroy() Cong Wang
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:03 UTC (permalink / raw)
  To: netdev
  Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang, Yotam Gigi,
	Jiri Pirko

The ->cleanup() now can block, it is okay to change
it to synchronize_rcu(). This could also fix a use-after-free
in module unloading.

Cc: Yotam Gigi <yotamg@mellanox.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/net/tc_act/tc_sample.h |  1 -
 net/sched/act_sample.c         | 12 +++---------
 2 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/include/net/tc_act/tc_sample.h b/include/net/tc_act/tc_sample.h
index 89e9305be880..eb44804ff5f0 100644
--- a/include/net/tc_act/tc_sample.h
+++ b/include/net/tc_act/tc_sample.h
@@ -13,7 +13,6 @@ struct tcf_sample {
 	struct psample_group __rcu *psample_group;
 	u32 psample_group_num;
 	struct list_head tcfm_list;
-	struct rcu_head rcu;
 };
 #define to_sample(a) ((struct tcf_sample *)a)
 
diff --git a/net/sched/act_sample.c b/net/sched/act_sample.c
index ec986ae52808..b8863e3f6776 100644
--- a/net/sched/act_sample.c
+++ b/net/sched/act_sample.c
@@ -96,23 +96,17 @@ static int tcf_sample_init(struct net *net, struct nlattr *nla,
 	return ret;
 }
 
-static void tcf_sample_cleanup_rcu(struct rcu_head *rcu)
+static void tcf_sample_cleanup(struct tc_action *a, int bind)
 {
-	struct tcf_sample *s = container_of(rcu, struct tcf_sample, rcu);
+	struct tcf_sample *s = to_sample(a);
 	struct psample_group *psample_group;
 
+	synchronize_rcu();
 	psample_group = rcu_dereference_protected(s->psample_group, 1);
 	RCU_INIT_POINTER(s->psample_group, NULL);
 	psample_group_put(psample_group);
 }
 
-static void tcf_sample_cleanup(struct tc_action *a, int bind)
-{
-	struct tcf_sample *s = to_sample(a);
-
-	call_rcu(&s->rcu, tcf_sample_cleanup_rcu);
-}
-
 static bool tcf_sample_dev_ok_push(struct net_device *dev)
 {
 	switch (dev->type) {
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 13/15] net_sched: add rtnl assertion to tcf_exts_destroy()
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (11 preceding siblings ...)
  2017-10-23 22:03 ` [Patch net 12/15] net_sched: remove RCU callbacks in sample action Cong Wang
@ 2017-10-23 22:03 ` Cong Wang
  2017-10-23 22:03 ` [Patch net 14/15] selftests: Introduce a new script to generate tc batch file Cong Wang
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:03 UTC (permalink / raw)
  To: netdev; +Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang, Jiri Pirko

After previous patches, it is now safe to claim that
tcf_exts_destroy() is always called with RTNL lock.

Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/cls_api.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 0b2219adf520..c48482942995 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -879,6 +879,7 @@ void tcf_exts_destroy(struct tcf_exts *exts)
 #ifdef CONFIG_NET_CLS_ACT
 	LIST_HEAD(actions);
 
+	ASSERT_RTNL();
 	tcf_exts_to_list(exts, &actions);
 	tcf_action_destroy(&actions, TCA_ACT_UNBIND);
 	kfree(exts->actions);
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 14/15] selftests: Introduce a new script to generate tc batch file
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (12 preceding siblings ...)
  2017-10-23 22:03 ` [Patch net 13/15] net_sched: add rtnl assertion to tcf_exts_destroy() Cong Wang
@ 2017-10-23 22:03 ` Cong Wang
  2017-10-24  4:56   ` Chris Mi
  2017-10-23 22:03 ` [Patch net 15/15] selftests: Introduce a new test case to tc testsuite Cong Wang
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:03 UTC (permalink / raw)
  To: netdev; +Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang

From: Chris Mi <chrism@mellanox.com>

From: Chris Mi <chrism@mellanox.com>

  # ./tdc_batch.py -h
  usage: tdc_batch.py [-h] [-n NUMBER] [-o] [-s] [-p] device file

  TC batch file generator

  positional arguments:
    device                device name
    file                  batch file name

  optional arguments:
    -h, --help            show this help message and exit
    -n NUMBER, --number NUMBER
                          how many lines in batch file
    -o, --skip_sw         skip_sw (offload), by default skip_hw
    -s, --share_action    all filters share the same action
    -p, --prio            all filters have different prio

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 tools/testing/selftests/tc-testing/tdc_batch.py | 62 +++++++++++++++++++++++++
 1 file changed, 62 insertions(+)
 create mode 100644 tools/testing/selftests/tc-testing/tdc_batch.py

diff --git a/tools/testing/selftests/tc-testing/tdc_batch.py b/tools/testing/selftests/tc-testing/tdc_batch.py
new file mode 100644
index 000000000000..707c6bfef689
--- /dev/null
+++ b/tools/testing/selftests/tc-testing/tdc_batch.py
@@ -0,0 +1,62 @@
+#!/usr/bin/python3
+
+"""
+tdc_batch.py - a script to generate TC batch file
+
+Copyright (C) 2017 Chris Mi <chrism@mellanox.com>
+"""
+
+import argparse
+
+parser = argparse.ArgumentParser(description='TC batch file generator')
+parser.add_argument("device", help="device name")
+parser.add_argument("file", help="batch file name")
+parser.add_argument("-n", "--number", type=int,
+                    help="how many lines in batch file")
+parser.add_argument("-o", "--skip_sw",
+                    help="skip_sw (offload), by default skip_hw",
+                    action="store_true")
+parser.add_argument("-s", "--share_action",
+                    help="all filters share the same action",
+                    action="store_true")
+parser.add_argument("-p", "--prio",
+                    help="all filters have different prio",
+                    action="store_true")
+args = parser.parse_args()
+
+device = args.device
+file = open(args.file, 'w')
+
+number = 1
+if args.number:
+    number = args.number
+
+skip = "skip_hw"
+if args.skip_sw:
+    skip = "skip_sw"
+
+share_action = ""
+if args.share_action:
+    share_action = "index 1"
+
+prio = "prio 1"
+if args.prio:
+    prio = ""
+    if number > 0x4000:
+        number = 0x4000
+
+index = 0
+for i in range(0x100):
+    for j in range(0x100):
+        for k in range(0x100):
+            mac = ("%02x:%02x:%02x" % (i, j, k))
+            src_mac = "e4:11:00:" + mac
+            dst_mac = "e4:12:00:" + mac
+            cmd = ("filter add dev %s %s protocol ip parent ffff: flower %s "
+                   "src_mac %s dst_mac %s action drop %s" %
+                   (device, prio, skip, src_mac, dst_mac, share_action))
+            file.write("%s\n" % cmd)
+            index += 1
+            if index >= number:
+                file.close()
+                exit(0)
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [Patch net 15/15] selftests: Introduce a new test case to tc testsuite
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (13 preceding siblings ...)
  2017-10-23 22:03 ` [Patch net 14/15] selftests: Introduce a new script to generate tc batch file Cong Wang
@ 2017-10-23 22:03 ` Cong Wang
  2017-10-23 23:16 ` [Patch net 00/15] net_sched: remove RCU callbacks from TC Eric Dumazet
  2017-10-25  1:43 ` David Miller
  16 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-23 22:03 UTC (permalink / raw)
  To: netdev; +Cc: paulmck, jhs, john.fastabend, Chris Mi, Cong Wang

From: Chris Mi <chrism@mellanox.com>

From: Chris Mi <chrism@mellanox.com>

In this patchset, we fixed a tc bug. This patch adds the test case
that reproduces the bug. To run this test case, user should specify
an existing NIC device:
  # sudo ./tdc.py -d enp4s0f0

This test case belongs to category "flower". If user doesn't specify
a NIC device, the test cases belong to "flower" will not be run.

In this test case, we create 1M filters and all filters share the same
action. When destroying all filters, kernel should not panic. It takes
about 18s to run it.

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Lucas Bates <lucasb@mojatatu.com>
Signed-off-by: Chris Mi <chrism@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 .../tc-testing/tc-tests/filters/tests.json         | 23 +++++++++++++++++++++-
 tools/testing/selftests/tc-testing/tdc.py          | 20 +++++++++++++++----
 tools/testing/selftests/tc-testing/tdc_config.py   |  2 ++
 3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/tc-testing/tc-tests/filters/tests.json b/tools/testing/selftests/tc-testing/tc-tests/filters/tests.json
index c727b96a59b0..5fa02d86b35f 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/filters/tests.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/filters/tests.json
@@ -17,5 +17,26 @@
         "teardown": [
             "$TC qdisc del dev $DEV1 ingress"
         ]
+    },
+    {
+        "id": "d052",
+        "name": "Add 1M filters with the same action",
+        "category": [
+            "filter",
+            "flower"
+        ],
+        "setup": [
+            "$TC qdisc add dev $DEV2 ingress",
+            "./tdc_batch.py $DEV2 $BATCH_FILE --share_action -n 1000000"
+        ],
+        "cmdUnderTest": "$TC -b $BATCH_FILE",
+        "expExitCode": "0",
+        "verifyCmd": "$TC actions list action gact",
+        "matchPattern": "action order 0: gact action drop.*index 1 ref 1000000 bind 1000000",
+        "matchCount": "1",
+        "teardown": [
+            "$TC qdisc del dev $DEV2 ingress",
+            "/bin/rm $BATCH_FILE"
+        ]
     }
-]
\ No newline at end of file
+]
diff --git a/tools/testing/selftests/tc-testing/tdc.py b/tools/testing/selftests/tc-testing/tdc.py
index cd61b7844c0d..5f11f5d7456e 100755
--- a/tools/testing/selftests/tc-testing/tdc.py
+++ b/tools/testing/selftests/tc-testing/tdc.py
@@ -88,7 +88,7 @@ USE_NS = True
             exit(1)
 
 
-def test_runner(filtered_tests):
+def test_runner(filtered_tests, args):
     """
     Driver function for the unit tests.
 
@@ -105,6 +105,8 @@ USE_NS = True
     for tidx in testlist:
         result = True
         tresult = ""
+        if "flower" in tidx["category"] and args.device == None:
+            continue
         print("Test " + tidx["id"] + ": " + tidx["name"])
         prepare_env(tidx["setup"])
         (p, procout) = exec_cmd(tidx["cmdUnderTest"])
@@ -152,6 +154,10 @@ USE_NS = True
         exec_cmd(cmd, False)
         cmd = 'ip -s $NS link set $DEV1 up'
         exec_cmd(cmd, False)
+        cmd = 'ip link set $DEV2 netns $NS'
+        exec_cmd(cmd, False)
+        cmd = 'ip -s $NS link set $DEV2 up'
+        exec_cmd(cmd, False)
 
 
 def ns_destroy():
@@ -211,7 +217,8 @@ USE_NS = True
                         help='Execute the single test case with specified ID')
     parser.add_argument('-i', '--id', action='store_true', dest='gen_id',
                         help='Generate ID numbers for new test cases')
-    return parser
+    parser.add_argument('-d', '--device',
+                        help='Execute the test case in flower category')
     return parser
 
 
@@ -225,6 +232,8 @@ USE_NS = True
 
     if args.path != None:
          NAMES['TC'] = args.path
+    if args.device != None:
+         NAMES['DEV2'] = args.device
     if not os.path.isfile(NAMES['TC']):
         print("The specified tc path " + NAMES['TC'] + " does not exist.")
         exit(1)
@@ -381,14 +390,17 @@ USE_NS = True
             if (len(alltests) == 0):
                 print("Cannot find a test case with ID matching " + target_id)
                 exit(1)
-        catresults = test_runner(alltests)
+        catresults = test_runner(alltests, args)
         print("All test results: " + "\n\n" + catresults)
     elif (len(target_category) > 0):
+        if (target_category == "flower") and args.device == None:
+            print("Please specify a NIC device (-d) to run category flower")
+            exit(1)
         if (target_category not in ucat):
             print("Specified category is not present in this file.")
             exit(1)
         else:
-            catresults = test_runner(testcases[target_category])
+            catresults = test_runner(testcases[target_category], args)
             print("Category " + target_category + "\n\n" + catresults)
 
     ns_destroy()
diff --git a/tools/testing/selftests/tc-testing/tdc_config.py b/tools/testing/selftests/tc-testing/tdc_config.py
index 01087375a7c3..b6352515c1b5 100644
--- a/tools/testing/selftests/tc-testing/tdc_config.py
+++ b/tools/testing/selftests/tc-testing/tdc_config.py
@@ -12,6 +12,8 @@ NAMES = {
           # Name of veth devices to be created for the namespace
           'DEV0': 'v0p0',
           'DEV1': 'v0p1',
+          'DEV2': '',
+          'BATCH_FILE': './batch.txt',
           # Name of the namespace to use
           'NS': 'tcut'
         }
-- 
2.13.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (14 preceding siblings ...)
  2017-10-23 22:03 ` [Patch net 15/15] selftests: Introduce a new test case to tc testsuite Cong Wang
@ 2017-10-23 23:16 ` Eric Dumazet
  2017-10-23 23:23   ` Cong Wang
  2017-10-25  1:43 ` David Miller
  16 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2017-10-23 23:16 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev, paulmck, jhs, john.fastabend, Chris Mi

On Mon, 2017-10-23 at 15:02 -0700, Cong Wang wrote:

> b) As suggested by Paul, we could defer the work to a workqueue and
> gain the permission of holding RTNL again without any performance
> impact, however, this seems impossible too, because as lockdep
> complains we have a deadlock when flushing workqueue while hodling
> RTNL lock, see the rcu_barrier() in tcf_block_put().
> 
> Therefore, the simplest solution we have is probably just getting
> rid of these RCU callbacks, because they are not necessary at all,
> callers of these call_rcu() are all on slow paths and have RTNL
> lock, so blocking is allowed in their contexts.

I am against these pessimistic changes, sorry for not following past
discussions last week.

I am asking a talk during upcoming netdev/netconf about this, if we need
to take a decision.

RTNL is already a big problem for many of us, adding synchronize_rcu()
calls while holding RTNL is a no - go, unless we have clear evidence it
can not be avoided.

Thanks !

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
  2017-10-23 23:16 ` [Patch net 00/15] net_sched: remove RCU callbacks from TC Eric Dumazet
@ 2017-10-23 23:23   ` Cong Wang
  2017-10-23 23:31     ` Eric Dumazet
  0 siblings, 1 reply; 27+ messages in thread
From: Cong Wang @ 2017-10-23 23:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linux Kernel Network Developers, Paul E. McKenney,
	Jamal Hadi Salim, John Fastabend, Chris Mi

On Mon, Oct 23, 2017 at 4:16 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2017-10-23 at 15:02 -0700, Cong Wang wrote:
>
>> b) As suggested by Paul, we could defer the work to a workqueue and
>> gain the permission of holding RTNL again without any performance
>> impact, however, this seems impossible too, because as lockdep
>> complains we have a deadlock when flushing workqueue while hodling
>> RTNL lock, see the rcu_barrier() in tcf_block_put().
>>
>> Therefore, the simplest solution we have is probably just getting
>> rid of these RCU callbacks, because they are not necessary at all,
>> callers of these call_rcu() are all on slow paths and have RTNL
>> lock, so blocking is allowed in their contexts.
>
> I am against these pessimistic changes, sorry for not following past
> discussions last week.

So even Cc'ing you doesn't work. :-D

>
> I am asking a talk during upcoming netdev/netconf about this, if we need
> to take a decision.

I won't be able to make it.

>
> RTNL is already a big problem for many of us, adding synchronize_rcu()
> calls while holding RTNL is a no - go, unless we have clear evidence it
> can not be avoided.

You omitted too much, including the evidence I provide. In short it is very
hard to do, otherwise I should have already done it. I am very open to
any simple solution to avoid it, but so far none...

Saying no but without giving a possible solution does not help anything
here.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
  2017-10-23 23:23   ` Cong Wang
@ 2017-10-23 23:31     ` Eric Dumazet
  2017-10-25 20:46       ` Cong Wang
  0 siblings, 1 reply; 27+ messages in thread
From: Eric Dumazet @ 2017-10-23 23:31 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, Paul E. McKenney,
	Jamal Hadi Salim, John Fastabend, Chris Mi

On Mon, 2017-10-23 at 16:23 -0700, Cong Wang wrote:
> On Mon, Oct 23, 2017 at 4:16 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2017-10-23 at 15:02 -0700, Cong Wang wrote:
> >
> >> b) As suggested by Paul, we could defer the work to a workqueue and
> >> gain the permission of holding RTNL again without any performance
> >> impact, however, this seems impossible too, because as lockdep
> >> complains we have a deadlock when flushing workqueue while hodling
> >> RTNL lock, see the rcu_barrier() in tcf_block_put().
> >>
> >> Therefore, the simplest solution we have is probably just getting
> >> rid of these RCU callbacks, because they are not necessary at all,
> >> callers of these call_rcu() are all on slow paths and have RTNL
> >> lock, so blocking is allowed in their contexts.
> >
> > I am against these pessimistic changes, sorry for not following past
> > discussions last week.
> 
> So even Cc'ing you doesn't work. :-D

Nope. At the end of the day, there are only 24 hours per day.

> 
> >
> > I am asking a talk during upcoming netdev/netconf about this, if we need
> > to take a decision.
> 
> I won't be able to make it.
> 
> >
> > RTNL is already a big problem for many of us, adding synchronize_rcu()
> > calls while holding RTNL is a no - go, unless we have clear evidence it
> > can not be avoided.
> 
> You omitted too much, including the evidence I provide. In short it is very
> hard to do, otherwise I should have already done it. I am very open to
> any simple solution to avoid it, but so far none...
> 
> Saying no but without giving a possible solution does not help anything
> here.

I did not pretend to give a bug fix, I simply said your patch series was
probably not the right way.

Sure, we could add back BKL and solve a lot of lockdep issues.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [Patch net 14/15] selftests: Introduce a new script to generate tc batch file
  2017-10-23 22:03 ` [Patch net 14/15] selftests: Introduce a new script to generate tc batch file Cong Wang
@ 2017-10-24  4:56   ` Chris Mi
  0 siblings, 0 replies; 27+ messages in thread
From: Chris Mi @ 2017-10-24  4:56 UTC (permalink / raw)
  To: Cong Wang, netdev; +Cc: paulmck, jhs, john.fastabend

> -----Original Message-----
> From: Cong Wang [mailto:xiyou.wangcong@gmail.com]
> Sent: Tuesday, October 24, 2017 6:03 AM
> To: netdev@vger.kernel.org
> Cc: paulmck@linux.vnet.ibm.com; jhs@mojatatu.com;
> john.fastabend@gmail.com; Chris Mi <chrism@mellanox.com>; Cong Wang
> <xiyou.wangcong@gmail.com>
> Subject: [Patch net 14/15] selftests: Introduce a new script to generate tc
> batch file
> 
> From: Chris Mi <chrism@mellanox.com>
> 
> From: Chris Mi <chrism@mellanox.com>
> 
>   # ./tdc_batch.py -h
>   usage: tdc_batch.py [-h] [-n NUMBER] [-o] [-s] [-p] device file
> 
>   TC batch file generator
> 
>   positional arguments:
>     device                device name
>     file                  batch file name
> 
>   optional arguments:
>     -h, --help            show this help message and exit
>     -n NUMBER, --number NUMBER
>                           how many lines in batch file
>     -o, --skip_sw         skip_sw (offload), by default skip_hw
>     -s, --share_action    all filters share the same action
>     -p, --prio            all filters have different prio
> 
> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
> Acked-by: Lucas Bates <lucasb@mojatatu.com>
> Signed-off-by: Chris Mi <chrism@mellanox.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  tools/testing/selftests/tc-testing/tdc_batch.py | 62
> +++++++++++++++++++++++++
>  1 file changed, 62 insertions(+)
>  create mode 100644 tools/testing/selftests/tc-testing/tdc_batch.py
> 
> diff --git a/tools/testing/selftests/tc-testing/tdc_batch.py
> b/tools/testing/selftests/tc-testing/tdc_batch.py
> new file mode 100644
File mode should be 755.
> index 000000000000..707c6bfef689
> --- /dev/null
> +++ b/tools/testing/selftests/tc-testing/tdc_batch.py
> @@ -0,0 +1,62 @@
> +#!/usr/bin/python3
> +
> +"""
> +tdc_batch.py - a script to generate TC batch file
> +
> +Copyright (C) 2017 Chris Mi <chrism@mellanox.com> """
> +
> +import argparse
> +
> +parser = argparse.ArgumentParser(description='TC batch file generator')
> +parser.add_argument("device", help="device name")
> +parser.add_argument("file", help="batch file name")
> +parser.add_argument("-n", "--number", type=int,
> +                    help="how many lines in batch file")
> +parser.add_argument("-o", "--skip_sw",
> +                    help="skip_sw (offload), by default skip_hw",
> +                    action="store_true") parser.add_argument("-s",
> +"--share_action",
> +                    help="all filters share the same action",
> +                    action="store_true") parser.add_argument("-p",
> +"--prio",
> +                    help="all filters have different prio",
> +                    action="store_true") args = parser.parse_args()
> +
> +device = args.device
> +file = open(args.file, 'w')
> +
> +number = 1
> +if args.number:
> +    number = args.number
> +
> +skip = "skip_hw"
> +if args.skip_sw:
> +    skip = "skip_sw"
> +
> +share_action = ""
> +if args.share_action:
> +    share_action = "index 1"
> +
> +prio = "prio 1"
> +if args.prio:
> +    prio = ""
> +    if number > 0x4000:
> +        number = 0x4000
> +
> +index = 0
> +for i in range(0x100):
> +    for j in range(0x100):
> +        for k in range(0x100):
> +            mac = ("%02x:%02x:%02x" % (i, j, k))
> +            src_mac = "e4:11:00:" + mac
> +            dst_mac = "e4:12:00:" + mac
> +            cmd = ("filter add dev %s %s protocol ip parent ffff: flower %s "
> +                   "src_mac %s dst_mac %s action drop %s" %
> +                   (device, prio, skip, src_mac, dst_mac, share_action))
> +            file.write("%s\n" % cmd)
> +            index += 1
> +            if index >= number:
> +                file.close()
> +                exit(0)
> --
> 2.13.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
  2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
                   ` (15 preceding siblings ...)
  2017-10-23 23:16 ` [Patch net 00/15] net_sched: remove RCU callbacks from TC Eric Dumazet
@ 2017-10-25  1:43 ` David Miller
  2017-10-25 22:37   ` Cong Wang
  16 siblings, 1 reply; 27+ messages in thread
From: David Miller @ 2017-10-25  1:43 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, paulmck, jhs, john.fastabend, chrism

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Mon, 23 Oct 2017 15:02:49 -0700

> Recently, the RCU callbacks used in TC filters and TC actions keep
> drawing my attention, they introduce at least 4 race condition bugs:

Like Eric, I think doing a full RCU sync on every delete is too big
a pill to swallow.  This is a major control plane performance
regression.

Please find another reasonable way to fix this.

Thank you.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
  2017-10-23 23:31     ` Eric Dumazet
@ 2017-10-25 20:46       ` Cong Wang
  0 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-25 20:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Linux Kernel Network Developers, Paul E. McKenney,
	Jamal Hadi Salim, John Fastabend, Chris Mi

On Mon, Oct 23, 2017 at 4:31 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> I did not pretend to give a bug fix, I simply said your patch series was
> probably not the right way.

Generally I agree with you on avoid synchronize_rcu(), but this case
is very special, you need to consider case by case, not just talking
generally.


>
> Sure, we could add back BKL and solve a lot of lockdep issues.
>
>

For the record, this case is about race conditions which lead to
real bugs, not merely lockdep warnings.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
  2017-10-25  1:43 ` David Miller
@ 2017-10-25 22:37   ` Cong Wang
  2017-10-26  0:19     ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Cong Wang @ 2017-10-25 22:37 UTC (permalink / raw)
  To: David Miller
  Cc: Linux Kernel Network Developers, Paul E. McKenney,
	Jamal Hadi Salim, John Fastabend, Chris Mi

On Tue, Oct 24, 2017 at 6:43 PM, David Miller <davem@davemloft.net> wrote:
> From: Cong Wang <xiyou.wangcong@gmail.com>
> Date: Mon, 23 Oct 2017 15:02:49 -0700
>
>> Recently, the RCU callbacks used in TC filters and TC actions keep
>> drawing my attention, they introduce at least 4 race condition bugs:
>
> Like Eric, I think doing a full RCU sync on every delete is too big
> a pill to swallow.  This is a major control plane performance
> regression.
>
> Please find another reasonable way to fix this.
>

Alright... I finally find a way to make everyone happy.

My solution is introducing a workqueue for tc filters
and let each RCU callback defer the work to this
workqueue. I solve the flush_workqueue() deadlock
by queuing another work in the same workqueue
at the end, so the execution order should be as same
as it is now. The ugly part is now tcf_block_put() which
looks like below:


static void tcf_block_put_final(struct work_struct *work)
{
        struct tcf_block *block = container_of(work, struct tcf_block, work);
        struct tcf_chain *chain, *tmp;

        /* At this point, all the chains should have refcnt == 1. */
        rtnl_lock();
        list_for_each_entry_safe(chain, tmp, &block->chain_list, list)
                tcf_chain_put(chain);
        rtnl_unlock();
        kfree(block);
}

static void tcf_block_put_deferred(struct work_struct *work)
{
        struct tcf_block *block = container_of(work, struct tcf_block, work);
        struct tcf_chain *chain;

        rtnl_lock();
        /* Hold a refcnt for all chains, except 0, in case they are gone. */
        list_for_each_entry(chain, &block->chain_list, list)
                if (chain->index)
                        tcf_chain_hold(chain);

        /* No race on the list, because no chain could be destroyed. */
        list_for_each_entry(chain, &block->chain_list, list)
                tcf_chain_flush(chain);

        INIT_WORK(&block->work, tcf_block_put_final);
        /* Wait for RCU callbacks to release the reference count and make
         * sure their works have been queued before this.
         */
        rcu_barrier();
        tcf_queue_work(&block->work);
        rtnl_unlock();
}

void tcf_block_put(struct tcf_block *block)
{
        if (!block)
                return;

        INIT_WORK(&block->work, tcf_block_put_deferred);
        /* Wait for existing RCU callbacks to cool down, make sure their works
         * have been queued before this. We can not flush pending works here
         * because we are holding the RTNL lock.
         */
        rcu_barrier();
        tcf_queue_work(&block->work);
}


Paul, does this make any sense to you? ;)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
  2017-10-25 22:37   ` Cong Wang
@ 2017-10-26  0:19     ` Paul E. McKenney
  2017-10-26  4:49       ` Cong Wang
  0 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2017-10-26  0:19 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Linux Kernel Network Developers, Jamal Hadi Salim,
	John Fastabend, Chris Mi

On Wed, Oct 25, 2017 at 03:37:40PM -0700, Cong Wang wrote:
> On Tue, Oct 24, 2017 at 6:43 PM, David Miller <davem@davemloft.net> wrote:
> > From: Cong Wang <xiyou.wangcong@gmail.com>
> > Date: Mon, 23 Oct 2017 15:02:49 -0700
> >
> >> Recently, the RCU callbacks used in TC filters and TC actions keep
> >> drawing my attention, they introduce at least 4 race condition bugs:
> >
> > Like Eric, I think doing a full RCU sync on every delete is too big
> > a pill to swallow.  This is a major control plane performance
> > regression.
> >
> > Please find another reasonable way to fix this.
> >
> 
> Alright... I finally find a way to make everyone happy.
> 
> My solution is introducing a workqueue for tc filters
> and let each RCU callback defer the work to this
> workqueue. I solve the flush_workqueue() deadlock
> by queuing another work in the same workqueue
> at the end, so the execution order should be as same
> as it is now. The ugly part is now tcf_block_put() which
> looks like below:
> 
> 
> static void tcf_block_put_final(struct work_struct *work)
> {
>         struct tcf_block *block = container_of(work, struct tcf_block, work);
>         struct tcf_chain *chain, *tmp;
> 
>         /* At this point, all the chains should have refcnt == 1. */
>         rtnl_lock();
>         list_for_each_entry_safe(chain, tmp, &block->chain_list, list)
>                 tcf_chain_put(chain);
>         rtnl_unlock();
>         kfree(block);
> }

I am guessing that tcf_chain_put() sometimes does a call_rcu(),
and the callback function in turn calls schedule_work(), and that
tcf_block_put_deferred() is the workqueue handler function.

> static void tcf_block_put_deferred(struct work_struct *work)
> {
>         struct tcf_block *block = container_of(work, struct tcf_block, work);
>         struct tcf_chain *chain;
> 
>         rtnl_lock();
>         /* Hold a refcnt for all chains, except 0, in case they are gone. */
>         list_for_each_entry(chain, &block->chain_list, list)
>                 if (chain->index)
>                         tcf_chain_hold(chain);
> 
>         /* No race on the list, because no chain could be destroyed. */
>         list_for_each_entry(chain, &block->chain_list, list)
>                 tcf_chain_flush(chain);
> 
>         INIT_WORK(&block->work, tcf_block_put_final);
>         /* Wait for RCU callbacks to release the reference count and make
>          * sure their works have been queued before this.
>          */
>         rcu_barrier();

This one can take awhile...  Though in recent kernels it will often
be a bit faster than synchronize_rcu().

Note that rcu_barrier() only waits for callbacks posted via call_rcu()
before the rcu_barrier() starts, if that matters.

>         tcf_queue_work(&block->work);
>         rtnl_unlock();
> }

And it looks like tcf_block_put_deferred() queues itself as work as well.
Or maybe instead?

> void tcf_block_put(struct tcf_block *block)
> {
>         if (!block)
>                 return;
> 
>         INIT_WORK(&block->work, tcf_block_put_deferred);
>         /* Wait for existing RCU callbacks to cool down, make sure their works
>          * have been queued before this. We can not flush pending works here
>          * because we are holding the RTNL lock.
>          */
>         rcu_barrier();
>         tcf_queue_work(&block->work);
> }
> 
> 
> Paul, does this make any sense to you? ;)

 would be surprised if I fully understand the problem to be solved,
but my current guess is that the constraints are as follows:

1.	Things removed must be processed in order.

2.	Things removes must not be processed until a grace period
	has elapsed.

3.	Things being processed after a grace period should not be
	processed concurrently with each other or with subsequent
	removals.

4.	A given removal is not finalized until its reference count
	reaches zero.

5.	RTNL might not be held when the reference count reaches zero.

Or did I lose the thread somewhere?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
  2017-10-26  0:19     ` Paul E. McKenney
@ 2017-10-26  4:49       ` Cong Wang
  2017-10-26 13:58         ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Cong Wang @ 2017-10-26  4:49 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: David Miller, Linux Kernel Network Developers, Jamal Hadi Salim,
	John Fastabend, Chris Mi

On Wed, Oct 25, 2017 at 5:19 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Wed, Oct 25, 2017 at 03:37:40PM -0700, Cong Wang wrote:
>> My solution is introducing a workqueue for tc filters
>> and let each RCU callback defer the work to this
>> workqueue. I solve the flush_workqueue() deadlock
>> by queuing another work in the same workqueue
>> at the end, so the execution order should be as same
>> as it is now. The ugly part is now tcf_block_put() which
>> looks like below:
>>
>>
>> static void tcf_block_put_final(struct work_struct *work)
>> {
>>         struct tcf_block *block = container_of(work, struct tcf_block, work);
>>         struct tcf_chain *chain, *tmp;
>>
>>         /* At this point, all the chains should have refcnt == 1. */
>>         rtnl_lock();
>>         list_for_each_entry_safe(chain, tmp, &block->chain_list, list)
>>                 tcf_chain_put(chain);
>>         rtnl_unlock();
>>         kfree(block);
>> }
>
> I am guessing that tcf_chain_put() sometimes does a call_rcu(),
> and the callback function in turn calls schedule_work(), and that
> tcf_block_put_deferred() is the workqueue handler function.


Yes, tcf_chain_put() triggers call_rcu() indirectly during flush,
this is why we have rcu_barrier()'s in current code base.


>
>> static void tcf_block_put_deferred(struct work_struct *work)
>> {
>>         struct tcf_block *block = container_of(work, struct tcf_block, work);
>>         struct tcf_chain *chain;
>>
>>         rtnl_lock();
>>         /* Hold a refcnt for all chains, except 0, in case they are gone. */
>>         list_for_each_entry(chain, &block->chain_list, list)
>>                 if (chain->index)
>>                         tcf_chain_hold(chain);
>>
>>         /* No race on the list, because no chain could be destroyed. */
>>         list_for_each_entry(chain, &block->chain_list, list)
>>                 tcf_chain_flush(chain);
>>
>>         INIT_WORK(&block->work, tcf_block_put_final);
>>         /* Wait for RCU callbacks to release the reference count and make
>>          * sure their works have been queued before this.
>>          */
>>         rcu_barrier();
>
> This one can take awhile...  Though in recent kernels it will often
> be a bit faster than synchronize_rcu().
>

It is already in current code base, so it is not introduced here.


> Note that rcu_barrier() only waits for callbacks posted via call_rcu()
> before the rcu_barrier() starts, if that matters.
>

Yes, this is exactly what I expect.


>>         tcf_queue_work(&block->work);
>>         rtnl_unlock();
>> }
>
> And it looks like tcf_block_put_deferred() queues itself as work as well.
> Or maybe instead?

Yes, it queues itself after all the works queued via call_rcu(),
to ensure it is the last.


>
>> void tcf_block_put(struct tcf_block *block)
>> {
>>         if (!block)
>>                 return;
>>
>>         INIT_WORK(&block->work, tcf_block_put_deferred);
>>         /* Wait for existing RCU callbacks to cool down, make sure their works
>>          * have been queued before this. We can not flush pending works here
>>          * because we are holding the RTNL lock.
>>          */
>>         rcu_barrier();
>>         tcf_queue_work(&block->work);
>> }
>>
>>
>> Paul, does this make any sense to you? ;)
>
>  would be surprised if I fully understand the problem to be solved,
> but my current guess is that the constraints are as follows:
>
> 1.      Things removed must be processed in order.
>

Sort of, RCU callbacks themselves don't have any order, they only
need to be serialized with RTNL lock, so we have to defer it to a
workqeueue.

What needs a strict order is tcf_block_put() vs. flying works. Before
tcf_block_put() finishes, all the flying works must be done otherwise
use-after-free. :-/



> 2.      Things removes must not be processed until a grace period
>         has elapsed.
>

For these RCU callbacks yes, but for tcf_block_put() no, it uses
refcnt not RCU.


> 3.      Things being processed after a grace period should not be
>         processed concurrently with each other or with subsequent
>         removals.


Yeah, this is the cause of the problem. They have to be serialized
with each other and with other tc action paths (holding RTNL) too.


>
> 4.      A given removal is not finalized until its reference count
>         reaches zero.

There are two kinds of removal here:

1) Filter removal. This is not reference counted, it needs RCU grace
period.

2) Filter chain removal. This is refcount'ed and filters hold its refcnt,
the actions attached to filters could hold its refcnt too

So when we remove a filter chain, it flushes all the filters and actions
attached, so RCU callbacks will be flying, tcf_block_put() needs to wait
for refcnt hits zero, especially by those actions, before freeing it.


>
> 5.      RTNL might not be held when the reference count reaches zero.
>

Without my patch, no, RTNL is not held in RCU callbacks where actions
are destroyed and releasing refcnt.

With my patch, RTNL will be always held, RCU callbacks will be merely
schedule_work(), RTNL is held is each work.


> Or did I lose the thread somewhere?
>

I don't think you miss anything.

Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
  2017-10-26  4:49       ` Cong Wang
@ 2017-10-26 13:58         ` Paul E. McKenney
  2017-10-26 19:10           ` Cong Wang
  0 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2017-10-26 13:58 UTC (permalink / raw)
  To: Cong Wang
  Cc: David Miller, Linux Kernel Network Developers, Jamal Hadi Salim,
	John Fastabend, Chris Mi

On Wed, Oct 25, 2017 at 09:49:09PM -0700, Cong Wang wrote:
> On Wed, Oct 25, 2017 at 5:19 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Wed, Oct 25, 2017 at 03:37:40PM -0700, Cong Wang wrote:
> >> My solution is introducing a workqueue for tc filters
> >> and let each RCU callback defer the work to this
> >> workqueue. I solve the flush_workqueue() deadlock
> >> by queuing another work in the same workqueue
> >> at the end, so the execution order should be as same
> >> as it is now. The ugly part is now tcf_block_put() which
> >> looks like below:
> >>
> >>
> >> static void tcf_block_put_final(struct work_struct *work)
> >> {
> >>         struct tcf_block *block = container_of(work, struct tcf_block, work);
> >>         struct tcf_chain *chain, *tmp;
> >>
> >>         /* At this point, all the chains should have refcnt == 1. */
> >>         rtnl_lock();
> >>         list_for_each_entry_safe(chain, tmp, &block->chain_list, list)
> >>                 tcf_chain_put(chain);
> >>         rtnl_unlock();
> >>         kfree(block);
> >> }
> >
> > I am guessing that tcf_chain_put() sometimes does a call_rcu(),
> > and the callback function in turn calls schedule_work(), and that
> > tcf_block_put_deferred() is the workqueue handler function.
> 
> Yes, tcf_chain_put() triggers call_rcu() indirectly during flush,
> this is why we have rcu_barrier()'s in current code base.

OK, got it.

> >> static void tcf_block_put_deferred(struct work_struct *work)
> >> {
> >>         struct tcf_block *block = container_of(work, struct tcf_block, work);
> >>         struct tcf_chain *chain;
> >>
> >>         rtnl_lock();
> >>         /* Hold a refcnt for all chains, except 0, in case they are gone. */
> >>         list_for_each_entry(chain, &block->chain_list, list)
> >>                 if (chain->index)
> >>                         tcf_chain_hold(chain);
> >>
> >>         /* No race on the list, because no chain could be destroyed. */
> >>         list_for_each_entry(chain, &block->chain_list, list)
> >>                 tcf_chain_flush(chain);
> >>
> >>         INIT_WORK(&block->work, tcf_block_put_final);
> >>         /* Wait for RCU callbacks to release the reference count and make
> >>          * sure their works have been queued before this.
> >>          */
> >>         rcu_barrier();
> >
> > This one can take awhile...  Though in recent kernels it will often
> > be a bit faster than synchronize_rcu().
> 
> It is already in current code base, so it is not introduced here.

Very good, then no problems with added overhead from rcu_barrier().  ;-)

> > Note that rcu_barrier() only waits for callbacks posted via call_rcu()
> > before the rcu_barrier() starts, if that matters.
> 
> Yes, this is exactly what I expect.

Good.

> >>         tcf_queue_work(&block->work);
> >>         rtnl_unlock();
> >> }
> >
> > And it looks like tcf_block_put_deferred() queues itself as work as well.
> > Or maybe instead?
> 
> Yes, it queues itself after all the works queued via call_rcu(),
> to ensure it is the last.

OK.

> >> void tcf_block_put(struct tcf_block *block)
> >> {
> >>         if (!block)
> >>                 return;
> >>
> >>         INIT_WORK(&block->work, tcf_block_put_deferred);
> >>         /* Wait for existing RCU callbacks to cool down, make sure their works
> >>          * have been queued before this. We can not flush pending works here
> >>          * because we are holding the RTNL lock.
> >>          */
> >>         rcu_barrier();
> >>         tcf_queue_work(&block->work);
> >> }
> >>
> >>
> >> Paul, does this make any sense to you? ;)
> >
> >  would be surprised if I fully understand the problem to be solved,
> > but my current guess is that the constraints are as follows:
> >
> > 1.      Things removed must be processed in order.
> 
> Sort of, RCU callbacks themselves don't have any order, they only
> need to be serialized with RTNL lock, so we have to defer it to a
> workqeueue.

OK, got it.

> What needs a strict order is tcf_block_put() vs. flying works. Before
> tcf_block_put() finishes, all the flying works must be done otherwise
> use-after-free. :-/

So when removing an entire chain, you flush any queued workqueue handlers
to make sure that any operations using elements on that chain have also
completed, correct?  This might also motivate the rcu_barrier() calls.

> > 2.      Things removes must not be processed until a grace period
> >         has elapsed.
> 
> For these RCU callbacks yes, but for tcf_block_put() no, it uses
> refcnt not RCU.

OK, got it, give or take.

> > 3.      Things being processed after a grace period should not be
> >         processed concurrently with each other or with subsequent
> >         removals.
> 
> Yeah, this is the cause of the problem. They have to be serialized
> with each other and with other tc action paths (holding RTNL) too.

OK, makes sense.

> > 4.      A given removal is not finalized until its reference count
> >         reaches zero.
> 
> There are two kinds of removal here:
> 
> 1) Filter removal. This is not reference counted, it needs RCU grace
> period.
> 
> 2) Filter chain removal. This is refcount'ed and filters hold its refcnt,
> the actions attached to filters could hold its refcnt too
> 
> So when we remove a filter chain, it flushes all the filters and actions
> attached, so RCU callbacks will be flying, tcf_block_put() needs to wait
> for refcnt hits zero, especially by those actions, before freeing it.

Very good, and I wasn't aware of the distinction between individual
filters and full filter chains before, thank you!

> > 5.      RTNL might not be held when the reference count reaches zero.
> 
> Without my patch, no, RTNL is not held in RCU callbacks where actions
> are destroyed and releasing refcnt.
> 
> With my patch, RTNL will be always held, RCU callbacks will be merely
> schedule_work(), RTNL is held is each work.

OK, got it.

> > Or did I lose the thread somewhere?
> 
> I don't think you miss anything.

Seems to me that your proposed patch is at least worth a try, then.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC
  2017-10-26 13:58         ` Paul E. McKenney
@ 2017-10-26 19:10           ` Cong Wang
  0 siblings, 0 replies; 27+ messages in thread
From: Cong Wang @ 2017-10-26 19:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: David Miller, Linux Kernel Network Developers, Jamal Hadi Salim,
	John Fastabend, Chris Mi

On Thu, Oct 26, 2017 at 6:58 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> So when removing an entire chain, you flush any queued workqueue handlers
> to make sure that any operations using elements on that chain have also
> completed, correct?  This might also motivate the rcu_barrier() calls.

Yes, we can only free it after all pending RCU callbacks and works finish,
because they are in the same chain and we are freeing the head of the
chain in tcf_block_put().

>
> Seems to me that your proposed patch is at least worth a try, then.
>

Thank you! Then it makes some basic sense. I did a quick test and
don't see any crash or lockdep warning.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-10-26 19:10 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-23 22:02 [Patch net 00/15] net_sched: remove RCU callbacks from TC Cong Wang
2017-10-23 22:02 ` [Patch net 01/15] net_sched: remove RCU callbacks in basic filter Cong Wang
2017-10-23 22:02 ` [Patch net 02/15] net_sched: remove RCU callbacks in bpf filter Cong Wang
2017-10-23 22:02 ` [Patch net 03/15] net_sched: remove RCU callbacks in flower filter Cong Wang
2017-10-23 22:02 ` [Patch net 04/15] net_sched: remove RCU callbacks in matchall filter Cong Wang
2017-10-23 22:02 ` [Patch net 05/15] net_sched: remove RCU callbacks in cgroup filter Cong Wang
2017-10-23 22:02 ` [Patch net 06/15] net_sched: remove RCU callbacks in rsvp filter Cong Wang
2017-10-23 22:02 ` [Patch net 07/15] net_sched: remove RCU callbacks in flow filter Cong Wang
2017-10-23 22:02 ` [Patch net 08/15] net_sched: remove RCU callbacks in tcindex filter Cong Wang
2017-10-23 22:02 ` [Patch net 09/15] net_sched: remove RCU callbacks in u32 filter Cong Wang
2017-10-23 22:02 ` [Patch net 10/15] net_sched: remove RCU callbacks in fw filter Cong Wang
2017-10-23 22:03 ` [Patch net 11/15] net_sched: remove RCU callbacks in route filter Cong Wang
2017-10-23 22:03 ` [Patch net 12/15] net_sched: remove RCU callbacks in sample action Cong Wang
2017-10-23 22:03 ` [Patch net 13/15] net_sched: add rtnl assertion to tcf_exts_destroy() Cong Wang
2017-10-23 22:03 ` [Patch net 14/15] selftests: Introduce a new script to generate tc batch file Cong Wang
2017-10-24  4:56   ` Chris Mi
2017-10-23 22:03 ` [Patch net 15/15] selftests: Introduce a new test case to tc testsuite Cong Wang
2017-10-23 23:16 ` [Patch net 00/15] net_sched: remove RCU callbacks from TC Eric Dumazet
2017-10-23 23:23   ` Cong Wang
2017-10-23 23:31     ` Eric Dumazet
2017-10-25 20:46       ` Cong Wang
2017-10-25  1:43 ` David Miller
2017-10-25 22:37   ` Cong Wang
2017-10-26  0:19     ` Paul E. McKenney
2017-10-26  4:49       ` Cong Wang
2017-10-26 13:58         ` Paul E. McKenney
2017-10-26 19:10           ` Cong Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.