Netdev Archive on lore.kernel.org
 help / Atom feed
* [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock
@ 2019-02-11  8:55 Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 01/17] net: sched: protect block state with mutex Vlad Buslov
                   ` (17 more replies)
  0 siblings, 18 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

Currently, all netlink protocol handlers for updating rules, actions and
qdiscs are protected with single global rtnl lock which removes any
possibility for parallelism. This patch set is a third step to remove
rtnl lock dependency from TC rules update path.

Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
Handlers registered with this flag are called without RTNL taken. End
goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
etc.) to be registered with UNLOCKED flag to allow parallel execution.
However, there is no intention to completely remove or split rtnl lock
itself. This patch set addresses specific problems in implementation of
classifiers API that prevent its control path from being executed
concurrently, and completes refactoring of cls API rules update handlers
by removing rtnl lock dependency from code that handles chains and
classifiers. Rules update handlers are registered with
RTNL_FLAG_DOIT_UNLOCKED flag.

This patch set substitutes global rtnl lock dependency on rules update
path in cls API by extending its data structures with following locks:
- tcf_block with 'lock' mutex. It is used to protect block state and
  life-time management fields of chains on the block (chain->refcnt,
  chain->action_refcnt, chain->explicitly crated, etc.).
- tcf_chain with 'filter_chain_lock' mutex, that is used to protect list
  of classifier instances attached to chain. chain0->filter_chain_lock
  serializes calls to head change callbacks and allows them to rely on
  filter_chain_lock for serialization instead of rtnl lock.
- tcf_proto with 'lock' spinlock that is intended to be used to
  synchronize access to classifiers that support unlocked execution.

Classifiers are extended with reference counting to accommodate parallel
access by unlocked cls API. Classifier ops structure is extended with
additional 'put' function to allow reference counting of filters and
intended to be used by classifiers that implement rtnl-unlocked API.
Users of classifiers and individual filter instances are modified to
always hold reference while working with them.

Classifiers that support unlocked execution still need to know the
status of rtnl lock, so their API is extended with additional
'rtnl_held' argument that is used to indicate that caller holds rtnl
lock. Cls API propagates rtnl lock status across its helper functions
and passes it to classifier.

Changes from V3 to V4:
- Patch 1:
  - Extract code that manages chain 'explicitly_created' flag into
    standalone patch.
- Patch 2 - new.

Changes from V2 to V3:
- Change block->lock and chain->filter_chain_lock type to mutex. This
  removes the need for async miniqp refactoring and allows calling
  sleeping functions while holding the block->lock and
  chain->filter_chain_lock locks.
- Previous patch 1 - async miniqp is no longer needed, remove the patch.
- Patch 1:
  - Change block->lock type to mutex.
  - Implement tcf_block_destroy() helper function that destroys
    block->lock mutex before deallocating the block.
  - Revert GFP_KERNEL->GFP_ATOMIC memory allocation flags of tcf_chain
    which is no longer needed after block->lock type change.
- Patch 6:
  - Change chain->filter_chain_lock type to mutex.
  - Assume chain0->filter_chain_lock synchronizations instead of rtnl
    lock in mini_qdisc_pair_swap() function that is called from head
    change callback of ingress Qdisc. With filter_chain_lock type
    changed to mutex it is now possible to call sleeping function while
    holding it, so it is now used instead of async implementation from
    previous versions of this patch set.
- Patch 7:
  - Add local tp_next var to tcf_chain_flush() and use it to store
    tp->next pointer dereferenced with rcu_dereference_protected() to
    satisfy kbuild test robot.
  - Reset tp pointer to NULL at the beginning of tc_new_tfilter() to
    prevent its uninitialized usage in error handling code. This code
    was already implemented in patch 10, but must be in patch 8 to
    preserve code bisectability.
  - Put parent chain in tcf_proto_destroy(). In previous version this
    code was implemented in patch 1 which was removed in V3.

Changes from V1 to V2:
- Patch 1:
  - Use smp_store_release() instead of xchg() for setting
    miniqp->tp_head.
  - Move Qdisc deallocation to tc_proto_wq ordered workqueue that is
    used to destroy tcf proto instances. This is necessary to ensure
    that Qdisc is destroyed after all instances of chain/proto that it
    contains in order to prevent use-after-free error in
    tc_chain_notify_delete().
  - Cache parent net device ifindex in block to prevent use-after-free
    of dev queue in tc_chain_notify_delete().
- Patch 2:
  - Use lockdep_assert_held() instead of spin_is_locked() for assertion.
  - Use 'free_block' argument in tcf_chain_destroy() instead of checking
    block's reference count and chain_list for second time.
- Patch 7:
  - Refactor tcf_chain0_head_change_cb_add() to not take block->lock and
    chain0->filter_chain_lock in correct order.
- Patch 10:
  - Always set 'tp_created' flag when creating tp to prevent releasing
    the chain twice when tp with same priority was inserted
    concurrently.
- Patch 11:
  - Add additional check to prevent creation of new proto instance when
    parent chain is being flushed to reduce CPU usage.
  - Don't call tcf_chain_delete_empty() if tp insertion failed.
- Patch 16 - new.
- Patch 17:
  - Refactor to only lock take rtnl lock once (at the beginning of rule
    update handlers).
  - Always release rtnl mutex in the same function that locked it.
    Remove unlock code from tcf_block_release().

Github:
[https://github.com/vbuslov/linux/tree/cls_api_rtnl_lock_remove_for_cong_v4]

Vlad Buslov (17):
  net: sched: protect block state with mutex
  net: sched: protect chain->explicitly_created with block->lock
  net: sched: refactor tc_ctl_chain() to use block->lock
  net: sched: protect block->chain0 with block->lock
  net: sched: traverse chains in block with tcf_get_next_chain()
  net: sched: protect chain template accesses with block lock
  net: sched: protect filter_chain list with filter_chain_lock mutex
  net: sched: introduce reference counting for tcf_proto
  net: sched: traverse classifiers in chain with tcf_get_next_proto()
  net: sched: refactor tp insert/delete for concurrent execution
  net: sched: prevent insertion of new classifiers during chain flush
  net: sched: track rtnl lock status when validating extensions
  net: sched: extend proto ops with 'put' callback
  net: sched: extend proto ops to support unlocked classifiers
  net: sched: add flags to Qdisc class ops struct
  net: sched: refactor tcf_block_find() into standalone functions
  net: sched: unlock rules update API

 include/net/pkt_cls.h     |    6 +-
 include/net/sch_generic.h |   68 ++-
 net/sched/cls_api.c       | 1214 ++++++++++++++++++++++++++++++++++-----------
 net/sched/cls_basic.c     |   14 +-
 net/sched/cls_bpf.c       |   15 +-
 net/sched/cls_cgroup.c    |   13 +-
 net/sched/cls_flow.c      |   15 +-
 net/sched/cls_flower.c    |   16 +-
 net/sched/cls_fw.c        |   15 +-
 net/sched/cls_matchall.c  |   16 +-
 net/sched/cls_route.c     |   14 +-
 net/sched/cls_rsvp.h      |   16 +-
 net/sched/cls_tcindex.c   |   17 +-
 net/sched/cls_u32.c       |   14 +-
 net/sched/sch_api.c       |   10 +-
 net/sched/sch_generic.c   |    6 +-
 16 files changed, 1103 insertions(+), 366 deletions(-)

-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 01/17] net: sched: protect block state with mutex
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-11 14:15   ` Jiri Pirko
  2019-02-11  8:55 ` [PATCH net-next v4 02/17] net: sched: protect chain->explicitly_created with block->lock Vlad Buslov
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

Currently, tcf_block doesn't use any synchronization mechanisms to protect
critical sections that manage lifetime of its chains. block->chain_list and
multiple variables in tcf_chain that control its lifetime assume external
synchronization provided by global rtnl lock. Converting chain reference
counting to atomic reference counters is not possible because cls API uses
multiple counters and flags to control chain lifetime, so all of them must
be synchronized in chain get/put code.

Use single per-block lock to protect block data and manage lifetime of all
chains on the block. Always take block->lock when accessing chain_list.
Chain get and put modify chain lifetime-management data and parent block's
chain_list, so take the lock in these functions. Verify block->lock state
with assertions in functions that expect to be called with the lock taken
and are called from multiple places. Take block->lock when accessing
filter_chain_list.

In order to allow parallel update of rules on single block, move all calls
to classifiers outside of critical sections protected by new block->lock.
Rearrange chain get and put functions code to only access protected chain
data while holding block lock:
- Rearrange code to only access chain reference counter and chain action
  reference counter while holding block lock.
- Extract code that requires block->lock from tcf_chain_destroy() into
  standalone tcf_chain_destroy() function that is called by
  __tcf_chain_put() in same critical section that changes chain reference
  counters.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---

Changes from V3 to V4:
  - Extract code that manages chain 'explicitly_created' flag into
    standalone patch.

 include/net/sch_generic.h |  5 +++
 net/sched/cls_api.c       | 84 +++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 76 insertions(+), 13 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 7a4957599874..31b8ea66a47d 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -12,6 +12,7 @@
 #include <linux/list.h>
 #include <linux/refcount.h>
 #include <linux/workqueue.h>
+#include <linux/mutex.h>
 #include <net/gen_stats.h>
 #include <net/rtnetlink.h>
 
@@ -352,6 +353,10 @@ struct tcf_chain {
 };
 
 struct tcf_block {
+	/* Lock protects tcf_block and lifetime-management data of chains
+	 * attached to the block (refcnt, action_refcnt, explicitly_created).
+	 */
+	struct mutex lock;
 	struct list_head chain_list;
 	u32 index; /* block index for shared blocks */
 	refcount_t refcnt;
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 02cf6d2fa0e1..806e7158a7e8 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -201,6 +201,9 @@ static void tcf_proto_destroy(struct tcf_proto *tp,
 	kfree_rcu(tp, rcu);
 }
 
+#define ASSERT_BLOCK_LOCKED(block)					\
+	lockdep_assert_held(&(block)->lock)
+
 struct tcf_filter_chain_list_item {
 	struct list_head list;
 	tcf_chain_head_change_t *chain_head_change;
@@ -212,6 +215,8 @@ static struct tcf_chain *tcf_chain_create(struct tcf_block *block,
 {
 	struct tcf_chain *chain;
 
+	ASSERT_BLOCK_LOCKED(block);
+
 	chain = kzalloc(sizeof(*chain), GFP_KERNEL);
 	if (!chain)
 		return NULL;
@@ -243,25 +248,51 @@ static void tcf_chain0_head_change(struct tcf_chain *chain,
 		tcf_chain_head_change_item(item, tp_head);
 }
 
-static void tcf_chain_destroy(struct tcf_chain *chain)
+/* Returns true if block can be safely freed. */
+
+static bool tcf_chain_detach(struct tcf_chain *chain)
 {
 	struct tcf_block *block = chain->block;
 
+	ASSERT_BLOCK_LOCKED(block);
+
 	list_del(&chain->list);
 	if (!chain->index)
 		block->chain0.chain = NULL;
+
+	if (list_empty(&block->chain_list) &&
+	    refcount_read(&block->refcnt) == 0)
+		return true;
+
+	return false;
+}
+
+static void tcf_block_destroy(struct tcf_block *block)
+{
+	mutex_destroy(&block->lock);
+	kfree_rcu(block, rcu);
+}
+
+static void tcf_chain_destroy(struct tcf_chain *chain, bool free_block)
+{
+	struct tcf_block *block = chain->block;
+
 	kfree(chain);
-	if (list_empty(&block->chain_list) && !refcount_read(&block->refcnt))
-		kfree_rcu(block, rcu);
+	if (free_block)
+		tcf_block_destroy(block);
 }
 
 static void tcf_chain_hold(struct tcf_chain *chain)
 {
+	ASSERT_BLOCK_LOCKED(chain->block);
+
 	++chain->refcnt;
 }
 
 static bool tcf_chain_held_by_acts_only(struct tcf_chain *chain)
 {
+	ASSERT_BLOCK_LOCKED(chain->block);
+
 	/* In case all the references are action references, this
 	 * chain should not be shown to the user.
 	 */
@@ -273,6 +304,8 @@ static struct tcf_chain *tcf_chain_lookup(struct tcf_block *block,
 {
 	struct tcf_chain *chain;
 
+	ASSERT_BLOCK_LOCKED(block);
+
 	list_for_each_entry(chain, &block->chain_list, list) {
 		if (chain->index == chain_index)
 			return chain;
@@ -287,31 +320,40 @@ static struct tcf_chain *__tcf_chain_get(struct tcf_block *block,
 					 u32 chain_index, bool create,
 					 bool by_act)
 {
-	struct tcf_chain *chain = tcf_chain_lookup(block, chain_index);
+	struct tcf_chain *chain = NULL;
+	bool is_first_reference;
 
+	mutex_lock(&block->lock);
+	chain = tcf_chain_lookup(block, chain_index);
 	if (chain) {
 		tcf_chain_hold(chain);
 	} else {
 		if (!create)
-			return NULL;
+			goto errout;
 		chain = tcf_chain_create(block, chain_index);
 		if (!chain)
-			return NULL;
+			goto errout;
 	}
 
 	if (by_act)
 		++chain->action_refcnt;
+	is_first_reference = chain->refcnt - chain->action_refcnt == 1;
+	mutex_unlock(&block->lock);
 
 	/* Send notification only in case we got the first
 	 * non-action reference. Until then, the chain acts only as
 	 * a placeholder for actions pointing to it and user ought
 	 * not know about them.
 	 */
-	if (chain->refcnt - chain->action_refcnt == 1 && !by_act)
+	if (is_first_reference && !by_act)
 		tc_chain_notify(chain, NULL, 0, NLM_F_CREATE | NLM_F_EXCL,
 				RTM_NEWCHAIN, false);
 
 	return chain;
+
+errout:
+	mutex_unlock(&block->lock);
+	return chain;
 }
 
 static struct tcf_chain *tcf_chain_get(struct tcf_block *block, u32 chain_index,
@@ -330,17 +372,31 @@ static void tc_chain_tmplt_del(struct tcf_chain *chain);
 
 static void __tcf_chain_put(struct tcf_chain *chain, bool by_act)
 {
+	struct tcf_block *block = chain->block;
+	bool is_last, free_block = false;
+	unsigned int refcnt;
+
+	mutex_lock(&block->lock);
 	if (by_act)
 		chain->action_refcnt--;
-	chain->refcnt--;
+
+	/* tc_chain_notify_delete can't be called while holding block lock.
+	 * However, when block is unlocked chain can be changed concurrently, so
+	 * save these to temporary variables.
+	 */
+	refcnt = --chain->refcnt;
+	is_last = refcnt - chain->action_refcnt == 0;
+	if (refcnt == 0)
+		free_block = tcf_chain_detach(chain);
+	mutex_unlock(&block->lock);
 
 	/* The last dropped non-action reference will trigger notification. */
-	if (chain->refcnt - chain->action_refcnt == 0 && !by_act)
+	if (is_last && !by_act)
 		tc_chain_notify(chain, NULL, 0, 0, RTM_DELCHAIN, false);
 
-	if (chain->refcnt == 0) {
+	if (refcnt == 0) {
 		tc_chain_tmplt_del(chain);
-		tcf_chain_destroy(chain);
+		tcf_chain_destroy(chain, free_block);
 	}
 }
 
@@ -772,6 +828,7 @@ static struct tcf_block *tcf_block_create(struct net *net, struct Qdisc *q,
 		NL_SET_ERR_MSG(extack, "Memory allocation for block failed");
 		return ERR_PTR(-ENOMEM);
 	}
+	mutex_init(&block->lock);
 	INIT_LIST_HEAD(&block->chain_list);
 	INIT_LIST_HEAD(&block->cb_list);
 	INIT_LIST_HEAD(&block->owner_list);
@@ -835,7 +892,7 @@ static void tcf_block_put_all_chains(struct tcf_block *block)
 static void __tcf_block_put(struct tcf_block *block, struct Qdisc *q,
 			    struct tcf_block_ext_info *ei)
 {
-	if (refcount_dec_and_test(&block->refcnt)) {
+	if (refcount_dec_and_mutex_lock(&block->refcnt, &block->lock)) {
 		/* Flushing/putting all chains will cause the block to be
 		 * deallocated when last chain is freed. However, if chain_list
 		 * is empty, block has to be manually deallocated. After block
@@ -844,6 +901,7 @@ static void __tcf_block_put(struct tcf_block *block, struct Qdisc *q,
 		 */
 		bool free_block = list_empty(&block->chain_list);
 
+		mutex_unlock(&block->lock);
 		if (tcf_block_shared(block))
 			tcf_block_remove(block, block->net);
 		if (!free_block)
@@ -853,7 +911,7 @@ static void __tcf_block_put(struct tcf_block *block, struct Qdisc *q,
 			tcf_block_offload_unbind(block, q, ei);
 
 		if (free_block)
-			kfree_rcu(block, rcu);
+			tcf_block_destroy(block);
 		else
 			tcf_block_put_all_chains(block);
 	} else if (q) {
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 02/17] net: sched: protect chain->explicitly_created with block->lock
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 01/17] net: sched: protect block state with mutex Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-11 14:15   ` Jiri Pirko
  2019-02-11  8:55 ` [PATCH net-next v4 03/17] net: sched: refactor tc_ctl_chain() to use block->lock Vlad Buslov
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

In order to remove dependency on rtnl lock, protect
tcf_chain->explicitly_created flag with block->lock. Consolidate code that
checks and resets 'explicitly_created' flag into __tcf_chain_put() to
execute it atomically with rest of code that puts chain reference.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---
 net/sched/cls_api.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 806e7158a7e8..2ebf8e53038a 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -370,13 +370,22 @@ EXPORT_SYMBOL(tcf_chain_get_by_act);
 
 static void tc_chain_tmplt_del(struct tcf_chain *chain);
 
-static void __tcf_chain_put(struct tcf_chain *chain, bool by_act)
+static void __tcf_chain_put(struct tcf_chain *chain, bool by_act,
+			    bool explicitly_created)
 {
 	struct tcf_block *block = chain->block;
 	bool is_last, free_block = false;
 	unsigned int refcnt;
 
 	mutex_lock(&block->lock);
+	if (explicitly_created) {
+		if (!chain->explicitly_created) {
+			mutex_unlock(&block->lock);
+			return;
+		}
+		chain->explicitly_created = false;
+	}
+
 	if (by_act)
 		chain->action_refcnt--;
 
@@ -402,19 +411,18 @@ static void __tcf_chain_put(struct tcf_chain *chain, bool by_act)
 
 static void tcf_chain_put(struct tcf_chain *chain)
 {
-	__tcf_chain_put(chain, false);
+	__tcf_chain_put(chain, false, false);
 }
 
 void tcf_chain_put_by_act(struct tcf_chain *chain)
 {
-	__tcf_chain_put(chain, true);
+	__tcf_chain_put(chain, true, false);
 }
 EXPORT_SYMBOL(tcf_chain_put_by_act);
 
 static void tcf_chain_put_explicitly_created(struct tcf_chain *chain)
 {
-	if (chain->explicitly_created)
-		tcf_chain_put(chain);
+	__tcf_chain_put(chain, false, true);
 }
 
 static void tcf_chain_flush(struct tcf_chain *chain)
@@ -2305,7 +2313,6 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n,
 		 * to the chain previously taken during addition.
 		 */
 		tcf_chain_put_explicitly_created(chain);
-		chain->explicitly_created = false;
 		break;
 	case RTM_GETCHAIN:
 		err = tc_chain_notify(chain, skb, n->nlmsg_seq,
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 03/17] net: sched: refactor tc_ctl_chain() to use block->lock
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 01/17] net: sched: protect block state with mutex Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 02/17] net: sched: protect chain->explicitly_created with block->lock Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 04/17] net: sched: protect block->chain0 with block->lock Vlad Buslov
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

In order to remove dependency on rtnl lock, modify chain API to use
block->lock to protect chain from concurrent modification. Rearrange
tc_ctl_chain() code to call tcf_chain_hold() while holding block->lock to
prevent concurrent chain removal.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 net/sched/cls_api.c | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 2ebf8e53038a..b5db0f79db14 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -2255,6 +2255,8 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n,
 		err = -EINVAL;
 		goto errout_block;
 	}
+
+	mutex_lock(&block->lock);
 	chain = tcf_chain_lookup(block, chain_index);
 	if (n->nlmsg_type == RTM_NEWCHAIN) {
 		if (chain) {
@@ -2266,41 +2268,49 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n,
 			} else {
 				NL_SET_ERR_MSG(extack, "Filter chain already exists");
 				err = -EEXIST;
-				goto errout_block;
+				goto errout_block_locked;
 			}
 		} else {
 			if (!(n->nlmsg_flags & NLM_F_CREATE)) {
 				NL_SET_ERR_MSG(extack, "Need both RTM_NEWCHAIN and NLM_F_CREATE to create a new chain");
 				err = -ENOENT;
-				goto errout_block;
+				goto errout_block_locked;
 			}
 			chain = tcf_chain_create(block, chain_index);
 			if (!chain) {
 				NL_SET_ERR_MSG(extack, "Failed to create filter chain");
 				err = -ENOMEM;
-				goto errout_block;
+				goto errout_block_locked;
 			}
 		}
 	} else {
 		if (!chain || tcf_chain_held_by_acts_only(chain)) {
 			NL_SET_ERR_MSG(extack, "Cannot find specified filter chain");
 			err = -EINVAL;
-			goto errout_block;
+			goto errout_block_locked;
 		}
 		tcf_chain_hold(chain);
 	}
 
+	if (n->nlmsg_type == RTM_NEWCHAIN) {
+		/* Modifying chain requires holding parent block lock. In case
+		 * the chain was successfully added, take a reference to the
+		 * chain. This ensures that an empty chain does not disappear at
+		 * the end of this function.
+		 */
+		tcf_chain_hold(chain);
+		chain->explicitly_created = true;
+	}
+	mutex_unlock(&block->lock);
+
 	switch (n->nlmsg_type) {
 	case RTM_NEWCHAIN:
 		err = tc_chain_tmplt_add(chain, net, tca, extack);
-		if (err)
+		if (err) {
+			tcf_chain_put_explicitly_created(chain);
 			goto errout;
-		/* In case the chain was successfully added, take a reference
-		 * to the chain. This ensures that an empty chain
-		 * does not disappear at the end of this function.
-		 */
-		tcf_chain_hold(chain);
-		chain->explicitly_created = true;
+		}
+
 		tc_chain_notify(chain, NULL, 0, NLM_F_CREATE | NLM_F_EXCL,
 				RTM_NEWCHAIN, false);
 		break;
@@ -2334,6 +2344,10 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n,
 		/* Replay the request. */
 		goto replay;
 	return err;
+
+errout_block_locked:
+	mutex_unlock(&block->lock);
+	goto errout_block;
 }
 
 /* called with RTNL */
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 04/17] net: sched: protect block->chain0 with block->lock
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (2 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 03/17] net: sched: refactor tc_ctl_chain() to use block->lock Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 05/17] net: sched: traverse chains in block with tcf_get_next_chain() Vlad Buslov
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

In order to remove dependency on rtnl lock, use block->lock to protect
chain0 struct from concurrent modification. Rearrange code in chain0
callback add and del functions to only access chain0 when block->lock is
held.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 net/sched/cls_api.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index b5db0f79db14..869ae44d7631 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -244,8 +244,11 @@ static void tcf_chain0_head_change(struct tcf_chain *chain,
 
 	if (chain->index)
 		return;
+
+	mutex_lock(&block->lock);
 	list_for_each_entry(item, &block->chain0.filter_chain_list, list)
 		tcf_chain_head_change_item(item, tp_head);
+	mutex_unlock(&block->lock);
 }
 
 /* Returns true if block can be safely freed. */
@@ -756,8 +759,8 @@ tcf_chain0_head_change_cb_add(struct tcf_block *block,
 			      struct tcf_block_ext_info *ei,
 			      struct netlink_ext_ack *extack)
 {
-	struct tcf_chain *chain0 = block->chain0.chain;
 	struct tcf_filter_chain_list_item *item;
+	struct tcf_chain *chain0;
 
 	item = kmalloc(sizeof(*item), GFP_KERNEL);
 	if (!item) {
@@ -766,9 +769,14 @@ tcf_chain0_head_change_cb_add(struct tcf_block *block,
 	}
 	item->chain_head_change = ei->chain_head_change;
 	item->chain_head_change_priv = ei->chain_head_change_priv;
+
+	mutex_lock(&block->lock);
+	chain0 = block->chain0.chain;
 	if (chain0 && chain0->filter_chain)
 		tcf_chain_head_change_item(item, chain0->filter_chain);
 	list_add(&item->list, &block->chain0.filter_chain_list);
+	mutex_unlock(&block->lock);
+
 	return 0;
 }
 
@@ -776,20 +784,23 @@ static void
 tcf_chain0_head_change_cb_del(struct tcf_block *block,
 			      struct tcf_block_ext_info *ei)
 {
-	struct tcf_chain *chain0 = block->chain0.chain;
 	struct tcf_filter_chain_list_item *item;
 
+	mutex_lock(&block->lock);
 	list_for_each_entry(item, &block->chain0.filter_chain_list, list) {
 		if ((!ei->chain_head_change && !ei->chain_head_change_priv) ||
 		    (item->chain_head_change == ei->chain_head_change &&
 		     item->chain_head_change_priv == ei->chain_head_change_priv)) {
-			if (chain0)
+			if (block->chain0.chain)
 				tcf_chain_head_change_item(item, NULL);
 			list_del(&item->list);
+			mutex_unlock(&block->lock);
+
 			kfree(item);
 			return;
 		}
 	}
+	mutex_unlock(&block->lock);
 	WARN_ON(1);
 }
 
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 05/17] net: sched: traverse chains in block with tcf_get_next_chain()
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (3 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 04/17] net: sched: protect block->chain0 with block->lock Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-15 22:21   ` Cong Wang
  2019-02-11  8:55 ` [PATCH net-next v4 06/17] net: sched: protect chain template accesses with block lock Vlad Buslov
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

All users of block->chain_list rely on rtnl lock and assume that no new
chains are added when traversing the list. Use tcf_get_next_chain() to
traverse chain list without relying on rtnl mutex. This function iterates
over chains by taking reference to current iterator chain only and doesn't
assume external synchronization of chain list.

Don't take reference to all chains in block when flushing and use
tcf_get_next_chain() to safely iterate over chain list instead. Remove
tcf_block_put_all_chains() that is no longer used.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/pkt_cls.h |  2 ++
 net/sched/cls_api.c   | 96 +++++++++++++++++++++++++++++++++++++--------------
 net/sched/sch_api.c   |  4 ++-
 3 files changed, 76 insertions(+), 26 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index cb8be396a11f..38bee7dd21d1 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -44,6 +44,8 @@ bool tcf_queue_work(struct rcu_work *rwork, work_func_t func);
 struct tcf_chain *tcf_chain_get_by_act(struct tcf_block *block,
 				       u32 chain_index);
 void tcf_chain_put_by_act(struct tcf_chain *chain);
+struct tcf_chain *tcf_get_next_chain(struct tcf_block *block,
+				     struct tcf_chain *chain);
 void tcf_block_netif_keep_dst(struct tcf_block *block);
 int tcf_block_get(struct tcf_block **p_block,
 		  struct tcf_proto __rcu **p_filter_chain, struct Qdisc *q,
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 869ae44d7631..8e2ac785f6fd 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -883,28 +883,62 @@ static struct tcf_block *tcf_block_refcnt_get(struct net *net, u32 block_index)
 	return block;
 }
 
-static void tcf_block_flush_all_chains(struct tcf_block *block)
+static struct tcf_chain *
+__tcf_get_next_chain(struct tcf_block *block, struct tcf_chain *chain)
 {
-	struct tcf_chain *chain;
+	mutex_lock(&block->lock);
+	if (chain)
+		chain = list_is_last(&chain->list, &block->chain_list) ?
+			NULL : list_next_entry(chain, list);
+	else
+		chain = list_first_entry_or_null(&block->chain_list,
+						 struct tcf_chain, list);
 
-	/* Hold a refcnt for all chains, so that they don't disappear
-	 * while we are iterating.
-	 */
-	list_for_each_entry(chain, &block->chain_list, list)
+	/* skip all action-only chains */
+	while (chain && tcf_chain_held_by_acts_only(chain))
+		chain = list_is_last(&chain->list, &block->chain_list) ?
+			NULL : list_next_entry(chain, list);
+
+	if (chain)
 		tcf_chain_hold(chain);
+	mutex_unlock(&block->lock);
 
-	list_for_each_entry(chain, &block->chain_list, list)
-		tcf_chain_flush(chain);
+	return chain;
 }
 
-static void tcf_block_put_all_chains(struct tcf_block *block)
+/* Function to be used by all clients that want to iterate over all chains on
+ * block. It properly obtains block->lock and takes reference to chain before
+ * returning it. Users of this function must be tolerant to concurrent chain
+ * insertion/deletion or ensure that no concurrent chain modification is
+ * possible. Note that all netlink dump callbacks cannot guarantee to provide
+ * consistent dump because rtnl lock is released each time skb is filled with
+ * data and sent to user-space.
+ */
+
+struct tcf_chain *
+tcf_get_next_chain(struct tcf_block *block, struct tcf_chain *chain)
 {
-	struct tcf_chain *chain, *tmp;
+	struct tcf_chain *chain_next = __tcf_get_next_chain(block, chain);
 
-	/* At this point, all the chains should have refcnt >= 1. */
-	list_for_each_entry_safe(chain, tmp, &block->chain_list, list) {
-		tcf_chain_put_explicitly_created(chain);
+	if (chain)
 		tcf_chain_put(chain);
+
+	return chain_next;
+}
+EXPORT_SYMBOL(tcf_get_next_chain);
+
+static void tcf_block_flush_all_chains(struct tcf_block *block)
+{
+	struct tcf_chain *chain;
+
+	/* Last reference to block. At this point chains cannot be added or
+	 * removed concurrently.
+	 */
+	for (chain = tcf_get_next_chain(block, NULL);
+	     chain;
+	     chain = tcf_get_next_chain(block, chain)) {
+		tcf_chain_put_explicitly_created(chain);
+		tcf_chain_flush(chain);
 	}
 }
 
@@ -923,8 +957,6 @@ static void __tcf_block_put(struct tcf_block *block, struct Qdisc *q,
 		mutex_unlock(&block->lock);
 		if (tcf_block_shared(block))
 			tcf_block_remove(block, block->net);
-		if (!free_block)
-			tcf_block_flush_all_chains(block);
 
 		if (q)
 			tcf_block_offload_unbind(block, q, ei);
@@ -932,7 +964,7 @@ static void __tcf_block_put(struct tcf_block *block, struct Qdisc *q,
 		if (free_block)
 			tcf_block_destroy(block);
 		else
-			tcf_block_put_all_chains(block);
+			tcf_block_flush_all_chains(block);
 	} else if (q) {
 		tcf_block_offload_unbind(block, q, ei);
 	}
@@ -1266,11 +1298,15 @@ tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
 			    void *cb_priv, bool add, bool offload_in_use,
 			    struct netlink_ext_ack *extack)
 {
-	struct tcf_chain *chain;
+	struct tcf_chain *chain, *chain_prev;
 	struct tcf_proto *tp;
 	int err;
 
-	list_for_each_entry(chain, &block->chain_list, list) {
+	for (chain = __tcf_get_next_chain(block, NULL);
+	     chain;
+	     chain_prev = chain,
+		     chain = __tcf_get_next_chain(block, chain),
+		     tcf_chain_put(chain_prev)) {
 		for (tp = rtnl_dereference(chain->filter_chain); tp;
 		     tp = rtnl_dereference(tp->next)) {
 			if (tp->ops->reoffload) {
@@ -1289,6 +1325,7 @@ tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
 	return 0;
 
 err_playback_remove:
+	tcf_chain_put(chain);
 	tcf_block_playback_offloads(block, cb, cb_priv, false, offload_in_use,
 				    extack);
 	return err;
@@ -2023,11 +2060,11 @@ static bool tcf_chain_dump(struct tcf_chain *chain, struct Qdisc *q, u32 parent,
 /* called with RTNL */
 static int tc_dump_tfilter(struct sk_buff *skb, struct netlink_callback *cb)
 {
+	struct tcf_chain *chain, *chain_prev;
 	struct net *net = sock_net(skb->sk);
 	struct nlattr *tca[TCA_MAX + 1];
 	struct Qdisc *q = NULL;
 	struct tcf_block *block;
-	struct tcf_chain *chain;
 	struct tcmsg *tcm = nlmsg_data(cb->nlh);
 	long index_start;
 	long index;
@@ -2091,12 +2128,17 @@ static int tc_dump_tfilter(struct sk_buff *skb, struct netlink_callback *cb)
 	index_start = cb->args[0];
 	index = 0;
 
-	list_for_each_entry(chain, &block->chain_list, list) {
+	for (chain = __tcf_get_next_chain(block, NULL);
+	     chain;
+	     chain_prev = chain,
+		     chain = __tcf_get_next_chain(block, chain),
+		     tcf_chain_put(chain_prev)) {
 		if (tca[TCA_CHAIN] &&
 		    nla_get_u32(tca[TCA_CHAIN]) != chain->index)
 			continue;
 		if (!tcf_chain_dump(chain, q, parent, skb, cb,
 				    index_start, &index)) {
+			tcf_chain_put(chain);
 			err = -EMSGSIZE;
 			break;
 		}
@@ -2364,11 +2406,11 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n,
 /* called with RTNL */
 static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb)
 {
+	struct tcf_chain *chain, *chain_prev;
 	struct net *net = sock_net(skb->sk);
 	struct nlattr *tca[TCA_MAX + 1];
 	struct Qdisc *q = NULL;
 	struct tcf_block *block;
-	struct tcf_chain *chain;
 	struct tcmsg *tcm = nlmsg_data(cb->nlh);
 	long index_start;
 	long index;
@@ -2432,7 +2474,11 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb)
 	index_start = cb->args[0];
 	index = 0;
 
-	list_for_each_entry(chain, &block->chain_list, list) {
+	for (chain = __tcf_get_next_chain(block, NULL);
+	     chain;
+	     chain_prev = chain,
+		     chain = __tcf_get_next_chain(block, chain),
+		     tcf_chain_put(chain_prev)) {
 		if ((tca[TCA_CHAIN] &&
 		     nla_get_u32(tca[TCA_CHAIN]) != chain->index))
 			continue;
@@ -2440,14 +2486,14 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb)
 			index++;
 			continue;
 		}
-		if (tcf_chain_held_by_acts_only(chain))
-			continue;
 		err = tc_chain_fill_node(chain, net, skb, block,
 					 NETLINK_CB(cb->skb).portid,
 					 cb->nlh->nlmsg_seq, NLM_F_MULTI,
 					 RTM_NEWCHAIN);
-		if (err <= 0)
+		if (err <= 0) {
+			tcf_chain_put(chain);
 			break;
+		}
 		index++;
 	}
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 03e26e8d0ec9..80058abc729f 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1909,7 +1909,9 @@ static void tc_bind_tclass(struct Qdisc *q, u32 portid, u32 clid,
 	block = cops->tcf_block(q, cl, NULL);
 	if (!block)
 		return;
-	list_for_each_entry(chain, &block->chain_list, list) {
+	for (chain = tcf_get_next_chain(block, NULL);
+	     chain;
+	     chain = tcf_get_next_chain(block, chain)) {
 		struct tcf_proto *tp;
 
 		for (tp = rtnl_dereference(chain->filter_chain);
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 06/17] net: sched: protect chain template accesses with block lock
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (4 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 05/17] net: sched: traverse chains in block with tcf_get_next_chain() Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex Vlad Buslov
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

When cls API is called without protection of rtnl lock, parallel
modification of chain is possible, which means that chain template can be
changed concurrently in certain circumstances. For example, when chain is
'deleted' by new user-space chain API, the chain might continue to be used
if it is referenced by actions, and can be 're-created' again by user. In
such case same chain structure is reused and its template is changed. To
protect from described scenario, cache chain template while holding block
lock. Introduce standalone tc_chain_notify_delete() function that works
with cached template values, instead of chains themselves.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 net/sched/cls_api.c | 73 +++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 57 insertions(+), 16 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 8e2ac785f6fd..0dcce8b0c7b4 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -371,14 +371,22 @@ struct tcf_chain *tcf_chain_get_by_act(struct tcf_block *block, u32 chain_index)
 }
 EXPORT_SYMBOL(tcf_chain_get_by_act);
 
-static void tc_chain_tmplt_del(struct tcf_chain *chain);
+static void tc_chain_tmplt_del(const struct tcf_proto_ops *tmplt_ops,
+			       void *tmplt_priv);
+static int tc_chain_notify_delete(const struct tcf_proto_ops *tmplt_ops,
+				  void *tmplt_priv, u32 chain_index,
+				  struct tcf_block *block, struct sk_buff *oskb,
+				  u32 seq, u16 flags, bool unicast);
 
 static void __tcf_chain_put(struct tcf_chain *chain, bool by_act,
 			    bool explicitly_created)
 {
 	struct tcf_block *block = chain->block;
+	const struct tcf_proto_ops *tmplt_ops;
 	bool is_last, free_block = false;
 	unsigned int refcnt;
+	void *tmplt_priv;
+	u32 chain_index;
 
 	mutex_lock(&block->lock);
 	if (explicitly_created) {
@@ -398,16 +406,21 @@ static void __tcf_chain_put(struct tcf_chain *chain, bool by_act,
 	 */
 	refcnt = --chain->refcnt;
 	is_last = refcnt - chain->action_refcnt == 0;
+	tmplt_ops = chain->tmplt_ops;
+	tmplt_priv = chain->tmplt_priv;
+	chain_index = chain->index;
+
 	if (refcnt == 0)
 		free_block = tcf_chain_detach(chain);
 	mutex_unlock(&block->lock);
 
 	/* The last dropped non-action reference will trigger notification. */
 	if (is_last && !by_act)
-		tc_chain_notify(chain, NULL, 0, 0, RTM_DELCHAIN, false);
+		tc_chain_notify_delete(tmplt_ops, tmplt_priv, chain_index,
+				       block, NULL, 0, 0, false);
 
 	if (refcnt == 0) {
-		tc_chain_tmplt_del(chain);
+		tc_chain_tmplt_del(tmplt_ops, tmplt_priv);
 		tcf_chain_destroy(chain, free_block);
 	}
 }
@@ -2155,8 +2168,10 @@ static int tc_dump_tfilter(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
-static int tc_chain_fill_node(struct tcf_chain *chain, struct net *net,
-			      struct sk_buff *skb, struct tcf_block *block,
+static int tc_chain_fill_node(const struct tcf_proto_ops *tmplt_ops,
+			      void *tmplt_priv, u32 chain_index,
+			      struct net *net, struct sk_buff *skb,
+			      struct tcf_block *block,
 			      u32 portid, u32 seq, u16 flags, int event)
 {
 	unsigned char *b = skb_tail_pointer(skb);
@@ -2165,8 +2180,8 @@ static int tc_chain_fill_node(struct tcf_chain *chain, struct net *net,
 	struct tcmsg *tcm;
 	void *priv;
 
-	ops = chain->tmplt_ops;
-	priv = chain->tmplt_priv;
+	ops = tmplt_ops;
+	priv = tmplt_priv;
 
 	nlh = nlmsg_put(skb, portid, seq, event, sizeof(*tcm), flags);
 	if (!nlh)
@@ -2184,7 +2199,7 @@ static int tc_chain_fill_node(struct tcf_chain *chain, struct net *net,
 		tcm->tcm_block_index = block->index;
 	}
 
-	if (nla_put_u32(skb, TCA_CHAIN, chain->index))
+	if (nla_put_u32(skb, TCA_CHAIN, chain_index))
 		goto nla_put_failure;
 
 	if (ops) {
@@ -2215,7 +2230,8 @@ static int tc_chain_notify(struct tcf_chain *chain, struct sk_buff *oskb,
 	if (!skb)
 		return -ENOBUFS;
 
-	if (tc_chain_fill_node(chain, net, skb, block, portid,
+	if (tc_chain_fill_node(chain->tmplt_ops, chain->tmplt_priv,
+			       chain->index, net, skb, block, portid,
 			       seq, flags, event) <= 0) {
 		kfree_skb(skb);
 		return -EINVAL;
@@ -2227,6 +2243,31 @@ static int tc_chain_notify(struct tcf_chain *chain, struct sk_buff *oskb,
 	return rtnetlink_send(skb, net, portid, RTNLGRP_TC, flags & NLM_F_ECHO);
 }
 
+static int tc_chain_notify_delete(const struct tcf_proto_ops *tmplt_ops,
+				  void *tmplt_priv, u32 chain_index,
+				  struct tcf_block *block, struct sk_buff *oskb,
+				  u32 seq, u16 flags, bool unicast)
+{
+	u32 portid = oskb ? NETLINK_CB(oskb).portid : 0;
+	struct net *net = block->net;
+	struct sk_buff *skb;
+
+	skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!skb)
+		return -ENOBUFS;
+
+	if (tc_chain_fill_node(tmplt_ops, tmplt_priv, chain_index, net, skb,
+			       block, portid, seq, flags, RTM_DELCHAIN) <= 0) {
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
+	if (unicast)
+		return netlink_unicast(net->rtnl, skb, portid, MSG_DONTWAIT);
+
+	return rtnetlink_send(skb, net, portid, RTNLGRP_TC, flags & NLM_F_ECHO);
+}
+
 static int tc_chain_tmplt_add(struct tcf_chain *chain, struct net *net,
 			      struct nlattr **tca,
 			      struct netlink_ext_ack *extack)
@@ -2256,16 +2297,15 @@ static int tc_chain_tmplt_add(struct tcf_chain *chain, struct net *net,
 	return 0;
 }
 
-static void tc_chain_tmplt_del(struct tcf_chain *chain)
+static void tc_chain_tmplt_del(const struct tcf_proto_ops *tmplt_ops,
+			       void *tmplt_priv)
 {
-	const struct tcf_proto_ops *ops = chain->tmplt_ops;
-
 	/* If template ops are set, no work to do for us. */
-	if (!ops)
+	if (!tmplt_ops)
 		return;
 
-	ops->tmplt_destroy(chain->tmplt_priv);
-	module_put(ops->owner);
+	tmplt_ops->tmplt_destroy(tmplt_priv);
+	module_put(tmplt_ops->owner);
 }
 
 /* Add/delete/get a chain */
@@ -2486,7 +2526,8 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb)
 			index++;
 			continue;
 		}
-		err = tc_chain_fill_node(chain, net, skb, block,
+		err = tc_chain_fill_node(chain->tmplt_ops, chain->tmplt_priv,
+					 chain->index, net, skb, block,
 					 NETLINK_CB(cb->skb).portid,
 					 cb->nlh->nlmsg_seq, NLM_F_MULTI,
 					 RTM_NEWCHAIN);
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (5 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 06/17] net: sched: protect chain template accesses with block lock Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-14 18:24   ` Ido Schimmel
  2019-02-15 22:35   ` Cong Wang
  2019-02-11  8:55 ` [PATCH net-next v4 08/17] net: sched: introduce reference counting for tcf_proto Vlad Buslov
                   ` (10 subsequent siblings)
  17 siblings, 2 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
when accessing filter_chain list, instead of relying on rtnl lock.
Dereference filter_chain with tcf_chain_dereference() lockdep macro to
verify that all users of chain_list have the lock taken.

Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
all necessary code while holding chain lock in order to prevent
invalidation of chain_info structure by potential concurrent change. This
also serializes calls to tcf_chain0_head_change(), which allows head change
callbacks to rely on filter_chain_lock for synchronization instead of rtnl
mutex.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/sch_generic.h |  17 +++++++
 net/sched/cls_api.c       | 111 +++++++++++++++++++++++++++++++++-------------
 net/sched/sch_generic.c   |   6 ++-
 3 files changed, 101 insertions(+), 33 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 31b8ea66a47d..85993d7efee6 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -341,6 +341,8 @@ struct qdisc_skb_cb {
 typedef void tcf_chain_head_change_t(struct tcf_proto *tp_head, void *priv);
 
 struct tcf_chain {
+	/* Protects filter_chain. */
+	struct mutex filter_chain_lock;
 	struct tcf_proto __rcu *filter_chain;
 	struct list_head list;
 	struct tcf_block *block;
@@ -374,6 +376,21 @@ struct tcf_block {
 	struct rcu_head rcu;
 };
 
+#ifdef CONFIG_PROVE_LOCKING
+static inline bool lockdep_tcf_chain_is_locked(struct tcf_chain *chain)
+{
+	return lockdep_is_held(&chain->filter_chain_lock);
+}
+#else
+static inline bool lockdep_tcf_chain_is_locked(struct tcf_block *chain)
+{
+	return true;
+}
+#endif /* #ifdef CONFIG_PROVE_LOCKING */
+
+#define tcf_chain_dereference(p, chain)					\
+	rcu_dereference_protected(p, lockdep_tcf_chain_is_locked(chain))
+
 static inline void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
 {
 	if (*flags & TCA_CLS_FLAGS_IN_HW)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 0dcce8b0c7b4..3fce30ae9a9b 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -221,6 +221,7 @@ static struct tcf_chain *tcf_chain_create(struct tcf_block *block,
 	if (!chain)
 		return NULL;
 	list_add_tail(&chain->list, &block->chain_list);
+	mutex_init(&chain->filter_chain_lock);
 	chain->block = block;
 	chain->index = chain_index;
 	chain->refcnt = 1;
@@ -280,6 +281,7 @@ static void tcf_chain_destroy(struct tcf_chain *chain, bool free_block)
 {
 	struct tcf_block *block = chain->block;
 
+	mutex_destroy(&chain->filter_chain_lock);
 	kfree(chain);
 	if (free_block)
 		tcf_block_destroy(block);
@@ -443,9 +445,13 @@ static void tcf_chain_put_explicitly_created(struct tcf_chain *chain)
 
 static void tcf_chain_flush(struct tcf_chain *chain)
 {
-	struct tcf_proto *tp = rtnl_dereference(chain->filter_chain);
+	struct tcf_proto *tp;
 
+	mutex_lock(&chain->filter_chain_lock);
+	tp = tcf_chain_dereference(chain->filter_chain, chain);
 	tcf_chain0_head_change(chain, NULL);
+	mutex_unlock(&chain->filter_chain_lock);
+
 	while (tp) {
 		RCU_INIT_POINTER(chain->filter_chain, tp->next);
 		tcf_proto_destroy(tp, NULL);
@@ -785,11 +791,29 @@ tcf_chain0_head_change_cb_add(struct tcf_block *block,
 
 	mutex_lock(&block->lock);
 	chain0 = block->chain0.chain;
-	if (chain0 && chain0->filter_chain)
-		tcf_chain_head_change_item(item, chain0->filter_chain);
-	list_add(&item->list, &block->chain0.filter_chain_list);
+	if (chain0)
+		tcf_chain_hold(chain0);
+	else
+		list_add(&item->list, &block->chain0.filter_chain_list);
 	mutex_unlock(&block->lock);
 
+	if (chain0) {
+		struct tcf_proto *tp_head;
+
+		mutex_lock(&chain0->filter_chain_lock);
+
+		tp_head = tcf_chain_dereference(chain0->filter_chain, chain0);
+		if (tp_head)
+			tcf_chain_head_change_item(item, tp_head);
+
+		mutex_lock(&block->lock);
+		list_add(&item->list, &block->chain0.filter_chain_list);
+		mutex_unlock(&block->lock);
+
+		mutex_unlock(&chain0->filter_chain_lock);
+		tcf_chain_put(chain0);
+	}
+
 	return 0;
 }
 
@@ -1464,9 +1488,10 @@ struct tcf_chain_info {
 	struct tcf_proto __rcu *next;
 };
 
-static struct tcf_proto *tcf_chain_tp_prev(struct tcf_chain_info *chain_info)
+static struct tcf_proto *tcf_chain_tp_prev(struct tcf_chain *chain,
+					   struct tcf_chain_info *chain_info)
 {
-	return rtnl_dereference(*chain_info->pprev);
+	return tcf_chain_dereference(*chain_info->pprev, chain);
 }
 
 static void tcf_chain_tp_insert(struct tcf_chain *chain,
@@ -1475,7 +1500,7 @@ static void tcf_chain_tp_insert(struct tcf_chain *chain,
 {
 	if (*chain_info->pprev == chain->filter_chain)
 		tcf_chain0_head_change(chain, tp);
-	RCU_INIT_POINTER(tp->next, tcf_chain_tp_prev(chain_info));
+	RCU_INIT_POINTER(tp->next, tcf_chain_tp_prev(chain, chain_info));
 	rcu_assign_pointer(*chain_info->pprev, tp);
 	tcf_chain_hold(chain);
 }
@@ -1484,7 +1509,7 @@ static void tcf_chain_tp_remove(struct tcf_chain *chain,
 				struct tcf_chain_info *chain_info,
 				struct tcf_proto *tp)
 {
-	struct tcf_proto *next = rtnl_dereference(chain_info->next);
+	struct tcf_proto *next = tcf_chain_dereference(chain_info->next, chain);
 
 	if (tp == chain->filter_chain)
 		tcf_chain0_head_change(chain, next);
@@ -1502,7 +1527,8 @@ static struct tcf_proto *tcf_chain_tp_find(struct tcf_chain *chain,
 
 	/* Check the chain for existence of proto-tcf with this priority */
 	for (pprev = &chain->filter_chain;
-	     (tp = rtnl_dereference(*pprev)); pprev = &tp->next) {
+	     (tp = tcf_chain_dereference(*pprev, chain));
+	     pprev = &tp->next) {
 		if (tp->prio >= prio) {
 			if (tp->prio == prio) {
 				if (prio_allocate ||
@@ -1710,12 +1736,13 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		goto errout;
 	}
 
+	mutex_lock(&chain->filter_chain_lock);
 	tp = tcf_chain_tp_find(chain, &chain_info, protocol,
 			       prio, prio_allocate);
 	if (IS_ERR(tp)) {
 		NL_SET_ERR_MSG(extack, "Filter with specified priority/protocol not found");
 		err = PTR_ERR(tp);
-		goto errout;
+		goto errout_locked;
 	}
 
 	if (tp == NULL) {
@@ -1724,29 +1751,37 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		if (tca[TCA_KIND] == NULL || !protocol) {
 			NL_SET_ERR_MSG(extack, "Filter kind and protocol must be specified");
 			err = -EINVAL;
-			goto errout;
+			goto errout_locked;
 		}
 
 		if (!(n->nlmsg_flags & NLM_F_CREATE)) {
 			NL_SET_ERR_MSG(extack, "Need both RTM_NEWTFILTER and NLM_F_CREATE to create a new filter");
 			err = -ENOENT;
-			goto errout;
+			goto errout_locked;
 		}
 
 		if (prio_allocate)
-			prio = tcf_auto_prio(tcf_chain_tp_prev(&chain_info));
+			prio = tcf_auto_prio(tcf_chain_tp_prev(chain,
+							       &chain_info));
 
+		mutex_unlock(&chain->filter_chain_lock);
 		tp = tcf_proto_create(nla_data(tca[TCA_KIND]),
 				      protocol, prio, chain, extack);
 		if (IS_ERR(tp)) {
 			err = PTR_ERR(tp);
 			goto errout;
 		}
+
+		mutex_lock(&chain->filter_chain_lock);
+		tcf_chain_tp_insert(chain, &chain_info, tp);
+		mutex_unlock(&chain->filter_chain_lock);
 		tp_created = 1;
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind)) {
 		NL_SET_ERR_MSG(extack, "Specified filter kind does not match existing one");
 		err = -EINVAL;
-		goto errout;
+		goto errout_locked;
+	} else {
+		mutex_unlock(&chain->filter_chain_lock);
 	}
 
 	fh = tp->ops->get(tp, t->tcm_handle);
@@ -1772,15 +1807,11 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
 			      n->nlmsg_flags & NLM_F_CREATE ? TCA_ACT_NOREPLACE : TCA_ACT_REPLACE,
 			      extack);
-	if (err == 0) {
-		if (tp_created)
-			tcf_chain_tp_insert(chain, &chain_info, tp);
+	if (err == 0)
 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
 			       RTM_NEWTFILTER, false);
-	} else {
-		if (tp_created)
-			tcf_proto_destroy(tp, NULL);
-	}
+	else if (tp_created)
+		tcf_proto_destroy(tp, NULL);
 
 errout:
 	if (chain)
@@ -1790,6 +1821,10 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		/* Replay the request. */
 		goto replay;
 	return err;
+
+errout_locked:
+	mutex_unlock(&chain->filter_chain_lock);
+	goto errout;
 }
 
 static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
@@ -1865,31 +1900,34 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		goto errout;
 	}
 
+	mutex_lock(&chain->filter_chain_lock);
 	tp = tcf_chain_tp_find(chain, &chain_info, protocol,
 			       prio, false);
 	if (!tp || IS_ERR(tp)) {
 		NL_SET_ERR_MSG(extack, "Filter with specified priority/protocol not found");
 		err = tp ? PTR_ERR(tp) : -ENOENT;
-		goto errout;
+		goto errout_locked;
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind)) {
 		NL_SET_ERR_MSG(extack, "Specified filter kind does not match existing one");
 		err = -EINVAL;
+		goto errout_locked;
+	} else if (t->tcm_handle == 0) {
+		tcf_chain_tp_remove(chain, &chain_info, tp);
+		mutex_unlock(&chain->filter_chain_lock);
+
+		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
+			       RTM_DELTFILTER, false);
+		tcf_proto_destroy(tp, extack);
+		err = 0;
 		goto errout;
 	}
+	mutex_unlock(&chain->filter_chain_lock);
 
 	fh = tp->ops->get(tp, t->tcm_handle);
 
 	if (!fh) {
-		if (t->tcm_handle == 0) {
-			tcf_chain_tp_remove(chain, &chain_info, tp);
-			tfilter_notify(net, skb, n, tp, block, q, parent, fh,
-				       RTM_DELTFILTER, false);
-			tcf_proto_destroy(tp, extack);
-			err = 0;
-		} else {
-			NL_SET_ERR_MSG(extack, "Specified filter handle not found");
-			err = -ENOENT;
-		}
+		NL_SET_ERR_MSG(extack, "Specified filter handle not found");
+		err = -ENOENT;
 	} else {
 		bool last;
 
@@ -1899,7 +1937,10 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		if (err)
 			goto errout;
 		if (last) {
+			mutex_lock(&chain->filter_chain_lock);
 			tcf_chain_tp_remove(chain, &chain_info, tp);
+			mutex_unlock(&chain->filter_chain_lock);
+
 			tcf_proto_destroy(tp, extack);
 		}
 	}
@@ -1909,6 +1950,10 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		tcf_chain_put(chain);
 	tcf_block_release(q, block);
 	return err;
+
+errout_locked:
+	mutex_unlock(&chain->filter_chain_lock);
+	goto errout;
 }
 
 static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
@@ -1966,8 +2011,10 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		goto errout;
 	}
 
+	mutex_lock(&chain->filter_chain_lock);
 	tp = tcf_chain_tp_find(chain, &chain_info, protocol,
 			       prio, false);
+	mutex_unlock(&chain->filter_chain_lock);
 	if (!tp || IS_ERR(tp)) {
 		NL_SET_ERR_MSG(extack, "Filter with specified priority/protocol not found");
 		err = tp ? PTR_ERR(tp) : -ENOENT;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 66ba2ce2320f..e24568f9246c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1366,7 +1366,11 @@ static void mini_qdisc_rcu_func(struct rcu_head *head)
 void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
 			  struct tcf_proto *tp_head)
 {
-	struct mini_Qdisc *miniq_old = rtnl_dereference(*miniqp->p_miniq);
+	/* Protected with chain0->filter_chain_lock.
+	 * Can't access chain directly because tp_head can be NULL.
+	 */
+	struct mini_Qdisc *miniq_old =
+		rcu_dereference_protected(*miniqp->p_miniq, 1);
 	struct mini_Qdisc *miniq;
 
 	if (!tp_head) {
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 08/17] net: sched: introduce reference counting for tcf_proto
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (6 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 09/17] net: sched: traverse classifiers in chain with tcf_get_next_proto() Vlad Buslov
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

In order to remove dependency on rtnl lock and allow concurrent tcf_proto
modification, extend tcf_proto with reference counter. Implement helper
get/put functions for tcf proto and use them to modify cls API to always
take reference to tcf_proto while using it. Only release reference to
parent chain after releasing last reference to tp.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/sch_generic.h |  1 +
 net/sched/cls_api.c       | 53 ++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 85993d7efee6..4372c08fc4d9 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -322,6 +322,7 @@ struct tcf_proto {
 	void			*data;
 	const struct tcf_proto_ops	*ops;
 	struct tcf_chain	*chain;
+	refcount_t		refcnt;
 	struct rcu_head		rcu;
 };
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fce30ae9a9b..37c05b96898f 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -180,6 +180,7 @@ static struct tcf_proto *tcf_proto_create(const char *kind, u32 protocol,
 	tp->protocol = protocol;
 	tp->prio = prio;
 	tp->chain = chain;
+	refcount_set(&tp->refcnt, 1);
 
 	err = tp->ops->init(tp);
 	if (err) {
@@ -193,14 +194,29 @@ static struct tcf_proto *tcf_proto_create(const char *kind, u32 protocol,
 	return ERR_PTR(err);
 }
 
+static void tcf_proto_get(struct tcf_proto *tp)
+{
+	refcount_inc(&tp->refcnt);
+}
+
+static void tcf_chain_put(struct tcf_chain *chain);
+
 static void tcf_proto_destroy(struct tcf_proto *tp,
 			      struct netlink_ext_ack *extack)
 {
 	tp->ops->destroy(tp, extack);
+	tcf_chain_put(tp->chain);
 	module_put(tp->ops->owner);
 	kfree_rcu(tp, rcu);
 }
 
+static void tcf_proto_put(struct tcf_proto *tp,
+			  struct netlink_ext_ack *extack)
+{
+	if (refcount_dec_and_test(&tp->refcnt))
+		tcf_proto_destroy(tp, extack);
+}
+
 #define ASSERT_BLOCK_LOCKED(block)					\
 	lockdep_assert_held(&(block)->lock)
 
@@ -445,18 +461,18 @@ static void tcf_chain_put_explicitly_created(struct tcf_chain *chain)
 
 static void tcf_chain_flush(struct tcf_chain *chain)
 {
-	struct tcf_proto *tp;
+	struct tcf_proto *tp, *tp_next;
 
 	mutex_lock(&chain->filter_chain_lock);
 	tp = tcf_chain_dereference(chain->filter_chain, chain);
+	RCU_INIT_POINTER(chain->filter_chain, NULL);
 	tcf_chain0_head_change(chain, NULL);
 	mutex_unlock(&chain->filter_chain_lock);
 
 	while (tp) {
-		RCU_INIT_POINTER(chain->filter_chain, tp->next);
-		tcf_proto_destroy(tp, NULL);
-		tp = rtnl_dereference(chain->filter_chain);
-		tcf_chain_put(chain);
+		tp_next = rcu_dereference_protected(tp->next, 1);
+		tcf_proto_put(tp, NULL);
+		tp = tp_next;
 	}
 }
 
@@ -1500,9 +1516,9 @@ static void tcf_chain_tp_insert(struct tcf_chain *chain,
 {
 	if (*chain_info->pprev == chain->filter_chain)
 		tcf_chain0_head_change(chain, tp);
+	tcf_proto_get(tp);
 	RCU_INIT_POINTER(tp->next, tcf_chain_tp_prev(chain, chain_info));
 	rcu_assign_pointer(*chain_info->pprev, tp);
-	tcf_chain_hold(chain);
 }
 
 static void tcf_chain_tp_remove(struct tcf_chain *chain,
@@ -1514,7 +1530,6 @@ static void tcf_chain_tp_remove(struct tcf_chain *chain,
 	if (tp == chain->filter_chain)
 		tcf_chain0_head_change(chain, next);
 	RCU_INIT_POINTER(*chain_info->pprev, next);
-	tcf_chain_put(chain);
 }
 
 static struct tcf_proto *tcf_chain_tp_find(struct tcf_chain *chain,
@@ -1541,7 +1556,12 @@ static struct tcf_proto *tcf_chain_tp_find(struct tcf_chain *chain,
 		}
 	}
 	chain_info->pprev = pprev;
-	chain_info->next = tp ? tp->next : NULL;
+	if (tp) {
+		chain_info->next = tp->next;
+		tcf_proto_get(tp);
+	} else {
+		chain_info->next = NULL;
+	}
 	return tp;
 }
 
@@ -1699,6 +1719,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	prio = TC_H_MAJ(t->tcm_info);
 	prio_allocate = false;
 	parent = t->tcm_parent;
+	tp = NULL;
 	cl = 0;
 
 	if (prio == 0) {
@@ -1816,6 +1837,12 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 errout:
 	if (chain)
 		tcf_chain_put(chain);
+	if (chain) {
+		if (tp && !IS_ERR(tp))
+			tcf_proto_put(tp, NULL);
+		if (!tp_created)
+			tcf_chain_put(chain);
+	}
 	tcf_block_release(q, block);
 	if (err == -EAGAIN)
 		/* Replay the request. */
@@ -1946,8 +1973,11 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	}
 
 errout:
-	if (chain)
+	if (chain) {
+		if (tp && !IS_ERR(tp))
+			tcf_proto_put(tp, NULL);
 		tcf_chain_put(chain);
+	}
 	tcf_block_release(q, block);
 	return err;
 
@@ -2038,8 +2068,11 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	}
 
 errout:
-	if (chain)
+	if (chain) {
+		if (tp && !IS_ERR(tp))
+			tcf_proto_put(tp, NULL);
 		tcf_chain_put(chain);
+	}
 	tcf_block_release(q, block);
 	return err;
 }
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 09/17] net: sched: traverse classifiers in chain with tcf_get_next_proto()
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (7 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 08/17] net: sched: introduce reference counting for tcf_proto Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 10/17] net: sched: refactor tp insert/delete for concurrent execution Vlad Buslov
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

All users of chain->filters_chain rely on rtnl lock and assume that no new
classifier instances are added when traversing the list. Use
tcf_get_next_proto() to traverse filters list without relying on rtnl
mutex. This function iterates over classifiers by taking reference to
current iterator classifier only and doesn't assume external
synchronization of filters list.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/pkt_cls.h |  2 ++
 net/sched/cls_api.c   | 70 +++++++++++++++++++++++++++++++++++++++++++--------
 net/sched/sch_api.c   |  4 +--
 3 files changed, 64 insertions(+), 12 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 38bee7dd21d1..e5dafa5ee1b2 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -46,6 +46,8 @@ struct tcf_chain *tcf_chain_get_by_act(struct tcf_block *block,
 void tcf_chain_put_by_act(struct tcf_chain *chain);
 struct tcf_chain *tcf_get_next_chain(struct tcf_block *block,
 				     struct tcf_chain *chain);
+struct tcf_proto *tcf_get_next_proto(struct tcf_chain *chain,
+				     struct tcf_proto *tp);
 void tcf_block_netif_keep_dst(struct tcf_block *block);
 int tcf_block_get(struct tcf_block **p_block,
 		  struct tcf_proto __rcu **p_filter_chain, struct Qdisc *q,
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 37c05b96898f..dca8a3bee9c2 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -980,6 +980,45 @@ tcf_get_next_chain(struct tcf_block *block, struct tcf_chain *chain)
 }
 EXPORT_SYMBOL(tcf_get_next_chain);
 
+static struct tcf_proto *
+__tcf_get_next_proto(struct tcf_chain *chain, struct tcf_proto *tp)
+{
+	ASSERT_RTNL();
+	mutex_lock(&chain->filter_chain_lock);
+
+	if (!tp)
+		tp = tcf_chain_dereference(chain->filter_chain, chain);
+	else
+		tp = tcf_chain_dereference(tp->next, chain);
+
+	if (tp)
+		tcf_proto_get(tp);
+
+	mutex_unlock(&chain->filter_chain_lock);
+
+	return tp;
+}
+
+/* Function to be used by all clients that want to iterate over all tp's on
+ * chain. Users of this function must be tolerant to concurrent tp
+ * insertion/deletion or ensure that no concurrent chain modification is
+ * possible. Note that all netlink dump callbacks cannot guarantee to provide
+ * consistent dump because rtnl lock is released each time skb is filled with
+ * data and sent to user-space.
+ */
+
+struct tcf_proto *
+tcf_get_next_proto(struct tcf_chain *chain, struct tcf_proto *tp)
+{
+	struct tcf_proto *tp_next = __tcf_get_next_proto(chain, tp);
+
+	if (tp)
+		tcf_proto_put(tp, NULL);
+
+	return tp_next;
+}
+EXPORT_SYMBOL(tcf_get_next_proto);
+
 static void tcf_block_flush_all_chains(struct tcf_block *block)
 {
 	struct tcf_chain *chain;
@@ -1352,7 +1391,7 @@ tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
 			    struct netlink_ext_ack *extack)
 {
 	struct tcf_chain *chain, *chain_prev;
-	struct tcf_proto *tp;
+	struct tcf_proto *tp, *tp_prev;
 	int err;
 
 	for (chain = __tcf_get_next_chain(block, NULL);
@@ -1360,8 +1399,10 @@ tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
 	     chain_prev = chain,
 		     chain = __tcf_get_next_chain(block, chain),
 		     tcf_chain_put(chain_prev)) {
-		for (tp = rtnl_dereference(chain->filter_chain); tp;
-		     tp = rtnl_dereference(tp->next)) {
+		for (tp = __tcf_get_next_proto(chain, NULL); tp;
+		     tp_prev = tp,
+			     tp = __tcf_get_next_proto(chain, tp),
+			     tcf_proto_put(tp_prev, NULL)) {
 			if (tp->ops->reoffload) {
 				err = tp->ops->reoffload(tp, add, cb, cb_priv,
 							 extack);
@@ -1378,6 +1419,7 @@ tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
 	return 0;
 
 err_playback_remove:
+	tcf_proto_put(tp, NULL);
 	tcf_chain_put(chain);
 	tcf_block_playback_offloads(block, cb, cb_priv, false, offload_in_use,
 				    extack);
@@ -1677,8 +1719,8 @@ static void tfilter_notify_chain(struct net *net, struct sk_buff *oskb,
 {
 	struct tcf_proto *tp;
 
-	for (tp = rtnl_dereference(chain->filter_chain);
-	     tp; tp = rtnl_dereference(tp->next))
+	for (tp = tcf_get_next_proto(chain, NULL);
+	     tp; tp = tcf_get_next_proto(chain, tp))
 		tfilter_notify(net, oskb, n, tp, block,
 			       q, parent, NULL, event, false);
 }
@@ -2104,11 +2146,15 @@ static bool tcf_chain_dump(struct tcf_chain *chain, struct Qdisc *q, u32 parent,
 	struct net *net = sock_net(skb->sk);
 	struct tcf_block *block = chain->block;
 	struct tcmsg *tcm = nlmsg_data(cb->nlh);
+	struct tcf_proto *tp, *tp_prev;
 	struct tcf_dump_args arg;
-	struct tcf_proto *tp;
 
-	for (tp = rtnl_dereference(chain->filter_chain);
-	     tp; tp = rtnl_dereference(tp->next), (*p_index)++) {
+	for (tp = __tcf_get_next_proto(chain, NULL);
+	     tp;
+	     tp_prev = tp,
+		     tp = __tcf_get_next_proto(chain, tp),
+		     tcf_proto_put(tp_prev, NULL),
+		     (*p_index)++) {
 		if (*p_index < index_start)
 			continue;
 		if (TC_H_MAJ(tcm->tcm_info) &&
@@ -2125,7 +2171,7 @@ static bool tcf_chain_dump(struct tcf_chain *chain, struct Qdisc *q, u32 parent,
 					  NETLINK_CB(cb->skb).portid,
 					  cb->nlh->nlmsg_seq, NLM_F_MULTI,
 					  RTM_NEWTFILTER) <= 0)
-				return false;
+				goto errout;
 
 			cb->args[1] = 1;
 		}
@@ -2145,9 +2191,13 @@ static bool tcf_chain_dump(struct tcf_chain *chain, struct Qdisc *q, u32 parent,
 		cb->args[2] = arg.w.cookie;
 		cb->args[1] = arg.w.count + 1;
 		if (arg.w.stop)
-			return false;
+			goto errout;
 	}
 	return true;
+
+errout:
+	tcf_proto_put(tp, NULL);
+	return false;
 }
 
 /* called with RTNL */
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 80058abc729f..9a530cad2759 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1914,8 +1914,8 @@ static void tc_bind_tclass(struct Qdisc *q, u32 portid, u32 clid,
 	     chain = tcf_get_next_chain(block, chain)) {
 		struct tcf_proto *tp;
 
-		for (tp = rtnl_dereference(chain->filter_chain);
-		     tp; tp = rtnl_dereference(tp->next)) {
+		for (tp = tcf_get_next_proto(chain, NULL);
+		     tp; tp = tcf_get_next_proto(chain, tp)) {
 			struct tcf_bind_args arg = {};
 
 			arg.w.fn = tcf_node_bind;
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 10/17] net: sched: refactor tp insert/delete for concurrent execution
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (8 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 09/17] net: sched: traverse classifiers in chain with tcf_get_next_proto() Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-15 23:17   ` Cong Wang
  2019-02-18 19:53   ` Cong Wang
  2019-02-11  8:55 ` [PATCH net-next v4 11/17] net: sched: prevent insertion of new classifiers during chain flush Vlad Buslov
                   ` (7 subsequent siblings)
  17 siblings, 2 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

Implement unique insertion function to atomically attach tcf_proto to chain
after verifying that no other tcf proto with specified priority exists.
Implement delete function that verifies that tp is actually empty before
deleting it. Use these functions to refactor cls API to account for
concurrent tp and rule update instead of relying on rtnl lock. Add new
'deleting' flag to tcf proto. Use it to restart search when iterating over
tp's on chain to prevent accessing potentially inval tp->next pointer.

Extend tcf proto with spinlock that is intended to be used to protect its
data from concurrent modification instead of relying on rtnl mutex. Use it
to protect 'deleting' flag. Add lockdep macros to validate that lock is
held when accessing protected fields.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/sch_generic.h |  18 +++++
 net/sched/cls_api.c       | 177 +++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 170 insertions(+), 25 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 4372c08fc4d9..083e566fc380 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -322,6 +322,11 @@ struct tcf_proto {
 	void			*data;
 	const struct tcf_proto_ops	*ops;
 	struct tcf_chain	*chain;
+	/* Lock protects tcf_proto shared state and can be used by unlocked
+	 * classifiers to protect their private data.
+	 */
+	spinlock_t		lock;
+	bool			deleting;
 	refcount_t		refcnt;
 	struct rcu_head		rcu;
 };
@@ -382,16 +387,29 @@ static inline bool lockdep_tcf_chain_is_locked(struct tcf_chain *chain)
 {
 	return lockdep_is_held(&chain->filter_chain_lock);
 }
+
+static inline bool lockdep_tcf_proto_is_locked(struct tcf_proto *tp)
+{
+	return lockdep_is_held(&tp->lock);
+}
 #else
 static inline bool lockdep_tcf_chain_is_locked(struct tcf_block *chain)
 {
 	return true;
 }
+
+static inline bool lockdep_tcf_proto_is_locked(struct tcf_proto *tp)
+{
+	return true;
+}
 #endif /* #ifdef CONFIG_PROVE_LOCKING */
 
 #define tcf_chain_dereference(p, chain)					\
 	rcu_dereference_protected(p, lockdep_tcf_chain_is_locked(chain))
 
+#define tcf_proto_dereference(p, tp)					\
+	rcu_dereference_protected(p, lockdep_tcf_proto_is_locked(tp))
+
 static inline void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
 {
 	if (*flags & TCA_CLS_FLAGS_IN_HW)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index dca8a3bee9c2..c6452e3bfc6a 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -180,6 +180,7 @@ static struct tcf_proto *tcf_proto_create(const char *kind, u32 protocol,
 	tp->protocol = protocol;
 	tp->prio = prio;
 	tp->chain = chain;
+	spin_lock_init(&tp->lock);
 	refcount_set(&tp->refcnt, 1);
 
 	err = tp->ops->init(tp);
@@ -217,6 +218,49 @@ static void tcf_proto_put(struct tcf_proto *tp,
 		tcf_proto_destroy(tp, extack);
 }
 
+static int walker_noop(struct tcf_proto *tp, void *d, struct tcf_walker *arg)
+{
+	return -1;
+}
+
+static bool tcf_proto_is_empty(struct tcf_proto *tp)
+{
+	struct tcf_walker walker = { .fn = walker_noop, };
+
+	if (tp->ops->walk) {
+		tp->ops->walk(tp, &walker);
+		return !walker.stop;
+	}
+	return true;
+}
+
+static bool tcf_proto_check_delete(struct tcf_proto *tp)
+{
+	spin_lock(&tp->lock);
+	if (tcf_proto_is_empty(tp))
+		tp->deleting = true;
+	spin_unlock(&tp->lock);
+	return tp->deleting;
+}
+
+static void tcf_proto_mark_delete(struct tcf_proto *tp)
+{
+	spin_lock(&tp->lock);
+	tp->deleting = true;
+	spin_unlock(&tp->lock);
+}
+
+static bool tcf_proto_is_deleting(struct tcf_proto *tp)
+{
+	bool deleting;
+
+	spin_lock(&tp->lock);
+	deleting = tp->deleting;
+	spin_unlock(&tp->lock);
+
+	return deleting;
+}
+
 #define ASSERT_BLOCK_LOCKED(block)					\
 	lockdep_assert_held(&(block)->lock)
 
@@ -983,13 +1027,27 @@ EXPORT_SYMBOL(tcf_get_next_chain);
 static struct tcf_proto *
 __tcf_get_next_proto(struct tcf_chain *chain, struct tcf_proto *tp)
 {
+	u32 prio = 0;
+
 	ASSERT_RTNL();
 	mutex_lock(&chain->filter_chain_lock);
 
-	if (!tp)
+	if (!tp) {
 		tp = tcf_chain_dereference(chain->filter_chain, chain);
-	else
+	} else if (tcf_proto_is_deleting(tp)) {
+		/* 'deleting' flag is set and chain->filter_chain_lock was
+		 * unlocked, which means next pointer could be invalid. Restart
+		 * search.
+		 */
+		prio = tp->prio + 1;
+		tp = tcf_chain_dereference(chain->filter_chain, chain);
+
+		for (; tp; tp = tcf_chain_dereference(tp->next, chain))
+			if (!tp->deleting && tp->prio >= prio)
+				break;
+	} else {
 		tp = tcf_chain_dereference(tp->next, chain);
+	}
 
 	if (tp)
 		tcf_proto_get(tp);
@@ -1569,6 +1627,7 @@ static void tcf_chain_tp_remove(struct tcf_chain *chain,
 {
 	struct tcf_proto *next = tcf_chain_dereference(chain_info->next, chain);
 
+	tcf_proto_mark_delete(tp);
 	if (tp == chain->filter_chain)
 		tcf_chain0_head_change(chain, next);
 	RCU_INIT_POINTER(*chain_info->pprev, next);
@@ -1577,6 +1636,79 @@ static void tcf_chain_tp_remove(struct tcf_chain *chain,
 static struct tcf_proto *tcf_chain_tp_find(struct tcf_chain *chain,
 					   struct tcf_chain_info *chain_info,
 					   u32 protocol, u32 prio,
+					   bool prio_allocate);
+
+/* Try to insert new proto.
+ * If proto with specified priority already exists, free new proto
+ * and return existing one.
+ */
+
+static struct tcf_proto *tcf_chain_tp_insert_unique(struct tcf_chain *chain,
+						    struct tcf_proto *tp_new,
+						    u32 protocol, u32 prio)
+{
+	struct tcf_chain_info chain_info;
+	struct tcf_proto *tp;
+
+	mutex_lock(&chain->filter_chain_lock);
+
+	tp = tcf_chain_tp_find(chain, &chain_info,
+			       protocol, prio, false);
+	if (!tp)
+		tcf_chain_tp_insert(chain, &chain_info, tp_new);
+	mutex_unlock(&chain->filter_chain_lock);
+
+	if (tp) {
+		tcf_proto_destroy(tp_new, NULL);
+		tp_new = tp;
+	}
+
+	return tp_new;
+}
+
+static void tcf_chain_tp_delete_empty(struct tcf_chain *chain,
+				      struct tcf_proto *tp,
+				      struct netlink_ext_ack *extack)
+{
+	struct tcf_chain_info chain_info;
+	struct tcf_proto *tp_iter;
+	struct tcf_proto **pprev;
+	struct tcf_proto *next;
+
+	mutex_lock(&chain->filter_chain_lock);
+
+	/* Atomically find and remove tp from chain. */
+	for (pprev = &chain->filter_chain;
+	     (tp_iter = tcf_chain_dereference(*pprev, chain));
+	     pprev = &tp_iter->next) {
+		if (tp_iter == tp) {
+			chain_info.pprev = pprev;
+			chain_info.next = tp_iter->next;
+			WARN_ON(tp_iter->deleting);
+			break;
+		}
+	}
+	/* Verify that tp still exists and no new filters were inserted
+	 * concurrently.
+	 * Mark tp for deletion if it is empty.
+	 */
+	if (!tp_iter || !tcf_proto_check_delete(tp)) {
+		mutex_unlock(&chain->filter_chain_lock);
+		return;
+	}
+
+	next = tcf_chain_dereference(chain_info.next, chain);
+	if (tp == chain->filter_chain)
+		tcf_chain0_head_change(chain, next);
+	RCU_INIT_POINTER(*chain_info.pprev, next);
+	mutex_unlock(&chain->filter_chain_lock);
+
+	tcf_proto_put(tp, extack);
+}
+
+static struct tcf_proto *tcf_chain_tp_find(struct tcf_chain *chain,
+					   struct tcf_chain_info *chain_info,
+					   u32 protocol, u32 prio,
 					   bool prio_allocate)
 {
 	struct tcf_proto **pprev;
@@ -1809,6 +1941,8 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	}
 
 	if (tp == NULL) {
+		struct tcf_proto *tp_new = NULL;
+
 		/* Proto-tcf does not exist, create new one */
 
 		if (tca[TCA_KIND] == NULL || !protocol) {
@@ -1828,25 +1962,25 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 							       &chain_info));
 
 		mutex_unlock(&chain->filter_chain_lock);
-		tp = tcf_proto_create(nla_data(tca[TCA_KIND]),
-				      protocol, prio, chain, extack);
-		if (IS_ERR(tp)) {
-			err = PTR_ERR(tp);
+		tp_new = tcf_proto_create(nla_data(tca[TCA_KIND]),
+					  protocol, prio, chain, extack);
+		if (IS_ERR(tp_new)) {
+			err = PTR_ERR(tp_new);
 			goto errout;
 		}
 
-		mutex_lock(&chain->filter_chain_lock);
-		tcf_chain_tp_insert(chain, &chain_info, tp);
-		mutex_unlock(&chain->filter_chain_lock);
 		tp_created = 1;
-	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind)) {
-		NL_SET_ERR_MSG(extack, "Specified filter kind does not match existing one");
-		err = -EINVAL;
-		goto errout_locked;
+		tp = tcf_chain_tp_insert_unique(chain, tp_new, protocol, prio);
 	} else {
 		mutex_unlock(&chain->filter_chain_lock);
 	}
 
+	if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind)) {
+		NL_SET_ERR_MSG(extack, "Specified filter kind does not match existing one");
+		err = -EINVAL;
+		goto errout;
+	}
+
 	fh = tp->ops->get(tp, t->tcm_handle);
 
 	if (!fh) {
@@ -1873,12 +2007,10 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	if (err == 0)
 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
 			       RTM_NEWTFILTER, false);
-	else if (tp_created)
-		tcf_proto_destroy(tp, NULL);
 
 errout:
-	if (chain)
-		tcf_chain_put(chain);
+	if (err && tp_created)
+		tcf_chain_tp_delete_empty(chain, tp, NULL);
 	if (chain) {
 		if (tp && !IS_ERR(tp))
 			tcf_proto_put(tp, NULL);
@@ -1984,9 +2116,9 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		tcf_chain_tp_remove(chain, &chain_info, tp);
 		mutex_unlock(&chain->filter_chain_lock);
 
+		tcf_proto_put(tp, NULL);
 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
 			       RTM_DELTFILTER, false);
-		tcf_proto_destroy(tp, extack);
 		err = 0;
 		goto errout;
 	}
@@ -2005,13 +2137,8 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 					 extack);
 		if (err)
 			goto errout;
-		if (last) {
-			mutex_lock(&chain->filter_chain_lock);
-			tcf_chain_tp_remove(chain, &chain_info, tp);
-			mutex_unlock(&chain->filter_chain_lock);
-
-			tcf_proto_destroy(tp, extack);
-		}
+		if (last)
+			tcf_chain_tp_delete_empty(chain, tp, extack);
 	}
 
 errout:
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 11/17] net: sched: prevent insertion of new classifiers during chain flush
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (9 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 10/17] net: sched: refactor tp insert/delete for concurrent execution Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 12/17] net: sched: track rtnl lock status when validating extensions Vlad Buslov
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

Extend tcf_chain with 'flushing' flag. Use the flag to prevent insertion of
new classifier instances when chain flushing is in progress in order to
prevent resource leak when tcf_proto is created by unlocked users
concurrently.

Return EAGAIN error from tcf_chain_tp_insert_unique() to restart
tc_new_tfilter() and lookup the chain/proto again.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/sch_generic.h |  1 +
 net/sched/cls_api.c       | 35 +++++++++++++++++++++++++++++------
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 083e566fc380..e8cf36ed3e87 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -356,6 +356,7 @@ struct tcf_chain {
 	unsigned int refcnt;
 	unsigned int action_refcnt;
 	bool explicitly_created;
+	bool flushing;
 	const struct tcf_proto_ops *tmplt_ops;
 	void *tmplt_priv;
 };
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index c6452e3bfc6a..3038a82f6591 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -477,9 +477,12 @@ static void __tcf_chain_put(struct tcf_chain *chain, bool by_act,
 	mutex_unlock(&block->lock);
 
 	/* The last dropped non-action reference will trigger notification. */
-	if (is_last && !by_act)
+	if (is_last && !by_act) {
 		tc_chain_notify_delete(tmplt_ops, tmplt_priv, chain_index,
 				       block, NULL, 0, 0, false);
+		/* Last reference to chain, no need to lock. */
+		chain->flushing = false;
+	}
 
 	if (refcnt == 0) {
 		tc_chain_tmplt_del(tmplt_ops, tmplt_priv);
@@ -511,6 +514,7 @@ static void tcf_chain_flush(struct tcf_chain *chain)
 	tp = tcf_chain_dereference(chain->filter_chain, chain);
 	RCU_INIT_POINTER(chain->filter_chain, NULL);
 	tcf_chain0_head_change(chain, NULL);
+	chain->flushing = true;
 	mutex_unlock(&chain->filter_chain_lock);
 
 	while (tp) {
@@ -1610,15 +1614,20 @@ static struct tcf_proto *tcf_chain_tp_prev(struct tcf_chain *chain,
 	return tcf_chain_dereference(*chain_info->pprev, chain);
 }
 
-static void tcf_chain_tp_insert(struct tcf_chain *chain,
-				struct tcf_chain_info *chain_info,
-				struct tcf_proto *tp)
+static int tcf_chain_tp_insert(struct tcf_chain *chain,
+			       struct tcf_chain_info *chain_info,
+			       struct tcf_proto *tp)
 {
+	if (chain->flushing)
+		return -EAGAIN;
+
 	if (*chain_info->pprev == chain->filter_chain)
 		tcf_chain0_head_change(chain, tp);
 	tcf_proto_get(tp);
 	RCU_INIT_POINTER(tp->next, tcf_chain_tp_prev(chain, chain_info));
 	rcu_assign_pointer(*chain_info->pprev, tp);
+
+	return 0;
 }
 
 static void tcf_chain_tp_remove(struct tcf_chain *chain,
@@ -1649,18 +1658,22 @@ static struct tcf_proto *tcf_chain_tp_insert_unique(struct tcf_chain *chain,
 {
 	struct tcf_chain_info chain_info;
 	struct tcf_proto *tp;
+	int err = 0;
 
 	mutex_lock(&chain->filter_chain_lock);
 
 	tp = tcf_chain_tp_find(chain, &chain_info,
 			       protocol, prio, false);
 	if (!tp)
-		tcf_chain_tp_insert(chain, &chain_info, tp_new);
+		err = tcf_chain_tp_insert(chain, &chain_info, tp_new);
 	mutex_unlock(&chain->filter_chain_lock);
 
 	if (tp) {
 		tcf_proto_destroy(tp_new, NULL);
 		tp_new = tp;
+	} else if (err) {
+		tcf_proto_destroy(tp_new, NULL);
+		tp_new = ERR_PTR(err);
 	}
 
 	return tp_new;
@@ -1943,6 +1956,11 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	if (tp == NULL) {
 		struct tcf_proto *tp_new = NULL;
 
+		if (chain->flushing) {
+			err = -EAGAIN;
+			goto errout_locked;
+		}
+
 		/* Proto-tcf does not exist, create new one */
 
 		if (tca[TCA_KIND] == NULL || !protocol) {
@@ -1966,11 +1984,15 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 					  protocol, prio, chain, extack);
 		if (IS_ERR(tp_new)) {
 			err = PTR_ERR(tp_new);
-			goto errout;
+			goto errout_tp;
 		}
 
 		tp_created = 1;
 		tp = tcf_chain_tp_insert_unique(chain, tp_new, protocol, prio);
+		if (IS_ERR(tp)) {
+			err = PTR_ERR(tp);
+			goto errout_tp;
+		}
 	} else {
 		mutex_unlock(&chain->filter_chain_lock);
 	}
@@ -2011,6 +2033,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 errout:
 	if (err && tp_created)
 		tcf_chain_tp_delete_empty(chain, tp, NULL);
+errout_tp:
 	if (chain) {
 		if (tp && !IS_ERR(tp))
 			tcf_proto_put(tp, NULL);
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 12/17] net: sched: track rtnl lock status when validating extensions
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (10 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 11/17] net: sched: prevent insertion of new classifiers during chain flush Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 13/17] net: sched: extend proto ops with 'put' callback Vlad Buslov
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

Actions API is already updated to not rely on rtnl lock for
synchronization. However, it need to be provided with rtnl status when
called from classifiers API in order to be able to correctly release the
lock when loading kernel module.

Extend extension validation function with 'rtnl_held' flag which is passed
to actions API. Add new 'rtnl_held' parameter to tcf_exts_validate() in cls
API. No classifier is currently updated to support unlocked execution, so
pass hardcoded 'true' flag parameter value.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/pkt_cls.h    | 2 +-
 net/sched/cls_api.c      | 9 +++++----
 net/sched/cls_basic.c    | 2 +-
 net/sched/cls_bpf.c      | 3 ++-
 net/sched/cls_cgroup.c   | 2 +-
 net/sched/cls_flow.c     | 2 +-
 net/sched/cls_flower.c   | 3 ++-
 net/sched/cls_fw.c       | 2 +-
 net/sched/cls_matchall.c | 3 ++-
 net/sched/cls_route.c    | 2 +-
 net/sched/cls_rsvp.h     | 3 ++-
 net/sched/cls_tcindex.c  | 2 +-
 net/sched/cls_u32.c      | 2 +-
 13 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index e5dafa5ee1b2..0e3b61016931 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -416,7 +416,7 @@ tcf_exts_exec(struct sk_buff *skb, struct tcf_exts *exts,
 
 int tcf_exts_validate(struct net *net, struct tcf_proto *tp,
 		      struct nlattr **tb, struct nlattr *rate_tlv,
-		      struct tcf_exts *exts, bool ovr,
+		      struct tcf_exts *exts, bool ovr, bool rtnl_held,
 		      struct netlink_ext_ack *extack);
 void tcf_exts_destroy(struct tcf_exts *exts);
 void tcf_exts_change(struct tcf_exts *dst, struct tcf_exts *src);
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3038a82f6591..a3e715d34efb 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -2841,7 +2841,7 @@ EXPORT_SYMBOL(tcf_exts_destroy);
 
 int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 		      struct nlattr *rate_tlv, struct tcf_exts *exts, bool ovr,
-		      struct netlink_ext_ack *extack)
+		      bool rtnl_held, struct netlink_ext_ack *extack)
 {
 #ifdef CONFIG_NET_CLS_ACT
 	{
@@ -2851,7 +2851,8 @@ int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 		if (exts->police && tb[exts->police]) {
 			act = tcf_action_init_1(net, tp, tb[exts->police],
 						rate_tlv, "police", ovr,
-						TCA_ACT_BIND, true, extack);
+						TCA_ACT_BIND, rtnl_held,
+						extack);
 			if (IS_ERR(act))
 				return PTR_ERR(act);
 
@@ -2863,8 +2864,8 @@ int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
 
 			err = tcf_action_init(net, tp, tb[exts->action],
 					      rate_tlv, NULL, ovr, TCA_ACT_BIND,
-					      exts->actions, &attr_size, true,
-					      extack);
+					      exts->actions, &attr_size,
+					      rtnl_held, extack);
 			if (err < 0)
 				return err;
 			exts->nr_actions = err;
diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index 4a57fec6f306..eaf9c02fe792 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -153,7 +153,7 @@ static int basic_set_parms(struct net *net, struct tcf_proto *tp,
 {
 	int err;
 
-	err = tcf_exts_validate(net, tp, tb, est, &f->exts, ovr, extack);
+	err = tcf_exts_validate(net, tp, tb, est, &f->exts, ovr, true, extack);
 	if (err < 0)
 		return err;
 
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index a95cb240a606..656b3423ad35 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -417,7 +417,8 @@ static int cls_bpf_set_parms(struct net *net, struct tcf_proto *tp,
 	if ((!is_bpf && !is_ebpf) || (is_bpf && is_ebpf))
 		return -EINVAL;
 
-	ret = tcf_exts_validate(net, tp, tb, est, &prog->exts, ovr, extack);
+	ret = tcf_exts_validate(net, tp, tb, est, &prog->exts, ovr, true,
+				extack);
 	if (ret < 0)
 		return ret;
 
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 3bc01bdde165..663ee1c6d606 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -110,7 +110,7 @@ static int cls_cgroup_change(struct net *net, struct sk_buff *in_skb,
 		goto errout;
 
 	err = tcf_exts_validate(net, tp, tb, tca[TCA_RATE], &new->exts, ovr,
-				extack);
+				true, extack);
 	if (err < 0)
 		goto errout;
 
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index 2bb043cd436b..39a6407d4832 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -445,7 +445,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 		goto err2;
 
 	err = tcf_exts_validate(net, tp, tb, tca[TCA_RATE], &fnew->exts, ovr,
-				extack);
+				true, extack);
 	if (err < 0)
 		goto err2;
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index c5d1db3a3db7..d187bda602fa 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1272,7 +1272,8 @@ static int fl_set_parms(struct net *net, struct tcf_proto *tp,
 {
 	int err;
 
-	err = tcf_exts_validate(net, tp, tb, est, &f->exts, ovr, extack);
+	err = tcf_exts_validate(net, tp, tb, est, &f->exts, ovr, true,
+				extack);
 	if (err < 0)
 		return err;
 
diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index 29eeeaf3ea44..c8173ebb69f2 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -217,7 +217,7 @@ static int fw_set_parms(struct net *net, struct tcf_proto *tp,
 	int err;
 
 	err = tcf_exts_validate(net, tp, tb, tca[TCA_RATE], &f->exts, ovr,
-				extack);
+				true, extack);
 	if (err < 0)
 		return err;
 
diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index a1b803fd372e..8848a147c4bf 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -145,7 +145,8 @@ static int mall_set_parms(struct net *net, struct tcf_proto *tp,
 {
 	int err;
 
-	err = tcf_exts_validate(net, tp, tb, est, &head->exts, ovr, extack);
+	err = tcf_exts_validate(net, tp, tb, est, &head->exts, ovr, true,
+				extack);
 	if (err < 0)
 		return err;
 
diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c
index 0404aa5fa7cb..44b26038c4c4 100644
--- a/net/sched/cls_route.c
+++ b/net/sched/cls_route.c
@@ -393,7 +393,7 @@ static int route4_set_parms(struct net *net, struct tcf_proto *tp,
 	struct route4_bucket *b;
 	int err;
 
-	err = tcf_exts_validate(net, tp, tb, est, &f->exts, ovr, extack);
+	err = tcf_exts_validate(net, tp, tb, est, &f->exts, ovr, true, extack);
 	if (err < 0)
 		return err;
 
diff --git a/net/sched/cls_rsvp.h b/net/sched/cls_rsvp.h
index e9ccf7daea7d..9dd9530e6a52 100644
--- a/net/sched/cls_rsvp.h
+++ b/net/sched/cls_rsvp.h
@@ -502,7 +502,8 @@ static int rsvp_change(struct net *net, struct sk_buff *in_skb,
 	err = tcf_exts_init(&e, TCA_RSVP_ACT, TCA_RSVP_POLICE);
 	if (err < 0)
 		return err;
-	err = tcf_exts_validate(net, tp, tb, tca[TCA_RATE], &e, ovr, extack);
+	err = tcf_exts_validate(net, tp, tb, tca[TCA_RATE], &e, ovr, true,
+				extack);
 	if (err < 0)
 		goto errout2;
 
diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
index 9ccc93f257db..b7dc667b6ec0 100644
--- a/net/sched/cls_tcindex.c
+++ b/net/sched/cls_tcindex.c
@@ -314,7 +314,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 	err = tcf_exts_init(&e, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE);
 	if (err < 0)
 		return err;
-	err = tcf_exts_validate(net, tp, tb, est, &e, ovr, extack);
+	err = tcf_exts_validate(net, tp, tb, est, &e, ovr, true, extack);
 	if (err < 0)
 		goto errout;
 
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index dcea21004604..e891f30d42e9 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -726,7 +726,7 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp,
 {
 	int err;
 
-	err = tcf_exts_validate(net, tp, tb, est, &n->exts, ovr, extack);
+	err = tcf_exts_validate(net, tp, tb, est, &n->exts, ovr, true, extack);
 	if (err < 0)
 		return err;
 
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 13/17] net: sched: extend proto ops with 'put' callback
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (11 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 12/17] net: sched: track rtnl lock status when validating extensions Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 14/17] net: sched: extend proto ops to support unlocked classifiers Vlad Buslov
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

Add optional tp->ops->put() API to be implemented for filter reference
counting. This new function is called by cls API to release filter
reference for filters returned by tp->ops->change() or tp->ops->get()
functions. Implement tfilter_put() helper to call tp->ops->put() only for
classifiers that implement it.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/sch_generic.h |  1 +
 net/sched/cls_api.c       | 12 +++++++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index e8cf36ed3e87..410dda80ca62 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -277,6 +277,7 @@ struct tcf_proto_ops {
 					   struct netlink_ext_ack *extack);
 
 	void*			(*get)(struct tcf_proto*, u32 handle);
+	void			(*put)(struct tcf_proto *tp, void *f);
 	int			(*change)(struct net *net, struct sk_buff *,
 					struct tcf_proto*, unsigned long,
 					u32 handle, struct nlattr **,
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index a3e715d34efb..8fe38aa180cf 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1870,6 +1870,12 @@ static void tfilter_notify_chain(struct net *net, struct sk_buff *oskb,
 			       q, parent, NULL, event, false);
 }
 
+static void tfilter_put(struct tcf_proto *tp, void *fh)
+{
+	if (tp->ops->put && fh)
+		tp->ops->put(tp, fh);
+}
+
 static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 			  struct netlink_ext_ack *extack)
 {
@@ -2012,6 +2018,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 			goto errout;
 		}
 	} else if (n->nlmsg_flags & NLM_F_EXCL) {
+		tfilter_put(tp, fh);
 		NL_SET_ERR_MSG(extack, "Filter already exists");
 		err = -EEXIST;
 		goto errout;
@@ -2026,9 +2033,11 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
 			      n->nlmsg_flags & NLM_F_CREATE ? TCA_ACT_NOREPLACE : TCA_ACT_REPLACE,
 			      extack);
-	if (err == 0)
+	if (err == 0) {
 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
 			       RTM_NEWTFILTER, false);
+		tfilter_put(tp, fh);
+	}
 
 errout:
 	if (err && tp_created)
@@ -2259,6 +2268,7 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 			NL_SET_ERR_MSG(extack, "Failed to send filter notify message");
 	}
 
+	tfilter_put(tp, fh);
 errout:
 	if (chain) {
 		if (tp && !IS_ERR(tp))
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 14/17] net: sched: extend proto ops to support unlocked classifiers
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (12 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 13/17] net: sched: extend proto ops with 'put' callback Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 15/17] net: sched: add flags to Qdisc class ops struct Vlad Buslov
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

Add 'rtnl_held' flag to tcf proto change, delete, destroy, dump, walk
functions to track rtnl lock status. Extend users of these function in cls
API to propagate rtnl lock status to them. This allows classifiers to
obtain rtnl lock when necessary and to pass rtnl lock status to extensions
and driver offload callbacks.

Add flags field to tcf proto ops. Add flag value to indicate that
classifier doesn't require rtnl lock.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/pkt_cls.h     |   2 +-
 include/net/sch_generic.h |  17 +++--
 net/sched/cls_api.c       | 168 +++++++++++++++++++++++++---------------------
 net/sched/cls_basic.c     |  12 ++--
 net/sched/cls_bpf.c       |  12 ++--
 net/sched/cls_cgroup.c    |  11 +--
 net/sched/cls_flow.c      |  13 ++--
 net/sched/cls_flower.c    |  13 ++--
 net/sched/cls_fw.c        |  13 ++--
 net/sched/cls_matchall.c  |  13 ++--
 net/sched/cls_route.c     |  12 ++--
 net/sched/cls_rsvp.h      |  13 ++--
 net/sched/cls_tcindex.c   |  15 +++--
 net/sched/cls_u32.c       |  12 ++--
 net/sched/sch_api.c       |   6 +-
 15 files changed, 191 insertions(+), 141 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 0e3b61016931..6a530bef9253 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -47,7 +47,7 @@ void tcf_chain_put_by_act(struct tcf_chain *chain);
 struct tcf_chain *tcf_get_next_chain(struct tcf_block *block,
 				     struct tcf_chain *chain);
 struct tcf_proto *tcf_get_next_proto(struct tcf_chain *chain,
-				     struct tcf_proto *tp);
+				     struct tcf_proto *tp, bool rtnl_held);
 void tcf_block_netif_keep_dst(struct tcf_block *block);
 int tcf_block_get(struct tcf_block **p_block,
 		  struct tcf_proto __rcu **p_filter_chain, struct Qdisc *q,
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 410dda80ca62..365801c2a4f5 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -273,7 +273,7 @@ struct tcf_proto_ops {
 					    const struct tcf_proto *,
 					    struct tcf_result *);
 	int			(*init)(struct tcf_proto*);
-	void			(*destroy)(struct tcf_proto *tp,
+	void			(*destroy)(struct tcf_proto *tp, bool rtnl_held,
 					   struct netlink_ext_ack *extack);
 
 	void*			(*get)(struct tcf_proto*, u32 handle);
@@ -281,12 +281,13 @@ struct tcf_proto_ops {
 	int			(*change)(struct net *net, struct sk_buff *,
 					struct tcf_proto*, unsigned long,
 					u32 handle, struct nlattr **,
-					void **, bool,
+					void **, bool, bool,
 					struct netlink_ext_ack *);
 	int			(*delete)(struct tcf_proto *tp, void *arg,
-					  bool *last,
+					  bool *last, bool rtnl_held,
 					  struct netlink_ext_ack *);
-	void			(*walk)(struct tcf_proto*, struct tcf_walker *arg);
+	void			(*walk)(struct tcf_proto *tp,
+					struct tcf_walker *arg, bool rtnl_held);
 	int			(*reoffload)(struct tcf_proto *tp, bool add,
 					     tc_setup_cb_t *cb, void *cb_priv,
 					     struct netlink_ext_ack *extack);
@@ -299,12 +300,18 @@ struct tcf_proto_ops {
 
 	/* rtnetlink specific */
 	int			(*dump)(struct net*, struct tcf_proto*, void *,
-					struct sk_buff *skb, struct tcmsg*);
+					struct sk_buff *skb, struct tcmsg*,
+					bool);
 	int			(*tmplt_dump)(struct sk_buff *skb,
 					      struct net *net,
 					      void *tmplt_priv);
 
 	struct module		*owner;
+	int			flags;
+};
+
+enum tcf_proto_ops_flags {
+	TCF_PROTO_OPS_DOIT_UNLOCKED = 1,
 };
 
 struct tcf_proto {
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 8fe38aa180cf..e8ed461e94af 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -69,7 +69,8 @@ static const struct tcf_proto_ops *__tcf_proto_lookup_ops(const char *kind)
 }
 
 static const struct tcf_proto_ops *
-tcf_proto_lookup_ops(const char *kind, struct netlink_ext_ack *extack)
+tcf_proto_lookup_ops(const char *kind, bool rtnl_held,
+		     struct netlink_ext_ack *extack)
 {
 	const struct tcf_proto_ops *ops;
 
@@ -77,9 +78,11 @@ tcf_proto_lookup_ops(const char *kind, struct netlink_ext_ack *extack)
 	if (ops)
 		return ops;
 #ifdef CONFIG_MODULES
-	rtnl_unlock();
+	if (rtnl_held)
+		rtnl_unlock();
 	request_module("cls_%s", kind);
-	rtnl_lock();
+	if (rtnl_held)
+		rtnl_lock();
 	ops = __tcf_proto_lookup_ops(kind);
 	/* We dropped the RTNL semaphore in order to perform
 	 * the module load. So, even if we succeeded in loading
@@ -162,6 +165,7 @@ static inline u32 tcf_auto_prio(struct tcf_proto *tp)
 
 static struct tcf_proto *tcf_proto_create(const char *kind, u32 protocol,
 					  u32 prio, struct tcf_chain *chain,
+					  bool rtnl_held,
 					  struct netlink_ext_ack *extack)
 {
 	struct tcf_proto *tp;
@@ -171,7 +175,7 @@ static struct tcf_proto *tcf_proto_create(const char *kind, u32 protocol,
 	if (!tp)
 		return ERR_PTR(-ENOBUFS);
 
-	tp->ops = tcf_proto_lookup_ops(kind, extack);
+	tp->ops = tcf_proto_lookup_ops(kind, rtnl_held, extack);
 	if (IS_ERR(tp->ops)) {
 		err = PTR_ERR(tp->ops);
 		goto errout;
@@ -202,20 +206,20 @@ static void tcf_proto_get(struct tcf_proto *tp)
 
 static void tcf_chain_put(struct tcf_chain *chain);
 
-static void tcf_proto_destroy(struct tcf_proto *tp,
+static void tcf_proto_destroy(struct tcf_proto *tp, bool rtnl_held,
 			      struct netlink_ext_ack *extack)
 {
-	tp->ops->destroy(tp, extack);
+	tp->ops->destroy(tp, rtnl_held, extack);
 	tcf_chain_put(tp->chain);
 	module_put(tp->ops->owner);
 	kfree_rcu(tp, rcu);
 }
 
-static void tcf_proto_put(struct tcf_proto *tp,
+static void tcf_proto_put(struct tcf_proto *tp, bool rtnl_held,
 			  struct netlink_ext_ack *extack)
 {
 	if (refcount_dec_and_test(&tp->refcnt))
-		tcf_proto_destroy(tp, extack);
+		tcf_proto_destroy(tp, rtnl_held, extack);
 }
 
 static int walker_noop(struct tcf_proto *tp, void *d, struct tcf_walker *arg)
@@ -223,21 +227,21 @@ static int walker_noop(struct tcf_proto *tp, void *d, struct tcf_walker *arg)
 	return -1;
 }
 
-static bool tcf_proto_is_empty(struct tcf_proto *tp)
+static bool tcf_proto_is_empty(struct tcf_proto *tp, bool rtnl_held)
 {
 	struct tcf_walker walker = { .fn = walker_noop, };
 
 	if (tp->ops->walk) {
-		tp->ops->walk(tp, &walker);
+		tp->ops->walk(tp, &walker, rtnl_held);
 		return !walker.stop;
 	}
 	return true;
 }
 
-static bool tcf_proto_check_delete(struct tcf_proto *tp)
+static bool tcf_proto_check_delete(struct tcf_proto *tp, bool rtnl_held)
 {
 	spin_lock(&tp->lock);
-	if (tcf_proto_is_empty(tp))
+	if (tcf_proto_is_empty(tp, rtnl_held))
 		tp->deleting = true;
 	spin_unlock(&tp->lock);
 	return tp->deleting;
@@ -506,7 +510,7 @@ static void tcf_chain_put_explicitly_created(struct tcf_chain *chain)
 	__tcf_chain_put(chain, false, true);
 }
 
-static void tcf_chain_flush(struct tcf_chain *chain)
+static void tcf_chain_flush(struct tcf_chain *chain, bool rtnl_held)
 {
 	struct tcf_proto *tp, *tp_next;
 
@@ -519,7 +523,7 @@ static void tcf_chain_flush(struct tcf_chain *chain)
 
 	while (tp) {
 		tp_next = rcu_dereference_protected(tp->next, 1);
-		tcf_proto_put(tp, NULL);
+		tcf_proto_put(tp, rtnl_held, NULL);
 		tp = tp_next;
 	}
 }
@@ -1070,18 +1074,19 @@ __tcf_get_next_proto(struct tcf_chain *chain, struct tcf_proto *tp)
  */
 
 struct tcf_proto *
-tcf_get_next_proto(struct tcf_chain *chain, struct tcf_proto *tp)
+tcf_get_next_proto(struct tcf_chain *chain, struct tcf_proto *tp,
+		   bool rtnl_held)
 {
 	struct tcf_proto *tp_next = __tcf_get_next_proto(chain, tp);
 
 	if (tp)
-		tcf_proto_put(tp, NULL);
+		tcf_proto_put(tp, rtnl_held, NULL);
 
 	return tp_next;
 }
 EXPORT_SYMBOL(tcf_get_next_proto);
 
-static void tcf_block_flush_all_chains(struct tcf_block *block)
+static void tcf_block_flush_all_chains(struct tcf_block *block, bool rtnl_held)
 {
 	struct tcf_chain *chain;
 
@@ -1092,12 +1097,12 @@ static void tcf_block_flush_all_chains(struct tcf_block *block)
 	     chain;
 	     chain = tcf_get_next_chain(block, chain)) {
 		tcf_chain_put_explicitly_created(chain);
-		tcf_chain_flush(chain);
+		tcf_chain_flush(chain, rtnl_held);
 	}
 }
 
 static void __tcf_block_put(struct tcf_block *block, struct Qdisc *q,
-			    struct tcf_block_ext_info *ei)
+			    struct tcf_block_ext_info *ei, bool rtnl_held)
 {
 	if (refcount_dec_and_mutex_lock(&block->refcnt, &block->lock)) {
 		/* Flushing/putting all chains will cause the block to be
@@ -1118,15 +1123,15 @@ static void __tcf_block_put(struct tcf_block *block, struct Qdisc *q,
 		if (free_block)
 			tcf_block_destroy(block);
 		else
-			tcf_block_flush_all_chains(block);
+			tcf_block_flush_all_chains(block, rtnl_held);
 	} else if (q) {
 		tcf_block_offload_unbind(block, q, ei);
 	}
 }
 
-static void tcf_block_refcnt_put(struct tcf_block *block)
+static void tcf_block_refcnt_put(struct tcf_block *block, bool rtnl_held)
 {
-	__tcf_block_put(block, NULL, NULL);
+	__tcf_block_put(block, NULL, NULL, rtnl_held);
 }
 
 /* Find tcf block.
@@ -1244,10 +1249,11 @@ static struct tcf_block *tcf_block_find(struct net *net, struct Qdisc **q,
 	return ERR_PTR(err);
 }
 
-static void tcf_block_release(struct Qdisc *q, struct tcf_block *block)
+static void tcf_block_release(struct Qdisc *q, struct tcf_block *block,
+			      bool rtnl_held)
 {
 	if (!IS_ERR_OR_NULL(block))
-		tcf_block_refcnt_put(block);
+		tcf_block_refcnt_put(block, rtnl_held);
 
 	if (q)
 		qdisc_put(q);
@@ -1358,7 +1364,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
 	tcf_block_owner_del(block, q, ei->binder_type);
 err_block_owner_add:
 err_block_insert:
-	tcf_block_refcnt_put(block);
+	tcf_block_refcnt_put(block, true);
 	return err;
 }
 EXPORT_SYMBOL(tcf_block_get_ext);
@@ -1395,7 +1401,7 @@ void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
 	tcf_chain0_head_change_cb_del(block, ei);
 	tcf_block_owner_del(block, q, ei->binder_type);
 
-	__tcf_block_put(block, q, ei);
+	__tcf_block_put(block, q, ei, true);
 }
 EXPORT_SYMBOL(tcf_block_put_ext);
 
@@ -1464,7 +1470,7 @@ tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
 		for (tp = __tcf_get_next_proto(chain, NULL); tp;
 		     tp_prev = tp,
 			     tp = __tcf_get_next_proto(chain, tp),
-			     tcf_proto_put(tp_prev, NULL)) {
+			     tcf_proto_put(tp_prev, true, NULL)) {
 			if (tp->ops->reoffload) {
 				err = tp->ops->reoffload(tp, add, cb, cb_priv,
 							 extack);
@@ -1481,7 +1487,7 @@ tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
 	return 0;
 
 err_playback_remove:
-	tcf_proto_put(tp, NULL);
+	tcf_proto_put(tp, true, NULL);
 	tcf_chain_put(chain);
 	tcf_block_playback_offloads(block, cb, cb_priv, false, offload_in_use,
 				    extack);
@@ -1654,7 +1660,8 @@ static struct tcf_proto *tcf_chain_tp_find(struct tcf_chain *chain,
 
 static struct tcf_proto *tcf_chain_tp_insert_unique(struct tcf_chain *chain,
 						    struct tcf_proto *tp_new,
-						    u32 protocol, u32 prio)
+						    u32 protocol, u32 prio,
+						    bool rtnl_held)
 {
 	struct tcf_chain_info chain_info;
 	struct tcf_proto *tp;
@@ -1669,10 +1676,10 @@ static struct tcf_proto *tcf_chain_tp_insert_unique(struct tcf_chain *chain,
 	mutex_unlock(&chain->filter_chain_lock);
 
 	if (tp) {
-		tcf_proto_destroy(tp_new, NULL);
+		tcf_proto_destroy(tp_new, rtnl_held, NULL);
 		tp_new = tp;
 	} else if (err) {
-		tcf_proto_destroy(tp_new, NULL);
+		tcf_proto_destroy(tp_new, rtnl_held, NULL);
 		tp_new = ERR_PTR(err);
 	}
 
@@ -1680,7 +1687,7 @@ static struct tcf_proto *tcf_chain_tp_insert_unique(struct tcf_chain *chain,
 }
 
 static void tcf_chain_tp_delete_empty(struct tcf_chain *chain,
-				      struct tcf_proto *tp,
+				      struct tcf_proto *tp, bool rtnl_held,
 				      struct netlink_ext_ack *extack)
 {
 	struct tcf_chain_info chain_info;
@@ -1705,7 +1712,7 @@ static void tcf_chain_tp_delete_empty(struct tcf_chain *chain,
 	 * concurrently.
 	 * Mark tp for deletion if it is empty.
 	 */
-	if (!tp_iter || !tcf_proto_check_delete(tp)) {
+	if (!tp_iter || !tcf_proto_check_delete(tp, rtnl_held)) {
 		mutex_unlock(&chain->filter_chain_lock);
 		return;
 	}
@@ -1716,7 +1723,7 @@ static void tcf_chain_tp_delete_empty(struct tcf_chain *chain,
 	RCU_INIT_POINTER(*chain_info.pprev, next);
 	mutex_unlock(&chain->filter_chain_lock);
 
-	tcf_proto_put(tp, extack);
+	tcf_proto_put(tp, rtnl_held, extack);
 }
 
 static struct tcf_proto *tcf_chain_tp_find(struct tcf_chain *chain,
@@ -1755,7 +1762,8 @@ static struct tcf_proto *tcf_chain_tp_find(struct tcf_chain *chain,
 static int tcf_fill_node(struct net *net, struct sk_buff *skb,
 			 struct tcf_proto *tp, struct tcf_block *block,
 			 struct Qdisc *q, u32 parent, void *fh,
-			 u32 portid, u32 seq, u16 flags, int event)
+			 u32 portid, u32 seq, u16 flags, int event,
+			 bool rtnl_held)
 {
 	struct tcmsg *tcm;
 	struct nlmsghdr  *nlh;
@@ -1783,7 +1791,8 @@ static int tcf_fill_node(struct net *net, struct sk_buff *skb,
 	if (!fh) {
 		tcm->tcm_handle = 0;
 	} else {
-		if (tp->ops->dump && tp->ops->dump(net, tp, fh, skb, tcm) < 0)
+		if (tp->ops->dump &&
+		    tp->ops->dump(net, tp, fh, skb, tcm, rtnl_held) < 0)
 			goto nla_put_failure;
 	}
 	nlh->nlmsg_len = skb_tail_pointer(skb) - b;
@@ -1798,7 +1807,8 @@ static int tcf_fill_node(struct net *net, struct sk_buff *skb,
 static int tfilter_notify(struct net *net, struct sk_buff *oskb,
 			  struct nlmsghdr *n, struct tcf_proto *tp,
 			  struct tcf_block *block, struct Qdisc *q,
-			  u32 parent, void *fh, int event, bool unicast)
+			  u32 parent, void *fh, int event, bool unicast,
+			  bool rtnl_held)
 {
 	struct sk_buff *skb;
 	u32 portid = oskb ? NETLINK_CB(oskb).portid : 0;
@@ -1808,7 +1818,8 @@ static int tfilter_notify(struct net *net, struct sk_buff *oskb,
 		return -ENOBUFS;
 
 	if (tcf_fill_node(net, skb, tp, block, q, parent, fh, portid,
-			  n->nlmsg_seq, n->nlmsg_flags, event) <= 0) {
+			  n->nlmsg_seq, n->nlmsg_flags, event,
+			  rtnl_held) <= 0) {
 		kfree_skb(skb);
 		return -EINVAL;
 	}
@@ -1824,7 +1835,7 @@ static int tfilter_del_notify(struct net *net, struct sk_buff *oskb,
 			      struct nlmsghdr *n, struct tcf_proto *tp,
 			      struct tcf_block *block, struct Qdisc *q,
 			      u32 parent, void *fh, bool unicast, bool *last,
-			      struct netlink_ext_ack *extack)
+			      bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct sk_buff *skb;
 	u32 portid = oskb ? NETLINK_CB(oskb).portid : 0;
@@ -1835,13 +1846,14 @@ static int tfilter_del_notify(struct net *net, struct sk_buff *oskb,
 		return -ENOBUFS;
 
 	if (tcf_fill_node(net, skb, tp, block, q, parent, fh, portid,
-			  n->nlmsg_seq, n->nlmsg_flags, RTM_DELTFILTER) <= 0) {
+			  n->nlmsg_seq, n->nlmsg_flags, RTM_DELTFILTER,
+			  rtnl_held) <= 0) {
 		NL_SET_ERR_MSG(extack, "Failed to build del event notification");
 		kfree_skb(skb);
 		return -EINVAL;
 	}
 
-	err = tp->ops->delete(tp, fh, last, extack);
+	err = tp->ops->delete(tp, fh, last, rtnl_held, extack);
 	if (err) {
 		kfree_skb(skb);
 		return err;
@@ -1860,14 +1872,15 @@ static int tfilter_del_notify(struct net *net, struct sk_buff *oskb,
 static void tfilter_notify_chain(struct net *net, struct sk_buff *oskb,
 				 struct tcf_block *block, struct Qdisc *q,
 				 u32 parent, struct nlmsghdr *n,
-				 struct tcf_chain *chain, int event)
+				 struct tcf_chain *chain, int event,
+				 bool rtnl_held)
 {
 	struct tcf_proto *tp;
 
-	for (tp = tcf_get_next_proto(chain, NULL);
-	     tp; tp = tcf_get_next_proto(chain, tp))
+	for (tp = tcf_get_next_proto(chain, NULL, rtnl_held);
+	     tp; tp = tcf_get_next_proto(chain, tp, rtnl_held))
 		tfilter_notify(net, oskb, n, tp, block,
-			       q, parent, NULL, event, false);
+			       q, parent, NULL, event, false, rtnl_held);
 }
 
 static void tfilter_put(struct tcf_proto *tp, void *fh)
@@ -1896,6 +1909,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	void *fh;
 	int err;
 	int tp_created;
+	bool rtnl_held = true;
 
 	if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
@@ -1987,14 +2001,16 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 
 		mutex_unlock(&chain->filter_chain_lock);
 		tp_new = tcf_proto_create(nla_data(tca[TCA_KIND]),
-					  protocol, prio, chain, extack);
+					  protocol, prio, chain, rtnl_held,
+					  extack);
 		if (IS_ERR(tp_new)) {
 			err = PTR_ERR(tp_new);
 			goto errout_tp;
 		}
 
 		tp_created = 1;
-		tp = tcf_chain_tp_insert_unique(chain, tp_new, protocol, prio);
+		tp = tcf_chain_tp_insert_unique(chain, tp_new, protocol, prio,
+						rtnl_held);
 		if (IS_ERR(tp)) {
 			err = PTR_ERR(tp);
 			goto errout_tp;
@@ -2032,24 +2048,24 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 
 	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
 			      n->nlmsg_flags & NLM_F_CREATE ? TCA_ACT_NOREPLACE : TCA_ACT_REPLACE,
-			      extack);
+			      rtnl_held, extack);
 	if (err == 0) {
 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
-			       RTM_NEWTFILTER, false);
+			       RTM_NEWTFILTER, false, rtnl_held);
 		tfilter_put(tp, fh);
 	}
 
 errout:
 	if (err && tp_created)
-		tcf_chain_tp_delete_empty(chain, tp, NULL);
+		tcf_chain_tp_delete_empty(chain, tp, rtnl_held, NULL);
 errout_tp:
 	if (chain) {
 		if (tp && !IS_ERR(tp))
-			tcf_proto_put(tp, NULL);
+			tcf_proto_put(tp, rtnl_held, NULL);
 		if (!tp_created)
 			tcf_chain_put(chain);
 	}
-	tcf_block_release(q, block);
+	tcf_block_release(q, block, rtnl_held);
 	if (err == -EAGAIN)
 		/* Replay the request. */
 		goto replay;
@@ -2078,6 +2094,7 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	unsigned long cl = 0;
 	void *fh = NULL;
 	int err;
+	bool rtnl_held = true;
 
 	if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
@@ -2127,8 +2144,8 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 
 	if (prio == 0) {
 		tfilter_notify_chain(net, skb, block, q, parent, n,
-				     chain, RTM_DELTFILTER);
-		tcf_chain_flush(chain);
+				     chain, RTM_DELTFILTER, rtnl_held);
+		tcf_chain_flush(chain, rtnl_held);
 		err = 0;
 		goto errout;
 	}
@@ -2148,9 +2165,9 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		tcf_chain_tp_remove(chain, &chain_info, tp);
 		mutex_unlock(&chain->filter_chain_lock);
 
-		tcf_proto_put(tp, NULL);
+		tcf_proto_put(tp, rtnl_held, NULL);
 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
-			       RTM_DELTFILTER, false);
+			       RTM_DELTFILTER, false, rtnl_held);
 		err = 0;
 		goto errout;
 	}
@@ -2166,20 +2183,21 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 
 		err = tfilter_del_notify(net, skb, n, tp, block,
 					 q, parent, fh, false, &last,
-					 extack);
+					 rtnl_held, extack);
+
 		if (err)
 			goto errout;
 		if (last)
-			tcf_chain_tp_delete_empty(chain, tp, extack);
+			tcf_chain_tp_delete_empty(chain, tp, rtnl_held, extack);
 	}
 
 errout:
 	if (chain) {
 		if (tp && !IS_ERR(tp))
-			tcf_proto_put(tp, NULL);
+			tcf_proto_put(tp, rtnl_held, NULL);
 		tcf_chain_put(chain);
 	}
-	tcf_block_release(q, block);
+	tcf_block_release(q, block, rtnl_held);
 	return err;
 
 errout_locked:
@@ -2205,6 +2223,7 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	unsigned long cl = 0;
 	void *fh = NULL;
 	int err;
+	bool rtnl_held = true;
 
 	err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, rtm_tca_policy, extack);
 	if (err < 0)
@@ -2263,7 +2282,7 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		err = -ENOENT;
 	} else {
 		err = tfilter_notify(net, skb, n, tp, block, q, parent,
-				     fh, RTM_NEWTFILTER, true);
+				     fh, RTM_NEWTFILTER, true, rtnl_held);
 		if (err < 0)
 			NL_SET_ERR_MSG(extack, "Failed to send filter notify message");
 	}
@@ -2272,10 +2291,10 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 errout:
 	if (chain) {
 		if (tp && !IS_ERR(tp))
-			tcf_proto_put(tp, NULL);
+			tcf_proto_put(tp, rtnl_held, NULL);
 		tcf_chain_put(chain);
 	}
-	tcf_block_release(q, block);
+	tcf_block_release(q, block, rtnl_held);
 	return err;
 }
 
@@ -2296,7 +2315,7 @@ static int tcf_node_dump(struct tcf_proto *tp, void *n, struct tcf_walker *arg)
 	return tcf_fill_node(net, a->skb, tp, a->block, a->q, a->parent,
 			     n, NETLINK_CB(a->cb->skb).portid,
 			     a->cb->nlh->nlmsg_seq, NLM_F_MULTI,
-			     RTM_NEWTFILTER);
+			     RTM_NEWTFILTER, true);
 }
 
 static bool tcf_chain_dump(struct tcf_chain *chain, struct Qdisc *q, u32 parent,
@@ -2313,7 +2332,7 @@ static bool tcf_chain_dump(struct tcf_chain *chain, struct Qdisc *q, u32 parent,
 	     tp;
 	     tp_prev = tp,
 		     tp = __tcf_get_next_proto(chain, tp),
-		     tcf_proto_put(tp_prev, NULL),
+		     tcf_proto_put(tp_prev, true, NULL),
 		     (*p_index)++) {
 		if (*p_index < index_start)
 			continue;
@@ -2330,9 +2349,8 @@ static bool tcf_chain_dump(struct tcf_chain *chain, struct Qdisc *q, u32 parent,
 			if (tcf_fill_node(net, skb, tp, block, q, parent, NULL,
 					  NETLINK_CB(cb->skb).portid,
 					  cb->nlh->nlmsg_seq, NLM_F_MULTI,
-					  RTM_NEWTFILTER) <= 0)
+					  RTM_NEWTFILTER, true) <= 0)
 				goto errout;
-
 			cb->args[1] = 1;
 		}
 		if (!tp->ops->walk)
@@ -2347,7 +2365,7 @@ static bool tcf_chain_dump(struct tcf_chain *chain, struct Qdisc *q, u32 parent,
 		arg.w.skip = cb->args[1] - 1;
 		arg.w.count = 0;
 		arg.w.cookie = cb->args[2];
-		tp->ops->walk(tp, &arg.w);
+		tp->ops->walk(tp, &arg.w, true);
 		cb->args[2] = arg.w.cookie;
 		cb->args[1] = arg.w.count + 1;
 		if (arg.w.stop)
@@ -2356,7 +2374,7 @@ static bool tcf_chain_dump(struct tcf_chain *chain, struct Qdisc *q, u32 parent,
 	return true;
 
 errout:
-	tcf_proto_put(tp, NULL);
+	tcf_proto_put(tp, true, NULL);
 	return false;
 }
 
@@ -2448,7 +2466,7 @@ static int tc_dump_tfilter(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 
 	if (tcm->tcm_ifindex == TCM_IFINDEX_MAGIC_BLOCK)
-		tcf_block_refcnt_put(block);
+		tcf_block_refcnt_put(block, true);
 	cb->args[0] = index;
 
 out:
@@ -2569,7 +2587,7 @@ static int tc_chain_tmplt_add(struct tcf_chain *chain, struct net *net,
 	if (!tca[TCA_KIND])
 		return 0;
 
-	ops = tcf_proto_lookup_ops(nla_data(tca[TCA_KIND]), extack);
+	ops = tcf_proto_lookup_ops(nla_data(tca[TCA_KIND]), true, extack);
 	if (IS_ERR(ops))
 		return PTR_ERR(ops);
 	if (!ops->tmplt_create || !ops->tmplt_destroy || !ops->tmplt_dump) {
@@ -2699,9 +2717,9 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n,
 		break;
 	case RTM_DELCHAIN:
 		tfilter_notify_chain(net, skb, block, q, parent, n,
-				     chain, RTM_DELTFILTER);
+				     chain, RTM_DELTFILTER, true);
 		/* Flush the chain first as the user requested chain removal. */
-		tcf_chain_flush(chain);
+		tcf_chain_flush(chain, true);
 		/* In case the chain was successfully deleted, put a reference
 		 * to the chain previously taken during addition.
 		 */
@@ -2722,7 +2740,7 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n,
 errout:
 	tcf_chain_put(chain);
 errout_block:
-	tcf_block_release(q, block);
+	tcf_block_release(q, block, true);
 	if (err == -EAGAIN)
 		/* Replay the request. */
 		goto replay;
@@ -2829,7 +2847,7 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 
 	if (tcm->tcm_ifindex == TCM_IFINDEX_MAGIC_BLOCK)
-		tcf_block_refcnt_put(block);
+		tcf_block_refcnt_put(block, true);
 	cb->args[0] = index;
 
 out:
diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index eaf9c02fe792..2383f449d2bc 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -107,7 +107,8 @@ static void basic_delete_filter_work(struct work_struct *work)
 	rtnl_unlock();
 }
 
-static void basic_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
+static void basic_destroy(struct tcf_proto *tp, bool rtnl_held,
+			  struct netlink_ext_ack *extack)
 {
 	struct basic_head *head = rtnl_dereference(tp->root);
 	struct basic_filter *f, *n;
@@ -126,7 +127,7 @@ static void basic_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
 }
 
 static int basic_delete(struct tcf_proto *tp, void *arg, bool *last,
-			struct netlink_ext_ack *extack)
+			bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct basic_head *head = rtnl_dereference(tp->root);
 	struct basic_filter *f = arg;
@@ -173,7 +174,7 @@ static int basic_set_parms(struct net *net, struct tcf_proto *tp,
 static int basic_change(struct net *net, struct sk_buff *in_skb,
 			struct tcf_proto *tp, unsigned long base, u32 handle,
 			struct nlattr **tca, void **arg, bool ovr,
-			struct netlink_ext_ack *extack)
+			bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	int err;
 	struct basic_head *head = rtnl_dereference(tp->root);
@@ -247,7 +248,8 @@ static int basic_change(struct net *net, struct sk_buff *in_skb,
 	return err;
 }
 
-static void basic_walk(struct tcf_proto *tp, struct tcf_walker *arg)
+static void basic_walk(struct tcf_proto *tp, struct tcf_walker *arg,
+		       bool rtnl_held)
 {
 	struct basic_head *head = rtnl_dereference(tp->root);
 	struct basic_filter *f;
@@ -274,7 +276,7 @@ static void basic_bind_class(void *fh, u32 classid, unsigned long cl)
 }
 
 static int basic_dump(struct net *net, struct tcf_proto *tp, void *fh,
-		      struct sk_buff *skb, struct tcmsg *t)
+		      struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
 {
 	struct tc_basic_pcnt gpf = {};
 	struct basic_filter *f = fh;
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 656b3423ad35..062350c6621c 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -298,7 +298,7 @@ static void __cls_bpf_delete(struct tcf_proto *tp, struct cls_bpf_prog *prog,
 }
 
 static int cls_bpf_delete(struct tcf_proto *tp, void *arg, bool *last,
-			  struct netlink_ext_ack *extack)
+			  bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct cls_bpf_head *head = rtnl_dereference(tp->root);
 
@@ -307,7 +307,7 @@ static int cls_bpf_delete(struct tcf_proto *tp, void *arg, bool *last,
 	return 0;
 }
 
-static void cls_bpf_destroy(struct tcf_proto *tp,
+static void cls_bpf_destroy(struct tcf_proto *tp, bool rtnl_held,
 			    struct netlink_ext_ack *extack)
 {
 	struct cls_bpf_head *head = rtnl_dereference(tp->root);
@@ -456,7 +456,8 @@ static int cls_bpf_set_parms(struct net *net, struct tcf_proto *tp,
 static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 			  struct tcf_proto *tp, unsigned long base,
 			  u32 handle, struct nlattr **tca,
-			  void **arg, bool ovr, struct netlink_ext_ack *extack)
+			  void **arg, bool ovr, bool rtnl_held,
+			  struct netlink_ext_ack *extack)
 {
 	struct cls_bpf_head *head = rtnl_dereference(tp->root);
 	struct cls_bpf_prog *oldprog = *arg;
@@ -576,7 +577,7 @@ static int cls_bpf_dump_ebpf_info(const struct cls_bpf_prog *prog,
 }
 
 static int cls_bpf_dump(struct net *net, struct tcf_proto *tp, void *fh,
-			struct sk_buff *skb, struct tcmsg *tm)
+			struct sk_buff *skb, struct tcmsg *tm, bool rtnl_held)
 {
 	struct cls_bpf_prog *prog = fh;
 	struct nlattr *nest;
@@ -636,7 +637,8 @@ static void cls_bpf_bind_class(void *fh, u32 classid, unsigned long cl)
 		prog->res.class = cl;
 }
 
-static void cls_bpf_walk(struct tcf_proto *tp, struct tcf_walker *arg)
+static void cls_bpf_walk(struct tcf_proto *tp, struct tcf_walker *arg,
+			 bool rtnl_held)
 {
 	struct cls_bpf_head *head = rtnl_dereference(tp->root);
 	struct cls_bpf_prog *prog;
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 663ee1c6d606..1cef3b416094 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -78,7 +78,7 @@ static void cls_cgroup_destroy_work(struct work_struct *work)
 static int cls_cgroup_change(struct net *net, struct sk_buff *in_skb,
 			     struct tcf_proto *tp, unsigned long base,
 			     u32 handle, struct nlattr **tca,
-			     void **arg, bool ovr,
+			     void **arg, bool ovr, bool rtnl_held,
 			     struct netlink_ext_ack *extack)
 {
 	struct nlattr *tb[TCA_CGROUP_MAX + 1];
@@ -130,7 +130,7 @@ static int cls_cgroup_change(struct net *net, struct sk_buff *in_skb,
 	return err;
 }
 
-static void cls_cgroup_destroy(struct tcf_proto *tp,
+static void cls_cgroup_destroy(struct tcf_proto *tp, bool rtnl_held,
 			       struct netlink_ext_ack *extack)
 {
 	struct cls_cgroup_head *head = rtnl_dereference(tp->root);
@@ -145,12 +145,13 @@ static void cls_cgroup_destroy(struct tcf_proto *tp,
 }
 
 static int cls_cgroup_delete(struct tcf_proto *tp, void *arg, bool *last,
-			     struct netlink_ext_ack *extack)
+			     bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	return -EOPNOTSUPP;
 }
 
-static void cls_cgroup_walk(struct tcf_proto *tp, struct tcf_walker *arg)
+static void cls_cgroup_walk(struct tcf_proto *tp, struct tcf_walker *arg,
+			    bool rtnl_held)
 {
 	struct cls_cgroup_head *head = rtnl_dereference(tp->root);
 
@@ -166,7 +167,7 @@ static void cls_cgroup_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 }
 
 static int cls_cgroup_dump(struct net *net, struct tcf_proto *tp, void *fh,
-			   struct sk_buff *skb, struct tcmsg *t)
+			   struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
 {
 	struct cls_cgroup_head *head = rtnl_dereference(tp->root);
 	struct nlattr *nest;
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index 39a6407d4832..204e2edae8d5 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -391,7 +391,8 @@ static void flow_destroy_filter_work(struct work_struct *work)
 static int flow_change(struct net *net, struct sk_buff *in_skb,
 		       struct tcf_proto *tp, unsigned long base,
 		       u32 handle, struct nlattr **tca,
-		       void **arg, bool ovr, struct netlink_ext_ack *extack)
+		       void **arg, bool ovr, bool rtnl_held,
+		       struct netlink_ext_ack *extack)
 {
 	struct flow_head *head = rtnl_dereference(tp->root);
 	struct flow_filter *fold, *fnew;
@@ -566,7 +567,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 }
 
 static int flow_delete(struct tcf_proto *tp, void *arg, bool *last,
-		       struct netlink_ext_ack *extack)
+		       bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct flow_head *head = rtnl_dereference(tp->root);
 	struct flow_filter *f = arg;
@@ -590,7 +591,8 @@ static int flow_init(struct tcf_proto *tp)
 	return 0;
 }
 
-static void flow_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
+static void flow_destroy(struct tcf_proto *tp, bool rtnl_held,
+			 struct netlink_ext_ack *extack)
 {
 	struct flow_head *head = rtnl_dereference(tp->root);
 	struct flow_filter *f, *next;
@@ -617,7 +619,7 @@ static void *flow_get(struct tcf_proto *tp, u32 handle)
 }
 
 static int flow_dump(struct net *net, struct tcf_proto *tp, void *fh,
-		     struct sk_buff *skb, struct tcmsg *t)
+		     struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
 {
 	struct flow_filter *f = fh;
 	struct nlattr *nest;
@@ -677,7 +679,8 @@ static int flow_dump(struct net *net, struct tcf_proto *tp, void *fh,
 	return -1;
 }
 
-static void flow_walk(struct tcf_proto *tp, struct tcf_walker *arg)
+static void flow_walk(struct tcf_proto *tp, struct tcf_walker *arg,
+		      bool rtnl_held)
 {
 	struct flow_head *head = rtnl_dereference(tp->root);
 	struct flow_filter *f;
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index d187bda602fa..5663f2b35de0 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -465,7 +465,8 @@ static void fl_destroy_sleepable(struct work_struct *work)
 	module_put(THIS_MODULE);
 }
 
-static void fl_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
+static void fl_destroy(struct tcf_proto *tp, bool rtnl_held,
+		       struct netlink_ext_ack *extack)
 {
 	struct cls_fl_head *head = rtnl_dereference(tp->root);
 	struct fl_flow_mask *mask, *next_mask;
@@ -1300,7 +1301,8 @@ static int fl_set_parms(struct net *net, struct tcf_proto *tp,
 static int fl_change(struct net *net, struct sk_buff *in_skb,
 		     struct tcf_proto *tp, unsigned long base,
 		     u32 handle, struct nlattr **tca,
-		     void **arg, bool ovr, struct netlink_ext_ack *extack)
+		     void **arg, bool ovr, bool rtnl_held,
+		     struct netlink_ext_ack *extack)
 {
 	struct cls_fl_head *head = rtnl_dereference(tp->root);
 	struct cls_fl_filter *fold = *arg;
@@ -1433,7 +1435,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 }
 
 static int fl_delete(struct tcf_proto *tp, void *arg, bool *last,
-		     struct netlink_ext_ack *extack)
+		     bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct cls_fl_head *head = rtnl_dereference(tp->root);
 	struct cls_fl_filter *f = arg;
@@ -1445,7 +1447,8 @@ static int fl_delete(struct tcf_proto *tp, void *arg, bool *last,
 	return 0;
 }
 
-static void fl_walk(struct tcf_proto *tp, struct tcf_walker *arg)
+static void fl_walk(struct tcf_proto *tp, struct tcf_walker *arg,
+		    bool rtnl_held)
 {
 	struct cls_fl_head *head = rtnl_dereference(tp->root);
 	struct cls_fl_filter *f;
@@ -2040,7 +2043,7 @@ static int fl_dump_key(struct sk_buff *skb, struct net *net,
 }
 
 static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh,
-		   struct sk_buff *skb, struct tcmsg *t)
+		   struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
 {
 	struct cls_fl_filter *f = fh;
 	struct nlattr *nest;
diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index c8173ebb69f2..317151bae73b 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -139,7 +139,8 @@ static void fw_delete_filter_work(struct work_struct *work)
 	rtnl_unlock();
 }
 
-static void fw_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
+static void fw_destroy(struct tcf_proto *tp, bool rtnl_held,
+		       struct netlink_ext_ack *extack)
 {
 	struct fw_head *head = rtnl_dereference(tp->root);
 	struct fw_filter *f;
@@ -163,7 +164,7 @@ static void fw_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
 }
 
 static int fw_delete(struct tcf_proto *tp, void *arg, bool *last,
-		     struct netlink_ext_ack *extack)
+		     bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct fw_head *head = rtnl_dereference(tp->root);
 	struct fw_filter *f = arg;
@@ -250,7 +251,8 @@ static int fw_set_parms(struct net *net, struct tcf_proto *tp,
 static int fw_change(struct net *net, struct sk_buff *in_skb,
 		     struct tcf_proto *tp, unsigned long base,
 		     u32 handle, struct nlattr **tca, void **arg,
-		     bool ovr, struct netlink_ext_ack *extack)
+		     bool ovr, bool rtnl_held,
+		     struct netlink_ext_ack *extack)
 {
 	struct fw_head *head = rtnl_dereference(tp->root);
 	struct fw_filter *f = *arg;
@@ -354,7 +356,8 @@ static int fw_change(struct net *net, struct sk_buff *in_skb,
 	return err;
 }
 
-static void fw_walk(struct tcf_proto *tp, struct tcf_walker *arg)
+static void fw_walk(struct tcf_proto *tp, struct tcf_walker *arg,
+		    bool rtnl_held)
 {
 	struct fw_head *head = rtnl_dereference(tp->root);
 	int h;
@@ -384,7 +387,7 @@ static void fw_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 }
 
 static int fw_dump(struct net *net, struct tcf_proto *tp, void *fh,
-		   struct sk_buff *skb, struct tcmsg *t)
+		   struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
 {
 	struct fw_head *head = rtnl_dereference(tp->root);
 	struct fw_filter *f = fh;
diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index 8848a147c4bf..a37137430e61 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -109,7 +109,8 @@ static int mall_replace_hw_filter(struct tcf_proto *tp,
 	return 0;
 }
 
-static void mall_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
+static void mall_destroy(struct tcf_proto *tp, bool rtnl_held,
+			 struct netlink_ext_ack *extack)
 {
 	struct cls_mall_head *head = rtnl_dereference(tp->root);
 
@@ -160,7 +161,8 @@ static int mall_set_parms(struct net *net, struct tcf_proto *tp,
 static int mall_change(struct net *net, struct sk_buff *in_skb,
 		       struct tcf_proto *tp, unsigned long base,
 		       u32 handle, struct nlattr **tca,
-		       void **arg, bool ovr, struct netlink_ext_ack *extack)
+		       void **arg, bool ovr, bool rtnl_held,
+		       struct netlink_ext_ack *extack)
 {
 	struct cls_mall_head *head = rtnl_dereference(tp->root);
 	struct nlattr *tb[TCA_MATCHALL_MAX + 1];
@@ -233,12 +235,13 @@ static int mall_change(struct net *net, struct sk_buff *in_skb,
 }
 
 static int mall_delete(struct tcf_proto *tp, void *arg, bool *last,
-		       struct netlink_ext_ack *extack)
+		       bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	return -EOPNOTSUPP;
 }
 
-static void mall_walk(struct tcf_proto *tp, struct tcf_walker *arg)
+static void mall_walk(struct tcf_proto *tp, struct tcf_walker *arg,
+		      bool rtnl_held)
 {
 	struct cls_mall_head *head = rtnl_dereference(tp->root);
 
@@ -280,7 +283,7 @@ static int mall_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb,
 }
 
 static int mall_dump(struct net *net, struct tcf_proto *tp, void *fh,
-		     struct sk_buff *skb, struct tcmsg *t)
+		     struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
 {
 	struct tc_matchall_pcnt gpf = {};
 	struct cls_mall_head *head = fh;
diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c
index 44b26038c4c4..e590c3a2999d 100644
--- a/net/sched/cls_route.c
+++ b/net/sched/cls_route.c
@@ -276,7 +276,8 @@ static void route4_queue_work(struct route4_filter *f)
 	tcf_queue_work(&f->rwork, route4_delete_filter_work);
 }
 
-static void route4_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
+static void route4_destroy(struct tcf_proto *tp, bool rtnl_held,
+			   struct netlink_ext_ack *extack)
 {
 	struct route4_head *head = rtnl_dereference(tp->root);
 	int h1, h2;
@@ -312,7 +313,7 @@ static void route4_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
 }
 
 static int route4_delete(struct tcf_proto *tp, void *arg, bool *last,
-			 struct netlink_ext_ack *extack)
+			 bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct route4_head *head = rtnl_dereference(tp->root);
 	struct route4_filter *f = arg;
@@ -468,7 +469,7 @@ static int route4_set_parms(struct net *net, struct tcf_proto *tp,
 static int route4_change(struct net *net, struct sk_buff *in_skb,
 			 struct tcf_proto *tp, unsigned long base, u32 handle,
 			 struct nlattr **tca, void **arg, bool ovr,
-			 struct netlink_ext_ack *extack)
+			 bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct route4_head *head = rtnl_dereference(tp->root);
 	struct route4_filter __rcu **fp;
@@ -560,7 +561,8 @@ static int route4_change(struct net *net, struct sk_buff *in_skb,
 	return err;
 }
 
-static void route4_walk(struct tcf_proto *tp, struct tcf_walker *arg)
+static void route4_walk(struct tcf_proto *tp, struct tcf_walker *arg,
+			bool rtnl_held)
 {
 	struct route4_head *head = rtnl_dereference(tp->root);
 	unsigned int h, h1;
@@ -597,7 +599,7 @@ static void route4_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 }
 
 static int route4_dump(struct net *net, struct tcf_proto *tp, void *fh,
-		       struct sk_buff *skb, struct tcmsg *t)
+		       struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
 {
 	struct route4_filter *f = fh;
 	struct nlattr *nest;
diff --git a/net/sched/cls_rsvp.h b/net/sched/cls_rsvp.h
index 9dd9530e6a52..4d3836178fa5 100644
--- a/net/sched/cls_rsvp.h
+++ b/net/sched/cls_rsvp.h
@@ -312,7 +312,8 @@ static void rsvp_delete_filter(struct tcf_proto *tp, struct rsvp_filter *f)
 		__rsvp_delete_filter(f);
 }
 
-static void rsvp_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
+static void rsvp_destroy(struct tcf_proto *tp, bool rtnl_held,
+			 struct netlink_ext_ack *extack)
 {
 	struct rsvp_head *data = rtnl_dereference(tp->root);
 	int h1, h2;
@@ -341,7 +342,7 @@ static void rsvp_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
 }
 
 static int rsvp_delete(struct tcf_proto *tp, void *arg, bool *last,
-		       struct netlink_ext_ack *extack)
+		       bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct rsvp_head *head = rtnl_dereference(tp->root);
 	struct rsvp_filter *nfp, *f = arg;
@@ -477,7 +478,8 @@ static int rsvp_change(struct net *net, struct sk_buff *in_skb,
 		       struct tcf_proto *tp, unsigned long base,
 		       u32 handle,
 		       struct nlattr **tca,
-		       void **arg, bool ovr, struct netlink_ext_ack *extack)
+		       void **arg, bool ovr, bool rtnl_held,
+		       struct netlink_ext_ack *extack)
 {
 	struct rsvp_head *data = rtnl_dereference(tp->root);
 	struct rsvp_filter *f, *nfp;
@@ -655,7 +657,8 @@ static int rsvp_change(struct net *net, struct sk_buff *in_skb,
 	return err;
 }
 
-static void rsvp_walk(struct tcf_proto *tp, struct tcf_walker *arg)
+static void rsvp_walk(struct tcf_proto *tp, struct tcf_walker *arg,
+		      bool rtnl_held)
 {
 	struct rsvp_head *head = rtnl_dereference(tp->root);
 	unsigned int h, h1;
@@ -689,7 +692,7 @@ static void rsvp_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 }
 
 static int rsvp_dump(struct net *net, struct tcf_proto *tp, void *fh,
-		     struct sk_buff *skb, struct tcmsg *t)
+		     struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
 {
 	struct rsvp_filter *f = fh;
 	struct rsvp_session *s;
diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
index b7dc667b6ec0..14d6b4058045 100644
--- a/net/sched/cls_tcindex.c
+++ b/net/sched/cls_tcindex.c
@@ -173,7 +173,7 @@ static void tcindex_destroy_fexts_work(struct work_struct *work)
 }
 
 static int tcindex_delete(struct tcf_proto *tp, void *arg, bool *last,
-			  struct netlink_ext_ack *extack)
+			  bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct tcindex_data *p = rtnl_dereference(tp->root);
 	struct tcindex_filter_result *r = arg;
@@ -226,7 +226,7 @@ static int tcindex_destroy_element(struct tcf_proto *tp,
 {
 	bool last;
 
-	return tcindex_delete(tp, arg, &last, NULL);
+	return tcindex_delete(tp, arg, &last, false, NULL);
 }
 
 static void __tcindex_destroy(struct rcu_head *head)
@@ -499,7 +499,7 @@ static int
 tcindex_change(struct net *net, struct sk_buff *in_skb,
 	       struct tcf_proto *tp, unsigned long base, u32 handle,
 	       struct nlattr **tca, void **arg, bool ovr,
-	       struct netlink_ext_ack *extack)
+	       bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct nlattr *opt = tca[TCA_OPTIONS];
 	struct nlattr *tb[TCA_TCINDEX_MAX + 1];
@@ -522,7 +522,8 @@ tcindex_change(struct net *net, struct sk_buff *in_skb,
 				 tca[TCA_RATE], ovr, extack);
 }
 
-static void tcindex_walk(struct tcf_proto *tp, struct tcf_walker *walker)
+static void tcindex_walk(struct tcf_proto *tp, struct tcf_walker *walker,
+			 bool rtnl_held)
 {
 	struct tcindex_data *p = rtnl_dereference(tp->root);
 	struct tcindex_filter *f, *next;
@@ -558,7 +559,7 @@ static void tcindex_walk(struct tcf_proto *tp, struct tcf_walker *walker)
 	}
 }
 
-static void tcindex_destroy(struct tcf_proto *tp,
+static void tcindex_destroy(struct tcf_proto *tp, bool rtnl_held,
 			    struct netlink_ext_ack *extack)
 {
 	struct tcindex_data *p = rtnl_dereference(tp->root);
@@ -568,14 +569,14 @@ static void tcindex_destroy(struct tcf_proto *tp,
 	walker.count = 0;
 	walker.skip = 0;
 	walker.fn = tcindex_destroy_element;
-	tcindex_walk(tp, &walker);
+	tcindex_walk(tp, &walker, true);
 
 	call_rcu(&p->rcu, __tcindex_destroy);
 }
 
 
 static int tcindex_dump(struct net *net, struct tcf_proto *tp, void *fh,
-			struct sk_buff *skb, struct tcmsg *t)
+			struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
 {
 	struct tcindex_data *p = rtnl_dereference(tp->root);
 	struct tcindex_filter_result *r = fh;
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index e891f30d42e9..27d29c04dcc9 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -629,7 +629,8 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht,
 	return -ENOENT;
 }
 
-static void u32_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
+static void u32_destroy(struct tcf_proto *tp, bool rtnl_held,
+			struct netlink_ext_ack *extack)
 {
 	struct tc_u_common *tp_c = tp->data;
 	struct tc_u_hnode *root_ht = rtnl_dereference(tp->root);
@@ -663,7 +664,7 @@ static void u32_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
 }
 
 static int u32_delete(struct tcf_proto *tp, void *arg, bool *last,
-		      struct netlink_ext_ack *extack)
+		      bool rtnl_held, struct netlink_ext_ack *extack)
 {
 	struct tc_u_hnode *ht = arg;
 	struct tc_u_common *tp_c = tp->data;
@@ -858,7 +859,7 @@ static struct tc_u_knode *u32_init_knode(struct tcf_proto *tp,
 
 static int u32_change(struct net *net, struct sk_buff *in_skb,
 		      struct tcf_proto *tp, unsigned long base, u32 handle,
-		      struct nlattr **tca, void **arg, bool ovr,
+		      struct nlattr **tca, void **arg, bool ovr, bool rtnl_held,
 		      struct netlink_ext_ack *extack)
 {
 	struct tc_u_common *tp_c = tp->data;
@@ -1123,7 +1124,8 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 	return err;
 }
 
-static void u32_walk(struct tcf_proto *tp, struct tcf_walker *arg)
+static void u32_walk(struct tcf_proto *tp, struct tcf_walker *arg,
+		     bool rtnl_held)
 {
 	struct tc_u_common *tp_c = tp->data;
 	struct tc_u_hnode *ht;
@@ -1281,7 +1283,7 @@ static void u32_bind_class(void *fh, u32 classid, unsigned long cl)
 }
 
 static int u32_dump(struct net *net, struct tcf_proto *tp, void *fh,
-		    struct sk_buff *skb, struct tcmsg *t)
+		    struct sk_buff *skb, struct tcmsg *t, bool rtnl_held)
 {
 	struct tc_u_knode *n = fh;
 	struct tc_u_hnode *ht_up, *ht_down;
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 9a530cad2759..2283924fb56d 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1914,14 +1914,14 @@ static void tc_bind_tclass(struct Qdisc *q, u32 portid, u32 clid,
 	     chain = tcf_get_next_chain(block, chain)) {
 		struct tcf_proto *tp;
 
-		for (tp = tcf_get_next_proto(chain, NULL);
-		     tp; tp = tcf_get_next_proto(chain, tp)) {
+		for (tp = tcf_get_next_proto(chain, NULL, true);
+		     tp; tp = tcf_get_next_proto(chain, tp, true)) {
 			struct tcf_bind_args arg = {};
 
 			arg.w.fn = tcf_node_bind;
 			arg.classid = clid;
 			arg.cl = new_cl;
-			tp->ops->walk(tp, &arg.w);
+			tp->ops->walk(tp, &arg.w, true);
 		}
 	}
 }
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 15/17] net: sched: add flags to Qdisc class ops struct
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (13 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 14/17] net: sched: extend proto ops to support unlocked classifiers Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 16/17] net: sched: refactor tcf_block_find() into standalone functions Vlad Buslov
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

Extend Qdisc_class_ops with flags. Create enum to hold possible class ops
flag values. Add first class ops flags value QDISC_CLASS_OPS_DOIT_UNLOCKED
to indicate that class ops functions can be called without taking rtnl
lock.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/sch_generic.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 365801c2a4f5..e50b729f8691 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -179,6 +179,7 @@ static inline int qdisc_avail_bulklimit(const struct netdev_queue *txq)
 }
 
 struct Qdisc_class_ops {
+	unsigned int		flags;
 	/* Child qdisc manipulation */
 	struct netdev_queue *	(*select_queue)(struct Qdisc *, struct tcmsg *);
 	int			(*graft)(struct Qdisc *, unsigned long cl,
@@ -210,6 +211,13 @@ struct Qdisc_class_ops {
 					struct gnet_dump *);
 };
 
+/* Qdisc_class_ops flag values */
+
+/* Implements API that doesn't require rtnl lock */
+enum qdisc_class_ops_flags {
+	QDISC_CLASS_OPS_DOIT_UNLOCKED = 1,
+};
+
 struct Qdisc_ops {
 	struct Qdisc_ops	*next;
 	const struct Qdisc_class_ops	*cl_ops;
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 16/17] net: sched: refactor tcf_block_find() into standalone functions
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (14 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 15/17] net: sched: add flags to Qdisc class ops struct Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-11  8:55 ` [PATCH net-next v4 17/17] net: sched: unlock rules update API Vlad Buslov
  2019-02-12 18:42 ` [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock David Miller
  17 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

Refactor tcf_block_find() code into three standalone functions:
- __tcf_qdisc_find() to lookup Qdisc and increment its reference counter.
- __tcf_qdisc_cl_find() to lookup class.
- __tcf_block_find() to lookup block and increment its reference counter.

This change is necessary to allow netlink tc rule update handlers to call
these functions directly in order to conditionally take rtnl lock
according to Qdisc class ops flags before calling any of class ops
functions.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 net/sched/cls_api.c | 241 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 149 insertions(+), 92 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index e8ed461e94af..5f9373ee47ce 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -1101,6 +1101,142 @@ static void tcf_block_flush_all_chains(struct tcf_block *block, bool rtnl_held)
 	}
 }
 
+/* Lookup Qdisc and increments its reference counter.
+ * Set parent, if necessary.
+ */
+
+static int __tcf_qdisc_find(struct net *net, struct Qdisc **q,
+			    u32 *parent, int ifindex, bool rtnl_held,
+			    struct netlink_ext_ack *extack)
+{
+	const struct Qdisc_class_ops *cops;
+	struct net_device *dev;
+	int err = 0;
+
+	if (ifindex == TCM_IFINDEX_MAGIC_BLOCK)
+		return 0;
+
+	rcu_read_lock();
+
+	/* Find link */
+	dev = dev_get_by_index_rcu(net, ifindex);
+	if (!dev) {
+		rcu_read_unlock();
+		return -ENODEV;
+	}
+
+	/* Find qdisc */
+	if (!*parent) {
+		*q = dev->qdisc;
+		*parent = (*q)->handle;
+	} else {
+		*q = qdisc_lookup_rcu(dev, TC_H_MAJ(*parent));
+		if (!*q) {
+			NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists");
+			err = -EINVAL;
+			goto errout_rcu;
+		}
+	}
+
+	*q = qdisc_refcount_inc_nz(*q);
+	if (!*q) {
+		NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists");
+		err = -EINVAL;
+		goto errout_rcu;
+	}
+
+	/* Is it classful? */
+	cops = (*q)->ops->cl_ops;
+	if (!cops) {
+		NL_SET_ERR_MSG(extack, "Qdisc not classful");
+		err = -EINVAL;
+		goto errout_qdisc;
+	}
+
+	if (!cops->tcf_block) {
+		NL_SET_ERR_MSG(extack, "Class doesn't support blocks");
+		err = -EOPNOTSUPP;
+		goto errout_qdisc;
+	}
+
+errout_rcu:
+	/* At this point we know that qdisc is not noop_qdisc,
+	 * which means that qdisc holds a reference to net_device
+	 * and we hold a reference to qdisc, so it is safe to release
+	 * rcu read lock.
+	 */
+	rcu_read_unlock();
+	return err;
+
+errout_qdisc:
+	rcu_read_unlock();
+
+	if (rtnl_held)
+		qdisc_put(*q);
+	else
+		qdisc_put_unlocked(*q);
+	*q = NULL;
+
+	return err;
+}
+
+static int __tcf_qdisc_cl_find(struct Qdisc *q, u32 parent, unsigned long *cl,
+			       int ifindex, struct netlink_ext_ack *extack)
+{
+	if (ifindex == TCM_IFINDEX_MAGIC_BLOCK)
+		return 0;
+
+	/* Do we search for filter, attached to class? */
+	if (TC_H_MIN(parent)) {
+		const struct Qdisc_class_ops *cops = q->ops->cl_ops;
+
+		*cl = cops->find(q, parent);
+		if (*cl == 0) {
+			NL_SET_ERR_MSG(extack, "Specified class doesn't exist");
+			return -ENOENT;
+		}
+	}
+
+	return 0;
+}
+
+static struct tcf_block *__tcf_block_find(struct net *net, struct Qdisc *q,
+					  unsigned long cl, int ifindex,
+					  u32 block_index,
+					  struct netlink_ext_ack *extack)
+{
+	struct tcf_block *block;
+
+	if (ifindex == TCM_IFINDEX_MAGIC_BLOCK) {
+		block = tcf_block_refcnt_get(net, block_index);
+		if (!block) {
+			NL_SET_ERR_MSG(extack, "Block of given index was not found");
+			return ERR_PTR(-EINVAL);
+		}
+	} else {
+		const struct Qdisc_class_ops *cops = q->ops->cl_ops;
+
+		block = cops->tcf_block(q, cl, extack);
+		if (!block)
+			return ERR_PTR(-EINVAL);
+
+		if (tcf_block_shared(block)) {
+			NL_SET_ERR_MSG(extack, "This filter block is shared. Please use the block index to manipulate the filters");
+			return ERR_PTR(-EOPNOTSUPP);
+		}
+
+		/* Always take reference to block in order to support execution
+		 * of rules update path of cls API without rtnl lock. Caller
+		 * must release block when it is finished using it. 'if' block
+		 * of this conditional obtain reference to block by calling
+		 * tcf_block_refcnt_get().
+		 */
+		refcount_inc(&block->refcnt);
+	}
+
+	return block;
+}
+
 static void __tcf_block_put(struct tcf_block *block, struct Qdisc *q,
 			    struct tcf_block_ext_info *ei, bool rtnl_held)
 {
@@ -1146,106 +1282,27 @@ static struct tcf_block *tcf_block_find(struct net *net, struct Qdisc **q,
 	struct tcf_block *block;
 	int err = 0;
 
-	if (ifindex == TCM_IFINDEX_MAGIC_BLOCK) {
-		block = tcf_block_refcnt_get(net, block_index);
-		if (!block) {
-			NL_SET_ERR_MSG(extack, "Block of given index was not found");
-			return ERR_PTR(-EINVAL);
-		}
-	} else {
-		const struct Qdisc_class_ops *cops;
-		struct net_device *dev;
-
-		rcu_read_lock();
-
-		/* Find link */
-		dev = dev_get_by_index_rcu(net, ifindex);
-		if (!dev) {
-			rcu_read_unlock();
-			return ERR_PTR(-ENODEV);
-		}
-
-		/* Find qdisc */
-		if (!*parent) {
-			*q = dev->qdisc;
-			*parent = (*q)->handle;
-		} else {
-			*q = qdisc_lookup_rcu(dev, TC_H_MAJ(*parent));
-			if (!*q) {
-				NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists");
-				err = -EINVAL;
-				goto errout_rcu;
-			}
-		}
-
-		*q = qdisc_refcount_inc_nz(*q);
-		if (!*q) {
-			NL_SET_ERR_MSG(extack, "Parent Qdisc doesn't exists");
-			err = -EINVAL;
-			goto errout_rcu;
-		}
-
-		/* Is it classful? */
-		cops = (*q)->ops->cl_ops;
-		if (!cops) {
-			NL_SET_ERR_MSG(extack, "Qdisc not classful");
-			err = -EINVAL;
-			goto errout_rcu;
-		}
-
-		if (!cops->tcf_block) {
-			NL_SET_ERR_MSG(extack, "Class doesn't support blocks");
-			err = -EOPNOTSUPP;
-			goto errout_rcu;
-		}
-
-		/* At this point we know that qdisc is not noop_qdisc,
-		 * which means that qdisc holds a reference to net_device
-		 * and we hold a reference to qdisc, so it is safe to release
-		 * rcu read lock.
-		 */
-		rcu_read_unlock();
+	ASSERT_RTNL();
 
-		/* Do we search for filter, attached to class? */
-		if (TC_H_MIN(*parent)) {
-			*cl = cops->find(*q, *parent);
-			if (*cl == 0) {
-				NL_SET_ERR_MSG(extack, "Specified class doesn't exist");
-				err = -ENOENT;
-				goto errout_qdisc;
-			}
-		}
+	err = __tcf_qdisc_find(net, q, parent, ifindex, true, extack);
+	if (err)
+		goto errout;
 
-		/* And the last stroke */
-		block = cops->tcf_block(*q, *cl, extack);
-		if (!block) {
-			err = -EINVAL;
-			goto errout_qdisc;
-		}
-		if (tcf_block_shared(block)) {
-			NL_SET_ERR_MSG(extack, "This filter block is shared. Please use the block index to manipulate the filters");
-			err = -EOPNOTSUPP;
-			goto errout_qdisc;
-		}
+	err = __tcf_qdisc_cl_find(*q, *parent, cl, ifindex, extack);
+	if (err)
+		goto errout_qdisc;
 
-		/* Always take reference to block in order to support execution
-		 * of rules update path of cls API without rtnl lock. Caller
-		 * must release block when it is finished using it. 'if' block
-		 * of this conditional obtain reference to block by calling
-		 * tcf_block_refcnt_get().
-		 */
-		refcount_inc(&block->refcnt);
-	}
+	block = __tcf_block_find(net, *q, *cl, ifindex, block_index, extack);
+	if (IS_ERR(block))
+		goto errout_qdisc;
 
 	return block;
 
-errout_rcu:
-	rcu_read_unlock();
 errout_qdisc:
-	if (*q) {
+	if (*q)
 		qdisc_put(*q);
-		*q = NULL;
-	}
+errout:
+	*q = NULL;
 	return ERR_PTR(err);
 }
 
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH net-next v4 17/17] net: sched: unlock rules update API
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (15 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 16/17] net: sched: refactor tcf_block_find() into standalone functions Vlad Buslov
@ 2019-02-11  8:55 ` Vlad Buslov
  2019-02-18 18:56   ` Cong Wang
  2019-02-12 18:42 ` [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock David Miller
  17 siblings, 1 reply; 52+ messages in thread
From: Vlad Buslov @ 2019-02-11  8:55 UTC (permalink / raw)
  To: netdev; +Cc: jhs, xiyou.wangcong, jiri, davem, ast, daniel, Vlad Buslov

Register netlink protocol handlers for message types RTM_NEWTFILTER,
RTM_DELTFILTER, RTM_GETTFILTER as unlocked. Set rtnl_held variable that
tracks rtnl mutex state to be false by default.

Introduce tcf_proto_is_unlocked() helper that is used to check
tcf_proto_ops->flag to determine if ops can be called without taking rtnl
lock. Manually lookup Qdisc, class and block in rule update handlers.
Verify that both Qdisc ops and proto ops are unlocked before using any of
their callbacks, and obtain rtnl lock otherwise.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 net/sched/cls_api.c | 131 +++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 114 insertions(+), 17 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 5f9373ee47ce..266fcb34fefe 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -163,6 +163,23 @@ static inline u32 tcf_auto_prio(struct tcf_proto *tp)
 	return TC_H_MAJ(first);
 }
 
+static bool tcf_proto_is_unlocked(const char *kind)
+{
+	const struct tcf_proto_ops *ops;
+	bool ret;
+
+	ops = tcf_proto_lookup_ops(kind, false, NULL);
+	/* On error return false to take rtnl lock. Proto lookup/create
+	 * functions will perform lookup again and properly handle errors.
+	 */
+	if (IS_ERR(ops))
+		return false;
+
+	ret = !!(ops->flags & TCF_PROTO_OPS_DOIT_UNLOCKED);
+	module_put(ops->owner);
+	return ret;
+}
+
 static struct tcf_proto *tcf_proto_create(const char *kind, u32 protocol,
 					  u32 prio, struct tcf_chain *chain,
 					  bool rtnl_held,
@@ -1312,8 +1329,12 @@ static void tcf_block_release(struct Qdisc *q, struct tcf_block *block,
 	if (!IS_ERR_OR_NULL(block))
 		tcf_block_refcnt_put(block, rtnl_held);
 
-	if (q)
-		qdisc_put(q);
+	if (q) {
+		if (rtnl_held)
+			qdisc_put(q);
+		else
+			qdisc_put_unlocked(q);
+	}
 }
 
 struct tcf_block_owner_item {
@@ -1966,7 +1987,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	void *fh;
 	int err;
 	int tp_created;
-	bool rtnl_held = true;
+	bool rtnl_held = false;
 
 	if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
@@ -1985,6 +2006,7 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	parent = t->tcm_parent;
 	tp = NULL;
 	cl = 0;
+	block = NULL;
 
 	if (prio == 0) {
 		/* If no priority is provided by the user,
@@ -2001,8 +2023,27 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 
 	/* Find head of filter chain. */
 
-	block = tcf_block_find(net, &q, &parent, &cl,
-			       t->tcm_ifindex, t->tcm_block_index, extack);
+	err = __tcf_qdisc_find(net, &q, &parent, t->tcm_ifindex, false, extack);
+	if (err)
+		return err;
+
+	/* Take rtnl mutex if rtnl_held was set to true on previous iteration,
+	 * block is shared (no qdisc found), qdisc is not unlocked, classifier
+	 * type is not specified, classifier is not unlocked.
+	 */
+	if (rtnl_held ||
+	    (q && !(q->ops->cl_ops->flags & QDISC_CLASS_OPS_DOIT_UNLOCKED)) ||
+	    !tca[TCA_KIND] || !tcf_proto_is_unlocked(nla_data(tca[TCA_KIND]))) {
+		rtnl_held = true;
+		rtnl_lock();
+	}
+
+	err = __tcf_qdisc_cl_find(q, parent, &cl, t->tcm_ifindex, extack);
+	if (err)
+		goto errout;
+
+	block = __tcf_block_find(net, q, cl, t->tcm_ifindex, t->tcm_block_index,
+				 extack);
 	if (IS_ERR(block)) {
 		err = PTR_ERR(block);
 		goto errout;
@@ -2123,9 +2164,18 @@ static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 			tcf_chain_put(chain);
 	}
 	tcf_block_release(q, block, rtnl_held);
-	if (err == -EAGAIN)
+
+	if (rtnl_held)
+		rtnl_unlock();
+
+	if (err == -EAGAIN) {
+		/* Take rtnl lock in case EAGAIN is caused by concurrent flush
+		 * of target chain.
+		 */
+		rtnl_held = true;
 		/* Replay the request. */
 		goto replay;
+	}
 	return err;
 
 errout_locked:
@@ -2146,12 +2196,12 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	struct Qdisc *q = NULL;
 	struct tcf_chain_info chain_info;
 	struct tcf_chain *chain = NULL;
-	struct tcf_block *block;
+	struct tcf_block *block = NULL;
 	struct tcf_proto *tp = NULL;
 	unsigned long cl = 0;
 	void *fh = NULL;
 	int err;
-	bool rtnl_held = true;
+	bool rtnl_held = false;
 
 	if (!netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
 		return -EPERM;
@@ -2172,8 +2222,27 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 
 	/* Find head of filter chain. */
 
-	block = tcf_block_find(net, &q, &parent, &cl,
-			       t->tcm_ifindex, t->tcm_block_index, extack);
+	err = __tcf_qdisc_find(net, &q, &parent, t->tcm_ifindex, false, extack);
+	if (err)
+		return err;
+
+	/* Take rtnl mutex if flushing whole chain, block is shared (no qdisc
+	 * found), qdisc is not unlocked, classifier type is not specified,
+	 * classifier is not unlocked.
+	 */
+	if (!prio ||
+	    (q && !(q->ops->cl_ops->flags & QDISC_CLASS_OPS_DOIT_UNLOCKED)) ||
+	    !tca[TCA_KIND] || !tcf_proto_is_unlocked(nla_data(tca[TCA_KIND]))) {
+		rtnl_held = true;
+		rtnl_lock();
+	}
+
+	err = __tcf_qdisc_cl_find(q, parent, &cl, t->tcm_ifindex, extack);
+	if (err)
+		goto errout;
+
+	block = __tcf_block_find(net, q, cl, t->tcm_ifindex, t->tcm_block_index,
+				 extack);
 	if (IS_ERR(block)) {
 		err = PTR_ERR(block);
 		goto errout;
@@ -2255,6 +2324,10 @@ static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		tcf_chain_put(chain);
 	}
 	tcf_block_release(q, block, rtnl_held);
+
+	if (rtnl_held)
+		rtnl_unlock();
+
 	return err;
 
 errout_locked:
@@ -2275,12 +2348,12 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	struct Qdisc *q = NULL;
 	struct tcf_chain_info chain_info;
 	struct tcf_chain *chain = NULL;
-	struct tcf_block *block;
+	struct tcf_block *block = NULL;
 	struct tcf_proto *tp = NULL;
 	unsigned long cl = 0;
 	void *fh = NULL;
 	int err;
-	bool rtnl_held = true;
+	bool rtnl_held = false;
 
 	err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, rtm_tca_policy, extack);
 	if (err < 0)
@@ -2298,8 +2371,26 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 
 	/* Find head of filter chain. */
 
-	block = tcf_block_find(net, &q, &parent, &cl,
-			       t->tcm_ifindex, t->tcm_block_index, extack);
+	err = __tcf_qdisc_find(net, &q, &parent, t->tcm_ifindex, false, extack);
+	if (err)
+		return err;
+
+	/* Take rtnl mutex if block is shared (no qdisc found), qdisc is not
+	 * unlocked, classifier type is not specified, classifier is not
+	 * unlocked.
+	 */
+	if ((q && !(q->ops->cl_ops->flags & QDISC_CLASS_OPS_DOIT_UNLOCKED)) ||
+	    !tca[TCA_KIND] || !tcf_proto_is_unlocked(nla_data(tca[TCA_KIND]))) {
+		rtnl_held = true;
+		rtnl_lock();
+	}
+
+	err = __tcf_qdisc_cl_find(q, parent, &cl, t->tcm_ifindex, extack);
+	if (err)
+		goto errout;
+
+	block = __tcf_block_find(net, q, cl, t->tcm_ifindex, t->tcm_block_index,
+				 extack);
 	if (IS_ERR(block)) {
 		err = PTR_ERR(block);
 		goto errout;
@@ -2352,6 +2443,10 @@ static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		tcf_chain_put(chain);
 	}
 	tcf_block_release(q, block, rtnl_held);
+
+	if (rtnl_held)
+		rtnl_unlock();
+
 	return err;
 }
 
@@ -3214,10 +3309,12 @@ static int __init tc_filter_init(void)
 	if (err)
 		goto err_rhash_setup_block_ht;
 
-	rtnl_register(PF_UNSPEC, RTM_NEWTFILTER, tc_new_tfilter, NULL, 0);
-	rtnl_register(PF_UNSPEC, RTM_DELTFILTER, tc_del_tfilter, NULL, 0);
+	rtnl_register(PF_UNSPEC, RTM_NEWTFILTER, tc_new_tfilter, NULL,
+		      RTNL_FLAG_DOIT_UNLOCKED);
+	rtnl_register(PF_UNSPEC, RTM_DELTFILTER, tc_del_tfilter, NULL,
+		      RTNL_FLAG_DOIT_UNLOCKED);
 	rtnl_register(PF_UNSPEC, RTM_GETTFILTER, tc_get_tfilter,
-		      tc_dump_tfilter, 0);
+		      tc_dump_tfilter, RTNL_FLAG_DOIT_UNLOCKED);
 	rtnl_register(PF_UNSPEC, RTM_NEWCHAIN, tc_ctl_chain, NULL, 0);
 	rtnl_register(PF_UNSPEC, RTM_DELCHAIN, tc_ctl_chain, NULL, 0);
 	rtnl_register(PF_UNSPEC, RTM_GETCHAIN, tc_ctl_chain,
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 01/17] net: sched: protect block state with mutex
  2019-02-11  8:55 ` [PATCH net-next v4 01/17] net: sched: protect block state with mutex Vlad Buslov
@ 2019-02-11 14:15   ` Jiri Pirko
  0 siblings, 0 replies; 52+ messages in thread
From: Jiri Pirko @ 2019-02-11 14:15 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, davem, ast, daniel

Mon, Feb 11, 2019 at 09:55:32AM CET, vladbu@mellanox.com wrote:
>Currently, tcf_block doesn't use any synchronization mechanisms to protect
>critical sections that manage lifetime of its chains. block->chain_list and
>multiple variables in tcf_chain that control its lifetime assume external
>synchronization provided by global rtnl lock. Converting chain reference
>counting to atomic reference counters is not possible because cls API uses
>multiple counters and flags to control chain lifetime, so all of them must
>be synchronized in chain get/put code.
>
>Use single per-block lock to protect block data and manage lifetime of all
>chains on the block. Always take block->lock when accessing chain_list.
>Chain get and put modify chain lifetime-management data and parent block's
>chain_list, so take the lock in these functions. Verify block->lock state
>with assertions in functions that expect to be called with the lock taken
>and are called from multiple places. Take block->lock when accessing
>filter_chain_list.
>
>In order to allow parallel update of rules on single block, move all calls
>to classifiers outside of critical sections protected by new block->lock.
>Rearrange chain get and put functions code to only access protected chain
>data while holding block lock:
>- Rearrange code to only access chain reference counter and chain action
>  reference counter while holding block lock.
>- Extract code that requires block->lock from tcf_chain_destroy() into
>  standalone tcf_chain_destroy() function that is called by
>  __tcf_chain_put() in same critical section that changes chain reference
>  counters.
>
>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 02/17] net: sched: protect chain->explicitly_created with block->lock
  2019-02-11  8:55 ` [PATCH net-next v4 02/17] net: sched: protect chain->explicitly_created with block->lock Vlad Buslov
@ 2019-02-11 14:15   ` Jiri Pirko
  0 siblings, 0 replies; 52+ messages in thread
From: Jiri Pirko @ 2019-02-11 14:15 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, davem, ast, daniel

Mon, Feb 11, 2019 at 09:55:33AM CET, vladbu@mellanox.com wrote:
>In order to remove dependency on rtnl lock, protect
>tcf_chain->explicitly_created flag with block->lock. Consolidate code that
>checks and resets 'explicitly_created' flag into __tcf_chain_put() to
>execute it atomically with rest of code that puts chain reference.
>
>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock
  2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
                   ` (16 preceding siblings ...)
  2019-02-11  8:55 ` [PATCH net-next v4 17/17] net: sched: unlock rules update API Vlad Buslov
@ 2019-02-12 18:42 ` David Miller
  17 siblings, 0 replies; 52+ messages in thread
From: David Miller @ 2019-02-12 18:42 UTC (permalink / raw)
  To: vladbu; +Cc: netdev, jhs, xiyou.wangcong, jiri, ast, daniel

From: Vlad Buslov <vladbu@mellanox.com>
Date: Mon, 11 Feb 2019 10:55:31 +0200

> Currently, all netlink protocol handlers for updating rules, actions and
> qdiscs are protected with single global rtnl lock which removes any
> possibility for parallelism. This patch set is a third step to remove
> rtnl lock dependency from TC rules update path.
 ...

I have to say, this stuff is very ambitious.  Thanks for working on this.

Series applied, thanks Vlad.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-11  8:55 ` [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex Vlad Buslov
@ 2019-02-14 18:24   ` Ido Schimmel
  2019-02-15 10:02     ` Vlad Buslov
  2019-02-15 22:35   ` Cong Wang
  1 sibling, 1 reply; 52+ messages in thread
From: Ido Schimmel @ 2019-02-14 18:24 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, jiri, davem, ast, daniel

On Mon, Feb 11, 2019 at 10:55:38AM +0200, Vlad Buslov wrote:
> Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
> when accessing filter_chain list, instead of relying on rtnl lock.
> Dereference filter_chain with tcf_chain_dereference() lockdep macro to
> verify that all users of chain_list have the lock taken.
> 
> Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
> all necessary code while holding chain lock in order to prevent
> invalidation of chain_info structure by potential concurrent change. This
> also serializes calls to tcf_chain0_head_change(), which allows head change
> callbacks to rely on filter_chain_lock for synchronization instead of rtnl
> mutex.
> 
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
> Acked-by: Jiri Pirko <jiri@mellanox.com>

With this sequence [1] I get the following trace [2]. Bisected it to
this patch. Note that second filter is rejected by the device driver
(that's the intention). I guess it is not properly removed from the
filter chain in the error path?

Thanks

[1]
# tc qdisc add dev swp3 clsact
# tc filter add dev swp3 ingress pref 1 matchall skip_sw \
	action mirred egress mirror dev swp7
# tc filter add dev swp3 ingress pref 2 matchall skip_sw \
	action mirred egress mirror dev swp7
RTNETLINK answers: File exists
We have an error talking to the kernel, -1
# tc qdisc del dev swp3 clsact

[2]
[   70.545131] kasan: GPF could be caused by NULL-ptr deref or user memory access
[   70.553394] general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
[   70.560618] CPU: 2 PID: 2268 Comm: tc Not tainted 5.0.0-rc5-custom-02103-g415d39427317 #1304
[   70.570065] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
[   70.580204] RIP: 0010:mall_reoffload+0x14a/0x760
[   70.585382] Code: c1 0f 85 ba 05 00 00 31 c0 4d 8d 6c 24 34 b9 06 00 00 00 4c 89 ff f3 48 ab 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 14 02 4c 89 e8 83
e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 bd
[   70.606382] RSP: 0018:ffff888231faefc0 EFLAGS: 00010207
[   70.612235] RAX: dffffc0000000000 RBX: 1ffff110463f5dfe RCX: 0000000000000000
[   70.620220] RDX: 0000000000000006 RSI: 1ffff110463f5e01 RDI: ffff888231faf040
[   70.628206] RBP: ffff8881ef151a00 R08: 0000000000000000 R09: ffffed10463f5dfa
[   70.636192] R10: ffffed10463f5dfa R11: 0000000000000003 R12: 0000000000000000
[   70.644177] R13: 0000000000000034 R14: 0000000000000000 R15: ffff888231faf010
[   70.652163] FS:  00007f46b5bf0800(0000) GS:ffff888236c00000(0000) knlGS:0000000000000000
[   70.661218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.667649] CR2: 0000000001d590a8 CR3: 0000000231c3c000 CR4: 00000000001006e0
[   70.675633] Call Trace:
[   70.693046]  tcf_block_playback_offloads+0x94/0x230
[   70.710617]  __tcf_block_cb_unregister+0xf7/0x2d0
[   70.734143]  mlxsw_sp_setup_tc+0x20f/0x660
[   70.738739]  tcf_block_offload_unbind+0x22b/0x350
[   70.748898]  __tcf_block_put+0x203/0x630
[   70.769700]  tcf_block_put_ext+0x2f/0x40
[   70.774098]  clsact_destroy+0x7a/0xb0
[   70.782604]  qdisc_destroy+0x11a/0x5c0
[   70.786810]  qdisc_put+0x5a/0x70
[   70.790435]  notify_and_destroy+0x8e/0xa0
[   70.794933]  qdisc_graft+0xbb7/0xef0
[   70.809009]  tc_get_qdisc+0x518/0xa30
[   70.821530]  rtnetlink_rcv_msg+0x397/0xa20
[   70.835510]  netlink_rcv_skb+0x132/0x380
[   70.848419]  netlink_unicast+0x4c0/0x690
[   70.866019]  netlink_sendmsg+0x929/0xe10
[   70.882134]  sock_sendmsg+0xc8/0x110
[   70.886144]  ___sys_sendmsg+0x77a/0x8f0
[   70.924064]  __sys_sendmsg+0xf7/0x250
[   70.955727]  do_syscall_64+0x14d/0x610

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-14 18:24   ` Ido Schimmel
@ 2019-02-15 10:02     ` Vlad Buslov
  2019-02-15 11:30       ` Ido Schimmel
  2019-02-19  5:08       ` Cong Wang
  0 siblings, 2 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-15 10:02 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, jhs, xiyou.wangcong, jiri, davem, ast, daniel


On Thu 14 Feb 2019 at 18:24, Ido Schimmel <idosch@idosch.org> wrote:
> On Mon, Feb 11, 2019 at 10:55:38AM +0200, Vlad Buslov wrote:
>> Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
>> when accessing filter_chain list, instead of relying on rtnl lock.
>> Dereference filter_chain with tcf_chain_dereference() lockdep macro to
>> verify that all users of chain_list have the lock taken.
>>
>> Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
>> all necessary code while holding chain lock in order to prevent
>> invalidation of chain_info structure by potential concurrent change. This
>> also serializes calls to tcf_chain0_head_change(), which allows head change
>> callbacks to rely on filter_chain_lock for synchronization instead of rtnl
>> mutex.
>>
>> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>> Acked-by: Jiri Pirko <jiri@mellanox.com>
>
> With this sequence [1] I get the following trace [2]. Bisected it to
> this patch. Note that second filter is rejected by the device driver
> (that's the intention). I guess it is not properly removed from the
> filter chain in the error path?
>
> Thanks
>
> [1]
> # tc qdisc add dev swp3 clsact
> # tc filter add dev swp3 ingress pref 1 matchall skip_sw \
> 	action mirred egress mirror dev swp7
> # tc filter add dev swp3 ingress pref 2 matchall skip_sw \
> 	action mirred egress mirror dev swp7
> RTNETLINK answers: File exists
> We have an error talking to the kernel, -1
> # tc qdisc del dev swp3 clsact
>
> [2]
> [   70.545131] kasan: GPF could be caused by NULL-ptr deref or user memory access
> [   70.553394] general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
> [   70.560618] CPU: 2 PID: 2268 Comm: tc Not tainted 5.0.0-rc5-custom-02103-g415d39427317 #1304
> [   70.570065] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
> [   70.580204] RIP: 0010:mall_reoffload+0x14a/0x760
> [   70.585382] Code: c1 0f 85 ba 05 00 00 31 c0 4d 8d 6c 24 34 b9 06 00 00 00 4c 89 ff f3 48 ab 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 14 02 4c 89 e8 83
> e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 bd
> [   70.606382] RSP: 0018:ffff888231faefc0 EFLAGS: 00010207
> [   70.612235] RAX: dffffc0000000000 RBX: 1ffff110463f5dfe RCX: 0000000000000000
> [   70.620220] RDX: 0000000000000006 RSI: 1ffff110463f5e01 RDI: ffff888231faf040
> [   70.628206] RBP: ffff8881ef151a00 R08: 0000000000000000 R09: ffffed10463f5dfa
> [   70.636192] R10: ffffed10463f5dfa R11: 0000000000000003 R12: 0000000000000000
> [   70.644177] R13: 0000000000000034 R14: 0000000000000000 R15: ffff888231faf010
> [   70.652163] FS:  00007f46b5bf0800(0000) GS:ffff888236c00000(0000) knlGS:0000000000000000
> [   70.661218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   70.667649] CR2: 0000000001d590a8 CR3: 0000000231c3c000 CR4: 00000000001006e0
> [   70.675633] Call Trace:
> [   70.693046]  tcf_block_playback_offloads+0x94/0x230
> [   70.710617]  __tcf_block_cb_unregister+0xf7/0x2d0
> [   70.734143]  mlxsw_sp_setup_tc+0x20f/0x660
> [   70.738739]  tcf_block_offload_unbind+0x22b/0x350
> [   70.748898]  __tcf_block_put+0x203/0x630
> [   70.769700]  tcf_block_put_ext+0x2f/0x40
> [   70.774098]  clsact_destroy+0x7a/0xb0
> [   70.782604]  qdisc_destroy+0x11a/0x5c0
> [   70.786810]  qdisc_put+0x5a/0x70
> [   70.790435]  notify_and_destroy+0x8e/0xa0
> [   70.794933]  qdisc_graft+0xbb7/0xef0
> [   70.809009]  tc_get_qdisc+0x518/0xa30
> [   70.821530]  rtnetlink_rcv_msg+0x397/0xa20
> [   70.835510]  netlink_rcv_skb+0x132/0x380
> [   70.848419]  netlink_unicast+0x4c0/0x690
> [   70.866019]  netlink_sendmsg+0x929/0xe10
> [   70.882134]  sock_sendmsg+0xc8/0x110
> [   70.886144]  ___sys_sendmsg+0x77a/0x8f0
> [   70.924064]  __sys_sendmsg+0xf7/0x250
> [   70.955727]  do_syscall_64+0x14d/0x610

Hi Ido,

Thanks for reporting this.

I looked at the code and problem seems to be matchall classifier
specific. My implementation of unlocked cls API assumes that concurrent
insertions are possible and checks for it when deleting "empty" tp.
Since classifiers don't expose number of elements, the only way to test
this is to do tp->walk() on them and assume that walk callback is called
once per filter on every classifier. In your example new tp is created
for second filter, filter insertion fails, number of elements on newly
created tp is checked with tp->walk() before deleting it. However,
matchall classifier always calls the tp->walk() callback once, even when
it doesn't have a valid filter (in this case with NULL filter pointer).

Possible ways to fix this:

1) Check for NULL filter pointer in tp->walk() callback and ignore it
when counting filters on tp - will work but I don't like it because I
don't think it is ever correct to call tp->walk() callback with NULL
filter pointer.

2) Fix matchall to not call tp->walk() callback with NULL filter
pointers - my preferred simple solution.

3) Extend tp API to have direct way to count filters by implementing
tp->count - requires change to every classifier. Or maybe add it as
optional API that only unlocked classifiers are required to implement
and just delete rtnl lock dependent tp's without checking for concurrent
insertions.

What do you think?

Regards,
Vlad

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-15 10:02     ` Vlad Buslov
@ 2019-02-15 11:30       ` Ido Schimmel
  2019-02-15 12:11         ` [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk() Vlad Buslov
                           ` (2 more replies)
  2019-02-19  5:08       ` Cong Wang
  1 sibling, 3 replies; 52+ messages in thread
From: Ido Schimmel @ 2019-02-15 11:30 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, jiri, davem, ast, daniel

On Fri, Feb 15, 2019 at 10:02:11AM +0000, Vlad Buslov wrote:
> 
> On Thu 14 Feb 2019 at 18:24, Ido Schimmel <idosch@idosch.org> wrote:
> > On Mon, Feb 11, 2019 at 10:55:38AM +0200, Vlad Buslov wrote:
> >> Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
> >> when accessing filter_chain list, instead of relying on rtnl lock.
> >> Dereference filter_chain with tcf_chain_dereference() lockdep macro to
> >> verify that all users of chain_list have the lock taken.
> >>
> >> Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
> >> all necessary code while holding chain lock in order to prevent
> >> invalidation of chain_info structure by potential concurrent change. This
> >> also serializes calls to tcf_chain0_head_change(), which allows head change
> >> callbacks to rely on filter_chain_lock for synchronization instead of rtnl
> >> mutex.
> >>
> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
> >> Acked-by: Jiri Pirko <jiri@mellanox.com>
> >
> > With this sequence [1] I get the following trace [2]. Bisected it to
> > this patch. Note that second filter is rejected by the device driver
> > (that's the intention). I guess it is not properly removed from the
> > filter chain in the error path?
> >
> > Thanks
> >
> > [1]
> > # tc qdisc add dev swp3 clsact
> > # tc filter add dev swp3 ingress pref 1 matchall skip_sw \
> > 	action mirred egress mirror dev swp7
> > # tc filter add dev swp3 ingress pref 2 matchall skip_sw \
> > 	action mirred egress mirror dev swp7
> > RTNETLINK answers: File exists
> > We have an error talking to the kernel, -1
> > # tc qdisc del dev swp3 clsact
> >
> > [2]
> > [   70.545131] kasan: GPF could be caused by NULL-ptr deref or user memory access
> > [   70.553394] general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
> > [   70.560618] CPU: 2 PID: 2268 Comm: tc Not tainted 5.0.0-rc5-custom-02103-g415d39427317 #1304
> > [   70.570065] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
> > [   70.580204] RIP: 0010:mall_reoffload+0x14a/0x760
> > [   70.585382] Code: c1 0f 85 ba 05 00 00 31 c0 4d 8d 6c 24 34 b9 06 00 00 00 4c 89 ff f3 48 ab 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 14 02 4c 89 e8 83
> > e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 bd
> > [   70.606382] RSP: 0018:ffff888231faefc0 EFLAGS: 00010207
> > [   70.612235] RAX: dffffc0000000000 RBX: 1ffff110463f5dfe RCX: 0000000000000000
> > [   70.620220] RDX: 0000000000000006 RSI: 1ffff110463f5e01 RDI: ffff888231faf040
> > [   70.628206] RBP: ffff8881ef151a00 R08: 0000000000000000 R09: ffffed10463f5dfa
> > [   70.636192] R10: ffffed10463f5dfa R11: 0000000000000003 R12: 0000000000000000
> > [   70.644177] R13: 0000000000000034 R14: 0000000000000000 R15: ffff888231faf010
> > [   70.652163] FS:  00007f46b5bf0800(0000) GS:ffff888236c00000(0000) knlGS:0000000000000000
> > [   70.661218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   70.667649] CR2: 0000000001d590a8 CR3: 0000000231c3c000 CR4: 00000000001006e0
> > [   70.675633] Call Trace:
> > [   70.693046]  tcf_block_playback_offloads+0x94/0x230
> > [   70.710617]  __tcf_block_cb_unregister+0xf7/0x2d0
> > [   70.734143]  mlxsw_sp_setup_tc+0x20f/0x660
> > [   70.738739]  tcf_block_offload_unbind+0x22b/0x350
> > [   70.748898]  __tcf_block_put+0x203/0x630
> > [   70.769700]  tcf_block_put_ext+0x2f/0x40
> > [   70.774098]  clsact_destroy+0x7a/0xb0
> > [   70.782604]  qdisc_destroy+0x11a/0x5c0
> > [   70.786810]  qdisc_put+0x5a/0x70
> > [   70.790435]  notify_and_destroy+0x8e/0xa0
> > [   70.794933]  qdisc_graft+0xbb7/0xef0
> > [   70.809009]  tc_get_qdisc+0x518/0xa30
> > [   70.821530]  rtnetlink_rcv_msg+0x397/0xa20
> > [   70.835510]  netlink_rcv_skb+0x132/0x380
> > [   70.848419]  netlink_unicast+0x4c0/0x690
> > [   70.866019]  netlink_sendmsg+0x929/0xe10
> > [   70.882134]  sock_sendmsg+0xc8/0x110
> > [   70.886144]  ___sys_sendmsg+0x77a/0x8f0
> > [   70.924064]  __sys_sendmsg+0xf7/0x250
> > [   70.955727]  do_syscall_64+0x14d/0x610
> 
> Hi Ido,
> 
> Thanks for reporting this.
> 
> I looked at the code and problem seems to be matchall classifier
> specific. My implementation of unlocked cls API assumes that concurrent
> insertions are possible and checks for it when deleting "empty" tp.
> Since classifiers don't expose number of elements, the only way to test
> this is to do tp->walk() on them and assume that walk callback is called
> once per filter on every classifier. In your example new tp is created
> for second filter, filter insertion fails, number of elements on newly
> created tp is checked with tp->walk() before deleting it. However,
> matchall classifier always calls the tp->walk() callback once, even when
> it doesn't have a valid filter (in this case with NULL filter pointer).
> 
> Possible ways to fix this:
> 
> 1) Check for NULL filter pointer in tp->walk() callback and ignore it
> when counting filters on tp - will work but I don't like it because I
> don't think it is ever correct to call tp->walk() callback with NULL
> filter pointer.
> 
> 2) Fix matchall to not call tp->walk() callback with NULL filter
> pointers - my preferred simple solution.
> 
> 3) Extend tp API to have direct way to count filters by implementing
> tp->count - requires change to every classifier. Or maybe add it as
> optional API that only unlocked classifiers are required to implement
> and just delete rtnl lock dependent tp's without checking for concurrent
> insertions.
> 
> What do you think?

Since the problem is matchall-specific, then it makes sense to fix it in
matchall like you suggested in option #2.

Can you please use this opportunity and audit other classifiers to
confirm problem is indeed specific to matchall?

In any case, feel free to send me a patch and I'll test it.

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk()
  2019-02-15 11:30       ` Ido Schimmel
@ 2019-02-15 12:11         ` Vlad Buslov
  2019-02-15 13:47           ` Ido Schimmel
                             ` (2 more replies)
  2019-02-15 12:15         ` [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex Vlad Buslov
  2019-02-15 15:35         ` Vlad Buslov
  2 siblings, 3 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-15 12:11 UTC (permalink / raw)
  To: idosch; +Cc: netdev, jhs, xiyou.wangcong, jiri, davem, Vlad Buslov

Check that filter is not NULL before passing it to tcf_walker->fn()
callback. This can happen when mall_change() failed to offload filter to
hardware.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
---
 net/sched/cls_matchall.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index a37137430e61..1f9d481b0fbb 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -247,6 +247,9 @@ static void mall_walk(struct tcf_proto *tp, struct tcf_walker *arg,
 
 	if (arg->count < arg->skip)
 		goto skip;
+
+	if (!head)
+		return;
 	if (arg->fn(tp, head, arg) < 0)
 		arg->stop = 1;
 skip:
-- 
2.13.6


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-15 11:30       ` Ido Schimmel
  2019-02-15 12:11         ` [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk() Vlad Buslov
@ 2019-02-15 12:15         ` Vlad Buslov
  2019-02-15 15:35         ` Vlad Buslov
  2 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-15 12:15 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, jhs, xiyou.wangcong, jiri, davem, ast, daniel


On Fri 15 Feb 2019 at 11:30, Ido Schimmel <idosch@idosch.org> wrote:
> On Fri, Feb 15, 2019 at 10:02:11AM +0000, Vlad Buslov wrote:
>>
>> On Thu 14 Feb 2019 at 18:24, Ido Schimmel <idosch@idosch.org> wrote:
>> > On Mon, Feb 11, 2019 at 10:55:38AM +0200, Vlad Buslov wrote:
>> >> Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
>> >> when accessing filter_chain list, instead of relying on rtnl lock.
>> >> Dereference filter_chain with tcf_chain_dereference() lockdep macro to
>> >> verify that all users of chain_list have the lock taken.
>> >>
>> >> Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
>> >> all necessary code while holding chain lock in order to prevent
>> >> invalidation of chain_info structure by potential concurrent change. This
>> >> also serializes calls to tcf_chain0_head_change(), which allows head change
>> >> callbacks to rely on filter_chain_lock for synchronization instead of rtnl
>> >> mutex.
>> >>
>> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>> >> Acked-by: Jiri Pirko <jiri@mellanox.com>
>> >
>> > With this sequence [1] I get the following trace [2]. Bisected it to
>> > this patch. Note that second filter is rejected by the device driver
>> > (that's the intention). I guess it is not properly removed from the
>> > filter chain in the error path?
>> >
>> > Thanks
>> >
>> > [1]
>> > # tc qdisc add dev swp3 clsact
>> > # tc filter add dev swp3 ingress pref 1 matchall skip_sw \
>> > 	action mirred egress mirror dev swp7
>> > # tc filter add dev swp3 ingress pref 2 matchall skip_sw \
>> > 	action mirred egress mirror dev swp7
>> > RTNETLINK answers: File exists
>> > We have an error talking to the kernel, -1
>> > # tc qdisc del dev swp3 clsact
>> >
>> > [2]
>> > [   70.545131] kasan: GPF could be caused by NULL-ptr deref or user memory access
>> > [   70.553394] general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
>> > [   70.560618] CPU: 2 PID: 2268 Comm: tc Not tainted 5.0.0-rc5-custom-02103-g415d39427317 #1304
>> > [   70.570065] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
>> > [   70.580204] RIP: 0010:mall_reoffload+0x14a/0x760
>> > [   70.585382] Code: c1 0f 85 ba 05 00 00 31 c0 4d 8d 6c 24 34 b9 06 00 00 00 4c 89 ff f3 48 ab 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 14 02 4c 89 e8 83
>> > e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 bd
>> > [   70.606382] RSP: 0018:ffff888231faefc0 EFLAGS: 00010207
>> > [   70.612235] RAX: dffffc0000000000 RBX: 1ffff110463f5dfe RCX: 0000000000000000
>> > [   70.620220] RDX: 0000000000000006 RSI: 1ffff110463f5e01 RDI: ffff888231faf040
>> > [   70.628206] RBP: ffff8881ef151a00 R08: 0000000000000000 R09: ffffed10463f5dfa
>> > [   70.636192] R10: ffffed10463f5dfa R11: 0000000000000003 R12: 0000000000000000
>> > [   70.644177] R13: 0000000000000034 R14: 0000000000000000 R15: ffff888231faf010
>> > [   70.652163] FS:  00007f46b5bf0800(0000) GS:ffff888236c00000(0000) knlGS:0000000000000000
>> > [   70.661218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [   70.667649] CR2: 0000000001d590a8 CR3: 0000000231c3c000 CR4: 00000000001006e0
>> > [   70.675633] Call Trace:
>> > [   70.693046]  tcf_block_playback_offloads+0x94/0x230
>> > [   70.710617]  __tcf_block_cb_unregister+0xf7/0x2d0
>> > [   70.734143]  mlxsw_sp_setup_tc+0x20f/0x660
>> > [   70.738739]  tcf_block_offload_unbind+0x22b/0x350
>> > [   70.748898]  __tcf_block_put+0x203/0x630
>> > [   70.769700]  tcf_block_put_ext+0x2f/0x40
>> > [   70.774098]  clsact_destroy+0x7a/0xb0
>> > [   70.782604]  qdisc_destroy+0x11a/0x5c0
>> > [   70.786810]  qdisc_put+0x5a/0x70
>> > [   70.790435]  notify_and_destroy+0x8e/0xa0
>> > [   70.794933]  qdisc_graft+0xbb7/0xef0
>> > [   70.809009]  tc_get_qdisc+0x518/0xa30
>> > [   70.821530]  rtnetlink_rcv_msg+0x397/0xa20
>> > [   70.835510]  netlink_rcv_skb+0x132/0x380
>> > [   70.848419]  netlink_unicast+0x4c0/0x690
>> > [   70.866019]  netlink_sendmsg+0x929/0xe10
>> > [   70.882134]  sock_sendmsg+0xc8/0x110
>> > [   70.886144]  ___sys_sendmsg+0x77a/0x8f0
>> > [   70.924064]  __sys_sendmsg+0xf7/0x250
>> > [   70.955727]  do_syscall_64+0x14d/0x610
>>
>> Hi Ido,
>>
>> Thanks for reporting this.
>>
>> I looked at the code and problem seems to be matchall classifier
>> specific. My implementation of unlocked cls API assumes that concurrent
>> insertions are possible and checks for it when deleting "empty" tp.
>> Since classifiers don't expose number of elements, the only way to test
>> this is to do tp->walk() on them and assume that walk callback is called
>> once per filter on every classifier. In your example new tp is created
>> for second filter, filter insertion fails, number of elements on newly
>> created tp is checked with tp->walk() before deleting it. However,
>> matchall classifier always calls the tp->walk() callback once, even when
>> it doesn't have a valid filter (in this case with NULL filter pointer).
>>
>> Possible ways to fix this:
>>
>> 1) Check for NULL filter pointer in tp->walk() callback and ignore it
>> when counting filters on tp - will work but I don't like it because I
>> don't think it is ever correct to call tp->walk() callback with NULL
>> filter pointer.
>>
>> 2) Fix matchall to not call tp->walk() callback with NULL filter
>> pointers - my preferred simple solution.
>>
>> 3) Extend tp API to have direct way to count filters by implementing
>> tp->count - requires change to every classifier. Or maybe add it as
>> optional API that only unlocked classifiers are required to implement
>> and just delete rtnl lock dependent tp's without checking for concurrent
>> insertions.
>>
>> What do you think?
>
> Since the problem is matchall-specific, then it makes sense to fix it in
> matchall like you suggested in option #2.
>
> Can you please use this opportunity and audit other classifiers to
> confirm problem is indeed specific to matchall?
>
> In any case, feel free to send me a patch and I'll test it.
>
> Thanks

I've sent you the patch for matchall and will audit all other
classifiers for this issue.

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk()
  2019-02-15 12:11         ` [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk() Vlad Buslov
@ 2019-02-15 13:47           ` Ido Schimmel
  2019-02-16  0:24           ` Cong Wang
  2019-02-17 21:27           ` David Miller
  2 siblings, 0 replies; 52+ messages in thread
From: Ido Schimmel @ 2019-02-15 13:47 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, jiri, davem

On Fri, Feb 15, 2019 at 02:11:20PM +0200, Vlad Buslov wrote:
> Check that filter is not NULL before passing it to tcf_walker->fn()
> callback. This can happen when mall_change() failed to offload filter to
> hardware.
> 
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Works for me. Thanks

Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex")
Reported-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-15 11:30       ` Ido Schimmel
  2019-02-15 12:11         ` [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk() Vlad Buslov
  2019-02-15 12:15         ` [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex Vlad Buslov
@ 2019-02-15 15:35         ` Vlad Buslov
  2019-02-19  5:26           ` Cong Wang
  2 siblings, 1 reply; 52+ messages in thread
From: Vlad Buslov @ 2019-02-15 15:35 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, jhs, xiyou.wangcong, jiri, davem, ast, daniel


On Fri 15 Feb 2019 at 11:30, Ido Schimmel <idosch@idosch.org> wrote:
> On Fri, Feb 15, 2019 at 10:02:11AM +0000, Vlad Buslov wrote:
>>
>> On Thu 14 Feb 2019 at 18:24, Ido Schimmel <idosch@idosch.org> wrote:
>> > On Mon, Feb 11, 2019 at 10:55:38AM +0200, Vlad Buslov wrote:
>> >> Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
>> >> when accessing filter_chain list, instead of relying on rtnl lock.
>> >> Dereference filter_chain with tcf_chain_dereference() lockdep macro to
>> >> verify that all users of chain_list have the lock taken.
>> >>
>> >> Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
>> >> all necessary code while holding chain lock in order to prevent
>> >> invalidation of chain_info structure by potential concurrent change. This
>> >> also serializes calls to tcf_chain0_head_change(), which allows head change
>> >> callbacks to rely on filter_chain_lock for synchronization instead of rtnl
>> >> mutex.
>> >>
>> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>> >> Acked-by: Jiri Pirko <jiri@mellanox.com>
>> >
>> > With this sequence [1] I get the following trace [2]. Bisected it to
>> > this patch. Note that second filter is rejected by the device driver
>> > (that's the intention). I guess it is not properly removed from the
>> > filter chain in the error path?
>> >
>> > Thanks
>> >
>> > [1]
>> > # tc qdisc add dev swp3 clsact
>> > # tc filter add dev swp3 ingress pref 1 matchall skip_sw \
>> > 	action mirred egress mirror dev swp7
>> > # tc filter add dev swp3 ingress pref 2 matchall skip_sw \
>> > 	action mirred egress mirror dev swp7
>> > RTNETLINK answers: File exists
>> > We have an error talking to the kernel, -1
>> > # tc qdisc del dev swp3 clsact
>> >
>> > [2]
>> > [   70.545131] kasan: GPF could be caused by NULL-ptr deref or user memory access
>> > [   70.553394] general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
>> > [   70.560618] CPU: 2 PID: 2268 Comm: tc Not tainted 5.0.0-rc5-custom-02103-g415d39427317 #1304
>> > [   70.570065] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
>> > [   70.580204] RIP: 0010:mall_reoffload+0x14a/0x760
>> > [   70.585382] Code: c1 0f 85 ba 05 00 00 31 c0 4d 8d 6c 24 34 b9 06 00 00 00 4c 89 ff f3 48 ab 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 14 02 4c 89 e8 83
>> > e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 bd
>> > [   70.606382] RSP: 0018:ffff888231faefc0 EFLAGS: 00010207
>> > [   70.612235] RAX: dffffc0000000000 RBX: 1ffff110463f5dfe RCX: 0000000000000000
>> > [   70.620220] RDX: 0000000000000006 RSI: 1ffff110463f5e01 RDI: ffff888231faf040
>> > [   70.628206] RBP: ffff8881ef151a00 R08: 0000000000000000 R09: ffffed10463f5dfa
>> > [   70.636192] R10: ffffed10463f5dfa R11: 0000000000000003 R12: 0000000000000000
>> > [   70.644177] R13: 0000000000000034 R14: 0000000000000000 R15: ffff888231faf010
>> > [   70.652163] FS:  00007f46b5bf0800(0000) GS:ffff888236c00000(0000) knlGS:0000000000000000
>> > [   70.661218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [   70.667649] CR2: 0000000001d590a8 CR3: 0000000231c3c000 CR4: 00000000001006e0
>> > [   70.675633] Call Trace:
>> > [   70.693046]  tcf_block_playback_offloads+0x94/0x230
>> > [   70.710617]  __tcf_block_cb_unregister+0xf7/0x2d0
>> > [   70.734143]  mlxsw_sp_setup_tc+0x20f/0x660
>> > [   70.738739]  tcf_block_offload_unbind+0x22b/0x350
>> > [   70.748898]  __tcf_block_put+0x203/0x630
>> > [   70.769700]  tcf_block_put_ext+0x2f/0x40
>> > [   70.774098]  clsact_destroy+0x7a/0xb0
>> > [   70.782604]  qdisc_destroy+0x11a/0x5c0
>> > [   70.786810]  qdisc_put+0x5a/0x70
>> > [   70.790435]  notify_and_destroy+0x8e/0xa0
>> > [   70.794933]  qdisc_graft+0xbb7/0xef0
>> > [   70.809009]  tc_get_qdisc+0x518/0xa30
>> > [   70.821530]  rtnetlink_rcv_msg+0x397/0xa20
>> > [   70.835510]  netlink_rcv_skb+0x132/0x380
>> > [   70.848419]  netlink_unicast+0x4c0/0x690
>> > [   70.866019]  netlink_sendmsg+0x929/0xe10
>> > [   70.882134]  sock_sendmsg+0xc8/0x110
>> > [   70.886144]  ___sys_sendmsg+0x77a/0x8f0
>> > [   70.924064]  __sys_sendmsg+0xf7/0x250
>> > [   70.955727]  do_syscall_64+0x14d/0x610
>>
>> Hi Ido,
>>
>> Thanks for reporting this.
>>
>> I looked at the code and problem seems to be matchall classifier
>> specific. My implementation of unlocked cls API assumes that concurrent
>> insertions are possible and checks for it when deleting "empty" tp.
>> Since classifiers don't expose number of elements, the only way to test
>> this is to do tp->walk() on them and assume that walk callback is called
>> once per filter on every classifier. In your example new tp is created
>> for second filter, filter insertion fails, number of elements on newly
>> created tp is checked with tp->walk() before deleting it. However,
>> matchall classifier always calls the tp->walk() callback once, even when
>> it doesn't have a valid filter (in this case with NULL filter pointer).
>>
>> Possible ways to fix this:
>>
>> 1) Check for NULL filter pointer in tp->walk() callback and ignore it
>> when counting filters on tp - will work but I don't like it because I
>> don't think it is ever correct to call tp->walk() callback with NULL
>> filter pointer.
>>
>> 2) Fix matchall to not call tp->walk() callback with NULL filter
>> pointers - my preferred simple solution.
>>
>> 3) Extend tp API to have direct way to count filters by implementing
>> tp->count - requires change to every classifier. Or maybe add it as
>> optional API that only unlocked classifiers are required to implement
>> and just delete rtnl lock dependent tp's without checking for concurrent
>> insertions.
>>
>> What do you think?
>
> Since the problem is matchall-specific, then it makes sense to fix it in
> matchall like you suggested in option #2.
>
> Can you please use this opportunity and audit other classifiers to
> confirm problem is indeed specific to matchall?
>

Turns out cls_cgroup has the same problem.

Another problem that I found in cls_fw and cls_route is that they set
arg->stop when empty. Both of them have code unchanged since it was
committed initially in 2005 so I assume this convention is no longer
relevant because all other classifiers don't do that (they only set
arg->stop when arg->fn returns negative value).

I sent fixes for all of those cases.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 05/17] net: sched: traverse chains in block with tcf_get_next_chain()
  2019-02-11  8:55 ` [PATCH net-next v4 05/17] net: sched: traverse chains in block with tcf_get_next_chain() Vlad Buslov
@ 2019-02-15 22:21   ` Cong Wang
  2019-02-18 10:07     ` Vlad Buslov
  0 siblings, 1 reply; 52+ messages in thread
From: Cong Wang @ 2019-02-15 22:21 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann

(Sorry for joining this late.)

On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> @@ -2432,7 +2474,11 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb)
>         index_start = cb->args[0];
>         index = 0;
>
> -       list_for_each_entry(chain, &block->chain_list, list) {
> +       for (chain = __tcf_get_next_chain(block, NULL);
> +            chain;
> +            chain_prev = chain,
> +                    chain = __tcf_get_next_chain(block, chain),
> +                    tcf_chain_put(chain_prev)) {

Why do you want to take the block->lock in each iteration
of the loop rather than taking once for the whole loop?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-11  8:55 ` [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex Vlad Buslov
  2019-02-14 18:24   ` Ido Schimmel
@ 2019-02-15 22:35   ` Cong Wang
  2019-02-18 11:06     ` Vlad Buslov
  1 sibling, 1 reply; 52+ messages in thread
From: Cong Wang @ 2019-02-15 22:35 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann

On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> +#ifdef CONFIG_PROVE_LOCKING
> +static inline bool lockdep_tcf_chain_is_locked(struct tcf_chain *chain)
> +{
> +       return lockdep_is_held(&chain->filter_chain_lock);
> +}
> +#else
> +static inline bool lockdep_tcf_chain_is_locked(struct tcf_block *chain)
> +{
> +       return true;
> +}
> +#endif /* #ifdef CONFIG_PROVE_LOCKING */
> +
> +#define tcf_chain_dereference(p, chain)                                        \
> +       rcu_dereference_protected(p, lockdep_tcf_chain_is_locked(chain))


Are you sure you need this #ifdef CONFIG_PROVE_LOCKING?
rcu_dereference_protected() should already test CONFIG_PROVE_RCU.

Ditto for tcf_proto_dereference().

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 10/17] net: sched: refactor tp insert/delete for concurrent execution
  2019-02-11  8:55 ` [PATCH net-next v4 10/17] net: sched: refactor tp insert/delete for concurrent execution Vlad Buslov
@ 2019-02-15 23:17   ` Cong Wang
  2019-02-18 11:19     ` Vlad Buslov
  2019-02-18 19:53   ` Cong Wang
  1 sibling, 1 reply; 52+ messages in thread
From: Cong Wang @ 2019-02-15 23:17 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann

On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> +static bool tcf_proto_is_empty(struct tcf_proto *tp)
> +{
> +       struct tcf_walker walker = { .fn = walker_noop, };
> +
> +       if (tp->ops->walk) {
> +               tp->ops->walk(tp, &walker);
> +               return !walker.stop;
> +       }
> +       return true;
> +}
> +
> +static bool tcf_proto_check_delete(struct tcf_proto *tp)
> +{
> +       spin_lock(&tp->lock);
> +       if (tcf_proto_is_empty(tp))
> +               tp->deleting = true;
> +       spin_unlock(&tp->lock);
> +       return tp->deleting;

If you use this spinlock for walking each tp data structure,
why it is not needed for adding to/deleting filters from each
tp?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk()
  2019-02-15 12:11         ` [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk() Vlad Buslov
  2019-02-15 13:47           ` Ido Schimmel
@ 2019-02-16  0:24           ` Cong Wang
  2019-02-18 12:00             ` Vlad Buslov
  2019-02-17 21:27           ` David Miller
  2 siblings, 1 reply; 52+ messages in thread
From: Cong Wang @ 2019-02-16  0:24 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Ido Schimmel, Linux Kernel Network Developers, Jamal Hadi Salim,
	Jiri Pirko, David Miller

On Fri, Feb 15, 2019 at 4:11 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> Check that filter is not NULL before passing it to tcf_walker->fn()
> callback. This can happen when mall_change() failed to offload filter to
> hardware.
>
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
> ---
>  net/sched/cls_matchall.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
> index a37137430e61..1f9d481b0fbb 100644
> --- a/net/sched/cls_matchall.c
> +++ b/net/sched/cls_matchall.c
> @@ -247,6 +247,9 @@ static void mall_walk(struct tcf_proto *tp, struct tcf_walker *arg,
>
>         if (arg->count < arg->skip)
>                 goto skip;
> +
> +       if (!head)
> +               return;

So head==NULL still counts one given that you check NULL after
checking arg->count. Is this expected?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk()
  2019-02-15 12:11         ` [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk() Vlad Buslov
  2019-02-15 13:47           ` Ido Schimmel
  2019-02-16  0:24           ` Cong Wang
@ 2019-02-17 21:27           ` David Miller
  2 siblings, 0 replies; 52+ messages in thread
From: David Miller @ 2019-02-17 21:27 UTC (permalink / raw)
  To: vladbu; +Cc: idosch, netdev, jhs, xiyou.wangcong, jiri

From: Vlad Buslov <vladbu@mellanox.com>
Date: Fri, 15 Feb 2019 14:11:20 +0200

> Check that filter is not NULL before passing it to tcf_walker->fn()
> callback. This can happen when mall_change() failed to offload filter to
> hardware.
> 
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Applied.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 05/17] net: sched: traverse chains in block with tcf_get_next_chain()
  2019-02-15 22:21   ` Cong Wang
@ 2019-02-18 10:07     ` Vlad Buslov
  2019-02-18 18:26       ` Cong Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Vlad Buslov @ 2019-02-18 10:07 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann

Hi Cong,

Thanks for reviewing!

On Fri 15 Feb 2019 at 22:21, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> (Sorry for joining this late.)
>
> On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> @@ -2432,7 +2474,11 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb)
>>         index_start = cb->args[0];
>>         index = 0;
>>
>> -       list_for_each_entry(chain, &block->chain_list, list) {
>> +       for (chain = __tcf_get_next_chain(block, NULL);
>> +            chain;
>> +            chain_prev = chain,
>> +                    chain = __tcf_get_next_chain(block, chain),
>> +                    tcf_chain_put(chain_prev)) {
>
> Why do you want to take the block->lock in each iteration
> of the loop rather than taking once for the whole loop?

This loop calls classifier ops callback in tc_chain_fill_node(). I don't
call any classifier ops callbacks while holding block or chain lock in
this change because the goal is to achieve fine-grained locking for data
structures used by filter update path. Locking per-block or per-chain is
much coarser than taking reference counters to parent structures and
allowing classifiers to implement their own locking.

In this case call to ops->tmplt_dump() is probably quite fast and its
execution time doesn't depend on number of filters on the classifier, so
releasing block->lock on each iteration doesn't provide much benefit, if
at all. However, it is easier for me to reason about locking correctness
in this refactoring by following a simple rule that no locks (besides
rtnl mutex) can be held when calling classifier ops callbacks.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-15 22:35   ` Cong Wang
@ 2019-02-18 11:06     ` Vlad Buslov
  2019-02-18 18:31       ` Cong Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Vlad Buslov @ 2019-02-18 11:06 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann

On Fri 15 Feb 2019 at 22:35, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> +#ifdef CONFIG_PROVE_LOCKING
>> +static inline bool lockdep_tcf_chain_is_locked(struct tcf_chain *chain)
>> +{
>> +       return lockdep_is_held(&chain->filter_chain_lock);
>> +}
>> +#else
>> +static inline bool lockdep_tcf_chain_is_locked(struct tcf_block *chain)
>> +{
>> +       return true;
>> +}
>> +#endif /* #ifdef CONFIG_PROVE_LOCKING */
>> +
>> +#define tcf_chain_dereference(p, chain)                                        \
>> +       rcu_dereference_protected(p, lockdep_tcf_chain_is_locked(chain))
>
>
> Are you sure you need this #ifdef CONFIG_PROVE_LOCKING?
> rcu_dereference_protected() should already test CONFIG_PROVE_RCU.
>
> Ditto for tcf_proto_dereference().

I implemented these macro same way as rtnl_dereference() is implemented,
which they are intended to substitute.

After removing them I get following compilation error with
CONFIG_PROVE_LOCKING disabled:

./include/net/sch_generic.h: In function ‘lockdep_tcf_chain_is_locked’:
./include/net/sch_generic.h:404:9: error: implicit declaration of function ‘lockdep_is_held’; did you mean ‘lockdep_rtnl_is_held’? [-Werror=implicit-function-declaration]
  return lockdep_is_held(&chain->filter_chain_lock);
         ^~~~~~~~~~~~~~~
         lockdep_rtnl_is_held

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 10/17] net: sched: refactor tp insert/delete for concurrent execution
  2019-02-15 23:17   ` Cong Wang
@ 2019-02-18 11:19     ` Vlad Buslov
  2019-02-18 19:55       ` Cong Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Vlad Buslov @ 2019-02-18 11:19 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann


On Fri 15 Feb 2019 at 23:17, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> +static bool tcf_proto_is_empty(struct tcf_proto *tp)
>> +{
>> +       struct tcf_walker walker = { .fn = walker_noop, };
>> +
>> +       if (tp->ops->walk) {
>> +               tp->ops->walk(tp, &walker);
>> +               return !walker.stop;
>> +       }
>> +       return true;
>> +}
>> +
>> +static bool tcf_proto_check_delete(struct tcf_proto *tp)
>> +{
>> +       spin_lock(&tp->lock);
>> +       if (tcf_proto_is_empty(tp))
>> +               tp->deleting = true;
>> +       spin_unlock(&tp->lock);
>> +       return tp->deleting;
>
> If you use this spinlock for walking each tp data structure,
> why it is not needed for adding to/deleting filters from each
> tp?

This lock is intended to be used by unlocked classifiers and I use it in
my following flower patch set extensively. Classifiers that do not set
'unlocked' flag continue to rely on rtnl lock for synchronization.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk()
  2019-02-16  0:24           ` Cong Wang
@ 2019-02-18 12:00             ` Vlad Buslov
  0 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-18 12:00 UTC (permalink / raw)
  To: Cong Wang
  Cc: Ido Schimmel, Linux Kernel Network Developers, Jamal Hadi Salim,
	Jiri Pirko, David Miller


On Sat 16 Feb 2019 at 00:24, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, Feb 15, 2019 at 4:11 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>> Check that filter is not NULL before passing it to tcf_walker->fn()
>> callback. This can happen when mall_change() failed to offload filter to
>> hardware.
>>
>> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>> ---
>>  net/sched/cls_matchall.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
>> index a37137430e61..1f9d481b0fbb 100644
>> --- a/net/sched/cls_matchall.c
>> +++ b/net/sched/cls_matchall.c
>> @@ -247,6 +247,9 @@ static void mall_walk(struct tcf_proto *tp, struct tcf_walker *arg,
>>
>>         if (arg->count < arg->skip)
>>                 goto skip;
>> +
>> +       if (!head)
>> +               return;
>
> So head==NULL still counts one given that you check NULL after
> checking arg->count. Is this expected?

My intention was to fix the problem (arg->fn() call with NULL filter)
without changing any other functionality, and always incrementing
arg->count once seemed to be the intended behavior. However, since
mall_delete() just returns -EOPNOTSUPP, it might be the case that author
of matchall expected to always have single filter configured when cls
API calls mall_walk(). What would you suggest?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 05/17] net: sched: traverse chains in block with tcf_get_next_chain()
  2019-02-18 10:07     ` Vlad Buslov
@ 2019-02-18 18:26       ` Cong Wang
  2019-02-19 16:04         ` Vlad Buslov
  0 siblings, 1 reply; 52+ messages in thread
From: Cong Wang @ 2019-02-18 18:26 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann

On Mon, Feb 18, 2019 at 2:07 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> Hi Cong,
>
> Thanks for reviewing!
>
> On Fri 15 Feb 2019 at 22:21, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > (Sorry for joining this late.)
> >
> > On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> >> @@ -2432,7 +2474,11 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb)
> >>         index_start = cb->args[0];
> >>         index = 0;
> >>
> >> -       list_for_each_entry(chain, &block->chain_list, list) {
> >> +       for (chain = __tcf_get_next_chain(block, NULL);
> >> +            chain;
> >> +            chain_prev = chain,
> >> +                    chain = __tcf_get_next_chain(block, chain),
> >> +                    tcf_chain_put(chain_prev)) {
> >
> > Why do you want to take the block->lock in each iteration
> > of the loop rather than taking once for the whole loop?
>
> This loop calls classifier ops callback in tc_chain_fill_node(). I don't
> call any classifier ops callbacks while holding block or chain lock in
> this change because the goal is to achieve fine-grained locking for data
> structures used by filter update path. Locking per-block or per-chain is
> much coarser than taking reference counters to parent structures and
> allowing classifiers to implement their own locking.

That is the problem, when we have N filter chains in a block, you
lock and unlock mutex N times... And what __tcf_get_next_chain()
does is basically just retrieving the next entry in the list, so the
overhead of mutex is likely more than the list operation itself in
contention situation.

Now I can see why you complained about mutex before, it is
how you use it, not actually its own problem. :)

>
> In this case call to ops->tmplt_dump() is probably quite fast and its
> execution time doesn't depend on number of filters on the classifier, so
> releasing block->lock on each iteration doesn't provide much benefit, if
> at all. However, it is easier for me to reason about locking correctness
> in this refactoring by following a simple rule that no locks (besides
> rtnl mutex) can be held when calling classifier ops callbacks.

Well, for me, a hierarchy locking is always simple when you take
them in the right order, that is locking the larger-scope lock first
and then smaller-scope one.

The way you use the locking here is actually harder for me to
review, because it is hard to valid its atomicity when you unlock
the larger scope lock and re-take the smaller scope lock. You
use refcnt to ensure it will not go way, but that is still far from
guarantee of the atomicity.

For example, tp->ops->change() which changes an existing
filter, I don't see you lock either block->lock or
chain->filter_chain_lock when calling it. How does it even work?

Thanks.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-18 11:06     ` Vlad Buslov
@ 2019-02-18 18:31       ` Cong Wang
  0 siblings, 0 replies; 52+ messages in thread
From: Cong Wang @ 2019-02-18 18:31 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann

On Mon, Feb 18, 2019 at 3:06 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> On Fri 15 Feb 2019 at 22:35, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> >> +#ifdef CONFIG_PROVE_LOCKING
> >> +static inline bool lockdep_tcf_chain_is_locked(struct tcf_chain *chain)
> >> +{
> >> +       return lockdep_is_held(&chain->filter_chain_lock);
> >> +}
> >> +#else
> >> +static inline bool lockdep_tcf_chain_is_locked(struct tcf_block *chain)
> >> +{
> >> +       return true;
> >> +}
> >> +#endif /* #ifdef CONFIG_PROVE_LOCKING */
> >> +
> >> +#define tcf_chain_dereference(p, chain)                                        \
> >> +       rcu_dereference_protected(p, lockdep_tcf_chain_is_locked(chain))
> >
> >
> > Are you sure you need this #ifdef CONFIG_PROVE_LOCKING?
> > rcu_dereference_protected() should already test CONFIG_PROVE_RCU.
> >
> > Ditto for tcf_proto_dereference().
>
> I implemented these macro same way as rtnl_dereference() is implemented,
> which they are intended to substitute.
>
> After removing them I get following compilation error with
> CONFIG_PROVE_LOCKING disabled:


This is pretty odd, because net/core/neighbour.c uses it without
any #ifdef CONFIG_PROVE_LOCKING, for instance:

 192                 neigh = rcu_dereference_protected(n->next,
 193
lockdep_is_held(&tbl->lock));
 194                 rcu_assign_pointer(*np, neigh);
 195                 neigh_mark_dead(n);
 196                 retval = true;

So how does this compile when CONFIG_PROVE_LOCKING
is disabled? :-/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 17/17] net: sched: unlock rules update API
  2019-02-11  8:55 ` [PATCH net-next v4 17/17] net: sched: unlock rules update API Vlad Buslov
@ 2019-02-18 18:56   ` Cong Wang
  0 siblings, 0 replies; 52+ messages in thread
From: Cong Wang @ 2019-02-18 18:56 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann

On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> Register netlink protocol handlers for message types RTM_NEWTFILTER,
> RTM_DELTFILTER, RTM_GETTFILTER as unlocked. Set rtnl_held variable that
> tracks rtnl mutex state to be false by default.
>
> Introduce tcf_proto_is_unlocked() helper that is used to check
> tcf_proto_ops->flag to determine if ops can be called without taking rtnl
> lock. Manually lookup Qdisc, class and block in rule update handlers.
> Verify that both Qdisc ops and proto ops are unlocked before using any of
> their callbacks, and obtain rtnl lock otherwise.

So if you end goal is to completely get rid of RTNL from tc filter and
action control path, why this change is needed?

I was expecting you to break down the RTNL lock down to each
block/filter chain, which should be done in this patchset. So again,
why do we still need RTNL for some cases?

Please state the reasoning in your changelog, rather than just
describing what your code does.

Thanks.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 10/17] net: sched: refactor tp insert/delete for concurrent execution
  2019-02-11  8:55 ` [PATCH net-next v4 10/17] net: sched: refactor tp insert/delete for concurrent execution Vlad Buslov
  2019-02-15 23:17   ` Cong Wang
@ 2019-02-18 19:53   ` Cong Wang
  1 sibling, 0 replies; 52+ messages in thread
From: Cong Wang @ 2019-02-18 19:53 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann

On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> +#define tcf_proto_dereference(p, tp)                                   \
> +       rcu_dereference_protected(p, lockdep_tcf_proto_is_locked(tp))
> +

BTW, it is not used anywhere even on top of your cls_flower changes...

$ git grep tcf_proto_dereference
include/net/sch_generic.h:#define tcf_proto_dereference(p, tp)
                         \

Please don't introduce new things unitl you actually use it.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 10/17] net: sched: refactor tp insert/delete for concurrent execution
  2019-02-18 11:19     ` Vlad Buslov
@ 2019-02-18 19:55       ` Cong Wang
  2019-02-19 10:25         ` Vlad Buslov
  0 siblings, 1 reply; 52+ messages in thread
From: Cong Wang @ 2019-02-18 19:55 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann

On Mon, Feb 18, 2019 at 3:19 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
>
> On Fri 15 Feb 2019 at 23:17, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> >> +static bool tcf_proto_is_empty(struct tcf_proto *tp)
> >> +{
> >> +       struct tcf_walker walker = { .fn = walker_noop, };
> >> +
> >> +       if (tp->ops->walk) {
> >> +               tp->ops->walk(tp, &walker);
> >> +               return !walker.stop;
> >> +       }
> >> +       return true;
> >> +}
> >> +
> >> +static bool tcf_proto_check_delete(struct tcf_proto *tp)
> >> +{
> >> +       spin_lock(&tp->lock);
> >> +       if (tcf_proto_is_empty(tp))
> >> +               tp->deleting = true;
> >> +       spin_unlock(&tp->lock);
> >> +       return tp->deleting;
> >
> > If you use this spinlock for walking each tp data structure,
> > why it is not needed for adding to/deleting filters from each
> > tp?
>
> This lock is intended to be used by unlocked classifiers and I use it in
> my following flower patch set extensively. Classifiers that do not set
> 'unlocked' flag continue to rely on rtnl lock for synchronization.

It is never late to add it when you seriously use it. The way you
split the patches is really annoying for reviewers...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-15 10:02     ` Vlad Buslov
  2019-02-15 11:30       ` Ido Schimmel
@ 2019-02-19  5:08       ` Cong Wang
  2019-02-19 15:20         ` Vlad Buslov
  1 sibling, 1 reply; 52+ messages in thread
From: Cong Wang @ 2019-02-19  5:08 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: Ido Schimmel, netdev, jhs, jiri, davem, ast, daniel

On Fri, Feb 15, 2019 at 2:02 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> I looked at the code and problem seems to be matchall classifier
> specific. My implementation of unlocked cls API assumes that concurrent
> insertions are possible and checks for it when deleting "empty" tp.
> Since classifiers don't expose number of elements, the only way to test
> this is to do tp->walk() on them and assume that walk callback is called
> once per filter on every classifier. In your example new tp is created
> for second filter, filter insertion fails, number of elements on newly
> created tp is checked with tp->walk() before deleting it. However,
> matchall classifier always calls the tp->walk() callback once, even when
> it doesn't have a valid filter (in this case with NULL filter pointer).

Again, this can be eliminated by just switching to normal
non-retry logic. This is yet another headache to review this
kind of unlock-and-retry logic, I have no idea why you are such
a big fan of it.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-15 15:35         ` Vlad Buslov
@ 2019-02-19  5:26           ` Cong Wang
  2019-02-19 12:31             ` Vlad Buslov
  0 siblings, 1 reply; 52+ messages in thread
From: Cong Wang @ 2019-02-19  5:26 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: Ido Schimmel, netdev, jhs, jiri, davem, ast, daniel

On Fri, Feb 15, 2019 at 7:35 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> Another problem that I found in cls_fw and cls_route is that they set
> arg->stop when empty. Both of them have code unchanged since it was
> committed initially in 2005 so I assume this convention is no longer
> relevant because all other classifiers don't do that (they only set
> arg->stop when arg->fn returns negative value).
>

The question is why do you want to use arg->stop==0 as
an indication for emptiness? Isn't what arg->count==0
supposed to be?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 10/17] net: sched: refactor tp insert/delete for concurrent execution
  2019-02-18 19:55       ` Cong Wang
@ 2019-02-19 10:25         ` Vlad Buslov
  0 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-19 10:25 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann


On Mon 18 Feb 2019 at 19:55, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Mon, Feb 18, 2019 at 3:19 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>>
>> On Fri 15 Feb 2019 at 23:17, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >> +static bool tcf_proto_is_empty(struct tcf_proto *tp)
>> >> +{
>> >> +       struct tcf_walker walker = { .fn = walker_noop, };
>> >> +
>> >> +       if (tp->ops->walk) {
>> >> +               tp->ops->walk(tp, &walker);
>> >> +               return !walker.stop;
>> >> +       }
>> >> +       return true;
>> >> +}
>> >> +
>> >> +static bool tcf_proto_check_delete(struct tcf_proto *tp)
>> >> +{
>> >> +       spin_lock(&tp->lock);
>> >> +       if (tcf_proto_is_empty(tp))
>> >> +               tp->deleting = true;
>> >> +       spin_unlock(&tp->lock);
>> >> +       return tp->deleting;
>> >
>> > If you use this spinlock for walking each tp data structure,
>> > why it is not needed for adding to/deleting filters from each
>> > tp?
>>
>> This lock is intended to be used by unlocked classifiers and I use it in
>> my following flower patch set extensively. Classifiers that do not set
>> 'unlocked' flag continue to rely on rtnl lock for synchronization.
>
> It is never late to add it when you seriously use it. The way you
> split the patches is really annoying for reviewers...

I made a decision to put all required cls API changes so at this point
anyone can implement their own rtnl-unlocked classifier (or refactor
existing one for unlocked execution) without any further changes to cls
API. However, I can see how this can be confusing to reviewer,
especially if they are not familiar with proposed flower changes. I will
split my patches according to your suggestions in the future.

Thanks,
Vlad

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-19  5:26           ` Cong Wang
@ 2019-02-19 12:31             ` Vlad Buslov
  2019-02-20 22:43               ` Cong Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Vlad Buslov @ 2019-02-19 12:31 UTC (permalink / raw)
  To: Cong Wang; +Cc: Ido Schimmel, netdev, jhs, jiri, davem, ast, daniel


On Tue 19 Feb 2019 at 05:26, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, Feb 15, 2019 at 7:35 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>> Another problem that I found in cls_fw and cls_route is that they set
>> arg->stop when empty. Both of them have code unchanged since it was
>> committed initially in 2005 so I assume this convention is no longer
>> relevant because all other classifiers don't do that (they only set
>> arg->stop when arg->fn returns negative value).
>>
>
> The question is why do you want to use arg->stop==0 as
> an indication for emptiness? Isn't what arg->count==0
> supposed to be?

Good question! I initially wanted to implement it like that, but
reconsidered because iterating through all filters on classifier to
count them is O(N), and terminating on first filter and relying on
arg->stop==1 is constant time. Making function that is called
"tcf_proto_is_empty" linear on number of filters seemed sloppy to me...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-19  5:08       ` Cong Wang
@ 2019-02-19 15:20         ` Vlad Buslov
  2019-02-20 23:00           ` Cong Wang
  0 siblings, 1 reply; 52+ messages in thread
From: Vlad Buslov @ 2019-02-19 15:20 UTC (permalink / raw)
  To: Cong Wang; +Cc: Ido Schimmel, netdev, jhs, jiri, davem, ast, daniel


On Tue 19 Feb 2019 at 05:08, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, Feb 15, 2019 at 2:02 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>> I looked at the code and problem seems to be matchall classifier
>> specific. My implementation of unlocked cls API assumes that concurrent
>> insertions are possible and checks for it when deleting "empty" tp.
>> Since classifiers don't expose number of elements, the only way to test
>> this is to do tp->walk() on them and assume that walk callback is called
>> once per filter on every classifier. In your example new tp is created
>> for second filter, filter insertion fails, number of elements on newly
>> created tp is checked with tp->walk() before deleting it. However,
>> matchall classifier always calls the tp->walk() callback once, even when
>> it doesn't have a valid filter (in this case with NULL filter pointer).
>
> Again, this can be eliminated by just switching to normal
> non-retry logic. This is yet another headache to review this
> kind of unlock-and-retry logic, I have no idea why you are such
> a big fan of it.

The retry approach was suggested to me multiple times by Jiri on
previous code reviews so I assumed it is preferred approach in such
cases. I don't have a strong preference in this regard, but locking
whole tp on filter update will remove any parallelism when updating same
classifier instance concurrently. The goal of these changes is to allow
parallel rule update and to achieve that I had to introduce some
complexity into the code.

Now let me explain why these two approaches result completely different
performance in this case. Lets start with a list of most CPU-consuming
parts in new filter creation process in descending order (raw data at
the end of this mail):

1) Hardware offload - if available and no skip_hw.
2) Exts (actions) initalization - most expensive part even with single
action, CPU usage increases with number of actions per filter.
3) cls API.
4) Flower classifier data structure initialization.

Note that 1)+2) is ~80% of cost of creating a flower filter. So if we
just lock the whole flower classifier instance during rule update we
serialize 1, 2 and 4, and only cls API (~13% of CPU cost) can be
executed concurrently. However, in proposed flower implementation hw
offloading and action initialization code is called without any locks
and tp->lock is only obtained when modifying flower data structures,
which means that only 3) is serialized and everything else (87% of CPU
cost) can be executed in parallel.

First page of profiling data:

Samples: 100K of event 'cycles:ppp', Event count (approx.): 11191878316
  Children      Self  Command  Shared Object       Symbol
+   84.71%     0.08%  tc       [kernel.vmlinux]    [k] entry_SYSCALL_64_after_hwframe
+   84.62%     0.06%  tc       [kernel.vmlinux]    [k] do_syscall_64
+   82.63%     0.01%  tc       libc-2.25.so        [.] __libc_sendmsg
+   82.37%     0.00%  tc       [kernel.vmlinux]    [k] __sys_sendmsg
+   82.37%     0.00%  tc       [kernel.vmlinux]    [k] ___sys_sendmsg
+   82.34%     0.00%  tc       [kernel.vmlinux]    [k] sock_sendmsg
+   82.34%     0.01%  tc       [kernel.vmlinux]    [k] netlink_sendmsg
+   82.15%     0.15%  tc       [kernel.vmlinux]    [k] netlink_unicast
+   82.10%     0.11%  tc       [kernel.vmlinux]    [k] netlink_rcv_skb
+   80.76%     0.22%  tc       [kernel.vmlinux]    [k] rtnetlink_rcv_msg
+   80.10%     0.24%  tc       [kernel.vmlinux]    [k] tc_new_tfilter
+   69.30%     2.11%  tc       [cls_flower]        [k] fl_change
+   33.56%     0.05%  tc       [kernel.vmlinux]    [k] tcf_exts_validate
+   33.50%     0.12%  tc       [kernel.vmlinux]    [k] tcf_action_init
+   33.30%     0.10%  tc       [kernel.vmlinux]    [k] tcf_action_init_1
+   32.78%     0.11%  tc       [act_gact]          [k] tcf_gact_init
+   30.93%     0.16%  tc       [kernel.vmlinux]    [k] tc_setup_cb_call
+   29.96%     0.60%  tc       [mlx5_core]         [k] mlx5e_configure_flower
+   27.62%     0.23%  tc       [mlx5_core]         [k] mlx5e_tc_add_nic_flow
+   27.31%     0.45%  tc       [kernel.vmlinux]    [k] tcf_idr_create
+   25.45%     1.75%  tc       [kernel.vmlinux]    [k] pcpu_alloc
+   16.33%     0.07%  tc       [mlx5_core]         [k] mlx5_cmd_exec
+   16.26%     1.96%  tc       [mlx5_core]         [k] cmd_exec
+   14.28%     1.05%  tc       [mlx5_core]         [k] mlx5_add_flow_rules
+   14.02%     0.26%  tc       [kernel.vmlinux]    [k] pcpu_alloc_area
+   13.09%     0.13%  tc       [mlx5_core]         [k] mlx5_fc_create
+    9.77%     0.30%  tc       [mlx5_core]         [k] add_rule_fg.isra.28
+    9.08%     0.84%  tc       [mlx5_core]         [k] mlx5_cmd_set_fte
+    8.90%     0.09%  tc       [mlx5_core]         [k] mlx5_cmd_fc_alloc
+    7.90%     0.12%  tc       [kernel.vmlinux]    [k] tfilter_notify
+    7.34%     0.61%  tc       [kernel.vmlinux]    [k] __queue_work
+    7.25%     0.26%  tc       [kernel.vmlinux]    [k] tcf_fill_node
+    6.73%     0.23%  tc       [kernel.vmlinux]    [k] wait_for_completion_timeout
+    6.67%     0.20%  tc       [cls_flower]        [k] fl_dump
+    6.52%     5.93%  tc       [kernel.vmlinux]    [k] memset_erms
+    5.77%     0.49%  tc       [kernel.vmlinux]    [k] schedule_timeout
+    5.57%     1.29%  tc       [kernel.vmlinux]    [k] try_to_wake_up
+    5.50%     0.11%  tc       [kernel.vmlinux]    [k] pcpu_block_update_hint_alloc
+    5.40%     0.85%  tc       [kernel.vmlinux]    [k] pcpu_block_refresh_hint
+    5.28%     0.11%  tc       [kernel.vmlinux]    [k] queue_work_on
+    5.19%     4.96%  tc       [kernel.vmlinux]    [k] find_next_bit
+    4.77%     0.11%  tc       [kernel.vmlinux]    [k] idr_alloc_u32
+    4.71%     0.10%  tc       [kernel.vmlinux]    [k] schedule
+    4.62%     0.30%  tc       [kernel.vmlinux]    [k] __sched_text_start
+    4.48%     4.41%  tc       [kernel.vmlinux]    [k] idr_get_free
+    4.19%     0.04%  tc       [kernel.vmlinux]    [k] tcf_idr_check_alloc

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 05/17] net: sched: traverse chains in block with tcf_get_next_chain()
  2019-02-18 18:26       ` Cong Wang
@ 2019-02-19 16:04         ` Vlad Buslov
  0 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-19 16:04 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	David Miller, Alexei Starovoitov, Daniel Borkmann


On Mon 18 Feb 2019 at 18:26, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Mon, Feb 18, 2019 at 2:07 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>> Hi Cong,
>>
>> Thanks for reviewing!
>>
>> On Fri 15 Feb 2019 at 22:21, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > (Sorry for joining this late.)
>> >
>> > On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >> @@ -2432,7 +2474,11 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb)
>> >>         index_start = cb->args[0];
>> >>         index = 0;
>> >>
>> >> -       list_for_each_entry(chain, &block->chain_list, list) {
>> >> +       for (chain = __tcf_get_next_chain(block, NULL);
>> >> +            chain;
>> >> +            chain_prev = chain,
>> >> +                    chain = __tcf_get_next_chain(block, chain),
>> >> +                    tcf_chain_put(chain_prev)) {
>> >
>> > Why do you want to take the block->lock in each iteration
>> > of the loop rather than taking once for the whole loop?
>>
>> This loop calls classifier ops callback in tc_chain_fill_node(). I don't
>> call any classifier ops callbacks while holding block or chain lock in
>> this change because the goal is to achieve fine-grained locking for data
>> structures used by filter update path. Locking per-block or per-chain is
>> much coarser than taking reference counters to parent structures and
>> allowing classifiers to implement their own locking.
>
> That is the problem, when we have N filter chains in a block, you
> lock and unlock mutex N times... And what __tcf_get_next_chain()
> does is basically just retrieving the next entry in the list, so the
> overhead of mutex is likely more than the list operation itself in
> contention situation.
>
> Now I can see why you complained about mutex before, it is
> how you use it, not actually its own problem. :)
>
>>
>> In this case call to ops->tmplt_dump() is probably quite fast and its
>> execution time doesn't depend on number of filters on the classifier, so
>> releasing block->lock on each iteration doesn't provide much benefit, if
>> at all. However, it is easier for me to reason about locking correctness
>> in this refactoring by following a simple rule that no locks (besides
>> rtnl mutex) can be held when calling classifier ops callbacks.
>
> Well, for me, a hierarchy locking is always simple when you take
> them in the right order, that is locking the larger-scope lock first
> and then smaller-scope one.
>
> The way you use the locking here is actually harder for me to
> review, because it is hard to valid its atomicity when you unlock
> the larger scope lock and re-take the smaller scope lock. You
> use refcnt to ensure it will not go way, but that is still far from
> guarantee of the atomicity.
>
> For example, tp->ops->change() which changes an existing
> filter, I don't see you lock either block->lock or
> chain->filter_chain_lock when calling it. How does it even work?

Sorry for not describing it in more details in cover letter. Basic
approach is that cls API obtains references to all necessary data
structures (Qdisc, block, chain, tp) and calls classifier ops callbacks
after releasing all the locks. This defers all locking and atomicity
concerns to actual classifier implementations. In case of
tp->ops->change() flower classifier uses tp->lock. In case of dump code
(both chain and filter) there is no atomicity with or without my changes
because rtnl lock is released after sending each skb to userspace and
concurrent changes to tc structures can occur in between.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-19 12:31             ` Vlad Buslov
@ 2019-02-20 22:43               ` Cong Wang
  2019-02-21 15:49                 ` Vlad Buslov
  0 siblings, 1 reply; 52+ messages in thread
From: Cong Wang @ 2019-02-20 22:43 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: Ido Schimmel, netdev, jhs, jiri, davem, ast, daniel

On Tue, Feb 19, 2019 at 4:31 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
>
> On Tue 19 Feb 2019 at 05:26, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Fri, Feb 15, 2019 at 7:35 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> >>
> >> Another problem that I found in cls_fw and cls_route is that they set
> >> arg->stop when empty. Both of them have code unchanged since it was
> >> committed initially in 2005 so I assume this convention is no longer
> >> relevant because all other classifiers don't do that (they only set
> >> arg->stop when arg->fn returns negative value).
> >>
> >
> > The question is why do you want to use arg->stop==0 as
> > an indication for emptiness? Isn't what arg->count==0
> > supposed to be?
>
> Good question! I initially wanted to implement it like that, but
> reconsidered because iterating through all filters on classifier to
> count them is O(N), and terminating on first filter and relying on
> arg->stop==1 is constant time. Making function that is called
> "tcf_proto_is_empty" linear on number of filters seemed sloppy to me...

Good point, however arg->stop _was_ supposed to set only when
error happens. Probably you want a new arg here to stop on the first
entry.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-19 15:20         ` Vlad Buslov
@ 2019-02-20 23:00           ` Cong Wang
  2019-02-21 17:11             ` Vlad Buslov
  0 siblings, 1 reply; 52+ messages in thread
From: Cong Wang @ 2019-02-20 23:00 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: Ido Schimmel, netdev, jhs, jiri, davem, ast, daniel

On Tue, Feb 19, 2019 at 7:20 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
>
> On Tue 19 Feb 2019 at 05:08, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Fri, Feb 15, 2019 at 2:02 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> >>
> >> I looked at the code and problem seems to be matchall classifier
> >> specific. My implementation of unlocked cls API assumes that concurrent
> >> insertions are possible and checks for it when deleting "empty" tp.
> >> Since classifiers don't expose number of elements, the only way to test
> >> this is to do tp->walk() on them and assume that walk callback is called
> >> once per filter on every classifier. In your example new tp is created
> >> for second filter, filter insertion fails, number of elements on newly
> >> created tp is checked with tp->walk() before deleting it. However,
> >> matchall classifier always calls the tp->walk() callback once, even when
> >> it doesn't have a valid filter (in this case with NULL filter pointer).
> >
> > Again, this can be eliminated by just switching to normal
> > non-retry logic. This is yet another headache to review this
> > kind of unlock-and-retry logic, I have no idea why you are such
> > a big fan of it.
>
> The retry approach was suggested to me multiple times by Jiri on
> previous code reviews so I assumed it is preferred approach in such
> cases. I don't have a strong preference in this regard, but locking
> whole tp on filter update will remove any parallelism when updating same
> classifier instance concurrently. The goal of these changes is to allow
> parallel rule update and to achieve that I had to introduce some
> complexity into the code.

Yeah, but with unlock-and-retry it would waste more time when
retry occurs. So it can't be better in the worst scenario.

The question is essentially that do we want to waste CPU cycles
when conflicts occurs or just block there until it is safe to enter
the critical section?

And, is the retry bound? Is it possible that we would retry infinitely
as long as we time it correctly?


>
> Now let me explain why these two approaches result completely different
> performance in this case. Lets start with a list of most CPU-consuming
> parts in new filter creation process in descending order (raw data at
> the end of this mail):
>
> 1) Hardware offload - if available and no skip_hw.
> 2) Exts (actions) initalization - most expensive part even with single
> action, CPU usage increases with number of actions per filter.
> 3) cls API.
> 4) Flower classifier data structure initialization.
>
> Note that 1)+2) is ~80% of cost of creating a flower filter. So if we
> just lock the whole flower classifier instance during rule update we
> serialize 1, 2 and 4, and only cls API (~13% of CPU cost) can be
> executed concurrently. However, in proposed flower implementation hw
> offloading and action initialization code is called without any locks
> and tp->lock is only obtained when modifying flower data structures,
> which means that only 3) is serialized and everything else (87% of CPU
> cost) can be executed in parallel.

What about when conflicts detected and retry the whole change?
And, of course, how often do conflicts happen?

Thanks.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-20 22:43               ` Cong Wang
@ 2019-02-21 15:49                 ` Vlad Buslov
  0 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-21 15:49 UTC (permalink / raw)
  To: Cong Wang; +Cc: Ido Schimmel, netdev, jhs, jiri, davem, ast, daniel


On Wed 20 Feb 2019 at 22:43, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Tue, Feb 19, 2019 at 4:31 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>>
>> On Tue 19 Feb 2019 at 05:26, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > On Fri, Feb 15, 2019 at 7:35 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >>
>> >> Another problem that I found in cls_fw and cls_route is that they set
>> >> arg->stop when empty. Both of them have code unchanged since it was
>> >> committed initially in 2005 so I assume this convention is no longer
>> >> relevant because all other classifiers don't do that (they only set
>> >> arg->stop when arg->fn returns negative value).
>> >>
>> >
>> > The question is why do you want to use arg->stop==0 as
>> > an indication for emptiness? Isn't what arg->count==0
>> > supposed to be?
>>
>> Good question! I initially wanted to implement it like that, but
>> reconsidered because iterating through all filters on classifier to
>> count them is O(N), and terminating on first filter and relying on
>> arg->stop==1 is constant time. Making function that is called
>> "tcf_proto_is_empty" linear on number of filters seemed sloppy to me...
>
> Good point, however arg->stop _was_ supposed to set only when
> error happens. Probably you want a new arg here to stop on the first
> entry.

Got it. I'll prepare a patch for that.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex
  2019-02-20 23:00           ` Cong Wang
@ 2019-02-21 17:11             ` Vlad Buslov
  0 siblings, 0 replies; 52+ messages in thread
From: Vlad Buslov @ 2019-02-21 17:11 UTC (permalink / raw)
  To: Cong Wang; +Cc: Ido Schimmel, netdev, jhs, jiri, davem, ast, daniel


On Wed 20 Feb 2019 at 23:00, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Tue, Feb 19, 2019 at 7:20 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>>
>> On Tue 19 Feb 2019 at 05:08, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > On Fri, Feb 15, 2019 at 2:02 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >>
>> >> I looked at the code and problem seems to be matchall classifier
>> >> specific. My implementation of unlocked cls API assumes that concurrent
>> >> insertions are possible and checks for it when deleting "empty" tp.
>> >> Since classifiers don't expose number of elements, the only way to test
>> >> this is to do tp->walk() on them and assume that walk callback is called
>> >> once per filter on every classifier. In your example new tp is created
>> >> for second filter, filter insertion fails, number of elements on newly
>> >> created tp is checked with tp->walk() before deleting it. However,
>> >> matchall classifier always calls the tp->walk() callback once, even when
>> >> it doesn't have a valid filter (in this case with NULL filter pointer).
>> >
>> > Again, this can be eliminated by just switching to normal
>> > non-retry logic. This is yet another headache to review this
>> > kind of unlock-and-retry logic, I have no idea why you are such
>> > a big fan of it.
>>
>> The retry approach was suggested to me multiple times by Jiri on
>> previous code reviews so I assumed it is preferred approach in such
>> cases. I don't have a strong preference in this regard, but locking
>> whole tp on filter update will remove any parallelism when updating same
>> classifier instance concurrently. The goal of these changes is to allow
>> parallel rule update and to achieve that I had to introduce some
>> complexity into the code.
>
> Yeah, but with unlock-and-retry it would waste more time when
> retry occurs. So it can't be better in the worst scenario.
>
> The question is essentially that do we want to waste CPU cycles
> when conflicts occurs or just block there until it is safe to enter
> the critical section?
>
> And, is the retry bound? Is it possible that we would retry infinitely
> as long as we time it correctly?
>
>
>>
>> Now let me explain why these two approaches result completely different
>> performance in this case. Lets start with a list of most CPU-consuming
>> parts in new filter creation process in descending order (raw data at
>> the end of this mail):
>>
>> 1) Hardware offload - if available and no skip_hw.
>> 2) Exts (actions) initalization - most expensive part even with single
>> action, CPU usage increases with number of actions per filter.
>> 3) cls API.
>> 4) Flower classifier data structure initialization.
>>
>> Note that 1)+2) is ~80% of cost of creating a flower filter. So if we
>> just lock the whole flower classifier instance during rule update we
>> serialize 1, 2 and 4, and only cls API (~13% of CPU cost) can be
>> executed concurrently. However, in proposed flower implementation hw
>> offloading and action initialization code is called without any locks
>> and tp->lock is only obtained when modifying flower data structures,
>> which means that only 3) is serialized and everything else (87% of CPU
>> cost) can be executed in parallel.
>
> What about when conflicts detected and retry the whole change?
> And, of course, how often do conflicts happen?
>
> Thanks.

I had similar concerns when designing this change. Lets look at two
cases when this retry is needed.

One process creates first filter on classifier and fails, while other
processes are trying to concurrently add filter to same block/chain/tp:

1) Process obtains filter_chain_lock, performs unsuccessful tp lookup,
   releases the lock.
2) Calls tcf_chain_tp_insert_unique() which obtains filter_chain_lock,
   inserts new tp, releases the lock.
3) Calls tp->ops->change() that returns an error.
4) Calls tcf_chain_tp_delete_empty() which takes filter_chain_lock, verifies that no
   filters were added to tp concurrently, sets tp->deleting flag, removes
   tp from chain.

This is supposed to be very rare occurrence because for retry to happen
it not only requires concurrent insertions to same block/chain/tp, but
also that tp with requested prio didn't exist before and no concurrent
process succeeded in adding at least one filter to tp during step 3
before it is marked for deletion in step 4 (otherwise
tcf_proto_check_delete() fails and concurrent threads don't perform
retry).

Another case is when last filter is being deleted while concurrent
processes adding new filters to same block/chain/tp:

1) tc_del_tfilter() gets last filter with tp->ops->get()
2) Deletes it with tp->ops->delete()...
3) ... that return 'last' hint set to true.
4) Calls tcf_chain_tp_delete_empty() which takes filter_chain_lock, verifies that no
   filters were added to tp concurrently, sets tp->deleting flag, removes
   tp from chain.

This case is also quite rare because it requires concurrent users to
successfully lookup tp before tp->deleting is set to true and tp is
removed from chain, but not create any new filters on tp during that
time.

After considering this I decided that it is not worth it to penalize
common case of updating filters by completely removing parallelism when
updates target same tp instance for such rare corner cases as described
above.

Now regarding forcing users to retry indefinitely. In later cases no
more than one retry is possible because concurrent add processes create
new tp on first retry. In former case multiple retries are possible, but
to block concurrent users indefinitely would require malicious process
to somehow always have priority when obtaining filter_chain_lock during
steps 1, 2, then wait to allow all concurrent users to lookup the tp,
then obtain filter_chain_lock in step 4 and initiate tp deletion before
any of concurrent users that have a reference to this new tp instance
can insert any single filter on it, then go back to step 1, obtain lock
first and repeat. I don't see how this can be timed from userspace
repeatedly as creating first filter on new tp involves multiple cycles
of getting and releasing filter_chain_lock and each of them require
attacker to "influence" kernel scheduler to behave in very specific
fashion.

Regards,
Vlad

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, back to index

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-11  8:55 [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock Vlad Buslov
2019-02-11  8:55 ` [PATCH net-next v4 01/17] net: sched: protect block state with mutex Vlad Buslov
2019-02-11 14:15   ` Jiri Pirko
2019-02-11  8:55 ` [PATCH net-next v4 02/17] net: sched: protect chain->explicitly_created with block->lock Vlad Buslov
2019-02-11 14:15   ` Jiri Pirko
2019-02-11  8:55 ` [PATCH net-next v4 03/17] net: sched: refactor tc_ctl_chain() to use block->lock Vlad Buslov
2019-02-11  8:55 ` [PATCH net-next v4 04/17] net: sched: protect block->chain0 with block->lock Vlad Buslov
2019-02-11  8:55 ` [PATCH net-next v4 05/17] net: sched: traverse chains in block with tcf_get_next_chain() Vlad Buslov
2019-02-15 22:21   ` Cong Wang
2019-02-18 10:07     ` Vlad Buslov
2019-02-18 18:26       ` Cong Wang
2019-02-19 16:04         ` Vlad Buslov
2019-02-11  8:55 ` [PATCH net-next v4 06/17] net: sched: protect chain template accesses with block lock Vlad Buslov
2019-02-11  8:55 ` [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex Vlad Buslov
2019-02-14 18:24   ` Ido Schimmel
2019-02-15 10:02     ` Vlad Buslov
2019-02-15 11:30       ` Ido Schimmel
2019-02-15 12:11         ` [PATCH] net: sched: matchall: verify that filter is not NULL in mall_walk() Vlad Buslov
2019-02-15 13:47           ` Ido Schimmel
2019-02-16  0:24           ` Cong Wang
2019-02-18 12:00             ` Vlad Buslov
2019-02-17 21:27           ` David Miller
2019-02-15 12:15         ` [PATCH net-next v4 07/17] net: sched: protect filter_chain list with filter_chain_lock mutex Vlad Buslov
2019-02-15 15:35         ` Vlad Buslov
2019-02-19  5:26           ` Cong Wang
2019-02-19 12:31             ` Vlad Buslov
2019-02-20 22:43               ` Cong Wang
2019-02-21 15:49                 ` Vlad Buslov
2019-02-19  5:08       ` Cong Wang
2019-02-19 15:20         ` Vlad Buslov
2019-02-20 23:00           ` Cong Wang
2019-02-21 17:11             ` Vlad Buslov
2019-02-15 22:35   ` Cong Wang
2019-02-18 11:06     ` Vlad Buslov
2019-02-18 18:31       ` Cong Wang
2019-02-11  8:55 ` [PATCH net-next v4 08/17] net: sched: introduce reference counting for tcf_proto Vlad Buslov
2019-02-11  8:55 ` [PATCH net-next v4 09/17] net: sched: traverse classifiers in chain with tcf_get_next_proto() Vlad Buslov
2019-02-11  8:55 ` [PATCH net-next v4 10/17] net: sched: refactor tp insert/delete for concurrent execution Vlad Buslov
2019-02-15 23:17   ` Cong Wang
2019-02-18 11:19     ` Vlad Buslov
2019-02-18 19:55       ` Cong Wang
2019-02-19 10:25         ` Vlad Buslov
2019-02-18 19:53   ` Cong Wang
2019-02-11  8:55 ` [PATCH net-next v4 11/17] net: sched: prevent insertion of new classifiers during chain flush Vlad Buslov
2019-02-11  8:55 ` [PATCH net-next v4 12/17] net: sched: track rtnl lock status when validating extensions Vlad Buslov
2019-02-11  8:55 ` [PATCH net-next v4 13/17] net: sched: extend proto ops with 'put' callback Vlad Buslov
2019-02-11  8:55 ` [PATCH net-next v4 14/17] net: sched: extend proto ops to support unlocked classifiers Vlad Buslov
2019-02-11  8:55 ` [PATCH net-next v4 15/17] net: sched: add flags to Qdisc class ops struct Vlad Buslov
2019-02-11  8:55 ` [PATCH net-next v4 16/17] net: sched: refactor tcf_block_find() into standalone functions Vlad Buslov
2019-02-11  8:55 ` [PATCH net-next v4 17/17] net: sched: unlock rules update API Vlad Buslov
2019-02-18 18:56   ` Cong Wang
2019-02-12 18:42 ` [PATCH net-next v4 00/17] Refactor classifier API to work with chain/classifiers without rtnl lock David Miller

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org netdev@archiver.kernel.org
	public-inbox-index netdev


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox