All of lore.kernel.org
 help / color / mirror / Atom feed
* Soft lockup in tc_classify
       [not found] <c1c394d3-3aea-52a8-89e3-be57d4d46b8e@mellanox.com>
@ 2016-12-12  9:43 ` Shahar Klein
  2016-12-12 13:28   ` Daniel Borkmann
  0 siblings, 1 reply; 17+ messages in thread
From: Shahar Klein @ 2016-12-12  9:43 UTC (permalink / raw)
  To: netdev
  Cc: Roi Dayan, David Miller, Cong Wang, Daniel Borkmann, Jiri Pirko,
	John Fastabend, Or Gerlitz, Hadar Hen Zion

Hi All,

sorry for the spam, the first time was sent with html part and was rejected.

We observed an issue where a classifier instance next member is pointing 
back to itself, causing a CPU soft lockup.
We found it by running traffic on many udp connections and then adding a 
new flower rule using tc.

We added a quick workaround to verify it:

In tc_classify:

         for (; tp; tp = rcu_dereference_bh(tp->next)) {
                 int err;
+               if (tp == tp->next)
+                     RCU_INIT_POINTER(tp->next, NULL);


We also had a print here showing tp->next is pointing to tp. With this 
workaround we are not hitting the issue anymore.
We are not sure we fully understand the mechanism here - with the rtnl 
and rcu locks.
We'll appreciate your help solving this issue.

Thanks,
Shahar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-12  9:43 ` Soft lockup in tc_classify Shahar Klein
@ 2016-12-12 13:28   ` Daniel Borkmann
  2016-12-12 16:04     ` Shahar Klein
  2016-12-12 21:18     ` Or Gerlitz
  0 siblings, 2 replies; 17+ messages in thread
From: Daniel Borkmann @ 2016-12-12 13:28 UTC (permalink / raw)
  To: Shahar Klein, netdev
  Cc: Roi Dayan, David Miller, Cong Wang, Jiri Pirko, John Fastabend,
	Or Gerlitz, Hadar Hen Zion

Hi Shahar,

On 12/12/2016 10:43 AM, Shahar Klein wrote:
> Hi All,
>
> sorry for the spam, the first time was sent with html part and was rejected.
>
> We observed an issue where a classifier instance next member is pointing back to itself, causing a CPU soft lockup.
> We found it by running traffic on many udp connections and then adding a new flower rule using tc.
>
> We added a quick workaround to verify it:
>
> In tc_classify:
>
>          for (; tp; tp = rcu_dereference_bh(tp->next)) {
>                  int err;
> +               if (tp == tp->next)
> +                     RCU_INIT_POINTER(tp->next, NULL);
>
>
> We also had a print here showing tp->next is pointing to tp. With this workaround we are not hitting the issue anymore.
> We are not sure we fully understand the mechanism here - with the rtnl and rcu locks.
> We'll appreciate your help solving this issue.

Note that there's still the RCU fix missing for the deletion race that
Cong will still send out, but you say that the only thing you do is to
add a single rule, but no other operation in involved during that test?

Do you have a script and kernel .config for reproducing this?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-12 13:28   ` Daniel Borkmann
@ 2016-12-12 16:04     ` Shahar Klein
  2016-12-12 19:07       ` Cong Wang
  2016-12-12 21:18     ` Or Gerlitz
  1 sibling, 1 reply; 17+ messages in thread
From: Shahar Klein @ 2016-12-12 16:04 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: shahark, Roi Dayan, David Miller, Cong Wang, Jiri Pirko,
	John Fastabend, Or Gerlitz, Hadar Hen Zion



On 12/12/2016 3:28 PM, Daniel Borkmann wrote:
> Hi Shahar,
>
> On 12/12/2016 10:43 AM, Shahar Klein wrote:
>> Hi All,
>>
>> sorry for the spam, the first time was sent with html part and was
>> rejected.
>>
>> We observed an issue where a classifier instance next member is
>> pointing back to itself, causing a CPU soft lockup.
>> We found it by running traffic on many udp connections and then adding
>> a new flower rule using tc.
>>
>> We added a quick workaround to verify it:
>>
>> In tc_classify:
>>
>>          for (; tp; tp = rcu_dereference_bh(tp->next)) {
>>                  int err;
>> +               if (tp == tp->next)
>> +                     RCU_INIT_POINTER(tp->next, NULL);
>>
>>
>> We also had a print here showing tp->next is pointing to tp. With this
>> workaround we are not hitting the issue anymore.
>> We are not sure we fully understand the mechanism here - with the rtnl
>> and rcu locks.
>> We'll appreciate your help solving this issue.
>
> Note that there's still the RCU fix missing for the deletion race that
> Cong will still send out, but you say that the only thing you do is to
> add a single rule, but no other operation in involved during that test?
>
> Do you have a script and kernel .config for reproducing this?

I'm using a user space socket 
app(https://github.com/shahar-klein/noodle)on a vm to push udp packets 
from ~2000 different udp src ports ramping up at ~100 per second towards 
another vm on the same Hypervisor. Once the traffic starts I'm pushing 
ingress flower tc udp rules(even_udp_src_port->mirred, odd->drop) on the 
relevant representor in the Hypervisor.

>
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-12 16:04     ` Shahar Klein
@ 2016-12-12 19:07       ` Cong Wang
  2016-12-13 11:59         ` Shahar Klein
  0 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2016-12-12 19:07 UTC (permalink / raw)
  To: Shahar Klein
  Cc: Daniel Borkmann, Linux Kernel Network Developers, Roi Dayan,
	David Miller, Jiri Pirko, John Fastabend, Or Gerlitz,
	Hadar Hen Zion

On Mon, Dec 12, 2016 at 8:04 AM, Shahar Klein <shahark@mellanox.com> wrote:
>
>
> On 12/12/2016 3:28 PM, Daniel Borkmann wrote:
>>
>> Hi Shahar,
>>
>> On 12/12/2016 10:43 AM, Shahar Klein wrote:
>>>
>>> Hi All,
>>>
>>> sorry for the spam, the first time was sent with html part and was
>>> rejected.
>>>
>>> We observed an issue where a classifier instance next member is
>>> pointing back to itself, causing a CPU soft lockup.
>>> We found it by running traffic on many udp connections and then adding
>>> a new flower rule using tc.
>>>
>>> We added a quick workaround to verify it:
>>>
>>> In tc_classify:
>>>
>>>          for (; tp; tp = rcu_dereference_bh(tp->next)) {
>>>                  int err;
>>> +               if (tp == tp->next)
>>> +                     RCU_INIT_POINTER(tp->next, NULL);
>>>
>>>
>>> We also had a print here showing tp->next is pointing to tp. With this
>>> workaround we are not hitting the issue anymore.
>>> We are not sure we fully understand the mechanism here - with the rtnl
>>> and rcu locks.
>>> We'll appreciate your help solving this issue.
>>
>>
>> Note that there's still the RCU fix missing for the deletion race that
>> Cong will still send out, but you say that the only thing you do is to
>> add a single rule, but no other operation in involved during that test?

Hmm, I thought RCU_INIT_POINTER() respects readers, but seems no?
If so, that could be the cause since we play with the next pointer and
there is only one filter in this case, but I don't see why we could have
a loop here.

>>
>> Do you have a script and kernel .config for reproducing this?
>
>
> I'm using a user space socket app(https://github.com/shahar-klein/noodle)on
> a vm to push udp packets from ~2000 different udp src ports ramping up at
> ~100 per second towards another vm on the same Hypervisor. Once the traffic
> starts I'm pushing ingress flower tc udp rules(even_udp_src_port->mirred,
> odd->drop) on the relevant representor in the Hypervisor.

Do you mind to share your `tc filter show dev...` output? Also, since you
mentioned you only add one flower filter, just want to make sure you never
delete any filter before/when the bug happens? How reproducible is this?

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-12 13:28   ` Daniel Borkmann
  2016-12-12 16:04     ` Shahar Klein
@ 2016-12-12 21:18     ` Or Gerlitz
  2016-12-12 22:51       ` Cong Wang
  1 sibling, 1 reply; 17+ messages in thread
From: Or Gerlitz @ 2016-12-12 21:18 UTC (permalink / raw)
  To: Daniel Borkmann, Cong Wang
  Cc: Shahar Klein, Linux Netdev List, Roi Dayan, David Miller,
	Jiri Pirko, John Fastabend, Hadar Hen Zion

On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:

> Note that there's still the RCU fix missing for the deletion race that
> Cong will still send out, but you say that the only thing you do is to
> add a single rule, but no other operation in involved during that test?

What's missing to have the deletion race fixed? making a patch or
testing to a patch which was sent?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-12 21:18     ` Or Gerlitz
@ 2016-12-12 22:51       ` Cong Wang
  2016-12-19 16:39         ` Shahar Klein
  0 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2016-12-12 22:51 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Daniel Borkmann, Shahar Klein, Linux Netdev List, Roi Dayan,
	David Miller, Jiri Pirko, John Fastabend, Hadar Hen Zion

[-- Attachment #1: Type: text/plain, Size: 711 bytes --]

On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
>> Note that there's still the RCU fix missing for the deletion race that
>> Cong will still send out, but you say that the only thing you do is to
>> add a single rule, but no other operation in involved during that test?
>
> What's missing to have the deletion race fixed? making a patch or
> testing to a patch which was sent?

If you think it would help for this problem, here is my patch rebased
on the latest net-next.

Again, I don't see how it could help this case yet, especially I don't
see how we could have a loop in this singly linked list.

[-- Attachment #2: tc-filter-destroy.diff --]
[-- Type: text/plain, Size: 21977 bytes --]

commit f6becda1e12fd8ef74e901fe39adb4558ce6c8f9
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Wed Nov 23 14:58:01 2016 -0800

    net_sched: move the empty tp check from ->destroy() to ->delete()
    
    Roi reported we could have a race condition where in ->classify() path
    we dereference tp->root and meanwhile a parallel ->destroy() makes it
    a NULL.
    
    This is possible because ->destroy() could be called when deleting
    a filter to check if we are the last one in tp, this tp is still
    linked and visible at that time.
    
    The root cause of this problem is the semantic of ->destroy(), it
    does two things (for non-force case):
    
    1) check if tp is empty
    2) if tp is empty we could really destroy it
    
    and its caller, if cares, needs to check its return value to see if
    it is really destroyed. Therefore we can't unlink tp unless we know
    it is empty.
    
    As suggested by Daniel, we could actually move the test logic to ->delete()
    so that we can safely unlink tp after ->delete() tells us the last one is
    just deleted and before ->destroy().
    
    What's more, even we unlink it before ->destroy(), it could still have
    readers since we don't wait for a grace period here, we should not modify
    tp->root in ->destroy() either.
    
    Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
    Reported-by: Roi Dayan <roid@mellanox.com>
    Cc: Daniel Borkmann <daniel@iogearbox.net>
    Cc: John Fastabend <john.fastabend@gmail.com>
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 498f81b..b5eda3f 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -203,14 +203,14 @@ struct tcf_proto_ops {
 					    const struct tcf_proto *,
 					    struct tcf_result *);
 	int			(*init)(struct tcf_proto*);
-	bool			(*destroy)(struct tcf_proto*, bool);
+	void			(*destroy)(struct tcf_proto*);
 
 	unsigned long		(*get)(struct tcf_proto*, u32 handle);
 	int			(*change)(struct net *net, struct sk_buff *,
 					struct tcf_proto*, unsigned long,
 					u32 handle, struct nlattr **,
 					unsigned long *, bool);
-	int			(*delete)(struct tcf_proto*, unsigned long);
+	int			(*delete)(struct tcf_proto*, unsigned long, bool*);
 	void			(*walk)(struct tcf_proto*, struct tcf_walker *arg);
 
 	/* rtnetlink specific */
@@ -405,7 +405,7 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue *dev_queue,
 				const struct Qdisc_ops *ops, u32 parentid);
 void __qdisc_calculate_pkt_len(struct sk_buff *skb,
 			       const struct qdisc_size_table *stab);
-bool tcf_destroy(struct tcf_proto *tp, bool force);
+void tcf_destroy(struct tcf_proto *tp);
 void tcf_destroy_chain(struct tcf_proto __rcu **fl);
 int skb_do_redirect(struct sk_buff *);
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..f9179e0 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -321,7 +321,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 
 			tfilter_notify(net, skb, n, tp, fh,
 				       RTM_DELTFILTER, false);
-			tcf_destroy(tp, true);
+			tcf_destroy(tp);
 			err = 0;
 			goto errout;
 		}
@@ -331,25 +331,29 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 		    !(n->nlmsg_flags & NLM_F_CREATE))
 			goto errout;
 	} else {
+		bool last;
+
 		switch (n->nlmsg_type) {
 		case RTM_NEWTFILTER:
 			err = -EEXIST;
 			if (n->nlmsg_flags & NLM_F_EXCL) {
 				if (tp_created)
-					tcf_destroy(tp, true);
+					tcf_destroy(tp);
 				goto errout;
 			}
 			break;
 		case RTM_DELTFILTER:
-			err = tp->ops->delete(tp, fh);
+			err = tp->ops->delete(tp, fh, &last);
 			if (err == 0) {
-				struct tcf_proto *next = rtnl_dereference(tp->next);
-
 				tfilter_notify(net, skb, n, tp,
 					       t->tcm_handle,
 					       RTM_DELTFILTER, false);
-				if (tcf_destroy(tp, false))
+				if (last) {
+					struct tcf_proto *next = rtnl_dereference(tp->next);
+
 					RCU_INIT_POINTER(*back, next);
+					tcf_destroy(tp);
+				}
 			}
 			goto errout;
 		case RTM_GETTFILTER:
@@ -372,7 +376,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 		tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER, false);
 	} else {
 		if (tp_created)
-			tcf_destroy(tp, true);
+			tcf_destroy(tp);
 	}
 
 errout:
diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index 5877f60..8d822e5 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -93,30 +93,28 @@ static void basic_delete_filter(struct rcu_head *head)
 	kfree(f);
 }
 
-static bool basic_destroy(struct tcf_proto *tp, bool force)
+static void basic_destroy(struct tcf_proto *tp)
 {
 	struct basic_head *head = rtnl_dereference(tp->root);
 	struct basic_filter *f, *n;
 
-	if (!force && !list_empty(&head->flist))
-		return false;
-
 	list_for_each_entry_safe(f, n, &head->flist, link) {
 		list_del_rcu(&f->link);
 		tcf_unbind_filter(tp, &f->res);
 		call_rcu(&f->rcu, basic_delete_filter);
 	}
 	kfree_rcu(head, rcu);
-	return true;
 }
 
-static int basic_delete(struct tcf_proto *tp, unsigned long arg)
+static int basic_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
 {
+	struct basic_head *head = rtnl_dereference(tp->root);
 	struct basic_filter *f = (struct basic_filter *) arg;
 
 	list_del_rcu(&f->link);
 	tcf_unbind_filter(tp, &f->res);
 	call_rcu(&f->rcu, basic_delete_filter);
+	*last = list_empty(&head->flist);
 	return 0;
 }
 
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index adc7760..55c9961 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -268,25 +268,24 @@ static void __cls_bpf_delete(struct tcf_proto *tp, struct cls_bpf_prog *prog)
 	call_rcu(&prog->rcu, cls_bpf_delete_prog_rcu);
 }
 
-static int cls_bpf_delete(struct tcf_proto *tp, unsigned long arg)
+static int cls_bpf_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
 {
+	struct cls_bpf_head *head = rtnl_dereference(tp->root);
+
 	__cls_bpf_delete(tp, (struct cls_bpf_prog *) arg);
+	*last = list_empty(&head->plist);
 	return 0;
 }
 
-static bool cls_bpf_destroy(struct tcf_proto *tp, bool force)
+static void cls_bpf_destroy(struct tcf_proto *tp)
 {
 	struct cls_bpf_head *head = rtnl_dereference(tp->root);
 	struct cls_bpf_prog *prog, *tmp;
 
-	if (!force && !list_empty(&head->plist))
-		return false;
-
 	list_for_each_entry_safe(prog, tmp, &head->plist, link)
 		__cls_bpf_delete(tp, prog);
 
 	kfree_rcu(head, rcu);
-	return true;
 }
 
 static unsigned long cls_bpf_get(struct tcf_proto *tp, u32 handle)
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index c1f2007..51c822d 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -131,20 +131,16 @@ static int cls_cgroup_change(struct net *net, struct sk_buff *in_skb,
 	return err;
 }
 
-static bool cls_cgroup_destroy(struct tcf_proto *tp, bool force)
+static void cls_cgroup_destroy(struct tcf_proto *tp)
 {
 	struct cls_cgroup_head *head = rtnl_dereference(tp->root);
 
-	if (!force)
-		return false;
 	/* Head can still be NULL due to cls_cgroup_init(). */
 	if (head)
 		call_rcu(&head->rcu, cls_cgroup_destroy_rcu);
-
-	return true;
 }
 
-static int cls_cgroup_delete(struct tcf_proto *tp, unsigned long arg)
+static int cls_cgroup_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
 {
 	return -EOPNOTSUPP;
 }
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index 6575aba..ea2be75 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -563,12 +563,14 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 	return err;
 }
 
-static int flow_delete(struct tcf_proto *tp, unsigned long arg)
+static int flow_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
 {
+	struct flow_head *head = rtnl_dereference(tp->root);
 	struct flow_filter *f = (struct flow_filter *)arg;
 
 	list_del_rcu(&f->list);
 	call_rcu(&f->rcu, flow_destroy_filter);
+	*last = list_empty(&head->filters);
 	return 0;
 }
 
@@ -584,20 +586,16 @@ static int flow_init(struct tcf_proto *tp)
 	return 0;
 }
 
-static bool flow_destroy(struct tcf_proto *tp, bool force)
+static void flow_destroy(struct tcf_proto *tp)
 {
 	struct flow_head *head = rtnl_dereference(tp->root);
 	struct flow_filter *f, *next;
 
-	if (!force && !list_empty(&head->filters))
-		return false;
-
 	list_for_each_entry_safe(f, next, &head->filters, list) {
 		list_del_rcu(&f->list);
 		call_rcu(&f->rcu, flow_destroy_filter);
 	}
 	kfree_rcu(head, rcu);
-	return true;
 }
 
 static unsigned long flow_get(struct tcf_proto *tp, u32 handle)
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index e040c51..328938b 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -312,21 +312,16 @@ static void fl_destroy_rcu(struct rcu_head *rcu)
 	schedule_work(&head->work);
 }
 
-static bool fl_destroy(struct tcf_proto *tp, bool force)
+static void fl_destroy(struct tcf_proto *tp)
 {
 	struct cls_fl_head *head = rtnl_dereference(tp->root);
 	struct cls_fl_filter *f, *next;
 
-	if (!force && !list_empty(&head->filters))
-		return false;
-
 	list_for_each_entry_safe(f, next, &head->filters, list)
 		__fl_delete(tp, f);
 
 	__module_get(THIS_MODULE);
 	call_rcu(&head->rcu, fl_destroy_rcu);
-
-	return true;
 }
 
 static unsigned long fl_get(struct tcf_proto *tp, u32 handle)
@@ -877,7 +872,7 @@ static int fl_change(struct net *net, struct sk_buff *in_skb,
 	return err;
 }
 
-static int fl_delete(struct tcf_proto *tp, unsigned long arg)
+static int fl_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
 {
 	struct cls_fl_head *head = rtnl_dereference(tp->root);
 	struct cls_fl_filter *f = (struct cls_fl_filter *) arg;
@@ -886,6 +881,7 @@ static int fl_delete(struct tcf_proto *tp, unsigned long arg)
 		rhashtable_remove_fast(&head->ht, &f->ht_node,
 				       head->ht_params);
 	__fl_delete(tp, f);
+	*last = list_empty(&head->filters);
 	return 0;
 }
 
diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index 9dc63d5..bc8ceb7 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -127,20 +127,14 @@ static void fw_delete_filter(struct rcu_head *head)
 	kfree(f);
 }
 
-static bool fw_destroy(struct tcf_proto *tp, bool force)
+static void fw_destroy(struct tcf_proto *tp)
 {
 	struct fw_head *head = rtnl_dereference(tp->root);
 	struct fw_filter *f;
 	int h;
 
 	if (head == NULL)
-		return true;
-
-	if (!force) {
-		for (h = 0; h < HTSIZE; h++)
-			if (rcu_access_pointer(head->ht[h]))
-				return false;
-	}
+		return;
 
 	for (h = 0; h < HTSIZE; h++) {
 		while ((f = rtnl_dereference(head->ht[h])) != NULL) {
@@ -150,17 +144,17 @@ static bool fw_destroy(struct tcf_proto *tp, bool force)
 			call_rcu(&f->rcu, fw_delete_filter);
 		}
 	}
-	RCU_INIT_POINTER(tp->root, NULL);
 	kfree_rcu(head, rcu);
-	return true;
 }
 
-static int fw_delete(struct tcf_proto *tp, unsigned long arg)
+static int fw_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
 {
 	struct fw_head *head = rtnl_dereference(tp->root);
 	struct fw_filter *f = (struct fw_filter *)arg;
 	struct fw_filter __rcu **fp;
 	struct fw_filter *pfp;
+	int ret = -EINVAL;
+	int h;
 
 	if (head == NULL || f == NULL)
 		goto out;
@@ -173,11 +167,21 @@ static int fw_delete(struct tcf_proto *tp, unsigned long arg)
 			RCU_INIT_POINTER(*fp, rtnl_dereference(f->next));
 			tcf_unbind_filter(tp, &f->res);
 			call_rcu(&f->rcu, fw_delete_filter);
-			return 0;
+			ret = 0;
+			break;
 		}
 	}
+
+	*last = true;
+	for (h = 0; h < HTSIZE; h++) {
+		if (rcu_access_pointer(head->ht[h])) {
+			*last = false;
+			break;
+		}
+	}
+
 out:
-	return -EINVAL;
+	return ret;
 }
 
 static const struct nla_policy fw_policy[TCA_FW_MAX + 1] = {
diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index f935429..7d54805 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -99,15 +99,12 @@ static void mall_destroy_hw_filter(struct tcf_proto *tp,
 					     &offload);
 }
 
-static bool mall_destroy(struct tcf_proto *tp, bool force)
+static void mall_destroy(struct tcf_proto *tp)
 {
 	struct cls_mall_head *head = rtnl_dereference(tp->root);
 	struct net_device *dev = tp->q->dev_queue->dev;
 	struct cls_mall_filter *f = head->filter;
 
-	if (!force && f)
-		return false;
-
 	if (f) {
 		if (tc_should_offload(dev, tp, f->flags))
 			mall_destroy_hw_filter(tp, f, (unsigned long) f);
@@ -115,7 +112,6 @@ static bool mall_destroy(struct tcf_proto *tp, bool force)
 		call_rcu(&f->rcu, mall_destroy_filter);
 	}
 	kfree_rcu(head, rcu);
-	return true;
 }
 
 static unsigned long mall_get(struct tcf_proto *tp, u32 handle)
@@ -224,7 +220,7 @@ static int mall_change(struct net *net, struct sk_buff *in_skb,
 	return err;
 }
 
-static int mall_delete(struct tcf_proto *tp, unsigned long arg)
+static int mall_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
 {
 	struct cls_mall_head *head = rtnl_dereference(tp->root);
 	struct cls_mall_filter *f = (struct cls_mall_filter *) arg;
@@ -236,6 +232,7 @@ static int mall_delete(struct tcf_proto *tp, unsigned long arg)
 	RCU_INIT_POINTER(head->filter, NULL);
 	tcf_unbind_filter(tp, &f->res);
 	call_rcu(&f->rcu, mall_destroy_filter);
+	*last = true;
 	return 0;
 }
 
diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c
index 455fc8f..1a38e41 100644
--- a/net/sched/cls_route.c
+++ b/net/sched/cls_route.c
@@ -276,20 +276,13 @@ static void route4_delete_filter(struct rcu_head *head)
 	kfree(f);
 }
 
-static bool route4_destroy(struct tcf_proto *tp, bool force)
+static void route4_destroy(struct tcf_proto *tp)
 {
 	struct route4_head *head = rtnl_dereference(tp->root);
 	int h1, h2;
 
 	if (head == NULL)
-		return true;
-
-	if (!force) {
-		for (h1 = 0; h1 <= 256; h1++) {
-			if (rcu_access_pointer(head->table[h1]))
-				return false;
-		}
-	}
+		return;
 
 	for (h1 = 0; h1 <= 256; h1++) {
 		struct route4_bucket *b;
@@ -312,12 +305,10 @@ static bool route4_destroy(struct tcf_proto *tp, bool force)
 			kfree_rcu(b, rcu);
 		}
 	}
-	RCU_INIT_POINTER(tp->root, NULL);
 	kfree_rcu(head, rcu);
-	return true;
 }
 
-static int route4_delete(struct tcf_proto *tp, unsigned long arg)
+static int route4_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
 {
 	struct route4_head *head = rtnl_dereference(tp->root);
 	struct route4_filter *f = (struct route4_filter *)arg;
@@ -325,7 +316,7 @@ static int route4_delete(struct tcf_proto *tp, unsigned long arg)
 	struct route4_filter *nf;
 	struct route4_bucket *b;
 	unsigned int h = 0;
-	int i;
+	int i, h1;
 
 	if (!head || !f)
 		return -EINVAL;
@@ -356,16 +347,25 @@ static int route4_delete(struct tcf_proto *tp, unsigned long arg)
 
 				rt = rtnl_dereference(b->ht[i]);
 				if (rt)
-					return 0;
+					goto out;
 			}
 
 			/* OK, session has no flows */
 			RCU_INIT_POINTER(head->table[to_hash(h)], NULL);
 			kfree_rcu(b, rcu);
+			break;
+		}
+	}
 
-			return 0;
+out:
+	*last = true;
+	for (h1 = 0; h1 <= 256; h1++) {
+		if (rcu_access_pointer(head->table[h1])) {
+			*last = false;
+			break;
 		}
 	}
+
 	return 0;
 }
 
diff --git a/net/sched/cls_rsvp.h b/net/sched/cls_rsvp.h
index 322438f..1aaff10 100644
--- a/net/sched/cls_rsvp.h
+++ b/net/sched/cls_rsvp.h
@@ -302,22 +302,13 @@ static void rsvp_delete_filter(struct tcf_proto *tp, struct rsvp_filter *f)
 	call_rcu(&f->rcu, rsvp_delete_filter_rcu);
 }
 
-static bool rsvp_destroy(struct tcf_proto *tp, bool force)
+static void rsvp_destroy(struct tcf_proto *tp)
 {
 	struct rsvp_head *data = rtnl_dereference(tp->root);
 	int h1, h2;
 
 	if (data == NULL)
-		return true;
-
-	if (!force) {
-		for (h1 = 0; h1 < 256; h1++) {
-			if (rcu_access_pointer(data->ht[h1]))
-				return false;
-		}
-	}
-
-	RCU_INIT_POINTER(tp->root, NULL);
+		return;
 
 	for (h1 = 0; h1 < 256; h1++) {
 		struct rsvp_session *s;
@@ -337,10 +328,9 @@ static bool rsvp_destroy(struct tcf_proto *tp, bool force)
 		}
 	}
 	kfree_rcu(data, rcu);
-	return true;
 }
 
-static int rsvp_delete(struct tcf_proto *tp, unsigned long arg)
+static int rsvp_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
 {
 	struct rsvp_head *head = rtnl_dereference(tp->root);
 	struct rsvp_filter *nfp, *f = (struct rsvp_filter *)arg;
@@ -348,7 +338,7 @@ static int rsvp_delete(struct tcf_proto *tp, unsigned long arg)
 	unsigned int h = f->handle;
 	struct rsvp_session __rcu **sp;
 	struct rsvp_session *nsp, *s = f->sess;
-	int i;
+	int i, h1;
 
 	fp = &s->ht[(h >> 8) & 0xFF];
 	for (nfp = rtnl_dereference(*fp); nfp;
@@ -361,7 +351,7 @@ static int rsvp_delete(struct tcf_proto *tp, unsigned long arg)
 
 			for (i = 0; i <= 16; i++)
 				if (s->ht[i])
-					return 0;
+					goto out;
 
 			/* OK, session has no flows */
 			sp = &head->ht[h & 0xFF];
@@ -370,13 +360,23 @@ static int rsvp_delete(struct tcf_proto *tp, unsigned long arg)
 				if (nsp == s) {
 					RCU_INIT_POINTER(*sp, s->next);
 					kfree_rcu(s, rcu);
-					return 0;
+					goto out;
 				}
 			}
 
-			return 0;
+			break;
 		}
 	}
+
+out:
+	*last = true;
+	for (h1 = 0; h1 < 256; h1++) {
+		if (rcu_access_pointer(head->ht[h1])) {
+			*last = false;
+			break;
+		}
+	}
+
 	return 0;
 }
 
diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
index 0751245..9149a03 100644
--- a/net/sched/cls_tcindex.c
+++ b/net/sched/cls_tcindex.c
@@ -150,7 +150,7 @@ static void tcindex_destroy_fexts(struct rcu_head *head)
 	kfree(f);
 }
 
-static int tcindex_delete(struct tcf_proto *tp, unsigned long arg)
+static int tcindex_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
 {
 	struct tcindex_data *p = rtnl_dereference(tp->root);
 	struct tcindex_filter_result *r = (struct tcindex_filter_result *) arg;
@@ -186,6 +186,8 @@ static int tcindex_delete(struct tcf_proto *tp, unsigned long arg)
 		call_rcu(&f->rcu, tcindex_destroy_fexts);
 	else
 		call_rcu(&r->rcu, tcindex_destroy_rexts);
+
+	*last = false;
 	return 0;
 }
 
@@ -193,7 +195,9 @@ static int tcindex_destroy_element(struct tcf_proto *tp,
 				   unsigned long arg,
 				   struct tcf_walker *walker)
 {
-	return tcindex_delete(tp, arg);
+	bool last;
+
+	return tcindex_delete(tp, arg, &last);
 }
 
 static void __tcindex_destroy(struct rcu_head *head)
@@ -529,14 +533,11 @@ static void tcindex_walk(struct tcf_proto *tp, struct tcf_walker *walker)
 	}
 }
 
-static bool tcindex_destroy(struct tcf_proto *tp, bool force)
+static void tcindex_destroy(struct tcf_proto *tp)
 {
 	struct tcindex_data *p = rtnl_dereference(tp->root);
 	struct tcf_walker walker;
 
-	if (!force)
-		return false;
-
 	pr_debug("tcindex_destroy(tp %p),p %p\n", tp, p);
 	walker.count = 0;
 	walker.skip = 0;
@@ -544,7 +545,6 @@ static bool tcindex_destroy(struct tcf_proto *tp, bool force)
 	tcindex_walk(tp, &walker);
 
 	call_rcu(&p->rcu, __tcindex_destroy);
-	return true;
 }
 
 
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index ae83c3ae..787573b 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -582,37 +582,13 @@ static bool ht_empty(struct tc_u_hnode *ht)
 	return true;
 }
 
-static bool u32_destroy(struct tcf_proto *tp, bool force)
+static void u32_destroy(struct tcf_proto *tp)
 {
 	struct tc_u_common *tp_c = tp->data;
 	struct tc_u_hnode *root_ht = rtnl_dereference(tp->root);
 
 	WARN_ON(root_ht == NULL);
 
-	if (!force) {
-		if (root_ht) {
-			if (root_ht->refcnt > 1)
-				return false;
-			if (root_ht->refcnt == 1) {
-				if (!ht_empty(root_ht))
-					return false;
-			}
-		}
-
-		if (tp_c->refcnt > 1)
-			return false;
-
-		if (tp_c->refcnt == 1) {
-			struct tc_u_hnode *ht;
-
-			for (ht = rtnl_dereference(tp_c->hlist);
-			     ht;
-			     ht = rtnl_dereference(ht->next))
-				if (!ht_empty(ht))
-					return false;
-		}
-	}
-
 	if (root_ht && --root_ht->refcnt == 0)
 		u32_destroy_hnode(tp, root_ht);
 
@@ -637,20 +613,22 @@ static bool u32_destroy(struct tcf_proto *tp, bool force)
 	}
 
 	tp->data = NULL;
-	return true;
 }
 
-static int u32_delete(struct tcf_proto *tp, unsigned long arg)
+static int u32_delete(struct tcf_proto *tp, unsigned long arg, bool *last)
 {
 	struct tc_u_hnode *ht = (struct tc_u_hnode *)arg;
 	struct tc_u_hnode *root_ht = rtnl_dereference(tp->root);
+	struct tc_u_common *tp_c = tp->data;
+	int ret = 0;
 
 	if (ht == NULL)
-		return 0;
+		goto out;
 
 	if (TC_U32_KEY(ht->handle)) {
 		u32_remove_hw_knode(tp, ht->handle);
-		return u32_delete_key(tp, (struct tc_u_knode *)ht);
+		ret = u32_delete_key(tp, (struct tc_u_knode *)ht);
+		goto out;
 	}
 
 	if (root_ht == ht)
@@ -663,7 +641,40 @@ static int u32_delete(struct tcf_proto *tp, unsigned long arg)
 		return -EBUSY;
 	}
 
-	return 0;
+out:
+	*last = true;
+	if (root_ht) {
+		if (root_ht->refcnt > 1) {
+			*last = false;
+			goto ret;
+		}
+		if (root_ht->refcnt == 1) {
+			if (!ht_empty(root_ht)) {
+				*last = false;
+				goto ret;
+			}
+		}
+	}
+
+	if (tp_c->refcnt > 1) {
+		*last = false;
+		goto ret;
+	}
+
+	if (tp_c->refcnt == 1) {
+		struct tc_u_hnode *ht;
+
+		for (ht = rtnl_dereference(tp_c->hlist);
+		     ht;
+		     ht = rtnl_dereference(ht->next))
+			if (!ht_empty(ht)) {
+				*last = false;
+				break;
+			}
+	}
+
+ret:
+	return ret;
 }
 
 #define NR_U32_NODE (1<<12)
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index d7b9342..20293ee 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1899,15 +1899,11 @@ int tc_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 }
 EXPORT_SYMBOL(tc_classify);
 
-bool tcf_destroy(struct tcf_proto *tp, bool force)
+void tcf_destroy(struct tcf_proto *tp)
 {
-	if (tp->ops->destroy(tp, force)) {
-		module_put(tp->ops->owner);
-		kfree_rcu(tp, rcu);
-		return true;
-	}
-
-	return false;
+	tp->ops->destroy(tp);
+	module_put(tp->ops->owner);
+	kfree_rcu(tp, rcu);
 }
 
 void tcf_destroy_chain(struct tcf_proto __rcu **fl)
@@ -1916,7 +1912,7 @@ void tcf_destroy_chain(struct tcf_proto __rcu **fl)
 
 	while ((tp = rtnl_dereference(*fl)) != NULL) {
 		RCU_INIT_POINTER(*fl, tp->next);
-		tcf_destroy(tp, true);
+		tcf_destroy(tp);
 	}
 }
 EXPORT_SYMBOL(tcf_destroy_chain);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-12 19:07       ` Cong Wang
@ 2016-12-13 11:59         ` Shahar Klein
  0 siblings, 0 replies; 17+ messages in thread
From: Shahar Klein @ 2016-12-13 11:59 UTC (permalink / raw)
  To: Cong Wang
  Cc: shahark, Daniel Borkmann, Linux Kernel Network Developers,
	Roi Dayan, David Miller, Jiri Pirko, John Fastabend, Or Gerlitz,
	Hadar Hen Zion

[-- Attachment #1: Type: text/plain, Size: 3584 bytes --]



On 12/12/2016 9:07 PM, Cong Wang wrote:
> On Mon, Dec 12, 2016 at 8:04 AM, Shahar Klein <shahark@mellanox.com> wrote:
>>
>>
>> On 12/12/2016 3:28 PM, Daniel Borkmann wrote:
>>>
>>> Hi Shahar,
>>>
>>> On 12/12/2016 10:43 AM, Shahar Klein wrote:
>>>>
>>>> Hi All,
>>>>
>>>> sorry for the spam, the first time was sent with html part and was
>>>> rejected.
>>>>
>>>> We observed an issue where a classifier instance next member is
>>>> pointing back to itself, causing a CPU soft lockup.
>>>> We found it by running traffic on many udp connections and then adding
>>>> a new flower rule using tc.
>>>>
>>>> We added a quick workaround to verify it:
>>>>
>>>> In tc_classify:
>>>>
>>>>          for (; tp; tp = rcu_dereference_bh(tp->next)) {
>>>>                  int err;
>>>> +               if (tp == tp->next)
>>>> +                     RCU_INIT_POINTER(tp->next, NULL);
>>>>
>>>>
>>>> We also had a print here showing tp->next is pointing to tp. With this
>>>> workaround we are not hitting the issue anymore.
>>>> We are not sure we fully understand the mechanism here - with the rtnl
>>>> and rcu locks.
>>>> We'll appreciate your help solving this issue.
>>>
>>>
>>> Note that there's still the RCU fix missing for the deletion race that
>>> Cong will still send out, but you say that the only thing you do is to
>>> add a single rule, but no other operation in involved during that test?
>
> Hmm, I thought RCU_INIT_POINTER() respects readers, but seems no?
> If so, that could be the cause since we play with the next pointer and
> there is only one filter in this case, but I don't see why we could have
> a loop here.
>
>>>
>>> Do you have a script and kernel .config for reproducing this?
>>
>>
>> I'm using a user space socket app(https://github.com/shahar-klein/noodle)on
>> a vm to push udp packets from ~2000 different udp src ports ramping up at
>> ~100 per second towards another vm on the same Hypervisor. Once the traffic
>> starts I'm pushing ingress flower tc udp rules(even_udp_src_port->mirred,
>> odd->drop) on the relevant representor in the Hypervisor.
>
> Do you mind to share your `tc filter show dev...` output? Also, since you
> mentioned you only add one flower filter, just want to make sure you never
> delete any filter before/when the bug happens? How reproducible is this?


The bridge between the two vms is based on ovs and representors.
We have a dpif in the ovs creating tc rules from ovs rules.
We set up 5000 open flow rules looks like this:
cook......, udp,dl_dst=24:8a:07:38:a2:b2,tp_src=7000 actions=drop
cook......, udp,dl_dst=24:8a:07:38:a2:b2,tp_src=7002 actions=drop
cook......, udp,dl_dst=24:8a:07:38:a2:b2,tp_src=7004 actions=drop
.
.
.

and then fire up 2000 udp flows starting at udp src 7000 and ramping up 
at 100 flows per second so after 20 seconds we suppose to have 2000 
active udp flows and half of them are dropped at the tc level.

The first packet of any such match hits the miss rule in the ovs 
datapath and pushed up to the user space ovs which consult the open 
flows rules above and translate the ovs rule to tc rule and push the 
rule back to the kernel via netlink.
I'm not sure I understand what happens to the second packet of the same 
match or all the following packets in the same match till the tc 
datapath is 'ready' for them.

The soft lockup is easily reproducible using this scenario but it won't 
happen if we use a much more easy traffic scheme first, say 100 udp 
flows at 3 per second.

I added a print and a panic when hitting the loop(output attached)

Also attached our .config


>
> Thanks!
>

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34095 bytes --]

[-- Attachment #3: tc_classify_panic.gz --]
[-- Type: application/gzip, Size: 1161 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-12 22:51       ` Cong Wang
@ 2016-12-19 16:39         ` Shahar Klein
  2016-12-19 17:58           ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Shahar Klein @ 2016-12-19 16:39 UTC (permalink / raw)
  To: Cong Wang, Or Gerlitz
  Cc: shahark, Daniel Borkmann, Linux Netdev List, Roi Dayan,
	David Miller, Jiri Pirko, John Fastabend, Hadar Hen Zion,
	Daniel Borkmann

[-- Attachment #1: Type: text/plain, Size: 1583 bytes --]



On 12/13/2016 12:51 AM, Cong Wang wrote:
> On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>>> Note that there's still the RCU fix missing for the deletion race that
>>> Cong will still send out, but you say that the only thing you do is to
>>> add a single rule, but no other operation in involved during that test?
>>
>> What's missing to have the deletion race fixed? making a patch or
>> testing to a patch which was sent?
>
> If you think it would help for this problem, here is my patch rebased
> on the latest net-next.
>
> Again, I don't see how it could help this case yet, especially I don't
> see how we could have a loop in this singly linked list.
>

I've applied cong's patch and hit a different lockup(full log attached):

[  264.725414] RIP: 0010:fl_classify+0x1f1/0x2b0 [cls_flower]

(gdb) list *(fl_classify+0x1f1)
0x1591 is in fl_classify (net/sched/cls_flower.c:187).
182		if (f && !tc_skip_sw(f->flags)) {
183			*res = f->res;
184			return tcf_exts_exec(skb, &f->exts, res);
185		}
186		return -1;
187	}
188	
189	static int fl_init(struct tcf_proto *tp)
190	{
191		struct cls_fl_head *head;
(gdb)

Daniel suggested I'll add a print:
                 case RTM_DELTFILTER:
-                   err = tp->ops->delete(tp, fh);
+                 printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__, __LINE__);
+                 err = tp->ops->delete(tp, fh, &last);
                         if (err == 0) {

and I couldn't see this print in the output.....





[-- Attachment #2: lockup_with_congs_patch --]
[-- Type: text/plain, Size: 16806 bytes --]

[  238.507670] GACT probability on
[  264.573942] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:4867]
[  264.582875] Modules linked in: act_gact act_mirred openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 vfio_pci vfio_virqfd vfio_iommu_type1 vfio cls_flower mlx5_ib mlx5_core devlink sch_ingress nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun ebtable_filter ebtables ip6table_filter ip6_tables netconsole bridge stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp kvm_intel kvm ib_core irqbypass crct10dif_pclmul crc32_pclmul igb ptp iTCO_wdt pps_core crc32c_intel joydev i2c_algo_bit ghash_clmulni_intel iTCO_vendor_support pcspkr i2c_i801 mei_me ipmi_ssif ioatdma shpchp tpm_tis i2c_smbus tpm_tis_core dca tpm mei lpc_ich ipmi_si ipmi_msghandler wmi nfsd target_core_mod auth_rpcgss nfs_acl lockd grace sunrpc isci libsas serio_raw scsi_transport_sas [last unloaded: devlink]
[  264.703539] CPU: 0 PID: 4867 Comm: modprobe Not tainted 4.9.0+ #27
[  264.710766] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  264.718475] task: ffff89916ad09d80 task.stack: ffffa36f44510000
[  264.725414] RIP: 0010:fl_classify+0x1f1/0x2b0 [cls_flower]
[  264.731863] RSP: 0018:ffff89916fa03ad8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  264.740971] RAX: 00000000ffffffff RBX: ffff89913f522100 RCX: 0000000000000000
[  264.749265] RDX: ffff89916fa03c98 RSI: ffff89915bd743c0 RDI: ffff89913f522100
[  264.757558] RBP: ffff89916fa03c28 R08: 000000000000270f R09: 0000000000000000
[  264.765852] R10: 0000000000000000 R11: 0000000000000004 R12: ffff89916fa03c98
[  264.774144] R13: ffff89913ecc7c00 R14: ffff89915bd743c0 R15: 0000000000000001
[  264.782440] FS:  00007f74f2f5e700(0000) GS:ffff89916fa00000(0000) knlGS:0000000000000000
[  264.792066] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  264.798809] CR2: 00007f74f2491480 CR3: 000000043f3f8000 CR4: 00000000000426f0
[  264.807102] Call Trace:
[  264.810151]  <IRQ>
[  264.812717]  ? handle_edge_irq+0x87/0x130
[  264.817504]  ? handle_irq+0xab/0x130
[  264.821808]  ? irq_exit+0x77/0xe0
[  264.825828]  ? irq_exit+0x77/0xe0
[  264.829849]  ? do_IRQ+0x51/0xd0
[  264.833664]  ? common_interrupt+0x93/0x93
[  264.838462]  tc_classify+0x78/0x120
[  264.842679]  __netif_receive_skb_core+0x623/0xa00
[  264.848264]  ? udp4_gro_receive+0x10b/0x2d0
[  264.853253]  __netif_receive_skb+0x18/0x60
[  264.858150]  netif_receive_skb_internal+0x40/0xb0
[  264.863723]  napi_gro_receive+0xcd/0x120
[  264.868437]  mlx5e_handle_rx_cqe_rep+0x61b/0x890 [mlx5_core]
[  264.875105]  ? kvm_irq_delivery_to_apic+0x2c0/0x2c0 [kvm]
[  264.881462]  mlx5e_poll_rx_cq+0x83/0x840 [mlx5_core]
[  264.887333]  mlx5e_napi_poll+0x89/0x480 [mlx5_core]
[  264.893102]  net_rx_action+0x260/0x3c0
[  264.897610]  __do_softirq+0xc9/0x28c
[  264.901922]  irq_exit+0xd7/0xe0
[  264.905748]  do_IRQ+0x51/0xd0
[  264.909380]  common_interrupt+0x93/0x93
[  264.913984] RIP: 0010:mpihelp_submul_1+0x91/0xe0
[  264.919460] RSP: 0018:ffffa36f44513940 EFLAGS: 00000206 ORIG_RAX: ffffffffffffffa2
[  264.928565] RAX: 5414d8fd4127cc59 RBX: 0000000100000000 RCX: ffffffffffffff50
[  264.936860] RDX: 00000000ffffffea RSI: 0000000019ba7e4d RDI: 0000000068ef6b86
[  264.945146] RBP: ffffa36f44513960 R08: 352cbbd4573982a8 R09: 64589019128fea22
[  264.953440] R10: 128fea2200000000 R11: 352cbbd5573982a8 R12: ffff89955c585c00
[  264.961732] R13: ffff899134dbfbd0 R14: 66a4c31fb80cf128 R15: d7554d389539f6c8
[  264.970028]  </IRQ>
[  264.972688]  mpihelp_divrem+0x23b/0x740
[  264.977291]  mpi_powm+0x471/0xa10
[  264.981312]  rsa_verify+0xa5/0x130
[  264.985421]  pkcs1pad_verify+0xb9/0xf0
[  264.989925]  public_key_verify_signature+0x1f7/0x280
[  264.995790]  public_key_verify_signature_2+0x15/0x20
[  265.001656]  verify_signature+0x3c/0x50
[  265.006259]  pkcs7_validate_trust+0xa1/0x200
[  265.011352]  verify_pkcs7_signature+0x9a/0x140
[  265.016636]  mod_verify_sig+0x92/0xe0
[  265.021046]  load_module+0x171/0x2810
[  265.025456]  ? vfs_read+0x113/0x130
[  265.029668]  ? kernel_read_file+0x158/0x190
[  265.034659]  ? kernel_read_file_from_fd+0x49/0x80
[  265.040248]  SYSC_finit_module+0xa6/0xf0
[  265.044947]  SyS_finit_module+0xe/0x10
[  265.049452]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[  265.054931] RIP: 0033:0x7f74f2439239
[  265.059241] RSP: 002b:00007ffeb8b23218 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  265.068373] RAX: ffffffffffffffda RBX: 00007ffeb8b21f10 RCX: 00007f74f2439239
[  265.076665] RDX: 0000000000000000 RSI: 00005578b61d02a6 RDI: 0000000000000000
[  265.084958] RBP: 00007ffeb8b21f00 R08: 0000000000000000 R09: 0000000000000000
[  265.098996] R10: 0000000000000000 R11: 0000000000000246 R12: 00005578b61d1be7
[  265.107289] R13: 00007ffeb8b22088 R14: 0000000000000000 R15: 00005578b61d1bf3
[  265.115582] Code: 02 75 e2 48 8b 81 98 00 00 00 48 8b 91 a0 00 00 00 48 8b 7c 24 18 48 89 07 48 89 57 08 8b 81 84 00 00 00 85 c0 0f 85 86 00 00 00 <48> 8b 9c 24 20 01 00 00 65 48 33 1c 25 28 00 00 00 0f 85 a1 00 
[  265.143394] Kernel panic - not syncing: softlockup: hung tasks
[  265.150231] CPU: 0 PID: 4867 Comm: modprobe Tainted: G             L  4.9.0+ #27
[  265.159057] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  265.166766] Call Trace:
[  265.169811]  <IRQ>
[  265.172374]  dump_stack+0x63/0x8c
[  265.176393]  panic+0xeb/0x239
[  265.180047]  watchdog_timer_fn+0x1e5/0x1f0
[  265.184941]  ? watchdog+0x40/0x40
[  265.188960]  __hrtimer_run_queues+0xee/0x270
[  265.194050]  hrtimer_interrupt+0xa8/0x190
[  265.198849]  local_apic_timer_interrupt+0x35/0x60
[  265.204422]  smp_apic_timer_interrupt+0x38/0x50
[  265.209802]  apic_timer_interrupt+0x93/0xa0
[  265.214794] RIP: 0010:fl_classify+0x1f1/0x2b0 [cls_flower]
[  265.221242] RSP: 0018:ffff89916fa03ad8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  265.230339] RAX: 00000000ffffffff RBX: ffff89913f522100 RCX: 0000000000000000
[  265.238635] RDX: ffff89916fa03c98 RSI: ffff89915bd743c0 RDI: ffff89913f522100
[  265.246929] RBP: ffff89916fa03c28 R08: 000000000000270f R09: 0000000000000000
[  265.255223] R10: 0000000000000000 R11: 0000000000000004 R12: ffff89916fa03c98
[  265.263516] R13: ffff89913ecc7c00 R14: ffff89915bd743c0 R15: 0000000000000001
[  265.271812]  ? handle_edge_irq+0x87/0x130
[  265.276599]  ? handle_irq+0xab/0x130
[  265.280902]  ? irq_exit+0x77/0xe0
[  265.284911]  ? irq_exit+0x77/0xe0
[  265.288927]  ? do_IRQ+0x51/0xd0
[  265.292752]  ? common_interrupt+0x93/0x93
[  265.297549]  tc_classify+0x78/0x120
[  265.301762]  __netif_receive_skb_core+0x623/0xa00
[  265.307336]  ? udp4_gro_receive+0x10b/0x2d0
[  265.312328]  __netif_receive_skb+0x18/0x60
[  265.317221]  netif_receive_skb_internal+0x40/0xb0
[  265.322795]  napi_gro_receive+0xcd/0x120
[  265.327502]  mlx5e_handle_rx_cqe_rep+0x61b/0x890 [mlx5_core]
[  265.334153]  ? kvm_irq_delivery_to_apic+0x2c0/0x2c0 [kvm]
[  265.340510]  mlx5e_poll_rx_cq+0x83/0x840 [mlx5_core]
[  265.346379]  mlx5e_napi_poll+0x89/0x480 [mlx5_core]
[  265.352170]  net_rx_action+0x260/0x3c0
[  265.356675]  __do_softirq+0xc9/0x28c
[  265.360987]  irq_exit+0xd7/0xe0
[  265.364814]  do_IRQ+0x51/0xd0
[  265.368440]  common_interrupt+0x93/0x93
[  265.373041] RIP: 0010:mpihelp_submul_1+0x91/0xe0
[  265.378516] RSP: 0018:ffffa36f44513940 EFLAGS: 00000206 ORIG_RAX: ffffffffffffffa2
[  265.387614] RAX: 5414d8fd4127cc59 RBX: 0000000100000000 RCX: ffffffffffffff50
[  265.395921] RDX: 00000000ffffffea RSI: 0000000019ba7e4d RDI: 0000000068ef6b86
[  265.404235] RBP: ffffa36f44513960 R08: 352cbbd4573982a8 R09: 64589019128fea22
[  265.412526] R10: 128fea2200000000 R11: 352cbbd5573982a8 R12: ffff89955c585c00
[  265.420819] R13: ffff899134dbfbd0 R14: 66a4c31fb80cf128 R15: d7554d389539f6c8
[  265.429113]  </IRQ>
[  265.431769]  mpihelp_divrem+0x23b/0x740
[  265.436373]  mpi_powm+0x471/0xa10
[  265.440394]  rsa_verify+0xa5/0x130
[  265.444508]  pkcs1pad_verify+0xb9/0xf0
[  265.449014]  public_key_verify_signature+0x1f7/0x280
[  265.454878]  public_key_verify_signature_2+0x15/0x20
[  265.460741]  verify_signature+0x3c/0x50
[  265.465343]  pkcs7_validate_trust+0xa1/0x200
[  265.470430]  verify_pkcs7_signature+0x9a/0x140
[  265.475715]  mod_verify_sig+0x92/0xe0
[  265.480147]  load_module+0x171/0x2810
[  265.484555]  ? vfs_read+0x113/0x130
[  265.488758]  ? kernel_read_file+0x158/0x190
[  265.493751]  ? kernel_read_file_from_fd+0x49/0x80
[  265.499328]  SYSC_finit_module+0xa6/0xf0
[  265.504051]  SyS_finit_module+0xe/0x10
[  265.508555]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[  265.514030] RIP: 0033:0x7f74f2439239
[  265.518340] RSP: 002b:00007ffeb8b23218 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  265.527446] RAX: ffffffffffffffda RBX: 00007ffeb8b21f10 RCX: 00007f74f2439239
[  265.535740] RDX: 0000000000000000 RSI: 00005578b61d02a6 RDI: 0000000000000000
[  265.544056] RBP: 00007ffeb8b21f00 R08: 0000000000000000 R09: 0000000000000000
[  265.552354] R10: 0000000000000000 R11: 0000000000000246 R12: 00005578b61d1be7
[  265.560648] R13: 00007ffeb8b22088 R14: 0000000000000000 R15: 00005578b61d1bf3
[  265.568994] Kernel Offset: 0x3b000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  266.845267] ---[ end Kernel panic - not syncing: softlockup: hung tasks
[  266.853000] ------------[ cut here ]------------
[  266.858490] WARNING: CPU: 0 PID: 4867 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x3f/0x50
[  266.869551] Modules linked in: act_gact act_mirred openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 vfio_pci vfio_virqfd vfio_iommu_type1 vfio cls_flower mlx5_ib mlx5_core devlink sch_ingress nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun ebtable_filter ebtables ip6table_filter ip6_tables netconsole bridge stp llc rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp kvm_intel kvm ib_core irqbypass crct10dif_pclmul crc32_pclmul igb ptp iTCO_wdt pps_core crc32c_intel joydev i2c_algo_bit ghash_clmulni_intel iTCO_vendor_support pcspkr i2c_i801 mei_me ipmi_ssif ioatdma shpchp tpm_tis i2c_smbus tpm_tis_core dca tpm mei lpc_ich ipmi_si ipmi_msghandler wmi nfsd target_core_mod auth_rpcgss nfs_acl lockd grace sunrpc isci libsas serio_raw scsi_transport_sas [last unloaded: devlink]
[  266.990566] CPU: 0 PID: 4867 Comm: modprobe Tainted: G             L  4.9.0+ #27
[  266.999401] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  267.007118] Call Trace:
[  267.010179]  <IRQ>
[  267.012748]  dump_stack+0x63/0x8c
[  267.016783]  __warn+0xd1/0xf0
[  267.020420]  warn_slowpath_null+0x1d/0x20
[  267.025314]  native_smp_send_reschedule+0x3f/0x50
[  267.030895]  try_to_wake_up+0x312/0x390
[  267.035501]  wake_up_process+0x15/0x20
[  267.040009]  swake_up_locked+0x24/0x50
[  267.044519]  swake_up+0x2c/0x40
[  267.048364]  kvm_vcpu_kick+0x32/0x80 [kvm]
[  267.053276]  __apic_accept_irq+0x1b6/0x330 [kvm]
[  267.058774]  kvm_apic_set_irq+0x2a/0x30 [kvm]
[  267.063978]  kvm_irq_delivery_to_apic_fast+0xf5/0x3c0 [kvm]
[  267.070544]  kvm_arch_set_irq_inatomic+0x99/0xb0 [kvm]
[  267.076621]  irqfd_wakeup+0x111/0x160 [kvm]
[  267.081628]  ? kvm_irq_delivery_to_apic+0x2c0/0x2c0 [kvm]
[  267.087984]  __wake_up_common+0x55/0x90
[  267.092583]  __wake_up_locked_key+0x18/0x20
[  267.097580]  eventfd_signal+0x5c/0x80
[  267.102001]  vfio_msihandler+0x16/0x20 [vfio_pci]
[  267.107578]  __handle_irq_event_percpu+0x3c/0x1a0
[  267.113164]  handle_irq_event_percpu+0x32/0x80
[  267.118453]  handle_irq_event+0x2c/0x50
[  267.123068]  handle_edge_irq+0x87/0x130
[  267.127677]  handle_irq+0xab/0x130
[  267.131801]  do_IRQ+0x48/0xd0
[  267.135435]  common_interrupt+0x93/0x93
[  267.140042] RIP: 0010:panic+0x1f5/0x239
[  267.144647] RSP: 0018:ffff89916fa03898 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff73
[  267.153760] RAX: 000000000000003b RBX: 0000000000000000 RCX: 0000000000000006
[  267.162062] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff89916fa0e060
[  267.170362] RBP: ffff89916fa03908 R08: 0000000000000691 R09: ffff898d000bc4c0
[  267.178659] R10: 000000000000010a R11: 0000000000000198 R12: ffffffffbcc4a459
[  267.186964] R13: 0000000000000000 R14: 0000000000000000 R15: ffff89916fa03a28
[  267.195266]  ? panic+0x1f1/0x239
[  267.199201]  watchdog_timer_fn+0x1e5/0x1f0
[  267.204104]  ? watchdog+0x40/0x40
[  267.208130]  __hrtimer_run_queues+0xee/0x270
[  267.213224]  hrtimer_interrupt+0xa8/0x190
[  267.218036]  local_apic_timer_interrupt+0x35/0x60
[  267.223612]  smp_apic_timer_interrupt+0x38/0x50
[  267.228999]  apic_timer_interrupt+0x93/0xa0
[  267.233987] RIP: 0010:fl_classify+0x1f1/0x2b0 [cls_flower]
[  267.240440] RSP: 0018:ffff89916fa03ad8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  267.249539] RAX: 00000000ffffffff RBX: ffff89913f522100 RCX: 0000000000000000
[  267.257829] RDX: ffff89916fa03c98 RSI: ffff89915bd743c0 RDI: ffff89913f522100
[  267.266128] RBP: ffff89916fa03c28 R08: 000000000000270f R09: 0000000000000000
[  267.274425] R10: 0000000000000000 R11: 0000000000000004 R12: ffff89916fa03c98
[  267.282721] R13: ffff89913ecc7c00 R14: ffff89915bd743c0 R15: 0000000000000001
[  267.291019]  ? handle_edge_irq+0x87/0x130
[  267.295818]  ? handle_irq+0xab/0x130
[  267.300133]  ? irq_exit+0x77/0xe0
[  267.304160]  ? irq_exit+0x77/0xe0
[  267.308184]  ? do_IRQ+0x51/0xd0
[  267.312015]  ? common_interrupt+0x93/0x93
[  267.316818]  tc_classify+0x78/0x120
[  267.321038]  __netif_receive_skb_core+0x623/0xa00
[  267.326618]  ? udp4_gro_receive+0x10b/0x2d0
[  267.331616]  __netif_receive_skb+0x18/0x60
[  267.336514]  netif_receive_skb_internal+0x40/0xb0
[  267.342091]  napi_gro_receive+0xcd/0x120
[  267.346809]  mlx5e_handle_rx_cqe_rep+0x61b/0x890 [mlx5_core]
[  267.353467]  ? kvm_irq_delivery_to_apic+0x2c0/0x2c0 [kvm]
[  267.359837]  mlx5e_poll_rx_cq+0x83/0x840 [mlx5_core]
[  267.365716]  mlx5e_napi_poll+0x89/0x480 [mlx5_core]
[  267.371500]  net_rx_action+0x260/0x3c0
[  267.376008]  __do_softirq+0xc9/0x28c
[  267.380330]  irq_exit+0xd7/0xe0
[  267.384168]  do_IRQ+0x51/0xd0
[  267.387810]  common_interrupt+0x93/0x93
[  267.392426] RIP: 0010:mpihelp_submul_1+0x91/0xe0
[  267.403605] RSP: 0018:ffffa36f44513940 EFLAGS: 00000206 ORIG_RAX: ffffffffffffffa2
[  267.412818] RAX: 5414d8fd4127cc59 RBX: 0000000100000000 RCX: ffffffffffffff50
[  267.421114] RDX: 00000000ffffffea RSI: 0000000019ba7e4d RDI: 0000000068ef6b86
[  267.429418] RBP: ffffa36f44513960 R08: 352cbbd4573982a8 R09: 64589019128fea22
[  267.437713] R10: 128fea2200000000 R11: 352cbbd5573982a8 R12: ffff89955c585c00
[  267.446014] R13: ffff899134dbfbd0 R14: 66a4c31fb80cf128 R15: d7554d389539f6c8
[  267.454311]  </IRQ>
[  267.456981]  mpihelp_divrem+0x23b/0x740
[  267.461598]  mpi_powm+0x471/0xa10
[  267.465630]  rsa_verify+0xa5/0x130
[  267.469759]  pkcs1pad_verify+0xb9/0xf0
[  267.474277]  public_key_verify_signature+0x1f7/0x280
[  267.480148]  public_key_verify_signature_2+0x15/0x20
[  267.486011]  verify_signature+0x3c/0x50
[  267.490619]  pkcs7_validate_trust+0xa1/0x200
[  267.495710]  verify_pkcs7_signature+0x9a/0x140
[  267.501000]  mod_verify_sig+0x92/0xe0
[  267.505404]  load_module+0x171/0x2810
[  267.509822]  ? vfs_read+0x113/0x130
[  267.514047]  ? kernel_read_file+0x158/0x190
[  267.519041]  ? kernel_read_file_from_fd+0x49/0x80
[  267.524620]  SYSC_finit_module+0xa6/0xf0
[  267.529325]  SyS_finit_module+0xe/0x10
[  267.533844]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[  267.539325] RIP: 0033:0x7f74f2439239
[  267.543639] RSP: 002b:00007ffeb8b23218 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  267.552740] RAX: ffffffffffffffda RBX: 00007ffeb8b21f10 RCX: 00007f74f2439239
[  267.561042] RDX: 0000000000000000 RSI: 00005578b61d02a6 RDI: 0000000000000000
[  267.569341] RBP: 00007ffeb8b21f00 R08: 0000000000000000 R09: 0000000000000000
[  267.577638] R10: 0000000000000000 R11: 0000000000000246 R12: 00005578b61d1be7
[  267.585935] R13: 00007ffeb8b22088 R14: 0000000000000000 R15: 00005578b61d1bf3
[  267.594231] ---[ end trace 2eb46a584b6ba420 ]---


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-19 16:39         ` Shahar Klein
@ 2016-12-19 17:58           ` Cong Wang
  2016-12-20  6:22             ` Shahar Klein
  0 siblings, 1 reply; 17+ messages in thread
From: Cong Wang @ 2016-12-19 17:58 UTC (permalink / raw)
  To: Shahar Klein
  Cc: Or Gerlitz, Daniel Borkmann, Linux Netdev List, Roi Dayan,
	David Miller, Jiri Pirko, John Fastabend, Hadar Hen Zion

[-- Attachment #1: Type: text/plain, Size: 1719 bytes --]

Hello,

On Mon, Dec 19, 2016 at 8:39 AM, Shahar Klein <shahark@mellanox.com> wrote:
>
>
> On 12/13/2016 12:51 AM, Cong Wang wrote:
>>
>> On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>>>
>>> On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net>
>>> wrote:
>>>
>>>> Note that there's still the RCU fix missing for the deletion race that
>>>> Cong will still send out, but you say that the only thing you do is to
>>>> add a single rule, but no other operation in involved during that test?
>>>
>>>
>>> What's missing to have the deletion race fixed? making a patch or
>>> testing to a patch which was sent?
>>
>>
>> If you think it would help for this problem, here is my patch rebased
>> on the latest net-next.
>>
>> Again, I don't see how it could help this case yet, especially I don't
>> see how we could have a loop in this singly linked list.
>>
>
> I've applied cong's patch and hit a different lockup(full log attached):


Are you sure this is really different? For me, it is still inside the loop
in tc_classify(), with only a slightly different offset.


>
> Daniel suggested I'll add a print:
>                 case RTM_DELTFILTER:
> -                   err = tp->ops->delete(tp, fh);
> +                 printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__, __LINE__);
> +                 err = tp->ops->delete(tp, fh, &last);
>                         if (err == 0) {
>
> and I couldn't see this print in the output.....

Hmm, that is odd, if this never prints, then my patch should not make any
difference.

There are still two other cases where we could change tp->next, so do you
mind to add two more printk's for debugging?

Attached is the delta patch.

Thanks!

[-- Attachment #2: tc-filter-debug.diff --]
[-- Type: text/plain, Size: 880 bytes --]

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index f9179e0..45bfe9f 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -317,6 +317,8 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 		if (n->nlmsg_type == RTM_DELTFILTER && t->tcm_handle == 0) {
 			struct tcf_proto *next = rtnl_dereference(tp->next);
 
+			printk(KERN_ERR "DEBUGG:SK delete filter by: %pf\n", tp->ops->get);
+
 			RCU_INIT_POINTER(*back, next);
 
 			tfilter_notify(net, skb, n, tp, fh,
@@ -370,6 +372,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 			      n->nlmsg_flags & NLM_F_CREATE ? TCA_ACT_NOREPLACE : TCA_ACT_REPLACE);
 	if (err == 0) {
 		if (tp_created) {
+			printk(KERN_ERR "DEBUGG:SK add/change filter by: %pf\n", tp->ops->change);
 			RCU_INIT_POINTER(tp->next, rtnl_dereference(*back));
 			rcu_assign_pointer(*back, tp);
 		}

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-19 17:58           ` Cong Wang
@ 2016-12-20  6:22             ` Shahar Klein
  2016-12-20 11:47               ` Daniel Borkmann
  0 siblings, 1 reply; 17+ messages in thread
From: Shahar Klein @ 2016-12-20  6:22 UTC (permalink / raw)
  To: Cong Wang
  Cc: shahark, Or Gerlitz, Daniel Borkmann, Linux Netdev List,
	Roi Dayan, David Miller, Jiri Pirko, John Fastabend,
	Hadar Hen Zion

[-- Attachment #1: Type: text/plain, Size: 3353 bytes --]



On 12/19/2016 7:58 PM, Cong Wang wrote:
> Hello,
>
> On Mon, Dec 19, 2016 at 8:39 AM, Shahar Klein <shahark@mellanox.com> wrote:
>>
>>
>> On 12/13/2016 12:51 AM, Cong Wang wrote:
>>>
>>> On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>>>>
>>>> On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net>
>>>> wrote:
>>>>
>>>>> Note that there's still the RCU fix missing for the deletion race that
>>>>> Cong will still send out, but you say that the only thing you do is to
>>>>> add a single rule, but no other operation in involved during that test?
>>>>
>>>>
>>>> What's missing to have the deletion race fixed? making a patch or
>>>> testing to a patch which was sent?
>>>
>>>
>>> If you think it would help for this problem, here is my patch rebased
>>> on the latest net-next.
>>>
>>> Again, I don't see how it could help this case yet, especially I don't
>>> see how we could have a loop in this singly linked list.
>>>
>>
>> I've applied cong's patch and hit a different lockup(full log attached):
>
>
> Are you sure this is really different? For me, it is still inside the loop
> in tc_classify(), with only a slightly different offset.
>
>
>>
>> Daniel suggested I'll add a print:
>>                 case RTM_DELTFILTER:
>> -                   err = tp->ops->delete(tp, fh);
>> +                 printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__, __LINE__);
>> +                 err = tp->ops->delete(tp, fh, &last);
>>                         if (err == 0) {
>>
>> and I couldn't see this print in the output.....
>
> Hmm, that is odd, if this never prints, then my patch should not make any
> difference.
>
> There are still two other cases where we could change tp->next, so do you
> mind to add two more printk's for debugging?
>
> Attached is the delta patch.
>
> Thanks!
>

I've added a slightly different debug print:
@@ -368,11 +375,12 @@ static int tc_ctl_tfilter(struct sk_buff *skb, 
struct nlmsghdr *n)
                 if (tp_created) {
                         RCU_INIT_POINTER(tp->next, 
rtnl_dereference(*back));
                         rcu_assign_pointer(*back, tp);
+                 printk(KERN_ERR "DEBUGG:SK add/change filter by: %pf 
tp=%p tp->next=%p\n", tp->ops->get, tp, tp->next);
                 }
                 tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER, false);

full output attached:

[  283.290271] Mirror/redirect action on
[  283.305031] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9432d704df60 tp->next=          (null)
[  283.322563] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9436e718d240 tp->next=          (null)
[  283.359997] GACT probability on
[  283.365923] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9436e718d3c0 tp->next=ffff9436e718d240
[  283.378725] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  283.391310] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  283.403923] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  283.416542] DEBUGG:SK add/change filter by: fl_get [cls_flower] 
tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  308.538571] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! 
[swapper/0:0]


Thanks
Shahar




[-- Attachment #2: tp_p_debug.log --]
[-- Type: text/plain, Size: 18431 bytes --]

[  283.290271] Mirror/redirect action on
[  283.305031] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9432d704df60 tp->next=          (null)
[  283.322563] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d240 tp->next=          (null)
[  283.359997] GACT probability on
[  283.365923] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d240
[  283.378725] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  283.391310] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  283.403923] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  283.416542] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
[  308.538571] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
[  308.547322] Modules linked in: act_gact act_mirred openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 vfio_pci vfio_virqfd vfio_iommu_type1 vfio cls_flower mlx5_ib mlx5_core devlink sch_ingress nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun ebtable_filter ebtables ip6table_filter ip6_tables netconsole rpcrdma bridge ib_isert stp iscsi_target_mod llc ib_iser libiscsi scsi_transport_iscsi ib_srpt ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp kvm_intel kvm igb irqbypass joydev ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt crc32c_intel ptp ipmi_si iTCO_vendor_support pcspkr ghash_clmulni_intel wmi pps_core i2c_algo_bit ipmi_msghandler mei_me i2c_i801 ioatdma tpm_tis mei shpchp i2c_smbus dca tpm_tis_core lpc_ich tpm nfsd target_core_mod auth_rpcgss nfs_acl lockd grace sunrpc isci libsas serio_raw scsi_transport_sas [last unloaded: devlink]
[  308.668291] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0+ #31
[  308.675337] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  308.683060] task: ffffffff94e0e500 task.stack: ffffffff94e00000
[  308.690012] RIP: 0010:fl_classify+0xb/0x2b0 [cls_flower]
[  308.696275] RSP: 0018:ffff9432efa03c20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  308.705396] RAX: 0000000000000008 RBX: ffff9432b59c4100 RCX: 0000000000000000
[  308.713704] RDX: ffff9432efa03c98 RSI: ffff9436e718d3c0 RDI: ffff9432b59c4100
[  308.722099] RBP: ffff9432efa03c28 R08: 000000000000270f R09: 0000000000000000
[  308.730409] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9432efa03c98
[  308.738713] R13: 0000000000000008 R14: ffff9436e718d3c0 R15: 0000000000000001
[  308.747013] FS:  0000000000000000(0000) GS:ffff9432efa00000(0000) knlGS:0000000000000000
[  308.756625] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  308.763378] CR2: 00007f5415f67914 CR3: 00000005fde07000 CR4: 00000000000426f0
[  308.771684] Call Trace:
[  308.774739]  <IRQ>
[  308.777311]  tc_classify+0x78/0x120
[  308.781549]  __netif_receive_skb_core+0x623/0xa00
[  308.787141]  ? udp4_gro_receive+0x10b/0x2d0
[  308.792143]  __netif_receive_skb+0x18/0x60
[  308.797048]  netif_receive_skb_internal+0x40/0xb0
[  308.802637]  napi_gro_receive+0xcd/0x120
[  308.807462]  mlx5e_handle_rx_cqe_rep+0x61b/0x890 [mlx5_core]
[  308.814123]  mlx5e_poll_rx_cq+0x83/0x840 [mlx5_core]
[  308.820015]  mlx5e_napi_poll+0x89/0x480 [mlx5_core]
[  308.825818]  net_rx_action+0x260/0x3c0
[  308.830334]  __do_softirq+0xc9/0x28c
[  308.834658]  irq_exit+0xd7/0xe0
[  308.838492]  do_IRQ+0x51/0xd0
[  308.842132]  common_interrupt+0x93/0x93
[  308.846747] RIP: 0010:cpuidle_enter_state+0xe1/0x260
[  308.852624] RSP: 0018:ffffffff94e03dc8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffa2
[  308.861766] RAX: ffff9432efa19600 RBX: ffff9432efa23600 RCX: 000000000000001f
[  308.870077] RDX: 0000000000000000 RSI: ffff9432efa16cd8 RDI: 0000000000000000
[  308.878379] RBP: ffffffff94e03e00 R08: 0000000000000001 R09: cccccccccccccccd
[  308.886690] R10: 0000000000000000 R11: 0000000000000008 R12: 0000000000000001
[  308.895000] R13: 0000000000000000 R14: ffffffff94ec79a0 R15: 00000041fab01c8d
[  308.903306]  </IRQ>
[  308.905978]  ? cpuidle_enter_state+0xc0/0x260
[  308.911173]  cpuidle_enter+0x17/0x20
[  308.915498]  call_cpuidle+0x23/0x40
[  308.919721]  do_idle+0x172/0x200
[  308.923656]  cpu_startup_entry+0x71/0x80
[  308.928370]  rest_init+0x77/0x80
[  308.932304]  start_kernel+0x4a6/0x4c7
[  308.936723]  ? set_init_arg+0x55/0x55
[  308.941141]  ? early_idt_handler_array+0x120/0x120
[  308.946823]  x86_64_start_reservations+0x24/0x26
[  308.952314]  x86_64_start_kernel+0x14c/0x16f
[  308.957418]  start_cpu+0x5/0x14
[  308.961242] Code: a8 4c 89 fe 48 8b 4d c8 48 8d 14 07 4c 89 e7 e8 2c fe ff ff e9 14 ff ff ff 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 <41> 56 41 55 41 54 53 48 81 ec 28 01 00 00 65 48 8b 04 25 28 00 
[  308.989075] Kernel panic - not syncing: softlockup: hung tasks
[  308.995924] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G             L  4.9.0+ #31
[  309.010442] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  309.018160] Call Trace:
[  309.021211]  <IRQ>
[  309.023776]  dump_stack+0x63/0x8c
[  309.027807]  panic+0xeb/0x239
[  309.031449]  watchdog_timer_fn+0x1e5/0x1f0
[  309.036354]  ? watchdog+0x40/0x40
[  309.040386]  __hrtimer_run_queues+0xee/0x270
[  309.045486]  hrtimer_interrupt+0xa8/0x190
[  309.050293]  local_apic_timer_interrupt+0x35/0x60
[  309.055880]  smp_apic_timer_interrupt+0x38/0x50
[  309.061272]  apic_timer_interrupt+0x93/0xa0
[  309.066272] RIP: 0010:fl_classify+0xb/0x2b0 [cls_flower]
[  309.072538] RSP: 0018:ffff9432efa03c20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  309.081686] RAX: 0000000000000008 RBX: ffff9432b59c4100 RCX: 0000000000000000
[  309.089994] RDX: ffff9432efa03c98 RSI: ffff9436e718d3c0 RDI: ffff9432b59c4100
[  309.098297] RBP: ffff9432efa03c28 R08: 000000000000270f R09: 0000000000000000
[  309.106603] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9432efa03c98
[  309.114914] R13: 0000000000000008 R14: ffff9436e718d3c0 R15: 0000000000000001
[  309.123229]  tc_classify+0x78/0x120
[  309.127452]  __netif_receive_skb_core+0x623/0xa00
[  309.133031]  ? udp4_gro_receive+0x10b/0x2d0
[  309.138033]  __netif_receive_skb+0x18/0x60
[  309.142949]  netif_receive_skb_internal+0x40/0xb0
[  309.148534]  napi_gro_receive+0xcd/0x120
[  309.153259]  mlx5e_handle_rx_cqe_rep+0x61b/0x890 [mlx5_core]
[  309.159918]  mlx5e_poll_rx_cq+0x83/0x840 [mlx5_core]
[  309.165823]  mlx5e_napi_poll+0x89/0x480 [mlx5_core]
[  309.171608]  net_rx_action+0x260/0x3c0
[  309.176238]  __do_softirq+0xc9/0x28c
[  309.180563]  irq_exit+0xd7/0xe0
[  309.184395]  do_IRQ+0x51/0xd0
[  309.188035]  common_interrupt+0x93/0x93
[  309.192651] RIP: 0010:cpuidle_enter_state+0xe1/0x260
[  309.198527] RSP: 0018:ffffffff94e03dc8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffa2
[  309.207651] RAX: ffff9432efa19600 RBX: ffff9432efa23600 RCX: 000000000000001f
[  309.215959] RDX: 0000000000000000 RSI: ffff9432efa16cd8 RDI: 0000000000000000
[  309.224268] RBP: ffffffff94e03e00 R08: 0000000000000001 R09: cccccccccccccccd
[  309.232573] R10: 0000000000000000 R11: 0000000000000008 R12: 0000000000000001
[  309.240881] R13: 0000000000000000 R14: ffffffff94ec79a0 R15: 00000041fab01c8d
[  309.249187]  </IRQ>
[  309.251858]  ? cpuidle_enter_state+0xc0/0x260
[  309.257057]  cpuidle_enter+0x17/0x20
[  309.261382]  call_cpuidle+0x23/0x40
[  309.265635]  do_idle+0x172/0x200
[  309.269604]  cpu_startup_entry+0x71/0x80
[  309.274314]  rest_init+0x77/0x80
[  309.278247]  start_kernel+0x4a6/0x4c7
[  309.282668]  ? set_init_arg+0x55/0x55
[  309.287089]  ? early_idt_handler_array+0x120/0x120
[  309.292771]  x86_64_start_reservations+0x24/0x26
[  309.298262]  x86_64_start_kernel+0x14c/0x16f
[  309.303361]  start_cpu+0x5/0x14
[  309.307245] Kernel Offset: 0x13000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  310.573997] ---[ end Kernel panic - not syncing: softlockup: hung tasks
[  310.581734] ------------[ cut here ]------------
[  310.587236] unchecked MSR access error: WRMSR to 0x83f (tried to write 0x00000000000000f6) at rIP: 0xffffffff94065c14 (native_write_msr+0x4/0x30)
[  310.602404] Call Trace:
[  310.605472]  <IRQ>
[  310.608066]  ? native_apic_msr_write+0x30/0x40
[  310.613371]  x2apic_send_IPI_self+0x1d/0x20
[  310.618390]  arch_irq_work_raise+0x28/0x40
[  310.623309]  irq_work_queue+0x6e/0x80
[  310.627724]  wake_up_klogd+0x34/0x40
[  310.632045]  console_unlock+0x4dc/0x540
[  310.636659]  vprintk_emit+0x2eb/0x4b0
[  310.641091]  ? native_smp_send_reschedule+0x3f/0x50
[  310.646871]  vprintk_default+0x29/0x40
[  310.651393]  printk+0x5d/0x74
[  310.655034]  ? native_smp_send_reschedule+0x3f/0x50
[  310.660807]  __warn+0x3b/0xf0
[  310.664450]  warn_slowpath_null+0x1d/0x20
[  310.669262]  native_smp_send_reschedule+0x3f/0x50
[  310.674849]  try_to_wake_up+0x312/0x390
[  310.679456]  default_wake_function+0x12/0x20
[  310.684560]  __wake_up_common+0x55/0x90
[  310.689170]  __wake_up_locked+0x13/0x20
[  310.693788]  ep_poll_callback+0xbb/0x240
[  310.698493]  __wake_up_common+0x55/0x90
[  310.703101]  __wake_up+0x39/0x50
[  310.707028]  wake_up_klogd_work_func+0x40/0x60
[  310.712316]  irq_work_run_list+0x4d/0x70
[  310.717022]  irq_work_run+0x2c/0x40
[  310.721243]  smp_irq_work_interrupt+0x2e/0x40
[  310.726443]  irq_work_interrupt+0x93/0xa0
[  310.731253] RIP: 0010:panic+0x1f5/0x239
[  310.735876] RSP: 0018:ffff9432efa039e8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff09
[  310.744995] RAX: 000000000000003b RBX: 0000000000000000 RCX: 0000000000000006
[  310.753294] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff9432efa0e060
[  310.761594] RBP: ffff9432efa03a58 R08: 0000000000000674 R09: ffff942e800bb3e0
[  310.769900] R10: 00000000000000ef R11: 0000000000000198 R12: ffffffff94c4a4a9
[  310.778199] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9432efa03b78
[  310.786505]  ? panic+0x1f1/0x239
[  310.790444]  watchdog_timer_fn+0x1e5/0x1f0
[  310.795353]  ? watchdog+0x40/0x40
[  310.799401]  __hrtimer_run_queues+0xee/0x270
[  310.804501]  hrtimer_interrupt+0xa8/0x190
[  310.809318]  local_apic_timer_interrupt+0x35/0x60
[  310.814895]  smp_apic_timer_interrupt+0x38/0x50
[  310.820282]  apic_timer_interrupt+0x93/0xa0
[  310.825287] RIP: 0010:fl_classify+0xb/0x2b0 [cls_flower]
[  310.831554] RSP: 0018:ffff9432efa03c20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  310.840693] RAX: 0000000000000008 RBX: ffff9432b59c4100 RCX: 0000000000000000
[  310.849007] RDX: ffff9432efa03c98 RSI: ffff9436e718d3c0 RDI: ffff9432b59c4100
[  310.857402] RBP: ffff9432efa03c28 R08: 000000000000270f R09: 0000000000000000
[  310.865712] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9432efa03c98
[  310.874020] R13: 0000000000000008 R14: ffff9436e718d3c0 R15: 0000000000000001
[  310.882337]  tc_classify+0x78/0x120
[  310.886568]  __netif_receive_skb_core+0x623/0xa00
[  310.892157]  ? udp4_gro_receive+0x10b/0x2d0
[  310.897151]  __netif_receive_skb+0x18/0x60
[  310.902057]  netif_receive_skb_internal+0x40/0xb0
[  310.907643]  napi_gro_receive+0xcd/0x120
[  310.912370]  mlx5e_handle_rx_cqe_rep+0x61b/0x890 [mlx5_core]
[  310.919031]  mlx5e_poll_rx_cq+0x83/0x840 [mlx5_core]
[  310.924924]  mlx5e_napi_poll+0x89/0x480 [mlx5_core]
[  310.930808]  net_rx_action+0x260/0x3c0
[  310.935319]  __do_softirq+0xc9/0x28c
[  310.939658]  irq_exit+0xd7/0xe0
[  310.943485]  do_IRQ+0x51/0xd0
[  310.947124]  common_interrupt+0x93/0x93
[  310.951748] RIP: 0010:cpuidle_enter_state+0xe1/0x260
[  310.957616] RSP: 0018:ffffffff94e03dc8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffa2
[  310.966743] RAX: ffff9432efa19600 RBX: ffff9432efa23600 RCX: 000000000000001f
[  310.975044] RDX: 0000000000000000 RSI: ffff9432efa16cd8 RDI: 0000000000000000
[  310.983349] RBP: ffffffff94e03e00 R08: 0000000000000001 R09: cccccccccccccccd
[  310.991654] R10: 0000000000000000 R11: 0000000000000008 R12: 0000000000000001
[  310.999952] R13: 0000000000000000 R14: ffffffff94ec79a0 R15: 00000041fab01c8d
[  311.008254]  </IRQ>
[  311.010926]  ? cpuidle_enter_state+0xc0/0x260
[  311.016122]  cpuidle_enter+0x17/0x20
[  311.020430]  call_cpuidle+0x23/0x40
[  311.024658]  do_idle+0x172/0x200
[  311.028583]  cpu_startup_entry+0x71/0x80
[  311.033295]  rest_init+0x77/0x80
[  311.037233]  start_kernel+0x4a6/0x4c7
[  311.041646]  ? set_init_arg+0x55/0x55
[  311.046068]  ? early_idt_handler_array+0x120/0x120
[  311.051752]  x86_64_start_reservations+0x24/0x26
[  311.057238]  x86_64_start_kernel+0x14c/0x16f
[  311.062339]  start_cpu+0x5/0x14
[  311.066180] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x3f/0x50
[  311.076956] Modules linked in: act_gact act_mirred openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 vfio_pci vfio_virqfd vfio_iommu_type1 vfio cls_flower mlx5_ib mlx5_core devlink sch_ingress nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun ebtable_filter ebtables ip6table_filter ip6_tables netconsole rpcrdma bridge ib_isert stp iscsi_target_mod llc ib_iser libiscsi scsi_transport_iscsi ib_srpt ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp kvm_intel kvm igb irqbypass joydev ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt crc32c_intel ptp ipmi_si iTCO_vendor_support pcspkr ghash_clmulni_intel wmi pps_core i2c_algo_bit ipmi_msghandler mei_me i2c_i801 ioatdma tpm_tis mei shpchp i2c_smbus dca tpm_tis_core lpc_ich tpm nfsd target_core_mod auth_rpcgss nfs_acl lockd grace sunrpc isci libsas serio_raw scsi_transport_sas [last unloaded: devlink]
[  311.198587] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G             L  4.9.0+ #31
[  311.207253] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  311.214983] Call Trace:
[  311.218051]  <IRQ>
[  311.220626]  dump_stack+0x63/0x8c
[  311.224657]  __warn+0xd1/0xf0
[  311.228298]  warn_slowpath_null+0x1d/0x20
[  311.233116]  native_smp_send_reschedule+0x3f/0x50
[  311.238702]  try_to_wake_up+0x312/0x390
[  311.243318]  default_wake_function+0x12/0x20
[  311.248418]  __wake_up_common+0x55/0x90
[  311.253034]  __wake_up_locked+0x13/0x20
[  311.257641]  ep_poll_callback+0xbb/0x240
[  311.262346]  __wake_up_common+0x55/0x90
[  311.272771]  __wake_up+0x39/0x50
[  311.276697]  wake_up_klogd_work_func+0x40/0x60
[  311.281986]  irq_work_run_list+0x4d/0x70
[  311.286681]  irq_work_run+0x2c/0x40
[  311.290899]  smp_irq_work_interrupt+0x2e/0x40
[  311.296090]  irq_work_interrupt+0x93/0xa0
[  311.300900] RIP: 0010:panic+0x1f5/0x239
[  311.305508] RSP: 0018:ffff9432efa039e8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff09
[  311.314630] RAX: 000000000000003b RBX: 0000000000000000 RCX: 0000000000000006
[  311.322936] RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffff9432efa0e060
[  311.331245] RBP: ffff9432efa03a58 R08: 0000000000000674 R09: ffff942e800bb3e0
[  311.339543] R10: 00000000000000ef R11: 0000000000000198 R12: ffffffff94c4a4a9
[  311.347855] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9432efa03b78
[  311.356167]  ? panic+0x1f1/0x239
[  311.360106]  watchdog_timer_fn+0x1e5/0x1f0
[  311.365004]  ? watchdog+0x40/0x40
[  311.369035]  __hrtimer_run_queues+0xee/0x270
[  311.374132]  hrtimer_interrupt+0xa8/0x190
[  311.378935]  local_apic_timer_interrupt+0x35/0x60
[  311.384511]  smp_apic_timer_interrupt+0x38/0x50
[  311.389897]  apic_timer_interrupt+0x93/0xa0
[  311.394892] RIP: 0010:fl_classify+0xb/0x2b0 [cls_flower]
[  311.401151] RSP: 0018:ffff9432efa03c20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  311.410270] RAX: 0000000000000008 RBX: ffff9432b59c4100 RCX: 0000000000000000
[  311.418580] RDX: ffff9432efa03c98 RSI: ffff9436e718d3c0 RDI: ffff9432b59c4100
[  311.426967] RBP: ffff9432efa03c28 R08: 000000000000270f R09: 0000000000000000
[  311.435278] R10: 0000000000000000 R11: 0000000000000004 R12: ffff9432efa03c98
[  311.443584] R13: 0000000000000008 R14: ffff9436e718d3c0 R15: 0000000000000001
[  311.451889]  tc_classify+0x78/0x120
[  311.456105]  __netif_receive_skb_core+0x623/0xa00
[  311.461683]  ? udp4_gro_receive+0x10b/0x2d0
[  311.466687]  __netif_receive_skb+0x18/0x60
[  311.471593]  netif_receive_skb_internal+0x40/0xb0
[  311.477186]  napi_gro_receive+0xcd/0x120
[  311.481900]  mlx5e_handle_rx_cqe_rep+0x61b/0x890 [mlx5_core]
[  311.488555]  mlx5e_poll_rx_cq+0x83/0x840 [mlx5_core]
[  311.494451]  mlx5e_napi_poll+0x89/0x480 [mlx5_core]
[  311.500233]  net_rx_action+0x260/0x3c0
[  311.504751]  __do_softirq+0xc9/0x28c
[  311.509075]  irq_exit+0xd7/0xe0
[  311.512901]  do_IRQ+0x51/0xd0
[  311.516529]  common_interrupt+0x93/0x93
[  311.521143] RIP: 0010:cpuidle_enter_state+0xe1/0x260
[  311.527011] RSP: 0018:ffffffff94e03dc8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffa2
[  311.536123] RAX: ffff9432efa19600 RBX: ffff9432efa23600 RCX: 000000000000001f
[  311.544430] RDX: 0000000000000000 RSI: ffff9432efa16cd8 RDI: 0000000000000000
[  311.552760] RBP: ffffffff94e03e00 R08: 0000000000000001 R09: cccccccccccccccd
[  311.561087] R10: 0000000000000000 R11: 0000000000000008 R12: 0000000000000001
[  311.569396] R13: 0000000000000000 R14: ffffffff94ec79a0 R15: 00000041fab01c8d
[  311.577714]  </IRQ>
[  311.580393]  ? cpuidle_enter_state+0xc0/0x260
[  311.585591]  cpuidle_enter+0x17/0x20
[  311.589913]  call_cpuidle+0x23/0x40
[  311.594136]  do_idle+0x172/0x200
[  311.598069]  cpu_startup_entry+0x71/0x80
[  311.602782]  rest_init+0x77/0x80
[  311.606713]  start_kernel+0x4a6/0x4c7
[  311.611134]  ? set_init_arg+0x55/0x55
[  311.615547]  ? early_idt_handler_array+0x120/0x120
[  311.621231]  x86_64_start_reservations+0x24/0x26
[  311.626717]  x86_64_start_kernel+0x14c/0x16f
[  311.631810]  start_cpu+0x5/0x14
[  311.635648] ---[ end trace c2fd08dd3d93dab3 ]---



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-20  6:22             ` Shahar Klein
@ 2016-12-20 11:47               ` Daniel Borkmann
  2016-12-21  6:44                 ` Shahar Klein
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Borkmann @ 2016-12-20 11:47 UTC (permalink / raw)
  To: Shahar Klein, Cong Wang
  Cc: Or Gerlitz, Linux Netdev List, Roi Dayan, David Miller,
	Jiri Pirko, John Fastabend, Hadar Hen Zion

Hi Shahar,

On 12/20/2016 07:22 AM, Shahar Klein wrote:
> On 12/19/2016 7:58 PM, Cong Wang wrote:
>> On Mon, Dec 19, 2016 at 8:39 AM, Shahar Klein <shahark@mellanox.com> wrote:
>>> On 12/13/2016 12:51 AM, Cong Wang wrote:
>>>>
>>>> On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>>>>>
>>>>> On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann <daniel@iogearbox.net>
>>>>> wrote:
>>>>>
>>>>>> Note that there's still the RCU fix missing for the deletion race that
>>>>>> Cong will still send out, but you say that the only thing you do is to
>>>>>> add a single rule, but no other operation in involved during that test?
>>>>>
>>>>> What's missing to have the deletion race fixed? making a patch or
>>>>> testing to a patch which was sent?
>>>>
>>>> If you think it would help for this problem, here is my patch rebased
>>>> on the latest net-next.
>>>>
>>>> Again, I don't see how it could help this case yet, especially I don't
>>>> see how we could have a loop in this singly linked list.
>>>
>>> I've applied cong's patch and hit a different lockup(full log attached):
>>
>> Are you sure this is really different? For me, it is still inside the loop
>> in tc_classify(), with only a slightly different offset.
>>
>>>
>>> Daniel suggested I'll add a print:
>>>                 case RTM_DELTFILTER:
>>> -                   err = tp->ops->delete(tp, fh);
>>> +                 printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__, __LINE__);
>>> +                 err = tp->ops->delete(tp, fh, &last);
>>>                         if (err == 0) {
>>>
>>> and I couldn't see this print in the output.....
>>
>> Hmm, that is odd, if this never prints, then my patch should not make any
>> difference.
>>
>> There are still two other cases where we could change tp->next, so do you
>> mind to add two more printk's for debugging?
>>
>> Attached is the delta patch.
>>
>> Thanks!
>
> I've added a slightly different debug print:
> @@ -368,11 +375,12 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
>                  if (tp_created) {
>                          RCU_INIT_POINTER(tp->next, rtnl_dereference(*back));
>                          rcu_assign_pointer(*back, tp);
> +                 printk(KERN_ERR "DEBUGG:SK add/change filter by: %pf tp=%p tp->next=%p\n", tp->ops->get, tp, tp->next);
>                  }
>                  tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER, false);

I'm curious, could you be a bit more verbose why you didn't go with Cong's
debug patch?

In particular, why you removed the hunk from the condition 'n->nlmsg_type ==
RTM_DELTFILTER && t->tcm_handle == 0' where we delete the whole tp instance?

Is it because if you have that printk() there, then the issue doesn't trigger
for you anymore? Or any other reason?

How many CPUs does your test machine have, I suspect more than 1, right?

So iff RTM_DELTFILTER with tcm_handle of 0 really played a role in this, I'm
wondering whether there was a subtle deletion + add race where the newly added
filter on the other CPU still saw a stale pointer in the list. But just a wild
guess at this point.

Hmm, could you try this below to see whether the issue still appears?

Thanks,
Daniel

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..4eee1cb 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -317,7 +317,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
  		if (n->nlmsg_type == RTM_DELTFILTER && t->tcm_handle == 0) {
  			struct tcf_proto *next = rtnl_dereference(tp->next);

-			RCU_INIT_POINTER(*back, next);
+			rcu_assign_pointer(*back, next);

  			tfilter_notify(net, skb, n, tp, fh,
  				       RTM_DELTFILTER, false);


> full output attached:
>
> [  283.290271] Mirror/redirect action on
> [  283.305031] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9432d704df60 tp->next=          (null)
> [  283.322563] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d240 tp->next=          (null)
> [  283.359997] GACT probability on
> [  283.365923] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d240
> [  283.378725] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
> [  283.391310] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
> [  283.403923] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
> [  283.416542] DEBUGG:SK add/change filter by: fl_get [cls_flower] tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
> [  308.538571] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
>
> Thanks
> Shahar

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-20 11:47               ` Daniel Borkmann
@ 2016-12-21  6:44                 ` Shahar Klein
  2016-12-21  7:03                   ` Cong Wang
  0 siblings, 1 reply; 17+ messages in thread
From: Shahar Klein @ 2016-12-21  6:44 UTC (permalink / raw)
  To: Daniel Borkmann, Cong Wang
  Cc: shahark, Or Gerlitz, Linux Netdev List, Roi Dayan, David Miller,
	Jiri Pirko, John Fastabend, Hadar Hen Zion

[-- Attachment #1: Type: text/plain, Size: 5312 bytes --]



On 12/20/2016 1:47 PM, Daniel Borkmann wrote:
> Hi Shahar,
>
> On 12/20/2016 07:22 AM, Shahar Klein wrote:
>> On 12/19/2016 7:58 PM, Cong Wang wrote:
>>> On Mon, Dec 19, 2016 at 8:39 AM, Shahar Klein <shahark@mellanox.com>
>>> wrote:
>>>> On 12/13/2016 12:51 AM, Cong Wang wrote:
>>>>>
>>>>> On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz <gerlitz.or@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann
>>>>>> <daniel@iogearbox.net>
>>>>>> wrote:
>>>>>>
>>>>>>> Note that there's still the RCU fix missing for the deletion race
>>>>>>> that
>>>>>>> Cong will still send out, but you say that the only thing you do
>>>>>>> is to
>>>>>>> add a single rule, but no other operation in involved during that
>>>>>>> test?
>>>>>>
>>>>>> What's missing to have the deletion race fixed? making a patch or
>>>>>> testing to a patch which was sent?
>>>>>
>>>>> If you think it would help for this problem, here is my patch rebased
>>>>> on the latest net-next.
>>>>>
>>>>> Again, I don't see how it could help this case yet, especially I don't
>>>>> see how we could have a loop in this singly linked list.
>>>>
>>>> I've applied cong's patch and hit a different lockup(full log
>>>> attached):
>>>
>>> Are you sure this is really different? For me, it is still inside the
>>> loop
>>> in tc_classify(), with only a slightly different offset.
>>>
>>>>
>>>> Daniel suggested I'll add a print:
>>>>                 case RTM_DELTFILTER:
>>>> -                   err = tp->ops->delete(tp, fh);
>>>> +                 printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__,
>>>> __LINE__);
>>>> +                 err = tp->ops->delete(tp, fh, &last);
>>>>                         if (err == 0) {
>>>>
>>>> and I couldn't see this print in the output.....
>>>
>>> Hmm, that is odd, if this never prints, then my patch should not make
>>> any
>>> difference.
>>>
>>> There are still two other cases where we could change tp->next, so do
>>> you
>>> mind to add two more printk's for debugging?
>>>
>>> Attached is the delta patch.
>>>
>>> Thanks!
>>
>> I've added a slightly different debug print:
>> @@ -368,11 +375,12 @@ static int tc_ctl_tfilter(struct sk_buff *skb,
>> struct nlmsghdr *n)
>>                  if (tp_created) {
>>                          RCU_INIT_POINTER(tp->next,
>> rtnl_dereference(*back));
>>                          rcu_assign_pointer(*back, tp);
>> +                 printk(KERN_ERR "DEBUGG:SK add/change filter by: %pf
>> tp=%p tp->next=%p\n", tp->ops->get, tp, tp->next);
>>                  }
>>                  tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER,
>> false);
>
> I'm curious, could you be a bit more verbose why you didn't go with Cong's
> debug patch?
>
> In particular, why you removed the hunk from the condition
> 'n->nlmsg_type ==
> RTM_DELTFILTER && t->tcm_handle == 0' where we delete the whole tp
> instance?

I didn't remove this hunk, just added more. anyway I'm attaching the 
diff to make sure I didn't miss anything.
I've added some more prints in the destroy and added some current info
run log attached also

>
> Is it because if you have that printk() there, then the issue doesn't
> trigger
> for you anymore? Or any other reason?
>
> How many CPUs does your test machine have, I suspect more than 1, right?
>
> So iff RTM_DELTFILTER with tcm_handle of 0 really played a role in this,
> I'm
> wondering whether there was a subtle deletion + add race where the newly
> added
> filter on the other CPU still saw a stale pointer in the list. But just
> a wild
> guess at this point.
>
> Hmm, could you try this below to see whether the issue still appears?
>
> Thanks,
> Daniel
>
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 3fbba79..4eee1cb 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -317,7 +317,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb,
> struct nlmsghdr *n)
>          if (n->nlmsg_type == RTM_DELTFILTER && t->tcm_handle == 0) {
>              struct tcf_proto *next = rtnl_dereference(tp->next);
>
> -            RCU_INIT_POINTER(*back, next);
> +            rcu_assign_pointer(*back, next);
>
>              tfilter_notify(net, skb, n, tp, fh,
>                         RTM_DELTFILTER, false);
>
>

Tried it with same results

>> full output attached:
>>
>> [  283.290271] Mirror/redirect action on
>> [  283.305031] DEBUGG:SK add/change filter by: fl_get [cls_flower]
>> tp=ffff9432d704df60 tp->next=          (null)
>> [  283.322563] DEBUGG:SK add/change filter by: fl_get [cls_flower]
>> tp=ffff9436e718d240 tp->next=          (null)
>> [  283.359997] GACT probability on
>> [  283.365923] DEBUGG:SK add/change filter by: fl_get [cls_flower]
>> tp=ffff9436e718d3c0 tp->next=ffff9436e718d240
>> [  283.378725] DEBUGG:SK add/change filter by: fl_get [cls_flower]
>> tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
>> [  283.391310] DEBUGG:SK add/change filter by: fl_get [cls_flower]
>> tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
>> [  283.403923] DEBUGG:SK add/change filter by: fl_get [cls_flower]
>> tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
>> [  283.416542] DEBUGG:SK add/change filter by: fl_get [cls_flower]
>> tp=ffff9436e718d3c0 tp->next=ffff9436e718d3c0
>> [  308.538571] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!
>> [swapper/0:0]
>>
>> Thanks
>> Shahar

[-- Attachment #2: cls_api.c.diff --]
[-- Type: text/plain, Size: 2508 bytes --]

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..1b31fc9 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -305,7 +305,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 			kfree(tp);
 			goto errout;
 		}
-
+		printk(KERN_ERR "DEBUGG:SK thread-%d[cpu-%d] setting tp_created to 1 tp=%p back=%p\n", current->pid, current->on_cpu, tp, rtnl_dereference(*back));
 		tp_created = 1;
 
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind))
@@ -317,11 +317,13 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 		if (n->nlmsg_type == RTM_DELTFILTER && t->tcm_handle == 0) {
 			struct tcf_proto *next = rtnl_dereference(tp->next);
 
-			RCU_INIT_POINTER(*back, next);
+			printk(KERN_ERR "DEBUGG:SK delete filter by: %pf\n", tp->ops->get);
+
+			rcu_assign_pointer(*back, next);
 
 			tfilter_notify(net, skb, n, tp, fh,
 				       RTM_DELTFILTER, false);
-			tcf_destroy(tp, true);
+			tcf_destroy(tp);
 			err = 0;
 			goto errout;
 		}
@@ -331,25 +333,30 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 		    !(n->nlmsg_flags & NLM_F_CREATE))
 			goto errout;
 	} else {
+		bool last;
+
 		switch (n->nlmsg_type) {
 		case RTM_NEWTFILTER:
 			err = -EEXIST;
 			if (n->nlmsg_flags & NLM_F_EXCL) {
 				if (tp_created)
-					tcf_destroy(tp, true);
+					tcf_destroy(tp);
 				goto errout;
 			}
 			break;
 		case RTM_DELTFILTER:
-			err = tp->ops->delete(tp, fh);
+			printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__, __LINE__);
+			err = tp->ops->delete(tp, fh, &last);
 			if (err == 0) {
-				struct tcf_proto *next = rtnl_dereference(tp->next);
-
 				tfilter_notify(net, skb, n, tp,
 					       t->tcm_handle,
 					       RTM_DELTFILTER, false);
-				if (tcf_destroy(tp, false))
+				if (last) {
+					struct tcf_proto *next = rtnl_dereference(tp->next);
+
 					RCU_INIT_POINTER(*back, next);
+					tcf_destroy(tp);
+				}
 			}
 			goto errout;
 		case RTM_GETTFILTER:
@@ -368,11 +375,12 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 		if (tp_created) {
 			RCU_INIT_POINTER(tp->next, rtnl_dereference(*back));
 			rcu_assign_pointer(*back, tp);
+			printk(KERN_ERR "DEBUGG:SK thread-%d[cpu-%d] add/change filter by: %pf tp=%p tp->next=%p\n", current->pid, current->on_cpu, tp->ops->get, tp, tp->next);
 		}
 		tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER, false);
 	} else {
 		if (tp_created)
-			tcf_destroy(tp, true);
+			tcf_destroy(tp);
 	}
 
 errout:

[-- Attachment #3: tp_p_debug_pid.log --]
[-- Type: text/plain, Size: 17231 bytes --]

[  354.117396] DEBUGG:SK tcf_destroy_chain:1915
[  354.123731] DEBUGG:SK tcf_destroy_chain:1915
[  354.128962] DEBUGG:SK tcf_destroy_chain:1915
[  354.134999] DEBUGG:SK tcf_destroy_chain:1915
[  354.145185] DEBUGG:SK tcf_destroy_chain:1915
[  354.155601] DEBUGG:SK tcf_destroy_chain:1915

[  408.197159] DEBUGG:SK thread-2852[cpu-1] setting tp_created to 1 tp=ffff94b5b669f8a0 back=          (null)
[  408.246733] Mirror/redirect action on
[  408.251498] DEBUGG:SK destroy ffff94b5b669f8a0 tcf_destroy:1905
[  408.258485] DEBUGG:SK thread-2852[cpu-1] setting tp_created to 1 tp=ffff94b5b669f900 back=          (null)
[  408.280217] DEBUGG:SK thread-2852[cpu-1] add/change filter by: fl_get [cls_flower] tp=ffff94b5b669f900 tp->next=          (null)
[  408.294257] DEBUGG:SK thread-2858[cpu-1] setting tp_created to 1 tp=ffff94b5b5de2060 back=          (null)
[  408.305684] DEBUGG:SK thread-2849[cpu-1] setting tp_created to 1 tp=ffff94b9df4fd420 back=          (null)
[  408.317103] DEBUGG:SK thread-2856[cpu-1] setting tp_created to 1 tp=ffff94b9ea932060 back=          (null)
[  408.333832] DEBUGG:SK thread-2856[cpu-1] add/change filter by: fl_get [cls_flower] tp=ffff94b9ea932060 tp->next=          (null)
[  408.347806] GACT probability on
[  408.348815] DEBUGG:SK thread-2851[cpu-1] setting tp_created to 1 tp=ffff94b5bfaab900 back=ffff94b9ea932060
[  408.348841] DEBUGG:SK thread-2854[cpu-1] setting tp_created to 1 tp=ffff94b5b669fcc0 back=ffff94b9ea932060
[  408.349841] DEBUGG:SK thread-2846[cpu-1] setting tp_created to 1 tp=ffff94b5b4ea9720 back=ffff94b9ea932060
[  408.349868] DEBUGG:SK thread-2847[cpu-1] setting tp_created to 1 tp=ffff94b9ec249ea0 back=ffff94b9ea932060
[  408.350880] DEBUGG:SK thread-2848[cpu-1] setting tp_created to 1 tp=ffff94b9e9542ea0 back=ffff94b9ea932060
[  408.355302] DEBUGG:SK thread-2855[cpu-1] setting tp_created to 1 tp=ffff94b5b402ca80 back=ffff94b9ea932060
[  408.358941] DEBUGG:SK thread-2844[cpu-1] setting tp_created to 1 tp=ffff94b5b0280960 back=ffff94b9ea932060
[  408.358962] DEBUGG:SK thread-2852[cpu-1] setting tp_created to 1 tp=ffff94b9e8e17840 back=ffff94b9ea932060
[  408.358986] DEBUGG:SK thread-2842[cpu-1] setting tp_created to 1 tp=ffff94b5b5de2f00 back=ffff94b9ea932060
[  408.360920] DEBUGG:SK thread-2850[cpu-1] setting tp_created to 1 tp=ffff94b5cfe38ea0 back=ffff94b9ea932060
[  408.360940] DEBUGG:SK thread-2857[cpu-1] setting tp_created to 1 tp=ffff94b5b0280840 back=ffff94b9ea932060
[  408.361827] DEBUGG:SK thread-2856[cpu-1] setting tp_created to 1 tp=ffff94b9ea9321e0 back=          (null)
[  408.362812] DEBUGG:SK thread-2856[cpu-1] add/change filter by: fl_get [cls_flower] tp=ffff94b9ea9321e0 tp->next=          (null)
[  408.362850] DEBUGG:SK thread-2853[cpu-1] setting tp_created to 1 tp=ffff94b5b669fea0 back=ffff94b9ea932060
[  408.362895] DEBUGG:SK thread-2845[cpu-1] setting tp_created to 1 tp=ffff94b5b0280780 back=ffff94b9ea932060
[  408.362904] DEBUGG:SK thread-2856[cpu-1] setting tp_created to 1 tp=ffff94b9ea9322a0 back=ffff94b9ea932060
[  408.362942] DEBUGG:SK thread-2843[cpu-1] setting tp_created to 1 tp=ffff94b5b402c960 back=ffff94b9ea932060
[  408.547689] DEBUGG:SK destroy ffff94b5b669fea0 tcf_destroy:1905
[  408.554689] DEBUGG:SK thread-2853[cpu-1] setting tp_created to 1 tp=ffff94b5b02805a0 back=ffff94b9ea932060
[  408.574258] DEBUGG:SK thread-2853[cpu-1] add/change filter by: fl_get [cls_flower] tp=ffff94b5b02805a0 tp->next=ffff94b9ea932060
[  408.587849] DEBUGG:SK destroy ffff94b5b0280780 tcf_destroy:1905
[  408.595862] DEBUGG:SK thread-2845[cpu-1] add/change filter by: fl_get [cls_flower] tp=ffff94b5b02805a0 tp->next=ffff94b5b02805a0
[  408.609476] DEBUGG:SK destroy ffff94b9ec249ea0 tcf_destroy:1905
[  408.617394] DEBUGG:SK thread-2847[cpu-1] add/change filter by: fl_get [cls_flower] tp=ffff94b5b02805a0 tp->next=ffff94b5b02805a0
[  408.631068] DEBUGG:SK destroy ffff94b9df4fd420 tcf_destroy:1905
[  408.639090] DEBUGG:SK thread-2849[cpu-1] add/change filter by: fl_get [cls_flower] tp=ffff94b5b02805a0 tp->next=ffff94b5b02805a0
[  408.652656] DEBUGG:SK destroy ffff94b5b5de2060 tcf_destroy:1905
[  408.660530] DEBUGG:SK thread-2858[cpu-1] add/change filter by: fl_get [cls_flower] tp=ffff94b5b02805a0 tp->next=ffff94b5b02805a0
[  432.583406] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [CPU 0/KVM:2574]
[  432.592457] Modules linked in: act_gact act_mirred openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 vfio_pci vfio_virqfd vfio_iommu_type1 vfio cls_flower mlx5_ib mlx5_core devlink sch_ingress nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun ebtable_filter ebtables ip6table_filter ip6_tables netconsole rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi bridge stp ib_srpt llc ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass crct10dif_pclmul igb joydev crc32_pclmul ipmi_ssif crc32c_intel iTCO_wdt ptp ghash_clmulni_intel pps_core iTCO_vendor_support i2c_algo_bit pcspkr ipmi_si wmi ipmi_msghandler ioatdma mei_me lpc_ich tpm_tis dca tpm_tis_core mei shpchp i2c_i801 i2c_smbus tpm target_core_mod nfsd auth_rpcgss nfs_acl lockd grace sunrpc isci libsas serio_raw scsi_transport_sas [last unloaded: devlink]
[  432.714385] CPU: 0 PID: 2574 Comm: CPU 0/KVM Not tainted 4.9.0+ #40
[  432.721734] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  432.729469] task: ffff94b5b69d3b00 task.stack: ffffad7948840000
[  432.736428] RIP: 0010:tc_classify+0x5a/0x120
[  432.741526] RSP: 0018:ffff94b5efa03c38 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10
[  432.750677] RAX: 00000000ffffffff RBX: ffff94b5b159a100 RCX: 0000000000000000
[  432.758982] RDX: ffff94b5efa03c98 RSI: ffff94b5b02805a0 RDI: ffff94b5b159a100
[  432.767291] RBP: ffff94b5efa03c70 R08: 000000000000270f R09: 0000000000000000
[  432.775603] R10: 0000000000000000 R11: 0000000000000004 R12: ffff94b5efa03c98
[  432.783907] R13: 0000000000000008 R14: ffff94b5b02805a0 R15: 0000000000000001
[  432.792226] FS:  00007f35c1aec700(0000) GS:ffff94b5efa00000(0000) knlGS:0000000000000000
[  432.801861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  432.808626] CR2: 0000562a2a56b0c8 CR3: 000000042c714000 CR4: 00000000000426f0
[  432.816932] Call Trace:
[  432.819993]  <IRQ>
[  432.822580]  __netif_receive_skb_core+0x623/0xa00
[  432.828173]  ? udp4_gro_receive+0x10b/0x2d0
[  432.833189]  __netif_receive_skb+0x18/0x60
[  432.838095]  netif_receive_skb_internal+0x40/0xb0
[  432.843692]  napi_gro_receive+0xcd/0x120
[  432.848425]  mlx5e_handle_rx_cqe_rep+0x61b/0x890 [mlx5_core]
[  432.855103]  ? kvm_irq_delivery_to_apic+0x2c0/0x2c0 [kvm]
[  432.861482]  mlx5e_poll_rx_cq+0x83/0x840 [mlx5_core]
[  432.867373]  mlx5e_napi_poll+0x89/0x480 [mlx5_core]
[  432.873164]  net_rx_action+0x260/0x3c0
[  432.877693]  __do_softirq+0xc9/0x28c
[  432.882020]  irq_exit+0xd7/0xe0
[  432.885855]  do_IRQ+0x51/0xd0
[  432.889506]  common_interrupt+0x93/0x93
[  432.894140] RIP: 0010:kvm_arch_vcpu_ioctl_run+0xa64/0x1590 [kvm]
[  432.896408] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [handler129:2849]
[  432.896436] Modules linked in: act_gact act_mirred openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 vfio_pci vfio_virqfd vfio_iommu_type1 vfio cls_flower mlx5_ib mlx5_core devlink sch_ingress nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun ebtable_filter ebtables ip6table_filter ip6_tables netconsole rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi bridge stp ib_srpt llc ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass crct10dif_pclmul igb joydev crc32_pclmul ipmi_ssif crc32c_intel iTCO_wdt ptp ghash_clmulni_intel
[  432.896448]  pps_core iTCO_vendor_support i2c_algo_bit pcspkr ipmi_si wmi ipmi_msghandler ioatdma mei_me lpc_ich tpm_tis dca tpm_tis_core mei shpchp i2c_i801 i2c_smbus tpm target_core_mod nfsd auth_rpcgss nfs_acl lockd grace sunrpc isci libsas serio_raw scsi_transport_sas [last unloaded: devlink]
[  432.896450] CPU: 5 PID: 2849 Comm: handler129 Not tainted 4.9.0+ #40
[  432.896451] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  432.896452] task: ffff94b5c1870000 task.stack: ffffad7944bb4000
[  432.896456] RIP: 0010:tc_ctl_tfilter+0x1cc/0x8c0
[  432.896458] RSP: 0018:ffffad7944bb7a88 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10
[  432.896459] RAX: 0000000000020000 RBX: ffff94b5b02805a0 RCX: 000000000000002c
[  432.896459] RDX: ffff94b9e3729400 RSI: 0000000000000000 RDI: ffff94b5c0663400
[  432.896460] RBP: ffffad7944bb7b88 R08: 0000000000000000 R09: ffffad7944bb7af8
[  432.896461] R10: ffff94b5b02805a0 R11: 00000000ffffffea R12: ffff94b5c0663400
[  432.896462] R13: 00000000ffff0000 R14: ffff94b5e7e9c800 R15: 0000000000030000
[  432.896463] FS:  00007f08c93cb700(0000) GS:ffff94b5efb40000(0000) knlGS:0000000000000000
[  432.896464] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  432.896465] CR2: 00007f08b80200a8 CR3: 000000085dfad000 CR4: 00000000000426e0
[  432.896465] Call Trace:
[  432.896470]  ? ns_capable+0x13/0x20
[  432.896475]  rtnetlink_rcv_msg+0xa4/0x230
[  432.896478]  ? __kmalloc_node_track_caller+0x191/0x290
[  432.896481]  ? __alloc_skb+0x8d/0x2b0
[  432.896483]  ? rtnl_newlink+0x870/0x870
[  432.896486]  netlink_rcv_skb+0xa7/0xc0
[  432.896488]  rtnetlink_rcv+0x28/0x30
[  432.896490]  netlink_unicast+0x181/0x240
[  432.896491]  netlink_sendmsg+0x32e/0x3b0
[  432.896495]  sock_sendmsg+0x38/0x50
[  432.896496]  ___sys_sendmsg+0x279/0x290
[  432.896500]  ? task_numa_fault+0x7e5/0xa90
[  432.896502]  ? migrate_misplaced_page+0x169/0x240
[  432.896506]  ? handle_mm_fault+0xa0d/0x1520
[  432.896510]  ? __fget_light+0x25/0x60
[  432.896512]  __sys_sendmsg+0x54/0x90
[  432.896514]  SyS_sendmsg+0x12/0x20
[  432.896516]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[  432.896518] RIP: 0033:0x7f08cd4fc2fd
[  432.896519] RSP: 002b:00007f08c938c110 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[  432.896520] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f08cd4fc2fd
[  432.896520] RDX: 0000000000000000 RSI: 00007f08c938c170 RDI: 000000000000000e
[  432.896521] RBP: 00007f08b801e8d0 R08: 0000000000008000 R09: 00007f08b8000158
[  432.896522] R10: 00007f08b8020610 R11: 0000000000000293 R12: 000000000000003f
[  432.896522] R13: 0000564e0421ea18 R14: 00007f08b801e8f8 R15: 00007f08c93ca388
[  432.896538] Code: b5 58 ff ff ff 4c 89 e7 ff d0 48 85 c0 49 89 c2 41 bb ea ff ff ff 74 48 41 0f b7 4e 04 48 8b 18 66 83 f9 2d 75 17 e9 64 02 00 00 <8b> 43 1c 41 39 c7 0f 86 8c 00 00 00 49 89 da 48 8b 1b 48 85 db 
[  432.896540] Kernel panic - not syncing: softlockup: hung tasks
[  432.896541] CPU: 5 PID: 2849 Comm: handler129 Tainted: G             L  4.9.0+ #40
[  432.896542] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  432.896542] Call Trace:
[  432.896543]  <IRQ>
[  432.896546]  dump_stack+0x63/0x8c
[  432.896549]  panic+0xeb/0x239
[  432.896552]  watchdog_timer_fn+0x1e5/0x1f0
[  432.896554]  ? watchdog+0x40/0x40
[  432.896557]  __hrtimer_run_queues+0xee/0x270
[  432.896558]  hrtimer_interrupt+0xa8/0x190
[  432.896563]  local_apic_timer_interrupt+0x35/0x60
[  432.896565]  smp_apic_timer_interrupt+0x38/0x50
[  432.896566]  apic_timer_interrupt+0x93/0xa0
[  432.896568] RIP: 0010:tc_ctl_tfilter+0x1cc/0x8c0
[  432.896569] RSP: 0018:ffffad7944bb7a88 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10
[  432.896569] RAX: 0000000000020000 RBX: ffff94b5b02805a0 RCX: 000000000000002c
[  432.896570] RDX: ffff94b9e3729400 RSI: 0000000000000000 RDI: ffff94b5c0663400
[  432.896571] RBP: ffffad7944bb7b88 R08: 0000000000000000 R09: ffffad7944bb7af8
[  432.896572] R10: ffff94b5b02805a0 R11: 00000000ffffffea R12: ffff94b5c0663400
[  432.896573] R13: 00000000ffff0000 R14: ffff94b5e7e9c800 R15: 0000000000030000
[  432.896573]  </IRQ>
[  432.896575]  ? tc_ctl_tfilter+0x1ab/0x8c0
[  432.896577]  ? ns_capable+0x13/0x20
[  432.896579]  rtnetlink_rcv_msg+0xa4/0x230
[  432.896580]  ? __kmalloc_node_track_caller+0x191/0x290
[  432.896581]  ? __alloc_skb+0x8d/0x2b0
[  432.896583]  ? rtnl_newlink+0x870/0x870
[  432.896585]  netlink_rcv_skb+0xa7/0xc0
[  432.896586]  rtnetlink_rcv+0x28/0x30
[  432.896588]  netlink_unicast+0x181/0x240
[  432.896590]  netlink_sendmsg+0x32e/0x3b0
[  432.896592]  sock_sendmsg+0x38/0x50
[  432.896593]  ___sys_sendmsg+0x279/0x290
[  432.896594]  ? task_numa_fault+0x7e5/0xa90
[  432.896596]  ? migrate_misplaced_page+0x169/0x240
[  432.896597]  ? handle_mm_fault+0xa0d/0x1520
[  432.896599]  ? __fget_light+0x25/0x60
[  432.896601]  __sys_sendmsg+0x54/0x90
[  432.896603]  SyS_sendmsg+0x12/0x20
[  432.896605]  entry_SYSCALL_64_fastpath+0x1a/0xa9
[  432.896605] RIP: 0033:0x7f08cd4fc2fd
[  432.896606] RSP: 002b:00007f08c938c110 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[  432.896607] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f08cd4fc2fd
[  432.896607] RDX: 0000000000000000 RSI: 00007f08c938c170 RDI: 000000000000000e
[  432.896608] RBP: 00007f08b801e8d0 R08: 0000000000008000 R09: 00007f08b8000158
[  432.896609] R10: 00007f08b8020610 R11: 0000000000000293 R12: 000000000000003f
[  432.896609] R13: 0000564e0421ea18 R14: 00007f08b801e8f8 R15: 00007f08c93ca388
[  432.900868] ------------[ cut here ]------------
[  432.900872] WARNING: CPU: 0 PID: 2574 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x3f/0x50
[  432.900893] Modules linked in: act_gact act_mirred openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 vfio_pci vfio_virqfd vfio_iommu_type1 vfio cls_flower mlx5_ib mlx5_core devlink sch_ingress nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat libcrc32c nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun ebtable_filter ebtables ip6table_filter ip6_tables netconsole rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi bridge stp ib_srpt llc ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass crct10dif_pclmul igb joydev crc32_pclmul ipmi_ssif crc32c_intel iTCO_wdt ptp ghash_clmulni_intel
[  432.900902]  pps_core iTCO_vendor_support i2c_algo_bit pcspkr ipmi_si wmi ipmi_msghandler ioatdma mei_me lpc_ich tpm_tis dca tpm_tis_core mei shpchp i2c_i801 i2c_smbus tpm target_core_mod nfsd auth_rpcgss nfs_acl lockd grace sunrpc isci libsas serio_raw scsi_transport_sas [last unloaded: devlink]
[  432.900904] CPU: 0 PID: 2574 Comm: CPU 0/KVM Tainted: G             L  4.9.0+ #40
[  432.900905] Hardware name: Supermicro X9DRW/X9DRW, BIOS 3.0a 08/08/2013
[  432.900905] Call Trace:
[  432.900906]  <IRQ>
[  432.900909]  dump_stack+0x63/0x8c
[  432.900911]  __warn+0xd1/0xf0
[  432.900913]  warn_slowpath_null+0x1d/0x20
[  432.900914]  native_smp_send_reschedule+0x3f/0x50
[  432.900917]  try_to_wake_up+0x312/0x390
[  432.900919]  default_wake_function+0x12/0x20
[  432.900921]  pollwake+0x73/0x90
[  432.900922]  ? wake_up_q+0x80/0x80
[  432.900926]  __wake_up_common+0x55/0x90
[  432.900928]  __wake_up_sync_key+0x45/0x60
[  432.900929]  sock_def_readable+0x3d/0x70
[  432.900933]  tun_net_xmit+0x1d8/0x360 [tun]
[  432.900936]  netpoll_start_xmit+0x11d/0x1a0
[  432.900937]  netpoll_send_skb_on_dev+0x19b/0x250
[  432.900945]  __br_forward+0x1b2/0x1d0 [bridge]
[  432.900946]  ? skb_clone+0x4c/0xa0
[  432.900948]  ? __skb_clone+0x2e/0x140
[  432.900951]  deliver_clone+0x39/0x60 [bridge]
[  432.900955]  br_flood+0x13b/0x190 [bridge]
[  432.900958]  br_dev_xmit+0x218/0x2c0 [bridge]
[  432.900959]  netpoll_start_xmit+0x11d/0x1a0
[  432.900961]  ? __alloc_skb+0x5d/0x2b0
[  432.900962]  netpoll_send_skb_on_dev+0x19b/0x250
[  432.900963]  netpoll_send_udp+0x2da/0x440
[  432.900967]  write_msg+0xb2/0xf0 [netconsole]
[  432.900970]  call_console_drivers.isra.10.constprop.29+0xef/0x100
[  432.900971]  console_unlock+0x22d/0x540
[  432.900972]  vprintk_emit+0x2eb/0x4b0
[  432.900975]  ? common_interrupt+0x93/0x93
[  432.900976]  vprintk_default+0x29/0x40
[  432.900978]  printk+0x5d/0x74
[  432.900991]  ? kvm_arch_vcpu_ioctl_run+0xa64/0x1590 [kvm]
[  432.900994]  __show_regs+0x37/0x2e0
[  432.900996]  ? update_stack_state+0x63/0x80
[  432.900998]  ? unwind_next_frame+0x83/0x1f0
[  432.901000]  ? common_interrupt+0x93/0x93
[  432.901001]  ? common_interrupt+0x93/0x93
[  432.901004]  show_trace_log_lvl+0x22d/0x380
[  432.901006]  show_regs+0x9f/0x1a0
[  432.901008]  watchdog_timer_fn+0x197/0x1f0
[  432.901009]  ? watchdog+0x40/0x40
[  432.901011]  __hrtimer_run_queues+0xee/0x270
[  432.901013]  hrtimer_interrupt+0xa8/0x190
[  432.901015]  local_apic_timer_interrupt+0x35/0x60
[  432.901016]  smp_apic_timer_interrupt+0x38/0x50
[  432.901017]  apic_timer_interrupt+0x93/0xa0
[  432.901018] RIP: 0010:tc_classify+0x5a/0x120



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-21  6:44                 ` Shahar Klein
@ 2016-12-21  7:03                   ` Cong Wang
  2016-12-21 10:15                     ` Daniel Borkmann
  2016-12-21 11:25                     ` Shahar Klein
  0 siblings, 2 replies; 17+ messages in thread
From: Cong Wang @ 2016-12-21  7:03 UTC (permalink / raw)
  To: Shahar Klein
  Cc: Daniel Borkmann, Or Gerlitz, Linux Netdev List, Roi Dayan,
	David Miller, Jiri Pirko, John Fastabend, Hadar Hen Zion

On Tue, Dec 20, 2016 at 10:44 PM, Shahar Klein <shahark@mellanox.com> wrote:
>
> Tried it with same results

This piece is pretty interesting:

[  408.554689] DEBUGG:SK thread-2853[cpu-1] setting tp_created to 1
tp=ffff94b5b02805a0 back=ffff94b9ea932060
[  408.574258] DEBUGG:SK thread-2853[cpu-1] add/change filter by:
fl_get [cls_flower] tp=ffff94b5b02805a0 tp->next=ffff94b9ea932060
[  408.587849] DEBUGG:SK destroy ffff94b5b0280780 tcf_destroy:1905
[  408.595862] DEBUGG:SK thread-2845[cpu-1] add/change filter by:
fl_get [cls_flower] tp=ffff94b5b02805a0 tp->next=ffff94b5b02805a0

Looks like you added a debug printk inside tcf_destroy() too,
which seems racy with filter creation, it should not happen since
in both cases we take RTNL lock.

Don't know if changing all RCU_INIT_POINTER in that file to
rcu_assign_pointer could help anything or not. Mind to try?


Thanks for debugging!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-21  7:03                   ` Cong Wang
@ 2016-12-21 10:15                     ` Daniel Borkmann
  2016-12-21 12:58                       ` Shahar Klein
  2016-12-21 11:25                     ` Shahar Klein
  1 sibling, 1 reply; 17+ messages in thread
From: Daniel Borkmann @ 2016-12-21 10:15 UTC (permalink / raw)
  To: Cong Wang, Shahar Klein
  Cc: Or Gerlitz, Linux Netdev List, Roi Dayan, David Miller,
	Jiri Pirko, John Fastabend, Hadar Hen Zion

On 12/21/2016 08:03 AM, Cong Wang wrote:
> On Tue, Dec 20, 2016 at 10:44 PM, Shahar Klein <shahark@mellanox.com> wrote:
[...]
> Looks like you added a debug printk inside tcf_destroy() too,
> which seems racy with filter creation, it should not happen since
> in both cases we take RTNL lock.
>
> Don't know if changing all RCU_INIT_POINTER in that file to
> rcu_assign_pointer could help anything or not. Mind to try?

I don't think at this point that it's RCU related at all.

I have a theory on what is happening. Quoting the piece in question from Shahar's log:

  1: thread-2845[cpu-1] setting tp_created to 1 tp=ffff94b5b0280780 back=ffff94b9ea932060
  2: thread-2856[cpu-1] setting tp_created to 1 tp=ffff94b9ea9322a0 back=ffff94b9ea932060
  3: thread-2843[cpu-1] setting tp_created to 1 tp=ffff94b5b402c960 back=ffff94b9ea932060
  4: destroy ffff94b5b669fea0 tcf_destroy:1905
  5: thread-2853[cpu-1] setting tp_created to 1 tp=ffff94b5b02805a0 back=ffff94b9ea932060
  6: thread-2853[cpu-1] add/change filter by: fl_get [cls_flower] tp=ffff94b5b02805a0 tp->next=ffff94b9ea932060
  7: destroy ffff94b5b0280780 tcf_destroy:1905
  8: thread-2845[cpu-1] add/change filter by: fl_get [cls_flower] tp=ffff94b5b02805a0 tp->next=ffff94b5b02805a0

The interesting thing is that all this happens on CPU1, so as you say we're under rtnl.
In 1), thread-2845 creates tp=ffff94b5b0280780, which is destroyed in 7), presumably also
by thread-2845, and the weird part is why suddenly in 8) thread-2845 adds a created filter
without actually creating it. Plus, thread-2845 got interrupted, which means it must have
dropped rntl in the middle. We drop it in tc_ctl_tfilter() when we do tcf_proto_lookup_ops()
and need to pull in a module, but here this doesn't make sense at all since i) at this
point we haven't created the tp yet and 2) flower was already there. Thus the only explanation
where this must have happened is where we called tp->ops->change(). So here the return
code must have been -EAGAIN, which makes sense because in 7) we destroyed that specific
tp instance. Which means we goto replay but *do not* clear tp_created. I think that is
the bug in question. So, while we dropped rtnl in the meantime, some other tp instance
was added (tp=ffff94b5b02805a0) that we had a match on in round 2, but we still think it
was newly created which wasn't the actual case. So we'd need to deal with the fact that
->change() callback could return -EAGAIN as well. Now looking at flower, I think the call
chain must have been fl_change() -> fl_set_parms() -> tcf_exts_validate() -> tcf_action_init()
-> tcf_action_init_1(). And here one possibility I see is that tc_lookup_action_n()
failed, therefore we shortly dropped rtnl for the request_module() where the module
got loaded successfully and thus error code from there is -EAGAIN that got propagated
all the way through ->change() from tc_ctl_tfilter(). So it looks like a generic issue
not specifically tied to flower.

Shahar, can you test the following? Thanks!

  net/sched/cls_api.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..1ecdf80 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -148,13 +148,15 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
  	unsigned long cl;
  	unsigned long fh;
  	int err;
-	int tp_created = 0;
+	int tp_created;

  	if ((n->nlmsg_type != RTM_GETTFILTER) &&
  	    !netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
  		return -EPERM;

  replay:
+	tp_created = 0;
+
  	err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL);
  	if (err < 0)
  		return err;
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-21  7:03                   ` Cong Wang
  2016-12-21 10:15                     ` Daniel Borkmann
@ 2016-12-21 11:25                     ` Shahar Klein
  1 sibling, 0 replies; 17+ messages in thread
From: Shahar Klein @ 2016-12-21 11:25 UTC (permalink / raw)
  To: Cong Wang
  Cc: shahark, Daniel Borkmann, Or Gerlitz, Linux Netdev List,
	Roi Dayan, David Miller, Jiri Pirko, John Fastabend,
	Hadar Hen Zion

[-- Attachment #1: Type: text/plain, Size: 1003 bytes --]



On 12/21/2016 9:03 AM, Cong Wang wrote:
> On Tue, Dec 20, 2016 at 10:44 PM, Shahar Klein <shahark@mellanox.com> wrote:
>>
>> Tried it with same results
>
> This piece is pretty interesting:
>
> [  408.554689] DEBUGG:SK thread-2853[cpu-1] setting tp_created to 1
> tp=ffff94b5b02805a0 back=ffff94b9ea932060
> [  408.574258] DEBUGG:SK thread-2853[cpu-1] add/change filter by:
> fl_get [cls_flower] tp=ffff94b5b02805a0 tp->next=ffff94b9ea932060
> [  408.587849] DEBUGG:SK destroy ffff94b5b0280780 tcf_destroy:1905
> [  408.595862] DEBUGG:SK thread-2845[cpu-1] add/change filter by:
> fl_get [cls_flower] tp=ffff94b5b02805a0 tp->next=ffff94b5b02805a0
>
> Looks like you added a debug printk inside tcf_destroy() too,
> which seems racy with filter creation, it should not happen since
> in both cases we take RTNL lock.
>
> Don't know if changing all RCU_INIT_POINTER in that file to
> rcu_assign_pointer could help anything or not. Mind to try?
>

Tried it with same results

>
> Thanks for debugging!
>

[-- Attachment #2: cls_api_no_RCU.c.diff --]
[-- Type: text/plain, Size: 2704 bytes --]

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..b8a66d8 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -305,7 +305,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 			kfree(tp);
 			goto errout;
 		}
-
+		printk(KERN_ERR "DEBUGG:SK thread-%d[cpu-%d] setting tp_created to 1 tp=%p back=%p\n", current->pid, current->on_cpu, tp, rtnl_dereference(*back));
 		tp_created = 1;
 
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind))
@@ -317,11 +317,13 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 		if (n->nlmsg_type == RTM_DELTFILTER && t->tcm_handle == 0) {
 			struct tcf_proto *next = rtnl_dereference(tp->next);
 
-			RCU_INIT_POINTER(*back, next);
+			printk(KERN_ERR "DEBUGG:SK delete filter by: %pf\n", tp->ops->get);
+
+			rcu_assign_pointer(*back, next);
 
 			tfilter_notify(net, skb, n, tp, fh,
 				       RTM_DELTFILTER, false);
-			tcf_destroy(tp, true);
+			tcf_destroy(tp);
 			err = 0;
 			goto errout;
 		}
@@ -331,25 +333,30 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 		    !(n->nlmsg_flags & NLM_F_CREATE))
 			goto errout;
 	} else {
+		bool last;
+
 		switch (n->nlmsg_type) {
 		case RTM_NEWTFILTER:
 			err = -EEXIST;
 			if (n->nlmsg_flags & NLM_F_EXCL) {
 				if (tp_created)
-					tcf_destroy(tp, true);
+					tcf_destroy(tp);
 				goto errout;
 			}
 			break;
 		case RTM_DELTFILTER:
-			err = tp->ops->delete(tp, fh);
+			printk(KERN_ERR "DEBUGG:SK %s:%d\n", __func__, __LINE__);
+			err = tp->ops->delete(tp, fh, &last);
 			if (err == 0) {
-				struct tcf_proto *next = rtnl_dereference(tp->next);
-
 				tfilter_notify(net, skb, n, tp,
 					       t->tcm_handle,
 					       RTM_DELTFILTER, false);
-				if (tcf_destroy(tp, false))
-					RCU_INIT_POINTER(*back, next);
+				if (last) {
+					struct tcf_proto *next = rtnl_dereference(tp->next);
+
+					rcu_assign_pointer(*back, next);
+					tcf_destroy(tp);
+				}
 			}
 			goto errout;
 		case RTM_GETTFILTER:
@@ -366,13 +373,14 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n)
 			      n->nlmsg_flags & NLM_F_CREATE ? TCA_ACT_NOREPLACE : TCA_ACT_REPLACE);
 	if (err == 0) {
 		if (tp_created) {
-			RCU_INIT_POINTER(tp->next, rtnl_dereference(*back));
+			rcu_assign_pointer(tp->next, rtnl_dereference(*back));
 			rcu_assign_pointer(*back, tp);
+			printk(KERN_ERR "DEBUGG:SK thread-%d[cpu-%d] add/change filter by: %pf tp=%p tp->next=%p\n", current->pid, current->on_cpu, tp->ops->get, tp, tp->next);
 		}
 		tfilter_notify(net, skb, n, tp, fh, RTM_NEWTFILTER, false);
 	} else {
 		if (tp_created)
-			tcf_destroy(tp, true);
+			tcf_destroy(tp);
 	}
 
 errout:

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-21 10:15                     ` Daniel Borkmann
@ 2016-12-21 12:58                       ` Shahar Klein
  2016-12-21 13:18                         ` Daniel Borkmann
  0 siblings, 1 reply; 17+ messages in thread
From: Shahar Klein @ 2016-12-21 12:58 UTC (permalink / raw)
  To: Daniel Borkmann, Cong Wang
  Cc: shahark, Or Gerlitz, Linux Netdev List, Roi Dayan, David Miller,
	Jiri Pirko, John Fastabend, Hadar Hen Zion



On 12/21/2016 12:15 PM, Daniel Borkmann wrote:
> On 12/21/2016 08:03 AM, Cong Wang wrote:
>> On Tue, Dec 20, 2016 at 10:44 PM, Shahar Klein <shahark@mellanox.com>
>> wrote:
> [...]
>> Looks like you added a debug printk inside tcf_destroy() too,
>> which seems racy with filter creation, it should not happen since
>> in both cases we take RTNL lock.
>>
>> Don't know if changing all RCU_INIT_POINTER in that file to
>> rcu_assign_pointer could help anything or not. Mind to try?
>
> I don't think at this point that it's RCU related at all.
>
> I have a theory on what is happening. Quoting the piece in question from
> Shahar's log:
>
>  1: thread-2845[cpu-1] setting tp_created to 1 tp=ffff94b5b0280780
> back=ffff94b9ea932060
>  2: thread-2856[cpu-1] setting tp_created to 1 tp=ffff94b9ea9322a0
> back=ffff94b9ea932060
>  3: thread-2843[cpu-1] setting tp_created to 1 tp=ffff94b5b402c960
> back=ffff94b9ea932060
>  4: destroy ffff94b5b669fea0 tcf_destroy:1905
>  5: thread-2853[cpu-1] setting tp_created to 1 tp=ffff94b5b02805a0
> back=ffff94b9ea932060
>  6: thread-2853[cpu-1] add/change filter by: fl_get [cls_flower]
> tp=ffff94b5b02805a0 tp->next=ffff94b9ea932060
>  7: destroy ffff94b5b0280780 tcf_destroy:1905
>  8: thread-2845[cpu-1] add/change filter by: fl_get [cls_flower]
> tp=ffff94b5b02805a0 tp->next=ffff94b5b02805a0
>
> The interesting thing is that all this happens on CPU1, so as you say
> we're under rtnl.
> In 1), thread-2845 creates tp=ffff94b5b0280780, which is destroyed in
> 7), presumably also
> by thread-2845, and the weird part is why suddenly in 8) thread-2845
> adds a created filter
> without actually creating it. Plus, thread-2845 got interrupted, which
> means it must have
> dropped rntl in the middle. We drop it in tc_ctl_tfilter() when we do
> tcf_proto_lookup_ops()
> and need to pull in a module, but here this doesn't make sense at all
> since i) at this
> point we haven't created the tp yet and 2) flower was already there.
> Thus the only explanation
> where this must have happened is where we called tp->ops->change(). So
> here the return
> code must have been -EAGAIN, which makes sense because in 7) we
> destroyed that specific
> tp instance. Which means we goto replay but *do not* clear tp_created. I
> think that is
> the bug in question. So, while we dropped rtnl in the meantime, some
> other tp instance
> was added (tp=ffff94b5b02805a0) that we had a match on in round 2, but
> we still think it
> was newly created which wasn't the actual case. So we'd need to deal
> with the fact that
> ->change() callback could return -EAGAIN as well. Now looking at flower,
> I think the call
> chain must have been fl_change() -> fl_set_parms() ->
> tcf_exts_validate() -> tcf_action_init()
> -> tcf_action_init_1(). And here one possibility I see is that
> tc_lookup_action_n()
> failed, therefore we shortly dropped rtnl for the request_module() where
> the module
> got loaded successfully and thus error code from there is -EAGAIN that
> got propagated
> all the way through ->change() from tc_ctl_tfilter(). So it looks like a
> generic issue
> not specifically tied to flower.
>
> Shahar, can you test the following? Thanks!
>
>  net/sched/cls_api.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 3fbba79..1ecdf80 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -148,13 +148,15 @@ static int tc_ctl_tfilter(struct sk_buff *skb,
> struct nlmsghdr *n)
>      unsigned long cl;
>      unsigned long fh;
>      int err;
> -    int tp_created = 0;
> +    int tp_created;
>
>      if ((n->nlmsg_type != RTM_GETTFILTER) &&
>          !netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
>          return -EPERM;
>
>  replay:
> +    tp_created = 0;
> +
>      err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL);
>      if (err < 0)
>          return err;

Test run successfully!
I'll remove all other debug "fixes" and run again later

Thanks Daniel!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Soft lockup in tc_classify
  2016-12-21 12:58                       ` Shahar Klein
@ 2016-12-21 13:18                         ` Daniel Borkmann
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Borkmann @ 2016-12-21 13:18 UTC (permalink / raw)
  To: Shahar Klein, Cong Wang
  Cc: Or Gerlitz, Linux Netdev List, Roi Dayan, David Miller,
	Jiri Pirko, John Fastabend, Hadar Hen Zion

On 12/21/2016 01:58 PM, Shahar Klein wrote:
> On 12/21/2016 12:15 PM, Daniel Borkmann wrote:
>> On 12/21/2016 08:03 AM, Cong Wang wrote:
>>> On Tue, Dec 20, 2016 at 10:44 PM, Shahar Klein <shahark@mellanox.com>
>>> wrote:
>> [...]
>>> Looks like you added a debug printk inside tcf_destroy() too,
>>> which seems racy with filter creation, it should not happen since
>>> in both cases we take RTNL lock.
>>>
>>> Don't know if changing all RCU_INIT_POINTER in that file to
>>> rcu_assign_pointer could help anything or not. Mind to try?
>>
>> I don't think at this point that it's RCU related at all.
>>
>> I have a theory on what is happening. Quoting the piece in question from
>> Shahar's log:
>>
>>  1: thread-2845[cpu-1] setting tp_created to 1 tp=ffff94b5b0280780
>> back=ffff94b9ea932060
>>  2: thread-2856[cpu-1] setting tp_created to 1 tp=ffff94b9ea9322a0
>> back=ffff94b9ea932060
>>  3: thread-2843[cpu-1] setting tp_created to 1 tp=ffff94b5b402c960
>> back=ffff94b9ea932060
>>  4: destroy ffff94b5b669fea0 tcf_destroy:1905
>>  5: thread-2853[cpu-1] setting tp_created to 1 tp=ffff94b5b02805a0
>> back=ffff94b9ea932060
>>  6: thread-2853[cpu-1] add/change filter by: fl_get [cls_flower]
>> tp=ffff94b5b02805a0 tp->next=ffff94b9ea932060
>>  7: destroy ffff94b5b0280780 tcf_destroy:1905
>>  8: thread-2845[cpu-1] add/change filter by: fl_get [cls_flower]
>> tp=ffff94b5b02805a0 tp->next=ffff94b5b02805a0
>>
>> The interesting thing is that all this happens on CPU1, so as you say
>> we're under rtnl.
>> In 1), thread-2845 creates tp=ffff94b5b0280780, which is destroyed in
>> 7), presumably also
>> by thread-2845, and the weird part is why suddenly in 8) thread-2845
>> adds a created filter
>> without actually creating it. Plus, thread-2845 got interrupted, which
>> means it must have
>> dropped rntl in the middle. We drop it in tc_ctl_tfilter() when we do
>> tcf_proto_lookup_ops()
>> and need to pull in a module, but here this doesn't make sense at all
>> since i) at this
>> point we haven't created the tp yet and 2) flower was already there.
>> Thus the only explanation
>> where this must have happened is where we called tp->ops->change(). So
>> here the return
>> code must have been -EAGAIN, which makes sense because in 7) we
>> destroyed that specific
>> tp instance. Which means we goto replay but *do not* clear tp_created. I
>> think that is
>> the bug in question. So, while we dropped rtnl in the meantime, some
>> other tp instance
>> was added (tp=ffff94b5b02805a0) that we had a match on in round 2, but
>> we still think it
>> was newly created which wasn't the actual case. So we'd need to deal
>> with the fact that
>> ->change() callback could return -EAGAIN as well. Now looking at flower,
>> I think the call
>> chain must have been fl_change() -> fl_set_parms() ->
>> tcf_exts_validate() -> tcf_action_init()
>> -> tcf_action_init_1(). And here one possibility I see is that
>> tc_lookup_action_n()
>> failed, therefore we shortly dropped rtnl for the request_module() where
>> the module
>> got loaded successfully and thus error code from there is -EAGAIN that
>> got propagated
>> all the way through ->change() from tc_ctl_tfilter(). So it looks like a
>> generic issue
>> not specifically tied to flower.
>>
>> Shahar, can you test the following? Thanks!
>>
>>  net/sched/cls_api.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index 3fbba79..1ecdf80 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -148,13 +148,15 @@ static int tc_ctl_tfilter(struct sk_buff *skb,
>> struct nlmsghdr *n)
>>      unsigned long cl;
>>      unsigned long fh;
>>      int err;
>> -    int tp_created = 0;
>> +    int tp_created;
>>
>>      if ((n->nlmsg_type != RTM_GETTFILTER) &&
>>          !netlink_ns_capable(skb, net->user_ns, CAP_NET_ADMIN))
>>          return -EPERM;
>>
>>  replay:
>> +    tp_created = 0;
>> +
>>      err = nlmsg_parse(n, sizeof(*t), tca, TCA_MAX, NULL);
>>      if (err < 0)
>>          return err;
>
> Test run successfully!
> I'll remove all other debug "fixes" and run again later
>
> Thanks Daniel!

Great, thanks for confirming so far Shahar!
I'll cook an official patch in the meantime.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-12-21 13:31 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <c1c394d3-3aea-52a8-89e3-be57d4d46b8e@mellanox.com>
2016-12-12  9:43 ` Soft lockup in tc_classify Shahar Klein
2016-12-12 13:28   ` Daniel Borkmann
2016-12-12 16:04     ` Shahar Klein
2016-12-12 19:07       ` Cong Wang
2016-12-13 11:59         ` Shahar Klein
2016-12-12 21:18     ` Or Gerlitz
2016-12-12 22:51       ` Cong Wang
2016-12-19 16:39         ` Shahar Klein
2016-12-19 17:58           ` Cong Wang
2016-12-20  6:22             ` Shahar Klein
2016-12-20 11:47               ` Daniel Borkmann
2016-12-21  6:44                 ` Shahar Klein
2016-12-21  7:03                   ` Cong Wang
2016-12-21 10:15                     ` Daniel Borkmann
2016-12-21 12:58                       ` Shahar Klein
2016-12-21 13:18                         ` Daniel Borkmann
2016-12-21 11:25                     ` Shahar Klein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.