All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
@ 2012-08-27  9:33 Oliver
  2012-08-28 10:52 ` Pablo Neira Ayuso
       [not found] ` <5427975.6moJlq4F9d@gentoovm>
  0 siblings, 2 replies; 16+ messages in thread
From: Oliver @ 2012-08-27  9:33 UTC (permalink / raw)
  To: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 4564 bytes --]

In a previous version of ctnetlink, a race condition could be caused as a 
result of ctnetlink_del_conntrack not setting the IPS_DYING_BIT that is 
checked by death_by_timeout()

I found that in 3.4.9 I could trigger a soft-lockup by packet flooding a pair 
of systems running conntrackd with NetlinkEventsReliable On.

I found that death_by_event() does not currently check the IPS_DYING_BIT and 
therefore, based on the panic stack trace, I added the bit check to 
death_by_event() and have since been unable to reproduce the crash.

I hope this patch is correct/useful - I'm not innately familiar with the 
conntrack code so perhaps I'm breaking the reliable event reporting with this 
change.

kernel panic is as follows:

Aug 24 14:02:39 fw02-lab [ 2544.350016] BUG: soft lockup - CPU#6 stuck for 
24s! [conntrackd:5119]
Aug 24 14:02:39 fw02-lab [ 2544.350536] Kernel panic - not syncing: 
softlockup: hung tasks
Aug 24 14:02:39 fw02-lab [ 2544.350662] Pid: 5119, comm: conntrackd Tainted: G        
W    3.4.9 #2
Aug 24 14:02:39 fw02-lab [ 2544.350786] Call Trace:
Aug 24 14:02:39 fw02-lab [ 2544.350903]  <IRQ>
Aug 24 14:02:39 fw02-lab [<ffffffff81683ab2>] ? panic+0xbe/0x1cd
Aug 24 14:02:39 fw02-lab [ 2544.351078]  [<ffffffff810a8493>] ? 
watchdog_timer_fn+0x173/0x180
Aug 24 14:02:39 fw02-lab [ 2544.351204]  [<ffffffff8106847e>] ? 
__run_hrtimer.clone.33+0x4e/0x110
Aug 24 14:02:39 fw02-lab [ 2544.351330]  [<ffffffff81068d34>] ? 
hrtimer_interrupt+0xf4/0x250
Aug 24 14:02:39 fw02-lab [ 2544.351455]  [<ffffffff8101ee43>] ? 
smp_apic_timer_interrupt+0x63/0xa0
Aug 24 14:02:39 fw02-lab [ 2544.351591]  [<ffffffff816878c7>] ? 
apic_timer_interrupt+0x67/0x70
Aug 24 14:02:39 fw02-lab [ 2544.351715]  [<ffffffff814ed7a1>] ? 
__kfree_skb+0x11/0x90
Aug 24 14:02:39 fw02-lab [ 2544.351837]  [<ffffffff815271e3>] ? 
netlink_broadcast_filtered+0x123/0x3c0
Aug 24 14:02:39 fw02-lab [ 2544.351962]  [<ffffffff8152e22e>] ? 
death_by_event+0x3e/0x1f0
Aug 24 14:02:39 fw02-lab [ 2544.352085]  [<ffffffff8152e3c5>] ? 
death_by_event+0x1d5/0x1f0
Aug 24 14:02:39 fw02-lab [ 2544.352209]  [<ffffffff8105445f>] ? 
run_timer_softirq+0x11f/0x240
Aug 24 14:02:39 fw02-lab [ 2544.352333]  [<ffffffff8152e1f0>] ? 
nf_conntrack_hash_check_insert+0x270/0x270
Aug 24 14:02:39 fw02-lab [ 2544.352524]  [<ffffffff8104f2c8>] ? 
__do_softirq+0x98/0x120
Aug 24 14:02:39 fw02-lab [ 2544.352647]  [<ffffffff8168820c>] ? 
call_softirq+0x1c/0x30
Aug 24 14:02:39 fw02-lab [ 2544.352767]  <EOI>
Aug 24 14:02:39 fw02-lab [<ffffffff8100460d>] ? do_softirq+0x4d/0x80
Aug 24 14:02:39 fw02-lab [ 2544.352938]  [<ffffffff8104f224>] ? 
local_bh_enable+0x94/0xa0
Aug 24 14:02:39 fw02-lab [ 2544.353061]  [<ffffffff8152dd5d>] ? 
____nf_conntrack_find+0x10d/0x120
Aug 24 14:02:39 fw02-lab [ 2544.353186]  [<ffffffff8152ddb9>] ? 
__nf_conntrack_find_get+0x49/0x170
Aug 24 14:02:39 fw02-lab [ 2544.353311]  [<ffffffff8153876c>] ? 
ctnetlink_del_conntrack+0xac/0x300
Aug 24 14:02:39 fw02-lab [ 2544.353435]  [<ffffffff81280f70>] ? 
nla_parse+0x80/0xd0
Aug 24 14:02:39 fw02-lab [ 2544.353558]  [<ffffffff8152c93e>] ? 
nfnetlink_rcv_msg+0x1ee/0x220
Aug 24 14:02:39 fw02-lab [ 2544.353682]  [<ffffffff8152c77a>] ? 
nfnetlink_rcv_msg+0x2a/0x220
Aug 24 14:02:39 fw02-lab [ 2544.353806]  [<ffffffff8152c750>] ? 
nfnl_lock+0x20/0x20
Aug 24 14:02:39 fw02-lab [ 2544.353927]  [<ffffffff81529389>] ? 
netlink_rcv_skb+0x99/0xc0
Aug 24 14:02:39 fw02-lab [ 2544.354050]  [<ffffffff81528d1f>] ? 
netlink_unicast+0x1af/0x200
Aug 24 14:02:39 fw02-lab [ 2544.354173]  [<ffffffff81528fa8>] ? 
netlink_sendmsg+0x238/0x350
Aug 24 14:02:39 fw02-lab [ 2544.354296]  [<ffffffff814e5084>] ? 
sock_sendmsg+0xe4/0x130
Aug 24 14:02:39 fw02-lab [ 2544.354418]  [<ffffffff814e4eed>] ? 
sock_recvmsg+0xed/0x140
Aug 24 14:02:39 fw02-lab [ 2544.354542]  [<ffffffff8111f5fd>] ? 
core_sys_select+0x22d/0x340
Aug 24 14:02:39 fw02-lab [ 2544.354665]  [<ffffffff814e64fb>] ? 
move_addr_to_kernel+0x2b/0xa0
Aug 24 14:02:39 fw02-lab [ 2544.354788]  [<ffffffff814e4272>] ? 
sockfd_lookup_light+0x22/0x90
Aug 24 14:02:39 fw02-lab [ 2544.354912]  [<ffffffff814e6fac>] ? 
sys_sendto+0x13c/0x1a0
Aug 24 14:02:39 fw02-lab [ 2544.355034]  [<ffffffff814ed7a1>] ? 
__kfree_skb+0x11/0x90
Aug 24 14:02:39 fw02-lab [ 2544.355157]  [<ffffffff8126c574>] ? 
rb_insert_color+0xa4/0x140
Aug 24 14:02:39 fw02-lab [ 2544.355279]  [<ffffffff81077e27>] ? 
dequeue_pushable_task+0x27/0x70
Aug 24 14:02:39 fw02-lab [ 2544.355404]  [<ffffffff81686e62>] ? 
system_call_fastpath+0x16/0x1b
Aug 24 14:02:39 fw02-lab [ 2544.355541] Rebooting in 5 seconds..

Thanks,
Oliver

[-- Attachment #2: death_by_event-check-dying-bit.patch --]
[-- Type: text/x-patch, Size: 625 bytes --]

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 729f157..5c274f3 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -250,7 +250,8 @@ static void death_by_event(unsigned long ul_conntrack)
 	struct nf_conn *ct = (void *)ul_conntrack;
 	struct net *net = nf_ct_net(ct);
 
-	if (nf_conntrack_event(IPCT_DESTROY, ct) < 0) {
+	if (!test_bit(IPS_DYING_BIT, &ct->status) &&
+	    nf_conntrack_event(IPCT_DESTROY, ct) < 0) {
 		/* bad luck, let's retry again */
 		ct->timeout.expires = jiffies +
 			(random32() % net->ct.sysctl_events_retry_timeout);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-27  9:33 [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack Oliver
@ 2012-08-28 10:52 ` Pablo Neira Ayuso
  2012-08-28 17:16   ` Oliver
       [not found] ` <5427975.6moJlq4F9d@gentoovm>
  1 sibling, 1 reply; 16+ messages in thread
From: Pablo Neira Ayuso @ 2012-08-28 10:52 UTC (permalink / raw)
  To: Oliver; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 919 bytes --]

Hi Oliver,

On Mon, Aug 27, 2012 at 11:33:29AM +0200, Oliver wrote:
> In a previous version of ctnetlink, a race condition could be caused as a 
> result of ctnetlink_del_conntrack not setting the IPS_DYING_BIT that is 
> checked by death_by_timeout()
> 
> I found that in 3.4.9 I could trigger a soft-lockup by packet flooding a pair 
> of systems running conntrackd with NetlinkEventsReliable On.
> 
> I found that death_by_event() does not currently check the IPS_DYING_BIT and 
> therefore, based on the panic stack trace, I added the bit check to 
> death_by_event() and have since been unable to reproduce the crash.
> 
> I hope this patch is correct/useful - I'm not innately familiar with the 
> conntrack code so perhaps I'm breaking the reliable event reporting with this 
> change.

It seems we're hitting death_by_event twice, which should not happen.

Would you give a try to the following patch?

Thanks.

[-- Attachment #2: 0001-netfilter-nf_conntrack-fix-race-in-timer-handling-wi.patch --]
[-- Type: text/x-diff, Size: 10946 bytes --]

>From 5d76750a082c85d032bc0ed9783b169c9bbe1cdc Mon Sep 17 00:00:00 2001
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Mon, 27 Aug 2012 14:15:14 +0200
Subject: [PATCH] netfilter: nf_conntrack: fix race in timer handling with
 reliable events

This patch fixes a rare race condition running conntrackd in reliable
events mode with the following conditions:

1) ctnetlink holds a reference to one conntrack in the
   initial ctnetlink_del_conntrack path.
2) the conntrack timer expires so we remove it from hashes.
3) we fail to deliver the event, thus, the entry is reinserted in
   the dying list and the timer is restarted.
4) ctnetlink deletes the timer (while the conntrack is in the dying
   list) and it reinserts the conntrack in the dying list.

This patch fixes this by using the following logic:

1) all entries deleted from the hashes that we failed to insert in
   the dying list has the IPS_DYING bit set.
2) we always check for IPS_DYING, if set, we don't delete the
   timer since the object is already in the dying list.
3) The new function nf_ct_delete sets the IPS_DYING bit (if not set
   already) and deliver the event. If IPS_DYING is not set, the
   event is not delivered.

The functions nf_ct_delete_from_lists and nf_ct_insert_dying_list
are not exported anymore. Instead nf_ct_delete is exported.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack.h        |    5 +-
 include/net/netfilter/nf_conntrack_ecache.h |    2 +-
 net/netfilter/nf_conntrack_core.c           |   75 ++++++++++-----------------
 net/netfilter/nf_conntrack_netlink.c        |   18 ++-----
 net/netfilter/nf_conntrack_proto.c          |    4 +-
 5 files changed, 34 insertions(+), 70 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index f1494fe..0433065 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -181,8 +181,7 @@ __nf_conntrack_find(struct net *net, u16 zone,
 		    const struct nf_conntrack_tuple *tuple);
 
 extern int nf_conntrack_hash_check_insert(struct nf_conn *ct);
-extern void nf_ct_delete_from_lists(struct nf_conn *ct);
-extern void nf_ct_insert_dying_list(struct nf_conn *ct);
+bool nf_ct_delete(struct nf_conn *ct, u32 pid, int report);
 
 extern void nf_conntrack_flush_report(struct net *net, u32 pid, int report);
 
@@ -249,7 +248,7 @@ extern void nf_ct_untracked_status_or(unsigned long bits);
 
 /* Iterate over all conntracks: if iter returns true, it's deleted. */
 extern void
-nf_ct_iterate_cleanup(struct net *net, int (*iter)(struct nf_conn *i, void *data), void *data);
+nf_ct_iterate_cleanup(struct net *net, int (*iter)(struct nf_conn *i, void *data), void *data, u32 pid, int report);
 extern void nf_conntrack_free(struct nf_conn *ct);
 extern struct nf_conn *
 nf_conntrack_alloc(struct net *net, u16 zone,
diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index e1ce104..2de13b9 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -108,7 +108,7 @@ nf_conntrack_eventmask_report(unsigned int eventmask,
 	if (e == NULL)
 		goto out_unlock;
 
-	if (nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct)) {
+	if (nf_ct_is_confirmed(ct)) {
 		struct nf_ct_event item = {
 			.ct 	= ct,
 			.pid	= e->pid ? e->pid : pid,
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index fc61f40..f84ad9a 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -231,7 +231,7 @@ destroy_conntrack(struct nf_conntrack *nfct)
 	nf_conntrack_free(ct);
 }
 
-void nf_ct_delete_from_lists(struct nf_conn *ct)
+static void nf_ct_delete_from_lists(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
 
@@ -243,7 +243,6 @@ void nf_ct_delete_from_lists(struct nf_conn *ct)
 	clean_from_lists(ct);
 	spin_unlock_bh(&nf_conntrack_lock);
 }
-EXPORT_SYMBOL_GPL(nf_ct_delete_from_lists);
 
 static void death_by_event(unsigned long ul_conntrack)
 {
@@ -257,15 +256,13 @@ static void death_by_event(unsigned long ul_conntrack)
 		add_timer(&ct->timeout);
 		return;
 	}
-	/* we've got the event delivered, now it's dying */
-	set_bit(IPS_DYING_BIT, &ct->status);
 	spin_lock(&nf_conntrack_lock);
 	hlist_nulls_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 	spin_unlock(&nf_conntrack_lock);
 	nf_ct_put(ct);
 }
 
-void nf_ct_insert_dying_list(struct nf_conn *ct)
+static void nf_ct_insert_dying_list(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
 
@@ -280,27 +277,32 @@ void nf_ct_insert_dying_list(struct nf_conn *ct)
 		(random32() % net->ct.sysctl_events_retry_timeout);
 	add_timer(&ct->timeout);
 }
-EXPORT_SYMBOL_GPL(nf_ct_insert_dying_list);
 
-static void death_by_timeout(unsigned long ul_conntrack)
+bool nf_ct_delete(struct nf_conn *ct, u32 pid, int report)
 {
-	struct nf_conn *ct = (void *)ul_conntrack;
 	struct nf_conn_tstamp *tstamp;
 
 	tstamp = nf_conn_tstamp_find(ct);
 	if (tstamp && tstamp->stop == 0)
 		tstamp->stop = ktime_to_ns(ktime_get_real());
 
-	if (!test_bit(IPS_DYING_BIT, &ct->status) &&
-	    unlikely(nf_conntrack_event(IPCT_DESTROY, ct) < 0)) {
+	if (!test_and_set_bit(IPS_DYING_BIT, &ct->status) &&
+	    unlikely(nf_conntrack_event_report(IPCT_DESTROY, ct, pid,
+								report) < 0)) {
 		/* destroy event was not delivered */
 		nf_ct_delete_from_lists(ct);
 		nf_ct_insert_dying_list(ct);
-		return;
+		return false;
 	}
-	set_bit(IPS_DYING_BIT, &ct->status);
 	nf_ct_delete_from_lists(ct);
 	nf_ct_put(ct);
+	return true;
+}
+EXPORT_SYMBOL_GPL(nf_ct_delete);
+
+static void death_by_timeout(unsigned long ul_conntrack)
+{
+	nf_ct_delete((struct nf_conn *)ul_conntrack, 0, 0);
 }
 
 /*
@@ -634,11 +636,9 @@ static noinline int early_drop(struct net *net, unsigned int hash)
 	if (!ct)
 		return dropped;
 
-	if (del_timer(&ct->timeout)) {
-		death_by_timeout((unsigned long)ct);
-		/* Check if we indeed killed this entry. Reliable event
-		   delivery may have inserted it into the dying list. */
-		if (test_bit(IPS_DYING_BIT, &ct->status)) {
+	if (!nf_ct_is_dying(ct) && del_timer(&ct->timeout)) {
+		/* Check if we indeed killed this entry */
+		if (nf_ct_delete(ct, 0, 0)) {
 			dropped = 1;
 			NF_CT_STAT_INC_ATOMIC(net, early_drop);
 		}
@@ -1117,8 +1117,8 @@ bool __nf_ct_kill_acct(struct nf_conn *ct,
 		}
 	}
 
-	if (del_timer(&ct->timeout)) {
-		ct->timeout.function((unsigned long)ct);
+	if (!nf_ct_is_dying(ct) && del_timer(&ct->timeout)) {
+		nf_ct_delete(ct, 0, 0);
 		return true;
 	}
 	return false;
@@ -1219,8 +1219,7 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data),
 	}
 	hlist_nulls_for_each_entry(h, n, &net->ct.unconfirmed, hnnode) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
-		if (iter(ct, data))
-			set_bit(IPS_DYING_BIT, &ct->status);
+		iter(ct, data);
 	}
 	spin_unlock_bh(&nf_conntrack_lock);
 	return NULL;
@@ -1232,15 +1231,15 @@ found:
 
 void nf_ct_iterate_cleanup(struct net *net,
 			   int (*iter)(struct nf_conn *i, void *data),
-			   void *data)
+			   void *data, u32 pid, int report)
 {
 	struct nf_conn *ct;
 	unsigned int bucket = 0;
 
 	while ((ct = get_next_corpse(net, iter, data, &bucket)) != NULL) {
 		/* Time to push up daises... */
-		if (del_timer(&ct->timeout))
-			death_by_timeout((unsigned long)ct);
+		if (!nf_ct_is_dying(ct) && del_timer(&ct->timeout))
+			nf_ct_delete(ct, pid, report);
 		/* ... else the timer will get him soon. */
 
 		nf_ct_put(ct);
@@ -1248,32 +1247,14 @@ void nf_ct_iterate_cleanup(struct net *net,
 }
 EXPORT_SYMBOL_GPL(nf_ct_iterate_cleanup);
 
-struct __nf_ct_flush_report {
-	u32 pid;
-	int report;
-};
-
-static int kill_report(struct nf_conn *i, void *data)
+static int kill_all(struct nf_conn *i, void *data)
 {
-	struct __nf_ct_flush_report *fr = (struct __nf_ct_flush_report *)data;
 	struct nf_conn_tstamp *tstamp;
 
 	tstamp = nf_conn_tstamp_find(i);
 	if (tstamp && tstamp->stop == 0)
 		tstamp->stop = ktime_to_ns(ktime_get_real());
 
-	/* If we fail to deliver the event, death_by_timeout() will retry */
-	if (nf_conntrack_event_report(IPCT_DESTROY, i,
-				      fr->pid, fr->report) < 0)
-		return 1;
-
-	/* Avoid the delivery of the destroy event in death_by_timeout(). */
-	set_bit(IPS_DYING_BIT, &i->status);
-	return 1;
-}
-
-static int kill_all(struct nf_conn *i, void *data)
-{
 	return 1;
 }
 
@@ -1289,11 +1270,7 @@ EXPORT_SYMBOL_GPL(nf_ct_free_hashtable);
 
 void nf_conntrack_flush_report(struct net *net, u32 pid, int report)
 {
-	struct __nf_ct_flush_report fr = {
-		.pid 	= pid,
-		.report = report,
-	};
-	nf_ct_iterate_cleanup(net, kill_report, &fr);
+	nf_ct_iterate_cleanup(net, kill_all, NULL, pid, report);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_flush_report);
 
@@ -1337,7 +1314,7 @@ static void nf_conntrack_cleanup_init_net(void)
 static void nf_conntrack_cleanup_net(struct net *net)
 {
  i_see_dead_people:
-	nf_ct_iterate_cleanup(net, kill_all, NULL);
+	nf_ct_iterate_cleanup(net, kill_all, NULL, 0, 0);
 	nf_ct_release_dying_list(net);
 	if (atomic_read(&net->ct.count) != 0) {
 		schedule();
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index da4fc37..9d01635 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -983,21 +983,9 @@ ctnetlink_del_conntrack(struct sock *ctnl, struct sk_buff *skb,
 		}
 	}
 
-	if (del_timer(&ct->timeout)) {
-		if (nf_conntrack_event_report(IPCT_DESTROY, ct,
-					      NETLINK_CB(skb).pid,
-					      nlmsg_report(nlh)) < 0) {
-			nf_ct_delete_from_lists(ct);
-			/* we failed to report the event, try later */
-			nf_ct_insert_dying_list(ct);
-			nf_ct_put(ct);
-			return 0;
-		}
-		/* death_by_timeout would report the event again */
-		set_bit(IPS_DYING_BIT, &ct->status);
-		nf_ct_delete_from_lists(ct);
-		nf_ct_put(ct);
-	}
+	if (!nf_ct_is_dying(ct) && del_timer(&ct->timeout))
+		nf_ct_delete(ct, NETLINK_CB(skb).pid, nlmsg_report(nlh));
+
 	nf_ct_put(ct);
 
 	return 0;
diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
index 51e928d..5e0d547 100644
--- a/net/netfilter/nf_conntrack_proto.c
+++ b/net/netfilter/nf_conntrack_proto.c
@@ -293,7 +293,7 @@ void nf_conntrack_l3proto_unregister(struct net *net,
 	nf_ct_l3proto_unregister_sysctl(net, proto);
 
 	/* Remove all contrack entries for this protocol */
-	nf_ct_iterate_cleanup(net, kill_l3proto, proto);
+	nf_ct_iterate_cleanup(net, kill_l3proto, proto, 0, 0);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_l3proto_unregister);
 
@@ -499,7 +499,7 @@ void nf_conntrack_l4proto_unregister(struct net *net,
 	nf_ct_l4proto_unregister_sysctl(net, pn, l4proto);
 
 	/* Remove all contrack entries for this protocol */
-	nf_ct_iterate_cleanup(net, kill_l4proto, l4proto);
+	nf_ct_iterate_cleanup(net, kill_l4proto, l4proto, 0, 0);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_l4proto_unregister);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-28 10:52 ` Pablo Neira Ayuso
@ 2012-08-28 17:16   ` Oliver
  2012-08-28 23:10     ` Oliver
  0 siblings, 1 reply; 16+ messages in thread
From: Oliver @ 2012-08-28 17:16 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1063 bytes --]

Hi Pablo,

On Tuesday 28 August 2012 12:52:35 you wrote:
> It seems we're hitting death_by_event twice, which should not happen.
> 
> Would you give a try to the following patch?
> 
> Thanks.

I've tried applying the patch against 3.4.10 and found that it doesn't work 
due to having rewritten nf_ct_iterate_cleanup() to take two additional 
arguments and the nf_conntrack_proto.c in your source tree being divergent 
from anything I could find.

So, I've taken a different approach; nf_ct_iterate_cleanup() is a wrapper for 
nf_ct_iterate_cleanup_new() that simply passes 0 for the last two args so as 
to save having to edit every module.

Please take a look at the attached patch and let me know your thoughts; I'd 
Ideally like to have this fix in 3.4 since that's long-term stable.

During testing I found that the kernel is indeed solid and does not panic; 
however, I managed to make conntrackd eat 100% of a CPU core on one of the 
pair and conntrack entries remained unevicted from the kernel until I killed 
the conntrackd process.

Kind Regards,
Oliver

[-- Attachment #2: 0002-netfilter-nf_conntrack-fix-race-in-timer-handling-wi-in-kernel-3.4.patch --]
[-- Type: text/x-patch, Size: 8338 bytes --]

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index ab86036..8f92f77 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -210,8 +210,7 @@ __nf_conntrack_find(struct net *net, u16 zone,
 		    const struct nf_conntrack_tuple *tuple);
 
 extern int nf_conntrack_hash_check_insert(struct nf_conn *ct);
-extern void nf_ct_delete_from_lists(struct nf_conn *ct);
-extern void nf_ct_insert_dying_list(struct nf_conn *ct);
+bool nf_ct_delete(struct nf_conn *ct, u32 pid, int report);
 
 extern void nf_conntrack_flush_report(struct net *net, u32 pid, int report);
 
@@ -278,7 +277,8 @@ extern void nf_ct_untracked_status_or(unsigned long bits);
 
 /* Iterate over all conntracks: if iter returns true, it's deleted. */
 extern void
-nf_ct_iterate_cleanup(struct net *net, int (*iter)(struct nf_conn *i, void *data), void *data);
+nf_ct_iterate_cleanup_new(struct net *net, int (*iter)(struct nf_conn *i, void *data), void *data, u32 pid, int report);
+extern void nf_ct_iterate_cleanup(struct net *net, int (*iter)(struct nf_conn *i, void *data), void *data);
 extern void nf_conntrack_free(struct nf_conn *ct);
 extern struct nf_conn *
 nf_conntrack_alloc(struct net *net, u16 zone,
diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index a88fb69..2f7b0ac 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -108,7 +108,7 @@ nf_conntrack_eventmask_report(unsigned int eventmask,
 	if (e == NULL)
 		goto out_unlock;
 
-	if (nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct)) {
+	if (nf_ct_is_confirmed(ct)) {
 		struct nf_ct_event item = {
 			.ct 	= ct,
 			.pid	= e->pid ? e->pid : pid,
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 729f157..bdc9473 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -231,7 +231,7 @@ destroy_conntrack(struct nf_conntrack *nfct)
 	nf_conntrack_free(ct);
 }
 
-void nf_ct_delete_from_lists(struct nf_conn *ct)
+static void nf_ct_delete_from_lists(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
 
@@ -243,7 +243,6 @@ void nf_ct_delete_from_lists(struct nf_conn *ct)
 	clean_from_lists(ct);
 	spin_unlock_bh(&nf_conntrack_lock);
 }
-EXPORT_SYMBOL_GPL(nf_ct_delete_from_lists);
 
 static void death_by_event(unsigned long ul_conntrack)
 {
@@ -257,15 +256,13 @@ static void death_by_event(unsigned long ul_conntrack)
 		add_timer(&ct->timeout);
 		return;
 	}
-	/* we've got the event delivered, now it's dying */
-	set_bit(IPS_DYING_BIT, &ct->status);
 	spin_lock(&nf_conntrack_lock);
 	hlist_nulls_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 	spin_unlock(&nf_conntrack_lock);
 	nf_ct_put(ct);
 }
 
-void nf_ct_insert_dying_list(struct nf_conn *ct)
+static void nf_ct_insert_dying_list(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
 
@@ -280,27 +277,32 @@ void nf_ct_insert_dying_list(struct nf_conn *ct)
 		(random32() % net->ct.sysctl_events_retry_timeout);
 	add_timer(&ct->timeout);
 }
-EXPORT_SYMBOL_GPL(nf_ct_insert_dying_list);
 
-static void death_by_timeout(unsigned long ul_conntrack)
+bool nf_ct_delete(struct nf_conn *ct, u32 pid, int report)
 {
-	struct nf_conn *ct = (void *)ul_conntrack;
 	struct nf_conn_tstamp *tstamp;
 
 	tstamp = nf_conn_tstamp_find(ct);
 	if (tstamp && tstamp->stop == 0)
 		tstamp->stop = ktime_to_ns(ktime_get_real());
 
-	if (!test_bit(IPS_DYING_BIT, &ct->status) &&
-	    unlikely(nf_conntrack_event(IPCT_DESTROY, ct) < 0)) {
+	if (!test_and_set_bit(IPS_DYING_BIT, &ct->status) &&
+	    unlikely(nf_conntrack_event_report(IPCT_DESTROY, ct, pid,
+								report) < 0)) {
 		/* destroy event was not delivered */
 		nf_ct_delete_from_lists(ct);
 		nf_ct_insert_dying_list(ct);
-		return;
+		return false;
 	}
-	set_bit(IPS_DYING_BIT, &ct->status);
 	nf_ct_delete_from_lists(ct);
 	nf_ct_put(ct);
+	return true;
+}
+EXPORT_SYMBOL_GPL(nf_ct_delete);
+
+static void death_by_timeout(unsigned long ul_conntrack)
+{
+	nf_ct_delete((struct nf_conn *)ul_conntrack, 0, 0);
 }
 
 /*
@@ -634,11 +636,9 @@ static noinline int early_drop(struct net *net, unsigned int hash)
 	if (!ct)
 		return dropped;
 
-	if (del_timer(&ct->timeout)) {
-		death_by_timeout((unsigned long)ct);
-		/* Check if we indeed killed this entry. Reliable event
-		   delivery may have inserted it into the dying list. */
-		if (test_bit(IPS_DYING_BIT, &ct->status)) {
+	if (!nf_ct_is_dying(ct) && del_timer(&ct->timeout)) {
+		/* Check if we indeed killed this entry */
+		if (nf_ct_delete(ct, 0, 0)) {
 			dropped = 1;
 			NF_CT_STAT_INC_ATOMIC(net, early_drop);
 		}
@@ -1124,8 +1124,8 @@ bool __nf_ct_kill_acct(struct nf_conn *ct,
 		}
 	}
 
-	if (del_timer(&ct->timeout)) {
-		ct->timeout.function((unsigned long)ct);
+	if (!nf_ct_is_dying(ct) && del_timer(&ct->timeout)) {
+		nf_ct_delete(ct, 0, 0);
 		return true;
 	}
 	return false;
@@ -1225,8 +1225,7 @@ get_next_corpse(struct net *net, int (*iter)(struct nf_conn *i, void *data),
 	}
 	hlist_nulls_for_each_entry(h, n, &net->ct.unconfirmed, hnnode) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
-		if (iter(ct, data))
-			set_bit(IPS_DYING_BIT, &ct->status);
+		iter(ct, data);
 	}
 	spin_unlock_bh(&nf_conntrack_lock);
 	return NULL;
@@ -1236,50 +1235,40 @@ found:
 	return ct;
 }
 
-void nf_ct_iterate_cleanup(struct net *net,
+void nf_ct_iterate_cleanup_new(struct net *net,
 			   int (*iter)(struct nf_conn *i, void *data),
-			   void *data)
+			   void *data, u32 pid, int report)
 {
 	struct nf_conn *ct;
 	unsigned int bucket = 0;
 
 	while ((ct = get_next_corpse(net, iter, data, &bucket)) != NULL) {
 		/* Time to push up daises... */
-		if (del_timer(&ct->timeout))
-			death_by_timeout((unsigned long)ct);
+		if (!nf_ct_is_dying(ct) && del_timer(&ct->timeout))
+			nf_ct_delete(ct, pid, report);
 		/* ... else the timer will get him soon. */
 
 		nf_ct_put(ct);
 	}
 }
-EXPORT_SYMBOL_GPL(nf_ct_iterate_cleanup);
+EXPORT_SYMBOL_GPL(nf_ct_iterate_cleanup_new);
 
-struct __nf_ct_flush_report {
-	u32 pid;
-	int report;
-};
+void nf_ct_iterate_cleanup(struct net *net,
+			   int (*iter)(struct nf_conn *i, void *data),
+			   void *data)
+{
+	nf_ct_iterate_cleanup_new(net, iter, data, 0, 0);
+}
+EXPORT_SYMBOL_GPL(nf_ct_iterate_cleanup);
 
-static int kill_report(struct nf_conn *i, void *data)
+static int kill_all(struct nf_conn *i, void *data)
 {
-	struct __nf_ct_flush_report *fr = (struct __nf_ct_flush_report *)data;
 	struct nf_conn_tstamp *tstamp;
 
 	tstamp = nf_conn_tstamp_find(i);
 	if (tstamp && tstamp->stop == 0)
 		tstamp->stop = ktime_to_ns(ktime_get_real());
 
-	/* If we fail to deliver the event, death_by_timeout() will retry */
-	if (nf_conntrack_event_report(IPCT_DESTROY, i,
-				      fr->pid, fr->report) < 0)
-		return 1;
-
-	/* Avoid the delivery of the destroy event in death_by_timeout(). */
-	set_bit(IPS_DYING_BIT, &i->status);
-	return 1;
-}
-
-static int kill_all(struct nf_conn *i, void *data)
-{
 	return 1;
 }
 
@@ -1295,11 +1284,7 @@ EXPORT_SYMBOL_GPL(nf_ct_free_hashtable);
 
 void nf_conntrack_flush_report(struct net *net, u32 pid, int report)
 {
-	struct __nf_ct_flush_report fr = {
-		.pid 	= pid,
-		.report = report,
-	};
-	nf_ct_iterate_cleanup(net, kill_report, &fr);
+	nf_ct_iterate_cleanup_new(net, kill_all, NULL, pid, report);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_flush_report);
 
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index ca7e835..2b0c9c1 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -967,21 +967,9 @@ ctnetlink_del_conntrack(struct sock *ctnl, struct sk_buff *skb,
 		}
 	}
 
-	if (del_timer(&ct->timeout)) {
-		if (nf_conntrack_event_report(IPCT_DESTROY, ct,
-					      NETLINK_CB(skb).pid,
-					      nlmsg_report(nlh)) < 0) {
-			nf_ct_delete_from_lists(ct);
-			/* we failed to report the event, try later */
-			nf_ct_insert_dying_list(ct);
-			nf_ct_put(ct);
-			return 0;
-		}
-		/* death_by_timeout would report the event again */
-		set_bit(IPS_DYING_BIT, &ct->status);
-		nf_ct_delete_from_lists(ct);
-		nf_ct_put(ct);
-	}
+	if (!nf_ct_is_dying(ct) && del_timer(&ct->timeout))
+		nf_ct_delete(ct, NETLINK_CB(skb).pid, nlmsg_report(nlh));
+
 	nf_ct_put(ct);
 
 	return 0;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-28 17:16   ` Oliver
@ 2012-08-28 23:10     ` Oliver
  2012-08-30  0:52       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 16+ messages in thread
From: Oliver @ 2012-08-28 23:10 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 847 bytes --]

On Tuesday 28 August 2012 19:16:39 Oliver wrote:
> During testing I found that the kernel is indeed solid and does not panic;
> however, I managed to make conntrackd eat 100% of a CPU core on one of the
> pair and conntrack entries remained unevicted from the kernel until I killed
> the conntrackd process.

having conntrackd running while the conntrack table is full is causing a GPF - 
I have attached a dmesg output of the kernel panic resulting from a general 
protection fault.

The first GPF is from the kernel patched with the code provided in my previous 
e-mail (the one for v3.4.10 based on the patch you provided me)

the second is with my only my original patch (the one-liner that checks the 
dying bit in death_by_event) - although that's likely not relevant here since 
that function is not part of the stack.

Kind Regards,
Oliver

[-- Attachment #2: fw02-lab-general_protection_fault.txt --]
[-- Type: text/plain, Size: 16938 bytes --]

Aug 28 15:18:06 fw02-lab [  352.508058] testing netconsole, hello!
Aug 28 15:19:59 fw02-lab [  465.619240] nf_conntrack: table full, dropping packet.
Aug 28 15:19:59 fw02-lab [  465.619258] nf_conntrack: table full, dropping packet.
Aug 28 15:19:59 fw02-lab [  465.619272] nf_conntrack: table full, dropping packet.
Aug 28 15:19:59 fw02-lab [  465.619274] nf_conntrack: table full, dropping packet.
Aug 28 15:19:59 fw02-lab [  465.619286] nf_conntrack: table full, dropping packet.
Aug 28 15:19:59 fw02-lab [  465.619288] nf_conntrack: table full, dropping packet.
Aug 28 15:19:59 fw02-lab [  465.619298] nf_conntrack: table full, dropping packet.
Aug 28 15:19:59 fw02-lab [  465.619314] nf_conntrack: table full, dropping packet.
Aug 28 15:19:59 fw02-lab [  465.619338] nf_conntrack: table full, dropping packet.
Aug 28 15:19:59 fw02-lab [  465.619349] nf_conntrack: table full, dropping packet.
Aug 28 15:20:00 fw02-lab [  467.077056] general protection fault: 0000 [#1] 
Aug 28 15:20:00 fw02-lab SMP 
Aug 28 15:20:00 fw02-lab 
Aug 28 15:20:00 fw02-lab [  467.077205] CPU 0 
Aug 28 15:20:00 fw02-lab 
Aug 28 15:20:00 fw02-lab [  467.077254] Modules linked in:
Aug 28 15:20:00 fw02-lab xt_set
Aug 28 15:20:00 fw02-lab ip_set_hash_ipportnet
Aug 28 15:20:00 fw02-lab ip_set_hash_net
Aug 28 15:20:00 fw02-lab ip_set
Aug 28 15:20:00 fw02-lab igb
Aug 28 15:20:00 fw02-lab bnx2
Aug 28 15:20:00 fw02-lab iTCO_wdt
Aug 28 15:20:00 fw02-lab dca
Aug 28 15:20:00 fw02-lab 
Aug 28 15:20:00 fw02-lab [  467.077769] 
Aug 28 15:20:00 fw02-lab [  467.077818] Pid: 2853, comm: conntrackd Not tainted 3.4.11-zerolag+ #1
Aug 28 15:20:00 fw02-lab Dell Inc. PowerEdge R310
Aug 28 15:20:00 fw02-lab /05XKKK
Aug 28 15:20:00 fw02-lab 
Aug 28 15:20:00 fw02-lab [  467.078017] RIP: 0010:[<ffffffff8152d6c7>] 
Aug 28 15:20:00 fw02-lab [<ffffffff8152d6c7>] nf_ct_delete_from_lists+0x57/0x90
Aug 28 15:20:00 fw02-lab [  467.078184] RSP: 0018:ffff88012650f858  EFLAGS: 00010246
Aug 28 15:20:00 fw02-lab [  467.078292] RAX: ffff880132737c28 RBX: ffff880130c4ca20 RCX: ffff88013229d708
Aug 28 15:20:00 fw02-lab [  467.078404] RDX: dead000000200200 RSI: dead000000200200 RDI: ffffffff81bcf580
Aug 28 15:20:00 fw02-lab [  467.078516] RBP: ffffffff81bcd080 R08: 0000000000000170 R09: 0000000000000001
Aug 28 15:20:00 fw02-lab [  467.078630] R10: 0000000000000002 R11: 000000009b0cb868 R12: 0000000000000000
Aug 28 15:20:00 fw02-lab [  467.078746] R13: ffffffff81bcd080 R14: 0000000000000000 R15: 00000000fffffffe
Aug 28 15:20:00 fw02-lab [  467.078864] FS:  00007f8bc707f700(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
Aug 28 15:20:00 fw02-lab [  467.079037] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 28 15:20:00 fw02-lab [  467.079148] CR2: 00007f8bc72e6ac0 CR3: 0000000126bc3000 CR4: 00000000000007f0
Aug 28 15:20:00 fw02-lab [  467.079260] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 28 15:20:00 fw02-lab [  467.079371] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug 28 15:20:00 fw02-lab [  467.079484] Process conntrackd (pid: 2853, threadinfo ffff88012650e000, task ffff8801382eb400)
Aug 28 15:20:00 fw02-lab [  467.079652] Stack:
Aug 28 15:20:00 fw02-lab [  467.079754]  0000000000000286
Aug 28 15:20:00 fw02-lab ffff880130c4ca20
Aug 28 15:20:00 fw02-lab 0000000000000000
Aug 28 15:20:00 fw02-lab ffffffff8152e188
Aug 28 15:20:00 fw02-lab 
Aug 28 15:20:00 fw02-lab [  467.080103]  0000000000000000
Aug 28 15:20:00 fw02-lab 0000000000000000
Aug 28 15:20:00 fw02-lab 0000000000000286
Aug 28 15:20:00 fw02-lab ffffffff81054901
Aug 28 15:20:00 fw02-lab 
Aug 28 15:20:00 fw02-lab [  467.080591]  ffff880130c4ca20
Aug 28 15:20:00 fw02-lab ffffffff81bcd080
Aug 28 15:20:00 fw02-lab 0000000000000000
Aug 28 15:20:00 fw02-lab ffffffff81bcd080
Aug 28 15:20:00 fw02-lab 
Aug 28 15:20:00 fw02-lab [  467.080938] Call Trace:
Aug 28 15:20:00 fw02-lab [  467.081044]  [<ffffffff8152e188>] ? nf_ct_delete+0xe8/0x280
Aug 28 15:20:00 fw02-lab [  467.081155]  [<ffffffff81054901>] ? del_timer+0x61/0x80
Aug 28 15:20:00 fw02-lab [  467.081265]  [<ffffffff8152e684>] ? early_drop+0xf4/0x160
Aug 28 15:20:00 fw02-lab [  467.081376]  [<ffffffff8152ed8f>] ? __nf_conntrack_alloc+0x24f/0x290
Aug 28 15:20:00 fw02-lab [  467.081490]  [<ffffffff815374db>] ? ctnetlink_create_conntrack+0x4b/0x440
Aug 28 15:20:00 fw02-lab [  467.081602]  [<ffffffff815353f7>] ? ctnetlink_parse_tuple+0x147/0x170
Aug 28 15:20:00 fw02-lab [  467.081712]  [<ffffffff8152dadd>] ? ____nf_conntrack_find+0x10d/0x120
Aug 28 15:20:00 fw02-lab [  467.081823]  [<ffffffff8152db39>] ? __nf_conntrack_find_get+0x49/0x170
Aug 28 15:20:00 fw02-lab [  467.081935]  [<ffffffff81538c29>] ? ctnetlink_new_conntrack+0x299/0x3e0
Aug 28 15:20:00 fw02-lab [  467.082047]  [<ffffffff812811b0>] ? nla_parse+0x80/0xd0
Aug 28 15:20:00 fw02-lab [  467.082158]  [<ffffffff8152cbee>] ? nfnetlink_rcv_msg+0x1ee/0x220
Aug 28 15:20:00 fw02-lab [  467.082270]  [<ffffffff8152ca2a>] ? nfnetlink_rcv_msg+0x2a/0x220
Aug 28 15:20:00 fw02-lab [  467.082381]  [<ffffffff8152ca00>] ? nfnl_lock+0x20/0x20
Aug 28 15:20:00 fw02-lab [  467.082491]  [<ffffffff81529639>] ? netlink_rcv_skb+0x99/0xc0
Aug 28 15:20:00 fw02-lab [  467.082602]  [<ffffffff81528fcf>] ? netlink_unicast+0x1af/0x200
Aug 28 15:20:00 fw02-lab [  467.082712]  [<ffffffff81529258>] ? netlink_sendmsg+0x238/0x350
Aug 28 15:20:00 fw02-lab [  467.082824]  [<ffffffff814e5204>] ? sock_sendmsg+0xe4/0x130
Aug 28 15:20:00 fw02-lab [  467.082933]  [<ffffffff814e506d>] ? sock_recvmsg+0xed/0x140
Aug 28 15:20:00 fw02-lab [  467.083044]  [<ffffffff8118001e>] ? ext4_file_write+0x6e/0x2a0
Aug 28 15:20:00 fw02-lab [  467.083155]  [<ffffffff8111a10c>] ? do_path_lookup+0x2c/0xd0
Aug 28 15:20:00 fw02-lab [  467.083264]  [<ffffffff811171ef>] ? getname_flags+0x6f/0x270
Aug 28 15:20:00 fw02-lab [  467.083374]  [<ffffffff8110c447>] ? do_sync_write+0xc7/0x100
Aug 28 15:20:00 fw02-lab [  467.083482]  [<ffffffff814e43f2>] ? sockfd_lookup_light+0x22/0x90
Aug 28 15:20:00 fw02-lab [  467.083592]  [<ffffffff814e710c>] ? sys_sendto+0x13c/0x1a0
Aug 28 15:20:00 fw02-lab [  467.083702]  [<ffffffff81687862>] ? system_call_fastpath+0x16/0x1b
Aug 28 15:20:00 fw02-lab [  467.083810] Code: 
Aug 28 15:20:00 fw02-lab 48 
Aug 28 15:20:00 fw02-lab 8b 
Aug 28 15:20:00 fw02-lab 43 
Aug 28 15:20:00 fw02-lab 08 
Aug 28 15:20:00 fw02-lab 48 
Aug 28 15:20:00 fw02-lab 8b 
Aug 28 15:20:00 fw02-lab 53 
Aug 28 15:20:00 fw02-lab 10 
Aug 28 15:20:00 fw02-lab a8 
Aug 28 15:20:00 fw02-lab 01 
Aug 28 15:20:00 fw02-lab 48 
Aug 28 15:20:00 fw02-lab 89 
Aug 28 15:20:00 fw02-lab 02 
Aug 28 15:20:00 fw02-lab 75 
Aug 28 15:20:00 fw02-lab 04 
Aug 28 15:20:00 fw02-lab 48 
Aug 28 15:20:00 fw02-lab 89 
Aug 28 15:20:00 fw02-lab 50 
Aug 28 15:20:00 fw02-lab 08 
Aug 28 15:20:00 fw02-lab 48 
Aug 28 15:20:00 fw02-lab 8b 
Aug 28 15:20:00 fw02-lab 43 
Aug 28 15:20:00 fw02-lab 40 
Aug 28 15:20:00 fw02-lab 48 
Aug 28 15:20:00 fw02-lab be 
Aug 28 15:20:00 fw02-lab 00 
Aug 28 15:20:00 fw02-lab 02 
Aug 28 15:20:00 fw02-lab 20 
Aug 28 15:20:00 fw02-lab 00 
Aug 28 15:20:00 fw02-lab 00 
Aug 28 15:20:00 fw02-lab 00 
Aug 28 15:20:00 fw02-lab ad 
Aug 28 15:20:00 fw02-lab de 
Aug 28 15:20:00 fw02-lab 48 
Aug 28 15:20:00 fw02-lab 8b 
Aug 28 15:20:00 fw02-lab 53 
Aug 28 15:20:00 fw02-lab 48 
Aug 28 15:20:00 fw02-lab 48 
Aug 28 15:20:00 fw02-lab 89 
Aug 28 15:20:00 fw02-lab 73 
Aug 28 15:20:00 fw02-lab 10 
Aug 28 15:20:00 fw02-lab a8 
Aug 28 15:20:00 fw02-lab 01 
Aug 28 15:20:00 fw02-lab 
Aug 28 15:20:00 fw02-lab 89 
Aug 28 15:20:00 fw02-lab 02 
Aug 28 15:20:00 fw02-lab 75 
Aug 28 15:20:00 fw02-lab 04 
Aug 28 15:20:00 fw02-lab 48 
Aug 28 15:20:00 fw02-lab 89 
Aug 28 15:20:00 fw02-lab 50 
Aug 28 15:20:00 fw02-lab 08 
Aug 28 15:20:00 fw02-lab 48 
Aug 28 15:20:00 fw02-lab 89 
Aug 28 15:20:00 fw02-lab df 
Aug 28 15:20:00 fw02-lab 48 
Aug 28 15:20:00 fw02-lab b9 
Aug 28 15:20:00 fw02-lab 00 
Aug 28 15:20:00 fw02-lab 02 
Aug 28 15:20:00 fw02-lab 20 
Aug 28 15:20:00 fw02-lab 00 
Aug 28 15:20:00 fw02-lab 00 
Aug 28 15:20:00 fw02-lab 00 
Aug 28 15:20:00 fw02-lab ad 
Aug 28 15:20:00 fw02-lab 
Aug 28 15:20:00 fw02-lab [  467.086984] RIP 
Aug 28 15:20:00 fw02-lab [<ffffffff8152d6c7>] nf_ct_delete_from_lists+0x57/0x90
Aug 28 15:20:00 fw02-lab [  467.087142]  RSP <ffff88012650f858>
Aug 28 15:20:00 fw02-lab [  467.087263] ---[ end trace 71ca302640a71c1f ]---
Aug 28 15:20:00 fw02-lab [  467.087525] Kernel panic - not syncing: Fatal exception in interrupt
Aug 28 15:20:00 fw02-lab [  467.087642] Rebooting in 5 seconds..


<SNIP>


Aug 28 15:50:00 fw02-lab [  220.903447] testing netconsole again, hello!
Aug 28 15:51:28 fw02-lab [  308.589992] nf_conntrack: table full, dropping packet.
Aug 28 15:51:28 fw02-lab [  308.590036] nf_conntrack: table full, dropping packet.
Aug 28 15:51:28 fw02-lab [  308.590061] nf_conntrack: table full, dropping packet.
Aug 28 15:51:28 fw02-lab [  308.590082] nf_conntrack: table full, dropping packet.
Aug 28 15:51:28 fw02-lab [  308.590131] nf_conntrack: table full, dropping packet.
Aug 28 15:51:28 fw02-lab [  308.590139] nf_conntrack: table full, dropping packet.
Aug 28 15:51:28 fw02-lab [  308.590201] nf_conntrack: table full, dropping packet.
Aug 28 15:51:28 fw02-lab [  308.590239] nf_conntrack: table full, dropping packet.
Aug 28 15:51:28 fw02-lab [  308.590352] nf_conntrack: table full, dropping packet.
Aug 28 15:51:28 fw02-lab [  308.590442] nf_conntrack: table full, dropping packet.
Aug 28 15:51:29 fw02-lab [  310.128590] general protection fault: 0000 [#1] 
Aug 28 15:51:29 fw02-lab SMP 
Aug 28 15:51:29 fw02-lab 
Aug 28 15:51:29 fw02-lab [  310.128741] CPU 0 
Aug 28 15:51:29 fw02-lab 
Aug 28 15:51:29 fw02-lab [  310.128791] Modules linked in:
Aug 28 15:51:29 fw02-lab xt_set
Aug 28 15:51:29 fw02-lab ip_set_hash_ipportnet
Aug 28 15:51:29 fw02-lab ip_set_hash_net
Aug 28 15:51:29 fw02-lab ip_set
Aug 28 15:51:29 fw02-lab igb
Aug 28 15:51:29 fw02-lab dca
Aug 28 15:51:29 fw02-lab bnx2
Aug 28 15:51:29 fw02-lab iTCO_wdt
Aug 28 15:51:29 fw02-lab 
Aug 28 15:51:29 fw02-lab [  310.129312] 
Aug 28 15:51:29 fw02-lab [  310.129361] Pid: 2859, comm: conntrackd Not tainted 3.4.10-zerolag+ #2
Aug 28 15:51:29 fw02-lab Dell Inc. PowerEdge R310
Aug 28 15:51:29 fw02-lab /05XKKK
Aug 28 15:51:29 fw02-lab 
Aug 28 15:51:29 fw02-lab [  310.129561] RIP: 0010:[<ffffffff8152d7d7>] 
Aug 28 15:51:29 fw02-lab [<ffffffff8152d7d7>] nf_ct_delete_from_lists+0x57/0x90
Aug 28 15:51:29 fw02-lab [  310.129726] RSP: 0018:ffff880132391858  EFLAGS: 00010202
Aug 28 15:51:29 fw02-lab [  310.129837] RAX: 000000000019a05b RBX: ffff880132b35b00 RCX: 0000000000245675
Aug 28 15:51:29 fw02-lab [  310.129951] RDX: dead000000200200 RSI: dead000000200200 RDI: ffffffff81bcf580
Aug 28 15:51:29 fw02-lab [  310.130064] RBP: ffffffff81bcd080 R08: 0000000000013c70 R09: ffffffff81527493
Aug 28 15:51:29 fw02-lab [  310.130180] R10: ffffffff814ed911 R11: 0000000000000004 R12: ffff880132aefa98
Aug 28 15:51:29 fw02-lab [  310.130293] R13: 0000000000000000 R14: ffff880132b35b04 R15: 00000000fffffffe
Aug 28 15:51:29 fw02-lab [  310.130408] FS:  00007fb285c74700(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
Aug 28 15:51:29 fw02-lab [  310.130579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 28 15:51:29 fw02-lab [  310.130692] CR2: 00007fb2852fe3a0 CR3: 0000000134c54000 CR4: 00000000000007f0
Aug 28 15:51:29 fw02-lab [  310.130804] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 28 15:51:29 fw02-lab [  310.130917] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug 28 15:51:29 fw02-lab [  310.131029] Process conntrackd (pid: 2859, threadinfo ffff880132390000, task ffff880139b65480)
Aug 28 15:51:29 fw02-lab [  310.131195] Stack:
Aug 28 15:51:29 fw02-lab [  310.131298]  0000000000000000
Aug 28 15:51:29 fw02-lab ffff880132b35b00
Aug 28 15:51:29 fw02-lab ffff880132b35b78
Aug 28 15:51:29 fw02-lab ffffffff8152e530
Aug 28 15:51:29 fw02-lab 
Aug 28 15:51:29 fw02-lab [  310.131649]  0000000000000000
Aug 28 15:51:29 fw02-lab 0000000000000001
Aug 28 15:51:29 fw02-lab ffff880132b35b00
Aug 28 15:51:29 fw02-lab 0000000000000b2a
Aug 28 15:51:29 fw02-lab 
Aug 28 15:51:29 fw02-lab [  310.131994]  ffffffff81bcd080
Aug 28 15:51:29 fw02-lab ffff880132b35b00
Aug 28 15:51:29 fw02-lab 0000000000000000
Aug 28 15:51:29 fw02-lab ffffffff81bcd080
Aug 28 15:51:29 fw02-lab 
Aug 28 15:51:29 fw02-lab [  310.132343] Call Trace:
Aug 28 15:51:29 fw02-lab [  310.132449]  [<ffffffff8152e530>] ? death_by_timeout+0x1c0/0x1e0
Aug 28 15:51:29 fw02-lab [  310.132560]  [<ffffffff8152e7d8>] ? early_drop+0xe8/0x160
Aug 28 15:51:29 fw02-lab [  310.132670]  [<ffffffff8152eecf>] ? __nf_conntrack_alloc+0x24f/0x290
Aug 28 15:51:29 fw02-lab [  310.132783]  [<ffffffff8153762b>] ? ctnetlink_create_conntrack+0x4b/0x440
Aug 28 15:51:29 fw02-lab [  310.132894]  [<ffffffff81535547>] ? ctnetlink_parse_tuple+0x147/0x170
Aug 28 15:51:29 fw02-lab [  310.133005]  [<ffffffff8152dd2d>] ? ____nf_conntrack_find+0x10d/0x120
Aug 28 15:51:29 fw02-lab [  310.133115]  [<ffffffff8152dd89>] ? __nf_conntrack_find_get+0x49/0x170
Aug 28 15:51:29 fw02-lab [  310.133227]  [<ffffffff81538ef1>] ? ctnetlink_new_conntrack+0x2a1/0x3f0
Aug 28 15:51:29 fw02-lab [  310.133339]  [<ffffffff812811b0>] ? nla_parse+0x80/0xd0
Aug 28 15:51:29 fw02-lab [  310.133564]  [<ffffffff8152cbee>] ? nfnetlink_rcv_msg+0x1ee/0x220
Aug 28 15:51:29 fw02-lab [  310.133674]  [<ffffffff8152ca2a>] ? nfnetlink_rcv_msg+0x2a/0x220
Aug 28 15:51:29 fw02-lab [  310.133784]  [<ffffffff8152ca00>] ? nfnl_lock+0x20/0x20
Aug 28 15:51:29 fw02-lab [  310.133892]  [<ffffffff81529639>] ? netlink_rcv_skb+0x99/0xc0
Aug 28 15:51:29 fw02-lab [  310.134000]  [<ffffffff81528fcf>] ? netlink_unicast+0x1af/0x200
Aug 28 15:51:29 fw02-lab [  310.134110]  [<ffffffff81529258>] ? netlink_sendmsg+0x238/0x350
Aug 28 15:51:29 fw02-lab [  310.134221]  [<ffffffff814e5204>] ? sock_sendmsg+0xe4/0x130
Aug 28 15:51:29 fw02-lab [  310.134329]  [<ffffffff814e506d>] ? sock_recvmsg+0xed/0x140
Aug 28 15:51:29 fw02-lab [  310.134439]  [<ffffffff8118001e>] ? ext4_file_write+0x6e/0x2a0
Aug 28 15:51:29 fw02-lab [  310.134551]  [<ffffffff8111a10c>] ? do_path_lookup+0x2c/0xd0
Aug 28 15:51:29 fw02-lab [  310.134659]  [<ffffffff811171ef>] ? getname_flags+0x6f/0x270
Aug 28 15:51:29 fw02-lab [  310.134771]  [<ffffffff8110c447>] ? do_sync_write+0xc7/0x100
Aug 28 15:51:29 fw02-lab [  310.134880]  [<ffffffff814e43f2>] ? sockfd_lookup_light+0x22/0x90
Aug 28 15:51:29 fw02-lab [  310.134990]  [<ffffffff814e710c>] ? sys_sendto+0x13c/0x1a0
Aug 28 15:51:29 fw02-lab [  310.135102]  [<ffffffff81687b62>] ? system_call_fastpath+0x16/0x1b
Aug 28 15:51:29 fw02-lab [  310.135214] Code: 
Aug 28 15:51:29 fw02-lab 48 
Aug 28 15:51:29 fw02-lab 8b 
Aug 28 15:51:29 fw02-lab 43 
Aug 28 15:51:29 fw02-lab 08 
Aug 28 15:51:29 fw02-lab 48 
Aug 28 15:51:29 fw02-lab 8b 
Aug 28 15:51:29 fw02-lab 53 
Aug 28 15:51:29 fw02-lab 10 
Aug 28 15:51:29 fw02-lab a8 
Aug 28 15:51:29 fw02-lab 01 
Aug 28 15:51:29 fw02-lab 48 
Aug 28 15:51:29 fw02-lab 89 
Aug 28 15:51:29 fw02-lab 02 
Aug 28 15:51:29 fw02-lab 75 
Aug 28 15:51:29 fw02-lab 04 
Aug 28 15:51:29 fw02-lab 48 
Aug 28 15:51:29 fw02-lab 89 
Aug 28 15:51:29 fw02-lab 50 
Aug 28 15:51:29 fw02-lab 08 
Aug 28 15:51:29 fw02-lab 48 
Aug 28 15:51:29 fw02-lab 8b 
Aug 28 15:51:29 fw02-lab 43 
Aug 28 15:51:29 fw02-lab 40 
Aug 28 15:51:29 fw02-lab 48 
Aug 28 15:51:29 fw02-lab be 
Aug 28 15:51:29 fw02-lab 00 
Aug 28 15:51:29 fw02-lab 02 
Aug 28 15:51:29 fw02-lab 20 
Aug 28 15:51:29 fw02-lab 00 
Aug 28 15:51:29 fw02-lab 00 
Aug 28 15:51:29 fw02-lab 00 
Aug 28 15:51:29 fw02-lab ad 
Aug 28 15:51:29 fw02-lab de 
Aug 28 15:51:29 fw02-lab 48 
Aug 28 15:51:29 fw02-lab 8b 
Aug 28 15:51:29 fw02-lab 53 
Aug 28 15:51:29 fw02-lab 48 
Aug 28 15:51:29 fw02-lab 48 
Aug 28 15:51:29 fw02-lab 89 
Aug 28 15:51:29 fw02-lab 73 
Aug 28 15:51:29 fw02-lab 10 
Aug 28 15:51:29 fw02-lab a8 
Aug 28 15:51:29 fw02-lab 01 
Aug 28 15:51:29 fw02-lab 
Aug 28 15:51:29 fw02-lab 89 
Aug 28 15:51:29 fw02-lab 02 
Aug 28 15:51:29 fw02-lab 75 
Aug 28 15:51:29 fw02-lab 04 
Aug 28 15:51:29 fw02-lab 48 
Aug 28 15:51:29 fw02-lab 89 
Aug 28 15:51:29 fw02-lab 50 
Aug 28 15:51:29 fw02-lab 08 
Aug 28 15:51:29 fw02-lab 48 
Aug 28 15:51:29 fw02-lab 89 
Aug 28 15:51:29 fw02-lab df 
Aug 28 15:51:29 fw02-lab 48 
Aug 28 15:51:29 fw02-lab b9 
Aug 28 15:51:29 fw02-lab 00 
Aug 28 15:51:29 fw02-lab 02 
Aug 28 15:51:29 fw02-lab 20 
Aug 28 15:51:29 fw02-lab 00 
Aug 28 15:51:29 fw02-lab 00 
Aug 28 15:51:29 fw02-lab 00 
Aug 28 15:51:29 fw02-lab ad 
Aug 28 15:51:29 fw02-lab 
Aug 28 15:51:29 fw02-lab [  310.138394] RIP 
Aug 28 15:51:29 fw02-lab [<ffffffff8152d7d7>] nf_ct_delete_from_lists+0x57/0x90
Aug 28 15:51:29 fw02-lab [  310.138551]  RSP <ffff880132391858>
Aug 28 15:51:29 fw02-lab [  310.138671] ---[ end trace d353ce02b91230e2 ]---
Aug 28 15:51:29 fw02-lab [  310.138784] Kernel panic - not syncing: Fatal exception in interrupt
Aug 28 15:51:29 fw02-lab [  310.138904] Rebooting in 5 seconds..

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-28 23:10     ` Oliver
@ 2012-08-30  0:52       ` Pablo Neira Ayuso
  2012-08-30  2:05         ` Oliver
  0 siblings, 1 reply; 16+ messages in thread
From: Pablo Neira Ayuso @ 2012-08-30  0:52 UTC (permalink / raw)
  To: Oliver; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 1221 bytes --]

On Wed, Aug 29, 2012 at 01:10:29AM +0200, Oliver wrote:
> On Tuesday 28 August 2012 19:16:39 Oliver wrote:
> > During testing I found that the kernel is indeed solid and does not panic;
> > however, I managed to make conntrackd eat 100% of a CPU core on one of the
> > pair and conntrack entries remained unevicted from the kernel until I killed
> > the conntrackd process.
> 
> having conntrackd running while the conntrack table is full is causing a GPF - 
> I have attached a dmesg output of the kernel panic resulting from a general 
> protection fault.
> 
> The first GPF is from the kernel patched with the code provided in my previous 
> e-mail (the one for v3.4.10 based on the patch you provided me)
> 
> the second is with my only my original patch (the one-liner that checks the 
> dying bit in death_by_event) - although that's likely not relevant here since 
> that function is not part of the stack.

The problem seems to be the re-use of the conntrack timer. Races may
happen in entries that were just inserted in the dying list while
packets / ctnetlink still hold a reference to them.

Would you give a try to this patch? Please, remove the previous I
sent.

Let me know if you hit more crashes. Thanks.

[-- Attachment #2: 0001-netfilter-nf_conntrack-fix-racy-timer-handling-with-.patch --]
[-- Type: text/x-diff, Size: 3251 bytes --]

>From 413243208640a973ba48111eaec143efa3ca7a31 Mon Sep 17 00:00:00 2001
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Thu, 30 Aug 2012 02:02:51 +0200
Subject: [PATCH] netfilter: nf_conntrack: fix racy timer handling with
 reliable events

Existing code assumes that del_timer returns true for alive conntrack
entries. However, this is not true if reliable events are enabled.
In that case, del_timer may return true for entries that were
just inserted in the dying list. Note that packets may hold references
to conntrack entries that were just inserted to such list.

This patch fixes the issue by adding an independent timer for
event delivery. This increases the size of the ecache extension.
Still we can revisit this later and use support for variable size
extensions to allocate this area on demand.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack_ecache.h |    1 +
 net/netfilter/nf_conntrack_core.c           |   16 +++++++++++-----
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index e1ce104..4a045cd 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -18,6 +18,7 @@ struct nf_conntrack_ecache {
 	u16 ctmask;		/* bitmask of ct events to be delivered */
 	u16 expmask;		/* bitmask of expect events to be delivered */
 	u32 pid;		/* netlink pid of destroyer */
+	struct timer_list timeout;
 };
 
 static inline struct nf_conntrack_ecache *
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index fc61f40..2d8f91a 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -249,12 +249,15 @@ static void death_by_event(unsigned long ul_conntrack)
 {
 	struct nf_conn *ct = (void *)ul_conntrack;
 	struct net *net = nf_ct_net(ct);
+	struct nf_conn_ecache *ecache = nf_ct_ecache_find(ct);
+
+	BUG_ON(ecache == NULL);
 
 	if (nf_conntrack_event(IPCT_DESTROY, ct) < 0) {
 		/* bad luck, let's retry again */
-		ct->timeout.expires = jiffies +
+		ecache->timeout.expires = jiffies +
 			(random32() % net->ct.sysctl_events_retry_timeout);
-		add_timer(&ct->timeout);
+		add_timer(&ecache->timeout);
 		return;
 	}
 	/* we've got the event delivered, now it's dying */
@@ -268,6 +271,9 @@ static void death_by_event(unsigned long ul_conntrack)
 void nf_ct_insert_dying_list(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
+	struct nf_conn_ecache *ecache = nf_ct_ecache_find(ct);
+
+	BUG_ON(ecache == NULL);
 
 	/* add this conntrack to the dying list */
 	spin_lock_bh(&nf_conntrack_lock);
@@ -275,10 +281,10 @@ void nf_ct_insert_dying_list(struct nf_conn *ct)
 			     &net->ct.dying);
 	spin_unlock_bh(&nf_conntrack_lock);
 	/* set a new timer to retry event delivery */
-	setup_timer(&ct->timeout, death_by_event, (unsigned long)ct);
-	ct->timeout.expires = jiffies +
+	setup_timer(&ecache->timeout, death_by_event, (unsigned long)ct);
+	ecache->timeout.expires = jiffies +
 		(random32() % net->ct.sysctl_events_retry_timeout);
-	add_timer(&ct->timeout);
+	add_timer(&ecache->timeout);
 }
 EXPORT_SYMBOL_GPL(nf_ct_insert_dying_list);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-30  0:52       ` Pablo Neira Ayuso
@ 2012-08-30  2:05         ` Oliver
  2012-08-30  2:25           ` Pablo Neira Ayuso
  0 siblings, 1 reply; 16+ messages in thread
From: Oliver @ 2012-08-30  2:05 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Thursday 30 August 2012 02:52:36 you wrote:
> The problem seems to be the re-use of the conntrack timer. Races may
> happen in entries that were just inserted in the dying list while
> packets / ctnetlink still hold a reference to them.
> 
> Would you give a try to this patch? Please, remove the previous I
> sent.
> 
> Let me know if you hit more crashes. Thanks.

Hi Pablo,

I successfully appied the patch against 3.4.10 but it is failing to compile, 
likely because of a lack of definition for the ecache struct.

the specific error from gcc is:

net/netfilter/nf_conntrack_core.c:258:9: error: dereferencing pointer to 
incomplete type

(plus all the other relevant lines using the ecache ptr)

Kind Regards,
Oliver

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-30  2:05         ` Oliver
@ 2012-08-30  2:25           ` Pablo Neira Ayuso
  0 siblings, 0 replies; 16+ messages in thread
From: Pablo Neira Ayuso @ 2012-08-30  2:25 UTC (permalink / raw)
  To: Oliver; +Cc: netfilter-devel

[-- Attachment #1: Type: text/plain, Size: 897 bytes --]

On Thu, Aug 30, 2012 at 04:05:35AM +0200, Oliver wrote:
> On Thursday 30 August 2012 02:52:36 you wrote:
> > The problem seems to be the re-use of the conntrack timer. Races may
> > happen in entries that were just inserted in the dying list while
> > packets / ctnetlink still hold a reference to them.
> > 
> > Would you give a try to this patch? Please, remove the previous I
> > sent.
> > 
> > Let me know if you hit more crashes. Thanks.
> 
> Hi Pablo,
> 
> I successfully appied the patch against 3.4.10 but it is failing to compile, 
> likely because of a lack of definition for the ecache struct.
> 
> the specific error from gcc is:
> 
> net/netfilter/nf_conntrack_core.c:258:9: error: dereferencing pointer to 
> incomplete type

Sorry, I sent you the wrong patch.

%s/nf_conn_ecache/nf_conntrack_ecache/g

> (plus all the other relevant lines using the ecache ptr)

New patch attached.

[-- Attachment #2: 0001-netfilter-nf_conntrack-fix-racy-timer-handling-with-.patch --]
[-- Type: text/x-diff, Size: 3261 bytes --]

>From 6ca201a35140a2ff32669267bdbbff7eb01183f2 Mon Sep 17 00:00:00 2001
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Thu, 30 Aug 2012 02:02:51 +0200
Subject: [PATCH] netfilter: nf_conntrack: fix racy timer handling with
 reliable events

Existing code assumes that del_timer returns true for alive conntrack
entries. However, this is not true if reliable events are enabled.
In that case, del_timer may return true for entries that were
just inserted in the dying list. Note that packets may hold references
to conntrack entries that were just inserted to such list.

This patch fixes the issue by adding an independent timer for
event delivery. This increases the size of the ecache extension.
Still we can revisit this later and use support for variable size
extensions to allocate this area on demand.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack_ecache.h |    1 +
 net/netfilter/nf_conntrack_core.c           |   16 +++++++++++-----
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index e1ce104..4a045cd 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -18,6 +18,7 @@ struct nf_conntrack_ecache {
 	u16 ctmask;		/* bitmask of ct events to be delivered */
 	u16 expmask;		/* bitmask of expect events to be delivered */
 	u32 pid;		/* netlink pid of destroyer */
+	struct timer_list timeout;
 };
 
 static inline struct nf_conntrack_ecache *
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index f83e79d..79dabe8 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -255,12 +255,15 @@ static void death_by_event(unsigned long ul_conntrack)
 {
 	struct nf_conn *ct = (void *)ul_conntrack;
 	struct net *net = nf_ct_net(ct);
+	struct nf_conntrack_ecache *ecache = nf_ct_ecache_find(ct);
+
+	BUG_ON(ecache == NULL);
 
 	if (nf_conntrack_event(IPCT_DESTROY, ct) < 0) {
 		/* bad luck, let's retry again */
-		ct->timeout.expires = jiffies +
+		ecache->timeout.expires = jiffies +
 			(random32() % net->ct.sysctl_events_retry_timeout);
-		add_timer(&ct->timeout);
+		add_timer(&ecache->timeout);
 		return;
 	}
 	/* we've got the event delivered, now it's dying */
@@ -274,6 +277,9 @@ static void death_by_event(unsigned long ul_conntrack)
 void nf_ct_insert_dying_list(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
+	struct nf_conntrack_ecache *ecache = nf_ct_ecache_find(ct);
+
+	BUG_ON(ecache == NULL);
 
 	/* add this conntrack to the dying list */
 	spin_lock_bh(&nf_conntrack_lock);
@@ -281,10 +287,10 @@ void nf_ct_insert_dying_list(struct nf_conn *ct)
 			     &net->ct.dying);
 	spin_unlock_bh(&nf_conntrack_lock);
 	/* set a new timer to retry event delivery */
-	setup_timer(&ct->timeout, death_by_event, (unsigned long)ct);
-	ct->timeout.expires = jiffies +
+	setup_timer(&ecache->timeout, death_by_event, (unsigned long)ct);
+	ecache->timeout.expires = jiffies +
 		(random32() % net->ct.sysctl_events_retry_timeout);
-	add_timer(&ct->timeout);
+	add_timer(&ecache->timeout);
 }
 EXPORT_SYMBOL_GPL(nf_ct_insert_dying_list);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
       [not found]   ` <20120830025009.GA16782@1984>
@ 2012-08-30  3:09     ` Oliver
  2012-08-30 10:34       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 16+ messages in thread
From: Oliver @ 2012-08-30  3:09 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Thursday 30 August 2012 04:50:09 you wrote:
> Not sure what you mean, you're still crashing with the patch below,
> right?
> 
> My proposal is to give a try to the ecache patch, that requires
> removing the previous patch.

Apologies for the confusion;  the patch quoted is essentially the first patch 
you provided me, with my changes to make it work in 3.4.10 *plus* the deletion 
of the change to nf_conntrack_ecache.h where your patch deleted the 
nf_ct_is_dying() check (i.e I have this check left in) - with this 
modification, I find that conntrackd is well-behaved and I have thus far not 
successfully caused a kernel panic.

Having tested your latest patch, I can also confirm that it also does not 
crash, including at exhaustion of the conntrack table.

In terms of overall stability, I would presume your latest patch is superior 
to the previous (i.e. what I attached most recently) ?

Kind Regards,
Oliver

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-30  3:09     ` Oliver
@ 2012-08-30 10:34       ` Pablo Neira Ayuso
  2012-08-30 12:28         ` Oliver
  0 siblings, 1 reply; 16+ messages in thread
From: Pablo Neira Ayuso @ 2012-08-30 10:34 UTC (permalink / raw)
  To: Oliver; +Cc: netfilter-devel

On Thu, Aug 30, 2012 at 05:09:01AM +0200, Oliver wrote:
> On Thursday 30 August 2012 04:50:09 you wrote:
> > Not sure what you mean, you're still crashing with the patch below,
> > right?
> > 
> > My proposal is to give a try to the ecache patch, that requires
> > removing the previous patch.
> 
> Apologies for the confusion;  the patch quoted is essentially the first patch 
> you provided me, with my changes to make it work in 3.4.10 *plus* the deletion 
> of the change to nf_conntrack_ecache.h where your patch deleted the 
> nf_ct_is_dying() check (i.e I have this check left in) - with this 
> modification, I find that conntrackd is well-behaved and I have thus far not 
> successfully caused a kernel panic.
> 
> Having tested your latest patch, I can also confirm that it also does not 
> crash, including at exhaustion of the conntrack table.
>
> In terms of overall stability, I would presume your latest patch is superior 
> to the previous (i.e. what I attached most recently) ?

Yes, I prefer the second patch. There is still races in the first
patch I sent you, harder to trigger, but still there.

There are several cleanups I'd like to recover from the first patch
though. Would you help testing them?

Thanks a lot for testing.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-30 10:34       ` Pablo Neira Ayuso
@ 2012-08-30 12:28         ` Oliver
  2012-08-30 12:39           ` Oliver
  2012-08-30 16:22           ` Pablo Neira Ayuso
  0 siblings, 2 replies; 16+ messages in thread
From: Oliver @ 2012-08-30 12:28 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Thursday 30 August 2012 12:34:37 you wrote:
> Yes, I prefer the second patch. There is still races in the first
> patch I sent you, harder to trigger, but still there.
> 
> There are several cleanups I'd like to recover from the first patch
> though. Would you help testing them?
> 
> Thanks a lot for testing.

HI Pablo,

Yep, I'd be happy to test. I've also uncovered a new issue: I have two Active-
Active machines (conntrackd running NOTRACK mode with both External and 
Internal cache disabled)

In kernel 3.2 this pair works asymmetric and issue-free. Upgrade it to 3.4 and 
it immediately has around 50% failure of TCP connection attempts on systems 
behind them - ICMP on the other hand is flawless, DNS lookups also are OK so I 
*believe* that UDP may also be performing well - I've no idea where to even 
look on this one so any insight would be most appreciated.

Kind Regards,
Oliver

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-30 12:28         ` Oliver
@ 2012-08-30 12:39           ` Oliver
  2012-08-30 16:22           ` Pablo Neira Ayuso
  1 sibling, 0 replies; 16+ messages in thread
From: Oliver @ 2012-08-30 12:39 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Thursday 30 August 2012 14:28:20 Oliver wrote:
> Yep, I'd be happy to test. I've also uncovered a new issue: I have two
> Active- Active machines (conntrackd running NOTRACK mode with both External
> and Internal cache disabled)
> 
> In kernel 3.2 this pair works asymmetric and issue-free. Upgrade it to 3.4
> and it immediately has around 50% failure of TCP connection attempts on
> systems behind them - ICMP on the other hand is flawless, DNS lookups also
> are OK so I *believe* that UDP may also be performing well - I've no idea
> where to even look on this one so any insight would be most appreciated.
> 
> Kind Regards,
> Oliver

Another thing that just entered my mind: I configured raw/PREROUTING to -j CT 
--notrack TCP port 80 (source and dest) on the appropriate interfaces and this 
resulted in total loss despite the fact that I had --ctstate UNTRACKED set to 
ACCEPT - and again, this behaviour only occurs under 3.4.[9|10] (probably 
earlier too but I didn't test)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-30 12:28         ` Oliver
  2012-08-30 12:39           ` Oliver
@ 2012-08-30 16:22           ` Pablo Neira Ayuso
  2012-08-30 17:49             ` Oliver
  1 sibling, 1 reply; 16+ messages in thread
From: Pablo Neira Ayuso @ 2012-08-30 16:22 UTC (permalink / raw)
  To: Oliver; +Cc: netfilter-devel

Hi Oliver,

On Thu, Aug 30, 2012 at 02:28:20PM +0200, Oliver wrote:
> On Thursday 30 August 2012 12:34:37 you wrote:
> > Yes, I prefer the second patch. There is still races in the first
> > patch I sent you, harder to trigger, but still there.
> > 
> > There are several cleanups I'd like to recover from the first patch
> > though. Would you help testing them?
> > 
> > Thanks a lot for testing.
> 
> HI Pablo,
> 
> Yep, I'd be happy to test. I've also uncovered a new issue: I have two Active-
> Active machines (conntrackd running NOTRACK mode with both External and 
> Internal cache disabled)

Thanks. I'll send you patches then.

> In kernel 3.2 this pair works asymmetric and issue-free. Upgrade it to 3.4 and 
> it immediately has around 50% failure of TCP connection attempts on systems 
> behind them - ICMP on the other hand is flawless, DNS lookups also are OK so I 
> *believe* that UDP may also be performing well - I've no idea where to even 
> look on this one so any insight would be most appreciated.

Unfortunately, asymmetric active-active is a crazy setup for conntrack
(documentation already discuss this). The state synchronization that
we are doing is asynchronous, so state-updates race with TCP packet.
We don't support this, sorry.

We can support active-active with hash-based load-sharing with the
cluster match / arptables, it seems more sane to me, theory is
described here:

http://1984.lsi.us.es/~pablo/docs/intcomp09.pdf

However, there is not documentation yet on how to make it. Last time I
looked at existing HA daemons, I didn't find that they support
active-active setup very well, so they require some changes / we need
some small new HA daemon for this.

I need to work on active/active load-sharing, to fully documented and
support it. That's not in top of my priority list at the moment
though.

Another (simpler) alternative is, in case your firewall have two IPs,
to statically distribute the load between your firewalls by assigning
different gateway IP to your client nodes. That should not be hard to
deploy.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-30 16:22           ` Pablo Neira Ayuso
@ 2012-08-30 17:49             ` Oliver
  2012-08-30 18:39               ` Pablo Neira Ayuso
  0 siblings, 1 reply; 16+ messages in thread
From: Oliver @ 2012-08-30 17:49 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Thursday 30 August 2012 18:22:48 you wrote:
> Unfortunately, asymmetric active-active is a crazy setup for conntrack
> (documentation already discuss this). The state synchronization that
> we are doing is asynchronous, so state-updates race with TCP packet.
> We don't support this, sorry.

The environment does fulfil the assumptions necessary for replication to happen 
within the handshake so under 3.2 there is no issue with handshakes completing 
under an asymmetric path.

Nonetheless, what doesn't make sense is that this operates under 3.2 and not 
3.4 - also is the fact that having a "-j CT --notrack" on specific traffic (i.e. 
asymmetric should not matter because there is no stateful tracking)

Kind Regards,
Oliver

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-30 17:49             ` Oliver
@ 2012-08-30 18:39               ` Pablo Neira Ayuso
  2012-08-31  0:19                 ` Oliver
  0 siblings, 1 reply; 16+ messages in thread
From: Pablo Neira Ayuso @ 2012-08-30 18:39 UTC (permalink / raw)
  To: Oliver; +Cc: netfilter-devel

On Thu, Aug 30, 2012 at 07:49:24PM +0200, Oliver wrote:
> On Thursday 30 August 2012 18:22:48 you wrote:
> > Unfortunately, asymmetric active-active is a crazy setup for conntrack
> > (documentation already discuss this). The state synchronization that
> > we are doing is asynchronous, so state-updates race with TCP packet.
> > We don't support this, sorry.
> 
> The environment does fulfil the assumptions necessary for replication to happen 
> within the handshake so under 3.2 there is no issue with handshakes completing 
> under an asymmetric path.

Interesting, how are those assumptions fulfilled?

> Nonetheless, what doesn't make sense is that this operates under 3.2 and not 
> 3.4 - also is the fact that having a "-j CT --notrack" on specific traffic (i.e. 
> asymmetric should not matter because there is no stateful tracking)

Agreed. But I don't come with any netfilter change that may result in
that problem you're reporting. You'll have to debug this and get back
to me with more information.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-30 18:39               ` Pablo Neira Ayuso
@ 2012-08-31  0:19                 ` Oliver
  2012-08-31  9:27                   ` Pablo Neira Ayuso
  0 siblings, 1 reply; 16+ messages in thread
From: Oliver @ 2012-08-31  0:19 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Thursday 30 August 2012 20:39:50 Pablo Neira Ayuso wrote:
> Interesting, how are those assumptions fulfilled?

Well, timing of course ;) - essentially, traffic paths are ensured longer than 
the actual time for replication of conntrack state.


> Agreed. But I don't come with any netfilter change that may result in
> that problem you're reporting. You'll have to debug this and get back
> to me with more information.

You can disregard this, turned out to be due to the unfortunate fact that 
net.ipv4.netfilter.ip_conntrack_tcp_be_liberal is of course replaced by 
net.netfilter.nf_conntrack_tcp_be_liberal under 3.4

Please feel free to send me your latest rework of the patch and I will be 
happy to test it out.

Kind Regards,
Oliver

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack
  2012-08-31  0:19                 ` Oliver
@ 2012-08-31  9:27                   ` Pablo Neira Ayuso
  0 siblings, 0 replies; 16+ messages in thread
From: Pablo Neira Ayuso @ 2012-08-31  9:27 UTC (permalink / raw)
  To: Oliver; +Cc: netfilter-devel

On Fri, Aug 31, 2012 at 02:19:36AM +0200, Oliver wrote:
> On Thursday 30 August 2012 20:39:50 Pablo Neira Ayuso wrote:
> > Interesting, how are those assumptions fulfilled?
> 
> Well, timing of course ;) - essentially, traffic paths are ensured longer than 
> the actual time for replication of conntrack state.

I see.

> > Agreed. But I don't come with any netfilter change that may result in
> > that problem you're reporting. You'll have to debug this and get back
> > to me with more information.
> 
> You can disregard this, turned out to be due to the unfortunate fact that 
> net.ipv4.netfilter.ip_conntrack_tcp_be_liberal is of course replaced by 
> net.netfilter.nf_conntrack_tcp_be_liberal under 3.4

$ ls /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal 
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal

Probably you forgot to set CONFIG_NF_CONNTRACK_PROC_COMPAT=y

We haven't remove it yet, although it should be bad to schedule this
for removal.

> Please feel free to send me your latest rework of the patch and I will be 
> happy to test it out.

Will do.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-08-31  9:27 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-27  9:33 [PATCH] death_by_event() does not check IPS_DYING_BIT - race condition against ctnetlink_del_conntrack Oliver
2012-08-28 10:52 ` Pablo Neira Ayuso
2012-08-28 17:16   ` Oliver
2012-08-28 23:10     ` Oliver
2012-08-30  0:52       ` Pablo Neira Ayuso
2012-08-30  2:05         ` Oliver
2012-08-30  2:25           ` Pablo Neira Ayuso
     [not found] ` <5427975.6moJlq4F9d@gentoovm>
     [not found]   ` <20120830025009.GA16782@1984>
2012-08-30  3:09     ` Oliver
2012-08-30 10:34       ` Pablo Neira Ayuso
2012-08-30 12:28         ` Oliver
2012-08-30 12:39           ` Oliver
2012-08-30 16:22           ` Pablo Neira Ayuso
2012-08-30 17:49             ` Oliver
2012-08-30 18:39               ` Pablo Neira Ayuso
2012-08-31  0:19                 ` Oliver
2012-08-31  9:27                   ` Pablo Neira Ayuso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.