All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH nf-next 0/6] conntrack: get rid of per-object timer
@ 2016-08-19 11:36 Florian Westphal
  2016-08-19 11:36 ` [PATCH nf-next 1/6] netfilter: don't rely on DYING bit to detect when destroy event was sent Florian Westphal
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Florian Westphal @ 2016-08-19 11:36 UTC (permalink / raw)
  To: netfilter-devel

During NFWS 2016 it was mentioned that per-conntrack timers have
two drawbacks:
 - the 5-day default established timeout is very large and brings
   extra constraints for the timer subsystem.
 - most distros enable timer stats so timer struct eats 80 bytes
   in each conntrack object.

This series replaces the per-object struct timer with a u32 jiffie
stamp and one global delayed work queue for conntrack eviction.
Size of nf_conn struct is reduced to 256 bytes on x86_64.

Eviction is performed from the packet path when doing
table lookup, for cases where we have idle periods the work
queue is used.

Tested with following script, conntrackd running in 'relible
event mode' and httpterm running on other host:

-----------------------------------------------------------
random_resize() {
        while true; do
                RND=$RANDOM%256000
                RND=$((RND+8192))
                sysctl net.netfilter.nf_conntrack_buckets=$RND
                sleep $((RANDOM % 120))
        done
}

random_flush() {
        while true; do
                sleep $((RANDOM % 120))
                conntrack -F
        done
}

random_startstop() {
        while true; do
                sleep $((RANDOM % 120))
                pkill -STOP conntrackd
                sleep $((RANDOM % 10))
                pkill -CONT conntrackd
        done
}

http-client -u 1000 -t 3 -F 192.168.10.50 -G 192.168.10.17:8001 &
http-client -u 1000 -t 3 -F 192.168.10.51 -G 192.168.10.17:8001 &
http-client -u 1000 -t 3 -F 192.168.10.52 -G 192.168.10.17:8001 &

random_resize &
random_flush &
random_startstop &

wait
-----------------------------------------------------------

 include/net/netfilter/nf_conntrack.h        |   36 +++--
 include/net/netfilter/nf_conntrack_ecache.h |   17 +-
 net/netfilter/nf_conntrack_core.c           |  186 ++++++++++++++++++++--------
 net/netfilter/nf_conntrack_ecache.c         |   22 ++-
 net/netfilter/nf_conntrack_netlink.c        |   39 ++++-
 net/netfilter/nf_conntrack_pptp.c           |    3 
 net/netfilter/nf_nat_core.c                 |    6 
 7 files changed, 215 insertions(+), 94 deletions(-)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH nf-next 1/6] netfilter: don't rely on DYING bit to detect when destroy event was sent
  2016-08-19 11:36 [PATCH nf-next 0/6] conntrack: get rid of per-object timer Florian Westphal
@ 2016-08-19 11:36 ` Florian Westphal
  2016-08-19 11:36 ` [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer Florian Westphal
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2016-08-19 11:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

The reliable event delivery mode currently (ab)uses the DYING bit to
detect which entries on the dying list have to be skipped when
re-delivering events from the eache worker in reliable event mode.

Currently when we delete the conntrack from main table we only set this
bit if we could also deliver the netlink destroy event to userspace.

If we fail we move it to the dying list, the ecache worker will
reattempt event delivery for all confirmed conntracks on the dying list
that do not have the DYING bit set.

Once timer is gone, we can no longer use if (del_timer()) to detect
when we 'stole' the reference count owned by the timer/hash entry, so
we need some other way to avoid racing with other cpu.

Pablo suggested to add a marker in the ecache extension that skips
entries that have been unhashed from main table but are still waiting
for the last reference count to be dropped (e.g. because one skb waiting
on nfqueue verdict still holds a reference).

We do this by adding a tristate.
If we fail to deliver the destroy event, make a note of this in the
eache extension.  The worker can then skip all entries that are in
a different state.  Either they never delivered a destroy event,
e.g. because the netlink backend was not loaded, or redelivery took
place already.

Once the conntrack timer is removed we will now be able to replace
del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid
racing with other cpu that tries to evict the same conntrack.

Because DYING will then be set right before we report the destroy event
we can no longer skip event reporting when dying bit is set.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack_ecache.h | 17 ++++++++++++-----
 net/netfilter/nf_conntrack_ecache.c         | 22 ++++++++++++++--------
 2 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index fa36447..12d967b 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -12,12 +12,19 @@
 #include <linux/netfilter/nf_conntrack_tuple_common.h>
 #include <net/netfilter/nf_conntrack_extend.h>
 
+enum nf_ct_ecache_state {
+	NFCT_ECACHE_UNKNOWN,		/* destroy event not sent */
+	NFCT_ECACHE_DESTROY_FAIL,	/* tried but failed to send destroy event */
+	NFCT_ECACHE_DESTROY_SENT,	/* sent destroy event after failure */
+};
+
 struct nf_conntrack_ecache {
-	unsigned long cache;	/* bitops want long */
-	unsigned long missed;	/* missed events */
-	u16 ctmask;		/* bitmask of ct events to be delivered */
-	u16 expmask;		/* bitmask of expect events to be delivered */
-	u32 portid;		/* netlink portid of destroyer */
+	unsigned long cache;		/* bitops want long */
+	unsigned long missed;		/* missed events */
+	u16 ctmask;			/* bitmask of ct events to be delivered */
+	u16 expmask;			/* bitmask of expect events to be delivered */
+	u32 portid;			/* netlink portid of destroyer */
+	enum nf_ct_ecache_state state;	/* ecache state */
 };
 
 static inline struct nf_conntrack_ecache *
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index d28011b..da9df2d 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -49,8 +49,13 @@ static enum retry_state ecache_work_evict_list(struct ct_pcpu *pcpu)
 
 	hlist_nulls_for_each_entry(h, n, &pcpu->dying, hnnode) {
 		struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+		struct nf_conntrack_ecache *e;
 
-		if (nf_ct_is_dying(ct))
+		if (!nf_ct_is_confirmed(ct))
+			continue;
+
+		e = nf_ct_ecache_find(ct);
+		if (!e || e->state != NFCT_ECACHE_DESTROY_FAIL)
 			continue;
 
 		if (nf_conntrack_event(IPCT_DESTROY, ct)) {
@@ -58,8 +63,7 @@ static enum retry_state ecache_work_evict_list(struct ct_pcpu *pcpu)
 			break;
 		}
 
-		/* we've got the event delivered, now it's dying */
-		set_bit(IPS_DYING_BIT, &ct->status);
+		e->state = NFCT_ECACHE_DESTROY_SENT;
 		refs[evicted] = ct;
 
 		if (++evicted >= ARRAY_SIZE(refs)) {
@@ -130,7 +134,7 @@ int nf_conntrack_eventmask_report(unsigned int eventmask, struct nf_conn *ct,
 	if (!e)
 		goto out_unlock;
 
-	if (nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct)) {
+	if (nf_ct_is_confirmed(ct)) {
 		struct nf_ct_event item = {
 			.ct	= ct,
 			.portid	= e->portid ? e->portid : portid,
@@ -150,11 +154,13 @@ int nf_conntrack_eventmask_report(unsigned int eventmask, struct nf_conn *ct,
 				 * triggered by a process, we store the PORTID
 				 * to include it in the retransmission.
 				 */
-				if (eventmask & (1 << IPCT_DESTROY) &&
-				    e->portid == 0 && portid != 0)
-					e->portid = portid;
-				else
+				if (eventmask & (1 << IPCT_DESTROY)) {
+					if (e->portid == 0 && portid != 0)
+						e->portid = portid;
+					e->state = NFCT_ECACHE_DESTROY_FAIL;
+				} else {
 					e->missed |= eventmask;
+				}
 			} else {
 				e->missed &= ~missed;
 			}
-- 
2.7.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer
  2016-08-19 11:36 [PATCH nf-next 0/6] conntrack: get rid of per-object timer Florian Westphal
  2016-08-19 11:36 ` [PATCH nf-next 1/6] netfilter: don't rely on DYING bit to detect when destroy event was sent Florian Westphal
@ 2016-08-19 11:36 ` Florian Westphal
  2016-08-19 14:37   ` Eric Dumazet
  2016-08-19 11:36 ` [PATCH nf-next 3/6] netfilter: evict stale entries on netlink dumps Florian Westphal
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 15+ messages in thread
From: Florian Westphal @ 2016-08-19 11:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

With stats enabled this eats 80 bytes on x86_64 per nf_conn entry.

Remove it and use a 32bit jiffies value containing timestamp until
entry is valid.

During conntrack lookup, even before doing tuple comparision, check
the timeout value and evict the entry in case it is too old.

The dying bit is used as a synchronization point to avoid races where
multiple cpus try to evict the same entry.

Because lookup is always lockless, we need to bump the refcnt once
when we evict, else we could try to evict already-dead entry that
is being recycled.

This is the standard/expected way when conntrack entries are destroyed.

Followup patches will introduce garbage colliction via work queue
and further places where we can reap obsoleted entries (e.g. during
netlink dumps), this is needed to avoid expired conntracks from hanging
around for too long when lookup rate is low after a busy period.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack.h | 23 +++++++--
 net/netfilter/nf_conntrack_core.c    | 91 ++++++++++++++++++++----------------
 net/netfilter/nf_conntrack_netlink.c | 14 ++----
 net/netfilter/nf_conntrack_pptp.c    |  3 +-
 net/netfilter/nf_nat_core.c          |  6 ---
 5 files changed, 74 insertions(+), 63 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 2a12748..6d8cf06 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -42,7 +42,6 @@ union nf_conntrack_expect_proto {
 
 #include <linux/types.h>
 #include <linux/skbuff.h>
-#include <linux/timer.h>
 
 #ifdef CONFIG_NETFILTER_DEBUG
 #define NF_CT_ASSERT(x)		WARN_ON(!(x))
@@ -73,7 +72,7 @@ struct nf_conn_help {
 #include <net/netfilter/ipv6/nf_conntrack_ipv6.h>
 
 struct nf_conn {
-	/* Usage count in here is 1 for hash table/destruct timer, 1 per skb,
+	/* Usage count in here is 1 for hash table, 1 per skb,
 	 * plus 1 for any connection(s) we are `master' for
 	 *
 	 * Hint, SKB address this struct and refcnt via skb->nfct and
@@ -96,8 +95,8 @@ struct nf_conn {
 	/* Have we seen traffic both ways yet? (bitset) */
 	unsigned long status;
 
-	/* Timer function; drops refcnt when it goes off. */
-	struct timer_list timeout;
+	/* jiffies32 when this ct is considered dead */
+	u32 timeout;
 
 	possible_net_t ct_net;
 
@@ -291,14 +290,28 @@ static inline bool nf_is_loopback_packet(const struct sk_buff *skb)
 	return skb->dev && skb->skb_iif && skb->dev->flags & IFF_LOOPBACK;
 }
 
+#define nfct_time_stamp ((u32)(jiffies))
+
 /* jiffies until ct expires, 0 if already expired */
 static inline unsigned long nf_ct_expires(const struct nf_conn *ct)
 {
-	long timeout = (long)ct->timeout.expires - (long)jiffies;
+	s32 timeout = ct->timeout - nfct_time_stamp;
 
 	return timeout > 0 ? timeout : 0;
 }
 
+static inline bool nf_ct_is_expired(const struct nf_conn *ct)
+{
+        return (__s32)(ct->timeout - nfct_time_stamp) <= 0;
+}
+
+/* use after obtaining a reference count */
+static inline bool nf_ct_should_gc(const struct nf_conn *ct)
+{
+	return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
+	       !nf_ct_is_dying(ct);
+}
+
 struct kernel_param;
 
 int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 7d90a5d..38c50d1 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -371,7 +371,6 @@ destroy_conntrack(struct nf_conntrack *nfct)
 
 	pr_debug("destroy_conntrack(%p)\n", ct);
 	NF_CT_ASSERT(atomic_read(&nfct->use) == 0);
-	NF_CT_ASSERT(!timer_pending(&ct->timeout));
 
 	if (unlikely(nf_ct_is_template(ct))) {
 		nf_ct_tmpl_free(ct);
@@ -434,35 +433,30 @@ bool nf_ct_delete(struct nf_conn *ct, u32 portid, int report)
 {
 	struct nf_conn_tstamp *tstamp;
 
+	if (test_and_set_bit(IPS_DYING_BIT, &ct->status))
+		return false;
+
 	tstamp = nf_conn_tstamp_find(ct);
 	if (tstamp && tstamp->stop == 0)
 		tstamp->stop = ktime_get_real_ns();
 
-	if (nf_ct_is_dying(ct))
-		goto delete;
-
 	if (nf_conntrack_event_report(IPCT_DESTROY, ct,
 				    portid, report) < 0) {
-		/* destroy event was not delivered */
+		/* destroy event was not delivered. nf_ct_put will
+		 * be done by event cache worker on redelivery.
+		 */
 		nf_ct_delete_from_lists(ct);
 		nf_conntrack_ecache_delayed_work(nf_ct_net(ct));
 		return false;
 	}
 
 	nf_conntrack_ecache_work(nf_ct_net(ct));
-	set_bit(IPS_DYING_BIT, &ct->status);
- delete:
 	nf_ct_delete_from_lists(ct);
 	nf_ct_put(ct);
 	return true;
 }
 EXPORT_SYMBOL_GPL(nf_ct_delete);
 
-static void death_by_timeout(unsigned long ul_conntrack)
-{
-	nf_ct_delete((struct nf_conn *)ul_conntrack, 0, 0);
-}
-
 static inline bool
 nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
 		const struct nf_conntrack_tuple *tuple,
@@ -480,6 +474,18 @@ nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
 	       net_eq(net, nf_ct_net(ct));
 }
 
+/* caller must hold rcu readlock and none of the nf_conntrack_locks */
+static void nf_ct_gc_expired(struct nf_conn *ct)
+{
+	if (!atomic_inc_not_zero(&ct->ct_general.use))
+		return;
+
+	if (nf_ct_should_gc(ct))
+		nf_ct_kill(ct);
+
+	nf_ct_put(ct);
+}
+
 /*
  * Warning :
  * - Caller must take a reference on returned object
@@ -499,6 +505,17 @@ begin:
 	bucket = reciprocal_scale(hash, hsize);
 
 	hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[bucket], hnnode) {
+		struct nf_conn *ct;
+
+		ct = nf_ct_tuplehash_to_ctrack(h);
+		if (nf_ct_is_expired(ct)) {
+			nf_ct_gc_expired(ct);
+			continue;
+		}
+
+		if (nf_ct_is_dying(ct))
+			continue;
+
 		if (nf_ct_key_equal(h, tuple, zone, net)) {
 			NF_CT_STAT_INC_ATOMIC(net, found);
 			return h;
@@ -597,7 +614,6 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
 				    zone, net))
 			goto out;
 
-	add_timer(&ct->timeout);
 	smp_wmb();
 	/* The caller holds a reference to this object */
 	atomic_set(&ct->ct_general.use, 2);
@@ -750,8 +766,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	/* Timer relative to confirmation time, not original
 	   setting time, otherwise we'd get timer wrap in
 	   weird delay cases. */
-	ct->timeout.expires += jiffies;
-	add_timer(&ct->timeout);
+	ct->timeout += nfct_time_stamp;
 	atomic_inc(&ct->ct_general.use);
 	ct->status |= IPS_CONFIRMED;
 
@@ -814,8 +829,16 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 
 	hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[hash], hnnode) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
-		if (ct != ignored_conntrack &&
-		    nf_ct_key_equal(h, tuple, zone, net)) {
+
+		if (ct == ignored_conntrack)
+			continue;
+
+		if (nf_ct_is_expired(ct)) {
+			nf_ct_gc_expired(ct);
+			continue;
+		}
+
+		if (nf_ct_key_equal(h, tuple, zone, net)) {
 			NF_CT_STAT_INC_ATOMIC(net, found);
 			rcu_read_unlock();
 			return 1;
@@ -843,6 +866,11 @@ static unsigned int early_drop_list(struct net *net,
 	hlist_nulls_for_each_entry_rcu(h, n, head, hnnode) {
 		tmp = nf_ct_tuplehash_to_ctrack(h);
 
+		if (nf_ct_is_expired(tmp)) {
+			nf_ct_gc_expired(tmp);
+			continue;
+		}
+
 		if (test_bit(IPS_ASSURED_BIT, &tmp->status) ||
 		    !net_eq(nf_ct_net(tmp), net) ||
 		    nf_ct_is_dying(tmp))
@@ -860,7 +888,6 @@ static unsigned int early_drop_list(struct net *net,
 		 */
 		if (net_eq(nf_ct_net(tmp), net) &&
 		    nf_ct_is_confirmed(tmp) &&
-		    del_timer(&tmp->timeout) &&
 		    nf_ct_delete(tmp, 0, 0))
 			drops++;
 
@@ -930,8 +957,6 @@ __nf_conntrack_alloc(struct net *net,
 	/* save hash for reusing when confirming */
 	*(unsigned long *)(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev) = hash;
 	ct->status = 0;
-	/* Don't set timer yet: wait for confirmation */
-	setup_timer(&ct->timeout, death_by_timeout, (unsigned long)ct);
 	write_pnet(&ct->ct_net, net);
 	memset(&ct->__nfct_init_offset[0], 0,
 	       offsetof(struct nf_conn, proto) -
@@ -1305,7 +1330,6 @@ void __nf_ct_refresh_acct(struct nf_conn *ct,
 			  unsigned long extra_jiffies,
 			  int do_acct)
 {
-	NF_CT_ASSERT(ct->timeout.data == (unsigned long)ct);
 	NF_CT_ASSERT(skb);
 
 	/* Only update if this is not a fixed timeout */
@@ -1313,18 +1337,10 @@ void __nf_ct_refresh_acct(struct nf_conn *ct,
 		goto acct;
 
 	/* If not in hash table, timer will not be active yet */
-	if (!nf_ct_is_confirmed(ct)) {
-		ct->timeout.expires = extra_jiffies;
-	} else {
-		unsigned long newtime = jiffies + extra_jiffies;
-
-		/* Only update the timeout if the new timeout is at least
-		   HZ jiffies from the old timeout. Need del_timer for race
-		   avoidance (may already be dying). */
-		if (newtime - ct->timeout.expires >= HZ)
-			mod_timer_pending(&ct->timeout, newtime);
-	}
+	if (nf_ct_is_confirmed(ct))
+		extra_jiffies += nfct_time_stamp;
 
+	ct->timeout = extra_jiffies;
 acct:
 	if (do_acct)
 		nf_ct_acct_update(ct, ctinfo, skb->len);
@@ -1339,11 +1355,7 @@ bool __nf_ct_kill_acct(struct nf_conn *ct,
 	if (do_acct)
 		nf_ct_acct_update(ct, ctinfo, skb->len);
 
-	if (del_timer(&ct->timeout)) {
-		ct->timeout.function((unsigned long)ct);
-		return true;
-	}
-	return false;
+	return nf_ct_delete(ct, 0, 0);
 }
 EXPORT_SYMBOL_GPL(__nf_ct_kill_acct);
 
@@ -1478,11 +1490,8 @@ void nf_ct_iterate_cleanup(struct net *net,
 
 	while ((ct = get_next_corpse(net, iter, data, &bucket)) != NULL) {
 		/* Time to push up daises... */
-		if (del_timer(&ct->timeout))
-			nf_ct_delete(ct, portid, report);
-
-		/* ... else the timer will get him soon. */
 
+		nf_ct_delete(ct, portid, report);
 		nf_ct_put(ct);
 		cond_resched();
 	}
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 68800c1..81fd34c 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1144,9 +1144,7 @@ static int ctnetlink_del_conntrack(struct net *net, struct sock *ctnl,
 		}
 	}
 
-	if (del_timer(&ct->timeout))
-		nf_ct_delete(ct, NETLINK_CB(skb).portid, nlmsg_report(nlh));
-
+	nf_ct_delete(ct, NETLINK_CB(skb).portid, nlmsg_report(nlh));
 	nf_ct_put(ct);
 
 	return 0;
@@ -1514,11 +1512,10 @@ static int ctnetlink_change_timeout(struct nf_conn *ct,
 {
 	u_int32_t timeout = ntohl(nla_get_be32(cda[CTA_TIMEOUT]));
 
-	if (!del_timer(&ct->timeout))
-		return -ETIME;
+	ct->timeout = nfct_time_stamp + timeout * HZ;
 
-	ct->timeout.expires = jiffies + timeout * HZ;
-	add_timer(&ct->timeout);
+	if (test_bit(IPS_DYING_BIT, &ct->status))
+		return -ETIME;
 
 	return 0;
 }
@@ -1716,9 +1713,8 @@ ctnetlink_create_conntrack(struct net *net,
 
 	if (!cda[CTA_TIMEOUT])
 		goto err1;
-	ct->timeout.expires = ntohl(nla_get_be32(cda[CTA_TIMEOUT]));
 
-	ct->timeout.expires = jiffies + ct->timeout.expires * HZ;
+	ct->timeout = nfct_time_stamp + ntohl(nla_get_be32(cda[CTA_TIMEOUT])) * HZ;
 
 	rcu_read_lock();
  	if (cda[CTA_HELP]) {
diff --git a/net/netfilter/nf_conntrack_pptp.c b/net/netfilter/nf_conntrack_pptp.c
index 5588c7a..f60a475 100644
--- a/net/netfilter/nf_conntrack_pptp.c
+++ b/net/netfilter/nf_conntrack_pptp.c
@@ -157,8 +157,7 @@ static int destroy_sibling_or_exp(struct net *net, struct nf_conn *ct,
 		pr_debug("setting timeout of conntrack %p to 0\n", sibling);
 		sibling->proto.gre.timeout	  = 0;
 		sibling->proto.gre.stream_timeout = 0;
-		if (del_timer(&sibling->timeout))
-			sibling->timeout.function((unsigned long)sibling);
+		nf_ct_kill(sibling);
 		nf_ct_put(sibling);
 		return 1;
 	} else {
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index de31818..81ae41f 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -565,16 +565,10 @@ static int nf_nat_proto_clean(struct nf_conn *ct, void *data)
 	 * Else, when the conntrack is destoyed, nf_nat_cleanup_conntrack()
 	 * will delete entry from already-freed table.
 	 */
-	if (!del_timer(&ct->timeout))
-		return 1;
-
 	ct->status &= ~IPS_NAT_DONE_MASK;
-
 	rhashtable_remove_fast(&nf_nat_bysource_table, &ct->nat_bysource,
 			       nf_nat_bysource_params);
 
-	add_timer(&ct->timeout);
-
 	/* don't delete conntrack.  Although that would make things a lot
 	 * simpler, we'd end up flushing all conntracks on nat rmmod.
 	 */
-- 
2.7.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH nf-next 3/6] netfilter: evict stale entries on netlink dumps
  2016-08-19 11:36 [PATCH nf-next 0/6] conntrack: get rid of per-object timer Florian Westphal
  2016-08-19 11:36 ` [PATCH nf-next 1/6] netfilter: don't rely on DYING bit to detect when destroy event was sent Florian Westphal
  2016-08-19 11:36 ` [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer Florian Westphal
@ 2016-08-19 11:36 ` Florian Westphal
  2016-08-19 11:36 ` [PATCH nf-next 4/6] netfilter: conntrack: add gc worker to remove timed-out entries Florian Westphal
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2016-08-19 11:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

When dumping we already have to look at the entire table, so we might
as well toss those entries whose timeout value is in the past.

We also look at every entry during resize operations.
However, eviction there is not as simple because we hold the
global resize lock so we can't evict without adding a 'expired' list
to drop from later.  Considering that resizes are very rare it doesn't
seem worth doing it.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_netlink.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 81fd34c..dedbe4b 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -815,14 +815,23 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 	struct hlist_nulls_node *n;
 	struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
 	u_int8_t l3proto = nfmsg->nfgen_family;
-	int res;
+	struct nf_conn *nf_ct_evict[8];
+	int res, i;
 	spinlock_t *lockp;
 
 	last = (struct nf_conn *)cb->args[1];
+	i = 0;
 
 	local_bh_disable();
 	for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) {
 restart:
+		while (i) {
+			i--;
+			if (nf_ct_should_gc(nf_ct_evict[i]))
+				nf_ct_kill(nf_ct_evict[i]);
+			nf_ct_put(nf_ct_evict[i]);
+		}
+
 		lockp = &nf_conntrack_locks[cb->args[0] % CONNTRACK_LOCKS];
 		nf_conntrack_lock(lockp);
 		if (cb->args[0] >= nf_conntrack_htable_size) {
@@ -834,6 +843,13 @@ restart:
 			if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
 				continue;
 			ct = nf_ct_tuplehash_to_ctrack(h);
+			if (nf_ct_is_expired(ct)) {
+				if (i < ARRAY_SIZE(nf_ct_evict) &&
+				    atomic_inc_not_zero(&ct->ct_general.use))
+					nf_ct_evict[i++] = ct;
+				continue;
+			}
+
 			if (!net_eq(net, nf_ct_net(ct)))
 				continue;
 
@@ -875,6 +891,13 @@ out:
 	if (last)
 		nf_ct_put(last);
 
+	while (i) {
+		i--;
+		if (nf_ct_should_gc(nf_ct_evict[i]))
+			nf_ct_kill(nf_ct_evict[i]);
+		nf_ct_put(nf_ct_evict[i]);
+	}
+
 	return skb->len;
 }
 
-- 
2.7.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH nf-next 4/6] netfilter: conntrack: add gc worker to remove timed-out entries
  2016-08-19 11:36 [PATCH nf-next 0/6] conntrack: get rid of per-object timer Florian Westphal
                   ` (2 preceding siblings ...)
  2016-08-19 11:36 ` [PATCH nf-next 3/6] netfilter: evict stale entries on netlink dumps Florian Westphal
@ 2016-08-19 11:36 ` Florian Westphal
  2016-08-19 14:41   ` Eric Dumazet
  2016-08-19 11:36 ` [PATCH nf-next 5/6] netfilter: conntrack: resched gc again if eviction rate is high Florian Westphal
  2016-08-19 11:36 ` [PATCH nf-next 6/6] netfilter: remove __nf_ct_kill_acct helper Florian Westphal
  5 siblings, 1 reply; 15+ messages in thread
From: Florian Westphal @ 2016-08-19 11:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

Conntrack gc worker to evict stale entries.

GC happens once every 5 seconds, but we only scan at most 1/64th of the
table (and not more than 8k) buckets to avoid hogging cpu.

This means that a complete scan of the table will take several minutes
of wall-clock time.

Considering that the gc run will never have to evict any entries
during normal operation because those will happen from packet path
this should be fine.

We only need gc to make sure userspace (conntrack event listeners)
eventually learn of the timeout, and for resource reclaim in case the
system becomes idle.

We do not disable BH and cond_resched for every bucket so this should
not introduce noticeable latencies either.

A followup patch will add a small change to speed up GC for the extreme
case where most entries are timed out on an otherwise idle system.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 77 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 38c50d1..3b778d2 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -72,11 +72,24 @@ EXPORT_SYMBOL_GPL(nf_conntrack_expect_lock);
 struct hlist_nulls_head *nf_conntrack_hash __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_hash);
 
+struct conntrack_gc_work {
+	struct delayed_work	dwork;
+	u32			last_bucket;
+	bool			exiting;
+};
+
 static __read_mostly struct kmem_cache *nf_conntrack_cachep;
 static __read_mostly spinlock_t nf_conntrack_locks_all_lock;
 static __read_mostly DEFINE_SPINLOCK(nf_conntrack_locks_all_lock);
 static __read_mostly bool nf_conntrack_locks_all;
 
+#define GC_MAX_BUCKETS_DIV	64u
+#define GC_MAX_BUCKETS		8192u
+#define GC_INTERVAL		(5 * HZ)
+#define GC_MAX_EVICTS		256u
+
+static struct conntrack_gc_work conntrack_gc_work;
+
 void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
 {
 	spin_lock(lock);
@@ -921,6 +934,62 @@ static noinline int early_drop(struct net *net, unsigned int _hash)
 	return false;
 }
 
+static void gc_worker(struct work_struct *work)
+{
+	unsigned int i, goal, buckets = 0, expired_count = 0;
+	unsigned long next_run = GC_INTERVAL;
+	struct conntrack_gc_work *gc_work;
+
+	gc_work = container_of(work, struct conntrack_gc_work, dwork.work);
+
+	goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);
+	i = gc_work->last_bucket;
+
+	do {
+		struct nf_conntrack_tuple_hash *h;
+		struct hlist_nulls_head *ct_hash;
+		struct hlist_nulls_node *n;
+		unsigned int sequence;
+		struct nf_conn *tmp;
+
+		i++;
+		rcu_read_lock();
+
+		do {
+			sequence = read_seqcount_begin(&nf_conntrack_generation);
+			if (i >= nf_conntrack_htable_size)
+				i = 0;
+			ct_hash = nf_conntrack_hash;
+		} while (read_seqcount_retry(&nf_conntrack_generation, sequence));
+
+		hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[i], hnnode) {
+			tmp = nf_ct_tuplehash_to_ctrack(h);
+
+			if (nf_ct_is_expired(tmp)) {
+				nf_ct_gc_expired(tmp);
+				expired_count++;
+				continue;
+			}
+		}
+
+		rcu_read_unlock();
+		cond_resched();
+	} while (++buckets < goal &&
+		 expired_count < GC_MAX_EVICTS);
+
+	if (gc_work->exiting)
+		return;
+
+	gc_work->last_bucket = i;
+	schedule_delayed_work(&gc_work->dwork, next_run);
+}
+
+static void conntrack_gc_work_init(struct conntrack_gc_work *gc_work)
+{
+	INIT_DELAYED_WORK(&gc_work->dwork, gc_worker);
+	gc_work->exiting = false;
+}
+
 static struct nf_conn *
 __nf_conntrack_alloc(struct net *net,
 		     const struct nf_conntrack_zone *zone,
@@ -1527,6 +1596,7 @@ static int untrack_refs(void)
 
 void nf_conntrack_cleanup_start(void)
 {
+	conntrack_gc_work.exiting = true;
 	RCU_INIT_POINTER(ip_ct_attach, NULL);
 }
 
@@ -1536,6 +1606,9 @@ void nf_conntrack_cleanup_end(void)
 	while (untrack_refs() > 0)
 		schedule();
 
+	cancel_delayed_work_sync(&conntrack_gc_work.dwork);
+	/* can be re-scheduled once */
+	cancel_delayed_work_sync(&conntrack_gc_work.dwork);
 	nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size);
 
 	nf_conntrack_proto_fini();
@@ -1810,6 +1883,10 @@ int nf_conntrack_init_start(void)
 	}
 	/*  - and look it like as a confirmed connection */
 	nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED);
+
+	conntrack_gc_work_init(&conntrack_gc_work);
+	schedule_delayed_work(&conntrack_gc_work.dwork, GC_INTERVAL);
+
 	return 0;
 
 err_proto:
-- 
2.7.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH nf-next 5/6] netfilter: conntrack: resched gc again if eviction rate is high
  2016-08-19 11:36 [PATCH nf-next 0/6] conntrack: get rid of per-object timer Florian Westphal
                   ` (3 preceding siblings ...)
  2016-08-19 11:36 ` [PATCH nf-next 4/6] netfilter: conntrack: add gc worker to remove timed-out entries Florian Westphal
@ 2016-08-19 11:36 ` Florian Westphal
  2016-08-19 11:36 ` [PATCH nf-next 6/6] netfilter: remove __nf_ct_kill_acct helper Florian Westphal
  5 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2016-08-19 11:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

If we evicted a large fraction of the scanned conntrack entries re-schedule
the next gc cycle for immediate execution.

This triggers during tests where load is high, then drops to zero and
many connections will be in TW/CLOSE state with < 30 second timeouts.

Without this change it will take several minutes until conntrack count
comes back to normal.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 3b778d2..d5f8f34 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -938,6 +938,7 @@ static void gc_worker(struct work_struct *work)
 {
 	unsigned int i, goal, buckets = 0, expired_count = 0;
 	unsigned long next_run = GC_INTERVAL;
+	unsigned int ratio, scanned = 0;
 	struct conntrack_gc_work *gc_work;
 
 	gc_work = container_of(work, struct conntrack_gc_work, dwork.work);
@@ -965,6 +966,7 @@ static void gc_worker(struct work_struct *work)
 		hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[i], hnnode) {
 			tmp = nf_ct_tuplehash_to_ctrack(h);
 
+			scanned++;
 			if (nf_ct_is_expired(tmp)) {
 				nf_ct_gc_expired(tmp);
 				expired_count++;
@@ -980,6 +982,10 @@ static void gc_worker(struct work_struct *work)
 	if (gc_work->exiting)
 		return;
 
+	ratio = scanned ? expired_count * 100 / scanned : 0;
+	if (ratio >= 90)
+		next_run = 0;
+
 	gc_work->last_bucket = i;
 	schedule_delayed_work(&gc_work->dwork, next_run);
 }
-- 
2.7.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH nf-next 6/6] netfilter: remove __nf_ct_kill_acct helper
  2016-08-19 11:36 [PATCH nf-next 0/6] conntrack: get rid of per-object timer Florian Westphal
                   ` (4 preceding siblings ...)
  2016-08-19 11:36 ` [PATCH nf-next 5/6] netfilter: conntrack: resched gc again if eviction rate is high Florian Westphal
@ 2016-08-19 11:36 ` Florian Westphal
  5 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2016-08-19 11:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

After timer removal this just calls nf_ct_delete so remove the __ prefix
version and make nf_ct_kill a shorthand for nf_ct_delete.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack.h | 13 +++----------
 net/netfilter/nf_conntrack_core.c    | 12 +++++-------
 2 files changed, 8 insertions(+), 17 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 6d8cf06..a3595c2 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -219,21 +219,14 @@ static inline void nf_ct_refresh(struct nf_conn *ct,
 	__nf_ct_refresh_acct(ct, 0, skb, extra_jiffies, 0);
 }
 
-bool __nf_ct_kill_acct(struct nf_conn *ct, enum ip_conntrack_info ctinfo,
-		       const struct sk_buff *skb, int do_acct);
-
 /* kill conntrack and do accounting */
-static inline bool nf_ct_kill_acct(struct nf_conn *ct,
-				   enum ip_conntrack_info ctinfo,
-				   const struct sk_buff *skb)
-{
-	return __nf_ct_kill_acct(ct, ctinfo, skb, 1);
-}
+bool nf_ct_kill_acct(struct nf_conn *ct, enum ip_conntrack_info ctinfo,
+		     const struct sk_buff *skb);
 
 /* kill conntrack without accounting */
 static inline bool nf_ct_kill(struct nf_conn *ct)
 {
-	return __nf_ct_kill_acct(ct, 0, NULL, 0);
+	return nf_ct_delete(ct, 0, 0);
 }
 
 /* These are for NAT.  Icky. */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index d5f8f34..282085c 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1422,17 +1422,15 @@ acct:
 }
 EXPORT_SYMBOL_GPL(__nf_ct_refresh_acct);
 
-bool __nf_ct_kill_acct(struct nf_conn *ct,
-		       enum ip_conntrack_info ctinfo,
-		       const struct sk_buff *skb,
-		       int do_acct)
+bool nf_ct_kill_acct(struct nf_conn *ct,
+		     enum ip_conntrack_info ctinfo,
+		     const struct sk_buff *skb)
 {
-	if (do_acct)
-		nf_ct_acct_update(ct, ctinfo, skb->len);
+	nf_ct_acct_update(ct, ctinfo, skb->len);
 
 	return nf_ct_delete(ct, 0, 0);
 }
-EXPORT_SYMBOL_GPL(__nf_ct_kill_acct);
+EXPORT_SYMBOL_GPL(nf_ct_kill_acct);
 
 #if IS_ENABLED(CONFIG_NF_CT_NETLINK)
 
-- 
2.7.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer
  2016-08-19 11:36 ` [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer Florian Westphal
@ 2016-08-19 14:37   ` Eric Dumazet
  2016-08-19 15:16     ` Florian Westphal
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2016-08-19 14:37 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Fri, 2016-08-19 at 13:36 +0200, Florian Westphal wrote:
> With stats enabled this eats 80 bytes on x86_64 per nf_conn entry.
> 
> Remove it and use a 32bit jiffies value containing timestamp until
> entry is valid.

Great work !

...

> +/* caller must hold rcu readlock and none of the nf_conntrack_locks */
> +static void nf_ct_gc_expired(struct nf_conn *ct)
> +{
> +	if (!atomic_inc_not_zero(&ct->ct_general.use))
> +		return;
> +
> +	if (nf_ct_should_gc(ct))
> +		nf_ct_kill(ct);
> +
> +	nf_ct_put(ct);
> +}
> +
>  /*
>   * Warning :
>   * - Caller must take a reference on returned object
> @@ -499,6 +505,17 @@ begin:
>  	bucket = reciprocal_scale(hash, hsize);
>  
>  	hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[bucket], hnnode) {
> +		struct nf_conn *ct;
> +
> +		ct = nf_ct_tuplehash_to_ctrack(h);
> +		if (nf_ct_is_expired(ct)) {
> +			nf_ct_gc_expired(ct);
> +			continue;
> +		}
> +
> +		if (nf_ct_is_dying(ct))
> +			continue;
> +
>  		if (nf_ct_key_equal(h, tuple, zone, net)) {
>  			NF_CT_STAT_INC_ATOMIC(net, found);
>  			return h;

Florian, I do not see how this part is safe against concurrent lookups
and deletes ?

At least the hlist_nulls_for_each_entry_rcu() looks buggy, since
fetching the next pointer would trigger a use after free ?

Thanks !




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH nf-next 4/6] netfilter: conntrack: add gc worker to remove timed-out entries
  2016-08-19 11:36 ` [PATCH nf-next 4/6] netfilter: conntrack: add gc worker to remove timed-out entries Florian Westphal
@ 2016-08-19 14:41   ` Eric Dumazet
  2016-08-19 15:22     ` Florian Westphal
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2016-08-19 14:41 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Fri, 2016-08-19 at 13:36 +0200, Florian Westphal wrote:
> Conntrack gc worker to evict stale entries.

...

> +
> +		hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[i], hnnode) {
> +			tmp = nf_ct_tuplehash_to_ctrack(h);
> +
> +			if (nf_ct_is_expired(tmp)) {
> +				nf_ct_gc_expired(tmp);
> +				expired_count++;
> +				continue;

Same remark about hlist_nulls_for_each_entry_rcu() not 'safe' here

> +			}
> +		}
> +
> +		rcu_read_unlock();
> +		cond_resched();

This could use cond_resched_rcu_qs() 

> +	} while (++buckets < goal &&
> +		 expired_count < GC_MAX_EVICTS);
> +
> +	if (gc_work->exiting)
> +		return;
> +


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer
  2016-08-19 14:37   ` Eric Dumazet
@ 2016-08-19 15:16     ` Florian Westphal
  2016-08-19 15:24       ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Florian Westphal @ 2016-08-19 15:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Florian Westphal, netfilter-devel

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > +/* caller must hold rcu readlock and none of the nf_conntrack_locks */
> > +static void nf_ct_gc_expired(struct nf_conn *ct)
> > +{
> > +	if (!atomic_inc_not_zero(&ct->ct_general.use))
> > +		return;
> > +
> > +	if (nf_ct_should_gc(ct))
> > +		nf_ct_kill(ct);
> > +
> > +	nf_ct_put(ct);
> > +}
> > +
> >  /*
> >   * Warning :
> >   * - Caller must take a reference on returned object
> > @@ -499,6 +505,17 @@ begin:
> >  	bucket = reciprocal_scale(hash, hsize);
> >  
> >  	hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[bucket], hnnode) {
> > +		struct nf_conn *ct;
> > +
> > +		ct = nf_ct_tuplehash_to_ctrack(h);
> > +		if (nf_ct_is_expired(ct)) {
> > +			nf_ct_gc_expired(ct);
> > +			continue;
> > +		}
> > +
> > +		if (nf_ct_is_dying(ct))
> > +			continue;
> > +
> >  		if (nf_ct_key_equal(h, tuple, zone, net)) {
> >  			NF_CT_STAT_INC_ATOMIC(net, found);
> >  			return h;
> 
> Florian, I do not see how this part is safe against concurrent lookups
> and deletes ?
> 
> At least the hlist_nulls_for_each_entry_rcu() looks buggy, since
> fetching the next pointer would trigger a use after free ?

Hmm, ____nf_conntrack_find caller needs to hold rcu_read_lock,
in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release
of the page.

Should be same as (old) 'death_by_timeout' timer firing during
hlist_nulls_for_each_entry_rcu walk.

Thanks for looking at this Eric!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH nf-next 4/6] netfilter: conntrack: add gc worker to remove timed-out entries
  2016-08-19 14:41   ` Eric Dumazet
@ 2016-08-19 15:22     ` Florian Westphal
  0 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2016-08-19 15:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Florian Westphal, netfilter-devel

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-08-19 at 13:36 +0200, Florian Westphal wrote:
> > Conntrack gc worker to evict stale entries.
> 
> ...
> 
> > +
> > +		hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[i], hnnode) {
> > +			tmp = nf_ct_tuplehash_to_ctrack(h);
> > +
> > +			if (nf_ct_is_expired(tmp)) {
> > +				nf_ct_gc_expired(tmp);
> > +				expired_count++;
> > +				continue;
> 
> Same remark about hlist_nulls_for_each_entry_rcu() not 'safe' here

Hmm, I am missing something, I will wait for your answer on my previous
reply ...

I though this was safe due to rcu -- another cpu could evict entry as
well so we can always see 'free'd objects with refcnt 0.

(nf_ct_gc_expired() skips those to avoid double-free).

> > +		rcu_read_unlock();
> > +		cond_resched();
> 
> This could use cond_resched_rcu_qs() 

Indeed, I did not kown about this, nice!

Will add it in V2, thanks!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer
  2016-08-19 15:16     ` Florian Westphal
@ 2016-08-19 15:24       ` Eric Dumazet
  2016-08-19 16:04         ` Florian Westphal
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2016-08-19 15:24 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Fri, 2016-08-19 at 17:16 +0200, Florian Westphal wrote:

> Hmm, ____nf_conntrack_find caller needs to hold rcu_read_lock,
> in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release
> of the page.

Well, point is that SLAB_DESTROY_BY_RCU means that we have no grace
period, and object can be immediately reused and recycled.

@next pointer can definitely be overwritten.

> 
> Should be same as (old) 'death_by_timeout' timer firing during
> hlist_nulls_for_each_entry_rcu walk.
> 
> Thanks for looking at this Eric!



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer
  2016-08-19 15:24       ` Eric Dumazet
@ 2016-08-19 16:04         ` Florian Westphal
  2016-08-22  4:13           ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Florian Westphal @ 2016-08-19 16:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Florian Westphal, netfilter-devel

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-08-19 at 17:16 +0200, Florian Westphal wrote:
> 
> > Hmm, ____nf_conntrack_find caller needs to hold rcu_read_lock,
> > in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release
> > of the page.
> 
> Well, point is that SLAB_DESTROY_BY_RCU means that we have no grace
> period, and object can be immediately reused and recycled.
> 
> @next pointer can definitely be overwritten.

I see.  Isn't that detected by the nulls magic (to restart
lookup if entry was moved to other chain due to overwritten next pointer)?

I don't see a hlist_nulls_for_each_entry_rcu_safe variant like we have
for normal lists (list_for_each_entry_safe), do I need to add one?
Currently we have this:

rcu_read_lock()
____nf_conntrack_find
hlist_nulls_for_each_entry_rcu ...

 if (nf_ct_is_expired(ct)) { // lazy test, ct can be reused here
	 nf_ct_gc_expired(ct);
	 continue;
 }

nf_ct_gc_expired() does this:

static void nf_ct_gc_expired(struct nf_conn *ct)
{
        if (!atomic_inc_not_zero(&ct->ct_general.use))
                return;

        if (nf_ct_should_gc(ct))
                nf_ct_kill(ct);

        nf_ct_put(ct);

So we only nf_ct_kill when we have a reference
and we know ct is in hash (check is in nf_ct_should_gc()).

So once we put, the object can be released but the
next pointer is not altered.

In case ct object is reused right after this last
nf_ct_put it could be re-inserted already, e.g. into
dying list or back into hash table -- but is this a problem?

'dead' (non-recycled) element is cought by atomic_inc_not_zero
in nf_ct_gc_expired, about-to-inserted (not in hash) is detected
by nf_ct_should_gc() (it checks confirmed bit in ct->status),
already-inserted (in hashtable) would lead us to 'magically'
move to a different hash chain in hlist_nulls_for_each_entry_rcu.

But we would then hit the wrong nulls list terminator and restart
the traversal.

Does that make sense?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer
  2016-08-19 16:04         ` Florian Westphal
@ 2016-08-22  4:13           ` Eric Dumazet
  2016-08-22  8:53             ` Florian Westphal
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2016-08-22  4:13 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Fri, 2016-08-19 at 18:04 +0200, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Fri, 2016-08-19 at 17:16 +0200, Florian Westphal wrote:
> > 
> > > Hmm, ____nf_conntrack_find caller needs to hold rcu_read_lock,
> > > in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release
> > > of the page.
> > 
> > Well, point is that SLAB_DESTROY_BY_RCU means that we have no grace
> > period, and object can be immediately reused and recycled.
> > 
> > @next pointer can definitely be overwritten.
> 
> I see.  Isn't that detected by the nulls magic (to restart
> lookup if entry was moved to other chain due to overwritten next pointer)?

Well, you did not add the nulls magic in your code ;)

It might be fine, since it should be a rare event, and garbage
collection is best effort, so you might add a comment in gc_worker() why
it is probably overkill to restart the loop in this unlikely event.

BTW, maybe nf_conntrack_tuple_taken() should get the nulls magic check,
as it is currently missing.





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer
  2016-08-22  4:13           ` Eric Dumazet
@ 2016-08-22  8:53             ` Florian Westphal
  0 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2016-08-22  8:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Florian Westphal, netfilter-devel

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2016-08-19 at 18:04 +0200, Florian Westphal wrote:
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > On Fri, 2016-08-19 at 17:16 +0200, Florian Westphal wrote:
> > > 
> > > > Hmm, ____nf_conntrack_find caller needs to hold rcu_read_lock,
> > > > in case object is free'd SLAB_DESTROY_BY_RCU should delay actual release
> > > > of the page.
> > > 
> > > Well, point is that SLAB_DESTROY_BY_RCU means that we have no grace
> > > period, and object can be immediately reused and recycled.
> > > 
> > > @next pointer can definitely be overwritten.
> > 
> > I see.  Isn't that detected by the nulls magic (to restart
> > lookup if entry was moved to other chain due to overwritten next pointer)?
> 
> Well, you did not add the nulls magic in your code ;)

Oh.  Right, its indeed mising in the gc code.

> It might be fine, since it should be a rare event, and garbage
> collection is best effort, so you might add a comment in gc_worker() why
> it is probably overkill to restart the loop in this unlikely event.

Seems like a good idea, I will add it.

> BTW, maybe nf_conntrack_tuple_taken() should get the nulls magic check,
> as it is currently missing.

Good point, I will investigate.

Thanks Eric!

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-08-22  8:53 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-19 11:36 [PATCH nf-next 0/6] conntrack: get rid of per-object timer Florian Westphal
2016-08-19 11:36 ` [PATCH nf-next 1/6] netfilter: don't rely on DYING bit to detect when destroy event was sent Florian Westphal
2016-08-19 11:36 ` [PATCH nf-next 2/6] netfilter: conntrack: get rid of conntrack timer Florian Westphal
2016-08-19 14:37   ` Eric Dumazet
2016-08-19 15:16     ` Florian Westphal
2016-08-19 15:24       ` Eric Dumazet
2016-08-19 16:04         ` Florian Westphal
2016-08-22  4:13           ` Eric Dumazet
2016-08-22  8:53             ` Florian Westphal
2016-08-19 11:36 ` [PATCH nf-next 3/6] netfilter: evict stale entries on netlink dumps Florian Westphal
2016-08-19 11:36 ` [PATCH nf-next 4/6] netfilter: conntrack: add gc worker to remove timed-out entries Florian Westphal
2016-08-19 14:41   ` Eric Dumazet
2016-08-19 15:22     ` Florian Westphal
2016-08-19 11:36 ` [PATCH nf-next 5/6] netfilter: conntrack: resched gc again if eviction rate is high Florian Westphal
2016-08-19 11:36 ` [PATCH nf-next 6/6] netfilter: remove __nf_ct_kill_acct helper Florian Westphal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.