[PATCH nf-next v2 0/7] netfilter: get rid of per-object conntrack timers

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH nf-next v2 0/7] netfilter: get rid of per-object conntrack timers
@ 2016-08-24 11:55 Florian Westphal
  2016-08-24 11:55 ` [PATCH v2 nf-next 1/7] netfilter: don't rely on DYING bit to detect when destroy event was sent Florian Westphal
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Florian Westphal @ 2016-08-24 11:55 UTC (permalink / raw)
  To: netfilter-devel

During NFWS 2016 it was mentioned that per-conntrack timers have
two drawbacks:
 - the 5-day default established timeout is very large and brings
   extra constraints for the timer subsystem.
 - most distros enable timer stats so timer struct eats 80 bytes
   in each conntrack object.

This series replaces the per-object struct timer with a u32 jiffie
stamp and one global delayed work queue for conntrack eviction.
Size of nf_conn struct is reduced to 256 bytes on x86_64.

Eviction is performed from the packet path when doing table lookup, for
cases where we have idle periods the work queue is used.

No changes since v1, except addition of a comment in #3 and
a one-line change (replace cond_reschd w. cond_resched_rcu_qs).

Patch #7 was not part of v1.

Tested with following script, conntrackd running in 'relible
event mode' and httpterm running on other host:

-----------------------------------------------------------
random_resize() {
        while true; do
                RND=$RANDOM%256000
                RND=$((RND+8192))
                sysctl net.netfilter.nf_conntrack_buckets=$RND
                sleep $((RANDOM % 120))
        done
}

random_flush() {
        while true; do
                sleep $((RANDOM % 120))
                conntrack -F
        done
}

random_startstop() {
        while true; do
                sleep $((RANDOM % 120))
                pkill -STOP conntrackd
                sleep $((RANDOM % 10))
                pkill -CONT conntrackd
        done
}

http-client -u 1000 -t 3 -F 192.168.10.50 -G 192.168.10.17:8001 &
http-client -u 1000 -t 3 -F 192.168.10.51 -G 192.168.10.17:8001 &
http-client -u 1000 -t 3 -F 192.168.10.52 -G 192.168.10.17:8001 &

random_resize &
random_flush &
random_startstop &

wait
-----------------------------------------------------------

 include/net/netfilter/nf_conntrack.h        |   36 +++--
 include/net/netfilter/nf_conntrack_ecache.h |   17 +-
 net/netfilter/nf_conntrack_core.c           |  194 +++++++++++++++++++++-------
 net/netfilter/nf_conntrack_ecache.c         |   22 ++-
 net/netfilter/nf_conntrack_netlink.c        |   39 ++++-
 net/netfilter/nf_conntrack_pptp.c           |    3 
 net/netfilter/nf_nat_core.c                 |    6 
 7 files changed, 223 insertions(+), 94 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 nf-next 1/7] netfilter: don't rely on DYING bit to detect when destroy event was sent
  2016-08-24 11:55 [PATCH nf-next v2 0/7] netfilter: get rid of per-object conntrack timers Florian Westphal
@ 2016-08-24 11:55 ` Florian Westphal
  2016-08-24 17:05   ` Eric Dumazet
  2016-08-24 11:55 ` [PATCH v2 nf-next 2/7] netfilter: conntrack: get rid of conntrack timer Florian Westphal
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2016-08-24 11:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

The reliable event delivery mode currently (ab)uses the DYING bit to
detect which entries on the dying list have to be skipped when
re-delivering events from the eache worker in reliable event mode.

Currently when we delete the conntrack from main table we only set this
bit if we could also deliver the netlink destroy event to userspace.

If we fail we move it to the dying list, the ecache worker will
reattempt event delivery for all confirmed conntracks on the dying list
that do not have the DYING bit set.

Once timer is gone, we can no longer use if (del_timer()) to detect
when we 'stole' the reference count owned by the timer/hash entry, so
we need some other way to avoid racing with other cpu.

Pablo suggested to add a marker in the ecache extension that skips
entries that have been unhashed from main table but are still waiting
for the last reference count to be dropped (e.g. because one skb waiting
on nfqueue verdict still holds a reference).

We do this by adding a tristate.
If we fail to deliver the destroy event, make a note of this in the
eache extension.  The worker can then skip all entries that are in
a different state.  Either they never delivered a destroy event,
e.g. because the netlink backend was not loaded, or redelivery took
place already.

Once the conntrack timer is removed we will now be able to replace
del_timer() test with test_and_set_bit(DYING, &ct->status) to avoid
racing with other cpu that tries to evict the same conntrack.

Because DYING will then be set right before we report the destroy event
we can no longer skip event reporting when dying bit is set.

Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack_ecache.h | 17 ++++++++++++-----
 net/netfilter/nf_conntrack_ecache.c         | 22 ++++++++++++++--------
 2 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_ecache.h b/include/net/netfilter/nf_conntrack_ecache.h
index fa36447..12d967b 100644
--- a/include/net/netfilter/nf_conntrack_ecache.h
+++ b/include/net/netfilter/nf_conntrack_ecache.h
@@ -12,12 +12,19 @@
 #include <linux/netfilter/nf_conntrack_tuple_common.h>
 #include <net/netfilter/nf_conntrack_extend.h>
 
+enum nf_ct_ecache_state {
+	NFCT_ECACHE_UNKNOWN,		/* destroy event not sent */
+	NFCT_ECACHE_DESTROY_FAIL,	/* tried but failed to send destroy event */
+	NFCT_ECACHE_DESTROY_SENT,	/* sent destroy event after failure */
+};
+
 struct nf_conntrack_ecache {
-	unsigned long cache;	/* bitops want long */
-	unsigned long missed;	/* missed events */
-	u16 ctmask;		/* bitmask of ct events to be delivered */
-	u16 expmask;		/* bitmask of expect events to be delivered */
-	u32 portid;		/* netlink portid of destroyer */
+	unsigned long cache;		/* bitops want long */
+	unsigned long missed;		/* missed events */
+	u16 ctmask;			/* bitmask of ct events to be delivered */
+	u16 expmask;			/* bitmask of expect events to be delivered */
+	u32 portid;			/* netlink portid of destroyer */
+	enum nf_ct_ecache_state state;	/* ecache state */
 };
 
 static inline struct nf_conntrack_ecache *
diff --git a/net/netfilter/nf_conntrack_ecache.c b/net/netfilter/nf_conntrack_ecache.c
index d28011b..da9df2d 100644
--- a/net/netfilter/nf_conntrack_ecache.c
+++ b/net/netfilter/nf_conntrack_ecache.c
@@ -49,8 +49,13 @@ static enum retry_state ecache_work_evict_list(struct ct_pcpu *pcpu)
 
 	hlist_nulls_for_each_entry(h, n, &pcpu->dying, hnnode) {
 		struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
+		struct nf_conntrack_ecache *e;
 
-		if (nf_ct_is_dying(ct))
+		if (!nf_ct_is_confirmed(ct))
+			continue;
+
+		e = nf_ct_ecache_find(ct);
+		if (!e || e->state != NFCT_ECACHE_DESTROY_FAIL)
 			continue;
 
 		if (nf_conntrack_event(IPCT_DESTROY, ct)) {
@@ -58,8 +63,7 @@ static enum retry_state ecache_work_evict_list(struct ct_pcpu *pcpu)
 			break;
 		}
 
-		/* we've got the event delivered, now it's dying */
-		set_bit(IPS_DYING_BIT, &ct->status);
+		e->state = NFCT_ECACHE_DESTROY_SENT;
 		refs[evicted] = ct;
 
 		if (++evicted >= ARRAY_SIZE(refs)) {
@@ -130,7 +134,7 @@ int nf_conntrack_eventmask_report(unsigned int eventmask, struct nf_conn *ct,
 	if (!e)
 		goto out_unlock;
 
-	if (nf_ct_is_confirmed(ct) && !nf_ct_is_dying(ct)) {
+	if (nf_ct_is_confirmed(ct)) {
 		struct nf_ct_event item = {
 			.ct	= ct,
 			.portid	= e->portid ? e->portid : portid,
@@ -150,11 +154,13 @@ int nf_conntrack_eventmask_report(unsigned int eventmask, struct nf_conn *ct,
 				 * triggered by a process, we store the PORTID
 				 * to include it in the retransmission.
 				 */
-				if (eventmask & (1 << IPCT_DESTROY) &&
-				    e->portid == 0 && portid != 0)
-					e->portid = portid;
-				else
+				if (eventmask & (1 << IPCT_DESTROY)) {
+					if (e->portid == 0 && portid != 0)
+						e->portid = portid;
+					e->state = NFCT_ECACHE_DESTROY_FAIL;
+				} else {
 					e->missed |= eventmask;
+				}
 			} else {
 				e->missed &= ~missed;
 			}
-- 
2.7.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 nf-next 2/7] netfilter: conntrack: get rid of conntrack timer
  2016-08-24 11:55 [PATCH nf-next v2 0/7] netfilter: get rid of per-object conntrack timers Florian Westphal
  2016-08-24 11:55 ` [PATCH v2 nf-next 1/7] netfilter: don't rely on DYING bit to detect when destroy event was sent Florian Westphal
@ 2016-08-24 11:55 ` Florian Westphal
  2016-08-24 17:11   ` Eric Dumazet
  2016-08-24 11:55 ` [PATCH v2 nf-next 3/7] netfilter: evict stale entries on netlink dumps Florian Westphal
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2016-08-24 11:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
Eric Dumazet pointed out during netfilter workshop 2016.

Remove the timer and use a 32bit jiffies value containing timestamp until
entry is valid.

During conntrack lookup, even before doing tuple comparision, check
the timeout value and evict the entry in case it is too old.

The dying bit is used as a synchronization point to avoid races where
multiple cpus try to evict the same entry.

Because lookup is always lockless, we need to bump the refcnt once
when we evict, else we could try to evict already-dead entry that
is being recycled.

This is the standard/expected way when conntrack entries are destroyed.

Followup patches will introduce garbage collection via work queue
and further places where we can reap obsoleted entries (e.g. during
netlink dumps), this is needed to avoid expired conntracks from hanging
around for too long when lookup rate is low after a busy period.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack.h | 23 +++++++--
 net/netfilter/nf_conntrack_core.c    | 91 ++++++++++++++++++++----------------
 net/netfilter/nf_conntrack_netlink.c | 14 ++----
 net/netfilter/nf_conntrack_pptp.c    |  3 +-
 net/netfilter/nf_nat_core.c          |  6 ---
 5 files changed, 74 insertions(+), 63 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 2a12748..6d8cf06 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -42,7 +42,6 @@ union nf_conntrack_expect_proto {
 
 #include <linux/types.h>
 #include <linux/skbuff.h>
-#include <linux/timer.h>
 
 #ifdef CONFIG_NETFILTER_DEBUG
 #define NF_CT_ASSERT(x)		WARN_ON(!(x))
@@ -73,7 +72,7 @@ struct nf_conn_help {
 #include <net/netfilter/ipv6/nf_conntrack_ipv6.h>
 
 struct nf_conn {
-	/* Usage count in here is 1 for hash table/destruct timer, 1 per skb,
+	/* Usage count in here is 1 for hash table, 1 per skb,
 	 * plus 1 for any connection(s) we are `master' for
 	 *
 	 * Hint, SKB address this struct and refcnt via skb->nfct and
@@ -96,8 +95,8 @@ struct nf_conn {
 	/* Have we seen traffic both ways yet? (bitset) */
 	unsigned long status;
 
-	/* Timer function; drops refcnt when it goes off. */
-	struct timer_list timeout;
+	/* jiffies32 when this ct is considered dead */
+	u32 timeout;
 
 	possible_net_t ct_net;
 
@@ -291,14 +290,28 @@ static inline bool nf_is_loopback_packet(const struct sk_buff *skb)
 	return skb->dev && skb->skb_iif && skb->dev->flags & IFF_LOOPBACK;
 }
 
+#define nfct_time_stamp ((u32)(jiffies))
+
 /* jiffies until ct expires, 0 if already expired */
 static inline unsigned long nf_ct_expires(const struct nf_conn *ct)
 {
-	long timeout = (long)ct->timeout.expires - (long)jiffies;
+	s32 timeout = ct->timeout - nfct_time_stamp;
 
 	return timeout > 0 ? timeout : 0;
 }
 
+static inline bool nf_ct_is_expired(const struct nf_conn *ct)
+{
+	return (__s32)(ct->timeout - nfct_time_stamp) <= 0;
+}
+
+/* use after obtaining a reference count */
+static inline bool nf_ct_should_gc(const struct nf_conn *ct)
+{
+	return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
+	       !nf_ct_is_dying(ct);
+}
+
 struct kernel_param;
 
 int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 7d90a5d..38c50d1 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -371,7 +371,6 @@ destroy_conntrack(struct nf_conntrack *nfct)
 
 	pr_debug("destroy_conntrack(%p)\n", ct);
 	NF_CT_ASSERT(atomic_read(&nfct->use) == 0);
-	NF_CT_ASSERT(!timer_pending(&ct->timeout));
 
 	if (unlikely(nf_ct_is_template(ct))) {
 		nf_ct_tmpl_free(ct);
@@ -434,35 +433,30 @@ bool nf_ct_delete(struct nf_conn *ct, u32 portid, int report)
 {
 	struct nf_conn_tstamp *tstamp;
 
+	if (test_and_set_bit(IPS_DYING_BIT, &ct->status))
+		return false;
+
 	tstamp = nf_conn_tstamp_find(ct);
 	if (tstamp && tstamp->stop == 0)
 		tstamp->stop = ktime_get_real_ns();
 
-	if (nf_ct_is_dying(ct))
-		goto delete;
-
 	if (nf_conntrack_event_report(IPCT_DESTROY, ct,
 				    portid, report) < 0) {
-		/* destroy event was not delivered */
+		/* destroy event was not delivered. nf_ct_put will
+		 * be done by event cache worker on redelivery.
+		 */
 		nf_ct_delete_from_lists(ct);
 		nf_conntrack_ecache_delayed_work(nf_ct_net(ct));
 		return false;
 	}
 
 	nf_conntrack_ecache_work(nf_ct_net(ct));
-	set_bit(IPS_DYING_BIT, &ct->status);
- delete:
 	nf_ct_delete_from_lists(ct);
 	nf_ct_put(ct);
 	return true;
 }
 EXPORT_SYMBOL_GPL(nf_ct_delete);
 
-static void death_by_timeout(unsigned long ul_conntrack)
-{
-	nf_ct_delete((struct nf_conn *)ul_conntrack, 0, 0);
-}
-
 static inline bool
 nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
 		const struct nf_conntrack_tuple *tuple,
@@ -480,6 +474,18 @@ nf_ct_key_equal(struct nf_conntrack_tuple_hash *h,
 	       net_eq(net, nf_ct_net(ct));
 }
 
+/* caller must hold rcu readlock and none of the nf_conntrack_locks */
+static void nf_ct_gc_expired(struct nf_conn *ct)
+{
+	if (!atomic_inc_not_zero(&ct->ct_general.use))
+		return;
+
+	if (nf_ct_should_gc(ct))
+		nf_ct_kill(ct);
+
+	nf_ct_put(ct);
+}
+
 /*
  * Warning :
  * - Caller must take a reference on returned object
@@ -499,6 +505,17 @@ begin:
 	bucket = reciprocal_scale(hash, hsize);
 
 	hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[bucket], hnnode) {
+		struct nf_conn *ct;
+
+		ct = nf_ct_tuplehash_to_ctrack(h);
+		if (nf_ct_is_expired(ct)) {
+			nf_ct_gc_expired(ct);
+			continue;
+		}
+
+		if (nf_ct_is_dying(ct))
+			continue;
+
 		if (nf_ct_key_equal(h, tuple, zone, net)) {
 			NF_CT_STAT_INC_ATOMIC(net, found);
 			return h;
@@ -597,7 +614,6 @@ nf_conntrack_hash_check_insert(struct nf_conn *ct)
 				    zone, net))
 			goto out;
 
-	add_timer(&ct->timeout);
 	smp_wmb();
 	/* The caller holds a reference to this object */
 	atomic_set(&ct->ct_general.use, 2);
@@ -750,8 +766,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	/* Timer relative to confirmation time, not original
 	   setting time, otherwise we'd get timer wrap in
 	   weird delay cases. */
-	ct->timeout.expires += jiffies;
-	add_timer(&ct->timeout);
+	ct->timeout += nfct_time_stamp;
 	atomic_inc(&ct->ct_general.use);
 	ct->status |= IPS_CONFIRMED;
 
@@ -814,8 +829,16 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 
 	hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[hash], hnnode) {
 		ct = nf_ct_tuplehash_to_ctrack(h);
-		if (ct != ignored_conntrack &&
-		    nf_ct_key_equal(h, tuple, zone, net)) {
+
+		if (ct == ignored_conntrack)
+			continue;
+
+		if (nf_ct_is_expired(ct)) {
+			nf_ct_gc_expired(ct);
+			continue;
+		}
+
+		if (nf_ct_key_equal(h, tuple, zone, net)) {
 			NF_CT_STAT_INC_ATOMIC(net, found);
 			rcu_read_unlock();
 			return 1;
@@ -843,6 +866,11 @@ static unsigned int early_drop_list(struct net *net,
 	hlist_nulls_for_each_entry_rcu(h, n, head, hnnode) {
 		tmp = nf_ct_tuplehash_to_ctrack(h);
 
+		if (nf_ct_is_expired(tmp)) {
+			nf_ct_gc_expired(tmp);
+			continue;
+		}
+
 		if (test_bit(IPS_ASSURED_BIT, &tmp->status) ||
 		    !net_eq(nf_ct_net(tmp), net) ||
 		    nf_ct_is_dying(tmp))
@@ -860,7 +888,6 @@ static unsigned int early_drop_list(struct net *net,
 		 */
 		if (net_eq(nf_ct_net(tmp), net) &&
 		    nf_ct_is_confirmed(tmp) &&
-		    del_timer(&tmp->timeout) &&
 		    nf_ct_delete(tmp, 0, 0))
 			drops++;
 
@@ -930,8 +957,6 @@ __nf_conntrack_alloc(struct net *net,
 	/* save hash for reusing when confirming */
 	*(unsigned long *)(&ct->tuplehash[IP_CT_DIR_REPLY].hnnode.pprev) = hash;
 	ct->status = 0;
-	/* Don't set timer yet: wait for confirmation */
-	setup_timer(&ct->timeout, death_by_timeout, (unsigned long)ct);
 	write_pnet(&ct->ct_net, net);
 	memset(&ct->__nfct_init_offset[0], 0,
 	       offsetof(struct nf_conn, proto) -
@@ -1305,7 +1330,6 @@ void __nf_ct_refresh_acct(struct nf_conn *ct,
 			  unsigned long extra_jiffies,
 			  int do_acct)
 {
-	NF_CT_ASSERT(ct->timeout.data == (unsigned long)ct);
 	NF_CT_ASSERT(skb);
 
 	/* Only update if this is not a fixed timeout */
@@ -1313,18 +1337,10 @@ void __nf_ct_refresh_acct(struct nf_conn *ct,
 		goto acct;
 
 	/* If not in hash table, timer will not be active yet */
-	if (!nf_ct_is_confirmed(ct)) {
-		ct->timeout.expires = extra_jiffies;
-	} else {
-		unsigned long newtime = jiffies + extra_jiffies;
-
-		/* Only update the timeout if the new timeout is at least
-		   HZ jiffies from the old timeout. Need del_timer for race
-		   avoidance (may already be dying). */
-		if (newtime - ct->timeout.expires >= HZ)
-			mod_timer_pending(&ct->timeout, newtime);
-	}
+	if (nf_ct_is_confirmed(ct))
+		extra_jiffies += nfct_time_stamp;
 
+	ct->timeout = extra_jiffies;
 acct:
 	if (do_acct)
 		nf_ct_acct_update(ct, ctinfo, skb->len);
@@ -1339,11 +1355,7 @@ bool __nf_ct_kill_acct(struct nf_conn *ct,
 	if (do_acct)
 		nf_ct_acct_update(ct, ctinfo, skb->len);
 
-	if (del_timer(&ct->timeout)) {
-		ct->timeout.function((unsigned long)ct);
-		return true;
-	}
-	return false;
+	return nf_ct_delete(ct, 0, 0);
 }
 EXPORT_SYMBOL_GPL(__nf_ct_kill_acct);
 
@@ -1478,11 +1490,8 @@ void nf_ct_iterate_cleanup(struct net *net,
 
 	while ((ct = get_next_corpse(net, iter, data, &bucket)) != NULL) {
 		/* Time to push up daises... */
-		if (del_timer(&ct->timeout))
-			nf_ct_delete(ct, portid, report);
-
-		/* ... else the timer will get him soon. */
 
+		nf_ct_delete(ct, portid, report);
 		nf_ct_put(ct);
 		cond_resched();
 	}
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 68800c1..81fd34c 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1144,9 +1144,7 @@ static int ctnetlink_del_conntrack(struct net *net, struct sock *ctnl,
 		}
 	}
 
-	if (del_timer(&ct->timeout))
-		nf_ct_delete(ct, NETLINK_CB(skb).portid, nlmsg_report(nlh));
-
+	nf_ct_delete(ct, NETLINK_CB(skb).portid, nlmsg_report(nlh));
 	nf_ct_put(ct);
 
 	return 0;
@@ -1514,11 +1512,10 @@ static int ctnetlink_change_timeout(struct nf_conn *ct,
 {
 	u_int32_t timeout = ntohl(nla_get_be32(cda[CTA_TIMEOUT]));
 
-	if (!del_timer(&ct->timeout))
-		return -ETIME;
+	ct->timeout = nfct_time_stamp + timeout * HZ;
 
-	ct->timeout.expires = jiffies + timeout * HZ;
-	add_timer(&ct->timeout);
+	if (test_bit(IPS_DYING_BIT, &ct->status))
+		return -ETIME;
 
 	return 0;
 }
@@ -1716,9 +1713,8 @@ ctnetlink_create_conntrack(struct net *net,
 
 	if (!cda[CTA_TIMEOUT])
 		goto err1;
-	ct->timeout.expires = ntohl(nla_get_be32(cda[CTA_TIMEOUT]));
 
-	ct->timeout.expires = jiffies + ct->timeout.expires * HZ;
+	ct->timeout = nfct_time_stamp + ntohl(nla_get_be32(cda[CTA_TIMEOUT])) * HZ;
 
 	rcu_read_lock();
  	if (cda[CTA_HELP]) {
diff --git a/net/netfilter/nf_conntrack_pptp.c b/net/netfilter/nf_conntrack_pptp.c
index 5588c7a..f60a475 100644
--- a/net/netfilter/nf_conntrack_pptp.c
+++ b/net/netfilter/nf_conntrack_pptp.c
@@ -157,8 +157,7 @@ static int destroy_sibling_or_exp(struct net *net, struct nf_conn *ct,
 		pr_debug("setting timeout of conntrack %p to 0\n", sibling);
 		sibling->proto.gre.timeout	  = 0;
 		sibling->proto.gre.stream_timeout = 0;
-		if (del_timer(&sibling->timeout))
-			sibling->timeout.function((unsigned long)sibling);
+		nf_ct_kill(sibling);
 		nf_ct_put(sibling);
 		return 1;
 	} else {
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index de31818..81ae41f 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -565,16 +565,10 @@ static int nf_nat_proto_clean(struct nf_conn *ct, void *data)
 	 * Else, when the conntrack is destoyed, nf_nat_cleanup_conntrack()
 	 * will delete entry from already-freed table.
 	 */
-	if (!del_timer(&ct->timeout))
-		return 1;
-
 	ct->status &= ~IPS_NAT_DONE_MASK;
-
 	rhashtable_remove_fast(&nf_nat_bysource_table, &ct->nat_bysource,
 			       nf_nat_bysource_params);
 
-	add_timer(&ct->timeout);
-
 	/* don't delete conntrack.  Although that would make things a lot
 	 * simpler, we'd end up flushing all conntracks on nat rmmod.
 	 */
-- 
2.7.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 nf-next 3/7] netfilter: evict stale entries on netlink dumps
  2016-08-24 11:55 [PATCH nf-next v2 0/7] netfilter: get rid of per-object conntrack timers Florian Westphal
  2016-08-24 11:55 ` [PATCH v2 nf-next 1/7] netfilter: don't rely on DYING bit to detect when destroy event was sent Florian Westphal
  2016-08-24 11:55 ` [PATCH v2 nf-next 2/7] netfilter: conntrack: get rid of conntrack timer Florian Westphal
@ 2016-08-24 11:55 ` Florian Westphal
  2016-08-24 17:33   ` Eric Dumazet
  2016-08-24 11:55 ` [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries Florian Westphal
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2016-08-24 11:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

When dumping we already have to look at the entire table, so we might
as well toss those entries whose timeout value is in the past.

We also look at every entry during resize operations.
However, eviction there is not as simple because we hold the
global resize lock so we can't evict without adding a 'expired' list
to drop from later.  Considering that resizes are very rare it doesn't
seem worth doing it.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_netlink.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 81fd34c..dedbe4b 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -815,14 +815,23 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 	struct hlist_nulls_node *n;
 	struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
 	u_int8_t l3proto = nfmsg->nfgen_family;
-	int res;
+	struct nf_conn *nf_ct_evict[8];
+	int res, i;
 	spinlock_t *lockp;
 
 	last = (struct nf_conn *)cb->args[1];
+	i = 0;
 
 	local_bh_disable();
 	for (; cb->args[0] < nf_conntrack_htable_size; cb->args[0]++) {
 restart:
+		while (i) {
+			i--;
+			if (nf_ct_should_gc(nf_ct_evict[i]))
+				nf_ct_kill(nf_ct_evict[i]);
+			nf_ct_put(nf_ct_evict[i]);
+		}
+
 		lockp = &nf_conntrack_locks[cb->args[0] % CONNTRACK_LOCKS];
 		nf_conntrack_lock(lockp);
 		if (cb->args[0] >= nf_conntrack_htable_size) {
@@ -834,6 +843,13 @@ restart:
 			if (NF_CT_DIRECTION(h) != IP_CT_DIR_ORIGINAL)
 				continue;
 			ct = nf_ct_tuplehash_to_ctrack(h);
+			if (nf_ct_is_expired(ct)) {
+				if (i < ARRAY_SIZE(nf_ct_evict) &&
+				    atomic_inc_not_zero(&ct->ct_general.use))
+					nf_ct_evict[i++] = ct;
+				continue;
+			}
+
 			if (!net_eq(net, nf_ct_net(ct)))
 				continue;
 
@@ -875,6 +891,13 @@ out:
 	if (last)
 		nf_ct_put(last);
 
+	while (i) {
+		i--;
+		if (nf_ct_should_gc(nf_ct_evict[i]))
+			nf_ct_kill(nf_ct_evict[i]);
+		nf_ct_put(nf_ct_evict[i]);
+	}
+
 	return skb->len;
 }
 
-- 
2.7.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries
  2016-08-24 11:55 [PATCH nf-next v2 0/7] netfilter: get rid of per-object conntrack timers Florian Westphal
                   ` (2 preceding siblings ...)
  2016-08-24 11:55 ` [PATCH v2 nf-next 3/7] netfilter: evict stale entries on netlink dumps Florian Westphal
@ 2016-08-24 11:55 ` Florian Westphal
  2016-08-24 17:40   ` Eric Dumazet
  2016-08-24 11:55 ` [PATCH v2 nf-next 5/7] netfilter: conntrack: resched gc again if eviction rate is high Florian Westphal
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2016-08-24 11:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

Conntrack gc worker to evict stale entries.

GC happens once every 5 seconds, but we only scan at most 1/64th of the
table (and not more than 8k) buckets to avoid hogging cpu.

This means that a complete scan of the table will take several minutes
of wall-clock time.

Considering that the gc run will never have to evict any entries
during normal operation because those will happen from packet path
this should be fine.

We only need gc to make sure userspace (conntrack event listeners)
eventually learn of the timeout, and for resource reclaim in case the
system becomes idle.

We do not disable BH and cond_resched for every bucket so this should
not introduce noticeable latencies either.

A followup patch will add a small change to speed up GC for the extreme
case where most entries are timed out on an otherwise idle system.

v2: Use cond_resched_rcu_qs & add comment wrt. missing restart on
nulls value change in gc worker, suggested by Eric Dumazet.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 78 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 38c50d1..e0a7889 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -72,11 +72,24 @@ EXPORT_SYMBOL_GPL(nf_conntrack_expect_lock);
 struct hlist_nulls_head *nf_conntrack_hash __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_hash);
 
+struct conntrack_gc_work {
+	struct delayed_work	dwork;
+	u32			last_bucket;
+	bool			exiting;
+};
+
 static __read_mostly struct kmem_cache *nf_conntrack_cachep;
 static __read_mostly spinlock_t nf_conntrack_locks_all_lock;
 static __read_mostly DEFINE_SPINLOCK(nf_conntrack_locks_all_lock);
 static __read_mostly bool nf_conntrack_locks_all;
 
+#define GC_MAX_BUCKETS_DIV	64u
+#define GC_MAX_BUCKETS		8192u
+#define GC_INTERVAL		(5 * HZ)
+#define GC_MAX_EVICTS		256u
+
+static struct conntrack_gc_work conntrack_gc_work;
+
 void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
 {
 	spin_lock(lock);
@@ -921,6 +934,63 @@ static noinline int early_drop(struct net *net, unsigned int _hash)
 	return false;
 }
 
+static void gc_worker(struct work_struct *work)
+{
+	unsigned int i, goal, buckets = 0, expired_count = 0;
+	unsigned long next_run = GC_INTERVAL;
+	struct conntrack_gc_work *gc_work;
+
+	gc_work = container_of(work, struct conntrack_gc_work, dwork.work);
+
+	goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);
+	i = gc_work->last_bucket;
+
+	do {
+		struct nf_conntrack_tuple_hash *h;
+		struct hlist_nulls_head *ct_hash;
+		struct hlist_nulls_node *n;
+		unsigned int hashsz;
+		struct nf_conn *tmp;
+
+		i++;
+		rcu_read_lock();
+
+		nf_conntrack_get_ht(&ct_hash, &hashsz);
+		if (i >= hashsz)
+			i = 0;
+
+		hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[i], hnnode) {
+			tmp = nf_ct_tuplehash_to_ctrack(h);
+
+			if (nf_ct_is_expired(tmp)) {
+				nf_ct_gc_expired(tmp);
+				expired_count++;
+				continue;
+			}
+		}
+
+		/* could check get_nulls_value() here and restart if ct
+		 * was moved to another chain.  But given gc is best-effort
+		 * we will just continue with next hash slot.
+		 */
+		rcu_read_unlock();
+		cond_resched_rcu_qs();
+	} while (++buckets < goal &&
+		 expired_count < GC_MAX_EVICTS);
+
+	if (gc_work->exiting)
+		return;
+
+	gc_work->last_bucket = i;
+	schedule_delayed_work(&gc_work->dwork, next_run);
+}
+
+static void conntrack_gc_work_init(struct conntrack_gc_work *gc_work)
+{
+	INIT_DELAYED_WORK(&gc_work->dwork, gc_worker);
+	gc_work->exiting = false;
+}
+
 static struct nf_conn *
 __nf_conntrack_alloc(struct net *net,
 		     const struct nf_conntrack_zone *zone,
@@ -1527,6 +1597,7 @@ static int untrack_refs(void)
 
 void nf_conntrack_cleanup_start(void)
 {
+	conntrack_gc_work.exiting = true;
 	RCU_INIT_POINTER(ip_ct_attach, NULL);
 }
 
@@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void)
 	while (untrack_refs() > 0)
 		schedule();
 
+	cancel_delayed_work_sync(&conntrack_gc_work.dwork);
+	/* can be re-scheduled once */
+	cancel_delayed_work_sync(&conntrack_gc_work.dwork);
 	nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size);
 
 	nf_conntrack_proto_fini();
@@ -1810,6 +1884,10 @@ int nf_conntrack_init_start(void)
 	}
 	/*  - and look it like as a confirmed connection */
 	nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED);
+
+	conntrack_gc_work_init(&conntrack_gc_work);
+	schedule_delayed_work(&conntrack_gc_work.dwork, GC_INTERVAL);
+
 	return 0;
 
 err_proto:
-- 
2.7.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 nf-next 5/7] netfilter: conntrack: resched gc again if eviction rate is high
  2016-08-24 11:55 [PATCH nf-next v2 0/7] netfilter: get rid of per-object conntrack timers Florian Westphal
                   ` (3 preceding siblings ...)
  2016-08-24 11:55 ` [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries Florian Westphal
@ 2016-08-24 11:55 ` Florian Westphal
  2016-08-24 11:55 ` [PATCH v2 nf-next 6/7] netfilter: remove __nf_ct_kill_acct helper Florian Westphal
  2016-08-24 11:55 ` [PATCH nf-next 7/7] netfilter: restart search if moved to other chain Florian Westphal
  6 siblings, 0 replies; 16+ messages in thread
From: Florian Westphal @ 2016-08-24 11:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

If we evicted a large fraction of the scanned conntrack entries re-schedule
the next gc cycle for immediate execution.

This triggers during tests where load is high, then drops to zero and
many connections will be in TW/CLOSE state with < 30 second timeouts.

Without this change it will take several minutes until conntrack count
comes back to normal.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index e0a7889..f2133e8 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -938,6 +938,7 @@ static void gc_worker(struct work_struct *work)
 {
 	unsigned int i, goal, buckets = 0, expired_count = 0;
 	unsigned long next_run = GC_INTERVAL;
+	unsigned int ratio, scanned = 0;
 	struct conntrack_gc_work *gc_work;
 
 	gc_work = container_of(work, struct conntrack_gc_work, dwork.work);
@@ -962,6 +963,7 @@ static void gc_worker(struct work_struct *work)
 		hlist_nulls_for_each_entry_rcu(h, n, &ct_hash[i], hnnode) {
 			tmp = nf_ct_tuplehash_to_ctrack(h);
 
+			scanned++;
 			if (nf_ct_is_expired(tmp)) {
 				nf_ct_gc_expired(tmp);
 				expired_count++;
@@ -981,6 +983,10 @@ static void gc_worker(struct work_struct *work)
 	if (gc_work->exiting)
 		return;
 
+	ratio = scanned ? expired_count * 100 / scanned : 0;
+	if (ratio >= 90)
+		next_run = 0;
+
 	gc_work->last_bucket = i;
 	schedule_delayed_work(&gc_work->dwork, next_run);
 }
-- 
2.7.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 nf-next 6/7] netfilter: remove __nf_ct_kill_acct helper
  2016-08-24 11:55 [PATCH nf-next v2 0/7] netfilter: get rid of per-object conntrack timers Florian Westphal
                   ` (4 preceding siblings ...)
  2016-08-24 11:55 ` [PATCH v2 nf-next 5/7] netfilter: conntrack: resched gc again if eviction rate is high Florian Westphal
@ 2016-08-24 11:55 ` Florian Westphal
  2016-08-24 11:55 ` [PATCH nf-next 7/7] netfilter: restart search if moved to other chain Florian Westphal
  6 siblings, 0 replies; 16+ messages in thread
From: Florian Westphal @ 2016-08-24 11:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

After timer removal this just calls nf_ct_delete so remove the __ prefix
version and make nf_ct_kill a shorthand for nf_ct_delete.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack.h | 13 +++----------
 net/netfilter/nf_conntrack_core.c    | 12 +++++-------
 2 files changed, 8 insertions(+), 17 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 6d8cf06..a3595c2 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -219,21 +219,14 @@ static inline void nf_ct_refresh(struct nf_conn *ct,
 	__nf_ct_refresh_acct(ct, 0, skb, extra_jiffies, 0);
 }
 
-bool __nf_ct_kill_acct(struct nf_conn *ct, enum ip_conntrack_info ctinfo,
-		       const struct sk_buff *skb, int do_acct);
-
 /* kill conntrack and do accounting */
-static inline bool nf_ct_kill_acct(struct nf_conn *ct,
-				   enum ip_conntrack_info ctinfo,
-				   const struct sk_buff *skb)
-{
-	return __nf_ct_kill_acct(ct, ctinfo, skb, 1);
-}
+bool nf_ct_kill_acct(struct nf_conn *ct, enum ip_conntrack_info ctinfo,
+		     const struct sk_buff *skb);
 
 /* kill conntrack without accounting */
 static inline bool nf_ct_kill(struct nf_conn *ct)
 {
-	return __nf_ct_kill_acct(ct, 0, NULL, 0);
+	return nf_ct_delete(ct, 0, 0);
 }
 
 /* These are for NAT.  Icky. */
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index f2133e8..2f7f759 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1423,17 +1423,15 @@ acct:
 }
 EXPORT_SYMBOL_GPL(__nf_ct_refresh_acct);
 
-bool __nf_ct_kill_acct(struct nf_conn *ct,
-		       enum ip_conntrack_info ctinfo,
-		       const struct sk_buff *skb,
-		       int do_acct)
+bool nf_ct_kill_acct(struct nf_conn *ct,
+		     enum ip_conntrack_info ctinfo,
+		     const struct sk_buff *skb)
 {
-	if (do_acct)
-		nf_ct_acct_update(ct, ctinfo, skb->len);
+	nf_ct_acct_update(ct, ctinfo, skb->len);
 
 	return nf_ct_delete(ct, 0, 0);
 }
-EXPORT_SYMBOL_GPL(__nf_ct_kill_acct);
+EXPORT_SYMBOL_GPL(nf_ct_kill_acct);
 
 #if IS_ENABLED(CONFIG_NF_CT_NETLINK)
 
-- 
2.7.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH nf-next 7/7] netfilter: restart search if moved to other chain
  2016-08-24 11:55 [PATCH nf-next v2 0/7] netfilter: get rid of per-object conntrack timers Florian Westphal
                   ` (5 preceding siblings ...)
  2016-08-24 11:55 ` [PATCH v2 nf-next 6/7] netfilter: remove __nf_ct_kill_acct helper Florian Westphal
@ 2016-08-24 11:55 ` Florian Westphal
  2016-08-24 13:20   ` Eric Dumazet
  6 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2016-08-24 11:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

In case nf_conntrack_tuple_taken did not find a conflicting entry
check that all entries in this hash slot were tested and restart
in case an entry was moved to another chain.

Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_core.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 2f7f759..4dda1ec 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -837,6 +837,7 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 	zone = nf_ct_zone(ignored_conntrack);
 
 	rcu_read_lock();
+ begin:
 	nf_conntrack_get_ht(&ct_hash, &hsize);
 	hash = __hash_conntrack(net, tuple, hsize);
 
@@ -858,6 +859,12 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 		}
 		NF_CT_STAT_INC_ATOMIC(net, searched);
 	}
+
+	if (get_nulls_value(n) != hash) {
+		NF_CT_STAT_INC_ATOMIC(net, search_restart);
+		goto begin;
+	}
+
 	rcu_read_unlock();
 
 	return 0;
-- 
2.7.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH nf-next 7/7] netfilter: restart search if moved to other chain
  2016-08-24 11:55 ` [PATCH nf-next 7/7] netfilter: restart search if moved to other chain Florian Westphal
@ 2016-08-24 13:20   ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-08-24 13:20 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> In case nf_conntrack_tuple_taken did not find a conflicting entry
> check that all entries in this hash slot were tested and restart
> in case an entry was moved to another chain.
> 
> Reported-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---

Fixes: ea781f197d6a ("netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()")
Acked-by: Eric Dumazet <edumazet@google.com>

Thanks !



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 nf-next 1/7] netfilter: don't rely on DYING bit to detect when destroy event was sent
  2016-08-24 11:55 ` [PATCH v2 nf-next 1/7] netfilter: don't rely on DYING bit to detect when destroy event was sent Florian Westphal
@ 2016-08-24 17:05   ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-08-24 17:05 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> The reliable event delivery mode currently (ab)uses the DYING bit to
> detect which entries on the dying list have to be skipped when
> re-delivering events from the eache worker in reliable event mode.
> 
> Currently when we delete the conntrack from main table we only set this
> bit if we could also deliver the netlink destroy event to userspace.
> 
> If we fail we move it to the dying list, the ecache worker will
> reattempt event delivery for all confirmed conntracks on the dying list
> that do not have the DYING bit set.
> 
> Once timer is gone, we can no longer use if (del_timer()) to detect
> when we 'stole' the reference count owned by the timer/hash entry, so
> we need some other way to avoid racing with other cpu.
> 
> Pablo suggested to add a marker in the ecache extension that skips

Acked-by: Eric Dumazet <edumazet@google.com>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 nf-next 2/7] netfilter: conntrack: get rid of conntrack timer
  2016-08-24 11:55 ` [PATCH v2 nf-next 2/7] netfilter: conntrack: get rid of conntrack timer Florian Westphal
@ 2016-08-24 17:11   ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-08-24 17:11 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as
> Eric Dumazet pointed out during netfilter workshop 2016.

Another reason was the fact that Thomas was about to change max timer
range : 

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=500462a9de657f86edaa102f8ab6bff7f7e43fc2

Acked-by: Eric Dumazet <edumazet@google.com>

Thanks !



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 nf-next 3/7] netfilter: evict stale entries on netlink dumps
  2016-08-24 11:55 ` [PATCH v2 nf-next 3/7] netfilter: evict stale entries on netlink dumps Florian Westphal
@ 2016-08-24 17:33   ` Eric Dumazet
  0 siblings, 0 replies; 16+ messages in thread
From: Eric Dumazet @ 2016-08-24 17:33 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> When dumping we already have to look at the entire table, so we might
> as well toss those entries whose timeout value is in the past.
> 
> We also look at every entry during resize operations.
> However, eviction there is not as simple because we hold the
> global resize lock so we can't evict without adding a 'expired' list
> to drop from later.  Considering that resizes are very rare it doesn't
> seem worth doing it.
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
Acked-by: Eric Dumazet <edumazet@google.com>



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries
  2016-08-24 11:55 ` [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries Florian Westphal
@ 2016-08-24 17:40   ` Eric Dumazet
  2016-08-24 20:11     ` Florian Westphal
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-08-24 17:40 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> Conntrack gc worker to evict stale entries.


>  static struct nf_conn *
>  __nf_conntrack_alloc(struct net *net,
>  		     const struct nf_conntrack_zone *zone,
> @@ -1527,6 +1597,7 @@ static int untrack_refs(void)
>  
>  void nf_conntrack_cleanup_start(void)
>  {
> +	conntrack_gc_work.exiting = true;
>  	RCU_INIT_POINTER(ip_ct_attach, NULL);
>  }
>  
> @@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void)
>  	while (untrack_refs() > 0)
>  		schedule();
>  
> +	cancel_delayed_work_sync(&conntrack_gc_work.dwork);
> +	/* can be re-scheduled once */

Are you sure ?

As conntrack_gc_work.exiting = true, I do not see how this can happen ?

> +	cancel_delayed_work_sync(&conntrack_gc_work.dwork);
>  	nf_ct_free_hashtable(nf_conntrack_hash, nf_conntrack_htable_size);
>  
>  	nf_conntrack_proto_fini();
> @@ -1810,6 +1884,10 @@ int nf_conntrack_init_start(void)
>  	}
>  	/*  - and look it like as a confirmed connection */
>  	nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED);
> +
> +	conntrack_gc_work_init(&conntrack_gc_work);
> +	schedule_delayed_work(&conntrack_gc_work.dwork, GC_INTERVAL);
> +
>  	return 0;
>  
>  err_proto:



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries
  2016-08-24 17:40   ` Eric Dumazet
@ 2016-08-24 20:11     ` Florian Westphal
  2016-08-24 20:30       ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Florian Westphal @ 2016-08-24 20:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Florian Westphal, netfilter-devel

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> > Conntrack gc worker to evict stale entries.
> 
> 
> >  static struct nf_conn *
> >  __nf_conntrack_alloc(struct net *net,
> >  		     const struct nf_conntrack_zone *zone,
> > @@ -1527,6 +1597,7 @@ static int untrack_refs(void)
> >  
> >  void nf_conntrack_cleanup_start(void)
> >  {
> > +	conntrack_gc_work.exiting = true;
> >  	RCU_INIT_POINTER(ip_ct_attach, NULL);
> >  }
> >  
> > @@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void)
> >  	while (untrack_refs() > 0)
> >  		schedule();
> >  
> > +	cancel_delayed_work_sync(&conntrack_gc_work.dwork);
> > +	/* can be re-scheduled once */
> 
> Are you sure ?
> 
> As conntrack_gc_work.exiting = true, I do not see how this can happen ?

nf_conntrack_cleanup_start() sets exiting = true

current cpu blocks in

cancel_delayed_work_sync(&conntrack_gc_work.dwork);

Iff the work queue was running on other cpu but was already past
gc_work->exiting check then when cancel_delayed_work_sync() (first one)
returns it will have re-armed itself via schedule_delayed_work().

So I think the 2nd cancel_delayed_work_sync is needed.

Let me know if you'd like to see a v3 with more verbose
comment about this.

Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries
  2016-08-24 20:11     ` Florian Westphal
@ 2016-08-24 20:30       ` Eric Dumazet
  2016-08-24 23:50         ` Florian Westphal
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2016-08-24 20:30 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Wed, 2016-08-24 at 22:11 +0200, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> > > Conntrack gc worker to evict stale entries.
> > 
> > 
> > >  static struct nf_conn *
> > >  __nf_conntrack_alloc(struct net *net,
> > >  		     const struct nf_conntrack_zone *zone,
> > > @@ -1527,6 +1597,7 @@ static int untrack_refs(void)
> > >  
> > >  void nf_conntrack_cleanup_start(void)
> > >  {
> > > +	conntrack_gc_work.exiting = true;
> > >  	RCU_INIT_POINTER(ip_ct_attach, NULL);
> > >  }
> > >  
> > > @@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void)
> > >  	while (untrack_refs() > 0)
> > >  		schedule();
> > >  
> > > +	cancel_delayed_work_sync(&conntrack_gc_work.dwork);
> > > +	/* can be re-scheduled once */
> > 
> > Are you sure ?
> > 
> > As conntrack_gc_work.exiting = true, I do not see how this can happen ?
> 
> nf_conntrack_cleanup_start() sets exiting = true
> 
> current cpu blocks in
> 
> cancel_delayed_work_sync(&conntrack_gc_work.dwork);
> 
> Iff the work queue was running on other cpu but was already past
> gc_work->exiting check then when cancel_delayed_work_sync() (first one)
> returns it will have re-armed itself via schedule_delayed_work().
> 
> So I think the 2nd cancel_delayed_work_sync is needed.
> 
> Let me know if you'd like to see a v3 with more verbose
> comment about this.

If you were using cancel_delayed_work() (instead of
cancel_delayed_work_sync()) I would understand the concern.

But here you are using the thing that was designed to exactly avoid the
issue, of both work running on another cpu and/or re-arming itself.

If what you are saying was true, we would have to fix hundreds of
cancel_delayed_work_sync() call sites ...




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries
  2016-08-24 20:30       ` Eric Dumazet
@ 2016-08-24 23:50         ` Florian Westphal
  0 siblings, 0 replies; 16+ messages in thread
From: Florian Westphal @ 2016-08-24 23:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Florian Westphal, netfilter-devel

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2016-08-24 at 22:11 +0200, Florian Westphal wrote:
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > On Wed, 2016-08-24 at 13:55 +0200, Florian Westphal wrote:
> > > > Conntrack gc worker to evict stale entries.
> > > 
> > > 
> > > >  static struct nf_conn *
> > > >  __nf_conntrack_alloc(struct net *net,
> > > >  		     const struct nf_conntrack_zone *zone,
> > > > @@ -1527,6 +1597,7 @@ static int untrack_refs(void)
> > > >  
> > > >  void nf_conntrack_cleanup_start(void)
> > > >  {
> > > > +	conntrack_gc_work.exiting = true;
> > > >  	RCU_INIT_POINTER(ip_ct_attach, NULL);
> > > >  }
> > > >  
> > > > @@ -1536,6 +1607,9 @@ void nf_conntrack_cleanup_end(void)
> > > >  	while (untrack_refs() > 0)
> > > >  		schedule();
> > > >  
> > > > +	cancel_delayed_work_sync(&conntrack_gc_work.dwork);
> > > > +	/* can be re-scheduled once */
> > > 
> > > Are you sure ?
> > > 
> > > As conntrack_gc_work.exiting = true, I do not see how this can happen ?
> > 
> > nf_conntrack_cleanup_start() sets exiting = true
> > 
> > current cpu blocks in
> > 
> > cancel_delayed_work_sync(&conntrack_gc_work.dwork);
> > 
> > Iff the work queue was running on other cpu but was already past
> > gc_work->exiting check then when cancel_delayed_work_sync() (first one)
> > returns it will have re-armed itself via schedule_delayed_work().
> > 
> > So I think the 2nd cancel_delayed_work_sync is needed.
> > 
> > Let me know if you'd like to see a v3 with more verbose
> > comment about this.
> 
> If you were using cancel_delayed_work() (instead of
> cancel_delayed_work_sync()) I would understand the concern.
> 
> But here you are using the thing that was designed to exactly avoid the
> issue, of both work running on another cpu and/or re-arming itself.
> 
> If what you are saying was true, we would have to fix hundreds of
> cancel_delayed_work_sync() call sites ...

Ok, I see, the _sync version indeed seems to be desgined to
suppress/avoid re-arming.

I will send a v3 without the 2nd cancel_delayed_work_sync, thanks Eric!

(I will preserve/add your Acked-by tags to the unchanged patches so you
 won't need to resend them).

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-08-25  5:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-24 11:55 [PATCH nf-next v2 0/7] netfilter: get rid of per-object conntrack timers Florian Westphal
2016-08-24 11:55 ` [PATCH v2 nf-next 1/7] netfilter: don't rely on DYING bit to detect when destroy event was sent Florian Westphal
2016-08-24 17:05   ` Eric Dumazet
2016-08-24 11:55 ` [PATCH v2 nf-next 2/7] netfilter: conntrack: get rid of conntrack timer Florian Westphal
2016-08-24 17:11   ` Eric Dumazet
2016-08-24 11:55 ` [PATCH v2 nf-next 3/7] netfilter: evict stale entries on netlink dumps Florian Westphal
2016-08-24 17:33   ` Eric Dumazet
2016-08-24 11:55 ` [PATCH v2 nf-next 4/7] netfilter: conntrack: add gc worker to remove timed-out entries Florian Westphal
2016-08-24 17:40   ` Eric Dumazet
2016-08-24 20:11     ` Florian Westphal
2016-08-24 20:30       ` Eric Dumazet
2016-08-24 23:50         ` Florian Westphal
2016-08-24 11:55 ` [PATCH v2 nf-next 5/7] netfilter: conntrack: resched gc again if eviction rate is high Florian Westphal
2016-08-24 11:55 ` [PATCH v2 nf-next 6/7] netfilter: remove __nf_ct_kill_acct helper Florian Westphal
2016-08-24 11:55 ` [PATCH nf-next 7/7] netfilter: restart search if moved to other chain Florian Westphal
2016-08-24 13:20   ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.