All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces
@ 2015-10-23 10:43 Florian Westphal
  2015-10-23 10:43 ` [PATCH v2 nf-next 1/9] netfilter: ingress: don't use nf_hook_list_active Florian Westphal
                   ` (10 more replies)
  0 siblings, 11 replies; 20+ messages in thread
From: Florian Westphal @ 2015-10-23 10:43 UTC (permalink / raw)
  To: netfilter-devel

Historically, a particular table or netfilter feature (defrag, iptables
filter table ...) was registered with the netfilter core hook mechanism
on module load.

When netns support was added to iptables only the ip/ip6tables ruleset
was made namespace aware, not the actual hook points.

This has changed -- after Eric Biedermans recent work we now have
per net namespace hooks.

When a new namespace is created, all the hooks registered 'globally'
(i.e. via nf_register_hook() instead of a particular namespace via
 nf_register_net_hook api) get copied to the new netns.

This means f.e. that when ipt_filter table/module is loaded on a system,
then each namespace on that system has an (empty) iptables filter ruleset.

This work aims to change all major hook users to nf_register_net_hook
so that when a new netns is created it has no hooks at all, even when the
initial namespace uses conntrack, iptables and bridge netfilter.

To keep behaviour somewhat compatible, xtable hooks are registered once a
iptables set/getsockopt call is made within a net namespace.
This also means that e.g. conntrack behaviour is not yet optimal, we
still create all the data structures and only skip hook registration
at this time.

Caveats:
- conntrack is no longer active just by loading nf_conntrack module -- at
least one (x)tables rule that requires conntrack has to be added, e.g.
conntrack match or S/DNAT target.
Loading the nat table is *not* sufficient.

Changes since v1:
- Don't add a dependency on conntrack in the nat table.
It causes hard to resolve dependency problems on initcall ordering, f.e.
ip6tables nat will fail if link order caues nat table initcall before
conntrack_ipv6 initcall, and so on (also makes the patch set pretty much
useless with MODULES=n builds).
- change all targets and matches that need conntrack, including MASQUERADE,
  REDIRECT, SNAT/DNAT to register w. conntrack.
- get rid of a couple of refcount bugs.
- reduce amount of copy&pastry in iptable_foo modules.  For this to work
all the XXX_register_table() functions had to be extended with the location of
the table pointer storage, problem is that this must be setup by the time the
hook is registered since we can see packets right away.

Ads section:
conntrack+filter + nat table used in init namespace, single TCP_STREAM lo netperf:
87380  16384  16384    30.00    14348.66
with patch set, netperf running in net namespace without rules:
87380  16384  16384    30.00    15683.97

routing from ns3 -> ns2, filter + nat table & conntrack in all namespaces:
87380  16384  16384    30.00    5664.46
without conntrack+any tables in those namespaces:
87380  16384  16384    30.00    7336.54

Comments welcome.

 include/linux/netfilter.h                      |   29 ++++------
 include/linux/netfilter/x_tables.h             |    6 +-
 include/linux/netfilter_arp/arp_tables.h       |    9 +--
 include/linux/netfilter_ingress.h              |    9 ++-
 include/linux/netfilter_ipv4/ip_tables.h       |    9 +--
 include/linux/netfilter_ipv6/ip6_tables.h      |    9 +--
 include/net/netfilter/ipv4/nf_defrag_ipv4.h    |    3 -
 include/net/netfilter/ipv6/nf_defrag_ipv6.h    |    3 -
 include/net/netfilter/nf_conntrack.h           |    4 +
 include/net/netfilter/nf_conntrack_l3proto.h   |    4 +
 net/bridge/br_netfilter_hooks.c                |   68 +++++++++++++++++++++++--
 net/ipv4/netfilter/arp_tables.c                |   66 +++++++++++++++---------
 net/ipv4/netfilter/arptable_filter.c           |   39 ++++++++------
 net/ipv4/netfilter/ip_tables.c                 |   63 +++++++++++++----------
 net/ipv4/netfilter/ipt_CLUSTERIP.c             |    4 -
 net/ipv4/netfilter/ipt_MASQUERADE.c            |    8 ++
 net/ipv4/netfilter/ipt_SYNPROXY.c              |    4 -
 net/ipv4/netfilter/iptable_filter.c            |   55 ++++++++++++--------
 net/ipv4/netfilter/iptable_mangle.c            |   41 ++++++++++-----
 net/ipv4/netfilter/iptable_nat.c               |   41 ++++++++-------
 net/ipv4/netfilter/iptable_raw.c               |   42 ++++++++++-----
 net/ipv4/netfilter/iptable_security.c          |   44 ++++++++++------
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |   62 ++++++++++++++++++----
 net/ipv4/netfilter/nf_defrag_ipv4.c            |   49 ++++++++++++++++--
 net/ipv4/netfilter/nft_masq_ipv4.c             |    7 ++
 net/ipv4/netfilter/nft_redir_ipv4.c            |    7 ++
 net/ipv6/netfilter/ip6_tables.c                |   65 ++++++++++++++---------
 net/ipv6/netfilter/ip6t_SYNPROXY.c             |    4 -
 net/ipv6/netfilter/ip6table_filter.c           |   47 ++++++++++-------
 net/ipv6/netfilter/ip6table_mangle.c           |   45 +++++++++-------
 net/ipv6/netfilter/ip6table_nat.c              |   41 ++++++++-------
 net/ipv6/netfilter/ip6table_raw.c              |   46 ++++++++++------
 net/ipv6/netfilter/ip6table_security.c         |   44 +++++++++-------
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |   61 +++++++++++++++++-----
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c      |   50 ++++++++++++++++--
 net/ipv6/netfilter/nft_masq_ipv6.c             |    7 ++
 net/ipv6/netfilter/nft_redir_ipv6.c            |    7 ++
 net/netfilter/nf_conntrack_proto.c             |   48 +++++++++++++++++
 net/netfilter/nft_ct.c                         |   24 ++++----
 net/netfilter/nft_masq.c                       |    5 +
 net/netfilter/nft_nat.c                        |   11 +++-
 net/netfilter/nft_redir.c                      |    2 
 net/netfilter/x_tables.c                       |   65 ++++++++++++++---------
 net/netfilter/xt_CONNSECMARK.c                 |    4 -
 net/netfilter/xt_CT.c                          |    6 +-
 net/netfilter/xt_NETMAP.c                      |   11 +++-
 net/netfilter/xt_REDIRECT.c                    |   12 +++-
 net/netfilter/xt_TPROXY.c                      |   15 +++--
 net/netfilter/xt_connbytes.c                   |    4 -
 net/netfilter/xt_connlabel.c                   |    6 +-
 net/netfilter/xt_connlimit.c                   |    6 +-
 net/netfilter/xt_connmark.c                    |    8 +-
 net/netfilter/xt_conntrack.c                   |    4 -
 net/netfilter/xt_helper.c                      |    4 -
 net/netfilter/xt_nat.c                         |   18 ++++++
 net/netfilter/xt_socket.c                      |   33 ++++++++++--
 net/netfilter/xt_state.c                       |    4 -
 57 files changed, 970 insertions(+), 422 deletions(-)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 nf-next 1/9] netfilter: ingress: don't use nf_hook_list_active
  2015-10-23 10:43 [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Florian Westphal
@ 2015-10-23 10:43 ` Florian Westphal
  2015-11-06 18:33   ` Pablo Neira Ayuso
  2015-10-23 10:43 ` [PATCH v2 nf-next 2/9] netfilter: add and use nf_ct_netns_get/put Florian Westphal
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 20+ messages in thread
From: Florian Westphal @ 2015-10-23 10:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

nf_hook_list_active() always returns true once at least one device has
NF_INGRESS hook enabled.

Thus, don't use this function. Instead, inverse the test and use the static
key to elide list_empty test if no NF_INGRESS hooks are active.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 no changes since v1.
 include/linux/netfilter_ingress.h | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/netfilter_ingress.h b/include/linux/netfilter_ingress.h
index 187feab..ba7ce88 100644
--- a/include/linux/netfilter_ingress.h
+++ b/include/linux/netfilter_ingress.h
@@ -5,10 +5,13 @@
 #include <linux/netdevice.h>
 
 #ifdef CONFIG_NETFILTER_INGRESS
-static inline int nf_hook_ingress_active(struct sk_buff *skb)
+static inline bool nf_hook_ingress_active(const struct sk_buff *skb)
 {
-	return nf_hook_list_active(&skb->dev->nf_hooks_ingress,
-				   NFPROTO_NETDEV, NF_NETDEV_INGRESS);
+#ifdef HAVE_JUMP_LABEL
+	if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_INGRESS]))
+		return false;
+#endif
+	return !list_empty(&skb->dev->nf_hooks_ingress);
 }
 
 static inline int nf_hook_ingress(struct sk_buff *skb)
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 nf-next 2/9] netfilter: add and use nf_ct_netns_get/put
  2015-10-23 10:43 [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Florian Westphal
  2015-10-23 10:43 ` [PATCH v2 nf-next 1/9] netfilter: ingress: don't use nf_hook_list_active Florian Westphal
@ 2015-10-23 10:43 ` Florian Westphal
  2015-10-23 10:43 ` [PATCH v2 nf-next 3/9] netfilter: conntrack: register hooks in netns when needed by ruleset Florian Westphal
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Florian Westphal @ 2015-10-23 10:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

currently aliased to try_module_get/_put.
Will be changed in next patch when we add functions to make use of ->net
argument to store usercount per l3proto tracker.

This is needed to avoid registering the conntrack hooks in all netns and
later only enable connection tracking in those that need conntrack.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Changes since v1:
  - prefer par->family over hardcoded NFPROTO_FOO (Jan Engelhardt)

 include/net/netfilter/nf_conntrack.h |  4 ++++
 net/ipv4/netfilter/ipt_CLUSTERIP.c   |  4 ++--
 net/ipv4/netfilter/ipt_SYNPROXY.c    |  4 ++--
 net/ipv6/netfilter/ip6t_SYNPROXY.c   |  4 ++--
 net/netfilter/nf_conntrack_proto.c   | 12 ++++++++++++
 net/netfilter/nft_ct.c               | 24 ++++++++++++------------
 net/netfilter/xt_CONNSECMARK.c       |  4 ++--
 net/netfilter/xt_CT.c                |  6 +++---
 net/netfilter/xt_connbytes.c         |  4 ++--
 net/netfilter/xt_connlabel.c         |  6 +++---
 net/netfilter/xt_connlimit.c         |  6 +++---
 net/netfilter/xt_connmark.c          |  8 ++++----
 net/netfilter/xt_conntrack.c         |  4 ++--
 net/netfilter/xt_helper.c            |  4 ++--
 net/netfilter/xt_state.c             |  4 ++--
 15 files changed, 57 insertions(+), 41 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index fde4068..9dd4a6b 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -175,6 +175,10 @@ static inline void nf_ct_put(struct nf_conn *ct)
 int nf_ct_l3proto_try_module_get(unsigned short l3proto);
 void nf_ct_l3proto_module_put(unsigned short l3proto);
 
+/* load module; enable/disable conntrack in this namespace */
+int nf_ct_netns_get(struct net *net, u8 nfproto);
+void nf_ct_netns_put(struct net *net, u8 nfproto);
+
 /*
  * Allocate a hashtable of hlist_head (if nulls == 0),
  * or hlist_nulls_head (if nulls == 1)
diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 4a9e6db..2c3fe06 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -419,7 +419,7 @@ static int clusterip_tg_check(const struct xt_tgchk_param *par)
 	}
 	cipinfo->config = config;
 
-	ret = nf_ct_l3proto_try_module_get(par->family);
+	ret = nf_ct_netns_get(par->net, par->family);
 	if (ret < 0)
 		pr_info("cannot load conntrack support for proto=%u\n",
 			par->family);
@@ -444,7 +444,7 @@ static void clusterip_tg_destroy(const struct xt_tgdtor_param *par)
 
 	clusterip_config_put(cipinfo->config);
 
-	nf_ct_l3proto_module_put(par->family);
+	nf_ct_netns_get(par->net, par->family);
 }
 
 #ifdef CONFIG_COMPAT
diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c b/net/ipv4/netfilter/ipt_SYNPROXY.c
index 5fdc556..c2cc22e 100644
--- a/net/ipv4/netfilter/ipt_SYNPROXY.c
+++ b/net/ipv4/netfilter/ipt_SYNPROXY.c
@@ -415,12 +415,12 @@ static int synproxy_tg4_check(const struct xt_tgchk_param *par)
 	    e->ip.invflags & XT_INV_PROTO)
 		return -EINVAL;
 
-	return nf_ct_l3proto_try_module_get(par->family);
+	return nf_ct_netns_get(par->net, par->family);
 }
 
 static void synproxy_tg4_destroy(const struct xt_tgdtor_param *par)
 {
-	nf_ct_l3proto_module_put(par->family);
+	nf_ct_netns_put(par->net, par->family);
 }
 
 static struct xt_target synproxy_tg4_reg __read_mostly = {
diff --git a/net/ipv6/netfilter/ip6t_SYNPROXY.c b/net/ipv6/netfilter/ip6t_SYNPROXY.c
index 3deed58..484f4b6 100644
--- a/net/ipv6/netfilter/ip6t_SYNPROXY.c
+++ b/net/ipv6/netfilter/ip6t_SYNPROXY.c
@@ -436,12 +436,12 @@ static int synproxy_tg6_check(const struct xt_tgchk_param *par)
 	    e->ipv6.invflags & XT_INV_PROTO)
 		return -EINVAL;
 
-	return nf_ct_l3proto_try_module_get(par->family);
+	return nf_ct_netns_get(par->net, par->family);
 }
 
 static void synproxy_tg6_destroy(const struct xt_tgdtor_param *par)
 {
-	nf_ct_l3proto_module_put(par->family);
+	nf_ct_netns_put(par->net, par->family);
 }
 
 static struct xt_target synproxy_tg6_reg __read_mostly = {
diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
index b65d586..609c789 100644
--- a/net/netfilter/nf_conntrack_proto.c
+++ b/net/netfilter/nf_conntrack_proto.c
@@ -125,6 +125,18 @@ void nf_ct_l3proto_module_put(unsigned short l3proto)
 }
 EXPORT_SYMBOL_GPL(nf_ct_l3proto_module_put);
 
+int nf_ct_netns_get(struct net *net, u8 nfproto)
+{
+	return nf_ct_l3proto_try_module_get(nfproto);
+}
+EXPORT_SYMBOL_GPL(nf_ct_netns_get);
+
+void nf_ct_netns_put(struct net *net, u8 nfproto)
+{
+	nf_ct_l3proto_module_put(nfproto);
+}
+EXPORT_SYMBOL_GPL(nf_ct_netns_put);
+
 struct nf_conntrack_l4proto *
 nf_ct_l4proto_find_get(u_int16_t l3num, u_int8_t l4num)
 {
diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index 8cbca34..8c775b1 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -186,37 +186,37 @@ static const struct nla_policy nft_ct_policy[NFTA_CT_MAX + 1] = {
 	[NFTA_CT_SREG]		= { .type = NLA_U32 },
 };
 
-static int nft_ct_l3proto_try_module_get(uint8_t family)
+static int nft_ct_netns_get(struct net *net, uint8_t family)
 {
 	int err;
 
 	if (family == NFPROTO_INET) {
-		err = nf_ct_l3proto_try_module_get(NFPROTO_IPV4);
+		err = nf_ct_netns_get(net, NFPROTO_IPV4);
 		if (err < 0)
 			goto err1;
-		err = nf_ct_l3proto_try_module_get(NFPROTO_IPV6);
+		err = nf_ct_netns_get(net, NFPROTO_IPV6);
 		if (err < 0)
 			goto err2;
 	} else {
-		err = nf_ct_l3proto_try_module_get(family);
+		err = nf_ct_netns_get(net, family);
 		if (err < 0)
 			goto err1;
 	}
 	return 0;
 
 err2:
-	nf_ct_l3proto_module_put(NFPROTO_IPV4);
+	nf_ct_netns_put(net, NFPROTO_IPV4);
 err1:
 	return err;
 }
 
-static void nft_ct_l3proto_module_put(uint8_t family)
+static void nft_ct_netns_put(struct net *net, uint8_t family)
 {
 	if (family == NFPROTO_INET) {
-		nf_ct_l3proto_module_put(NFPROTO_IPV4);
-		nf_ct_l3proto_module_put(NFPROTO_IPV6);
+		nf_ct_netns_put(net, NFPROTO_IPV4);
+		nf_ct_netns_put(net, NFPROTO_IPV6);
 	} else
-		nf_ct_l3proto_module_put(family);
+		nf_ct_netns_put(net, family);
 }
 
 static int nft_ct_get_init(const struct nft_ctx *ctx,
@@ -312,7 +312,7 @@ static int nft_ct_get_init(const struct nft_ctx *ctx,
 	if (err < 0)
 		return err;
 
-	err = nft_ct_l3proto_try_module_get(ctx->afi->family);
+	err = nft_ct_netns_get(ctx->net, ctx->afi->family);
 	if (err < 0)
 		return err;
 
@@ -343,7 +343,7 @@ static int nft_ct_set_init(const struct nft_ctx *ctx,
 	if (err < 0)
 		return err;
 
-	err = nft_ct_l3proto_try_module_get(ctx->afi->family);
+	err = nft_ct_netns_get(ctx->net, ctx->afi->family);
 	if (err < 0)
 		return err;
 
@@ -353,7 +353,7 @@ static int nft_ct_set_init(const struct nft_ctx *ctx,
 static void nft_ct_destroy(const struct nft_ctx *ctx,
 			   const struct nft_expr *expr)
 {
-	nft_ct_l3proto_module_put(ctx->afi->family);
+	nft_ct_netns_put(ctx->net, ctx->afi->family);
 }
 
 static int nft_ct_get_dump(struct sk_buff *skb, const struct nft_expr *expr)
diff --git a/net/netfilter/xt_CONNSECMARK.c b/net/netfilter/xt_CONNSECMARK.c
index e04dc28..da56c06 100644
--- a/net/netfilter/xt_CONNSECMARK.c
+++ b/net/netfilter/xt_CONNSECMARK.c
@@ -106,7 +106,7 @@ static int connsecmark_tg_check(const struct xt_tgchk_param *par)
 		return -EINVAL;
 	}
 
-	ret = nf_ct_l3proto_try_module_get(par->family);
+	ret = nf_ct_netns_get(par->net, par->family);
 	if (ret < 0)
 		pr_info("cannot load conntrack support for proto=%u\n",
 			par->family);
@@ -115,7 +115,7 @@ static int connsecmark_tg_check(const struct xt_tgchk_param *par)
 
 static void connsecmark_tg_destroy(const struct xt_tgdtor_param *par)
 {
-	nf_ct_l3proto_module_put(par->family);
+	nf_ct_netns_put(par->net, par->family);
 }
 
 static struct xt_target connsecmark_tg_reg __read_mostly = {
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
index e7ac07e..48f1ddb 100644
--- a/net/netfilter/xt_CT.c
+++ b/net/netfilter/xt_CT.c
@@ -216,7 +216,7 @@ static int xt_ct_tg_check(const struct xt_tgchk_param *par,
 		goto err1;
 #endif
 
-	ret = nf_ct_l3proto_try_module_get(par->family);
+	ret = nf_ct_netns_get(par->net, par->family);
 	if (ret < 0)
 		goto err1;
 
@@ -260,7 +260,7 @@ out:
 err3:
 	nf_ct_tmpl_free(ct);
 err2:
-	nf_ct_l3proto_module_put(par->family);
+	nf_ct_netns_put(par->net, par->family);
 err1:
 	return ret;
 }
@@ -341,7 +341,7 @@ static void xt_ct_tg_destroy(const struct xt_tgdtor_param *par,
 		if (help)
 			module_put(help->helper->me);
 
-		nf_ct_l3proto_module_put(par->family);
+		nf_ct_netns_put(par->net, par->family);
 
 		xt_ct_destroy_timeout(ct);
 		nf_ct_put(info->ct);
diff --git a/net/netfilter/xt_connbytes.c b/net/netfilter/xt_connbytes.c
index d4bec26..cad0b7b 100644
--- a/net/netfilter/xt_connbytes.c
+++ b/net/netfilter/xt_connbytes.c
@@ -110,7 +110,7 @@ static int connbytes_mt_check(const struct xt_mtchk_param *par)
 	    sinfo->direction != XT_CONNBYTES_DIR_BOTH)
 		return -EINVAL;
 
-	ret = nf_ct_l3proto_try_module_get(par->family);
+	ret = nf_ct_netns_get(par->net, par->family);
 	if (ret < 0)
 		pr_info("cannot load conntrack support for proto=%u\n",
 			par->family);
@@ -129,7 +129,7 @@ static int connbytes_mt_check(const struct xt_mtchk_param *par)
 
 static void connbytes_mt_destroy(const struct xt_mtdtor_param *par)
 {
-	nf_ct_l3proto_module_put(par->family);
+	nf_ct_netns_put(par->net, par->family);
 }
 
 static struct xt_match connbytes_mt_reg __read_mostly = {
diff --git a/net/netfilter/xt_connlabel.c b/net/netfilter/xt_connlabel.c
index bb9cbeb..b7e57f2 100644
--- a/net/netfilter/xt_connlabel.c
+++ b/net/netfilter/xt_connlabel.c
@@ -48,7 +48,7 @@ static int connlabel_mt_check(const struct xt_mtchk_param *par)
 		return -EINVAL;
 	}
 
-	ret = nf_ct_l3proto_try_module_get(par->family);
+	ret = nf_ct_netns_get(par->net, par->family);
 	if (ret < 0) {
 		pr_info("cannot load conntrack support for proto=%u\n",
 							par->family);
@@ -57,14 +57,14 @@ static int connlabel_mt_check(const struct xt_mtchk_param *par)
 
 	ret = nf_connlabels_get(par->net, info->bit + 1);
 	if (ret < 0)
-		nf_ct_l3proto_module_put(par->family);
+		nf_ct_netns_put(par->net, par->family);
 	return ret;
 }
 
 static void connlabel_mt_destroy(const struct xt_mtdtor_param *par)
 {
 	nf_connlabels_put(par->net);
-	nf_ct_l3proto_module_put(par->family);
+	nf_ct_netns_put(par->net, par->family);
 }
 
 static struct xt_match connlabels_mt_reg __read_mostly = {
diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 99bbc82..66f5480 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -374,7 +374,7 @@ static int connlimit_mt_check(const struct xt_mtchk_param *par)
 		} while (!rand);
 		cmpxchg(&connlimit_rnd, 0, rand);
 	}
-	ret = nf_ct_l3proto_try_module_get(par->family);
+	ret = nf_ct_netns_get(par->net, par->family);
 	if (ret < 0) {
 		pr_info("cannot load conntrack support for "
 			"address family %u\n", par->family);
@@ -384,7 +384,7 @@ static int connlimit_mt_check(const struct xt_mtchk_param *par)
 	/* init private data */
 	info->data = kmalloc(sizeof(struct xt_connlimit_data), GFP_KERNEL);
 	if (info->data == NULL) {
-		nf_ct_l3proto_module_put(par->family);
+		nf_ct_netns_put(par->net, par->family);
 		return -ENOMEM;
 	}
 
@@ -420,7 +420,7 @@ static void connlimit_mt_destroy(const struct xt_mtdtor_param *par)
 	const struct xt_connlimit_info *info = par->matchinfo;
 	unsigned int i;
 
-	nf_ct_l3proto_module_put(par->family);
+	nf_ct_netns_put(par->net, par->family);
 
 	for (i = 0; i < ARRAY_SIZE(info->data->climit_root4); ++i)
 		destroy_tree(&info->data->climit_root4[i]);
diff --git a/net/netfilter/xt_connmark.c b/net/netfilter/xt_connmark.c
index 69f78e9..ec377cc 100644
--- a/net/netfilter/xt_connmark.c
+++ b/net/netfilter/xt_connmark.c
@@ -77,7 +77,7 @@ static int connmark_tg_check(const struct xt_tgchk_param *par)
 {
 	int ret;
 
-	ret = nf_ct_l3proto_try_module_get(par->family);
+	ret = nf_ct_netns_get(par->net, par->family);
 	if (ret < 0)
 		pr_info("cannot load conntrack support for proto=%u\n",
 			par->family);
@@ -86,7 +86,7 @@ static int connmark_tg_check(const struct xt_tgchk_param *par)
 
 static void connmark_tg_destroy(const struct xt_tgdtor_param *par)
 {
-	nf_ct_l3proto_module_put(par->family);
+	nf_ct_netns_put(par->net, par->family);
 }
 
 static bool
@@ -107,7 +107,7 @@ static int connmark_mt_check(const struct xt_mtchk_param *par)
 {
 	int ret;
 
-	ret = nf_ct_l3proto_try_module_get(par->family);
+	ret = nf_ct_netns_get(par->net, par->family);
 	if (ret < 0)
 		pr_info("cannot load conntrack support for proto=%u\n",
 			par->family);
@@ -116,7 +116,7 @@ static int connmark_mt_check(const struct xt_mtchk_param *par)
 
 static void connmark_mt_destroy(const struct xt_mtdtor_param *par)
 {
-	nf_ct_l3proto_module_put(par->family);
+	nf_ct_netns_put(par->net, par->family);
 }
 
 static struct xt_target connmark_tg_reg __read_mostly = {
diff --git a/net/netfilter/xt_conntrack.c b/net/netfilter/xt_conntrack.c
index 188404b9..9480b52 100644
--- a/net/netfilter/xt_conntrack.c
+++ b/net/netfilter/xt_conntrack.c
@@ -273,7 +273,7 @@ static int conntrack_mt_check(const struct xt_mtchk_param *par)
 {
 	int ret;
 
-	ret = nf_ct_l3proto_try_module_get(par->family);
+	ret = nf_ct_netns_get(par->net, par->family);
 	if (ret < 0)
 		pr_info("cannot load conntrack support for proto=%u\n",
 			par->family);
@@ -282,7 +282,7 @@ static int conntrack_mt_check(const struct xt_mtchk_param *par)
 
 static void conntrack_mt_destroy(const struct xt_mtdtor_param *par)
 {
-	nf_ct_l3proto_module_put(par->family);
+	nf_ct_netns_put(par->net, par->family);
 }
 
 static struct xt_match conntrack_mt_reg[] __read_mostly = {
diff --git a/net/netfilter/xt_helper.c b/net/netfilter/xt_helper.c
index 9f4ab00..8e451ec 100644
--- a/net/netfilter/xt_helper.c
+++ b/net/netfilter/xt_helper.c
@@ -59,7 +59,7 @@ static int helper_mt_check(const struct xt_mtchk_param *par)
 	struct xt_helper_info *info = par->matchinfo;
 	int ret;
 
-	ret = nf_ct_l3proto_try_module_get(par->family);
+	ret = nf_ct_netns_get(par->net, par->family);
 	if (ret < 0) {
 		pr_info("cannot load conntrack support for proto=%u\n",
 			par->family);
@@ -71,7 +71,7 @@ static int helper_mt_check(const struct xt_mtchk_param *par)
 
 static void helper_mt_destroy(const struct xt_mtdtor_param *par)
 {
-	nf_ct_l3proto_module_put(par->family);
+	nf_ct_netns_put(par->net, par->family);
 }
 
 static struct xt_match helper_mt_reg __read_mostly = {
diff --git a/net/netfilter/xt_state.c b/net/netfilter/xt_state.c
index a507922..5746a33 100644
--- a/net/netfilter/xt_state.c
+++ b/net/netfilter/xt_state.c
@@ -43,7 +43,7 @@ static int state_mt_check(const struct xt_mtchk_param *par)
 {
 	int ret;
 
-	ret = nf_ct_l3proto_try_module_get(par->family);
+	ret = nf_ct_netns_get(par->net, par->family);
 	if (ret < 0)
 		pr_info("cannot load conntrack support for proto=%u\n",
 			par->family);
@@ -52,7 +52,7 @@ static int state_mt_check(const struct xt_mtchk_param *par)
 
 static void state_mt_destroy(const struct xt_mtdtor_param *par)
 {
-	nf_ct_l3proto_module_put(par->family);
+	nf_ct_netns_put(par->net, par->family);
 }
 
 static struct xt_match state_mt_reg __read_mostly = {
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 nf-next 3/9] netfilter: conntrack: register hooks in netns when needed by ruleset
  2015-10-23 10:43 [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Florian Westphal
  2015-10-23 10:43 ` [PATCH v2 nf-next 1/9] netfilter: ingress: don't use nf_hook_list_active Florian Westphal
  2015-10-23 10:43 ` [PATCH v2 nf-next 2/9] netfilter: add and use nf_ct_netns_get/put Florian Westphal
@ 2015-10-23 10:43 ` Florian Westphal
  2015-10-23 10:43 ` [PATCH v2 nf-next 4/9] netfilter: xtables: don't register xt hooks in namespace at init time Florian Westphal
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Florian Westphal @ 2015-10-23 10:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

This makes use of nf_ct_netns_get/put added in previous patch.
We add get/put functions to nf_conntrack_l3proto structure, ipv4 and
ipv6 then implement use-count to track how many users (nftables or
xtables modules) have a dependency on ipv4 and/or ipv6 connection
tracking functionality.

When count reaches zero, the hooks are unregistered.

The main goal of this patch is to delay activation of connection
tracking inside a namespace until a point in time where such
functionality is needed (which might be "never").

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 no changes since v1

 include/net/netfilter/nf_conntrack_l3proto.h   |  4 ++
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 55 ++++++++++++++++++++------
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c | 54 +++++++++++++++++++------
 net/netfilter/nf_conntrack_proto.c             | 38 +++++++++++++++++-
 4 files changed, 127 insertions(+), 24 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_l3proto.h b/include/net/netfilter/nf_conntrack_l3proto.h
index cdc920b..e0238db 100644
--- a/include/net/netfilter/nf_conntrack_l3proto.h
+++ b/include/net/netfilter/nf_conntrack_l3proto.h
@@ -52,6 +52,10 @@ struct nf_conntrack_l3proto {
 	int (*tuple_to_nlattr)(struct sk_buff *skb,
 			       const struct nf_conntrack_tuple *t);
 
+	/* Called when netns wants to use connection tracking */
+	int (*net_ns_get)(struct net *);
+	void (*net_ns_put)(struct net *);
+
 	/*
 	 * Calculate size of tuple nlattr
 	 */
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index 461ca92..2b51a04 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -31,6 +31,13 @@
 #include <net/netfilter/ipv4/nf_defrag_ipv4.h>
 #include <net/netfilter/nf_log.h>
 
+static int conntrack4_net_id __read_mostly;
+static DEFINE_MUTEX(register_ipv4_hooks);
+
+struct conntrack4_net {
+	unsigned int users;
+};
+
 static bool ipv4_pkt_to_tuple(const struct sk_buff *skb, unsigned int nhoff,
 			      struct nf_conntrack_tuple *tuple)
 {
@@ -367,6 +374,38 @@ static int ipv4_init_net(struct net *net)
 	return 0;
 }
 
+static int nf_conntrack_l3proto_ipv4_hooks_register(struct net *net)
+{
+	struct conntrack4_net *cnet = net_generic(net, conntrack4_net_id);
+	int err = 0;
+
+	mutex_lock(&register_ipv4_hooks);
+
+	cnet->users++;
+	if (cnet->users > 1)
+		goto out_unlock;
+
+	err = nf_register_net_hooks(net, ipv4_conntrack_ops,
+				    ARRAY_SIZE(ipv4_conntrack_ops));
+
+	if (err)
+		cnet->users = 0;
+ out_unlock:
+	mutex_unlock(&register_ipv4_hooks);
+	return err;
+}
+
+static void nf_conntrack_l3proto_ipv4_hooks_unregister(struct net *net)
+{
+	struct conntrack4_net *cnet = net_generic(net, conntrack4_net_id);
+
+	mutex_lock(&register_ipv4_hooks);
+	if (--cnet->users == 0)
+		nf_unregister_net_hooks(net, ipv4_conntrack_ops,
+					ARRAY_SIZE(ipv4_conntrack_ops));
+	mutex_unlock(&register_ipv4_hooks);
+}
+
 struct nf_conntrack_l3proto nf_conntrack_l3proto_ipv4 __read_mostly = {
 	.l3proto	 = PF_INET,
 	.name		 = "ipv4",
@@ -383,6 +422,8 @@ struct nf_conntrack_l3proto nf_conntrack_l3proto_ipv4 __read_mostly = {
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_NF_CONNTRACK_PROC_COMPAT)
 	.ctl_table_path  = "net/ipv4/netfilter",
 #endif
+	.net_ns_get	 = nf_conntrack_l3proto_ipv4_hooks_register,
+	.net_ns_put	 = nf_conntrack_l3proto_ipv4_hooks_unregister,
 	.init_net	 = ipv4_init_net,
 	.me		 = THIS_MODULE,
 };
@@ -440,6 +481,8 @@ static void ipv4_net_exit(struct net *net)
 static struct pernet_operations ipv4_net_ops = {
 	.init = ipv4_net_init,
 	.exit = ipv4_net_exit,
+	.id = &conntrack4_net_id,
+	.size = sizeof(struct conntrack4_net),
 };
 
 static int __init nf_conntrack_l3proto_ipv4_init(void)
@@ -461,17 +504,10 @@ static int __init nf_conntrack_l3proto_ipv4_init(void)
 		goto cleanup_sockopt;
 	}
 
-	ret = nf_register_hooks(ipv4_conntrack_ops,
-				ARRAY_SIZE(ipv4_conntrack_ops));
-	if (ret < 0) {
-		pr_err("nf_conntrack_ipv4: can't register hooks.\n");
-		goto cleanup_pernet;
-	}
-
 	ret = nf_ct_l4proto_register(&nf_conntrack_l4proto_tcp4);
 	if (ret < 0) {
 		pr_err("nf_conntrack_ipv4: can't register tcp4 proto.\n");
-		goto cleanup_hooks;
+		goto cleanup_pernet;
 	}
 
 	ret = nf_ct_l4proto_register(&nf_conntrack_l4proto_udp4);
@@ -508,8 +544,6 @@ static int __init nf_conntrack_l3proto_ipv4_init(void)
 	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_udp4);
  cleanup_tcp4:
 	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_tcp4);
- cleanup_hooks:
-	nf_unregister_hooks(ipv4_conntrack_ops, ARRAY_SIZE(ipv4_conntrack_ops));
  cleanup_pernet:
 	unregister_pernet_subsys(&ipv4_net_ops);
  cleanup_sockopt:
@@ -527,7 +561,6 @@ static void __exit nf_conntrack_l3proto_ipv4_fini(void)
 	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_icmp);
 	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_udp4);
 	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_tcp4);
-	nf_unregister_hooks(ipv4_conntrack_ops, ARRAY_SIZE(ipv4_conntrack_ops));
 	unregister_pernet_subsys(&ipv4_net_ops);
 	nf_unregister_sockopt(&so_getorigdst);
 }
diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
index 1aa5848..8916846 100644
--- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
@@ -34,6 +34,13 @@
 #include <net/netfilter/ipv6/nf_defrag_ipv6.h>
 #include <net/netfilter/nf_log.h>
 
+static int conntrack6_net_id;
+static DEFINE_MUTEX(register_ipv6_hooks);
+
+struct conntrack6_net {
+	unsigned int users;
+};
+
 static bool ipv6_pkt_to_tuple(const struct sk_buff *skb, unsigned int nhoff,
 			      struct nf_conntrack_tuple *tuple)
 {
@@ -308,6 +315,36 @@ static int ipv6_nlattr_tuple_size(void)
 }
 #endif
 
+static int nf_conntrack_l3proto_ipv6_hooks_register(struct net *net)
+{
+	struct conntrack6_net *cnet = net_generic(net, conntrack6_net_id);
+	int err = 0;
+
+	mutex_lock(&register_ipv6_hooks);
+	cnet->users++;
+	if (cnet->users > 1)
+		goto out_unlock;
+
+	err = nf_register_net_hooks(net, ipv6_conntrack_ops,
+				    ARRAY_SIZE(ipv6_conntrack_ops));
+	if (err)
+		cnet->users = 0;
+ out_unlock:
+	mutex_unlock(&register_ipv6_hooks);
+	return err;
+}
+
+static void nf_conntrack_l3proto_ipv6_hooks_unregister(struct net *net)
+{
+	struct conntrack6_net *cnet = net_generic(net, conntrack6_net_id);
+
+	mutex_lock(&register_ipv6_hooks);
+	if (--cnet->users == 0)
+		nf_unregister_net_hooks(net, ipv6_conntrack_ops,
+					ARRAY_SIZE(ipv6_conntrack_ops));
+	mutex_unlock(&register_ipv6_hooks);
+}
+
 struct nf_conntrack_l3proto nf_conntrack_l3proto_ipv6 __read_mostly = {
 	.l3proto		= PF_INET6,
 	.name			= "ipv6",
@@ -321,6 +358,8 @@ struct nf_conntrack_l3proto nf_conntrack_l3proto_ipv6 __read_mostly = {
 	.nlattr_to_tuple	= ipv6_nlattr_to_tuple,
 	.nla_policy		= ipv6_nla_policy,
 #endif
+	.net_ns_get	 = nf_conntrack_l3proto_ipv6_hooks_register,
+	.net_ns_put	 = nf_conntrack_l3proto_ipv6_hooks_unregister,
 	.me			= THIS_MODULE,
 };
 
@@ -382,6 +421,8 @@ static void ipv6_net_exit(struct net *net)
 static struct pernet_operations ipv6_net_ops = {
 	.init = ipv6_net_init,
 	.exit = ipv6_net_exit,
+	.id = &conntrack6_net_id,
+	.size = sizeof(struct conntrack6_net),
 };
 
 static int __init nf_conntrack_l3proto_ipv6_init(void)
@@ -401,18 +442,10 @@ static int __init nf_conntrack_l3proto_ipv6_init(void)
 	if (ret < 0)
 		goto cleanup_sockopt;
 
-	ret = nf_register_hooks(ipv6_conntrack_ops,
-				ARRAY_SIZE(ipv6_conntrack_ops));
-	if (ret < 0) {
-		pr_err("nf_conntrack_ipv6: can't register pre-routing defrag "
-		       "hook.\n");
-		goto cleanup_pernet;
-	}
-
 	ret = nf_ct_l4proto_register(&nf_conntrack_l4proto_tcp6);
 	if (ret < 0) {
 		pr_err("nf_conntrack_ipv6: can't register tcp6 proto.\n");
-		goto cleanup_hooks;
+		goto cleanup_pernet;
 	}
 
 	ret = nf_ct_l4proto_register(&nf_conntrack_l4proto_udp6);
@@ -440,8 +473,6 @@ static int __init nf_conntrack_l3proto_ipv6_init(void)
 	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_udp6);
  cleanup_tcp6:
 	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_tcp6);
- cleanup_hooks:
-	nf_unregister_hooks(ipv6_conntrack_ops, ARRAY_SIZE(ipv6_conntrack_ops));
  cleanup_pernet:
 	unregister_pernet_subsys(&ipv6_net_ops);
  cleanup_sockopt:
@@ -456,7 +487,6 @@ static void __exit nf_conntrack_l3proto_ipv6_fini(void)
 	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_tcp6);
 	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_udp6);
 	nf_ct_l4proto_unregister(&nf_conntrack_l4proto_icmpv6);
-	nf_unregister_hooks(ipv6_conntrack_ops, ARRAY_SIZE(ipv6_conntrack_ops));
 	unregister_pernet_subsys(&ipv6_net_ops);
 	nf_unregister_sockopt(&so_getorigdst6);
 }
diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
index 609c789..1fb11b6 100644
--- a/net/netfilter/nf_conntrack_proto.c
+++ b/net/netfilter/nf_conntrack_proto.c
@@ -127,12 +127,48 @@ EXPORT_SYMBOL_GPL(nf_ct_l3proto_module_put);
 
 int nf_ct_netns_get(struct net *net, u8 nfproto)
 {
-	return nf_ct_l3proto_try_module_get(nfproto);
+	const struct nf_conntrack_l3proto *l3proto;
+	int ret;
+
+	might_sleep();
+
+	ret = nf_ct_l3proto_try_module_get(nfproto);
+	if (ret < 0)
+		return ret;
+
+	/* we already have a reference, can't fail */
+	rcu_read_lock();
+	l3proto = __nf_ct_l3proto_find(nfproto);
+	rcu_read_unlock();
+
+	if (!l3proto->net_ns_get)
+		return 0;
+
+	ret = l3proto->net_ns_get(net);
+	if (ret < 0)
+		nf_ct_l3proto_module_put(nfproto);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(nf_ct_netns_get);
 
 void nf_ct_netns_put(struct net *net, u8 nfproto)
 {
+	const struct nf_conntrack_l3proto *l3proto;
+
+	might_sleep();
+
+	/* same as nf_conntrack_netns_get(), reference assumed */
+	rcu_read_lock();
+	l3proto = __nf_ct_l3proto_find(nfproto);
+	rcu_read_unlock();
+
+	if (WARN_ON(!l3proto))
+		return;
+
+	if (l3proto->net_ns_put)
+		l3proto->net_ns_put(net);
+
 	nf_ct_l3proto_module_put(nfproto);
 }
 EXPORT_SYMBOL_GPL(nf_ct_netns_put);
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 nf-next 4/9] netfilter: xtables: don't register xt hooks in namespace at init time
  2015-10-23 10:43 [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Florian Westphal
                   ` (2 preceding siblings ...)
  2015-10-23 10:43 ` [PATCH v2 nf-next 3/9] netfilter: conntrack: register hooks in netns when needed by ruleset Florian Westphal
@ 2015-10-23 10:43 ` Florian Westphal
  2015-10-23 10:43 ` [PATCH v2 nf-next 5/9] netfilter: defrag: only register defrag functionality if needed Florian Westphal
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Florian Westphal @ 2015-10-23 10:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

delay hook registration until the table is being requested inside a
namespace.

Historically, a particular table (iptables mangle, ip6tables filter,
etc) was registered on module load.

When netns support was added to iptables only the ip/ip6tables ruleset
was made namespace aware, not the actual hook points.

This means f.e. that when ipt_filter table/module is loaded on a system,
then each namespace on that system has an (empty) iptables filter ruleset.

In other words, if a namespace sends a packet, such skb is 'caught'
by netfilter machinery and fed to hooking points for that table
(i.e. INPUT, FORWARD, etc).

Thanks to Eric Biederman, hooks are no longer global, but per namespace.

This means that we can avoid allocation of empty ruleset in a namespace
and defer hook registration until we need the functionality.

We register a tables hook entry points ONLY in the initial namespace.
When an iptables get/setockopt is issued inside a given namespace,
we check if the table is found in the per-namespace list.

If not, we attempt to find it in the initial namespace, and,
if found, create an empty default table in the requesting namespace
and register the needed hooks.

Hook points are destroyed only once namespace is deleted, there is no
'usage count' (it makes no sense since there is no 'remove table'
operation in xtables api).

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Changes since v1:
 - perform netns hook registration from xx_register_table() (Pablo Neira)
 - don't make nat table register conntrack hooks (causes hard to resolve
   link order problems w. conntrack initcalls -- followup patch adds expicit
   conntrack-enable calls to MASQUERADE, REDIRECT, etc)

 include/linux/netfilter/x_tables.h        |  6 ++-
 include/linux/netfilter_arp/arp_tables.h  |  9 +++--
 include/linux/netfilter_ipv4/ip_tables.h  |  9 +++--
 include/linux/netfilter_ipv6/ip6_tables.h |  9 +++--
 net/ipv4/netfilter/arp_tables.c           | 66 +++++++++++++++++++------------
 net/ipv4/netfilter/arptable_filter.c      | 39 ++++++++++--------
 net/ipv4/netfilter/ip_tables.c            | 63 +++++++++++++++++------------
 net/ipv4/netfilter/iptable_filter.c       | 55 ++++++++++++++++----------
 net/ipv4/netfilter/iptable_mangle.c       | 41 +++++++++++++------
 net/ipv4/netfilter/iptable_nat.c          | 41 ++++++++++---------
 net/ipv4/netfilter/iptable_raw.c          | 42 +++++++++++++-------
 net/ipv4/netfilter/iptable_security.c     | 44 +++++++++++++--------
 net/ipv6/netfilter/ip6_tables.c           | 65 ++++++++++++++++++------------
 net/ipv6/netfilter/ip6table_filter.c      | 47 +++++++++++++---------
 net/ipv6/netfilter/ip6table_mangle.c      | 45 ++++++++++++---------
 net/ipv6/netfilter/ip6table_nat.c         | 41 ++++++++++---------
 net/ipv6/netfilter/ip6table_raw.c         | 46 ++++++++++++---------
 net/ipv6/netfilter/ip6table_security.c    | 44 +++++++++++++--------
 net/netfilter/x_tables.c                  | 65 ++++++++++++++++++------------
 19 files changed, 474 insertions(+), 303 deletions(-)

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index c557741..80a305b 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -200,6 +200,9 @@ struct xt_table {
 	u_int8_t af;		/* address/protocol family */
 	int priority;		/* hook order */
 
+	/* called when table is needed in the given netns */
+	int (*table_init)(struct net *net);
+
 	/* A unique name... */
 	const char name[XT_TABLE_MAXNAMELEN];
 };
@@ -408,8 +411,7 @@ xt_get_per_cpu_counter(struct xt_counters *cnt, unsigned int cpu)
 	return cnt;
 }
 
-struct nf_hook_ops *xt_hook_link(const struct xt_table *, nf_hookfn *);
-void xt_hook_unlink(const struct xt_table *, struct nf_hook_ops *);
+struct nf_hook_ops *xt_hook_ops_alloc(const struct xt_table *, nf_hookfn *);
 
 #ifdef CONFIG_COMPAT
 #include <net/compat.h>
diff --git a/include/linux/netfilter_arp/arp_tables.h b/include/linux/netfilter_arp/arp_tables.h
index 6f074db..029b95e 100644
--- a/include/linux/netfilter_arp/arp_tables.h
+++ b/include/linux/netfilter_arp/arp_tables.h
@@ -48,10 +48,11 @@ struct arpt_error {
 }
 
 extern void *arpt_alloc_initial_table(const struct xt_table *);
-extern struct xt_table *arpt_register_table(struct net *net,
-					    const struct xt_table *table,
-					    const struct arpt_replace *repl);
-extern void arpt_unregister_table(struct xt_table *table);
+int arpt_register_table(struct net *net, const struct xt_table *table,
+			const struct arpt_replace *repl,
+			const struct nf_hook_ops *ops, struct xt_table **res);
+void arpt_unregister_table(struct net *net, struct xt_table *table,
+			   const struct nf_hook_ops *ops);
 extern unsigned int arpt_do_table(struct sk_buff *skb,
 				  const struct nf_hook_state *state,
 				  struct xt_table *table);
diff --git a/include/linux/netfilter_ipv4/ip_tables.h b/include/linux/netfilter_ipv4/ip_tables.h
index aa598f9..7bfc589 100644
--- a/include/linux/netfilter_ipv4/ip_tables.h
+++ b/include/linux/netfilter_ipv4/ip_tables.h
@@ -24,10 +24,11 @@
 
 extern void ipt_init(void) __init;
 
-extern struct xt_table *ipt_register_table(struct net *net,
-					   const struct xt_table *table,
-					   const struct ipt_replace *repl);
-extern void ipt_unregister_table(struct net *net, struct xt_table *table);
+int ipt_register_table(struct net *net, const struct xt_table *table,
+		       const struct ipt_replace *repl,
+		       const struct nf_hook_ops *ops, struct xt_table **res);
+void ipt_unregister_table(struct net *net, struct xt_table *table,
+			  const struct nf_hook_ops *ops);
 
 /* Standard entry. */
 struct ipt_standard {
diff --git a/include/linux/netfilter_ipv6/ip6_tables.h b/include/linux/netfilter_ipv6/ip6_tables.h
index 0f76e5c..b21c392 100644
--- a/include/linux/netfilter_ipv6/ip6_tables.h
+++ b/include/linux/netfilter_ipv6/ip6_tables.h
@@ -25,10 +25,11 @@
 extern void ip6t_init(void) __init;
 
 extern void *ip6t_alloc_initial_table(const struct xt_table *);
-extern struct xt_table *ip6t_register_table(struct net *net,
-					    const struct xt_table *table,
-					    const struct ip6t_replace *repl);
-extern void ip6t_unregister_table(struct net *net, struct xt_table *table);
+int ip6t_register_table(struct net *net, const struct xt_table *table,
+			const struct ip6t_replace *repl,
+			const struct nf_hook_ops *ops, struct xt_table **res);
+void ip6t_unregister_table(struct net *net, struct xt_table *table,
+			   const struct nf_hook_ops *ops);
 extern unsigned int ip6t_do_table(struct sk_buff *skb,
 				  const struct nf_hook_state *state,
 				  struct xt_table *table);
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 11dccba..15c0129 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -1780,9 +1780,29 @@ static int do_arpt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len
 	return ret;
 }
 
-struct xt_table *arpt_register_table(struct net *net,
-				     const struct xt_table *table,
-				     const struct arpt_replace *repl)
+static void __arpt_unregister_table(struct xt_table *table)
+{
+	struct xt_table_info *private;
+	void *loc_cpu_entry;
+	struct module *table_owner = table->me;
+	struct arpt_entry *iter;
+
+	private = xt_unregister_table(table);
+
+	/* Decrease module usage counts and free resources */
+	loc_cpu_entry = private->entries;
+	xt_entry_foreach(iter, loc_cpu_entry, private->size)
+		cleanup_entry(iter);
+	if (private->number > private->initial_entries)
+		module_put(table_owner);
+	xt_free_table_info(private);
+}
+
+int arpt_register_table(struct net *net,
+			const struct xt_table *table,
+			const struct arpt_replace *repl,
+			const struct nf_hook_ops *ops,
+			struct xt_table **res)
 {
 	int ret;
 	struct xt_table_info *newinfo;
@@ -1791,10 +1811,8 @@ struct xt_table *arpt_register_table(struct net *net,
 	struct xt_table *new_table;
 
 	newinfo = xt_alloc_table_info(repl->size);
-	if (!newinfo) {
-		ret = -ENOMEM;
-		goto out;
-	}
+	if (!newinfo)
+		return -ENOMEM;
 
 	loc_cpu_entry = newinfo->entries;
 	memcpy(loc_cpu_entry, repl->entries, repl->size);
@@ -1809,30 +1827,28 @@ struct xt_table *arpt_register_table(struct net *net,
 		ret = PTR_ERR(new_table);
 		goto out_free;
 	}
-	return new_table;
+
+	/* set res now, will see skbs right after nf_register_net_hooks */
+	WRITE_ONCE(*res, new_table);
+
+	ret = nf_register_net_hooks(net, ops, hweight32(table->valid_hooks));
+	if (ret != 0) {
+		__arpt_unregister_table(new_table);
+		*res = NULL;
+	}
+
+	return ret;
 
 out_free:
 	xt_free_table_info(newinfo);
-out:
-	return ERR_PTR(ret);
+	return ret;
 }
 
-void arpt_unregister_table(struct xt_table *table)
+void arpt_unregister_table(struct net *net, struct xt_table *table,
+			   const struct nf_hook_ops *ops)
 {
-	struct xt_table_info *private;
-	void *loc_cpu_entry;
-	struct module *table_owner = table->me;
-	struct arpt_entry *iter;
-
-	private = xt_unregister_table(table);
-
-	/* Decrease module usage counts and free resources */
-	loc_cpu_entry = private->entries;
-	xt_entry_foreach(iter, loc_cpu_entry, private->size)
-		cleanup_entry(iter);
-	if (private->number > private->initial_entries)
-		module_put(table_owner);
-	xt_free_table_info(private);
+	nf_unregister_net_hooks(net, ops, hweight32(table->valid_hooks));
+	__arpt_unregister_table(table);
 }
 
 /* The built-in targets: standard (NULL) and error. */
diff --git a/net/ipv4/netfilter/arptable_filter.c b/net/ipv4/netfilter/arptable_filter.c
index 1897ee1..3f8188c 100644
--- a/net/ipv4/netfilter/arptable_filter.c
+++ b/net/ipv4/netfilter/arptable_filter.c
@@ -17,12 +17,15 @@ MODULE_DESCRIPTION("arptables filter table");
 #define FILTER_VALID_HOOKS ((1 << NF_ARP_IN) | (1 << NF_ARP_OUT) | \
 			   (1 << NF_ARP_FORWARD))
 
+static int __net_init arptable_filter_table_init(struct net *net);
+
 static const struct xt_table packet_filter = {
 	.name		= "filter",
 	.valid_hooks	= FILTER_VALID_HOOKS,
 	.me		= THIS_MODULE,
 	.af		= NFPROTO_ARP,
 	.priority	= NF_IP_PRI_FILTER,
+	.table_init	= arptable_filter_table_init,
 };
 
 /* The work comes in here from netfilter.c */
@@ -35,26 +38,31 @@ arptable_filter_hook(void *priv, struct sk_buff *skb,
 
 static struct nf_hook_ops *arpfilter_ops __read_mostly;
 
-static int __net_init arptable_filter_net_init(struct net *net)
+static int __net_init arptable_filter_table_init(struct net *net)
 {
 	struct arpt_replace *repl;
-	
+	int err = 0;
+
+	if (net->ipv4.arptable_filter)
+		return 0;
+
 	repl = arpt_alloc_initial_table(&packet_filter);
 	if (repl == NULL)
 		return -ENOMEM;
-	net->ipv4.arptable_filter =
-		arpt_register_table(net, &packet_filter, repl);
+	err = arpt_register_table(net, &packet_filter, repl, arpfilter_ops,
+				  &net->ipv4.arptable_filter);
 	kfree(repl);
-	return PTR_ERR_OR_ZERO(net->ipv4.arptable_filter);
+	return err;
 }
 
 static void __net_exit arptable_filter_net_exit(struct net *net)
 {
-	arpt_unregister_table(net->ipv4.arptable_filter);
+	if (net->ipv4.arptable_filter)
+		arpt_unregister_table(net, net->ipv4.arptable_filter, arpfilter_ops);
+	net->ipv4.arptable_filter = NULL;
 }
 
 static struct pernet_operations arptable_filter_net_ops = {
-	.init = arptable_filter_net_init,
 	.exit = arptable_filter_net_exit,
 };
 
@@ -62,26 +70,23 @@ static int __init arptable_filter_init(void)
 {
 	int ret;
 
+	arpfilter_ops = xt_hook_ops_alloc(&packet_filter, arptable_filter_hook);
+	if (IS_ERR(arpfilter_ops))
+		return PTR_ERR(arpfilter_ops);
+
 	ret = register_pernet_subsys(&arptable_filter_net_ops);
-	if (ret < 0)
+	if (ret < 0) {
+		kfree(arpfilter_ops);
 		return ret;
-
-	arpfilter_ops = xt_hook_link(&packet_filter, arptable_filter_hook);
-	if (IS_ERR(arpfilter_ops)) {
-		ret = PTR_ERR(arpfilter_ops);
-		goto cleanup_table;
 	}
-	return ret;
 
-cleanup_table:
-	unregister_pernet_subsys(&arptable_filter_net_ops);
 	return ret;
 }
 
 static void __exit arptable_filter_fini(void)
 {
-	xt_hook_unlink(&packet_filter, arpfilter_ops);
 	unregister_pernet_subsys(&arptable_filter_net_ops);
+	kfree(arpfilter_ops);
 }
 
 module_init(arptable_filter_init);
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index b99affa..e53f8d6 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -2062,9 +2062,27 @@ do_ipt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 	return ret;
 }
 
-struct xt_table *ipt_register_table(struct net *net,
-				    const struct xt_table *table,
-				    const struct ipt_replace *repl)
+static void __ipt_unregister_table(struct net *net, struct xt_table *table)
+{
+	struct xt_table_info *private;
+	void *loc_cpu_entry;
+	struct module *table_owner = table->me;
+	struct ipt_entry *iter;
+
+	private = xt_unregister_table(table);
+
+	/* Decrease module usage counts and free resources */
+	loc_cpu_entry = private->entries;
+	xt_entry_foreach(iter, loc_cpu_entry, private->size)
+		cleanup_entry(iter, net);
+	if (private->number > private->initial_entries)
+		module_put(table_owner);
+	xt_free_table_info(private);
+}
+
+int ipt_register_table(struct net *net, const struct xt_table *table,
+		       const struct ipt_replace *repl,
+		       const struct nf_hook_ops *ops, struct xt_table **res)
 {
 	int ret;
 	struct xt_table_info *newinfo;
@@ -2073,10 +2091,8 @@ struct xt_table *ipt_register_table(struct net *net,
 	struct xt_table *new_table;
 
 	newinfo = xt_alloc_table_info(repl->size);
-	if (!newinfo) {
-		ret = -ENOMEM;
-		goto out;
-	}
+	if (!newinfo)
+		return -ENOMEM;
 
 	loc_cpu_entry = newinfo->entries;
 	memcpy(loc_cpu_entry, repl->entries, repl->size);
@@ -2091,30 +2107,27 @@ struct xt_table *ipt_register_table(struct net *net,
 		goto out_free;
 	}
 
-	return new_table;
+	/* set res now, will see skbs right after nf_register_net_hooks */
+	WRITE_ONCE(*res, new_table);
+
+	ret = nf_register_net_hooks(net, ops, hweight32(table->valid_hooks));
+	if (ret != 0) {
+		__ipt_unregister_table(net, new_table);
+		*res = NULL;
+	}
+
+	return ret;
 
 out_free:
 	xt_free_table_info(newinfo);
-out:
-	return ERR_PTR(ret);
+	return ret;
 }
 
-void ipt_unregister_table(struct net *net, struct xt_table *table)
+void ipt_unregister_table(struct net *net, struct xt_table *table,
+			  const struct nf_hook_ops *ops)
 {
-	struct xt_table_info *private;
-	void *loc_cpu_entry;
-	struct module *table_owner = table->me;
-	struct ipt_entry *iter;
-
-	private = xt_unregister_table(table);
-
-	/* Decrease module usage counts and free resources */
-	loc_cpu_entry = private->entries;
-	xt_entry_foreach(iter, loc_cpu_entry, private->size)
-		cleanup_entry(iter, net);
-	if (private->number > private->initial_entries)
-		module_put(table_owner);
-	xt_free_table_info(private);
+	nf_unregister_net_hooks(net, ops, hweight32(table->valid_hooks));
+	__ipt_unregister_table(net, table);
 }
 
 /* Returns 1 if the type and code is matched by the range, 0 otherwise */
diff --git a/net/ipv4/netfilter/iptable_filter.c b/net/ipv4/netfilter/iptable_filter.c
index 397ef2d..648e5d5 100644
--- a/net/ipv4/netfilter/iptable_filter.c
+++ b/net/ipv4/netfilter/iptable_filter.c
@@ -24,12 +24,21 @@ MODULE_DESCRIPTION("iptables filter table");
 			    (1 << NF_INET_FORWARD) | \
 			    (1 << NF_INET_LOCAL_OUT))
 
+static struct nf_hook_ops *filter_ops __read_mostly;
+
+/* Default to forward because I got too much mail already. */
+static bool forward __read_mostly = true;
+module_param(forward, bool, 0000);
+
+static int __net_init iptable_filter_table_init(struct net *net);
+
 static const struct xt_table packet_filter = {
 	.name		= "filter",
 	.valid_hooks	= FILTER_VALID_HOOKS,
 	.me		= THIS_MODULE,
 	.af		= NFPROTO_IPV4,
 	.priority	= NF_IP_PRI_FILTER,
+	.table_init	= iptable_filter_table_init,
 };
 
 static unsigned int
@@ -45,15 +54,13 @@ iptable_filter_hook(void *priv, struct sk_buff *skb,
 	return ipt_do_table(skb, state, state->net->ipv4.iptable_filter);
 }
 
-static struct nf_hook_ops *filter_ops __read_mostly;
-
-/* Default to forward because I got too much mail already. */
-static bool forward = true;
-module_param(forward, bool, 0000);
-
-static int __net_init iptable_filter_net_init(struct net *net)
+static int __net_init iptable_filter_table_init(struct net *net)
 {
 	struct ipt_replace *repl;
+	int err;
+
+	if (net->ipv4.iptable_filter)
+		return 0;
 
 	repl = ipt_alloc_initial_table(&packet_filter);
 	if (repl == NULL)
@@ -62,15 +69,26 @@ static int __net_init iptable_filter_net_init(struct net *net)
 	((struct ipt_standard *)repl->entries)[1].target.verdict =
 		forward ? -NF_ACCEPT - 1 : -NF_DROP - 1;
 
-	net->ipv4.iptable_filter =
-		ipt_register_table(net, &packet_filter, repl);
+	err = ipt_register_table(net, &packet_filter, repl, filter_ops,
+			         &net->ipv4.iptable_filter);
 	kfree(repl);
-	return PTR_ERR_OR_ZERO(net->ipv4.iptable_filter);
+	return err;
+}
+
+static int __net_init iptable_filter_net_init(struct net *net)
+{
+	if (net == &init_net || !forward)
+		return iptable_filter_table_init(net);
+
+	return 0;
 }
 
 static void __net_exit iptable_filter_net_exit(struct net *net)
 {
-	ipt_unregister_table(net, net->ipv4.iptable_filter);
+	if (!net->ipv4.iptable_filter)
+		return;
+	ipt_unregister_table(net, net->ipv4.iptable_filter, filter_ops);
+	net->ipv4.iptable_filter = NULL;
 }
 
 static struct pernet_operations iptable_filter_net_ops = {
@@ -82,24 +100,21 @@ static int __init iptable_filter_init(void)
 {
 	int ret;
 
+	filter_ops = xt_hook_ops_alloc(&packet_filter, iptable_filter_hook);
+	if (IS_ERR(filter_ops))
+		return PTR_ERR(filter_ops);
+
 	ret = register_pernet_subsys(&iptable_filter_net_ops);
 	if (ret < 0)
-		return ret;
-
-	/* Register hooks */
-	filter_ops = xt_hook_link(&packet_filter, iptable_filter_hook);
-	if (IS_ERR(filter_ops)) {
-		ret = PTR_ERR(filter_ops);
-		unregister_pernet_subsys(&iptable_filter_net_ops);
-	}
+		kfree(filter_ops);
 
 	return ret;
 }
 
 static void __exit iptable_filter_fini(void)
 {
-	xt_hook_unlink(&packet_filter, filter_ops);
 	unregister_pernet_subsys(&iptable_filter_net_ops);
+	kfree(filter_ops);
 }
 
 module_init(iptable_filter_init);
diff --git a/net/ipv4/netfilter/iptable_mangle.c b/net/ipv4/netfilter/iptable_mangle.c
index ba5d392..57fc97c 100644
--- a/net/ipv4/netfilter/iptable_mangle.c
+++ b/net/ipv4/netfilter/iptable_mangle.c
@@ -28,12 +28,15 @@ MODULE_DESCRIPTION("iptables mangle table");
 			    (1 << NF_INET_LOCAL_OUT) | \
 			    (1 << NF_INET_POST_ROUTING))
 
+static int __net_init iptable_mangle_table_init(struct net *net);
+
 static const struct xt_table packet_mangler = {
 	.name		= "mangle",
 	.valid_hooks	= MANGLE_VALID_HOOKS,
 	.me		= THIS_MODULE,
 	.af		= NFPROTO_IPV4,
 	.priority	= NF_IP_PRI_MANGLE,
+	.table_init	= iptable_mangle_table_init,
 };
 
 static unsigned int
@@ -92,27 +95,32 @@ iptable_mangle_hook(void *priv,
 }
 
 static struct nf_hook_ops *mangle_ops __read_mostly;
-
-static int __net_init iptable_mangle_net_init(struct net *net)
+static int __net_init iptable_mangle_table_init(struct net *net)
 {
 	struct ipt_replace *repl;
+	int ret;
+
+	if (net->ipv4.iptable_mangle)
+		return 0;
 
 	repl = ipt_alloc_initial_table(&packet_mangler);
 	if (repl == NULL)
 		return -ENOMEM;
-	net->ipv4.iptable_mangle =
-		ipt_register_table(net, &packet_mangler, repl);
+	ret = ipt_register_table(net, &packet_mangler, repl, mangle_ops,
+				 &net->ipv4.iptable_mangle);
 	kfree(repl);
-	return PTR_ERR_OR_ZERO(net->ipv4.iptable_mangle);
+	return ret;
 }
 
 static void __net_exit iptable_mangle_net_exit(struct net *net)
 {
-	ipt_unregister_table(net, net->ipv4.iptable_mangle);
+	if (!net->ipv4.iptable_mangle)
+		return;
+	ipt_unregister_table(net, net->ipv4.iptable_mangle, mangle_ops);
+	net->ipv4.iptable_mangle = NULL;
 }
 
 static struct pernet_operations iptable_mangle_net_ops = {
-	.init = iptable_mangle_net_init,
 	.exit = iptable_mangle_net_exit,
 };
 
@@ -120,15 +128,22 @@ static int __init iptable_mangle_init(void)
 {
 	int ret;
 
+	mangle_ops = xt_hook_ops_alloc(&packet_mangler, iptable_mangle_hook);
+	if (IS_ERR(mangle_ops)) {
+		ret = PTR_ERR(mangle_ops);
+		return ret;
+	}
+
 	ret = register_pernet_subsys(&iptable_mangle_net_ops);
-	if (ret < 0)
+	if (ret < 0) {
+		kfree(mangle_ops);
 		return ret;
+	}
 
-	/* Register hooks */
-	mangle_ops = xt_hook_link(&packet_mangler, iptable_mangle_hook);
-	if (IS_ERR(mangle_ops)) {
-		ret = PTR_ERR(mangle_ops);
+	ret = iptable_mangle_table_init(&init_net);
+	if (ret) {
 		unregister_pernet_subsys(&iptable_mangle_net_ops);
+		kfree(mangle_ops);
 	}
 
 	return ret;
@@ -136,8 +151,8 @@ static int __init iptable_mangle_init(void)
 
 static void __exit iptable_mangle_fini(void)
 {
-	xt_hook_unlink(&packet_mangler, mangle_ops);
 	unregister_pernet_subsys(&iptable_mangle_net_ops);
+	kfree(mangle_ops);
 }
 
 module_init(iptable_mangle_init);
diff --git a/net/ipv4/netfilter/iptable_nat.c b/net/ipv4/netfilter/iptable_nat.c
index ae2cd27..138a24b 100644
--- a/net/ipv4/netfilter/iptable_nat.c
+++ b/net/ipv4/netfilter/iptable_nat.c
@@ -18,6 +18,8 @@
 #include <net/netfilter/nf_nat_core.h>
 #include <net/netfilter/nf_nat_l3proto.h>
 
+static int __net_init iptable_nat_table_init(struct net *net);
+
 static const struct xt_table nf_nat_ipv4_table = {
 	.name		= "nat",
 	.valid_hooks	= (1 << NF_INET_PRE_ROUTING) |
@@ -26,6 +28,7 @@ static const struct xt_table nf_nat_ipv4_table = {
 			  (1 << NF_INET_LOCAL_IN),
 	.me		= THIS_MODULE,
 	.af		= NFPROTO_IPV4,
+	.table_init	= iptable_nat_table_init,
 };
 
 static unsigned int iptable_nat_do_chain(void *priv,
@@ -95,50 +98,50 @@ static struct nf_hook_ops nf_nat_ipv4_ops[] __read_mostly = {
 	},
 };
 
-static int __net_init iptable_nat_net_init(struct net *net)
+static int __net_init iptable_nat_table_init(struct net *net)
 {
 	struct ipt_replace *repl;
+	int ret;
+
+	if (net->ipv4.nat_table)
+		return 0;
 
 	repl = ipt_alloc_initial_table(&nf_nat_ipv4_table);
 	if (repl == NULL)
 		return -ENOMEM;
-	net->ipv4.nat_table = ipt_register_table(net, &nf_nat_ipv4_table, repl);
+	ret = ipt_register_table(net, &nf_nat_ipv4_table, repl,
+				 nf_nat_ipv4_ops, &net->ipv4.nat_table);
 	kfree(repl);
-	return PTR_ERR_OR_ZERO(net->ipv4.nat_table);
+	return ret;
 }
 
 static void __net_exit iptable_nat_net_exit(struct net *net)
 {
-	ipt_unregister_table(net, net->ipv4.nat_table);
+	if (!net->ipv4.nat_table)
+		return;
+	ipt_unregister_table(net, net->ipv4.nat_table, nf_nat_ipv4_ops);
+	net->ipv4.nat_table = NULL;
 }
 
 static struct pernet_operations iptable_nat_net_ops = {
-	.init	= iptable_nat_net_init,
 	.exit	= iptable_nat_net_exit,
 };
 
 static int __init iptable_nat_init(void)
 {
-	int err;
+	int ret = register_pernet_subsys(&iptable_nat_net_ops);
 
-	err = register_pernet_subsys(&iptable_nat_net_ops);
-	if (err < 0)
-		goto err1;
+	if (ret)
+		return ret;
 
-	err = nf_register_hooks(nf_nat_ipv4_ops, ARRAY_SIZE(nf_nat_ipv4_ops));
-	if (err < 0)
-		goto err2;
-	return 0;
-
-err2:
-	unregister_pernet_subsys(&iptable_nat_net_ops);
-err1:
-	return err;
+	ret = iptable_nat_table_init(&init_net);
+	if (ret)
+		unregister_pernet_subsys(&iptable_nat_net_ops);
+	return ret;
 }
 
 static void __exit iptable_nat_exit(void)
 {
-	nf_unregister_hooks(nf_nat_ipv4_ops, ARRAY_SIZE(nf_nat_ipv4_ops));
 	unregister_pernet_subsys(&iptable_nat_net_ops);
 }
 
diff --git a/net/ipv4/netfilter/iptable_raw.c b/net/ipv4/netfilter/iptable_raw.c
index 1ba0281..ca83ed8 100644
--- a/net/ipv4/netfilter/iptable_raw.c
+++ b/net/ipv4/netfilter/iptable_raw.c
@@ -10,12 +10,17 @@
 
 #define RAW_VALID_HOOKS ((1 << NF_INET_PRE_ROUTING) | (1 << NF_INET_LOCAL_OUT))
 
+static struct nf_hook_ops *rawtable_ops __read_mostly;
+
+static int __net_init iptable_raw_table_init(struct net *net);
+
 static const struct xt_table packet_raw = {
 	.name = "raw",
 	.valid_hooks =  RAW_VALID_HOOKS,
 	.me = THIS_MODULE,
 	.af = NFPROTO_IPV4,
 	.priority = NF_IP_PRI_RAW,
+	.table_init = iptable_raw_table_init,
 };
 
 /* The work comes in here from netfilter.c. */
@@ -32,28 +37,32 @@ iptable_raw_hook(void *priv, struct sk_buff *skb,
 	return ipt_do_table(skb, state, state->net->ipv4.iptable_raw);
 }
 
-static struct nf_hook_ops *rawtable_ops __read_mostly;
-
-static int __net_init iptable_raw_net_init(struct net *net)
+static int __net_init iptable_raw_table_init(struct net *net)
 {
 	struct ipt_replace *repl;
+	int ret;
+
+	if (net->ipv4.iptable_raw)
+		return 0;
 
 	repl = ipt_alloc_initial_table(&packet_raw);
 	if (repl == NULL)
 		return -ENOMEM;
-	net->ipv4.iptable_raw =
-		ipt_register_table(net, &packet_raw, repl);
+	ret = ipt_register_table(net, &packet_raw, repl, rawtable_ops,
+				 &net->ipv4.iptable_raw);
 	kfree(repl);
-	return PTR_ERR_OR_ZERO(net->ipv4.iptable_raw);
+	return ret;
 }
 
 static void __net_exit iptable_raw_net_exit(struct net *net)
 {
-	ipt_unregister_table(net, net->ipv4.iptable_raw);
+	if (!net->ipv4.iptable_raw)
+		return;
+	ipt_unregister_table(net, net->ipv4.iptable_raw, rawtable_ops);
+	net->ipv4.iptable_raw = NULL;
 }
 
 static struct pernet_operations iptable_raw_net_ops = {
-	.init = iptable_raw_net_init,
 	.exit = iptable_raw_net_exit,
 };
 
@@ -61,15 +70,20 @@ static int __init iptable_raw_init(void)
 {
 	int ret;
 
+	rawtable_ops = xt_hook_ops_alloc(&packet_raw, iptable_raw_hook);
+	if (IS_ERR(rawtable_ops))
+		return PTR_ERR(rawtable_ops);
+
 	ret = register_pernet_subsys(&iptable_raw_net_ops);
-	if (ret < 0)
+	if (ret < 0) {
+		kfree(rawtable_ops);
 		return ret;
+	}
 
-	/* Register hooks */
-	rawtable_ops = xt_hook_link(&packet_raw, iptable_raw_hook);
-	if (IS_ERR(rawtable_ops)) {
-		ret = PTR_ERR(rawtable_ops);
+	ret = iptable_raw_table_init(&init_net);
+	if (ret) {
 		unregister_pernet_subsys(&iptable_raw_net_ops);
+		kfree(rawtable_ops);
 	}
 
 	return ret;
@@ -77,8 +91,8 @@ static int __init iptable_raw_init(void)
 
 static void __exit iptable_raw_fini(void)
 {
-	xt_hook_unlink(&packet_raw, rawtable_ops);
 	unregister_pernet_subsys(&iptable_raw_net_ops);
+	kfree(rawtable_ops);
 }
 
 module_init(iptable_raw_init);
diff --git a/net/ipv4/netfilter/iptable_security.c b/net/ipv4/netfilter/iptable_security.c
index c2e23d5..ff22659 100644
--- a/net/ipv4/netfilter/iptable_security.c
+++ b/net/ipv4/netfilter/iptable_security.c
@@ -28,12 +28,15 @@ MODULE_DESCRIPTION("iptables security table, for MAC rules");
 				(1 << NF_INET_FORWARD) | \
 				(1 << NF_INET_LOCAL_OUT)
 
+static int __net_init iptable_security_table_init(struct net *net);
+
 static const struct xt_table security_table = {
 	.name		= "security",
 	.valid_hooks	= SECURITY_VALID_HOOKS,
 	.me		= THIS_MODULE,
 	.af		= NFPROTO_IPV4,
 	.priority	= NF_IP_PRI_SECURITY,
+	.table_init	= iptable_security_table_init,
 };
 
 static unsigned int
@@ -51,26 +54,33 @@ iptable_security_hook(void *priv, struct sk_buff *skb,
 
 static struct nf_hook_ops *sectbl_ops __read_mostly;
 
-static int __net_init iptable_security_net_init(struct net *net)
+static int __net_init iptable_security_table_init(struct net *net)
 {
 	struct ipt_replace *repl;
+	int ret;
+
+	if (net->ipv4.iptable_security)
+		return 0;
 
 	repl = ipt_alloc_initial_table(&security_table);
 	if (repl == NULL)
 		return -ENOMEM;
-	net->ipv4.iptable_security =
-		ipt_register_table(net, &security_table, repl);
+	ret = ipt_register_table(net, &security_table, repl, sectbl_ops,
+				 &net->ipv4.iptable_security);
 	kfree(repl);
-	return PTR_ERR_OR_ZERO(net->ipv4.iptable_security);
+	return ret;
 }
 
 static void __net_exit iptable_security_net_exit(struct net *net)
 {
-	ipt_unregister_table(net, net->ipv4.iptable_security);
+	if (!net->ipv4.iptable_security)
+		return;
+
+	ipt_unregister_table(net, net->ipv4.iptable_security, sectbl_ops);
+	net->ipv4.iptable_security = NULL;
 }
 
 static struct pernet_operations iptable_security_net_ops = {
-	.init = iptable_security_net_init,
 	.exit = iptable_security_net_exit,
 };
 
@@ -78,27 +88,29 @@ static int __init iptable_security_init(void)
 {
 	int ret;
 
+	sectbl_ops = xt_hook_ops_alloc(&security_table, iptable_security_hook);
+	if (IS_ERR(sectbl_ops))
+		return PTR_ERR(sectbl_ops);
+
 	ret = register_pernet_subsys(&iptable_security_net_ops);
-	if (ret < 0)
+	if (ret < 0) {
+		kfree(sectbl_ops);
 		return ret;
-
-	sectbl_ops = xt_hook_link(&security_table, iptable_security_hook);
-	if (IS_ERR(sectbl_ops)) {
-		ret = PTR_ERR(sectbl_ops);
-		goto cleanup_table;
 	}
 
-	return ret;
+	ret = iptable_security_table_init(&init_net);
+	if (ret) {
+		unregister_pernet_subsys(&iptable_security_net_ops);
+		kfree(sectbl_ops);
+	}
 
-cleanup_table:
-	unregister_pernet_subsys(&iptable_security_net_ops);
 	return ret;
 }
 
 static void __exit iptable_security_fini(void)
 {
-	xt_hook_unlink(&security_table, sectbl_ops);
 	unregister_pernet_subsys(&iptable_security_net_ops);
+	kfree(sectbl_ops);
 }
 
 module_init(iptable_security_init);
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 99425cf..84f9baf 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -2071,9 +2071,28 @@ do_ip6t_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 	return ret;
 }
 
-struct xt_table *ip6t_register_table(struct net *net,
-				     const struct xt_table *table,
-				     const struct ip6t_replace *repl)
+static void __ip6t_unregister_table(struct net *net, struct xt_table *table)
+{
+	struct xt_table_info *private;
+	void *loc_cpu_entry;
+	struct module *table_owner = table->me;
+	struct ip6t_entry *iter;
+
+	private = xt_unregister_table(table);
+
+	/* Decrease module usage counts and free resources */
+	loc_cpu_entry = private->entries;
+	xt_entry_foreach(iter, loc_cpu_entry, private->size)
+		cleanup_entry(iter, net);
+	if (private->number > private->initial_entries)
+		module_put(table_owner);
+	xt_free_table_info(private);
+}
+
+int ip6t_register_table(struct net *net, const struct xt_table *table,
+			const struct ip6t_replace *repl,
+			const struct nf_hook_ops *ops,
+			struct xt_table **res)
 {
 	int ret;
 	struct xt_table_info *newinfo;
@@ -2082,10 +2101,8 @@ struct xt_table *ip6t_register_table(struct net *net,
 	struct xt_table *new_table;
 
 	newinfo = xt_alloc_table_info(repl->size);
-	if (!newinfo) {
-		ret = -ENOMEM;
-		goto out;
-	}
+	if (!newinfo)
+		return -ENOMEM;
 
 	loc_cpu_entry = newinfo->entries;
 	memcpy(loc_cpu_entry, repl->entries, repl->size);
@@ -2099,30 +2116,28 @@ struct xt_table *ip6t_register_table(struct net *net,
 		ret = PTR_ERR(new_table);
 		goto out_free;
 	}
-	return new_table;
+
+	/* set res now, will see skbs right after nf_register_net_hooks */
+	WRITE_ONCE(*res, new_table);
+
+	ret = nf_register_net_hooks(net, ops, hweight32(table->valid_hooks));
+	if (ret != 0) {
+		__ip6t_unregister_table(net, new_table);
+		*res = NULL;
+	}
+
+	return ret;
 
 out_free:
 	xt_free_table_info(newinfo);
-out:
-	return ERR_PTR(ret);
+	return ret;
 }
 
-void ip6t_unregister_table(struct net *net, struct xt_table *table)
+void ip6t_unregister_table(struct net *net, struct xt_table *table,
+			   const struct nf_hook_ops *ops)
 {
-	struct xt_table_info *private;
-	void *loc_cpu_entry;
-	struct module *table_owner = table->me;
-	struct ip6t_entry *iter;
-
-	private = xt_unregister_table(table);
-
-	/* Decrease module usage counts and free resources */
-	loc_cpu_entry = private->entries;
-	xt_entry_foreach(iter, loc_cpu_entry, private->size)
-		cleanup_entry(iter, net);
-	if (private->number > private->initial_entries)
-		module_put(table_owner);
-	xt_free_table_info(private);
+	nf_unregister_net_hooks(net, ops, hweight32(table->valid_hooks));
+	__ip6t_unregister_table(net, table);
 }
 
 /* Returns 1 if the type and code is matched by the range, 0 otherwise */
diff --git a/net/ipv6/netfilter/ip6table_filter.c b/net/ipv6/netfilter/ip6table_filter.c
index 8b277b9..1343077 100644
--- a/net/ipv6/netfilter/ip6table_filter.c
+++ b/net/ipv6/netfilter/ip6table_filter.c
@@ -22,12 +22,15 @@ MODULE_DESCRIPTION("ip6tables filter table");
 			    (1 << NF_INET_FORWARD) | \
 			    (1 << NF_INET_LOCAL_OUT))
 
+static int __net_init ip6table_filter_table_init(struct net *net);
+
 static const struct xt_table packet_filter = {
 	.name		= "filter",
 	.valid_hooks	= FILTER_VALID_HOOKS,
 	.me		= THIS_MODULE,
 	.af		= NFPROTO_IPV6,
 	.priority	= NF_IP6_PRI_FILTER,
+	.table_init	= ip6table_filter_table_init,
 };
 
 /* The work comes in here from netfilter.c. */
@@ -44,9 +47,13 @@ static struct nf_hook_ops *filter_ops __read_mostly;
 static bool forward = true;
 module_param(forward, bool, 0000);
 
-static int __net_init ip6table_filter_net_init(struct net *net)
+static int __net_init ip6table_filter_table_init(struct net *net)
 {
 	struct ip6t_replace *repl;
+	int err;
+
+	if (net->ipv6.ip6table_filter)
+		return 0;
 
 	repl = ip6t_alloc_initial_table(&packet_filter);
 	if (repl == NULL)
@@ -55,15 +62,26 @@ static int __net_init ip6table_filter_net_init(struct net *net)
 	((struct ip6t_standard *)repl->entries)[1].target.verdict =
 		forward ? -NF_ACCEPT - 1 : -NF_DROP - 1;
 
-	net->ipv6.ip6table_filter =
-		ip6t_register_table(net, &packet_filter, repl);
+	err = ip6t_register_table(net, &packet_filter, repl, filter_ops,
+				  &net->ipv6.ip6table_filter);
 	kfree(repl);
-	return PTR_ERR_OR_ZERO(net->ipv6.ip6table_filter);
+	return err;
+}
+
+static int __net_init ip6table_filter_net_init(struct net *net)
+{
+	if (net == &init_net || !forward)
+		return ip6table_filter_table_init(net);
+
+	return 0;
 }
 
 static void __net_exit ip6table_filter_net_exit(struct net *net)
 {
-	ip6t_unregister_table(net, net->ipv6.ip6table_filter);
+	if (!net->ipv6.ip6table_filter)
+		return;
+	ip6t_unregister_table(net, net->ipv6.ip6table_filter, filter_ops);
+	net->ipv6.ip6table_filter = NULL;
 }
 
 static struct pernet_operations ip6table_filter_net_ops = {
@@ -75,28 +93,21 @@ static int __init ip6table_filter_init(void)
 {
 	int ret;
 
+	filter_ops = xt_hook_ops_alloc(&packet_filter, ip6table_filter_hook);
+	if (IS_ERR(filter_ops))
+		return PTR_ERR(filter_ops);
+
 	ret = register_pernet_subsys(&ip6table_filter_net_ops);
 	if (ret < 0)
-		return ret;
-
-	/* Register hooks */
-	filter_ops = xt_hook_link(&packet_filter, ip6table_filter_hook);
-	if (IS_ERR(filter_ops)) {
-		ret = PTR_ERR(filter_ops);
-		goto cleanup_table;
-	}
+		kfree(filter_ops);
 
 	return ret;
-
- cleanup_table:
-	unregister_pernet_subsys(&ip6table_filter_net_ops);
-	return ret;
 }
 
 static void __exit ip6table_filter_fini(void)
 {
-	xt_hook_unlink(&packet_filter, filter_ops);
 	unregister_pernet_subsys(&ip6table_filter_net_ops);
+	kfree(filter_ops);
 }
 
 module_init(ip6table_filter_init);
diff --git a/net/ipv6/netfilter/ip6table_mangle.c b/net/ipv6/netfilter/ip6table_mangle.c
index abe278b..bc09dc8 100644
--- a/net/ipv6/netfilter/ip6table_mangle.c
+++ b/net/ipv6/netfilter/ip6table_mangle.c
@@ -23,12 +23,14 @@ MODULE_DESCRIPTION("ip6tables mangle table");
 			    (1 << NF_INET_LOCAL_OUT) | \
 			    (1 << NF_INET_POST_ROUTING))
 
+static int __net_init ip6table_mangle_table_init(struct net *net);
 static const struct xt_table packet_mangler = {
 	.name		= "mangle",
 	.valid_hooks	= MANGLE_VALID_HOOKS,
 	.me		= THIS_MODULE,
 	.af		= NFPROTO_IPV6,
 	.priority	= NF_IP6_PRI_MANGLE,
+	.table_init	= ip6table_mangle_table_init,
 };
 
 static unsigned int
@@ -88,26 +90,33 @@ ip6table_mangle_hook(void *priv, struct sk_buff *skb,
 }
 
 static struct nf_hook_ops *mangle_ops __read_mostly;
-static int __net_init ip6table_mangle_net_init(struct net *net)
+static int __net_init ip6table_mangle_table_init(struct net *net)
 {
 	struct ip6t_replace *repl;
+	int ret;
+
+	if (net->ipv6.ip6table_mangle)
+		return 0;
 
 	repl = ip6t_alloc_initial_table(&packet_mangler);
 	if (repl == NULL)
 		return -ENOMEM;
-	net->ipv6.ip6table_mangle =
-		ip6t_register_table(net, &packet_mangler, repl);
+	ret = ip6t_register_table(net, &packet_mangler, repl, mangle_ops,
+				  &net->ipv6.ip6table_mangle);
 	kfree(repl);
-	return PTR_ERR_OR_ZERO(net->ipv6.ip6table_mangle);
+	return ret;
 }
 
 static void __net_exit ip6table_mangle_net_exit(struct net *net)
 {
-	ip6t_unregister_table(net, net->ipv6.ip6table_mangle);
+	if (!net->ipv6.ip6table_mangle)
+		return;
+
+	ip6t_unregister_table(net, net->ipv6.ip6table_mangle, mangle_ops);
+	net->ipv6.ip6table_mangle = NULL;
 }
 
 static struct pernet_operations ip6table_mangle_net_ops = {
-	.init = ip6table_mangle_net_init,
 	.exit = ip6table_mangle_net_exit,
 };
 
@@ -115,28 +124,28 @@ static int __init ip6table_mangle_init(void)
 {
 	int ret;
 
+	mangle_ops = xt_hook_ops_alloc(&packet_mangler, ip6table_mangle_hook);
+	if (IS_ERR(mangle_ops))
+		return PTR_ERR(mangle_ops);
+
 	ret = register_pernet_subsys(&ip6table_mangle_net_ops);
-	if (ret < 0)
+	if (ret < 0) {
+		kfree(mangle_ops);
 		return ret;
-
-	/* Register hooks */
-	mangle_ops = xt_hook_link(&packet_mangler, ip6table_mangle_hook);
-	if (IS_ERR(mangle_ops)) {
-		ret = PTR_ERR(mangle_ops);
-		goto cleanup_table;
 	}
 
-	return ret;
-
- cleanup_table:
-	unregister_pernet_subsys(&ip6table_mangle_net_ops);
+	ret = ip6table_mangle_table_init(&init_net);
+	if (ret) {
+		unregister_pernet_subsys(&ip6table_mangle_net_ops);
+		kfree(mangle_ops);
+	}
 	return ret;
 }
 
 static void __exit ip6table_mangle_fini(void)
 {
-	xt_hook_unlink(&packet_mangler, mangle_ops);
 	unregister_pernet_subsys(&ip6table_mangle_net_ops);
+	kfree(mangle_ops);
 }
 
 module_init(ip6table_mangle_init);
diff --git a/net/ipv6/netfilter/ip6table_nat.c b/net/ipv6/netfilter/ip6table_nat.c
index de2a10a..7d2bd94 100644
--- a/net/ipv6/netfilter/ip6table_nat.c
+++ b/net/ipv6/netfilter/ip6table_nat.c
@@ -20,6 +20,8 @@
 #include <net/netfilter/nf_nat_core.h>
 #include <net/netfilter/nf_nat_l3proto.h>
 
+static int __net_init ip6table_nat_table_init(struct net *net);
+
 static const struct xt_table nf_nat_ipv6_table = {
 	.name		= "nat",
 	.valid_hooks	= (1 << NF_INET_PRE_ROUTING) |
@@ -28,6 +30,7 @@ static const struct xt_table nf_nat_ipv6_table = {
 			  (1 << NF_INET_LOCAL_IN),
 	.me		= THIS_MODULE,
 	.af		= NFPROTO_IPV6,
+	.table_init	= ip6table_nat_table_init,
 };
 
 static unsigned int ip6table_nat_do_chain(void *priv,
@@ -97,50 +100,50 @@ static struct nf_hook_ops nf_nat_ipv6_ops[] __read_mostly = {
 	},
 };
 
-static int __net_init ip6table_nat_net_init(struct net *net)
+static int __net_init ip6table_nat_table_init(struct net *net)
 {
 	struct ip6t_replace *repl;
+	int ret;
+
+	if (net->ipv6.ip6table_nat)
+		return 0;
 
 	repl = ip6t_alloc_initial_table(&nf_nat_ipv6_table);
 	if (repl == NULL)
 		return -ENOMEM;
-	net->ipv6.ip6table_nat = ip6t_register_table(net, &nf_nat_ipv6_table, repl);
+	ret = ip6t_register_table(net, &nf_nat_ipv6_table, repl,
+				  nf_nat_ipv6_ops, &net->ipv6.ip6table_nat);
 	kfree(repl);
-	return PTR_ERR_OR_ZERO(net->ipv6.ip6table_nat);
+	return ret;
 }
 
 static void __net_exit ip6table_nat_net_exit(struct net *net)
 {
-	ip6t_unregister_table(net, net->ipv6.ip6table_nat);
+	if (!net->ipv6.ip6table_nat)
+		return;
+	ip6t_unregister_table(net, net->ipv6.ip6table_nat, nf_nat_ipv6_ops);
+	net->ipv6.ip6table_nat = NULL;
 }
 
 static struct pernet_operations ip6table_nat_net_ops = {
-	.init	= ip6table_nat_net_init,
 	.exit	= ip6table_nat_net_exit,
 };
 
 static int __init ip6table_nat_init(void)
 {
-	int err;
+	int ret = register_pernet_subsys(&ip6table_nat_net_ops);
 
-	err = register_pernet_subsys(&ip6table_nat_net_ops);
-	if (err < 0)
-		goto err1;
+	if (ret)
+		return ret;
 
-	err = nf_register_hooks(nf_nat_ipv6_ops, ARRAY_SIZE(nf_nat_ipv6_ops));
-	if (err < 0)
-		goto err2;
-	return 0;
-
-err2:
-	unregister_pernet_subsys(&ip6table_nat_net_ops);
-err1:
-	return err;
+	ret = ip6table_nat_table_init(&init_net);
+	if (ret)
+		unregister_pernet_subsys(&ip6table_nat_net_ops);
+	return ret;
 }
 
 static void __exit ip6table_nat_exit(void)
 {
-	nf_unregister_hooks(nf_nat_ipv6_ops, ARRAY_SIZE(nf_nat_ipv6_ops));
 	unregister_pernet_subsys(&ip6table_nat_net_ops);
 }
 
diff --git a/net/ipv6/netfilter/ip6table_raw.c b/net/ipv6/netfilter/ip6table_raw.c
index 9021963..d4bc564 100644
--- a/net/ipv6/netfilter/ip6table_raw.c
+++ b/net/ipv6/netfilter/ip6table_raw.c
@@ -9,12 +9,15 @@
 
 #define RAW_VALID_HOOKS ((1 << NF_INET_PRE_ROUTING) | (1 << NF_INET_LOCAL_OUT))
 
+static int __net_init ip6table_raw_table_init(struct net *net);
+
 static const struct xt_table packet_raw = {
 	.name = "raw",
 	.valid_hooks = RAW_VALID_HOOKS,
 	.me = THIS_MODULE,
 	.af = NFPROTO_IPV6,
 	.priority = NF_IP6_PRI_RAW,
+	.table_init = ip6table_raw_table_init,
 };
 
 /* The work comes in here from netfilter.c. */
@@ -27,26 +30,32 @@ ip6table_raw_hook(void *priv, struct sk_buff *skb,
 
 static struct nf_hook_ops *rawtable_ops __read_mostly;
 
-static int __net_init ip6table_raw_net_init(struct net *net)
+static int __net_init ip6table_raw_table_init(struct net *net)
 {
 	struct ip6t_replace *repl;
+	int ret;
+
+	if (net->ipv6.ip6table_raw)
+		return 0;
 
 	repl = ip6t_alloc_initial_table(&packet_raw);
 	if (repl == NULL)
 		return -ENOMEM;
-	net->ipv6.ip6table_raw =
-		ip6t_register_table(net, &packet_raw, repl);
+	ret = ip6t_register_table(net, &packet_raw, repl, rawtable_ops,
+				  &net->ipv6.ip6table_raw);
 	kfree(repl);
-	return PTR_ERR_OR_ZERO(net->ipv6.ip6table_raw);
+	return ret;
 }
 
 static void __net_exit ip6table_raw_net_exit(struct net *net)
 {
-	ip6t_unregister_table(net, net->ipv6.ip6table_raw);
+	if (!net->ipv6.ip6table_raw)
+		return;
+	ip6t_unregister_table(net, net->ipv6.ip6table_raw, rawtable_ops);
+	net->ipv6.ip6table_raw = NULL;
 }
 
 static struct pernet_operations ip6table_raw_net_ops = {
-	.init = ip6table_raw_net_init,
 	.exit = ip6table_raw_net_exit,
 };
 
@@ -54,28 +63,29 @@ static int __init ip6table_raw_init(void)
 {
 	int ret;
 
+	/* Register hooks */
+	rawtable_ops = xt_hook_ops_alloc(&packet_raw, ip6table_raw_hook);
+	if (IS_ERR(rawtable_ops))
+		return PTR_ERR(rawtable_ops);
+
 	ret = register_pernet_subsys(&ip6table_raw_net_ops);
-	if (ret < 0)
+	if (ret < 0) {
+		kfree(rawtable_ops);
 		return ret;
-
-	/* Register hooks */
-	rawtable_ops = xt_hook_link(&packet_raw, ip6table_raw_hook);
-	if (IS_ERR(rawtable_ops)) {
-		ret = PTR_ERR(rawtable_ops);
-		goto cleanup_table;
 	}
 
-	return ret;
-
- cleanup_table:
-	unregister_pernet_subsys(&ip6table_raw_net_ops);
+	ret = ip6table_raw_table_init(&init_net);
+	if (ret) {
+		unregister_pernet_subsys(&ip6table_raw_net_ops);
+		kfree(rawtable_ops);
+	}
 	return ret;
 }
 
 static void __exit ip6table_raw_fini(void)
 {
-	xt_hook_unlink(&packet_raw, rawtable_ops);
 	unregister_pernet_subsys(&ip6table_raw_net_ops);
+	kfree(rawtable_ops);
 }
 
 module_init(ip6table_raw_init);
diff --git a/net/ipv6/netfilter/ip6table_security.c b/net/ipv6/netfilter/ip6table_security.c
index 0d856fe..cf26ccb 100644
--- a/net/ipv6/netfilter/ip6table_security.c
+++ b/net/ipv6/netfilter/ip6table_security.c
@@ -27,12 +27,15 @@ MODULE_DESCRIPTION("ip6tables security table, for MAC rules");
 				(1 << NF_INET_FORWARD) | \
 				(1 << NF_INET_LOCAL_OUT)
 
+static int __net_init ip6table_security_table_init(struct net *net);
+
 static const struct xt_table security_table = {
 	.name		= "security",
 	.valid_hooks	= SECURITY_VALID_HOOKS,
 	.me		= THIS_MODULE,
 	.af		= NFPROTO_IPV6,
 	.priority	= NF_IP6_PRI_SECURITY,
+	.table_init     = ip6table_security_table_init,
 };
 
 static unsigned int
@@ -44,26 +47,32 @@ ip6table_security_hook(void *priv, struct sk_buff *skb,
 
 static struct nf_hook_ops *sectbl_ops __read_mostly;
 
-static int __net_init ip6table_security_net_init(struct net *net)
+static int __net_init ip6table_security_table_init(struct net *net)
 {
 	struct ip6t_replace *repl;
+	int ret;
+
+	if (net->ipv6.ip6table_security)
+		return 0;
 
 	repl = ip6t_alloc_initial_table(&security_table);
 	if (repl == NULL)
 		return -ENOMEM;
-	net->ipv6.ip6table_security =
-		ip6t_register_table(net, &security_table, repl);
+	ret = ip6t_register_table(net, &security_table, repl, sectbl_ops,
+				  &net->ipv6.ip6table_security);
 	kfree(repl);
-	return PTR_ERR_OR_ZERO(net->ipv6.ip6table_security);
+	return ret;
 }
 
 static void __net_exit ip6table_security_net_exit(struct net *net)
 {
-	ip6t_unregister_table(net, net->ipv6.ip6table_security);
+	if (!net->ipv6.ip6table_security)
+		return;
+	ip6t_unregister_table(net, net->ipv6.ip6table_security, sectbl_ops);
+	net->ipv6.ip6table_security = NULL;
 }
 
 static struct pernet_operations ip6table_security_net_ops = {
-	.init = ip6table_security_net_init,
 	.exit = ip6table_security_net_exit,
 };
 
@@ -71,27 +80,28 @@ static int __init ip6table_security_init(void)
 {
 	int ret;
 
+	sectbl_ops = xt_hook_ops_alloc(&security_table, ip6table_security_hook);
+	if (IS_ERR(sectbl_ops))
+		return PTR_ERR(sectbl_ops);
+
 	ret = register_pernet_subsys(&ip6table_security_net_ops);
-	if (ret < 0)
+	if (ret < 0) {
+		kfree(sectbl_ops);
 		return ret;
-
-	sectbl_ops = xt_hook_link(&security_table, ip6table_security_hook);
-	if (IS_ERR(sectbl_ops)) {
-		ret = PTR_ERR(sectbl_ops);
-		goto cleanup_table;
 	}
 
-	return ret;
-
-cleanup_table:
-	unregister_pernet_subsys(&ip6table_security_net_ops);
+	ret = ip6table_security_table_init(&init_net);
+	if (ret) {
+		unregister_pernet_subsys(&ip6table_security_net_ops);
+		kfree(sectbl_ops);
+	}
 	return ret;
 }
 
 static void __exit ip6table_security_fini(void)
 {
-	xt_hook_unlink(&security_table, sectbl_ops);
 	unregister_pernet_subsys(&ip6table_security_net_ops);
+	kfree(sectbl_ops);
 }
 
 module_init(ip6table_security_init);
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index d4aaad7..63d2896 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -693,12 +693,45 @@ EXPORT_SYMBOL(xt_free_table_info);
 struct xt_table *xt_find_table_lock(struct net *net, u_int8_t af,
 				    const char *name)
 {
-	struct xt_table *t;
+	struct xt_table *t, *found = NULL;
 
 	mutex_lock(&xt[af].mutex);
 	list_for_each_entry(t, &net->xt.tables[af], list)
 		if (strcmp(t->name, name) == 0 && try_module_get(t->me))
 			return t;
+
+	if (net == &init_net)
+		goto out;
+
+	/* Table doesn't exist in this netns, re-try init */
+	list_for_each_entry(t, &init_net.xt.tables[af], list) {
+		if (strcmp(t->name, name))
+			continue;
+		if (!try_module_get(t->me))
+			return NULL;
+
+		mutex_unlock(&xt[af].mutex);
+		if (t->table_init(net) != 0) {
+			module_put(t->me);
+			return NULL;
+		}
+
+		found = t;
+
+		mutex_lock(&xt[af].mutex);
+		break;
+	}
+
+	if (!found)
+		goto out;
+
+	/* and once again: */
+	list_for_each_entry(t, &net->xt.tables[af], list)
+		if (strcmp(t->name, name) == 0)
+			return t;
+
+	module_put(found->me);
+ out:
 	mutex_unlock(&xt[af].mutex);
 	return NULL;
 }
@@ -1169,20 +1202,20 @@ static const struct file_operations xt_target_ops = {
 #endif /* CONFIG_PROC_FS */
 
 /**
- * xt_hook_link - set up hooks for a new table
+ * xt_hook_ops_alloc - set up hooks for a new table
  * @table:	table with metadata needed to set up hooks
  * @fn:		Hook function
  *
- * This function will take care of creating and registering the necessary
- * Netfilter hooks for XT tables.
+ * This function will create the nf_hook_ops that the x_table needs
+ * to hand to xt_hook_link_net().
  */
-struct nf_hook_ops *xt_hook_link(const struct xt_table *table, nf_hookfn *fn)
+struct nf_hook_ops *
+xt_hook_ops_alloc(const struct xt_table *table, nf_hookfn *fn)
 {
 	unsigned int hook_mask = table->valid_hooks;
 	uint8_t i, num_hooks = hweight32(hook_mask);
 	uint8_t hooknum;
 	struct nf_hook_ops *ops;
-	int ret;
 
 	ops = kmalloc(sizeof(*ops) * num_hooks, GFP_KERNEL);
 	if (ops == NULL)
@@ -1199,27 +1232,9 @@ struct nf_hook_ops *xt_hook_link(const struct xt_table *table, nf_hookfn *fn)
 		++i;
 	}
 
-	ret = nf_register_hooks(ops, num_hooks);
-	if (ret < 0) {
-		kfree(ops);
-		return ERR_PTR(ret);
-	}
-
 	return ops;
 }
-EXPORT_SYMBOL_GPL(xt_hook_link);
-
-/**
- * xt_hook_unlink - remove hooks for a table
- * @ops:	nf_hook_ops array as returned by nf_hook_link
- * @hook_mask:	the very same mask that was passed to nf_hook_link
- */
-void xt_hook_unlink(const struct xt_table *table, struct nf_hook_ops *ops)
-{
-	nf_unregister_hooks(ops, hweight32(table->valid_hooks));
-	kfree(ops);
-}
-EXPORT_SYMBOL_GPL(xt_hook_unlink);
+EXPORT_SYMBOL_GPL(xt_hook_ops_alloc);
 
 int xt_proto_init(struct net *net, u_int8_t af)
 {
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 nf-next 5/9] netfilter: defrag: only register defrag functionality if needed
  2015-10-23 10:43 [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Florian Westphal
                   ` (3 preceding siblings ...)
  2015-10-23 10:43 ` [PATCH v2 nf-next 4/9] netfilter: xtables: don't register xt hooks in namespace at init time Florian Westphal
@ 2015-10-23 10:43 ` Florian Westphal
  2015-10-23 10:43 ` [PATCH v2 nf-next 6/9] netfilter: nat: add dependencies on conntrack module Florian Westphal
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Florian Westphal @ 2015-10-23 10:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

nf_defrag modules for ipv4 and ipv6 export an empty stub function.
Any module that needs the defragmentation hooks registered simply
'calls' this empty function to create a 'phony' module dependency --
modprobe magic will make sure the appropriate defrag module is loaded.

This extends defragmentation to delay the defragmentation hook
registration until the functionality is requested within a network
namespace instead of module load time for all namespaces.

Hooks are only un-registered on module unload or when a namespace
that used such defrag functionality exits.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 no changes since v1

 include/net/netfilter/ipv4/nf_defrag_ipv4.h    |  3 +-
 include/net/netfilter/ipv6/nf_defrag_ipv6.h    |  3 +-
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |  7 +++-
 net/ipv4/netfilter/nf_defrag_ipv4.c            | 49 +++++++++++++++++++++++--
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |  7 +++-
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c      | 50 +++++++++++++++++++++++---
 net/netfilter/xt_TPROXY.c                      | 15 +++++---
 net/netfilter/xt_socket.c                      | 33 ++++++++++++++---
 8 files changed, 146 insertions(+), 21 deletions(-)

diff --git a/include/net/netfilter/ipv4/nf_defrag_ipv4.h b/include/net/netfilter/ipv4/nf_defrag_ipv4.h
index f01ef20..db405f7 100644
--- a/include/net/netfilter/ipv4/nf_defrag_ipv4.h
+++ b/include/net/netfilter/ipv4/nf_defrag_ipv4.h
@@ -1,6 +1,7 @@
 #ifndef _NF_DEFRAG_IPV4_H
 #define _NF_DEFRAG_IPV4_H
 
-void nf_defrag_ipv4_enable(void);
+struct net;
+int nf_defrag_ipv4_enable(struct net *);
 
 #endif /* _NF_DEFRAG_IPV4_H */
diff --git a/include/net/netfilter/ipv6/nf_defrag_ipv6.h b/include/net/netfilter/ipv6/nf_defrag_ipv6.h
index fb7da5b..aa8dfe6 100644
--- a/include/net/netfilter/ipv6/nf_defrag_ipv6.h
+++ b/include/net/netfilter/ipv6/nf_defrag_ipv6.h
@@ -1,7 +1,8 @@
 #ifndef _NF_DEFRAG_IPV6_H
 #define _NF_DEFRAG_IPV6_H
 
-void nf_defrag_ipv6_enable(void);
+struct net;
+int nf_defrag_ipv6_enable(struct net *);
 
 int nf_ct_frag6_init(void);
 void nf_ct_frag6_cleanup(void);
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index 2b51a04..d804620 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -385,6 +385,12 @@ static int nf_conntrack_l3proto_ipv4_hooks_register(struct net *net)
 	if (cnet->users > 1)
 		goto out_unlock;
 
+	err = nf_defrag_ipv4_enable(net);
+	if (err) {
+		cnet->users = 0;
+		goto out_unlock;
+	}
+
 	err = nf_register_net_hooks(net, ipv4_conntrack_ops,
 				    ARRAY_SIZE(ipv4_conntrack_ops));
 
@@ -490,7 +496,6 @@ static int __init nf_conntrack_l3proto_ipv4_init(void)
 	int ret = 0;
 
 	need_conntrack();
-	nf_defrag_ipv4_enable();
 
 	ret = nf_register_sockopt(&so_getorigdst);
 	if (ret < 0) {
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
index 0e5591c..e3df959 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -11,6 +11,7 @@
 #include <linux/netfilter.h>
 #include <linux/module.h>
 #include <linux/skbuff.h>
+#include <net/netns/generic.h>
 #include <net/route.h>
 #include <net/ip.h>
 
@@ -22,6 +23,13 @@
 #endif
 #include <net/netfilter/nf_conntrack_zones.h>
 
+static int defrag4_net_id __read_mostly;
+static DEFINE_MUTEX(defrag4_mutex);
+
+struct defrag4_net {
+	bool enabled;
+};
+
 static int nf_ct_ipv4_gather_frags(struct net *net, struct sk_buff *skb,
 				   u_int32_t user)
 {
@@ -107,18 +115,53 @@ static struct nf_hook_ops ipv4_defrag_ops[] = {
 	},
 };
 
+static void __net_exit defrag4_net_exit(struct net *net)
+{
+	struct defrag4_net *n = net_generic(net, defrag4_net_id);
+
+	if (n->enabled)
+		nf_unregister_net_hooks(net, ipv4_defrag_ops,
+					ARRAY_SIZE(ipv4_defrag_ops));
+}
+
+static struct pernet_operations defrag4_net_ops = {
+	.exit = defrag4_net_exit,
+	.id = &defrag4_net_id,
+	.size = sizeof(struct defrag4_net),
+};
+
 static int __init nf_defrag_init(void)
 {
-	return nf_register_hooks(ipv4_defrag_ops, ARRAY_SIZE(ipv4_defrag_ops));
+	return register_pernet_subsys(&defrag4_net_ops);
 }
 
 static void __exit nf_defrag_fini(void)
 {
-	nf_unregister_hooks(ipv4_defrag_ops, ARRAY_SIZE(ipv4_defrag_ops));
+	unregister_pernet_subsys(&defrag4_net_ops);
 }
 
-void nf_defrag_ipv4_enable(void)
+int nf_defrag_ipv4_enable(struct net *net)
 {
+	struct defrag4_net *n = net_generic(net, defrag4_net_id);
+	int err = 0;
+
+	might_sleep();
+
+	if (n->enabled)
+		return 0;
+
+	mutex_lock(&defrag4_mutex);
+	if (n->enabled)
+		goto out_unlock;
+
+	err = nf_register_net_hooks(net, ipv4_defrag_ops,
+				    ARRAY_SIZE(ipv4_defrag_ops));
+	if (err == 0)
+		n->enabled = true;
+
+ out_unlock:
+	mutex_unlock(&defrag4_mutex);
+	return err;
 }
 EXPORT_SYMBOL_GPL(nf_defrag_ipv4_enable);
 
diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
index 8916846..f3e7ca6 100644
--- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
@@ -325,6 +325,12 @@ static int nf_conntrack_l3proto_ipv6_hooks_register(struct net *net)
 	if (cnet->users > 1)
 		goto out_unlock;
 
+	err = nf_defrag_ipv6_enable(net);
+	if (err < 0) {
+		cnet->users = 0;
+		goto out_unlock;
+	}
+
 	err = nf_register_net_hooks(net, ipv6_conntrack_ops,
 				    ARRAY_SIZE(ipv6_conntrack_ops));
 	if (err)
@@ -430,7 +436,6 @@ static int __init nf_conntrack_l3proto_ipv6_init(void)
 	int ret = 0;
 
 	need_conntrack();
-	nf_defrag_ipv6_enable();
 
 	ret = nf_register_sockopt(&so_getorigdst6);
 	if (ret < 0) {
diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
index 4fdbed5e..bd9077c 100644
--- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
+++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
@@ -30,6 +30,13 @@
 #include <net/netfilter/nf_conntrack_zones.h>
 #include <net/netfilter/ipv6/nf_defrag_ipv6.h>
 
+static int defrag6_net_id __read_mostly;
+static DEFINE_MUTEX(defrag6_mutex);
+
+struct defrag6_net {
+	bool enabled;
+};
+
 static enum ip6_defrag_users nf_ct6_defrag_user(unsigned int hooknum,
 						struct sk_buff *skb)
 {
@@ -97,6 +104,21 @@ static struct nf_hook_ops ipv6_defrag_ops[] = {
 	},
 };
 
+static void __net_exit defrag6_net_exit(struct net *net)
+{
+	struct defrag6_net *n = net_generic(net, defrag6_net_id);
+
+	if (n->enabled)
+		nf_unregister_net_hooks(net, ipv6_defrag_ops,
+					ARRAY_SIZE(ipv6_defrag_ops));
+}
+
+static struct pernet_operations defrag6_net_ops = {
+	.exit = defrag6_net_exit,
+	.id = &defrag6_net_id,
+	.size = sizeof(struct defrag6_net),
+};
+
 static int __init nf_defrag_init(void)
 {
 	int ret = 0;
@@ -106,9 +128,9 @@ static int __init nf_defrag_init(void)
 		pr_err("nf_defrag_ipv6: can't initialize frag6.\n");
 		return ret;
 	}
-	ret = nf_register_hooks(ipv6_defrag_ops, ARRAY_SIZE(ipv6_defrag_ops));
+	ret = register_pernet_subsys(&defrag6_net_ops);
 	if (ret < 0) {
-		pr_err("nf_defrag_ipv6: can't register hooks\n");
+		pr_err("nf_defrag_ipv6: can't register pernet ops\n");
 		goto cleanup_frag6;
 	}
 	return ret;
@@ -121,12 +143,32 @@ cleanup_frag6:
 
 static void __exit nf_defrag_fini(void)
 {
-	nf_unregister_hooks(ipv6_defrag_ops, ARRAY_SIZE(ipv6_defrag_ops));
+	unregister_pernet_subsys(&defrag6_net_ops);
 	nf_ct_frag6_cleanup();
 }
 
-void nf_defrag_ipv6_enable(void)
+int nf_defrag_ipv6_enable(struct net *net)
 {
+	struct defrag6_net *n = net_generic(net, defrag6_net_id);
+	int err = 0;
+
+	might_sleep();
+
+	if (n->enabled)
+		return 0;
+
+	mutex_lock(&defrag6_mutex);
+	if (n->enabled)
+		goto out_unlock;
+
+	err = nf_register_net_hooks(net, ipv6_defrag_ops,
+				    ARRAY_SIZE(ipv6_defrag_ops));
+	if (err == 0)
+		n->enabled = true;
+
+ out_unlock:
+	mutex_unlock(&defrag6_mutex);
+	return err;
 }
 EXPORT_SYMBOL_GPL(nf_defrag_ipv6_enable);
 
diff --git a/net/netfilter/xt_TPROXY.c b/net/netfilter/xt_TPROXY.c
index 3ab591e..f091244 100644
--- a/net/netfilter/xt_TPROXY.c
+++ b/net/netfilter/xt_TPROXY.c
@@ -516,6 +516,11 @@ tproxy_tg6_v1(struct sk_buff *skb, const struct xt_action_param *par)
 static int tproxy_tg6_check(const struct xt_tgchk_param *par)
 {
 	const struct ip6t_ip6 *i = par->entryinfo;
+	int err;
+
+	err = nf_defrag_ipv6_enable(par->net);
+	if (err)
+		return err;
 
 	if ((i->proto == IPPROTO_TCP || i->proto == IPPROTO_UDP) &&
 	    !(i->invflags & IP6T_INV_PROTO))
@@ -530,6 +535,11 @@ static int tproxy_tg6_check(const struct xt_tgchk_param *par)
 static int tproxy_tg4_check(const struct xt_tgchk_param *par)
 {
 	const struct ipt_ip *i = par->entryinfo;
+	int err;
+
+	err = nf_defrag_ipv4_enable(par->net);
+	if (err)
+		return err;
 
 	if ((i->proto == IPPROTO_TCP || i->proto == IPPROTO_UDP)
 	    && !(i->invflags & IPT_INV_PROTO))
@@ -581,11 +591,6 @@ static struct xt_target tproxy_tg_reg[] __read_mostly = {
 
 static int __init tproxy_tg_init(void)
 {
-	nf_defrag_ipv4_enable();
-#ifdef XT_TPROXY_HAVE_IPV6
-	nf_defrag_ipv6_enable();
-#endif
-
 	return xt_register_targets(tproxy_tg_reg, ARRAY_SIZE(tproxy_tg_reg));
 }
 
diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 2ec08f0..d0f0064 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -418,9 +418,28 @@ socket_mt6_v1_v2_v3(const struct sk_buff *skb, struct xt_action_param *par)
 }
 #endif
 
+static int socket_mt_enable_defrag(struct net *net, int family)
+{
+	switch (family) {
+	case NFPROTO_IPV4:
+		return nf_defrag_ipv4_enable(net);
+#ifdef XT_SOCKET_HAVE_IPV6
+	case NFPROTO_IPV6:
+		return nf_defrag_ipv6_enable(net);
+#endif
+	}
+	WARN_ONCE(1, "Unknown family %d\n", family);
+	return 0;
+}
+
 static int socket_mt_v1_check(const struct xt_mtchk_param *par)
 {
 	const struct xt_socket_mtinfo1 *info = (struct xt_socket_mtinfo1 *) par->matchinfo;
+	int err;
+
+	err = socket_mt_enable_defrag(par->net, par->family);
+	if (err)
+		return err;
 
 	if (info->flags & ~XT_SOCKET_FLAGS_V1) {
 		pr_info("unknown flags 0x%x\n", info->flags & ~XT_SOCKET_FLAGS_V1);
@@ -432,6 +451,11 @@ static int socket_mt_v1_check(const struct xt_mtchk_param *par)
 static int socket_mt_v2_check(const struct xt_mtchk_param *par)
 {
 	const struct xt_socket_mtinfo2 *info = (struct xt_socket_mtinfo2 *) par->matchinfo;
+	int err;
+
+	err = socket_mt_enable_defrag(par->net, par->family);
+	if (err)
+		return err;
 
 	if (info->flags & ~XT_SOCKET_FLAGS_V2) {
 		pr_info("unknown flags 0x%x\n", info->flags & ~XT_SOCKET_FLAGS_V2);
@@ -444,7 +468,11 @@ static int socket_mt_v3_check(const struct xt_mtchk_param *par)
 {
 	const struct xt_socket_mtinfo3 *info =
 				    (struct xt_socket_mtinfo3 *)par->matchinfo;
+	int err;
 
+	err = socket_mt_enable_defrag(par->net, par->family);
+	if (err)
+		return err;
 	if (info->flags & ~XT_SOCKET_FLAGS_V3) {
 		pr_info("unknown flags 0x%x\n",
 			info->flags & ~XT_SOCKET_FLAGS_V3);
@@ -539,11 +567,6 @@ static struct xt_match socket_mt_reg[] __read_mostly = {
 
 static int __init socket_mt_init(void)
 {
-	nf_defrag_ipv4_enable();
-#ifdef XT_SOCKET_HAVE_IPV6
-	nf_defrag_ipv6_enable();
-#endif
-
 	return xt_register_matches(socket_mt_reg, ARRAY_SIZE(socket_mt_reg));
 }
 
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 nf-next 6/9] netfilter: nat: add dependencies on conntrack module
  2015-10-23 10:43 [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Florian Westphal
                   ` (4 preceding siblings ...)
  2015-10-23 10:43 ` [PATCH v2 nf-next 5/9] netfilter: defrag: only register defrag functionality if needed Florian Westphal
@ 2015-10-23 10:43 ` Florian Westphal
  2015-10-23 10:43 ` [PATCH v2 nf-next 7/9] netfilter: bridge: register hooks only when bridge is added Florian Westphal
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Florian Westphal @ 2015-10-23 10:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

MASQUERADE, S/DNAT and REDIRECT already call functions that depend
on the conntrack module.

However, since the conntrack hooks are now registered in a lazy fashion
(i.e., only when needed) a symbol reference is not enough.

Thus, when something is added to a nat table, make sure that it will see
packets by calling nf_ct_netns_get() which will register the conntrack
hooks in the current netns.

An alternative would be to add these dependencies to the NAT table.

However, that has problems when using MODULES=n builds -- we might try
to register e.g. ipv6 conntrack before that 'modules' initcall has run,
leading to NULL deref crashes since its per-netns storage has not yet been
allocated.

Adding the dependency in the modules instead has the advantage that
nat table also has no cpu cost until rules are added.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 not part of v1 patchset.

 net/ipv4/netfilter/ipt_MASQUERADE.c |  8 +++++++-
 net/netfilter/xt_NETMAP.c           | 11 +++++++++--
 net/netfilter/xt_REDIRECT.c         | 12 ++++++++++--
 net/netfilter/xt_nat.c              | 18 +++++++++++++++++-
 4 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_MASQUERADE.c b/net/ipv4/netfilter/ipt_MASQUERADE.c
index da7f02a..e758e53 100644
--- a/net/ipv4/netfilter/ipt_MASQUERADE.c
+++ b/net/ipv4/netfilter/ipt_MASQUERADE.c
@@ -41,7 +41,7 @@ static int masquerade_tg_check(const struct xt_tgchk_param *par)
 		pr_debug("bad rangesize %u\n", mr->rangesize);
 		return -EINVAL;
 	}
-	return 0;
+	return nf_ct_netns_get(par->net, par->family);
 }
 
 static unsigned int
@@ -58,6 +58,11 @@ masquerade_tg(struct sk_buff *skb, const struct xt_action_param *par)
 	return nf_nat_masquerade_ipv4(skb, par->hooknum, &range, par->out);
 }
 
+static void masquerade_tg_destroy(const struct xt_tgdtor_param *par)
+{
+	nf_ct_netns_put(par->net, par->family);
+}
+
 static struct xt_target masquerade_tg_reg __read_mostly = {
 	.name		= "MASQUERADE",
 	.family		= NFPROTO_IPV4,
@@ -66,6 +71,7 @@ static struct xt_target masquerade_tg_reg __read_mostly = {
 	.table		= "nat",
 	.hooks		= 1 << NF_INET_POST_ROUTING,
 	.checkentry	= masquerade_tg_check,
+	.destroy	= masquerade_tg_destroy,
 	.me		= THIS_MODULE,
 };
 
diff --git a/net/netfilter/xt_NETMAP.c b/net/netfilter/xt_NETMAP.c
index b253e07..222458c 100644
--- a/net/netfilter/xt_NETMAP.c
+++ b/net/netfilter/xt_NETMAP.c
@@ -60,7 +60,12 @@ static int netmap_tg6_checkentry(const struct xt_tgchk_param *par)
 
 	if (!(range->flags & NF_NAT_RANGE_MAP_IPS))
 		return -EINVAL;
-	return 0;
+	return nf_ct_netns_get(par->net, par->family);
+}
+
+static void netmap_tg_destroy(const struct xt_tgdtor_param *par)
+{
+	nf_ct_netns_put(par->net, par->family);
 }
 
 static unsigned int
@@ -111,7 +116,7 @@ static int netmap_tg4_check(const struct xt_tgchk_param *par)
 		pr_debug("bad rangesize %u.\n", mr->rangesize);
 		return -EINVAL;
 	}
-	return 0;
+	return nf_ct_netns_get(par->net, par->family);
 }
 
 static struct xt_target netmap_tg_reg[] __read_mostly = {
@@ -127,6 +132,7 @@ static struct xt_target netmap_tg_reg[] __read_mostly = {
 		              (1 << NF_INET_LOCAL_OUT) |
 		              (1 << NF_INET_LOCAL_IN),
 		.checkentry = netmap_tg6_checkentry,
+		.destroy    = netmap_tg_destroy,
 		.me         = THIS_MODULE,
 	},
 	{
@@ -141,6 +147,7 @@ static struct xt_target netmap_tg_reg[] __read_mostly = {
 		              (1 << NF_INET_LOCAL_OUT) |
 		              (1 << NF_INET_LOCAL_IN),
 		.checkentry = netmap_tg4_check,
+		.destroy    = netmap_tg_destroy,
 		.me         = THIS_MODULE,
 	},
 };
diff --git a/net/netfilter/xt_REDIRECT.c b/net/netfilter/xt_REDIRECT.c
index 03f0b37..1406516 100644
--- a/net/netfilter/xt_REDIRECT.c
+++ b/net/netfilter/xt_REDIRECT.c
@@ -40,7 +40,13 @@ static int redirect_tg6_checkentry(const struct xt_tgchk_param *par)
 
 	if (range->flags & NF_NAT_RANGE_MAP_IPS)
 		return -EINVAL;
-	return 0;
+
+	return nf_ct_netns_get(par->net, par->family);
+}
+
+static void redirect_tg_destroy(const struct xt_tgdtor_param *par)
+{
+	nf_ct_netns_put(par->net, par->family);
 }
 
 /* FIXME: Take multiple ranges --RR */
@@ -56,7 +62,7 @@ static int redirect_tg4_check(const struct xt_tgchk_param *par)
 		pr_debug("bad rangesize %u.\n", mr->rangesize);
 		return -EINVAL;
 	}
-	return 0;
+	return nf_ct_netns_get(par->net, par->family);
 }
 
 static unsigned int
@@ -72,6 +78,7 @@ static struct xt_target redirect_tg_reg[] __read_mostly = {
 		.revision   = 0,
 		.table      = "nat",
 		.checkentry = redirect_tg6_checkentry,
+		.destroy    = redirect_tg_destroy,
 		.target     = redirect_tg6,
 		.targetsize = sizeof(struct nf_nat_range),
 		.hooks      = (1 << NF_INET_PRE_ROUTING) |
@@ -85,6 +92,7 @@ static struct xt_target redirect_tg_reg[] __read_mostly = {
 		.table      = "nat",
 		.target     = redirect_tg4,
 		.checkentry = redirect_tg4_check,
+		.destroy    = redirect_tg_destroy,
 		.targetsize = sizeof(struct nf_nat_ipv4_multi_range_compat),
 		.hooks      = (1 << NF_INET_PRE_ROUTING) |
 		              (1 << NF_INET_LOCAL_OUT),
diff --git a/net/netfilter/xt_nat.c b/net/netfilter/xt_nat.c
index bea7464..8107b3e 100644
--- a/net/netfilter/xt_nat.c
+++ b/net/netfilter/xt_nat.c
@@ -23,7 +23,17 @@ static int xt_nat_checkentry_v0(const struct xt_tgchk_param *par)
 			par->target->name);
 		return -EINVAL;
 	}
-	return 0;
+	return nf_ct_netns_get(par->net, par->family);
+}
+
+static int xt_nat_checkentry(const struct xt_tgchk_param *par)
+{
+	return nf_ct_netns_get(par->net, par->family);
+}
+
+static void xt_nat_destroy(const struct xt_tgdtor_param *par)
+{
+	nf_ct_netns_put(par->net, par->family);
 }
 
 static void xt_nat_convert_range(struct nf_nat_range *dst,
@@ -106,6 +116,7 @@ static struct xt_target xt_nat_target_reg[] __read_mostly = {
 		.name		= "SNAT",
 		.revision	= 0,
 		.checkentry	= xt_nat_checkentry_v0,
+		.destroy	= xt_nat_destroy,
 		.target		= xt_snat_target_v0,
 		.targetsize	= sizeof(struct nf_nat_ipv4_multi_range_compat),
 		.family		= NFPROTO_IPV4,
@@ -118,6 +129,7 @@ static struct xt_target xt_nat_target_reg[] __read_mostly = {
 		.name		= "DNAT",
 		.revision	= 0,
 		.checkentry	= xt_nat_checkentry_v0,
+		.destroy	= xt_nat_destroy,
 		.target		= xt_dnat_target_v0,
 		.targetsize	= sizeof(struct nf_nat_ipv4_multi_range_compat),
 		.family		= NFPROTO_IPV4,
@@ -129,6 +141,8 @@ static struct xt_target xt_nat_target_reg[] __read_mostly = {
 	{
 		.name		= "SNAT",
 		.revision	= 1,
+		.checkentry	= xt_nat_checkentry,
+		.destroy	= xt_nat_destroy,
 		.target		= xt_snat_target_v1,
 		.targetsize	= sizeof(struct nf_nat_range),
 		.table		= "nat",
@@ -139,6 +153,8 @@ static struct xt_target xt_nat_target_reg[] __read_mostly = {
 	{
 		.name		= "DNAT",
 		.revision	= 1,
+		.checkentry	= xt_nat_checkentry,
+		.destroy	= xt_nat_destroy,
 		.target		= xt_dnat_target_v1,
 		.targetsize	= sizeof(struct nf_nat_range),
 		.table		= "nat",
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 nf-next 7/9] netfilter: bridge: register hooks only when bridge is added
  2015-10-23 10:43 [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Florian Westphal
                   ` (5 preceding siblings ...)
  2015-10-23 10:43 ` [PATCH v2 nf-next 6/9] netfilter: nat: add dependencies on conntrack module Florian Westphal
@ 2015-10-23 10:43 ` Florian Westphal
  2015-10-23 10:43 ` [PATCH v2 nf-next 8/9] netfilter: don't call nf_hook_state_init/_hook_slow unless needed Florian Westphal
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Florian Westphal @ 2015-10-23 10:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

This moves the last 'common' hooks to a 'register only when needed' scheme.

We use a device notifier to register all the 'call-iptables' netfilter
hooks once a bridge gets added.

This means that if the initial namespace uses a bridge, newly created
network namespaces no longer 'inherit' the PRE_ROUTING ipt_sabotage hook,
instead it will only be registered in that network namespace if a bridge
is added within that namespace.

After this patch, only a handful of netfilter modules still use
global hooks:
- PF_BRIDGE hooks
- CLUSTER match (deprecated)
- ipvs hooks
- SYNPROXY

As long as these modules are not loaded/used, a new network namespace
has empty hook list and NF_HOOK() will boil down to single list_empty
test even if initial namespace does packet filtering, conntrack, etc.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 no changes since v1

 net/bridge/br_netfilter_hooks.c | 68 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 65 insertions(+), 3 deletions(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 7ddbe7e..44114a9 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -37,6 +37,7 @@
 #include <net/addrconf.h>
 #include <net/route.h>
 #include <net/netfilter/br_netfilter.h>
+#include <net/netns/generic.h>
 
 #include <asm/uaccess.h>
 #include "br_private.h"
@@ -44,6 +45,12 @@
 #include <linux/sysctl.h>
 #endif
 
+static int brnf_net_id __read_mostly;
+
+struct brnf_net {
+	bool enabled;
+};
+
 #ifdef CONFIG_SYSCTL
 static struct ctl_table_header *brnf_sysctl_header;
 static int brnf_call_iptables __read_mostly = 1;
@@ -938,6 +945,53 @@ static struct nf_hook_ops br_nf_ops[] __read_mostly = {
 	},
 };
 
+static int brnf_device_event(struct notifier_block *unused, unsigned long event,
+			     void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	struct brnf_net *brnet;
+	struct net *net;
+	int ret;
+
+	if (event != NETDEV_REGISTER || !(dev->priv_flags & IFF_EBRIDGE))
+		return NOTIFY_DONE;
+
+	ASSERT_RTNL();
+
+	net = dev_net(dev);
+	brnet = net_generic(net, brnf_net_id);
+	if (brnet->enabled)
+		return NOTIFY_OK;
+
+	ret = nf_register_net_hooks(net, br_nf_ops, ARRAY_SIZE(br_nf_ops));
+	if (ret)
+		return NOTIFY_BAD;
+
+	brnet->enabled = true;
+	return NOTIFY_OK;
+}
+
+static void __net_exit brnf_exit_net(struct net *net)
+{
+	struct brnf_net *brnet = net_generic(net, brnf_net_id);
+
+	if (!brnet->enabled)
+		return;
+
+	nf_unregister_net_hooks(net, br_nf_ops, ARRAY_SIZE(br_nf_ops));
+	brnet->enabled = false;
+}
+
+static struct pernet_operations brnf_net_ops __read_mostly = {
+	.exit = brnf_exit_net,
+	.id   = &brnf_net_id,
+	.size = sizeof(struct brnf_net),
+};
+
+static struct notifier_block brnf_notifier __read_mostly = {
+	.notifier_call = brnf_device_event,
+};
+
 #ifdef CONFIG_SYSCTL
 static
 int brnf_sysctl_call_tables(struct ctl_table *ctl, int write,
@@ -1003,16 +1057,23 @@ static int __init br_netfilter_init(void)
 {
 	int ret;
 
-	ret = nf_register_hooks(br_nf_ops, ARRAY_SIZE(br_nf_ops));
+	ret = register_pernet_subsys(&brnf_net_ops);
 	if (ret < 0)
 		return ret;
 
+	ret = register_netdevice_notifier(&brnf_notifier);
+	if (ret < 0) {
+		unregister_pernet_subsys(&brnf_net_ops);
+		return ret;
+	}
+
 #ifdef CONFIG_SYSCTL
 	brnf_sysctl_header = register_net_sysctl(&init_net, "net/bridge", brnf_table);
 	if (brnf_sysctl_header == NULL) {
 		printk(KERN_WARNING
 		       "br_netfilter: can't register to sysctl.\n");
-		nf_unregister_hooks(br_nf_ops, ARRAY_SIZE(br_nf_ops));
+		unregister_netdevice_notifier(&brnf_notifier);
+		unregister_pernet_subsys(&brnf_net_ops);
 		return -ENOMEM;
 	}
 #endif
@@ -1024,7 +1085,8 @@ static int __init br_netfilter_init(void)
 static void __exit br_netfilter_fini(void)
 {
 	RCU_INIT_POINTER(nf_br_ops, NULL);
-	nf_unregister_hooks(br_nf_ops, ARRAY_SIZE(br_nf_ops));
+	unregister_netdevice_notifier(&brnf_notifier);
+	unregister_pernet_subsys(&brnf_net_ops);
 #ifdef CONFIG_SYSCTL
 	unregister_net_sysctl_table(brnf_sysctl_header);
 #endif
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 nf-next 8/9] netfilter: don't call nf_hook_state_init/_hook_slow unless needed
  2015-10-23 10:43 [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Florian Westphal
                   ` (6 preceding siblings ...)
  2015-10-23 10:43 ` [PATCH v2 nf-next 7/9] netfilter: bridge: register hooks only when bridge is added Florian Westphal
@ 2015-10-23 10:43 ` Florian Westphal
  2015-10-23 10:43 ` [PATCH v2 nf-next 9/9] nftables: add conntrack dependencies for nat/masq/redir expressions Florian Westphal
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 20+ messages in thread
From: Florian Westphal @ 2015-10-23 10:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

With the previous patches in place, a netns nf_hook_list might be empty,
even if e.g. init_net performs filtering/conntrack.

Thus, change nf_hook_thresh to check the hook_list as well before
initializing hook_state and calling nf_hook_slow().

We still make use of static keys, if no netfilter hooks are loaded
we can elide further testing since list is guaranteed to be empty.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 no changes since v1

 include/linux/netfilter.h | 29 +++++++++++------------------
 1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 0ad5567..9230f9a 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -141,22 +141,6 @@ void nf_unregister_sockopt(struct nf_sockopt_ops *reg);
 
 #ifdef HAVE_JUMP_LABEL
 extern struct static_key nf_hooks_needed[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
-
-static inline bool nf_hook_list_active(struct list_head *hook_list,
-				       u_int8_t pf, unsigned int hook)
-{
-	if (__builtin_constant_p(pf) &&
-	    __builtin_constant_p(hook))
-		return static_key_false(&nf_hooks_needed[pf][hook]);
-
-	return !list_empty(hook_list);
-}
-#else
-static inline bool nf_hook_list_active(struct list_head *hook_list,
-				       u_int8_t pf, unsigned int hook)
-{
-	return !list_empty(hook_list);
-}
 #endif
 
 int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state);
@@ -177,9 +161,18 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook,
 				 int (*okfn)(struct net *, struct sock *, struct sk_buff *),
 				 int thresh)
 {
-	struct list_head *hook_list = &net->nf.hooks[pf][hook];
+	struct list_head *hook_list;
+
+#ifdef HAVE_JUMP_LABEL
+	if (__builtin_constant_p(pf) &&
+	    __builtin_constant_p(hook) &&
+	    !static_key_false(&nf_hooks_needed[pf][hook]))
+		return 1;
+#endif
+
+	hook_list = &net->nf.hooks[pf][hook];
 
-	if (nf_hook_list_active(hook_list, pf, hook)) {
+	if (!list_empty(hook_list)) {
 		struct nf_hook_state state;
 
 		nf_hook_state_init(&state, hook_list, hook, thresh,
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v2 nf-next 9/9] nftables: add conntrack dependencies for nat/masq/redir expressions
  2015-10-23 10:43 [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Florian Westphal
                   ` (7 preceding siblings ...)
  2015-10-23 10:43 ` [PATCH v2 nf-next 8/9] netfilter: don't call nf_hook_state_init/_hook_slow unless needed Florian Westphal
@ 2015-10-23 10:43 ` Florian Westphal
  2015-10-26 22:55 ` [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Pablo Neira Ayuso
  2015-11-24 10:27 ` Pablo Neira Ayuso
  10 siblings, 0 replies; 20+ messages in thread
From: Florian Westphal @ 2015-10-23 10:43 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

so that conntrack core will add the needed hooks in this namespace.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 not part of v1 patchset.

 net/ipv4/netfilter/nft_masq_ipv4.c  |  7 +++++++
 net/ipv4/netfilter/nft_redir_ipv4.c |  7 +++++++
 net/ipv6/netfilter/nft_masq_ipv6.c  |  7 +++++++
 net/ipv6/netfilter/nft_redir_ipv6.c |  7 +++++++
 net/netfilter/nft_masq.c            |  5 +++--
 net/netfilter/nft_nat.c             | 11 ++++++++++-
 net/netfilter/nft_redir.c           |  2 +-
 7 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/netfilter/nft_masq_ipv4.c b/net/ipv4/netfilter/nft_masq_ipv4.c
index b72ffc5..ee5bb98 100644
--- a/net/ipv4/netfilter/nft_masq_ipv4.c
+++ b/net/ipv4/netfilter/nft_masq_ipv4.c
@@ -30,12 +30,19 @@ static void nft_masq_ipv4_eval(const struct nft_expr *expr,
 						    &range, pkt->out);
 }
 
+static void
+nft_masq_ipv4_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr)
+{
+	nf_ct_netns_put(ctx->net, NFPROTO_IPV4);
+}
+
 static struct nft_expr_type nft_masq_ipv4_type;
 static const struct nft_expr_ops nft_masq_ipv4_ops = {
 	.type		= &nft_masq_ipv4_type,
 	.size		= NFT_EXPR_SIZE(sizeof(struct nft_masq)),
 	.eval		= nft_masq_ipv4_eval,
 	.init		= nft_masq_init,
+	.destroy	= nft_masq_ipv4_destroy,
 	.dump		= nft_masq_dump,
 	.validate	= nft_masq_validate,
 };
diff --git a/net/ipv4/netfilter/nft_redir_ipv4.c b/net/ipv4/netfilter/nft_redir_ipv4.c
index c09d438..862eb5a 100644
--- a/net/ipv4/netfilter/nft_redir_ipv4.c
+++ b/net/ipv4/netfilter/nft_redir_ipv4.c
@@ -39,12 +39,19 @@ static void nft_redir_ipv4_eval(const struct nft_expr *expr,
 						  pkt->hook);
 }
 
+static void
+nft_redir_ipv4_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr)
+{
+	nf_ct_netns_put(ctx->net, NFPROTO_IPV4);
+}
+
 static struct nft_expr_type nft_redir_ipv4_type;
 static const struct nft_expr_ops nft_redir_ipv4_ops = {
 	.type		= &nft_redir_ipv4_type,
 	.size		= NFT_EXPR_SIZE(sizeof(struct nft_redir)),
 	.eval		= nft_redir_ipv4_eval,
 	.init		= nft_redir_init,
+	.destroy	= nft_redir_ipv4_destroy,
 	.dump		= nft_redir_dump,
 	.validate	= nft_redir_validate,
 };
diff --git a/net/ipv6/netfilter/nft_masq_ipv6.c b/net/ipv6/netfilter/nft_masq_ipv6.c
index cd1ac16..a8b3626 100644
--- a/net/ipv6/netfilter/nft_masq_ipv6.c
+++ b/net/ipv6/netfilter/nft_masq_ipv6.c
@@ -30,12 +30,19 @@ static void nft_masq_ipv6_eval(const struct nft_expr *expr,
 	regs->verdict.code = nf_nat_masquerade_ipv6(pkt->skb, &range, pkt->out);
 }
 
+static void
+nft_masq_ipv6_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr)
+{
+	nf_ct_netns_put(ctx->net, NFPROTO_IPV6);
+}
+
 static struct nft_expr_type nft_masq_ipv6_type;
 static const struct nft_expr_ops nft_masq_ipv6_ops = {
 	.type		= &nft_masq_ipv6_type,
 	.size		= NFT_EXPR_SIZE(sizeof(struct nft_masq)),
 	.eval		= nft_masq_ipv6_eval,
 	.init		= nft_masq_init,
+	.destroy	= nft_masq_ipv6_destroy,
 	.dump		= nft_masq_dump,
 	.validate	= nft_masq_validate,
 };
diff --git a/net/ipv6/netfilter/nft_redir_ipv6.c b/net/ipv6/netfilter/nft_redir_ipv6.c
index aca44e8..ef673cd 100644
--- a/net/ipv6/netfilter/nft_redir_ipv6.c
+++ b/net/ipv6/netfilter/nft_redir_ipv6.c
@@ -38,12 +38,19 @@ static void nft_redir_ipv6_eval(const struct nft_expr *expr,
 	regs->verdict.code = nf_nat_redirect_ipv6(pkt->skb, &range, pkt->hook);
 }
 
+static void
+nft_redir_ipv6_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr)
+{
+	nf_ct_netns_put(ctx->net, NFPROTO_IPV6);
+}
+
 static struct nft_expr_type nft_redir_ipv6_type;
 static const struct nft_expr_ops nft_redir_ipv6_ops = {
 	.type		= &nft_redir_ipv6_type,
 	.size		= NFT_EXPR_SIZE(sizeof(struct nft_redir)),
 	.eval		= nft_redir_ipv6_eval,
 	.init		= nft_redir_init,
+	.destroy	= nft_redir_ipv6_destroy,
 	.dump		= nft_redir_dump,
 	.validate	= nft_redir_validate,
 };
diff --git a/net/netfilter/nft_masq.c b/net/netfilter/nft_masq.c
index 9aea747..c7269c3 100644
--- a/net/netfilter/nft_masq.c
+++ b/net/netfilter/nft_masq.c
@@ -48,13 +48,14 @@ int nft_masq_init(const struct nft_ctx *ctx,
 		return err;
 
 	if (tb[NFTA_MASQ_FLAGS] == NULL)
-		return 0;
+		goto out;
 
 	priv->flags = ntohl(nla_get_be32(tb[NFTA_MASQ_FLAGS]));
 	if (priv->flags & ~NF_NAT_RANGE_MASK)
 		return -EINVAL;
+ out:
+	return nf_ct_netns_get(ctx->net, ctx->afi->family);
 
-	return 0;
 }
 EXPORT_SYMBOL_GPL(nft_masq_init);
 
diff --git a/net/netfilter/nft_nat.c b/net/netfilter/nft_nat.c
index ee2d717..19a7bf3 100644
--- a/net/netfilter/nft_nat.c
+++ b/net/netfilter/nft_nat.c
@@ -209,7 +209,7 @@ static int nft_nat_init(const struct nft_ctx *ctx, const struct nft_expr *expr,
 			return -EINVAL;
 	}
 
-	return 0;
+	return nf_ct_netns_get(ctx->net, family);
 }
 
 static int nft_nat_dump(struct sk_buff *skb, const struct nft_expr *expr)
@@ -257,12 +257,21 @@ nla_put_failure:
 	return -1;
 }
 
+static void
+nft_nat_destroy(const struct nft_ctx *ctx, const struct nft_expr *expr)
+{
+	const struct nft_nat *priv = nft_expr_priv(expr);
+
+	nf_ct_netns_put(ctx->net, priv->family);
+}
+
 static struct nft_expr_type nft_nat_type;
 static const struct nft_expr_ops nft_nat_ops = {
 	.type           = &nft_nat_type,
 	.size           = NFT_EXPR_SIZE(sizeof(struct nft_nat)),
 	.eval           = nft_nat_eval,
 	.init           = nft_nat_init,
+	.destroy        = nft_nat_destroy,
 	.dump           = nft_nat_dump,
 	.validate	= nft_nat_validate,
 };
diff --git a/net/netfilter/nft_redir.c b/net/netfilter/nft_redir.c
index 03f7bf4..f8227ec 100644
--- a/net/netfilter/nft_redir.c
+++ b/net/netfilter/nft_redir.c
@@ -79,7 +79,7 @@ int nft_redir_init(const struct nft_ctx *ctx,
 			return -EINVAL;
 	}
 
-	return 0;
+	return nf_ct_netns_get(ctx->net, ctx->afi->family);
 }
 EXPORT_SYMBOL_GPL(nft_redir_init);
 
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces
  2015-10-23 10:43 [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Florian Westphal
                   ` (8 preceding siblings ...)
  2015-10-23 10:43 ` [PATCH v2 nf-next 9/9] nftables: add conntrack dependencies for nat/masq/redir expressions Florian Westphal
@ 2015-10-26 22:55 ` Pablo Neira Ayuso
  2015-10-26 23:09   ` Florian Westphal
  2015-11-24 10:27 ` Pablo Neira Ayuso
  10 siblings, 1 reply; 20+ messages in thread
From: Pablo Neira Ayuso @ 2015-10-26 22:55 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

Hi Florian,

On Fri, Oct 23, 2015 at 12:43:17PM +0200, Florian Westphal wrote:
[...]
> This work aims to change all major hook users to nf_register_net_hook
> so that when a new netns is created it has no hooks at all, even when the
> initial namespace uses conntrack, iptables and bridge netfilter.
>
> To keep behaviour somewhat compatible, xtable hooks are registered once a
> iptables set/getsockopt call is made within a net namespace.
> This also means that e.g. conntrack behaviour is not yet optimal, we
> still create all the data structures and only skip hook registration
> at this time.
> 
> Caveats:
> - conntrack is no longer active just by loading nf_conntrack module -- at
> least one (x)tables rule that requires conntrack has to be added, e.g.
> conntrack match or S/DNAT target.

So far it was possible to run conntrack without iptables, eg. to
collect statistics at per-flow level via ctnetlink. Could you find a
way to enable the hooks also from that path?

If the program polls /proc, too bad for it, it shouldn't be using such
interface for that purpose.

We probably should go back to the idea of having an explicit way of
enabling conntrack from the ruleset, but that will need the /proc
switch to keep there the existing semantics that people expect.

Let me know,
Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces
  2015-10-26 22:55 ` [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Pablo Neira Ayuso
@ 2015-10-26 23:09   ` Florian Westphal
  2015-10-27 16:35     ` Florian Westphal
  0 siblings, 1 reply; 20+ messages in thread
From: Florian Westphal @ 2015-10-26 23:09 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > This work aims to change all major hook users to nf_register_net_hook
> > so that when a new netns is created it has no hooks at all, even when the
> > initial namespace uses conntrack, iptables and bridge netfilter.
> >
> > To keep behaviour somewhat compatible, xtable hooks are registered once a
> > iptables set/getsockopt call is made within a net namespace.
> > This also means that e.g. conntrack behaviour is not yet optimal, we
> > still create all the data structures and only skip hook registration
> > at this time.
> > Caveats:
> > - conntrack is no longer active just by loading nf_conntrack module -- at
> > least one (x)tables rule that requires conntrack has to be added, e.g.
> > conntrack match or S/DNAT target.
> 
> So far it was possible to run conntrack without iptables, eg. to
> collect statistics at per-flow level via ctnetlink. Could you find a
> way to enable the hooks also from that path?

Good point, I'll look at this tomorrow.  It should not be too hard to
add this.

> We probably should go back to the idea of having an explicit way of
> enabling conntrack from the ruleset, but that will need the /proc
> switch to keep there the existing semantics that people expect.

I'm assuming you mean something like

-t raw -p tcp ... bla ... -j CT --track ?

where this target calls the conntrack_in function directly?

I planned to add such an expression for nft bridge conntrack.

I think that if we go down this route we should also investigate
if we also need to change the way how we deal with defragmentation.
(e.g. for PF_BRIDGE and INGRESS hook points).

Maybe we could/should make it a (nf)table property?

I thought about adding a defrag expression for bridge but its error
prone, e.g. 'tcp dport 42 defrag' would have to be reordered to defrag
before l4 matching.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces
  2015-10-26 23:09   ` Florian Westphal
@ 2015-10-27 16:35     ` Florian Westphal
  2015-10-28 12:39       ` Jan Engelhardt
  2015-10-29 20:35       ` Pablo Neira Ayuso
  0 siblings, 2 replies; 20+ messages in thread
From: Florian Westphal @ 2015-10-27 16:35 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter-devel

Florian Westphal <fw@strlen.de> wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > This work aims to change all major hook users to nf_register_net_hook
> > > so that when a new netns is created it has no hooks at all, even when the
> > > initial namespace uses conntrack, iptables and bridge netfilter.
> > >
> > > To keep behaviour somewhat compatible, xtable hooks are registered once a
> > > iptables set/getsockopt call is made within a net namespace.
> > > This also means that e.g. conntrack behaviour is not yet optimal, we
> > > still create all the data structures and only skip hook registration
> > > at this time.
> > > Caveats:
> > > - conntrack is no longer active just by loading nf_conntrack module -- at
> > > least one (x)tables rule that requires conntrack has to be added, e.g.
> > > conntrack match or S/DNAT target.
> > 
> > So far it was possible to run conntrack without iptables, eg. to
> > collect statistics at per-flow level via ctnetlink. Could you find a
> > way to enable the hooks also from that path?
> 
> Good point, I'll look at this tomorrow.  It should not be too hard to
> add this.

Ahem.  There are strings attached... :-/

So conntrack -L or conntrack -E do not enable connection tracking
if its not enabled (on current kernels).

So one has to load ipv4/ipv6 etc tracker explicitly.

Problem *after* patches is that this doesn't suffice.

So old behaviour:
conntrack -E

(nothing happens)
(modprobe nf_conntrack_ipv4)
(conntrack -E starts to display events)

new behaviour:
(modprobe nf_conntrack_ipv4)
(conntrack -E doesn't display events since conntrack module doesn't
 see packets due to lack of nf hooks).

My first attempt to fix this was to hook into nfnetlink bind,
but that doesn't really work in a backwards-compatible fashion since
it only makes 'modprobe nf_conntrack_ipv4; conntrack -E' work, but
not nf_conntrack_ipv4 module load *after* a event listener is already
running.

Other alternative is to request all the protocol trackers during
ctnetlink bind request but that sucks.

Any suggestion?  I don't really see a way out of this.

For reference, this is the change I have:

Subject: netfilter: ctnetlink: make ctnetlink bind register conntrack hooks

several problems here:

1. conntrack -E & modprobe nf_conntrack_ipv4 will *not* register
ipv4 conntrack hooks
2. since ctnetlink has no dependencies on nf_conntrack_xxx its
possible to rmmod nf_conntrack_xxx while event listener is running
which means the tracker has to force-remove hooks on netns destruction.
---
 include/linux/netfilter/nfnetlink.h            |  1 +
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c | 13 ++++++
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c | 13 ++++++
 net/netfilter/nf_conntrack_netlink.c           | 58 ++++++++++++++++++++++++++
 net/netfilter/nfnetlink.c                      | 26 ++++++++----
 5 files changed, 102 insertions(+), 9 deletions(-)

diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h
index 249d1bb..9049c6a 100644
--- a/include/linux/netfilter/nfnetlink.h
+++ b/include/linux/netfilter/nfnetlink.h
@@ -28,6 +28,7 @@ struct nfnetlink_subsystem {
 	const struct nfnl_callback *cb;	/* callback for individual types */
 	int (*commit)(struct sk_buff *skb);
 	int (*abort)(struct sk_buff *skb);
+	int (*bind)(struct net *net);
 };
 
 int nfnetlink_subsys_register(const struct nfnetlink_subsystem *n);
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index d804620..7918b45 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -405,6 +405,9 @@ static void nf_conntrack_l3proto_ipv4_hooks_unregister(struct net *net)
 {
 	struct conntrack4_net *cnet = net_generic(net, conntrack4_net_id);
 
+	if (cnet->users == 0)
+		return;
+
 	mutex_lock(&register_ipv4_hooks);
 	if (--cnet->users == 0)
 		nf_unregister_net_hooks(net, ipv4_conntrack_ops,
@@ -478,6 +481,16 @@ out_tcp:
 
 static void ipv4_net_exit(struct net *net)
 {
+	struct conntrack4_net *cnet = net_generic(net, conntrack4_net_id);
+
+	mutex_lock(&register_ipv4_hooks);
+	if (cnet->users) {
+		cnet->users = 0;
+		nf_unregister_net_hooks(net, ipv4_conntrack_ops,
+					ARRAY_SIZE(ipv4_conntrack_ops));
+	}
+	mutex_unlock(&register_ipv4_hooks);
+
 	nf_ct_l3proto_pernet_unregister(net, &nf_conntrack_l3proto_ipv4);
 	nf_ct_l4proto_pernet_unregister(net, &nf_conntrack_l4proto_icmp);
 	nf_ct_l4proto_pernet_unregister(net, &nf_conntrack_l4proto_udp4);
diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
index f3e7ca6..dd0fad6 100644
--- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
@@ -344,6 +344,9 @@ static void nf_conntrack_l3proto_ipv6_hooks_unregister(struct net *net)
 {
 	struct conntrack6_net *cnet = net_generic(net, conntrack6_net_id);
 
+	if (cnet->users == 0)
+		return;
+
 	mutex_lock(&register_ipv6_hooks);
 	if (--cnet->users == 0)
 		nf_unregister_net_hooks(net, ipv6_conntrack_ops,
@@ -418,6 +421,16 @@ static int ipv6_net_init(struct net *net)
 
 static void ipv6_net_exit(struct net *net)
 {
+	struct conntrack6_net *cnet = net_generic(net, conntrack6_net_id);
+
+	mutex_lock(&register_ipv6_hooks);
+	if (cnet->users) {
+		cnet->users = 0;
+		nf_unregister_net_hooks(net, ipv6_conntrack_ops,
+					ARRAY_SIZE(ipv6_conntrack_ops));
+	}
+	mutex_unlock(&register_ipv6_hooks);
+
 	nf_ct_l3proto_pernet_unregister(net, &nf_conntrack_l3proto_ipv6);
 	nf_ct_l4proto_pernet_unregister(net, &nf_conntrack_l4proto_icmpv6);
 	nf_ct_l4proto_pernet_unregister(net, &nf_conntrack_l4proto_udp6);
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 9f52729..f0585e2 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -57,6 +57,11 @@
 MODULE_LICENSE("GPL");
 
 static char __initdata version[] = "0.93";
+static int ctnetlink_net_id __read_mostly;
+
+struct ctnl_net {
+	DECLARE_BITMAP(enabled, NFPROTO_NUMPROTO);
+};
 
 static inline int
 ctnetlink_dump_tuples_proto(struct sk_buff *skb,
@@ -3257,6 +3262,37 @@ ctnetlink_stat_exp_cpu(struct sock *ctnl, struct sk_buff *skb,
 	return 0;
 }
 
+static int ctnl_bind(struct net *net)
+{
+	struct ctnl_net *ctnet = net_generic(net, ctnetlink_net_id);
+	int i;
+
+	rcu_read_lock();
+
+	for (i = 0; i < NFPROTO_NUMPROTO; i++) {
+		struct nf_conntrack_l3proto *l3proto;
+		int ret;
+
+		/* don't autoload modules; only ensure those present have
+		 * their hooks registered.
+		 */
+		l3proto = __nf_ct_l3proto_find(i);
+		if (!l3proto || !l3proto->net_ns_get)
+			continue;
+
+		if (test_and_set_bit(i, ctnet->enabled))
+			continue;
+
+		ret = l3proto->net_ns_get(net);
+		if (ret < 0)
+			clear_bit(i, ctnet->enabled);
+	}
+
+	rcu_read_unlock();
+
+	return 0;
+}
+
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
 static struct nf_ct_event_notifier ctnl_notifier = {
 	.fcn = ctnetlink_conntrack_event,
@@ -3304,6 +3340,7 @@ static const struct nfnetlink_subsystem ctnl_subsys = {
 	.subsys_id			= NFNL_SUBSYS_CTNETLINK,
 	.cb_count			= IPCTNL_MSG_MAX,
 	.cb				= ctnl_cb,
+	.bind				= ctnl_bind,
 };
 
 static const struct nfnetlink_subsystem ctnl_exp_subsys = {
@@ -3311,6 +3348,7 @@ static const struct nfnetlink_subsystem ctnl_exp_subsys = {
 	.subsys_id			= NFNL_SUBSYS_CTNETLINK_EXP,
 	.cb_count			= IPCTNL_MSG_EXP_MAX,
 	.cb				= ctnl_exp_cb,
+	.bind				= ctnl_bind,
 };
 
 MODULE_ALIAS("ip_conntrack_netlink");
@@ -3346,10 +3384,28 @@ err_out:
 
 static void ctnetlink_net_exit(struct net *net)
 {
+	struct ctnl_net *ctnet = net_generic(net, ctnetlink_net_id);
+	int i;
+
 #ifdef CONFIG_NF_CONNTRACK_EVENTS
 	nf_ct_expect_unregister_notifier(net, &ctnl_notifier_exp);
 	nf_conntrack_unregister_notifier(net, &ctnl_notifier);
 #endif
+	rcu_read_lock();
+
+	for (i = 0; i < NFPROTO_NUMPROTO; i++) {
+		struct nf_conntrack_l3proto *l3proto;
+
+		if (!test_bit(i, ctnet->enabled))
+			continue;
+
+		l3proto = __nf_ct_l3proto_find(i);
+		if (WARN_ON(!l3proto || !l3proto->net_ns_get))
+			continue;
+		l3proto->net_ns_put(net);
+	}
+
+	rcu_read_unlock();
 }
 
 static void __net_exit ctnetlink_net_exit_batch(struct list_head *net_exit_list)
@@ -3363,6 +3419,8 @@ static void __net_exit ctnetlink_net_exit_batch(struct list_head *net_exit_list)
 static struct pernet_operations ctnetlink_net_ops = {
 	.init		= ctnetlink_net_init,
 	.exit_batch	= ctnetlink_net_exit_batch,
+	.id   = &ctnetlink_net_id,
+	.size = sizeof(struct ctnl_net),
 };
 
 static int __init ctnetlink_init(void)
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index f1d9e88..d2ad3fd 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -480,11 +480,10 @@ static void nfnetlink_rcv(struct sk_buff *skb)
 	}
 }
 
-#ifdef CONFIG_MODULES
 static int nfnetlink_bind(struct net *net, int group)
 {
 	const struct nfnetlink_subsystem *ss;
-	int type;
+	int type, ret;
 
 	if (group <= NFNLGRP_NONE || group > NFNLGRP_MAX)
 		return 0;
@@ -492,13 +491,24 @@ static int nfnetlink_bind(struct net *net, int group)
 	type = nfnl_group2type[group];
 
 	rcu_read_lock();
-	ss = nfnetlink_get_subsys(type);
-	rcu_read_unlock();
-	if (!ss)
+	ss = nfnetlink_get_subsys(type << 8);
+	ret = -EINVAL;
+#ifdef CONFIG_MODULES
+	if (!ss) {
+		rcu_read_unlock();
 		request_module("nfnetlink-subsys-%d", type);
-	return 0;
-}
+		rcu_read_lock();
+		ss = nfnetlink_get_subsys(type);
+	}
 #endif
+	if (!ss)
+		goto out;
+
+	ret = ss->bind ? ss->bind(net) : 0;
+ out:
+	rcu_read_unlock();
+	return ret;
+}
 
 static int __net_init nfnetlink_net_init(struct net *net)
 {
@@ -506,9 +516,7 @@ static int __net_init nfnetlink_net_init(struct net *net)
 	struct netlink_kernel_cfg cfg = {
 		.groups	= NFNLGRP_MAX,
 		.input	= nfnetlink_rcv,
-#ifdef CONFIG_MODULES
 		.bind	= nfnetlink_bind,
-#endif
 	};
 
 	nfnl = netlink_kernel_create(net, NETLINK_NETFILTER, &cfg);
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces
  2015-10-27 16:35     ` Florian Westphal
@ 2015-10-28 12:39       ` Jan Engelhardt
  2015-10-29 20:35       ` Pablo Neira Ayuso
  1 sibling, 0 replies; 20+ messages in thread
From: Jan Engelhardt @ 2015-10-28 12:39 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter-devel


On Tuesday 2015-10-27 17:35, Florian Westphal wrote:
>> 
>> Good point, I'll look at this tomorrow.  It should not be too hard to
>> add this.
>
>Ahem.  There are strings attached... :-/
>
>So conntrack -L or conntrack -E do not enable connection tracking
>if its not enabled (on current kernels).
>
>So one has to load ipv4/ipv6 etc tracker explicitly.
>
>Problem *after* patches is that this doesn't suffice.
>
>So old behaviour:
>conntrack -E
>
>(nothing happens)
>(modprobe nf_conntrack_ipv4)
>(conntrack -E starts to display events)
>
>new behaviour:
>(modprobe nf_conntrack_ipv4)
>(conntrack -E doesn't display events since conntrack module doesn't
> see packets due to lack of nf hooks).
>
>My first attempt to fix this was to hook into nfnetlink bind,
>but that doesn't really work in a backwards-compatible fashion since
>it only makes 'modprobe nf_conntrack_ipv4; conntrack -E' work, but
>not nf_conntrack_ipv4 module load *after* a event listener is already
>running.
>
>Other alternative is to request all the protocol trackers during
>ctnetlink bind request but that sucks.
>
>Any suggestion?  I don't really see a way out of this.

I am thinking of something like

  echo +PROTO >/proc/net/.../nf_conntrack/bind
  echo -PROTO >/proc/net/.../nf_conntrack/bind

That way, userspace can request enablement per netns. And nf_conntrack can
do
1. call request_module to load it if not already in the system,
2. pin the particular nf_conntrack_PROTO module, with refcounting
   (try_module_get() and module_put(), one ref per netns).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces
  2015-10-27 16:35     ` Florian Westphal
  2015-10-28 12:39       ` Jan Engelhardt
@ 2015-10-29 20:35       ` Pablo Neira Ayuso
  2015-10-29 22:13         ` Florian Westphal
  1 sibling, 1 reply; 20+ messages in thread
From: Pablo Neira Ayuso @ 2015-10-29 20:35 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

Hi Florian,

On Tue, Oct 27, 2015 at 05:35:52PM +0100, Florian Westphal wrote:
> Ahem.  There are strings attached... :-/
> 
> So conntrack -L or conntrack -E do not enable connection tracking
> if its not enabled (on current kernels).
> 
> So one has to load ipv4/ipv6 etc tracker explicitly.
> 
> Problem *after* patches is that this doesn't suffice.
> 
> So old behaviour:
> conntrack -E
> 
> (nothing happens)
> (modprobe nf_conntrack_ipv4)
> (conntrack -E starts to display events)
> 
> new behaviour:
> (modprobe nf_conntrack_ipv4)
> (conntrack -E doesn't display events since conntrack module doesn't
>  see packets due to lack of nf hooks).
> 
> My first attempt to fix this was to hook into nfnetlink bind,
> but that doesn't really work in a backwards-compatible fashion since
> it only makes 'modprobe nf_conntrack_ipv4; conntrack -E' work, but
> not nf_conntrack_ipv4 module load *after* a event listener is already
> running.

So conntrack -L currently uses NFPROTO_UNSPEC by default and from
conntrack -E we subscribe to the generic groups.

> Other alternative is to request all the protocol trackers during
> ctnetlink bind request but that sucks.

Agreed, that sucks :).

> Any suggestion?  I don't really see a way out of this.

We can probably register the hooks from ctnetlink based on what we
already have, ie. if nf_conntrack_ipv4 is loaded and someone runs
conntrack -E (or whatever custom application to listen to events),
then we get the hooks registered.

On top of that, assuming someone modprobes nf_conntrack_ipv6 later on,
we'll have to iterate over the list of netns available and register
the hooks if anyone is already listening to events as well.

Remember we also now also have nfnetlink_log and _queue integration
with conntrack, there we should register the hooks too in case the
userspace application.

Another possible solution: We add a sysctl switch to the core that
indicates if conntrack/defrag hooks are enabled by default (should be
1 to maintain backward compatility).

When unset, the user needs to explicitly indicate:

        iptables -I ... -j CT --track

that we want connection tracking, so we stop playing guess games,
which (although transparent) seems a bit fragile to me.

The second path is something that we need to explore anyway for nft as
you indicated in your previous email, so this can probably solve the
problem both for iptables and nft in a more generic way.

Will be giving another spin to this tomorrow, just arrived from a long
trip.

Thanks!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces
  2015-10-29 20:35       ` Pablo Neira Ayuso
@ 2015-10-29 22:13         ` Florian Westphal
  2015-11-02 13:00           ` Florian Westphal
  0 siblings, 1 reply; 20+ messages in thread
From: Florian Westphal @ 2015-10-29 22:13 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > (nothing happens)
> > (modprobe nf_conntrack_ipv4)
> > (conntrack -E starts to display events)
> > 
> > new behaviour:
> > (modprobe nf_conntrack_ipv4)
> > (conntrack -E doesn't display events since conntrack module doesn't
> >  see packets due to lack of nf hooks).
> > 
> > My first attempt to fix this was to hook into nfnetlink bind,
> > but that doesn't really work in a backwards-compatible fashion since
> > it only makes 'modprobe nf_conntrack_ipv4; conntrack -E' work, but
> > not nf_conntrack_ipv4 module load *after* a event listener is already
> > running.
> 
> So conntrack -L currently uses NFPROTO_UNSPEC by default and from
> conntrack -E we subscribe to the generic groups.

Yes.

> > Other alternative is to request all the protocol trackers during
> > ctnetlink bind request but that sucks.
> 
> Agreed, that sucks :).

Good, I would have hated implementing it ;)

> > Any suggestion?  I don't really see a way out of this.
> 
> We can probably register the hooks from ctnetlink based on what we
> already have, ie. if nf_conntrack_ipv4 is loaded and someone runs
> conntrack -E (or whatever custom application to listen to events),
> then we get the hooks registered.

Right, thats what the RFC patch I sent does, it asks all the loaded
ones to register the hooks on ctnetlink bind (i.e. conntrack -L doesn't
register anything either, but I'm not sure thats what we would want here
since -L is just 'show me all conntrack entries' and there are none,
it seems wrong to take a display request as a hint that conntrack should
be enabled.

> On top of that, assuming someone modprobes nf_conntrack_ipv6 later on,
> we'll have to iterate over the list of netns available and register
> the hooks if anyone is already listening to events as well.

Hmm, I wonder if this is doable without adding any additional module
dependencies... I'll check if its feasible.

> Remember we also now also have nfnetlink_log and _queue integration
> with conntrack, there we should register the hooks too in case the
> userspace application.

Ugh, I forgot.  I don't see how this is fixable at the moment
Userspace has to set _CFG_F_CONNTRACK flag but I am not (yet) convinced
that a mere presence of this flag should force register of the conntrack
hooks.

We'd also need yet another exported pointer-hook to prevent direct
module dependencies from nfqueue/nflog to conntrack.  Seems best way
would be to add a "passive" register request function to

struct nfnl_ct_hook {
	...
}

in nf_conntrack_netlink.c

[ passive in the sense of 'register hooks of all nf_conntrack_nfproto
  modules loaded, but don't grab refcount on those modules and don't
  modprobe anything ]

> Another possible solution: We add a sysctl switch to the core that
> indicates if conntrack/defrag hooks are enabled by default (should be
> 1 to maintain backward compatility).
> 
> When unset, the user needs to explicitly indicate:
> 
>         iptables -I ... -j CT --track
> 
> that we want connection tracking, so we stop playing guess games,
> which (although transparent) seems a bit fragile to me.

Not really, I mean if someone has a -m conntrack or DNAT or whatever
rule its crystal clear that we need conntrack enabled.
So I don't really think there are any 'guess games' to be played.

The only corner-cases that I see is loaded conntrack, no rule
dependencies at all (no nat, no stateful filtering of any kind)
but e.g. traffic accouting via ctnetlink.

[ The -j CT --track thing has other advantages such as allowing
  fine grained control over what needs tracking rather than the 'notrack'
  stuff have right now, I'd definitely would like to see this as well ]

> The second path is something that we need to explore anyway for nft as
> you indicated in your previous email, so this can probably solve the
> problem both for iptables and nft in a more generic way.

Right.

> Will be giving another spin to this tomorrow, just arrived from a long
> trip.

Take your time, this beast won't be ready for next -next (he he) anyway.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces
  2015-10-29 22:13         ` Florian Westphal
@ 2015-11-02 13:00           ` Florian Westphal
  0 siblings, 0 replies; 20+ messages in thread
From: Florian Westphal @ 2015-11-02 13:00 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter-devel

Florian Westphal <fw@strlen.de> wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > We can probably register the hooks from ctnetlink based on what we
> > already have, ie. if nf_conntrack_ipv4 is loaded and someone runs
> > conntrack -E (or whatever custom application to listen to events),
> > then we get the hooks registered.
> 
> Right, thats what the RFC patch I sent does, it asks all the loaded
> ones to register the hooks on ctnetlink bind (i.e. conntrack -L doesn't
> register anything either, but I'm not sure thats what we would want here
> since -L is just 'show me all conntrack entries' and there are none,
> it seems wrong to take a display request as a hint that conntrack should
> be enabled.
> 
> > On top of that, assuming someone modprobes nf_conntrack_ipv6 later on,
> > we'll have to iterate over the list of netns available and register
> > the hooks if anyone is already listening to events as well.
> 
> Hmm, I wonder if this is doable without adding any additional module
> dependencies... I'll check if its feasible.

So the current plan is to add a 'notification' call to
ctnetlink_glue_hook, i.e.

 .newproto       = ctnl_newproto,

and then, from nf_ct_l3proto_register(), do:
nfnl_ct = rcu_dereference(nfnl_ct_hook);
if (nfnl_ct)
	nfnl_ct->newproto(nfproto);

This would kick ctnetlink without adding a module dependency.

Downside is that ctnetlink would need some additional logic to
make newproto() do nothing unless we saw at least on bind attempt
before (otherwise we'd always register the hooks for a new conntracker
in all the network namespaces).

For that we'd have to iterate over all the network namespaces we have
and check a per-netns knob if ctnetlink was requested at some point in the
past.

It might be acceptable though since nf_ct_l3proto_register is only
called from module init hooks.

What do you think?

> > Remember we also now also have nfnetlink_log and _queue integration
> > with conntrack, there we should register the hooks too in case the
> > userspace application.
> 
> Ugh, I forgot.  I don't see how this is fixable at the moment
> Userspace has to set _CFG_F_CONNTRACK flag but I am not (yet) convinced
> that a mere presence of this flag should force register of the conntrack
> hooks.
> 
> We'd also need yet another exported pointer-hook to prevent direct
> module dependencies from nfqueue/nflog to conntrack.  Seems best way
> would be to add a "passive" register request function to
> 
> struct nfnl_ct_hook {
> 	...
> }
> 
> in nf_conntrack_netlink.c
> 
> [ passive in the sense of 'register hooks of all nf_conntrack_nfproto
>   modules loaded, but don't grab refcount on those modules and don't
>   modprobe anything ]

This is less intrusive than I thought and should not affect any hotpath,
we can check at config time wheter this feature is wanted and then

nfnl_ct_hook->register_hooks(net)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 nf-next 1/9] netfilter: ingress: don't use nf_hook_list_active
  2015-10-23 10:43 ` [PATCH v2 nf-next 1/9] netfilter: ingress: don't use nf_hook_list_active Florian Westphal
@ 2015-11-06 18:33   ` Pablo Neira Ayuso
  0 siblings, 0 replies; 20+ messages in thread
From: Pablo Neira Ayuso @ 2015-11-06 18:33 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Fri, Oct 23, 2015 at 12:43:18PM +0200, Florian Westphal wrote:
> nf_hook_list_active() always returns true once at least one device has
> NF_INGRESS hook enabled.
> 
> Thus, don't use this function. Instead, inverse the test and use the static
> key to elide list_empty test if no NF_INGRESS hooks are active.

Florian, I think this qualifies as a fix, I'm going to apply this to
the nf tree.

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces
  2015-10-23 10:43 [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Florian Westphal
                   ` (9 preceding siblings ...)
  2015-10-26 22:55 ` [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Pablo Neira Ayuso
@ 2015-11-24 10:27 ` Pablo Neira Ayuso
  2015-11-24 10:59   ` Florian Westphal
  10 siblings, 1 reply; 20+ messages in thread
From: Pablo Neira Ayuso @ 2015-11-24 10:27 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Fri, Oct 23, 2015 at 12:43:17PM +0200, Florian Westphal wrote:
> Ads section:
> conntrack+filter + nat table used in init namespace, single TCP_STREAM lo netperf:
> 87380  16384  16384    30.00    14348.66
> with patch set, netperf running in net namespace without rules:
> 87380  16384  16384    30.00    15683.97
> 
> routing from ns3 -> ns2, filter + nat table & conntrack in all namespaces:
> 87380  16384  16384    30.00    5664.46
> without conntrack+any tables in those namespaces:
> 87380  16384  16384    30.00    7336.54

Florian, I didn't have time so far on this but I really expect that
you follow up on this with a new version adressing or summarizing
possible solutions for the corner cases that we have discussed
previously.

We definitely need this for better netns support in iptables.

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces
  2015-11-24 10:27 ` Pablo Neira Ayuso
@ 2015-11-24 10:59   ` Florian Westphal
  0 siblings, 0 replies; 20+ messages in thread
From: Florian Westphal @ 2015-11-24 10:59 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Fri, Oct 23, 2015 at 12:43:17PM +0200, Florian Westphal wrote:
> > Ads section:
> > conntrack+filter + nat table used in init namespace, single TCP_STREAM lo netperf:
> > 87380  16384  16384    30.00    14348.66
> > with patch set, netperf running in net namespace without rules:
> > 87380  16384  16384    30.00    15683.97
> > 
> > routing from ns3 -> ns2, filter + nat table & conntrack in all namespaces:
> > 87380  16384  16384    30.00    5664.46
> > without conntrack+any tables in those namespaces:
> > 87380  16384  16384    30.00    7336.54
> 
> Florian, I didn't have time so far on this but I really expect that
> you follow up on this with a new version adressing or summarizing
> possible solutions for the corner cases that we have discussed
> previously.

Yes, a new version will be coming, adding a new

+       void (*newproto)(void);

To struct nfnl_ct_hook, this allows ct protocol registration
to make ctnetlink aware of new protocols if events are currently in use.

Drawback: once such hooks are registered, they won't go away anymore
unless ns is destroyed or module is unloaded, but I don't think
its a problem.  We can discuss the fine print when next round is sent.

Its ready but untested, and needs another rebase.
I hope I get to it later this week.

Thanks for your patience.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-11-24 10:59 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-23 10:43 [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Florian Westphal
2015-10-23 10:43 ` [PATCH v2 nf-next 1/9] netfilter: ingress: don't use nf_hook_list_active Florian Westphal
2015-11-06 18:33   ` Pablo Neira Ayuso
2015-10-23 10:43 ` [PATCH v2 nf-next 2/9] netfilter: add and use nf_ct_netns_get/put Florian Westphal
2015-10-23 10:43 ` [PATCH v2 nf-next 3/9] netfilter: conntrack: register hooks in netns when needed by ruleset Florian Westphal
2015-10-23 10:43 ` [PATCH v2 nf-next 4/9] netfilter: xtables: don't register xt hooks in namespace at init time Florian Westphal
2015-10-23 10:43 ` [PATCH v2 nf-next 5/9] netfilter: defrag: only register defrag functionality if needed Florian Westphal
2015-10-23 10:43 ` [PATCH v2 nf-next 6/9] netfilter: nat: add dependencies on conntrack module Florian Westphal
2015-10-23 10:43 ` [PATCH v2 nf-next 7/9] netfilter: bridge: register hooks only when bridge is added Florian Westphal
2015-10-23 10:43 ` [PATCH v2 nf-next 8/9] netfilter: don't call nf_hook_state_init/_hook_slow unless needed Florian Westphal
2015-10-23 10:43 ` [PATCH v2 nf-next 9/9] nftables: add conntrack dependencies for nat/masq/redir expressions Florian Westphal
2015-10-26 22:55 ` [PATCH v2 nf-next 0/9] netfilter: don't copy initns hooks to new namespaces Pablo Neira Ayuso
2015-10-26 23:09   ` Florian Westphal
2015-10-27 16:35     ` Florian Westphal
2015-10-28 12:39       ` Jan Engelhardt
2015-10-29 20:35       ` Pablo Neira Ayuso
2015-10-29 22:13         ` Florian Westphal
2015-11-02 13:00           ` Florian Westphal
2015-11-24 10:27 ` Pablo Neira Ayuso
2015-11-24 10:59   ` Florian Westphal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.