All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
@ 2015-04-10 12:15 Pablo Neira Ayuso
  2015-04-10 12:15 ` [PATCH 1/7] net: refactor __netif_receive_skb_core Pablo Neira Ayuso
                   ` (7 more replies)
  0 siblings, 8 replies; 39+ messages in thread
From: Pablo Neira Ayuso @ 2015-04-10 12:15 UTC (permalink / raw)
  To: netfilter-devel; +Cc: kaber, netdev, davem

This patchset adds the Netfilter hook at the ingress path, in a per-device
fashion. This also comes with the new nf_tables 'netdev' family support to
provide access to users to the existing nf_tables features. This includes the
transactional netlink API and the enhanced set infrastructure.  Several patches
come in first place to prepare this support, including the refactoring of
__netif_receive_skb_core() to accomodate the new hook.

As in other netfilter hooks, a static key is used to disable the hook path when
no hooks are registered, so the ingress path remains similar to what we already
have. The static key that governs this is global.

The netfilter ingress hook comes just after handle_ing() to honor the existing
qdisc ingress infrastructure.

You can find a simple example ruleset file to account traffic at:

	http://people.netfilter.org/pablo/nft-ingress.ruleset

Comments are welcome, thanks.

Pablo Neira Ayuso (7):
  net: refactor __netif_receive_skb_core
  netfilter: add nf_hook_list_active()
  netfilter: add hook list to nf_hook_state
  netfilter: cleanup struct nf_hook_ops struct indentation
  net: add netfilter ingress hook
  netfilter: nf_tables: allow to bind table to net_device
  netfilter: nf_tables: add netdev table to filter from ingress

 include/linux/netdevice.h		  |    4 +-
 include/linux/netfilter.h		  |   39 +++---
 include/linux/netfilter_ingress.h	  |   80 +++++++++++++
 include/net/netfilter/nf_tables.h	  |    8 ++
 include/net/netns/nftables.h		  |    1 +
 include/uapi/linux/netfilter.h		  |    6 +
 include/uapi/linux/netfilter/nf_tables.h |    2 +
 net/core/dev.c				  |  190 +++++++++++++++++++-----------
 net/netfilter/Kconfig			  |    6 +
 net/netfilter/Makefile			  |    1 +
 net/netfilter/core.c			  |   28 ++++-
 net/netfilter/nf_tables_api.c		  |   47 +++++++-
 net/netfilter/nf_tables_netdev.c	  |  182 ++++++++++++++++++++++++++++
 13 files changed, 502 insertions(+), 92 deletions(-)
 create mode 100644 include/linux/netfilter_ingress.h
 create mode 100644 net/netfilter/nf_tables_netdev.c

--
1.7.10.4


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/7] net: refactor __netif_receive_skb_core
  2015-04-10 12:15 [PATCH 0/7 RFC] Netfilter/nf_tables ingress support Pablo Neira Ayuso
@ 2015-04-10 12:15 ` Pablo Neira Ayuso
  2015-04-10 13:47   ` Daniel Borkmann
  2015-04-10 19:56   ` Alexander Duyck
  2015-04-10 12:15 ` [PATCH 2/7] netfilter: add nf_hook_list_active() Pablo Neira Ayuso
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 39+ messages in thread
From: Pablo Neira Ayuso @ 2015-04-10 12:15 UTC (permalink / raw)
  To: netfilter-devel; +Cc: kaber, netdev, davem

This patch splits __netif_receive_skb_core() in smaller functions to improve
maintainability.

The function __netif_receive_skb_core() has been split in two:

* __netif_receive_skb_ingress(), to perform all actions up to
  ingress filtering.

* __netif_receive_skb_finish(), if the ingress filter accepts this
  packet, pass it to the corresponding packet_type function handler for further
processing.

This patch also adds __NET_RX_ANOTHER_ROUND that is used when the packet is
stripped off from the vlan header or in case the rx_handler needs it.

This also prepares the introduction of the netfilter ingress hook.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/core/dev.c |  156 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 89 insertions(+), 67 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index b2775f0..0e19e4f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3638,72 +3638,17 @@ static bool skb_pfmemalloc_protocol(struct sk_buff *skb)
 	}
 }
 
-static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
+#define __NET_RX_ANOTHER_ROUND	2
+
+static int __netif_receive_skb_finish(struct sk_buff *skb, bool pfmemalloc,
+				      struct packet_type *pt_prev,
+				      struct net_device *orig_dev)
 {
-	struct packet_type *ptype, *pt_prev;
 	rx_handler_func_t *rx_handler;
-	struct net_device *orig_dev;
 	bool deliver_exact = false;
-	int ret = NET_RX_DROP;
+	int ret;
 	__be16 type;
 
-	net_timestamp_check(!netdev_tstamp_prequeue, skb);
-
-	trace_netif_receive_skb(skb);
-
-	orig_dev = skb->dev;
-
-	skb_reset_network_header(skb);
-	if (!skb_transport_header_was_set(skb))
-		skb_reset_transport_header(skb);
-	skb_reset_mac_len(skb);
-
-	pt_prev = NULL;
-
-	rcu_read_lock();
-
-another_round:
-	skb->skb_iif = skb->dev->ifindex;
-
-	__this_cpu_inc(softnet_data.processed);
-
-	if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
-	    skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
-		skb = skb_vlan_untag(skb);
-		if (unlikely(!skb))
-			goto unlock;
-	}
-
-#ifdef CONFIG_NET_CLS_ACT
-	if (skb->tc_verd & TC_NCLS) {
-		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
-		goto ncls;
-	}
-#endif
-
-	if (pfmemalloc)
-		goto skip_taps;
-
-	list_for_each_entry_rcu(ptype, &ptype_all, list) {
-		if (pt_prev)
-			ret = deliver_skb(skb, pt_prev, orig_dev);
-		pt_prev = ptype;
-	}
-
-	list_for_each_entry_rcu(ptype, &skb->dev->ptype_all, list) {
-		if (pt_prev)
-			ret = deliver_skb(skb, pt_prev, orig_dev);
-		pt_prev = ptype;
-	}
-
-skip_taps:
-#ifdef CONFIG_NET_CLS_ACT
-	skb = handle_ing(skb, &pt_prev, &ret, orig_dev);
-	if (!skb)
-		goto unlock;
-ncls:
-#endif
-
 	if (pfmemalloc && !skb_pfmemalloc_protocol(skb))
 		goto drop;
 
@@ -3713,9 +3658,9 @@ ncls:
 			pt_prev = NULL;
 		}
 		if (vlan_do_receive(&skb))
-			goto another_round;
+			return __NET_RX_ANOTHER_ROUND;
 		else if (unlikely(!skb))
-			goto unlock;
+			return NET_RX_SUCCESS;
 	}
 
 	rx_handler = rcu_dereference(skb->dev->rx_handler);
@@ -3726,10 +3671,9 @@ ncls:
 		}
 		switch (rx_handler(&skb)) {
 		case RX_HANDLER_CONSUMED:
-			ret = NET_RX_SUCCESS;
-			goto unlock;
+			return NET_RX_SUCCESS;
 		case RX_HANDLER_ANOTHER:
-			goto another_round;
+			return __NET_RX_ANOTHER_ROUND;
 		case RX_HANDLER_EXACT:
 			deliver_exact = true;
 		case RX_HANDLER_PASS:
@@ -3780,9 +3724,87 @@ drop:
 		 */
 		ret = NET_RX_DROP;
 	}
+	return ret;
+}
 
-unlock:
+static int __netif_receive_skb_ingress(struct sk_buff *skb, bool pfmemalloc,
+				       struct net_device *orig_dev)
+{
+	struct packet_type *ptype, *pt_prev = NULL;
+	int ret;
+
+	skb->skb_iif = skb->dev->ifindex;
+
+	__this_cpu_inc(softnet_data.processed);
+
+	if (skb->protocol == cpu_to_be16(ETH_P_8021Q) ||
+	    skb->protocol == cpu_to_be16(ETH_P_8021AD)) {
+		skb = skb_vlan_untag(skb);
+		if (unlikely(!skb))
+			return NET_RX_DROP;
+	}
+
+#ifdef CONFIG_NET_CLS_ACT
+	if (skb->tc_verd & TC_NCLS) {
+		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
+		return NET_RX_SUCCESS;
+	}
+#endif
+
+	if (pfmemalloc)
+		goto skip_taps;
+
+	list_for_each_entry_rcu(ptype, &ptype_all, list) {
+		if (pt_prev)
+			ret = deliver_skb(skb, pt_prev, orig_dev);
+		pt_prev = ptype;
+	}
+
+	list_for_each_entry_rcu(ptype, &skb->dev->ptype_all, list) {
+		if (pt_prev)
+			ret = deliver_skb(skb, pt_prev, orig_dev);
+		pt_prev = ptype;
+	}
+
+skip_taps:
+#ifdef CONFIG_NET_CLS_ACT
+	skb = handle_ing(skb, &pt_prev, &ret, orig_dev);
+	if (!skb)
+		return NET_RX_DROP;
+#endif
+
+	return __netif_receive_skb_finish(skb, pfmemalloc, pt_prev, orig_dev);
+}
+
+static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
+{
+	struct net_device *orig_dev;
+	int ret;
+
+	net_timestamp_check(!netdev_tstamp_prequeue, skb);
+
+	trace_netif_receive_skb(skb);
+
+	orig_dev = skb->dev;
+
+	skb_reset_network_header(skb);
+	if (!skb_transport_header_was_set(skb))
+		skb_reset_transport_header(skb);
+	skb_reset_mac_len(skb);
+
+	rcu_read_lock();
+
+another_round:
+	ret = __netif_receive_skb_ingress(skb, pfmemalloc, orig_dev);
+	switch (ret) {
+	case NET_RX_SUCCESS:
+	case NET_RX_DROP:
+		break;
+	case __NET_RX_ANOTHER_ROUND:
+		goto another_round;
+	}
 	rcu_read_unlock();
+
 	return ret;
 }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/7] netfilter: add nf_hook_list_active()
  2015-04-10 12:15 [PATCH 0/7 RFC] Netfilter/nf_tables ingress support Pablo Neira Ayuso
  2015-04-10 12:15 ` [PATCH 1/7] net: refactor __netif_receive_skb_core Pablo Neira Ayuso
@ 2015-04-10 12:15 ` Pablo Neira Ayuso
  2015-04-10 12:15 ` [PATCH 3/7] netfilter: add hook list to nf_hook_state Pablo Neira Ayuso
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Pablo Neira Ayuso @ 2015-04-10 12:15 UTC (permalink / raw)
  To: netfilter-devel; +Cc: kaber, netdev, davem

In preparation to have netfilter ingress per-device hook list.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netfilter.h |   17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 63560d0..936f5c2 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -131,26 +131,33 @@ extern struct list_head nf_hooks[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
 #ifdef HAVE_JUMP_LABEL
 extern struct static_key nf_hooks_needed[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
 
-static inline bool nf_hooks_active(u_int8_t pf, unsigned int hook)
+static inline bool nf_hook_list_active(struct list_head *nf_hook_list,
+				       u_int8_t pf, unsigned int hook)
 {
 	if (__builtin_constant_p(pf) &&
 	    __builtin_constant_p(hook))
 		return static_key_false(&nf_hooks_needed[pf][hook]);
 
-	return !list_empty(&nf_hooks[pf][hook]);
+	return !list_empty(nf_hook_list);
 }
 #else
-static inline bool nf_hooks_active(u_int8_t pf, unsigned int hook)
+static inline bool nf_hook_list_active(struct list_head *nf_hook_list,
+				       u_int8_t pf, unsigned int hook)
 {
-	return !list_empty(&nf_hooks[pf][hook]);
+	return !list_empty(nf_hook_list);
 }
 #endif
 
+static inline bool nf_hooks_active(u_int8_t pf, unsigned int hook)
+{
+	return nf_hook_list_active(&nf_hooks[pf][hook], pf, hook);
+}
+
 int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state);
 
 /**
  *	nf_hook_thresh - call a netfilter hook
- *	
+ *
  *	Returns 1 if the hook has allowed the packet to pass.  The function
  *	okfn must be invoked by the caller in this case.  Any other return
  *	value indicates the packet has been consumed by the hook.
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/7] netfilter: add hook list to nf_hook_state
  2015-04-10 12:15 [PATCH 0/7 RFC] Netfilter/nf_tables ingress support Pablo Neira Ayuso
  2015-04-10 12:15 ` [PATCH 1/7] net: refactor __netif_receive_skb_core Pablo Neira Ayuso
  2015-04-10 12:15 ` [PATCH 2/7] netfilter: add nf_hook_list_active() Pablo Neira Ayuso
@ 2015-04-10 12:15 ` Pablo Neira Ayuso
  2015-04-10 12:15 ` [PATCH 4/7] netfilter: cleanup struct nf_hook_ops struct indentation Pablo Neira Ayuso
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 39+ messages in thread
From: Pablo Neira Ayuso @ 2015-04-10 12:15 UTC (permalink / raw)
  To: netfilter-devel; +Cc: kaber, netdev, davem

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netfilter.h |    7 +++++--
 net/netfilter/core.c      |    6 ++----
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 936f5c2..a605ebe 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -54,10 +54,12 @@ struct nf_hook_state {
 	struct net_device *in;
 	struct net_device *out;
 	struct sock *sk;
+	struct list_head *hook_list;
 	int (*okfn)(struct sock *, struct sk_buff *);
 };
 
 static inline void nf_hook_state_init(struct nf_hook_state *p,
+				      struct list_head *hook_list,
 				      unsigned int hook,
 				      int thresh, u_int8_t pf,
 				      struct net_device *indev,
@@ -71,6 +73,7 @@ static inline void nf_hook_state_init(struct nf_hook_state *p,
 	p->in = indev;
 	p->out = outdev;
 	p->sk = sk;
+	p->hook_list = hook_list;
 	p->okfn = okfn;
 }
 
@@ -173,8 +176,8 @@ static inline int nf_hook_thresh(u_int8_t pf, unsigned int hook,
 	if (nf_hooks_active(pf, hook)) {
 		struct nf_hook_state state;
 
-		nf_hook_state_init(&state, hook, thresh, pf,
-				   indev, outdev, sk, okfn);
+		nf_hook_state_init(&state, &nf_hooks[pf][hook], hook, thresh,
+				   pf, indev, outdev, sk, okfn);
 		return nf_hook_slow(skb, &state);
 	}
 	return 1;
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index e616301..e418cfd 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -166,11 +166,9 @@ int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state)
 	/* We may already have this, but read-locks nest anyway */
 	rcu_read_lock();
 
-	elem = list_entry_rcu(&nf_hooks[state->pf][state->hook],
-			      struct nf_hook_ops, list);
+	elem = list_entry_rcu(state->hook_list, struct nf_hook_ops, list);
 next_hook:
-	verdict = nf_iterate(&nf_hooks[state->pf][state->hook], skb, state,
-			     &elem);
+	verdict = nf_iterate(state->hook_list, skb, state, &elem);
 	if (verdict == NF_ACCEPT || verdict == NF_STOP) {
 		ret = 1;
 	} else if ((verdict & NF_VERDICT_MASK) == NF_DROP) {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/7] netfilter: cleanup struct nf_hook_ops struct indentation
  2015-04-10 12:15 [PATCH 0/7 RFC] Netfilter/nf_tables ingress support Pablo Neira Ayuso
                   ` (2 preceding siblings ...)
  2015-04-10 12:15 ` [PATCH 3/7] netfilter: add hook list to nf_hook_state Pablo Neira Ayuso
@ 2015-04-10 12:15 ` Pablo Neira Ayuso
  2015-04-10 13:27   ` Sergei Shtylyov
  2015-04-10 12:15 ` [PATCH 5/7] net: add netfilter ingress hook Pablo Neira Ayuso
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Pablo Neira Ayuso @ 2015-04-10 12:15 UTC (permalink / raw)
  To: netfilter-devel; +Cc: kaber, netdev, davem

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netfilter.h |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index a605ebe..49d0063 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -82,16 +82,16 @@ typedef unsigned int nf_hookfn(const struct nf_hook_ops *ops,
 			       const struct nf_hook_state *state);
 
 struct nf_hook_ops {
-	struct list_head list;
+	struct list_head 	list;
 
 	/* User fills in from here down. */
-	nf_hookfn	*hook;
-	struct module	*owner;
-	void		*priv;
-	u_int8_t	pf;
-	unsigned int	hooknum;
+	nf_hookfn		*hook;
+	struct module		*owner;
+	void			*priv;
+	u_int8_t		pf;
+	unsigned int		hooknum;
 	/* Hooks are ordered in ascending priority. */
-	int		priority;
+	int			priority;
 };
 
 struct nf_sockopt_ops {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/7] net: add netfilter ingress hook
  2015-04-10 12:15 [PATCH 0/7 RFC] Netfilter/nf_tables ingress support Pablo Neira Ayuso
                   ` (3 preceding siblings ...)
  2015-04-10 12:15 ` [PATCH 4/7] netfilter: cleanup struct nf_hook_ops struct indentation Pablo Neira Ayuso
@ 2015-04-10 12:15 ` Pablo Neira Ayuso
  2015-04-10 13:21   ` Thomas Graf
  2015-04-10 12:15 ` [PATCH 6/7] netfilter: nf_tables: allow to bind table to net_device Pablo Neira Ayuso
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 39+ messages in thread
From: Pablo Neira Ayuso @ 2015-04-10 12:15 UTC (permalink / raw)
  To: netfilter-devel; +Cc: kaber, netdev, davem

This patch adds a new NFPROTO_NETDEV family that allows you to register hooks
from the ingress path.

This patch adds a hook list per device, so this introduces a new net_device
structure pointer to nf_hook_ops that needs to be set before hook registration.
The caller is responsible for holding/putting the reference on the net_device
that is attached to nf_hook_ops.

As in other netfilter hooks, we have a static key to enable the netfilter path
if we at least have one registered hook. So the code follows the usual path for
people that don't need this.

To keep the context around, the skb->cb area is used to save and restore the
relevant information since we're in full control of this area as it happens in
handle_ing().

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netdevice.h         |    4 +-
 include/linux/netfilter.h         |    1 +
 include/linux/netfilter_ingress.h |   80 +++++++++++++++++++++++++++++++++++++
 include/uapi/linux/netfilter.h    |    6 +++
 net/core/dev.c                    |   34 ++++++++++++++++
 net/netfilter/core.c              |   22 +++++++++-
 6 files changed, 145 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/netfilter_ingress.h

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bf6d9df..26f7c65 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1645,7 +1645,9 @@ struct net_device {
 
 	struct netdev_queue __rcu *ingress_queue;
 	unsigned char		broadcast[MAX_ADDR_LEN];
-
+#ifdef CONFIG_NETFILTER
+	struct list_head	nf_hooks_ingress;
+#endif
 
 /*
  * Cache lines mostly used on transmit path
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 49d0063..1138cea 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -90,6 +90,7 @@ struct nf_hook_ops {
 	void			*priv;
 	u_int8_t		pf;
 	unsigned int		hooknum;
+	struct net_device	*dev;
 	/* Hooks are ordered in ascending priority. */
 	int			priority;
 };
diff --git a/include/linux/netfilter_ingress.h b/include/linux/netfilter_ingress.h
new file mode 100644
index 0000000..95c4537
--- /dev/null
+++ b/include/linux/netfilter_ingress.h
@@ -0,0 +1,80 @@
+#ifndef _NETFILTER_INGRESS_H_
+#define _NETFILTER_INGRESS_H_
+
+#include <linux/netfilter.h>
+#include <linux/netdevice.h>
+
+#ifdef CONFIG_NETFILTER
+struct nf_ingress_skb_cb {
+	struct packet_type	*pt_prev;
+	struct net_device	*orig_dev;
+	bool			pfmemalloc;
+};
+
+static inline struct nf_ingress_skb_cb *nf_ingress_skb_cb(struct sk_buff *skb)
+{
+	return (struct nf_ingress_skb_cb *)skb->cb;
+}
+
+static inline void nf_ingress_ctx_save(struct sk_buff *skb,
+				       struct packet_type *pt_prev,
+				       struct net_device *orig_dev,
+				       bool pfmemalloc)
+{
+	BUILD_BUG_ON(sizeof(skb->cb) < sizeof(struct nf_ingress_skb_cb));
+
+	nf_ingress_skb_cb(skb)->orig_dev = orig_dev;
+	nf_ingress_skb_cb(skb)->pt_prev = pt_prev;
+	nf_ingress_skb_cb(skb)->pfmemalloc = pfmemalloc;
+}
+
+static inline void nf_ingress_ctx_restore(struct sk_buff *skb,
+					  struct packet_type **pt_prev,
+					  struct net_device **orig_dev,
+					  bool *pfmemalloc)
+{
+	*orig_dev = nf_ingress_skb_cb(skb)->orig_dev;
+	*pt_prev = nf_ingress_skb_cb(skb)->pt_prev;
+	*pfmemalloc = nf_ingress_skb_cb(skb)->pfmemalloc;
+}
+
+static inline int nf_hook_ingress_active(struct sk_buff *skb)
+{
+	return nf_hook_list_active(&skb->dev->nf_hooks_ingress,
+				   NFPROTO_NETDEV, NF_NETDEV_INGRESS);
+}
+
+static inline int
+__nf_hook_ingress(struct sk_buff *skb,
+		  int (*okfn)(struct sock *sk, struct sk_buff *skb))
+{
+	struct nf_hook_state state;
+	int ret;
+
+	nf_hook_state_init(&state, &skb->dev->nf_hooks_ingress,
+			   NF_NETDEV_INGRESS, INT_MIN, NFPROTO_NETDEV, NULL,
+			   skb->dev, NULL, okfn);
+
+	ret = nf_hook_slow(skb, &state);
+	if (ret < 0)
+		return NET_RX_DROP;
+	else if (ret == 1)
+		ret = okfn(NULL, skb);
+
+	return ret;
+}
+#else
+static inline int nf_hook_ingress_active(struct sk_buff *skb)
+{
+	return 0;
+}
+
+static inline int nf_hook_ingress(struct sk_buff *skb,
+				  struct packet_type *pt_prev,
+				  struct net_device *orig_dev,
+				  bool pfmemalloc)
+{
+	BUG("nf_hook_ingress() called with CONFIG_NETFILTER disabled\n");
+}
+#endif /* CONFIG_NETFILTER */
+#endif /* _NETFILTER_INGRESS_H_ */
diff --git a/include/uapi/linux/netfilter.h b/include/uapi/linux/netfilter.h
index ef1b1f8..177027c 100644
--- a/include/uapi/linux/netfilter.h
+++ b/include/uapi/linux/netfilter.h
@@ -51,11 +51,17 @@ enum nf_inet_hooks {
 	NF_INET_NUMHOOKS
 };
 
+enum nf_dev_hooks {
+	NF_NETDEV_INGRESS,
+	NF_NETDEV_NUMHOOKS
+};
+
 enum {
 	NFPROTO_UNSPEC =  0,
 	NFPROTO_INET   =  1,
 	NFPROTO_IPV4   =  2,
 	NFPROTO_ARP    =  3,
+	NFPROTO_NETDEV =  5,
 	NFPROTO_BRIDGE =  7,
 	NFPROTO_IPV6   = 10,
 	NFPROTO_DECNET = 12,
diff --git a/net/core/dev.c b/net/core/dev.c
index 0e19e4f..9ba8f27 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -135,6 +135,7 @@
 #include <linux/if_macvlan.h>
 #include <linux/errqueue.h>
 #include <linux/hrtimer.h>
+#include <linux/netfilter_ingress.h>
 
 #include "net-sysfs.h"
 
@@ -3727,6 +3728,33 @@ drop:
 	return ret;
 }
 
+#ifdef CONFIG_NETFILTER
+static int nf_hook_ingress_finish(struct sock *sk, struct sk_buff *skb)
+{
+	struct net_device *orig_dev;
+	struct packet_type *pt_prev;
+	bool pfmemalloc;
+
+	nf_ingress_ctx_restore(skb, &pt_prev, &orig_dev, &pfmemalloc);
+
+	return __netif_receive_skb_finish(skb, pfmemalloc, pt_prev, orig_dev);
+}
+
+static int nf_hook_ingress(struct sk_buff *skb, struct packet_type *pt_prev,
+			   struct net_device *orig_dev, bool pfmemalloc)
+{
+	int ret;
+
+	if (pt_prev) {
+		ret = deliver_skb(skb, pt_prev, orig_dev);
+		pt_prev = NULL;
+	}
+	nf_ingress_ctx_save(skb, pt_prev, orig_dev, pfmemalloc);
+
+	return __nf_hook_ingress(skb, nf_hook_ingress_finish);
+}
+#endif
+
 static int __netif_receive_skb_ingress(struct sk_buff *skb, bool pfmemalloc,
 				       struct net_device *orig_dev)
 {
@@ -3772,6 +3800,8 @@ skip_taps:
 	if (!skb)
 		return NET_RX_DROP;
 #endif
+	if (nf_hook_ingress_active(skb))
+		return nf_hook_ingress(skb, pt_prev, orig_dev, pfmemalloc);
 
 	return __netif_receive_skb_finish(skb, pfmemalloc, pt_prev, orig_dev);
 }
@@ -6862,6 +6892,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	if (netif_alloc_netdev_queues(dev))
 		goto free_all;
 
+#ifdef CONFIG_NETFILTER
+	INIT_LIST_HEAD(&dev->nf_hooks_ingress);
+#endif
+
 #ifdef CONFIG_SYSFS
 	dev->num_rx_queues = rxqs;
 	dev->real_num_rx_queues = rxqs;
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index e418cfd..aa817d5 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -64,10 +64,23 @@ static DEFINE_MUTEX(nf_hook_mutex);
 
 int nf_register_hook(struct nf_hook_ops *reg)
 {
+	struct list_head *nf_hook_list;
 	struct nf_hook_ops *elem;
 
 	mutex_lock(&nf_hook_mutex);
-	list_for_each_entry(elem, &nf_hooks[reg->pf][reg->hooknum], list) {
+	switch (reg->pf) {
+	case NFPROTO_NETDEV:
+		if (reg->hooknum == NF_NETDEV_INGRESS) {
+			BUG_ON(reg->dev == NULL);
+			nf_hook_list = &reg->dev->nf_hooks_ingress;
+			break;
+		}
+		/* Fall through. */
+	default:
+		nf_hook_list = &nf_hooks[reg->pf][reg->hooknum];
+		break;
+	}
+	list_for_each_entry(elem, nf_hook_list, list) {
 		if (reg->priority < elem->priority)
 			break;
 	}
@@ -84,6 +97,13 @@ void nf_unregister_hook(struct nf_hook_ops *reg)
 {
 	mutex_lock(&nf_hook_mutex);
 	list_del_rcu(&reg->list);
+	switch (reg->pf) {
+	case NFPROTO_NETDEV:
+		WARN_ON(reg->dev == NULL);
+		break;
+	default:
+		break;
+	}
 	mutex_unlock(&nf_hook_mutex);
 #ifdef HAVE_JUMP_LABEL
 	static_key_slow_dec(&nf_hooks_needed[reg->pf][reg->hooknum]);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 6/7] netfilter: nf_tables: allow to bind table to net_device
  2015-04-10 12:15 [PATCH 0/7 RFC] Netfilter/nf_tables ingress support Pablo Neira Ayuso
                   ` (4 preceding siblings ...)
  2015-04-10 12:15 ` [PATCH 5/7] net: add netfilter ingress hook Pablo Neira Ayuso
@ 2015-04-10 12:15 ` Pablo Neira Ayuso
  2015-04-10 12:15 ` [PATCH 7/7] netfilter: nf_tables: add netdev table to filter from ingress Pablo Neira Ayuso
  2015-04-10 13:22 ` [PATCH 0/7 RFC] Netfilter/nf_tables ingress support Thomas Graf
  7 siblings, 0 replies; 39+ messages in thread
From: Pablo Neira Ayuso @ 2015-04-10 12:15 UTC (permalink / raw)
  To: netfilter-devel; +Cc: kaber, netdev, davem

This patch adds the NFT_AF_DEV flag that indicates that indicates that you need
to attach this table to a given net_device.

This is required by a follow up patch that introduces netdev tables that contain
ingress chains.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_tables.h        |    8 +++++
 include/uapi/linux/netfilter/nf_tables.h |    2 ++
 net/netfilter/nf_tables_api.c            |   47 ++++++++++++++++++++++++++----
 3 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/include/net/netfilter/nf_tables.h b/include/net/netfilter/nf_tables.h
index d6a2f0e..aab6e13 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -776,6 +776,7 @@ unsigned int nft_do_chain(struct nft_pktinfo *pkt,
  *	@use: number of chain references to this table
  *	@flags: table flag (see enum nft_table_flags)
  *	@name: name of the table
+ *	@dev: this table is bound to a device (if any)
  */
 struct nft_table {
 	struct list_head		list;
@@ -785,6 +786,11 @@ struct nft_table {
 	u32				use;
 	u16				flags;
 	char				name[NFT_TABLE_MAXNAMELEN];
+	struct net_device		*dev;
+};
+
+enum nft_af_flags {
+	NFT_AF_DEV	= (1 << 0),
 };
 
 /**
@@ -794,6 +800,7 @@ struct nft_table {
  *	@family: address family
  *	@nhooks: number of hooks in this family
  *	@owner: module owner
+ *	@flags: family flags
  *	@tables: used internally
  *	@nops: number of hook ops in this family
  *	@hook_ops_init: initialization function for chain hook ops
@@ -804,6 +811,7 @@ struct nft_af_info {
 	int				family;
 	unsigned int			nhooks;
 	struct module			*owner;
+	u32				flags;
 	struct list_head		tables;
 	unsigned int			nops;
 	void				(*hook_ops_init)(struct nf_hook_ops *,
diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index 05ee1e0..3859d56 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -117,12 +117,14 @@ enum nft_table_flags {
  * @NFTA_TABLE_NAME: name of the table (NLA_STRING)
  * @NFTA_TABLE_FLAGS: bitmask of enum nft_table_flags (NLA_U32)
  * @NFTA_TABLE_USE: number of chains in this table (NLA_U32)
+ * @NFTA_TABLE_DEV: net device name (NLA_STRING)
  */
 enum nft_table_attributes {
 	NFTA_TABLE_UNSPEC,
 	NFTA_TABLE_NAME,
 	NFTA_TABLE_FLAGS,
 	NFTA_TABLE_USE,
+	NFTA_TABLE_DEV,
 	__NFTA_TABLE_MAX
 };
 #define NFTA_TABLE_MAX		(__NFTA_TABLE_MAX - 1)
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 0b96fa0..4d3a157 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -399,6 +399,8 @@ static const struct nla_policy nft_table_policy[NFTA_TABLE_MAX + 1] = {
 	[NFTA_TABLE_NAME]	= { .type = NLA_STRING,
 				    .len = NFT_TABLE_MAXNAMELEN - 1 },
 	[NFTA_TABLE_FLAGS]	= { .type = NLA_U32 },
+	[NFTA_TABLE_DEV]	= { .type = NLA_STRING,
+				    .len = IFNAMSIZ - 1 },
 };
 
 static int nf_tables_fill_table_info(struct sk_buff *skb, struct net *net,
@@ -423,6 +425,10 @@ static int nf_tables_fill_table_info(struct sk_buff *skb, struct net *net,
 	    nla_put_be32(skb, NFTA_TABLE_USE, htonl(table->use)))
 		goto nla_put_failure;
 
+	if (table->dev &&
+	    nla_put_string(skb, NFTA_TABLE_DEV, table->dev->name))
+		goto nla_put_failure;
+
 	nlmsg_end(skb, nlh);
 	return 0;
 
@@ -608,6 +614,11 @@ static int nf_tables_updtable(struct nft_ctx *ctx)
 	if (flags == ctx->table->flags)
 		return 0;
 
+	if ((ctx->afi->flags & NFT_AF_DEV) &&
+	    ctx->nla[NFTA_TABLE_DEV] &&
+	    nla_strcmp(ctx->nla[NFTA_TABLE_DEV], ctx->table->dev->name))
+		return -EBUSY;
+
 	trans = nft_trans_alloc(ctx, NFT_MSG_NEWTABLE,
 				sizeof(struct nft_trans_table));
 	if (trans == NULL)
@@ -645,6 +656,7 @@ static int nf_tables_newtable(struct sock *nlsk, struct sk_buff *skb,
 	struct nft_table *table;
 	struct net *net = sock_net(skb->sk);
 	int family = nfmsg->nfgen_family;
+	struct net_device *dev = NULL;
 	u32 flags = 0;
 	struct nft_ctx ctx;
 	int err;
@@ -679,30 +691,51 @@ static int nf_tables_newtable(struct sock *nlsk, struct sk_buff *skb,
 			return -EINVAL;
 	}
 
+	if (afi->flags & NFT_AF_DEV) {
+		char ifname[IFNAMSIZ];
+
+		if (!nla[NFTA_TABLE_DEV])
+			return -EINVAL;
+
+		nla_strlcpy(ifname, nla[NFTA_TABLE_DEV], IFNAMSIZ);
+		dev = dev_get_by_name(net, ifname);
+		if (!dev)
+			return -ENOENT;
+	} else {
+		if (nla[NFTA_TABLE_DEV])
+			return -EINVAL;
+	}
+
+	err = -EAFNOSUPPORT;
 	if (!try_module_get(afi->owner))
-		return -EAFNOSUPPORT;
+		goto err1;
 
 	err = -ENOMEM;
 	table = kzalloc(sizeof(*table), GFP_KERNEL);
 	if (table == NULL)
-		goto err1;
+		goto err2;
 
 	nla_strlcpy(table->name, name, NFT_TABLE_MAXNAMELEN);
 	INIT_LIST_HEAD(&table->chains);
 	INIT_LIST_HEAD(&table->sets);
 	table->flags = flags;
+	table->dev   = dev;
 
 	nft_ctx_init(&ctx, skb, nlh, afi, table, NULL, nla);
 	err = nft_trans_table_add(&ctx, NFT_MSG_NEWTABLE);
 	if (err < 0)
-		goto err2;
+		goto err3;
 
 	list_add_tail_rcu(&table->list, &afi->tables);
 	return 0;
-err2:
+err3:
 	kfree(table);
-err1:
+err2:
 	module_put(afi->owner);
+err1:
+	if (dev != NULL)
+		dev_put(dev);
+
 	return err;
 }
 
@@ -806,6 +839,9 @@ static void nf_tables_table_destroy(struct nft_ctx *ctx)
 {
 	BUG_ON(ctx->table->use > 0);
 
+	if (ctx->table->dev)
+		dev_put(ctx->table->dev);
+
 	kfree(ctx->table);
 	module_put(ctx->afi->owner);
 }
@@ -1361,6 +1397,7 @@ static int nf_tables_newchain(struct sock *nlsk, struct sk_buff *skb,
 			ops->priority	= priority;
 			ops->priv	= chain;
 			ops->hook	= afi->hooks[ops->hooknum];
+			ops->dev	= table->dev;
 			if (hookfn)
 				ops->hook = hookfn;
 			if (afi->hook_ops_init)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 7/7] netfilter: nf_tables: add netdev table to filter from ingress
  2015-04-10 12:15 [PATCH 0/7 RFC] Netfilter/nf_tables ingress support Pablo Neira Ayuso
                   ` (5 preceding siblings ...)
  2015-04-10 12:15 ` [PATCH 6/7] netfilter: nf_tables: allow to bind table to net_device Pablo Neira Ayuso
@ 2015-04-10 12:15 ` Pablo Neira Ayuso
  2015-04-10 13:22 ` [PATCH 0/7 RFC] Netfilter/nf_tables ingress support Thomas Graf
  7 siblings, 0 replies; 39+ messages in thread
From: Pablo Neira Ayuso @ 2015-04-10 12:15 UTC (permalink / raw)
  To: netfilter-devel; +Cc: kaber, netdev, davem

This allows us to create netdev tables that contain ingress chains.
Use skb_header_pointer() as we may see shared sk_buffs at this stage.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netns/nftables.h     |    1 +
 net/netfilter/Kconfig            |    6 ++
 net/netfilter/Makefile           |    1 +
 net/netfilter/nf_tables_netdev.c |  182 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 190 insertions(+)
 create mode 100644 net/netfilter/nf_tables_netdev.c

diff --git a/include/net/netns/nftables.h b/include/net/netns/nftables.h
index eee608b..c807811 100644
--- a/include/net/netns/nftables.h
+++ b/include/net/netns/nftables.h
@@ -13,6 +13,7 @@ struct netns_nftables {
 	struct nft_af_info	*inet;
 	struct nft_af_info	*arp;
 	struct nft_af_info	*bridge;
+	struct nft_af_info	*netdev;
 	unsigned int		base_seq;
 	u8			gencursor;
 };
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index f70e34a..e5e8fdf 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -448,6 +448,12 @@ config NF_TABLES_INET
 	help
 	  This option enables support for a mixed IPv4/IPv6 "inet" table.
 
+config NF_TABLES_NETDEV
+	tristate "Netfilter nf_tables netdev tables support"
+	help
+	  This option enables support for the netdev table and the ingress
+          filter chain.
+
 config NFT_EXTHDR
 	tristate "Netfilter nf_tables IPv6 exthdr module"
 	help
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index a87d8b8..70d026d 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -75,6 +75,7 @@ nf_tables-objs += nft_bitwise.o nft_byteorder.o nft_payload.o
 
 obj-$(CONFIG_NF_TABLES)		+= nf_tables.o
 obj-$(CONFIG_NF_TABLES_INET)	+= nf_tables_inet.o
+obj-$(CONFIG_NF_TABLES_NETDEV)	+= nf_tables_netdev.o
 obj-$(CONFIG_NFT_COMPAT)	+= nft_compat.o
 obj-$(CONFIG_NFT_EXTHDR)	+= nft_exthdr.o
 obj-$(CONFIG_NFT_META)		+= nft_meta.o
diff --git a/net/netfilter/nf_tables_netdev.c b/net/netfilter/nf_tables_netdev.c
new file mode 100644
index 0000000..bf66250
--- /dev/null
+++ b/net/netfilter/nf_tables_netdev.c
@@ -0,0 +1,182 @@
+/*
+ * Copyright (c) 2015 Pablo Neira Ayuso <pablo@netfilter.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <net/netfilter/nf_tables.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <net/netfilter/nf_tables_ipv4.h>
+#include <net/netfilter/nf_tables_ipv6.h>
+
+static inline void
+nft_netdev_set_pktinfo_ipv4(struct nft_pktinfo *pkt,
+			    const struct nf_hook_ops *ops, struct sk_buff *skb,
+			    const struct nf_hook_state *state)
+{
+	struct iphdr *iph, _iph;
+	u32 len, thoff;
+
+	nft_set_pktinfo(pkt, ops, skb, state);
+
+	iph = skb_header_pointer(skb, skb_network_offset(skb), sizeof(*iph),
+				 &_iph);
+	if (!iph)
+		return;
+
+	iph = ip_hdr(skb);
+	if (iph->ihl < 5 || iph->version != 4)
+		return;
+
+	len = ntohs(iph->tot_len);
+	thoff = iph->ihl * 4;
+	if (skb->len < len)
+		return;
+	else if (len < thoff)
+		return;
+
+	pkt->tprot = iph->protocol;
+	pkt->xt.thoff = thoff;
+	pkt->xt.fragoff = ntohs(iph->frag_off) & IP_OFFSET;
+}
+
+static inline void
+__nft_netdev_set_pktinfo_ipv6(struct nft_pktinfo *pkt,
+			      const struct nf_hook_ops *ops,
+			      struct sk_buff *skb,
+			      const struct nf_hook_state *state)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+	struct ipv6hdr *ip6h, _ip6h;
+	int protohdr, thoff = 0;
+	unsigned short frag_off;
+	u32 pkt_len;
+
+	ip6h = skb_header_pointer(skb, skb_network_offset(skb), sizeof(*ip6h),
+				  &_ip6h);
+	if (!ip6h)
+		return;
+
+	if (ip6h->version != 6)
+		return;
+
+	pkt_len = ntohs(ip6h->payload_len);
+	if (pkt_len + sizeof(*ip6h) > skb->len)
+		return;
+
+	protohdr = ipv6_find_hdr(pkt->skb, &thoff, -1, &frag_off, NULL);
+	if (protohdr < 0)
+                return;
+
+	pkt->tprot = protohdr;
+	pkt->xt.thoff = thoff;
+	pkt->xt.fragoff = frag_off;
+#endif
+}
+
+static inline void nft_netdev_set_pktinfo_ipv6(struct nft_pktinfo *pkt,
+					       const struct nf_hook_ops *ops,
+					       struct sk_buff *skb,
+					       const struct nf_hook_state *state)
+{
+	nft_set_pktinfo(pkt, ops, skb, state);
+	__nft_netdev_set_pktinfo_ipv6(pkt, ops, skb, state);
+}
+
+static unsigned int
+nft_do_chain_netdev(const struct nf_hook_ops *ops, struct sk_buff *skb,
+		    const struct nf_hook_state *state)
+{
+	struct nft_pktinfo pkt;
+
+	switch (eth_hdr(skb)->h_proto) {
+	case htons(ETH_P_IP):
+		nft_netdev_set_pktinfo_ipv4(&pkt, ops, skb, state);
+		break;
+	case htons(ETH_P_IPV6):
+		nft_netdev_set_pktinfo_ipv6(&pkt, ops, skb, state);
+		break;
+	default:
+		nft_set_pktinfo(&pkt, ops, skb, state);
+		break;
+	}
+
+	return nft_do_chain(&pkt, ops);
+}
+
+static struct nft_af_info nft_af_netdev __read_mostly = {
+	.family		= NFPROTO_NETDEV,
+	.nhooks		= NF_NETDEV_NUMHOOKS,
+	.owner		= THIS_MODULE,
+	.flags		= NFT_AF_DEV,
+	.nops		= 1,
+	.hooks		= {
+		[NF_NETDEV_INGRESS]	= nft_do_chain_netdev,
+	},
+};
+
+static int nf_tables_netdev_init_net(struct net *net)
+{
+	net->nft.netdev = kmalloc(sizeof(struct nft_af_info), GFP_KERNEL);
+	if (net->nft.netdev == NULL)
+		return -ENOMEM;
+
+	memcpy(net->nft.netdev, &nft_af_netdev, sizeof(nft_af_netdev));
+
+	if (nft_register_afinfo(net, net->nft.netdev) < 0)
+		goto err;
+
+	return 0;
+err:
+	kfree(net->nft.netdev);
+	return -ENOMEM;
+}
+
+static void nf_tables_netdev_exit_net(struct net *net)
+{
+	nft_unregister_afinfo(net->nft.netdev);
+	kfree(net->nft.netdev);
+}
+
+static struct pernet_operations nf_tables_netdev_net_ops = {
+	.init	= nf_tables_netdev_init_net,
+	.exit	= nf_tables_netdev_exit_net,
+};
+
+static const struct nf_chain_type nft_filter_chain_netdev = {
+	.name		= "filter",
+	.type		= NFT_CHAIN_T_DEFAULT,
+	.family		= NFPROTO_NETDEV,
+	.owner		= THIS_MODULE,
+	.hook_mask	= (1 << NF_NETDEV_INGRESS),
+};
+
+static int __init nf_tables_netdev_init(void)
+{
+	int ret;
+
+	nft_register_chain_type(&nft_filter_chain_netdev);
+	ret = register_pernet_subsys(&nf_tables_netdev_net_ops);
+	if (ret < 0)
+		nft_unregister_chain_type(&nft_filter_chain_netdev);
+
+	return ret;
+}
+
+static void __exit nf_tables_netdev_exit(void)
+{
+	unregister_pernet_subsys(&nf_tables_netdev_net_ops);
+	nft_unregister_chain_type(&nft_filter_chain_netdev);
+}
+
+module_init(nf_tables_netdev_init);
+module_exit(nf_tables_netdev_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Pablo Neira Ayuso <pablo@netfilter.org>");
+MODULE_ALIAS_NFT_FAMILY(5);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] net: add netfilter ingress hook
  2015-04-10 12:15 ` [PATCH 5/7] net: add netfilter ingress hook Pablo Neira Ayuso
@ 2015-04-10 13:21   ` Thomas Graf
  2015-04-10 13:36     ` Patrick McHardy
  2015-04-10 20:08     ` Pablo Neira Ayuso
  0 siblings, 2 replies; 39+ messages in thread
From: Thomas Graf @ 2015-04-10 13:21 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, kaber, netdev, davem

On 04/10/15 at 02:15pm, Pablo Neira Ayuso wrote:
>  static int __netif_receive_skb_ingress(struct sk_buff *skb, bool pfmemalloc,
>  				       struct net_device *orig_dev)
>  {
> @@ -3772,6 +3800,8 @@ skip_taps:
>  	if (!skb)
>  		return NET_RX_DROP;
>  #endif
> +	if (nf_hook_ingress_active(skb))
> +		return nf_hook_ingress(skb, pt_prev, orig_dev, pfmemalloc);
>  
>  	return __netif_receive_skb_finish(skb, pfmemalloc, pt_prev, orig_dev);
>  }

I would favour if we avoid for every subsystem to manage its ingress
filter pointers in net_device. From a net_device perspective, all it
takes is a single pointer which points to a single linked list of
filters which need to be run through. These entries could represent
an ingress qdisc or a netfilter chain or something else (L2 ingress
qdisc?).

I know it's only 24 bytes but I'm trying hard to keep net_device below
2K.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-10 12:15 [PATCH 0/7 RFC] Netfilter/nf_tables ingress support Pablo Neira Ayuso
                   ` (6 preceding siblings ...)
  2015-04-10 12:15 ` [PATCH 7/7] netfilter: nf_tables: add netdev table to filter from ingress Pablo Neira Ayuso
@ 2015-04-10 13:22 ` Thomas Graf
  2015-04-10 20:09   ` Pablo Neira Ayuso
  7 siblings, 1 reply; 39+ messages in thread
From: Thomas Graf @ 2015-04-10 13:22 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel, kaber, netdev, davem

On 04/10/15 at 02:15pm, Pablo Neira Ayuso wrote:
> This patchset adds the Netfilter hook at the ingress path, in a per-device
> fashion. This also comes with the new nf_tables 'netdev' family support to
> provide access to users to the existing nf_tables features. This includes the
> transactional netlink API and the enhanced set infrastructure.  Several patches
> come in first place to prepare this support, including the refactoring of
> __netif_receive_skb_core() to accomodate the new hook.

Is the goal to eventually replace ebtables with this?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/7] netfilter: cleanup struct nf_hook_ops struct indentation
  2015-04-10 12:15 ` [PATCH 4/7] netfilter: cleanup struct nf_hook_ops struct indentation Pablo Neira Ayuso
@ 2015-04-10 13:27   ` Sergei Shtylyov
  0 siblings, 0 replies; 39+ messages in thread
From: Sergei Shtylyov @ 2015-04-10 13:27 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel; +Cc: kaber, netdev, davem

Hello.

On 4/10/2015 3:15 PM, Pablo Neira Ayuso wrote:

> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

    The subject is somewhat tautological. :-)

WBR, Sergei


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] net: add netfilter ingress hook
  2015-04-10 13:21   ` Thomas Graf
@ 2015-04-10 13:36     ` Patrick McHardy
  2015-04-10 20:17       ` Pablo Neira Ayuso
  2015-04-10 20:08     ` Pablo Neira Ayuso
  1 sibling, 1 reply; 39+ messages in thread
From: Patrick McHardy @ 2015-04-10 13:36 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Pablo Neira Ayuso, netfilter-devel, netdev, davem

On 10.04, Thomas Graf wrote:
> On 04/10/15 at 02:15pm, Pablo Neira Ayuso wrote:
> >  static int __netif_receive_skb_ingress(struct sk_buff *skb, bool pfmemalloc,
> >  				       struct net_device *orig_dev)
> >  {
> > @@ -3772,6 +3800,8 @@ skip_taps:
> >  	if (!skb)
> >  		return NET_RX_DROP;
> >  #endif
> > +	if (nf_hook_ingress_active(skb))
> > +		return nf_hook_ingress(skb, pt_prev, orig_dev, pfmemalloc);
> >  
> >  	return __netif_receive_skb_finish(skb, pfmemalloc, pt_prev, orig_dev);
> >  }
> 
> I would favour if we avoid for every subsystem to manage its ingress
> filter pointers in net_device. From a net_device perspective, all it
> takes is a single pointer which points to a single linked list of
> filters which need to be run through. These entries could represent
> an ingress qdisc or a netfilter chain or something else (L2 ingress
> qdisc?).

I'm wondering if the hook is the right abstraction at all. Netfilter hooks
require async resumption (okfn) support, which is why all the refactoring is
needed. Is that something that we need for NF_PROTO_NETDEV? For ingress
userspace queueing *might* actually work if the missing pieces are added,
but for offloaded rules it obviously can not work.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] net: refactor __netif_receive_skb_core
  2015-04-10 12:15 ` [PATCH 1/7] net: refactor __netif_receive_skb_core Pablo Neira Ayuso
@ 2015-04-10 13:47   ` Daniel Borkmann
  2015-04-15 16:09     ` Jesper Dangaard Brouer
  2015-04-10 19:56   ` Alexander Duyck
  1 sibling, 1 reply; 39+ messages in thread
From: Daniel Borkmann @ 2015-04-10 13:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel; +Cc: kaber, netdev, davem

On 04/10/2015 02:15 PM, Pablo Neira Ayuso wrote:
> This patch splits __netif_receive_skb_core() in smaller functions to improve
> maintainability.
>
> The function __netif_receive_skb_core() has been split in two:
>
> * __netif_receive_skb_ingress(), to perform all actions up to
>    ingress filtering.
>
> * __netif_receive_skb_finish(), if the ingress filter accepts this
>    packet, pass it to the corresponding packet_type function handler for further
> processing.
>
> This patch also adds __NET_RX_ANOTHER_ROUND that is used when the packet is
> stripped off from the vlan header or in case the rx_handler needs it.
>
> This also prepares the introduction of the netfilter ingress hook.

Out of curiosity, what is actually the performance impact on all
of this? We were just arguing on a different matter on two more
instructions in the fast-path, here it's refactoring the whole
function into several ones, I presume gcc won't inline it.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] net: refactor __netif_receive_skb_core
  2015-04-10 12:15 ` [PATCH 1/7] net: refactor __netif_receive_skb_core Pablo Neira Ayuso
  2015-04-10 13:47   ` Daniel Borkmann
@ 2015-04-10 19:56   ` Alexander Duyck
  2015-04-15 12:44     ` David Laight
  1 sibling, 1 reply; 39+ messages in thread
From: Alexander Duyck @ 2015-04-10 19:56 UTC (permalink / raw)
  To: Pablo Neira Ayuso, netfilter-devel; +Cc: kaber, netdev, davem

On 04/10/2015 05:15 AM, Pablo Neira Ayuso wrote:
> +another_round:
> +	ret = __netif_receive_skb_ingress(skb, pfmemalloc, orig_dev);
> +	switch (ret) {
> +	case NET_RX_SUCCESS:
> +	case NET_RX_DROP:
> +		break;
> +	case __NET_RX_ANOTHER_ROUND:
> +		goto another_round;
> +	}
>  	rcu_read_unlock();
> +
>  	return ret;
>  }
>  
> 

Couldn't this just be done as a do while?  It would probably be easier
to read and there wouldn't be any need for the another_round label anymore.

- Alex

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] net: add netfilter ingress hook
  2015-04-10 13:21   ` Thomas Graf
  2015-04-10 13:36     ` Patrick McHardy
@ 2015-04-10 20:08     ` Pablo Neira Ayuso
  1 sibling, 0 replies; 39+ messages in thread
From: Pablo Neira Ayuso @ 2015-04-10 20:08 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netfilter-devel, kaber, netdev, davem

On Fri, Apr 10, 2015 at 02:21:20PM +0100, Thomas Graf wrote:
> On 04/10/15 at 02:15pm, Pablo Neira Ayuso wrote:
> >  static int __netif_receive_skb_ingress(struct sk_buff *skb, bool pfmemalloc,
> >  				       struct net_device *orig_dev)
> >  {
> > @@ -3772,6 +3800,8 @@ skip_taps:
> >  	if (!skb)
> >  		return NET_RX_DROP;
> >  #endif
> > +	if (nf_hook_ingress_active(skb))
> > +		return nf_hook_ingress(skb, pt_prev, orig_dev, pfmemalloc);
> >  
> >  	return __netif_receive_skb_finish(skb, pfmemalloc, pt_prev, orig_dev);
> >  }
> 
> I would favour if we avoid for every subsystem to manage its ingress
> filter pointers in net_device. From a net_device perspective, all it
> takes is a single pointer which points to a single linked list of
> filters which need to be run through. These entries could represent
> an ingress qdisc or a netfilter chain or something else (L2 ingress
> qdisc?).
> 
> I know it's only 24 bytes but I'm trying hard to keep net_device below
> 2K.

Then it would be probably good to investigate if we can come up with
some extension infrastructure for net_device (I think Patrick already
suggested this during netdev0.1), so things are allocated based on
available features.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-10 13:22 ` [PATCH 0/7 RFC] Netfilter/nf_tables ingress support Thomas Graf
@ 2015-04-10 20:09   ` Pablo Neira Ayuso
  2015-04-13  1:14     ` David Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Pablo Neira Ayuso @ 2015-04-10 20:09 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netfilter-devel, kaber, netdev, davem

On Fri, Apr 10, 2015 at 02:22:05PM +0100, Thomas Graf wrote:
> On 04/10/15 at 02:15pm, Pablo Neira Ayuso wrote:
> > This patchset adds the Netfilter hook at the ingress path, in a per-device
> > fashion. This also comes with the new nf_tables 'netdev' family support to
> > provide access to users to the existing nf_tables features. This includes the
> > transactional netlink API and the enhanced set infrastructure.  Several patches
> > come in first place to prepare this support, including the refactoring of
> > __netif_receive_skb_core() to accomodate the new hook.
> 
> Is the goal to eventually replace ebtables with this?

No, nftables comes with a bridge family that will eventually replace
ebtables.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] net: add netfilter ingress hook
  2015-04-10 13:36     ` Patrick McHardy
@ 2015-04-10 20:17       ` Pablo Neira Ayuso
  2015-04-10 21:33         ` Patrick McHardy
  0 siblings, 1 reply; 39+ messages in thread
From: Pablo Neira Ayuso @ 2015-04-10 20:17 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Thomas Graf, netfilter-devel, netdev, davem

[-- Attachment #1: Type: text/plain, Size: 1662 bytes --]

On Fri, Apr 10, 2015 at 02:36:11PM +0100, Patrick McHardy wrote:
> On 10.04, Thomas Graf wrote:
> > On 04/10/15 at 02:15pm, Pablo Neira Ayuso wrote:
> > >  static int __netif_receive_skb_ingress(struct sk_buff *skb, bool pfmemalloc,
> > >  				       struct net_device *orig_dev)
> > >  {
> > > @@ -3772,6 +3800,8 @@ skip_taps:
> > >  	if (!skb)
> > >  		return NET_RX_DROP;
> > >  #endif
> > > +	if (nf_hook_ingress_active(skb))
> > > +		return nf_hook_ingress(skb, pt_prev, orig_dev, pfmemalloc);
> > >  
> > >  	return __netif_receive_skb_finish(skb, pfmemalloc, pt_prev, orig_dev);
> > >  }
> > 
> > I would favour if we avoid for every subsystem to manage its ingress
> > filter pointers in net_device. From a net_device perspective, all it
> > takes is a single pointer which points to a single linked list of
> > filters which need to be run through. These entries could represent
> > an ingress qdisc or a netfilter chain or something else (L2 ingress
> > qdisc?).
> 
> I'm wondering if the hook is the right abstraction at all. Netfilter hooks
> require async resumption (okfn) support, which is why all the refactoring is
> needed. Is that something that we need for NF_PROTO_NETDEV? For ingress
> userspace queueing *might* actually work if the missing pieces are added,
> but for offloaded rules it obviously can not work.

For userspace queueing from ingress we still have to call
skb_share_check() and hold a reference to orig_dev from the escape
path. But this support is still missing in nf_tables (actually, we
only support NFPROTO_IPV4 and NFPROTO_IPV6 at this moment, see patch
attached). Regarding offload, this path will not see any packet.

[-- Attachment #2: 0001-netfilter-nf_tables-restrict-nft_queue-to-AF_INET-an.patch --]
[-- Type: text/x-diff, Size: 973 bytes --]

>From db2fba74dea98b69ee7615fca86b9847bc42887f Mon Sep 17 00:00:00 2001
From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Fri, 10 Apr 2015 21:40:58 +0200
Subject: [PATCH] netfilter: nf_tables: restrict nft_queue to AF_INET and
 AF_INET6

Other families need the corresponding struct nf_afinfo in place to work.
Restrict it to NFPROTO_IPV4 and NFPROTO_IPV6 until the necessary code is in
place.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nft_queue.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nft_queue.c b/net/netfilter/nft_queue.c
index e8ae2f6..42ca976 100644
--- a/net/netfilter/nft_queue.c
+++ b/net/netfilter/nft_queue.c
@@ -129,4 +129,5 @@ module_exit(nft_queue_module_exit);
 
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("Eric Leblond <eric@regit.org>");
-MODULE_ALIAS_NFT_EXPR("queue");
+MODULE_ALIAS_NFT_AF_EXPR(AF_INET, "queue");
+MODULE_ALIAS_NFT_AF_EXPR(AF_INET6, "queue");
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] net: add netfilter ingress hook
  2015-04-10 20:17       ` Pablo Neira Ayuso
@ 2015-04-10 21:33         ` Patrick McHardy
  2015-04-11 12:55           ` Pablo Neira Ayuso
  0 siblings, 1 reply; 39+ messages in thread
From: Patrick McHardy @ 2015-04-10 21:33 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Thomas Graf, netfilter-devel, netdev, davem

On 10.04, Pablo Neira Ayuso wrote:
> On Fri, Apr 10, 2015 at 02:36:11PM +0100, Patrick McHardy wrote:
> > 
> > I'm wondering if the hook is the right abstraction at all. Netfilter hooks
> > require async resumption (okfn) support, which is why all the refactoring is
> > needed. Is that something that we need for NF_PROTO_NETDEV? For ingress
> > userspace queueing *might* actually work if the missing pieces are added,
> > but for offloaded rules it obviously can not work.
> 
> For userspace queueing from ingress we still have to call
> skb_share_check() and hold a reference to orig_dev from the escape
> path. But this support is still missing in nf_tables (actually, we
> only support NFPROTO_IPV4 and NFPROTO_IPV6 at this moment, see patch
> attached). Regarding offload, this path will not see any packet.

We do support all families using the regular NF_QUEUE verdict of course.
But yes, nf_queue.c will simply drop packets that don't have a netfilter
AF registered.

But my question is whether queueing is something that is even worth
considering for the NFPROTO_NETDEV family. As I said, it will at best
work for ingress anyways and that will actually be more tricky than just
calling skb_share_check(), we need to take care of keeping valid
references to all the data you currently store in the CB, including the
packet_type, the device, things attached to the skb at this point to
the stack etc.

If we decide not to support queueing for this family we don't have to
use netfilter hooks for this and all the refactoring for async resume
becomes unnecessary. 

> >From db2fba74dea98b69ee7615fca86b9847bc42887f Mon Sep 17 00:00:00 2001
> From: Pablo Neira Ayuso <pablo@netfilter.org>
> Date: Fri, 10 Apr 2015 21:40:58 +0200
> Subject: [PATCH] netfilter: nf_tables: restrict nft_queue to AF_INET and
>  AF_INET6
> 
> Other families need the corresponding struct nf_afinfo in place to work.
> Restrict it to NFPROTO_IPV4 and NFPROTO_IPV6 until the necessary code is in
> place.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] net: add netfilter ingress hook
  2015-04-10 21:33         ` Patrick McHardy
@ 2015-04-11 12:55           ` Pablo Neira Ayuso
  2015-04-11 13:06             ` Patrick McHardy
  0 siblings, 1 reply; 39+ messages in thread
From: Pablo Neira Ayuso @ 2015-04-11 12:55 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Thomas Graf, netfilter-devel, netdev, davem

On Fri, Apr 10, 2015 at 10:33:12PM +0100, Patrick McHardy wrote:
> On 10.04, Pablo Neira Ayuso wrote:
> > On Fri, Apr 10, 2015 at 02:36:11PM +0100, Patrick McHardy wrote:
> > > 
> > > I'm wondering if the hook is the right abstraction at all. Netfilter hooks
> > > require async resumption (okfn) support, which is why all the refactoring is
> > > needed. Is that something that we need for NF_PROTO_NETDEV? For ingress
> > > userspace queueing *might* actually work if the missing pieces are added,
> > > but for offloaded rules it obviously can not work.
> > 
> > For userspace queueing from ingress we still have to call
> > skb_share_check() and hold a reference to orig_dev from the escape
> > path. But this support is still missing in nf_tables (actually, we
> > only support NFPROTO_IPV4 and NFPROTO_IPV6 at this moment, see patch
> > attached). Regarding offload, this path will not see any packet.
> 
> We do support all families using the regular NF_QUEUE verdict of course.
> But yes, nf_queue.c will simply drop packets that don't have a netfilter
> AF registered.
> 
> But my question is whether queueing is something that is even worth
> considering for the NFPROTO_NETDEV family. As I said, it will at best
> work for ingress anyways and that will actually be more tricky than just
> calling skb_share_check(), we need to take care of keeping valid
> references to all the data you currently store in the CB, including the
> packet_type, the device, things attached to the skb at this point to
> the stack etc.

I think we only need to hold the reference on orig_dev. The pt_prev
pointer in skb CB can actually be removed. Other things attached to
the skb we already handle this from nf_queue to make sure they don't
vanish.

> If we decide not to support queueing for this family we don't have to
> use netfilter hooks for this and all the refactoring for async resume
> becomes unnecessary.

I think the refactoring is worth. Have a look at the current state of
this function. It has grown with features along time and it got many
gotos that force you travel back and forth when reading this code.

Regarding the nf_queue support at ingress, I don't see any major
technical obstacule at this moment to support this and I think that
existing programs that inspect traffic from userspace can benefit from
this feature (eg. IPS).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] net: add netfilter ingress hook
  2015-04-11 12:55           ` Pablo Neira Ayuso
@ 2015-04-11 13:06             ` Patrick McHardy
  2015-04-11 13:32               ` Pablo Neira Ayuso
  0 siblings, 1 reply; 39+ messages in thread
From: Patrick McHardy @ 2015-04-11 13:06 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Thomas Graf, netfilter-devel, netdev, davem

On 11.04, Pablo Neira Ayuso wrote:
> On Fri, Apr 10, 2015 at 10:33:12PM +0100, Patrick McHardy wrote:
> > On 10.04, Pablo Neira Ayuso wrote:
> > > On Fri, Apr 10, 2015 at 02:36:11PM +0100, Patrick McHardy wrote:
> > We do support all families using the regular NF_QUEUE verdict of course.
> > But yes, nf_queue.c will simply drop packets that don't have a netfilter
> > AF registered.
> > 
> > But my question is whether queueing is something that is even worth
> > considering for the NFPROTO_NETDEV family. As I said, it will at best
> > work for ingress anyways and that will actually be more tricky than just
> > calling skb_share_check(), we need to take care of keeping valid
> > references to all the data you currently store in the CB, including the
> > packet_type, the device, things attached to the skb at this point to
> > the stack etc.
> 
> I think we only need to hold the reference on orig_dev. The pt_prev
> pointer in skb CB can actually be removed. Other things attached to
> the skb we already handle this from nf_queue to make sure they don't
> vanish.

Are you sure? What about removable protocols or packet sockets?

> > If we decide not to support queueing for this family we don't have to
> > use netfilter hooks for this and all the refactoring for async resume
> > becomes unnecessary.
> 
> I think the refactoring is worth. Have a look at the current state of
> this function. It has grown with features along time and it got many
> gotos that force you travel back and forth when reading this code.
> 
> Regarding the nf_queue support at ingress, I don't see any major
> technical obstacule at this moment to support this and I think that
> existing programs that inspect traffic from userspace can benefit from
> this feature (eg. IPS).

Yeah, that might be useful, although they seem to be pretty fine with
getting only IPv4 and IPv6. I guess ARP might be interesting as well,
but we also have hooks for that already.

Regarding the refactoring, there seem to be concerns about performance
impact. My suggestions would be to use nf_hook(), make sure no queueing
can happen and therefore no okfn invocations and then you can simply
add this as a function call to the existing code without the need for
any refactoring or storing state.

You don't loose anything, it only massively simplifies the patches. If
queuing supported is added, you can still change it.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/7] net: add netfilter ingress hook
  2015-04-11 13:06             ` Patrick McHardy
@ 2015-04-11 13:32               ` Pablo Neira Ayuso
  0 siblings, 0 replies; 39+ messages in thread
From: Pablo Neira Ayuso @ 2015-04-11 13:32 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Thomas Graf, netfilter-devel, netdev, davem

On Sat, Apr 11, 2015 at 02:06:48PM +0100, Patrick McHardy wrote:
> On 11.04, Pablo Neira Ayuso wrote:
> > On Fri, Apr 10, 2015 at 10:33:12PM +0100, Patrick McHardy wrote:
> > > On 10.04, Pablo Neira Ayuso wrote:
> > > > On Fri, Apr 10, 2015 at 02:36:11PM +0100, Patrick McHardy wrote:
> > > We do support all families using the regular NF_QUEUE verdict of course.
> > > But yes, nf_queue.c will simply drop packets that don't have a netfilter
> > > AF registered.
> > > 
> > > But my question is whether queueing is something that is even worth
> > > considering for the NFPROTO_NETDEV family. As I said, it will at best
> > > work for ingress anyways and that will actually be more tricky than just
> > > calling skb_share_check(), we need to take care of keeping valid
> > > references to all the data you currently store in the CB, including the
> > > packet_type, the device, things attached to the skb at this point to
> > > the stack etc.
> > 
> > I think we only need to hold the reference on orig_dev. The pt_prev
> > pointer in skb CB can actually be removed. Other things attached to
> > the skb we already handle this from nf_queue to make sure they don't
> > vanish.
> 
> Are you sure? What about removable protocols or packet sockets?

pt_prev will be always NULL if we enter the netfilter ingress hook, so
no need to store it.

> > > If we decide not to support queueing for this family we don't have to
> > > use netfilter hooks for this and all the refactoring for async resume
> > > becomes unnecessary.
> > 
> > I think the refactoring is worth. Have a look at the current state of
> > this function. It has grown with features along time and it got many
> > gotos that force you travel back and forth when reading this code.
> > 
> > Regarding the nf_queue support at ingress, I don't see any major
> > technical obstacule at this moment to support this and I think that
> > existing programs that inspect traffic from userspace can benefit from
> > this feature (eg. IPS).
> 
> Yeah, that might be useful, although they seem to be pretty fine with
> getting only IPv4 and IPv6. I guess ARP might be interesting as well,
> but we also have hooks for that already.

For security applications, I guess they will be happy to get pretty
much everything that they can inspect.

> Regarding the refactoring, there seem to be concerns about performance
> impact. My suggestions would be to use nf_hook(), make sure no queueing
> can happen and therefore no okfn invocations and then you can simply
> add this as a function call to the existing code without the need for
> any refactoring or storing state.

I'll come back with numbers and more feedback anyway.

> You don't loose anything, it only massively simplifies the patches. If
> queuing supported is added, you can still change it.

I'll explore this, this seems like a good alternative if performance
becomes a real issue.

Thanks.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-10 20:09   ` Pablo Neira Ayuso
@ 2015-04-13  1:14     ` David Miller
  2015-04-13 20:19       ` Patrick McHardy
  0 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2015-04-13  1:14 UTC (permalink / raw)
  To: pablo; +Cc: tgraf, netfilter-devel, kaber, netdev

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Fri, 10 Apr 2015 22:09:01 +0200

> On Fri, Apr 10, 2015 at 02:22:05PM +0100, Thomas Graf wrote:
>> On 04/10/15 at 02:15pm, Pablo Neira Ayuso wrote:
>> > This patchset adds the Netfilter hook at the ingress path, in a per-device
>> > fashion. This also comes with the new nf_tables 'netdev' family support to
>> > provide access to users to the existing nf_tables features. This includes the
>> > transactional netlink API and the enhanced set infrastructure.  Several patches
>> > come in first place to prepare this support, including the refactoring of
>> > __netif_receive_skb_core() to accomodate the new hook.
>> 
>> Is the goal to eventually replace ebtables with this?
> 
> No, nftables comes with a bridge family that will eventually replace
> ebtables.

It's always been the case that if you want to do L2 or lower level
stuff, you use the ingress classifier, packet actions, and qdiscs.

If you want to do higher level things, NF hooks provide that facility.

The only example I've seen is packet counting, and one could do that
just as easily with the ingress qdisc.

So given what I've been shown so far I'm very far from convinced that
this new hook in an already over polluted, most critical of all
critical code paths, is justified at all.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-13  1:14     ` David Miller
@ 2015-04-13 20:19       ` Patrick McHardy
  2015-04-14  9:00         ` Thomas Graf
  2015-04-14 12:27         ` Jamal Hadi Salim
  0 siblings, 2 replies; 39+ messages in thread
From: Patrick McHardy @ 2015-04-13 20:19 UTC (permalink / raw)
  To: David Miller; +Cc: pablo, tgraf, netfilter-devel, netdev

On 12.04, David Miller wrote:
> From: Pablo Neira Ayuso <pablo@netfilter.org>
> Date: Fri, 10 Apr 2015 22:09:01 +0200
> 
> >> > This patchset adds the Netfilter hook at the ingress path, in a per-device
> >> > fashion. This also comes with the new nf_tables 'netdev' family support to
> >> > provide access to users to the existing nf_tables features. This includes the
> >> > transactional netlink API and the enhanced set infrastructure.  Several patches
> >> > come in first place to prepare this support, including the refactoring of
> >> > __netif_receive_skb_core() to accomodate the new hook.
> 
> It's always been the case that if you want to do L2 or lower level
> stuff, you use the ingress classifier, packet actions, and qdiscs.
> 
> If you want to do higher level things, NF hooks provide that facility.
> 
> The only example I've seen is packet counting, and one could do that
> just as easily with the ingress qdisc.
> 
> So given what I've been shown so far I'm very far from convinced that
> this new hook in an already over polluted, most critical of all
> critical code paths, is justified at all.

I'm going to jump in here since I think a good case for this can be
made. It's going to be a bit longer, sorry. You can skip to the
examples after the first three paragraphs since it's just a lot of
TC bashing :)

First, lets start with the ingress qdisc hook, since it is somewhat
related. The ingress qdisc is basically a hack to get filters to run
at ingress. There is no queue, its a workaround for the fact that
we don't have any other way to filter at ingress, and in my opinion
not a nice one.

Its is today implemented as an enqueue call in netif_receive_skb(),
but it used to register with IPv4/IPv6 netfilter to receive packets.
If it is actually used, there is full serialization since a global
lock is used. If it is not used, the cost is pretty minimal except
for the quite large amount of code.

Now, if we had a netfilter hook at ingress *without* the okfn, the
cost when disabled would simply be a static key, when enabled a
hook invocation, pretty much comparable to ingress. Given that
ingress used to be implemented on top of netfilter, there's no
reason why we couldn't do that again. I'm pretty much convinced
that we could easily make the impact smaller than it is now.

Now, given the features. Ingress filtering is mainly used for
TC actions, which are in my opinion the most horible way imaginable
to so something like that. tc is a horrible mess and tc actions
are the worst part of it. The fact that people over time have still
added some TC actions is in my opinion telling that a better
way to do this stuff would be appreciated, its hard to imagine this
being the first choice of anyone.

Looking at the TC actions, some are directly related to using
netfilter (ipt, connmark). ipt is actually dangerous in that
it can easily crash your kernel, from a quick look connmark
looks equally unsafe in that it doesn't perform even basic
header validation that conntrack relies on. We can obviously
take care of the netfilter related actions easily. skbedit is
like a very primitive meta expression from nftables, stateless
NAT and pedit are something that can easily be supported using
nftables and we need that (generic) support anyways. The VLAN
action and policing is something I'm unsure about, but all
the remaining ones we don't already support basically come for
free since we need that functionality (packet editing) anyway.

Now the advantages of being able to use nft. First, the obvious
one is that we have a nice userspace tool, a well defined
grammar, and that people would be able to use the same tool for
very similar tasks. nftables in the kernel is almost completely
lockless, we support way more possibilites already and we won't
have to add new special case TC actions anymore. Look at the
connmark action for example. It can set a value. How long until
someone wants to use a bitmask? We support all operations
(assignment, bit operations) for all types, we have sets for fast
lookups, maps for associating values quickly, we have a nice and
readable syntax and full translation back to the readable
representation and much more.

Now you asked for some examples :)

As I mentioned, we can set meta data in various ways, we will
be able to do stateless NAT and pedit, of course very flexible
classification etc etc, but some of the cool stuff which would
be useful especially in ingress:

Are you wondering why CPUs are used unbalanced and want to
see what kind of traffic is causing it? With the dynamic
expression instantiation queued in nf-next, you could do:

nft add ingress filter \
	flow table debug cpu . ether saddr counter

To dynamically create a table of counters called "debug" for
all CPU + MAC source combinations. Listing it will show you

	0 . 20:1a:06:e5:cb:be ... bytes ... packets
	0 . 9c:d2:1e:74:eb:75 ... bytes ... packets
	1 . 52:54:00:79:83:a2 ... bytes ... packets
	...

I'm currently working on various ways to sort it in every
dimension so you can easily find the information you're looking
for.

If it doesn't seem to be unevenly distributed based on MACs,
you could try IP source and IP destination address instead:

nft add ingress filter \
	flow table debug cpu . ip saddr . ip daddr counter

Which will give you a table of counters for every CPU + saddr + daddr
combination. You can combine any kind of key we support, which
are a lot. This could be a nice help for debugging performance
problems.

So, in short, I think this makes a lot of sense to have, but
in order to avoid performance impact, ingress should be moved
back to netfilter (since its the more generic approach) and
the async resumption (okfn) should be avoided completely.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-13 20:19       ` Patrick McHardy
@ 2015-04-14  9:00         ` Thomas Graf
  2015-04-14  9:06           ` Patrick McHardy
  2015-04-14 12:27         ` Jamal Hadi Salim
  1 sibling, 1 reply; 39+ messages in thread
From: Thomas Graf @ 2015-04-14  9:00 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, pablo, netfilter-devel, netdev

On 04/13/15 at 09:19pm, Patrick McHardy wrote:
> Now the advantages of being able to use nft. First, the obvious
> one is that we have a nice userspace tool, a well defined
> grammar, and that people would be able to use the same tool for
> very similar tasks. nftables in the kernel is almost completely
> lockless, we support way more possibilites already and we won't
> have to add new special case TC actions anymore. Look at the
> connmark action for example. It can set a value. How long until
> someone wants to use a bitmask? We support all operations
> (assignment, bit operations) for all types, we have sets for fast
> lookups, maps for associating values quickly, we have a nice and
> readable syntax and full translation back to the readable
> representation and much more.

*cough* Performance numbers? *cough* ;-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-14  9:00         ` Thomas Graf
@ 2015-04-14  9:06           ` Patrick McHardy
  2015-04-14 10:08             ` Thomas Graf
  0 siblings, 1 reply; 39+ messages in thread
From: Patrick McHardy @ 2015-04-14  9:06 UTC (permalink / raw)
  To: Thomas Graf; +Cc: David Miller, pablo, netfilter-devel, netdev

On 14.04, Thomas Graf wrote:
> On 04/13/15 at 09:19pm, Patrick McHardy wrote:
> > Now the advantages of being able to use nft. First, the obvious
> > one is that we have a nice userspace tool, a well defined
> > grammar, and that people would be able to use the same tool for
> > very similar tasks. nftables in the kernel is almost completely
> > lockless, we support way more possibilites already and we won't
> > have to add new special case TC actions anymore. Look at the
> > connmark action for example. It can set a value. How long until
> > someone wants to use a bitmask? We support all operations
> > (assignment, bit operations) for all types, we have sets for fast
> > lookups, maps for associating values quickly, we have a nice and
> > readable syntax and full translation back to the readable
> > representation and much more.
> 
> *cough* Performance numbers? *cough* ;-)

I'm just arguing, not implementing :)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-14  9:06           ` Patrick McHardy
@ 2015-04-14 10:08             ` Thomas Graf
  2015-04-14 10:13               ` Patrick McHardy
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Graf @ 2015-04-14 10:08 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, pablo, netfilter-devel, netdev

On 04/14/15 at 10:06am, Patrick McHardy wrote:
> On 14.04, Thomas Graf wrote:
> > On 04/13/15 at 09:19pm, Patrick McHardy wrote:
> > > Now the advantages of being able to use nft. First, the obvious
> > > one is that we have a nice userspace tool, a well defined
> > > grammar, and that people would be able to use the same tool for
> > > very similar tasks. nftables in the kernel is almost completely
> > > lockless, we support way more possibilites already and we won't
> > > have to add new special case TC actions anymore. Look at the
> > > connmark action for example. It can set a value. How long until
> > > someone wants to use a bitmask? We support all operations
> > > (assignment, bit operations) for all types, we have sets for fast
> > > lookups, maps for associating values quickly, we have a nice and
> > > readable syntax and full translation back to the readable
> > > representation and much more.
> > 
> > *cough* Performance numbers? *cough* ;-)
> 
> I'm just arguing, not implementing :)

OK ;-) Seriously though, we need to start putting emphasis on
numbers as well. We are supposed to run data centers with all of
this, we can't just horse around for fun ;-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-14 10:08             ` Thomas Graf
@ 2015-04-14 10:13               ` Patrick McHardy
  2015-04-14 10:32                 ` Thomas Graf
  0 siblings, 1 reply; 39+ messages in thread
From: Patrick McHardy @ 2015-04-14 10:13 UTC (permalink / raw)
  To: Thomas Graf; +Cc: David Miller, pablo, netfilter-devel, netdev

On 14.04, Thomas Graf wrote:
> On 04/14/15 at 10:06am, Patrick McHardy wrote:
> > On 14.04, Thomas Graf wrote:
> > > On 04/13/15 at 09:19pm, Patrick McHardy wrote:
> > > > Now the advantages of being able to use nft. First, the obvious
> > > > one is that we have a nice userspace tool, a well defined
> > > > grammar, and that people would be able to use the same tool for
> > > > very similar tasks. nftables in the kernel is almost completely
> > > > lockless, we support way more possibilites already and we won't
> > > > have to add new special case TC actions anymore. Look at the
> > > > connmark action for example. It can set a value. How long until
> > > > someone wants to use a bitmask? We support all operations
> > > > (assignment, bit operations) for all types, we have sets for fast
> > > > lookups, maps for associating values quickly, we have a nice and
> > > > readable syntax and full translation back to the readable
> > > > representation and much more.
> > > 
> > > *cough* Performance numbers? *cough* ;-)
> > 
> > I'm just arguing, not implementing :)
> 
> OK ;-) Seriously though, we need to start putting emphasis on
> numbers as well. We are supposed to run data centers with all of
> this, we can't just horse around for fun ;-)

I would actually expect them to use neither TC nor nft, so the most
interesting number would be the impact if not used. Additionally I'd
like to see the numbers for moving ingress to use the netfilter hook
if it is actually used.

The costs of TC actions vs nft are actually not relevant in my
opinion since we're not replacing anything.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-14 10:13               ` Patrick McHardy
@ 2015-04-14 10:32                 ` Thomas Graf
  2015-04-14 20:05                   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Graf @ 2015-04-14 10:32 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, pablo, netfilter-devel, netdev

On 04/14/15 at 11:13am, Patrick McHardy wrote:
> I would actually expect them to use neither TC nor nft, so the most
> interesting number would be the impact if not used. Additionally I'd
> like to see the numbers for moving ingress to use the netfilter hook
> if it is actually used.
> 
> The costs of TC actions vs nft are actually not relevant in my
> opinion since we're not replacing anything.

Ingress filtering to implement distribtued packet filters is very
relevant for data centers. The times of no-policy data centers are
gone with multi tenancy.

Not all packets are routed so at least some of the filtering must
occur before prerouting. I'm afraid you can't take yourself out
of the fast path that easily ;-)

This is not a pledge specific to nft. I would like to see more
numbers in general. We are putting APIs and frameworks in place that
we can't remove afterwards without knowing how they really scale and
perform.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-13 20:19       ` Patrick McHardy
  2015-04-14  9:00         ` Thomas Graf
@ 2015-04-14 12:27         ` Jamal Hadi Salim
  2015-04-14 15:12           ` John Fastabend
  1 sibling, 1 reply; 39+ messages in thread
From: Jamal Hadi Salim @ 2015-04-14 12:27 UTC (permalink / raw)
  To: Patrick McHardy, David Miller; +Cc: pablo, tgraf, netfilter-devel, netdev

On 04/13/15 16:19, Patrick McHardy wrote:

> I'm going to jump in here since I think a good case for this can be
> made. It's going to be a bit longer, sorry. You can skip to the
> examples after the first three paragraphs since it's just a lot of
> TC bashing :)

Patrick, maybe i am misunderstanding.
Here I am arguing against Alexei for introducing a couple of
statements for the sake of performance and you want to put netfilter
on that code path and maybe move ingress to be dependent on netfilter
further up the stack? I know you are trying to sell the virtues
of nft but since you are busy bashing here, how about this polite
statement:
Netfilter is not known for its performance. Even with the ingress qlock
tc will outperform an equivalent operation done via netfilter.
Last i poked:
As the number of rules goes up, the performance difference goes up
significantly as well. When you mention the ingress qdisc lock
the analogy would be saying like saying "tc is overweight" when
infact netfilter is obese. tc can get better and there is effort
to get rid of the  lock (i am aware of efforts in that direction).
Having said that netfilter is more usable than tc - but that usability
is costly.

I like your example use case but I am not sure why you think those use
cases cant be trivially implemented with tc actions? I see a table
per cpu which is updated with info gleaned from arriving packets
as described by policy.
Maybe i am missing the connection.

In regards to ipt or conntrack; i am sure we could use your help
(and Pablo has been very nice to us) to make it a smoother experience.
We are trying not to replicate what you have done and therefore
reusing your code. OTOH, if i understood correctly you are talking
about replacing all these actions with ones you are going to write.
If you see issues we should fix them. Again, it would be preferable
if there are ways you can help make this even more seamless. And if
you decide you want to use something in tc we should make it smooth for
you.
But using conntrack or ipt or whatever new thing in the netfilter
code is optional for tc. Enforcing it by default is wrong. I also dont
believe in monopolies for classifiers. Sure nft is nice but i should
be able to use bpf if needed. It will make sense in some cases.
This is the design tc has. I know having a single approach as does
nft does reduce ui confusions - but even in your case a refresh is
needed (as in the introduction of nft). And i am sure that is not the
last time you'll do it. That experience transformation
is much smoother with tc when introducing a new classifier.

cheers,
jamal

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-14 12:27         ` Jamal Hadi Salim
@ 2015-04-14 15:12           ` John Fastabend
  2015-04-14 15:36             ` Alexei Starovoitov
  0 siblings, 1 reply; 39+ messages in thread
From: John Fastabend @ 2015-04-14 15:12 UTC (permalink / raw)
  To: Jamal Hadi Salim, Patrick McHardy, David Miller
  Cc: pablo, tgraf, netfilter-devel, netdev

On 04/14/2015 05:27 AM, Jamal Hadi Salim wrote:
> On 04/13/15 16:19, Patrick McHardy wrote:
> 
>> I'm going to jump in here since I think a good case for this can be
>> made. It's going to be a bit longer, sorry. You can skip to the
>> examples after the first three paragraphs since it's just a lot of
>> TC bashing :)
> 
> Patrick, maybe i am misunderstanding.
> Here I am arguing against Alexei for introducing a couple of
> statements for the sake of performance and you want to put netfilter
> on that code path and maybe move ingress to be dependent on netfilter
> further up the stack? I know you are trying to sell the virtues
> of nft but since you are busy bashing here, how about this polite
> statement:
> Netfilter is not known for its performance. Even with the ingress qlock
> tc will outperform an equivalent operation done via netfilter.

avoiding the nft vs tc vs ebpf debate I will the ingress qlock is not
needed anymore I just haven't got to removing it yet. Its not actually
protecting any data structures though.

I was hoping to push the skb lists onto something like rte_ring
used by the DPDK folks or possibly some of the lockless ring work Jesper
created. This is needed for many qdisc's to drop the qlock but not the
ingress qdisc. Been busy working on switch bits lately but might be
able to pick this up next merge window.

Thanks,
John

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-14 15:12           ` John Fastabend
@ 2015-04-14 15:36             ` Alexei Starovoitov
  2015-04-15  7:35               ` John Fastabend
  0 siblings, 1 reply; 39+ messages in thread
From: Alexei Starovoitov @ 2015-04-14 15:36 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jamal Hadi Salim, Patrick McHardy, David Miller, pablo, tgraf,
	netfilter-devel, netdev

On Tue, Apr 14, 2015 at 08:12:18AM -0700, John Fastabend wrote:
>
> I was hoping to push the skb lists onto something like rte_ring
> used by the DPDK folks or possibly some of the lockless ring work Jesper
> created. This is needed for many qdisc's to drop the qlock but not the
> ingress qdisc. Been busy working on switch bits lately but might be
> able to pick this up next merge window.

I've spent quite a bit of time reanalyzying your work ;) It seems
only trivial stuff left to drop ingress spinlock. Can you send me
your TC test scripts ? I'm only starting building mine and they're
not covering everything. Roughly I'm creating namespaces and running
traffic between them while varying csum/gso/gro offload settings.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-14 10:32                 ` Thomas Graf
@ 2015-04-14 20:05                   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-04-14 20:05 UTC (permalink / raw)
  To: Thomas Graf; +Cc: brouer, Patrick McHardy, pablo, netfilter-devel, netdev

On Tue, 14 Apr 2015 11:32:02 +0100 Thomas Graf <tgraf@suug.ch> wrote:

> This is not a pledge specific to nft. I would like to see more
> numbers in general. We are putting APIs and frameworks in place that
> we can't remove afterwards without knowing how they really scale and
> perform.

I also vote for more performance numbers and measurements ;-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-14 15:36             ` Alexei Starovoitov
@ 2015-04-15  7:35               ` John Fastabend
  2015-04-15  9:19                 ` Daniel Borkmann
  2015-04-15 16:24                 ` Alexei Starovoitov
  0 siblings, 2 replies; 39+ messages in thread
From: John Fastabend @ 2015-04-15  7:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Jamal Hadi Salim, Patrick McHardy, David Miller,
	pablo, tgraf, netfilter-devel, netdev

On 04/14/2015 08:36 AM, Alexei Starovoitov wrote:
> On Tue, Apr 14, 2015 at 08:12:18AM -0700, John Fastabend wrote:
>>
>> I was hoping to push the skb lists onto something like rte_ring
>> used by the DPDK folks or possibly some of the lockless ring work Jesper
>> created. This is needed for many qdisc's to drop the qlock but not the
>> ingress qdisc. Been busy working on switch bits lately but might be
>> able to pick this up next merge window.
>
> I've spent quite a bit of time reanalyzying your work ;) It seems
> only trivial stuff left to drop ingress spinlock. Can you send me
> your TC test scripts ? I'm only starting building mine and they're
> not covering everything. Roughly I'm creating namespaces and running
> traffic between them while varying csum/gso/gro offload settings.
>

I'll dig up my scripts and post them to github this weekend. They
are a bit organized and all over the place at the moment.

Maybe we can build a master repository. I know there a lot of different
scripts running around, for example I already collected a few from
Jamal and I think Cong must have some as well.

Here is a patch that has been running on my dev box sans the quick
port to Dave's master tree. It seems to work at least it has been
running on my dev box for a few months. But I haven't had a chance to
run any recent perf numbers on it. Actually what I would really like
is to drop the lock on pfifo_fast with a lockless skb ring and make
drivers expose a descriptor ring per core (most already do anyways).

---

net: sched: run ingress qdisc without locks

From: John Fastabend <john.r.fastabend@intel.com>

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
  net/core/dev.c          |    2 --
  net/sched/sch_ingress.c |    3 ++-
  2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index af4a1b0..9b34a18 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3547,10 +3547,8 @@ static int ing_filter(struct sk_buff *skb, struct 
netdev_queue *rxq)

  	q = rcu_dereference(rxq->qdisc);
  	if (q != &noop_qdisc) {
-		spin_lock(qdisc_lock(q));
  		if (likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
  			result = qdisc_enqueue_root(skb, q);
-		spin_unlock(qdisc_lock(q));
  	}

  	return result;
diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 4cdbfb8..a2542ac 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -69,7 +69,7 @@ static int ingress_enqueue(struct sk_buff *skb, struct 
Qdisc *sch)
  	switch (result) {
  	case TC_ACT_SHOT:
  		result = TC_ACT_SHOT;
-		qdisc_qstats_drop(sch);
+		qdisc_qstats_drop_cpu(sch);
  		break;
  	case TC_ACT_STOLEN:
  	case TC_ACT_QUEUED:
@@ -91,6 +91,7 @@ static int ingress_enqueue(struct sk_buff *skb, struct 
Qdisc *sch)
  static int ingress_init(struct Qdisc *sch, struct nlattr *opt)
  {
  	net_inc_ingress_queue();
+	sch->flags |= TCQ_F_CPUSTATS;

  	return 0;
  }

-- 
John Fastabend         Intel Corporation

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-15  7:35               ` John Fastabend
@ 2015-04-15  9:19                 ` Daniel Borkmann
  2015-04-15 16:24                 ` Alexei Starovoitov
  1 sibling, 0 replies; 39+ messages in thread
From: Daniel Borkmann @ 2015-04-15  9:19 UTC (permalink / raw)
  To: John Fastabend, Alexei Starovoitov
  Cc: John Fastabend, Jamal Hadi Salim, Patrick McHardy, David Miller,
	pablo, tgraf, netfilter-devel, netdev

On 04/15/2015 09:35 AM, John Fastabend wrote:
> On 04/14/2015 08:36 AM, Alexei Starovoitov wrote:
>> On Tue, Apr 14, 2015 at 08:12:18AM -0700, John Fastabend wrote:
>>>
>>> I was hoping to push the skb lists onto something like rte_ring
>>> used by the DPDK folks or possibly some of the lockless ring work Jesper
>>> created. This is needed for many qdisc's to drop the qlock but not the
>>> ingress qdisc. Been busy working on switch bits lately but might be
>>> able to pick this up next merge window.
>>
>> I've spent quite a bit of time reanalyzying your work ;) It seems
>> only trivial stuff left to drop ingress spinlock. Can you send me
>> your TC test scripts ? I'm only starting building mine and they're
>> not covering everything. Roughly I'm creating namespaces and running
>> traffic between them while varying csum/gso/gro offload settings.
>
> I'll dig up my scripts and post them to github this weekend. They
> are a bit organized and all over the place at the moment.
>
> Maybe we can build a master repository. I know there a lot of different
> scripts running around, for example I already collected a few from
> Jamal and I think Cong must have some as well.

Sounds awesome, I think that will be really useful for better test
coverage.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH 1/7] net: refactor __netif_receive_skb_core
  2015-04-10 19:56   ` Alexander Duyck
@ 2015-04-15 12:44     ` David Laight
  2015-04-15 13:28       ` Alexander Duyck
  0 siblings, 1 reply; 39+ messages in thread
From: David Laight @ 2015-04-15 12:44 UTC (permalink / raw)
  To: 'Alexander Duyck', Pablo Neira Ayuso, netfilter-devel
  Cc: kaber, netdev, davem

From: Alexander Duyck
> Sent: 10 April 2015 20:56
> On 04/10/2015 05:15 AM, Pablo Neira Ayuso wrote:
> > +another_round:
> > +	ret = __netif_receive_skb_ingress(skb, pfmemalloc, orig_dev);
> > +	switch (ret) {
> > +	case NET_RX_SUCCESS:
> > +	case NET_RX_DROP:
> > +		break;
> > +	case __NET_RX_ANOTHER_ROUND:
> > +		goto another_round;
> > +	}
> >  	rcu_read_unlock();
> > +
> >  	return ret;
> >  }
> >
> >
> 
> Couldn't this just be done as a do while?  It would probably be easier
> to read and there wouldn't be any need for the another_round label anymore.

Or an infinite loop with a break at the bottom, as in:
	for (;;) {
		switch (...) {
		case again:
			continue;
		default:
			break;
		}
		break;
	}

	David

			

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] net: refactor __netif_receive_skb_core
  2015-04-15 12:44     ` David Laight
@ 2015-04-15 13:28       ` Alexander Duyck
  0 siblings, 0 replies; 39+ messages in thread
From: Alexander Duyck @ 2015-04-15 13:28 UTC (permalink / raw)
  To: David Laight, 'Alexander Duyck',
	Pablo Neira Ayuso, netfilter-devel
  Cc: kaber, netdev, davem



On 04/15/2015 05:44 AM, David Laight wrote:
> From: Alexander Duyck
>> Sent: 10 April 2015 20:56
>> On 04/10/2015 05:15 AM, Pablo Neira Ayuso wrote:
>>> +another_round:
>>> +	ret = __netif_receive_skb_ingress(skb, pfmemalloc, orig_dev);
>>> +	switch (ret) {
>>> +	case NET_RX_SUCCESS:
>>> +	case NET_RX_DROP:
>>> +		break;
>>> +	case __NET_RX_ANOTHER_ROUND:
>>> +		goto another_round;
>>> +	}
>>>   	rcu_read_unlock();
>>> +
>>>   	return ret;
>>>   }
>>>
>>>
>>
>> Couldn't this just be done as a do while?  It would probably be easier
>> to read and there wouldn't be any need for the another_round label anymore.
>
> Or an infinite loop with a break at the bottom, as in:
> 	for (;;) {
> 		switch (...) {
> 		case again:
> 			continue;
> 		default:
> 			break;
> 		}
> 		break;
> 	}
>
> 	David
>

That is even more complicated.  What I was thinking was
	do {
		ret = __netif_receive_skb_ingress(skb, pfmemalloc,
						  orig_dev);
	} while (ret == __NET_RX_ANOTHER_ROUND);

Either that or the switch could just be replaced with a if statement 
since the only case that really goes anywhere is __NET_RX_ANOTHER_ROUND 
and everything else just exits anyway.  I had just suggested a do/while 
since that lets the goto be dropped, but an if would allow for avoiding 
any unnecessary indentation on the call to __netif_receive_skb_ingress.

- Alex

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] net: refactor __netif_receive_skb_core
  2015-04-10 13:47   ` Daniel Borkmann
@ 2015-04-15 16:09     ` Jesper Dangaard Brouer
  2015-04-16  5:49       ` Patrick McHardy
  0 siblings, 1 reply; 39+ messages in thread
From: Jesper Dangaard Brouer @ 2015-04-15 16:09 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Pablo Neira Ayuso, netfilter-devel, kaber, netdev, davem

On Fri, 10 Apr 2015 15:47:34 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 04/10/2015 02:15 PM, Pablo Neira Ayuso wrote:
> > This patch splits __netif_receive_skb_core() in smaller functions to improve
> > maintainability.
> >
> > The function __netif_receive_skb_core() has been split in two:
> >
> > * __netif_receive_skb_ingress(), to perform all actions up to
> >    ingress filtering.
> >
> > * __netif_receive_skb_finish(), if the ingress filter accepts this
> >    packet, pass it to the corresponding packet_type function handler for further
> > processing.
> >
> > This patch also adds __NET_RX_ANOTHER_ROUND that is used when the packet is
> > stripped off from the vlan header or in case the rx_handler needs it.
> >
> > This also prepares the introduction of the netfilter ingress hook.
> 
> Out of curiosity, what is actually the performance impact on all
> of this? We were just arguing on a different matter on two more
> instructions in the fast-path, here it's refactoring the whole
> function into several ones, I presume gcc won't inline it.

Pablo asked me to performance test this change.  Full test report below.

The performance effect (of this patch) depend on the Gcc compiler
version.

Two tests:
 1. IP-forwarding (unloaded netfilter modules)
 2. Early drop in iptables "raw" table

With GCC 4.4.7, which does not inline the new functions
(__netif_receive_skb_ingress and __netif_receive_skb_finish) the
performance impact/regression is definitly measurable.

With GCC 4.4.7:
 1. IP-forwarding: +25.18 ns (slower) (-27776 pps)
 2. Early-drop   :  +7.55 ns (slower) (-66577 pps)

With GCC 4.9.1, the new functions gets inlined, thus the refactor
splitup of __netif_receive_skb_core() is basically "cancled".
Strangly there is a small improvement for forwarding, likely due to
some lucky assember reordering that give less icache/fetch-misses.
The early-drop improvement is below accuracy levels, can cannot be
trusted.

With GCC 4.9.1:
 1. IP-forwarding: -10.05ns (faster) (+17532 pps)
 2. Early-drop   :  -1.54ns (faster) (+16216 pps) below accuracy levels

I don't know what to conclude, as the result depend on the compiler
version... but these kind of change do affect performance, and should
be tested/measured.

- --
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

Quick eval of Pablo's refactor of __netif_receive_skb_core
==========================================================
:Version: 0.1
:Author:  Jesper Dangaard Brouer

Summary
=======

Pablo is refactoring __netif_receive_skb_core() into several
functions, to allow for some other upcomming changes.
Daniel Borkmann question if this will affect performance.
Jesper tests this.

The performance effect (of this patch) depend on the Gcc compiler
version.

Two tests:
 1. IP-forwarding (unloaded netfilter modules)
 2. Early drop in iptables "raw" table

With GCC 4.4.7, which does not inline the new functions
(__netif_receive_skb_ingress and __netif_receive_skb_finish) the
performance impact/regression is definitly measurable.

With GCC 4.4.7:
 1. IP-forwarding: +25.18 ns (slower) (-27776 pps)
 2. Early-drop   :  +7.55 ns (slower) (-66577 pps)

With GCC 4.9.1, the new functions gets inlined, thus the refactor
splitup of __netif_receive_skb_core() is basically "cancled".
Strangly there is a small improvement for forwarding, likely due to
some lucky assember reordering that give less icache/fetch-misses.
The early-drop improvement is below accuracy levels, can cannot be
trusted.

With GCC 4.9.1:
 1. IP-forwarding: -10.05ns (faster) (+17532 pps)
 2. Early-drop   :  -1.54ns (faster) (+16216 pps) below accuracy levels


Setup
=====

On host: ivy
------------

Host ivy is the "sink" or DUT (Device Under Test).
 * CPU E5-2695ES @ 2.80GHz

netfilter_unload_modules.sh
netfilter_unload_modules.sh
sudo rmmod nf_reject_ipv4 nf_reject_ipv6

base_device_setup.sh eth4  # 10G sink/receiving interface (ixgbe)
base_device_setup.sh eth5
sudo ethtool --coalesce eth4 rx-usecs 30

Make a fake route to 198.18.0.0/15 out via eth5

sudo ip neigh add 192.168.21.66 dev eth5 lladdr 00:00:ba:d0:ba:d0
sudo ip route add 198.18.0.0/15 via 192.168.21.66 dev eth5

Disable power saving to get more accurate measurements (see blogpost)::

 $ sudo tuned-adm active
 Current active profile: latency-performance
 Service tuned: enabled, running
 Service ktune: enabled, running

Early drop in raw
-----------------

alias iptables='sudo iptables'
iptables -t raw -N simple || iptables -t raw -F simple
iptables -t raw -I simple -d 198.18.0.0/15 -j DROP
iptables -t raw -D PREROUTING -j simple
iptables -t raw -I PREROUTING -j simple


On host: dragon
---------------

Host dragon is the packet generator.
 * 2x CPU E5-2630 0 @ 2.30GHz

Generator NIC: eth8 - ixgbe 10G

netfilter_unload_modules.sh
netfilter_unload_modules.sh
sudo rmmod nf_reject_ipv4 nf_reject_ipv6

base_device_setup.sh eth8  # 10G generator interface (ixgbe)
sudo ethtool --coalesce eth8 rx-usecs 30

Generator
~~~~~~~~~
Generator command::

 ./pktgen02_burst.sh -d 198.18.0.2 -i eth8 -m 90:E2:BA:0A:56:B4 -b 32 -s 64 -t 4

This will gen approx 12Mpps towards single IP, thus single flow.

Notice this single flow, one activates 1-CPU on target host.  This is
on purpose.


Baseline measurements
=====================

Kernel: 4.0.0-rc7-net-next-01881-ge60a9de
 - Thus, kernel at commit e60a9de49c3 ("Merge branch ...jkirsher/next-queue")

Two compiler versions:
 * gcc version 4.4.7 20120313 (Red Hat 4.4.7-11) (GCC)
  - (This results in slower kernel)
 * gcc version 4.9.1 20140922 (Red Hat 4.9.1-10) (GCC)


Forwarding
----------

IP-forward, single flow:
 * Gcc 4.4.7
  - Run01: instant rx:0 tx:1064064 pps n:100 average: rx:0 tx:1064066 pps
    (instant variation TX -0.002 ns (min:-0.431 max:0.171) RX 0.000 ns)

IP-forward, single flow:
 * Gcc 4.9.1 <-- **NOTICE GCC version**
  - Run02: instant rx:0 tx:1312000 pps n:96 average: rx:0 tx:1311854 pps
    (instant variation TX 0.085 ns (min:-0.359 max:1.331) RX 0.000 ns)
  - Run03: instant rx:0 tx:1311168 pps n:106 average: rx:0 tx:1310818 pps
    (instant variation TX 0.203 ns (min:-0.593 max:0.512) RX 0.000 ns)


Early drop
----------

Early drop iptables raw, single flow:
 * Gcc 4.4.7
  - Run01: instant rx:3003072 tx:0 pps n:300 average: rx:3002243 tx:0 pps
    (instant variation TX 0.000 ns (min:0.000 max:0.000) RX 0.092 ns)

Early drop iptables raw, single flow:
 * Gcc 4.9.1 <-- **NOTICE GCC version**
  - Run02: instant rx:3233600 tx:0 pps n:83 average: rx:3233151 tx:0 pps
    (instant variation TX 0.000 ns (min:0.000 max:0.000) RX 0.043 ns)

Change measurement
==================

Performance test for Pablo of patch:
 * http://patchwork.ozlabs.org/patch/460069/

Pablo is refactoring __netif_receive_skb_core() into several
functions, to allow for some other upcomming changes.

Daniel Borkmann question if this will affect performance.

Kernel: 4.0.0-rc7-pablo01-refactor--netif_receive_skb_core+
 * on top of commit e60a9de49c3 ("Merge branch ...jkirsher/next-queue")

Forwarding
----------

IP-forward, single flow:
 * Gcc 4.4.7
  - Run01: instant rx:0 tx:1034236 pps n:74 average: rx:0 tx:1034824 pps
    (instant variation TX -0.550 ns (min:-1.577 max:0.224) RX 0.000 ns)
  - Run02: instant rx:0 tx:1036292 pps n:60 average: rx:0 tx:1036290 pps
    (instant variation TX 0.001 ns (min:-0.271 max:0.259) RX 0.000 ns)
 * (Gcc 4.4.7) compare against baseline (run01 vs run02)
  - 1036290 - 1064066 = -27776 pps (slower)
  - (1/1036290*10^9)-(1/1064066*10^9) = +25.18 ns (slower)

IP-forward, single flow:
 * Gcc 4.9.1  <-- **NOTICE GCC version**
  - Run01: instant rx:0 tx:1335676 pps n:60 average: rx:0 tx:1335773 pps
    (instant variation TX -0.055 ns (min:-0.163 max:0.126) RX 0.000 ns)
  - Run02: instant rx:0 tx:1328708 pps n:80 average: rx:0 tx:1329386 pps
    (instant variation TX -0.384 ns (min:-1.298 max:0.316) RX 0.000 ns)
   * run01 vs run02 variance:
    - (1/1335773*10^9) - (1/1329386*10^9) = -3.5967 ns
 * (Gcc 4.9.1) compare against baseline(Run02) vs this Run02
  - 1329386 - 1311854 = +17532 pps (faster)
  - (1/1329386*10^9) - (1/1311854*10^9) = -10.05ns (faster)


Early drop
----------

Early drop iptables raw, single flow:
 * Gcc 4.4.7
  - Run01: instant rx:2942528 tx:0 pps n:91 average: rx:2940013 tx:0 pps
    (instant variation TX 0.000 ns (min:0.000 max:0.000) RX 0.291 ns)
  - Run02: instant rx:2929280 tx:0 pps n:120 average: rx:2935666 tx:0 pps
    (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.743 ns)
 * (Gcc 4.4.7) compare against baseline (run01 vs run02)
  - 2935666 - 3002243 = -66577 pps (slower)
  - (1/2935666*10^9) - (1/3002243*10^9) = +7.55 ns (slower)

My theory behind why IP-forwarding shows larger impact/regression is
that IP-forwarding cause more icache misses, and these function splits
is also more expensive instruction icache fetch wise.

Early drop iptables raw, single flow:
 * Gcc 4.9.1 <-- **NOTICE GCC version**
  - Run01: instant rx:3249280 tx:0 pps n:140 average: rx:3249367 tx:0 pps
    (instant variation TX 0.000 ns (min:0.000 max:0.000) RX -0.008 ns)
 * (Gcc 4.9.1) compare against baseline(Run02) vs this Run02
  - 3249367 - 3233151 = +16216 pps (faster)
  - (1/3249367*10^9) - (1/3233151*10^9) = -1.54ns (faster)


In perf top below, notice __netif_receive_skb_finish,
__netif_receive_skb_ingress and __netif_receive_skb_core, are seperate
functions, thus not inlined by Gcc 4.4.7. (early-drop test)

Perf top::

 Samples:: 218K of event 'cycles', Event count (approx.): 43758793383
 Overhead  Shared Object        Symbol
   8.59%  [ip_tables]          [k] ipt_do_table
   6.45%  [kernel]             [k] build_skb
   5.62%  [ixgbe]              [k] ixgbe_fetch_rx_buffer
   5.55%  [kernel]             [k] ip_rcv
   5.31%  [ixgbe]              [k] ixgbe_clean_rx_irq
   4.64%  [kernel]             [k] __netif_receive_skb_finish
   4.28%  [kernel]             [k] dev_gro_receive
   3.76%  [kernel]             [k] inet_gro_receive
   3.30%  [kernel]             [k] kmem_cache_alloc
   3.09%  [kernel]             [k] __rcu_read_unlock
   2.85%  [kernel]             [k] put_compound_page
   2.80%  [kernel]             [k] __memcpy
   2.67%  [kernel]             [k] nf_iterate
   2.49%  [kernel]             [k] nf_hook_slow
   2.43%  [kernel]             [k] __netif_receive_skb_ingress
   2.09%  [kernel]             [k] udp4_gro_receive
   2.04%  [kernel]             [k] kmem_cache_free
   2.03%  [ixgbe]              [k] ixgbe_process_skb_fields
   1.99%  [kernel]             [k] __local_bh_enable_ip
   1.99%  [kernel]             [k] __rcu_read_lock
   1.82%  [kernel]             [k] __netif_receive_skb_core


Total usage of modified functions::

   4.64%  [kernel]             [k] __netif_receive_skb_finish
   2.43%  [kernel]             [k] __netif_receive_skb_ingress
   1.82%  [kernel]             [k] __netif_receive_skb_core
   ---------------
   8.89%

Perf top of same early-drop workload with Gcc 4.9.1. Notice
__netif_receive_skb_core(8.16%) have other functions inlined.

Samples: 221K of event 'cycles', Event count (approx.): 44134618001
Overhead  Shared Object        Symbol
  11.45%  [ixgbe]              [k] ixgbe_clean_rx_irq
   9.55%  [ip_tables]          [k] ipt_do_table
   8.35%  [kernel]             [k] build_skb
   8.16%  [kernel]             [k] __netif_receive_skb_core
   7.44%  [kernel]             [k] ip_rcv
   6.69%  [kernel]             [k] dev_gro_receive
   3.38%  [kernel]             [k] __memcpy
   3.37%  [kernel]             [k] put_compound_page
   3.33%  [kernel]             [k] inet_gro_receive
   2.84%  [kernel]             [k] __rcu_read_unlock
   2.66%  [kernel]             [k] udp4_gro_receive
   2.47%  [kernel]             [k] kmem_cache_alloc
   2.41%  [kernel]             [k] nf_iterate
   2.19%  [kernel]             [k] eth_type_trans
   2.15%  [kernel]             [k] nf_hook_slow
   2.09%  [kernel]             [k] kmem_cache_free
   1.94%  [kernel]             [k] skb_free_head
   1.82%  [kernel]             [k] __local_bh_enable_ip
   1.78%  [kernel]             [k] __rcu_read_lock
   1.70%  [kernel]             [k] udp_gro_receive
   1.59%  [kernel]             [k] skb_release_data
   1.08%  [kernel]             [k] __alloc_page_frag
   1.06%  [kernel]             [k] __alloc_rx_skb
   0.91%  [kernel]             [k] __napi_alloc_skb
   0.83%  [kernel]             [k] skb_release_head_state
   0.75%  [kernel]             [k] napi_gro_receive
   0.66%  [kernel]             [k] skb_release_all
   0.64%  [kernel]             [k] kfree_skb
   0.63%  [ixgbe]              [k] ixgbe_alloc_rx_buffers
   0.55%  [kernel]             [k] __netif_receive_skb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/7 RFC] Netfilter/nf_tables ingress support
  2015-04-15  7:35               ` John Fastabend
  2015-04-15  9:19                 ` Daniel Borkmann
@ 2015-04-15 16:24                 ` Alexei Starovoitov
  1 sibling, 0 replies; 39+ messages in thread
From: Alexei Starovoitov @ 2015-04-15 16:24 UTC (permalink / raw)
  To: John Fastabend
  Cc: John Fastabend, Jamal Hadi Salim, Patrick McHardy, David Miller,
	pablo, tgraf, netfilter-devel, netdev

On Wed, Apr 15, 2015 at 12:35:16AM -0700, John Fastabend wrote:
>
> I'll dig up my scripts and post them to github this weekend. They
> are a bit organized and all over the place at the moment.
> 
> Maybe we can build a master repository. I know there a lot of different
> scripts running around, for example I already collected a few from
> Jamal and I think Cong must have some as well.

great. let's start building a proper testsuite.

> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
> index 4cdbfb8..a2542ac 100644
> --- a/net/sched/sch_ingress.c
> +++ b/net/sched/sch_ingress.c
> @@ -69,7 +69,7 @@ static int ingress_enqueue(struct sk_buff *skb, struct
> Qdisc *sch)
>  	switch (result) {
>  	case TC_ACT_SHOT:
>  		result = TC_ACT_SHOT;
> -		qdisc_qstats_drop(sch);
> +		qdisc_qstats_drop_cpu(sch);

Your quick patch missed updating:
- qdisc_bstats_update(sch, skb);
+ qdisc_bstats_update_cpu(sch, skb);

I've tried the same :)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/7] net: refactor __netif_receive_skb_core
  2015-04-15 16:09     ` Jesper Dangaard Brouer
@ 2015-04-16  5:49       ` Patrick McHardy
  0 siblings, 0 replies; 39+ messages in thread
From: Patrick McHardy @ 2015-04-16  5:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Pablo Neira Ayuso, netfilter-devel, netdev, davem

On 15.04, Jesper Dangaard Brouer wrote:
> > Out of curiosity, what is actually the performance impact on all
> > of this? We were just arguing on a different matter on two more
> > instructions in the fast-path, here it's refactoring the whole
> > function into several ones, I presume gcc won't inline it.
> 
> Pablo asked me to performance test this change.  Full test report below.
> 
> The performance effect (of this patch) depend on the Gcc compiler
> version.
> 
> Two tests:
>  1. IP-forwarding (unloaded netfilter modules)
>  2. Early drop in iptables "raw" table
> 
> With GCC 4.4.7, which does not inline the new functions
> (__netif_receive_skb_ingress and __netif_receive_skb_finish) the
> performance impact/regression is definitly measurable.
> 
> With GCC 4.4.7:
>  1. IP-forwarding: +25.18 ns (slower) (-27776 pps)
>  2. Early-drop   :  +7.55 ns (slower) (-66577 pps)
> 
> With GCC 4.9.1, the new functions gets inlined, thus the refactor
> splitup of __netif_receive_skb_core() is basically "cancled".
> Strangly there is a small improvement for forwarding, likely due to
> some lucky assember reordering that give less icache/fetch-misses.
> The early-drop improvement is below accuracy levels, can cannot be
> trusted.
> 
> With GCC 4.9.1:
>  1. IP-forwarding: -10.05ns (faster) (+17532 pps)
>  2. Early-drop   :  -1.54ns (faster) (+16216 pps) below accuracy levels
> 
> I don't know what to conclude, as the result depend on the compiler
> version... but these kind of change do affect performance, and should
> be tested/measured.

Thanks Jesper. This effect without inlinging was to be expected I guess.
The interesting question would be a patch that uses nf_hook() without okfn
callback, moves the ingress qdisc to register with the netfilter ingress
hook and moves the TTL tracking of ing_filter() to the ingress qdisc,
where it belongs.

My expectation would be that this would result in an overall performance
gain.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2015-04-16  5:49 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-10 12:15 [PATCH 0/7 RFC] Netfilter/nf_tables ingress support Pablo Neira Ayuso
2015-04-10 12:15 ` [PATCH 1/7] net: refactor __netif_receive_skb_core Pablo Neira Ayuso
2015-04-10 13:47   ` Daniel Borkmann
2015-04-15 16:09     ` Jesper Dangaard Brouer
2015-04-16  5:49       ` Patrick McHardy
2015-04-10 19:56   ` Alexander Duyck
2015-04-15 12:44     ` David Laight
2015-04-15 13:28       ` Alexander Duyck
2015-04-10 12:15 ` [PATCH 2/7] netfilter: add nf_hook_list_active() Pablo Neira Ayuso
2015-04-10 12:15 ` [PATCH 3/7] netfilter: add hook list to nf_hook_state Pablo Neira Ayuso
2015-04-10 12:15 ` [PATCH 4/7] netfilter: cleanup struct nf_hook_ops struct indentation Pablo Neira Ayuso
2015-04-10 13:27   ` Sergei Shtylyov
2015-04-10 12:15 ` [PATCH 5/7] net: add netfilter ingress hook Pablo Neira Ayuso
2015-04-10 13:21   ` Thomas Graf
2015-04-10 13:36     ` Patrick McHardy
2015-04-10 20:17       ` Pablo Neira Ayuso
2015-04-10 21:33         ` Patrick McHardy
2015-04-11 12:55           ` Pablo Neira Ayuso
2015-04-11 13:06             ` Patrick McHardy
2015-04-11 13:32               ` Pablo Neira Ayuso
2015-04-10 20:08     ` Pablo Neira Ayuso
2015-04-10 12:15 ` [PATCH 6/7] netfilter: nf_tables: allow to bind table to net_device Pablo Neira Ayuso
2015-04-10 12:15 ` [PATCH 7/7] netfilter: nf_tables: add netdev table to filter from ingress Pablo Neira Ayuso
2015-04-10 13:22 ` [PATCH 0/7 RFC] Netfilter/nf_tables ingress support Thomas Graf
2015-04-10 20:09   ` Pablo Neira Ayuso
2015-04-13  1:14     ` David Miller
2015-04-13 20:19       ` Patrick McHardy
2015-04-14  9:00         ` Thomas Graf
2015-04-14  9:06           ` Patrick McHardy
2015-04-14 10:08             ` Thomas Graf
2015-04-14 10:13               ` Patrick McHardy
2015-04-14 10:32                 ` Thomas Graf
2015-04-14 20:05                   ` Jesper Dangaard Brouer
2015-04-14 12:27         ` Jamal Hadi Salim
2015-04-14 15:12           ` John Fastabend
2015-04-14 15:36             ` Alexei Starovoitov
2015-04-15  7:35               ` John Fastabend
2015-04-15  9:19                 ` Daniel Borkmann
2015-04-15 16:24                 ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.