Netfilter-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH nf-next v3 0/3] Netfilter egress hook
@ 2020-08-27  8:55 Lukas Wunner
  2020-08-27  8:55 ` [PATCH nf-next v3 1/3] netfilter: Rename ingress hook include file Lukas Wunner
                   ` (4 more replies)
  0 siblings, 5 replies; 34+ messages in thread
From: Lukas Wunner @ 2020-08-27  8:55 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, Laura Garcia,
	David Miller

Introduce a netfilter egress hook to allow filtering outbound AF_PACKETs
such as DHCP and to prepare for in-kernel NAT64/NAT46.

An earlier version of this series was applied by Pablo Neira Ayuso back
in March and subsequently reverted by Daniel Borkmann over performance
concerns.  I've now reworked the series following a discussion between
Daniel and Florian Westphal:

https://lore.kernel.org/netdev/20200318123315.GI979@breakpoint.cc/

Briefly, traffic control and netfilter handling is moved out of the
__dev_queue_xmit() hotpath into a noinline function which is dynamically
patched in using a static_key.  In that function, each of tc and nft are
patched in with additional static_keys.

Thus, if neither tc nor nft is used, performance improves compared to
the status quo (see measurements in patch [3/3]).  However if tc is
used, performance degrades a little due to the "noinline", the additional
outer static key and the added netfilter code.  That's kind of a bummer.
If anyone has ideas how to mitigate this performance degradation, please
come forward.

To test the new netfilter egress hook, apply this nft patch to add rules
from user space:

https://lore.kernel.org/netfilter-devel/d6b6896fdd8408e4ddbd66ab524709e5cf82ea32.1583929080.git.lukas@wunner.de/

Thanks!

Lukas Wunner (3):
  netfilter: Rename ingress hook include file
  netfilter: Generalize ingress hook
  netfilter: Introduce egress hook

 include/linux/netdevice.h         |   8 +++
 include/linux/netfilter_ingress.h |  58 -----------------
 include/linux/netfilter_netdev.h  | 102 ++++++++++++++++++++++++++++++
 include/linux/rtnetlink.h         |   2 +-
 include/uapi/linux/netfilter.h    |   1 +
 net/core/dev.c                    |  56 +++++++++++++---
 net/netfilter/Kconfig             |   8 +++
 net/netfilter/core.c              |  24 +++++--
 net/netfilter/nft_chain_filter.c  |   4 +-
 net/sched/Kconfig                 |   3 +
 10 files changed, 194 insertions(+), 72 deletions(-)
 delete mode 100644 include/linux/netfilter_ingress.h
 create mode 100644 include/linux/netfilter_netdev.h

-- 
2.27.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH nf-next v3 1/3] netfilter: Rename ingress hook include file
  2020-08-27  8:55 [PATCH nf-next v3 0/3] Netfilter egress hook Lukas Wunner
@ 2020-08-27  8:55 ` Lukas Wunner
  2020-08-27  8:55 ` [PATCH nf-next v3 2/3] netfilter: Generalize ingress hook Lukas Wunner
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Lukas Wunner @ 2020-08-27  8:55 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, Laura Garcia,
	David Miller

Prepare for addition of a netfilter egress hook by renaming
<linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.

The egress hook also necessitates a refactoring of the include file,
but that is done in a separate commit to ease reviewing.

No functional change intended.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/{netfilter_ingress.h => netfilter_netdev.h} | 0
 net/core/dev.c                                            | 2 +-
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename include/linux/{netfilter_ingress.h => netfilter_netdev.h} (100%)

diff --git a/include/linux/netfilter_ingress.h b/include/linux/netfilter_netdev.h
similarity index 100%
rename from include/linux/netfilter_ingress.h
rename to include/linux/netfilter_netdev.h
diff --git a/net/core/dev.c b/net/core/dev.c
index b5d1129d8310..e7ff11ec492c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -136,7 +136,7 @@
 #include <linux/if_macvlan.h>
 #include <linux/errqueue.h>
 #include <linux/hrtimer.h>
-#include <linux/netfilter_ingress.h>
+#include <linux/netfilter_netdev.h>
 #include <linux/crash_dump.h>
 #include <linux/sctp.h>
 #include <net/udp_tunnel.h>
-- 
2.27.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH nf-next v3 2/3] netfilter: Generalize ingress hook
  2020-08-27  8:55 [PATCH nf-next v3 0/3] Netfilter egress hook Lukas Wunner
  2020-08-27  8:55 ` [PATCH nf-next v3 1/3] netfilter: Rename ingress hook include file Lukas Wunner
@ 2020-08-27  8:55 ` Lukas Wunner
  2020-08-27  8:55 ` [PATCH nf-next v3 3/3] netfilter: Introduce egress hook Lukas Wunner
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Lukas Wunner @ 2020-08-27  8:55 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, Laura Garcia,
	David Miller

Prepare for addition of a netfilter egress hook by generalizing the
ingress hook introduced by commit e687ad60af09 ("netfilter: add
netfilter ingress hook after handle_ing() under unique static key").

In particular, rename and refactor the ingress hook's static inlines
such that they can be reused for an egress hook.

No functional change intended.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/netfilter_netdev.h | 45 ++++++++++++++++++++++----------
 net/core/dev.c                   |  2 +-
 2 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/include/linux/netfilter_netdev.h b/include/linux/netfilter_netdev.h
index a13774be2eb5..49e26479642e 100644
--- a/include/linux/netfilter_netdev.h
+++ b/include/linux/netfilter_netdev.h
@@ -1,34 +1,37 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _NETFILTER_INGRESS_H_
-#define _NETFILTER_INGRESS_H_
+#ifndef _NETFILTER_NETDEV_H_
+#define _NETFILTER_NETDEV_H_
 
 #include <linux/netfilter.h>
 #include <linux/netdevice.h>
 
-#ifdef CONFIG_NETFILTER_INGRESS
-static inline bool nf_hook_ingress_active(const struct sk_buff *skb)
+#ifdef CONFIG_NETFILTER
+static __always_inline bool nf_hook_netdev_active(enum nf_dev_hooks hooknum,
+					  struct nf_hook_entries __rcu *hooks)
 {
 #ifdef CONFIG_JUMP_LABEL
-	if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_INGRESS]))
+	if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][hooknum]))
 		return false;
 #endif
-	return rcu_access_pointer(skb->dev->nf_hooks_ingress);
+	return rcu_access_pointer(hooks);
 }
 
 /* caller must hold rcu_read_lock */
-static inline int nf_hook_ingress(struct sk_buff *skb)
+static __always_inline int nf_hook_netdev(struct sk_buff *skb,
+					  enum nf_dev_hooks hooknum,
+					  struct nf_hook_entries __rcu *hooks)
 {
-	struct nf_hook_entries *e = rcu_dereference(skb->dev->nf_hooks_ingress);
+	struct nf_hook_entries *e = rcu_dereference(hooks);
 	struct nf_hook_state state;
 	int ret;
 
-	/* Must recheck the ingress hook head, in the event it became NULL
-	 * after the check in nf_hook_ingress_active evaluated to true.
+	/* Must recheck the hook head, in the event it became NULL
+	 * after the check in nf_hook_netdev_active evaluated to true.
 	 */
 	if (unlikely(!e))
 		return 0;
 
-	nf_hook_state_init(&state, NF_NETDEV_INGRESS,
+	nf_hook_state_init(&state, hooknum,
 			   NFPROTO_NETDEV, skb->dev, NULL, NULL,
 			   dev_net(skb->dev), NULL);
 	ret = nf_hook_slow(skb, &state, e, 0);
@@ -37,10 +40,26 @@ static inline int nf_hook_ingress(struct sk_buff *skb)
 
 	return ret;
 }
+#endif /* CONFIG_NETFILTER */
 
-static inline void nf_hook_ingress_init(struct net_device *dev)
+static inline void nf_hook_netdev_init(struct net_device *dev)
 {
+#ifdef CONFIG_NETFILTER_INGRESS
 	RCU_INIT_POINTER(dev->nf_hooks_ingress, NULL);
+#endif
+}
+
+#ifdef CONFIG_NETFILTER_INGRESS
+static inline bool nf_hook_ingress_active(const struct sk_buff *skb)
+{
+	return nf_hook_netdev_active(NF_NETDEV_INGRESS,
+				     skb->dev->nf_hooks_ingress);
+}
+
+static inline int nf_hook_ingress(struct sk_buff *skb)
+{
+	return nf_hook_netdev(skb, NF_NETDEV_INGRESS,
+			      skb->dev->nf_hooks_ingress);
 }
 #else /* CONFIG_NETFILTER_INGRESS */
 static inline int nf_hook_ingress_active(struct sk_buff *skb)
@@ -52,7 +71,5 @@ static inline int nf_hook_ingress(struct sk_buff *skb)
 {
 	return 0;
 }
-
-static inline void nf_hook_ingress_init(struct net_device *dev) {}
 #endif /* CONFIG_NETFILTER_INGRESS */
 #endif /* _NETFILTER_INGRESS_H_ */
diff --git a/net/core/dev.c b/net/core/dev.c
index e7ff11ec492c..e1aae2df6762 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10308,7 +10308,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	if (!dev->ethtool_ops)
 		dev->ethtool_ops = &default_ethtool_ops;
 
-	nf_hook_ingress_init(dev);
+	nf_hook_netdev_init(dev);
 
 	return dev;
 
-- 
2.27.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-08-27  8:55 [PATCH nf-next v3 0/3] Netfilter egress hook Lukas Wunner
  2020-08-27  8:55 ` [PATCH nf-next v3 1/3] netfilter: Rename ingress hook include file Lukas Wunner
  2020-08-27  8:55 ` [PATCH nf-next v3 2/3] netfilter: Generalize ingress hook Lukas Wunner
@ 2020-08-27  8:55 ` Lukas Wunner
  2020-08-28 18:52   ` John Fastabend
  2020-09-19 15:54   ` Pablo Neira Ayuso
  2020-08-27 10:36 ` [PATCH nf-next v3 0/3] Netfilter " Laura García Liébana
  2020-08-28  7:14 ` Daniel Borkmann
  4 siblings, 2 replies; 34+ messages in thread
From: Lukas Wunner @ 2020-08-27  8:55 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, Laura Garcia,
	David Miller

Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key") introduced the ability to
classify packets on ingress.

Support the same on egress.  This allows filtering locally generated
traffic such as DHCP, or outbound AF_PACKETs in general.  It will also
allow introducing in-kernel NAT64 and NAT46.  A patch for nftables to
hook up egress rules from user space has been submitted separately.

Position the hook immediately before a packet is handed to traffic
control and then sent out on an interface, thereby mirroring the ingress
order.  This order allows marking packets in the netfilter egress hook
and subsequently using the mark in tc.  Another benefit of this order is
consistency with a lot of existing documentation which says that egress
tc is performed after netfilter hooks.

To avoid a performance degradation in the default case (with neither
netfilter nor traffic control used), Daniel Borkmann suggests "a single
static_key which wraps an empty function call entry which can then be
patched by the kernel at runtime. Inside that trampoline we can still
keep the ordering [between netfilter and traffic control] intact":

https://lore.kernel.org/netdev/20200318123315.GI979@breakpoint.cc/

To this end, introduce nf_sch_egress() which is dynamically patched into
__dev_queue_xmit(), contingent on egress_needed_key.  Inside that
function, nf_egress() and sch_handle_egress() is called, each contingent
on its own separate static_key.

nf_sch_egress() is declared noinline per Florian Westphal's suggestion.
This change alone causes a speedup if neither netfilter nor traffic
control is used, apparently because it reduces instruction cache
pressure.  The same effect was previously observed by Eric Dumazet for
the ingress path:

https://lore.kernel.org/netdev/1431387038.566.47.camel@edumazet-glaptop2.roam.corp.google.com/

Overall, performance improves with this commit if neither netfilter nor
traffic control is used. However it degrades a little if only traffic
control is used, due to the "noinline", the additional outer static key
and the added netfilter code:

* Before:       4730418pps 2270Mb/sec (2270600640bps)
* After:        4759206pps 2284Mb/sec (2284418880bps)

* Before + tc:  4063912pps 1950Mb/sec (1950677760bps)
* After  + tc:  4007728pps 1923Mb/sec (1923709440bps)

* After  + nft: 3714546pps 1782Mb/sec (1782982080bps)

Measured on a bare-metal Core i7-3615QM.

Commands to perform a measurement:
ip link add dev foo type dummy
ip link set dev foo up
modprobe pktgen
echo "add_device foo" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1

Commands to enable egress traffic control:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'

Commands to enable egress netfilter:
nft add table netdev t
nft add chain netdev t co \{ type filter hook egress device foo priority 0 \; \}
nft add rule netdev t co ip daddr 4.3.2.1/32 drop

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
---
 include/linux/netdevice.h        |  8 +++++
 include/linux/netfilter_netdev.h | 27 +++++++++++++++++
 include/linux/rtnetlink.h        |  2 +-
 include/uapi/linux/netfilter.h   |  1 +
 net/core/dev.c                   | 52 ++++++++++++++++++++++++++++----
 net/netfilter/Kconfig            |  8 +++++
 net/netfilter/core.c             | 24 ++++++++++++---
 net/netfilter/nft_chain_filter.c |  4 ++-
 net/sched/Kconfig                |  3 ++
 9 files changed, 117 insertions(+), 12 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b0e303f6603f..5d88a40aca21 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -796,6 +796,10 @@ struct xps_dev_maps {
 
 #endif /* CONFIG_XPS */
 
+#ifdef CONFIG_NET_EGRESS
+extern struct static_key_false egress_needed_key;
+#endif
+
 #define TC_MAX_QUEUE	16
 #define TC_BITMASK	15
 /* HW offloaded queuing disciplines txq count and offset maps */
@@ -1779,6 +1783,7 @@ enum netdev_priv_flags {
  *	@xps_maps:	XXX: need comments on this one
  *	@miniq_egress:		clsact qdisc specific data for
  *				egress processing
+ *	@nf_hooks_egress:	netfilter hooks executed for egress packets
  *	@qdisc_hash:		qdisc hash table
  *	@watchdog_timeo:	Represents the timeout that is used by
  *				the watchdog (see dev_watchdog())
@@ -2057,6 +2062,9 @@ struct net_device {
 #ifdef CONFIG_NET_CLS_ACT
 	struct mini_Qdisc __rcu	*miniq_egress;
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	struct nf_hook_entries __rcu *nf_hooks_egress;
+#endif
 
 #ifdef CONFIG_NET_SCHED
 	DECLARE_HASHTABLE	(qdisc_hash, 4);
diff --git a/include/linux/netfilter_netdev.h b/include/linux/netfilter_netdev.h
index 49e26479642e..92d3611a782e 100644
--- a/include/linux/netfilter_netdev.h
+++ b/include/linux/netfilter_netdev.h
@@ -47,6 +47,9 @@ static inline void nf_hook_netdev_init(struct net_device *dev)
 #ifdef CONFIG_NETFILTER_INGRESS
 	RCU_INIT_POINTER(dev->nf_hooks_ingress, NULL);
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	RCU_INIT_POINTER(dev->nf_hooks_egress, NULL);
+#endif
 }
 
 #ifdef CONFIG_NETFILTER_INGRESS
@@ -72,4 +75,28 @@ static inline int nf_hook_ingress(struct sk_buff *skb)
 	return 0;
 }
 #endif /* CONFIG_NETFILTER_INGRESS */
+
+#ifdef CONFIG_NETFILTER_EGRESS
+static inline bool nf_hook_egress_active(const struct sk_buff *skb)
+{
+	return nf_hook_netdev_active(NF_NETDEV_EGRESS,
+				     skb->dev->nf_hooks_egress);
+}
+
+static inline int nf_hook_egress(struct sk_buff *skb)
+{
+	return nf_hook_netdev(skb, NF_NETDEV_EGRESS,
+			      skb->dev->nf_hooks_egress);
+}
+#else /* CONFIG_NETFILTER_EGRESS */
+static inline int nf_hook_egress_active(struct sk_buff *skb)
+{
+	return 0;
+}
+
+static inline int nf_hook_egress(struct sk_buff *skb)
+{
+	return 0;
+}
+#endif /* CONFIG_NETFILTER_EGRESS */
 #endif /* _NETFILTER_INGRESS_H_ */
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index bb9cb84114c1..49fe4f3b8fbd 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -97,7 +97,7 @@ void net_inc_ingress_queue(void);
 void net_dec_ingress_queue(void);
 #endif
 
-#ifdef CONFIG_NET_EGRESS
+#ifdef CONFIG_NET_SCH_EGRESS
 void net_inc_egress_queue(void);
 void net_dec_egress_queue(void);
 #endif
diff --git a/include/uapi/linux/netfilter.h b/include/uapi/linux/netfilter.h
index ca9e63d6e0e4..d1616574c54f 100644
--- a/include/uapi/linux/netfilter.h
+++ b/include/uapi/linux/netfilter.h
@@ -50,6 +50,7 @@ enum nf_inet_hooks {
 
 enum nf_dev_hooks {
 	NF_NETDEV_INGRESS,
+	NF_NETDEV_EGRESS,
 	NF_NETDEV_NUMHOOKS
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index e1aae2df6762..ebdd2a348f38 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2099,20 +2099,30 @@ EXPORT_SYMBOL_GPL(net_dec_ingress_queue);
 #endif
 
 #ifdef CONFIG_NET_EGRESS
-static DEFINE_STATIC_KEY_FALSE(egress_needed_key);
+DEFINE_STATIC_KEY_FALSE(egress_needed_key);
+
+#ifdef CONFIG_NET_SCH_EGRESS
+static DEFINE_STATIC_KEY_FALSE(sch_egress_needed_key);
 
 void net_inc_egress_queue(void)
 {
 	static_branch_inc(&egress_needed_key);
+	static_branch_inc(&sch_egress_needed_key);
 }
 EXPORT_SYMBOL_GPL(net_inc_egress_queue);
 
 void net_dec_egress_queue(void)
 {
+	static_branch_dec(&sch_egress_needed_key);
 	static_branch_dec(&egress_needed_key);
 }
 EXPORT_SYMBOL_GPL(net_dec_egress_queue);
-#endif
+
+#define sch_egress_needed static_branch_unlikely(&sch_egress_needed_key)
+#else
+#define sch_egress_needed false
+#endif /* CONFIG_NET_SCH_EGRESS */
+#endif /* CONFIG_NET_EGRESS */
 
 static DEFINE_STATIC_KEY_FALSE(netstamp_needed_key);
 #ifdef CONFIG_JUMP_LABEL
@@ -3854,9 +3864,20 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
 EXPORT_SYMBOL(dev_loopback_xmit);
 
 #ifdef CONFIG_NET_EGRESS
-static struct sk_buff *
+static inline int nf_egress(struct sk_buff *skb)
+{
+	int ret;
+
+	rcu_read_lock();
+	ret = nf_hook_egress(skb);
+	rcu_read_unlock();
+	return ret;
+}
+
+static inline struct sk_buff *
 sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
+#ifdef CONFIG_NET_SCH_EGRESS
 	struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
 	struct tcf_result cl_res;
 
@@ -3890,6 +3911,25 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 	default:
 		break;
 	}
+#endif /* CONFIG_NET_SCH_EGRESS */
+
+	return skb;
+}
+
+/* noinline reduces icache pressure in __dev_queue_xmit() hotpath for
+ * default case that neither netfilter nor traffic control is used.
+ */
+static noinline struct sk_buff *
+nf_sch_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
+{
+	if (nf_hook_egress_active(skb)) {
+		*ret = nf_egress(skb);
+		if (*ret < 0)
+			return NULL;
+	}
+
+	if (sch_egress_needed)
+		return sch_handle_egress(skb, ret, dev);
 
 	return skb;
 }
@@ -4081,13 +4121,13 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	qdisc_pkt_len_init(skb);
 #ifdef CONFIG_NET_CLS_ACT
 	skb->tc_at_ingress = 0;
-# ifdef CONFIG_NET_EGRESS
+#endif
+#ifdef CONFIG_NET_EGRESS
 	if (static_branch_unlikely(&egress_needed_key)) {
-		skb = sch_handle_egress(skb, &rc, dev);
+		skb = nf_sch_egress(skb, &rc, dev);
 		if (!skb)
 			goto out;
 	}
-# endif
 #endif
 	/* If device/qdisc don't need skb->dst, release it right now while
 	 * its hot in this cpu cache.
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 25313c29d799..93a300af82b8 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -10,6 +10,14 @@ config NETFILTER_INGRESS
 	  This allows you to classify packets from ingress using the Netfilter
 	  infrastructure.
 
+config NETFILTER_EGRESS
+	bool "Netfilter egress support"
+	default y
+	select NET_EGRESS
+	help
+	  This allows you to classify packets before transmission using the
+	  Netfilter infrastructure.
+
 config NETFILTER_NETLINK
 	tristate
 
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 3ac7c8c1548d..d42c8c2a7c30 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -306,6 +306,12 @@ nf_hook_entry_head(struct net *net, int pf, unsigned int hooknum,
 		if (dev && dev_net(dev) == net)
 			return &dev->nf_hooks_ingress;
 	}
+#endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	if (hooknum == NF_NETDEV_EGRESS) {
+		if (dev && dev_net(dev) == net)
+			return &dev->nf_hooks_egress;
+	}
 #endif
 	WARN_ON_ONCE(1);
 	return NULL;
@@ -318,11 +324,13 @@ static int __nf_register_net_hook(struct net *net, int pf,
 	struct nf_hook_entries __rcu **pp;
 
 	if (pf == NFPROTO_NETDEV) {
-#ifndef CONFIG_NETFILTER_INGRESS
-		if (reg->hooknum == NF_NETDEV_INGRESS)
+		if ((!IS_ENABLED(CONFIG_NETFILTER_INGRESS) &&
+		     reg->hooknum == NF_NETDEV_INGRESS) ||
+		    (!IS_ENABLED(CONFIG_NETFILTER_EGRESS) &&
+		     reg->hooknum == NF_NETDEV_EGRESS))
 			return -EOPNOTSUPP;
-#endif
-		if (reg->hooknum != NF_NETDEV_INGRESS ||
+		if ((reg->hooknum != NF_NETDEV_INGRESS &&
+		     reg->hooknum != NF_NETDEV_EGRESS) ||
 		    !reg->dev || dev_net(reg->dev) != net)
 			return -EINVAL;
 	}
@@ -348,6 +356,10 @@ static int __nf_register_net_hook(struct net *net, int pf,
 	if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
 		net_inc_ingress_queue();
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_EGRESS)
+		static_branch_inc(&egress_needed_key);
+#endif
 #ifdef CONFIG_JUMP_LABEL
 	static_key_slow_inc(&nf_hooks_needed[pf][reg->hooknum]);
 #endif
@@ -406,6 +418,10 @@ static void __nf_unregister_net_hook(struct net *net, int pf,
 		if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_INGRESS)
 			net_dec_ingress_queue();
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+		if (pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_EGRESS)
+			static_branch_dec(&egress_needed_key);
+#endif
 #ifdef CONFIG_JUMP_LABEL
 		static_key_slow_dec(&nf_hooks_needed[pf][reg->hooknum]);
 #endif
diff --git a/net/netfilter/nft_chain_filter.c b/net/netfilter/nft_chain_filter.c
index c78d01bc02e9..67ce6dbb5496 100644
--- a/net/netfilter/nft_chain_filter.c
+++ b/net/netfilter/nft_chain_filter.c
@@ -277,9 +277,11 @@ static const struct nft_chain_type nft_chain_filter_netdev = {
 	.name		= "filter",
 	.type		= NFT_CHAIN_T_DEFAULT,
 	.family		= NFPROTO_NETDEV,
-	.hook_mask	= (1 << NF_NETDEV_INGRESS),
+	.hook_mask	= (1 << NF_NETDEV_INGRESS) |
+			  (1 << NF_NETDEV_EGRESS),
 	.hooks		= {
 		[NF_NETDEV_INGRESS]	= nft_do_chain_netdev,
+		[NF_NETDEV_EGRESS]	= nft_do_chain_netdev,
 	},
 };
 
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a3b37d88800e..390b7e6b9e91 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -396,6 +396,9 @@ config NET_SCH_INGRESS
 	  To compile this code as a module, choose M here: the module will be
 	  called sch_ingress with alias of sch_clsact.
 
+config NET_SCH_EGRESS
+	def_bool NET_SCH_INGRESS
+
 config NET_SCH_PLUG
 	tristate "Plug network traffic until release (PLUG)"
 	help
-- 
2.27.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 0/3] Netfilter egress hook
  2020-08-27  8:55 [PATCH nf-next v3 0/3] Netfilter egress hook Lukas Wunner
                   ` (2 preceding siblings ...)
  2020-08-27  8:55 ` [PATCH nf-next v3 3/3] netfilter: Introduce egress hook Lukas Wunner
@ 2020-08-27 10:36 ` Laura García Liébana
  2020-08-28  7:14 ` Daniel Borkmann
  4 siblings, 0 replies; 34+ messages in thread
From: Laura García Liébana @ 2020-08-27 10:36 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	Daniel Borkmann, Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	David Miller

Hi Lukas, thank you for your patches.

On Thu, Aug 27, 2020 at 10:55 AM Lukas Wunner <lukas@wunner.de> wrote:
>
> Introduce a netfilter egress hook to allow filtering outbound AF_PACKETs
> such as DHCP and to prepare for in-kernel NAT64/NAT46.
>

Actually, we've found 2 additional use cases in container-based nodes
that use the egress hook:

1. intra-node DSR load balancing connectivity
2. container-based outbound security policies

We've been using your previous patch in an experimental project and
it's working fine.

Great job!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 0/3] Netfilter egress hook
  2020-08-27  8:55 [PATCH nf-next v3 0/3] Netfilter egress hook Lukas Wunner
                   ` (3 preceding siblings ...)
  2020-08-27 10:36 ` [PATCH nf-next v3 0/3] Netfilter " Laura García Liébana
@ 2020-08-28  7:14 ` Daniel Borkmann
  2020-08-28  9:14   ` Eric Dumazet
  4 siblings, 1 reply; 34+ messages in thread
From: Daniel Borkmann @ 2020-08-28  7:14 UTC (permalink / raw)
  To: Lukas Wunner, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Alexei Starovoitov,
	Eric Dumazet, Thomas Graf, Laura Garcia, David Miller

Hi Lukas,

On 8/27/20 10:55 AM, Lukas Wunner wrote:
> Introduce a netfilter egress hook to allow filtering outbound AF_PACKETs
> such as DHCP and to prepare for in-kernel NAT64/NAT46.

Thinking more about this, how will this allow to sufficiently filter AF_PACKET?
It won't. Any AF_PACKET application can freely set PACKET_QDISC_BYPASS without
additional privileges and then dev_queue_xmit() is being bypassed in the host ns.
This is therefore ineffective and not sufficient. (From container side these can
be caught w/ host veth on ingress, but not in host ns, of course, so hook won't
be invoked.)

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 0/3] Netfilter egress hook
  2020-08-28  7:14 ` Daniel Borkmann
@ 2020-08-28  9:14   ` Eric Dumazet
  0 siblings, 0 replies; 34+ messages in thread
From: Eric Dumazet @ 2020-08-28  9:14 UTC (permalink / raw)
  To: Daniel Borkmann, Lukas Wunner, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Alexei Starovoitov,
	Eric Dumazet, Thomas Graf, Laura Garcia, David Miller



On 8/28/20 12:14 AM, Daniel Borkmann wrote:
> Hi Lukas,
> 
> On 8/27/20 10:55 AM, Lukas Wunner wrote:
>> Introduce a netfilter egress hook to allow filtering outbound AF_PACKETs
>> such as DHCP and to prepare for in-kernel NAT64/NAT46.
> 
> Thinking more about this, how will this allow to sufficiently filter AF_PACKET?
> It won't. Any AF_PACKET application can freely set PACKET_QDISC_BYPASS without
> additional privileges and then dev_queue_xmit() is being bypassed in the host ns.
> This is therefore ineffective and not sufficient. (From container side these can
> be caught w/ host veth on ingress, but not in host ns, of course, so hook won't
> be invoked.)


Presumably dev_direct_xmit() could be augmented to support the hook.

dev_direct_xmit() (packet_direct_xmit()) was introduced to bypass qdisc,
not to bypass everything.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-08-27  8:55 ` [PATCH nf-next v3 3/3] netfilter: Introduce egress hook Lukas Wunner
@ 2020-08-28 18:52   ` John Fastabend
  2020-09-03  5:00     ` John Fastabend
  2020-09-19 15:54   ` Pablo Neira Ayuso
  1 sibling, 1 reply; 34+ messages in thread
From: John Fastabend @ 2020-08-28 18:52 UTC (permalink / raw)
  To: Lukas Wunner, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, Laura Garcia,
	David Miller

Lukas Wunner wrote:
> Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
> handle_ing() under unique static key") introduced the ability to
> classify packets on ingress.
> 
> Support the same on egress.  This allows filtering locally generated
> traffic such as DHCP, or outbound AF_PACKETs in general.  It will also
> allow introducing in-kernel NAT64 and NAT46.  A patch for nftables to
> hook up egress rules from user space has been submitted separately.
> 
> Position the hook immediately before a packet is handed to traffic
> control and then sent out on an interface, thereby mirroring the ingress
> order.  This order allows marking packets in the netfilter egress hook
> and subsequently using the mark in tc.  Another benefit of this order is
> consistency with a lot of existing documentation which says that egress
> tc is performed after netfilter hooks.
> 
> To avoid a performance degradation in the default case (with neither
> netfilter nor traffic control used), Daniel Borkmann suggests "a single
> static_key which wraps an empty function call entry which can then be
> patched by the kernel at runtime. Inside that trampoline we can still
> keep the ordering [between netfilter and traffic control] intact":
> 
> https://lore.kernel.org/netdev/20200318123315.GI979@breakpoint.cc/
> 
> To this end, introduce nf_sch_egress() which is dynamically patched into
> __dev_queue_xmit(), contingent on egress_needed_key.  Inside that
> function, nf_egress() and sch_handle_egress() is called, each contingent
> on its own separate static_key.
> 
> nf_sch_egress() is declared noinline per Florian Westphal's suggestion.
> This change alone causes a speedup if neither netfilter nor traffic
> control is used, apparently because it reduces instruction cache
> pressure.  The same effect was previously observed by Eric Dumazet for
> the ingress path:
> 
> https://lore.kernel.org/netdev/1431387038.566.47.camel@edumazet-glaptop2.roam.corp.google.com/
> 
> Overall, performance improves with this commit if neither netfilter nor
> traffic control is used. However it degrades a little if only traffic
> control is used, due to the "noinline", the additional outer static key
> and the added netfilter code:
> 
> * Before:       4730418pps 2270Mb/sec (2270600640bps)
> * After:        4759206pps 2284Mb/sec (2284418880bps)

These baseline numbers seem low to me.

> 
> * Before + tc:  4063912pps 1950Mb/sec (1950677760bps)
> * After  + tc:  4007728pps 1923Mb/sec (1923709440bps)
> 
> * After  + nft: 3714546pps 1782Mb/sec (1782982080bps)
> 
> Measured on a bare-metal Core i7-3615QM.

OK I have some server class systems here I would like to run these
benchmarks again on to be sure we don't have any performance
regressions on that side.

I'll try to get to it asap, but likely will be Monday morning
by the time I get to it. I assume that should be no problem
seeing we are only on rc2.

Thanks.

> 
> Commands to perform a measurement:
> ip link add dev foo type dummy
> ip link set dev foo up
> modprobe pktgen
> echo "add_device foo" > /proc/net/pktgen/kpktgend_3
> samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1

Thats a single thread correct? -t option if I recall correctly.
I think we should also try with many threads to see if
that makes a difference. I guess probably not, but lets
see.

> 
> Commands to enable egress traffic control:
> tc qdisc add dev foo clsact
> tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
> 
> Commands to enable egress netfilter:
> nft add table netdev t
> nft add chain netdev t co \{ type filter hook egress device foo priority 0 \; \}
> nft add rule netdev t co ip daddr 4.3.2.1/32 drop
> 

I'll give above a try.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-08-28 18:52   ` John Fastabend
@ 2020-09-03  5:00     ` John Fastabend
  2020-09-04  8:54       ` Laura García Liébana
  2020-09-04 16:21       ` Lukas Wunner
  0 siblings, 2 replies; 34+ messages in thread
From: John Fastabend @ 2020-09-03  5:00 UTC (permalink / raw)
  To: John Fastabend, Lukas Wunner, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, Laura Garcia,
	David Miller

John Fastabend wrote:
> Lukas Wunner wrote:
> > Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
> > handle_ing() under unique static key") introduced the ability to
> > classify packets on ingress.
> > 
> > Support the same on egress.  This allows filtering locally generated
> > traffic such as DHCP, or outbound AF_PACKETs in general.  It will also
> > allow introducing in-kernel NAT64 and NAT46.  A patch for nftables to
> > hook up egress rules from user space has been submitted separately.
> > 
> > Position the hook immediately before a packet is handed to traffic
> > control and then sent out on an interface, thereby mirroring the ingress
> > order.  This order allows marking packets in the netfilter egress hook
> > and subsequently using the mark in tc.  Another benefit of this order is
> > consistency with a lot of existing documentation which says that egress
> > tc is performed after netfilter hooks.
> > 
> > To avoid a performance degradation in the default case (with neither
> > netfilter nor traffic control used), Daniel Borkmann suggests "a single
> > static_key which wraps an empty function call entry which can then be
> > patched by the kernel at runtime. Inside that trampoline we can still
> > keep the ordering [between netfilter and traffic control] intact":
> > 
> > https://lore.kernel.org/netdev/20200318123315.GI979@breakpoint.cc/
> > 
> > To this end, introduce nf_sch_egress() which is dynamically patched into
> > __dev_queue_xmit(), contingent on egress_needed_key.  Inside that
> > function, nf_egress() and sch_handle_egress() is called, each contingent
> > on its own separate static_key.
> > 
> > nf_sch_egress() is declared noinline per Florian Westphal's suggestion.
> > This change alone causes a speedup if neither netfilter nor traffic
> > control is used, apparently because it reduces instruction cache
> > pressure.  The same effect was previously observed by Eric Dumazet for
> > the ingress path:
> > 
> > https://lore.kernel.org/netdev/1431387038.566.47.camel@edumazet-glaptop2.roam.corp.google.com/
> > 
> > Overall, performance improves with this commit if neither netfilter nor
> > traffic control is used. However it degrades a little if only traffic
> > control is used, due to the "noinline", the additional outer static key
> > and the added netfilter code:

I don't think it actualy improves performance at least I didn't observe
that. From the code its not clear why this would be the case either. As
a nit I would prefer that line removed from the commit message.

I guess the Before/After below is just showing some noise in the
measurement.

> > 
> > * Before:       4730418pps 2270Mb/sec (2270600640bps)
> > * After:        4759206pps 2284Mb/sec (2284418880bps)
> 
> These baseline numbers seem low to me.

I used a 10Gbps ixgbe nic to measure the performance after the dummy
device hung on me for some reason. I'll try to investigate what happened
later. It was unrelated to these patches though.

But, with 10Gbps NIC and doing a pktgen benchmark with and without
the patches applied I didn't see any measurable differences. Both
cases reached 14Mpps.

> 
> > 
> > * Before + tc:  4063912pps 1950Mb/sec (1950677760bps)
> > * After  + tc:  4007728pps 1923Mb/sec (1923709440bps)

Same here before/after aggregate appears to be the same. Even the
numbers above show a 1.2% degradation. Just curious is the above
from a single run or averaged over multiple runs or something
else? Seems like noise to me.

I did see something odd where it appeared fairness between threads
was slightly worse. I don't have any explanation for this? Did
you have a chance to run the test with -t >1?

Also the overhead on your system for adding a tc rule seems
a bit high. In my case a single tc drop rule added ~7% overhead
at 14mpps. Above it looks more like 16% so double. Maybe a
missing JIT or some other configuration. Either a perf trace
or looking at your config would help figure that out.

> > 
> > * After  + nft: 3714546pps 1782Mb/sec (1782982080bps)
> > 

I haven't had a chance to do these benchmarks, but for my use
cases its more important to _not_ degrade tc performance.

I will note though that this is getting close to a 10% perf
degradation from using tc. I haven't looked much into it,
but that seems high to simply drop a packet.

Do you have plans to address the performance degradation? Otherwise
if I was building some new components its unclear why we would
choose the slower option over the tc hook. The two suggested
use cases security policy and DSR sound like new features, any
reason to not just use existing infrastructure?

Is the use case primarily legacy things already running in
nft infrastructure? I guess if you have code running now
moving it to this hook is faster and even if its 10% slower
than it could be that may be better than a rewrite?

> > Measured on a bare-metal Core i7-3615QM.
> 
> OK I have some server class systems here I would like to run these
> benchmarks again on to be sure we don't have any performance
> regressions on that side.
> 
> I'll try to get to it asap, but likely will be Monday morning
> by the time I get to it. I assume that should be no problem
> seeing we are only on rc2.

Sorry Monday had to look into a different bug.

> 
> Thanks.
> 

Thanks.
John

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-03  5:00     ` John Fastabend
@ 2020-09-04  8:54       ` Laura García Liébana
  2020-09-04 15:46         ` John Fastabend
  2020-09-04 16:21       ` Lukas Wunner
  1 sibling, 1 reply; 34+ messages in thread
From: Laura García Liébana @ 2020-09-04  8:54 UTC (permalink / raw)
  To: John Fastabend
  Cc: Lukas Wunner, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Netfilter Development Mailing list, coreteam,
	netdev, Daniel Borkmann, Alexei Starovoitov, Eric Dumazet,
	Thomas Graf, David Miller

Hi,

On Thu, Sep 3, 2020 at 7:00 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
[...]
>
> I don't think it actualy improves performance at least I didn't observe
> that. From the code its not clear why this would be the case either. As
> a nit I would prefer that line removed from the commit message.
>

It hasn't been proven to be untrue either.


[...]
>
> Do you have plans to address the performance degradation? Otherwise
> if I was building some new components its unclear why we would
> choose the slower option over the tc hook. The two suggested
> use cases security policy and DSR sound like new features, any
> reason to not just use existing infrastructure?
>

Unfortunately, tc is not an option as it is required to interact with
nft objects (sets, maps, chains, etc), more complex than just a drop.
Also, when building new features we try to maintain the application
stack as simple as possible, not trying to do ugly integrations.

I understand that you measure performance with a drop, but using this
hook we reduce the datapath consistently for these use cases and
hence, improving traffic performance.

Thank you for your time!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-04  8:54       ` Laura García Liébana
@ 2020-09-04 15:46         ` John Fastabend
  2020-09-05 11:13           ` Laura García Liébana
  0 siblings, 1 reply; 34+ messages in thread
From: John Fastabend @ 2020-09-04 15:46 UTC (permalink / raw)
  To: Laura García Liébana, John Fastabend
  Cc: Lukas Wunner, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Netfilter Development Mailing list, coreteam,
	netdev, Daniel Borkmann, Alexei Starovoitov, Eric Dumazet,
	Thomas Graf, David Miller

Laura García Liébana wrote:
> Hi,
> 
> On Thu, Sep 3, 2020 at 7:00 AM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> [...]
> >
> > I don't think it actualy improves performance at least I didn't observe
> > that. From the code its not clear why this would be the case either. As
> > a nit I would prefer that line removed from the commit message.
> >
> 
> It hasn't been proven to be untrue either.

huh? Its stated in the commit message with no reason for why it might
be the case and I can't reproduce it. Also the numbers posted show such a
slight increase (~1%) its likely just random system noise.

Sorry maybe that was a joke? Just poured some coffee so might be missing it.

> 
> 
> [...]
> >
> > Do you have plans to address the performance degradation? Otherwise
> > if I was building some new components its unclear why we would
> > choose the slower option over the tc hook. The two suggested
> > use cases security policy and DSR sound like new features, any
> > reason to not just use existing infrastructure?
> >
> 
> Unfortunately, tc is not an option as it is required to interact with
> nft objects (sets, maps, chains, etc), more complex than just a drop.
> Also, when building new features we try to maintain the application
> stack as simple as possible, not trying to do ugly integrations.

We have code that interacts with iptables as well. How I read the
above is in your case you have a bunch of existing software and you
want something slightly faster. Even if its not as fast the 10%
overhead is OK in your case and/or you believe the overhead of all
the other components is much higher so it will be lost in the noise.

> I understand that you measure performance with a drop, but using this
> hook we reduce the datapath consistently for these use cases and
> hence, improving traffic performance.

I measured drops because it was the benchmark provided in the patch
series. Also it likely looks a lot like any DDOS that might be put
there. You mentioned security policies which should probably include
DDOS so I would expect drop performance to be at least a useful
metric even if its not the only or most important in your case.

Lets post a selftest that represents the use case so folks like
myself can understand and benchmark correctly. This gives the extra
benefit of ensuring we don't regress going forward and can add it
to CI.

> 
> Thank you for your time!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-03  5:00     ` John Fastabend
  2020-09-04  8:54       ` Laura García Liébana
@ 2020-09-04 16:21       ` Lukas Wunner
  2020-09-04 21:14         ` Daniel Borkmann
  2020-09-08 18:58         ` John Fastabend
  1 sibling, 2 replies; 34+ messages in thread
From: Lukas Wunner @ 2020-09-04 16:21 UTC (permalink / raw)
  To: John Fastabend
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, Laura Garcia,
	David Miller

On Wed, Sep 02, 2020 at 10:00:32PM -0700, John Fastabend wrote:
> Lukas Wunner wrote:
> > * Before:       4730418pps 2270Mb/sec (2270600640bps)
> > * After:        4759206pps 2284Mb/sec (2284418880bps)
> 
> I used a 10Gbps ixgbe nic to measure the performance after the dummy
> device hung on me for some reason. I'll try to investigate what happened
> later. It was unrelated to these patches though.
> 
> But, with 10Gbps NIC and doing a pktgen benchmark with and without
> the patches applied I didn't see any measurable differences. Both
> cases reached 14Mpps.

Hm, I strongly suspect you may have measured performance of the NIC and
that you'd get different before/after numbers with the dummy device.


> > * Before + tc:  4063912pps 1950Mb/sec (1950677760bps)
> > * After  + tc:  4007728pps 1923Mb/sec (1923709440bps)
> 
> Same here before/after aggregate appears to be the same. Even the
> numbers above show a 1.2% degradation. Just curious is the above
> from a single run or averaged over multiple runs or something
> else? Seems like noise to me.

I performed at least 3 runs, but included just a single number in
the commit message for brevity.  That number is intended to show
where the numbers settled:

Before:           2257 2270 2270           Mb/sec
After:            2282 2283 2284 2282      Mb/sec

Before + tc:      1941 1950 1951           Mb/sec
After  + tc:      1923 1923 1919 1920 1919 Mb/sec

After + nft:      1782 1783 1782 1781      Mb/sec
After + nft + tc: 1574 1566 1566           Mb/sec

So there's certainly some noise but still a speedup is clearly
visible if neither tc nor nft is used, and a slight degradation
if tc is used.


> I did see something odd where it appeared fairness between threads
> was slightly worse. I don't have any explanation for this? Did
> you have a chance to run the test with -t >1?

Sorry, no, I only tested with a single thread on an otherwise idle
machine.


> Do you have plans to address the performance degradation? Otherwise
> if I was building some new components its unclear why we would
> choose the slower option over the tc hook. The two suggested
> use cases security policy and DSR sound like new features, any
> reason to not just use existing infrastructure?
> 
> Is the use case primarily legacy things already running in
> nft infrastructure? I guess if you have code running now
> moving it to this hook is faster and even if its 10% slower
> than it could be that may be better than a rewrite?

nft and tc are orthogonal, i.e. filtering/mangling versus queueing.
However tc has gained the ability to filter packets as well, hence
there's some overlap in functionality.  Naturally tc does not allow
the full glory of nft filtering/mangling options as Laura has stated,
hence the need to add nft in the egress path.


> huh? Its stated in the commit message with no reason for why it might
> be the case and I can't reproduce it.

The reason as stated in the commit message is that cache pressure is
apparently reduced with the tc handling moved out of the hotpath,
an effect that Eric Dumazet had previously observed for the ingress path:

https://lore.kernel.org/netdev/1431387038.566.47.camel@edumazet-glaptop2.roam.corp.google.com/

Thanks a lot for taking the time to give these patches a whirl.

Lukas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-04 16:21       ` Lukas Wunner
@ 2020-09-04 21:14         ` Daniel Borkmann
  2020-09-05  5:24           ` Lukas Wunner
                             ` (2 more replies)
  2020-09-08 18:58         ` John Fastabend
  1 sibling, 3 replies; 34+ messages in thread
From: Daniel Borkmann @ 2020-09-04 21:14 UTC (permalink / raw)
  To: Lukas Wunner, John Fastabend
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, netdev, Alexei Starovoitov,
	Eric Dumazet, Thomas Graf, Laura Garcia, David Miller

On 9/4/20 6:21 PM, Lukas Wunner wrote:
> On Wed, Sep 02, 2020 at 10:00:32PM -0700, John Fastabend wrote:
>> Lukas Wunner wrote:
[...]
>> Do you have plans to address the performance degradation? Otherwise
>> if I was building some new components its unclear why we would
>> choose the slower option over the tc hook. The two suggested
>> use cases security policy and DSR sound like new features, any
>> reason to not just use existing infrastructure?
>>
>> Is the use case primarily legacy things already running in
>> nft infrastructure? I guess if you have code running now
>> moving it to this hook is faster and even if its 10% slower
>> than it could be that may be better than a rewrite?
> 
> nft and tc are orthogonal, i.e. filtering/mangling versus queueing.
> However tc has gained the ability to filter packets as well, hence
> there's some overlap in functionality.  Naturally tc does not allow
> the full glory of nft filtering/mangling options as Laura has stated,
> hence the need to add nft in the egress path.

Heh, really!? It sounds to me that you never looked serious enough into what
tc/BPF is actually doing. Please check your facts before making any such claim
since it's false.

Lets do a reality check for your original motivation of adding this hook and
see whether that matches your claim ... quote [0]:

   The module I need this for is out-of-tree:

   https://github.com/RevolutionPi/piControl/commit/da199ccd2099

   In my experience the argument that a feature is needed for an out-of-tree
   module holds zero value upstream.  If there's no in-tree user, the feature
   isn't merged, I've seen this more than enough.  Which is why I didn't mention
   it in the first place.

So in essence what you had in that commit is:

   static unsigned int revpi_gate_nf_hook(void *priv, struct sk_buff *skb,
				         const struct nf_hook_state *state)
   {
	u16 eth_proto = ntohs(eth_hdr(skb)->h_proto);

	return likely(eth_proto == ETH_P_KUNBUSGW) ? NF_ACCEPT : NF_DROP;
   }

   static const struct nf_hook_ops revpi_gate_nf_hook_ops = {
	.hook	  = revpi_gate_nf_hook,
	.pf	  = NFPROTO_NETDEV,
	.hooknum  = NF_NETDEV_EGRESS,
	.priority = INT_MAX,
   };

Its trivial to achieve with tc/BPF on the existing egress hook today. Probably
takes less time than to write up this mail ...

root@x:~/x# cat foo.c

#include <linux/bpf.h>
#include <linux/if_ether.h>
#include <arpa/inet.h>

#ifndef __section
# define __section(NAME)		\
    __attribute__((section(NAME), used))
#endif

#define ETH_P_KUNBUSGW	0x419C

#define PASS	0
#define DROP	2

int foo(struct __sk_buff *skb)
{
	void *data_end = (void *)(long)skb->data_end;
	void *data = (void *)(long)skb->data;
	struct ethhdr *eth = data;

	if (data + sizeof(*eth) > data_end)
		return DROP;

	return eth->h_proto == htons(ETH_P_KUNBUSGW) ? PASS : DROP;
}

char __license[] __section("license") = "";

root@x:~/x# clang -target bpf -Wall -O2 -c foo.c -o foo.o
root@x:~/x# ip link add dev foo type dummy
root@x:~/x# ip link set up dev foo
root@x:~/x# tc qdisc add dev foo clsact
root@x:~/x# tc filter add dev foo egress bpf da obj foo.o sec .text

There we go, attached to the device on existing egress. Double checking it
does what we want:

root@x:~/x# cat foo.t
{
    0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa,
    0xbb, 0xbb, 0xbb, 0xbb, 0xbb, 0xbb,
    0x41, 0x9c
}
root@x:~/x# trafgen -i foo.t -o foo -n 1 -q
root@x:~/x# tcpdump -i foo
[...]
22:43:42.981112 bb:bb:bb:bb:bb:bb (oui Unknown) > aa:aa:aa:aa:aa:aa (oui Unknown), ethertype Unknown (0x419c), length 14:

root@x:~/x# cat bar.t
{
    0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa,
    0xbb, 0xbb, 0xbb, 0xbb, 0xbb, 0xbb,
    0xee, 0xee
}
root@x:~/x# trafgen -i bar.t -o foo -n 1 -q
root@x:~/x# tcpdump -i foo
[... nothing/filtered ...]

Done, and this is exactly intended for exotic stuff like this. It also mangles packets
and whatnot, I guess I don't need to list all of this here (catching up on docs is
easy enough). The tc queueing layer which is below is not the tc egress hook; the
latter is for filtering/mangling/forwarding or helping the lower tc queueing layer to
classify.

Now given you've stated that you're not mentioning your out of tree stuff in the
commit message in the first place, you bring up "This allows filtering locally
generated traffic such as DHCP, or outbound AF_PACKETs in general. It will also
allow introducing in-kernel NAT64 and NAT46."

I haven't seen any NAT64/NAT46 in this set and I guess it's some sort of future
work (fwiw, you can also already do this today with tc/BPF), and filtering AF_PACKET
is currently broken with your approach as elaborated earlier so needs another hook...
also slow-path DHCP filtering should rather be moved into AF_PACKET itself. Why paying
the performance hit going into the nft interpreter for this hook for *every* other
*unrelated* packet in the fast-path... the case is rather if distros start adding DHCP
filtering rules by default there as per your main motivation then everyone needs to
pay this price, which is completely unreasonable to perform in __dev_queue_xmit().

Thanks,
Daniel

   [0] https://lore.kernel.org/netdev/20191123142305.g2kkaudhhyui22fq@wunner.de/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-04 21:14         ` Daniel Borkmann
@ 2020-09-05  5:24           ` Lukas Wunner
  2020-09-08 12:55             ` Daniel Borkmann
  2020-09-05 11:18           ` Laura García Liébana
  2020-09-08 11:46           ` Arturo Borrero Gonzalez
  2 siblings, 1 reply; 34+ messages in thread
From: Lukas Wunner @ 2020-09-05  5:24 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: John Fastabend, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, netfilter-devel, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, Laura Garcia,
	David Miller

On Fri, Sep 04, 2020 at 11:14:37PM +0200, Daniel Borkmann wrote:
> On 9/4/20 6:21 PM, Lukas Wunner wrote:
> > nft and tc are orthogonal, i.e. filtering/mangling versus queueing.
> > However tc has gained the ability to filter packets as well, hence
> > there's some overlap in functionality.  Naturally tc does not allow
> > the full glory of nft filtering/mangling options as Laura has stated,
> > hence the need to add nft in the egress path.
> 
> Heh, really!? It sounds to me that you never looked serious enough into what
> tc/BPF is actually doing.

It wasn't my intention to denigrate or belittle tc's capabilities
with the above statement.

I don't dispute that the original use case for which these patches
were developed might be solved equally well with tc.  As I've stated
before, I chose netfilter over tc because I needed an in-kernel API,
which netfilter provides with nf_register_net_hook():

https://lore.kernel.org/netdev/20191123142305.g2kkaudhhyui22fq@wunner.de/

The motivation for these patches has pivoted away from my original
use case, which is why I no longer mentioned it in the commit message.


> The tc queueing layer which is below is not the tc egress hook; the
> latter is for filtering/mangling/forwarding or helping the lower tc
> queueing layer to classify.

People want to apply netfilter rules on egress, so either we need an
egress hook in the xmit path or we'd have to teach tc to filter and
mangle based on netfilter rules.  The former seemed more straight-forward
to me but I'm happy to pursue other directions.


> Why paying the performance hit going into the nft interpreter for this
> hook for *every* other *unrelated* packet in the fast-path...

As long as neither tc nor nft is used, *no* packet goes through the
nft interpreter and I'm measuring a speedup as a result of moving
the two out of the hotpath.

If nft is used, only those interfaces for which netfilter rules have
been hooked up go through the nft interpreter.


> the case is rather if distros start adding DHCP
> filtering rules by default there as per your main motivation then
> everyone needs to pay this price, which is completely unreasonable
> to perform in __dev_queue_xmit().

So first you're saying that the patches are unnecessary and everything
they do can be achieved with tc... and then you're saying distros are
going to use the nft hook to filter DHCP by default, which will cost
performance.  That seems contradictory.  Why aren't distros using tc
today to filter DHCP?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-04 15:46         ` John Fastabend
@ 2020-09-05 11:13           ` Laura García Liébana
  0 siblings, 0 replies; 34+ messages in thread
From: Laura García Liébana @ 2020-09-05 11:13 UTC (permalink / raw)
  To: John Fastabend
  Cc: Lukas Wunner, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Netfilter Development Mailing list, coreteam,
	netdev, Daniel Borkmann, Alexei Starovoitov, Eric Dumazet,
	Thomas Graf, David Miller

Hi John,

On Fri, Sep 4, 2020 at 5:46 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Laura García Liébana wrote:
> > Hi,
> >
> > On Thu, Sep 3, 2020 at 7:00 AM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > [...]
> > >
> > > I don't think it actualy improves performance at least I didn't observe
> > > that. From the code its not clear why this would be the case either. As
> > > a nit I would prefer that line removed from the commit message.
> > >
> >
> > It hasn't been proven to be untrue either.
>
> huh? Its stated in the commit message with no reason for why it might
> be the case and I can't reproduce it. Also the numbers posted show such a
> slight increase (~1%) its likely just random system noise.
>
> Sorry maybe that was a joke? Just poured some coffee so might be missing it.
>
> >
> >
> > [...]
> > >
> > > Do you have plans to address the performance degradation? Otherwise
> > > if I was building some new components its unclear why we would
> > > choose the slower option over the tc hook. The two suggested
> > > use cases security policy and DSR sound like new features, any
> > > reason to not just use existing infrastructure?
> > >
> >
> > Unfortunately, tc is not an option as it is required to interact with
> > nft objects (sets, maps, chains, etc), more complex than just a drop.
> > Also, when building new features we try to maintain the application
> > stack as simple as possible, not trying to do ugly integrations.
>
> We have code that interacts with iptables as well. How I read the
> above is in your case you have a bunch of existing software and you
> want something slightly faster. Even if its not as fast the 10%
> overhead is OK in your case and/or you believe the overhead of all
> the other components is much higher so it will be lost in the noise.
>

Not a bunch of software, but the other way around. We replaced, a year
now, all the existing iptables, ip6tables, ebtables, arptables,
x_tables, ipset and ipvs components (both in-kernel and user-space)
with just nftables. As all these components features are integrated
natively, objects (sets, maps, chains, stateful objects, etc.) are
created in a form of nftables scheme that are integrated all together.
That is why the tc workaround is not an option for people that are
moving to nftables to use just a hook.


> > I understand that you measure performance with a drop, but using this
> > hook we reduce the datapath consistently for these use cases and
> > hence, improving traffic performance.
>
> I measured drops because it was the benchmark provided in the patch
> series. Also it likely looks a lot like any DDOS that might be put
> there. You mentioned security policies which should probably include
> DDOS so I would expect drop performance to be at least a useful
> metric even if its not the only or most important in your case.
>
> Lets post a selftest that represents the use case so folks like
> myself can understand and benchmark correctly. This gives the extra
> benefit of ensuring we don't regress going forward and can add it
> to CI.
>

From the 4 use cases we found until now (although I'm sure there will
be many more), 2 are related to filtering (not necessarily related to
DDoS mitigation though) and 2 are related to packet mangling. One of
the packet mangling DSR case, it is working great from ingress but in
certain traffic generated from user-space in the node, we require the
egress hook. In addition to that, having this hook available in nft,
we could improve performance by optimizing the datapath for several
load balancing cases.

As I said, I understand that you're worried about dropping performance
but this hook will allow many more possibilities to improve the
traffic datapath.

Thank you!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-04 21:14         ` Daniel Borkmann
  2020-09-05  5:24           ` Lukas Wunner
@ 2020-09-05 11:18           ` Laura García Liébana
  2020-09-07 22:11             ` Daniel Borkmann
  2020-09-08 11:46           ` Arturo Borrero Gonzalez
  2 siblings, 1 reply; 34+ messages in thread
From: Laura García Liébana @ 2020-09-05 11:18 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Lukas Wunner, John Fastabend, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, David Miller

Hi Daniel,

On Fri, Sep 4, 2020 at 11:14 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
[...]
>
> Its trivial to achieve with tc/BPF on the existing egress hook today. Probably
> takes less time than to write up this mail ...
>
> root@x:~/x# cat foo.c
>
> #include <linux/bpf.h>
> #include <linux/if_ether.h>
> #include <arpa/inet.h>
>
> #ifndef __section
> # define __section(NAME)                \
>     __attribute__((section(NAME), used))
> #endif
>
> #define ETH_P_KUNBUSGW  0x419C
>
> #define PASS    0
> #define DROP    2
>
> int foo(struct __sk_buff *skb)
> {
>         void *data_end = (void *)(long)skb->data_end;
>         void *data = (void *)(long)skb->data;
>         struct ethhdr *eth = data;
>
>         if (data + sizeof(*eth) > data_end)
>                 return DROP;
>
>         return eth->h_proto == htons(ETH_P_KUNBUSGW) ? PASS : DROP;
> }
>
> char __license[] __section("license") = "";
>
> root@x:~/x# clang -target bpf -Wall -O2 -c foo.c -o foo.o
> root@x:~/x# ip link add dev foo type dummy
> root@x:~/x# ip link set up dev foo
> root@x:~/x# tc qdisc add dev foo clsact
> root@x:~/x# tc filter add dev foo egress bpf da obj foo.o sec .text
>
> There we go, attached to the device on existing egress. Double checking it
> does what we want:
>
> root@x:~/x# cat foo.t
> {
>     0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa,
>     0xbb, 0xbb, 0xbb, 0xbb, 0xbb, 0xbb,
>     0x41, 0x9c
> }
> root@x:~/x# trafgen -i foo.t -o foo -n 1 -q
> root@x:~/x# tcpdump -i foo
> [...]
> 22:43:42.981112 bb:bb:bb:bb:bb:bb (oui Unknown) > aa:aa:aa:aa:aa:aa (oui Unknown), ethertype Unknown (0x419c), length 14:
>
> root@x:~/x# cat bar.t
> {
>     0xaa, 0xaa, 0xaa, 0xaa, 0xaa, 0xaa,
>     0xbb, 0xbb, 0xbb, 0xbb, 0xbb, 0xbb,
>     0xee, 0xee
> }
> root@x:~/x# trafgen -i bar.t -o foo -n 1 -q
> root@x:~/x# tcpdump -i foo
> [... nothing/filtered ...]
>

Something like this seems more trivial to me:

table netdev mytable {
    chain mychain {
        type filter hook egress device "eth0" priority 100; policy drop;
        meta protocol != 0x419C accept
    }
}

Cheers.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-05 11:18           ` Laura García Liébana
@ 2020-09-07 22:11             ` Daniel Borkmann
  2020-09-08  6:19               ` Laura García Liébana
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Borkmann @ 2020-09-07 22:11 UTC (permalink / raw)
  To: Laura García Liébana
  Cc: Lukas Wunner, John Fastabend, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, David Miller

On 9/5/20 1:18 PM, Laura García Liébana wrote:
> On Fri, Sep 4, 2020 at 11:14 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
[...]
> Something like this seems more trivial to me:
> 
> table netdev mytable {
>      chain mychain {
>          type filter hook egress device "eth0" priority 100; policy drop;
>          meta protocol != 0x419C accept
>      }
> }

Sure, different frontends, so what?! You could also wrap that code into a
simple a.out or have nft style syntax jit to bpf ...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-07 22:11             ` Daniel Borkmann
@ 2020-09-08  6:19               ` Laura García Liébana
  0 siblings, 0 replies; 34+ messages in thread
From: Laura García Liébana @ 2020-09-08  6:19 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Lukas Wunner, John Fastabend, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, David Miller

Hi Daniel,

On Tue, Sep 8, 2020 at 12:11 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 9/5/20 1:18 PM, Laura García Liébana wrote:
> > On Fri, Sep 4, 2020 at 11:14 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> [...]
> > Something like this seems more trivial to me:
> >
> > table netdev mytable {
> >      chain mychain {
> >          type filter hook egress device "eth0" priority 100; policy drop;
> >          meta protocol != 0x419C accept
> >      }
> > }
>
> Sure, different frontends, so what?! You could also wrap that code into a
> simple a.out or have nft style syntax jit to bpf ...

Not only syntax but get rid of bpf and tc from the application stack
and only maintain 1 interface.

Thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-04 21:14         ` Daniel Borkmann
  2020-09-05  5:24           ` Lukas Wunner
  2020-09-05 11:18           ` Laura García Liébana
@ 2020-09-08 11:46           ` Arturo Borrero Gonzalez
  2020-09-08 13:27             ` Daniel Borkmann
  2 siblings, 1 reply; 34+ messages in thread
From: Arturo Borrero Gonzalez @ 2020-09-08 11:46 UTC (permalink / raw)
  To: Daniel Borkmann, Lukas Wunner, John Fastabend
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, netdev, Alexei Starovoitov,
	Eric Dumazet, Thomas Graf, Laura Garcia, David Miller

On 2020-09-04 23:14, Daniel Borkmann wrote:
> root@x:~/x# clang -target bpf -Wall -O2 -c foo.c -o foo.o

In my honest opinion (debian hat), the simplification of the stack is a key
point for end users/developers. A gain in usability might justify a small
performance penalty.

I can think on both sysadmins and network apps developers, or even casual
advanced users. For many people, dealing with the network stack is already
challenging enough.

Also, ideally, servers would be clean of the GCC or CLANG suites.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-05  5:24           ` Lukas Wunner
@ 2020-09-08 12:55             ` Daniel Borkmann
  2020-09-11  7:42               ` Laura García Liébana
  2020-10-11  7:59               ` Lukas Wunner
  0 siblings, 2 replies; 34+ messages in thread
From: Daniel Borkmann @ 2020-09-08 12:55 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: John Fastabend, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, netfilter-devel, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, Laura Garcia,
	David Miller

Hi Lukas,

On 9/5/20 7:24 AM, Lukas Wunner wrote:
> On Fri, Sep 04, 2020 at 11:14:37PM +0200, Daniel Borkmann wrote:
>> On 9/4/20 6:21 PM, Lukas Wunner wrote:
[...]
>> The tc queueing layer which is below is not the tc egress hook; the
>> latter is for filtering/mangling/forwarding or helping the lower tc
>> queueing layer to classify.
> 
> People want to apply netfilter rules on egress, so either we need an
> egress hook in the xmit path or we'd have to teach tc to filter and
> mangle based on netfilter rules.  The former seemed more straight-forward
> to me but I'm happy to pursue other directions.

I would strongly prefer something where nf integrates into existing tc hook,
not only due to the hook reuse which would be better, but also to allow for a
more flexible interaction between tc/BPF use cases and nf, to name one
example... consider two different entities in the system setting up the two, that
is, one adding rules for nf ingress/egress on the phys device for host fw and
the other one for routing traffic into/from containers at the tc layer,
then traffic going into host ns will hit nf ingress and on egress side the
nf egress part; however, traffic going to containers via existing tc redirect
will not see the nf ingress as expected but would on reverse path incorrectly
hit the nf egress one which is /not/ the case for dev_queue_xmit() today. So
there would need to be more flexible coordination between the two so these
subsystems don't step on each other and the orchestration system can flexibly
arrange those needs depending on the use case. Conceptually the tc/nf
ingress/egress hook would be the same anyway in the sense that we have
some sort of a list or array with callbacks performing actions on the skb,
passing on, dropping or forwarding, so this should be consolidated where
both can register into an array of callbacks as processing pipeline that
can be atomically swapped at runtime, and then similar as with tc or LSMs
allow to delegate or terminate the processing in a generic way.

[...]
>> the case is rather if distros start adding DHCP
>> filtering rules by default there as per your main motivation then
>> everyone needs to pay this price, which is completely unreasonable
>> to perform in __dev_queue_xmit().
> 
> So first you're saying that the patches are unnecessary and everything
> they do can be achieved with tc... and then you're saying distros are
> going to use the nft hook to filter DHCP by default, which will cost
> performance.  That seems contradictory.  Why aren't distros using tc
> today to filter DHCP?

Again, I'm not sure why you ask me, you guys brought up lack of DHCP filtering
as why this hook is needed. My gut feeling why it is not there today, because the
use case was not strong enough to do it on nf or tc layer that anyone cared to fix
it over the last few decades (!). And if you check a typical DHCP client that is
present on major modern distros like systemd-networkd's DHCP client then they
already implement filtering of malicious packets via BPF at socket layer including
checking for cookies in the DHCP header that are set by the application itself to
prevent spoofing [0].

Thanks,
Daniel

   [0] https://github.com/systemd/systemd/blob/master/src/libsystemd-network/dhcp-network.c#L28

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-08 11:46           ` Arturo Borrero Gonzalez
@ 2020-09-08 13:27             ` Daniel Borkmann
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Borkmann @ 2020-09-08 13:27 UTC (permalink / raw)
  To: Arturo Borrero Gonzalez, Lukas Wunner, John Fastabend
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, netdev, Alexei Starovoitov,
	Eric Dumazet, Thomas Graf, Laura Garcia, David Miller

On 9/8/20 1:46 PM, Arturo Borrero Gonzalez wrote:
> On 2020-09-04 23:14, Daniel Borkmann wrote:
>> root@x:~/x# clang -target bpf -Wall -O2 -c foo.c -o foo.o
> 
> In my honest opinion (debian hat), the simplification of the stack is a key
> point for end users/developers. A gain in usability might justify a small
> performance penalty.

Not really, both are independent from each other. Usability is typically achieved
through abstractions, e.g. hiding complexity in libraries (think of raw syscalls
vs libc). Same with the example of bpf or any other kernel subsystem fwiw, users
don't need to be aware of the details as applications abstract this away entirely
but they can benefit from efficiency underneath nevertheless. One example is how
systemd implements cgroup-aware firewalling and accounting for its services via bpf
[0]. Zero knowledge required while it presents meta data in user friendly way via
systemctl status. I'm not trying to convince you of bpf (or systemd), just that
this argument is moot.

> I can think on both sysadmins and network apps developers, or even casual
> advanced users. For many people, dealing with the network stack is already
> challenging enough.

In the age of containers and distributed computing there is no such thing as
sysadmin anymore as we know it from our university days where a bunch of grey
bearded admins maintained a bunch of old sun boxes, printers, etc manually. ;-)
But yes, devops these days is complex, hence abstractions to improve usability
and gain introspection, but kernel is just a tiny fraction in the overall stack.

> Also, ideally, servers would be clean of the GCC or CLANG suites.

Yes agree, one can compile out all other backends (in case of clang at least) that
would generate executable code though.

   [0] http://0pointer.net/blog/ip-accounting-and-access-lists-with-systemd.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-04 16:21       ` Lukas Wunner
  2020-09-04 21:14         ` Daniel Borkmann
@ 2020-09-08 18:58         ` John Fastabend
  1 sibling, 0 replies; 34+ messages in thread
From: John Fastabend @ 2020-09-08 18:58 UTC (permalink / raw)
  To: Lukas Wunner, John Fastabend
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, Laura Garcia,
	David Miller

Lukas Wunner wrote:
> On Wed, Sep 02, 2020 at 10:00:32PM -0700, John Fastabend wrote:
> > Lukas Wunner wrote:
> > > * Before:       4730418pps 2270Mb/sec (2270600640bps)
> > > * After:        4759206pps 2284Mb/sec (2284418880bps)
> > 
> > I used a 10Gbps ixgbe nic to measure the performance after the dummy
> > device hung on me for some reason. I'll try to investigate what happened
> > later. It was unrelated to these patches though.
> > 
> > But, with 10Gbps NIC and doing a pktgen benchmark with and without
> > the patches applied I didn't see any measurable differences. Both
> > cases reached 14Mpps.
> 
> Hm, I strongly suspect you may have measured performance of the NIC and
> that you'd get different before/after numbers with the dummy device.

OK tried again on dummy device.

> 
> 
> > > * Before + tc:  4063912pps 1950Mb/sec (1950677760bps)
> > > * After  + tc:  4007728pps 1923Mb/sec (1923709440bps)
> > 
> > Same here before/after aggregate appears to be the same. Even the
> > numbers above show a 1.2% degradation. Just curious is the above
> > from a single run or averaged over multiple runs or something
> > else? Seems like noise to me.
> 
> I performed at least 3 runs, but included just a single number in
> the commit message for brevity.  That number is intended to show
> where the numbers settled:
> 
> Before:           2257 2270 2270           Mb/sec
> After:            2282 2283 2284 2282      Mb/sec
> 
> Before + tc:      1941 1950 1951           Mb/sec
> After  + tc:      1923 1923 1919 1920 1919 Mb/sec
> 
> After + nft:      1782 1783 1782 1781      Mb/sec
> After + nft + tc: 1574 1566 1566           Mb/sec
> 
> So there's certainly some noise but still a speedup is clearly
> visible if neither tc nor nft is used, and a slight degradation
> if tc is used.

After running multiple times it does seem to be some small performance
improvement by adding noinline there. Its small though (maybe 1-2%?)
and I can't detect this on anything, but the dummy device.

But the degradation with clsact is also caused from this noinline.
If I add the noinline directly into the existing code I see the
same impact. Likely some some small performance improvement for the
no clsact case, but a very real degradation in the clsact case. Presumably,
due to having to do a call now. I didn't collect perf output
just did the simple test.

One piece I don't understand fully yet is on a single thread test
the degradation is small, but as the number of threads increases
the degradation increases. At single thread its 1-2%, but creeps
up to about 5% with 16 cores. Can you confirm this?

> 
> 
> > I did see something odd where it appeared fairness between threads
> > was slightly worse. I don't have any explanation for this? Did
> > you have a chance to run the test with -t >1?
> 
> Sorry, no, I only tested with a single thread on an otherwise idle
> machine.

We need to also consider the case with many threads or at least
become convinced its not going to change with thread count. What
I see is degradation is creeping up as cores increase.

> 
> 
> > Do you have plans to address the performance degradation? Otherwise
> > if I was building some new components its unclear why we would
> > choose the slower option over the tc hook. The two suggested
> > use cases security policy and DSR sound like new features, any
> > reason to not just use existing infrastructure?
> > 
> > Is the use case primarily legacy things already running in
> > nft infrastructure? I guess if you have code running now
> > moving it to this hook is faster and even if its 10% slower
> > than it could be that may be better than a rewrite?
> 
> nft and tc are orthogonal, i.e. filtering/mangling versus queueing.
> However tc has gained the ability to filter packets as well, hence
> there's some overlap in functionality.  Naturally tc does not allow
> the full glory of nft filtering/mangling options as Laura has stated,
> hence the need to add nft in the egress path.
> 
> 
> > huh? Its stated in the commit message with no reason for why it might
> > be the case and I can't reproduce it.
> 
> The reason as stated in the commit message is that cache pressure is
> apparently reduced with the tc handling moved out of the hotpath,
> an effect that Eric Dumazet had previously observed for the ingress path:
> 
> https://lore.kernel.org/netdev/1431387038.566.47.camel@edumazet-glaptop2.roam.corp.google.com/

OK, seems possible it could be an icache miss we are hitting. To really
confirm this though I would want to look at icache statistics. Otherwise
it feels likely, but difficult to tell for sure.

This noinline change subtle and buried in another patch. If you really
want to noinline that function pull it out of the series and push as its
own patch. I am against it because it appears to be directly degrading
performance for my use case and only providing small (if measurable at all)
gain in the normal case. But, at least if its submitted as its own patch we
can debate the merits. We would need performance data for some real devices
veth, nic, and dummy device also across many threads to get a good
handle on it. Also perf data would help understand whats happening. My
preference would be to also nail down the icache stats so we can be sure
this noininline improvment in non clsact case is fully understood.

> 
> Thanks a lot for taking the time to give these patches a whirl.
> 
> Lukas



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-08 12:55             ` Daniel Borkmann
@ 2020-09-11  7:42               ` Laura García Liébana
  2020-09-11 16:27                 ` Daniel Borkmann
  2020-10-11  7:59               ` Lukas Wunner
  1 sibling, 1 reply; 34+ messages in thread
From: Laura García Liébana @ 2020-09-11  7:42 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Lukas Wunner, John Fastabend, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, David Miller

Hi Daniel,

On Tue, Sep 8, 2020 at 2:55 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Hi Lukas,
>
> On 9/5/20 7:24 AM, Lukas Wunner wrote:
> > On Fri, Sep 04, 2020 at 11:14:37PM +0200, Daniel Borkmann wrote:
> >> On 9/4/20 6:21 PM, Lukas Wunner wrote:
> [...]
> >> The tc queueing layer which is below is not the tc egress hook; the
> >> latter is for filtering/mangling/forwarding or helping the lower tc
> >> queueing layer to classify.
> >
> > People want to apply netfilter rules on egress, so either we need an
> > egress hook in the xmit path or we'd have to teach tc to filter and
> > mangle based on netfilter rules.  The former seemed more straight-forward
> > to me but I'm happy to pursue other directions.
>
> I would strongly prefer something where nf integrates into existing tc hook,
> not only due to the hook reuse which would be better, but also to allow for a
> more flexible interaction between tc/BPF use cases and nf, to name one

That sounds good but I'm afraid that it would take too much back and
forth discussions. We'll really appreciate it if this small patch can
be unblocked and then rethink the refactoring of ingress/egress hooks
that you commented in another thread.

Thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-11  7:42               ` Laura García Liébana
@ 2020-09-11 16:27                 ` Daniel Borkmann
  2020-09-14 11:29                   ` Laura García Liébana
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Borkmann @ 2020-09-11 16:27 UTC (permalink / raw)
  To: Laura García Liébana
  Cc: Lukas Wunner, John Fastabend, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, David Miller

On 9/11/20 9:42 AM, Laura García Liébana wrote:
> On Tue, Sep 8, 2020 at 2:55 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 9/5/20 7:24 AM, Lukas Wunner wrote:
>>> On Fri, Sep 04, 2020 at 11:14:37PM +0200, Daniel Borkmann wrote:
>>>> On 9/4/20 6:21 PM, Lukas Wunner wrote:
>> [...]
>>>> The tc queueing layer which is below is not the tc egress hook; the
>>>> latter is for filtering/mangling/forwarding or helping the lower tc
>>>> queueing layer to classify.
>>>
>>> People want to apply netfilter rules on egress, so either we need an
>>> egress hook in the xmit path or we'd have to teach tc to filter and
>>> mangle based on netfilter rules.  The former seemed more straight-forward
>>> to me but I'm happy to pursue other directions.
>>
>> I would strongly prefer something where nf integrates into existing tc hook,
>> not only due to the hook reuse which would be better, but also to allow for a
>> more flexible interaction between tc/BPF use cases and nf, to name one
> 
> That sounds good but I'm afraid that it would take too much back and
> forth discussions. We'll really appreciate it if this small patch can
> be unblocked and then rethink the refactoring of ingress/egress hooks
> that you commented in another thread.

I'm not sure whether your comment was serious or not, but nope, this needs
to be addressed as mentioned as otherwise this use case would regress. It
is one thing for you wanting to remove tc / BPF from your application stack
as you call it, but not at the cost of breaking others.

Thank you,
Daniel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-11 16:27                 ` Daniel Borkmann
@ 2020-09-14 11:29                   ` Laura García Liébana
  2020-09-14 22:02                     ` Daniel Borkmann
  0 siblings, 1 reply; 34+ messages in thread
From: Laura García Liébana @ 2020-09-14 11:29 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Lukas Wunner, John Fastabend, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, David Miller

Hi Daniel,

On Fri, Sep 11, 2020 at 6:28 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 9/11/20 9:42 AM, Laura García Liébana wrote:
> > On Tue, Sep 8, 2020 at 2:55 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 9/5/20 7:24 AM, Lukas Wunner wrote:
> >>> On Fri, Sep 04, 2020 at 11:14:37PM +0200, Daniel Borkmann wrote:
> >>>> On 9/4/20 6:21 PM, Lukas Wunner wrote:
> >> [...]
> >>>> The tc queueing layer which is below is not the tc egress hook; the
> >>>> latter is for filtering/mangling/forwarding or helping the lower tc
> >>>> queueing layer to classify.
> >>>
> >>> People want to apply netfilter rules on egress, so either we need an
> >>> egress hook in the xmit path or we'd have to teach tc to filter and
> >>> mangle based on netfilter rules.  The former seemed more straight-forward
> >>> to me but I'm happy to pursue other directions.
> >>
> >> I would strongly prefer something where nf integrates into existing tc hook,
> >> not only due to the hook reuse which would be better, but also to allow for a
> >> more flexible interaction between tc/BPF use cases and nf, to name one
> >
> > That sounds good but I'm afraid that it would take too much back and
> > forth discussions. We'll really appreciate it if this small patch can
> > be unblocked and then rethink the refactoring of ingress/egress hooks
> > that you commented in another thread.
>
> I'm not sure whether your comment was serious or not, but nope, this needs
> to be addressed as mentioned as otherwise this use case would regress. It

This patch doesn't break anything. The tc redirect use case that you
just commented on is the expected behavior and the same will happen
with ingress. To be consistent, in the case that someone requires both
hooks, another tc redirect would be needed in the egress path. If you
mean to bypass the nf egress if tc redirect in ingress is used, that
would lead in a huge security concern.

Please elaborate on where do you see a break in this patch.

> is one thing for you wanting to remove tc / BPF from your application stack
> as you call it, but not at the cost of breaking others.
>

I'm not intended to remove tc / BPF from my application stack as I'm
not using it and, as I explained in past emails, it can't be used for
my use cases.

In addition, let's review your NACK reasons:

   This reverts the following commits:

     8537f78647c0 ("netfilter: Introduce egress hook")
     5418d3881e1f ("netfilter: Generalize ingress hook")
     b030f194aed2 ("netfilter: Rename ingress hook include file")

   From the discussion in [0], the author's main motivation to add a hook
   in fast path is for an out of tree kernel module, which is a red flag
   to begin with. Other mentioned potential use cases like NAT{64,46}
   is on future extensions w/o concrete code in the tree yet. Revert as
   suggested [1] given the weak justification to add more hooks to critical
   fast-path.

     [0] https://lore.kernel.org/netdev/cover.1583927267.git.lukas@wunner.de/
     [1] https://lore.kernel.org/netdev/20200318.011152.72770718915606186.davem@davemloft.net/


It has been explained already that there are more use cases that
require this hook in nf, not only for future developments or out of
tree modules.

Thank you!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-14 11:29                   ` Laura García Liébana
@ 2020-09-14 22:02                     ` Daniel Borkmann
  2020-09-17 10:28                       ` Laura García Liébana
  2020-10-11  8:26                       ` Lukas Wunner
  0 siblings, 2 replies; 34+ messages in thread
From: Daniel Borkmann @ 2020-09-14 22:02 UTC (permalink / raw)
  To: Laura García Liébana
  Cc: Lukas Wunner, John Fastabend, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, David Miller

On 9/14/20 1:29 PM, Laura García Liébana wrote:
> On Fri, Sep 11, 2020 at 6:28 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 9/11/20 9:42 AM, Laura García Liébana wrote:
>>> On Tue, Sep 8, 2020 at 2:55 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>> On 9/5/20 7:24 AM, Lukas Wunner wrote:
>>>>> On Fri, Sep 04, 2020 at 11:14:37PM +0200, Daniel Borkmann wrote:
>>>>>> On 9/4/20 6:21 PM, Lukas Wunner wrote:
>>>> [...]
>>>>>> The tc queueing layer which is below is not the tc egress hook; the
>>>>>> latter is for filtering/mangling/forwarding or helping the lower tc
>>>>>> queueing layer to classify.
>>>>>
>>>>> People want to apply netfilter rules on egress, so either we need an
>>>>> egress hook in the xmit path or we'd have to teach tc to filter and
>>>>> mangle based on netfilter rules.  The former seemed more straight-forward
>>>>> to me but I'm happy to pursue other directions.
>>>>
>>>> I would strongly prefer something where nf integrates into existing tc hook,
>>>> not only due to the hook reuse which would be better, but also to allow for a
>>>> more flexible interaction between tc/BPF use cases and nf, to name one
>>>
>>> That sounds good but I'm afraid that it would take too much back and
>>> forth discussions. We'll really appreciate it if this small patch can
>>> be unblocked and then rethink the refactoring of ingress/egress hooks
>>> that you commented in another thread.
>>
>> I'm not sure whether your comment was serious or not, but nope, this needs
>> to be addressed as mentioned as otherwise this use case would regress. It
> 
> This patch doesn't break anything. The tc redirect use case that you
> just commented on is the expected behavior and the same will happen
> with ingress. To be consistent, in the case that someone requires both
> hooks, another tc redirect would be needed in the egress path. If you
> mean to bypass the nf egress if tc redirect in ingress is used, that
> would lead in a huge security concern.

I'm not sure I parse what you're saying above ... today it is possible and
perfectly fine to e.g. redirect to a host-facing veth from tc ingress which
then goes into container. Only traffic that goes up the host stack is seen
by nf ingress hook in that case. Likewise, reply traffic can be redirected
from host-facing veth to phys dev for xmit w/o any netfilter interference.
This means netfilter in host ns really only sees traffic to/from host as
intended. This is fine today, however, if 3rd party entities (e.g. distro
side) start pushing down rules on the two nf hooks, then these use cases will
break on the egress one due to this asymmetric layering violation. Hence my
ask that this needs to be configurable from a control plane perspective so
that both use cases can live next to each other w/o breakage. Most trivial
one I can think of is (aside from the fact to refactor the hooks and improve
their performance) a flag e.g. for skb that can be set from tc/BPF layer to
bypass the nf hooks. Basically a flexible opt-in so that existing use-cases
can be retained w/o breakage. This is one option with what I meant in my
earlier mail.

>> is one thing for you wanting to remove tc / BPF from your application stack
>> as you call it, but not at the cost of breaking others.
> 
> I'm not intended to remove tc / BPF from my application stack as I'm
> not using it and, as I explained in past emails, it can't be used for
> my use cases.
> 
> In addition, let's review your NACK reasons:
> 
>     This reverts the following commits:
> 
>       8537f78647c0 ("netfilter: Introduce egress hook")
>       5418d3881e1f ("netfilter: Generalize ingress hook")
>       b030f194aed2 ("netfilter: Rename ingress hook include file")
> 
>     From the discussion in [0], the author's main motivation to add a hook
>     in fast path is for an out of tree kernel module, which is a red flag
>     to begin with. Other mentioned potential use cases like NAT{64,46}
>     is on future extensions w/o concrete code in the tree yet. Revert as
>     suggested [1] given the weak justification to add more hooks to critical
>     fast-path.
> 
>       [0] https://lore.kernel.org/netdev/cover.1583927267.git.lukas@wunner.de/
>       [1] https://lore.kernel.org/netdev/20200318.011152.72770718915606186.davem@davemloft.net/
> 
> It has been explained already that there are more use cases that
> require this hook in nf, not only for future developments or out of
> tree modules.

Sure, aside from the two mentioned cases above, we scratched DHCP a little
bit on the surface but it was found that i) you need a af_packet specific
hook to get there instead, and ii) dhcp clients implement their own filtering
internally to check for bogus messages. What is confusing to me is whether
this is just brought up as an example or whether you actually care to solve
it (.. but then why would you do that in fast-path to penalize every other
traffic as well just for this type of slow-path filtering instead of doing
in af_packet only). Similarly, why not add this along with /actual/ nat64
code with /concrete/ explanation of why it cannot be performed in post-routing?
Either way, whatever your actual/real use-case, the above must be addressed
one way or another.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-14 22:02                     ` Daniel Borkmann
@ 2020-09-17 10:28                       ` Laura García Liébana
  2020-09-18 20:31                         ` Daniel Borkmann
  2020-10-11  8:26                       ` Lukas Wunner
  1 sibling, 1 reply; 34+ messages in thread
From: Laura García Liébana @ 2020-09-17 10:28 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Lukas Wunner, John Fastabend, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, David Miller

Hi Daniel,

On Tue, Sep 15, 2020 at 12:02 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 9/14/20 1:29 PM, Laura García Liébana wrote:
> > On Fri, Sep 11, 2020 at 6:28 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 9/11/20 9:42 AM, Laura García Liébana wrote:
> >>> On Tue, Sep 8, 2020 at 2:55 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>> On 9/5/20 7:24 AM, Lukas Wunner wrote:
> >>>>> On Fri, Sep 04, 2020 at 11:14:37PM +0200, Daniel Borkmann wrote:
> >>>>>> On 9/4/20 6:21 PM, Lukas Wunner wrote:
> >>>> [...]
> >>>>>> The tc queueing layer which is below is not the tc egress hook; the
> >>>>>> latter is for filtering/mangling/forwarding or helping the lower tc
> >>>>>> queueing layer to classify.
> >>>>>
> >>>>> People want to apply netfilter rules on egress, so either we need an
> >>>>> egress hook in the xmit path or we'd have to teach tc to filter and
> >>>>> mangle based on netfilter rules.  The former seemed more straight-forward
> >>>>> to me but I'm happy to pursue other directions.
> >>>>
> >>>> I would strongly prefer something where nf integrates into existing tc hook,
> >>>> not only due to the hook reuse which would be better, but also to allow for a
> >>>> more flexible interaction between tc/BPF use cases and nf, to name one
> >>>
> >>> That sounds good but I'm afraid that it would take too much back and
> >>> forth discussions. We'll really appreciate it if this small patch can
> >>> be unblocked and then rethink the refactoring of ingress/egress hooks
> >>> that you commented in another thread.
> >>
> >> I'm not sure whether your comment was serious or not, but nope, this needs
> >> to be addressed as mentioned as otherwise this use case would regress. It
> >
> > This patch doesn't break anything. The tc redirect use case that you
> > just commented on is the expected behavior and the same will happen
> > with ingress. To be consistent, in the case that someone requires both
> > hooks, another tc redirect would be needed in the egress path. If you
> > mean to bypass the nf egress if tc redirect in ingress is used, that
> > would lead in a huge security concern.
>
> I'm not sure I parse what you're saying above ... today it is possible and
> perfectly fine to e.g. redirect to a host-facing veth from tc ingress which
> then goes into container. Only traffic that goes up the host stack is seen
> by nf ingress hook in that case. Likewise, reply traffic can be redirected
> from host-facing veth to phys dev for xmit w/o any netfilter interference.
> This means netfilter in host ns really only sees traffic to/from host as
> intended. This is fine today, however, if 3rd party entities (e.g. distro
> side) start pushing down rules on the two nf hooks, then these use cases will
> break on the egress one due to this asymmetric layering violation. Hence my
> ask that this needs to be configurable from a control plane perspective so
> that both use cases can live next to each other w/o breakage. Most trivial

Why does it should be symmetric? Fast-paths create "asymmetric
layering" continuously, see: packet hit XDP to user space bypassing
ingress, but in the response will hit egress. So the "breakage" is
already there.

Also, we're here to create mechanisms not policies that distros have to follow.

> one I can think of is (aside from the fact to refactor the hooks and improve
> their performance) a flag e.g. for skb that can be set from tc/BPF layer to
> bypass the nf hooks. Basically a flexible opt-in so that existing use-cases
> can be retained w/o breakage. This is one option with what I meant in my
> earlier mail.

No comment.

>
> >> is one thing for you wanting to remove tc / BPF from your application stack
> >> as you call it, but not at the cost of breaking others.
> >
> > I'm not intended to remove tc / BPF from my application stack as I'm
> > not using it and, as I explained in past emails, it can't be used for
> > my use cases.
> >
> > In addition, let's review your NACK reasons:
> >
> >     This reverts the following commits:
> >
> >       8537f78647c0 ("netfilter: Introduce egress hook")
> >       5418d3881e1f ("netfilter: Generalize ingress hook")
> >       b030f194aed2 ("netfilter: Rename ingress hook include file")
> >
> >     From the discussion in [0], the author's main motivation to add a hook
> >     in fast path is for an out of tree kernel module, which is a red flag
> >     to begin with. Other mentioned potential use cases like NAT{64,46}
> >     is on future extensions w/o concrete code in the tree yet. Revert as
> >     suggested [1] given the weak justification to add more hooks to critical
> >     fast-path.
> >
> >       [0] https://lore.kernel.org/netdev/cover.1583927267.git.lukas@wunner.de/
> >       [1] https://lore.kernel.org/netdev/20200318.011152.72770718915606186.davem@davemloft.net/
> >
> > It has been explained already that there are more use cases that
> > require this hook in nf, not only for future developments or out of
> > tree modules.
>
> Sure, aside from the two mentioned cases above, we scratched DHCP a little
> bit on the surface but it was found that i) you need a af_packet specific
> hook to get there instead, and ii) dhcp clients implement their own filtering
> internally to check for bogus messages. What is confusing to me is whether
> this is just brought up as an example or whether you actually care to solve

I need a af_packet filter in nf egress but never said it was related
to DHCP. It is more related to clustering.

> it (.. but then why would you do that in fast-path to penalize every other
> traffic as well just for this type of slow-path filtering instead of doing

With nft if the hook is not registered it is not going to be used at
all, so the penalty will never happen to any traffic.

> in af_packet only). Similarly, why not add this along with /actual/ nat64
> code with /concrete/ explanation of why it cannot be performed in post-routing?
> Either way, whatever your actual/real use-case, the above must be addressed
> one way or another.
>

Pablo already explained why it should be done in egress [0].

Thank you for your time!


[0] https://marc.info/?l=linux-netdev&m=158449203321811&w=2

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-17 10:28                       ` Laura García Liébana
@ 2020-09-18 20:31                         ` Daniel Borkmann
  2020-09-19 15:52                           ` Pablo Neira Ayuso
  2020-09-21  7:07                           ` Laura García Liébana
  0 siblings, 2 replies; 34+ messages in thread
From: Daniel Borkmann @ 2020-09-18 20:31 UTC (permalink / raw)
  To: Laura García Liébana
  Cc: Lukas Wunner, John Fastabend, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, David Miller

On 9/17/20 12:28 PM, Laura García Liébana wrote:
> On Tue, Sep 15, 2020 at 12:02 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 9/14/20 1:29 PM, Laura García Liébana wrote:
>>> On Fri, Sep 11, 2020 at 6:28 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>> On 9/11/20 9:42 AM, Laura García Liébana wrote:
>>>>> On Tue, Sep 8, 2020 at 2:55 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>>> On 9/5/20 7:24 AM, Lukas Wunner wrote:
>>>>>>> On Fri, Sep 04, 2020 at 11:14:37PM +0200, Daniel Borkmann wrote:
>>>>>>>> On 9/4/20 6:21 PM, Lukas Wunner wrote:
>>>>>> [...]
>>>>>>>> The tc queueing layer which is below is not the tc egress hook; the
>>>>>>>> latter is for filtering/mangling/forwarding or helping the lower tc
>>>>>>>> queueing layer to classify.
>>>>>>>
>>>>>>> People want to apply netfilter rules on egress, so either we need an
>>>>>>> egress hook in the xmit path or we'd have to teach tc to filter and
>>>>>>> mangle based on netfilter rules.  The former seemed more straight-forward
>>>>>>> to me but I'm happy to pursue other directions.
>>>>>>
>>>>>> I would strongly prefer something where nf integrates into existing tc hook,
>>>>>> not only due to the hook reuse which would be better, but also to allow for a
>>>>>> more flexible interaction between tc/BPF use cases and nf, to name one
>>>>>
>>>>> That sounds good but I'm afraid that it would take too much back and
>>>>> forth discussions. We'll really appreciate it if this small patch can
>>>>> be unblocked and then rethink the refactoring of ingress/egress hooks
>>>>> that you commented in another thread.
>>>>
>>>> I'm not sure whether your comment was serious or not, but nope, this needs
>>>> to be addressed as mentioned as otherwise this use case would regress. It
>>>
>>> This patch doesn't break anything. The tc redirect use case that you
>>> just commented on is the expected behavior and the same will happen
>>> with ingress. To be consistent, in the case that someone requires both
>>> hooks, another tc redirect would be needed in the egress path. If you
>>> mean to bypass the nf egress if tc redirect in ingress is used, that
>>> would lead in a huge security concern.
>>
>> I'm not sure I parse what you're saying above ... today it is possible and
>> perfectly fine to e.g. redirect to a host-facing veth from tc ingress which
>> then goes into container. Only traffic that goes up the host stack is seen
>> by nf ingress hook in that case. Likewise, reply traffic can be redirected
>> from host-facing veth to phys dev for xmit w/o any netfilter interference.
>> This means netfilter in host ns really only sees traffic to/from host as
>> intended. This is fine today, however, if 3rd party entities (e.g. distro
>> side) start pushing down rules on the two nf hooks, then these use cases will
>> break on the egress one due to this asymmetric layering violation. Hence my
>> ask that this needs to be configurable from a control plane perspective so
>> that both use cases can live next to each other w/o breakage. Most trivial
> 
> Why does it should be symmetric? Fast-paths create "asymmetric
> layering" continuously, see: packet hit XDP to user space bypassing
> ingress, but in the response will hit egress. So the "breakage" is
> already there.

Not quite sure what you mean exactly here or into which issue you ran. Either
you push the xdp buffer back out from XDP layer for load balancer case so upper
stack never sees it, or you push it to upper stack, and it goes through the
ingress/egress hooks e.g. from tc side. AF_XDP will bypass either. If you mean
the redirect from XDP layer to the veth devs where they have XDP support, then
the reply path also needs to operate /below/ netfilter on tc layer exactly for
the reason /not/ to break, as otherwise we get potentially hard to debug skb
drops on netfilter side when CT is involved and it figures it must drop due to
invalid CT state to name one example. That is if there is an opt-in to such data
path being used, then it also needs to continue to work, which gets me back to
the earlier mentioned example with the interaction on the egress side with that
hook that it needs to /interoperate/ with tc to avoid breakage of existing use
cases in the wild. Reuse of skb flag could be one option to move forward, or as
mentioned in earlier mails overall rework of ingress/egress side to be a more
flexible pipeline (think of cont/ok actions as with tc filters or stackable LSMs
to process & delegate).

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-18 20:31                         ` Daniel Borkmann
@ 2020-09-19 15:52                           ` Pablo Neira Ayuso
  2020-09-21  7:07                           ` Laura García Liébana
  1 sibling, 0 replies; 34+ messages in thread
From: Pablo Neira Ayuso @ 2020-09-19 15:52 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Laura García Liébana, Lukas Wunner, John Fastabend,
	Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, David Miller

Hi Daniel,

Long time no see, unfortunately this complicated situation is keeping
us away from personal reach, that's unfortunate. Now, looking into
this topic...

On Fri, Sep 18, 2020 at 10:31:09PM +0200, Daniel Borkmann wrote:
> [...] That is if there is an opt-in to such data path being used, then it also
> needs to continue to work, which gets me back to the earlier mentioned example
> with the interaction on the egress side with that hook that it needs to
> /interoperate/ with tc to avoid breakage of existing use cases in the wild.
> Reuse of skb flag could be one option to move forward, or as mentioned in
> earlier mails overall rework of ingress/egress side to be a more flexible
> pipeline (think of cont/ok actions as with tc filters or stackable LSMs to
> process & delegate).

The netfilter ingress hook was introduced many years after the tc
ingress "qdisc" (in the 4.2 kernel series), and I have absolutely no
records of one single complain from users in the netdev and netfilter
mailing lists regarding this being an issue / breaking anything. The
ingress hook needs to be *explicitly* registered by the user, so an
explicit user action to register the hook is required to register this
hook. As for this egress hook, it will be disabled by default too, since
egress chains are only registered on demand.

Assuming that preventing Netfilter to operate will *not* break things
makes no sense. It's the user that make sure that policies are
consistent across the datapath. I don't think there is any mechanism
that ensures that user policy fully makes sense.

Note that:

- The user can easily inspect if someone registered an egress hook.
- Your software can just report a warning to your user if there is an
  interaction with other subsystems makes no sense, it's just a bit of
  Netlink code from userspace if you don't want to wait for the user to
  notice.

You mentioned there is a real issue at this moment since AF_PACKET might
bypass dev_queue_xmit(), I think we can just ask Lukas to extend his
patch to include a hook there, so you can also follow up to fix this
issue for you too.

Thank you Daniel, stay safe.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-08-27  8:55 ` [PATCH nf-next v3 3/3] netfilter: Introduce egress hook Lukas Wunner
  2020-08-28 18:52   ` John Fastabend
@ 2020-09-19 15:54   ` Pablo Neira Ayuso
  2020-09-28 12:20     ` Lukas Wunner
  1 sibling, 1 reply; 34+ messages in thread
From: Pablo Neira Ayuso @ 2020-09-19 15:54 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Jozsef Kadlecsik, Florian Westphal, netfilter-devel, coreteam,
	netdev, Daniel Borkmann, Alexei Starovoitov, Eric Dumazet,
	Thomas Graf, Laura Garcia, David Miller

Hi Lukas,

On Thu, Aug 27, 2020 at 10:55:03AM +0200, Lukas Wunner wrote:
[...]
> Overall, performance improves with this commit if neither netfilter nor
> traffic control is used. However it degrades a little if only traffic
> control is used, due to the "noinline", the additional outer static key
> and the added netfilter code:
> 
> * Before:       4730418pps 2270Mb/sec (2270600640bps)
> * After:        4759206pps 2284Mb/sec (2284418880bps)
> 
> * Before + tc:  4063912pps 1950Mb/sec (1950677760bps)
> * After  + tc:  4007728pps 1923Mb/sec (1923709440bps)
> 
> * After  + nft: 3714546pps 1782Mb/sec (1782982080bps)
[...]
> Commands to enable egress traffic control:
> tc qdisc add dev foo clsact
> tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'

1,6 0 0 0, means drop. This is a program with one instruction that
says "drop this packet".

> Commands to enable egress netfilter:
> nft add table netdev t
> nft add chain netdev t co \{ type filter hook egress device foo priority 0 \; \}
> nft add rule netdev t co ip daddr 4.3.2.1/32 drop

However, this is actually doing much more than that:

nft --debug=netlink add rule netdev t co ip daddr 4.3.2.1/32 drop
netdev 
  [ meta load protocol => reg 1 ]
  [ cmp eq reg 1 0x00000008 ]
  [ payload load 4b @ network header + 16 => reg 1 ]
  [ bitwise reg 1 = (reg=1 & 0xffffffff ) ^ 0x00000000 ]
  [ cmp eq reg 1 0x01020304 ]
  [ immediate reg 0 drop ]

So this is comparing apples and pears in some way :-)

Then, I'd suggest the Netfilter ruleset to compare it with tc should be:

add table netdev t
add chain netdev t co { type filter hook egress device foo priority 0 ; policy drop; }

Would you redo these numbers using this ruleset to address Daniel's
comments regarding performance?

Moreover, Daniel also suggested dev_direct_xmit() path from AF_PACKET
allows packets to escape from policy, it seems this also needs to be
extended to add a hook there too.

Could you work on this and send a v2?

Thank you.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-18 20:31                         ` Daniel Borkmann
  2020-09-19 15:52                           ` Pablo Neira Ayuso
@ 2020-09-21  7:07                           ` Laura García Liébana
  1 sibling, 0 replies; 34+ messages in thread
From: Laura García Liébana @ 2020-09-21  7:07 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Lukas Wunner, John Fastabend, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, David Miller

Hi Daniel,

On Fri, Sep 18, 2020 at 10:31 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 9/17/20 12:28 PM, Laura García Liébana wrote:
> > On Tue, Sep 15, 2020 at 12:02 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 9/14/20 1:29 PM, Laura García Liébana wrote:
> >>> On Fri, Sep 11, 2020 at 6:28 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>> On 9/11/20 9:42 AM, Laura García Liébana wrote:
> >>>>> On Tue, Sep 8, 2020 at 2:55 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>>>> On 9/5/20 7:24 AM, Lukas Wunner wrote:
> >>>>>>> On Fri, Sep 04, 2020 at 11:14:37PM +0200, Daniel Borkmann wrote:
> >>>>>>>> On 9/4/20 6:21 PM, Lukas Wunner wrote:
> >>>>>> [...]
> >>>>>>>> The tc queueing layer which is below is not the tc egress hook; the
> >>>>>>>> latter is for filtering/mangling/forwarding or helping the lower tc
> >>>>>>>> queueing layer to classify.
> >>>>>>>
> >>>>>>> People want to apply netfilter rules on egress, so either we need an
> >>>>>>> egress hook in the xmit path or we'd have to teach tc to filter and
> >>>>>>> mangle based on netfilter rules.  The former seemed more straight-forward
> >>>>>>> to me but I'm happy to pursue other directions.
> >>>>>>
> >>>>>> I would strongly prefer something where nf integrates into existing tc hook,
> >>>>>> not only due to the hook reuse which would be better, but also to allow for a
> >>>>>> more flexible interaction between tc/BPF use cases and nf, to name one
> >>>>>
> >>>>> That sounds good but I'm afraid that it would take too much back and
> >>>>> forth discussions. We'll really appreciate it if this small patch can
> >>>>> be unblocked and then rethink the refactoring of ingress/egress hooks
> >>>>> that you commented in another thread.
> >>>>
> >>>> I'm not sure whether your comment was serious or not, but nope, this needs
> >>>> to be addressed as mentioned as otherwise this use case would regress. It
> >>>
> >>> This patch doesn't break anything. The tc redirect use case that you
> >>> just commented on is the expected behavior and the same will happen
> >>> with ingress. To be consistent, in the case that someone requires both
> >>> hooks, another tc redirect would be needed in the egress path. If you
> >>> mean to bypass the nf egress if tc redirect in ingress is used, that
> >>> would lead in a huge security concern.
> >>
> >> I'm not sure I parse what you're saying above ... today it is possible and
> >> perfectly fine to e.g. redirect to a host-facing veth from tc ingress which
> >> then goes into container. Only traffic that goes up the host stack is seen
> >> by nf ingress hook in that case. Likewise, reply traffic can be redirected
> >> from host-facing veth to phys dev for xmit w/o any netfilter interference.
> >> This means netfilter in host ns really only sees traffic to/from host as
> >> intended. This is fine today, however, if 3rd party entities (e.g. distro
> >> side) start pushing down rules on the two nf hooks, then these use cases will
> >> break on the egress one due to this asymmetric layering violation. Hence my
> >> ask that this needs to be configurable from a control plane perspective so
> >> that both use cases can live next to each other w/o breakage. Most trivial
> >
> > Why does it should be symmetric? Fast-paths create "asymmetric
> > layering" continuously, see: packet hit XDP to user space bypassing
> > ingress, but in the response will hit egress. So the "breakage" is
> > already there.
>
> Not quite sure what you mean exactly here or into which issue you ran. Either

I don't really know, we were discussing about your issue :)

> you push the xdp buffer back out from XDP layer for load balancer case so upper
> stack never sees it, or you push it to upper stack, and it goes through the
> ingress/egress hooks e.g. from tc side. AF_XDP will bypass either. If you mean
> the redirect from XDP layer to the veth devs where they have XDP support, then
> the reply path also needs to operate /below/ netfilter on tc layer exactly for
> the reason /not/ to break, as otherwise we get potentially hard to debug skb
> drops on netfilter side when CT is involved and it figures it must drop due to

So, the "breakage" is about because it is difficult to debug CT drops,
or maybe it creates asymmetric layering, or maybe problems with nf/tc
interoperation? Sorry, this is so confusing. I suspect that the issue
you're talking about is a consequence of having different hooks,
nothing specifically related to this patch cause these use cases
you're referring to are happening right now with XDP, tc and CT.

The advantage is that with nf you only register the hooks required, so
this won't be a problem. Also, we are having more and more mechanisms
for ingress and CT interoperation, and this will be quite easy to
extend to egress.

> invalid CT state to name one example. That is if there is an opt-in to such data
> path being used, then it also needs to continue to work, which gets me back to
> the earlier mentioned example with the interaction on the egress side with that
> hook that it needs to /interoperate/ with tc to avoid breakage of existing use
> cases in the wild. Reuse of skb flag could be one option to move forward, or as
> mentioned in earlier mails overall rework of ingress/egress side to be a more
> flexible pipeline (think of cont/ok actions as with tc filters or stackable LSMs
> to process & delegate).

Again, a flag is not needed as you can register and de-register nf
hooks on demand.

Thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-19 15:54   ` Pablo Neira Ayuso
@ 2020-09-28 12:20     ` Lukas Wunner
  0 siblings, 0 replies; 34+ messages in thread
From: Lukas Wunner @ 2020-09-28 12:20 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Jozsef Kadlecsik, Florian Westphal, netfilter-devel, coreteam,
	netdev, Daniel Borkmann, Alexei Starovoitov, Eric Dumazet,
	Thomas Graf, Laura Garcia, David Miller

On Sat, Sep 19, 2020 at 05:54:05PM +0200, Pablo Neira Ayuso wrote:
> Would you redo these numbers using this ruleset to address Daniel's
> comments regarding performance?
> 
> Moreover, Daniel also suggested dev_direct_xmit() path from AF_PACKET
> allows packets to escape from policy, it seems this also needs to be
> extended to add a hook there too.
> 
> Could you work on this and send a v2?

Sure, will do.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-08 12:55             ` Daniel Borkmann
  2020-09-11  7:42               ` Laura García Liébana
@ 2020-10-11  7:59               ` Lukas Wunner
  1 sibling, 0 replies; 34+ messages in thread
From: Lukas Wunner @ 2020-10-11  7:59 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: John Fastabend, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, netfilter-devel, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, Laura Garcia,
	David Miller

On Tue, Sep 08, 2020 at 02:55:36PM +0200, Daniel Borkmann wrote:
> I would strongly prefer something where nf integrates into existing
> tc hook, not only due to the hook reuse which would be better,
> but also to allow for a more flexible interaction between tc/BPF
> use cases
[...]
> one option to move forward [...] overall rework of ingress/egress
> side to be a more flexible pipeline (think of cont/ok actions
> as with tc filters or stackable LSMs to process & delegate).

Interaction between netfilter and tc is facilitated by skb->mark.
Both netfilter and tc are able to set and match by way of the mark.
E.g. a netfilter hook may set the mark and tc may later perform an
action if a matching mark is found.

Because the placement of netfilter and tc hooks in the data path
has been unchanged for decades, we must assume that users depend
on their order for setting and matching the mark.

Thus, reworking the data path in the way you suggest (a flexible
pipeline) must not change the order of the hooks.  It would have
to be a fixed pipeline.  But what's the benefit then compared to
separate netfilter and tc hooks which are patched in at runtime
and become NOPs if not used?  (Which is what the present series is
aiming for.)


> to name one example... consider two different entities in the system
> setting up the two, that is, one adding rules for nf ingress/egress
> on the phys device for host fw and the other one for routing traffic
> into/from containers at the tc layer, then traffic going into host ns
> will hit nf ingress and on egress side the nf egress part; however,
> traffic going to containers via existing tc redirect will not see the
> nf ingress as expected but would on reverse path incorrectly
> hit the nf egress one which is /not/ the case for dev_queue_xmit() today.

Using tc to bounce ingress traffic into a container -- is that actually
a thing or is it a hypothetical example?  I think at least Docker uses
plain vanilla routing and bridging to move packets in and out of
containers.

However you're right that if tc *is* used to redirect ingress packets
to a container veth, then the data path would look like:

host tc -> container tc -> container nft

Whereas the egress data path would look like:

container nft -> container tc -> host nft -> host tc

But I argue that the egress data path is actually correct because the
host must be able to firewall packets coming out of the container
in case the container has been compromised.


> And if you check a typical DHCP client that is present on major
> modern distros like systemd-networkd's DHCP client then they
> already implement filtering of malicious packets via BPF at
> socket layer including checking for cookies in the DHCP header
> that are set by the application itself to prevent spoofing [0].
> 
> [0] https://github.com/systemd/systemd/blob/master/src/libsystemd-network/dhcp-network.c#L28

That's an *ingress* filter so that user space only receives DHCP
packets and nothing else.

We're talking about the ability to filter *egress* DHCP packets
(among others) at the kernel level to guard against unwanted
packets coming from user space.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH nf-next v3 3/3] netfilter: Introduce egress hook
  2020-09-14 22:02                     ` Daniel Borkmann
  2020-09-17 10:28                       ` Laura García Liébana
@ 2020-10-11  8:26                       ` Lukas Wunner
  1 sibling, 0 replies; 34+ messages in thread
From: Lukas Wunner @ 2020-10-11  8:26 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Laura García Liébana, John Fastabend,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Netfilter Development Mailing list, coreteam, netdev,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf, David Miller

On Tue, Sep 15, 2020 at 12:02:03AM +0200, Daniel Borkmann wrote:
> today it is possible and
> perfectly fine to e.g. redirect to a host-facing veth from tc ingress which
> then goes into container. Only traffic that goes up the host stack is seen
> by nf ingress hook in that case. Likewise, reply traffic can be redirected
> from host-facing veth to phys dev for xmit w/o any netfilter interference.
> This means netfilter in host ns really only sees traffic to/from host as
> intended. This is fine today, however, if 3rd party entities (e.g. distro
> side) start pushing down rules on the two nf hooks, then these use cases will
> break on the egress one due to this asymmetric layering violation. Hence my
> ask that this needs to be configurable from a control plane perspective so
> that both use cases can live next to each other w/o breakage. Most trivial
> one I can think of is (aside from the fact to refactor the hooks and improve
> their performance) a flag e.g. for skb that can be set from tc/BPF layer to
> bypass the nf hooks. Basically a flexible opt-in so that existing use-cases
> can be retained w/o breakage.

I argue that being able to filter traffic coming out of the container
is desirable because why should the host trust the software running
in the container to never send malicious packets.

As for the flag you're asking for, it already exists in the form of
skb->mark.  Just let tc set the mark when the packet exits the container
and add a netfilter rule to accept packets carrying that mark.  Or do
not set any netfilter egress rules at all to disable the egress
filtering and avoid the performance impact it would imply.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, back to index

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-27  8:55 [PATCH nf-next v3 0/3] Netfilter egress hook Lukas Wunner
2020-08-27  8:55 ` [PATCH nf-next v3 1/3] netfilter: Rename ingress hook include file Lukas Wunner
2020-08-27  8:55 ` [PATCH nf-next v3 2/3] netfilter: Generalize ingress hook Lukas Wunner
2020-08-27  8:55 ` [PATCH nf-next v3 3/3] netfilter: Introduce egress hook Lukas Wunner
2020-08-28 18:52   ` John Fastabend
2020-09-03  5:00     ` John Fastabend
2020-09-04  8:54       ` Laura García Liébana
2020-09-04 15:46         ` John Fastabend
2020-09-05 11:13           ` Laura García Liébana
2020-09-04 16:21       ` Lukas Wunner
2020-09-04 21:14         ` Daniel Borkmann
2020-09-05  5:24           ` Lukas Wunner
2020-09-08 12:55             ` Daniel Borkmann
2020-09-11  7:42               ` Laura García Liébana
2020-09-11 16:27                 ` Daniel Borkmann
2020-09-14 11:29                   ` Laura García Liébana
2020-09-14 22:02                     ` Daniel Borkmann
2020-09-17 10:28                       ` Laura García Liébana
2020-09-18 20:31                         ` Daniel Borkmann
2020-09-19 15:52                           ` Pablo Neira Ayuso
2020-09-21  7:07                           ` Laura García Liébana
2020-10-11  8:26                       ` Lukas Wunner
2020-10-11  7:59               ` Lukas Wunner
2020-09-05 11:18           ` Laura García Liébana
2020-09-07 22:11             ` Daniel Borkmann
2020-09-08  6:19               ` Laura García Liébana
2020-09-08 11:46           ` Arturo Borrero Gonzalez
2020-09-08 13:27             ` Daniel Borkmann
2020-09-08 18:58         ` John Fastabend
2020-09-19 15:54   ` Pablo Neira Ayuso
2020-09-28 12:20     ` Lukas Wunner
2020-08-27 10:36 ` [PATCH nf-next v3 0/3] Netfilter " Laura García Liébana
2020-08-28  7:14 ` Daniel Borkmann
2020-08-28  9:14   ` Eric Dumazet

Netfilter-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netfilter-devel/0 netfilter-devel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netfilter-devel netfilter-devel/ https://lore.kernel.org/netfilter-devel \
		netfilter-devel@vger.kernel.org
	public-inbox-index netfilter-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netfilter-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git