netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH nf-next v4 0/5] Netfilter egress hook
@ 2021-01-22  8:47 Lukas Wunner
  2021-01-22  8:47 ` [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling Lukas Wunner
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Lukas Wunner @ 2021-01-22  8:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	Laura Garcia Liebana, John Fastabend

Netfilter egress hook, 4th iteration:

Previously traffic control suffered a performance degradation with this
series applied.  Not anymore, see patch [1/5].

Pablo added netfilter egress handling to af_packet, patch [5/5].

Pablo also moved the netfilter egress hook behind traffic control to
address an objection from Daniel Borkmann, see patch [4/5].  The commit
message was amended with Laura's and Pablo's use cases to make it clear
that the series is no longer motivated by an out-of-tree module.
A bunch of small performance improvements and bugfixes were applied.

Please review and test.  Thanks!

Link to previous version:
https://lore.kernel.org/netfilter-devel/cover.1598517739.git.lukas@wunner.de/


Lukas Wunner (4):
  net: sched: Micro-optimize egress handling
  netfilter: Rename ingress hook include file
  netfilter: Generalize ingress hook include file
  netfilter: Introduce egress hook

Pablo Neira Ayuso (1):
  af_packet: Introduce egress hook

 include/linux/netdevice.h         |   4 ++
 include/linux/netfilter_ingress.h |  58 ----------------
 include/linux/netfilter_netdev.h  | 112 ++++++++++++++++++++++++++++++
 include/uapi/linux/netfilter.h    |   1 +
 net/core/dev.c                    |  16 +++--
 net/netfilter/Kconfig             |   8 +++
 net/netfilter/core.c              |  34 ++++++++-
 net/netfilter/nft_chain_filter.c  |   4 +-
 net/packet/af_packet.c            |  35 ++++++++++
 9 files changed, 206 insertions(+), 66 deletions(-)
 delete mode 100644 include/linux/netfilter_ingress.h
 create mode 100644 include/linux/netfilter_netdev.h

-- 
2.29.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling
  2021-01-22  8:47 [PATCH nf-next v4 0/5] Netfilter egress hook Lukas Wunner
@ 2021-01-22  8:47 ` Lukas Wunner
  2021-01-22  9:40   ` Eric Dumazet
  2021-01-24  3:26   ` Jakub Kicinski
  2021-01-22  8:47 ` [PATCH nf-next v4 2/5] netfilter: Rename ingress hook include file Lukas Wunner
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 21+ messages in thread
From: Lukas Wunner @ 2021-01-22  8:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	Laura Garcia Liebana, John Fastabend

sch_handle_egress() returns either the skb or NULL to signal to its
caller __dev_queue_xmit() whether a packet should continue to be
processed.

The skb is always non-NULL, otherwise __dev_queue_xmit() would hit a
NULL pointer deref right at its top.

But the compiler doesn't know that.  So if sch_handle_egress() signals
success by returning the skb, the "if (!skb) goto out;" statement
results in a gratuitous NULL pointer check in the Assembler output.

Avoid by telling the compiler that __dev_queue_xmit() is never passed a
NULL skb.  This also eliminates another gratuitous NULL pointer check in
__dev_queue_xmit()
  qdisc_pkt_len_init()
    skb_header_pointer()
      __skb_header_pointer()

The speedup is barely measurable:
Before: 1877 1875 1878 1874 1882 1873 Mb/sec
After:  1877 1877 1880 1883 1888 1886 Mb/sec

However we're about to add a netfilter egress hook to __dev_queue_xmit()
and without the micro-optimization, it will result in a performance
degradation which is indeed measurable:
With netfilter hook:               1853 1852 1850 1848 1849 1851 Mb/sec
With netfilter hook + micro-optim: 1874 1877 1881 1875 1876 1876 Mb/sec

The performance degradation is caused by a JNE instruction ("if (skb)")
being flipped to a JE instruction ("if (!skb)") once the netfilter hook
is added.  The micro-optimization removes the test and jump instructions
altogether.

Measurements were performed on a Core i7-3615QM.  Reproducer:
ip link add dev foo type dummy
ip link set dev foo up
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
modprobe pktgen
echo "add_device foo" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
---
 net/core/dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 7afbb642e203..4c16b9932823 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4072,6 +4072,7 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev,
  *      the BH enable code must have IRQs enabled so that it will not deadlock.
  *          --BLG
  */
+__attribute__((nonnull(1)))
 static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 {
 	struct net_device *dev = skb->dev;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH nf-next v4 2/5] netfilter: Rename ingress hook include file
  2021-01-22  8:47 [PATCH nf-next v4 0/5] Netfilter egress hook Lukas Wunner
  2021-01-22  8:47 ` [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling Lukas Wunner
@ 2021-01-22  8:47 ` Lukas Wunner
  2021-01-22  8:47 ` [PATCH nf-next v4 3/5] netfilter: Generalize " Lukas Wunner
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Lukas Wunner @ 2021-01-22  8:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	Laura Garcia Liebana, John Fastabend

Prepare for addition of a netfilter egress hook by renaming
<linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.

The egress hook also necessitates a refactoring of the include file,
but that is done in a separate commit to ease reviewing.

No functional change intended.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 include/linux/{netfilter_ingress.h => netfilter_netdev.h} | 0
 net/core/dev.c                                            | 2 +-
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename include/linux/{netfilter_ingress.h => netfilter_netdev.h} (100%)

diff --git a/include/linux/netfilter_ingress.h b/include/linux/netfilter_netdev.h
similarity index 100%
rename from include/linux/netfilter_ingress.h
rename to include/linux/netfilter_netdev.h
diff --git a/net/core/dev.c b/net/core/dev.c
index 4c16b9932823..98c5abf22e63 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -137,7 +137,7 @@
 #include <linux/if_macvlan.h>
 #include <linux/errqueue.h>
 #include <linux/hrtimer.h>
-#include <linux/netfilter_ingress.h>
+#include <linux/netfilter_netdev.h>
 #include <linux/crash_dump.h>
 #include <linux/sctp.h>
 #include <net/udp_tunnel.h>
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH nf-next v4 3/5] netfilter: Generalize ingress hook include file
  2021-01-22  8:47 [PATCH nf-next v4 0/5] Netfilter egress hook Lukas Wunner
  2021-01-22  8:47 ` [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling Lukas Wunner
  2021-01-22  8:47 ` [PATCH nf-next v4 2/5] netfilter: Rename ingress hook include file Lukas Wunner
@ 2021-01-22  8:47 ` Lukas Wunner
  2021-01-22  8:47 ` [PATCH nf-next v4 4/5] netfilter: Introduce egress hook Lukas Wunner
  2021-01-22  8:47 ` [PATCH nf-next v4 5/5] af_packet: " Lukas Wunner
  4 siblings, 0 replies; 21+ messages in thread
From: Lukas Wunner @ 2021-01-22  8:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	Laura Garcia Liebana, John Fastabend

Prepare for addition of a netfilter egress hook by generalizing the
ingress hook include file.

No functional change intended.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 include/linux/netfilter_netdev.h | 20 +++++++++++---------
 net/core/dev.c                   |  2 +-
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/linux/netfilter_netdev.h b/include/linux/netfilter_netdev.h
index a13774be2eb5..5812b0fb0278 100644
--- a/include/linux/netfilter_netdev.h
+++ b/include/linux/netfilter_netdev.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _NETFILTER_INGRESS_H_
-#define _NETFILTER_INGRESS_H_
+#ifndef _NETFILTER_NETDEV_H_
+#define _NETFILTER_NETDEV_H_
 
 #include <linux/netfilter.h>
 #include <linux/netdevice.h>
@@ -38,10 +38,6 @@ static inline int nf_hook_ingress(struct sk_buff *skb)
 	return ret;
 }
 
-static inline void nf_hook_ingress_init(struct net_device *dev)
-{
-	RCU_INIT_POINTER(dev->nf_hooks_ingress, NULL);
-}
 #else /* CONFIG_NETFILTER_INGRESS */
 static inline int nf_hook_ingress_active(struct sk_buff *skb)
 {
@@ -52,7 +48,13 @@ static inline int nf_hook_ingress(struct sk_buff *skb)
 {
 	return 0;
 }
-
-static inline void nf_hook_ingress_init(struct net_device *dev) {}
 #endif /* CONFIG_NETFILTER_INGRESS */
-#endif /* _NETFILTER_INGRESS_H_ */
+
+static inline void nf_hook_netdev_init(struct net_device *dev)
+{
+#ifdef CONFIG_NETFILTER_INGRESS
+	RCU_INIT_POINTER(dev->nf_hooks_ingress, NULL);
+#endif
+}
+
+#endif /* _NETFILTER_NETDEV_H_ */
diff --git a/net/core/dev.c b/net/core/dev.c
index 98c5abf22e63..931149bd654a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10602,7 +10602,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	if (!dev->ethtool_ops)
 		dev->ethtool_ops = &default_ethtool_ops;
 
-	nf_hook_ingress_init(dev);
+	nf_hook_netdev_init(dev);
 
 	return dev;
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH nf-next v4 4/5] netfilter: Introduce egress hook
  2021-01-22  8:47 [PATCH nf-next v4 0/5] Netfilter egress hook Lukas Wunner
                   ` (2 preceding siblings ...)
  2021-01-22  8:47 ` [PATCH nf-next v4 3/5] netfilter: Generalize " Lukas Wunner
@ 2021-01-22  8:47 ` Lukas Wunner
  2021-01-26 19:13   ` Daniel Borkmann
  2021-01-22  8:47 ` [PATCH nf-next v4 5/5] af_packet: " Lukas Wunner
  4 siblings, 1 reply; 21+ messages in thread
From: Lukas Wunner @ 2021-01-22  8:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	Laura Garcia Liebana, John Fastabend

Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key") introduced the ability to
classify packets with netfilter on ingress.

Support the same on egress to satisfy user requirements such as:
* outbound security policies for containers (Laura)
* filtering and mangling intra-node Direct Server Return (DSR) traffic
  on a load balancer (Laura)
* filtering locally generated traffic coming in through AF_PACKET,
  such as local ARP traffic generated for clustering purposes or DHCP
  (Laura; the AF_PACKET plumbing is contained in a separate commit)
* L2 filtering from ingress and egress for AVB (Audio Video Bridging)
  and gPTP with nftables (Pablo)
* in the future: in-kernel NAT64/NAT46 (Pablo)

A patch for nftables to hook up egress rules from user space has been
submitted separately, so users may immediately take advantage of the
feature.

The hook is positioned after packet handling by traffic control.
Thus, if packets are redirected into and out of containers with tc,
the data path is:
ingress: host tc -> container tc -> container nft
egress:  container tc -> host tc -> host nft

This was done to address an objection from Daniel Borkmann:  If desired,
nft does not get into tc's way performance-wise.  The host is able to
firewall malicious packets coming out of a container, but only after tc
has done its duty.  An implication is that tc may set skb->mark on
egress for nft to act on it, but not the other way round.

If egress netfilter handling is not enabled on any interface, it is
patched out of the data path by way of a static_key and doesn't make a
performance difference that is discernible from noise:

Before:             2076 2076 2076 2077 2077 2074 Mb/sec
After:              2080 2078 2078 2079 2079 2077 Mb/sec
Before + tc accept: 1877 1875 1878 1874 1882 1873 Mb/sec
After  + tc accept: 1874 1877 1881 1875 1876 1876 Mb/sec
Before + tc drop:   2167 2182 2168 2171 2164 2171 Mb/sec
After  + tc drop:   2176 2180 2184 2178 2178 2180 Mb/sec

Measurements were performed on a Core i7-3615QM.  Commands to reproduce:
ip link add dev foo type dummy
ip link set dev foo up
modprobe pktgen
echo "add_device foo" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1

Accept all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'

Drop all traffic with tc:
tc qdisc add dev foo clsact
tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'

Apply this patch when measuring packet drops to avoid errors in dmesg:
https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Laura García Liébana <nevola@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Thomas Graf <tgraf@suug.ch>
---
 include/linux/netdevice.h        |  4 +++
 include/linux/netfilter_netdev.h | 52 ++++++++++++++++++++++++++++++++
 include/uapi/linux/netfilter.h   |  1 +
 net/core/dev.c                   | 11 +++++--
 net/netfilter/Kconfig            |  8 +++++
 net/netfilter/core.c             | 34 +++++++++++++++++++--
 net/netfilter/nft_chain_filter.c |  4 ++-
 7 files changed, 108 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1ec3ac5d5bbf..af0774cc20d2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1764,6 +1764,7 @@ enum netdev_priv_flags {
  *	@xps_maps:	XXX: need comments on this one
  *	@miniq_egress:		clsact qdisc specific data for
  *				egress processing
+ *	@nf_hooks_egress:	netfilter hooks executed for egress packets
  *	@qdisc_hash:		qdisc hash table
  *	@watchdog_timeo:	Represents the timeout that is used by
  *				the watchdog (see dev_watchdog())
@@ -2057,6 +2058,9 @@ struct net_device {
 #ifdef CONFIG_NET_CLS_ACT
 	struct mini_Qdisc __rcu	*miniq_egress;
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	struct nf_hook_entries __rcu *nf_hooks_egress;
+#endif
 
 #ifdef CONFIG_NET_SCHED
 	DECLARE_HASHTABLE	(qdisc_hash, 4);
diff --git a/include/linux/netfilter_netdev.h b/include/linux/netfilter_netdev.h
index 5812b0fb0278..5ed6e90d46f6 100644
--- a/include/linux/netfilter_netdev.h
+++ b/include/linux/netfilter_netdev.h
@@ -50,11 +50,63 @@ static inline int nf_hook_ingress(struct sk_buff *skb)
 }
 #endif /* CONFIG_NETFILTER_INGRESS */
 
+#ifdef CONFIG_NETFILTER_EGRESS
+static inline bool nf_hook_egress_active(void)
+{
+#ifdef CONFIG_JUMP_LABEL
+	if (!static_key_false(&nf_hooks_needed[NFPROTO_NETDEV][NF_NETDEV_EGRESS]))
+		return false;
+#endif
+	return true;
+}
+
+/* caller must hold rcu_read_lock */
+static inline struct sk_buff *nf_hook_egress(struct sk_buff *skb, int *rc,
+					     struct net_device *dev)
+{
+	struct nf_hook_entries *e = rcu_dereference(dev->nf_hooks_egress);
+	struct nf_hook_state state;
+	int ret;
+
+	if (!e)
+		return skb;
+
+	nf_hook_state_init(&state, NF_NETDEV_EGRESS,
+			   NFPROTO_NETDEV, dev, NULL, NULL,
+			   dev_net(dev), NULL);
+	ret = nf_hook_slow(skb, &state, e, 0);
+
+	if (ret == 1) {
+		return skb;
+	} else if (ret < 0) {
+		*rc = NET_XMIT_DROP;
+		return NULL;
+	} else { /* ret == 0 */
+		*rc = NET_XMIT_SUCCESS;
+		return NULL;
+	}
+}
+#else /* CONFIG_NETFILTER_EGRESS */
+static inline bool nf_hook_egress_active(void)
+{
+	return false;
+}
+
+static inline struct sk_buff *nf_hook_egress(struct sk_buff *skb, int *rc,
+					     struct net_device *dev)
+{
+	return skb;
+}
+#endif /* CONFIG_NETFILTER_EGRESS */
+
 static inline void nf_hook_netdev_init(struct net_device *dev)
 {
 #ifdef CONFIG_NETFILTER_INGRESS
 	RCU_INIT_POINTER(dev->nf_hooks_ingress, NULL);
 #endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	RCU_INIT_POINTER(dev->nf_hooks_egress, NULL);
+#endif
 }
 
 #endif /* _NETFILTER_NETDEV_H_ */
diff --git a/include/uapi/linux/netfilter.h b/include/uapi/linux/netfilter.h
index ef9a44286e23..53411ccc69db 100644
--- a/include/uapi/linux/netfilter.h
+++ b/include/uapi/linux/netfilter.h
@@ -51,6 +51,7 @@ enum nf_inet_hooks {
 
 enum nf_dev_hooks {
 	NF_NETDEV_INGRESS,
+	NF_NETDEV_EGRESS,
 	NF_NETDEV_NUMHOOKS
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 931149bd654a..ecf881515b62 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3870,6 +3870,7 @@ EXPORT_SYMBOL(dev_loopback_xmit);
 static struct sk_buff *
 sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
+#ifdef CONFIG_NET_CLS_ACT
 	struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
 	struct tcf_result cl_res;
 
@@ -3904,6 +3905,7 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 	default:
 		break;
 	}
+#endif /* CONFIG_NET_CLS_ACT */
 
 	return skb;
 }
@@ -4096,13 +4098,18 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 	qdisc_pkt_len_init(skb);
 #ifdef CONFIG_NET_CLS_ACT
 	skb->tc_at_ingress = 0;
-# ifdef CONFIG_NET_EGRESS
+#endif
+#ifdef CONFIG_NET_EGRESS
 	if (static_branch_unlikely(&egress_needed_key)) {
 		skb = sch_handle_egress(skb, &rc, dev);
 		if (!skb)
 			goto out;
+		if (nf_hook_egress_active()) {
+			skb = nf_hook_egress(skb, &rc, dev);
+			if (!skb)
+				goto out;
+		}
 	}
-# endif
 #endif
 	/* If device/qdisc don't need skb->dst, release it right now while
 	 * its hot in this cpu cache.
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 49fbef0d99be..ade86afa3b1b 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -10,6 +10,14 @@ config NETFILTER_INGRESS
 	  This allows you to classify packets from ingress using the Netfilter
 	  infrastructure.
 
+config NETFILTER_EGRESS
+	bool "Netfilter egress support"
+	default y
+	select NET_EGRESS
+	help
+	  This allows you to classify packets before transmission using the
+	  Netfilter infrastructure.
+
 config NETFILTER_NETLINK
 	tristate
 
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 63d032191e62..3a32a813fcde 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -316,6 +316,12 @@ nf_hook_entry_head(struct net *net, int pf, unsigned int hooknum,
 		if (dev && dev_net(dev) == net)
 			return &dev->nf_hooks_ingress;
 	}
+#endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	if (hooknum == NF_NETDEV_EGRESS) {
+		if (dev && dev_net(dev) == net)
+			return &dev->nf_hooks_egress;
+	}
 #endif
 	WARN_ON_ONCE(1);
 	return NULL;
@@ -344,6 +350,11 @@ static inline bool nf_ingress_hook(const struct nf_hook_ops *reg, int pf)
 	return false;
 }
 
+static inline bool nf_egress_hook(const struct nf_hook_ops *reg, int pf)
+{
+	return pf == NFPROTO_NETDEV && reg->hooknum == NF_NETDEV_EGRESS;
+}
+
 static void nf_static_key_inc(const struct nf_hook_ops *reg, int pf)
 {
 #ifdef CONFIG_JUMP_LABEL
@@ -383,9 +394,18 @@ static int __nf_register_net_hook(struct net *net, int pf,
 
 	switch (pf) {
 	case NFPROTO_NETDEV:
-		err = nf_ingress_check(net, reg, NF_NETDEV_INGRESS);
-		if (err < 0)
-			return err;
+#ifndef CONFIG_NETFILTER_INGRESS
+		if (reg->hooknum == NF_NETDEV_INGRESS)
+			return -EOPNOTSUPP;
+#endif
+#ifndef CONFIG_NETFILTER_EGRESS
+		if (reg->hooknum == NF_NETDEV_EGRESS)
+			return -EOPNOTSUPP;
+#endif
+		if ((reg->hooknum != NF_NETDEV_INGRESS &&
+		     reg->hooknum != NF_NETDEV_EGRESS) ||
+		    !reg->dev || dev_net(reg->dev) != net)
+			return -EINVAL;
 		break;
 	case NFPROTO_INET:
 		if (reg->hooknum != NF_INET_INGRESS)
@@ -417,6 +437,10 @@ static int __nf_register_net_hook(struct net *net, int pf,
 #ifdef CONFIG_NETFILTER_INGRESS
 	if (nf_ingress_hook(reg, pf))
 		net_inc_ingress_queue();
+#endif
+#ifdef CONFIG_NETFILTER_EGRESS
+	if (nf_egress_hook(reg, pf))
+		net_inc_egress_queue();
 #endif
 	nf_static_key_inc(reg, pf);
 
@@ -474,6 +498,10 @@ static void __nf_unregister_net_hook(struct net *net, int pf,
 #ifdef CONFIG_NETFILTER_INGRESS
 		if (nf_ingress_hook(reg, pf))
 			net_dec_ingress_queue();
+#endif
+#ifdef CONFIG_NETFILTER_EGRESS
+		if (nf_egress_hook(reg, pf))
+			net_dec_egress_queue();
 #endif
 		nf_static_key_dec(reg, pf);
 	} else {
diff --git a/net/netfilter/nft_chain_filter.c b/net/netfilter/nft_chain_filter.c
index ff8528ad3dc6..c9dc5f36569b 100644
--- a/net/netfilter/nft_chain_filter.c
+++ b/net/netfilter/nft_chain_filter.c
@@ -310,9 +310,11 @@ static const struct nft_chain_type nft_chain_filter_netdev = {
 	.name		= "filter",
 	.type		= NFT_CHAIN_T_DEFAULT,
 	.family		= NFPROTO_NETDEV,
-	.hook_mask	= (1 << NF_NETDEV_INGRESS),
+	.hook_mask	= (1 << NF_NETDEV_INGRESS) |
+			  (1 << NF_NETDEV_EGRESS),
 	.hooks		= {
 		[NF_NETDEV_INGRESS]	= nft_do_chain_netdev,
+		[NF_NETDEV_EGRESS]	= nft_do_chain_netdev,
 	},
 };
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH nf-next v4 5/5] af_packet: Introduce egress hook
  2021-01-22  8:47 [PATCH nf-next v4 0/5] Netfilter egress hook Lukas Wunner
                   ` (3 preceding siblings ...)
  2021-01-22  8:47 ` [PATCH nf-next v4 4/5] netfilter: Introduce egress hook Lukas Wunner
@ 2021-01-22  8:47 ` Lukas Wunner
  2021-01-22 16:13   ` Willem de Bruijn
  4 siblings, 1 reply; 21+ messages in thread
From: Lukas Wunner @ 2021-01-22  8:47 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	Laura Garcia Liebana, John Fastabend

From: Pablo Neira Ayuso <pablo@netfilter.org>

Add egress hook for AF_PACKET sockets that have the PACKET_QDISC_BYPASS
socket option set to on, which allows packets to escape without being
filtered in the egress path.

This patch only updates the AF_PACKET path, it does not update
dev_direct_xmit() so the XDP infrastructure has a chance to bypass
Netfilter.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
[lukas: acquire rcu_read_lock, fix typos, rebase]
Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 net/packet/af_packet.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 6bbc7a448593..6dca6ead1162 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -89,6 +89,7 @@
 #endif
 #include <linux/bpf.h>
 #include <net/compat.h>
+#include <linux/netfilter_netdev.h>
 
 #include "internal.h"
 
@@ -239,8 +240,42 @@ struct packet_skb_cb {
 static void __fanout_unlink(struct sock *sk, struct packet_sock *po);
 static void __fanout_link(struct sock *sk, struct packet_sock *po);
 
+#ifdef CONFIG_NETFILTER_EGRESS
+static noinline struct sk_buff *nf_hook_direct_egress(struct sk_buff *skb)
+{
+	struct sk_buff *next, *head = NULL, *tail;
+	int rc;
+
+	rcu_read_lock();
+	for (; skb != NULL; skb = next) {
+		next = skb->next;
+		skb_mark_not_on_list(skb);
+
+		if (!nf_hook_egress(skb, &rc, skb->dev))
+			continue;
+
+		if (!head)
+			head = skb;
+		else
+			tail->next = skb;
+
+		tail = skb;
+	}
+	rcu_read_unlock();
+
+	return head;
+}
+#endif
+
 static int packet_direct_xmit(struct sk_buff *skb)
 {
+#ifdef CONFIG_NETFILTER_EGRESS
+	if (nf_hook_egress_active()) {
+		skb = nf_hook_direct_egress(skb);
+		if (!skb)
+			return NET_XMIT_DROP;
+	}
+#endif
 	return dev_direct_xmit(skb, packet_pick_tx_queue(skb));
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling
  2021-01-22  8:47 ` [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling Lukas Wunner
@ 2021-01-22  9:40   ` Eric Dumazet
  2021-01-24 10:33     ` Lukas Wunner
  2021-01-24  3:26   ` Jakub Kicinski
  1 sibling, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2021-01-22  9:40 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Thomas Graf, Laura Garcia Liebana,
	John Fastabend

On Fri, Jan 22, 2021 at 9:55 AM Lukas Wunner <lukas@wunner.de> wrote:
>
> sch_handle_egress() returns either the skb or NULL to signal to its
> caller __dev_queue_xmit() whether a packet should continue to be
> processed.
>
> The skb is always non-NULL, otherwise __dev_queue_xmit() would hit a
> NULL pointer deref right at its top.
>
> But the compiler doesn't know that.  So if sch_handle_egress() signals
> success by returning the skb, the "if (!skb) goto out;" statement
> results in a gratuitous NULL pointer check in the Assembler output.
>
> Avoid by telling the compiler that __dev_queue_xmit() is never passed a
> NULL skb.  This also eliminates another gratuitous NULL pointer check in
> __dev_queue_xmit()
>   qdisc_pkt_len_init()
>     skb_header_pointer()
>       __skb_header_pointer()
>
> The speedup is barely measurable:
> Before: 1877 1875 1878 1874 1882 1873 Mb/sec
> After:  1877 1877 1880 1883 1888 1886 Mb/sec
>
> However we're about to add a netfilter egress hook to __dev_queue_xmit()
> and without the micro-optimization, it will result in a performance
> degradation which is indeed measurable:
> With netfilter hook:               1853 1852 1850 1848 1849 1851 Mb/sec
> With netfilter hook + micro-optim: 1874 1877 1881 1875 1876 1876 Mb/sec
>
> The performance degradation is caused by a JNE instruction ("if (skb)")
> being flipped to a JE instruction ("if (!skb)") once the netfilter hook
> is added.  The micro-optimization removes the test and jump instructions
> altogether.
>
> Measurements were performed on a Core i7-3615QM.  Reproducer:
> ip link add dev foo type dummy
> ip link set dev foo up
> tc qdisc add dev foo clsact
> tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
> modprobe pktgen
> echo "add_device foo" > /proc/net/pktgen/kpktgend_3
> samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1
>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Thomas Graf <tgraf@suug.ch>
> ---
>  net/core/dev.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7afbb642e203..4c16b9932823 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4072,6 +4072,7 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev,
>   *      the BH enable code must have IRQs enabled so that it will not deadlock.
>   *          --BLG
>   */
> +__attribute__((nonnull(1)))
>  static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>  {
>         struct net_device *dev = skb->dev;
> --
> 2.29.2
>

Interesting !

It is a bit sad the compilers do not automatically get this knowledge
from the very first instruction :

 struct net_device *dev = skb->dev;

I see this also makes qdisc_pkt_len_init() slightly faster because
this removes the if (!skb) test from __skb_header_pointer()

I guess we also could add this patch to benefit all
skb_header_pointer() callers :

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5f60c9e907c9d8eae1e85ae0329838383e3325df..db8774c50cc6ab99deaecdb1dfc31dc5306b9a37
100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3627,6 +3627,7 @@ __skb_header_pointer(const struct sk_buff *skb,
int offset,
        return buffer;
 }

+__attribute__((nonnull(1)))
 static inline void * __must_check
 skb_header_pointer(const struct sk_buff *skb, int offset, int len,
void *buffer)
 {

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 5/5] af_packet: Introduce egress hook
  2021-01-22  8:47 ` [PATCH nf-next v4 5/5] af_packet: " Lukas Wunner
@ 2021-01-22 16:13   ` Willem de Bruijn
  2021-01-24 11:14     ` Lukas Wunner
  0 siblings, 1 reply; 21+ messages in thread
From: Willem de Bruijn @ 2021-01-22 16:13 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, Network Development, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	Laura Garcia Liebana, John Fastabend

On Fri, Jan 22, 2021 at 4:44 AM Lukas Wunner <lukas@wunner.de> wrote:
>
> From: Pablo Neira Ayuso <pablo@netfilter.org>
>
> Add egress hook for AF_PACKET sockets that have the PACKET_QDISC_BYPASS
> socket option set to on, which allows packets to escape without being
> filtered in the egress path.
>
> This patch only updates the AF_PACKET path, it does not update
> dev_direct_xmit() so the XDP infrastructure has a chance to bypass
> Netfilter.
>
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> [lukas: acquire rcu_read_lock, fix typos, rebase]
> Signed-off-by: Lukas Wunner <lukas@wunner.de>

Isn't the point of PACKET_QDISC_BYPASS to skip steps like this?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling
  2021-01-22  8:47 ` [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling Lukas Wunner
  2021-01-22  9:40   ` Eric Dumazet
@ 2021-01-24  3:26   ` Jakub Kicinski
  2021-01-24 10:46     ` Lukas Wunner
  1 sibling, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2021-01-24  3:26 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	Laura Garcia Liebana, John Fastabend, LKML

On Fri, 22 Jan 2021 09:47:01 +0100 Lukas Wunner wrote:
> sch_handle_egress() returns either the skb or NULL to signal to its
> caller __dev_queue_xmit() whether a packet should continue to be
> processed.
> 
> The skb is always non-NULL, otherwise __dev_queue_xmit() would hit a
> NULL pointer deref right at its top.
> 
> But the compiler doesn't know that.  So if sch_handle_egress() signals
> success by returning the skb, the "if (!skb) goto out;" statement
> results in a gratuitous NULL pointer check in the Assembler output.

Which exact compiler are we talking about it? Did you report this?
As Eric pointed the compiler should be able to figure this out quite
easily.

> Avoid by telling the compiler that __dev_queue_xmit() is never passed a
> NULL skb.  This also eliminates another gratuitous NULL pointer check in
> __dev_queue_xmit()
>   qdisc_pkt_len_init()
>     skb_header_pointer()
>       __skb_header_pointer()
> 
> The speedup is barely measurable:
> Before: 1877 1875 1878 1874 1882 1873 Mb/sec
> After:  1877 1877 1880 1883 1888 1886 Mb/sec
> 
> However we're about to add a netfilter egress hook to __dev_queue_xmit()
> and without the micro-optimization, it will result in a performance
> degradation which is indeed measurable:
> With netfilter hook:               1853 1852 1850 1848 1849 1851 Mb/sec
> With netfilter hook + micro-optim: 1874 1877 1881 1875 1876 1876 Mb/sec
> 
> The performance degradation is caused by a JNE instruction ("if (skb)")
> being flipped to a JE instruction ("if (!skb)") once the netfilter hook
> is added.  The micro-optimization removes the test and jump instructions
> altogether.
> 
> Measurements were performed on a Core i7-3615QM.  Reproducer:
> ip link add dev foo type dummy
> ip link set dev foo up
> tc qdisc add dev foo clsact
> tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
> modprobe pktgen
> echo "add_device foo" > /proc/net/pktgen/kpktgend_3
> samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1
> 
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Thomas Graf <tgraf@suug.ch>
> ---
>  net/core/dev.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7afbb642e203..4c16b9932823 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4072,6 +4072,7 @@ struct netdev_queue *netdev_core_pick_tx(struct net_device *dev,
>   *      the BH enable code must have IRQs enabled so that it will not deadlock.
>   *          --BLG
>   */
> +__attribute__((nonnull(1)))
>  static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>  {
>  	struct net_device *dev = skb->dev;


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling
  2021-01-22  9:40   ` Eric Dumazet
@ 2021-01-24 10:33     ` Lukas Wunner
  2021-01-25 19:39       ` Jakub Kicinski
  0 siblings, 1 reply; 21+ messages in thread
From: Lukas Wunner @ 2021-01-24 10:33 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Thomas Graf, Laura Garcia Liebana,
	John Fastabend

On Fri, Jan 22, 2021 at 10:40:05AM +0100, Eric Dumazet wrote:
> On Fri, Jan 22, 2021 at 9:55 AM Lukas Wunner <lukas@wunner.de> wrote:
> > sch_handle_egress() returns either the skb or NULL to signal to its
> > caller __dev_queue_xmit() whether a packet should continue to be
> > processed.
> >
> > The skb is always non-NULL, otherwise __dev_queue_xmit() would hit a
> > NULL pointer deref right at its top.
> >
> > But the compiler doesn't know that.  So if sch_handle_egress() signals
> > success by returning the skb, the "if (!skb) goto out;" statement
> > results in a gratuitous NULL pointer check in the Assembler output.
> >
> > Avoid by telling the compiler that __dev_queue_xmit() is never passed a
> > NULL skb.
[...]
> > we're about to add a netfilter egress hook to __dev_queue_xmit()
> > and without the micro-optimization, it will result in a performance
> > degradation which is indeed measurable:
[...]
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > +__attribute__((nonnull(1)))
> >  static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >  {
> >         struct net_device *dev = skb->dev;
> 
> It is a bit sad the compilers do not automatically get this knowledge
> from the very first instruction :
> 
>  struct net_device *dev = skb->dev;

The compiler (gcc) is capable of doing that, but the feature was disabled by:

    commit a3ca86aea507904148870946d599e07a340b39bf
    Author: Eugene Teo <eteo@redhat.com>
    Date:   Wed Jul 15 14:59:10 2009 +0800
    
    Add '-fno-delete-null-pointer-checks' to gcc CFLAGS

If -fno-delete-null-pointer-checks is dropped from the top-level Makefile
then the gratuitous NULL pointer checks disappear from the Assembler output,
obviating the need to litter hot paths with __attribute__((nonnull(1)))
annotations.

Taking a closer look at that commit, its rationale appears questionable:
It says that broken code such as ...

	struct agnx_priv *priv = dev->priv;

	if (!dev)
		return;

... would result in the NULL pointer check being optimized away.
The commit message claims that keeping the NULL pointer check in
"makes it harder to abuse" the broken code.

I don't see how that's the case:  If dev is NULL, the NULL pointer
dereference at the function's top causes termination of the task
in kernel/exit.c:do_exit().  So the NULL pointer check is never
reached by the task.  If on the other hand dev is non-NULL,
the task isn't terminated but then the NULL pointer check is
unnecessary as well.

So the point of the commit remains elusive to me.  I could submit
an RFC patch which drops -fno-delete-null-pointer-checks and see
if any security folks cry foul.  Thoughts?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling
  2021-01-24  3:26   ` Jakub Kicinski
@ 2021-01-24 10:46     ` Lukas Wunner
  0 siblings, 0 replies; 21+ messages in thread
From: Lukas Wunner @ 2021-01-24 10:46 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, netdev, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	Laura Garcia Liebana, John Fastabend, LKML

On Sat, Jan 23, 2021 at 07:26:24PM -0800, Jakub Kicinski wrote:
> On Fri, 22 Jan 2021 09:47:01 +0100 Lukas Wunner wrote:
> > sch_handle_egress() returns either the skb or NULL to signal to its
> > caller __dev_queue_xmit() whether a packet should continue to be
> > processed.
> > 
> > The skb is always non-NULL, otherwise __dev_queue_xmit() would hit a
> > NULL pointer deref right at its top.
> > 
> > But the compiler doesn't know that.  So if sch_handle_egress() signals
> > success by returning the skb, the "if (!skb) goto out;" statement
> > results in a gratuitous NULL pointer check in the Assembler output.
> 
> Which exact compiler are we talking about it? Did you report this?
> As Eric pointed the compiler should be able to figure this out quite
> easily.

I tested with gcc 8, 9, 10.

No need to report as it's the expected behavior with
-fno-delete-null-pointer-checks, whose motivation appears
questionable though (per my preceding e-mail).

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 5/5] af_packet: Introduce egress hook
  2021-01-22 16:13   ` Willem de Bruijn
@ 2021-01-24 11:14     ` Lukas Wunner
  2021-01-24 16:18       ` Willem de Bruijn
  0 siblings, 1 reply; 21+ messages in thread
From: Lukas Wunner @ 2021-01-24 11:14 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, Network Development, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	Laura Garcia Liebana, John Fastabend

On Fri, Jan 22, 2021 at 11:13:19AM -0500, Willem de Bruijn wrote:
> On Fri, Jan 22, 2021 at 4:44 AM Lukas Wunner <lukas@wunner.de> wrote:
> > Add egress hook for AF_PACKET sockets that have the PACKET_QDISC_BYPASS
> > socket option set to on, which allows packets to escape without being
> > filtered in the egress path.
> >
> > This patch only updates the AF_PACKET path, it does not update
> > dev_direct_xmit() so the XDP infrastructure has a chance to bypass
> > Netfilter.
> 
> Isn't the point of PACKET_QDISC_BYPASS to skip steps like this?

I suppose PACKET_QDISC_BYPASS "was introduced to bypass qdisc,
not to bypass everything."

(The quote is taken from this message by Eric Dumazet:
https://lore.kernel.org/netfilter-devel/a9006cf7-f4ba-81b1-fca1-fd2e97939fdc@gmail.com/
)

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 5/5] af_packet: Introduce egress hook
  2021-01-24 11:14     ` Lukas Wunner
@ 2021-01-24 16:18       ` Willem de Bruijn
  2021-01-30 16:26         ` Lukas Wunner
  0 siblings, 1 reply; 21+ messages in thread
From: Willem de Bruijn @ 2021-01-24 16:18 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, Network Development, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	Laura Garcia Liebana, John Fastabend

On Sun, Jan 24, 2021 at 6:14 AM Lukas Wunner <lukas@wunner.de> wrote:
>
> On Fri, Jan 22, 2021 at 11:13:19AM -0500, Willem de Bruijn wrote:
> > On Fri, Jan 22, 2021 at 4:44 AM Lukas Wunner <lukas@wunner.de> wrote:
> > > Add egress hook for AF_PACKET sockets that have the PACKET_QDISC_BYPASS
> > > socket option set to on, which allows packets to escape without being
> > > filtered in the egress path.
> > >
> > > This patch only updates the AF_PACKET path, it does not update
> > > dev_direct_xmit() so the XDP infrastructure has a chance to bypass
> > > Netfilter.
> >
> > Isn't the point of PACKET_QDISC_BYPASS to skip steps like this?
>
> I suppose PACKET_QDISC_BYPASS "was introduced to bypass qdisc,
> not to bypass everything."
>
> (The quote is taken from this message by Eric Dumazet:
> https://lore.kernel.org/netfilter-devel/a9006cf7-f4ba-81b1-fca1-fd2e97939fdc@gmail.com/
> )

I see. I don't understand the value of a short-cut fast path if we
start chipping away at its characteristic feature.

It also bypasses sch_handle_egress and packet sockets. This new
feature seems broadly similar to the first?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling
  2021-01-24 10:33     ` Lukas Wunner
@ 2021-01-25 19:39       ` Jakub Kicinski
  2021-01-26  8:58         ` Dan Carpenter
  0 siblings, 1 reply; 21+ messages in thread
From: Jakub Kicinski @ 2021-01-25 19:39 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Eric Dumazet, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, netfilter-devel, coreteam, netdev,
	Daniel Borkmann, Alexei Starovoitov, Thomas Graf,
	Laura Garcia Liebana, John Fastabend, Dan Carpenter

On Sun, 24 Jan 2021 11:33:01 +0100 Lukas Wunner wrote:
> On Fri, Jan 22, 2021 at 10:40:05AM +0100, Eric Dumazet wrote:
> > On Fri, Jan 22, 2021 at 9:55 AM Lukas Wunner <lukas@wunner.de> wrote:  
> > > sch_handle_egress() returns either the skb or NULL to signal to its
> > > caller __dev_queue_xmit() whether a packet should continue to be
> > > processed.
> > >
> > > The skb is always non-NULL, otherwise __dev_queue_xmit() would hit a
> > > NULL pointer deref right at its top.
> > >
> > > But the compiler doesn't know that.  So if sch_handle_egress() signals
> > > success by returning the skb, the "if (!skb) goto out;" statement
> > > results in a gratuitous NULL pointer check in the Assembler output.
> > >
> > > Avoid by telling the compiler that __dev_queue_xmit() is never passed a
> > > NULL skb.  
> [...]
> > > we're about to add a netfilter egress hook to __dev_queue_xmit()
> > > and without the micro-optimization, it will result in a performance
> > > degradation which is indeed measurable:  
> [...]
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > +__attribute__((nonnull(1)))
> > >  static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> > >  {
> > >         struct net_device *dev = skb->dev;  
> > 
> > It is a bit sad the compilers do not automatically get this knowledge
> > from the very first instruction :
> > 
> >  struct net_device *dev = skb->dev;  
> 
> The compiler (gcc) is capable of doing that, but the feature was disabled by:
> 
>     commit a3ca86aea507904148870946d599e07a340b39bf
>     Author: Eugene Teo <eteo@redhat.com>
>     Date:   Wed Jul 15 14:59:10 2009 +0800
>     
>     Add '-fno-delete-null-pointer-checks' to gcc CFLAGS
> 
> If -fno-delete-null-pointer-checks is dropped from the top-level Makefile
> then the gratuitous NULL pointer checks disappear from the Assembler output,
> obviating the need to litter hot paths with __attribute__((nonnull(1)))
> annotations.
> 
> Taking a closer look at that commit, its rationale appears questionable:
> It says that broken code such as ...
> 
> 	struct agnx_priv *priv = dev->priv;
> 
> 	if (!dev)
> 		return;
> 
> ... would result in the NULL pointer check being optimized away.
> The commit message claims that keeping the NULL pointer check in
> "makes it harder to abuse" the broken code.
> 
> I don't see how that's the case:  If dev is NULL, the NULL pointer
> dereference at the function's top causes termination of the task
> in kernel/exit.c:do_exit().  So the NULL pointer check is never
> reached by the task.  If on the other hand dev is non-NULL,
> the task isn't terminated but then the NULL pointer check is
> unnecessary as well.
> 
> So the point of the commit remains elusive to me.  I could submit
> an RFC patch which drops -fno-delete-null-pointer-checks and see
> if any security folks cry foul.  Thoughts?

I wonder if modern compilers can't simply warn about this particular
case. Not to mention our static checkers..


Dan, do you think the concern from the above-quoted commit is still
valid? Is this something that smatch flags these days? We're apparently
paying a real performance price in networking for tying compiler's hands
with -fno-delete-null-pointer-checks

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling
  2021-01-25 19:39       ` Jakub Kicinski
@ 2021-01-26  8:58         ` Dan Carpenter
  2021-01-30 16:00           ` Lukas Wunner
  0 siblings, 1 reply; 21+ messages in thread
From: Dan Carpenter @ 2021-01-26  8:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Lukas Wunner, Eric Dumazet, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, netfilter-devel, coreteam, netdev,
	Daniel Borkmann, Alexei Starovoitov, Thomas Graf,
	Laura Garcia Liebana, John Fastabend

On Mon, Jan 25, 2021 at 11:39:08AM -0800, Jakub Kicinski wrote:
> On Sun, 24 Jan 2021 11:33:01 +0100 Lukas Wunner wrote:
> > On Fri, Jan 22, 2021 at 10:40:05AM +0100, Eric Dumazet wrote:
> > > On Fri, Jan 22, 2021 at 9:55 AM Lukas Wunner <lukas@wunner.de> wrote:  
> > > > sch_handle_egress() returns either the skb or NULL to signal to its
> > > > caller __dev_queue_xmit() whether a packet should continue to be
> > > > processed.
> > > >
> > > > The skb is always non-NULL, otherwise __dev_queue_xmit() would hit a
> > > > NULL pointer deref right at its top.
> > > >
> > > > But the compiler doesn't know that.  So if sch_handle_egress() signals
> > > > success by returning the skb, the "if (!skb) goto out;" statement
> > > > results in a gratuitous NULL pointer check in the Assembler output.
> > > >
> > > > Avoid by telling the compiler that __dev_queue_xmit() is never passed a
> > > > NULL skb.  
> > [...]
> > > > we're about to add a netfilter egress hook to __dev_queue_xmit()
> > > > and without the micro-optimization, it will result in a performance
> > > > degradation which is indeed measurable:  
> > [...]
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > +__attribute__((nonnull(1)))
> > > >  static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> > > >  {
> > > >         struct net_device *dev = skb->dev;  
> > > 
> > > It is a bit sad the compilers do not automatically get this knowledge
> > > from the very first instruction :
> > > 
> > >  struct net_device *dev = skb->dev;  
> > 
> > The compiler (gcc) is capable of doing that, but the feature was disabled by:
> > 
> >     commit a3ca86aea507904148870946d599e07a340b39bf
> >     Author: Eugene Teo <eteo@redhat.com>
> >     Date:   Wed Jul 15 14:59:10 2009 +0800
> >     
> >     Add '-fno-delete-null-pointer-checks' to gcc CFLAGS
> > 
> > If -fno-delete-null-pointer-checks is dropped from the top-level Makefile
> > then the gratuitous NULL pointer checks disappear from the Assembler output,
> > obviating the need to litter hot paths with __attribute__((nonnull(1)))
> > annotations.
> > 
> > Taking a closer look at that commit, its rationale appears questionable:
> > It says that broken code such as ...
> > 
> > 	struct agnx_priv *priv = dev->priv;
> > 
> > 	if (!dev)
> > 		return;
> > 
> > ... would result in the NULL pointer check being optimized away.
> > The commit message claims that keeping the NULL pointer check in
> > "makes it harder to abuse" the broken code.
> > 
> > I don't see how that's the case:  If dev is NULL, the NULL pointer
> > dereference at the function's top causes termination of the task
> > in kernel/exit.c:do_exit().  So the NULL pointer check is never
> > reached by the task.  If on the other hand dev is non-NULL,
> > the task isn't terminated but then the NULL pointer check is
> > unnecessary as well.
> > 
> > So the point of the commit remains elusive to me.  I could submit
> > an RFC patch which drops -fno-delete-null-pointer-checks and see
> > if any security folks cry foul.  Thoughts?
> 

This was a famous tun.c bug back in the day.  In those days we weren't
careful to disallow remapping NULL to a different pointer.  See
/proc/sys/vm/mmap_min_addr.  The exploit was to remap NULL to be a valid
user controlled pointer.  It should have been impossible to exploit
because the code had a check for NULL, but the compiler optimized it
away.

https://lwn.net/Articles/342330/

> I wonder if modern compilers can't simply warn about this particular
> case. Not to mention our static checkers..
> 
> 
> Dan, do you think the concern from the above-quoted commit is still
> valid? Is this something that smatch flags these days? We're apparently
> paying a real performance price in networking for tying compiler's hands
> with -fno-delete-null-pointer-checks

If I had to guess why GCC doesn't warn about this I would say that
probably it's because a lot of macros have NULL checks built in.  Most
static analysis tools have a warning about inconsistent NULL checks but
Smatch won't warn about it unless it can lead to a NULL dereference.
The fact that pointless NULL checks slow down the code has never
bothered anyone up to now.

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 4/5] netfilter: Introduce egress hook
  2021-01-22  8:47 ` [PATCH nf-next v4 4/5] netfilter: Introduce egress hook Lukas Wunner
@ 2021-01-26 19:13   ` Daniel Borkmann
  2021-09-11 21:26     ` Lukas Wunner
  0 siblings, 1 reply; 21+ messages in thread
From: Daniel Borkmann @ 2021-01-26 19:13 UTC (permalink / raw)
  To: Lukas Wunner, Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal
  Cc: netfilter-devel, coreteam, netdev, Alexei Starovoitov,
	Eric Dumazet, Thomas Graf, Laura Garcia Liebana, John Fastabend

On 1/22/21 9:47 AM, Lukas Wunner wrote:
> Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
> handle_ing() under unique static key") introduced the ability to
> classify packets with netfilter on ingress.
> 
> Support the same on egress to satisfy user requirements such as:
> * outbound security policies for containers (Laura)
> * filtering and mangling intra-node Direct Server Return (DSR) traffic
>    on a load balancer (Laura)
> * filtering locally generated traffic coming in through AF_PACKET,
>    such as local ARP traffic generated for clustering purposes or DHCP
>    (Laura; the AF_PACKET plumbing is contained in a separate commit)
> * L2 filtering from ingress and egress for AVB (Audio Video Bridging)
>    and gPTP with nftables (Pablo)
> * in the future: in-kernel NAT64/NAT46 (Pablo)
> 
> A patch for nftables to hook up egress rules from user space has been
> submitted separately, so users may immediately take advantage of the
> feature.
> 
> The hook is positioned after packet handling by traffic control.
> Thus, if packets are redirected into and out of containers with tc,
> the data path is:
> ingress: host tc -> container tc -> container nft
> egress:  container tc -> host tc -> host nft
> 
> This was done to address an objection from Daniel Borkmann:  If desired,
> nft does not get into tc's way performance-wise.  The host is able to
> firewall malicious packets coming out of a container, but only after tc
> has done its duty.  An implication is that tc may set skb->mark on
> egress for nft to act on it, but not the other way round.
> 
> If egress netfilter handling is not enabled on any interface, it is
> patched out of the data path by way of a static_key and doesn't make a
> performance difference that is discernible from noise:
> 
[...]
> @@ -4096,13 +4098,18 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>   	qdisc_pkt_len_init(skb);
>   #ifdef CONFIG_NET_CLS_ACT
>   	skb->tc_at_ingress = 0;
> -# ifdef CONFIG_NET_EGRESS
> +#endif
> +#ifdef CONFIG_NET_EGRESS
>   	if (static_branch_unlikely(&egress_needed_key)) {
>   		skb = sch_handle_egress(skb, &rc, dev);
>   		if (!skb)
>   			goto out;
> +		if (nf_hook_egress_active()) {
> +			skb = nf_hook_egress(skb, &rc, dev);
> +			if (!skb)
> +				goto out;
> +		}

This won't work unfortunately, the issue is that this now creates an asymmetric path, for
example: [tc ingress] -> [nf ingress] -> [host] -> [tc egress] -> [nf egress]. So any NAT
translation done on tc ingress + tc egress will break on the nf hooks given nf is not able to
correlate inbound with outbound traffic. Likewise for container-based traffic that is in its
own netns, one of the existing paths is: [tc ingress (phys,hostns)] -> [tc ingress (veth,podns)] ->
[reply] -> [tc egress (veth,podns)] -> [tc ingress (veth,hostns)] -> [nf egress (phys,hostns)*] ->
[tc egress (phys,hostns)]. As can be seen the [nf ingress] hook is not an issue at all given
everything operates in tc layer but the [nf egress*] one is in this case, as it touches in tc
layer where existing data planes will start to break upon rule injection. Wrt above objection,
what needs to be done for the minimum case is to i) fix the asymmetry problem from here, and
ii) add a flag for tc layer-redirected skbs to skip the nf egress hook *; with that in place
this concern should be resolved. Thanks!

>   	}
> -# endif
>   #endif
>   	/* If device/qdisc don't need skb->dst, release it right now while
>   	 * its hot in this cpu cache.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling
  2021-01-26  8:58         ` Dan Carpenter
@ 2021-01-30 16:00           ` Lukas Wunner
  0 siblings, 0 replies; 21+ messages in thread
From: Lukas Wunner @ 2021-01-30 16:00 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Jakub Kicinski, Eric Dumazet, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, netfilter-devel, coreteam,
	netdev, Daniel Borkmann, Alexei Starovoitov, Thomas Graf,
	Laura Garcia Liebana, John Fastabend

On Tue, Jan 26, 2021 at 11:58:17AM +0300, Dan Carpenter wrote:
> On Mon, Jan 25, 2021 at 11:39:08AM -0800, Jakub Kicinski wrote:
> > On Sun, 24 Jan 2021 11:33:01 +0100 Lukas Wunner wrote:
> > > On Fri, Jan 22, 2021 at 10:40:05AM +0100, Eric Dumazet wrote:
> > > > On Fri, Jan 22, 2021 at 9:55 AM Lukas Wunner <lukas@wunner.de> wrote:  
> > > > > sch_handle_egress() returns either the skb or NULL to signal to its
> > > > > caller __dev_queue_xmit() whether a packet should continue to be
> > > > > processed.
> > > > >
> > > > > The skb is always non-NULL, otherwise __dev_queue_xmit() would hit a
> > > > > NULL pointer deref right at its top.
> > > > >
> > > > > But the compiler doesn't know that.  So if sch_handle_egress() signals
> > > > > success by returning the skb, the "if (!skb) goto out;" statement
> > > > > results in a gratuitous NULL pointer check in the Assembler output.
> > > > >
> > > > > Avoid by telling the compiler that __dev_queue_xmit() is never passed a
> > > > > NULL skb.  
> > > [...]
> > > > > we're about to add a netfilter egress hook to __dev_queue_xmit()
> > > > > and without the micro-optimization, it will result in a performance
> > > > > degradation which is indeed measurable:  
> > > [...]
> > > > > --- a/net/core/dev.c
> > > > > +++ b/net/core/dev.c
> > > > > +__attribute__((nonnull(1)))
> > > > >  static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> > > > >  {
> > > > >         struct net_device *dev = skb->dev;  
> > > > 
> > > > It is a bit sad the compilers do not automatically get this knowledge
> > > > from the very first instruction :
> > > > 
> > > >  struct net_device *dev = skb->dev;  
> > > 
> > > The compiler (gcc) is capable of doing that, but the feature was disabled by:
> > > 
> > >     commit a3ca86aea507904148870946d599e07a340b39bf
> > >     Author: Eugene Teo <eteo@redhat.com>
> > >     Date:   Wed Jul 15 14:59:10 2009 +0800
> > >     
> > >     Add '-fno-delete-null-pointer-checks' to gcc CFLAGS
> > > 
> > > If -fno-delete-null-pointer-checks is dropped from the top-level Makefile
> > > then the gratuitous NULL pointer checks disappear from the Assembler output,
> > > obviating the need to litter hot paths with __attribute__((nonnull(1)))
> > > annotations.
> > > 
> > > Taking a closer look at that commit, its rationale appears questionable:
> > > It says that broken code such as ...
> > > 
> > > 	struct agnx_priv *priv = dev->priv;
> > > 
> > > 	if (!dev)
> > > 		return;
> > > 
> > > ... would result in the NULL pointer check being optimized away.
> > > The commit message claims that keeping the NULL pointer check in
> > > "makes it harder to abuse" the broken code.
> > > 
> > > I don't see how that's the case:  If dev is NULL, the NULL pointer
> > > dereference at the function's top causes termination of the task
> > > in kernel/exit.c:do_exit().  So the NULL pointer check is never
> > > reached by the task.  If on the other hand dev is non-NULL,
> > > the task isn't terminated but then the NULL pointer check is
> > > unnecessary as well.
> > > 
> > > So the point of the commit remains elusive to me.  I could submit
> > > an RFC patch which drops -fno-delete-null-pointer-checks and see
> > > if any security folks cry foul.  Thoughts?
> 
> This was a famous tun.c bug back in the day.  In those days we weren't
> careful to disallow remapping NULL to a different pointer.  See
> /proc/sys/vm/mmap_min_addr.  The exploit was to remap NULL to be a valid
> user controlled pointer.  It should have been impossible to exploit
> because the code had a check for NULL, but the compiler optimized it
> away.
> 
> https://lwn.net/Articles/342330/

Thanks Dan, that's valuable information.  I'll add it to the commit message
when respinning the series.

So to summarize, keeping seemingly useless NULL pointer checks in
is a mitigation in case accesses to the zero page don't result in
an exception because an attacker managed to map something to that
page.

I'm assuming for now that selectively dropping those NULL pointer
checks from hotpaths is acceptable if otherwise a measurable
performance degradation occurs.

Maintainers please speak up if you disagree.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 5/5] af_packet: Introduce egress hook
  2021-01-24 16:18       ` Willem de Bruijn
@ 2021-01-30 16:26         ` Lukas Wunner
  2021-01-30 16:58           ` Willem de Bruijn
  0 siblings, 1 reply; 21+ messages in thread
From: Lukas Wunner @ 2021-01-30 16:26 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, Network Development, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	Laura Garcia Liebana, John Fastabend

On Sun, Jan 24, 2021 at 11:18:00AM -0500, Willem de Bruijn wrote:
> On Sun, Jan 24, 2021 at 6:14 AM Lukas Wunner <lukas@wunner.de> wrote:
> > On Fri, Jan 22, 2021 at 11:13:19AM -0500, Willem de Bruijn wrote:
> > > On Fri, Jan 22, 2021 at 4:44 AM Lukas Wunner <lukas@wunner.de> wrote:
> > > > Add egress hook for AF_PACKET sockets that have the PACKET_QDISC_BYPASS
> > > > socket option set to on, which allows packets to escape without being
> > > > filtered in the egress path.
> > > >
> > > > This patch only updates the AF_PACKET path, it does not update
> > > > dev_direct_xmit() so the XDP infrastructure has a chance to bypass
> > > > Netfilter.
> > >
> > > Isn't the point of PACKET_QDISC_BYPASS to skip steps like this?
> >
> > I suppose PACKET_QDISC_BYPASS "was introduced to bypass qdisc,
> > not to bypass everything."
> >
> > (The quote is taken from this message by Eric Dumazet:
> > https://lore.kernel.org/netfilter-devel/a9006cf7-f4ba-81b1-fca1-fd2e97939fdc@gmail.com/
> > )
> 
> I see. I don't understand the value of a short-cut fast path if we
> start chipping away at its characteristic feature.

The point is to filter traffic coming in through af_packet.
Exempting PACKET_QDISC_BYPASS from filtering would open up a
trivial security hole.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 5/5] af_packet: Introduce egress hook
  2021-01-30 16:26         ` Lukas Wunner
@ 2021-01-30 16:58           ` Willem de Bruijn
  0 siblings, 0 replies; 21+ messages in thread
From: Willem de Bruijn @ 2021-01-30 16:58 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, Network Development, Daniel Borkmann,
	Alexei Starovoitov, Eric Dumazet, Thomas Graf,
	Laura Garcia Liebana, John Fastabend

On Sat, Jan 30, 2021 at 11:26 AM Lukas Wunner <lukas@wunner.de> wrote:
>
> On Sun, Jan 24, 2021 at 11:18:00AM -0500, Willem de Bruijn wrote:
> > On Sun, Jan 24, 2021 at 6:14 AM Lukas Wunner <lukas@wunner.de> wrote:
> > > On Fri, Jan 22, 2021 at 11:13:19AM -0500, Willem de Bruijn wrote:
> > > > On Fri, Jan 22, 2021 at 4:44 AM Lukas Wunner <lukas@wunner.de> wrote:
> > > > > Add egress hook for AF_PACKET sockets that have the PACKET_QDISC_BYPASS
> > > > > socket option set to on, which allows packets to escape without being
> > > > > filtered in the egress path.
> > > > >
> > > > > This patch only updates the AF_PACKET path, it does not update
> > > > > dev_direct_xmit() so the XDP infrastructure has a chance to bypass
> > > > > Netfilter.
> > > >
> > > > Isn't the point of PACKET_QDISC_BYPASS to skip steps like this?
> > >
> > > I suppose PACKET_QDISC_BYPASS "was introduced to bypass qdisc,
> > > not to bypass everything."
> > >
> > > (The quote is taken from this message by Eric Dumazet:
> > > https://lore.kernel.org/netfilter-devel/a9006cf7-f4ba-81b1-fca1-fd2e97939fdc@gmail.com/
> > > )
> >
> > I see. I don't understand the value of a short-cut fast path if we
> > start chipping away at its characteristic feature.
>
> The point is to filter traffic coming in through af_packet.
> Exempting PACKET_QDISC_BYPASS from filtering would open up a
> trivial security hole.

Sure. But that argument is no different for TC_EGRESS.

That's why packet sockets require CAP_NET_RAW. It is perhaps
unfortunately that it is ns_capable instead of capable. But there is
nothing netfilter specific about this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 4/5] netfilter: Introduce egress hook
  2021-01-26 19:13   ` Daniel Borkmann
@ 2021-09-11 21:26     ` Lukas Wunner
  2021-09-15  9:45       ` Daniel Borkmann
  0 siblings, 1 reply; 21+ messages in thread
From: Lukas Wunner @ 2021-09-11 21:26 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, netdev, Alexei Starovoitov,
	Eric Dumazet, Thomas Graf, Laura Garcia Liebana, John Fastabend

On Tue, Jan 26, 2021 at 08:13:19PM +0100, Daniel Borkmann wrote:
> On 1/22/21 9:47 AM, Lukas Wunner wrote:
> > Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
> > handle_ing() under unique static key") introduced the ability to
> > classify packets with netfilter on ingress.
> > 
> > Support the same on egress to satisfy user requirements such as:
[...]
> > The hook is positioned after packet handling by traffic control.
> > Thus, if packets are redirected into and out of containers with tc,
> > the data path is:
> > ingress: host tc -> container tc -> container nft
> > egress:  container tc -> host tc -> host nft
> > 
> > This was done to address an objection from Daniel Borkmann:  If desired,
> > nft does not get into tc's way performance-wise.  The host is able to
> > firewall malicious packets coming out of a container, but only after tc
> > has done its duty.  An implication is that tc may set skb->mark on
> > egress for nft to act on it, but not the other way round.
> [...]
> > @@ -4096,13 +4098,18 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
> >   	qdisc_pkt_len_init(skb);
> >   #ifdef CONFIG_NET_CLS_ACT
> >   	skb->tc_at_ingress = 0;
> > -# ifdef CONFIG_NET_EGRESS
> > +#endif
> > +#ifdef CONFIG_NET_EGRESS
> >   	if (static_branch_unlikely(&egress_needed_key)) {
> >   		skb = sch_handle_egress(skb, &rc, dev);
> >   		if (!skb)
> >   			goto out;
> > +		if (nf_hook_egress_active()) {
> > +			skb = nf_hook_egress(skb, &rc, dev);
> > +			if (!skb)
> > +				goto out;
> > +		}
> 
> This won't work unfortunately, the issue is that this now creates an
> asymmetric path, for example:
>   [tc ingress] -> [nf ingress] -> [host] -> [tc egress] -> [nf egress].
> So any NAT translation done on tc ingress + tc egress will break on
> the nf hooks given nf is not able to correlate inbound with outbound
> traffic.
> 
> Likewise for container-based traffic that is in its own netns,
> one of the existing paths is:
>   [tc ingress (phys,hostns)] -> [tc ingress (veth,podns)] -> [reply] ->
>   [tc egress (veth,podns)] -> [tc ingress (veth,hostns)] ->
>   [nf egress (phys,hostns)*] -> [tc egress (phys,hostns)].
> As can be seen the [nf ingress] hook is not an issue at all given
> everything operates in tc layer but the [nf egress*] one is in this case,
> as it touches in tc layer where existing data planes will start to break
> upon rule injection.
> 
> Wrt above objection, what needs to be done for the minimum case is to
> i) fix the asymmetry problem from here, and
> ii) add a flag for tc layer-redirected skbs to skip the nf egress hook *;
> with that in place this concern should be resolved. Thanks!

Daniel, your reply above has left me utterly confused.  After thinking
about it for a while, I'm requesting clarification what you really want.
We do want the netfilter egress hook in mainline and we're happy to
accommodate to your requirements, but we need you to specify them clearly.

Regarding the data path for packets which are going out from a container,
I think you've erroneously mixed up the last two elements above:
If the nf egress hook is placed after tc egress (as is done in the present
patch), the data path is actually as follows:

  [tc egress (veth,podns)] -> [tc ingress (veth,hostns)] ->
  [tc egress (phys,hostns)] -> [nf egress (phys,hostns)*]

By contrast, if nf egress is put in front of tc egress (as you're
requesting above), the data path becomes:

  [nf egress (veth,podns)] -> [tc egress (veth,podns)] ->
  [tc ingress (veth,hostns)] ->
  [nf egress (phys,hostns)*] -> [tc egress (phys,hostns)]

So this order results in an *additional* nf egress hook in the data path,
which may cost performance.  Previously you voiced concerns that the
nf egress hook may degrade performance.  To address that concern,
we ordered nf egress *after* tc egress, thereby avoiding any performance
impact as much as possible.

I agree with your argument that an inverse order vis-a-vis ingress is
more logical because it avoids NAT incongruencies etc.  So I'll be happy
to reorder nf egress before tc egress.  I'm just confused that you're now
requesting an order which may be *more* detrimental to performance.

As a reminder:
- If no nf egress rules are in use on the system, there is no performance
  impact at all because the hook is patched out of the data path.
- If an interface on the system uses nf egress rules, all other interfaces
  suffer a minor performance impact because a NULL pointer check is
  performed for sk_buff->nf_hooks_egress for every packet.
- The actual netfilter processing only happens on the interface(s)
  for which nf egress rules have been configured.

You further request above to "add a flag for tc layer-redirected skbs
to skip the nf egress hook *".

However the data path *starts* with an nf egress hook.  So even if a flag
to skip subsequent nf egress processing is set upon tc redirect,
the initial nf egress hook is never skipped.  The benefit of such a flag
is therefore questionable, which is another source of confusion to me.

Again, please clarify what your concerns are and how they can be addressed.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH nf-next v4 4/5] netfilter: Introduce egress hook
  2021-09-11 21:26     ` Lukas Wunner
@ 2021-09-15  9:45       ` Daniel Borkmann
  0 siblings, 0 replies; 21+ messages in thread
From: Daniel Borkmann @ 2021-09-15  9:45 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	netfilter-devel, coreteam, netdev, Alexei Starovoitov,
	Eric Dumazet, Thomas Graf, Laura Garcia Liebana, John Fastabend

Hi Lukas,

On 9/11/21 11:26 PM, Lukas Wunner wrote:
> On Tue, Jan 26, 2021 at 08:13:19PM +0100, Daniel Borkmann wrote:
>> On 1/22/21 9:47 AM, Lukas Wunner wrote:
[...]
>> Wrt above objection, what needs to be done for the minimum case is to
>> i) fix the asymmetry problem from here, and
>> ii) add a flag for tc layer-redirected skbs to skip the nf egress hook *;
>> with that in place this concern should be resolved. Thanks!
> 
> Daniel, your reply above has left me utterly confused.  After thinking
> about it for a while, I'm requesting clarification what you really want.
> We do want the netfilter egress hook in mainline and we're happy to
> accommodate to your requirements, but we need you to specify them clearly.
> 
> Regarding the data path for packets which are going out from a container,
> I think you've erroneously mixed up the last two elements above:
> If the nf egress hook is placed after tc egress (as is done in the present
> patch), the data path is actually as follows:
> 
>    [tc egress (veth,podns)] -> [tc ingress (veth,hostns)] ->
>    [tc egress (phys,hostns)] -> [nf egress (phys,hostns)*]
> 
> By contrast, if nf egress is put in front of tc egress (as you're
> requesting above), the data path becomes:
> 
>    [nf egress (veth,podns)] -> [tc egress (veth,podns)] ->
>    [tc ingress (veth,hostns)] ->
>    [nf egress (phys,hostns)*] -> [tc egress (phys,hostns)]
> 
> So this order results in an *additional* nf egress hook in the data path,
> which may cost performance.  Previously you voiced concerns that the
> nf egress hook may degrade performance.  To address that concern,
> we ordered nf egress *after* tc egress, thereby avoiding any performance
> impact as much as possible.
> 
> I agree with your argument that an inverse order vis-a-vis ingress is
> more logical because it avoids NAT incongruencies etc.  So I'll be happy
> to reorder nf egress before tc egress.  I'm just confused that you're now
> requesting an order which may be *more* detrimental to performance.

Right, the issue is that placing it either in front or after the existing tc
egress hook just as-is results in layering violations given both tc/nf subsystems
will inevitably step on each other when both dataplanes are used by different
components where things break. To provide a walk-through on what I think would
work w/o breaking layers:

1) Packet going up hostns stack. Here, we'll end up with:

    [tc ingress (phys,hostns)] -> [nf ingress (phys,hostns)] ->
      upper stack ->
        [nf egress (phys,hostns)] -> [tc egress (phys,hostns)]

2) Packet going up podns stack. Here, if the forwarding happens entirely in tc
    layer, then the datapath (as-is today) operates _below_ nf and must look like:

    [tc ingress (phys,hostns)] -> [tc egress (veth,hostns)] ->
      netns switch -> (*) -> netns switch ->
        [tc ingress (veth,hostns)] -> [tc egress (phys,hostns)]

    For simplicity, I left out (*), but it's essentially the same as case 1)
    just for the Pod's/container's stack PoV.

3) Packet is locally forwarded between Pods. Same as 2) for the case where the
    forwarding happens entirely in tc (as-is today) which operates _below_ nf and
    must look like:

    (+) -> [tc ingress (veth,hostns)] -> [tc egress (veth,hostns)] ->
      netns switch -> (*) -> netns switch ->
        [tc ingress (veth,hostns)] -> [tc egress (veth,hostns)] -> (+)

    The (+) denotes the src/sink where we enter/exit the hostns after netns switch.

The skb bit marker would be that tc [& BPF]-related redirect functions are internally
setting that bit, so that nf egress hook is skipped for cases like 2)/3).

Walking through a similar 1/2/3) scenario from nf side one layer higher if you were
to do an equivalent, things would look like:

1) Packet going up hostns stack. Same as above:

    [tc ingress (phys,hostns)] -> [nf ingress (phys,hostns)] ->
      upper stack ->
        [nf egress (phys,hostns)] -> [tc egress (phys,hostns)]

2) Packet going up podns stack with forwarding from nf side:

    [tc ingress (phys,hostns)] -> [nf ingress (phys,hostns)] ->
      [nf egress (veth,hostns)] -> [tc egress (veth,hostns)] ->
        netns switch -> (*) -> netns switch ->
          [tc ingress (veth,hostns)] -> [nf ingress (veth,hostns)] ->
            [nf egress (phys,hostns)] -> [tc egress (phys,hostns)]

3) Packet is locally forwarded between Pods with forwarding from nf side:

    (+) -> [tc ingress (veth,hostns)] -> [nf ingress (veth,hostns)] ->
      [nf egress (veth,hostns)] -> [tc egress (veth,hostns)] ->
        netns switch -> (*) -> netns switch ->
          [tc ingress (veth,hostns)] -> [nf ingress (veth,hostns)] ->
            [nf egress (veth,hostns)] -> [tc egress (veth,hostns)] -> (+)

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-09-15  9:45 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-22  8:47 [PATCH nf-next v4 0/5] Netfilter egress hook Lukas Wunner
2021-01-22  8:47 ` [PATCH nf-next v4 1/5] net: sched: Micro-optimize egress handling Lukas Wunner
2021-01-22  9:40   ` Eric Dumazet
2021-01-24 10:33     ` Lukas Wunner
2021-01-25 19:39       ` Jakub Kicinski
2021-01-26  8:58         ` Dan Carpenter
2021-01-30 16:00           ` Lukas Wunner
2021-01-24  3:26   ` Jakub Kicinski
2021-01-24 10:46     ` Lukas Wunner
2021-01-22  8:47 ` [PATCH nf-next v4 2/5] netfilter: Rename ingress hook include file Lukas Wunner
2021-01-22  8:47 ` [PATCH nf-next v4 3/5] netfilter: Generalize " Lukas Wunner
2021-01-22  8:47 ` [PATCH nf-next v4 4/5] netfilter: Introduce egress hook Lukas Wunner
2021-01-26 19:13   ` Daniel Borkmann
2021-09-11 21:26     ` Lukas Wunner
2021-09-15  9:45       ` Daniel Borkmann
2021-01-22  8:47 ` [PATCH nf-next v4 5/5] af_packet: " Lukas Wunner
2021-01-22 16:13   ` Willem de Bruijn
2021-01-24 11:14     ` Lukas Wunner
2021-01-24 16:18       ` Willem de Bruijn
2021-01-30 16:26         ` Lukas Wunner
2021-01-30 16:58           ` Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).