All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/9 v2] netfilter: bpf base hook program generator
@ 2022-10-05 14:13 Florian Westphal
  2022-10-05 14:13 ` [RFC v2 1/9] netfilter: nf_queue: carry index in hook state Florian Westphal
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Florian Westphal @ 2022-10-05 14:13 UTC (permalink / raw)
  To: bpf; +Cc: Florian Westphal

Sending as another RFC even though patches are unchanged vs. last iteration
to provide background/context ahead of bpf office hours on Oct 6th, thus
deliberately omitting netdev@ and nf-devel@.

This series adds a bpf program generator for netfilter base hooks.
'netfilter base hooks' are c-functions that get called from the NF_HOOK()
stubs that can be found in a myriad of locations in the network stack.

Examples from ipv4 (ip_input.c):
254         return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN,
255                        net, NULL, skb, skb->dev, NULL,
256                        ip_local_deliver_finish);
[..]
564         return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING,
565                        net, NULL, skb, dev, NULL,
566                        ip_rcv_finish);

Well-known users of this facility are iptables, nftables,
but also connection tracking selinux.  Conntrack is also a greedy module,
with 5 hooks total (prerouting, input, output, postrouting) and another two
via nf_defrag(_ipv4) module dependency.

Eliding the static-key handling, NF_HOOK() expands to:

-----
struct nf_hook_entries *hooks = rcu_dereference(net->nf.hooks_ipv4[hook]);
/* where '[hook] is any one of prerouting, input, and so on */
ret = nf_hook_slow(skb, &state, hooks, 0);

if (ret == 1) /* packet is allowed to pass */
   okfn(net, sk, skb);
------

'hooks' is an array of function-address/void * arg pairs that is
iterated in nf_hook_slow():

for i in hooks[]; do
  verdict = hooks[i]->addr(hooks->[i].arg, skb, state);
  switch (verdict) { ....

Each hook can chose to toss the packet (NF_DROP), move to next hook
(NF_ACCEPT), or assume skb ownership (NF_STOLEN) and so on.

All hooks have access to the skb, to the private void *arg (used by
nf_tables and ip_tables -- the start of the user-defined ruleset to
evaluate) and a context structure that wraps extra data: incoming and
outgoing network interfaces, the net namespace the hook is registered in,
the protocol family, hook location (input, prerouting, forward, ...) ...

Even for simple iptables-filter + nat this results in multiple indirect
calls per packet.

The proposed autogenerator unrolls nf_hook_slow() and builds a bpf program
that performs those function calls sequentially, i.e.:

state->priv = hooks->[0].hook_arg;
v = firstfunction(state);
if (v != ACCEPT) goto out;
state->priv = hooks->[1].hook_arg;
v = secondfunction(state); ...
if (v != ACCEPT) goto out;

... and so on.  As the function arguments are still taken from struct net at runtime,
rather than added as constants, those programs can be shared across net namespaces if
they share the exact same registered hooks. (Example: 10 netns with iptables-filter table and
active conntrack will all share the same 5 programs (one for prerouting, input,
output and postrouting each), rather than 50 bpf programs.

Invocation of the autogenerated programs is done via bpf dispatcher from
nf_hook(); instead of

ret = nf_hook_slow( ... )

this is now:
------------------
struct bpf_prog *prog = READ_ONCE(e->hook_prog);

state.priv = (void *)e;
state.skb = skb;

migrate_disable();
ret = __bpf_prog_run(prog, state, BPF_DISPATCHER_FUNC(nf_hook_base));
migrate_enable();
------------------

As long as NF_QUEUE is not used -- which should be rare -- data path will not call
nf_hook_slow "interpreter" anymore.

No changes in BPF core or UAPI additions, although I suppose it would make sense to add a
'enable/disable' sysctl for this.

I think that it makes little sense to consider any form of nf_tables (or iptables) JIT
without indirect-call avoidance first, unless such 'jit' would be for the XDP hook.

I would propose 'xdptables' tool for that though (or 'xdp' family for nftables),
without kernel changes.

Comments welcome.

Florian Westphal (9):
  netfilter: nf_queue: carry index in hook state
  netfilter: nat: split nat hook iteration into a helper
  netfilter: remove hook index from nf_hook_slow arguments
  netfilter: make hook functions accept only one argument
  netfilter: reduce allowed hook count to 32
  netfilter: add bpf base hook program generator
  netfilter: core: do not rebuild bpf program on dying netns
  netfilter: netdev: switch to invocation via bpf
  netfilter: hook_jit: add prog cache

 drivers/net/ipvlan/ipvlan_l3s.c            |   4 +-
 include/linux/netfilter.h                  |  82 ++-
 include/linux/netfilter_arp/arp_tables.h   |   3 +-
 include/linux/netfilter_bridge/ebtables.h  |   3 +-
 include/linux/netfilter_ipv4/ip_tables.h   |   4 +-
 include/linux/netfilter_ipv6/ip6_tables.h  |   3 +-
 include/linux/netfilter_netdev.h           |  33 +-
 include/net/netfilter/br_netfilter.h       |   7 +-
 include/net/netfilter/nf_flow_table.h      |   6 +-
 include/net/netfilter/nf_hook_bpf.h        |  21 +
 include/net/netfilter/nf_queue.h           |   3 +-
 include/net/netfilter/nf_synproxy.h        |   6 +-
 net/bridge/br_input.c                      |   3 +-
 net/bridge/br_netfilter_hooks.c            |  30 +-
 net/bridge/br_netfilter_ipv6.c             |   5 +-
 net/bridge/netfilter/ebtable_broute.c      |   9 +-
 net/bridge/netfilter/ebtables.c            |   6 +-
 net/bridge/netfilter/nf_conntrack_bridge.c |   8 +-
 net/ipv4/netfilter/arp_tables.c            |   7 +-
 net/ipv4/netfilter/ip_tables.c             |   7 +-
 net/ipv4/netfilter/ipt_CLUSTERIP.c         |   6 +-
 net/ipv4/netfilter/iptable_mangle.c        |  15 +-
 net/ipv4/netfilter/nf_defrag_ipv4.c        |   5 +-
 net/ipv6/ila/ila_xlat.c                    |   6 +-
 net/ipv6/netfilter/ip6_tables.c            |   6 +-
 net/ipv6/netfilter/ip6table_mangle.c       |  13 +-
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c  |   5 +-
 net/netfilter/Kconfig                      |  10 +
 net/netfilter/Makefile                     |   1 +
 net/netfilter/core.c                       | 121 ++++-
 net/netfilter/ipvs/ip_vs_core.c            |  13 +-
 net/netfilter/nf_conntrack_proto.c         |  34 +-
 net/netfilter/nf_flow_table_inet.c         |   8 +-
 net/netfilter/nf_flow_table_ip.c           |  12 +-
 net/netfilter/nf_hook_bpf.c                | 574 +++++++++++++++++++++
 net/netfilter/nf_nat_core.c                |  50 +-
 net/netfilter/nf_nat_proto.c               |  56 +-
 net/netfilter/nf_queue.c                   |  12 +-
 net/netfilter/nf_synproxy_core.c           |   8 +-
 net/netfilter/nft_chain_filter.c           |  48 +-
 net/netfilter/nft_chain_nat.c              |   7 +-
 net/netfilter/nft_chain_route.c            |  22 +-
 security/apparmor/lsm.c                    |   5 +-
 security/selinux/hooks.c                   |  22 +-
 security/smack/smack_netfilter.c           |   8 +-
 45 files changed, 1044 insertions(+), 273 deletions(-)
 create mode 100644 include/net/netfilter/nf_hook_bpf.h
 create mode 100644 net/netfilter/nf_hook_bpf.c

-- 
2.35.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC v2 1/9] netfilter: nf_queue: carry index in hook state
  2022-10-05 14:13 [RFC 0/9 v2] netfilter: bpf base hook program generator Florian Westphal
@ 2022-10-05 14:13 ` Florian Westphal
  2022-10-05 14:13 ` [RFC v2 2/9] netfilter: nat: split nat hook iteration into a helper Florian Westphal
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2022-10-05 14:13 UTC (permalink / raw)
  To: bpf; +Cc: Florian Westphal

Rather than passing the index (hook function to call next)
as function argument, store it in the hook state.

This is a prerequesite to allow passing all nf hook arguments in a single
structure.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter.h        |  1 +
 include/net/netfilter/nf_queue.h |  3 +--
 net/bridge/br_input.c            |  3 ++-
 net/netfilter/core.c             |  6 +++++-
 net/netfilter/nf_queue.c         | 12 ++++++------
 5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index d8817d381c14..7a1a2c4787f0 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -67,6 +67,7 @@ struct sock;
 struct nf_hook_state {
 	u8 hook;
 	u8 pf;
+	u16 hook_index; /* index in hook_entries->hook[] */
 	struct net_device *in;
 	struct net_device *out;
 	struct sock *sk;
diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
index 980daa6e1e3a..bdcdece2bbff 100644
--- a/include/net/netfilter/nf_queue.h
+++ b/include/net/netfilter/nf_queue.h
@@ -13,7 +13,6 @@ struct nf_queue_entry {
 	struct list_head	list;
 	struct sk_buff		*skb;
 	unsigned int		id;
-	unsigned int		hook_index;	/* index in hook_entries->hook[] */
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	struct net_device	*physin;
 	struct net_device	*physout;
@@ -125,6 +124,6 @@ nfqueue_hash(const struct sk_buff *skb, u16 queue, u16 queues_total, u8 family,
 }
 
 int nf_queue(struct sk_buff *skb, struct nf_hook_state *state,
-	     unsigned int index, unsigned int verdict);
+	     unsigned int verdict);
 
 #endif /* _NF_QUEUE_H */
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index 68b3e850bcb9..5be7e4573528 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -264,7 +264,8 @@ static int nf_hook_bridge_pre(struct sk_buff *skb, struct sk_buff **pskb)
 			kfree_skb(skb);
 			return RX_HANDLER_CONSUMED;
 		case NF_QUEUE:
-			ret = nf_queue(skb, &state, i, verdict);
+			state.hook_index = i;
+			ret = nf_queue(skb, &state, verdict);
 			if (ret == 1)
 				continue;
 			return RX_HANDLER_CONSUMED;
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 5a6705a0e4ec..c094742e3ec3 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -623,7 +623,8 @@ int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
 				ret = -EPERM;
 			return ret;
 		case NF_QUEUE:
-			ret = nf_queue(skb, state, s, verdict);
+			state->hook_index = s;
+			ret = nf_queue(skb, state, verdict);
 			if (ret == 1)
 				continue;
 			return ret;
@@ -772,6 +773,9 @@ int __init netfilter_init(void)
 {
 	int ret;
 
+	/* state->index */
+	BUILD_BUG_ON(MAX_HOOK_COUNT > USHRT_MAX);
+
 	ret = register_pernet_subsys(&netfilter_net_ops);
 	if (ret < 0)
 		goto err;
diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index 63d1516816b1..9f9dfde3e054 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -156,7 +156,7 @@ static void nf_ip6_saveroute(const struct sk_buff *skb,
 }
 
 static int __nf_queue(struct sk_buff *skb, const struct nf_hook_state *state,
-		      unsigned int index, unsigned int queuenum)
+		      unsigned int queuenum)
 {
 	struct nf_queue_entry *entry = NULL;
 	const struct nf_queue_handler *qh;
@@ -204,7 +204,6 @@ static int __nf_queue(struct sk_buff *skb, const struct nf_hook_state *state,
 	*entry = (struct nf_queue_entry) {
 		.skb	= skb,
 		.state	= *state,
-		.hook_index = index,
 		.size	= sizeof(*entry) + route_key_size,
 	};
 
@@ -235,11 +234,11 @@ static int __nf_queue(struct sk_buff *skb, const struct nf_hook_state *state,
 
 /* Packets leaving via this function must come back through nf_reinject(). */
 int nf_queue(struct sk_buff *skb, struct nf_hook_state *state,
-	     unsigned int index, unsigned int verdict)
+	     unsigned int verdict)
 {
 	int ret;
 
-	ret = __nf_queue(skb, state, index, verdict >> NF_VERDICT_QBITS);
+	ret = __nf_queue(skb, state, verdict >> NF_VERDICT_QBITS);
 	if (ret < 0) {
 		if (ret == -ESRCH &&
 		    (verdict & NF_VERDICT_FLAG_QUEUE_BYPASS))
@@ -311,7 +310,7 @@ void nf_reinject(struct nf_queue_entry *entry, unsigned int verdict)
 
 	hooks = nf_hook_entries_head(net, pf, entry->state.hook);
 
-	i = entry->hook_index;
+	i = entry->state.hook_index;
 	if (WARN_ON_ONCE(!hooks || i >= hooks->num_hook_entries)) {
 		kfree_skb(skb);
 		nf_queue_entry_free(entry);
@@ -343,7 +342,8 @@ void nf_reinject(struct nf_queue_entry *entry, unsigned int verdict)
 		local_bh_enable();
 		break;
 	case NF_QUEUE:
-		err = nf_queue(skb, &entry->state, i, verdict);
+		entry->state.hook_index = i;
+		err = nf_queue(skb, &entry->state, verdict);
 		if (err == 1)
 			goto next_hook;
 		break;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 2/9] netfilter: nat: split nat hook iteration into a helper
  2022-10-05 14:13 [RFC 0/9 v2] netfilter: bpf base hook program generator Florian Westphal
  2022-10-05 14:13 ` [RFC v2 1/9] netfilter: nf_queue: carry index in hook state Florian Westphal
@ 2022-10-05 14:13 ` Florian Westphal
  2022-10-05 14:13 ` [RFC v2 3/9] netfilter: remove hook index from nf_hook_slow arguments Florian Westphal
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2022-10-05 14:13 UTC (permalink / raw)
  To: bpf; +Cc: Florian Westphal

Makes conversion in followup patch simpler.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_nat_core.c | 46 +++++++++++++++++++++++--------------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index 7981be526f26..bd5ac4ff03f9 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -709,6 +709,32 @@ static bool in_vrf_postrouting(const struct nf_hook_state *state)
 	return false;
 }
 
+static unsigned int nf_nat_inet_run_hooks(const struct nf_hook_state *state,
+					  struct sk_buff *skb,
+					  struct nf_conn *ct,
+					  struct nf_nat_lookup_hook_priv *lpriv)
+{
+	enum nf_nat_manip_type maniptype = HOOK2MANIP(state->hook);
+	struct nf_hook_entries *e = rcu_dereference(lpriv->entries);
+	unsigned int ret;
+	int i;
+
+	if (!e)
+		goto null_bind;
+
+	for (i = 0; i < e->num_hook_entries; i++) {
+		ret = e->hooks[i].hook(e->hooks[i].priv, skb, state);
+		if (ret != NF_ACCEPT)
+			return ret;
+
+		if (nf_nat_initialized(ct, maniptype))
+			return NF_ACCEPT;
+	}
+
+null_bind:
+	return nf_nat_alloc_null_binding(ct, state->hook);
+}
+
 unsigned int
 nf_nat_inet_fn(void *priv, struct sk_buff *skb,
 	       const struct nf_hook_state *state)
@@ -740,23 +766,9 @@ nf_nat_inet_fn(void *priv, struct sk_buff *skb,
 		 */
 		if (!nf_nat_initialized(ct, maniptype)) {
 			struct nf_nat_lookup_hook_priv *lpriv = priv;
-			struct nf_hook_entries *e = rcu_dereference(lpriv->entries);
 			unsigned int ret;
-			int i;
-
-			if (!e)
-				goto null_bind;
-
-			for (i = 0; i < e->num_hook_entries; i++) {
-				ret = e->hooks[i].hook(e->hooks[i].priv, skb,
-						       state);
-				if (ret != NF_ACCEPT)
-					return ret;
-				if (nf_nat_initialized(ct, maniptype))
-					goto do_nat;
-			}
-null_bind:
-			ret = nf_nat_alloc_null_binding(ct, state->hook);
+
+			ret = nf_nat_inet_run_hooks(state, skb, ct, lpriv);
 			if (ret != NF_ACCEPT)
 				return ret;
 		} else {
@@ -775,7 +787,7 @@ nf_nat_inet_fn(void *priv, struct sk_buff *skb,
 		if (nf_nat_oif_changed(state->hook, ctinfo, nat, state->out))
 			goto oif_changed;
 	}
-do_nat:
+
 	return nf_nat_packet(ct, ctinfo, state->hook, skb);
 
 oif_changed:
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 3/9] netfilter: remove hook index from nf_hook_slow arguments
  2022-10-05 14:13 [RFC 0/9 v2] netfilter: bpf base hook program generator Florian Westphal
  2022-10-05 14:13 ` [RFC v2 1/9] netfilter: nf_queue: carry index in hook state Florian Westphal
  2022-10-05 14:13 ` [RFC v2 2/9] netfilter: nat: split nat hook iteration into a helper Florian Westphal
@ 2022-10-05 14:13 ` Florian Westphal
  2022-10-05 14:13 ` [RFC v2 4/9] netfilter: make hook functions accept only one argument Florian Westphal
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2022-10-05 14:13 UTC (permalink / raw)
  To: bpf; +Cc: Florian Westphal

Previous patch added hook_entry member to nf_hook_state struct, so
use that for passing the index.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter.h        | 5 +++--
 include/linux/netfilter_netdev.h | 4 ++--
 net/bridge/br_netfilter_hooks.c  | 3 ++-
 net/netfilter/core.c             | 6 +++---
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 7a1a2c4787f0..ec416d79352e 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -154,6 +154,7 @@ static inline void nf_hook_state_init(struct nf_hook_state *p,
 {
 	p->hook = hook;
 	p->pf = pf;
+	p->hook_index = 0;
 	p->in = indev;
 	p->out = outdev;
 	p->sk = sk;
@@ -198,7 +199,7 @@ extern struct static_key nf_hooks_needed[NFPROTO_NUMPROTO][NF_MAX_HOOKS];
 #endif
 
 int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
-		 const struct nf_hook_entries *e, unsigned int i);
+		 const struct nf_hook_entries *e);
 
 void nf_hook_slow_list(struct list_head *head, struct nf_hook_state *state,
 		       const struct nf_hook_entries *e);
@@ -255,7 +256,7 @@ static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
 		nf_hook_state_init(&state, hook, pf, indev, outdev,
 				   sk, net, okfn);
 
-		ret = nf_hook_slow(skb, &state, hook_head, 0);
+		ret = nf_hook_slow(skb, &state, hook_head);
 	}
 	rcu_read_unlock();
 
diff --git a/include/linux/netfilter_netdev.h b/include/linux/netfilter_netdev.h
index 8676316547cc..92996b1ac90f 100644
--- a/include/linux/netfilter_netdev.h
+++ b/include/linux/netfilter_netdev.h
@@ -31,7 +31,7 @@ static inline int nf_hook_ingress(struct sk_buff *skb)
 	nf_hook_state_init(&state, NF_NETDEV_INGRESS,
 			   NFPROTO_NETDEV, skb->dev, NULL, NULL,
 			   dev_net(skb->dev), NULL);
-	ret = nf_hook_slow(skb, &state, e, 0);
+	ret = nf_hook_slow(skb, &state, e);
 	if (ret == 0)
 		return -1;
 
@@ -104,7 +104,7 @@ static inline struct sk_buff *nf_hook_egress(struct sk_buff *skb, int *rc,
 
 	/* nf assumes rcu_read_lock, not just read_lock_bh */
 	rcu_read_lock();
-	ret = nf_hook_slow(skb, &state, e, 0);
+	ret = nf_hook_slow(skb, &state, e);
 	rcu_read_unlock();
 
 	if (ret == 1) {
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index f20f4373ff40..cc4b5a19ca31 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -1036,7 +1036,8 @@ int br_nf_hook_thresh(unsigned int hook, struct net *net,
 	nf_hook_state_init(&state, hook, NFPROTO_BRIDGE, indev, outdev,
 			   sk, net, okfn);
 
-	ret = nf_hook_slow(skb, &state, e, i);
+	state.hook_index = i;
+	ret = nf_hook_slow(skb, &state, e);
 	if (ret == 1)
 		ret = okfn(net, sk, skb);
 
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index c094742e3ec3..a8176351f120 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -605,9 +605,9 @@ EXPORT_SYMBOL(nf_unregister_net_hooks);
 /* Returns 1 if okfn() needs to be executed by the caller,
  * -EPERM for NF_DROP, 0 otherwise.  Caller must hold rcu_read_lock. */
 int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
-		 const struct nf_hook_entries *e, unsigned int s)
+		 const struct nf_hook_entries *e)
 {
-	unsigned int verdict;
+	unsigned int verdict, s = state->hook_index;
 	int ret;
 
 	for (; s < e->num_hook_entries; s++) {
@@ -651,7 +651,7 @@ void nf_hook_slow_list(struct list_head *head, struct nf_hook_state *state,
 
 	list_for_each_entry_safe(skb, next, head, list) {
 		skb_list_del_init(skb);
-		ret = nf_hook_slow(skb, state, e, 0);
+		ret = nf_hook_slow(skb, state, e);
 		if (ret == 1)
 			list_add_tail(&skb->list, &sublist);
 	}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 4/9] netfilter: make hook functions accept only one argument
  2022-10-05 14:13 [RFC 0/9 v2] netfilter: bpf base hook program generator Florian Westphal
                   ` (2 preceding siblings ...)
  2022-10-05 14:13 ` [RFC v2 3/9] netfilter: remove hook index from nf_hook_slow arguments Florian Westphal
@ 2022-10-05 14:13 ` Florian Westphal
  2022-10-05 14:13 ` [RFC v2 5/9] netfilter: reduce allowed hook count to 32 Florian Westphal
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2022-10-05 14:13 UTC (permalink / raw)
  To: bpf; +Cc: Florian Westphal

BPF conversion requirement: one pointer-to-structure as argument.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 drivers/net/ipvlan/ipvlan_l3s.c            |  4 +-
 include/linux/netfilter.h                  | 10 ++--
 include/linux/netfilter_arp/arp_tables.h   |  3 +-
 include/linux/netfilter_bridge/ebtables.h  |  3 +-
 include/linux/netfilter_ipv4/ip_tables.h   |  4 +-
 include/linux/netfilter_ipv6/ip6_tables.h  |  3 +-
 include/net/netfilter/br_netfilter.h       |  7 +--
 include/net/netfilter/nf_flow_table.h      |  6 +--
 include/net/netfilter/nf_synproxy.h        |  6 +--
 net/bridge/br_netfilter_hooks.c            | 27 +++++------
 net/bridge/br_netfilter_ipv6.c             |  5 +-
 net/bridge/netfilter/ebtable_broute.c      |  9 ++--
 net/bridge/netfilter/ebtables.c            |  6 +--
 net/bridge/netfilter/nf_conntrack_bridge.c |  8 ++--
 net/ipv4/netfilter/arp_tables.c            |  7 ++-
 net/ipv4/netfilter/ip_tables.c             |  7 ++-
 net/ipv4/netfilter/ipt_CLUSTERIP.c         |  6 +--
 net/ipv4/netfilter/iptable_mangle.c        | 15 +++---
 net/ipv4/netfilter/nf_defrag_ipv4.c        |  5 +-
 net/ipv6/ila/ila_xlat.c                    |  6 +--
 net/ipv6/netfilter/ip6_tables.c            |  6 +--
 net/ipv6/netfilter/ip6table_mangle.c       | 13 +++--
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c  |  5 +-
 net/netfilter/core.c                       |  5 +-
 net/netfilter/ipvs/ip_vs_core.c            | 13 +++--
 net/netfilter/nf_conntrack_proto.c         | 34 +++++--------
 net/netfilter/nf_flow_table_inet.c         |  8 ++--
 net/netfilter/nf_flow_table_ip.c           | 12 ++---
 net/netfilter/nf_nat_core.c                | 10 ++--
 net/netfilter/nf_nat_proto.c               | 56 +++++++++++-----------
 net/netfilter/nf_synproxy_core.c           |  8 ++--
 net/netfilter/nft_chain_filter.c           | 48 +++++++++----------
 net/netfilter/nft_chain_nat.c              |  7 ++-
 net/netfilter/nft_chain_route.c            | 22 ++++-----
 security/apparmor/lsm.c                    |  5 +-
 security/selinux/hooks.c                   | 22 ++++-----
 security/smack/smack_netfilter.c           |  8 ++--
 37 files changed, 201 insertions(+), 228 deletions(-)

diff --git a/drivers/net/ipvlan/ipvlan_l3s.c b/drivers/net/ipvlan/ipvlan_l3s.c
index 943d26cbf39f..a6af569fcc27 100644
--- a/drivers/net/ipvlan/ipvlan_l3s.c
+++ b/drivers/net/ipvlan/ipvlan_l3s.c
@@ -90,9 +90,9 @@ static const struct l3mdev_ops ipvl_l3mdev_ops = {
 	.l3mdev_l3_rcv = ipvlan_l3_rcv,
 };
 
-static unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
-				    const struct nf_hook_state *state)
+static unsigned int ipvlan_nf_input(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct ipvl_addr *addr;
 	unsigned int len;
 
diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index ec416d79352e..7c604ef8e8cb 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -65,6 +65,8 @@ struct nf_hook_ops;
 struct sock;
 
 struct nf_hook_state {
+	struct sk_buff *skb;
+	void *priv;
 	u8 hook;
 	u8 pf;
 	u16 hook_index; /* index in hook_entries->hook[] */
@@ -75,9 +77,7 @@ struct nf_hook_state {
 	int (*okfn)(struct net *, struct sock *, struct sk_buff *);
 };
 
-typedef unsigned int nf_hookfn(void *priv,
-			       struct sk_buff *skb,
-			       const struct nf_hook_state *state);
+typedef unsigned int nf_hookfn(const struct nf_hook_state *state);
 enum nf_hook_ops_type {
 	NF_HOOK_OP_UNDEFINED,
 	NF_HOOK_OP_NF_TABLES,
@@ -140,7 +140,9 @@ static inline int
 nf_hook_entry_hookfn(const struct nf_hook_entry *entry, struct sk_buff *skb,
 		     struct nf_hook_state *state)
 {
-	return entry->hook(entry->priv, skb, state);
+	state->skb = skb;
+	state->priv = entry->priv;
+	return entry->hook(state);
 }
 
 static inline void nf_hook_state_init(struct nf_hook_state *p,
diff --git a/include/linux/netfilter_arp/arp_tables.h b/include/linux/netfilter_arp/arp_tables.h
index a40aaf645fa4..651462358ee1 100644
--- a/include/linux/netfilter_arp/arp_tables.h
+++ b/include/linux/netfilter_arp/arp_tables.h
@@ -54,8 +54,7 @@ int arpt_register_table(struct net *net, const struct xt_table *table,
 			const struct nf_hook_ops *ops);
 void arpt_unregister_table(struct net *net, const char *name);
 void arpt_unregister_table_pre_exit(struct net *net, const char *name);
-extern unsigned int arpt_do_table(void *priv, struct sk_buff *skb,
-				  const struct nf_hook_state *state);
+extern unsigned int arpt_do_table(const struct nf_hook_state *state);
 
 #ifdef CONFIG_NETFILTER_XTABLES_COMPAT
 #include <net/compat.h>
diff --git a/include/linux/netfilter_bridge/ebtables.h b/include/linux/netfilter_bridge/ebtables.h
index fd533552a062..3d664027e14f 100644
--- a/include/linux/netfilter_bridge/ebtables.h
+++ b/include/linux/netfilter_bridge/ebtables.h
@@ -108,8 +108,7 @@ extern int ebt_register_table(struct net *net,
 			      const struct nf_hook_ops *ops);
 extern void ebt_unregister_table(struct net *net, const char *tablename);
 void ebt_unregister_table_pre_exit(struct net *net, const char *tablename);
-extern unsigned int ebt_do_table(void *priv, struct sk_buff *skb,
-				 const struct nf_hook_state *state);
+extern unsigned int ebt_do_table(const struct nf_hook_state *state);
 
 /* True if the hook mask denotes that the rule is in a base chain,
  * used in the check() functions */
diff --git a/include/linux/netfilter_ipv4/ip_tables.h b/include/linux/netfilter_ipv4/ip_tables.h
index 132b0e4a6d4d..270963c73245 100644
--- a/include/linux/netfilter_ipv4/ip_tables.h
+++ b/include/linux/netfilter_ipv4/ip_tables.h
@@ -63,9 +63,7 @@ struct ipt_error {
 }
 
 extern void *ipt_alloc_initial_table(const struct xt_table *);
-extern unsigned int ipt_do_table(void *priv,
-				 struct sk_buff *skb,
-				 const struct nf_hook_state *state);
+extern unsigned int ipt_do_table(const struct nf_hook_state *state);
 
 #ifdef CONFIG_NETFILTER_XTABLES_COMPAT
 #include <net/compat.h>
diff --git a/include/linux/netfilter_ipv6/ip6_tables.h b/include/linux/netfilter_ipv6/ip6_tables.h
index 8b8885a73c76..f786fb7ef47f 100644
--- a/include/linux/netfilter_ipv6/ip6_tables.h
+++ b/include/linux/netfilter_ipv6/ip6_tables.h
@@ -29,8 +29,7 @@ int ip6t_register_table(struct net *net, const struct xt_table *table,
 			const struct nf_hook_ops *ops);
 void ip6t_unregister_table_pre_exit(struct net *net, const char *name);
 void ip6t_unregister_table_exit(struct net *net, const char *name);
-extern unsigned int ip6t_do_table(void *priv, struct sk_buff *skb,
-				  const struct nf_hook_state *state);
+extern unsigned int ip6t_do_table(const struct nf_hook_state *state);
 
 #ifdef CONFIG_NETFILTER_XTABLES_COMPAT
 #include <net/compat.h>
diff --git a/include/net/netfilter/br_netfilter.h b/include/net/netfilter/br_netfilter.h
index 371696ec11b2..9c37bf316077 100644
--- a/include/net/netfilter/br_netfilter.h
+++ b/include/net/netfilter/br_netfilter.h
@@ -57,9 +57,7 @@ struct net_device *setup_pre_routing(struct sk_buff *skb,
 
 #if IS_ENABLED(CONFIG_IPV6)
 int br_validate_ipv6(struct net *net, struct sk_buff *skb);
-unsigned int br_nf_pre_routing_ipv6(void *priv,
-				    struct sk_buff *skb,
-				    const struct nf_hook_state *state);
+unsigned int br_nf_pre_routing_ipv6(const struct nf_hook_state *state);
 #else
 static inline int br_validate_ipv6(struct net *net, struct sk_buff *skb)
 {
@@ -67,8 +65,7 @@ static inline int br_validate_ipv6(struct net *net, struct sk_buff *skb)
 }
 
 static inline unsigned int
-br_nf_pre_routing_ipv6(void *priv, struct sk_buff *skb,
-		       const struct nf_hook_state *state)
+br_nf_pre_routing_ipv6(const struct nf_hook_state *state)
 {
 	return NF_ACCEPT;
 }
diff --git a/include/net/netfilter/nf_flow_table.h b/include/net/netfilter/nf_flow_table.h
index cd982f4a0f50..fc86c2573c3c 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -291,10 +291,8 @@ struct flow_ports {
 	__be16 source, dest;
 };
 
-unsigned int nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
-				     const struct nf_hook_state *state);
-unsigned int nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
-				       const struct nf_hook_state *state);
+unsigned int nf_flow_offload_ip_hook(const struct nf_hook_state *state);
+unsigned int nf_flow_offload_ipv6_hook(const struct nf_hook_state *state);
 
 #define MODULE_ALIAS_NF_FLOWTABLE(family)	\
 	MODULE_ALIAS("nf-flowtable-" __stringify(family))
diff --git a/include/net/netfilter/nf_synproxy.h b/include/net/netfilter/nf_synproxy.h
index a336f9434e73..9cf8db712e88 100644
--- a/include/net/netfilter/nf_synproxy.h
+++ b/include/net/netfilter/nf_synproxy.h
@@ -60,8 +60,7 @@ bool synproxy_recv_client_ack(struct net *net,
 
 struct nf_hook_state;
 
-unsigned int ipv4_synproxy_hook(void *priv, struct sk_buff *skb,
-				const struct nf_hook_state *nhs);
+unsigned int ipv4_synproxy_hook(const struct nf_hook_state *nhs);
 int nf_synproxy_ipv4_init(struct synproxy_net *snet, struct net *net);
 void nf_synproxy_ipv4_fini(struct synproxy_net *snet, struct net *net);
 
@@ -75,8 +74,7 @@ bool synproxy_recv_client_ack_ipv6(struct net *net, const struct sk_buff *skb,
 				   const struct tcphdr *th,
 				   struct synproxy_options *opts, u32 recv_seq);
 
-unsigned int ipv6_synproxy_hook(void *priv, struct sk_buff *skb,
-				const struct nf_hook_state *nhs);
+unsigned int ipv6_synproxy_hook(const struct nf_hook_state *nhs);
 int nf_synproxy_ipv6_init(struct synproxy_net *snet, struct net *net);
 void nf_synproxy_ipv6_fini(struct synproxy_net *snet, struct net *net);
 #else
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index cc4b5a19ca31..f42faf572c21 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -474,10 +474,9 @@ struct net_device *setup_pre_routing(struct sk_buff *skb, const struct net *net)
  * receiving device) to make netfilter happy, the REDIRECT
  * target in particular.  Save the original destination IP
  * address to be able to detect DNAT afterwards. */
-static unsigned int br_nf_pre_routing(void *priv,
-				      struct sk_buff *skb,
-				      const struct nf_hook_state *state)
+static unsigned int br_nf_pre_routing(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct nf_bridge_info *nf_bridge;
 	struct net_bridge_port *p;
 	struct net_bridge *br;
@@ -504,7 +503,7 @@ static unsigned int br_nf_pre_routing(void *priv,
 		}
 
 		nf_bridge_pull_encap_header_rcsum(skb);
-		return br_nf_pre_routing_ipv6(priv, skb, state);
+		return br_nf_pre_routing_ipv6(state);
 	}
 
 	if (!brnet->call_iptables && !br_opt_get(br, BROPT_NF_CALL_IPTABLES))
@@ -574,10 +573,9 @@ static int br_nf_forward_finish(struct net *net, struct sock *sk, struct sk_buff
  * but we are still able to filter on the 'real' indev/outdev
  * because of the physdev module. For ARP, indev and outdev are the
  * bridge ports. */
-static unsigned int br_nf_forward_ip(void *priv,
-				     struct sk_buff *skb,
-				     const struct nf_hook_state *state)
+static unsigned int br_nf_forward_ip(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct nf_bridge_info *nf_bridge;
 	struct net_device *parent;
 	u_int8_t pf;
@@ -640,10 +638,9 @@ static unsigned int br_nf_forward_ip(void *priv,
 	return NF_STOLEN;
 }
 
-static unsigned int br_nf_forward_arp(void *priv,
-				      struct sk_buff *skb,
-				      const struct nf_hook_state *state)
+static unsigned int br_nf_forward_arp(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct net_bridge_port *p;
 	struct net_bridge *br;
 	struct net_device **d = (struct net_device **)(skb->cb);
@@ -813,10 +810,9 @@ static int br_nf_dev_queue_xmit(struct net *net, struct sock *sk, struct sk_buff
 }
 
 /* PF_BRIDGE/POST_ROUTING ********************************************/
-static unsigned int br_nf_post_routing(void *priv,
-				       struct sk_buff *skb,
-				       const struct nf_hook_state *state)
+static unsigned int br_nf_post_routing(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
 	struct net_device *realoutdev = bridge_parent(skb->dev);
 	u_int8_t pf;
@@ -862,10 +858,9 @@ static unsigned int br_nf_post_routing(void *priv,
 /* IP/SABOTAGE *****************************************************/
 /* Don't hand locally destined packets to PF_INET(6)/PRE_ROUTING
  * for the second time. */
-static unsigned int ip_sabotage_in(void *priv,
-				   struct sk_buff *skb,
-				   const struct nf_hook_state *state)
+static unsigned int ip_sabotage_in(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct nf_bridge_info *nf_bridge = nf_bridge_info_get(skb);
 
 	if (nf_bridge && !nf_bridge->in_prerouting &&
diff --git a/net/bridge/br_netfilter_ipv6.c b/net/bridge/br_netfilter_ipv6.c
index 6b07f30675bb..87e5c7f60ae2 100644
--- a/net/bridge/br_netfilter_ipv6.c
+++ b/net/bridge/br_netfilter_ipv6.c
@@ -213,11 +213,10 @@ static int br_nf_pre_routing_finish_ipv6(struct net *net, struct sock *sk, struc
 /* Replicate the checks that IPv6 does on packet reception and pass the packet
  * to ip6tables.
  */
-unsigned int br_nf_pre_routing_ipv6(void *priv,
-				    struct sk_buff *skb,
-				    const struct nf_hook_state *state)
+unsigned int br_nf_pre_routing_ipv6(const struct nf_hook_state *state)
 {
 	struct nf_bridge_info *nf_bridge;
+	struct sk_buff *skb = state->skb;
 
 	if (br_validate_ipv6(state->net, skb))
 		return NF_DROP;
diff --git a/net/bridge/netfilter/ebtable_broute.c b/net/bridge/netfilter/ebtable_broute.c
index 8f19253024b0..e98791176341 100644
--- a/net/bridge/netfilter/ebtable_broute.c
+++ b/net/bridge/netfilter/ebtable_broute.c
@@ -43,9 +43,9 @@ static const struct ebt_table broute_table = {
 	.me		= THIS_MODULE,
 };
 
-static unsigned int ebt_broute(void *priv, struct sk_buff *skb,
-			       const struct nf_hook_state *s)
+static unsigned int ebt_broute(const struct nf_hook_state *s)
 {
+	struct sk_buff *skb = s->skb;
 	struct net_bridge_port *p = br_port_get_rcu(skb->dev);
 	struct nf_hook_state state;
 	unsigned char *dest;
@@ -58,7 +58,10 @@ static unsigned int ebt_broute(void *priv, struct sk_buff *skb,
 			   NFPROTO_BRIDGE, s->in, NULL, NULL,
 			   s->net, NULL);
 
-	ret = ebt_do_table(priv, skb, &state);
+	state.skb = skb;
+	state.priv = s->priv;
+
+	ret = ebt_do_table(&state);
 	if (ret != NF_DROP)
 		return ret;
 
diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index ce5dfa3babd2..8e99e72e90e9 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -189,10 +189,10 @@ ebt_get_target_c(const struct ebt_entry *e)
 }
 
 /* Do some firewalling */
-unsigned int ebt_do_table(void *priv, struct sk_buff *skb,
-			  const struct nf_hook_state *state)
+unsigned int ebt_do_table(const struct nf_hook_state *state)
 {
-	struct ebt_table *table = priv;
+	struct ebt_table *table = state->priv;
+	struct sk_buff *skb = state->skb;
 	unsigned int hook = state->hook;
 	int i, nentries;
 	struct ebt_entry *point;
diff --git a/net/bridge/netfilter/nf_conntrack_bridge.c b/net/bridge/netfilter/nf_conntrack_bridge.c
index 73242962be5d..b0a9187cd399 100644
--- a/net/bridge/netfilter/nf_conntrack_bridge.c
+++ b/net/bridge/netfilter/nf_conntrack_bridge.c
@@ -237,10 +237,10 @@ static int nf_ct_br_ipv6_check(const struct sk_buff *skb)
 	return 0;
 }
 
-static unsigned int nf_ct_bridge_pre(void *priv, struct sk_buff *skb,
-				     const struct nf_hook_state *state)
+static unsigned int nf_ct_bridge_pre(const struct nf_hook_state *state)
 {
 	struct nf_hook_state bridge_state = *state;
+	struct sk_buff *skb = state->skb;
 	enum ip_conntrack_info ctinfo;
 	struct nf_conn *ct;
 	u32 len;
@@ -396,9 +396,9 @@ static unsigned int nf_ct_bridge_confirm(struct sk_buff *skb)
 	return nf_confirm(skb, protoff, ct, ctinfo);
 }
 
-static unsigned int nf_ct_bridge_post(void *priv, struct sk_buff *skb,
-				      const struct nf_hook_state *state)
+static unsigned int nf_ct_bridge_post(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	int ret;
 
 	ret = nf_ct_bridge_confirm(skb);
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index ffc0cab7cf18..b870773590ba 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -179,11 +179,10 @@ struct arpt_entry *arpt_next_entry(const struct arpt_entry *entry)
 	return (void *)entry + entry->next_offset;
 }
 
-unsigned int arpt_do_table(void *priv,
-			   struct sk_buff *skb,
-			   const struct nf_hook_state *state)
+unsigned int arpt_do_table(const struct nf_hook_state *state)
 {
-	const struct xt_table *table = priv;
+	const struct xt_table *table = state->priv;
+	struct sk_buff *skb = state->skb;
 	unsigned int hook = state->hook;
 	static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long))));
 	unsigned int verdict = NF_DROP;
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 2ed7c58b471a..c49d3e324f99 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -222,11 +222,10 @@ struct ipt_entry *ipt_next_entry(const struct ipt_entry *entry)
 
 /* Returns one of the generic firewall policies, like NF_ACCEPT. */
 unsigned int
-ipt_do_table(void *priv,
-	     struct sk_buff *skb,
-	     const struct nf_hook_state *state)
+ipt_do_table(const struct nf_hook_state *state)
 {
-	const struct xt_table *table = priv;
+	const struct xt_table *table = state->priv;
+	struct sk_buff *skb = state->skb;
 	unsigned int hook = state->hook;
 	static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long))));
 	const struct iphdr *ip;
diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index f8e176c77d1c..60ea95739a35 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -75,7 +75,7 @@ struct clusterip_net {
 	unsigned int hook_users;
 };
 
-static unsigned int clusterip_arp_mangle(void *priv, struct sk_buff *skb, const struct nf_hook_state *state);
+static unsigned int clusterip_arp_mangle(const struct nf_hook_state *state);
 
 static const struct nf_hook_ops cip_arp_ops = {
 	.hook = clusterip_arp_mangle,
@@ -638,9 +638,9 @@ static void arp_print(struct arp_payload *payload)
 #endif
 
 static unsigned int
-clusterip_arp_mangle(void *priv, struct sk_buff *skb,
-		     const struct nf_hook_state *state)
+clusterip_arp_mangle(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct arphdr *arp = arp_hdr(skb);
 	struct arp_payload *payload;
 	struct clusterip_config *c;
diff --git a/net/ipv4/netfilter/iptable_mangle.c b/net/ipv4/netfilter/iptable_mangle.c
index 3abb430af9e6..dca4637ad844 100644
--- a/net/ipv4/netfilter/iptable_mangle.c
+++ b/net/ipv4/netfilter/iptable_mangle.c
@@ -33,9 +33,9 @@ static const struct xt_table packet_mangler = {
 	.priority	= NF_IP_PRI_MANGLE,
 };
 
-static unsigned int
-ipt_mangle_out(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
+static unsigned int ipt_mangle_out(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	unsigned int ret;
 	const struct iphdr *iph;
 	u_int8_t tos;
@@ -50,7 +50,7 @@ ipt_mangle_out(void *priv, struct sk_buff *skb, const struct nf_hook_state *stat
 	daddr = iph->daddr;
 	tos = iph->tos;
 
-	ret = ipt_do_table(priv, skb, state);
+	ret = ipt_do_table(state);
 	/* Reroute for ANY change. */
 	if (ret != NF_DROP && ret != NF_STOLEN) {
 		iph = ip_hdr(skb);
@@ -69,14 +69,11 @@ ipt_mangle_out(void *priv, struct sk_buff *skb, const struct nf_hook_state *stat
 }
 
 /* The work comes in here from netfilter.c. */
-static unsigned int
-iptable_mangle_hook(void *priv,
-		     struct sk_buff *skb,
-		     const struct nf_hook_state *state)
+static unsigned int iptable_mangle_hook(const struct nf_hook_state *state)
 {
 	if (state->hook == NF_INET_LOCAL_OUT)
-		return ipt_mangle_out(priv, skb, state);
-	return ipt_do_table(priv, skb, state);
+		return ipt_mangle_out(state);
+	return ipt_do_table(state);
 }
 
 static struct nf_hook_ops *mangle_ops __read_mostly;
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
index e61ea428ea18..8fda6f06fe2b 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -58,10 +58,9 @@ static enum ip_defrag_users nf_ct_defrag_user(unsigned int hooknum,
 		return IP_DEFRAG_CONNTRACK_OUT + zone_id;
 }
 
-static unsigned int ipv4_conntrack_defrag(void *priv,
-					  struct sk_buff *skb,
-					  const struct nf_hook_state *state)
+static unsigned int ipv4_conntrack_defrag(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct sock *sk = skb->sk;
 
 	if (sk && sk_fullsock(sk) && (sk->sk_family == PF_INET) &&
diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c
index 47447f0241df..94d21bbed412 100644
--- a/net/ipv6/ila/ila_xlat.c
+++ b/net/ipv6/ila/ila_xlat.c
@@ -184,10 +184,10 @@ static void ila_free_cb(void *ptr, void *arg)
 static int ila_xlat_addr(struct sk_buff *skb, bool sir2ila);
 
 static unsigned int
-ila_nf_input(void *priv,
-	     struct sk_buff *skb,
-	     const struct nf_hook_state *state)
+ila_nf_input(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
+
 	ila_xlat_addr(skb, false);
 	return NF_ACCEPT;
 }
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 2d816277f2c5..4da1d61b9b42 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -247,10 +247,10 @@ ip6t_next_entry(const struct ip6t_entry *entry)
 
 /* Returns one of the generic firewall policies, like NF_ACCEPT. */
 unsigned int
-ip6t_do_table(void *priv, struct sk_buff *skb,
-	      const struct nf_hook_state *state)
+ip6t_do_table(const struct nf_hook_state *state)
 {
-	const struct xt_table *table = priv;
+	const struct xt_table *table = state->priv;
+	struct sk_buff *skb = state->skb;
 	unsigned int hook = state->hook;
 	static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long))));
 	/* Initializing verdict to NF_DROP keeps gcc happy. */
diff --git a/net/ipv6/netfilter/ip6table_mangle.c b/net/ipv6/netfilter/ip6table_mangle.c
index a88b2ce4a3cb..33b0e3ab3399 100644
--- a/net/ipv6/netfilter/ip6table_mangle.c
+++ b/net/ipv6/netfilter/ip6table_mangle.c
@@ -28,9 +28,9 @@ static const struct xt_table packet_mangler = {
 	.priority	= NF_IP6_PRI_MANGLE,
 };
 
-static unsigned int
-ip6t_mangle_out(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
+static unsigned int ip6t_mangle_out(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	unsigned int ret;
 	struct in6_addr saddr, daddr;
 	u_int8_t hop_limit;
@@ -46,7 +46,7 @@ ip6t_mangle_out(void *priv, struct sk_buff *skb, const struct nf_hook_state *sta
 	/* flowlabel and prio (includes version, which shouldn't change either */
 	flowlabel = *((u_int32_t *)ipv6_hdr(skb));
 
-	ret = ip6t_do_table(priv, skb, state);
+	ret = ip6t_do_table(state);
 
 	if (ret != NF_DROP && ret != NF_STOLEN &&
 	    (!ipv6_addr_equal(&ipv6_hdr(skb)->saddr, &saddr) ||
@@ -64,12 +64,11 @@ ip6t_mangle_out(void *priv, struct sk_buff *skb, const struct nf_hook_state *sta
 
 /* The work comes in here from netfilter.c. */
 static unsigned int
-ip6table_mangle_hook(void *priv, struct sk_buff *skb,
-		     const struct nf_hook_state *state)
+ip6table_mangle_hook(const struct nf_hook_state *state)
 {
 	if (state->hook == NF_INET_LOCAL_OUT)
-		return ip6t_mangle_out(priv, skb, state);
-	return ip6t_do_table(priv, skb, state);
+		return ip6t_mangle_out(state);
+	return ip6t_do_table(state);
 }
 
 static struct nf_hook_ops *mangle_ops __read_mostly;
diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
index cb4eb1d2c620..25aae7deb7cc 100644
--- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
+++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
@@ -48,10 +48,9 @@ static enum ip6_defrag_users nf_ct6_defrag_user(unsigned int hooknum,
 		return IP6_DEFRAG_CONNTRACK_OUT + zone_id;
 }
 
-static unsigned int ipv6_defrag(void *priv,
-				struct sk_buff *skb,
-				const struct nf_hook_state *state)
+static unsigned int ipv6_defrag(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	int err;
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index a8176351f120..593fec9434d7 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -88,9 +88,7 @@ static void nf_hook_entries_free(struct nf_hook_entries *e)
 	call_rcu(&head->head, __nf_hook_entries_free);
 }
 
-static unsigned int accept_all(void *priv,
-			       struct sk_buff *skb,
-			       const struct nf_hook_state *state)
+static unsigned int accept_all(const struct nf_hook_state *state)
 {
 	return NF_ACCEPT; /* ACCEPT makes nf_hook_slow call next hook */
 }
@@ -610,6 +608,7 @@ int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
 	unsigned int verdict, s = state->hook_index;
 	int ret;
 
+	state->skb = skb;
 	for (; s < e->num_hook_entries; s++) {
 		verdict = nf_hook_entry_hookfn(&e->hooks[s], skb, state);
 		switch (verdict & NF_VERDICT_MASK) {
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index 51ad557a525b..8c36e2aa7f82 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -1330,10 +1330,11 @@ handle_response(int af, struct sk_buff *skb, struct ip_vs_proto_data *pd,
  *	Check if outgoing packet belongs to the established ip_vs_conn.
  */
 static unsigned int
-ip_vs_out_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
+ip_vs_out_hook(const struct nf_hook_state *state)
 {
 	struct netns_ipvs *ipvs = net_ipvs(state->net);
 	unsigned int hooknum = state->hook;
+	struct sk_buff *skb = state->skb;
 	struct ip_vs_iphdr iph;
 	struct ip_vs_protocol *pp;
 	struct ip_vs_proto_data *pd;
@@ -1910,10 +1911,11 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, struct sk_buff *skb,
  *	and send it on its way...
  */
 static unsigned int
-ip_vs_in_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state)
+ip_vs_in_hook(const struct nf_hook_state *state)
 {
 	struct netns_ipvs *ipvs = net_ipvs(state->net);
 	unsigned int hooknum = state->hook;
+	struct sk_buff *skb = state->skb;
 	struct ip_vs_iphdr iph;
 	struct ip_vs_protocol *pp;
 	struct ip_vs_proto_data *pd;
@@ -2103,12 +2105,15 @@ ip_vs_in_hook(void *priv, struct sk_buff *skb, const struct nf_hook_state *state
  *      and send them to ip_vs_in_icmp.
  */
 static unsigned int
-ip_vs_forward_icmp(void *priv, struct sk_buff *skb,
-		   const struct nf_hook_state *state)
+ip_vs_forward_icmp(const struct nf_hook_state *state)
 {
 	struct netns_ipvs *ipvs = net_ipvs(state->net);
+	struct sk_buff *skb = state->skb;
 	int r;
 
+	if (ip_hdr(skb)->protocol != IPPROTO_ICMP)
+		return NF_ACCEPT;
+
 	/* ipvs enabled in this netns ? */
 	if (unlikely(sysctl_backup_only(ipvs) || !ipvs->enable))
 		return NF_ACCEPT;
diff --git a/net/netfilter/nf_conntrack_proto.c b/net/netfilter/nf_conntrack_proto.c
index 895b09cbd7cf..95e2f5a87dc3 100644
--- a/net/netfilter/nf_conntrack_proto.c
+++ b/net/netfilter/nf_conntrack_proto.c
@@ -165,10 +165,9 @@ static bool in_vrf_postrouting(const struct nf_hook_state *state)
 	return false;
 }
 
-static unsigned int ipv4_confirm(void *priv,
-				 struct sk_buff *skb,
-				 const struct nf_hook_state *state)
+static unsigned int ipv4_confirm(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	enum ip_conntrack_info ctinfo;
 	struct nf_conn *ct;
 
@@ -184,17 +183,15 @@ static unsigned int ipv4_confirm(void *priv,
 			  ct, ctinfo);
 }
 
-static unsigned int ipv4_conntrack_in(void *priv,
-				      struct sk_buff *skb,
-				      const struct nf_hook_state *state)
+static unsigned int ipv4_conntrack_in(const struct nf_hook_state *state)
 {
-	return nf_conntrack_in(skb, state);
+	return nf_conntrack_in(state->skb, state);
 }
 
-static unsigned int ipv4_conntrack_local(void *priv,
-					 struct sk_buff *skb,
-					 const struct nf_hook_state *state)
+static unsigned int ipv4_conntrack_local(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
+
 	if (ip_is_fragment(ip_hdr(skb))) { /* IP_NODEFRAG setsockopt set */
 		enum ip_conntrack_info ctinfo;
 		struct nf_conn *tmpl;
@@ -373,10 +370,9 @@ static struct nf_sockopt_ops so_getorigdst6 = {
 	.owner		= THIS_MODULE,
 };
 
-static unsigned int ipv6_confirm(void *priv,
-				 struct sk_buff *skb,
-				 const struct nf_hook_state *state)
+static unsigned int ipv6_confirm(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct nf_conn *ct;
 	enum ip_conntrack_info ctinfo;
 	unsigned char pnum = ipv6_hdr(skb)->nexthdr;
@@ -400,18 +396,14 @@ static unsigned int ipv6_confirm(void *priv,
 	return nf_confirm(skb, protoff, ct, ctinfo);
 }
 
-static unsigned int ipv6_conntrack_in(void *priv,
-				      struct sk_buff *skb,
-				      const struct nf_hook_state *state)
+static unsigned int ipv6_conntrack_in(const struct nf_hook_state *state)
 {
-	return nf_conntrack_in(skb, state);
+	return nf_conntrack_in(state->skb, state);
 }
 
-static unsigned int ipv6_conntrack_local(void *priv,
-					 struct sk_buff *skb,
-					 const struct nf_hook_state *state)
+static unsigned int ipv6_conntrack_local(const struct nf_hook_state *state)
 {
-	return nf_conntrack_in(skb, state);
+	return nf_conntrack_in(state->skb, state);
 }
 
 static const struct nf_hook_ops ipv6_conntrack_ops[] = {
diff --git a/net/netfilter/nf_flow_table_inet.c b/net/netfilter/nf_flow_table_inet.c
index 0ccabf3fa6aa..315db69f4ca8 100644
--- a/net/netfilter/nf_flow_table_inet.c
+++ b/net/netfilter/nf_flow_table_inet.c
@@ -9,9 +9,9 @@
 #include <linux/if_vlan.h>
 
 static unsigned int
-nf_flow_offload_inet_hook(void *priv, struct sk_buff *skb,
-			  const struct nf_hook_state *state)
+nf_flow_offload_inet_hook(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct vlan_ethhdr *veth;
 	__be16 proto;
 
@@ -30,9 +30,9 @@ nf_flow_offload_inet_hook(void *priv, struct sk_buff *skb,
 
 	switch (proto) {
 	case htons(ETH_P_IP):
-		return nf_flow_offload_ip_hook(priv, skb, state);
+		return nf_flow_offload_ip_hook(state);
 	case htons(ETH_P_IPV6):
-		return nf_flow_offload_ipv6_hook(priv, skb, state);
+		return nf_flow_offload_ipv6_hook(state);
 	}
 
 	return NF_ACCEPT;
diff --git a/net/netfilter/nf_flow_table_ip.c b/net/netfilter/nf_flow_table_ip.c
index b350fe9d00b0..98c7e7272ab4 100644
--- a/net/netfilter/nf_flow_table_ip.c
+++ b/net/netfilter/nf_flow_table_ip.c
@@ -337,12 +337,12 @@ static unsigned int nf_flow_queue_xmit(struct net *net, struct sk_buff *skb,
 }
 
 unsigned int
-nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
-			const struct nf_hook_state *state)
+nf_flow_offload_ip_hook(const struct nf_hook_state *state)
 {
+	struct nf_flowtable *flow_table = state->priv;
 	struct flow_offload_tuple_rhash *tuplehash;
-	struct nf_flowtable *flow_table = priv;
 	struct flow_offload_tuple tuple = {};
+	struct sk_buff *skb = state->skb;
 	enum flow_offload_tuple_dir dir;
 	struct flow_offload *flow;
 	struct net_device *outdev;
@@ -599,12 +599,12 @@ static int nf_flow_tuple_ipv6(struct sk_buff *skb, const struct net_device *dev,
 }
 
 unsigned int
-nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
-			  const struct nf_hook_state *state)
+nf_flow_offload_ipv6_hook(const struct nf_hook_state *state)
 {
+	struct nf_flowtable *flow_table = state->priv;
 	struct flow_offload_tuple_rhash *tuplehash;
-	struct nf_flowtable *flow_table = priv;
 	struct flow_offload_tuple tuple = {};
+	struct sk_buff *skb = state->skb;
 	enum flow_offload_tuple_dir dir;
 	const struct in6_addr *nexthop;
 	struct flow_offload *flow;
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index bd5ac4ff03f9..71d860b049c2 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -710,20 +710,24 @@ static bool in_vrf_postrouting(const struct nf_hook_state *state)
 }
 
 static unsigned int nf_nat_inet_run_hooks(const struct nf_hook_state *state,
-					  struct sk_buff *skb,
 					  struct nf_conn *ct,
 					  struct nf_nat_lookup_hook_priv *lpriv)
 {
 	enum nf_nat_manip_type maniptype = HOOK2MANIP(state->hook);
 	struct nf_hook_entries *e = rcu_dereference(lpriv->entries);
+	struct nf_hook_state __state;
 	unsigned int ret;
 	int i;
 
 	if (!e)
 		goto null_bind;
 
+	__state = *state;
+
 	for (i = 0; i < e->num_hook_entries; i++) {
-		ret = e->hooks[i].hook(e->hooks[i].priv, skb, state);
+		__state.priv = e->hooks[i].priv;
+
+		ret = e->hooks[i].hook(&__state);
 		if (ret != NF_ACCEPT)
 			return ret;
 
@@ -768,7 +772,7 @@ nf_nat_inet_fn(void *priv, struct sk_buff *skb,
 			struct nf_nat_lookup_hook_priv *lpriv = priv;
 			unsigned int ret;
 
-			ret = nf_nat_inet_run_hooks(state, skb, ct, lpriv);
+			ret = nf_nat_inet_run_hooks(state, ct, lpriv);
 			if (ret != NF_ACCEPT)
 				return ret;
 		} else {
diff --git a/net/netfilter/nf_nat_proto.c b/net/netfilter/nf_nat_proto.c
index 48cc60084d28..9d1d6a20ae1e 100644
--- a/net/netfilter/nf_nat_proto.c
+++ b/net/netfilter/nf_nat_proto.c
@@ -622,11 +622,12 @@ int nf_nat_icmp_reply_translation(struct sk_buff *skb,
 EXPORT_SYMBOL_GPL(nf_nat_icmp_reply_translation);
 
 static unsigned int
-nf_nat_ipv4_fn(void *priv, struct sk_buff *skb,
-	       const struct nf_hook_state *state)
+nf_nat_ipv4_fn(const struct nf_hook_state *state)
 {
-	struct nf_conn *ct;
+	struct sk_buff *skb = state->skb;
+	void *priv = state->priv;
 	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct;
 
 	ct = nf_ct_get(skb, &ctinfo);
 	if (!ct)
@@ -646,13 +647,13 @@ nf_nat_ipv4_fn(void *priv, struct sk_buff *skb,
 }
 
 static unsigned int
-nf_nat_ipv4_pre_routing(void *priv, struct sk_buff *skb,
-			const struct nf_hook_state *state)
+nf_nat_ipv4_pre_routing(const struct nf_hook_state *state)
 {
-	unsigned int ret;
+	struct sk_buff *skb = state->skb;
 	__be32 daddr = ip_hdr(skb)->daddr;
+	unsigned int ret;
 
-	ret = nf_nat_ipv4_fn(priv, skb, state);
+	ret = nf_nat_ipv4_fn(state);
 	if (ret == NF_ACCEPT && daddr != ip_hdr(skb)->daddr)
 		skb_dst_drop(skb);
 
@@ -698,14 +699,14 @@ static int nf_xfrm_me_harder(struct net *net, struct sk_buff *skb, unsigned int
 #endif
 
 static unsigned int
-nf_nat_ipv4_local_in(void *priv, struct sk_buff *skb,
-		     const struct nf_hook_state *state)
+nf_nat_ipv4_local_in(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	__be32 saddr = ip_hdr(skb)->saddr;
 	struct sock *sk = skb->sk;
 	unsigned int ret;
 
-	ret = nf_nat_ipv4_fn(priv, skb, state);
+	ret = nf_nat_ipv4_fn(state);
 
 	if (ret == NF_ACCEPT && sk && saddr != ip_hdr(skb)->saddr &&
 	    !inet_sk_transparent(sk))
@@ -715,17 +716,17 @@ nf_nat_ipv4_local_in(void *priv, struct sk_buff *skb,
 }
 
 static unsigned int
-nf_nat_ipv4_out(void *priv, struct sk_buff *skb,
-		const struct nf_hook_state *state)
+nf_nat_ipv4_out(const struct nf_hook_state *state)
 {
 #ifdef CONFIG_XFRM
+	struct sk_buff *skb = state->skb;
 	const struct nf_conn *ct;
 	enum ip_conntrack_info ctinfo;
 	int err;
 #endif
 	unsigned int ret;
 
-	ret = nf_nat_ipv4_fn(priv, skb, state);
+	ret = nf_nat_ipv4_fn(state);
 #ifdef CONFIG_XFRM
 	if (ret != NF_ACCEPT)
 		return ret;
@@ -752,15 +753,15 @@ nf_nat_ipv4_out(void *priv, struct sk_buff *skb,
 }
 
 static unsigned int
-nf_nat_ipv4_local_fn(void *priv, struct sk_buff *skb,
-		     const struct nf_hook_state *state)
+nf_nat_ipv4_local_fn(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	const struct nf_conn *ct;
 	enum ip_conntrack_info ctinfo;
 	unsigned int ret;
 	int err;
 
-	ret = nf_nat_ipv4_fn(priv, skb, state);
+	ret = nf_nat_ipv4_fn(state);
 	if (ret != NF_ACCEPT)
 		return ret;
 
@@ -901,9 +902,10 @@ int nf_nat_icmpv6_reply_translation(struct sk_buff *skb,
 EXPORT_SYMBOL_GPL(nf_nat_icmpv6_reply_translation);
 
 static unsigned int
-nf_nat_ipv6_fn(void *priv, struct sk_buff *skb,
-	       const struct nf_hook_state *state)
+nf_nat_ipv6_fn(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
+	void *priv = state->priv;
 	struct nf_conn *ct;
 	enum ip_conntrack_info ctinfo;
 	__be16 frag_off;
@@ -938,13 +940,13 @@ nf_nat_ipv6_fn(void *priv, struct sk_buff *skb,
 }
 
 static unsigned int
-nf_nat_ipv6_in(void *priv, struct sk_buff *skb,
-	       const struct nf_hook_state *state)
+nf_nat_ipv6_in(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	unsigned int ret;
 	struct in6_addr daddr = ipv6_hdr(skb)->daddr;
 
-	ret = nf_nat_ipv6_fn(priv, skb, state);
+	ret = nf_nat_ipv6_fn(state);
 	if (ret != NF_DROP && ret != NF_STOLEN &&
 	    ipv6_addr_cmp(&daddr, &ipv6_hdr(skb)->daddr))
 		skb_dst_drop(skb);
@@ -953,17 +955,17 @@ nf_nat_ipv6_in(void *priv, struct sk_buff *skb,
 }
 
 static unsigned int
-nf_nat_ipv6_out(void *priv, struct sk_buff *skb,
-		const struct nf_hook_state *state)
+nf_nat_ipv6_out(const struct nf_hook_state *state)
 {
 #ifdef CONFIG_XFRM
+	struct sk_buff *skb = state->skb;
 	const struct nf_conn *ct;
 	enum ip_conntrack_info ctinfo;
 	int err;
 #endif
 	unsigned int ret;
 
-	ret = nf_nat_ipv6_fn(priv, skb, state);
+	ret = nf_nat_ipv6_fn(state);
 #ifdef CONFIG_XFRM
 	if (ret != NF_ACCEPT)
 		return ret;
@@ -990,15 +992,15 @@ nf_nat_ipv6_out(void *priv, struct sk_buff *skb,
 }
 
 static unsigned int
-nf_nat_ipv6_local_fn(void *priv, struct sk_buff *skb,
-		     const struct nf_hook_state *state)
+nf_nat_ipv6_local_fn(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	const struct nf_conn *ct;
 	enum ip_conntrack_info ctinfo;
 	unsigned int ret;
 	int err;
 
-	ret = nf_nat_ipv6_fn(priv, skb, state);
+	ret = nf_nat_ipv6_fn(state);
 	if (ret != NF_ACCEPT)
 		return ret;
 
diff --git a/net/netfilter/nf_synproxy_core.c b/net/netfilter/nf_synproxy_core.c
index 16915f8eef2b..d7bcfd4072c7 100644
--- a/net/netfilter/nf_synproxy_core.c
+++ b/net/netfilter/nf_synproxy_core.c
@@ -636,10 +636,10 @@ synproxy_recv_client_ack(struct net *net,
 EXPORT_SYMBOL_GPL(synproxy_recv_client_ack);
 
 unsigned int
-ipv4_synproxy_hook(void *priv, struct sk_buff *skb,
-		   const struct nf_hook_state *nhs)
+ipv4_synproxy_hook(const struct nf_hook_state *nhs)
 {
 	struct net *net = nhs->net;
+	struct sk_buff *skb = nhs->skb;
 	struct synproxy_net *snet = synproxy_pernet(net);
 	enum ip_conntrack_info ctinfo;
 	struct nf_conn *ct;
@@ -1053,9 +1053,9 @@ synproxy_recv_client_ack_ipv6(struct net *net,
 EXPORT_SYMBOL_GPL(synproxy_recv_client_ack_ipv6);
 
 unsigned int
-ipv6_synproxy_hook(void *priv, struct sk_buff *skb,
-		   const struct nf_hook_state *nhs)
+ipv6_synproxy_hook(const struct nf_hook_state *nhs)
 {
+	struct sk_buff *skb = nhs->skb;
 	struct net *net = nhs->net;
 	struct synproxy_net *snet = synproxy_pernet(net);
 	enum ip_conntrack_info ctinfo;
diff --git a/net/netfilter/nft_chain_filter.c b/net/netfilter/nft_chain_filter.c
index c3563f0be269..f451c081958a 100644
--- a/net/netfilter/nft_chain_filter.c
+++ b/net/netfilter/nft_chain_filter.c
@@ -11,16 +11,15 @@
 #include <net/netfilter/nf_tables_ipv6.h>
 
 #ifdef CONFIG_NF_TABLES_IPV4
-static unsigned int nft_do_chain_ipv4(void *priv,
-				      struct sk_buff *skb,
-				      const struct nf_hook_state *state)
+static unsigned int nft_do_chain_ipv4(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct nft_pktinfo pkt;
 
 	nft_set_pktinfo(&pkt, skb, state);
 	nft_set_pktinfo_ipv4(&pkt);
 
-	return nft_do_chain(&pkt, priv);
+	return nft_do_chain(&pkt, state->priv);
 }
 
 static const struct nft_chain_type nft_chain_filter_ipv4 = {
@@ -56,15 +55,15 @@ static inline void nft_chain_filter_ipv4_fini(void) {}
 #endif /* CONFIG_NF_TABLES_IPV4 */
 
 #ifdef CONFIG_NF_TABLES_ARP
-static unsigned int nft_do_chain_arp(void *priv, struct sk_buff *skb,
-				     const struct nf_hook_state *state)
+static unsigned int nft_do_chain_arp(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct nft_pktinfo pkt;
 
 	nft_set_pktinfo(&pkt, skb, state);
 	nft_set_pktinfo_unspec(&pkt);
 
-	return nft_do_chain(&pkt, priv);
+	return nft_do_chain(&pkt, state->priv);
 }
 
 static const struct nft_chain_type nft_chain_filter_arp = {
@@ -95,16 +94,15 @@ static inline void nft_chain_filter_arp_fini(void) {}
 #endif /* CONFIG_NF_TABLES_ARP */
 
 #ifdef CONFIG_NF_TABLES_IPV6
-static unsigned int nft_do_chain_ipv6(void *priv,
-				      struct sk_buff *skb,
-				      const struct nf_hook_state *state)
+static unsigned int nft_do_chain_ipv6(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct nft_pktinfo pkt;
 
 	nft_set_pktinfo(&pkt, skb, state);
 	nft_set_pktinfo_ipv6(&pkt);
 
-	return nft_do_chain(&pkt, priv);
+	return nft_do_chain(&pkt, state->priv);
 }
 
 static const struct nft_chain_type nft_chain_filter_ipv6 = {
@@ -140,9 +138,9 @@ static inline void nft_chain_filter_ipv6_fini(void) {}
 #endif /* CONFIG_NF_TABLES_IPV6 */
 
 #ifdef CONFIG_NF_TABLES_INET
-static unsigned int nft_do_chain_inet(void *priv, struct sk_buff *skb,
-				      const struct nf_hook_state *state)
+static unsigned int nft_do_chain_inet(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct nft_pktinfo pkt;
 
 	nft_set_pktinfo(&pkt, skb, state);
@@ -158,13 +156,13 @@ static unsigned int nft_do_chain_inet(void *priv, struct sk_buff *skb,
 		break;
 	}
 
-	return nft_do_chain(&pkt, priv);
+	return nft_do_chain(&pkt, state->priv);
 }
 
-static unsigned int nft_do_chain_inet_ingress(void *priv, struct sk_buff *skb,
-					      const struct nf_hook_state *state)
+static unsigned int nft_do_chain_inet_ingress(const struct nf_hook_state *state)
 {
 	struct nf_hook_state ingress_state = *state;
+	struct sk_buff *skb = state->skb;
 	struct nft_pktinfo pkt;
 
 	switch (skb->protocol) {
@@ -189,7 +187,7 @@ static unsigned int nft_do_chain_inet_ingress(void *priv, struct sk_buff *skb,
 		return NF_ACCEPT;
 	}
 
-	return nft_do_chain(&pkt, priv);
+	return nft_do_chain(&pkt, state->priv);
 }
 
 static const struct nft_chain_type nft_chain_filter_inet = {
@@ -228,10 +226,9 @@ static inline void nft_chain_filter_inet_fini(void) {}
 
 #if IS_ENABLED(CONFIG_NF_TABLES_BRIDGE)
 static unsigned int
-nft_do_chain_bridge(void *priv,
-		    struct sk_buff *skb,
-		    const struct nf_hook_state *state)
+nft_do_chain_bridge(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct nft_pktinfo pkt;
 
 	nft_set_pktinfo(&pkt, skb, state);
@@ -248,7 +245,7 @@ nft_do_chain_bridge(void *priv,
 		break;
 	}
 
-	return nft_do_chain(&pkt, priv);
+	return nft_do_chain(&pkt, state->priv);
 }
 
 static const struct nft_chain_type nft_chain_filter_bridge = {
@@ -284,14 +281,13 @@ static inline void nft_chain_filter_bridge_fini(void) {}
 #endif /* CONFIG_NF_TABLES_BRIDGE */
 
 #ifdef CONFIG_NF_TABLES_NETDEV
-static unsigned int nft_do_chain_netdev(void *priv, struct sk_buff *skb,
-					const struct nf_hook_state *state)
+static unsigned int nft_do_chain_netdev(const struct nf_hook_state *state)
 {
 	struct nft_pktinfo pkt;
 
-	nft_set_pktinfo(&pkt, skb, state);
+	nft_set_pktinfo(&pkt, state->skb, state);
 
-	switch (skb->protocol) {
+	switch (state->skb->protocol) {
 	case htons(ETH_P_IP):
 		nft_set_pktinfo_ipv4_validate(&pkt);
 		break;
@@ -303,7 +299,7 @@ static unsigned int nft_do_chain_netdev(void *priv, struct sk_buff *skb,
 		break;
 	}
 
-	return nft_do_chain(&pkt, priv);
+	return nft_do_chain(&pkt, state->priv);
 }
 
 static const struct nft_chain_type nft_chain_filter_netdev = {
diff --git a/net/netfilter/nft_chain_nat.c b/net/netfilter/nft_chain_nat.c
index 98e4946100c5..7eff7e499f54 100644
--- a/net/netfilter/nft_chain_nat.c
+++ b/net/netfilter/nft_chain_nat.c
@@ -7,12 +7,11 @@
 #include <net/netfilter/nf_tables_ipv4.h>
 #include <net/netfilter/nf_tables_ipv6.h>
 
-static unsigned int nft_nat_do_chain(void *priv, struct sk_buff *skb,
-				     const struct nf_hook_state *state)
+static unsigned int nft_nat_do_chain(const struct nf_hook_state *state)
 {
 	struct nft_pktinfo pkt;
 
-	nft_set_pktinfo(&pkt, skb, state);
+	nft_set_pktinfo(&pkt, state->skb, state);
 
 	switch (state->pf) {
 #ifdef CONFIG_NF_TABLES_IPV4
@@ -29,7 +28,7 @@ static unsigned int nft_nat_do_chain(void *priv, struct sk_buff *skb,
 		break;
 	}
 
-	return nft_do_chain(&pkt, priv);
+	return nft_do_chain(&pkt, state->priv);
 }
 
 #ifdef CONFIG_NF_TABLES_IPV4
diff --git a/net/netfilter/nft_chain_route.c b/net/netfilter/nft_chain_route.c
index 925db0dce48d..8c9f31a96d6f 100644
--- a/net/netfilter/nft_chain_route.c
+++ b/net/netfilter/nft_chain_route.c
@@ -13,10 +13,10 @@
 #include <net/ip.h>
 
 #ifdef CONFIG_NF_TABLES_IPV4
-static unsigned int nf_route_table_hook4(void *priv,
-					 struct sk_buff *skb,
-					 const struct nf_hook_state *state)
+static unsigned int nf_route_table_hook4(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
+	void *priv = state->priv;
 	const struct iphdr *iph;
 	struct nft_pktinfo pkt;
 	__be32 saddr, daddr;
@@ -62,10 +62,10 @@ static const struct nft_chain_type nft_chain_route_ipv4 = {
 #endif
 
 #ifdef CONFIG_NF_TABLES_IPV6
-static unsigned int nf_route_table_hook6(void *priv,
-					 struct sk_buff *skb,
-					 const struct nf_hook_state *state)
+static unsigned int nf_route_table_hook6(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
+	void *priv = state->priv;
 	struct in6_addr saddr, daddr;
 	struct nft_pktinfo pkt;
 	u32 mark, flowlabel;
@@ -112,17 +112,17 @@ static const struct nft_chain_type nft_chain_route_ipv6 = {
 #endif
 
 #ifdef CONFIG_NF_TABLES_INET
-static unsigned int nf_route_table_inet(void *priv,
-					struct sk_buff *skb,
-					const struct nf_hook_state *state)
+static unsigned int nf_route_table_inet(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
+	void *priv = state->priv;
 	struct nft_pktinfo pkt;
 
 	switch (state->pf) {
 	case NFPROTO_IPV4:
-		return nf_route_table_hook4(priv, skb, state);
+		return nf_route_table_hook4(state);
 	case NFPROTO_IPV6:
-		return nf_route_table_hook6(priv, skb, state);
+		return nf_route_table_hook6(state);
 	default:
 		nft_set_pktinfo(&pkt, skb, state);
 		break;
diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index e29cade7b662..582fa381af20 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -1788,10 +1788,9 @@ static inline int apparmor_init_sysctl(void)
 #endif /* CONFIG_SYSCTL */
 
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_NETWORK_SECMARK)
-static unsigned int apparmor_ip_postroute(void *priv,
-					  struct sk_buff *skb,
-					  const struct nf_hook_state *state)
+static unsigned int apparmor_ip_postroute(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct aa_sk_ctx *ctx;
 	struct sock *sk;
 
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 03bca97c8b29..31d052be21ee 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -5612,10 +5612,9 @@ static int selinux_tun_dev_open(void *security)
 }
 
 #ifdef CONFIG_NETFILTER
-
-static unsigned int selinux_ip_forward(void *priv, struct sk_buff *skb,
-				       const struct nf_hook_state *state)
+static unsigned int selinux_ip_forward(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	int ifindex;
 	u16 family;
 	char *addrp;
@@ -5672,9 +5671,9 @@ static unsigned int selinux_ip_forward(void *priv, struct sk_buff *skb,
 	return NF_ACCEPT;
 }
 
-static unsigned int selinux_ip_output(void *priv, struct sk_buff *skb,
-				      const struct nf_hook_state *state)
+static unsigned int selinux_ip_output(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb;
 	struct sock *sk;
 	u32 sid;
 
@@ -5684,6 +5683,7 @@ static unsigned int selinux_ip_output(void *priv, struct sk_buff *skb,
 	/* we do this in the LOCAL_OUT path and not the POST_ROUTING path
 	 * because we want to make sure we apply the necessary labeling
 	 * before IPsec is applied so we can leverage AH protection */
+	skb = state->skb;
 	sk = skb->sk;
 	if (sk) {
 		struct sk_security_struct *sksec;
@@ -5714,10 +5714,9 @@ static unsigned int selinux_ip_output(void *priv, struct sk_buff *skb,
 	return NF_ACCEPT;
 }
 
-
-static unsigned int selinux_ip_postroute_compat(struct sk_buff *skb,
-					const struct nf_hook_state *state)
+static unsigned int selinux_ip_postroute_compat(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	struct sock *sk;
 	struct sk_security_struct *sksec;
 	struct common_audit_data ad;
@@ -5748,10 +5747,9 @@ static unsigned int selinux_ip_postroute_compat(struct sk_buff *skb,
 	return NF_ACCEPT;
 }
 
-static unsigned int selinux_ip_postroute(void *priv,
-					 struct sk_buff *skb,
-					 const struct nf_hook_state *state)
+static unsigned int selinux_ip_postroute(const struct nf_hook_state *state)
 {
+	struct sk_buff *skb = state->skb;
 	u16 family;
 	u32 secmark_perm;
 	u32 peer_sid;
@@ -5767,7 +5765,7 @@ static unsigned int selinux_ip_postroute(void *priv,
 	 * special handling.  We do this in an attempt to keep this function
 	 * as fast and as clean as possible. */
 	if (!selinux_policycap_netpeer())
-		return selinux_ip_postroute_compat(skb, state);
+		return selinux_ip_postroute_compat(state);
 
 	secmark_active = selinux_secmark_enabled();
 	peerlbl_active = selinux_peerlbl_enabled();
diff --git a/security/smack/smack_netfilter.c b/security/smack/smack_netfilter.c
index b945c1d3a743..309a2b8191a5 100644
--- a/security/smack/smack_netfilter.c
+++ b/security/smack/smack_netfilter.c
@@ -18,14 +18,14 @@
 #include <net/net_namespace.h>
 #include "smack.h"
 
-static unsigned int smack_ip_output(void *priv,
-					struct sk_buff *skb,
-					const struct nf_hook_state *state)
+static unsigned int smack_ip_output(const struct nf_hook_state *state)
 {
-	struct sock *sk = skb_to_full_sk(skb);
+	struct sk_buff *skb = state->skb;
 	struct socket_smack *ssp;
 	struct smack_known *skp;
+	struct sock *sk;
 
+	sk = skb_to_full_sk(skb);
 	if (sk && sk->sk_security) {
 		ssp = sk->sk_security;
 		skp = ssp->smk_out;
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 5/9] netfilter: reduce allowed hook count to 32
  2022-10-05 14:13 [RFC 0/9 v2] netfilter: bpf base hook program generator Florian Westphal
                   ` (3 preceding siblings ...)
  2022-10-05 14:13 ` [RFC v2 4/9] netfilter: make hook functions accept only one argument Florian Westphal
@ 2022-10-05 14:13 ` Florian Westphal
  2022-10-05 14:13 ` [RFC v2 6/9] netfilter: add bpf base hook program generator Florian Westphal
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2022-10-05 14:13 UTC (permalink / raw)
  To: bpf; +Cc: Florian Westphal

1k is huge and will mean we'd need to support tailcalls in the
nf_hook bpf converter.

We need about 5 insns per hook at this time, ignoring prologue/epilogue.

32 should be fine, typically even extreme cases need about 8 hooks per
hook location.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 593fec9434d7..17165f9cf4a1 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -42,7 +42,7 @@ EXPORT_SYMBOL(nf_hooks_needed);
 static DEFINE_MUTEX(nf_hook_mutex);
 
 /* max hooks per family/hooknum */
-#define MAX_HOOK_COUNT		1024
+#define MAX_HOOK_COUNT		32
 
 #define nf_entry_dereference(e) \
 	rcu_dereference_protected(e, lockdep_is_held(&nf_hook_mutex))
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 6/9] netfilter: add bpf base hook program generator
  2022-10-05 14:13 [RFC 0/9 v2] netfilter: bpf base hook program generator Florian Westphal
                   ` (4 preceding siblings ...)
  2022-10-05 14:13 ` [RFC v2 5/9] netfilter: reduce allowed hook count to 32 Florian Westphal
@ 2022-10-05 14:13 ` Florian Westphal
  2022-10-06  2:52   ` Alexei Starovoitov
  2022-10-05 14:13 ` [RFC v2 7/9] netfilter: core: do not rebuild bpf program on dying netns Florian Westphal
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Florian Westphal @ 2022-10-05 14:13 UTC (permalink / raw)
  To: bpf; +Cc: Florian Westphal

Add a kernel bpf program generator for netfilter base hooks.

Currently netfilter hooks are invoked by nf_hook_slow:

for i in hooks; do
  verdict = hooks[i]->indirect_func(hooks->[i].hook_arg, skb, state);

  switch (verdict) { ....

The autogenerator unrolls the loop, so we get:

state->priv = hooks->[0].hook_arg;
v = first_hook_function(state);
if (v != ACCEPT) goto done;
state->priv = hooks->[1].hook_arg;
v = second_hook_function(state); ...

Indirections are replaced by direct calls. Invocation of the
autogenerated programs is done via bpf dispatcher from nf_hook().

The autogenerated program has the same return value scheme as
nf_hook_slow(). NF_HOOK() points are converted to call the
autogenerated bpf program instead of nf_hook_slow().

Purpose of this is to eventually add a 'netfilter prog type' to bpf and
permit attachment of (userspace generated) bpf programs to the netfilter
machinery, e.g.  'attach bpf prog id 1234 to ipv6 PREROUTING at prio -300'.

This will require to expose the context structure (program argument,
'__nf_hook_state', with rewriting accesses to match nf_hook_state layout.

Nat hooks are still handled via indirect calls, but they are only called
once per connection.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter.h           |  66 ++++-
 include/net/netfilter/nf_hook_bpf.h |  21 ++
 net/netfilter/Kconfig               |  10 +
 net/netfilter/Makefile              |   1 +
 net/netfilter/core.c                |  92 +++++-
 net/netfilter/nf_hook_bpf.c         | 424 ++++++++++++++++++++++++++++
 6 files changed, 605 insertions(+), 9 deletions(-)
 create mode 100644 include/net/netfilter/nf_hook_bpf.h
 create mode 100644 net/netfilter/nf_hook_bpf.c

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 7c604ef8e8cb..b7874b772dd1 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -2,6 +2,7 @@
 #ifndef __LINUX_NETFILTER_H
 #define __LINUX_NETFILTER_H
 
+#include <linux/filter.h>
 #include <linux/init.h>
 #include <linux/skbuff.h>
 #include <linux/net.h>
@@ -106,6 +107,9 @@ struct nf_hook_entries_rcu_head {
 };
 
 struct nf_hook_entries {
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+	struct bpf_prog			*hook_prog;
+#endif
 	u16				num_hook_entries;
 	/* padding */
 	struct nf_hook_entry		hooks[];
@@ -205,6 +209,17 @@ int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
 
 void nf_hook_slow_list(struct list_head *head, struct nf_hook_state *state,
 		       const struct nf_hook_entries *e);
+
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+DECLARE_BPF_DISPATCHER(nf_hook_base);
+
+static __always_inline int bpf_prog_run_nf(const struct bpf_prog *prog,
+					   struct nf_hook_state *state)
+{
+	return __bpf_prog_run(prog, state, BPF_DISPATCHER_FUNC(nf_hook_base));
+}
+#endif
+
 /**
  *	nf_hook - call a netfilter hook
  *
@@ -213,17 +228,17 @@ void nf_hook_slow_list(struct list_head *head, struct nf_hook_state *state,
  *	value indicates the packet has been consumed by the hook.
  */
 static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
-			  struct sock *sk, struct sk_buff *skb,
-			  struct net_device *indev, struct net_device *outdev,
-			  int (*okfn)(struct net *, struct sock *, struct sk_buff *))
+		struct sock *sk, struct sk_buff *skb,
+		struct net_device *indev, struct net_device *outdev,
+		int (*okfn)(struct net *, struct sock *, struct sk_buff *))
 {
 	struct nf_hook_entries *hook_head = NULL;
 	int ret = 1;
 
 #ifdef CONFIG_JUMP_LABEL
 	if (__builtin_constant_p(pf) &&
-	    __builtin_constant_p(hook) &&
-	    !static_key_false(&nf_hooks_needed[pf][hook]))
+			__builtin_constant_p(hook) &&
+			!static_key_false(&nf_hooks_needed[pf][hook]))
 		return 1;
 #endif
 
@@ -254,11 +269,24 @@ static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
 
 	if (hook_head) {
 		struct nf_hook_state state;
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+		const struct bpf_prog *p = READ_ONCE(hook_head->hook_prog);
+
+		nf_hook_state_init(&state, hook, pf, indev, outdev,
+				   sk, net, okfn);
+
+		state.priv = (void *)hook_head;
+		state.skb = skb;
 
+		migrate_disable();
+		ret = bpf_prog_run_nf(p, &state);
+		migrate_enable();
+#else
 		nf_hook_state_init(&state, hook, pf, indev, outdev,
 				   sk, net, okfn);
 
 		ret = nf_hook_slow(skb, &state, hook_head);
+#endif
 	}
 	rcu_read_unlock();
 
@@ -336,10 +364,38 @@ NF_HOOK_LIST(uint8_t pf, unsigned int hook, struct net *net, struct sock *sk,
 
 	if (hook_head) {
 		struct nf_hook_state state;
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+		const struct bpf_prog *p = hook_head->hook_prog;
+		struct sk_buff *skb, *next;
+		struct list_head sublist;
+		int ret;
+
+		nf_hook_state_init(&state, hook, pf, in, out, sk, net, okfn);
+
+		INIT_LIST_HEAD(&sublist);
 
+		migrate_disable();
+
+		list_for_each_entry_safe(skb, next, head, list) {
+			skb_list_del_init(skb);
+
+			state.priv = (void *)hook_head;
+			state.skb = skb;
+
+			ret = bpf_prog_run_nf(p, &state);
+			if (ret == 1)
+				list_add_tail(&skb->list, &sublist);
+		}
+
+		migrate_enable();
+
+		/* Put passed packets back on main list */
+		list_splice(&sublist, head);
+#else
 		nf_hook_state_init(&state, hook, pf, in, out, sk, net, okfn);
 
 		nf_hook_slow_list(head, &state, hook_head);
+#endif
 	}
 	rcu_read_unlock();
 }
diff --git a/include/net/netfilter/nf_hook_bpf.h b/include/net/netfilter/nf_hook_bpf.h
new file mode 100644
index 000000000000..1792f97a806d
--- /dev/null
+++ b/include/net/netfilter/nf_hook_bpf.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+struct bpf_dispatcher;
+struct bpf_prog;
+
+struct bpf_prog *nf_hook_bpf_create_fb(void);
+
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+struct bpf_prog *nf_hook_bpf_create(const struct nf_hook_entries *n);
+
+void nf_hook_bpf_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from, struct bpf_prog *to);
+#else
+static inline void
+nf_hook_bpf_change_prog(struct bpf_dispatcher *d, struct bpf_prog *f, struct bpf_prog *t)
+{
+}
+
+static inline struct bpf_prog *nf_hook_bpf_create(const struct nf_hook_entries *n)
+{
+	return NULL;
+}
+#endif
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 4b8d04640ff3..2610786b6ad8 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -30,6 +30,16 @@ config NETFILTER_FAMILY_BRIDGE
 config NETFILTER_FAMILY_ARP
 	bool
 
+config HAVE_NF_HOOK_BPF
+	bool
+
+config NF_HOOK_BPF
+	bool "netfilter base hook bpf translator"
+	depends on BPF_JIT
+	help
+	  This unrolls the nf_hook_slow interpreter loop with
+	  auto-generated BPF program.
+
 config NETFILTER_NETLINK_HOOK
 	tristate "Netfilter base hook dump support"
 	depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 06df49ea6329..e465659e87ad 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -21,6 +21,7 @@ nf_conntrack-$(CONFIG_DEBUG_INFO_BTF) += nf_conntrack_bpf.o
 endif
 
 obj-$(CONFIG_NETFILTER) = netfilter.o
+obj-$(CONFIG_NF_HOOK_BPF) += nf_hook_bpf.o
 
 obj-$(CONFIG_NETFILTER_NETLINK) += nfnetlink.o
 obj-$(CONFIG_NETFILTER_NETLINK_ACCT) += nfnetlink_acct.o
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 17165f9cf4a1..6888c7fd5aeb 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -24,6 +24,7 @@
 #include <linux/rcupdate.h>
 #include <net/net_namespace.h>
 #include <net/netfilter/nf_queue.h>
+#include <net/netfilter/nf_hook_bpf.h>
 #include <net/sock.h>
 
 #include "nf_internals.h"
@@ -47,6 +48,33 @@ static DEFINE_MUTEX(nf_hook_mutex);
 #define nf_entry_dereference(e) \
 	rcu_dereference_protected(e, lockdep_is_held(&nf_hook_mutex))
 
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+DEFINE_BPF_DISPATCHER(nf_hook_base);
+
+#define NF_DISPATCHER_PTR	BPF_DISPATCHER_PTR(nf_hook_base)
+#else
+#define NF_DISPATCHER_PTR	NULL
+#endif
+
+static struct bpf_prog *fallback_nf_hook_slow;
+
+static void nf_hook_bpf_prog_set(struct nf_hook_entries *e,
+				 struct bpf_prog *p)
+{
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+	WRITE_ONCE(e->hook_prog, p);
+#endif
+}
+
+static struct bpf_prog *nf_hook_bpf_prog_get(struct nf_hook_entries *e)
+{
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+	if (e)
+		return e->hook_prog;
+#endif
+	return NULL;
+}
+
 static struct nf_hook_entries *allocate_hook_entries_size(u16 num)
 {
 	struct nf_hook_entries *e;
@@ -58,9 +86,23 @@ static struct nf_hook_entries *allocate_hook_entries_size(u16 num)
 	if (num == 0)
 		return NULL;
 
-	e = kvzalloc(alloc, GFP_KERNEL_ACCOUNT);
-	if (e)
-		e->num_hook_entries = num;
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+	if (!fallback_nf_hook_slow) {
+		/* never free'd */
+		fallback_nf_hook_slow = nf_hook_bpf_create_fb();
+
+		if (!fallback_nf_hook_slow)
+			return NULL;
+	}
+#endif
+
+	e = kvzalloc(alloc, GFP_KERNEL);
+	if (!e)
+		return NULL;
+
+	e->num_hook_entries = num;
+	nf_hook_bpf_prog_set(e, fallback_nf_hook_slow);
+
 	return e;
 }
 
@@ -98,6 +140,29 @@ static const struct nf_hook_ops dummy_ops = {
 	.priority = INT_MIN,
 };
 
+static void nf_hook_entries_grow_bpf(const struct nf_hook_entries *old,
+				     struct nf_hook_entries *new)
+{
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+	struct bpf_prog *hook_bpf_prog = nf_hook_bpf_create(new);
+
+	/* allocate_hook_entries_size() pre-inits new->hook_prog
+	 * to a fallback program that calls nf_hook_slow().
+	 */
+	if (hook_bpf_prog) {
+		struct bpf_prog *old_prog = NULL;
+
+		new->hook_prog = hook_bpf_prog;
+
+		if (old)
+			old_prog = old->hook_prog;
+
+		nf_hook_bpf_change_prog(BPF_DISPATCHER_PTR(nf_hook_base),
+					old_prog, hook_bpf_prog);
+	}
+#endif
+}
+
 static struct nf_hook_entries *
 nf_hook_entries_grow(const struct nf_hook_entries *old,
 		     const struct nf_hook_ops *reg)
@@ -156,6 +221,7 @@ nf_hook_entries_grow(const struct nf_hook_entries *old,
 		new->hooks[nhooks].priv = reg->priv;
 	}
 
+	nf_hook_entries_grow_bpf(old, new);
 	return new;
 }
 
@@ -221,6 +287,7 @@ static void *__nf_hook_entries_try_shrink(struct nf_hook_entries *old,
 					  struct nf_hook_entries __rcu **pp)
 {
 	unsigned int i, j, skip = 0, hook_entries;
+	struct bpf_prog *hook_bpf_prog = NULL;
 	struct nf_hook_entries *new = NULL;
 	struct nf_hook_ops **orig_ops;
 	struct nf_hook_ops **new_ops;
@@ -244,8 +311,13 @@ static void *__nf_hook_entries_try_shrink(struct nf_hook_entries *old,
 
 	hook_entries -= skip;
 	new = allocate_hook_entries_size(hook_entries);
-	if (!new)
+	if (!new) {
+		struct bpf_prog *old_prog = nf_hook_bpf_prog_get(old);
+
+		nf_hook_bpf_prog_set(old, fallback_nf_hook_slow);
+		nf_hook_bpf_change_prog(NF_DISPATCHER_PTR, old_prog, NULL);
 		return NULL;
+	}
 
 	new_ops = nf_hook_entries_get_hook_ops(new);
 	for (i = 0, j = 0; i < old->num_hook_entries; i++) {
@@ -256,7 +328,13 @@ static void *__nf_hook_entries_try_shrink(struct nf_hook_entries *old,
 		j++;
 	}
 	hooks_validate(new);
+
+	/* if this fails fallback prog calls nf_hook_slow. */
+	hook_bpf_prog = nf_hook_bpf_create(new);
+	if (hook_bpf_prog)
+		nf_hook_bpf_prog_set(new, hook_bpf_prog);
 out_assign:
+	nf_hook_bpf_change_prog(NF_DISPATCHER_PTR, nf_hook_bpf_prog_get(old), hook_bpf_prog);
 	rcu_assign_pointer(*pp, new);
 	return old;
 }
@@ -609,6 +687,7 @@ int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state,
 	int ret;
 
 	state->skb = skb;
+
 	for (; s < e->num_hook_entries; s++) {
 		verdict = nf_hook_entry_hookfn(&e->hooks[s], skb, state);
 		switch (verdict & NF_VERDICT_MASK) {
@@ -783,6 +862,11 @@ int __init netfilter_init(void)
 	if (ret < 0)
 		goto err_pernet;
 
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+	fallback_nf_hook_slow = nf_hook_bpf_create_fb();
+	WARN_ON_ONCE(!fallback_nf_hook_slow);
+#endif
+
 	return 0;
 err_pernet:
 	unregister_pernet_subsys(&netfilter_net_ops);
diff --git a/net/netfilter/nf_hook_bpf.c b/net/netfilter/nf_hook_bpf.c
new file mode 100644
index 000000000000..dab13b803801
--- /dev/null
+++ b/net/netfilter/nf_hook_bpf.c
@@ -0,0 +1,424 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/string.h>
+#include <linux/hashtable.h>
+#include <linux/jhash.h>
+#include <linux/netfilter.h>
+
+#include <net/netfilter/nf_hook_bpf.h>
+#include <net/netfilter/nf_queue.h>
+
+#define JMP_INVALID 0
+#define JIT_SIZE_MAX 0xffff
+
+/* BPF translator for netfilter hooks.
+ *
+ * Create a bpf program that can be called *instead* of nf_hook_slow().
+ * This program thus has same return value as nf_hook_slow and
+ * handles nfqueue and packet drops internally.
+ * Call nf_hook_bpf_create(struct nf_hook_entries *e, NF_HOOK_BPF_TYPE_BASE)
+ * to unroll the functions described by nf_hook_entries into such
+ * a bpf program.
+ *
+ * These bpf programs are called/run from nf_hook() inline function.
+ *
+ * Register usage is:
+ *
+ * BPF_REG_0: verdict.
+ * BPF_REG_1: struct nf_hook_state *
+ * BPF_REG_2: reserved as arg to nf_queue()
+ * BPF_REG_3: reserved as arg to nf_queue()
+ *
+ * Prologue storage:
+ * BPF_REG_6: copy of REG_1 (original struct nf_hook_state *)
+ * BPF_REG_7: copy of original state->priv value
+ * BPF_REG_8: copy of state->hook_index
+ */
+struct nf_hook_prog {
+	struct bpf_insn *insns;
+	unsigned int pos;
+};
+
+static bool emit(struct nf_hook_prog *p, struct bpf_insn insn)
+{
+	if (WARN_ON_ONCE(p->pos >= BPF_MAXINSNS))
+		return false;
+
+	p->insns[p->pos] = insn;
+	p->pos++;
+	return true;
+}
+
+static bool xlate_one_hook(struct nf_hook_prog *p, const struct nf_hook_entries *e,
+			   const struct nf_hook_entry *h)
+{
+	int width = bytes_to_bpf_size(sizeof(h->priv));
+
+	/* if priv is NULL, the called hookfn does not use the priv member. */
+	if (!h->priv)
+		goto emit_hook_call;
+
+	if (WARN_ON_ONCE(width < 0))
+		return false;
+
+	/* x = entries[s]->priv; */
+	if (!emit(p, BPF_LDX_MEM(width, BPF_REG_2, BPF_REG_7,
+				 (unsigned long)&h->priv - (unsigned long)e)))
+		return false;
+
+	/* state->priv = x */
+	if (!emit(p, BPF_STX_MEM(width, BPF_REG_6, BPF_REG_2,
+				 offsetof(struct nf_hook_state, priv))))
+		return false;
+
+emit_hook_call:
+	if (!emit(p, BPF_EMIT_CALL(h->hook)))
+		return false;
+
+	/* Only advance to next hook on ACCEPT verdict.
+	 * Else, skip rest and move to tail.
+	 *
+	 * Postprocessing patches the jump offset to the
+	 * correct position, after last hook.
+	 */
+	if (!emit(p, BPF_JMP_IMM(BPF_JNE, BPF_REG_0, NF_ACCEPT, JMP_INVALID)))
+		return false;
+
+	return true;
+}
+
+static bool emit_mov_ptr_reg(struct nf_hook_prog *p, u8 dreg, u8 sreg)
+{
+	if (sizeof(void *) == sizeof(u64))
+		return emit(p, BPF_MOV64_REG(dreg, sreg));
+	if (sizeof(void *) == sizeof(u32))
+		return emit(p, BPF_MOV32_REG(dreg, sreg));
+
+	return false;
+}
+
+static bool do_prologue(struct nf_hook_prog *p)
+{
+	int width = bytes_to_bpf_size(sizeof(void *));
+
+	if (WARN_ON_ONCE(width < 0))
+		return false;
+
+	/* argument to program is a pointer to struct nf_hook_state, in BPF_REG_1. */
+	if (!emit_mov_ptr_reg(p, BPF_REG_6, BPF_REG_1))
+		return false;
+
+	if (!emit(p, BPF_LDX_MEM(width, BPF_REG_7, BPF_REG_1,
+				 offsetof(struct nf_hook_state, priv))))
+		return false;
+
+	/* could load state->hook_index, but we don't support index > 0 for bpf call. */
+	if (!emit(p, BPF_MOV32_IMM(BPF_REG_8, 0)))
+		return false;
+
+	return true;
+}
+
+static void patch_hook_jumps(struct nf_hook_prog *p)
+{
+	unsigned int i;
+
+	if (!p->insns)
+		return;
+
+	for (i = 0; i < p->pos; i++) {
+		if (BPF_CLASS(p->insns[i].code) != BPF_JMP)
+			continue;
+
+		if (p->insns[i].code == (BPF_EXIT | BPF_JMP))
+			continue;
+		if (p->insns[i].code == (BPF_CALL | BPF_JMP))
+			continue;
+
+		if (p->insns[i].off != JMP_INVALID)
+			continue;
+		p->insns[i].off = p->pos - i - 1;
+	}
+}
+
+static bool emit_retval(struct nf_hook_prog *p, int retval)
+{
+	if (!emit(p, BPF_MOV32_IMM(BPF_REG_0, retval)))
+		return false;
+
+	return emit(p, BPF_EXIT_INSN());
+}
+
+static bool emit_nf_hook_slow(struct nf_hook_prog *p)
+{
+	int width = bytes_to_bpf_size(sizeof(void *));
+
+	/* restore the original state->priv. */
+	if (!emit(p, BPF_STX_MEM(width, BPF_REG_6, BPF_REG_7,
+				 offsetof(struct nf_hook_state, priv))))
+		return false;
+
+	/* arg1 is state->skb */
+	if (!emit(p, BPF_LDX_MEM(width, BPF_REG_1, BPF_REG_6,
+				 offsetof(struct nf_hook_state, skb))))
+		return false;
+
+	/* arg2 is "struct nf_hook_state *" */
+	if (!emit(p, BPF_MOV64_REG(BPF_REG_2, BPF_REG_6)))
+		return false;
+
+	/* arg3 is nf_hook_entries (original state->priv) */
+	if (!emit(p, BPF_MOV64_REG(BPF_REG_3, BPF_REG_7)))
+		return false;
+
+	if (!emit(p, BPF_EMIT_CALL(nf_hook_slow)))
+		return false;
+
+	/* No further action needed, return retval provided by nf_hook_slow */
+	return emit(p, BPF_EXIT_INSN());
+}
+
+static bool emit_nf_queue(struct nf_hook_prog *p)
+{
+	int width = bytes_to_bpf_size(sizeof(void *));
+
+	if (width < 0) {
+		WARN_ON_ONCE(1);
+		return false;
+	}
+
+	/* int nf_queue(struct sk_buff *skb, struct nf_hook_state *state, unsigned int verdict) */
+	if (!emit(p, BPF_LDX_MEM(width, BPF_REG_1, BPF_REG_6,
+				 offsetof(struct nf_hook_state, skb))))
+		return false;
+	if (!emit(p, BPF_STX_MEM(BPF_H, BPF_REG_6, BPF_REG_8,
+				 offsetof(struct nf_hook_state, hook_index))))
+		return false;
+	/* arg2: struct nf_hook_state * */
+	if (!emit(p, BPF_MOV64_REG(BPF_REG_2, BPF_REG_6)))
+		return false;
+	/* arg3: original hook return value: (NUM << NF_VERDICT_QBITS | NF_QUEUE) */
+	if (!emit(p, BPF_MOV32_REG(BPF_REG_3, BPF_REG_0)))
+		return false;
+	if (!emit(p, BPF_EMIT_CALL(nf_queue)))
+		return false;
+
+	/* Check nf_queue return value.  Abnormal case: nf_queue returned != 0.
+	 *
+	 * Fall back to nf_hook_slow().
+	 */
+	if (!emit(p, BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2)))
+		return false;
+
+	/* Normal case: skb was stolen. Return 0. */
+	return emit_retval(p, 0);
+}
+
+static bool do_epilogue_base_hooks(struct nf_hook_prog *p)
+{
+	int width = bytes_to_bpf_size(sizeof(void *));
+
+	if (WARN_ON_ONCE(width < 0))
+		return false;
+
+	/* last 'hook'. We arrive here if previous hook returned ACCEPT,
+	 * i.e. all hooks passed -- we are done.
+	 *
+	 * Return 1, skb can continue traversing network stack.
+	 */
+	if (!emit_retval(p, 1))
+		return false;
+
+	/* Patch all hook jumps, in case any of these are taken
+	 * we need to jump to this location.
+	 *
+	 * This happens when verdict is != ACCEPT.
+	 */
+	patch_hook_jumps(p);
+
+	/* need to ignore upper 24 bits, might contain errno or queue number */
+	if (!emit(p, BPF_MOV32_REG(BPF_REG_3, BPF_REG_0)))
+		return false;
+	if (!emit(p, BPF_ALU32_IMM(BPF_AND, BPF_REG_3, 0xff)))
+		return false;
+
+	/* ACCEPT handled, check STOLEN. */
+	if (!emit(p, BPF_JMP_IMM(BPF_JNE, BPF_REG_3, NF_STOLEN, 2)))
+		return false;
+
+	if (!emit_retval(p, 0))
+		return false;
+
+	/* ACCEPT and STOLEN handled.  Check DROP next */
+	if (!emit(p, BPF_JMP_IMM(BPF_JNE, BPF_REG_3, NF_DROP, 1 + 2 + 2 + 2 + 2)))
+		return false;
+
+	/* First step. Extract the errno number. 1 insn. */
+	if (!emit(p, BPF_ALU32_IMM(BPF_RSH, BPF_REG_0, NF_VERDICT_QBITS)))
+		return false;
+
+	/* Second step: replace errno with EPERM if it was 0. 2 insns. */
+	if (!emit(p, BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1)))
+		return false;
+	if (!emit(p, BPF_MOV32_IMM(BPF_REG_0, EPERM)))
+		return false;
+
+	/* Third step: negate reg0: Caller expects -EFOO and stash the result.  2 insns. */
+	if (!emit(p, BPF_ALU32_IMM(BPF_NEG, BPF_REG_0, 0)))
+		return false;
+	if (!emit(p, BPF_MOV32_REG(BPF_REG_8, BPF_REG_0)))
+		return false;
+
+	/* Fourth step: free the skb. 2 insns. */
+	if (!emit(p, BPF_LDX_MEM(width, BPF_REG_1, BPF_REG_6,
+				 offsetof(struct nf_hook_state, skb))))
+		return false;
+	if (!emit(p, BPF_EMIT_CALL(kfree_skb)))
+		return false;
+
+	/* Last step: return. 2 insns. */
+	if (!emit(p, BPF_MOV32_REG(BPF_REG_0, BPF_REG_8)))
+		return false;
+	if (!emit(p, BPF_EXIT_INSN()))
+		return false;
+
+	/* ACCEPT, STOLEN and DROP have been handled.
+	 * REPEAT and STOP are not allowed anymore for individual hook functions.
+	 * This leaves NFQUEUE as only remaing return value.
+	 *
+	 * In this case BPF_REG_0 still contains the original verdict of
+	 * '(NUM << NF_VERDICT_QBITS | NF_QUEUE)', so pass it to nf_queue() as-is.
+	 */
+	if (!emit_nf_queue(p))
+		return false;
+
+	/* Increment hook index and store it in nf_hook_state so nf_hook_slow will
+	 * start at the next hook, if any.
+	 */
+	if (!emit(p, BPF_ALU32_IMM(BPF_ADD, BPF_REG_8, 1)))
+		return false;
+	if (!emit(p, BPF_STX_MEM(BPF_H, BPF_REG_6, BPF_REG_8,
+				 offsetof(struct nf_hook_state, hook_index))))
+		return false;
+
+	return emit_nf_hook_slow(p);
+}
+
+static int nf_hook_prog_init(struct nf_hook_prog *p)
+{
+	memset(p, 0, sizeof(*p));
+
+	p->insns = kcalloc(BPF_MAXINSNS, sizeof(*p->insns), GFP_KERNEL);
+	if (!p->insns)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void nf_hook_prog_free(struct nf_hook_prog *p)
+{
+	kfree(p->insns);
+}
+
+static int xlate_base_hooks(struct nf_hook_prog *p, const struct nf_hook_entries *e)
+{
+	unsigned int i, len;
+
+	len = e->num_hook_entries;
+
+	if (!do_prologue(p))
+		goto out;
+
+	for (i = 0; i < len; i++) {
+		if (!xlate_one_hook(p, e, &e->hooks[i]))
+			goto out;
+
+		if (i + 1 < len) {
+			if (!emit(p, BPF_MOV64_REG(BPF_REG_1, BPF_REG_6)))
+				goto out;
+
+			if (!emit(p, BPF_ALU32_IMM(BPF_ADD, BPF_REG_8, 1)))
+				goto out;
+		}
+	}
+
+	if (!do_epilogue_base_hooks(p))
+		goto out;
+
+	return 0;
+out:
+	return -EINVAL;
+}
+
+static struct bpf_prog *nf_hook_jit_compile(struct bpf_insn *insns, unsigned int len)
+{
+	struct bpf_prog *prog;
+	int err = 0;
+
+	prog = bpf_prog_alloc(bpf_prog_size(len), 0);
+	if (!prog)
+		return NULL;
+
+	prog->len = len;
+	prog->type = BPF_PROG_TYPE_SOCKET_FILTER;
+	memcpy(prog->insnsi, insns, prog->len * sizeof(struct bpf_insn));
+
+	prog = bpf_prog_select_runtime(prog, &err);
+	if (err) {
+		bpf_prog_free(prog);
+		return NULL;
+	}
+
+	return prog;
+}
+
+/* fallback program, invokes nf_hook_slow interpreter.
+ *
+ * Used when a hook is unregistered and new/replacement program cannot
+ * be compiled for some reason.
+ */
+struct bpf_prog *nf_hook_bpf_create_fb(void)
+{
+	struct bpf_prog *prog;
+	struct nf_hook_prog p;
+	int err;
+
+	err = nf_hook_prog_init(&p);
+	if (err)
+		return NULL;
+
+	if (!do_prologue(&p))
+		goto err;
+
+	if (!emit_nf_hook_slow(&p))
+		goto err;
+
+	prog = nf_hook_jit_compile(p.insns, p.pos);
+err:
+	nf_hook_prog_free(&p);
+	return prog;
+}
+
+struct bpf_prog *nf_hook_bpf_create(const struct nf_hook_entries *new)
+{
+	struct bpf_prog *prog;
+	struct nf_hook_prog p;
+	int err;
+
+	err = nf_hook_prog_init(&p);
+	if (err)
+		return NULL;
+
+	err = xlate_base_hooks(&p, new);
+	if (err)
+		goto err;
+
+	prog = nf_hook_jit_compile(p.insns, p.pos);
+err:
+	nf_hook_prog_free(&p);
+	return prog;
+}
+
+void nf_hook_bpf_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from, struct bpf_prog *to)
+{
+	bpf_dispatcher_change_prog(d, from, to);
+}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 7/9] netfilter: core: do not rebuild bpf program on dying netns
  2022-10-05 14:13 [RFC 0/9 v2] netfilter: bpf base hook program generator Florian Westphal
                   ` (5 preceding siblings ...)
  2022-10-05 14:13 ` [RFC v2 6/9] netfilter: add bpf base hook program generator Florian Westphal
@ 2022-10-05 14:13 ` Florian Westphal
  2022-10-05 14:13 ` [RFC v2 8/9] netfilter: netdev: switch to invocation via bpf Florian Westphal
  2022-10-05 14:13 ` [RFC v2 9/9] netfilter: hook_jit: add prog cache Florian Westphal
  8 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2022-10-05 14:13 UTC (permalink / raw)
  To: bpf; +Cc: Florian Westphal

We can save a few cycles on netns destruction.
When a hook is removed we can just skip building a new
program with the remaining hooks, those will be removed too
in the immediate future.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/core.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 6888c7fd5aeb..71974c55de50 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -272,6 +272,7 @@ EXPORT_SYMBOL_GPL(nf_hook_entries_insert_raw);
  *
  * @old -- current hook blob at @pp
  * @pp -- location of hook blob
+ * @recompile -- false if bpf prog should not be replaced
  *
  * Hook unregistration must always succeed, so to-be-removed hooks
  * are replaced by a dummy one that will just move to next hook.
@@ -284,7 +285,8 @@ EXPORT_SYMBOL_GPL(nf_hook_entries_insert_raw);
  * Returns address to free, or NULL.
  */
 static void *__nf_hook_entries_try_shrink(struct nf_hook_entries *old,
-					  struct nf_hook_entries __rcu **pp)
+					  struct nf_hook_entries __rcu **pp,
+					  bool recompile)
 {
 	unsigned int i, j, skip = 0, hook_entries;
 	struct bpf_prog *hook_bpf_prog = NULL;
@@ -329,10 +331,12 @@ static void *__nf_hook_entries_try_shrink(struct nf_hook_entries *old,
 	}
 	hooks_validate(new);
 
-	/* if this fails fallback prog calls nf_hook_slow. */
-	hook_bpf_prog = nf_hook_bpf_create(new);
-	if (hook_bpf_prog)
-		nf_hook_bpf_prog_set(new, hook_bpf_prog);
+	if (recompile) {
+		/* if this fails fallback prog calls nf_hook_slow. */
+		hook_bpf_prog = nf_hook_bpf_create(new);
+		if (hook_bpf_prog)
+			nf_hook_bpf_prog_set(new, hook_bpf_prog);
+	}
 out_assign:
 	nf_hook_bpf_change_prog(NF_DISPATCHER_PTR, nf_hook_bpf_prog_get(old), hook_bpf_prog);
 	rcu_assign_pointer(*pp, new);
@@ -581,7 +585,7 @@ static void __nf_unregister_net_hook(struct net *net, int pf,
 		WARN_ONCE(1, "hook not found, pf %d num %d", pf, reg->hooknum);
 	}
 
-	p = __nf_hook_entries_try_shrink(p, pp);
+	p = __nf_hook_entries_try_shrink(p, pp, check_net(net));
 	mutex_unlock(&nf_hook_mutex);
 	if (!p)
 		return;
@@ -612,7 +616,7 @@ void nf_hook_entries_delete_raw(struct nf_hook_entries __rcu **pp,
 
 	p = rcu_dereference_raw(*pp);
 	if (nf_remove_net_hook(p, reg)) {
-		p = __nf_hook_entries_try_shrink(p, pp);
+		p = __nf_hook_entries_try_shrink(p, pp, false);
 		nf_hook_entries_free(p);
 	}
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 8/9] netfilter: netdev: switch to invocation via bpf
  2022-10-05 14:13 [RFC 0/9 v2] netfilter: bpf base hook program generator Florian Westphal
                   ` (6 preceding siblings ...)
  2022-10-05 14:13 ` [RFC v2 7/9] netfilter: core: do not rebuild bpf program on dying netns Florian Westphal
@ 2022-10-05 14:13 ` Florian Westphal
  2022-10-05 14:13 ` [RFC v2 9/9] netfilter: hook_jit: add prog cache Florian Westphal
  8 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2022-10-05 14:13 UTC (permalink / raw)
  To: bpf; +Cc: Florian Westphal

Handle ingress and egress hook invocation via bpf.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_netdev.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/include/linux/netfilter_netdev.h b/include/linux/netfilter_netdev.h
index 92996b1ac90f..b0d50a28626f 100644
--- a/include/linux/netfilter_netdev.h
+++ b/include/linux/netfilter_netdev.h
@@ -19,6 +19,9 @@ static inline bool nf_hook_ingress_active(const struct sk_buff *skb)
 static inline int nf_hook_ingress(struct sk_buff *skb)
 {
 	struct nf_hook_entries *e = rcu_dereference(skb->dev->nf_hooks_ingress);
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+	const struct bpf_prog *prog;
+#endif
 	struct nf_hook_state state;
 	int ret;
 
@@ -31,7 +34,19 @@ static inline int nf_hook_ingress(struct sk_buff *skb)
 	nf_hook_state_init(&state, NF_NETDEV_INGRESS,
 			   NFPROTO_NETDEV, skb->dev, NULL, NULL,
 			   dev_net(skb->dev), NULL);
+
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+	prog = READ_ONCE(e->hook_prog);
+
+	state.priv = (void *)e;
+	state.skb = skb;
+
+	migrate_disable();
+	ret = bpf_prog_run_nf(prog, &state);
+	migrate_enable();
+#else
 	ret = nf_hook_slow(skb, &state, e);
+#endif
 	if (ret == 0)
 		return -1;
 
@@ -87,6 +102,9 @@ static inline struct sk_buff *nf_hook_egress(struct sk_buff *skb, int *rc,
 {
 	struct nf_hook_entries *e;
 	struct nf_hook_state state;
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+	const struct bpf_prog *prog;
+#endif
 	int ret;
 
 #ifdef CONFIG_NETFILTER_SKIP_EGRESS
@@ -104,7 +122,18 @@ static inline struct sk_buff *nf_hook_egress(struct sk_buff *skb, int *rc,
 
 	/* nf assumes rcu_read_lock, not just read_lock_bh */
 	rcu_read_lock();
+#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
+	prog = READ_ONCE(e->hook_prog);
+
+	state.priv = (void *)e;
+	state.skb = skb;
+
+	migrate_disable();
+	ret = bpf_prog_run_nf(prog, &state);
+	migrate_enable();
+#else
 	ret = nf_hook_slow(skb, &state, e);
+#endif
 	rcu_read_unlock();
 
 	if (ret == 1) {
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 9/9] netfilter: hook_jit: add prog cache
  2022-10-05 14:13 [RFC 0/9 v2] netfilter: bpf base hook program generator Florian Westphal
                   ` (7 preceding siblings ...)
  2022-10-05 14:13 ` [RFC v2 8/9] netfilter: netdev: switch to invocation via bpf Florian Westphal
@ 2022-10-05 14:13 ` Florian Westphal
  8 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2022-10-05 14:13 UTC (permalink / raw)
  To: bpf; +Cc: Florian Westphal

This allows to re-use the same program.  For example, a nft
ruleset that attaches filter basechains to input, forward, output would
use the same program for all three hook points.

The cache is intentionally netns agnostic, so same config
in different netns will all use same programs.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_hook_bpf.c | 150 ++++++++++++++++++++++++++++++++++++
 1 file changed, 150 insertions(+)

diff --git a/net/netfilter/nf_hook_bpf.c b/net/netfilter/nf_hook_bpf.c
index dab13b803801..0ca2e4404b1b 100644
--- a/net/netfilter/nf_hook_bpf.c
+++ b/net/netfilter/nf_hook_bpf.c
@@ -38,6 +38,24 @@ struct nf_hook_prog {
 	unsigned int pos;
 };
 
+struct nf_hook_bpf_prog {
+	struct rcu_head rcu_head;
+
+	struct hlist_node node_key;
+	struct hlist_node node_prog;
+	u32 key;
+	u16 hook_count;
+	refcount_t refcnt;
+	struct bpf_prog	*prog;
+	unsigned long hooks[32];
+};
+
+#define NF_BPF_PROG_HT_BITS	8
+
+/* users need to hold nf_hook_mutex */
+static DEFINE_HASHTABLE(nf_bpf_progs_ht_key, NF_BPF_PROG_HT_BITS);
+static DEFINE_HASHTABLE(nf_bpf_progs_ht_prog, NF_BPF_PROG_HT_BITS);
+
 static bool emit(struct nf_hook_prog *p, struct bpf_insn insn)
 {
 	if (WARN_ON_ONCE(p->pos >= BPF_MAXINSNS))
@@ -398,12 +416,112 @@ struct bpf_prog *nf_hook_bpf_create_fb(void)
 	return prog;
 }
 
+static u32 nf_hook_entries_hash(const struct nf_hook_entries *new)
+{
+	u32 i = 0, hook_count = new->num_hook_entries;
+	u32 a, b, c;
+
+	a = b = c = JHASH_INITVAL + hook_count;
+
+	while (hook_count > 3) {
+		a += hash32_ptr(new->hooks[i].hook);
+		b += hash32_ptr(new->hooks[i + 1].hook);
+		c += hash32_ptr(new->hooks[i + 2].hook);
+		__jhash_mix(a, b, c);
+		hook_count -= 3;
+		i += 3;
+	}
+
+	switch (hook_count) {
+	case 3:
+		c += hash32_ptr(new->hooks[i + 2].hook);
+		fallthrough;
+	case 2:
+		b += hash32_ptr(new->hooks[i + 1].hook);
+		fallthrough;
+	case 1:
+		a += hash32_ptr(new->hooks[i].hook);
+		__jhash_final(a, b, c);
+		break;
+	}
+
+	return c;
+}
+
+static struct bpf_prog *nf_hook_bpf_find_prog_by_key(const struct nf_hook_entries *new, u32 key)
+{
+	int i, hook_count = new->num_hook_entries;
+	struct nf_hook_bpf_prog *pc;
+
+	hash_for_each_possible(nf_bpf_progs_ht_key, pc, node_key, key) {
+		if (pc->hook_count != hook_count ||
+		    pc->key != key)
+			continue;
+
+		for (i = 0; i < hook_count; i++) {
+			if (pc->hooks[i] != (unsigned long)new->hooks[i].hook)
+				break;
+		}
+
+		if (i == hook_count) {
+			refcount_inc(&pc->refcnt);
+			return pc->prog;
+		}
+	}
+
+	return NULL;
+}
+
+static struct nf_hook_bpf_prog *nf_hook_bpf_find_prog(const struct bpf_prog *p)
+{
+	struct nf_hook_bpf_prog *pc;
+
+	hash_for_each_possible(nf_bpf_progs_ht_prog, pc, node_prog, (unsigned long)p) {
+		if (pc->prog == p)
+			return pc;
+	}
+
+	return NULL;
+}
+
+static void nf_hook_bpf_prog_store(const struct nf_hook_entries *new,
+				   struct bpf_prog *prog, u32 key)
+{
+	unsigned int i, hook_count = new->num_hook_entries;
+	struct nf_hook_bpf_prog *alloc;
+
+	if (hook_count >= ARRAY_SIZE(alloc->hooks))
+		return;
+
+	alloc = kzalloc(sizeof(*alloc), GFP_KERNEL);
+	if (!alloc)
+		return;
+
+	alloc->hook_count = new->num_hook_entries;
+	alloc->prog = prog;
+	alloc->key = key;
+
+	for (i = 0; i < hook_count; i++)
+		alloc->hooks[i] = (unsigned long)new->hooks[i].hook;
+
+	hash_add(nf_bpf_progs_ht_key, &alloc->node_key, key);
+	hash_add(nf_bpf_progs_ht_prog, &alloc->node_prog, (unsigned long)prog);
+	refcount_set(&alloc->refcnt, 1);
+
+	bpf_prog_inc(prog);
+}
+
 struct bpf_prog *nf_hook_bpf_create(const struct nf_hook_entries *new)
 {
+	u32 key = nf_hook_entries_hash(new);
 	struct bpf_prog *prog;
 	struct nf_hook_prog p;
 	int err;
 
+	prog = nf_hook_bpf_find_prog_by_key(new, key);
+	if (prog)
+		return prog;
+
 	err = nf_hook_prog_init(&p);
 	if (err)
 		return NULL;
@@ -413,12 +531,44 @@ struct bpf_prog *nf_hook_bpf_create(const struct nf_hook_entries *new)
 		goto err;
 
 	prog = nf_hook_jit_compile(p.insns, p.pos);
+	if (prog)
+		nf_hook_bpf_prog_store(new, prog, key);
 err:
 	nf_hook_prog_free(&p);
 	return prog;
 }
 
+static void __nf_hook_free_prog(struct rcu_head *head)
+{
+	struct nf_hook_bpf_prog *old = container_of(head, struct nf_hook_bpf_prog, rcu_head);
+
+	bpf_prog_put(old->prog);
+	kfree(old);
+}
+
+static void nf_hook_free_prog(struct nf_hook_bpf_prog *old)
+{
+	call_rcu(&old->rcu_head, __nf_hook_free_prog);
+}
+
 void nf_hook_bpf_change_prog(struct bpf_dispatcher *d, struct bpf_prog *from, struct bpf_prog *to)
 {
+	if (from == to)
+		return;
+
+	if (from) {
+		struct nf_hook_bpf_prog *old;
+
+		old = nf_hook_bpf_find_prog(from);
+		if (old) {
+			WARN_ON_ONCE(from != old->prog);
+			if (refcount_dec_and_test(&old->refcnt)) {
+				hash_del(&old->node_key);
+				hash_del(&old->node_prog);
+				nf_hook_free_prog(old);
+			}
+		}
+	}
+
 	bpf_dispatcher_change_prog(d, from, to);
 }
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC v2 6/9] netfilter: add bpf base hook program generator
  2022-10-05 14:13 ` [RFC v2 6/9] netfilter: add bpf base hook program generator Florian Westphal
@ 2022-10-06  2:52   ` Alexei Starovoitov
  2022-10-06 13:51     ` Florian Westphal
  2022-10-07 11:45     ` Florian Westphal
  0 siblings, 2 replies; 15+ messages in thread
From: Alexei Starovoitov @ 2022-10-06  2:52 UTC (permalink / raw)
  To: Florian Westphal; +Cc: bpf

On Wed, Oct 05, 2022 at 04:13:06PM +0200, Florian Westphal wrote:
>  
> @@ -254,11 +269,24 @@ static inline int nf_hook(u_int8_t pf, unsigned int hook, struct net *net,
>  
>  	if (hook_head) {
>  		struct nf_hook_state state;
> +#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
> +		const struct bpf_prog *p = READ_ONCE(hook_head->hook_prog);
> +
> +		nf_hook_state_init(&state, hook, pf, indev, outdev,
> +				   sk, net, okfn);
> +
> +		state.priv = (void *)hook_head;
> +		state.skb = skb;
>  
> +		migrate_disable();
> +		ret = bpf_prog_run_nf(p, &state);
> +		migrate_enable();

Since generated prog doesn't do any per-cpu work and not using any maps
there is no need for migrate_disable.
There is cant_migrate() in __bpf_prog_run(), but it's probably better
to silence that instead of adding migrate_disable/enable overhead.
I guess it's ok for now.

> +static bool emit_mov_ptr_reg(struct nf_hook_prog *p, u8 dreg, u8 sreg)
> +{
> +	if (sizeof(void *) == sizeof(u64))
> +		return emit(p, BPF_MOV64_REG(dreg, sreg));
> +	if (sizeof(void *) == sizeof(u32))
> +		return emit(p, BPF_MOV32_REG(dreg, sreg));

I bet that was never tested :) because... see below.

> +
> +	return false;
> +}
> +
> +static bool do_prologue(struct nf_hook_prog *p)
> +{
> +	int width = bytes_to_bpf_size(sizeof(void *));
> +
> +	if (WARN_ON_ONCE(width < 0))
> +		return false;
> +
> +	/* argument to program is a pointer to struct nf_hook_state, in BPF_REG_1. */
> +	if (!emit_mov_ptr_reg(p, BPF_REG_6, BPF_REG_1))
> +		return false;
> +
> +	if (!emit(p, BPF_LDX_MEM(width, BPF_REG_7, BPF_REG_1,
> +				 offsetof(struct nf_hook_state, priv))))
> +		return false;
> +
> +	/* could load state->hook_index, but we don't support index > 0 for bpf call. */
> +	if (!emit(p, BPF_MOV32_IMM(BPF_REG_8, 0)))
> +		return false;
> +
> +	return true;
> +}
> +
> +static void patch_hook_jumps(struct nf_hook_prog *p)
> +{
> +	unsigned int i;
> +
> +	if (!p->insns)
> +		return;
> +
> +	for (i = 0; i < p->pos; i++) {
> +		if (BPF_CLASS(p->insns[i].code) != BPF_JMP)
> +			continue;
> +
> +		if (p->insns[i].code == (BPF_EXIT | BPF_JMP))
> +			continue;
> +		if (p->insns[i].code == (BPF_CALL | BPF_JMP))
> +			continue;
> +
> +		if (p->insns[i].off != JMP_INVALID)
> +			continue;
> +		p->insns[i].off = p->pos - i - 1;

Pls add a check that it fits in 16-bits.

> +	}
> +}
> +
> +static bool emit_retval(struct nf_hook_prog *p, int retval)
> +{
> +	if (!emit(p, BPF_MOV32_IMM(BPF_REG_0, retval)))
> +		return false;
> +
> +	return emit(p, BPF_EXIT_INSN());
> +}
> +
> +static bool emit_nf_hook_slow(struct nf_hook_prog *p)
> +{
> +	int width = bytes_to_bpf_size(sizeof(void *));
> +
> +	/* restore the original state->priv. */
> +	if (!emit(p, BPF_STX_MEM(width, BPF_REG_6, BPF_REG_7,
> +				 offsetof(struct nf_hook_state, priv))))
> +		return false;
> +
> +	/* arg1 is state->skb */
> +	if (!emit(p, BPF_LDX_MEM(width, BPF_REG_1, BPF_REG_6,
> +				 offsetof(struct nf_hook_state, skb))))
> +		return false;
> +
> +	/* arg2 is "struct nf_hook_state *" */
> +	if (!emit(p, BPF_MOV64_REG(BPF_REG_2, BPF_REG_6)))
> +		return false;
> +
> +	/* arg3 is nf_hook_entries (original state->priv) */
> +	if (!emit(p, BPF_MOV64_REG(BPF_REG_3, BPF_REG_7)))
> +		return false;
> +
> +	if (!emit(p, BPF_EMIT_CALL(nf_hook_slow)))
> +		return false;
> +
> +	/* No further action needed, return retval provided by nf_hook_slow */
> +	return emit(p, BPF_EXIT_INSN());
> +}
> +
> +static bool emit_nf_queue(struct nf_hook_prog *p)
> +{
> +	int width = bytes_to_bpf_size(sizeof(void *));
> +
> +	if (width < 0) {
> +		WARN_ON_ONCE(1);
> +		return false;
> +	}
> +
> +	/* int nf_queue(struct sk_buff *skb, struct nf_hook_state *state, unsigned int verdict) */
> +	if (!emit(p, BPF_LDX_MEM(width, BPF_REG_1, BPF_REG_6,
> +				 offsetof(struct nf_hook_state, skb))))
> +		return false;
> +	if (!emit(p, BPF_STX_MEM(BPF_H, BPF_REG_6, BPF_REG_8,
> +				 offsetof(struct nf_hook_state, hook_index))))
> +		return false;
> +	/* arg2: struct nf_hook_state * */
> +	if (!emit(p, BPF_MOV64_REG(BPF_REG_2, BPF_REG_6)))
> +		return false;
> +	/* arg3: original hook return value: (NUM << NF_VERDICT_QBITS | NF_QUEUE) */
> +	if (!emit(p, BPF_MOV32_REG(BPF_REG_3, BPF_REG_0)))
> +		return false;
> +	if (!emit(p, BPF_EMIT_CALL(nf_queue)))
> +		return false;

here and other CALL work by accident on x84-64.
You need to wrap them with BPF_CALL_ and point BPF_EMIT_CALL to that wrapper.
On x86-64 it will be a nop.
On x86-32 it will do quite a bit of work.

> +
> +	/* Check nf_queue return value.  Abnormal case: nf_queue returned != 0.
> +	 *
> +	 * Fall back to nf_hook_slow().
> +	 */
> +	if (!emit(p, BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2)))
> +		return false;
> +
> +	/* Normal case: skb was stolen. Return 0. */
> +	return emit_retval(p, 0);
> +}
> +
> +static bool do_epilogue_base_hooks(struct nf_hook_prog *p)
> +{
> +	int width = bytes_to_bpf_size(sizeof(void *));
> +
> +	if (WARN_ON_ONCE(width < 0))
> +		return false;
> +
> +	/* last 'hook'. We arrive here if previous hook returned ACCEPT,
> +	 * i.e. all hooks passed -- we are done.
> +	 *
> +	 * Return 1, skb can continue traversing network stack.
> +	 */
> +	if (!emit_retval(p, 1))
> +		return false;
> +
> +	/* Patch all hook jumps, in case any of these are taken
> +	 * we need to jump to this location.
> +	 *
> +	 * This happens when verdict is != ACCEPT.
> +	 */
> +	patch_hook_jumps(p);
> +
> +	/* need to ignore upper 24 bits, might contain errno or queue number */
> +	if (!emit(p, BPF_MOV32_REG(BPF_REG_3, BPF_REG_0)))
> +		return false;
> +	if (!emit(p, BPF_ALU32_IMM(BPF_AND, BPF_REG_3, 0xff)))
> +		return false;
> +
> +	/* ACCEPT handled, check STOLEN. */
> +	if (!emit(p, BPF_JMP_IMM(BPF_JNE, BPF_REG_3, NF_STOLEN, 2)))
> +		return false;
> +
> +	if (!emit_retval(p, 0))
> +		return false;
> +
> +	/* ACCEPT and STOLEN handled.  Check DROP next */
> +	if (!emit(p, BPF_JMP_IMM(BPF_JNE, BPF_REG_3, NF_DROP, 1 + 2 + 2 + 2 + 2)))
> +		return false;
> +
> +	/* First step. Extract the errno number. 1 insn. */
> +	if (!emit(p, BPF_ALU32_IMM(BPF_RSH, BPF_REG_0, NF_VERDICT_QBITS)))
> +		return false;
> +
> +	/* Second step: replace errno with EPERM if it was 0. 2 insns. */
> +	if (!emit(p, BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1)))
> +		return false;
> +	if (!emit(p, BPF_MOV32_IMM(BPF_REG_0, EPERM)))
> +		return false;
> +
> +	/* Third step: negate reg0: Caller expects -EFOO and stash the result.  2 insns. */
> +	if (!emit(p, BPF_ALU32_IMM(BPF_NEG, BPF_REG_0, 0)))
> +		return false;
> +	if (!emit(p, BPF_MOV32_REG(BPF_REG_8, BPF_REG_0)))
> +		return false;
> +
> +	/* Fourth step: free the skb. 2 insns. */
> +	if (!emit(p, BPF_LDX_MEM(width, BPF_REG_1, BPF_REG_6,
> +				 offsetof(struct nf_hook_state, skb))))
> +		return false;
> +	if (!emit(p, BPF_EMIT_CALL(kfree_skb)))
> +		return false;

ditto.

> +
> +	/* Last step: return. 2 insns. */
> +	if (!emit(p, BPF_MOV32_REG(BPF_REG_0, BPF_REG_8)))
> +		return false;
> +	if (!emit(p, BPF_EXIT_INSN()))
> +		return false;
> +
> +	/* ACCEPT, STOLEN and DROP have been handled.
> +	 * REPEAT and STOP are not allowed anymore for individual hook functions.
> +	 * This leaves NFQUEUE as only remaing return value.
> +	 *
> +	 * In this case BPF_REG_0 still contains the original verdict of
> +	 * '(NUM << NF_VERDICT_QBITS | NF_QUEUE)', so pass it to nf_queue() as-is.
> +	 */
> +	if (!emit_nf_queue(p))
> +		return false;
> +
> +	/* Increment hook index and store it in nf_hook_state so nf_hook_slow will
> +	 * start at the next hook, if any.
> +	 */
> +	if (!emit(p, BPF_ALU32_IMM(BPF_ADD, BPF_REG_8, 1)))
> +		return false;
> +	if (!emit(p, BPF_STX_MEM(BPF_H, BPF_REG_6, BPF_REG_8,
> +				 offsetof(struct nf_hook_state, hook_index))))
> +		return false;
> +
> +	return emit_nf_hook_slow(p);
> +}
> +
> +static int nf_hook_prog_init(struct nf_hook_prog *p)
> +{
> +	memset(p, 0, sizeof(*p));
> +
> +	p->insns = kcalloc(BPF_MAXINSNS, sizeof(*p->insns), GFP_KERNEL);
> +	if (!p->insns)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +static void nf_hook_prog_free(struct nf_hook_prog *p)
> +{
> +	kfree(p->insns);
> +}
> +
> +static int xlate_base_hooks(struct nf_hook_prog *p, const struct nf_hook_entries *e)
> +{
> +	unsigned int i, len;
> +
> +	len = e->num_hook_entries;
> +
> +	if (!do_prologue(p))
> +		goto out;
> +
> +	for (i = 0; i < len; i++) {
> +		if (!xlate_one_hook(p, e, &e->hooks[i]))
> +			goto out;
> +
> +		if (i + 1 < len) {
> +			if (!emit(p, BPF_MOV64_REG(BPF_REG_1, BPF_REG_6)))
> +				goto out;
> +
> +			if (!emit(p, BPF_ALU32_IMM(BPF_ADD, BPF_REG_8, 1)))
> +				goto out;
> +		}
> +	}
> +
> +	if (!do_epilogue_base_hooks(p))
> +		goto out;
> +
> +	return 0;
> +out:
> +	return -EINVAL;
> +}
> +
> +static struct bpf_prog *nf_hook_jit_compile(struct bpf_insn *insns, unsigned int len)
> +{
> +	struct bpf_prog *prog;
> +	int err = 0;
> +
> +	prog = bpf_prog_alloc(bpf_prog_size(len), 0);
> +	if (!prog)
> +		return NULL;
> +
> +	prog->len = len;
> +	prog->type = BPF_PROG_TYPE_SOCKET_FILTER;

lol. Just say BPF_PROG_TYPE_UNSPEC ?

> +	memcpy(prog->insnsi, insns, prog->len * sizeof(struct bpf_insn));
> +
> +	prog = bpf_prog_select_runtime(prog, &err);
> +	if (err) {
> +		bpf_prog_free(prog);
> +		return NULL;
> +	}

Would be good to do bpf_prog_alloc_id() so it can be seen in
bpftool prog show.
and bpf_prog_kallsyms_add() to make 'perf report' and
stack traces readable.

Overall I don't hate it, but don't like it either.
Please provide performance numbers.
It's a lot of tricky code and not clear what the benefits are.
Who will maintain this body of code long term?
How are we going to deal with refactoring that will touch generic bpf bits
and this generated prog?

> Purpose of this is to eventually add a 'netfilter prog type' to bpf and
> permit attachment of (userspace generated) bpf programs to the netfilter
> machinery, e.g.  'attach bpf prog id 1234 to ipv6 PREROUTING at prio -300'.
> 
> This will require to expose the context structure (program argument,
> '__nf_hook_state', with rewriting accesses to match nf_hook_state layout.

This part is orthogonal, right? I don't see how this work is connected
to above idea.
I'm still convinced that xt_bpf was a bad choice for many reasons.
"Add a 'netfilter prog type' to bpf" would repeat the same mistakes.
Let's evaluate this set independently.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 6/9] netfilter: add bpf base hook program generator
  2022-10-06  2:52   ` Alexei Starovoitov
@ 2022-10-06 13:51     ` Florian Westphal
  2022-10-07 11:45     ` Florian Westphal
  1 sibling, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2022-10-06 13:51 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Florian Westphal, bpf

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > +#if IS_ENABLED(CONFIG_NF_HOOK_BPF)
> > +		const struct bpf_prog *p = READ_ONCE(hook_head->hook_prog);
> > +
> > +		nf_hook_state_init(&state, hook, pf, indev, outdev,
> > +				   sk, net, okfn);
> > +
> > +		state.priv = (void *)hook_head;
> > +		state.skb = skb;
> >  
> > +		migrate_disable();
> > +		ret = bpf_prog_run_nf(p, &state);
> > +		migrate_enable();
> 
> Since generated prog doesn't do any per-cpu work and not using any maps
> there is no need for migrate_disable.
> There is cant_migrate() in __bpf_prog_run(), but it's probably better
> to silence that instead of adding migrate_disable/enable overhead.

Ah, thanks -- noted.

> > +static bool emit_mov_ptr_reg(struct nf_hook_prog *p, u8 dreg, u8 sreg)
> > +{
> > +	if (sizeof(void *) == sizeof(u64))
> > +		return emit(p, BPF_MOV64_REG(dreg, sreg));
> > +	if (sizeof(void *) == sizeof(u32))
> > +		return emit(p, BPF_MOV32_REG(dreg, sreg));
> 
> I bet that was never tested :) because... see below.

Right, never tested, only on amd64 arch.

I suspect that real 32bit support won't reduce readability too much,
else I can either remove it or add it in a different patch.

> > +static void patch_hook_jumps(struct nf_hook_prog *p)
> > +{
> > +	unsigned int i;
> > +
> > +	if (!p->insns)
> > +		return;
> > +
> > +	for (i = 0; i < p->pos; i++) {
> > +		if (BPF_CLASS(p->insns[i].code) != BPF_JMP)
> > +			continue;
> > +
> > +		if (p->insns[i].code == (BPF_EXIT | BPF_JMP))
> > +			continue;
> > +		if (p->insns[i].code == (BPF_CALL | BPF_JMP))
> > +			continue;
> > +
> > +		if (p->insns[i].off != JMP_INVALID)
> > +			continue;
> > +		p->insns[i].off = p->pos - i - 1;
> 
> Pls add a check that it fits in 16-bits.

Makes sense.

> > +	if (!emit(p, BPF_EMIT_CALL(nf_queue)))
> > +		return false;
> 
> here and other CALL work by accident on x84-64.
> You need to wrap them with BPF_CALL_ and point BPF_EMIT_CALL to that wrapper.
> On x86-64 it will be a nop.
> On x86-32 it will do quite a bit of work.

I see. thanks.

> > +	prog->len = len;
> > +	prog->type = BPF_PROG_TYPE_SOCKET_FILTER;
> 
> lol. Just say BPF_PROG_TYPE_UNSPEC ?

Right, will do that.

> > +	memcpy(prog->insnsi, insns, prog->len * sizeof(struct bpf_insn));
> > +
> > +	prog = bpf_prog_select_runtime(prog, &err);
> > +	if (err) {
> > +		bpf_prog_free(prog);
> > +		return NULL;
> > +	}
> 
> Would be good to do bpf_prog_alloc_id() so it can be seen in
> bpftool prog show.

Agree.

> and bpf_prog_kallsyms_add() to make 'perf report' and
> stack traces readable.

Good to know, will check that this works.

> Overall I don't hate it, but don't like it either.
> Please provide performance numbers.

Oh, right, I should have included those in the cover letter.
Tests were done on 5.19-rc3 on a 56core intel machine using pktgen,
(based off pktgen_bench_xmit_mode_netif_receive.sh), i.e.
64byte udp packets that get forwarded to a dummy device.

Ruleset had single 'ct state new accept' rule in forward chain.

Baseline, with 56-rx queues: 682006 pps, 348 Mb/s
with this patchset:          696743 pps, 356 MB/s

Averaged over 10 runs each, also reboot after each run.
irqbalance was off, scaling_governor set to 'performance'.

I would redo those tests for future patch submission.
If there is a particular test i should do please let me know.

I also did a test via iperf3 forwarding
(netns -> veth1 -> netns -> veth -> netns), but 'improvement'
was in noise range, too much overhead for the indirection avoidance
to be noticeable.

> It's a lot of tricky code and not clear what the benefits are.
> Who will maintain this body of code long term?
> How are we going to deal with refactoring that will touch generic bpf bits
> and this generated prog?

Good questions.  The only 'good' answer is that it could always be
marked BROKEN and then reverted if needed as it doesn't add new
functionality per se.

Furthermore (I have NOT looked at this at all) this opens the door for
more complexity/trickery.  For example the bpf prog could check (during
code generation) if $indirect_hook is the ipv4 or ipv6 defrag hook and
then insert extra code that avoids the function call for the common
case.  There are probably more hack^W tricks that could be done.

So yes, maintainablity is a good question, plus what other users in the
tree might want something similar (selinux hook invocation for
example...).

I guess it depends on wheter the perf numbers are decent enough.
If they are, then I'd suggest to just do a live experiment and give
it a try -- if it turns out to be a big pain point
(maintenance, frequent crashes, hard-to-debug correctness bugs, e.g.
 'generator failed to re-jit and now it skips my iptables filter
 table',...) or whatever, mark it as BROKEN in Kconfig and, if
everything fails just rip it out again.

Does that sound ok?

> > Purpose of this is to eventually add a 'netfilter prog type' to bpf and
> > permit attachment of (userspace generated) bpf programs to the netfilter
> > machinery, e.g.  'attach bpf prog id 1234 to ipv6 PREROUTING at prio -300'.
> > 
> > This will require to expose the context structure (program argument,
> > '__nf_hook_state', with rewriting accesses to match nf_hook_state layout.
> 
> This part is orthogonal, right? I don't see how this work is connected
> to above idea.

Yes, orthogonal from technical pov.

> I'm still convinced that xt_bpf was a bad choice for many reasons.

Hmmm, ok -- there is nothing I can say, it looks reasonably
innocent/harmless to me wrt. backwards kludge risk etc.

> "Add a 'netfilter prog type' to bpf" would repeat the same mistakes.

Hmm, to me it would be more like the 'xtc/tcx' stuff rather than
cls/act_bpf/xt_bpf etc. pp.  but perhaps I'm missing something.

> Let's evaluate this set independently.

Ok, sure.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 6/9] netfilter: add bpf base hook program generator
  2022-10-06  2:52   ` Alexei Starovoitov
  2022-10-06 13:51     ` Florian Westphal
@ 2022-10-07 11:45     ` Florian Westphal
  2022-10-07 19:08       ` Alexei Starovoitov
  1 sibling, 1 reply; 15+ messages in thread
From: Florian Westphal @ 2022-10-07 11:45 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Florian Westphal, bpf

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > +	if (!emit(p, BPF_STX_MEM(BPF_H, BPF_REG_6, BPF_REG_8,
> > +				 offsetof(struct nf_hook_state, hook_index))))
> > +		return false;
> > +	/* arg2: struct nf_hook_state * */
> > +	if (!emit(p, BPF_MOV64_REG(BPF_REG_2, BPF_REG_6)))
> > +		return false;
> > +	/* arg3: original hook return value: (NUM << NF_VERDICT_QBITS | NF_QUEUE) */
> > +	if (!emit(p, BPF_MOV32_REG(BPF_REG_3, BPF_REG_0)))
> > +		return false;
> > +	if (!emit(p, BPF_EMIT_CALL(nf_queue)))
> > +		return false;
> 
> here and other CALL work by accident on x84-64.
> You need to wrap them with BPF_CALL_ and point BPF_EMIT_CALL to that wrapper.

Do you mean this? :

BPF_CALL_3(nf_queue_bpf, struct sk_buff *, skb, struct nf_hook_state *,
           state, unsigned int, verdict)
{
     return nf_queue(skb, state, verdict);
}

-       if (!emit(p, BPF_EMIT_CALL(nf_hook_slow)))
+       if (!emit(p, BPF_EMIT_CALL(nf_hook_slow_bpf)))

?

If yes, I don't see how this will work for the case where I only have an
address, i.e.:

if (!emit(p, BPF_EMIT_CALL(h->hook))) ....

(Also, the address might be in a kernel module)

> On x86-64 it will be a nop.
> On x86-32 it will do quite a bit of work.

If this only a problem for 32bit arches, I could also make this
'depends on CONFIG_64BIT'.

But perhaps I am on the wrong track, I see existing code doing:
        *insn++ = BPF_EMIT_CALL(__htab_map_lookup_elem);

(kernel/bpf/hashtab.c).

> > +	prog = bpf_prog_select_runtime(prog, &err);
> > +	if (err) {
> > +		bpf_prog_free(prog);
> > +		return NULL;
> > +	}
> 
> Would be good to do bpf_prog_alloc_id() so it can be seen in
> bpftool prog show.

Thanks a lot for the hint:

39: unspec  tag 0000000000000000
xlated 416B  jited 221B  memlock 4096B

bpftool prog  dump xlated id 39
   0: (bf) r6 = r1
   1: (79) r7 = *(u64 *)(r1 +8)
   2: (b4) w8 = 0
   3: (85) call ipv6_defrag#526144928
   4: (55) if r0 != 0x1 goto pc+24
   5: (bf) r1 = r6
   6: (04) w8 += 1
   7: (85) call ipv6_conntrack_in#526206096
   [..]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 6/9] netfilter: add bpf base hook program generator
  2022-10-07 11:45     ` Florian Westphal
@ 2022-10-07 19:08       ` Alexei Starovoitov
  2022-10-07 19:35         ` Florian Westphal
  0 siblings, 1 reply; 15+ messages in thread
From: Alexei Starovoitov @ 2022-10-07 19:08 UTC (permalink / raw)
  To: Florian Westphal; +Cc: bpf

On Fri, Oct 7, 2022 at 4:45 AM Florian Westphal <fw@strlen.de> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > > +   if (!emit(p, BPF_STX_MEM(BPF_H, BPF_REG_6, BPF_REG_8,
> > > +                            offsetof(struct nf_hook_state, hook_index))))
> > > +           return false;
> > > +   /* arg2: struct nf_hook_state * */
> > > +   if (!emit(p, BPF_MOV64_REG(BPF_REG_2, BPF_REG_6)))
> > > +           return false;
> > > +   /* arg3: original hook return value: (NUM << NF_VERDICT_QBITS | NF_QUEUE) */
> > > +   if (!emit(p, BPF_MOV32_REG(BPF_REG_3, BPF_REG_0)))
> > > +           return false;
> > > +   if (!emit(p, BPF_EMIT_CALL(nf_queue)))
> > > +           return false;
> >
> > here and other CALL work by accident on x84-64.
> > You need to wrap them with BPF_CALL_ and point BPF_EMIT_CALL to that wrapper.
>
> Do you mean this? :
>
> BPF_CALL_3(nf_queue_bpf, struct sk_buff *, skb, struct nf_hook_state *,
>            state, unsigned int, verdict)
> {
>      return nf_queue(skb, state, verdict);
> }

yep.

>
> -       if (!emit(p, BPF_EMIT_CALL(nf_hook_slow)))
> +       if (!emit(p, BPF_EMIT_CALL(nf_hook_slow_bpf)))
>
> ?
>
> If yes, I don't see how this will work for the case where I only have an
> address, i.e.:
>
> if (!emit(p, BPF_EMIT_CALL(h->hook))) ....
>
> (Also, the address might be in a kernel module)
>
> > On x86-64 it will be a nop.
> > On x86-32 it will do quite a bit of work.
>
> If this only a problem for 32bit arches, I could also make this
> 'depends on CONFIG_64BIT'.

If that's acceptable, sure.

> But perhaps I am on the wrong track, I see existing code doing:
>         *insn++ = BPF_EMIT_CALL(__htab_map_lookup_elem);

Yes, because we do:
                /* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
                 * and other inlining handlers are currently limited to 64 bit
                 * only.
                 */
                if (prog->jit_requested && BITS_PER_LONG == 64 &&


I think you already gate this feature with jit_requested?
Otherwise it's going to be slow in the interpreter.

> (kernel/bpf/hashtab.c).
>
> > > +   prog = bpf_prog_select_runtime(prog, &err);
> > > +   if (err) {
> > > +           bpf_prog_free(prog);
> > > +           return NULL;
> > > +   }
> >
> > Would be good to do bpf_prog_alloc_id() so it can be seen in
> > bpftool prog show.
>
> Thanks a lot for the hint:
>
> 39: unspec  tag 0000000000000000
> xlated 416B  jited 221B  memlock 4096B

Probably should do bpf_prog_calc_tag() too.
And please give it some meaningful name.

> bpftool prog  dump xlated id 39
>    0: (bf) r6 = r1
>    1: (79) r7 = *(u64 *)(r1 +8)
>    2: (b4) w8 = 0
>    3: (85) call ipv6_defrag#526144928
>    4: (55) if r0 != 0x1 goto pc+24
>    5: (bf) r1 = r6
>    6: (04) w8 += 1
>    7: (85) call ipv6_conntrack_in#526206096
>    [..]

Nice.
bpftool prog profile
should work too.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 6/9] netfilter: add bpf base hook program generator
  2022-10-07 19:08       ` Alexei Starovoitov
@ 2022-10-07 19:35         ` Florian Westphal
  0 siblings, 0 replies; 15+ messages in thread
From: Florian Westphal @ 2022-10-07 19:35 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Florian Westphal, bpf

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > -       if (!emit(p, BPF_EMIT_CALL(nf_hook_slow)))
> > +       if (!emit(p, BPF_EMIT_CALL(nf_hook_slow_bpf)))
> >
> > ?
> >
> > If yes, I don't see how this will work for the case where I only have an
> > address, i.e.:
> >
> > if (!emit(p, BPF_EMIT_CALL(h->hook))) ....
> >
> > (Also, the address might be in a kernel module)
> >
> > > On x86-64 it will be a nop.
> > > On x86-32 it will do quite a bit of work.
> >
> > If this only a problem for 32bit arches, I could also make this
> > 'depends on CONFIG_64BIT'.
> 
> If that's acceptable, sure.

Good, thanks!

> > But perhaps I am on the wrong track, I see existing code doing:
> >         *insn++ = BPF_EMIT_CALL(__htab_map_lookup_elem);
> 
> Yes, because we do:
>                 /* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
>                  * and other inlining handlers are currently limited to 64 bit
>                  * only.
>                  */
>                 if (prog->jit_requested && BITS_PER_LONG == 64 &&

Ah, thanks, makes sense.

> I think you already gate this feature with jit_requested?
> Otherwise it's going to be slow in the interpreter.

Right, use of bpf interpreter is silly for this.

> > 39: unspec  tag 0000000000000000
> > xlated 416B  jited 221B  memlock 4096B
> 
> Probably should do bpf_prog_calc_tag() too.
> And please give it some meaningful name.

Agree, will add this.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-10-07 19:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-05 14:13 [RFC 0/9 v2] netfilter: bpf base hook program generator Florian Westphal
2022-10-05 14:13 ` [RFC v2 1/9] netfilter: nf_queue: carry index in hook state Florian Westphal
2022-10-05 14:13 ` [RFC v2 2/9] netfilter: nat: split nat hook iteration into a helper Florian Westphal
2022-10-05 14:13 ` [RFC v2 3/9] netfilter: remove hook index from nf_hook_slow arguments Florian Westphal
2022-10-05 14:13 ` [RFC v2 4/9] netfilter: make hook functions accept only one argument Florian Westphal
2022-10-05 14:13 ` [RFC v2 5/9] netfilter: reduce allowed hook count to 32 Florian Westphal
2022-10-05 14:13 ` [RFC v2 6/9] netfilter: add bpf base hook program generator Florian Westphal
2022-10-06  2:52   ` Alexei Starovoitov
2022-10-06 13:51     ` Florian Westphal
2022-10-07 11:45     ` Florian Westphal
2022-10-07 19:08       ` Alexei Starovoitov
2022-10-07 19:35         ` Florian Westphal
2022-10-05 14:13 ` [RFC v2 7/9] netfilter: core: do not rebuild bpf program on dying netns Florian Westphal
2022-10-05 14:13 ` [RFC v2 8/9] netfilter: netdev: switch to invocation via bpf Florian Westphal
2022-10-05 14:13 ` [RFC v2 9/9] netfilter: hook_jit: add prog cache Florian Westphal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.