netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure
@ 2018-06-01 15:32 Florian Westphal
  2018-06-01 15:32 ` [RFC nf-next 1/5] bpf: add bpf_prog_get_type_dev_file Florian Westphal
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Florian Westphal @ 2018-06-01 15:32 UTC (permalink / raw)
  To: netfilter-devel; +Cc: ast, daniel, netdev

This patch series adds a JIT layer to translate nft expressions
to ebpf programs.

>From commit phase, spawn a userspace program (using recently added UMH
infrastructure).

We then provide rules that came in this transaction to the helper via pipe,
using same nf_tables netlink that nftables already uses.

The userspace helper translates the rules, and, if successful, installs the
generated program(s) via bpf syscall.

For each rule a small response containing the corresponding epbf file
descriptor (can be -1 on failure) and a attribute count (how many
expressions were jitted) gets sent back to kernel via pipe.

If translation fails, the rule is will be processed by nf_tables
interpreter (as before this patch).

If translation succeeded, nf_tables fetches the bpf program using the file
descriptor identifier, allocates a new rule blob containing the new 'ebpf'
expression (and possible trailing un-translated expressions).

It then replaces the original rule in the transaction log with the new
'ebpf-rule'.  The original rule is retained in a private area inside the epbf
expression to be able to present the original expressions back to userspace
on 'nft list ruleset'.

For easier review, this contains the kernel-side only.
nf_tables_jit_work() will not do anything, yet.

Unresolved issues:
 - maps and sets.
   It might be possible to add a new ebpf map type that just wraps
   the nft set infrastructure for lookups.
   This would allow nft userspace to continue to work as-is while
   not requiring new ebpf helper.
   Anonymous set should be a lot easier as they're immutable
   and could probably be handled already by existing infra.

 - BPF_PROG_RUN() is bolted into nft main loop via a middleman expression.
   I'm also abusing skb->cb[] to pass network and transport header offsets.
   Its not 'public' api so this can be changed later.

 - always uses BPF_PROG_TYPE_SCHED_CLS.
   This is because it "works" for current RFC purposes.

 - we should eventually support translating multiple (adjacent) rules
   into single program.

   If we do this kernel will need to track mapping of rules to
   program (to re-jit when a rule is changed.  This isn't implemented
   so far, but can be added later.  Alternatively, one could also add a
   'readonly' table switch to just prevent further updates.

   We will also need to dump the 'next' generation of the
   to-be-translated table.  The kernel has this information, so its only
   a matter of serializing it back to userspace from the commit phase.

The jitter is still limited.  So far it supports:

 * payload expression for network and transport header
 * meta mark, nfproto, l4proto
 * 32 bit immediates
 * 32 bit bitmask ops
 * accept/drop verdicts

As this uses netlink, there is also no technical requirement for
libnftnl, its simply used here for convienience.

It doesn't need any userspace changes. Patches for libnftnl and nftables
make debug info available (e.g. to map rule to its bpf prog id).

Comments welcome.

Florian Westphal (5):
      bpf: add bpf_prog_get_type_dev_file
      netfilter: nf_tables: add ebpf expression
      netfilter: nf_tables: add rule ebpf jit infrastructure
      netfilter: nf_tables_jit: add dumping of original rule
      netfilter: nf_tables_jit: add userspace nft to ebpf translator

 include/linux/bpf.h                              |   11 
 include/net/netfilter/nf_tables_core.h           |   22 
 include/uapi/linux/netfilter/nf_tables.h         |   18 
 kernel/bpf/syscall.c                             |   18 
 net/netfilter/Kconfig                            |    7 
 net/netfilter/Makefile                           |    5 
 net/netfilter/nf_tables_api.c                    |   16 
 net/netfilter/nf_tables_core.c                   |   61 +
 net/netfilter/nf_tables_jit.c                    |  242 +++
 net/netfilter/nf_tables_jit/Makefile             |   19 
 net/netfilter/nf_tables_jit/imr.c                | 1401 +++++++++++++++++++++++
 net/netfilter/nf_tables_jit/imr.h                |   96 +
 net/netfilter/nf_tables_jit/main.c               |  579 +++++++++
 net/netfilter/nf_tables_jit/nf_tables_jit_kern.c |  175 ++
 14 files changed, 2670 insertions(+)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC nf-next 1/5] bpf: add bpf_prog_get_type_dev_file
  2018-06-01 15:32 [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure Florian Westphal
@ 2018-06-01 15:32 ` Florian Westphal
  2018-06-01 15:32 ` [RFC nf-next 2/5] netfilter: nf_tables: add ebpf expression Florian Westphal
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Florian Westphal @ 2018-06-01 15:32 UTC (permalink / raw)
  To: netfilter-devel; +Cc: ast, daniel, netdev, Florian Westphal

Same as bpf_prog_get_type_dev, but gets struct file* instead of fd.

In case of nf_tables jit, a file descriptor representing the ebpf program
gets passed to kernel via a pipe from the (userspace) jit helper,
not 'current', so existing bpf_prog_get_type_dev() doesn't work.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/bpf.h  | 11 +++++++++++
 kernel/bpf/syscall.c | 18 ++++++++++++++++++
 2 files changed, 29 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index bbe297436e5d..be7796ac48ac 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -24,6 +24,7 @@ struct bpf_map;
 struct sock;
 struct seq_file;
 struct btf;
+struct file;
 
 /* map is generic key/value storage optionally accesible by eBPF programs */
 struct bpf_map_ops {
@@ -417,6 +418,9 @@ extern const struct bpf_verifier_ops xdp_analyzer_ops;
 struct bpf_prog *bpf_prog_get(u32 ufd);
 struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 				       bool attach_drv);
+struct bpf_prog *bpf_prog_get_type_dev_file(struct file *,
+					    enum bpf_prog_type type,
+					    bool attach_drv);
 struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, int i);
 void bpf_prog_sub(struct bpf_prog *prog, int i);
 struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog);
@@ -523,6 +527,13 @@ static inline struct bpf_prog *bpf_prog_get_type_dev(u32 ufd,
 	return ERR_PTR(-EOPNOTSUPP);
 }
 
+static inline struct bpf_prog *bpf_prog_get_type_dev_file(struct file *f,
+							  enum bpf_prog_type type,
+							  bool b)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 static inline struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog,
 							  int i)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 388d4feda348..3fcfd26f0290 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1203,6 +1203,24 @@ struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 }
 EXPORT_SYMBOL_GPL(bpf_prog_get_type_dev);
 
+struct bpf_prog *bpf_prog_get_type_dev_file(struct file *f,
+					    enum bpf_prog_type type,
+					    bool attach_drv)
+{
+	struct bpf_prog *prog;
+
+	if (f->f_op != &bpf_prog_fops)
+		return ERR_PTR(-EINVAL);
+
+	prog = f->private_data;
+
+	if (!bpf_prog_get_ok(prog, &type, attach_drv))
+		return ERR_PTR(-EINVAL);
+
+	return bpf_prog_inc(prog);
+}
+EXPORT_SYMBOL_GPL(bpf_prog_get_type_dev_file);
+
 /* Initially all BPF programs could be loaded w/o specifying
  * expected_attach_type. Later for some of them specifying expected_attach_type
  * at load time became required so that program could be validated properly.
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC nf-next 2/5] netfilter: nf_tables: add ebpf expression
  2018-06-01 15:32 [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure Florian Westphal
  2018-06-01 15:32 ` [RFC nf-next 1/5] bpf: add bpf_prog_get_type_dev_file Florian Westphal
@ 2018-06-01 15:32 ` Florian Westphal
  2018-06-01 15:32 ` [RFC nf-next 3/5] netfilter: nf_tables: add rule ebpf jit infrastructure Florian Westphal
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Florian Westphal @ 2018-06-01 15:32 UTC (permalink / raw)
  To: netfilter-devel; +Cc: ast, daniel, netdev, Florian Westphal

This expression serves two purposes:
1. a middleman to invoke BPF_PROG_RUN() from nf_tables main eval loop
2. to expose the bpf program id via netlink, so userspace
   can map nftables rules to their corresponding ebpf program.

2) is added in a followup patch.

Its currently not possible to attach arbitrary ebpf programs from
userspace, but this limitation is easy to remove if needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_tables_core.h   |  9 ++++++
 include/uapi/linux/netfilter/nf_tables.h | 18 ++++++++++++
 net/netfilter/Makefile                   |  3 +-
 net/netfilter/nf_tables_core.c           | 33 ++++++++++++++++++++++
 net/netfilter/nf_tables_jit.c            | 48 ++++++++++++++++++++++++++++++++
 5 files changed, 110 insertions(+), 1 deletion(-)
 create mode 100644 net/netfilter/nf_tables_jit.c

diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
index e0c0c2558ec4..90087a84f127 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -15,6 +15,7 @@ extern struct nft_expr_type nft_range_type;
 extern struct nft_expr_type nft_meta_type;
 extern struct nft_expr_type nft_rt_type;
 extern struct nft_expr_type nft_exthdr_type;
+extern struct nft_expr_type nft_ebpf_type;
 
 int nf_tables_core_module_init(void);
 void nf_tables_core_module_exit(void);
@@ -62,6 +63,14 @@ struct nft_payload_set {
 
 extern const struct nft_expr_ops nft_payload_fast_ops;
 
+struct nft_ebpf {
+	struct bpf_prog *prog;
+	u8 expressions;
+	const struct nft_rule *original;
+};
+
+extern const struct nft_expr_ops nft_ebpf_fast_ops;
+
 extern struct static_key_false nft_counters_enabled;
 extern struct static_key_false nft_trace_enabled;
 
diff --git a/include/uapi/linux/netfilter/nf_tables.h b/include/uapi/linux/netfilter/nf_tables.h
index 9c71f024f9cc..e05799652a4c 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -718,6 +718,24 @@ enum nft_payload_attributes {
 };
 #define NFTA_PAYLOAD_MAX	(__NFTA_PAYLOAD_MAX - 1)
 
+/**
+ * enum nft_ebpf_attributes - nf_tables ebpf expression netlink attributes
+ *
+ * @NFTA_EBPF_FD: file descriptor holding ebpf program (NLA_S32)
+ * @NFTA_EBPF_ID: bpf program id (NLA_U32)
+ * @NFTA_EBPF_TAG: bpf tag (NLA_BINARY)
+ * @NFTA_EBPF_TAG: expressions covered by this jit (NLA_U32)
+ */
+enum nft_ebpf_attributes {
+	NFTA_EBPF_UNSPEC,
+	NFTA_EBPF_FD,
+	NFTA_EBPF_ID,
+	NFTA_EBPF_TAG,
+	NFTA_EBPF_EXPR_COUNT,
+	__NFTA_EBPF_MAX,
+};
+#define NFTA_EBPF_MAX	(__NFTA_EBPF_MAX - 1)
+
 enum nft_exthdr_flags {
 	NFT_EXTHDR_F_PRESENT = (1 << 0),
 };
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 9b3434360d49..49c6e0a535f9 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -76,7 +76,8 @@ obj-$(CONFIG_NF_DUP_NETDEV)	+= nf_dup_netdev.o
 nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \
 		  nf_tables_trace.o nft_immediate.o nft_cmp.o nft_range.o \
 		  nft_bitwise.o nft_byteorder.o nft_payload.o nft_lookup.o \
-		  nft_dynset.o nft_meta.o nft_rt.o nft_exthdr.o
+		  nft_dynset.o nft_meta.o nft_rt.o nft_exthdr.o \
+		  nf_tables_jit.o
 
 obj-$(CONFIG_NF_TABLES)		+= nf_tables.o
 obj-$(CONFIG_NFT_COMPAT)	+= nft_compat.o
diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index 47cf667b15ca..038a15243508 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -11,6 +11,7 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include <linux/filter.h>
 #include <linux/list.h>
 #include <linux/rculist.h>
 #include <linux/skbuff.h>
@@ -92,6 +93,35 @@ static bool nft_payload_fast_eval(const struct nft_expr *expr,
 	return true;
 }
 
+static void nft_ebpf_fast_eval(const struct nft_expr *expr,
+			       struct nft_regs *regs,
+			       const struct nft_pktinfo *pkt)
+{
+	const struct nft_ebpf *priv = nft_expr_priv(expr);
+	struct bpf_skb_data_end cb_saved;
+	int ret;
+
+	memcpy(&cb_saved, pkt->skb->cb, sizeof(cb_saved));
+	bpf_compute_data_pointers(pkt->skb);
+
+	ret = BPF_PROG_RUN(priv->prog, pkt->skb);
+
+	memcpy(pkt->skb->cb, &cb_saved, sizeof(cb_saved));
+
+	switch (ret) {
+	case NF_DROP:
+	case NF_ACCEPT:
+	case NFT_BREAK:
+		regs->verdict.code = ret;
+		return;
+	case NFT_CONTINUE:
+		return;
+	default:
+		pr_debug("Unknown verdict %d\n", ret);
+		regs->verdict.code = NF_DROP;
+		break;
+	}
+}
 DEFINE_STATIC_KEY_FALSE(nft_counters_enabled);
 
 static noinline void nft_update_chain_stats(const struct nft_chain *chain,
@@ -151,6 +181,8 @@ nft_do_chain(struct nft_pktinfo *pkt, void *priv)
 		nft_rule_for_each_expr(expr, last, rule) {
 			if (expr->ops == &nft_cmp_fast_ops)
 				nft_cmp_fast_eval(expr, &regs);
+			else if (expr->ops == &nft_ebpf_fast_ops)
+				nft_ebpf_fast_eval(expr, &regs, pkt);
 			else if (expr->ops != &nft_payload_fast_ops ||
 				 !nft_payload_fast_eval(expr, &regs, pkt))
 				expr->ops->eval(expr, &regs, pkt);
@@ -232,6 +264,7 @@ static struct nft_expr_type *nft_basic_types[] = {
 	&nft_meta_type,
 	&nft_rt_type,
 	&nft_exthdr_type,
+	&nft_ebpf_type,
 };
 
 int __init nf_tables_core_module_init(void)
diff --git a/net/netfilter/nf_tables_jit.c b/net/netfilter/nf_tables_jit.c
new file mode 100644
index 000000000000..415c2acfa471
--- /dev/null
+++ b/net/netfilter/nf_tables_jit.c
@@ -0,0 +1,48 @@
+#include <linux/bpf.h>
+#include <linux/netfilter.h>
+#include <net/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>
+
+struct nft_ebpf_expression {
+	struct nft_expr e;
+	struct nft_ebpf priv;
+};
+
+static const struct nla_policy nft_ebpf_policy[NFTA_EBPF_MAX + 1] = {
+	[NFTA_EBPF_FD]			= { .type = NLA_S32 },
+	[NFTA_EBPF_ID]			= { .type = NLA_U32 },
+	[NFTA_EBPF_EXPR_COUNT]		= { .type = NLA_U32 },
+	[NFTA_EBPF_TAG]			= { .type = NLA_BINARY,
+					    .len = BPF_TAG_SIZE, },
+};
+
+static int nft_ebpf_init(const struct nft_ctx *ctx,
+			 const struct nft_expr *expr,
+			 const struct nlattr * const tb[])
+{
+	return -EOPNOTSUPP;
+}
+
+static void nft_ebpf_destroy(const struct nft_ctx *ctx,
+			     const struct nft_expr *expr)
+{
+	struct nft_ebpf *priv = nft_expr_priv(expr);
+
+	bpf_prog_put(priv->prog);
+	kfree(priv->original);
+}
+
+const struct nft_expr_ops nft_ebpf_fast_ops = {
+	.type		= &nft_ebpf_type,
+	.size		= NFT_EXPR_SIZE(sizeof(struct nft_ebpf)),
+	.init		= nft_ebpf_init,
+	.destroy	= nft_ebpf_destroy,
+};
+
+struct nft_expr_type nft_ebpf_type __read_mostly = {
+	.name		= "ebpf",
+	.ops		= &nft_ebpf_fast_ops,
+	.policy		= nft_ebpf_policy,
+	.maxattr	= NFTA_EBPF_MAX,
+	.owner		= THIS_MODULE,
+};
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC nf-next 3/5] netfilter: nf_tables: add rule ebpf jit infrastructure
  2018-06-01 15:32 [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure Florian Westphal
  2018-06-01 15:32 ` [RFC nf-next 1/5] bpf: add bpf_prog_get_type_dev_file Florian Westphal
  2018-06-01 15:32 ` [RFC nf-next 2/5] netfilter: nf_tables: add ebpf expression Florian Westphal
@ 2018-06-01 15:32 ` Florian Westphal
  2018-06-01 15:32 ` [RFC nf-next 4/5] netfilter: nf_tables_jit: add dumping of original rule Florian Westphal
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Florian Westphal @ 2018-06-01 15:32 UTC (permalink / raw)
  To: netfilter-devel; +Cc: ast, daniel, netdev, Florian Westphal

This adds a JIT helper infrastructure to translate nft expressions to ebpf
programs.

>From commit phase, we spawn jit module (a userspace program), and then
provide the rules that came in this transaction to that program via a pipe
(in nf_tables netlink format).

The userspace helper translates the rules if possible, and installs the
program(s) via bpf syscall.

For each rule a small response containing the corresponding file descriptor
(can be -1 on failure) and a attribute count (how many expressions were
jitted) gets sent back to kernel via pipe.

If translation fails, the rule is will be processed by nf_tables
interpreter (as before this patch).

If translation succeeded, nf_tables fetches the bpf program using the file
descriptor identifier, allocates a new rule blob containing the new 'ebpf'
expression (and possible trailing un-translated expressions).

It then replaces the original rule in the transaction log with the new
'ebpf-rule'.
The original rule is retained in a private area inside the epbf expression
to be able to present the original expressions to userspace when
'nft list ruleset' is called.

For easier review, this contains the kernel-side only.
nf_tables_jit_work() will not do anything, yet.

Unresolved issues:
 - maps and sets.
   It might be possible to add a new ebpf map type that just wraps
   the nft set infrastructure for lookups.
   This would allow nft userspace to continue to work as-is while
   not requiring new ebpf helper.

 - we should eventually support translating multiple (adjacent) rules
   into single program.

   If we do this kernel will need to track mapping of rules to
   program (to re-jit when a rule is changed.  This isn't implemented
   so far, but can be added later.

   We will also need to dump the 'next' generation of the
   to-be-translated table.  The kernel has this information, so its only
   a matter of serializing it back to userspace from the commit phase.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_tables_core.h           |  12 ++
 net/netfilter/Kconfig                            |   7 ++
 net/netfilter/Makefile                           |   8 +-
 net/netfilter/nf_tables_api.c                    |   5 +
 net/netfilter/nf_tables_core.c                   |  31 ++++-
 net/netfilter/nf_tables_jit.c                    | 139 +++++++++++++++++++++++
 net/netfilter/nf_tables_jit/Makefile             |  18 +++
 net/netfilter/nf_tables_jit/main.c               |  21 ++++
 net/netfilter/nf_tables_jit/nf_tables_jit_kern.c |  33 ++++++
 9 files changed, 270 insertions(+), 4 deletions(-)
 create mode 100644 net/netfilter/nf_tables_jit/Makefile
 create mode 100644 net/netfilter/nf_tables_jit/main.c
 create mode 100644 net/netfilter/nf_tables_jit/nf_tables_jit_kern.c

diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
index 90087a84f127..e9b5cc20ec45 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -71,6 +71,18 @@ struct nft_ebpf {
 
 extern const struct nft_expr_ops nft_ebpf_fast_ops;
 
+struct nft_jit_data_from_user {
+	int ebpf_fd;		/* fd to get program from, or < 0 if jitter error */
+	u32 expr_count;		/* number of translated expressions */
+};
+
+#if IS_ENABLED(CONFIG_NF_TABLES_JIT)
+int nft_jit_commit(struct net *net);
+#else
+static inline int nft_jit_commit(struct net *net) { return 0; }
+#endif
+int nf_tables_jit_work(const struct sk_buff *nlskb, struct nft_ebpf *e);
+
 extern struct static_key_false nft_counters_enabled;
 extern struct static_key_false nft_trace_enabled;
 
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 3ec8886850b2..82162fe931bb 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -473,6 +473,13 @@ config NF_TABLES_NETDEV
 	help
 	  This option enables support for the "netdev" table.
 
+config NF_TABLES_JIT
+	bool "Netfilter nf_tables jit infrastructure"
+	depends on BPF
+	help
+	  This option enables support for translation of nf_tables
+	  expressions to ebpf.
+
 config NFT_NUMGEN
 	tristate "Netfilter nf_tables number generator module"
 	help
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 49c6e0a535f9..ecb371160cf7 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -76,8 +76,12 @@ obj-$(CONFIG_NF_DUP_NETDEV)	+= nf_dup_netdev.o
 nf_tables-objs := nf_tables_core.o nf_tables_api.o nft_chain_filter.o \
 		  nf_tables_trace.o nft_immediate.o nft_cmp.o nft_range.o \
 		  nft_bitwise.o nft_byteorder.o nft_payload.o nft_lookup.o \
-		  nft_dynset.o nft_meta.o nft_rt.o nft_exthdr.o \
-		  nf_tables_jit.o
+		  nft_dynset.o nft_meta.o nft_rt.o nft_exthdr.o
+
+obj-$(CONFIG_NF_TABLES_JIT) += nf_tables_jit/
+nf_tables-$(CONFIG_NF_TABLES_JIT) += nf_tables_jit.o
+nf_tables-$(CONFIG_NF_TABLES_JIT) += nf_tables_jit/nf_tables_jit_kern.o
+nf_tables-$(CONFIG_NF_TABLES_JIT) += nf_tables_jit/nf_tables_jit_umh.o
 
 obj-$(CONFIG_NF_TABLES)		+= nf_tables.o
 obj-$(CONFIG_NFT_COMPAT)	+= nft_compat.o
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 89e61b2d048b..40c2de230400 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -6092,6 +6092,11 @@ static int nf_tables_commit(struct net *net, struct sk_buff *skb)
 	struct nft_trans_elem *te;
 	struct nft_chain *chain;
 	struct nft_table *table;
+	int ret;
+
+	ret = nft_jit_commit(net);
+	if (ret < 0)
+		return ret;
 
 	/* 1.  Allocate space for next generation rules_gen_X[] */
 	list_for_each_entry_safe(trans, next, &net->nft.commit_list, list) {
diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index 038a15243508..5557b2709f98 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -93,19 +93,46 @@ static bool nft_payload_fast_eval(const struct nft_expr *expr,
 	return true;
 }
 
+/* Dirty hack: pass nft_pktinfo in skb->cb[] */
+struct nft_jit_args_inet_cb {
+	/* cb[0] */
+	u16 thoff;	 /* 0: unset */
+	u16 lloff;	 /* 0: unset */
+
+	/* cb[1] */
+	u16 l4proto;	/* thoff = 0? unset */
+	u16 reserved;
+
+	/* 12 bytes left */
+};
+
 static void nft_ebpf_fast_eval(const struct nft_expr *expr,
 			       struct nft_regs *regs,
 			       const struct nft_pktinfo *pkt)
 {
 	const struct nft_ebpf *priv = nft_expr_priv(expr);
+	struct nft_jit_args_inet_cb *jit_args;
 	struct bpf_skb_data_end cb_saved;
 	int ret;
 
+	BUILD_BUG_ON(sizeof(struct nft_jit_args_inet_cb) > QDISC_CB_PRIV_LEN);
+
 	memcpy(&cb_saved, pkt->skb->cb, sizeof(cb_saved));
+
+	jit_args = (void *)bpf_skb_cb(pkt->skb);
+	memset(jit_args, 0, sizeof(*jit_args));
+
+	if (skb_mac_header_was_set(pkt->skb))
+		jit_args->lloff = skb_mac_header_len(pkt->skb);
+
+	if (pkt->tprot_set) {
+		jit_args->thoff = pkt->xt.thoff;
+		jit_args->l4proto = pkt->tprot;
+	}
+
 	bpf_compute_data_pointers(pkt->skb);
 
 	ret = BPF_PROG_RUN(priv->prog, pkt->skb);
-
 	memcpy(pkt->skb->cb, &cb_saved, sizeof(cb_saved));
 
 	switch (ret) {
@@ -119,9 +146,9 @@ static void nft_ebpf_fast_eval(const struct nft_expr *expr,
 	default:
 		pr_debug("Unknown verdict %d\n", ret);
 		regs->verdict.code = NF_DROP;
-		break;
 	}
 }
+
 DEFINE_STATIC_KEY_FALSE(nft_counters_enabled);
 
 static noinline void nft_update_chain_stats(const struct nft_chain *chain,
diff --git a/net/netfilter/nf_tables_jit.c b/net/netfilter/nf_tables_jit.c
index 415c2acfa471..a8f4696249bf 100644
--- a/net/netfilter/nf_tables_jit.c
+++ b/net/netfilter/nf_tables_jit.c
@@ -1,13 +1,152 @@
+// SPDX-License-Identifier: GPL-2.0
 #include <linux/bpf.h>
+#include <linux/filter.h>
 #include <linux/netfilter.h>
 #include <net/netfilter/nf_tables.h>
 #include <net/netfilter/nf_tables_core.h>
+#include <linux/file.h>
+
+static int nft_jit_dump_ruleinfo(struct sk_buff *skb,
+				 const struct nft_ctx *ctx, const struct nft_rule *rule)
+{
+	const struct nft_expr *expr, *next;
+	struct nfgenmsg *nfmsg;
+	struct nlmsghdr *nlh;
+	struct nlattr *list;
+	int ret;
+	u16 type = nfnl_msg_type(NFNL_SUBSYS_NFTABLES, NFT_MSG_NEWRULE);
+
+	nlh = nlmsg_put(skb, ctx->portid, ctx->seq, type, sizeof(struct nfgenmsg), 0);
+	if (nlh == NULL)
+		return -EMSGSIZE;
+
+	nfmsg = nlmsg_data(nlh);
+	nfmsg->nfgen_family = ctx->family;
+	nfmsg->version = NFNETLINK_V0;
+	nfmsg->res_id = htons(ctx->net->nft.base_seq & 0xffff);
+
+	ret = nla_put_string(skb, NFTA_RULE_TABLE, ctx->table->name);
+	if (ret < 0)
+		return ret;
+	ret = nla_put_string(skb, NFTA_RULE_CHAIN, ctx->chain->name);
+	if (ret < 0)
+		return ret;
+	ret = nla_put_be64(skb, NFTA_RULE_HANDLE, cpu_to_be64(rule->handle),
+			   NFTA_RULE_PAD);
+	if (ret < 0)
+		return ret;
+
+	list = nla_nest_start(skb, NFTA_RULE_EXPRESSIONS);
+	if (list == NULL)
+		return -EMSGSIZE;
+
+	nft_rule_for_each_expr(expr, next, rule) {
+		ret = nft_expr_dump(skb, NFTA_LIST_ELEM, expr);
+		if (ret)
+			return ret;
+	}
+	nla_nest_end(skb, list);
+	nlmsg_end(skb, nlh);
+	return 0;
+}
 
 struct nft_ebpf_expression {
 	struct nft_expr e;
 	struct nft_ebpf priv;
 };
 
+static int nft_jit_rule(struct nft_trans *trans, struct sk_buff *skb)
+{
+	const struct nft_rule *r = nft_trans_rule(trans);
+	const struct nft_expr *e, *last;
+	struct nft_ebpf_expression ebpf = { 0 };
+	struct nft_rule *rule;
+	struct nft_expr *new;
+	unsigned int size = sizeof(ebpf);
+	int err, expr_count;
+
+	err = nft_jit_dump_ruleinfo(skb, &trans->ctx, nft_trans_rule(trans));
+	if (err < 0)
+		return err;
+
+	err = nf_tables_jit_work(skb, &ebpf.priv);
+	if (err < 0)
+		return err;
+
+	if (!ebpf.priv.prog)
+		return 0;
+
+	ebpf.priv.original = r;
+
+	if (r->udata) {
+		struct nft_userdata *udata = nft_userdata(r);
+
+		size += udata->len + 1;
+	}
+
+	rule = kmalloc(sizeof(*rule) + r->dlen + size, GFP_KERNEL);
+	if (!rule) {
+		bpf_prog_put(ebpf.priv.prog);
+		return -ENOMEM;
+	}
+
+	memcpy(rule, r, sizeof(*r));
+	rule->dlen = r->dlen + sizeof(ebpf);
+
+	new = nft_expr_first(rule);
+	memcpy(new, &ebpf, sizeof(ebpf));
+	new->ops = &nft_ebpf_fast_ops;
+	size = sizeof(ebpf);
+
+	expr_count = 0;
+	nft_rule_for_each_expr(e, last, r) {
+		++expr_count;
+		if (expr_count <= ebpf.priv.expressions)
+			continue; /* expression was jitted */
+
+		new = nft_expr_next(new);
+		memcpy(new, e, e->ops->size);
+		size += e->ops->size;
+	}
+
+	rule->dlen = size;
+	if (r->udata) {
+		const struct nft_userdata *udata = nft_userdata(r);
+
+		memcpy(nft_userdata(rule), udata, udata->len + 1);
+	}
+
+	list_replace_rcu(&nft_trans_rule(trans)->list, &rule->list);
+	nft_trans_rule(trans) = rule;
+
+	return 0;
+}
+
+int nft_jit_commit(struct net *net)
+{
+	struct nft_trans *trans;
+	struct sk_buff *skb;
+	int ret;
+
+	skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	list_for_each_entry(trans, &net->nft.commit_list, list) {
+		if (trans->msg_type != NFT_MSG_NEWRULE)
+			continue;
+
+		ret = nft_jit_rule(trans, skb);
+		if (ret < 0)
+			break;
+		skb->head = skb->data;
+		skb_reset_tail_pointer(skb);
+	}
+
+	kfree_skb(skb);
+	return ret;
+}
+
 static const struct nla_policy nft_ebpf_policy[NFTA_EBPF_MAX + 1] = {
 	[NFTA_EBPF_FD]			= { .type = NLA_S32 },
 	[NFTA_EBPF_ID]			= { .type = NLA_U32 },
diff --git a/net/netfilter/nf_tables_jit/Makefile b/net/netfilter/nf_tables_jit/Makefile
new file mode 100644
index 000000000000..aa7509e49589
--- /dev/null
+++ b/net/netfilter/nf_tables_jit/Makefile
@@ -0,0 +1,18 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+
+hostprogs-y := nf_tables_jit_umh
+nf_tables_jit_umh-objs := main.o
+HOSTCFLAGS += -I. -Itools/include/
+
+quiet_cmd_copy_umh = GEN $@
+      cmd_copy_umh = echo ':' > $(obj)/.nf_tables_jit_umh.o.cmd; \
+      $(OBJCOPY) -I binary -O $(CONFIG_OUTPUT_FORMAT) \
+      -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
+      --rename-section .data=.rodata $< $@
+
+$(obj)/nf_tables_jit_umh.o: $(obj)/nf_tables_jit_umh
+	$(call cmd,copy_umh)
+
+obj-$(CONFIG_NF_TABLES_JIT) += nf_tables_jit.o
+nf_tables_jit-objs += nf_tables_jit_kern.o nf_tables_jit_umh.o
diff --git a/net/netfilter/nf_tables_jit/main.c b/net/netfilter/nf_tables_jit/main.c
new file mode 100644
index 000000000000..6f6a4423c2e4
--- /dev/null
+++ b/net/netfilter/nf_tables_jit/main.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <unistd.h>
+
+int main(void)
+{
+	static struct {
+		int fd, count;
+	} response;
+
+	response.fd = -1;
+	for (;;) {
+		char buf[8192];
+
+		if (read(0, buf, sizeof(buf)) < 0)
+			return 1;
+		if (write(1, &response, sizeof(response)) != sizeof(response))
+			return 2;
+	}
+
+	return 0;
+}
diff --git a/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c b/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c
new file mode 100644
index 000000000000..4778f53b2683
--- /dev/null
+++ b/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/umh.h>
+#include <linux/netfilter/nfnetlink.h>
+#include <linux/netfilter/nf_tables.h>
+#include <net/netfilter/nf_tables_core.h>
+
+#define UMH_start _binary_net_netfilter_nf_tables_jit_nf_tables_jit_umh_start
+#define UMH_end _binary_net_netfilter_nf_tables_jit_nf_tables_jit_umh_end
+
+extern char UMH_start;
+extern char UMH_end;
+
+static struct umh_info info;
+
+static int nft_jit_load_umh(void)
+{
+	return fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
+}
+
+int nf_tables_jit_work(const struct sk_buff *nlskb, struct nft_ebpf *e)
+{
+	if (!info.pipe_to_umh) {
+		int ret = nft_jit_load_umh();
+		if (ret)
+			return ret;
+
+		if (WARN_ON(!info.pipe_to_umh))
+			return -EINVAL;
+	}
+
+	return 0;
+}
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC nf-next 4/5] netfilter: nf_tables_jit: add dumping of original rule
  2018-06-01 15:32 [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure Florian Westphal
                   ` (2 preceding siblings ...)
  2018-06-01 15:32 ` [RFC nf-next 3/5] netfilter: nf_tables: add rule ebpf jit infrastructure Florian Westphal
@ 2018-06-01 15:32 ` Florian Westphal
  2018-06-01 15:32 ` [RFC nf-next 5/5] netfilter: nf_tables_jit: add userspace nft to ebpf translator Florian Westphal
  2018-06-11 22:12 ` [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure Alexei Starovoitov
  5 siblings, 0 replies; 10+ messages in thread
From: Florian Westphal @ 2018-06-01 15:32 UTC (permalink / raw)
  To: netfilter-devel; +Cc: ast, daniel, netdev, Florian Westphal

After previous patch userspace can't discover the original rules
anymore when listing the rule.

This change adds a dump callback to the ebpf expression and
a special handling in the main dumper loop.

When we see an ebpf expression in a rule, we skip normal dump handling
and leave it the the nft ebpf expression -- it has a copy of the
original expressions and can then simply add them back.

In order to allow userspace to discover presence of auto-jit,
and to map the rule to the ebpf program, we still include the ebpf
expression itself as the first expression in the dump.

For now, we expose the ebpf tag and the ebpf id plus the
number of expressions that are supposedly covered by the program.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_tables_api.c | 11 +++++++++
 net/netfilter/nf_tables_jit.c | 55 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 40c2de230400..4c5acd5d1cab 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -2054,6 +2054,17 @@ static int nf_tables_fill_rule_info(struct sk_buff *skb, struct net *net,
 	if (list == NULL)
 		goto nla_put_failure;
 	nft_rule_for_each_expr(expr, next, rule) {
+		/*
+		 * special case: ebpf_fast_ops will add original expressions
+		 * to the netlink message, it will call
+		 * nf_tables_fill_expr_info() itself.
+		 */
+		if (expr->ops == &nft_ebpf_fast_ops) {
+			if (expr->ops->dump(skb, expr) < 0)
+				goto nla_put_failure;
+			break;
+		}
+
 		if (nft_expr_dump(skb, NFTA_LIST_ELEM, expr) < 0)
 			goto nla_put_failure;
 	}
diff --git a/net/netfilter/nf_tables_jit.c b/net/netfilter/nf_tables_jit.c
index a8f4696249bf..864331aaee6b 100644
--- a/net/netfilter/nf_tables_jit.c
+++ b/net/netfilter/nf_tables_jit.c
@@ -171,11 +171,66 @@ static void nft_ebpf_destroy(const struct nft_ctx *ctx,
 	kfree(priv->original);
 }
 
+static int nft_ebpf_dump(struct sk_buff *skb, const struct nft_expr *expr)
+{
+	const struct nft_ebpf *priv = nft_expr_priv(expr);
+	const struct bpf_prog *prog = priv->prog;
+	const struct nft_expr *next;
+	struct nlattr *nest, *data;
+	int ret;
+
+	/*
+	 * From netlink perspective dump of normal vs. ebpf-jitted rule are
+	 * the same, except epbf-jitted rule has the ebpf expression prepended
+	 * to it.  The ebpf expression allows us to propagate the epbf tag and
+	 * some other meta data back to userspace.
+	 *
+	 * After the epbf expression we serialize the expressions of the
+	 * original rule (rather than the ebpf-rule blob used in packet path).
+	 */
+	nest = nla_nest_start(skb, NFTA_LIST_ELEM);
+	if (!nest)
+		return -EMSGSIZE;
+
+	if (nla_put_string(skb, NFTA_EXPR_NAME, expr->ops->type->name))
+		return -EMSGSIZE;
+
+	/* first, add ebpf expr meta data */
+	data = nla_nest_start(skb, NFTA_EXPR_DATA);
+	if (data == NULL)
+		return -EMSGSIZE;
+
+	ret = nla_put_be32(skb, NFTA_EBPF_ID, htonl(prog->aux->id));
+	if (ret)
+		return ret;
+
+	ret = nla_put(skb, NFTA_EBPF_TAG, sizeof(prog->tag), prog->tag);
+	if (ret)
+		return ret;
+
+	ret = nla_put_be32(skb, NFTA_EBPF_EXPR_COUNT, htonl(priv->expressions));
+	if (ret)
+		return ret;
+	nla_nest_end(skb, data);
+	nla_nest_end(skb, nest);
+
+	/* ... followed by the expressions that made up the original rule. */
+	nft_rule_for_each_expr(expr, next, priv->original) {
+		if (WARN_ON(expr->ops->dump == nft_ebpf_dump))
+			break;
+		if (nft_expr_dump(skb, NFTA_LIST_ELEM, expr) < 0)
+			return -EMSGSIZE;
+	}
+
+	return 0;
+}
+
 const struct nft_expr_ops nft_ebpf_fast_ops = {
 	.type		= &nft_ebpf_type,
 	.size		= NFT_EXPR_SIZE(sizeof(struct nft_ebpf)),
 	.init		= nft_ebpf_init,
 	.destroy	= nft_ebpf_destroy,
+	.dump		= nft_ebpf_dump,
 };
 
 struct nft_expr_type nft_ebpf_type __read_mostly = {
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC nf-next 5/5] netfilter: nf_tables_jit: add userspace nft to ebpf translator
  2018-06-01 15:32 [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure Florian Westphal
                   ` (3 preceding siblings ...)
  2018-06-01 15:32 ` [RFC nf-next 4/5] netfilter: nf_tables_jit: add dumping of original rule Florian Westphal
@ 2018-06-01 15:32 ` Florian Westphal
  2018-06-11 22:12 ` [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure Alexei Starovoitov
  5 siblings, 0 replies; 10+ messages in thread
From: Florian Westphal @ 2018-06-01 15:32 UTC (permalink / raw)
  To: netfilter-devel; +Cc: ast, daniel, netdev, Florian Westphal

currently rather limited.

It supports:
 * payload expression for network and transport header
 * meta mark, nfproto, l4proto
 * 32 bit immediates
 * 32 bit bitmask ops
 * accept/drop verdicts

Currently kernel will emit each rule on its own.
However, jitter is (eventually) supposed to also cope with complete
chains (including goto/jump).

It also lacks support for any kind of sets; anonymous sets would
be a good initial target as they can't change.

As this uses netlink, there is also no technical requirement for
libnftnl, its simply used for convienience.

This doesn't need any userspace changes to work, however,
a libnftnl and nft patch will make debug info available
(e.g. to match a rule with its bpf program id).

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_tables_core.h           |    1 +
 net/netfilter/nf_tables_core.c                   |    1 +
 net/netfilter/nf_tables_jit/Makefile             |    3 +-
 net/netfilter/nf_tables_jit/imr.c                | 1401 ++++++++++++++++++++++
 net/netfilter/nf_tables_jit/imr.h                |   96 ++
 net/netfilter/nf_tables_jit/main.c               |  582 ++++++++-
 net/netfilter/nf_tables_jit/nf_tables_jit_kern.c |  146 ++-
 7 files changed, 2215 insertions(+), 15 deletions(-)
 create mode 100644 net/netfilter/nf_tables_jit/imr.c
 create mode 100644 net/netfilter/nf_tables_jit/imr.h

diff --git a/include/net/netfilter/nf_tables_core.h b/include/net/netfilter/nf_tables_core.h
index e9b5cc20ec45..f3e85e6c8cc6 100644
--- a/include/net/netfilter/nf_tables_core.h
+++ b/include/net/netfilter/nf_tables_core.h
@@ -82,6 +82,7 @@ int nft_jit_commit(struct net *net);
 static inline int nft_jit_commit(struct net *net) { return 0; }
 #endif
 int nf_tables_jit_work(const struct sk_buff *nlskb, struct nft_ebpf *e);
+void nft_jit_stop_umh(void);
 
 extern struct static_key_false nft_counters_enabled;
 extern struct static_key_false nft_trace_enabled;
diff --git a/net/netfilter/nf_tables_core.c b/net/netfilter/nf_tables_core.c
index 5557b2709f98..8956f873a8cb 100644
--- a/net/netfilter/nf_tables_core.c
+++ b/net/netfilter/nf_tables_core.c
@@ -319,4 +319,5 @@ void nf_tables_core_module_exit(void)
 	i = ARRAY_SIZE(nft_basic_types);
 	while (i-- > 0)
 		nft_unregister_expr(nft_basic_types[i]);
+	nft_jit_stop_umh();
 }
diff --git a/net/netfilter/nf_tables_jit/Makefile b/net/netfilter/nf_tables_jit/Makefile
index aa7509e49589..a1b8eb5a4c45 100644
--- a/net/netfilter/nf_tables_jit/Makefile
+++ b/net/netfilter/nf_tables_jit/Makefile
@@ -2,8 +2,9 @@
 #
 
 hostprogs-y := nf_tables_jit_umh
-nf_tables_jit_umh-objs := main.o
+nf_tables_jit_umh-objs := main.o imr.o
 HOSTCFLAGS += -I. -Itools/include/
+HOSTLOADLIBES_nf_tables_jit_umh = `pkg-config --libs libnftnl libmnl`
 
 quiet_cmd_copy_umh = GEN $@
       cmd_copy_umh = echo ':' > $(obj)/.nf_tables_jit_umh.o.cmd; \
diff --git a/net/netfilter/nf_tables_jit/imr.c b/net/netfilter/nf_tables_jit/imr.c
new file mode 100644
index 000000000000..2242bc7379ee
--- /dev/null
+++ b/net/netfilter/nf_tables_jit/imr.c
@@ -0,0 +1,1401 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdbool.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <limits.h>
+
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
+#include <linux/if_ether.h>
+#include <arpa/inet.h>
+#include <linux/netfilter.h>
+
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+
+#include "imr.h"
+
+#define div_round_up(n, d)      (((n) + (d) - 1) / (d))
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+#define EMIT(ctx, x)							\
+	do {								\
+		struct bpf_insn __tmp[] = { x };			\
+		if ((ctx)->len_cur + ARRAY_SIZE(__tmp) > BPF_MAXINSNS)	\
+			return -ENOMEM;					\
+		memcpy((ctx)->img + (ctx)->len_cur, &__tmp, sizeof(__tmp));		\
+		(ctx)->len_cur += ARRAY_SIZE(__tmp);			\
+	} while (0)
+
+struct imr_object {
+	enum imr_obj_type type:8;
+	uint8_t len;
+	uint8_t refcnt;
+
+	union {
+		struct {
+			union {
+				uint64_t value_large[8];
+				uint64_t value64;
+				uint32_t value32;
+			};
+		} imm;
+		struct {
+			uint16_t offset;
+			enum imr_payload_base base:8;
+		} payload;
+		struct {
+			enum imr_verdict verdict;
+		} verdict;
+		struct {
+			enum imr_meta_key key:8;
+		} meta;
+		struct {
+			struct imr_object *left;
+			struct imr_object *right;
+			enum imr_alu_op op:8;
+		} alu;
+	};
+};
+
+struct imr_state {
+	struct bpf_insn	*img;
+	uint16_t	len_cur;
+	uint16_t	num_objects;
+	uint8_t		nfproto;
+	uint8_t		regcount;
+
+	/* payload access <= headlen will use direct skb->data access.
+	 * Normally set to either sizeof(iphdr) or sizeof(ipv6hdr).
+	 *
+	 * Access >= headlen will need to go through skb_header_pointer().
+	 */
+	uint8_t		headlen;
+
+	/* where skb->data points to at start
+	 * of program.  Usually this is IMR_PAYLOAD_BASE_NH.
+	 */
+	enum imr_payload_base base:8;
+
+	/* hints to emitter */
+	bool reload_r2;
+
+	struct imr_object *registers[IMR_REG_COUNT];
+
+	struct imr_object **objects;
+};
+
+static int imr_jit_object(struct imr_state *, const struct imr_object *o);
+
+static void internal_error(const char *s)
+{
+	fprintf(stderr, "FIXME: internal error %s\n", s);
+	exit(1);
+}
+
+static unsigned int imr_regs_needed(unsigned int len)
+{
+	return div_round_up(len, sizeof(uint64_t));
+}
+
+static int imr_register_alloc(struct imr_state *s, uint32_t len)
+{
+	unsigned int regs_needed = imr_regs_needed(len);
+	uint8_t reg = s->regcount;
+
+	if (s->regcount + regs_needed >= IMR_REG_COUNT) {
+		internal_error("out of BPF registers");
+		return -1;
+	}
+
+	s->regcount += regs_needed;
+
+	return reg;
+}
+
+static int imr_register_get(const struct imr_state *s, uint32_t len)
+{
+	unsigned int regs_needed = imr_regs_needed(len);
+
+	if (s->regcount < regs_needed)
+		internal_error("not enough registers in use");
+
+	return s->regcount - regs_needed;
+}
+
+static int bpf_reg_width(unsigned int len)
+{
+	switch (len) {
+	case sizeof(uint8_t): return BPF_B;
+	case sizeof(uint16_t): return BPF_H;
+	case sizeof(uint32_t): return BPF_W;
+	case sizeof(uint64_t): return BPF_DW;
+	default:
+		internal_error("reg size not supported");
+	}
+
+	return -EINVAL;
+}
+
+/* map op to negated bpf opcode.
+ * This is because if we want to check 'eq', we need
+ * to jump to end of rule ('break') on inequality, i.e.
+ * 'branch if NOT equal'.
+ */
+static int alu_jmp_get_negated_bpf_opcode(enum imr_alu_op op)
+{
+	switch (op) {
+	case IMR_ALU_OP_EQ:
+		return BPF_JNE;
+	case IMR_ALU_OP_NE:
+		return BPF_JEQ;
+	case IMR_ALU_OP_LT:
+		return BPF_JGE;
+	case IMR_ALU_OP_LTE:
+		return BPF_JGT;
+	case IMR_ALU_OP_GT:
+		return BPF_JLE;
+	case IMR_ALU_OP_GTE:
+		return BPF_JLT;
+	case IMR_ALU_OP_LSHIFT:
+	case IMR_ALU_OP_AND:
+		break;
+        }
+
+	internal_error("invalid imr alu op");
+	return -EINVAL;
+}
+
+static void imr_register_release(struct imr_state *s, uint32_t len)
+{
+	unsigned int regs_needed = imr_regs_needed(len);
+
+	if (s->regcount < regs_needed)
+		internal_error("regcount underflow");
+	s->regcount -= regs_needed;
+}
+
+void imr_register_store(struct imr_state *s, enum imr_reg_num reg, struct imr_object *o)
+{
+	struct imr_object *old;
+
+	old = s->registers[reg];
+	if (old)
+		imr_object_free(old);
+
+	s->registers[reg] = o;
+}
+
+struct imr_object *imr_register_load(const struct imr_state *s, enum imr_reg_num reg)
+{
+	struct imr_object *o = s->registers[reg];
+
+	if (!o)
+		internal_error("empty register");
+
+	if (!o->refcnt)
+		internal_error("already free'd object in register");
+
+	o->refcnt++;
+	return o;
+}
+
+struct imr_state *imr_state_alloc(void)
+{
+	struct imr_state *s = calloc(1, sizeof(*s));
+
+	return s;
+}
+
+void imr_state_free(struct imr_state *s)
+{
+	int i;
+
+	for (i = 0; i < s->num_objects; i++)
+		imr_object_free(s->objects[i]);
+
+	free(s->objects);
+	free(s->img);
+	free(s);
+}
+
+struct imr_object *imr_object_alloc(enum imr_obj_type t)
+{
+	struct imr_object *o = calloc(1, sizeof(*o));
+
+	if (!o)
+		return NULL;
+
+	o->refcnt = 1;
+	o->type = t;
+	return o;
+}
+
+static struct imr_object *imr_object_copy(const struct imr_object *old)
+{
+	struct imr_object *o = imr_object_alloc(old->type);
+
+	if (!o)
+		return NULL;
+
+	switch (o->type) {
+	case IMR_OBJ_TYPE_VERDICT:
+	case IMR_OBJ_TYPE_IMMEDIATE:
+	case IMR_OBJ_TYPE_PAYLOAD:
+	case IMR_OBJ_TYPE_META:
+		memcpy(o, old, sizeof(*o));
+		o->refcnt = 1;
+		break;
+	case IMR_OBJ_TYPE_ALU:
+		o->alu.left = imr_object_copy(old->alu.left);
+		o->alu.right = imr_object_copy(old->alu.right);
+		if (!o->alu.left || !o->alu.right) {
+			imr_object_free(o);
+			return NULL;
+		}
+		break;
+	}
+
+	o->len = old->len;
+	return o;
+}
+
+static struct imr_object *imr_object_split64(struct imr_object *to_split)
+{
+	struct imr_object *o = NULL;
+
+	if (to_split->len < sizeof(uint64_t))
+		internal_error("bogus split of size <= uint64_t");
+
+	to_split->len -= sizeof(uint64_t);
+
+	switch (to_split->type) {
+	case IMR_OBJ_TYPE_IMMEDIATE: {
+		uint64_t tmp;
+
+		o = imr_object_copy(to_split);
+		o->imm.value64 = to_split->imm.value_large[0];
+
+		switch (to_split->len) {
+		case 0:
+			break;
+		case sizeof(uint32_t):
+			tmp = to_split->imm.value_large[1];
+			to_split->imm.value32 = tmp;
+			break;
+		case sizeof(uint64_t):
+			tmp = to_split->imm.value_large[1];
+			to_split->imm.value64 = tmp;
+			break;
+		default:
+			memmove(to_split->imm.value_large, &to_split->imm.value_large[1],
+				sizeof(to_split->imm.value_large) - sizeof(to_split->imm.value_large[0]));
+			break;
+		}
+		}
+		break;
+	case IMR_OBJ_TYPE_PAYLOAD:
+		o = imr_object_copy(to_split);
+		to_split->payload.offset += sizeof(uint64_t);
+		break;
+	case IMR_OBJ_TYPE_META:
+		internal_error("can't split meta");
+		break;
+	case IMR_OBJ_TYPE_ALU:
+		o = imr_object_alloc(to_split->type);
+		o->alu.left = imr_object_split64(to_split->alu.left);
+		o->alu.right = imr_object_split64(to_split->alu.right);
+
+		if (!o->alu.left || !o->alu.right) {
+			imr_object_free(o);
+			return NULL; /* Can't recover */
+
+		}
+		break;
+	case IMR_OBJ_TYPE_VERDICT:
+		internal_error("can't split type");
+	}
+
+	if (o)
+		o->len = sizeof(uint64_t);
+	return o;
+}
+
+void imr_object_free(struct imr_object *o)
+{
+	if (!o)
+		return;
+
+	if (o->refcnt == 0) {
+		internal_error("double-free, refcnt already zero");
+		o->refcnt--;
+	}
+	switch (o->type) {
+	case IMR_OBJ_TYPE_VERDICT:
+	case IMR_OBJ_TYPE_IMMEDIATE:
+	case IMR_OBJ_TYPE_PAYLOAD:
+	case IMR_OBJ_TYPE_META:
+		break;
+	case IMR_OBJ_TYPE_ALU:
+		imr_object_free(o->alu.left);
+		imr_object_free(o->alu.right);
+		break;
+	}
+
+	o->refcnt--;
+	if (o->refcnt > 0)
+		return;
+
+	free(o);
+}
+
+struct imr_object *imr_object_alloc_imm32(uint32_t value)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_IMMEDIATE);
+
+	if (o) {
+		o->imm.value32 = value;
+		o->len = sizeof(value);
+	}
+	return o;
+}
+
+struct imr_object *imr_object_alloc_imm64(uint64_t value)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_IMMEDIATE);
+
+	if (o) {
+		o->imm.value64 = value;
+		o->len = sizeof(value);
+	}
+	return o;
+}
+
+struct imr_object *imr_object_alloc_imm(const uint32_t *data, unsigned int len)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_IMMEDIATE);
+	unsigned int left = len;
+	int i = 0;
+
+	if (!o)
+		return NULL;
+
+	while (left >= sizeof(uint64_t)) {
+		uint64_t value = *data;
+
+		left -= sizeof(uint64_t);
+
+		value <<= 32;
+		data++;
+		value |= *data;
+		data++;
+
+		if (i >= ARRAY_SIZE(o->imm.value_large)) {
+			internal_error("value too large");
+			imr_object_free(o);
+			return NULL;
+		}
+		o->imm.value_large[i++] = value;
+	}
+
+	if (left) {
+		if (left != sizeof(uint32_t))
+			internal_error("values are expected in 4-byte chunks at least");
+
+		if (i >= ARRAY_SIZE(o->imm.value_large)) {
+			internal_error("value too large");
+			imr_object_free(o);
+			return NULL;
+		}
+		o->imm.value_large[i] = *data;
+	}
+
+	o->len = len;
+	return o;
+}
+
+struct imr_object *imr_object_alloc_verdict(enum imr_verdict v)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_VERDICT);
+
+	if (!o)
+		return NULL;
+
+	o->verdict.verdict = v;
+	o->len = sizeof(v);
+
+	return o;
+}
+
+static const char * alu_op_to_str(enum imr_alu_op op)
+{
+	switch (op) {
+	case IMR_ALU_OP_EQ: return "eq";
+	case IMR_ALU_OP_NE: return "ne";
+	case IMR_ALU_OP_LT: return "<";
+	case IMR_ALU_OP_LTE: return "<=";
+	case IMR_ALU_OP_GT: return ">";
+	case IMR_ALU_OP_GTE: return ">=";
+	case IMR_ALU_OP_AND: return "&";
+	case IMR_ALU_OP_LSHIFT: return "<<";
+	}
+
+	return "?";
+}
+
+static const char *verdict_to_str(enum imr_verdict v)
+{
+	switch (v) {
+	case IMR_VERDICT_NONE: return "none";
+	case IMR_VERDICT_NEXT: return "next";
+	case IMR_VERDICT_PASS: return "pass";
+	case IMR_VERDICT_DROP: return "drop";
+	}
+
+	return "invalid";
+}
+
+static int imr_object_print_imm(FILE *fp, const struct imr_object *o)
+{
+	switch (o->len) {
+	case sizeof(uint64_t):
+		return fprintf(fp, "(0x%16llx)", (unsigned long long)o->imm.value64);
+	case sizeof(uint32_t):
+		return fprintf(fp, "(0x%08x)", (unsigned int)o->imm.value32);
+	default:
+		return fprintf(fp, "(0x%llx?)", (unsigned long long)o->imm.value64);
+	}
+}
+
+static const char *meta_to_str(enum imr_meta_key k)
+{
+	switch (k) {
+	case IMR_META_NFMARK:
+		return "nfmark";
+	case IMR_META_NFPROTO:
+		return "nfproto";
+	case IMR_META_L4PROTO:
+		return "l4proto";
+	}
+
+	return "unknown";
+}
+
+static const char *type_to_str(enum imr_obj_type t)
+{
+	switch (t) {
+	case IMR_OBJ_TYPE_VERDICT: return "verdict";
+	case IMR_OBJ_TYPE_IMMEDIATE: return "imm";
+	case IMR_OBJ_TYPE_PAYLOAD: return "payload";
+	case IMR_OBJ_TYPE_ALU: return "alu";
+	case IMR_OBJ_TYPE_META: return "meta";
+	}
+
+	return "unknown";
+}
+
+static int imr_object_print(FILE *fp, const struct imr_object *o)
+{
+	int ret, total = 0;
+
+	ret = fprintf(fp, "%s", type_to_str(o->type));
+	if (ret < 0)
+		return ret;
+	total += ret;
+	switch (o->type) {
+	case IMR_OBJ_TYPE_VERDICT:
+		ret = fprintf(fp, "(%s)", verdict_to_str(o->verdict.verdict));
+		if (ret < 0)
+			break;
+		total += ret;
+		break;
+	case IMR_OBJ_TYPE_PAYLOAD:
+		ret = fprintf(fp, "(base %d, off %d, len %d)",
+				o->payload.base, o->payload.offset, o->len);
+		if (ret < 0)
+			break;
+		total += ret;
+		break;
+	case IMR_OBJ_TYPE_IMMEDIATE:
+		ret = imr_object_print_imm(fp, o);
+		if (ret < 0)
+			break;
+		total += ret;
+		break;
+	case IMR_OBJ_TYPE_ALU:
+		ret = fprintf(fp, "(");
+		if (ret < 0)
+			break;
+		total += ret;
+		ret = imr_object_print(fp, o->alu.left);
+		if (ret < 0)
+			break;
+		total += ret;
+
+		ret = fprintf(fp , " %s ", alu_op_to_str(o->alu.op));
+		if (ret < 0)
+			break;
+		total += ret;
+
+		ret = imr_object_print(fp, o->alu.right);
+		if (ret < 0)
+			break;
+		total += ret;
+
+		ret = fprintf(fp, ") ");
+		if (ret < 0)
+			break;
+		total += ret;
+		break;
+	case IMR_OBJ_TYPE_META:
+		ret = fprintf(fp , " %s ", meta_to_str(o->meta.key));
+		if (ret < 0)
+			break;
+		total += ret;
+		break;
+	default:
+		internal_error("missing print support");
+		break;
+	}
+
+	return total;
+}
+
+void imr_state_print(FILE *fp, struct imr_state *s)
+{
+	int i;
+
+	for (i = 0; i < s->num_objects; i++) {
+		imr_object_print(fp, s->objects[i]);
+		putc('\n', fp);
+	}
+}
+
+struct imr_object *imr_object_alloc_meta(enum imr_meta_key k)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_META);
+
+	o->meta.key = k;
+
+	switch (k) {
+	case IMR_META_L4PROTO:
+		o->len = sizeof(uint16_t);
+		break;
+	case IMR_META_NFPROTO:
+		o->len = sizeof(uint8_t);
+		break;
+	case IMR_META_NFMARK:
+		o->len = sizeof(uint32_t);
+		break;
+	}
+
+	return o;
+}
+
+struct imr_object *imr_object_alloc_payload(enum imr_payload_base b, uint16_t off, uint16_t len)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_PAYLOAD);
+
+	if (!o)
+		return NULL;
+
+	o->payload.base = b;
+	o->payload.offset = off;
+	if (len > 16)
+		return NULL;
+
+	if (len == 0)
+		internal_error("payload length is 0");
+	if (len > 16)
+		internal_error("payload length exceeds 16 byte");
+
+	o->len = len;
+
+	return o;
+}
+
+struct imr_object *imr_object_alloc_alu(enum imr_alu_op op, struct imr_object *l, struct imr_object *r)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_ALU);
+
+	if (!o)
+		return NULL;
+
+	if (l == r)
+		internal_error("same operands");
+
+	o->alu.op = op;
+	o->alu.left = l;
+	o->alu.right = r;
+
+	if (l->len == 0 || r->len == 0)
+		internal_error("alu op with 0 op length");
+
+	o->len = l->len;
+	if (r->len > o->len)
+		o->len = r->len;
+
+	return o;
+}
+
+static int imr_state_add_obj_alu(struct imr_state *s, struct imr_object *o)
+{
+	struct imr_object *old;
+
+	if (s->num_objects == 0 || o->len > sizeof(uint64_t))
+		return -EINVAL;
+
+	old = s->objects[s->num_objects - 1];
+
+	if (old->type != IMR_OBJ_TYPE_ALU)
+		return -EINVAL;
+	if (old->alu.left != o->alu.left)
+		return -EINVAL;
+
+	imr_object_free(o->alu.left);
+	o->alu.left = old;
+	s->objects[s->num_objects - 1] = o;
+
+	if (old->len != o->len)
+		internal_error("different op len but same src");
+	return 0;
+}
+
+int imr_state_add_obj(struct imr_state *s, struct imr_object *o)
+{
+	struct imr_object **new;
+	uint32_t slot = s->num_objects;
+
+	if (s->num_objects >= 0xffff / sizeof(*o))
+		return -1;
+
+	if (o->type == IMR_OBJ_TYPE_ALU &&
+	    imr_state_add_obj_alu(s, o) == 0)
+		return 0;
+
+	s->num_objects++;
+
+	new = realloc(s->objects, sizeof(o) * s->num_objects);
+	if (!new) {
+		imr_object_free(o);
+		return -1;
+	}
+
+	new[slot] = o;
+	if (new != s->objects)
+		s->objects = new;
+
+	return 0;
+}
+
+int imr_state_rule_end(struct imr_state *s)
+{
+	uint32_t slot = s->num_objects;
+	struct imr_object *last;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(s->registers); i++) {
+		last = s->registers[i];
+		if (last)
+			imr_register_store(s, i, NULL);
+	}
+
+	if (slot == 0)
+		internal_error("rule end, but no objects present\n");
+	last = s->objects[slot - 1];
+
+	if (last->type == IMR_OBJ_TYPE_VERDICT)
+		return 0;
+
+	return imr_state_add_obj(s, imr_object_alloc_verdict(IMR_VERDICT_NEXT));
+}
+
+static int imr_jit_obj_immediate(struct imr_state *s,
+				 const struct imr_object *o)
+{
+	int bpf_reg = imr_register_get(s, o->len);
+
+	switch (o->len) {
+	case sizeof(uint32_t):
+		EMIT(s, BPF_MOV32_IMM(bpf_reg, o->imm.value32));
+		return 0;
+	case sizeof(uint64_t):
+		EMIT(s, BPF_LD_IMM64(bpf_reg, o->imm.value64));
+		return 0;
+	default:
+		break;
+	}
+
+	internal_error("unhandled immediate size");
+	return -EINVAL;
+}
+
+static int imr_jit_verdict(struct imr_state *s, int verdict)
+{
+	EMIT(s, BPF_MOV32_IMM(BPF_REG_0, verdict));
+	EMIT(s, BPF_EXIT_INSN());
+	return 0;
+}
+
+static int imr_jit_obj_verdict(struct imr_state *s,
+			       const struct imr_object *o)
+{
+	int verdict = o->verdict.verdict;
+
+	switch (o->verdict.verdict) {
+	case IMR_VERDICT_NEXT: /* no-op: continue with next rule */
+		return 0;
+	case IMR_VERDICT_PASS:
+		verdict = NF_ACCEPT;
+		break;
+	case IMR_VERDICT_DROP:
+		verdict = NF_DROP;
+		break;
+	case IMR_VERDICT_NONE:
+		verdict = -1; /* NFT_CONTINUE */
+		break;
+	default:
+		internal_error("unhandled verdict");
+	}
+
+	return imr_jit_verdict(s, verdict);
+}
+
+static unsigned int align_for_stack(uint16_t len)
+{
+	return div_round_up(len, sizeof(uint64_t)) * sizeof(uint64_t);
+}
+
+static int imr_reload_skb_data(struct imr_state *state)
+{
+	int tmp_reg = imr_register_alloc(state, sizeof(uint64_t));
+
+	/* headlen tells how much bytes we can expect to reside
+	 * in the skb linear area.
+	 *
+	 * Used to decide when to prefer direct access vs.
+	 * bpf equivalent of skb_header_pointer().
+	 */
+	EMIT(state, BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
+			 offsetof(struct __sk_buff, data)));
+	EMIT(state, BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
+			 offsetof(struct __sk_buff, data_end)));
+
+	EMIT(state, BPF_MOV64_REG(tmp_reg, BPF_REG_2));
+	EMIT(state, BPF_ALU64_IMM(BPF_ADD, tmp_reg, state->headlen));
+
+	/* This is so that verifier can mark accesses to
+	 * skb->data as safe provided they don't exceed data_end (R3).
+	 *
+	 * IMR makes sure it switches to bpf_skb_load_bytes helper for
+	 * accesses that are larger, else verifier rejects program.
+	 *
+	 * R3 and R4 are only used temporarily here, no need to preserve them.
+	 */
+	EMIT(state, BPF_JMP_REG(BPF_JLE, tmp_reg, BPF_REG_3, 2));
+
+	imr_register_release(state, sizeof(uint64_t));
+
+	/*
+	 * ((R2 (data) + headlen) > R3 data_end.
+	 * Should never happen for nf hook points, ip/ipv6 stack pulls
+	 * at least ip(6) header into linear area, and caller will
+	 * pass this header size as headlen.
+	 */
+	EMIT(state, BPF_MOV32_IMM(BPF_REG_0, NF_DROP));
+	EMIT(state, BPF_EXIT_INSN());
+	return 0;
+}
+
+static int imr_load_thoff(struct imr_state *s, int bpfreg)
+{
+	/* fetch 16bit off cb[0] */
+	EMIT(s, BPF_LDX_MEM(BPF_H, bpfreg, BPF_REG_1, offsetof(struct __sk_buff, cb[0])));
+	return 0;
+}
+
+static int imr_maybe_reload_skb_data(struct imr_state *state)
+{
+	if (state->reload_r2) {
+		state->reload_r2 = false;
+		return imr_reload_skb_data(state);
+	}
+
+	return 0;
+}
+
+/*
+ * Though R10 is correct read-only register and has type PTR_TO_STACK
+ * and R10 - 4 is within stack bounds, there were no stores into that location.
+ */
+static int bpf_skb_load_bytes(struct imr_state *state,
+			      uint16_t offset, uint16_t olen,
+			      int bpf_reg_hdr_off)
+{
+	int len = align_for_stack(olen);
+	int tmp_reg;
+
+	tmp_reg = imr_register_alloc(state, sizeof(uint64_t));
+	if (tmp_reg < 0)
+		return -ENOSPC;
+
+	EMIT(state, BPF_MOV64_IMM(BPF_REG_2, offset));
+	state->reload_r2 = true;
+
+	EMIT(state, BPF_ALU64_REG(BPF_ADD, BPF_REG_2, bpf_reg_hdr_off));
+
+	EMIT(state, BPF_ALU64_REG(BPF_MOV, BPF_REG_3, BPF_REG_10));
+	EMIT(state, BPF_ALU64_IMM(BPF_ADD, BPF_REG_3, -len));
+
+	EMIT(state, BPF_MOV64_IMM(BPF_REG_4, olen));
+
+	EMIT(state, BPF_MOV64_REG(tmp_reg, BPF_REG_1));
+
+	EMIT(state, BPF_EMIT_CALL(BPF_FUNC_skb_load_bytes));
+
+	/* 0: ok, so move to next rule on error */
+	EMIT(state, BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 0));
+
+	EMIT(state, BPF_MOV64_REG(BPF_REG_1, tmp_reg));
+	imr_register_release(state, sizeof(uint64_t));
+	return 0;
+}
+
+static int imr_jit_obj_payload(struct imr_state *state,
+			       const struct imr_object *o)
+{
+	int base = o->payload.base;
+	int offset = o->payload.offset;
+	int bpf_width = bpf_reg_width(o->len);
+	int bpf_reg = imr_register_get(state, o->len);
+	int ret, bpf_reg_hdr_off;
+
+	switch (base) {
+	case IMR_PAYLOAD_BASE_LL: /* XXX: */
+		internal_error("can't handle ll yet");
+		return -ENOTSUP;
+	case IMR_PAYLOAD_BASE_NH:
+		if (state->base == base &&
+		    offset <= state->headlen) {
+			ret = imr_maybe_reload_skb_data(state);
+			if (ret < 0)
+				return ret;
+			EMIT(state, BPF_LDX_MEM(bpf_width, bpf_reg, BPF_REG_2, offset));
+			return 0;
+		}
+		/* XXX: use bpf_load_bytes helper if offset is too big */
+		internal_error("can't handle nonlinear yet");
+		return -ENOTSUP;
+	case IMR_PAYLOAD_BASE_TH:
+		if (o->len > sizeof(uint64_t))
+			internal_error("can't handle size exceeding 8 bytes");
+
+		bpf_reg_hdr_off = imr_register_alloc(state, sizeof(uint16_t));
+		if (bpf_reg_hdr_off < 0)
+			return -ENOSPC;
+
+		ret = imr_load_thoff(state, bpf_reg_hdr_off);
+		if (ret < 0) {
+			imr_register_release(state, sizeof(uint16_t));
+			return ret;
+		}
+
+		ret = bpf_skb_load_bytes(state, offset,
+						o->len, bpf_reg_hdr_off);
+		imr_register_release(state, sizeof(uint16_t));
+
+		if (ret)
+			return ret;
+
+		EMIT(state, BPF_LDX_MEM(bpf_width, bpf_reg, BPF_REG_10,
+		     - align_for_stack(o->len)));
+		return 0;
+	}
+
+	internal_error("invalid base");
+	return -ENOTSUP;
+}
+
+static void imr_fixup_jumps(struct imr_state *state, unsigned int poc_start)
+{
+	unsigned int pc, pc_end, i;
+
+	if (poc_start >= state->len_cur)
+		internal_error("old poc >= current one");
+
+	pc = 0;
+	pc_end = state->len_cur - poc_start;
+
+	for (i = poc_start; pc < pc_end; pc++, i++) {
+		if (BPF_CLASS(state->img[i].code) == BPF_JMP) {
+			if (state->img[i].code == (BPF_EXIT | BPF_JMP))
+				continue;
+			if (state->img[i].code == (BPF_CALL | BPF_JMP))
+				continue;
+
+			if (state->img[i].off)
+				continue;
+			state->img[i].off = pc_end - pc - 1;
+		}
+	}
+}
+
+
+#if 0
+static void nft_cmp_eval(const struct nft_expr *expr,
+                         struct nft_regs *regs,
+                         const struct nft_pktinfo *pkt)
+{
+        const struct nft_cmp_expr *priv = nft_expr_priv(expr);
+        int d;
+
+        d = memcmp(&regs->data[priv->sreg], &priv->data, priv->len);
+        switch (priv->op) {
+        case NFT_CMP_EQ:
+                if (d != 0)
+                        goto mismatch;
+                break;
+        case NFT_CMP_NEQ:
+                if (d == 0)
+                        goto mismatch;
+                break;
+        case NFT_CMP_LT:
+                if (d == 0)
+                        goto mismatch;
+                /* fall through */
+        case NFT_CMP_LTE:
+                if (d > 0)
+                        goto mismatch;
+                break;
+        case NFT_CMP_GT:
+                if (d == 0)
+                        goto mismatch;
+                /* fall through */
+        case NFT_CMP_GTE:
+                if (d < 0)
+                        goto mismatch;
+                break;
+        }
+        return;
+
+mismatch:
+        regs->verdict.code = NFT_BREAK;
+}
+#endif
+
+static int __imr_jit_memcmp_sub64(struct imr_state *state,
+				  struct imr_object *sub,
+				  int regl)
+{
+	int ret = imr_jit_object(state, sub->alu.left);
+	int regr = imr_register_alloc(state, sizeof(uint64_t));
+
+	if (ret < 0)
+		return ret;
+
+	ret = imr_jit_object(state, sub->alu.right);
+
+	EMIT(state, BPF_ALU64_REG(BPF_SUB, regl, regr));
+
+	imr_register_release(state, sizeof(uint64_t));
+	return 0;
+}
+
+static int __imr_jit_memcmp_sub32(struct imr_state *state,
+				  struct imr_object *sub,
+				  int regl)
+{
+	const struct imr_object *right = sub->alu.right;
+	int regr, ret = imr_jit_object(state, sub->alu.left);
+
+	if (right->type == IMR_OBJ_TYPE_IMMEDIATE && right->len) {
+		EMIT(state, BPF_ALU32_IMM(BPF_SUB, regl, right->imm.value32));
+		return 0;
+	}
+
+	regr = imr_register_alloc(state, sizeof(uint32_t));
+	if (ret < 0)
+		return ret;
+
+	ret = imr_jit_object(state, right);
+	if (ret < 0) {
+		imr_register_release(state, sizeof(uint32_t));
+		return ret;
+	}
+
+	EMIT(state, BPF_ALU32_REG(BPF_SUB, regl, regr));
+	return 0;
+}
+
+static int imr_jit_alu_bigcmp(struct imr_state *state, const struct imr_object *o)
+{
+	struct imr_object *copy = imr_object_copy(o);
+	unsigned int start_insn = state->len_cur;
+	int regl, ret;
+
+	if (!copy)
+		return -ENOMEM;
+
+	regl = imr_register_alloc(state, sizeof(uint64_t));
+	do {
+		struct imr_object *tmp;
+
+		tmp = imr_object_split64(copy);
+		if (!tmp) {
+			imr_register_release(state, sizeof(uint64_t));
+			imr_object_free(copy);
+			return -ENOMEM;
+		}
+
+		ret = __imr_jit_memcmp_sub64(state, tmp, regl);
+		imr_object_free(tmp);
+		if (ret < 0) {
+			imr_register_release(state, sizeof(uint64_t));
+			imr_object_free(copy);
+			return ret;
+		}
+		/* XXX: 64bit */
+		EMIT(state, BPF_JMP_IMM(BPF_JNE, regl, 0, 0));
+	} while (copy->len >= sizeof(uint64_t));
+
+	if (copy->len && copy->len != sizeof(uint64_t)) {
+		ret = __imr_jit_memcmp_sub32(state, copy, regl);
+
+		if (ret < 0) {
+			imr_object_free(copy);
+			imr_register_release(state, sizeof(uint64_t));
+			return ret;
+		}
+	}
+
+	imr_object_free(copy);
+	imr_fixup_jumps(state, start_insn);
+
+	switch (o->alu.op) {
+	case IMR_ALU_OP_AND:
+	case IMR_ALU_OP_LSHIFT:
+		internal_error("not a jump");
+	case IMR_ALU_OP_EQ:
+	case IMR_ALU_OP_NE:
+	case IMR_ALU_OP_LT:
+	case IMR_ALU_OP_LTE:
+	case IMR_ALU_OP_GT:
+	case IMR_ALU_OP_GTE:
+		EMIT(state, BPF_JMP_IMM(alu_jmp_get_negated_bpf_opcode(o->alu.op), regl, 0, 0));
+		break;
+        }
+
+	imr_register_release(state, sizeof(uint64_t));
+	return 0;
+}
+
+static int __imr_jit_obj_alu_jmp(struct imr_state *state,
+			         const struct imr_object *o,
+				 int regl)
+{
+	const struct imr_object *right;
+	enum imr_reg_num regr;
+	int op, ret;
+
+	right = o->alu.right;
+
+	op = alu_jmp_get_negated_bpf_opcode(o->alu.op);
+
+	/* avoid 2nd register if possible */
+	if (right->type == IMR_OBJ_TYPE_IMMEDIATE) {
+		switch (right->len) {
+		case sizeof(uint32_t):
+			EMIT(state, BPF_JMP_IMM(op, regl, right->imm.value32, 0));
+			return 0;
+		}
+	}
+
+	regr = imr_register_alloc(state, right->len);
+	if (regr < 0)
+		return -ENOSPC;
+
+	ret = imr_jit_object(state, right);
+	if (ret == 0) {
+		EMIT(state, BPF_MOV32_IMM(BPF_REG_0, -2)); /* NFT_BREAK */
+		EMIT(state, BPF_JMP_REG(op, regl, regr, 0));
+	}
+
+	imr_register_release(state, right->len);
+	return ret;
+}
+
+static int imr_jit_obj_alu_jmp(struct imr_state *state,
+			       const struct imr_object *o,
+			       int regl)
+
+{
+	int ret;
+
+	/* multiple tests on same source? */
+	if (o->alu.left->type == IMR_OBJ_TYPE_ALU) {
+		ret = imr_jit_obj_alu_jmp(state, o->alu.left, regl);
+		if (ret < 0)
+			return ret;
+	} else {
+		ret = imr_jit_object(state, o->alu.left);
+		if (ret < 0)
+			return ret;
+	}
+
+	ret = __imr_jit_obj_alu_jmp(state, o, regl);
+
+	return ret;
+}
+
+static int imr_jit_obj_alu(struct imr_state *state, const struct imr_object *o)
+{
+	const struct imr_object *right;
+	enum imr_reg_num regl;
+	int ret, op;
+
+
+	switch (o->alu.op) {
+	case IMR_ALU_OP_AND:
+		op = BPF_AND;
+		break;
+	case IMR_ALU_OP_LSHIFT:
+		op = BPF_LSH;
+		break;
+	case IMR_ALU_OP_EQ:
+	case IMR_ALU_OP_NE:
+	case IMR_ALU_OP_LT:
+	case IMR_ALU_OP_LTE:
+	case IMR_ALU_OP_GT:
+	case IMR_ALU_OP_GTE:
+		if (o->len > sizeof(uint64_t))
+			return imr_jit_alu_bigcmp(state, o);
+
+		regl = imr_register_alloc(state, o->len);
+		if (regl < 0)
+			return -ENOSPC;
+
+		ret = imr_jit_obj_alu_jmp(state, o, regl);
+		imr_register_release(state, o->len);
+		return ret;
+	}
+
+	ret = imr_jit_object(state, o->alu.left);
+	if (ret)
+		return ret;
+
+	regl = imr_register_get(state, o->len);
+	if (regl < 0)
+		return -EINVAL;
+
+	right = o->alu.right;
+
+	/* avoid 2nd register if possible */
+	if (right->type == IMR_OBJ_TYPE_IMMEDIATE) {
+		switch (right->len) {
+		case sizeof(uint32_t):
+			EMIT(state, BPF_ALU32_IMM(op, regl, right->imm.value32));
+			return 0;
+		}
+	}
+
+	internal_error("alu bitops only handle 32bit immediate RHS");
+	return -EINVAL;
+}
+
+static int imr_jit_obj_meta(struct imr_state *state, const struct imr_object *o)
+{
+	int bpf_reg = imr_register_get(state, o->len);
+	int bpf_width = bpf_reg_width(o->len);
+	int ret;
+
+	switch (o->meta.key) {
+	case IMR_META_NFMARK:
+		EMIT(state, BPF_LDX_MEM(bpf_width, bpf_reg, BPF_REG_1,
+					 offsetof(struct __sk_buff, mark)));
+		break;
+	case IMR_META_L4PROTO:
+		ret = imr_load_thoff(state, bpf_reg);
+		if (ret < 0)
+			return ret;
+
+		EMIT(state, BPF_JMP_IMM(BPF_JEQ, bpf_reg, 0, 0)); /* th == 0? L4PROTO undefined. */
+		EMIT(state, BPF_LDX_MEM(bpf_width, bpf_reg, BPF_REG_1,
+					 offsetof(struct __sk_buff, cb[1])));
+		break;
+	case IMR_META_NFPROTO:
+		switch (state->nfproto) {
+		case NFPROTO_IPV4:
+		case NFPROTO_IPV6:
+			EMIT(state, BPF_MOV32_IMM(bpf_reg, state->nfproto));
+			break;
+		case NFPROTO_INET:	/* first need to check ihl->version */
+			ret = imr_maybe_reload_skb_data(state);
+			if (ret < 0)
+				return ret;
+
+			/* bpf_reg = iph->version & 0xf0 */
+			EMIT(state, BPF_LDX_MEM(BPF_B, bpf_reg, BPF_REG_2, 0));		/* ihl->version/hdrlen */
+			EMIT(state, BPF_ALU32_IMM(BPF_AND, bpf_reg, 0xf0));		/* retain version */
+
+			EMIT(state, BPF_JMP_IMM(BPF_JNE, bpf_reg, 4 << 4, 2));		/* ipv4? */
+			EMIT(state, BPF_MOV32_IMM(bpf_reg, NFPROTO_IPV4));
+			EMIT(state, BPF_JMP_IMM(BPF_JA, 0, 0, 5));			/* skip NF_DROP */
+
+			EMIT(state, BPF_JMP_IMM(BPF_JNE, bpf_reg, 6 << 4, 4));		/* ipv6? */
+			EMIT(state, BPF_MOV32_IMM(bpf_reg, NFPROTO_IPV6));
+			EMIT(state, BPF_JMP_IMM(BPF_JA, 0, 0, 2));			/* skip NF_DROP */
+
+			EMIT(state, BPF_MOV32_IMM(BPF_REG_0, NF_DROP));
+			EMIT(state, BPF_EXIT_INSN());
+			/* Not ipv4, not ipv6? Should not happen: INET hooks from ipv4/ipv6 stack */
+			break;
+		default:
+			internal_error("unsupported family");
+		}
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static int imr_jit_object(struct imr_state *s, const struct imr_object *o)
+{
+	switch (o->type) {
+	case IMR_OBJ_TYPE_VERDICT:
+		return imr_jit_obj_verdict(s, o);
+	case IMR_OBJ_TYPE_PAYLOAD:
+		return imr_jit_obj_payload(s, o);
+	case IMR_OBJ_TYPE_IMMEDIATE:
+		return imr_jit_obj_immediate(s, o);
+	case IMR_OBJ_TYPE_ALU:
+		return imr_jit_obj_alu(s, o);
+	case IMR_OBJ_TYPE_META:
+		return imr_jit_obj_meta(s, o);
+	}
+
+	return -EINVAL;
+}
+
+static int imr_jit_rule(struct imr_state *state, int i)
+{
+	unsigned int start, end, count, len_cur;
+
+	end = state->num_objects;
+	if (i >= end)
+		return -EINVAL;
+
+	len_cur = state->len_cur;
+
+	start = i;
+	count = 0;
+
+	for (i = start; start < end; i++) {
+		int ret = imr_jit_object(state, state->objects[i]);
+
+		if (ret < 0) {
+			fprintf(stderr, "failed to JIT object type %d\n",  state->objects[i]->type);
+			return ret;
+		}
+
+		count++;
+
+		if (state->objects[i]->type == IMR_OBJ_TYPE_VERDICT)
+			break;
+	}
+
+	if (i == end)
+		internal_error("no verdict found in rule");
+
+	imr_fixup_jumps(state, len_cur);
+
+	return count;
+}
+
+/* R0: return value.
+ * R1: __sk_buff (BPF_RUN_PROG() argument).
+ * R2-R5 are unused, (caller saved registers).
+ *   imr_state_init sets R2 to be start of skb->data.
+ * R2-R5 are invalidated after BPF function calls.
+ *
+ * R6-R9 are callee saved registers.
+ */
+int imr_state_init(struct imr_state *state, int family)
+{
+	if (!state->img) {
+		state->img = calloc(BPF_MAXINSNS, sizeof(struct bpf_insn));
+		if (!state->img)
+			return -ENOMEM;
+	}
+
+	state->len_cur = 0;
+	state->nfproto = family;
+
+	switch (family) {
+	case NFPROTO_INET:
+	case NFPROTO_IPV4:
+		state->headlen = sizeof(struct iphdr);
+		state->base = IMR_PAYLOAD_BASE_NH;
+		break;
+	case NFPROTO_IPV6:
+		state->headlen = sizeof(struct ip6_hdr);
+		state->base = IMR_PAYLOAD_BASE_NH;
+		break;
+	default:
+		state->base = IMR_PAYLOAD_BASE_NH;
+		break;
+	}
+
+	if (state->headlen) {
+		int ret = imr_reload_skb_data(state);
+		if (ret < 0)
+			return ret;
+	}
+
+	return 0;
+}
+
+struct bpf_insn	*imr_translate(struct imr_state *s, unsigned int *insn_count)
+{
+	struct bpf_insn *img;
+	int ret = 0, i = 0;
+
+	if (!s->img) {
+		ret = imr_state_init(s, s->nfproto);
+		if (ret < 0)
+			return NULL;
+	}
+
+	/* Only use R6..R9 for now to simplify helper calls (R1..R5 will be clobbered) */
+	s->regcount = 6;
+
+	do {
+		int insns = imr_jit_rule(s, i);
+		if (insns < 0) {
+			ret = insns;
+			break;
+		}
+		if (insns == 0)
+			internal_error("rule jit yields 0 insns");
+
+		i += insns;
+	} while (i < s->num_objects);
+
+	if (ret != 0)
+		return NULL;
+
+	ret = imr_jit_verdict(s, -2); /* XXX: policy support. -2: NFT_BREAK */
+	if (ret < 0)
+		return NULL;
+
+	*insn_count = s->len_cur;
+	img = s->img;
+
+	s->img = NULL;
+	s->len_cur = 0;
+
+	return img;
+}
diff --git a/net/netfilter/nf_tables_jit/imr.h b/net/netfilter/nf_tables_jit/imr.h
new file mode 100644
index 000000000000..7ebbf78526f9
--- /dev/null
+++ b/net/netfilter/nf_tables_jit/imr.h
@@ -0,0 +1,96 @@
+#ifndef IMR_HDR
+#define IMR_HDR
+#include <stdint.h>
+#include <stdio.h>
+
+/* map 1:1 to BPF regs. */
+enum imr_reg_num {
+	IMR_REG_0,
+	IMR_REG_1,
+	IMR_REG_2,
+	IMR_REG_3,
+	IMR_REG_4,
+	IMR_REG_5,
+	IMR_REG_6,
+	IMR_REG_7,
+	IMR_REG_8,
+	IMR_REG_9,
+	/* R10 is frame pointer */
+	IMR_REG_COUNT,
+};
+
+struct imr_state;
+struct imr_object;
+
+enum imr_obj_type {
+	IMR_OBJ_TYPE_VERDICT,
+	IMR_OBJ_TYPE_IMMEDIATE,
+	IMR_OBJ_TYPE_PAYLOAD,
+	IMR_OBJ_TYPE_ALU,
+	IMR_OBJ_TYPE_META,
+};
+
+enum imr_alu_op {
+	IMR_ALU_OP_EQ,
+	IMR_ALU_OP_NE,
+	IMR_ALU_OP_LT,
+	IMR_ALU_OP_LTE,
+	IMR_ALU_OP_GT,
+	IMR_ALU_OP_GTE,
+	IMR_ALU_OP_AND,
+	IMR_ALU_OP_LSHIFT,
+};
+
+enum imr_verdict {
+	IMR_VERDICT_NONE,	/* partially translated rule, no verdict */
+	IMR_VERDICT_NEXT,	/* move to next rule */
+	IMR_VERDICT_PASS,	/* end processing, accept packet */
+	IMR_VERDICT_DROP,	/* end processing, drop packet */
+};
+
+enum imr_payload_base {
+	IMR_PAYLOAD_BASE_INVALID,
+	IMR_PAYLOAD_BASE_LL,
+	IMR_PAYLOAD_BASE_NH,
+	IMR_PAYLOAD_BASE_TH,
+};
+
+enum imr_meta_key {
+	IMR_META_L4PROTO,
+	IMR_META_NFPROTO,
+	IMR_META_NFMARK,
+};
+
+struct imr_state *imr_state_alloc(void);
+void imr_state_free(struct imr_state *s);
+void imr_state_print(FILE *fp, struct imr_state *s);
+
+static inline int imr_state_rule_begin(struct imr_state *s)
+{
+	/* nothing for now */
+	return 0;
+}
+
+int imr_state_rule_end(struct imr_state *s);
+
+void imr_register_store(struct imr_state *s, enum imr_reg_num r, struct imr_object *o);
+struct imr_object *imr_register_load(const struct imr_state *s, enum imr_reg_num r);
+
+struct imr_object *imr_object_alloc(enum imr_obj_type t);
+void imr_object_free(struct imr_object *o);
+
+struct imr_object *imr_object_alloc_imm32(uint32_t value);
+struct imr_object *imr_object_alloc_imm64(uint64_t value);
+struct imr_object *imr_object_alloc_imm(const uint32_t *data, unsigned int len);
+struct imr_object *imr_object_alloc_verdict(enum imr_verdict v);
+
+struct imr_object *imr_object_alloc_payload(enum imr_payload_base b, uint16_t off, uint16_t len);
+struct imr_object *imr_object_alloc_alu(enum imr_alu_op op, struct imr_object *l, struct imr_object *r);
+struct imr_object *imr_object_alloc_meta(enum imr_meta_key k);
+
+int imr_state_add_obj(struct imr_state *s, struct imr_object *o);
+
+int imr_state_init(struct imr_state *state, int family);
+struct bpf_insn	*imr_translate(struct imr_state *s, unsigned int *insn_count);
+
+#endif /* IMR_HDR */
diff --git a/net/netfilter/nf_tables_jit/main.c b/net/netfilter/nf_tables_jit/main.c
index 6f6a4423c2e4..42b9d6d5d1fb 100644
--- a/net/netfilter/nf_tables_jit/main.c
+++ b/net/netfilter/nf_tables_jit/main.c
@@ -1,20 +1,578 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <unistd.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stddef.h>
+#include <time.h>
+#include <string.h>
+#include <netinet/in.h>
+#include <errno.h>
 
-int main(void)
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+
+#include <linux/netfilter.h>
+#include <linux/netfilter/nf_tables.h>
+#include <linux/netfilter/nfnetlink.h>
+
+#include <libmnl/libmnl.h>
+#include <libnftnl/common.h>
+#include <libnftnl/ruleset.h>
+#include <libnftnl/table.h>
+#include <libnftnl/chain.h>
+#include <libnftnl/set.h>
+#include <libnftnl/expr.h>
+#include <libnftnl/rule.h>
+
+#include <linux/if_ether.h>
+#include <linux/bpf.h>
+#include <linux/netlink.h>
+
+#include "imr.h"
+
+struct nft_jit_data_from_user {
+ int ebpf_fd;            /* fd to get program from, or < 0 if jitter error */
+ uint32_t expr_count;    /* number of translated expressions */
+};
+
+static FILE *log_file;
+#define NFTNL_EXPR_EBPF_FD      NFTNL_EXPR_BASE
+
+static int bpf(int cmd, union bpf_attr *attr, unsigned int size)
 {
-	static struct {
-		int fd, count;
-	} response;
+#ifndef __NR_bpf
+#define __NR_bpf 321 /* x86_64 */
+#endif
+        return syscall(__NR_bpf, cmd, attr, size);
+}
 
-	response.fd = -1;
-	for (;;) {
-		char buf[8192];
+struct nft_ebpf_prog {
+	enum bpf_prog_type type;
+	const struct bpf_insn *insn;
+	unsigned int len;
+};
+
+struct cb_args {
+	unsigned int buflen;
+	uint32_t exprs_seen;
+	uint32_t stmt_exprs;
+	struct imr_state *s;
+	int fd;
+};
+
+static void memory_allocation_error(void) { perror("allocation failed"); exit(1); }
+
+static int bpf_prog_load(const struct nft_ebpf_prog *prog)
+{
+	union bpf_attr attr = {};
+	char *log;
+	int ret;
+
+	attr.prog_type  = prog->type;
+	attr.insns      = (uint64_t)prog->insn;
+	attr.insn_cnt   = prog->len;
+	attr.license    = (uint64_t)("GPL");
+
+	log = malloc(8192);
+	attr.log_buf = (uint64_t)log;
+	attr.log_size = 8192;
+	attr.log_level = 1;
+
+	ret = bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
+	if (ret < 0)
+		fprintf(log_file, "bpf errlog: %s\n", log);
+	free(log);
+	return ret;
+}
+
+
+static int nft_reg_to_imr_reg(int nfreg)
+{
+	switch (nfreg) {
+	case NFT_REG_VERDICT:
+		return IMR_REG_0;
+	/* old register numbers, 4 128 bit registers. */
+	case NFT_REG_1:
+		return IMR_REG_4;
+	case NFT_REG_2:
+		return IMR_REG_6;
+	case NFT_REG_3:
+		return IMR_REG_8;
+	case NFT_REG_4:
+		break;
+#ifdef NFT_REG32_SIZE
+	/* new register numbers, 16 32 bit registers, map to old ones */
+	case NFT_REG32_00:
+		return IMR_REG_4;
+	case NFT_REG32_01:
+		return IMR_REG_5;
+	case NFT_REG32_02:
+		return IMR_REG_6;
+#endif
+	default:
+		return -1;
+	}
+	return -1;
+}
+
+static int netlink_parse_immediate(const struct nftnl_expr *nle, void *out)
+{
+	struct imr_state *state = out;
+	struct imr_object *o = NULL;
+
+	if (nftnl_expr_is_set(nle, NFTNL_EXPR_IMM_DATA)) {
+		uint32_t len;
+		int reg;
+
+		nftnl_expr_get(nle, NFTNL_EXPR_IMM_DATA, &len);
+
+		switch (len) {
+		case sizeof(uint32_t):
+			o = imr_object_alloc_imm32(nftnl_expr_get_u32(nle, NFTNL_EXPR_IMM_DATA));
+			break;
+		case sizeof(uint64_t):
+			o = imr_object_alloc_imm64(nftnl_expr_get_u64(nle, NFTNL_EXPR_IMM_DATA));
+			break;
+		default:
+			return -ENOTSUP;
+		}
+		reg = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle,
+					     NFTNL_EXPR_IMM_DREG));
+		if (reg < 0) {
+			imr_object_free(o);
+			return reg;
+		}
+
+		imr_register_store(state, reg, o);
+		return 0;
+	} else if (nftnl_expr_is_set(nle, NFTNL_EXPR_IMM_VERDICT)) {
+		uint32_t verdict;
+		int ret;
+
+		if (nftnl_expr_is_set(nle, NFTNL_EXPR_IMM_CHAIN))
+			return -ENOTSUP;
+
+                verdict = nftnl_expr_get_u32(nle, NFTNL_EXPR_IMM_VERDICT);
+
+		switch (verdict) {
+		case NF_ACCEPT:
+			o = imr_object_alloc_verdict(IMR_VERDICT_PASS);
+			break;
+		case NF_DROP:
+			o = imr_object_alloc_verdict(IMR_VERDICT_DROP);
+			break;
+		default:
+			fprintf(log_file, "Unhandled verdict %d\n", verdict);
+			o = imr_object_alloc_verdict(IMR_VERDICT_DROP);
+			break;
+		}
+
+		ret = imr_state_add_obj(state, o);
+		if (ret < 0)
+			imr_object_free(o);
+
+		return ret;
+	}
+
+	return -ENOTSUP;
+}
+
+static int netlink_parse_cmp(const struct nftnl_expr *nle, void *out)
+{
+	struct imr_object *o, *imm, *left;
+	const uint32_t *raw;
+	uint32_t tmp, len;
+	struct imr_state *state = out;
+	enum imr_alu_op op;
+	int ret;
+	op = nftnl_expr_get_u32(nle, NFTNL_EXPR_CMP_OP);
+
+	switch (op) {
+        case NFT_CMP_EQ:
+		op = IMR_ALU_OP_EQ;
+		break;
+        case NFT_CMP_NEQ:
+		op = IMR_ALU_OP_NE;
+		break;
+	case NFT_CMP_LT:
+		op = IMR_ALU_OP_LT;
+		break;
+	case NFT_CMP_LTE:
+		op = IMR_ALU_OP_LTE;
+		break;
+	case NFT_CMP_GT:
+		op = IMR_ALU_OP_GT;
+		break;
+	case NFT_CMP_GTE:
+		op = IMR_ALU_OP_GTE;
+		break;
+	default:
+		return -ENOTSUP;
+	}
+
+	raw = nftnl_expr_get(nle, NFTNL_EXPR_CMP_DATA, &len);
+	switch (len) {
+	case sizeof(uint64_t):
+		imm = imr_object_alloc_imm64(nftnl_expr_get_u64(nle, NFTNL_EXPR_CMP_DATA));
+		break;
+	case sizeof(uint32_t):
+		imm = imr_object_alloc_imm32(nftnl_expr_get_u32(nle, NFTNL_EXPR_CMP_DATA));
+		break;
+	case sizeof(uint16_t):
+		tmp = nftnl_expr_get_u16(nle, NFTNL_EXPR_CMP_DATA);
+
+		imm = imr_object_alloc_imm32(tmp);
+		break;
+	case sizeof(uint8_t):
+		tmp = nftnl_expr_get_u8(nle, NFTNL_EXPR_CMP_DATA);
+
+		imm = imr_object_alloc_imm32(tmp);
+		break;
+	default:
+		imm = imr_object_alloc_imm(raw, len);
+		break;
+	}
+
+	if (!imm)
+		return -ENOMEM;
+
+	ret = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle, NFTNL_EXPR_CMP_SREG));
+	if (ret < 0) {
+		imr_object_free(imm);
+		return ret;
+	}
+
+	left = imr_register_load(state, ret);
+	if (!left) {
+		fprintf(log_file, "%s:%d\n", __FILE__, __LINE__);
+		return -EINVAL;
+	}
+
+	o = imr_object_alloc_alu(op, left, imm);
+
+	return imr_state_add_obj(state, o);
+}
+
+static int netlink_parse_meta(const struct nftnl_expr *nle, void *out)
+{
+	struct imr_state *state = out;
+	struct imr_object *meta;
+	enum imr_meta_key key;
+	int ret;
+
+	if (nftnl_expr_is_set(nle, NFTNL_EXPR_META_SREG))
+		return -EOPNOTSUPP;
+
+	ret = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle, NFTNL_EXPR_META_DREG));
+	if (ret < 0)
+		return ret;
+
+	switch (nftnl_expr_get_u32(nle, NFTNL_EXPR_META_KEY)) {
+	case NFT_META_NFPROTO:
+		key = IMR_META_NFPROTO;
+		break;
+	case NFT_META_L4PROTO:
+		key = IMR_META_L4PROTO;
+		break;
+	case NFT_META_MARK:
+		key = IMR_META_NFMARK;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	meta = imr_object_alloc_meta(key);
+	if (!meta)
+		return -ENOMEM;
+
+	imr_register_store(state, ret, meta);
+	return 0;
+}
+
+static int netlink_parse_payload(const struct nftnl_expr *nle, void *out)
+{
+	struct imr_state *state = out;
+	enum imr_payload_base imr_base;
+	uint32_t base, offset, len;
+	struct imr_object *payload;
+	int ret;
+
+	if (nftnl_expr_is_set(nle, NFTNL_EXPR_PAYLOAD_SREG) ||
+	    nftnl_expr_is_set(nle, NFTNL_EXPR_PAYLOAD_FLAGS))
+		return -EOPNOTSUPP;
+
+	base = nftnl_expr_get_u32(nle, NFTNL_EXPR_PAYLOAD_BASE);
+	offset = nftnl_expr_get_u32(nle, NFTNL_EXPR_PAYLOAD_OFFSET);
+	len = nftnl_expr_get_u32(nle, NFTNL_EXPR_PAYLOAD_LEN);
+
+	ret = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle, NFTNL_EXPR_PAYLOAD_DREG));
+	if (ret < 0)
+		return ret;
+
+	switch (base) {
+	case NFT_PAYLOAD_LL_HEADER:
+		imr_base = IMR_PAYLOAD_BASE_LL;
+		break;
+	case NFT_PAYLOAD_NETWORK_HEADER:
+		imr_base = IMR_PAYLOAD_BASE_NH;
+		break;
+	case NFT_PAYLOAD_TRANSPORT_HEADER:
+		imr_base = IMR_PAYLOAD_BASE_TH;
+		break;
+	default:
+		fprintf(log_file, "%s:%d\n", __FILE__, __LINE__);
+		return -EINVAL;
+	}
+
+	payload = imr_object_alloc_payload(imr_base, offset, len);
+	if (!payload)
+		return -ENOMEM;
+
+	imr_register_store(state, ret, payload);
+	return 0;
+}
+
+static int netlink_parse_bitwise(const struct nftnl_expr *nle, void *out)
+{
+	struct imr_object *imm, *alu, *left;
+	struct imr_state *state = out;
+	uint32_t len_mask, len_xor;
+	int reg;
+
+	reg = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle, NFTNL_EXPR_BITWISE_SREG));
+	if (reg < 0)
+		return reg;
+
+	left = imr_register_load(state, reg);
+	if (!left) {
+		fprintf(log_file, "%s:%d\n", __FILE__, __LINE__);
+		return -EINVAL;
+	}
+
+        nftnl_expr_get(nle, NFTNL_EXPR_BITWISE_XOR, &len_xor);
+	switch (len_xor) {
+	case sizeof(uint32_t):
+		if (nftnl_expr_get_u32(nle, NFTNL_EXPR_BITWISE_XOR) != 0)
+			return -EOPNOTSUPP;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	nftnl_expr_get(nle, NFTNL_EXPR_BITWISE_MASK, &len_mask);
+	switch (len_mask) {
+	case sizeof(uint32_t):
+		imm = imr_object_alloc_imm32(nftnl_expr_get_u32(nle, NFTNL_EXPR_BITWISE_MASK));
+		if (!imm)
+			return -ENOMEM;
+		break;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	alu = imr_object_alloc_alu(IMR_ALU_OP_AND, left, imm);
+	if (!alu) {
+		imr_object_free(imm);
+		return -ENOMEM;
+	}
+
+	imr_register_store(state, reg, alu);
+	return 0;
+}
+
+static const struct {
+	const char *name;
+	int (*parse)(const struct nftnl_expr *nle,
+				 void *);
+} netlink_parsers[] = {
+	{ .name = "immediate",	.parse = netlink_parse_immediate },
+	{ .name = "cmp",	.parse = netlink_parse_cmp },
+	{ .name = "payload",	.parse = netlink_parse_payload },
+	{ .name = "bitwise",	.parse = netlink_parse_bitwise },
+	{ .name = "meta",	.parse = netlink_parse_meta },
+};
+
+static int expr_parse_cb(struct nftnl_expr *expr, void *data)
+{
+	const char *name = nftnl_expr_get_str(expr, NFTNL_EXPR_NAME);
+	struct cb_args *args = data;
+	struct imr_state *state = args->s;
+	unsigned int i;
+
+	if (!name)
+		return -1;
+
+	for (i = 0; i < MNL_ARRAY_SIZE(netlink_parsers); i++) {
+		int ret;
+
+		if (strcmp(netlink_parsers[i].name, name))
+			continue;
+
+		ret = netlink_parsers[i].parse(expr, state);
+		if (ret == 0) {
+			args->exprs_seen++;
+
+			if (strcmp(netlink_parsers[i].name, "cmp") == 0 ||
+			    strcmp(netlink_parsers[i].name, "immediate") == 0) {
+
+				args->stmt_exprs += args->exprs_seen;
+				args->exprs_seen = 0;
+			}
+		}
+
+		fprintf(log_file, "parse: %s got %d\n", name, ret);
+		return ret;
+	}
+
+	fprintf(log_file, "cannot handle expression %s\n", name);
+	return -EOPNOTSUPP;
+}
+
+static int nlmsg_parse_newrule(const struct nlmsghdr *nlh, struct cb_args *args)
+{
+	struct nft_ebpf_prog prog;
+	struct imr_state *state;
+	struct nftnl_rule *rule;
+	int ret = -ENOMEM;
+
+	rule = nftnl_rule_alloc();
+	if (!rule)
+		memory_allocation_error();
+
+	if (nftnl_rule_nlmsg_parse(nlh, rule) < 0)
+		goto err_free;
+
+	state = imr_state_alloc();
+	if (!state)
+		goto err_free;
+
+	ret = imr_state_init(state,
+			     nftnl_rule_get_u32(rule, NFTNL_RULE_FAMILY));
+	if (ret < 0) {
+		imr_state_free(state);
+		goto err_free;
+	}
 
-		if (read(0, buf, sizeof(buf)) < 0)
-			return 1;
-		if (write(1, &response, sizeof(response)) != sizeof(response))
-			return 2;
+	ret = imr_state_rule_begin(state);
+	if (ret < 0) {
+		imr_state_free(state);
+		goto err_free;
+	}
+
+	args->s = state;
+	ret = nftnl_expr_foreach(rule, expr_parse_cb, args);
+	if (ret == 0) {
+		fprintf(log_file, "completed tranlsation, %d stmt_exprs and %d partial\n",
+				  args->stmt_exprs, args->exprs_seen);
+	} else {
+		fprintf(log_file, "failed translation, %d stmt_exprs and %d partial\n",
+				  args->stmt_exprs, args->exprs_seen);
+		if (args->stmt_exprs) {
+			ret = imr_state_add_obj(state, imr_object_alloc_verdict(IMR_VERDICT_NONE));
+			if (ret < 0) {
+				imr_state_free(state);
+				goto err_free;
+			}
+		}
+	}
+
+	ret = imr_state_rule_end(state);
+	if (ret < 0) {
+		imr_state_free(state);
+		goto err_free;
+	}
+
+	imr_state_print(log_file, state);
+
+	if (args->stmt_exprs) {
+		prog.type = BPF_PROG_TYPE_SCHED_CLS;
+		prog.insn = imr_translate(state, &prog.len);
+
+		imr_state_free(state);
+		if (!prog.insn)
+			goto err_free;
+
+		args->fd = bpf_prog_load(&prog);
+		free((void*)prog.insn);
+		if (args->fd < 0)
+			goto err_free;
+		ret = 0;
+	} else {
+		imr_state_free(state);
+	}
+
+err_free:
+	nftnl_rule_free(rule);
+	return ret;
+}
+
+static int nlmsg_parse(const struct nlmsghdr *nlh, void *data)
+{
+	struct cb_args *args = data;
+
+	fprintf(log_file, "%s:%d, buflen %d, nlh %d, nl len %d\n", __FILE__, __LINE__,
+		(int)args->buflen, (int)sizeof(*nlh) , (int) nlh->nlmsg_len);
+	if (args->buflen < sizeof(*nlh) || args->buflen < nlh->nlmsg_len) {
+		// nftjit.c:517, buflen 428, nlh 16, nl len 20
+		fprintf(log_file, "%s:%d- ERROR: buflen %d, nlh %d, nl len %d\n", __FILE__, __LINE__,
+		(int)args->buflen, (int)sizeof(*nlh) , (int) nlh->nlmsg_len);
+		return -EINVAL;
+	}
+
+	switch (NFNL_MSG_TYPE(nlh->nlmsg_type)) {
+	case NFT_MSG_NEWRULE:
+		return nlmsg_parse_newrule(nlh, args);
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static int doit(void)
+{
+	struct cb_args args;
+	struct nft_jit_data_from_user to_kernel = { .ebpf_fd = -1 };
+	char buf[MNL_SOCKET_BUFFER_SIZE];
+	ssize_t len;
+	int ret;
+
+	fprintf(log_file, "block in read, pid %d\n", (int) getpid());
+	len = read(0, buf, sizeof(buf));
+	if (len <= 0)
+		return 1;
+
+	memset(&args, 0, sizeof(args));
+	args.buflen = len;
+	args.fd = -1;
+
+	ret = len;
+	ret = mnl_cb_run(buf, ret, 0, 0, nlmsg_parse, &args);
+	to_kernel.ebpf_fd = args.fd;
+	to_kernel.expr_count = args.stmt_exprs;
+	if (ret < 0)
+		fprintf(log_file, "%s: mnl_cb_run: %d\n", __func__, ret);
+
+	if (write(1, &to_kernel, sizeof(to_kernel)) != (int)sizeof(to_kernel))
+		return 2;
+
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	int fd;
+
+	log_file = fopen("/tmp/debug.log", "a");
+	if (!log_file)
+		return 1;
+
+	fd = -1;
+	for (;;) {
+		int ret = doit();
+		if (ret != 0)
+			return ret;
+		close(fd);
+		fd = ret;
 	}
 
 	return 0;
diff --git a/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c b/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c
index 4778f53b2683..bd319d41e2d1 100644
--- a/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c
+++ b/net/netfilter/nf_tables_jit/nf_tables_jit_kern.c
@@ -1,6 +1,14 @@
 // SPDX-License-Identifier: GPL-2.0
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include <linux/umh.h>
+#include <linux/sched.h>
+#include <linux/sched/signal.h>
+#include <linux/fs.h>
+#include <linux/fdtable.h>
+#include <linux/file.h>
+#include <linux/skbuff.h>
+#include <linux/bpf.h>
+
 #include <linux/netfilter/nfnetlink.h>
 #include <linux/netfilter/nf_tables.h>
 #include <net/netfilter/nf_tables_core.h>
@@ -18,8 +26,54 @@ static int nft_jit_load_umh(void)
 	return fork_usermode_blob(&UMH_start, &UMH_end - &UMH_start, &info);
 }
 
-int nf_tables_jit_work(const struct sk_buff *nlskb, struct nft_ebpf *e)
+static void nft_jit_fd_to_prog(struct nft_ebpf *e, int fd, u32 expr_count)
+{
+	struct task_struct *task = pid_task(find_vpid(info.pid), PIDTYPE_PID);
+	struct files_struct *files;
+	struct bpf_prog *p;
+	struct file *file;
+
+	if (WARN_ON_ONCE(!task) || expr_count > 128) {
+		nft_jit_stop_umh();
+		return;
+	}
+
+	if (expr_count == 0) /* could not translate */
+		return;
+
+	task_lock(task);
+	files = task->files;
+	if (!files)
+		goto out_unlock;
+
+	file = fcheck_files(files, fd);
+	if (file && !get_file_rcu(file))
+		file = NULL;
+
+	if (!file)
+		goto out_unlock;
+
+	p = bpf_prog_get_type_dev_file(file, BPF_PROG_TYPE_SCHED_CLS, false);
+
+	task_unlock(task);
+
+	if (!IS_ERR(p)) {
+		e->prog = p;
+		e->expressions = expr_count;
+	}
+
+	fput(file);
+	return;
+out_unlock:
+	task_unlock(task);
+	nft_jit_stop_umh();
+}
+
+static int nft_jit_write_rule_info(const struct sk_buff *nlskb)
 {
+	const char *addr = nlskb->data;
+	ssize_t w, n, nr = nlskb->len;
+
 	if (!info.pipe_to_umh) {
 		int ret = nft_jit_load_umh();
 		if (ret)
@@ -29,5 +83,93 @@ int nf_tables_jit_work(const struct sk_buff *nlskb, struct nft_ebpf *e)
 			return -EINVAL;
 	}
 
-	return 0;
+	w = 0;
+	do {
+		loff_t pos = 0;
+
+		n = __kernel_write(info.pipe_to_umh, addr, nr, &pos);
+		if (n < 0)
+			return n;
+		w += n;
+		nr -= n;
+		if (nr == 0)
+			break;
+		addr += n;
+	} while (!signal_pending(current));
+
+	if (w == nlskb->len)
+		return 0;
+
+	return -EINTR;
+}
+
+static int nft_jit_read_result(struct nft_jit_data_from_user *res)
+{
+	ssize_t r, n, nr = sizeof(*res);
+
+	r = 0;
+
+	do {
+		loff_t pos = 0;
+
+		n = kernel_read(info.pipe_from_umh, res, nr, &pos);
+		if (n < 0)
+			return n;
+		if (n == 0)
+			return -EPIPE;
+		r += n;
+		nr -= n;
+		if (nr == 0)
+			break;
+	} while (!signal_pending(current));
+
+	if (r == (ssize_t)sizeof(*res))
+		return 0;
+
+	return -EINTR;
+}
+
+int nf_tables_jit_work(const struct sk_buff *nlskb, struct nft_ebpf *e)
+{
+	struct nft_jit_data_from_user from_usr;
+	int ret;
+
+	ret = nft_jit_write_rule_info(nlskb);
+	if (ret < 0) {
+		nft_jit_stop_umh();
+		pr_info("write rule info: ret %d\n", ret);
+		return ret;
+	}
+
+	ret = nft_jit_read_result(&from_usr);
+	if (ret < 0) {
+		pr_info("read rule info: ret %d\n", ret);
+		nft_jit_stop_umh();
+		return ret;
+	}
+
+	if (from_usr.ebpf_fd >= 0) {
+		rcu_read_lock();
+		nft_jit_fd_to_prog(e, from_usr.ebpf_fd, from_usr.expr_count);
+		rcu_read_unlock();
+		return 0;
+	}
+
+	return ret;
+}
+
+void nft_jit_stop_umh(void)
+{
+	struct task_struct *tsk;
+
+	rcu_read_lock();
+	tsk = pid_task(find_vpid(info.pid), PIDTYPE_PID);
+	if (tsk)
+		force_sig(SIGKILL, tsk);
+	rcu_read_unlock();
+	fput(info.pipe_to_umh);
+	fput(info.pipe_from_umh);
+	memset(&info, 0, sizeof(info));
+
+	info.pid = -1;
 }
-- 
2.16.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure
  2018-06-01 15:32 [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure Florian Westphal
                   ` (4 preceding siblings ...)
  2018-06-01 15:32 ` [RFC nf-next 5/5] netfilter: nf_tables_jit: add userspace nft to ebpf translator Florian Westphal
@ 2018-06-11 22:12 ` Alexei Starovoitov
  2018-06-12  9:28   ` Florian Westphal
  5 siblings, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2018-06-11 22:12 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, ast, daniel, netdev, David S. Miller, ecree

On Fri, Jun 01, 2018 at 05:32:11PM +0200, Florian Westphal wrote:
> This patch series adds a JIT layer to translate nft expressions
> to ebpf programs.
> 
> From commit phase, spawn a userspace program (using recently added UMH
> infrastructure).
> 
> We then provide rules that came in this transaction to the helper via pipe,
> using same nf_tables netlink that nftables already uses.
> 
> The userspace helper translates the rules, and, if successful, installs the
> generated program(s) via bpf syscall.
> 
> For each rule a small response containing the corresponding epbf file
> descriptor (can be -1 on failure) and a attribute count (how many
> expressions were jitted) gets sent back to kernel via pipe.
> 
> If translation fails, the rule is will be processed by nf_tables
> interpreter (as before this patch).
> 
> If translation succeeded, nf_tables fetches the bpf program using the file
> descriptor identifier, allocates a new rule blob containing the new 'ebpf'
> expression (and possible trailing un-translated expressions).
> 
> It then replaces the original rule in the transaction log with the new
> 'ebpf-rule'.  The original rule is retained in a private area inside the epbf
> expression to be able to present the original expressions back to userspace
> on 'nft list ruleset'.
> 
> For easier review, this contains the kernel-side only.
> nf_tables_jit_work() will not do anything, yet.
> 
> Unresolved issues:
>  - maps and sets.
>    It might be possible to add a new ebpf map type that just wraps
>    the nft set infrastructure for lookups.
>    This would allow nft userspace to continue to work as-is while
>    not requiring new ebpf helper.
>    Anonymous set should be a lot easier as they're immutable
>    and could probably be handled already by existing infra.
> 
>  - BPF_PROG_RUN() is bolted into nft main loop via a middleman expression.
>    I'm also abusing skb->cb[] to pass network and transport header offsets.
>    Its not 'public' api so this can be changed later.
> 
>  - always uses BPF_PROG_TYPE_SCHED_CLS.
>    This is because it "works" for current RFC purposes.
> 
>  - we should eventually support translating multiple (adjacent) rules
>    into single program.
> 
>    If we do this kernel will need to track mapping of rules to
>    program (to re-jit when a rule is changed.  This isn't implemented
>    so far, but can be added later.  Alternatively, one could also add a
>    'readonly' table switch to just prevent further updates.
> 
>    We will also need to dump the 'next' generation of the
>    to-be-translated table.  The kernel has this information, so its only
>    a matter of serializing it back to userspace from the commit phase.
> 
> The jitter is still limited.  So far it supports:
> 
>  * payload expression for network and transport header
>  * meta mark, nfproto, l4proto
>  * 32 bit immediates
>  * 32 bit bitmask ops
>  * accept/drop verdicts
> 
> As this uses netlink, there is also no technical requirement for
> libnftnl, its simply used here for convienience.
> 
> It doesn't need any userspace changes. Patches for libnftnl and nftables
> make debug info available (e.g. to map rule to its bpf prog id).
> 
> Comments welcome.

The implementation of patch 5 looks good to me, but I'm concerned with
patch 2 that adds 'ebpf expression' to nft. I see no reason to do so.
It seems existing support for infinite number of nft expressions is
used as a way to execute infinite number of bpf programs sequentially.
I don't think it was a scalable approach before and won't scale in the future.
I think the algorithm should consider all nft rules at once and generate
a program or two that will execute fast even when number of rules is large.
We have the same scalability issue with bpfilter RFC patches. That's why
they're still in RFC stage, since we need to figure out a way to support
thousands of iptable rules in scalable way.
There are papers on scalable packet classification algorithms that
use decision trees (hicuts, hypercuts, efficuts, etc)
Imo that is the direction should we should be looking at.
If we implement one of the algorithms as a tree(trie) with a generic lookup
it will be usuable from bpf(bpfilter), from XDP, and other places
inside the kernel.
We can even have multiple algorithms implemented and pick and choose
depending on the size of ruleset and its properties, since one size
doesn't always fit all.
I'm imagining umh will be doing iptables->trie+bpf conversion and
nft->trie+bpf conversion where bpf progs will be dealing with pieces
of logic that don't fit into trie lookup and provide generic mechanism
for parsing the packet in the specific way suited for trie lookup
for the given ruleset. The trie will be sized differently depending
on tuples needed in the lookup. Like if there is no ipv6 in the ruleset
the bpf prog won't be parsing that to prepare a tuple for given trie.
Just like bpf hash map can be of different key/value size, this new
trie will be customized for specific ruleset on the fly by umh.
At the end the trie lookup is fully generic and bpf progs before
and after are generic as well.
imo this way majority of iptables/nft rules can be converted and
performance will be great even with large rulesets.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure
  2018-06-11 22:12 ` [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure Alexei Starovoitov
@ 2018-06-12  9:28   ` Florian Westphal
  2018-06-13  0:43     ` Alexei Starovoitov
  0 siblings, 1 reply; 10+ messages in thread
From: Florian Westphal @ 2018-06-12  9:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Florian Westphal, netfilter-devel, ast, daniel, netdev,
	David S. Miller, ecree

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> On Fri, Jun 01, 2018 at 05:32:11PM +0200, Florian Westphal wrote:
> > The userspace helper translates the rules, and, if successful, installs the
> > generated program(s) via bpf syscall.
> > 
> > For each rule a small response containing the corresponding epbf file
> > descriptor (can be -1 on failure) and a attribute count (how many
> > expressions were jitted) gets sent back to kernel via pipe.
> > 
> > If translation fails, the rule is will be processed by nf_tables
> > interpreter (as before this patch).
> > 
> > If translation succeeded, nf_tables fetches the bpf program using the file
> > descriptor identifier, allocates a new rule blob containing the new 'ebpf'
> > expression (and possible trailing un-translated expressions).
> > 
> > It then replaces the original rule in the transaction log with the new
> > 'ebpf-rule'.  The original rule is retained in a private area inside the epbf
> > expression to be able to present the original expressions back to userspace
> > on 'nft list ruleset'.
> > 
> > For easier review, this contains the kernel-side only.
> > nf_tables_jit_work() will not do anything, yet.
> > 
> > Unresolved issues:
> >  - maps and sets.
> >    It might be possible to add a new ebpf map type that just wraps
> >    the nft set infrastructure for lookups.
> >    This would allow nft userspace to continue to work as-is while
> >    not requiring new ebpf helper.
> >    Anonymous set should be a lot easier as they're immutable
> >    and could probably be handled already by existing infra.
> > 
> >  - BPF_PROG_RUN() is bolted into nft main loop via a middleman expression.
> >    I'm also abusing skb->cb[] to pass network and transport header offsets.
> >    Its not 'public' api so this can be changed later.
> > 
> >  - always uses BPF_PROG_TYPE_SCHED_CLS.
> >    This is because it "works" for current RFC purposes.
> > 
> >  - we should eventually support translating multiple (adjacent) rules
> >    into single program.
> > 
> >    If we do this kernel will need to track mapping of rules to
> >    program (to re-jit when a rule is changed.  This isn't implemented
> >    so far, but can be added later.  Alternatively, one could also add a
> >    'readonly' table switch to just prevent further updates.
> > 
> >    We will also need to dump the 'next' generation of the
> >    to-be-translated table.  The kernel has this information, so its only
> >    a matter of serializing it back to userspace from the commit phase.
> > 
> > The jitter is still limited.  So far it supports:
> > 
> >  * payload expression for network and transport header
> >  * meta mark, nfproto, l4proto
> >  * 32 bit immediates
> >  * 32 bit bitmask ops
> >  * accept/drop verdicts
> > 
> > As this uses netlink, there is also no technical requirement for
> > libnftnl, its simply used here for convienience.
> > 
> > It doesn't need any userspace changes. Patches for libnftnl and nftables
> > make debug info available (e.g. to map rule to its bpf prog id).
> > 
> > Comments welcome.
> 
> The implementation of patch 5 looks good to me, but I'm concerned with
> patch 2 that adds 'ebpf expression' to nft. I see no reason to do so.

I think its important user(space) can see which rules are jitted, and
which ebpf prog corresponds to which rule(s), using an expression as
container allows to re-use existing nft config plane code to serialze
this via netlink attributes.

> It seems existing support for infinite number of nft expressions is
> used as a way to execute infinite number of bpf programs sequentially.

In this RFC, yes.

> I don't think it was a scalable approach before and won't scale in the future.
> I think the algorithm should consider all nft rules at once and generate
> a program or two that will execute fast even when number of rules is large.

Yes, but existence of the epbf expression doesn't prevent doing this in
the future.  Doing it now complicates things and given unresolved issues
(see above cover letter) I'm reluctant to implement this already. The
UMH in this RFC can translate only a very small subset of
expressions.  To make full-table realistic I think issues outlined above
need to be addressed first.

It can be done, in such case the epbf expression would replace not just
rule but possibly all of them.

Netlink dump of such a fully-translated table would have the epbf
expression at the beginning of the first rule, exposing epbf program id/tag,
and a list of the nft rule IDs that it replaced.  In the extreme (ideal)
case, it would thus list all rule handle IDs of the chain (including
those reachable via jump-to-user-defined-chains).

Rest of dump would be as if ebpf did not exist, but these rules would
all be "dead" from packet-path point of view.  They are linked from via
the nft epbf pseudo-expression, but no different from an arbitrary
cookie/comment.

As explained above, this also needs kernel to track mapping of
n nft rules to m ebpf progs, rather than the simple 1:1 mapping done
in this RFC.

The 1:1 mapping is not being set stone here, its just the inital
step to get the needed plumbing in, also see "Unresolved issues"
in cover letter above.

So:

Step 1: 1:1 mapping, an nft rule has at most one ebpf prog.
Step 2: figure out how to handle maps, sets, and how to cope with
        not-yet-translateable expressions
Step 3: m:n mapping: kernel provides adjacent rules to the UMH for
        jitting.  Example: user appends rules a, b, c.  UMH creates
	single ebpf prog from a/b/c.
      	nft-pseudo-expression replaces a/b/c in the
	packet path, original rules a/b/c are linked from the pseudo
	expression for tracking.  If user deletes rule b, we provide
	a/c to UMH to create new epbf prog that replaces new
	sequence a/c.
Step 4: always provide entire future base chain and all reachable chains
        to the umh.  Ideally all of it is replaced by single program.

Eventually, entire eval loop could be replaced by ebpf prog.
But it will need some time to get there -- at this point existing
nft expressions would no longer provide an ->eval() function.

Does that make sense to you?

If you see this as flawed, please let me know, but as I have no idea
how to resolve these issues going from 0 to 4 makes no sense to me.

> There are papers on scalable packet classification algorithms that
> use decision trees (hicuts, hypercuts, efficuts, etc)
> Imo that is the direction should we should be looking at.

Okay, but without any idea how to consider existing expressions,
sets, maps etc. I'm not sure it makes sense to work on that at this
point.

We also have the second problem that the netfilter base hook infra
(NF_HOOK) already imposes indirect calls on us.

Is there a plan to have a away to replace those indirect calls with
direct ones?  We can't do that easily because most of the functions are
in modules, but AFAIU ebpf could rewrite that to a sequence of direct
calls.

[..]

> imo this way majority of iptables/nft rules can be converted and
> performance will be great even with large rulesets.

Oh, I do not doubt that multiple rules can be compiled into single program,
sorry if the RFC 1:1 mapping was confusing or gave that impression.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure
  2018-06-12  9:28   ` Florian Westphal
@ 2018-06-13  0:43     ` Alexei Starovoitov
  2018-06-13 20:59       ` Florian Westphal
  0 siblings, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2018-06-13  0:43 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, ast, daniel, netdev, David S. Miller, ecree

On Tue, Jun 12, 2018 at 11:28:12AM +0200, Florian Westphal wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > On Fri, Jun 01, 2018 at 05:32:11PM +0200, Florian Westphal wrote:
> > > The userspace helper translates the rules, and, if successful, installs the
> > > generated program(s) via bpf syscall.
> > > 
> > > For each rule a small response containing the corresponding epbf file
> > > descriptor (can be -1 on failure) and a attribute count (how many
> > > expressions were jitted) gets sent back to kernel via pipe.
> > > 
> > > If translation fails, the rule is will be processed by nf_tables
> > > interpreter (as before this patch).
> > > 
> > > If translation succeeded, nf_tables fetches the bpf program using the file
> > > descriptor identifier, allocates a new rule blob containing the new 'ebpf'
> > > expression (and possible trailing un-translated expressions).
> > > 
> > > It then replaces the original rule in the transaction log with the new
> > > 'ebpf-rule'.  The original rule is retained in a private area inside the epbf
> > > expression to be able to present the original expressions back to userspace
> > > on 'nft list ruleset'.
> > > 
> > > For easier review, this contains the kernel-side only.
> > > nf_tables_jit_work() will not do anything, yet.
> > > 
> > > Unresolved issues:
> > >  - maps and sets.
> > >    It might be possible to add a new ebpf map type that just wraps
> > >    the nft set infrastructure for lookups.
> > >    This would allow nft userspace to continue to work as-is while
> > >    not requiring new ebpf helper.
> > >    Anonymous set should be a lot easier as they're immutable
> > >    and could probably be handled already by existing infra.
> > > 
> > >  - BPF_PROG_RUN() is bolted into nft main loop via a middleman expression.
> > >    I'm also abusing skb->cb[] to pass network and transport header offsets.
> > >    Its not 'public' api so this can be changed later.
> > > 
> > >  - always uses BPF_PROG_TYPE_SCHED_CLS.
> > >    This is because it "works" for current RFC purposes.
> > > 
> > >  - we should eventually support translating multiple (adjacent) rules
> > >    into single program.
> > > 
> > >    If we do this kernel will need to track mapping of rules to
> > >    program (to re-jit when a rule is changed.  This isn't implemented
> > >    so far, but can be added later.  Alternatively, one could also add a
> > >    'readonly' table switch to just prevent further updates.
> > > 
> > >    We will also need to dump the 'next' generation of the
> > >    to-be-translated table.  The kernel has this information, so its only
> > >    a matter of serializing it back to userspace from the commit phase.
> > > 
> > > The jitter is still limited.  So far it supports:
> > > 
> > >  * payload expression for network and transport header
> > >  * meta mark, nfproto, l4proto
> > >  * 32 bit immediates
> > >  * 32 bit bitmask ops
> > >  * accept/drop verdicts
> > > 
> > > As this uses netlink, there is also no technical requirement for
> > > libnftnl, its simply used here for convienience.
> > > 
> > > It doesn't need any userspace changes. Patches for libnftnl and nftables
> > > make debug info available (e.g. to map rule to its bpf prog id).
> > > 
> > > Comments welcome.
> > 
> > The implementation of patch 5 looks good to me, but I'm concerned with
> > patch 2 that adds 'ebpf expression' to nft. I see no reason to do so.
> 
> I think its important user(space) can see which rules are jitted, and
> which ebpf prog corresponds to which rule(s), using an expression as
> container allows to re-use existing nft config plane code to serialze
> this via netlink attributes.

In my mind it would be all or nothing. I don't think it helps
to convert some rules and not all.

> > It seems existing support for infinite number of nft expressions is
> > used as a way to execute infinite number of bpf programs sequentially.
> 
> In this RFC, yes.
> 
> > I don't think it was a scalable approach before and won't scale in the future.
> > I think the algorithm should consider all nft rules at once and generate
> > a program or two that will execute fast even when number of rules is large.
> 
> Yes, but existence of the epbf expression doesn't prevent doing this in
> the future.  Doing it now complicates things and given unresolved issues
> (see above cover letter) I'm reluctant to implement this already. The
> UMH in this RFC can translate only a very small subset of
> expressions.  To make full-table realistic I think issues outlined above
> need to be addressed first.
> 
> It can be done, in such case the epbf expression would replace not just
> rule but possibly all of them.

I think 'all of them' is mandatory. Same for bpfilter.
Existing iptables/nft work as fallback already.
Only when converting all rules we get performance benefit.
Partial converstion only makes things harder to debug and confuse users.

> Netlink dump of such a fully-translated table would have the epbf
> expression at the beginning of the first rule, exposing epbf program id/tag,
> and a list of the nft rule IDs that it replaced.  In the extreme (ideal)
> case, it would thus list all rule handle IDs of the chain (including
> those reachable via jump-to-user-defined-chains).
> 
> Rest of dump would be as if ebpf did not exist, but these rules would
> all be "dead" from packet-path point of view.  They are linked from via
> the nft epbf pseudo-expression, but no different from an arbitrary
> cookie/comment.
> 
> As explained above, this also needs kernel to track mapping of
> n nft rules to m ebpf progs, rather than the simple 1:1 mapping done
> in this RFC.
> 
> The 1:1 mapping is not being set stone here, its just the inital
> step to get the needed plumbing in, also see "Unresolved issues"
> in cover letter above.
> 
> So:
> 
> Step 1: 1:1 mapping, an nft rule has at most one ebpf prog.
> Step 2: figure out how to handle maps, sets, and how to cope with
>         not-yet-translateable expressions
> Step 3: m:n mapping: kernel provides adjacent rules to the UMH for
>         jitting.  Example: user appends rules a, b, c.  UMH creates
> 	single ebpf prog from a/b/c.
>       	nft-pseudo-expression replaces a/b/c in the
> 	packet path, original rules a/b/c are linked from the pseudo
> 	expression for tracking.  If user deletes rule b, we provide
> 	a/c to UMH to create new epbf prog that replaces new
> 	sequence a/c.
> Step 4: always provide entire future base chain and all reachable chains
>         to the umh.  Ideally all of it is replaced by single program.

Right. I think the first implementation of converter should
be translating all rules at once. Not necessarily all features,
but all rules. Even if 60% of rules can be translated as bpf+trie
there is not much benefit to do that and somehow mix and match
the other 40% of old style iterative rule evaluation.
Algorithms are too different. Iterative will be a drag on trie.

> 
> Eventually, entire eval loop could be replaced by ebpf prog.
> But it will need some time to get there -- at this point existing
> nft expressions would no longer provide an ->eval() function.
> 
> Does that make sense to you?
> 
> If you see this as flawed, please let me know, but as I have no idea
> how to resolve these issues going from 0 to 4 makes no sense to me.

I think the challenge is how to implement 4 without doing step 1, right?
imo doing such 1:1 (single rule to single bpf prog) translation does not
help to break hard problem into smaller pieces. Such 1:1 is great
for prototype, but not to land upstream.
For the same reasons in bpfilter we did single iptable rule to single
bpf prog translation, but such code doesn't belong in upstream tree,
since it's not a scalable approach.
It's too easy to follow that road, but it goes nowhere.
Hence my proposal to invest time into building decision tree based
algorithm coupled with pre- and post- bpf progs that supply 'key'
into decision trie lookup and interpret the result.
This way thousands of basic firewall rules will be translated
in efficient way, but even tiny ruleset with complex features (like
nat) won't be translated and that's ok.
We can build on top algorithm that considers all rules at once,
but not on top of translator that does one rule at a time.

> > There are papers on scalable packet classification algorithms that
> > use decision trees (hicuts, hypercuts, efficuts, etc)
> > Imo that is the direction should we should be looking at.
> 
> Okay, but without any idea how to consider existing expressions,
> sets, maps etc. I'm not sure it makes sense to work on that at this
> point.

I think sets and ipset (in case of iptables) fit well into trie model.

> We also have the second problem that the netfilter base hook infra
> (NF_HOOK) already imposes indirect calls on us.
> 
> Is there a plan to have a away to replace those indirect calls with
> direct ones?  We can't do that easily because most of the functions are
> in modules, but AFAIU ebpf could rewrite that to a sequence of direct
> calls.

Yes. abundance of indirect calls is a separate, but equally important
problem. We need to address both of them.

> 
> [..]
> 
> > imo this way majority of iptables/nft rules can be converted and
> > performance will be great even with large rulesets.
> 
> Oh, I do not doubt that multiple rules can be compiled into single program,
> sorry if the RFC 1:1 mapping was confusing or gave that impression.

I think bpfilter RFC also made folks believe that translating
iptables rules one by one is what we're going to do as well.
I hope this confusion is now resolved.
The kernel doesn't need another sequential match firewall.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure
  2018-06-13  0:43     ` Alexei Starovoitov
@ 2018-06-13 20:59       ` Florian Westphal
  0 siblings, 0 replies; 10+ messages in thread
From: Florian Westphal @ 2018-06-13 20:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Florian Westphal, netfilter-devel, ast, daniel, netdev,
	David S. Miller, ecree

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> On Tue, Jun 12, 2018 at 11:28:12AM +0200, Florian Westphal wrote:
> > I think its important user(space) can see which rules are jitted, and
> > which ebpf prog corresponds to which rule(s), using an expression as
> > container allows to re-use existing nft config plane code to serialze
> > this via netlink attributes.
> 
> In my mind it would be all or nothing. I don't think it helps
> to convert some rules and not all.

Ok.  Still, even in that case I think it would be good if we'd be able to tell
userspace the ebpf program id that corresponds to the ruleset.

> > Step 1: 1:1 mapping, an nft rule has at most one ebpf prog.
> > Step 2: figure out how to handle maps, sets, and how to cope with
> >         not-yet-translateable expressions
> > Step 3: m:n mapping: kernel provides adjacent rules to the UMH for
> >         jitting.  Example: user appends rules a, b, c.  UMH creates
> > 	single ebpf prog from a/b/c.
> >       	nft-pseudo-expression replaces a/b/c in the
> > 	packet path, original rules a/b/c are linked from the pseudo
> > 	expression for tracking.  If user deletes rule b, we provide
> > 	a/c to UMH to create new epbf prog that replaces new
> > 	sequence a/c.
> > Step 4: always provide entire future base chain and all reachable chains
> >         to the umh.  Ideally all of it is replaced by single program.

[..]

> > Does that make sense to you?
> > 
> > If you see this as flawed, please let me know, but as I have no idea
> > how to resolve these issues going from 0 to 4 makes no sense to me.
>
> I think the challenge is how to implement 4 without doing step 1, right?

Yes.

> imo doing such 1:1 (single rule to single bpf prog) translation does not
> help to break hard problem into smaller pieces. Such 1:1 is great
> for prototype, but not to land upstream.
> For the same reasons in bpfilter we did single iptable rule to single
> bpf prog translation, but such code doesn't belong in upstream tree,
> since it's not a scalable approach.
[..]

> > Okay, but without any idea how to consider existing expressions,
> > sets, maps etc. I'm not sure it makes sense to work on that at this
> > point.
> 
> I think sets and ipset (in case of iptables) fit well into trie model.

Yes, but thats going to be a lot of effort to handle properly
without breaking (or replacing) userland plumbing.

For nft we could aim for full-translation for the ingress hook
initially as that takes stateful filering out of the picture (ingress
occurs before conntrack).

We could also ignore sets for now and only deal with anonymous sets (they
are immutable and data stored in such sets can be made available to
UMH).

I can rework the RFC to emit "future table" to UMH instead of
individual rules, but I don't know yet when i will have time to work on
it again.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-06-13 20:59 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-01 15:32 [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure Florian Westphal
2018-06-01 15:32 ` [RFC nf-next 1/5] bpf: add bpf_prog_get_type_dev_file Florian Westphal
2018-06-01 15:32 ` [RFC nf-next 2/5] netfilter: nf_tables: add ebpf expression Florian Westphal
2018-06-01 15:32 ` [RFC nf-next 3/5] netfilter: nf_tables: add rule ebpf jit infrastructure Florian Westphal
2018-06-01 15:32 ` [RFC nf-next 4/5] netfilter: nf_tables_jit: add dumping of original rule Florian Westphal
2018-06-01 15:32 ` [RFC nf-next 5/5] netfilter: nf_tables_jit: add userspace nft to ebpf translator Florian Westphal
2018-06-11 22:12 ` [RFC nf-next 0/5] netfilter: add ebpf translation infrastructure Alexei Starovoitov
2018-06-12  9:28   ` Florian Westphal
2018-06-13  0:43     ` Alexei Starovoitov
2018-06-13 20:59       ` Florian Westphal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).