[RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
@ 2018-03-04 19:40 Florian Westphal
  2018-03-04 19:40 ` [RFC,POC 1/3] bpfilter: add experimental IMR bpf translator Florian Westphal
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Florian Westphal @ 2018-03-04 19:40 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, pablo

These patches, which go on top of the 'bpfilter' RFC patches,
demonstrate an nftables to ebpf translation (done in userspace).
In order to not duplicate the ebpf code generation efforts, the rules

iptables -i lo -d 127.0.0.2 -j DROP
and
nft add rule ip filter input ip daddr 127.0.0.2 drop

are first translated to a common intermediate representation, and then
to ebpf, which attaches resulting prog to the XDP hook.

IMR representation is identical in both cases so therefore both
rules result in the same ebpf program.

The IMR currently assumes that translation will always be to ebpf.
As per previous discussion it doesn't consider other targets, so
for instance IMR pseudo-registers map 1:1 to ebpf ones.

The IMR is also supposed to be generic enough to make it easy to convert
'fronted' formats (iptables rule blob, nftables netlink) to it, and
also extend it to cover ip rule, ovs or any other inputs in the future
without need for major changes to the IMR.

The IMR currently implements following basic operations:
 - Relational (equal, not equal)
 - immediates (32 and 64bit constants)
 - payload with relative addressing (macr, network, transport header)
 - verdict (pass, drop, next rule)

Its still in early stage, but I think its good enough as
a proof-of-concept.

Known differences between nftjit.ko and bpfilter.ko:
nftjit.ko currently doesn't run transparently, but thats only
because I wanted to focus on the IMR and get the POC out of the door.

It should be possible to get it transparent via the bpfilter.ko approach.

Next steps for the IMR could be addition of binary operations for
prefixes ("-d 192.168.0.1/24"), its also needed e.g. for tcp flag
matching (-p tcp --syn in iptables) and so on.

I'd also be interested in wheter XDP is seen as appropriate
target hook.  AFAICS the XDP and the nftables ingress hook are similar
enough to consider just (re)using the XDP hook to jit the nftables ingress
hook.  The translator could check if the hook is unused, and return
early if some other program is already attached.

Comments welcome, especially wrt. IMR concept and what might be
next step(s) in moving forward.

The patches are also available via git at
https://git.breakpoint.cc/cgit/fw/net-next.git/log/?h=bpfilter7 .

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC,POC 1/3] bpfilter: add experimental IMR bpf translator
  2018-03-04 19:40 [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer Florian Westphal
@ 2018-03-04 19:40 ` Florian Westphal
  2018-03-04 19:40 ` [RFC,POC 2/3] bpfilter: add nftables jit proof-of-concept Florian Westphal
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2018-03-04 19:40 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, pablo, Florian Westphal

This is a basic intermediate representation to decouple
the ruleset representation (iptables, nftables) from the
ebpf translation.

The IMR currently assumes that translation will always be
into ebpf, its pseudo-registers map 1:1 to ebpf ones.

Objects implemented at the moment:
- relop (eq, ne only for now)
- immediate (32, 64 bit constants)
- payload, with relative addressing (mac header, network header, transport header)

This doesn't add a user; files will not even be compiled yet.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/bpfilter/imr.c | 655 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/bpfilter/imr.h |  78 +++++++
 2 files changed, 733 insertions(+)
 create mode 100644 net/bpfilter/imr.c
 create mode 100644 net/bpfilter/imr.h

diff --git a/net/bpfilter/imr.c b/net/bpfilter/imr.c
new file mode 100644
index 000000000000..09c557ea7c21
--- /dev/null
+++ b/net/bpfilter/imr.c
@@ -0,0 +1,655 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <limits.h>
+
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
+#include <linux/if_ether.h>
+typedef __u16 __bitwise __sum16; /* hack */
+#include <linux/ip.h>
+#include <arpa/inet.h>
+
+#include "imr.h"
+#include "bpfilter_gen.h"
+
+#define EMIT(ctx, x)					\
+	do {						\
+		if ((ctx)->len_cur + 1 > (ctx)->len_max)\
+			return -ENOMEM;			\
+		(ctx)->img[(ctx)->len_cur++] = x;	\
+	} while (0)
+
+struct imr_object {
+	enum imr_obj_type type:8;
+	uint8_t len;
+
+	union {
+		struct {
+			union {
+				uint64_t value64;
+				uint32_t value32;
+			};
+		} immedate;
+		struct {
+			struct imr_object *left;
+			struct imr_object *right;
+			enum imr_relop op:8;
+		} relational;
+		struct {
+			uint16_t offset;
+			enum imr_payload_base base:8;
+		} payload;
+		struct {
+			enum imr_verdict verdict;
+		} verdict;
+	};
+};
+
+struct imr_state {
+	struct bpf_insn	*img;
+	uint32_t len_cur;
+	uint32_t len_max;
+
+	struct imr_object *registers[IMR_REG_COUNT];
+	uint8_t regcount;
+
+	uint32_t num_objects;
+	struct imr_object **objects;
+};
+
+static int imr_jit_object(struct bpfilter_gen_ctx *ctx,
+			  struct imr_state *, const struct imr_object *o);
+
+static void internal_error(const char *s)
+{
+	fprintf(stderr, "FIXME: internal error %s\n", s);
+	exit(1);
+}
+
+/* FIXME: consider len too (e.g. reserve 2 registers for len == 8) */
+static int imr_register_alloc(struct imr_state *s, uint32_t len)
+{
+	uint8_t reg = s->regcount;
+
+	if (s->regcount >= IMR_REG_COUNT)
+		return -1;
+
+	s->regcount++;
+
+	return reg;
+}
+
+static int imr_register_get(const struct imr_state *s, uint32_t len)
+{
+	if (len > sizeof(uint64_t))
+		internal_error(">64bit types not yet implemented");
+	if (s->regcount == 0)
+		internal_error("no registers in use");
+
+	return s->regcount - 1;
+}
+
+static int imr_to_bpf_reg(enum imr_reg_num n)
+{
+	/* currently maps 1:1 */
+	return (int)n;
+}
+
+static int bpf_reg_width(unsigned int len)
+{
+	switch (len) {
+	case sizeof(uint8_t): return BPF_B;
+	case sizeof(uint16_t): return BPF_H;
+	case sizeof(uint32_t): return BPF_W;
+	case sizeof(uint64_t): return BPF_DW;
+	default:
+		internal_error("reg size not supported");
+	}
+
+	return -EINVAL;
+}
+
+static void imr_register_release(struct imr_state *s)
+{
+	if (s->regcount == 0)
+		internal_error("regcount underflow");
+	s->regcount--;
+}
+
+void imr_register_store(struct imr_state *s, enum imr_reg_num reg, struct imr_object *o)
+{
+	s->registers[reg] = o;
+}
+
+struct imr_object *imr_register_load(const struct imr_state *s, enum imr_reg_num reg)
+{
+	return s->registers[reg];
+}
+
+struct imr_state *imr_state_alloc(void)
+{
+	struct imr_state *s = calloc(1, sizeof(*s));
+
+	return s;
+}
+
+void imr_state_free(struct imr_state *s)
+{
+	int i;
+
+	for (i = 0; i < s->num_objects; i++)
+		imr_object_free(s->objects[i]);
+
+	free(s);
+}
+
+struct imr_object *imr_object_alloc(enum imr_obj_type t)
+{
+	struct imr_object *o = calloc(1, sizeof(*o));
+
+	if (o)
+		o->type = t;
+
+	return o;
+}
+
+void imr_object_free(struct imr_object *o)
+{
+	switch (o->type) {
+	case IMR_OBJ_TYPE_VERDICT:
+	case IMR_OBJ_TYPE_IMMEDIATE:
+	case IMR_OBJ_TYPE_PAYLOAD:
+		break;
+	case IMR_OBJ_TYPE_RELATIONAL:
+		imr_object_free(o->relational.left);
+		imr_object_free(o->relational.right);
+		break;
+	}
+
+	free(o);
+}
+
+struct imr_object *imr_object_alloc_imm32(uint32_t value)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_IMMEDIATE);
+
+	if (o) {
+		o->immedate.value32 = value;
+		o->len = sizeof(value);
+	}
+	return o;
+}
+
+struct imr_object *imr_object_alloc_imm64(uint64_t value)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_IMMEDIATE);
+
+	if (o) {
+		o->immedate.value64 = value;
+		o->len = sizeof(value);
+	}
+	return o;
+}
+
+struct imr_object *imr_object_alloc_verdict(enum imr_verdict v)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_VERDICT);
+
+	if (!o)
+		return NULL;
+
+	o->verdict.verdict = v;
+	o->len = sizeof(v);
+
+	return o;
+}
+
+static const char *op_to_str(enum imr_relop op)
+{
+	switch (op) {
+	case IMR_RELOP_NE: return "ne";
+	case IMR_RELOP_EQ: return "eq";
+	}
+
+	return "invalid";
+}
+
+static const char *verdict_to_str(enum imr_verdict v)
+{
+	switch (v) {
+	case IMR_VERDICT_NEXT: return "next";
+	case IMR_VERDICT_PASS: return "pass";
+	case IMR_VERDICT_DROP: return "drop";
+	}
+
+	return "invalid";
+}
+
+static int imr_object_print_imm(FILE *fp, const const struct imr_object *o)
+{
+	int ret = fprintf(fp, "TYPE_IMMEDIATE (");
+	if (ret < 0)
+		return ret;
+
+	switch (o->len) {
+	case sizeof(uint64_t):
+		return fprintf(fp, "0x%16llx)\n", (unsigned long long)o->immedate.value64);
+	case sizeof(uint32_t):
+		return fprintf(fp, "0x%08x)\n", (unsigned int)o->immedate.value32);
+	default:
+		return fprintf(fp, "0x%llx (?)\n", (unsigned long long)o->immedate.value64);
+	}
+}
+
+static int imr_object_print(FILE *fp, int depth, const struct imr_object *o)
+{
+	int ret, total = 0;
+	int i;
+
+	for (i = 0; i < depth; i++) {
+		ret = fprintf(fp, "\t");
+		if (ret < 0)
+			return ret;
+	}
+
+	switch (o->type) {
+	case IMR_OBJ_TYPE_VERDICT:
+		return fprintf(fp, "TYPE_VERDICT: %s\n", verdict_to_str(o->verdict.verdict));
+	case IMR_OBJ_TYPE_RELATIONAL:
+		++depth;
+
+		ret = fprintf(fp, "IMR_OBJ_TYPE_RELATIONAL {\n");
+		if (ret < 0)
+			return ret;
+		total += ret;
+
+		ret = imr_object_print(fp, depth, o->relational.left);
+		if (ret < 0)
+			return ret;
+		total += ret;
+
+		for (i = 0; i < depth; i++)
+			fprintf(fp, "\t");
+
+		ret = fprintf(fp , "op: %s\n", op_to_str(o->relational.op));
+		if (ret < 0)
+			return ret;
+		total += ret;
+
+		ret = imr_object_print(fp, depth, o->relational.right);
+		if (ret < 0)
+			return ret;
+		total += ret;
+
+		--depth;
+		for (i = 0; i < depth; i++)
+			fprintf(fp, "\t");
+
+		ret = fprintf(fp, "}\n");
+		if (ret < 0)
+			return ret;
+
+		return total + ret;
+	case IMR_OBJ_TYPE_PAYLOAD:
+		return fprintf(fp, "TYPE_PAYLOAD: base %d,offset %d, length %d\n",
+				o->payload.base, o->payload.offset, o->len);
+	case IMR_OBJ_TYPE_IMMEDIATE:
+		return imr_object_print_imm(fp, o);
+	}
+
+	internal_error("missing print support");
+	return 0;
+}
+
+void imr_state_print(FILE *fp, struct imr_state *s)
+{
+	int i;
+
+	for (i = 0; i < s->num_objects; i++)
+		imr_object_print(fp, 0, s->objects[i]);
+}
+
+struct imr_object *imr_object_alloc_payload(enum imr_payload_base b, uint16_t off, uint16_t len)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_PAYLOAD);
+
+	if (!o)
+		return NULL;
+
+	o->payload.base = b;
+	o->payload.offset = off;
+	if (len > 16) {
+
+		return NULL;
+	}
+	if (len == 0)
+		internal_error("payload length is 0");
+	if (len > 16)
+		internal_error("payload length exceeds 16 byte");
+
+	o->len = len;
+
+	return o;
+}
+
+struct imr_object *imr_object_alloc_relational(enum imr_relop op, struct imr_object *l, struct imr_object *r)
+{
+	struct imr_object *o = imr_object_alloc(IMR_OBJ_TYPE_RELATIONAL);
+
+	if (!o)
+		return NULL;
+
+	o->relational.op = op;
+	o->relational.left = l;
+	o->relational.right = r;
+
+	if (l->len == 0 || r->len == 0)
+		internal_error("relational op with 0 op length");
+
+	o->len = l->len;
+	if (r->len > o->len)
+		o->len = r->len;
+
+	return o;
+}
+
+int imr_state_add_obj(struct imr_state *s, struct imr_object *o)
+{
+	struct imr_object **new;
+	uint32_t slot = s->num_objects;
+
+	if (s->num_objects >= INT_MAX / sizeof(*o))
+		return -1;
+
+	s->num_objects++;
+	new = realloc(s->objects, sizeof(o) * s->num_objects);
+	if (!new) {
+		imr_object_free(o);
+		return -1;
+	}
+
+	new[slot] = o;
+	if (new != s->objects)
+		s->objects = new;
+
+	return 0;
+}
+
+int imr_state_rule_end(struct imr_state *s)
+{
+	uint32_t slot = s->num_objects;
+	struct imr_object *last;
+
+	if (slot == 0)
+		internal_error("rule end, but no objects present\n");
+	last = s->objects[slot - 1];
+
+	if (last->type == IMR_OBJ_TYPE_VERDICT)
+		return 0;
+
+	return imr_state_add_obj(s, imr_object_alloc_verdict(IMR_VERDICT_NEXT));
+}
+
+static int imr_jit_obj_immediate(struct bpfilter_gen_ctx *ctx,
+				 const struct imr_state *s,
+				 const struct imr_object *o)
+{
+	int bpf_reg = imr_to_bpf_reg(imr_register_get(s, o->len));
+
+	fprintf(stderr, "store immediate in bpf reg %d\n", bpf_reg);
+	switch (o->len) {
+	case sizeof(uint32_t):
+		EMIT(ctx, BPF_MOV32_IMM(bpf_reg, o->immedate.value32));
+		return 0;
+	case sizeof(uint64_t):
+		EMIT(ctx, BPF_MOV64_IMM(bpf_reg, o->immedate.value64));
+		return 0;
+	default:
+		break;
+	}
+
+	internal_error("unhandled immediate size");
+	return -EINVAL;
+}
+
+static int imr_jit_obj_verdict(struct bpfilter_gen_ctx *ctx,
+			       const struct imr_state *s,
+			       const struct imr_object *o)
+{
+	uint32_t verdict = o->verdict.verdict;
+	enum xdp_action match_xdp;
+
+	match_xdp = verdict == IMR_VERDICT_DROP ? XDP_DROP : XDP_PASS;
+	fprintf(stderr, "jit verdict: %s (imr: %d)\n", match_xdp == XDP_DROP ? "drop" : "pass", verdict);
+
+	EMIT(ctx, BPF_MOV32_IMM(BPF_REG_0, match_xdp));
+	EMIT(ctx, BPF_EXIT_INSN());
+
+	return 0;
+}
+
+static int imr_jit_obj_payload(struct bpfilter_gen_ctx *ctx,
+			       const struct imr_state *state,
+			       const struct imr_object *o)
+{
+	int base = o->payload.base;
+	int offset;
+	int bpf_width, bpf_reg;
+
+	offset = o->payload.offset;
+
+	switch (base) {
+	case IMR_PAYLOAD_BASE_LL:
+	        EMIT(ctx, BPF_ALU64_IMM(BPF_ADD, BPF_REG_1,
+                    -(int)sizeof(struct ethhdr)));
+		break;
+	case IMR_PAYLOAD_BASE_NH:
+		break;
+	case IMR_PAYLOAD_BASE_TH:
+		/* XXX: ip options */
+		offset += sizeof(struct iphdr);
+		break;
+	}
+
+	bpf_width = bpf_reg_width(o->len);
+	bpf_reg = imr_to_bpf_reg(imr_register_get(state, o->len));
+
+	fprintf(stderr, "store payload in bpf reg %d\n", bpf_reg);
+        EMIT(ctx, BPF_LDX_MEM(bpf_width, bpf_reg, BPF_REG_1, offset));
+
+	switch (base) {
+	case IMR_PAYLOAD_BASE_LL:
+	        EMIT(ctx, BPF_ALU64_IMM(BPF_ADD, BPF_REG_1,
+					(int)sizeof(struct ethhdr)));
+		break;
+	case IMR_PAYLOAD_BASE_NH:
+		break;
+	case IMR_PAYLOAD_BASE_TH:
+	        EMIT(ctx, BPF_ALU64_IMM(BPF_ADD, BPF_REG_1,
+					-(int)sizeof(struct iphdr)));
+		break;
+	}
+
+	return 0;
+}
+
+static int imr_jit_obj_relational(struct bpfilter_gen_ctx *ctx,
+				  struct imr_state *state,
+				  const struct imr_object *o)
+{
+	const struct imr_object *right;
+	enum imr_reg_num regl, regr;
+	int ret, op, bpf_reg;
+
+	switch (o->relational.op) {
+	case IMR_RELOP_EQ:
+		op = BPF_JNE;
+		break;
+	case IMR_RELOP_NE:
+		op = BPF_JEQ;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	regl = imr_register_alloc(state, o->len);
+	if (regl < 0)
+		return -ENOSPC;
+
+	ret = imr_jit_object(ctx, state, o->relational.left);
+	if (ret) {
+		imr_register_release(state);
+		return ret;
+	}
+
+	right = o->relational.right;
+	bpf_reg = imr_to_bpf_reg(regl);
+
+	/* avoid 2nd register if possible */
+	if (right->type == IMR_OBJ_TYPE_IMMEDIATE) {
+		switch (right->len) {
+		case sizeof(uint32_t):
+			EMIT(ctx, BPF_JMP_IMM(op, bpf_reg, right->immedate.value32, 0));
+			imr_register_release(state);
+			return 0;
+		}
+	}
+
+	regr = imr_register_alloc(state, right->len);
+	if (regr < 0) {
+		imr_register_release(state);
+		return -ENOSPC;
+	}
+
+	ret = imr_jit_object(ctx, state, right);
+	if (ret) {
+		imr_register_release(state);
+		imr_register_release(state);
+		return ret;
+	}
+
+	fprintf(stderr, "CMP: %d %d\n", bpf_reg, imr_to_bpf_reg(regr));
+	EMIT(ctx, BPF_JMP_REG(op, bpf_reg, imr_to_bpf_reg(regr), 0));
+	imr_register_release(state);
+	imr_register_release(state);
+	return 0;
+}
+
+static int imr_jit_object(struct bpfilter_gen_ctx *ctx,
+			  struct imr_state *s,
+			  const struct imr_object *o)
+{
+	switch (o->type) {
+	case IMR_OBJ_TYPE_VERDICT:
+		return imr_jit_obj_verdict(ctx, s, o);
+	case IMR_OBJ_TYPE_RELATIONAL:
+		return imr_jit_obj_relational(ctx, s, o);
+	case IMR_OBJ_TYPE_PAYLOAD:
+		return imr_jit_obj_payload(ctx, s, o);
+	case IMR_OBJ_TYPE_IMMEDIATE:
+		return imr_jit_obj_immediate(ctx, s, o);
+	}
+
+	return -EINVAL;
+}
+
+static int imr_jit_rule(struct bpfilter_gen_ctx *ctx,
+			struct imr_state *state,
+			int i)
+{
+	unsigned int start, end, count, pc, pc_end, len_cur;
+
+	end = state->num_objects;
+	if (i >= end)
+		return -EINVAL;
+
+	len_cur = ctx->len_cur;
+
+	EMIT(ctx, BPF_MOV64_REG(BPF_REG_1, BPF_REG_2));
+	EMIT(ctx, BPF_ALU64_IMM(BPF_ADD, BPF_REG_1,
+			   sizeof(struct ethhdr) + sizeof(struct iphdr)));
+	EMIT(ctx, BPF_JMP_REG(BPF_JGT, BPF_REG_1, BPF_REG_3, 0));
+	EMIT(ctx, BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -(int)sizeof(struct iphdr)));
+
+	start = i;
+	count = 0;
+
+	for (i = start; start < end; i++) {
+		int ret = imr_jit_object(ctx, state, state->objects[i]);
+
+		if (ret < 0) {
+			fprintf(stderr, "failed to JIT object type %d\n",  state->objects[i]->type);
+			return ret;
+		}
+
+		count++;
+
+		if (state->objects[i]->type == IMR_OBJ_TYPE_VERDICT)
+			break;
+	}
+
+	if (i == end) {/* malformed -- no verdict */
+		fprintf(stderr, "rule had no verdict, start %d end %d\n", start, end);
+		internal_error("no verdict found in rule");
+	}
+
+	pc = 0;
+	pc_end = ctx->len_cur - len_cur; /* start of next rule */
+
+	for (i = len_cur; pc < pc_end; pc++, i++) {
+		if (BPF_CLASS(ctx->img[i].code) == BPF_JMP) {
+			if (ctx->img[i].code == (BPF_EXIT | BPF_JMP))
+				continue;
+
+			fprintf(stderr, "fix jump to %d: should be %d, pc is %d\n", ctx->img[i].off, pc_end - pc, pc);
+			ctx->img[i].off = pc_end - pc - 1;
+		}
+	}
+
+	return count;
+}
+
+/* test function, would only return bpf prog */
+int imr_do_bpf(struct imr_state *s)
+{
+	struct bpfilter_gen_ctx ctx;
+	int ret, i = 0;
+
+	ret = bpfilter_gen_init(&ctx);
+	if (ret < 0)
+		return ret;
+
+	ret = bpfilter_gen_prologue(&ctx);
+	if (ret < 0)
+		return ret;
+
+	/* Hack: don't touch/use first 4 bpf registers */
+	s->regcount = 4;
+	do {
+		int insns = imr_jit_rule(&ctx, s, i);
+		if (insns < 0) {
+			ret = insns;
+			break;
+		}
+		if (insns == 0)
+			internal_error("rule jit yields 0 insns");
+
+		i += insns;
+	} while (i < s->num_objects);
+
+	ctx.ifindex = 1;
+	if (ret == 0) {
+		EMIT(&ctx, BPF_MOV32_IMM(BPF_REG_0, XDP_PASS));
+		EMIT(&ctx, BPF_EXIT_INSN());
+		bpfilter_gen_commit(&ctx);
+	} else {
+		fprintf(stderr, "Error when generating bpf code");
+	}
+
+	bpfilter_gen_destroy(&ctx);
+
+	return ret;
+}
diff --git a/net/bpfilter/imr.h b/net/bpfilter/imr.h
new file mode 100644
index 000000000000..3f602bf315df
--- /dev/null
+++ b/net/bpfilter/imr.h
@@ -0,0 +1,78 @@
+#ifndef IMR_HDR
+#define IMR_HDR
+#include <stdint.h>
+#include <stdio.h>
+
+enum imr_reg_num {
+	IMR_REG_0 = 0,
+	IMR_REG_1,
+	IMR_REG_2,
+	IMR_REG_3,
+	IMR_REG_4,
+	IMR_REG_5,
+	IMR_REG_6,
+	IMR_REG_7,
+	IMR_REG_8,
+	IMR_REG_9,
+	IMR_REG_10,
+	IMR_REG_COUNT,
+};
+
+struct imr_state;
+struct imr_object;
+
+enum imr_obj_type {
+	IMR_OBJ_TYPE_VERDICT,
+	IMR_OBJ_TYPE_IMMEDIATE,
+	IMR_OBJ_TYPE_RELATIONAL,
+	IMR_OBJ_TYPE_PAYLOAD,
+};
+
+enum imr_relop {
+	IMR_RELOP_EQ,
+	IMR_RELOP_NE,
+};
+
+enum imr_verdict {
+	IMR_VERDICT_NEXT,	/* move to next rule */
+	IMR_VERDICT_PASS,	/* end processing, accept packet */
+	IMR_VERDICT_DROP,	/* end processing, drop packet */
+};
+
+enum imr_payload_base {
+	IMR_PAYLOAD_BASE_INVALID,
+	IMR_PAYLOAD_BASE_LL,
+	IMR_PAYLOAD_BASE_NH,
+	IMR_PAYLOAD_BASE_TH,
+};
+
+struct imr_state *imr_state_alloc(void);
+void imr_state_free(struct imr_state *s);
+void imr_state_print(FILE *fp, struct imr_state *s);
+
+static inline int imr_state_rule_begin(struct imr_state *s)
+{
+	/* nothing for now */
+	return 0;
+}
+
+int imr_state_rule_end(struct imr_state *s);
+
+void imr_register_store(struct imr_state *s, enum imr_reg_num r, struct imr_object *o);
+struct imr_object *imr_register_load(const struct imr_state *s, enum imr_reg_num r);
+
+struct imr_object *imr_object_alloc(enum imr_obj_type t);
+void imr_object_free(struct imr_object *o);
+
+struct imr_object *imr_object_alloc_imm32(uint32_t value);
+struct imr_object *imr_object_alloc_imm64(uint64_t value);
+struct imr_object *imr_object_alloc_verdict(enum imr_verdict v);
+
+struct imr_object *imr_object_alloc_payload(enum imr_payload_base b, uint16_t off, uint16_t len);
+struct imr_object *imr_object_alloc_relational(enum imr_relop op, struct imr_object *l, struct imr_object *r);
+
+int imr_state_add_obj(struct imr_state *s, struct imr_object *o);
+
+int imr_do_bpf(struct imr_state *s);
+
+#endif /* IMR_HDR */
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC,POC 2/3] bpfilter: add nftables jit proof-of-concept
  2018-03-04 19:40 [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer Florian Westphal
  2018-03-04 19:40 ` [RFC,POC 1/3] bpfilter: add experimental IMR bpf translator Florian Westphal
@ 2018-03-04 19:40 ` Florian Westphal
  2018-03-04 19:40 ` [RFC,POC 3/3] bpfilter: switch bpfilter to iptables->IMR translation Florian Westphal
  2018-03-06 14:22 ` [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer Daniel Borkmann
  3 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2018-03-04 19:40 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, pablo, Florian Westphal

This adds a nftables frontend for the IMR->BPF translator.

This doesn't work via UMH yet.

AFAIU it should be possible to get transparent ebpf translation for
nftables, similar to the bpfilter/iptables UMH.

However, at this time I think its better to get IMR "right".

nftjit.ko currently needs libnftnl/libmnl but thats convenince on
my end and not a "must have".

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/bpfilter/Makefile   |   7 +-
 net/bpfilter/nftables.c | 679 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 685 insertions(+), 1 deletion(-)
 create mode 100644 net/bpfilter/nftables.c

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index 5a85ef7d7a4d..a4064986dc2f 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -3,7 +3,12 @@
 # Makefile for the Linux BPFILTER layer.
 #
 
-hostprogs-y := bpfilter.ko
+hostprogs-y := nftjit.ko bpfilter.ko
 always := $(hostprogs-y)
 bpfilter.ko-objs := bpfilter.o tgts.o targets.o tables.o init.o ctor.o sockopt.o gen.o
+
+NFT_LIBS = -lnftnl
+nftjit.ko-objs := tgts.o targets.o tables.o init.o ctor.o gen.o nftables.o imr.o
+HOSTLOADLIBES_nftjit.ko = `pkg-config --libs libnftnl libmnl`
+
 HOSTCFLAGS += -I. -Itools/include/
diff --git a/net/bpfilter/nftables.c b/net/bpfilter/nftables.c
new file mode 100644
index 000000000000..5a756ccd03a1
--- /dev/null
+++ b/net/bpfilter/nftables.c
@@ -0,0 +1,679 @@
+/*
+ * based on previous code from:
+ *
+ * Copyright (c) 2013 Arturo Borrero Gonzalez <arturo@netfilter.org>
+ * Copyright (c) 2013 Pablo Neira Ayuso <pablo@netfilter.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <stdlib.h>
+#include <stdint.h>
+#include <time.h>
+#include <string.h>
+#include <netinet/in.h>
+#include <errno.h>
+#include <utils.h>
+
+#include <linux/netfilter.h>
+#include <linux/netfilter/nf_tables.h>
+
+#include <libmnl/libmnl.h>
+#include <libnftnl/common.h>
+#include <libnftnl/ruleset.h>
+#include <libnftnl/table.h>
+#include <libnftnl/chain.h>
+#include <libnftnl/set.h>
+#include <libnftnl/expr.h>
+#include <libnftnl/rule.h>
+
+#include <linux/if_ether.h>
+
+#include "bpfilter_mod.h"
+#include "imr.h"
+
+/* Hack, we don't link bpfilter.o */
+extern long int syscall (long int __sysno, ...);
+
+int sys_bpf(int cmd, union bpf_attr *attr, unsigned int size)
+{
+	return syscall(321, cmd, attr, size);
+}
+
+static int seq;
+
+static void memory_allocation_error(void) { perror("allocation failed"); exit(1); }
+
+static int nft_reg_to_imr_reg(int nfreg)
+{
+	switch (nfreg) {
+	case NFT_REG_VERDICT:
+		return IMR_REG_0;
+	/* old register numbers, 4 128 bit registers. */
+	case NFT_REG_1:
+		return IMR_REG_4;
+	case NFT_REG_2:
+		return IMR_REG_6;
+	case NFT_REG_3:
+		return IMR_REG_8;
+	case NFT_REG_4:
+		break;
+	/* new register numbers, 16 32 bit registers, map to old ones */
+	case NFT_REG32_00:
+		return IMR_REG_4;
+	case NFT_REG32_01:
+		return IMR_REG_5;
+	case NFT_REG32_02:
+		return IMR_REG_6;
+	default:
+		return -1;
+	}
+	return -1;
+}
+
+static int netlink_parse_immediate(const struct nftnl_expr *nle, void *out)
+{
+	struct imr_state *state = out;
+	struct imr_object *o = NULL;
+
+	if (nftnl_expr_is_set(nle, NFTNL_EXPR_IMM_DATA)) {
+		uint32_t len;
+		int reg;
+
+		nftnl_expr_get(nle, NFTNL_EXPR_IMM_DATA, &len);
+
+		switch (len) {
+		case sizeof(uint32_t):
+			o = imr_object_alloc_imm32(nftnl_expr_get_u32(nle, NFTNL_EXPR_IMM_DATA));
+			break;
+		case sizeof(uint64_t):
+			o = imr_object_alloc_imm64(nftnl_expr_get_u64(nle, NFTNL_EXPR_IMM_DATA));
+			break;
+		default:
+			return -ENOTSUPP;
+		}
+		reg = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle,
+					     NFTNL_EXPR_IMM_DREG));
+		if (reg < 0) {
+			imr_object_free(o);
+			return reg;
+		}
+
+		imr_register_store(state, reg, o);
+		return 0;
+	} else if (nftnl_expr_is_set(nle, NFTNL_EXPR_IMM_VERDICT)) {
+		uint32_t verdict;
+		int ret;
+
+		if (nftnl_expr_is_set(nle, NFTNL_EXPR_IMM_CHAIN))
+			return -ENOTSUPP;
+
+                verdict = nftnl_expr_get_u32(nle, NFTNL_EXPR_IMM_VERDICT);
+
+		switch (verdict) {
+		case NF_ACCEPT:
+			o = imr_object_alloc_verdict(IMR_VERDICT_PASS);
+			break;
+		case NF_DROP:
+			o = imr_object_alloc_verdict(IMR_VERDICT_DROP);
+			break;
+		default:
+			fprintf(stderr, "Unhandled verdict %d\n", verdict);
+			o = imr_object_alloc_verdict(IMR_VERDICT_DROP);
+			break;
+		}
+
+		ret = imr_state_add_obj(state, o);
+		if (ret < 0)
+			imr_object_free(o);
+
+		return ret;
+	}
+
+	return -ENOTSUPP;
+}
+
+static int netlink_parse_cmp(const struct nftnl_expr *nle, void *out)
+{
+	struct imr_object *o, *imm, *left;
+	struct imr_state *state = out;
+	enum imr_relop op;
+	uint32_t tmp, len;
+	int ret;
+	op = nftnl_expr_get_u32(nle, NFTNL_EXPR_CMP_OP);
+
+	switch (op) {
+        case NFT_CMP_EQ:
+		op = IMR_RELOP_EQ;
+		break;
+        case NFT_CMP_NEQ:
+		op = IMR_RELOP_NE;
+		break;
+	default:
+		return -ENOTSUPP;
+	}
+
+	nftnl_expr_get(nle, NFTNL_EXPR_CMP_DATA, &len);
+	switch (len) {
+	case sizeof(uint64_t):
+		imm = imr_object_alloc_imm64(nftnl_expr_get_u64(nle, NFTNL_EXPR_CMP_DATA));
+		break;
+	case sizeof(uint32_t):
+		imm = imr_object_alloc_imm32(nftnl_expr_get_u32(nle, NFTNL_EXPR_CMP_DATA));
+		break;
+	case sizeof(uint16_t):
+		tmp = nftnl_expr_get_u16(nle, NFTNL_EXPR_CMP_DATA);
+
+		imm = imr_object_alloc_imm32(tmp);
+		break;
+	case sizeof(uint8_t):
+		tmp = nftnl_expr_get_u8(nle, NFTNL_EXPR_CMP_DATA);
+
+		imm = imr_object_alloc_imm32(tmp);
+		break;
+	default:
+		return -ENOTSUPP;
+	}
+
+	if (!imm)
+		return -ENOMEM;
+
+	ret = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle, NFTNL_EXPR_CMP_SREG));
+	if (ret < 0) {
+		imr_object_free(imm);
+		return ret;
+	}
+
+	left = imr_register_load(state, ret);
+	if (!left)
+		return -EINVAL;
+
+	o = imr_object_alloc_relational(op, left, imm);
+
+	return imr_state_add_obj(state, o);
+}
+
+static int netlink_parse_payload(const struct nftnl_expr *nle, void *out)
+{
+	struct imr_state *state = out;
+	enum imr_payload_base imr_base;
+	uint32_t base, offset, len;
+	struct imr_object *payload;
+	int ret;
+
+	if (nftnl_expr_is_set(nle, NFTNL_EXPR_PAYLOAD_SREG) ||
+	    nftnl_expr_is_set(nle, NFTNL_EXPR_PAYLOAD_FLAGS))
+		return -EOPNOTSUPP;
+
+	base = nftnl_expr_get_u32(nle, NFTNL_EXPR_PAYLOAD_BASE);
+	offset = nftnl_expr_get_u32(nle, NFTNL_EXPR_PAYLOAD_OFFSET);
+	len = nftnl_expr_get_u32(nle, NFTNL_EXPR_PAYLOAD_LEN);
+
+	printf("payload: base %d off %d len %d\n", base, offset, len);
+
+	ret = nft_reg_to_imr_reg(nftnl_expr_get_u32(nle, NFTNL_EXPR_PAYLOAD_DREG));
+	if (ret < 0)
+		return ret;
+
+	switch (base) {
+	case NFT_PAYLOAD_LL_HEADER:
+		imr_base = IMR_PAYLOAD_BASE_LL;
+		break;
+	case NFT_PAYLOAD_NETWORK_HEADER:
+		imr_base = IMR_PAYLOAD_BASE_NH;
+		break;
+	case NFT_PAYLOAD_TRANSPORT_HEADER:
+		imr_base = IMR_PAYLOAD_BASE_TH;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	payload = imr_object_alloc_payload(imr_base, offset, len);
+	if (!payload)
+		return -ENOMEM;
+
+	imr_register_store(state, ret, payload);
+	return 0;
+}
+
+static const struct {
+	const char *name;
+	int (*parse)(const struct nftnl_expr *nle,
+				 void *);
+} netlink_parsers[] = {
+	{ .name = "immediate",	.parse = netlink_parse_immediate },
+	{ .name = "cmp",	.parse = netlink_parse_cmp },
+	{ .name = "payload",	.parse = netlink_parse_payload },
+};
+
+static int expr_parse_cb(struct nftnl_expr *expr, void *data)
+{
+	const char *name = nftnl_expr_get_str(expr, NFTNL_EXPR_NAME);
+	struct imr_state *state = data;
+	unsigned int i;
+
+	if (!name)
+		return -1;
+
+	for (i = 0; i < ARRAY_SIZE(netlink_parsers); i++) {
+		if (strcmp(netlink_parsers[i].name, name))
+			continue;
+
+		printf("parse: %s\n", nftnl_expr_get_str(expr, NFTNL_EXPR_NAME));
+		netlink_parsers[i].parse(expr, state);
+	}
+
+	return 0;
+}
+
+static int rule_parse_cb(struct nftnl_rule *rule, void *data)
+{
+	struct imr_state *state = data;
+	int ret;
+
+	ret = imr_state_rule_begin(state);
+	if (ret < 0)
+		return ret;
+	nftnl_expr_foreach(rule, expr_parse_cb, data);
+
+	return imr_state_rule_end(state);
+}
+
+static int
+mnl_talk(struct mnl_socket *nf_sock, const void *data, unsigned int len,
+	 int (*cb)(const struct nlmsghdr *nlh, void *data), void *cb_data)
+{
+	char buf[MNL_SOCKET_BUFFER_SIZE];
+	uint32_t portid = mnl_socket_get_portid(nf_sock);
+	int ret;
+
+	if (mnl_socket_sendto(nf_sock, data, len) < 0)
+		return -1;
+
+	ret = mnl_socket_recvfrom(nf_sock, buf, sizeof(buf));
+	while (ret > 0) {
+		ret = mnl_cb_run(buf, ret, seq, portid, cb, cb_data);
+		if (ret <= 0)
+			goto out;
+
+		ret = mnl_socket_recvfrom(nf_sock, buf, sizeof(buf));
+	}
+out:
+	if (ret < 0 && errno == EAGAIN)
+		return 0;
+
+	return ret;
+}
+
+/*
+ * Rule
+ */
+static int rule_cb(const struct nlmsghdr *nlh, void *data)
+{
+	struct nftnl_rule_list *nlr_list = data;
+	struct nftnl_rule *r;
+
+	r = nftnl_rule_alloc();
+	if (r == NULL)
+		memory_allocation_error();
+
+	if (nftnl_rule_nlmsg_parse(nlh, r) < 0)
+		goto err_free;
+
+	nftnl_rule_list_add_tail(r, nlr_list);
+	return MNL_CB_OK;
+
+err_free:
+	nftnl_rule_free(r);
+	return MNL_CB_OK;
+}
+
+static struct nftnl_rule_list *mnl_rule_dump(struct mnl_socket *nf_sock,
+					   int family)
+{
+	char buf[MNL_SOCKET_BUFFER_SIZE];
+	struct nlmsghdr *nlh;
+	struct nftnl_rule_list *nlr_list;
+	int ret;
+
+	nlr_list = nftnl_rule_list_alloc();
+	if (nlr_list == NULL)
+		memory_allocation_error();
+
+	nlh = nftnl_rule_nlmsg_build_hdr(buf, NFT_MSG_GETRULE, family,
+					 NLM_F_DUMP, seq);
+
+	ret = mnl_talk(nf_sock, nlh, nlh->nlmsg_len, rule_cb, nlr_list);
+	if (ret < 0)
+		goto err;
+
+	return nlr_list;
+err:
+	nftnl_rule_list_free(nlr_list);
+	return NULL;
+}
+
+/*
+ * Chain
+ */
+static int chain_cb(const struct nlmsghdr *nlh, void *data)
+{
+	struct nftnl_chain_list *nlc_list = data;
+	struct nftnl_chain *c;
+
+	c = nftnl_chain_alloc();
+	if (c == NULL)
+		memory_allocation_error();
+
+	if (nftnl_chain_nlmsg_parse(nlh, c) < 0)
+		goto err_free;
+
+	nftnl_chain_list_add_tail(c, nlc_list);
+	return MNL_CB_OK;
+
+err_free:
+	nftnl_chain_free(c);
+	return MNL_CB_OK;
+}
+
+static struct nftnl_chain_list *mnl_chain_dump(struct mnl_socket *nf_sock,
+					     int family)
+{
+	char buf[MNL_SOCKET_BUFFER_SIZE];
+	struct nlmsghdr *nlh;
+	struct nftnl_chain_list *nlc_list;
+	int ret;
+
+	nlc_list = nftnl_chain_list_alloc();
+	if (nlc_list == NULL)
+		memory_allocation_error();
+
+	nlh = nftnl_chain_nlmsg_build_hdr(buf, NFT_MSG_GETCHAIN, family,
+					NLM_F_DUMP, seq);
+
+	ret = mnl_talk(nf_sock, nlh, nlh->nlmsg_len, chain_cb, nlc_list);
+	if (ret < 0)
+		goto err;
+
+	return nlc_list;
+err:
+	nftnl_chain_list_free(nlc_list);
+	return NULL;
+}
+
+/*
+ * Table
+ */
+static int table_cb(const struct nlmsghdr *nlh, void *data)
+{
+	struct nftnl_ruleset *rs = data;
+	struct nftnl_table *t;
+
+	t = nftnl_table_alloc();
+	if (t == NULL)
+		memory_allocation_error();
+
+	if (nftnl_table_nlmsg_parse(nlh, t) < 0)
+		goto err_free;
+
+	nftnl_ruleset_set(rs, NFTNL_RULESET_TABLELIST, t);
+
+	return MNL_CB_OK;
+
+err_free:
+	nftnl_table_free(t);
+	return MNL_CB_ERROR;
+}
+
+/*
+ * Set elements
+ */
+static int set_elem_cb(const struct nlmsghdr *nlh, void *data)
+{
+	nftnl_set_elems_nlmsg_parse(nlh, data);
+	return MNL_CB_OK;
+}
+
+static int mnl_setelem_get(struct mnl_socket *nf_sock, struct nftnl_set *nls)
+{
+	char buf[MNL_SOCKET_BUFFER_SIZE];
+	struct nlmsghdr *nlh;
+	uint32_t family = nftnl_set_get_u32(nls, NFTNL_SET_FAMILY);
+
+	nlh = nftnl_set_nlmsg_build_hdr(buf, NFT_MSG_GETSETELEM, family,
+				      NLM_F_DUMP|NLM_F_ACK, seq);
+	nftnl_set_nlmsg_build_payload(nlh, nls);
+
+	return mnl_talk(nf_sock, nlh, nlh->nlmsg_len, set_elem_cb, nls);
+}
+
+/*
+ * Set
+ */
+static int set_cb(const struct nlmsghdr *nlh, void *data)
+{
+	struct nftnl_set_list *nls_list = data;
+	struct nftnl_set *s;
+
+	s = nftnl_set_alloc();
+	if (s == NULL)
+		memory_allocation_error();
+
+	if (nftnl_set_nlmsg_parse(nlh, s) < 0)
+		goto err_free;
+
+	nftnl_set_list_add_tail(s, nls_list);
+	return MNL_CB_OK;
+
+err_free:
+	nftnl_set_free(s);
+	return MNL_CB_OK;
+}
+
+static struct nftnl_set_list *
+mnl_set_dump(struct mnl_socket *nf_sock, int family)
+{
+	char buf[MNL_SOCKET_BUFFER_SIZE];
+	struct nlmsghdr *nlh;
+	struct nftnl_set *s;
+	struct nftnl_set_list *nls_list;
+	struct nftnl_set *si;
+	struct nftnl_set_list_iter *i;
+	int ret;
+
+	s = nftnl_set_alloc();
+	if (s == NULL)
+		memory_allocation_error();
+
+	nlh = nftnl_set_nlmsg_build_hdr(buf, NFT_MSG_GETSET, family,
+				      NLM_F_DUMP|NLM_F_ACK, seq);
+	nftnl_set_nlmsg_build_payload(nlh, s);
+	nftnl_set_free(s);
+
+	nls_list = nftnl_set_list_alloc();
+	if (nls_list == NULL)
+		memory_allocation_error();
+
+	ret = mnl_talk(nf_sock, nlh, nlh->nlmsg_len, set_cb, nls_list);
+	if (ret < 0)
+		goto err;
+
+	i = nftnl_set_list_iter_create(nls_list);
+	if (i == NULL)
+		memory_allocation_error();
+
+	si = nftnl_set_list_iter_next(i);
+	while (si != NULL) {
+		if (mnl_setelem_get(nf_sock, si) != 0) {
+			perror("E: Unable to get set elements");
+			nftnl_set_list_iter_destroy(i);
+			goto err;
+		}
+		si = nftnl_set_list_iter_next(i);
+	}
+
+	nftnl_set_list_iter_destroy(i);
+
+	return nls_list;
+err:
+	nftnl_set_list_free(nls_list);
+	return NULL;
+}
+
+static struct nftnl_ruleset *mnl_table_ruleset(struct mnl_socket *nf_sock,
+					       int family,
+					       const char *table)
+{
+	char buf[MNL_SOCKET_BUFFER_SIZE];
+	struct nftnl_ruleset *rs;
+	struct nftnl_table *t;
+	struct nlmsghdr *nlh;
+	int ret;
+
+	nlh = nftnl_table_nlmsg_build_hdr(buf, NFT_MSG_GETTABLE, family,
+					  NLM_F_ACK, seq);
+	t = nftnl_table_alloc();
+	if (t == NULL)
+		memory_allocation_error();
+
+	nftnl_table_set(t, NFTNL_TABLE_NAME, table);
+	nftnl_table_nlmsg_build_payload(nlh, t);
+	nftnl_table_free(t);
+
+	rs = nftnl_ruleset_alloc();
+	if (rs == NULL)
+		memory_allocation_error();
+	ret = mnl_talk(nf_sock, nlh, nlh->nlmsg_len, table_cb, rs);
+	if (ret < 0)
+		goto err;
+
+	return rs;
+err:
+	nftnl_ruleset_free(rs);
+	return NULL;
+}
+
+static struct nftnl_ruleset *mnl_ruleset_dump(struct mnl_socket *nf_sock, int family)
+{
+	struct nftnl_ruleset *rs;
+	struct nftnl_chain_list *c;
+	struct nftnl_set_list *s;
+	struct nftnl_rule_list *r;
+	uint32_t type = NFTNL_OUTPUT_DEFAULT;
+
+	rs = mnl_table_ruleset(nf_sock, family, "filter");
+	if (!rs)
+		return NULL;
+
+	c = mnl_chain_dump(nf_sock, family);
+	if (c != NULL)
+		nftnl_ruleset_set(rs, NFTNL_RULESET_CHAINLIST, c);
+
+	s = mnl_set_dump(nf_sock, family);
+	if (s != NULL)
+		nftnl_ruleset_set(rs, NFTNL_RULESET_SETLIST, s);
+
+	r = mnl_rule_dump(nf_sock, family);
+	if (r != NULL)
+		nftnl_ruleset_set(rs, NFTNL_RULESET_RULELIST, r);
+
+        nftnl_ruleset_fprintf(stdout, rs, type, 0);
+	return rs;
+}
+
+/* ether type ne 0x800 accept */
+static int nft_ipv4_only(struct imr_state *state)
+{
+	struct imr_object *eth_p_ip, *lltype, *relop;
+	int ret;
+
+	imr_state_rule_begin(state);
+	lltype = imr_object_alloc_payload(IMR_PAYLOAD_BASE_LL,
+					  offsetof(struct ethhdr, h_proto),
+					  sizeof(uint16_t));
+	if (!lltype)
+		return -ENOMEM;
+
+	eth_p_ip = imr_object_alloc_imm32(htons(ETH_P_IP));
+	if (!eth_p_ip) {
+		imr_object_free(lltype);
+		return -ENOMEM;
+	}
+
+	relop = imr_object_alloc_relational(IMR_RELOP_NE, lltype, eth_p_ip);
+	if (!relop) {
+		imr_object_free(eth_p_ip);
+		imr_object_free(lltype);
+		return -ENOMEM;
+	}
+
+	ret = imr_state_add_obj(state, relop);
+	if (ret == 0) {
+		ret = imr_state_add_obj(state, imr_object_alloc_verdict(IMR_VERDICT_PASS));
+		if (ret == 0)
+			return imr_state_rule_end(state);
+	}
+
+	return ret;
+}
+
+static int nft2imr(const struct nftnl_ruleset *rs)
+{
+	struct nftnl_rule_list *l = nftnl_ruleset_get(rs, NFTNL_RULESET_RULELIST);
+	struct imr_state *state;
+	int ret;
+
+	state = imr_state_alloc();
+	if (!state)
+		return -ENOMEM;
+
+	ret = nft_ipv4_only(state);
+
+	ret = nftnl_rule_list_foreach(l, rule_parse_cb, state);
+	if (ret < 0) {
+		imr_state_free(state);
+		return ret;
+	}
+
+	imr_state_print(stdout, state);
+	imr_do_bpf(state);
+	imr_state_free(state);
+
+	return 0;
+}
+
+int main(int argc, char *argv[])
+{
+	struct mnl_socket *nl;
+	struct nftnl_ruleset *rs;
+
+	if (argc > 2) {
+		fprintf(stderr, "%s {json}\n",
+			argv[0]);
+		exit(EXIT_FAILURE);
+	}
+
+	nl = mnl_socket_open(NETLINK_NETFILTER);
+	if (nl == NULL) {
+		perror("mnl_socket_open");
+		exit(EXIT_FAILURE);
+	}
+
+	if (mnl_socket_bind(nl, 0, MNL_SOCKET_AUTOPID) < 0) {
+		perror("mnl_socket_bind");
+		exit(EXIT_FAILURE);
+	}
+
+	seq = time(NULL);
+
+	rs = mnl_ruleset_dump(nl, NFPROTO_IPV4);
+	if (rs == NULL) {
+		perror("ruleset_dump");
+		exit(EXIT_FAILURE);
+	}
+
+	return nft2imr(rs);
+}
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC,POC 3/3] bpfilter: switch bpfilter to iptables->IMR translation
  2018-03-04 19:40 [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer Florian Westphal
  2018-03-04 19:40 ` [RFC,POC 1/3] bpfilter: add experimental IMR bpf translator Florian Westphal
  2018-03-04 19:40 ` [RFC,POC 2/3] bpfilter: add nftables jit proof-of-concept Florian Westphal
@ 2018-03-04 19:40 ` Florian Westphal
  2018-03-06 14:22 ` [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer Daniel Borkmann
  3 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2018-03-04 19:40 UTC (permalink / raw)
  To: netdev; +Cc: daniel, ast, pablo, Florian Westphal

Translate basic iptables rule blob to the IMR, then ask
IMR to translate to ebpf.

IMR is shared between nft and bpfilter translators.
iptables_gen_append() is the only relevant function here,
as it demonstrates simple 'source/destination matches x' test.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/bpfilter/Makefile       |  2 +-
 net/bpfilter/bpfilter_gen.h | 15 +++++++++
 net/bpfilter/bpfilter_mod.h | 16 +---------
 net/bpfilter/iptables.c     | 76 +++++++++++++++++++++++++++++++++++++++++++++
 net/bpfilter/iptables.h     |  4 +++
 net/bpfilter/sockopt.c      | 73 +++++++++++++++++++++++++++++++++----------
 6 files changed, 154 insertions(+), 32 deletions(-)
 create mode 100644 net/bpfilter/bpfilter_gen.h
 create mode 100644 net/bpfilter/iptables.c
 create mode 100644 net/bpfilter/iptables.h

diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
index a4064986dc2f..21a8afb60b7c 100644
--- a/net/bpfilter/Makefile
+++ b/net/bpfilter/Makefile
@@ -5,7 +5,7 @@
 
 hostprogs-y := nftjit.ko bpfilter.ko
 always := $(hostprogs-y)
-bpfilter.ko-objs := bpfilter.o tgts.o targets.o tables.o init.o ctor.o sockopt.o gen.o
+bpfilter.ko-objs := bpfilter.o tgts.o targets.o tables.o init.o ctor.o sockopt.o gen.o iptables.o imr.o
 
 NFT_LIBS = -lnftnl
 nftjit.ko-objs := tgts.o targets.o tables.o init.o ctor.o gen.o nftables.o imr.o
diff --git a/net/bpfilter/bpfilter_gen.h b/net/bpfilter/bpfilter_gen.h
new file mode 100644
index 000000000000..71c6e8a73e24
--- /dev/null
+++ b/net/bpfilter/bpfilter_gen.h
@@ -0,0 +1,15 @@
+struct bpfilter_gen_ctx {
+	struct bpf_insn		*img;
+	u32			len_cur;
+	u32			len_max;
+	u32			default_verdict;
+	int			fd;
+	int			ifindex;
+	bool			offloaded;
+};
+
+int bpfilter_gen_init(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_prologue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_epilogue(struct bpfilter_gen_ctx *ctx);
+int bpfilter_gen_commit(struct bpfilter_gen_ctx *ctx);
+void bpfilter_gen_destroy(struct bpfilter_gen_ctx *ctx);
diff --git a/net/bpfilter/bpfilter_mod.h b/net/bpfilter/bpfilter_mod.h
index b4209985efff..dc3a90df1788 100644
--- a/net/bpfilter/bpfilter_mod.h
+++ b/net/bpfilter/bpfilter_mod.h
@@ -4,6 +4,7 @@
 
 #include "include/uapi/linux/bpfilter.h"
 #include <linux/list.h>
+#include "bpfilter_gen.h"
 
 struct bpfilter_table {
 	struct hlist_node	hash;
@@ -71,26 +72,11 @@ struct bpfilter_target {
 	u8			rev;
 };
 
-struct bpfilter_gen_ctx {
-	struct bpf_insn		*img;
-	u32			len_cur;
-	u32			len_max;
-	u32			default_verdict;
-	int			fd;
-	int			ifindex;
-	bool			offloaded;
-};
-
 union bpf_attr;
 int sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
 
-int bpfilter_gen_init(struct bpfilter_gen_ctx *ctx);
-int bpfilter_gen_prologue(struct bpfilter_gen_ctx *ctx);
-int bpfilter_gen_epilogue(struct bpfilter_gen_ctx *ctx);
 int bpfilter_gen_append(struct bpfilter_gen_ctx *ctx,
 			struct bpfilter_ipt_ip *ent, int verdict);
-int bpfilter_gen_commit(struct bpfilter_gen_ctx *ctx);
-void bpfilter_gen_destroy(struct bpfilter_gen_ctx *ctx);
 
 struct bpfilter_target *bpfilter_target_get_by_name(const char *name);
 void bpfilter_target_put(struct bpfilter_target *tgt);
diff --git a/net/bpfilter/iptables.c b/net/bpfilter/iptables.c
new file mode 100644
index 000000000000..055cfa8fbf21
--- /dev/null
+++ b/net/bpfilter/iptables.c
@@ -0,0 +1,76 @@
+#include <string.h>
+#include <stdint.h>
+
+typedef uint16_t __sum16; /* hack */
+#include <linux/ip.h>
+
+#include "bpfilter_mod.h"
+#include "iptables.h"
+#include "imr.h"
+
+static int check_entry(const struct bpfilter_ipt_ip *ent)
+{
+#define M_FF	"\xff\xff\xff\xff"
+	static const __u8 mask1[IFNAMSIZ] = M_FF M_FF M_FF M_FF;
+	static const __u8 mask0[IFNAMSIZ] = { };
+	int ones = strlen(ent->in_iface); ones += ones > 0;
+#undef M_FF
+	if (strlen(ent->out_iface) > 0)
+		return -ENOTSUPP;
+	if (memcmp(ent->in_iface_mask, mask1, ones) ||
+	    memcmp(&ent->in_iface_mask[ones], mask0, sizeof(mask0) - ones))
+		return -ENOTSUPP;
+	if ((ent->src_mask != 0 && ent->src_mask != 0xffffffff) ||
+	    (ent->dst_mask != 0 && ent->dst_mask != 0xffffffff))
+		return -ENOTSUPP;
+
+	return 0;
+}
+
+int iptables_gen_append(struct imr_state *state,
+			struct bpfilter_ipt_ip *ent, int verdict)
+{
+	struct imr_object *left, *right, *relop;
+	int ret;
+
+	ret = check_entry(ent);
+	if (ret < 0)
+		return ret;
+	if (ent->src_mask == 0 && ent->dst_mask == 0)
+		return 0;
+
+	imr_state_rule_begin(state);
+
+	if (ent->src_mask) {
+		left = imr_object_alloc_payload(IMR_PAYLOAD_BASE_NH,
+					        offsetof(struct iphdr, saddr),
+					        sizeof(uint32_t));
+		right = imr_object_alloc_imm32(ent->src);
+
+		relop = imr_object_alloc_relational(IMR_RELOP_EQ, left, right);
+		imr_state_add_obj(state, relop);
+	}
+
+	if (ent->dst_mask) {
+		left = imr_object_alloc_payload(IMR_PAYLOAD_BASE_NH,
+					        offsetof(struct iphdr, daddr),
+					        sizeof(uint32_t));
+		right = imr_object_alloc_imm32(ent->dst);
+
+		relop = imr_object_alloc_relational(IMR_RELOP_EQ, left, right);
+		imr_state_add_obj(state, relop);
+	}
+
+	switch (verdict) {
+	case -1:
+		verdict = IMR_VERDICT_DROP;
+		break;
+	default:
+		verdict = IMR_VERDICT_PASS;
+		break;
+	}
+
+	imr_state_add_obj(state, imr_object_alloc_verdict(verdict));
+
+	return imr_state_rule_end(state);
+}
diff --git a/net/bpfilter/iptables.h b/net/bpfilter/iptables.h
new file mode 100644
index 000000000000..8d1299f10e55
--- /dev/null
+++ b/net/bpfilter/iptables.h
@@ -0,0 +1,4 @@
+#include "imr.h"
+
+int iptables_gen_append(struct imr_state *state,
+			struct bpfilter_ipt_ip *ent, int verdict);
diff --git a/net/bpfilter/sockopt.c b/net/bpfilter/sockopt.c
index 26ad12a11736..c601fa0839a1 100644
--- a/net/bpfilter/sockopt.c
+++ b/net/bpfilter/sockopt.c
@@ -5,8 +5,13 @@
 #include <stdlib.h>
 
 #include <sys/socket.h>
+#include <net/if.h>
+
+#include <linux/if_ether.h>
+#include <arpa/inet.h>
 
 #include "bpfilter_mod.h"
+#include "iptables.h"
 
 /* TODO: Get all of this in here properly done in encoding/decoding layer. */
 static int fetch_name(void *addr, int len, char *name, int name_len)
@@ -144,22 +149,53 @@ int bpfilter_get_entries(void *cmd, int len)
 	return err;
 }
 
+/* We run via XDP so skip non ip ethernet frames */
+static int iptables_accept_non_ipv4(struct imr_state *state)
+{
+	struct imr_object *eth_p_ip, *lltype, *relop;
+	int ret;
+
+	imr_state_rule_begin(state);
+	lltype = imr_object_alloc_payload(IMR_PAYLOAD_BASE_LL,
+					  offsetof(struct ethhdr, h_proto),
+					  sizeof(uint16_t));
+	if (!lltype)
+		return -ENOMEM;
+
+	eth_p_ip = imr_object_alloc_imm32(htons(ETH_P_IP));
+	if (!eth_p_ip) {
+		imr_object_free(lltype);
+		return -ENOMEM;
+	}
+
+	relop = imr_object_alloc_relational(IMR_RELOP_NE, lltype, eth_p_ip);
+	if (!relop) {
+		imr_object_free(eth_p_ip);
+		imr_object_free(lltype);
+		return -ENOMEM;
+	}
+
+	ret = imr_state_add_obj(state, relop);
+	if (ret == 0) {
+		ret = imr_state_add_obj(state, imr_object_alloc_verdict(IMR_VERDICT_PASS));
+		if (ret == 0)
+			return imr_state_rule_end(state);
+	}
+
+	return ret;
+}
+
 static int do_set_replace(struct bpfilter_ipt_replace *req, void *base,
 			  struct bpfilter_table *tbl)
 {
 	unsigned int total_size = req->size;
 	struct bpfilter_table_info *info;
 	struct bpfilter_ipt_entry *ent;
-	struct bpfilter_gen_ctx ctx;
+	struct imr_state *imr_state;
 	unsigned int off, sents = 0, ents = 0;
+	int last_iface = 0;
 	int ret;
 
-	ret = bpfilter_gen_init(&ctx);
-	if (ret < 0)
-		return ret;
-	ret = bpfilter_gen_prologue(&ctx);
-	if (ret < 0)
-		return ret;
 	info = bpfilter_ipv4_table_alloc(tbl, total_size);
 	if (!info)
 		return -ENOMEM;
@@ -167,15 +203,25 @@ static int do_set_replace(struct bpfilter_ipt_replace *req, void *base,
 		free(info);
 		return -EFAULT;
 	}
+	imr_state = imr_state_alloc();
 	base = &info->entries[0];
+	iptables_accept_non_ipv4(imr_state);
 	for (off = 0; off < total_size; off += ent->next_offset) {
 		struct bpfilter_standard_target *tgt;
+		int ifindex;
 		ent = base + off;
+		ifindex = if_nametoindex(ent->ip.in_iface);
+		if (!ifindex)
+			continue;
+		if (last_iface && last_iface != ifindex)
+			return -ENOTSUPP;
+		else
+			last_iface = ifindex;
 		ents++;
 		sents += ent->next_offset;
 		tgt = (void *) ent + ent->target_offset;
 		target_u2k(tgt);
-		ret = bpfilter_gen_append(&ctx, &ent->ip, tgt->verdict);
+		ret = iptables_gen_append(imr_state, &ent->ip, tgt->verdict);
                 if (ret < 0)
                         goto err;
 	}
@@ -183,16 +229,11 @@ static int do_set_replace(struct bpfilter_ipt_replace *req, void *base,
 	info->size = sents;
 	memcpy(info->hook_entry, req->hook_entry, sizeof(info->hook_entry));
 	memcpy(info->underflow, req->underflow, sizeof(info->hook_entry));
-	ret = bpfilter_gen_epilogue(&ctx);
-	if (ret < 0)
-		goto err;
-	ret = bpfilter_gen_commit(&ctx);
-	if (ret < 0)
-		goto err;
+	ret = imr_do_bpf(imr_state);
+	imr_state_free(imr_state);
 	free(tbl->info);
 	tbl->info = info;
-	bpfilter_gen_destroy(&ctx);
-	dprintf(debug_fd, "offloaded %u\n", ctx.offloaded);
+
 	return ret;
 err:
 	free(info);
-- 
2.16.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
  2018-03-04 19:40 [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer Florian Westphal
                   ` (2 preceding siblings ...)
  2018-03-04 19:40 ` [RFC,POC 3/3] bpfilter: switch bpfilter to iptables->IMR translation Florian Westphal
@ 2018-03-06 14:22 ` Daniel Borkmann
  2018-03-06 16:42   ` Florian Westphal
  3 siblings, 1 reply; 12+ messages in thread
From: Daniel Borkmann @ 2018-03-06 14:22 UTC (permalink / raw)
  To: Florian Westphal, netdev; +Cc: ast, pablo

On 03/04/2018 08:40 PM, Florian Westphal wrote:
> These patches, which go on top of the 'bpfilter' RFC patches,
> demonstrate an nftables to ebpf translation (done in userspace).
> In order to not duplicate the ebpf code generation efforts, the rules
> 
> iptables -i lo -d 127.0.0.2 -j DROP
> and
> nft add rule ip filter input ip daddr 127.0.0.2 drop
> 
> are first translated to a common intermediate representation, and then
> to ebpf, which attaches resulting prog to the XDP hook.
> 
> IMR representation is identical in both cases so therefore both
> rules result in the same ebpf program.
> 
> The IMR currently assumes that translation will always be to ebpf.
> As per previous discussion it doesn't consider other targets, so
> for instance IMR pseudo-registers map 1:1 to ebpf ones.
> 
> The IMR is also supposed to be generic enough to make it easy to convert
> 'fronted' formats (iptables rule blob, nftables netlink) to it, and
> also extend it to cover ip rule, ovs or any other inputs in the future
> without need for major changes to the IMR.
> 
> The IMR currently implements following basic operations:
>  - Relational (equal, not equal)
>  - immediates (32 and 64bit constants)
>  - payload with relative addressing (macr, network, transport header)
>  - verdict (pass, drop, next rule)
> 
> Its still in early stage, but I think its good enough as
> a proof-of-concept.

Thanks a lot for working on this! Overall I like the PoC and the
underlying idea of it! I think the design of such IMR would indeed
be the critical part in that it needs to be flexible enough to cover
both front ends well enough without having to make compromises to
one of them. The same would be for optimization passes e.g. when we
know that two successive rules would match on TCP header bits that we
can reuse the register that loaded/parsed it previously to that point.
Similar when it comes to maps when the lookup value would need to
propagate through the linked imr objects. Do you see such optimizations
or in general propagation of state as direct part of the IMR or rather
somewhat hidden in IMR layer when doing the IMR to BPF 'jit' phase?

Which other parts do you think would be needed for the IMR aside
from above basic operations? ALU ops, JMPs for the relational ops?
I think it would be good to have clear semantics in terms of what
it would eventually abstract away from raw BPF when sitting on top
of it; potentially these could be details on packet access or
interaction with helpers or other BPF features such that it's BPF prog
type independent at this stage, e.g. another point could be that given
the priv/cap level of the uapi request, there could also be different
BPF code gen backends that implement against the IMR, e.g. when the
request comes out of userns then it has feature constraints in terms
of e.g. having to use bpf_skb_{load,store}_bytes() helpers for packet
access instead of direct packet access or not being able to use BPF to
BPF calls, etc; wdyt?

> Known differences between nftjit.ko and bpfilter.ko:
> nftjit.ko currently doesn't run transparently, but thats only
> because I wanted to focus on the IMR and get the POC out of the door.
> 
> It should be possible to get it transparent via the bpfilter.ko approach.
> 
> Next steps for the IMR could be addition of binary operations for
> prefixes ("-d 192.168.0.1/24"), its also needed e.g. for tcp flag
> matching (-p tcp --syn in iptables) and so on.
> 
> I'd also be interested in wheter XDP is seen as appropriate
> target hook.  AFAICS the XDP and the nftables ingress hook are similar
> enough to consider just (re)using the XDP hook to jit the nftables ingress
> hook.  The translator could check if the hook is unused, and return
> early if some other program is already attached.
> 
> Comments welcome, especially wrt. IMR concept and what might be
> next step(s) in moving forward.
> 
> The patches are also available via git at
> https://git.breakpoint.cc/cgit/fw/net-next.git/log/?h=bpfilter7 .

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
  2018-03-06 14:22 ` [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer Daniel Borkmann
@ 2018-03-06 16:42   ` Florian Westphal
  2018-03-06 17:24     ` Edward Cree
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2018-03-06 16:42 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Florian Westphal, netdev, ast, pablo

Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 03/04/2018 08:40 PM, Florian Westphal wrote:
> > Its still in early stage, but I think its good enough as
> > a proof-of-concept.
> 
> be the critical part in that it needs to be flexible enough to cover
> both front ends well enough without having to make compromises to
> one of them.

Yes, thats the idea.

> The same would be for optimization passes e.g. when we
> know that two successive rules would match on TCP header bits that we
> can reuse the register that loaded/parsed it previously to that point.

Yes, I already thought about this.

First step would be to simply track the last accessed header base
(mac, nh, th) in the imr state so we can reuse the offset if possible.

> Similar when it comes to maps when the lookup value would need to
> propagate through the linked imr objects. Do you see such optimizations
> or in general propagation of state as direct part of the IMR or rather
> somewhat hidden in IMR layer when doing the IMR to BPF 'jit' phase?

Direct part.  I would only say that it should be hidden in the sense that
frontends don't need to do anything here.
I would do such propagation in the IMR before generating code.

The IMR has enough info to figure out:
1. number of (ebpf) registers that are needed simultaneously at most
2. how many payload requests relative to (mac/network/tcp header) there are
(both in total and in relation to rule count).

So it could e.g. decide to set aside 2 registers to contain the
th/nh offset to have them ready at all times, while not covering mac
header at all because its not used anywhere.

Likewise, we could even avoid for instance extension header parsing
as IMR would see that there are no 'IPv6+transport header' rules.

> Which other parts do you think would be needed for the IMR aside
> from above basic operations? ALU ops, JMPs for the relational ops?

We'll need at least following IMR features/objects (pseudo instructions):

- relops: gt, ge, lt etc.
- all bitops (shift, xor, and, etc)
- payload: allow stores instead of loads (this includes csum fixup)
- verdict: jumps to other chains, nfqueue verdict
- meta: (accesses to in/output interface names, skb length, and the like).
- register allocation for ipv6 addresses, interface names, etc (anything
larger than 64 bit).

I would also add 'highlevel' objects that are themselves translated into
basic operations.  Most obvious example
are 'fetch 4 bytes x bytes into transport header'.

Frontend should not need to bother with ipv4 details, such as ip
options.  Instead IMR should take care of this and generate the needed
instructions to fetch iph->ihl and figure out the correct transport
header offset.

It could do so by adding the needed part of the rule (fetch byte containing
ihl, mask/shift result, store the offset, etc.) internally.

I would also add:
- limit (to jit both xt_limit and nft limit)
- statistic (random sampling)

as both nft and iptables have those features and i think they're
pretty much the same so it can be handled by common code.
Same for others extensions but those have side effects and I don't
know yet how to handle that (log, counter and the like).

Same for set lookups.

In iptables case, they will only be 'address/network/interface is (not) in set',
i.e. only match/no match, and set manipulations (SET target).

In nftables case it can also be something like 'retrieve value
belonging to key (dreg = lookup(map, sreg, sreg_len), and then dreg
value gets used in other ways.

I think that by making the IMR also handle the nftables case rather than
iptables one has the advantage that the IMR might be able to optimize
repetitiive iptables rules (and nft rules that do not use jump map even
if they could).

iptables .. -i eth0 -j ETH0_RULES
iptables .. -i eth1 -j ETH1_RULES
..
iptables .. -i ethX -j ETHX_RULES

could be transparently thrown into a map and the jump target retrieved
from it base on iifname...

The IMR would also need to be able to create/allocate such sets/maps
on its own.

Do you see this as something that has to be decided now?
Ideally feeling about IMR would be that we can probably handle this
later by extending it rather than having to throw it away plus a
rewrite :)

> I think it would be good to have clear semantics in terms of what
> it would eventually abstract away from raw BPF when sitting on top
> of it; potentially these could be details on packet access or

[..]

Yes, it would be good if IMR could e.g. do merge of adjacent loads:

load 4 bytes from network header at offset 12  (relop, equal) imm 0x1
load 4 bytes from network header at offset 16  (relop, equal) imm 0x2

 ..  would be condensed to ...

load 8 bytes from network header at offset 12  (relop, equal) imm 0x100000002

... before generating ebpf code.

I don't think this should be part of frontend (too lowlevel task for
frontend) and I don't see why it should happen when generating code (too high
level task for that).

> interaction with helpers or other BPF features such that it's BPF prog
> type independent at this stage, e.g. another point could be that given
> the priv/cap level of the uapi request, there could also be different
> BPF code gen backends that implement against the IMR, e.g. when the
> request comes out of userns then it has feature constraints in terms
> of e.g. having to use bpf_skb_{load,store}_bytes() helpers for packet
> access instead of direct packet access or not being able to use BPF to
> BPF calls, etc; wdyt?

I haven't thought about this at all, I think this is useful.

When initializing IMR state we could provide feature flag(s) that could
either instantly reject something (e.g. context only allows ro-access to
packets but IMR was just told to mangle payload), or we could consider
it during code-gen phase (i don't understand ebpf well enough to say
wheter thats useful or not, but if its then it would be good if IMR could
do it.

I did not intend the IMR as a 'public' library -- it would only be for
use by in-tree umh translators.

So anything that needs changing at later stage should be doable,
but obviously the less rework it needs later on the better...

Thanks for reviewing!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
  2018-03-06 16:42   ` Florian Westphal
@ 2018-03-06 17:24     ` Edward Cree
  2018-03-06 18:03       ` Florian Westphal
  0 siblings, 1 reply; 12+ messages in thread
From: Edward Cree @ 2018-03-06 17:24 UTC (permalink / raw)
  To: Florian Westphal, Daniel Borkmann; +Cc: netdev, ast, pablo

On 06/03/18 16:42, Florian Westphal wrote:
> I would also add 'highlevel' objects that are themselves translated into
> basic operations.  Most obvious example
> are 'fetch 4 bytes x bytes into transport header'.
>
> Frontend should not need to bother with ipv4 details, such as ip
> options.  Instead IMR should take care of this and generate the needed
> instructions to fetch iph->ihl and figure out the correct transport
> header offset.
Presumably then for this the IMR regs will cease to have any connection to
 BPF regs and will simply be (SSA?) r0, r1, ... as far as needed (not
 limited to 10 regs like BPF)?  Then register allocation all happens in
 the IMR->BPF conversion (even for things 64 bits or smaller).

I wonder how sophisticated we should be about register allocation; whether
 we should go the whole hog with graph-colouring algorithms or linear
 scan, or just do something naïve like an LRU.

Relatedly, should we spill values to the stack when we run out of
 registers, or should we just rely on being able to rematerialise them
 from parsing the packet again?

-Ed

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
  2018-03-06 17:24     ` Edward Cree
@ 2018-03-06 18:03       ` Florian Westphal
  2018-03-06 18:18         ` Edward Cree
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2018-03-06 18:03 UTC (permalink / raw)
  To: Edward Cree; +Cc: Florian Westphal, Daniel Borkmann, netdev, ast, pablo

Edward Cree <ecree@solarflare.com> wrote:
> On 06/03/18 16:42, Florian Westphal wrote:
> > I would also add 'highlevel' objects that are themselves translated into
> > basic operations.  Most obvious example
> > are 'fetch 4 bytes x bytes into transport header'.
> >
> > Frontend should not need to bother with ipv4 details, such as ip
> > options.  Instead IMR should take care of this and generate the needed
> > instructions to fetch iph->ihl and figure out the correct transport
> > header offset.
> Presumably then for this the IMR regs will cease to have any connection to
> ï¿œBPF regs and will simply be (SSA?) r0, r1, ... as far as needed (not
> ï¿œlimited to 10 regs like BPF)?ï¿œ Then register allocation all happens in
> ï¿œthe IMR->BPF conversion (even for things 64 bits or smaller).
> 
> I wonder how sophisticated we should be about register allocation; whether
> ï¿œwe should go the whole hog with graph-colouring algorithms or linear
> ï¿œscan, or just do something naï¿œve like an LRU.
> 
> Relatedly, should we spill values to the stack when we run out of
> ï¿œregisters, or should we just rely on being able to rematerialise them
> ï¿œfrom parsing the packet again?

I don't know.  I suspect we should go for naive algorithm only,
but I would defer such decision to Alexei/Daniel.

f.e. i don't know if using llvm is a good idea or not, I did not
intend to turn proposed imr into full blown compiler in any case,
I only want to avoid code duplication for iptables/nftables -> ebpf
translator.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
  2018-03-06 18:03       ` Florian Westphal
@ 2018-03-06 18:18         ` Edward Cree
  2018-03-15 16:13           ` Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: Edward Cree @ 2018-03-06 18:18 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Daniel Borkmann, netdev, ast, pablo

On 06/03/18 18:03, Florian Westphal wrote:
> I don't know.  I suspect we should go for naive algorithm only,
> but I would defer such decision to Alexei/Daniel.
>
> f.e. i don't know if using llvm is a good idea or not,
Yeah, I wondered about that too.  I think it's probably not a good idea,
 since it's a big project with a complex interface and the tail would
 likely wag the dog.

> I did not
> intend to turn proposed imr into full blown compiler in any case,
> I only want to avoid code duplication for iptables/nftables -> ebpf
> translator.
Well, I think anything that calling code does by hand will entail limits
 on what the imr->ebpf step can do.  E.g. if imr uses registers by name
 then the backend can't expand any highlevel constructs that need to use
 scratch registers, unless some regs are reserved for the backend and
 can't be used by the imr.

Ultimately I think the imr will have to become practically a compiler
 back-end, because any jobs it punts to the caller will necessarily be
 duplicated between iptables and nftables.

-Ed

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
  2018-03-06 18:18         ` Edward Cree
@ 2018-03-15 16:13           ` Alexei Starovoitov
  2018-03-15 17:00             ` Florian Westphal
  0 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2018-03-15 16:13 UTC (permalink / raw)
  To: Edward Cree
  Cc: Florian Westphal, Daniel Borkmann, netdev, ast, pablo, David S. Miller

On Tue, Mar 06, 2018 at 06:18:04PM +0000, Edward Cree wrote:
> On 06/03/18 18:03, Florian Westphal wrote:
> > I don't know.  I suspect we should go for naive algorithm only,
> > but I would defer such decision to Alexei/Daniel.
> >
> > f.e. i don't know if using llvm is a good idea or not,
> Yeah, I wondered about that too.ï¿½ I think it's probably not a good idea,
> ï¿½since it's a big project with a complex interface and the tail would
> ï¿½likely wag the dog.
> 
> > I did not
> > intend to turn proposed imr into full blown compiler in any case,
> > I only want to avoid code duplication for iptables/nftables -> ebpf
> > translator.
> Well, I think anything that calling code does by hand will entail limits
> ï¿½on what the imr->ebpf step can do.ï¿½ E.g. if imr uses registers by name
> ï¿½then the backend can't expand any highlevel constructs that need to use
> ï¿½scratch registers, unless some regs are reserved for the backend and
> ï¿½can't be used by the imr.
> 
> Ultimately I think the imr will have to become practically a compiler
> ï¿½back-end, because any jobs it punts to the caller will necessarily be
> ï¿½duplicated between iptables and nftables.

few years back we had a prototype of nft->bpf converter.
It was reading nft netlink, generating really dumb C code,
and feeding into clang/llvm. Performance of generated code was
surprisingly pretty good. llvm did particularly well combining
the same loads from multiple rules into single load and
combining all 'if' conditions into a few.
So full blown compiler approach known to work, but ...
we cannot use it here.
We cannot have bpfilter umh module depend on llvm.
The converter have to be fully contained within kernel source tree.
It's probably fine to start simple with compiler-like thingy and
later improve it with sophisticated optimizations and fancy
register allocator, but I suspect we can achieve better performance
and easier to maintain code if we specialize the converters.

The way this IMR defined today looks pretty much like nft and
it feels a bit too low level than iptable conversion would need.
Converting iptable to this IMR is like going old days of gcc
when C/C++ front-ends were generating RTL and all optimizations
were done on RTL representation. Eventually gcc had to adopt
gimple (variant of ssa) format to do optimizations that were
very hard to do with RTL.
I suspect if we go iptable->nft (same as this imr) the history
will repeat itself. It will work, but will be difficult to apply
interesting optimizations.

Similarly I don't quite see why this particular IMR is needed
for nft->bpf either.
I think it would be simpler to have user space only extensions
and opcodes added to bpf for the purpose of the translation.
Like there is no bpf instruction called 'load from IP header',
but we can make one. Just extend extended bpf with an instruction
like this and on the first pass do full conversion of nft
directly into this 'extended extended bpf'.
Then do second pass of lowering 'extended extended bpf' into eBPF
that kernel actually understands.
That would be similar to what llvm does when it lowers higher level
represenation into lower one.
The higher level IR doesn't need to be completely different
from lower level IR. Often lower level IR is a subset of higher level.

For iptable->bpf I think it's easier to construct decision tree
out of all iptable rules, optimize it as a tree and then
convert it into a set of map lookups glued with a bit of bpf code.
Converting iptable rules one by one isn't going to scale.
It was fine for prototype, but now we're talking production.
If there are 1k iptables rules and we generate just 4 bpf insns
for every rule, we will already hit 4k insn limit.
We can easily bump the limit, but even if it's 128k insns
it's still the limit. All iptables rules should be converted together.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
  2018-03-15 16:13           ` Alexei Starovoitov
@ 2018-03-15 17:00             ` Florian Westphal
  2018-03-15 20:26               ` Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2018-03-15 17:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Edward Cree, Florian Westphal, Daniel Borkmann, netdev, ast,
	pablo, David S. Miller

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> The way this IMR defined today looks pretty much like nft and
> it feels a bit too low level than iptable conversion would need.

It wasn't so much about a specific IMR but to avoid code duplication
between nft and iptables translators.

> I think it would be simpler to have user space only extensions
> and opcodes added to bpf for the purpose of the translation.
> Like there is no bpf instruction called 'load from IP header',
> but we can make one. Just extend extended bpf with an instruction
> like this and on the first pass do full conversion of nft
> directly into this 'extended extended bpf'.

I don't want to duplicate any ebpf conversion (and optimisations)
in the nft part.

If nft can be translated to this 'extended extended bpf' and
this then generates bpf code from nft input all is good.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer
  2018-03-15 17:00             ` Florian Westphal
@ 2018-03-15 20:26               ` Alexei Starovoitov
  0 siblings, 0 replies; 12+ messages in thread
From: Alexei Starovoitov @ 2018-03-15 20:26 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Edward Cree, Daniel Borkmann, netdev, ast, pablo, David S. Miller

On Thu, Mar 15, 2018 at 06:00:22PM +0100, Florian Westphal wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > The way this IMR defined today looks pretty much like nft and
> > it feels a bit too low level than iptable conversion would need.
> 
> It wasn't so much about a specific IMR but to avoid code duplication
> between nft and iptables translators.
> 
> > I think it would be simpler to have user space only extensions
> > and opcodes added to bpf for the purpose of the translation.
> > Like there is no bpf instruction called 'load from IP header',
> > but we can make one. Just extend extended bpf with an instruction
> > like this and on the first pass do full conversion of nft
> > directly into this 'extended extended bpf'.
> 
> I don't want to duplicate any ebpf conversion (and optimisations)
> in the nft part.
> 
> If nft can be translated to this 'extended extended bpf' and
> this then generates bpf code from nft input all is good.

if possible it's great to avoid duplication, but it shouldn't be
such ultimate goal that it cripples iptable->bpf conversion
just to reuse nft->bpf bits.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-03-15 20:26 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-04 19:40 [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer Florian Westphal
2018-03-04 19:40 ` [RFC,POC 1/3] bpfilter: add experimental IMR bpf translator Florian Westphal
2018-03-04 19:40 ` [RFC,POC 2/3] bpfilter: add nftables jit proof-of-concept Florian Westphal
2018-03-04 19:40 ` [RFC,POC 3/3] bpfilter: switch bpfilter to iptables->IMR translation Florian Westphal
2018-03-06 14:22 ` [RFC,POC] iptables/nftables to epbf/xdp via common intermediate layer Daniel Borkmann
2018-03-06 16:42   ` Florian Westphal
2018-03-06 17:24     ` Edward Cree
2018-03-06 18:03       ` Florian Westphal
2018-03-06 18:18         ` Edward Cree
2018-03-15 16:13           ` Alexei Starovoitov
2018-03-15 17:00             ` Florian Westphal
2018-03-15 20:26               ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.