All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 PATCH 0/4] eBPF and struct scatterlist
@ 2018-06-19 18:00 Tushar Dave
  2018-06-19 18:00 ` [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Tushar Dave @ 2018-06-19 18:00 UTC (permalink / raw)
  To: ast, daniel, davem, jakub.kicinski, quentin.monnet, jiong.wang,
	guro, sandipan, john.fastabend, kafai, rdna, brakmo, netdev,
	acme, sowmini.varadhan

This follows up on https://patchwork.ozlabs.org/cover/927050/
where the review feedback was to use bpf_skb_load_bytes() to deal with
linear and non-linear skbs. While that feedback is valid and correct,
the motivation for this work is to allow eBPF based firewalling for
kernel modules that do not always get their packet as an sk_buff from
their downlink drivers. One such instance of this use-case is RDS, which
can be run both over IB (driver RDMA's a scatterlist to the RDS module)
or over TCP (TCP passes an sk_buff to the RDS module)

This RFC (call it v2) uses exiting socket filter infrastructure and
extend it with new eBPF program type that deals with struct scatterlist.
For RDS, the integrated approach treats the scatterlist as the common
denominator, and allows the application to write a filter for processing
a scatterlist.


Details:
Patch 1 adds new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which
uses the existing socket filter infrastructure for bpf program attach
and load. eBPF program of type BPF_PROG_TYPE_SOCKET_SG_FILTER receives
struct scatterlist as bpf context contrast to
BPF_PROG_TYPE_SOCKET_FILTER which deals with struct skb. This new eBPF
program type allow socket filter to run on packet data that is in form
form of struct scatterlist.

Patch 2 adds functionality to run BPF_PROG_TYPE_SOCKET_SG_FILTER socket
filter program. A bpf helpers bpf_sg_next() is also added so users can
retrieve sg elements from scatterlist.

Patch 3 adds socket filter eBPF sample program that uses patch 1 and
patch 2. The sample program opens an rds socket, attach ebpf program
(socksg i.e. BPF_PROG_TYPE_SOCKET_SG_FILTER) to rds socket and uses
bpf_sg_next helper to look into sg. For a test, current ebpf program
only prints first few bytes from each elements of sg list.

Finally, patch 4 allows rds_recv_incoming to invoke socket filter
program which deals with scatterlist.

Thanks.

-Tushar

Tushar Dave (4):
  eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  ebpf: Add sg_filter_run and sg helper
  ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  rds: invoke socket sg filter attached to rds socket

 include/linux/bpf_types.h                 |   1 +
 include/linux/filter.h                    |  10 +
 include/uapi/linux/bpf.h                  |  17 +-
 kernel/bpf/syscall.c                      |   1 +
 kernel/bpf/verifier.c                     |   1 +
 net/core/filter.c                         | 149 ++++++++++++-
 net/rds/ib.c                              |   1 +
 net/rds/ib.h                              |   1 +
 net/rds/ib_recv.c                         |  12 ++
 net/rds/rds.h                             |   2 +
 net/rds/recv.c                            |  16 ++
 net/rds/tcp.c                             |   2 +
 net/rds/tcp.h                             |   2 +
 net/rds/tcp_recv.c                        |  38 ++++
 samples/bpf/Makefile                      |   3 +
 samples/bpf/bpf_load.c                    |  11 +-
 samples/bpf/rds_filter_kern.c             |  78 +++++++
 samples/bpf/rds_filter_user.c             | 339 ++++++++++++++++++++++++++++++
 tools/bpf/bpftool/prog.c                  |   1 +
 tools/include/uapi/linux/bpf.h            |  17 +-
 tools/lib/bpf/libbpf.c                    |   3 +
 tools/lib/bpf/libbpf.h                    |   2 +
 tools/testing/selftests/bpf/bpf_helpers.h |   3 +
 23 files changed, 703 insertions(+), 7 deletions(-)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  2018-06-19 18:00 [RFC v2 PATCH 0/4] eBPF and struct scatterlist Tushar Dave
@ 2018-06-19 18:00 ` Tushar Dave
  2018-06-29  7:25   ` Daniel Borkmann
  2018-06-29  8:27   ` Daniel Borkmann
  2018-06-19 18:00 ` [RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper Tushar Dave
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Tushar Dave @ 2018-06-19 18:00 UTC (permalink / raw)
  To: ast, daniel, davem, jakub.kicinski, quentin.monnet, jiong.wang,
	guro, sandipan, john.fastabend, kafai, rdna, brakmo, netdev,
	acme, sowmini.varadhan

Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
existing socket filter infrastructure for bpf program attach and load.
SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
contrast to SOCKET_FILTER which deals with struct skb. This is useful
for kernel entities that don't have skb to represent packet data but
want to run eBPF socket filter on packet data that is in form of struct
scatterlist e.g. IB/RDMA

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/bpf_types.h      |  1 +
 include/linux/filter.h         |  8 +++++
 include/uapi/linux/bpf.h       |  7 ++++
 kernel/bpf/syscall.c           |  1 +
 kernel/bpf/verifier.c          |  1 +
 net/core/filter.c              | 77 ++++++++++++++++++++++++++++++++++++++++--
 samples/bpf/bpf_load.c         | 11 ++++--
 tools/bpf/bpftool/prog.c       |  1 +
 tools/include/uapi/linux/bpf.h |  7 ++++
 tools/lib/bpf/libbpf.c         |  3 ++
 tools/lib/bpf/libbpf.h         |  2 ++
 11 files changed, 114 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index c5700c2..f8b4b56 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -16,6 +16,7 @@
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
 #endif
 #ifdef CONFIG_BPF_EVENTS
 BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 45fc0f5..71618b1 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -517,6 +517,14 @@ struct bpf_skb_data_end {
 	void *data_end;
 };
 
+struct bpf_scatterlist {
+	struct scatterlist *sg;
+	void *start;
+	void *end;
+	int cur_sg;
+	int num_sg;
+};
+
 struct sk_msg_buff {
 	void *data;
 	void *data_end;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59b19b6..ef0a7b6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -144,6 +144,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
@@ -2358,6 +2359,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+/* use accessible scatterlist */
+struct sg_filter_md {
+	void *data; /* sg_virt(sg) */
+	void *data_end; /* sg_virt(sg) + sg->length */
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0fa2062..74193a8 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1300,6 +1300,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
+	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
 	    !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d6403b5..a00d3eb 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1320,6 +1320,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	case BPF_PROG_TYPE_LWT_XMIT:
 	case BPF_PROG_TYPE_SK_SKB:
 	case BPF_PROG_TYPE_SK_MSG:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		if (meta)
 			return meta->pkt_access;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 3d9ba7e..8f67942 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1130,7 +1130,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
 
 static void __bpf_prog_release(struct bpf_prog *prog)
 {
-	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
+	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
+	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
 		bpf_prog_put(prog);
 	} else {
 		bpf_release_orig_filter(prog);
@@ -1551,10 +1552,16 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 
 static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
 {
+	struct bpf_prog *prog;
+
 	if (sock_flag(sk, SOCK_FILTER_LOCKED))
 		return ERR_PTR(-EPERM);
 
-	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	if (IS_ERR(prog))
+		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
+
+	return prog;
 }
 
 int sk_attach_bpf(u32 ufd, struct sock *sk)
@@ -4710,6 +4717,15 @@ bool bpf_helper_changes_pkt_data(void *func)
 }
 
 static const struct bpf_func_proto *
+socksg_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
@@ -5037,6 +5053,30 @@ static bool sk_filter_is_valid_access(int off, int size,
 	return bpf_skb_is_valid_access(off, size, type, prog, info);
 }
 
+static bool socksg_filter_is_valid_access(int off, int size,
+					  enum bpf_access_type type,
+					  const struct bpf_prog *prog,
+					  struct bpf_insn_access_aux *info)
+{
+	switch (off) {
+	case offsetof(struct sg_filter_md, data):
+		info->reg_type = PTR_TO_PACKET;
+		break;
+	case offsetof(struct sg_filter_md, data_end):
+		info->reg_type = PTR_TO_PACKET_END;
+		break;
+	}
+
+	if (off < 0 || off >= sizeof(struct sg_filter_md))
+		return false;
+	if (off % size != 0)
+		return false;
+	if (size != sizeof(__u64))
+		return false;
+
+	return true;
+}
+
 static bool lwt_is_valid_access(int off, int size,
 				enum bpf_access_type type,
 				const struct bpf_prog *prog,
@@ -6516,6 +6556,30 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static u32 socksg_filter_convert_ctx_access(enum bpf_access_type type,
+					    const struct bpf_insn *si,
+					    struct bpf_insn *insn_buf,
+					    struct bpf_prog *prog,
+					    u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct sg_filter_md, data):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_scatterlist, start),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_scatterlist, start));
+		break;
+	case offsetof(struct sg_filter_md, data_end):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_scatterlist, end),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_scatterlist, end));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 				     const struct bpf_insn *si,
 				     struct bpf_insn *insn_buf,
@@ -6654,6 +6718,15 @@ static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+const struct bpf_verifier_ops socksg_filter_verifier_ops = {
+	.get_func_proto         = socksg_filter_func_proto,
+	.is_valid_access        = socksg_filter_is_valid_access,
+	.convert_ctx_access     = socksg_filter_convert_ctx_access,
+};
+
+const struct bpf_prog_ops socksg_filter_prog_ops = {
+};
+
 const struct bpf_verifier_ops tc_cls_act_verifier_ops = {
 	.get_func_proto		= tc_cls_act_func_proto,
 	.is_valid_access	= tc_cls_act_is_valid_access,
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 89161c9..15c355e 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -69,6 +69,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_sockops = strncmp(event, "sockops", 7) == 0;
 	bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
 	bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
+	bool is_socksg = strncmp(event, "socksg", 6) == 0;
+
 	size_t insns_cnt = size / sizeof(struct bpf_insn);
 	enum bpf_prog_type prog_type;
 	char buf[256];
@@ -102,6 +104,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_SK_SKB;
 	} else if (is_sk_msg) {
 		prog_type = BPF_PROG_TYPE_SK_MSG;
+	} else if (is_socksg) {
+		prog_type = BPF_PROG_TYPE_SOCKET_SG_FILTER;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -119,8 +123,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
 		return 0;
 
-	if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
-		if (is_socket)
+	if (is_socket || is_sockops || is_sk_skb || is_sk_msg || is_socksg) {
+		if (is_socket || is_socksg)
 			event += 6;
 		else
 			event += 7;
@@ -624,7 +628,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
 		    memcmp(shname, "cgroup/", 7) == 0 ||
 		    memcmp(shname, "sockops", 7) == 0 ||
 		    memcmp(shname, "sk_skb", 6) == 0 ||
-		    memcmp(shname, "sk_msg", 6) == 0) {
+		    memcmp(shname, "sk_msg", 6) == 0 ||
+		    memcmp(shname, "socksg", 6) == 0) {
 			ret = load_and_attach(shname, data->d_buf,
 					      data->d_size);
 			if (ret != 0)
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index a4f4352..06b2fef 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -72,6 +72,7 @@
 	[BPF_PROG_TYPE_RAW_TRACEPOINT]	= "raw_tracepoint",
 	[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
 	[BPF_PROG_TYPE_LIRC_MODE2]	= "lirc_mode2",
+	[BPF_PROG_TYPE_SOCKET_SG_FILTER] = "socket_sg_filter",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e0b0678..c87ae16 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -144,6 +144,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
@@ -2358,6 +2359,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+/* use accessible scatterlist */
+struct sg_filter_md {
+	void *data; /* sg_virt(sg) */
+	void *data_end; /* sg_virt(sg) + sg->length */
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index a1e96b5..7628278 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1463,6 +1463,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_SK_MSG:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_LIRC_MODE2:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		return false;
 	case BPF_PROG_TYPE_UNSPEC:
 	case BPF_PROG_TYPE_KPROBE:
@@ -1998,6 +1999,7 @@ static bool bpf_program__is_type(struct bpf_program *prog,
 BPF_PROG_TYPE_FNS(raw_tracepoint, BPF_PROG_TYPE_RAW_TRACEPOINT);
 BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP);
 BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
+BPF_PROG_TYPE_FNS(socket_sg_filter, BPF_PROG_TYPE_SOCKET_SG_FILTER);
 
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type)
@@ -2048,6 +2050,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 	BPF_SA_PROG_SEC("cgroup/sendmsg6", BPF_CGROUP_UDP6_SENDMSG),
 	BPF_S_PROG_SEC("cgroup/post_bind4", BPF_CGROUP_INET4_POST_BIND),
 	BPF_S_PROG_SEC("cgroup/post_bind6", BPF_CGROUP_INET6_POST_BIND),
+	BPF_PROG_SEC("socksg",          BPF_PROG_TYPE_SOCKET_SG_FILTER),
 };
 
 #undef BPF_PROG_SEC
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 0997653..3be165b 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -195,6 +195,7 @@ int bpf_program__set_prep(struct bpf_program *prog, int nr_instance,
 void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type);
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type);
+int bpf_program__set_socket_sg_filter(struct bpf_program *prog);
 
 bool bpf_program__is_socket_filter(struct bpf_program *prog);
 bool bpf_program__is_tracepoint(struct bpf_program *prog);
@@ -204,6 +205,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 bool bpf_program__is_sched_act(struct bpf_program *prog);
 bool bpf_program__is_xdp(struct bpf_program *prog);
 bool bpf_program__is_perf_event(struct bpf_program *prog);
+bool bpf_program__is_socket_sg_filter(struct bpf_program *prog);
 
 /*
  * No need for __attribute__((packed)), all members of 'bpf_map_def'
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper
  2018-06-19 18:00 [RFC v2 PATCH 0/4] eBPF and struct scatterlist Tushar Dave
  2018-06-19 18:00 ` [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
@ 2018-06-19 18:00 ` Tushar Dave
  2018-06-29  8:18   ` Daniel Borkmann
  2018-06-29  8:32   ` Daniel Borkmann
  2018-06-19 18:00 ` [RFC v2 PATCH 3/4] ebpf: Add sample ebpf program for SOCKET_SG_FILTER Tushar Dave
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Tushar Dave @ 2018-06-19 18:00 UTC (permalink / raw)
  To: ast, daniel, davem, jakub.kicinski, quentin.monnet, jiong.wang,
	guro, sandipan, john.fastabend, kafai, rdna, brakmo, netdev,
	acme, sowmini.varadhan

When sg_filter_run() is invoked it runs the attached eBPF
SOCKET_SG_FILTER program which deals with struct scatterlist.

In addition, this patch also adds bpf_sg_next helper function that
allows users to retrieve the next sg element from sg list.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/filter.h                    |  2 +
 include/uapi/linux/bpf.h                  | 10 ++++-
 net/core/filter.c                         | 72 +++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h            | 10 ++++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++
 5 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 71618b1..d176402 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1072,4 +1072,6 @@ struct bpf_sock_ops_kern {
 					 */
 };
 
+int sg_filter_run(struct sock *sk, struct scatterlist *sg);
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ef0a7b6..036432b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2076,6 +2076,13 @@ struct bpf_stack_build_id {
  * 	Return
  * 		A 64-bit integer containing the current cgroup id based
  * 		on the cgroup within which the current task is running.
+ *
+ * int bpf_sg_next(struct bpf_scatterlist *sg)
+ *	Description
+ *		This helper allows user to retrieve next sg element from
+ *		sg list.
+ *	Return
+ *		Returns 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2158,7 +2165,8 @@ struct bpf_stack_build_id {
 	FN(rc_repeat),			\
 	FN(rc_keydown),			\
 	FN(skb_cgroup_id),		\
-	FN(get_current_cgroup_id),
+	FN(get_current_cgroup_id),	\
+	FN(sg_next),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 8f67942..702ff5b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -121,6 +121,53 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
 }
 EXPORT_SYMBOL(sk_filter_trim_cap);
 
+int sg_filter_run(struct sock *sk, struct scatterlist *sg)
+{
+	struct sk_filter *filter;
+	int err;
+
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		struct bpf_scatterlist bpfsg;
+		int num_sg;
+
+		if (!sg) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		num_sg = sg_nents(sg);
+		if (num_sg <= 0) {
+			err = -EINVAL;
+			goto out;
+		}
+
+		/* We store a reference  to the sg list so it can later used by
+		 * eBPF helpers to retrieve the next sg element.
+		 */
+		bpfsg.num_sg = num_sg;
+		bpfsg.cur_sg = 0;
+		bpfsg.sg = sg;
+
+		/* For the first sg element, we store the pkt access pointers
+		 * into start and end so eBPF program can have pkt access using
+		 * data and data_end. The pkt access for subsequent element of
+		 * sg list is possible when eBPF program invokes bpf_sg_next
+		 * which takes care of setting start and end to the correct sg
+		 * element.
+		 */
+		bpfsg.start = sg_virt(sg);
+		bpfsg.end = bpfsg.start + sg->length;
+		BPF_PROG_RUN(filter->prog, &bpfsg);
+	}
+out:
+	rcu_read_unlock();
+
+	return err;
+}
+EXPORT_SYMBOL(sg_filter_run);
+
 BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb)
 {
 	return skb_get_poff(skb);
@@ -3753,6 +3800,29 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
 	.arg1_type      = ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_1(bpf_sg_next, struct bpf_scatterlist *, bpfsg)
+{
+	struct scatterlist *sg = bpfsg->sg;
+	int cur_sg = bpfsg->cur_sg;
+
+	cur_sg++;
+	if (cur_sg >= bpfsg->num_sg)
+		return -ENODATA;
+
+	bpfsg->cur_sg = cur_sg;
+	bpfsg->start = sg_virt(&sg[cur_sg]);
+	bpfsg->end = bpfsg->start + sg[cur_sg].length;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_sg_next_proto = {
+	.func		= bpf_sg_next,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+};
+
 BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 	   int, level, int, optname, char *, optval, int, optlen)
 {
@@ -4720,6 +4790,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 socksg_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
+	case BPF_FUNC_sg_next:
+		return &bpf_sg_next_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c87ae16..a298498 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2076,6 +2076,13 @@ struct bpf_stack_build_id {
  * 	Return
  * 		A 64-bit integer containing the current cgroup id based
  * 		on the cgroup within which the current task is running.
+ *
+ * int bpf_sg_next(struct bpf_scatterlist *sg)
+ *	Description
+ *		This helper allows user to retrieve next sg element from
+ *		sg list.
+ *	Return
+ *		Returns 0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2158,7 +2165,8 @@ struct bpf_stack_build_id {
 	FN(rc_repeat),			\
 	FN(rc_keydown),			\
 	FN(skb_cgroup_id),		\
-	FN(get_current_cgroup_id),
+	FN(get_current_cgroup_id),	\
+	FN(sg_next),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index f2f28b6..1997ba2 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -133,6 +133,9 @@ static int (*bpf_rc_keydown)(void *ctx, unsigned int protocol,
 	(void *) BPF_FUNC_rc_keydown;
 static unsigned long long (*bpf_get_current_cgroup_id)(void) =
 	(void *) BPF_FUNC_get_current_cgroup_id;
+static unsigned long long (*bpf_sg_next)(void *ctx) =
+	(void *) BPF_FUNC_sg_next;
+
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 PATCH 3/4] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  2018-06-19 18:00 [RFC v2 PATCH 0/4] eBPF and struct scatterlist Tushar Dave
  2018-06-19 18:00 ` [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
  2018-06-19 18:00 ` [RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper Tushar Dave
@ 2018-06-19 18:00 ` Tushar Dave
  2018-06-19 18:00 ` [RFC v2 PATCH 4/4] rds: invoke socket sg filter attached to rds socket Tushar Dave
  2018-06-26  0:14 ` [RFC v2 PATCH 0/4] eBPF and struct scatterlist Tushar Dave
  4 siblings, 0 replies; 15+ messages in thread
From: Tushar Dave @ 2018-06-19 18:00 UTC (permalink / raw)
  To: ast, daniel, davem, jakub.kicinski, quentin.monnet, jiong.wang,
	guro, sandipan, john.fastabend, kafai, rdna, brakmo, netdev,
	acme, sowmini.varadhan

Add a sample program that shows how socksg program is used and attached
to socket filter. The kernel sample program deals with struct
scatterlist that is passed as bpf context.

When run in server mode, the sample RDS program opens PF_RDS socket,
attaches eBPF program to RDS socket which then uses bpf_sg_next
helper along with bpf tail calls to retrieve packet data contained in
struct scatterlist form.

To ease testing, RDS client functionality is also added so that users
can generate RDS packet.

Server:
[root@lab71 bpf]# ./rds_filter -s 192.168.3.71 -t tcp
running server in a loop
transport tcp
server bound to address: 192.168.3.71 port 4000
server listening on 192.168.3.71

Client:
[root@lab70 bpf]# ./rds_filter -s 192.168.3.71 -c 192.168.3.70 -t tcp
transport tcp
client bound to address: 192.168.3.70 port 25278
client sending 8192 byte message  from 192.168.3.70 to 192.168.3.71 on
port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...

Server output:
192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
on port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...
server listening on 192.168.3.71

BPF program output:
[root@lab71]# cat /sys/kernel/debug/tracing/trace_pipe
          <idle>-0     [007] ..s.   525.994894: 0: Print first 6 bytes from sg element
          <idle>-0     [007] ..s.   525.994897: 0: First sg element:
          <idle>-0     [007] ..s.   525.994899: 0: 30 31 32
          <idle>-0     [007] ..s.   525.994900: 0: 33 34 35
          <idle>-0     [007] ..s.   525.994901: 0: next sg element:
          <idle>-0     [007] ..s.   525.994902: 0: a8 a9 aa
          <idle>-0     [007] ..s.   525.994903: 0: ab ac ad
          <idle>-0     [007] ..s.   525.994904: 0: next sg element:
          <idle>-0     [007] ..s.   525.994905: 0: 50 51 52
          <idle>-0     [007] ..s.   525.994905: 0: 53 54 55
          <idle>-0     [007] ..s.   525.994906: 0: next sg element:
          <idle>-0     [007] ..s.   525.994907: 0: f8 f9 fa
          <idle>-0     [007] ..s.   525.994907: 0: fb fc fd
          <idle>-0     [007] ..s.   525.994908: 0: next sg element:
          <idle>-0     [007] ..s.   525.994909: 0: a0 a1 a2
          <idle>-0     [007] ..s.   525.994909: 0: a3 a4 a5
          <idle>-0     [007] ..s.   525.994910: 0: next sg element:
          <idle>-0     [007] ..s.   525.994911: 0: 48 49 4a
          <idle>-0     [007] ..s.   525.994911: 0: 4b 4c 4d
          <idle>-0     [007] ..s.   525.994912: 0: no more sg element

Similary specifying '-t ib' will run this on IB link.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 samples/bpf/Makefile          |   3 +
 samples/bpf/rds_filter_kern.c |  78 ++++++++++
 samples/bpf/rds_filter_user.c | 339 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 420 insertions(+)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 1303af1..5de238b 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -52,6 +52,7 @@ hostprogs-y += xdp_adjust_tail
 hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
+hostprogs-y += rds_filter
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -107,6 +108,7 @@ xdp_adjust_tail-objs := xdp_adjust_tail_user.o
 xdpsock-objs := bpf_load.o xdpsock_user.o
 xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
+rds_filter-objs := bpf_load.o rds_filter_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -163,6 +165,7 @@ always += xdp_adjust_tail_kern.o
 always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
+always += rds_filter_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/rds_filter_kern.c b/samples/bpf/rds_filter_kern.c
new file mode 100644
index 0000000..8fe3d3c
--- /dev/null
+++ b/samples/bpf/rds_filter_kern.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/filter.h>
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include <linux/rds.h>
+#include "bpf_helpers.h"
+
+#define PROG(F) SEC("socksg/"__stringify(F)) int bpf_func_##F
+
+#define bpf_printk(fmt, ...)				\
+({							\
+	char ____fmt[] = fmt;				\
+	bpf_trace_printk(____fmt, sizeof(____fmt),	\
+			##__VA_ARGS__);			\
+})
+
+struct bpf_map_def SEC("maps") jmp_table = {
+	.type = BPF_MAP_TYPE_PROG_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = 2,
+};
+
+#define SG1 1
+
+static inline void dump_sg(struct sg_filter_md *sg)
+{
+	void *data = (void *)(long) sg->data;
+	void *data_end = (void *)(long) sg->data_end;
+	unsigned char *d;
+
+	if (data + 8 > data_end)
+		return;
+
+	d = (unsigned char *)data;
+	bpf_printk("%x %x %x\n", d[0], d[1], d[2]);
+	bpf_printk("%x %x %x\n", d[3], d[4], d[5]);
+
+	return;
+
+}
+
+static void sg_dispatcher(struct sg_filter_md *sg)
+{
+	int ret;
+
+	ret = bpf_sg_next(sg);
+	if (ret == -ENODATA) {
+		bpf_printk("no more sg element\n");
+		return;
+	}
+
+	/* We use same function to walk sg list */
+	bpf_tail_call(sg, &jmp_table, 1);
+}
+
+/* walk sg list */
+PROG(SG1)(struct sg_filter_md *sg)
+{
+	bpf_printk("next sg element:\n");
+	dump_sg(sg);
+	sg_dispatcher(sg);
+	return 0;
+}
+
+SEC("socksg/0")
+int main_prog(struct sg_filter_md *sg)
+{
+	bpf_printk("Print first 6 bytes from sg element\n");
+	bpf_printk("First sg element:\n");
+	dump_sg(sg);
+	sg_dispatcher(sg);
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/rds_filter_user.c b/samples/bpf/rds_filter_user.c
new file mode 100644
index 0000000..1165f1e
--- /dev/null
+++ b/samples/bpf/rds_filter_user.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <arpa/inet.h>
+#include <assert.h>
+#include "bpf_load.h"
+#include <getopt.h>
+#include <errno.h>
+#include <netinet/in.h>
+#include <limits.h>
+#include <linux/sockios.h>
+#include <linux/rds.h>
+#include <linux/errqueue.h>
+#include <linux/bpf.h>
+#include <strings.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#define TESTPORT	4000
+#define BUFSIZE		8192
+
+int transport = -1;
+
+static int str2trans(const char *trans)
+{
+	if (strcmp(trans, "tcp") == 0)
+		return RDS_TRANS_TCP;
+	if (strcmp(trans, "ib") == 0)
+		return RDS_TRANS_IB;
+	return (RDS_TRANS_NONE);
+}
+
+static const char *trans2str(int trans)
+{
+	switch (trans) {
+	case RDS_TRANS_TCP:
+		return ("tcp");
+	case RDS_TRANS_IB:
+		return ("ib");
+	case RDS_TRANS_NONE:
+		return ("none");
+	default:
+		return ("unknown");
+	}
+}
+
+static int gettransport(int sock)
+{
+	int err;
+	char val;
+	socklen_t len = sizeof(int);
+
+	err = getsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&val, &len);
+	if (err < 0) {
+		fprintf(stderr, "%s: getsockopt %s\n",
+			__func__, strerror(errno));
+		return err;
+	}
+	return (int)val;
+}
+
+static int settransport(int sock, int transport)
+{
+	int err;
+
+	err = setsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&transport, sizeof(transport));
+	if (err < 0) {
+		fprintf(stderr, "could not set transport %s, %s\n",
+			trans2str(transport), strerror(errno));
+	}
+	return err;
+}
+
+static void print_sock_local_info(int fd, char *str, struct sockaddr_in *ret)
+{
+	socklen_t sin_size = sizeof(struct sockaddr_in);
+	struct sockaddr_in sin;
+	int err;
+
+	err = getsockname(fd, (struct sockaddr *)&sin, &sin_size);
+	if (err < 0) {
+		fprintf(stderr, "%s getsockname %s\n",
+			__func__, strerror(errno));
+		return;
+	}
+	printf("%s address: %s port %d\n",
+		(str ? str : ""), inet_ntoa(sin.sin_addr), ntohs(sin.sin_port));
+
+	if (ret != NULL)
+		*ret = sin;
+}
+
+static void print_payload(char *buf)
+{
+	int i;
+
+	printf("payload contains:");
+	for (i = 0; i < 10; i++)
+		printf("%x ", buf[i]);
+	printf("...\n");
+}
+
+static void server(char *address, in_port_t port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	struct iovec *iov;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(address);
+	sin.sin_port = htons(port);
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	/* attach bpf prog */
+	assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd[1],
+			  sizeof(prog_fd[0])) == 0);
+
+	print_sock_local_info(sock, "server bound to", NULL);
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	while (1) {
+		memset(buf, 0, BUFSIZE);
+		iov[0].iov_base = buf;
+		iov[0].iov_len = BUFSIZE;
+
+		memset(&msg, 0, sizeof(msg));
+		msg.msg_name = &din;
+		msg.msg_namelen = sizeof(din);
+		msg.msg_iov = iov;
+		msg.msg_iovlen = 1;
+
+		printf("server listening on %s\n", inet_ntoa(sin.sin_addr));
+
+		rc = recvmsg(sock, &msg, 0);
+		if (rc < 0) {
+			fprintf(stderr, "%s: recvmsg %s\n",
+				__func__, strerror(errno));
+			break;
+		}
+
+		printf("%s received a packet from %s of len %d cmsg len %d, on port %d\n",
+			inet_ntoa(sin.sin_addr),
+			inet_ntoa(din.sin_addr),
+			(uint32_t) iov[0].iov_len,
+			(uint32_t) msg.msg_controllen,
+			ntohs(din.sin_port));
+
+		print_payload(buf);
+	}
+	free(iov);
+out:
+	free(buf);
+}
+
+static void create_message(char *buf)
+{
+	unsigned int i;
+
+	for (i = 0; i < BUFSIZE; i++) {
+		buf[i] = i + 0x30;
+	}
+}
+
+static int build_rds_packet(struct msghdr *msg, char *buf)
+{
+	struct iovec *iov;
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return -1;
+	}
+
+	msg->msg_iov = iov;
+	msg->msg_iovlen = 1;
+
+	iov[0].iov_base = buf;
+	iov[0].iov_len = BUFSIZE * sizeof(char);
+
+	return 0;
+}
+
+static void client(char *localaddr, char *remoteaddr, in_port_t server_port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	create_message(buf);
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(localaddr);
+	sin.sin_port = 0;
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	print_sock_local_info(sock, "client bound to",  &sin);
+
+	memset(&msg, 0, sizeof(msg));
+	msg.msg_name = &din;
+	msg.msg_namelen = sizeof(din);
+
+	memset(&din, 0, sizeof(din));
+	din.sin_family = AF_INET;
+	din.sin_addr.s_addr = inet_addr(remoteaddr);
+	din.sin_port = htons(server_port);
+
+	rc = build_rds_packet(&msg, buf);
+	if (rc < 0)
+		goto out;
+
+	printf("client sending %d byte message from %s to %s on port %d\n",
+		(uint32_t) msg.msg_iov->iov_len, localaddr,
+		remoteaddr, ntohs(sin.sin_port));
+
+	rc = sendmsg(sock, &msg, 0);
+	if (rc < 0)
+		fprintf(stderr, "%s: sendmsg %s\n", __func__, strerror(errno));
+
+	print_payload(buf);
+
+	if (msg.msg_control)
+		free(msg.msg_control);
+	if (msg.msg_iov)
+		free(msg.msg_iov);
+out:
+	free(buf);
+
+	return;
+}
+
+static void usage(char *progname)
+{
+	fprintf(stderr, "Usage %s [-s srvaddr] [-c clientaddr] [-t transport]"
+		"\n", progname);
+}
+
+int main(int argc, char **argv)
+{
+	in_port_t server_port = TESTPORT;
+	char *serveraddr = NULL;
+	char *clientaddr = NULL;
+	char filename[256];
+	int opt;
+
+	while ((opt = getopt(argc, argv, "s:c:t:")) != -1) {
+		switch (opt) {
+		case 's':
+			serveraddr = optarg;
+			break;
+		case 'c':
+			clientaddr = optarg;
+			break;
+		case 't':
+			transport = str2trans(optarg);
+			if (transport == RDS_TRANS_NONE) {
+				fprintf(stderr,
+					"unknown transport %s\n", optarg);
+					usage(argv[0]);
+					return (-1);
+			}
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		fprintf(stderr, "Error: load_bpf_file %s", bpf_log_buf);
+		return 1;
+	}
+
+	if (serveraddr && !clientaddr) {
+		printf("running server in a loop\n");
+		server(serveraddr, server_port);
+	} else if (serveraddr && clientaddr) {
+		client(clientaddr, serveraddr, server_port);
+	}
+
+	return 0;
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC v2 PATCH 4/4] rds: invoke socket sg filter attached to rds socket
  2018-06-19 18:00 [RFC v2 PATCH 0/4] eBPF and struct scatterlist Tushar Dave
                   ` (2 preceding siblings ...)
  2018-06-19 18:00 ` [RFC v2 PATCH 3/4] ebpf: Add sample ebpf program for SOCKET_SG_FILTER Tushar Dave
@ 2018-06-19 18:00 ` Tushar Dave
  2018-06-26  0:14 ` [RFC v2 PATCH 0/4] eBPF and struct scatterlist Tushar Dave
  4 siblings, 0 replies; 15+ messages in thread
From: Tushar Dave @ 2018-06-19 18:00 UTC (permalink / raw)
  To: ast, daniel, davem, jakub.kicinski, quentin.monnet, jiong.wang,
	guro, sandipan, john.fastabend, kafai, rdna, brakmo, netdev,
	acme, sowmini.varadhan

RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
However, because socket filter only deal with skb (e.g. struct skb as
bpf context) we can only use socket filter for rds_tcp and not for
rds_rdma.

Considering one filtering solution for RDS, it seems that the common
denominator between sk_buff and scatterlist is scatterlist. Therefore,
this patch converts skb to sgvec and invoke sg_filter_run for
rds_tcp and simply invoke sg_filter_run for IB/rds_rdma.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/ib.c       |  1 +
 net/rds/ib.h       |  1 +
 net/rds/ib_recv.c  | 12 ++++++++++++
 net/rds/rds.h      |  2 ++
 net/rds/recv.c     | 16 ++++++++++++++++
 net/rds/tcp.c      |  2 ++
 net/rds/tcp.h      |  2 ++
 net/rds/tcp_recv.c | 38 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 74 insertions(+)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 02deee2..3027832 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -421,6 +421,7 @@ struct rds_transport rds_ib_transport = {
 	.conn_path_shutdown	= rds_ib_conn_path_shutdown,
 	.inc_copy_to_user	= rds_ib_inc_copy_to_user,
 	.inc_free		= rds_ib_inc_free,
+	.inc_to_sg_get		= rds_ib_inc_to_sg_get,
 	.cm_initiate_connect	= rds_ib_cm_initiate_connect,
 	.cm_handle_connect	= rds_ib_cm_handle_connect,
 	.cm_connect_complete	= rds_ib_cm_connect_complete,
diff --git a/net/rds/ib.h b/net/rds/ib.h
index a6f4d7d..699b5b9b 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -375,6 +375,7 @@ void rds_ib_cm_connect_complete(struct rds_connection *conn,
 void rds_ib_recv_free_caches(struct rds_ib_connection *ic);
 void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp);
 void rds_ib_inc_free(struct rds_incoming *inc);
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
 int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
 void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc,
 			     struct rds_ib_ack_state *state);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index b4e421a..62be497 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -219,6 +219,18 @@ void rds_ib_inc_free(struct rds_incoming *inc)
 	rds_ib_recv_cache_put(&ibinc->ii_cache_entry, &ic->i_cache_incs);
 }
 
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct rds_ib_incoming *ibinc;
+	struct rds_page_frag *frag;
+
+	ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+	frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
+	*sg =  &frag->f_sg;
+
+	return 0;
+}
+
 static void rds_ib_recv_clear_one(struct rds_ib_connection *ic,
 				  struct rds_ib_recv_work *recv)
 {
diff --git a/net/rds/rds.h b/net/rds/rds.h
index b04c333..f5ea833 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -528,6 +528,8 @@ struct rds_transport {
 	int (*recv_path)(struct rds_conn_path *cp);
 	int (*inc_copy_to_user)(struct rds_incoming *inc, struct iov_iter *to);
 	void (*inc_free)(struct rds_incoming *inc);
+	int (*inc_to_sg_get)(struct rds_incoming *inc, struct scatterlist **sg);
+	void (*inc_to_sg_put)(struct scatterlist **sg);
 
 	int (*cm_handle_connect)(struct rdma_cm_id *cm_id,
 				 struct rdma_cm_event *event);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index dc67458..e0c5b4c 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -286,6 +286,7 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr,
 	struct sock *sk;
 	unsigned long flags;
 	struct rds_conn_path *cp;
+	struct sk_filter *filter;
 
 	inc->i_conn = conn;
 	inc->i_rx_jiffies = jiffies;
@@ -369,6 +370,21 @@ void rds_recv_incoming(struct rds_connection *conn, __be32 saddr, __be32 daddr,
 	/* We can be racing with rds_release() which marks the socket dead. */
 	sk = rds_rs_to_sk(rs);
 
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		if (conn->c_trans->inc_to_sg_get) {
+			struct scatterlist *sg;
+
+			if (conn->c_trans->inc_to_sg_get(inc, &sg) == 0) {
+				sg_filter_run(sk, sg);
+				if (conn->c_trans->inc_to_sg_put)
+					conn->c_trans->inc_to_sg_put(&sg);
+			}
+		}
+	}
+	rcu_read_unlock();
+
 	/* serialize with rds_release -> sock_orphan */
 	write_lock_irqsave(&rs->rs_recv_lock, flags);
 	if (!sock_flag(sk, SOCK_DEAD)) {
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 351a284..b431854 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -376,6 +376,8 @@ struct rds_transport rds_tcp_transport = {
 	.conn_path_shutdown	= rds_tcp_conn_path_shutdown,
 	.inc_copy_to_user	= rds_tcp_inc_copy_to_user,
 	.inc_free		= rds_tcp_inc_free,
+	.inc_to_sg_get		= rds_tcp_inc_to_sg_get,
+	.inc_to_sg_put		= rds_tcp_inc_to_sg_put,
 	.stats_info_copy	= rds_tcp_stats_info_copy,
 	.exit			= rds_tcp_exit,
 	.t_owner		= THIS_MODULE,
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index c6fa080..466bdb9 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -82,6 +82,8 @@ void rds_tcp_restore_callbacks(struct socket *sock,
 int rds_tcp_recv_path(struct rds_conn_path *cp);
 void rds_tcp_inc_free(struct rds_incoming *inc);
 int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
+void rds_tcp_inc_to_sg_put(struct scatterlist **sg);
 
 /* tcp_send.c */
 void rds_tcp_xmit_path_prepare(struct rds_conn_path *cp);
diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index b9fbd2e..ce62712 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -56,6 +56,44 @@ void rds_tcp_inc_free(struct rds_incoming *inc)
 	kmem_cache_free(rds_tcp_incoming_slab, tinc);
 }
 
+#define MAX_SG 17
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct scatterlist *sg_list;
+	struct rds_tcp_incoming *tinc;
+	struct sk_buff *skb;
+	int num_sg = 0;
+
+	tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+
+	/* For now we are assuming that the max sg elements we need is MAX_SG.
+	 * To determine actual number of sg elements we need to traverse the
+	 * skb queue e.g.
+	 *
+	 * skb_queue_walk(&tinc->ti_skb_list, skb) {
+	 *	num_sg += skb_shinfo(skb)->nr_frags + 1;
+	 * }
+	 */
+	sg_list = kzalloc(sizeof(*sg_list) * MAX_SG, GFP_KERNEL);
+	if (!sg_list)
+		return -ENOMEM;
+
+	sg_init_table(sg_list, MAX_SG);
+	skb_queue_walk(&tinc->ti_skb_list, skb) {
+		num_sg += skb_to_sgvec_nomark(skb, &sg_list[num_sg], 0,
+					      skb->len);
+	}
+	sg_mark_end(&sg_list[num_sg - 1]);
+	*sg = sg_list;
+
+	return 0;
+}
+
+void rds_tcp_inc_to_sg_put(struct scatterlist **sg)
+{
+	kfree(*sg);
+}
+
 /*
  * this is pretty lame, but, whatever.
  */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC v2 PATCH 0/4] eBPF and struct scatterlist
  2018-06-19 18:00 [RFC v2 PATCH 0/4] eBPF and struct scatterlist Tushar Dave
                   ` (3 preceding siblings ...)
  2018-06-19 18:00 ` [RFC v2 PATCH 4/4] rds: invoke socket sg filter attached to rds socket Tushar Dave
@ 2018-06-26  0:14 ` Tushar Dave
  4 siblings, 0 replies; 15+ messages in thread
From: Tushar Dave @ 2018-06-26  0:14 UTC (permalink / raw)
  To: ast, daniel, davem, jakub.kicinski, quentin.monnet, jiong.wang,
	guro, sandipan, john.fastabend, kafai, rdna, brakmo, netdev,
	acme, sowmini.varadhan



On 06/19/2018 11:00 AM, Tushar Dave wrote:
> This follows up on https://patchwork.ozlabs.org/cover/927050/
> where the review feedback was to use bpf_skb_load_bytes() to deal with
> linear and non-linear skbs. While that feedback is valid and correct,
> the motivation for this work is to allow eBPF based firewalling for
> kernel modules that do not always get their packet as an sk_buff from
> their downlink drivers. One such instance of this use-case is RDS, which
> can be run both over IB (driver RDMA's a scatterlist to the RDS module)
> or over TCP (TCP passes an sk_buff to the RDS module)
> 
> This RFC (call it v2) uses exiting socket filter infrastructure and
> extend it with new eBPF program type that deals with struct scatterlist.
> For RDS, the integrated approach treats the scatterlist as the common
> denominator, and allows the application to write a filter for processing
> a scatterlist.
> 
> 
> Details:
> Patch 1 adds new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which
> uses the existing socket filter infrastructure for bpf program attach
> and load. eBPF program of type BPF_PROG_TYPE_SOCKET_SG_FILTER receives
> struct scatterlist as bpf context contrast to
> BPF_PROG_TYPE_SOCKET_FILTER which deals with struct skb. This new eBPF
> program type allow socket filter to run on packet data that is in form
> form of struct scatterlist.
> 
> Patch 2 adds functionality to run BPF_PROG_TYPE_SOCKET_SG_FILTER socket
> filter program. A bpf helpers bpf_sg_next() is also added so users can
> retrieve sg elements from scatterlist.
> 
> Patch 3 adds socket filter eBPF sample program that uses patch 1 and
> patch 2. The sample program opens an rds socket, attach ebpf program
> (socksg i.e. BPF_PROG_TYPE_SOCKET_SG_FILTER) to rds socket and uses
> bpf_sg_next helper to look into sg. For a test, current ebpf program
> only prints first few bytes from each elements of sg list.
> 
> Finally, patch 4 allows rds_recv_incoming to invoke socket filter
> program which deals with scatterlist.

It would be really helpful to have your thoughts/comments on my
direction. I am planning to put together some complex example for
filtering RDS traffic using eBPF, and so it would be good to have
feedback before I go too far ahead.

Thanks.

-Tushar

> 
> Thanks.
> 
> -Tushar
> 
> Tushar Dave (4):
>    eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
>    ebpf: Add sg_filter_run and sg helper
>    ebpf: Add sample ebpf program for SOCKET_SG_FILTER
>    rds: invoke socket sg filter attached to rds socket
> 
>   include/linux/bpf_types.h                 |   1 +
>   include/linux/filter.h                    |  10 +
>   include/uapi/linux/bpf.h                  |  17 +-
>   kernel/bpf/syscall.c                      |   1 +
>   kernel/bpf/verifier.c                     |   1 +
>   net/core/filter.c                         | 149 ++++++++++++-
>   net/rds/ib.c                              |   1 +
>   net/rds/ib.h                              |   1 +
>   net/rds/ib_recv.c                         |  12 ++
>   net/rds/rds.h                             |   2 +
>   net/rds/recv.c                            |  16 ++
>   net/rds/tcp.c                             |   2 +
>   net/rds/tcp.h                             |   2 +
>   net/rds/tcp_recv.c                        |  38 ++++
>   samples/bpf/Makefile                      |   3 +
>   samples/bpf/bpf_load.c                    |  11 +-
>   samples/bpf/rds_filter_kern.c             |  78 +++++++
>   samples/bpf/rds_filter_user.c             | 339 ++++++++++++++++++++++++++++++
>   tools/bpf/bpftool/prog.c                  |   1 +
>   tools/include/uapi/linux/bpf.h            |  17 +-
>   tools/lib/bpf/libbpf.c                    |   3 +
>   tools/lib/bpf/libbpf.h                    |   2 +
>   tools/testing/selftests/bpf/bpf_helpers.h |   3 +
>   23 files changed, 703 insertions(+), 7 deletions(-)
>   create mode 100644 samples/bpf/rds_filter_kern.c
>   create mode 100644 samples/bpf/rds_filter_user.c
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  2018-06-19 18:00 ` [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
@ 2018-06-29  7:25   ` Daniel Borkmann
  2018-06-29  8:48     ` Daniel Borkmann
  2018-06-29  8:27   ` Daniel Borkmann
  1 sibling, 1 reply; 15+ messages in thread
From: Daniel Borkmann @ 2018-06-29  7:25 UTC (permalink / raw)
  To: Tushar Dave, ast, davem, jakub.kicinski, quentin.monnet,
	jiong.wang, guro, sandipan, john.fastabend, kafai, rdna, brakmo,
	netdev, acme, sowmini.varadhan

On 06/19/2018 08:00 PM, Tushar Dave wrote:
> Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
> existing socket filter infrastructure for bpf program attach and load.
> SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
> contrast to SOCKET_FILTER which deals with struct skb. This is useful
> for kernel entities that don't have skb to represent packet data but
> want to run eBPF socket filter on packet data that is in form of struct
> scatterlist e.g. IB/RDMA
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  include/linux/bpf_types.h      |  1 +
>  include/linux/filter.h         |  8 +++++
>  include/uapi/linux/bpf.h       |  7 ++++
>  kernel/bpf/syscall.c           |  1 +
>  kernel/bpf/verifier.c          |  1 +
>  net/core/filter.c              | 77 ++++++++++++++++++++++++++++++++++++++++--
>  samples/bpf/bpf_load.c         | 11 ++++--
>  tools/bpf/bpftool/prog.c       |  1 +
>  tools/include/uapi/linux/bpf.h |  7 ++++
>  tools/lib/bpf/libbpf.c         |  3 ++
>  tools/lib/bpf/libbpf.h         |  2 ++
>  11 files changed, 114 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index c5700c2..f8b4b56 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -16,6 +16,7 @@
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
>  #endif
>  #ifdef CONFIG_BPF_EVENTS
>  BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 45fc0f5..71618b1 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -517,6 +517,14 @@ struct bpf_skb_data_end {
>  	void *data_end;
>  };
>  
> +struct bpf_scatterlist {
> +	struct scatterlist *sg;
> +	void *start;
> +	void *end;
> +	int cur_sg;
> +	int num_sg;
> +};
> +
>  struct sk_msg_buff {
>  	void *data;
>  	void *data_end;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 59b19b6..ef0a7b6 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -144,6 +144,7 @@ enum bpf_prog_type {
>  	BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
>  	BPF_PROG_TYPE_LWT_SEG6LOCAL,
>  	BPF_PROG_TYPE_LIRC_MODE2,
> +	BPF_PROG_TYPE_SOCKET_SG_FILTER,
>  };
>  
>  enum bpf_attach_type {
> @@ -2358,6 +2359,12 @@ enum sk_action {
>  	SK_PASS,
>  };
>  
> +/* use accessible scatterlist */
> +struct sg_filter_md {
> +	void *data; /* sg_virt(sg) */
> +	void *data_end; /* sg_virt(sg) + sg->length */
> +};
> +
>  /* user accessible metadata for SK_MSG packet hook, new fields must
>   * be added to the end of this structure
>   */
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 0fa2062..74193a8 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1300,6 +1300,7 @@ static int bpf_prog_load(union bpf_attr *attr)
>  
>  	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
>  	    type != BPF_PROG_TYPE_CGROUP_SKB &&
> +	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
>  	    !capable(CAP_SYS_ADMIN))
>  		return -EPERM;
>  
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index d6403b5..a00d3eb 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1320,6 +1320,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
>  	case BPF_PROG_TYPE_LWT_XMIT:
>  	case BPF_PROG_TYPE_SK_SKB:
>  	case BPF_PROG_TYPE_SK_MSG:
> +	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
>  		if (meta)
>  			return meta->pkt_access;
>  
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 3d9ba7e..8f67942 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1130,7 +1130,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
>  
>  static void __bpf_prog_release(struct bpf_prog *prog)
>  {
> -	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
> +	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
> +	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
>  		bpf_prog_put(prog);
>  	} else {
>  		bpf_release_orig_filter(prog);
> @@ -1551,10 +1552,16 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
>  
>  static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
>  {
> +	struct bpf_prog *prog;
> +
>  	if (sock_flag(sk, SOCK_FILTER_LOCKED))
>  		return ERR_PTR(-EPERM);
>  
> -	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
> +	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
> +	if (IS_ERR(prog))
> +		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
> +
> +	return prog;
>  }

Hmm, I don't think this works: this now means as unpriviledged I can attach a new
BPF_PROG_TYPE_SOCKET_SG_FILTER to a non-rds socket e.g. normal tcp/udp through the
SO_ATTACH_BPF sockopt, where input context is skb instead of sg list and thus crash
my box?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper
  2018-06-19 18:00 ` [RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper Tushar Dave
@ 2018-06-29  8:18   ` Daniel Borkmann
  2018-06-30  0:24     ` Tushar Dave
  2018-06-29  8:32   ` Daniel Borkmann
  1 sibling, 1 reply; 15+ messages in thread
From: Daniel Borkmann @ 2018-06-29  8:18 UTC (permalink / raw)
  To: Tushar Dave, ast, davem, jakub.kicinski, quentin.monnet,
	jiong.wang, guro, sandipan, john.fastabend, kafai, rdna, brakmo,
	netdev, acme, sowmini.varadhan

On 06/19/2018 08:00 PM, Tushar Dave wrote:
> When sg_filter_run() is invoked it runs the attached eBPF
> SOCKET_SG_FILTER program which deals with struct scatterlist.
> 
> In addition, this patch also adds bpf_sg_next helper function that
> allows users to retrieve the next sg element from sg list.
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  include/linux/filter.h                    |  2 +
>  include/uapi/linux/bpf.h                  | 10 ++++-
>  net/core/filter.c                         | 72 +++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h            | 10 ++++-
>  tools/testing/selftests/bpf/bpf_helpers.h |  3 ++
>  5 files changed, 95 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 71618b1..d176402 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1072,4 +1072,6 @@ struct bpf_sock_ops_kern {
>  					 */
>  };
>  
> +int sg_filter_run(struct sock *sk, struct scatterlist *sg);
> +
>  #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index ef0a7b6..036432b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2076,6 +2076,13 @@ struct bpf_stack_build_id {
>   * 	Return
>   * 		A 64-bit integer containing the current cgroup id based
>   * 		on the cgroup within which the current task is running.
> + *
> + * int bpf_sg_next(struct bpf_scatterlist *sg)
> + *	Description
> + *		This helper allows user to retrieve next sg element from
> + *		sg list.
> + *	Return
> + *		Returns 0 on success, or a negative error in case of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -2158,7 +2165,8 @@ struct bpf_stack_build_id {
>  	FN(rc_repeat),			\
>  	FN(rc_keydown),			\
>  	FN(skb_cgroup_id),		\
> -	FN(get_current_cgroup_id),
> +	FN(get_current_cgroup_id),	\
> +	FN(sg_next),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 8f67942..702ff5b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -121,6 +121,53 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
>  }
>  EXPORT_SYMBOL(sk_filter_trim_cap);
>  
> +int sg_filter_run(struct sock *sk, struct scatterlist *sg)
> +{
> +	struct sk_filter *filter;
> +	int err;
> +
> +	rcu_read_lock();
> +	filter = rcu_dereference(sk->sk_filter);
> +	if (filter) {
> +		struct bpf_scatterlist bpfsg;
> +		int num_sg;
> +
> +		if (!sg) {
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		num_sg = sg_nents(sg);
> +		if (num_sg <= 0) {
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		/* We store a reference  to the sg list so it can later used by
> +		 * eBPF helpers to retrieve the next sg element.
> +		 */
> +		bpfsg.num_sg = num_sg;
> +		bpfsg.cur_sg = 0;
> +		bpfsg.sg = sg;
> +
> +		/* For the first sg element, we store the pkt access pointers
> +		 * into start and end so eBPF program can have pkt access using
> +		 * data and data_end. The pkt access for subsequent element of
> +		 * sg list is possible when eBPF program invokes bpf_sg_next
> +		 * which takes care of setting start and end to the correct sg
> +		 * element.
> +		 */
> +		bpfsg.start = sg_virt(sg);
> +		bpfsg.end = bpfsg.start + sg->length;
> +		BPF_PROG_RUN(filter->prog, &bpfsg);
> +	}
> +out:
> +	rcu_read_unlock();
> +
> +	return err;
> +}
> +EXPORT_SYMBOL(sg_filter_run);
> +
>  BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb)
>  {
>  	return skb_get_poff(skb);
> @@ -3753,6 +3800,29 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
>  	.arg1_type      = ARG_PTR_TO_CTX,
>  };
>  
> +BPF_CALL_1(bpf_sg_next, struct bpf_scatterlist *, bpfsg)
> +{
> +	struct scatterlist *sg = bpfsg->sg;
> +	int cur_sg = bpfsg->cur_sg;
> +
> +	cur_sg++;
> +	if (cur_sg >= bpfsg->num_sg)
> +		return -ENODATA;
> +
> +	bpfsg->cur_sg = cur_sg;
> +	bpfsg->start = sg_virt(&sg[cur_sg]);
> +	bpfsg->end = bpfsg->start + sg[cur_sg].length;
> +
> +	return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_sg_next_proto = {
> +	.func		= bpf_sg_next,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +};

Should be added to bpf_helper_changes_pkt_data() in order to enforce a reload
of all pkt pointers. Otherwise this is buggy in the sense that someone could only
reload pkt_end pointer in the prog while old pkt_start still points to previous
sg entry, so you would be able to access out of bounds.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  2018-06-19 18:00 ` [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
  2018-06-29  7:25   ` Daniel Borkmann
@ 2018-06-29  8:27   ` Daniel Borkmann
  2018-06-30  0:46     ` Tushar Dave
  1 sibling, 1 reply; 15+ messages in thread
From: Daniel Borkmann @ 2018-06-29  8:27 UTC (permalink / raw)
  To: Tushar Dave, ast, davem, jakub.kicinski, quentin.monnet,
	jiong.wang, guro, sandipan, john.fastabend, kafai, rdna, brakmo,
	netdev, acme, sowmini.varadhan

On 06/19/2018 08:00 PM, Tushar Dave wrote:
> Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
> existing socket filter infrastructure for bpf program attach and load.
> SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
> contrast to SOCKET_FILTER which deals with struct skb. This is useful
> for kernel entities that don't have skb to represent packet data but
> want to run eBPF socket filter on packet data that is in form of struct
> scatterlist e.g. IB/RDMA
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  include/linux/bpf_types.h      |  1 +
>  include/linux/filter.h         |  8 +++++
>  include/uapi/linux/bpf.h       |  7 ++++
>  kernel/bpf/syscall.c           |  1 +
>  kernel/bpf/verifier.c          |  1 +
>  net/core/filter.c              | 77 ++++++++++++++++++++++++++++++++++++++++--
>  samples/bpf/bpf_load.c         | 11 ++++--
>  tools/bpf/bpftool/prog.c       |  1 +
>  tools/include/uapi/linux/bpf.h |  7 ++++
>  tools/lib/bpf/libbpf.c         |  3 ++
>  tools/lib/bpf/libbpf.h         |  2 ++
>  11 files changed, 114 insertions(+), 5 deletions(-)
> 
[...]
>  
> +static bool socksg_filter_is_valid_access(int off, int size,
> +					  enum bpf_access_type type,
> +					  const struct bpf_prog *prog,
> +					  struct bpf_insn_access_aux *info)
> +{
> +	switch (off) {
> +	case offsetof(struct sg_filter_md, data):
> +		info->reg_type = PTR_TO_PACKET;
> +		break;
> +	case offsetof(struct sg_filter_md, data_end):
> +		info->reg_type = PTR_TO_PACKET_END;
> +		break;
> +	}
> +
> +	if (off < 0 || off >= sizeof(struct sg_filter_md))
> +		return false;
> +	if (off % size != 0)
> +		return false;
> +	if (size != sizeof(__u64))
> +		return false;
> +
> +	return true;
> +}

Just a note, don't know much about rds, but when you make this writeable for
rds/tcp you definitely must make sure that it can be handled properly in there,
meaning when program rewrites packet data that this data is private to the BPF
prog (to avoid races/corruption) and that the rewritten data is correctly handled
from there.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper
  2018-06-19 18:00 ` [RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper Tushar Dave
  2018-06-29  8:18   ` Daniel Borkmann
@ 2018-06-29  8:32   ` Daniel Borkmann
  2018-06-30  0:27     ` Tushar Dave
  1 sibling, 1 reply; 15+ messages in thread
From: Daniel Borkmann @ 2018-06-29  8:32 UTC (permalink / raw)
  To: Tushar Dave, ast, davem, jakub.kicinski, quentin.monnet,
	jiong.wang, guro, sandipan, john.fastabend, kafai, rdna, brakmo,
	netdev, acme, sowmini.varadhan

On 06/19/2018 08:00 PM, Tushar Dave wrote:
[...]
> +int sg_filter_run(struct sock *sk, struct scatterlist *sg)
> +{
> +	struct sk_filter *filter;
> +	int err;
> +
> +	rcu_read_lock();
> +	filter = rcu_dereference(sk->sk_filter);
> +	if (filter) {
> +		struct bpf_scatterlist bpfsg;
> +		int num_sg;
> +
> +		if (!sg) {
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		num_sg = sg_nents(sg);
> +		if (num_sg <= 0) {
> +			err = -EINVAL;
> +			goto out;
> +		}
> +
> +		/* We store a reference  to the sg list so it can later used by
> +		 * eBPF helpers to retrieve the next sg element.
> +		 */
> +		bpfsg.num_sg = num_sg;
> +		bpfsg.cur_sg = 0;
> +		bpfsg.sg = sg;
> +
> +		/* For the first sg element, we store the pkt access pointers
> +		 * into start and end so eBPF program can have pkt access using
> +		 * data and data_end. The pkt access for subsequent element of
> +		 * sg list is possible when eBPF program invokes bpf_sg_next
> +		 * which takes care of setting start and end to the correct sg
> +		 * element.
> +		 */
> +		bpfsg.start = sg_virt(sg);
> +		bpfsg.end = bpfsg.start + sg->length;
> +		BPF_PROG_RUN(filter->prog, &bpfsg);

Return code here from BPF prog is ignored entirely, I thought you wanted to
use it also for dropping packets? If UAPI would get frozen like this then it's
baked in stone.

> +	}
> +out:
> +	rcu_read_unlock();
> +
> +	return err;
> +}
> +EXPORT_SYMBOL(sg_filter_run);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  2018-06-29  7:25   ` Daniel Borkmann
@ 2018-06-29  8:48     ` Daniel Borkmann
  2018-06-30  0:20       ` Tushar Dave
  0 siblings, 1 reply; 15+ messages in thread
From: Daniel Borkmann @ 2018-06-29  8:48 UTC (permalink / raw)
  To: Daniel Borkmann, Tushar Dave, ast, davem, jakub.kicinski,
	quentin.monnet, jiong.wang, guro, sandipan, john.fastabend,
	kafai, rdna, brakmo, netdev, acme, sowmini.varadhan

On 06/29/2018 09:25 AM, Daniel Borkmann wrote:
> On 06/19/2018 08:00 PM, Tushar Dave wrote:
>> Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
>> existing socket filter infrastructure for bpf program attach and load.
>> SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
>> contrast to SOCKET_FILTER which deals with struct skb. This is useful
>> for kernel entities that don't have skb to represent packet data but
>> want to run eBPF socket filter on packet data that is in form of struct
>> scatterlist e.g. IB/RDMA
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
[...]
>>  static void __bpf_prog_release(struct bpf_prog *prog)
>>  {
>> -	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
>> +	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
>> +	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
>>  		bpf_prog_put(prog);
>>  	} else {
>>  		bpf_release_orig_filter(prog);
>> @@ -1551,10 +1552,16 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
>>  
>>  static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
>>  {
>> +	struct bpf_prog *prog;
>> +
>>  	if (sock_flag(sk, SOCK_FILTER_LOCKED))
>>  		return ERR_PTR(-EPERM);
>>  
>> -	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
>> +	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
>> +	if (IS_ERR(prog))
>> +		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
>> +
>> +	return prog;
>>  }
> 
> Hmm, I don't think this works: this now means as unpriviledged I can attach a new
> BPF_PROG_TYPE_SOCKET_SG_FILTER to a non-rds socket e.g. normal tcp/udp through the
> SO_ATTACH_BPF sockopt, where input context is skb instead of sg list and thus crash
> my box?

... probably best to just make a setsockopt specific to rds here, so the two are fully
separated.

Also worth exploring whether you can reuse as much as possible from the struct sk_msg_buff
context and in general the BPF_PROG_TYPE_SK_MSG type that is using this which we already
have in sockmap today. At least feels like some of the concepts are a bit similar. For
pulling in more payload you have bpf_msg_pull_data() there which I think might be more
user-friendly at least in that you have the full payload from start to the 'current' end
available and don't need to navigate through individual sg entries back/forth which could
perhaps end up being bit painful for users, though I can see that it's a middle ground
between some skb_load_bytes()-alike helper that would copy the pieces out of the sg entries
vs needing to linearize. What are the requirements here, would it make sense to offer both
as an option or is this impractical based on what you've measured?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  2018-06-29  8:48     ` Daniel Borkmann
@ 2018-06-30  0:20       ` Tushar Dave
  0 siblings, 0 replies; 15+ messages in thread
From: Tushar Dave @ 2018-06-30  0:20 UTC (permalink / raw)
  To: Daniel Borkmann, Daniel Borkmann, ast, davem, jakub.kicinski,
	quentin.monnet, jiong.wang, guro, sandipan, john.fastabend,
	kafai, rdna, brakmo, netdev, acme, sowmini.varadhan



On 06/29/2018 01:48 AM, Daniel Borkmann wrote:
> On 06/29/2018 09:25 AM, Daniel Borkmann wrote:
>> On 06/19/2018 08:00 PM, Tushar Dave wrote:
>>> Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
>>> existing socket filter infrastructure for bpf program attach and load.
>>> SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
>>> contrast to SOCKET_FILTER which deals with struct skb. This is useful
>>> for kernel entities that don't have skb to represent packet data but
>>> want to run eBPF socket filter on packet data that is in form of struct
>>> scatterlist e.g. IB/RDMA
>>>
>>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> [...]
>>>   static void __bpf_prog_release(struct bpf_prog *prog)
>>>   {
>>> -	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
>>> +	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
>>> +	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
>>>   		bpf_prog_put(prog);
>>>   	} else {
>>>   		bpf_release_orig_filter(prog);
>>> @@ -1551,10 +1552,16 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
>>>   
>>>   static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
>>>   {
>>> +	struct bpf_prog *prog;
>>> +
>>>   	if (sock_flag(sk, SOCK_FILTER_LOCKED))
>>>   		return ERR_PTR(-EPERM);
>>>   
>>> -	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
>>> +	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
>>> +	if (IS_ERR(prog))
>>> +		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
>>> +
>>> +	return prog;
>>>   }
>>
>> Hmm, I don't think this works: this now means as unpriviledged I can attach a new
>> BPF_PROG_TYPE_SOCKET_SG_FILTER to a non-rds socket e.g. normal tcp/udp through the
>> SO_ATTACH_BPF sockopt, where input context is skb instead of sg list and thus crash
>> my box?

hmm.. I see the problem.

> 
> ... probably best to just make a setsockopt specific to rds here, so the two are fully
> separated.

Yes, it makes sense to make setsockopt specific to RDS that gives us
complete separation (and not to worry about above problem).
btw, to to make setsockopt specific to RDS though we have to export
sk_attach_bpf().

> 
> Also worth exploring whether you can reuse as much as possible from the struct sk_msg_buff
> context and in general the BPF_PROG_TYPE_SK_MSG type that is using this which we already
> have in sockmap today. At least feels like some of the concepts are a bit similar. For
> pulling in more payload you have bpf_msg_pull_data() there which I think might be more
> user-friendly at least in that you have the full payload from start to the 'current' end
> available and don't need to navigate through individual sg entries back/forth which could
> perhaps end up being bit painful for users, though I can see that it's a middle ground
> between some skb_load_bytes()-alike helper that would copy the pieces out of the sg entries
> vs needing to linearize. What are the requirements here, would it make sense to offer both
> as an option or is this impractical based on what you've measured?

Yes, sockmap also deal with struct scatterlist so from that prospective
I certainly try to reuse code wherever I can e.g. most likely get rid of
struct bpf_scatterlist and use struct sk_msg_buff.

Form use-case prospective, we want to look at complete payload (that
includes RDS header as well) and based on that take actions like pass
,drop or forward. So, I agree that it makes sense to not iterate over sg
element back/forth [1]. I guess bpf_msg_pull_data() would do the work.
If there is need we may add sg_load_bytes()-like helper.

Thanks for taking time reviewing. I will work on the suggested changes.

-Tushar

[1]
btw, bpf_sg_next() was added to just showcase an example and if there is
no real need of it I will get rid of it.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper
  2018-06-29  8:18   ` Daniel Borkmann
@ 2018-06-30  0:24     ` Tushar Dave
  0 siblings, 0 replies; 15+ messages in thread
From: Tushar Dave @ 2018-06-30  0:24 UTC (permalink / raw)
  To: Daniel Borkmann, ast, davem, jakub.kicinski, quentin.monnet,
	jiong.wang, guro, sandipan, john.fastabend, kafai, rdna, brakmo,
	netdev, acme, sowmini.varadhan



On 06/29/2018 01:18 AM, Daniel Borkmann wrote:
> On 06/19/2018 08:00 PM, Tushar Dave wrote:
>> When sg_filter_run() is invoked it runs the attached eBPF
>> SOCKET_SG_FILTER program which deals with struct scatterlist.
>>
>> In addition, this patch also adds bpf_sg_next helper function that
>> allows users to retrieve the next sg element from sg list.
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   include/linux/filter.h                    |  2 +
>>   include/uapi/linux/bpf.h                  | 10 ++++-
>>   net/core/filter.c                         | 72 +++++++++++++++++++++++++++++++
>>   tools/include/uapi/linux/bpf.h            | 10 ++++-
>>   tools/testing/selftests/bpf/bpf_helpers.h |  3 ++
>>   5 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 71618b1..d176402 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -1072,4 +1072,6 @@ struct bpf_sock_ops_kern {
>>   					 */
>>   };
>>   
>> +int sg_filter_run(struct sock *sk, struct scatterlist *sg);
>> +
>>   #endif /* __LINUX_FILTER_H__ */
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index ef0a7b6..036432b 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -2076,6 +2076,13 @@ struct bpf_stack_build_id {
>>    * 	Return
>>    * 		A 64-bit integer containing the current cgroup id based
>>    * 		on the cgroup within which the current task is running.
>> + *
>> + * int bpf_sg_next(struct bpf_scatterlist *sg)
>> + *	Description
>> + *		This helper allows user to retrieve next sg element from
>> + *		sg list.
>> + *	Return
>> + *		Returns 0 on success, or a negative error in case of failure.
>>    */
>>   #define __BPF_FUNC_MAPPER(FN)		\
>>   	FN(unspec),			\
>> @@ -2158,7 +2165,8 @@ struct bpf_stack_build_id {
>>   	FN(rc_repeat),			\
>>   	FN(rc_keydown),			\
>>   	FN(skb_cgroup_id),		\
>> -	FN(get_current_cgroup_id),
>> +	FN(get_current_cgroup_id),	\
>> +	FN(sg_next),
>>   
>>   /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>    * function eBPF program intends to call
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 8f67942..702ff5b 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -121,6 +121,53 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
>>   }
>>   EXPORT_SYMBOL(sk_filter_trim_cap);
>>   
>> +int sg_filter_run(struct sock *sk, struct scatterlist *sg)
>> +{
>> +	struct sk_filter *filter;
>> +	int err;
>> +
>> +	rcu_read_lock();
>> +	filter = rcu_dereference(sk->sk_filter);
>> +	if (filter) {
>> +		struct bpf_scatterlist bpfsg;
>> +		int num_sg;
>> +
>> +		if (!sg) {
>> +			err = -EINVAL;
>> +			goto out;
>> +		}
>> +
>> +		num_sg = sg_nents(sg);
>> +		if (num_sg <= 0) {
>> +			err = -EINVAL;
>> +			goto out;
>> +		}
>> +
>> +		/* We store a reference  to the sg list so it can later used by
>> +		 * eBPF helpers to retrieve the next sg element.
>> +		 */
>> +		bpfsg.num_sg = num_sg;
>> +		bpfsg.cur_sg = 0;
>> +		bpfsg.sg = sg;
>> +
>> +		/* For the first sg element, we store the pkt access pointers
>> +		 * into start and end so eBPF program can have pkt access using
>> +		 * data and data_end. The pkt access for subsequent element of
>> +		 * sg list is possible when eBPF program invokes bpf_sg_next
>> +		 * which takes care of setting start and end to the correct sg
>> +		 * element.
>> +		 */
>> +		bpfsg.start = sg_virt(sg);
>> +		bpfsg.end = bpfsg.start + sg->length;
>> +		BPF_PROG_RUN(filter->prog, &bpfsg);
>> +	}
>> +out:
>> +	rcu_read_unlock();
>> +
>> +	return err;
>> +}
>> +EXPORT_SYMBOL(sg_filter_run);
>> +
>>   BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb)
>>   {
>>   	return skb_get_poff(skb);
>> @@ -3753,6 +3800,29 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
>>   	.arg1_type      = ARG_PTR_TO_CTX,
>>   };
>>   
>> +BPF_CALL_1(bpf_sg_next, struct bpf_scatterlist *, bpfsg)
>> +{
>> +	struct scatterlist *sg = bpfsg->sg;
>> +	int cur_sg = bpfsg->cur_sg;
>> +
>> +	cur_sg++;
>> +	if (cur_sg >= bpfsg->num_sg)
>> +		return -ENODATA;
>> +
>> +	bpfsg->cur_sg = cur_sg;
>> +	bpfsg->start = sg_virt(&sg[cur_sg]);
>> +	bpfsg->end = bpfsg->start + sg[cur_sg].length;
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct bpf_func_proto bpf_sg_next_proto = {
>> +	.func		= bpf_sg_next,
>> +	.gpl_only	= false,
>> +	.ret_type	= RET_INTEGER,
>> +	.arg1_type	= ARG_PTR_TO_CTX,
>> +};
> 
> Should be added to bpf_helper_changes_pkt_data() in order to enforce a reload
> of all pkt pointers. Otherwise this is buggy in the sense that someone could only
> reload pkt_end pointer in the prog while old pkt_start still points to previous
> sg entry, so you would be able to access out of bounds.

Sure thing. Will do so.

Thank you.

-Tushar
> 
> Thanks,
> Daniel
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper
  2018-06-29  8:32   ` Daniel Borkmann
@ 2018-06-30  0:27     ` Tushar Dave
  0 siblings, 0 replies; 15+ messages in thread
From: Tushar Dave @ 2018-06-30  0:27 UTC (permalink / raw)
  To: Daniel Borkmann, ast, davem, jakub.kicinski, quentin.monnet,
	jiong.wang, guro, sandipan, john.fastabend, kafai, rdna, brakmo,
	netdev, acme, sowmini.varadhan



On 06/29/2018 01:32 AM, Daniel Borkmann wrote:
> On 06/19/2018 08:00 PM, Tushar Dave wrote:
> [...]
>> +int sg_filter_run(struct sock *sk, struct scatterlist *sg)
>> +{
>> +	struct sk_filter *filter;
>> +	int err;
>> +
>> +	rcu_read_lock();
>> +	filter = rcu_dereference(sk->sk_filter);
>> +	if (filter) {
>> +		struct bpf_scatterlist bpfsg;
>> +		int num_sg;
>> +
>> +		if (!sg) {
>> +			err = -EINVAL;
>> +			goto out;
>> +		}
>> +
>> +		num_sg = sg_nents(sg);
>> +		if (num_sg <= 0) {
>> +			err = -EINVAL;
>> +			goto out;
>> +		}
>> +
>> +		/* We store a reference  to the sg list so it can later used by
>> +		 * eBPF helpers to retrieve the next sg element.
>> +		 */
>> +		bpfsg.num_sg = num_sg;
>> +		bpfsg.cur_sg = 0;
>> +		bpfsg.sg = sg;
>> +
>> +		/* For the first sg element, we store the pkt access pointers
>> +		 * into start and end so eBPF program can have pkt access using
>> +		 * data and data_end. The pkt access for subsequent element of
>> +		 * sg list is possible when eBPF program invokes bpf_sg_next
>> +		 * which takes care of setting start and end to the correct sg
>> +		 * element.
>> +		 */
>> +		bpfsg.start = sg_virt(sg);
>> +		bpfsg.end = bpfsg.start + sg->length;
>> +		BPF_PROG_RUN(filter->prog, &bpfsg);
> 
> Return code here from BPF prog is ignored entirely, I thought you wanted to
> use it also for dropping packets? If UAPI would get frozen like this then it's
> baked in stone.

Yeah, I am going to add return code necessary for pass, drop and
forward. I will do that. Thanks.

-Tushar

> 
>> +	}
>> +out:
>> +	rcu_read_unlock();
>> +
>> +	return err;
>> +}
>> +EXPORT_SYMBOL(sg_filter_run);
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  2018-06-29  8:27   ` Daniel Borkmann
@ 2018-06-30  0:46     ` Tushar Dave
  0 siblings, 0 replies; 15+ messages in thread
From: Tushar Dave @ 2018-06-30  0:46 UTC (permalink / raw)
  To: Daniel Borkmann, ast, davem, jakub.kicinski, quentin.monnet,
	jiong.wang, guro, sandipan, john.fastabend, kafai, rdna, brakmo,
	netdev, acme, sowmini.varadhan



On 06/29/2018 01:27 AM, Daniel Borkmann wrote:
> On 06/19/2018 08:00 PM, Tushar Dave wrote:
>> Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
>> existing socket filter infrastructure for bpf program attach and load.
>> SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
>> contrast to SOCKET_FILTER which deals with struct skb. This is useful
>> for kernel entities that don't have skb to represent packet data but
>> want to run eBPF socket filter on packet data that is in form of struct
>> scatterlist e.g. IB/RDMA
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   include/linux/bpf_types.h      |  1 +
>>   include/linux/filter.h         |  8 +++++
>>   include/uapi/linux/bpf.h       |  7 ++++
>>   kernel/bpf/syscall.c           |  1 +
>>   kernel/bpf/verifier.c          |  1 +
>>   net/core/filter.c              | 77 ++++++++++++++++++++++++++++++++++++++++--
>>   samples/bpf/bpf_load.c         | 11 ++++--
>>   tools/bpf/bpftool/prog.c       |  1 +
>>   tools/include/uapi/linux/bpf.h |  7 ++++
>>   tools/lib/bpf/libbpf.c         |  3 ++
>>   tools/lib/bpf/libbpf.h         |  2 ++
>>   11 files changed, 114 insertions(+), 5 deletions(-)
>>
> [...]
>>   
>> +static bool socksg_filter_is_valid_access(int off, int size,
>> +					  enum bpf_access_type type,
>> +					  const struct bpf_prog *prog,
>> +					  struct bpf_insn_access_aux *info)
>> +{
>> +	switch (off) {
>> +	case offsetof(struct sg_filter_md, data):
>> +		info->reg_type = PTR_TO_PACKET;
>> +		break;
>> +	case offsetof(struct sg_filter_md, data_end):
>> +		info->reg_type = PTR_TO_PACKET_END;
>> +		break;
>> +	}
>> +
>> +	if (off < 0 || off >= sizeof(struct sg_filter_md))
>> +		return false;
>> +	if (off % size != 0)
>> +		return false;
>> +	if (size != sizeof(__u64))
>> +		return false;
>> +
>> +	return true;
>> +}
> 
> Just a note, don't know much about rds, but when you make this writeable for
> rds/tcp you definitely must make sure that it can be handled properly in there,
> meaning when program rewrites packet data that this data is private to the BPF
> prog (to avoid races/corruption) and that the rewritten data is correctly handled
> from there.

Sure thing. When I add something like bpf_sg_store_bytes(), I will make
sure to take care of rewrites.

Thanks.

-Tushar
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-06-30  1:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-19 18:00 [RFC v2 PATCH 0/4] eBPF and struct scatterlist Tushar Dave
2018-06-19 18:00 ` [RFC v2 PATCH 1/4] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
2018-06-29  7:25   ` Daniel Borkmann
2018-06-29  8:48     ` Daniel Borkmann
2018-06-30  0:20       ` Tushar Dave
2018-06-29  8:27   ` Daniel Borkmann
2018-06-30  0:46     ` Tushar Dave
2018-06-19 18:00 ` [RFC v2 PATCH 2/4] ebpf: Add sg_filter_run and sg helper Tushar Dave
2018-06-29  8:18   ` Daniel Borkmann
2018-06-30  0:24     ` Tushar Dave
2018-06-29  8:32   ` Daniel Borkmann
2018-06-30  0:27     ` Tushar Dave
2018-06-19 18:00 ` [RFC v2 PATCH 3/4] ebpf: Add sample ebpf program for SOCKET_SG_FILTER Tushar Dave
2018-06-19 18:00 ` [RFC v2 PATCH 4/4] rds: invoke socket sg filter attached to rds socket Tushar Dave
2018-06-26  0:14 ` [RFC v2 PATCH 0/4] eBPF and struct scatterlist Tushar Dave

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.