All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/5] eBPF and struct scatterlist
@ 2018-09-11 19:37 Tushar Dave
  2018-09-11 19:38 ` [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page Tushar Dave
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: Tushar Dave @ 2018-09-11 19:37 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan

This non-RFC patch-set is follow-up on the RFC v3 that was sent earlier.
(https://www.spinics.net/lists/netdev/msg519380.html)

In this patch-set following changes are made,
RFC v3 -> this patch-set:

- "RFC v3 patch 3" is removed as it is no longer needed because
bpf_msg_pull_data() has all required bug fixed. Thanks Daniel.

- Use __GFP_COMP while allocating pages in bpf_msg_pull_data to avoid
page_copy_sane while using sg page in copy_page_to_iter() (patch 1)

- In sg_filter_run(), after BPF prog returns, mb.sg_data may have
changed while linearize multiple scatterlist entries into one.
Therefore, make sure to update original sg and mark the sg end correctly
before return. (patch 3)

- BPF program can write/modify RDS packet, if that is the case then the
modified packet data is represented in scatterlist. Therefore use
scatterlist (not skb) while copying payload back to userspace. Also
carefully release scatterlist and associated pages e.g.
get_page()/put_page() (patch 4)



Details:
--------
eBPF: Patch 1 use __GFP_COMP while allocating pages in bpf_msg_pull_data
to avoid page_copy_sane warning.

eBPF: Patch 2 adds new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
which uses the existing socket filter infrastructure for bpf program
attach and load. eBPF program of type BPF_PROG_TYPE_SOCKET_SG_FILTER
deals with struct scatterlist as bpf context contrast to
BPF_PROG_TYPE_SOCKET_FILTER which deals with struct skb. This new eBPF
program type allow socket filter to run on packet data that is in form
of struct scatterlist.

eBPF: Patch 3 adds sg_filter_run() that runs BPF_PROG_TYPE_SOCKET_SG_FILTER.

RDS: patch 4 allows rds_recv_incoming to invoke socket filter program
which deals with struct scatterlist

bpf/samples: Patch 5 adds socket filter eBPF sample program that uses
patches 1 to 5. The sample program opens an rds socket, attach ebpf
program (socksg i.e. BPF_PROG_TYPE_SOCKET_SG_FILTER) to rds socket and
uses bpf_msg_pull_data() helper to inspect RDS packet data. For a test,
current sample program only prints first few bytes of packet data.


Background:
-----------
The motivation for this work is to allow eBPF based firewalling for
kernel modules that do not always get their packet as an sk_buff from
their downlink drivers. One such instance of this use-case is RDS, which
can be run both over IB (driver RDMA's a scatterlist to the RDS module)
or over TCP (TCP passes an sk_buff to the RDS module).

This patchset uses exiting socket filter infrastructure and extend it
with new eBPF program type that deals with struct scatterlist.
Existing bpf helper bpf_msg_pull_data() is used to inspect packet data
that are in form struct scatterlist. For RDS, the integrated approach
treats the scatterlist as the common denominator, and allows the
application to write a filter for processing a scatterlist.


Testing:
---------
To confirm data accuracy and results, RDS packets of various sizes has
been tested with socksg program along with various start and end values
for bpf_msg_pull_data(). All such tests shows accurate results.

Thanks.

-Tushar


Tushar Dave (5):
  bpf: use __GFP_COMP while allocating page
  eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  ebpf: Add sg_filter_run()
  rds: invoke socket sg filter attached to rds socket
  ebpf: Add sample ebpf program for SOCKET_SG_FILTER

 include/linux/bpf_types.h      |   1 +
 include/linux/filter.h         |   8 +
 include/uapi/linux/bpf.h       |   7 +
 kernel/bpf/syscall.c           |   1 +
 kernel/bpf/verifier.c          |   1 +
 net/core/filter.c              |  93 ++++++++++-
 net/rds/ib.c                   |   1 +
 net/rds/ib.h                   |   1 +
 net/rds/ib_recv.c              |  12 ++
 net/rds/rds.h                  |   1 +
 net/rds/recv.c                 |  12 ++
 net/rds/tcp.c                  |   1 +
 net/rds/tcp.h                  |   2 +
 net/rds/tcp_recv.c             | 108 ++++++++++++-
 samples/bpf/Makefile           |   3 +
 samples/bpf/bpf_load.c         |  11 +-
 samples/bpf/rds_filter_kern.c  |  42 +++++
 samples/bpf/rds_filter_user.c  | 339 +++++++++++++++++++++++++++++++++++++++++
 tools/bpf/bpftool/prog.c       |   1 +
 tools/include/uapi/linux/bpf.h |   7 +
 tools/lib/bpf/libbpf.c         |   3 +
 tools/lib/bpf/libbpf.h         |   2 +
 22 files changed, 650 insertions(+), 7 deletions(-)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page
  2018-09-11 19:37 [PATCH net-next 0/5] eBPF and struct scatterlist Tushar Dave
@ 2018-09-11 19:38 ` Tushar Dave
  2018-09-12 16:21   ` Tushar Dave
  2018-09-11 19:38 ` [PATCH net-next 2/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan

Helper bpg_msg_pull_data() can allocate multiple pages while
linearizing multiple scatterlist elements into one shared page.
However, if the shared page has size > PAGE_SIZE, using
copy_page_to_iter() causes below warning.

e.g.
[ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
page_copy_sane.part.8+0x0/0x8

To avoid above warning, use __GFP_COMP while allocating multiple
contiguous pages.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
---
 net/core/filter.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index d301134..0b40f95 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
 	if (unlikely(bytes_sg_total > copy))
 		return -EINVAL;
 
-	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
+	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
+			   get_order(copy));
 	if (unlikely(!page))
 		return -ENOMEM;
 	p = page_address(page);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next 2/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  2018-09-11 19:37 [PATCH net-next 0/5] eBPF and struct scatterlist Tushar Dave
  2018-09-11 19:38 ` [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page Tushar Dave
@ 2018-09-11 19:38 ` Tushar Dave
  2018-09-12  3:57   ` Alexei Starovoitov
  2018-09-11 19:38 ` [PATCH net-next 3/5] ebpf: Add sg_filter_run() Tushar Dave
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan

Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
existing socket filter infrastructure for bpf program attach and load.
SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
contrast to SOCKET_FILTER which deals with struct skb. This is useful
for kernel entities that don't have skb to represent packet data but
want to run eBPF socket filter on packet data that is in form of struct
scatterlist e.g. IB/RDMA

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/bpf_types.h      |  1 +
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/syscall.c           |  1 +
 kernel/bpf/verifier.c          |  1 +
 net/core/filter.c              | 55 ++++++++++++++++++++++++++++++++++++++++--
 samples/bpf/bpf_load.c         | 11 ++++++---
 tools/bpf/bpftool/prog.c       |  1 +
 tools/include/uapi/linux/bpf.h |  1 +
 tools/lib/bpf/libbpf.c         |  3 +++
 tools/lib/bpf/libbpf.h         |  2 ++
 10 files changed, 72 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index cd26c09..7dc1503 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -16,6 +16,7 @@
 BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
 #endif
 #ifdef CONFIG_BPF_EVENTS
 BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 66917a4..6ec1e32 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3c9636f..5f302b7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1361,6 +1361,7 @@ static int bpf_prog_load(union bpf_attr *attr)
 
 	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
 	    type != BPF_PROG_TYPE_CGROUP_SKB &&
+	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
 	    !capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f4ff0c5..17fc4d2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1234,6 +1234,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
 	case BPF_PROG_TYPE_LWT_XMIT:
 	case BPF_PROG_TYPE_SK_SKB:
 	case BPF_PROG_TYPE_SK_MSG:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		if (meta)
 			return meta->pkt_access;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 0b40f95..469c488 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1140,7 +1140,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
 
 static void __bpf_prog_release(struct bpf_prog *prog)
 {
-	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
+	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
+	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
 		bpf_prog_put(prog);
 	} else {
 		bpf_release_orig_filter(prog);
@@ -1539,10 +1540,16 @@ int sk_reuseport_attach_filter(struct sock_fprog *fprog, struct sock *sk)
 
 static struct bpf_prog *__get_bpf(u32 ufd, struct sock *sk)
 {
+	struct bpf_prog *prog;
+
 	if (sock_flag(sk, SOCK_FILTER_LOCKED))
 		return ERR_PTR(-EPERM);
 
-	return bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_FILTER);
+	if (IS_ERR(prog))
+		prog = bpf_prog_get_type(ufd, BPF_PROG_TYPE_SOCKET_SG_FILTER);
+
+	return prog;
 }
 
 int sk_attach_bpf(u32 ufd, struct sock *sk)
@@ -4935,6 +4942,17 @@ bool bpf_helper_changes_pkt_data(void *func)
 }
 
 static const struct bpf_func_proto *
+socksg_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	switch (func_id) {
+	case BPF_FUNC_msg_pull_data:
+		return &bpf_msg_pull_data_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
@@ -6753,6 +6771,30 @@ static u32 sk_skb_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static u32 socksg_filter_convert_ctx_access(enum bpf_access_type type,
+					    const struct bpf_insn *si,
+					    struct bpf_insn *insn_buf,
+					    struct bpf_prog *prog,
+					    u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct sk_msg_md, data):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct sk_msg_buff, data));
+		break;
+	case offsetof(struct sk_msg_md, data_end):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_msg_buff, data_end),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct sk_msg_buff, data_end));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 				     const struct bpf_insn *si,
 				     struct bpf_insn *insn_buf,
@@ -6891,6 +6933,15 @@ static u32 sk_msg_convert_ctx_access(enum bpf_access_type type,
 	.test_run		= bpf_prog_test_run_skb,
 };
 
+const struct bpf_verifier_ops socksg_filter_verifier_ops = {
+	.get_func_proto         = socksg_filter_func_proto,
+	.is_valid_access        = sk_msg_is_valid_access,
+	.convert_ctx_access     = socksg_filter_convert_ctx_access,
+};
+
+const struct bpf_prog_ops socksg_filter_prog_ops = {
+};
+
 const struct bpf_verifier_ops tc_cls_act_verifier_ops = {
 	.get_func_proto		= tc_cls_act_func_proto,
 	.is_valid_access	= tc_cls_act_is_valid_access,
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 904e775..3b1697d 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -69,6 +69,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_sockops = strncmp(event, "sockops", 7) == 0;
 	bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
 	bool is_sk_msg = strncmp(event, "sk_msg", 6) == 0;
+	bool is_socksg = strncmp(event, "socksg", 6) == 0;
+
 	size_t insns_cnt = size / sizeof(struct bpf_insn);
 	enum bpf_prog_type prog_type;
 	char buf[256];
@@ -102,6 +104,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_SK_SKB;
 	} else if (is_sk_msg) {
 		prog_type = BPF_PROG_TYPE_SK_MSG;
+	} else if (is_socksg) {
+		prog_type = BPF_PROG_TYPE_SOCKET_SG_FILTER;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -122,8 +126,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
 		return 0;
 
-	if (is_socket || is_sockops || is_sk_skb || is_sk_msg) {
-		if (is_socket)
+	if (is_socket || is_sockops || is_sk_skb || is_sk_msg || is_socksg) {
+		if (is_socket || is_socksg)
 			event += 6;
 		else
 			event += 7;
@@ -627,7 +631,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
 		    memcmp(shname, "cgroup/", 7) == 0 ||
 		    memcmp(shname, "sockops", 7) == 0 ||
 		    memcmp(shname, "sk_skb", 6) == 0 ||
-		    memcmp(shname, "sk_msg", 6) == 0) {
+		    memcmp(shname, "sk_msg", 6) == 0 ||
+		    memcmp(shname, "socksg", 6) == 0) {
 			ret = load_and_attach(shname, data->d_buf,
 					      data->d_size);
 			if (ret != 0)
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index dce960d..9c57c4e 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -74,6 +74,7 @@
 	[BPF_PROG_TYPE_RAW_TRACEPOINT]	= "raw_tracepoint",
 	[BPF_PROG_TYPE_CGROUP_SOCK_ADDR] = "cgroup_sock_addr",
 	[BPF_PROG_TYPE_LIRC_MODE2]	= "lirc_mode2",
+	[BPF_PROG_TYPE_SOCKET_SG_FILTER] = "socket_sg_filter",
 };
 
 static void print_boot_time(__u64 nsecs, char *buf, unsigned int size)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 66917a4..6ec1e32 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -152,6 +152,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_SEG6LOCAL,
 	BPF_PROG_TYPE_LIRC_MODE2,
 	BPF_PROG_TYPE_SK_REUSEPORT,
+	BPF_PROG_TYPE_SOCKET_SG_FILTER,
 };
 
 enum bpf_attach_type {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2abd0f1..a7ac51c 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1502,6 +1502,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_LIRC_MODE2:
 	case BPF_PROG_TYPE_SK_REUSEPORT:
+	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
 		return false;
 	case BPF_PROG_TYPE_UNSPEC:
 	case BPF_PROG_TYPE_KPROBE:
@@ -2077,6 +2078,7 @@ static bool bpf_program__is_type(struct bpf_program *prog,
 BPF_PROG_TYPE_FNS(raw_tracepoint, BPF_PROG_TYPE_RAW_TRACEPOINT);
 BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP);
 BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
+BPF_PROG_TYPE_FNS(socket_sg_filter, BPF_PROG_TYPE_SOCKET_SG_FILTER);
 
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type)
@@ -2129,6 +2131,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 	BPF_SA_PROG_SEC("cgroup/sendmsg6", BPF_CGROUP_UDP6_SENDMSG),
 	BPF_S_PROG_SEC("cgroup/post_bind4", BPF_CGROUP_INET4_POST_BIND),
 	BPF_S_PROG_SEC("cgroup/post_bind6", BPF_CGROUP_INET6_POST_BIND),
+	BPF_PROG_SEC("socksg",          BPF_PROG_TYPE_SOCKET_SG_FILTER),
 };
 
 #undef BPF_PROG_SEC
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 96c55fa..7527ea4 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -208,6 +208,7 @@ int bpf_program__set_prep(struct bpf_program *prog, int nr_instance,
 void bpf_program__set_type(struct bpf_program *prog, enum bpf_prog_type type);
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type);
+int bpf_program__set_socket_sg_filter(struct bpf_program *prog);
 
 bool bpf_program__is_socket_filter(struct bpf_program *prog);
 bool bpf_program__is_tracepoint(struct bpf_program *prog);
@@ -217,6 +218,7 @@ void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 bool bpf_program__is_sched_act(struct bpf_program *prog);
 bool bpf_program__is_xdp(struct bpf_program *prog);
 bool bpf_program__is_perf_event(struct bpf_program *prog);
+bool bpf_program__is_socket_sg_filter(struct bpf_program *prog);
 
 /*
  * No need for __attribute__((packed)), all members of 'bpf_map_def'
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next 3/5] ebpf: Add sg_filter_run()
  2018-09-11 19:37 [PATCH net-next 0/5] eBPF and struct scatterlist Tushar Dave
  2018-09-11 19:38 ` [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page Tushar Dave
  2018-09-11 19:38 ` [PATCH net-next 2/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
@ 2018-09-11 19:38 ` Tushar Dave
  2018-09-12  3:58   ` Alexei Starovoitov
  2018-09-11 19:38 ` [PATCH net-next 4/5] rds: invoke socket sg filter attached to rds socket Tushar Dave
  2018-09-11 19:38 ` [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER Tushar Dave
  4 siblings, 1 reply; 22+ messages in thread
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan

When sg_filter_run() is invoked it runs the attached eBPF
prog of type BPF_PROG_TYPE_SOCKET_SG_FILTER which deals with
struct scatterlist.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 include/linux/filter.h         |  8 ++++++++
 include/uapi/linux/bpf.h       |  6 ++++++
 net/core/filter.c              | 35 +++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  6 ++++++
 4 files changed, 55 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6791a0a..ae664a9 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1113,4 +1113,12 @@ struct bpf_sock_ops_kern {
 					 */
 };
 
+enum __socksg_action {
+	__SOCKSG_PASS = 0,
+	__SOCKSG_DROP,
+	__SOCKSG_REDIRECT,
+};
+
+int sg_filter_run(struct sock *sk, struct scatterlist *sg);
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6ec1e32..1e11789 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2428,6 +2428,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+enum socksg_action {
+	SOCKSG_PASS = 0,
+	SOCKSG_DROP,
+	SOCKSG_REDIRECT,
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
diff --git a/net/core/filter.c b/net/core/filter.c
index 469c488..a3afc61 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -121,6 +121,41 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
 }
 EXPORT_SYMBOL(sk_filter_trim_cap);
 
+int sg_filter_run(struct sock *sk, struct scatterlist *sg)
+{
+	struct sk_filter *filter;
+	int result = 0;
+
+	if (!sg)
+		return result;
+
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter) {
+		struct sk_msg_buff mb = {0};
+
+		memcpy(mb.sg_data, sg, sizeof(*sg) * MAX_SKB_FRAGS);
+		mb.sg_start = 0;
+		mb.sg_end = sg_nents(sg);
+		mb.data = sg_virt(sg);
+		mb.data_end = mb.data + sg->length;
+		mb.sg_copy[mb.sg_end - 1] = true;
+
+		result = BPF_PROG_RUN(filter->prog, &mb);
+
+		/* BPF prog may have changed mb.sg_data e.g. may linearize
+		 * multiple scatterlist entries into one. Therefore, make sure
+		 * to update original sg and mark the sg end.
+		 */
+		memcpy(sg, mb.sg_data, sizeof(*sg) * MAX_SKB_FRAGS);
+		sg_mark_end(&sg[mb.sg_end - 1]);
+	}
+	rcu_read_unlock();
+
+	return result;
+}
+EXPORT_SYMBOL(sg_filter_run);
+
 BPF_CALL_1(bpf_skb_get_pay_offset, struct sk_buff *, skb)
 {
 	return skb_get_poff(skb);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6ec1e32..1e11789 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2428,6 +2428,12 @@ enum sk_action {
 	SK_PASS,
 };
 
+enum socksg_action {
+	SOCKSG_PASS = 0,
+	SOCKSG_DROP,
+	SOCKSG_REDIRECT,
+};
+
 /* user accessible metadata for SK_MSG packet hook, new fields must
  * be added to the end of this structure
  */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next 4/5] rds: invoke socket sg filter attached to rds socket
  2018-09-11 19:37 [PATCH net-next 0/5] eBPF and struct scatterlist Tushar Dave
                   ` (2 preceding siblings ...)
  2018-09-11 19:38 ` [PATCH net-next 3/5] ebpf: Add sg_filter_run() Tushar Dave
@ 2018-09-11 19:38 ` Tushar Dave
  2018-09-11 21:06   ` santosh.shilimkar
  2018-09-11 19:38 ` [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER Tushar Dave
  4 siblings, 1 reply; 22+ messages in thread
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan

RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
However, because socket filter only deal with skb (e.g. struct skb as
bpf context) we can only use socket filter for rds_tcp and not for
rds_rdma.

Considering one filtering solution for RDS, it seems that the common
denominator between sk_buff and scatterlist is scatterlist. Therefore,
this patch converts skb to sgvec and invoke sg_filter_run for
rds_tcp and simply invoke sg_filter_run for IB/rds_rdma.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/ib.c       |   1 +
 net/rds/ib.h       |   1 +
 net/rds/ib_recv.c  |  12 ++++++
 net/rds/rds.h      |   1 +
 net/rds/recv.c     |  12 ++++++
 net/rds/tcp.c      |   1 +
 net/rds/tcp.h      |   2 +
 net/rds/tcp_recv.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 8 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index eba75c1..6c40652 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -527,6 +527,7 @@ struct rds_transport rds_ib_transport = {
 	.conn_path_shutdown	= rds_ib_conn_path_shutdown,
 	.inc_copy_to_user	= rds_ib_inc_copy_to_user,
 	.inc_free		= rds_ib_inc_free,
+	.inc_to_sg_get		= rds_ib_inc_to_sg_get,
 	.cm_initiate_connect	= rds_ib_cm_initiate_connect,
 	.cm_handle_connect	= rds_ib_cm_handle_connect,
 	.cm_connect_complete	= rds_ib_cm_connect_complete,
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 73427ff..0a12b41 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -404,6 +404,7 @@ int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev,
 void rds_ib_recv_free_caches(struct rds_ib_connection *ic);
 void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp);
 void rds_ib_inc_free(struct rds_incoming *inc);
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
 int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
 void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc,
 			     struct rds_ib_ack_state *state);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 2f16146..0054c7c 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -219,6 +219,18 @@ void rds_ib_inc_free(struct rds_incoming *inc)
 	rds_ib_recv_cache_put(&ibinc->ii_cache_entry, &ic->i_cache_incs);
 }
 
+int rds_ib_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct rds_ib_incoming *ibinc;
+	struct rds_page_frag *frag;
+
+	ibinc = container_of(inc, struct rds_ib_incoming, ii_inc);
+	frag = list_entry(ibinc->ii_frags.next, struct rds_page_frag, f_item);
+	*sg =  &frag->f_sg;
+
+	return 0;
+}
+
 static void rds_ib_recv_clear_one(struct rds_ib_connection *ic,
 				  struct rds_ib_recv_work *recv)
 {
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 6bfaf05..9f3e4df 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -542,6 +542,7 @@ struct rds_transport {
 	int (*recv_path)(struct rds_conn_path *cp);
 	int (*inc_copy_to_user)(struct rds_incoming *inc, struct iov_iter *to);
 	void (*inc_free)(struct rds_incoming *inc);
+	int (*inc_to_sg_get)(struct rds_incoming *inc, struct scatterlist **sg);
 
 	int (*cm_handle_connect)(struct rdma_cm_id *cm_id,
 				 struct rdma_cm_event *event, bool isv6);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index 1271965..424042e 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -290,6 +290,8 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
 	struct sock *sk;
 	unsigned long flags;
 	struct rds_conn_path *cp;
+	struct sk_filter *filter;
+	int result = __SOCKSG_PASS;
 
 	inc->i_conn = conn;
 	inc->i_rx_jiffies = jiffies;
@@ -374,6 +376,16 @@ void rds_recv_incoming(struct rds_connection *conn, struct in6_addr *saddr,
 	/* We can be racing with rds_release() which marks the socket dead. */
 	sk = rds_rs_to_sk(rs);
 
+	rcu_read_lock();
+	filter = rcu_dereference(sk->sk_filter);
+	if (filter && conn->c_trans->inc_to_sg_get) {
+		struct scatterlist *sg = NULL;
+
+		if (conn->c_trans->inc_to_sg_get(inc, &sg) == 0)
+			result = sg_filter_run(sk, sg);
+	}
+	rcu_read_unlock();
+
 	/* serialize with rds_release -> sock_orphan */
 	write_lock_irqsave(&rs->rs_recv_lock, flags);
 	if (!sock_flag(sk, SOCK_DEAD)) {
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index b9bbcf3..b0683e6 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -464,6 +464,7 @@ struct rds_transport rds_tcp_transport = {
 	.conn_path_shutdown	= rds_tcp_conn_path_shutdown,
 	.inc_copy_to_user	= rds_tcp_inc_copy_to_user,
 	.inc_free		= rds_tcp_inc_free,
+	.inc_to_sg_get		= rds_tcp_inc_to_sg_get,
 	.stats_info_copy	= rds_tcp_stats_info_copy,
 	.exit			= rds_tcp_exit,
 	.t_owner		= THIS_MODULE,
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index 3c69361..e4ea16e 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -7,6 +7,7 @@
 struct rds_tcp_incoming {
 	struct rds_incoming	ti_inc;
 	struct sk_buff_head	ti_skb_list;
+	struct scatterlist	*sg;
 };
 
 struct rds_tcp_connection {
@@ -82,6 +83,7 @@ void rds_tcp_restore_callbacks(struct socket *sock,
 int rds_tcp_recv_path(struct rds_conn_path *cp);
 void rds_tcp_inc_free(struct rds_incoming *inc);
 int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg);
 
 /* tcp_send.c */
 void rds_tcp_xmit_path_prepare(struct rds_conn_path *cp);
diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index 42c5ff1..22d84f2 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -50,14 +50,113 @@ static void rds_tcp_inc_purge(struct rds_incoming *inc)
 void rds_tcp_inc_free(struct rds_incoming *inc)
 {
 	struct rds_tcp_incoming *tinc;
+	int i;
+
 	tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
 	rds_tcp_inc_purge(inc);
+
+	if (tinc->sg) {
+		for (i = 0; i < sg_nents(tinc->sg); i++) {
+			struct page *page;
+
+			page = sg_page(&tinc->sg[i]);
+			put_page(page);
+		}
+		kfree(tinc->sg);
+	}
+
 	rdsdebug("freeing tinc %p inc %p\n", tinc, inc);
 	kmem_cache_free(rds_tcp_incoming_slab, tinc);
 }
 
+#define MAX_SG MAX_SKB_FRAGS
+int rds_tcp_inc_to_sg_get(struct rds_incoming *inc, struct scatterlist **sg)
+{
+	struct rds_tcp_incoming *tinc;
+	struct sk_buff *skb;
+	int num_sg = 0;
+	int i;
+
+	tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+
+	/* For now we are assuming that the max sg elements we need is MAX_SG.
+	 * To determine actual number of sg elements we need to traverse the
+	 * skb queue e.g.
+	 *
+	 * skb_queue_walk(&tinc->ti_skb_list, skb) {
+	 *	num_sg += skb_shinfo(skb)->nr_frags + 1;
+	 * }
+	 */
+	tinc->sg = kzalloc(sizeof(*tinc->sg) * MAX_SG, GFP_KERNEL);
+	if (!tinc->sg)
+		return -ENOMEM;
+
+	sg_init_table(tinc->sg, MAX_SG);
+	skb_queue_walk(&tinc->ti_skb_list, skb) {
+		num_sg += skb_to_sgvec_nomark(skb, &tinc->sg[num_sg], 0,
+					      skb->len);
+	}
+
+	/* packet can have zero length */
+	if (num_sg <= 0) {
+		kfree(tinc->sg);
+		tinc->sg = NULL;
+		return -ENODATA;
+	}
+
+	sg_mark_end(&tinc->sg[num_sg - 1]);
+	*sg = tinc->sg;
+
+	for (i = 0; i < num_sg; i++)
+		get_page(sg_page(&tinc->sg[i]));
+
+	return 0;
+}
+
+static int rds_tcp_inc_copy_sg_to_user(struct rds_incoming *inc,
+				       struct iov_iter *to)
+{
+	struct rds_tcp_incoming *tinc;
+	struct scatterlist *sg;
+	unsigned long copied = 0;
+	unsigned long len;
+	u8 i = 0;
+
+	tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
+	len = be32_to_cpu(inc->i_hdr.h_len);
+	sg = tinc->sg;
+
+	do {
+		struct page *page;
+		unsigned long n, copy, to_copy;
+
+		sg = &tinc->sg[i];
+		copy = sg->length;
+		page = sg_page(sg);
+		to_copy = iov_iter_count(to);
+		to_copy = min_t(unsigned long, to_copy, copy);
+
+		n = copy_page_to_iter(page, sg->offset, to_copy, to);
+		if (n != copy)
+			return -EFAULT;
+
+		rds_stats_add(s_copy_to_user, to_copy);
+		copied += to_copy;
+		sg->offset += to_copy;
+		sg->length -= to_copy;
+
+		if (!sg->length)
+			i++;
+
+		if (copied == len)
+			break;
+	} while (i != sg_nents(tinc->sg));
+	return copied;
+}
 /*
- * this is pretty lame, but, whatever.
+ * This is pretty lame, but, whatever.
+ * Note: bpf filter can change RDS packet and if so then the modified packet is
+ * contained in the form of scatterlist, not skb.
  */
 int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to)
 {
@@ -70,6 +169,12 @@ int rds_tcp_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to)
 
 	tinc = container_of(inc, struct rds_tcp_incoming, ti_inc);
 
+	/* if tinc->sg is not NULL means bpf filter ran on packet and so packet
+	 * now is in the form of scatterlist.
+	 */
+	if (tinc->sg)
+		return rds_tcp_inc_copy_sg_to_user(inc, to);
+
 	skb_queue_walk(&tinc->ti_skb_list, skb) {
 		unsigned long to_copy, skb_off;
 		for (skb_off = 0; skb_off < skb->len; skb_off += to_copy) {
@@ -176,6 +281,7 @@ static int rds_tcp_data_recv(read_descriptor_t *desc, struct sk_buff *skb,
 				desc->error = -ENOMEM;
 				goto out;
 			}
+			tinc->sg = NULL;
 			tc->t_tinc = tinc;
 			rdsdebug("alloced tinc %p\n", tinc);
 			rds_inc_path_init(&tinc->ti_inc, cp,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  2018-09-11 19:37 [PATCH net-next 0/5] eBPF and struct scatterlist Tushar Dave
                   ` (3 preceding siblings ...)
  2018-09-11 19:38 ` [PATCH net-next 4/5] rds: invoke socket sg filter attached to rds socket Tushar Dave
@ 2018-09-11 19:38 ` Tushar Dave
  2018-09-12  4:00   ` Alexei Starovoitov
  4 siblings, 1 reply; 22+ messages in thread
From: Tushar Dave @ 2018-09-11 19:38 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan

Add a sample program that shows how socksg program is used and attached
to socket filter. The kernel sample program deals with struct
scatterlist that is passed as bpf context.

When run in server mode, the sample RDS program opens PF_RDS socket,
attaches eBPF program to RDS socket which then uses bpf_msg_pull_data
helper to inspect packet data contained in struct scatterlist and
returns appropriate action code back to kernel.

To ease testing, RDS client functionality is also added so that users
can generate RDS packet.

Server:
[root@lab71 bpf]# ./rds_filter -s 192.168.3.71 -t tcp
running server in a loop
transport tcp
server bound to address: 192.168.3.71 port 4000
server listening on 192.168.3.71

Client:
[root@lab70 bpf]# ./rds_filter -s 192.168.3.71 -c 192.168.3.70 -t tcp
transport tcp
client bound to address: 192.168.3.70 port 25278
client sending 8192 byte message  from 192.168.3.70 to 192.168.3.71 on
port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...

Server output:
192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
on port 25278
payload contains:30 31 32 33 34 35 36 37 38 39 ...
server listening on 192.168.3.71

[root@lab71 tushar]# cat /sys/kernel/debug/tracing/trace_pipe
          <idle>-0     [038] ..s.   146.947362: 0: 30 31 32
          <idle>-0     [038] ..s.   146.947364: 0: 33 34 35

Similarly specifying '-t ib' will run this on IB link.

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 samples/bpf/Makefile          |   3 +
 samples/bpf/rds_filter_kern.c |  42 ++++++
 samples/bpf/rds_filter_user.c | 339 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 384 insertions(+)
 create mode 100644 samples/bpf/rds_filter_kern.c
 create mode 100644 samples/bpf/rds_filter_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index be0a961..bbac5ef 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
 hostprogs-y += xdp_sample_pkts
+hostprogs-y += rds_filter
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -109,6 +110,7 @@ xdpsock-objs := xdpsock_user.o
 xdp_fwd-objs := xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
+rds_filter-objs := bpf_load.o rds_filter_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -167,6 +169,7 @@ always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
 always += xdp_sample_pkts_kern.o
+always += rds_filter_kern.o
 
 KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
 KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/rds_filter_kern.c b/samples/bpf/rds_filter_kern.c
new file mode 100644
index 0000000..633e687
--- /dev/null
+++ b/samples/bpf/rds_filter_kern.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/filter.h>
+#include <linux/ptrace.h>
+#include <linux/version.h>
+#include <uapi/linux/bpf.h>
+#include <linux/rds.h>
+#include "bpf_helpers.h"
+
+#define bpf_printk(fmt, ...)				\
+({							\
+	char ____fmt[] = fmt;				\
+	bpf_trace_printk(____fmt, sizeof(____fmt),	\
+			##__VA_ARGS__);			\
+})
+
+SEC("socksg")
+int main_prog(struct sk_msg_md *msg)
+{
+	int start, end, err;
+	unsigned char *d;
+
+	start = 0;
+	end = 6;
+
+	err = bpf_msg_pull_data(msg, start, end, 0);
+	if (err) {
+		bpf_printk("socksg: pull_data err %i\n", err);
+		return SOCKSG_PASS;
+	}
+
+	if (msg->data + 6 > msg->data_end)
+		return SOCKSG_PASS;
+
+	d = (unsigned char *)msg->data;
+	bpf_printk("%x %x %x\n", d[0], d[1], d[2]);
+	bpf_printk("%x %x %x\n", d[3], d[4], d[5]);
+
+	return SOCKSG_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/rds_filter_user.c b/samples/bpf/rds_filter_user.c
new file mode 100644
index 0000000..1186345
--- /dev/null
+++ b/samples/bpf/rds_filter_user.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <arpa/inet.h>
+#include <assert.h>
+#include "bpf_load.h"
+#include <getopt.h>
+#include <errno.h>
+#include <netinet/in.h>
+#include <limits.h>
+#include <linux/sockios.h>
+#include <linux/rds.h>
+#include <linux/errqueue.h>
+#include <linux/bpf.h>
+#include <strings.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#define TESTPORT	4000
+#define BUFSIZE		8192
+
+int transport = -1;
+
+static int str2trans(const char *trans)
+{
+	if (strcmp(trans, "tcp") == 0)
+		return RDS_TRANS_TCP;
+	if (strcmp(trans, "ib") == 0)
+		return RDS_TRANS_IB;
+	return (RDS_TRANS_NONE);
+}
+
+static const char *trans2str(int trans)
+{
+	switch (trans) {
+	case RDS_TRANS_TCP:
+		return ("tcp");
+	case RDS_TRANS_IB:
+		return ("ib");
+	case RDS_TRANS_NONE:
+		return ("none");
+	default:
+		return ("unknown");
+	}
+}
+
+static int gettransport(int sock)
+{
+	int err;
+	char val;
+	socklen_t len = sizeof(int);
+
+	err = getsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&val, &len);
+	if (err < 0) {
+		fprintf(stderr, "%s: getsockopt %s\n",
+			__func__, strerror(errno));
+		return err;
+	}
+	return (int)val;
+}
+
+static int settransport(int sock, int transport)
+{
+	int err;
+
+	err = setsockopt(sock, SOL_RDS, SO_RDS_TRANSPORT,
+			 (char *)&transport, sizeof(transport));
+	if (err < 0) {
+		fprintf(stderr, "could not set transport %s, %s\n",
+			trans2str(transport), strerror(errno));
+	}
+	return err;
+}
+
+static void print_sock_local_info(int fd, char *str, struct sockaddr_in *ret)
+{
+	socklen_t sin_size = sizeof(struct sockaddr_in);
+	struct sockaddr_in sin;
+	int err;
+
+	err = getsockname(fd, (struct sockaddr *)&sin, &sin_size);
+	if (err < 0) {
+		fprintf(stderr, "%s getsockname %s\n",
+			__func__, strerror(errno));
+		return;
+	}
+	printf("%s address: %s port %d\n",
+		(str ? str : ""), inet_ntoa(sin.sin_addr), ntohs(sin.sin_port));
+
+	if (ret != NULL)
+		*ret = sin;
+}
+
+static void print_payload(char *buf)
+{
+	int i;
+
+	printf("payload contains:");
+	for (i = 0; i < 10; i++)
+		printf("%x ", buf[i]);
+	printf("...\n");
+}
+
+static void server(char *address, in_port_t port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	struct iovec *iov;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(address);
+	sin.sin_port = htons(port);
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	/* attach bpf prog */
+	assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
+			  sizeof(prog_fd[0])) == 0);
+
+	print_sock_local_info(sock, "server bound to", NULL);
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	while (1) {
+		memset(buf, 0, BUFSIZE);
+		iov[0].iov_base = buf;
+		iov[0].iov_len = BUFSIZE;
+
+		memset(&msg, 0, sizeof(msg));
+		msg.msg_name = &din;
+		msg.msg_namelen = sizeof(din);
+		msg.msg_iov = iov;
+		msg.msg_iovlen = 1;
+
+		printf("server listening on %s\n", inet_ntoa(sin.sin_addr));
+
+		rc = recvmsg(sock, &msg, 0);
+		if (rc < 0) {
+			fprintf(stderr, "%s: recvmsg %s\n",
+				__func__, strerror(errno));
+			break;
+		}
+
+		printf("%s received a packet from %s of len %d cmsg len %d, on port %d\n",
+			inet_ntoa(sin.sin_addr),
+			inet_ntoa(din.sin_addr),
+			(uint32_t) iov[0].iov_len,
+			(uint32_t) msg.msg_controllen,
+			ntohs(din.sin_port));
+
+		print_payload(buf);
+	}
+	free(iov);
+out:
+	free(buf);
+}
+
+static void create_message(char *buf)
+{
+	unsigned int i;
+
+	for (i = 0; i < BUFSIZE; i++) {
+		buf[i] = i + 0x30;
+	}
+}
+
+static int build_rds_packet(struct msghdr *msg, char *buf)
+{
+	struct iovec *iov;
+
+	iov = calloc(1, sizeof(struct iovec));
+	if (!iov) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return -1;
+	}
+
+	msg->msg_iov = iov;
+	msg->msg_iovlen = 1;
+
+	iov[0].iov_base = buf;
+	iov[0].iov_len = BUFSIZE * sizeof(char);
+
+	return 0;
+}
+
+static void client(char *localaddr, char *remoteaddr, in_port_t server_port)
+{
+	struct sockaddr_in sin, din;
+	struct msghdr msg;
+	int rc, sock;
+	char *buf;
+
+	buf = calloc(BUFSIZE, sizeof(char));
+	if (!buf) {
+		fprintf(stderr, "%s: calloc %s\n", __func__, strerror(errno));
+		return;
+	}
+
+	create_message(buf);
+
+	sock = socket(PF_RDS, SOCK_SEQPACKET, 0);
+	if (sock < 0) {
+		fprintf(stderr, "%s: socket %s\n", __func__, strerror(errno));
+		goto out;
+	}
+
+	if (settransport(sock, transport) < 0)
+		goto out;
+
+	printf("transport %s\n", trans2str(gettransport(sock)));
+
+	memset(&sin, 0, sizeof(sin));
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = inet_addr(localaddr);
+	sin.sin_port = 0;
+
+	rc = bind(sock, (struct sockaddr *)&sin, sizeof(sin));
+	if (rc < 0) {
+		fprintf(stderr, "%s: bind %s\n", __func__, strerror(errno));
+		goto out;
+	}
+	print_sock_local_info(sock, "client bound to",  &sin);
+
+	memset(&msg, 0, sizeof(msg));
+	msg.msg_name = &din;
+	msg.msg_namelen = sizeof(din);
+
+	memset(&din, 0, sizeof(din));
+	din.sin_family = AF_INET;
+	din.sin_addr.s_addr = inet_addr(remoteaddr);
+	din.sin_port = htons(server_port);
+
+	rc = build_rds_packet(&msg, buf);
+	if (rc < 0)
+		goto out;
+
+	printf("client sending %d byte message from %s to %s on port %d\n",
+		(uint32_t) msg.msg_iov->iov_len, localaddr,
+		remoteaddr, ntohs(sin.sin_port));
+
+	rc = sendmsg(sock, &msg, 0);
+	if (rc < 0)
+		fprintf(stderr, "%s: sendmsg %s\n", __func__, strerror(errno));
+
+	print_payload(buf);
+
+	if (msg.msg_control)
+		free(msg.msg_control);
+	if (msg.msg_iov)
+		free(msg.msg_iov);
+out:
+	free(buf);
+
+	return;
+}
+
+static void usage(char *progname)
+{
+	fprintf(stderr, "Usage %s [-s srvaddr] [-c clientaddr] [-t transport]"
+		"\n", progname);
+}
+
+int main(int argc, char **argv)
+{
+	in_port_t server_port = TESTPORT;
+	char *serveraddr = NULL;
+	char *clientaddr = NULL;
+	char filename[256];
+	int opt;
+
+	while ((opt = getopt(argc, argv, "s:c:t:")) != -1) {
+		switch (opt) {
+		case 's':
+			serveraddr = optarg;
+			break;
+		case 'c':
+			clientaddr = optarg;
+			break;
+		case 't':
+			transport = str2trans(optarg);
+			if (transport == RDS_TRANS_NONE) {
+				fprintf(stderr,
+					"unknown transport %s\n", optarg);
+					usage(argv[0]);
+					return (-1);
+			}
+			break;
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(filename)) {
+		fprintf(stderr, "Error: load_bpf_file %s", bpf_log_buf);
+		return 1;
+	}
+
+	if (serveraddr && !clientaddr) {
+		printf("running server in a loop\n");
+		server(serveraddr, server_port);
+	} else if (serveraddr && clientaddr) {
+		client(clientaddr, serveraddr, server_port);
+	}
+
+	return 0;
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 4/5] rds: invoke socket sg filter attached to rds socket
  2018-09-11 19:38 ` [PATCH net-next 4/5] rds: invoke socket sg filter attached to rds socket Tushar Dave
@ 2018-09-11 21:06   ` santosh.shilimkar
  0 siblings, 0 replies; 22+ messages in thread
From: santosh.shilimkar @ 2018-09-11 21:06 UTC (permalink / raw)
  To: Tushar Dave, ast, daniel, davem, jakub.kicinski, quentin.monnet,
	jiong.wang, sandipan, john.fastabend, kafai, rdna, yhs, netdev,
	rds-devel, sowmini.varadhan

On 9/11/18 12:38 PM, Tushar Dave wrote:
> RDS module sits on top of TCP (rds_tcp) and IB (rds_rdma), so messages
> arrive in form of skb (over TCP) and scatterlist (over IB/RDMA).
> However, because socket filter only deal with skb (e.g. struct skb as
> bpf context) we can only use socket filter for rds_tcp and not for
> rds_rdma.
> 
> Considering one filtering solution for RDS, it seems that the common
> denominator between sk_buff and scatterlist is scatterlist. Therefore,
> this patch converts skb to sgvec and invoke sg_filter_run for
> rds_tcp and simply invoke sg_filter_run for IB/rds_rdma.
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
I remember acking the earlier version. Here it is again..

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 2/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  2018-09-11 19:38 ` [PATCH net-next 2/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
@ 2018-09-12  3:57   ` Alexei Starovoitov
  2018-09-12 19:25     ` Tushar Dave
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2018-09-12  3:57 UTC (permalink / raw)
  To: Tushar Dave
  Cc: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan

On Tue, Sep 11, 2018 at 09:38:01PM +0200, Tushar Dave wrote:
> Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
> existing socket filter infrastructure for bpf program attach and load.
> SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
> contrast to SOCKET_FILTER which deals with struct skb. This is useful
> for kernel entities that don't have skb to represent packet data but
> want to run eBPF socket filter on packet data that is in form of struct
> scatterlist e.g. IB/RDMA
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  include/linux/bpf_types.h      |  1 +
>  include/uapi/linux/bpf.h       |  1 +
>  kernel/bpf/syscall.c           |  1 +
>  kernel/bpf/verifier.c          |  1 +
>  net/core/filter.c              | 55 ++++++++++++++++++++++++++++++++++++++++--
>  samples/bpf/bpf_load.c         | 11 ++++++---
>  tools/bpf/bpftool/prog.c       |  1 +
>  tools/include/uapi/linux/bpf.h |  1 +
>  tools/lib/bpf/libbpf.c         |  3 +++
>  tools/lib/bpf/libbpf.h         |  2 ++

please do not mix core kernel and user space into single patch.
split tools/include/uapi/linux/bpf.h sync into separate patch
and changes to tools/lib/bpf as yet another patch.

>  10 files changed, 72 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index cd26c09..7dc1503 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -16,6 +16,7 @@
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
>  BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
> +BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
>  #endif
>  #ifdef CONFIG_BPF_EVENTS
>  BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 66917a4..6ec1e32 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -152,6 +152,7 @@ enum bpf_prog_type {
>  	BPF_PROG_TYPE_LWT_SEG6LOCAL,
>  	BPF_PROG_TYPE_LIRC_MODE2,
>  	BPF_PROG_TYPE_SK_REUSEPORT,
> +	BPF_PROG_TYPE_SOCKET_SG_FILTER,
>  };
>  
>  enum bpf_attach_type {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 3c9636f..5f302b7 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1361,6 +1361,7 @@ static int bpf_prog_load(union bpf_attr *attr)
>  
>  	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
>  	    type != BPF_PROG_TYPE_CGROUP_SKB &&
> +	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&

I'm not comfortable to let unpriv use this right away.
Can you live with root-only ?

>  	    !capable(CAP_SYS_ADMIN))
>  		return -EPERM;
>  
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index f4ff0c5..17fc4d2 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1234,6 +1234,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
>  	case BPF_PROG_TYPE_LWT_XMIT:
>  	case BPF_PROG_TYPE_SK_SKB:
>  	case BPF_PROG_TYPE_SK_MSG:
> +	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
>  		if (meta)
>  			return meta->pkt_access;
>  
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 0b40f95..469c488 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1140,7 +1140,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
>  
>  static void __bpf_prog_release(struct bpf_prog *prog)
>  {
> -	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
> +	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
> +	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
>  		bpf_prog_put(prog);

this doesn't look right.
Why this is needed?
Are you using old-style setsockopt to attach?
I think new style of attaching that all bpf prog types that came
after socket_filter are using is preferred.
Pls take a look at BPF_PROG_ATTACH cmd.

Also it looks the first patch doesn't really add the useful logic, but adds
few lines of code here and there. Then more code comes in patches 3 and 4.
Please rearrange them that they're reviewable as logical pieces.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 3/5] ebpf: Add sg_filter_run()
  2018-09-11 19:38 ` [PATCH net-next 3/5] ebpf: Add sg_filter_run() Tushar Dave
@ 2018-09-12  3:58   ` Alexei Starovoitov
  2018-09-12 19:27     ` Tushar Dave
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2018-09-12  3:58 UTC (permalink / raw)
  To: Tushar Dave
  Cc: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan

On Tue, Sep 11, 2018 at 09:38:02PM +0200, Tushar Dave wrote:
> When sg_filter_run() is invoked it runs the attached eBPF
> prog of type BPF_PROG_TYPE_SOCKET_SG_FILTER which deals with
> struct scatterlist.
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  include/linux/filter.h         |  8 ++++++++
>  include/uapi/linux/bpf.h       |  6 ++++++
>  net/core/filter.c              | 35 +++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |  6 ++++++
>  4 files changed, 55 insertions(+)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 6791a0a..ae664a9 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1113,4 +1113,12 @@ struct bpf_sock_ops_kern {
>  					 */
>  };
>  
> +enum __socksg_action {
> +	__SOCKSG_PASS = 0,
> +	__SOCKSG_DROP,
> +	__SOCKSG_REDIRECT,

what is this? I see no code that handles it either in this patch
or in the later patches?!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  2018-09-11 19:38 ` [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER Tushar Dave
@ 2018-09-12  4:00   ` Alexei Starovoitov
  2018-09-12 19:32     ` Tushar Dave
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2018-09-12  4:00 UTC (permalink / raw)
  To: Tushar Dave
  Cc: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan

On Tue, Sep 11, 2018 at 09:38:04PM +0200, Tushar Dave wrote:
> Add a sample program that shows how socksg program is used and attached
> to socket filter. The kernel sample program deals with struct
> scatterlist that is passed as bpf context.
> 
> When run in server mode, the sample RDS program opens PF_RDS socket,
> attaches eBPF program to RDS socket which then uses bpf_msg_pull_data
> helper to inspect packet data contained in struct scatterlist and
> returns appropriate action code back to kernel.
> 
> To ease testing, RDS client functionality is also added so that users
> can generate RDS packet.
> 
> Server:
> [root@lab71 bpf]# ./rds_filter -s 192.168.3.71 -t tcp
> running server in a loop
> transport tcp
> server bound to address: 192.168.3.71 port 4000
> server listening on 192.168.3.71
> 
> Client:
> [root@lab70 bpf]# ./rds_filter -s 192.168.3.71 -c 192.168.3.70 -t tcp
> transport tcp
> client bound to address: 192.168.3.70 port 25278
> client sending 8192 byte message  from 192.168.3.70 to 192.168.3.71 on
> port 25278
> payload contains:30 31 32 33 34 35 36 37 38 39 ...
> 
> Server output:
> 192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
> on port 25278
> payload contains:30 31 32 33 34 35 36 37 38 39 ...
> server listening on 192.168.3.71
> 
> [root@lab71 tushar]# cat /sys/kernel/debug/tracing/trace_pipe
>           <idle>-0     [038] ..s.   146.947362: 0: 30 31 32
>           <idle>-0     [038] ..s.   146.947364: 0: 33 34 35
> 
> Similarly specifying '-t ib' will run this on IB link.
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>  samples/bpf/Makefile          |   3 +
>  samples/bpf/rds_filter_kern.c |  42 ++++++
>  samples/bpf/rds_filter_user.c | 339 ++++++++++++++++++++++++++++++++++++++++++

please no samples.
Add this as proper test to tools/testing/selftests/bpf
that reports PASS/FAIL and can be run automatically.
samples/bpf is effectively dead code.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page
  2018-09-11 19:38 ` [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page Tushar Dave
@ 2018-09-12 16:21   ` Tushar Dave
  2018-09-12 16:51     ` John Fastabend
  0 siblings, 1 reply; 22+ messages in thread
From: Tushar Dave @ 2018-09-12 16:21 UTC (permalink / raw)
  To: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan



On 09/11/2018 12:38 PM, Tushar Dave wrote:
> Helper bpg_msg_pull_data() can allocate multiple pages while
> linearizing multiple scatterlist elements into one shared page.
> However, if the shared page has size > PAGE_SIZE, using
> copy_page_to_iter() causes below warning.
> 
> e.g.
> [ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
> page_copy_sane.part.8+0x0/0x8
> 
> To avoid above warning, use __GFP_COMP while allocating multiple
> contiguous pages.
> 
> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
> ---
>   net/core/filter.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index d301134..0b40f95 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>   	if (unlikely(bytes_sg_total > copy))
>   		return -EINVAL;
>   
> -	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
> +	page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
> +			   get_order(copy));
>   	if (unlikely(!page))
>   		return -ENOMEM;
>   	p = page_address(page);

I should have mentioned that I could re-order this patch anywhere in
patch series (as long as it doesn't break git bisect). I kept it first
because I think it is more like a bug fix. I sent it along with these
patch series considering we have a context of why and for what I need
this patch!

Daniel, John,

Not sure if you guys hit this page_copy_sane warning. I hit it when RDS
copy sg page to userspace using copy_page_to_iter().

example:

RDS packet size 8KB represented in scatterlist:
sg_data[0].length = 1400
sg_data[1].length = 1448
sg_data[2].length = 1448
sg_data[3].length = 1448
sg_data[4].length = 1448
sg_data[5].length = 1000

If start=0 and end=8192, bpf_msg_pull_data() will linearize all
sg_data elements into one shared page. e.g. sg_data[0].length = 8192.
Using this sg_data[0].page in function copy_page_to_iter() causes:
WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
page_copy_sane.part.8+0x0/0x8

(FYI, patch 4 has code that does copy_page_to_iter)


Comments?

Thanks in advance,
-Tushar

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page
  2018-09-12 16:21   ` Tushar Dave
@ 2018-09-12 16:51     ` John Fastabend
  2018-09-12 20:15       ` Tushar Dave
  0 siblings, 1 reply; 22+ messages in thread
From: John Fastabend @ 2018-09-12 16:51 UTC (permalink / raw)
  To: Tushar Dave, ast, daniel, davem, santosh.shilimkar,
	jakub.kicinski, quentin.monnet, jiong.wang, sandipan, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan

On 09/12/2018 09:21 AM, Tushar Dave wrote:
> 
> 
> On 09/11/2018 12:38 PM, Tushar Dave wrote:
>> Helper bpg_msg_pull_data() can allocate multiple pages while
>> linearizing multiple scatterlist elements into one shared page.
>> However, if the shared page has size > PAGE_SIZE, using
>> copy_page_to_iter() causes below warning.
>>
>> e.g.
>> [ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
>> page_copy_sane.part.8+0x0/0x8
>>
>> To avoid above warning, use __GFP_COMP while allocating multiple
>> contiguous pages.
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> ---
>>   net/core/filter.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index d301134..0b40f95 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>       if (unlikely(bytes_sg_total > copy))
>>           return -EINVAL;
>>   -    page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
>> +    page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
>> +               get_order(copy));
>>       if (unlikely(!page))
>>           return -ENOMEM;
>>       p = page_address(page);
> 
> I should have mentioned that I could re-order this patch anywhere in
> patch series (as long as it doesn't break git bisect). I kept it first
> because I think it is more like a bug fix. I sent it along with these
> patch series considering we have a context of why and for what I need
> this patch!
> 
> Daniel, John,
> 
> Not sure if you guys hit this page_copy_sane warning. I hit it when RDS
> copy sg page to userspace using copy_page_to_iter().
> 

I have not hit this before but I'm working on a set of patches for
test_sockmap to test the bpf_msg_pull_data() so I'll add a case
for this. Currently, we only test the simple case where we pull
data out of a single page in selftests. This was sufficient for
my use case but missed a handful of other valid cases.

> example:
> 
> RDS packet size 8KB represented in scatterlist:
> sg_data[0].length = 1400
> sg_data[1].length = 1448
> sg_data[2].length = 1448
> sg_data[3].length = 1448
> sg_data[4].length = 1448
> sg_data[5].length = 1000
> 
> If start=0 and end=8192, bpf_msg_pull_data() will linearize all
> sg_data elements into one shared page. e.g. sg_data[0].length = 8192.
> Using this sg_data[0].page in function copy_page_to_iter() causes:
> WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
> page_copy_sane.part.8+0x0/0x8
> 
> (FYI, patch 4 has code that does copy_page_to_iter)
> 

How about sending it as a bugfix against bpf on its own. It
looks like we could reproduce this with a combination of
bpf_msg_pull_data() + redirect (to ingress) perhaps. Either
way seems like a candidate for the bpf fixes tree to me.

Thanks,
John

> 
> Comments?
> 
> Thanks in advance,
> -Tushar
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 2/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
  2018-09-12  3:57   ` Alexei Starovoitov
@ 2018-09-12 19:25     ` Tushar Dave
  0 siblings, 0 replies; 22+ messages in thread
From: Tushar Dave @ 2018-09-12 19:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan



On 09/11/2018 08:57 PM, Alexei Starovoitov wrote:
> On Tue, Sep 11, 2018 at 09:38:01PM +0200, Tushar Dave wrote:
>> Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which uses the
>> existing socket filter infrastructure for bpf program attach and load.
>> SOCKET_SG_FILTER eBPF program receives struct scatterlist as bpf context
>> contrast to SOCKET_FILTER which deals with struct skb. This is useful
>> for kernel entities that don't have skb to represent packet data but
>> want to run eBPF socket filter on packet data that is in form of struct
>> scatterlist e.g. IB/RDMA
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   include/linux/bpf_types.h      |  1 +
>>   include/uapi/linux/bpf.h       |  1 +
>>   kernel/bpf/syscall.c           |  1 +
>>   kernel/bpf/verifier.c          |  1 +
>>   net/core/filter.c              | 55 ++++++++++++++++++++++++++++++++++++++++--
>>   samples/bpf/bpf_load.c         | 11 ++++++---
>>   tools/bpf/bpftool/prog.c       |  1 +
>>   tools/include/uapi/linux/bpf.h |  1 +
>>   tools/lib/bpf/libbpf.c         |  3 +++
>>   tools/lib/bpf/libbpf.h         |  2 ++
>

Alexi,

Thank you for reviewing the patches.

> please do not mix core kernel and user space into single patch.
> split tools/include/uapi/linux/bpf.h sync into separate patch
> and changes to tools/lib/bpf as yet another patch.

Sure, I can do that.

> 
>>   10 files changed, 72 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
>> index cd26c09..7dc1503 100644
>> --- a/include/linux/bpf_types.h
>> +++ b/include/linux/bpf_types.h
>> @@ -16,6 +16,7 @@
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops)
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SK_SKB, sk_skb)
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_SK_MSG, sk_msg)
>> +BPF_PROG_TYPE(BPF_PROG_TYPE_SOCKET_SG_FILTER, socksg_filter)
>>   #endif
>>   #ifdef CONFIG_BPF_EVENTS
>>   BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe)
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 66917a4..6ec1e32 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -152,6 +152,7 @@ enum bpf_prog_type {
>>   	BPF_PROG_TYPE_LWT_SEG6LOCAL,
>>   	BPF_PROG_TYPE_LIRC_MODE2,
>>   	BPF_PROG_TYPE_SK_REUSEPORT,
>> +	BPF_PROG_TYPE_SOCKET_SG_FILTER,
>>   };
>>   
>>   enum bpf_attach_type {
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 3c9636f..5f302b7 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -1361,6 +1361,7 @@ static int bpf_prog_load(union bpf_attr *attr)
>>   
>>   	if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
>>   	    type != BPF_PROG_TYPE_CGROUP_SKB &&
>> +	    type != BPF_PROG_TYPE_SOCKET_SG_FILTER &&
> 
> I'm not comfortable to let unpriv use this right away.
> Can you live with root-only ?

Honestly, I prep this the same as how we treat
BPF_PROG_TYPE_SOCKET_FILTER. But sure I can live with root-only.

> 
>>   	    !capable(CAP_SYS_ADMIN))
>>   		return -EPERM;
>>   
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index f4ff0c5..17fc4d2 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -1234,6 +1234,7 @@ static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
>>   	case BPF_PROG_TYPE_LWT_XMIT:
>>   	case BPF_PROG_TYPE_SK_SKB:
>>   	case BPF_PROG_TYPE_SK_MSG:
>> +	case BPF_PROG_TYPE_SOCKET_SG_FILTER:
>>   		if (meta)
>>   			return meta->pkt_access;
>>   
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 0b40f95..469c488 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -1140,7 +1140,8 @@ static void bpf_release_orig_filter(struct bpf_prog *fp)
>>   
>>   static void __bpf_prog_release(struct bpf_prog *prog)
>>   {
>> -	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER) {
>> +	if (prog->type == BPF_PROG_TYPE_SOCKET_FILTER ||
>> +	    prog->type == BPF_PROG_TYPE_SOCKET_SG_FILTER) {
>>   		bpf_prog_put(prog);
> 
> this doesn't look right.
> Why this is needed?
> Are you using old-style setsockopt to attach?

Yes.

> I think new style of attaching that all bpf prog types that came
> after socket_filter are using is preferred.
> Pls take a look at BPF_PROG_ATTACH cmd.

well, I am not sure if that is going to work.

I did this way so I can attach eBPF prog type
BPF_PROG_TYPE_SOCKET_SG_FILTER to a socket. Like today we can attach
regular socket filter bpf program (e.g. BPF_PROG_TYPE_SOCKET_FILTER) to
TCP, UDP sockets using setsockopt. The only difference between them is,
BPF_PROG_TYPE_SOCKET_FILTER deals with struct sk_buff while
BPF_PROG_TYPE_SOCKET_SG_FILTER deals with strut scatterlist.

e.g. setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, prog_fd,
sizeof(prog_fd[0]));


> 
> Also it looks the first patch doesn't really add the useful logic, but adds
> few lines of code here and there. Then more code comes in patches 3 and 4.
> Please rearrange them that they're reviewable as logical pieces.

okay.

-Tushar
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 3/5] ebpf: Add sg_filter_run()
  2018-09-12  3:58   ` Alexei Starovoitov
@ 2018-09-12 19:27     ` Tushar Dave
  0 siblings, 0 replies; 22+ messages in thread
From: Tushar Dave @ 2018-09-12 19:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan



On 09/11/2018 08:58 PM, Alexei Starovoitov wrote:
> On Tue, Sep 11, 2018 at 09:38:02PM +0200, Tushar Dave wrote:
>> When sg_filter_run() is invoked it runs the attached eBPF
>> prog of type BPF_PROG_TYPE_SOCKET_SG_FILTER which deals with
>> struct scatterlist.
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   include/linux/filter.h         |  8 ++++++++
>>   include/uapi/linux/bpf.h       |  6 ++++++
>>   net/core/filter.c              | 35 +++++++++++++++++++++++++++++++++++
>>   tools/include/uapi/linux/bpf.h |  6 ++++++
>>   4 files changed, 55 insertions(+)
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 6791a0a..ae664a9 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -1113,4 +1113,12 @@ struct bpf_sock_ops_kern {
>>   					 */
>>   };
>>   
>> +enum __socksg_action {
>> +	__SOCKSG_PASS = 0,
>> +	__SOCKSG_DROP,
>> +	__SOCKSG_REDIRECT,
> 
> what is this? I see no code that handles it either in this patch
> or in the later patches?!

Yes, I am not handling these actions in RDS in this patch series. I am
thinking first I should have basic infrastructure in place (and want to
make sure it works and accepted by community) then I will add more
changes for RDS which includes handling socksg actions at RDS level.
But fine, I can add code that handles socksg actions.

Thanks

-Tushar
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  2018-09-12  4:00   ` Alexei Starovoitov
@ 2018-09-12 19:32     ` Tushar Dave
  2018-09-13  0:59       ` Sowmini Varadhan
  0 siblings, 1 reply; 22+ messages in thread
From: Tushar Dave @ 2018-09-12 19:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: ast, daniel, davem, santosh.shilimkar, jakub.kicinski,
	quentin.monnet, jiong.wang, sandipan, john.fastabend, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan



On 09/11/2018 09:00 PM, Alexei Starovoitov wrote:
> On Tue, Sep 11, 2018 at 09:38:04PM +0200, Tushar Dave wrote:
>> Add a sample program that shows how socksg program is used and attached
>> to socket filter. The kernel sample program deals with struct
>> scatterlist that is passed as bpf context.
>>
>> When run in server mode, the sample RDS program opens PF_RDS socket,
>> attaches eBPF program to RDS socket which then uses bpf_msg_pull_data
>> helper to inspect packet data contained in struct scatterlist and
>> returns appropriate action code back to kernel.
>>
>> To ease testing, RDS client functionality is also added so that users
>> can generate RDS packet.
>>
>> Server:
>> [root@lab71 bpf]# ./rds_filter -s 192.168.3.71 -t tcp
>> running server in a loop
>> transport tcp
>> server bound to address: 192.168.3.71 port 4000
>> server listening on 192.168.3.71
>>
>> Client:
>> [root@lab70 bpf]# ./rds_filter -s 192.168.3.71 -c 192.168.3.70 -t tcp
>> transport tcp
>> client bound to address: 192.168.3.70 port 25278
>> client sending 8192 byte message  from 192.168.3.70 to 192.168.3.71 on
>> port 25278
>> payload contains:30 31 32 33 34 35 36 37 38 39 ...
>>
>> Server output:
>> 192.168.3.71 received a packet from 192.168.3.71 of len 8192 cmsg len 0,
>> on port 25278
>> payload contains:30 31 32 33 34 35 36 37 38 39 ...
>> server listening on 192.168.3.71
>>
>> [root@lab71 tushar]# cat /sys/kernel/debug/tracing/trace_pipe
>>            <idle>-0     [038] ..s.   146.947362: 0: 30 31 32
>>            <idle>-0     [038] ..s.   146.947364: 0: 33 34 35
>>
>> Similarly specifying '-t ib' will run this on IB link.
>>
>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
>> ---
>>   samples/bpf/Makefile          |   3 +
>>   samples/bpf/rds_filter_kern.c |  42 ++++++
>>   samples/bpf/rds_filter_user.c | 339 ++++++++++++++++++++++++++++++++++++++++++
> 
> please no samples.
> Add this as proper test to tools/testing/selftests/bpf
> that reports PASS/FAIL and can be run automatically.
> samples/bpf is effectively dead code.

Okay :(

-Tushar
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page
  2018-09-12 16:51     ` John Fastabend
@ 2018-09-12 20:15       ` Tushar Dave
  0 siblings, 0 replies; 22+ messages in thread
From: Tushar Dave @ 2018-09-12 20:15 UTC (permalink / raw)
  To: John Fastabend, ast, daniel, davem, santosh.shilimkar,
	jakub.kicinski, quentin.monnet, jiong.wang, sandipan, kafai,
	rdna, yhs, netdev, rds-devel, sowmini.varadhan



On 09/12/2018 09:51 AM, John Fastabend wrote:
> On 09/12/2018 09:21 AM, Tushar Dave wrote:
>>
>>
>> On 09/11/2018 12:38 PM, Tushar Dave wrote:
>>> Helper bpg_msg_pull_data() can allocate multiple pages while
>>> linearizing multiple scatterlist elements into one shared page.
>>> However, if the shared page has size > PAGE_SIZE, using
>>> copy_page_to_iter() causes below warning.
>>>
>>> e.g.
>>> [ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
>>> page_copy_sane.part.8+0x0/0x8
>>>
>>> To avoid above warning, use __GFP_COMP while allocating multiple
>>> contiguous pages.
>>>
>>> Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
>>> ---
>>>    net/core/filter.c | 3 ++-
>>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index d301134..0b40f95 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -2344,7 +2344,8 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
>>>        if (unlikely(bytes_sg_total > copy))
>>>            return -EINVAL;
>>>    -    page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC, get_order(copy));
>>> +    page = alloc_pages(__GFP_NOWARN | GFP_ATOMIC | __GFP_COMP,
>>> +               get_order(copy));
>>>        if (unlikely(!page))
>>>            return -ENOMEM;
>>>        p = page_address(page);
>>
>> I should have mentioned that I could re-order this patch anywhere in
>> patch series (as long as it doesn't break git bisect). I kept it first
>> because I think it is more like a bug fix. I sent it along with these
>> patch series considering we have a context of why and for what I need
>> this patch!
>>
>> Daniel, John,
>>
>> Not sure if you guys hit this page_copy_sane warning. I hit it when RDS
>> copy sg page to userspace using copy_page_to_iter().
>>
> 
> I have not hit this before but I'm working on a set of patches for
> test_sockmap to test the bpf_msg_pull_data() so I'll add a case
> for this. Currently, we only test the simple case where we pull
> data out of a single page in selftests. This was sufficient for
> my use case but missed a handful of other valid cases.
> 
>> example:
>>
>> RDS packet size 8KB represented in scatterlist:
>> sg_data[0].length = 1400
>> sg_data[1].length = 1448
>> sg_data[2].length = 1448
>> sg_data[3].length = 1448
>> sg_data[4].length = 1448
>> sg_data[5].length = 1000
>>
>> If start=0 and end=8192, bpf_msg_pull_data() will linearize all
>> sg_data elements into one shared page. e.g. sg_data[0].length = 8192.
>> Using this sg_data[0].page in function copy_page_to_iter() causes:
>> WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
>> page_copy_sane.part.8+0x0/0x8
>>
>> (FYI, patch 4 has code that does copy_page_to_iter)
>>
> 
> How about sending it as a bugfix against bpf on its own. It
> looks like we could reproduce this with a combination of
> bpf_msg_pull_data() + redirect (to ingress) perhaps. Either
> way seems like a candidate for the bpf fixes tree to me.

Done.

Thanks.
-Tushar
> 
> Thanks,
> John
> 
>>
>> Comments?
>>
>> Thanks in advance,
>> -Tushar
>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  2018-09-12 19:32     ` Tushar Dave
@ 2018-09-13  0:59       ` Sowmini Varadhan
  2018-09-13  2:07         ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Sowmini Varadhan @ 2018-09-13  0:59 UTC (permalink / raw)
  To: Tushar Dave
  Cc: Alexei Starovoitov, ast, daniel, davem, santosh.shilimkar,
	jakub.kicinski, quentin.monnet, jiong.wang, sandipan,
	john.fastabend, kafai, rdna, yhs, netdev, rds-devel

> On 09/11/2018 09:00 PM, Alexei Starovoitov wrote:
> >please no samples.
> >Add this as proper test to tools/testing/selftests/bpf
> >that reports PASS/FAIL and can be run automatically.
> >samples/bpf is effectively dead code.

Just a second.

You do realize that RDS is doing real networking, so it needs
RDMA capable hardware to test the rds_rdma paths? Also, when we
"talk to ourselves" we default to the rds_loop transport, so
we would even bypass the rds-tcp module.

I dont think this can be tested with some academic "test it 
over lo0" exercise.. I suppose you can add example code in
sefltests for this, but asking for a "proper test" may be
a litte unrealistic here- a proper test needs proper hardware
in this case.

--Sowmini

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  2018-09-13  0:59       ` Sowmini Varadhan
@ 2018-09-13  2:07         ` Alexei Starovoitov
  2018-09-13 10:10           ` Sowmini Varadhan
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2018-09-13  2:07 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Tushar Dave, ast, daniel, davem, santosh.shilimkar,
	jakub.kicinski, quentin.monnet, jiong.wang, sandipan,
	john.fastabend, kafai, rdna, yhs, netdev, rds-devel

On Wed, Sep 12, 2018 at 08:59:54PM -0400, Sowmini Varadhan wrote:
> > On 09/11/2018 09:00 PM, Alexei Starovoitov wrote:
> > >please no samples.
> > >Add this as proper test to tools/testing/selftests/bpf
> > >that reports PASS/FAIL and can be run automatically.
> > >samples/bpf is effectively dead code.
> 
> Just a second.
> 
> You do realize that RDS is doing real networking, so it needs
> RDMA capable hardware to test the rds_rdma paths? Also, when we
> "talk to ourselves" we default to the rds_loop transport, so
> we would even bypass the rds-tcp module.
> 
> I dont think this can be tested with some academic "test it 
> over lo0" exercise.. I suppose you can add example code in
> sefltests for this, but asking for a "proper test" may be
> a litte unrealistic here- a proper test needs proper hardware
> in this case.

I didn't know that. The way I understand your statement that
this new program type, new sg logic, and all the complexity
are only applicable to RDMA capable hw and RDS.
In such case, I think, such BPF extensions do not belong
in the upstream kernel. I don't want BPF to support niche technologies,
since the maintenance overhead makes it prohibitive long term.
If the kernel had a way to deprecate the api it would have been
different story. Every new feature and bpf extension lands once
and then being maintained forever. Hence we have to be very
selective weighting the benefits vs long term maintenance.
I'm happy to review the code further and suggest improvements,
but it's not going to be applied.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  2018-09-13  2:07         ` Alexei Starovoitov
@ 2018-09-13 10:10           ` Sowmini Varadhan
  2018-09-17 23:15             ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Sowmini Varadhan @ 2018-09-13 10:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tushar Dave, ast, daniel, davem, santosh.shilimkar,
	jakub.kicinski, quentin.monnet, jiong.wang, sandipan,
	john.fastabend, kafai, rdna, yhs, netdev, rds-devel

On (09/12/18 19:07), Alexei Starovoitov wrote:
> 
> I didn't know that. The way I understand your statement that
> this new program type, new sg logic, and all the complexity
> are only applicable to RDMA capable hw and RDS.

I dont know if you have been following the RFC series at all 
(and DanielB/JohnF feedback to it) but that is not what the patch 
set is about.

To repeat a summary of the original problem statement:

RDS (hardly a "niche" driver, let's please not get carried away 
with strong assertions based on incomplete understanding), 
is an example of a driver that happens to pass up packets
as both scatterlist and sk_buffs to the ULPs. 

The scatterlist comes from IB, the sk_buffs come from the ethernet
drivers. At the moment, the only way to build firewalls for
this is to convert scatterlist to skb and use  either netfilter
or eBPF on the skb. What Tushar is adding is support to use eBPF
on the scatterlist itself, so that you dont have to do this
inefficient scatterlist->skb conversion.

> In such case, I think, such BPF extensions do not belong
> in the upstream kernel. I don't want BPF to support niche technologies,

> since the maintenance overhead makes it prohibitive long term.

After I sent the message, I noticed that selftests/bpf has
some tests using veth/netns. I think Tushar should be able to write
tests for the rds-tcp path (and thus test the eBPF infrastructure, 
which seems to be what you are worried about)

Does that address your concern?

--Sowmini

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  2018-09-13 10:10           ` Sowmini Varadhan
@ 2018-09-17 23:15             ` Alexei Starovoitov
  2018-09-17 23:23               ` Sowmini Varadhan
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2018-09-17 23:15 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Tushar Dave, ast, daniel, davem, santosh.shilimkar,
	jakub.kicinski, quentin.monnet, jiong.wang, sandipan,
	john.fastabend, kafai, rdna, yhs, netdev, rds-devel

On Thu, Sep 13, 2018 at 06:10:13AM -0400, Sowmini Varadhan wrote:
> On (09/12/18 19:07), Alexei Starovoitov wrote:
> > 
> > I didn't know that. The way I understand your statement that
> > this new program type, new sg logic, and all the complexity
> > are only applicable to RDMA capable hw and RDS.
> 
> I dont know if you have been following the RFC series at all 
> (and DanielB/JohnF feedback to it) but that is not what the patch 
> set is about.
> 
> To repeat a summary of the original problem statement:
> 
> RDS (hardly a "niche" driver, let's please not get carried away 
> with strong assertions based on incomplete understanding), 
> is an example of a driver that happens to pass up packets
> as both scatterlist and sk_buffs to the ULPs. 
> 
> The scatterlist comes from IB, the sk_buffs come from the ethernet
> drivers. At the moment, the only way to build firewalls for
> this is to convert scatterlist to skb and use  either netfilter
> or eBPF on the skb. What Tushar is adding is support to use eBPF
> on the scatterlist itself, so that you dont have to do this
> inefficient scatterlist->skb conversion.

if the goal is to add firewall ability to RDS then the patch set
is going in the wrong direction.
New bpf prog type and attaching to sockets isn't going to be
helpful in building firewalls.
Also there was a mention of some form of 'redirect' for some
future use? That doesn't fit the firewall goal as well.
I think it would be the best to start from scratch and discuss
the bigger goal first.
May be the right answer is to teach rds to behave like the rest of protocols.
Then all existing tooling and features will 'just work' ?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  2018-09-17 23:15             ` Alexei Starovoitov
@ 2018-09-17 23:23               ` Sowmini Varadhan
  2018-09-17 23:26                 ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Sowmini Varadhan @ 2018-09-17 23:23 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tushar Dave, ast, daniel, davem, santosh.shilimkar,
	jakub.kicinski, quentin.monnet, jiong.wang, sandipan,
	john.fastabend, kafai, rdna, yhs, netdev, rds-devel

On (09/17/18 16:15), Alexei Starovoitov wrote:
> 
> if the goal is to add firewall ability to RDS then the patch set
> is going in the wrong direction.

The goal is to add the ability to process scatterlist directly,
just like we process skb's today.

Your main objection was that you wanted a test case in selftests
that was aligned with existing tests, Tushar is working on that
patchset. Why dont we wait for that patchset before continuing
this discussion further? 

> May be the right answer is to teach rds to behave like the rest of protocols.
> Then all existing tooling and features will 'just work' ?

RDS does not need to be taught anything :-) I think KCM is modeled
on the RDS stack model. Before we "teach" rds anything, "we" need
to understand what RDS does first - google can provide lot of slide-decks
that explain the rds stack to you, suggest you look at that first. 

Meanwhile, how about waiting for Tushar's next patchset, where
you will have your selftests that are based on veth/netns
just like exising tests for XDP. vxlan etc. I strongly suggest
waiting for that.

And btw, it would have been very useful/courteous to help with 
the RFC reviews to start with.

--Sowmini

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
  2018-09-17 23:23               ` Sowmini Varadhan
@ 2018-09-17 23:26                 ` Alexei Starovoitov
  0 siblings, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2018-09-17 23:26 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Tushar Dave, ast, daniel, davem, santosh.shilimkar,
	jakub.kicinski, quentin.monnet, jiong.wang, sandipan,
	john.fastabend, kafai, rdna, yhs, netdev, rds-devel

On Mon, Sep 17, 2018 at 07:23:48PM -0400, Sowmini Varadhan wrote:
> On (09/17/18 16:15), Alexei Starovoitov wrote:
> > 
> > if the goal is to add firewall ability to RDS then the patch set
> > is going in the wrong direction.
> 
> The goal is to add the ability to process scatterlist directly,
> just like we process skb's today.

that doesn't answer my question and doesn't address the objection.
It is still a nack.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-09-18  4:56 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-11 19:37 [PATCH net-next 0/5] eBPF and struct scatterlist Tushar Dave
2018-09-11 19:38 ` [PATCH net-next 1/5] bpf: use __GFP_COMP while allocating page Tushar Dave
2018-09-12 16:21   ` Tushar Dave
2018-09-12 16:51     ` John Fastabend
2018-09-12 20:15       ` Tushar Dave
2018-09-11 19:38 ` [PATCH net-next 2/5] eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER Tushar Dave
2018-09-12  3:57   ` Alexei Starovoitov
2018-09-12 19:25     ` Tushar Dave
2018-09-11 19:38 ` [PATCH net-next 3/5] ebpf: Add sg_filter_run() Tushar Dave
2018-09-12  3:58   ` Alexei Starovoitov
2018-09-12 19:27     ` Tushar Dave
2018-09-11 19:38 ` [PATCH net-next 4/5] rds: invoke socket sg filter attached to rds socket Tushar Dave
2018-09-11 21:06   ` santosh.shilimkar
2018-09-11 19:38 ` [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER Tushar Dave
2018-09-12  4:00   ` Alexei Starovoitov
2018-09-12 19:32     ` Tushar Dave
2018-09-13  0:59       ` Sowmini Varadhan
2018-09-13  2:07         ` Alexei Starovoitov
2018-09-13 10:10           ` Sowmini Varadhan
2018-09-17 23:15             ` Alexei Starovoitov
2018-09-17 23:23               ` Sowmini Varadhan
2018-09-17 23:26                 ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.