All of lore.kernel.org
 help / color / mirror / Atom feed
* PATCH net-next v3 00/15
@ 2017-06-20  3:00 Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 01/15] bpf: BPF support for sock_ops Lawrence Brakmo
                   ` (14 more replies)
  0 siblings, 15 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.) and setting
connection parameters such as buffer sizes, initial window, SYN/SYN-ACK
RTOs, etc.

Unlike current BPF program types that expect to be called at a particular
place in the network stack code, SOCK_OPS program can be called at
different places and use an "op" field to indicate the context. There
are currently two types of operations, those whose effect is through
their return value and those whose effect is through the new
bpf_setsocketop BPF helper function.

Example operands of the first type are:
  BPF_SOCK_OPS_TIMEOUT_INIT
  BPF_SOCK_OPS_RWND_INIT
  BPF_SOCK_OPS_NEEDS_ECN

Example operands of the secont type are:
  BPF_SOCK_OPS_TCP_CONNECT_CB
  BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB
  BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB

Current operands are only called during connection establishment so
there should not be any BPF overheads after connection establishment. The
main idea is to use connection information form both hosts, such as IP
addresses and ports to allow setting of per connection parameters to
optimize the connection's peformance.

Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
disticnt advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it does not require
application changes and it can be updated easily at any time.

Currently there is functionality to load one global BPF program of this
type but I plan to add support for loading per cgroup socket ops BPF
programs in the near future. When that is done, the global program could
be called when a cgroup has no program associated with it.

One question is whether I should add this functionality into David Ahern's
BPF_PROG_TYPE_CGROUP_SOCK or create a new cgroup bpf type. Whereas the
current cgroup_sock type expects to be called only once during a connection's
lifetime, the new socket_ops type could be called multipe times. My preference
is to define a new bpf attach type, BPF_CGROUP_SOCK_OPS, to attach
BPF_PROG_TYPE_SOCK_OPS to cgroups.

This patch set also includes sample BPF programs to demostrate the differnet
features.

v2: Formatting changes, rebased to latest net-next

v3: Fixed build issues, changed socket_ops to sock_ops throught,
    fixed formatting issues, removed the syscall to load sock_ops
    program and added functionality to use existing bpf attach and
    bpf detach system calls, removed reader/writer locks in
    sock_bpfops.c (used when saving sock_ops global program)

Consists of the following patches:


 include/linux/bpf.h           |   6 ++
 include/linux/bpf_types.h     |   1 +
 include/linux/filter.h        |  10 ++
 include/net/tcp.h             |  60 ++++++++++-
 include/uapi/linux/bpf.h      |  66 +++++++++++-
 kernel/bpf/syscall.c          |  62 +++++++++---
 net/core/Makefile             |   3 +-
 net/core/filter.c             | 271 ++++++++++++++++++++++++++++++++++++++++++++++++++
 net/core/sock_bpfops.c        |  65 ++++++++++++
 net/ipv4/tcp.c                |   2 +-
 net/ipv4/tcp_cong.c           |  32 ++++--
 net/ipv4/tcp_fastopen.c       |   1 +
 net/ipv4/tcp_input.c          |  10 +-
 net/ipv4/tcp_minisocks.c      |   9 +-
 net/ipv4/tcp_output.c         |  18 +++-
 samples/bpf/Makefile          |   9 ++
 samples/bpf/bpf_helpers.h     |   3 +
 samples/bpf/bpf_load.c        |  13 ++-
 samples/bpf/tcp_bpf.c         |  86 ++++++++++++++++
 samples/bpf/tcp_bufs_kern.c   |  76 ++++++++++++++
 samples/bpf/tcp_clamp_kern.c  |  93 +++++++++++++++++
 samples/bpf/tcp_cong_kern.c   |  73 ++++++++++++++
 samples/bpf/tcp_iw_kern.c     |  78 +++++++++++++++
 samples/bpf/tcp_rwnd_kern.c   |  60 +++++++++++
 samples/bpf/tcp_synrto_kern.c |  59 +++++++++++
 25 files changed, 1126 insertions(+), 40 deletions(-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 01/15] bpf: BPF support for sock_ops
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-22 22:41   ` Daniel Borkmann
  2017-06-20  3:00 ` [PATCH net-next v3 02/15] bpf: program to load sock_ops BPF programs Lawrence Brakmo
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). Currently there is
functionality to load one global BPF program of this type which can be
called at appropriate times to set relevant connection parameters such
as buffer sizes, SYN and SYN-ACK RTOs, etc., based on connection
information such as IP addresses, port numbers, etc.

Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
disticnt advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.

I plan to add support for loading per cgroup sock_ops BPF programs in
the near future. One question is whether I should add this functionality
into David Ahern's BPF_PROG_TYPE_CGROUP_SOCK or create a new cgroup bpf
type. Whereas the current cgroup_sock type expects to be called only once
during a connection's lifetime, the new sock_ops type could be called
multipe times. For example, before sending SYN and SYN-ACKs to set an
appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.

The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.

This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.

This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.

Two new corresponding structs (one for the kernel one for the user/BPF
program):

/* kernel version */
struct bpf_sock_ops_kern {
        struct sock *sk;
	bool   is_req_sock:1;
        __u32  op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
};

/* user version */
struct bpf_sock_ops {
        __u32 op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
        __u32 family;
        __u32 remote_ip4;
        __u32 local_ip4;
        __u32 remote_ip6[4];
        __u32 local_ip6[4];
        __u32 remote_port;
        __u32 local_port;
};

Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.

The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/linux/bpf.h       |   6 ++
 include/linux/bpf_types.h |   1 +
 include/linux/filter.h    |  10 +++
 include/net/tcp.h         |  30 ++++++++
 include/uapi/linux/bpf.h  |  28 ++++++++
 kernel/bpf/syscall.c      |  62 +++++++++++++----
 net/core/Makefile         |   3 +-
 net/core/filter.c         | 170 ++++++++++++++++++++++++++++++++++++++++++++++
 net/core/sock_bpfops.c    |  65 ++++++++++++++++++
 samples/bpf/bpf_load.c    |  13 +++-
 10 files changed, 370 insertions(+), 18 deletions(-)
 create mode 100644 net/core/sock_bpfops.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1bcbf0a..a1a1f2f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -362,4 +362,10 @@ extern const struct bpf_func_proto bpf_get_stackid_proto;
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+/* sock_ops related */
+struct bpf_sock_ops_kern;
+
+int bpf_sock_ops_attach_global_prog(int fd);
+int bpf_sock_ops_detach_global_prog(void);
+int bpf_sock_ops_call(struct bpf_sock_ops_kern *bpf_sock);
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 03bf223..3d137c3 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -10,6 +10,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock_prog_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout_prog_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout_prog_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit_prog_ops)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops_prog_ops)
 #endif
 #ifdef CONFIG_BPF_EVENTS
 BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe_prog_ops)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1fa26dc..bbd6429 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -898,4 +898,14 @@ static inline int bpf_tell_extensions(void)
 	return SKF_AD_MAX;
 }
 
+struct bpf_sock_ops_kern {
+	struct	sock *sk;
+	bool	is_req_sock:1;
+	u32	op;
+	union {
+		u32 reply;
+		u32 replylong[4];
+	};
+};
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d0751b7..f6f415c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -46,6 +46,9 @@
 #include <linux/seq_file.h>
 #include <linux/memcontrol.h>
 
+#include <linux/bpf.h>
+#include <linux/filter.h>
+
 extern struct inet_hashinfo tcp_hashinfo;
 
 extern struct percpu_counter tcp_orphan_count;
@@ -2021,4 +2024,31 @@ int tcp_set_ulp(struct sock *sk, const char *name);
 void tcp_get_available_ulp(char *buf, size_t len);
 void tcp_cleanup_ulp(struct sock *sk);
 
+/* Call BPF_SOCK_OPS program that returns an int. If the return value
+ * is < 0, then the BPF op failed (for example if the loaded BPF
+ * program does not support the chosen operation or there is no BPF
+ * program loaded).
+ */
+#ifdef CONFIG_BPF
+static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
+{
+	struct bpf_sock_ops_kern sock_ops;
+
+	if (!is_req_sock)
+		sock_owned_by_me(sk);
+
+	memset(&sock_ops, 0, sizeof(sock_ops));
+	sock_ops.sk = sk;
+	sock_ops.is_req_sock = is_req_sock;
+	sock_ops.op = op;
+
+	return bpf_sock_ops_call(&sock_ops);
+}
+#else
+static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
+{
+	return -1;
+}
+#endif
+
 #endif	/* _TCP_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f94b48b..861dbe9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -120,12 +120,14 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_IN,
 	BPF_PROG_TYPE_LWT_OUT,
 	BPF_PROG_TYPE_LWT_XMIT,
+	BPF_PROG_TYPE_SOCK_OPS,
 };
 
 enum bpf_attach_type {
 	BPF_CGROUP_INET_INGRESS,
 	BPF_CGROUP_INET_EGRESS,
 	BPF_CGROUP_INET_SOCK_CREATE,
+	BPF_GLOBAL_SOCK_OPS,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -720,4 +722,30 @@ struct bpf_map_info {
 	__u32 map_flags;
 } __attribute__((aligned(8)));
 
+/* User bpf_sock_ops struct to access socket values and specify request ops
+ * and their replies.
+ * New fields can only be added at the end of this structure
+ */
+struct bpf_sock_ops {
+	__u32 op;
+	union {
+		__u32 reply;
+		__u32 replylong[4];
+	};
+	__u32 family;
+	__u32 remote_ip4;
+	__u32 local_ip4;
+	__u32 remote_ip6[4];
+	__u32 local_ip6[4];
+	__u32 remote_port;
+	__u32 local_port;
+};
+
+/* List of known BPF sock_ops operators.
+ * New entries can only be added at the end
+ */
+enum {
+	BPF_SOCK_OPS_VOID,
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8942c82..e02831f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1041,23 +1041,17 @@ static int bpf_obj_get(const union bpf_attr *attr)
 	return bpf_obj_get_user(u64_to_user_ptr(attr->pathname));
 }
 
-#ifdef CONFIG_CGROUP_BPF
-
 #define BPF_PROG_ATTACH_LAST_FIELD attach_flags
 
-static int bpf_prog_attach(const union bpf_attr *attr)
+#ifdef CONFIG_CGROUP_BPF
+
+static int bpf_prog_attach_cgroup(const union bpf_attr *attr)
 {
 	enum bpf_prog_type ptype;
 	struct bpf_prog *prog;
 	struct cgroup *cgrp;
 	int ret;
 
-	if (!capable(CAP_NET_ADMIN))
-		return -EPERM;
-
-	if (CHECK_ATTR(BPF_PROG_ATTACH))
-		return -EINVAL;
-
 	if (attr->attach_flags & ~BPF_F_ALLOW_OVERRIDE)
 		return -EINVAL;
 
@@ -1092,9 +1086,32 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	return ret;
 }
 
+#else
+static int bpf_prog_attach_cgroup(const union bpf_attr *attr)
+{
+	return -EINVAL;
+}
+#endif
+
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (CHECK_ATTR(BPF_PROG_ATTACH))
+		return -EINVAL;
+
+	if (attr->attach_type == BPF_GLOBAL_SOCK_OPS)
+		return bpf_sock_ops_attach_global_prog(attr->attach_bpf_fd);
+	else
+		return bpf_prog_attach_cgroup(attr);
+}
+
 #define BPF_PROG_DETACH_LAST_FIELD attach_type
 
-static int bpf_prog_detach(const union bpf_attr *attr)
+#ifdef CONFIG_CGROUP_BPF
+
+static int bpf_prog_detach_cgroup(const union bpf_attr *attr)
 {
 	struct cgroup *cgrp;
 	int ret;
@@ -1116,14 +1133,33 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 		ret = cgroup_bpf_update(cgrp, NULL, attr->attach_type, false);
 		cgroup_put(cgrp);
 		break;
-
 	default:
 		return -EINVAL;
 	}
 
 	return ret;
 }
-#endif /* CONFIG_CGROUP_BPF */
+
+#else
+static int bpf_prog_detach_cgroup(const union bpf_attr *attr)
+{
+	return -EINVAL;
+}
+#endif
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (CHECK_ATTR(BPF_PROG_DETACH))
+		return -EINVAL;
+
+	if (attr->attach_type == BPF_GLOBAL_SOCK_OPS)
+		return bpf_sock_ops_detach_global_prog();
+	else
+		return bpf_prog_detach_cgroup(attr);
+}
 
 #define BPF_PROG_TEST_RUN_LAST_FIELD test.duration
 
@@ -1431,14 +1467,12 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
 	case BPF_OBJ_GET:
 		err = bpf_obj_get(&attr);
 		break;
-#ifdef CONFIG_CGROUP_BPF
 	case BPF_PROG_ATTACH:
 		err = bpf_prog_attach(&attr);
 		break;
 	case BPF_PROG_DETACH:
 		err = bpf_prog_detach(&attr);
 		break;
-#endif
 	case BPF_PROG_TEST_RUN:
 		err = bpf_prog_test_run(&attr, uattr);
 		break;
diff --git a/net/core/Makefile b/net/core/Makefile
index 79f9479..5d711c2 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -9,7 +9,8 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 
 obj-y		     += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
 			neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
-			sock_diag.o dev_ioctl.o tso.o sock_reuseport.o
+			sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
+			sock_bpfops.o
 
 obj-$(CONFIG_XFRM) += flow.o
 obj-y += net-sysfs.o
diff --git a/net/core/filter.c b/net/core/filter.c
index 60ed6f3..7d69d16 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3095,6 +3095,36 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
+static bool __is_valid_sock_ops_access(int off, int size)
+{
+	if (off < 0 || off >= sizeof(struct bpf_sock_ops))
+		return false;
+	/* The verifier guarantees that size > 0. */
+	if (off % size != 0)
+		return false;
+	if (size != sizeof(__u32))
+		return false;
+
+	return true;
+}
+
+static bool sock_ops_is_valid_access(int off, int size,
+				     enum bpf_access_type type,
+				     enum bpf_reg_type *reg_type)
+{
+	if (type == BPF_WRITE) {
+		switch (off) {
+		case offsetof(struct bpf_sock_ops, op) ...
+		     offsetof(struct bpf_sock_ops, replylong[3]):
+			break;
+		default:
+			return false;
+		}
+	}
+
+	return __is_valid_sock_ops_access(off, size);
+}
+
 static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 				  const struct bpf_insn *si,
 				  struct bpf_insn *insn_buf,
@@ -3364,6 +3394,140 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
+				       const struct bpf_insn *si,
+				       struct bpf_insn *insn_buf,
+				       struct bpf_prog *prog)
+{
+	struct bpf_insn *insn = insn_buf;
+	int off;
+
+	switch (si->off) {
+	case offsetof(struct bpf_sock_ops, op) ...
+	     offsetof(struct bpf_sock_ops, replylong[3]):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct bpf_sock_ops, op) !=
+			     FIELD_SIZEOF(struct bpf_sock_ops_kern, op));
+		BUILD_BUG_ON(FIELD_SIZEOF(struct bpf_sock_ops, reply) !=
+			     FIELD_SIZEOF(struct bpf_sock_ops_kern, reply));
+		BUILD_BUG_ON(FIELD_SIZEOF(struct bpf_sock_ops, replylong) !=
+			     FIELD_SIZEOF(struct bpf_sock_ops_kern, replylong));
+		off = si->off;
+		off -= offsetof(struct bpf_sock_ops, op);
+		off += offsetof(struct bpf_sock_ops_kern, op);
+		if (type == BPF_WRITE)
+			*insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
+					      off);
+		else
+			*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+					      off);
+		break;
+
+	case offsetof(struct bpf_sock_ops, family):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_family) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+					      struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common, skc_family));
+		break;
+
+	case offsetof(struct bpf_sock_ops, remote_ip4):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_daddr) != 4);
+
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+						struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common, skc_daddr));
+		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
+		break;
+
+	case offsetof(struct bpf_sock_ops, local_ip4):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_rcv_saddr) != 4);
+
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+					      struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common,
+					       skc_rcv_saddr));
+		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
+		break;
+
+	case offsetof(struct bpf_sock_ops, remote_ip6[0]) ...
+	     offsetof(struct bpf_sock_ops, remote_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
+					  skc_v6_daddr.s6_addr32[0]) != 4);
+
+		off = si->off;
+		off -= offsetof(struct bpf_sock_ops, remote_ip6[0]);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+						struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common,
+					       skc_v6_daddr.s6_addr32[0]) +
+				      off);
+		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
+#else
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case offsetof(struct bpf_sock_ops, local_ip6[0]) ...
+	     offsetof(struct bpf_sock_ops, local_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
+					  skc_v6_rcv_saddr.s6_addr32[0]) != 4);
+
+		off = si->off;
+		off -= offsetof(struct bpf_sock_ops, local_ip6[0]);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+						struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common,
+					       skc_v6_rcv_saddr.s6_addr32[0]) +
+				      off);
+		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
+#else
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case offsetof(struct bpf_sock_ops, remote_port):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_dport) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+						struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common, skc_dport));
+		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 16);
+		break;
+
+	case offsetof(struct bpf_sock_ops, local_port):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_num) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+						struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common, skc_num));
+		break;
+	}
+	return insn - insn_buf;
+}
+
 const struct bpf_verifier_ops sk_filter_prog_ops = {
 	.get_func_proto		= sk_filter_func_proto,
 	.is_valid_access	= sk_filter_is_valid_access,
@@ -3413,6 +3577,12 @@ const struct bpf_verifier_ops cg_sock_prog_ops = {
 	.convert_ctx_access	= sock_filter_convert_ctx_access,
 };
 
+const struct bpf_verifier_ops sock_ops_prog_ops = {
+	.get_func_proto		= bpf_base_func_proto,
+	.is_valid_access	= sock_ops_is_valid_access,
+	.convert_ctx_access	= sock_ops_convert_ctx_access,
+};
+
 int sk_detach_filter(struct sock *sk)
 {
 	int ret = -ENOENT;
diff --git a/net/core/sock_bpfops.c b/net/core/sock_bpfops.c
new file mode 100644
index 0000000..06f4a64
--- /dev/null
+++ b/net/core/sock_bpfops.c
@@ -0,0 +1,65 @@
+/*
+ * BPF support for sockets
+ *
+ * Copyright (c) 2016 Lawrence Brakmo <brakmo@fb.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation.
+ */
+
+#include <net/sock.h>
+#include <linux/skbuff.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/errno.h>
+#ifdef CONFIG_NET_NS
+#include <net/net_namespace.h>
+#include <linux/proc_ns.h>
+#endif
+
+/* Global BPF program for sockets */
+static struct bpf_prog *bpf_global_sock_ops_prog;
+
+int bpf_sock_ops_detach_global_prog(void)
+{
+	struct bpf_prog *old_prog;
+
+	old_prog = xchg(&bpf_global_sock_ops_prog, NULL);
+
+	if (old_prog)
+		bpf_prog_put(old_prog);
+
+	return 0;
+}
+
+int bpf_sock_ops_attach_global_prog(int fd)
+{
+	struct bpf_prog *prog, *old_prog;
+	int err = 0;
+
+	prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCK_OPS);
+	if (IS_ERR(prog))
+		return PTR_ERR(prog);
+
+	old_prog = xchg(&bpf_global_sock_ops_prog, prog);
+	if (old_prog)
+		bpf_prog_put(old_prog);
+	return err;
+}
+
+int bpf_sock_ops_call(struct bpf_sock_ops_kern *bpf_sock)
+{
+	struct bpf_prog *prog;
+	int ret;
+
+	rcu_read_lock();
+	prog =  READ_ONCE(bpf_global_sock_ops_prog);
+	if (prog)
+		ret = BPF_PROG_RUN(prog, bpf_sock);
+	else
+		ret = -1;
+	rcu_read_unlock();
+
+	return ret;
+}
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index a91c57d..a4be7cf 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -64,6 +64,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
 	bool is_cgroup_skb = strncmp(event, "cgroup/skb", 10) == 0;
 	bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0;
+	bool is_sockops = strncmp(event, "sockops", 7) == 0;
 	size_t insns_cnt = size / sizeof(struct bpf_insn);
 	enum bpf_prog_type prog_type;
 	char buf[256];
@@ -89,6 +90,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_CGROUP_SKB;
 	} else if (is_cgroup_sk) {
 		prog_type = BPF_PROG_TYPE_CGROUP_SOCK;
+	} else if (is_sockops) {
+		prog_type = BPF_PROG_TYPE_SOCK_OPS;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -106,8 +109,11 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
 		return 0;
 
-	if (is_socket) {
-		event += 6;
+	if (is_socket || is_sockops) {
+		if (is_socket)
+			event += 6;
+		else
+			event += 7;
 		if (*event != '/')
 			return 0;
 		event++;
@@ -560,7 +566,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
 		    memcmp(shname, "xdp", 3) == 0 ||
 		    memcmp(shname, "perf_event", 10) == 0 ||
 		    memcmp(shname, "socket", 6) == 0 ||
-		    memcmp(shname, "cgroup/", 7) == 0)
+		    memcmp(shname, "cgroup/", 7) == 0 ||
+		    memcmp(shname, "sockops", 7) == 0)
 			load_and_attach(shname, data->d_buf, data->d_size);
 	}
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 02/15] bpf: program to load sock_ops BPF programs
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 01/15] bpf: BPF support for sock_ops Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 03/15] bpf: Support for per connection SYN/SYN-ACK RTOs Lawrence Brakmo
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

The program tcp_bpf can be used to remove current global sock_ops program
and to load/replace sock_ops BPF programs. There is also an option to
print the bpf trace buffer (for debugging purposes).

USAGE:
  ./tcp_bpf [-r] [-l] [<pname>]
WHERE:
  -r      remove current loaded sock_ops BPF program
          not needed if loading a new program
  -l      print BPF trace buffer. Used when loading a new program
  <pname> name of BPF sock_ops program to load
          if <pname> does not end in ".o", then "_kern.o" is appended
          example: using tcp_rto will load tcp_rto_kern.o

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile  |  3 ++
 samples/bpf/tcp_bpf.c | 86 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)
 create mode 100644 samples/bpf/tcp_bpf.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a0561dc..ed6bc75 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -36,6 +36,7 @@ hostprogs-y += lwt_len_hist
 hostprogs-y += xdp_tx_iptunnel
 hostprogs-y += test_map_in_map
 hostprogs-y += per_socket_stats_example
+hostprogs-y += tcp_bpf
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o
@@ -52,6 +53,7 @@ tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o
 tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o
 tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o
 tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
+tcp_bpf-objs := bpf_load.o $(LIBBPF) tcp_bpf.o
 test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
 trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
 lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
@@ -130,6 +132,7 @@ HOSTLOADLIBES_tracex4 += -lelf -lrt
 HOSTLOADLIBES_tracex5 += -lelf
 HOSTLOADLIBES_tracex6 += -lelf
 HOSTLOADLIBES_test_cgrp2_sock2 += -lelf
+HOSTLOADLIBES_tcp_bpf += -lelf
 HOSTLOADLIBES_test_probe_write_user += -lelf
 HOSTLOADLIBES_trace_output += -lelf -lrt
 HOSTLOADLIBES_lathist += -lelf
diff --git a/samples/bpf/tcp_bpf.c b/samples/bpf/tcp_bpf.c
new file mode 100644
index 0000000..735b8b2
--- /dev/null
+++ b/samples/bpf/tcp_bpf.c
@@ -0,0 +1,86 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+#include <unistd.h>
+#include <errno.h>
+#include <linux/unistd.h>
+
+static void usage(char *pname)
+{
+	printf("USAGE:\n  %s [-r] [-l] <pname>\n", pname);
+	printf("WHERE:\n");
+	printf("  -r      remove current loaded socketops BPF program\n");
+	printf("          not needed if loading a new program\n");
+	printf("  -l      print out BPF log buffer\n");
+	printf("  <pname> name of BPF sockeops program to load\n");
+	printf("          if <pname> does not end in \".o\", then \"_kern.o\" "
+	       "is appended\n");
+	printf("          example: using tcp1 will load tcp1_kern.o\n");
+	printf("\n");
+}
+
+int main(int argc, char **argv)
+{
+	//union bpf_attr attr;
+	int k, logFlag = 0;
+	int error = 0;
+	char fn[500];
+
+	if (argc <= 1)
+		usage(argv[0]);
+	for (k = 1; k < argc; k++) {
+		if (!strcmp(argv[k], "-r")) {
+			error = bpf_prog_detach(0, BPF_GLOBAL_SOCK_OPS);
+			if (error) {
+				printf("ERROR: bpf_prog_detach: %d (%s)\n",
+				       error, strerror(errno));
+				error =  1;
+			}
+		} else if (!strcmp(argv[k], "-l")) {
+			logFlag = 1;
+		} else if (!strcmp(argv[k], "-h")) {
+			usage(argv[0]);
+		} else if (argv[k][0] == '-') {
+			printf("Error, unknown flag: %s\n", argv[k]);
+			error = 2;
+		} else if (strlen(argv[k]) > 450) {
+			printf("Error, program name too long %d\n",
+			       (int) strlen(argv[k]));
+			error = 3;
+		} else {
+			if (!strcmp(argv[k]+strlen(argv[k])-2, ".o"))
+				strcpy(fn, argv[k]);
+			else
+				sprintf(fn, "%s_kern.o", argv[k]);
+			if (logFlag)
+				printf("loading bpf file:%s\n", fn);
+			if (load_bpf_file(fn)) {
+				printf("%s", bpf_log_buf);
+				return 1;
+			}
+			if (logFlag) {
+				printf("TCP BPF Loaded %s\n", fn);
+				printf("%s\n", bpf_log_buf);
+			}
+			error = bpf_prog_attach(prog_fd[0], 0,
+						BPF_GLOBAL_SOCK_OPS, 0);
+			if (error) {
+				printf("ERROR: bpf_prog_attach: %d (%s)\n",
+				       error, strerror(errno));
+				error = 4;
+			} else if (logFlag) {
+				read_trace_pipe();
+			}
+		}
+	}
+	return error;
+}
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 03/15] bpf: Support for per connection  SYN/SYN-ACK RTOs
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 01/15] bpf: BPF support for sock_ops Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 02/15] bpf: program to load sock_ops BPF programs Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 04/15] bpf: Sample bpf program to set " Lawrence Brakmo
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

This patch adds support for setting a per connection SYN and
SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
to set small RTOs when it is known both hosts are within a
datacenter.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/net/tcp.h        | 11 +++++++++++
 include/uapi/linux/bpf.h |  3 +++
 net/ipv4/tcp_input.c     |  3 ++-
 net/ipv4/tcp_output.c    |  2 +-
 4 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index f6f415c..bdf6bfd 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2051,4 +2051,15 @@ static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
 }
 #endif
 
+static inline u32 tcp_timeout_init(struct sock *sk, bool is_req_sock)
+{
+	int timeout;
+
+	timeout = tcp_call_bpf(sk, is_req_sock, BPF_SOCK_OPS_TIMEOUT_INIT);
+
+	if (timeout <= 0)
+		timeout = TCP_TIMEOUT_INIT;
+	return timeout;
+}
+
 #endif	/* _TCP_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 861dbe9..4532c31 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -746,6 +746,9 @@ struct bpf_sock_ops {
  */
 enum {
 	BPF_SOCK_OPS_VOID,
+	BPF_SOCK_OPS_TIMEOUT_INIT,	/* Should return SYN-RTO value to use or
+					 * -1 if default value should be used
+					 */
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2ab7e2f..0867b05 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6406,7 +6406,8 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	} else {
 		tcp_rsk(req)->tfo_listener = false;
 		if (!want_cookie)
-			inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+			inet_csk_reqsk_queue_hash_add(sk, req,
+				tcp_timeout_init((struct sock *)req, true));
 		af_ops->send_synack(sk, dst, &fl, req, &foc,
 				    !want_cookie ? TCP_SYNACK_NORMAL :
 						   TCP_SYNACK_COOKIE);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9a9c395..5e478a1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3327,7 +3327,7 @@ static void tcp_connect_init(struct sock *sk)
 	tp->rcv_wup = tp->rcv_nxt;
 	tp->copied_seq = tp->rcv_nxt;
 
-	inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+	inet_csk(sk)->icsk_rto = tcp_timeout_init(sk, false);
 	inet_csk(sk)->icsk_retransmits = 0;
 	tcp_clear_retrans(tp);
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 04/15] bpf: Sample bpf program to set SYN/SYN-ACK RTOs
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
                   ` (2 preceding siblings ...)
  2017-06-20  3:00 ` [PATCH net-next v3 03/15] bpf: Support for per connection SYN/SYN-ACK RTOs Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 05/15] bpf: Support for setting initial receive window Lawrence Brakmo
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

The sample BPF program, tcp_synrto_kern.c, sets the SYN and SYN-ACK
RTOs to 10ms when both hosts are within the same datacenter (i.e.
small RTTs) in an environment where common IPv6 prefixes indicate
both hosts are in the same data center.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile          |  1 +
 samples/bpf/tcp_synrto_kern.c | 59 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+)
 create mode 100644 samples/bpf/tcp_synrto_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index ed6bc75..21cb016 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -113,6 +113,7 @@ always += lwt_len_hist_kern.o
 always += xdp_tx_iptunnel_kern.o
 always += test_map_in_map_kern.o
 always += cookie_uid_helper_example.o
+always += tcp_synrto_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_synrto_kern.c b/samples/bpf/tcp_synrto_kern.c
new file mode 100644
index 0000000..b11efd8
--- /dev/null
+++ b/samples/bpf/tcp_synrto_kern.c
@@ -0,0 +1,59 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * BPF program to set SYN and SYN-ACK RTOs to 10ms when using IPv6 addresses
+ * and the first 5.5 bytes of the IPv6 addresses are the same (in this example
+ * that means both hosts are in the same datacenter.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_synrto(struct bpf_sock_ops *skops)
+{
+	char fmt1[] = "BPF command: %d\n";
+	char fmt2[] = "  Returning %d\n";
+	int rv = -1;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 55601
+	 */
+	if (skops->remote_port != 55601 && skops->local_port != 55601)
+		return -1;
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+	/* Check for TIMEOUT_INIT operation and IPv6 addresses */
+	if (op == BPF_SOCK_OPS_TIMEOUT_INIT &&
+		skops->family == AF_INET6) {
+
+		/* If the first 5.5 bytes of the IPv6 address are the same
+		 * then both hosts are in the same datacenter
+		 * so use an RTO of 10ms
+		 */
+		if (skops->local_ip6[0] == skops->remote_ip6[0] &&
+		    (skops->local_ip6[1] & 0xfff00000) ==
+		    (skops->remote_ip6[1] & 0xfff00000))
+			rv = 10;
+	}
+#ifdef DEBUG
+	bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+	return rv;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 05/15] bpf: Support for setting initial receive window
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
                   ` (3 preceding siblings ...)
  2017-06-20  3:00 ` [PATCH net-next v3 04/15] bpf: Sample bpf program to set " Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 06/15] bpf: Sample bpf program to set initial window Lawrence Brakmo
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

This patch adds suppport for setting the initial advertized window from
within a BPF_SOCK_OPS program. This can be used to support larger
initial cwnd values in environments where it is known to be safe.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/net/tcp.h        | 10 ++++++++++
 include/uapi/linux/bpf.h |  4 ++++
 net/ipv4/tcp_minisocks.c |  9 ++++++++-
 net/ipv4/tcp_output.c    |  7 ++++++-
 4 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index bdf6bfd..ff806d7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2062,4 +2062,14 @@ static inline u32 tcp_timeout_init(struct sock *sk, bool is_req_sock)
 	return timeout;
 }
 
+static inline u32 tcp_rwnd_init_bpf(struct sock *sk, bool is_req_sock)
+{
+	int rwnd;
+
+	rwnd = tcp_call_bpf(sk, is_req_sock, BPF_SOCK_OPS_RWND_INIT);
+
+	if (rwnd < 0)
+		rwnd = 0;
+	return rwnd;
+}
 #endif	/* _TCP_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4532c31..314fdf3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -749,6 +749,10 @@ enum {
 	BPF_SOCK_OPS_TIMEOUT_INIT,	/* Should return SYN-RTO value to use or
 					 * -1 if default value should be used
 					 */
+	BPF_SOCK_OPS_RWND_INIT,		/* Should return initial advertized
+					 * window (in packets) or -1 if default
+					 * value should be used
+					 */
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index d30ee31..bbaf3c6 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -351,6 +351,7 @@ void tcp_openreq_init_rwin(struct request_sock *req,
 	int full_space = tcp_full_space(sk_listener);
 	u32 window_clamp;
 	__u8 rcv_wscale;
+	u32 rcv_wnd;
 	int mss;
 
 	mss = tcp_mss_clamp(tp, dst_metric_advmss(dst));
@@ -363,6 +364,12 @@ void tcp_openreq_init_rwin(struct request_sock *req,
 	    (req->rsk_window_clamp > full_space || req->rsk_window_clamp == 0))
 		req->rsk_window_clamp = full_space;
 
+	rcv_wnd = tcp_rwnd_init_bpf((struct sock *)req, true);
+	if (rcv_wnd == 0)
+		rcv_wnd = dst_metric(dst, RTAX_INITRWND);
+	else if (full_space < rcv_wnd * mss)
+		full_space = rcv_wnd * mss;
+
 	/* tcp_full_space because it is guaranteed to be the first packet */
 	tcp_select_initial_window(full_space,
 		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
@@ -370,7 +377,7 @@ void tcp_openreq_init_rwin(struct request_sock *req,
 		&req->rsk_window_clamp,
 		ireq->wscale_ok,
 		&rcv_wscale,
-		dst_metric(dst, RTAX_INITRWND));
+		rcv_wnd);
 	ireq->rcv_wscale = rcv_wscale;
 }
 EXPORT_SYMBOL(tcp_openreq_init_rwin);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5e478a1..e5f623f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3267,6 +3267,7 @@ static void tcp_connect_init(struct sock *sk)
 	const struct dst_entry *dst = __sk_dst_get(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u8 rcv_wscale;
+	u32 rcv_wnd;
 
 	/* We'll fix this up when we get a response from the other end.
 	 * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT.
@@ -3300,13 +3301,17 @@ static void tcp_connect_init(struct sock *sk)
 	    (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
 		tp->window_clamp = tcp_full_space(sk);
 
+	rcv_wnd = tcp_rwnd_init_bpf(sk, false);
+	if (rcv_wnd == 0)
+		rcv_wnd = dst_metric(dst, RTAX_INITRWND);
+
 	tcp_select_initial_window(tcp_full_space(sk),
 				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
 				  &tp->rcv_wnd,
 				  &tp->window_clamp,
 				  sock_net(sk)->ipv4.sysctl_tcp_window_scaling,
 				  &rcv_wscale,
-				  dst_metric(dst, RTAX_INITRWND));
+				  rcv_wnd);
 
 	tp->rx_opt.rcv_wscale = rcv_wscale;
 	tp->rcv_ssthresh = tp->rcv_wnd;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 06/15] bpf: Sample bpf program to set initial window
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
                   ` (4 preceding siblings ...)
  2017-06-20  3:00 ` [PATCH net-next v3 05/15] bpf: Support for setting initial receive window Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 07/15] bpf: Add setsockopt helper function to bpf Lawrence Brakmo
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

The sample bpf program, tcp_rwnd_kern.c, sets the initial
advertized window to 40 packets in an environment where
distinct IPv6 prefixes indicate that both hosts are not
in the same data center.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile        |  1 +
 samples/bpf/tcp_rwnd_kern.c | 60 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)
 create mode 100644 samples/bpf/tcp_rwnd_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 21cb016..9aca209 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -114,6 +114,7 @@ always += xdp_tx_iptunnel_kern.o
 always += test_map_in_map_kern.o
 always += cookie_uid_helper_example.o
 always += tcp_synrto_kern.o
+always += tcp_rwnd_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_rwnd_kern.c b/samples/bpf/tcp_rwnd_kern.c
new file mode 100644
index 0000000..26c1370
--- /dev/null
+++ b/samples/bpf/tcp_rwnd_kern.c
@@ -0,0 +1,60 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * BPF program to set initial receive window to 40 packets when using IPv6
+ * and the first 5.5 bytes of the IPv6 addresses are not the same (in this
+ * example that means both hosts are not the same datacenter.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_rwnd(struct bpf_sock_ops *skops)
+{
+	char fmt1[] = "BPF command: %d\n";
+	char fmt2[] = "  Returning %d\n";
+	int rv = -1;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 55601
+	 */
+	if (skops->remote_port != 55601 && skops->local_port != 55601)
+		return -1;
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+	/* Check for RWND_INIT operation and IPv6 addresses */
+	if (op == BPF_SOCK_OPS_RWND_INIT &&
+		skops->family == AF_INET6) {
+
+		/* If the first 5.5 bytes of the IPv6 address are not the same
+		 * then both hosts are not in the same datacenter
+		 * so use a larger initial advertized window (40 packets)
+		 */
+		if (skops->local_ip6[0] != skops->remote_ip6[0] ||
+		    (skops->local_ip6[1] & 0xfffff000) !=
+		    (skops->remote_ip6[1] & 0xfffff000))
+			bpf_trace_printk(fmt2, sizeof(fmt2), -1);
+			rv = 40;
+	}
+#ifdef DEBUG
+	bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+	return rv;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 07/15] bpf: Add setsockopt helper function to bpf
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
                   ` (5 preceding siblings ...)
  2017-06-20  3:00 ` [PATCH net-next v3 06/15] bpf: Sample bpf program to set initial window Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-20 21:25   ` Craig Gallek
  2017-06-20  3:00 ` [PATCH net-next v3 08/15] bpf: Add TCP connection BPF callbacks Lawrence Brakmo
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Added support for calling a subset of socket setsockopts from
BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
than making the changes to call the socket setsockopt function because
the changes required would have been larger.

The ops supported are:
  SO_RCVBUF
  SO_SNDBUF
  SO_MAX_PACING_RATE
  SO_PRIORITY
  SO_RCVLOWAT
  SO_MARK

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h  | 14 ++++++++-
 net/core/filter.c         | 77 ++++++++++++++++++++++++++++++++++++++++++++++-
 samples/bpf/bpf_helpers.h |  3 ++
 3 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 314fdf3..86595f9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -520,6 +520,17 @@ union bpf_attr {
  *     Set full skb->hash.
  *     @skb: pointer to skb
  *     @hash: hash to set
+ *
+ * int bpf_setsockopt(bpf_socket, level, optname, optval, optlen)
+ *     Calls setsockopt. Not all opts are available, only those with
+ *     integer optvals plus TCP_CONGESTION.
+ *     Supported levels: SOL_SOCKET and IPROTO_TCP
+ *     @bpf_socket: pointer to bpf_socket
+ *     @level: SOL_SOCKET or IPROTO_TCP
+ *     @optname: option name
+ *     @optval: pointer to option value
+ *     @optlen: length of optval in byes
+ *     Return: 0 or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -570,7 +581,8 @@ union bpf_attr {
 	FN(probe_read_str),		\
 	FN(get_socket_cookie),		\
 	FN(get_socket_uid),		\
-	FN(set_hash),
+	FN(set_hash),			\
+	FN(setsockopt),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 7d69d16..b114ae1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -54,6 +54,7 @@
 #include <net/dst.h>
 #include <net/sock_reuseport.h>
 #include <net/busy_poll.h>
+#include <net/tcp.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -2671,6 +2672,69 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
 	.arg1_type      = ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
+	   int, level, int, optname, char *, optval, int, optlen)
+{
+	struct sock *sk = bpf_sock->sk;
+	int ret = 0;
+	int val;
+
+	if (bpf_sock->is_req_sock)
+		return -EINVAL;
+
+	if (level == SOL_SOCKET) {
+		/* Only some socketops are supported */
+		val = *((int *)optval);
+
+		switch (optname) {
+		case SO_RCVBUF:
+			sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
+			sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF);
+			break;
+		case SO_SNDBUF:
+			sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
+			sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
+			break;
+		case SO_MAX_PACING_RATE:
+			sk->sk_max_pacing_rate = val;
+			sk->sk_pacing_rate = min(sk->sk_pacing_rate,
+						 sk->sk_max_pacing_rate);
+			break;
+		case SO_PRIORITY:
+			sk->sk_priority = val;
+			break;
+		case SO_RCVLOWAT:
+			if (val < 0)
+				val = INT_MAX;
+			sk->sk_rcvlowat = val ? : 1;
+			break;
+		case SO_MARK:
+			sk->sk_mark = val;
+			break;
+		default:
+			ret = -EINVAL;
+		}
+	} else if (level == SOL_TCP &&
+		   sk->sk_prot->setsockopt == tcp_setsockopt) {
+		/* Place holder */
+		ret = -EINVAL;
+	} else {
+		ret = -EINVAL;
+	}
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_setsockopt_proto = {
+	.func		= bpf_setsockopt,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_MEM,
+	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -2822,6 +2886,17 @@ lwt_inout_func_proto(enum bpf_func_id func_id)
 }
 
 static const struct bpf_func_proto *
+	sock_ops_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_setsockopt:
+		return &bpf_setsockopt_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 lwt_xmit_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -3578,7 +3653,7 @@ const struct bpf_verifier_ops cg_sock_prog_ops = {
 };
 
 const struct bpf_verifier_ops sock_ops_prog_ops = {
-	.get_func_proto		= bpf_base_func_proto,
+	.get_func_proto		= sock_ops_func_proto,
 	.is_valid_access	= sock_ops_is_valid_access,
 	.convert_ctx_access	= sock_ops_convert_ctx_access,
 };
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index f4840b8..d50ac34 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -60,6 +60,9 @@ static unsigned long long (*bpf_get_prandom_u32)(void) =
 	(void *) BPF_FUNC_get_prandom_u32;
 static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
 	(void *) BPF_FUNC_xdp_adjust_head;
+static int (*bpf_setsockopt)(void *ctx, int level, int optname, void *optval,
+			     int optlen) =
+	(void *) BPF_FUNC_setsockopt;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 08/15] bpf: Add TCP connection BPF callbacks
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
                   ` (6 preceding siblings ...)
  2017-06-20  3:00 ` [PATCH net-next v3 07/15] bpf: Add setsockopt helper function to bpf Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 09/15] bpf: Sample BPF program to set buffer sizes Lawrence Brakmo
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Added callbacks to BPF SOCK_OPS type program before an active
connection is intialized and after a passive or active connection is
established.

The following patch demostrates how they can be used to set send and
receive buffer sizes.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h | 11 +++++++++++
 net/ipv4/tcp_fastopen.c  |  1 +
 net/ipv4/tcp_input.c     |  4 +++-
 net/ipv4/tcp_output.c    |  1 +
 4 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 86595f9..4856d16 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -765,6 +765,17 @@ enum {
 					 * window (in packets) or -1 if default
 					 * value should be used
 					 */
+	BPF_SOCK_OPS_TCP_CONNECT_CB,	/* Calls BPF program right before an
+					 * active connection is initialized
+					 */
+	BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB,	/* Calls BPF program when an
+						 * active connection is
+						 * established
+						 */
+	BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB,	/* Calls BPF program when a
+						 * passive connection is
+						 * established
+						 */
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 4af82b9..ed6b549 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -221,6 +221,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
 	tcp_init_congestion_control(child);
 	tcp_mtup_init(child);
 	tcp_init_metrics(child);
+	tcp_call_bpf(child, false, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
 	tcp_init_buffer_space(child);
 
 	tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0867b05..1b868ae 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5571,7 +5571,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 	icsk->icsk_af_ops->rebuild_header(sk);
 
 	tcp_init_metrics(sk);
-
+	tcp_call_bpf(sk, false, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
 	tcp_init_congestion_control(sk);
 
 	/* Prevent spurious tcp_cwnd_restart() on first data
@@ -5977,6 +5977,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		} else {
 			/* Make sure socket is routed, for correct metrics. */
 			icsk->icsk_af_ops->rebuild_header(sk);
+			tcp_call_bpf(sk, false,
+				     BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
 			tcp_init_congestion_control(sk);
 
 			tcp_mtup_init(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e5f623f..958edc8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3445,6 +3445,7 @@ int tcp_connect(struct sock *sk)
 	struct sk_buff *buff;
 	int err;
 
+	tcp_call_bpf(sk, false, BPF_SOCK_OPS_TCP_CONNECT_CB);
 	tcp_connect_init(sk);
 
 	if (unlikely(tp->repair)) {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 09/15] bpf: Sample BPF program to set buffer sizes
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
                   ` (7 preceding siblings ...)
  2017-06-20  3:00 ` [PATCH net-next v3 08/15] bpf: Add TCP connection BPF callbacks Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 10/15] bpf: Add support for changing congestion control Lawrence Brakmo
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

This patch contains a BPF program to set initial receive window to
40 packets and send and receive buffers to 1.5MB. This would usually
be done after doing appropriate checks that indicate the hosts are
far enough away (i.e. large RTT).

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile        |  1 +
 samples/bpf/tcp_bufs_kern.c | 76 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+)
 create mode 100644 samples/bpf/tcp_bufs_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 9aca209..942c7c7 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -115,6 +115,7 @@ always += test_map_in_map_kern.o
 always += cookie_uid_helper_example.o
 always += tcp_synrto_kern.o
 always += tcp_rwnd_kern.o
+always += tcp_bufs_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_bufs_kern.c b/samples/bpf/tcp_bufs_kern.c
new file mode 100644
index 0000000..6cc096c
--- /dev/null
+++ b/samples/bpf/tcp_bufs_kern.c
@@ -0,0 +1,76 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * BPF program to set initial receive window to 40 packets and send
+ * and receive buffers to 1.5MB. This would usually be done after
+ * doing appropriate checks that indicate the hosts are far enough
+ * away (i.e. large RTT).
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_bufs(struct bpf_sock_ops *skops)
+{
+	char fmt1[] = "BPF command: %d\n";
+	char fmt2[] = "  Returning %d\n";
+	int bufsize = 1500000;
+	int rwnd_init = 40;
+	int rv = 0;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 55601
+	 */
+	if (skops->remote_port != 55601 && skops->local_port != 55601)
+		return -1;
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+	/* Usually there would be a check to insure the hosts are far
+	 * from each other so it makes sense to increase buffer sizes
+	 */
+	switch (op) {
+	case BPF_SOCK_OPS_RWND_INIT:
+		rv = rwnd_init;
+		break;
+	case BPF_SOCK_OPS_TCP_CONNECT_CB:
+		/* Set sndbuf and rcvbuf of active connections */
+		rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF, &bufsize,
+				    sizeof(bufsize));
+		rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET, SO_RCVBUF,
+					     &bufsize, sizeof(bufsize));
+		break;
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+		/* Nothing to do */
+		break;
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		/* Set sndbuf and rcvbuf of passive connections */
+		rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF, &bufsize,
+				    sizeof(bufsize));
+		rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET, SO_RCVBUF,
+					     &bufsize, sizeof(bufsize));
+		break;
+	default:
+		rv = -1;
+	}
+#ifdef DEBUG
+	bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+	return rv;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 10/15] bpf: Add support for changing congestion control
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
                   ` (8 preceding siblings ...)
  2017-06-20  3:00 ` [PATCH net-next v3 09/15] bpf: Sample BPF program to set buffer sizes Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-20  8:40   ` kbuild test robot
  2017-06-20  3:00 ` [PATCH net-next v3 11/15] bpf: Sample BPF program to set " Lawrence Brakmo
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Added support for changing congestion control for SOCK_OPS bpf
programs through the setsockopt bpf helper function. It also adds
a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
congestion controls, like dctcp, that need to enable ECN in the
SYN packets.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/net/tcp.h        |  9 ++++++++-
 include/uapi/linux/bpf.h |  3 +++
 net/core/filter.c        | 11 +++++++++--
 net/ipv4/tcp.c           |  2 +-
 net/ipv4/tcp_cong.c      | 32 ++++++++++++++++++++++----------
 net/ipv4/tcp_input.c     |  3 ++-
 net/ipv4/tcp_output.c    |  8 +++++---
 7 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index ff806d7..58d67be 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1003,7 +1003,9 @@ void tcp_get_default_congestion_control(char *name);
 void tcp_get_available_congestion_control(char *buf, size_t len);
 void tcp_get_allowed_congestion_control(char *buf, size_t len);
 int tcp_set_allowed_congestion_control(char *allowed);
-int tcp_set_congestion_control(struct sock *sk, const char *name);
+int tcp_set_congestion_control(struct sock *sk, const char *name, bool load);
+void tcp_reinit_congestion_control(struct sock *sk,
+				   const struct tcp_congestion_ops *ca);
 u32 tcp_slow_start(struct tcp_sock *tp, u32 acked);
 void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
 
@@ -2072,4 +2074,9 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk, bool is_req_sock)
 		rwnd = 0;
 	return rwnd;
 }
+
+static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
+{
+	return (tcp_call_bpf(sk, true, BPF_SOCK_OPS_NEEDS_ECN) == 1);
+}
 #endif	/* _TCP_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4856d16..c222059 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -776,6 +776,9 @@ enum {
 						 * passive connection is
 						 * established
 						 */
+	BPF_SOCK_OPS_NEEDS_ECN,		/* If connection's congestion control
+					 * needs ECN
+					 */
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index b114ae1..bbf8f78 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2716,8 +2716,15 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 		}
 	} else if (level == SOL_TCP &&
 		   sk->sk_prot->setsockopt == tcp_setsockopt) {
-		/* Place holder */
-		ret = -EINVAL;
+		if (optname == TCP_CONGESTION) {
+			ret = tcp_set_congestion_control(sk, optval, false);
+			if (!ret && bpf_sock->op > BPF_SOCK_OPS_NEEDS_ECN)
+				/* replacing an existing ca */
+				tcp_reinit_congestion_control(sk,
+					inet_csk(sk)->icsk_ca_ops);
+		} else {
+			ret = -EINVAL;
+		}
 	} else {
 		ret = -EINVAL;
 	}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 058f509..9476fd6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2479,7 +2479,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		name[val] = 0;
 
 		lock_sock(sk);
-		err = tcp_set_congestion_control(sk, name);
+		err = tcp_set_congestion_control(sk, name, true);
 		release_sock(sk);
 		return err;
 	}
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 324c9bc..fde983f 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -189,8 +189,8 @@ void tcp_init_congestion_control(struct sock *sk)
 		INET_ECN_dontxmit(sk);
 }
 
-static void tcp_reinit_congestion_control(struct sock *sk,
-					  const struct tcp_congestion_ops *ca)
+void tcp_reinit_congestion_control(struct sock *sk,
+				   const struct tcp_congestion_ops *ca)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
@@ -333,8 +333,12 @@ int tcp_set_allowed_congestion_control(char *val)
 	return ret;
 }
 
-/* Change congestion control for socket */
-int tcp_set_congestion_control(struct sock *sk, const char *name)
+/* Change congestion control for socket. If load is false, then it is the
+ * responsibility of the caller to call tcp_init_congestion_control or
+ * tcp_reinit_congestion_control (if the current congestion control was
+ * already initialized.
+ */
+int tcp_set_congestion_control(struct sock *sk, const char *name, bool load)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	const struct tcp_congestion_ops *ca;
@@ -344,21 +348,29 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
 		return -EPERM;
 
 	rcu_read_lock();
-	ca = __tcp_ca_find_autoload(name);
+	if (!load)
+		ca = tcp_ca_find(name);
+	else
+		ca = __tcp_ca_find_autoload(name);
 	/* No change asking for existing value */
 	if (ca == icsk->icsk_ca_ops) {
 		icsk->icsk_ca_setsockopt = 1;
 		goto out;
 	}
-	if (!ca)
+	if (!ca) {
 		err = -ENOENT;
-	else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
-		   ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
+	} else if (!load) {
+		icsk->icsk_ca_ops = ca;
+		if (!try_module_get(ca->owner))
+			err = -EBUSY;
+	} else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
+		     ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))) {
 		err = -EPERM;
-	else if (!try_module_get(ca->owner))
+	} else if (!try_module_get(ca->owner)) {
 		err = -EBUSY;
-	else
+	} else {
 		tcp_reinit_congestion_control(sk, ca);
+	}
  out:
 	rcu_read_unlock();
 	return err;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1b868ae..23f9707 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6192,7 +6192,8 @@ static void tcp_ecn_create_request(struct request_sock *req,
 	ecn_ok = net->ipv4.sysctl_tcp_ecn || ecn_ok_dst;
 
 	if ((!ect && ecn_ok) || tcp_ca_needs_ecn(listen_sk) ||
-	    (ecn_ok_dst & DST_FEATURE_ECN_CA))
+	    (ecn_ok_dst & DST_FEATURE_ECN_CA) ||
+	    tcp_bpf_ca_needs_ecn((struct sock *)req))
 		inet_rsk(req)->ecn_ok = 1;
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 958edc8..a273117 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -316,7 +316,8 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 	TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR;
 	if (!(tp->ecn_flags & TCP_ECN_OK))
 		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ECE;
-	else if (tcp_ca_needs_ecn(sk))
+	else if (tcp_ca_needs_ecn(sk) ||
+		 tcp_bpf_ca_needs_ecn(sk))
 		INET_ECN_xmit(sk);
 }
 
@@ -324,8 +325,9 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	bool bpf_needs_ecn = tcp_bpf_ca_needs_ecn(sk);
 	bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
-		       tcp_ca_needs_ecn(sk);
+		tcp_ca_needs_ecn(sk) || bpf_needs_ecn;
 
 	if (!use_ecn) {
 		const struct dst_entry *dst = __sk_dst_get(sk);
@@ -339,7 +341,7 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 	if (use_ecn) {
 		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
 		tp->ecn_flags = TCP_ECN_OK;
-		if (tcp_ca_needs_ecn(sk))
+		if (tcp_ca_needs_ecn(sk) || bpf_needs_ecn)
 			INET_ECN_xmit(sk);
 	}
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 11/15] bpf: Sample BPF program to set congestion control
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
                   ` (9 preceding siblings ...)
  2017-06-20  3:00 ` [PATCH net-next v3 10/15] bpf: Add support for changing congestion control Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 12/15] bpf: Adds support for setting initial cwnd Lawrence Brakmo
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Sample BPF program that sets congestion control to dctcp when both hosts
are within the same datacenter. In this example that is assumed to be
when they have the first 5.5 bytes of their IPv6 address are the same.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile        |  1 +
 samples/bpf/tcp_cong_kern.c | 73 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 74 insertions(+)
 create mode 100644 samples/bpf/tcp_cong_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 942c7c7..eb324e0 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -116,6 +116,7 @@ always += cookie_uid_helper_example.o
 always += tcp_synrto_kern.o
 always += tcp_rwnd_kern.o
 always += tcp_bufs_kern.o
+always += tcp_cong_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_cong_kern.c b/samples/bpf/tcp_cong_kern.c
new file mode 100644
index 0000000..d56fb8a
--- /dev/null
+++ b/samples/bpf/tcp_cong_kern.c
@@ -0,0 +1,73 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * BPF program to set congestion control to dctcp when both hosts are
+ * in the same datacenter (as deteremined by IPv6 prefix).
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/tcp.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_cong(struct bpf_sock_ops *skops)
+{
+	char fmt1[] = "BPF command: %d\n";
+	char fmt2[] = "  Returning %d\n";
+	char cong[] = "dctcp";
+	int rv = 0;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 55601
+	 */
+	if (skops->remote_port != 55601 && skops->local_port != 55601)
+		return -1;
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+	/* Check if both hosts are in the same datacenter. For this
+	 * example they are if the 1st 5.5 bytes in the IPv6 address
+	 * are the same.
+	 */
+	if (skops->family == AF_INET6 &&
+	    skops->local_ip6[0] == skops->remote_ip6[0] &&
+	    (skops->local_ip6[1] & 0xfff00000) ==
+	    (skops->remote_ip6[1] & 0xfff00000)) {
+		switch (op) {
+		case BPF_SOCK_OPS_NEEDS_ECN:
+			rv = 1;
+			break;
+		case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+			rv = bpf_setsockopt(skops, SOL_TCP, TCP_CONGESTION,
+					    cong, sizeof(cong));
+			break;
+		case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+			rv = bpf_setsockopt(skops, SOL_TCP, TCP_CONGESTION,
+					    cong, sizeof(cong));
+			break;
+		default:
+			rv = -1;
+		}
+	} else {
+		rv = -1;
+	}
+#ifdef DEBUG
+	bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+	return rv;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 12/15] bpf: Adds support for setting initial cwnd
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
                   ` (10 preceding siblings ...)
  2017-06-20  3:00 ` [PATCH net-next v3 11/15] bpf: Sample BPF program to set " Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 13/15] bpf: Sample BPF program to set " Lawrence Brakmo
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_IW, which sets the
initial congestion window. This can be used when the hosts are far
apart (large RTTs) and it is safe to start with a large inital cwnd.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h |  2 ++
 net/core/filter.c        | 14 +++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c222059..a07acc6 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -781,4 +781,6 @@ enum {
 					 */
 };
 
+#define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index bbf8f78..db6d30c0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2723,7 +2723,19 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 				tcp_reinit_congestion_control(sk,
 					inet_csk(sk)->icsk_ca_ops);
 		} else {
-			ret = -EINVAL;
+			struct tcp_sock *tp = tcp_sk(sk);
+
+			val = *((int *)optval);
+			switch (optname) {
+			case TCP_BPF_IW:
+				if (val <= 0 || tp->data_segs_out > 0)
+					ret = -EINVAL;
+				else
+					tp->snd_cwnd = val;
+				break;
+			default:
+				ret = -EINVAL;
+			}
 		}
 	} else {
 		ret = -EINVAL;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 13/15] bpf: Sample BPF program to set initial cwnd
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
                   ` (11 preceding siblings ...)
  2017-06-20  3:00 ` [PATCH net-next v3 12/15] bpf: Adds support for setting initial cwnd Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 14/15] bpf: Adds support for setting sndcwnd clamp Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 15/15] bpf: Sample bpf program to set " Lawrence Brakmo
  14 siblings, 0 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Sample BPF program that assumes hosts are far away (i.e. large RTTs)
and sets initial cwnd and initial receive window to 40 packets,
send and receive buffers to 1.5MB.

In practice there would be a test to insure the hosts are actually
far enough away.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile      |  1 +
 samples/bpf/tcp_iw_kern.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)
 create mode 100644 samples/bpf/tcp_iw_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index eb324e0..3ec96a0 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -117,6 +117,7 @@ always += tcp_synrto_kern.o
 always += tcp_rwnd_kern.o
 always += tcp_bufs_kern.o
 always += tcp_cong_kern.o
+always += tcp_iw_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_iw_kern.c b/samples/bpf/tcp_iw_kern.c
new file mode 100644
index 0000000..4f978fc
--- /dev/null
+++ b/samples/bpf/tcp_iw_kern.c
@@ -0,0 +1,78 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * BPF program to set initial congestion window and initial receive
+ * window to 40 packets and send and receive buffers to 1.5MB. This
+ * would usually be done after doing appropriate checks that indicate
+ * the hosts are far enough away (i.e. large RTT).
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_iw(struct bpf_sock_ops *skops)
+{
+	char fmt1[] = "BPF command: %d\n";
+	char fmt2[] = "  Returning %d\n";
+	int bufsize = 1500000;
+	int rwnd_init = 40;
+	int iw = 40;
+	int rv = 0;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 55601
+	 */
+	if (skops->remote_port != 55601 && skops->local_port != 55601)
+		return -1;
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+	/* Usually there would be a check to insure the hosts are far
+	 * from each other so it makes sense to increase buffer sizes
+	 */
+	switch (op) {
+	case BPF_SOCK_OPS_RWND_INIT:
+		rv = rwnd_init;
+		break;
+	case BPF_SOCK_OPS_TCP_CONNECT_CB:
+		/* Set sndbuf and rcvbuf of active connections */
+		rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF, &bufsize,
+				    sizeof(bufsize));
+		rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET, SO_RCVBUF,
+					     &bufsize, sizeof(bufsize));
+		break;
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+		rv = bpf_setsockopt(skops, SOL_TCP, TCP_BPF_IW, &iw,
+				    sizeof(iw));
+		break;
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		/* Set sndbuf and rcvbuf of passive connections */
+		rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF, &bufsize,
+				    sizeof(bufsize));
+		rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET, SO_RCVBUF,
+					     &bufsize, sizeof(bufsize));
+		break;
+	default:
+		rv = -1;
+	}
+#ifdef DEBUG
+	bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+	return rv;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 14/15] bpf: Adds support for setting sndcwnd clamp
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
                   ` (12 preceding siblings ...)
  2017-06-20  3:00 ` [PATCH net-next v3 13/15] bpf: Sample BPF program to set " Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  2017-06-20  3:00 ` [PATCH net-next v3 15/15] bpf: Sample bpf program to set " Lawrence Brakmo
  14 siblings, 0 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_SNDCWND_CLAMP, which
sets the initial congestion window. It is useful to limit the sndcwnd
when the host are close to each other (small RTT).

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h | 1 +
 net/core/filter.c        | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a07acc6..47189e5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -782,5 +782,6 @@ enum {
 };
 
 #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
+#define TCP_BPF_SNDCWND_CLAMP	1002	/* Set sndcwnd_clamp */
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index db6d30c0..664bb9f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2733,6 +2733,13 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 				else
 					tp->snd_cwnd = val;
 				break;
+			case TCP_BPF_SNDCWND_CLAMP:
+				if (val <= 0) {
+					ret = -EINVAL;
+				} else {
+					tp->snd_cwnd_clamp = val;
+					tp->snd_ssthresh = val;
+				}
 			default:
 				ret = -EINVAL;
 			}
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH net-next v3 15/15] bpf: Sample bpf program to set sndcwnd clamp
  2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
                   ` (13 preceding siblings ...)
  2017-06-20  3:00 ` [PATCH net-next v3 14/15] bpf: Adds support for setting sndcwnd clamp Lawrence Brakmo
@ 2017-06-20  3:00 ` Lawrence Brakmo
  14 siblings, 0 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-20  3:00 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Sample BPF program, tcp_clamp_kern.c, to demostrate the use
of setting the sndcwnd clamp. This program assumes that if the
first 5.5 bytes of the host's IPv6 addresses are the same, then
the hosts are in the same datacenter and sets sndcwnd clamp to
100 packets, SYN and SYN-ACK RTOs to 10ms and send/receive buffer
sizes to 150KB.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile         |  1 +
 samples/bpf/tcp_clamp_kern.c | 93 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 94 insertions(+)
 create mode 100644 samples/bpf/tcp_clamp_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 3ec96a0..59975c3 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -118,6 +118,7 @@ always += tcp_rwnd_kern.o
 always += tcp_bufs_kern.o
 always += tcp_cong_kern.o
 always += tcp_iw_kern.o
+always += tcp_clamp_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_clamp_kern.c b/samples/bpf/tcp_clamp_kern.c
new file mode 100644
index 0000000..413eeba
--- /dev/null
+++ b/samples/bpf/tcp_clamp_kern.c
@@ -0,0 +1,93 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * Sample BPF program to set send and receive buffers to 150KB, sndcwnd clamp
+ * to 100 packets and SYN and SYN_ACK RTOs to 10ms when both hosts are within
+ * the same datacenter. For his example, we assume they are within the same
+ * datacenter when the first 5.5 bytes of their IPv6 addresses are the same.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_clamp(struct bpf_sock_ops *skops)
+{
+	char fmt1[] = "BPF command: %d\n";
+	char fmt2[] = "  Returning %d\n";
+	int bufsize = 150000;
+	int to_init = 10;
+	int clamp = 100;
+	int rv = 0;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 55601
+	 */
+	if (skops->remote_port != 55601 && skops->local_port != 55601)
+		return -1;
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+	/* Check that both hosts are within same datacenter. For this example
+	 * it is the case when the first 5.5 bytes of their IPv6 addresses are
+	 * the same.
+	 */
+	if (skops->family == AF_INET6 &&
+	    skops->local_ip6[0] == skops->remote_ip6[0] &&
+	    (skops->local_ip6[1] & 0xfff00000) ==
+	    (skops->remote_ip6[1] & 0xfff00000)) {
+		switch (op) {
+		case BPF_SOCK_OPS_TIMEOUT_INIT:
+			rv = to_init;
+			break;
+		case BPF_SOCK_OPS_TCP_CONNECT_CB:
+			/* Set sndbuf and rcvbuf of active connections */
+			rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF,
+					    &bufsize, sizeof(bufsize));
+			rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET,
+						      SO_RCVBUF, &bufsize,
+						      sizeof(bufsize));
+			break;
+		case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+			rv = bpf_setsockopt(skops, SOL_TCP,
+					    TCP_BPF_SNDCWND_CLAMP,
+					    &clamp, sizeof(clamp));
+			break;
+		case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+			/* Set sndbuf and rcvbuf of passive connections */
+			rv = bpf_setsockopt(skops, SOL_TCP,
+					    TCP_BPF_SNDCWND_CLAMP,
+					    &clamp, sizeof(clamp));
+			rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET,
+						      SO_SNDBUF, &bufsize,
+						      sizeof(bufsize));
+			rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET,
+						      SO_RCVBUF, &bufsize,
+						      sizeof(bufsize));
+			break;
+		default:
+			rv = -1;
+		}
+	} else {
+		rv = -1;
+	}
+#ifdef DEBUG
+	bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+	return rv;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH net-next v3 10/15] bpf: Add support for changing congestion control
  2017-06-20  3:00 ` [PATCH net-next v3 10/15] bpf: Add support for changing congestion control Lawrence Brakmo
@ 2017-06-20  8:40   ` kbuild test robot
  0 siblings, 0 replies; 28+ messages in thread
From: kbuild test robot @ 2017-06-20  8:40 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: kbuild-all, netdev, Kernel Team, Blake Matheny,
	Alexei Starovoitov, Daniel Borkmann, David Ahern

[-- Attachment #1: Type: text/plain, Size: 911 bytes --]

Hi Lawrence,

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Lawrence-Brakmo/bpf-BPF-support-for-sock_ops/20170620-142609
config: i386-randconfig-i0-201725 (attached as .config)
compiler: gcc-4.8 (Debian 4.8.4-1) 4.8.4
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   net/built-in.o: In function `bpf_setsockopt':
>> (.text+0x326e9): undefined reference to `tcp_setsockopt'
   net/built-in.o: In function `bpf_setsockopt':
>> (.text+0x326f9): undefined reference to `tcp_set_congestion_control'
   net/built-in.o: In function `bpf_setsockopt':
>> (.text+0x32712): undefined reference to `tcp_reinit_congestion_control'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 21781 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net-next v3 07/15] bpf: Add setsockopt helper function to bpf
  2017-06-20  3:00 ` [PATCH net-next v3 07/15] bpf: Add setsockopt helper function to bpf Lawrence Brakmo
@ 2017-06-20 21:25   ` Craig Gallek
  2017-06-21 16:51     ` Lawrence Brakmo
  0 siblings, 1 reply; 28+ messages in thread
From: Craig Gallek @ 2017-06-20 21:25 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: netdev, Kernel Team, Blake Matheny, Alexei Starovoitov,
	Daniel Borkmann, David Ahern

On Mon, Jun 19, 2017 at 11:00 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
> Added support for calling a subset of socket setsockopts from
> BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
> than making the changes to call the socket setsockopt function because
> the changes required would have been larger.
>
> @@ -2671,6 +2672,69 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
>         .arg1_type      = ARG_PTR_TO_CTX,
>  };
>
> +BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> +          int, level, int, optname, char *, optval, int, optlen)
> +{
> +       struct sock *sk = bpf_sock->sk;
> +       int ret = 0;
> +       int val;
> +
> +       if (bpf_sock->is_req_sock)
> +               return -EINVAL;
> +
> +       if (level == SOL_SOCKET) {
> +               /* Only some socketops are supported */
> +               val = *((int *)optval);
> +
> +               switch (optname) {
> +               case SO_RCVBUF:
> +                       sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> +                       sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF);
> +                       break;
> +               case SO_SNDBUF:
> +                       sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> +                       sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
> +                       break;
> +               case SO_MAX_PACING_RATE:
> +                       sk->sk_max_pacing_rate = val;
> +                       sk->sk_pacing_rate = min(sk->sk_pacing_rate,
> +                                                sk->sk_max_pacing_rate);
> +                       break;
> +               case SO_PRIORITY:
> +                       sk->sk_priority = val;
> +                       break;
> +               case SO_RCVLOWAT:
> +                       if (val < 0)
> +                               val = INT_MAX;
> +                       sk->sk_rcvlowat = val ? : 1;
> +                       break;
> +               case SO_MARK:
> +                       sk->sk_mark = val;
> +                       break;

Isn't the socket lock required when manipulating these fields?  It's
not obvious that the lock is held from every bpf hook point that could
trigger this function...

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net-next v3 07/15] bpf: Add setsockopt helper function to bpf
  2017-06-20 21:25   ` Craig Gallek
@ 2017-06-21 16:51     ` Lawrence Brakmo
  2017-06-21 17:13       ` Craig Gallek
  0 siblings, 1 reply; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-21 16:51 UTC (permalink / raw)
  To: Craig Gallek
  Cc: netdev, Kernel Team, Blake Matheny, Alexei Starovoitov,
	Daniel Borkmann, David Ahern


On 6/20/17, 2:25 PM, "Craig Gallek" <kraigatgoog@gmail.com> wrote:

    On Mon, Jun 19, 2017 at 11:00 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
    > Added support for calling a subset of socket setsockopts from
    > BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
    > than making the changes to call the socket setsockopt function because
    > the changes required would have been larger.
    >
    > @@ -2671,6 +2672,69 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
    >         .arg1_type      = ARG_PTR_TO_CTX,
    >  };
    >
    > +BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
    > +          int, level, int, optname, char *, optval, int, optlen)
    > +{
    > +       struct sock *sk = bpf_sock->sk;
    > +       int ret = 0;
    > +       int val;
    > +
    > +       if (bpf_sock->is_req_sock)
    > +               return -EINVAL;
    > +
    > +       if (level == SOL_SOCKET) {
    > +               /* Only some socketops are supported */
    > +               val = *((int *)optval);
    > +
    > +               switch (optname) {
    > +               case SO_RCVBUF:
    > +                       sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
    > +                       sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF);
    > +                       break;
    > +               case SO_SNDBUF:
    > +                       sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
    > +                       sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
    > +                       break;
    > +               case SO_MAX_PACING_RATE:
    > +                       sk->sk_max_pacing_rate = val;
    > +                       sk->sk_pacing_rate = min(sk->sk_pacing_rate,
    > +                                                sk->sk_max_pacing_rate);
    > +                       break;
    > +               case SO_PRIORITY:
    > +                       sk->sk_priority = val;
    > +                       break;
    > +               case SO_RCVLOWAT:
    > +                       if (val < 0)
    > +                               val = INT_MAX;
    > +                       sk->sk_rcvlowat = val ? : 1;
    > +                       break;
    > +               case SO_MARK:
    > +                       sk->sk_mark = val;
    > +                       break;
    
    Isn't the socket lock required when manipulating these fields?  It's
    not obvious that the lock is held from every bpf hook point that could
    trigger this function...
    
The sock_ops BPF programs are being called from within the network
stack and my understanding is that  lock has already been taken. 
Currently they are only called:
(1) after a packet is received, where there is the call to
bh_lock_sock_nested() in tcp_v4_rcv() before calling
tcp_v4_do_rcv().
(2) in tcp_connect(), where there should be no issue

Just in case I added a check “sock_owned_by_me(sk)” in tcp_call_bpf()
Do you think this is enough, or should I explicitly add a bh_lock_sock_nested
in the bpf_setsockopt function?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net-next v3 07/15] bpf: Add setsockopt helper function to bpf
  2017-06-21 16:51     ` Lawrence Brakmo
@ 2017-06-21 17:13       ` Craig Gallek
  2017-06-21 23:55         ` Lawrence Brakmo
  0 siblings, 1 reply; 28+ messages in thread
From: Craig Gallek @ 2017-06-21 17:13 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: netdev, Kernel Team, Blake Matheny, Alexei Starovoitov,
	Daniel Borkmann, David Ahern

On Wed, Jun 21, 2017 at 12:51 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
>
> On 6/20/17, 2:25 PM, "Craig Gallek" <kraigatgoog@gmail.com> wrote:
>
>     On Mon, Jun 19, 2017 at 11:00 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
>     > Added support for calling a subset of socket setsockopts from
>     > BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
>     > than making the changes to call the socket setsockopt function because
>     > the changes required would have been larger.
>     >
>     > @@ -2671,6 +2672,69 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
>     >         .arg1_type      = ARG_PTR_TO_CTX,
>     >  };
>     >
>     > +BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
>     > +          int, level, int, optname, char *, optval, int, optlen)
>     > +{
>     > +       struct sock *sk = bpf_sock->sk;
>     > +       int ret = 0;
>     > +       int val;
>     > +
>     > +       if (bpf_sock->is_req_sock)
>     > +               return -EINVAL;
>     > +
>     > +       if (level == SOL_SOCKET) {
>     > +               /* Only some socketops are supported */
>     > +               val = *((int *)optval);
>     > +
>     > +               switch (optname) {
>     > +               case SO_RCVBUF:
>     > +                       sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
>     > +                       sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF);
>     > +                       break;
>     > +               case SO_SNDBUF:
>     > +                       sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
>     > +                       sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
>     > +                       break;
>     > +               case SO_MAX_PACING_RATE:
>     > +                       sk->sk_max_pacing_rate = val;
>     > +                       sk->sk_pacing_rate = min(sk->sk_pacing_rate,
>     > +                                                sk->sk_max_pacing_rate);
>     > +                       break;
>     > +               case SO_PRIORITY:
>     > +                       sk->sk_priority = val;
>     > +                       break;
>     > +               case SO_RCVLOWAT:
>     > +                       if (val < 0)
>     > +                               val = INT_MAX;
>     > +                       sk->sk_rcvlowat = val ? : 1;
>     > +                       break;
>     > +               case SO_MARK:
>     > +                       sk->sk_mark = val;
>     > +                       break;
>
>     Isn't the socket lock required when manipulating these fields?  It's
>     not obvious that the lock is held from every bpf hook point that could
>     trigger this function...
>
> The sock_ops BPF programs are being called from within the network
> stack and my understanding is that  lock has already been taken.
> Currently they are only called:
> (1) after a packet is received, where there is the call to
> bh_lock_sock_nested() in tcp_v4_rcv() before calling
> tcp_v4_do_rcv().
> (2) in tcp_connect(), where there should be no issue
Someone who understands the TCP stack better than I should verify
this, but even if it's OK to do in these specific spots, it's not
unreasonable to believe that someone will add another socket-context
bpf hook in the future where it would not be safe.  Without some
additional check to prevent this setsockopt function from being called
in those spots, we could run into trouble.  The only other
socket-context point currently is the cgroup one, which happens during
socket creation and should also be safe.

> Just in case I added a check “sock_owned_by_me(sk)” in tcp_call_bpf()
> Do you think this is enough, or should I explicitly add a bh_lock_sock_nested
> in the bpf_setsockopt function?
Adding the check is certainly a way to test the behavior as
implemented, but this bpf function could be called by any
socket-context bpf (not just the tcp_call_bpf ones).  I believe the
current bpf hook points only guarantee RCU read-side lock.  Adding an
additional lock guarantee may have undesirable performance
implications.  If this is just for socket creation or other rare
events it's probably not a big deal, but if it's for a hook in the
fast path it's probably a non-starter.

I guess the higher level question is what should the locking
guarantees for socket-context bpf programs be?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net-next v3 07/15] bpf: Add setsockopt helper function to bpf
  2017-06-21 17:13       ` Craig Gallek
@ 2017-06-21 23:55         ` Lawrence Brakmo
  0 siblings, 0 replies; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-21 23:55 UTC (permalink / raw)
  To: Craig Gallek
  Cc: netdev, Kernel Team, Blake Matheny, Alexei Starovoitov,
	Daniel Borkmann, David Ahern


On 6/21/17, 10:13 AM, "Craig Gallek" <kraigatgoog@gmail.com> wrote:

    On Wed, Jun 21, 2017 at 12:51 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
    >
    > On 6/20/17, 2:25 PM, "Craig Gallek" <kraigatgoog@gmail.com> wrote:
    >
    >     On Mon, Jun 19, 2017 at 11:00 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
    >     > Added support for calling a subset of socket setsockopts from
    >     > BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
    >     > than making the changes to call the socket setsockopt function because
    >     > the changes required would have been larger.
    >     >
    >     > @@ -2671,6 +2672,69 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
    >     >         .arg1_type      = ARG_PTR_TO_CTX,
    >     >  };
    >     >
    >     > +BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
    >     > +          int, level, int, optname, char *, optval, int, optlen)
    >     > +{
    >     > +       struct sock *sk = bpf_sock->sk;
    >     > +       int ret = 0;
    >     > +       int val;
    >     > +
    >     > +       if (bpf_sock->is_req_sock)
    >     > +               return -EINVAL;
    >     > +
    >     > +       if (level == SOL_SOCKET) {
    >     > +               /* Only some socketops are supported */
    >     > +               val = *((int *)optval);
    >     > +
    >     > +               switch (optname) {
    >     > +               case SO_RCVBUF:
    >     > +                       sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
    >     > +                       sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF);
    >     > +                       break;
    >     > +               case SO_SNDBUF:
    >     > +                       sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
    >     > +                       sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
    >     > +                       break;
    >     > +               case SO_MAX_PACING_RATE:
    >     > +                       sk->sk_max_pacing_rate = val;
    >     > +                       sk->sk_pacing_rate = min(sk->sk_pacing_rate,
    >     > +                                                sk->sk_max_pacing_rate);
    >     > +                       break;
    >     > +               case SO_PRIORITY:
    >     > +                       sk->sk_priority = val;
    >     > +                       break;
    >     > +               case SO_RCVLOWAT:
    >     > +                       if (val < 0)
    >     > +                               val = INT_MAX;
    >     > +                       sk->sk_rcvlowat = val ? : 1;
    >     > +                       break;
    >     > +               case SO_MARK:
    >     > +                       sk->sk_mark = val;
    >     > +                       break;
    >
    >     Isn't the socket lock required when manipulating these fields?  It's
    >     not obvious that the lock is held from every bpf hook point that could
    >     trigger this function...
    >
    > The sock_ops BPF programs are being called from within the network
    > stack and my understanding is that  lock has already been taken.
    > Currently they are only called:
    > (1) after a packet is received, where there is the call to
    > bh_lock_sock_nested() in tcp_v4_rcv() before calling
    > tcp_v4_do_rcv().
    > (2) in tcp_connect(), where there should be no issue
    Someone who understands the TCP stack better than I should verify
    this, but even if it's OK to do in these specific spots, it's not
    unreasonable to believe that someone will add another socket-context
    bpf hook in the future where it would not be safe.  Without some
    additional check to prevent this setsockopt function from being called
    in those spots, we could run into trouble.  The only other
    socket-context point currently is the cgroup one, which happens during
    socket creation and should also be safe.

The cgroup socket (BPF_PROG_TYPE_CGROUP_SOCK) is a different prog type
than BPF_PROG_TYPE_SOCK_OPS and it cannot call the bpf_setsockops function.
And it should not because the context arguments are different (struct sock *
vs. struct bpf_socks_ops_kern *) 
  
    > Just in case I added a check “sock_owned_by_me(sk)” in tcp_call_bpf()
    > Do you think this is enough, or should I explicitly add a bh_lock_sock_nested
    > in the bpf_setsockopt function?
    Adding the check is certainly a way to test the behavior as
    implemented, but this bpf function could be called by any
    socket-context bpf (not just the tcp_call_bpf ones).  I believe the

Can only be called by tcp_call_bpf ones

    current bpf hook points only guarantee RCU read-side lock.  Adding an
    additional lock guarantee may have undesirable performance
    implications.  If this is just for socket creation or other rare
    events it's probably not a big deal, but if it's for a hook in the
    fast path it's probably a non-starter.

I agree
    
    I guess the higher level question is what should the locking
    guarantees for socket-context bpf programs be?

Since the whole point of this new bpf prog type is to be able to
modify sock and TCP parameters, we should insure that
tcp_call_bpf is only called when it is safe to change tcp sock state
(either because the sock lock is held or because it is safe for
other reasons).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net-next v3 01/15] bpf: BPF support for sock_ops
  2017-06-20  3:00 ` [PATCH net-next v3 01/15] bpf: BPF support for sock_ops Lawrence Brakmo
@ 2017-06-22 22:41   ` Daniel Borkmann
  2017-06-22 22:58     ` Lawrence Brakmo
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Borkmann @ 2017-06-22 22:41 UTC (permalink / raw)
  To: Lawrence Brakmo, netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern

On 06/20/2017 05:00 AM, Lawrence Brakmo wrote:
[...]
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f94b48b..861dbe9 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -120,12 +120,14 @@ enum bpf_prog_type {
>   	BPF_PROG_TYPE_LWT_IN,
>   	BPF_PROG_TYPE_LWT_OUT,
>   	BPF_PROG_TYPE_LWT_XMIT,
> +	BPF_PROG_TYPE_SOCK_OPS,
>   };
>
>   enum bpf_attach_type {
>   	BPF_CGROUP_INET_INGRESS,
>   	BPF_CGROUP_INET_EGRESS,
>   	BPF_CGROUP_INET_SOCK_CREATE,
> +	BPF_GLOBAL_SOCK_OPS,
>   	__MAX_BPF_ATTACH_TYPE
>   };
[...]
>   #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8942c82..e02831f 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
[...]
> +static int bpf_prog_attach(const union bpf_attr *attr)
> +{
> +	if (!capable(CAP_NET_ADMIN))
> +		return -EPERM;
> +
> +	if (CHECK_ATTR(BPF_PROG_ATTACH))
> +		return -EINVAL;
> +
> +	if (attr->attach_type == BPF_GLOBAL_SOCK_OPS)
> +		return bpf_sock_ops_attach_global_prog(attr->attach_bpf_fd);
> +	else
> +		return bpf_prog_attach_cgroup(attr);
> +}
> +
[...]
> +static int bpf_prog_detach(const union bpf_attr *attr)
> +{
> +	if (!capable(CAP_NET_ADMIN))
> +		return -EPERM;
> +
> +	if (CHECK_ATTR(BPF_PROG_DETACH))
> +		return -EINVAL;
> +
> +	if (attr->attach_type == BPF_GLOBAL_SOCK_OPS)
> +		return bpf_sock_ops_detach_global_prog();
> +	else
> +		return bpf_prog_detach_cgroup(attr);
> +}
>
>   #define BPF_PROG_TEST_RUN_LAST_FIELD test.duration
>
> @@ -1431,14 +1467,12 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
>   	case BPF_OBJ_GET:
>   		err = bpf_obj_get(&attr);
>   		break;
> -#ifdef CONFIG_CGROUP_BPF
>   	case BPF_PROG_ATTACH:
>   		err = bpf_prog_attach(&attr);
>   		break;
>   	case BPF_PROG_DETACH:
>   		err = bpf_prog_detach(&attr);
>   		break;
> -#endif
>   	case BPF_PROG_TEST_RUN:
>   		err = bpf_prog_test_run(&attr, uattr);
>   		break;
[...]
> diff --git a/net/core/sock_bpfops.c b/net/core/sock_bpfops.c
> new file mode 100644
> index 0000000..06f4a64
> --- /dev/null
> +++ b/net/core/sock_bpfops.c
> @@ -0,0 +1,65 @@
> +/*
> + * BPF support for sockets
> + *
> + * Copyright (c) 2016 Lawrence Brakmo <brakmo@fb.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2
> + * as published by the Free Software Foundation.
> + */
> +
> +#include <net/sock.h>
> +#include <linux/skbuff.h>
> +#include <linux/bpf.h>
> +#include <linux/filter.h>
> +#include <linux/errno.h>
> +#ifdef CONFIG_NET_NS
> +#include <net/net_namespace.h>
> +#include <linux/proc_ns.h>
> +#endif
> +
> +/* Global BPF program for sockets */
> +static struct bpf_prog *bpf_global_sock_ops_prog;
> +
> +int bpf_sock_ops_detach_global_prog(void)
> +{
> +	struct bpf_prog *old_prog;
> +
> +	old_prog = xchg(&bpf_global_sock_ops_prog, NULL);
> +
> +	if (old_prog)
> +		bpf_prog_put(old_prog);
> +
> +	return 0;
> +}
> +
> +int bpf_sock_ops_attach_global_prog(int fd)
> +{
> +	struct bpf_prog *prog, *old_prog;
> +	int err = 0;
> +
> +	prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCK_OPS);
> +	if (IS_ERR(prog))
> +		return PTR_ERR(prog);
> +
> +	old_prog = xchg(&bpf_global_sock_ops_prog, prog);
> +	if (old_prog)
> +		bpf_prog_put(old_prog);
> +	return err;
> +}
> +
> +int bpf_sock_ops_call(struct bpf_sock_ops_kern *bpf_sock)
> +{
> +	struct bpf_prog *prog;
> +	int ret;
> +
> +	rcu_read_lock();
> +	prog =  READ_ONCE(bpf_global_sock_ops_prog);
> +	if (prog)
> +		ret = BPF_PROG_RUN(prog, bpf_sock);
> +	else
> +		ret = -1;
> +	rcu_read_unlock();
> +
> +	return ret;
> +}

Now that we integrate with BPF_PROG_ATTACH/DETACH, can you make all
the above also per cgroup as we have with all other BPF_CGROUP_INET_*
progs? It seems kind of weird that we have one single global program
doing the enforcement of TCP and congctl options. Something on a more
fine-grained level like cgroups would be more suited wrt containers,
etc. Right now there's no notion of global program of such kind.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net-next v3 01/15] bpf: BPF support for sock_ops
  2017-06-22 22:41   ` Daniel Borkmann
@ 2017-06-22 22:58     ` Lawrence Brakmo
  2017-06-22 23:19       ` Daniel Borkmann
  0 siblings, 1 reply; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-22 22:58 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern


On 6/22/17, 3:41 PM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:

    On 06/20/2017 05:00 AM, Lawrence Brakmo wrote:
    [...]
    > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
    > index f94b48b..861dbe9 100644
    > --- a/include/uapi/linux/bpf.h
    > +++ b/include/uapi/linux/bpf.h
    > @@ -120,12 +120,14 @@ enum bpf_prog_type {
    >   	BPF_PROG_TYPE_LWT_IN,
    >   	BPF_PROG_TYPE_LWT_OUT,
    >   	BPF_PROG_TYPE_LWT_XMIT,
    > +	BPF_PROG_TYPE_SOCK_OPS,
    >   };
    >
    >   enum bpf_attach_type {
    >   	BPF_CGROUP_INET_INGRESS,
    >   	BPF_CGROUP_INET_EGRESS,
    >   	BPF_CGROUP_INET_SOCK_CREATE,
    > +	BPF_GLOBAL_SOCK_OPS,
    >   	__MAX_BPF_ATTACH_TYPE
    >   };
    [...]
    >   #endif /* _UAPI__LINUX_BPF_H__ */
    > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
    > index 8942c82..e02831f 100644
    > --- a/kernel/bpf/syscall.c
    > +++ b/kernel/bpf/syscall.c
    [...]
    > +static int bpf_prog_attach(const union bpf_attr *attr)
    > +{
    > +	if (!capable(CAP_NET_ADMIN))
    > +		return -EPERM;
    > +
    > +	if (CHECK_ATTR(BPF_PROG_ATTACH))
    > +		return -EINVAL;
    > +
    > +	if (attr->attach_type == BPF_GLOBAL_SOCK_OPS)
    > +		return bpf_sock_ops_attach_global_prog(attr->attach_bpf_fd);
    > +	else
    > +		return bpf_prog_attach_cgroup(attr);
    > +}
    > +
    [...]
    > +static int bpf_prog_detach(const union bpf_attr *attr)
    > +{
    > +	if (!capable(CAP_NET_ADMIN))
    > +		return -EPERM;
    > +
    > +	if (CHECK_ATTR(BPF_PROG_DETACH))
    > +		return -EINVAL;
    > +
    > +	if (attr->attach_type == BPF_GLOBAL_SOCK_OPS)
    > +		return bpf_sock_ops_detach_global_prog();
    > +	else
    > +		return bpf_prog_detach_cgroup(attr);
    > +}
    >
    >   #define BPF_PROG_TEST_RUN_LAST_FIELD test.duration
    >
    > @@ -1431,14 +1467,12 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
    >   	case BPF_OBJ_GET:
    >   		err = bpf_obj_get(&attr);
    >   		break;
    > -#ifdef CONFIG_CGROUP_BPF
    >   	case BPF_PROG_ATTACH:
    >   		err = bpf_prog_attach(&attr);
    >   		break;
    >   	case BPF_PROG_DETACH:
    >   		err = bpf_prog_detach(&attr);
    >   		break;
    > -#endif
    >   	case BPF_PROG_TEST_RUN:
    >   		err = bpf_prog_test_run(&attr, uattr);
    >   		break;
    [...]
    > diff --git a/net/core/sock_bpfops.c b/net/core/sock_bpfops.c
    > new file mode 100644
    > index 0000000..06f4a64
    > --- /dev/null
    > +++ b/net/core/sock_bpfops.c
    > @@ -0,0 +1,65 @@
    > +/*
    > + * BPF support for sockets
    > + *
    > + * Copyright (c) 2016 Lawrence Brakmo <brakmo@fb.com>
    > + *
    > + * This program is free software; you can redistribute it and/or modify
    > + * it under the terms of the GNU General Public License version 2
    > + * as published by the Free Software Foundation.
    > + */
    > +
    > +#include <net/sock.h>
    > +#include <linux/skbuff.h>
    > +#include <linux/bpf.h>
    > +#include <linux/filter.h>
    > +#include <linux/errno.h>
    > +#ifdef CONFIG_NET_NS
    > +#include <net/net_namespace.h>
    > +#include <linux/proc_ns.h>
    > +#endif
    > +
    > +/* Global BPF program for sockets */
    > +static struct bpf_prog *bpf_global_sock_ops_prog;
    > +
    > +int bpf_sock_ops_detach_global_prog(void)
    > +{
    > +	struct bpf_prog *old_prog;
    > +
    > +	old_prog = xchg(&bpf_global_sock_ops_prog, NULL);
    > +
    > +	if (old_prog)
    > +		bpf_prog_put(old_prog);
    > +
    > +	return 0;
    > +}
    > +
    > +int bpf_sock_ops_attach_global_prog(int fd)
    > +{
    > +	struct bpf_prog *prog, *old_prog;
    > +	int err = 0;
    > +
    > +	prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCK_OPS);
    > +	if (IS_ERR(prog))
    > +		return PTR_ERR(prog);
    > +
    > +	old_prog = xchg(&bpf_global_sock_ops_prog, prog);
    > +	if (old_prog)
    > +		bpf_prog_put(old_prog);
    > +	return err;
    > +}
    > +
    > +int bpf_sock_ops_call(struct bpf_sock_ops_kern *bpf_sock)
    > +{
    > +	struct bpf_prog *prog;
    > +	int ret;
    > +
    > +	rcu_read_lock();
    > +	prog =  READ_ONCE(bpf_global_sock_ops_prog);
    > +	if (prog)
    > +		ret = BPF_PROG_RUN(prog, bpf_sock);
    > +	else
    > +		ret = -1;
    > +	rcu_read_unlock();
    > +
    > +	return ret;
    > +}
    
    Now that we integrate with BPF_PROG_ATTACH/DETACH, can you make all
    the above also per cgroup as we have with all other BPF_CGROUP_INET_*
    progs? It seems kind of weird that we have one single global program
    doing the enforcement of TCP and congctl options. Something on a more
    fine-grained level like cgroups would be more suited wrt containers,
    etc. Right now there's no notion of global program of such kind.
    
    Thanks,
    Daniel


Daniel, I see value for having a global program, so I would like to keep that. When
this patchset is accepted, I will submit one that adds support for per cgroup
sock_ops programs, with the option to use the global one if none is
specified for a cgroup. We could also have the option of the cgroup sock_ops
program choosing if the global program should run for a particular op based on
its return value. We can iron it out the details when that patch is submitted.

Is it acceptable?

Thanks,
Lawrence    


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net-next v3 01/15] bpf: BPF support for sock_ops
  2017-06-22 22:58     ` Lawrence Brakmo
@ 2017-06-22 23:19       ` Daniel Borkmann
  2017-06-22 23:57         ` Lawrence Brakmo
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Borkmann @ 2017-06-22 23:19 UTC (permalink / raw)
  To: Lawrence Brakmo, netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern

On 06/23/2017 12:58 AM, Lawrence Brakmo wrote:
[...]
> Daniel, I see value for having a global program, so I would like to keep that. When
> this patchset is accepted, I will submit one that adds support for per cgroup
> sock_ops programs, with the option to use the global one if none is
> specified for a cgroup. We could also have the option of the cgroup sock_ops
> program choosing if the global program should run for a particular op based on
> its return value. We can iron it out the details when that patch is submitted.

Hm, could you elaborate on the value part compared to per cgroups ops?
My understanding is that per cgroup would already be a proper superset
of just the global one anyway, so why not going with that in the first
place since you're working on it?

What would be the additional value? How would global vs per cgroup one
interact with each other in terms of enforcement e.g., there's already
semantics in place for cgroups descendants, would it be that we set
TCP parameters twice or would you disable the global one altogether?
Just wondering as you could avoid these altogether with going via cgroups
initially.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net-next v3 01/15] bpf: BPF support for sock_ops
  2017-06-22 23:19       ` Daniel Borkmann
@ 2017-06-22 23:57         ` Lawrence Brakmo
  2017-06-23 21:15           ` Daniel Borkmann
  0 siblings, 1 reply; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-22 23:57 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern


On 6/22/17, 4:19 PM, "netdev-owner@vger.kernel.org on behalf of Daniel Borkmann" <netdev-owner@vger.kernel.org on behalf of daniel@iogearbox.net> wrote:

    On 06/23/2017 12:58 AM, Lawrence Brakmo wrote:
    [...]
    > Daniel, I see value for having a global program, so I would like to keep that. When
    > this patchset is accepted, I will submit one that adds support for per cgroup
    > sock_ops programs, with the option to use the global one if none is
    > specified for a cgroup. We could also have the option of the cgroup sock_ops
    > program choosing if the global program should run for a particular op based on
    > its return value. We can iron it out the details when that patch is submitted.
    
    Hm, could you elaborate on the value part compared to per cgroups ops?
    My understanding is that per cgroup would already be a proper superset
    of just the global one anyway, so why not going with that in the first
    place since you're working on it?
    
    What would be the additional value? How would global vs per cgroup one
    interact with each other in terms of enforcement e.g., there's already
    semantics in place for cgroups descendants, would it be that we set
    TCP parameters twice or would you disable the global one altogether?
    Just wondering as you could avoid these altogether with going via cgroups
    initially.
    
    Thanks,
    Daniel
    
Well, for starters the global program will work even if CONFIG_CGROUP_BPF is
not defined. It is also an easier concept for when a global program is all that
is required. But I also had in mind that behaviors that were in common for
most cgroup programs could be handled by the global program instead of
adding it to all cgroup programs. In this scenario the global program
represents the default behavior that can be override by the cgroup
program (per op). For example, the cgroup program could return a value
to indicate that that op should be passed to the global program. 

I agree 100% with you on the value of cgroup programs, but I just happen
to think there is also value in the global program.

Thanks,
Lawrence


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net-next v3 01/15] bpf: BPF support for sock_ops
  2017-06-22 23:57         ` Lawrence Brakmo
@ 2017-06-23 21:15           ` Daniel Borkmann
  2017-06-28 17:45             ` Lawrence Brakmo
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Borkmann @ 2017-06-23 21:15 UTC (permalink / raw)
  To: Lawrence Brakmo, netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern

On 06/23/2017 01:57 AM, Lawrence Brakmo wrote:
> On 6/22/17, 4:19 PM, "netdev-owner@vger.kernel.org on behalf of Daniel Borkmann" <netdev-owner@vger.kernel.org on behalf of daniel@iogearbox.net> wrote:
>
>      On 06/23/2017 12:58 AM, Lawrence Brakmo wrote:
>      [...]
>      > Daniel, I see value for having a global program, so I would like to keep that. When
>      > this patchset is accepted, I will submit one that adds support for per cgroup
>      > sock_ops programs, with the option to use the global one if none is
>      > specified for a cgroup. We could also have the option of the cgroup sock_ops
>      > program choosing if the global program should run for a particular op based on
>      > its return value. We can iron it out the details when that patch is submitted.
>
>      Hm, could you elaborate on the value part compared to per cgroups ops?
>      My understanding is that per cgroup would already be a proper superset
>      of just the global one anyway, so why not going with that in the first
>      place since you're working on it?
>
>      What would be the additional value? How would global vs per cgroup one
>      interact with each other in terms of enforcement e.g., there's already
>      semantics in place for cgroups descendants, would it be that we set
>      TCP parameters twice or would you disable the global one altogether?
>      Just wondering as you could avoid these altogether with going via cgroups
>      initially.
>
>      Thanks,
>      Daniel
>
> Well, for starters the global program will work even if CONFIG_CGROUP_BPF is
> not defined. It is also an easier concept for when a global program is all that

Otoh, major distros are highly likely to enable this on by default anyway.

> is required. But I also had in mind that behaviors that were in common for
> most cgroup programs could be handled by the global program instead of
> adding it to all cgroup programs. In this scenario the global program
> represents the default behavior that can be override by the cgroup
> program (per op). For example, the cgroup program could return a value
> to indicate that that op should be passed to the global program.

But then you would need to go through two program passes for setting
such parameters? Other option could be to make the per cgroup ops more
fine grained and use the effective one that was inherited for delegating
to default ops. My gut feeling is just that this makes interactions to
manage this and enforcement in combination with the later planned per
cgroups ops more complex if the same use-case could indeed be resolved
with per cgroups only.

> I agree 100% with you on the value of cgroup programs, but I just happen
> to think there is also value in the global program.
>
> Thanks,
> Lawrence

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net-next v3 01/15] bpf: BPF support for sock_ops
  2017-06-23 21:15           ` Daniel Borkmann
@ 2017-06-28 17:45             ` Lawrence Brakmo
  2017-06-29  9:47               ` Daniel Borkmann
  0 siblings, 1 reply; 28+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:45 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern

On 6/23/17, 2:15 PM, "Daniel Borkmann" <daniel@iogearbox.net> wrote:

    On 06/23/2017 01:57 AM, Lawrence Brakmo wrote:
    > On 6/22/17, 4:19 PM, "netdev-owner@vger.kernel.org on behalf of Daniel Borkmann" <netdev-owner@vger.kernel.org on behalf of daniel@iogearbox.net> wrote:
    >
    >      On 06/23/2017 12:58 AM, Lawrence Brakmo wrote:
    >      [...]
    >      > Daniel, I see value for having a global program, so I would like to keep that. When
    >      > this patchset is accepted, I will submit one that adds support for per cgroup
    >      > sock_ops programs, with the option to use the global one if none is
    >      > specified for a cgroup. We could also have the option of the cgroup sock_ops
    >      > program choosing if the global program should run for a particular op based on
    >      > its return value. We can iron it out the details when that patch is submitted.
    >
    >      Hm, could you elaborate on the value part compared to per cgroups ops?
    >      My understanding is that per cgroup would already be a proper superset
    >      of just the global one anyway, so why not going with that in the first
    >      place since you're working on it?
    >
    >      What would be the additional value? How would global vs per cgroup one
    >      interact with each other in terms of enforcement e.g., there's already
    >      semantics in place for cgroups descendants, would it be that we set
    >      TCP parameters twice or would you disable the global one altogether?
    >      Just wondering as you could avoid these altogether with going via cgroups
    >      initially.
    >
    >      Thanks,
    >      Daniel
    >
    > Well, for starters the global program will work even if CONFIG_CGROUP_BPF is
    > not defined. It is also an easier concept for when a global program is all that
    
    Otoh, major distros are highly likely to enable this on by default anyway.
    
    > is required. But I also had in mind that behaviors that were in common for
    > most cgroup programs could be handled by the global program instead of
    > adding it to all cgroup programs. In this scenario the global program
    > represents the default behavior that can be override by the cgroup
    > program (per op). For example, the cgroup program could return a value
    > to indicate that that op should be passed to the global program.
    
    But then you would need to go through two program passes for setting
    such parameters? Other option could be to make the per cgroup ops more
    fine grained and use the effective one that was inherited for delegating
    to default ops. My gut feeling is just that this makes interactions to
    manage this and enforcement in combination with the later planned per
    cgroups ops more complex if the same use-case could indeed be resolved
    with per cgroups only.
    
Daniel, thank you for the feedback. I just submitted a new patch set without
the global program and using bpf cgroups framework.

    > I agree 100% with you on the value of cgroup programs, but I just happen
    > to think there is also value in the global program.
    >
    > Thanks,
    > Lawrence
    


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH net-next v3 01/15] bpf: BPF support for sock_ops
  2017-06-28 17:45             ` Lawrence Brakmo
@ 2017-06-29  9:47               ` Daniel Borkmann
  0 siblings, 0 replies; 28+ messages in thread
From: Daniel Borkmann @ 2017-06-29  9:47 UTC (permalink / raw)
  To: Lawrence Brakmo, netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern

On 06/28/2017 07:45 PM, Lawrence Brakmo wrote:
[...]
> Daniel, thank you for the feedback. I just submitted a new patch set without
> the global program and using bpf cgroups framework.

Awesome, thanks for working on it!

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-06-29  9:47 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-20  3:00 PATCH net-next v3 00/15 Lawrence Brakmo
2017-06-20  3:00 ` [PATCH net-next v3 01/15] bpf: BPF support for sock_ops Lawrence Brakmo
2017-06-22 22:41   ` Daniel Borkmann
2017-06-22 22:58     ` Lawrence Brakmo
2017-06-22 23:19       ` Daniel Borkmann
2017-06-22 23:57         ` Lawrence Brakmo
2017-06-23 21:15           ` Daniel Borkmann
2017-06-28 17:45             ` Lawrence Brakmo
2017-06-29  9:47               ` Daniel Borkmann
2017-06-20  3:00 ` [PATCH net-next v3 02/15] bpf: program to load sock_ops BPF programs Lawrence Brakmo
2017-06-20  3:00 ` [PATCH net-next v3 03/15] bpf: Support for per connection SYN/SYN-ACK RTOs Lawrence Brakmo
2017-06-20  3:00 ` [PATCH net-next v3 04/15] bpf: Sample bpf program to set " Lawrence Brakmo
2017-06-20  3:00 ` [PATCH net-next v3 05/15] bpf: Support for setting initial receive window Lawrence Brakmo
2017-06-20  3:00 ` [PATCH net-next v3 06/15] bpf: Sample bpf program to set initial window Lawrence Brakmo
2017-06-20  3:00 ` [PATCH net-next v3 07/15] bpf: Add setsockopt helper function to bpf Lawrence Brakmo
2017-06-20 21:25   ` Craig Gallek
2017-06-21 16:51     ` Lawrence Brakmo
2017-06-21 17:13       ` Craig Gallek
2017-06-21 23:55         ` Lawrence Brakmo
2017-06-20  3:00 ` [PATCH net-next v3 08/15] bpf: Add TCP connection BPF callbacks Lawrence Brakmo
2017-06-20  3:00 ` [PATCH net-next v3 09/15] bpf: Sample BPF program to set buffer sizes Lawrence Brakmo
2017-06-20  3:00 ` [PATCH net-next v3 10/15] bpf: Add support for changing congestion control Lawrence Brakmo
2017-06-20  8:40   ` kbuild test robot
2017-06-20  3:00 ` [PATCH net-next v3 11/15] bpf: Sample BPF program to set " Lawrence Brakmo
2017-06-20  3:00 ` [PATCH net-next v3 12/15] bpf: Adds support for setting initial cwnd Lawrence Brakmo
2017-06-20  3:00 ` [PATCH net-next v3 13/15] bpf: Sample BPF program to set " Lawrence Brakmo
2017-06-20  3:00 ` [PATCH net-next v3 14/15] bpf: Adds support for setting sndcwnd clamp Lawrence Brakmo
2017-06-20  3:00 ` [PATCH net-next v3 15/15] bpf: Sample bpf program to set " Lawrence Brakmo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.