All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops
@ 2017-06-28 17:31 Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 01/16] bpf: BPF " Lawrence Brakmo
                   ` (15 more replies)
  0 siblings, 16 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.) and setting
connection parameters such as buffer sizes, initial window, SYN/SYN-ACK
RTOs, etc.

Unlike current BPF program types that expect to be called at a particular
place in the network stack code, SOCK_OPS program can be called at
different places and use an "op" field to indicate the context. There
are currently two types of operations, those whose effect is through
their return value and those whose effect is through the new
bpf_setsocketop BPF helper function.

Example operands of the first type are:
  BPF_SOCK_OPS_TIMEOUT_INIT
  BPF_SOCK_OPS_RWND_INIT
  BPF_SOCK_OPS_NEEDS_ECN

Example operands of the secont type are:
  BPF_SOCK_OPS_TCP_CONNECT_CB
  BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB
  BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB

Current operands are only called during connection establishment so
there should not be any BPF overheads after connection establishment. The
main idea is to use connection information form both hosts, such as IP
addresses and ports to allow setting of per connection parameters to
optimize the connection's peformance.

Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
disticnt advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it does not require
application changes and it can be updated easily at any time.

It uses the existing bpf cgroups infrastructure so the programs can be
attached per cgroup with full inheritance support. Although the bpf cgroup
framework already contains a sock related program type (BPF_PROG_TYPE_CGROUP_SOCK),
I created the new type (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type
expects to be called only once during the connections's lifetime. In contrast,
the new program type will be called multiple times from different places in the
network stack code.  For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set congestion
control, etc. As a result it has "op" field to specify the type of operation
requested.

This patch set also includes sample BPF programs to demostrate the differnet
features.

v2: Formatting changes, rebased to latest net-next

v3: Fixed build issues, changed socket_ops to sock_ops throught,
    fixed formatting issues, removed the syscall to load sock_ops
    program and added functionality to use existing bpf attach and
    bpf detach system calls, removed reader/writer locks in
    sock_bpfops.c (used when saving sock_ops global program)
    and fixed missing module refcount increment.

v4: Removed global sock_ops program and instead used existing cgroup bpf
    infrastructure to support a new BPF_CGROUP_ATTCH type.

Consists of the following patches:


 include/linux/bpf-cgroup.h     |  18 ++++
 include/linux/bpf_types.h      |   1 +
 include/linux/filter.h         |  10 ++
 include/net/tcp.h              |  67 +++++++++++-
 include/uapi/linux/bpf.h       |  66 +++++++++++-
 kernel/bpf/cgroup.c            |  37 +++++++
 kernel/bpf/syscall.c           |   5 +
 net/core/filter.c              | 271 +++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp.c                 |   2 +-
 net/ipv4/tcp_cong.c            |  32 ++++--
 net/ipv4/tcp_fastopen.c        |   1 +
 net/ipv4/tcp_input.c           |  10 +-
 net/ipv4/tcp_minisocks.c       |   9 +-
 net/ipv4/tcp_output.c          |  18 +++-
 samples/bpf/Makefile           |   9 ++
 samples/bpf/bpf_helpers.h      |   3 +
 samples/bpf/bpf_load.c         |  13 ++-
 samples/bpf/load_sock_ops.c    |  97 +++++++++++++++++
 samples/bpf/tcp_bufs_kern.c    |  77 ++++++++++++++
 samples/bpf/tcp_clamp_kern.c   |  94 ++++++++++++++++
 samples/bpf/tcp_cong_kern.c    |  74 +++++++++++++
 samples/bpf/tcp_iw_kern.c      |  79 ++++++++++++++
 samples/bpf/tcp_rwnd_kern.c    |  61 +++++++++++
 samples/bpf/tcp_synrto_kern.c  |  60 +++++++++++
 tools/include/uapi/linux/bpf.h |  66 +++++++++++-
 25 files changed, 1154 insertions(+), 26 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 01/16] bpf: BPF support for sock_ops
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-28 19:53   ` Alexei Starovoitov
                     ` (3 more replies)
  2017-06-28 17:31 ` [PATCH net-next v4 02/16] bpf: program to load and attach sock_ops BPF progs Lawrence Brakmo
                   ` (14 subsequent siblings)
  15 siblings, 4 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
struct that allows BPF programs of this type to access some of the
socket's fields (such as IP addresses, ports, etc.). It uses the
existing bpf cgroups infrastructure so the programs can be attached per
cgroup with full inheritance support. The program will be called at
appropriate times to set relevant connections parameters such as buffer
sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
as IP addresses, port numbers, etc.

Alghough there are already 3 mechanisms to set parameters (sysctls,
route metrics and setsockopts), this new mechanism provides some
distinct advantages. Unlike sysctls, it can set parameters per
connection. In contrast to route metrics, it can also use port numbers
and information provided by a user level program. In addition, it could
set parameters probabilistically for evaluation purposes (i.e. do
something different on 10% of the flows and compare results with the
other 90% of the flows). Also, in cases where IPv6 addresses contain
geographic information, the rules to make changes based on the distance
(or RTT) between the hosts are much easier than route metric rules and
can be global. Finally, unlike setsockopt, it oes not require
application changes and it can be updated easily at any time.

Although the bpf cgroup framework already contains a sock related
program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
(BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
only once during the connections's lifetime. In contrast, the new
program type will be called multiple times from different places in the
network stack code.  For example, before sending SYN and SYN-ACKs to set
an appropriate timeout, when the connection is established to set
congestion control, etc. As a result it has "op" field to specify the
type of operation requested.

The purpose of this new program type is to simplify setting connection
parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
easy to use facebook's internal IPv6 addresses to determine if both hosts
of a connection are in the same datacenter. Therefore, it is easy to
write a BPF program to choose a small SYN RTO value when both hosts are
in the same datacenter.

This patch only contains the framework to support the new BPF program
type, following patches add the functionality to set various connection
parameters.

This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
and a new bpf syscall command to load a new program of this type:
BPF_PROG_LOAD_SOCKET_OPS.

Two new corresponding structs (one for the kernel one for the user/BPF
program):

/* kernel version */
struct bpf_sock_ops_kern {
        struct sock *sk;
	bool   is_req_sock:1;
        __u32  op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
};

/* user version */
struct bpf_sock_ops {
        __u32 op;
        union {
                __u32 reply;
                __u32 replylong[4];
        };
        __u32 family;
        __u32 remote_ip4;
        __u32 local_ip4;
        __u32 remote_ip6[4];
        __u32 local_ip6[4];
        __u32 remote_port;
        __u32 local_port;
};

Currently there are two types of ops. The first type expects the BPF
program to return a value which is then used by the caller (or a
negative value to indicate the operation is not supported). The second
type expects state changes to be done by the BPF program, for example
through a setsockopt BPF helper function, and they ignore the return
value.

The reply fields of the bpf_sockt_ops struct are there in case a bpf
program needs to return a value larger than an integer.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/linux/bpf-cgroup.h |  18 +++++
 include/linux/bpf_types.h  |   1 +
 include/linux/filter.h     |  10 +++
 include/net/tcp.h          |  37 ++++++++++
 include/uapi/linux/bpf.h   |  28 ++++++++
 kernel/bpf/cgroup.c        |  37 ++++++++++
 kernel/bpf/syscall.c       |   5 ++
 net/core/filter.c          | 170 +++++++++++++++++++++++++++++++++++++++++++++
 samples/bpf/bpf_load.c     |  13 +++-
 9 files changed, 316 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index c970a25..26449c7 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -7,6 +7,7 @@
 struct sock;
 struct cgroup;
 struct sk_buff;
+struct bpf_sock_ops_kern;
 
 #ifdef CONFIG_CGROUP_BPF
 
@@ -42,6 +43,10 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 int __cgroup_bpf_run_filter_sk(struct sock *sk,
 			       enum bpf_attach_type type);
 
+int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
+				     struct bpf_sock_ops_kern *sock_ops,
+				     enum bpf_attach_type type);
+
 /* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)			      \
 ({									      \
@@ -75,6 +80,18 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk,
 	__ret;								       \
 })
 
+#define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)				       \
+({									       \
+	int __ret = 0;							       \
+	if (cgroup_bpf_enabled && (sock_ops) && (sock_ops)->sk) {	       \
+		typeof(sk) __sk = sk_to_full_sk((sock_ops)->sk);	       \
+		if (sk_fullsock(__sk))					       \
+			__ret = __cgroup_bpf_run_filter_sock_ops(__sk,	       \
+								 sock_ops,     \
+							 BPF_CGROUP_SOCK_OPS); \
+	}								       \
+	__ret;								       \
+})
 #else
 
 struct cgroup_bpf {};
@@ -85,6 +102,7 @@ static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
 #define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_INET_SOCK(sk) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
 
 #endif /* CONFIG_CGROUP_BPF */
 
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 03bf223..3d137c3 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -10,6 +10,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SOCK, cg_sock_prog_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_IN, lwt_inout_prog_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_OUT, lwt_inout_prog_ops)
 BPF_PROG_TYPE(BPF_PROG_TYPE_LWT_XMIT, lwt_xmit_prog_ops)
+BPF_PROG_TYPE(BPF_PROG_TYPE_SOCK_OPS, sock_ops_prog_ops)
 #endif
 #ifdef CONFIG_BPF_EVENTS
 BPF_PROG_TYPE(BPF_PROG_TYPE_KPROBE, kprobe_prog_ops)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1fa26dc..bbd6429 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -898,4 +898,14 @@ static inline int bpf_tell_extensions(void)
 	return SKF_AD_MAX;
 }
 
+struct bpf_sock_ops_kern {
+	struct	sock *sk;
+	bool	is_req_sock:1;
+	u32	op;
+	union {
+		u32 reply;
+		u32 replylong[4];
+	};
+};
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d0751b7..804c27a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -46,6 +46,10 @@
 #include <linux/seq_file.h>
 #include <linux/memcontrol.h>
 
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/bpf-cgroup.h>
+
 extern struct inet_hashinfo tcp_hashinfo;
 
 extern struct percpu_counter tcp_orphan_count;
@@ -2021,4 +2025,37 @@ int tcp_set_ulp(struct sock *sk, const char *name);
 void tcp_get_available_ulp(char *buf, size_t len);
 void tcp_cleanup_ulp(struct sock *sk);
 
+/* Call BPF_SOCK_OPS program that returns an int. If the return value
+ * is < 0, then the BPF op failed (for example if the loaded BPF
+ * program does not support the chosen operation or there is no BPF
+ * program loaded).
+ */
+#ifdef CONFIG_BPF
+static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
+{
+	struct bpf_sock_ops_kern sock_ops;
+	int ret;
+
+	if (!is_req_sock)
+		sock_owned_by_me(sk);
+
+	memset(&sock_ops, 0, sizeof(sock_ops));
+	sock_ops.sk = sk;
+	sock_ops.is_req_sock = is_req_sock;
+	sock_ops.op = op;
+
+	ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
+	if (ret == 0)
+		ret = sock_ops.reply;
+	else
+		ret = -1;
+	return ret;
+}
+#else
+static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
+{
+	return -EPERM;
+}
+#endif
+
 #endif	/* _TCP_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f94b48b..617fb66 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -120,12 +120,14 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_IN,
 	BPF_PROG_TYPE_LWT_OUT,
 	BPF_PROG_TYPE_LWT_XMIT,
+	BPF_PROG_TYPE_SOCK_OPS,
 };
 
 enum bpf_attach_type {
 	BPF_CGROUP_INET_INGRESS,
 	BPF_CGROUP_INET_EGRESS,
 	BPF_CGROUP_INET_SOCK_CREATE,
+	BPF_CGROUP_SOCK_OPS,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -720,4 +722,30 @@ struct bpf_map_info {
 	__u32 map_flags;
 } __attribute__((aligned(8)));
 
+/* User bpf_sock_ops struct to access socket values and specify request ops
+ * and their replies.
+ * New fields can only be added at the end of this structure
+ */
+struct bpf_sock_ops {
+	__u32 op;
+	union {
+		__u32 reply;
+		__u32 replylong[4];
+	};
+	__u32 family;
+	__u32 remote_ip4;
+	__u32 local_ip4;
+	__u32 remote_ip6[4];
+	__u32 local_ip6[4];
+	__u32 remote_port;
+	__u32 local_port;
+};
+
+/* List of known BPF sock_ops operators.
+ * New entries can only be added at the end
+ */
+enum {
+	BPF_SOCK_OPS_VOID,
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index ea6033c..5461134 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -236,3 +236,40 @@ int __cgroup_bpf_run_filter_sk(struct sock *sk,
 	return ret;
 }
 EXPORT_SYMBOL(__cgroup_bpf_run_filter_sk);
+
+/**
+ * __cgroup_bpf_run_filter_sock_ops() - Run a program on a sock
+ * @sk: socket to get cgroup from
+ * @sock_ops: bpf_sock_ops_kern struct to pass to program. Contains
+ * sk with connection information (IP addresses, etc.) May not contain
+ * cgroup info if it is a req sock.
+ * @type: The type of program to be exectuted
+ *
+ * socket passed is expected to be of type INET or INET6.
+ *
+ * The program type passed in via @type must be suitable for sock_ops
+ * filtering. No further check is performed to assert that.
+ *
+ * This function will return %-EPERM if any if an attached program was found
+ * and if it returned != 1 during execution. In all other cases, 0 is returned.
+ */
+int __cgroup_bpf_run_filter_sock_ops(struct sock *sk,
+				     struct bpf_sock_ops_kern *sock_ops,
+				     enum bpf_attach_type type)
+{
+	struct cgroup *cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
+	struct bpf_prog *prog;
+	int ret = 0;
+
+
+	rcu_read_lock();
+
+	prog = rcu_dereference(cgrp->bpf.effective[type]);
+	if (prog)
+		ret = BPF_PROG_RUN(prog, sock_ops) == 1 ? 0 : -EPERM;
+
+	rcu_read_unlock();
+
+	return ret;
+}
+EXPORT_SYMBOL(__cgroup_bpf_run_filter_sock_ops);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8942c82..19905e3 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1069,6 +1069,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET_SOCK_CREATE:
 		ptype = BPF_PROG_TYPE_CGROUP_SOCK;
 		break;
+	case BPF_CGROUP_SOCK_OPS:
+		ptype = BPF_PROG_TYPE_SOCK_OPS;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -1109,6 +1112,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_CGROUP_INET_INGRESS:
 	case BPF_CGROUP_INET_EGRESS:
 	case BPF_CGROUP_INET_SOCK_CREATE:
+	case BPF_CGROUP_SOCK_OPS:
 		cgrp = cgroup_get_from_fd(attr->target_fd);
 		if (IS_ERR(cgrp))
 			return PTR_ERR(cgrp);
@@ -1123,6 +1127,7 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 
 	return ret;
 }
+
 #endif /* CONFIG_CGROUP_BPF */
 
 #define BPF_PROG_TEST_RUN_LAST_FIELD test.duration
diff --git a/net/core/filter.c b/net/core/filter.c
index b39c869..bb54832 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3110,6 +3110,36 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
+static bool __is_valid_sock_ops_access(int off, int size)
+{
+	if (off < 0 || off >= sizeof(struct bpf_sock_ops))
+		return false;
+	/* The verifier guarantees that size > 0. */
+	if (off % size != 0)
+		return false;
+	if (size != sizeof(__u32))
+		return false;
+
+	return true;
+}
+
+static bool sock_ops_is_valid_access(int off, int size,
+				     enum bpf_access_type type,
+				     struct bpf_insn_access_aux *info)
+{
+	if (type == BPF_WRITE) {
+		switch (off) {
+		case offsetof(struct bpf_sock_ops, op) ...
+		     offsetof(struct bpf_sock_ops, replylong[3]):
+			break;
+		default:
+			return false;
+		}
+	}
+
+	return __is_valid_sock_ops_access(off, size);
+}
+
 static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 				  const struct bpf_insn *si,
 				  struct bpf_insn *insn_buf,
@@ -3379,6 +3409,140 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
+				       const struct bpf_insn *si,
+				       struct bpf_insn *insn_buf,
+				       struct bpf_prog *prog)
+{
+	struct bpf_insn *insn = insn_buf;
+	int off;
+
+	switch (si->off) {
+	case offsetof(struct bpf_sock_ops, op) ...
+	     offsetof(struct bpf_sock_ops, replylong[3]):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct bpf_sock_ops, op) !=
+			     FIELD_SIZEOF(struct bpf_sock_ops_kern, op));
+		BUILD_BUG_ON(FIELD_SIZEOF(struct bpf_sock_ops, reply) !=
+			     FIELD_SIZEOF(struct bpf_sock_ops_kern, reply));
+		BUILD_BUG_ON(FIELD_SIZEOF(struct bpf_sock_ops, replylong) !=
+			     FIELD_SIZEOF(struct bpf_sock_ops_kern, replylong));
+		off = si->off;
+		off -= offsetof(struct bpf_sock_ops, op);
+		off += offsetof(struct bpf_sock_ops_kern, op);
+		if (type == BPF_WRITE)
+			*insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg,
+					      off);
+		else
+			*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+					      off);
+		break;
+
+	case offsetof(struct bpf_sock_ops, family):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_family) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+					      struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common, skc_family));
+		break;
+
+	case offsetof(struct bpf_sock_ops, remote_ip4):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_daddr) != 4);
+
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+						struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common, skc_daddr));
+		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
+		break;
+
+	case offsetof(struct bpf_sock_ops, local_ip4):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_rcv_saddr) != 4);
+
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+					      struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common,
+					       skc_rcv_saddr));
+		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
+		break;
+
+	case offsetof(struct bpf_sock_ops, remote_ip6[0]) ...
+	     offsetof(struct bpf_sock_ops, remote_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
+					  skc_v6_daddr.s6_addr32[0]) != 4);
+
+		off = si->off;
+		off -= offsetof(struct bpf_sock_ops, remote_ip6[0]);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+						struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common,
+					       skc_v6_daddr.s6_addr32[0]) +
+				      off);
+		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
+#else
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case offsetof(struct bpf_sock_ops, local_ip6[0]) ...
+	     offsetof(struct bpf_sock_ops, local_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
+					  skc_v6_rcv_saddr.s6_addr32[0]) != 4);
+
+		off = si->off;
+		off -= offsetof(struct bpf_sock_ops, local_ip6[0]);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+						struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common,
+					       skc_v6_rcv_saddr.s6_addr32[0]) +
+				      off);
+		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
+#else
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case offsetof(struct bpf_sock_ops, remote_port):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_dport) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+						struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common, skc_dport));
+		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 16);
+		break;
+
+	case offsetof(struct bpf_sock_ops, local_port):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_num) != 2);
+
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+						struct bpf_sock_ops_kern, sk),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct bpf_sock_ops_kern, sk));
+		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
+				      offsetof(struct sock_common, skc_num));
+		break;
+	}
+	return insn - insn_buf;
+}
+
 const struct bpf_verifier_ops sk_filter_prog_ops = {
 	.get_func_proto		= sk_filter_func_proto,
 	.is_valid_access	= sk_filter_is_valid_access,
@@ -3428,6 +3592,12 @@ const struct bpf_verifier_ops cg_sock_prog_ops = {
 	.convert_ctx_access	= sock_filter_convert_ctx_access,
 };
 
+const struct bpf_verifier_ops sock_ops_prog_ops = {
+	.get_func_proto		= bpf_base_func_proto,
+	.is_valid_access	= sock_ops_is_valid_access,
+	.convert_ctx_access	= sock_ops_convert_ctx_access,
+};
+
 int sk_detach_filter(struct sock *sk)
 {
 	int ret = -ENOENT;
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index a91c57d..a4be7cf 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -64,6 +64,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
 	bool is_cgroup_skb = strncmp(event, "cgroup/skb", 10) == 0;
 	bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0;
+	bool is_sockops = strncmp(event, "sockops", 7) == 0;
 	size_t insns_cnt = size / sizeof(struct bpf_insn);
 	enum bpf_prog_type prog_type;
 	char buf[256];
@@ -89,6 +90,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_CGROUP_SKB;
 	} else if (is_cgroup_sk) {
 		prog_type = BPF_PROG_TYPE_CGROUP_SOCK;
+	} else if (is_sockops) {
+		prog_type = BPF_PROG_TYPE_SOCK_OPS;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -106,8 +109,11 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
 		return 0;
 
-	if (is_socket) {
-		event += 6;
+	if (is_socket || is_sockops) {
+		if (is_socket)
+			event += 6;
+		else
+			event += 7;
 		if (*event != '/')
 			return 0;
 		event++;
@@ -560,7 +566,8 @@ static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
 		    memcmp(shname, "xdp", 3) == 0 ||
 		    memcmp(shname, "perf_event", 10) == 0 ||
 		    memcmp(shname, "socket", 6) == 0 ||
-		    memcmp(shname, "cgroup/", 7) == 0)
+		    memcmp(shname, "cgroup/", 7) == 0 ||
+		    memcmp(shname, "sockops", 7) == 0)
 			load_and_attach(shname, data->d_buf, data->d_size);
 	}
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 02/16] bpf: program to load and attach sock_ops BPF progs
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 01/16] bpf: BPF " Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 03/16] bpf: Support for per connection SYN/SYN-ACK RTOs Lawrence Brakmo
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

The program load_sock_ops can be used to load sock_ops bpf programs and
to attach it to an existing (v2) cgroup. It can also be used to detach
sock_ops programs.

Examples:
    load_sock_ops [-l] <cg-path> <prog filename>
	Load and attaches a sock_ops program at the specified cgroup.
	If "-l" is used, the program will continue to run to output the
	BPF log buffer.
	If the specified filename does not end in ".o", it appends
	"_kern.o" to the name.

    load_sock_ops -r <cg-path>
	Detaches the currently attached sock_ops program from the
	specified cgroup.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile        |  3 ++
 samples/bpf/load_sock_ops.c | 97 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+)
 create mode 100644 samples/bpf/load_sock_ops.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index e7ec9b8..015589b 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -36,6 +36,7 @@ hostprogs-y += lwt_len_hist
 hostprogs-y += xdp_tx_iptunnel
 hostprogs-y += test_map_in_map
 hostprogs-y += per_socket_stats_example
+hostprogs-y += load_sock_ops
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o
@@ -52,6 +53,7 @@ tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o
 tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o
 tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o
 tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
+load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
 test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
 trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
 lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
@@ -130,6 +132,7 @@ HOSTLOADLIBES_tracex4 += -lelf -lrt
 HOSTLOADLIBES_tracex5 += -lelf
 HOSTLOADLIBES_tracex6 += -lelf
 HOSTLOADLIBES_test_cgrp2_sock2 += -lelf
+HOSTLOADLIBES_load_sock_ops += -lelf
 HOSTLOADLIBES_test_probe_write_user += -lelf
 HOSTLOADLIBES_trace_output += -lelf -lrt
 HOSTLOADLIBES_lathist += -lelf
diff --git a/samples/bpf/load_sock_ops.c b/samples/bpf/load_sock_ops.c
new file mode 100644
index 0000000..91aa00d
--- /dev/null
+++ b/samples/bpf/load_sock_ops.c
@@ -0,0 +1,97 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/unistd.h>
+
+static void usage(char *pname)
+{
+	printf("USAGE:\n  %s [-l] <cg-path> <prog filename>\n", pname);
+	printf("\tLoad and attach a sock_ops program to the specified "
+	       "cgroup\n");
+	printf("\tIf \"-l\" is used, the program will continue to run\n");
+	printf("\tprinting the BPF log buffer\n");
+	printf("\tIf the specified filename does not end in \".o\", it\n");
+	printf("\tappends \"_kern.o\" to the name\n");
+	printf("\n");
+	printf("  %s -r <cg-path>\n", pname);
+	printf("\tDetaches the currently attached sock_ops program\n");
+	printf("\tfrom the specified cgroup\n");
+	printf("\n");
+	exit(0);
+}
+
+int main(int argc, char **argv)
+{
+	int logFlag = 0;
+	int error = 0;
+	char *cg_path;
+	char fn[500];
+	char *prog;
+	int cg_fd;
+
+	if (argc < 3)
+		usage(argv[0]);
+
+	if (!strcmp(argv[1], "-r")) {
+		cg_path = argv[2];
+		cg_fd = open(cg_path, O_DIRECTORY, O_RDONLY);
+		error = bpf_prog_detach(cg_fd, BPF_CGROUP_SOCK_OPS);
+		if (error) {
+			printf("ERROR: bpf_prog_detach: %d (%s)\n",
+			       error, strerror(errno));
+			return 1;
+		}
+		return 0;
+	} else if (!strcmp(argv[1], "-h")) {
+		usage(argv[0]);
+	} else if (!strcmp(argv[1], "-l")) {
+		logFlag = 1;
+		if (argc < 4)
+			usage(argv[0]);
+	}
+
+	prog = argv[argc - 1];
+	cg_path = argv[argc - 2];
+	if (strlen(prog) > 480) {
+		fprintf(stderr, "ERROR: program name too long (> 480 chars)\n");
+		exit(2);
+	}
+	cg_fd = open(cg_path, O_DIRECTORY, O_RDONLY);
+
+	if (!strcmp(prog + strlen(prog)-2, ".o"))
+		strcpy(fn, prog);
+	else
+		sprintf(fn, "%s_kern.o", prog);
+	if (logFlag)
+		printf("loading bpf file:%s\n", fn);
+	if (load_bpf_file(fn)) {
+		printf("ERROR: load_bpf_file failed for: %s\n", fn);
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+	if (logFlag)
+		printf("TCP BPF Loaded %s\n", fn);
+
+	error = bpf_prog_attach(prog_fd[0], cg_fd, BPF_CGROUP_SOCK_OPS, 0);
+	if (error) {
+		printf("ERROR: bpf_prog_attach: %d (%s)\n",
+		       error, strerror(errno));
+		return 4;
+	} else if (logFlag) {
+		read_trace_pipe();
+	}
+
+	return error;
+}
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 03/16] bpf: Support for per connection  SYN/SYN-ACK RTOs
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 01/16] bpf: BPF " Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 02/16] bpf: program to load and attach sock_ops BPF progs Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 04/16] bpf: Sample bpf program to set " Lawrence Brakmo
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

This patch adds support for setting a per connection SYN and
SYN_ACK RTOs from within a BPF_SOCK_OPS program. For example,
to set small RTOs when it is known both hosts are within a
datacenter.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/net/tcp.h        | 11 +++++++++++
 include/uapi/linux/bpf.h |  3 +++
 net/ipv4/tcp_input.c     |  3 ++-
 net/ipv4/tcp_output.c    |  2 +-
 4 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 804c27a..cd9ef63 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2058,4 +2058,15 @@ static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
 }
 #endif
 
+static inline u32 tcp_timeout_init(struct sock *sk, bool is_req_sock)
+{
+	int timeout;
+
+	timeout = tcp_call_bpf(sk, is_req_sock, BPF_SOCK_OPS_TIMEOUT_INIT);
+
+	if (timeout <= 0)
+		timeout = TCP_TIMEOUT_INIT;
+	return timeout;
+}
+
 #endif	/* _TCP_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 617fb66..4174668 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -746,6 +746,9 @@ struct bpf_sock_ops {
  */
 enum {
 	BPF_SOCK_OPS_VOID,
+	BPF_SOCK_OPS_TIMEOUT_INIT,	/* Should return SYN-RTO value to use or
+					 * -1 if default value should be used
+					 */
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2ab7e2f..0867b05 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6406,7 +6406,8 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	} else {
 		tcp_rsk(req)->tfo_listener = false;
 		if (!want_cookie)
-			inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+			inet_csk_reqsk_queue_hash_add(sk, req,
+				tcp_timeout_init((struct sock *)req, true));
 		af_ops->send_synack(sk, dst, &fl, req, &foc,
 				    !want_cookie ? TCP_SYNACK_NORMAL :
 						   TCP_SYNACK_COOKIE);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9a9c395..5e478a1 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3327,7 +3327,7 @@ static void tcp_connect_init(struct sock *sk)
 	tp->rcv_wup = tp->rcv_nxt;
 	tp->copied_seq = tp->rcv_nxt;
 
-	inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+	inet_csk(sk)->icsk_rto = tcp_timeout_init(sk, false);
 	inet_csk(sk)->icsk_retransmits = 0;
 	tcp_clear_retrans(tp);
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 04/16] bpf: Sample bpf program to set SYN/SYN-ACK RTOs
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
                   ` (2 preceding siblings ...)
  2017-06-28 17:31 ` [PATCH net-next v4 03/16] bpf: Support for per connection SYN/SYN-ACK RTOs Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-29 19:39   ` Jesper Dangaard Brouer
  2017-06-28 17:31 ` [PATCH net-next v4 05/16] bpf: Support for setting initial receive window Lawrence Brakmo
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

The sample BPF program, tcp_synrto_kern.c, sets the SYN and SYN-ACK
RTOs to 10ms when both hosts are within the same datacenter (i.e.
small RTTs) in an environment where common IPv6 prefixes indicate
both hosts are in the same data center.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile          |  1 +
 samples/bpf/tcp_synrto_kern.c | 60 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)
 create mode 100644 samples/bpf/tcp_synrto_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 015589b..e29370a 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -113,6 +113,7 @@ always += lwt_len_hist_kern.o
 always += xdp_tx_iptunnel_kern.o
 always += test_map_in_map_kern.o
 always += cookie_uid_helper_example.o
+always += tcp_synrto_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_synrto_kern.c b/samples/bpf/tcp_synrto_kern.c
new file mode 100644
index 0000000..b16ac39
--- /dev/null
+++ b/samples/bpf/tcp_synrto_kern.c
@@ -0,0 +1,60 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * BPF program to set SYN and SYN-ACK RTOs to 10ms when using IPv6 addresses
+ * and the first 5.5 bytes of the IPv6 addresses are the same (in this example
+ * that means both hosts are in the same datacenter.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_synrto(struct bpf_sock_ops *skops)
+{
+	char fmt1[] = "BPF command: %d\n";
+	char fmt2[] = "  Returning %d\n";
+	int rv = -1;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 55601
+	 */
+	if (skops->remote_port != 55601 && skops->local_port != 55601)
+		return -1;
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+	/* Check for TIMEOUT_INIT operation and IPv6 addresses */
+	if (op == BPF_SOCK_OPS_TIMEOUT_INIT &&
+		skops->family == AF_INET6) {
+
+		/* If the first 5.5 bytes of the IPv6 address are the same
+		 * then both hosts are in the same datacenter
+		 * so use an RTO of 10ms
+		 */
+		if (skops->local_ip6[0] == skops->remote_ip6[0] &&
+		    (skops->local_ip6[1] & 0xfff00000) ==
+		    (skops->remote_ip6[1] & 0xfff00000))
+			rv = 10;
+	}
+#ifdef DEBUG
+	bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+	skops->reply = rv;
+	return 1;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 05/16] bpf: Support for setting initial receive window
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
                   ` (3 preceding siblings ...)
  2017-06-28 17:31 ` [PATCH net-next v4 04/16] bpf: Sample bpf program to set " Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 06/16] bpf: Sample bpf program to set initial window Lawrence Brakmo
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

This patch adds suppport for setting the initial advertized window from
within a BPF_SOCK_OPS program. This can be used to support larger
initial cwnd values in environments where it is known to be safe.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/net/tcp.h        | 10 ++++++++++
 include/uapi/linux/bpf.h |  4 ++++
 net/ipv4/tcp_minisocks.c |  9 ++++++++-
 net/ipv4/tcp_output.c    |  7 ++++++-
 4 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index cd9ef63..af404aa 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2069,4 +2069,14 @@ static inline u32 tcp_timeout_init(struct sock *sk, bool is_req_sock)
 	return timeout;
 }
 
+static inline u32 tcp_rwnd_init_bpf(struct sock *sk, bool is_req_sock)
+{
+	int rwnd;
+
+	rwnd = tcp_call_bpf(sk, is_req_sock, BPF_SOCK_OPS_RWND_INIT);
+
+	if (rwnd < 0)
+		rwnd = 0;
+	return rwnd;
+}
 #endif	/* _TCP_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4174668..cdec348 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -749,6 +749,10 @@ enum {
 	BPF_SOCK_OPS_TIMEOUT_INIT,	/* Should return SYN-RTO value to use or
 					 * -1 if default value should be used
 					 */
+	BPF_SOCK_OPS_RWND_INIT,		/* Should return initial advertized
+					 * window (in packets) or -1 if default
+					 * value should be used
+					 */
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index d30ee31..bbaf3c6 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -351,6 +351,7 @@ void tcp_openreq_init_rwin(struct request_sock *req,
 	int full_space = tcp_full_space(sk_listener);
 	u32 window_clamp;
 	__u8 rcv_wscale;
+	u32 rcv_wnd;
 	int mss;
 
 	mss = tcp_mss_clamp(tp, dst_metric_advmss(dst));
@@ -363,6 +364,12 @@ void tcp_openreq_init_rwin(struct request_sock *req,
 	    (req->rsk_window_clamp > full_space || req->rsk_window_clamp == 0))
 		req->rsk_window_clamp = full_space;
 
+	rcv_wnd = tcp_rwnd_init_bpf((struct sock *)req, true);
+	if (rcv_wnd == 0)
+		rcv_wnd = dst_metric(dst, RTAX_INITRWND);
+	else if (full_space < rcv_wnd * mss)
+		full_space = rcv_wnd * mss;
+
 	/* tcp_full_space because it is guaranteed to be the first packet */
 	tcp_select_initial_window(full_space,
 		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
@@ -370,7 +377,7 @@ void tcp_openreq_init_rwin(struct request_sock *req,
 		&req->rsk_window_clamp,
 		ireq->wscale_ok,
 		&rcv_wscale,
-		dst_metric(dst, RTAX_INITRWND));
+		rcv_wnd);
 	ireq->rcv_wscale = rcv_wscale;
 }
 EXPORT_SYMBOL(tcp_openreq_init_rwin);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5e478a1..e5f623f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3267,6 +3267,7 @@ static void tcp_connect_init(struct sock *sk)
 	const struct dst_entry *dst = __sk_dst_get(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u8 rcv_wscale;
+	u32 rcv_wnd;
 
 	/* We'll fix this up when we get a response from the other end.
 	 * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT.
@@ -3300,13 +3301,17 @@ static void tcp_connect_init(struct sock *sk)
 	    (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0))
 		tp->window_clamp = tcp_full_space(sk);
 
+	rcv_wnd = tcp_rwnd_init_bpf(sk, false);
+	if (rcv_wnd == 0)
+		rcv_wnd = dst_metric(dst, RTAX_INITRWND);
+
 	tcp_select_initial_window(tcp_full_space(sk),
 				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
 				  &tp->rcv_wnd,
 				  &tp->window_clamp,
 				  sock_net(sk)->ipv4.sysctl_tcp_window_scaling,
 				  &rcv_wscale,
-				  dst_metric(dst, RTAX_INITRWND));
+				  rcv_wnd);
 
 	tp->rx_opt.rcv_wscale = rcv_wscale;
 	tp->rcv_ssthresh = tp->rcv_wnd;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 06/16] bpf: Sample bpf program to set initial window
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
                   ` (4 preceding siblings ...)
  2017-06-28 17:31 ` [PATCH net-next v4 05/16] bpf: Support for setting initial receive window Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 07/16] bpf: Add setsockopt helper function to bpf Lawrence Brakmo
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

The sample bpf program, tcp_rwnd_kern.c, sets the initial
advertized window to 40 packets in an environment where
distinct IPv6 prefixes indicate that both hosts are not
in the same data center.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile        |  1 +
 samples/bpf/tcp_rwnd_kern.c | 61 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+)
 create mode 100644 samples/bpf/tcp_rwnd_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index e29370a..ca95528 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -114,6 +114,7 @@ always += xdp_tx_iptunnel_kern.o
 always += test_map_in_map_kern.o
 always += cookie_uid_helper_example.o
 always += tcp_synrto_kern.o
+always += tcp_rwnd_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_rwnd_kern.c b/samples/bpf/tcp_rwnd_kern.c
new file mode 100644
index 0000000..5daa649
--- /dev/null
+++ b/samples/bpf/tcp_rwnd_kern.c
@@ -0,0 +1,61 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * BPF program to set initial receive window to 40 packets when using IPv6
+ * and the first 5.5 bytes of the IPv6 addresses are not the same (in this
+ * example that means both hosts are not the same datacenter.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_rwnd(struct bpf_sock_ops *skops)
+{
+	char fmt1[] = "BPF command: %d\n";
+	char fmt2[] = "  Returning %d\n";
+	int rv = -1;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 55601
+	 */
+	if (skops->remote_port != 55601 && skops->local_port != 55601)
+		return -1;
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+	/* Check for RWND_INIT operation and IPv6 addresses */
+	if (op == BPF_SOCK_OPS_RWND_INIT &&
+		skops->family == AF_INET6) {
+
+		/* If the first 5.5 bytes of the IPv6 address are not the same
+		 * then both hosts are not in the same datacenter
+		 * so use a larger initial advertized window (40 packets)
+		 */
+		if (skops->local_ip6[0] != skops->remote_ip6[0] ||
+		    (skops->local_ip6[1] & 0xfffff000) !=
+		    (skops->remote_ip6[1] & 0xfffff000))
+			bpf_trace_printk(fmt2, sizeof(fmt2), -1);
+			rv = 40;
+	}
+#ifdef DEBUG
+	bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+	skops->reply = rv;
+	return 1;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 07/16] bpf: Add setsockopt helper function to bpf
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
                   ` (5 preceding siblings ...)
  2017-06-28 17:31 ` [PATCH net-next v4 06/16] bpf: Sample bpf program to set initial window Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-29 10:08   ` Daniel Borkmann
  2017-06-28 17:31 ` [PATCH net-next v4 08/16] bpf: Add TCP connection BPF callbacks Lawrence Brakmo
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Added support for calling a subset of socket setsockopts from
BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
than making the changes to call the socket setsockopt function because
the changes required would have been larger.

The ops supported are:
  SO_RCVBUF
  SO_SNDBUF
  SO_MAX_PACING_RATE
  SO_PRIORITY
  SO_RCVLOWAT
  SO_MARK

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h  | 14 ++++++++-
 net/core/filter.c         | 77 ++++++++++++++++++++++++++++++++++++++++++++++-
 samples/bpf/bpf_helpers.h |  3 ++
 3 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index cdec348..2dbae9e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -520,6 +520,17 @@ union bpf_attr {
  *     Set full skb->hash.
  *     @skb: pointer to skb
  *     @hash: hash to set
+ *
+ * int bpf_setsockopt(bpf_socket, level, optname, optval, optlen)
+ *     Calls setsockopt. Not all opts are available, only those with
+ *     integer optvals plus TCP_CONGESTION.
+ *     Supported levels: SOL_SOCKET and IPROTO_TCP
+ *     @bpf_socket: pointer to bpf_socket
+ *     @level: SOL_SOCKET or IPROTO_TCP
+ *     @optname: option name
+ *     @optval: pointer to option value
+ *     @optlen: length of optval in byes
+ *     Return: 0 or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -570,7 +581,8 @@ union bpf_attr {
 	FN(probe_read_str),		\
 	FN(get_socket_cookie),		\
 	FN(get_socket_uid),		\
-	FN(set_hash),
+	FN(set_hash),			\
+	FN(setsockopt),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index bb54832..167eca0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -54,6 +54,7 @@
 #include <net/dst.h>
 #include <net/sock_reuseport.h>
 #include <net/busy_poll.h>
+#include <net/tcp.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -2672,6 +2673,69 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
 	.arg1_type      = ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
+	   int, level, int, optname, char *, optval, int, optlen)
+{
+	struct sock *sk = bpf_sock->sk;
+	int ret = 0;
+	int val;
+
+	if (bpf_sock->is_req_sock)
+		return -EINVAL;
+
+	if (level == SOL_SOCKET) {
+		/* Only some socketops are supported */
+		val = *((int *)optval);
+
+		switch (optname) {
+		case SO_RCVBUF:
+			sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
+			sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF);
+			break;
+		case SO_SNDBUF:
+			sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
+			sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
+			break;
+		case SO_MAX_PACING_RATE:
+			sk->sk_max_pacing_rate = val;
+			sk->sk_pacing_rate = min(sk->sk_pacing_rate,
+						 sk->sk_max_pacing_rate);
+			break;
+		case SO_PRIORITY:
+			sk->sk_priority = val;
+			break;
+		case SO_RCVLOWAT:
+			if (val < 0)
+				val = INT_MAX;
+			sk->sk_rcvlowat = val ? : 1;
+			break;
+		case SO_MARK:
+			sk->sk_mark = val;
+			break;
+		default:
+			ret = -EINVAL;
+		}
+	} else if (level == SOL_TCP &&
+		   sk->sk_prot->setsockopt == tcp_setsockopt) {
+		/* Place holder */
+		ret = -EINVAL;
+	} else {
+		ret = -EINVAL;
+	}
+	return ret;
+}
+
+static const struct bpf_func_proto bpf_setsockopt_proto = {
+	.func		= bpf_setsockopt,
+	.gpl_only	= true,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_MEM,
+	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -2823,6 +2887,17 @@ lwt_inout_func_proto(enum bpf_func_id func_id)
 }
 
 static const struct bpf_func_proto *
+	sock_ops_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_setsockopt:
+		return &bpf_setsockopt_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
 lwt_xmit_func_proto(enum bpf_func_id func_id)
 {
 	switch (func_id) {
@@ -3593,7 +3668,7 @@ const struct bpf_verifier_ops cg_sock_prog_ops = {
 };
 
 const struct bpf_verifier_ops sock_ops_prog_ops = {
-	.get_func_proto		= bpf_base_func_proto,
+	.get_func_proto		= sock_ops_func_proto,
 	.is_valid_access	= sock_ops_is_valid_access,
 	.convert_ctx_access	= sock_ops_convert_ctx_access,
 };
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index f4840b8..d50ac34 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -60,6 +60,9 @@ static unsigned long long (*bpf_get_prandom_u32)(void) =
 	(void *) BPF_FUNC_get_prandom_u32;
 static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
 	(void *) BPF_FUNC_xdp_adjust_head;
+static int (*bpf_setsockopt)(void *ctx, int level, int optname, void *optval,
+			     int optlen) =
+	(void *) BPF_FUNC_setsockopt;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 08/16] bpf: Add TCP connection BPF callbacks
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
                   ` (6 preceding siblings ...)
  2017-06-28 17:31 ` [PATCH net-next v4 07/16] bpf: Add setsockopt helper function to bpf Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 09/16] bpf: Sample BPF program to set buffer sizes Lawrence Brakmo
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Added callbacks to BPF SOCK_OPS type program before an active
connection is intialized and after a passive or active connection is
established.

The following patch demostrates how they can be used to set send and
receive buffer sizes.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h | 11 +++++++++++
 net/ipv4/tcp_fastopen.c  |  1 +
 net/ipv4/tcp_input.c     |  4 +++-
 net/ipv4/tcp_output.c    |  1 +
 4 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2dbae9e..5b7207d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -765,6 +765,17 @@ enum {
 					 * window (in packets) or -1 if default
 					 * value should be used
 					 */
+	BPF_SOCK_OPS_TCP_CONNECT_CB,	/* Calls BPF program right before an
+					 * active connection is initialized
+					 */
+	BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB,	/* Calls BPF program when an
+						 * active connection is
+						 * established
+						 */
+	BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB,	/* Calls BPF program when a
+						 * passive connection is
+						 * established
+						 */
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 4af82b9..ed6b549 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -221,6 +221,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
 	tcp_init_congestion_control(child);
 	tcp_mtup_init(child);
 	tcp_init_metrics(child);
+	tcp_call_bpf(child, false, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
 	tcp_init_buffer_space(child);
 
 	tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0867b05..1b868ae 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5571,7 +5571,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 	icsk->icsk_af_ops->rebuild_header(sk);
 
 	tcp_init_metrics(sk);
-
+	tcp_call_bpf(sk, false, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
 	tcp_init_congestion_control(sk);
 
 	/* Prevent spurious tcp_cwnd_restart() on first data
@@ -5977,6 +5977,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		} else {
 			/* Make sure socket is routed, for correct metrics. */
 			icsk->icsk_af_ops->rebuild_header(sk);
+			tcp_call_bpf(sk, false,
+				     BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
 			tcp_init_congestion_control(sk);
 
 			tcp_mtup_init(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e5f623f..958edc8 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3445,6 +3445,7 @@ int tcp_connect(struct sock *sk)
 	struct sk_buff *buff;
 	int err;
 
+	tcp_call_bpf(sk, false, BPF_SOCK_OPS_TCP_CONNECT_CB);
 	tcp_connect_init(sk);
 
 	if (unlikely(tp->repair)) {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 09/16] bpf: Sample BPF program to set buffer sizes
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
                   ` (7 preceding siblings ...)
  2017-06-28 17:31 ` [PATCH net-next v4 08/16] bpf: Add TCP connection BPF callbacks Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 10/16] bpf: Add support for changing congestion control Lawrence Brakmo
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

This patch contains a BPF program to set initial receive window to
40 packets and send and receive buffers to 1.5MB. This would usually
be done after doing appropriate checks that indicate the hosts are
far enough away (i.e. large RTT).

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile        |  1 +
 samples/bpf/tcp_bufs_kern.c | 77 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 78 insertions(+)
 create mode 100644 samples/bpf/tcp_bufs_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index ca95528..3b300db 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -115,6 +115,7 @@ always += test_map_in_map_kern.o
 always += cookie_uid_helper_example.o
 always += tcp_synrto_kern.o
 always += tcp_rwnd_kern.o
+always += tcp_bufs_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_bufs_kern.c b/samples/bpf/tcp_bufs_kern.c
new file mode 100644
index 0000000..ccd3bbe
--- /dev/null
+++ b/samples/bpf/tcp_bufs_kern.c
@@ -0,0 +1,77 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * BPF program to set initial receive window to 40 packets and send
+ * and receive buffers to 1.5MB. This would usually be done after
+ * doing appropriate checks that indicate the hosts are far enough
+ * away (i.e. large RTT).
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_bufs(struct bpf_sock_ops *skops)
+{
+	char fmt1[] = "BPF command: %d\n";
+	char fmt2[] = "  Returning %d\n";
+	int bufsize = 1500000;
+	int rwnd_init = 40;
+	int rv = 0;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 55601
+	 */
+	if (skops->remote_port != 55601 && skops->local_port != 55601)
+		return -1;
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+	/* Usually there would be a check to insure the hosts are far
+	 * from each other so it makes sense to increase buffer sizes
+	 */
+	switch (op) {
+	case BPF_SOCK_OPS_RWND_INIT:
+		rv = rwnd_init;
+		break;
+	case BPF_SOCK_OPS_TCP_CONNECT_CB:
+		/* Set sndbuf and rcvbuf of active connections */
+		rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF, &bufsize,
+				    sizeof(bufsize));
+		rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET, SO_RCVBUF,
+					     &bufsize, sizeof(bufsize));
+		break;
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+		/* Nothing to do */
+		break;
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		/* Set sndbuf and rcvbuf of passive connections */
+		rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF, &bufsize,
+				    sizeof(bufsize));
+		rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET, SO_RCVBUF,
+					     &bufsize, sizeof(bufsize));
+		break;
+	default:
+		rv = -1;
+	}
+#ifdef DEBUG
+	bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+	skops->reply = rv;
+	return 1;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 10/16] bpf: Add support for changing congestion control
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
                   ` (8 preceding siblings ...)
  2017-06-28 17:31 ` [PATCH net-next v4 09/16] bpf: Sample BPF program to set buffer sizes Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-30 12:50   ` kbuild test robot
  2017-06-28 17:31 ` [PATCH net-next v4 11/16] bpf: Sample BPF program to set " Lawrence Brakmo
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Added support for changing congestion control for SOCK_OPS bpf
programs through the setsockopt bpf helper function. It also adds
a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for
congestion controls, like dctcp, that need to enable ECN in the
SYN packets.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/net/tcp.h        |  9 ++++++++-
 include/uapi/linux/bpf.h |  3 +++
 net/core/filter.c        | 11 +++++++++--
 net/ipv4/tcp.c           |  2 +-
 net/ipv4/tcp_cong.c      | 32 ++++++++++++++++++++++----------
 net/ipv4/tcp_input.c     |  3 ++-
 net/ipv4/tcp_output.c    |  8 +++++---
 7 files changed, 50 insertions(+), 18 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index af404aa..4faa8d1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1004,7 +1004,9 @@ void tcp_get_default_congestion_control(char *name);
 void tcp_get_available_congestion_control(char *buf, size_t len);
 void tcp_get_allowed_congestion_control(char *buf, size_t len);
 int tcp_set_allowed_congestion_control(char *allowed);
-int tcp_set_congestion_control(struct sock *sk, const char *name);
+int tcp_set_congestion_control(struct sock *sk, const char *name, bool load);
+void tcp_reinit_congestion_control(struct sock *sk,
+				   const struct tcp_congestion_ops *ca);
 u32 tcp_slow_start(struct tcp_sock *tp, u32 acked);
 void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
 
@@ -2079,4 +2081,9 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk, bool is_req_sock)
 		rwnd = 0;
 	return rwnd;
 }
+
+static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
+{
+	return (tcp_call_bpf(sk, true, BPF_SOCK_OPS_NEEDS_ECN) == 1);
+}
 #endif	/* _TCP_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5b7207d..77d05ff 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -776,6 +776,9 @@ enum {
 						 * passive connection is
 						 * established
 						 */
+	BPF_SOCK_OPS_NEEDS_ECN,		/* If connection's congestion control
+					 * needs ECN
+					 */
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index 167eca0..b36ec83 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2717,8 +2717,15 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 		}
 	} else if (level == SOL_TCP &&
 		   sk->sk_prot->setsockopt == tcp_setsockopt) {
-		/* Place holder */
-		ret = -EINVAL;
+		if (optname == TCP_CONGESTION) {
+			ret = tcp_set_congestion_control(sk, optval, false);
+			if (!ret && bpf_sock->op > BPF_SOCK_OPS_NEEDS_ECN)
+				/* replacing an existing ca */
+				tcp_reinit_congestion_control(sk,
+					inet_csk(sk)->icsk_ca_ops);
+		} else {
+			ret = -EINVAL;
+		}
 	} else {
 		ret = -EINVAL;
 	}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4c88d20..5199952 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2479,7 +2479,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 		name[val] = 0;
 
 		lock_sock(sk);
-		err = tcp_set_congestion_control(sk, name);
+		err = tcp_set_congestion_control(sk, name, true);
 		release_sock(sk);
 		return err;
 	}
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 324c9bc..fde983f 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -189,8 +189,8 @@ void tcp_init_congestion_control(struct sock *sk)
 		INET_ECN_dontxmit(sk);
 }
 
-static void tcp_reinit_congestion_control(struct sock *sk,
-					  const struct tcp_congestion_ops *ca)
+void tcp_reinit_congestion_control(struct sock *sk,
+				   const struct tcp_congestion_ops *ca)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
@@ -333,8 +333,12 @@ int tcp_set_allowed_congestion_control(char *val)
 	return ret;
 }
 
-/* Change congestion control for socket */
-int tcp_set_congestion_control(struct sock *sk, const char *name)
+/* Change congestion control for socket. If load is false, then it is the
+ * responsibility of the caller to call tcp_init_congestion_control or
+ * tcp_reinit_congestion_control (if the current congestion control was
+ * already initialized.
+ */
+int tcp_set_congestion_control(struct sock *sk, const char *name, bool load)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	const struct tcp_congestion_ops *ca;
@@ -344,21 +348,29 @@ int tcp_set_congestion_control(struct sock *sk, const char *name)
 		return -EPERM;
 
 	rcu_read_lock();
-	ca = __tcp_ca_find_autoload(name);
+	if (!load)
+		ca = tcp_ca_find(name);
+	else
+		ca = __tcp_ca_find_autoload(name);
 	/* No change asking for existing value */
 	if (ca == icsk->icsk_ca_ops) {
 		icsk->icsk_ca_setsockopt = 1;
 		goto out;
 	}
-	if (!ca)
+	if (!ca) {
 		err = -ENOENT;
-	else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
-		   ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)))
+	} else if (!load) {
+		icsk->icsk_ca_ops = ca;
+		if (!try_module_get(ca->owner))
+			err = -EBUSY;
+	} else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
+		     ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))) {
 		err = -EPERM;
-	else if (!try_module_get(ca->owner))
+	} else if (!try_module_get(ca->owner)) {
 		err = -EBUSY;
-	else
+	} else {
 		tcp_reinit_congestion_control(sk, ca);
+	}
  out:
 	rcu_read_unlock();
 	return err;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1b868ae..23f9707 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6192,7 +6192,8 @@ static void tcp_ecn_create_request(struct request_sock *req,
 	ecn_ok = net->ipv4.sysctl_tcp_ecn || ecn_ok_dst;
 
 	if ((!ect && ecn_ok) || tcp_ca_needs_ecn(listen_sk) ||
-	    (ecn_ok_dst & DST_FEATURE_ECN_CA))
+	    (ecn_ok_dst & DST_FEATURE_ECN_CA) ||
+	    tcp_bpf_ca_needs_ecn((struct sock *)req))
 		inet_rsk(req)->ecn_ok = 1;
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 958edc8..a273117 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -316,7 +316,8 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 	TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_CWR;
 	if (!(tp->ecn_flags & TCP_ECN_OK))
 		TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_ECE;
-	else if (tcp_ca_needs_ecn(sk))
+	else if (tcp_ca_needs_ecn(sk) ||
+		 tcp_bpf_ca_needs_ecn(sk))
 		INET_ECN_xmit(sk);
 }
 
@@ -324,8 +325,9 @@ static void tcp_ecn_send_synack(struct sock *sk, struct sk_buff *skb)
 static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	bool bpf_needs_ecn = tcp_bpf_ca_needs_ecn(sk);
 	bool use_ecn = sock_net(sk)->ipv4.sysctl_tcp_ecn == 1 ||
-		       tcp_ca_needs_ecn(sk);
+		tcp_ca_needs_ecn(sk) || bpf_needs_ecn;
 
 	if (!use_ecn) {
 		const struct dst_entry *dst = __sk_dst_get(sk);
@@ -339,7 +341,7 @@ static void tcp_ecn_send_syn(struct sock *sk, struct sk_buff *skb)
 	if (use_ecn) {
 		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_ECE | TCPHDR_CWR;
 		tp->ecn_flags = TCP_ECN_OK;
-		if (tcp_ca_needs_ecn(sk))
+		if (tcp_ca_needs_ecn(sk) || bpf_needs_ecn)
 			INET_ECN_xmit(sk);
 	}
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 11/16] bpf: Sample BPF program to set congestion control
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
                   ` (9 preceding siblings ...)
  2017-06-28 17:31 ` [PATCH net-next v4 10/16] bpf: Add support for changing congestion control Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 12/16] bpf: Adds support for setting initial cwnd Lawrence Brakmo
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Sample BPF program that sets congestion control to dctcp when both hosts
are within the same datacenter. In this example that is assumed to be
when they have the first 5.5 bytes of their IPv6 address are the same.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile        |  1 +
 samples/bpf/tcp_cong_kern.c | 74 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 75 insertions(+)
 create mode 100644 samples/bpf/tcp_cong_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 3b300db..6fdf32d 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -116,6 +116,7 @@ always += cookie_uid_helper_example.o
 always += tcp_synrto_kern.o
 always += tcp_rwnd_kern.o
 always += tcp_bufs_kern.o
+always += tcp_cong_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_cong_kern.c b/samples/bpf/tcp_cong_kern.c
new file mode 100644
index 0000000..fdced0f
--- /dev/null
+++ b/samples/bpf/tcp_cong_kern.c
@@ -0,0 +1,74 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * BPF program to set congestion control to dctcp when both hosts are
+ * in the same datacenter (as deteremined by IPv6 prefix).
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/tcp.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_cong(struct bpf_sock_ops *skops)
+{
+	char fmt1[] = "BPF command: %d\n";
+	char fmt2[] = "  Returning %d\n";
+	char cong[] = "dctcp";
+	int rv = 0;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 55601
+	 */
+	if (skops->remote_port != 55601 && skops->local_port != 55601)
+		return -1;
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+	/* Check if both hosts are in the same datacenter. For this
+	 * example they are if the 1st 5.5 bytes in the IPv6 address
+	 * are the same.
+	 */
+	if (skops->family == AF_INET6 &&
+	    skops->local_ip6[0] == skops->remote_ip6[0] &&
+	    (skops->local_ip6[1] & 0xfff00000) ==
+	    (skops->remote_ip6[1] & 0xfff00000)) {
+		switch (op) {
+		case BPF_SOCK_OPS_NEEDS_ECN:
+			rv = 1;
+			break;
+		case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+			rv = bpf_setsockopt(skops, SOL_TCP, TCP_CONGESTION,
+					    cong, sizeof(cong));
+			break;
+		case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+			rv = bpf_setsockopt(skops, SOL_TCP, TCP_CONGESTION,
+					    cong, sizeof(cong));
+			break;
+		default:
+			rv = -1;
+		}
+	} else {
+		rv = -1;
+	}
+#ifdef DEBUG
+	bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+	skops->reply = rv;
+	return 1;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 12/16] bpf: Adds support for setting initial cwnd
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
                   ` (10 preceding siblings ...)
  2017-06-28 17:31 ` [PATCH net-next v4 11/16] bpf: Sample BPF program to set " Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 13/16] bpf: Sample BPF program to set " Lawrence Brakmo
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_IW, which sets the
initial congestion window. This can be used when the hosts are far
apart (large RTTs) and it is safe to start with a large inital cwnd.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h |  2 ++
 net/core/filter.c        | 14 +++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 77d05ff..0d9ff6d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -781,4 +781,6 @@ enum {
 					 */
 };
 
+#define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index b36ec83..147b637 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2724,7 +2724,19 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 				tcp_reinit_congestion_control(sk,
 					inet_csk(sk)->icsk_ca_ops);
 		} else {
-			ret = -EINVAL;
+			struct tcp_sock *tp = tcp_sk(sk);
+
+			val = *((int *)optval);
+			switch (optname) {
+			case TCP_BPF_IW:
+				if (val <= 0 || tp->data_segs_out > 0)
+					ret = -EINVAL;
+				else
+					tp->snd_cwnd = val;
+				break;
+			default:
+				ret = -EINVAL;
+			}
 		}
 	} else {
 		ret = -EINVAL;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 13/16] bpf: Sample BPF program to set initial cwnd
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
                   ` (11 preceding siblings ...)
  2017-06-28 17:31 ` [PATCH net-next v4 12/16] bpf: Adds support for setting initial cwnd Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 14/16] bpf: Adds support for setting sndcwnd clamp Lawrence Brakmo
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Sample BPF program that assumes hosts are far away (i.e. large RTTs)
and sets initial cwnd and initial receive window to 40 packets,
send and receive buffers to 1.5MB.

In practice there would be a test to insure the hosts are actually
far enough away.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile      |  1 +
 samples/bpf/tcp_iw_kern.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 80 insertions(+)
 create mode 100644 samples/bpf/tcp_iw_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 6fdf32d..242d76e 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -117,6 +117,7 @@ always += tcp_synrto_kern.o
 always += tcp_rwnd_kern.o
 always += tcp_bufs_kern.o
 always += tcp_cong_kern.o
+always += tcp_iw_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_iw_kern.c b/samples/bpf/tcp_iw_kern.c
new file mode 100644
index 0000000..28626f9
--- /dev/null
+++ b/samples/bpf/tcp_iw_kern.c
@@ -0,0 +1,79 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * BPF program to set initial congestion window and initial receive
+ * window to 40 packets and send and receive buffers to 1.5MB. This
+ * would usually be done after doing appropriate checks that indicate
+ * the hosts are far enough away (i.e. large RTT).
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_iw(struct bpf_sock_ops *skops)
+{
+	char fmt1[] = "BPF command: %d\n";
+	char fmt2[] = "  Returning %d\n";
+	int bufsize = 1500000;
+	int rwnd_init = 40;
+	int iw = 40;
+	int rv = 0;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 55601
+	 */
+	if (skops->remote_port != 55601 && skops->local_port != 55601)
+		return -1;
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+	/* Usually there would be a check to insure the hosts are far
+	 * from each other so it makes sense to increase buffer sizes
+	 */
+	switch (op) {
+	case BPF_SOCK_OPS_RWND_INIT:
+		rv = rwnd_init;
+		break;
+	case BPF_SOCK_OPS_TCP_CONNECT_CB:
+		/* Set sndbuf and rcvbuf of active connections */
+		rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF, &bufsize,
+				    sizeof(bufsize));
+		rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET, SO_RCVBUF,
+					     &bufsize, sizeof(bufsize));
+		break;
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+		rv = bpf_setsockopt(skops, SOL_TCP, TCP_BPF_IW, &iw,
+				    sizeof(iw));
+		break;
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		/* Set sndbuf and rcvbuf of passive connections */
+		rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF, &bufsize,
+				    sizeof(bufsize));
+		rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET, SO_RCVBUF,
+					     &bufsize, sizeof(bufsize));
+		break;
+	default:
+		rv = -1;
+	}
+#ifdef DEBUG
+	bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+	skops->reply = rv;
+	return 1;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 14/16] bpf: Adds support for setting sndcwnd clamp
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
                   ` (12 preceding siblings ...)
  2017-06-28 17:31 ` [PATCH net-next v4 13/16] bpf: Sample BPF program to set " Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 15/16] bpf: Sample bpf program to set " Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 16/16] bpf: update tools/include/uapi/linux/bpf.h Lawrence Brakmo
  15 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_SNDCWND_CLAMP, which
sets the initial congestion window. It is useful to limit the sndcwnd
when the host are close to each other (small RTT).

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h | 1 +
 net/core/filter.c        | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0d9ff6d..284b366 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -782,5 +782,6 @@ enum {
 };
 
 #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
+#define TCP_BPF_SNDCWND_CLAMP	1002	/* Set sndcwnd_clamp */
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index 147b637..516353e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2734,6 +2734,13 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 				else
 					tp->snd_cwnd = val;
 				break;
+			case TCP_BPF_SNDCWND_CLAMP:
+				if (val <= 0) {
+					ret = -EINVAL;
+				} else {
+					tp->snd_cwnd_clamp = val;
+					tp->snd_ssthresh = val;
+				}
 			default:
 				ret = -EINVAL;
 			}
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 15/16] bpf: Sample bpf program to set sndcwnd clamp
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
                   ` (13 preceding siblings ...)
  2017-06-28 17:31 ` [PATCH net-next v4 14/16] bpf: Adds support for setting sndcwnd clamp Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  2017-06-28 17:31 ` [PATCH net-next v4 16/16] bpf: update tools/include/uapi/linux/bpf.h Lawrence Brakmo
  15 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Sample BPF program, tcp_clamp_kern.c, to demostrate the use
of setting the sndcwnd clamp. This program assumes that if the
first 5.5 bytes of the host's IPv6 addresses are the same, then
the hosts are in the same datacenter and sets sndcwnd clamp to
100 packets, SYN and SYN-ACK RTOs to 10ms and send/receive buffer
sizes to 150KB.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile         |  1 +
 samples/bpf/tcp_clamp_kern.c | 94 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+)
 create mode 100644 samples/bpf/tcp_clamp_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 242d76e..9c65058 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -118,6 +118,7 @@ always += tcp_rwnd_kern.o
 always += tcp_bufs_kern.o
 always += tcp_cong_kern.o
 always += tcp_iw_kern.o
+always += tcp_clamp_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
diff --git a/samples/bpf/tcp_clamp_kern.c b/samples/bpf/tcp_clamp_kern.c
new file mode 100644
index 0000000..07e334e
--- /dev/null
+++ b/samples/bpf/tcp_clamp_kern.c
@@ -0,0 +1,94 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * Sample BPF program to set send and receive buffers to 150KB, sndcwnd clamp
+ * to 100 packets and SYN and SYN_ACK RTOs to 10ms when both hosts are within
+ * the same datacenter. For his example, we assume they are within the same
+ * datacenter when the first 5.5 bytes of their IPv6 addresses are the same.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+
+#define DEBUG 1
+
+SEC("sockops")
+int bpf_clamp(struct bpf_sock_ops *skops)
+{
+	char fmt1[] = "BPF command: %d\n";
+	char fmt2[] = "  Returning %d\n";
+	int bufsize = 150000;
+	int to_init = 10;
+	int clamp = 100;
+	int rv = 0;
+	int op;
+
+	/* For testing purposes, only execute rest of BPF program
+	 * if neither port numberis 55601
+	 */
+	if (skops->remote_port != 55601 && skops->local_port != 55601)
+		return -1;
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_trace_printk(fmt1, sizeof(fmt1), op);
+#endif
+
+	/* Check that both hosts are within same datacenter. For this example
+	 * it is the case when the first 5.5 bytes of their IPv6 addresses are
+	 * the same.
+	 */
+	if (skops->family == AF_INET6 &&
+	    skops->local_ip6[0] == skops->remote_ip6[0] &&
+	    (skops->local_ip6[1] & 0xfff00000) ==
+	    (skops->remote_ip6[1] & 0xfff00000)) {
+		switch (op) {
+		case BPF_SOCK_OPS_TIMEOUT_INIT:
+			rv = to_init;
+			break;
+		case BPF_SOCK_OPS_TCP_CONNECT_CB:
+			/* Set sndbuf and rcvbuf of active connections */
+			rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF,
+					    &bufsize, sizeof(bufsize));
+			rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET,
+						      SO_RCVBUF, &bufsize,
+						      sizeof(bufsize));
+			break;
+		case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+			rv = bpf_setsockopt(skops, SOL_TCP,
+					    TCP_BPF_SNDCWND_CLAMP,
+					    &clamp, sizeof(clamp));
+			break;
+		case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+			/* Set sndbuf and rcvbuf of passive connections */
+			rv = bpf_setsockopt(skops, SOL_TCP,
+					    TCP_BPF_SNDCWND_CLAMP,
+					    &clamp, sizeof(clamp));
+			rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET,
+						      SO_SNDBUF, &bufsize,
+						      sizeof(bufsize));
+			rv = rv*100 + bpf_setsockopt(skops, SOL_SOCKET,
+						      SO_RCVBUF, &bufsize,
+						      sizeof(bufsize));
+			break;
+		default:
+			rv = -1;
+		}
+	} else {
+		rv = -1;
+	}
+#ifdef DEBUG
+	bpf_trace_printk(fmt2, sizeof(fmt2), rv);
+#endif
+	skops->reply = rv;
+	return 1;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next v4 16/16] bpf: update tools/include/uapi/linux/bpf.h
  2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
                   ` (14 preceding siblings ...)
  2017-06-28 17:31 ` [PATCH net-next v4 15/16] bpf: Sample bpf program to set " Lawrence Brakmo
@ 2017-06-28 17:31 ` Lawrence Brakmo
  15 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-28 17:31 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, Daniel Borkmann,
	David Ahern

Update tools/include/uapi/linux/bpf.h to include changes related to new
bpf sock_ops program type.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 tools/include/uapi/linux/bpf.h | 66 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f94b48b..284b366 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -120,12 +120,14 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LWT_IN,
 	BPF_PROG_TYPE_LWT_OUT,
 	BPF_PROG_TYPE_LWT_XMIT,
+	BPF_PROG_TYPE_SOCK_OPS,
 };
 
 enum bpf_attach_type {
 	BPF_CGROUP_INET_INGRESS,
 	BPF_CGROUP_INET_EGRESS,
 	BPF_CGROUP_INET_SOCK_CREATE,
+	BPF_CGROUP_SOCK_OPS,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -518,6 +520,17 @@ union bpf_attr {
  *     Set full skb->hash.
  *     @skb: pointer to skb
  *     @hash: hash to set
+ *
+ * int bpf_setsockopt(bpf_socket, level, optname, optval, optlen)
+ *     Calls setsockopt. Not all opts are available, only those with
+ *     integer optvals plus TCP_CONGESTION.
+ *     Supported levels: SOL_SOCKET and IPROTO_TCP
+ *     @bpf_socket: pointer to bpf_socket
+ *     @level: SOL_SOCKET or IPROTO_TCP
+ *     @optname: option name
+ *     @optval: pointer to option value
+ *     @optlen: length of optval in byes
+ *     Return: 0 or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -568,7 +581,8 @@ union bpf_attr {
 	FN(probe_read_str),		\
 	FN(get_socket_cookie),		\
 	FN(get_socket_uid),		\
-	FN(set_hash),
+	FN(set_hash),			\
+	FN(setsockopt),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -720,4 +734,54 @@ struct bpf_map_info {
 	__u32 map_flags;
 } __attribute__((aligned(8)));
 
+/* User bpf_sock_ops struct to access socket values and specify request ops
+ * and their replies.
+ * New fields can only be added at the end of this structure
+ */
+struct bpf_sock_ops {
+	__u32 op;
+	union {
+		__u32 reply;
+		__u32 replylong[4];
+	};
+	__u32 family;
+	__u32 remote_ip4;
+	__u32 local_ip4;
+	__u32 remote_ip6[4];
+	__u32 local_ip6[4];
+	__u32 remote_port;
+	__u32 local_port;
+};
+
+/* List of known BPF sock_ops operators.
+ * New entries can only be added at the end
+ */
+enum {
+	BPF_SOCK_OPS_VOID,
+	BPF_SOCK_OPS_TIMEOUT_INIT,	/* Should return SYN-RTO value to use or
+					 * -1 if default value should be used
+					 */
+	BPF_SOCK_OPS_RWND_INIT,		/* Should return initial advertized
+					 * window (in packets) or -1 if default
+					 * value should be used
+					 */
+	BPF_SOCK_OPS_TCP_CONNECT_CB,	/* Calls BPF program right before an
+					 * active connection is initialized
+					 */
+	BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB,	/* Calls BPF program when an
+						 * active connection is
+						 * established
+						 */
+	BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB,	/* Calls BPF program when a
+						 * passive connection is
+						 * established
+						 */
+	BPF_SOCK_OPS_NEEDS_ECN,		/* If connection's congestion control
+					 * needs ECN
+					 */
+};
+
+#define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
+#define TCP_BPF_SNDCWND_CLAMP	1002	/* Set sndcwnd_clamp */
+
 #endif /* _UAPI__LINUX_BPF_H__ */
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v4 01/16] bpf: BPF support for sock_ops
  2017-06-28 17:31 ` [PATCH net-next v4 01/16] bpf: BPF " Lawrence Brakmo
@ 2017-06-28 19:53   ` Alexei Starovoitov
  2017-06-29  9:46   ` Daniel Borkmann
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Alexei Starovoitov @ 2017-06-28 19:53 UTC (permalink / raw)
  To: Lawrence Brakmo, netdev
  Cc: Kernel Team, Blake Matheny, Daniel Borkmann, David Ahern

On 6/28/17 10:31 AM, Lawrence Brakmo wrote:
> +#ifdef CONFIG_BPF
> +static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
> +{
> +	struct bpf_sock_ops_kern sock_ops;
> +	int ret;
> +
> +	if (!is_req_sock)
> +		sock_owned_by_me(sk);
> +
> +	memset(&sock_ops, 0, sizeof(sock_ops));
> +	sock_ops.sk = sk;
> +	sock_ops.is_req_sock = is_req_sock;
> +	sock_ops.op = op;
> +
> +	ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
> +	if (ret == 0)
> +		ret = sock_ops.reply;
> +	else
> +		ret = -1;
> +	return ret;
> +}

the switch to cgroup attached only made it really nice and clean.
No global state to worry about.
I haven't looked through the minor patch details, but overall
it all looks good to me. I don't have any architectural concerns.

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v4 01/16] bpf: BPF support for sock_ops
  2017-06-28 17:31 ` [PATCH net-next v4 01/16] bpf: BPF " Lawrence Brakmo
  2017-06-28 19:53   ` Alexei Starovoitov
@ 2017-06-29  9:46   ` Daniel Borkmann
  2017-06-30  7:27     ` Lawrence Brakmo
  2017-06-29 15:57   ` kbuild test robot
  2017-06-29 16:21   ` kbuild test robot
  3 siblings, 1 reply; 26+ messages in thread
From: Daniel Borkmann @ 2017-06-29  9:46 UTC (permalink / raw)
  To: Lawrence Brakmo, netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern

On 06/28/2017 07:31 PM, Lawrence Brakmo wrote:
> Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
> struct that allows BPF programs of this type to access some of the
> socket's fields (such as IP addresses, ports, etc.). It uses the
> existing bpf cgroups infrastructure so the programs can be attached per
> cgroup with full inheritance support. The program will be called at
> appropriate times to set relevant connections parameters such as buffer
> sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
> as IP addresses, port numbers, etc.
[...]
> Currently there are two types of ops. The first type expects the BPF
> program to return a value which is then used by the caller (or a
> negative value to indicate the operation is not supported). The second
> type expects state changes to be done by the BPF program, for example
> through a setsockopt BPF helper function, and they ignore the return
> value.
>
> The reply fields of the bpf_sockt_ops struct are there in case a bpf
> program needs to return a value larger than an integer.
>
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>

For BPF bits:

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

> @@ -3379,6 +3409,140 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
>   	return insn - insn_buf;
>   }
>
> +static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
> +				       const struct bpf_insn *si,
> +				       struct bpf_insn *insn_buf,
> +				       struct bpf_prog *prog)
> +{
> +	struct bpf_insn *insn = insn_buf;
> +	int off;
> +
> +	switch (si->off) {
[...]
> +	case offsetof(struct bpf_sock_ops, remote_ip4):
> +		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_daddr) != 4);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> +						struct bpf_sock_ops_kern, sk),
> +				      si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sock_ops_kern, sk));
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> +				      offsetof(struct sock_common, skc_daddr));
> +		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
> +		break;
> +
> +	case offsetof(struct bpf_sock_ops, local_ip4):
> +		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_rcv_saddr) != 4);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> +					      struct bpf_sock_ops_kern, sk),
> +				      si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sock_ops_kern, sk));
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> +				      offsetof(struct sock_common,
> +					       skc_rcv_saddr));
> +		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
> +		break;
> +
> +	case offsetof(struct bpf_sock_ops, remote_ip6[0]) ...
> +	     offsetof(struct bpf_sock_ops, remote_ip6[3]):
> +#if IS_ENABLED(CONFIG_IPV6)
> +		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
> +					  skc_v6_daddr.s6_addr32[0]) != 4);
> +
> +		off = si->off;
> +		off -= offsetof(struct bpf_sock_ops, remote_ip6[0]);
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> +						struct bpf_sock_ops_kern, sk),
> +				      si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sock_ops_kern, sk));
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> +				      offsetof(struct sock_common,
> +					       skc_v6_daddr.s6_addr32[0]) +
> +				      off);
> +		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
> +#else
> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
> +#endif
> +		break;
> +
> +	case offsetof(struct bpf_sock_ops, local_ip6[0]) ...
> +	     offsetof(struct bpf_sock_ops, local_ip6[3]):
> +#if IS_ENABLED(CONFIG_IPV6)
> +		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
> +					  skc_v6_rcv_saddr.s6_addr32[0]) != 4);
> +
> +		off = si->off;
> +		off -= offsetof(struct bpf_sock_ops, local_ip6[0]);
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> +						struct bpf_sock_ops_kern, sk),
> +				      si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sock_ops_kern, sk));
> +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> +				      offsetof(struct sock_common,
> +					       skc_v6_rcv_saddr.s6_addr32[0]) +
> +				      off);
> +		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
> +#else
> +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
> +#endif
> +		break;
> +
> +	case offsetof(struct bpf_sock_ops, remote_port):
> +		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_dport) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> +						struct bpf_sock_ops_kern, sk),
> +				      si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sock_ops_kern, sk));
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
> +				      offsetof(struct sock_common, skc_dport));
> +		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 16);
> +		break;
> +
> +	case offsetof(struct bpf_sock_ops, local_port):
> +		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_num) != 2);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
> +						struct bpf_sock_ops_kern, sk),
> +				      si->dst_reg, si->src_reg,
> +				      offsetof(struct bpf_sock_ops_kern, sk));
> +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
> +				      offsetof(struct sock_common, skc_num));

That one is indeed in host endianness. Makes sense to have remote_port
and local_port in a consistent representation.

I was wondering though whether we should do all the conversion of
BPF_ENDIAN(BPF_FROM_BE, ...) or just leave it to the user whether
he needs the BPF_ENDIAN(BPF_FROM_BE, ...) or process it in network
byte order as-is. In case the user needs to go and undo again via
BPF_ENDIAN(BPF_TO_BE, ...), e.g., to reconstruct a full v6 addr,
then we have two unneeded insns for each of the remote_ip6[X] /
local_ip6[X]. So, not providing it in host byte order, the user can
still always chose to do a BPF_ENDIAN(BPF_FROM_BE, ...) by himself,
if this representation is preferred. Wdyt?

> +		break;
> +	}
> +	return insn - insn_buf;
> +}
> +
>   const struct bpf_verifier_ops sk_filter_prog_ops = {
>   	.get_func_proto		= sk_filter_func_proto,
[...]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v4 07/16] bpf: Add setsockopt helper function to bpf
  2017-06-28 17:31 ` [PATCH net-next v4 07/16] bpf: Add setsockopt helper function to bpf Lawrence Brakmo
@ 2017-06-29 10:08   ` Daniel Borkmann
  0 siblings, 0 replies; 26+ messages in thread
From: Daniel Borkmann @ 2017-06-29 10:08 UTC (permalink / raw)
  To: Lawrence Brakmo, netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern

On 06/28/2017 07:31 PM, Lawrence Brakmo wrote:
> Added support for calling a subset of socket setsockopts from
> BPF_PROG_TYPE_SOCK_OPS programs. The code was duplicated rather
> than making the changes to call the socket setsockopt function because
> the changes required would have been larger.
>
> The ops supported are:
>    SO_RCVBUF
>    SO_SNDBUF
>    SO_MAX_PACING_RATE
>    SO_PRIORITY
>    SO_RCVLOWAT
>    SO_MARK
>
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
[...]
> @@ -2672,6 +2673,69 @@ static const struct bpf_func_proto bpf_get_socket_uid_proto = {
>   	.arg1_type      = ARG_PTR_TO_CTX,
>   };
>
> +BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
> +	   int, level, int, optname, char *, optval, int, optlen)

Nit: I would rather make optlen a u32. But more below.

> +{
> +	struct sock *sk = bpf_sock->sk;
> +	int ret = 0;
> +	int val;
> +
> +	if (bpf_sock->is_req_sock)
> +		return -EINVAL;
> +
> +	if (level == SOL_SOCKET) {

		if (optlen != sizeof(int))
			return -EINVAL;

> +		/* Only some socketops are supported */
> +		val = *((int *)optval);
> +
> +		switch (optname) {
> +		case SO_RCVBUF:
> +			sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> +			sk->sk_rcvbuf = max_t(int, val * 2, SOCK_MIN_RCVBUF);
> +			break;
> +		case SO_SNDBUF:
> +			sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
> +			sk->sk_sndbuf = max_t(int, val * 2, SOCK_MIN_SNDBUF);
> +			break;
> +		case SO_MAX_PACING_RATE:
> +			sk->sk_max_pacing_rate = val;
> +			sk->sk_pacing_rate = min(sk->sk_pacing_rate,
> +						 sk->sk_max_pacing_rate);
> +			break;
> +		case SO_PRIORITY:
> +			sk->sk_priority = val;
> +			break;
> +		case SO_RCVLOWAT:
> +			if (val < 0)
> +				val = INT_MAX;
> +			sk->sk_rcvlowat = val ? : 1;
> +			break;
> +		case SO_MARK:
> +			sk->sk_mark = val;
> +			break;
> +		default:
> +			ret = -EINVAL;
> +		}
> +	} else if (level == SOL_TCP &&
> +		   sk->sk_prot->setsockopt == tcp_setsockopt) {
> +		/* Place holder */
> +		ret = -EINVAL;
> +	} else {
> +		ret = -EINVAL;
> +	}
> +	return ret;
> +}
> +
> +static const struct bpf_func_proto bpf_setsockopt_proto = {
> +	.func		= bpf_setsockopt,
> +	.gpl_only	= true,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +	.arg2_type	= ARG_ANYTHING,
> +	.arg3_type	= ARG_ANYTHING,
> +	.arg4_type	= ARG_PTR_TO_MEM,
> +	.arg5_type	= ARG_CONST_SIZE_OR_ZERO,

Any reason you went with the ARG_CONST_SIZE_OR_ZERO type? Semantics
of this are that allowed [arg4, arg5] pair can be i) [NULL, 0] or
ii) [non-NULL, non-zero], where in case ii) verifier checks that the
area is initialized when coming from BPF stack.

So above 'val = *((int *)optval);' would give a NULL pointer deref
with NULL passed as arg or in case optlen was < sizeof(int) we access
stack out of bounds potentially. If the [NULL, 0] pair is not required,
I would just make that a ARG_CONST_SIZE and then check for size before
accessing optval.

> +};
> +
>   static const struct bpf_func_proto *
>   bpf_base_func_proto(enum bpf_func_id func_id)
>   {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v4 01/16] bpf: BPF support for sock_ops
  2017-06-28 17:31 ` [PATCH net-next v4 01/16] bpf: BPF " Lawrence Brakmo
  2017-06-28 19:53   ` Alexei Starovoitov
  2017-06-29  9:46   ` Daniel Borkmann
@ 2017-06-29 15:57   ` kbuild test robot
  2017-06-29 16:21   ` kbuild test robot
  3 siblings, 0 replies; 26+ messages in thread
From: kbuild test robot @ 2017-06-29 15:57 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: kbuild-all, netdev, Kernel Team, Blake Matheny,
	Alexei Starovoitov, Daniel Borkmann, David Ahern

[-- Attachment #1: Type: text/plain, Size: 1718 bytes --]

Hi Lawrence,

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Lawrence-Brakmo/bpf-BPF-cgroup-support-for-sock_ops/20170629-203719
config: tile-allyesconfig (attached as .config)
compiler: tilegx-linux-gcc (GCC) 4.6.2
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=tile 

All warnings (new ones prefixed by >>):

   In file included from include/linux/netfilter/ipset/pfxlen.h:6:0,
                    from net/netfilter/ipset/pfxlen.c:2:
   include/net/tcp.h: In function 'tcp_call_bpf':
>> include/net/tcp.h:2047:8: warning: the address of 'sock_ops' will always evaluate as 'true' [-Waddress]

vim +2047 include/net/tcp.h

  2031	 * program loaded).
  2032	 */
  2033	#ifdef CONFIG_BPF
  2034	static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)
  2035	{
  2036		struct bpf_sock_ops_kern sock_ops;
  2037		int ret;
  2038	
  2039		if (!is_req_sock)
  2040			sock_owned_by_me(sk);
  2041	
  2042		memset(&sock_ops, 0, sizeof(sock_ops));
  2043		sock_ops.sk = sk;
  2044		sock_ops.is_req_sock = is_req_sock;
  2045		sock_ops.op = op;
  2046	
> 2047		ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
  2048		if (ret == 0)
  2049			ret = sock_ops.reply;
  2050		else
  2051			ret = -1;
  2052		return ret;
  2053	}
  2054	#else
  2055	static inline int tcp_call_bpf(struct sock *sk, bool is_req_sock, int op)

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 49221 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v4 01/16] bpf: BPF support for sock_ops
  2017-06-28 17:31 ` [PATCH net-next v4 01/16] bpf: BPF " Lawrence Brakmo
                     ` (2 preceding siblings ...)
  2017-06-29 15:57   ` kbuild test robot
@ 2017-06-29 16:21   ` kbuild test robot
  3 siblings, 0 replies; 26+ messages in thread
From: kbuild test robot @ 2017-06-29 16:21 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: kbuild-all, netdev, Kernel Team, Blake Matheny,
	Alexei Starovoitov, Daniel Borkmann, David Ahern

[-- Attachment #1: Type: text/plain, Size: 4992 bytes --]

Hi Lawrence,

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Lawrence-Brakmo/bpf-BPF-cgroup-support-for-sock_ops/20170629-203719
config: xtensa-allyesconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

All warnings (new ones prefixed by >>):

   In file included from include/linux/cgroup-defs.h:20:0,
                    from include/linux/cgroup.h:26,
                    from include/net/netprio_cgroup.h:17,
                    from include/linux/netdevice.h:47,
                    from include/net/sock.h:51,
                    from include/linux/tcp.h:23,
                    from include/net/tcp.h:24,
                    from net//ipv6/netfilter/nf_socket_ipv6.c:13:
   include/net/tcp.h: In function 'tcp_call_bpf':
>> include/linux/bpf-cgroup.h:86:25: warning: the address of 'sock_ops' will always evaluate as 'true' [-Waddress]
     if (cgroup_bpf_enabled && (sock_ops) && (sock_ops)->sk) {        \
                            ^
>> include/net/tcp.h:2047:8: note: in expansion of macro 'BPF_CGROUP_RUN_PROG_SOCK_OPS'
     ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
           ^
--
   In file included from include/linux/cgroup-defs.h:20:0,
                    from include/linux/cgroup.h:26,
                    from include/net/netprio_cgroup.h:17,
                    from include/linux/netdevice.h:47,
                    from include/net/sock.h:51,
                    from include/linux/tcp.h:23,
                    from net//netfilter/ipvs/ip_vs_core.c:33:
   include/net/tcp.h: In function 'tcp_call_bpf':
>> include/linux/bpf-cgroup.h:86:25: warning: the address of 'sock_ops' will always evaluate as 'true' [-Waddress]
     if (cgroup_bpf_enabled && (sock_ops) && (sock_ops)->sk) {        \
                            ^
>> include/net/tcp.h:2047:8: note: in expansion of macro 'BPF_CGROUP_RUN_PROG_SOCK_OPS'
     ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
           ^
   net//netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_sched_persist':
   net//netfilter/ipvs/ip_vs_core.c:399:1: warning: the frame size of 1072 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    }
    ^
   net//netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_new_conn_out':
   net//netfilter/ipvs/ip_vs_core.c:1199:1: warning: the frame size of 1056 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    }
    ^
--
   In file included from include/linux/cgroup-defs.h:20:0,
                    from include/linux/cgroup.h:26,
                    from include/net/netprio_cgroup.h:17,
                    from include/linux/netdevice.h:47,
                    from include/net/sock.h:51,
                    from include/linux/tcp.h:23,
                    from net/netfilter/ipvs/ip_vs_core.c:33:
   include/net/tcp.h: In function 'tcp_call_bpf':
>> include/linux/bpf-cgroup.h:86:25: warning: the address of 'sock_ops' will always evaluate as 'true' [-Waddress]
     if (cgroup_bpf_enabled && (sock_ops) && (sock_ops)->sk) {        \
                            ^
>> include/net/tcp.h:2047:8: note: in expansion of macro 'BPF_CGROUP_RUN_PROG_SOCK_OPS'
     ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops);
           ^
   net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_sched_persist':
   net/netfilter/ipvs/ip_vs_core.c:399:1: warning: the frame size of 1072 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    }
    ^
   net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_new_conn_out':
   net/netfilter/ipvs/ip_vs_core.c:1199:1: warning: the frame size of 1056 bytes is larger than 1024 bytes [-Wframe-larger-than=]
    }
    ^

vim +86 include/linux/bpf-cgroup.h

    70		__ret;								       \
    71	})
    72	
    73	#define BPF_CGROUP_RUN_PROG_INET_SOCK(sk)				       \
    74	({									       \
    75		int __ret = 0;							       \
    76		if (cgroup_bpf_enabled && sk) {					       \
    77			__ret = __cgroup_bpf_run_filter_sk(sk,			       \
    78							 BPF_CGROUP_INET_SOCK_CREATE); \
    79		}								       \
    80		__ret;								       \
    81	})
    82	
    83	#define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)				       \
    84	({									       \
    85		int __ret = 0;							       \
  > 86		if (cgroup_bpf_enabled && (sock_ops) && (sock_ops)->sk) {	       \
    87			typeof(sk) __sk = sk_to_full_sk((sock_ops)->sk);	       \
    88			if (sk_fullsock(__sk))					       \
    89				__ret = __cgroup_bpf_run_filter_sock_ops(__sk,	       \
    90									 sock_ops,     \
    91								 BPF_CGROUP_SOCK_OPS); \
    92		}								       \
    93		__ret;								       \
    94	})

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 50369 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v4 04/16] bpf: Sample bpf program to set SYN/SYN-ACK RTOs
  2017-06-28 17:31 ` [PATCH net-next v4 04/16] bpf: Sample bpf program to set " Lawrence Brakmo
@ 2017-06-29 19:39   ` Jesper Dangaard Brouer
  2017-06-29 22:25     ` Lawrence Brakmo
  0 siblings, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2017-06-29 19:39 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: brouer, netdev, Kernel Team, Blake Matheny, Alexei Starovoitov,
	Daniel Borkmann, David Ahern

On Wed, 28 Jun 2017 10:31:12 -0700
Lawrence Brakmo <brakmo@fb.com> wrote:

> +++ b/samples/bpf/tcp_synrto_kern.c
> @@ -0,0 +1,60 @@
> +/* Copyright (c) 2017 Facebook
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * BPF program to set SYN and SYN-ACK RTOs to 10ms when using IPv6 addresses
> + * and the first 5.5 bytes of the IPv6 addresses are the same (in this example
> + * that means both hosts are in the same datacenter.

Missing end ")".

I really like this short comment of what the program does, as it helps
people browsing these sample programs. 

Can you also mention in the comment (of all these) bpf programs that
people load this bpf object file via the program 'load_sock_ops'?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v4 04/16] bpf: Sample bpf program to set SYN/SYN-ACK RTOs
  2017-06-29 19:39   ` Jesper Dangaard Brouer
@ 2017-06-29 22:25     ` Lawrence Brakmo
  0 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-29 22:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Kernel Team, Blake Matheny, Alexei Starovoitov,
	Daniel Borkmann, David Ahern


On 6/29/17, 12:39 PM, "netdev-owner@vger.kernel.org on behalf of Jesper Dangaard Brouer" <netdev-owner@vger.kernel.org on behalf of brouer@redhat.com> wrote:

    On Wed, 28 Jun 2017 10:31:12 -0700
    Lawrence Brakmo <brakmo@fb.com> wrote:
    
    > +++ b/samples/bpf/tcp_synrto_kern.c
    > @@ -0,0 +1,60 @@
    > +/* Copyright (c) 2017 Facebook
    > + *
    > + * This program is free software; you can redistribute it and/or
    > + * modify it under the terms of version 2 of the GNU General Public
    > + * License as published by the Free Software Foundation.
    > + *
    > + * BPF program to set SYN and SYN-ACK RTOs to 10ms when using IPv6 addresses
    > + * and the first 5.5 bytes of the IPv6 addresses are the same (in this example
    > + * that means both hosts are in the same datacenter.
    
    Missing end ")".
    
    I really like this short comment of what the program does, as it helps
    people browsing these sample programs. 
    
    Can you also mention in the comment (of all these) bpf programs that
    people load this bpf object file via the program 'load_sock_ops'?

Thank you for finding the typo and for the comment on adding how to load the sample programs. Will be done in v5 due later today.
    
    -- 
    Best regards,
      Jesper Dangaard Brouer
      MSc.CS, Principal Kernel Engineer at Red Hat
      LinkedIn: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linkedin.com_in_brouer&d=DwICAg&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_g&m=EJ1TyanCGEOIXEPnAm8BicVjUXEJLsvUQY1vNC_4r7g&s=INcdT-mimhOZEgFLw7hqg2V6VJ70XZJoeY83vp6V8YY&e= 
    


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v4 01/16] bpf: BPF support for sock_ops
  2017-06-29  9:46   ` Daniel Borkmann
@ 2017-06-30  7:27     ` Lawrence Brakmo
  0 siblings, 0 replies; 26+ messages in thread
From: Lawrence Brakmo @ 2017-06-30  7:27 UTC (permalink / raw)
  To: Daniel Borkmann, netdev
  Cc: Kernel Team, Blake Matheny, Alexei Starovoitov, David Ahern


On 6/29/17, 2:46 AM, "netdev-owner@vger.kernel.org on behalf of Daniel Borkmann" <netdev-owner@vger.kernel.org on behalf of daniel@iogearbox.net> wrote:

    On 06/28/2017 07:31 PM, Lawrence Brakmo wrote:
    > Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
    > struct that allows BPF programs of this type to access some of the
    > socket's fields (such as IP addresses, ports, etc.). It uses the
    > existing bpf cgroups infrastructure so the programs can be attached per
    > cgroup with full inheritance support. The program will be called at
    > appropriate times to set relevant connections parameters such as buffer
    > sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
    > as IP addresses, port numbers, etc.
    [...]
    > Currently there are two types of ops. The first type expects the BPF
    > program to return a value which is then used by the caller (or a
    > negative value to indicate the operation is not supported). The second
    > type expects state changes to be done by the BPF program, for example
    > through a setsockopt BPF helper function, and they ignore the return
    > value.
    >
    > The reply fields of the bpf_sockt_ops struct are there in case a bpf
    > program needs to return a value larger than an integer.
    >
    > Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
    
    For BPF bits:
    
    Acked-by: Daniel Borkmann <daniel@iogearbox.net>
    
    > @@ -3379,6 +3409,140 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
    >   	return insn - insn_buf;
    >   }
    >
    > +static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
    > +				       const struct bpf_insn *si,
    > +				       struct bpf_insn *insn_buf,
    > +				       struct bpf_prog *prog)
    > +{
    > +	struct bpf_insn *insn = insn_buf;
    > +	int off;
    > +
    > +	switch (si->off) {
    [...]
    > +	case offsetof(struct bpf_sock_ops, remote_ip4):
    > +		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_daddr) != 4);
    > +
    > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
    > +						struct bpf_sock_ops_kern, sk),
    > +				      si->dst_reg, si->src_reg,
    > +				      offsetof(struct bpf_sock_ops_kern, sk));
    > +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
    > +				      offsetof(struct sock_common, skc_daddr));
    > +		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
    > +		break;
    > +
    > +	case offsetof(struct bpf_sock_ops, local_ip4):
    > +		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_rcv_saddr) != 4);
    > +
    > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
    > +					      struct bpf_sock_ops_kern, sk),
    > +				      si->dst_reg, si->src_reg,
    > +				      offsetof(struct bpf_sock_ops_kern, sk));
    > +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
    > +				      offsetof(struct sock_common,
    > +					       skc_rcv_saddr));
    > +		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
    > +		break;
    > +
    > +	case offsetof(struct bpf_sock_ops, remote_ip6[0]) ...
    > +	     offsetof(struct bpf_sock_ops, remote_ip6[3]):
    > +#if IS_ENABLED(CONFIG_IPV6)
    > +		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
    > +					  skc_v6_daddr.s6_addr32[0]) != 4);
    > +
    > +		off = si->off;
    > +		off -= offsetof(struct bpf_sock_ops, remote_ip6[0]);
    > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
    > +						struct bpf_sock_ops_kern, sk),
    > +				      si->dst_reg, si->src_reg,
    > +				      offsetof(struct bpf_sock_ops_kern, sk));
    > +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
    > +				      offsetof(struct sock_common,
    > +					       skc_v6_daddr.s6_addr32[0]) +
    > +				      off);
    > +		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
    > +#else
    > +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
    > +#endif
    > +		break;
    > +
    > +	case offsetof(struct bpf_sock_ops, local_ip6[0]) ...
    > +	     offsetof(struct bpf_sock_ops, local_ip6[3]):
    > +#if IS_ENABLED(CONFIG_IPV6)
    > +		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common,
    > +					  skc_v6_rcv_saddr.s6_addr32[0]) != 4);
    > +
    > +		off = si->off;
    > +		off -= offsetof(struct bpf_sock_ops, local_ip6[0]);
    > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
    > +						struct bpf_sock_ops_kern, sk),
    > +				      si->dst_reg, si->src_reg,
    > +				      offsetof(struct bpf_sock_ops_kern, sk));
    > +		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
    > +				      offsetof(struct sock_common,
    > +					       skc_v6_rcv_saddr.s6_addr32[0]) +
    > +				      off);
    > +		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 32);
    > +#else
    > +		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
    > +#endif
    > +		break;
    > +
    > +	case offsetof(struct bpf_sock_ops, remote_port):
    > +		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_dport) != 2);
    > +
    > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
    > +						struct bpf_sock_ops_kern, sk),
    > +				      si->dst_reg, si->src_reg,
    > +				      offsetof(struct bpf_sock_ops_kern, sk));
    > +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
    > +				      offsetof(struct sock_common, skc_dport));
    > +		*insn++ = BPF_ENDIAN(BPF_FROM_BE, si->dst_reg, 16);
    > +		break;
    > +
    > +	case offsetof(struct bpf_sock_ops, local_port):
    > +		BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_num) != 2);
    > +
    > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
    > +						struct bpf_sock_ops_kern, sk),
    > +				      si->dst_reg, si->src_reg,
    > +				      offsetof(struct bpf_sock_ops_kern, sk));
    > +		*insn++ = BPF_LDX_MEM(BPF_H, si->dst_reg, si->dst_reg,
    > +				      offsetof(struct sock_common, skc_num));
    
    That one is indeed in host endianness. Makes sense to have remote_port
    and local_port in a consistent representation.
    
    I was wondering though whether we should do all the conversion of
    BPF_ENDIAN(BPF_FROM_BE, ...) or just leave it to the user whether
    he needs the BPF_ENDIAN(BPF_FROM_BE, ...) or process it in network
    byte order as-is. In case the user needs to go and undo again via
    BPF_ENDIAN(BPF_TO_BE, ...), e.g., to reconstruct a full v6 addr,
    then we have two unneeded insns for each of the remote_ip6[X] /
    local_ip6[X]. So, not providing it in host byte order, the user can
    still always chose to do a BPF_ENDIAN(BPF_FROM_BE, ...) by himself,
    if this representation is preferred. Wdyt?

Good point about endianness. What I will do is present the data 
in the same endianness as it is in the kernel sock struct and document
this in the sock_ops struct.
I will submit a new patch set soon.  
    
    > +		break;
    > +	}
    > +	return insn - insn_buf;
    > +}
    > +
    >   const struct bpf_verifier_ops sk_filter_prog_ops = {
    >   	.get_func_proto		= sk_filter_func_proto,
    [...]
    


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next v4 10/16] bpf: Add support for changing congestion control
  2017-06-28 17:31 ` [PATCH net-next v4 10/16] bpf: Add support for changing congestion control Lawrence Brakmo
@ 2017-06-30 12:50   ` kbuild test robot
  0 siblings, 0 replies; 26+ messages in thread
From: kbuild test robot @ 2017-06-30 12:50 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: kbuild-all, netdev, Kernel Team, Blake Matheny,
	Alexei Starovoitov, Daniel Borkmann, David Ahern

[-- Attachment #1: Type: text/plain, Size: 1526 bytes --]

Hi Lawrence,

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Lawrence-Brakmo/bpf-BPF-cgroup-support-for-sock_ops/20170629-203719
config: arm-spear3xx_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   net/built-in.o: In function `____bpf_setsockopt':
>> net/core/filter.c:2721: undefined reference to `tcp_set_congestion_control'
>> net/core/filter.c:2724: undefined reference to `tcp_reinit_congestion_control'
>> net/core/filter.c:2724: undefined reference to `tcp_setsockopt'

vim +2721 net/core/filter.c

  2715			default:
  2716				ret = -EINVAL;
  2717			}
  2718		} else if (level == SOL_TCP &&
  2719			   sk->sk_prot->setsockopt == tcp_setsockopt) {
  2720			if (optname == TCP_CONGESTION) {
> 2721				ret = tcp_set_congestion_control(sk, optval, false);
  2722				if (!ret && bpf_sock->op > BPF_SOCK_OPS_NEEDS_ECN)
  2723					/* replacing an existing ca */
> 2724					tcp_reinit_congestion_control(sk,
  2725						inet_csk(sk)->icsk_ca_ops);
  2726			} else {
  2727				ret = -EINVAL;

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 15767 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-06-30 12:51 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-28 17:31 [PATCH net-next v4 00/16] bpf: BPF cgroup support for sock_ops Lawrence Brakmo
2017-06-28 17:31 ` [PATCH net-next v4 01/16] bpf: BPF " Lawrence Brakmo
2017-06-28 19:53   ` Alexei Starovoitov
2017-06-29  9:46   ` Daniel Borkmann
2017-06-30  7:27     ` Lawrence Brakmo
2017-06-29 15:57   ` kbuild test robot
2017-06-29 16:21   ` kbuild test robot
2017-06-28 17:31 ` [PATCH net-next v4 02/16] bpf: program to load and attach sock_ops BPF progs Lawrence Brakmo
2017-06-28 17:31 ` [PATCH net-next v4 03/16] bpf: Support for per connection SYN/SYN-ACK RTOs Lawrence Brakmo
2017-06-28 17:31 ` [PATCH net-next v4 04/16] bpf: Sample bpf program to set " Lawrence Brakmo
2017-06-29 19:39   ` Jesper Dangaard Brouer
2017-06-29 22:25     ` Lawrence Brakmo
2017-06-28 17:31 ` [PATCH net-next v4 05/16] bpf: Support for setting initial receive window Lawrence Brakmo
2017-06-28 17:31 ` [PATCH net-next v4 06/16] bpf: Sample bpf program to set initial window Lawrence Brakmo
2017-06-28 17:31 ` [PATCH net-next v4 07/16] bpf: Add setsockopt helper function to bpf Lawrence Brakmo
2017-06-29 10:08   ` Daniel Borkmann
2017-06-28 17:31 ` [PATCH net-next v4 08/16] bpf: Add TCP connection BPF callbacks Lawrence Brakmo
2017-06-28 17:31 ` [PATCH net-next v4 09/16] bpf: Sample BPF program to set buffer sizes Lawrence Brakmo
2017-06-28 17:31 ` [PATCH net-next v4 10/16] bpf: Add support for changing congestion control Lawrence Brakmo
2017-06-30 12:50   ` kbuild test robot
2017-06-28 17:31 ` [PATCH net-next v4 11/16] bpf: Sample BPF program to set " Lawrence Brakmo
2017-06-28 17:31 ` [PATCH net-next v4 12/16] bpf: Adds support for setting initial cwnd Lawrence Brakmo
2017-06-28 17:31 ` [PATCH net-next v4 13/16] bpf: Sample BPF program to set " Lawrence Brakmo
2017-06-28 17:31 ` [PATCH net-next v4 14/16] bpf: Adds support for setting sndcwnd clamp Lawrence Brakmo
2017-06-28 17:31 ` [PATCH net-next v4 15/16] bpf: Sample bpf program to set " Lawrence Brakmo
2017-06-28 17:31 ` [PATCH net-next v4 16/16] bpf: update tools/include/uapi/linux/bpf.h Lawrence Brakmo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.