All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/5] bpf: add support for BASE_RTT
@ 2017-10-20 18:05 Lawrence Brakmo
  2017-10-20 18:05 ` [PATCH net-next 1/5] bpf: add support for BPF_SOCK_OPS_BASE_RTT Lawrence Brakmo
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Lawrence Brakmo @ 2017-10-20 18:05 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Alexei Starovoitov, Daniel Borkmann, Blake Matheny,
	Lawrence Brakmo

This patch set adds the following functionality to socket_ops BPF
programs.
1) Add bpf helper function bpf_getsocketops. Currently only supports
   TCP_CONGESTION
2) Add BPF_SOCKET_OPS_BASE_RTT op to get the base RTT of the
   connection. In general, the base RTT indicates the threshold such
   that RTTs above it indicate congestion. More details in the
   relevant patches.   
 
Consists of the following patches:

[PATCH net-next 1/5] bpf: add support for BPF_SOCK_OPS_BASE_RTT
[PATCH net-next 2/5] bpf: Adding helper function bpf_getsockops
[PATCH net-next 3/5] bpf: Add BPF_SOCKET_OPS_BASE_RTT support to
[PATCH net-next 4/5] bpf: sample BPF_SOCKET_OPS_BASE_RTT program
[PATCH net-next 5/5] bpf: create samples/bpf/tcp_bpf.readme

 include/uapi/linux/bpf.h                  | 26 +++++++++++--
 net/core/filter.c                         | 46 ++++++++++++++++++++++-
 net/ipv4/tcp_nv.c                         | 40 +++++++++++++++++++-
 samples/bpf/Makefile                      |  1 +
 samples/bpf/tcp_basertt_kern.c            | 78 +++++++++++++++++++++++++++++++++++++++
 samples/bpf/tcp_bbf.readme                | 27 ++++++++++++++
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++
7 files changed, 214 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH net-next 1/5] bpf: add support for BPF_SOCK_OPS_BASE_RTT
  2017-10-20 18:05 [PATCH net-next 0/5] bpf: add support for BASE_RTT Lawrence Brakmo
@ 2017-10-20 18:05 ` Lawrence Brakmo
  2017-10-20 18:05 ` [PATCH net-next 2/5] bpf: Adding helper function bpf_getsockops Lawrence Brakmo
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Lawrence Brakmo @ 2017-10-20 18:05 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Alexei Starovoitov, Daniel Borkmann, Blake Matheny,
	Lawrence Brakmo

A congestion control algorithm can make a call to the BPF socket_ops
program to request the base RTT. The base RTT can be congestion control
dependent and is meant to represent a congestion threshold such that
RTTs above it indicate congestion. This is especially useful for flows
within a DC where the base RTT is easy to obtain.

Being provided a base RTT solves a basic problem in RTT based congestion
avoidance algorithms (such as Vegas, NV and BBR). Although it is easy
to get the base RTT when the network is not congested, it is very
diffcult to do when it is very congested. Newer connections get an
inflated value of the base RTT leading to unfariness (newer flows with a
larger base RTT get more bandwidth). As a result, RTT based congestion
avoidance algorithms tend to update their base RTTs to improve fairness.
In very congested networks this can lead to base RTT inflation, reducing
the ability of these RTT based congestion control algorithms to prevent
congestion.

Note that in my experiments with TCP-NV, the base RTT provided can be
much larger than the actual hardware RTT. For example, experimenting
with hosts within a rack where the hardware RTT is 16-20us, I've used
base RTTs up to 150us. The effect of using a larger base RTT is that the
congestion avoidance algorithm will allow more queueing. When there are
only a few flows the main effect is larger measured RTTs and RPC
latencies due to the increased queueing. When there are a lot of flows,
a larger base RTT can lead to more congestion and more packet drops.
For this case, where the hardware RTT is 20us, a base RTT of 80us
produces good results.

This patch only introduces BPF_SOCK_OPS_BASE_RTT, a later patch in this
set adds support for using it in TCP-NV. Further study and testing is
needed before support can be added to other delay based congestion
avoidance algorithms.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/uapi/linux/bpf.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d83f95e..1aca744 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -955,6 +955,13 @@ enum {
 	BPF_SOCK_OPS_NEEDS_ECN,		/* If connection's congestion control
 					 * needs ECN
 					 */
+	BPF_SOCK_OPS_BASE_RTT,		/* Get base RTT. The correct value is
+					 * based on the path and may be
+					 * dependent on the congestion control
+					 * algorithm. In general it indicates
+					 * a congestion threshold. RTTs above
+					 * this indicate congestion
+					 */
 };
 
 #define TCP_BPF_IW		1001	/* Set TCP initial congestion window */
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net-next 2/5] bpf: Adding helper function bpf_getsockops
  2017-10-20 18:05 [PATCH net-next 0/5] bpf: add support for BASE_RTT Lawrence Brakmo
  2017-10-20 18:05 ` [PATCH net-next 1/5] bpf: add support for BPF_SOCK_OPS_BASE_RTT Lawrence Brakmo
@ 2017-10-20 18:05 ` Lawrence Brakmo
  2017-10-20 18:05 ` [PATCH net-next 3/5] bpf: Add BPF_SOCKET_OPS_BASE_RTT support to tcp_nv Lawrence Brakmo
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Lawrence Brakmo @ 2017-10-20 18:05 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Alexei Starovoitov, Daniel Borkmann, Blake Matheny,
	Lawrence Brakmo

Adding support for helper function bpf_getsockops to socket_ops BPF
programs. This patch only supports TCP_CONGESTION.

Signed-off-by: Vlad Vysotsky <vlad@cs.ucla.edu>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Alexei Starovoitov <ast@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/uapi/linux/bpf.h                  | 19 ++++++++++---
 net/core/filter.c                         | 46 ++++++++++++++++++++++++++++++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++
 3 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1aca744..f650346 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -613,12 +613,22 @@ union bpf_attr {
  * int bpf_setsockopt(bpf_socket, level, optname, optval, optlen)
  *     Calls setsockopt. Not all opts are available, only those with
  *     integer optvals plus TCP_CONGESTION.
- *     Supported levels: SOL_SOCKET and IPROTO_TCP
+ *     Supported levels: SOL_SOCKET and IPPROTO_TCP
  *     @bpf_socket: pointer to bpf_socket
- *     @level: SOL_SOCKET or IPROTO_TCP
+ *     @level: SOL_SOCKET or IPPROTO_TCP
  *     @optname: option name
  *     @optval: pointer to option value
- *     @optlen: length of optval in byes
+ *     @optlen: length of optval in bytes
+ *     Return: 0 or negative error
+ *
+ * int bpf_getsockopt(bpf_socket, level, optname, optval, optlen)
+ *     Calls getsockopt. Not all opts are available.
+ *     Supported levels: IPPROTO_TCP
+ *     @bpf_socket: pointer to bpf_socket
+ *     @level: IPPROTO_TCP
+ *     @optname: option name
+ *     @optval: pointer to option value
+ *     @optlen: length of optval in bytes
  *     Return: 0 or negative error
  *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
@@ -721,7 +731,8 @@ union bpf_attr {
 	FN(sock_map_update),		\
 	FN(xdp_adjust_meta),		\
 	FN(perf_event_read_value),	\
-	FN(perf_prog_read_value),
+	FN(perf_prog_read_value),	\
+	FN(getsockopt),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 09e011f..ccf62f4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3273,7 +3273,7 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
 
 static const struct bpf_func_proto bpf_setsockopt_proto = {
 	.func		= bpf_setsockopt,
-	.gpl_only	= true,
+	.gpl_only	= false,
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 	.arg2_type	= ARG_ANYTHING,
@@ -3282,6 +3282,48 @@ static const struct bpf_func_proto bpf_setsockopt_proto = {
 	.arg5_type	= ARG_CONST_SIZE,
 };
 
+BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, bpf_sock,
+	   int, level, int, optname, char *, optval, int, optlen)
+{
+	struct sock *sk = bpf_sock->sk;
+	int ret = 0;
+
+	if (!sk_fullsock(sk))
+		goto err_clear;
+
+#ifdef CONFIG_INET
+	if (level == SOL_TCP && sk->sk_prot->getsockopt == tcp_getsockopt) {
+		if (optname == TCP_CONGESTION) {
+			struct inet_connection_sock *icsk = inet_csk(sk);
+
+			if (!icsk->icsk_ca_ops || optlen <= 1)
+				goto err_clear;
+			strncpy(optval, icsk->icsk_ca_ops->name, optlen);
+			optval[optlen - 1] = 0;
+		} else {
+			goto err_clear;
+		}
+	} else {
+		goto err_clear;
+	}
+	return ret;
+#endif
+err_clear:
+	memset(optval, 0, optlen);
+	return -EINVAL;
+}
+
+static const struct bpf_func_proto bpf_getsockopt_proto = {
+	.func		= bpf_getsockopt,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+	.arg4_type	= ARG_PTR_TO_UNINIT_MEM,
+	.arg5_type	= ARG_CONST_SIZE,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -3460,6 +3502,8 @@ static const struct bpf_func_proto *
 	switch (func_id) {
 	case BPF_FUNC_setsockopt:
 		return &bpf_setsockopt_proto;
+	case BPF_FUNC_getsockopt:
+		return &bpf_getsockopt_proto;
 	case BPF_FUNC_sock_map_update:
 		return &bpf_sock_map_update_proto;
 	default:
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index e25dbf6..609514f 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -67,6 +67,9 @@ static int (*bpf_xdp_adjust_meta)(void *ctx, int offset) =
 static int (*bpf_setsockopt)(void *ctx, int level, int optname, void *optval,
 			     int optlen) =
 	(void *) BPF_FUNC_setsockopt;
+static int (*bpf_getsockopt)(void *ctx, int level, int optname, void *optval,
+			     int optlen) =
+	(void *) BPF_FUNC_getsockopt;
 static int (*bpf_sk_redirect_map)(void *map, int key, int flags) =
 	(void *) BPF_FUNC_sk_redirect_map;
 static int (*bpf_sock_map_update)(void *map, void *key, void *value,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net-next 3/5] bpf: Add BPF_SOCKET_OPS_BASE_RTT support to tcp_nv
  2017-10-20 18:05 [PATCH net-next 0/5] bpf: add support for BASE_RTT Lawrence Brakmo
  2017-10-20 18:05 ` [PATCH net-next 1/5] bpf: add support for BPF_SOCK_OPS_BASE_RTT Lawrence Brakmo
  2017-10-20 18:05 ` [PATCH net-next 2/5] bpf: Adding helper function bpf_getsockops Lawrence Brakmo
@ 2017-10-20 18:05 ` Lawrence Brakmo
  2017-10-21 19:37   ` Alexei Starovoitov
  2017-10-20 18:05 ` [PATCH net-next 4/5] bpf: sample BPF_SOCKET_OPS_BASE_RTT program Lawrence Brakmo
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 8+ messages in thread
From: Lawrence Brakmo @ 2017-10-20 18:05 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Alexei Starovoitov, Daniel Borkmann, Blake Matheny,
	Lawrence Brakmo

TCP_NV will try to get the base RTT from a socket_ops BPF program if one
is loaded. NV will then use the base RTT to bound its min RTT (its
notion of the base RTT). It uses the base RTT as an upper bound and 80%
of the base RTT as its lower bound.

In other words, NV will consider filtered RTTs larger than base RTT as a
sign of congestion. As a result, there is no minRTT inflation when there
is a lot of congestion. For example, in a DC where the RTTs are less
than 40us when there is no congestion, a base RTT value of 80us improves
the performance of NV. The difference between the uncongested RTT and
the base RTT provided represents how much queueing we are willing to
have (in practice it can be higher).

NV has been tunned to reduce congestion when there are many flows at the
cost of one flow not achieving full bandwith utilization. When a
reasonable base RTT is provided, one NV flow can now fully utilize the
full bandwidth. In addition, the performance is also improved when there
are many flows.

In the following examples the NV results are using a kernel with this
patch set (i.e. both NV results are using the new nv_loss_dec_factor).

With one host sending to another host and only one flow the
goodputs are:
  Cubic: 9.3 Gbps, NV: 5.5 Gbps, NV (baseRTT=80us): 9.2 Gbps

With 2 hosts sending to one host (1 flow per host, the goodput per flow
is:
  Cubic: 4.6 Gbps, NV: 4.5 Gbps, NV (baseRTT=80us)L 4.6 Gbps

But the RTTs seen by a ping process in the sender is:
  Cubic: 3.3ms  NV: 97us,  NV (baseRTT=80us): 146us

With a lot of flows things look even better for NV with baseRTT. Here we
have 3 hosts sending to one host. Each sending host has 6 flows: 1
stream, 4x1MB RPC, 1x10KB RPC. Cubic, NV and NV with baseRTT all fully
utilize the full available bandwidth. However, the distribution of
bandwidth among the flows is very different. For the 10KB RPC flow:
  Cubic: 27Mbps, NV: 111Mbps, NV (baseRTT=80us): 222Mbps

The 99% latencies for the 10KB flows are:
  Cubic: 26ms,  NV: 1ms,  NV (baseRTT=80us): 500us

The RTT seen by a ping process at the senders:
  Cubic: 3.2ms  NV: 720us,  NV (baseRTT=80us): 330us

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked_by: Alexei Starovoitov <ast@fb.com>
---
 net/ipv4/tcp_nv.c | 40 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c
index 1ff7398..a978e3f 100644
--- a/net/ipv4/tcp_nv.c
+++ b/net/ipv4/tcp_nv.c
@@ -39,7 +39,7 @@
  * nv_cong_dec_mult	Decrease cwnd by X% (30%) of congestion when detected
  * nv_ssthresh_factor	On congestion set ssthresh to this * <desired cwnd> / 8
  * nv_rtt_factor	RTT averaging factor
- * nv_loss_dec_factor	Decrease cwnd by this (50%) when losses occur
+ * nv_loss_dec_factor	Decrease cwnd to this (80%) when losses occur
  * nv_dec_eval_min_calls	Wait this many RTT measurements before dec cwnd
  * nv_inc_eval_min_calls	Wait this many RTT measurements before inc cwnd
  * nv_ssthresh_eval_min_calls	Wait this many RTT measurements before stopping
@@ -61,7 +61,7 @@ static int nv_min_cwnd __read_mostly = 2;
 static int nv_cong_dec_mult __read_mostly = 30 * 128 / 100; /* = 30% */
 static int nv_ssthresh_factor __read_mostly = 8; /* = 1 */
 static int nv_rtt_factor __read_mostly = 128; /* = 1/2*old + 1/2*new */
-static int nv_loss_dec_factor __read_mostly = 512; /* => 50% */
+static int nv_loss_dec_factor __read_mostly = 819; /* => 80% */
 static int nv_cwnd_growth_rate_neg __read_mostly = 8;
 static int nv_cwnd_growth_rate_pos __read_mostly; /* 0 => fixed like Reno */
 static int nv_dec_eval_min_calls __read_mostly = 60;
@@ -101,6 +101,11 @@ struct tcpnv {
 	u32 nv_last_rtt;	/* last rtt */
 	u32 nv_min_rtt;		/* active min rtt. Used to determine slope */
 	u32 nv_min_rtt_new;	/* min rtt for future use */
+	u32 nv_base_rtt;        /* If non-zero it represents the threshold for
+				 * congestion */
+	u32 nv_lower_bound_rtt; /* Used in conjunction with nv_base_rtt. It is
+				 * set to 80% of nv_base_rtt. It helps reduce
+				 * unfairness between flows */
 	u32 nv_rtt_max_rate;	/* max rate seen during current RTT */
 	u32 nv_rtt_start_seq;	/* current RTT ends when packet arrives
 				 * acking beyond nv_rtt_start_seq */
@@ -132,9 +137,24 @@ static inline void tcpnv_reset(struct tcpnv *ca, struct sock *sk)
 static void tcpnv_init(struct sock *sk)
 {
 	struct tcpnv *ca = inet_csk_ca(sk);
+	int base_rtt;
 
 	tcpnv_reset(ca, sk);
 
+	/* See if base_rtt is available from socket_ops bpf program.
+	 * It is meant to be used in environments, such as communication
+	 * within a datacenter, where we have reasonable estimates of
+	 * RTTs
+	 */
+	base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT);
+	if (base_rtt > 0) {
+		ca->nv_base_rtt = base_rtt;
+		ca->nv_lower_bound_rtt = (base_rtt * 205) >> 8; /* 80% */
+	} else {
+		ca->nv_base_rtt = 0;
+		ca->nv_lower_bound_rtt = 0;
+	}
+
 	ca->nv_allow_cwnd_growth = 1;
 	ca->nv_min_rtt_reset_jiffies = jiffies + 2 * HZ;
 	ca->nv_min_rtt = NV_INIT_RTT;
@@ -144,6 +164,19 @@ static void tcpnv_init(struct sock *sk)
 	ca->cwnd_growth_factor = 0;
 }
 
+/* If provided, apply upper (base_rtt) and lower (lower_bound_rtt)
+ * bounds to RTT.
+ */
+inline u32 nv_get_bounded_rtt(struct tcpnv *ca, u32 val)
+{
+	if (ca->nv_lower_bound_rtt > 0 && val < ca->nv_lower_bound_rtt)
+		return ca->nv_lower_bound_rtt;
+	else if (ca->nv_base_rtt > 0 && val > ca->nv_base_rtt)
+		return ca->nv_base_rtt;
+	else
+		return val;
+}
+
 static void tcpnv_cong_avoid(struct sock *sk, u32 ack, u32 acked)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -265,6 +298,9 @@ static void tcpnv_acked(struct sock *sk, const struct ack_sample *sample)
 	if (ca->nv_eval_call_cnt < 255)
 		ca->nv_eval_call_cnt++;
 
+	/* Apply bounds to rtt. Only used to update min_rtt */
+	avg_rtt = nv_get_bounded_rtt(ca, avg_rtt);
+
 	/* update min rtt if necessary */
 	if (avg_rtt < ca->nv_min_rtt)
 		ca->nv_min_rtt = avg_rtt;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net-next 4/5] bpf: sample BPF_SOCKET_OPS_BASE_RTT program
  2017-10-20 18:05 [PATCH net-next 0/5] bpf: add support for BASE_RTT Lawrence Brakmo
                   ` (2 preceding siblings ...)
  2017-10-20 18:05 ` [PATCH net-next 3/5] bpf: Add BPF_SOCKET_OPS_BASE_RTT support to tcp_nv Lawrence Brakmo
@ 2017-10-20 18:05 ` Lawrence Brakmo
  2017-10-20 18:05 ` [PATCH net-next 5/5] bpf: create samples/bpf/tcp_bpf.readme Lawrence Brakmo
  2017-10-22  2:12 ` [PATCH net-next 0/5] bpf: add support for BASE_RTT David Miller
  5 siblings, 0 replies; 8+ messages in thread
From: Lawrence Brakmo @ 2017-10-20 18:05 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Alexei Starovoitov, Daniel Borkmann, Blake Matheny,
	Lawrence Brakmo

Sample socket_ops BPF program to test the BPF helper function
bpf_getsocketops and the new socket_ops op BPF_SOCKET_OPS_BASE_RTT.

The program provides a base RTT of 80us when the calling flow is
within a DC (as determined by the IPV6 prefix) and the congestion
algorithm is "nv".

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked_by: Alexei Starovoitov <ast@fb.com>
---
 samples/bpf/Makefile           |  1 +
 samples/bpf/tcp_basertt_kern.c | 78 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+)
 create mode 100644 samples/bpf/tcp_basertt_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 3534ccf..ea2b9e6 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -129,6 +129,7 @@ always += tcp_bufs_kern.o
 always += tcp_cong_kern.o
 always += tcp_iw_kern.o
 always += tcp_clamp_kern.o
+always += tcp_basertt_kern.o
 always += xdp_redirect_kern.o
 always += xdp_redirect_map_kern.o
 always += xdp_redirect_cpu_kern.o
diff --git a/samples/bpf/tcp_basertt_kern.c b/samples/bpf/tcp_basertt_kern.c
new file mode 100644
index 0000000..4bf4fc5
--- /dev/null
+++ b/samples/bpf/tcp_basertt_kern.c
@@ -0,0 +1,78 @@
+/* Copyright (c) 2017 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * BPF program to set base_rtt to 80us when host is running TCP-NV and
+ * both hosts are in the same datacenter (as determined by IPv6 prefix).
+ *
+ * Use load_sock_ops to load this BPF program.
+ */
+
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/tcp.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <linux/socket.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+#define DEBUG 1
+
+#define bpf_printk(fmt, ...)					\
+({								\
+	       char ____fmt[] = fmt;				\
+	       bpf_trace_printk(____fmt, sizeof(____fmt),	\
+				##__VA_ARGS__);			\
+})
+
+SEC("sockops")
+int bpf_basertt(struct bpf_sock_ops *skops)
+{
+	char cong[20];
+	char nv[] = "nv";
+	int rv = 0, n;
+	int op;
+
+	op = (int) skops->op;
+
+#ifdef DEBUG
+	bpf_printk("BPF command: %d\n", op);
+#endif
+
+	/* Check if both hosts are in the same datacenter. For this
+	 * example they are if the 1st 5.5 bytes in the IPv6 address
+	 * are the same.
+	 */
+	if (skops->family == AF_INET6 &&
+	    skops->local_ip6[0] == skops->remote_ip6[0] &&
+	    (bpf_ntohl(skops->local_ip6[1]) & 0xfff00000) ==
+	    (bpf_ntohl(skops->remote_ip6[1]) & 0xfff00000)) {
+		switch (op) {
+		case BPF_SOCK_OPS_BASE_RTT:
+			n = bpf_getsockopt(skops, SOL_TCP, TCP_CONGESTION,
+					   cong, sizeof(cong));
+			if (!n && !__builtin_memcmp(cong, nv, sizeof(nv)+1)) {
+				/* Set base_rtt to 80us */
+				rv = 80;
+			} else if (n) {
+				rv = n;
+			} else {
+				rv = -1;
+			}
+			break;
+		default:
+			rv = -1;
+		}
+	} else {
+		rv = -1;
+	}
+#ifdef DEBUG
+	bpf_printk("Returning %d\n", rv);
+#endif
+	skops->reply = rv;
+	return 1;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net-next 5/5] bpf: create samples/bpf/tcp_bpf.readme
  2017-10-20 18:05 [PATCH net-next 0/5] bpf: add support for BASE_RTT Lawrence Brakmo
                   ` (3 preceding siblings ...)
  2017-10-20 18:05 ` [PATCH net-next 4/5] bpf: sample BPF_SOCKET_OPS_BASE_RTT program Lawrence Brakmo
@ 2017-10-20 18:05 ` Lawrence Brakmo
  2017-10-22  2:12 ` [PATCH net-next 0/5] bpf: add support for BASE_RTT David Miller
  5 siblings, 0 replies; 8+ messages in thread
From: Lawrence Brakmo @ 2017-10-20 18:05 UTC (permalink / raw)
  To: netdev
  Cc: Kernel Team, Alexei Starovoitov, Daniel Borkmann, Blake Matheny,
	Lawrence Brakmo

Readme file explaining how to create a cgroupv2 and attach one
of the tcp_*_kern.o socket_ops BPF program.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked_by: Alexei Starovoitov <ast@fb.com>
---
 samples/bpf/tcp_bbf.readme | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)
 create mode 100644 samples/bpf/tcp_bbf.readme

diff --git a/samples/bpf/tcp_bbf.readme b/samples/bpf/tcp_bbf.readme
new file mode 100644
index 0000000..3692056
--- /dev/null
+++ b/samples/bpf/tcp_bbf.readme
@@ -0,0 +1,27 @@
+This file describes how to run the tcp_*_kern.o tcp_bpf (or socket_ops)
+programs. These programs attach to a cgroupv2. The following commands create
+a cgroupv2 and attach a bash shell to the group.
+
+  mkdir -p /tmp/cgroupv2
+  mount -t cgroup2 none /tmp/cgroupv2
+  mkdir -p /tmp/cgroupv2/foo
+  bash
+  echo $$ >> /tmp/cgroupv2/foo/cgroup.procs
+
+Anything that runs under this shell belongs to the foo cgroupv2 To load
+(attach) one of the tcp_*_kern.o programs:
+
+  ./load_sock_ops -l /tmp/cgroupv2/foo tcp_basertt_kern.o
+
+If the "-l" flag is used, the load_sock_ops program will continue to run
+printing the BPF log buffer. The tcp_*_kern.o programs use special print
+functions to print logging information (if enabled by the ifdef).
+
+If using netperf/netserver to create traffic, you need to run them under the
+cgroupv2 to which the BPF programs are attached (i.e. under bash shell
+attached to the cgroupv2).
+
+To remove (unattach) a socket_ops BPF program from a cgroupv2:
+
+  ./load_sock_ops -r /tmp/cgroupv2/foo
+
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 3/5] bpf: Add BPF_SOCKET_OPS_BASE_RTT support to tcp_nv
  2017-10-20 18:05 ` [PATCH net-next 3/5] bpf: Add BPF_SOCKET_OPS_BASE_RTT support to tcp_nv Lawrence Brakmo
@ 2017-10-21 19:37   ` Alexei Starovoitov
  0 siblings, 0 replies; 8+ messages in thread
From: Alexei Starovoitov @ 2017-10-21 19:37 UTC (permalink / raw)
  To: Lawrence Brakmo
  Cc: netdev, Kernel Team, Alexei Starovoitov, Daniel Borkmann, Blake Matheny

On Fri, Oct 20, 2017 at 11:05:41AM -0700, Lawrence Brakmo wrote:
> TCP_NV will try to get the base RTT from a socket_ops BPF program if one
> is loaded. NV will then use the base RTT to bound its min RTT (its
> notion of the base RTT). It uses the base RTT as an upper bound and 80%
> of the base RTT as its lower bound.
...
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
> Acked_by: Alexei Starovoitov <ast@fb.com>
       ^
just noticed the typo above :)
fixing it with:
Acked-by: Alexei Starovoitov <ast@kernel.org>

so patchworks can pick it up automatically.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 0/5] bpf: add support for BASE_RTT
  2017-10-20 18:05 [PATCH net-next 0/5] bpf: add support for BASE_RTT Lawrence Brakmo
                   ` (4 preceding siblings ...)
  2017-10-20 18:05 ` [PATCH net-next 5/5] bpf: create samples/bpf/tcp_bpf.readme Lawrence Brakmo
@ 2017-10-22  2:12 ` David Miller
  5 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2017-10-22  2:12 UTC (permalink / raw)
  To: brakmo; +Cc: netdev, kernel-team, ast, daniel, bmatheny

From: Lawrence Brakmo <brakmo@fb.com>
Date: Fri, 20 Oct 2017 11:05:38 -0700

> This patch set adds the following functionality to socket_ops BPF
> programs.
> 1) Add bpf helper function bpf_getsocketops. Currently only supports
>    TCP_CONGESTION
> 2) Add BPF_SOCKET_OPS_BASE_RTT op to get the base RTT of the
>    connection. In general, the base RTT indicates the threshold such
>    that RTTs above it indicate congestion. More details in the
>    relevant patches.   

Series applied, thank you.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-10-22  2:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-20 18:05 [PATCH net-next 0/5] bpf: add support for BASE_RTT Lawrence Brakmo
2017-10-20 18:05 ` [PATCH net-next 1/5] bpf: add support for BPF_SOCK_OPS_BASE_RTT Lawrence Brakmo
2017-10-20 18:05 ` [PATCH net-next 2/5] bpf: Adding helper function bpf_getsockops Lawrence Brakmo
2017-10-20 18:05 ` [PATCH net-next 3/5] bpf: Add BPF_SOCKET_OPS_BASE_RTT support to tcp_nv Lawrence Brakmo
2017-10-21 19:37   ` Alexei Starovoitov
2017-10-20 18:05 ` [PATCH net-next 4/5] bpf: sample BPF_SOCKET_OPS_BASE_RTT program Lawrence Brakmo
2017-10-20 18:05 ` [PATCH net-next 5/5] bpf: create samples/bpf/tcp_bpf.readme Lawrence Brakmo
2017-10-22  2:12 ` [PATCH net-next 0/5] bpf: add support for BASE_RTT David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.