All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM)
@ 2019-02-23  1:06 brakmo
  2019-02-23  1:06 ` [PATCH v2 bpf-next 1/9] bpf: Remove const from get_func_proto brakmo
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: brakmo @ 2019-02-23  1:06 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

Network Resource Manager is a framework for limiting the bandwidth used
by v2 cgroups. It consists of 4 BPF helpers and a sample BPF program to
limit egress bandwdith as well as a sample user program and script to
simplify NRM testing.

The sample NRM BPF program is not meant to be production quality, it is
provided as proof of concept. A lot more information, including sample
runs in some cases, are provided in the commit messages of the individual
patches.

Two more BPF programs, one to limit ingress and one that limits egress
and uses fq's Earliest Departure Time feature (EDT), will be provided in an
upcomming patchset.

Changes from v1 to v2:
  * bpf_tcp_enter_cwr can only be called from a cgroup skb egress BPF
    program (otherwise load or attach will fail) where we already hold
    the sk lock. Also only applies for ESTABLISHED state.
  * bpf_skb_ecn_set_ce uses INET_ECN_set_ce()
  * bpf_tcp_check_probe_timer now uses tcp_reset_xmit_timer. Can only be
    used by egress cgroup skb programs.
  * removed load_cg_skb user program. 
  * nrm bpf egress program checks packet header in skb to determine
    ECN value. Now also works for ECN enabled UDP packets.
    Using ECN_ defines instead of integers.
  * NRM script test program now uses bpftool instead of load_cg_skb

Martin KaFai Lau (2):
  bpf: Remove const from get_func_proto
  bpf: Add bpf helper bpf_tcp_enter_cwr

brakmo (7):
  bpf: Test bpf_tcp_enter_cwr in test_verifier
  bpf: add bpf helper bpf_skb_ecn_set_ce
  bpf: Add bpf helper bpf_tcp_check_probe_timer
  bpf: sync bpf.h to tools and update bpf_helpers.h
  bpf: Sample NRM BPF program to limit egress bw
  bpf: User program for testing NRM
  bpf: NRM test script

 drivers/media/rc/bpf-lirc.c                 |   2 +-
 include/linux/bpf.h                         |   3 +-
 include/linux/filter.h                      |   3 +-
 include/uapi/linux/bpf.h                    |  27 +-
 kernel/bpf/cgroup.c                         |   2 +-
 kernel/bpf/syscall.c                        |  12 +
 kernel/bpf/verifier.c                       |   4 +
 kernel/trace/bpf_trace.c                    |  10 +-
 net/core/filter.c                           | 101 ++++-
 samples/bpf/Makefile                        |   5 +
 samples/bpf/do_nrm_test.sh                  | 437 +++++++++++++++++++
 samples/bpf/nrm.c                           | 440 ++++++++++++++++++++
 samples/bpf/nrm.h                           |  31 ++
 samples/bpf/nrm_kern.h                      | 137 ++++++
 samples/bpf/nrm_out_kern.c                  | 190 +++++++++
 tools/include/uapi/linux/bpf.h              |  27 +-
 tools/testing/selftests/bpf/bpf_helpers.h   |   6 +
 tools/testing/selftests/bpf/verifier/sock.c |  33 ++
 18 files changed, 1444 insertions(+), 26 deletions(-)
 create mode 100755 samples/bpf/do_nrm_test.sh
 create mode 100644 samples/bpf/nrm.c
 create mode 100644 samples/bpf/nrm.h
 create mode 100644 samples/bpf/nrm_kern.h
 create mode 100644 samples/bpf/nrm_out_kern.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 1/9] bpf: Remove const from get_func_proto
  2019-02-23  1:06 [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) brakmo
@ 2019-02-23  1:06 ` brakmo
  2019-02-23  1:06 ` [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr brakmo
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: brakmo @ 2019-02-23  1:06 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

From: Martin KaFai Lau <kafai@fb.com>

The next patch needs to set a bit in "prog" in
cg_skb_func_proto().  Hence, the "const struct bpf_prog *"
as a second argument will not work.

This patch removes the "const" from get_func_proto and
makes the needed changes to all get_func_proto implementations
to avoid compiler error.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 drivers/media/rc/bpf-lirc.c |  2 +-
 include/linux/bpf.h         |  2 +-
 kernel/bpf/cgroup.c         |  2 +-
 kernel/trace/bpf_trace.c    | 10 +++++-----
 net/core/filter.c           | 30 +++++++++++++++---------------
 5 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/drivers/media/rc/bpf-lirc.c b/drivers/media/rc/bpf-lirc.c
index 390a722e6211..6adb7f734cb9 100644
--- a/drivers/media/rc/bpf-lirc.c
+++ b/drivers/media/rc/bpf-lirc.c
@@ -82,7 +82,7 @@ static const struct bpf_func_proto rc_pointer_rel_proto = {
 };
 
 static const struct bpf_func_proto *
-lirc_mode2_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+lirc_mode2_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_rc_repeat:
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index de18227b3d95..d5ba2fc01af3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -287,7 +287,7 @@ struct bpf_verifier_ops {
 	/* return eBPF function prototype for verification */
 	const struct bpf_func_proto *
 	(*get_func_proto)(enum bpf_func_id func_id,
-			  const struct bpf_prog *prog);
+			  struct bpf_prog *prog);
 
 	/* return true if 'size' wide access at offset 'off' within bpf_context
 	 * with 'type' (read or write) is allowed
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 4e807973aa80..0de0f5d98b46 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -701,7 +701,7 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
 EXPORT_SYMBOL(__cgroup_bpf_check_dev_permission);
 
 static const struct bpf_func_proto *
-cgroup_dev_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+cgroup_dev_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_map_lookup_elem:
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f1a86a0d881d..0d2f60828d7d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -561,7 +561,7 @@ static const struct bpf_func_proto bpf_probe_read_str_proto = {
 };
 
 static const struct bpf_func_proto *
-tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+tracing_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_map_lookup_elem:
@@ -610,7 +610,7 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-kprobe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+kprobe_prog_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_perf_event_output:
@@ -726,7 +726,7 @@ static const struct bpf_func_proto bpf_get_stack_proto_tp = {
 };
 
 static const struct bpf_func_proto *
-tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+tp_prog_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_perf_event_output:
@@ -790,7 +790,7 @@ static const struct bpf_func_proto bpf_perf_prog_read_value_proto = {
 };
 
 static const struct bpf_func_proto *
-pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+pe_prog_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_perf_event_output:
@@ -873,7 +873,7 @@ static const struct bpf_func_proto bpf_get_stack_proto_raw_tp = {
 };
 
 static const struct bpf_func_proto *
-raw_tp_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+raw_tp_prog_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_perf_event_output:
diff --git a/net/core/filter.c b/net/core/filter.c
index 85749f6ec789..97916eedfe69 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5508,7 +5508,7 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 }
 
 static const struct bpf_func_proto *
-sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+sock_filter_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	/* inet and inet6 sockets are created in a process
@@ -5524,7 +5524,7 @@ sock_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+sock_addr_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	/* inet and inet6 sockets are created in a process
@@ -5558,7 +5558,7 @@ sock_addr_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+sk_filter_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_skb_load_bytes:
@@ -5575,7 +5575,7 @@ sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+cg_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_get_local_storage:
@@ -5592,7 +5592,7 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+tc_cls_act_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_skb_store_bytes:
@@ -5685,7 +5685,7 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+xdp_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_perf_event_output:
@@ -5723,7 +5723,7 @@ const struct bpf_func_proto bpf_sock_map_update_proto __weak;
 const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
 
 static const struct bpf_func_proto *
-sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+sock_ops_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_setsockopt:
@@ -5751,7 +5751,7 @@ const struct bpf_func_proto bpf_msg_redirect_map_proto __weak;
 const struct bpf_func_proto bpf_msg_redirect_hash_proto __weak;
 
 static const struct bpf_func_proto *
-sk_msg_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+sk_msg_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_msg_redirect_map:
@@ -5777,7 +5777,7 @@ const struct bpf_func_proto bpf_sk_redirect_map_proto __weak;
 const struct bpf_func_proto bpf_sk_redirect_hash_proto __weak;
 
 static const struct bpf_func_proto *
-sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+sk_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_skb_store_bytes:
@@ -5812,7 +5812,7 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+flow_dissector_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_skb_load_bytes:
@@ -5823,7 +5823,7 @@ flow_dissector_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+lwt_out_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_skb_load_bytes:
@@ -5850,7 +5850,7 @@ lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+lwt_in_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_lwt_push_encap:
@@ -5861,7 +5861,7 @@ lwt_in_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+lwt_xmit_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_skb_get_tunnel_key:
@@ -5898,7 +5898,7 @@ lwt_xmit_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 }
 
 static const struct bpf_func_proto *
-lwt_seg6local_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+lwt_seg6local_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 {
 	switch (func_id) {
 #if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
@@ -8124,7 +8124,7 @@ static const struct bpf_func_proto sk_reuseport_load_bytes_relative_proto = {
 
 static const struct bpf_func_proto *
 sk_reuseport_func_proto(enum bpf_func_id func_id,
-			const struct bpf_prog *prog)
+			struct bpf_prog *prog)
 {
 	switch (func_id) {
 	case BPF_FUNC_sk_select_reuseport:
-- 
2.17.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr
  2019-02-23  1:06 [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) brakmo
  2019-02-23  1:06 ` [PATCH v2 bpf-next 1/9] bpf: Remove const from get_func_proto brakmo
@ 2019-02-23  1:06 ` brakmo
  2019-02-24  1:32   ` Eric Dumazet
  2019-02-25 23:14   ` Stanislav Fomichev
  2019-02-23  1:06 ` [PATCH v2 bpf-next 3/9] bpf: Test bpf_tcp_enter_cwr in test_verifier brakmo
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 29+ messages in thread
From: brakmo @ 2019-02-23  1:06 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

From: Martin KaFai Lau <kafai@fb.com>

This patch adds a new bpf helper BPF_FUNC_tcp_enter_cwr
"int bpf_tcp_enter_cwr(struct bpf_tcp_sock *tp)".
It is added to BPF_PROG_TYPE_CGROUP_SKB which can be attached
to the egress path where the bpf prog is called by
ip_finish_output() or ip6_finish_output().  The verifier
ensures that the parameter must be a tcp_sock.

This helper makes a tcp_sock enter CWR state.  It can be used
by a bpf_prog to manage egress network bandwidth limit per
cgroupv2.  A later patch will have a sample program to
show how it can be used to limit bandwidth usage per cgroupv2.

To ensure it is only called from BPF_CGROUP_INET_EGRESS, the
attr->expected_attach_type must be specified as BPF_CGROUP_INET_EGRESS
during load time if the prog uses this new helper.
The newly added prog->enforce_expected_attach_type bit will also be set
if this new helper is used.  This bit is for backward compatibility reason
because currently prog->expected_attach_type has been ignored in
BPF_PROG_TYPE_CGROUP_SKB.  During attach time,
prog->expected_attach_type is only enforced if the
prog->enforce_expected_attach_type bit is set.
i.e. prog->expected_attach_type is only enforced if this new helper
is used by the prog.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/bpf.h      |  1 +
 include/linux/filter.h   |  3 ++-
 include/uapi/linux/bpf.h |  9 ++++++++-
 kernel/bpf/syscall.c     | 12 ++++++++++++
 kernel/bpf/verifier.c    |  4 ++++
 net/core/filter.c        | 25 +++++++++++++++++++++++++
 6 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d5ba2fc01af3..2d54ba7cf9dd 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -195,6 +195,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock */
 	ARG_PTR_TO_SPIN_LOCK,	/* pointer to bpf_spin_lock */
 	ARG_PTR_TO_SOCK_COMMON,	/* pointer to sock_common */
+	ARG_PTR_TO_TCP_SOCK,    /* pointer to tcp_sock */
 };
 
 /* type of values returned from helper functions */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index f32b3eca5a04..c6e878bdc5a6 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -510,7 +510,8 @@ struct bpf_prog {
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
 				kprobe_override:1, /* Do we override a kprobe? */
-				has_callchain_buf:1; /* callchain buffer allocated? */
+				has_callchain_buf:1, /* callchain buffer allocated? */
+				enforce_expected_attach_type:1; /* Enforce expected_attach_type checking at attach time */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index bcdd2474eee7..95b5058fa945 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2359,6 +2359,12 @@ union bpf_attr {
  *	Return
  *		A **struct bpf_tcp_sock** pointer on success, or NULL in
  *		case of failure.
+ *
+ * int bpf_tcp_enter_cwr(struct bpf_tcp_sock *tp)
+ *	Description
+ *		Make a tcp_sock enter CWR state.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2457,7 +2463,8 @@ union bpf_attr {
 	FN(spin_lock),			\
 	FN(spin_unlock),		\
 	FN(sk_fullsock),		\
-	FN(tcp_sock),
+	FN(tcp_sock),			\
+	FN(tcp_enter_cwr),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ec7c552af76b..9a478f2875cd 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1482,6 +1482,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
 		default:
 			return -EINVAL;
 		}
+	case BPF_PROG_TYPE_CGROUP_SKB:
+		switch (expected_attach_type) {
+		case BPF_CGROUP_INET_INGRESS:
+		case BPF_CGROUP_INET_EGRESS:
+			return 0;
+		default:
+			return -EINVAL;
+		}
 	default:
 		return 0;
 	}
@@ -1725,6 +1733,10 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
+	case BPF_PROG_TYPE_CGROUP_SKB:
+		return prog->enforce_expected_attach_type &&
+			prog->expected_attach_type != attach_type ?
+			-EINVAL : 0;
 	default:
 		return 0;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1b9496c41383..95fb385c6f3c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2424,6 +2424,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 			return -EFAULT;
 		}
 		meta->ptr_id = reg->id;
+	} else if (arg_type == ARG_PTR_TO_TCP_SOCK) {
+		expected_type = PTR_TO_TCP_SOCK;
+		if (type != expected_type)
+			goto err_type;
 	} else if (arg_type == ARG_PTR_TO_SPIN_LOCK) {
 		if (meta->func_id == BPF_FUNC_spin_lock) {
 			if (process_spin_lock(env, regno, true))
diff --git a/net/core/filter.c b/net/core/filter.c
index 97916eedfe69..ca57ef25279c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5426,6 +5426,24 @@ static const struct bpf_func_proto bpf_tcp_sock_proto = {
 	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
 };
 
+BPF_CALL_1(bpf_tcp_enter_cwr, struct tcp_sock *, tp)
+{
+	struct sock *sk = (struct sock *)tp;
+
+	if (sk->sk_state == TCP_ESTABLISHED) {
+		tcp_enter_cwr(sk);
+		return 0;
+	}
+
+	return -EINVAL;
+}
+
+static const struct bpf_func_proto bpf_tcp_enter_cwr_proto = {
+	.func        = bpf_tcp_enter_cwr,
+	.gpl_only    = false,
+	.ret_type    = RET_INTEGER,
+	.arg1_type    = ARG_PTR_TO_TCP_SOCK,
+};
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -5585,6 +5603,13 @@ cg_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 #ifdef CONFIG_INET
 	case BPF_FUNC_tcp_sock:
 		return &bpf_tcp_sock_proto;
+	case BPF_FUNC_tcp_enter_cwr:
+		if (prog->expected_attach_type == BPF_CGROUP_INET_EGRESS) {
+			prog->enforce_expected_attach_type = 1;
+			return &bpf_tcp_enter_cwr_proto;
+		} else {
+			return NULL;
+		}
 #endif
 	default:
 		return sk_filter_func_proto(func_id, prog);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 3/9] bpf: Test bpf_tcp_enter_cwr in test_verifier
  2019-02-23  1:06 [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) brakmo
  2019-02-23  1:06 ` [PATCH v2 bpf-next 1/9] bpf: Remove const from get_func_proto brakmo
  2019-02-23  1:06 ` [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr brakmo
@ 2019-02-23  1:06 ` brakmo
  2019-02-23  1:06 ` [PATCH v2 bpf-next 4/9] bpf: add bpf helper bpf_skb_ecn_set_ce brakmo
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: brakmo @ 2019-02-23  1:06 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

This test ensures the verifier has checked the arg1 of
BPF_FUNC_tcp_enter_cwr is of ARG_PTR_TO_TCP_SOCK type.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 tools/testing/selftests/bpf/verifier/sock.c | 33 +++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/tools/testing/selftests/bpf/verifier/sock.c b/tools/testing/selftests/bpf/verifier/sock.c
index 0ddfdf76aba5..b07a083eeb59 100644
--- a/tools/testing/selftests/bpf/verifier/sock.c
+++ b/tools/testing/selftests/bpf/verifier/sock.c
@@ -382,3 +382,36 @@
 	.result = REJECT,
 	.errstr = "type=tcp_sock expected=sock",
 },
+{
+	"bpf_tcp_enter_cwr(skb->sk)",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_enter_cwr),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = REJECT,
+	.errstr = "type=sock_common expected=tcp_sock",
+},
+{
+	"bpf_tcp_enter_cwr(bpf_tcp_sock(skb->sk))",
+	.insns = {
+	BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, offsetof(struct __sk_buff, sk)),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_sock),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+	BPF_EMIT_CALL(BPF_FUNC_tcp_enter_cwr),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+	.result = ACCEPT,
+},
-- 
2.17.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 4/9] bpf: add bpf helper bpf_skb_ecn_set_ce
  2019-02-23  1:06 [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) brakmo
                   ` (2 preceding siblings ...)
  2019-02-23  1:06 ` [PATCH v2 bpf-next 3/9] bpf: Test bpf_tcp_enter_cwr in test_verifier brakmo
@ 2019-02-23  1:06 ` brakmo
  2019-02-23  1:14   ` Daniel Borkmann
  2019-02-23  1:06 ` [PATCH v2 bpf-next 5/9] bpf: Add bpf helper bpf_tcp_check_probe_timer brakmo
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: brakmo @ 2019-02-23  1:06 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce
"int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to
BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can
be attached to the ingress and egress path. The helper is needed
because his type of bpf_prog cannot modify the skb directly.

This helper is used to set the ECN field of ECN capable IP packets to ce
(congestion encountered) in the IPv6 or IPv4 header of the skb. It can be
used by a bpf_prog to manage egress or ingress network bandwdith limit
per cgroupv2 by inducing an ECN response in the TCP sender.
This works best when using DCTCP.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h | 10 +++++++++-
 net/core/filter.c        | 14 ++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 95b5058fa945..fc646f3eaf9b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2365,6 +2365,13 @@ union bpf_attr {
  *		Make a tcp_sock enter CWR state.
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_ecn_set_ce(struct sk_buf *skb)
+ *	Description
+ *		Sets ECN of IP header to ce (congestion encountered) if
+ *		current value is ect (ECN capable). Works with IPv6 and IPv4.
+ *	Return
+ *		1 if set, 0 if not set.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2464,7 +2471,8 @@ union bpf_attr {
 	FN(spin_unlock),		\
 	FN(sk_fullsock),		\
 	FN(tcp_sock),			\
-	FN(tcp_enter_cwr),
+	FN(tcp_enter_cwr),		\
+	FN(skb_ecn_set_ce),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index ca57ef25279c..955369c6ed30 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5444,6 +5444,18 @@ static const struct bpf_func_proto bpf_tcp_enter_cwr_proto = {
 	.ret_type    = RET_INTEGER,
 	.arg1_type    = ARG_PTR_TO_TCP_SOCK,
 };
+
+BPF_CALL_1(bpf_skb_ecn_set_ce, struct sk_buff *, skb)
+{
+	return INET_ECN_set_ce(skb);
+}
+
+static const struct bpf_func_proto bpf_skb_ecn_set_ce_proto = {
+	.func		= bpf_skb_ecn_set_ce,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+};
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -5610,6 +5622,8 @@ cg_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 		} else {
 			return NULL;
 		}
+	case BPF_FUNC_skb_ecn_set_ce:
+		return &bpf_skb_ecn_set_ce_proto;
 #endif
 	default:
 		return sk_filter_func_proto(func_id, prog);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 5/9] bpf: Add bpf helper bpf_tcp_check_probe_timer
  2019-02-23  1:06 [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) brakmo
                   ` (3 preceding siblings ...)
  2019-02-23  1:06 ` [PATCH v2 bpf-next 4/9] bpf: add bpf helper bpf_skb_ecn_set_ce brakmo
@ 2019-02-23  1:06 ` brakmo
  2019-02-23  1:07 ` [PATCH v2 bpf-next 6/9] bpf: sync bpf.h to tools and update bpf_helpers.h brakmo
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: brakmo @ 2019-02-23  1:06 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

This patch adds a new bpf helper BPF_FUNC_tcp_check_probe_timer
"int bpf_check_tcp_probe_timer(struct tcp_bpf_sock *tp, u32 when_us)".
It is added to BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently
can be attached to the ingress and egress path.

To ensure it is only called from BPF_CGROUP_INET_EGRESS, the
attr->expected_attach_type must be specified as BPF_CGROUP_INET_EGRESS
during load time if the prog uses this new helper.
The newly added prog->enforce_expected_attach_type bit will also be set
if this new helper is used.  This bit is for backward compatibility reason
because currently prog->expected_attach_type has been ignored in
BPF_PROG_TYPE_CGROUP_SKB.  During attach time,
prog->expected_attach_type is only enforced if the
prog->enforce_expected_attach_type bit is set.
i.e. prog->expected_attach_type is only enforced if this new helper
is used by the prog.

The function forces when_us to be at least TCP_TIMEOUT_MIN (currently
2 jiffies) and no more than TCP_RTO_MIN (currently 200ms).

When using a bpf_prog to limit the egress bandwidth of a cgroup,
it can happen that we drop a packet of a connection that has no
packets out. In this case, the connection may not retry sending
the packet until the probe timer fires. Since the default value
of the probe timer is at least 200ms, this can introduce link
underutiliation (i.e. the cgroup egress bandwidth being smaller
than the specified rate) thus increased tail latency.
This helper function allows for setting a smaller probe timer.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/uapi/linux/bpf.h | 12 +++++++++++-
 net/core/filter.c        | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fc646f3eaf9b..5d0bed852800 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2372,6 +2372,15 @@ union bpf_attr {
  *		current value is ect (ECN capable). Works with IPv6 and IPv4.
  *	Return
  *		1 if set, 0 if not set.
+ *
+ * int bpf_tcp_check_probe_timer(struct bpf_tcp_sock *tp, int when_us)
+ *	Description
+ *		Checks that there are no packets out and there is no pending
+ *		timer. If both of these are true, it bounds when_us by
+ *		TCP_TIMEOUT_MIN (2 jiffies) or TCP_RTO_MIN (200ms) and
+ *		sets the probe timer.
+ *	Return
+ *		0
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2472,7 +2481,8 @@ union bpf_attr {
 	FN(sk_fullsock),		\
 	FN(tcp_sock),			\
 	FN(tcp_enter_cwr),		\
-	FN(skb_ecn_set_ce),
+	FN(skb_ecn_set_ce),		\
+	FN(tcp_check_probe_timer),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 955369c6ed30..7d7026768840 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5456,6 +5456,31 @@ static const struct bpf_func_proto bpf_skb_ecn_set_ce_proto = {
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_CTX,
 };
+
+BPF_CALL_2(bpf_tcp_check_probe_timer, struct tcp_sock *, tp, u32, when_us)
+{
+	struct sock *sk = (struct sock *) tp;
+	unsigned long when = usecs_to_jiffies(when_us);
+
+	if (!tp->packets_out && !inet_csk(sk)->icsk_pending) {
+		if (when < TCP_TIMEOUT_MIN)
+			when = TCP_TIMEOUT_MIN;
+		else if (when > TCP_RTO_MIN)
+			when = TCP_RTO_MIN;
+
+		tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
+				     when, TCP_RTO_MAX, NULL);
+	}
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_tcp_check_probe_timer_proto = {
+	.func		= bpf_tcp_check_probe_timer,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_TCP_SOCK,
+	.arg2_type	= ARG_ANYTHING,
+};
 #endif /* CONFIG_INET */
 
 bool bpf_helper_changes_pkt_data(void *func)
@@ -5624,6 +5649,13 @@ cg_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
 		}
 	case BPF_FUNC_skb_ecn_set_ce:
 		return &bpf_skb_ecn_set_ce_proto;
+	case BPF_FUNC_tcp_check_probe_timer:
+		if (prog->expected_attach_type == BPF_CGROUP_INET_EGRESS) {
+			prog->enforce_expected_attach_type = 1;
+			return &bpf_tcp_check_probe_timer_proto;
+		} else {
+			return NULL;
+		}
 #endif
 	default:
 		return sk_filter_func_proto(func_id, prog);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 6/9] bpf: sync bpf.h to tools and update bpf_helpers.h
  2019-02-23  1:06 [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) brakmo
                   ` (4 preceding siblings ...)
  2019-02-23  1:06 ` [PATCH v2 bpf-next 5/9] bpf: Add bpf helper bpf_tcp_check_probe_timer brakmo
@ 2019-02-23  1:07 ` brakmo
  2019-02-23  1:07 ` [PATCH v2 bpf-next 7/9] bpf: Sample NRM BPF program to limit egress bw brakmo
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: brakmo @ 2019-02-23  1:07 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

This patch syncs the uapi bpf.h to tools/ and also updates
bpf_herlpers.h in tools/

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 tools/include/uapi/linux/bpf.h            | 27 ++++++++++++++++++++++-
 tools/testing/selftests/bpf/bpf_helpers.h |  6 +++++
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index bcdd2474eee7..5d0bed852800 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2359,6 +2359,28 @@ union bpf_attr {
  *	Return
  *		A **struct bpf_tcp_sock** pointer on success, or NULL in
  *		case of failure.
+ *
+ * int bpf_tcp_enter_cwr(struct bpf_tcp_sock *tp)
+ *	Description
+ *		Make a tcp_sock enter CWR state.
+ *	Return
+ *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_skb_ecn_set_ce(struct sk_buf *skb)
+ *	Description
+ *		Sets ECN of IP header to ce (congestion encountered) if
+ *		current value is ect (ECN capable). Works with IPv6 and IPv4.
+ *	Return
+ *		1 if set, 0 if not set.
+ *
+ * int bpf_tcp_check_probe_timer(struct bpf_tcp_sock *tp, int when_us)
+ *	Description
+ *		Checks that there are no packets out and there is no pending
+ *		timer. If both of these are true, it bounds when_us by
+ *		TCP_TIMEOUT_MIN (2 jiffies) or TCP_RTO_MIN (200ms) and
+ *		sets the probe timer.
+ *	Return
+ *		0
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -2457,7 +2479,10 @@ union bpf_attr {
 	FN(spin_lock),			\
 	FN(spin_unlock),		\
 	FN(sk_fullsock),		\
-	FN(tcp_sock),
+	FN(tcp_sock),			\
+	FN(tcp_enter_cwr),		\
+	FN(skb_ecn_set_ce),		\
+	FN(tcp_check_probe_timer),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h b/tools/testing/selftests/bpf/bpf_helpers.h
index d9999f1ed1d2..8aec59624ebc 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -180,6 +180,12 @@ static struct bpf_sock *(*bpf_sk_fullsock)(struct bpf_sock *sk) =
 	(void *) BPF_FUNC_sk_fullsock;
 static struct bpf_tcp_sock *(*bpf_tcp_sock)(struct bpf_sock *sk) =
 	(void *) BPF_FUNC_tcp_sock;
+static int (*bpf_tcp_enter_cwr)(struct bpf_tcp_sock *tp) =
+	(void *) BPF_FUNC_tcp_enter_cwr;
+static int (*bpf_skb_ecn_set_ce)(void *ctx) =
+	(void *) BPF_FUNC_skb_ecn_set_ce;
+static int (*bpf_tcp_check_probe_timer)(struct bpf_tcp_sock *tp, int when_us) =
+	(void *) BPF_FUNC_tcp_check_probe_timer;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.17.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 7/9] bpf: Sample NRM BPF program to limit egress bw
  2019-02-23  1:06 [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) brakmo
                   ` (5 preceding siblings ...)
  2019-02-23  1:07 ` [PATCH v2 bpf-next 6/9] bpf: sync bpf.h to tools and update bpf_helpers.h brakmo
@ 2019-02-23  1:07 ` brakmo
  2019-02-23  1:07 ` [PATCH v2 bpf-next 8/9] bpf: User program for testing NRM brakmo
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: brakmo @ 2019-02-23  1:07 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

A cgroup skb BPF program to limit cgroup output bandwidth.
It uses a modified virtual token bucket queue to limit average
egress bandwidth. The implementation uses credits instead of tokens.
Negative credits imply that queueing would have happened (this is
a virtual queue, so no queueing is done by it. However, queueing may
occur at the actual qdisc (which is not used for rate limiting).

This implementation uses 3 thresholds, one to start marking packets and
the other two to drop packets:
                                 CREDIT
       - <--------------------------|------------------------> +
             |    |          |      0
             |  Large pkt    |
             |  drop thresh  |
  Small pkt drop             Mark threshold
      thresh

The effect of marking depends on the type of packet:
a) If the packet is ECN enabled and it is a TCP packet, then the packet
   is ECN marked. The current mark threshold is tuned for DCTCP.
b) If the packet is a TCP packet, then we probabilistically call tcp_cwr
   to reduce the congestion window. The current implementation uses a linear
   distribution (0% probability at marking threshold, 100% probability
   at drop threshold).
c) If the packet is not a TCP packet, then it is dropped.

If the credit is below the drop threshold, the packet is dropped. If it
is a TCP packet, then it also calls tcp_cwr since packets dropped by
by a cgroup skb BPF program do not automatically trigger a call to
tcp_cwr in the current kernel code.

This BPF program actually uses 2 drop thresholds, one threshold
for larger packets (>= 120 bytes) and another for smaller packets. This
protects smaller packets such as SYNs, ACKs, etc.

The default bandwidth limit is set at 1Gbps but this can be changed by
a user program through a shared BPF map. In addition, by default this BPF
program does not limit connections using loopback. This behavior can be
overwritten by the user program. There is also an option to calculate
some statistics, such as percent of packets marked or dropped, which
the user program can access.

A latter patch provides such a program (nrm.c)

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile       |   2 +
 samples/bpf/nrm.h          |  31 ++++++
 samples/bpf/nrm_kern.h     | 137 ++++++++++++++++++++++++++
 samples/bpf/nrm_out_kern.c | 190 +++++++++++++++++++++++++++++++++++++
 4 files changed, 360 insertions(+)
 create mode 100644 samples/bpf/nrm.h
 create mode 100644 samples/bpf/nrm_kern.h
 create mode 100644 samples/bpf/nrm_out_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a0ef7eddd0b3..897b467066fd 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -167,6 +167,7 @@ always += xdpsock_kern.o
 always += xdp_fwd_kern.o
 always += task_fd_query_kern.o
 always += xdp_sample_pkts_kern.o
+always += nrm_out_kern.o
 
 KBUILD_HOSTCFLAGS += -I$(objtree)/usr/include
 KBUILD_HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -266,6 +267,7 @@ $(BPF_SAMPLES_PATH)/*.c: verify_target_bpf $(LIBBPF)
 $(src)/*.c: verify_target_bpf $(LIBBPF)
 
 $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h
+$(obj)/nrm_out_kern.o: $(src)/nrm.h $(src)/nrm_kern.h
 
 # asm/sysreg.h - inline assembly used by it is incompatible with llvm.
 # But, there is no easy way to fix it, so just exclude it since it is
diff --git a/samples/bpf/nrm.h b/samples/bpf/nrm.h
new file mode 100644
index 000000000000..ea89d6027ff0
--- /dev/null
+++ b/samples/bpf/nrm.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (c) 2019 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * Include file for NRM programs
+ */
+struct nrm_vqueue {
+	struct bpf_spin_lock lock;
+	/* 4 byte hole */
+	unsigned long long lasttime;	/* In ns */
+	int credit;			/* In bytes */
+	unsigned int rate;		/* In bytes per NS << 20 */
+};
+
+struct nrm_queue_stats {
+	unsigned long rate;		/* in Mbps*/
+	unsigned long stats:1,		/* get NRM stats (marked, dropped,..) */
+		loopback:1;		/* also limit flows using loopback */
+	unsigned long long pkts_marked;
+	unsigned long long bytes_marked;
+	unsigned long long pkts_dropped;
+	unsigned long long bytes_dropped;
+	unsigned long long pkts_total;
+	unsigned long long bytes_total;
+	unsigned long long firstPacketTime;
+	unsigned long long lastPacketTime;
+};
diff --git a/samples/bpf/nrm_kern.h b/samples/bpf/nrm_kern.h
new file mode 100644
index 000000000000..e48d4d2944a9
--- /dev/null
+++ b/samples/bpf/nrm_kern.h
@@ -0,0 +1,137 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (c) 2019 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * Include file for sample NRM BPF programs
+ */
+#define KBUILD_MODNAME "foo"
+#include <stddef.h>
+#include <stdbool.h>
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include <uapi/linux/ipv6.h>
+#include <uapi/linux/in.h>
+#include <uapi/linux/tcp.h>
+#include <uapi/linux/filter.h>
+#include <uapi/linux/pkt_cls.h>
+#include <net/ipv6.h>
+#include <net/inet_ecn.h>
+#include "bpf_endian.h"
+#include "bpf_helpers.h"
+#include "nrm.h"
+
+#define DROP_PKT	0
+#define ALLOW_PKT	1
+#define TCP_ECN_OK	1
+
+#define NRM_DEBUG 0  // Set to 1 to enable debugging
+#if NRM_DEBUG
+#define bpf_printk(fmt, ...)					\
+({								\
+	char ____fmt[] = fmt;					\
+	bpf_trace_printk(____fmt, sizeof(____fmt),		\
+			 ##__VA_ARGS__);			\
+})
+#else
+#define bpf_printk(fmt, ...)
+#endif
+
+#define INITIAL_CREDIT_PACKETS	100
+#define MAX_BYTES_PER_PACKET	1500
+#define MARK_THRESH		(80 * MAX_BYTES_PER_PACKET)
+#define DROP_THRESH		(80 * 5 * MAX_BYTES_PER_PACKET)
+#define LARGE_PKT_DROP_THRESH	(DROP_THRESH - (15 * MAX_BYTES_PER_PACKET))
+#define MARK_REGION_SIZE	(LARGE_PKT_DROP_THRESH - MARK_THRESH)
+#define LARGE_PKT_THRESH	120
+#define MAX_CREDIT		(100 * MAX_BYTES_PER_PACKET)
+#define INIT_CREDIT		(INITIAL_CREDIT_PACKETS * MAX_BYTES_PER_PACKET)
+
+// rate in bytes per ns << 20
+#define CREDIT_PER_NS(delta, rate) ((((u64)(delta)) * (rate)) >> 20)
+
+struct bpf_map_def SEC("maps") queue_state = {
+	.type = BPF_MAP_TYPE_CGROUP_STORAGE,
+	.key_size = sizeof(struct bpf_cgroup_storage_key),
+	.value_size = sizeof(struct nrm_vqueue),
+};
+BPF_ANNOTATE_KV_PAIR(queue_state, struct bpf_cgroup_storage_key,
+		     struct nrm_vqueue);
+
+struct bpf_map_def SEC("maps") queue_stats = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(struct nrm_queue_stats),
+	.max_entries = 1,
+};
+BPF_ANNOTATE_KV_PAIR(queue_stats, int, struct nrm_queue_stats);
+
+struct nrm_pkt_info {
+	bool	is_ip;
+	bool	is_tcp;
+	short	ecn;
+};
+
+static __always_inline void nrm_get_pkt_info(struct __sk_buff *skb,
+					     struct nrm_pkt_info *pkti)
+{
+	struct iphdr iph;
+	struct ipv6hdr *ip6h;
+
+	bpf_skb_load_bytes(skb, 0, &iph, 12);
+	if (iph.version == 6) {
+		ip6h = (struct ipv6hdr *)&iph;
+		pkti->is_ip = true;
+		pkti->is_tcp = (ip6h->nexthdr == 6);
+		pkti->ecn = (ip6h->flow_lbl[0] >> 4) & INET_ECN_MASK;
+	} else if (iph.version == 4) {
+		pkti->is_ip = true;
+		pkti->is_tcp = (iph.protocol == 6);
+		pkti->ecn = iph.tos & INET_ECN_MASK;
+	} else {
+		pkti->is_ip = false;
+		pkti->is_tcp = false;
+		pkti->ecn = 0;
+	}
+}
+
+static __always_inline void nrm_init_vqueue(struct nrm_vqueue *qdp, int rate)
+{
+		bpf_printk("Initializing queue_state, rate:%d\n", rate * 128);
+		qdp->lasttime = bpf_ktime_get_ns();
+		qdp->credit = INIT_CREDIT;
+		qdp->rate = rate * 128;
+}
+
+static __always_inline void nrm_update_stats(struct nrm_queue_stats *qsp,
+					     int len,
+					     unsigned long long curtime,
+					     bool congestion_flag,
+					     bool drop_flag)
+{
+	if (qsp != NULL) {
+		// Following is needed for work conserving
+		__sync_add_and_fetch(&(qsp->bytes_total), len);
+		if (qsp->stats) {
+			// Optionally update statistics
+			if (qsp->firstPacketTime == 0)
+				qsp->firstPacketTime = curtime;
+			qsp->lastPacketTime = curtime;
+			__sync_add_and_fetch(&(qsp->pkts_total), 1);
+			if (congestion_flag) {
+				__sync_add_and_fetch(&(qsp->pkts_marked), 1);
+				__sync_add_and_fetch(&(qsp->bytes_marked), len);
+			}
+			if (drop_flag) {
+				__sync_add_and_fetch(&(qsp->pkts_dropped), 1);
+				__sync_add_and_fetch(&(qsp->bytes_dropped),
+						     len);
+			}
+		}
+	}
+}
diff --git a/samples/bpf/nrm_out_kern.c b/samples/bpf/nrm_out_kern.c
new file mode 100644
index 000000000000..2d4c5a647daa
--- /dev/null
+++ b/samples/bpf/nrm_out_kern.c
@@ -0,0 +1,190 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * Sample Network Resource Manager (NRM) BPF program.
+ *
+ * A cgroup skb BPF egress program to limit cgroup output bandwidth.
+ * It uses a modified virtual token bucket queue to limit average
+ * egress bandwidth. The implementation uses credits instead of tokens.
+ * Negative credits imply that queueing would have happened (this is
+ * a virtual queue, so no queueing is done by it. However, queueing may
+ * occur at the actual qdisc (which is not used for rate limiting).
+ *
+ * This implementation uses 3 thresholds, one to start marking packets and
+ * the other two to drop packets:
+ *                                  CREDIT
+ *        - <--------------------------|------------------------> +
+ *              |    |          |      0
+ *              |  Large pkt    |
+ *              |  drop thresh  |
+ *   Small pkt drop             Mark threshold
+ *       thresh
+ *
+ * The effect of marking depends on the type of packet:
+ * a) If the packet is ECN enabled and it is a TCP packet, then the packet
+ *    is ECN marked.
+ * b) If the packet is a TCP packet, then we probabilistically call tcp_cwr
+ *    to reduce the congestion window. The current implementation uses a linear
+ *    distribution (0% probability at marking threshold, 100% probability
+ *    at drop threshold).
+ * c) If the packet is not a TCP packet, then it is dropped.
+ *
+ * If the credit is below the drop threshold, the packet is dropped. If it
+ * is a TCP packet, then it also calls tcp_cwr since packets dropped by
+ * by a cgroup skb BPF program do not automatically trigger a call to
+ * tcp_cwr in the current kernel code.
+ *
+ * This BPF program actually uses 2 drop thresholds, one threshold
+ * for larger packets (>= 120 bytes) and another for smaller packets. This
+ * protects smaller packets such as SYNs, ACKs, etc.
+ *
+ * The default bandwidth limit is set at 1Gbps but this can be changed by
+ * a user program through a shared BPF map. In addition, by default this BPF
+ * program does not limit connections using loopback. This behavior can be
+ * overwritten by the user program. There is also an option to calculate
+ * some statistics, such as percent of packets marked or dropped, which
+ * the user program can access.
+ *
+ * A latter patch provides such a program (nrm.c)
+ */
+
+#include "nrm_kern.h"
+
+SEC("cgroup_skb/egress")
+int _nrm_out_cg(struct __sk_buff *skb)
+{
+	struct nrm_pkt_info pkti;
+	int len = skb->len;
+	unsigned int queue_index = 0;
+	unsigned long long curtime;
+	int credit;
+	signed long long delta = 0, zero = 0;
+	int max_credit = MAX_CREDIT;
+	bool congestion_flag = false;
+	bool drop_flag = false;
+	bool cwr_flag = false;
+	struct nrm_vqueue *qdp;
+	struct nrm_queue_stats *qsp = NULL;
+	int rv = ALLOW_PKT;
+
+	qsp = bpf_map_lookup_elem(&queue_stats, &queue_index);
+	if (qsp != NULL && !qsp->loopback && (skb->ifindex == 1))
+		return ALLOW_PKT;
+
+	nrm_get_pkt_info(skb, &pkti);
+
+	// We may want to account for the length of headers in len
+	// calculation, like ETH header + overhead, specially if it
+	// is a gso packet. But I am not doing it right now.
+
+	qdp = bpf_get_local_storage(&queue_state, 0);
+	if (!qdp)
+		return ALLOW_PKT;
+	else if (qdp->lasttime == 0)
+		nrm_init_vqueue(qdp, 1024);
+
+	curtime = bpf_ktime_get_ns();
+
+	// Begin critical section
+	bpf_spin_lock(&qdp->lock);
+	credit = qdp->credit;
+	delta = curtime - qdp->lasttime;
+	/* delta < 0 implies that another process with a curtime greater
+	 * than ours beat us to the critical section and already added
+	 * the new credit, so we should not add it ourselves
+	 */
+	if (delta > 0) {
+		qdp->lasttime = curtime;
+		credit += CREDIT_PER_NS(delta, qdp->rate);
+		if (credit > MAX_CREDIT)
+			credit = MAX_CREDIT;
+	}
+	credit -= len;
+	qdp->credit = credit;
+	bpf_spin_unlock(&qdp->lock);
+	// End critical section
+
+	// Check if we should update rate
+	if (qsp != NULL && (qsp->rate * 128) != qdp->rate) {
+		qdp->rate = qsp->rate * 128;
+		bpf_printk("Updating rate: %d (1sec:%llu bits)\n",
+			   (int)qdp->rate,
+			   CREDIT_PER_NS(1000000000, qdp->rate) * 8);
+	}
+
+	// Set flags (drop, congestion, cwr)
+	// Dropping => we are congested, so ignore congestion flag
+	if (pkti.is_ip) {
+		if (credit < -DROP_THRESH ||
+		    (len > LARGE_PKT_THRESH &&
+		     credit < -LARGE_PKT_DROP_THRESH)) {
+			// Very congested, set drop flag
+			drop_flag = true;
+			if (pkti.is_tcp && pkti.ecn == 0)
+				cwr_flag = true;
+		} else if (credit < 0) {
+			// Congested, set congestion flag
+			if (pkti.is_tcp || pkti.ecn) {
+				if (credit < -MARK_THRESH)
+					congestion_flag = true;
+				else
+					congestion_flag = false;
+			} else {
+				congestion_flag = true;
+			}
+		}
+
+		if (congestion_flag) {
+			if (!pkti.ecn || !bpf_skb_ecn_set_ce(skb)) {
+				if (pkti.is_tcp) {
+					u32 rand = bpf_get_prandom_u32();
+
+					if (-credit >= MARK_THRESH +
+					    (rand % MARK_REGION_SIZE)) {
+						// Do cong avoidance
+						cwr_flag = true;
+					}
+				} else if (len > LARGE_PKT_THRESH) {
+					// Problem if too many small packets?
+					drop_flag = true;
+					congestion_flag = false;
+				}
+			}
+		}
+
+		if (pkti.is_tcp && (drop_flag || cwr_flag)) {
+			struct bpf_sock *sk;
+			struct bpf_tcp_sock *tp = NULL;
+
+			sk = skb->sk;
+			if (sk) {
+				sk = bpf_sk_fullsock(sk);
+				if (sk)
+					tp = bpf_tcp_sock(sk);
+			}
+			if (tp && drop_flag)
+				bpf_tcp_check_probe_timer(tp, 20000);
+			if (tp && cwr_flag)
+				bpf_tcp_enter_cwr(tp);
+		}
+
+		if (drop_flag)
+			rv = DROP_PKT;
+
+	} else if (credit < -MARK_THRESH) {
+		drop_flag = true;
+		rv =  DROP_PKT;
+	}
+
+	nrm_update_stats(qsp, len, curtime, congestion_flag, drop_flag);
+
+	if (rv == DROP_PKT)
+		__sync_add_and_fetch(&(qdp->credit), len);
+
+	return rv;
+}
+char _license[] SEC("license") = "GPL";
-- 
2.17.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 8/9] bpf: User program for testing NRM
  2019-02-23  1:06 [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) brakmo
                   ` (6 preceding siblings ...)
  2019-02-23  1:07 ` [PATCH v2 bpf-next 7/9] bpf: Sample NRM BPF program to limit egress bw brakmo
@ 2019-02-23  1:07 ` brakmo
  2019-02-23  1:07 ` [PATCH v2 bpf-next 9/9] bpf: NRM test script brakmo
  2019-02-23  3:03 ` [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) David Ahern
  9 siblings, 0 replies; 29+ messages in thread
From: brakmo @ 2019-02-23  1:07 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

The program nrm creates a cgroup and attaches a BPF program to the
cgroup for testing NRM for egress traffic. One still needs to create
network traffic. This can be done through netesto, netperf or iperf3.
A follow-up patch contains a script to create traffic.

USAGE: nrm [-d] [-l] [-n <id>] [-r <rate>] [-s] [-t <secs>]
           [-w] [-h] [prog]
  Where:
   -d        Print BPF trace debug buffer
   -l        Also limit flows doing loopback
   -n <#>    To create cgroup "/nrm#" and attach prog. Default is /nrm1
             This is convenient when testing NRM in more than 1 cgroup
   -r <rate> Rate limit in Mbps
   -s        Get NRM stats (marked, dropped, etc.)
   -t <time> Exit after specified seconds (deault is 0)
   -w        Work conserving flag. cgroup can increase its bandwidth
             beyond the rate limit specified while there is available
             bandwidth. Current implementation assumes there is only
             NIC (eth0), but can be extended to support multiple NICs.
             Currrently only supported for egress. Note, this is just
	     a proof of concept.
   -h        Print this info
   prog      BPF program file name. Name defaults to nrm_out_kern.o for
             output, and nrm_in_ker.o for input.

More information about NRM can be found in the paper "BPF Host Resource
Management" presented at the 2018 Linux Plumbers Conference, Networking Track
(http://vger.kernel.org/lpc_net2018_talks/LPC%20BPF%20Network%20Resource%20Paper.pdf)

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/Makefile |   3 +
 samples/bpf/nrm.c    | 440 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 443 insertions(+)
 create mode 100644 samples/bpf/nrm.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 897b467066fd..6186c9fc3179 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -53,6 +53,7 @@ hostprogs-y += xdpsock
 hostprogs-y += xdp_fwd
 hostprogs-y += task_fd_query
 hostprogs-y += xdp_sample_pkts
+hostprogs-y += nrm
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -109,6 +110,7 @@ xdpsock-objs := xdpsock_user.o
 xdp_fwd-objs := xdp_fwd_user.o
 task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
+nrm-objs := bpf_load.o nrm.o $(CGROUP_HELPERS)
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -268,6 +270,7 @@ $(src)/*.c: verify_target_bpf $(LIBBPF)
 
 $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h
 $(obj)/nrm_out_kern.o: $(src)/nrm.h $(src)/nrm_kern.h
+$(obj)/nrm.o: $(src)/nrm.h
 
 # asm/sysreg.h - inline assembly used by it is incompatible with llvm.
 # But, there is no easy way to fix it, so just exclude it since it is
diff --git a/samples/bpf/nrm.c b/samples/bpf/nrm.c
new file mode 100644
index 000000000000..ae2ab61b0fb3
--- /dev/null
+++ b/samples/bpf/nrm.c
@@ -0,0 +1,440 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2019 Facebook
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * Example program for Network Resource Managmement
+ *
+ * This program loads a cgroup skb BPF program to enforce cgroup output
+ * (egress) or input (ingress) bandwidth limits.
+ *
+ * USAGE: nrm [-d] [-l] [-n <id>] [-r <rate>] [-s] [-t <secs>] [-w] [-h] [prog]
+ *   Where:
+ *    -d	Print BPF trace debug buffer
+ *    -l	Also limit flows doing loopback
+ *    -n <#>	To create cgroup \"/nrm#\" and attach prog
+ *		Default is /nrm1
+ *    -r <rate>	Rate limit in Mbps
+ *    -s	Get NRM stats (marked, dropped, etc.)
+ *    -t <time>	Exit after specified seconds (deault is 0)
+ *    -w	Work conserving flag. cgroup can increase its bandwidth
+ *		beyond the rate limit specified while there is available
+ *		bandwidth. Current implementation assumes there is only
+ *		NIC (eth0), but can be extended to support multiple NICs.
+ *		Currrently only supported for egress.
+ *    -h	Print this info
+ *    prog	BPF program file name. Name defaults to nrm_out_kern.o
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <sys/resource.h>
+#include <sys/time.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/unistd.h>
+
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+
+#include "bpf_load.h"
+#include "bpf_rlimit.h"
+#include "cgroup_helpers.h"
+#include "nrm.h"
+#include "bpf_util.h"
+#include "bpf/bpf.h"
+#include "bpf/libbpf.h"
+
+bool outFlag = true;
+int minRate = 1000;		/* cgroup rate limit in Mbps */
+int rate = 1000;		/* can grow if rate conserving is enabled */
+int dur = 1;
+bool stats_flag;
+bool loopback_flag;
+bool debugFlag;
+bool work_conserving_flag;
+
+static void Usage(void);
+static void read_trace_pipe2(void);
+static void do_error(char *msg, bool errno_flag);
+
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
+struct bpf_object *obj;
+int bpfprog_fd;
+int cgroup_storage_fd;
+
+static void read_trace_pipe2(void)
+{
+	int trace_fd;
+	FILE *outf;
+	char *outFname = "nrm_out.log";
+
+	trace_fd = open(DEBUGFS "trace_pipe", O_RDONLY, 0);
+	if (trace_fd < 0) {
+		printf("Error opening trace_pipe\n");
+		return;
+	}
+
+	if (!outFlag)
+		outFname = "nrm_in.log";
+	outf = fopen(outFname, "w");
+
+	if (outf == NULL)
+		printf("Error creating %s\n", outFname);
+
+	while (1) {
+		static char buf[4097];
+		ssize_t sz;
+
+		sz = read(trace_fd, buf, sizeof(buf) - 1);
+		if (sz > 0) {
+			buf[sz] = 0;
+			puts(buf);
+			if (outf != NULL) {
+				fprintf(outf, "%s\n", buf);
+				fflush(outf);
+			}
+		}
+	}
+}
+
+static void do_error(char *msg, bool errno_flag)
+{
+	if (errno_flag)
+		printf("ERROR: %s, errno: %d\n", msg, errno);
+	else
+		printf("ERROR: %s\n", msg);
+	exit(1);
+}
+
+static int prog_load(char *prog)
+{
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+		.file = prog,
+		.expected_attach_type = BPF_CGROUP_INET_EGRESS,
+	};
+	int map_fd;
+	struct bpf_map *map;
+
+	int ret = 0;
+
+	if (access(prog, O_RDONLY) < 0) {
+		printf("Error accessing file %s: %s\n", prog, strerror(errno));
+		return 1;
+	}
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &bpfprog_fd))
+		ret = 1;
+	if (!ret) {
+		map = bpf_object__find_map_by_name(obj, "queue_stats");
+		map_fd = bpf_map__fd(map);
+		if (map_fd < 0) {
+			printf("Map not found: %s\n", strerror(map_fd));
+			ret = 1;
+		}
+	}
+
+	if (ret) {
+		printf("ERROR: load_bpf_file failed for: %s\n", prog);
+		printf("  Output from verifier:\n%s\n------\n", bpf_log_buf);
+		ret = -1;
+	} else {
+		ret = map_fd;
+	}
+
+	return ret;
+}
+
+static int run_bpf_prog(char *prog, int cg_id)
+{
+	int map_fd;
+	int rc = 0;
+	int key = 0;
+	int cg1 = 0;
+	int type = BPF_CGROUP_INET_EGRESS;
+	char cg_dir[100];
+	struct nrm_queue_stats qstats = {0};
+
+	sprintf(cg_dir, "/nrm%d", cg_id);
+	map_fd = prog_load(prog);
+	if (map_fd  == -1)
+		return 1;
+
+	if (setup_cgroup_environment()) {
+		printf("ERROR: setting cgroup environment\n");
+		goto err;
+	}
+	cg1 = create_and_get_cgroup(cg_dir);
+	if (!cg1) {
+		printf("ERROR: create_and_get_cgroup\n");
+		goto err;
+	}
+	if (join_cgroup(cg_dir)) {
+		printf("ERROR: join_cgroup\n");
+		goto err;
+	}
+
+	qstats.rate = rate;
+	qstats.stats = stats_flag ? 1 : 0;
+	qstats.loopback = loopback_flag ? 1 : 0;
+	if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY)) {
+		printf("ERROR: Could not update map element\n");
+		goto err;
+	}
+
+	if (!outFlag)
+		type = BPF_CGROUP_INET_INGRESS;
+	if (bpf_prog_attach(bpfprog_fd, cg1, type, 0)) {
+		printf("ERROR: bpf_prog_attach fails!\n");
+		log_err("Attaching prog");
+		goto err;
+	}
+
+	if (work_conserving_flag) {
+		struct timeval t0, t_last, t_new;
+		FILE *fin;
+		unsigned long long last_eth_tx_bytes, new_eth_tx_bytes;
+		signed long long last_cg_tx_bytes, new_cg_tx_bytes;
+		signed long long delta_time, delta_bytes, delta_rate;
+		int delta_ms;
+#define DELTA_RATE_CHECK 10000		/* in us */
+#define RATE_THRESHOLD 9500000000	/* 9.5 Gbps */
+
+		bpf_map_lookup_elem(map_fd, &key, &qstats);
+		if (gettimeofday(&t0, NULL) < 0)
+			do_error("gettimeofday failed", true);
+		t_last = t0;
+		fin = fopen("/sys/class/net/eth0/statistics/tx_bytes", "r");
+		if (fscanf(fin, "%llu", &last_eth_tx_bytes) != 1)
+			do_error("fscanf fails", false);
+		fclose(fin);
+		last_cg_tx_bytes = qstats.bytes_total;
+		while (true) {
+			usleep(DELTA_RATE_CHECK);
+			if (gettimeofday(&t_new, NULL) < 0)
+				do_error("gettimeofday failed", true);
+			delta_ms = (t_new.tv_sec - t0.tv_sec) * 1000 +
+				(t_new.tv_usec - t0.tv_usec)/1000;
+			if (delta_ms > dur * 1000)
+				break;
+			delta_time = (t_new.tv_sec - t_last.tv_sec) * 1000000 +
+				(t_new.tv_usec - t_last.tv_usec);
+			if (delta_time == 0)
+				continue;
+			t_last = t_new;
+			fin = fopen("/sys/class/net/eth0/statistics/tx_bytes",
+				    "r");
+			if (fscanf(fin, "%llu", &new_eth_tx_bytes) != 1)
+				do_error("fscanf fails", false);
+			fclose(fin);
+			printf("  new_eth_tx_bytes:%llu\n",
+			       new_eth_tx_bytes);
+			bpf_map_lookup_elem(map_fd, &key, &qstats);
+			new_cg_tx_bytes = qstats.bytes_total;
+			delta_bytes = new_eth_tx_bytes - last_eth_tx_bytes;
+			last_eth_tx_bytes = new_eth_tx_bytes;
+			delta_rate = (delta_bytes * 8000000) / delta_time;
+			printf("%5d - eth_rate:%.1fGbps cg_rate:%.3fGbps",
+			       delta_ms, delta_rate/1000000000.0,
+			       rate/1000.0);
+			if (delta_rate < RATE_THRESHOLD) {
+				/* can increase cgroup rate limit, but first
+				 * check if we are using the current limit.
+				 * Currently increasing by 6.25%, unknown
+				 * if that is the optimal rate.
+				 */
+				int rate_diff100;
+
+				delta_bytes = new_cg_tx_bytes -
+					last_cg_tx_bytes;
+				last_cg_tx_bytes = new_cg_tx_bytes;
+				delta_rate = (delta_bytes * 8000000) /
+					delta_time;
+				printf(" rate:%.3fGbps",
+				       delta_rate/1000000000.0);
+				rate_diff100 = (((long long)rate)*1000000 -
+						     delta_rate) * 100 /
+					(((long long) rate) * 1000000);
+				printf("  rdiff:%d", rate_diff100);
+				if (rate_diff100  <= 3) {
+					rate += (rate >> 4);
+					if (rate > RATE_THRESHOLD / 1000000)
+						rate = RATE_THRESHOLD / 1000000;
+					qstats.rate = rate;
+					printf(" INC\n");
+				} else {
+					printf("\n");
+				}
+			} else {
+				/* Need to decrease cgroup rate limit.
+				 * Currently decreasing by 12.5%, unknown
+				 * if that is optimal
+				 */
+				printf(" DEC\n");
+				rate -= (rate >> 3);
+				if (rate < minRate)
+					rate = minRate;
+				qstats.rate = rate;
+			}
+			if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY))
+				do_error("update map element fails", false);
+		}
+	} else {
+		sleep(dur);
+	}
+	// Get stats!
+	if (stats_flag && bpf_map_lookup_elem(map_fd, &key, &qstats)) {
+		char fname[100];
+		FILE *fout;
+
+		if (!outFlag)
+			sprintf(fname, "nrm.%d.in", cg_id);
+		else
+			sprintf(fname, "nrm.%d.out", cg_id);
+		fout = fopen(fname, "w");
+		fprintf(fout, "id:%d\n", cg_id);
+		fprintf(fout, "ERROR: Could not lookup queue_stats\n");
+	} else if (stats_flag && qstats.lastPacketTime >
+		   qstats.firstPacketTime) {
+		long long delta_us = (qstats.lastPacketTime -
+				      qstats.firstPacketTime)/1000;
+		unsigned int rate_mbps = ((qstats.bytes_total -
+					   qstats.bytes_dropped) * 8 /
+					  delta_us);
+		double percent_pkts, percent_bytes;
+		char fname[100];
+		FILE *fout;
+
+		if (!outFlag)
+			sprintf(fname, "nrm.%d.in", cg_id);
+		else
+			sprintf(fname, "nrm.%d.out", cg_id);
+		fout = fopen(fname, "w");
+		fprintf(fout, "id:%d\n", cg_id);
+		fprintf(fout, "rate_mbps:%d\n", rate_mbps);
+		fprintf(fout, "duration:%.1f secs\n",
+			(qstats.lastPacketTime - qstats.firstPacketTime) /
+			1000000000.0);
+		fprintf(fout, "packets:%d\n", (int)qstats.pkts_total);
+		fprintf(fout, "bytes_MB:%d\n", (int)(qstats.bytes_total /
+						     1000000));
+		fprintf(fout, "pkts_dropped:%d\n", (int)qstats.pkts_dropped);
+		fprintf(fout, "bytes_dropped_MB:%d\n",
+			(int)(qstats.bytes_dropped /
+						       1000000));
+		// Marked Pkts and Bytes
+		percent_pkts = (qstats.pkts_marked * 100.0) /
+			(qstats.pkts_total + 1);
+		percent_bytes = (qstats.bytes_marked * 100.0) /
+			(qstats.bytes_total + 1);
+		fprintf(fout, "pkts_marked_percent:%6.2f\n", percent_pkts);
+		fprintf(fout, "bytes_marked_percent:%6.2f\n", percent_bytes);
+
+		// Dropped Pkts and Bytes
+		percent_pkts = (qstats.pkts_dropped * 100.0) /
+			(qstats.pkts_total + 1);
+		percent_bytes = (qstats.bytes_dropped * 100.0) /
+			(qstats.bytes_total + 1);
+		fprintf(fout, "pkts_dropped_percent:%6.2f\n", percent_pkts);
+		fprintf(fout, "bytes_dropped_percent:%6.2f\n", percent_bytes);
+		fclose(fout);
+	}
+
+	if (debugFlag)
+		read_trace_pipe2();
+	return rc;
+err:
+	rc = 1;
+
+	if (cg1)
+		close(cg1);
+	cleanup_cgroup_environment();
+
+	return rc;
+}
+
+static void Usage(void)
+{
+	printf("This program loads a cgroup skb BPF program to enforce\n"
+	       "cgroup output (egress) bandwidth limits.\n\n"
+	       "USAGE: nrm [-o] [-d]  [-l] [-n <id>] [-r <rate>] [-s]\n"
+	       "           [-t <secs>] [-w] [-h] [prog]\n"
+	       "  Where:\n"
+	       "    -o         indicates egress direction (default)\n"
+	       "    -d         print BPF trace debug buffer\n"
+	       "    -l         also limit flows using loopback\n"
+	       "    -n <#>     to create cgroup \"/nrm#\" and attach prog\n"
+	       "               Default is /nrm1\n"
+	       "    -r <rate>  Rate in Mbps\n"
+	       "    -s         Update NRM stats\n"
+	       "    -t <time>  Exit after specified seconds (deault is 0)\n"
+	       "    -w	       Work conserving flag. cgroup can increase\n"
+	       "               bandwidth beyond the rate limit specified\n"
+	       "               while there is available bandwidth. Current\n"
+	       "               implementation assumes there is only eth0\n"
+	       "               but can be extended to support multiple NICs\n"
+	       "    -h         print this info\n"
+	       "    prog       BPF program file name. Name defaults to\n"
+	       "                 nrm_out_kern.o for output, and\n"
+	       "                 nrm_in_ker.o for input.\n");
+}
+
+int main(int argc, char **argv)
+{
+	char *prog = "nrm_out_kern.o";
+	int  k;
+	int cg_id = 1;
+	char *optstring = "iodln:r:st:wh";
+
+	while ((k = getopt(argc, argv, optstring)) != -1) {
+		switch (k) {
+		case'o':
+			break;
+		case 'd':
+			debugFlag = true;
+			break;
+		case 'l':
+			loopback_flag = true;
+			break;
+		case 'n':
+			cg_id = atoi(optarg);
+			break;
+		case 'r':
+			minRate = atoi(optarg) * 1.024;
+			rate = minRate;
+			break;
+		case 's':
+			stats_flag = true;
+			break;
+		case 't':
+			dur = atoi(optarg);
+			break;
+		case 'w':
+			work_conserving_flag = true;
+			break;
+		case '?':
+			if (optopt == 'n' || optopt == 'r' || optopt == 't')
+				fprintf(stderr,
+					"Option -%c requires an argument.\n\n",
+					optopt);
+		case 'h':
+			// fallthrough
+		default:
+			Usage();
+			return 0;
+		}
+	}
+
+	if (optind < argc)
+		prog = argv[optind];
+	printf("NRM prog: %s\n", prog != NULL ? prog : "NULL");
+
+	return run_bpf_prog(prog, cg_id);
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v2 bpf-next 9/9] bpf: NRM test script
  2019-02-23  1:06 [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) brakmo
                   ` (7 preceding siblings ...)
  2019-02-23  1:07 ` [PATCH v2 bpf-next 8/9] bpf: User program for testing NRM brakmo
@ 2019-02-23  1:07 ` brakmo
  2019-02-23  3:03 ` [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) David Ahern
  9 siblings, 0 replies; 29+ messages in thread
From: brakmo @ 2019-02-23  1:07 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

Script for testing NRM (Network Resource Manager) framework.
It creates a cgroup to use for testing and load a BPF program to limit
egress bandwidht. It then uses iperf3 or netperf to create
loads. The output is the goodput in Mbps (unless -D is used).

It can work on a single host using loopback or among two hosts (with netperf).
When using loopback, it is recommended to also introduce a delay of at least
1ms (-d=1), otherwise the assigned bandwidth is likely to be underutilized.

USAGE: $name [out] [-b=<prog>|--bpf=<prog>] [-c=<cc>|--cc=<cc>] [-D]
             [-d=<delay>|--delay=<delay>] [--debug] [-E]
             [-f=<#flows>|--flows=<#flows>] [-h] [-i=<id>|--id=<id >] [-l]
	     [-N] [-p=<port>|--port=<port>] [-P] [-q=<qdisc>]
             [-R] [-s=<server>|--server=<server] [--stats]
	     [-t=<time>|--time=<time>] [-w] [cubic|dctcp]
  Where:
    out               Egress (default egress)
    -b or --bpf       BPF program filename to load and attach.
                      Default is nrm_out_kern.o for egress,
                      nrm_in_kern.o for ingress
    -c or -cc         TCP congestion control (cubic or dctcp)
    -d or --delay     Add a delay in ms using netem
    -D                In addition to the goodput in Mbps, it also outputs
                      other detailed information. This information is
                      test dependent (i.e. iperf3 or netperf).
    --debug           Print BPF trace buffer
    -E                Enable ECN (not required for dctcp)
    -f or --flows     Number of concurrent flows (default=1)
    -i or --id        cgroup id (an integer, default is 1)
    -l                Do not limit flows using loopback
    -N                Use netperf instead of iperf3
    -h                Help
    -p or --port      iperf3 port (default is 5201)
    -P                Use an iperf3 instance for each flow
    -q                Use the specified qdisc.
    -r or --rate      Rate in Mbps (default 1s 1Gbps)
    -R                Use TCP_RR for netperf. 1st flow has req
                      size of 10KB, rest of 1MB. Reply in all
                      cases is 1 byte.
                      More detailed output for each flow can be found
                      in the files netperf.<cg>.<flow>, where <cg> is the
                      cgroup id as specified with the -i flag, and <flow>
                      is the flow id starting at 1 and increasing by 1 for
                      flow (as specified by -f).
    -s or --server    hostname of netperf server. Used to create netperf
                      test traffic between to hosts (default is within host)
                      netserver must be running on the host.
    --stats           Get NRM stats (marked, dropped, etc.)
    -t or --time      duration of iperf3 in seconds (default=5)
    -w                Work conserving flag. cgroup can increase its
                      bandwidth beyond the rate limit specified
                      while there is available bandwidth. Current
                      implementation assumes there is only one NIC
                      (eth0), but can be extended to support multiple
                      NICs. This is just a proof of concept.
    cubic or dctcp    specify TCP CC to use

Examples:
 ./do_nrm_test.sh -l -d=1 -D --stats
     Runs a 5 second test, using a single iperf3 flow and with the default
     rate limit of 1Gbps and a delay of 1ms (using netem) using the default
     TCP congestion control on the loopback device (hence we use "-l" to
     enforce bandwidth limit on loopback device). Since no direction is
     specified, it defaults to egress. Since no TCP CC algorithm is
     specified it uses the system default.
     With no -D flag, only the value of the AGGREGATE OUTPUT would show.
     id refers to the cgroup id and is useful when running multi cgroup
     tests (see do_nrm_test_multi.sh script).
   Output:
     Details for NRM in cgroup 1
     id:1
     rate_mbps:713
     duration:4.9 secs
     packets:10072
     bytes_MB:468
     pkts_dropped:491
     bytes_dropped_MB:32
     pkts_marked_percent: 28.64
     bytes_marked_percent: 29.15
     pkts_dropped_percent:  4.87
     bytes_dropped_percent:  6.86
     PING AVG DELAY:2.072
     AGGREGATE_GOODPUT:729

./do_nrm_test.sh -l -d=1 -D --stats dctcp
     Same as above but using dctcp. Note that fewer bytes are dropped
     (0.13 vs. 6.86%).
   Output:
     Details for NRM in cgroup 1
     id:1
     rate_mbps:932
     duration:4.9 secs
     packets:15514
     bytes_MB:570
     pkts_dropped:11
     bytes_dropped_MB:0
     pkts_marked_percent: 40.38
     bytes_marked_percent: 46.82
     pkts_dropped_percent:  0.07
     bytes_dropped_percent:  0.13
     PING AVG DELAY:2.069
     AGGREGATE_GOODPUT:953

./do_nrm_test.sh -d=1 -D --stats
     As first example, but without limiting loopback device (i.e. no
     "-l" flag). Since there is no bandwidth limiting, no details for
     NRM are printed out.
   Output:
     Details for NRM in cgroup 1
     PING AVG DELAY:2.021
     AGGREGATE_GOODPUT:40226

./do_nrm_test.sh -l -d=1 -D --stats -f=2
     Uses iper3 and does 2 flows
./do_nrm_test.sh -l -d=1 -D --stats -f=4 -P
     Uses iperf3 and does 4 flows, each flow as a separate process.
./do_nrm_test.sh -l -d=1 -D --stats -f=4 -N
     Uses netperf, 4 flows
./do_nrm_test.sh -f=1 -r=2000 -t=5 -N -D --stats dctcp -s=<server-name>
     Uses netperf between two hosts. The remote host name is specified
     with -s= and you need to start the program netserver manually on
     the remote host. It will use 1 flow, a rate limit of 2Gbps and dctcp.
./do_nrm_test.sh -f=1 -r=2000 -t=5 -N -D --stats -w dctcp \
     -s=<server-name>
     As previous, but allows use of extra bandwidth. For this test the
     rate is 8Gbps vs. 1Gbps of the previous test.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/do_nrm_test.sh | 437 +++++++++++++++++++++++++++++++++++++
 1 file changed, 437 insertions(+)
 create mode 100755 samples/bpf/do_nrm_test.sh

diff --git a/samples/bpf/do_nrm_test.sh b/samples/bpf/do_nrm_test.sh
new file mode 100755
index 000000000000..91d99237aea5
--- /dev/null
+++ b/samples/bpf/do_nrm_test.sh
@@ -0,0 +1,437 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (c) 2019 Facebook
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of version 2 of the GNU General Public
+# License as published by the Free Software Foundation.
+
+Usage() {
+  echo "Script for testing NRM (Network Resource Manager) framework."
+  echo "It creates a cgroup to use for testing and load a BPF program to limit"
+  echo "egress or ingress bandwidht. It then uses iperf3 or netperf to create"
+  echo "loads. The output is the goodput in Mbps (unless -D was used)."
+  echo ""
+  echo "USAGE: $name [out] [-b=<prog>|--bpf=<prog>] [-c=<cc>|--cc=<cc>] [-D]"
+  echo "             [-d=<delay>|--delay=<delay>] [--debug] [-E]"
+  echo "             [-f=<#flows>|--flows=<#flows>] [-h] [-i=<id>|--id=<id >]"
+  echo "             [-l] [-N] [-p=<port>|--port=<port>] [-P]"
+  echo "             [-q=<qdisc>] [-R] [-s=<server>|--server=<server]"
+  echo "             [-S|--stats] -t=<time>|--time=<time>] [-w] [cubic|dctcp]"
+  echo "  Where:"
+  echo "    out               egress (default)"
+  echo "    -b or --bpf       BPF program filename to load and attach."
+  echo "                      Default is nrm_out_kern.o for egress,"
+  echo "                      nrm_in_kern.o for ingress"
+  echo "    -c or -cc         TCP congestion control (cubic or dctcp)"
+  echo "    --debug           print BPF trace buffer"
+  echo "    -d or --delay     add a delay in ms using netem"
+  echo "    -D                In addition to the goodput in Mbps, it also outputs"
+  echo "                      other detailed information. This information is"
+  echo "                      test dependent (i.e. iperf3 or netperf)."
+  echo "    -E                enable ECN (not required for dctcp)"
+  echo "    -f or --flows     number of concurrent flows (default=1)"
+  echo "    -i or --id        cgroup id (an integer, default is 1)"
+  echo "    -N                use netperf instead of iperf3"
+  echo "    -l                do not limit flows using loopback"
+  echo "    -h                Help"
+  echo "    -p or --port      iperf3 port (default is 5201)"
+  echo "    -P                use an iperf3 instance for each flow"
+  echo "    -q                use the specified qdisc"
+  echo "    -r or --rate      rate in Mbps (default 1s 1Gbps)"
+  echo "    -R                Use TCP_RR for netperf. 1st flow has req"
+  echo "                      size of 10KB, rest of 1MB. Reply in all"
+  echo "                      cases is 1 byte."
+  echo "                      More detailed output for each flow can be found"
+  echo "                      in the files netperf.<cg>.<flow>, where <cg> is the"
+  echo "                      cgroup id as specified with the -i flag, and <flow>"
+  echo "                      is the flow id starting at 1 and increasing by 1 for"
+  echo "                      flow (as specified by -f)."
+  echo "    -s or --server    hostname of netperf server. Used to create netperf"
+  echo "                      test traffic between to hosts (default is within host)"
+  echo "                      netserver must be running on the host."
+  echo "    -S or --stats     whether to update nrm stats (default is yes)."
+  echo "    -t or --time      duration of iperf3 in seconds (default=5)"
+  echo "    -w                Work conserving flag. cgroup can increase its"
+  echo "                      bandwidth beyond the rate limit specified"
+  echo "                      while there is available bandwidth. Current"
+  echo "                      implementation assumes there is only one NIC"
+  echo "                      (eth0), but can be extended to support multiple"
+  echo "                       NICs."
+  echo "    cubic or dctcp    specify which TCP CC to use"
+  echo " "
+  exit
+}
+
+#set -x
+
+debug_flag=0
+args="$@"
+name="$0"
+netem=0
+cc=x
+dir="-o"
+dir_name="out"
+dur=5
+flows=1
+id=1
+prog=""
+port=5201
+rate=1000
+multi_iperf=0
+flow_cnt=1
+use_netperf=0
+rr=0
+ecn=0
+details=0
+server=""
+qdisc=""
+flags=""
+do_stats=0
+
+function start_nrm () {
+  rm -f nrm.out
+  echo "./nrm $dir -n $id -r $rate -t $dur $flags $dbg $prog" > nrm.out
+  echo " " >> nrm.out
+  ./nrm $dir -n $id -r $rate -t $dur $flags $dbg $prog >> nrm.out 2>&1  &
+  echo $!
+}
+
+processArgs () {
+  for i in $args ; do
+    case $i in
+    # Support for upcomming ingress rate limiting
+    #in)         # support for upcoming ingress rate limiting
+    #  dir="-i"
+    #  dir_name="in"
+    #  ;;
+    out)
+      dir="-o"
+      dir_name="out"
+      ;;
+    -b=*|--bpf=*)
+      prog="${i#*=}"
+      ;;
+    -c=*|--cc=*)
+      cc="${i#*=}"
+      ;;
+    --debug)
+      flags="$flags -d"
+      debug_flag=1
+      ;;
+    -d=*|--delay=*)
+      netem="${i#*=}"
+      ;;
+    -D)
+      details=1
+      ;;
+    -E)
+     ecn=1
+     ;;
+    # Support for upcomming fq Early Departure Time egress rate limiting
+    #--edt)
+    # prog="nrm_out_edt_kern.o"
+    # qdisc="fq"
+    # ;;
+    -f=*|--flows=*)
+      flows="${i#*=}"
+      ;;
+    -i=*|--id=*)
+      id="${i#*=}"
+      ;;
+    -l)
+      flags="$flags -l"
+      ;;
+    -N)
+      use_netperf=1
+      ;;
+    -p=*|--port=*)
+      port="${i#*=}"
+      ;;
+    -P)
+      multi_iperf=1
+      ;;
+    -q=*)
+      qdisc="${i#*=}"
+      ;;
+    -r=*|--rate=*)
+      rate="${i#*=}"
+      ;;
+    -R)
+      rr=1
+      ;;
+    -s=*|--server=*)
+      server="${i#*=}"
+      ;;
+    -S|--stats)
+      flags="$flags -s"
+      do_stats=1
+      ;;
+    -t=*|--time=*)
+      dur="${i#*=}"
+      ;;
+    -w)
+      flags="$flags -w"
+      ;;
+    cubic)
+      cc=cubic
+      ;;
+    dctcp)
+      cc=dctcp
+      ;;
+    *)
+      echo "Unknown arg:$i"
+      Usage
+      ;;
+    esac
+  done
+}
+
+processArgs
+
+if [ $debug_flag -eq 1 ] ; then
+  rm -f nrm_out.log
+fi
+
+nrm_pid=$(start_nrm)
+usleep 100000
+
+host=`hostname`
+cg_base_dir=/sys/fs/cgroup
+cg_dir="$cg_base_dir/cgroup-test-work-dir/nrm$id"
+
+echo $$ >> $cg_dir/cgroup.procs
+
+ulimit -l unlimited
+
+rm -f ss.out
+rm -f nrm.[0-9]*.$dir_name
+if [ $ecn -ne 0 ] ; then
+  sysctl -w -q -n net.ipv4.tcp_ecn=1
+fi
+
+if [ $use_netperf -eq 0 ] ; then
+  cur_cc=`sysctl -n net.ipv4.tcp_congestion_control`
+  if [ "$cc" != "x" ] ; then
+    sysctl -w -q -n net.ipv4.tcp_congestion_control=$cc
+  fi
+fi
+
+if [ "$netem" -ne "0" ] ; then
+  if [ "$qdisc" != "" ] ; then
+    echo "WARNING: Ignoring -q options because -d option used"
+  fi
+  tc qdisc del dev lo root > /dev/null 2>&1
+  tc qdisc add dev lo root netem delay $netem\ms > /dev/null 2>&1
+elif [ "$qdisc" != "" ] ; then
+  tc qdisc del dev lo root > /dev/null 2>&1
+  tc qdisc add dev lo root $qdisc > /dev/null 2>&1
+fi
+
+n=0
+m=$[$dur * 5]
+hn="::1"
+if [ $use_netperf -ne 0 ] ; then
+  if [ "$server" != "" ] ; then
+    hn=$server
+  fi
+fi
+
+( ping6 -i 0.2 -c $m $hn > ping.out 2>&1 ) &
+
+if [ $use_netperf -ne 0 ] ; then
+  begNetserverPid=`ps ax | grep netserver | grep --invert-match "grep" | \
+                   awk '{ print $1 }'`
+  if [ "$begNetserverPid" == "" ] ; then
+    if [ "$server" == "" ] ; then
+      ( ./netserver > /dev/null 2>&1) &
+      usleep 100000
+    fi
+  fi
+  flow_cnt=1
+  if [ "$server" == "" ] ; then
+    np_server=$host
+  else
+    np_server=$server
+  fi
+  if [ "$cc" == "x" ] ; then
+    np_cc=""
+  else
+    np_cc="-K $cc,$cc"
+  fi
+  replySize=1
+  while [ $flow_cnt -le $flows ] ; do
+    if [ $rr -ne 0 ] ; then
+      reqSize=1M
+      if [ $flow_cnt -eq 1 ] ; then
+        reqSize=10K
+      fi
+      if [ "$dir" == "-i" ] ; then
+        replySize=$reqSize
+        reqSize=1
+      fi
+      ( ./netperf -H $np_server -l $dur -f m -j -t TCP_RR  -- -r $reqSize,$replySize $np_cc -k P50_lATENCY,P90_LATENCY,LOCAL_TRANSPORT_RETRANS,REMOTE_TRANSPORT_RETRANS,LOCAL_SEND_THROUGHPUT,LOCAL_RECV_THROUGHPUT,REQUEST_SIZE,RESPONSE_SIZE > netperf.$id.$flow_cnt ) &
+    else
+      if [ "$dir" == "-i" ] ; then
+        ( ./netperf -H $np_server -l $dur -f m -j -t TCP_RR -- -r 1,10M $np_cc -k P50_LATENCY,P90_LATENCY,LOCAL_TRANSPORT_RETRANS,LOCAL_SEND_THROUGHPUT,REMOTE_TRANSPORT_RETRANS,REMOTE_SEND_THROUGHPUT,REQUEST_SIZE,RESPONSE_SIZE > netperf.$id.$flow_cnt ) &
+      else
+        ( ./netperf -H $np_server -l $dur -f m -j -t TCP_STREAM -- $np_cc -k P50_lATENCY,P90_LATENCY,LOCAL_TRANSPORT_RETRANS,LOCAL_SEND_THROUGHPUT,REQUEST_SIZE,RESPONSE_SIZE > netperf.$id.$flow_cnt ) &
+      fi
+    fi
+    flow_cnt=$[flow_cnt+1]
+  done
+
+# sleep for duration of test (plus some buffer)
+  n=$[dur+2]
+  sleep $n
+
+# force graceful termination of netperf
+  pids=`pgrep netperf`
+  for p in $pids ; do
+    kill -SIGALRM $p
+  done
+
+  flow_cnt=1
+  rate=0
+  if [ $details -ne 0 ] ; then
+    echo ""
+    echo "Details for NRM in cgroup $id"
+    if [ $do_stats -eq 1 ] ; then
+      if [ -e nrm.$id.$dir_name ] ; then
+        cat nrm.$id.$dir_name
+      fi
+    fi
+  fi
+  while [ $flow_cnt -le $flows ] ; do
+    if [ "$dir" == "-i" ] ; then
+      r=`cat netperf.$id.$flow_cnt | grep -o "REMOTE_SEND_THROUGHPUT=[0-9]*" | grep -o "[0-9]*"`
+    else
+      r=`cat netperf.$id.$flow_cnt | grep -o "LOCAL_SEND_THROUGHPUT=[0-9]*" | grep -o "[0-9]*"`
+    fi
+    echo "rate for flow $flow_cnt: $r"
+    rate=$[rate+r]
+    if [ $details -ne 0 ] ; then
+      echo "-----"
+      echo "Details for cgroup $id, flow $flow_cnt"
+      cat netperf.$id.$flow_cnt
+    fi
+    flow_cnt=$[flow_cnt+1]
+  done
+  if [ $details -ne 0 ] ; then
+    echo ""
+    delay=`grep "avg" ping.out | grep -o "= [0-9.]*/[0-9.]*" | grep -o "[0-9.]*$"`
+    echo "PING AVG DELAY:$delay"
+    echo "AGGREGATE_GOODPUT:$rate"
+  else
+    echo $rate
+  fi
+elif [ $multi_iperf -eq 0 ] ; then
+  (iperf3 -s -p $port -1 > /dev/null 2>&1) &
+  usleep 100000
+  iperf3 -c $host -p $port -i 0 -P $flows -f m -t $dur > iperf.$id
+  rates=`grep receiver iperf.$id | grep -o "[0-9.]* Mbits" | grep -o "^[0-9]*"`
+  rate=`echo $rates | grep -o "[0-9]*$"`
+
+  if [ $details -ne 0 ] ; then
+    echo ""
+    echo "Details for NRM in cgroup $id"
+    if [ $do_stats -eq 1 ] ; then
+      if [ -e nrm.$id.$dir_name ] ; then
+        cat nrm.$id.$dir_name
+      fi
+    fi
+    delay=`grep "avg" ping.out | grep -o "= [0-9.]*/[0-9.]*" | grep -o "[0-9.]*$"`
+    echo "PING AVG DELAY:$delay"
+    echo "AGGREGATE_GOODPUT:$rate"
+  else
+    echo $rate
+  fi
+else
+  flow_cnt=1
+  while [ $flow_cnt -le $flows ] ; do
+    (iperf3 -s -p $port -1 > /dev/null 2>&1) &
+    ( iperf3 -c $host -p $port -i 0 -P 1 -f m -t $dur | grep receiver | grep -o "[0-9.]* Mbits" | grep -o "^[0-9]*" | grep -o "[0-9]*$" > iperf3.$id.$flow_cnt ) &
+    port=$[port+1]
+    flow_cnt=$[flow_cnt+1]
+  done
+  n=$[dur+1]
+  sleep $n
+  flow_cnt=1
+  rate=0
+  if [ $details -ne 0 ] ; then
+    echo ""
+    echo "Details for NRM in cgroup $id"
+    if [ $do_stats -eq 1 ] ; then
+      if [ -e nrm.$id.$dir_name ] ; then
+        cat nrm.$id.$dir_name
+      fi
+    fi
+  fi
+
+  while [ $flow_cnt -le $flows ] ; do
+    r=`cat iperf3.$id.$flow_cnt`
+#    echo "rate for flow $flow_cnt: $r"
+  if [ $details -ne 0 ] ; then
+    echo "Rate for cgroup $id, flow $flow_cnt LOCAL_SEND_THROUGHPUT=$r"
+  fi
+    rate=$[rate+r]
+    flow_cnt=$[flow_cnt+1]
+  done
+  if [ $details -ne 0 ] ; then
+    delay=`grep "avg" ping.out | grep -o "= [0-9.]*/[0-9.]*" | grep -o "[0-9.]*$"`
+    echo "PING AVG DELAY:$delay"
+    echo "AGGREGATE_GOODPUT:$rate"
+  else
+    echo $rate
+  fi
+fi
+
+if [ $use_netperf -eq 0 ] ; then
+  sysctl -w -q -n net.ipv4.tcp_congestion_control=$cur_cc
+fi
+if [ $ecn -ne 0 ] ; then
+  sysctl -w -q -n net.ipv4.tcp_ecn=0
+fi
+if [ "$netem" -ne "0" ] ; then
+  tc qdisc del dev lo root > /dev/null 2>&1
+fi
+
+sleep 2
+
+nrmPid=`ps ax | grep "nrm " | grep --invert-match "grep" | awk '{ print $1 }'`
+if [ "$nrmPid" == "$nrm_pid" ] ; then
+  kill $nrm_pid
+fi
+
+sleep 1
+
+# Detach any BPF programs that may have lingered
+ttx=`bpftool cgroup tree | grep nrm`
+v=2
+for x in $ttx ; do
+    if [ "${x:0:36}" == "/sys/fs/cgroup/cgroup-test-work-dir/" ] ; then
+	cg=$x ; v=0
+    else
+	if [ $v -eq 0 ] ; then
+	    id=$x ; v=1
+	else
+	    if [ $v -eq 1 ] ; then
+		type=$x ; bpftool cgroup detach $cg $type id $id
+		v=0
+	    fi
+	fi
+    fi
+done
+
+if [ $use_netperf -ne 0 ] ; then
+  if [ "$server" == "" ] ; then
+    if [ "$begNetserverPid" == "" ] ; then
+      netserverPid=`ps ax | grep netserver | grep --invert-match "grep" | awk '{ print $1 }'`
+      if [ "$netserverPid" != "" ] ; then
+        kill $netserverPid
+      fi
+    fi
+  fi
+fi
+exit
-- 
2.17.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 4/9] bpf: add bpf helper bpf_skb_ecn_set_ce
  2019-02-23  1:06 ` [PATCH v2 bpf-next 4/9] bpf: add bpf helper bpf_skb_ecn_set_ce brakmo
@ 2019-02-23  1:14   ` Daniel Borkmann
  2019-02-23  7:30     ` Martin Lau
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel Borkmann @ 2019-02-23  1:14 UTC (permalink / raw)
  To: brakmo, netdev; +Cc: Martin Lau, Alexei Starovoitov, Eric Dumazet, Kernel Team

On 02/23/2019 02:06 AM, brakmo wrote:
> This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce
> "int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to
> BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can
> be attached to the ingress and egress path. The helper is needed
> because his type of bpf_prog cannot modify the skb directly.
> 
> This helper is used to set the ECN field of ECN capable IP packets to ce
> (congestion encountered) in the IPv6 or IPv4 header of the skb. It can be
> used by a bpf_prog to manage egress or ingress network bandwdith limit
> per cgroupv2 by inducing an ECN response in the TCP sender.
> This works best when using DCTCP.
> 
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
> ---
>  include/uapi/linux/bpf.h | 10 +++++++++-
>  net/core/filter.c        | 14 ++++++++++++++
>  2 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 95b5058fa945..fc646f3eaf9b 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2365,6 +2365,13 @@ union bpf_attr {
>   *		Make a tcp_sock enter CWR state.
>   *	Return
>   *		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_skb_ecn_set_ce(struct sk_buf *skb)
> + *	Description
> + *		Sets ECN of IP header to ce (congestion encountered) if
> + *		current value is ect (ECN capable). Works with IPv6 and IPv4.
> + *	Return
> + *		1 if set, 0 if not set.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -2464,7 +2471,8 @@ union bpf_attr {
>  	FN(spin_unlock),		\
>  	FN(sk_fullsock),		\
>  	FN(tcp_sock),			\
> -	FN(tcp_enter_cwr),
> +	FN(tcp_enter_cwr),		\
> +	FN(skb_ecn_set_ce),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/net/core/filter.c b/net/core/filter.c
> index ca57ef25279c..955369c6ed30 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5444,6 +5444,18 @@ static const struct bpf_func_proto bpf_tcp_enter_cwr_proto = {
>  	.ret_type    = RET_INTEGER,
>  	.arg1_type    = ARG_PTR_TO_TCP_SOCK,
>  };
> +
> +BPF_CALL_1(bpf_skb_ecn_set_ce, struct sk_buff *, skb)
> +{
> +	return INET_ECN_set_ce(skb);

Hm, but as mentioned last time, don't we have to ensure here that skb
is writable (aka skb->data private to us before writing into it)?

> +}
> +
> +static const struct bpf_func_proto bpf_skb_ecn_set_ce_proto = {
> +	.func		= bpf_skb_ecn_set_ce,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +};
>  #endif /* CONFIG_INET */
>  
>  bool bpf_helper_changes_pkt_data(void *func)
> @@ -5610,6 +5622,8 @@ cg_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
>  		} else {
>  			return NULL;
>  		}
> +	case BPF_FUNC_skb_ecn_set_ce:
> +		return &bpf_skb_ecn_set_ce_proto;
>  #endif
>  	default:
>  		return sk_filter_func_proto(func_id, prog);
> 

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM)
  2019-02-23  1:06 [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) brakmo
                   ` (8 preceding siblings ...)
  2019-02-23  1:07 ` [PATCH v2 bpf-next 9/9] bpf: NRM test script brakmo
@ 2019-02-23  3:03 ` David Ahern
  2019-02-23 18:39   ` Eric Dumazet
  9 siblings, 1 reply; 29+ messages in thread
From: David Ahern @ 2019-02-23  3:03 UTC (permalink / raw)
  To: brakmo, netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

On 2/22/19 8:06 PM, brakmo wrote:
> Network Resource Manager is a framework for limiting the bandwidth used
> by v2 cgroups. It consists of 4 BPF helpers and a sample BPF program to
> limit egress bandwdith as well as a sample user program and script to
> simplify NRM testing.

'resource manager' is a really generic name. Since you are referring to
bandwidth, how about renaming to Network Bandwidth Manager?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 4/9] bpf: add bpf helper bpf_skb_ecn_set_ce
  2019-02-23  1:14   ` Daniel Borkmann
@ 2019-02-23  7:30     ` Martin Lau
  2019-02-25 10:10       ` Daniel Borkmann
  0 siblings, 1 reply; 29+ messages in thread
From: Martin Lau @ 2019-02-23  7:30 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Lawrence Brakmo, netdev, Alexei Starovoitov, Eric Dumazet, Kernel Team

On Sat, Feb 23, 2019 at 02:14:26AM +0100, Daniel Borkmann wrote:
> On 02/23/2019 02:06 AM, brakmo wrote:
> > This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce
> > "int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to
> > BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can
> > be attached to the ingress and egress path. The helper is needed
> > because his type of bpf_prog cannot modify the skb directly.
> > 
> > This helper is used to set the ECN field of ECN capable IP packets to ce
> > (congestion encountered) in the IPv6 or IPv4 header of the skb. It can be
> > used by a bpf_prog to manage egress or ingress network bandwdith limit
> > per cgroupv2 by inducing an ECN response in the TCP sender.
> > This works best when using DCTCP.
> > 
> > Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
> > ---
> >  include/uapi/linux/bpf.h | 10 +++++++++-
> >  net/core/filter.c        | 14 ++++++++++++++
> >  2 files changed, 23 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 95b5058fa945..fc646f3eaf9b 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -2365,6 +2365,13 @@ union bpf_attr {
> >   *		Make a tcp_sock enter CWR state.
> >   *	Return
> >   *		0 on success, or a negative error in case of failure.
> > + *
> > + * int bpf_skb_ecn_set_ce(struct sk_buf *skb)
> > + *	Description
> > + *		Sets ECN of IP header to ce (congestion encountered) if
> > + *		current value is ect (ECN capable). Works with IPv6 and IPv4.
> > + *	Return
> > + *		1 if set, 0 if not set.
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)		\
> >  	FN(unspec),			\
> > @@ -2464,7 +2471,8 @@ union bpf_attr {
> >  	FN(spin_unlock),		\
> >  	FN(sk_fullsock),		\
> >  	FN(tcp_sock),			\
> > -	FN(tcp_enter_cwr),
> > +	FN(tcp_enter_cwr),		\
> > +	FN(skb_ecn_set_ce),
> >  
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> >   * function eBPF program intends to call
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index ca57ef25279c..955369c6ed30 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5444,6 +5444,18 @@ static const struct bpf_func_proto bpf_tcp_enter_cwr_proto = {
> >  	.ret_type    = RET_INTEGER,
> >  	.arg1_type    = ARG_PTR_TO_TCP_SOCK,
> >  };
> > +
> > +BPF_CALL_1(bpf_skb_ecn_set_ce, struct sk_buff *, skb)
> > +{
> > +	return INET_ECN_set_ce(skb);
> 
> Hm, but as mentioned last time, don't we have to ensure here that skb
> is writable (aka skb->data private to us before writing into it)?
INET_ECN_set_ce(skb) is also called from a few net/sched/sch_*.c
but I don't see how they ensure if a skb is writable.

May be I have missed something there that can also be borrowed and
reused here?

Thanks,
Martin

> 
> > +}
> > +
> > +static const struct bpf_func_proto bpf_skb_ecn_set_ce_proto = {
> > +	.func		= bpf_skb_ecn_set_ce,
> > +	.gpl_only	= false,
> > +	.ret_type	= RET_INTEGER,
> > +	.arg1_type	= ARG_PTR_TO_CTX,
> > +};
> >  #endif /* CONFIG_INET */
> >  
> >  bool bpf_helper_changes_pkt_data(void *func)
> > @@ -5610,6 +5622,8 @@ cg_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
> >  		} else {
> >  			return NULL;
> >  		}
> > +	case BPF_FUNC_skb_ecn_set_ce:
> > +		return &bpf_skb_ecn_set_ce_proto;
> >  #endif
> >  	default:
> >  		return sk_filter_func_proto(func_id, prog);
> > 
> 
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM)
  2019-02-23  3:03 ` [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) David Ahern
@ 2019-02-23 18:39   ` Eric Dumazet
  2019-02-23 20:40     ` Alexei Starovoitov
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2019-02-23 18:39 UTC (permalink / raw)
  To: David Ahern, brakmo, netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team



On 02/22/2019 07:03 PM, David Ahern wrote:
> On 2/22/19 8:06 PM, brakmo wrote:
>> Network Resource Manager is a framework for limiting the bandwidth used
>> by v2 cgroups. It consists of 4 BPF helpers and a sample BPF program to
>> limit egress bandwdith as well as a sample user program and script to
>> simplify NRM testing.
> 
> 'resource manager' is a really generic name. Since you are referring to
> bandwidth, how about renaming to Network Bandwidth Manager?
> 

Or just use the normal word for a policer ...

Really this is beyond me that TCP experts can still push policers out there,
they are really a huge pain.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM)
  2019-02-23 18:39   ` Eric Dumazet
@ 2019-02-23 20:40     ` Alexei Starovoitov
  2019-02-23 20:43       ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Alexei Starovoitov @ 2019-02-23 20:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Ahern, brakmo, netdev, Martin Lau, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Sat, Feb 23, 2019 at 10:39:53AM -0800, Eric Dumazet wrote:
> 
> 
> On 02/22/2019 07:03 PM, David Ahern wrote:
> > On 2/22/19 8:06 PM, brakmo wrote:
> >> Network Resource Manager is a framework for limiting the bandwidth used
> >> by v2 cgroups. It consists of 4 BPF helpers and a sample BPF program to
> >> limit egress bandwdith as well as a sample user program and script to
> >> simplify NRM testing.
> > 
> > 'resource manager' is a really generic name. Since you are referring to
> > bandwidth, how about renaming to Network Bandwidth Manager?
> > 
> 
> Or just use the normal word for a policer ...
> 
> Really this is beyond me that TCP experts can still push policers out there,
> they are really a huge pain.

hmm. please see our NRM presentation at LPC.
It is a networking _resource_ management for cgroups.
Bandwidth enforcement is a particular example.
It's not a policer either.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM)
  2019-02-23 20:40     ` Alexei Starovoitov
@ 2019-02-23 20:43       ` Eric Dumazet
  2019-02-23 23:25         ` Alexei Starovoitov
  0 siblings, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2019-02-23 20:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Ahern, brakmo, netdev, Martin Lau, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team



On 02/23/2019 12:40 PM, Alexei Starovoitov wrote:
> On Sat, Feb 23, 2019 at 10:39:53AM -0800, Eric Dumazet wrote:
>>
>>
>> On 02/22/2019 07:03 PM, David Ahern wrote:
>>> On 2/22/19 8:06 PM, brakmo wrote:
>>>> Network Resource Manager is a framework for limiting the bandwidth used
>>>> by v2 cgroups. It consists of 4 BPF helpers and a sample BPF program to
>>>> limit egress bandwdith as well as a sample user program and script to
>>>> simplify NRM testing.
>>>
>>> 'resource manager' is a really generic name. Since you are referring to
>>> bandwidth, how about renaming to Network Bandwidth Manager?
>>>
>>
>> Or just use the normal word for a policer ...
>>
>> Really this is beyond me that TCP experts can still push policers out there,
>> they are really a huge pain.
> 
> hmm. please see our NRM presentation at LPC.
> It is a networking _resource_ management for cgroups.
> Bandwidth enforcement is a particular example.
> It's not a policer either.
> 

Well, this definitely looks a policer to me, sorry if we disagree, this is fine.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM)
  2019-02-23 20:43       ` Eric Dumazet
@ 2019-02-23 23:25         ` Alexei Starovoitov
  2019-02-24  2:58           ` David Ahern
  0 siblings, 1 reply; 29+ messages in thread
From: Alexei Starovoitov @ 2019-02-23 23:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Ahern, brakmo, netdev, Martin Lau, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Sat, Feb 23, 2019 at 12:43:51PM -0800, Eric Dumazet wrote:
> 
> 
> On 02/23/2019 12:40 PM, Alexei Starovoitov wrote:
> > On Sat, Feb 23, 2019 at 10:39:53AM -0800, Eric Dumazet wrote:
> >>
> >>
> >> On 02/22/2019 07:03 PM, David Ahern wrote:
> >>> On 2/22/19 8:06 PM, brakmo wrote:
> >>>> Network Resource Manager is a framework for limiting the bandwidth used
> >>>> by v2 cgroups. It consists of 4 BPF helpers and a sample BPF program to
> >>>> limit egress bandwdith as well as a sample user program and script to
> >>>> simplify NRM testing.
> >>>
> >>> 'resource manager' is a really generic name. Since you are referring to
> >>> bandwidth, how about renaming to Network Bandwidth Manager?
> >>>
> >>
> >> Or just use the normal word for a policer ...
> >>
> >> Really this is beyond me that TCP experts can still push policers out there,
> >> they are really a huge pain.
> > 
> > hmm. please see our NRM presentation at LPC.
> > It is a networking _resource_ management for cgroups.
> > Bandwidth enforcement is a particular example.
> > It's not a policer either.
> > 
> 
> Well, this definitely looks a policer to me, sorry if we disagree, this is fine.

this particular example certainly does look like it. we both agree.
It's overall direction of this work that is aiming to do
network resource management. For example bpf prog may choose
to react on SLA violations in one cgroup by throttling flows
in the other cgroup. Aggregated per-cgroup bandwidth doesn't
need to cross a threshold for bpf prog to take action.
It could do 'work conserving' 'policer'.
I think this set of patches represent a revolutionary approach and existing
networking nomenclature doesn't have precise words to describe it :)
'NRM' describes our goals the best.
Other folks may choose to use it differently, of course.
Note that NRM abbreviation doesn't leak anywhere in uapi.
It's only used in examples. So not sure what we're arguing about.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr
  2019-02-23  1:06 ` [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr brakmo
@ 2019-02-24  1:32   ` Eric Dumazet
  2019-02-24  3:08     ` Martin Lau
  2019-02-25 23:14   ` Stanislav Fomichev
  1 sibling, 1 reply; 29+ messages in thread
From: Eric Dumazet @ 2019-02-24  1:32 UTC (permalink / raw)
  To: brakmo, netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team



On 02/22/2019 05:06 PM, brakmo wrote:
> From: Martin KaFai Lau <kafai@fb.com>
> 
> This patch adds a new bpf helper BPF_FUNC_tcp_enter_cwr
> "int bpf_tcp_enter_cwr(struct bpf_tcp_sock *tp)".
> It is added to BPF_PROG_TYPE_CGROUP_SKB which can be attached
> to the egress path where the bpf prog is called by
> ip_finish_output() or ip6_finish_output().  The verifier
> ensures that the parameter must be a tcp_sock.
> 
> This helper makes a tcp_sock enter CWR state.  It can be used
> by a bpf_prog to manage egress network bandwidth limit per
> cgroupv2.  A later patch will have a sample program to
> show how it can be used to limit bandwidth usage per cgroupv2.
> 
> To ensure it is only called from BPF_CGROUP_INET_EGRESS, the
> attr->expected_attach_type must be specified as BPF_CGROUP_INET_EGRESS
> during load time if the prog uses this new helper.
> The newly added prog->enforce_expected_attach_type bit will also be set
> if this new helper is used.  This bit is for backward compatibility reason
> because currently prog->expected_attach_type has been ignored in
> BPF_PROG_TYPE_CGROUP_SKB.  During attach time,
> prog->expected_attach_type is only enforced if the
> prog->enforce_expected_attach_type bit is set.
> i.e. prog->expected_attach_type is only enforced if this new helper
> is used by the prog.
> 

BTW, it seems to me that BPF_CGROUP_INET_EGRESS can be used while the socket lock is not held.

Maybe we should fix :/


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM)
  2019-02-23 23:25         ` Alexei Starovoitov
@ 2019-02-24  2:58           ` David Ahern
  2019-02-24  4:48             ` Alexei Starovoitov
  0 siblings, 1 reply; 29+ messages in thread
From: David Ahern @ 2019-02-24  2:58 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric Dumazet
  Cc: brakmo, netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team

On 2/23/19 6:25 PM, Alexei Starovoitov wrote:
>>> hmm. please see our NRM presentation at LPC.

Reference?

We also gave a talk about a resource manager in November 2017:

https://netdevconf.org/2.2/papers/roulin-hardwareresourcesmgmt-talk.pdf

in this case the context is hardware resources for networking which
aligns with devlink and switchdev.

>>> It is a networking _resource_ management for cgroups.
>>> Bandwidth enforcement is a particular example.
>>> It's not a policer either.
>>>
>>
>> Well, this definitely looks a policer to me, sorry if we disagree, this is fine.
> 
> this particular example certainly does look like it. we both agree.
> It's overall direction of this work that is aiming to do
> network resource management. For example bpf prog may choose
> to react on SLA violations in one cgroup by throttling flows
> in the other cgroup. Aggregated per-cgroup bandwidth doesn't
> need to cross a threshold for bpf prog to take action.
> It could do 'work conserving' 'policer'.
> I think this set of patches represent a revolutionary approach and existing
> networking nomenclature doesn't have precise words to describe it :)
> 'NRM' describes our goals the best.

Are you doing something beyond bandwidth usage? e.g., are you limiting
neighbor entries, fdb entries or FIB entries by cgroup? what about
router interfaces or vlans? I cannot imagine why or how you would manage
that but my point is the meaning of 'network resources'.


> Other folks may choose to use it differently, of course.
> Note that NRM abbreviation doesn't leak anywhere in uapi.
> It's only used in examples. So not sure what we're arguing about.
> 

It was a simple request for a more specific name that better represents
the scope of the project. Everything presented so far has been about
bandwidth.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr
  2019-02-24  1:32   ` Eric Dumazet
@ 2019-02-24  3:08     ` Martin Lau
  2019-02-24  4:44       ` Alexei Starovoitov
  2019-02-24 18:00       ` Eric Dumazet
  0 siblings, 2 replies; 29+ messages in thread
From: Martin Lau @ 2019-02-24  3:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Lawrence Brakmo, netdev, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team

On Sat, Feb 23, 2019 at 05:32:14PM -0800, Eric Dumazet wrote:
> 
> 
> On 02/22/2019 05:06 PM, brakmo wrote:
> > From: Martin KaFai Lau <kafai@fb.com>
> > 
> > This patch adds a new bpf helper BPF_FUNC_tcp_enter_cwr
> > "int bpf_tcp_enter_cwr(struct bpf_tcp_sock *tp)".
> > It is added to BPF_PROG_TYPE_CGROUP_SKB which can be attached
> > to the egress path where the bpf prog is called by
> > ip_finish_output() or ip6_finish_output().  The verifier
> > ensures that the parameter must be a tcp_sock.
> > 
> > This helper makes a tcp_sock enter CWR state.  It can be used
> > by a bpf_prog to manage egress network bandwidth limit per
> > cgroupv2.  A later patch will have a sample program to
> > show how it can be used to limit bandwidth usage per cgroupv2.
> > 
> > To ensure it is only called from BPF_CGROUP_INET_EGRESS, the
> > attr->expected_attach_type must be specified as BPF_CGROUP_INET_EGRESS
> > during load time if the prog uses this new helper.
> > The newly added prog->enforce_expected_attach_type bit will also be set
> > if this new helper is used.  This bit is for backward compatibility reason
> > because currently prog->expected_attach_type has been ignored in
> > BPF_PROG_TYPE_CGROUP_SKB.  During attach time,
> > prog->expected_attach_type is only enforced if the
> > prog->enforce_expected_attach_type bit is set.
> > i.e. prog->expected_attach_type is only enforced if this new helper
> > is used by the prog.
> > 
> 
> BTW, it seems to me that BPF_CGROUP_INET_EGRESS can be used while the socket lock is not held.
Thanks for pointing it out.

ic. I just noticed the comments at ip6_xmit():
/*
 * xmit an sk_buff (used by TCP, SCTP and DCCP)
 * Note : socket lock is not held for SYNACK packets, but might be modified
 * by calls to skb_set_owner_w() and ipv6_local_error(),
 * which are using proper atomic operations or spinlocks.
 */
Is there other cases other than SYNACK?

Thanks,
Martin

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr
  2019-02-24  3:08     ` Martin Lau
@ 2019-02-24  4:44       ` Alexei Starovoitov
  2019-02-24 18:00       ` Eric Dumazet
  1 sibling, 0 replies; 29+ messages in thread
From: Alexei Starovoitov @ 2019-02-24  4:44 UTC (permalink / raw)
  To: Martin Lau
  Cc: Eric Dumazet, Lawrence Brakmo, netdev, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Sun, Feb 24, 2019 at 03:08:48AM +0000, Martin Lau wrote:
> On Sat, Feb 23, 2019 at 05:32:14PM -0800, Eric Dumazet wrote:
> > 
> > 
> > On 02/22/2019 05:06 PM, brakmo wrote:
> > > From: Martin KaFai Lau <kafai@fb.com>
> > > 
> > > This patch adds a new bpf helper BPF_FUNC_tcp_enter_cwr
> > > "int bpf_tcp_enter_cwr(struct bpf_tcp_sock *tp)".
> > > It is added to BPF_PROG_TYPE_CGROUP_SKB which can be attached
> > > to the egress path where the bpf prog is called by
> > > ip_finish_output() or ip6_finish_output().  The verifier
> > > ensures that the parameter must be a tcp_sock.
> > > 
> > > This helper makes a tcp_sock enter CWR state.  It can be used
> > > by a bpf_prog to manage egress network bandwidth limit per
> > > cgroupv2.  A later patch will have a sample program to
> > > show how it can be used to limit bandwidth usage per cgroupv2.
> > > 
> > > To ensure it is only called from BPF_CGROUP_INET_EGRESS, the
> > > attr->expected_attach_type must be specified as BPF_CGROUP_INET_EGRESS
> > > during load time if the prog uses this new helper.
> > > The newly added prog->enforce_expected_attach_type bit will also be set
> > > if this new helper is used.  This bit is for backward compatibility reason
> > > because currently prog->expected_attach_type has been ignored in
> > > BPF_PROG_TYPE_CGROUP_SKB.  During attach time,
> > > prog->expected_attach_type is only enforced if the
> > > prog->enforce_expected_attach_type bit is set.
> > > i.e. prog->expected_attach_type is only enforced if this new helper
> > > is used by the prog.
> > > 
> > 
> > BTW, it seems to me that BPF_CGROUP_INET_EGRESS can be used while the socket lock is not held.
> Thanks for pointing it out.
> 
> ic. I just noticed the comments at ip6_xmit():
> /*
>  * xmit an sk_buff (used by TCP, SCTP and DCCP)
>  * Note : socket lock is not held for SYNACK packets, but might be modified
>  * by calls to skb_set_owner_w() and ipv6_local_error(),
>  * which are using proper atomic operations or spinlocks.
>  */
> Is there other cases other than SYNACK?

I don't think it's a problem.
the helper does:
BPF_CALL_1(bpf_tcp_enter_cwr, struct tcp_sock *, tp)
+{
+	struct sock *sk = (struct sock *)tp;
+
+	if (sk->sk_state == TCP_ESTABLISHED) {
+		tcp_enter_cwr(sk);

I believe at the time ip_finish_output is called on established socket
it's safe to call tcp_enter_cwr.
I don't see how this is different from normal __tcp_transmit_skb path.

Eric, what issue do you see?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM)
  2019-02-24  2:58           ` David Ahern
@ 2019-02-24  4:48             ` Alexei Starovoitov
  2019-02-25  1:38               ` David Ahern
  0 siblings, 1 reply; 29+ messages in thread
From: Alexei Starovoitov @ 2019-02-24  4:48 UTC (permalink / raw)
  To: David Ahern
  Cc: Eric Dumazet, brakmo, netdev, Martin Lau, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Sat, Feb 23, 2019 at 09:58:57PM -0500, David Ahern wrote:
> On 2/23/19 6:25 PM, Alexei Starovoitov wrote:
> >>> hmm. please see our NRM presentation at LPC.
> 
> Reference?
> 
> We also gave a talk about a resource manager in November 2017:
> 
> https://netdevconf.org/2.2/papers/roulin-hardwareresourcesmgmt-talk.pdf
> 
> in this case the context is hardware resources for networking which
> aligns with devlink and switchdev.
> 
> >>> It is a networking _resource_ management for cgroups.
> >>> Bandwidth enforcement is a particular example.
> >>> It's not a policer either.
> >>>
> >>
> >> Well, this definitely looks a policer to me, sorry if we disagree, this is fine.
> > 
> > this particular example certainly does look like it. we both agree.
> > It's overall direction of this work that is aiming to do
> > network resource management. For example bpf prog may choose
> > to react on SLA violations in one cgroup by throttling flows
> > in the other cgroup. Aggregated per-cgroup bandwidth doesn't
> > need to cross a threshold for bpf prog to take action.
> > It could do 'work conserving' 'policer'.
> > I think this set of patches represent a revolutionary approach and existing
> > networking nomenclature doesn't have precise words to describe it :)
> > 'NRM' describes our goals the best.
> 
> Are you doing something beyond bandwidth usage? e.g., are you limiting
> neighbor entries, fdb entries or FIB entries by cgroup? what about
> router interfaces or vlans? I cannot imagine why or how you would manage
> that but my point is the meaning of 'network resources'.

'network resources' also include back bone and TOR capacity and
this mechanism is going to help address that as well.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr
  2019-02-24  3:08     ` Martin Lau
  2019-02-24  4:44       ` Alexei Starovoitov
@ 2019-02-24 18:00       ` Eric Dumazet
  1 sibling, 0 replies; 29+ messages in thread
From: Eric Dumazet @ 2019-02-24 18:00 UTC (permalink / raw)
  To: Martin Lau, Eric Dumazet
  Cc: Lawrence Brakmo, netdev, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team



On 02/23/2019 07:08 PM, Martin Lau wrote:
> On Sat, Feb 23, 2019 at 05:32:14PM -0800, Eric Dumazet wrote:
>>
>>
>> On 02/22/2019 05:06 PM, brakmo wrote:
>>> From: Martin KaFai Lau <kafai@fb.com>
>>>
>>> This patch adds a new bpf helper BPF_FUNC_tcp_enter_cwr
>>> "int bpf_tcp_enter_cwr(struct bpf_tcp_sock *tp)".
>>> It is added to BPF_PROG_TYPE_CGROUP_SKB which can be attached
>>> to the egress path where the bpf prog is called by
>>> ip_finish_output() or ip6_finish_output().  The verifier
>>> ensures that the parameter must be a tcp_sock.
>>>
>>> This helper makes a tcp_sock enter CWR state.  It can be used
>>> by a bpf_prog to manage egress network bandwidth limit per
>>> cgroupv2.  A later patch will have a sample program to
>>> show how it can be used to limit bandwidth usage per cgroupv2.
>>>
>>> To ensure it is only called from BPF_CGROUP_INET_EGRESS, the
>>> attr->expected_attach_type must be specified as BPF_CGROUP_INET_EGRESS
>>> during load time if the prog uses this new helper.
>>> The newly added prog->enforce_expected_attach_type bit will also be set
>>> if this new helper is used.  This bit is for backward compatibility reason
>>> because currently prog->expected_attach_type has been ignored in
>>> BPF_PROG_TYPE_CGROUP_SKB.  During attach time,
>>> prog->expected_attach_type is only enforced if the
>>> prog->enforce_expected_attach_type bit is set.
>>> i.e. prog->expected_attach_type is only enforced if this new helper
>>> is used by the prog.
>>>
>>
>> BTW, it seems to me that BPF_CGROUP_INET_EGRESS can be used while the socket lock is not held.
> Thanks for pointing it out.
> 
> ic. I just noticed the comments at ip6_xmit():
> /*
>  * xmit an sk_buff (used by TCP, SCTP and DCCP)
>  * Note : socket lock is not held for SYNACK packets, but might be modified
>  * by calls to skb_set_owner_w() and ipv6_local_error(),
>  * which are using proper atomic operations or spinlocks.
>  */
> Is there other cases other than SYNACK?


Well, I was referring to various virtual devices, re-entering ip stack.

Since we can have a qdisc on any netdev, there is no way we can guarantee the socket is
locked by the current thread.

Random example :

ipvlan_process_v4_outbound()
...
     err = ip_local_out(net, skb->sk, skb);
...



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM)
  2019-02-24  4:48             ` Alexei Starovoitov
@ 2019-02-25  1:38               ` David Ahern
  0 siblings, 0 replies; 29+ messages in thread
From: David Ahern @ 2019-02-25  1:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, brakmo, netdev, Martin Lau, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On 2/23/19 11:48 PM, Alexei Starovoitov wrote:
> 'network resources' also include back bone and TOR capacity and
> this mechanism is going to help address that as well.

This appears to be the talk you are referring to:

http://vger.kernel.org/lpc_net2018_talks/LPC%20NRM.pdf

and from my reading it only references throttling at L4 - ie.,
bandwidth. hence my request for a better name than 'network resources'
in the commit logs and code references.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 4/9] bpf: add bpf helper bpf_skb_ecn_set_ce
  2019-02-23  7:30     ` Martin Lau
@ 2019-02-25 10:10       ` Daniel Borkmann
  2019-02-25 16:52         ` Eric Dumazet
  0 siblings, 1 reply; 29+ messages in thread
From: Daniel Borkmann @ 2019-02-25 10:10 UTC (permalink / raw)
  To: Martin Lau
  Cc: Lawrence Brakmo, netdev, Alexei Starovoitov, Eric Dumazet, Kernel Team

On 02/23/2019 08:30 AM, Martin Lau wrote:
> On Sat, Feb 23, 2019 at 02:14:26AM +0100, Daniel Borkmann wrote:
>> On 02/23/2019 02:06 AM, brakmo wrote:
>>> This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce
>>> "int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to
>>> BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can
>>> be attached to the ingress and egress path. The helper is needed
>>> because his type of bpf_prog cannot modify the skb directly.
>>>
>>> This helper is used to set the ECN field of ECN capable IP packets to ce
>>> (congestion encountered) in the IPv6 or IPv4 header of the skb. It can be
>>> used by a bpf_prog to manage egress or ingress network bandwdith limit
>>> per cgroupv2 by inducing an ECN response in the TCP sender.
>>> This works best when using DCTCP.
>>>
>>> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
>>> ---
>>>  include/uapi/linux/bpf.h | 10 +++++++++-
>>>  net/core/filter.c        | 14 ++++++++++++++
>>>  2 files changed, 23 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 95b5058fa945..fc646f3eaf9b 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -2365,6 +2365,13 @@ union bpf_attr {
>>>   *		Make a tcp_sock enter CWR state.
>>>   *	Return
>>>   *		0 on success, or a negative error in case of failure.
>>> + *
>>> + * int bpf_skb_ecn_set_ce(struct sk_buf *skb)
>>> + *	Description
>>> + *		Sets ECN of IP header to ce (congestion encountered) if
>>> + *		current value is ect (ECN capable). Works with IPv6 and IPv4.
>>> + *	Return
>>> + *		1 if set, 0 if not set.
>>>   */
>>>  #define __BPF_FUNC_MAPPER(FN)		\
>>>  	FN(unspec),			\
>>> @@ -2464,7 +2471,8 @@ union bpf_attr {
>>>  	FN(spin_unlock),		\
>>>  	FN(sk_fullsock),		\
>>>  	FN(tcp_sock),			\
>>> -	FN(tcp_enter_cwr),
>>> +	FN(tcp_enter_cwr),		\
>>> +	FN(skb_ecn_set_ce),
>>>  
>>>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>>   * function eBPF program intends to call
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index ca57ef25279c..955369c6ed30 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -5444,6 +5444,18 @@ static const struct bpf_func_proto bpf_tcp_enter_cwr_proto = {
>>>  	.ret_type    = RET_INTEGER,
>>>  	.arg1_type    = ARG_PTR_TO_TCP_SOCK,
>>>  };
>>> +
>>> +BPF_CALL_1(bpf_skb_ecn_set_ce, struct sk_buff *, skb)
>>> +{
>>> +	return INET_ECN_set_ce(skb);
>>
>> Hm, but as mentioned last time, don't we have to ensure here that skb
>> is writable (aka skb->data private to us before writing into it)?
> INET_ECN_set_ce(skb) is also called from a few net/sched/sch_*.c
> but I don't see how they ensure if a skb is writable.
> 
> May be I have missed something there that can also be borrowed and
> reused here?

My understanding is that before doing any writes into skb, we should make
sure the data area is private to us (and offset in linear data). In tc BPF
(ingress, egress) we use bpf_try_make_writable() helper for this, others
like act_{pedit,skbmod} or ovs have similar logic before writing into skb,
note that in all these cases it's mostly about generic writes, so location
could also be L4, for example.

Difference of above helper compared to net/sched/sch_*.c instances could
be that it's i) for the qdisc case it's only on egress INET_ECN_set_ce()
and that there may be a convention that qdiscs specifically may mangle
it whereas the helper could be called on ingress and egress and confuse
other subsystems since they won't see original or race by seeing partially
updated (invalid) packet.

Eric, have a chance to clarify? Perhaps then would make sense to disallow
the helper in cgroup ingress path.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 4/9] bpf: add bpf helper bpf_skb_ecn_set_ce
  2019-02-25 10:10       ` Daniel Borkmann
@ 2019-02-25 16:52         ` Eric Dumazet
  0 siblings, 0 replies; 29+ messages in thread
From: Eric Dumazet @ 2019-02-25 16:52 UTC (permalink / raw)
  To: Daniel Borkmann, Martin Lau
  Cc: Lawrence Brakmo, netdev, Alexei Starovoitov, Kernel Team



On 02/25/2019 02:10 AM, Daniel Borkmann wrote:

> My understanding is that before doing any writes into skb, we should make
> sure the data area is private to us (and offset in linear data). In tc BPF
> (ingress, egress) we use bpf_try_make_writable() helper for this, others
> like act_{pedit,skbmod} or ovs have similar logic before writing into skb,
> note that in all these cases it's mostly about generic writes, so location
> could also be L4, for example.
> 
> Difference of above helper compared to net/sched/sch_*.c instances could
> be that it's i) for the qdisc case it's only on egress INET_ECN_set_ce()
> and that there may be a convention that qdiscs specifically may mangle
> it whereas the helper could be called on ingress and egress and confuse
> other subsystems since they won't see original or race by seeing partially
> updated (invalid) packet.
> 
> Eric, have a chance to clarify? Perhaps then would make sense to disallow
> the helper in cgroup ingress path.

Good observations Daniel, thanks for bringing this up.

skb_ensure_writable() seems a big hammer for the case we change some bits in IP header.

TCP cloned packets certainly can have their headers mangled, so maybe
we need to use something using skb_header_cloned() instead of skb_cloned()

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr
  2019-02-23  1:06 ` [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr brakmo
  2019-02-24  1:32   ` Eric Dumazet
@ 2019-02-25 23:14   ` Stanislav Fomichev
  2019-02-26  1:30     ` Martin Lau
  1 sibling, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2019-02-25 23:14 UTC (permalink / raw)
  To: brakmo
  Cc: netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Kernel Team

On 02/22, brakmo wrote:
> From: Martin KaFai Lau <kafai@fb.com>
> 
> This patch adds a new bpf helper BPF_FUNC_tcp_enter_cwr
> "int bpf_tcp_enter_cwr(struct bpf_tcp_sock *tp)".
> It is added to BPF_PROG_TYPE_CGROUP_SKB which can be attached
> to the egress path where the bpf prog is called by
> ip_finish_output() or ip6_finish_output().  The verifier
> ensures that the parameter must be a tcp_sock.
> 
> This helper makes a tcp_sock enter CWR state.  It can be used
> by a bpf_prog to manage egress network bandwidth limit per
> cgroupv2.  A later patch will have a sample program to
> show how it can be used to limit bandwidth usage per cgroupv2.
> 
> To ensure it is only called from BPF_CGROUP_INET_EGRESS, the
> attr->expected_attach_type must be specified as BPF_CGROUP_INET_EGRESS
> during load time if the prog uses this new helper.
> The newly added prog->enforce_expected_attach_type bit will also be set
> if this new helper is used.  This bit is for backward compatibility reason
> because currently prog->expected_attach_type has been ignored in
> BPF_PROG_TYPE_CGROUP_SKB.  During attach time,
> prog->expected_attach_type is only enforced if the
> prog->enforce_expected_attach_type bit is set.
> i.e. prog->expected_attach_type is only enforced if this new helper
> is used by the prog.
> 
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> ---
>  include/linux/bpf.h      |  1 +
>  include/linux/filter.h   |  3 ++-
>  include/uapi/linux/bpf.h |  9 ++++++++-
>  kernel/bpf/syscall.c     | 12 ++++++++++++
>  kernel/bpf/verifier.c    |  4 ++++
>  net/core/filter.c        | 25 +++++++++++++++++++++++++
>  6 files changed, 52 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index d5ba2fc01af3..2d54ba7cf9dd 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -195,6 +195,7 @@ enum bpf_arg_type {
>  	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock */
>  	ARG_PTR_TO_SPIN_LOCK,	/* pointer to bpf_spin_lock */
>  	ARG_PTR_TO_SOCK_COMMON,	/* pointer to sock_common */
> +	ARG_PTR_TO_TCP_SOCK,    /* pointer to tcp_sock */
>  };
>  
>  /* type of values returned from helper functions */
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index f32b3eca5a04..c6e878bdc5a6 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -510,7 +510,8 @@ struct bpf_prog {
>  				blinded:1,	/* Was blinded */
>  				is_func:1,	/* program is a bpf function */
>  				kprobe_override:1, /* Do we override a kprobe? */
> -				has_callchain_buf:1; /* callchain buffer allocated? */
> +				has_callchain_buf:1, /* callchain buffer allocated? */
> +				enforce_expected_attach_type:1; /* Enforce expected_attach_type checking at attach time */
>  	enum bpf_prog_type	type;		/* Type of BPF program */
>  	enum bpf_attach_type	expected_attach_type; /* For some prog types */
>  	u32			len;		/* Number of filter blocks */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index bcdd2474eee7..95b5058fa945 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2359,6 +2359,12 @@ union bpf_attr {
>   *	Return
>   *		A **struct bpf_tcp_sock** pointer on success, or NULL in
>   *		case of failure.
> + *
> + * int bpf_tcp_enter_cwr(struct bpf_tcp_sock *tp)
> + *	Description
> + *		Make a tcp_sock enter CWR state.
> + *	Return
> + *		0 on success, or a negative error in case of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -2457,7 +2463,8 @@ union bpf_attr {
>  	FN(spin_lock),			\
>  	FN(spin_unlock),		\
>  	FN(sk_fullsock),		\
> -	FN(tcp_sock),
> +	FN(tcp_sock),			\
> +	FN(tcp_enter_cwr),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index ec7c552af76b..9a478f2875cd 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1482,6 +1482,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
>  		default:
>  			return -EINVAL;
>  		}
> +	case BPF_PROG_TYPE_CGROUP_SKB:
> +		switch (expected_attach_type) {
> +		case BPF_CGROUP_INET_INGRESS:
> +		case BPF_CGROUP_INET_EGRESS:
> +			return 0;
> +		default:
> +			return -EINVAL;
> +		}
>  	default:
>  		return 0;
>  	}
> @@ -1725,6 +1733,10 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
>  	case BPF_PROG_TYPE_CGROUP_SOCK:
>  	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
>  		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
> +	case BPF_PROG_TYPE_CGROUP_SKB:
> +		return prog->enforce_expected_attach_type &&
> +			prog->expected_attach_type != attach_type ?
> +			-EINVAL : 0;
>  	default:
>  		return 0;
>  	}
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 1b9496c41383..95fb385c6f3c 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2424,6 +2424,10 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>  			return -EFAULT;
>  		}
>  		meta->ptr_id = reg->id;
> +	} else if (arg_type == ARG_PTR_TO_TCP_SOCK) {
> +		expected_type = PTR_TO_TCP_SOCK;
> +		if (type != expected_type)
> +			goto err_type;
>  	} else if (arg_type == ARG_PTR_TO_SPIN_LOCK) {
>  		if (meta->func_id == BPF_FUNC_spin_lock) {
>  			if (process_spin_lock(env, regno, true))
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 97916eedfe69..ca57ef25279c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5426,6 +5426,24 @@ static const struct bpf_func_proto bpf_tcp_sock_proto = {
>  	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
>  };
>  
> +BPF_CALL_1(bpf_tcp_enter_cwr, struct tcp_sock *, tp)
> +{
> +	struct sock *sk = (struct sock *)tp;
> +
> +	if (sk->sk_state == TCP_ESTABLISHED) {
> +		tcp_enter_cwr(sk);
> +		return 0;
> +	}
> +
> +	return -EINVAL;
> +}
> +
> +static const struct bpf_func_proto bpf_tcp_enter_cwr_proto = {
> +	.func        = bpf_tcp_enter_cwr,
> +	.gpl_only    = false,
> +	.ret_type    = RET_INTEGER,
> +	.arg1_type    = ARG_PTR_TO_TCP_SOCK,
> +};
>  #endif /* CONFIG_INET */
>  
>  bool bpf_helper_changes_pkt_data(void *func)
> @@ -5585,6 +5603,13 @@ cg_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
>  #ifdef CONFIG_INET
>  	case BPF_FUNC_tcp_sock:
>  		return &bpf_tcp_sock_proto;

[...]
> +	case BPF_FUNC_tcp_enter_cwr:
> +		if (prog->expected_attach_type == BPF_CGROUP_INET_EGRESS) {
> +			prog->enforce_expected_attach_type = 1;
> +			return &bpf_tcp_enter_cwr_proto;
Instead of this back and forth with enforce_expected_attach_type, can we
just do here:

if (prog->expected_attach_type == BPF_CGROUP_INET_EGRESS)
	return &bpf_tcp_enter_cwr_proto;
else
	return null;

Wouldn't it have the same effect?

> +		} else {
> +			return NULL;
> +		}
>  #endif
>  	default:
>  		return sk_filter_func_proto(func_id, prog);
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr
  2019-02-25 23:14   ` Stanislav Fomichev
@ 2019-02-26  1:30     ` Martin Lau
  2019-02-26  3:32       ` Stanislav Fomichev
  0 siblings, 1 reply; 29+ messages in thread
From: Martin Lau @ 2019-02-26  1:30 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Lawrence Brakmo, netdev, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Kernel Team

On Mon, Feb 25, 2019 at 03:14:38PM -0800, Stanislav Fomichev wrote:
[ ... ]

> > 
> > To ensure it is only called from BPF_CGROUP_INET_EGRESS, the
> > attr->expected_attach_type must be specified as BPF_CGROUP_INET_EGRESS
> > during load time if the prog uses this new helper.
> > The newly added prog->enforce_expected_attach_type bit will also be set
> > if this new helper is used.  This bit is for backward compatibility reason
> > because currently prog->expected_attach_type has been ignored in
> > BPF_PROG_TYPE_CGROUP_SKB.  During attach time,
> > prog->expected_attach_type is only enforced if the
> > prog->enforce_expected_attach_type bit is set.
> > i.e. prog->expected_attach_type is only enforced if this new helper
> > is used by the prog.
> > 
[ ... ]

> > @@ -1725,6 +1733,10 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
> >  	case BPF_PROG_TYPE_CGROUP_SOCK:
> >  	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> >  		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
> > +	case BPF_PROG_TYPE_CGROUP_SKB:
> > +		return prog->enforce_expected_attach_type &&
> > +			prog->expected_attach_type != attach_type ?
> > +			-EINVAL : 0;
> >  	default:
> >  		return 0;
> >  	}
[ ... ]

> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 97916eedfe69..ca57ef25279c 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -5426,6 +5426,24 @@ static const struct bpf_func_proto bpf_tcp_sock_proto = {
> >  	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
> >  };
> >  
> > +BPF_CALL_1(bpf_tcp_enter_cwr, struct tcp_sock *, tp)
> > +{
> > +	struct sock *sk = (struct sock *)tp;
> > +
> > +	if (sk->sk_state == TCP_ESTABLISHED) {
> > +		tcp_enter_cwr(sk);
> > +		return 0;
> > +	}
> > +
> > +	return -EINVAL;
> > +}
> > +
> > +static const struct bpf_func_proto bpf_tcp_enter_cwr_proto = {
> > +	.func        = bpf_tcp_enter_cwr,
> > +	.gpl_only    = false,
> > +	.ret_type    = RET_INTEGER,
> > +	.arg1_type    = ARG_PTR_TO_TCP_SOCK,
> > +};
> >  #endif /* CONFIG_INET */
> >  
> >  bool bpf_helper_changes_pkt_data(void *func)
> > @@ -5585,6 +5603,13 @@ cg_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
> >  #ifdef CONFIG_INET
> >  	case BPF_FUNC_tcp_sock:
> >  		return &bpf_tcp_sock_proto;
> 
> [...]
> > +	case BPF_FUNC_tcp_enter_cwr:
> > +		if (prog->expected_attach_type == BPF_CGROUP_INET_EGRESS) {
> > +			prog->enforce_expected_attach_type = 1;
> > +			return &bpf_tcp_enter_cwr_proto;
> Instead of this back and forth with enforce_expected_attach_type, can we
> just do here:
> 
> if (prog->expected_attach_type == BPF_CGROUP_INET_EGRESS)
> 	return &bpf_tcp_enter_cwr_proto;
> else
> 	return null;
> 
> Wouldn't it have the same effect?
The attr->expected_attach_type is currently ignored (i.e. not checked)
during the bpf load time.

How to avoid breaking backward compatibility without selectively
enforcing prog->expected_attach_type during attach time?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr
  2019-02-26  1:30     ` Martin Lau
@ 2019-02-26  3:32       ` Stanislav Fomichev
  0 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2019-02-26  3:32 UTC (permalink / raw)
  To: Martin Lau
  Cc: Lawrence Brakmo, netdev, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Kernel Team

On 02/26, Martin Lau wrote:
> On Mon, Feb 25, 2019 at 03:14:38PM -0800, Stanislav Fomichev wrote:
> [ ... ]
> 
> > > 
> > > To ensure it is only called from BPF_CGROUP_INET_EGRESS, the
> > > attr->expected_attach_type must be specified as BPF_CGROUP_INET_EGRESS
> > > during load time if the prog uses this new helper.
> > > The newly added prog->enforce_expected_attach_type bit will also be set
> > > if this new helper is used.  This bit is for backward compatibility reason
> > > because currently prog->expected_attach_type has been ignored in
> > > BPF_PROG_TYPE_CGROUP_SKB.  During attach time,
> > > prog->expected_attach_type is only enforced if the
> > > prog->enforce_expected_attach_type bit is set.
> > > i.e. prog->expected_attach_type is only enforced if this new helper
> > > is used by the prog.
> > > 
> [ ... ]
> 
> > > @@ -1725,6 +1733,10 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
> > >  	case BPF_PROG_TYPE_CGROUP_SOCK:
> > >  	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
> > >  		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
> > > +	case BPF_PROG_TYPE_CGROUP_SKB:
> > > +		return prog->enforce_expected_attach_type &&
> > > +			prog->expected_attach_type != attach_type ?
> > > +			-EINVAL : 0;
> > >  	default:
> > >  		return 0;
> > >  	}
> [ ... ]
> 
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 97916eedfe69..ca57ef25279c 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -5426,6 +5426,24 @@ static const struct bpf_func_proto bpf_tcp_sock_proto = {
> > >  	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
> > >  };
> > >  
> > > +BPF_CALL_1(bpf_tcp_enter_cwr, struct tcp_sock *, tp)
> > > +{
> > > +	struct sock *sk = (struct sock *)tp;
> > > +
> > > +	if (sk->sk_state == TCP_ESTABLISHED) {
> > > +		tcp_enter_cwr(sk);
> > > +		return 0;
> > > +	}
> > > +
> > > +	return -EINVAL;
> > > +}
> > > +
> > > +static const struct bpf_func_proto bpf_tcp_enter_cwr_proto = {
> > > +	.func        = bpf_tcp_enter_cwr,
> > > +	.gpl_only    = false,
> > > +	.ret_type    = RET_INTEGER,
> > > +	.arg1_type    = ARG_PTR_TO_TCP_SOCK,
> > > +};
> > >  #endif /* CONFIG_INET */
> > >  
> > >  bool bpf_helper_changes_pkt_data(void *func)
> > > @@ -5585,6 +5603,13 @@ cg_skb_func_proto(enum bpf_func_id func_id, struct bpf_prog *prog)
> > >  #ifdef CONFIG_INET
> > >  	case BPF_FUNC_tcp_sock:
> > >  		return &bpf_tcp_sock_proto;
> > 
> > [...]
> > > +	case BPF_FUNC_tcp_enter_cwr:
> > > +		if (prog->expected_attach_type == BPF_CGROUP_INET_EGRESS) {
> > > +			prog->enforce_expected_attach_type = 1;
> > > +			return &bpf_tcp_enter_cwr_proto;
> > Instead of this back and forth with enforce_expected_attach_type, can we
> > just do here:
> > 
> > if (prog->expected_attach_type == BPF_CGROUP_INET_EGRESS)
> > 	return &bpf_tcp_enter_cwr_proto;
> > else
> > 	return null;
> > 
> > Wouldn't it have the same effect?
> The attr->expected_attach_type is currently ignored (i.e. not checked)
> during the bpf load time.
But nothing stops you form checking prog->expected_attach_type in
the cg_skb_func_proto, right? That is done at the time of loading.
So depending on prog->expected_attach_type just return null or non-null
and the verifier will take care of the rest. Then, at attach time just
make sure we are attaching it to the expected_attach_type.

We also should not have any existing use cases for
BPF_FUNC_tcp_enter_cwr I suppose.

In other words, why something like below won't work? Am I missing something?

---

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b155cd17c1bd..86dc7cd00f34 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1678,6 +1678,10 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
+	case BPF_PROG_TYPE_CGROUP_SKB:
+		if (prog->expected_attach_type)
+			return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
+		return 0;
 	default:
 		return 0;
 	}
diff --git a/net/core/filter.c b/net/core/filter.c
index 7559d6835ecb..56f70468fc7a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5396,6 +5396,11 @@ cg_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	switch (func_id) {
 	case BPF_FUNC_get_local_storage:
 		return &bpf_get_local_storage_proto;
+	case BPF_FUNC_tcp_enter_cwr:
+		if (prog->expected_attach_type == BPF_CGROUP_INET_EGRESS)
+			return &bpf_tcp_enter_cwr_proto;
+		else
+			return NULL;
 	default:
 		return sk_filter_func_proto(func_id, prog);
 	}

> 
> How to avoid breaking backward compatibility without selectively
> enforcing prog->expected_attach_type during attach time?

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2019-02-26  3:32 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-23  1:06 [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) brakmo
2019-02-23  1:06 ` [PATCH v2 bpf-next 1/9] bpf: Remove const from get_func_proto brakmo
2019-02-23  1:06 ` [PATCH v2 bpf-next 2/9] bpf: Add bpf helper bpf_tcp_enter_cwr brakmo
2019-02-24  1:32   ` Eric Dumazet
2019-02-24  3:08     ` Martin Lau
2019-02-24  4:44       ` Alexei Starovoitov
2019-02-24 18:00       ` Eric Dumazet
2019-02-25 23:14   ` Stanislav Fomichev
2019-02-26  1:30     ` Martin Lau
2019-02-26  3:32       ` Stanislav Fomichev
2019-02-23  1:06 ` [PATCH v2 bpf-next 3/9] bpf: Test bpf_tcp_enter_cwr in test_verifier brakmo
2019-02-23  1:06 ` [PATCH v2 bpf-next 4/9] bpf: add bpf helper bpf_skb_ecn_set_ce brakmo
2019-02-23  1:14   ` Daniel Borkmann
2019-02-23  7:30     ` Martin Lau
2019-02-25 10:10       ` Daniel Borkmann
2019-02-25 16:52         ` Eric Dumazet
2019-02-23  1:06 ` [PATCH v2 bpf-next 5/9] bpf: Add bpf helper bpf_tcp_check_probe_timer brakmo
2019-02-23  1:07 ` [PATCH v2 bpf-next 6/9] bpf: sync bpf.h to tools and update bpf_helpers.h brakmo
2019-02-23  1:07 ` [PATCH v2 bpf-next 7/9] bpf: Sample NRM BPF program to limit egress bw brakmo
2019-02-23  1:07 ` [PATCH v2 bpf-next 8/9] bpf: User program for testing NRM brakmo
2019-02-23  1:07 ` [PATCH v2 bpf-next 9/9] bpf: NRM test script brakmo
2019-02-23  3:03 ` [PATCH v2 bpf-next 0/9] bpf: Network Resource Manager (NRM) David Ahern
2019-02-23 18:39   ` Eric Dumazet
2019-02-23 20:40     ` Alexei Starovoitov
2019-02-23 20:43       ` Eric Dumazet
2019-02-23 23:25         ` Alexei Starovoitov
2019-02-24  2:58           ` David Ahern
2019-02-24  4:48             ` Alexei Starovoitov
2019-02-25  1:38               ` David Ahern

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.