All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
@ 2019-03-23  8:05 brakmo
  2019-03-23  8:05 ` [PATCH bpf-next 1/7] bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY brakmo
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: brakmo @ 2019-03-23  8:05 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

This patchset adds support for propagating congestion notifications (cn)
to TCP from cgroup inet skb egress BPF programs.

Current cgroup skb BPF programs cannot trigger TCP congestion window
reductions, even when they drop a packet. This patch-set adds support
for cgroup skb BPF programs to send congestion notifications in the
return value when the packets are TCP packets. Rather than the
current 1 for keeping the packet and 0 for dropping it, they can
now return:
    NET_XMIT_SUCCESS    (0)    - continue with packet output
    NET_XMIT_DROP       (1)    - drop packet and do cn
    NET_XMIT_CN         (2)    - continue with packet output and do cn
    -EPERM                     - drop packet

There is also support for setting the probe timer to a small value,
specified by a sysctl, when a packet is dropped when calling
queue_xmit in __tcp_transmit_skb and there are no other packets in
transit.

In addition, HBM programs are modified to collect and return more
statistics.

The use of congestion notifications improves the performance of HBM when
using Cubic. Without congestion notifications, Cubic will not decrease its
cwnd and HBM will need to drop a large percentage of the packets.
Smaller probe timers improve the performance of Cubic and DCTCP when the
rates are small enough that there are times when HBM cannot send a packet
per RTT in order to mainting the bandwidth limit.

The following results are obtained for rate limits of 1Gbps and 200Mbps,
between two servers using netperf, and only one flow. We also show how
reducing the max delayed ACK timer can improve the performance when
using Cubic. 

A following patch will add support for fq's Earliest Departure Time (EDT).

The command used was:
  ./do_hbm_test.sh -l -D --stats -N -r=<rate> [--no_cn] [dctcp] \
                   -s=<server running netserver>
  where:
     <rate>   is 1000 or 200
     --no_cn  specifies no cwr notifications
     dctcp    use of dctcp

                       Cubic                    DCTCP
Lim,Prob,DA    Mbps cwnd cred drops  Mbps cwnd cred drops
------------   ---- ---- ---- -----  ---- ---- ---- -----
  1G, 0,40       35  462 -320 67%     995    1 -212  0.05%
  1G, 0,40,cn   349    3 -229  0.15   995    1 -212  0.05
  1G, 0, 5,cn   941    2 -189  0.13   995    1 -212  0.05

200M, 0,40,cn    50    3 -152  0.34    31    3 -203  0.50
200M, 0, 5,cn    43    2 -202  0.48    33    3 -199  0.50
200M,20, 5,cn   199    2 -209  0.38   199    1 -214  0.30

Notes:
  --no_cn has no effect with DCTCP
  Lim = rate limit
  Prob = Probe timer
  DA = maximum delay ack timer
  cred = credit in packets
  drops = % packets dropped

brakmo (7):
  bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY
  bpf: cgroup inet skb programs can return 0 to 3
  bpf: Update __cgroup_bpf_run_filter_skb with cn
  bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls
  bpf: sysctl for probe_on_drop
  bpf: Add cn support to hbm_out_kern.c
  bpf: Add more stats to HBM

 include/linux/bpf.h        | 50 +++++++++++++++++++++++++++++
 include/linux/filter.h     |  3 +-
 include/net/netns/ipv4.h   |  1 +
 kernel/bpf/cgroup.c        | 25 ++++++++++++---
 kernel/bpf/syscall.c       | 12 +++++++
 kernel/bpf/verifier.c      | 16 +++++++--
 net/ipv4/ip_output.c       | 39 ++++++++++++----------
 net/ipv4/sysctl_net_ipv4.c | 10 ++++++
 net/ipv4/tcp_ipv4.c        |  1 +
 net/ipv4/tcp_output.c      | 18 +++++++++--
 net/ipv6/ip6_output.c      | 22 +++++++------
 samples/bpf/do_hbm_test.sh | 10 ++++--
 samples/bpf/hbm.c          | 51 +++++++++++++++++++++++++++--
 samples/bpf/hbm.h          |  9 +++++-
 samples/bpf/hbm_kern.h     | 66 ++++++++++++++++++++++++++++++++++++--
 samples/bpf/hbm_out_kern.c | 48 +++++++++++++++++++--------
 16 files changed, 321 insertions(+), 60 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 1/7] bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY
  2019-03-23  8:05 [PATCH bpf-next 0/7] bpf: Propagate cn to TCP brakmo
@ 2019-03-23  8:05 ` brakmo
  2019-03-23  8:05 ` [PATCH bpf-next 2/7] bpf: cgroup inet skb programs can return 0 to 3 brakmo
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: brakmo @ 2019-03-23  8:05 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

Create new macro BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() to be used by
__cgroup_bpf_run_filter_skb for EGRESS BPF progs so BPF programs can
request cwr for TCP packets.

Current cgroup skb programs can only return 0 or 1 (0 to drop the
packet. This macro changes the behavior so the low order bit
indicates whether the packet should be dropped (0) or not (1)
and the next bit is used for congestion notification (cn).

Hence, new allowed return values of CGROUP EGRESS BPF programs are:
  0: drop packet
  1: keep packet
  2: drop packet and call cwr
  3: keep packet and call cwr

This macro then converts it to one of NET_XMIT values or -EPERM
that has the effect of dropping the packet with no cn.
  0: NET_XMIT_SUCCESS  skb should be transmitted (no cn)
  1: NET_XMIT_DROP     skb should be dropped and cwr called
  2: NET_XMIT_CN       skb should be transmitted and cwr called
  3: -EPERM            skb should be dropped (no cn)

Note that when more than one BPF program is called, the packet is
dropped if at least one of programs requests it be dropped, and
there is cn if at least one program returns cn.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/linux/bpf.h | 50 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f62897198844..4037cf952d86 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -515,6 +515,56 @@ _out:							\
 		_ret;					\
 	 })
 
+/* To be used by __cgroup_bpf_run_filter_skb for EGRESS BPF progs
+ * so BPF programs can request cwr for TCP packets.
+ *
+ * Current cgroup skb programs can only return 0 or 1 (0 to drop the
+ * packet. This macro changes the behavior so the low order bit
+ * indicates whether the packet should be dropped (0) or not (1)
+ * and the next bit is a congestion notification bit. This could be
+ * used by TCP to call tcp_enter_cwr()
+ *
+ * Hence, new allowed return values of CGROUP EGRESS BPF programs are:
+ *   0: drop packet
+ *   1: keep packet
+ *   2: drop packet and cn
+ *   3: keep packet and cn
+ *
+ * This macro then converts it to one of the NET_XMIT or an error
+ * code that is then interpreted as drop packet (and no cn):
+ *   0: NET_XMIT_SUCCESS  skb should be transmitted
+ *   1: NET_XMIT_DROP     skb should be dropped and cn
+ *   2: NET_XMIT_CN       skb should be transmitted and cn
+ *   3: -EPERM            skb should be dropped
+ */
+#define BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY(array, ctx, func)		\
+	({						\
+		struct bpf_prog_array_item *_item;	\
+		struct bpf_prog *_prog;			\
+		struct bpf_prog_array *_array;		\
+		u32 ret;				\
+		u32 _ret = 1;				\
+		u32 _cn = 0;				\
+		preempt_disable();			\
+		rcu_read_lock();			\
+		_array = rcu_dereference(array);	\
+		_item = &_array->items[0];		\
+		while ((_prog = READ_ONCE(_item->prog))) {		\
+			bpf_cgroup_storage_set(_item->cgroup_storage);	\
+			ret = func(_prog, ctx);		\
+			_ret &= (ret & 1);		\
+			_cn |= (ret & 2);		\
+			_item++;			\
+		}					\
+		rcu_read_unlock();			\
+		preempt_enable_no_resched();		\
+		if (_ret)				\
+			_ret = (_cn ? NET_XMIT_CN : NET_XMIT_SUCCESS);	\
+		else					\
+			_ret = (_cn ? NET_XMIT_DROP : -EPERM);		\
+		_ret;					\
+	})
+
 #define BPF_PROG_RUN_ARRAY(array, ctx, func)		\
 	__BPF_PROG_RUN_ARRAY(array, ctx, func, false)
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 2/7] bpf: cgroup inet skb programs can return 0 to 3
  2019-03-23  8:05 [PATCH bpf-next 0/7] bpf: Propagate cn to TCP brakmo
  2019-03-23  8:05 ` [PATCH bpf-next 1/7] bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY brakmo
@ 2019-03-23  8:05 ` brakmo
  2019-03-23  8:05 ` [PATCH bpf-next 3/7] bpf: Update __cgroup_bpf_run_filter_skb with cn brakmo
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: brakmo @ 2019-03-23  8:05 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

Allows cgroup inet skb programs to return values in the range [0, 3].
The second bit is used to deterine if congestion occurred and higher
level protocol should decrease rate. E.g. TCP would call tcp_enter_cwr()

The bpf_prog must set expected_attach_type to BPF_CGROUP_INET_EGRESS
at load time if it uses the new return values (i.e. 2 or 3).

The expected_attach_type is currently not enforced for
BPF_PROG_TYPE_CGROUP_SKB.  e.g Meaning the current bpf_prog with
expected_attach_type setting to BPF_CGROUP_INET_EGRESS can attach to
BPF_CGROUP_INET_INGRESS.  Blindly enforcing expected_attach_type will
break backward compatibility.

This patch adds a enforce_expected_attach_type bit to only
enforce the expected_attach_type when it uses the new
return value.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/filter.h |  3 ++-
 kernel/bpf/syscall.c   | 12 ++++++++++++
 kernel/bpf/verifier.c  | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6074aa064b54..67755a837c48 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -510,7 +510,8 @@ struct bpf_prog {
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
 				kprobe_override:1, /* Do we override a kprobe? */
-				has_callchain_buf:1; /* callchain buffer allocated? */
+				has_callchain_buf:1, /* callchain buffer allocated? */
+				enforce_expected_attach_type:1; /* Enforce expected_attach_type checking at attach time */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 62f6bced3a3c..b32facec19f3 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1513,6 +1513,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
 		default:
 			return -EINVAL;
 		}
+	case BPF_PROG_TYPE_CGROUP_SKB:
+		switch (expected_attach_type) {
+		case BPF_CGROUP_INET_INGRESS:
+		case BPF_CGROUP_INET_EGRESS:
+			return 0;
+		default:
+			return -EINVAL;
+		}
 	default:
 		return 0;
 	}
@@ -1757,6 +1765,10 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
+	case BPF_PROG_TYPE_CGROUP_SKB:
+		return prog->enforce_expected_attach_type &&
+			prog->expected_attach_type != attach_type ?
+			-EINVAL : 0;
 	default:
 		return 0;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index dffeec3706ce..f9dfb6e756d3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5118,11 +5118,16 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 
 static int check_return_code(struct bpf_verifier_env *env)
 {
+	struct tnum enforce_attach_type_range = tnum_unknown;
 	struct bpf_reg_state *reg;
 	struct tnum range = tnum_range(0, 1);
 
 	switch (env->prog->type) {
 	case BPF_PROG_TYPE_CGROUP_SKB:
+		if (env->prog->expected_attach_type == BPF_CGROUP_INET_EGRESS) {
+			range = tnum_range(0, 3);
+			enforce_attach_type_range = tnum_range(2, 3);
+		}
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_SOCK_OPS:
@@ -5140,18 +5145,23 @@ static int check_return_code(struct bpf_verifier_env *env)
 	}
 
 	if (!tnum_in(range, reg->var_off)) {
+		char tn_buf[48];
+
 		verbose(env, "At program exit the register R0 ");
 		if (!tnum_is_unknown(reg->var_off)) {
-			char tn_buf[48];
-
 			tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
 			verbose(env, "has value %s", tn_buf);
 		} else {
 			verbose(env, "has unknown scalar value");
 		}
-		verbose(env, " should have been 0 or 1\n");
+		tnum_strn(tn_buf, sizeof(tn_buf), range);
+		verbose(env, " should have been %s\n", tn_buf);
 		return -EINVAL;
 	}
+
+	if (!tnum_is_unknown(enforce_attach_type_range) &&
+	    tnum_in(enforce_attach_type_range, reg->var_off))
+		env->prog->enforce_expected_attach_type = 1;
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 3/7] bpf: Update __cgroup_bpf_run_filter_skb with cn
  2019-03-23  8:05 [PATCH bpf-next 0/7] bpf: Propagate cn to TCP brakmo
  2019-03-23  8:05 ` [PATCH bpf-next 1/7] bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY brakmo
  2019-03-23  8:05 ` [PATCH bpf-next 2/7] bpf: cgroup inet skb programs can return 0 to 3 brakmo
@ 2019-03-23  8:05 ` brakmo
  2019-03-23  8:05 ` [PATCH bpf-next 4/7] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls brakmo
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: brakmo @ 2019-03-23  8:05 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

For egress packets, __cgroup_bpf_fun_filter_skb() will now call
BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() instead of PROG_CGROUP_RUN_ARRAY()
in order to propagate congestion notifications (cn) requests to TCP
callers.

For egress packets, this function can return:
   NET_XMIT_SUCCESS    (0)    - continue with packet output
   NET_XMIT_DROP       (1)    - drop packet and notify TCP to call cwr
   NET_XMIT_CN         (2)    - continue with packet output and notify TCP
                                to call cwr
   -EPERM                     - drop packet

For ingress packets, this function will return -EPERM if any attached
program was found and if it returned != 1 during execution. Otherwise 0
is returned.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 kernel/bpf/cgroup.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 4e807973aa80..ad8108a02153 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -545,8 +545,16 @@ int cgroup_bpf_prog_query(const union bpf_attr *attr,
  * The program type passed in via @type must be suitable for network
  * filtering. No further check is performed to assert that.
  *
- * This function will return %-EPERM if any if an attached program was found
- * and if it returned != 1 during execution. In all other cases, 0 is returned.
+ * For egress packets, this function can return:
+ *   NET_XMIT_SUCCESS    (0)	- continue with packet output
+ *   NET_XMIT_DROP       (1)	- drop packet and notify TCP to call cwr
+ *   NET_XMIT_CN         (2)	- continue with packet output and notify TCP
+ *				  to call cwr
+ *   -EPERM			- drop packet
+ *
+ * For ingress packets, this function will return -EPERM if any
+ * attached program was found and if it returned != 1 during execution.
+ * Otherwise 0 is returned.
  */
 int __cgroup_bpf_run_filter_skb(struct sock *sk,
 				struct sk_buff *skb,
@@ -572,12 +580,19 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 	/* compute pointers for the bpf prog */
 	bpf_compute_and_save_data_end(skb, &saved_data_end);
 
-	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
-				 __bpf_prog_run_save_cb);
+	if (type == BPF_CGROUP_INET_EGRESS) {
+		ret = BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY(
+			cgrp->bpf.effective[type], skb, __bpf_prog_run_save_cb);
+	} else {
+		ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
+					  __bpf_prog_run_save_cb);
+		ret = (ret == 1 ? 0 : -EPERM);
+	}
 	bpf_restore_data_end(skb, saved_data_end);
 	__skb_pull(skb, offset);
 	skb->sk = save_sk;
-	return ret == 1 ? 0 : -EPERM;
+
+	return ret;
 }
 EXPORT_SYMBOL(__cgroup_bpf_run_filter_skb);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 4/7] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls
  2019-03-23  8:05 [PATCH bpf-next 0/7] bpf: Propagate cn to TCP brakmo
                   ` (2 preceding siblings ...)
  2019-03-23  8:05 ` [PATCH bpf-next 3/7] bpf: Update __cgroup_bpf_run_filter_skb with cn brakmo
@ 2019-03-23  8:05 ` brakmo
  2019-03-23  8:05 ` [PATCH bpf-next 5/7] bpf: sysctl for probe_on_drop brakmo
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: brakmo @ 2019-03-23  8:05 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning
congestion notifications.

If BPF_CGROUP_RUN_PROG_INET_EGRESS() returns a value other than
NET_XMIT_SUCCESS or NET_XMIT_CN, the skb is dropped and the value
is returned to the caller.

Else, if the return of the output function is not NET_XMIT_SUCCESS,
return it, otherwise return the return value of the call to
BPF_CGROUP_RUN_PROG_INET_EGRESS().

Otherwise, return the return value of the output function.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 net/ipv4/ip_output.c  | 39 ++++++++++++++++++++++-----------------
 net/ipv6/ip6_output.c | 22 +++++++++++++---------
 2 files changed, 35 insertions(+), 26 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c80188875f39..efa0b9a195b4 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -292,43 +292,48 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk,
 static int ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	unsigned int mtu;
+	int ret_bpf;
 	int ret;
 
-	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
-	if (ret) {
+	ret_bpf = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+	if (ret_bpf != NET_XMIT_SUCCESS && ret_bpf != NET_XMIT_CN) {
 		kfree_skb(skb);
-		return ret;
+		return ret_bpf;
 	}
 
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
 	/* Policy lookup after SNAT yielded a new policy */
 	if (skb_dst(skb)->xfrm) {
 		IPCB(skb)->flags |= IPSKB_REROUTED;
-		return dst_output(net, sk, skb);
-	}
+		ret = dst_output(net, sk, skb);
+	} else
 #endif
-	mtu = ip_skb_dst_mtu(sk, skb);
-	if (skb_is_gso(skb))
-		return ip_finish_output_gso(net, sk, skb, mtu);
-
-	if (skb->len > mtu || (IPCB(skb)->flags & IPSKB_FRAG_PMTU))
-		return ip_fragment(net, sk, skb, mtu, ip_finish_output2);
-
-	return ip_finish_output2(net, sk, skb);
+	{
+		mtu = ip_skb_dst_mtu(sk, skb);
+		if (skb_is_gso(skb))
+			ret = ip_finish_output_gso(net, sk, skb, mtu);
+		else if (skb->len > mtu || (IPCB(skb)->flags & IPSKB_FRAG_PMTU))
+			ret = ip_fragment(net, sk, skb, mtu, ip_finish_output2);
+		else
+			ret = ip_finish_output2(net, sk, skb);
+	}
+	return ret ? : ret_bpf;
 }
 
 static int ip_mc_finish_output(struct net *net, struct sock *sk,
 			       struct sk_buff *skb)
 {
+	int ret_bpf;
 	int ret;
 
-	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
-	if (ret) {
+	ret_bpf = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+	if (ret_bpf != NET_XMIT_SUCCESS && ret_bpf != NET_XMIT_CN) {
 		kfree_skb(skb);
-		return ret;
+		return ret_bpf;
 	}
 
-	return dev_loopback_xmit(net, sk, skb);
+	ret = dev_loopback_xmit(net, sk, skb);
+	return ret ? : ret_bpf;
 }
 
 int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index edbd12067170..53a838d82a21 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -130,28 +130,32 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 
 static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
+	int ret_bpf;
 	int ret;
 
-	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
-	if (ret) {
+	ret_bpf = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+	if (ret_bpf != NET_XMIT_SUCCESS && ret_bpf != NET_XMIT_CN) {
 		kfree_skb(skb);
-		return ret;
+		return ret_bpf;
 	}
 
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
 	/* Policy lookup after SNAT yielded a new policy */
 	if (skb_dst(skb)->xfrm) {
 		IPCB(skb)->flags |= IPSKB_REROUTED;
-		return dst_output(net, sk, skb);
-	}
+		ret = dst_output(net, sk, skb);
+	} else
 #endif
 
 	if ((skb->len > ip6_skb_dst_mtu(skb) && !skb_is_gso(skb)) ||
 	    dst_allfrag(skb_dst(skb)) ||
-	    (IP6CB(skb)->frag_max_size && skb->len > IP6CB(skb)->frag_max_size))
-		return ip6_fragment(net, sk, skb, ip6_finish_output2);
-	else
-		return ip6_finish_output2(net, sk, skb);
+	    (IP6CB(skb)->frag_max_size && skb->len >
+	     IP6CB(skb)->frag_max_size)) {
+		ret = ip6_fragment(net, sk, skb, ip6_finish_output2);
+	} else {
+		ret = ip6_finish_output2(net, sk, skb);
+	}
+	return ret ? : ret_bpf;
 }
 
 int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 5/7] bpf: sysctl for probe_on_drop
  2019-03-23  8:05 [PATCH bpf-next 0/7] bpf: Propagate cn to TCP brakmo
                   ` (3 preceding siblings ...)
  2019-03-23  8:05 ` [PATCH bpf-next 4/7] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls brakmo
@ 2019-03-23  8:05 ` brakmo
  2019-03-23  8:05 ` [PATCH bpf-next 6/7] bpf: Add cn support to hbm_out_kern.c brakmo
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: brakmo @ 2019-03-23  8:05 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
and packets_out is 0, it is beneficial to set a small probe timer.
Otherwise, the throughput for the flow can suffer because it may need to
depend on the probe timer to start sending again. The default value for
the probe timer is at least 200ms, this patch sets it to 20ms when a
packet is dropped and there are no other packets in flight.

This patch introduces a new sysctl, sysctl_tcp_probe_on_drop_ms, that is
used to specify the duration of the probe timer for the case described
earlier. The allowed values are between 0 and TCP_RTO_MIN. A value of 0
disables setting the probe timer with a small value.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/net/netns/ipv4.h   |  1 +
 net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++
 net/ipv4/tcp_ipv4.c        |  1 +
 net/ipv4/tcp_output.c      | 18 +++++++++++++++---
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 104a6669e344..d5716a193883 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -165,6 +165,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_wmem[3];
 	int sysctl_tcp_rmem[3];
 	int sysctl_tcp_comp_sack_nr;
+	int sysctl_tcp_probe_on_drop_ms;
 	unsigned long sysctl_tcp_comp_sack_delay_ns;
 	struct inet_timewait_death_row tcp_death_row;
 	int sysctl_max_syn_backlog;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index ba0fc4b18465..50837e66313f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -49,6 +49,7 @@ static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
 static int comp_sack_nr_max = 255;
 static u32 u32_max_div_HZ = UINT_MAX / HZ;
+static int probe_on_drop_max = TCP_RTO_MIN;
 
 /* obsolete */
 static int sysctl_tcp_low_latency __read_mostly;
@@ -1219,6 +1220,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra1		= &zero,
 		.extra2		= &comp_sack_nr_max,
 	},
+	{
+		.procname	= "tcp_probe_on_drop_ms",
+		.data		= &init_net.ipv4.sysctl_tcp_probe_on_drop_ms,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &probe_on_drop_max,
+	},
 	{
 		.procname	= "udp_rmem_min",
 		.data		= &init_net.ipv4.sysctl_udp_rmem_min,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 277d71239d75..5aba95850d61 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2679,6 +2679,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock);
 	net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60;
 	atomic_set(&net->ipv4.tfo_active_disable_times, 0);
+	net->ipv4.sysctl_tcp_probe_on_drop_ms = 20;
 
 	/* Reno is always built in */
 	if (!net_eq(net, &init_net) &&
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4522579aaca2..95a0102fde3b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1158,9 +1158,21 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 
 	err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
 
-	if (unlikely(err > 0)) {
-		tcp_enter_cwr(sk);
-		err = net_xmit_eval(err);
+	if (unlikely(err)) {
+		if (unlikely(err > 0)) {
+			tcp_enter_cwr(sk);
+			err = net_xmit_eval(err);
+		}
+		/* Packet was dropped. If there are no packets out,
+		 * we may need to depend on probe timer to start sending
+		 * again. Hence, use a smaller value.
+		 */
+		if (!tp->packets_out && !inet_csk(sk)->icsk_pending &&
+		    sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms > 0) {
+			tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
+					     sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms,
+					     TCP_RTO_MAX, NULL);
+		}
 	}
 	if (!err && oskb) {
 		tcp_update_skb_after_send(sk, oskb, prior_wstamp);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 6/7] bpf: Add cn support to hbm_out_kern.c
  2019-03-23  8:05 [PATCH bpf-next 0/7] bpf: Propagate cn to TCP brakmo
                   ` (4 preceding siblings ...)
  2019-03-23  8:05 ` [PATCH bpf-next 5/7] bpf: sysctl for probe_on_drop brakmo
@ 2019-03-23  8:05 ` brakmo
  2019-03-23  8:05 ` [PATCH bpf-next 7/7] bpf: Add more stats to HBM brakmo
  2019-03-23  9:12 ` [PATCH bpf-next 0/7] bpf: Propagate cn to TCP Eric Dumazet
  7 siblings, 0 replies; 24+ messages in thread
From: brakmo @ 2019-03-23  8:05 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

Update hbm_out_kern.c to support returning cn notifications.
Also updates relevant files to allow disabling cn notifications.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/do_hbm_test.sh | 10 +++++++---
 samples/bpf/hbm.c          | 18 +++++++++++++++---
 samples/bpf/hbm.h          |  3 ++-
 samples/bpf/hbm_out_kern.c | 26 +++++++++++++++++++++-----
 4 files changed, 45 insertions(+), 12 deletions(-)

diff --git a/samples/bpf/do_hbm_test.sh b/samples/bpf/do_hbm_test.sh
index 56c8b4115c95..e48b047d4646 100755
--- a/samples/bpf/do_hbm_test.sh
+++ b/samples/bpf/do_hbm_test.sh
@@ -13,10 +13,10 @@ Usage() {
   echo "egress or ingress bandwidht. It then uses iperf3 or netperf to create"
   echo "loads. The output is the goodput in Mbps (unless -D was used)."
   echo ""
-  echo "USAGE: $name [out] [-b=<prog>|--bpf=<prog>] [-c=<cc>|--cc=<cc>] [-D]"
-  echo "             [-d=<delay>|--delay=<delay>] [--debug] [-E]"
+  echo "USAGE: $name [out] [-b=<prog>|--bpf=<prog>] [-c=<cc>|--cc=<cc>]"
+  echo "             [-D] [-d=<delay>|--delay=<delay>] [--debug] [-E]"
   echo "             [-f=<#flows>|--flows=<#flows>] [-h] [-i=<id>|--id=<id >]"
-  echo "             [-l] [-N] [-p=<port>|--port=<port>] [-P]"
+  echo "             [-l] [-N] [--no_cn] [-p=<port>|--port=<port>] [-P]"
   echo "             [-q=<qdisc>] [-R] [-s=<server>|--server=<server]"
   echo "             [-S|--stats] -t=<time>|--time=<time>] [-w] [cubic|dctcp]"
   echo "  Where:"
@@ -33,6 +33,7 @@ Usage() {
   echo "    -f or --flows     number of concurrent flows (default=1)"
   echo "    -i or --id        cgroup id (an integer, default is 1)"
   echo "    -N                use netperf instead of iperf3"
+  echo "    --no_cn           Do not return CN notifications"
   echo "    -l                do not limit flows using loopback"
   echo "    -h                Help"
   echo "    -p or --port      iperf3 port (default is 5201)"
@@ -115,6 +116,9 @@ processArgs () {
     -c=*|--cc=*)
       cc="${i#*=}"
       ;;
+    --no_cn)
+      flags="$flags --no_cn"
+      ;;
     --debug)
       flags="$flags -d"
       debug_flag=1
diff --git a/samples/bpf/hbm.c b/samples/bpf/hbm.c
index a79828ab273f..c5629bae67ab 100644
--- a/samples/bpf/hbm.c
+++ b/samples/bpf/hbm.c
@@ -16,6 +16,7 @@
  *    -l	Also limit flows doing loopback
  *    -n <#>	To create cgroup \"/hbm#\" and attach prog
  *		Default is /hbm1
+ *    --no_cn   Do not return cn notifications
  *    -r <rate>	Rate limit in Mbps
  *    -s	Get HBM stats (marked, dropped, etc.)
  *    -t <time>	Exit after specified seconds (default is 0)
@@ -42,6 +43,7 @@
 
 #include <linux/bpf.h>
 #include <bpf/bpf.h>
+#include <getopt.h>
 
 #include "bpf_load.h"
 #include "bpf_rlimit.h"
@@ -59,6 +61,7 @@ bool stats_flag;
 bool loopback_flag;
 bool debugFlag;
 bool work_conserving_flag;
+bool no_cn_flag;
 
 static void Usage(void);
 static void read_trace_pipe2(void);
@@ -185,6 +188,7 @@ static int run_bpf_prog(char *prog, int cg_id)
 	qstats.rate = rate;
 	qstats.stats = stats_flag ? 1 : 0;
 	qstats.loopback = loopback_flag ? 1 : 0;
+	qstats.no_cn = no_cn_flag ? 1 : 0;
 	if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY)) {
 		printf("ERROR: Could not update map element\n");
 		goto err;
@@ -366,14 +370,15 @@ static void Usage(void)
 {
 	printf("This program loads a cgroup skb BPF program to enforce\n"
 	       "cgroup output (egress) bandwidth limits.\n\n"
-	       "USAGE: hbm [-o] [-d]  [-l] [-n <id>] [-r <rate>] [-s]\n"
-	       "           [-t <secs>] [-w] [-h] [prog]\n"
+	       "USAGE: hbm [-o] [-d]  [-l] [-n <id>] [--no_cn] [-r <rate>]\n"
+	       "           [-s] [-t <secs>] [-w] [-h] [prog]\n"
 	       "  Where:\n"
 	       "    -o         indicates egress direction (default)\n"
 	       "    -d         print BPF trace debug buffer\n"
 	       "    -l         also limit flows using loopback\n"
 	       "    -n <#>     to create cgroup \"/hbm#\" and attach prog\n"
 	       "               Default is /hbm1\n"
+	       "    --no_cn    disable CN notifcations\n"
 	       "    -r <rate>  Rate in Mbps\n"
 	       "    -s         Update HBM stats\n"
 	       "    -t <time>  Exit after specified seconds (default is 0)\n"
@@ -393,9 +398,16 @@ int main(int argc, char **argv)
 	int  k;
 	int cg_id = 1;
 	char *optstring = "iodln:r:st:wh";
+	struct option loptions[] = {
+		{"no_cn", 0, NULL, 1},
+		{NULL, 0, NULL, 0}
+	};
 
-	while ((k = getopt(argc, argv, optstring)) != -1) {
+	while ((k = getopt_long(argc, argv, optstring, loptions, NULL)) != -1) {
 		switch (k) {
+		case 1:
+			no_cn_flag = true;
+			break;
 		case'o':
 			break;
 		case 'd':
diff --git a/samples/bpf/hbm.h b/samples/bpf/hbm.h
index 518e8147d084..c08247cec2a7 100644
--- a/samples/bpf/hbm.h
+++ b/samples/bpf/hbm.h
@@ -19,7 +19,8 @@ struct hbm_vqueue {
 struct hbm_queue_stats {
 	unsigned long rate;		/* in Mbps*/
 	unsigned long stats:1,		/* get HBM stats (marked, dropped,..) */
-		loopback:1;		/* also limit flows using loopback */
+		loopback:1,		/* also limit flows using loopback */
+		no_cn:1;		/* do not use cn flags */
 	unsigned long long pkts_marked;
 	unsigned long long bytes_marked;
 	unsigned long long pkts_dropped;
diff --git a/samples/bpf/hbm_out_kern.c b/samples/bpf/hbm_out_kern.c
index f806863d0b79..fa3ea92e1564 100644
--- a/samples/bpf/hbm_out_kern.c
+++ b/samples/bpf/hbm_out_kern.c
@@ -119,13 +119,16 @@ int _hbm_out_cg(struct __sk_buff *skb)
 	// Set flags (drop, congestion, cwr)
 	// Dropping => we are congested, so ignore congestion flag
 	if (credit < -DROP_THRESH ||
-	    (len > LARGE_PKT_THRESH &&
-	     credit < -LARGE_PKT_DROP_THRESH)) {
-		// Very congested, set drop flag
+	    (len > LARGE_PKT_THRESH && credit < -LARGE_PKT_DROP_THRESH)) {
+		// Very congested, set drop packet
 		drop_flag = true;
+		if (pkti.ecn)
+			congestion_flag = true;
+		else if (pkti.is_tcp)
+			cwr_flag = true;
 	} else if (credit < 0) {
 		// Congested, set congestion flag
-		if (pkti.ecn) {
+		if (pkti.ecn || pkti.is_tcp) {
 			if (credit < -MARK_THRESH)
 				congestion_flag = true;
 			else
@@ -137,7 +140,15 @@ int _hbm_out_cg(struct __sk_buff *skb)
 
 	if (congestion_flag) {
 		if (!bpf_skb_ecn_set_ce(skb)) {
-			if (len > LARGE_PKT_THRESH) {
+			if (pkti.is_tcp) {
+				unsigned int rand = bpf_get_prandom_u32();
+
+				if (-credit >= MARK_THRESH +
+				    (rand % MARK_REGION_SIZE)) {
+					// Do congestion control
+					cwr_flag = true;
+				}
+			} else if (len > LARGE_PKT_THRESH) {
 				// Problem if too many small packets?
 				drop_flag = true;
 			}
@@ -146,12 +157,17 @@ int _hbm_out_cg(struct __sk_buff *skb)
 
 	if (drop_flag)
 		rv = DROP_PKT;
+	if (qsp != NULL)
+		if (qsp->no_cn)
+			cwr_flag = false;
 
 	hbm_update_stats(qsp, len, curtime, congestion_flag, drop_flag);
 
 	if (rv == DROP_PKT)
 		__sync_add_and_fetch(&(qdp->credit), len);
 
+	if (cwr_flag)
+		rv |= 2;
 	return rv;
 }
 char _license[] SEC("license") = "GPL";
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH bpf-next 7/7] bpf: Add more stats to HBM
  2019-03-23  8:05 [PATCH bpf-next 0/7] bpf: Propagate cn to TCP brakmo
                   ` (5 preceding siblings ...)
  2019-03-23  8:05 ` [PATCH bpf-next 6/7] bpf: Add cn support to hbm_out_kern.c brakmo
@ 2019-03-23  8:05 ` brakmo
  2019-03-23  9:12 ` [PATCH bpf-next 0/7] bpf: Propagate cn to TCP Eric Dumazet
  7 siblings, 0 replies; 24+ messages in thread
From: brakmo @ 2019-03-23  8:05 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

Adds more stats to HBM, including average cwnd and rtt of all TCP
flows, percents of packets that are ecn ce marked and distribution
of return values.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/hbm.c          | 33 +++++++++++++++++++
 samples/bpf/hbm.h          |  6 ++++
 samples/bpf/hbm_kern.h     | 66 ++++++++++++++++++++++++++++++++++++--
 samples/bpf/hbm_out_kern.c | 22 ++++++++-----
 4 files changed, 117 insertions(+), 10 deletions(-)

diff --git a/samples/bpf/hbm.c b/samples/bpf/hbm.c
index c5629bae67ab..480b7ad6a1f2 100644
--- a/samples/bpf/hbm.c
+++ b/samples/bpf/hbm.c
@@ -316,6 +316,14 @@ static int run_bpf_prog(char *prog, int cg_id)
 		double percent_pkts, percent_bytes;
 		char fname[100];
 		FILE *fout;
+		int k;
+		static const char *returnValNames[] = {
+			"DROP_PKT",
+			"ALLOW_PKT",
+			"DROP_PKT_CWR",
+			"ALLOW_PKT_CWR"
+		};
+#define RET_VAL_COUNT 4
 
 // Future support of ingress
 //		if (!outFlag)
@@ -350,6 +358,31 @@ static int run_bpf_prog(char *prog, int cg_id)
 			(qstats.bytes_total + 1);
 		fprintf(fout, "pkts_dropped_percent:%6.2f\n", percent_pkts);
 		fprintf(fout, "bytes_dropped_percent:%6.2f\n", percent_bytes);
+
+		// ECN CE markings
+		percent_pkts = (qstats.pkts_ecn_ce * 100.0) /
+			(qstats.pkts_total + 1);
+		fprintf(fout, "pkts_ecn_ce:%6.2f (%d)\n", percent_pkts,
+			(int)qstats.pkts_ecn_ce);
+
+		// Average cwnd
+		fprintf(fout, "avg cwnd:%d\n",
+			(int)(qstats.sum_cwnd / (qstats.sum_cwnd_cnt + 1)));
+		// Average rtt
+		fprintf(fout, "avg rtt:%d\n",
+			(int)(qstats.sum_rtt / (qstats.pkts_total + 1)));
+		// Average credit
+		fprintf(fout, "avg credit:%d\n",
+			(int)(qstats.sum_credit /
+			      (1500 * ((int)qstats.pkts_total) + 1)));
+
+		// Return values stats
+		for (k = 0; k < RET_VAL_COUNT; k++) {
+			percent_pkts = (qstats.returnValCount[k] * 100.0) /
+				(qstats.pkts_total + 1);
+			fprintf(fout, "%s:%6.2f (%d)\n", returnValNames[k],
+				percent_pkts, (int)qstats.returnValCount[k]);
+		}
 		fclose(fout);
 	}
 
diff --git a/samples/bpf/hbm.h b/samples/bpf/hbm.h
index c08247cec2a7..f0963ed6a562 100644
--- a/samples/bpf/hbm.h
+++ b/samples/bpf/hbm.h
@@ -29,4 +29,10 @@ struct hbm_queue_stats {
 	unsigned long long bytes_total;
 	unsigned long long firstPacketTime;
 	unsigned long long lastPacketTime;
+	unsigned long long pkts_ecn_ce;
+	unsigned long long returnValCount[4];
+	unsigned long long sum_cwnd;
+	unsigned long long sum_rtt;
+	unsigned long long sum_cwnd_cnt;
+	long long sum_credit;
 };
diff --git a/samples/bpf/hbm_kern.h b/samples/bpf/hbm_kern.h
index c5635d924193..508a3f93d200 100644
--- a/samples/bpf/hbm_kern.h
+++ b/samples/bpf/hbm_kern.h
@@ -72,17 +72,43 @@ struct bpf_map_def SEC("maps") queue_stats = {
 BPF_ANNOTATE_KV_PAIR(queue_stats, int, struct hbm_queue_stats);
 
 struct hbm_pkt_info {
+	int	cwnd;
+	int	rtt;
 	bool	is_ip;
 	bool	is_tcp;
 	short	ecn;
 };
 
+static int get_tcp_info(struct __sk_buff *skb, struct hbm_pkt_info *pkti)
+{
+	struct bpf_sock *sk;
+	struct bpf_tcp_sock *tp;
+
+	sk = skb->sk;
+	if (sk) {
+		sk = bpf_sk_fullsock(sk);
+		if (sk) {
+			if (sk->protocol == IPPROTO_TCP) {
+				tp = bpf_tcp_sock(sk);
+				if (tp) {
+					pkti->cwnd = tp->snd_cwnd;
+					pkti->rtt = tp->srtt_us >> 3;
+					return 0;
+				}
+			}
+		}
+	}
+	return 1;
+}
+
 static __always_inline void hbm_get_pkt_info(struct __sk_buff *skb,
 					     struct hbm_pkt_info *pkti)
 {
 	struct iphdr iph;
 	struct ipv6hdr *ip6h;
 
+	pkti->cwnd = 0;
+	pkti->rtt = 0;
 	bpf_skb_load_bytes(skb, 0, &iph, 12);
 	if (iph.version == 6) {
 		ip6h = (struct ipv6hdr *)&iph;
@@ -98,6 +124,8 @@ static __always_inline void hbm_get_pkt_info(struct __sk_buff *skb,
 		pkti->is_tcp = false;
 		pkti->ecn = 0;
 	}
+	if (pkti->is_tcp)
+		get_tcp_info(skb, pkti);
 }
 
 static __always_inline void hbm_init_vqueue(struct hbm_vqueue *qdp, int rate)
@@ -112,8 +140,14 @@ static __always_inline void hbm_update_stats(struct hbm_queue_stats *qsp,
 					     int len,
 					     unsigned long long curtime,
 					     bool congestion_flag,
-					     bool drop_flag)
+					     bool drop_flag,
+					     bool cwr_flag,
+					     bool ecn_ce_flag,
+					     struct hbm_pkt_info *pkti,
+					     int credit)
 {
+	int rv = ALLOW_PKT;
+
 	if (qsp != NULL) {
 		// Following is needed for work conserving
 		__sync_add_and_fetch(&(qsp->bytes_total), len);
@@ -123,7 +157,7 @@ static __always_inline void hbm_update_stats(struct hbm_queue_stats *qsp,
 				qsp->firstPacketTime = curtime;
 			qsp->lastPacketTime = curtime;
 			__sync_add_and_fetch(&(qsp->pkts_total), 1);
-			if (congestion_flag || drop_flag) {
+			if (congestion_flag) {
 				__sync_add_and_fetch(&(qsp->pkts_marked), 1);
 				__sync_add_and_fetch(&(qsp->bytes_marked), len);
 			}
@@ -132,6 +166,34 @@ static __always_inline void hbm_update_stats(struct hbm_queue_stats *qsp,
 				__sync_add_and_fetch(&(qsp->bytes_dropped),
 						     len);
 			}
+			if (ecn_ce_flag)
+				__sync_add_and_fetch(&(qsp->pkts_ecn_ce), 1);
+			if (pkti->cwnd) {
+				__sync_add_and_fetch(&(qsp->sum_cwnd),
+						     pkti->cwnd);
+				__sync_add_and_fetch(&(qsp->sum_cwnd_cnt), 1);
+			}
+			if (pkti->rtt)
+				__sync_add_and_fetch(&(qsp->sum_rtt),
+						     pkti->rtt);
+			__sync_add_and_fetch(&(qsp->sum_credit), credit);
+
+			if (drop_flag)
+				rv = DROP_PKT;
+			if (cwr_flag)
+				rv |= 2;
+			if (rv == DROP_PKT)
+				__sync_add_and_fetch(&(qsp->returnValCount[0]),
+						     1);
+			else if (rv == ALLOW_PKT)
+				__sync_add_and_fetch(&(qsp->returnValCount[1]),
+						     1);
+			else if (rv == 2)
+				__sync_add_and_fetch(&(qsp->returnValCount[2]),
+						     1);
+			else if (rv == 3)
+				__sync_add_and_fetch(&(qsp->returnValCount[3]),
+						     1);
 		}
 	}
 }
diff --git a/samples/bpf/hbm_out_kern.c b/samples/bpf/hbm_out_kern.c
index fa3ea92e1564..829934bd43cb 100644
--- a/samples/bpf/hbm_out_kern.c
+++ b/samples/bpf/hbm_out_kern.c
@@ -62,11 +62,12 @@ int _hbm_out_cg(struct __sk_buff *skb)
 	unsigned int queue_index = 0;
 	unsigned long long curtime;
 	int credit;
-	signed long long delta = 0, zero = 0;
+	signed long long delta = 0, new_credit;
 	int max_credit = MAX_CREDIT;
 	bool congestion_flag = false;
 	bool drop_flag = false;
 	bool cwr_flag = false;
+	bool ecn_ce_flag = false;
 	struct hbm_vqueue *qdp;
 	struct hbm_queue_stats *qsp = NULL;
 	int rv = ALLOW_PKT;
@@ -99,9 +100,11 @@ int _hbm_out_cg(struct __sk_buff *skb)
 	 */
 	if (delta > 0) {
 		qdp->lasttime = curtime;
-		credit += CREDIT_PER_NS(delta, qdp->rate);
-		if (credit > MAX_CREDIT)
+		new_credit = credit + CREDIT_PER_NS(delta, qdp->rate);
+		if (new_credit > MAX_CREDIT)
 			credit = MAX_CREDIT;
+		else
+			credit = new_credit;
 	}
 	credit -= len;
 	qdp->credit = credit;
@@ -139,7 +142,9 @@ int _hbm_out_cg(struct __sk_buff *skb)
 	}
 
 	if (congestion_flag) {
-		if (!bpf_skb_ecn_set_ce(skb)) {
+		if (bpf_skb_ecn_set_ce(skb)) {
+			ecn_ce_flag = true;
+		} else {
 			if (pkti.is_tcp) {
 				unsigned int rand = bpf_get_prandom_u32();
 
@@ -155,16 +160,17 @@ int _hbm_out_cg(struct __sk_buff *skb)
 		}
 	}
 
-	if (drop_flag)
-		rv = DROP_PKT;
 	if (qsp != NULL)
 		if (qsp->no_cn)
 			cwr_flag = false;
 
-	hbm_update_stats(qsp, len, curtime, congestion_flag, drop_flag);
+	hbm_update_stats(qsp, len, curtime, congestion_flag, drop_flag,
+			 cwr_flag, ecn_ce_flag, &pkti, credit);
 
-	if (rv == DROP_PKT)
+	if (drop_flag) {
 		__sync_add_and_fetch(&(qdp->credit), len);
+		rv = DROP_PKT;
+	}
 
 	if (cwr_flag)
 		rv |= 2;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-23  8:05 [PATCH bpf-next 0/7] bpf: Propagate cn to TCP brakmo
                   ` (6 preceding siblings ...)
  2019-03-23  8:05 ` [PATCH bpf-next 7/7] bpf: Add more stats to HBM brakmo
@ 2019-03-23  9:12 ` Eric Dumazet
  2019-03-23 15:41   ` Alexei Starovoitov
  2019-03-24  1:14   ` Lawrence Brakmo
  7 siblings, 2 replies; 24+ messages in thread
From: Eric Dumazet @ 2019-03-23  9:12 UTC (permalink / raw)
  To: brakmo, netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team



On 03/23/2019 01:05 AM, brakmo wrote:
> This patchset adds support for propagating congestion notifications (cn)
> to TCP from cgroup inet skb egress BPF programs.
> 
> Current cgroup skb BPF programs cannot trigger TCP congestion window
> reductions, even when they drop a packet. This patch-set adds support
> for cgroup skb BPF programs to send congestion notifications in the
> return value when the packets are TCP packets. Rather than the
> current 1 for keeping the packet and 0 for dropping it, they can
> now return:
>     NET_XMIT_SUCCESS    (0)    - continue with packet output
>     NET_XMIT_DROP       (1)    - drop packet and do cn
>     NET_XMIT_CN         (2)    - continue with packet output and do cn
>     -EPERM                     - drop packet
>

I believe I already mentioned this model is broken, if you have any virtual
device before the cgroup BPF program.

Please think about offloading the pacing/throttling in the NIC,
there is no way we will report back to tcp stack instant notifications.

This patch series is going way too far for my taste.

This idea is not new, you were at Google when it was experimented by Nandita and
others, and we know it is not worth the pain.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-23  9:12 ` [PATCH bpf-next 0/7] bpf: Propagate cn to TCP Eric Dumazet
@ 2019-03-23 15:41   ` Alexei Starovoitov
  2019-03-24  5:36     ` Eric Dumazet
  2019-03-24  5:48     ` Eric Dumazet
  2019-03-24  1:14   ` Lawrence Brakmo
  1 sibling, 2 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2019-03-23 15:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: brakmo, netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team

On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote:
> 
> 
> On 03/23/2019 01:05 AM, brakmo wrote:
> > This patchset adds support for propagating congestion notifications (cn)
> > to TCP from cgroup inet skb egress BPF programs.
> > 
> > Current cgroup skb BPF programs cannot trigger TCP congestion window
> > reductions, even when they drop a packet. This patch-set adds support
> > for cgroup skb BPF programs to send congestion notifications in the
> > return value when the packets are TCP packets. Rather than the
> > current 1 for keeping the packet and 0 for dropping it, they can
> > now return:
> >     NET_XMIT_SUCCESS    (0)    - continue with packet output
> >     NET_XMIT_DROP       (1)    - drop packet and do cn
> >     NET_XMIT_CN         (2)    - continue with packet output and do cn
> >     -EPERM                     - drop packet
> >
> 
> I believe I already mentioned this model is broken, if you have any virtual
> device before the cgroup BPF program.
> 
> Please think about offloading the pacing/throttling in the NIC,
> there is no way we will report back to tcp stack instant notifications.

I don't think 'offload to google proprietary nic' is a suggestion
that folks can practically follow.
Very few NICs can offload pacing to hw and there are plenty of limitations.
This patch set represents a pure sw solution that works and scales to millions of flows.

> This patch series is going way too far for my taste.

I would really appreciate if you can do a technical review of the patches.
Our previous approach didn't quite work due to complexity around locked/non-locked socket.
This is a cleaner approach.
Either we go with this one or will add a bpf hook into __tcp_transmit_skb.
This approach is better since it works for other protocols and can be
used by qdiscs w/o any bpf.

> This idea is not new, you were at Google when it was experimented by Nandita and
> others, and we know it is not worth the pain.

google networking needs are different from the rest of the world.

Thank you.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-23  9:12 ` [PATCH bpf-next 0/7] bpf: Propagate cn to TCP Eric Dumazet
  2019-03-23 15:41   ` Alexei Starovoitov
@ 2019-03-24  1:14   ` Lawrence Brakmo
  2019-03-24  5:58     ` Eric Dumazet
  1 sibling, 1 reply; 24+ messages in thread
From: Lawrence Brakmo @ 2019-03-24  1:14 UTC (permalink / raw)
  To: Eric Dumazet, netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Kernel Team


On 3/23/19, 10:12 AM, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:

    
    
    On 03/23/2019 01:05 AM, brakmo wrote:
    > This patchset adds support for propagating congestion notifications (cn)
    > to TCP from cgroup inet skb egress BPF programs.
    > 
    > Current cgroup skb BPF programs cannot trigger TCP congestion window
    > reductions, even when they drop a packet. This patch-set adds support
    > for cgroup skb BPF programs to send congestion notifications in the
    > return value when the packets are TCP packets. Rather than the
    > current 1 for keeping the packet and 0 for dropping it, they can
    > now return:
    >     NET_XMIT_SUCCESS    (0)    - continue with packet output
    >     NET_XMIT_DROP       (1)    - drop packet and do cn
    >     NET_XMIT_CN         (2)    - continue with packet output and do cn
    >     -EPERM                     - drop packet
    >
    
    I believe I already mentioned this model is broken, if you have any virtual
    device before the cgroup BPF program.

Current qdisc can return values 0 to 2, how is this different from the cgroup
BPF program returning these values? I understand that virtual devices before
the cgroup or qdisc may not propagate these values (and I would say the
problem is then with the virtual device), but not everyone uses virtual
devices like that. For them, HBM if not appropriate if they are using
Cubic (however, it works with DCTCP or with fq's EDT).
    
    Please think about offloading the pacing/throttling in the NIC,
    there is no way we will report back to tcp stack instant notifications.
    
Not everyone has the ability for offloading to the NIC.

    This patch series is going way too far for my taste.
    
Too far which way? I'm simply extending a mechanism that has been present
for a while with qdiscs to cgroup egress BPF programs. I understand if
it has no use in your environment, but we believe it has in ours.

    This idea is not new, you were at Google when it was experimented by Nandita and
    others, and we know it is not worth the pain.
    
There was no eBPF at that time. We like the flexibility we get by programing
the algorithms in eBPF.

These are not intrusive changes, they simply extend the current limited return
values form cgroup skb egress BPF programs to be more in line with qdiscs.
    


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-23 15:41   ` Alexei Starovoitov
@ 2019-03-24  5:36     ` Eric Dumazet
  2019-03-24 16:19       ` Alexei Starovoitov
  2019-03-24  5:48     ` Eric Dumazet
  1 sibling, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2019-03-24  5:36 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric Dumazet
  Cc: brakmo, netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team



On 03/23/2019 08:41 AM, Alexei Starovoitov wrote:
> On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote:
>>
>>
>> On 03/23/2019 01:05 AM, brakmo wrote:
>>> This patchset adds support for propagating congestion notifications (cn)
>>> to TCP from cgroup inet skb egress BPF programs.
>>>
>>> Current cgroup skb BPF programs cannot trigger TCP congestion window
>>> reductions, even when they drop a packet. This patch-set adds support
>>> for cgroup skb BPF programs to send congestion notifications in the
>>> return value when the packets are TCP packets. Rather than the
>>> current 1 for keeping the packet and 0 for dropping it, they can
>>> now return:
>>>     NET_XMIT_SUCCESS    (0)    - continue with packet output
>>>     NET_XMIT_DROP       (1)    - drop packet and do cn
>>>     NET_XMIT_CN         (2)    - continue with packet output and do cn
>>>     -EPERM                     - drop packet
>>>
>>
>> I believe I already mentioned this model is broken, if you have any virtual
>> device before the cgroup BPF program.
>>
>> Please think about offloading the pacing/throttling in the NIC,
>> there is no way we will report back to tcp stack instant notifications.
> 
> I don't think 'offload to google proprietary nic' is a suggestion
> that folks can practically follow.
> Very few NICs can offload pacing to hw and there are plenty of limitations.
> This patch set represents a pure sw solution that works and scales to millions of flows.
> 
>> This patch series is going way too far for my taste.
> 
> I would really appreciate if you can do a technical review of the patches.
> Our previous approach didn't quite work due to complexity around locked/non-locked socket.
> This is a cleaner approach.
> Either we go with this one or will add a bpf hook into __tcp_transmit_skb.
> This approach is better since it works for other protocols and can be
> used by qdiscs w/o any bpf.
> 
>> This idea is not new, you were at Google when it was experimented by Nandita and
>> others, and we know it is not worth the pain.
> 
> google networking needs are different from the rest of the world.
> 

This has nothing to do with Google against Facebook really, it is a bit sad you react like this Alexei.

We just upstreamed bpf_skb_ecn_set_ce(), so I doubt you already have numbers to show that this strategy
is not enough.

All recent discussions about ECN (TCP Prague and SCE) do not _require_ instant feedback to the sender.

Please show us experimental results before we have to carry these huge hacks.

Thank you.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-23 15:41   ` Alexei Starovoitov
  2019-03-24  5:36     ` Eric Dumazet
@ 2019-03-24  5:48     ` Eric Dumazet
  1 sibling, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2019-03-24  5:48 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric Dumazet
  Cc: brakmo, netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team



On 03/23/2019 08:41 AM, Alexei Starovoitov wrote:
> On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote:

>> Please think about offloading the pacing/throttling in the NIC,
>> there is no way we will report back to tcp stack instant notificatiions.
> 
> I don't think 'offload to google proprietary nic' is a suggestion
> that folks can practically follow.
> Very few NICs can offload pacing to hw and there are plenty of limitations.
> This patch set represents a pure sw solution that works and scales to millions of flows.

offloading is not a unique feature to "Google nic", this is ridiculous.

By offloading I meant : Any hop that might not be controlled by linux stack.

Think about VM, and think about pacers that can be on various existing NIC,
like Mellanox ones. For them, it is trivial to implement L4S style capabilities.

Beauty of ECN is that is relies on standard TCP feedback, not on some immediate
reaction from one one intermediate hop, sending another signal back to the sender.

HTTP 3 is planned to use QUIC (over UDP), we do not want to add TCP-only features in
IP fast path.

Lets have experimental results first, please ?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-24  1:14   ` Lawrence Brakmo
@ 2019-03-24  5:58     ` Eric Dumazet
  0 siblings, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2019-03-24  5:58 UTC (permalink / raw)
  To: Lawrence Brakmo, Eric Dumazet, netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 03/23/2019 06:14 PM, Lawrence Brakmo wrote:

> There was no eBPF at that time. We like the flexibility we get by programing
> the algorithms in eBPF.

eBPF being there does not mean we can adopt whatever research work,
we want evaluations of the costs and benefits.

> 
> These are not intrusive changes, they simply extend the current limited return
> values form cgroup skb egress BPF programs to be more in line with qdiscs.

They are quite intrusive changes to IP layer, we can not deny this.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-24  5:36     ` Eric Dumazet
@ 2019-03-24 16:19       ` Alexei Starovoitov
  2019-03-25  8:33         ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2019-03-24 16:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: brakmo, netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team

On Sat, Mar 23, 2019 at 10:36:24PM -0700, Eric Dumazet wrote:
> 
> 
> On 03/23/2019 08:41 AM, Alexei Starovoitov wrote:
> > On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote:
> >>
> >>
> >> On 03/23/2019 01:05 AM, brakmo wrote:
> >>> This patchset adds support for propagating congestion notifications (cn)
> >>> to TCP from cgroup inet skb egress BPF programs.
> >>>
> >>> Current cgroup skb BPF programs cannot trigger TCP congestion window
> >>> reductions, even when they drop a packet. This patch-set adds support
> >>> for cgroup skb BPF programs to send congestion notifications in the
> >>> return value when the packets are TCP packets. Rather than the
> >>> current 1 for keeping the packet and 0 for dropping it, they can
> >>> now return:
> >>>     NET_XMIT_SUCCESS    (0)    - continue with packet output
> >>>     NET_XMIT_DROP       (1)    - drop packet and do cn
> >>>     NET_XMIT_CN         (2)    - continue with packet output and do cn
> >>>     -EPERM                     - drop packet
> >>>
> >>
> >> I believe I already mentioned this model is broken, if you have any virtual
> >> device before the cgroup BPF program.
> >>
> >> Please think about offloading the pacing/throttling in the NIC,
> >> there is no way we will report back to tcp stack instant notifications.
> > 
> > I don't think 'offload to google proprietary nic' is a suggestion
> > that folks can practically follow.
> > Very few NICs can offload pacing to hw and there are plenty of limitations.
> > This patch set represents a pure sw solution that works and scales to millions of flows.
> > 
> >> This patch series is going way too far for my taste.
> > 
> > I would really appreciate if you can do a technical review of the patches.
> > Our previous approach didn't quite work due to complexity around locked/non-locked socket.
> > This is a cleaner approach.
> > Either we go with this one or will add a bpf hook into __tcp_transmit_skb.
> > This approach is better since it works for other protocols and can be
> > used by qdiscs w/o any bpf.
> > 
> >> This idea is not new, you were at Google when it was experimented by Nandita and
> >> others, and we know it is not worth the pain.
> > 
> > google networking needs are different from the rest of the world.
> > 
> 
> This has nothing to do with Google against Facebook really, it is a bit sad you react like this Alexei.
> 
> We just upstreamed bpf_skb_ecn_set_ce(), so I doubt you already have numbers to show that this strategy
> is not enough.
> 
> All recent discussions about ECN (TCP Prague and SCE) do not _require_ instant feedback to the sender.
> 
> Please show us experimental results before we have to carry these huge hacks.

I have a feeling we're looking at different patches.
All of your comments seems to be talking about something else.
I have a hard time connecting them to this patch set.
Have you read the cover letter and patches of _this_ set?

The results are in the cover letter and there is no change in behavior in networking stack.
Patches 1, 2, 3 are simple refactoring of bpf-cgroup return codes.
Patch 4 is the only one that touches net/ipv4/ip_output.c only to pass
the return code to tcp/udp layer.
The concept works for both despite of what you're claiming being tcp only.

Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
Please realize that existing qdiscs already doing this.
The patchset allows bpf-cgroup to do the same.

If you were arguing about sysctl knob in patch 5 that I would understand.
Instead of a knob we can hard code the value for now. But please explain
what issues you see with that knob.

Also to be fair I applied your
commit f11216b24219 ("bpf: add skb->tstamp r/w access from tc clsact and cg skb progs")
without demanding 'experimental results' because the feature makes sense.
Yet, you folks didn't produce a _single_ example or test result since then.
This is not cool.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-24 16:19       ` Alexei Starovoitov
@ 2019-03-25  8:33         ` Eric Dumazet
  2019-03-25  8:48           ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2019-03-25  8:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: brakmo, netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team



On 03/24/2019 09:19 AM, Alexei Starovoitov wrote:
> On Sat, Mar 23, 2019 at 10:36:24PM -0700, Eric Dumazet wrote:
>>
>>
>> On 03/23/2019 08:41 AM, Alexei Starovoitov wrote:
>>> On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote:
>>>>
>>>>
>>>> On 03/23/2019 01:05 AM, brakmo wrote:
>>>>> This patchset adds support for propagating congestion notifications (cn)
>>>>> to TCP from cgroup inet skb egress BPF programs.
>>>>>
>>>>> Current cgroup skb BPF programs cannot trigger TCP congestion window
>>>>> reductions, even when they drop a packet. This patch-set adds support
>>>>> for cgroup skb BPF programs to send congestion notifications in the
>>>>> return value when the packets are TCP packets. Rather than the
>>>>> current 1 for keeping the packet and 0 for dropping it, they can
>>>>> now return:
>>>>>     NET_XMIT_SUCCESS    (0)    - continue with packet output
>>>>>     NET_XMIT_DROP       (1)    - drop packet and do cn
>>>>>     NET_XMIT_CN         (2)    - continue with packet output and do cn
>>>>>     -EPERM                     - drop packet
>>>>>
>>>>
>>>> I believe I already mentioned this model is broken, if you have any virtual
>>>> device before the cgroup BPF program.
>>>>
>>>> Please think about offloading the pacing/throttling in the NIC,
>>>> there is no way we will report back to tcp stack instant notifications.
>>>
>>> I don't think 'offload to google proprietary nic' is a suggestion
>>> that folks can practically follow.
>>> Very few NICs can offload pacing to hw and there are plenty of limitations.
>>> This patch set represents a pure sw solution that works and scales to millions of flows.
>>>
>>>> This patch series is going way too far for my taste.
>>>
>>> I would really appreciate if you can do a technical review of the patches.
>>> Our previous approach didn't quite work due to complexity around locked/non-locked socket.
>>> This is a cleaner approach.
>>> Either we go with this one or will add a bpf hook into __tcp_transmit_skb.
>>> This approach is better since it works for other protocols and can be
>>> used by qdiscs w/o any bpf.
>>>
>>>> This idea is not new, you were at Google when it was experimented by Nandita and
>>>> others, and we know it is not worth the pain.
>>>
>>> google networking needs are different from the rest of the world.
>>>
>>
>> This has nothing to do with Google against Facebook really, it is a bit sad you react like this Alexei.
>>
>> We just upstreamed bpf_skb_ecn_set_ce(), so I doubt you already have numbers to show that this strategy
>> is not enough.
>>
>> All recent discussions about ECN (TCP Prague and SCE) do not _require_ instant feedback to the sender.
>>
>> Please show us experimental results before we have to carry these huge hacks.
> 
> I have a feeling we're looking at different patches.
> All of your comments seems to be talking about something else.
> I have a hard time connecting them to this patch set.
> Have you read the cover letter and patches of _this_ set?
> 
> The results are in the cover letter and there is no change in behavior in networking stack.
> Patches 1, 2, 3 are simple refactoring of bpf-cgroup return codes.
> Patch 4 is the only one that touches net/ipv4/ip_output.c only to pass
> the return code to tcp/udp layer.
> The concept works for both despite of what you're claiming being tcp only.
> 
> Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
> Please realize that existing qdiscs already doing this.
> The patchset allows bpf-cgroup to do the same.

Not the same thing I am afraid.

> 
> If you were arguing about sysctl knob in patch 5 that I would understand.
> Instead of a knob we can hard code the value for now. But please explain
> what issues you see with that knob.
> 
> Also to be fair I applied your
> commit f11216b24219 ("bpf: add skb->tstamp r/w access from tc clsact and cg skb progs")
> without demanding 'experimental results' because the feature makes sense.
> Yet, you folks didn't produce a _single_ example or test result since then.
> This is not cool.

This patch did not change fast path at all Alexei

Look at [PATCH bpf-next 4/7] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls

This part changes ip stacks.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-25  8:33         ` Eric Dumazet
@ 2019-03-25  8:48           ` Eric Dumazet
  2019-03-26  4:27             ` Alexei Starovoitov
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2019-03-25  8:48 UTC (permalink / raw)
  To: Eric Dumazet, Alexei Starovoitov
  Cc: brakmo, netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team



On 03/25/2019 01:33 AM, Eric Dumazet wrote:
> 
> 
> On 03/24/2019 09:19 AM, Alexei Starovoitov wrote:

>> Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
>> Please realize that existing qdiscs already doing this.
>> The patchset allows bpf-cgroup to do the same.
> 
> Not the same thing I am afraid.

To be clear Alexei :

Existing qdisc set CE mark on a packet, exactly like a router would do.
Simple and universal.
This can be stacked, and done far away from the sender.

We do not _call_ back local TCP to propagate cn.

We simply rely on the fact that incoming ACK will carry the needed information,
and TCP stack already handles the case just fine.

Larry cover letter does not really explain why we need to handle a corner case
(local drops) with such intrusive changes.

TCP Small Queues already should make sure local drops are non existent.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-25  8:48           ` Eric Dumazet
@ 2019-03-26  4:27             ` Alexei Starovoitov
  2019-03-26  8:06               ` Eric Dumazet
  2019-03-26  8:13               ` Eric Dumazet
  0 siblings, 2 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2019-03-26  4:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: brakmo, netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team

On Mon, Mar 25, 2019 at 01:48:27AM -0700, Eric Dumazet wrote:
> 
> 
> On 03/25/2019 01:33 AM, Eric Dumazet wrote:
> > 
> > 
> > On 03/24/2019 09:19 AM, Alexei Starovoitov wrote:
> 
> >> Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
> >> Please realize that existing qdiscs already doing this.
> >> The patchset allows bpf-cgroup to do the same.
> > 
> > Not the same thing I am afraid.
> 
> To be clear Alexei :
> 
> Existing qdisc set CE mark on a packet, exactly like a router would do.
> Simple and universal.
> This can be stacked, and done far away from the sender.
> 
> We do not _call_ back local TCP to propagate cn.

How do you classify NET_XMIT_CN ?
It's exactly local call back to indicate CN into tcp from layers below tcp.
tc-bpf prog returning 'drop' code means drop+cn
whereas cgroup-bpf prog returning 'drop' means drop only.
This patch set is fixing this discrepancy.

> We simply rely on the fact that incoming ACK will carry the needed information,
> and TCP stack already handles the case just fine.
> 
> Larry cover letter does not really explain why we need to handle a corner case
> (local drops) with such intrusive changes.

Only after so many rounds of back and forth I think I'm starting
to understand your 'intrusive change' comment.
I think you're referring to:
-       return ip_finish_output2(net, sk, skb);
+       ret = ip_finish_output2(net, sk, skb);
+       return ret ? : ret_bpf;

right?
And your concern that this change slows down ip stack?
Please spell out your concerns in more verbose way to avoid this guessing game.

I've looked at assembler and indeed this change pessimizes tailcall
optimization. What kind of benchmark do you want to see ?
As an alternative we can do it under static_key that cgroup-bpf is under.
It will be larger number of lines changed, but tailcall
optimization can be preserved.

> TCP Small Queues already should make sure local drops are non existent.

tsq don't help.
Here is the comment from patch 5:
"When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
and packets_out is 0, it is beneficial to set a small probe timer.
Otherwise, the throughput for the flow can suffer because it may need to
depend on the probe timer to start sending again. "

In other words when we're clamping aggregated cgroup bandwidth with this facility
we may need to drop the only in flight packet for this flow
and the flow restarts after default 200ms probe timer.
In such case it's well under tsq limit.

Thinking about tsq... I think it would be very interesting to add bpf hook
there as well and have programmable and dynamic tsq limit, but that is
orthogonal to this patch set. We will explore this idea separately.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-26  4:27             ` Alexei Starovoitov
@ 2019-03-26  8:06               ` Eric Dumazet
  2019-03-26 15:07                 ` Alexei Starovoitov
  2019-03-26  8:13               ` Eric Dumazet
  1 sibling, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2019-03-26  8:06 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric Dumazet
  Cc: brakmo, netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team



On 03/25/2019 09:27 PM, Alexei Starovoitov wrote:
> On Mon, Mar 25, 2019 at 01:48:27AM -0700, Eric Dumazet wrote:
>>
>>
>> On 03/25/2019 01:33 AM, Eric Dumazet wrote:
>>>
>>>
>>> On 03/24/2019 09:19 AM, Alexei Starovoitov wrote:
>>
>>>> Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
>>>> Please realize that existing qdiscs already doing this.
>>>> The patchset allows bpf-cgroup to do the same.
>>>
>>> Not the same thing I am afraid.
>>
>> To be clear Alexei :
>>
>> Existing qdisc set CE mark on a packet, exactly like a router would do.
>> Simple and universal.
>> This can be stacked, and done far away from the sender.
>>
>> We do not _call_ back local TCP to propagate cn.
> 
> How do you classify NET_XMIT_CN ?


> It's exactly local call back to indicate CN into tcp from layers below tcp.
> tc-bpf prog returning 'drop' code means drop+cn
> whereas cgroup-bpf prog returning 'drop' means drop only.
> This patch set is fixing this discrepancy.


Except this does not work universally.
team or bonding or any tunnel wont propagate this signal.
(This is not something that can be fixed either, since qdisc can be installed there)

This is a wrong design, and we want to get rid of it, and not 'fix it'.

Back to my original concerns :

1) Larry just upstreamed the bpf_skb_ecn_set_ce().
   This part was fine, we did the same in fq_codel an fq already.

2)
 This ability to inject directly to TCP stack local drops is exactly confirming
 why I complained earlier about why policers are often wrong. Trying to cope with them
 is a nightmare.

I feel you need this because your ebpf code was only able to accept or drop a packet (aka a policer)

EDT model allows for extending the skb->tstamp and not breaking TSQ logic : no local drops.

One of the ECN goal is to avoid drops in the first place, so adding something to favor
local drops sounds a step in the wrong direction.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-26  4:27             ` Alexei Starovoitov
  2019-03-26  8:06               ` Eric Dumazet
@ 2019-03-26  8:13               ` Eric Dumazet
  1 sibling, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2019-03-26  8:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric Dumazet
  Cc: brakmo, netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team



On 03/25/2019 09:27 PM, Alexei Starovoitov wrote:

> In other words when we're clamping aggregated cgroup bandwidth with this facility
> we may need to drop the only in flight packet for this flow
> and the flow restarts after default 200ms probe timer.

TLP should send a probe that does not depend on 200ms but more on RTT.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-26  8:06               ` Eric Dumazet
@ 2019-03-26 15:07                 ` Alexei Starovoitov
  2019-03-26 15:43                   ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2019-03-26 15:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, brakmo, netdev, Martin Lau, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Tue, Mar 26, 2019 at 01:06:23AM -0700, Eric Dumazet wrote:
> 
> 
> On 03/25/2019 09:27 PM, Alexei Starovoitov wrote:
> > On Mon, Mar 25, 2019 at 01:48:27AM -0700, Eric Dumazet wrote:
> >>
> >>
> >> On 03/25/2019 01:33 AM, Eric Dumazet wrote:
> >>>
> >>>
> >>> On 03/24/2019 09:19 AM, Alexei Starovoitov wrote:
> >>
> >>>> Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
> >>>> Please realize that existing qdiscs already doing this.
> >>>> The patchset allows bpf-cgroup to do the same.
> >>>
> >>> Not the same thing I am afraid.
> >>
> >> To be clear Alexei :
> >>
> >> Existing qdisc set CE mark on a packet, exactly like a router would do.
> >> Simple and universal.
> >> This can be stacked, and done far away from the sender.
> >>
> >> We do not _call_ back local TCP to propagate cn.
> > 
> > How do you classify NET_XMIT_CN ?
> 
> 
> > It's exactly local call back to indicate CN into tcp from layers below tcp.
> > tc-bpf prog returning 'drop' code means drop+cn
> > whereas cgroup-bpf prog returning 'drop' means drop only.
> > This patch set is fixing this discrepancy.
> 
> 
> Except this does not work universally.
> team or bonding or any tunnel wont propagate this signal.
> (This is not something that can be fixed either, since qdisc can be installed there)
> 
> This is a wrong design, and we want to get rid of it, and not 'fix it'.

so after 20+ years linux qdisc design is wrong?
bpf is about choice. We have to give people tools to experiment even
when we philosophically disagree on the design.
We need to make sure that new changes don't hurt existing use cases though.
Performance bar has to remain high.

> Back to my original concerns :
> 
> 1) Larry just upstreamed the bpf_skb_ecn_set_ce().
>    This part was fine, we did the same in fq_codel an fq already.

ecn doesn't work with all congestion controls.

> 2)
>  This ability to inject directly to TCP stack local drops is exactly confirming
>  why I complained earlier about why policers are often wrong. Trying to cope with them
>  is a nightmare.
> 
> I feel you need this because your ebpf code was only able to accept or drop a packet (aka a policer)

I've tried to explain several times that it's not about policer.
That's why we called it NRM. Network Resource Manager.
The goal is to control neworking resources in containerized environment
where container scope is a cgroup.
we could have done some of this logic from tc-bpf layer, but overhead is prohibitive there,
since tc sees all packets while we need to control only certain cgroups and
work properly within cgroup hierarchy.

> EDT model allows for extending the skb->tstamp and not breaking TSQ logic : no local drops.

we already discussed this. EDT is not applicable in all cases.

> One of the ECN goal is to avoid drops in the first place, so adding something to favor
> local drops sounds a step in the wrong direction.

according to this statement BBR is wrong design then, since it ignores ecn?

Anyway, to move forward...
We're going to explore few ways to reduce tailcall pessimization in patch 4 and
will proceed with the rest as-is.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-26 15:07                 ` Alexei Starovoitov
@ 2019-03-26 15:43                   ` Eric Dumazet
  2019-03-26 17:01                     ` Alexei Starovoitov
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2019-03-26 15:43 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: brakmo, netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team



On 03/26/2019 08:07 AM, Alexei Starovoitov wrote:

> so after 20+ years linux qdisc design is wrong?

Yes it is how it is, return values can not be propagated back to the TCP stack in all cases.

When a packet is queued to Qdisc 1, there is no way we can return
a value that can represent what the packet becomes when dequeued later and queued into Qdisc 2.

Also some qdisc take their drop decision later (eg : codel and fq_codel), so ->enqueue() will
return a success which might be a lie.


> bpf is about choice. We have to give people tools to experiment even
> when we philosophically disagree on the design.

Maybe, but I feel that for the moment, the choice is only for FB, and rest
of the world has to re-invent private ebpf code in order to benefit from all of this.

I doubt many TCP users will have the skills/money to benefit from this.

Meanwhile, we as a community have to maintain a TCP/IP stack with added hooks and complexity.

It seems TCP stack became a playground for experiments.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-26 15:43                   ` Eric Dumazet
@ 2019-03-26 17:01                     ` Alexei Starovoitov
  2019-03-26 18:07                       ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2019-03-26 17:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: brakmo, netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team

On Tue, Mar 26, 2019 at 08:43:11AM -0700, Eric Dumazet wrote:
> 
> 
> On 03/26/2019 08:07 AM, Alexei Starovoitov wrote:
> 
> > so after 20+ years linux qdisc design is wrong?
> 
> Yes it is how it is, return values can not be propagated back to the TCP stack in all cases.
> 
> When a packet is queued to Qdisc 1, there is no way we can return
> a value that can represent what the packet becomes when dequeued later and queued into Qdisc 2.
> 
> Also some qdisc take their drop decision later (eg : codel and fq_codel), so ->enqueue() will
> return a success which might be a lie.

root and children qdiscs propagate the return value already.
different unrelated qdiscs are just like different physical switches.
they can communicate only via on the wire protocol.
nothing wrong with that.
But not everything can and should communicate over the wire.

> > bpf is about choice. We have to give people tools to experiment even
> > when we philosophically disagree on the design.
> 
> Maybe, but I feel that for the moment, the choice is only for FB, and rest
> of the world has to re-invent private ebpf code in order to benefit from all of this.

Clearly a misunderstanding. Please see samples/bpf/hbm_out_kern.c
The source code of bpf program is not only public, but algorithm is well documented.

> I doubt many TCP users will have the skills/money to benefit from this.

majority of bpf features are actively used and often not by authors
who introduced them.

> Meanwhile, we as a community have to maintain a TCP/IP stack with added hooks and complexity.

your herculean effort to keep tcp stack in excellent shape
is greatly appreciated. No doubt about that.

> It seems TCP stack became a playground for experiments.

exactly.
The next tcp congestion control algorithm should be implementable in bpf.
If kernel was extensible to that degree likely there would have been
no need to bypass it and invent quic.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH bpf-next 0/7] bpf: Propagate cn to TCP
  2019-03-26 17:01                     ` Alexei Starovoitov
@ 2019-03-26 18:07                       ` Eric Dumazet
  0 siblings, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2019-03-26 18:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric Dumazet
  Cc: brakmo, netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team



On 03/26/2019 10:01 AM, Alexei Starovoitov wrote:

> The next tcp congestion control algorithm should be implementable in bpf.

Congestion control do not need per-packet decision, and can indeed be
implemented in bpf.

Some researchers even have implemented them in user space (CCP if I remember well)

> If kernel was extensible to that degree likely there would have been
> no need to bypass it and invent quic.

QUIC is a complete re-design, and works around fundamental things
that TCP can not change.

bpf in TCP could not have changed the rise of QUIC or any other
modern transport, really.

Lets stop this discussion, we are digressing :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2019-03-26 18:07 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-23  8:05 [PATCH bpf-next 0/7] bpf: Propagate cn to TCP brakmo
2019-03-23  8:05 ` [PATCH bpf-next 1/7] bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY brakmo
2019-03-23  8:05 ` [PATCH bpf-next 2/7] bpf: cgroup inet skb programs can return 0 to 3 brakmo
2019-03-23  8:05 ` [PATCH bpf-next 3/7] bpf: Update __cgroup_bpf_run_filter_skb with cn brakmo
2019-03-23  8:05 ` [PATCH bpf-next 4/7] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls brakmo
2019-03-23  8:05 ` [PATCH bpf-next 5/7] bpf: sysctl for probe_on_drop brakmo
2019-03-23  8:05 ` [PATCH bpf-next 6/7] bpf: Add cn support to hbm_out_kern.c brakmo
2019-03-23  8:05 ` [PATCH bpf-next 7/7] bpf: Add more stats to HBM brakmo
2019-03-23  9:12 ` [PATCH bpf-next 0/7] bpf: Propagate cn to TCP Eric Dumazet
2019-03-23 15:41   ` Alexei Starovoitov
2019-03-24  5:36     ` Eric Dumazet
2019-03-24 16:19       ` Alexei Starovoitov
2019-03-25  8:33         ` Eric Dumazet
2019-03-25  8:48           ` Eric Dumazet
2019-03-26  4:27             ` Alexei Starovoitov
2019-03-26  8:06               ` Eric Dumazet
2019-03-26 15:07                 ` Alexei Starovoitov
2019-03-26 15:43                   ` Eric Dumazet
2019-03-26 17:01                     ` Alexei Starovoitov
2019-03-26 18:07                       ` Eric Dumazet
2019-03-26  8:13               ` Eric Dumazet
2019-03-24  5:48     ` Eric Dumazet
2019-03-24  1:14   ` Lawrence Brakmo
2019-03-24  5:58     ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.