All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 bpf-next 0/7] bpf: Propagate cn to TCP
@ 2019-04-04  0:12 brakmo
  2019-04-04  0:12 ` [PATCH v2 bpf-next 1/7] bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY brakmo
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: brakmo @ 2019-04-04  0:12 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

This patchset adds support for propagating congestion notifications (cn)
to TCP from cgroup inet skb egress BPF programs.

Current cgroup skb BPF programs cannot trigger TCP congestion window
reductions, even when they drop a packet. This patch-set adds support
for cgroup skb BPF programs to send congestion notifications in the
return value when the packets are TCP packets. Rather than the
current 1 for keeping the packet and 0 for dropping it, they can
now return:
    NET_XMIT_SUCCESS    (0)    - continue with packet output
    NET_XMIT_DROP       (1)    - drop packet and do cn
    NET_XMIT_CN         (2)    - continue with packet output and do cn
    -EPERM                     - drop packet

There is also support for setting the probe timer to a small value,
specified by a sysctl, when a packet is dropped when calling
queue_xmit in __tcp_transmit_skb and there are no other packets in
transit.

Finally, HBM programs are modified to collect and return more
statistics.

There has been some discussion regarding the best place to manage
bandwidths. Some believe this should be done in the qdisc where it can
also be managed with a BPF program. We believe there are advantages
for doing it with a BPF program in the cgroup/skb callback. For example,
it reduces overheads in the cases where there is on primary workload and
one or more secondary workloads, where each workload is running on its
own cgroupv2. In this scenario, we only need to throttle the secondary
workloads and there is no overhead for the primary workload since there
will be no BPF program attached to its cgroup.

Regardless, we agree that this mechanism should not penalize those that
are not using it. We tested this by doing 1 byte req/reply RPCs over
loopback. Each test consists of 30 sec of back-to-back 1 byte RPCs.
Each test was repeated 50 times with a 1 minute delay between each set
of 10. We then calculated the average RPCs/sec over the 50 tests. We
compare upstream with upstream + patchset and no BPF program as well
as upstream + patchset and a BPF program that just returns ALLOW_PKT.
Here are the results:

upstream                           80937 RPCs/sec
upstream + patches, no BPF program 80894 RPCs/sec
upstream + patches, BPF program    80634 RPCs/sec

These numbers indicate that there is no penalty for these patches

The use of congestion notifications improves the performance of HBM when
using Cubic. Without congestion notifications, Cubic will not decrease its
cwnd and HBM will need to drop a large percentage of the packets.
Smaller probe timers improve the performance of Cubic and DCTCP when the
rates are small enough that there are times when HBM cannot send a packet
per RTT in order to mainting the bandwidth limit.

The following results are obtained for rate limits of 1Gbps and 200Mbps,
between two servers using netperf, and only one flow. We also show how
reducing the max delayed ACK timer can improve the performance when
using Cubic.

Command used was:
  ./do_hbm_test.sh -l -D --stats -N -r=<rate> [--no_cn] [dctcp] \
                   -s=<server running netserver>
  where:
     <rate>   is 1000 or 200
     --no_cn  specifies no cwr notifications
     dctcp    uses dctcp

                       Cubic                    DCTCP
Lim,Prob,DA    Mbps cwnd cred drops  Mbps cwnd cred drops
------------   ---- ---- ---- -----  ---- ---- ---- -----
  1G, 0,40       35  462 -320 67%     995    1 -212  0.05%
  1G, 0,40,cn   736    9  -78  0.07   995    1 -212  0.05
  1G, 0, 5,cn   941    2 -189  0.13   995    1 -212  0.05

200M, 0,40,cn    50    3 -152  0.34    31    3 -203  0.50
200M, 0, 5,cn    43    2 -202  0.48    33    3 -199  0.50
200M,20, 5,cn   199    2 -209  0.38   199    1 -214  0.30

Notes:
  --no_cn has no effect with DCTCP
  Lim = rate limit
  Prob = Probe timer
  DA = maximum delay ack timer
  cred = credit in packets
  drops = % packets dropped

brakmo (7):
  bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY
  bpf: cgroup inet skb programs can return 0 to 3
  bpf: Update __cgroup_bpf_run_filter_skb with cn
  bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls
  bpf: sysctl for probe_on_drop
  bpf: Add cn support to hbm_out_kern.c
  bpf: Add more stats to HBM

 include/linux/bpf.h        | 50 +++++++++++++++++++++++++++++
 include/linux/filter.h     |  3 +-
 include/net/netns/ipv4.h   |  1 +
 kernel/bpf/cgroup.c        | 25 ++++++++++++---
 kernel/bpf/syscall.c       | 12 +++++++
 kernel/bpf/verifier.c      | 16 +++++++--
 net/ipv4/ip_output.c       | 34 +++++++++++++-------
 net/ipv4/sysctl_net_ipv4.c | 10 ++++++
 net/ipv4/tcp_ipv4.c        |  1 +
 net/ipv4/tcp_output.c      | 18 +++++++++--
 net/ipv6/ip6_output.c      | 26 +++++++++------
 samples/bpf/do_hbm_test.sh | 10 ++++--
 samples/bpf/hbm.c          | 51 +++++++++++++++++++++++++++--
 samples/bpf/hbm.h          |  9 +++++-
 samples/bpf/hbm_kern.h     | 66 ++++++++++++++++++++++++++++++++++++--
 samples/bpf/hbm_out_kern.c | 48 +++++++++++++++++++--------
 16 files changed, 326 insertions(+), 54 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 bpf-next 1/7] bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY
  2019-04-04  0:12 [PATCH v2 bpf-next 0/7] bpf: Propagate cn to TCP brakmo
@ 2019-04-04  0:12 ` brakmo
  2019-04-04  0:12 ` [PATCH v2 bpf-next 2/7] bpf: cgroup inet skb programs can return 0 to 3 brakmo
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: brakmo @ 2019-04-04  0:12 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

Create new macro BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() to be used by
__cgroup_bpf_run_filter_skb for EGRESS BPF progs so BPF programs can
request cwr for TCP packets.

Current cgroup skb programs can only return 0 or 1 (0 to drop the
packet. This macro changes the behavior so the low order bit
indicates whether the packet should be dropped (0) or not (1)
and the next bit is used for congestion notification (cn).

Hence, new allowed return values of CGROUP EGRESS BPF programs are:
  0: drop packet
  1: keep packet
  2: drop packet and call cwr
  3: keep packet and call cwr

This macro then converts it to one of NET_XMIT values or -EPERM
that has the effect of dropping the packet with no cn.
  0: NET_XMIT_SUCCESS  skb should be transmitted (no cn)
  1: NET_XMIT_DROP     skb should be dropped and cwr called
  2: NET_XMIT_CN       skb should be transmitted and cwr called
  3: -EPERM            skb should be dropped (no cn)

Note that when more than one BPF program is called, the packet is
dropped if at least one of programs requests it be dropped, and
there is cn if at least one program returns cn.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/linux/bpf.h | 50 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f62897198844..4037cf952d86 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -515,6 +515,56 @@ _out:							\
 		_ret;					\
 	 })
 
+/* To be used by __cgroup_bpf_run_filter_skb for EGRESS BPF progs
+ * so BPF programs can request cwr for TCP packets.
+ *
+ * Current cgroup skb programs can only return 0 or 1 (0 to drop the
+ * packet. This macro changes the behavior so the low order bit
+ * indicates whether the packet should be dropped (0) or not (1)
+ * and the next bit is a congestion notification bit. This could be
+ * used by TCP to call tcp_enter_cwr()
+ *
+ * Hence, new allowed return values of CGROUP EGRESS BPF programs are:
+ *   0: drop packet
+ *   1: keep packet
+ *   2: drop packet and cn
+ *   3: keep packet and cn
+ *
+ * This macro then converts it to one of the NET_XMIT or an error
+ * code that is then interpreted as drop packet (and no cn):
+ *   0: NET_XMIT_SUCCESS  skb should be transmitted
+ *   1: NET_XMIT_DROP     skb should be dropped and cn
+ *   2: NET_XMIT_CN       skb should be transmitted and cn
+ *   3: -EPERM            skb should be dropped
+ */
+#define BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY(array, ctx, func)		\
+	({						\
+		struct bpf_prog_array_item *_item;	\
+		struct bpf_prog *_prog;			\
+		struct bpf_prog_array *_array;		\
+		u32 ret;				\
+		u32 _ret = 1;				\
+		u32 _cn = 0;				\
+		preempt_disable();			\
+		rcu_read_lock();			\
+		_array = rcu_dereference(array);	\
+		_item = &_array->items[0];		\
+		while ((_prog = READ_ONCE(_item->prog))) {		\
+			bpf_cgroup_storage_set(_item->cgroup_storage);	\
+			ret = func(_prog, ctx);		\
+			_ret &= (ret & 1);		\
+			_cn |= (ret & 2);		\
+			_item++;			\
+		}					\
+		rcu_read_unlock();			\
+		preempt_enable_no_resched();		\
+		if (_ret)				\
+			_ret = (_cn ? NET_XMIT_CN : NET_XMIT_SUCCESS);	\
+		else					\
+			_ret = (_cn ? NET_XMIT_DROP : -EPERM);		\
+		_ret;					\
+	})
+
 #define BPF_PROG_RUN_ARRAY(array, ctx, func)		\
 	__BPF_PROG_RUN_ARRAY(array, ctx, func, false)
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 bpf-next 2/7] bpf: cgroup inet skb programs can return 0 to 3
  2019-04-04  0:12 [PATCH v2 bpf-next 0/7] bpf: Propagate cn to TCP brakmo
  2019-04-04  0:12 ` [PATCH v2 bpf-next 1/7] bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY brakmo
@ 2019-04-04  0:12 ` brakmo
  2019-04-04  0:12 ` [PATCH v2 bpf-next 3/7] bpf: Update __cgroup_bpf_run_filter_skb with cn brakmo
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: brakmo @ 2019-04-04  0:12 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

Allows cgroup inet skb programs to return values in the range [0, 3].
The second bit is used to deterine if congestion occurred and higher
level protocol should decrease rate. E.g. TCP would call tcp_enter_cwr()

The bpf_prog must set expected_attach_type to BPF_CGROUP_INET_EGRESS
at load time if it uses the new return values (i.e. 2 or 3).

The expected_attach_type is currently not enforced for
BPF_PROG_TYPE_CGROUP_SKB.  e.g Meaning the current bpf_prog with
expected_attach_type setting to BPF_CGROUP_INET_EGRESS can attach to
BPF_CGROUP_INET_INGRESS.  Blindly enforcing expected_attach_type will
break backward compatibility.

This patch adds a enforce_expected_attach_type bit to only
enforce the expected_attach_type when it uses the new
return value.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 include/linux/filter.h |  3 ++-
 kernel/bpf/syscall.c   | 12 ++++++++++++
 kernel/bpf/verifier.c  | 16 +++++++++++++---
 3 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6074aa064b54..67755a837c48 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -510,7 +510,8 @@ struct bpf_prog {
 				blinded:1,	/* Was blinded */
 				is_func:1,	/* program is a bpf function */
 				kprobe_override:1, /* Do we override a kprobe? */
-				has_callchain_buf:1; /* callchain buffer allocated? */
+				has_callchain_buf:1, /* callchain buffer allocated? */
+				enforce_expected_attach_type:1; /* Enforce expected_attach_type checking at attach time */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index afca36f53c49..36c24ef0b32d 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1521,6 +1521,14 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type prog_type,
 		default:
 			return -EINVAL;
 		}
+	case BPF_PROG_TYPE_CGROUP_SKB:
+		switch (expected_attach_type) {
+		case BPF_CGROUP_INET_INGRESS:
+		case BPF_CGROUP_INET_EGRESS:
+			return 0;
+		default:
+			return -EINVAL;
+		}
 	default:
 		return 0;
 	}
@@ -1765,6 +1773,10 @@ static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 		return attach_type == prog->expected_attach_type ? 0 : -EINVAL;
+	case BPF_PROG_TYPE_CGROUP_SKB:
+		return prog->enforce_expected_attach_type &&
+			prog->expected_attach_type != attach_type ?
+			-EINVAL : 0;
 	default:
 		return 0;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 87221fda1321..bf66aa7581da 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5160,11 +5160,16 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 
 static int check_return_code(struct bpf_verifier_env *env)
 {
+	struct tnum enforce_attach_type_range = tnum_unknown;
 	struct bpf_reg_state *reg;
 	struct tnum range = tnum_range(0, 1);
 
 	switch (env->prog->type) {
 	case BPF_PROG_TYPE_CGROUP_SKB:
+		if (env->prog->expected_attach_type == BPF_CGROUP_INET_EGRESS) {
+			range = tnum_range(0, 3);
+			enforce_attach_type_range = tnum_range(2, 3);
+		}
 	case BPF_PROG_TYPE_CGROUP_SOCK:
 	case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
 	case BPF_PROG_TYPE_SOCK_OPS:
@@ -5182,18 +5187,23 @@ static int check_return_code(struct bpf_verifier_env *env)
 	}
 
 	if (!tnum_in(range, reg->var_off)) {
+		char tn_buf[48];
+
 		verbose(env, "At program exit the register R0 ");
 		if (!tnum_is_unknown(reg->var_off)) {
-			char tn_buf[48];
-
 			tnum_strn(tn_buf, sizeof(tn_buf), reg->var_off);
 			verbose(env, "has value %s", tn_buf);
 		} else {
 			verbose(env, "has unknown scalar value");
 		}
-		verbose(env, " should have been 0 or 1\n");
+		tnum_strn(tn_buf, sizeof(tn_buf), range);
+		verbose(env, " should have been %s\n", tn_buf);
 		return -EINVAL;
 	}
+
+	if (!tnum_is_unknown(enforce_attach_type_range) &&
+	    tnum_in(enforce_attach_type_range, reg->var_off))
+		env->prog->enforce_expected_attach_type = 1;
 	return 0;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 bpf-next 3/7] bpf: Update __cgroup_bpf_run_filter_skb with cn
  2019-04-04  0:12 [PATCH v2 bpf-next 0/7] bpf: Propagate cn to TCP brakmo
  2019-04-04  0:12 ` [PATCH v2 bpf-next 1/7] bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY brakmo
  2019-04-04  0:12 ` [PATCH v2 bpf-next 2/7] bpf: cgroup inet skb programs can return 0 to 3 brakmo
@ 2019-04-04  0:12 ` brakmo
  2019-04-04  0:12 ` [PATCH v2 bpf-next 4/7] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls brakmo
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: brakmo @ 2019-04-04  0:12 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

For egress packets, __cgroup_bpf_fun_filter_skb() will now call
BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() instead of PROG_CGROUP_RUN_ARRAY()
in order to propagate congestion notifications (cn) requests to TCP
callers.

For egress packets, this function can return:
   NET_XMIT_SUCCESS    (0)    - continue with packet output
   NET_XMIT_DROP       (1)    - drop packet and notify TCP to call cwr
   NET_XMIT_CN         (2)    - continue with packet output and notify TCP
                                to call cwr
   -EPERM                     - drop packet

For ingress packets, this function will return -EPERM if any attached
program was found and if it returned != 1 during execution. Otherwise 0
is returned.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 kernel/bpf/cgroup.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 4e807973aa80..ad8108a02153 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -545,8 +545,16 @@ int cgroup_bpf_prog_query(const union bpf_attr *attr,
  * The program type passed in via @type must be suitable for network
  * filtering. No further check is performed to assert that.
  *
- * This function will return %-EPERM if any if an attached program was found
- * and if it returned != 1 during execution. In all other cases, 0 is returned.
+ * For egress packets, this function can return:
+ *   NET_XMIT_SUCCESS    (0)	- continue with packet output
+ *   NET_XMIT_DROP       (1)	- drop packet and notify TCP to call cwr
+ *   NET_XMIT_CN         (2)	- continue with packet output and notify TCP
+ *				  to call cwr
+ *   -EPERM			- drop packet
+ *
+ * For ingress packets, this function will return -EPERM if any
+ * attached program was found and if it returned != 1 during execution.
+ * Otherwise 0 is returned.
  */
 int __cgroup_bpf_run_filter_skb(struct sock *sk,
 				struct sk_buff *skb,
@@ -572,12 +580,19 @@ int __cgroup_bpf_run_filter_skb(struct sock *sk,
 	/* compute pointers for the bpf prog */
 	bpf_compute_and_save_data_end(skb, &saved_data_end);
 
-	ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
-				 __bpf_prog_run_save_cb);
+	if (type == BPF_CGROUP_INET_EGRESS) {
+		ret = BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY(
+			cgrp->bpf.effective[type], skb, __bpf_prog_run_save_cb);
+	} else {
+		ret = BPF_PROG_RUN_ARRAY(cgrp->bpf.effective[type], skb,
+					  __bpf_prog_run_save_cb);
+		ret = (ret == 1 ? 0 : -EPERM);
+	}
 	bpf_restore_data_end(skb, saved_data_end);
 	__skb_pull(skb, offset);
 	skb->sk = save_sk;
-	return ret == 1 ? 0 : -EPERM;
+
+	return ret;
 }
 EXPORT_SYMBOL(__cgroup_bpf_run_filter_skb);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 bpf-next 4/7] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls
  2019-04-04  0:12 [PATCH v2 bpf-next 0/7] bpf: Propagate cn to TCP brakmo
                   ` (2 preceding siblings ...)
  2019-04-04  0:12 ` [PATCH v2 bpf-next 3/7] bpf: Update __cgroup_bpf_run_filter_skb with cn brakmo
@ 2019-04-04  0:12 ` brakmo
  2019-04-04  0:12 ` [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop brakmo
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: brakmo @ 2019-04-04  0:12 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

Update BPF_CGROUP_RUN_PROG_INET_EGRESS() callers to support returning
congestion notifications from the BPF programs.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 net/ipv4/ip_output.c  | 34 +++++++++++++++++++++++-----------
 net/ipv6/ip6_output.c | 26 +++++++++++++++++---------
 2 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c80188875f39..26071f16eb98 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -289,16 +289,9 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk,
 	return ret;
 }
 
-static int ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
+static int __ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	unsigned int mtu;
-	int ret;
-
-	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
-	if (ret) {
-		kfree_skb(skb);
-		return ret;
-	}
 
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
 	/* Policy lookup after SNAT yielded a new policy */
@@ -317,18 +310,37 @@ static int ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *sk
 	return ip_finish_output2(net, sk, skb);
 }
 
+static int ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
+{
+	int ret;
+
+	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+	switch (ret) {
+	case NET_XMIT_SUCCESS:
+		return __ip_finish_output(net, sk, skb);
+	case NET_XMIT_CN:
+		return __ip_finish_output(net, sk, skb) ? : ret;
+	default:
+		kfree_skb(skb);
+		return ret;
+	}
+}
+
 static int ip_mc_finish_output(struct net *net, struct sock *sk,
 			       struct sk_buff *skb)
 {
 	int ret;
 
 	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
-	if (ret) {
+	switch (ret) {
+	case NET_XMIT_SUCCESS:
+		return dev_loopback_xmit(net, sk, skb);
+	case NET_XMIT_CN:
+		return dev_loopback_xmit(net, sk, skb) ? : ret;
+	default:
 		kfree_skb(skb);
 		return ret;
 	}
-
-	return dev_loopback_xmit(net, sk, skb);
 }
 
 int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index edbd12067170..733f098b28fb 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -128,16 +128,8 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff *
 	return -EINVAL;
 }
 
-static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
+static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
-	int ret;
-
-	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
-	if (ret) {
-		kfree_skb(skb);
-		return ret;
-	}
-
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
 	/* Policy lookup after SNAT yielded a new policy */
 	if (skb_dst(skb)->xfrm) {
@@ -154,6 +146,22 @@ static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *s
 		return ip6_finish_output2(net, sk, skb);
 }
 
+static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
+{
+	int ret;
+
+	ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+	switch (ret) {
+	case NET_XMIT_SUCCESS:
+		return __ip6_finish_output(net, sk, skb);
+	case NET_XMIT_CN:
+		return __ip6_finish_output(net, sk, skb) ? : ret;
+	default:
+		kfree_skb(skb);
+		return ret;
+	}
+}
+
 int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct net_device *dev = skb_dst(skb)->dev;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop
  2019-04-04  0:12 [PATCH v2 bpf-next 0/7] bpf: Propagate cn to TCP brakmo
                   ` (3 preceding siblings ...)
  2019-04-04  0:12 ` [PATCH v2 bpf-next 4/7] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls brakmo
@ 2019-04-04  0:12 ` brakmo
  2019-04-08 16:16   ` Neal Cardwell
  2019-04-04  0:12 ` [PATCH v2 bpf-next 6/7] bpf: Add cn support to hbm_out_kern.c brakmo
  2019-04-04  0:12 ` [PATCH v2 bpf-next 7/7] bpf: Add more stats to HBM brakmo
  6 siblings, 1 reply; 14+ messages in thread
From: brakmo @ 2019-04-04  0:12 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
and packets_out is 0, it is beneficial to set a small probe timer.
Otherwise, the throughput for the flow can suffer because it may need to
depend on the probe timer to start sending again. The default value for
the probe timer is at least 200ms, this patch sets it to 20ms when a
packet is dropped and there are no other packets in flight.

This patch introduces a new sysctl, sysctl_tcp_probe_on_drop_ms, that is
used to specify the duration of the probe timer for the case described
earlier. The allowed values are between 0 and TCP_RTO_MIN. A value of 0
disables setting the probe timer with a small value.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 include/net/netns/ipv4.h   |  1 +
 net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++
 net/ipv4/tcp_ipv4.c        |  1 +
 net/ipv4/tcp_output.c      | 18 +++++++++++++++---
 4 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 7698460a3dd1..26c52d294dea 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -166,6 +166,7 @@ struct netns_ipv4 {
 	int sysctl_tcp_wmem[3];
 	int sysctl_tcp_rmem[3];
 	int sysctl_tcp_comp_sack_nr;
+	int sysctl_tcp_probe_on_drop_ms;
 	unsigned long sysctl_tcp_comp_sack_delay_ns;
 	struct inet_timewait_death_row tcp_death_row;
 	int sysctl_max_syn_backlog;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 2316c08e9591..b1e4420b6d6c 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -49,6 +49,7 @@ static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
 static int comp_sack_nr_max = 255;
 static u32 u32_max_div_HZ = UINT_MAX / HZ;
+static int probe_on_drop_max = TCP_RTO_MIN;
 
 /* obsolete */
 static int sysctl_tcp_low_latency __read_mostly;
@@ -1228,6 +1229,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra1		= &zero,
 		.extra2		= &comp_sack_nr_max,
 	},
+	{
+		.procname	= "tcp_probe_on_drop_ms",
+		.data		= &init_net.ipv4.sysctl_tcp_probe_on_drop_ms,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &probe_on_drop_max,
+	},
 	{
 		.procname	= "udp_rmem_min",
 		.data		= &init_net.ipv4.sysctl_udp_rmem_min,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 3979939804b7..3df6735f0f07 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2686,6 +2686,7 @@ static int __net_init tcp_sk_init(struct net *net)
 	spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock);
 	net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60;
 	atomic_set(&net->ipv4.tfo_active_disable_times, 0);
+	net->ipv4.sysctl_tcp_probe_on_drop_ms = 20;
 
 	/* Reno is always built in */
 	if (!net_eq(net, &init_net) &&
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e265d1aeeb66..f101e4d7c1ae 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1154,9 +1154,21 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
 
 	err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
 
-	if (unlikely(err > 0)) {
-		tcp_enter_cwr(sk);
-		err = net_xmit_eval(err);
+	if (unlikely(err)) {
+		if (unlikely(err > 0)) {
+			tcp_enter_cwr(sk);
+			err = net_xmit_eval(err);
+		}
+		/* Packet was dropped. If there are no packets out,
+		 * we may need to depend on probe timer to start sending
+		 * again. Hence, use a smaller value.
+		 */
+		if (!tp->packets_out && !inet_csk(sk)->icsk_pending &&
+		    sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms > 0) {
+			tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
+					     sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms,
+					     TCP_RTO_MAX, NULL);
+		}
 	}
 	if (!err && oskb) {
 		tcp_update_skb_after_send(sk, oskb, prior_wstamp);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 bpf-next 6/7] bpf: Add cn support to hbm_out_kern.c
  2019-04-04  0:12 [PATCH v2 bpf-next 0/7] bpf: Propagate cn to TCP brakmo
                   ` (4 preceding siblings ...)
  2019-04-04  0:12 ` [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop brakmo
@ 2019-04-04  0:12 ` brakmo
  2019-04-04  0:12 ` [PATCH v2 bpf-next 7/7] bpf: Add more stats to HBM brakmo
  6 siblings, 0 replies; 14+ messages in thread
From: brakmo @ 2019-04-04  0:12 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

Update hbm_out_kern.c to support returning cn notifications.
Also updates relevant files to allow disabling cn notifications.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/do_hbm_test.sh | 10 +++++++---
 samples/bpf/hbm.c          | 18 +++++++++++++++---
 samples/bpf/hbm.h          |  3 ++-
 samples/bpf/hbm_out_kern.c | 26 +++++++++++++++++++++-----
 4 files changed, 45 insertions(+), 12 deletions(-)

diff --git a/samples/bpf/do_hbm_test.sh b/samples/bpf/do_hbm_test.sh
index 56c8b4115c95..e48b047d4646 100755
--- a/samples/bpf/do_hbm_test.sh
+++ b/samples/bpf/do_hbm_test.sh
@@ -13,10 +13,10 @@ Usage() {
   echo "egress or ingress bandwidht. It then uses iperf3 or netperf to create"
   echo "loads. The output is the goodput in Mbps (unless -D was used)."
   echo ""
-  echo "USAGE: $name [out] [-b=<prog>|--bpf=<prog>] [-c=<cc>|--cc=<cc>] [-D]"
-  echo "             [-d=<delay>|--delay=<delay>] [--debug] [-E]"
+  echo "USAGE: $name [out] [-b=<prog>|--bpf=<prog>] [-c=<cc>|--cc=<cc>]"
+  echo "             [-D] [-d=<delay>|--delay=<delay>] [--debug] [-E]"
   echo "             [-f=<#flows>|--flows=<#flows>] [-h] [-i=<id>|--id=<id >]"
-  echo "             [-l] [-N] [-p=<port>|--port=<port>] [-P]"
+  echo "             [-l] [-N] [--no_cn] [-p=<port>|--port=<port>] [-P]"
   echo "             [-q=<qdisc>] [-R] [-s=<server>|--server=<server]"
   echo "             [-S|--stats] -t=<time>|--time=<time>] [-w] [cubic|dctcp]"
   echo "  Where:"
@@ -33,6 +33,7 @@ Usage() {
   echo "    -f or --flows     number of concurrent flows (default=1)"
   echo "    -i or --id        cgroup id (an integer, default is 1)"
   echo "    -N                use netperf instead of iperf3"
+  echo "    --no_cn           Do not return CN notifications"
   echo "    -l                do not limit flows using loopback"
   echo "    -h                Help"
   echo "    -p or --port      iperf3 port (default is 5201)"
@@ -115,6 +116,9 @@ processArgs () {
     -c=*|--cc=*)
       cc="${i#*=}"
       ;;
+    --no_cn)
+      flags="$flags --no_cn"
+      ;;
     --debug)
       flags="$flags -d"
       debug_flag=1
diff --git a/samples/bpf/hbm.c b/samples/bpf/hbm.c
index a79828ab273f..c5629bae67ab 100644
--- a/samples/bpf/hbm.c
+++ b/samples/bpf/hbm.c
@@ -16,6 +16,7 @@
  *    -l	Also limit flows doing loopback
  *    -n <#>	To create cgroup \"/hbm#\" and attach prog
  *		Default is /hbm1
+ *    --no_cn   Do not return cn notifications
  *    -r <rate>	Rate limit in Mbps
  *    -s	Get HBM stats (marked, dropped, etc.)
  *    -t <time>	Exit after specified seconds (default is 0)
@@ -42,6 +43,7 @@
 
 #include <linux/bpf.h>
 #include <bpf/bpf.h>
+#include <getopt.h>
 
 #include "bpf_load.h"
 #include "bpf_rlimit.h"
@@ -59,6 +61,7 @@ bool stats_flag;
 bool loopback_flag;
 bool debugFlag;
 bool work_conserving_flag;
+bool no_cn_flag;
 
 static void Usage(void);
 static void read_trace_pipe2(void);
@@ -185,6 +188,7 @@ static int run_bpf_prog(char *prog, int cg_id)
 	qstats.rate = rate;
 	qstats.stats = stats_flag ? 1 : 0;
 	qstats.loopback = loopback_flag ? 1 : 0;
+	qstats.no_cn = no_cn_flag ? 1 : 0;
 	if (bpf_map_update_elem(map_fd, &key, &qstats, BPF_ANY)) {
 		printf("ERROR: Could not update map element\n");
 		goto err;
@@ -366,14 +370,15 @@ static void Usage(void)
 {
 	printf("This program loads a cgroup skb BPF program to enforce\n"
 	       "cgroup output (egress) bandwidth limits.\n\n"
-	       "USAGE: hbm [-o] [-d]  [-l] [-n <id>] [-r <rate>] [-s]\n"
-	       "           [-t <secs>] [-w] [-h] [prog]\n"
+	       "USAGE: hbm [-o] [-d]  [-l] [-n <id>] [--no_cn] [-r <rate>]\n"
+	       "           [-s] [-t <secs>] [-w] [-h] [prog]\n"
 	       "  Where:\n"
 	       "    -o         indicates egress direction (default)\n"
 	       "    -d         print BPF trace debug buffer\n"
 	       "    -l         also limit flows using loopback\n"
 	       "    -n <#>     to create cgroup \"/hbm#\" and attach prog\n"
 	       "               Default is /hbm1\n"
+	       "    --no_cn    disable CN notifcations\n"
 	       "    -r <rate>  Rate in Mbps\n"
 	       "    -s         Update HBM stats\n"
 	       "    -t <time>  Exit after specified seconds (default is 0)\n"
@@ -393,9 +398,16 @@ int main(int argc, char **argv)
 	int  k;
 	int cg_id = 1;
 	char *optstring = "iodln:r:st:wh";
+	struct option loptions[] = {
+		{"no_cn", 0, NULL, 1},
+		{NULL, 0, NULL, 0}
+	};
 
-	while ((k = getopt(argc, argv, optstring)) != -1) {
+	while ((k = getopt_long(argc, argv, optstring, loptions, NULL)) != -1) {
 		switch (k) {
+		case 1:
+			no_cn_flag = true;
+			break;
 		case'o':
 			break;
 		case 'd':
diff --git a/samples/bpf/hbm.h b/samples/bpf/hbm.h
index 518e8147d084..c08247cec2a7 100644
--- a/samples/bpf/hbm.h
+++ b/samples/bpf/hbm.h
@@ -19,7 +19,8 @@ struct hbm_vqueue {
 struct hbm_queue_stats {
 	unsigned long rate;		/* in Mbps*/
 	unsigned long stats:1,		/* get HBM stats (marked, dropped,..) */
-		loopback:1;		/* also limit flows using loopback */
+		loopback:1,		/* also limit flows using loopback */
+		no_cn:1;		/* do not use cn flags */
 	unsigned long long pkts_marked;
 	unsigned long long bytes_marked;
 	unsigned long long pkts_dropped;
diff --git a/samples/bpf/hbm_out_kern.c b/samples/bpf/hbm_out_kern.c
index f806863d0b79..fa3ea92e1564 100644
--- a/samples/bpf/hbm_out_kern.c
+++ b/samples/bpf/hbm_out_kern.c
@@ -119,13 +119,16 @@ int _hbm_out_cg(struct __sk_buff *skb)
 	// Set flags (drop, congestion, cwr)
 	// Dropping => we are congested, so ignore congestion flag
 	if (credit < -DROP_THRESH ||
-	    (len > LARGE_PKT_THRESH &&
-	     credit < -LARGE_PKT_DROP_THRESH)) {
-		// Very congested, set drop flag
+	    (len > LARGE_PKT_THRESH && credit < -LARGE_PKT_DROP_THRESH)) {
+		// Very congested, set drop packet
 		drop_flag = true;
+		if (pkti.ecn)
+			congestion_flag = true;
+		else if (pkti.is_tcp)
+			cwr_flag = true;
 	} else if (credit < 0) {
 		// Congested, set congestion flag
-		if (pkti.ecn) {
+		if (pkti.ecn || pkti.is_tcp) {
 			if (credit < -MARK_THRESH)
 				congestion_flag = true;
 			else
@@ -137,7 +140,15 @@ int _hbm_out_cg(struct __sk_buff *skb)
 
 	if (congestion_flag) {
 		if (!bpf_skb_ecn_set_ce(skb)) {
-			if (len > LARGE_PKT_THRESH) {
+			if (pkti.is_tcp) {
+				unsigned int rand = bpf_get_prandom_u32();
+
+				if (-credit >= MARK_THRESH +
+				    (rand % MARK_REGION_SIZE)) {
+					// Do congestion control
+					cwr_flag = true;
+				}
+			} else if (len > LARGE_PKT_THRESH) {
 				// Problem if too many small packets?
 				drop_flag = true;
 			}
@@ -146,12 +157,17 @@ int _hbm_out_cg(struct __sk_buff *skb)
 
 	if (drop_flag)
 		rv = DROP_PKT;
+	if (qsp != NULL)
+		if (qsp->no_cn)
+			cwr_flag = false;
 
 	hbm_update_stats(qsp, len, curtime, congestion_flag, drop_flag);
 
 	if (rv == DROP_PKT)
 		__sync_add_and_fetch(&(qdp->credit), len);
 
+	if (cwr_flag)
+		rv |= 2;
 	return rv;
 }
 char _license[] SEC("license") = "GPL";
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 bpf-next 7/7] bpf: Add more stats to HBM
  2019-04-04  0:12 [PATCH v2 bpf-next 0/7] bpf: Propagate cn to TCP brakmo
                   ` (5 preceding siblings ...)
  2019-04-04  0:12 ` [PATCH v2 bpf-next 6/7] bpf: Add cn support to hbm_out_kern.c brakmo
@ 2019-04-04  0:12 ` brakmo
  6 siblings, 0 replies; 14+ messages in thread
From: brakmo @ 2019-04-04  0:12 UTC (permalink / raw)
  To: netdev
  Cc: Martin Lau, Alexei Starovoitov, Daniel Borkmann, Eric Dumazet,
	Kernel Team

Adds more stats to HBM, including average cwnd and rtt of all TCP
flows, percents of packets that are ecn ce marked and distribution
of return values.

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 samples/bpf/hbm.c          | 33 +++++++++++++++++++
 samples/bpf/hbm.h          |  6 ++++
 samples/bpf/hbm_kern.h     | 66 ++++++++++++++++++++++++++++++++++++--
 samples/bpf/hbm_out_kern.c | 22 ++++++++-----
 4 files changed, 117 insertions(+), 10 deletions(-)

diff --git a/samples/bpf/hbm.c b/samples/bpf/hbm.c
index c5629bae67ab..480b7ad6a1f2 100644
--- a/samples/bpf/hbm.c
+++ b/samples/bpf/hbm.c
@@ -316,6 +316,14 @@ static int run_bpf_prog(char *prog, int cg_id)
 		double percent_pkts, percent_bytes;
 		char fname[100];
 		FILE *fout;
+		int k;
+		static const char *returnValNames[] = {
+			"DROP_PKT",
+			"ALLOW_PKT",
+			"DROP_PKT_CWR",
+			"ALLOW_PKT_CWR"
+		};
+#define RET_VAL_COUNT 4
 
 // Future support of ingress
 //		if (!outFlag)
@@ -350,6 +358,31 @@ static int run_bpf_prog(char *prog, int cg_id)
 			(qstats.bytes_total + 1);
 		fprintf(fout, "pkts_dropped_percent:%6.2f\n", percent_pkts);
 		fprintf(fout, "bytes_dropped_percent:%6.2f\n", percent_bytes);
+
+		// ECN CE markings
+		percent_pkts = (qstats.pkts_ecn_ce * 100.0) /
+			(qstats.pkts_total + 1);
+		fprintf(fout, "pkts_ecn_ce:%6.2f (%d)\n", percent_pkts,
+			(int)qstats.pkts_ecn_ce);
+
+		// Average cwnd
+		fprintf(fout, "avg cwnd:%d\n",
+			(int)(qstats.sum_cwnd / (qstats.sum_cwnd_cnt + 1)));
+		// Average rtt
+		fprintf(fout, "avg rtt:%d\n",
+			(int)(qstats.sum_rtt / (qstats.pkts_total + 1)));
+		// Average credit
+		fprintf(fout, "avg credit:%d\n",
+			(int)(qstats.sum_credit /
+			      (1500 * ((int)qstats.pkts_total) + 1)));
+
+		// Return values stats
+		for (k = 0; k < RET_VAL_COUNT; k++) {
+			percent_pkts = (qstats.returnValCount[k] * 100.0) /
+				(qstats.pkts_total + 1);
+			fprintf(fout, "%s:%6.2f (%d)\n", returnValNames[k],
+				percent_pkts, (int)qstats.returnValCount[k]);
+		}
 		fclose(fout);
 	}
 
diff --git a/samples/bpf/hbm.h b/samples/bpf/hbm.h
index c08247cec2a7..f0963ed6a562 100644
--- a/samples/bpf/hbm.h
+++ b/samples/bpf/hbm.h
@@ -29,4 +29,10 @@ struct hbm_queue_stats {
 	unsigned long long bytes_total;
 	unsigned long long firstPacketTime;
 	unsigned long long lastPacketTime;
+	unsigned long long pkts_ecn_ce;
+	unsigned long long returnValCount[4];
+	unsigned long long sum_cwnd;
+	unsigned long long sum_rtt;
+	unsigned long long sum_cwnd_cnt;
+	long long sum_credit;
 };
diff --git a/samples/bpf/hbm_kern.h b/samples/bpf/hbm_kern.h
index c5635d924193..508a3f93d200 100644
--- a/samples/bpf/hbm_kern.h
+++ b/samples/bpf/hbm_kern.h
@@ -72,17 +72,43 @@ struct bpf_map_def SEC("maps") queue_stats = {
 BPF_ANNOTATE_KV_PAIR(queue_stats, int, struct hbm_queue_stats);
 
 struct hbm_pkt_info {
+	int	cwnd;
+	int	rtt;
 	bool	is_ip;
 	bool	is_tcp;
 	short	ecn;
 };
 
+static int get_tcp_info(struct __sk_buff *skb, struct hbm_pkt_info *pkti)
+{
+	struct bpf_sock *sk;
+	struct bpf_tcp_sock *tp;
+
+	sk = skb->sk;
+	if (sk) {
+		sk = bpf_sk_fullsock(sk);
+		if (sk) {
+			if (sk->protocol == IPPROTO_TCP) {
+				tp = bpf_tcp_sock(sk);
+				if (tp) {
+					pkti->cwnd = tp->snd_cwnd;
+					pkti->rtt = tp->srtt_us >> 3;
+					return 0;
+				}
+			}
+		}
+	}
+	return 1;
+}
+
 static __always_inline void hbm_get_pkt_info(struct __sk_buff *skb,
 					     struct hbm_pkt_info *pkti)
 {
 	struct iphdr iph;
 	struct ipv6hdr *ip6h;
 
+	pkti->cwnd = 0;
+	pkti->rtt = 0;
 	bpf_skb_load_bytes(skb, 0, &iph, 12);
 	if (iph.version == 6) {
 		ip6h = (struct ipv6hdr *)&iph;
@@ -98,6 +124,8 @@ static __always_inline void hbm_get_pkt_info(struct __sk_buff *skb,
 		pkti->is_tcp = false;
 		pkti->ecn = 0;
 	}
+	if (pkti->is_tcp)
+		get_tcp_info(skb, pkti);
 }
 
 static __always_inline void hbm_init_vqueue(struct hbm_vqueue *qdp, int rate)
@@ -112,8 +140,14 @@ static __always_inline void hbm_update_stats(struct hbm_queue_stats *qsp,
 					     int len,
 					     unsigned long long curtime,
 					     bool congestion_flag,
-					     bool drop_flag)
+					     bool drop_flag,
+					     bool cwr_flag,
+					     bool ecn_ce_flag,
+					     struct hbm_pkt_info *pkti,
+					     int credit)
 {
+	int rv = ALLOW_PKT;
+
 	if (qsp != NULL) {
 		// Following is needed for work conserving
 		__sync_add_and_fetch(&(qsp->bytes_total), len);
@@ -123,7 +157,7 @@ static __always_inline void hbm_update_stats(struct hbm_queue_stats *qsp,
 				qsp->firstPacketTime = curtime;
 			qsp->lastPacketTime = curtime;
 			__sync_add_and_fetch(&(qsp->pkts_total), 1);
-			if (congestion_flag || drop_flag) {
+			if (congestion_flag) {
 				__sync_add_and_fetch(&(qsp->pkts_marked), 1);
 				__sync_add_and_fetch(&(qsp->bytes_marked), len);
 			}
@@ -132,6 +166,34 @@ static __always_inline void hbm_update_stats(struct hbm_queue_stats *qsp,
 				__sync_add_and_fetch(&(qsp->bytes_dropped),
 						     len);
 			}
+			if (ecn_ce_flag)
+				__sync_add_and_fetch(&(qsp->pkts_ecn_ce), 1);
+			if (pkti->cwnd) {
+				__sync_add_and_fetch(&(qsp->sum_cwnd),
+						     pkti->cwnd);
+				__sync_add_and_fetch(&(qsp->sum_cwnd_cnt), 1);
+			}
+			if (pkti->rtt)
+				__sync_add_and_fetch(&(qsp->sum_rtt),
+						     pkti->rtt);
+			__sync_add_and_fetch(&(qsp->sum_credit), credit);
+
+			if (drop_flag)
+				rv = DROP_PKT;
+			if (cwr_flag)
+				rv |= 2;
+			if (rv == DROP_PKT)
+				__sync_add_and_fetch(&(qsp->returnValCount[0]),
+						     1);
+			else if (rv == ALLOW_PKT)
+				__sync_add_and_fetch(&(qsp->returnValCount[1]),
+						     1);
+			else if (rv == 2)
+				__sync_add_and_fetch(&(qsp->returnValCount[2]),
+						     1);
+			else if (rv == 3)
+				__sync_add_and_fetch(&(qsp->returnValCount[3]),
+						     1);
 		}
 	}
 }
diff --git a/samples/bpf/hbm_out_kern.c b/samples/bpf/hbm_out_kern.c
index fa3ea92e1564..829934bd43cb 100644
--- a/samples/bpf/hbm_out_kern.c
+++ b/samples/bpf/hbm_out_kern.c
@@ -62,11 +62,12 @@ int _hbm_out_cg(struct __sk_buff *skb)
 	unsigned int queue_index = 0;
 	unsigned long long curtime;
 	int credit;
-	signed long long delta = 0, zero = 0;
+	signed long long delta = 0, new_credit;
 	int max_credit = MAX_CREDIT;
 	bool congestion_flag = false;
 	bool drop_flag = false;
 	bool cwr_flag = false;
+	bool ecn_ce_flag = false;
 	struct hbm_vqueue *qdp;
 	struct hbm_queue_stats *qsp = NULL;
 	int rv = ALLOW_PKT;
@@ -99,9 +100,11 @@ int _hbm_out_cg(struct __sk_buff *skb)
 	 */
 	if (delta > 0) {
 		qdp->lasttime = curtime;
-		credit += CREDIT_PER_NS(delta, qdp->rate);
-		if (credit > MAX_CREDIT)
+		new_credit = credit + CREDIT_PER_NS(delta, qdp->rate);
+		if (new_credit > MAX_CREDIT)
 			credit = MAX_CREDIT;
+		else
+			credit = new_credit;
 	}
 	credit -= len;
 	qdp->credit = credit;
@@ -139,7 +142,9 @@ int _hbm_out_cg(struct __sk_buff *skb)
 	}
 
 	if (congestion_flag) {
-		if (!bpf_skb_ecn_set_ce(skb)) {
+		if (bpf_skb_ecn_set_ce(skb)) {
+			ecn_ce_flag = true;
+		} else {
 			if (pkti.is_tcp) {
 				unsigned int rand = bpf_get_prandom_u32();
 
@@ -155,16 +160,17 @@ int _hbm_out_cg(struct __sk_buff *skb)
 		}
 	}
 
-	if (drop_flag)
-		rv = DROP_PKT;
 	if (qsp != NULL)
 		if (qsp->no_cn)
 			cwr_flag = false;
 
-	hbm_update_stats(qsp, len, curtime, congestion_flag, drop_flag);
+	hbm_update_stats(qsp, len, curtime, congestion_flag, drop_flag,
+			 cwr_flag, ecn_ce_flag, &pkti, credit);
 
-	if (rv == DROP_PKT)
+	if (drop_flag) {
 		__sync_add_and_fetch(&(qdp->credit), len);
+		rv = DROP_PKT;
+	}
 
 	if (cwr_flag)
 		rv |= 2;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop
  2019-04-04  0:12 ` [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop brakmo
@ 2019-04-08 16:16   ` Neal Cardwell
  2019-04-08 17:06     ` Eric Dumazet
  2019-05-04  2:33     ` Lawrence Brakmo
  0 siblings, 2 replies; 14+ messages in thread
From: Neal Cardwell @ 2019-04-08 16:16 UTC (permalink / raw)
  To: brakmo
  Cc: netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Kernel Team

On Wed, Apr 3, 2019 at 8:13 PM brakmo <brakmo@fb.com> wrote:
>
> When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
> and packets_out is 0, it is beneficial to set a small probe timer.
> Otherwise, the throughput for the flow can suffer because it may need to
> depend on the probe timer to start sending again. The default value for
> the probe timer is at least 200ms, this patch sets it to 20ms when a
> packet is dropped and there are no other packets in flight.
>
> This patch introduces a new sysctl, sysctl_tcp_probe_on_drop_ms, that is
> used to specify the duration of the probe timer for the case described
> earlier. The allowed values are between 0 and TCP_RTO_MIN. A value of 0
> disables setting the probe timer with a small value.
>
> Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
> ---
>  include/net/netns/ipv4.h   |  1 +
>  net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++
>  net/ipv4/tcp_ipv4.c        |  1 +
>  net/ipv4/tcp_output.c      | 18 +++++++++++++++---
>  4 files changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 7698460a3dd1..26c52d294dea 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -166,6 +166,7 @@ struct netns_ipv4 {
>         int sysctl_tcp_wmem[3];
>         int sysctl_tcp_rmem[3];
>         int sysctl_tcp_comp_sack_nr;
> +       int sysctl_tcp_probe_on_drop_ms;
>         unsigned long sysctl_tcp_comp_sack_delay_ns;
>         struct inet_timewait_death_row tcp_death_row;
>         int sysctl_max_syn_backlog;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 2316c08e9591..b1e4420b6d6c 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -49,6 +49,7 @@ static int ip_ping_group_range_min[] = { 0, 0 };
>  static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
>  static int comp_sack_nr_max = 255;
>  static u32 u32_max_div_HZ = UINT_MAX / HZ;
> +static int probe_on_drop_max = TCP_RTO_MIN;
>
>  /* obsolete */
>  static int sysctl_tcp_low_latency __read_mostly;
> @@ -1228,6 +1229,15 @@ static struct ctl_table ipv4_net_table[] = {
>                 .extra1         = &zero,
>                 .extra2         = &comp_sack_nr_max,
>         },
> +       {
> +               .procname       = "tcp_probe_on_drop_ms",
> +               .data           = &init_net.ipv4.sysctl_tcp_probe_on_drop_ms,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,

The tcp_reset_xmit_timer () function expects a time in jiffies, so
this would probably need to be proc_dointvec_ms_jiffies?

> +               .extra1         = &zero,
> +               .extra2         = &probe_on_drop_max,
> +       },
>         {
>                 .procname       = "udp_rmem_min",
>                 .data           = &init_net.ipv4.sysctl_udp_rmem_min,
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 3979939804b7..3df6735f0f07 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2686,6 +2686,7 @@ static int __net_init tcp_sk_init(struct net *net)
>         spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock);
>         net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60;
>         atomic_set(&net->ipv4.tfo_active_disable_times, 0);
> +       net->ipv4.sysctl_tcp_probe_on_drop_ms = 20;

The tcp_reset_xmit_timer () function expects a time in jiffies, so
this would probably need to be msecs_to_jiffies(20)?

>
>         /* Reno is always built in */
>         if (!net_eq(net, &init_net) &&
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index e265d1aeeb66..f101e4d7c1ae 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1154,9 +1154,21 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
>
>         err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
>
> -       if (unlikely(err > 0)) {
> -               tcp_enter_cwr(sk);
> -               err = net_xmit_eval(err);
> +       if (unlikely(err)) {
> +               if (unlikely(err > 0)) {
> +                       tcp_enter_cwr(sk);
> +                       err = net_xmit_eval(err);
> +               }
> +               /* Packet was dropped. If there are no packets out,
> +                * we may need to depend on probe timer to start sending
> +                * again. Hence, use a smaller value.
> +                */
> +               if (!tp->packets_out && !inet_csk(sk)->icsk_pending &&
> +                   sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms > 0) {
> +                       tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
> +                                            sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms,
> +                                            TCP_RTO_MAX, NULL);
> +               }

My reading of the code is that any timer that was set here would
typically be quickly overridden by the tcp_check_probe_timer() that
usually happens after most tcp_write_xmit() calls. Am I reading that
correctly?

There seems to be a philosophical difference between this patch and
the existing logic in tcp_check_probe_timer()/tcp_probe0_base():

/* Something is really bad, we could not queue an additional packet,
 * because qdisc is full or receiver sent a 0 window, or we are paced.
 * We do not want to add fuel to the fire, or abort too early,
 * so make sure the timer we arm now is at least 200ms in the future,
 * regardless of current icsk_rto value (as it could be ~2ms)
 */
static inline unsigned long tcp_probe0_base(const struct sock *sk)
{
  return max_t(unsigned long, inet_csk(sk)->icsk_rto, TCP_RTO_MIN);
}
...
static inline void tcp_check_probe_timer(struct sock *sk)
{
        if (!tcp_sk(sk)->packets_out && !inet_csk(sk)->icsk_pending)
                tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
                                     tcp_probe0_base(sk), TCP_RTO_MAX,
                                     NULL);
}

If we apply some version of this patch then we should probably at
least update this comment above tcp_probe0_base() to clarify the new
strategy.

cheers,
neal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop
  2019-04-08 16:16   ` Neal Cardwell
@ 2019-04-08 17:06     ` Eric Dumazet
  2019-04-08 17:38       ` Yuchung Cheng
  2019-05-04  2:35       ` Lawrence Brakmo
  2019-05-04  2:33     ` Lawrence Brakmo
  1 sibling, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2019-04-08 17:06 UTC (permalink / raw)
  To: Neal Cardwell, brakmo
  Cc: netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann, Kernel Team



On 04/08/2019 09:16 AM, Neal Cardwell wrote:
> On Wed, Apr 3, 2019 at 8:13 PM brakmo <brakmo@fb.com> wrote:
>>
>> When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
>> and packets_out is 0, it is beneficial to set a small probe timer.
>> Otherwise, the throughput for the flow can suffer because it may need to
>> depend on the probe timer to start sending again. The default value for
>> the probe timer is at least 200ms, this patch sets it to 20ms when a
>> packet is dropped and there are no other packets in flight.
>>
>> This patch introduces a new sysctl, sysctl_tcp_probe_on_drop_ms, that is
>> used to specify the duration of the probe timer for the case described
>> earlier. The allowed values are between 0 and TCP_RTO_MIN. A value of 0
>> disables setting the probe timer with a small value.

This seems to contradict our recent work ?

See recent Yuchung patch series :

c1d5674f8313b9f8e683c265f1c00a2582cf5fc5 tcp: less aggressive window probing on local congestion
590d2026d62418bb27de9ca87526e9131c1f48af tcp: retry more conservatively on local congestion
9721e709fa68ef9b860c322b474cfbd1f8285b0f tcp: simplify window probe aborting on USER_TIMEOUT
01a523b071618abbc634d1958229fe3bd2dfa5fa tcp: create a helper to model exponential backoff
c7d13c8faa74f4e8ef191f88a252cefab6805b38 tcp: properly track retry time on passive Fast Open
7ae189759cc48cf8b54beebff566e9fd2d4e7d7c tcp: always set retrans_stamp on recovery
7f12422c4873e9b274bc151ea59cb0cdf9415cf1 tcp: always timestamp on every skb transmission


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop
  2019-04-08 17:06     ` Eric Dumazet
@ 2019-04-08 17:38       ` Yuchung Cheng
  2019-05-04  2:38         ` Lawrence Brakmo
  2019-05-04  2:35       ` Lawrence Brakmo
  1 sibling, 1 reply; 14+ messages in thread
From: Yuchung Cheng @ 2019-04-08 17:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, brakmo, netdev, Martin Lau, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team

On Mon, Apr 8, 2019 at 10:07 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 04/08/2019 09:16 AM, Neal Cardwell wrote:
> > On Wed, Apr 3, 2019 at 8:13 PM brakmo <brakmo@fb.com> wrote:
> >>
> >> When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
> >> and packets_out is 0, it is beneficial to set a small probe timer.
> >> Otherwise, the throughput for the flow can suffer because it may need to
> >> depend on the probe timer to start sending again. The default value for
> >> the probe timer is at least 200ms, this patch sets it to 20ms when a
> >> packet is dropped and there are no other packets in flight.
> >>
> >> This patch introduces a new sysctl, sysctl_tcp_probe_on_drop_ms, that is
> >> used to specify the duration of the probe timer for the case described
> >> earlier. The allowed values are between 0 and TCP_RTO_MIN. A value of 0
> >> disables setting the probe timer with a small value.
>
> This seems to contradict our recent work ?
>
> See recent Yuchung patch series :
>
> c1d5674f8313b9f8e683c265f1c00a2582cf5fc5 tcp: less aggressive window probing on local congestion
> 590d2026d62418bb27de9ca87526e9131c1f48af tcp: retry more conservatively on local congestion

I would appreciate a direct change to TCP stack starts with tcp:
subject instead of the confusing bpf for TCP developers.

packet being dropped at local layer is a sign of severe of congestion
-- it's caused by application bursting on many (idle or new)
connections. With this patch, the (many) connections that fail on the
first try (including SYN and pure ACKs) will all come back at 20ms,
instead of the RTTs-adjusted RTOs. So the end effect is the
application repetitively pounding the local qdisc through to squeeze
out some performance.

This patch seems to apply for a special case where local congestion
only lives for a very short period. I don't think it applies well as a
general principle for congestion control.







> 9721e709fa68ef9b860c322b474cfbd1f8285b0f tcp: simplify window probe aborting on USER_TIMEOUT
> 01a523b071618abbc634d1958229fe3bd2dfa5fa tcp: create a helper to model exponential backoff
> c7d13c8faa74f4e8ef191f88a252cefab6805b38 tcp: properly track retry time on passive Fast Open
> 7ae189759cc48cf8b54beebff566e9fd2d4e7d7c tcp: always set retrans_stamp on recovery
> 7f12422c4873e9b274bc151ea59cb0cdf9415cf1 tcp: always timestamp on every skb transmission
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop
  2019-04-08 16:16   ` Neal Cardwell
  2019-04-08 17:06     ` Eric Dumazet
@ 2019-05-04  2:33     ` Lawrence Brakmo
  1 sibling, 0 replies; 14+ messages in thread
From: Lawrence Brakmo @ 2019-05-04  2:33 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Yuchung Cheng, Kernel Team

Sorry for the delay, due to a combination of changing teams and getting pneumonia. Answers inline, but note that I will be sending a patchset for review that does not include these changes to the probe timer.

On 4/8/19, 9:17 AM, "Neal Cardwell" <ncardwell@google.com> wrote:

    On Wed, Apr 3, 2019 at 8:13 PM brakmo <brakmo@fb.com> wrote:
    >
    > When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
    > and packets_out is 0, it is beneficial to set a small probe timer.
    > Otherwise, the throughput for the flow can suffer because it may need to
    > depend on the probe timer to start sending again. The default value for
    > the probe timer is at least 200ms, this patch sets it to 20ms when a
    > packet is dropped and there are no other packets in flight.
    >
    > This patch introduces a new sysctl, sysctl_tcp_probe_on_drop_ms, that is
    > used to specify the duration of the probe timer for the case described
    > earlier. The allowed values are between 0 and TCP_RTO_MIN. A value of 0
    > disables setting the probe timer with a small value.
    >
    > Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
    > ---
    >  include/net/netns/ipv4.h   |  1 +
    >  net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++
    >  net/ipv4/tcp_ipv4.c        |  1 +
    >  net/ipv4/tcp_output.c      | 18 +++++++++++++++---
    >  4 files changed, 27 insertions(+), 3 deletions(-)
    >
    > diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
    > index 7698460a3dd1..26c52d294dea 100644
    > --- a/include/net/netns/ipv4.h
    > +++ b/include/net/netns/ipv4.h
    > @@ -166,6 +166,7 @@ struct netns_ipv4 {
    >         int sysctl_tcp_wmem[3];
    >         int sysctl_tcp_rmem[3];
    >         int sysctl_tcp_comp_sack_nr;
    > +       int sysctl_tcp_probe_on_drop_ms;
    >         unsigned long sysctl_tcp_comp_sack_delay_ns;
    >         struct inet_timewait_death_row tcp_death_row;
    >         int sysctl_max_syn_backlog;
    > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
    > index 2316c08e9591..b1e4420b6d6c 100644
    > --- a/net/ipv4/sysctl_net_ipv4.c
    > +++ b/net/ipv4/sysctl_net_ipv4.c
    > @@ -49,6 +49,7 @@ static int ip_ping_group_range_min[] = { 0, 0 };
    >  static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
    >  static int comp_sack_nr_max = 255;
    >  static u32 u32_max_div_HZ = UINT_MAX / HZ;
    > +static int probe_on_drop_max = TCP_RTO_MIN;
    >
    >  /* obsolete */
    >  static int sysctl_tcp_low_latency __read_mostly;
    > @@ -1228,6 +1229,15 @@ static struct ctl_table ipv4_net_table[] = {
    >                 .extra1         = &zero,
    >                 .extra2         = &comp_sack_nr_max,
    >         },
    > +       {
    > +               .procname       = "tcp_probe_on_drop_ms",
    > +               .data           = &init_net.ipv4.sysctl_tcp_probe_on_drop_ms,
    > +               .maxlen         = sizeof(int),
    > +               .mode           = 0644,
    > +               .proc_handler   = proc_dointvec_minmax,
    
    The tcp_reset_xmit_timer () function expects a time in jiffies, so
    this would probably need to be proc_dointvec_ms_jiffies?

Yes, good catch.
    
    > +               .extra1         = &zero,
    > +               .extra2         = &probe_on_drop_max,
    > +       },
    >         {
    >                 .procname       = "udp_rmem_min",
    >                 .data           = &init_net.ipv4.sysctl_udp_rmem_min,
    > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
    > index 3979939804b7..3df6735f0f07 100644
    > --- a/net/ipv4/tcp_ipv4.c
    > +++ b/net/ipv4/tcp_ipv4.c
    > @@ -2686,6 +2686,7 @@ static int __net_init tcp_sk_init(struct net *net)
    >         spin_lock_init(&net->ipv4.tcp_fastopen_ctx_lock);
    >         net->ipv4.sysctl_tcp_fastopen_blackhole_timeout = 60 * 60;
    >         atomic_set(&net->ipv4.tfo_active_disable_times, 0);
    > +       net->ipv4.sysctl_tcp_probe_on_drop_ms = 20;
    
    The tcp_reset_xmit_timer () function expects a time in jiffies, so
    this would probably need to be msecs_to_jiffies(20)?

Correct, thanks.
    
    >
    >         /* Reno is always built in */
    >         if (!net_eq(net, &init_net) &&
    > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
    > index e265d1aeeb66..f101e4d7c1ae 100644
    > --- a/net/ipv4/tcp_output.c
    > +++ b/net/ipv4/tcp_output.c
    > @@ -1154,9 +1154,21 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb,
    >
    >         err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
    >
    > -       if (unlikely(err > 0)) {
    > -               tcp_enter_cwr(sk);
    > -               err = net_xmit_eval(err);
    > +       if (unlikely(err)) {
    > +               if (unlikely(err > 0)) {
    > +                       tcp_enter_cwr(sk);
    > +                       err = net_xmit_eval(err);
    > +               }
    > +               /* Packet was dropped. If there are no packets out,
    > +                * we may need to depend on probe timer to start sending
    > +                * again. Hence, use a smaller value.
    > +                */
    > +               if (!tp->packets_out && !inet_csk(sk)->icsk_pending &&
    > +                   sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms > 0) {
    > +                       tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
    > +                                            sock_net(sk)->ipv4.sysctl_tcp_probe_on_drop_ms,
    > +                                            TCP_RTO_MAX, NULL);
    > +               }
    
    My reading of the code is that any timer that was set here would
    typically be quickly overridden by the tcp_check_probe_timer() that
    usually happens after most tcp_write_xmit() calls. Am I reading that
    correctly?

No. tcp_check_probe_timer() first checks if there is a pending timer and 
Will not overwrite it.
    
    There seems to be a philosophical difference between this patch and
    the existing logic in tcp_check_probe_timer()/tcp_probe0_base():
    
    /* Something is really bad, we could not queue an additional packet,
     * because qdisc is full or receiver sent a 0 window, or we are paced.
     * We do not want to add fuel to the fire, or abort too early,
     * so make sure the timer we arm now is at least 200ms in the future,
     * regardless of current icsk_rto value (as it could be ~2ms)
     */
    static inline unsigned long tcp_probe0_base(const struct sock *sk)
    {
      return max_t(unsigned long, inet_csk(sk)->icsk_rto, TCP_RTO_MIN);
    }
    ...
    static inline void tcp_check_probe_timer(struct sock *sk)
    {
            if (!tcp_sk(sk)->packets_out && !inet_csk(sk)->icsk_pending)
                    tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
                                         tcp_probe0_base(sk), TCP_RTO_MAX,
                                         NULL);
    }
    
    If we apply some version of this patch then we should probably at
    least update this comment above tcp_probe0_base() to clarify the new
    strategy.
    
Thanks for pointing this out. In the use case we consider packets can be
dropped as a result of applying a rate limit, so in this case drops are not a
sign that things are really bad (and in my measurements do not occur very
often). In retrospect I should have set the default sysctl value to 0 so the
behavior would not change unless one desires it by setting the sysctl to a
non-zero value.

    cheers,
    neal
    


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop
  2019-04-08 17:06     ` Eric Dumazet
  2019-04-08 17:38       ` Yuchung Cheng
@ 2019-05-04  2:35       ` Lawrence Brakmo
  1 sibling, 0 replies; 14+ messages in thread
From: Lawrence Brakmo @ 2019-05-04  2:35 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell
  Cc: netdev, Martin Lau, Alexei Starovoitov, Daniel Borkmann,
	Kernel Team, Yuchung Cheng



On 4/8/19, 10:07 AM, "Eric Dumazet" <eric.dumazet@gmail.com> wrote:

    
    
    On 04/08/2019 09:16 AM, Neal Cardwell wrote:
    > On Wed, Apr 3, 2019 at 8:13 PM brakmo <brakmo@fb.com> wrote:
    >>
    >> When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
    >> and packets_out is 0, it is beneficial to set a small probe timer.
    >> Otherwise, the throughput for the flow can suffer because it may need to
    >> depend on the probe timer to start sending again. The default value for
    >> the probe timer is at least 200ms, this patch sets it to 20ms when a
    >> packet is dropped and there are no other packets in flight.
    >>
    >> This patch introduces a new sysctl, sysctl_tcp_probe_on_drop_ms, that is
    >> used to specify the duration of the probe timer for the case described
    >> earlier. The allowed values are between 0 and TCP_RTO_MIN. A value of 0
    >> disables setting the probe timer with a small value.
    
    This seems to contradict our recent work ?
    
    See recent Yuchung patch series :
    
    c1d5674f8313b9f8e683c265f1c00a2582cf5fc5 tcp: less aggressive window probing on local congestion
    590d2026d62418bb27de9ca87526e9131c1f48af tcp: retry more conservatively on local congestion
    9721e709fa68ef9b860c322b474cfbd1f8285b0f tcp: simplify window probe aborting on USER_TIMEOUT
    01a523b071618abbc634d1958229fe3bd2dfa5fa tcp: create a helper to model exponential backoff
    c7d13c8faa74f4e8ef191f88a252cefab6805b38 tcp: properly track retry time on passive Fast Open
    7ae189759cc48cf8b54beebff566e9fd2d4e7d7c tcp: always set retrans_stamp on recovery
    7f12422c4873e9b274bc151ea59cb0cdf9415cf1 tcp: always timestamp on every skb transmission
    
Thank you for pointing this out. I'm taking this patch out of the patchset and will reconsider it.
    


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop
  2019-04-08 17:38       ` Yuchung Cheng
@ 2019-05-04  2:38         ` Lawrence Brakmo
  0 siblings, 0 replies; 14+ messages in thread
From: Lawrence Brakmo @ 2019-05-04  2:38 UTC (permalink / raw)
  To: Yuchung Cheng, Eric Dumazet
  Cc: Neal Cardwell, netdev, Martin Lau, Alexei Starovoitov,
	Daniel Borkmann, Kernel Team


On 4/8/19, 10:39 AM, "Yuchung Cheng" <ycheng@google.com> wrote:

    On Mon, Apr 8, 2019 at 10:07 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
    >
    >
    >
    > On 04/08/2019 09:16 AM, Neal Cardwell wrote:
    > > On Wed, Apr 3, 2019 at 8:13 PM brakmo <brakmo@fb.com> wrote:
    > >>
    > >> When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
    > >> and packets_out is 0, it is beneficial to set a small probe timer.
    > >> Otherwise, the throughput for the flow can suffer because it may need to
    > >> depend on the probe timer to start sending again. The default value for
    > >> the probe timer is at least 200ms, this patch sets it to 20ms when a
    > >> packet is dropped and there are no other packets in flight.
    > >>
    > >> This patch introduces a new sysctl, sysctl_tcp_probe_on_drop_ms, that is
    > >> used to specify the duration of the probe timer for the case described
    > >> earlier. The allowed values are between 0 and TCP_RTO_MIN. A value of 0
    > >> disables setting the probe timer with a small value.
    >
    > This seems to contradict our recent work ?
    >
    > See recent Yuchung patch series :
    >
    > c1d5674f8313b9f8e683c265f1c00a2582cf5fc5 tcp: less aggressive window probing on local congestion
    > 590d2026d62418bb27de9ca87526e9131c1f48af tcp: retry more conservatively on local congestion
    
    I would appreciate a direct change to TCP stack starts with tcp:
    subject instead of the confusing bpf for TCP developers.

You are right, thank you for pointing this out.
    
    packet being dropped at local layer is a sign of severe of congestion
    -- it's caused by application bursting on many (idle or new)
    connections. With this patch, the (many) connections that fail on the
    first try (including SYN and pure ACKs) will all come back at 20ms,
    instead of the RTTs-adjusted RTOs. So the end effect is the
    application repetitively pounding the local qdisc through to squeeze
    out some performance.
    
    This patch seems to apply for a special case where local congestion
    only lives for a very short period. I don't think it applies well as a
    general principle for congestion control.
    
I am taking this out of the patchset and will revisit it later.
    
    
    > 9721e709fa68ef9b860c322b474cfbd1f8285b0f tcp: simplify window probe aborting on USER_TIMEOUT
    > 01a523b071618abbc634d1958229fe3bd2dfa5fa tcp: create a helper to model exponential backoff
    > c7d13c8faa74f4e8ef191f88a252cefab6805b38 tcp: properly track retry time on passive Fast Open
    > 7ae189759cc48cf8b54beebff566e9fd2d4e7d7c tcp: always set retrans_stamp on recovery
    > 7f12422c4873e9b274bc151ea59cb0cdf9415cf1 tcp: always timestamp on every skb transmission
    >
    


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-05-04  2:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-04  0:12 [PATCH v2 bpf-next 0/7] bpf: Propagate cn to TCP brakmo
2019-04-04  0:12 ` [PATCH v2 bpf-next 1/7] bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY brakmo
2019-04-04  0:12 ` [PATCH v2 bpf-next 2/7] bpf: cgroup inet skb programs can return 0 to 3 brakmo
2019-04-04  0:12 ` [PATCH v2 bpf-next 3/7] bpf: Update __cgroup_bpf_run_filter_skb with cn brakmo
2019-04-04  0:12 ` [PATCH v2 bpf-next 4/7] bpf: Update BPF_CGROUP_RUN_PROG_INET_EGRESS calls brakmo
2019-04-04  0:12 ` [PATCH v2 bpf-next 5/7] bpf: sysctl for probe_on_drop brakmo
2019-04-08 16:16   ` Neal Cardwell
2019-04-08 17:06     ` Eric Dumazet
2019-04-08 17:38       ` Yuchung Cheng
2019-05-04  2:38         ` Lawrence Brakmo
2019-05-04  2:35       ` Lawrence Brakmo
2019-05-04  2:33     ` Lawrence Brakmo
2019-04-04  0:12 ` [PATCH v2 bpf-next 6/7] bpf: Add cn support to hbm_out_kern.c brakmo
2019-04-04  0:12 ` [PATCH v2 bpf-next 7/7] bpf: Add more stats to HBM brakmo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.