linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC v1 net-next 0/4] net: Qdisc backpressure infrastructure
@ 2022-05-06 19:43 Peilin Ye
  2022-05-06 19:44 ` [PATCH RFC v1 net-next 1/4] net: Introduce " Peilin Ye
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Peilin Ye @ 2022-05-06 19:43 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko
  Cc: Peilin Ye, netdev, linux-kernel, Cong Wang, Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Hi all,

Currently sockets (especially UDP ones) can drop a lot of skbs at TC
egress when rate limited by shaper Qdiscs like HTB.  This experimental
patchset tries to improve this by introducing a backpressure mechanism, so
that sockets are temporarily throttled when they "send too much".

For now it takes care of TBF, HTB and CBQ, for UDP and TCP sockets.  Any
comments, suggestions would be much appreciated.  Thanks!

Contents
     [I] Background
    [II] Design Overview
   [III] Performance Numbers & Issues
    [IV] Synchronization
     [V] Test Setup

______________
[I] Background

Imagine 2 UDP iperf2 clients, both wish to send at 1 GByte/sec from veth0.
veth0's egress root Qdisc is a TBF Qdisc, with a configured rate of 200
Mbits/sec.  I tested this setup [V]:

[  3] 10.0-10.5 sec  25.9 MBytes   434 Mbits/sec
[  3] 10.0-10.5 sec  22.0 MBytes   369 Mbits/sec
[  3] 10.5-11.0 sec  24.3 MBytes   407 Mbits/sec
[  3] 10.5-11.0 sec  21.7 MBytes   363 Mbits/sec
<...>                              ^^^^^^^^^^^^^

[  3]  0.0-30.0 sec   358 MBytes   100 Mbits/sec   0.030 ms 702548/958104 (73%)
[  3]  0.0-30.0 sec   347 MBytes  96.9 Mbits/sec   0.136 ms 653610/900810 (73%)
                                                            ^^^^^^^^^^^^^ ^^^^^

On average, both clients try to send at 389.82 Mbits/sec.  TBF drops 74.7%
of the traffic, in order to keep veth0's egress rate (196.93 Mbits/sec)
under the configured 200 Mbits/sec.

Why is this happening?  Consider sk->sk_wmem_alloc, number of bytes of
skbs that a socket has currently "committed" to the "transmit queue":

         ┌─────┐         ┌───────────┐     ┌──────────────┐
    ─ ─ >│ UDP │─ ... ─ >│ TBF Qdisc │─ ─ >│ device queue │─ ┬ >
         └───┬─┘         └───────────┘     └──────────────┘ [b]
            [a]

Normally, sk_wmem_alloc is increased right before an skb leaves UDP [a],
and decreased when an skb is consumed upon TX completion [b].

However, when TBF becomes full, and starts to drop skbs (assuming a
simple taildrop inner Qdisc like bfifo for brevity):

         ┌─────┐         ┌───────────┐     ┌──────────────┐
    ─ ─ >│ UDP │─ ... ─ >│ TBF Qdisc │─ ─ >│ device queue │─ ┬ >
         └───┬─┘         ├───────────┘     └──────────────┘ [b]
            [a]         [c]

For those dropped skbs, sk_wmem_alloc is decreased right before TBF [c].
This is problematic: since the (a,c) "interval" does not cover TBF,
whenever UDP starts to send faster, TBF simply drops faster, keeping
sk_wmem_alloc balanced, and tricking UDP into thinking "it is okay to send
more".

Similar thing happens to other shapers as well.  TCP behaves much better
than UDP, since TCP slows down itself when TC egress fails to enqueue an
skb.

____________________
[II] Design Overview

Hacking sk_wmem_alloc turned out to be tricky.  Instead, introduce the
following state machine for sockets:

  ┌────────────────┐  [d]  ┌────────────────┐  [e]  ┌────────────────┐
  │ SK_UNTHROTTLED │─ ─ ─ >│  SK_OVERLIMIT  │─ ─ ─ >│  SK_THROTTLED  │
  └────────────────┘       └────────────────┘       └────────────────┘
           └─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ < ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┘
                                   [f]

Take TBF as an example:

  [d] qdisc_backpressure_overlimit()
      When TBF fails to enqueue an skb belonging to an UNTHROTTLED socket,
      the socket is marked as OVERLIMIT, and added to TBF's
      "backpressure_list";

  [e] qdisc_backpressure_throttle()
      Later, when TBF runs out of tokens, it marks all OVERLIMIT sockets
      on its backpressure_list as THROTTLED, and schedules a Qdisc
      watchdog to wait for more tokens;

    * TCP and UDP sleeps on THROTTLED sockets
    * epoll() and friends should not report EPOLL{OUT,WRNORM} for
      THROTTLED sockets

  [f] qdisc_backpressure_unthrottle()
      When the timer expires, all THROTTLED sockets are removed from TBF's
      backpressure_list, and marked back as UNTHROTTLED.

    * Also call ->sk_write_space(), so that all TCP and UDP waiters are
      woken up, and all SOCKWQ_ASYNC_NOSPACE subscribers are notified

This should work for all Qdisc watchdog-based shapers (as pointed out by
Cong though, ETF seems like a special one, I haven't looked into it yet).

(As discussed with Cong, a slightly different design would be: only
 mark OVERLIMIT sockets as THROTTLED when:
 
 1. TBF is full, AND
 2. TBF is out of tokens i.e. Qdisc watchdog timer is active

 This approach seems to have a slightly lower drop rate under heavy load,
 at least for TBF.  I'm still working on it.)

__________________________________
[III] Performance Numbers & Issues

I tested the same setup [V] after applying this patchset:

[  3] 10.0-10.5 sec  6.29 MBytes   106 Mbits/sec
[  3] 10.0-10.5 sec  5.84 MBytes  98.0 Mbits/sec
[  3] 10.5-11.0 sec  6.31 MBytes   106 Mbits/sec
[  3] 10.5-11.0 sec  6.01 MBytes   101 Mbits/sec
<...>                              ^^^^^^^^^^^^^

[  3]  0.0-30.0 sec   358 MBytes   100 Mbits/sec  62.444 ms 8500/263825 (3.2%)
[  3]  0.0-30.0 sec   357 MBytes  99.9 Mbits/sec   0.217 ms 8411/263310 (3.2%)
                                                            ^^^^^^^^^^^ ^^^^^^

On average, drop rate decreased from 74.7% to 3.2%.  No significant
affects on throughput (196.93 Mbits/sec becomes 197.46 Mbits/sec).

However, drop rate starts to increase when we add more iperf2 clients.
For example, 3 clients (also UDP -b 1G):

[  3]  0.0-30.0 sec   232 MBytes  65.0 Mbits/sec   0.092 ms 104961/270765 (39%)
[  3]  0.0-30.0 sec   232 MBytes  64.8 Mbits/sec   0.650 ms 102493/267769 (38%)
[  3]  0.0-30.1 sec   239 MBytes  66.8 Mbits/sec   0.045 ms 99234/269987 (37%)
                                                            ^^^^^^^^^^^^ ^^^^^

38% of the traffic is dropped.  This is still a lot better than current
(87%), but not ideal.  There is a thundering herd problem: when the Qdisc
watchdog timer expires, we wake up all THROTTLED sockets at once in [f],
so all of them just resume sending (and dropping...).  We probably need a
better algorithm here, please advise, thanks!

One "workaround" is making TBF's queue larger (the "limit" parameter).  In
the above 3-client example, raising "limit" from 200k to 300k decreases
drop rate from 38% back to 3.1%.  Without this patchset, it requires about
a 400k "limit" for the same setup to drop less than 5% of the traffic.

____________________
[IV] Synchronization

1. For each device, all backpressure_list operations are serialized by
   Qdisc root_lock:

   [d] and [e] happen in shaper's ->enqueue() and ->dequeue()
   respectively; in both cases we hold root_lock.

   [f] happens in TX softirq, right after grabbing root_lock.  Scheduling
   Qdisc watchdog is not the only way to wait for more tokens, see
   htb_work_func() for an example.  However, they all end up raising TX
   softirq, so run [f] there.

2. Additionally, we should prevent 2 CPUs from trying to add the same
   socket to 2 different backpressure_lists (on 2 different devices).  Use
   memory barriers to make sure this will not happen.  Please see [1/4]
   commit message for details.

______________
[V] Test Setup

    # setup network namespace
    ip netns add red
    ip link add name veth0 type veth peer name veth1
    ip link set veth1 netns red
    ip addr add 10.0.0.1/24 dev veth0
    ip -n red addr add 10.0.0.2/24 dev veth1
    ip link set veth0 up
    ip -n red link set veth1 up

    tc qdisc replace dev veth0 handle 1: root \
        tbf rate 200mbit burst 20kb limit 200k

    # servers
    ip netns exec red iperf -u -s -p 5555 -i 0.5 -t 1000 &
    ip netns exec red iperf -u -s -p 6666 -i 0.5 -t 1000 &

    # clients
    iperf -u -c 10.0.0.2 -p 5555 -i 0.5 -t 30 -b 1G &
    iperf -u -c 10.0.0.2 -p 6666 -i 0.5 -t 30 -b 1G &

Thanks,
Peilin Ye (4):
  net: Introduce Qdisc backpressure infrastructure
  net/sched: sch_tbf: Use Qdisc backpressure infrastructure
  net/sched: sch_htb: Use Qdisc backpressure infrastructure
  net/sched: sch_cbq: Use Qdisc backpressure infrastructure

 include/net/sch_generic.h | 43 +++++++++++++++++++++++++++++++++++++++
 include/net/sock.h        | 18 +++++++++++++++-
 net/core/dev.c            |  1 +
 net/core/sock.c           |  6 ++++--
 net/ipv4/tcp_ipv4.c       | 11 +++++++---
 net/sched/sch_cbq.c       |  6 +++++-
 net/sched/sch_generic.c   |  4 ++++
 net/sched/sch_htb.c       |  5 +++++
 net/sched/sch_tbf.c       |  2 ++
 9 files changed, 89 insertions(+), 7 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH RFC v1 net-next 1/4] net: Introduce Qdisc backpressure infrastructure
  2022-05-06 19:43 [PATCH RFC v1 net-next 0/4] net: Qdisc backpressure infrastructure Peilin Ye
@ 2022-05-06 19:44 ` Peilin Ye
  2022-05-06 20:31   ` Stephen Hemminger
  2022-05-09  7:53   ` Dave Taht
  2022-05-06 19:44 ` [PATCH RFC v1 net-next 2/4] net/sched: sch_tbf: Use " Peilin Ye
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Peilin Ye @ 2022-05-06 19:44 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko
  Cc: Peilin Ye, netdev, linux-kernel, Cong Wang, Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Currently sockets (especially UDP ones) can drop a lot of traffic at TC
egress when rate limited by shaper Qdiscs like HTB.  Improve it by
implementing the following state machine for sockets (currently only
support UDP and TCP ones):

  ┌────────────────┐  [a]  ┌────────────────┐  [b]  ┌────────────────┐
  │ SK_UNTHROTTLED │─ ─ ─ >│  SK_OVERLIMIT  │─ ─ ─ >│  SK_THROTTLED  │
  └────────────────┘       └────────────────┘       └────────────────┘
           └─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ < ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┘
                                   [c]

Take TBF as an example,

  [a] When TBF's inner Qdisc (e.g. bfifo) becomes full, TBF fails to
      enqueue an skb belonging to UNTHROTTLED socket A.  socket A is
      marked as OVERLIMIT, and added to TBF's "backpressure_list";

  [b] When TBF runs out of tokens, it marks all OVERLIMIT sockets
      (including A) on its backpressure_list as THROTTLED, and schedules
      a Qdisc watchdog timer to wait for more tokens;

  [c] After the timer expires, all THROTTLED sockets (including A) are
      removed from TBF's backpressure_list, and marked as UNTHROTTLED.

UDP and TCP sleep on THROTTLED sockets in sock_wait_for_wmem() and
sk_stream_wait_memory() respectively.  epoll() and friends should not
report EPOLL{OUT,WRNORM} for THROTTLED sockets.  When unthrottling in [c],
call ->sk_write_space() to wake up UDP and/or TCP waiters, and notify
SOCKWQ_ASYNC_NOSPACE subscribers.

For each device, backpressure_list operations are always serialized by
Qdisc root_lock.

When marking a socket as OVERLIMIT in [a], use a cmpxchg() to make sure
that multiple CPUs do not try to add this socket to different
backpressure_lists (on different devices) concurrently.

After removing a THROTTLED socket from backpressure_list in [c], use a
smp_store_release() to make sure changes have been committed to memory
before marking the socket as UNTHROTTLED.

Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 include/net/sch_generic.h | 43 +++++++++++++++++++++++++++++++++++++++
 include/net/sock.h        | 18 +++++++++++++++-
 net/core/dev.c            |  1 +
 net/core/sock.c           |  6 ++++--
 net/ipv4/tcp_ipv4.c       | 11 +++++++---
 net/sched/sch_generic.c   |  4 ++++
 6 files changed, 77 insertions(+), 6 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9bab396c1f3b..5ddbe0b65cb6 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -19,6 +19,7 @@
 #include <net/gen_stats.h>
 #include <net/rtnetlink.h>
 #include <net/flow_offload.h>
+#include <net/sock.h>
 
 struct Qdisc_ops;
 struct qdisc_walker;
@@ -108,6 +109,7 @@ struct Qdisc {
 	struct gnet_stats_queue	__percpu *cpu_qstats;
 	int			pad;
 	refcount_t		refcnt;
+	struct list_head	backpressure_list;
 
 	/*
 	 * For performance sake on SMP, we put highly modified fields at the end
@@ -1221,6 +1223,47 @@ static inline int qdisc_drop_all(struct sk_buff *skb, struct Qdisc *sch,
 	return NET_XMIT_DROP;
 }
 
+static inline void qdisc_backpressure_overlimit(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+
+	if (!sk || !sk_fullsock(sk))
+		return;
+
+	if (cmpxchg(&sk->sk_backpressure_status, SK_UNTHROTTLED, SK_OVERLIMIT) == SK_UNTHROTTLED) {
+		sock_hold(sk);
+		list_add_tail(&sk->sk_backpressure_node, &sch->backpressure_list);
+	}
+}
+
+static inline void qdisc_backpressure_throttle(struct Qdisc *sch)
+{
+	struct list_head *pos;
+	struct sock *sk;
+
+	list_for_each(pos, &sch->backpressure_list) {
+		sk = list_entry(pos, struct sock, sk_backpressure_node);
+
+		WRITE_ONCE(sk->sk_backpressure_status, SK_THROTTLED);
+	}
+}
+
+static inline void qdisc_backpressure_unthrottle(struct Qdisc *sch)
+{
+	struct list_head *pos, *next;
+	struct sock *sk;
+
+	list_for_each_safe(pos, next, &sch->backpressure_list) {
+		sk = list_entry(pos, struct sock, sk_backpressure_node);
+
+		list_del_init(pos);
+		smp_store_release(&sk->sk_backpressure_status, SK_UNTHROTTLED);
+		sk->sk_write_space(sk);
+
+		sock_put(sk);
+	}
+}
+
 /* Length to Time (L2T) lookup in a qdisc_rate_table, to determine how
    long it will take to send a packet given its size.
  */
diff --git a/include/net/sock.h b/include/net/sock.h
index 73063c88a249..6ed2de43dc98 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -315,6 +315,8 @@ struct sk_filter;
   *	@sk_rcvtimeo: %SO_RCVTIMEO setting
   *	@sk_sndtimeo: %SO_SNDTIMEO setting
   *	@sk_txhash: computed flow hash for use on transmit
+  *	@sk_backpressure_status: Qdisc backpressure status
+  *	@sk_backpressure_node: linkage for Qdisc backpressure list
   *	@sk_txrehash: enable TX hash rethink
   *	@sk_filter: socket filtering instructions
   *	@sk_timer: sock cleanup timer
@@ -468,6 +470,8 @@ struct sock {
 	unsigned int		sk_gso_max_size;
 	gfp_t			sk_allocation;
 	__u32			sk_txhash;
+	u32			sk_backpressure_status; /* see enum sk_backpressure */
+	struct list_head	sk_backpressure_node;
 
 	/*
 	 * Because of non atomicity rules, all
@@ -548,6 +552,12 @@ enum sk_pacing {
 	SK_PACING_FQ		= 2,
 };
 
+enum sk_backpressure {
+	SK_UNTHROTTLED	= 0,
+	SK_OVERLIMIT	= 1,
+	SK_THROTTLED	= 2,
+};
+
 /* Pointer stored in sk_user_data might not be suitable for copying
  * when cloning the socket. For instance, it can point to a reference
  * counted object. sk_user_data bottom bit is set if pointer must not
@@ -2522,12 +2532,18 @@ static inline struct page_frag *sk_page_frag(struct sock *sk)
 
 bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag);
 
+static inline bool sk_is_throttled(const struct sock *sk)
+{
+	return READ_ONCE(sk->sk_backpressure_status) == SK_THROTTLED;
+}
+
 /*
  *	Default write policy as shown to user space via poll/select/SIGIO
  */
 static inline bool sock_writeable(const struct sock *sk)
 {
-	return refcount_read(&sk->sk_wmem_alloc) < (READ_ONCE(sk->sk_sndbuf) >> 1);
+	return !sk_is_throttled(sk) &&
+	       refcount_read(&sk->sk_wmem_alloc) < (READ_ONCE(sk->sk_sndbuf) >> 1);
 }
 
 static inline gfp_t gfp_any(void)
diff --git a/net/core/dev.c b/net/core/dev.c
index c2d73595a7c3..7c3d136725b9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5013,6 +5013,7 @@ static __latent_entropy void net_tx_action(struct softirq_action *h)
 			if (!(q->flags & TCQ_F_NOLOCK)) {
 				root_lock = qdisc_lock(q);
 				spin_lock(root_lock);
+				qdisc_backpressure_unthrottle(q);
 			} else if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED,
 						     &q->state))) {
 				/* There is a synchronize_net() between
diff --git a/net/core/sock.c b/net/core/sock.c
index be20a1af20e5..7ed9d2bd991f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2034,6 +2034,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 
 		sock_net_set(sk, net);
 		refcount_set(&sk->sk_wmem_alloc, 1);
+		INIT_LIST_HEAD(&sk->sk_backpressure_node);
 
 		mem_cgroup_sk_alloc(sk);
 		cgroup_sk_alloc(&sk->sk_cgrp_data);
@@ -2589,7 +2590,8 @@ static long sock_wait_for_wmem(struct sock *sk, long timeo)
 			break;
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
-		if (refcount_read(&sk->sk_wmem_alloc) < READ_ONCE(sk->sk_sndbuf))
+		if (!sk_is_throttled(sk) &&
+		    refcount_read(&sk->sk_wmem_alloc) < READ_ONCE(sk->sk_sndbuf))
 			break;
 		if (sk->sk_shutdown & SEND_SHUTDOWN)
 			break;
@@ -2624,7 +2626,7 @@ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
 		if (sk->sk_shutdown & SEND_SHUTDOWN)
 			goto failure;
 
-		if (sk_wmem_alloc_get(sk) < READ_ONCE(sk->sk_sndbuf))
+		if (!sk_is_throttled(sk) && sk_wmem_alloc_get(sk) < READ_ONCE(sk->sk_sndbuf))
 			break;
 
 		sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 918816ec5dd4..6e905995bfa2 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3006,9 +3006,14 @@ void tcp4_proc_exit(void)
  */
 bool tcp_stream_memory_free(const struct sock *sk, int wake)
 {
-	const struct tcp_sock *tp = tcp_sk(sk);
-	u32 notsent_bytes = READ_ONCE(tp->write_seq) -
-			    READ_ONCE(tp->snd_nxt);
+	const struct tcp_sock *tp;
+	u32 notsent_bytes;
+
+	if (sk_is_throttled(sk))
+		return false;
+
+	tp = tcp_sk(sk);
+	notsent_bytes = READ_ONCE(tp->write_seq) - READ_ONCE(tp->snd_nxt);
 
 	return (notsent_bytes << wake) < tcp_notsent_lowat(tp);
 }
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index dba0b3e24af5..9ab314b874a7 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -674,6 +674,7 @@ struct Qdisc noop_qdisc = {
 		.qlen = 0,
 		.lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.skb_bad_txq.lock),
 	},
+	.backpressure_list =	LIST_HEAD_INIT(noop_qdisc.backpressure_list),
 };
 EXPORT_SYMBOL(noop_qdisc);
 
@@ -947,6 +948,7 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 	qdisc_skb_head_init(&sch->q);
 	gnet_stats_basic_sync_init(&sch->bstats);
 	spin_lock_init(&sch->q.lock);
+	INIT_LIST_HEAD(&sch->backpressure_list);
 
 	if (ops->static_flags & TCQ_F_CPUSTATS) {
 		sch->cpu_bstats =
@@ -1025,6 +1027,8 @@ void qdisc_reset(struct Qdisc *qdisc)
 	if (ops->reset)
 		ops->reset(qdisc);
 
+	qdisc_backpressure_unthrottle(qdisc);
+
 	__skb_queue_purge(&qdisc->gso_skb);
 	__skb_queue_purge(&qdisc->skb_bad_txq);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC v1 net-next 2/4] net/sched: sch_tbf: Use Qdisc backpressure infrastructure
  2022-05-06 19:43 [PATCH RFC v1 net-next 0/4] net: Qdisc backpressure infrastructure Peilin Ye
  2022-05-06 19:44 ` [PATCH RFC v1 net-next 1/4] net: Introduce " Peilin Ye
@ 2022-05-06 19:44 ` Peilin Ye
  2022-05-06 19:45 ` [PATCH RFC v1 net-next 3/4] net/sched: sch_htb: " Peilin Ye
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Peilin Ye @ 2022-05-06 19:44 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko
  Cc: Peilin Ye, netdev, linux-kernel, Cong Wang, Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Recently we introduced a Qdisc backpressure infrastructure for TCP and
UDP sockets.  Use it in TBF.

Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 net/sched/sch_tbf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c
index 72102277449e..06229765290b 100644
--- a/net/sched/sch_tbf.c
+++ b/net/sched/sch_tbf.c
@@ -250,6 +250,7 @@ static int tbf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	}
 	ret = qdisc_enqueue(skb, q->qdisc, to_free);
 	if (ret != NET_XMIT_SUCCESS) {
+		qdisc_backpressure_overlimit(sch, skb);
 		if (net_xmit_drop_count(ret))
 			qdisc_qstats_drop(sch);
 		return ret;
@@ -306,6 +307,7 @@ static struct sk_buff *tbf_dequeue(struct Qdisc *sch)
 			return skb;
 		}
 
+		qdisc_backpressure_throttle(sch);
 		qdisc_watchdog_schedule_ns(&q->watchdog,
 					   now + max_t(long, -toks, -ptoks));
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC v1 net-next 3/4] net/sched: sch_htb: Use Qdisc backpressure infrastructure
  2022-05-06 19:43 [PATCH RFC v1 net-next 0/4] net: Qdisc backpressure infrastructure Peilin Ye
  2022-05-06 19:44 ` [PATCH RFC v1 net-next 1/4] net: Introduce " Peilin Ye
  2022-05-06 19:44 ` [PATCH RFC v1 net-next 2/4] net/sched: sch_tbf: Use " Peilin Ye
@ 2022-05-06 19:45 ` Peilin Ye
  2022-05-06 19:45 ` [PATCH RFC v1 net-next 4/4] net/sched: sch_cbq: " Peilin Ye
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Peilin Ye @ 2022-05-06 19:45 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko
  Cc: Peilin Ye, netdev, linux-kernel, Cong Wang, Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Recently we introduced a Qdisc backpressure infrastructure for TCP and
UDP sockets.  Use it in HTB.

Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 net/sched/sch_htb.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 23a9d6242429..21d78dff08e7 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -623,6 +623,7 @@ static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 			__qdisc_enqueue_tail(skb, &q->direct_queue);
 			q->direct_pkts++;
 		} else {
+			qdisc_backpressure_overlimit(sch, skb);
 			return qdisc_drop(skb, sch, to_free);
 		}
 #ifdef CONFIG_NET_CLS_ACT
@@ -634,6 +635,7 @@ static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 #endif
 	} else if ((ret = qdisc_enqueue(skb, cl->leaf.q,
 					to_free)) != NET_XMIT_SUCCESS) {
+		qdisc_backpressure_overlimit(sch, skb);
 		if (net_xmit_drop_count(ret)) {
 			qdisc_qstats_drop(sch);
 			cl->drops++;
@@ -978,6 +980,9 @@ static struct sk_buff *htb_dequeue(struct Qdisc *sch)
 				goto ok;
 		}
 	}
+
+	qdisc_backpressure_throttle(sch);
+
 	if (likely(next_event > q->now))
 		qdisc_watchdog_schedule_ns(&q->watchdog, next_event);
 	else
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC v1 net-next 4/4] net/sched: sch_cbq: Use Qdisc backpressure infrastructure
  2022-05-06 19:43 [PATCH RFC v1 net-next 0/4] net: Qdisc backpressure infrastructure Peilin Ye
                   ` (2 preceding siblings ...)
  2022-05-06 19:45 ` [PATCH RFC v1 net-next 3/4] net/sched: sch_htb: " Peilin Ye
@ 2022-05-06 19:45 ` Peilin Ye
  2022-05-10  3:26 ` [PATCH RFC v1 net-next 0/4] net: " Eric Dumazet
  2022-08-22  9:10 ` [PATCH RFC v2 net-next 0/5] " Peilin Ye
  5 siblings, 0 replies; 28+ messages in thread
From: Peilin Ye @ 2022-05-06 19:45 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko
  Cc: Peilin Ye, netdev, linux-kernel, Cong Wang, Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Recently we introduced a Qdisc backpressure infrastructure for TCP and
UDP sockets.  Use it in CBQ.

Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 net/sched/sch_cbq.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c
index 02d9f0dfe356..4a5204da49d0 100644
--- a/net/sched/sch_cbq.c
+++ b/net/sched/sch_cbq.c
@@ -382,6 +382,7 @@ cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		return ret;
 	}
 
+	qdisc_backpressure_overlimit(sch, skb);
 	if (net_xmit_drop_count(ret)) {
 		qdisc_qstats_drop(sch);
 		cbq_mark_toplevel(q, cl);
@@ -509,6 +510,7 @@ static enum hrtimer_restart cbq_undelay(struct hrtimer *timer)
 
 		time = 0;
 		time = ktime_add_ns(time, PSCHED_TICKS2NS(now + delay));
+		qdisc_backpressure_throttle(sch);
 		hrtimer_start(&q->delay_timer, time, HRTIMER_MODE_ABS_PINNED);
 	}
 
@@ -851,9 +853,11 @@ cbq_dequeue(struct Qdisc *sch)
 
 	if (sch->q.qlen) {
 		qdisc_qstats_overlimit(sch);
-		if (q->wd_expires)
+		if (q->wd_expires) {
+			qdisc_backpressure_throttle(sch);
 			qdisc_watchdog_schedule(&q->watchdog,
 						now + q->wd_expires);
+		}
 	}
 	return NULL;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v1 net-next 1/4] net: Introduce Qdisc backpressure infrastructure
  2022-05-06 19:44 ` [PATCH RFC v1 net-next 1/4] net: Introduce " Peilin Ye
@ 2022-05-06 20:31   ` Stephen Hemminger
  2022-05-06 23:34     ` Peilin Ye
  2022-05-09  7:53   ` Dave Taht
  1 sibling, 1 reply; 28+ messages in thread
From: Stephen Hemminger @ 2022-05-06 20:31 UTC (permalink / raw)
  To: Peilin Ye
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Peilin Ye, netdev, linux-kernel, Cong Wang

On Fri,  6 May 2022 12:44:22 -0700
Peilin Ye <yepeilin.cs@gmail.com> wrote:

> +static inline void qdisc_backpressure_overlimit(struct Qdisc *sch, struct sk_buff *skb)
> +{
> +	struct sock *sk = skb->sk;
> +
> +	if (!sk || !sk_fullsock(sk))
> +		return;
> +
> +	if (cmpxchg(&sk->sk_backpressure_status, SK_UNTHROTTLED, SK_OVERLIMIT) == SK_UNTHROTTLED) {
> +		sock_hold(sk);
> +		list_add_tail(&sk->sk_backpressure_node, &sch->backpressure_list);
> +	}
> +}

What if socket is closed? You are holding reference but application maybe gone.

Or if output is stalled indefinitely?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v1 net-next 1/4] net: Introduce Qdisc backpressure infrastructure
  2022-05-06 20:31   ` Stephen Hemminger
@ 2022-05-06 23:34     ` Peilin Ye
  0 siblings, 0 replies; 28+ messages in thread
From: Peilin Ye @ 2022-05-06 23:34 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Peilin Ye, netdev, linux-kernel, Cong Wang

Hi Stephen,

On Fri, May 06, 2022 at 01:31:11PM -0700, Stephen Hemminger wrote:
> On Fri,  6 May 2022 12:44:22 -0700, Peilin Ye <yepeilin.cs@gmail.com> wrote:
> > +static inline void qdisc_backpressure_overlimit(struct Qdisc *sch, struct sk_buff *skb)
> > +{
> > +	struct sock *sk = skb->sk;
> > +
> > +	if (!sk || !sk_fullsock(sk))
> > +		return;
> > +
> > +	if (cmpxchg(&sk->sk_backpressure_status, SK_UNTHROTTLED, SK_OVERLIMIT) == SK_UNTHROTTLED) {
> > +		sock_hold(sk);
> > +		list_add_tail(&sk->sk_backpressure_node, &sch->backpressure_list);
> > +	}
> > +}
> 
> What if socket is closed? You are holding reference but application maybe gone.

Thanks for pointing this out!  I just understood how sk_refcnt works
together with sk_wmem_alloc.

By the time we process this in-flight skb, sk_refcnt may have already
reached 0, which means sk_free() may have already decreased that "extra" 1
sk_wmem_alloc, so skb->destructor() may call __sk_free() while I "hold"
the sock here.  Seems like a UAF.

> Or if output is stalled indefinitely?

It would be better to do a cleanup in sock destroying code, but I am
trying to avoid acquiring Qdisc root_lock there.  I will try to come up
with a better solution.

Thanks,
Peilin Ye



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v1 net-next 1/4] net: Introduce Qdisc backpressure infrastructure
  2022-05-06 19:44 ` [PATCH RFC v1 net-next 1/4] net: Introduce " Peilin Ye
  2022-05-06 20:31   ` Stephen Hemminger
@ 2022-05-09  7:53   ` Dave Taht
  2022-05-10  2:23     ` Peilin Ye
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Taht @ 2022-05-09  7:53 UTC (permalink / raw)
  To: Peilin Ye
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Peilin Ye, Linux Kernel Network Developers, LKML,
	Cong Wang

I am very pleased to see this work.

However,  my "vision" such as it was, and as misguided as it might be,
was to implement a facility similar to tcp_notsent_lowat for udp
packets, tracking the progress of the udp packet through the kernel,
and supplying backpressure and providing better information about
where when and why the packet was dropped in the stack back to the
application.

I've been really impressed by the DROP_REASON work and had had no clue
prior to seeing all that instrumentation, where else packets might be
dropped in the kernel.

I'd be interested to see what happens with sch_cake.

-- 
FQ World Domination pending: https://blog.cerowrt.org/post/state_of_fq_codel/
Dave Täht CEO, TekLibre, LLC

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v1 net-next 1/4] net: Introduce Qdisc backpressure infrastructure
  2022-05-09  7:53   ` Dave Taht
@ 2022-05-10  2:23     ` Peilin Ye
  0 siblings, 0 replies; 28+ messages in thread
From: Peilin Ye @ 2022-05-10  2:23 UTC (permalink / raw)
  To: Dave Taht
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Peilin Ye, Linux Kernel Network Developers, LKML,
	Cong Wang

Hi Dave,

On Mon, May 09, 2022 at 12:53:28AM -0700, Dave Taht wrote:
> I am very pleased to see this work.

Thanks!

> However,  my "vision" such as it was, and as misguided as it might be,
> was to implement a facility similar to tcp_notsent_lowat for udp
> packets, tracking the progress of the udp packet through the kernel,
> and supplying backpressure and providing better information about
> where when and why the packet was dropped in the stack back to the
> application.

By "a facility similar to tcp_notsent_lowat", do you mean a smaller
sk_sndbuf, or "UDP Small Queues"?

I don't fully understand the implications of using a smaller sk_sndbuf
yet, but I think it can work together with this RFC.

sk_sndbuf is a per-socket attribute, while this RFC tries to improve it
from Qdisc's perspective.  Using a smaller sk_sndbuf alone does not
prevent the "when UDP sends faster, TBF simply drops faster" issue
(described in [I] of the cover letter) from happening.  There's always a
point, that there're too many sockets, so TBF's queue cannot contain
"sk_sndbuf times number of sockets" (roughly speaking) bytes of skbs.
After that point, TBF will suddenly start to drop a lot.

For example, I used the default 212992 sk_sndbuf
(/proc/sys/net/core/wmem_default) in the test setup ([V] in the cover
letter).  Let's make it one tenth as large, 21299.  It works well for
the 2-client setup; zero packets dropped.  However, if we test it with
15 iperf2 clients:

[  3]  0.0-30.0 sec  46.4 MBytes  13.0 Mbits/sec   1.251 ms 89991/123091 (73%)
[  3]  0.0-30.0 sec  46.6 MBytes  13.0 Mbits/sec   2.033 ms 91204/124464 (73%)
[  3]  0.0-30.0 sec  46.5 MBytes  13.0 Mbits/sec   0.504 ms 89879/123054 (73%)
<...>                                                       ^^^^^^^^^^^^ ^^^^^

73% drop rate again.  Now apply this RFC:

[  3]  0.0-30.0 sec  46.3 MBytes  12.9 Mbits/sec   1.206 ms  807/33839 (2.4%)
[  3]  0.0-30.0 sec  45.5 MBytes  12.7 Mbits/sec   1.919 ms  839/33283 (2.5%)
[  3]  0.0-30.0 sec  45.8 MBytes  12.8 Mbits/sec   2.521 ms  837/33508 (2.5%)
<...>                                                        ^^^^^^^^^ ^^^^^^

Down to 3% again.

Next, same 21299 sk_sndbuf, 20 iperf2 clients, without RFC:

[  3]  0.0-30.0 sec  34.5 MBytes  9.66 Mbits/sec   1.054 ms 258703/283342 (91%)
[  3]  0.0-30.0 sec  34.5 MBytes  9.66 Mbits/sec   1.033 ms 257324/281964 (91%)
[  3]  0.0-30.0 sec  34.5 MBytes  9.66 Mbits/sec   1.116 ms 257858/282500 (91%)
<...>                                                       ^^^^^^^^^^^^^ ^^^^^

91% drop rate.  Finally, apply RFC:

[  3]  0.0-30.0 sec  34.4 MBytes  9.61 Mbits/sec   0.974 ms 7982/32503 (25%)
[  3]  0.0-30.0 sec  34.1 MBytes  9.54 Mbits/sec   1.381 ms 7394/31732 (23%)
[  3]  0.0-30.0 sec  34.3 MBytes  9.58 Mbits/sec   2.431 ms 8149/32583 (25%)
<...>                                                       ^^^^^^^^^^ ^^^^^

The thundering herd probelm ([III] in the cover letter) surfaces, but
still an improvement.

In conclusion, assuming we are going to use smaller sk_sndbuf or "UDP
Small Queues", I think it doesn't replace this RFC, and vice versa.

> I've been really impressed by the DROP_REASON work and had had no clue
> prior to seeing all that instrumentation, where else packets might be
> dropped in the kernel.
> 
> I'd be interested to see what happens with sch_cake.

Sure, I will cover sch_cake in v2.

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v1 net-next 0/4] net: Qdisc backpressure infrastructure
  2022-05-06 19:43 [PATCH RFC v1 net-next 0/4] net: Qdisc backpressure infrastructure Peilin Ye
                   ` (3 preceding siblings ...)
  2022-05-06 19:45 ` [PATCH RFC v1 net-next 4/4] net/sched: sch_cbq: " Peilin Ye
@ 2022-05-10  3:26 ` Eric Dumazet
  2022-05-10 23:03   ` Peilin Ye
  2022-08-22  9:10 ` [PATCH RFC v2 net-next 0/5] " Peilin Ye
  5 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2022-05-10  3:26 UTC (permalink / raw)
  To: Peilin Ye, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Hideaki YOSHIFUJI, David Ahern, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko
  Cc: Peilin Ye, netdev, linux-kernel, Cong Wang


On 5/6/22 12:43, Peilin Ye wrote:
> From: Peilin Ye <peilin.ye@bytedance.com>
>
> Hi all,
>
> Currently sockets (especially UDP ones) can drop a lot of skbs at TC
> egress when rate limited by shaper Qdiscs like HTB.  This experimental
> patchset tries to improve this by introducing a backpressure mechanism, so
> that sockets are temporarily throttled when they "send too much".
>
> For now it takes care of TBF, HTB and CBQ, for UDP and TCP sockets.  Any
> comments, suggestions would be much appreciated.  Thanks!


This very much looks like trying to solve an old problem to me.

If you were using EDT model, a simple eBPF program could get rid of the 
HTB/TBF qdisc

and you could use MQ+FQ as the packet schedulers, with the true 
multiqueue sharding.

FQ provides fairness, so a flow can not anymore consume all the qdisc limit.

(If your UDP sockets send packets all over the place (not connected 
sockets),

then the eBPF can also be used to rate limit them)


>
> Contents
>       [I] Background
>      [II] Design Overview
>     [III] Performance Numbers & Issues
>      [IV] Synchronization
>       [V] Test Setup
>
> ______________
> [I] Background
>
> Imagine 2 UDP iperf2 clients, both wish to send at 1 GByte/sec from veth0.
> veth0's egress root Qdisc is a TBF Qdisc, with a configured rate of 200
> Mbits/sec.  I tested this setup [V]:
>
> [  3] 10.0-10.5 sec  25.9 MBytes   434 Mbits/sec
> [  3] 10.0-10.5 sec  22.0 MBytes   369 Mbits/sec
> [  3] 10.5-11.0 sec  24.3 MBytes   407 Mbits/sec
> [  3] 10.5-11.0 sec  21.7 MBytes   363 Mbits/sec
> <...>                              ^^^^^^^^^^^^^
>
> [  3]  0.0-30.0 sec   358 MBytes   100 Mbits/sec   0.030 ms 702548/958104 (73%)
> [  3]  0.0-30.0 sec   347 MBytes  96.9 Mbits/sec   0.136 ms 653610/900810 (73%)
>                                                              ^^^^^^^^^^^^^ ^^^^^
>
> On average, both clients try to send at 389.82 Mbits/sec.  TBF drops 74.7%
> of the traffic, in order to keep veth0's egress rate (196.93 Mbits/sec)
> under the configured 200 Mbits/sec.
>
> Why is this happening?  Consider sk->sk_wmem_alloc, number of bytes of
> skbs that a socket has currently "committed" to the "transmit queue":
>
>           ┌─────┐         ┌───────────┐     ┌──────────────┐
>      ─ ─ >│ UDP │─ ... ─ >│ TBF Qdisc │─ ─ >│ device queue │─ ┬ >
>           └───┬─┘         └───────────┘     └──────────────┘ [b]
>              [a]
>
> Normally, sk_wmem_alloc is increased right before an skb leaves UDP [a],
> and decreased when an skb is consumed upon TX completion [b].
>
> However, when TBF becomes full, and starts to drop skbs (assuming a
> simple taildrop inner Qdisc like bfifo for brevity):
>
>           ┌─────┐         ┌───────────┐     ┌──────────────┐
>      ─ ─ >│ UDP │─ ... ─ >│ TBF Qdisc │─ ─ >│ device queue │─ ┬ >
>           └───┬─┘         ├───────────┘     └──────────────┘ [b]
>              [a]         [c]
>
> For those dropped skbs, sk_wmem_alloc is decreased right before TBF [c].
> This is problematic: since the (a,c) "interval" does not cover TBF,
> whenever UDP starts to send faster, TBF simply drops faster, keeping
> sk_wmem_alloc balanced, and tricking UDP into thinking "it is okay to send
> more".
>
> Similar thing happens to other shapers as well.  TCP behaves much better
> than UDP, since TCP slows down itself when TC egress fails to enqueue an
> skb.
>
> ____________________
> [II] Design Overview
>
> Hacking sk_wmem_alloc turned out to be tricky.  Instead, introduce the
> following state machine for sockets:
>
>    ┌────────────────┐  [d]  ┌────────────────┐  [e]  ┌────────────────┐
>    │ SK_UNTHROTTLED │─ ─ ─ >│  SK_OVERLIMIT  │─ ─ ─ >│  SK_THROTTLED  │
>    └────────────────┘       └────────────────┘       └────────────────┘
>             └─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ < ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─┘
>                                     [f]
>
> Take TBF as an example:
>
>    [d] qdisc_backpressure_overlimit()
>        When TBF fails to enqueue an skb belonging to an UNTHROTTLED socket,
>        the socket is marked as OVERLIMIT, and added to TBF's
>        "backpressure_list";
>
>    [e] qdisc_backpressure_throttle()
>        Later, when TBF runs out of tokens, it marks all OVERLIMIT sockets
>        on its backpressure_list as THROTTLED, and schedules a Qdisc
>        watchdog to wait for more tokens;
>
>      * TCP and UDP sleeps on THROTTLED sockets
>      * epoll() and friends should not report EPOLL{OUT,WRNORM} for
>        THROTTLED sockets
>
>    [f] qdisc_backpressure_unthrottle()
>        When the timer expires, all THROTTLED sockets are removed from TBF's
>        backpressure_list, and marked back as UNTHROTTLED.
>
>      * Also call ->sk_write_space(), so that all TCP and UDP waiters are
>        woken up, and all SOCKWQ_ASYNC_NOSPACE subscribers are notified
>
> This should work for all Qdisc watchdog-based shapers (as pointed out by
> Cong though, ETF seems like a special one, I haven't looked into it yet).
>
> (As discussed with Cong, a slightly different design would be: only
>   mark OVERLIMIT sockets as THROTTLED when:
>   
>   1. TBF is full, AND
>   2. TBF is out of tokens i.e. Qdisc watchdog timer is active
>
>   This approach seems to have a slightly lower drop rate under heavy load,
>   at least for TBF.  I'm still working on it.)
>
> __________________________________
> [III] Performance Numbers & Issues
>
> I tested the same setup [V] after applying this patchset:
>
> [  3] 10.0-10.5 sec  6.29 MBytes   106 Mbits/sec
> [  3] 10.0-10.5 sec  5.84 MBytes  98.0 Mbits/sec
> [  3] 10.5-11.0 sec  6.31 MBytes   106 Mbits/sec
> [  3] 10.5-11.0 sec  6.01 MBytes   101 Mbits/sec
> <...>                              ^^^^^^^^^^^^^
>
> [  3]  0.0-30.0 sec   358 MBytes   100 Mbits/sec  62.444 ms 8500/263825 (3.2%)
> [  3]  0.0-30.0 sec   357 MBytes  99.9 Mbits/sec   0.217 ms 8411/263310 (3.2%)
>                                                              ^^^^^^^^^^^ ^^^^^^
>
> On average, drop rate decreased from 74.7% to 3.2%.  No significant
> affects on throughput (196.93 Mbits/sec becomes 197.46 Mbits/sec).
>
> However, drop rate starts to increase when we add more iperf2 clients.
> For example, 3 clients (also UDP -b 1G):
>
> [  3]  0.0-30.0 sec   232 MBytes  65.0 Mbits/sec   0.092 ms 104961/270765 (39%)
> [  3]  0.0-30.0 sec   232 MBytes  64.8 Mbits/sec   0.650 ms 102493/267769 (38%)
> [  3]  0.0-30.1 sec   239 MBytes  66.8 Mbits/sec   0.045 ms 99234/269987 (37%)
>                                                              ^^^^^^^^^^^^ ^^^^^
>
> 38% of the traffic is dropped.  This is still a lot better than current
> (87%), but not ideal.  There is a thundering herd problem: when the Qdisc
> watchdog timer expires, we wake up all THROTTLED sockets at once in [f],
> so all of them just resume sending (and dropping...).  We probably need a
> better algorithm here, please advise, thanks!
>
> One "workaround" is making TBF's queue larger (the "limit" parameter).  In
> the above 3-client example, raising "limit" from 200k to 300k decreases
> drop rate from 38% back to 3.1%.  Without this patchset, it requires about
> a 400k "limit" for the same setup to drop less than 5% of the traffic.
>
> ____________________
> [IV] Synchronization
>
> 1. For each device, all backpressure_list operations are serialized by
>     Qdisc root_lock:
>
>     [d] and [e] happen in shaper's ->enqueue() and ->dequeue()
>     respectively; in both cases we hold root_lock.
>
>     [f] happens in TX softirq, right after grabbing root_lock.  Scheduling
>     Qdisc watchdog is not the only way to wait for more tokens, see
>     htb_work_func() for an example.  However, they all end up raising TX
>     softirq, so run [f] there.
>
> 2. Additionally, we should prevent 2 CPUs from trying to add the same
>     socket to 2 different backpressure_lists (on 2 different devices).  Use
>     memory barriers to make sure this will not happen.  Please see [1/4]
>     commit message for details.
>
> ______________
> [V] Test Setup
>
>      # setup network namespace
>      ip netns add red
>      ip link add name veth0 type veth peer name veth1
>      ip link set veth1 netns red
>      ip addr add 10.0.0.1/24 dev veth0
>      ip -n red addr add 10.0.0.2/24 dev veth1
>      ip link set veth0 up
>      ip -n red link set veth1 up
>
>      tc qdisc replace dev veth0 handle 1: root \
>          tbf rate 200mbit burst 20kb limit 200k
>
>      # servers
>      ip netns exec red iperf -u -s -p 5555 -i 0.5 -t 1000 &
>      ip netns exec red iperf -u -s -p 6666 -i 0.5 -t 1000 &
>
>      # clients
>      iperf -u -c 10.0.0.2 -p 5555 -i 0.5 -t 30 -b 1G &
>      iperf -u -c 10.0.0.2 -p 6666 -i 0.5 -t 30 -b 1G &
>
> Thanks,
> Peilin Ye (4):
>    net: Introduce Qdisc backpressure infrastructure
>    net/sched: sch_tbf: Use Qdisc backpressure infrastructure
>    net/sched: sch_htb: Use Qdisc backpressure infrastructure
>    net/sched: sch_cbq: Use Qdisc backpressure infrastructure
>
>   include/net/sch_generic.h | 43 +++++++++++++++++++++++++++++++++++++++
>   include/net/sock.h        | 18 +++++++++++++++-
>   net/core/dev.c            |  1 +
>   net/core/sock.c           |  6 ++++--
>   net/ipv4/tcp_ipv4.c       | 11 +++++++---
>   net/sched/sch_cbq.c       |  6 +++++-
>   net/sched/sch_generic.c   |  4 ++++
>   net/sched/sch_htb.c       |  5 +++++
>   net/sched/sch_tbf.c       |  2 ++
>   9 files changed, 89 insertions(+), 7 deletions(-)
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v1 net-next 0/4] net: Qdisc backpressure infrastructure
  2022-05-10  3:26 ` [PATCH RFC v1 net-next 0/4] net: " Eric Dumazet
@ 2022-05-10 23:03   ` Peilin Ye
  2022-05-10 23:27     ` Peilin Ye
  0 siblings, 1 reply; 28+ messages in thread
From: Peilin Ye @ 2022-05-10 23:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Hideaki YOSHIFUJI,
	David Ahern, Jamal Hadi Salim, Cong Wang, Jiri Pirko, Peilin Ye,
	netdev, linux-kernel, Cong Wang

Hi Eric,

On Mon, May 09, 2022 at 08:26:27PM -0700, Eric Dumazet wrote:
> On 5/6/22 12:43, Peilin Ye wrote:
> > From: Peilin Ye <peilin.ye@bytedance.com>
> > 
> > Hi all,
> > 
> > Currently sockets (especially UDP ones) can drop a lot of skbs at TC
> > egress when rate limited by shaper Qdiscs like HTB.  This experimental
> > patchset tries to improve this by introducing a backpressure mechanism, so
> > that sockets are temporarily throttled when they "send too much".
> > 
> > For now it takes care of TBF, HTB and CBQ, for UDP and TCP sockets.  Any
> > comments, suggestions would be much appreciated.  Thanks!
> 
> This very much looks like trying to solve an old problem to me.
> 
> If you were using EDT model, a simple eBPF program could get rid of the
> HTB/TBF qdisc
> 
> and you could use MQ+FQ as the packet schedulers, with the true multiqueue
> sharding.
> 
> FQ provides fairness, so a flow can not anymore consume all the qdisc limit.

This RFC tries to solve the "when UDP starts to drop (whether because of
per-flow or per-Qdisc limit), it drops a lot" issue described in [I] of
the cover letter; its main goal is not to improve fairness.

> (If your UDP sockets send packets all over the place (not connected
> sockets),
> 
> then the eBPF can also be used to rate limit them)

I was able to reproduce the same issue using EDT: default sch_fq
flow_limit (100 packets), with a 1 Gbit/sec rate limit.  Now if I run
this:

  $ iperf -u -c 1.2.3.4 -p 5678 -l 3K -i 0.5 -t 30 -b 3g

  [ ID] Interval       Transfer     Bandwidth
  [  3]  0.0- 0.5 sec   137 MBytes  2.29 Gbits/sec
  [  3]  0.5- 1.0 sec   142 MBytes  2.38 Gbits/sec
  [  3]  1.0- 1.5 sec   117 MBytes  1.96 Gbits/sec
  [  3]  1.5- 2.0 sec   105 MBytes  1.77 Gbits/sec
  [  3]  2.0- 2.5 sec   132 MBytes  2.22 Gbits/sec
  <...>                             ^^^^^^^^^^^^^^

On average it tries to send 2.31 Gbits per second, dropping 56.71% of
the traffic:

  $ tc -s qdisc show dev eth0
  <...>
  qdisc fq 5: parent 1:4 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 18030 initial_quantum 90150 low_rate_threshold 550Kbit refill_delay 40.0ms
   Sent 16356556 bytes 14159 pkt (dropped 2814461, overlimits 0 requeues 0)
                                  ^^^^^^^^^^^^^^^

This RFC does not cover EDT though, since it does not use Qdisc watchdog
or friends.

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v1 net-next 0/4] net: Qdisc backpressure infrastructure
  2022-05-10 23:03   ` Peilin Ye
@ 2022-05-10 23:27     ` Peilin Ye
  0 siblings, 0 replies; 28+ messages in thread
From: Peilin Ye @ 2022-05-10 23:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Hideaki YOSHIFUJI,
	David Ahern, Jamal Hadi Salim, Cong Wang, Jiri Pirko, Peilin Ye,
	netdev, linux-kernel, Cong Wang

On Tue, May 10, 2022 at 04:03:47PM -0700, Peilin Ye wrote:
> This RFC does not cover EDT though, since it does not use Qdisc watchdog
> or friends.

Sorry, there is a call to qdisc_watchdog_schedule_range_ns() in
sch_fq.c.  I will learn more about how sch_fq is implemented.

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH RFC v2 net-next 0/5] net: Qdisc backpressure infrastructure
  2022-05-06 19:43 [PATCH RFC v1 net-next 0/4] net: Qdisc backpressure infrastructure Peilin Ye
                   ` (4 preceding siblings ...)
  2022-05-10  3:26 ` [PATCH RFC v1 net-next 0/4] net: " Eric Dumazet
@ 2022-08-22  9:10 ` Peilin Ye
  2022-08-22  9:11   ` [PATCH RFC v2 net-next 1/5] net: Introduce " Peilin Ye
                     ` (6 more replies)
  5 siblings, 7 replies; 28+ messages in thread
From: Peilin Ye @ 2022-08-22  9:10 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, netdev, linux-doc, linux-kernel, Cong Wang,
	Stephen Hemminger, Dave Taht, Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Hi all,

Currently sockets (especially UDP ones) can drop a lot of packets at TC
egress when rate limited by shaper Qdiscs like HTB.  This patchset series
tries to solve this by introducing a Qdisc backpressure mechanism.

RFC v1 [1] used a throttle & unthrottle approach, which introduced several
issues, including a thundering herd problem and a socket reference count
issue [2].  This RFC v2 uses a different approach to avoid those issues:

  1. When a shaper Qdisc drops a packet that belongs to a local socket due
     to TC egress congestion, we make part of the socket's sndbuf
     temporarily unavailable, so it sends slower.
  
  2. Later, when TC egress becomes idle again, we gradually recover the
     socket's sndbuf back to normal.  Patch 2 implements this step using a
     timer for UDP sockets.

The thundering herd problem is avoided, since we no longer wake up all
throttled sockets at the same time in qdisc_watchdog().  The socket
reference count issue is also avoided, since we no longer maintain socket
list on Qdisc.

Performance is better than RFC v1.  There is one concern about fairness
between flows for TBF Qdisc, which could be solved by using a SFQ inner
Qdisc.

Please see the individual patches for details and numbers.  Any comments,
suggestions would be much appreciated.  Thanks!

[1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
[2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/

Peilin Ye (5):
  net: Introduce Qdisc backpressure infrastructure
  net/udp: Implement Qdisc backpressure algorithm
  net/sched: sch_tbf: Use Qdisc backpressure infrastructure
  net/sched: sch_htb: Use Qdisc backpressure infrastructure
  net/sched: sch_cbq: Use Qdisc backpressure infrastructure

 Documentation/networking/ip-sysctl.rst | 11 ++++
 include/linux/udp.h                    |  3 ++
 include/net/netns/ipv4.h               |  1 +
 include/net/sch_generic.h              | 11 ++++
 include/net/sock.h                     | 21 ++++++++
 include/net/udp.h                      |  1 +
 net/core/sock.c                        |  5 +-
 net/ipv4/sysctl_net_ipv4.c             |  7 +++
 net/ipv4/udp.c                         | 69 +++++++++++++++++++++++++-
 net/ipv6/udp.c                         |  2 +-
 net/sched/sch_cbq.c                    |  1 +
 net/sched/sch_htb.c                    |  2 +
 net/sched/sch_tbf.c                    |  2 +
 13 files changed, 132 insertions(+), 4 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH RFC v2 net-next 1/5] net: Introduce Qdisc backpressure infrastructure
  2022-08-22  9:10 ` [PATCH RFC v2 net-next 0/5] " Peilin Ye
@ 2022-08-22  9:11   ` Peilin Ye
  2022-08-22  9:12   ` [PATCH RFC v2 net-next 2/5] net/udp: Implement Qdisc backpressure algorithm Peilin Ye
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Peilin Ye @ 2022-08-22  9:11 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, netdev, linux-doc, linux-kernel, Cong Wang,
	Stephen Hemminger, Dave Taht, Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Currently sockets (especially UDP ones) can drop a lot of traffic at TC
egress when rate limited by shaper Qdiscs like HTB.  Improve this by
introducing a Qdisc backpressure infrastructure:

  a. A new 'sock struct' field, @sk_overlimits, which keeps track of the
     number of bytes in socket send buffer that are currently
     unavailable due to TC egress congestion.  The size of an overlimit
     socket's "effective" send buffer is represented by @sk_sndbuf minus
     @sk_overlimits, with a lower limit of SOCK_MIN_SNDBUF:

     max(@sk_sndbuf - @sk_overlimits, SOCK_MIN_SNDBUF)

  b. A new (*backpressure) 'struct proto' callback, which is the
     protocol's private algorithm for Qdisc backpressure.

Working together:

  1. When a shaper Qdisc (TBF, HTB, CBQ, etc.) drops a packet that
     belongs to a local socket, it calls qdisc_backpressure().

  2. qdisc_backpressure() eventually invokes the socket protocol's
     (*backpressure) callback, which should increase @sk_overlimits.

  3. The transport layer then sees a smaller "effective" send buffer and
     will send slower.

  4. It is the per-protocol (*backpressure) implementation's
     responsibility to decrease @sk_overlimits when TC egress becomes
     idle again, potentially by using a timer.

Suggested-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 include/net/sch_generic.h | 11 +++++++++++
 include/net/sock.h        | 21 +++++++++++++++++++++
 net/core/sock.c           |  1 +
 3 files changed, 33 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index ec693fe7c553..afdf4bf64936 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -19,6 +19,7 @@
 #include <net/gen_stats.h>
 #include <net/rtnetlink.h>
 #include <net/flow_offload.h>
+#include <net/sock.h>
 
 struct Qdisc_ops;
 struct qdisc_walker;
@@ -1188,6 +1189,16 @@ static inline int qdisc_drop_all(struct sk_buff *skb, struct Qdisc *sch,
 	return NET_XMIT_DROP;
 }
 
+static inline void qdisc_backpressure(struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+
+	if (!sk || !sk_fullsock(sk))
+		return;
+
+	sk_backpressure(sk);
+}
+
 /* Length to Time (L2T) lookup in a qdisc_rate_table, to determine how
    long it will take to send a packet given its size.
  */
diff --git a/include/net/sock.h b/include/net/sock.h
index 05a1bbdf5805..ef10ca66cf26 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -277,6 +277,7 @@ struct sk_filter;
   *	@sk_pacing_status: Pacing status (requested, handled by sch_fq)
   *	@sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE)
   *	@sk_sndbuf: size of send buffer in bytes
+  *	@sk_overlimits: size of temporarily unavailable send buffer in bytes
   *	@__sk_flags_offset: empty field used to determine location of bitfield
   *	@sk_padding: unused element for alignment
   *	@sk_no_check_tx: %SO_NO_CHECK setting, set checksum in TX packets
@@ -439,6 +440,7 @@ struct sock {
 	struct dst_entry __rcu	*sk_dst_cache;
 	atomic_t		sk_omem_alloc;
 	int			sk_sndbuf;
+	int			sk_overlimits;
 
 	/* ===== cache line for TX ===== */
 	int			sk_wmem_queued;
@@ -1264,6 +1266,7 @@ struct proto {
 
 	bool			(*stream_memory_free)(const struct sock *sk, int wake);
 	bool			(*sock_is_readable)(struct sock *sk);
+	void			(*backpressure)(struct sock *sk);
 	/* Memory pressure */
 	void			(*enter_memory_pressure)(struct sock *sk);
 	void			(*leave_memory_pressure)(struct sock *sk);
@@ -2499,6 +2502,24 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
 	WRITE_ONCE(sk->sk_sndbuf, max_t(u32, val, SOCK_MIN_SNDBUF));
 }
 
+static inline int sk_sndbuf_avail(struct sock *sk)
+{
+	int overlimits, sndbuf = READ_ONCE(sk->sk_sndbuf);
+
+	if (!sk->sk_prot->backpressure)
+		return sndbuf;
+
+	overlimits = READ_ONCE(sk->sk_overlimits);
+
+	return max_t(int, sndbuf - overlimits, SOCK_MIN_SNDBUF);
+}
+
+static inline void sk_backpressure(struct sock *sk)
+{
+	if (sk->sk_prot->backpressure)
+		sk->sk_prot->backpressure(sk);
+}
+
 /**
  * sk_page_frag - return an appropriate page_frag
  * @sk: socket
diff --git a/net/core/sock.c b/net/core/sock.c
index 4cb957d934a2..167d471b176f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2194,6 +2194,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
 
 	/* sk_wmem_alloc set to one (see sk_free() and sock_wfree()) */
 	refcount_set(&newsk->sk_wmem_alloc, 1);
+	newsk->sk_overlimits	= 0;
 
 	atomic_set(&newsk->sk_omem_alloc, 0);
 	sk_init_common(newsk);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC v2 net-next 2/5] net/udp: Implement Qdisc backpressure algorithm
  2022-08-22  9:10 ` [PATCH RFC v2 net-next 0/5] " Peilin Ye
  2022-08-22  9:11   ` [PATCH RFC v2 net-next 1/5] net: Introduce " Peilin Ye
@ 2022-08-22  9:12   ` Peilin Ye
  2022-08-22  9:12   ` [PATCH RFC v2 net-next 3/5] net/sched: sch_tbf: Use Qdisc backpressure infrastructure Peilin Ye
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Peilin Ye @ 2022-08-22  9:12 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, netdev, linux-doc, linux-kernel, Cong Wang,
	Stephen Hemminger, Dave Taht, Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Support Qdisc backpressure for UDP (IPv4 and IPv6) sockets by
implementing the (*backpressure) callback:

  1. When a shaper Qdisc drops a packet due to TC egress congestion,
     halve the effective send buffer [1], then (re)scedule the
     backpressure timer.

  [1] sndbuf - overlimits_new == 1/2 * (sndbuf - overlimits_old)

  2. When the timer expires, double the effective send buffer [2].  If
     the socket is still overlimit, reschedule the timer itself.

  [2] sndbuf - overlimits_new == 2 * (sndbuf - overlimits_old)

In sock_wait_for_wmem() and sock_alloc_send_pskb(), check the size of
effective send buffer instead, so that overlimit sockets send slower.
See sk_sndbuf_avail().

The timer interval is specified by a new per-net sysctl,
sysctl_udp_backpressure_interval.  Default is 100 milliseconds, meaning
that an overlimit UDP socket will try to double its effective send
buffer every 100 milliseconds.  Use 0 to disable Qdisc backpressure for
UDP sockets.

Generally, longer interval means lower packet drop rate, but also makes
overlimit sockets slower to recover when TC egress becomes idle (or the
shaper Qdisc gets removed, etc.)

Test results with TBF + SFQ Qdiscs, 500 Mbits/sec rate limit with 16
iperf UDP '-b 1G' clients:

  Interval       Throughput  Drop Rate  CPU Usage [3]
   0 (disabled)  480.0 Mb/s     96.50%     68.38%
   10   ms       486.4 Mb/s      9.28%      1.30%
   100  ms       486.4 Mb/s      1.10%      1.11%
   1000 ms       486.4 Mb/s      0.13%      0.81%

  [3] perf-top, __pv_queued_spin_lock_slowpath()

Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 Documentation/networking/ip-sysctl.rst | 11 ++++
 include/linux/udp.h                    |  3 ++
 include/net/netns/ipv4.h               |  1 +
 include/net/udp.h                      |  1 +
 net/core/sock.c                        |  4 +-
 net/ipv4/sysctl_net_ipv4.c             |  7 +++
 net/ipv4/udp.c                         | 69 +++++++++++++++++++++++++-
 net/ipv6/udp.c                         |  2 +-
 8 files changed, 94 insertions(+), 4 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 56cd4ea059b2..a0d8e9518fda 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1070,6 +1070,17 @@ udp_rmem_min - INTEGER
 udp_wmem_min - INTEGER
 	UDP does not have tx memory accounting and this tunable has no effect.
 
+udp_backpressure_interval - INTEGER
+	The time interval (in milliseconds) in which an overlimit UDP socket
+	tries to increase its effective send buffer size, used by Qdisc
+	backpressure.  A longer interval typically results in a lower packet
+	drop rate, but also makes it slower for overlimit UDP sockets to
+	recover from backpressure when TC egress becomes idle.
+
+	0 to disable Qdisc backpressure for UDP sockets.
+
+	Default: 100
+
 RAW variables
 =============
 
diff --git a/include/linux/udp.h b/include/linux/udp.h
index 254a2654400f..dd017994738b 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -86,6 +86,9 @@ struct udp_sock {
 
 	/* This field is dirtied by udp_recvmsg() */
 	int		forward_deficit;
+
+	/* Qdisc backpressure timer */
+	struct timer_list	backpressure_timer;
 };
 
 #define UDP_MAX_SEGMENTS	(1 << 6UL)
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index c7320ef356d9..01f72ddf23e0 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -182,6 +182,7 @@ struct netns_ipv4 {
 
 	int sysctl_udp_wmem_min;
 	int sysctl_udp_rmem_min;
+	int sysctl_udp_backpressure_interval;
 
 	u8 sysctl_fib_notify_on_flag_change;
 
diff --git a/include/net/udp.h b/include/net/udp.h
index 5ee88ddf79c3..82018e58659b 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -279,6 +279,7 @@ int udp_init_sock(struct sock *sk);
 int udp_pre_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len);
 int __udp_disconnect(struct sock *sk, int flags);
 int udp_disconnect(struct sock *sk, int flags);
+void udp_backpressure(struct sock *sk);
 __poll_t udp_poll(struct file *file, struct socket *sock, poll_table *wait);
 struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb,
 				       netdev_features_t features,
diff --git a/net/core/sock.c b/net/core/sock.c
index 167d471b176f..cb6ba66f80c8 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2614,7 +2614,7 @@ static long sock_wait_for_wmem(struct sock *sk, long timeo)
 			break;
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
-		if (refcount_read(&sk->sk_wmem_alloc) < READ_ONCE(sk->sk_sndbuf))
+		if (refcount_read(&sk->sk_wmem_alloc) < sk_sndbuf_avail(sk))
 			break;
 		if (sk->sk_shutdown & SEND_SHUTDOWN)
 			break;
@@ -2649,7 +2649,7 @@ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
 		if (sk->sk_shutdown & SEND_SHUTDOWN)
 			goto failure;
 
-		if (sk_wmem_alloc_get(sk) < READ_ONCE(sk->sk_sndbuf))
+		if (sk_wmem_alloc_get(sk) < sk_sndbuf_avail(sk))
 			break;
 
 		sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 5490c285668b..1e509a417b92 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1337,6 +1337,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ONE
 	},
+	{
+		.procname	= "udp_backpressure_interval",
+		.data		= &init_net.ipv4.sysctl_udp_backpressure_interval,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_ms_jiffies,
+	},
 	{
 		.procname	= "fib_notify_on_flag_change",
 		.data		= &init_net.ipv4.sysctl_fib_notify_on_flag_change,
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 34eda973bbf1..ff58f638c834 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -110,6 +110,7 @@
 #include <trace/events/skb.h>
 #include <net/busy_poll.h>
 #include "udp_impl.h"
+#include <net/sock.h>
 #include <net/sock_reuseport.h>
 #include <net/addrconf.h>
 #include <net/udp_tunnel.h>
@@ -1614,10 +1615,73 @@ void udp_destruct_sock(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(udp_destruct_sock);
 
+static inline int udp_backpressure_interval_get(struct sock *sk)
+{
+	return READ_ONCE(sock_net(sk)->ipv4.sysctl_udp_backpressure_interval);
+}
+
+static inline void udp_reset_backpressure_timer(struct sock *sk,
+						unsigned long expires)
+{
+	sk_reset_timer(sk, &udp_sk(sk)->backpressure_timer, expires);
+}
+
+static void udp_backpressure_timer(struct timer_list *t)
+{
+	struct udp_sock *up = from_timer(up, t, backpressure_timer);
+	int interval, sndbuf, overlimits;
+	struct sock *sk = &up->inet.sk;
+
+	interval = udp_backpressure_interval_get(sk);
+	if (!interval) {
+		/* Qdisc backpressure has been turned off */
+		WRITE_ONCE(sk->sk_overlimits, 0);
+		goto out;
+	}
+
+	sndbuf = READ_ONCE(sk->sk_sndbuf);
+	overlimits = READ_ONCE(sk->sk_overlimits);
+
+	/* sndbuf - overlimits_new == 2 * (sndbuf - overlimits_old) */
+	overlimits = min_t(int, overlimits, sndbuf - SOCK_MIN_SNDBUF);
+	overlimits = max_t(int, (2 * overlimits) - sndbuf, 0);
+	WRITE_ONCE(sk->sk_overlimits, overlimits);
+
+	if (overlimits > 0)
+		udp_reset_backpressure_timer(sk, jiffies + interval);
+
+out:
+	sock_put(sk);
+}
+
+void udp_backpressure(struct sock *sk)
+{
+	int interval, sndbuf, overlimits;
+
+	interval = udp_backpressure_interval_get(sk);
+	if (!interval)	/* Qdisc backpressure is off */
+		return;
+
+	sndbuf = READ_ONCE(sk->sk_sndbuf);
+	overlimits = READ_ONCE(sk->sk_overlimits);
+
+	/* sndbuf - overlimits_new == 1/2 * (sndbuf - overlimits_old) */
+	overlimits = min_t(int, overlimits, sndbuf - SOCK_MIN_SNDBUF);
+	overlimits += (sndbuf - overlimits) >> 1;
+	WRITE_ONCE(sk->sk_overlimits, overlimits);
+
+	if (overlimits > 0)
+		udp_reset_backpressure_timer(sk, jiffies + interval);
+}
+EXPORT_SYMBOL_GPL(udp_backpressure);
+
 int udp_init_sock(struct sock *sk)
 {
-	skb_queue_head_init(&udp_sk(sk)->reader_queue);
+	struct udp_sock *up = udp_sk(sk);
+
+	skb_queue_head_init(&up->reader_queue);
 	sk->sk_destruct = udp_destruct_sock;
+	timer_setup(&up->backpressure_timer, udp_backpressure_timer, 0);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(udp_init_sock);
@@ -2653,6 +2717,7 @@ void udp_destroy_sock(struct sock *sk)
 	/* protects from races with udp_abort() */
 	sock_set_flag(sk, SOCK_DEAD);
 	udp_flush_pending_frames(sk);
+	sk_stop_timer(sk, &up->backpressure_timer);
 	unlock_sock_fast(sk, slow);
 	if (static_branch_unlikely(&udp_encap_needed_key)) {
 		if (up->encap_type) {
@@ -2946,6 +3011,7 @@ struct proto udp_prot = {
 #ifdef CONFIG_BPF_SYSCALL
 	.psock_update_sk_prot	= udp_bpf_update_proto,
 #endif
+	.backpressure		= udp_backpressure,
 	.memory_allocated	= &udp_memory_allocated,
 	.per_cpu_fw_alloc	= &udp_memory_per_cpu_fw_alloc,
 
@@ -3268,6 +3334,7 @@ static int __net_init udp_sysctl_init(struct net *net)
 {
 	net->ipv4.sysctl_udp_rmem_min = PAGE_SIZE;
 	net->ipv4.sysctl_udp_wmem_min = PAGE_SIZE;
+	net->ipv4.sysctl_udp_backpressure_interval = msecs_to_jiffies(100);
 
 #ifdef CONFIG_NET_L3_MASTER_DEV
 	net->ipv4.sysctl_udp_l3mdev_accept = 0;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 16c176e7c69a..106032af6756 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1735,7 +1735,7 @@ struct proto udpv6_prot = {
 #ifdef CONFIG_BPF_SYSCALL
 	.psock_update_sk_prot	= udp_bpf_update_proto,
 #endif
-
+	.backpressure		= udp_backpressure,
 	.memory_allocated	= &udp_memory_allocated,
 	.per_cpu_fw_alloc	= &udp_memory_per_cpu_fw_alloc,
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC v2 net-next 3/5] net/sched: sch_tbf: Use Qdisc backpressure infrastructure
  2022-08-22  9:10 ` [PATCH RFC v2 net-next 0/5] " Peilin Ye
  2022-08-22  9:11   ` [PATCH RFC v2 net-next 1/5] net: Introduce " Peilin Ye
  2022-08-22  9:12   ` [PATCH RFC v2 net-next 2/5] net/udp: Implement Qdisc backpressure algorithm Peilin Ye
@ 2022-08-22  9:12   ` Peilin Ye
  2022-08-22  9:12   ` [PATCH RFC v2 net-next 4/5] net/sched: sch_htb: " Peilin Ye
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Peilin Ye @ 2022-08-22  9:12 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, netdev, linux-doc, linux-kernel, Cong Wang,
	Stephen Hemminger, Dave Taht, Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Recently we introduced a Qdisc backpressure infrastructure (currently
supports UDP sockets).  Use it in TBF Qdisc.

Tested with 500 Mbits/sec rate limit and SFQ inner Qdisc using 16 iperf UDP
1 Gbit/sec clients.  Before:

[  3]  0.0-15.0 sec  53.6 MBytes  30.0 Mbits/sec   0.208 ms 1190234/1228450 (97%)
[  3]  0.0-15.0 sec  54.7 MBytes  30.6 Mbits/sec   0.085 ms   955591/994593 (96%)
[  3]  0.0-15.0 sec  55.4 MBytes  31.0 Mbits/sec   0.170 ms  966364/1005868 (96%)
[  3]  0.0-15.0 sec  55.0 MBytes  30.8 Mbits/sec   0.167 ms   925083/964333 (96%)
<...>                                                         ^^^^^^^^^^^^^^^^^^^

Total throughput is 480.2 Mbits/sec and average drop rate is 96.5%.

Now enable Qdisc backpressure for UDP sockets, with
udp_backpressure_interval default to 100 milliseconds:

[  3]  0.0-15.0 sec  54.4 MBytes  30.4 Mbits/sec   0.097 ms 450/39246 (1.1%)
[  3]  0.0-15.0 sec  54.4 MBytes  30.4 Mbits/sec   0.331 ms 435/39232 (1.1%)
[  3]  0.0-15.0 sec  54.4 MBytes  30.4 Mbits/sec   0.040 ms 435/39212 (1.1%)
[  3]  0.0-15.0 sec  54.4 MBytes  30.4 Mbits/sec   0.031 ms 426/39208 (1.1%)
<...>                                                       ^^^^^^^^^^^^^^^^

Total throughput is 486.4 Mbits/sec (1.29% higher) and average drop rate
is 1.1% (98.86% lower).

However, enabling Qdisc backpressure affects fairness between flow if we
use TBF Qdisc with default bfifo inner Qdisc:

[  3]  0.0-15.0 sec  46.1 MBytes  25.8 Mbits/sec   1.102 ms 142/33048 (0.43%)
[  3]  0.0-15.0 sec  72.8 MBytes  40.7 Mbits/sec   0.476 ms 145/52081 (0.28%)
[  3]  0.0-15.0 sec  53.2 MBytes  29.7 Mbits/sec   1.047 ms 141/38086 (0.37%)
[  3]  0.0-15.0 sec  45.5 MBytes  25.4 Mbits/sec   1.600 ms 141/32573 (0.43%)
<...>                                                       ^^^^^^^^^^^^^^^^^

In the test, per-flow throughput ranged from 16.4 to 68.7 Mbits/sec.
However, total throughput was still 486.4 Mbits/sec (0.87% higher than
before), and average drop rate was 0.41% (99.58% lower than before).

Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 net/sched/sch_tbf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sched/sch_tbf.c b/net/sched/sch_tbf.c
index 72102277449e..cf9cc7dbf078 100644
--- a/net/sched/sch_tbf.c
+++ b/net/sched/sch_tbf.c
@@ -222,6 +222,7 @@ static int tbf_segment(struct sk_buff *skb, struct Qdisc *sch,
 		len += segs->len;
 		ret = qdisc_enqueue(segs, q->qdisc, to_free);
 		if (ret != NET_XMIT_SUCCESS) {
+			qdisc_backpressure(skb);
 			if (net_xmit_drop_count(ret))
 				qdisc_qstats_drop(sch);
 		} else {
@@ -250,6 +251,7 @@ static int tbf_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	}
 	ret = qdisc_enqueue(skb, q->qdisc, to_free);
 	if (ret != NET_XMIT_SUCCESS) {
+		qdisc_backpressure(skb);
 		if (net_xmit_drop_count(ret))
 			qdisc_qstats_drop(sch);
 		return ret;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC v2 net-next 4/5] net/sched: sch_htb: Use Qdisc backpressure infrastructure
  2022-08-22  9:10 ` [PATCH RFC v2 net-next 0/5] " Peilin Ye
                     ` (2 preceding siblings ...)
  2022-08-22  9:12   ` [PATCH RFC v2 net-next 3/5] net/sched: sch_tbf: Use Qdisc backpressure infrastructure Peilin Ye
@ 2022-08-22  9:12   ` Peilin Ye
  2022-08-22  9:12   ` [PATCH RFC v2 net-next 5/5] net/sched: sch_cbq: " Peilin Ye
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Peilin Ye @ 2022-08-22  9:12 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, netdev, linux-doc, linux-kernel, Cong Wang,
	Stephen Hemminger, Dave Taht, Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Recently we introduced a Qdisc backpressure infrastructure (currently
supports UDP sockets).  Use it in HTB Qdisc.

Tested with 500 Mbits/sec rate limit using 16 iperf UDP 1 Gbit/sec
clients.  Before:

[  3]  0.0-15.0 sec  54.2 MBytes  30.4 Mbits/sec   0.875 ms 1245750/1284444 (97%)
[  3]  0.0-15.0 sec  54.2 MBytes  30.3 Mbits/sec   1.288 ms 1238753/1277402 (97%)
[  3]  0.0-15.0 sec  54.8 MBytes  30.6 Mbits/sec   1.761 ms 1261762/1300817 (97%)
[  3]  0.0-15.0 sec  53.9 MBytes  30.1 Mbits/sec   1.635 ms 1241690/1280133 (97%)
<...>                                                       ^^^^^^^^^^^^^^^^^^^^^

Total throughput is 482.0 Mbits/sec and average drop rate is 97.0%.

Now enable Qdisc backpressure for UDP sockets, with
udp_backpressure_interval default to 100 milliseconds:

[  3]  0.0-15.0 sec  53.0 MBytes  29.6 Mbits/sec   1.621 ms 54/37856 (0.14%)
[  3]  0.0-15.0 sec  55.9 MBytes  31.3 Mbits/sec   1.368 ms  6/39895 (0.015%)
[  3]  0.0-15.0 sec  52.3 MBytes  29.2 Mbits/sec   1.560 ms 56/37340 (0.15%)
[  3]  0.0-15.0 sec  52.7 MBytes  29.5 Mbits/sec   1.495 ms 57/37677 (0.15%)
<...>                                                       ^^^^^^^^^^^^^^^^

Total throughput is 485.9 Mbits/sec (0.81% higher) and average drop rate
is 0.1% (99.9% lower).

Fairness between flows is slightly affected, with per-flow average
throughput ranging from 29.2 to 31.8 Mbits/sec (compared with 29.7 to
30.6 Mbits/sec).

Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 net/sched/sch_htb.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 23a9d6242429..e337b3d0dab3 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -623,6 +623,7 @@ static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 			__qdisc_enqueue_tail(skb, &q->direct_queue);
 			q->direct_pkts++;
 		} else {
+			qdisc_backpressure(skb);
 			return qdisc_drop(skb, sch, to_free);
 		}
 #ifdef CONFIG_NET_CLS_ACT
@@ -634,6 +635,7 @@ static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 #endif
 	} else if ((ret = qdisc_enqueue(skb, cl->leaf.q,
 					to_free)) != NET_XMIT_SUCCESS) {
+		qdisc_backpressure(skb);
 		if (net_xmit_drop_count(ret)) {
 			qdisc_qstats_drop(sch);
 			cl->drops++;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH RFC v2 net-next 5/5] net/sched: sch_cbq: Use Qdisc backpressure infrastructure
  2022-08-22  9:10 ` [PATCH RFC v2 net-next 0/5] " Peilin Ye
                     ` (3 preceding siblings ...)
  2022-08-22  9:12   ` [PATCH RFC v2 net-next 4/5] net/sched: sch_htb: " Peilin Ye
@ 2022-08-22  9:12   ` Peilin Ye
  2022-08-22 16:17   ` [PATCH RFC v2 net-next 0/5] net: " Jakub Kicinski
  2022-08-22 16:22   ` Eric Dumazet
  6 siblings, 0 replies; 28+ messages in thread
From: Peilin Ye @ 2022-08-22  9:12 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko
  Cc: Peilin Ye, netdev, linux-doc, linux-kernel, Cong Wang,
	Stephen Hemminger, Dave Taht, Peilin Ye

From: Peilin Ye <peilin.ye@bytedance.com>

Recently we introduced a Qdisc backpressure infrastructure (currently
supports UDP sockets).  Use it in CBQ Qdisc.

Tested with 500 Mbits/sec rate limit using 16 iperf UDP 1 Gbit/sec
clients.  Before:

[  3]  0.0-15.0 sec  55.8 MBytes  31.2 Mbits/sec   1.185 ms 1073326/1113110 (96%)
[  3]  0.0-15.0 sec  55.9 MBytes  31.3 Mbits/sec   1.001 ms 1080330/1120201 (96%)
[  3]  0.0-15.0 sec  55.6 MBytes  31.1 Mbits/sec   1.750 ms 1078292/1117980 (96%)
[  3]  0.0-15.0 sec  55.3 MBytes  30.9 Mbits/sec   0.895 ms 1089200/1128640 (97%)
<...>                                                       ^^^^^^^^^^^^^^^^^^^^^

Total throughput is 493.7 Mbits/sec and average drop rate is 96.13%.

Now enable Qdisc backpressure for UDP sockets, with
udp_backpressure_interval default to 100 milliseconds:

[  3]  0.0-15.0 sec  54.2 MBytes  30.3 Mbits/sec   2.302 ms 54/38692 (0.14%)
[  3]  0.0-15.0 sec  54.1 MBytes  30.2 Mbits/sec   2.227 ms 54/38671 (0.14%)
[  3]  0.0-15.0 sec  53.5 MBytes  29.9 Mbits/sec   2.043 ms 57/38203 (0.15%)
[  3]  0.0-15.0 sec  58.1 MBytes  32.5 Mbits/sec   1.843 ms 1/41480 (0.0024%)
<...>                                                       ^^^^^^^^^^^^^^^^^

Total throughput is 497.1 Mbits/sec (0.69% higher), average drop rate is
0.08% (99.9% lower).

Fairness between flows is slightly affected, with per-flow average
throughput ranging from 29.9 to 32.6 Mbits/sec (compared with 30.3 to
31.3 Mbits/sec).

Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
---
 net/sched/sch_cbq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sched/sch_cbq.c b/net/sched/sch_cbq.c
index 91a0dc463c48..42e44f570988 100644
--- a/net/sched/sch_cbq.c
+++ b/net/sched/sch_cbq.c
@@ -381,6 +381,7 @@ cbq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		return ret;
 	}
 
+	qdisc_backpressure(skb);
 	if (net_xmit_drop_count(ret)) {
 		qdisc_qstats_drop(sch);
 		cbq_mark_toplevel(q, cl);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v2 net-next 0/5] net: Qdisc backpressure infrastructure
  2022-08-22  9:10 ` [PATCH RFC v2 net-next 0/5] " Peilin Ye
                     ` (4 preceding siblings ...)
  2022-08-22  9:12   ` [PATCH RFC v2 net-next 5/5] net/sched: sch_cbq: " Peilin Ye
@ 2022-08-22 16:17   ` Jakub Kicinski
  2022-08-29 16:53     ` Cong Wang
  2022-08-22 16:22   ` Eric Dumazet
  6 siblings, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2022-08-22 16:17 UTC (permalink / raw)
  To: Peilin Ye
  Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Jonathan Corbet,
	Hideaki YOSHIFUJI, David Ahern, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Peilin Ye, netdev, linux-doc, linux-kernel,
	Cong Wang, Stephen Hemminger, Dave Taht

On Mon, 22 Aug 2022 02:10:17 -0700 Peilin Ye wrote:
> Currently sockets (especially UDP ones) can drop a lot of packets at TC
> egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> tries to solve this by introducing a Qdisc backpressure mechanism.
> 
> RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> issues, including a thundering herd problem and a socket reference count
> issue [2].  This RFC v2 uses a different approach to avoid those issues:
> 
>   1. When a shaper Qdisc drops a packet that belongs to a local socket due
>      to TC egress congestion, we make part of the socket's sndbuf
>      temporarily unavailable, so it sends slower.
>   
>   2. Later, when TC egress becomes idle again, we gradually recover the
>      socket's sndbuf back to normal.  Patch 2 implements this step using a
>      timer for UDP sockets.
> 
> The thundering herd problem is avoided, since we no longer wake up all
> throttled sockets at the same time in qdisc_watchdog().  The socket
> reference count issue is also avoided, since we no longer maintain socket
> list on Qdisc.
> 
> Performance is better than RFC v1.  There is one concern about fairness
> between flows for TBF Qdisc, which could be solved by using a SFQ inner
> Qdisc.
> 
> Please see the individual patches for details and numbers.  Any comments,
> suggestions would be much appreciated.  Thanks!
> 
> [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/

Similarly to Eric's comments on v1 I'm not seeing the clear motivation
here. Modern high speed UDP users will have a CC in user space, back
off and set transmission time on the packets. Could you describe your
_actual_ use case / application in more detail?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v2 net-next 0/5] net: Qdisc backpressure infrastructure
  2022-08-22  9:10 ` [PATCH RFC v2 net-next 0/5] " Peilin Ye
                     ` (5 preceding siblings ...)
  2022-08-22 16:17   ` [PATCH RFC v2 net-next 0/5] net: " Jakub Kicinski
@ 2022-08-22 16:22   ` Eric Dumazet
  2022-08-29 16:47     ` Cong Wang
  2022-08-30  2:28     ` Yafang Shao
  6 siblings, 2 replies; 28+ messages in thread
From: Eric Dumazet @ 2022-08-22 16:22 UTC (permalink / raw)
  To: Peilin Ye
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Jonathan Corbet,
	Hideaki YOSHIFUJI, David Ahern, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Peilin Ye, netdev, open list:DOCUMENTATION, LKML,
	Cong Wang, Stephen Hemminger, Dave Taht

On Mon, Aug 22, 2022 at 2:10 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>
> From: Peilin Ye <peilin.ye@bytedance.com>
>
> Hi all,
>
> Currently sockets (especially UDP ones) can drop a lot of packets at TC
> egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> tries to solve this by introducing a Qdisc backpressure mechanism.
>
> RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> issues, including a thundering herd problem and a socket reference count
> issue [2].  This RFC v2 uses a different approach to avoid those issues:
>
>   1. When a shaper Qdisc drops a packet that belongs to a local socket due
>      to TC egress congestion, we make part of the socket's sndbuf
>      temporarily unavailable, so it sends slower.
>
>   2. Later, when TC egress becomes idle again, we gradually recover the
>      socket's sndbuf back to normal.  Patch 2 implements this step using a
>      timer for UDP sockets.
>
> The thundering herd problem is avoided, since we no longer wake up all
> throttled sockets at the same time in qdisc_watchdog().  The socket
> reference count issue is also avoided, since we no longer maintain socket
> list on Qdisc.
>
> Performance is better than RFC v1.  There is one concern about fairness
> between flows for TBF Qdisc, which could be solved by using a SFQ inner
> Qdisc.
>
> Please see the individual patches for details and numbers.  Any comments,
> suggestions would be much appreciated.  Thanks!
>
> [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
>
> Peilin Ye (5):
>   net: Introduce Qdisc backpressure infrastructure
>   net/udp: Implement Qdisc backpressure algorithm
>   net/sched: sch_tbf: Use Qdisc backpressure infrastructure
>   net/sched: sch_htb: Use Qdisc backpressure infrastructure
>   net/sched: sch_cbq: Use Qdisc backpressure infrastructure
>

I think the whole idea is wrong.

Packet schedulers can be remote (offloaded, or on another box)

The idea of going back to socket level from a packet scheduler should
really be a last resort.

Issue of having UDP sockets being able to flood a network is tough, I
am not sure the core networking stack
should pretend it can solve the issue.

Note that FQ based packet schedulers can also help already.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v2 net-next 0/5] net: Qdisc backpressure infrastructure
  2022-08-22 16:22   ` Eric Dumazet
@ 2022-08-29 16:47     ` Cong Wang
  2022-08-29 16:53       ` Eric Dumazet
  2022-08-30  2:28     ` Yafang Shao
  1 sibling, 1 reply; 28+ messages in thread
From: Cong Wang @ 2022-08-29 16:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peilin Ye, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Jiri Pirko, Peilin Ye, netdev,
	open list:DOCUMENTATION, LKML, Cong Wang, Stephen Hemminger,
	Dave Taht

On Mon, Aug 22, 2022 at 09:22:39AM -0700, Eric Dumazet wrote:
> On Mon, Aug 22, 2022 at 2:10 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
> >
> > From: Peilin Ye <peilin.ye@bytedance.com>
> >
> > Hi all,
> >
> > Currently sockets (especially UDP ones) can drop a lot of packets at TC
> > egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> > tries to solve this by introducing a Qdisc backpressure mechanism.
> >
> > RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> > issues, including a thundering herd problem and a socket reference count
> > issue [2].  This RFC v2 uses a different approach to avoid those issues:
> >
> >   1. When a shaper Qdisc drops a packet that belongs to a local socket due
> >      to TC egress congestion, we make part of the socket's sndbuf
> >      temporarily unavailable, so it sends slower.
> >
> >   2. Later, when TC egress becomes idle again, we gradually recover the
> >      socket's sndbuf back to normal.  Patch 2 implements this step using a
> >      timer for UDP sockets.
> >
> > The thundering herd problem is avoided, since we no longer wake up all
> > throttled sockets at the same time in qdisc_watchdog().  The socket
> > reference count issue is also avoided, since we no longer maintain socket
> > list on Qdisc.
> >
> > Performance is better than RFC v1.  There is one concern about fairness
> > between flows for TBF Qdisc, which could be solved by using a SFQ inner
> > Qdisc.
> >
> > Please see the individual patches for details and numbers.  Any comments,
> > suggestions would be much appreciated.  Thanks!
> >
> > [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> > [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
> >
> > Peilin Ye (5):
> >   net: Introduce Qdisc backpressure infrastructure
> >   net/udp: Implement Qdisc backpressure algorithm
> >   net/sched: sch_tbf: Use Qdisc backpressure infrastructure
> >   net/sched: sch_htb: Use Qdisc backpressure infrastructure
> >   net/sched: sch_cbq: Use Qdisc backpressure infrastructure
> >
> 
> I think the whole idea is wrong.
> 

Be more specific?

> Packet schedulers can be remote (offloaded, or on another box)

This is not the case we are dealing with (yet).

> 
> The idea of going back to socket level from a packet scheduler should
> really be a last resort.

I think it should be the first resort, as we should backpressure to the
source, rather than anything in the middle.

> 
> Issue of having UDP sockets being able to flood a network is tough, I
> am not sure the core networking stack
> should pretend it can solve the issue.

It seems you misunderstand it here, we are not dealing with UDP on the
network, just on an end host. The backpressure we are dealing with is
from Qdisc to socket on _TX side_ and on one single host.

> 
> Note that FQ based packet schedulers can also help already.

It only helps TCP pacing.

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v2 net-next 0/5] net: Qdisc backpressure infrastructure
  2022-08-22 16:17   ` [PATCH RFC v2 net-next 0/5] net: " Jakub Kicinski
@ 2022-08-29 16:53     ` Cong Wang
  2022-08-30  0:21       ` Jakub Kicinski
  0 siblings, 1 reply; 28+ messages in thread
From: Cong Wang @ 2022-08-29 16:53 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Peilin Ye, David S. Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Jiri Pirko, Peilin Ye, netdev, linux-doc,
	linux-kernel, Cong Wang, Stephen Hemminger, Dave Taht

On Mon, Aug 22, 2022 at 09:17:37AM -0700, Jakub Kicinski wrote:
> On Mon, 22 Aug 2022 02:10:17 -0700 Peilin Ye wrote:
> > Currently sockets (especially UDP ones) can drop a lot of packets at TC
> > egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> > tries to solve this by introducing a Qdisc backpressure mechanism.
> > 
> > RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> > issues, including a thundering herd problem and a socket reference count
> > issue [2].  This RFC v2 uses a different approach to avoid those issues:
> > 
> >   1. When a shaper Qdisc drops a packet that belongs to a local socket due
> >      to TC egress congestion, we make part of the socket's sndbuf
> >      temporarily unavailable, so it sends slower.
> >   
> >   2. Later, when TC egress becomes idle again, we gradually recover the
> >      socket's sndbuf back to normal.  Patch 2 implements this step using a
> >      timer for UDP sockets.
> > 
> > The thundering herd problem is avoided, since we no longer wake up all
> > throttled sockets at the same time in qdisc_watchdog().  The socket
> > reference count issue is also avoided, since we no longer maintain socket
> > list on Qdisc.
> > 
> > Performance is better than RFC v1.  There is one concern about fairness
> > between flows for TBF Qdisc, which could be solved by using a SFQ inner
> > Qdisc.
> > 
> > Please see the individual patches for details and numbers.  Any comments,
> > suggestions would be much appreciated.  Thanks!
> > 
> > [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> > [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
> 
> Similarly to Eric's comments on v1 I'm not seeing the clear motivation
> here. Modern high speed UDP users will have a CC in user space, back
> off and set transmission time on the packets. Could you describe your
> _actual_ use case / application in more detail?

Not everyone implements QUIC or CC, it is really hard to implement CC
from scratch. This backpressure mechnism is much simpler than CC (TCP or
QUIC), as clearly it does not deal with any remote congestions.

And, although this patchset only implements UDP backpressure, it can be
applied to any other protocol easily, it is protocol-independent.

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v2 net-next 0/5] net: Qdisc backpressure infrastructure
  2022-08-29 16:47     ` Cong Wang
@ 2022-08-29 16:53       ` Eric Dumazet
  2022-09-19 17:06         ` Cong Wang
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2022-08-29 16:53 UTC (permalink / raw)
  To: Cong Wang
  Cc: Peilin Ye, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Jiri Pirko, Peilin Ye, netdev,
	open list:DOCUMENTATION, LKML, Cong Wang, Stephen Hemminger,
	Dave Taht

On Mon, Aug 29, 2022 at 9:47 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Mon, Aug 22, 2022 at 09:22:39AM -0700, Eric Dumazet wrote:
> > On Mon, Aug 22, 2022 at 2:10 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
> > >
> > > From: Peilin Ye <peilin.ye@bytedance.com>
> > >
> > > Hi all,
> > >
> > > Currently sockets (especially UDP ones) can drop a lot of packets at TC
> > > egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> > > tries to solve this by introducing a Qdisc backpressure mechanism.
> > >
> > > RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> > > issues, including a thundering herd problem and a socket reference count
> > > issue [2].  This RFC v2 uses a different approach to avoid those issues:
> > >
> > >   1. When a shaper Qdisc drops a packet that belongs to a local socket due
> > >      to TC egress congestion, we make part of the socket's sndbuf
> > >      temporarily unavailable, so it sends slower.
> > >
> > >   2. Later, when TC egress becomes idle again, we gradually recover the
> > >      socket's sndbuf back to normal.  Patch 2 implements this step using a
> > >      timer for UDP sockets.
> > >
> > > The thundering herd problem is avoided, since we no longer wake up all
> > > throttled sockets at the same time in qdisc_watchdog().  The socket
> > > reference count issue is also avoided, since we no longer maintain socket
> > > list on Qdisc.
> > >
> > > Performance is better than RFC v1.  There is one concern about fairness
> > > between flows for TBF Qdisc, which could be solved by using a SFQ inner
> > > Qdisc.
> > >
> > > Please see the individual patches for details and numbers.  Any comments,
> > > suggestions would be much appreciated.  Thanks!
> > >
> > > [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> > > [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
> > >
> > > Peilin Ye (5):
> > >   net: Introduce Qdisc backpressure infrastructure
> > >   net/udp: Implement Qdisc backpressure algorithm
> > >   net/sched: sch_tbf: Use Qdisc backpressure infrastructure
> > >   net/sched: sch_htb: Use Qdisc backpressure infrastructure
> > >   net/sched: sch_cbq: Use Qdisc backpressure infrastructure
> > >
> >
> > I think the whole idea is wrong.
> >
>
> Be more specific?
>
> > Packet schedulers can be remote (offloaded, or on another box)
>
> This is not the case we are dealing with (yet).
>
> >
> > The idea of going back to socket level from a packet scheduler should
> > really be a last resort.
>
> I think it should be the first resort, as we should backpressure to the
> source, rather than anything in the middle.
>
> >
> > Issue of having UDP sockets being able to flood a network is tough, I
> > am not sure the core networking stack
> > should pretend it can solve the issue.
>
> It seems you misunderstand it here, we are not dealing with UDP on the
> network, just on an end host. The backpressure we are dealing with is
> from Qdisc to socket on _TX side_ and on one single host.
>
> >
> > Note that FQ based packet schedulers can also help already.
>
> It only helps TCP pacing.

FQ : Fair Queue.

It definitely helps without the pacing part...

>
> Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v2 net-next 0/5] net: Qdisc backpressure infrastructure
  2022-08-29 16:53     ` Cong Wang
@ 2022-08-30  0:21       ` Jakub Kicinski
  2022-09-19 17:00         ` Cong Wang
  0 siblings, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2022-08-30  0:21 UTC (permalink / raw)
  To: Cong Wang
  Cc: Peilin Ye, David S. Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Jiri Pirko, Peilin Ye, netdev, linux-doc,
	linux-kernel, Cong Wang, Stephen Hemminger, Dave Taht

On Mon, 29 Aug 2022 09:53:17 -0700 Cong Wang wrote:
> > Similarly to Eric's comments on v1 I'm not seeing the clear motivation
> > here. Modern high speed UDP users will have a CC in user space, back
> > off and set transmission time on the packets. Could you describe your
> > _actual_ use case / application in more detail?  
> 
> Not everyone implements QUIC or CC, it is really hard to implement CC
> from scratch. This backpressure mechnism is much simpler than CC (TCP or
> QUIC), as clearly it does not deal with any remote congestions.
> 
> And, although this patchset only implements UDP backpressure, it can be
> applied to any other protocol easily, it is protocol-independent.

No disagreement on any of your points. But I don't feel like 
you answered my question about the details of the use case.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v2 net-next 0/5] net: Qdisc backpressure infrastructure
  2022-08-22 16:22   ` Eric Dumazet
  2022-08-29 16:47     ` Cong Wang
@ 2022-08-30  2:28     ` Yafang Shao
  2022-09-19 17:04       ` Cong Wang
  1 sibling, 1 reply; 28+ messages in thread
From: Yafang Shao @ 2022-08-30  2:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peilin Ye, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Peilin Ye, netdev,
	open list:DOCUMENTATION, LKML, Cong Wang, Stephen Hemminger,
	Dave Taht

On Tue, Aug 23, 2022 at 1:02 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Aug 22, 2022 at 2:10 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
> >
> > From: Peilin Ye <peilin.ye@bytedance.com>
> >
> > Hi all,
> >
> > Currently sockets (especially UDP ones) can drop a lot of packets at TC
> > egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> > tries to solve this by introducing a Qdisc backpressure mechanism.
> >
> > RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> > issues, including a thundering herd problem and a socket reference count
> > issue [2].  This RFC v2 uses a different approach to avoid those issues:
> >
> >   1. When a shaper Qdisc drops a packet that belongs to a local socket due
> >      to TC egress congestion, we make part of the socket's sndbuf
> >      temporarily unavailable, so it sends slower.
> >
> >   2. Later, when TC egress becomes idle again, we gradually recover the
> >      socket's sndbuf back to normal.  Patch 2 implements this step using a
> >      timer for UDP sockets.
> >
> > The thundering herd problem is avoided, since we no longer wake up all
> > throttled sockets at the same time in qdisc_watchdog().  The socket
> > reference count issue is also avoided, since we no longer maintain socket
> > list on Qdisc.
> >
> > Performance is better than RFC v1.  There is one concern about fairness
> > between flows for TBF Qdisc, which could be solved by using a SFQ inner
> > Qdisc.
> >
> > Please see the individual patches for details and numbers.  Any comments,
> > suggestions would be much appreciated.  Thanks!
> >
> > [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> > [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
> >
> > Peilin Ye (5):
> >   net: Introduce Qdisc backpressure infrastructure
> >   net/udp: Implement Qdisc backpressure algorithm
> >   net/sched: sch_tbf: Use Qdisc backpressure infrastructure
> >   net/sched: sch_htb: Use Qdisc backpressure infrastructure
> >   net/sched: sch_cbq: Use Qdisc backpressure infrastructure
> >
>
> I think the whole idea is wrong.
>
> Packet schedulers can be remote (offloaded, or on another box)
>
> The idea of going back to socket level from a packet scheduler should
> really be a last resort.
>
> Issue of having UDP sockets being able to flood a network is tough, I
> am not sure the core networking stack
> should pretend it can solve the issue.
>
> Note that FQ based packet schedulers can also help already.

We encounter a similar issue when using (fq + edt-bpf) to limit UDP
packet, because of the qdisc buffer limit.
If the qdisc buffer limit is too small, the UDP packet will be dropped
in the qdisc layer. But the sender doesn't know that the packets has
been dropped, so it will continue to send packets, and thus more and
more packets will be dropped there.  IOW, the qdisc will be a
bottleneck before the bandwidth limit is reached.
We workaround this issue by enlarging the buffer limit and flow_limit
(the proper values can be calculated from net.ipv4.udp_mem and
net.core.wmem_default).
But obviously this is not a perfect solution, because
net.ipv4.udp_mem or net.core.wmem_default may be changed dynamically.
We also think about a solution to build a connection between udp
memory and qdisc limit, but not sure if it is a good idea neither.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v2 net-next 0/5] net: Qdisc backpressure infrastructure
  2022-08-30  0:21       ` Jakub Kicinski
@ 2022-09-19 17:00         ` Cong Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Cong Wang @ 2022-09-19 17:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Peilin Ye, David S. Miller, Eric Dumazet, Paolo Abeni,
	Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Jiri Pirko, Peilin Ye, netdev, linux-doc,
	linux-kernel, Cong Wang, Stephen Hemminger, Dave Taht

On Mon, Aug 29, 2022 at 05:21:11PM -0700, Jakub Kicinski wrote:
> On Mon, 29 Aug 2022 09:53:17 -0700 Cong Wang wrote:
> > > Similarly to Eric's comments on v1 I'm not seeing the clear motivation
> > > here. Modern high speed UDP users will have a CC in user space, back
> > > off and set transmission time on the packets. Could you describe your
> > > _actual_ use case / application in more detail?  
> > 
> > Not everyone implements QUIC or CC, it is really hard to implement CC
> > from scratch. This backpressure mechnism is much simpler than CC (TCP or
> > QUIC), as clearly it does not deal with any remote congestions.
> > 
> > And, although this patchset only implements UDP backpressure, it can be
> > applied to any other protocol easily, it is protocol-independent.
> 
> No disagreement on any of your points. But I don't feel like 
> you answered my question about the details of the use case.

Do you need a use case for UDP w/o QUIC? Seriously??? There must be
tons of it...

Take a look at UDP tunnels, for instance, wireguard which is our use
case. ByteDance has wireguard-based VPN solution for bussiness. (I hate
to brand ourselves, but you are asking for it...)

Please do research on your side, as a netdev maintainer, you are
supposed to know this much better than me.

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v2 net-next 0/5] net: Qdisc backpressure infrastructure
  2022-08-30  2:28     ` Yafang Shao
@ 2022-09-19 17:04       ` Cong Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Cong Wang @ 2022-09-19 17:04 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Eric Dumazet, Peilin Ye, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Jiri Pirko, Peilin Ye, netdev,
	open list:DOCUMENTATION, LKML, Cong Wang, Stephen Hemminger,
	Dave Taht

On Tue, Aug 30, 2022 at 10:28:01AM +0800, Yafang Shao wrote:
> On Tue, Aug 23, 2022 at 1:02 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Aug 22, 2022 at 2:10 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
> > >
> > > From: Peilin Ye <peilin.ye@bytedance.com>
> > >
> > > Hi all,
> > >
> > > Currently sockets (especially UDP ones) can drop a lot of packets at TC
> > > egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> > > tries to solve this by introducing a Qdisc backpressure mechanism.
> > >
> > > RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> > > issues, including a thundering herd problem and a socket reference count
> > > issue [2].  This RFC v2 uses a different approach to avoid those issues:
> > >
> > >   1. When a shaper Qdisc drops a packet that belongs to a local socket due
> > >      to TC egress congestion, we make part of the socket's sndbuf
> > >      temporarily unavailable, so it sends slower.
> > >
> > >   2. Later, when TC egress becomes idle again, we gradually recover the
> > >      socket's sndbuf back to normal.  Patch 2 implements this step using a
> > >      timer for UDP sockets.
> > >
> > > The thundering herd problem is avoided, since we no longer wake up all
> > > throttled sockets at the same time in qdisc_watchdog().  The socket
> > > reference count issue is also avoided, since we no longer maintain socket
> > > list on Qdisc.
> > >
> > > Performance is better than RFC v1.  There is one concern about fairness
> > > between flows for TBF Qdisc, which could be solved by using a SFQ inner
> > > Qdisc.
> > >
> > > Please see the individual patches for details and numbers.  Any comments,
> > > suggestions would be much appreciated.  Thanks!
> > >
> > > [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> > > [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
> > >
> > > Peilin Ye (5):
> > >   net: Introduce Qdisc backpressure infrastructure
> > >   net/udp: Implement Qdisc backpressure algorithm
> > >   net/sched: sch_tbf: Use Qdisc backpressure infrastructure
> > >   net/sched: sch_htb: Use Qdisc backpressure infrastructure
> > >   net/sched: sch_cbq: Use Qdisc backpressure infrastructure
> > >
> >
> > I think the whole idea is wrong.
> >
> > Packet schedulers can be remote (offloaded, or on another box)
> >
> > The idea of going back to socket level from a packet scheduler should
> > really be a last resort.
> >
> > Issue of having UDP sockets being able to flood a network is tough, I
> > am not sure the core networking stack
> > should pretend it can solve the issue.
> >
> > Note that FQ based packet schedulers can also help already.
> 
> We encounter a similar issue when using (fq + edt-bpf) to limit UDP
> packet, because of the qdisc buffer limit.
> If the qdisc buffer limit is too small, the UDP packet will be dropped
> in the qdisc layer. But the sender doesn't know that the packets has
> been dropped, so it will continue to send packets, and thus more and
> more packets will be dropped there.  IOW, the qdisc will be a
> bottleneck before the bandwidth limit is reached.
> We workaround this issue by enlarging the buffer limit and flow_limit
> (the proper values can be calculated from net.ipv4.udp_mem and
> net.core.wmem_default).
> But obviously this is not a perfect solution, because
> net.ipv4.udp_mem or net.core.wmem_default may be changed dynamically.
> We also think about a solution to build a connection between udp
> memory and qdisc limit, but not sure if it is a good idea neither.

This is literally what this patchset does. Although this patchset does
not touch any TCP (as TCP has TSQ), I think this is a better approach
than TSQ, because TSQ has no idea about Qdisc limit.

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH RFC v2 net-next 0/5] net: Qdisc backpressure infrastructure
  2022-08-29 16:53       ` Eric Dumazet
@ 2022-09-19 17:06         ` Cong Wang
  0 siblings, 0 replies; 28+ messages in thread
From: Cong Wang @ 2022-09-19 17:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peilin Ye, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Jonathan Corbet, Hideaki YOSHIFUJI, David Ahern,
	Jamal Hadi Salim, Jiri Pirko, Peilin Ye, netdev,
	open list:DOCUMENTATION, LKML, Cong Wang, Stephen Hemminger,
	Dave Taht

On Mon, Aug 29, 2022 at 09:53:43AM -0700, Eric Dumazet wrote:
> On Mon, Aug 29, 2022 at 9:47 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Mon, Aug 22, 2022 at 09:22:39AM -0700, Eric Dumazet wrote:
> > > On Mon, Aug 22, 2022 at 2:10 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
> > > >
> > > > From: Peilin Ye <peilin.ye@bytedance.com>
> > > >
> > > > Hi all,
> > > >
> > > > Currently sockets (especially UDP ones) can drop a lot of packets at TC
> > > > egress when rate limited by shaper Qdiscs like HTB.  This patchset series
> > > > tries to solve this by introducing a Qdisc backpressure mechanism.
> > > >
> > > > RFC v1 [1] used a throttle & unthrottle approach, which introduced several
> > > > issues, including a thundering herd problem and a socket reference count
> > > > issue [2].  This RFC v2 uses a different approach to avoid those issues:
> > > >
> > > >   1. When a shaper Qdisc drops a packet that belongs to a local socket due
> > > >      to TC egress congestion, we make part of the socket's sndbuf
> > > >      temporarily unavailable, so it sends slower.
> > > >
> > > >   2. Later, when TC egress becomes idle again, we gradually recover the
> > > >      socket's sndbuf back to normal.  Patch 2 implements this step using a
> > > >      timer for UDP sockets.
> > > >
> > > > The thundering herd problem is avoided, since we no longer wake up all
> > > > throttled sockets at the same time in qdisc_watchdog().  The socket
> > > > reference count issue is also avoided, since we no longer maintain socket
> > > > list on Qdisc.
> > > >
> > > > Performance is better than RFC v1.  There is one concern about fairness
> > > > between flows for TBF Qdisc, which could be solved by using a SFQ inner
> > > > Qdisc.
> > > >
> > > > Please see the individual patches for details and numbers.  Any comments,
> > > > suggestions would be much appreciated.  Thanks!
> > > >
> > > > [1] https://lore.kernel.org/netdev/cover.1651800598.git.peilin.ye@bytedance.com/
> > > > [2] https://lore.kernel.org/netdev/20220506133111.1d4bebf3@hermes.local/
> > > >
> > > > Peilin Ye (5):
> > > >   net: Introduce Qdisc backpressure infrastructure
> > > >   net/udp: Implement Qdisc backpressure algorithm
> > > >   net/sched: sch_tbf: Use Qdisc backpressure infrastructure
> > > >   net/sched: sch_htb: Use Qdisc backpressure infrastructure
> > > >   net/sched: sch_cbq: Use Qdisc backpressure infrastructure
> > > >
> > >
> > > I think the whole idea is wrong.
> > >
> >
> > Be more specific?
> >
> > > Packet schedulers can be remote (offloaded, or on another box)
> >
> > This is not the case we are dealing with (yet).
> >
> > >
> > > The idea of going back to socket level from a packet scheduler should
> > > really be a last resort.
> >
> > I think it should be the first resort, as we should backpressure to the
> > source, rather than anything in the middle.
> >
> > >
> > > Issue of having UDP sockets being able to flood a network is tough, I
> > > am not sure the core networking stack
> > > should pretend it can solve the issue.
> >
> > It seems you misunderstand it here, we are not dealing with UDP on the
> > network, just on an end host. The backpressure we are dealing with is
> > from Qdisc to socket on _TX side_ and on one single host.
> >
> > >
> > > Note that FQ based packet schedulers can also help already.
> >
> > It only helps TCP pacing.
> 
> FQ : Fair Queue.
> 
> It definitely helps without the pacing part...

True. but the fair queuing part has nothing related to this patchset...
Only the pacing part is related to this topic, and it is merely about
TCP.

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-09-19 17:09 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-06 19:43 [PATCH RFC v1 net-next 0/4] net: Qdisc backpressure infrastructure Peilin Ye
2022-05-06 19:44 ` [PATCH RFC v1 net-next 1/4] net: Introduce " Peilin Ye
2022-05-06 20:31   ` Stephen Hemminger
2022-05-06 23:34     ` Peilin Ye
2022-05-09  7:53   ` Dave Taht
2022-05-10  2:23     ` Peilin Ye
2022-05-06 19:44 ` [PATCH RFC v1 net-next 2/4] net/sched: sch_tbf: Use " Peilin Ye
2022-05-06 19:45 ` [PATCH RFC v1 net-next 3/4] net/sched: sch_htb: " Peilin Ye
2022-05-06 19:45 ` [PATCH RFC v1 net-next 4/4] net/sched: sch_cbq: " Peilin Ye
2022-05-10  3:26 ` [PATCH RFC v1 net-next 0/4] net: " Eric Dumazet
2022-05-10 23:03   ` Peilin Ye
2022-05-10 23:27     ` Peilin Ye
2022-08-22  9:10 ` [PATCH RFC v2 net-next 0/5] " Peilin Ye
2022-08-22  9:11   ` [PATCH RFC v2 net-next 1/5] net: Introduce " Peilin Ye
2022-08-22  9:12   ` [PATCH RFC v2 net-next 2/5] net/udp: Implement Qdisc backpressure algorithm Peilin Ye
2022-08-22  9:12   ` [PATCH RFC v2 net-next 3/5] net/sched: sch_tbf: Use Qdisc backpressure infrastructure Peilin Ye
2022-08-22  9:12   ` [PATCH RFC v2 net-next 4/5] net/sched: sch_htb: " Peilin Ye
2022-08-22  9:12   ` [PATCH RFC v2 net-next 5/5] net/sched: sch_cbq: " Peilin Ye
2022-08-22 16:17   ` [PATCH RFC v2 net-next 0/5] net: " Jakub Kicinski
2022-08-29 16:53     ` Cong Wang
2022-08-30  0:21       ` Jakub Kicinski
2022-09-19 17:00         ` Cong Wang
2022-08-22 16:22   ` Eric Dumazet
2022-08-29 16:47     ` Cong Wang
2022-08-29 16:53       ` Eric Dumazet
2022-09-19 17:06         ` Cong Wang
2022-08-30  2:28     ` Yafang Shao
2022-09-19 17:04       ` Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).