All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc
@ 2023-09-01  6:21 Abel Wu
  2023-09-01  6:21 ` [RFC PATCH net-next 1/3] sock: Code cleanup on __sk_mem_raise_allocated() Abel Wu
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Abel Wu @ 2023-09-01  6:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Shakeel Butt, Roman Gushchin, Michal Hocko,
	Johannes Weiner, Yosry Ahmed, Matthew Wilcox (Oracle),
	Yu Zhao, Kefeng Wang, Abel Wu, Yafang Shao, Kuniyuki Iwashima,
	Martin KaFai Lau, Breno Leitao, Alexander Mikhalitsyn,
	David Howells, Jason Xing
  Cc: open list, open list:NETWORKING [GENERAL], open list:MEMORY MANAGEMENT

As a cloud service provider, we encountered a problem in our production
environment during the transition from cgroup v1 to v2 (partly due to the
heavy taxes of accounting socket memory in v1). Say one workload behaves
fine in cgroupv1 with memcg limit configured to 10GB memory and another
1GB tcpmem, but will suck (or even be OOM-killed) in v2 with 11GB memory
due to burst memory usage on socket, since there is no specific limit for
socket memory in cgroupv2 and relies largely on workloads doing traffic
control themselves.

It's rational for the workloads to build some traffic control to better
utilize the resources they bought, but from kernel's point of view it's
also reasonable to suppress the allocation of socket memory once there is
a shortage of free memory, given that performance degradation is usually
better than failure.

This patchset aims to be more conservative on alloc for pressure-aware
sockets under global and/or memcg pressure, to avoid further memstall or
possibly OOM in such case. The patchset includes:

  1/3: simple code cleanup, no functional change intended.
  2/3: record memcg pressure level to enable fine-grained control.
  3/3: throttle alloc for pressure-aware sockets under pressure.

The whole patchset focuses on the pressure-aware protocols, and should
have no/little impact on pressure-unaware protocols like UDP etc.

Tested on Intel Xeon(R) Platinum 8260, a dual socket machine containing 2
NUMA nodes each of which has 24C/48T. All the benchmarks are done inside a
separate memcg in a clean host.

  baseline:	net-next c639a708a0b8
  compare:	baseline + patchset

case            	load    	baseline(std%)	compare%( std%)
tbench-loopback        	thread-24	 1.00 (  0.50)	 -0.98 (  0.87)
tbench-loopback        	thread-48	 1.00 (  0.76)	 -0.29 (  0.92)
tbench-loopback        	thread-72	 1.00 (  0.75)	 +1.51 (  0.14)
tbench-loopback        	thread-96	 1.00 (  4.11)	 +1.29 (  3.73)
tbench-loopback        	thread-192	 1.00 (  3.52)	 +1.44 (  3.30)
TCP_RR          	thread-24	 1.00 (  1.87)	 -0.87 (  2.40)
TCP_RR          	thread-48	 1.00 (  0.92)	 -0.22 (  1.61)
TCP_RR          	thread-72	 1.00 (  2.35)	 +2.42 (  2.27)
TCP_RR          	thread-96	 1.00 (  2.66)	 -1.37 (  3.02)
TCP_RR          	thread-192	 1.00 ( 13.25)	 +0.29 ( 11.80)
TCP_STREAM      	thread-24	 1.00 (  1.26)	 -0.75 (  0.87)
TCP_STREAM      	thread-48	 1.00 (  0.29)	 -1.55 (  0.14)
TCP_STREAM      	thread-72	 1.00 (  0.05)	 -1.59 (  0.05)
TCP_STREAM      	thread-96	 1.00 (  0.19)	 -0.06 (  0.29)
TCP_STREAM      	thread-192	 1.00 (  0.23)	 -0.01 (  0.28)
UDP_RR          	thread-24	 1.00 (  2.27)	 +0.33 (  2.82)
UDP_RR          	thread-48	 1.00 (  1.25)	 -0.30 (  1.21)
UDP_RR          	thread-72	 1.00 (  2.54)	 +2.99 (  2.34)
UDP_RR          	thread-96	 1.00 (  4.76)	 +2.49 (  2.19)
UDP_RR          	thread-192	 1.00 ( 14.43)	 -0.02 ( 12.98)
UDP_STREAM      	thread-24	 1.00 (107.41)	 -0.48 (106.93)
UDP_STREAM      	thread-48	 1.00 (100.85)	 +1.38 (100.59)
UDP_STREAM      	thread-72	 1.00 (103.43)	 +1.40 (103.48)
UDP_STREAM      	thread-96	 1.00 ( 99.91)	 -0.25 (100.06)
UDP_STREAM      	thread-192	 1.00 (109.83)	 -3.67 (104.12)

As patch 3 moves forward traversal of cgroup hierarchy for pressure-aware
protocols, which could turn a conditional overhead into constant, tests
running inside 5-level-depth cgroups are also performed.

case            	load    	baseline(std%)	compare%( std%)
tbench-loopback        	thread-24	 1.00 (  0.59)	 +0.68 (  0.09)
tbench-loopback        	thread-48	 1.00 (  0.16)	 +0.01 (  0.26)
tbench-loopback        	thread-72	 1.00 (  0.34)	 -0.67 (  0.48)
tbench-loopback        	thread-96	 1.00 (  4.40)	 -3.27 (  4.84)
tbench-loopback        	thread-192	 1.00 (  0.49)	 -1.07 (  1.18)
TCP_RR          	thread-24	 1.00 (  2.40)	 -0.34 (  2.49)
TCP_RR          	thread-48	 1.00 (  1.62)	 -0.48 (  1.35)
TCP_RR          	thread-72	 1.00 (  1.26)	 +0.46 (  0.95)
TCP_RR          	thread-96	 1.00 (  2.98)	 +0.13 (  2.64)
TCP_RR          	thread-192	 1.00 ( 13.75)	 -0.20 ( 15.42)
TCP_STREAM      	thread-24	 1.00 (  0.21)	 +0.68 (  1.02)
TCP_STREAM      	thread-48	 1.00 (  0.20)	 -1.41 (  0.01)
TCP_STREAM      	thread-72	 1.00 (  0.09)	 -1.23 (  0.19)
TCP_STREAM      	thread-96	 1.00 (  0.01)	 +0.01 (  0.01)
TCP_STREAM      	thread-192	 1.00 (  0.20)	 -0.02 (  0.25)
UDP_RR          	thread-24	 1.00 (  2.20)	 +0.84 ( 17.45)
UDP_RR          	thread-48	 1.00 (  1.34)	 -0.73 (  1.12)
UDP_RR          	thread-72	 1.00 (  2.32)	 +0.49 (  2.11)
UDP_RR          	thread-96	 1.00 (  2.36)	 +0.53 (  2.42)
UDP_RR          	thread-192	 1.00 ( 16.34)	 -0.67 ( 14.06)
UDP_STREAM      	thread-24	 1.00 (106.55)	 -0.70 (107.13)
UDP_STREAM      	thread-48	 1.00 (105.11)	 +1.60 (103.48)
UDP_STREAM      	thread-72	 1.00 (100.60)	 +1.98 (101.13)
UDP_STREAM      	thread-96	 1.00 ( 99.91)	 +2.59 (101.04)
UDP_STREAM      	thread-192	 1.00 (135.39)	 -2.51 (108.00)

As expected, no obvious performance gain or loss observed. As for the
issue we encountered, this patchset provides better worst-case behavior
that such OOM cases are reduced at some extent. While further fine-
grained traffic control is what the workloads need to think about.

Comments are welcomed! Thanks!

Abel Wu (3):
  sock: Code cleanup on __sk_mem_raise_allocated()
  net-memcg: Record pressure level when under pressure
  sock: Throttle pressure-aware sockets under pressure

 include/linux/memcontrol.h | 39 +++++++++++++++++++++++++----
 include/net/sock.h         |  2 +-
 include/net/tcp.h          |  2 +-
 mm/vmpressure.c            |  9 ++++++-
 net/core/sock.c            | 51 +++++++++++++++++++++++++++++---------
 5 files changed, 83 insertions(+), 20 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH net-next 1/3] sock: Code cleanup on __sk_mem_raise_allocated()
  2023-09-01  6:21 [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc Abel Wu
@ 2023-09-01  6:21 ` Abel Wu
  2023-09-14  5:47   ` Shakeel Butt
  2023-09-01  6:21 ` [RFC PATCH net-next 2/3] net-memcg: Record pressure level when under pressure Abel Wu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Abel Wu @ 2023-09-01  6:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Shakeel Butt, Roman Gushchin, Michal Hocko,
	Johannes Weiner, Yosry Ahmed, Yu Zhao, Matthew Wilcox (Oracle),
	Abel Wu, Yafang Shao, Kefeng Wang, Kuniyuki Iwashima,
	Martin KaFai Lau, Breno Leitao, Alexander Mikhalitsyn,
	David Howells, Jason Xing
  Cc: open list, open list:NETWORKING [GENERAL], open list:MEMORY MANAGEMENT

Code cleanup for both better simplicity and readability.
No functional change intended.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 net/core/sock.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 666a17cab4f5..af778fc60a4d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3040,17 +3040,19 @@ EXPORT_SYMBOL(sk_wait_data);
  */
 int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
-	bool memcg_charge = mem_cgroup_sockets_enabled && sk->sk_memcg;
+	struct mem_cgroup *memcg = mem_cgroup_sockets_enabled ? sk->sk_memcg : NULL;
 	struct proto *prot = sk->sk_prot;
-	bool charged = true;
+	bool charged = false;
 	long allocated;
 
 	sk_memory_allocated_add(sk, amt);
 	allocated = sk_memory_allocated(sk);
-	if (memcg_charge &&
-	    !(charged = mem_cgroup_charge_skmem(sk->sk_memcg, amt,
-						gfp_memcg_charge())))
-		goto suppress_allocation;
+
+	if (memcg) {
+		if (!mem_cgroup_charge_skmem(memcg, amt, gfp_memcg_charge()))
+			goto suppress_allocation;
+		charged = true;
+	}
 
 	/* Under limit. */
 	if (allocated <= sk_prot_mem_limits(sk, 0)) {
@@ -3105,8 +3107,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 		 */
 		if (sk->sk_wmem_queued + size >= sk->sk_sndbuf) {
 			/* Force charge with __GFP_NOFAIL */
-			if (memcg_charge && !charged) {
-				mem_cgroup_charge_skmem(sk->sk_memcg, amt,
+			if (memcg && !charged) {
+				mem_cgroup_charge_skmem(memcg, amt,
 					gfp_memcg_charge() | __GFP_NOFAIL);
 			}
 			return 1;
@@ -3118,8 +3120,8 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 
 	sk_memory_allocated_sub(sk, amt);
 
-	if (memcg_charge && charged)
-		mem_cgroup_uncharge_skmem(sk->sk_memcg, amt);
+	if (charged)
+		mem_cgroup_uncharge_skmem(memcg, amt);
 
 	return 0;
 }
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH net-next 2/3] net-memcg: Record pressure level when under pressure
  2023-09-01  6:21 [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc Abel Wu
  2023-09-01  6:21 ` [RFC PATCH net-next 1/3] sock: Code cleanup on __sk_mem_raise_allocated() Abel Wu
@ 2023-09-01  6:21 ` Abel Wu
  2023-09-01  6:21 ` [RFC PATCH net-next 3/3] sock: Throttle pressure-aware sockets " Abel Wu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Abel Wu @ 2023-09-01  6:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Shakeel Butt, Roman Gushchin, Michal Hocko,
	Johannes Weiner, Yosry Ahmed, Matthew Wilcox (Oracle),
	Yu Zhao, Abel Wu, Yafang Shao, Kefeng Wang, Kuniyuki Iwashima,
	Martin KaFai Lau, Breno Leitao, Alexander Mikhalitsyn,
	David Howells, Jason Xing
  Cc: open list, open list:NETWORKING [GENERAL], open list:MEMORY MANAGEMENT

For now memcg->socket_pressure is used for judging whether there
is memory reclaim pressure in this memcg. As different reclaim
efficiencies require different strategies, recording the level of
pressure would help do fine-grained control inside networking
where performance matters a lot.

The vmpressure infrastructure classifies pressure into 3 levels:
low, medium and critical. It would be too much conservative if
constraining socket memory usage at "low" level, so now only the
other two are taken into consideration and the least significant
bit of socket_pressure is enough to record this information.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 include/linux/memcontrol.h | 39 +++++++++++++++++++++++++++++++++-----
 include/net/sock.h         |  2 +-
 include/net/tcp.h          |  2 +-
 mm/vmpressure.c            |  9 ++++++++-
 4 files changed, 44 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dbf26bc89dd4..a24047bf7722 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -288,6 +288,9 @@ struct mem_cgroup {
 	 * Hint of reclaim pressure for socket memroy management. Note
 	 * that this indicator should NOT be used in legacy cgroup mode
 	 * where socket memory is accounted/charged separately.
+	 *
+	 * The least significant bit is used for indicating the level of
+	 * pressure, 1 for 'critical' and 0 otherwise.
 	 */
 	unsigned long		socket_pressure;
 
@@ -1730,15 +1733,40 @@ extern struct static_key_false memcg_sockets_enabled_key;
 #define mem_cgroup_sockets_enabled static_branch_unlikely(&memcg_sockets_enabled_key)
 void mem_cgroup_sk_alloc(struct sock *sk);
 void mem_cgroup_sk_free(struct sock *sk);
-static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
+
+static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg,
+						    bool *critical)
 {
+	bool under_pressure = false;
+
+	/*
+	 * When cgroup is in legacy mode where tcpmem is separately
+	 * charged, we have no idea about memcg reclaim pressure from
+	 * here and actually no need to, so just ignore the pressure
+	 * level info.
+	 */
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return !!memcg->tcpmem_pressure;
+
+	if (critical)
+		*critical = false;
+
 	do {
-		if (time_before(jiffies, READ_ONCE(memcg->socket_pressure)))
-			return true;
+		unsigned long expire = READ_ONCE(memcg->socket_pressure);
+
+		if (time_before(jiffies, expire)) {
+			if (!under_pressure)
+				under_pressure = true;
+			if (!critical)
+				break;
+			if (expire & 1) {
+				*critical = true;
+				break;
+			}
+		}
 	} while ((memcg = parent_mem_cgroup(memcg)));
-	return false;
+
+	return under_pressure;
 }
 
 int alloc_shrinker_info(struct mem_cgroup *memcg);
@@ -1749,7 +1777,8 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg);
 #define mem_cgroup_sockets_enabled 0
 static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
 static inline void mem_cgroup_sk_free(struct sock *sk) { };
-static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
+static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg,
+						    bool *critical)
 {
 	return false;
 }
diff --git a/include/net/sock.h b/include/net/sock.h
index 11d503417591..079bbee5c400 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1434,7 +1434,7 @@ static inline bool sk_under_memory_pressure(const struct sock *sk)
 		return false;
 
 	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
-	    mem_cgroup_under_socket_pressure(sk->sk_memcg))
+	    mem_cgroup_under_socket_pressure(sk->sk_memcg, NULL))
 		return true;
 
 	return !!READ_ONCE(*sk->sk_prot->memory_pressure);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 07b21d9a9620..81e1f9c90a94 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -261,7 +261,7 @@ extern unsigned long tcp_memory_pressure;
 static inline bool tcp_under_memory_pressure(const struct sock *sk)
 {
 	if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
-	    mem_cgroup_under_socket_pressure(sk->sk_memcg))
+	    mem_cgroup_under_socket_pressure(sk->sk_memcg, NULL))
 		return true;
 
 	return READ_ONCE(tcp_memory_pressure);
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 22c6689d9302..5a3ac3768c0f 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -308,6 +308,13 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
 		level = vmpressure_calc_level(scanned, reclaimed);
 
 		if (level > VMPRESSURE_LOW) {
+			unsigned long expire = jiffies + HZ;
+
+			if (level == VMPRESSURE_CRITICAL)
+				expire |=  1UL;
+			else
+				expire &= ~1UL;
+
 			/*
 			 * Let the socket buffer allocator know that
 			 * we are having trouble reclaiming LRU pages.
@@ -316,7 +323,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree,
 			 * asserted for a second in which subsequent
 			 * pressure events can occur.
 			 */
-			WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
+			WRITE_ONCE(memcg->socket_pressure, expire);
 		}
 	}
 }
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH net-next 3/3] sock: Throttle pressure-aware sockets under pressure
  2023-09-01  6:21 [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc Abel Wu
  2023-09-01  6:21 ` [RFC PATCH net-next 1/3] sock: Code cleanup on __sk_mem_raise_allocated() Abel Wu
  2023-09-01  6:21 ` [RFC PATCH net-next 2/3] net-memcg: Record pressure level when under pressure Abel Wu
@ 2023-09-01  6:21 ` Abel Wu
  2023-09-01 13:59   ` Simon Horman
  2023-09-18  7:48   ` Abel Wu
  2023-09-08  7:55 ` [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc Abel Wu
  2023-09-14 21:20 ` Shakeel Butt
  4 siblings, 2 replies; 14+ messages in thread
From: Abel Wu @ 2023-09-01  6:21 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Shakeel Butt, Roman Gushchin, Michal Hocko,
	Johannes Weiner, Yosry Ahmed, Yu Zhao, Matthew Wilcox (Oracle),
	Kefeng Wang, Abel Wu, Yafang Shao, Kuniyuki Iwashima,
	Martin KaFai Lau, Breno Leitao, Alexander Mikhalitsyn,
	David Howells, Jason Xing
  Cc: open list, open list:NETWORKING [GENERAL], open list:MEMORY MANAGEMENT

A socket is pressure-aware when its protocol has pressure defined, that
is sk_has_memory_pressure(sk) != NULL, e.g. TCP. These protocols might
want to limit the usage of socket memory depending on both the state of
global & memcg pressure through sk_under_memory_pressure(sk).

While for allocation, memcg pressure will be simply ignored when usage
is under global limit (sysctl_mem[0]). This behavior has different impacts
on different cgroup modes. In cgroupv2 socket and other purposes share a
same memory limit, thus allowing sockmem to burst under memcg reclaiming
pressure could lead to longer stall, sometimes even OOM. While cgroupv1
has no such worries.

As a cloud service provider, we encountered a problem in our production
environment during the transition from cgroup v1 to v2 (partly due to the
heavy taxes of accounting socket memory in v1). Say one workload behaves
fine in cgroupv1 with memcg limit configured to 10GB memory and another
1GB tcpmem, but will suck (or even be OOM-killed) in v2 with 11GB memory
due to burst memory usage on socket, since there is no specific limit for
socket memory in cgroupv2 and relies largely on workloads doing traffic
control themselves.

It's rational for the workloads to build some traffic control to better
utilize the resources they bought, but from kernel's point of view it's
also reasonable to suppress the allocation of socket memory once there is
a shortage of free memory, given that performance degradation is better
than failure.

As per the above, this patch aims to be more conservative on allocation
for the pressure-aware sockets under global and/or memcg pressure. While
OTOH throttling on incoming traffic could hurt latency badly possibly
due to SACKed segs get dropped from the OFO queue. See a related commit
720ca52bcef22 ("net-memcg: avoid stalls when under memory pressure").
This patch preserves this decision by throttling RX allocation only at
critical pressure level when it hardly makes sense to continue receive
data.

No functional change intended for pressure-unaware protocols.

Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
 net/core/sock.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index af778fc60a4d..6c1d13547f1b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3041,6 +3041,7 @@ EXPORT_SYMBOL(sk_wait_data);
 int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 {
 	struct mem_cgroup *memcg = mem_cgroup_sockets_enabled ? sk->sk_memcg : NULL;
+	bool under_memcg_pressure = false;
 	struct proto *prot = sk->sk_prot;
 	bool charged = false;
 	long allocated;
@@ -3051,13 +3052,25 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 	if (memcg) {
 		if (!mem_cgroup_charge_skmem(memcg, amt, gfp_memcg_charge()))
 			goto suppress_allocation;
+
+		/* Get pressure info from net-memcg. But consider the memcg
+		 * to be under pressure for incoming traffic iff at 'critical'
+		 * level, see commit 720ca52bcef22 ("net-memcg: avoid stalls
+		 * when under memory pressure").
+		 */
+		if (sk_has_memory_pressure(sk) &&
+		    mem_cgroup_under_socket_pressure(memcg, &under_memcg_pressure) &&
+		    !in_softirq())
+			under_memcg_pressure = true;
+
 		charged = true;
 	}
 
 	/* Under limit. */
 	if (allocated <= sk_prot_mem_limits(sk, 0)) {
 		sk_leave_memory_pressure(sk);
-		return 1;
+		if (!under_memcg_pressure)
+			return 1;
 	}
 
 	/* Under pressure. */
@@ -3087,8 +3100,20 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
 	if (sk_has_memory_pressure(sk)) {
 		u64 alloc;
 
-		if (!sk_under_memory_pressure(sk))
+		/* Be more conservative if the socket's memcg (or its
+		 * parents) is under reclaim pressure, try to possibly
+		 * avoid further memstall.
+		 */
+		if (under_memcg_pressure)
+			goto suppress_allocation;
+
+		if (!sk_under_global_memory_pressure(sk))
 			return 1;
+
+		/* Trying to be fair among all the sockets of same
+		 * protocal under global memory pressure, by allowing
+		 * the ones that under average usage to raise.
+		 */
 		alloc = sk_sockets_allocated_read_positive(sk);
 		if (sk_prot_mem_limits(sk, 2) > alloc *
 		    sk_mem_pages(sk->sk_wmem_queued +
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 3/3] sock: Throttle pressure-aware sockets under pressure
  2023-09-01  6:21 ` [RFC PATCH net-next 3/3] sock: Throttle pressure-aware sockets " Abel Wu
@ 2023-09-01 13:59   ` Simon Horman
  2023-09-03  4:54     ` Abel Wu
  2023-09-18  7:48   ` Abel Wu
  1 sibling, 1 reply; 14+ messages in thread
From: Simon Horman @ 2023-09-01 13:59 UTC (permalink / raw)
  To: Abel Wu
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Shakeel Butt, Roman Gushchin, Michal Hocko,
	Johannes Weiner, Yosry Ahmed, Yu Zhao, Matthew Wilcox (Oracle),
	Kefeng Wang, Yafang Shao, Kuniyuki Iwashima, Martin KaFai Lau,
	Breno Leitao, Alexander Mikhalitsyn, David Howells, Jason Xing,
	open list, open list:NETWORKING [GENERAL],
	open list:MEMORY MANAGEMENT

On Fri, Sep 01, 2023 at 02:21:28PM +0800, Abel Wu wrote:
> A socket is pressure-aware when its protocol has pressure defined, that
> is sk_has_memory_pressure(sk) != NULL, e.g. TCP. These protocols might
> want to limit the usage of socket memory depending on both the state of
> global & memcg pressure through sk_under_memory_pressure(sk).
> 
> While for allocation, memcg pressure will be simply ignored when usage
> is under global limit (sysctl_mem[0]). This behavior has different impacts
> on different cgroup modes. In cgroupv2 socket and other purposes share a
> same memory limit, thus allowing sockmem to burst under memcg reclaiming
> pressure could lead to longer stall, sometimes even OOM. While cgroupv1
> has no such worries.
> 
> As a cloud service provider, we encountered a problem in our production
> environment during the transition from cgroup v1 to v2 (partly due to the
> heavy taxes of accounting socket memory in v1). Say one workload behaves
> fine in cgroupv1 with memcg limit configured to 10GB memory and another
> 1GB tcpmem, but will suck (or even be OOM-killed) in v2 with 11GB memory
> due to burst memory usage on socket, since there is no specific limit for
> socket memory in cgroupv2 and relies largely on workloads doing traffic
> control themselves.
> 
> It's rational for the workloads to build some traffic control to better
> utilize the resources they bought, but from kernel's point of view it's
> also reasonable to suppress the allocation of socket memory once there is
> a shortage of free memory, given that performance degradation is better
> than failure.
> 
> As per the above, this patch aims to be more conservative on allocation
> for the pressure-aware sockets under global and/or memcg pressure. While
> OTOH throttling on incoming traffic could hurt latency badly possibly
> due to SACKed segs get dropped from the OFO queue. See a related commit
> 720ca52bcef22 ("net-memcg: avoid stalls when under memory pressure").
> This patch preserves this decision by throttling RX allocation only at
> critical pressure level when it hardly makes sense to continue receive
> data.
> 
> No functional change intended for pressure-unaware protocols.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>

...

> @@ -3087,8 +3100,20 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
>  	if (sk_has_memory_pressure(sk)) {
>  		u64 alloc;
>  
> -		if (!sk_under_memory_pressure(sk))
> +		/* Be more conservative if the socket's memcg (or its
> +		 * parents) is under reclaim pressure, try to possibly
> +		 * avoid further memstall.
> +		 */
> +		if (under_memcg_pressure)
> +			goto suppress_allocation;
> +
> +		if (!sk_under_global_memory_pressure(sk))
>  			return 1;
> +
> +		/* Trying to be fair among all the sockets of same
> +		 * protocal under global memory pressure, by allowing

nit: checkpatch.pl --codespell says, protocal -> protocol

> +		 * the ones that under average usage to raise.
> +		 */
>  		alloc = sk_sockets_allocated_read_positive(sk);
>  		if (sk_prot_mem_limits(sk, 2) > alloc *
>  		    sk_mem_pages(sk->sk_wmem_queued +
> -- 
> 2.37.3
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [RFC PATCH net-next 3/3] sock: Throttle pressure-aware sockets under pressure
  2023-09-01 13:59   ` Simon Horman
@ 2023-09-03  4:54     ` Abel Wu
  0 siblings, 0 replies; 14+ messages in thread
From: Abel Wu @ 2023-09-03  4:54 UTC (permalink / raw)
  To: Simon Horman
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Shakeel Butt, Roman Gushchin, Michal Hocko,
	Johannes Weiner, Yosry Ahmed, Yu Zhao, Matthew Wilcox (Oracle),
	Kefeng Wang, Yafang Shao, Kuniyuki Iwashima, Martin KaFai Lau,
	Breno Leitao, Alexander Mikhalitsyn, David Howells, Jason Xing,
	open list, open list:NETWORKING [GENERAL],
	open list:MEMORY MANAGEMENT

Hi Simon, thanks for reviewing!

On 9/1/23 9:59 PM, Simon Horman wrote:
> On Fri, Sep 01, 2023 at 02:21:28PM +0800, Abel Wu wrote:
>> @@ -3087,8 +3100,20 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
>>   	if (sk_has_memory_pressure(sk)) {
>>   		u64 alloc;
>>   
>> -		if (!sk_under_memory_pressure(sk))
>> +		/* Be more conservative if the socket's memcg (or its
>> +		 * parents) is under reclaim pressure, try to possibly
>> +		 * avoid further memstall.
>> +		 */
>> +		if (under_memcg_pressure)
>> +			goto suppress_allocation;
>> +
>> +		if (!sk_under_global_memory_pressure(sk))
>>   			return 1;
>> +
>> +		/* Trying to be fair among all the sockets of same
>> +		 * protocal under global memory pressure, by allowing
> 
> nit: checkpatch.pl --codespell says, protocal -> protocol

Will fix in next version.

Thanks,
	Abel

> 
>> +		 * the ones that under average usage to raise.
>> +		 */
>>   		alloc = sk_sockets_allocated_read_positive(sk);
>>   		if (sk_prot_mem_limits(sk, 2) > alloc *
>>   		    sk_mem_pages(sk->sk_wmem_queued +
>> -- 
>> 2.37.3
>>
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc
  2023-09-01  6:21 [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc Abel Wu
                   ` (2 preceding siblings ...)
  2023-09-01  6:21 ` [RFC PATCH net-next 3/3] sock: Throttle pressure-aware sockets " Abel Wu
@ 2023-09-08  7:55 ` Abel Wu
  2023-09-08 15:42   ` Shakeel Butt
  2023-09-14 21:20 ` Shakeel Butt
  4 siblings, 1 reply; 14+ messages in thread
From: Abel Wu @ 2023-09-08  7:55 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Shakeel Butt, Roman Gushchin, Michal Hocko,
	Johannes Weiner, Yosry Ahmed, Matthew Wilcox (Oracle),
	Yu Zhao, Kefeng Wang, Yafang Shao, Kuniyuki Iwashima,
	Martin KaFai Lau, Breno Leitao, Alexander Mikhalitsyn,
	David Howells, Jason Xing
  Cc: open list, open list:NETWORKING [GENERAL], open list:MEMORY MANAGEMENT

Friendly ping :)

On 9/1/23 2:21 PM, Abel Wu wrote:
> As a cloud service provider, we encountered a problem in our production
> environment during the transition from cgroup v1 to v2 (partly due to the
> heavy taxes of accounting socket memory in v1). Say one workload behaves
> fine in cgroupv1 with memcg limit configured to 10GB memory and another
> 1GB tcpmem, but will suck (or even be OOM-killed) in v2 with 11GB memory
> due to burst memory usage on socket, since there is no specific limit for
> socket memory in cgroupv2 and relies largely on workloads doing traffic
> control themselves.
> 
> It's rational for the workloads to build some traffic control to better
> utilize the resources they bought, but from kernel's point of view it's
> also reasonable to suppress the allocation of socket memory once there is
> a shortage of free memory, given that performance degradation is usually
> better than failure.
> 
> This patchset aims to be more conservative on alloc for pressure-aware
> sockets under global and/or memcg pressure, to avoid further memstall or
> possibly OOM in such case. The patchset includes:
> 
>    1/3: simple code cleanup, no functional change intended.
>    2/3: record memcg pressure level to enable fine-grained control.
>    3/3: throttle alloc for pressure-aware sockets under pressure.
> 
> The whole patchset focuses on the pressure-aware protocols, and should
> have no/little impact on pressure-unaware protocols like UDP etc.
> 
> Tested on Intel Xeon(R) Platinum 8260, a dual socket machine containing 2
> NUMA nodes each of which has 24C/48T. All the benchmarks are done inside a
> separate memcg in a clean host.
> 
>    baseline:	net-next c639a708a0b8
>    compare:	baseline + patchset
> 
> case            	load    	baseline(std%)	compare%( std%)
> tbench-loopback        	thread-24	 1.00 (  0.50)	 -0.98 (  0.87)
> tbench-loopback        	thread-48	 1.00 (  0.76)	 -0.29 (  0.92)
> tbench-loopback        	thread-72	 1.00 (  0.75)	 +1.51 (  0.14)
> tbench-loopback        	thread-96	 1.00 (  4.11)	 +1.29 (  3.73)
> tbench-loopback        	thread-192	 1.00 (  3.52)	 +1.44 (  3.30)
> TCP_RR          	thread-24	 1.00 (  1.87)	 -0.87 (  2.40)
> TCP_RR          	thread-48	 1.00 (  0.92)	 -0.22 (  1.61)
> TCP_RR          	thread-72	 1.00 (  2.35)	 +2.42 (  2.27)
> TCP_RR          	thread-96	 1.00 (  2.66)	 -1.37 (  3.02)
> TCP_RR          	thread-192	 1.00 ( 13.25)	 +0.29 ( 11.80)
> TCP_STREAM      	thread-24	 1.00 (  1.26)	 -0.75 (  0.87)
> TCP_STREAM      	thread-48	 1.00 (  0.29)	 -1.55 (  0.14)
> TCP_STREAM      	thread-72	 1.00 (  0.05)	 -1.59 (  0.05)
> TCP_STREAM      	thread-96	 1.00 (  0.19)	 -0.06 (  0.29)
> TCP_STREAM      	thread-192	 1.00 (  0.23)	 -0.01 (  0.28)
> UDP_RR          	thread-24	 1.00 (  2.27)	 +0.33 (  2.82)
> UDP_RR          	thread-48	 1.00 (  1.25)	 -0.30 (  1.21)
> UDP_RR          	thread-72	 1.00 (  2.54)	 +2.99 (  2.34)
> UDP_RR          	thread-96	 1.00 (  4.76)	 +2.49 (  2.19)
> UDP_RR          	thread-192	 1.00 ( 14.43)	 -0.02 ( 12.98)
> UDP_STREAM      	thread-24	 1.00 (107.41)	 -0.48 (106.93)
> UDP_STREAM      	thread-48	 1.00 (100.85)	 +1.38 (100.59)
> UDP_STREAM      	thread-72	 1.00 (103.43)	 +1.40 (103.48)
> UDP_STREAM      	thread-96	 1.00 ( 99.91)	 -0.25 (100.06)
> UDP_STREAM      	thread-192	 1.00 (109.83)	 -3.67 (104.12)
> 
> As patch 3 moves forward traversal of cgroup hierarchy for pressure-aware
> protocols, which could turn a conditional overhead into constant, tests
> running inside 5-level-depth cgroups are also performed.
> 
> case            	load    	baseline(std%)	compare%( std%)
> tbench-loopback        	thread-24	 1.00 (  0.59)	 +0.68 (  0.09)
> tbench-loopback        	thread-48	 1.00 (  0.16)	 +0.01 (  0.26)
> tbench-loopback        	thread-72	 1.00 (  0.34)	 -0.67 (  0.48)
> tbench-loopback        	thread-96	 1.00 (  4.40)	 -3.27 (  4.84)
> tbench-loopback        	thread-192	 1.00 (  0.49)	 -1.07 (  1.18)
> TCP_RR          	thread-24	 1.00 (  2.40)	 -0.34 (  2.49)
> TCP_RR          	thread-48	 1.00 (  1.62)	 -0.48 (  1.35)
> TCP_RR          	thread-72	 1.00 (  1.26)	 +0.46 (  0.95)
> TCP_RR          	thread-96	 1.00 (  2.98)	 +0.13 (  2.64)
> TCP_RR          	thread-192	 1.00 ( 13.75)	 -0.20 ( 15.42)
> TCP_STREAM      	thread-24	 1.00 (  0.21)	 +0.68 (  1.02)
> TCP_STREAM      	thread-48	 1.00 (  0.20)	 -1.41 (  0.01)
> TCP_STREAM      	thread-72	 1.00 (  0.09)	 -1.23 (  0.19)
> TCP_STREAM      	thread-96	 1.00 (  0.01)	 +0.01 (  0.01)
> TCP_STREAM      	thread-192	 1.00 (  0.20)	 -0.02 (  0.25)
> UDP_RR          	thread-24	 1.00 (  2.20)	 +0.84 ( 17.45)
> UDP_RR          	thread-48	 1.00 (  1.34)	 -0.73 (  1.12)
> UDP_RR          	thread-72	 1.00 (  2.32)	 +0.49 (  2.11)
> UDP_RR          	thread-96	 1.00 (  2.36)	 +0.53 (  2.42)
> UDP_RR          	thread-192	 1.00 ( 16.34)	 -0.67 ( 14.06)
> UDP_STREAM      	thread-24	 1.00 (106.55)	 -0.70 (107.13)
> UDP_STREAM      	thread-48	 1.00 (105.11)	 +1.60 (103.48)
> UDP_STREAM      	thread-72	 1.00 (100.60)	 +1.98 (101.13)
> UDP_STREAM      	thread-96	 1.00 ( 99.91)	 +2.59 (101.04)
> UDP_STREAM      	thread-192	 1.00 (135.39)	 -2.51 (108.00)
> 
> As expected, no obvious performance gain or loss observed. As for the
> issue we encountered, this patchset provides better worst-case behavior
> that such OOM cases are reduced at some extent. While further fine-
> grained traffic control is what the workloads need to think about.
> 
> Comments are welcomed! Thanks!
> 
> Abel Wu (3):
>    sock: Code cleanup on __sk_mem_raise_allocated()
>    net-memcg: Record pressure level when under pressure
>    sock: Throttle pressure-aware sockets under pressure
> 
>   include/linux/memcontrol.h | 39 +++++++++++++++++++++++++----
>   include/net/sock.h         |  2 +-
>   include/net/tcp.h          |  2 +-
>   mm/vmpressure.c            |  9 ++++++-
>   net/core/sock.c            | 51 +++++++++++++++++++++++++++++---------
>   5 files changed, 83 insertions(+), 20 deletions(-)
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc
  2023-09-08  7:55 ` [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc Abel Wu
@ 2023-09-08 15:42   ` Shakeel Butt
  2023-09-10  5:09     ` Abel Wu
  0 siblings, 1 reply; 14+ messages in thread
From: Shakeel Butt @ 2023-09-08 15:42 UTC (permalink / raw)
  To: Abel Wu
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner,
	Yosry Ahmed, Matthew Wilcox (Oracle),
	Yu Zhao, Kefeng Wang, Yafang Shao, Kuniyuki Iwashima,
	Martin KaFai Lau, Breno Leitao, Alexander Mikhalitsyn,
	David Howells, Jason Xing, open list,
	open list:NETWORKING [GENERAL],
	open list:MEMORY MANAGEMENT

On Fri, Sep 8, 2023 at 12:55 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> Friendly ping :)
>

Sorry for the delay, I will get to this over this weekend.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc
  2023-09-08 15:42   ` Shakeel Butt
@ 2023-09-10  5:09     ` Abel Wu
  0 siblings, 0 replies; 14+ messages in thread
From: Abel Wu @ 2023-09-10  5:09 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner,
	Yosry Ahmed, Matthew Wilcox (Oracle),
	Yu Zhao, Kefeng Wang, Yafang Shao, Kuniyuki Iwashima,
	Martin KaFai Lau, Breno Leitao, Alexander Mikhalitsyn,
	David Howells, Jason Xing, open list,
	open list:NETWORKING [GENERAL],
	open list:MEMORY MANAGEMENT

On 9/8/23 11:42 PM, Shakeel Butt wrote:
> On Fri, Sep 8, 2023 at 12:55 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
>>
>> Friendly ping :)
>>
> 
> Sorry for the delay, I will get to this over this weekend.

Hi Shakeel, I am really appreciate your time! Thanks a lot!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 1/3] sock: Code cleanup on __sk_mem_raise_allocated()
  2023-09-01  6:21 ` [RFC PATCH net-next 1/3] sock: Code cleanup on __sk_mem_raise_allocated() Abel Wu
@ 2023-09-14  5:47   ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2023-09-14  5:47 UTC (permalink / raw)
  To: Abel Wu
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner,
	Yosry Ahmed, Yu Zhao, Matthew Wilcox (Oracle),
	Yafang Shao, Kefeng Wang, Kuniyuki Iwashima, Martin KaFai Lau,
	Breno Leitao, Alexander Mikhalitsyn, David Howells, Jason Xing,
	open list, open list:NETWORKING [GENERAL],
	open list:MEMORY MANAGEMENT

On Fri, Sep 01, 2023 at 02:21:26PM +0800, Abel Wu wrote:
> Code cleanup for both better simplicity and readability.
> No functional change intended.
> 
> Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>

This is simple cleanup and can be included independently.

Acked-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc
  2023-09-01  6:21 [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc Abel Wu
                   ` (3 preceding siblings ...)
  2023-09-08  7:55 ` [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc Abel Wu
@ 2023-09-14 21:20 ` Shakeel Butt
  2023-09-15  8:47   ` Abel Wu
  4 siblings, 1 reply; 14+ messages in thread
From: Shakeel Butt @ 2023-09-14 21:20 UTC (permalink / raw)
  To: Abel Wu
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner,
	Yosry Ahmed, Matthew Wilcox (Oracle),
	Yu Zhao, Kefeng Wang, Yafang Shao, Kuniyuki Iwashima,
	Martin KaFai Lau, Breno Leitao, Alexander Mikhalitsyn,
	David Howells, Jason Xing, open list,
	open list:NETWORKING [GENERAL],
	open list:MEMORY MANAGEMENT

On Fri, Sep 01, 2023 at 02:21:25PM +0800, Abel Wu wrote:
> 
[...]
> As expected, no obvious performance gain or loss observed. As for the
> issue we encountered, this patchset provides better worst-case behavior
> that such OOM cases are reduced at some extent. While further fine-
> grained traffic control is what the workloads need to think about.
> 

I agree with the motivation but I don't agree with the solution (patch 2
and 3). This is adding one more heuristic in the code which you yourself
described as helped to some extent. In addition adding more dependency
on vmpressure subsystem which is in weird state. Vmpressure is a cgroup
v1 feature which somehow networking subsystem is relying on for cgroup
v2 deployments. In addition vmpressure acts differently for workloads
with different memory types (mapped, mlocked, kernel memory).

Anyways, have you explored the BPF based approach. You can induce socket
pressure at the points you care about and define memory pressure however
your use-case cares for. You can define memory pressure using PSI or
vmpressure or maybe with MEMCG_HIGH events. What do you think?

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Re: [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc
  2023-09-14 21:20 ` Shakeel Butt
@ 2023-09-15  8:47   ` Abel Wu
  0 siblings, 0 replies; 14+ messages in thread
From: Abel Wu @ 2023-09-15  8:47 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner,
	Yosry Ahmed, Matthew Wilcox (Oracle),
	Yu Zhao, Kefeng Wang, Yafang Shao, Kuniyuki Iwashima,
	Martin KaFai Lau, Breno Leitao, Alexander Mikhalitsyn,
	David Howells, Jason Xing, open list,
	open list:NETWORKING [GENERAL],
	open list:MEMORY MANAGEMENT

On 9/15/23 5:20 AM, Shakeel Butt wrote:
> On Fri, Sep 01, 2023 at 02:21:25PM +0800, Abel Wu wrote:
>>
> [...]
>> As expected, no obvious performance gain or loss observed. As for the
>> issue we encountered, this patchset provides better worst-case behavior
>> that such OOM cases are reduced at some extent. While further fine-
>> grained traffic control is what the workloads need to think about.
>>
> 
> I agree with the motivation but I don't agree with the solution (patch 2
> and 3). This is adding one more heuristic in the code which you yourself
> described as helped to some extent. In addition adding more dependency
> on vmpressure subsystem which is in weird state. Vmpressure is a cgroup
> v1 feature which somehow networking subsystem is relying on for cgroup
> v2 deployments. In addition vmpressure acts differently for workloads
> with different memory types (mapped, mlocked, kernel memory).

Indeed.

> 
> Anyways, have you explored the BPF based approach. You can induce socket
> pressure at the points you care about and define memory pressure however
> your use-case cares for. You can define memory pressure using PSI or
> vmpressure or maybe with MEMCG_HIGH events. What do you think?

Yeah, this sounds much better. I will re-implement this patchset based
on your suggestion. Thank you for helpful comments!

Best,
	Abel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 3/3] sock: Throttle pressure-aware sockets under pressure
  2023-09-01  6:21 ` [RFC PATCH net-next 3/3] sock: Throttle pressure-aware sockets " Abel Wu
  2023-09-01 13:59   ` Simon Horman
@ 2023-09-18  7:48   ` Abel Wu
  2023-09-18 15:49     ` Shakeel Butt
  1 sibling, 1 reply; 14+ messages in thread
From: Abel Wu @ 2023-09-18  7:48 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Shakeel Butt, Roman Gushchin, Michal Hocko,
	Johannes Weiner, Yosry Ahmed, Yu Zhao, Matthew Wilcox (Oracle),
	Kefeng Wang, Yafang Shao, Kuniyuki Iwashima, Martin KaFai Lau,
	Breno Leitao, Alexander Mikhalitsyn, David Howells, Jason Xing
  Cc: open list, open list:NETWORKING [GENERAL], open list:MEMORY MANAGEMENT

On 9/1/23 2:21 PM, Abel Wu wrote:
> @@ -3087,8 +3100,20 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
>   	if (sk_has_memory_pressure(sk)) {
>   		u64 alloc;
>   
> -		if (!sk_under_memory_pressure(sk))
> +		/* Be more conservative if the socket's memcg (or its
> +		 * parents) is under reclaim pressure, try to possibly
> +		 * avoid further memstall.
> +		 */
> +		if (under_memcg_pressure)
> +			goto suppress_allocation;
> +
> +		if (!sk_under_global_memory_pressure(sk))
>   			return 1;
> +
> +		/* Trying to be fair among all the sockets of same
> +		 * protocal under global memory pressure, by allowing
> +		 * the ones that under average usage to raise.
> +		 */
>   		alloc = sk_sockets_allocated_read_positive(sk);
>   		if (sk_prot_mem_limits(sk, 2) > alloc *
>   		    sk_mem_pages(sk->sk_wmem_queued +

I totally agree with what Shakeel said in last reply and will try ebpf-
based solution to let userspace inject proper strategies. But IMHO the
above hunk is irrelevant to the idea of this patchset, and is the right
thing to do, so maybe worth a separate patch?

This hunk originally passes the allocation when this socket is below
average usage even under global and/or memcg pressure. It makes sense
to do so under global pressure, as the 'average' is in the scope of
global, but it's really weird from a memcg's point of view. Actually
this pass condition was present before memcg pressure was introduced.

Please correct me if I missed something, thanks!

Best,
	Abel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH net-next 3/3] sock: Throttle pressure-aware sockets under pressure
  2023-09-18  7:48   ` Abel Wu
@ 2023-09-18 15:49     ` Shakeel Butt
  0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2023-09-18 15:49 UTC (permalink / raw)
  To: Abel Wu
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Andrew Morton, Roman Gushchin, Michal Hocko, Johannes Weiner,
	Yosry Ahmed, Yu Zhao, Matthew Wilcox (Oracle),
	Kefeng Wang, Yafang Shao, Kuniyuki Iwashima, Martin KaFai Lau,
	Breno Leitao, Alexander Mikhalitsyn, David Howells, Jason Xing,
	open list, open list:NETWORKING [GENERAL],
	open list:MEMORY MANAGEMENT

On Mon, Sep 18, 2023 at 12:48 AM Abel Wu <wuyun.abel@bytedance.com> wrote:
>
> On 9/1/23 2:21 PM, Abel Wu wrote:
> > @@ -3087,8 +3100,20 @@ int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind)
> >       if (sk_has_memory_pressure(sk)) {
> >               u64 alloc;
> >
> > -             if (!sk_under_memory_pressure(sk))
> > +             /* Be more conservative if the socket's memcg (or its
> > +              * parents) is under reclaim pressure, try to possibly
> > +              * avoid further memstall.
> > +              */
> > +             if (under_memcg_pressure)
> > +                     goto suppress_allocation;
> > +
> > +             if (!sk_under_global_memory_pressure(sk))
> >                       return 1;
> > +
> > +             /* Trying to be fair among all the sockets of same
> > +              * protocal under global memory pressure, by allowing
> > +              * the ones that under average usage to raise.
> > +              */
> >               alloc = sk_sockets_allocated_read_positive(sk);
> >               if (sk_prot_mem_limits(sk, 2) > alloc *
> >                   sk_mem_pages(sk->sk_wmem_queued +
>
> I totally agree with what Shakeel said in last reply and will try ebpf-
> based solution to let userspace inject proper strategies. But IMHO the
> above hunk is irrelevant to the idea of this patchset, and is the right
> thing to do, so maybe worth a separate patch?
>
> This hunk originally passes the allocation when this socket is below
> average usage even under global and/or memcg pressure. It makes sense
> to do so under global pressure, as the 'average' is in the scope of
> global, but it's really weird from a memcg's point of view. Actually
> this pass condition was present before memcg pressure was introduced.
>
> Please correct me if I missed something, thanks!
>

Please send the patch 1 and this hunk as separate patches with
relevant motivation and reasoning.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-09-18 16:30 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-01  6:21 [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc Abel Wu
2023-09-01  6:21 ` [RFC PATCH net-next 1/3] sock: Code cleanup on __sk_mem_raise_allocated() Abel Wu
2023-09-14  5:47   ` Shakeel Butt
2023-09-01  6:21 ` [RFC PATCH net-next 2/3] net-memcg: Record pressure level when under pressure Abel Wu
2023-09-01  6:21 ` [RFC PATCH net-next 3/3] sock: Throttle pressure-aware sockets " Abel Wu
2023-09-01 13:59   ` Simon Horman
2023-09-03  4:54     ` Abel Wu
2023-09-18  7:48   ` Abel Wu
2023-09-18 15:49     ` Shakeel Butt
2023-09-08  7:55 ` [RFC PATCH net-next 0/3] sock: Be aware of memcg pressure on alloc Abel Wu
2023-09-08 15:42   ` Shakeel Butt
2023-09-10  5:09     ` Abel Wu
2023-09-14 21:20 ` Shakeel Butt
2023-09-15  8:47   ` Abel Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.