netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable
@ 2024-02-26  2:24 Adam Li
  2024-02-26 16:01 ` Lameter, Christopher
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Adam Li @ 2024-02-26  2:24 UTC (permalink / raw)
  To: corbet, davem, edumazet, kuba, pabeni
  Cc: willemb, yangtiezhu, atenart, kuniyu, wuyun.abel, leitao,
	alexander, dhowells, paulmck, joel.granados, urezki, joel,
	linux-doc, linux-kernel, netdev, patches, cl, shijie, Adam Li

This patch adds /proc/sys/net/core/mem_pcpu_rsv sysctl file,
to make SK_MEMORY_PCPU_RESERV tunable.

Commit 3cd3399dd7a8 ("net: implement per-cpu reserves for
memory_allocated") introduced per-cpu forward alloc cache:

"Implement a per-cpu cache of +1/-1 MB, to reduce number
of changes to sk->sk_prot->memory_allocated, which
would otherwise be cause of false sharing."

sk_prot->memory_allocated points to global atomic variable:
atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp;

If increasing the per-cpu cache size from 1MB to e.g. 16MB,
changes to sk->sk_prot->memory_allocated can be further reduced.
Performance may be improved on system with many cores.

Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
---
 Documentation/admin-guide/sysctl/net.rst | 5 +++++
 include/net/sock.h                       | 5 +++--
 net/core/sock.c                          | 1 +
 net/core/sysctl_net_core.c               | 9 +++++++++
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 396091651955..7250c0542828 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -206,6 +206,11 @@ Will increase power usage.
 
 Default: 0 (off)
 
+mem_pcpu_rsv
+------------
+
+Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU.
+
 rmem_default
 ------------
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 796a902cf4c1..09a0cde8bf52 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1443,6 +1443,7 @@ sk_memory_allocated(const struct sock *sk)
 
 /* 1 MB per cpu, in page units */
 #define SK_MEMORY_PCPU_RESERVE (1 << (20 - PAGE_SHIFT))
+extern int sysctl_mem_pcpu_rsv;
 
 static inline void
 sk_memory_allocated_add(struct sock *sk, int amt)
@@ -1451,7 +1452,7 @@ sk_memory_allocated_add(struct sock *sk, int amt)
 
 	preempt_disable();
 	local_reserve = __this_cpu_add_return(*sk->sk_prot->per_cpu_fw_alloc, amt);
-	if (local_reserve >= SK_MEMORY_PCPU_RESERVE) {
+	if (local_reserve >= READ_ONCE(sysctl_mem_pcpu_rsv)) {
 		__this_cpu_sub(*sk->sk_prot->per_cpu_fw_alloc, local_reserve);
 		atomic_long_add(local_reserve, sk->sk_prot->memory_allocated);
 	}
@@ -1465,7 +1466,7 @@ sk_memory_allocated_sub(struct sock *sk, int amt)
 
 	preempt_disable();
 	local_reserve = __this_cpu_sub_return(*sk->sk_prot->per_cpu_fw_alloc, amt);
-	if (local_reserve <= -SK_MEMORY_PCPU_RESERVE) {
+	if (local_reserve <= -READ_ONCE(sysctl_mem_pcpu_rsv)) {
 		__this_cpu_sub(*sk->sk_prot->per_cpu_fw_alloc, local_reserve);
 		atomic_long_add(local_reserve, sk->sk_prot->memory_allocated);
 	}
diff --git a/net/core/sock.c b/net/core/sock.c
index 8d86886e39fa..df2ac54a8f74 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -283,6 +283,7 @@ __u32 sysctl_rmem_max __read_mostly = SK_RMEM_MAX;
 EXPORT_SYMBOL(sysctl_rmem_max);
 __u32 sysctl_wmem_default __read_mostly = SK_WMEM_MAX;
 __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
+int sysctl_mem_pcpu_rsv __read_mostly = SK_MEMORY_PCPU_RESERVE;
 
 int sysctl_tstamp_allow_data __read_mostly = 1;
 
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 0f0cb1465e08..986f15e5d6c4 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -30,6 +30,7 @@ static int int_3600 = 3600;
 static int min_sndbuf = SOCK_MIN_SNDBUF;
 static int min_rcvbuf = SOCK_MIN_RCVBUF;
 static int max_skb_frags = MAX_SKB_FRAGS;
+static int min_mem_pcpu_rsv = SK_MEMORY_PCPU_RESERVE;
 
 static int net_msg_warn;	/* Unused, but still a sysctl */
 
@@ -407,6 +408,14 @@ static struct ctl_table net_core_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &min_rcvbuf,
 	},
+	{
+		.procname	= "mem_pcpu_rsv",
+		.data		= &sysctl_mem_pcpu_rsv,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &min_mem_pcpu_rsv,
+	},
 	{
 		.procname	= "dev_weight",
 		.data		= &weight_p,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable
  2024-02-26  2:24 [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable Adam Li
@ 2024-02-26 16:01 ` Lameter, Christopher
  2024-02-27 20:38 ` Eric Dumazet
  2024-02-28  9:30 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 11+ messages in thread
From: Lameter, Christopher @ 2024-02-26 16:01 UTC (permalink / raw)
  To: Adam Li
  Cc: corbet, davem, edumazet, kuba, pabeni, willemb, yangtiezhu,
	atenart, kuniyu, wuyun.abel, leitao, alexander, dhowells,
	paulmck, joel.granados, urezki, joel, linux-doc, linux-kernel,
	netdev, patches, shijie

Looks good to me. What may be done in an additional patch is to set the 
tunable automatically higher on machines with high core counts.

Reviewed-by: Christoph Lameter (Ampere) <cl@linux.com>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable
  2024-02-26  2:24 [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable Adam Li
  2024-02-26 16:01 ` Lameter, Christopher
@ 2024-02-27 20:38 ` Eric Dumazet
  2024-02-27 23:08   ` Lameter, Christopher
  2024-02-28 13:23   ` Adam Li
  2024-02-28  9:30 ` patchwork-bot+netdevbpf
  2 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2024-02-27 20:38 UTC (permalink / raw)
  To: Adam Li
  Cc: corbet, davem, kuba, pabeni, willemb, yangtiezhu, atenart,
	kuniyu, wuyun.abel, leitao, alexander, dhowells, paulmck,
	joel.granados, urezki, joel, linux-doc, linux-kernel, netdev,
	patches, cl, shijie

On Mon, Feb 26, 2024 at 3:25 AM Adam Li <adamli@os.amperecomputing.com> wrote:
>
> This patch adds /proc/sys/net/core/mem_pcpu_rsv sysctl file,
> to make SK_MEMORY_PCPU_RESERV tunable.
>
> Commit 3cd3399dd7a8 ("net: implement per-cpu reserves for
> memory_allocated") introduced per-cpu forward alloc cache:
>
> "Implement a per-cpu cache of +1/-1 MB, to reduce number
> of changes to sk->sk_prot->memory_allocated, which
> would otherwise be cause of false sharing."
>
> sk_prot->memory_allocated points to global atomic variable:
> atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp;
>
> If increasing the per-cpu cache size from 1MB to e.g. 16MB,
> changes to sk->sk_prot->memory_allocated can be further reduced.
> Performance may be improved on system with many cores.

This looks good, do you have any performance numbers to share ?

On a host with 384 threads, 384*16 ->  6 GB of memory.

With this kind of use, we might need a shrinker...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable
  2024-02-27 20:38 ` Eric Dumazet
@ 2024-02-27 23:08   ` Lameter, Christopher
  2024-02-28  1:13     ` Jakub Kicinski
  2024-02-28  9:32     ` Eric Dumazet
  2024-02-28 13:23   ` Adam Li
  1 sibling, 2 replies; 11+ messages in thread
From: Lameter, Christopher @ 2024-02-27 23:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Adam Li, corbet, davem, kuba, pabeni, willemb, yangtiezhu,
	atenart, kuniyu, wuyun.abel, leitao, alexander, dhowells,
	paulmck, joel.granados, urezki, joel, linux-doc, linux-kernel,
	netdev, patches, shijie

On Tue, 27 Feb 2024, Eric Dumazet wrote:

>> sk_prot->memory_allocated points to global atomic variable:
>> atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp;
>>
>> If increasing the per-cpu cache size from 1MB to e.g. 16MB,
>> changes to sk->sk_prot->memory_allocated can be further reduced.
>> Performance may be improved on system with many cores.
>
> This looks good, do you have any performance numbers to share ?
>
> On a host with 384 threads, 384*16 ->  6 GB of memory.

Those things also come with corresponding memories of a couple of TB...

> With this kind of use, we might need a shrinker...

Yes. No point of keeping the buffers around if the core stops doing 
networking. But to be done at times when there is no contention please. 
Isnt there something like a timeout for skbs in the network stack already?




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable
  2024-02-27 23:08   ` Lameter, Christopher
@ 2024-02-28  1:13     ` Jakub Kicinski
  2024-02-28  9:32     ` Eric Dumazet
  1 sibling, 0 replies; 11+ messages in thread
From: Jakub Kicinski @ 2024-02-28  1:13 UTC (permalink / raw)
  To: Lameter, Christopher
  Cc: Eric Dumazet, Adam Li, corbet, davem, pabeni, willemb,
	yangtiezhu, atenart, kuniyu, wuyun.abel, leitao, alexander,
	dhowells, paulmck, joel.granados, urezki, joel, linux-doc,
	linux-kernel, netdev, patches, shijie

On Tue, 27 Feb 2024 15:08:18 -0800 (PST) Lameter, Christopher wrote:
> > This looks good, do you have any performance numbers to share ?
> >
> > On a host with 384 threads, 384*16 ->  6 GB of memory.  
> 
> Those things also come with corresponding memories of a couple of TB...

We have a lot of machines at Meta with more cores than gigabytes of
memory. Keying on amount of memory would make sense. Something like
max(1MB, sk_mem / cores / 8) comes to mind?

In fact it may be a better idea to have the sysctl control the divisor
(the 8 in my example above). I had issues in the past with people
"micro-optimizing" the absolute size, forgetting about it, moving
workload to a larger machine and then complaining TCP is choking :(

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable
  2024-02-26  2:24 [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable Adam Li
  2024-02-26 16:01 ` Lameter, Christopher
  2024-02-27 20:38 ` Eric Dumazet
@ 2024-02-28  9:30 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 11+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-02-28  9:30 UTC (permalink / raw)
  To: Adam Li
  Cc: corbet, davem, edumazet, kuba, pabeni, willemb, yangtiezhu,
	atenart, kuniyu, wuyun.abel, leitao, alexander, dhowells,
	paulmck, joel.granados, urezki, joel, linux-doc, linux-kernel,
	netdev, patches, cl, shijie

Hello:

This patch was applied to netdev/net-next.git (main)
by David S. Miller <davem@davemloft.net>:

On Mon, 26 Feb 2024 02:24:52 +0000 you wrote:
> This patch adds /proc/sys/net/core/mem_pcpu_rsv sysctl file,
> to make SK_MEMORY_PCPU_RESERV tunable.
> 
> Commit 3cd3399dd7a8 ("net: implement per-cpu reserves for
> memory_allocated") introduced per-cpu forward alloc cache:
> 
> "Implement a per-cpu cache of +1/-1 MB, to reduce number
> of changes to sk->sk_prot->memory_allocated, which
> would otherwise be cause of false sharing."
> 
> [...]

Here is the summary with links:
  - net: make SK_MEMORY_PCPU_RESERV tunable
    https://git.kernel.org/netdev/net-next/c/12a686c2e761

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable
  2024-02-27 23:08   ` Lameter, Christopher
  2024-02-28  1:13     ` Jakub Kicinski
@ 2024-02-28  9:32     ` Eric Dumazet
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2024-02-28  9:32 UTC (permalink / raw)
  To: Lameter, Christopher
  Cc: Adam Li, corbet, davem, kuba, pabeni, willemb, yangtiezhu,
	atenart, kuniyu, wuyun.abel, leitao, alexander, dhowells,
	paulmck, joel.granados, urezki, joel, linux-doc, linux-kernel,
	netdev, patches, shijie

On Wed, Feb 28, 2024 at 12:08 AM Lameter, Christopher
<cl@os.amperecomputing.com> wrote:
>
> On Tue, 27 Feb 2024, Eric Dumazet wrote:
>
> >> sk_prot->memory_allocated points to global atomic variable:
> >> atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp;
> >>
> >> If increasing the per-cpu cache size from 1MB to e.g. 16MB,
> >> changes to sk->sk_prot->memory_allocated can be further reduced.
> >> Performance may be improved on system with many cores.
> >
> > This looks good, do you have any performance numbers to share ?
> >
> > On a host with 384 threads, 384*16 ->  6 GB of memory.
>
> Those things also come with corresponding memories of a couple of TB...
>
> > With this kind of use, we might need a shrinker...
>
> Yes. No point of keeping the buffers around if the core stops doing
> networking. But to be done at times when there is no contention please.

I yet have to see the 'contention'  ?

I usually see one on the zone spinlock or memcg ones when
allocating/freeing pages, not on the tcp_memory_allocated atomic

We can add caches for sure, we had a giant one before my patch, and
this was a disaster really,
for workloads with millions of TCP sockets.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable
  2024-02-27 20:38 ` Eric Dumazet
  2024-02-27 23:08   ` Lameter, Christopher
@ 2024-02-28 13:23   ` Adam Li
  2024-02-28 13:46     ` Eric Dumazet
  1 sibling, 1 reply; 11+ messages in thread
From: Adam Li @ 2024-02-28 13:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: corbet, davem, kuba, pabeni, willemb, yangtiezhu, atenart,
	kuniyu, wuyun.abel, leitao, alexander, dhowells, paulmck,
	joel.granados, urezki, joel, linux-doc, linux-kernel, netdev,
	patches, cl, shijie

On 2/28/2024 4:38 AM, Eric Dumazet wrote:

>>
>> sk_prot->memory_allocated points to global atomic variable:
>> atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp;
>>
>> If increasing the per-cpu cache size from 1MB to e.g. 16MB,
>> changes to sk->sk_prot->memory_allocated can be further reduced.
>> Performance may be improved on system with many cores.
> 
> This looks good, do you have any performance numbers to share ?

I ran localhost memcached test on system with 320 CPU threads,
perf shows 4% cycles spent in __sk_mem_raise_allocated() -->sk_memory_allocated().
If increasing SK_MEMORY_PCPU_RESERV to 16MB, perf cycles spent in
__sk_mem_raise_allocated() drops to 0.4%.

Thanks,
-adam


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable
  2024-02-28 13:23   ` Adam Li
@ 2024-02-28 13:46     ` Eric Dumazet
  2024-03-06 17:01       ` Lameter, Christopher
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2024-02-28 13:46 UTC (permalink / raw)
  To: Adam Li
  Cc: corbet, davem, kuba, pabeni, willemb, yangtiezhu, atenart,
	kuniyu, wuyun.abel, leitao, alexander, dhowells, paulmck,
	joel.granados, urezki, joel, linux-doc, linux-kernel, netdev,
	patches, cl, shijie

On Wed, Feb 28, 2024 at 2:21 PM Adam Li <adamli@os.amperecomputing.com> wrote:
>
> On 2/28/2024 4:38 AM, Eric Dumazet wrote:
>
> >>
> >> sk_prot->memory_allocated points to global atomic variable:
> >> atomic_long_t tcp_memory_allocated ____cacheline_aligned_in_smp;
> >>
> >> If increasing the per-cpu cache size from 1MB to e.g. 16MB,
> >> changes to sk->sk_prot->memory_allocated can be further reduced.
> >> Performance may be improved on system with many cores.
> >
> > This looks good, do you have any performance numbers to share ?
>
> I ran localhost memcached test on system with 320 CPU threads,
> perf shows 4% cycles spent in __sk_mem_raise_allocated() -->sk_memory_allocated().
> If increasing SK_MEMORY_PCPU_RESERV to 16MB, perf cycles spent in
> __sk_mem_raise_allocated() drops to 0.4%.

I suspect some kind of flow/cpu steering issues then.
Also maybe SO_RESERVE_MEM would be better for this workload.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable
  2024-02-28 13:46     ` Eric Dumazet
@ 2024-03-06 17:01       ` Lameter, Christopher
  2024-03-06 17:28         ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Lameter, Christopher @ 2024-03-06 17:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Adam Li, corbet, davem, kuba, pabeni, willemb, yangtiezhu,
	atenart, kuniyu, wuyun.abel, leitao, alexander, dhowells,
	paulmck, joel.granados, urezki, joel, linux-doc, linux-kernel,
	netdev, patches, shijie

On Wed, 28 Feb 2024, Eric Dumazet wrote:

>> __sk_mem_raise_allocated() drops to 0.4%.
>
> I suspect some kind of flow/cpu steering issues then.
> Also maybe SO_RESERVE_MEM would be better for this workload.

This is via loopback. So there is a flow steering issue in the IP 
stack?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable
  2024-03-06 17:01       ` Lameter, Christopher
@ 2024-03-06 17:28         ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2024-03-06 17:28 UTC (permalink / raw)
  To: Lameter, Christopher
  Cc: Adam Li, corbet, davem, kuba, pabeni, willemb, yangtiezhu,
	atenart, kuniyu, wuyun.abel, leitao, alexander, dhowells,
	paulmck, joel.granados, urezki, joel, linux-doc, linux-kernel,
	netdev, patches, shijie

On Wed, Mar 6, 2024 at 6:01 PM Lameter, Christopher
<cl@os.amperecomputing.com> wrote:
>
> On Wed, 28 Feb 2024, Eric Dumazet wrote:
>
> >> __sk_mem_raise_allocated() drops to 0.4%.
> >
> > I suspect some kind of flow/cpu steering issues then.
> > Also maybe SO_RESERVE_MEM would be better for this workload.
>
> This is via loopback. So there is a flow steering issue in the IP
> stack?

Asymmetric allocations / freeing, things that will usually have a high
cost for payload copy anyway.

Maybe a hierarchical tracking would avoid false sharings if some
arches pay a high price to them.

- One per-cpu reserve.    (X MB)

- One per-memory-domain reserve.  (number_of_cpu_in_this_domain * X MB)

- A global reserve, with an uncertainty of number_of_cpus * X MB

Basically reworking lib/percpu_counter.c for better NUMA awareness.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-03-06 17:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-26  2:24 [PATCH] net: make SK_MEMORY_PCPU_RESERV tunable Adam Li
2024-02-26 16:01 ` Lameter, Christopher
2024-02-27 20:38 ` Eric Dumazet
2024-02-27 23:08   ` Lameter, Christopher
2024-02-28  1:13     ` Jakub Kicinski
2024-02-28  9:32     ` Eric Dumazet
2024-02-28 13:23   ` Adam Li
2024-02-28 13:46     ` Eric Dumazet
2024-03-06 17:01       ` Lameter, Christopher
2024-03-06 17:28         ` Eric Dumazet
2024-02-28  9:30 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).