[PATCH] net: skbuff: allocate the fclone in the current NUMA node

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] net: skbuff: allocate the fclone in the current NUMA node
@ 2024-02-20  2:18 Huang Shijie
  2024-02-20  5:32 ` Eric Dumazet
  2024-02-26 10:10 ` [PATCH] net: skbuff: allocate the fclone in the current NUMA node Alexander Lobakin
  0 siblings, 2 replies; 19+ messages in thread
From: Huang Shijie @ 2024-02-20  2:18 UTC (permalink / raw)
  To: kuba
  Cc: patches, davem, horms, edumazet, ast, dhowells, linyunsheng,
	aleksander.lobakin, linux-kernel, netdev, cl, Huang Shijie

The current code passes NUMA_NO_NODE to __alloc_skb(), we found
it may creates fclone SKB in remote NUMA node.

So use numa_node_id() to limit the allocation to current NUMA node.

Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
 include/linux/skbuff.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2dde34c29203..ebc42b2604ad 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1343,7 +1343,7 @@ static inline bool skb_fclone_busy(const struct sock *sk,
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
+	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, numa_node_id());
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: skbuff: allocate the fclone in the current NUMA node
  2024-02-20  2:18 [PATCH] net: skbuff: allocate the fclone in the current NUMA node Huang Shijie
@ 2024-02-20  5:32 ` Eric Dumazet
  2024-02-20  6:26   ` Shijie Huang
  2024-02-27  6:28   ` [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache Huang Shijie
  2024-02-26 10:10 ` [PATCH] net: skbuff: allocate the fclone in the current NUMA node Alexander Lobakin
  1 sibling, 2 replies; 19+ messages in thread
From: Eric Dumazet @ 2024-02-20  5:32 UTC (permalink / raw)
  To: Huang Shijie
  Cc: kuba, patches, davem, horms, ast, dhowells, linyunsheng,
	aleksander.lobakin, linux-kernel, netdev, cl

On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
<shijie@os.amperecomputing.com> wrote:
>
> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
> it may creates fclone SKB in remote NUMA node.

This is intended (WAI)

What about the NUMA policies of the current thread ?

Has NUMA_NO_NODE behavior changed recently?

What means : "it may creates" ? Please be more specific.

>
> So use numa_node_id() to limit the allocation to current NUMA node.

We prefer the allocation to succeed, instead of failing if the current
NUMA node has no available memory.

Please check:

grep . /sys/devices/system/node/node*/numastat

Are you going to change ~700 uses of  NUMA_NO_NODE in the kernel ?

Just curious.

>
> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> ---
>  include/linux/skbuff.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 2dde34c29203..ebc42b2604ad 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1343,7 +1343,7 @@ static inline bool skb_fclone_busy(const struct sock *sk,
>  static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
>                                                gfp_t priority)
>  {
> -       return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
> +       return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, numa_node_id());
>  }
>
>  struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: skbuff: allocate the fclone in the current NUMA node
  2024-02-20  5:32 ` Eric Dumazet
@ 2024-02-20  6:26   ` Shijie Huang
  2024-02-20  8:17     ` Eric Dumazet
  2024-02-27  6:28   ` [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache Huang Shijie
  1 sibling, 1 reply; 19+ messages in thread
From: Shijie Huang @ 2024-02-20  6:26 UTC (permalink / raw)
  To: Eric Dumazet, Huang Shijie
  Cc: kuba, patches, davem, horms, ast, dhowells, linyunsheng,
	aleksander.lobakin, linux-kernel, netdev, cl


在 2024/2/20 13:32, Eric Dumazet 写道:
> On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
> <shijie@os.amperecomputing.com> wrote:
>> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
>> it may creates fclone SKB in remote NUMA node.
> This is intended (WAI)

Okay. thanks a lot.

It seems I should fix the issue in other code, not the networking.

>
> What about the NUMA policies of the current thread ?

We use "numactl -m 0" for memcached, the NUMA policy should allocate 
fclone in

node 0, but we can see many fclones were allocated in node 1.

We have enough memory to allocate these fclones in node 0.

>
> Has NUMA_NO_NODE behavior changed recently?
I guess not.
>
> What means : "it may creates" ? Please be more specific.

When we use the memcached for testing in NUMA, there are maybe 20% ~ 30% 
fclones were allocated in

remote NUMA node.

After this patch, all the fclones are allocated correctly.


>> So use numa_node_id() to limit the allocation to current NUMA node.
> We prefer the allocation to succeed, instead of failing if the current
> NUMA node has no available memory.

Got it.


Thanks

Huang Shijie

>
> Please check:
>
> grep . /sys/devices/system/node/node*/numastat
>
> Are you going to change ~700 uses of  NUMA_NO_NODE in the kernel ?
>
> Just curious.
>
>> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
>> ---
>>   include/linux/skbuff.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 2dde34c29203..ebc42b2604ad 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -1343,7 +1343,7 @@ static inline bool skb_fclone_busy(const struct sock *sk,
>>   static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
>>                                                 gfp_t priority)
>>   {
>> -       return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
>> +       return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, numa_node_id());
>>   }
>>
>>   struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
>> --
>> 2.40.1
>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: skbuff: allocate the fclone in the current NUMA node
  2024-02-20  6:26   ` Shijie Huang
@ 2024-02-20  8:17     ` Eric Dumazet
  2024-02-20  8:37       ` Shijie Huang
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2024-02-20  8:17 UTC (permalink / raw)
  To: Shijie Huang
  Cc: Huang Shijie, kuba, patches, davem, horms, ast, dhowells,
	linyunsheng, aleksander.lobakin, linux-kernel, netdev, cl

On Tue, Feb 20, 2024 at 7:26 AM Shijie Huang
<shijie@amperemail.onmicrosoft.com> wrote:
>
>
> 在 2024/2/20 13:32, Eric Dumazet 写道:
> > On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
> > <shijie@os.amperecomputing.com> wrote:
> >> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
> >> it may creates fclone SKB in remote NUMA node.
> > This is intended (WAI)
>
> Okay. thanks a lot.
>
> It seems I should fix the issue in other code, not the networking.
>
> >
> > What about the NUMA policies of the current thread ?
>
> We use "numactl -m 0" for memcached, the NUMA policy should allocate
> fclone in
>
> node 0, but we can see many fclones were allocated in node 1.
>
> We have enough memory to allocate these fclones in node 0.
>
> >
> > Has NUMA_NO_NODE behavior changed recently?
> I guess not.
> >
> > What means : "it may creates" ? Please be more specific.
>
> When we use the memcached for testing in NUMA, there are maybe 20% ~ 30%
> fclones were allocated in
>
> remote NUMA node.

Interesting, how was it measured exactly ?
Are you using SLUB or SLAB ?

>
> After this patch, all the fclones are allocated correctly.

Note that skbs for TCP have three memory components (or more for large packets)

sk_buff
skb->head
page frags (see sk_page_frag_refill() for non zero copy payload)

The payload should be following NUMA policy of current thread, that is
really what matters.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: skbuff: allocate the fclone in the current NUMA node
  2024-02-20  8:17     ` Eric Dumazet
@ 2024-02-20  8:37       ` Shijie Huang
  2024-02-24 19:07         ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Shijie Huang @ 2024-02-20  8:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Huang Shijie, kuba, patches, davem, horms, ast, dhowells,
	linyunsheng, aleksander.lobakin, linux-kernel, netdev, cl


在 2024/2/20 16:17, Eric Dumazet 写道:
> On Tue, Feb 20, 2024 at 7:26 AM Shijie Huang
> <shijie@amperemail.onmicrosoft.com> wrote:
>>
>> 在 2024/2/20 13:32, Eric Dumazet 写道:
>>> On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
>>> <shijie@os.amperecomputing.com> wrote:
>>>> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
>>>> it may creates fclone SKB in remote NUMA node.
>>> This is intended (WAI)
>> Okay. thanks a lot.
>>
>> It seems I should fix the issue in other code, not the networking.
>>
>>> What about the NUMA policies of the current thread ?
>> We use "numactl -m 0" for memcached, the NUMA policy should allocate
>> fclone in
>>
>> node 0, but we can see many fclones were allocated in node 1.
>>
>> We have enough memory to allocate these fclones in node 0.
>>
>>> Has NUMA_NO_NODE behavior changed recently?
>> I guess not.
>>> What means : "it may creates" ? Please be more specific.
>> When we use the memcached for testing in NUMA, there are maybe 20% ~ 30%
>> fclones were allocated in
>>
>> remote NUMA node.
> Interesting, how was it measured exactly ?

I created a private patch to record the status for each fclone allocation.


> Are you using SLUB or SLAB ?

I think I use SLUB. (CONFIG_SLUB=y, 
CONFIG_SLAB_MERGE_DEFAULT=y,CONFIG_SLUB_CPU_PARTIAL=y)


Thanks

Huang Shijie


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: skbuff: allocate the fclone in the current NUMA node
  2024-02-20  8:37       ` Shijie Huang
@ 2024-02-24 19:07         ` Eric Dumazet
  2024-02-26 10:18           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2024-02-24 19:07 UTC (permalink / raw)
  To: Shijie Huang, Jesper Dangaard Brouer
  Cc: Huang Shijie, kuba, patches, davem, horms, ast, dhowells,
	linyunsheng, aleksander.lobakin, linux-kernel, netdev, cl

On Tue, Feb 20, 2024 at 9:37 AM Shijie Huang
<shijie@amperemail.onmicrosoft.com> wrote:
>
>
> 在 2024/2/20 16:17, Eric Dumazet 写道:
> > On Tue, Feb 20, 2024 at 7:26 AM Shijie Huang
> > <shijie@amperemail.onmicrosoft.com> wrote:
> >>
> >> 在 2024/2/20 13:32, Eric Dumazet 写道:
> >>> On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
> >>> <shijie@os.amperecomputing.com> wrote:
> >>>> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
> >>>> it may creates fclone SKB in remote NUMA node.
> >>> This is intended (WAI)
> >> Okay. thanks a lot.
> >>
> >> It seems I should fix the issue in other code, not the networking.
> >>
> >>> What about the NUMA policies of the current thread ?
> >> We use "numactl -m 0" for memcached, the NUMA policy should allocate
> >> fclone in
> >>
> >> node 0, but we can see many fclones were allocated in node 1.
> >>
> >> We have enough memory to allocate these fclones in node 0.
> >>
> >>> Has NUMA_NO_NODE behavior changed recently?
> >> I guess not.
> >>> What means : "it may creates" ? Please be more specific.
> >> When we use the memcached for testing in NUMA, there are maybe 20% ~ 30%
> >> fclones were allocated in
> >>
> >> remote NUMA node.
> > Interesting, how was it measured exactly ?
>
> I created a private patch to record the status for each fclone allocation.
>
>
> > Are you using SLUB or SLAB ?
>
> I think I use SLUB. (CONFIG_SLUB=y,
> CONFIG_SLAB_MERGE_DEFAULT=y,CONFIG_SLUB_CPU_PARTIAL=y)
>

A similar issue comes from tx_action() calling __napi_kfree_skb() on
arbitrary skbs
including ones that were allocated on a different NUMA node.

This pollutes per-cpu caches with not optimally placed sk_buff :/

Although this should not impact fclones, __napi_kfree_skb() only ?

commit 15fad714be86eab13e7568fecaf475b2a9730d3e
Author: Jesper Dangaard Brouer <brouer@redhat.com>
Date:   Mon Feb 8 13:15:04 2016 +0100

    net: bulk free SKBs that were delay free'ed due to IRQ context

What about :

diff --git a/net/core/dev.c b/net/core/dev.c
index c588808be77f563c429eb4a2eaee5c8062d99582..63165138c6f690e14520f11e32dc16f2845abad4
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5162,11 +5162,7 @@ static __latent_entropy void
net_tx_action(struct softirq_action *h)
                                trace_kfree_skb(skb, net_tx_action,
                                                get_kfree_skb_cb(skb)->reason);

-                       if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
-                               __kfree_skb(skb);
-                       else
-                               __napi_kfree_skb(skb,
-                                                get_kfree_skb_cb(skb)->reason);
+                       __kfree_skb(skb);
                }
        }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: skbuff: allocate the fclone in the current NUMA node
  2024-02-20  2:18 [PATCH] net: skbuff: allocate the fclone in the current NUMA node Huang Shijie
  2024-02-20  5:32 ` Eric Dumazet
@ 2024-02-26 10:10 ` Alexander Lobakin
  2024-02-27  6:30   ` Shijie Huang
  1 sibling, 1 reply; 19+ messages in thread
From: Alexander Lobakin @ 2024-02-26 10:10 UTC (permalink / raw)
  To: Huang Shijie
  Cc: kuba, patches, davem, horms, edumazet, ast, dhowells,
	linyunsheng, linux-kernel, netdev, cl

From: Huang Shijie <shijie@os.amperecomputing.com>
Date: Tue, 20 Feb 2024 10:18:04 +0800

> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
> it may creates fclone SKB in remote NUMA node.
> 
> So use numa_node_id() to limit the allocation to current NUMA node.
> 
> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> ---
>  include/linux/skbuff.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 2dde34c29203..ebc42b2604ad 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1343,7 +1343,7 @@ static inline bool skb_fclone_busy(const struct sock *sk,
>  static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
>  					       gfp_t priority)
>  {
> -	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
> +	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, numa_node_id());

Because it tries to defragment the memory and pick an optimal node.

__alloc_skb() and skb clones aren't anyway something very hotpathish, do
you have any particular perf numbers and/or usecases where %NUMA_NO_NODE
really hurts?

>  }
>  
>  struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);

Thanks,
Olek

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: skbuff: allocate the fclone in the current NUMA node
  2024-02-24 19:07         ` Eric Dumazet
@ 2024-02-26 10:18           ` Jesper Dangaard Brouer
  2024-02-26 10:29             ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2024-02-26 10:18 UTC (permalink / raw)
  To: Eric Dumazet, aleksander.lobakin, Shijie Huang
  Cc: Huang Shijie, kuba, patches, davem, horms, ast, dhowells,
	linyunsheng, linux-kernel, netdev, cl



On 24/02/2024 20.07, Eric Dumazet wrote:
> On Tue, Feb 20, 2024 at 9:37 AM Shijie Huang
> <shijie@amperemail.onmicrosoft.com> wrote:
>>
>>
>> 在 2024/2/20 16:17, Eric Dumazet 写道:
>>> On Tue, Feb 20, 2024 at 7:26 AM Shijie Huang
>>> <shijie@amperemail.onmicrosoft.com> wrote:
>>>>
>>>> 在 2024/2/20 13:32, Eric Dumazet 写道:
>>>>> On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
>>>>> <shijie@os.amperecomputing.com> wrote:
>>>>>> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
>>>>>> it may creates fclone SKB in remote NUMA node.
>>>>> This is intended (WAI)
>>>> Okay. thanks a lot.
>>>>
>>>> It seems I should fix the issue in other code, not the networking.
>>>>
>>>>> What about the NUMA policies of the current thread ?
>>>> We use "numactl -m 0" for memcached, the NUMA policy should allocate
>>>> fclone in
>>>>
>>>> node 0, but we can see many fclones were allocated in node 1.
>>>>
>>>> We have enough memory to allocate these fclones in node 0.
>>>>
>>>>> Has NUMA_NO_NODE behavior changed recently?
>>>> I guess not.
>>>>> What means : "it may creates" ? Please be more specific.
>>>> When we use the memcached for testing in NUMA, there are maybe 20% ~ 30%
>>>> fclones were allocated in
>>>>
>>>> remote NUMA node.
>>> Interesting, how was it measured exactly ?
>>
>> I created a private patch to record the status for each fclone allocation.
>>
>>
>>> Are you using SLUB or SLAB ?
>>
>> I think I use SLUB. (CONFIG_SLUB=y,
>> CONFIG_SLAB_MERGE_DEFAULT=y,CONFIG_SLUB_CPU_PARTIAL=y)
>>
> 
> A similar issue comes from tx_action() calling __napi_kfree_skb() on
> arbitrary skbs
> including ones that were allocated on a different NUMA node.
> 
> This pollutes per-cpu caches with not optimally placed sk_buff :/
> 
> Although this should not impact fclones, __napi_kfree_skb() only ?
> 
> commit 15fad714be86eab13e7568fecaf475b2a9730d3e
> Author: Jesper Dangaard Brouer <brouer@redhat.com>
> Date:   Mon Feb 8 13:15:04 2016 +0100
> 
>      net: bulk free SKBs that were delay free'ed due to IRQ context
> 
> What about :
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index c588808be77f563c429eb4a2eaee5c8062d99582..63165138c6f690e14520f11e32dc16f2845abad4
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5162,11 +5162,7 @@ static __latent_entropy void
> net_tx_action(struct softirq_action *h)
>                                  trace_kfree_skb(skb, net_tx_action,
>                                                  get_kfree_skb_cb(skb)->reason);
> 
> -                       if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
> -                               __kfree_skb(skb);
> -                       else
> -                               __napi_kfree_skb(skb,
> -                                                get_kfree_skb_cb(skb)->reason);

Yes, I think it makes sense to avoid calling __napi_kfree_skb here.
The __napi_kfree_skb call will cache SKB slub-allocation (but "release"
data) on a per CPU napi_alloc_cache (see code napi_skb_cache_put()).
In net_tx_action() there is a chance this could originate from another
CPU or even NUMA node.  I notice this is only for SKBs on the
softnet_data->completion_queue, which have a high chance of being cache
cold.  My patch 15fad714be86e only made sense when we bulk freed these
SKBs, but after Olek's changes to cache freed SKBs, then this shouldn't
be calling __napi_kfree_skb() (previously named __kfree_skb_defer).

I support this RFC patch from Eric.

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

> +                       __kfree_skb(skb);
>                  }
>          }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: skbuff: allocate the fclone in the current NUMA node
  2024-02-26 10:18           ` Jesper Dangaard Brouer
@ 2024-02-26 10:29             ` Eric Dumazet
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2024-02-26 10:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: aleksander.lobakin, Shijie Huang, Huang Shijie, kuba, patches,
	davem, horms, ast, dhowells, linyunsheng, linux-kernel, netdev,
	cl

On Mon, Feb 26, 2024 at 11:18 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
>
> On 24/02/2024 20.07, Eric Dumazet wrote:
> > On Tue, Feb 20, 2024 at 9:37 AM Shijie Huang
> > <shijie@amperemail.onmicrosoft.com> wrote:
> >>
> >>
> >> 在 2024/2/20 16:17, Eric Dumazet 写道:
> >>> On Tue, Feb 20, 2024 at 7:26 AM Shijie Huang
> >>> <shijie@amperemail.onmicrosoft.com> wrote:
> >>>>
> >>>> 在 2024/2/20 13:32, Eric Dumazet 写道:
> >>>>> On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
> >>>>> <shijie@os.amperecomputing.com> wrote:
> >>>>>> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
> >>>>>> it may creates fclone SKB in remote NUMA node.
> >>>>> This is intended (WAI)
> >>>> Okay. thanks a lot.
> >>>>
> >>>> It seems I should fix the issue in other code, not the networking.
> >>>>
> >>>>> What about the NUMA policies of the current thread ?
> >>>> We use "numactl -m 0" for memcached, the NUMA policy should allocate
> >>>> fclone in
> >>>>
> >>>> node 0, but we can see many fclones were allocated in node 1.
> >>>>
> >>>> We have enough memory to allocate these fclones in node 0.
> >>>>
> >>>>> Has NUMA_NO_NODE behavior changed recently?
> >>>> I guess not.
> >>>>> What means : "it may creates" ? Please be more specific.
> >>>> When we use the memcached for testing in NUMA, there are maybe 20% ~ 30%
> >>>> fclones were allocated in
> >>>>
> >>>> remote NUMA node.
> >>> Interesting, how was it measured exactly ?
> >>
> >> I created a private patch to record the status for each fclone allocation.
> >>
> >>
> >>> Are you using SLUB or SLAB ?
> >>
> >> I think I use SLUB. (CONFIG_SLUB=y,
> >> CONFIG_SLAB_MERGE_DEFAULT=y,CONFIG_SLUB_CPU_PARTIAL=y)
> >>
> >
> > A similar issue comes from tx_action() calling __napi_kfree_skb() on
> > arbitrary skbs
> > including ones that were allocated on a different NUMA node.
> >
> > This pollutes per-cpu caches with not optimally placed sk_buff :/
> >
> > Although this should not impact fclones, __napi_kfree_skb() only ?
> >
> > commit 15fad714be86eab13e7568fecaf475b2a9730d3e
> > Author: Jesper Dangaard Brouer <brouer@redhat.com>
> > Date:   Mon Feb 8 13:15:04 2016 +0100
> >
> >      net: bulk free SKBs that were delay free'ed due to IRQ context
> >
> > What about :
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index c588808be77f563c429eb4a2eaee5c8062d99582..63165138c6f690e14520f11e32dc16f2845abad4
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -5162,11 +5162,7 @@ static __latent_entropy void
> > net_tx_action(struct softirq_action *h)
> >                                  trace_kfree_skb(skb, net_tx_action,
> >                                                  get_kfree_skb_cb(skb)->reason);
> >
> > -                       if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
> > -                               __kfree_skb(skb);
> > -                       else
> > -                               __napi_kfree_skb(skb,
> > -                                                get_kfree_skb_cb(skb)->reason);
>
> Yes, I think it makes sense to avoid calling __napi_kfree_skb here.
> The __napi_kfree_skb call will cache SKB slub-allocation (but "release"
> data) on a per CPU napi_alloc_cache (see code napi_skb_cache_put()).
> In net_tx_action() there is a chance this could originate from another
> CPU or even NUMA node.  I notice this is only for SKBs on the
> softnet_data->completion_queue, which have a high chance of being cache
> cold.  My patch 15fad714be86e only made sense when we bulk freed these
> SKBs, but after Olek's changes to cache freed SKBs, then this shouldn't
> be calling __napi_kfree_skb() (previously named __kfree_skb_defer).
>
> I support this RFC patch from Eric.
>
> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

Note that this should not matter for most NIC, because their drivers
perform TX completion from NAPI context, we do not hit this path.

It seems that switching to SLUB instead of SLAB has increased the chances
of getting memory from another node.

We probably need to investigate.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache
  2024-02-20  5:32 ` Eric Dumazet
  2024-02-20  6:26   ` Shijie Huang
@ 2024-02-27  6:28   ` Huang Shijie
  2024-02-27 12:55     ` Eric Dumazet
  2024-02-29 16:56     ` Christoph Lameter (Ampere)
  1 sibling, 2 replies; 19+ messages in thread
From: Huang Shijie @ 2024-02-27  6:28 UTC (permalink / raw)
  To: kuba
  Cc: patches, davem, horms, edumazet, ast, dhowells, linyunsheng,
	aleksander.lobakin, linux-kernel, netdev, cl, Huang Shijie

Since we do not set FLAG_SKB_NO_MERGE for skbuff_fclone_cache,
the current skbuff_fclone_cache maybe not really allocated, it maybe
used an exist old kmem_cache. In NUMA, the fclone allocated by
alloc_skb_fclone() maybe in remote node.

So set FLAG_SKB_NO_MERGE for skbuff_fclone_cache to fix it.

Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
v1 --> v2:
       set FLAG_SKB_NO_MERGE for skbuff_fclone_cache in initialization.

v1: https://lkml.org/lkml/2024/2/20/121	
---
 net/core/skbuff.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1f918e602bc4..5e3e130fb57a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5013,7 +5013,8 @@ void __init skb_init(void)
 	skbuff_fclone_cache = kmem_cache_create("skbuff_fclone_cache",
 						sizeof(struct sk_buff_fclones),
 						0,
-						SLAB_HWCACHE_ALIGN|SLAB_PANIC,
+						SLAB_HWCACHE_ALIGN|SLAB_PANIC|
+						FLAG_SKB_NO_MERGE,
 						NULL);
 	/* usercopy should only access first SKB_SMALL_HEAD_HEADROOM bytes.
 	 * struct skb_shared_info is located at the end of skb->head,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: skbuff: allocate the fclone in the current NUMA node
  2024-02-26 10:10 ` [PATCH] net: skbuff: allocate the fclone in the current NUMA node Alexander Lobakin
@ 2024-02-27  6:30   ` Shijie Huang
  0 siblings, 0 replies; 19+ messages in thread
From: Shijie Huang @ 2024-02-27  6:30 UTC (permalink / raw)
  To: Alexander Lobakin, Huang Shijie
  Cc: kuba, patches, davem, horms, edumazet, ast, dhowells,
	linyunsheng, linux-kernel, netdev, cl


在 2024/2/26 18:10, Alexander Lobakin 写道:
> __alloc_skb() and skb clones aren't anyway something very hotpathish, do
> you have any particular perf numbers and/or usecases where %NUMA_NO_NODE
> really hurts?

 From the memcached test, I do not see really performance hurts.


Thanks

Huang Shijie


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache
  2024-02-27  6:28   ` [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache Huang Shijie
@ 2024-02-27 12:55     ` Eric Dumazet
  2024-02-27 13:15       ` Eric Dumazet
  2024-02-28  7:05       ` Shijie Huang
  2024-02-29 16:56     ` Christoph Lameter (Ampere)
  1 sibling, 2 replies; 19+ messages in thread
From: Eric Dumazet @ 2024-02-27 12:55 UTC (permalink / raw)
  To: Huang Shijie
  Cc: kuba, patches, davem, horms, ast, dhowells, linyunsheng,
	aleksander.lobakin, linux-kernel, netdev, cl

On Tue, Feb 27, 2024 at 7:29 AM Huang Shijie
<shijie@os.amperecomputing.com> wrote:
>
> Since we do not set FLAG_SKB_NO_MERGE for skbuff_fclone_cache,
> the current skbuff_fclone_cache maybe not really allocated, it maybe
> used an exist old kmem_cache. In NUMA, the fclone allocated by
> alloc_skb_fclone() maybe in remote node.

Why is this happening in the first place ? Whab about skb->head ?

Jesper patch [1] motivation was not about NUMA., but about
fragmentation and bulk allocations/freeing.

TCP fclones are not bulk allocated/freed, so I do not understand what
your patch is doing.
You need to give more details, and experimental results.

Using SLAB_NO_MERGE does not help, I am still seeing wrong allocations
on a dual socket
host with plenty of available memory.
(either sk_buff or skb->head being allocated on the other node).

fclones might be allocated from a cpu running on node A, and freed
from a cpu running on node B.
Maybe SLUB is not properly handling this case ?

SLAB_NO_MERGE will avoid merging fclone with kmalloc-512, it does not
really help.

I think we need help from mm/slub experts, instead of trying to 'fix'
networking stacks.

Perhaps we could augment trace_kmem_cache_alloc() to record/print the nodes
of the allocated chunk (we already have the cpu number giving us the
local node).
That would give us more confidence on any fixes.

BTW SLUB is gone, time to remove FLAG_SKB_NO_MERGE and simply use SLAB_NO_MERGE

[1]
commit 0a0643164da4a1976455aa12f0a96d08ee290752
Author: Jesper Dangaard Brouer <hawk@kernel.org>
Date:   Tue Aug 15 17:17:36 2023 +0200

    net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cache



>
> So set FLAG_SKB_NO_MERGE for skbuff_fclone_cache to fix it.
>
> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> ---
> v1 --> v2:
>        set FLAG_SKB_NO_MERGE for skbuff_fclone_cache in initialization.
>
> v1: https://lkml.org/lkml/2024/2/20/121
> ---
>  net/core/skbuff.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 1f918e602bc4..5e3e130fb57a 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5013,7 +5013,8 @@ void __init skb_init(void)
>         skbuff_fclone_cache = kmem_cache_create("skbuff_fclone_cache",
>                                                 sizeof(struct sk_buff_fclones),
>                                                 0,
> -                                               SLAB_HWCACHE_ALIGN|SLAB_PANIC,
> +                                               SLAB_HWCACHE_ALIGN|SLAB_PANIC|
> +                                               FLAG_SKB_NO_MERGE,
>                                                 NULL);
>         /* usercopy should only access first SKB_SMALL_HEAD_HEADROOM bytes.
>          * struct skb_shared_info is located at the end of skb->head,
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache
  2024-02-27 12:55     ` Eric Dumazet
@ 2024-02-27 13:15       ` Eric Dumazet
  2024-02-28  7:05       ` Shijie Huang
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2024-02-27 13:15 UTC (permalink / raw)
  To: Huang Shijie
  Cc: kuba, patches, davem, horms, ast, dhowells, linyunsheng,
	aleksander.lobakin, linux-kernel, netdev, cl

On Tue, Feb 27, 2024 at 1:55 PM Eric Dumazet <edumazet@google.com> wrote:
>

> BTW SLUB is gone, time to remove FLAG_SKB_NO_MERGE and simply use SLAB_NO_MERGE

Ignore this part, I was thinking about SLOB.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache
  2024-02-27 12:55     ` Eric Dumazet
  2024-02-27 13:15       ` Eric Dumazet
@ 2024-02-28  7:05       ` Shijie Huang
  2024-02-28  9:38         ` Eric Dumazet
  2024-02-29 17:00         ` Christoph Lameter (Ampere)
  1 sibling, 2 replies; 19+ messages in thread
From: Shijie Huang @ 2024-02-28  7:05 UTC (permalink / raw)
  To: Eric Dumazet, Huang Shijie
  Cc: kuba, patches, davem, horms, ast, dhowells, linyunsheng,
	aleksander.lobakin, linux-kernel, netdev, cl


在 2024/2/27 20:55, Eric Dumazet 写道:
> On Tue, Feb 27, 2024 at 7:29 AM Huang Shijie
> <shijie@os.amperecomputing.com> wrote:
>> Since we do not set FLAG_SKB_NO_MERGE for skbuff_fclone_cache,
>> the current skbuff_fclone_cache maybe not really allocated, it maybe
>> used an exist old kmem_cache. In NUMA, the fclone allocated by
>> alloc_skb_fclone() maybe in remote node.
> Why is this happening in the first place ? Whab about skb->head ?

I tested the fclone firstly. I did not test others yet.

I did not check the skb->head yet.

But I ever checked the pfrag's page, it is okay.


>
> Jesper patch [1] motivation was not about NUMA., but about
> fragmentation and bulk allocations/freeing.
>
> TCP fclones are not bulk allocated/freed, so I do not understand what
> your patch is doing.
> You need to give more details, and experimental results.

1.) My NUMA machine:

       node 0 (CPU 0 ~ CPU79):

                      CPU 0 ~  CPU 39 are used as memcached's server

                     CPU 40 ~  CPU 79 are used as memcached's client

       node 1 (CPU 80 ~ CPU160):

                      CPU 80 ~  CPU 119 are used as memcached's server

                     CPU 120 ~  CPU 179 are used as memcached's client

    the kernel is linux-next 20240227


  2.) My private patches:

       patch 1 is for slub:

       ---
  mm/slub.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/mm/slub.c b/mm/slub.c
index 5d838ebfa35e..d2ab1e36fd6b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5691,6 +5691,7 @@ __kmem_cache_alias(const char *name, unsigned int 
size, unsigned int align,

         s = find_mergeable(size, align, flags, name, ctor);
         if (s) {
+               printk("[%s] origin:%s, shared :%s\n", __func__, name, 
s->name);
                 if (sysfs_slab_alias(s, name))
                         return NULL;

---------

   This patch is used the check which is the sharing kmem_cache for 
"skbuff_fclone_cache".

   I cannot find the "skbuff_fclone_cache" in /proc/slabinfo.

   From my test, the "pool_workqueue" is the real working kmem_cache.

   The "skbuff_fclone_cache" is just a pointer to "pool_workqueue" 
(pwq_cache).


   The following private patch is used to record the fclone allocation:

  ---
  net/ipv4/tcp.c | 19 +++++++++++++++++++
  1 file changed, 19 insertions(+)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c82dc42f57c6..6f31ddcfc017 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -864,6 +864,24 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t 
*ppos,
  }
  EXPORT_SYMBOL(tcp_splice_read);

+unsigned long right_num, wrong_num;
+static void check_fclone(struct sk_buff *skb)
+{
+       int n = numa_mem_id(); /* current node */
+       int node2 = page_to_nid(virt_to_page((unsigned long) skb));
+
+       if (n != node2) {
+               wrong_num++;
+               if ((wrong_num % 1000) == 999)
+                       printk(KERN_DEBUG "[%s] current:%d, get from 
:%d, (%ld, %ld, %ld)\n",
+                               __func__, n, node2, wrong_num, 
right_num, wrong_num * 100 / (wrong_num + right_num));
+       } else {
+               right_num++;
+               if ((right_num % 1000000) == 9999)
+                       printk("[%s] we received :%ld, %ld\n", __func__, 
right_num, wrong_num);
+       }
+}
+
  struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
                                      bool force_schedule)
  {
@@ -884,6 +902,7 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock 
*sk, gfp_t gfp,
                         skb_reserve(skb, MAX_TCP_HEADER);
                         skb->ip_summed = CHECKSUM_PARTIAL;
INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
+                       check_fclone(skb);
                         return skb;
                 }
                 __kfree_skb(skb);
--

   Without this v2 patch, I can get the result after the memcached test:

[ 1027.317645] [check_fclone] current:0, get from :1, (7112999, 9076711, 43)
[ 1027.317653] [check_fclone] current:0, get from :1, (7112999, 9076707, 43)
[ 1027.804110] [check_fclone] we received :10009999, 7113326

It means nearly 43% fclone is allocated in the remote node.


  With this v2 patch,  I can find the "skbuff_fclone_cache" in 
/proc/slabinfo.

The test result shows below:

[  503.357293] [check_fclone] we received :8009999, 0
[  503.357293] [check_fclone] we received :8009999, 0
[  503.357305] [check_fclone] we received :8009999, 0

After v2 patch, I cannot see the wrong fclone in remote node.


>
> Using SLAB_NO_MERGE does not help, I am still seeing wrong allocations
> on a dual socket
> host with plenty of available memory.
> (either sk_buff or skb->head being allocated on the other node).

Do you mean you still can see the wrong fclone after using SLAB_NO_MERGE?

If so, I guess there is bug in the slub.


> fclones might be allocated from a cpu running on node A, and freed
> from a cpu running on node B.
> Maybe SLUB is not properly handling this case ?

Maybe.



> SLAB_NO_MERGE will avoid merging fclone with kmalloc-512, it does not
> really help.
>
> I think we need help from mm/slub experts, instead of trying to 'fix'
> networking stacks.

@Christopher

Any idea about this?

Thanks

Huang Shijie

> Perhaps we could augment trace_kmem_cache_alloc() to record/print the nodes
> of the allocated chunk (we already have the cpu number giving us the
> local node).
> That would give us more confidence on any fixes.
>
> BTW SLUB is gone, time to remove FLAG_SKB_NO_MERGE and simply use SLAB_NO_MERGE
>
> [1]
> commit 0a0643164da4a1976455aa12f0a96d08ee290752
> Author: Jesper Dangaard Brouer <hawk@kernel.org>
> Date:   Tue Aug 15 17:17:36 2023 +0200
>
>      net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cache
>
>
>
>> So set FLAG_SKB_NO_MERGE for skbuff_fclone_cache to fix it.
>>
>> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
>> ---
>> v1 --> v2:
>>         set FLAG_SKB_NO_MERGE for skbuff_fclone_cache in initialization.
>>
>> v1: https://lkml.org/lkml/2024/2/20/121
>> ---
>>   net/core/skbuff.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 1f918e602bc4..5e3e130fb57a 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -5013,7 +5013,8 @@ void __init skb_init(void)
>>          skbuff_fclone_cache = kmem_cache_create("skbuff_fclone_cache",
>>                                                  sizeof(struct sk_buff_fclones),
>>                                                  0,
>> -                                               SLAB_HWCACHE_ALIGN|SLAB_PANIC,
>> +                                               SLAB_HWCACHE_ALIGN|SLAB_PANIC|
>> +                                               FLAG_SKB_NO_MERGE,
>>                                                  NULL);
>>          /* usercopy should only access first SKB_SMALL_HEAD_HEADROOM bytes.
>>           * struct skb_shared_info is located at the end of skb->head,
>> --
>> 2.40.1
>>

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache
  2024-02-28  7:05       ` Shijie Huang
@ 2024-02-28  9:38         ` Eric Dumazet
  2024-02-29 17:00         ` Christoph Lameter (Ampere)
  1 sibling, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2024-02-28  9:38 UTC (permalink / raw)
  To: Shijie Huang
  Cc: Huang Shijie, kuba, patches, davem, horms, ast, dhowells,
	linyunsheng, aleksander.lobakin, linux-kernel, netdev, cl

On Wed, Feb 28, 2024 at 8:06 AM Shijie Huang
<shijie@amperemail.onmicrosoft.com> wrote:
>
>
> 在 2024/2/27 20:55, Eric Dumazet 写道:
> > On Tue, Feb 27, 2024 at 7:29 AM Huang Shijie
> > <shijie@os.amperecomputing.com> wrote:
> >> Since we do not set FLAG_SKB_NO_MERGE for skbuff_fclone_cache,
> >> the current skbuff_fclone_cache maybe not really allocated, it maybe
> >> used an exist old kmem_cache. In NUMA, the fclone allocated by
> >> alloc_skb_fclone() maybe in remote node.
> > Why is this happening in the first place ? Whab about skb->head ?
>
> I tested the fclone firstly. I did not test others yet.
>
> I did not check the skb->head yet.
>
> But I ever checked the pfrag's page, it is okay.
>
>
> >
> > Jesper patch [1] motivation was not about NUMA., but about
> > fragmentation and bulk allocations/freeing.
> >
> > TCP fclones are not bulk allocated/freed, so I do not understand what
> > your patch is doing.
> > You need to give more details, and experimental results.
>
> 1.) My NUMA machine:
>
>        node 0 (CPU 0 ~ CPU79):
>
>                       CPU 0 ~  CPU 39 are used as memcached's server
>
>                      CPU 40 ~  CPU 79 are used as memcached's client
>
>        node 1 (CPU 80 ~ CPU160):
>
>                       CPU 80 ~  CPU 119 are used as memcached's server
>
>                      CPU 120 ~  CPU 179 are used as memcached's client
>
>     the kernel is linux-next 20240227
>
>
>   2.) My private patches:
>
>        patch 1 is for slub:
>
>        ---
>   mm/slub.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 5d838ebfa35e..d2ab1e36fd6b 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5691,6 +5691,7 @@ __kmem_cache_alias(const char *name, unsigned int
> size, unsigned int align,
>
>          s = find_mergeable(size, align, flags, name, ctor);
>          if (s) {
> +               printk("[%s] origin:%s, shared :%s\n", __func__, name,
> s->name);
>                  if (sysfs_slab_alias(s, name))
>                          return NULL;
>
> ---------
>
>    This patch is used the check which is the sharing kmem_cache for
> "skbuff_fclone_cache".
>
>    I cannot find the "skbuff_fclone_cache" in /proc/slabinfo.
>
>    From my test, the "pool_workqueue" is the real working kmem_cache.
>
>    The "skbuff_fclone_cache" is just a pointer to "pool_workqueue"
> (pwq_cache).
>
>
>    The following private patch is used to record the fclone allocation:
>
>   ---
>   net/ipv4/tcp.c | 19 +++++++++++++++++++
>   1 file changed, 19 insertions(+)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index c82dc42f57c6..6f31ddcfc017 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -864,6 +864,24 @@ ssize_t tcp_splice_read(struct socket *sock, loff_t
> *ppos,
>   }
>   EXPORT_SYMBOL(tcp_splice_read);
>
> +unsigned long right_num, wrong_num;
> +static void check_fclone(struct sk_buff *skb)
> +{
> +       int n = numa_mem_id(); /* current node */
> +       int node2 = page_to_nid(virt_to_page((unsigned long) skb));
> +
> +       if (n != node2) {
> +               wrong_num++;
> +               if ((wrong_num % 1000) == 999)
> +                       printk(KERN_DEBUG "[%s] current:%d, get from
> :%d, (%ld, %ld, %ld)\n",
> +                               __func__, n, node2, wrong_num,
> right_num, wrong_num * 100 / (wrong_num + right_num));
> +       } else {
> +               right_num++;
> +               if ((right_num % 1000000) == 9999)
> +                       printk("[%s] we received :%ld, %ld\n", __func__,
> right_num, wrong_num);
> +       }
> +}
> +
>   struct sk_buff *tcp_stream_alloc_skb(struct sock *sk, gfp_t gfp,
>                                       bool force_schedule)
>   {
> @@ -884,6 +902,7 @@ struct sk_buff *tcp_stream_alloc_skb(struct sock
> *sk, gfp_t gfp,
>                          skb_reserve(skb, MAX_TCP_HEADER);
>                          skb->ip_summed = CHECKSUM_PARTIAL;
> INIT_LIST_HEAD(&skb->tcp_tsorted_anchor);
> +                       check_fclone(skb);
>                          return skb;
>                  }
>                  __kfree_skb(skb);
> --
>
>    Without this v2 patch, I can get the result after the memcached test:
>
> [ 1027.317645] [check_fclone] current:0, get from :1, (7112999, 9076711, 43)
> [ 1027.317653] [check_fclone] current:0, get from :1, (7112999, 9076707, 43)
> [ 1027.804110] [check_fclone] we received :10009999, 7113326
>
> It means nearly 43% fclone is allocated in the remote node.
>
>
>   With this v2 patch,  I can find the "skbuff_fclone_cache" in
> /proc/slabinfo.
>
> The test result shows below:
>
> [  503.357293] [check_fclone] we received :8009999, 0
> [  503.357293] [check_fclone] we received :8009999, 0
> [  503.357305] [check_fclone] we received :8009999, 0
>
> After v2 patch, I cannot see the wrong fclone in remote node.
>
>
> >
> > Using SLAB_NO_MERGE does not help, I am still seeing wrong allocations
> > on a dual socket
> > host with plenty of available memory.
> > (either sk_buff or skb->head being allocated on the other node).
>
> Do you mean you still can see the wrong fclone after using SLAB_NO_MERGE?
>
> If so, I guess there is bug in the slub.
>
>
> > fclones might be allocated from a cpu running on node A, and freed
> > from a cpu running on node B.
> > Maybe SLUB is not properly handling this case ?
>
> Maybe.
>
>
>
> > SLAB_NO_MERGE will avoid merging fclone with kmalloc-512, it does not
> > really help.
> >
> > I think we need help from mm/slub experts, instead of trying to 'fix'
> > networking stacks.
>
> @Christopher
>
> Any idea about this?

I had a simpler bpftrace program to get an histogram of [my_node,
node(sk_buff), node_of(skb->head)]
and can tell that going back to linux-v6.7 and CONFIG_SLAB=y solved
all the issues for me.

99.9999% of SLAB allocations were on the right node.

This looks like a SLUB bug to me.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache
  2024-02-27  6:28   ` [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache Huang Shijie
  2024-02-27 12:55     ` Eric Dumazet
@ 2024-02-29 16:56     ` Christoph Lameter (Ampere)
  1 sibling, 0 replies; 19+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-02-29 16:56 UTC (permalink / raw)
  To: Huang Shijie
  Cc: kuba, patches, davem, horms, edumazet, ast, dhowells,
	linyunsheng, aleksander.lobakin, linux-kernel, netdev, cl

On Tue, 27 Feb 2024, Huang Shijie wrote:

> Since we do not set FLAG_SKB_NO_MERGE for skbuff_fclone_cache,
> the current skbuff_fclone_cache maybe not really allocated, it maybe
> used an exist old kmem_cache. In NUMA, the fclone allocated by
> alloc_skb_fclone() maybe in remote node.

This is not the right approach. If you want to force a local allocation 
you need to use GFP_THISNODE. Merging has nothing to do with locality.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache
  2024-02-28  7:05       ` Shijie Huang
  2024-02-28  9:38         ` Eric Dumazet
@ 2024-02-29 17:00         ` Christoph Lameter (Ampere)
  2024-02-29 17:07           ` Eric Dumazet
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-02-29 17:00 UTC (permalink / raw)
  To: Shijie Huang
  Cc: Eric Dumazet, Huang Shijie, kuba, patches, davem, horms, ast,
	dhowells, linyunsheng, aleksander.lobakin, linux-kernel, netdev,
	cl

On Wed, 28 Feb 2024, Shijie Huang wrote:

>> 
>> Using SLAB_NO_MERGE does not help, I am still seeing wrong allocations
>> on a dual socket
>> host with plenty of available memory.
>> (either sk_buff or skb->head being allocated on the other node).
>
> Do you mean you still can see the wrong fclone after using SLAB_NO_MERGE?
>
> If so, I guess there is bug in the slub.

Mergin has nothing to do with memory locality.

>> fclones might be allocated from a cpu running on node A, and freed
>> from a cpu running on node B.
>> Maybe SLUB is not properly handling this case ?
>
> Maybe.

Basic functionality is broken??? Really?

>> I think we need help from mm/slub experts, instead of trying to 'fix'
>> networking stacks.
>
> @Christopher
>
> Any idea about this?

If you want to force a local allocation then use GFP_THISNODE as a flag.

If you do not specify a node or GFP_THISNODE then the slub allocator will 
opportunistically allocate sporadically from other nodes to avoid 
fragmentation of slabs. The page allocator also will sporadically go off 
node in order to avoid reclaim. The page allocator may go off node 
extensively if there is a imbalance of allocation between node. The page 
allocator has knobs to tune off node vs reclaim options. Doing more 
reclaim will slow things down but give you local data.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache
  2024-02-29 17:00         ` Christoph Lameter (Ampere)
@ 2024-02-29 17:07           ` Eric Dumazet
  2024-02-29 17:57             ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2024-02-29 17:07 UTC (permalink / raw)
  To: Christoph Lameter (Ampere)
  Cc: Shijie Huang, Huang Shijie, kuba, patches, davem, horms, ast,
	dhowells, linyunsheng, aleksander.lobakin, linux-kernel, netdev,
	cl

On Thu, Feb 29, 2024 at 6:01 PM Christoph Lameter (Ampere) <cl@linux.com> wrote:
>
> On Wed, 28 Feb 2024, Shijie Huang wrote:
>
> >>
> >> Using SLAB_NO_MERGE does not help, I am still seeing wrong allocations
> >> on a dual socket
> >> host with plenty of available memory.
> >> (either sk_buff or skb->head being allocated on the other node).
> >
> > Do you mean you still can see the wrong fclone after using SLAB_NO_MERGE?
> >
> > If so, I guess there is bug in the slub.
>
> Mergin has nothing to do with memory locality.
>
> >> fclones might be allocated from a cpu running on node A, and freed
> >> from a cpu running on node B.
> >> Maybe SLUB is not properly handling this case ?
> >
> > Maybe.
>
> Basic functionality is broken??? Really?

It seems so.

>
> >> I think we need help from mm/slub experts, instead of trying to 'fix'
> >> networking stacks.
> >
> > @Christopher
> >
> > Any idea about this?
>
>
> If you want to force a local allocation then use GFP_THISNODE as a flag.
>
> If you do not specify a node or GFP_THISNODE then the slub allocator will
> opportunistically allocate sporadically from other nodes to avoid
> fragmentation of slabs. The page allocator also will sporadically go off
> node in order to avoid reclaim. The page allocator may go off node
> extensively if there is a imbalance of allocation between node. The page
> allocator has knobs to tune off node vs reclaim options. Doing more
> reclaim will slow things down but give you local data.

Maybe, maybe not.

Going back to CONFIG_SLAB=y removes all mismatches, without having to
use GFP_THISNODE at all,
on hosts with plenty of available memory on all nodes.

I think that is some kind of evidence that something is broken in SLUB land.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache
  2024-02-29 17:07           ` Eric Dumazet
@ 2024-02-29 17:57             ` Christoph Lameter (Ampere)
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-02-29 17:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Shijie Huang, Huang Shijie, kuba, patches, davem, horms, ast,
	dhowells, linyunsheng, aleksander.lobakin, linux-kernel, netdev

On Thu, 29 Feb 2024, Eric Dumazet wrote:

>> If you do not specify a node or GFP_THISNODE then the slub allocator will
>> opportunistically allocate sporadically from other nodes to avoid
>> fragmentation of slabs. The page allocator also will sporadically go off
>> node in order to avoid reclaim. The page allocator may go off node
>> extensively if there is a imbalance of allocation between node. The page
>> allocator has knobs to tune off node vs reclaim options. Doing more
>> reclaim will slow things down but give you local data.
>
> Maybe, maybe not.
>
> Going back to CONFIG_SLAB=y removes all mismatches, without having to
> use GFP_THISNODE at all,
> on hosts with plenty of available memory on all nodes.


Slab uses GFPTHISNODE by default and does not respect the memory policies 
etc set for pages. As such it will causes additional overhead through 
reclaim passses etc and memory policies will not be applied on a per page 
level (as specd) but in its own layer on a per object basis. It causes 
additional fragmentation.

> I think that is some kind of evidence that something is broken in SLUB land.

That is one of the reasons that SLAB was removed.

Slub defragmentation can be disabled by either GFP_THISNODE or tuning the 
remote_claim knob in /sys/kernel/slab/<slabname>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-02-29 17:57 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-20  2:18 [PATCH] net: skbuff: allocate the fclone in the current NUMA node Huang Shijie
2024-02-20  5:32 ` Eric Dumazet
2024-02-20  6:26   ` Shijie Huang
2024-02-20  8:17     ` Eric Dumazet
2024-02-20  8:37       ` Shijie Huang
2024-02-24 19:07         ` Eric Dumazet
2024-02-26 10:18           ` Jesper Dangaard Brouer
2024-02-26 10:29             ` Eric Dumazet
2024-02-27  6:28   ` [PATCH v2] net: skbuff: set FLAG_SKB_NO_MERGE for skbuff_fclone_cache Huang Shijie
2024-02-27 12:55     ` Eric Dumazet
2024-02-27 13:15       ` Eric Dumazet
2024-02-28  7:05       ` Shijie Huang
2024-02-28  9:38         ` Eric Dumazet
2024-02-29 17:00         ` Christoph Lameter (Ampere)
2024-02-29 17:07           ` Eric Dumazet
2024-02-29 17:57             ` Christoph Lameter (Ampere)
2024-02-29 16:56     ` Christoph Lameter (Ampere)
2024-02-26 10:10 ` [PATCH] net: skbuff: allocate the fclone in the current NUMA node Alexander Lobakin
2024-02-27  6:30   ` Shijie Huang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).