[PATCH net-next v2] net: cache for same cpu skb_attempt_defer

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v2] net: cache for same cpu skb_attempt_defer_free
@ 2024-03-18  0:44 Pavel Begunkov
  2024-03-18  2:44 ` Jason Xing
  2024-03-18 10:11 ` Eric Dumazet
  0 siblings, 2 replies; 8+ messages in thread
From: Pavel Begunkov @ 2024-03-18  0:44 UTC (permalink / raw)
  To: netdev, edumazet, davem, dsahern, pabeni, kuba; +Cc: Pavel Begunkov

Optimise skb_attempt_defer_free() when run by the same CPU the skb was
allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
disable softirqs and put the buffer into cpu local caches.

CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
I'd expect the win doubled with rx only benchmarks, as the optimisation
is for the receive path, but the test spends >55% of CPU doing writes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---

v2: pass @napi_safe=true by using __napi_kfree_skb()

 net/core/skbuff.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b99127712e67..35d37ae70a3d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
 EXPORT_SYMBOL(__skb_ext_put);
 #endif /* CONFIG_SKB_EXTENSIONS */
 
+static void kfree_skb_napi_cache(struct sk_buff *skb)
+{
+	/* if SKB is a clone, don't handle this case */
+	if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
+		__kfree_skb(skb);
+		return;
+	}
+
+	local_bh_disable();
+	__napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
+	local_bh_enable();
+}
+
 /**
  * skb_attempt_defer_free - queue skb for remote freeing
  * @skb: buffer
@@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
 	if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
 	    !cpu_online(cpu) ||
 	    cpu == raw_smp_processor_id()) {
-nodefer:	__kfree_skb(skb);
+nodefer:	kfree_skb_napi_cache(skb);
 		return;
 	}
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v2] net: cache for same cpu skb_attempt_defer_free
  2024-03-18  0:44 [PATCH net-next v2] net: cache for same cpu skb_attempt_defer_free Pavel Begunkov
@ 2024-03-18  2:44 ` Jason Xing
  2024-03-18  3:20   ` Pavel Begunkov
  2024-03-18 10:11 ` Eric Dumazet
  1 sibling, 1 reply; 8+ messages in thread
From: Jason Xing @ 2024-03-18  2:44 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: netdev, edumazet, davem, dsahern, pabeni, kuba

On Mon, Mar 18, 2024 at 8:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
> disable softirqs and put the buffer into cpu local caches.
>
> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,

I suspect that we can stably gain this improvement. The reason why I
ask is because it might be caused by some factor of chance.

> I'd expect the win doubled with rx only benchmarks, as the optimisation
> is for the receive path, but the test spends >55% of CPU doing writes.

I wonder how you did this test? Could you tell us more, please.

>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>
> v2: pass @napi_safe=true by using __napi_kfree_skb()
>
>  net/core/skbuff.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b99127712e67..35d37ae70a3d 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
>  EXPORT_SYMBOL(__skb_ext_put);
>  #endif /* CONFIG_SKB_EXTENSIONS */
>
> +static void kfree_skb_napi_cache(struct sk_buff *skb)
> +{
> +       /* if SKB is a clone, don't handle this case */
> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
> +               __kfree_skb(skb);
> +               return;
> +       }
> +
> +       local_bh_disable();
> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);

__napi_kfree_skb() doesn't care much about why we drop in the rx path,
I think. How about replacing it with SKB_CONSUMED like
napi_skb_finish() does?

Thanks,
Jason

> +       local_bh_enable();
> +}
> +
>  /**
>   * skb_attempt_defer_free - queue skb for remote freeing
>   * @skb: buffer
> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>         if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>             !cpu_online(cpu) ||
>             cpu == raw_smp_processor_id()) {
> -nodefer:       __kfree_skb(skb);
> +nodefer:       kfree_skb_napi_cache(skb);
>                 return;
>         }
>
> --
> 2.44.0
>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v2] net: cache for same cpu skb_attempt_defer_free
  2024-03-18  2:44 ` Jason Xing
@ 2024-03-18  3:20   ` Pavel Begunkov
  0 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2024-03-18  3:20 UTC (permalink / raw)
  To: Jason Xing; +Cc: netdev, edumazet, davem, dsahern, pabeni, kuba

On 3/18/24 02:44, Jason Xing wrote:
> On Mon, Mar 18, 2024 at 8:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
>> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
>> disable softirqs and put the buffer into cpu local caches.
>>
>> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
>> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
>> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
> 
> I suspect that we can stably gain this improvement. The reason why I
> ask is because it might be caused by some factor of chance.

I guess it might be the kernel is even booting only by
some factor of chance, but no, the testing was quite stable :)

>> I'd expect the win doubled with rx only benchmarks, as the optimisation
>> is for the receive path, but the test spends >55% of CPU doing writes.
> 
> I wonder how you did this test? Could you tell us more, please.

Well, the way I did it: choose a NIC, redirect all its IRQs
to a single CPU of your choice, taskset your rx side to that
CPU. You might want to make sure there is no much traffic
apart from the test program, but I was lucky to have 2 NICs
in the system. Instead of IRQs you can also try to configure
steering.

I used netbench [1] for both rx and tx, but that shouldn't be
important as long as you drive enough traffic, you can try
iperf or something else.

[1] https://github.com/DylanZA/netbench


>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>
>> v2: pass @napi_safe=true by using __napi_kfree_skb()
>>
>>   net/core/skbuff.c | 15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index b99127712e67..35d37ae70a3d 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
>>   EXPORT_SYMBOL(__skb_ext_put);
>>   #endif /* CONFIG_SKB_EXTENSIONS */
>>
>> +static void kfree_skb_napi_cache(struct sk_buff *skb)
>> +{
>> +       /* if SKB is a clone, don't handle this case */
>> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
>> +               __kfree_skb(skb);
>> +               return;
>> +       }
>> +
>> +       local_bh_disable();
>> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
> 
> __napi_kfree_skb() doesn't care much about why we drop in the rx path,
> I think. How about replacing it with SKB_CONSUMED like
> napi_skb_finish() does?

I'm sticking here with the current behaviour, __kfree_skb(),
which it replaces, passes SKB_DROP_REASON_NOT_SPECIFIED.

I can make the change if maintainers ack it, but maybe
it should better be done in a separate patch with a proper
explanation.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v2] net: cache for same cpu skb_attempt_defer_free
  2024-03-18  0:44 [PATCH net-next v2] net: cache for same cpu skb_attempt_defer_free Pavel Begunkov
  2024-03-18  2:44 ` Jason Xing
@ 2024-03-18 10:11 ` Eric Dumazet
  2024-03-18 11:41   ` Pavel Begunkov
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2024-03-18 10:11 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: netdev, davem, dsahern, pabeni, kuba

On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
> disable softirqs and put the buffer into cpu local caches.
>
> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
> I'd expect the win doubled with rx only benchmarks, as the optimisation
> is for the receive path, but the test spends >55% of CPU doing writes.
>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>
> v2: pass @napi_safe=true by using __napi_kfree_skb()
>
>  net/core/skbuff.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b99127712e67..35d37ae70a3d 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
>  EXPORT_SYMBOL(__skb_ext_put);
>  #endif /* CONFIG_SKB_EXTENSIONS */
>
> +static void kfree_skb_napi_cache(struct sk_buff *skb)
> +{
> +       /* if SKB is a clone, don't handle this case */
> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
> +               __kfree_skb(skb);
> +               return;
> +       }
> +
> +       local_bh_disable();
> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
> +       local_bh_enable();
> +}
> +
>  /**
>   * skb_attempt_defer_free - queue skb for remote freeing
>   * @skb: buffer
> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>         if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>             !cpu_online(cpu) ||
>             cpu == raw_smp_processor_id()) {
> -nodefer:       __kfree_skb(skb);
> +nodefer:       kfree_skb_napi_cache(skb);
>                 return;
>         }
>
> --
> 2.44.0
>

1) net-next is currently closed.
2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the
correct node.
3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(),  I wonder
why trying to cache the sk_buff in this particular path is needed.

Why not change __kfree_skb() instead ?

All these helpers are becoming a maze.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v2] net: cache for same cpu skb_attempt_defer_free
  2024-03-18 10:11 ` Eric Dumazet
@ 2024-03-18 11:41   ` Pavel Begunkov
  2024-03-18 14:14     ` Pavel Begunkov
  2024-03-18 14:33     ` Eric Dumazet
  0 siblings, 2 replies; 8+ messages in thread
From: Pavel Begunkov @ 2024-03-18 11:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, dsahern, pabeni, kuba

On 3/18/24 10:11, Eric Dumazet wrote:
> On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
>> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
>> disable softirqs and put the buffer into cpu local caches.
>>
>> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
>> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
>> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
>> I'd expect the win doubled with rx only benchmarks, as the optimisation
>> is for the receive path, but the test spends >55% of CPU doing writes.
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>
>> v2: pass @napi_safe=true by using __napi_kfree_skb()
>>
>>   net/core/skbuff.c | 15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index b99127712e67..35d37ae70a3d 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
>>   EXPORT_SYMBOL(__skb_ext_put);
>>   #endif /* CONFIG_SKB_EXTENSIONS */
>>
>> +static void kfree_skb_napi_cache(struct sk_buff *skb)
>> +{
>> +       /* if SKB is a clone, don't handle this case */
>> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
>> +               __kfree_skb(skb);
>> +               return;
>> +       }
>> +
>> +       local_bh_disable();
>> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
>> +       local_bh_enable();
>> +}
>> +
>>   /**
>>    * skb_attempt_defer_free - queue skb for remote freeing
>>    * @skb: buffer
>> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>>          if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>>              !cpu_online(cpu) ||
>>              cpu == raw_smp_processor_id()) {
>> -nodefer:       __kfree_skb(skb);
>> +nodefer:       kfree_skb_napi_cache(skb);
>>                  return;
>>          }
>>
>> --
>> 2.44.0
>>
> 
> 1) net-next is currently closed.

Ok

> 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the
> correct node.

Let me see if I read you right. You're saying that SLUB can
allocate an skb from a different node, so skb->alloc_cpu
might be not indicative of the node, and so we might locally
cache an skb of a foreign numa node?

If that's the case I don't see how it's different from the
cpu != raw_smp_processor_id() path, which will transfer the
skb to another cpu and still put it in the local cache in
softirq.


> 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(),  I wonder
> why trying to cache the sk_buff in this particular path is needed.
> 
> Why not change __kfree_skb() instead ?

IIRC kfree_skb() can be called from any context including irqoff,
it's convenient to have a function that just does the job without
too much of extra care. Theoretically it can have a separate path
inside based on irqs_disabled(), but that would be ugly.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v2] net: cache for same cpu skb_attempt_defer_free
  2024-03-18 11:41   ` Pavel Begunkov
@ 2024-03-18 14:14     ` Pavel Begunkov
  2024-03-18 14:33     ` Eric Dumazet
  1 sibling, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2024-03-18 14:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, dsahern, pabeni, kuba

On 3/18/24 11:41, Pavel Begunkov wrote:
> On 3/18/24 10:11, Eric Dumazet wrote:
>> On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>
>>> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
>>> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
>>> disable softirqs and put the buffer into cpu local caches.
>>>
>>> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
>>> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
>>> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
>>> I'd expect the win doubled with rx only benchmarks, as the optimisation
>>> is for the receive path, but the test spends >55% of CPU doing writes.
>>>
>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>> ---
>>>
>>> v2: pass @napi_safe=true by using __napi_kfree_skb()
>>>
>>>   net/core/skbuff.c | 15 ++++++++++++++-
>>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index b99127712e67..35d37ae70a3d 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
>>>   EXPORT_SYMBOL(__skb_ext_put);
>>>   #endif /* CONFIG_SKB_EXTENSIONS */
>>>
>>> +static void kfree_skb_napi_cache(struct sk_buff *skb)
>>> +{
>>> +       /* if SKB is a clone, don't handle this case */
>>> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
>>> +               __kfree_skb(skb);
>>> +               return;
>>> +       }
>>> +
>>> +       local_bh_disable();
>>> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
>>> +       local_bh_enable();
>>> +}
>>> +
>>>   /**
>>>    * skb_attempt_defer_free - queue skb for remote freeing
>>>    * @skb: buffer
>>> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>>>          if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>>>              !cpu_online(cpu) ||
>>>              cpu == raw_smp_processor_id()) {
>>> -nodefer:       __kfree_skb(skb);
>>> +nodefer:       kfree_skb_napi_cache(skb);
>>>                  return;
>>>          }
>>>
>>> -- 
>>> 2.44.0
>>>
>>
>> 1) net-next is currently closed.
> 
> Ok
> 
>> 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the
>> correct node.
> 
> Let me see if I read you right. You're saying that SLUB can
> allocate an skb from a different node, so skb->alloc_cpu
> might be not indicative of the node, and so we might locally
> cache an skb of a foreign numa node?
> 
> If that's the case I don't see how it's different from the
> cpu != raw_smp_processor_id() path, which will transfer the
> skb to another cpu and still put it in the local cache in
> softirq.
> 
> 
>> 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(),  I wonder
>> why trying to cache the sk_buff in this particular path is needed.
>>
>> Why not change __kfree_skb() instead ?
> 
> IIRC kfree_skb() can be called from any context including irqoff,

On the other hand there is napi_pp_put_page() inside which
assumes hardirq off, so maybe not, I'd appreciate if someone
can clarify it. I was sure that drivers allocating in hardirq
via e.g. __netdev_alloc_skb() might want to use kfree_skb(),
but maybe they're consistently sticking to dev_kfree_sk*().


> it's convenient to have a function that just does the job without
> too much of extra care. Theoretically it can have a separate path
> inside based on irqs_disabled(), but that would be ugly.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v2] net: cache for same cpu skb_attempt_defer_free
  2024-03-18 11:41   ` Pavel Begunkov
  2024-03-18 14:14     ` Pavel Begunkov
@ 2024-03-18 14:33     ` Eric Dumazet
  2024-03-18 23:12       ` Pavel Begunkov
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2024-03-18 14:33 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: netdev, davem, dsahern, pabeni, kuba

On Mon, Mar 18, 2024 at 12:43 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 3/18/24 10:11, Eric Dumazet wrote:
> > On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
> >> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
> >> disable softirqs and put the buffer into cpu local caches.
> >>
> >> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
> >> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
> >> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
> >> I'd expect the win doubled with rx only benchmarks, as the optimisation
> >> is for the receive path, but the test spends >55% of CPU doing writes.
> >>
> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> >> ---
> >>
> >> v2: pass @napi_safe=true by using __napi_kfree_skb()
> >>
> >>   net/core/skbuff.c | 15 ++++++++++++++-
> >>   1 file changed, 14 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >> index b99127712e67..35d37ae70a3d 100644
> >> --- a/net/core/skbuff.c
> >> +++ b/net/core/skbuff.c
> >> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
> >>   EXPORT_SYMBOL(__skb_ext_put);
> >>   #endif /* CONFIG_SKB_EXTENSIONS */
> >>
> >> +static void kfree_skb_napi_cache(struct sk_buff *skb)
> >> +{
> >> +       /* if SKB is a clone, don't handle this case */
> >> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
> >> +               __kfree_skb(skb);
> >> +               return;
> >> +       }
> >> +
> >> +       local_bh_disable();
> >> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
> >> +       local_bh_enable();
> >> +}
> >> +
> >>   /**
> >>    * skb_attempt_defer_free - queue skb for remote freeing
> >>    * @skb: buffer
> >> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
> >>          if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
> >>              !cpu_online(cpu) ||
> >>              cpu == raw_smp_processor_id()) {
> >> -nodefer:       __kfree_skb(skb);
> >> +nodefer:       kfree_skb_napi_cache(skb);
> >>                  return;
> >>          }
> >>
> >> --
> >> 2.44.0
> >>
> >
> > 1) net-next is currently closed.
>
> Ok
>
> > 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the
> > correct node.
>
> Let me see if I read you right. You're saying that SLUB can
> allocate an skb from a different node, so skb->alloc_cpu
> might be not indicative of the node, and so we might locally
> cache an skb of a foreign numa node?
>
> If that's the case I don't see how it's different from the
> cpu != raw_smp_processor_id() path, which will transfer the
> skb to another cpu and still put it in the local cache in
> softirq.

The big win for skb_attempt_defer_free() is for the many pages that are attached
to TCP incoming GRO packets.

A typical GRO packet can have 45 page fragments.

Pages are not recycled by a NIC if the NUMA node does not match.

If the win was only for sk_buff itself, I think we should have asked
SLUB maintainers
why SLUB needs another cache to optimize SLUB fast cache !


>
>
> > 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(),  I wonder
> > why trying to cache the sk_buff in this particular path is needed.
> >
> > Why not change __kfree_skb() instead ?
>
> IIRC kfree_skb() can be called from any context including irqoff,
> it's convenient to have a function that just does the job without
> too much of extra care. Theoretically it can have a separate path
> inside based on irqs_disabled(), but that would be ugly.

Well, adding one case here, another here, and another here is also ugly.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v2] net: cache for same cpu skb_attempt_defer_free
  2024-03-18 14:33     ` Eric Dumazet
@ 2024-03-18 23:12       ` Pavel Begunkov
  0 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2024-03-18 23:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, dsahern, pabeni, kuba

On 3/18/24 14:33, Eric Dumazet wrote:
> On Mon, Mar 18, 2024 at 12:43 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 3/18/24 10:11, Eric Dumazet wrote:
>>> On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>>
>>>> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
>>>> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
>>>> disable softirqs and put the buffer into cpu local caches.
>>>>
>>>> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
>>>> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
>>>> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
>>>> I'd expect the win doubled with rx only benchmarks, as the optimisation
>>>> is for the receive path, but the test spends >55% of CPU doing writes.
>>>>
>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>>> ---
>>>>
>>>> v2: pass @napi_safe=true by using __napi_kfree_skb()
>>>>
>>>>    net/core/skbuff.c | 15 ++++++++++++++-
>>>>    1 file changed, 14 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index b99127712e67..35d37ae70a3d 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
>>>>    EXPORT_SYMBOL(__skb_ext_put);
>>>>    #endif /* CONFIG_SKB_EXTENSIONS */
>>>>
>>>> +static void kfree_skb_napi_cache(struct sk_buff *skb)
>>>> +{
>>>> +       /* if SKB is a clone, don't handle this case */
>>>> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
>>>> +               __kfree_skb(skb);
>>>> +               return;
>>>> +       }
>>>> +
>>>> +       local_bh_disable();
>>>> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
>>>> +       local_bh_enable();
>>>> +}
>>>> +
>>>>    /**
>>>>     * skb_attempt_defer_free - queue skb for remote freeing
>>>>     * @skb: buffer
>>>> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>>>>           if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>>>>               !cpu_online(cpu) ||
>>>>               cpu == raw_smp_processor_id()) {
>>>> -nodefer:       __kfree_skb(skb);
>>>> +nodefer:       kfree_skb_napi_cache(skb);
>>>>                   return;
>>>>           }
>>>>
>>>> --
>>>> 2.44.0
>>>>
>>>
>>> 1) net-next is currently closed.
>>
>> Ok
>>
>>> 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the
>>> correct node.
>>
>> Let me see if I read you right. You're saying that SLUB can
>> allocate an skb from a different node, so skb->alloc_cpu
>> might be not indicative of the node, and so we might locally
>> cache an skb of a foreign numa node?
>>
>> If that's the case I don't see how it's different from the
>> cpu != raw_smp_processor_id() path, which will transfer the
>> skb to another cpu and still put it in the local cache in
>> softirq.
> 
> The big win for skb_attempt_defer_free() is for the many pages that are attached
> to TCP incoming GRO packets.
> 
> A typical GRO packet can have 45 page fragments.

That's great and probably a dominating optimisation for
GRO, however it's a path that all TCP reads would go through,
large GRO'ed skbs or not. Even if there is a single frag, the
patch at least enables the skb cache and gives the frag a
chance to hit the pp's lockless cache.

> Pages are not recycled by a NIC if the NUMA node does not match.

Great, looking it up it seems that it's only true for pages
ending up in the pp's ptr_ring but not recycled directly.

> If the win was only for sk_buff itself, I think we should have asked
> SLUB maintainers
> why SLUB needs another cache to optimize SLUB fast cache !

Well, it's just slow for hot paths, not awfully slow but enough
to be noticeable and take 1-2% per allocation per request. There
are caches in io_uring, because it was slow, there are cache
in the block layer. And there are skb and other caches in net,
I assume all for the same reasons. Not like I'm adding a new
one.

I remember someone was poking into similar questions, but I
haven't seen any drastic SLUB performance changes.

Eric, I'm seriously getting lost in the arguments and don't
really see what you're ultimately against. Summing up topics:

1) Is there a NUMA awareness problem in the patch that I missed?
If so, can you elaborate? I'll try to address it.
Is it a regression comparing to the current state or something
that would be nice to have?

2) Do you trust the numbers for the test described? I.e. the
rx server have multiple connections, each does small packet
(<100B) ping pong style traffic. It's pinned to a CPU and all
completions are ensured to run on that CPU.

3) Do you agree that it's a valid use case? Or are you shrugging
it off as something that nobody cares about?

4) Maybe you're against the design?


>>> 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(),  I wonder
>>> why trying to cache the sk_buff in this particular path is needed.
>>>
>>> Why not change __kfree_skb() instead ?
>>
>> IIRC kfree_skb() can be called from any context including irqoff,
>> it's convenient to have a function that just does the job without
>> too much of extra care. Theoretically it can have a separate path
>> inside based on irqs_disabled(), but that would be ugly.
> 
> Well, adding one case here, another here, and another here is also ugly.

And fwiw, irqs_disabled() is not great for hot paths. Not to the
extent of irq_disable(), but it still trashes up a bit the pipeline
or so IIRC.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-03-18 23:13 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-18  0:44 [PATCH net-next v2] net: cache for same cpu skb_attempt_defer_free Pavel Begunkov
2024-03-18  2:44 ` Jason Xing
2024-03-18  3:20   ` Pavel Begunkov
2024-03-18 10:11 ` Eric Dumazet
2024-03-18 11:41   ` Pavel Begunkov
2024-03-18 14:14     ` Pavel Begunkov
2024-03-18 14:33     ` Eric Dumazet
2024-03-18 23:12       ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).