[v4,net-next,09/11] skbuff: allow to optionally use NAPI cache from __alloc_skb()
diff mbox series

Message ID 20210210162732.80467-10-alobakin@pm.me
State New, archived
Headers show
Series
  • skbuff: introduce skbuff_heads bulking and reusing
Related show

Commit Message

Alexander Lobakin Feb. 10, 2021, 4:30 p.m. UTC
Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get
an skbuff_head from the NAPI cache instead of inplace allocation
inside __alloc_skb().
This implies that the function is called from softirq or BH-off
context, not for allocating a clone or from a distant node.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
---
 net/core/skbuff.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

Comments

Paolo Abeni Feb. 11, 2021, 10:16 a.m. UTC | #1
On Wed, 2021-02-10 at 16:30 +0000, Alexander Lobakin wrote:
> Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get
> an skbuff_head from the NAPI cache instead of inplace allocation
> inside __alloc_skb().
> This implies that the function is called from softirq or BH-off
> context, not for allocating a clone or from a distant node.
> 
> Signed-off-by: Alexander Lobakin <alobakin@pm.me>
> ---
>  net/core/skbuff.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 9e1a8ded4acc..750fa1825b28 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -397,15 +397,20 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	struct sk_buff *skb;
>  	u8 *data;
>  	bool pfmemalloc;
> +	bool clone;
>  
> -	cache = (flags & SKB_ALLOC_FCLONE)
> -		? skbuff_fclone_cache : skbuff_head_cache;
> +	clone = !!(flags & SKB_ALLOC_FCLONE);
> +	cache = clone ? skbuff_fclone_cache : skbuff_head_cache;
>  
>  	if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
>  		gfp_mask |= __GFP_MEMALLOC;
>  
>  	/* Get the HEAD */
> -	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
> +	if (!clone && (flags & SKB_ALLOC_NAPI) &&
> +	    likely(node == NUMA_NO_NODE || node == numa_mem_id()))
> +		skb = napi_skb_cache_get();
> +	else
> +		skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node);
>  	if (unlikely(!skb))
>  		return NULL;
>  	prefetchw(skb);

I hope the opt-in thing would have allowed leaving this code unchanged.
I see it's not trivial avoid touching this code path.
Still I think it would be nice if you would be able to let the device
driver use the cache without touching the above, which is also used
e.g. by the TCP xmit path, which in turn will not leverage the cache
(as it requires FCLONE skbs).

If I read correctly, the above chunk is needed to
allow __napi_alloc_skb() access the cache even for small skb
allocation. Good device drivers should not call alloc_skb() in the fast
path.

What about changing __napi_alloc_skb() to always use
the __napi_build_skb(), for both kmalloc and page backed skbs? That is,
always doing the 'data' allocation in __napi_alloc_skb() - either via
page_frag or via kmalloc() - and than call __napi_build_skb().

I think that should avoid adding more checks in __alloc_skb() and
should probably reduce the number of conditional used
by __napi_alloc_skb().

Thanks!

Paolo
Alexander Lobakin Feb. 11, 2021, 2:28 p.m. UTC | #2
From: Paolo Abeni <pabeni@redhat.com>
Date: Thu, 11 Feb 2021 11:16:40 +0100

> On Wed, 2021-02-10 at 16:30 +0000, Alexander Lobakin wrote:
> > Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get
> > an skbuff_head from the NAPI cache instead of inplace allocation
> > inside __alloc_skb().
> > This implies that the function is called from softirq or BH-off
> > context, not for allocating a clone or from a distant node.
> > 
> > Signed-off-by: Alexander Lobakin <alobakin@pm.me>
> > ---
> >  net/core/skbuff.c | 13 +++++++++----
> >  1 file changed, 9 insertions(+), 4 deletions(-)
> > 
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 9e1a8ded4acc..750fa1825b28 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -397,15 +397,20 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >  	struct sk_buff *skb;
> >  	u8 *data;
> >  	bool pfmemalloc;
> > +	bool clone;
> >  
> > -	cache = (flags & SKB_ALLOC_FCLONE)
> > -		? skbuff_fclone_cache : skbuff_head_cache;
> > +	clone = !!(flags & SKB_ALLOC_FCLONE);
> > +	cache = clone ? skbuff_fclone_cache : skbuff_head_cache;
> >  
> >  	if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
> >  		gfp_mask |= __GFP_MEMALLOC;
> >  
> >  	/* Get the HEAD */
> > -	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
> > +	if (!clone && (flags & SKB_ALLOC_NAPI) &&
> > +	    likely(node == NUMA_NO_NODE || node == numa_mem_id()))
> > +		skb = napi_skb_cache_get();
> > +	else
> > +		skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node);
> >  	if (unlikely(!skb))
> >  		return NULL;
> >  	prefetchw(skb);
> 
> I hope the opt-in thing would have allowed leaving this code unchanged.
> I see it's not trivial avoid touching this code path.
> Still I think it would be nice if you would be able to let the device
> driver use the cache without touching the above, which is also used
> e.g. by the TCP xmit path, which in turn will not leverage the cache
> (as it requires FCLONE skbs).
> 
> If I read correctly, the above chunk is needed to
> allow __napi_alloc_skb() access the cache even for small skb
> allocation.

Not only. I wanted to give an ability to access the new feature
through __alloc_skb() too, not only through napi_build_skb() or
napi_alloc_skb().
And not only for drivers. As you may remember, firstly
napi_consume_skb()'s batching system landed for drivers, but then
it got used in network core code.
I think that some core parts may benefit from reusing the NAPI
caches. We'll only see it later.

It's not as complex as it may seem. NUMA check is cheap and tends
to be true for the vast majority of cases. Check for fclone is
already present in baseline code, even two times through the function.
So it's mostly about (flags & SKB_ALLOC_NAPI).

> Good device drivers should not call alloc_skb() in the fast
> path.

Not really. Several enterprise NIC drivers use __alloc_skb() and
alloc_skb(): ChelsIO and Mellanox for inline TLS, Netronome etc.
Lots of RDMA and wireless drivers (not the legacy ones), too.
__alloc_skb() gives you more control on NUMA node and needed skb
headroom, so it's still sometimes useful in drivers.

> What about changing __napi_alloc_skb() to always use
> the __napi_build_skb(), for both kmalloc and page backed skbs? That is,
> always doing the 'data' allocation in __napi_alloc_skb() - either via
> page_frag or via kmalloc() - and than call __napi_build_skb().
>
> I think that should avoid adding more checks in __alloc_skb() and
> should probably reduce the number of conditional used
> by __napi_alloc_skb().

I thought of this too. But this will introduce conditional branch
to set or not skb->head_frag. So one branch less in __alloc_skb(),
one branch more here, and we also lose the ability to __alloc_skb()
with decached head.

> Thanks!
> 
> Paolo

Thanks,
Al
Paolo Abeni Feb. 11, 2021, 2:55 p.m. UTC | #3
On Thu, 2021-02-11 at 14:28 +0000, Alexander Lobakin wrote:
> From: Paolo Abeni <pabeni@redhat.com> on Thu, 11 Feb 2021 11:16:40 +0100 wrote:
> > What about changing __napi_alloc_skb() to always use
> > the __napi_build_skb(), for both kmalloc and page backed skbs? That is,
> > always doing the 'data' allocation in __napi_alloc_skb() - either via
> > page_frag or via kmalloc() - and than call __napi_build_skb().
> > 
> > I think that should avoid adding more checks in __alloc_skb() and
> > should probably reduce the number of conditional used
> > by __napi_alloc_skb().
> 
> I thought of this too. But this will introduce conditional branch
> to set or not skb->head_frag. So one branch less in __alloc_skb(),
> one branch more here, and we also lose the ability to __alloc_skb()
> with decached head.

Just to try to be clear, I mean something alike the following (not even
build tested). In the fast path it has less branches than the current
code - for both kmalloc and page_frag allocation.

---
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 785daff48030..a242fbe4730e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -506,23 +506,12 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 				 gfp_t gfp_mask)
 {
 	struct napi_alloc_cache *nc;
+	bool head_frag, pfmemalloc;
 	struct sk_buff *skb;
 	void *data;
 
 	len += NET_SKB_PAD + NET_IP_ALIGN;
 
-	/* If requested length is either too small or too big,
-	 * we use kmalloc() for skb->head allocation.
-	 */
-	if (len <= SKB_WITH_OVERHEAD(1024) ||
-	    len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
-	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
-		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
-		if (!skb)
-			goto skb_fail;
-		goto skb_success;
-	}
-
 	nc = this_cpu_ptr(&napi_alloc_cache);
 	len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 	len = SKB_DATA_ALIGN(len);
@@ -530,25 +519,34 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (sk_memalloc_socks())
 		gfp_mask |= __GFP_MEMALLOC;
 
-	data = page_frag_alloc(&nc->page, len, gfp_mask);
+	if (len <= SKB_WITH_OVERHEAD(1024) ||
+            len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
+            (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
+		data = kmalloc_reserve(len, gfp_mask, NUMA_NO_NODE, &pfmemalloc);
+		head_frag = 0;
+		len = 0;
+	} else {
+		data = page_frag_alloc(&nc->page, len, gfp_mask);
+		pfmemalloc = nc->page.pfmemalloc;
+		head_frag = 1;
+	}
 	if (unlikely(!data))
 		return NULL;
 
 	skb = __build_skb(data, len);
 	if (unlikely(!skb)) {
-		skb_free_frag(data);
+		if (head_frag)
+			skb_free_frag(data);
+		else
+			kfree(data);
 		return NULL;
 	}
 
-	if (nc->page.pfmemalloc)
-		skb->pfmemalloc = 1;
-	skb->head_frag = 1;
+	skb->pfmemalloc = pfmemalloc;
+	skb->head_frag = head_frag;
 
-skb_success:
 	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
 	skb->dev = napi->dev;
-
-skb_fail:
 	return skb;
 }
 EXPORT_SYMBOL(__napi_alloc_skb);
Alexander Lobakin Feb. 11, 2021, 3:26 p.m. UTC | #4
From: Paolo Abeni <pabeni@redhat.com>
Date: Thu, 11 Feb 2021 15:55:04 +0100

> On Thu, 2021-02-11 at 14:28 +0000, Alexander Lobakin wrote:
> > From: Paolo Abeni <pabeni@redhat.com> on Thu, 11 Feb 2021 11:16:40 +0100 wrote:
> > > What about changing __napi_alloc_skb() to always use
> > > the __napi_build_skb(), for both kmalloc and page backed skbs? That is,
> > > always doing the 'data' allocation in __napi_alloc_skb() - either via
> > > page_frag or via kmalloc() - and than call __napi_build_skb().
> > > 
> > > I think that should avoid adding more checks in __alloc_skb() and
> > > should probably reduce the number of conditional used
> > > by __napi_alloc_skb().
> > 
> > I thought of this too. But this will introduce conditional branch
> > to set or not skb->head_frag. So one branch less in __alloc_skb(),
> > one branch more here, and we also lose the ability to __alloc_skb()
> > with decached head.
> 
> Just to try to be clear, I mean something alike the following (not even
> build tested). In the fast path it has less branches than the current
> code - for both kmalloc and page_frag allocation.
> 
> ---
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 785daff48030..a242fbe4730e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -506,23 +506,12 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>  				 gfp_t gfp_mask)
>  {
>  	struct napi_alloc_cache *nc;
> +	bool head_frag, pfmemalloc;
>  	struct sk_buff *skb;
>  	void *data;
>  
>  	len += NET_SKB_PAD + NET_IP_ALIGN;
>  
> -	/* If requested length is either too small or too big,
> -	 * we use kmalloc() for skb->head allocation.
> -	 */
> -	if (len <= SKB_WITH_OVERHEAD(1024) ||
> -	    len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> -	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> -		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
> -		if (!skb)
> -			goto skb_fail;
> -		goto skb_success;
> -	}
> -
>  	nc = this_cpu_ptr(&napi_alloc_cache);
>  	len += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>  	len = SKB_DATA_ALIGN(len);
> @@ -530,25 +519,34 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>  	if (sk_memalloc_socks())
>  		gfp_mask |= __GFP_MEMALLOC;
>  
> -	data = page_frag_alloc(&nc->page, len, gfp_mask);
> +	if (len <= SKB_WITH_OVERHEAD(1024) ||
> +            len > SKB_WITH_OVERHEAD(PAGE_SIZE) ||
> +            (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> +		data = kmalloc_reserve(len, gfp_mask, NUMA_NO_NODE, &pfmemalloc);
> +		head_frag = 0;
> +		len = 0;
> +	} else {
> +		data = page_frag_alloc(&nc->page, len, gfp_mask);
> +		pfmemalloc = nc->page.pfmemalloc;
> +		head_frag = 1;
> +	}
>  	if (unlikely(!data))
>  		return NULL;

Sure. I have a separate WIP series that reworks all three *alloc_skb()
functions, as there's a nice room for optimization, especially after
that tiny skbs now fall back to __alloc_skb().
It will likely hit mailing lists after the merge window and next
net-next season, not now. And it's not really connected with NAPI
cache reusing.

>  	skb = __build_skb(data, len);
>  	if (unlikely(!skb)) {
> -		skb_free_frag(data);
> +		if (head_frag)
> +			skb_free_frag(data);
> +		else
> +			kfree(data);
>  		return NULL;
>  	}
>  
> -	if (nc->page.pfmemalloc)
> -		skb->pfmemalloc = 1;
> -	skb->head_frag = 1;
> +	skb->pfmemalloc = pfmemalloc;
> +	skb->head_frag = head_frag;
>  
> -skb_success:
>  	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
>  	skb->dev = napi->dev;
> -
> -skb_fail:
>  	return skb;
>  }
>  EXPORT_SYMBOL(__napi_alloc_skb);

Al

Patch
diff mbox series

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9e1a8ded4acc..750fa1825b28 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -397,15 +397,20 @@  struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	struct sk_buff *skb;
 	u8 *data;
 	bool pfmemalloc;
+	bool clone;
 
-	cache = (flags & SKB_ALLOC_FCLONE)
-		? skbuff_fclone_cache : skbuff_head_cache;
+	clone = !!(flags & SKB_ALLOC_FCLONE);
+	cache = clone ? skbuff_fclone_cache : skbuff_head_cache;
 
 	if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
 		gfp_mask |= __GFP_MEMALLOC;
 
 	/* Get the HEAD */
-	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
+	if (!clone && (flags & SKB_ALLOC_NAPI) &&
+	    likely(node == NUMA_NO_NODE || node == numa_mem_id()))
+		skb = napi_skb_cache_get();
+	else
+		skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node);
 	if (unlikely(!skb))
 		return NULL;
 	prefetchw(skb);
@@ -436,7 +441,7 @@  struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	__build_skb_around(skb, data, 0);
 	skb->pfmemalloc = pfmemalloc;
 
-	if (flags & SKB_ALLOC_FCLONE) {
+	if (clone) {
 		struct sk_buff_fclones *fclones;
 
 		fclones = container_of(skb, struct sk_buff_fclones, skb1);