* [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack @ 2016-05-09 13:44 Jesper Dangaard Brouer 2016-05-09 13:44 ` [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer ` (3 more replies) 0 siblings, 4 replies; 20+ messages in thread From: Jesper Dangaard Brouer @ 2016-05-09 13:44 UTC (permalink / raw) To: netdev, David S. Miller Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck, Jesper Dangaard Brouer This patchset enables use of the slab/kmem_cache bulk alloc API, and completes the use the slab/kmem_cache bulking API in the network stack. I've not included the patches that introduce a SKB bulk hint, which would beneficial for drivers like mlx5. Lets see if we can agree on this patchset first. --- Jesper Dangaard Brouer (3): net: bulk alloc and reuse of SKBs in NAPI context mlx4: use napi_alloc_skb API to get SKB bulk allocations net: warn on napi_alloc_skb being called in wrong context drivers/net/ethernet/mellanox/mlx4/en_rx.c | 7 ++- net/core/skbuff.c | 74 ++++++++++++++++++---------- 2 files changed, 52 insertions(+), 29 deletions(-) ^ permalink raw reply [flat|nested] 20+ messages in thread
* [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context 2016-05-09 13:44 [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer @ 2016-05-09 13:44 ` Jesper Dangaard Brouer 2016-05-09 16:20 ` Alexander Duyck 2016-05-09 13:44 ` [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations Jesper Dangaard Brouer ` (2 subsequent siblings) 3 siblings, 1 reply; 20+ messages in thread From: Jesper Dangaard Brouer @ 2016-05-09 13:44 UTC (permalink / raw) To: netdev, David S. Miller Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck, Jesper Dangaard Brouer This patch introduce bulk alloc of SKBs and allow reuse of SKBs free'ed in same softirq cycle. SKBs are normally free'ed during TX completion, but most high speed drivers also cleanup TX ring during NAPI RX poll cycle. Thus, if using napi_consume_skb/__kfree_skb_defer, SKBs will be avail in the napi_alloc_cache->skb_cache. If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit the potential overshooting unused SKBs needed to free'ed when NAPI cycle ends (flushed in net_rx_action via __kfree_skb_flush()). Generalize ___build_skb() to allow passing it a preallocated SKB. I've previously demonstrated a 1% speedup for IPv4 forwarding, when used on the ixgbe driver. If SKB alloc and free happens on different CPUs (like in RPS use-case) the performance benefit is expected to increase. All drivers using the napi_alloc_skb() call will benefit from this change automatically. Joint work with Alexander Duyck. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> --- net/core/skbuff.c | 71 ++++++++++++++++++++++++++++++++++------------------- 1 file changed, 45 insertions(+), 26 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5586be93632f..e85f1065b263 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -281,32 +281,14 @@ nodata: } EXPORT_SYMBOL(__alloc_skb); -/** - * __build_skb - build a network buffer - * @data: data buffer provided by caller - * @frag_size: size of data, or 0 if head was kmalloced - * - * Allocate a new &sk_buff. Caller provides space holding head and - * skb_shared_info. @data must have been allocated by kmalloc() only if - * @frag_size is 0, otherwise data should come from the page allocator - * or vmalloc() - * The return is the new skb buffer. - * On a failure the return is %NULL, and @data is not freed. - * Notes : - * Before IO, driver allocates only data buffer where NIC put incoming frame - * Driver should add room at head (NET_SKB_PAD) and - * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) - * After IO, driver calls build_skb(), to allocate sk_buff and populate it - * before giving packet to stack. - * RX rings only contains data buffers, not full skbs. - */ -struct sk_buff *__build_skb(void *data, unsigned int frag_size) +/* Allows skb being (pre)allocated by caller */ +static inline +struct sk_buff *___build_skb(void *data, unsigned int frag_size, + struct sk_buff *skb) { struct skb_shared_info *shinfo; - struct sk_buff *skb; unsigned int size = frag_size ? : ksize(data); - skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); if (!skb) return NULL; @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size) return skb; } +/** + * __build_skb - build a network buffer + * @data: data buffer provided by caller + * @frag_size: size of data, or 0 if head was kmalloced + * + * Allocate a new &sk_buff. Caller provides space holding head and + * skb_shared_info. @data must have been allocated by kmalloc() only if + * @frag_size is 0, otherwise data should come from the page allocator + * or vmalloc() + * The return is the new skb buffer. + * On a failure the return is %NULL, and @data is not freed. + * Notes : + * Before IO, driver allocates only data buffer where NIC put incoming frame + * Driver should add room at head (NET_SKB_PAD) and + * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) + * After IO, driver calls build_skb(), to allocate sk_buff and populate it + * before giving packet to stack. + * RX rings only contains data buffers, not full skbs. + */ +struct sk_buff *__build_skb(void *data, unsigned int frag_size) +{ + struct sk_buff *skb; + unsigned int size = frag_size ? : ksize(data); + + skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); + return ___build_skb(data, size, skb); +} + /* build_skb() is wrapper over __build_skb(), that specifically * takes care of skb->head and skb->pfmemalloc * This means that if @frag_size is not zero, then @data must be backed @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, len += NET_SKB_PAD + NET_IP_ALIGN; - if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || - (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { + if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || + (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA)))) { skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE); if (!skb) goto skb_fail; @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, if (unlikely(!data)) return NULL; - skb = __build_skb(data, len); - if (unlikely(!skb)) { +#define BULK_ALLOC_SIZE 8 + if (!nc->skb_count) { + nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache, + gfp_mask, BULK_ALLOC_SIZE, + nc->skb_cache); + } + if (likely(nc->skb_count)) { + skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count]; + } else { + /* alloc bulk failed */ skb_free_frag(data); return NULL; } + skb = ___build_skb(data, len, skb); /* use OR instead of assignment to avoid clearing of bits in mask */ if (nc->page.pfmemalloc) ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context 2016-05-09 13:44 ` [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer @ 2016-05-09 16:20 ` Alexander Duyck 2016-05-09 19:59 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 20+ messages in thread From: Alexander Duyck @ 2016-05-09 16:20 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > This patch introduce bulk alloc of SKBs and allow reuse of SKBs > free'ed in same softirq cycle. SKBs are normally free'ed during TX > completion, but most high speed drivers also cleanup TX ring during > NAPI RX poll cycle. Thus, if using napi_consume_skb/__kfree_skb_defer, > SKBs will be avail in the napi_alloc_cache->skb_cache. > > If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit > the potential overshooting unused SKBs needed to free'ed when NAPI > cycle ends (flushed in net_rx_action via __kfree_skb_flush()). > > Generalize ___build_skb() to allow passing it a preallocated SKB. > > I've previously demonstrated a 1% speedup for IPv4 forwarding, when > used on the ixgbe driver. If SKB alloc and free happens on different > CPUs (like in RPS use-case) the performance benefit is expected to > increase. Really I am not sure this patch series is worth the effort. For freeing buffers in Tx it was an obvious win. But adding all this complexity for a 1% gain just doesn't seen worth the effort. > All drivers using the napi_alloc_skb() call will benefit from > this change automatically. Yes, but a 1% improvement is essentially just noise. I'd say we need to show a better gain or be able to more precisely show that this is a definite gain and not just a test fluctuation resulting in the improvement. For all I know all of the gain could be due to a function shifting around so that some loop is now 16 byte aligned. > Joint work with Alexander Duyck. > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> The fact that this is still using my redhat.com address says volumes. If I recall I think we were working on this code around 9 months ago and had similar issues with it showing either negative performance or no gain. My advice then was to push the bulk free code and try to find a way to fix the bulk allocation. If we are still at this state for bulk allocation then maybe we need to drop the bulk allocation and start looking at other avenues to pursue. > --- > net/core/skbuff.c | 71 ++++++++++++++++++++++++++++++++++------------------- > 1 file changed, 45 insertions(+), 26 deletions(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 5586be93632f..e85f1065b263 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -281,32 +281,14 @@ nodata: > } > EXPORT_SYMBOL(__alloc_skb); > > -/** > - * __build_skb - build a network buffer > - * @data: data buffer provided by caller > - * @frag_size: size of data, or 0 if head was kmalloced > - * > - * Allocate a new &sk_buff. Caller provides space holding head and > - * skb_shared_info. @data must have been allocated by kmalloc() only if > - * @frag_size is 0, otherwise data should come from the page allocator > - * or vmalloc() > - * The return is the new skb buffer. > - * On a failure the return is %NULL, and @data is not freed. > - * Notes : > - * Before IO, driver allocates only data buffer where NIC put incoming frame > - * Driver should add room at head (NET_SKB_PAD) and > - * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) > - * After IO, driver calls build_skb(), to allocate sk_buff and populate it > - * before giving packet to stack. > - * RX rings only contains data buffers, not full skbs. > - */ > -struct sk_buff *__build_skb(void *data, unsigned int frag_size) > +/* Allows skb being (pre)allocated by caller */ > +static inline > +struct sk_buff *___build_skb(void *data, unsigned int frag_size, > + struct sk_buff *skb) > { > struct skb_shared_info *shinfo; > - struct sk_buff *skb; > unsigned int size = frag_size ? : ksize(data); > > - skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); > if (!skb) > return NULL; > > @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size) > return skb; > } > > +/** > + * __build_skb - build a network buffer > + * @data: data buffer provided by caller > + * @frag_size: size of data, or 0 if head was kmalloced > + * > + * Allocate a new &sk_buff. Caller provides space holding head and > + * skb_shared_info. @data must have been allocated by kmalloc() only if > + * @frag_size is 0, otherwise data should come from the page allocator > + * or vmalloc() > + * The return is the new skb buffer. > + * On a failure the return is %NULL, and @data is not freed. > + * Notes : > + * Before IO, driver allocates only data buffer where NIC put incoming frame > + * Driver should add room at head (NET_SKB_PAD) and > + * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) > + * After IO, driver calls build_skb(), to allocate sk_buff and populate it > + * before giving packet to stack. > + * RX rings only contains data buffers, not full skbs. > + */ > +struct sk_buff *__build_skb(void *data, unsigned int frag_size) > +{ > + struct sk_buff *skb; > + unsigned int size = frag_size ? : ksize(data); > + > + skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); > + return ___build_skb(data, size, skb); > +} > + > /* build_skb() is wrapper over __build_skb(), that specifically > * takes care of skb->head and skb->pfmemalloc > * This means that if @frag_size is not zero, then @data must be backed If we can avoid having to break up build_skb into more functions that would be preferred. I realize I probably wrote this code in order to enable the bulk allocation approach, but really I don't want to add more complexity unless we can show a strong gain which we haven't been able to demonstrate. > @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, > > len += NET_SKB_PAD + NET_IP_ALIGN; > > - if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || > - (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { > + if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || > + (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA)))) { > skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE); > if (!skb) > goto skb_fail; Does this unlikely really make a difference? If so you might want to move it into a patch all on its own. > @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, > if (unlikely(!data)) > return NULL; > > - skb = __build_skb(data, len); > - if (unlikely(!skb)) { > +#define BULK_ALLOC_SIZE 8 > + if (!nc->skb_count) { > + nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache, > + gfp_mask, BULK_ALLOC_SIZE, > + nc->skb_cache); > + } > + if (likely(nc->skb_count)) { > + skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count]; > + } else { > + /* alloc bulk failed */ So did you try doing a low latency socket test with this patch? I'm guessing not as I am certain there is a negative impact from having to allocate 8 buffers, and then free back 7 every time you call the NAPI polling routine with just one buffer in the ring. > skb_free_frag(data); > return NULL; > } > + skb = ___build_skb(data, len, skb); > > /* use OR instead of assignment to avoid clearing of bits in mask */ > if (nc->page.pfmemalloc) > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context 2016-05-09 16:20 ` Alexander Duyck @ 2016-05-09 19:59 ` Jesper Dangaard Brouer 2016-05-09 20:46 ` Alexander Duyck 0 siblings, 1 reply; 20+ messages in thread From: Jesper Dangaard Brouer @ 2016-05-09 19:59 UTC (permalink / raw) To: Alexander Duyck Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev, brouer On Mon, 9 May 2016 09:20:41 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > This patch introduce bulk alloc of SKBs and allow reuse of SKBs > > free'ed in same softirq cycle. SKBs are normally free'ed during TX > > completion, but most high speed drivers also cleanup TX ring during > > NAPI RX poll cycle. Thus, if using napi_consume_skb/__kfree_skb_defer, > > SKBs will be avail in the napi_alloc_cache->skb_cache. > > > > If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit > > the potential overshooting unused SKBs needed to free'ed when NAPI > > cycle ends (flushed in net_rx_action via __kfree_skb_flush()). > > > > Generalize ___build_skb() to allow passing it a preallocated SKB. > > > > I've previously demonstrated a 1% speedup for IPv4 forwarding, when > > used on the ixgbe driver. If SKB alloc and free happens on different > > CPUs (like in RPS use-case) the performance benefit is expected to > > increase. > > Really I am not sure this patch series is worth the effort. For > freeing buffers in Tx it was an obvious win. But adding all this > complexity for a 1% gain just doesn't seen worth the effort. I still think it is worth it, because: 1) it enables recycling, which is more likely for real-life traffic (e.g. some TCP ACKs gets TX DMA completion cleanup, and RX can re-use these), and 2) because bulk alloc and bulk free gets "coupled" (mostly relevant when doing cross CPU). > > All drivers using the napi_alloc_skb() call will benefit from > > this change automatically. > > Yes, but a 1% improvement is essentially just noise. I'd say we need > to show a better gain or be able to more precisely show that this is a > definite gain and not just a test fluctuation resulting in the > improvement. For all I know all of the gain could be due to a > function shifting around so that some loop is now 16 byte aligned. > > > Joint work with Alexander Duyck. > > > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> > > The fact that this is still using my redhat.com address says volumes. > If I recall I think we were working on this code around 9 months ago > and had similar issues with it showing either negative performance or > no gain. My advice then was to push the bulk free code and try to > find a way to fix the bulk allocation. If we are still at this state > for bulk allocation then maybe we need to drop the bulk allocation and > start looking at other avenues to pursue. I'm sure there is a gain here. Sure I can spend 2-3 weeks coming up with a benchmark that show a bigger gain, but is it time well spend? My quick benchmarking with mlx4 actually showed 2.7% improvement, for local UDP socket delivery. And only 0.6% on IP-forwarding. If I find some bidirectional traffic then the benefit should be higher due to recycling kicking in. We did do as you recommended, and the bulk free code first. I'm just getting the last pieces pushed. I didn't see any negative performance in my testing. As you also know, tuning the SLUB system will give higher performance, easily. In the future, I'm planning to get some auto-tuning into the SLUB allocator. I've already discussed this with Christiph Lameter, at MM-summit, see presentation[1] slides 4 and 5. [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/slab_mm_summit2016.odp > > --- > > net/core/skbuff.c | 71 > > ++++++++++++++++++++++++++++++++++------------------- 1 file > > changed, 45 insertions(+), 26 deletions(-) > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 5586be93632f..e85f1065b263 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -281,32 +281,14 @@ nodata: > > } > > EXPORT_SYMBOL(__alloc_skb); > > > > -/** > > - * __build_skb - build a network buffer > > - * @data: data buffer provided by caller > > - * @frag_size: size of data, or 0 if head was kmalloced > > - * > > - * Allocate a new &sk_buff. Caller provides space holding head and > > - * skb_shared_info. @data must have been allocated by kmalloc() > > only if > > - * @frag_size is 0, otherwise data should come from the page > > allocator > > - * or vmalloc() > > - * The return is the new skb buffer. > > - * On a failure the return is %NULL, and @data is not freed. > > - * Notes : > > - * Before IO, driver allocates only data buffer where NIC put > > incoming frame > > - * Driver should add room at head (NET_SKB_PAD) and > > - * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) > > - * After IO, driver calls build_skb(), to allocate sk_buff and > > populate it > > - * before giving packet to stack. > > - * RX rings only contains data buffers, not full skbs. > > - */ > > -struct sk_buff *__build_skb(void *data, unsigned int frag_size) > > +/* Allows skb being (pre)allocated by caller */ > > +static inline > > +struct sk_buff *___build_skb(void *data, unsigned int frag_size, > > + struct sk_buff *skb) > > { > > struct skb_shared_info *shinfo; > > - struct sk_buff *skb; > > unsigned int size = frag_size ? : ksize(data); > > > > - skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); > > if (!skb) > > return NULL; > > > > @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data, > > unsigned int frag_size) return skb; > > } > > > > +/** > > + * __build_skb - build a network buffer > > + * @data: data buffer provided by caller > > + * @frag_size: size of data, or 0 if head was kmalloced > > + * > > + * Allocate a new &sk_buff. Caller provides space holding head and > > + * skb_shared_info. @data must have been allocated by kmalloc() > > only if > > + * @frag_size is 0, otherwise data should come from the page > > allocator > > + * or vmalloc() > > + * The return is the new skb buffer. > > + * On a failure the return is %NULL, and @data is not freed. > > + * Notes : > > + * Before IO, driver allocates only data buffer where NIC put > > incoming frame > > + * Driver should add room at head (NET_SKB_PAD) and > > + * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) > > + * After IO, driver calls build_skb(), to allocate sk_buff and > > populate it > > + * before giving packet to stack. > > + * RX rings only contains data buffers, not full skbs. > > + */ > > +struct sk_buff *__build_skb(void *data, unsigned int frag_size) > > +{ > > + struct sk_buff *skb; > > + unsigned int size = frag_size ? : ksize(data); > > + > > + skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); > > + return ___build_skb(data, size, skb); > > +} > > + > > /* build_skb() is wrapper over __build_skb(), that specifically > > * takes care of skb->head and skb->pfmemalloc > > * This means that if @frag_size is not zero, then @data must be > > backed > > If we can avoid having to break up build_skb into more functions that > would be preferred. I realize I probably wrote this code in order to > enable the bulk allocation approach, but really I don't want to add > more complexity unless we can show a strong gain which we haven't been > able to demonstrate. You do notice that ___build_skb gets inlined, thus there is not performance loss. And this change was explicitly requested last time this patch was reviewed. And I think this variant is much less intrusive. > > @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct > > napi_struct *napi, unsigned int len, > > > > len += NET_SKB_PAD + NET_IP_ALIGN; > > > > - if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || > > - (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { > > + if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || > > + (gfp_mask & (__GFP_DIRECT_RECLAIM | > > GFP_DMA)))) { skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, > > NUMA_NO_NODE); if (!skb) > > goto skb_fail; > > Does this unlikely really make a difference? If so you might want to > move it into a patch all on its own. I can place it in a patch of it's own. I just noticed the compiler layed out the code wrongly compared to the normal use-case. > > @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct > > napi_struct *napi, unsigned int len, if (unlikely(!data)) > > return NULL; > > > > - skb = __build_skb(data, len); > > - if (unlikely(!skb)) { > > +#define BULK_ALLOC_SIZE 8 > > + if (!nc->skb_count) { > > + nc->skb_count = > > kmem_cache_alloc_bulk(skbuff_head_cache, > > + gfp_mask, > > BULK_ALLOC_SIZE, > > + > > nc->skb_cache); > > + } > > + if (likely(nc->skb_count)) { > > + skb = (struct sk_buff > > *)nc->skb_cache[--nc->skb_count]; > > + } else { > > + /* alloc bulk failed */ > > So did you try doing a low latency socket test with this patch? I'm > guessing not as I am certain there is a negative impact from having to > allocate 8 buffers, and then free back 7 every time you call the NAPI > polling routine with just one buffer in the ring. There is a very high probability that pulling out 8 objects, and returning 7 object, will have the same cost of alloc and free of a single object. This is due to how the SLUB allocator's per CPU allocator works. Notice I said "high probability". I have adjusted the slab bulk APIs, such that we can extend them to bulk return "upto" a given number of objects. Then the SLUB allocator can remove the "high probability" part, and make sure only to return object on the per CPU slab-page, thus guaranteeing no cmpxchg_double calls, and only local_irq_disable/enable cycles, which is actually faster than the normal fastpath local cmpxchg_double (non-atomic variant). > > skb_free_frag(data); > > return NULL; > > } > > + skb = ___build_skb(data, len, skb); > > > > /* use OR instead of assignment to avoid clearing of bits > > in mask */ if (nc->page.pfmemalloc) > > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context 2016-05-09 19:59 ` Jesper Dangaard Brouer @ 2016-05-09 20:46 ` Alexander Duyck 2016-05-10 9:29 ` Jesper Dangaard Brouer 2016-05-10 12:30 ` Jesper Dangaard Brouer 0 siblings, 2 replies; 20+ messages in thread From: Alexander Duyck @ 2016-05-09 20:46 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev On Mon, May 9, 2016 at 12:59 PM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Mon, 9 May 2016 09:20:41 -0700 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer >> <brouer@redhat.com> wrote: >> > This patch introduce bulk alloc of SKBs and allow reuse of SKBs >> > free'ed in same softirq cycle. SKBs are normally free'ed during TX >> > completion, but most high speed drivers also cleanup TX ring during >> > NAPI RX poll cycle. Thus, if using napi_consume_skb/__kfree_skb_defer, >> > SKBs will be avail in the napi_alloc_cache->skb_cache. >> > >> > If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit >> > the potential overshooting unused SKBs needed to free'ed when NAPI >> > cycle ends (flushed in net_rx_action via __kfree_skb_flush()). >> > >> > Generalize ___build_skb() to allow passing it a preallocated SKB. >> > >> > I've previously demonstrated a 1% speedup for IPv4 forwarding, when >> > used on the ixgbe driver. If SKB alloc and free happens on different >> > CPUs (like in RPS use-case) the performance benefit is expected to >> > increase. >> >> Really I am not sure this patch series is worth the effort. For >> freeing buffers in Tx it was an obvious win. But adding all this >> complexity for a 1% gain just doesn't seen worth the effort. > > I still think it is worth it, because: 1) it enables recycling, which > is more likely for real-life traffic (e.g. some TCP ACKs gets TX DMA > completion cleanup, and RX can re-use these), and 2) because bulk alloc > and bulk free gets "coupled" (mostly relevant when doing cross CPU). I disagree. While there will be some ACKs it will likely be significantly less then the number of packets received, especially when you take GRO into account. > >> > All drivers using the napi_alloc_skb() call will benefit from >> > this change automatically. >> >> Yes, but a 1% improvement is essentially just noise. I'd say we need >> to show a better gain or be able to more precisely show that this is a >> definite gain and not just a test fluctuation resulting in the >> improvement. For all I know all of the gain could be due to a >> function shifting around so that some loop is now 16 byte aligned. >> >> > Joint work with Alexander Duyck. >> > >> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> >> > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> >> >> The fact that this is still using my redhat.com address says volumes. >> If I recall I think we were working on this code around 9 months ago >> and had similar issues with it showing either negative performance or >> no gain. My advice then was to push the bulk free code and try to >> find a way to fix the bulk allocation. If we are still at this state >> for bulk allocation then maybe we need to drop the bulk allocation and >> start looking at other avenues to pursue. > > I'm sure there is a gain here. Sure I can spend 2-3 weeks coming up > with a benchmark that show a bigger gain, but is it time well spend? > > My quick benchmarking with mlx4 actually showed 2.7% improvement, for > local UDP socket delivery. And only 0.6% on IP-forwarding. If I find > some bidirectional traffic then the benefit should be higher due to > recycling kicking in. Is that with or without patch 2 being a part of the set? This patch should be able to prove itself as a stand-alone patch. I'd say go ahead and submit patch 2 as a standalone and then we can start working on collecting numbers on the mlx4. Also are you running your test with GRO disabled? If so you might try enabling it on the mlx4 because that changes the skbuff allocation which would impact the behavior for the device. Try testing with TCP_RR instead and watch the CPU utilization. I'm suspecting allocating 8 and freeing 7 buffers for every 1 buffer received will blow any gains right out of the water. Also try it with a mix of traffic. So have one NIC doing TCP_RR while another is doing a stream test. You are stuffing 7 buffers onto a queue that were were using to perform bulk freeing. How much of a penalty do you take if you are now limited on how many you can bulk free because somebody left a stray 7 packets sitting on the queue? Also please stop focusing on IP-forwarding. There are some use cases for it out there, but the vast majority of people don't care that much about it. It is just one data point out of many. Try running a few TCP based applications and see if you can notice any difference. It is okay to focus on routing numbers when you are doing work that will only impact routing or has next to no likelihood of introducing a regression, but the fact is the bulk alloc has a strong likelihood of having effects on other parts of the network stack since you are pulling in more buffers then you actually need. > We did do as you recommended, and the bulk free code first. I'm > just getting the last pieces pushed. I didn't see any negative > performance in my testing. Not seeing any negative performance is not the same as seeing gains. You are introducing complexity, there needs to be something to show for it. Otherwise you are just making the code harder to maintain for no good reason. > As you also know, tuning the SLUB system will give higher performance, > easily. In the future, I'm planning to get some auto-tuning into the > SLUB allocator. I've already discussed this with Christiph Lameter, at > MM-summit, see presentation[1] slides 4 and 5. We aren't discussing tuning parameters. We are discussing this patch. If you want to argue that with certain tuning parameters this shows more performance then bring the numbers, but don't try to bias things. If you have to tune the system in some way that nobody will there is probably no point in submitting the patch because nobody will use it that way. You also didn't test for low latency. As I said this is coming at a high expense to other use cases, but your use case only shows a 1% improvement. In my opinion it is not worth adding the complexity if we cannot show much of a gain. If you want to win me over we need to be able to justify the complexity. Right now I am not seeing it. In addition patch 3 illustrates the added complexity you are bringing to all this because you are having to add limitations to napi_alloc_skb because you are trying to make it work with recycling which historically has never been a win because there are too many traffic patterns where recycling cannot occur. > [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/slab_mm_summit2016.odp > >> > --- >> > net/core/skbuff.c | 71 >> > ++++++++++++++++++++++++++++++++++------------------- 1 file >> > changed, 45 insertions(+), 26 deletions(-) >> > >> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c >> > index 5586be93632f..e85f1065b263 100644 >> > --- a/net/core/skbuff.c >> > +++ b/net/core/skbuff.c >> > @@ -281,32 +281,14 @@ nodata: >> > } >> > EXPORT_SYMBOL(__alloc_skb); >> > >> > -/** >> > - * __build_skb - build a network buffer >> > - * @data: data buffer provided by caller >> > - * @frag_size: size of data, or 0 if head was kmalloced >> > - * >> > - * Allocate a new &sk_buff. Caller provides space holding head and >> > - * skb_shared_info. @data must have been allocated by kmalloc() >> > only if >> > - * @frag_size is 0, otherwise data should come from the page >> > allocator >> > - * or vmalloc() >> > - * The return is the new skb buffer. >> > - * On a failure the return is %NULL, and @data is not freed. >> > - * Notes : >> > - * Before IO, driver allocates only data buffer where NIC put >> > incoming frame >> > - * Driver should add room at head (NET_SKB_PAD) and >> > - * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) >> > - * After IO, driver calls build_skb(), to allocate sk_buff and >> > populate it >> > - * before giving packet to stack. >> > - * RX rings only contains data buffers, not full skbs. >> > - */ >> > -struct sk_buff *__build_skb(void *data, unsigned int frag_size) >> > +/* Allows skb being (pre)allocated by caller */ >> > +static inline >> > +struct sk_buff *___build_skb(void *data, unsigned int frag_size, >> > + struct sk_buff *skb) >> > { >> > struct skb_shared_info *shinfo; >> > - struct sk_buff *skb; >> > unsigned int size = frag_size ? : ksize(data); >> > >> > - skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); >> > if (!skb) >> > return NULL; >> > >> > @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data, >> > unsigned int frag_size) return skb; >> > } >> > >> > +/** >> > + * __build_skb - build a network buffer >> > + * @data: data buffer provided by caller >> > + * @frag_size: size of data, or 0 if head was kmalloced >> > + * >> > + * Allocate a new &sk_buff. Caller provides space holding head and >> > + * skb_shared_info. @data must have been allocated by kmalloc() >> > only if >> > + * @frag_size is 0, otherwise data should come from the page >> > allocator >> > + * or vmalloc() >> > + * The return is the new skb buffer. >> > + * On a failure the return is %NULL, and @data is not freed. >> > + * Notes : >> > + * Before IO, driver allocates only data buffer where NIC put >> > incoming frame >> > + * Driver should add room at head (NET_SKB_PAD) and >> > + * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) >> > + * After IO, driver calls build_skb(), to allocate sk_buff and >> > populate it >> > + * before giving packet to stack. >> > + * RX rings only contains data buffers, not full skbs. >> > + */ >> > +struct sk_buff *__build_skb(void *data, unsigned int frag_size) >> > +{ >> > + struct sk_buff *skb; >> > + unsigned int size = frag_size ? : ksize(data); >> > + >> > + skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); >> > + return ___build_skb(data, size, skb); >> > +} >> > + >> > /* build_skb() is wrapper over __build_skb(), that specifically >> > * takes care of skb->head and skb->pfmemalloc >> > * This means that if @frag_size is not zero, then @data must be >> > backed >> >> If we can avoid having to break up build_skb into more functions that >> would be preferred. I realize I probably wrote this code in order to >> enable the bulk allocation approach, but really I don't want to add >> more complexity unless we can show a strong gain which we haven't been >> able to demonstrate. > > You do notice that ___build_skb gets inlined, thus there is not > performance loss. And this change was explicitly requested last time > this patch was reviewed. And I think this variant is much less > intrusive. I didn't notice the inline, so it means we are bloating the code a bit instead of adding more function calls. I guess that is a better trade off, but still not the most desirable thing to do. > >> > @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct >> > napi_struct *napi, unsigned int len, >> > >> > len += NET_SKB_PAD + NET_IP_ALIGN; >> > >> > - if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || >> > - (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { >> > + if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || >> > + (gfp_mask & (__GFP_DIRECT_RECLAIM | >> > GFP_DMA)))) { skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, >> > NUMA_NO_NODE); if (!skb) >> > goto skb_fail; >> >> Does this unlikely really make a difference? If so you might want to >> move it into a patch all on its own. > > I can place it in a patch of it's own. I just noticed the compiler > layed out the code wrongly compared to the normal use-case. What version of gcc are you seeing this with? From what I can tell on the 4.8.5 version that came with my RHEL 7.2 build I am seeing the proper setup with 2 compares and 2 jumps to the remote code. >> > @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct >> > napi_struct *napi, unsigned int len, if (unlikely(!data)) >> > return NULL; >> > >> > - skb = __build_skb(data, len); >> > - if (unlikely(!skb)) { >> > +#define BULK_ALLOC_SIZE 8 >> > + if (!nc->skb_count) { >> > + nc->skb_count = >> > kmem_cache_alloc_bulk(skbuff_head_cache, >> > + gfp_mask, >> > BULK_ALLOC_SIZE, >> > + >> > nc->skb_cache); >> > + } >> > + if (likely(nc->skb_count)) { >> > + skb = (struct sk_buff >> > *)nc->skb_cache[--nc->skb_count]; >> > + } else { >> > + /* alloc bulk failed */ >> >> So did you try doing a low latency socket test with this patch? I'm >> guessing not as I am certain there is a negative impact from having to >> allocate 8 buffers, and then free back 7 every time you call the NAPI >> polling routine with just one buffer in the ring. > > There is a very high probability that pulling out 8 objects, and > returning 7 object, will have the same cost of alloc and free of a > single object. This is due to how the SLUB allocator's per CPU > allocator works. I very much doubt that. You are still having to move around many more objects then you really should be and dragging in 8 cache lines as you have to walk the percpu freelist to pull the entries and then push them back in. > Notice I said "high probability". I have adjusted the slab bulk APIs, > such that we can extend them to bulk return "upto" a given number of > objects. Then the SLUB allocator can remove the "high probability" > part, and make sure only to return object on the per CPU slab-page, > thus guaranteeing no cmpxchg_double calls, and only > local_irq_disable/enable cycles, which is actually faster than the > normal fastpath local cmpxchg_double (non-atomic variant). On your system. There is no guarantee that is the case across all CPUs supported by the x86 architecture. Also I can think of scenarios where your code would become very expensive. The fact is you are pushing an additional 7 objects into the L1 cache each time you do one of these bulk allocations. The question you have to ask yourself is what is it you are evicting when you do it. There are scenerios where you could be evicting critical data back out to L2 which will negatively impact your performance. A 1% gain 99% of the time is kind of useless when there is a risk of a significant performance hit for that remaining 1% of cases. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context 2016-05-09 20:46 ` Alexander Duyck @ 2016-05-10 9:29 ` Jesper Dangaard Brouer 2016-05-10 12:30 ` Jesper Dangaard Brouer 1 sibling, 0 replies; 20+ messages in thread From: Jesper Dangaard Brouer @ 2016-05-10 9:29 UTC (permalink / raw) To: Alexander Duyck Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev, brouer On Mon, 9 May 2016 13:46:32 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > > As you also know, tuning the SLUB system will give higher performance, > > easily. In the future, I'm planning to get some auto-tuning into the > > SLUB allocator. I've already discussed this with Christiph Lameter, at > > MM-summit, see presentation[1] slides 4 and 5. > > We aren't discussing tuning parameters. We are discussing this patch. > If you want to argue that with certain tuning parameters this shows > more performance then bring the numbers, but don't try to bias things. > If you have to tune the system in some way that nobody will there is > probably no point in submitting the patch because nobody will use it > that way. I think you missed the point. I didn't do parameter tuning for my benchmarks. I hate tuning parameters, they are huge problem for users of the kernel. My point is that I want to implement auto-tuning in the SLUB allocator. This network stack use-case, will just be one use-case where the auto-tuning need to show improvements. It is not that complicated. FreeBSD have this kind of auto-tuning to the workload in their slab implementation. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context 2016-05-09 20:46 ` Alexander Duyck 2016-05-10 9:29 ` Jesper Dangaard Brouer @ 2016-05-10 12:30 ` Jesper Dangaard Brouer 2016-05-10 13:48 ` Eric Dumazet 2016-05-10 17:46 ` Alexander Duyck 1 sibling, 2 replies; 20+ messages in thread From: Jesper Dangaard Brouer @ 2016-05-10 12:30 UTC (permalink / raw) To: Alexander Duyck Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev, brouer On Mon, 9 May 2016 13:46:32 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > Try testing with TCP_RR instead and watch the CPU utilization. I'm > suspecting allocating 8 and freeing 7 buffers for every 1 buffer > received will blow any gains right out of the water. Also try it with > a mix of traffic. So have one NIC doing TCP_RR while another is doing > a stream test. You are stuffing 7 buffers onto a queue that were were > using to perform bulk freeing. How much of a penalty do you take if > you are now limited on how many you can bulk free because somebody > left a stray 7 packets sitting on the queue? Testing with TCP_RR, is not a very "clean" network test. One have to be very careful what is actually being tested, is it the server or client which is the bottleneck. And most of all this is test of the CPU/process scheduler. We can avoid the scheduler problem by enabling busy_poll/busy_read. I guess you want to see the "scheduler test" first. Default setting of disabled busy poll on both client and server: Disable busy poll on both client and server, Not patched: $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 78077.55 3.74 2.69 3.830 8.265 16384 87380 Disable busy poll on both client and server, patched: $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 78517.32 3.06 2.84 3.118 8.677 16384 87380 I will not call this an improvement... the results are basically the same. Next step enabling busy poll on the server. The server is likely the bottleneck, given it's CPU is slower than the client. Context switches on the server is too high 156K/sec, after enabling busy poll reduced to 620/sec. Note the client is doing around 233k/sec context switches, (fairly impressive). Enabling busy poll on the server: sysctl -w net.core.busy_poll=50 sysctl -w net.core.busy_read=50 Enabled busy poll only on server, Not patched: $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 112480.72 5.90 4.68 4.194 9.984 16384 87380 Enabled busy poll only on server, patched: $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 110152.34 5.84 4.60 4.242 10.014 16384 87380 Numbers are too close, for any conclusions. Running a second run, on Not-patched kernel: Enabled busy poll only on server, Not patched: [jbrouer@canyon ~]$ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 101554.90 4.12 4.31 3.245 10.185 16384 87380 Thus, variation between runs are bigger than any improvement/regression, thus no performance conclusions from this change can be drawn. Lets move beyond testing the CPU/process scheduler by enabling busy-polling on both client and server: (sysctl -w net.core.busy_poll=50 ;sysctl -w net.core.busy_read=50) Enable busy poll on both client and server, Not patched: $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 137987.86 13.18 4.77 7.643 8.298 16384 87380 Enable busy poll on both client and server, patched: $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 147324.38 13.76 4.76 7.474 7.747 16384 87380 I've a little bit surprised to see such a large improvement here 6.76%. 147324/137987*100 = 106.76 I'm remaining skeptic towards this measurement, as the improvement should not be this high. Even if recycling is happening. Perf record does show less calls to __slab_free(), indicating better interaction with SLUB, and perhaps recycling working. But this is only a perf-report change from 0.37% to 0.33%. More testing show not-patched kernel fluctuate between 125k-143k/sec, and patched kernel fluctuate between 131k-152k/sec. The ranges are too high, to say anything conclusive. It seems to be timing dependent, as starting and stoping the test with -D 1, show a rate variation within 2k/sec, but rate itself can vary withing the range stated. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context 2016-05-10 12:30 ` Jesper Dangaard Brouer @ 2016-05-10 13:48 ` Eric Dumazet 2016-05-10 14:48 ` Jesper Dangaard Brouer 2016-05-10 17:46 ` Alexander Duyck 1 sibling, 1 reply; 20+ messages in thread From: Eric Dumazet @ 2016-05-10 13:48 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Alexander Duyck, Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev On Tue, 2016-05-10 at 14:30 +0200, Jesper Dangaard Brouer wrote: > Disable busy poll on both client and server, Not patched: > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 78077.55 3.74 2.69 3.830 8.265 > 16384 87380 Tell us more about the -T6,6 For example how many TX/RX queues you have on the NIC, and which cpus service interrupts. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context 2016-05-10 13:48 ` Eric Dumazet @ 2016-05-10 14:48 ` Jesper Dangaard Brouer 2016-05-10 15:44 ` Eric Dumazet 0 siblings, 1 reply; 20+ messages in thread From: Jesper Dangaard Brouer @ 2016-05-10 14:48 UTC (permalink / raw) To: Eric Dumazet Cc: Alexander Duyck, Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev, brouer On Tue, 10 May 2016 06:48:54 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2016-05-10 at 14:30 +0200, Jesper Dangaard Brouer wrote: > > > Disable busy poll on both client and server, Not patched: > > > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 port 0 AF_INET to 198.18.40.2 > > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > > Local /Remote > > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > > Send Recv Size Size Time Rate local remote local remote > > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > > > 16384 87380 1 1 60.00 78077.55 3.74 2.69 3.830 8.265 > > 16384 87380 > > Tell us more about the -T6,6 > > For example how many TX/RX queues you have on the NIC, and which cpus > service interrupts. The -T6,6 option: -T lcpu,rcpu Request netperf/netserver be bound to local/remote cpu I use the option to get more stable results. If I don't pin/bind the CPU netperf/netserver is running on then the CPU scheduler will migrate the processes around. This gives unpredictable results, worst for the busy_poll tests. Especially if the RX softirq runs on the same CPU (also true if it runs on a HyperTread siping). Netperf client (8 cores i7-4790K CPU @ 4.00GHz) RX:8 and TX:8 queues. Netserver server (2x 12 cores E5-2630 @ 2.30GHz) RX:8 and TX:24 queues. Driver mlx4. Disabled GRO to hit code path I changed in patch 2. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context 2016-05-10 14:48 ` Jesper Dangaard Brouer @ 2016-05-10 15:44 ` Eric Dumazet 0 siblings, 0 replies; 20+ messages in thread From: Eric Dumazet @ 2016-05-10 15:44 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Alexander Duyck, Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev On Tue, 2016-05-10 at 16:48 +0200, Jesper Dangaard Brouer wrote: > On Tue, 10 May 2016 06:48:54 -0700 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > On Tue, 2016-05-10 at 14:30 +0200, Jesper Dangaard Brouer wrote: > > > > > Disable busy poll on both client and server, Not patched: > > > > > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > > > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 port 0 AF_INET to 198.18.40.2 > > > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > > > Local /Remote > > > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > > > Send Recv Size Size Time Rate local remote local remote > > > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > > > > > 16384 87380 1 1 60.00 78077.55 3.74 2.69 3.830 8.265 > > > 16384 87380 > > > > Tell us more about the -T6,6 > > > > For example how many TX/RX queues you have on the NIC, and which cpus > > service interrupts. > > The -T6,6 option: > -T lcpu,rcpu Request netperf/netserver be bound to local/remote cpu > Sure, I know -T option in netperf. > I use the option to get more stable results. If I don't pin/bind the > CPU netperf/netserver is running on then the CPU scheduler will migrate > the processes around. This gives unpredictable results, worst for the > busy_poll tests. Especially if the RX softirq runs on the same CPU > (also true if it runs on a HyperTread siping). > > Netperf client (8 cores i7-4790K CPU @ 4.00GHz) RX:8 and TX:8 queues. > Netserver server (2x 12 cores E5-2630 @ 2.30GHz) RX:8 and TX:24 queues. > Driver mlx4. > Disabled GRO to hit code path I changed in patch 2. But are you using stuff like aRFS, RPS , RFS ? Each netperf run lands on different cpus, and we know that results can have a 25% variability because of that, even more on 2-node systems. By forcing -T6,6 you force the netperf/netserver cpu, not the RX queues. A nice effort would be to be able to chose the source in the 4-tuple at connect() time so that we know that Toeplitz hash will select the 'correct' RX queue. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context 2016-05-10 12:30 ` Jesper Dangaard Brouer 2016-05-10 13:48 ` Eric Dumazet @ 2016-05-10 17:46 ` Alexander Duyck 1 sibling, 0 replies; 20+ messages in thread From: Alexander Duyck @ 2016-05-10 17:46 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev On 05/10/2016 05:30 AM, Jesper Dangaard Brouer wrote: > > On Mon, 9 May 2016 13:46:32 -0700 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> Try testing with TCP_RR instead and watch the CPU utilization. I'm >> suspecting allocating 8 and freeing 7 buffers for every 1 buffer >> received will blow any gains right out of the water. Also try it with >> a mix of traffic. So have one NIC doing TCP_RR while another is doing >> a stream test. You are stuffing 7 buffers onto a queue that were were >> using to perform bulk freeing. How much of a penalty do you take if >> you are now limited on how many you can bulk free because somebody >> left a stray 7 packets sitting on the queue? > > Testing with TCP_RR, is not a very "clean" network test. One have to be > very careful what is actually being tested, is it the server or client > which is the bottleneck. And most of all this is test of the CPU/process > scheduler. > > We can avoid the scheduler problem by enabling busy_poll/busy_read. > > I guess you want to see the "scheduler test" first. Default setting of > disabled busy poll on both client and server: > > Disable busy poll on both client and server, Not patched: Lets also define what "Not patched" means. I only want the bulk allocation patch tested. The other patches you have for the mlx4 and the WARN_ON are just noise. If possible it would be best to focus on just the one patch that is high risk. > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 78077.55 3.74 2.69 3.830 8.265 > 16384 87380 > > Disable busy poll on both client and server, patched: > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 78517.32 3.06 2.84 3.118 8.677 > 16384 87380 > > I will not call this an improvement... the results are basically the same. So I have a few suggestions. 1. Either switch to ixgbe and use ATR/Flow Director or look at setting up your test so that the RSS key and indirection table are the same for each test and use the "-- -P" option in netperf to force the use of the same 5 tuple for each test. 2. Cut down on the noise. Specifically rebuild your kernel with as few options enabled as possible. If you don't need it drop it out so that we can identify exactly how much gain there is to be had from your patches. Also you should increase your test to use multiple CPUs, or cut down on the number of CPUs so that you aren't cutting down the the CPU utilization so much. If you have to you might even look at doing something like using SAR -P 6 in order to be able to monitor the CPU utilization of just CPU 6 on your test system. The goal is you want your tests to be repeatable if you had to move to another system or another NIC so I would recommend trying to find a way to make it so that much of the fluctuation is ruled out and that your numbers are as reliable as possible. 3. You might even try a pktgen Rx and drop test to see what the difference is in ns/packet for your allocation routine. Assuming you can clear out the variability that would be a useful datapoint as you could then also collect the perf data to show which functions have reduced their total CPU time. > Next step enabling busy poll on the server. The server is likely the > bottleneck, given it's CPU is slower than the client. Context switches > on the server is too high 156K/sec, after enabling busy poll reduced to > 620/sec. Note the client is doing around 233k/sec context switches, > (fairly impressive). > > Enabling busy poll on the server: > sysctl -w net.core.busy_poll=50 > sysctl -w net.core.busy_read=50 > > > Enabled busy poll only on server, Not patched: > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 112480.72 5.90 4.68 4.194 9.984 > 16384 87380 > > Enabled busy poll only on server, patched: > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 110152.34 5.84 4.60 4.242 10.014 > 16384 87380 > > Numbers are too close, for any conclusions. Agreed. > Running a second run, on Not-patched kernel: > Enabled busy poll only on server, Not patched: > [jbrouer@canyon ~]$ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 101554.90 4.12 4.31 3.245 10.185 > 16384 87380 > > Thus, variation between runs are bigger than any improvement/regression, > thus no performance conclusions from this change can be drawn. Like Eric mentioned this is likely the fact that you are bouncing between Rx queues. > Lets move beyond testing the CPU/process scheduler by enabling > busy-polling on both client and server: > (sysctl -w net.core.busy_poll=50 ;sysctl -w net.core.busy_read=50) > > Enable busy poll on both client and server, Not patched: > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 137987.86 13.18 4.77 7.643 8.298 > 16384 87380 > > > Enable busy poll on both client and server, patched: > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 147324.38 13.76 4.76 7.474 7.747 > 16384 87380 > > I've a little bit surprised to see such a large improvement here 6.76%. > 147324/137987*100 = 106.76 That is a difference of 500ns per packet. I am highly doubtful we are seeing that much of an improvement as well. Odds are you were doing something cross-node in your first run or something along those lines since the difference is too large to be attributed to the bulk allocation change. > I'm remaining skeptic towards this measurement, as the improvement > should not be this high. Even if recycling is happening. > > Perf record does show less calls to __slab_free(), indicating better > interaction with SLUB, and perhaps recycling working. But this is > only a perf-report change from 0.37% to 0.33%. > > More testing show not-patched kernel fluctuate between 125k-143k/sec, > and patched kernel fluctuate between 131k-152k/sec. The ranges are too > high, to say anything conclusive. It seems to be timing dependent, as > starting and stoping the test with -D 1, show a rate variation within > 2k/sec, but rate itself can vary withing the range stated. I am pretty sure we are guaranteed to see a performance regression for socket based workloads. You are hoping for recycling to occur, but in almost all cases recycling almost always fails to show any gains and just ends up introducing the possibility for regressions as something always gets overlooked. On top of that you aren't guaranteed a frame from Tx clean-up is going to be warm in the cache so you may end up taking a few cache line misses as well. I haven't seen enough change in these patches to justify having them submitted to the kernel. A 1% improvement in one specific test case is kind of a vague reason to do something that very likely introduces regressions in many other cases. - Alex ^ permalink raw reply [flat|nested] 20+ messages in thread
* [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations 2016-05-09 13:44 [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer 2016-05-09 13:44 ` [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer @ 2016-05-09 13:44 ` Jesper Dangaard Brouer 2016-05-09 16:47 ` Alexander Duyck 2016-05-09 13:44 ` [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context Jesper Dangaard Brouer 2016-05-20 8:14 ` [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer 3 siblings, 1 reply; 20+ messages in thread From: Jesper Dangaard Brouer @ 2016-05-09 13:44 UTC (permalink / raw) To: netdev, David S. Miller Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck, Jesper Dangaard Brouer Activate the bulk alloc API, simply by changing mlx4 from using netdev_alloc_skb() to using napi_alloc_skb(). Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index 8ef6875b6cf9..84fd6db5a176 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -577,14 +577,15 @@ fail: static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv, struct mlx4_en_rx_desc *rx_desc, struct mlx4_en_rx_alloc *frags, - unsigned int length) + unsigned int length, + struct napi_struct *napi) { struct sk_buff *skb; void *va; int used_frags; dma_addr_t dma; - skb = netdev_alloc_skb(priv->dev, SMALL_PACKET_SIZE + NET_IP_ALIGN); + skb = napi_alloc_skb(napi, SMALL_PACKET_SIZE + NET_IP_ALIGN); if (!skb) { en_dbg(RX_ERR, priv, "Failed allocating skb\n"); return NULL; @@ -932,7 +933,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud } /* GRO not possible, complete processing here */ - skb = mlx4_en_rx_skb(priv, rx_desc, frags, length); + skb = mlx4_en_rx_skb(priv, rx_desc, frags, length, &cq->napi); if (!skb) { ring->dropped++; goto next; ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations 2016-05-09 13:44 ` [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations Jesper Dangaard Brouer @ 2016-05-09 16:47 ` Alexander Duyck 2016-05-09 20:05 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 20+ messages in thread From: Alexander Duyck @ 2016-05-09 16:47 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > Activate the bulk alloc API, simply by changing mlx4 from using > netdev_alloc_skb() to using napi_alloc_skb(). This patch is just enabling the napi_alloc_skb call. You don't need to call out that it is enabling bulk allocations. This patch could stand on its own without needing to make reference to the bulk allocation API because there is enough of a gain from napi_alloc_skb replacing netdev_alloc_skb. > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > --- > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > index 8ef6875b6cf9..84fd6db5a176 100644 > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > @@ -577,14 +577,15 @@ fail: > static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv, > struct mlx4_en_rx_desc *rx_desc, > struct mlx4_en_rx_alloc *frags, > - unsigned int length) > + unsigned int length, > + struct napi_struct *napi) Instead of passing the NAPI structure you could just pass the mlx4_en_cq pointer to be used by the NAPI alloc function. In addition you might try adding the new parameter before length since that way the pointers are in one block followed by integers in a tapering length order. > { > struct sk_buff *skb; > void *va; > int used_frags; > dma_addr_t dma; > > - skb = netdev_alloc_skb(priv->dev, SMALL_PACKET_SIZE + NET_IP_ALIGN); > + skb = napi_alloc_skb(napi, SMALL_PACKET_SIZE + NET_IP_ALIGN); The NET_IP_ALIGN is redundant as napi_alloc_skb already takes are of adding that and NET_SKB_PAD. > if (!skb) { > en_dbg(RX_ERR, priv, "Failed allocating skb\n"); > return NULL; > @@ -932,7 +933,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud > } > > /* GRO not possible, complete processing here */ > - skb = mlx4_en_rx_skb(priv, rx_desc, frags, length); > + skb = mlx4_en_rx_skb(priv, rx_desc, frags, length, &cq->napi); > if (!skb) { > ring->dropped++; > goto next; > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations 2016-05-09 16:47 ` Alexander Duyck @ 2016-05-09 20:05 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 20+ messages in thread From: Jesper Dangaard Brouer @ 2016-05-09 20:05 UTC (permalink / raw) To: Alexander Duyck Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev, brouer On Mon, 9 May 2016 09:47:08 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > Activate the bulk alloc API, simply by changing mlx4 from using > > netdev_alloc_skb() to using napi_alloc_skb(). > > This patch is just enabling the napi_alloc_skb call. You don't need > to call out that it is enabling bulk allocations. This patch could > stand on its own without needing to make reference to the bulk > allocation API because there is enough of a gain from napi_alloc_skb > replacing netdev_alloc_skb. Okay, I see you point. > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > > --- > > drivers/net/ethernet/mellanox/mlx4/en_rx.c | 7 ++++--- > > 1 file changed, 4 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > > index 8ef6875b6cf9..84fd6db5a176 100644 > > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c > > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c > > @@ -577,14 +577,15 @@ fail: > > static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv, > > struct mlx4_en_rx_desc *rx_desc, > > struct mlx4_en_rx_alloc *frags, > > - unsigned int length) > > + unsigned int length, > > + struct napi_struct *napi) > > Instead of passing the NAPI structure you could just pass the > mlx4_en_cq pointer to be used by the NAPI alloc function. In addition > you might try adding the new parameter before length since that way > the pointers are in one block followed by integers in a tapering > length order. Okay, will adjust. > > { > > struct sk_buff *skb; > > void *va; > > int used_frags; > > dma_addr_t dma; > > > > - skb = netdev_alloc_skb(priv->dev, SMALL_PACKET_SIZE + NET_IP_ALIGN); > > + skb = napi_alloc_skb(napi, SMALL_PACKET_SIZE + NET_IP_ALIGN); > > The NET_IP_ALIGN is redundant as napi_alloc_skb already takes are of > adding that and NET_SKB_PAD. Thanks for catching this. > > if (!skb) { > > en_dbg(RX_ERR, priv, "Failed allocating skb\n"); > > return NULL; > > @@ -932,7 +933,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud > > } > > > > /* GRO not possible, complete processing here */ > > - skb = mlx4_en_rx_skb(priv, rx_desc, frags, length); > > + skb = mlx4_en_rx_skb(priv, rx_desc, frags, length, &cq->napi); > > if (!skb) { > > ring->dropped++; > > goto next; > > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 20+ messages in thread
* [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context 2016-05-09 13:44 [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer 2016-05-09 13:44 ` [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer 2016-05-09 13:44 ` [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations Jesper Dangaard Brouer @ 2016-05-09 13:44 ` Jesper Dangaard Brouer 2016-05-09 13:53 ` Sergei Shtylyov ` (2 more replies) 2016-05-20 8:14 ` [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer 3 siblings, 3 replies; 20+ messages in thread From: Jesper Dangaard Brouer @ 2016-05-09 13:44 UTC (permalink / raw) To: netdev, David S. Miller Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck, Jesper Dangaard Brouer It have always been required to call napi_alloc_skb from NAPI/softirq context, which implies running with local_bh_disable'ed. Thus, this code path should already be well tested. But recent SKB bulk changes introduced will make this more volatile and bugs more subtle, if this is violated. To catch any driver violating this add a loud WARN_ON. Performance wise, I do worry about adding this runtime check code into the hotpath, of this highly optimized function call. I've micro-benchmarked it with both IP-forwarding and local UDP delivery, and didn't see any regressions. It does adds extra code size (icache). add/remove: 0/0 grow/shrink: 1/0 up/down: 43/0 (43) function old new delta __napi_alloc_skb 461 504 +43 Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- net/core/skbuff.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index e85f1065b263..99addbf66f2e 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -498,6 +498,9 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, struct sk_buff *skb; void *data; + /* Catch drivers violating, not having local BH disabled */ + WARN_ON(!in_softirq()); + len += NET_SKB_PAD + NET_IP_ALIGN; if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context 2016-05-09 13:44 ` [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context Jesper Dangaard Brouer @ 2016-05-09 13:53 ` Sergei Shtylyov 2016-05-09 13:53 ` Sergei Shtylyov 2016-05-09 16:33 ` Alexander Duyck 2 siblings, 0 replies; 20+ messages in thread From: Sergei Shtylyov @ 2016-05-09 13:53 UTC (permalink / raw) To: Jesper Dangaard Brouer, netdev, David S. Miller Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck Hello. On 5/9/2016 4:44 PM, Jesper Dangaard Brouer wrote: > It have always been required to call napi_alloc_skb from NAPI/softirq It has. > context, which implies running with local_bh_disable'ed. Thus, this Not local_bh_disable'd? > code path should already be well tested. But recent SKB bulk changes > introduced will make this more volatile and bugs more subtle, if this > is violated. > > To catch any driver violating this add a loud WARN_ON. > > Performance wise, I do worry about adding this runtime check code into > the hotpath, of this highly optimized function call. I've Hot path? My spellchecked trips here. > micro-benchmarked it with both IP-forwarding and local UDP delivery, > and didn't see any regressions. It does adds extra code size (icache). > > add/remove: 0/0 grow/shrink: 1/0 up/down: 43/0 (43) > function old new delta > __napi_alloc_skb 461 504 +43 > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> [...] MBR, Sergei ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context 2016-05-09 13:44 ` [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context Jesper Dangaard Brouer 2016-05-09 13:53 ` Sergei Shtylyov @ 2016-05-09 13:53 ` Sergei Shtylyov 2016-05-09 16:33 ` Alexander Duyck 2 siblings, 0 replies; 20+ messages in thread From: Sergei Shtylyov @ 2016-05-09 13:53 UTC (permalink / raw) To: Jesper Dangaard Brouer, netdev, David S. Miller Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck Hello. On 5/9/2016 4:44 PM, Jesper Dangaard Brouer wrote: > It have always been required to call napi_alloc_skb from NAPI/softirq It has. > context, which implies running with local_bh_disable'ed. Thus, this Not local_bh_disable'd? > code path should already be well tested. But recent SKB bulk changes > introduced will make this more volatile and bugs more subtle, if this > is violated. > > To catch any driver violating this add a loud WARN_ON. > > Performance wise, I do worry about adding this runtime check code into > the hotpath, of this highly optimized function call. I've Hot path? My spellchecker trips here. > micro-benchmarked it with both IP-forwarding and local UDP delivery, > and didn't see any regressions. It does adds extra code size (icache). > > add/remove: 0/0 grow/shrink: 1/0 up/down: 43/0 (43) > function old new delta > __napi_alloc_skb 461 504 +43 > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> [...] MBR, Sergei ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context 2016-05-09 13:44 ` [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context Jesper Dangaard Brouer 2016-05-09 13:53 ` Sergei Shtylyov 2016-05-09 13:53 ` Sergei Shtylyov @ 2016-05-09 16:33 ` Alexander Duyck 2016-05-09 20:03 ` Jesper Dangaard Brouer 2 siblings, 1 reply; 20+ messages in thread From: Alexander Duyck @ 2016-05-09 16:33 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > It have always been required to call napi_alloc_skb from NAPI/softirq > context, which implies running with local_bh_disable'ed. Thus, this > code path should already be well tested. But recent SKB bulk changes > introduced will make this more volatile and bugs more subtle, if this > is violated. It hasn't been required to be in the softirq routine, bottom halves could have been disabled. I used the NAPI parameter to try and enforce a strong correlation between the two but I didn't make it a hard requirement. It isn't until you try to do the recycling that it becomes more of a requirement because then it is tied into the bulk free. The bulk free routine had a stronger requirement for NAPI because it needed to have the __kfree_skb_flush routine called to finally free the skbuffs that were still hanging off the percpu skbuff structure. > To catch any driver violating this add a loud WARN_ON. > > Performance wise, I do worry about adding this runtime check code into > the hotpath, of this highly optimized function call. I've > micro-benchmarked it with both IP-forwarding and local UDP delivery, > and didn't see any regressions. It does adds extra code size (icache). > > add/remove: 0/0 grow/shrink: 1/0 up/down: 43/0 (43) > function old new delta > __napi_alloc_skb 461 504 +43 > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > --- > net/core/skbuff.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index e85f1065b263..99addbf66f2e 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -498,6 +498,9 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, > struct sk_buff *skb; > void *data; > > + /* Catch drivers violating, not having local BH disabled */ > + WARN_ON(!in_softirq()); > + I'm pretty sure this still lets you get away with just disabling bottom halves. I think what you would want is in_serving_softirq(). > len += NET_SKB_PAD + NET_IP_ALIGN; > > if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context 2016-05-09 16:33 ` Alexander Duyck @ 2016-05-09 20:03 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 20+ messages in thread From: Jesper Dangaard Brouer @ 2016-05-09 20:03 UTC (permalink / raw) To: Alexander Duyck Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev, brouer On Mon, 9 May 2016 09:33:25 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > It have always been required to call napi_alloc_skb from NAPI/softirq > > context, which implies running with local_bh_disable'ed. Thus, this > > code path should already be well tested. But recent SKB bulk changes > > introduced will make this more volatile and bugs more subtle, if this > > is violated. > > It hasn't been required to be in the softirq routine, bottom halves > could have been disabled. I used the NAPI parameter to try and > enforce a strong correlation between the two but I didn't make it a > hard requirement. It isn't until you try to do the recycling that it > becomes more of a requirement because then it is tied into the bulk > free. > > The bulk free routine had a stronger requirement for NAPI because it > needed to have the __kfree_skb_flush routine called to finally free > the skbuffs that were still hanging off the percpu skbuff structure. > > > To catch any driver violating this add a loud WARN_ON. > > > > Performance wise, I do worry about adding this runtime check code into > > the hotpath, of this highly optimized function call. I've > > micro-benchmarked it with both IP-forwarding and local UDP delivery, > > and didn't see any regressions. It does adds extra code size (icache). > > > > add/remove: 0/0 grow/shrink: 1/0 up/down: 43/0 (43) > > function old new delta > > __napi_alloc_skb 461 504 +43 > > > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > > --- > > net/core/skbuff.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index e85f1065b263..99addbf66f2e 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -498,6 +498,9 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, > > struct sk_buff *skb; > > void *data; > > > > + /* Catch drivers violating, not having local BH disabled */ > > + WARN_ON(!in_softirq()); > > + > > I'm pretty sure this still lets you get away with just disabling > bottom halves. I think what you would want is in_serving_softirq(). No, we want to allow this being call with just disabled bottom halves. Thus, we have to use in_softirq() and not in_serving_softirq(). > > len += NET_SKB_PAD + NET_IP_ALIGN; > > > > if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || > > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack 2016-05-09 13:44 [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer ` (2 preceding siblings ...) 2016-05-09 13:44 ` [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context Jesper Dangaard Brouer @ 2016-05-20 8:14 ` Jesper Dangaard Brouer 3 siblings, 0 replies; 20+ messages in thread From: Jesper Dangaard Brouer @ 2016-05-20 8:14 UTC (permalink / raw) To: netdev, David S. Miller Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck, brouer On Mon, 09 May 2016 15:44:24 +0200 Jesper Dangaard Brouer <brouer@redhat.com> wrote: > This patchset enables use of the slab/kmem_cache bulk alloc API, and > completes the use the slab/kmem_cache bulking API in the network stack. > > I've not included the patches that introduce a SKB bulk hint, which > would beneficial for drivers like mlx5. Lets see if we can agree on > this patchset first. Conclusion: Guess we could not agree on this patchset. The main problem seems to be that the patch always bulk allocated 8 SKBs, without knowing if we actually need them. My earlier patchset also included a "hint" interface, as mlx5 driver knows after processing the RX queue, how many SKBs it needs, thus it can request to bulk alloc the needed amount. (The only problem with mlx5 is that preallocating SKBs into it's RX ring is a broken scheme). A better scheme would be, if the driver on RX knows how many packets are available in the RX queue. Then we can bulk alloc the needed amount of SKBs. Thus, we can circle back to this once the driver can provide this information. (This goes nicely hand-in-hand with my idea of pulling out avail RX packet-pages from the RX ring, and start prefetch'es to hide the cache-misses). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2016-05-20 8:14 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-05-09 13:44 [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer 2016-05-09 13:44 ` [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer 2016-05-09 16:20 ` Alexander Duyck 2016-05-09 19:59 ` Jesper Dangaard Brouer 2016-05-09 20:46 ` Alexander Duyck 2016-05-10 9:29 ` Jesper Dangaard Brouer 2016-05-10 12:30 ` Jesper Dangaard Brouer 2016-05-10 13:48 ` Eric Dumazet 2016-05-10 14:48 ` Jesper Dangaard Brouer 2016-05-10 15:44 ` Eric Dumazet 2016-05-10 17:46 ` Alexander Duyck 2016-05-09 13:44 ` [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations Jesper Dangaard Brouer 2016-05-09 16:47 ` Alexander Duyck 2016-05-09 20:05 ` Jesper Dangaard Brouer 2016-05-09 13:44 ` [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context Jesper Dangaard Brouer 2016-05-09 13:53 ` Sergei Shtylyov 2016-05-09 13:53 ` Sergei Shtylyov 2016-05-09 16:33 ` Alexander Duyck 2016-05-09 20:03 ` Jesper Dangaard Brouer 2016-05-20 8:14 ` [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.