All of lore.kernel.org
 help / color / mirror / Atom feed
* [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack
@ 2016-05-09 13:44 Jesper Dangaard Brouer
  2016-05-09 13:44 ` [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-09 13:44 UTC (permalink / raw)
  To: netdev, David S. Miller
  Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck, Jesper Dangaard Brouer

This patchset enables use of the slab/kmem_cache bulk alloc API, and
completes the use the slab/kmem_cache bulking API in the network stack.

I've not included the patches that introduce a SKB bulk hint, which
would beneficial for drivers like mlx5.  Lets see if we can agree on
this patchset first.

---

Jesper Dangaard Brouer (3):
      net: bulk alloc and reuse of SKBs in NAPI context
      mlx4: use napi_alloc_skb API to get SKB bulk allocations
      net: warn on napi_alloc_skb being called in wrong context


 drivers/net/ethernet/mellanox/mlx4/en_rx.c |    7 ++-
 net/core/skbuff.c                          |   74 ++++++++++++++++++----------
 2 files changed, 52 insertions(+), 29 deletions(-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context
  2016-05-09 13:44 [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer
@ 2016-05-09 13:44 ` Jesper Dangaard Brouer
  2016-05-09 16:20   ` Alexander Duyck
  2016-05-09 13:44 ` [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-09 13:44 UTC (permalink / raw)
  To: netdev, David S. Miller
  Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck, Jesper Dangaard Brouer

This patch introduce bulk alloc of SKBs and allow reuse of SKBs
free'ed in same softirq cycle.  SKBs are normally free'ed during TX
completion, but most high speed drivers also cleanup TX ring during
NAPI RX poll cycle.  Thus, if using napi_consume_skb/__kfree_skb_defer,
SKBs will be avail in the napi_alloc_cache->skb_cache.

If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit
the potential overshooting unused SKBs needed to free'ed when NAPI
cycle ends (flushed in net_rx_action via __kfree_skb_flush()).

Generalize ___build_skb() to allow passing it a preallocated SKB.

I've previously demonstrated a 1% speedup for IPv4 forwarding, when
used on the ixgbe driver.  If SKB alloc and free happens on different
CPUs (like in RPS use-case) the performance benefit is expected to
increase.

All drivers using the napi_alloc_skb() call will benefit from
this change automatically.

Joint work with Alexander Duyck.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 net/core/skbuff.c |   71 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 45 insertions(+), 26 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5586be93632f..e85f1065b263 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -281,32 +281,14 @@ nodata:
 }
 EXPORT_SYMBOL(__alloc_skb);
 
-/**
- * __build_skb - build a network buffer
- * @data: data buffer provided by caller
- * @frag_size: size of data, or 0 if head was kmalloced
- *
- * Allocate a new &sk_buff. Caller provides space holding head and
- * skb_shared_info. @data must have been allocated by kmalloc() only if
- * @frag_size is 0, otherwise data should come from the page allocator
- *  or vmalloc()
- * The return is the new skb buffer.
- * On a failure the return is %NULL, and @data is not freed.
- * Notes :
- *  Before IO, driver allocates only data buffer where NIC put incoming frame
- *  Driver should add room at head (NET_SKB_PAD) and
- *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
- *  After IO, driver calls build_skb(), to allocate sk_buff and populate it
- *  before giving packet to stack.
- *  RX rings only contains data buffers, not full skbs.
- */
-struct sk_buff *__build_skb(void *data, unsigned int frag_size)
+/* Allows skb being (pre)allocated by caller */
+static inline
+struct sk_buff *___build_skb(void *data, unsigned int frag_size,
+			     struct sk_buff *skb)
 {
 	struct skb_shared_info *shinfo;
-	struct sk_buff *skb;
 	unsigned int size = frag_size ? : ksize(data);
 
-	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
 	if (!skb)
 		return NULL;
 
@@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size)
 	return skb;
 }
 
+/**
+ * __build_skb - build a network buffer
+ * @data: data buffer provided by caller
+ * @frag_size: size of data, or 0 if head was kmalloced
+ *
+ * Allocate a new &sk_buff. Caller provides space holding head and
+ * skb_shared_info. @data must have been allocated by kmalloc() only if
+ * @frag_size is 0, otherwise data should come from the page allocator
+ *  or vmalloc()
+ * The return is the new skb buffer.
+ * On a failure the return is %NULL, and @data is not freed.
+ * Notes :
+ *  Before IO, driver allocates only data buffer where NIC put incoming frame
+ *  Driver should add room at head (NET_SKB_PAD) and
+ *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
+ *  After IO, driver calls build_skb(), to allocate sk_buff and populate it
+ *  before giving packet to stack.
+ *  RX rings only contains data buffers, not full skbs.
+ */
+struct sk_buff *__build_skb(void *data, unsigned int frag_size)
+{
+	struct sk_buff *skb;
+	unsigned int size = frag_size ? : ksize(data);
+
+	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+	return ___build_skb(data, size, skb);
+}
+
 /* build_skb() is wrapper over __build_skb(), that specifically
  * takes care of skb->head and skb->pfmemalloc
  * This means that if @frag_size is not zero, then @data must be backed
@@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 
 	len += NET_SKB_PAD + NET_IP_ALIGN;
 
-	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
-	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
+	if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
+		     (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA)))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
 		if (!skb)
 			goto skb_fail;
@@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (unlikely(!data))
 		return NULL;
 
-	skb = __build_skb(data, len);
-	if (unlikely(!skb)) {
+#define BULK_ALLOC_SIZE 8
+	if (!nc->skb_count) {
+		nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
+						      gfp_mask, BULK_ALLOC_SIZE,
+						      nc->skb_cache);
+	}
+	if (likely(nc->skb_count)) {
+		skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count];
+	} else {
+		/* alloc bulk failed */
 		skb_free_frag(data);
 		return NULL;
 	}
+	skb = ___build_skb(data, len, skb);
 
 	/* use OR instead of assignment to avoid clearing of bits in mask */
 	if (nc->page.pfmemalloc)

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations
  2016-05-09 13:44 [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer
  2016-05-09 13:44 ` [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer
@ 2016-05-09 13:44 ` Jesper Dangaard Brouer
  2016-05-09 16:47   ` Alexander Duyck
  2016-05-09 13:44 ` [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context Jesper Dangaard Brouer
  2016-05-20  8:14 ` [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer
  3 siblings, 1 reply; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-09 13:44 UTC (permalink / raw)
  To: netdev, David S. Miller
  Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck, Jesper Dangaard Brouer

Activate the bulk alloc API, simply by changing mlx4 from using
netdev_alloc_skb() to using napi_alloc_skb().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 8ef6875b6cf9..84fd6db5a176 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -577,14 +577,15 @@ fail:
 static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv,
 				      struct mlx4_en_rx_desc *rx_desc,
 				      struct mlx4_en_rx_alloc *frags,
-				      unsigned int length)
+				      unsigned int length,
+				      struct napi_struct *napi)
 {
 	struct sk_buff *skb;
 	void *va;
 	int used_frags;
 	dma_addr_t dma;
 
-	skb = netdev_alloc_skb(priv->dev, SMALL_PACKET_SIZE + NET_IP_ALIGN);
+	skb = napi_alloc_skb(napi, SMALL_PACKET_SIZE + NET_IP_ALIGN);
 	if (!skb) {
 		en_dbg(RX_ERR, priv, "Failed allocating skb\n");
 		return NULL;
@@ -932,7 +933,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		}
 
 		/* GRO not possible, complete processing here */
-		skb = mlx4_en_rx_skb(priv, rx_desc, frags, length);
+		skb = mlx4_en_rx_skb(priv, rx_desc, frags, length, &cq->napi);
 		if (!skb) {
 			ring->dropped++;
 			goto next;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context
  2016-05-09 13:44 [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer
  2016-05-09 13:44 ` [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer
  2016-05-09 13:44 ` [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations Jesper Dangaard Brouer
@ 2016-05-09 13:44 ` Jesper Dangaard Brouer
  2016-05-09 13:53   ` Sergei Shtylyov
                     ` (2 more replies)
  2016-05-20  8:14 ` [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer
  3 siblings, 3 replies; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-09 13:44 UTC (permalink / raw)
  To: netdev, David S. Miller
  Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck, Jesper Dangaard Brouer

It have always been required to call napi_alloc_skb from NAPI/softirq
context, which implies running with local_bh_disable'ed.  Thus, this
code path should already be well tested. But recent SKB bulk changes
introduced will make this more volatile and bugs more subtle, if this
is violated.

To catch any driver violating this add a loud WARN_ON.

Performance wise, I do worry about adding this runtime check code into
the hotpath, of this highly optimized function call.  I've
micro-benchmarked it with both IP-forwarding and local UDP delivery,
and didn't see any regressions.  It does adds extra code size (icache).

add/remove: 0/0 grow/shrink: 1/0 up/down: 43/0 (43)
function                                     old     new   delta
__napi_alloc_skb                             461     504     +43

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/skbuff.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e85f1065b263..99addbf66f2e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -498,6 +498,9 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	struct sk_buff *skb;
 	void *data;
 
+	/* Catch drivers violating, not having local BH disabled */
+	WARN_ON(!in_softirq());
+
 	len += NET_SKB_PAD + NET_IP_ALIGN;
 
 	if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context
  2016-05-09 13:44 ` [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context Jesper Dangaard Brouer
@ 2016-05-09 13:53   ` Sergei Shtylyov
  2016-05-09 13:53   ` Sergei Shtylyov
  2016-05-09 16:33   ` Alexander Duyck
  2 siblings, 0 replies; 20+ messages in thread
From: Sergei Shtylyov @ 2016-05-09 13:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller
  Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck

Hello.

On 5/9/2016 4:44 PM, Jesper Dangaard Brouer wrote:

> It have always been required to call napi_alloc_skb from NAPI/softirq

    It has.

> context, which implies running with local_bh_disable'ed.  Thus, this

    Not local_bh_disable'd?

> code path should already be well tested. But recent SKB bulk changes
> introduced will make this more volatile and bugs more subtle, if this
> is violated.
>
> To catch any driver violating this add a loud WARN_ON.
>
> Performance wise, I do worry about adding this runtime check code into
> the hotpath, of this highly optimized function call.  I've

    Hot path? My spellchecked trips here.

> micro-benchmarked it with both IP-forwarding and local UDP delivery,
> and didn't see any regressions.  It does adds extra code size (icache).
>
> add/remove: 0/0 grow/shrink: 1/0 up/down: 43/0 (43)
> function                                     old     new   delta
> __napi_alloc_skb                             461     504     +43
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
[...]

MBR, Sergei

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context
  2016-05-09 13:44 ` [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context Jesper Dangaard Brouer
  2016-05-09 13:53   ` Sergei Shtylyov
@ 2016-05-09 13:53   ` Sergei Shtylyov
  2016-05-09 16:33   ` Alexander Duyck
  2 siblings, 0 replies; 20+ messages in thread
From: Sergei Shtylyov @ 2016-05-09 13:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, David S. Miller
  Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck

Hello.

On 5/9/2016 4:44 PM, Jesper Dangaard Brouer wrote:

> It have always been required to call napi_alloc_skb from NAPI/softirq

    It has.

> context, which implies running with local_bh_disable'ed.  Thus, this

    Not local_bh_disable'd?

> code path should already be well tested. But recent SKB bulk changes
> introduced will make this more volatile and bugs more subtle, if this
> is violated.
>
> To catch any driver violating this add a loud WARN_ON.
>
> Performance wise, I do worry about adding this runtime check code into
> the hotpath, of this highly optimized function call.  I've

    Hot path? My spellchecker trips here.

> micro-benchmarked it with both IP-forwarding and local UDP delivery,
> and didn't see any regressions.  It does adds extra code size (icache).
>
> add/remove: 0/0 grow/shrink: 1/0 up/down: 43/0 (43)
> function                                     old     new   delta
> __napi_alloc_skb                             461     504     +43
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
[...]

MBR, Sergei

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context
  2016-05-09 13:44 ` [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer
@ 2016-05-09 16:20   ` Alexander Duyck
  2016-05-09 19:59     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Duyck @ 2016-05-09 16:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev

On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> This patch introduce bulk alloc of SKBs and allow reuse of SKBs
> free'ed in same softirq cycle.  SKBs are normally free'ed during TX
> completion, but most high speed drivers also cleanup TX ring during
> NAPI RX poll cycle.  Thus, if using napi_consume_skb/__kfree_skb_defer,
> SKBs will be avail in the napi_alloc_cache->skb_cache.
>
> If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit
> the potential overshooting unused SKBs needed to free'ed when NAPI
> cycle ends (flushed in net_rx_action via __kfree_skb_flush()).
>
> Generalize ___build_skb() to allow passing it a preallocated SKB.
>
> I've previously demonstrated a 1% speedup for IPv4 forwarding, when
> used on the ixgbe driver.  If SKB alloc and free happens on different
> CPUs (like in RPS use-case) the performance benefit is expected to
> increase.

Really I am not sure this patch series is worth the effort.  For
freeing buffers in Tx it was an obvious win.  But adding all this
complexity for a 1% gain just doesn't seen worth the effort.

> All drivers using the napi_alloc_skb() call will benefit from
> this change automatically.

Yes, but a 1% improvement is essentially just noise.  I'd say we need
to show a better gain or be able to more precisely show that this is a
definite gain and not just a test fluctuation resulting in the
improvement.  For all I know all of the gain could be due to a
function shifting around so that some loop is now 16 byte aligned.

> Joint work with Alexander Duyck.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>

The fact that this is still using my redhat.com address says volumes.
If I recall I think we were working on this code around 9 months ago
and had similar issues with it showing either negative performance or
no gain.  My advice then was to push the bulk free code and try to
find a way to fix the bulk allocation.  If we are still at this state
for bulk allocation then maybe we need to drop the bulk allocation and
start looking at other avenues to pursue.

> ---
>  net/core/skbuff.c |   71 ++++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 45 insertions(+), 26 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5586be93632f..e85f1065b263 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -281,32 +281,14 @@ nodata:
>  }
>  EXPORT_SYMBOL(__alloc_skb);
>
> -/**
> - * __build_skb - build a network buffer
> - * @data: data buffer provided by caller
> - * @frag_size: size of data, or 0 if head was kmalloced
> - *
> - * Allocate a new &sk_buff. Caller provides space holding head and
> - * skb_shared_info. @data must have been allocated by kmalloc() only if
> - * @frag_size is 0, otherwise data should come from the page allocator
> - *  or vmalloc()
> - * The return is the new skb buffer.
> - * On a failure the return is %NULL, and @data is not freed.
> - * Notes :
> - *  Before IO, driver allocates only data buffer where NIC put incoming frame
> - *  Driver should add room at head (NET_SKB_PAD) and
> - *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
> - *  After IO, driver calls build_skb(), to allocate sk_buff and populate it
> - *  before giving packet to stack.
> - *  RX rings only contains data buffers, not full skbs.
> - */
> -struct sk_buff *__build_skb(void *data, unsigned int frag_size)
> +/* Allows skb being (pre)allocated by caller */
> +static inline
> +struct sk_buff *___build_skb(void *data, unsigned int frag_size,
> +                            struct sk_buff *skb)
>  {
>         struct skb_shared_info *shinfo;
> -       struct sk_buff *skb;
>         unsigned int size = frag_size ? : ksize(data);
>
> -       skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
>         if (!skb)
>                 return NULL;
>
> @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size)
>         return skb;
>  }
>
> +/**
> + * __build_skb - build a network buffer
> + * @data: data buffer provided by caller
> + * @frag_size: size of data, or 0 if head was kmalloced
> + *
> + * Allocate a new &sk_buff. Caller provides space holding head and
> + * skb_shared_info. @data must have been allocated by kmalloc() only if
> + * @frag_size is 0, otherwise data should come from the page allocator
> + *  or vmalloc()
> + * The return is the new skb buffer.
> + * On a failure the return is %NULL, and @data is not freed.
> + * Notes :
> + *  Before IO, driver allocates only data buffer where NIC put incoming frame
> + *  Driver should add room at head (NET_SKB_PAD) and
> + *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
> + *  After IO, driver calls build_skb(), to allocate sk_buff and populate it
> + *  before giving packet to stack.
> + *  RX rings only contains data buffers, not full skbs.
> + */
> +struct sk_buff *__build_skb(void *data, unsigned int frag_size)
> +{
> +       struct sk_buff *skb;
> +       unsigned int size = frag_size ? : ksize(data);
> +
> +       skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
> +       return ___build_skb(data, size, skb);
> +}
> +
>  /* build_skb() is wrapper over __build_skb(), that specifically
>   * takes care of skb->head and skb->pfmemalloc
>   * This means that if @frag_size is not zero, then @data must be backed

If we can avoid having to break up build_skb into more functions that
would be preferred.  I realize I probably wrote this code in order to
enable the bulk allocation approach, but really I don't want to add
more complexity unless we can show a strong gain which we haven't been
able to demonstrate.

> @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>
>         len += NET_SKB_PAD + NET_IP_ALIGN;
>
> -       if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> -           (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> +       if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> +                    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA)))) {
>                 skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>                 if (!skb)
>                         goto skb_fail;

Does this unlikely really make a difference?  If so you might want to
move it into a patch all on its own.

> @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>         if (unlikely(!data))
>                 return NULL;
>
> -       skb = __build_skb(data, len);
> -       if (unlikely(!skb)) {
> +#define BULK_ALLOC_SIZE 8
> +       if (!nc->skb_count) {
> +               nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
> +                                                     gfp_mask, BULK_ALLOC_SIZE,
> +                                                     nc->skb_cache);
> +       }
> +       if (likely(nc->skb_count)) {
> +               skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count];
> +       } else {
> +               /* alloc bulk failed */

So did you try doing a low latency socket test with this patch?  I'm
guessing not as I am certain there is a negative impact from having to
allocate 8 buffers, and then free back 7 every time you call the NAPI
polling routine with just one buffer in the ring.

>                 skb_free_frag(data);
>                 return NULL;
>         }
> +       skb = ___build_skb(data, len, skb);
>
>         /* use OR instead of assignment to avoid clearing of bits in mask */
>         if (nc->page.pfmemalloc)
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context
  2016-05-09 13:44 ` [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context Jesper Dangaard Brouer
  2016-05-09 13:53   ` Sergei Shtylyov
  2016-05-09 13:53   ` Sergei Shtylyov
@ 2016-05-09 16:33   ` Alexander Duyck
  2016-05-09 20:03     ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 20+ messages in thread
From: Alexander Duyck @ 2016-05-09 16:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev

On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> It have always been required to call napi_alloc_skb from NAPI/softirq
> context, which implies running with local_bh_disable'ed.  Thus, this
> code path should already be well tested. But recent SKB bulk changes
> introduced will make this more volatile and bugs more subtle, if this
> is violated.

It hasn't been required to be in the softirq routine, bottom halves
could have been disabled.  I used the NAPI parameter to try and
enforce a strong correlation between the two but I didn't make it a
hard requirement.  It isn't until you try to do the recycling that it
becomes more of a requirement because then it is tied into the bulk
free.

The bulk free routine had a stronger requirement for NAPI because it
needed to have the __kfree_skb_flush routine called to finally free
the skbuffs that were still hanging off the percpu skbuff structure.

> To catch any driver violating this add a loud WARN_ON.
>
> Performance wise, I do worry about adding this runtime check code into
> the hotpath, of this highly optimized function call.  I've
> micro-benchmarked it with both IP-forwarding and local UDP delivery,
> and didn't see any regressions.  It does adds extra code size (icache).
>
> add/remove: 0/0 grow/shrink: 1/0 up/down: 43/0 (43)
> function                                     old     new   delta
> __napi_alloc_skb                             461     504     +43
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  net/core/skbuff.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index e85f1065b263..99addbf66f2e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -498,6 +498,9 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>         struct sk_buff *skb;
>         void *data;
>
> +       /* Catch drivers violating, not having local BH disabled */
> +       WARN_ON(!in_softirq());
> +

I'm pretty sure this still lets you get away with just disabling
bottom halves.  I think what you would want is in_serving_softirq().

>         len += NET_SKB_PAD + NET_IP_ALIGN;
>
>         if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations
  2016-05-09 13:44 ` [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations Jesper Dangaard Brouer
@ 2016-05-09 16:47   ` Alexander Duyck
  2016-05-09 20:05     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Duyck @ 2016-05-09 16:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev

On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Activate the bulk alloc API, simply by changing mlx4 from using
> netdev_alloc_skb() to using napi_alloc_skb().

This patch is just enabling the napi_alloc_skb call.  You don't need
to call out that it is enabling bulk allocations.  This patch could
stand on its own without needing to make reference to the bulk
allocation API because there is enough of a gain from napi_alloc_skb
replacing netdev_alloc_skb.

> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |    7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 8ef6875b6cf9..84fd6db5a176 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -577,14 +577,15 @@ fail:
>  static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv,
>                                       struct mlx4_en_rx_desc *rx_desc,
>                                       struct mlx4_en_rx_alloc *frags,
> -                                     unsigned int length)
> +                                     unsigned int length,
> +                                     struct napi_struct *napi)

Instead of passing the NAPI structure you could just pass the
mlx4_en_cq pointer to be used by the NAPI alloc function.  In addition
you might try adding the new parameter before length since that way
the pointers are in one block followed by integers in a tapering
length order.

>  {
>         struct sk_buff *skb;
>         void *va;
>         int used_frags;
>         dma_addr_t dma;
>
> -       skb = netdev_alloc_skb(priv->dev, SMALL_PACKET_SIZE + NET_IP_ALIGN);
> +       skb = napi_alloc_skb(napi, SMALL_PACKET_SIZE + NET_IP_ALIGN);

The NET_IP_ALIGN is redundant as napi_alloc_skb already takes are of
adding that and NET_SKB_PAD.

>         if (!skb) {
>                 en_dbg(RX_ERR, priv, "Failed allocating skb\n");
>                 return NULL;
> @@ -932,7 +933,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>                 }
>
>                 /* GRO not possible, complete processing here */
> -               skb = mlx4_en_rx_skb(priv, rx_desc, frags, length);
> +               skb = mlx4_en_rx_skb(priv, rx_desc, frags, length, &cq->napi);
>                 if (!skb) {
>                         ring->dropped++;
>                         goto next;
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context
  2016-05-09 16:20   ` Alexander Duyck
@ 2016-05-09 19:59     ` Jesper Dangaard Brouer
  2016-05-09 20:46       ` Alexander Duyck
  0 siblings, 1 reply; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-09 19:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz,
	Eugenia Emantayev, brouer

On Mon, 9 May 2016 09:20:41 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > This patch introduce bulk alloc of SKBs and allow reuse of SKBs
> > free'ed in same softirq cycle.  SKBs are normally free'ed during TX
> > completion, but most high speed drivers also cleanup TX ring during
> > NAPI RX poll cycle.  Thus, if using napi_consume_skb/__kfree_skb_defer,
> > SKBs will be avail in the napi_alloc_cache->skb_cache.
> >
> > If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit
> > the potential overshooting unused SKBs needed to free'ed when NAPI
> > cycle ends (flushed in net_rx_action via __kfree_skb_flush()).
> >
> > Generalize ___build_skb() to allow passing it a preallocated SKB.
> >
> > I've previously demonstrated a 1% speedup for IPv4 forwarding, when
> > used on the ixgbe driver.  If SKB alloc and free happens on different
> > CPUs (like in RPS use-case) the performance benefit is expected to
> > increase.  
> 
> Really I am not sure this patch series is worth the effort.  For
> freeing buffers in Tx it was an obvious win.  But adding all this
> complexity for a 1% gain just doesn't seen worth the effort.

I still think it is worth it, because: 1) it enables recycling, which
is more likely for real-life traffic (e.g. some TCP ACKs gets TX DMA
completion cleanup, and RX can re-use these), and 2) because bulk alloc
and bulk free gets "coupled" (mostly relevant when doing cross CPU).


> > All drivers using the napi_alloc_skb() call will benefit from
> > this change automatically.  
> 
> Yes, but a 1% improvement is essentially just noise.  I'd say we need
> to show a better gain or be able to more precisely show that this is a
> definite gain and not just a test fluctuation resulting in the
> improvement.  For all I know all of the gain could be due to a
> function shifting around so that some loop is now 16 byte aligned.
> 
> > Joint work with Alexander Duyck.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>  
> 
> The fact that this is still using my redhat.com address says volumes.
> If I recall I think we were working on this code around 9 months ago
> and had similar issues with it showing either negative performance or
> no gain.  My advice then was to push the bulk free code and try to
> find a way to fix the bulk allocation.  If we are still at this state
> for bulk allocation then maybe we need to drop the bulk allocation and
> start looking at other avenues to pursue.

I'm sure there is a gain here. Sure I can spend 2-3 weeks coming up
with a benchmark that show a bigger gain, but is it time well spend?

My quick benchmarking with mlx4 actually showed 2.7% improvement, for
local UDP socket delivery. And only 0.6% on IP-forwarding.  If I find
some bidirectional traffic then the benefit should be higher due to
recycling kicking in.

We did do as you recommended, and the bulk free code first. I'm
just getting the last pieces pushed.  I didn't see any negative
performance in my testing. 

As you also know, tuning the SLUB system will give higher performance,
easily.  In the future, I'm planning to get some auto-tuning into the
SLUB allocator.  I've already discussed this with Christiph Lameter, at
MM-summit, see presentation[1] slides 4 and 5.

[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/slab_mm_summit2016.odp
 
> > ---
> >  net/core/skbuff.c |   71
> > ++++++++++++++++++++++++++++++++++------------------- 1 file
> > changed, 45 insertions(+), 26 deletions(-)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 5586be93632f..e85f1065b263 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -281,32 +281,14 @@ nodata:
> >  }
> >  EXPORT_SYMBOL(__alloc_skb);
> >
> > -/**
> > - * __build_skb - build a network buffer
> > - * @data: data buffer provided by caller
> > - * @frag_size: size of data, or 0 if head was kmalloced
> > - *
> > - * Allocate a new &sk_buff. Caller provides space holding head and
> > - * skb_shared_info. @data must have been allocated by kmalloc()
> > only if
> > - * @frag_size is 0, otherwise data should come from the page
> > allocator
> > - *  or vmalloc()
> > - * The return is the new skb buffer.
> > - * On a failure the return is %NULL, and @data is not freed.
> > - * Notes :
> > - *  Before IO, driver allocates only data buffer where NIC put
> > incoming frame
> > - *  Driver should add room at head (NET_SKB_PAD) and
> > - *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
> > - *  After IO, driver calls build_skb(), to allocate sk_buff and
> > populate it
> > - *  before giving packet to stack.
> > - *  RX rings only contains data buffers, not full skbs.
> > - */
> > -struct sk_buff *__build_skb(void *data, unsigned int frag_size)
> > +/* Allows skb being (pre)allocated by caller */
> > +static inline
> > +struct sk_buff *___build_skb(void *data, unsigned int frag_size,
> > +                            struct sk_buff *skb)
> >  {
> >         struct skb_shared_info *shinfo;
> > -       struct sk_buff *skb;
> >         unsigned int size = frag_size ? : ksize(data);
> >
> > -       skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
> >         if (!skb)
> >                 return NULL;
> >
> > @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data,
> > unsigned int frag_size) return skb;
> >  }
> >
> > +/**
> > + * __build_skb - build a network buffer
> > + * @data: data buffer provided by caller
> > + * @frag_size: size of data, or 0 if head was kmalloced
> > + *
> > + * Allocate a new &sk_buff. Caller provides space holding head and
> > + * skb_shared_info. @data must have been allocated by kmalloc()
> > only if
> > + * @frag_size is 0, otherwise data should come from the page
> > allocator
> > + *  or vmalloc()
> > + * The return is the new skb buffer.
> > + * On a failure the return is %NULL, and @data is not freed.
> > + * Notes :
> > + *  Before IO, driver allocates only data buffer where NIC put
> > incoming frame
> > + *  Driver should add room at head (NET_SKB_PAD) and
> > + *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
> > + *  After IO, driver calls build_skb(), to allocate sk_buff and
> > populate it
> > + *  before giving packet to stack.
> > + *  RX rings only contains data buffers, not full skbs.
> > + */
> > +struct sk_buff *__build_skb(void *data, unsigned int frag_size)
> > +{
> > +       struct sk_buff *skb;
> > +       unsigned int size = frag_size ? : ksize(data);
> > +
> > +       skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
> > +       return ___build_skb(data, size, skb);
> > +}
> > +
> >  /* build_skb() is wrapper over __build_skb(), that specifically
> >   * takes care of skb->head and skb->pfmemalloc
> >   * This means that if @frag_size is not zero, then @data must be
> > backed  
> 
> If we can avoid having to break up build_skb into more functions that
> would be preferred.  I realize I probably wrote this code in order to
> enable the bulk allocation approach, but really I don't want to add
> more complexity unless we can show a strong gain which we haven't been
> able to demonstrate.

You do notice that ___build_skb gets inlined, thus there is not
performance loss.  And this change was explicitly requested last time
this patch was reviewed.  And I think this variant is much less
intrusive.

 
> > @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct
> > napi_struct *napi, unsigned int len,
> >
> >         len += NET_SKB_PAD + NET_IP_ALIGN;
> >
> > -       if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> > -           (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> > +       if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> > +                    (gfp_mask & (__GFP_DIRECT_RECLAIM |
> > GFP_DMA)))) { skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX,
> > NUMA_NO_NODE); if (!skb)
> >                         goto skb_fail;  
> 
> Does this unlikely really make a difference?  If so you might want to
> move it into a patch all on its own.

I can place it in a patch of it's own.  I just noticed the compiler
layed out the code wrongly compared to the normal use-case.


> > @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct
> > napi_struct *napi, unsigned int len, if (unlikely(!data))
> >                 return NULL;
> >
> > -       skb = __build_skb(data, len);
> > -       if (unlikely(!skb)) {
> > +#define BULK_ALLOC_SIZE 8
> > +       if (!nc->skb_count) {
> > +               nc->skb_count =
> > kmem_cache_alloc_bulk(skbuff_head_cache,
> > +                                                     gfp_mask,
> > BULK_ALLOC_SIZE,
> > +
> > nc->skb_cache);
> > +       }
> > +       if (likely(nc->skb_count)) {
> > +               skb = (struct sk_buff
> > *)nc->skb_cache[--nc->skb_count];
> > +       } else {
> > +               /* alloc bulk failed */  
> 
> So did you try doing a low latency socket test with this patch?  I'm
> guessing not as I am certain there is a negative impact from having to
> allocate 8 buffers, and then free back 7 every time you call the NAPI
> polling routine with just one buffer in the ring.

There is a very high probability that pulling out 8 objects, and
returning 7 object, will have the same cost of alloc and free of a
single object.  This is due to how the SLUB allocator's per CPU
allocator works.

Notice I said "high probability".  I have adjusted the slab bulk APIs,
such that we can extend them to bulk return "upto" a given number of
objects.  Then the SLUB allocator can remove the "high probability"
part, and make sure only to return object on the per CPU slab-page,
thus guaranteeing no cmpxchg_double calls, and only
local_irq_disable/enable cycles, which is actually faster than the
normal fastpath local cmpxchg_double (non-atomic variant).

 
> >                 skb_free_frag(data);
> >                 return NULL;
> >         }
> > +       skb = ___build_skb(data, len, skb);
> >
> >         /* use OR instead of assignment to avoid clearing of bits
> > in mask */ if (nc->page.pfmemalloc)
> >  


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context
  2016-05-09 16:33   ` Alexander Duyck
@ 2016-05-09 20:03     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-09 20:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz,
	Eugenia Emantayev, brouer

On Mon, 9 May 2016 09:33:25 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > It have always been required to call napi_alloc_skb from NAPI/softirq
> > context, which implies running with local_bh_disable'ed.  Thus, this
> > code path should already be well tested. But recent SKB bulk changes
> > introduced will make this more volatile and bugs more subtle, if this
> > is violated.  
> 
> It hasn't been required to be in the softirq routine, bottom halves
> could have been disabled.  I used the NAPI parameter to try and
> enforce a strong correlation between the two but I didn't make it a
> hard requirement.  It isn't until you try to do the recycling that it
> becomes more of a requirement because then it is tied into the bulk
> free.
> 
> The bulk free routine had a stronger requirement for NAPI because it
> needed to have the __kfree_skb_flush routine called to finally free
> the skbuffs that were still hanging off the percpu skbuff structure.
> 
> > To catch any driver violating this add a loud WARN_ON.
> >
> > Performance wise, I do worry about adding this runtime check code into
> > the hotpath, of this highly optimized function call.  I've
> > micro-benchmarked it with both IP-forwarding and local UDP delivery,
> > and didn't see any regressions.  It does adds extra code size (icache).
> >
> > add/remove: 0/0 grow/shrink: 1/0 up/down: 43/0 (43)
> > function                                     old     new   delta
> > __napi_alloc_skb                             461     504     +43
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  net/core/skbuff.c |    3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index e85f1065b263..99addbf66f2e 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -498,6 +498,9 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
> >         struct sk_buff *skb;
> >         void *data;
> >
> > +       /* Catch drivers violating, not having local BH disabled */
> > +       WARN_ON(!in_softirq());
> > +  
> 
> I'm pretty sure this still lets you get away with just disabling
> bottom halves.  I think what you would want is in_serving_softirq().

No, we want to allow this being call with just disabled bottom halves.
Thus, we have to use in_softirq() and not in_serving_softirq().


> >         len += NET_SKB_PAD + NET_IP_ALIGN;
> >
> >         if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> >  



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations
  2016-05-09 16:47   ` Alexander Duyck
@ 2016-05-09 20:05     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-09 20:05 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz,
	Eugenia Emantayev, brouer

On Mon, 9 May 2016 09:47:08 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > Activate the bulk alloc API, simply by changing mlx4 from using
> > netdev_alloc_skb() to using napi_alloc_skb().  
> 
> This patch is just enabling the napi_alloc_skb call.  You don't need
> to call out that it is enabling bulk allocations.  This patch could
> stand on its own without needing to make reference to the bulk
> allocation API because there is enough of a gain from napi_alloc_skb
> replacing netdev_alloc_skb.

Okay, I see you point.
 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx4/en_rx.c |    7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > index 8ef6875b6cf9..84fd6db5a176 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> > @@ -577,14 +577,15 @@ fail:
> >  static struct sk_buff *mlx4_en_rx_skb(struct mlx4_en_priv *priv,
> >                                       struct mlx4_en_rx_desc *rx_desc,
> >                                       struct mlx4_en_rx_alloc *frags,
> > -                                     unsigned int length)
> > +                                     unsigned int length,
> > +                                     struct napi_struct *napi)  
> 
> Instead of passing the NAPI structure you could just pass the
> mlx4_en_cq pointer to be used by the NAPI alloc function.  In addition
> you might try adding the new parameter before length since that way
> the pointers are in one block followed by integers in a tapering
> length order.

Okay, will adjust.

> >  {
> >         struct sk_buff *skb;
> >         void *va;
> >         int used_frags;
> >         dma_addr_t dma;
> >
> > -       skb = netdev_alloc_skb(priv->dev, SMALL_PACKET_SIZE + NET_IP_ALIGN);
> > +       skb = napi_alloc_skb(napi, SMALL_PACKET_SIZE + NET_IP_ALIGN);  
> 
> The NET_IP_ALIGN is redundant as napi_alloc_skb already takes are of
> adding that and NET_SKB_PAD.

Thanks for catching this.
 
> >         if (!skb) {
> >                 en_dbg(RX_ERR, priv, "Failed allocating skb\n");
> >                 return NULL;
> > @@ -932,7 +933,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
> >                 }
> >
> >                 /* GRO not possible, complete processing here */
> > -               skb = mlx4_en_rx_skb(priv, rx_desc, frags, length);
> > +               skb = mlx4_en_rx_skb(priv, rx_desc, frags, length, &cq->napi);
> >                 if (!skb) {
> >                         ring->dropped++;
> >                         goto next;
> >  



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context
  2016-05-09 19:59     ` Jesper Dangaard Brouer
@ 2016-05-09 20:46       ` Alexander Duyck
  2016-05-10  9:29         ` Jesper Dangaard Brouer
  2016-05-10 12:30         ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 20+ messages in thread
From: Alexander Duyck @ 2016-05-09 20:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev

On Mon, May 9, 2016 at 12:59 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Mon, 9 May 2016 09:20:41 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> > This patch introduce bulk alloc of SKBs and allow reuse of SKBs
>> > free'ed in same softirq cycle.  SKBs are normally free'ed during TX
>> > completion, but most high speed drivers also cleanup TX ring during
>> > NAPI RX poll cycle.  Thus, if using napi_consume_skb/__kfree_skb_defer,
>> > SKBs will be avail in the napi_alloc_cache->skb_cache.
>> >
>> > If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit
>> > the potential overshooting unused SKBs needed to free'ed when NAPI
>> > cycle ends (flushed in net_rx_action via __kfree_skb_flush()).
>> >
>> > Generalize ___build_skb() to allow passing it a preallocated SKB.
>> >
>> > I've previously demonstrated a 1% speedup for IPv4 forwarding, when
>> > used on the ixgbe driver.  If SKB alloc and free happens on different
>> > CPUs (like in RPS use-case) the performance benefit is expected to
>> > increase.
>>
>> Really I am not sure this patch series is worth the effort.  For
>> freeing buffers in Tx it was an obvious win.  But adding all this
>> complexity for a 1% gain just doesn't seen worth the effort.
>
> I still think it is worth it, because: 1) it enables recycling, which
> is more likely for real-life traffic (e.g. some TCP ACKs gets TX DMA
> completion cleanup, and RX can re-use these), and 2) because bulk alloc
> and bulk free gets "coupled" (mostly relevant when doing cross CPU).

I disagree.  While there will be some ACKs it will likely be
significantly less then the number of packets received, especially
when you take GRO into account.

>
>> > All drivers using the napi_alloc_skb() call will benefit from
>> > this change automatically.
>>
>> Yes, but a 1% improvement is essentially just noise.  I'd say we need
>> to show a better gain or be able to more precisely show that this is a
>> definite gain and not just a test fluctuation resulting in the
>> improvement.  For all I know all of the gain could be due to a
>> function shifting around so that some loop is now 16 byte aligned.
>>
>> > Joint work with Alexander Duyck.
>> >
>> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
>>
>> The fact that this is still using my redhat.com address says volumes.
>> If I recall I think we were working on this code around 9 months ago
>> and had similar issues with it showing either negative performance or
>> no gain.  My advice then was to push the bulk free code and try to
>> find a way to fix the bulk allocation.  If we are still at this state
>> for bulk allocation then maybe we need to drop the bulk allocation and
>> start looking at other avenues to pursue.
>
> I'm sure there is a gain here. Sure I can spend 2-3 weeks coming up
> with a benchmark that show a bigger gain, but is it time well spend?
>
> My quick benchmarking with mlx4 actually showed 2.7% improvement, for
> local UDP socket delivery. And only 0.6% on IP-forwarding.  If I find
> some bidirectional traffic then the benefit should be higher due to
> recycling kicking in.

Is that with or without patch 2 being a part of the set?  This patch
should be able to prove itself as a stand-alone patch.  I'd say go
ahead and submit patch 2 as a standalone and then we can start working
on collecting numbers on the mlx4.  Also are you running your test
with GRO disabled?  If so you might try enabling it on the mlx4
because that changes the skbuff allocation which would impact the
behavior for the device.

Try testing with TCP_RR instead and watch the CPU utilization.  I'm
suspecting allocating 8 and freeing 7 buffers for every 1 buffer
received will blow any gains right out of the water.  Also try it with
a mix of traffic.  So have one NIC doing TCP_RR while another is doing
a stream test.  You are stuffing 7 buffers onto a queue that were were
using to perform bulk freeing.  How much of a penalty do you take if
you are now limited on how many you can bulk free because somebody
left a stray 7 packets sitting on the queue?

Also please stop focusing on IP-forwarding.  There are some use cases
for it out there, but the vast majority of people don't care that much
about it.  It is just one data point out of many.  Try running a few
TCP based applications and see if you can notice any difference.  It
is okay to focus on routing numbers when you are doing work that will
only impact routing or has next to no likelihood of introducing a
regression, but the fact is the bulk alloc has a strong likelihood of
having effects on other parts of the network stack since you are
pulling in more buffers then you actually need.

> We did do as you recommended, and the bulk free code first. I'm
> just getting the last pieces pushed.  I didn't see any negative
> performance in my testing.

Not seeing any negative performance is not the same as seeing gains.
You are introducing complexity, there needs to be something to show
for it.  Otherwise you are just making the code harder to maintain for
no good reason.

> As you also know, tuning the SLUB system will give higher performance,
> easily.  In the future, I'm planning to get some auto-tuning into the
> SLUB allocator.  I've already discussed this with Christiph Lameter, at
> MM-summit, see presentation[1] slides 4 and 5.

We aren't discussing tuning parameters.  We are discussing this patch.
If you want to argue that with certain tuning parameters this shows
more performance then bring the numbers, but don't try to bias things.
If you have to tune the system in some way that nobody will there is
probably no point in submitting the patch because nobody will use it
that way.

You also didn't test for low latency.  As I said this is coming at a
high expense to other use cases, but your use case only shows a 1%
improvement.  In my opinion it is not worth adding the complexity if
we cannot show much of a gain.  If you want to win me over we need to
be able to justify the complexity.  Right now I am not seeing it.  In
addition patch 3 illustrates the added complexity you are bringing to
all this because you are having to add limitations to napi_alloc_skb
because you are trying to make it work with recycling which
historically has never been a win because there are too many traffic
patterns where recycling cannot occur.

> [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/slab_mm_summit2016.odp
>
>> > ---
>> >  net/core/skbuff.c |   71
>> > ++++++++++++++++++++++++++++++++++------------------- 1 file
>> > changed, 45 insertions(+), 26 deletions(-)
>> >
>> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> > index 5586be93632f..e85f1065b263 100644
>> > --- a/net/core/skbuff.c
>> > +++ b/net/core/skbuff.c
>> > @@ -281,32 +281,14 @@ nodata:
>> >  }
>> >  EXPORT_SYMBOL(__alloc_skb);
>> >
>> > -/**
>> > - * __build_skb - build a network buffer
>> > - * @data: data buffer provided by caller
>> > - * @frag_size: size of data, or 0 if head was kmalloced
>> > - *
>> > - * Allocate a new &sk_buff. Caller provides space holding head and
>> > - * skb_shared_info. @data must have been allocated by kmalloc()
>> > only if
>> > - * @frag_size is 0, otherwise data should come from the page
>> > allocator
>> > - *  or vmalloc()
>> > - * The return is the new skb buffer.
>> > - * On a failure the return is %NULL, and @data is not freed.
>> > - * Notes :
>> > - *  Before IO, driver allocates only data buffer where NIC put
>> > incoming frame
>> > - *  Driver should add room at head (NET_SKB_PAD) and
>> > - *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
>> > - *  After IO, driver calls build_skb(), to allocate sk_buff and
>> > populate it
>> > - *  before giving packet to stack.
>> > - *  RX rings only contains data buffers, not full skbs.
>> > - */
>> > -struct sk_buff *__build_skb(void *data, unsigned int frag_size)
>> > +/* Allows skb being (pre)allocated by caller */
>> > +static inline
>> > +struct sk_buff *___build_skb(void *data, unsigned int frag_size,
>> > +                            struct sk_buff *skb)
>> >  {
>> >         struct skb_shared_info *shinfo;
>> > -       struct sk_buff *skb;
>> >         unsigned int size = frag_size ? : ksize(data);
>> >
>> > -       skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
>> >         if (!skb)
>> >                 return NULL;
>> >
>> > @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data,
>> > unsigned int frag_size) return skb;
>> >  }
>> >
>> > +/**
>> > + * __build_skb - build a network buffer
>> > + * @data: data buffer provided by caller
>> > + * @frag_size: size of data, or 0 if head was kmalloced
>> > + *
>> > + * Allocate a new &sk_buff. Caller provides space holding head and
>> > + * skb_shared_info. @data must have been allocated by kmalloc()
>> > only if
>> > + * @frag_size is 0, otherwise data should come from the page
>> > allocator
>> > + *  or vmalloc()
>> > + * The return is the new skb buffer.
>> > + * On a failure the return is %NULL, and @data is not freed.
>> > + * Notes :
>> > + *  Before IO, driver allocates only data buffer where NIC put
>> > incoming frame
>> > + *  Driver should add room at head (NET_SKB_PAD) and
>> > + *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
>> > + *  After IO, driver calls build_skb(), to allocate sk_buff and
>> > populate it
>> > + *  before giving packet to stack.
>> > + *  RX rings only contains data buffers, not full skbs.
>> > + */
>> > +struct sk_buff *__build_skb(void *data, unsigned int frag_size)
>> > +{
>> > +       struct sk_buff *skb;
>> > +       unsigned int size = frag_size ? : ksize(data);
>> > +
>> > +       skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
>> > +       return ___build_skb(data, size, skb);
>> > +}
>> > +
>> >  /* build_skb() is wrapper over __build_skb(), that specifically
>> >   * takes care of skb->head and skb->pfmemalloc
>> >   * This means that if @frag_size is not zero, then @data must be
>> > backed
>>
>> If we can avoid having to break up build_skb into more functions that
>> would be preferred.  I realize I probably wrote this code in order to
>> enable the bulk allocation approach, but really I don't want to add
>> more complexity unless we can show a strong gain which we haven't been
>> able to demonstrate.
>
> You do notice that ___build_skb gets inlined, thus there is not
> performance loss.  And this change was explicitly requested last time
> this patch was reviewed.  And I think this variant is much less
> intrusive.

I didn't notice the inline, so it means we are bloating the code a bit
instead of adding more function calls.  I guess that is a better trade
off, but still not the most desirable thing to do.

>
>> > @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct
>> > napi_struct *napi, unsigned int len,
>> >
>> >         len += NET_SKB_PAD + NET_IP_ALIGN;
>> >
>> > -       if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
>> > -           (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>> > +       if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
>> > +                    (gfp_mask & (__GFP_DIRECT_RECLAIM |
>> > GFP_DMA)))) { skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX,
>> > NUMA_NO_NODE); if (!skb)
>> >                         goto skb_fail;
>>
>> Does this unlikely really make a difference?  If so you might want to
>> move it into a patch all on its own.
>
> I can place it in a patch of it's own.  I just noticed the compiler
> layed out the code wrongly compared to the normal use-case.

What version of gcc are you seeing this with?  From what I can tell on
the 4.8.5 version that came with my RHEL 7.2 build I am seeing the
proper setup with 2 compares and 2 jumps to the remote code.

>> > @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct
>> > napi_struct *napi, unsigned int len, if (unlikely(!data))
>> >                 return NULL;
>> >
>> > -       skb = __build_skb(data, len);
>> > -       if (unlikely(!skb)) {
>> > +#define BULK_ALLOC_SIZE 8
>> > +       if (!nc->skb_count) {
>> > +               nc->skb_count =
>> > kmem_cache_alloc_bulk(skbuff_head_cache,
>> > +                                                     gfp_mask,
>> > BULK_ALLOC_SIZE,
>> > +
>> > nc->skb_cache);
>> > +       }
>> > +       if (likely(nc->skb_count)) {
>> > +               skb = (struct sk_buff
>> > *)nc->skb_cache[--nc->skb_count];
>> > +       } else {
>> > +               /* alloc bulk failed */
>>
>> So did you try doing a low latency socket test with this patch?  I'm
>> guessing not as I am certain there is a negative impact from having to
>> allocate 8 buffers, and then free back 7 every time you call the NAPI
>> polling routine with just one buffer in the ring.
>
> There is a very high probability that pulling out 8 objects, and
> returning 7 object, will have the same cost of alloc and free of a
> single object.  This is due to how the SLUB allocator's per CPU
> allocator works.

I very much doubt that.  You are still having to move around many more
objects then you really should be and dragging in 8 cache lines as you
have to walk the percpu freelist to pull the entries and then push
them back in.

> Notice I said "high probability".  I have adjusted the slab bulk APIs,
> such that we can extend them to bulk return "upto" a given number of
> objects.  Then the SLUB allocator can remove the "high probability"
> part, and make sure only to return object on the per CPU slab-page,
> thus guaranteeing no cmpxchg_double calls, and only
> local_irq_disable/enable cycles, which is actually faster than the
> normal fastpath local cmpxchg_double (non-atomic variant).

On your system.  There is no guarantee that is the case across all
CPUs supported by the x86 architecture.

Also I can think of scenarios where your code would become very
expensive.  The fact is you are pushing an additional 7 objects into
the L1 cache each time you do one of these bulk allocations.  The
question you have to ask yourself is what is it you are evicting when
you do it.  There are scenerios where you could be evicting critical
data back out to L2 which will negatively impact your performance.  A
1% gain 99% of the time is kind of useless when there is a risk of a
significant performance hit for that remaining 1% of cases.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context
  2016-05-09 20:46       ` Alexander Duyck
@ 2016-05-10  9:29         ` Jesper Dangaard Brouer
  2016-05-10 12:30         ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-10  9:29 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz,
	Eugenia Emantayev, brouer

On Mon, 9 May 2016 13:46:32 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> > As you also know, tuning the SLUB system will give higher performance,
> > easily.  In the future, I'm planning to get some auto-tuning into the
> > SLUB allocator.  I've already discussed this with Christiph Lameter, at
> > MM-summit, see presentation[1] slides 4 and 5.  
> 
> We aren't discussing tuning parameters.  We are discussing this patch.
> If you want to argue that with certain tuning parameters this shows
> more performance then bring the numbers, but don't try to bias things.
> If you have to tune the system in some way that nobody will there is
> probably no point in submitting the patch because nobody will use it
> that way.

I think you missed the point. I didn't do parameter tuning for my
benchmarks. I hate tuning parameters, they are huge problem for users
of the kernel.
 My point is that I want to implement auto-tuning in the SLUB
allocator.  This network stack use-case, will just be one use-case
where the auto-tuning need to show improvements.  It is not that
complicated. FreeBSD have this kind of auto-tuning to the workload in
their slab implementation.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context
  2016-05-09 20:46       ` Alexander Duyck
  2016-05-10  9:29         ` Jesper Dangaard Brouer
@ 2016-05-10 12:30         ` Jesper Dangaard Brouer
  2016-05-10 13:48           ` Eric Dumazet
  2016-05-10 17:46           ` Alexander Duyck
  1 sibling, 2 replies; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-10 12:30 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz,
	Eugenia Emantayev, brouer


On Mon, 9 May 2016 13:46:32 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> Try testing with TCP_RR instead and watch the CPU utilization.  I'm
> suspecting allocating 8 and freeing 7 buffers for every 1 buffer
> received will blow any gains right out of the water.  Also try it with
> a mix of traffic.  So have one NIC doing TCP_RR while another is doing
> a stream test.  You are stuffing 7 buffers onto a queue that were were
> using to perform bulk freeing.  How much of a penalty do you take if
> you are now limited on how many you can bulk free because somebody
> left a stray 7 packets sitting on the queue?

Testing with TCP_RR, is not a very "clean" network test. One have to be
very careful what is actually being tested, is it the server or client
which is the bottleneck. And most of all this is test of the CPU/process
scheduler.

We can avoid the scheduler problem by enabling busy_poll/busy_read.

I guess you want to see the "scheduler test" first.  Default setting of
disabled busy poll on both client and server:

Disable busy poll on both client and server, Not patched:

 $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 
 ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
 Local /Remote
 Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
 Send   Recv   Size    Size   Time    Rate     local  remote local   remote
 bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
 
 16384  87380  1       1      60.00   78077.55  3.74   2.69   3.830   8.265  
 16384  87380 

Disable busy poll on both client and server, patched:

 $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2
 ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
 Local /Remote
 Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
 Send   Recv   Size    Size   Time    Rate     local  remote local   remote
 bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
 
 16384  87380  1       1      60.00   78517.32  3.06   2.84   3.118   8.677  
 16384  87380 

I will not call this an improvement... the results are basically the same.



Next step enabling busy poll on the server.  The server is likely the
bottleneck, given it's CPU is slower than the client.  Context switches
on the server is too high 156K/sec, after enabling busy poll reduced to
620/sec. Note the client is doing around 233k/sec context switches,
(fairly impressive).

Enabling busy poll on the server:
 sysctl -w net.core.busy_poll=50
 sysctl -w net.core.busy_read=50


Enabled busy poll only on server, Not patched:
 $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 
 ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
 Local /Remote
 Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
 Send   Recv   Size    Size   Time    Rate     local  remote local   remote
 bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

 16384  87380  1       1      60.00   112480.72  5.90   4.68   4.194   9.984  
 16384  87380 

Enabled busy poll only on server, patched:

 $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 
 ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
 Local /Remote
 Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
 Send   Recv   Size    Size   Time    Rate     local  remote local   remote
 bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

 16384  87380  1       1      60.00   110152.34  5.84   4.60   4.242   10.014 
 16384  87380 

Numbers are too close, for any conclusions.

Running a second run, on Not-patched kernel:
 Enabled busy poll only on server, Not patched:
 [jbrouer@canyon ~]$ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 
 ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
 Local /Remote
 Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
 Send   Recv   Size    Size   Time    Rate     local  remote local   remote
 bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

 16384  87380  1       1      60.00   101554.90  4.12   4.31   3.245   10.185 
 16384  87380 

Thus, variation between runs are bigger than any improvement/regression,
thus no performance conclusions from this change can be drawn.


Lets move beyond testing the CPU/process scheduler by enabling
busy-polling on both client and server:
 (sysctl -w net.core.busy_poll=50 ;sysctl -w net.core.busy_read=50)

Enable busy poll on both client and server, Not patched:

$ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      60.00   137987.86  13.18  4.77   7.643   8.298  
16384  87380 


Enable busy poll on both client and server, patched:

$ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      60.00   147324.38  13.76  4.76   7.474   7.747  
16384  87380 

I've a little bit surprised to see such a large improvement here 6.76%.
 147324/137987*100 = 106.76

I'm remaining skeptic towards this measurement, as the improvement
should not be this high.  Even if recycling is happening.

Perf record does show less calls to __slab_free(), indicating better
interaction with SLUB, and perhaps recycling working.  But this is
only a perf-report change from 0.37% to 0.33%.

More testing show not-patched kernel fluctuate between 125k-143k/sec,
and patched kernel fluctuate between 131k-152k/sec. The ranges are too
high, to say anything conclusive.  It seems to be timing dependent, as
starting and stoping the test with -D 1, show a rate variation within
2k/sec, but rate itself can vary withing the range stated.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context
  2016-05-10 12:30         ` Jesper Dangaard Brouer
@ 2016-05-10 13:48           ` Eric Dumazet
  2016-05-10 14:48             ` Jesper Dangaard Brouer
  2016-05-10 17:46           ` Alexander Duyck
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2016-05-10 13:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, Netdev, David S. Miller, Saeed Mahameed,
	Or Gerlitz, Eugenia Emantayev

On Tue, 2016-05-10 at 14:30 +0200, Jesper Dangaard Brouer wrote:

> Disable busy poll on both client and server, Not patched:
> 
>  $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
>  MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 
>  ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
>  Local /Remote
>  Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>  Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>  bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>  
>  16384  87380  1       1      60.00   78077.55  3.74   2.69   3.830   8.265  
>  16384  87380 

Tell us more about the -T6,6

For example how many TX/RX queues you have on the NIC, and which cpus
service interrupts.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context
  2016-05-10 13:48           ` Eric Dumazet
@ 2016-05-10 14:48             ` Jesper Dangaard Brouer
  2016-05-10 15:44               ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-10 14:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Netdev, David S. Miller, Saeed Mahameed,
	Or Gerlitz, Eugenia Emantayev, brouer

On Tue, 10 May 2016 06:48:54 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2016-05-10 at 14:30 +0200, Jesper Dangaard Brouer wrote:
> 
> > Disable busy poll on both client and server, Not patched:
> > 
> >  $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
> >  MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 port 0 AF_INET to 198.18.40.2 
> >  ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
> >  Local /Remote
> >  Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> >  Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> >  bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
> >  
> >  16384  87380  1       1      60.00   78077.55  3.74   2.69   3.830   8.265  
> >  16384  87380   
> 
> Tell us more about the -T6,6
> 
> For example how many TX/RX queues you have on the NIC, and which cpus
> service interrupts.

The -T6,6 option:
 -T lcpu,rcpu      Request netperf/netserver be bound to local/remote cpu

I use the option to get more stable results.  If I don't pin/bind the
CPU netperf/netserver is running on then the CPU scheduler will migrate
the processes around.  This gives unpredictable results, worst for the
busy_poll tests.  Especially if the RX softirq runs on the same CPU
(also true if it runs on a HyperTread siping).  

Netperf client (8 cores i7-4790K CPU @ 4.00GHz)  RX:8 and TX:8 queues.
Netserver server (2x 12 cores E5-2630 @ 2.30GHz) RX:8 and TX:24 queues.
Driver mlx4.
Disabled GRO to hit code path I changed in patch 2.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context
  2016-05-10 14:48             ` Jesper Dangaard Brouer
@ 2016-05-10 15:44               ` Eric Dumazet
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Dumazet @ 2016-05-10 15:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, Netdev, David S. Miller, Saeed Mahameed,
	Or Gerlitz, Eugenia Emantayev

On Tue, 2016-05-10 at 16:48 +0200, Jesper Dangaard Brouer wrote:
> On Tue, 10 May 2016 06:48:54 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Tue, 2016-05-10 at 14:30 +0200, Jesper Dangaard Brouer wrote:
> > 
> > > Disable busy poll on both client and server, Not patched:
> > > 
> > >  $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
> > >  MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 port 0 AF_INET to 198.18.40.2 
> > >  ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
> > >  Local /Remote
> > >  Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> > >  Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> > >  bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
> > >  
> > >  16384  87380  1       1      60.00   78077.55  3.74   2.69   3.830   8.265  
> > >  16384  87380   
> > 
> > Tell us more about the -T6,6
> > 
> > For example how many TX/RX queues you have on the NIC, and which cpus
> > service interrupts.
> 
> The -T6,6 option:
>  -T lcpu,rcpu      Request netperf/netserver be bound to local/remote cpu
> 

Sure, I know -T option in netperf.

> I use the option to get more stable results.  If I don't pin/bind the
> CPU netperf/netserver is running on then the CPU scheduler will migrate
> the processes around.  This gives unpredictable results, worst for the
> busy_poll tests.  Especially if the RX softirq runs on the same CPU
> (also true if it runs on a HyperTread siping).  
> 
> Netperf client (8 cores i7-4790K CPU @ 4.00GHz)  RX:8 and TX:8 queues.
> Netserver server (2x 12 cores E5-2630 @ 2.30GHz) RX:8 and TX:24 queues.
> Driver mlx4.
> Disabled GRO to hit code path I changed in patch 2.

But are you using stuff like aRFS, RPS , RFS ?

Each netperf run lands on different cpus, and we know that results can
have a 25% variability  because of that, even more on 2-node systems.

By forcing -T6,6 you force the netperf/netserver cpu, not the RX queues.

A nice effort would be to be able to chose the source in the 4-tuple at
connect() time so that we know that Toeplitz hash will select the
'correct' RX queue.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context
  2016-05-10 12:30         ` Jesper Dangaard Brouer
  2016-05-10 13:48           ` Eric Dumazet
@ 2016-05-10 17:46           ` Alexander Duyck
  1 sibling, 0 replies; 20+ messages in thread
From: Alexander Duyck @ 2016-05-10 17:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, David S. Miller, Saeed Mahameed, Or Gerlitz, Eugenia Emantayev

On 05/10/2016 05:30 AM, Jesper Dangaard Brouer wrote:
>
> On Mon, 9 May 2016 13:46:32 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> Try testing with TCP_RR instead and watch the CPU utilization.  I'm
>> suspecting allocating 8 and freeing 7 buffers for every 1 buffer
>> received will blow any gains right out of the water.  Also try it with
>> a mix of traffic.  So have one NIC doing TCP_RR while another is doing
>> a stream test.  You are stuffing 7 buffers onto a queue that were were
>> using to perform bulk freeing.  How much of a penalty do you take if
>> you are now limited on how many you can bulk free because somebody
>> left a stray 7 packets sitting on the queue?
>
> Testing with TCP_RR, is not a very "clean" network test. One have to be
> very careful what is actually being tested, is it the server or client
> which is the bottleneck. And most of all this is test of the CPU/process
> scheduler.
>
> We can avoid the scheduler problem by enabling busy_poll/busy_read.
>
> I guess you want to see the "scheduler test" first.  Default setting of
> disabled busy poll on both client and server:
>
> Disable busy poll on both client and server, Not patched:

Lets also define what "Not patched" means.  I only want the bulk 
allocation patch tested.  The other patches you have for the mlx4 and 
the WARN_ON are just noise.  If possible it would be best to focus on 
just the one patch that is high risk.

>   $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
>   MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2
>   ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
>   Local /Remote
>   Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>   Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>   bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>   16384  87380  1       1      60.00   78077.55  3.74   2.69   3.830   8.265
>   16384  87380
>
> Disable busy poll on both client and server, patched:
>
>   $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
>   MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2
>   ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
>   Local /Remote
>   Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>   Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>   bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>   16384  87380  1       1      60.00   78517.32  3.06   2.84   3.118   8.677
>   16384  87380
>
> I will not call this an improvement... the results are basically the same.

So I have a few suggestions.

1.  Either switch to ixgbe and use ATR/Flow Director or look at setting 
up your test so that the RSS key and indirection table are the same for 
each test and use the "-- -P" option in netperf to force the use of the 
same 5 tuple for each test.

2.  Cut down on the noise.  Specifically rebuild your kernel with as few 
options enabled as possible.  If you don't need it drop it out so that 
we can identify exactly how much gain there is to be had from your 
patches.  Also you should increase your test to use multiple CPUs, or 
cut down on the number of CPUs so that you aren't cutting down the the 
CPU utilization so much.  If you have to you might even look at doing 
something like using SAR -P 6 in order to be able to monitor the CPU 
utilization of just CPU 6 on your test system.

The goal is you want your tests to be repeatable if you had to move to 
another system or another NIC so I would recommend trying to find a way 
to make it so that much of the fluctuation is ruled out and that your 
numbers are as reliable as possible.

3.  You might even try a pktgen Rx and drop test to see what the 
difference is in ns/packet for your allocation routine.  Assuming you 
can clear out the variability that would be a useful datapoint as you 
could then also collect the perf data to show which functions have 
reduced their total CPU time.

> Next step enabling busy poll on the server.  The server is likely the
> bottleneck, given it's CPU is slower than the client.  Context switches
> on the server is too high 156K/sec, after enabling busy poll reduced to
> 620/sec. Note the client is doing around 233k/sec context switches,
> (fairly impressive).
>
> Enabling busy poll on the server:
>   sysctl -w net.core.busy_poll=50
>   sysctl -w net.core.busy_read=50
>
>
> Enabled busy poll only on server, Not patched:
>   $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
>   MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2
>   ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
>   Local /Remote
>   Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>   Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>   bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>   16384  87380  1       1      60.00   112480.72  5.90   4.68   4.194   9.984
>   16384  87380
>
> Enabled busy poll only on server, patched:
>
>   $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
>   MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2
>   ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
>   Local /Remote
>   Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>   Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>   bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>   16384  87380  1       1      60.00   110152.34  5.84   4.60   4.242   10.014
>   16384  87380
>
> Numbers are too close, for any conclusions.

Agreed.

> Running a second run, on Not-patched kernel:
>   Enabled busy poll only on server, Not patched:
>   [jbrouer@canyon ~]$ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
>   MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2
>   ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
>   Local /Remote
>   Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>   Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>   bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>   16384  87380  1       1      60.00   101554.90  4.12   4.31   3.245   10.185
>   16384  87380
>
> Thus, variation between runs are bigger than any improvement/regression,
> thus no performance conclusions from this change can be drawn.

Like Eric mentioned this is likely the fact that you are bouncing 
between Rx queues.

> Lets move beyond testing the CPU/process scheduler by enabling
> busy-polling on both client and server:
>   (sysctl -w net.core.busy_poll=50 ;sysctl -w net.core.busy_read=50)
>
> Enable busy poll on both client and server, Not patched:
>
> $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
> 16384  87380  1       1      60.00   137987.86  13.18  4.77   7.643   8.298
> 16384  87380
>
>
> Enable busy poll on both client and server, patched:
>
> $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
> 16384  87380  1       1      60.00   147324.38  13.76  4.76   7.474   7.747
> 16384  87380
>
> I've a little bit surprised to see such a large improvement here 6.76%.
>   147324/137987*100 = 106.76

That is a difference of 500ns per packet.  I am highly doubtful we are 
seeing that much of an improvement as well.  Odds are you were doing 
something cross-node in  your first run or something along those lines 
since the difference is too large to be attributed to the bulk 
allocation change.

> I'm remaining skeptic towards this measurement, as the improvement
> should not be this high.  Even if recycling is happening.
>
> Perf record does show less calls to __slab_free(), indicating better
> interaction with SLUB, and perhaps recycling working.  But this is
> only a perf-report change from 0.37% to 0.33%.
>
> More testing show not-patched kernel fluctuate between 125k-143k/sec,
> and patched kernel fluctuate between 131k-152k/sec. The ranges are too
> high, to say anything conclusive.  It seems to be timing dependent, as
> starting and stoping the test with -D 1, show a rate variation within
> 2k/sec, but rate itself can vary withing the range stated.

I am pretty sure we are guaranteed to see a performance regression for 
socket based workloads.  You are hoping for recycling to occur, but in 
almost all cases recycling almost always fails to show any gains and 
just ends up introducing the possibility for regressions as something 
always gets overlooked.  On top of that you aren't guaranteed a frame 
from Tx clean-up is going to be warm in the cache so you may end up 
taking a few cache line misses as well.

I haven't seen enough change in these patches to justify having them 
submitted to the kernel.  A 1% improvement in one specific test case is 
kind of a vague reason to do something that very likely introduces 
regressions in many other cases.

- Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack
  2016-05-09 13:44 [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2016-05-09 13:44 ` [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context Jesper Dangaard Brouer
@ 2016-05-20  8:14 ` Jesper Dangaard Brouer
  3 siblings, 0 replies; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2016-05-20  8:14 UTC (permalink / raw)
  To: netdev, David S. Miller
  Cc: saeedm, gerlitz.or, eugenia, Alexander Duyck, brouer

On Mon, 09 May 2016 15:44:24 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> This patchset enables use of the slab/kmem_cache bulk alloc API, and
> completes the use the slab/kmem_cache bulking API in the network stack.
> 
> I've not included the patches that introduce a SKB bulk hint, which
> would beneficial for drivers like mlx5.  Lets see if we can agree on
> this patchset first.

Conclusion: Guess we could not agree on this patchset.

The main problem seems to be that the patch always bulk allocated 8
SKBs, without knowing if we actually need them.  My earlier patchset
also included a "hint" interface, as mlx5 driver knows after processing
the RX queue, how many SKBs it needs, thus it can request to bulk alloc
the needed amount. (The only problem with mlx5 is that preallocating
SKBs into it's RX ring is a broken scheme).

A better scheme would be, if the driver on RX knows how many packets
are available in the RX queue.  Then we can bulk alloc the needed
amount of SKBs.  Thus, we can circle back to this once the driver can
provide this information. (This goes nicely hand-in-hand with my idea
of pulling out avail RX packet-pages from the RX ring, and start
prefetch'es to hide the cache-misses).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-05-20  8:14 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-09 13:44 [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer
2016-05-09 13:44 ` [net-next PATCH V1 1/3] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer
2016-05-09 16:20   ` Alexander Duyck
2016-05-09 19:59     ` Jesper Dangaard Brouer
2016-05-09 20:46       ` Alexander Duyck
2016-05-10  9:29         ` Jesper Dangaard Brouer
2016-05-10 12:30         ` Jesper Dangaard Brouer
2016-05-10 13:48           ` Eric Dumazet
2016-05-10 14:48             ` Jesper Dangaard Brouer
2016-05-10 15:44               ` Eric Dumazet
2016-05-10 17:46           ` Alexander Duyck
2016-05-09 13:44 ` [net-next PATCH V1 2/3] mlx4: use napi_alloc_skb API to get SKB bulk allocations Jesper Dangaard Brouer
2016-05-09 16:47   ` Alexander Duyck
2016-05-09 20:05     ` Jesper Dangaard Brouer
2016-05-09 13:44 ` [net-next PATCH V1 3/3] net: warn on napi_alloc_skb being called in wrong context Jesper Dangaard Brouer
2016-05-09 13:53   ` Sergei Shtylyov
2016-05-09 13:53   ` Sergei Shtylyov
2016-05-09 16:33   ` Alexander Duyck
2016-05-09 20:03     ` Jesper Dangaard Brouer
2016-05-20  8:14 ` [net-next PATCH V1 0/3] net: enable use of kmem_cache_alloc_bulk in network stack Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.