All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] net: mitigating kmem_cache slowpath for network stack in NAPI context
@ 2015-10-23 12:46 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-23 12:46 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Duyck, linux-mm, Jesper Dangaard Brouer, Joonsoo Kim,
	Andrew Morton, Christoph Lameter

It have been a long road. Back in July 2014 I realized that network
stack were hitting the kmem_cache/SLUB slowpath when freeing SKBs, but
had no solution.  In Dec 2014 I had implemented a solution called
qmempool[1], that showed it was possible to improve this, but got
rejected due to being a cache on top of kmem_cache.  In July 2015
improvements to kmem_cache were proposed, and recently Oct 2015 my
kmem_cache (SLAB+SLUB) patches for bulk alloc and free have been
accepted into the AKPM quilt tree.

This patchset is the first real use-case kmem_cache bulk alloc and free.
And is joint work with Alexander Duyck while still at Red Hat.

Using bulk free to avoid the SLUB slowpath shows the full potential.
In this patchset it is realized in NAPI/softirq context.  1. During
DMA TX completion bulk free is optimal and does not introduce any
added latency. 2. bulk free of SKBs delay free'ed due to IRQ context
in net_tx_action softirq completion queue.

Using bulk alloc is showing minor improvements for SLUB(+0.9%), but a
very slight slowdown for SLAB(-0.1%).

[1] http://thread.gmane.org/gmane.linux.network/342347/focus=126138


This patchset is based on net-next (commit 26440c835), BUT I've
applied several patches from AKPMs MM-tree.

Cherrypick some commits from MMOTM tree on branch/tag mmotm-2015-10-06-16-30
from git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
(Below commit IDs are obviously not stable)

Pickup my own MM-changes:
 b0aa3e95ce82 ("slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG")
 114a2b37847c ("slab: implement bulking for SLAB allocator")
 606397476e8b ("slub: support for bulk free with SLUB freelists")
 ee29cd6a570c ("slub: optimize bulk slowpath free by detached freelist")
 491c6e0ca89d ("slub-optimize-bulk-slowpath-free-by-detached-freelist-fix")

Pickup slab.h changes:
 d9a47e0b776b ("compiler.h: add support for function attribute assume_aligned")
 1c3a5c789b4f ("slab.h: sprinkle __assume_aligned attributes")

Wanted Kirill A. Shutemov's changes as they change virt_to_head_page(),
had to apply patches manually from http://ozlabs.org/~akpm/mmotm/
(stamp-2015-10-20-16-33) as AKPM made several small fixes.

---

Jesper Dangaard Brouer (4):
      net: bulk free infrastructure for NAPI context, use napi_consume_skb
      net: bulk free SKBs that were delay free'ed due to IRQ context
      ixgbe: bulk free SKBs during TX completion cleanup cycle
      net: bulk alloc and reuse of SKBs in NAPI context


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    6 +
 include/linux/skbuff.h                        |    4 +
 net/core/dev.c                                |    9 ++
 net/core/skbuff.c                             |  122 +++++++++++++++++++++++--
 4 files changed, 127 insertions(+), 14 deletions(-)

--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 0/4] net: mitigating kmem_cache slowpath for network stack in NAPI context
@ 2015-10-23 12:46 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-23 12:46 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Duyck, linux-mm, Jesper Dangaard Brouer, Joonsoo Kim,
	Andrew Morton, Christoph Lameter

It have been a long road. Back in July 2014 I realized that network
stack were hitting the kmem_cache/SLUB slowpath when freeing SKBs, but
had no solution.  In Dec 2014 I had implemented a solution called
qmempool[1], that showed it was possible to improve this, but got
rejected due to being a cache on top of kmem_cache.  In July 2015
improvements to kmem_cache were proposed, and recently Oct 2015 my
kmem_cache (SLAB+SLUB) patches for bulk alloc and free have been
accepted into the AKPM quilt tree.

This patchset is the first real use-case kmem_cache bulk alloc and free.
And is joint work with Alexander Duyck while still at Red Hat.

Using bulk free to avoid the SLUB slowpath shows the full potential.
In this patchset it is realized in NAPI/softirq context.  1. During
DMA TX completion bulk free is optimal and does not introduce any
added latency. 2. bulk free of SKBs delay free'ed due to IRQ context
in net_tx_action softirq completion queue.

Using bulk alloc is showing minor improvements for SLUB(+0.9%), but a
very slight slowdown for SLAB(-0.1%).

[1] http://thread.gmane.org/gmane.linux.network/342347/focus=126138


This patchset is based on net-next (commit 26440c835), BUT I've
applied several patches from AKPMs MM-tree.

Cherrypick some commits from MMOTM tree on branch/tag mmotm-2015-10-06-16-30
from git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
(Below commit IDs are obviously not stable)

Pickup my own MM-changes:
 b0aa3e95ce82 ("slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG")
 114a2b37847c ("slab: implement bulking for SLAB allocator")
 606397476e8b ("slub: support for bulk free with SLUB freelists")
 ee29cd6a570c ("slub: optimize bulk slowpath free by detached freelist")
 491c6e0ca89d ("slub-optimize-bulk-slowpath-free-by-detached-freelist-fix")

Pickup slab.h changes:
 d9a47e0b776b ("compiler.h: add support for function attribute assume_aligned")
 1c3a5c789b4f ("slab.h: sprinkle __assume_aligned attributes")

Wanted Kirill A. Shutemov's changes as they change virt_to_head_page(),
had to apply patches manually from http://ozlabs.org/~akpm/mmotm/
(stamp-2015-10-20-16-33) as AKPM made several small fixes.

---

Jesper Dangaard Brouer (4):
      net: bulk free infrastructure for NAPI context, use napi_consume_skb
      net: bulk free SKBs that were delay free'ed due to IRQ context
      ixgbe: bulk free SKBs during TX completion cleanup cycle
      net: bulk alloc and reuse of SKBs in NAPI context


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    6 +
 include/linux/skbuff.h                        |    4 +
 net/core/dev.c                                |    9 ++
 net/core/skbuff.c                             |  122 +++++++++++++++++++++++--
 4 files changed, 127 insertions(+), 14 deletions(-)

--
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/4] net: bulk free infrastructure for NAPI context, use napi_consume_skb
  2015-10-23 12:46 ` Jesper Dangaard Brouer
@ 2015-10-23 12:46   ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-23 12:46 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Duyck, linux-mm, Jesper Dangaard Brouer, Joonsoo Kim,
	Andrew Morton, Christoph Lameter

Discovered that network stack were hitting the kmem_cache/SLUB
slowpath when freeing SKBs.  Doing bulk free with kmem_cache_free_bulk
can speedup this slowpath.

NAPI context is a bit special, lets take advantage of that for bulk
free'ing SKBs.

In NAPI context we are running in softirq, which gives us certain
protection.  A softirq can run on several CPUs at once.  BUT the
important part is a softirq will never preempt another softirq running
on the same CPU.  This gives us the opportunity to access per-cpu
variables in softirq context.

Extend napi_alloc_cache (before only contained page_frag_cache) to be
a struct with a small array based stack for holding SKBs.  Introduce a
SKB defer and flush API for accessing this.

Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any()
when running in NAPI context.  A small trick to handle/detect if we
are called from netpoll is to see if budget is 0.  In that case, we
need to invoke dev_consume_skb_irq().

Joint work with Alexander Duyck.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 include/linux/skbuff.h |    3 ++
 net/core/dev.c         |    1 +
 net/core/skbuff.c      |   83 +++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4398411236f1..a3dec82e0e2c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2308,6 +2308,9 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi,
 {
 	return __napi_alloc_skb(napi, length, GFP_ATOMIC);
 }
+void napi_consume_skb(struct sk_buff *skb, int budget);
+
+void __kfree_skb_flush(void);
 
 /**
  * __dev_alloc_pages - allocate page for network Rx
diff --git a/net/core/dev.c b/net/core/dev.c
index 1225b4be8ed6..204059f67154 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4841,6 +4841,7 @@ static void net_rx_action(struct softirq_action *h)
 		}
 	}
 
+	__kfree_skb_flush();
 	local_irq_disable();
 
 	list_splice_tail_init(&sd->poll_list, &list);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fab4599ba8b2..2682ac46d640 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -347,8 +347,16 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 }
 EXPORT_SYMBOL(build_skb);
 
+#define NAPI_SKB_CACHE_SIZE	64
+
+struct napi_alloc_cache {
+	struct page_frag_cache page;
+	size_t skb_count;
+	void *skb_cache[NAPI_SKB_CACHE_SIZE];
+};
+
 static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
-static DEFINE_PER_CPU(struct page_frag_cache, napi_alloc_cache);
+static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
 
 static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 {
@@ -378,9 +386,9 @@ EXPORT_SYMBOL(netdev_alloc_frag);
 
 static void *__napi_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 {
-	struct page_frag_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
-	return __alloc_page_frag(nc, fragsz, gfp_mask);
+	return __alloc_page_frag(&nc->page, fragsz, gfp_mask);
 }
 
 void *napi_alloc_frag(unsigned int fragsz)
@@ -474,7 +482,7 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
 struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 				 gfp_t gfp_mask)
 {
-	struct page_frag_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 	struct sk_buff *skb;
 	void *data;
 
@@ -494,7 +502,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (sk_memalloc_socks())
 		gfp_mask |= __GFP_MEMALLOC;
 
-	data = __alloc_page_frag(nc, len, gfp_mask);
+	data = __alloc_page_frag(&nc->page, len, gfp_mask);
 	if (unlikely(!data))
 		return NULL;
 
@@ -505,7 +513,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	}
 
 	/* use OR instead of assignment to avoid clearing of bits in mask */
-	if (nc->pfmemalloc)
+	if (nc->page.pfmemalloc)
 		skb->pfmemalloc = 1;
 	skb->head_frag = 1;
 
@@ -747,6 +755,69 @@ void consume_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(consume_skb);
 
+void __kfree_skb_flush(void)
+{
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+
+	/* flush skb_cache if containing objects */
+	if (nc->skb_count) {
+		kmem_cache_free_bulk(skbuff_head_cache, nc->skb_count,
+				     nc->skb_cache);
+		nc->skb_count = 0;
+	}
+}
+
+static void __kfree_skb_defer(struct sk_buff *skb)
+{
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+
+	/* drop skb->head and call any destructors for packet */
+	skb_release_all(skb);
+
+	/* record skb to CPU local list */
+	nc->skb_cache[nc->skb_count++] = skb;
+
+#ifdef CONFIG_SLUB
+	/* SLUB writes into objects when freeing */
+	prefetchw(skb);
+#endif
+
+	/* flush skb_cache if it is filled */
+	if (unlikely(nc->skb_count == NAPI_SKB_CACHE_SIZE)) {
+		kmem_cache_free_bulk(skbuff_head_cache, NAPI_SKB_CACHE_SIZE,
+				     nc->skb_cache);
+		nc->skb_count = 0;
+	}
+}
+
+void napi_consume_skb(struct sk_buff *skb, int budget)
+{
+	if (unlikely(!skb))
+		return;
+
+	/* if budget is 0 assume netpoll w/ IRQs disabled */
+	if (unlikely(!budget)) {
+		dev_consume_skb_irq(skb);
+		return;
+	}
+
+	if (likely(atomic_read(&skb->users) == 1))
+		smp_rmb();
+	else if (likely(!atomic_dec_and_test(&skb->users)))
+		return;
+	/* if reaching here SKB is ready to free */
+	trace_consume_skb(skb);
+
+	/* if SKB is a clone, don't handle this case */
+	if (unlikely(skb->fclone != SKB_FCLONE_UNAVAILABLE)) {
+		__kfree_skb(skb);
+		return;
+	}
+
+	__kfree_skb_defer(skb);
+}
+EXPORT_SYMBOL(napi_consume_skb);
+
 /* Make sure a field is enclosed inside headers_start/headers_end section */
 #define CHECK_SKB_FIELD(field) \
 	BUILD_BUG_ON(offsetof(struct sk_buff, field) <		\

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 1/4] net: bulk free infrastructure for NAPI context, use napi_consume_skb
@ 2015-10-23 12:46   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-23 12:46 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Duyck, linux-mm, Jesper Dangaard Brouer, Joonsoo Kim,
	Andrew Morton, Christoph Lameter

Discovered that network stack were hitting the kmem_cache/SLUB
slowpath when freeing SKBs.  Doing bulk free with kmem_cache_free_bulk
can speedup this slowpath.

NAPI context is a bit special, lets take advantage of that for bulk
free'ing SKBs.

In NAPI context we are running in softirq, which gives us certain
protection.  A softirq can run on several CPUs at once.  BUT the
important part is a softirq will never preempt another softirq running
on the same CPU.  This gives us the opportunity to access per-cpu
variables in softirq context.

Extend napi_alloc_cache (before only contained page_frag_cache) to be
a struct with a small array based stack for holding SKBs.  Introduce a
SKB defer and flush API for accessing this.

Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any()
when running in NAPI context.  A small trick to handle/detect if we
are called from netpoll is to see if budget is 0.  In that case, we
need to invoke dev_consume_skb_irq().

Joint work with Alexander Duyck.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 include/linux/skbuff.h |    3 ++
 net/core/dev.c         |    1 +
 net/core/skbuff.c      |   83 +++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4398411236f1..a3dec82e0e2c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2308,6 +2308,9 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi,
 {
 	return __napi_alloc_skb(napi, length, GFP_ATOMIC);
 }
+void napi_consume_skb(struct sk_buff *skb, int budget);
+
+void __kfree_skb_flush(void);
 
 /**
  * __dev_alloc_pages - allocate page for network Rx
diff --git a/net/core/dev.c b/net/core/dev.c
index 1225b4be8ed6..204059f67154 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4841,6 +4841,7 @@ static void net_rx_action(struct softirq_action *h)
 		}
 	}
 
+	__kfree_skb_flush();
 	local_irq_disable();
 
 	list_splice_tail_init(&sd->poll_list, &list);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fab4599ba8b2..2682ac46d640 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -347,8 +347,16 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 }
 EXPORT_SYMBOL(build_skb);
 
+#define NAPI_SKB_CACHE_SIZE	64
+
+struct napi_alloc_cache {
+	struct page_frag_cache page;
+	size_t skb_count;
+	void *skb_cache[NAPI_SKB_CACHE_SIZE];
+};
+
 static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
-static DEFINE_PER_CPU(struct page_frag_cache, napi_alloc_cache);
+static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
 
 static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 {
@@ -378,9 +386,9 @@ EXPORT_SYMBOL(netdev_alloc_frag);
 
 static void *__napi_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 {
-	struct page_frag_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
-	return __alloc_page_frag(nc, fragsz, gfp_mask);
+	return __alloc_page_frag(&nc->page, fragsz, gfp_mask);
 }
 
 void *napi_alloc_frag(unsigned int fragsz)
@@ -474,7 +482,7 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
 struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 				 gfp_t gfp_mask)
 {
-	struct page_frag_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 	struct sk_buff *skb;
 	void *data;
 
@@ -494,7 +502,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (sk_memalloc_socks())
 		gfp_mask |= __GFP_MEMALLOC;
 
-	data = __alloc_page_frag(nc, len, gfp_mask);
+	data = __alloc_page_frag(&nc->page, len, gfp_mask);
 	if (unlikely(!data))
 		return NULL;
 
@@ -505,7 +513,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	}
 
 	/* use OR instead of assignment to avoid clearing of bits in mask */
-	if (nc->pfmemalloc)
+	if (nc->page.pfmemalloc)
 		skb->pfmemalloc = 1;
 	skb->head_frag = 1;
 
@@ -747,6 +755,69 @@ void consume_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(consume_skb);
 
+void __kfree_skb_flush(void)
+{
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+
+	/* flush skb_cache if containing objects */
+	if (nc->skb_count) {
+		kmem_cache_free_bulk(skbuff_head_cache, nc->skb_count,
+				     nc->skb_cache);
+		nc->skb_count = 0;
+	}
+}
+
+static void __kfree_skb_defer(struct sk_buff *skb)
+{
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+
+	/* drop skb->head and call any destructors for packet */
+	skb_release_all(skb);
+
+	/* record skb to CPU local list */
+	nc->skb_cache[nc->skb_count++] = skb;
+
+#ifdef CONFIG_SLUB
+	/* SLUB writes into objects when freeing */
+	prefetchw(skb);
+#endif
+
+	/* flush skb_cache if it is filled */
+	if (unlikely(nc->skb_count == NAPI_SKB_CACHE_SIZE)) {
+		kmem_cache_free_bulk(skbuff_head_cache, NAPI_SKB_CACHE_SIZE,
+				     nc->skb_cache);
+		nc->skb_count = 0;
+	}
+}
+
+void napi_consume_skb(struct sk_buff *skb, int budget)
+{
+	if (unlikely(!skb))
+		return;
+
+	/* if budget is 0 assume netpoll w/ IRQs disabled */
+	if (unlikely(!budget)) {
+		dev_consume_skb_irq(skb);
+		return;
+	}
+
+	if (likely(atomic_read(&skb->users) == 1))
+		smp_rmb();
+	else if (likely(!atomic_dec_and_test(&skb->users)))
+		return;
+	/* if reaching here SKB is ready to free */
+	trace_consume_skb(skb);
+
+	/* if SKB is a clone, don't handle this case */
+	if (unlikely(skb->fclone != SKB_FCLONE_UNAVAILABLE)) {
+		__kfree_skb(skb);
+		return;
+	}
+
+	__kfree_skb_defer(skb);
+}
+EXPORT_SYMBOL(napi_consume_skb);
+
 /* Make sure a field is enclosed inside headers_start/headers_end section */
 #define CHECK_SKB_FIELD(field) \
 	BUILD_BUG_ON(offsetof(struct sk_buff, field) <		\

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/4] net: bulk free SKBs that were delay free'ed due to IRQ context
  2015-10-23 12:46 ` Jesper Dangaard Brouer
  (?)
  (?)
@ 2015-10-23 12:46 ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-23 12:46 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Duyck, linux-mm, Jesper Dangaard Brouer, Joonsoo Kim,
	Andrew Morton, Christoph Lameter

The network stack defers SKBs free, in-case free happens in IRQ or
when IRQs are disabled. This happens in __dev_kfree_skb_irq() that
writes SKBs that were free'ed during IRQ to the softirq completion
queue (softnet_data.completion_queue).

These SKBs are naturally delayed, and cleaned up during NET_TX_SOFTIRQ
in function net_tx_action().  Take advantage of this a use the skb
defer and flush API, as we are already in softirq context.

For modern drivers this rarely happens. Although most drivers do call
dev_kfree_skb_any(), which detects the situation and calls
__dev_kfree_skb_irq() when needed.  This due to netpoll can call from
IRQ context.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/skbuff.h |    1 +
 net/core/dev.c         |    8 +++++++-
 net/core/skbuff.c      |    8 ++++++--
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a3dec82e0e2c..f314abff2cbd 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2311,6 +2311,7 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi,
 void napi_consume_skb(struct sk_buff *skb, int budget);
 
 void __kfree_skb_flush(void);
+void __kfree_skb_defer(struct sk_buff *skb);
 
 /**
  * __dev_alloc_pages - allocate page for network Rx
diff --git a/net/core/dev.c b/net/core/dev.c
index 204059f67154..0e88397db1fa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3588,8 +3588,14 @@ static void net_tx_action(struct softirq_action *h)
 				trace_consume_skb(skb);
 			else
 				trace_kfree_skb(skb, net_tx_action);
-			__kfree_skb(skb);
+
+			if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
+				__kfree_skb(skb);
+			else
+				__kfree_skb_defer(skb);
 		}
+
+		__kfree_skb_flush();
 	}
 
 	if (sd->output_queue) {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2682ac46d640..2ffcb014e00b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -767,7 +767,7 @@ void __kfree_skb_flush(void)
 	}
 }
 
-static void __kfree_skb_defer(struct sk_buff *skb)
+static inline void _kfree_skb_defer(struct sk_buff *skb)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
@@ -789,6 +789,10 @@ static void __kfree_skb_defer(struct sk_buff *skb)
 		nc->skb_count = 0;
 	}
 }
+void __kfree_skb_defer(struct sk_buff *skb)
+{
+	_kfree_skb_defer(skb);
+}
 
 void napi_consume_skb(struct sk_buff *skb, int budget)
 {
@@ -814,7 +818,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
 		return;
 	}
 
-	__kfree_skb_defer(skb);
+	_kfree_skb_defer(skb);
 }
 EXPORT_SYMBOL(napi_consume_skb);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/4] ixgbe: bulk free SKBs during TX completion cleanup cycle
  2015-10-23 12:46 ` Jesper Dangaard Brouer
@ 2015-10-23 12:46   ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-23 12:46 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Duyck, linux-mm, Jesper Dangaard Brouer, Joonsoo Kim,
	Andrew Morton, Christoph Lameter

There is an opportunity to bulk free SKBs during reclaiming of
resources after DMA transmit completes in ixgbe_clean_tx_irq.  Thus,
bulk freeing at this point does not introduce any added latency.

Simply use napi_consume_skb() which were recently introduced.  The
napi_budget parameter is needed by napi_consume_skb() to detect if it
is called from netpoll.

Benchmarking IPv4-forwarding, on CPU i7-4790K @4.2GHz (no turbo boost)
 Single CPU/flow numbers: before: 1982144 pps ->  after : 2064446 pps
 Improvement: +82302 pps, -20 nanosec, +4.1%
 (SLUB and GCC version 5.1.1 20150618 (Red Hat 5.1.1-4))

Joint work with Alexander Duyck.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 9f8a7fd7a195..4614dfb3febd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1090,7 +1090,7 @@ static void ixgbe_tx_timeout_reset(struct ixgbe_adapter *adapter)
  * @tx_ring: tx ring to clean
  **/
 static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
-			       struct ixgbe_ring *tx_ring)
+			       struct ixgbe_ring *tx_ring, int napi_budget)
 {
 	struct ixgbe_adapter *adapter = q_vector->adapter;
 	struct ixgbe_tx_buffer *tx_buffer;
@@ -1128,7 +1128,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 		total_packets += tx_buffer->gso_segs;
 
 		/* free the skb */
-		dev_consume_skb_any(tx_buffer->skb);
+		napi_consume_skb(tx_buffer->skb, napi_budget);
 
 		/* unmap skb header data */
 		dma_unmap_single(tx_ring->dev,
@@ -2784,7 +2784,7 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
 #endif
 
 	ixgbe_for_each_ring(ring, q_vector->tx)
-		clean_complete &= !!ixgbe_clean_tx_irq(q_vector, ring);
+		clean_complete &= !!ixgbe_clean_tx_irq(q_vector, ring, budget);
 
 	if (!ixgbe_qv_lock_napi(q_vector))
 		return budget;

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/4] ixgbe: bulk free SKBs during TX completion cleanup cycle
@ 2015-10-23 12:46   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-23 12:46 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Duyck, linux-mm, Jesper Dangaard Brouer, Joonsoo Kim,
	Andrew Morton, Christoph Lameter

There is an opportunity to bulk free SKBs during reclaiming of
resources after DMA transmit completes in ixgbe_clean_tx_irq.  Thus,
bulk freeing at this point does not introduce any added latency.

Simply use napi_consume_skb() which were recently introduced.  The
napi_budget parameter is needed by napi_consume_skb() to detect if it
is called from netpoll.

Benchmarking IPv4-forwarding, on CPU i7-4790K @4.2GHz (no turbo boost)
 Single CPU/flow numbers: before: 1982144 pps ->  after : 2064446 pps
 Improvement: +82302 pps, -20 nanosec, +4.1%
 (SLUB and GCC version 5.1.1 20150618 (Red Hat 5.1.1-4))

Joint work with Alexander Duyck.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 9f8a7fd7a195..4614dfb3febd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1090,7 +1090,7 @@ static void ixgbe_tx_timeout_reset(struct ixgbe_adapter *adapter)
  * @tx_ring: tx ring to clean
  **/
 static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
-			       struct ixgbe_ring *tx_ring)
+			       struct ixgbe_ring *tx_ring, int napi_budget)
 {
 	struct ixgbe_adapter *adapter = q_vector->adapter;
 	struct ixgbe_tx_buffer *tx_buffer;
@@ -1128,7 +1128,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 		total_packets += tx_buffer->gso_segs;
 
 		/* free the skb */
-		dev_consume_skb_any(tx_buffer->skb);
+		napi_consume_skb(tx_buffer->skb, napi_budget);
 
 		/* unmap skb header data */
 		dma_unmap_single(tx_ring->dev,
@@ -2784,7 +2784,7 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
 #endif
 
 	ixgbe_for_each_ring(ring, q_vector->tx)
-		clean_complete &= !!ixgbe_clean_tx_irq(q_vector, ring);
+		clean_complete &= !!ixgbe_clean_tx_irq(q_vector, ring, budget);
 
 	if (!ixgbe_qv_lock_napi(q_vector))
 		return budget;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 4/4] net: bulk alloc and reuse of SKBs in NAPI context
  2015-10-23 12:46 ` Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  (?)
@ 2015-10-23 12:46 ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2015-10-23 12:46 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Duyck, linux-mm, Jesper Dangaard Brouer, Joonsoo Kim,
	Andrew Morton, Christoph Lameter

Think twice before applying
 - This patch can potentially introduce added latency in some workloads

This patch introduce bulk alloc of SKBs and allow reuse of SKBs
free'ed in same softirq cycle.  SKBs are normally free'ed during TX
completion, but most high speed drivers also cleanup TX ring during
NAPI RX poll cycle.  Thus, if using napi_consume_skb/__kfree_skb_defer,
SKBs will be avail in the napi_alloc_cache->skb_cache.

If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit
the potential overshooting unused SKBs needed to free'ed when NAPI
cycle ends (flushed in net_rx_action via __kfree_skb_flush()).

Benchmarking IPv4-forwarding, on CPU i7-4790K @4.2GHz (no turbo boost)
(GCC version 5.1.1 20150618 (Red Hat 5.1.1-4))
 Allocator SLUB:
  Single CPU/flow numbers: before: 2064446 pps -> after: 2083031 pps
  Improvement: +18585 pps, -4.3 nanosec, +0.9%
 Allocator SLAB:
  Single CPU/flow numbers: before: 2035949 pps -> after: 2033567 pps
  Regression: -2382 pps, +0.57 nanosec, -0.1 %

Even-though benchmarking does show an improvement for SLUB(+0.9%), I'm
not convinced bulk alloc will be a win in all situations:
 * I see stalls on walking the SLUB freelist (normal hidden by prefetch)
 * In case RX queue is not full, alloc and free more SKBs than needed

More testing is needed with more real life benchmarks.

Joint work with Alexander Duyck.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 net/core/skbuff.c |   35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2ffcb014e00b..d0e0ccccfb11 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -483,13 +483,14 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 				 gfp_t gfp_mask)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	void *data;
 
 	len += NET_SKB_PAD + NET_IP_ALIGN;
 
-	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
-	    (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
+	if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
+		     (gfp_mask & (__GFP_WAIT | GFP_DMA)))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
 		if (!skb)
 			goto skb_fail;
@@ -506,12 +507,38 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (unlikely(!data))
 		return NULL;
 
-	skb = __build_skb(data, len);
-	if (unlikely(!skb)) {
+#define BULK_ALLOC_SIZE 8
+	if (!nc->skb_count &&
+	    kmem_cache_alloc_bulk(skbuff_head_cache, gfp_mask,
+				  BULK_ALLOC_SIZE, nc->skb_cache)) {
+		nc->skb_count = BULK_ALLOC_SIZE;
+	}
+	if (likely(nc->skb_count)) {
+		skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count];
+	} else {
+		/* alloc bulk failed */
 		skb_free_frag(data);
 		return NULL;
 	}
 
+	len -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+	memset(skb, 0, offsetof(struct sk_buff, tail));
+	skb->truesize = SKB_TRUESIZE(len);
+	atomic_set(&skb->users, 1);
+	skb->head = data;
+	skb->data = data;
+	skb_reset_tail_pointer(skb);
+	skb->end = skb->tail + len;
+	skb->mac_header = (typeof(skb->mac_header))~0U;
+	skb->transport_header = (typeof(skb->transport_header))~0U;
+
+	/* make sure we initialize shinfo sequentially */
+	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+	atomic_set(&shinfo->dataref, 1);
+	kmemcheck_annotate_variable(shinfo->destructor_arg);
+
 	/* use OR instead of assignment to avoid clearing of bits in mask */
 	if (nc->page.pfmemalloc)
 		skb->pfmemalloc = 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] net: mitigating kmem_cache slowpath for network stack in NAPI context
  2015-10-23 12:46 ` Jesper Dangaard Brouer
                   ` (4 preceding siblings ...)
  (?)
@ 2015-10-27  1:09 ` David Miller
  -1 siblings, 0 replies; 35+ messages in thread
From: David Miller @ 2015-10-27  1:09 UTC (permalink / raw)
  To: brouer; +Cc: netdev, alexander.duyck, linux-mm, iamjoonsoo.kim, akpm, cl

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Fri, 23 Oct 2015 14:46:01 +0200

> It have been a long road. Back in July 2014 I realized that network
> stack were hitting the kmem_cache/SLUB slowpath when freeing SKBs, but
> had no solution.  In Dec 2014 I had implemented a solution called
> qmempool[1], that showed it was possible to improve this, but got
> rejected due to being a cache on top of kmem_cache.  In July 2015
> improvements to kmem_cache were proposed, and recently Oct 2015 my
> kmem_cache (SLAB+SLUB) patches for bulk alloc and free have been
> accepted into the AKPM quilt tree.
> 
> This patchset is the first real use-case kmem_cache bulk alloc and free.
> And is joint work with Alexander Duyck while still at Red Hat.
> 
> Using bulk free to avoid the SLUB slowpath shows the full potential.
> In this patchset it is realized in NAPI/softirq context.  1. During
> DMA TX completion bulk free is optimal and does not introduce any
> added latency. 2. bulk free of SKBs delay free'ed due to IRQ context
> in net_tx_action softirq completion queue.
> 
> Using bulk alloc is showing minor improvements for SLUB(+0.9%), but a
> very slight slowdown for SLAB(-0.1%).
> 
> [1] http://thread.gmane.org/gmane.linux.network/342347/focus=126138
> 
> 
> This patchset is based on net-next (commit 26440c835), BUT I've
> applied several patches from AKPMs MM-tree.
> 
> Cherrypick some commits from MMOTM tree on branch/tag mmotm-2015-10-06-16-30
> from git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> (Below commit IDs are obviously not stable)

Logically I'm fine with this series, but as you mention there are
dependencies that need to hit upstream before I can merge any of
this stuff into my tree.

I also think that patch #4 is a net-win, and also will expose the
bulking code to more testing since it will be used more often.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches
  2015-10-23 12:46 ` Jesper Dangaard Brouer
                   ` (5 preceding siblings ...)
  (?)
@ 2016-02-02 21:11 ` Jesper Dangaard Brouer
  2016-02-02 21:11   ` [net-next PATCH 01/11] net: bulk free infrastructure for NAPI context, use napi_consume_skb Jesper Dangaard Brouer
                     ` (11 more replies)
  -1 siblings, 12 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-02 21:11 UTC (permalink / raw)
  To: netdev
  Cc: Christoph Lameter, tom, Alexander Duyck, alexei.starovoitov,
	Jesper Dangaard Brouer, ogerlitz, gerlitz.or

This patchset is relevant for my NetDev 1.1 "Network Performance BoF" [1].

The first 4 patches, is a repost[2], for the first real use-case of
kmem_cache bulk alloc and free API.  They were adjusted slightly to
accomodate my last slab API changes.  They should be ready for
inclusion in net-next, as the needed MM tree are avail in net-next.

Patch 5 is also enabling the SKB bulk API for mlx5.

Thus, patches 1-5 should be ready for net-next.

After patch 5, the experimental patches begin, which is Prove-of-Concept
code for what we will be discussing during the Network Performance BoF [1]

[1] http://netdevconf.org/1.1/bof-network-performance-bof-jesper-dangaard-brouer.html
[2] http://thread.gmane.org/gmane.linux.network/384302/

---

Jesper Dangaard Brouer (11):
      net: bulk free infrastructure for NAPI context, use napi_consume_skb
      net: bulk free SKBs that were delay free'ed due to IRQ context
      ixgbe: bulk free SKBs during TX completion cleanup cycle
      net: bulk alloc and reuse of SKBs in NAPI context
      mlx5: use napi_*_skb APIs to get bulk alloc and free
      mlx5: RX bulking or bundling of packets before calling network stack
      net: introduce napi_alloc_skb_bulk() for more use-cases
      mlx5: hint the NAPI alloc skb API about the expected bulk size
      EXPERIMENT: dummy: bulk free
      net: API for RX handover of multiple SKBs to stack
      net: RPS bulk enqueue to backlog


 drivers/net/dummy.c                               |    3 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |    6 -
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    5 -
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   24 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |    4 -
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |    4 -
 include/linux/netdevice.h                         |   13 ++
 include/linux/skbuff.h                            |   23 ++-
 net/core/dev.c                                    |  160 ++++++++++++++++++++-
 net/core/skbuff.c                                 |  122 +++++++++++++++-
 10 files changed, 327 insertions(+), 37 deletions(-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [net-next PATCH 01/11] net: bulk free infrastructure for NAPI context, use napi_consume_skb
  2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
@ 2016-02-02 21:11   ` Jesper Dangaard Brouer
  2016-02-02 21:11   ` [net-next PATCH 02/11] net: bulk free SKBs that were delay free'ed due to IRQ context Jesper Dangaard Brouer
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-02 21:11 UTC (permalink / raw)
  To: netdev
  Cc: Christoph Lameter, tom, Alexander Duyck, alexei.starovoitov,
	Jesper Dangaard Brouer, ogerlitz, gerlitz.or

Discovered that network stack were hitting the kmem_cache/SLUB
slowpath when freeing SKBs.  Doing bulk free with kmem_cache_free_bulk
can speedup this slowpath.

NAPI context is a bit special, lets take advantage of that for bulk
free'ing SKBs.

In NAPI context we are running in softirq, which gives us certain
protection.  A softirq can run on several CPUs at once.  BUT the
important part is a softirq will never preempt another softirq running
on the same CPU.  This gives us the opportunity to access per-cpu
variables in softirq context.

Extend napi_alloc_cache (before only contained page_frag_cache) to be
a struct with a small array based stack for holding SKBs.  Introduce a
SKB defer and flush API for accessing this.

Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any()
when running in NAPI context.  A small trick to handle/detect if we
are called from netpoll is to see if budget is 0.  In that case, we
need to invoke dev_consume_skb_irq().

Joint work with Alexander Duyck.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 include/linux/skbuff.h |    3 ++
 net/core/dev.c         |    1 +
 net/core/skbuff.c      |   83 +++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 11f935c1a090..3c8d348223d7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2399,6 +2399,9 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi,
 {
 	return __napi_alloc_skb(napi, length, GFP_ATOMIC);
 }
+void napi_consume_skb(struct sk_buff *skb, int budget);
+
+void __kfree_skb_flush(void);
 
 /**
  * __dev_alloc_pages - allocate page for network Rx
diff --git a/net/core/dev.c b/net/core/dev.c
index cc9e3652cf93..73e6cbc10ac6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5149,6 +5149,7 @@ static void net_rx_action(struct softirq_action *h)
 		}
 	}
 
+	__kfree_skb_flush();
 	local_irq_disable();
 
 	list_splice_tail_init(&sd->poll_list, &list);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b2df375ec9c2..e26bb2b1dba4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -347,8 +347,16 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 }
 EXPORT_SYMBOL(build_skb);
 
+#define NAPI_SKB_CACHE_SIZE	64
+
+struct napi_alloc_cache {
+	struct page_frag_cache page;
+	size_t skb_count;
+	void *skb_cache[NAPI_SKB_CACHE_SIZE];
+};
+
 static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
-static DEFINE_PER_CPU(struct page_frag_cache, napi_alloc_cache);
+static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
 
 static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 {
@@ -378,9 +386,9 @@ EXPORT_SYMBOL(netdev_alloc_frag);
 
 static void *__napi_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 {
-	struct page_frag_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
-	return __alloc_page_frag(nc, fragsz, gfp_mask);
+	return __alloc_page_frag(&nc->page, fragsz, gfp_mask);
 }
 
 void *napi_alloc_frag(unsigned int fragsz)
@@ -474,7 +482,7 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
 struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 				 gfp_t gfp_mask)
 {
-	struct page_frag_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 	struct sk_buff *skb;
 	void *data;
 
@@ -494,7 +502,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (sk_memalloc_socks())
 		gfp_mask |= __GFP_MEMALLOC;
 
-	data = __alloc_page_frag(nc, len, gfp_mask);
+	data = __alloc_page_frag(&nc->page, len, gfp_mask);
 	if (unlikely(!data))
 		return NULL;
 
@@ -505,7 +513,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	}
 
 	/* use OR instead of assignment to avoid clearing of bits in mask */
-	if (nc->pfmemalloc)
+	if (nc->page.pfmemalloc)
 		skb->pfmemalloc = 1;
 	skb->head_frag = 1;
 
@@ -747,6 +755,69 @@ void consume_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(consume_skb);
 
+void __kfree_skb_flush(void)
+{
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+
+	/* flush skb_cache if containing objects */
+	if (nc->skb_count) {
+		kmem_cache_free_bulk(skbuff_head_cache, nc->skb_count,
+				     nc->skb_cache);
+		nc->skb_count = 0;
+	}
+}
+
+static void __kfree_skb_defer(struct sk_buff *skb)
+{
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+
+	/* drop skb->head and call any destructors for packet */
+	skb_release_all(skb);
+
+	/* record skb to CPU local list */
+	nc->skb_cache[nc->skb_count++] = skb;
+
+#ifdef CONFIG_SLUB
+	/* SLUB writes into objects when freeing */
+	prefetchw(skb);
+#endif
+
+	/* flush skb_cache if it is filled */
+	if (unlikely(nc->skb_count == NAPI_SKB_CACHE_SIZE)) {
+		kmem_cache_free_bulk(skbuff_head_cache, NAPI_SKB_CACHE_SIZE,
+				     nc->skb_cache);
+		nc->skb_count = 0;
+	}
+}
+
+void napi_consume_skb(struct sk_buff *skb, int budget)
+{
+	if (unlikely(!skb))
+		return;
+
+	/* if budget is 0 assume netpoll w/ IRQs disabled */
+	if (unlikely(!budget)) {
+		dev_consume_skb_irq(skb);
+		return;
+	}
+
+	if (likely(atomic_read(&skb->users) == 1))
+		smp_rmb();
+	else if (likely(!atomic_dec_and_test(&skb->users)))
+		return;
+	/* if reaching here SKB is ready to free */
+	trace_consume_skb(skb);
+
+	/* if SKB is a clone, don't handle this case */
+	if (unlikely(skb->fclone != SKB_FCLONE_UNAVAILABLE)) {
+		__kfree_skb(skb);
+		return;
+	}
+
+	__kfree_skb_defer(skb);
+}
+EXPORT_SYMBOL(napi_consume_skb);
+
 /* Make sure a field is enclosed inside headers_start/headers_end section */
 #define CHECK_SKB_FIELD(field) \
 	BUILD_BUG_ON(offsetof(struct sk_buff, field) <		\

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [net-next PATCH 02/11] net: bulk free SKBs that were delay free'ed due to IRQ context
  2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
  2016-02-02 21:11   ` [net-next PATCH 01/11] net: bulk free infrastructure for NAPI context, use napi_consume_skb Jesper Dangaard Brouer
@ 2016-02-02 21:11   ` Jesper Dangaard Brouer
  2016-02-02 21:11   ` [net-next PATCH 03/11] ixgbe: bulk free SKBs during TX completion cleanup cycle Jesper Dangaard Brouer
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-02 21:11 UTC (permalink / raw)
  To: netdev
  Cc: Christoph Lameter, tom, Alexander Duyck, alexei.starovoitov,
	Jesper Dangaard Brouer, ogerlitz, gerlitz.or

The network stack defers SKBs free, in-case free happens in IRQ or
when IRQs are disabled. This happens in __dev_kfree_skb_irq() that
writes SKBs that were free'ed during IRQ to the softirq completion
queue (softnet_data.completion_queue).

These SKBs are naturally delayed, and cleaned up during NET_TX_SOFTIRQ
in function net_tx_action().  Take advantage of this a use the skb
defer and flush API, as we are already in softirq context.

For modern drivers this rarely happens. Although most drivers do call
dev_kfree_skb_any(), which detects the situation and calls
__dev_kfree_skb_irq() when needed.  This due to netpoll can call from
IRQ context.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/skbuff.h |    1 +
 net/core/dev.c         |    8 +++++++-
 net/core/skbuff.c      |    8 ++++++--
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3c8d348223d7..b06ba2e07c89 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2402,6 +2402,7 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi,
 void napi_consume_skb(struct sk_buff *skb, int budget);
 
 void __kfree_skb_flush(void);
+void __kfree_skb_defer(struct sk_buff *skb);
 
 /**
  * __dev_alloc_pages - allocate page for network Rx
diff --git a/net/core/dev.c b/net/core/dev.c
index 73e6cbc10ac6..24be1d07d854 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3829,8 +3829,14 @@ static void net_tx_action(struct softirq_action *h)
 				trace_consume_skb(skb);
 			else
 				trace_kfree_skb(skb, net_tx_action);
-			__kfree_skb(skb);
+
+			if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
+				__kfree_skb(skb);
+			else
+				__kfree_skb_defer(skb);
 		}
+
+		__kfree_skb_flush();
 	}
 
 	if (sd->output_queue) {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e26bb2b1dba4..d278e51789e9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -767,7 +767,7 @@ void __kfree_skb_flush(void)
 	}
 }
 
-static void __kfree_skb_defer(struct sk_buff *skb)
+static inline void _kfree_skb_defer(struct sk_buff *skb)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
@@ -789,6 +789,10 @@ static void __kfree_skb_defer(struct sk_buff *skb)
 		nc->skb_count = 0;
 	}
 }
+void __kfree_skb_defer(struct sk_buff *skb)
+{
+	_kfree_skb_defer(skb);
+}
 
 void napi_consume_skb(struct sk_buff *skb, int budget)
 {
@@ -814,7 +818,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
 		return;
 	}
 
-	__kfree_skb_defer(skb);
+	_kfree_skb_defer(skb);
 }
 EXPORT_SYMBOL(napi_consume_skb);
 

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [net-next PATCH 03/11] ixgbe: bulk free SKBs during TX completion cleanup cycle
  2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
  2016-02-02 21:11   ` [net-next PATCH 01/11] net: bulk free infrastructure for NAPI context, use napi_consume_skb Jesper Dangaard Brouer
  2016-02-02 21:11   ` [net-next PATCH 02/11] net: bulk free SKBs that were delay free'ed due to IRQ context Jesper Dangaard Brouer
@ 2016-02-02 21:11   ` Jesper Dangaard Brouer
  2016-02-02 21:12   ` [net-next PATCH 04/11] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-02 21:11 UTC (permalink / raw)
  To: netdev
  Cc: Christoph Lameter, tom, Alexander Duyck, alexei.starovoitov,
	Jesper Dangaard Brouer, ogerlitz, gerlitz.or

There is an opportunity to bulk free SKBs during reclaiming of
resources after DMA transmit completes in ixgbe_clean_tx_irq.  Thus,
bulk freeing at this point does not introduce any added latency.

Simply use napi_consume_skb() which were recently introduced.  The
napi_budget parameter is needed by napi_consume_skb() to detect if it
is called from netpoll.

Benchmarking IPv4-forwarding, on CPU i7-4790K @4.2GHz (no turbo boost)
 Single CPU/flow numbers: before: 1982144 pps ->  after : 2064446 pps
 Improvement: +82302 pps, -20 nanosec, +4.1%
 (SLUB and GCC version 5.1.1 20150618 (Red Hat 5.1.1-4))

Joint work with Alexander Duyck.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index c4003a88bbf6..0c701b8438b6 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1089,7 +1089,7 @@ static void ixgbe_tx_timeout_reset(struct ixgbe_adapter *adapter)
  * @tx_ring: tx ring to clean
  **/
 static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
-			       struct ixgbe_ring *tx_ring)
+			       struct ixgbe_ring *tx_ring, int napi_budget)
 {
 	struct ixgbe_adapter *adapter = q_vector->adapter;
 	struct ixgbe_tx_buffer *tx_buffer;
@@ -1127,7 +1127,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 		total_packets += tx_buffer->gso_segs;
 
 		/* free the skb */
-		dev_consume_skb_any(tx_buffer->skb);
+		napi_consume_skb(tx_buffer->skb, napi_budget);
 
 		/* unmap skb header data */
 		dma_unmap_single(tx_ring->dev,
@@ -2784,7 +2784,7 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
 #endif
 
 	ixgbe_for_each_ring(ring, q_vector->tx)
-		clean_complete &= !!ixgbe_clean_tx_irq(q_vector, ring);
+		clean_complete &= !!ixgbe_clean_tx_irq(q_vector, ring, budget);
 
 	/* Exit if we are called by netpoll or busy polling is active */
 	if ((budget <= 0) || !ixgbe_qv_lock_napi(q_vector))

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [net-next PATCH 04/11] net: bulk alloc and reuse of SKBs in NAPI context
  2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
                     ` (2 preceding siblings ...)
  2016-02-02 21:11   ` [net-next PATCH 03/11] ixgbe: bulk free SKBs during TX completion cleanup cycle Jesper Dangaard Brouer
@ 2016-02-02 21:12   ` Jesper Dangaard Brouer
  2016-02-03  0:52     ` Alexei Starovoitov
  2016-02-02 21:12   ` [net-next PATCH 05/11] mlx5: use napi_*_skb APIs to get bulk alloc and free Jesper Dangaard Brouer
                     ` (7 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-02 21:12 UTC (permalink / raw)
  To: netdev
  Cc: Christoph Lameter, tom, Alexander Duyck, alexei.starovoitov,
	Jesper Dangaard Brouer, ogerlitz, gerlitz.or

Think twice before applying
 - This patch can potentially introduce added latency in some workloads

This patch introduce bulk alloc of SKBs and allow reuse of SKBs
free'ed in same softirq cycle.  SKBs are normally free'ed during TX
completion, but most high speed drivers also cleanup TX ring during
NAPI RX poll cycle.  Thus, if using napi_consume_skb/__kfree_skb_defer,
SKBs will be avail in the napi_alloc_cache->skb_cache.

If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit
the potential overshooting unused SKBs needed to free'ed when NAPI
cycle ends (flushed in net_rx_action via __kfree_skb_flush()).

Benchmarking IPv4-forwarding, on CPU i7-4790K @4.2GHz (no turbo boost)
(GCC version 5.1.1 20150618 (Red Hat 5.1.1-4))
 Allocator SLUB:
  Single CPU/flow numbers: before: 2064446 pps -> after: 2083031 pps
  Improvement: +18585 pps, -4.3 nanosec, +0.9%
 Allocator SLAB:
  Single CPU/flow numbers: before: 2035949 pps -> after: 2033567 pps
  Regression: -2382 pps, +0.57 nanosec, -0.1 %

Even-though benchmarking does show an improvement for SLUB(+0.9%), I'm
not convinced bulk alloc will be a win in all situations:
 * I see stalls on walking the SLUB freelist (normal hidden by prefetch)
 * In case RX queue is not full, alloc and free more SKBs than needed

More testing is needed with more real life benchmarks.

Joint work with Alexander Duyck.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 net/core/skbuff.c |   35 +++++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d278e51789e9..ae8cdbec90ee 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -483,13 +483,14 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 				 gfp_t gfp_mask)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	void *data;
 
 	len += NET_SKB_PAD + NET_IP_ALIGN;
 
-	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
-	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
+	if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
+		     (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA)))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
 		if (!skb)
 			goto skb_fail;
@@ -506,12 +507,38 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (unlikely(!data))
 		return NULL;
 
-	skb = __build_skb(data, len);
-	if (unlikely(!skb)) {
+#define BULK_ALLOC_SIZE 8
+	if (!nc->skb_count) {
+		nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
+						      gfp_mask, BULK_ALLOC_SIZE,
+						      nc->skb_cache);
+	}
+	if (likely(nc->skb_count)) {
+		skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count];
+	} else {
+		/* alloc bulk failed */
 		skb_free_frag(data);
 		return NULL;
 	}
 
+	len -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+	memset(skb, 0, offsetof(struct sk_buff, tail));
+	skb->truesize = SKB_TRUESIZE(len);
+	atomic_set(&skb->users, 1);
+	skb->head = data;
+	skb->data = data;
+	skb_reset_tail_pointer(skb);
+	skb->end = skb->tail + len;
+	skb->mac_header = (typeof(skb->mac_header))~0U;
+	skb->transport_header = (typeof(skb->transport_header))~0U;
+
+	/* make sure we initialize shinfo sequentially */
+	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+	atomic_set(&shinfo->dataref, 1);
+	kmemcheck_annotate_variable(shinfo->destructor_arg);
+
 	/* use OR instead of assignment to avoid clearing of bits in mask */
 	if (nc->page.pfmemalloc)
 		skb->pfmemalloc = 1;

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [net-next PATCH 05/11] mlx5: use napi_*_skb APIs to get bulk alloc and free
  2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
                     ` (3 preceding siblings ...)
  2016-02-02 21:12   ` [net-next PATCH 04/11] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer
@ 2016-02-02 21:12   ` Jesper Dangaard Brouer
  2016-02-02 21:13   ` [net-next PATCH 06/11] RFC: mlx5: RX bulking or bundling of packets before calling network stack Jesper Dangaard Brouer
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-02 21:12 UTC (permalink / raw)
  To: netdev
  Cc: Christoph Lameter, tom, Alexander Duyck, alexei.starovoitov,
	Jesper Dangaard Brouer, ogerlitz, gerlitz.or

Bulk alloc and free of SKBs happen transparently by the API calls
napi_alloc_skb() and napi_consume_skb().

The mlx5 driver have an extra high benefit of these changes,
because it already have a loop refilling its RX ring queue.  I
considered if the alloc API should be allowed to request larger
allocation, knowing the size of objects it need to refill.  For
now, just use the default bulk size hidden inside napi_alloc_skb().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |    9 +++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |    4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |    4 ++--
 4 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 9ea49a893323..3e531fae9ed3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -591,9 +591,9 @@ netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
 void mlx5e_completion_event(struct mlx5_core_cq *mcq);
 void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum mlx5_event event);
 int mlx5e_napi_poll(struct napi_struct *napi, int budget);
-bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq);
+bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
-bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
+bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq, struct napi_struct *napi);
 struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
 
 void mlx5e_update_stats(struct mlx5e_priv *priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index dd959d929aad..e923f4adc0f8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -42,12 +42,13 @@ static inline bool mlx5e_rx_hw_stamp(struct mlx5e_tstamp *tstamp)
 }
 
 static inline int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq,
-				     struct mlx5e_rx_wqe *wqe, u16 ix)
+				     struct mlx5e_rx_wqe *wqe, u16 ix,
+				     struct napi_struct *napi)
 {
 	struct sk_buff *skb;
 	dma_addr_t dma_addr;
 
-	skb = netdev_alloc_skb(rq->netdev, rq->wqe_sz);
+	skb = napi_alloc_skb(napi, rq->wqe_sz);
 	if (unlikely(!skb))
 		return -ENOMEM;
 
@@ -76,7 +77,7 @@ err_free_skb:
 	return -ENOMEM;
 }
 
-bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
+bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq, struct napi_struct *napi)
 {
 	struct mlx5_wq_ll *wq = &rq->wq;
 
@@ -86,7 +87,7 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
 	while (!mlx5_wq_ll_is_full(wq)) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(wq, wq->head);
 
-		if (unlikely(mlx5e_alloc_rx_wqe(rq, wqe, wq->head)))
+		if (unlikely(mlx5e_alloc_rx_wqe(rq, wqe, wq->head, napi)))
 			break;
 
 		mlx5_wq_ll_push(wq, be16_to_cpu(wqe->next.next_wqe_index));
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 2c3fba0fff54..06a29c8f5712 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -326,7 +326,7 @@ netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev)
 	return mlx5e_sq_xmit(sq, skb);
 }
 
-bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq)
+bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 {
 	struct mlx5e_sq *sq;
 	u32 dma_fifo_cc;
@@ -402,7 +402,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq)
 			npkts++;
 			nbytes += wi->num_bytes;
 			sqcc += wi->num_wqebbs;
-			dev_kfree_skb(skb);
+			napi_consume_skb(skb, napi_budget);
 		} while (!last_wqe);
 	}
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 4ac8d716dbdd..8fd07c8087e3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -60,11 +60,11 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 	clear_bit(MLX5E_CHANNEL_NAPI_SCHED, &c->flags);
 
 	for (i = 0; i < c->num_tc; i++)
-		busy |= mlx5e_poll_tx_cq(&c->sq[i].cq);
+		busy |= mlx5e_poll_tx_cq(&c->sq[i].cq, budget);
 
 	work_done = mlx5e_poll_rx_cq(&c->rq.cq, budget);
 	busy |= work_done == budget;
-	busy |= mlx5e_post_rx_wqes(&c->rq);
+	busy |= mlx5e_post_rx_wqes(&c->rq, napi);
 
 	if (busy)
 		return budget;

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [net-next PATCH 06/11] RFC: mlx5: RX bulking or bundling of packets before calling network stack
  2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
                     ` (4 preceding siblings ...)
  2016-02-02 21:12   ` [net-next PATCH 05/11] mlx5: use napi_*_skb APIs to get bulk alloc and free Jesper Dangaard Brouer
@ 2016-02-02 21:13   ` Jesper Dangaard Brouer
  2016-02-09 11:57     ` Saeed Mahameed
  2016-02-02 21:13   ` [net-next PATCH 07/11] net: introduce napi_alloc_skb_hint() for more use-cases Jesper Dangaard Brouer
                     ` (5 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-02 21:13 UTC (permalink / raw)
  To: netdev
  Cc: Christoph Lameter, tom, Alexander Duyck, alexei.starovoitov,
	Jesper Dangaard Brouer, ogerlitz, gerlitz.or

There are several techniques/concepts combined in this optimization.
It is both a data-cache and instruction-cache optimization.

First of all, this is primarily about delaying touching
packet-data, which happend in eth_type_trans, until the prefetch
have had time to fetch.  Thus, hopefully avoiding a cache-miss on
packet data.

Secondly, the instruction-cache optimization is about, not
calling the network stack for every packet, which is pulled out
of the RX ring.  Calling the full stack likely removes/flushes
the instruction cache every time.

Thus, have two loops, one loop pulling out packet from the RX
ring and starting the prefetching, and the second loop calling
eth_type_trans() and invoking the stack via napi_gro_receive().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>


Notes:
This is the patch that gave a speed up of 6.2Mpps to 12Mpps, when
trying to measure lowest RX level, by dropping the packets in the
driver itself (marked drop point as comment).

For now, the ring is emptied upto the budget.  I don't know if it
would be better to chunk it up more?

In the future, I imagine the we can call the stack with the full
SKB-list instead of this local loop.  But then it would look a
bit strange, to call eth_type_trans() as the only function...
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |   23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index e923f4adc0f8..5d96d6682db0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -214,8 +214,6 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 
 	mlx5e_handle_csum(netdev, cqe, rq, skb);
 
-	skb->protocol = eth_type_trans(skb, netdev);
-
 	skb_record_rx_queue(skb, rq->ix);
 
 	if (likely(netdev->features & NETIF_F_RXHASH))
@@ -229,8 +227,15 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
 {
 	struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
+	struct sk_buff_head rx_skb_list;
+	struct sk_buff *rx_skb;
 	int work_done;
 
+	/* Using SKB list infrastructure, even-though some instructions
+	 * could be saved by open-coding it on skb->next directly.
+	 */
+	__skb_queue_head_init(&rx_skb_list);
+
 	/* avoid accessing cq (dma coherent memory) if not needed */
 	if (!test_and_clear_bit(MLX5E_CQ_HAS_CQES, &cq->flags))
 		return 0;
@@ -252,7 +257,6 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
 		wqe_counter    = be16_to_cpu(wqe_counter_be);
 		wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
 		skb            = rq->skb[wqe_counter];
-		prefetch(skb->data);
 		rq->skb[wqe_counter] = NULL;
 
 		dma_unmap_single(rq->pdev,
@@ -265,16 +269,27 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
 			dev_kfree_skb(skb);
 			goto wq_ll_pop;
 		}
+		prefetch(skb->data);
 
 		mlx5e_build_rx_skb(cqe, rq, skb);
 		rq->stats.packets++;
-		napi_gro_receive(cq->napi, skb);
+		__skb_queue_tail(&rx_skb_list, skb);
 
 wq_ll_pop:
 		mlx5_wq_ll_pop(&rq->wq, wqe_counter_be,
 			       &wqe->next.next_wqe_index);
 	}
 
+	while ((rx_skb = __skb_dequeue(&rx_skb_list)) != NULL) {
+		rx_skb->protocol = eth_type_trans(rx_skb, rq->netdev);
+		napi_gro_receive(cq->napi, rx_skb);
+
+		/* NOT FOR UPSTREAM INCLUSION:
+		 * How I did isolated testing of driver RX, I here called:
+		 *  napi_consume_skb(rx_skb, budget);
+		 */
+	}
+
 	mlx5_cqwq_update_db_record(&cq->wq);
 
 	/* ensure cq space is freed before enabling more cqes */

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [net-next PATCH 07/11] net: introduce napi_alloc_skb_hint() for more use-cases
  2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
                     ` (5 preceding siblings ...)
  2016-02-02 21:13   ` [net-next PATCH 06/11] RFC: mlx5: RX bulking or bundling of packets before calling network stack Jesper Dangaard Brouer
@ 2016-02-02 21:13   ` Jesper Dangaard Brouer
  2016-02-02 22:29     ` kbuild test robot
  2016-02-02 21:14   ` [net-next PATCH 08/11] mlx5: hint the NAPI alloc skb API about the expected bulk size Jesper Dangaard Brouer
                     ` (4 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-02 21:13 UTC (permalink / raw)
  To: netdev
  Cc: Christoph Lameter, tom, Alexander Duyck, alexei.starovoitov,
	Jesper Dangaard Brouer, ogerlitz, gerlitz.or

The default bulk alloc size arbitrarily choosen (to be 8) might
not suit all use-cases, this introduce a function napi_alloc_skb_hint()
that allow the caller to specify a bulk size hint they are expecting.
It is a hint because __napi_alloc_skb() limits the bulk size to
the array size.

One user is the mlx5 driver, which bulk re-populate it's RX ring
with both SKBs and pages.  Thus, it would like to work with
bigger bulk alloc chunks.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/skbuff.h |   19 +++++++++++++++----
 net/core/skbuff.c      |    8 +++-----
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b06ba2e07c89..4d0c0eacbc34 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2391,14 +2391,25 @@ static inline void skb_free_frag(void *addr)
 	__free_page_frag(addr);
 }
 
+#define NAPI_SKB_CACHE_SIZE	64U /* Used in struct napi_alloc_cache */
+#define NAPI_SKB_BULK_ALLOC	 8U /* Default slab bulk alloc in NAPI */
+
 void *napi_alloc_frag(unsigned int fragsz);
-struct sk_buff *__napi_alloc_skb(struct napi_struct *napi,
-				 unsigned int length, gfp_t gfp_mask);
+struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
+				 unsigned int bulk_hint, gfp_t gfp_mask);
 static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi,
-					     unsigned int length)
+					     unsigned int len)
+{
+	return __napi_alloc_skb(napi, len, NAPI_SKB_BULK_ALLOC, GFP_ATOMIC);
+}
+static inline struct sk_buff *napi_alloc_skb_hint(struct napi_struct *napi,
+						  unsigned int len,
+						  unsigned int bulk_hint)
 {
-	return __napi_alloc_skb(napi, length, GFP_ATOMIC);
+	bulk_hint = bulk_hint ? : 1;
+	return __napi_alloc_skb(napi, len, bulk_hint, GFP_ATOMIC);
 }
+
 void napi_consume_skb(struct sk_buff *skb, int budget);
 
 void __kfree_skb_flush(void);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ae8cdbec90ee..f77209fb5361 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -347,8 +347,6 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 }
 EXPORT_SYMBOL(build_skb);
 
-#define NAPI_SKB_CACHE_SIZE	64
-
 struct napi_alloc_cache {
 	struct page_frag_cache page;
 	size_t skb_count;
@@ -480,9 +478,10 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
  *	%NULL is returned if there is no free memory.
  */
 struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
-				 gfp_t gfp_mask)
+				 unsigned int bulk_hint, gfp_t gfp_mask)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	unsigned int bulk_sz = min(bulk_hint, NAPI_SKB_CACHE_SIZE);
 	struct skb_shared_info *shinfo;
 	struct sk_buff *skb;
 	void *data;
@@ -507,10 +506,9 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (unlikely(!data))
 		return NULL;
 
-#define BULK_ALLOC_SIZE 8
 	if (!nc->skb_count) {
 		nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
-						      gfp_mask, BULK_ALLOC_SIZE,
+						      gfp_mask, bulk_sz,
 						      nc->skb_cache);
 	}
 	if (likely(nc->skb_count)) {

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [net-next PATCH 08/11] mlx5: hint the NAPI alloc skb API about the expected bulk size
  2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
                     ` (6 preceding siblings ...)
  2016-02-02 21:13   ` [net-next PATCH 07/11] net: introduce napi_alloc_skb_hint() for more use-cases Jesper Dangaard Brouer
@ 2016-02-02 21:14   ` Jesper Dangaard Brouer
  2016-02-02 21:14   ` [net-next PATCH 09/11] RFC: dummy: bulk free SKBs Jesper Dangaard Brouer
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-02 21:14 UTC (permalink / raw)
  To: netdev
  Cc: Christoph Lameter, tom, Alexander Duyck, alexei.starovoitov,
	Jesper Dangaard Brouer, ogerlitz, gerlitz.or

Use the newly introduced napi_alloc_skb_hint() API, to get the underlying
slab bulk allocation sizes to align with what mlx5 driver need for refilling
its RX ring queue.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   10 ++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c |    2 +-
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 3e531fae9ed3..b2aba498e8d4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -593,7 +593,8 @@ void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum mlx5_event event);
 int mlx5e_napi_poll(struct napi_struct *napi, int budget);
 bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
 int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
-bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq, struct napi_struct *napi);
+bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq, struct napi_struct *napi,
+			unsigned int bulk_hint);
 struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
 
 void mlx5e_update_stats(struct mlx5e_priv *priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 5d96d6682db0..88f88d354abc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -43,12 +43,13 @@ static inline bool mlx5e_rx_hw_stamp(struct mlx5e_tstamp *tstamp)
 
 static inline int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq,
 				     struct mlx5e_rx_wqe *wqe, u16 ix,
-				     struct napi_struct *napi)
+				     struct napi_struct *napi,
+				     unsigned int bulk_hint)
 {
 	struct sk_buff *skb;
 	dma_addr_t dma_addr;
 
-	skb = napi_alloc_skb(napi, rq->wqe_sz);
+	skb = napi_alloc_skb_hint(napi, rq->wqe_sz, bulk_hint);
 	if (unlikely(!skb))
 		return -ENOMEM;
 
@@ -77,7 +78,8 @@ err_free_skb:
 	return -ENOMEM;
 }
 
-bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq, struct napi_struct *napi)
+bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq, struct napi_struct *napi,
+			unsigned int hint)
 {
 	struct mlx5_wq_ll *wq = &rq->wq;
 
@@ -87,7 +89,7 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq, struct napi_struct *napi)
 	while (!mlx5_wq_ll_is_full(wq)) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(wq, wq->head);
 
-		if (unlikely(mlx5e_alloc_rx_wqe(rq, wqe, wq->head, napi)))
+		if (unlikely(mlx5e_alloc_rx_wqe(rq, wqe, wq->head, napi, hint)))
 			break;
 
 		mlx5_wq_ll_push(wq, be16_to_cpu(wqe->next.next_wqe_index));
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 8fd07c8087e3..6488404edff6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -64,7 +64,7 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 
 	work_done = mlx5e_poll_rx_cq(&c->rq.cq, budget);
 	busy |= work_done == budget;
-	busy |= mlx5e_post_rx_wqes(&c->rq, napi);
+	busy |= mlx5e_post_rx_wqes(&c->rq, napi, work_done);
 
 	if (busy)
 		return budget;

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [net-next PATCH 09/11] RFC: dummy: bulk free SKBs
  2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
                     ` (7 preceding siblings ...)
  2016-02-02 21:14   ` [net-next PATCH 08/11] mlx5: hint the NAPI alloc skb API about the expected bulk size Jesper Dangaard Brouer
@ 2016-02-02 21:14   ` Jesper Dangaard Brouer
  2016-02-02 21:15   ` [net-next PATCH 10/11] RFC: net: API for RX handover of multiple SKBs to stack Jesper Dangaard Brouer
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-02 21:14 UTC (permalink / raw)
  To: netdev
  Cc: Christoph Lameter, tom, Alexander Duyck, alexei.starovoitov,
	Jesper Dangaard Brouer, ogerlitz, gerlitz.or

Normal TX completion uses napi_consume_skb(), thus also make dummy driver
use this, as it make it easier to see the effect of bulk freeing SKBs.

---
 drivers/net/dummy.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index 69fc8409a973..985565ec5d60 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -85,7 +85,8 @@ static netdev_tx_t dummy_xmit(struct sk_buff *skb, struct net_device *dev)
 	dstats->tx_bytes += skb->len;
 	u64_stats_update_end(&dstats->syncp);
 
-	dev_kfree_skb(skb);
+	//dev_kfree_skb(skb);
+	napi_consume_skb(skb, 1);
 	return NETDEV_TX_OK;
 }
 

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [net-next PATCH 10/11] RFC: net: API for RX handover of multiple SKBs to stack
  2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
                     ` (8 preceding siblings ...)
  2016-02-02 21:14   ` [net-next PATCH 09/11] RFC: dummy: bulk free SKBs Jesper Dangaard Brouer
@ 2016-02-02 21:15   ` Jesper Dangaard Brouer
  2016-02-02 21:15   ` [net-next PATCH 11/11] RFC: net: RPS bulk enqueue to backlog Jesper Dangaard Brouer
  2016-02-07 19:25   ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches David Miller
  11 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-02 21:15 UTC (permalink / raw)
  To: netdev
  Cc: Christoph Lameter, tom, Alexander Duyck, alexei.starovoitov,
	Jesper Dangaard Brouer, ogerlitz, gerlitz.or

Introduce napi_gro_receive_list() which takes a full SKB-list
for processing by the stack.

It also take over invoking eth_type_trans().

One purpose is to disconnect the icache usage/sharing between
driver level RX (NAPI loop) and upper RX network stack.

Another advantage is that the stack now knows how many packets it
received, and can do the appropiate packet bundling inside the
stack.  E.g. flush/process these bundles when the skb list is empty.

PITFALLS: Slightly overkill to use a struct sk_buff_head (24 bytes),
to handover packets.  Which is allocated on callers stack.  It
also maintains a qlen, which is unnecessary in this hotpath code.
A simple list within the first SKB could be a minimum solution.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |   12 +-----------
 include/linux/netdevice.h                       |    3 +++
 net/core/dev.c                                  |   18 ++++++++++++++++++
 3 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 88f88d354abc..b6e7cc29f02c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -230,7 +230,6 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
 {
 	struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
 	struct sk_buff_head rx_skb_list;
-	struct sk_buff *rx_skb;
 	int work_done;
 
 	/* Using SKB list infrastructure, even-though some instructions
@@ -281,16 +280,7 @@ wq_ll_pop:
 		mlx5_wq_ll_pop(&rq->wq, wqe_counter_be,
 			       &wqe->next.next_wqe_index);
 	}
-
-	while ((rx_skb = __skb_dequeue(&rx_skb_list)) != NULL) {
-		rx_skb->protocol = eth_type_trans(rx_skb, rq->netdev);
-		napi_gro_receive(cq->napi, rx_skb);
-
-		/* NOT FOR UPSTREAM INCLUSION:
-		 * How I did isolated testing of driver RX, I here called:
-		 *  napi_consume_skb(rx_skb, budget);
-		 */
-	}
+	napi_gro_receive_list(cq->napi, &rx_skb_list, rq->netdev);
 
 	mlx5_cqwq_update_db_record(&cq->wq);
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5ac140dcb789..11df9af41a3c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3142,6 +3142,9 @@ int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb(struct sk_buff *skb);
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
+void napi_gro_receive_list(struct napi_struct *napi,
+			   struct sk_buff_head *skb_list,
+			   struct net_device *netdev);
 void napi_gro_flush(struct napi_struct *napi, bool flush_old);
 struct sk_buff *napi_get_frags(struct napi_struct *napi);
 gro_result_t napi_gro_frags(struct napi_struct *napi);
diff --git a/net/core/dev.c b/net/core/dev.c
index 24be1d07d854..35c92a968937 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4579,6 +4579,24 @@ gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(napi_gro_receive);
 
+void napi_gro_receive_list(struct napi_struct *napi,
+			   struct sk_buff_head *skb_list,
+			   struct net_device *netdev)
+{
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(skb_list)) != NULL) {
+		skb->protocol = eth_type_trans(skb, netdev);
+
+		skb_mark_napi_id(skb, napi);
+		trace_napi_gro_receive_entry(skb);
+
+		skb_gro_reset_offset(skb);
+		napi_skb_finish(dev_gro_receive(napi, skb), skb);
+	}
+}
+EXPORT_SYMBOL(napi_gro_receive_list);
+
 static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb)
 {
 	if (unlikely(skb->pfmemalloc)) {

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [net-next PATCH 11/11] RFC: net: RPS bulk enqueue to backlog
  2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
                     ` (9 preceding siblings ...)
  2016-02-02 21:15   ` [net-next PATCH 10/11] RFC: net: API for RX handover of multiple SKBs to stack Jesper Dangaard Brouer
@ 2016-02-02 21:15   ` Jesper Dangaard Brouer
  2016-02-07 19:25   ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches David Miller
  11 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-02 21:15 UTC (permalink / raw)
  To: netdev
  Cc: Christoph Lameter, tom, Alexander Duyck, alexei.starovoitov,
	Jesper Dangaard Brouer, ogerlitz, gerlitz.or

NEED TO CLEAN UP PATCH (likely still contains bugs...)

When enabling Receive Packet Steering (RPS) like :

 echo 32768 > /proc/sys/net/core/rps_sock_flow_entries

 for N in $(seq 0 7) ; do
    echo 4096 > /sys/class/net/${DEV}/queues/rx-$N/rps_flow_cnt
    echo f >  /sys/class/net/${DEV}/queues/rx-$N/rps_cpus
    grep -H . /sys/class/net/${DEV}/queues/rx-$N/rps_cpus
 done

I noticed high contention on enqueue_to_backlog().  To mitigate
this introduce enqueue_list_to_backlog(), which allow to enqueue
an entire skb list, instead of having to take the per packet cost.

The skb list's needed for bulk enqueing are constructed in the
per CPU area of softnet_data.  And thus, can be constructed
without heavy CPU synchronization.

I'm excited about the performance improvement of this patch:

Before:
 Ethtool(mlx5p2  ) stat:      7246630 (      7,246,630) <= rx3_packets /sec
After:
 Ethtool(mlx5p2  ) stat:      9182886 (      9,182,886) <= rx3_packets /sec
Improvement:
 (1/9182886-1/7246630)*10^9 = saving -29.0 ns

The benchmark is a single pktgen flow, which is RPS directed to
another CPU, for further processing via process_backlog().  The
remote CPU cannot handle the load and this CPU drops packets when
it cannot get them enqueued.
---
 include/linux/netdevice.h |   10 +++
 net/core/dev.c            |  133 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 140 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 11df9af41a3c..dc5baef95d27 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -633,6 +633,15 @@ struct rps_dev_flow {
 };
 #define RPS_NO_FILTER 0xffff
 
+struct rps_cpu_queue {
+	struct sk_buff_head skb_list;
+	int to_cpu;
+	struct rps_dev_flow *rflow;
+	struct net_device *dev;
+};
+#define RPS_CPU_QUEUES		2 /* Must be power of 2 */
+#define RPS_CPU_QUEUES_MASK	(RPS_CPU_QUEUES - 1)
+
 /*
  * The rps_dev_flow_table structure contains a table of flow mappings.
  */
@@ -2662,6 +2671,7 @@ struct softnet_data {
 	unsigned int		received_rps;
 #ifdef CONFIG_RPS
 	struct softnet_data	*rps_ipi_list;
+	struct rps_cpu_queue	local_rps_queue[RPS_CPU_QUEUES];
 #endif
 #ifdef CONFIG_NET_FLOW_LIMIT
 	struct sd_flow_limit __rcu *flow_limit;
diff --git a/net/core/dev.c b/net/core/dev.c
index 35c92a968937..0a231529bc0c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3736,6 +3736,60 @@ drop:
 	return NET_RX_DROP;
 }
 
+static int enqueue_list_to_backlog(struct sk_buff_head *skb_list, int cpu,
+				   unsigned int *qtail, struct net_device *dev)
+{
+       unsigned int qlen, qlen_drop;
+       struct softnet_data *sd;
+       struct sk_buff *skb;
+       unsigned long flags;
+
+       sd = &per_cpu(softnet_data, cpu);
+
+       local_irq_save(flags);
+
+       rps_lock(sd);
+       if (!netif_running(dev))
+               goto drop;
+       qlen = skb_queue_len(&sd->input_pkt_queue);
+       /* NOTICE: Had to drop !skb_flow_limit(skb, qlen) check here */
+       if (qlen <= netdev_max_backlog) {
+               if (qlen) {
+enqueue:
+                       //__skb_queue_tail(&sd->input_pkt_queue, skb);
+                       skb_queue_splice_tail_init(skb_list,
+                                                  &sd->input_pkt_queue);
+                       input_queue_tail_incr_save(sd, qtail);
+                       rps_unlock(sd);
+                       local_irq_restore(flags);
+                       return NET_RX_SUCCESS;
+               }
+
+               /* Schedule NAPI for backlog device
+                * We can use non atomic operation since we own the queue lock
+                */
+               if (!__test_and_set_bit(NAPI_STATE_SCHED, &sd->backlog.state)) {
+                       if (!rps_ipi_queued(sd))
+                               ____napi_schedule(sd, &sd->backlog);
+               }
+               goto enqueue;
+       }
+
+drop:
+       qlen_drop = skb_queue_len(skb_list);
+       sd->dropped += qlen_drop;
+       rps_unlock(sd);
+
+       local_irq_restore(flags);
+
+       atomic_long_add(qlen_drop, &dev->rx_dropped);
+       while ((skb = __skb_dequeue(skb_list)) != NULL) {
+               __kfree_skb_defer(skb);
+       }
+       return NET_RX_DROP;
+}
+
+
 static int netif_rx_internal(struct sk_buff *skb)
 {
 	int ret;
@@ -4211,14 +4265,43 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
 #ifdef CONFIG_RPS
 	if (static_key_false(&rps_needed)) {
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
+		struct softnet_data *sd = this_cpu_ptr(&softnet_data);
+		struct rps_cpu_queue *lq; /* softnet cpu local queue (lq) */
+
 		int cpu = get_rps_cpu(skb->dev, skb, &rflow);
+		if (cpu < 0)
+			goto no_rps;
+
+		/* RPS destinated packet */
+		// XXX: is local_irq_disable needed here?
+		sd = this_cpu_ptr(&softnet_data);
+		lq = &sd->local_rps_queue[cpu & RPS_CPU_QUEUES_MASK];
 
-		if (cpu >= 0) {
-			ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
+		if (lq->to_cpu == cpu && lq->dev == skb->dev) {
+			/* Bonus, RPS dest match prev CPU, q pkt for later */
+			__skb_queue_tail(&lq->skb_list, skb);
+			lq->rflow = rflow;
 			rcu_read_unlock();
-			return ret;
+			return NET_RX_SUCCESS;
 		}
+		if (unlikely(lq->to_cpu < 0))
+			goto init_localq;
+
+		/* No match, bulk enq to remote cpu backlog */
+		ret = enqueue_list_to_backlog(&lq->skb_list, lq->to_cpu,
+					      &lq->rflow->last_qtail, lq->dev);
+	init_localq: /* start new localq (lq) */
+		/* XXX: check if lq->skb_list was already re-init'ed */
+		skb_queue_head_init(&lq->skb_list);
+		__skb_queue_tail(&lq->skb_list, skb);
+		lq->rflow = rflow;
+		lq->to_cpu = cpu;
+		lq->dev = skb->dev;
+
+		rcu_read_unlock();
+		return ret;
 	}
+no_rps:
 #endif
 	ret = __netif_receive_skb(skb);
 	rcu_read_unlock();
@@ -4579,6 +4662,30 @@ gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 }
 EXPORT_SYMBOL(napi_gro_receive);
 
+static void rps_flush_local_queues(struct softnet_data *sd)
+{
+#ifdef CONFIG_RPS
+	int i;
+
+//	local_irq_disable(); //TEST
+//	if (!sd)
+	sd = this_cpu_ptr(&softnet_data);
+
+	/* Bulk flush any remaining locally queued packets for RPS */
+	for (i = 0; i < RPS_CPU_QUEUES; i++) {
+		struct rps_cpu_queue *lq = &sd->local_rps_queue[i];
+
+		if (skb_queue_empty(&lq->skb_list))
+			continue;
+
+		enqueue_list_to_backlog(&lq->skb_list, lq->to_cpu,
+					&lq->rflow->last_qtail, lq->dev);
+		// FAILS: lq->to_cpu = -1;
+	}
+//	local_irq_enable(); //TEST
+#endif
+}
+
 void napi_gro_receive_list(struct napi_struct *napi,
 			   struct sk_buff_head *skb_list,
 			   struct net_device *netdev)
@@ -4594,6 +4701,10 @@ void napi_gro_receive_list(struct napi_struct *napi,
 		skb_gro_reset_offset(skb);
 		napi_skb_finish(dev_gro_receive(napi, skb), skb);
 	}
+#ifdef CONFIG_RPS
+//	if (static_key_false(&rps_needed))
+//		rps_flush_local_queues(NULL);
+#endif
 }
 EXPORT_SYMBOL(napi_gro_receive_list);
 
@@ -4747,6 +4858,8 @@ static void net_rps_action_and_irq_enable(struct softnet_data *sd)
 
 		local_irq_enable();
 
+//		rps_flush_local_queues(NULL);
+
 		/* Send pending IPI's to kick RPS processing on remote cpus. */
 		while (remsd) {
 			struct softnet_data *next = remsd->rps_ipi_next;
@@ -5176,6 +5289,8 @@ static void net_rx_action(struct softirq_action *h)
 	__kfree_skb_flush();
 	local_irq_disable();
 
+        rps_flush_local_queues(NULL);
+
 	list_splice_tail_init(&sd->poll_list, &list);
 	list_splice_tail(&repoll, &list);
 	list_splice(&list, &sd->poll_list);
@@ -8085,6 +8200,9 @@ static int __init net_dev_init(void)
 
 	for_each_possible_cpu(i) {
 		struct softnet_data *sd = &per_cpu(softnet_data, i);
+#ifdef CONFIG_RPS
+		int j;
+#endif
 
 		skb_queue_head_init(&sd->input_pkt_queue);
 		skb_queue_head_init(&sd->process_queue);
@@ -8094,6 +8212,15 @@ static int __init net_dev_init(void)
 		sd->csd.func = rps_trigger_softirq;
 		sd->csd.info = sd;
 		sd->cpu = i;
+		for (j = 0; j < RPS_CPU_QUEUES; j++) {
+			struct rps_cpu_queue *lq = &sd->local_rps_queue[j];
+
+			skb_queue_head_init(&lq->skb_list);
+			lq->to_cpu = -1;
+			// XXX: below not needed
+			lq->rflow = NULL;
+			lq->dev = NULL;
+		}
 #endif
 
 		sd->backlog.poll = process_backlog;

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [net-next PATCH 07/11] net: introduce napi_alloc_skb_hint() for more use-cases
  2016-02-02 21:13   ` [net-next PATCH 07/11] net: introduce napi_alloc_skb_hint() for more use-cases Jesper Dangaard Brouer
@ 2016-02-02 22:29     ` kbuild test robot
  0 siblings, 0 replies; 35+ messages in thread
From: kbuild test robot @ 2016-02-02 22:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: kbuild-all, netdev, Christoph Lameter, tom, Alexander Duyck,
	alexei.starovoitov, Jesper Dangaard Brouer, ogerlitz, gerlitz.or

[-- Attachment #1: Type: text/plain, Size: 3942 bytes --]

Hi Jesper,

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Jesper-Dangaard-Brouer/net-mitigating-kmem_cache-slowpath-and-BoF-discussion-patches/20160203-051706
reproduce: make htmldocs

All warnings (new ones prefixed by >>):

   include/linux/skbuff.h:922: warning: No description found for parameter 'sk'
>> net/core/skbuff.c:482: warning: No description found for parameter 'bulk_hint'
   net/core/gen_stats.c:155: warning: No description found for parameter 'cpu'
   net/core/gen_estimator.c:212: warning: No description found for parameter 'cpu_bstats'
   net/core/gen_estimator.c:303: warning: No description found for parameter 'cpu_bstats'
   net/core/dev.c:6450: warning: No description found for parameter 'len'
   include/linux/netdevice.h:1321: warning: Enum value 'IFF_XMIT_DST_RELEASE_PERM' not described in enum 'netdev_priv_flags'
   include/linux/netdevice.h:1321: warning: Enum value 'IFF_IPVLAN_MASTER' not described in enum 'netdev_priv_flags'
   include/linux/netdevice.h:1321: warning: Enum value 'IFF_IPVLAN_SLAVE' not described in enum 'netdev_priv_flags'
   include/linux/netdevice.h:1826: warning: No description found for parameter 'ptype_all'
   include/linux/netdevice.h:1826: warning: No description found for parameter 'ptype_specific'

vim +/bulk_hint +482 net/core/skbuff.c

^1da177e Linus Torvalds         2005-04-16  466  
fd11a83d Alexander Duyck        2014-12-09  467  /**
fd11a83d Alexander Duyck        2014-12-09  468   *	__napi_alloc_skb - allocate skbuff for rx in a specific NAPI instance
fd11a83d Alexander Duyck        2014-12-09  469   *	@napi: napi instance this buffer was allocated for
d7499160 Masanari Iida          2015-08-24  470   *	@len: length to allocate
fd11a83d Alexander Duyck        2014-12-09  471   *	@gfp_mask: get_free_pages mask, passed to alloc_skb and alloc_pages
fd11a83d Alexander Duyck        2014-12-09  472   *
fd11a83d Alexander Duyck        2014-12-09  473   *	Allocate a new sk_buff for use in NAPI receive.  This buffer will
fd11a83d Alexander Duyck        2014-12-09  474   *	attempt to allocate the head from a special reserved region used
fd11a83d Alexander Duyck        2014-12-09  475   *	only for NAPI Rx allocation.  By doing this we can save several
fd11a83d Alexander Duyck        2014-12-09  476   *	CPU cycles by avoiding having to disable and re-enable IRQs.
fd11a83d Alexander Duyck        2014-12-09  477   *
fd11a83d Alexander Duyck        2014-12-09  478   *	%NULL is returned if there is no free memory.
fd11a83d Alexander Duyck        2014-12-09  479   */
9451980a Alexander Duyck        2015-05-06  480  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
c24f01ac Jesper Dangaard Brouer 2016-02-02  481  				 unsigned int bulk_hint, gfp_t gfp_mask)
fd11a83d Alexander Duyck        2014-12-09 @482  {
1ec46e92 Jesper Dangaard Brouer 2016-02-02  483  	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
c24f01ac Jesper Dangaard Brouer 2016-02-02  484  	unsigned int bulk_sz = min(bulk_hint, NAPI_SKB_CACHE_SIZE);
fc755a89 Jesper Dangaard Brouer 2016-02-02  485  	struct skb_shared_info *shinfo;
fd11a83d Alexander Duyck        2014-12-09  486  	struct sk_buff *skb;
9451980a Alexander Duyck        2015-05-06  487  	void *data;
fd11a83d Alexander Duyck        2014-12-09  488  
9451980a Alexander Duyck        2015-05-06  489  	len += NET_SKB_PAD + NET_IP_ALIGN;
9451980a Alexander Duyck        2015-05-06  490  

:::::: The code at line 482 was first introduced by commit
:::::: fd11a83dd3630ec6a60f8a702446532c5c7e1991 net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb

:::::: TO: Alexander Duyck <alexander.h.duyck@redhat.com>
:::::: CC: David S. Miller <davem@davemloft.net>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 6229 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [net-next PATCH 04/11] net: bulk alloc and reuse of SKBs in NAPI context
  2016-02-02 21:12   ` [net-next PATCH 04/11] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer
@ 2016-02-03  0:52     ` Alexei Starovoitov
  2016-02-03 10:38       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2016-02-03  0:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Christoph Lameter, tom, Alexander Duyck, ogerlitz, gerlitz.or

On Tue, Feb 02, 2016 at 10:12:01PM +0100, Jesper Dangaard Brouer wrote:
> Think twice before applying
>  - This patch can potentially introduce added latency in some workloads
> 
> This patch introduce bulk alloc of SKBs and allow reuse of SKBs
> free'ed in same softirq cycle.  SKBs are normally free'ed during TX
> completion, but most high speed drivers also cleanup TX ring during
> NAPI RX poll cycle.  Thus, if using napi_consume_skb/__kfree_skb_defer,
> SKBs will be avail in the napi_alloc_cache->skb_cache.
> 
> If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit
> the potential overshooting unused SKBs needed to free'ed when NAPI
> cycle ends (flushed in net_rx_action via __kfree_skb_flush()).
> 
> Benchmarking IPv4-forwarding, on CPU i7-4790K @4.2GHz (no turbo boost)
> (GCC version 5.1.1 20150618 (Red Hat 5.1.1-4))
>  Allocator SLUB:
>   Single CPU/flow numbers: before: 2064446 pps -> after: 2083031 pps
>   Improvement: +18585 pps, -4.3 nanosec, +0.9%
>  Allocator SLAB:
>   Single CPU/flow numbers: before: 2035949 pps -> after: 2033567 pps
>   Regression: -2382 pps, +0.57 nanosec, -0.1 %
> 
> Even-though benchmarking does show an improvement for SLUB(+0.9%), I'm
> not convinced bulk alloc will be a win in all situations:
>  * I see stalls on walking the SLUB freelist (normal hidden by prefetch)
>  * In case RX queue is not full, alloc and free more SKBs than needed
> 
> More testing is needed with more real life benchmarks.
> 
> Joint work with Alexander Duyck.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
...
> -	skb = __build_skb(data, len);
> -	if (unlikely(!skb)) {
> +#define BULK_ALLOC_SIZE 8
> +	if (!nc->skb_count) {
> +		nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
> +						      gfp_mask, BULK_ALLOC_SIZE,
> +						      nc->skb_cache);
> +	}
> +	if (likely(nc->skb_count)) {
> +		skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count];
> +	} else {
> +		/* alloc bulk failed */
>  		skb_free_frag(data);
>  		return NULL;
>  	}
>  
> +	len -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +
> +	memset(skb, 0, offsetof(struct sk_buff, tail));
> +	skb->truesize = SKB_TRUESIZE(len);
> +	atomic_set(&skb->users, 1);
> +	skb->head = data;
> +	skb->data = data;
> +	skb_reset_tail_pointer(skb);
> +	skb->end = skb->tail + len;
> +	skb->mac_header = (typeof(skb->mac_header))~0U;
> +	skb->transport_header = (typeof(skb->transport_header))~0U;
> +
> +	/* make sure we initialize shinfo sequentially */
> +	shinfo = skb_shinfo(skb);
> +	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +	atomic_set(&shinfo->dataref, 1);
> +	kmemcheck_annotate_variable(shinfo->destructor_arg);

copy-pasting from __build_skb()...
Either new helper is needed or extend __build_skb() to take
pre-allocated 'raw_skb' pointer.
This interface is questionable until patch 7 comes to use it.
Would have helped if they were back to back.

Overall I like the first 3 patches. I think they're useful
on their won.
As far as bulk alloc... have you considered splitting
bulk alloc of skb and init of skb?
Like in the above
+	skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count];
will give cold pointer and first memset() will be missing cache.
Either prefetch is needed the way slab_alloc_node() is doing
in the line prefetch_freepointer(s, next_object);
or buld_alloc_skb and bulk_init_skb need to be two loops
driven by drivers.
Another idea is we can move skb_init all the way up till
eth_type_trans() and the driver should prefetch both
skb->data and skb pointers. Then eth_type_trans_and_skb_init()
helper will read from cache and store into cache.
Rephrasing the idea:
when the drivers do napi_alloc_skb() they don't really
need initialized 'struct sk_buff'. They either need skb->data
to copy headers into or shinfo->frags to add a page to,
the full init can wait till eth_type_trans_and_init()
right before napi_gro_receive().
Thoughts?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [net-next PATCH 04/11] net: bulk alloc and reuse of SKBs in NAPI context
  2016-02-03  0:52     ` Alexei Starovoitov
@ 2016-02-03 10:38       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-03 10:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, Christoph Lameter, tom, Alexander Duyck, ogerlitz,
	gerlitz.or, brouer

On Tue, 2 Feb 2016 16:52:50 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Feb 02, 2016 at 10:12:01PM +0100, Jesper Dangaard Brouer wrote:
> > Think twice before applying
> >  - This patch can potentially introduce added latency in some workloads
> > 
> > This patch introduce bulk alloc of SKBs and allow reuse of SKBs
> > free'ed in same softirq cycle.  SKBs are normally free'ed during TX
> > completion, but most high speed drivers also cleanup TX ring during
> > NAPI RX poll cycle.  Thus, if using napi_consume_skb/__kfree_skb_defer,
> > SKBs will be avail in the napi_alloc_cache->skb_cache.
> > 
> > If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit
> > the potential overshooting unused SKBs needed to free'ed when NAPI
> > cycle ends (flushed in net_rx_action via __kfree_skb_flush()).
> > 
> > Benchmarking IPv4-forwarding, on CPU i7-4790K @4.2GHz (no turbo boost)
> > (GCC version 5.1.1 20150618 (Red Hat 5.1.1-4))
> >  Allocator SLUB:
> >   Single CPU/flow numbers: before: 2064446 pps -> after: 2083031 pps
> >   Improvement: +18585 pps, -4.3 nanosec, +0.9%
> >  Allocator SLAB:
> >   Single CPU/flow numbers: before: 2035949 pps -> after: 2033567 pps
> >   Regression: -2382 pps, +0.57 nanosec, -0.1 %
> > 
> > Even-though benchmarking does show an improvement for SLUB(+0.9%), I'm
> > not convinced bulk alloc will be a win in all situations:
> >  * I see stalls on walking the SLUB freelist (normal hidden by prefetch)
> >  * In case RX queue is not full, alloc and free more SKBs than needed
> > 
> > More testing is needed with more real life benchmarks.
> > 
> > Joint work with Alexander Duyck.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>  
> ...
> > -	skb = __build_skb(data, len);
> > -	if (unlikely(!skb)) {
> > +#define BULK_ALLOC_SIZE 8
> > +	if (!nc->skb_count) {
> > +		nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
> > +						      gfp_mask, BULK_ALLOC_SIZE,
> > +						      nc->skb_cache);
> > +	}
> > +	if (likely(nc->skb_count)) {
> > +		skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count];
> > +	} else {
> > +		/* alloc bulk failed */
> >  		skb_free_frag(data);
> >  		return NULL;
> >  	}
> >  
> > +	len -= SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > +
> > +	memset(skb, 0, offsetof(struct sk_buff, tail));
> > +	skb->truesize = SKB_TRUESIZE(len);
> > +	atomic_set(&skb->users, 1);
> > +	skb->head = data;
> > +	skb->data = data;
> > +	skb_reset_tail_pointer(skb);
> > +	skb->end = skb->tail + len;
> > +	skb->mac_header = (typeof(skb->mac_header))~0U;
> > +	skb->transport_header = (typeof(skb->transport_header))~0U;
> > +
> > +	/* make sure we initialize shinfo sequentially */
> > +	shinfo = skb_shinfo(skb);
> > +	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > +	atomic_set(&shinfo->dataref, 1);
> > +	kmemcheck_annotate_variable(shinfo->destructor_arg);  
> 
> copy-pasting from __build_skb()...
> Either new helper is needed or extend __build_skb() to take
> pre-allocated 'raw_skb' pointer.

Guess should have create a helper for the basic skb setup.  I just kept
it here, as I was planning to do trick like removing the memset, as
discussed below.

> Overall I like the first 3 patches. I think they're useful
> on their won.

Great, hope they can go into net-next soon.

> As far as bulk alloc... have you considered splitting
> bulk alloc of skb and init of skb?

I have many consideration on the SKB memset, as it shows up quite high
in profiles, how we can either 1) avoid clearing this much, or 2) speed
up clearing by clearing more (full pages in allocator).

My idea (2) is about clearing larger memory chunks, as the
memset/rep-stos operation have a fixed startup cost.  But do it inside
the slab allocator's bulk alloc call.  During bulk alloc we can
identify objects that are "next-to-each-other" and exec one rep-stos
operation.  My measurements show we need 3x 256-byte object's cleared
together before it is a win.

*BUT* I actually like your idea better, below, of delaying the init
... more comments below.

> Like in the above
> +	skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count];
> will give cold pointer and first memset() will be missing cache.
> Either prefetch is needed the way slab_alloc_node() is doing
> in the line prefetch_freepointer(s, next_object);

I could prefetch the next elem in skb_cache[]. I have been playing with
that.  It didn't have much effect, as the bulk alloc (at least for
SLUB) have already walked the memory.  It helped a little to prefetch
the 3rd cache-line, on the memset speed... (but so small that is was
difficult to measure with enough confidence).

> or buld_alloc_skb and bulk_init_skb need to be two loops
> driven by drivers.
> Another idea is we can move skb_init all the way up till
> eth_type_trans() and the driver should prefetch both
> skb->data and skb pointers. Then eth_type_trans_and_skb_init()
> helper will read from cache and store into cache.
>
> Rephrasing the idea:
> when the drivers do napi_alloc_skb() they don't really
> need initialized 'struct sk_buff'. They either need skb->data
> to copy headers into or shinfo->frags to add a page to,
> the full init can wait till eth_type_trans_and_init()
> right before napi_gro_receive().
> Thoughts?

I actually like this idea better than my own.  Of delaying memset init
of the SKB. (I didn't understood it, until after you rephrased it ;-))

The drivers also need some extra fields, like hash and some flags.  But
we could easily organize sk_buff, to account for this.

I also like it because, delaying it until napi_gro_receive, we might
have a chance to avoid the clearing, e.g. if GRO can merge it, or for
other fastpath forwarding, we have not invented yet.  (Thinking out
loud, maybe even RPS could delay the memset, until it reach the remote
CPU, saving may cache-line transfers between the CPUs).

I think I like yours delayed memset idea the best.  Lets see if others
shoot it down? ;-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches
  2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
                     ` (10 preceding siblings ...)
  2016-02-02 21:15   ` [net-next PATCH 11/11] RFC: net: RPS bulk enqueue to backlog Jesper Dangaard Brouer
@ 2016-02-07 19:25   ` David Miller
  2016-02-08 12:14       ` Jesper Dangaard Brouer
  11 siblings, 1 reply; 35+ messages in thread
From: David Miller @ 2016-02-07 19:25 UTC (permalink / raw)
  To: brouer
  Cc: netdev, cl, tom, alexander.duyck, alexei.starovoitov, ogerlitz,
	gerlitz.or

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Tue, 02 Feb 2016 22:11:28 +0100

> This patchset is relevant for my NetDev 1.1 "Network Performance BoF" [1].
> 
> The first 4 patches, is a repost[2], for the first real use-case of
> kmem_cache bulk alloc and free API.  They were adjusted slightly to
> accomodate my last slab API changes.  They should be ready for
> inclusion in net-next, as the needed MM tree are avail in net-next.
> 
> Patch 5 is also enabling the SKB bulk API for mlx5.
> 
> Thus, patches 1-5 should be ready for net-next.
> 
> After patch 5, the experimental patches begin, which is Prove-of-Concept
> code for what we will be discussing during the Network Performance BoF [1]
> 
> [1] http://netdevconf.org/1.1/bof-network-performance-bof-jesper-dangaard-brouer.html
> [2] http://thread.gmane.org/gmane.linux.network/384302/

Very exciting work, indeed!

Jesper, why don't you resubmit just the first three patches as a
group?  They seem a lot less controversial and as Alexei said, stand
on their own.

Thanks.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [net-next PATCH V2 0/3] net: mitigating kmem_cache free slowpath
  2016-02-07 19:25   ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches David Miller
@ 2016-02-08 12:14       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-08 12:14 UTC (permalink / raw)
  To: netdev, Jeff Kirsher
  Cc: Andrew Morton, tom, Alexander Duyck, alexei.starovoitov,
	linux-mm, Jesper Dangaard Brouer, Christoph Lameter,
	David S. Miller

This patchset is the first real use-case for kmem_cache bulk _free_.
The use of bulk _alloc_ is NOT included in this patchset. The full use
have previously been posted here [1].

The bulk free side have the largest benefit for the network stack
use-case, because network stack is hitting the kmem_cache/SLUB
slowpath when freeing SKBs, due to the amount of outstanding SKBs.
This is solved by using the new API kmem_cache_free_bulk().

Introduce new API napi_consume_skb(), that hides/handles bulk freeing
for the caller.  The drivers simply need to use this call when freeing
SKBs in NAPI context, e.g. replacing their calles to dev_kfree_skb() /
dev_consume_skb_any().

Driver ixgbe is the first user of this new API.

[1] http://thread.gmane.org/gmane.linux.network/384302/focus=397373

---

Jesper Dangaard Brouer (3):
      net: bulk free infrastructure for NAPI context, use napi_consume_skb
      net: bulk free SKBs that were delay free'ed due to IRQ context
      ixgbe: bulk free SKBs during TX completion cleanup cycle


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    6 +-
 include/linux/skbuff.h                        |    4 +
 net/core/dev.c                                |    9 ++-
 net/core/skbuff.c                             |   87 +++++++++++++++++++++++--
 4 files changed, 96 insertions(+), 10 deletions(-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [net-next PATCH V2 0/3] net: mitigating kmem_cache free slowpath
@ 2016-02-08 12:14       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-08 12:14 UTC (permalink / raw)
  To: netdev, Jeff Kirsher
  Cc: Andrew Morton, tom, Alexander Duyck, alexei.starovoitov,
	linux-mm, Jesper Dangaard Brouer, Christoph Lameter,
	David S. Miller

This patchset is the first real use-case for kmem_cache bulk _free_.
The use of bulk _alloc_ is NOT included in this patchset. The full use
have previously been posted here [1].

The bulk free side have the largest benefit for the network stack
use-case, because network stack is hitting the kmem_cache/SLUB
slowpath when freeing SKBs, due to the amount of outstanding SKBs.
This is solved by using the new API kmem_cache_free_bulk().

Introduce new API napi_consume_skb(), that hides/handles bulk freeing
for the caller.  The drivers simply need to use this call when freeing
SKBs in NAPI context, e.g. replacing their calles to dev_kfree_skb() /
dev_consume_skb_any().

Driver ixgbe is the first user of this new API.

[1] http://thread.gmane.org/gmane.linux.network/384302/focus=397373

---

Jesper Dangaard Brouer (3):
      net: bulk free infrastructure for NAPI context, use napi_consume_skb
      net: bulk free SKBs that were delay free'ed due to IRQ context
      ixgbe: bulk free SKBs during TX completion cleanup cycle


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    6 +-
 include/linux/skbuff.h                        |    4 +
 net/core/dev.c                                |    9 ++-
 net/core/skbuff.c                             |   87 +++++++++++++++++++++++--
 4 files changed, 96 insertions(+), 10 deletions(-)

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [net-next PATCH V2 1/3] net: bulk free infrastructure for NAPI context, use napi_consume_skb
  2016-02-08 12:14       ` Jesper Dangaard Brouer
  (?)
@ 2016-02-08 12:14       ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-08 12:14 UTC (permalink / raw)
  To: netdev, Jeff Kirsher
  Cc: Andrew Morton, tom, Alexander Duyck, alexei.starovoitov,
	linux-mm, Jesper Dangaard Brouer, Christoph Lameter,
	David S. Miller

Discovered that network stack were hitting the kmem_cache/SLUB
slowpath when freeing SKBs.  Doing bulk free with kmem_cache_free_bulk
can speedup this slowpath.

NAPI context is a bit special, lets take advantage of that for bulk
free'ing SKBs.

In NAPI context we are running in softirq, which gives us certain
protection.  A softirq can run on several CPUs at once.  BUT the
important part is a softirq will never preempt another softirq running
on the same CPU.  This gives us the opportunity to access per-cpu
variables in softirq context.

Extend napi_alloc_cache (before only contained page_frag_cache) to be
a struct with a small array based stack for holding SKBs.  Introduce a
SKB defer and flush API for accessing this.

Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any()
when running in NAPI context.  A small trick to handle/detect if we
are called from netpoll is to see if budget is 0.  In that case, we
need to invoke dev_consume_skb_irq().

Joint work with Alexander Duyck.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 include/linux/skbuff.h |    3 ++
 net/core/dev.c         |    1 +
 net/core/skbuff.c      |   83 +++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 11f935c1a090..3c8d348223d7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2399,6 +2399,9 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi,
 {
 	return __napi_alloc_skb(napi, length, GFP_ATOMIC);
 }
+void napi_consume_skb(struct sk_buff *skb, int budget);
+
+void __kfree_skb_flush(void);
 
 /**
  * __dev_alloc_pages - allocate page for network Rx
diff --git a/net/core/dev.c b/net/core/dev.c
index 8cba3d852f25..44384a8c9613 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5152,6 +5152,7 @@ static void net_rx_action(struct softirq_action *h)
 		}
 	}
 
+	__kfree_skb_flush();
 	local_irq_disable();
 
 	list_splice_tail_init(&sd->poll_list, &list);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b2df375ec9c2..e26bb2b1dba4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -347,8 +347,16 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 }
 EXPORT_SYMBOL(build_skb);
 
+#define NAPI_SKB_CACHE_SIZE	64
+
+struct napi_alloc_cache {
+	struct page_frag_cache page;
+	size_t skb_count;
+	void *skb_cache[NAPI_SKB_CACHE_SIZE];
+};
+
 static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
-static DEFINE_PER_CPU(struct page_frag_cache, napi_alloc_cache);
+static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
 
 static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 {
@@ -378,9 +386,9 @@ EXPORT_SYMBOL(netdev_alloc_frag);
 
 static void *__napi_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
 {
-	struct page_frag_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
-	return __alloc_page_frag(nc, fragsz, gfp_mask);
+	return __alloc_page_frag(&nc->page, fragsz, gfp_mask);
 }
 
 void *napi_alloc_frag(unsigned int fragsz)
@@ -474,7 +482,7 @@ EXPORT_SYMBOL(__netdev_alloc_skb);
 struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 				 gfp_t gfp_mask)
 {
-	struct page_frag_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 	struct sk_buff *skb;
 	void *data;
 
@@ -494,7 +502,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (sk_memalloc_socks())
 		gfp_mask |= __GFP_MEMALLOC;
 
-	data = __alloc_page_frag(nc, len, gfp_mask);
+	data = __alloc_page_frag(&nc->page, len, gfp_mask);
 	if (unlikely(!data))
 		return NULL;
 
@@ -505,7 +513,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	}
 
 	/* use OR instead of assignment to avoid clearing of bits in mask */
-	if (nc->pfmemalloc)
+	if (nc->page.pfmemalloc)
 		skb->pfmemalloc = 1;
 	skb->head_frag = 1;
 
@@ -747,6 +755,69 @@ void consume_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(consume_skb);
 
+void __kfree_skb_flush(void)
+{
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+
+	/* flush skb_cache if containing objects */
+	if (nc->skb_count) {
+		kmem_cache_free_bulk(skbuff_head_cache, nc->skb_count,
+				     nc->skb_cache);
+		nc->skb_count = 0;
+	}
+}
+
+static void __kfree_skb_defer(struct sk_buff *skb)
+{
+	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+
+	/* drop skb->head and call any destructors for packet */
+	skb_release_all(skb);
+
+	/* record skb to CPU local list */
+	nc->skb_cache[nc->skb_count++] = skb;
+
+#ifdef CONFIG_SLUB
+	/* SLUB writes into objects when freeing */
+	prefetchw(skb);
+#endif
+
+	/* flush skb_cache if it is filled */
+	if (unlikely(nc->skb_count == NAPI_SKB_CACHE_SIZE)) {
+		kmem_cache_free_bulk(skbuff_head_cache, NAPI_SKB_CACHE_SIZE,
+				     nc->skb_cache);
+		nc->skb_count = 0;
+	}
+}
+
+void napi_consume_skb(struct sk_buff *skb, int budget)
+{
+	if (unlikely(!skb))
+		return;
+
+	/* if budget is 0 assume netpoll w/ IRQs disabled */
+	if (unlikely(!budget)) {
+		dev_consume_skb_irq(skb);
+		return;
+	}
+
+	if (likely(atomic_read(&skb->users) == 1))
+		smp_rmb();
+	else if (likely(!atomic_dec_and_test(&skb->users)))
+		return;
+	/* if reaching here SKB is ready to free */
+	trace_consume_skb(skb);
+
+	/* if SKB is a clone, don't handle this case */
+	if (unlikely(skb->fclone != SKB_FCLONE_UNAVAILABLE)) {
+		__kfree_skb(skb);
+		return;
+	}
+
+	__kfree_skb_defer(skb);
+}
+EXPORT_SYMBOL(napi_consume_skb);
+
 /* Make sure a field is enclosed inside headers_start/headers_end section */
 #define CHECK_SKB_FIELD(field) \
 	BUILD_BUG_ON(offsetof(struct sk_buff, field) <		\

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [net-next PATCH V2 2/3] net: bulk free SKBs that were delay free'ed due to IRQ context
  2016-02-08 12:14       ` Jesper Dangaard Brouer
  (?)
  (?)
@ 2016-02-08 12:15       ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-08 12:15 UTC (permalink / raw)
  To: netdev, Jeff Kirsher
  Cc: Andrew Morton, tom, Alexander Duyck, alexei.starovoitov,
	linux-mm, Jesper Dangaard Brouer, Christoph Lameter,
	David S. Miller

The network stack defers SKBs free, in-case free happens in IRQ or
when IRQs are disabled. This happens in __dev_kfree_skb_irq() that
writes SKBs that were free'ed during IRQ to the softirq completion
queue (softnet_data.completion_queue).

These SKBs are naturally delayed, and cleaned up during NET_TX_SOFTIRQ
in function net_tx_action().  Take advantage of this a use the skb
defer and flush API, as we are already in softirq context.

For modern drivers this rarely happens. Although most drivers do call
dev_kfree_skb_any(), which detects the situation and calls
__dev_kfree_skb_irq() when needed.  This due to netpoll can call from
IRQ context.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/skbuff.h |    1 +
 net/core/dev.c         |    8 +++++++-
 net/core/skbuff.c      |    8 ++++++--
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3c8d348223d7..b06ba2e07c89 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2402,6 +2402,7 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi,
 void napi_consume_skb(struct sk_buff *skb, int budget);
 
 void __kfree_skb_flush(void);
+void __kfree_skb_defer(struct sk_buff *skb);
 
 /**
  * __dev_alloc_pages - allocate page for network Rx
diff --git a/net/core/dev.c b/net/core/dev.c
index 44384a8c9613..b185d7eaa2e4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3829,8 +3829,14 @@ static void net_tx_action(struct softirq_action *h)
 				trace_consume_skb(skb);
 			else
 				trace_kfree_skb(skb, net_tx_action);
-			__kfree_skb(skb);
+
+			if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
+				__kfree_skb(skb);
+			else
+				__kfree_skb_defer(skb);
 		}
+
+		__kfree_skb_flush();
 	}
 
 	if (sd->output_queue) {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e26bb2b1dba4..d278e51789e9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -767,7 +767,7 @@ void __kfree_skb_flush(void)
 	}
 }
 
-static void __kfree_skb_defer(struct sk_buff *skb)
+static inline void _kfree_skb_defer(struct sk_buff *skb)
 {
 	struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
 
@@ -789,6 +789,10 @@ static void __kfree_skb_defer(struct sk_buff *skb)
 		nc->skb_count = 0;
 	}
 }
+void __kfree_skb_defer(struct sk_buff *skb)
+{
+	_kfree_skb_defer(skb);
+}
 
 void napi_consume_skb(struct sk_buff *skb, int budget)
 {
@@ -814,7 +818,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
 		return;
 	}
 
-	__kfree_skb_defer(skb);
+	_kfree_skb_defer(skb);
 }
 EXPORT_SYMBOL(napi_consume_skb);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [net-next PATCH V2 3/3] ixgbe: bulk free SKBs during TX completion cleanup cycle
  2016-02-08 12:14       ` Jesper Dangaard Brouer
                         ` (2 preceding siblings ...)
  (?)
@ 2016-02-08 12:15       ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-08 12:15 UTC (permalink / raw)
  To: netdev, Jeff Kirsher
  Cc: Andrew Morton, tom, Alexander Duyck, alexei.starovoitov,
	linux-mm, Jesper Dangaard Brouer, Christoph Lameter,
	David S. Miller

There is an opportunity to bulk free SKBs during reclaiming of
resources after DMA transmit completes in ixgbe_clean_tx_irq.  Thus,
bulk freeing at this point does not introduce any added latency.

Simply use napi_consume_skb() which were recently introduced.  The
napi_budget parameter is needed by napi_consume_skb() to detect if it
is called from netpoll.

Benchmarking IPv4-forwarding, on CPU i7-4790K @4.2GHz (no turbo boost)
 Single CPU/flow numbers: before: 1982144 pps ->  after : 2064446 pps
 Improvement: +82302 pps, -20 nanosec, +4.1%
 (SLUB and GCC version 5.1.1 20150618 (Red Hat 5.1.1-4))

Joint work with Alexander Duyck.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index c4003a88bbf6..0c701b8438b6 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1089,7 +1089,7 @@ static void ixgbe_tx_timeout_reset(struct ixgbe_adapter *adapter)
  * @tx_ring: tx ring to clean
  **/
 static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
-			       struct ixgbe_ring *tx_ring)
+			       struct ixgbe_ring *tx_ring, int napi_budget)
 {
 	struct ixgbe_adapter *adapter = q_vector->adapter;
 	struct ixgbe_tx_buffer *tx_buffer;
@@ -1127,7 +1127,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 		total_packets += tx_buffer->gso_segs;
 
 		/* free the skb */
-		dev_consume_skb_any(tx_buffer->skb);
+		napi_consume_skb(tx_buffer->skb, napi_budget);
 
 		/* unmap skb header data */
 		dma_unmap_single(tx_ring->dev,
@@ -2784,7 +2784,7 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
 #endif
 
 	ixgbe_for_each_ring(ring, q_vector->tx)
-		clean_complete &= !!ixgbe_clean_tx_irq(q_vector, ring);
+		clean_complete &= !!ixgbe_clean_tx_irq(q_vector, ring, budget);
 
 	/* Exit if we are called by netpoll or busy polling is active */
 	if ((budget <= 0) || !ixgbe_qv_lock_napi(q_vector))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [net-next PATCH 06/11] RFC: mlx5: RX bulking or bundling of packets before calling network stack
  2016-02-02 21:13   ` [net-next PATCH 06/11] RFC: mlx5: RX bulking or bundling of packets before calling network stack Jesper Dangaard Brouer
@ 2016-02-09 11:57     ` Saeed Mahameed
  2016-02-10 20:26       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 35+ messages in thread
From: Saeed Mahameed @ 2016-02-09 11:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Christoph Lameter, tom, Alexander Duyck,
	alexei.starovoitov, Or Gerlitz, Or Gerlitz, Eran Ben Elisha,
	Rana Shahout

On Tue, Feb 2, 2016 at 11:13 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> There are several techniques/concepts combined in this optimization.
> It is both a data-cache and instruction-cache optimization.
>
> First of all, this is primarily about delaying touching
> packet-data, which happend in eth_type_trans, until the prefetch
> have had time to fetch.  Thus, hopefully avoiding a cache-miss on
> packet data.
>
> Secondly, the instruction-cache optimization is about, not
> calling the network stack for every packet, which is pulled out
> of the RX ring.  Calling the full stack likely removes/flushes
> the instruction cache every time.
>
> Thus, have two loops, one loop pulling out packet from the RX
> ring and starting the prefetching, and the second loop calling
> eth_type_trans() and invoking the stack via napi_gro_receive().
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>
>
> Notes:
> This is the patch that gave a speed up of 6.2Mpps to 12Mpps, when
> trying to measure lowest RX level, by dropping the packets in the
> driver itself (marked drop point as comment).
Indeed looks very promising in respect of instruction-cache
optimization, but i have some doubts regarding the data-cache
optimizations (prefetch), please see my below questions.

We will take this patch and test it in house.

>
> For now, the ring is emptied upto the budget.  I don't know if it
> would be better to chunk it up more?
Not sure, according to netdevice.h :

/* Default NAPI poll() weight
 * Device drivers are strongly advised to not use bigger value
 */
#define NAPI_POLL_WEIGHT 64

we will also compare different budget values with your approach, but I
doubt it will be accepted to increase the NAPI_POLL_WEIGHT for mlx5
drivers.
furthermore increasing NAPI poll budget might cause cache overflow
with this approach since you are chunking up all "prefetch(skb->data)"
(I didn't do the math yet in regards of cache utilization with this
approach).

>         mlx5e_handle_csum(netdev, cqe, rq, skb);
>
> -       skb->protocol = eth_type_trans(skb, netdev);
> -
mlx5e_handle_csum also access the skb->data in is_first_ethertype_ip
function, but i think it is not interesting since this is not the
common case,
e.g: for the none common case of L4 traffic with no HW checksum
offload you won't benefit from this optimization since we access the
skb->data to know the L3 header type, and this can be fixed in driver
code to check the CQE meta data for these fields instead of accessing
the skb->data, but I will need to look further into that.

> @@ -252,7 +257,6 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
>                 wqe_counter    = be16_to_cpu(wqe_counter_be);
>                 wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
>                 skb            = rq->skb[wqe_counter];
> -               prefetch(skb->data);
>                 rq->skb[wqe_counter] = NULL;
>
>                 dma_unmap_single(rq->pdev,
> @@ -265,16 +269,27 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
>                         dev_kfree_skb(skb);
>                         goto wq_ll_pop;
>                 }
> +               prefetch(skb->data);
is this optimal for all CPU archs ? is it ok to use up to 64 cache
lines at once ?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [net-next PATCH 06/11] RFC: mlx5: RX bulking or bundling of packets before calling network stack
  2016-02-09 11:57     ` Saeed Mahameed
@ 2016-02-10 20:26       ` Jesper Dangaard Brouer
  2016-02-16  0:01         ` Saeed Mahameed
  0 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2016-02-10 20:26 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: netdev, Christoph Lameter, tom, Alexander Duyck,
	alexei.starovoitov, Or Gerlitz, Or Gerlitz, Eran Ben Elisha,
	Rana Shahout, brouer

On Tue, 9 Feb 2016 13:57:41 +0200
Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:

> On Tue, Feb 2, 2016 at 11:13 PM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > There are several techniques/concepts combined in this optimization.
> > It is both a data-cache and instruction-cache optimization.
> >
> > First of all, this is primarily about delaying touching
> > packet-data, which happend in eth_type_trans, until the prefetch
> > have had time to fetch.  Thus, hopefully avoiding a cache-miss on
> > packet data.
> >
> > Secondly, the instruction-cache optimization is about, not
> > calling the network stack for every packet, which is pulled out
> > of the RX ring.  Calling the full stack likely removes/flushes
> > the instruction cache every time.
> >
> > Thus, have two loops, one loop pulling out packet from the RX
> > ring and starting the prefetching, and the second loop calling
> > eth_type_trans() and invoking the stack via napi_gro_receive().
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> >
> >
> > Notes:
> > This is the patch that gave a speed up of 6.2Mpps to 12Mpps, when
> > trying to measure lowest RX level, by dropping the packets in the
> > driver itself (marked drop point as comment).
>  
> Indeed looks very promising in respect of instruction-cache
> optimization, but i have some doubts regarding the data-cache
> optimizations (prefetch), please see my below questions.
> 
> We will take this patch and test it in house.
> 
> >
> > For now, the ring is emptied upto the budget.  I don't know if it
> > would be better to chunk it up more?  
>
> Not sure, according to netdevice.h :
> 
> /* Default NAPI poll() weight
>  * Device drivers are strongly advised to not use bigger value
>  */
> #define NAPI_POLL_WEIGHT 64
> 
> we will also compare different budget values with your approach, but I
> doubt it will be accepted to increase the NAPI_POLL_WEIGHT for mlx5
> drivers. furthermore increasing NAPI poll budget might cause cache overflow
> with this approach since you are chunking up all "prefetch(skb->data)"
> (I didn't do the math yet in regards of cache utilization with this
> approach).

You misunderstood me... I don't want to increase the NAPI_POLL_WEIGHT.
I want to keep the 64, but sort of split it up, and e.g. call the stack
for each 16 packets. Due to cache-size limits...

One approach could be to compare the HW skb->hash to prev packet, and
exit loop if they don't match (and call netstack with this bundle).


> >         mlx5e_handle_csum(netdev, cqe, rq, skb);
> >
> > -       skb->protocol = eth_type_trans(skb, netdev);
> > -  
>
> mlx5e_handle_csum also access the skb->data in is_first_ethertype_ip
> function, but i think it is not interesting since this is not the
> common case,
> e.g: for the none common case of L4 traffic with no HW checksum
> offload you won't benefit from this optimization since we access the
> skb->data to know the L3 header type, and this can be fixed in driver
> code to check the CQE meta data for these fields instead of accessing
> the skb->data, but I will need to look further into that.

Okay, understood.  We should look into this too, but not as top priority.
We can simply move mlx5e_handle_csum() like eth_type_trans().


> > @@ -252,7 +257,6 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
> >                 wqe_counter    = be16_to_cpu(wqe_counter_be);
> >                 wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
> >                 skb            = rq->skb[wqe_counter];
> > -               prefetch(skb->data);
> >                 rq->skb[wqe_counter] = NULL;
> >
> >                 dma_unmap_single(rq->pdev,
> > @@ -265,16 +269,27 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
> >                         dev_kfree_skb(skb);
> >                         goto wq_ll_pop;
> >                 }
> > +               prefetch(skb->data);  
>
> is this optimal for all CPU archs ?

For some CPU ARCHs the prefetch is compile time removed.

> is it ok to use up to 64 cache lines at once ?

That is not the problem, using 64 cache-lines * 64 = 4096 bytes.
The BIOS/HW sometime also take next cache line => 8092 bytes.
The problem is also SKB 4x cache-lines clearing = 16384 bytes.

We should of-cause keep this CPU independent, but for Intel SandyBridge
CPUs the optimal prefetch loop size is likely 10.

Quote from Intels optimization manual:
 The L1 DCache can handle multiple outstanding cache misses and continue
 to service incoming stores and loads. Up to 10 requests of missing
 cache lines can be managed simultaneously using the LFB (Line File Buffers).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [net-next PATCH V2 0/3] net: mitigating kmem_cache free slowpath
  2016-02-08 12:14       ` Jesper Dangaard Brouer
                         ` (3 preceding siblings ...)
  (?)
@ 2016-02-11 16:59       ` David Miller
  -1 siblings, 0 replies; 35+ messages in thread
From: David Miller @ 2016-02-11 16:59 UTC (permalink / raw)
  To: brouer
  Cc: netdev, jeffrey.t.kirsher, akpm, tom, alexander.duyck,
	alexei.starovoitov, linux-mm, cl

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Mon, 08 Feb 2016 13:14:54 +0100

> This patchset is the first real use-case for kmem_cache bulk _free_.
> The use of bulk _alloc_ is NOT included in this patchset. The full use
> have previously been posted here [1].
> 
> The bulk free side have the largest benefit for the network stack
> use-case, because network stack is hitting the kmem_cache/SLUB
> slowpath when freeing SKBs, due to the amount of outstanding SKBs.
> This is solved by using the new API kmem_cache_free_bulk().
> 
> Introduce new API napi_consume_skb(), that hides/handles bulk freeing
> for the caller.  The drivers simply need to use this call when freeing
> SKBs in NAPI context, e.g. replacing their calles to dev_kfree_skb() /
> dev_consume_skb_any().
> 
> Driver ixgbe is the first user of this new API.
> 
> [1] http://thread.gmane.org/gmane.linux.network/384302/focus=397373

Series applied, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [net-next PATCH V2 0/3] net: mitigating kmem_cache free slowpath
  2016-02-08 12:14       ` Jesper Dangaard Brouer
                         ` (4 preceding siblings ...)
  (?)
@ 2016-02-13 11:12       ` Tilman Schmidt
  -1 siblings, 0 replies; 35+ messages in thread
From: Tilman Schmidt @ 2016-02-13 11:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Jeff Kirsher, Andrew Morton, tom, Alexander Duyck,
	alexei.starovoitov, linux-mm, Christoph Lameter, David S. Miller

[-- Attachment #1: Type: text/plain, Size: 558 bytes --]

Hi Jesper,

Am 08.02.2016 um 13:14 schrieb Jesper Dangaard Brouer:
> Introduce new API napi_consume_skb(), that hides/handles bulk freeing
> for the caller.  The drivers simply need to use this call when freeing
> SKBs in NAPI context, e.g. replacing their calles to dev_kfree_skb() /
> dev_consume_skb_any().

Would you mind adding a kerneldoc comment for the new API function?

Thanks,
Tilman

-- 
Tilman Schmidt                              E-Mail: tilman@imap.cc
Bonn, Germany
Nous, on a des fleurs et des bougies pour nous protéger.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [net-next PATCH 06/11] RFC: mlx5: RX bulking or bundling of packets before calling network stack
  2016-02-10 20:26       ` Jesper Dangaard Brouer
@ 2016-02-16  0:01         ` Saeed Mahameed
  0 siblings, 0 replies; 35+ messages in thread
From: Saeed Mahameed @ 2016-02-16  0:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, Christoph Lameter, Tom Herbert, Alexander Duyck,
	Alexei Starovoitov, Or Gerlitz, Or Gerlitz, Eran Ben Elisha,
	Rana Shahout

On Wed, Feb 10, 2016 at 10:26 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Tue, 9 Feb 2016 13:57:41 +0200
> Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
>
>> On Tue, Feb 2, 2016 at 11:13 PM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> > There are several techniques/concepts combined in this optimization.
>> > It is both a data-cache and instruction-cache optimization.
>> >
>> > First of all, this is primarily about delaying touching
>> > packet-data, which happend in eth_type_trans, until the prefetch
>> > have had time to fetch.  Thus, hopefully avoiding a cache-miss on
>> > packet data.
>> >
>> > Secondly, the instruction-cache optimization is about, not
>> > calling the network stack for every packet, which is pulled out
>> > of the RX ring.  Calling the full stack likely removes/flushes
>> > the instruction cache every time.
>> >
>> > Thus, have two loops, one loop pulling out packet from the RX
>> > ring and starting the prefetching, and the second loop calling
>> > eth_type_trans() and invoking the stack via napi_gro_receive().
>> >
>> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> >
>> >
>> > Notes:
>> > This is the patch that gave a speed up of 6.2Mpps to 12Mpps, when
>> > trying to measure lowest RX level, by dropping the packets in the
>> > driver itself (marked drop point as comment).
>>
>> Indeed looks very promising in respect of instruction-cache
>> optimization, but i have some doubts regarding the data-cache
>> optimizations (prefetch), please see my below questions.
>>
>> We will take this patch and test it in house.
>>
>> >
>> > For now, the ring is emptied upto the budget.  I don't know if it
>> > would be better to chunk it up more?
>>
>> Not sure, according to netdevice.h :
>>
>> /* Default NAPI poll() weight
>>  * Device drivers are strongly advised to not use bigger value
>>  */
>> #define NAPI_POLL_WEIGHT 64
>>
>> we will also compare different budget values with your approach, but I
>> doubt it will be accepted to increase the NAPI_POLL_WEIGHT for mlx5
>> drivers. furthermore increasing NAPI poll budget might cause cache overflow
>> with this approach since you are chunking up all "prefetch(skb->data)"
>> (I didn't do the math yet in regards of cache utilization with this
>> approach).
>
> You misunderstood me... I don't want to increase the NAPI_POLL_WEIGHT.
> I want to keep the 64, but sort of split it up, and e.g. call the stack
> for each 16 packets. Due to cache-size limits...
I see.

>
> One approach could be to compare the HW skb->hash to prev packet, and
> exit loop if they don't match (and call netstack with this bundle).
Sorry i am failing to see how this could help, either way you need an
inner budget of 16 as you said before.
>
>
>> >         mlx5e_handle_csum(netdev, cqe, rq, skb);
>> >
>> > -       skb->protocol = eth_type_trans(skb, netdev);
>> > -
>>
>> mlx5e_handle_csum also access the skb->data in is_first_ethertype_ip
>> function, but i think it is not interesting since this is not the
>> common case,
>> e.g: for the none common case of L4 traffic with no HW checksum
>> offload you won't benefit from this optimization since we access the
>> skb->data to know the L3 header type, and this can be fixed in driver
>> code to check the CQE meta data for these fields instead of accessing
>> the skb->data, but I will need to look further into that.
>
> Okay, understood.  We should look into this too, but not as top priority.
> We can simply move mlx5e_handle_csum() like eth_type_trans().
No, it is not that simple. mlx5e_handle_csum needs the cqe form the
first loop, referencing back to the cqe in the second loop will might
introduce new cache misses as the cqe is already "cold", what i like
in your approach is that you separated between two different flows
(read from device & create SKBs bundle  --> pass bundle to netstack),
now we don't want the "pass bundle to netstack" flow to look back at
device's (cqes/wqes etc..).

Again this is not the main issue for now as it is not the common case,
but we are preparing a patch that fixes the mlx5e_handle_csum to not
look at skb->data at all, we will share it once it is ready.
>
>
>> > @@ -252,7 +257,6 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
>> >                 wqe_counter    = be16_to_cpu(wqe_counter_be);
>> >                 wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
>> >                 skb            = rq->skb[wqe_counter];
>> > -               prefetch(skb->data);
>> >                 rq->skb[wqe_counter] = NULL;
>> >
>> >                 dma_unmap_single(rq->pdev,
>> > @@ -265,16 +269,27 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
>> >                         dev_kfree_skb(skb);
>> >                         goto wq_ll_pop;
>> >                 }
>> > +               prefetch(skb->data);
>>
>> is this optimal for all CPU archs ?
>
> For some CPU ARCHs the prefetch is compile time removed.
>
>> is it ok to use up to 64 cache lines at once ?
>
> That is not the problem, using 64 cache-lines * 64 = 4096 bytes.
> The BIOS/HW sometime also take next cache line => 8092 bytes.
> The problem is also SKB 4x cache-lines clearing = 16384 bytes.
>
> We should of-cause keep this CPU independent, but for Intel SandyBridge
> CPUs the optimal prefetch loop size is likely 10.
>
> Quote from Intels optimization manual:
>  The L1 DCache can handle multiple outstanding cache misses and continue
>  to service incoming stores and loads. Up to 10 requests of missing
>  cache lines can be managed simultaneously using the LFB (Line File Buffers).
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2016-02-16  0:01 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-23 12:46 [PATCH 0/4] net: mitigating kmem_cache slowpath for network stack in NAPI context Jesper Dangaard Brouer
2015-10-23 12:46 ` Jesper Dangaard Brouer
2015-10-23 12:46 ` [PATCH 1/4] net: bulk free infrastructure for NAPI context, use napi_consume_skb Jesper Dangaard Brouer
2015-10-23 12:46   ` Jesper Dangaard Brouer
2015-10-23 12:46 ` [PATCH 2/4] net: bulk free SKBs that were delay free'ed due to IRQ context Jesper Dangaard Brouer
2015-10-23 12:46 ` [PATCH 3/4] ixgbe: bulk free SKBs during TX completion cleanup cycle Jesper Dangaard Brouer
2015-10-23 12:46   ` Jesper Dangaard Brouer
2015-10-23 12:46 ` [PATCH 4/4] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer
2015-10-27  1:09 ` [PATCH 0/4] net: mitigating kmem_cache slowpath for network stack " David Miller
2016-02-02 21:11 ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches Jesper Dangaard Brouer
2016-02-02 21:11   ` [net-next PATCH 01/11] net: bulk free infrastructure for NAPI context, use napi_consume_skb Jesper Dangaard Brouer
2016-02-02 21:11   ` [net-next PATCH 02/11] net: bulk free SKBs that were delay free'ed due to IRQ context Jesper Dangaard Brouer
2016-02-02 21:11   ` [net-next PATCH 03/11] ixgbe: bulk free SKBs during TX completion cleanup cycle Jesper Dangaard Brouer
2016-02-02 21:12   ` [net-next PATCH 04/11] net: bulk alloc and reuse of SKBs in NAPI context Jesper Dangaard Brouer
2016-02-03  0:52     ` Alexei Starovoitov
2016-02-03 10:38       ` Jesper Dangaard Brouer
2016-02-02 21:12   ` [net-next PATCH 05/11] mlx5: use napi_*_skb APIs to get bulk alloc and free Jesper Dangaard Brouer
2016-02-02 21:13   ` [net-next PATCH 06/11] RFC: mlx5: RX bulking or bundling of packets before calling network stack Jesper Dangaard Brouer
2016-02-09 11:57     ` Saeed Mahameed
2016-02-10 20:26       ` Jesper Dangaard Brouer
2016-02-16  0:01         ` Saeed Mahameed
2016-02-02 21:13   ` [net-next PATCH 07/11] net: introduce napi_alloc_skb_hint() for more use-cases Jesper Dangaard Brouer
2016-02-02 22:29     ` kbuild test robot
2016-02-02 21:14   ` [net-next PATCH 08/11] mlx5: hint the NAPI alloc skb API about the expected bulk size Jesper Dangaard Brouer
2016-02-02 21:14   ` [net-next PATCH 09/11] RFC: dummy: bulk free SKBs Jesper Dangaard Brouer
2016-02-02 21:15   ` [net-next PATCH 10/11] RFC: net: API for RX handover of multiple SKBs to stack Jesper Dangaard Brouer
2016-02-02 21:15   ` [net-next PATCH 11/11] RFC: net: RPS bulk enqueue to backlog Jesper Dangaard Brouer
2016-02-07 19:25   ` [net-next PATCH 00/11] net: mitigating kmem_cache slowpath and BoF discussion patches David Miller
2016-02-08 12:14     ` [net-next PATCH V2 0/3] net: mitigating kmem_cache free slowpath Jesper Dangaard Brouer
2016-02-08 12:14       ` Jesper Dangaard Brouer
2016-02-08 12:14       ` [net-next PATCH V2 1/3] net: bulk free infrastructure for NAPI context, use napi_consume_skb Jesper Dangaard Brouer
2016-02-08 12:15       ` [net-next PATCH V2 2/3] net: bulk free SKBs that were delay free'ed due to IRQ context Jesper Dangaard Brouer
2016-02-08 12:15       ` [net-next PATCH V2 3/3] ixgbe: bulk free SKBs during TX completion cleanup cycle Jesper Dangaard Brouer
2016-02-11 16:59       ` [net-next PATCH V2 0/3] net: mitigating kmem_cache free slowpath David Miller
2016-02-13 11:12       ` Tilman Schmidt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.