netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
       [not found] <20150824005727.2947.36065.stgit@localhost>
@ 2015-09-04 17:00 ` Jesper Dangaard Brouer
  2015-09-04 17:00   ` [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() Jesper Dangaard Brouer
                     ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-04 17:00 UTC (permalink / raw)
  To: netdev, akpm
  Cc: linux-mm, Jesper Dangaard Brouer, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim

During TX DMA completion cleanup there exist an opportunity in the NIC
drivers to perform bulk free, without introducing additional latency.

For an IPv4 forwarding workload the network stack is hitting the
slowpath of the kmem_cache "slub" allocator.  This slowpath can be
mitigated by bulk free via the detached freelists patchset.

Depend on patchset:
 http://thread.gmane.org/gmane.linux.kernel.mm/137469

Kernel based on MMOTM tag 2015-08-24-16-12 from git repo:
 git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
 Also contains Christoph's patch "slub: Avoid irqoff/on in bulk allocation"


Benchmarking: Single CPU IPv4 forwarding UDP (generator pktgen):
 * Before: 2043575 pps
 * After : 2090522 pps
 * Improvements: +46947 pps and -10.99 ns

In the before case, perf report shows slub free hits the slowpath:
 1.98%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_free.isra.72
 1.29%  ksoftirqd/6  [kernel.vmlinux]  [k] cmpxchg_double_slab.isra.71
 0.95%  ksoftirqd/6  [kernel.vmlinux]  [k] kmem_cache_free
 0.95%  ksoftirqd/6  [kernel.vmlinux]  [k] kmem_cache_alloc
 0.20%  ksoftirqd/6  [kernel.vmlinux]  [k] __cmpxchg_double_slab.isra.60
 0.17%  ksoftirqd/6  [kernel.vmlinux]  [k] ___slab_alloc.isra.68
 0.09%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_alloc.isra.69

After the slowpath calls are almost gone:
 0.22%  ksoftirqd/6  [kernel.vmlinux]  [k] __cmpxchg_double_slab.isra.60
 0.18%  ksoftirqd/6  [kernel.vmlinux]  [k] ___slab_alloc.isra.68
 0.14%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_free.isra.72
 0.14%  ksoftirqd/6  [kernel.vmlinux]  [k] cmpxchg_double_slab.isra.71
 0.08%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_alloc.isra.69


Extra info, tuning SLUB per CPU structures gives further improvements:
 * slub-tuned: 2124217 pps
 * patched increase: +33695 pps and  -7.59 ns
 * before  increase: +80642 pps and -18.58 ns

Tuning done:
 echo 256 > /sys/kernel/slab/skbuff_head_cache/cpu_partial
 echo 9   > /sys/kernel/slab/skbuff_head_cache/min_partial

Without SLUB tuning, same performance comes with kernel cmdline "slab_nomerge":
 * slab_nomerge: 2121824 pps

Test notes:
 * Notice very fast CPU i7-4790K CPU @ 4.00GHz
 * gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC)
 * kernel 4.1.0-mmotm-2015-08-24-16-12+ #271 SMP
 * Generator pktgen UDP single flow (pktgen_sample03_burst_single_flow.sh)
 * Tuned for forwarding:
  - unloaded netfilter modules
  - Sysctl settings:
  - net/ipv4/conf/default/rp_filter = 0
  - net/ipv4/conf/all/rp_filter = 0
  - (Forwarding performance is affected by early demux)
  - net/ipv4/ip_early_demux = 0
  - net.ipv4.ip_forward = 1
  - Disabled GRO on NICs
  - ethtool -K ixgbe3 gro off tso off gso off

---

Jesper Dangaard Brouer (3):
      net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
      net: NIC helper API for building array of skbs to free
      ixgbe: bulk free SKBs during TX completion cleanup cycle


 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   13 +++-
 include/linux/netdevice.h                     |   62 ++++++++++++++++++
 include/linux/skbuff.h                        |    1 
 net/core/skbuff.c                             |   87 ++++++++++++++++++++-----
 4 files changed, 144 insertions(+), 19 deletions(-)

--

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
  2015-09-04 17:00 ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Jesper Dangaard Brouer
@ 2015-09-04 17:00   ` Jesper Dangaard Brouer
  2015-09-04 18:47     ` Tom Herbert
  2015-09-08 21:01     ` David Miller
  2015-09-04 17:01   ` [RFC PATCH 2/3] net: NIC helper API for building array of skbs to free Jesper Dangaard Brouer
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-04 17:00 UTC (permalink / raw)
  To: netdev, akpm
  Cc: linux-mm, Jesper Dangaard Brouer, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim

Introduce the first user of SLAB bulk free API kmem_cache_free_bulk(),
in the network stack in form of function kfree_skb_bulk() which bulk
free SKBs (not skb clones or skb->head, yet).

As this is the third user of SKB reference decrementing, split out
refcnt decrement into helper function and use this in all call points.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/skbuff.h |    1 +
 net/core/skbuff.c      |   87 +++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 71 insertions(+), 17 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b97597970ce7..e5f1e007723b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -762,6 +762,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 }
 
 void kfree_skb(struct sk_buff *skb);
+void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size);
 void kfree_skb_list(struct sk_buff *segs);
 void skb_tx_error(struct sk_buff *skb);
 void consume_skb(struct sk_buff *skb);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 429b407b4fe6..034545934158 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -661,26 +661,83 @@ void __kfree_skb(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(__kfree_skb);
 
+/*
+ *	skb_dec_and_test - Helper to drop ref to SKB and see is ready to free
+ *	@skb: buffer to decrement reference
+ *
+ *	Drop a reference to the buffer, and return true if it is ready
+ *	to free. Which is if the usage count has hit zero or is equal to 1.
+ *
+ *	This is performance critical code that should be inlined.
+ */
+static inline bool skb_dec_and_test(struct sk_buff *skb)
+{
+	if (unlikely(!skb))
+		return false;
+	if (likely(atomic_read(&skb->users) == 1))
+		smp_rmb();
+	else if (likely(!atomic_dec_and_test(&skb->users)))
+		return false;
+	/* If reaching here SKB is ready to free */
+	return true;
+}
+
 /**
  *	kfree_skb - free an sk_buff
  *	@skb: buffer to free
  *
  *	Drop a reference to the buffer and free it if the usage count has
- *	hit zero.
+ *	hit zero or is equal to 1.
  */
 void kfree_skb(struct sk_buff *skb)
 {
-	if (unlikely(!skb))
-		return;
-	if (likely(atomic_read(&skb->users) == 1))
-		smp_rmb();
-	else if (likely(!atomic_dec_and_test(&skb->users)))
-		return;
-	trace_kfree_skb(skb, __builtin_return_address(0));
-	__kfree_skb(skb);
+	if (skb_dec_and_test(skb)) {
+		trace_kfree_skb(skb, __builtin_return_address(0));
+		__kfree_skb(skb);
+	}
 }
 EXPORT_SYMBOL(kfree_skb);
 
+/**
+ *	kfree_skb_bulk - bulk free SKBs when refcnt allows to
+ *	@skbs: array of SKBs to free
+ *	@size: number of SKBs in array
+ *
+ *	If SKB refcnt allows for free, then release any auxiliary data
+ *	and then bulk free SKBs to the SLAB allocator.
+ *
+ *	Note that interrupts must be enabled when calling this function.
+ */
+void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size)
+{
+	int i;
+	size_t cnt = 0;
+
+	for (i = 0; i < size; i++) {
+		struct sk_buff *skb = skbs[i];
+
+		if (!skb_dec_and_test(skb))
+			continue; /* skip skb, not ready to free */
+
+		/* Construct an array of SKBs, ready to be free'ed and
+		 * cleanup all auxiliary, before bulk free to SLAB.
+		 * For now, only handle non-cloned SKBs, related to
+		 * SLAB skbuff_head_cache
+		 */
+		if (skb->fclone == SKB_FCLONE_UNAVAILABLE) {
+			skb_release_all(skb);
+			skbs[cnt++] = skb;
+		} else {
+			/* SKB was a clone, don't handle this case */
+			__kfree_skb(skb);
+		}
+	}
+	if (likely(cnt)) {
+		kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs);
+	}
+}
+EXPORT_SYMBOL(kfree_skb_bulk);
+
 void kfree_skb_list(struct sk_buff *segs)
 {
 	while (segs) {
@@ -722,14 +779,10 @@ EXPORT_SYMBOL(skb_tx_error);
  */
 void consume_skb(struct sk_buff *skb)
 {
-	if (unlikely(!skb))
-		return;
-	if (likely(atomic_read(&skb->users) == 1))
-		smp_rmb();
-	else if (likely(!atomic_dec_and_test(&skb->users)))
-		return;
-	trace_consume_skb(skb);
-	__kfree_skb(skb);
+	if (skb_dec_and_test(skb)) {
+		trace_consume_skb(skb);
+		__kfree_skb(skb);
+	}
 }
 EXPORT_SYMBOL(consume_skb);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 2/3] net: NIC helper API for building array of skbs to free
  2015-09-04 17:00 ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Jesper Dangaard Brouer
  2015-09-04 17:00   ` [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() Jesper Dangaard Brouer
@ 2015-09-04 17:01   ` Jesper Dangaard Brouer
  2015-09-04 17:01   ` [RFC PATCH 3/3] ixgbe: bulk free SKBs during TX completion cleanup cycle Jesper Dangaard Brouer
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-04 17:01 UTC (permalink / raw)
  To: netdev, akpm
  Cc: linux-mm, Jesper Dangaard Brouer, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim

The NIC device drivers are expected to use this small helper API, when
building up an array of objects/skbs to bulk free, while (loop)
processing objects to free.  Objects to be free'ed later is added
(dev_free_waitlist_add) to an array and flushed if the array runs
full.  After processing the array is flushed (dev_free_waitlist_flush).
The array should be stored on the local stack.

Usage e.g. during TX completion loop the NIC driver can replace
dev_consume_skb_any() with an "add" and after the loop a "flush".

For performance reasons the compiler should inline most of these
functions.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/netdevice.h |   62 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 05b9a694e213..d0133e778314 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2935,6 +2935,68 @@ static inline void dev_consume_skb_any(struct sk_buff *skb)
 	__dev_kfree_skb_any(skb, SKB_REASON_CONSUMED);
 }
 
+/* The NIC device drivers are expected to use this small helper API,
+ * when building up an array of objects/skbs to bulk free, while
+ * (loop) processing objects to free.  Objects to be free'ed later is
+ * added (dev_free_waitlist_add) to an array and flushed if the array
+ * runs full.  After processing the array is flushed (dev_free_waitlist_flush).
+ * The array should be stored on the local stack.
+ *
+ * Usage e.g. during TX completion loop the NIC driver can replace
+ * dev_consume_skb_any() with an "add" and after the loop a "flush".
+ *
+ * For performance reasons the compiler should inline most of these
+ * functions.
+ */
+struct dev_free_waitlist {
+	struct sk_buff **skbs;
+	unsigned int skb_cnt;
+};
+
+static void __dev_free_waitlist_bulkfree(struct dev_free_waitlist *wl)
+{
+	/* Cannot bulk free from interrupt context or with IRQs
+	 * disabled, due to how SLAB bulk API works (and gain it's
+	 * speedup).  This can e.g. happen due to invocation from
+	 * netconsole/netpoll.
+	 */
+	if (unlikely(in_irq() || irqs_disabled())) {
+		int i;
+
+		for (i = 0; i < wl->skb_cnt; i++)
+			dev_consume_skb_irq(wl->skbs[i]);
+	} else {
+		/* Likely fastpath, don't call with cnt == 0 */
+		kfree_skb_bulk(wl->skbs, wl->skb_cnt);
+	}
+}
+
+static inline void dev_free_waitlist_flush(struct dev_free_waitlist *wl)
+{
+	/* Flush the waitlist, but only if any objects remain, as bulk
+	 * freeing "zero" objects is not supported and plus it avoids
+	 * pointless function calls.
+	 */
+	if (likely(wl->skb_cnt))
+		__dev_free_waitlist_bulkfree(wl);
+}
+
+static __always_inline void dev_free_waitlist_add(struct dev_free_waitlist *wl,
+						  struct sk_buff *skb,
+						  unsigned int max)
+{
+	/* It is recommended that max is a builtin constant, as this
+	 * saves one register when inlined. Catch offenders with:
+	 * BUILD_BUG_ON(!__builtin_constant_p(max));
+	 */
+	wl->skbs[wl->skb_cnt++] = skb;
+	if (wl->skb_cnt == max) {
+		/* Detect when waitlist array is full, then flush and reset */
+		__dev_free_waitlist_bulkfree(wl);
+		wl->skb_cnt = 0;
+	}
+}
+
 int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb_sk(struct sock *sk, struct sk_buff *skb);

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 3/3] ixgbe: bulk free SKBs during TX completion cleanup cycle
  2015-09-04 17:00 ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Jesper Dangaard Brouer
  2015-09-04 17:00   ` [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() Jesper Dangaard Brouer
  2015-09-04 17:01   ` [RFC PATCH 2/3] net: NIC helper API for building array of skbs to free Jesper Dangaard Brouer
@ 2015-09-04 17:01   ` Jesper Dangaard Brouer
  2015-09-04 18:09   ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Alexander Duyck
  2015-09-16 10:02   ` Experiences with slub bulk use-case for network stack Jesper Dangaard Brouer
  4 siblings, 0 replies; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-04 17:01 UTC (permalink / raw)
  To: netdev, akpm
  Cc: linux-mm, Jesper Dangaard Brouer, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim

First user of the SKB bulk free API (namely kfree_skb_bulk() via
waitlist helper add-and-flush API).

There is an opportunity to bulk free SKBs during reclaiming of
resources after DMA transmit completes in ixgbe_clean_tx_irq.  Thus,
bulk freeing at this point does not introduce any added latency.
Choosing bulk size 32 even-though budget usually is 64, due (1) to
limit the stack usage and (2) as SLAB behind SKBs have 32 objects per
slab.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 463ff47200f1..d35d6b47bae2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1075,6 +1075,7 @@ static void ixgbe_tx_timeout_reset(struct ixgbe_adapter *adapter)
  * @q_vector: structure containing interrupt and ring information
  * @tx_ring: tx ring to clean
  **/
+#define BULK_FREE_SIZE 32
 static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 			       struct ixgbe_ring *tx_ring)
 {
@@ -1084,6 +1085,11 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 	unsigned int total_bytes = 0, total_packets = 0;
 	unsigned int budget = q_vector->tx.work_limit;
 	unsigned int i = tx_ring->next_to_clean;
+	struct sk_buff *skbs[BULK_FREE_SIZE];
+	struct dev_free_waitlist wl;
+
+	wl.skb_cnt = 0;
+	wl.skbs = skbs;
 
 	if (test_bit(__IXGBE_DOWN, &adapter->state))
 		return true;
@@ -1113,8 +1119,8 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 		total_bytes += tx_buffer->bytecount;
 		total_packets += tx_buffer->gso_segs;
 
-		/* free the skb */
-		dev_consume_skb_any(tx_buffer->skb);
+		/* delay skb free and bulk free later */
+		dev_free_waitlist_add(&wl, tx_buffer->skb, BULK_FREE_SIZE);
 
 		/* unmap skb header data */
 		dma_unmap_single(tx_ring->dev,
@@ -1164,6 +1170,8 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 		budget--;
 	} while (likely(budget));
 
+	dev_free_waitlist_flush(&wl); /* free remaining SKBs on waitlist */
+
 	i += tx_ring->count;
 	tx_ring->next_to_clean = i;
 	u64_stats_update_begin(&tx_ring->syncp);
@@ -1224,6 +1232,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 
 	return !!budget;
 }
+#undef BULK_FREE_SIZE
 
 #ifdef CONFIG_IXGBE_DCA
 static void ixgbe_update_tx_dca(struct ixgbe_adapter *adapter,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-04 17:00 ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Jesper Dangaard Brouer
                     ` (2 preceding siblings ...)
  2015-09-04 17:01   ` [RFC PATCH 3/3] ixgbe: bulk free SKBs during TX completion cleanup cycle Jesper Dangaard Brouer
@ 2015-09-04 18:09   ` Alexander Duyck
  2015-09-04 18:55     ` Christoph Lameter
  2015-09-07  8:16     ` Jesper Dangaard Brouer
  2015-09-16 10:02   ` Experiences with slub bulk use-case for network stack Jesper Dangaard Brouer
  4 siblings, 2 replies; 23+ messages in thread
From: Alexander Duyck @ 2015-09-04 18:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, akpm
  Cc: linux-mm, aravinda, Christoph Lameter, Paul E. McKenney, iamjoonsoo.kim

On 09/04/2015 10:00 AM, Jesper Dangaard Brouer wrote:
> During TX DMA completion cleanup there exist an opportunity in the NIC
> drivers to perform bulk free, without introducing additional latency.
>
> For an IPv4 forwarding workload the network stack is hitting the
> slowpath of the kmem_cache "slub" allocator.  This slowpath can be
> mitigated by bulk free via the detached freelists patchset.
>
> Depend on patchset:
>   http://thread.gmane.org/gmane.linux.kernel.mm/137469
>
> Kernel based on MMOTM tag 2015-08-24-16-12 from git repo:
>   git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
>   Also contains Christoph's patch "slub: Avoid irqoff/on in bulk allocation"
>
>
> Benchmarking: Single CPU IPv4 forwarding UDP (generator pktgen):
>   * Before: 2043575 pps
>   * After : 2090522 pps
>   * Improvements: +46947 pps and -10.99 ns
>
> In the before case, perf report shows slub free hits the slowpath:
>   1.98%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_free.isra.72
>   1.29%  ksoftirqd/6  [kernel.vmlinux]  [k] cmpxchg_double_slab.isra.71
>   0.95%  ksoftirqd/6  [kernel.vmlinux]  [k] kmem_cache_free
>   0.95%  ksoftirqd/6  [kernel.vmlinux]  [k] kmem_cache_alloc
>   0.20%  ksoftirqd/6  [kernel.vmlinux]  [k] __cmpxchg_double_slab.isra.60
>   0.17%  ksoftirqd/6  [kernel.vmlinux]  [k] ___slab_alloc.isra.68
>   0.09%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_alloc.isra.69
>
> After the slowpath calls are almost gone:
>   0.22%  ksoftirqd/6  [kernel.vmlinux]  [k] __cmpxchg_double_slab.isra.60
>   0.18%  ksoftirqd/6  [kernel.vmlinux]  [k] ___slab_alloc.isra.68
>   0.14%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_free.isra.72
>   0.14%  ksoftirqd/6  [kernel.vmlinux]  [k] cmpxchg_double_slab.isra.71
>   0.08%  ksoftirqd/6  [kernel.vmlinux]  [k] __slab_alloc.isra.69
>
>
> Extra info, tuning SLUB per CPU structures gives further improvements:
>   * slub-tuned: 2124217 pps
>   * patched increase: +33695 pps and  -7.59 ns
>   * before  increase: +80642 pps and -18.58 ns
>
> Tuning done:
>   echo 256 > /sys/kernel/slab/skbuff_head_cache/cpu_partial
>   echo 9   > /sys/kernel/slab/skbuff_head_cache/min_partial
>
> Without SLUB tuning, same performance comes with kernel cmdline "slab_nomerge":
>   * slab_nomerge: 2121824 pps
>
> Test notes:
>   * Notice very fast CPU i7-4790K CPU @ 4.00GHz
>   * gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC)
>   * kernel 4.1.0-mmotm-2015-08-24-16-12+ #271 SMP
>   * Generator pktgen UDP single flow (pktgen_sample03_burst_single_flow.sh)
>   * Tuned for forwarding:
>    - unloaded netfilter modules
>    - Sysctl settings:
>    - net/ipv4/conf/default/rp_filter = 0
>    - net/ipv4/conf/all/rp_filter = 0
>    - (Forwarding performance is affected by early demux)
>    - net/ipv4/ip_early_demux = 0
>    - net.ipv4.ip_forward = 1
>    - Disabled GRO on NICs
>    - ethtool -K ixgbe3 gro off tso off gso off
>
> ---

This is an interesting start.  However I feel like it might work better 
if you were to create a per-cpu pool for skbs that could be freed and 
allocated in NAPI context.  So for example we already have 
napi_alloc_skb, why not just add a napi_free_skb and then make the array 
of objects to be freed part of a pool that could be used for either 
allocation or freeing?  If the pool runs empty you just allocate 
something like 8 or 16 new skb heads, and if you fill it you just free 
half of the list?

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
  2015-09-04 17:00   ` [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() Jesper Dangaard Brouer
@ 2015-09-04 18:47     ` Tom Herbert
  2015-09-07  8:41       ` Jesper Dangaard Brouer
  2015-09-08 21:01     ` David Miller
  1 sibling, 1 reply; 23+ messages in thread
From: Tom Herbert @ 2015-09-04 18:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Linux Kernel Network Developers, akpm, linux-mm, aravinda,
	Christoph Lameter, Paul E. McKenney, iamjoonsoo.kim

On Fri, Sep 4, 2015 at 10:00 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Introduce the first user of SLAB bulk free API kmem_cache_free_bulk(),
> in the network stack in form of function kfree_skb_bulk() which bulk
> free SKBs (not skb clones or skb->head, yet).
>
> As this is the third user of SKB reference decrementing, split out
> refcnt decrement into helper function and use this in all call points.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/linux/skbuff.h |    1 +
>  net/core/skbuff.c      |   87 +++++++++++++++++++++++++++++++++++++++---------
>  2 files changed, 71 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index b97597970ce7..e5f1e007723b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -762,6 +762,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
>  }
>
>  void kfree_skb(struct sk_buff *skb);
> +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size);
>  void kfree_skb_list(struct sk_buff *segs);
>  void skb_tx_error(struct sk_buff *skb);
>  void consume_skb(struct sk_buff *skb);
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 429b407b4fe6..034545934158 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -661,26 +661,83 @@ void __kfree_skb(struct sk_buff *skb)
>  }
>  EXPORT_SYMBOL(__kfree_skb);
>
> +/*
> + *     skb_dec_and_test - Helper to drop ref to SKB and see is ready to free
> + *     @skb: buffer to decrement reference
> + *
> + *     Drop a reference to the buffer, and return true if it is ready
> + *     to free. Which is if the usage count has hit zero or is equal to 1.
> + *
> + *     This is performance critical code that should be inlined.
> + */
> +static inline bool skb_dec_and_test(struct sk_buff *skb)
> +{
> +       if (unlikely(!skb))
> +               return false;
> +       if (likely(atomic_read(&skb->users) == 1))
> +               smp_rmb();
> +       else if (likely(!atomic_dec_and_test(&skb->users)))
> +               return false;
> +       /* If reaching here SKB is ready to free */
> +       return true;
> +}
> +
>  /**
>   *     kfree_skb - free an sk_buff
>   *     @skb: buffer to free
>   *
>   *     Drop a reference to the buffer and free it if the usage count has
> - *     hit zero.
> + *     hit zero or is equal to 1.
>   */
>  void kfree_skb(struct sk_buff *skb)
>  {
> -       if (unlikely(!skb))
> -               return;
> -       if (likely(atomic_read(&skb->users) == 1))
> -               smp_rmb();
> -       else if (likely(!atomic_dec_and_test(&skb->users)))
> -               return;
> -       trace_kfree_skb(skb, __builtin_return_address(0));
> -       __kfree_skb(skb);
> +       if (skb_dec_and_test(skb)) {
> +               trace_kfree_skb(skb, __builtin_return_address(0));
> +               __kfree_skb(skb);
> +       }
>  }
>  EXPORT_SYMBOL(kfree_skb);
>
> +/**
> + *     kfree_skb_bulk - bulk free SKBs when refcnt allows to
> + *     @skbs: array of SKBs to free
> + *     @size: number of SKBs in array
> + *
> + *     If SKB refcnt allows for free, then release any auxiliary data
> + *     and then bulk free SKBs to the SLAB allocator.
> + *
> + *     Note that interrupts must be enabled when calling this function.
> + */
> +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size)
> +{
What not pass a list of skbs (e.g. using skb->next)?

> +       int i;
> +       size_t cnt = 0;
> +
> +       for (i = 0; i < size; i++) {
> +               struct sk_buff *skb = skbs[i];
> +
> +               if (!skb_dec_and_test(skb))
> +                       continue; /* skip skb, not ready to free */
> +
> +               /* Construct an array of SKBs, ready to be free'ed and
> +                * cleanup all auxiliary, before bulk free to SLAB.
> +                * For now, only handle non-cloned SKBs, related to
> +                * SLAB skbuff_head_cache
> +                */
> +               if (skb->fclone == SKB_FCLONE_UNAVAILABLE) {
> +                       skb_release_all(skb);
> +                       skbs[cnt++] = skb;
> +               } else {
> +                       /* SKB was a clone, don't handle this case */
> +                       __kfree_skb(skb);
> +               }
> +       }
> +       if (likely(cnt)) {
> +               kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs);
> +       }
> +}
> +EXPORT_SYMBOL(kfree_skb_bulk);
> +
>  void kfree_skb_list(struct sk_buff *segs)
>  {
>         while (segs) {
> @@ -722,14 +779,10 @@ EXPORT_SYMBOL(skb_tx_error);
>   */
>  void consume_skb(struct sk_buff *skb)
>  {
> -       if (unlikely(!skb))
> -               return;
> -       if (likely(atomic_read(&skb->users) == 1))
> -               smp_rmb();
> -       else if (likely(!atomic_dec_and_test(&skb->users)))
> -               return;
> -       trace_consume_skb(skb);
> -       __kfree_skb(skb);
> +       if (skb_dec_and_test(skb)) {
> +               trace_consume_skb(skb);
> +               __kfree_skb(skb);
> +       }
>  }
>  EXPORT_SYMBOL(consume_skb);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-04 18:09   ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Alexander Duyck
@ 2015-09-04 18:55     ` Christoph Lameter
  2015-09-04 20:39       ` Alexander Duyck
  2015-09-07  8:16     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2015-09-04 18:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesper Dangaard Brouer, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim

On Fri, 4 Sep 2015, Alexander Duyck wrote:

> were to create a per-cpu pool for skbs that could be freed and allocated in
> NAPI context.  So for example we already have napi_alloc_skb, why not just add
> a napi_free_skb and then make the array of objects to be freed part of a pool
> that could be used for either allocation or freeing?  If the pool runs empty
> you just allocate something like 8 or 16 new skb heads, and if you fill it you
> just free half of the list?

The slab allocators provide something like a per cpu pool for you to
optimize object alloc and free.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-04 18:55     ` Christoph Lameter
@ 2015-09-04 20:39       ` Alexander Duyck
  2015-09-04 23:45         ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Alexander Duyck @ 2015-09-04 20:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jesper Dangaard Brouer, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim

On 09/04/2015 11:55 AM, Christoph Lameter wrote:
> On Fri, 4 Sep 2015, Alexander Duyck wrote:
>
>> were to create a per-cpu pool for skbs that could be freed and allocated in
>> NAPI context.  So for example we already have napi_alloc_skb, why not just add
>> a napi_free_skb and then make the array of objects to be freed part of a pool
>> that could be used for either allocation or freeing?  If the pool runs empty
>> you just allocate something like 8 or 16 new skb heads, and if you fill it you
>> just free half of the list?
> The slab allocators provide something like a per cpu pool for you to
> optimize object alloc and free.

Right, but one of the reasons for Jesper to implement the bulk 
alloc/free is to avoid the cmpxchg that is being used to get stuff into 
or off of the per cpu lists.

In the case of network drivers they are running in softirq context 
almost exclusively.  As such it is useful to have a set of buffers that 
can be acquired or freed from this context without the need to use any 
synchronization primitives.  Then once the softirq context ends then we 
can free up some or all of the resources back to the slab allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-04 20:39       ` Alexander Duyck
@ 2015-09-04 23:45         ` Christoph Lameter
  2015-09-05 11:18           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2015-09-04 23:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesper Dangaard Brouer, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim

On Fri, 4 Sep 2015, Alexander Duyck wrote:
> Right, but one of the reasons for Jesper to implement the bulk alloc/free is
> to avoid the cmpxchg that is being used to get stuff into or off of the per
> cpu lists.

There is no full cmpxchg used for the per cpu lists. Its a cmpxchg without
lock semantics which is very cheap.

> In the case of network drivers they are running in softirq context almost
> exclusively.  As such it is useful to have a set of buffers that can be
> acquired or freed from this context without the need to use any
> synchronization primitives.  Then once the softirq context ends then we can
> free up some or all of the resources back to the slab allocator.

That is the case in the slab allocators.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-04 23:45         ` Christoph Lameter
@ 2015-09-05 11:18           ` Jesper Dangaard Brouer
  2015-09-08 17:32             ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-05 11:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alexander Duyck, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim, brouer

On Fri, 4 Sep 2015 18:45:13 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Fri, 4 Sep 2015, Alexander Duyck wrote:
> > Right, but one of the reasons for Jesper to implement the bulk alloc/free is
> > to avoid the cmpxchg that is being used to get stuff into or off of the per
> > cpu lists.
> 
> There is no full cmpxchg used for the per cpu lists. Its a cmpxchg without
> lock semantics which is very cheap.

The double_cmpxchg without lock prefix still cost 9 cycles, which is
very fast but still a cost (add approx 19 cycles for a lock prefix).

It is slower than local_irq_disable + local_irq_enable that only cost
7 cycles, which the bulking call uses.  (That is the reason bulk calls
with 1 object can almost compete with fastpath).


> > In the case of network drivers they are running in softirq context almost
> > exclusively.  As such it is useful to have a set of buffers that can be
> > acquired or freed from this context without the need to use any
> > synchronization primitives.  Then once the softirq context ends then we can
> > free up some or all of the resources back to the slab allocator.
> 
> That is the case in the slab allocators.

There is a potential for taking advantage of this softirq context,
which is basically what my qmempool implementation did.

But we have now optimized the slub allocator to an extend that (in case
of slab-tuning or slab_nomerge) is faster than my qmempool implementation.

Thus, I would like a smaller/slimmer layer than qmempool.  We do need
some per CPU cache for allocations, like Alex suggests, but I'm not
sure we need that for the free side.  For now I'm returning
objects/skbs directly to slub, and is hoping enough objects can be
merged in a detached freelist, which allow me to return several objects
with a single locked double_cmpxchg.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-04 18:09   ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Alexander Duyck
  2015-09-04 18:55     ` Christoph Lameter
@ 2015-09-07  8:16     ` Jesper Dangaard Brouer
  2015-09-07 21:23       ` Alexander Duyck
  1 sibling, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-07  8:16 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: netdev, akpm, linux-mm, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim, brouer

On Fri, 4 Sep 2015 11:09:21 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> This is an interesting start.  However I feel like it might work better 
> if you were to create a per-cpu pool for skbs that could be freed and 
> allocated in NAPI context.  So for example we already have 
> napi_alloc_skb, why not just add a napi_free_skb

I do like the idea...

> and then make the array 
> of objects to be freed part of a pool that could be used for either 
> allocation or freeing?  If the pool runs empty you just allocate 
> something like 8 or 16 new skb heads, and if you fill it you just free 
> half of the list?

But I worry that this algorithm will "randomize" the (skb) objects.
And the SLUB bulk optimization only works if we have many objects
belonging to the same page.

It would likely be fastest to implement a simple stack (for these
per-cpu pools), but I again worry that it would randomize the
object-pages.  A simple queue might be better, but slightly slower.
Guess I could just reuse part of qmempool / alf_queue as a quick test.

Having a per-cpu pool in networking would solve the problem of the slub
per-cpu pool isn't large enough for our use-case.  On the other hand,
maybe we should fix slub to dynamically adjust the size of it's per-cpu
resources?


A pre-req knowledge (for people not knowing slub's internal details):
Slub alloc path will pickup a page, and empty all objects for that page
before proceeding to the next page.  Thus, slub bulk alloc will give
many objects belonging to the page.  I'm trying to keep these objects
grouped together until they can be free'ed in a bulk.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
  2015-09-04 18:47     ` Tom Herbert
@ 2015-09-07  8:41       ` Jesper Dangaard Brouer
  2015-09-07 16:25         ` Tom Herbert
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-07  8:41 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, akpm, linux-mm, aravinda,
	Christoph Lameter, Paul E. McKenney, iamjoonsoo.kim, brouer

On Fri, 4 Sep 2015 11:47:17 -0700 Tom Herbert <tom@herbertland.com> wrote:

> On Fri, Sep 4, 2015 at 10:00 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > Introduce the first user of SLAB bulk free API kmem_cache_free_bulk(),
> > in the network stack in form of function kfree_skb_bulk() which bulk
> > free SKBs (not skb clones or skb->head, yet).
> >
[...]
> > +/**
> > + *     kfree_skb_bulk - bulk free SKBs when refcnt allows to
> > + *     @skbs: array of SKBs to free
> > + *     @size: number of SKBs in array
> > + *
> > + *     If SKB refcnt allows for free, then release any auxiliary data
> > + *     and then bulk free SKBs to the SLAB allocator.
> > + *
> > + *     Note that interrupts must be enabled when calling this function.
> > + */
> > +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size)
> > +{
>
> What not pass a list of skbs (e.g. using skb->next)?

Because the next layer, the slab API needs an array:
  kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)

Look at the patch:
 [PATCH V2 3/3] slub: build detached freelist with look-ahead
 http://thread.gmane.org/gmane.linux.kernel.mm/137469/focus=137472

Where I use this array to progressively scan for objects belonging to
the same page.  (A subtle detail is I manage to zero out the array,
which is good from a security/error-handling point of view, as pointers
to the objects are not left dangling on the stack).


I cannot argue that, writing skb->next comes as an additional cost,
because the slUb free also writes into this cacheline.  Perhaps the
slAb allocator does not?

[...]
> > +       if (likely(cnt)) {
> > +               kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs);
> > +       }
> > +}
> > +EXPORT_SYMBOL(kfree_skb_bulk);

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
  2015-09-07  8:41       ` Jesper Dangaard Brouer
@ 2015-09-07 16:25         ` Tom Herbert
  2015-09-07 20:14           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 23+ messages in thread
From: Tom Herbert @ 2015-09-07 16:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Linux Kernel Network Developers, akpm, linux-mm, aravinda,
	Christoph Lameter, Paul E. McKenney, iamjoonsoo.kim

>> What not pass a list of skbs (e.g. using skb->next)?
>
> Because the next layer, the slab API needs an array:
>   kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>

I suppose we could ask the same question of that function. IMO
encouraging drivers to define arrays of pointers on the stack like
you're doing in the ixgbe patch is a bad direction.

In any case I believe this would be simpler in the networking side
just to maintain a list of skb's to free. Then the dev_free_waitlist
structure might not be needed then since we could just use a
skb_buf_head for that.


Tom

> Look at the patch:
>  [PATCH V2 3/3] slub: build detached freelist with look-ahead
>  http://thread.gmane.org/gmane.linux.kernel.mm/137469/focus=137472
>
> Where I use this array to progressively scan for objects belonging to
> the same page.  (A subtle detail is I manage to zero out the array,
> which is good from a security/error-handling point of view, as pointers
> to the objects are not left dangling on the stack).
>
>
> I cannot argue that, writing skb->next comes as an additional cost,
> because the slUb free also writes into this cacheline.  Perhaps the
> slAb allocator does not?
>
> [...]
>> > +       if (likely(cnt)) {
>> > +               kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs);
>> > +       }
>> > +}
>> > +EXPORT_SYMBOL(kfree_skb_bulk);
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Sr. Network Kernel Developer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
  2015-09-07 16:25         ` Tom Herbert
@ 2015-09-07 20:14           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-07 20:14 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Linux Kernel Network Developers, akpm, linux-mm, aravinda,
	Christoph Lameter, Paul E. McKenney, iamjoonsoo.kim, brouer


On Mon, 7 Sep 2015 09:25:49 -0700 Tom Herbert <tom@herbertland.com> wrote:

> >> What not pass a list of skbs (e.g. using skb->next)?
> >
> > Because the next layer, the slab API needs an array:
> >   kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
> >
> 
> I suppose we could ask the same question of that function. IMO
> encouraging drivers to define arrays of pointers on the stack like
> you're doing in the ixgbe patch is a bad direction.
> 
> In any case I believe this would be simpler in the networking side
> just to maintain a list of skb's to free. Then the dev_free_waitlist
> structure might not be needed then since we could just use a
> skb_buf_head for that.

I guess it is more natural for the network side to work with skb lists.
But I'm keeping it for slab/slub as we cannot assume/enforce objects of a
specific data type.

I worried about how large bulk free we should allow, due to the
interaction with skb->destructor which for sockets affect their memory
accounting. E.g. we have seen issues with hypervisor network drivers
(Xen and HyperV) that are too slow to cleanup their TX completion queue
that their TCP bandwidth get limited by tcp_limit_output_bytes.
I capped it at 32, and the NAPI budget will cap it at 64.


By the following argument, bulk free of 64 objects/skb's is not a problem.
The delay I'm introducing is very small, before the first real
kfree_skb is called, which calls the destructor with free up socket
memory accounting.

Assume measured packet rate of: 2105011 pps
Time between packets (1/2105011*10^9): 475 ns

Perf shows ixgbe_clean_tx_irq() takes: 1.23%
Extrapolating the function call cost: 5.84 ns (475*(1.23/100))

Processing 64 packets in ixgbe_clean_tx_irq() 373 ns.
At 10Gbit/s how many bytes can arrive in this period, only: 466 bytes.
((373/10^9)*(10000*10^6)/8)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-07  8:16     ` Jesper Dangaard Brouer
@ 2015-09-07 21:23       ` Alexander Duyck
  0 siblings, 0 replies; 23+ messages in thread
From: Alexander Duyck @ 2015-09-07 21:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, akpm, linux-mm, aravinda, Christoph Lameter,
	Paul E. McKenney, iamjoonsoo.kim

On 09/07/2015 01:16 AM, Jesper Dangaard Brouer wrote:
> On Fri, 4 Sep 2015 11:09:21 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> This is an interesting start.  However I feel like it might work better
>> if you were to create a per-cpu pool for skbs that could be freed and
>> allocated in NAPI context.  So for example we already have
>> napi_alloc_skb, why not just add a napi_free_skb
> I do like the idea...

If nothing else you want to avoid having to redo this code for every 
driver.  If you can just replace dev_kfree_skb with some other freeing 
call it will make it much easier to convert other drivers.

>> and then make the array
>> of objects to be freed part of a pool that could be used for either
>> allocation or freeing?  If the pool runs empty you just allocate
>> something like 8 or 16 new skb heads, and if you fill it you just free
>> half of the list?
> But I worry that this algorithm will "randomize" the (skb) objects.
> And the SLUB bulk optimization only works if we have many objects
> belonging to the same page.

Agreed to some extent, however at the same time what this does is allow 
for a certain amount of skb recycling.  So instead of freeing the 
buffers received from the socket you would likely be recycling them and 
sending them back as Rx skbs.  In the case of a heavy routing workload 
you would likely just be cycling through the same set of buffers and 
cleaning them off of transmit and placing them back on receive.  The 
general idea is to keep the memory footprint small so recycling Tx 
buffers to use for Rx can have its advantages in terms of keeping things 
confined to limits of the L1/L2 cache.

> It would likely be fastest to implement a simple stack (for these
> per-cpu pools), but I again worry that it would randomize the
> object-pages.  A simple queue might be better, but slightly slower.
> Guess I could just reuse part of qmempool / alf_queue as a quick test.

I would say don't over engineer it.  A stack is the simplest.  The 
qmempool / alf_queue is just going to add extra overhead.

The added advantage to the stack is that you are working with pointers 
and you are guaranteed that the list of pointers are going to be 
linear.  If you use a queue clean-up will require up to 2 blocks of 
freeing in case the ring has wrapped.

> Having a per-cpu pool in networking would solve the problem of the slub
> per-cpu pool isn't large enough for our use-case.  On the other hand,
> maybe we should fix slub to dynamically adjust the size of it's per-cpu
> resources?

The per-cpu pool is just meant to replace the the per-driver pool you 
were using.  By using a per-cpu pool you would get better aggregation 
and can just flush the freed buffers at the end of the Rx softirq or 
when the pool is full instead of having to flush smaller lists per call 
to napi->poll.

> A pre-req knowledge (for people not knowing slub's internal details):
> Slub alloc path will pickup a page, and empty all objects for that page
> before proceeding to the next page.  Thus, slub bulk alloc will give
> many objects belonging to the page.  I'm trying to keep these objects
> grouped together until they can be free'ed in a bulk.

The problem is you aren't going to be able to keep them together very 
easily.  Yes they might be allocated all from one spot on Rx but they 
can very easily end up scattered to multiple locations. The same applies 
to Tx where you will have multiple flows all outgoing on one port.  That 
is why I was thinking adding some skb recycling via a per-cpu stack 
might be useful especially since you have to either fill or empty the 
stack when you allocate or free multiple skbs anyway.  In addition it 
provides an easy way for a bulk alloc and a bulk free to share data 
structures without adding additional overhead by keeping them separate.

If you managed it with some sort of high-water/low-water mark type setup 
you could very well keep the bulk-alloc/free busy without too much 
fragmentation.  For the socket transmit/receive case the thing you have 
to keep in mind is that if you reuse the buffers you are just going to 
be throwing them back at the sockets which are likely not using 
bulk-free anyway.  So in that case reuse could actually improve things 
by simply reducing the number of calls to bulk-alloc you will need to 
make since things like TSO allow you to send 64K using a single sk_buff, 
while you will be likely be receiving one or more acks on the receive 
side which will require allocations.

- Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-05 11:18           ` Jesper Dangaard Brouer
@ 2015-09-08 17:32             ` Christoph Lameter
  2015-09-09 12:59               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2015-09-08 17:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim

On Sat, 5 Sep 2015, Jesper Dangaard Brouer wrote:

> The double_cmpxchg without lock prefix still cost 9 cycles, which is
> very fast but still a cost (add approx 19 cycles for a lock prefix).
>
> It is slower than local_irq_disable + local_irq_enable that only cost
> 7 cycles, which the bulking call uses.  (That is the reason bulk calls
> with 1 object can almost compete with fastpath).

Hmmm... Guess we need to come up with distinct version of kmalloc() for
irq and non irq contexts to take advantage of that . Most at non irq
context anyways.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk()
  2015-09-04 17:00   ` [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() Jesper Dangaard Brouer
  2015-09-04 18:47     ` Tom Herbert
@ 2015-09-08 21:01     ` David Miller
  1 sibling, 0 replies; 23+ messages in thread
From: David Miller @ 2015-09-08 21:01 UTC (permalink / raw)
  To: brouer; +Cc: netdev, akpm, linux-mm, aravinda, cl, paulmck, iamjoonsoo.kim

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Fri, 04 Sep 2015 19:00:53 +0200

> +/**
> + *	kfree_skb_bulk - bulk free SKBs when refcnt allows to
> + *	@skbs: array of SKBs to free
> + *	@size: number of SKBs in array
> + *
> + *	If SKB refcnt allows for free, then release any auxiliary data
> + *	and then bulk free SKBs to the SLAB allocator.
> + *
> + *	Note that interrupts must be enabled when calling this function.
> + */
> +void kfree_skb_bulk(struct sk_buff **skbs, unsigned int size)
> +{
> +	int i;
> +	size_t cnt = 0;
> +
> +	for (i = 0; i < size; i++) {
> +		struct sk_buff *skb = skbs[i];
> +
> +		if (!skb_dec_and_test(skb))
> +			continue; /* skip skb, not ready to free */
> +
> +		/* Construct an array of SKBs, ready to be free'ed and
> +		 * cleanup all auxiliary, before bulk free to SLAB.
> +		 * For now, only handle non-cloned SKBs, related to
> +		 * SLAB skbuff_head_cache
> +		 */
> +		if (skb->fclone == SKB_FCLONE_UNAVAILABLE) {
> +			skb_release_all(skb);
> +			skbs[cnt++] = skb;
> +		} else {
> +			/* SKB was a clone, don't handle this case */
> +			__kfree_skb(skb);
> +		}
> +	}
> +	if (likely(cnt)) {
> +		kmem_cache_free_bulk(skbuff_head_cache, cnt, (void **) skbs);
> +	}
> +}

You're going to have to do a trace_kfree_skb() or trace_consume_skb() for
these things.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-08 17:32             ` Christoph Lameter
@ 2015-09-09 12:59               ` Jesper Dangaard Brouer
  2015-09-09 14:08                 ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-09 12:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Alexander Duyck, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim, brouer

On Tue, 8 Sep 2015 12:32:40 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Sat, 5 Sep 2015, Jesper Dangaard Brouer wrote:
> 
> > The double_cmpxchg without lock prefix still cost 9 cycles, which is
> > very fast but still a cost (add approx 19 cycles for a lock prefix).
> >
> > It is slower than local_irq_disable + local_irq_enable that only cost
> > 7 cycles, which the bulking call uses.  (That is the reason bulk calls
> > with 1 object can almost compete with fastpath).
> 
> Hmmm... Guess we need to come up with distinct version of kmalloc() for
> irq and non irq contexts to take advantage of that . Most at non irq
> context anyways.

I agree, it would be an easy win.  Do notice this will have the most
impact for the slAb allocator.

I estimate alloc + free cost would save:
 * slAb would save approx 60 cycles
 * slUb would save approx  4 cycles

We might consider keeping the slUb approach as it would be more
friendly for RT with less IRQ disabling.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API.
  2015-09-09 12:59               ` Jesper Dangaard Brouer
@ 2015-09-09 14:08                 ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2015-09-09 14:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, netdev, akpm, linux-mm, aravinda,
	Paul E. McKenney, iamjoonsoo.kim

On Wed, 9 Sep 2015, Jesper Dangaard Brouer wrote:

> > Hmmm... Guess we need to come up with distinct version of kmalloc() for
> > irq and non irq contexts to take advantage of that . Most at non irq
> > context anyways.
>
> I agree, it would be an easy win.  Do notice this will have the most
> impact for the slAb allocator.
>
> I estimate alloc + free cost would save:
>  * slAb would save approx 60 cycles
>  * slUb would save approx  4 cycles
>
> We might consider keeping the slUb approach as it would be more
> friendly for RT with less IRQ disabling.

IRQ disabling it a mixed bag. Older cpus have higher latencies there and
also virtualized contexts may require the hypervisor tracks the interrupt
state.

For recent intel cpus this is certainly a workable approach.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Experiences with slub bulk use-case for network stack
  2015-09-04 17:00 ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Jesper Dangaard Brouer
                     ` (3 preceding siblings ...)
  2015-09-04 18:09   ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Alexander Duyck
@ 2015-09-16 10:02   ` Jesper Dangaard Brouer
  2015-09-16 15:13     ` Christoph Lameter
  4 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-16 10:02 UTC (permalink / raw)
  To: linux-mm, Christoph Lameter; +Cc: netdev, akpm, Alexander Duyck, iamjoonsoo.kim


Hint, this leads up to discussing if current bulk *ALLOC* API need to
be changed...

Alex and I have been working hard on practical use-case for SLAB
bulking (mostly slUb), in the network stack.  Here is a summary of
what we have learned so far.

Bulk free'ing SKBs during TX completion is a big and easy win.

Specifically for slUb, normal path for freeing these objects (which
are not on c->freelist) require a locked double_cmpxchg per object.
The bulk free (via detached freelist patch) allow to free all objects
belonging to the same slab-page, to be free'ed with a single locked
double_cmpxchg. Thus, the bulk free speedup is quite an improvement.

The slUb alloc is hard to beat on speed:
 * accessing c->freelist, local cmpxchg 9 cycles (38% of cost)
 * c->freelist is refilled with single locked cmpxchg

In micro benchmarking it looks like we can beat alloc, because we do a
local_irq_{disable,enable} (cost 7 cycles).  And then pull out all
objects in c->freelist.  Thus, saving 9 cycles per object (counting
from the 2nd object).

However, in practical use-cases we are seeing the single object alloc
win over bulk alloc, we believe this to be due to prefetching.  When
c->freelist get (semi) cache-cold, then it gets more expensive to walk
the freelist (which is a basic single linked list to next free object).

For bulk alloc the full freelist is walked (right-way) and objects
pulled out into the array.  For normal single object alloc only a
single object is returned, but it does a prefetch on the next object
pointer.  Thus, next time single alloc is called the object will have
been prefetched.  Doing prefetch in bulk alloc only helps a little, as
it does not have enough "time" between accessing/walking the freelist
for objects.

So, how can we solve this and make bulk alloc faster?


Alex and I had the idea of bulk alloc returns an "allocator specific
cache" data-structure (and we add some helpers to access this).

In the slUb case, the freelist is a single linked pointer list.  In
the network stack the skb objects have a skb->next pointer, which is
located at the same position as freelist pointer.  Thus, simply
returning the freelist directly, could be interpreted as a skb-list.
The helper API would then do the prefetching, when pulling out
objects.

For the slUb case, we would simply cmpxchg either c->freelist or
page->freelist with a NULL ptr, and then own all objects on the
freelist. This also reduce the time we keep IRQs disabled.

API wise, we don't (necessary) know how many objects are on the
freelist (without first walking the list, which would cause stalls on
data, which we are trying to avoid).

Thus, the API of always returning the exact number of requested
objects will not work...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

(related to http://thread.gmane.org/gmane.linux.kernel.mm/137469)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Experiences with slub bulk use-case for network stack
  2015-09-16 10:02   ` Experiences with slub bulk use-case for network stack Jesper Dangaard Brouer
@ 2015-09-16 15:13     ` Christoph Lameter
  2015-09-17 20:17       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Lameter @ 2015-09-16 15:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, netdev, akpm, Alexander Duyck, iamjoonsoo.kim

On Wed, 16 Sep 2015, Jesper Dangaard Brouer wrote:

>
> Hint, this leads up to discussing if current bulk *ALLOC* API need to
> be changed...
>
> Alex and I have been working hard on practical use-case for SLAB
> bulking (mostly slUb), in the network stack.  Here is a summary of
> what we have learned so far.

SLAB refers to the SLAB allocator which is one slab allocator and SLUB is
another slab allocator.

Please keep that consistent otherwise things get confusing

> Bulk free'ing SKBs during TX completion is a big and easy win.
>
> Specifically for slUb, normal path for freeing these objects (which
> are not on c->freelist) require a locked double_cmpxchg per object.
> The bulk free (via detached freelist patch) allow to free all objects
> belonging to the same slab-page, to be free'ed with a single locked
> double_cmpxchg. Thus, the bulk free speedup is quite an improvement.

Yep.

> Alex and I had the idea of bulk alloc returns an "allocator specific
> cache" data-structure (and we add some helpers to access this).

Maybe add some Macros to handle this?

> In the slUb case, the freelist is a single linked pointer list.  In
> the network stack the skb objects have a skb->next pointer, which is
> located at the same position as freelist pointer.  Thus, simply
> returning the freelist directly, could be interpreted as a skb-list.
> The helper API would then do the prefetching, when pulling out
> objects.

The problem with the SLUB case is that the objects must be on the same
slab page.

> For the slUb case, we would simply cmpxchg either c->freelist or
> page->freelist with a NULL ptr, and then own all objects on the
> freelist. This also reduce the time we keep IRQs disabled.

You dont need to disable interrupts for the cmpxchges. There is additional
state in the page struct though so the updates must be done carefully.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Experiences with slub bulk use-case for network stack
  2015-09-16 15:13     ` Christoph Lameter
@ 2015-09-17 20:17       ` Jesper Dangaard Brouer
  2015-09-17 23:57         ` Christoph Lameter
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2015-09-17 20:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, netdev, akpm, Alexander Duyck, iamjoonsoo.kim, brouer

On Wed, 16 Sep 2015 10:13:25 -0500 (CDT)
Christoph Lameter <cl@linux.com> wrote:

> On Wed, 16 Sep 2015, Jesper Dangaard Brouer wrote:
> 
> >
> > Hint, this leads up to discussing if current bulk *ALLOC* API need to
> > be changed...
> >
> > Alex and I have been working hard on practical use-case for SLAB
> > bulking (mostly slUb), in the network stack.  Here is a summary of
> > what we have learned so far.
> 
> SLAB refers to the SLAB allocator which is one slab allocator and SLUB is
> another slab allocator.
> 
> Please keep that consistent otherwise things get confusing

This naming scheme is really confusing.  I'll try to be more
consistent.  So, you want capital letters SLAB and SLUB when talking
about a specific slab allocator implementation.


> > Bulk free'ing SKBs during TX completion is a big and easy win.
> >
> > Specifically for slUb, normal path for freeing these objects (which
> > are not on c->freelist) require a locked double_cmpxchg per object.
> > The bulk free (via detached freelist patch) allow to free all objects
> > belonging to the same slab-page, to be free'ed with a single locked
> > double_cmpxchg. Thus, the bulk free speedup is quite an improvement.
> 
> Yep.
> 
> > Alex and I had the idea of bulk alloc returns an "allocator specific
> > cache" data-structure (and we add some helpers to access this).
> 
> Maybe add some Macros to handle this?

Yes, helpers will likely turn out to be macros.


> > In the slUb case, the freelist is a single linked pointer list.  In
> > the network stack the skb objects have a skb->next pointer, which is
> > located at the same position as freelist pointer.  Thus, simply
> > returning the freelist directly, could be interpreted as a skb-list.
> > The helper API would then do the prefetching, when pulling out
> > objects.
> 
> The problem with the SLUB case is that the objects must be on the same
> slab page.

Yes, I'm aware that, that is what we are trying to take advantage of.


> > For the slUb case, we would simply cmpxchg either c->freelist or
> > page->freelist with a NULL ptr, and then own all objects on the
> > freelist. This also reduce the time we keep IRQs disabled.
> 
> You dont need to disable interrupts for the cmpxchges. There is
> additional state in the page struct though so the updates must be
> done carefully.

Yes, I'm aware of cmpxchg does not need to disable interrupts.  And I
plan to take advantage of this, in this new approach for bulk alloc.

Our current bulk alloc disables interrupts for the full period (of
collecting the number requested objects).

What I'm proposing is keeping interrupts on, and then simply cmpxchg
e.g 2 slab-pages out of the SLUB allocator (which the SLUB code calls
freelist's). The bulk call now owns these freelists, and returns them
to the caller.  The API caller gets some helpers/macros to access
objects, to shield him from the details (of SLUB freelist's).

The pitfall with this API is we don't know how many objects are on a
SLUB freelist.  And we cannot walk the freelist and count them, because
then we hit the problem of memory/cache stalls (that we are trying so
hard to avoid).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Experiences with slub bulk use-case for network stack
  2015-09-17 20:17       ` Jesper Dangaard Brouer
@ 2015-09-17 23:57         ` Christoph Lameter
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Lameter @ 2015-09-17 23:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: linux-mm, netdev, akpm, Alexander Duyck, iamjoonsoo.kim

On Thu, 17 Sep 2015, Jesper Dangaard Brouer wrote:

> What I'm proposing is keeping interrupts on, and then simply cmpxchg
> e.g 2 slab-pages out of the SLUB allocator (which the SLUB code calls
> freelist's). The bulk call now owns these freelists, and returns them
> to the caller.  The API caller gets some helpers/macros to access
> objects, to shield him from the details (of SLUB freelist's).
>
> The pitfall with this API is we don't know how many objects are on a
> SLUB freelist.  And we cannot walk the freelist and count them, because
> then we hit the problem of memory/cache stalls (that we are trying so
> hard to avoid).

If you get a fresh page from the page allocator then you know how many
objects are available in a slab page.

There is also a counter in each slab page for the objects allocated. The
number of free object is page->objects - page->inuse.

This is only true for a lockec cmpxchg. The unlocked cmpxchg used for the
per cpu freelist does not use the counters in the page struct.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-09-17 23:57 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20150824005727.2947.36065.stgit@localhost>
2015-09-04 17:00 ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Jesper Dangaard Brouer
2015-09-04 17:00   ` [RFC PATCH 1/3] net: introduce kfree_skb_bulk() user of kmem_cache_free_bulk() Jesper Dangaard Brouer
2015-09-04 18:47     ` Tom Herbert
2015-09-07  8:41       ` Jesper Dangaard Brouer
2015-09-07 16:25         ` Tom Herbert
2015-09-07 20:14           ` Jesper Dangaard Brouer
2015-09-08 21:01     ` David Miller
2015-09-04 17:01   ` [RFC PATCH 2/3] net: NIC helper API for building array of skbs to free Jesper Dangaard Brouer
2015-09-04 17:01   ` [RFC PATCH 3/3] ixgbe: bulk free SKBs during TX completion cleanup cycle Jesper Dangaard Brouer
2015-09-04 18:09   ` [RFC PATCH 0/3] Network stack, first user of SLAB/kmem_cache bulk free API Alexander Duyck
2015-09-04 18:55     ` Christoph Lameter
2015-09-04 20:39       ` Alexander Duyck
2015-09-04 23:45         ` Christoph Lameter
2015-09-05 11:18           ` Jesper Dangaard Brouer
2015-09-08 17:32             ` Christoph Lameter
2015-09-09 12:59               ` Jesper Dangaard Brouer
2015-09-09 14:08                 ` Christoph Lameter
2015-09-07  8:16     ` Jesper Dangaard Brouer
2015-09-07 21:23       ` Alexander Duyck
2015-09-16 10:02   ` Experiences with slub bulk use-case for network stack Jesper Dangaard Brouer
2015-09-16 15:13     ` Christoph Lameter
2015-09-17 20:17       ` Jesper Dangaard Brouer
2015-09-17 23:57         ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).